Why Run LLMs Locally

Three benefits: data stays on your machine (privacy), works offline, and no ongoing API costs.

The tradeoff: hardware investment needed, and inference is usually slower than cloud.

Model Size vs VRAM Requirements

Rough formula:

  • FP16: VRAM(GB) ≈ Parameters(B) × 2
  • INT4 quantization: VRAM(GB) ≈ Parameters(B) × 0.5
Model SizeFP16INT4Example Models
7B14 GB3.5 GBPhi-4, Qwen3.5-7B
14B28 GB7 GBQwen3.5-14B
32B64 GB16 GBQwen3.5-32B
70B140 GB35 GBLlama 4 Scout, Qwen3.5-72B

Hardware Recommendations

Budget: Free (existing PC)

Any post-2020 computer can run 7B models on CPU alone. Speed: 5-10 tok/s.

Entry-level GPU (¥3,000-5,000)

RTX 3060 12GB — best value, runs 14B quantized at 20-30 tok/s.

Mid-range (¥8,000-15,000)

RTX 4070 Ti Super 16GB — runs 14B smoothly, some 32B at 40-60 tok/s.

High-end (¥20,000-50,000)

RTX 4090 24GB — runs 32B quantized. Dual-card for larger models.

Apple Silicon

Unified memory is ideal for LLMs — GPU accesses all RAM directly.

  • M4 Pro 24-48GB: runs 14B-32B
  • M4 Max 64-128GB: runs 32B-70B

Inference Tools

ToolBest For
OllamaBeginners, developers — one command install
LM StudioNon-technical users — GUI, drag-and-drop
llama.cppAdvanced users — maximum performance
vLLMAPI servers — concurrent requests

Is Local Worth It?

Yes if: Data privacy is mandatory, heavy long-term use, unstable network. No if: Occasional use, need top quality, don't want to tinker.

For most people, DeepSeek V4 Flash API (¥0.95/M) is more cost-effective than buying hardware.