6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6 KiB
| name | description | author | version | date |
|---|---|---|---|---|
| local-llm-gpu-selection | Guide for selecting GPUs and hardware for local LLM inference on Dell R730 and comparing to Apple Silicon alternatives. Use when: (1) user asks about running local models (Ollama, llama.cpp), (2) user asks which GPU to buy for LLMs, (3) user wants to compare local models to Claude for coding, (4) user asks about quantized model selection, (5) user asks about Mac Mini/Studio vs GPU server for LLMs. Covers VRAM requirements, memory bandwidth as key metric, R730 GPU compatibility, multi-GPU considerations, and realistic quality comparisons to Claude models. | Claude Code | 1.0.0 | 2025-06-11 |
Local LLM GPU Selection & Performance Guide
Problem
Choosing the right hardware for local LLM inference requires understanding the relationship between VRAM capacity, memory bandwidth, GPU compatibility with server chassis, and realistic model quality expectations.
Context / Trigger Conditions
- User asks about running quantized models locally (Ollama, llama.cpp)
- User wants to know which GPU fits their server (Dell R730 or similar 2U)
- User asks about Apple Silicon (Mac Mini/Studio) vs datacenter GPUs for LLMs
- User wants to compare local model quality to Claude (Opus/Sonnet/Haiku) for coding
Key Principle: Memory Bandwidth Is Everything
LLM token generation is memory-bandwidth bound, not compute bound. The formula:
approx tokens/sec = memory_bandwidth_GB_s / model_size_GB
This is why Apple Silicon (high bandwidth unified memory) competes with datacenter GPUs despite having less raw compute.
VRAM Requirements by Model Size
| Model Size | Quant | VRAM Needed | Examples |
|---|---|---|---|
| 7-8B | Q4_K_M | ~5 GB | Llama 3.1 8B, Mistral 7B |
| 7-8B | Q8_0 | ~8 GB | |
| 13-14B | Q4_K_M | ~8 GB | Qwen 2.5 Coder 14B |
| 22-24B | Q4_K_M | ~13-14 GB | Mistral Small, Codestral |
| 32B | Q4_K_M | ~20 GB | Qwen 2.5 Coder 32B |
| 32B | Q8_0 | ~34 GB | |
| 70B | Q4_K_M | ~40 GB | Llama 3.1 70B |
| 70B | Q8_0 | ~70 GB |
Add ~1-2 GB overhead for KV cache and context. Longer conversations use more.
Dell R730 GPU Compatibility
Constraints
- 2U chassis: Full-height cards fit, but limited to dual-slot width
- PCIe 3.0 x16 slots: 2-3 usable slots depending on riser configuration
- Power: Needs Dell GPU power cable (P/N 0D4J0T) for GPUs >75W TDP
- PSU: Check wattage headroom (dual 750W or 1100W typical)
Compatible GPUs
No external power needed (<=75W):
- Tesla T4: 16 GB, 320 GB/s, 70W — best drop-in option
- Tesla P4: 8 GB, 192 GB/s, 75W — too little VRAM for modern LLMs
- NVIDIA L4: 24 GB, 300 GB/s, 72W — T4 successor, Ada Lovelace, expensive
- NVIDIA A2: 16 GB, 200 GB/s, 60W — worse than T4 in every way, avoid
Requires power cable (>75W):
- Tesla P40: 24 GB, 346 GB/s, 250W — best value per GB
- Tesla V100 PCIe: 32 GB, 900 GB/s, 250W — excellent bandwidth
- Tesla P100 PCIe: 16 GB, 732 GB/s, 250W — same VRAM as T4, not worth it
Won't fit:
- RTX 3090/4090: Too thick (3-slot), too long
- A100: Fits physically but very expensive
- Any consumer RTX: Generally too large for 2U
Multi-GPU Considerations
- Ollama splits model layers across GPUs automatically
- PCIe 3.0 cross-GPU transfer adds ~30-40% latency penalty
- Mismatched GPUs (e.g., T4 + P40) work but the slower card bottlenecks
- R730 PCIe 3.0 limits newer GPU bandwidth (L4 runs at half its rated speed)
Apple Silicon Comparison
Apple Silicon unified memory means ALL system RAM = VRAM with no bus penalty.
| Device | Memory | Bandwidth | Advantage |
|---|---|---|---|
| Mac Mini M4 Pro 48 GB | 48 GB | 273 GB/s | Silent, 25W, no PCIe penalty |
| Mac Studio M4 Max 128 GB | 128 GB | 546 GB/s | Run 100B+ models |
| Mac Studio M4 Ultra 192 GB | 192 GB | 819 GB/s | Run anything |
A Mac Mini M4 Pro 48GB often matches or beats a T4+L4 multi-GPU setup for LLM inference due to zero cross-GPU overhead and high unified bandwidth.
Best Coding Models (for Ollama)
For coding tasks specifically, prefer dedicated coding models:
- Qwen 2.5 Coder 32B — best open-source coding model in this size class
- Codestral 22B — Mistral's dedicated coding model
- DeepSeek Coder V2 — good quality, efficient
- Llama 3.1 70B — strong general purpose but needs ~40 GB
Realistic Quality Comparison to Claude
For Claude Code-style agentic coding workflows:
| Capability | Opus/Sonnet | Haiku | Qwen 2.5 Coder 32B | 70B General |
|---|---|---|---|---|
| Single function gen | Excellent | Good | Good | Decent |
| Multi-file refactoring | Excellent | Decent | Weak | Weak |
| Tool use / agentic loops | Excellent | Good | Poor | Poor |
| Long context (large codebases) | Excellent | Good | Weak | Weak |
Local models work for simple completions and code questions. They struggle badly with Claude Code's complex multi-step tool-use workflows, long context windows, and self-correction capabilities.
Quantization Quality Guide
From best to worst quality (and largest to smallest):
- FP16: Full precision, baseline quality
- Q8_0: Near-lossless, ~50% size reduction
- Q6_K: Minimal quality loss
- Q5_K_M: Good balance
- Q4_K_M: Recommended default — best quality/size tradeoff
- Q3_K_M: Noticeable degradation on complex reasoning
- Q2_K: Significant quality loss, emergency only
Verification
- Check GPU compatibility:
lspci | grep -i nvidiaon the host - Check available VRAM:
nvidia-smiinside the GPU VM - Check model fit: Ollama shows VRAM usage during
ollama run - Check inference speed: Count tokens/sec in Ollama output
Notes
- GPU prices fluctuate significantly in the used market; check current prices
- The T4 is PCIe 3.0 only; newer GPUs in PCIe 3.0 slots run at reduced bandwidth
- Power consumption matters for 24/7 homelab use (electricity cost)
- For Claude Code specifically, API-based Claude models remain significantly superior to any local model for agentic coding workflows