Viktor Barzin fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]

6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-09 08:45:33 +00:00

6 KiB

Raw Blame History

name	description	author	version	date
local-llm-gpu-selection	Guide for selecting GPUs and hardware for local LLM inference on Dell R730 and comparing to Apple Silicon alternatives. Use when: (1) user asks about running local models (Ollama, llama.cpp), (2) user asks which GPU to buy for LLMs, (3) user wants to compare local models to Claude for coding, (4) user asks about quantized model selection, (5) user asks about Mac Mini/Studio vs GPU server for LLMs. Covers VRAM requirements, memory bandwidth as key metric, R730 GPU compatibility, multi-GPU considerations, and realistic quality comparisons to Claude models.	Claude Code	1.0.0	2025-06-11

Local LLM GPU Selection & Performance Guide

Problem

Choosing the right hardware for local LLM inference requires understanding the relationship between VRAM capacity, memory bandwidth, GPU compatibility with server chassis, and realistic model quality expectations.

Context / Trigger Conditions

User asks about running quantized models locally (Ollama, llama.cpp)
User wants to know which GPU fits their server (Dell R730 or similar 2U)
User asks about Apple Silicon (Mac Mini/Studio) vs datacenter GPUs for LLMs
User wants to compare local model quality to Claude (Opus/Sonnet/Haiku) for coding

Key Principle: Memory Bandwidth Is Everything

LLM token generation is memory-bandwidth bound, not compute bound. The formula:

approx tokens/sec = memory_bandwidth_GB_s / model_size_GB

This is why Apple Silicon (high bandwidth unified memory) competes with datacenter GPUs despite having less raw compute.

VRAM Requirements by Model Size

Model Size	Quant	VRAM Needed	Examples
7-8B	Q4_K_M	~5 GB	Llama 3.1 8B, Mistral 7B
7-8B	Q8_0	~8 GB
13-14B	Q4_K_M	~8 GB	Qwen 2.5 Coder 14B
22-24B	Q4_K_M	~13-14 GB	Mistral Small, Codestral
32B	Q4_K_M	~20 GB	Qwen 2.5 Coder 32B
32B	Q8_0	~34 GB
70B	Q4_K_M	~40 GB	Llama 3.1 70B
70B	Q8_0	~70 GB

Add ~1-2 GB overhead for KV cache and context. Longer conversations use more.

Dell R730 GPU Compatibility

Constraints

2U chassis: Full-height cards fit, but limited to dual-slot width
PCIe 3.0 x16 slots: 2-3 usable slots depending on riser configuration
Power: Needs Dell GPU power cable (P/N 0D4J0T) for GPUs >75W TDP
PSU: Check wattage headroom (dual 750W or 1100W typical)

Compatible GPUs

No external power needed (<=75W):

Tesla T4: 16 GB, 320 GB/s, 70W — best drop-in option
Tesla P4: 8 GB, 192 GB/s, 75W — too little VRAM for modern LLMs
NVIDIA L4: 24 GB, 300 GB/s, 72W — T4 successor, Ada Lovelace, expensive
NVIDIA A2: 16 GB, 200 GB/s, 60W — worse than T4 in every way, avoid

Requires power cable (>75W):

Tesla P40: 24 GB, 346 GB/s, 250W — best value per GB
Tesla V100 PCIe: 32 GB, 900 GB/s, 250W — excellent bandwidth
Tesla P100 PCIe: 16 GB, 732 GB/s, 250W — same VRAM as T4, not worth it

Won't fit:

RTX 3090/4090: Too thick (3-slot), too long
A100: Fits physically but very expensive
Any consumer RTX: Generally too large for 2U

Multi-GPU Considerations

Ollama splits model layers across GPUs automatically
PCIe 3.0 cross-GPU transfer adds ~30-40% latency penalty
Mismatched GPUs (e.g., T4 + P40) work but the slower card bottlenecks
R730 PCIe 3.0 limits newer GPU bandwidth (L4 runs at half its rated speed)

Apple Silicon Comparison

Apple Silicon unified memory means ALL system RAM = VRAM with no bus penalty.

Device	Memory	Bandwidth	Advantage
Mac Mini M4 Pro 48 GB	48 GB	273 GB/s	Silent, 25W, no PCIe penalty
Mac Studio M4 Max 128 GB	128 GB	546 GB/s	Run 100B+ models
Mac Studio M4 Ultra 192 GB	192 GB	819 GB/s	Run anything

A Mac Mini M4 Pro 48GB often matches or beats a T4+L4 multi-GPU setup for LLM inference due to zero cross-GPU overhead and high unified bandwidth.

Best Coding Models (for Ollama)

For coding tasks specifically, prefer dedicated coding models:

Qwen 2.5 Coder 32B — best open-source coding model in this size class
Codestral 22B — Mistral's dedicated coding model
DeepSeek Coder V2 — good quality, efficient
Llama 3.1 70B — strong general purpose but needs ~40 GB

Realistic Quality Comparison to Claude

For Claude Code-style agentic coding workflows:

Capability	Opus/Sonnet	Haiku	Qwen 2.5 Coder 32B	70B General
Single function gen	Excellent	Good	Good	Decent
Multi-file refactoring	Excellent	Decent	Weak	Weak
Tool use / agentic loops	Excellent	Good	Poor	Poor
Long context (large codebases)	Excellent	Good	Weak	Weak

Local models work for simple completions and code questions. They struggle badly with Claude Code's complex multi-step tool-use workflows, long context windows, and self-correction capabilities.

Quantization Quality Guide

From best to worst quality (and largest to smallest):

FP16: Full precision, baseline quality
Q8_0: Near-lossless, ~50% size reduction
Q6_K: Minimal quality loss
Q5_K_M: Good balance
Q4_K_M: Recommended default — best quality/size tradeoff
Q3_K_M: Noticeable degradation on complex reasoning
Q2_K: Significant quality loss, emergency only

Verification

Check GPU compatibility: lspci | grep -i nvidia on the host
Check available VRAM: nvidia-smi inside the GPU VM
Check model fit: Ollama shows VRAM usage during ollama run
Check inference speed: Count tokens/sec in Ollama output

Notes

GPU prices fluctuate significantly in the used market; check current prices
The T4 is PCIe 3.0 only; newer GPUs in PCIe 3.0 slots run at reduced bandwidth
Power consumption matters for 24/7 homelab use (electricity cost)
For Claude Code specifically, API-based Claude models remain significantly superior to any local model for agentic coding workflows

6 KiB Raw Blame History