From 92f392f64c5467ea44a0ef0f9b9ea8f62cd922c1 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Fri, 13 Feb 2026 19:26:19 +0000 Subject: [PATCH] [ci skip] Add skill: local-llm-gpu-selection --- .../skills/local-llm-gpu-selection/SKILL.md | 143 ++++++++++++++++++ 1 file changed, 143 insertions(+) create mode 100644 .claude/skills/local-llm-gpu-selection/SKILL.md diff --git a/.claude/skills/local-llm-gpu-selection/SKILL.md b/.claude/skills/local-llm-gpu-selection/SKILL.md new file mode 100644 index 00000000..ac5943e6 --- /dev/null +++ b/.claude/skills/local-llm-gpu-selection/SKILL.md @@ -0,0 +1,143 @@ +--- +name: local-llm-gpu-selection +description: | + Guide for selecting GPUs and hardware for local LLM inference on Dell R730 and + comparing to Apple Silicon alternatives. Use when: (1) user asks about running + local models (Ollama, llama.cpp), (2) user asks which GPU to buy for LLMs, + (3) user wants to compare local models to Claude for coding, (4) user asks about + quantized model selection, (5) user asks about Mac Mini/Studio vs GPU server for + LLMs. Covers VRAM requirements, memory bandwidth as key metric, R730 GPU compatibility, + multi-GPU considerations, and realistic quality comparisons to Claude models. +author: Claude Code +version: 1.0.0 +date: 2025-06-11 +--- + +# Local LLM GPU Selection & Performance Guide + +## Problem +Choosing the right hardware for local LLM inference requires understanding the +relationship between VRAM capacity, memory bandwidth, GPU compatibility with +server chassis, and realistic model quality expectations. + +## Context / Trigger Conditions +- User asks about running quantized models locally (Ollama, llama.cpp) +- User wants to know which GPU fits their server (Dell R730 or similar 2U) +- User asks about Apple Silicon (Mac Mini/Studio) vs datacenter GPUs for LLMs +- User wants to compare local model quality to Claude (Opus/Sonnet/Haiku) for coding + +## Key Principle: Memory Bandwidth Is Everything + +LLM token generation is **memory-bandwidth bound**, not compute bound. The formula: +``` +approx tokens/sec = memory_bandwidth_GB_s / model_size_GB +``` +This is why Apple Silicon (high bandwidth unified memory) competes with datacenter GPUs +despite having less raw compute. + +## VRAM Requirements by Model Size + +| Model Size | Quant | VRAM Needed | Examples | +|------------|-------|-------------|----------| +| 7-8B | Q4_K_M | ~5 GB | Llama 3.1 8B, Mistral 7B | +| 7-8B | Q8_0 | ~8 GB | | +| 13-14B | Q4_K_M | ~8 GB | Qwen 2.5 Coder 14B | +| 22-24B | Q4_K_M | ~13-14 GB | Mistral Small, Codestral | +| 32B | Q4_K_M | ~20 GB | Qwen 2.5 Coder 32B | +| 32B | Q8_0 | ~34 GB | | +| 70B | Q4_K_M | ~40 GB | Llama 3.1 70B | +| 70B | Q8_0 | ~70 GB | | + +Add ~1-2 GB overhead for KV cache and context. Longer conversations use more. + +## Dell R730 GPU Compatibility + +### Constraints +- **2U chassis**: Full-height cards fit, but limited to dual-slot width +- **PCIe 3.0 x16 slots**: 2-3 usable slots depending on riser configuration +- **Power**: Needs Dell GPU power cable (P/N 0D4J0T) for GPUs >75W TDP +- **PSU**: Check wattage headroom (dual 750W or 1100W typical) + +### Compatible GPUs + +**No external power needed (<=75W):** +- Tesla T4: 16 GB, 320 GB/s, 70W — best drop-in option +- Tesla P4: 8 GB, 192 GB/s, 75W — too little VRAM for modern LLMs +- NVIDIA L4: 24 GB, 300 GB/s, 72W — T4 successor, Ada Lovelace, expensive +- NVIDIA A2: 16 GB, 200 GB/s, 60W — worse than T4 in every way, avoid + +**Requires power cable (>75W):** +- Tesla P40: 24 GB, 346 GB/s, 250W — best value per GB +- Tesla V100 PCIe: 32 GB, 900 GB/s, 250W — excellent bandwidth +- Tesla P100 PCIe: 16 GB, 732 GB/s, 250W — same VRAM as T4, not worth it + +**Won't fit:** +- RTX 3090/4090: Too thick (3-slot), too long +- A100: Fits physically but very expensive +- Any consumer RTX: Generally too large for 2U + +### Multi-GPU Considerations +- Ollama splits model layers across GPUs automatically +- PCIe 3.0 cross-GPU transfer adds ~30-40% latency penalty +- Mismatched GPUs (e.g., T4 + P40) work but the slower card bottlenecks +- R730 PCIe 3.0 limits newer GPU bandwidth (L4 runs at half its rated speed) + +## Apple Silicon Comparison + +Apple Silicon unified memory means ALL system RAM = VRAM with no bus penalty. + +| Device | Memory | Bandwidth | Advantage | +|--------|--------|-----------|-----------| +| Mac Mini M4 Pro 48 GB | 48 GB | 273 GB/s | Silent, 25W, no PCIe penalty | +| Mac Studio M4 Max 128 GB | 128 GB | 546 GB/s | Run 100B+ models | +| Mac Studio M4 Ultra 192 GB | 192 GB | 819 GB/s | Run anything | + +A Mac Mini M4 Pro 48GB often matches or beats a T4+L4 multi-GPU setup for +LLM inference due to zero cross-GPU overhead and high unified bandwidth. + +## Best Coding Models (for Ollama) + +For coding tasks specifically, prefer dedicated coding models: +1. **Qwen 2.5 Coder 32B** — best open-source coding model in this size class +2. **Codestral 22B** — Mistral's dedicated coding model +3. **DeepSeek Coder V2** — good quality, efficient +4. **Llama 3.1 70B** — strong general purpose but needs ~40 GB + +## Realistic Quality Comparison to Claude + +For Claude Code-style agentic coding workflows: + +| Capability | Opus/Sonnet | Haiku | Qwen 2.5 Coder 32B | 70B General | +|-----------|-------------|-------|---------------------|-------------| +| Single function gen | Excellent | Good | Good | Decent | +| Multi-file refactoring | Excellent | Decent | Weak | Weak | +| Tool use / agentic loops | Excellent | Good | Poor | Poor | +| Long context (large codebases) | Excellent | Good | Weak | Weak | + +Local models work for simple completions and code questions. They struggle badly +with Claude Code's complex multi-step tool-use workflows, long context windows, +and self-correction capabilities. + +## Quantization Quality Guide + +From best to worst quality (and largest to smallest): +- FP16: Full precision, baseline quality +- Q8_0: Near-lossless, ~50% size reduction +- Q6_K: Minimal quality loss +- Q5_K_M: Good balance +- Q4_K_M: **Recommended default** — best quality/size tradeoff +- Q3_K_M: Noticeable degradation on complex reasoning +- Q2_K: Significant quality loss, emergency only + +## Verification +- Check GPU compatibility: `lspci | grep -i nvidia` on the host +- Check available VRAM: `nvidia-smi` inside the GPU VM +- Check model fit: Ollama shows VRAM usage during `ollama run` +- Check inference speed: Count tokens/sec in Ollama output + +## Notes +- GPU prices fluctuate significantly in the used market; check current prices +- The T4 is PCIe 3.0 only; newer GPUs in PCIe 3.0 slots run at reduced bandwidth +- Power consumption matters for 24/7 homelab use (electricity cost) +- For Claude Code specifically, API-based Claude models remain significantly + superior to any local model for agentic coding workflows