infra/llama-cpp: benchmark report + -fa flag fix

Phase 7 of the vision-LLM benchmark plan. Adds: - docs/benchmarks/2026-05-10-vision-llm.md — curated report (TL;DR, per-model analysis, top-N agreement, cost vs cloud APIs, sample captions). Verdict: qwen3vl-4b for the request path (3.55 s p50, 100% parse, decisive top-N distro); qwen3vl-8b for caption polish. - docs/benchmarks/benchmark-2026-05-10-1424.json — raw 300-row dump for diff-checking against future runs. - main.tf: -fa -> -fa on (b9085 llama.cpp removed the no-value form of the flash-attention flag; without the value llama-server exits before serving any request). - llama-cpp.md architecture doc links the report so future operators land on the deployed-and-evaluated model from one entry point. 300/300 calls, 0 parse errors, 33m32s wall on a single T4 with the GPU exclusively allocated. immich-ml was scaled to 0 for the run (node1 RAM constraint, not GPU - bumping node1 RAM is tracked as a follow-up). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 15:03:16 +00:00 · 2026-05-10 15:03:16 +00:00 · 6e7fe96a40
commit 6e7fe96a40
parent 3da01e6e1e
23 changed files with 8504 additions and 11 deletions
--- a/docs/architecture/llama-cpp.md
+++ b/docs/architecture/llama-cpp.md
@ -14,6 +14,11 @@ choosing between **Qwen3-VL-8B**, **MiniCPM-V-4.5**, and
 Future consumers (Home Assistant, agentic tooling) can hit the same
 endpoint via LiteLLM at the cluster gateway.

+First benchmark run (2026-05-10): see
+`infra/docs/benchmarks/2026-05-10-vision-llm.md`. Verdict: **qwen3vl-4b**
+for the request path (3.55 s p50, 100% parse, decisive top-N
+distribution). qwen3vl-8b for caption polish on top picks.
+
 ## Why llama.cpp + llama-swap (not Ollama)

 Verified across 7+7 research/challenger subagents (2026-05-10):