infra/llama-cpp: benchmark report + -fa flag fix

Phase 7 of the vision-LLM benchmark plan. Adds:

- docs/benchmarks/2026-05-10-vision-llm.md — curated report (TL;DR,
  per-model analysis, top-N agreement, cost vs cloud APIs, sample
  captions). Verdict: qwen3vl-4b for the request path (3.55 s p50,
  100% parse, decisive top-N distro); qwen3vl-8b for caption polish.
- docs/benchmarks/benchmark-2026-05-10-1424.json — raw 300-row dump
  for diff-checking against future runs.
- main.tf: -fa -> -fa on (b9085 llama.cpp removed the no-value form
  of the flash-attention flag; without the value llama-server exits
  before serving any request).
- llama-cpp.md architecture doc links the report so future operators
  land on the deployed-and-evaluated model from one entry point.

300/300 calls, 0 parse errors, 33m32s wall on a single T4 with the
GPU exclusively allocated. immich-ml was scaled to 0 for the run
(node1 RAM constraint, not GPU - bumping node1 RAM is tracked as a
follow-up).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

This commit is contained in:

Viktor Barzin

2026-05-10 15:03:16 +00:00

parent 3da01e6e1e

commit 6e7fe96a40

23 changed files with 8504 additions and 11 deletions

									
										2

stacks/llama-cpp/main.tf
									
										View file
										
				@ -65,7 +65,7 @@ locals {

				          "-c ${cfg.ctx_size}",

				          "-np 1",

				          "--jinja",

				          "-fa",

				          "-fa on",

				        ])

				        ttl           = 600 # unload after 10 min idle

				        checkEndpoint = "/health"

Rows
Columns

infra/llama-cpp: benchmark report + -fa flag fix

2 stacks/llama-cpp/main.tf Unescape Escape View file

2

stacks/llama-cpp/main.tf

View file