infra/docs/benchmarks/2026-05-10-vision-llm.md
Viktor Barzin 764c234b1c infra/llama-cpp: benchmark report + -fa flag fix
Phase 7 of the vision-LLM benchmark plan. Adds:

- docs/benchmarks/2026-05-10-vision-llm.md — curated report (TL;DR,
  per-model analysis, top-N agreement, cost vs cloud APIs, sample
  captions). Verdict: qwen3vl-4b for the request path (3.55 s p50,
  100% parse, decisive top-N distro); qwen3vl-8b for caption polish.
- docs/benchmarks/benchmark-2026-05-10-1424.json — raw 300-row dump
  for diff-checking against future runs.
- main.tf: -fa -> -fa on (b9085 llama.cpp removed the no-value form
  of the flash-attention flag; without the value llama-server exits
  before serving any request).
- llama-cpp.md architecture doc links the report so future operators
  land on the deployed-and-evaluated model from one entry point.

300/300 calls, 0 parse errors, 33m32s wall on a single T4 with the
GPU exclusively allocated. immich-ml was scaled to 0 for the run
(node1 RAM constraint, not GPU - bumping node1 RAM is tracked as a
follow-up).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 15:03:16 +00:00

11 KiB
Raw Blame History

Vision-LLM benchmark — Malaga / Seville album

Run ID: 2026-05-10-1424 · Date: 2026-05-10 · Operator: wizard

100 photos randomly sampled (seed=42) from the Immich album 🇪🇸 Malaga Seville (46565b85-7580-4ac1-91a6-1ece2cf8634d, 1556 image assets + 9 videos), scored by three local vision-LLMs served by llama-swap on a single Tesla T4. Goal: pick a model to wire into instagram-poster's /candidates ranking path.

TL;DR

Recommendation: qwen3vl-4b.

  • Fastest by a wide margin (3.55 s p50, 60% of qwen3vl-8b), important once this is in the request path of /candidates.
  • 100% structured-output success — same as the other two; GBNF grammar enforcement worked across the board.
  • Captions are competitive with the 8B model in qualitative review (tied or close on 8/10 sampled photos; 8B wins on Flair, 4B wins on Latency).
  • Most decisive scorer — 47/100 photos got IG-fit=9 vs 17 for qwen3vl-8b and 9 for minicpm. We get more signal at the top end for ranking.

Use qwen3vl-8b for manual caption refinement (top-1 of the day) if caption polish matters. Use minicpm-v-4-5 for nothing immediate — it's the most conservative scorer and the slowest at high quantiles, with no offsetting wins in this dataset.

Setup

  • Hardware: 1× Tesla T4 (16 GiB VRAM), nvidia.com/gpu time-slicing enabled (replicas=100), pod scheduled on k8s-node1.
  • Server: mostlygeek/llama-swap:cuda (ships llama.cpp b9085-046e28443) on llama-swap.llama-cpp.svc.cluster.local:8080.
  • Models: GGUF Q4_K_M, mmproj F16 except qwen3vl-4b which used the Q8_0 mmproj (alphabetically first matching the glob).
  • Image prep: EXIF-transposed, long-edge resized to 1024 px, JPEG q=90, base64-embedded as image_url data URLs.
  • Generation: temperature=0, top_k=1, enable_thinking=false, GBNF grammar pinning the JSON schema (6 fields, 110 ints, ≤8 tags).
  • Run isolation: immich-machine-learning scaled to 0 for the duration to avoid noisy GPU contention. (Diagnostic note: the scheduling failure that triggered this was actually node1 RAM — not GPU — at 94% allocated. Time-slicing was already on. Bumping node1 RAM is tracked as a follow-up.)

Headline numbers

model n parse_ok p50 latency p95 latency median IG-fit median aesthetic
qwen3vl-4b 100 100% 3.55 s 4.06 s 8.0 8.0
minicpm-v-4-5 100 100% 5.62 s 6.00 s 7.0 8.0
qwen3vl-8b 100 100% 5.98 s 6.64 s 7.0 8.0

Total wall time for the run: 33 m 32 s (300 calls + 3 cold loads of ~30 s each).

What each model is good at

qwen3vl-4b — fast and decisive

  • p50 3.55 s — comfortable for adding to /candidates request path.
  • IG-fit distribution skews right (47 nines), spreading 6 → 9 fairly evenly, which is what you want from a ranker.
  • Captions are emoji-friendly, hashtag-friendly, sometimes hallucinatory (e.g. labelled a Seville street as "Barcelona's colourful streets" once).
  • Failure mode to watch: occasional double-down on the same caption template ("Lost in the tiles. 🌿" repeated across two unrelated blue-dress photos).

minicpm-v-4-5 — conservative, terse

  • Most conservative scorer: 65% of photos got IG-fit=7. Only 9 nines. Less useful as a top-N ranker because the top is squashed.
  • Fastest p95 of the three (6.0 s) but slower p50 than qwen3vl-4b.
  • Captions are short and lower-case ("azulejo dreams.", "sunshine & secrets") — distinct voice but less Instagram-native.

qwen3vl-8b — most polished captions

  • Best subject identification (specifically named "Metropol Parasol" and "Plaza de España" by name where the others said "modern architecture" / "plaza").
  • Captions read well: "Coffee & calm vibes ", "where modern meets historic under a brilliant sky".
  • Slowest p50 (5.98 s) and tightest score distribution (median 7, 17 nines) — middle of the pack as a ranker.

Top-10 agreement (Kendall-tau-style overlap)

How many of each model's top-10 IG-fit picks appear in another model's top-10:

pair overlap
qwen3vl-4b ↔ qwen3vl-8b 5/10
minicpm-v-4-5 ↔ qwen3vl-4b 4/10
minicpm-v-4-5 ↔ qwen3vl-8b 4/10

Read: there's moderate but not strong agreement. The models pick roughly half the same "best" photos and half different ones. For ranking, that's a healthy sign — they're not collapsing to a single notion of "good", so combining their scores would add real signal.

Cost-equivalent context

Approximate cost to score the same 100 photos via cloud APIs (prompt ≈ 1100 tokens incl. image, completion ≈ 100 tokens):

backend input output per-100 photos
Local llama-swap on T4 ≈ $0.04 (electricity, ~70 W × 7 min)
Anthropic Haiku 4.5 $1.00/M $5.00/M ≈ $0.15
Anthropic Sonnet 4.6 $3.00/M $15.00/M ≈ $0.45
Google Gemini 2.5 Flash $0.30/M $2.50/M ≈ $0.05

Local is competitive with Gemini Flash on marginal cost. The case for keeping it local is privacy (Immich originals never leave the LAN), no rate-limits, and no per-call quota planning. The case against is the GPU is finite — adding this to a request path means sharing T4 time with frigate, ytdlp, and (when we restore it) immich-ml. With time-slicing on, that's tractable but each tenant gets fewer full-GPU kernels.

Sample captions (10 representative photos, all 3 models)

9f9c9faf-... — man at ornate arch in Seville

  • minicpm-v-4-5 (7/8): sunshine & secrets
  • qwen3vl-4b (9/8): Found the perfect spot to pause. 🌞 #travel #architecture
  • qwen3vl-8b (9/8): Standing in Seville's magic, framed by history and color.

0d53158b-... — Metropol Parasol

  • minicpm-v-4-5 (9/8): Seville's skyward statement.
  • qwen3vl-4b (9/8): Seville's iconic modern structure against a vibrant blue sky.
  • qwen3vl-8b (9/8): The Metropol Parasol in Seville, where modern meets historic under a brilliant sky.

23137c1c-... — coffee shot

  • minicpm-v-4-5 (8/7): a morning pick-me-up and a smile.
  • qwen3vl-4b (8/7): Coffee & chill. Perfectly imperfect moment.
  • qwen3vl-8b (9/8): Coffee & calm vibes

3376ee87-... — woman against tiled wall

  • minicpm-v-4-5 (7/8): azulejo dreams.
  • qwen3vl-4b (9/8): Lost in the tiles. 🌿
  • qwen3vl-8b (9/8): Blue tiles, golden light, and a moment of quiet grace.

4f2fea45-... — courtyard

  • minicpm-v-4-5 (7/8): hidden gems of seville
  • qwen3vl-4b (7/8): Timeless beauty in a Spanish courtyard. 🌿
  • qwen3vl-8b (7/8): A serene courtyard in Seville, where palm trees sway under the sun.

ea713729-... — flower-market street (qwen3vl-4b confused location)

  • minicpm-v-4-5 (7/8): Seville's hidden gems.
  • qwen3vl-4b (7/8): Walking through Barcelona's colorful streets, backlit by golden hour.
  • qwen3vl-8b (7/8): Walking through Seville's vibrant streets, lavender in hand.

The full list of 10 sample sets is in the auto-generated section below; the raw 300-row JSON is at benchmark-2026-05-10-1424.json in this directory.

Operational cost during the run

  • llama-swap pod (1× T4 wholly allocated for the duration): ~33 min.
  • Immich-ML downtime: ~33 min. New uploads weren't auto-tagged or CLIP-embedded during this window. No user-visible impact (Immich search against already-indexed assets still worked via pgvector).
  • Network egress: zero — Immich originals stayed on the LAN, all scoring traffic was in-cluster.

Reproducibility

DATA_DIR=/tmp/benchmark \
  IMMICH_API_KEY=\
  LLAMA_SWAP_URL=http://localhost:18080 \
  poetry run python -m instagram_poster.benchmark run \
    --album-id 46565b85-7580-4ac1-91a6-1ece2cf8634d \
    --models qwen3vl-8b,minicpm-v-4-5,qwen3vl-4b \
    --limit 100 --random-seed 42 --run-id 2026-05-10-1424

The same --random-seed reproduces the photo sample exactly. Prompt version 4bbb7e7721da24d9 is the SHA-256 of the system prompt + user prompt + GBNF grammar; rerunning under the same prompt version against the same seed should produce within-noise identical scores (the models themselves are temperature=0, top_k=1).

Next steps

  • Wire qwen3vl-4b into instagram-poster as an additional ranking signal alongside CLIP-based recency in /candidates. Cache the score per asset_id so we don't re-pay 4 s on every list refresh.
  • Bump k8s-node1 RAM so immich-ml + llama-swap can co-exist (drain → resize → uncordon, with kubelet systemReserved adjusted in stacks/infra/main.tf).
  • Re-benchmark with shared GPU once node1 RAM is bumped, to get realistic latency numbers when the T4 is also under load from immich-ml and frigate.
  • Front llama-swap with LiteLLM so Home Assistant and any other consumer can hit one OpenAI-compat gateway. Track separately.

Auto-generated report

Below is the unedited output of python -m instagram_poster.benchmark report --run-id 2026-05-10-1424, kept for diff-checking against future runs.

Per-model summary

model n parse_ok % error % p50 latency p95 latency median IG-fit median aesthetic
minicpm-v-4-5 100 100.0 0.0 5617 ms 5998 ms 7.0 8.0
qwen3vl-4b 100 100.0 0.0 3552 ms 4063 ms 8.0 8.0
qwen3vl-8b 100 100.0 0.0 5981 ms 6637 ms 7.0 8.0

Score histograms (instagram_fit_score 110)

minicpm-v-4-5

 1: (0)   2: (0)   3: (0)   4: (0)   5: (0)
 6: ███████ (7)
 7: █████████████████████████████████████████████████████████████████ (65)
 8: ███████████████████ (19)
 9: █████████ (9)
10: (0)

qwen3vl-4b

 1: (0)   2: (0)   3: (0)   4: (0)   5: (0)
 6: █████ (5)
 7: ████████████████ (16)
 8: ████████████████████████████████ (32)
 9: ███████████████████████████████████████████████ (47)
10: (0)

qwen3vl-8b

 1: (0)   2: (0)   3: (0)   4: (0)   5: (0)
 6: ███████████ (11)
 7: ███████████████████████████████████████████████████████ (55)
 8: █████████████████ (17)
 9: █████████████████ (17)
10: (0)

Top-10 by IG-fit per model — see benchmark-2026-05-10-1424.json

(Tables omitted from the curated report; available in the JSON dump alongside this file.)