immich: set MACHINE_LEARNING_MODEL_TTL 0->600 to stop GPU VRAM hog

immich-ml at TTL=0 never unloaded models; a heavy OCR library job inflated onnxruntime's CUDA arena to ~10.7GB and held it on the shared time-sliced T4, starving llama-swap (qwen3-8b) so recruiter-responder triage 502'd silently for hours (emails preserved unseen, no loss). TTL=600 lets idle ad-hoc models (OCR, face) free VRAM while preloaded CLIP/smart-search stays warm. Docs: correct stale llama-cpp GPU notes (T4 is time-sliced, no VRAM isolation; add qwen3-8b to model table), immich MODEL_TTL gotcha in .claude/CLAUDE.md, and a post-mortem. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 20:16:11 +00:00 · 2026-06-02 20:16:11 +00:00 · 052c776eba
commit 052c776eba
parent cda858d560
4 changed files with 124 additions and 11 deletions
--- a/docs/architecture/llama-cpp.md
+++ b/docs/architecture/llama-cpp.md
@ -68,11 +68,26 @@ for the initial deployment.

 ## GPU allocation

-The llama-swap pod requests `nvidia.com/gpu: 1` (whole-T4
-allocation). The shared T4 is also used by Immich's ML pod
-(`immich.immich-machine-learning`); only one of the two can hold the
-GPU at a time. Operator must scale immich-ml to 0 before running a
-benchmark and restore it after:
+The llama-swap pod requests `nvidia.com/gpu: 1`, but the T4 is
+**time-sliced** by the NVIDIA device plugin — several pods on k8s-node1
+each hold a `nvidia.com/gpu: 1` slice and run **concurrently**:
+`llama-swap`, `immich.immich-machine-learning`, `immich.immich-server`
+(NVENC transcode), and `frigate`. Time-slicing shares *compute* but
+**not memory** — the 16 GB VRAM is a single unpartitioned pool, so one
+greedy tenant can starve all the others.
+
+This is a real failure mode, not theoretical: on 2026-06-02 immich-ml
+(running with `MACHINE_LEARNING_MODEL_TTL=0`, so nothing ever unloaded)
+let its onnxruntime CUDA arena balloon to 10.7 GB during an OCR-heavy
+library job and held it, leaving only ~2 GB free. llama-swap then
+couldn't allocate qwen3-8b (~4.5 GB) → `cudaMalloc` OOM → `llama-server`
+exited → 502s → recruiter-responder triage failed silently for ~5 h.
+Fix: immich `MODEL_TTL=600` so idle models unload and return VRAM. See
+`docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`.
+
+Budget the T4 accordingly: with immich-ml idle (~2 GB CLIP) + frigate
+(~2 GB) there is ample room for an 8 B model. For a heavy benchmark you
+can still evict immich-ml entirely to guarantee headroom:

 ```bash
 kubectl scale -n immich deploy/immich-machine-learning --replicas=0
@ -84,10 +99,15 @@ kubectl scale -n immich deploy/immich-machine-learning --replicas=1

 | ID | HF repo | Quant | Ctx | mmproj |
 |----|---------|-------|-----|--------|
+| `qwen3-8b` | `Qwen/Qwen3-8B-GGUF` | Q4_K_M | 16384 | no (text-only) |
 | `qwen3vl-8b` | `Qwen/Qwen3-VL-8B-Instruct-GGUF` | Q4_K_M | 3072 | yes |
 | `minicpm-v-4-5` | `openbmb/MiniCPM-V-4_5-gguf` | Q4_K_M | 3072 | yes |
 | `qwen3vl-4b` | `Qwen/Qwen3-VL-4B-Instruct-GGUF` | Q4_K_M | 3072 | yes |

+`qwen3-8b` (text-only) is the Tier-0 triage model for
+`recruiter-responder`; the `qwen3vl-*` / `minicpm-v` models serve the
+vision use cases.
+
 llama.cpp build pinned via the `llama-swap:cuda` image (ships a
 recent llama.cpp ≥ b9095, which includes Qwen3-VL projection fix
 [#20899](https://github.com/ggml-org/llama.cpp/issues/20899) and
@ -107,10 +127,13 @@ mtmd Flash-Attention regression fix

 ## Known issues / decisions

- **Cluster-wide GPU contention** — only one of llama-swap or
-  immich-ml can hold the T4. No GPU sharing solution wired in
-  (MPS/MIG would help but T4 has no MIG and MPS is overkill for two
-  workloads).
+- **Cluster-wide GPU contention** — the T4 is time-sliced across
+  llama-swap, immich-ml, immich-server, and frigate; compute is shared
+  but the 16 GB VRAM is **not** isolated, so any tenant can OOM the
+  others (see "GPU allocation" + the 2026-06-02 post-mortem). No hard
+  memory partitioning is wired in (T4 has no MIG; MPS memory limits are
+  overkill). Mitigation is keeping each tenant's resident footprint
+  bounded — for immich-ml that means `MACHINE_LEARNING_MODEL_TTL > 0`.
 - **Filename-agnostic config** — the download Job creates stable
  `model.gguf` / `mmproj.gguf` symlinks per model dir so the
  llama-swap config doesn't need to track exact HF filenames (which
--- a/docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md
+++ b/docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md
@ -0,0 +1,85 @@
+# Post-Mortem: immich-ml VRAM hog (MODEL_TTL=0) starved llama-swap → recruiter-responder silently down
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-06-02 |
+| **Duration** | Triage failing 17:41 → ~20:08 EEST (~2.5 h confirmed in retained logs; first 502 at 17:41) |
+| **Severity** | SEV3 — one pipeline (recruiter-responder) fully down; no data loss (emails preserved unseen); no other user-facing impact |
+| **Affected** | `recruiter-responder` (triage). Root cause in `immich-machine-learning` + shared T4 GPU. |
+| **Status** | Fixed — `immich` `MACHINE_LEARNING_MODEL_TTL` 0 → 600; immich-ml VRAM dropped 10.7 GB → ~1.9 GB; qwen3-8b loads again; backlog reprocessed. |
+
+## Summary
+
+Reported by the operator: "receiving recruiter emails but seeing no responses."
+The recruiter-responder IMAP IDLE reader was healthy and fetching mail, but every
+email failed at the triage step with `502 Bad Gateway` from llama-swap. llama-swap
+could not load its `qwen3-8b` model because the shared Tesla T4 (16 GB) had only
+~2.2 GB free — `immich-machine-learning` was holding **10.7 GB** and never released
+it. Because triage *raised* (not swallowed), each email was left **unseen** and
+retried, so no mail was lost — but no draft/event/Telegram notification was ever
+produced.
+
+## Root cause (chain)
+
+```
+immich-ml runs with MACHINE_LEARNING_MODEL_TTL=0  →  ModelCache(revalidate=False),
+  per-model TTL eviction + idle-shutdown both DISABLED → nothing ever unloads
+        ▼
+heavy immich library job ~17:17 (metadata + smartSearch + OCR + face) runs OCR
+  (PP-OCRv5, dynamic input shapes) → onnxruntime BFC CUDA arena inflates to ~10.7 GB
+        ▼
+TTL=0 → the arena floor is permanent (onnxruntime doesn't cudaFree between runs;
+  only a process restart reclaims it)
+        ▼
+T4 free VRAM ~2.2 GB  (T4 is time-sliced across immich-ml / immich-server /
+  frigate / llama-swap with NO memory isolation)
+        ▼
+llama-swap gets a qwen3-8b request → llama-server: cudaMalloc 4455 MiB OOM →
+  "exiting due to model loading error" → llama-swap returns 502
+        ▼
+recruiter-responder triage.py raise_for_status() → orchestrator raises →
+  imap_idle leaves the message UNSEEN (BODY.PEEK) → no draft/event → no Telegram
+```
+
+## Why it was hard to spot
+
+- **Everything showed `Running`/healthy**: the recruiter-responder, llama-swap, and
+  immich-ml pods were all `1/1 Running` with 0 restarts. The failure was a runtime
+  502, not a crash.
+- **`nvidia-smi` inside a container shows "No running processes found"** (PID-namespace
+  isolation) — per-process VRAM attribution needed the host-PID `gpu-pod-exporter`
+  (`nvidia-smi --query-compute-apps`), which pinned the 10.7 GB on `immich_ml.main`.
+- **Silent**: triage errors only landed in recruiter-responder logs; no alert fired
+  on llama-swap 5xx or on low GPU free-VRAM. ~440 triage attempts failed before the
+  operator noticed organically.
+
+## Resolution
+
+- `stacks/immich/main.tf`: `MACHINE_LEARNING_MODEL_TTL` `0` → `600` (targeted apply of
+  `kubernetes_deployment.immich-machine-learning`). The Recreate rollout cleared the
+  stuck arena immediately; going forward, idle ad-hoc models (OCR, face) unload after
+  600 s and return VRAM, while preloaded CLIP (smart search) stays warm.
+- Verified: T4 used 12571 → 3785 MiB (11.1 GB free); immich-ml 10726 → 1940 MiB;
+  `qwen3-8b` chat completion returns HTTP 200; recruiter-responder reprocessed its
+  unseen backlog with triage `200 OK`.
+
+## Why MODEL_TTL=0 was set (and the correction)
+
+`MODEL_TTL=0` was almost certainly chosen to keep the smart-search model permanently
+warm for snappy search. The unintended consequence: it *also* pins every ad-hoc model
+(OCR/face) and lets onnxruntime's arena grow unbounded on a GPU it doesn't own alone.
+immich has **no per-model TTL** (a single global knob; the idle path kills the whole
+worker via `os.kill(getpid(), SIGINT)` and respawns), so the practical compromise is a
+moderate global TTL + CLIP preload: CLIP reloads in ~10 s on the rare idle miss, while
+OCR/face free their VRAM.
+
+## Follow-ups (not yet done — operator declined hardening this session)
+
+- **Alerting** on (a) GPU free-VRAM below a threshold and (b) llama-swap 5xx /
+  recruiter-responder triage failure rate, so a future starvation doesn't sit silent.
+  (Operator believes existing alerts cover it — unverified here.)
+- **Optional** recruiter-responder resilience: fall back to a smaller model
+  (`qwen3vl-4b`) or the Tier-1 GPT relay when llama-swap 502s.
+- **Separate pre-existing issue** surfaced in immich-server logs: repeated
+  `AssetExtractMetadata` `ENOENT` on `upload/upload/...` paths (missing originals) —
+  unrelated to this incident; worth a look.