immich: set MACHINE_LEARNING_MODEL_TTL 0->600 to stop GPU VRAM hog

immich-ml at TTL=0 never unloaded models; a heavy OCR library job inflated onnxruntime's CUDA arena to ~10.7GB and held it on the shared time-sliced T4, starving llama-swap (qwen3-8b) so recruiter-responder triage 502'd silently for hours (emails preserved unseen, no loss). TTL=600 lets idle ad-hoc models (OCR, face) free VRAM while preloaded CLIP/smart-search stays warm. Docs: correct stale llama-cpp GPU notes (T4 is time-sliced, no VRAM isolation; add qwen3-8b to model table), immich MODEL_TTL gotcha in .claude/CLAUDE.md, and a post-mortem. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 20:16:11 +00:00 · 2026-06-02 20:16:11 +00:00 · 052c776eba
commit 052c776eba
parent cda858d560
4 changed files with 124 additions and 11 deletions
--- a/stacks/immich/main.tf
+++ b/stacks/immich/main.tf
@ -690,9 +690,14 @@ resource "kubernetes_deployment" "immich-machine-learning" {
            protocol       = "TCP"
            name           = "immich-ml"
          }
+          # Idle models unload after 600s, returning VRAM to the shared T4.
+          # MUST stay > 0: at 0 nothing ever unloads and onnxruntime's CUDA
+          # arena (OCR's dynamic input shapes balloon it to ~10GB) is held
+          # forever, starving llama-swap (qwen3-8b) on the same time-sliced
+          # GPU and silently breaking recruiter-responder triage.
          env {
            name  = "MACHINE_LEARNING_MODEL_TTL"
-            value = "0"
+            value = "600"
          }
          env {
            name  = "TRANSFORMERS_CACHE"