diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index bce23531..cdf6f09c 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -134,7 +134,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle | Service | Key Operational Knowledge | |---------|--------------------------| | Nextcloud | MaxRequestWorkers=150, needs 8Gi limit (Apache transient memory spikes, see commit eb94144), very generous startup probe | -| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. | +| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle ad-hoc models (OCR, face) free VRAM while preloaded CLIP/smart-search stays warm. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. | | CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob | | Frigate | GPU stall detection in liveness probe (inference speed check), high CPU | | Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding | diff --git a/docs/architecture/llama-cpp.md b/docs/architecture/llama-cpp.md index bd7c4be2..f3e91674 100644 --- a/docs/architecture/llama-cpp.md +++ b/docs/architecture/llama-cpp.md @@ -68,11 +68,26 @@ for the initial deployment. ## GPU allocation -The llama-swap pod requests `nvidia.com/gpu: 1` (whole-T4 -allocation). The shared T4 is also used by Immich's ML pod -(`immich.immich-machine-learning`); only one of the two can hold the -GPU at a time. Operator must scale immich-ml to 0 before running a -benchmark and restore it after: +The llama-swap pod requests `nvidia.com/gpu: 1`, but the T4 is +**time-sliced** by the NVIDIA device plugin — several pods on k8s-node1 +each hold a `nvidia.com/gpu: 1` slice and run **concurrently**: +`llama-swap`, `immich.immich-machine-learning`, `immich.immich-server` +(NVENC transcode), and `frigate`. Time-slicing shares *compute* but +**not memory** — the 16 GB VRAM is a single unpartitioned pool, so one +greedy tenant can starve all the others. + +This is a real failure mode, not theoretical: on 2026-06-02 immich-ml +(running with `MACHINE_LEARNING_MODEL_TTL=0`, so nothing ever unloaded) +let its onnxruntime CUDA arena balloon to 10.7 GB during an OCR-heavy +library job and held it, leaving only ~2 GB free. llama-swap then +couldn't allocate qwen3-8b (~4.5 GB) → `cudaMalloc` OOM → `llama-server` +exited → 502s → recruiter-responder triage failed silently for ~5 h. +Fix: immich `MODEL_TTL=600` so idle models unload and return VRAM. See +`docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`. + +Budget the T4 accordingly: with immich-ml idle (~2 GB CLIP) + frigate +(~2 GB) there is ample room for an 8 B model. For a heavy benchmark you +can still evict immich-ml entirely to guarantee headroom: ```bash kubectl scale -n immich deploy/immich-machine-learning --replicas=0 @@ -84,10 +99,15 @@ kubectl scale -n immich deploy/immich-machine-learning --replicas=1 | ID | HF repo | Quant | Ctx | mmproj | |----|---------|-------|-----|--------| +| `qwen3-8b` | `Qwen/Qwen3-8B-GGUF` | Q4_K_M | 16384 | no (text-only) | | `qwen3vl-8b` | `Qwen/Qwen3-VL-8B-Instruct-GGUF` | Q4_K_M | 3072 | yes | | `minicpm-v-4-5` | `openbmb/MiniCPM-V-4_5-gguf` | Q4_K_M | 3072 | yes | | `qwen3vl-4b` | `Qwen/Qwen3-VL-4B-Instruct-GGUF` | Q4_K_M | 3072 | yes | +`qwen3-8b` (text-only) is the Tier-0 triage model for +`recruiter-responder`; the `qwen3vl-*` / `minicpm-v` models serve the +vision use cases. + llama.cpp build pinned via the `llama-swap:cuda` image (ships a recent llama.cpp ≥ b9095, which includes Qwen3-VL projection fix [#20899](https://github.com/ggml-org/llama.cpp/issues/20899) and @@ -107,10 +127,13 @@ mtmd Flash-Attention regression fix ## Known issues / decisions -- **Cluster-wide GPU contention** — only one of llama-swap or - immich-ml can hold the T4. No GPU sharing solution wired in - (MPS/MIG would help but T4 has no MIG and MPS is overkill for two - workloads). +- **Cluster-wide GPU contention** — the T4 is time-sliced across + llama-swap, immich-ml, immich-server, and frigate; compute is shared + but the 16 GB VRAM is **not** isolated, so any tenant can OOM the + others (see "GPU allocation" + the 2026-06-02 post-mortem). No hard + memory partitioning is wired in (T4 has no MIG; MPS memory limits are + overkill). Mitigation is keeping each tenant's resident footprint + bounded — for immich-ml that means `MACHINE_LEARNING_MODEL_TTL > 0`. - **Filename-agnostic config** — the download Job creates stable `model.gguf` / `mmproj.gguf` symlinks per model dir so the llama-swap config doesn't need to track exact HF filenames (which diff --git a/docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md b/docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md new file mode 100644 index 00000000..872f6325 --- /dev/null +++ b/docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md @@ -0,0 +1,85 @@ +# Post-Mortem: immich-ml VRAM hog (MODEL_TTL=0) starved llama-swap → recruiter-responder silently down + +| Field | Value | +|-------|-------| +| **Date** | 2026-06-02 | +| **Duration** | Triage failing 17:41 → ~20:08 EEST (~2.5 h confirmed in retained logs; first 502 at 17:41) | +| **Severity** | SEV3 — one pipeline (recruiter-responder) fully down; no data loss (emails preserved unseen); no other user-facing impact | +| **Affected** | `recruiter-responder` (triage). Root cause in `immich-machine-learning` + shared T4 GPU. | +| **Status** | Fixed — `immich` `MACHINE_LEARNING_MODEL_TTL` 0 → 600; immich-ml VRAM dropped 10.7 GB → ~1.9 GB; qwen3-8b loads again; backlog reprocessed. | + +## Summary + +Reported by the operator: "receiving recruiter emails but seeing no responses." +The recruiter-responder IMAP IDLE reader was healthy and fetching mail, but every +email failed at the triage step with `502 Bad Gateway` from llama-swap. llama-swap +could not load its `qwen3-8b` model because the shared Tesla T4 (16 GB) had only +~2.2 GB free — `immich-machine-learning` was holding **10.7 GB** and never released +it. Because triage *raised* (not swallowed), each email was left **unseen** and +retried, so no mail was lost — but no draft/event/Telegram notification was ever +produced. + +## Root cause (chain) + +``` +immich-ml runs with MACHINE_LEARNING_MODEL_TTL=0 → ModelCache(revalidate=False), + per-model TTL eviction + idle-shutdown both DISABLED → nothing ever unloads + ▼ +heavy immich library job ~17:17 (metadata + smartSearch + OCR + face) runs OCR + (PP-OCRv5, dynamic input shapes) → onnxruntime BFC CUDA arena inflates to ~10.7 GB + ▼ +TTL=0 → the arena floor is permanent (onnxruntime doesn't cudaFree between runs; + only a process restart reclaims it) + ▼ +T4 free VRAM ~2.2 GB (T4 is time-sliced across immich-ml / immich-server / + frigate / llama-swap with NO memory isolation) + ▼ +llama-swap gets a qwen3-8b request → llama-server: cudaMalloc 4455 MiB OOM → + "exiting due to model loading error" → llama-swap returns 502 + ▼ +recruiter-responder triage.py raise_for_status() → orchestrator raises → + imap_idle leaves the message UNSEEN (BODY.PEEK) → no draft/event → no Telegram +``` + +## Why it was hard to spot + +- **Everything showed `Running`/healthy**: the recruiter-responder, llama-swap, and + immich-ml pods were all `1/1 Running` with 0 restarts. The failure was a runtime + 502, not a crash. +- **`nvidia-smi` inside a container shows "No running processes found"** (PID-namespace + isolation) — per-process VRAM attribution needed the host-PID `gpu-pod-exporter` + (`nvidia-smi --query-compute-apps`), which pinned the 10.7 GB on `immich_ml.main`. +- **Silent**: triage errors only landed in recruiter-responder logs; no alert fired + on llama-swap 5xx or on low GPU free-VRAM. ~440 triage attempts failed before the + operator noticed organically. + +## Resolution + +- `stacks/immich/main.tf`: `MACHINE_LEARNING_MODEL_TTL` `0` → `600` (targeted apply of + `kubernetes_deployment.immich-machine-learning`). The Recreate rollout cleared the + stuck arena immediately; going forward, idle ad-hoc models (OCR, face) unload after + 600 s and return VRAM, while preloaded CLIP (smart search) stays warm. +- Verified: T4 used 12571 → 3785 MiB (11.1 GB free); immich-ml 10726 → 1940 MiB; + `qwen3-8b` chat completion returns HTTP 200; recruiter-responder reprocessed its + unseen backlog with triage `200 OK`. + +## Why MODEL_TTL=0 was set (and the correction) + +`MODEL_TTL=0` was almost certainly chosen to keep the smart-search model permanently +warm for snappy search. The unintended consequence: it *also* pins every ad-hoc model +(OCR/face) and lets onnxruntime's arena grow unbounded on a GPU it doesn't own alone. +immich has **no per-model TTL** (a single global knob; the idle path kills the whole +worker via `os.kill(getpid(), SIGINT)` and respawns), so the practical compromise is a +moderate global TTL + CLIP preload: CLIP reloads in ~10 s on the rare idle miss, while +OCR/face free their VRAM. + +## Follow-ups (not yet done — operator declined hardening this session) + +- **Alerting** on (a) GPU free-VRAM below a threshold and (b) llama-swap 5xx / + recruiter-responder triage failure rate, so a future starvation doesn't sit silent. + (Operator believes existing alerts cover it — unverified here.) +- **Optional** recruiter-responder resilience: fall back to a smaller model + (`qwen3vl-4b`) or the Tier-1 GPT relay when llama-swap 502s. +- **Separate pre-existing issue** surfaced in immich-server logs: repeated + `AssetExtractMetadata` `ENOENT` on `upload/upload/...` paths (missing originals) — + unrelated to this incident; worth a look. diff --git a/stacks/immich/main.tf b/stacks/immich/main.tf index 183b3e50..6d3c7da4 100644 --- a/stacks/immich/main.tf +++ b/stacks/immich/main.tf @@ -690,9 +690,14 @@ resource "kubernetes_deployment" "immich-machine-learning" { protocol = "TCP" name = "immich-ml" } + # Idle models unload after 600s, returning VRAM to the shared T4. + # MUST stay > 0: at 0 nothing ever unloads and onnxruntime's CUDA + # arena (OCR's dynamic input shapes balloon it to ~10GB) is held + # forever, starving llama-swap (qwen3-8b) on the same time-sliced + # GPU and silently breaking recruiter-responder triage. env { name = "MACHINE_LEARNING_MODEL_TTL" - value = "0" + value = "600" } env { name = "TRANSFORMERS_CACHE"