infra/docs/post-mortems
Viktor Barzin 052c776eba immich: set MACHINE_LEARNING_MODEL_TTL 0->600 to stop GPU VRAM hog
immich-ml at TTL=0 never unloaded models; a heavy OCR library job
inflated onnxruntime's CUDA arena to ~10.7GB and held it on the shared
time-sliced T4, starving llama-swap (qwen3-8b) so recruiter-responder
triage 502'd silently for hours (emails preserved unseen, no loss).
TTL=600 lets idle ad-hoc models (OCR, face) free VRAM while preloaded
CLIP/smart-search stays warm.

Docs: correct stale llama-cpp GPU notes (T4 is time-sliced, no VRAM
isolation; add qwen3-8b to model table), immich MODEL_TTL gotcha in
.claude/CLAUDE.md, and a post-mortem.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 20:16:11 +00:00
..
2026-03-16-kured-containerd-cascade-outage.html
2026-03-16-nfs-csi-cascade-failure.md
2026-04-14-nfs-fsid0-dns-vault-outage.md docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip] 2026-04-14 18:09:11 +00:00
2026-04-14-postmortem-pipeline-test.md
2026-04-18-authentik-outpost-shm-full.md
2026-04-19-registry-orphan-index.md
2026-04-22-vault-raft-leader-deadlock.md
2026-05-09-io-pressure-stale-nfs.md
2026-05-16-kured-stalled-and-anubis-ha.md
2026-05-17-gpu-driver-ubuntu2604-mismatch.md
2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md
2026-05-25-immich-anca-elements-io-storm.md
2026-05-30-redis-split-brain.md
2026-05-31-kured-sentinel-gate-oom.md
2026-06-01-cloudflared-stale-traefik-origin.md
2026-06-01-keel-match-tag-image-swap.md
2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md immich: set MACHINE_LEARNING_MODEL_TTL 0->600 to stop GPU VRAM hog 2026-06-02 20:16:11 +00:00
index.html