immich: set MACHINE_LEARNING_MODEL_TTL 0->600 to stop GPU VRAM hog
immich-ml at TTL=0 never unloaded models; a heavy OCR library job inflated onnxruntime's CUDA arena to ~10.7GB and held it on the shared time-sliced T4, starving llama-swap (qwen3-8b) so recruiter-responder triage 502'd silently for hours (emails preserved unseen, no loss). TTL=600 lets idle ad-hoc models (OCR, face) free VRAM while preloaded CLIP/smart-search stays warm. Docs: correct stale llama-cpp GPU notes (T4 is time-sliced, no VRAM isolation; add qwen3-8b to model table), immich MODEL_TTL gotcha in .claude/CLAUDE.md, and a post-mortem. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
cda858d560
commit
052c776eba
4 changed files with 124 additions and 11 deletions
|
|
@ -690,9 +690,14 @@ resource "kubernetes_deployment" "immich-machine-learning" {
|
|||
protocol = "TCP"
|
||||
name = "immich-ml"
|
||||
}
|
||||
# Idle models unload after 600s, returning VRAM to the shared T4.
|
||||
# MUST stay > 0: at 0 nothing ever unloads and onnxruntime's CUDA
|
||||
# arena (OCR's dynamic input shapes balloon it to ~10GB) is held
|
||||
# forever, starving llama-swap (qwen3-8b) on the same time-sliced
|
||||
# GPU and silently breaking recruiter-responder triage.
|
||||
env {
|
||||
name = "MACHINE_LEARNING_MODEL_TTL"
|
||||
value = "0"
|
||||
value = "600"
|
||||
}
|
||||
env {
|
||||
name = "TRANSFORMERS_CACHE"
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue