feat(nvidia): GPU VRAM budget + watchdog to stop T4 overallocation
The single time-sliced Tesla T4 has no per-tenant memory isolation, so its ~9 GPU workloads can collectively overallocate VRAM. On 2026-06-02 immich-ml's onnxruntime arena grew to 10.7 GB and silently starved llama-swap, breaking recruiter-responder for ~5h. Viktor asked for memory protection so we don't overallocate GPU memory, and chose to do it at the scheduling level (no device-plugin swap) after weighing HAMi and MPS. Make the scheduler VRAM-aware and add runtime teeth, all repo-native, time-slicing untouched: - Advertise a node extended resource viktorbarzin.me/gpumem (~14000 MiB) via a reconcile null_resource (immediate, apply-time) + hourly re-assert CronJob. - Each always-on GPU tenant declares a gpumem budget (immich-ml 3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500; sum 13300 <= advertised) so the scheduler refuses to co-schedule past the card (overflow -> Pending). - gpu-vram-watchdog Deployment recycles the biggest over-budget tenant ONLY when actual free VRAM < floor. Ships DRY_RUN=true (observe-then-enforce); flip to false after a few cycles look right. - Prometheus alerts GPUVRAMLow / GPUVRAMTelemetryDown / GPUVRAMWatchdogDown -- the 2026-06-02 post-mortem's never-built free-VRAM follow-up. - Docs: ADR-0016 (records why HAMi/MPS were rejected), CONTEXT.md GPU-sharing glossary; fix the stale "whole T4 / scale immich-ml to 0" llama-cpp comment. HITL GPU-node change: apply nvidia FIRST (advertise gpumem), verify the node shows the capacity, THEN the consumer stacks -- the cutover bounces GPU pods. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
308a174ad6
commit
74819d4061
8 changed files with 609 additions and 3 deletions
|
|
@ -266,8 +266,11 @@ resource "kubernetes_config_map" "llama_swap_config" {
|
|||
|
||||
# Single Deployment running llama-swap. Spawns per-model llama-server
|
||||
# subprocesses on demand and unloads them after `ttl` seconds idle.
|
||||
# The whole T4 is allocated to this pod via nvidia.com/gpu=1; immich-ml
|
||||
# must be scaled to 0 during benchmark runs.
|
||||
# nvidia.com/gpu=1 buys ONE time-slice (a scheduling turn, NOT the card's
|
||||
# memory) — the T4 is shared with immich-ml/frigate/immich-server/portal-stt.
|
||||
# VRAM is bounded per-tenant by the gpumem budget + watchdog (ADR-0016), not by
|
||||
# scaling co-tenants to 0. llama-swap loads ONE model at a time (no `groups` =
|
||||
# swap mode, ttl=600 unloads idle), so its footprint is the largest single model.
|
||||
resource "kubernetes_deployment" "llama_swap" {
|
||||
metadata {
|
||||
name = "llama-swap"
|
||||
|
|
@ -355,6 +358,9 @@ resource "kubernetes_deployment" "llama_swap" {
|
|||
limits = {
|
||||
memory = "12Gi"
|
||||
"nvidia.com/gpu" = "1"
|
||||
# GPU VRAM budget (ADR-0016): one model at a time; qwen3-8b @16k
|
||||
# peak ~4.35 GiB + headroom (the protected recruiter-responder tenant).
|
||||
"viktorbarzin.me/gpumem" = "5000"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue