infra/docs/adr/0016-gpu-vram-extended-resource-budget.md
Viktor Barzin 74819d4061 feat(nvidia): GPU VRAM budget + watchdog to stop T4 overallocation
The single time-sliced Tesla T4 has no per-tenant memory isolation, so its
~9 GPU workloads can collectively overallocate VRAM. On 2026-06-02 immich-ml's
onnxruntime arena grew to 10.7 GB and silently starved llama-swap, breaking
recruiter-responder for ~5h. Viktor asked for memory protection so we don't
overallocate GPU memory, and chose to do it at the scheduling level (no
device-plugin swap) after weighing HAMi and MPS.

Make the scheduler VRAM-aware and add runtime teeth, all repo-native,
time-slicing untouched:
- Advertise a node extended resource viktorbarzin.me/gpumem (~14000 MiB) via a
  reconcile null_resource (immediate, apply-time) + hourly re-assert CronJob.
- Each always-on GPU tenant declares a gpumem budget (immich-ml 3000,
  llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500; sum 13300
  <= advertised) so the scheduler refuses to co-schedule past the card
  (overflow -> Pending).
- gpu-vram-watchdog Deployment recycles the biggest over-budget tenant ONLY when
  actual free VRAM < floor. Ships DRY_RUN=true (observe-then-enforce); flip to
  false after a few cycles look right.
- Prometheus alerts GPUVRAMLow / GPUVRAMTelemetryDown / GPUVRAMWatchdogDown --
  the 2026-06-02 post-mortem's never-built free-VRAM follow-up.
- Docs: ADR-0016 (records why HAMi/MPS were rejected), CONTEXT.md GPU-sharing
  glossary; fix the stale "whole T4 / scale immich-ml to 0" llama-cpp comment.

HITL GPU-node change: apply nvidia FIRST (advertise gpumem), verify the node
shows the capacity, THEN the consumer stacks -- the cutover bounces GPU pods.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:57:40 +00:00

7 KiB
Raw Blame History

GPU VRAM protection via a scheduler extended-resource budget + a runtime watchdog (HAMi/MPS rejected)

The single Tesla T4 (16 GB, ~15360 MiB usable) on k8s-node1 is time-sliced (nvidia.com/gpu advertised ×100, migStrategy: none) and shared by ~9 tenants (immich-ml, immich-server, frigate, llama-swap, portal-stt, tts, ebook2audiobook, ytdlp, android-emulator). Time-slicing grants a scheduling turn, not memory — the scheduler is blind to VRAM, so the tenants can collectively overallocate the card. On 2026-06-02 immich-ml's unbounded onnxruntime OCR arena grew from ~2 GB to 10.7 GB, starved llama-swap's qwen3-8b, and silently broke recruiter-responder triage for ~5 h (docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md). The post-mortem's #1 follow-up — alert/guard on GPU VRAM — was never built.

Context

  • MIG is impossible. The T4 is Turing; hardware memory partitioning (MIG) only exists on Ampere+. So per-tenant hardware isolation is off the table.
  • The card is busy but not steadily oversubscribed. Measured steady residents (2026-06-17, gpu_pod_memory_used_bytes): immich-ml ~2.1 GiB, frigate ~1.9 GiB, llama-swap ~4.35 GiB peak (one model at a time — it already swaps), immich-server ~1.2 GiB, portal-stt ~1.5 GiB, android-emulator ~0.15 GiB → ~11 GiB used, ~4 GiB free. The failure mode is a single tenant's runtime runaway, not a scheduling-time pile-on.
  • Prior art already exists (soft): a gpu-workload PriorityClass (1,200,000) is auto-stamped on every GPU pod by the Kyverno inject-gpu-workload-priority policy (tts excluded → tier-2-gpu, evicted first); tts runs behind a free-VRAM demand-gate (stacks/tts, scales 0↔1 on sum(gpu_pod_memory_used_bytes) vs a floor); immich-ml is soft-bounded by MACHINE_LEARNING_MODEL_TTL=600. What was missing is anything that bounds a tenant's VRAM during active use.

Alternatives considered and rejected

  • NVIDIA MPS (device-plugin sharing.mps, hard CUDA_MPS_PINNED_DEVICE_MEM_LIMIT): caps are uniform — slice = total ÷ replicas, tenants get integer multiples. Nine heterogeneous tenants spanning 0.15→6 GB do not fit uniform slices without large rounding waste on a card that has none to spare. Rejected.
  • HAMi vGPU (per-container nvidia.com/gpumem MiB caps, libvgpu CUDA hook): the correct hard-cap primitive and T4-supported, but it replaces the operator's device plugin (the operator owns/reconciles it), enforces via an LD_PRELOAD CUDA hook that is unproven for our NVENC transcode path (open codec bug), cannot cap the android-emulator (QEMU bypasses the CUDA hook — KubeVirt/Kata explicitly unsupported), carries a restart-triggered false-OOM bug (#1181) directly in our blast radius (kured reboots node1 regularly), and its reservation-based scheduling would supersede the working demand-gate and strand the ~4 GB of steady headroom. Too much risk and behavioral change for the single proven failure mode. Rejected for now; this ADR is the record of why, so a future "let's just use HAMi" re-opens with the trade-offs already on the table.

Decision

Make the scheduler VRAM-aware and add runtime teeth — entirely with repo-native pieces, no device-plugin/driver change, time-slicing untouched:

  1. Budget (schedule-time). Advertise a custom node-level extended resource viktorbarzin.me/gpumem on the GPU node (= ~14000 MiB; ~15.4 GB physical minus ~1.4 GB driver/CUDA-context/exporter slack), via a reconcile Job + CronJob that kubectl patch node --subresource=status (dynamic over nvidia.com/gpu.present=true nodes; re-asserts after node re-register). Every GPU tenant declares resources.limits."viktorbarzin.me/gpumem" (immich-ml 3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500 — sum ≤ advertised). Extended resources are non-overcommittable (request==limit, integer), so the scheduler refuses to co-schedule past the card → overflow Pending. On-demand batch tenants (tts/ebook2audiobook/ytdlp) keep the free-VRAM demand-gate and fill the real slack rather than holding a reserved seat.
  2. Watchdog (runtime). A gpu-vram-watchdog CronJob (every minute, nvidia ns) reads per-pod gpu_pod_memory_used_bytes (the host-PID exporter) and each GPU pod's declared gpumem, and only when actual free VRAM < floor (~1536 MiB) recycles the biggest over-budget offender (used > declared). Contract enforcement, not priority (immich-ml and llama-swap share gpu-workload, so priority can't distinguish them). Acting only under pressure lets a tenant burst into genuine slack; the recycle clears its arena (exactly what the TTL=600 Recreate does for immich-ml when idle). This is what would have caught 2026-06-02.
  3. Alerting (the never-built follow-up): GPU free-VRAM below floor, GPU pod Pending on gpumem, and pod-over-budget → the #alerts digest.

This is soft enforcement: the scheduler reserves on paper and the watchdog corrects at runtime with a detection lag (secondsminute), so a brief physical overshoot is possible before a recycle. Accepted, given the failure mode is a slow arena drift, not an instantaneous spike, and the alternative (HAMi) carries disproportionate risk for this hardware.

Consequences

  • The 2026-06-02 class is bounded without touching the pinned driver, the GPU operator, or time-slicing. immich-ml can no longer silently grow into llama-swap's VRAM: it either schedules within its budget or, on a true runaway under pressure, gets recycled (its heavy library job is the intended loser).
  • The card has a seating chart now. Sum of declared budgets ≤ ~14 GB, so a new always-on GPU tenant requires re-budgeting; an over-budget on-demand tenant sits Pending. This is the intended, legible back-pressure.
  • Small/on-demand tenants (android-emulator, ytdlp, tts, ebook2audiobook) are NOT budgeted in v1 — they fill actual slack rather than holding a scheduler seat (tts via its existing free-VRAM demand-gate), and are covered by the ~1.4 GiB physical reserve plus budget headroom (the five residents' budgets sum to 13300 ≤ 14000 advertised). Give them budgets later if they grow; until then the watchdog protects the budgeted five and counts everyone's usage toward free.
  • New RBAC: the reconcile SA patches nodes/status; the watchdog SA lists pods cluster-wide and deletes pods in GPU tenant namespaces. Far less privileged than existing cluster-admin tooling (woodpecker-agent).
  • Apply order matters: advertise gpumem (nvidia stack) before the consumer stacks declare it, or a pod requesting an unadvertised extended resource is unschedulable. The reconcile runs as a Job (immediate) for this.
  • Fully reversible: delete the CronJobs/Job + the gpumem stanzas, and kubectl patch node --subresource=status to remove the capacity key. Nothing structural; no driver/operator state to unwind.
  • The gpumem numbers are first estimates; tune from gpu_pod_memory_used_bytes.