infra/docs/adr/0016-gpu-vram-extended-resource-budget.md
Viktor Barzin 74819d4061 feat(nvidia): GPU VRAM budget + watchdog to stop T4 overallocation
The single time-sliced Tesla T4 has no per-tenant memory isolation, so its
~9 GPU workloads can collectively overallocate VRAM. On 2026-06-02 immich-ml's
onnxruntime arena grew to 10.7 GB and silently starved llama-swap, breaking
recruiter-responder for ~5h. Viktor asked for memory protection so we don't
overallocate GPU memory, and chose to do it at the scheduling level (no
device-plugin swap) after weighing HAMi and MPS.

Make the scheduler VRAM-aware and add runtime teeth, all repo-native,
time-slicing untouched:
- Advertise a node extended resource viktorbarzin.me/gpumem (~14000 MiB) via a
  reconcile null_resource (immediate, apply-time) + hourly re-assert CronJob.
- Each always-on GPU tenant declares a gpumem budget (immich-ml 3000,
  llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500; sum 13300
  <= advertised) so the scheduler refuses to co-schedule past the card
  (overflow -> Pending).
- gpu-vram-watchdog Deployment recycles the biggest over-budget tenant ONLY when
  actual free VRAM < floor. Ships DRY_RUN=true (observe-then-enforce); flip to
  false after a few cycles look right.
- Prometheus alerts GPUVRAMLow / GPUVRAMTelemetryDown / GPUVRAMWatchdogDown --
  the 2026-06-02 post-mortem's never-built free-VRAM follow-up.
- Docs: ADR-0016 (records why HAMi/MPS were rejected), CONTEXT.md GPU-sharing
  glossary; fix the stale "whole T4 / scale immich-ml to 0" llama-cpp comment.

HITL GPU-node change: apply nvidia FIRST (advertise gpumem), verify the node
shows the capacity, THEN the consumer stacks -- the cutover bounces GPU pods.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:57:40 +00:00

107 lines
7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# GPU VRAM protection via a scheduler extended-resource budget + a runtime watchdog (HAMi/MPS rejected)
The single Tesla T4 (16 GB, ~15360 MiB usable) on `k8s-node1` is **time-sliced**
(`nvidia.com/gpu` advertised ×100, `migStrategy: none`) and shared by ~9 tenants
(immich-ml, immich-server, frigate, llama-swap, portal-stt, tts,
ebook2audiobook, ytdlp, android-emulator). Time-slicing grants a *scheduling
turn, not memory* — the scheduler is blind to VRAM, so the tenants can
collectively overallocate the card. On 2026-06-02 immich-ml's unbounded
onnxruntime OCR arena grew from ~2 GB to **10.7 GB**, starved llama-swap's
qwen3-8b, and silently broke recruiter-responder triage for ~5 h
(`docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). The
post-mortem's #1 follow-up — alert/guard on GPU VRAM — was never built.
## Context
- **MIG is impossible.** The T4 is Turing; hardware memory partitioning (MIG)
only exists on Ampere+. So per-tenant *hardware* isolation is off the table.
- **The card is busy but not steadily oversubscribed.** Measured steady residents
(2026-06-17, `gpu_pod_memory_used_bytes`): immich-ml ~2.1 GiB, frigate ~1.9 GiB,
llama-swap ~4.35 GiB peak (one model at a time — it already swaps), immich-server
~1.2 GiB, portal-stt ~1.5 GiB, android-emulator ~0.15 GiB → ~11 GiB used, ~4 GiB
free. **The failure mode is a single tenant's runtime runaway, not a
scheduling-time pile-on.**
- **Prior art already exists (soft):** a `gpu-workload` PriorityClass (1,200,000)
is auto-stamped on every GPU pod by the Kyverno `inject-gpu-workload-priority`
policy (tts excluded → `tier-2-gpu`, evicted first); tts runs behind a
free-VRAM demand-gate (`stacks/tts`, scales 0↔1 on `sum(gpu_pod_memory_used_bytes)`
vs a floor); immich-ml is soft-bounded by `MACHINE_LEARNING_MODEL_TTL=600`. What
was missing is anything that bounds a tenant's VRAM *during active use*.
### Alternatives considered and rejected
- **NVIDIA MPS** (device-plugin `sharing.mps`, hard `CUDA_MPS_PINNED_DEVICE_MEM_LIMIT`):
caps are **uniform** — slice = `total ÷ replicas`, tenants get integer multiples.
Nine heterogeneous tenants spanning 0.15→6 GB do not fit uniform slices without
large rounding waste on a card that has none to spare. Rejected.
- **HAMi vGPU** (per-container `nvidia.com/gpumem` MiB caps, libvgpu CUDA hook):
the *correct* hard-cap primitive and T4-supported, but it **replaces the
operator's device plugin** (the operator owns/reconciles it), enforces via an
`LD_PRELOAD` CUDA hook that is **unproven for our NVENC transcode path**
(open codec bug), **cannot cap the android-emulator** (QEMU bypasses the CUDA
hook — KubeVirt/Kata explicitly unsupported), carries a **restart-triggered
false-OOM bug** (#1181) directly in our blast radius (kured reboots node1
regularly), and its reservation-based scheduling would **supersede the working
demand-gate** and **strand the ~4 GB of steady headroom**. Too much risk and
behavioral change for the single proven failure mode. Rejected for now; this
ADR is the record of *why*, so a future "let's just use HAMi" re-opens with the
trade-offs already on the table.
## Decision
Make the scheduler VRAM-aware and add runtime teeth — entirely with repo-native
pieces, **no device-plugin/driver change, time-slicing untouched**:
1. **Budget (schedule-time).** Advertise a custom node-level **extended resource
`viktorbarzin.me/gpumem`** on the GPU node (= ~14000 MiB; ~15.4 GB physical
minus ~1.4 GB driver/CUDA-context/exporter slack), via a reconcile Job +
CronJob that `kubectl patch node --subresource=status` (dynamic over
`nvidia.com/gpu.present=true` nodes; re-asserts after node re-register).
Every GPU tenant declares `resources.limits."viktorbarzin.me/gpumem"` (immich-ml
3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500 — sum
≤ advertised). Extended resources are **non-overcommittable** (request==limit,
integer), so the scheduler refuses to co-schedule past the card → overflow
`Pending`. On-demand batch tenants (tts/ebook2audiobook/ytdlp) keep the
free-VRAM demand-gate and fill the real slack rather than holding a reserved seat.
2. **Watchdog (runtime).** A `gpu-vram-watchdog` CronJob (every minute, nvidia ns)
reads per-pod `gpu_pod_memory_used_bytes` (the host-PID exporter) and each GPU
pod's *declared* `gpumem`, and **only when actual free VRAM < floor (~1536 MiB)**
recycles the biggest **over-budget** offender (used > declared). Contract
enforcement, not priority (immich-ml and llama-swap share `gpu-workload`, so
priority can't distinguish them). Acting only under pressure lets a tenant burst
into genuine slack; the recycle clears its arena (exactly what the TTL=600
Recreate does for immich-ml when idle). This is what would have caught 2026-06-02.
3. **Alerting** (the never-built follow-up): GPU free-VRAM below floor, GPU pod
`Pending` on `gpumem`, and pod-over-budget → the `#alerts` digest.
This is **soft enforcement**: the scheduler reserves on paper and the watchdog
corrects at runtime with a detection lag (secondsminute), so a brief physical
overshoot is possible before a recycle. Accepted, given the failure mode is a
slow arena drift, not an instantaneous spike, and the alternative (HAMi) carries
disproportionate risk for this hardware.
## Consequences
- **The 2026-06-02 class is bounded** without touching the pinned driver, the GPU
operator, or time-slicing. immich-ml can no longer silently grow into
llama-swap's VRAM: it either schedules within its budget or, on a true runaway
under pressure, gets recycled (its heavy library job is the intended loser).
- **The card has a seating chart now.** Sum of declared budgets ≤ ~14 GB, so a new
always-on GPU tenant requires re-budgeting; an over-budget on-demand tenant sits
`Pending`. This is the intended, legible back-pressure.
- **Small/on-demand tenants (android-emulator, ytdlp, tts, ebook2audiobook) are
NOT budgeted in v1** — they fill *actual* slack rather than holding a scheduler
seat (tts via its existing free-VRAM demand-gate), and are covered by the
~1.4 GiB physical reserve plus budget headroom (the five residents' budgets sum
to 13300 ≤ 14000 advertised). Give them budgets later if they grow; until then
the watchdog protects the budgeted five and counts everyone's usage toward free.
- **New RBAC:** the reconcile SA patches `nodes/status`; the watchdog SA lists pods
cluster-wide and deletes pods in GPU tenant namespaces. Far less privileged than
existing cluster-admin tooling (woodpecker-agent).
- **Apply order matters:** advertise `gpumem` (nvidia stack) **before** the
consumer stacks declare it, or a pod requesting an unadvertised extended
resource is unschedulable. The reconcile runs as a Job (immediate) for this.
- **Fully reversible:** delete the CronJobs/Job + the `gpumem` stanzas, and
`kubectl patch node --subresource=status` to remove the capacity key. Nothing
structural; no driver/operator state to unwind.
- The `gpumem` numbers are first estimates; tune from `gpu_pod_memory_used_bytes`.