infra/docs/adr/0016-gpu-vram-extended-resource-budget.md

# GPU VRAM protection via a scheduler extended-resource budget + a runtime watchdog (HAMi/MPS rejected)

The single Tesla T4 (16 GB, ~15360 MiB usable) on `k8s-node1` is **time-sliced**
(`nvidia.com/gpu` advertised ×100, `migStrategy: none`) and shared by ~9 tenants
(immich-ml, immich-server, frigate, llama-swap, portal-stt, tts,
ebook2audiobook, ytdlp, android-emulator). Time-slicing grants a *scheduling
turn, not memory* — the scheduler is blind to VRAM, so the tenants can
collectively overallocate the card. On 2026-06-02 immich-ml's unbounded
onnxruntime OCR arena grew from ~2 GB to **10.7 GB**, starved llama-swap's
qwen3-8b, and silently broke recruiter-responder triage for ~5 h
(`docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). The
post-mortem's #1 follow-up — alert/guard on GPU VRAM — was never built.

## Context

- **MIG is impossible.** The T4 is Turing; hardware memory partitioning (MIG)
  only exists on Ampere+. So per-tenant *hardware* isolation is off the table.
- **The card is busy but not steadily oversubscribed.** Measured steady residents
  (2026-06-17, `gpu_pod_memory_used_bytes`): immich-ml ~2.1 GiB, frigate ~1.9 GiB,
  llama-swap ~4.35 GiB peak (one model at a time — it already swaps), immich-server
  ~1.2 GiB, portal-stt ~1.5 GiB, android-emulator ~0.15 GiB → ~11 GiB used, ~4 GiB
  free. **The failure mode is a single tenant's runtime runaway, not a
  scheduling-time pile-on.**
- **Prior art already exists (soft):** a `gpu-workload` PriorityClass (1,200,000)
  is auto-stamped on every GPU pod by the Kyverno `inject-gpu-workload-priority`
  policy (tts excluded → `tier-2-gpu`, evicted first); tts runs behind a
  free-VRAM demand-gate (`stacks/tts`, scales 0↔1 on `sum(gpu_pod_memory_used_bytes)`
  vs a floor); immich-ml is soft-bounded by `MACHINE_LEARNING_MODEL_TTL=600`. What
  was missing is anything that bounds a tenant's VRAM *during active use*.

### Alternatives considered and rejected

- **NVIDIA MPS** (device-plugin `sharing.mps`, hard `CUDA_MPS_PINNED_DEVICE_MEM_LIMIT`):
  caps are **uniform** — slice = `total ÷ replicas`, tenants get integer multiples.
  Nine heterogeneous tenants spanning 0.15→6 GB do not fit uniform slices without
  large rounding waste on a card that has none to spare. Rejected.
- **HAMi vGPU** (per-container `nvidia.com/gpumem` MiB caps, libvgpu CUDA hook):
  the *correct* hard-cap primitive and T4-supported, but it **replaces the
  operator's device plugin** (the operator owns/reconciles it), enforces via an
  `LD_PRELOAD` CUDA hook that is **unproven for our NVENC transcode path**
  (open codec bug), **cannot cap the android-emulator** (QEMU bypasses the CUDA
  hook — KubeVirt/Kata explicitly unsupported), carries a **restart-triggered
  false-OOM bug** (#1181) directly in our blast radius (kured reboots node1
  regularly), and its reservation-based scheduling would **supersede the working
  demand-gate** and **strand the ~4 GB of steady headroom**. Too much risk and
  behavioral change for the single proven failure mode. Rejected for now; this
  ADR is the record of *why*, so a future "let's just use HAMi" re-opens with the
  trade-offs already on the table.

## Decision

Make the scheduler VRAM-aware and add runtime teeth — entirely with repo-native
pieces, **no device-plugin/driver change, time-slicing untouched**:

1. **Budget (schedule-time).** Advertise a custom node-level **extended resource
   `viktorbarzin.me/gpumem`** on the GPU node (= ~14000 MiB; ~15.4 GB physical
   minus ~1.4 GB driver/CUDA-context/exporter slack), via a reconcile Job +
   CronJob that `kubectl patch node --subresource=status` (dynamic over
   `nvidia.com/gpu.present=true` nodes; re-asserts after node re-register).
   Every GPU tenant declares `resources.limits."viktorbarzin.me/gpumem"` (immich-ml
   3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500 — sum
   ≤ advertised). Extended resources are **non-overcommittable** (request==limit,
   integer), so the scheduler refuses to co-schedule past the card → overflow
   `Pending`. On-demand batch tenants (tts/ebook2audiobook/ytdlp) keep the
   free-VRAM demand-gate and fill the real slack rather than holding a reserved seat.
2. **Watchdog (runtime).** A `gpu-vram-watchdog` CronJob (every minute, nvidia ns)
   reads per-pod `gpu_pod_memory_used_bytes` (the host-PID exporter) and each GPU
   pod's *declared* `gpumem`, and **only when actual free VRAM < floor (~1536 MiB)**
   recycles the biggest **over-budget** offender (used > declared). Contract
   enforcement, not priority (immich-ml and llama-swap share `gpu-workload`, so
   priority can't distinguish them). Acting only under pressure lets a tenant burst
   into genuine slack; the recycle clears its arena (exactly what the TTL=600
   Recreate does for immich-ml when idle). This is what would have caught 2026-06-02.
3. **Alerting** (the never-built follow-up): GPU free-VRAM below floor, GPU pod
   `Pending` on `gpumem`, and pod-over-budget → the `#alerts` digest.

This is **soft enforcement**: the scheduler reserves on paper and the watchdog
corrects at runtime with a detection lag (seconds–minute), so a brief physical
overshoot is possible before a recycle. Accepted, given the failure mode is a
slow arena drift, not an instantaneous spike, and the alternative (HAMi) carries
disproportionate risk for this hardware.

## Consequences

- **The 2026-06-02 class is bounded** without touching the pinned driver, the GPU
  operator, or time-slicing. immich-ml can no longer silently grow into
  llama-swap's VRAM: it either schedules within its budget or, on a true runaway
  under pressure, gets recycled (its heavy library job is the intended loser).
- **The card has a seating chart now.** Sum of declared budgets ≤ ~14 GB, so a new
  always-on GPU tenant requires re-budgeting; an over-budget on-demand tenant sits
  `Pending`. This is the intended, legible back-pressure.
- **Small/on-demand tenants (android-emulator, ytdlp, tts, ebook2audiobook) are
  NOT budgeted in v1** — they fill *actual* slack rather than holding a scheduler
  seat (tts via its existing free-VRAM demand-gate), and are covered by the
  ~1.4 GiB physical reserve plus budget headroom (the five residents' budgets sum
  to 13300 ≤ 14000 advertised). Give them budgets later if they grow; until then
  the watchdog protects the budgeted five and counts everyone's usage toward free.
- **New RBAC:** the reconcile SA patches `nodes/status`; the watchdog SA lists pods
  cluster-wide and deletes pods in GPU tenant namespaces. Far less privileged than
  existing cluster-admin tooling (woodpecker-agent).
- **Apply order matters:** advertise `gpumem` (nvidia stack) **before** the
  consumer stacks declare it, or a pod requesting an unadvertised extended
  resource is unschedulable. The reconcile runs as a Job (immediate) for this.
- **Fully reversible:** delete the CronJobs/Job + the `gpumem` stanzas, and
  `kubectl patch node --subresource=status` to remove the capacity key. Nothing
  structural; no driver/operator state to unwind.
- The `gpumem` numbers are first estimates; tune from `gpu_pod_memory_used_bytes`.
-												feat(nvidia): GPU VRAM budget + watchdog to stop T4 overallocation

The single time-sliced Tesla T4 has no per-tenant memory isolation, so its
~9 GPU workloads can collectively overallocate VRAM. On 2026-06-02 immich-ml's
onnxruntime arena grew to 10.7 GB and silently starved llama-swap, breaking
recruiter-responder for ~5h. Viktor asked for memory protection so we don't
overallocate GPU memory, and chose to do it at the scheduling level (no
device-plugin swap) after weighing HAMi and MPS.

Make the scheduler VRAM-aware and add runtime teeth, all repo-native,
time-slicing untouched:
- Advertise a node extended resource viktorbarzin.me/gpumem (~14000 MiB) via a
  reconcile null_resource (immediate, apply-time) + hourly re-assert CronJob.
- Each always-on GPU tenant declares a gpumem budget (immich-ml 3000,
  llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500; sum 13300
  <= advertised) so the scheduler refuses to co-schedule past the card
  (overflow -> Pending).
- gpu-vram-watchdog Deployment recycles the biggest over-budget tenant ONLY when
  actual free VRAM < floor. Ships DRY_RUN=true (observe-then-enforce); flip to
  false after a few cycles look right.
- Prometheus alerts GPUVRAMLow / GPUVRAMTelemetryDown / GPUVRAMWatchdogDown --
  the 2026-06-02 post-mortem's never-built free-VRAM follow-up.
- Docs: ADR-0016 (records why HAMi/MPS were rejected), CONTEXT.md GPU-sharing
  glossary; fix the stale "whole T4 / scale immich-ml to 0" llama-cpp comment.

HITL GPU-node change: apply nvidia FIRST (advertise gpumem), verify the node
shows the capacity, THEN the consumer stacks -- the cutover bounces GPU pods.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-30 07:57:40 +00:00
+								# GPU VRAM protection via a scheduler extended-resource budget + a runtime watchdog (HAMi/MPS rejected)
 								The single Tesla T4 (16 GB, ~15360 MiB usable) on `k8s-node1` is **time-sliced**
 								(`nvidia.com/gpu` advertised ×100, `migStrategy: none`) and shared by ~9 tenants
 								(immich-ml, immich-server, frigate, llama-swap, portal-stt, tts,
 								ebook2audiobook, ytdlp, android-emulator). Time-slicing grants a *scheduling
 								turn, not memory* — the scheduler is blind to VRAM, so the tenants can
 								collectively overallocate the card. On 2026-06-02 immich-ml's unbounded
 								onnxruntime OCR arena grew from ~2 GB to **10.7 GB**, starved llama-swap's
 								qwen3-8b, and silently broke recruiter-responder triage for ~5 h
 								(`docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). The
 								post-mortem's #1 follow-up — alert/guard on GPU VRAM — was never built.
 								## Context
 								- **MIG is impossible.** The T4 is Turing; hardware memory partitioning (MIG)
 								  only exists on Ampere+. So per-tenant *hardware* isolation is off the table.
 								- **The card is busy but not steadily oversubscribed.** Measured steady residents
 								  (2026-06-17, `gpu_pod_memory_used_bytes`): immich-ml ~2.1 GiB, frigate ~1.9 GiB,
 								  llama-swap ~4.35 GiB peak (one model at a time — it already swaps), immich-server
 								  ~1.2 GiB, portal-stt ~1.5 GiB, android-emulator ~0.15 GiB → ~11 GiB used, ~4 GiB
 								  free. **The failure mode is a single tenant's runtime runaway, not a
 								  scheduling-time pile-on.**
 								- **Prior art already exists (soft):** a `gpu-workload` PriorityClass (1,200,000)
 								  is auto-stamped on every GPU pod by the Kyverno `inject-gpu-workload-priority`
 								  policy (tts excluded → `tier-2-gpu`, evicted first); tts runs behind a
 								  free-VRAM demand-gate (`stacks/tts`, scales 0↔1 on `sum(gpu_pod_memory_used_bytes)`
 								  vs a floor); immich-ml is soft-bounded by `MACHINE_LEARNING_MODEL_TTL=600`. What
 								  was missing is anything that bounds a tenant's VRAM *during active use*.
 								### Alternatives considered and rejected
 								- **NVIDIA MPS** (device-plugin `sharing.mps`, hard `CUDA_MPS_PINNED_DEVICE_MEM_LIMIT`):
 								  caps are **uniform** — slice = `total ÷ replicas`, tenants get integer multiples.
 								  Nine heterogeneous tenants spanning 0.15→6 GB do not fit uniform slices without
 								  large rounding waste on a card that has none to spare. Rejected.
 								- **HAMi vGPU** (per-container `nvidia.com/gpumem` MiB caps, libvgpu CUDA hook):
 								  the *correct* hard-cap primitive and T4-supported, but it **replaces the
 								  operator's device plugin** (the operator owns/reconciles it), enforces via an
 								  `LD_PRELOAD` CUDA hook that is **unproven for our NVENC transcode path**
 								  (open codec bug), **cannot cap the android-emulator** (QEMU bypasses the CUDA
 								  hook — KubeVirt/Kata explicitly unsupported), carries a **restart-triggered
 								  false-OOM bug** (#1181) directly in our blast radius (kured reboots node1
 								  regularly), and its reservation-based scheduling would **supersede the working
 								  demand-gate** and **strand the ~4 GB of steady headroom**. Too much risk and
 								  behavioral change for the single proven failure mode. Rejected for now; this
 								  ADR is the record of *why*, so a future "let's just use HAMi" re-opens with the
 								  trade-offs already on the table.
 								## Decision
 								Make the scheduler VRAM-aware and add runtime teeth — entirely with repo-native
 								pieces, **no device-plugin/driver change, time-slicing untouched**:
 . **Budget (schedule-time).** Advertise a custom node-level **extended resource
 								   `viktorbarzin.me/gpumem`** on the GPU node (= ~14000 MiB; ~15.4 GB physical
 								   minus ~1.4 GB driver/CUDA-context/exporter slack), via a reconcile Job +
 								   CronJob that `kubectl patch node --subresource=status` (dynamic over
 								   `nvidia.com/gpu.present=true` nodes; re-asserts after node re-register).
 								   Every GPU tenant declares `resources.limits."viktorbarzin.me/gpumem"` (immich-ml
 , llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500 — sum
 								   ≤ advertised). Extended resources are **non-overcommittable** (request==limit,
 								   integer), so the scheduler refuses to co-schedule past the card → overflow
 								   `Pending`. On-demand batch tenants (tts/ebook2audiobook/ytdlp) keep the
 								   free-VRAM demand-gate and fill the real slack rather than holding a reserved seat.
 . **Watchdog (runtime).** A `gpu-vram-watchdog` CronJob (every minute, nvidia ns)
 								   reads per-pod `gpu_pod_memory_used_bytes` (the host-PID exporter) and each GPU
 								   pod's *declared* `gpumem`, and **only when actual free VRAM < floor (~1536 MiB)**
 								   recycles the biggest **over-budget** offender (used > declared). Contract
 								   enforcement, not priority (immich-ml and llama-swap share `gpu-workload`, so
 								   priority can't distinguish them). Acting only under pressure lets a tenant burst
 								   into genuine slack; the recycle clears its arena (exactly what the TTL=600
 								   Recreate does for immich-ml when idle). This is what would have caught 2026-06-02.
 . **Alerting** (the never-built follow-up): GPU free-VRAM below floor, GPU pod
 								   `Pending` on `gpumem`, and pod-over-budget → the `#alerts` digest.
 								This is **soft enforcement**: the scheduler reserves on paper and the watchdog
 								corrects at runtime with a detection lag (seconds–minute), so a brief physical
 								overshoot is possible before a recycle. Accepted, given the failure mode is a
 								slow arena drift, not an instantaneous spike, and the alternative (HAMi) carries
 								disproportionate risk for this hardware.
 								## Consequences
 								- **The 2026-06-02 class is bounded** without touching the pinned driver, the GPU
 								  operator, or time-slicing. immich-ml can no longer silently grow into
 								  llama-swap's VRAM: it either schedules within its budget or, on a true runaway
 								  under pressure, gets recycled (its heavy library job is the intended loser).
 								- **The card has a seating chart now.** Sum of declared budgets ≤ ~14 GB, so a new
 								  always-on GPU tenant requires re-budgeting; an over-budget on-demand tenant sits
 								  `Pending`. This is the intended, legible back-pressure.
 								- **Small/on-demand tenants (android-emulator, ytdlp, tts, ebook2audiobook) are
 								  NOT budgeted in v1** — they fill *actual* slack rather than holding a scheduler
 								  seat (tts via its existing free-VRAM demand-gate), and are covered by the
 								  ~1.4 GiB physical reserve plus budget headroom (the five residents' budgets sum
 								  to 13300 ≤ 14000 advertised). Give them budgets later if they grow; until then
 								  the watchdog protects the budgeted five and counts everyone's usage toward free.
 								- **New RBAC:** the reconcile SA patches `nodes/status`; the watchdog SA lists pods
 								  cluster-wide and deletes pods in GPU tenant namespaces. Far less privileged than
 								  existing cluster-admin tooling (woodpecker-agent).
 								- **Apply order matters:** advertise `gpumem` (nvidia stack) **before** the
 								  consumer stacks declare it, or a pod requesting an unadvertised extended
 								  resource is unschedulable. The reconcile runs as a Job (immediate) for this.
 								- **Fully reversible:** delete the CronJobs/Job + the `gpumem` stanzas, and
 								  `kubectl patch node --subresource=status` to remove the capacity key. Nothing
 								  structural; no driver/operator state to unwind.
 								- The `gpumem` numbers are first estimates; tune from `gpu_pod_memory_used_bytes`.