feat(nvidia): GPU VRAM budget + watchdog to stop T4 overallocation
The single time-sliced Tesla T4 has no per-tenant memory isolation, so its ~9 GPU workloads can collectively overallocate VRAM. On 2026-06-02 immich-ml's onnxruntime arena grew to 10.7 GB and silently starved llama-swap, breaking recruiter-responder for ~5h. Viktor asked for memory protection so we don't overallocate GPU memory, and chose to do it at the scheduling level (no device-plugin swap) after weighing HAMi and MPS. Make the scheduler VRAM-aware and add runtime teeth, all repo-native, time-slicing untouched: - Advertise a node extended resource viktorbarzin.me/gpumem (~14000 MiB) via a reconcile null_resource (immediate, apply-time) + hourly re-assert CronJob. - Each always-on GPU tenant declares a gpumem budget (immich-ml 3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500; sum 13300 <= advertised) so the scheduler refuses to co-schedule past the card (overflow -> Pending). - gpu-vram-watchdog Deployment recycles the biggest over-budget tenant ONLY when actual free VRAM < floor. Ships DRY_RUN=true (observe-then-enforce); flip to false after a few cycles look right. - Prometheus alerts GPUVRAMLow / GPUVRAMTelemetryDown / GPUVRAMWatchdogDown -- the 2026-06-02 post-mortem's never-built free-VRAM follow-up. - Docs: ADR-0016 (records why HAMi/MPS were rejected), CONTEXT.md GPU-sharing glossary; fix the stale "whole T4 / scale immich-ml to 0" llama-cpp comment. HITL GPU-node change: apply nvidia FIRST (advertise gpumem), verify the node shows the capacity, THEN the consumer stacks -- the cutover bounces GPU pods. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
308a174ad6
commit
74819d4061
8 changed files with 609 additions and 3 deletions
|
|
@ -852,6 +852,32 @@ serverFiles:
|
|||
# targets: "alertmanager.viktorbarzin.lan"
|
||||
alerting_rules.yml:
|
||||
groups:
|
||||
# GPU VRAM budget (ADR-0016). The post-mortem 2026-06-02 follow-up — alert
|
||||
# on GPU free-VRAM — finally built. Physical T4 = 15360 MiB; free =
|
||||
# physical - sum(gpu_pod_memory_used_bytes) (the host-PID exporter gauge).
|
||||
- name: GPU VRAM
|
||||
rules:
|
||||
- alert: GPUVRAMLow
|
||||
expr: (15360 * 1024 * 1024 - sum(gpu_pod_memory_used_bytes)) / 1024 / 1024 < 1024 and on() (time() - process_start_time_seconds{job="prometheus"}) > 900
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "GPU free VRAM {{ $value | printf \"%.0f\" }} MiB (<1024) — T4 oversubscribed; gpu-vram-watchdog recycles the biggest over-budget tenant. See ADR-0016."
|
||||
- alert: GPUVRAMTelemetryDown
|
||||
expr: absent(gpu_pod_memory_used_bytes)
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "GPU VRAM telemetry (gpu-pod-exporter) absent 15m — the free-VRAM alert AND the watchdog are blind to per-pod usage."
|
||||
- alert: GPUVRAMWatchdogDown
|
||||
expr: kube_deployment_status_replicas_available{namespace="nvidia", deployment="gpu-vram-watchdog"} < 1
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "gpu-vram-watchdog has no available replica for 15m — runtime VRAM enforcement (over-budget recycle) is OFF. Budget still scheduler-enforced."
|
||||
- name: R730 Host
|
||||
rules:
|
||||
- alert: HighCPUTemperature
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue