feat(nvidia): GPU VRAM budget + watchdog to stop T4 overallocation
The single time-sliced Tesla T4 has no per-tenant memory isolation, so its ~9 GPU workloads can collectively overallocate VRAM. On 2026-06-02 immich-ml's onnxruntime arena grew to 10.7 GB and silently starved llama-swap, breaking recruiter-responder for ~5h. Viktor asked for memory protection so we don't overallocate GPU memory, and chose to do it at the scheduling level (no device-plugin swap) after weighing HAMi and MPS. Make the scheduler VRAM-aware and add runtime teeth, all repo-native, time-slicing untouched: - Advertise a node extended resource viktorbarzin.me/gpumem (~14000 MiB) via a reconcile null_resource (immediate, apply-time) + hourly re-assert CronJob. - Each always-on GPU tenant declares a gpumem budget (immich-ml 3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500; sum 13300 <= advertised) so the scheduler refuses to co-schedule past the card (overflow -> Pending). - gpu-vram-watchdog Deployment recycles the biggest over-budget tenant ONLY when actual free VRAM < floor. Ships DRY_RUN=true (observe-then-enforce); flip to false after a few cycles look right. - Prometheus alerts GPUVRAMLow / GPUVRAMTelemetryDown / GPUVRAMWatchdogDown -- the 2026-06-02 post-mortem's never-built free-VRAM follow-up. - Docs: ADR-0016 (records why HAMi/MPS were rejected), CONTEXT.md GPU-sharing glossary; fix the stale "whole T4 / scale immich-ml to 0" llama-cpp comment. HITL GPU-node change: apply nvidia FIRST (advertise gpumem), verify the node shows the capacity, THEN the consumer stacks -- the cutover bounces GPU pods. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
308a174ad6
commit
74819d4061
8 changed files with 609 additions and 3 deletions
22
CONTEXT.md
22
CONTEXT.md
|
|
@ -56,6 +56,28 @@ _Avoid_: "core service" (collides with the `0-core-*` Namespace tier name).
|
|||
A non-admin identity declared in `secret/platform → k8s_users` (JSON map). Owns one or more namespaces and one or more public subdomains. Also drives a **Workstation profile** (an identity has both a cluster facet and a workstation facet).
|
||||
_Avoid_: bare "user", "tenant".
|
||||
|
||||
### GPU sharing
|
||||
|
||||
**GPU slice**:
|
||||
One unit of `nvidia.com/gpu` on the time-sliced Tesla T4 — a **scheduling turn, NOT a memory allocation**. The device plugin advertises the card ×100; a pod requesting `nvidia.com/gpu: 1` gets GPU *access*, with zero guarantee about how much of the 16 GB VRAM it may use. "Overallocate GPU memory" is a real failure precisely because a slice carries no memory accounting.
|
||||
_Avoid_: reading a GPU slice as a memory reservation or a fraction of the card; "vGPU" (we run no vGPU/MIG/MPS — see ADR-0016).
|
||||
|
||||
**GPU memory budget**:
|
||||
The custom node-level extended resource **`viktorbarzin.me/gpumem`** (integer MiB) that makes the scheduler VRAM-aware (ADR-0016). The GPU node advertises a total (~14000 MiB = physical minus driver/context slack); each GPU tenant declares `resources.limits."viktorbarzin.me/gpumem"`; being non-overcommittable, the scheduler refuses to co-schedule past the card (overflow → `Pending`). A *schedule-time* reservation, **not** a runtime cap — it stops pile-on, not a single tenant's runaway.
|
||||
_Avoid_: treating it as a hard CUDA cap (it isn't — that's what the **GPU watchdog** is for); confusing it with the `nvidia.com/gpu` slice (orthogonal axes: access vs memory accounting).
|
||||
|
||||
**GPU watchdog**:
|
||||
The `gpu-vram-watchdog` CronJob (nvidia ns) that supplies the runtime teeth the **GPU memory budget** lacks: when *actual* free VRAM (`gpu_pod_memory_used_bytes`) drops below a floor, it recycles the biggest tenant that is **over its declared budget**. Enforces the budget as a contract, acts only under pressure (so bursting into genuine slack is fine), and is what bounds the 2026-06-02 immich-ml runaway class.
|
||||
_Avoid_: expecting it to act on priority (it enforces the *budget*, since co-tenants often share one PriorityClass); expecting instant prevention (it corrects with a detection lag — soft, by design).
|
||||
|
||||
**GPU demand-gate**:
|
||||
The scale-0↔1 admission CronJobs (`stacks/tts`) that bring a best-effort *batch* GPU tenant (chatterbox-tts) up only when free VRAM ≥ a floor and idle it back down — letting on-demand tenants fill real slack without holding a reserved **GPU memory budget** seat.
|
||||
_Avoid_: using it for interactive tenants (cold-load lag — portal-stt is warm-resident instead); conflating it with the **GPU watchdog** (gate = admit on free VRAM; watchdog = recycle on over-budget pressure).
|
||||
|
||||
**gpu-workload priority**:
|
||||
The `gpu-workload` PriorityClass (1,200,000) auto-stamped on every GPU pod by the Kyverno `inject-gpu-workload-priority` policy — the exclude list (`tts`) drops to `tier-2-gpu` (600,000) so it loses node-pressure eviction first. Governs *Kubernetes node* eviction order, **not** VRAM (VRAM is the budget + watchdog's job).
|
||||
_Avoid_: assuming it protects VRAM; it is a scheduling/eviction priority on node memory/CPU pressure.
|
||||
|
||||
### Workstation (multi-user devvm)
|
||||
|
||||
**devvm**:
|
||||
|
|
|
|||
107
docs/adr/0016-gpu-vram-extended-resource-budget.md
Normal file
107
docs/adr/0016-gpu-vram-extended-resource-budget.md
Normal file
|
|
@ -0,0 +1,107 @@
|
|||
# GPU VRAM protection via a scheduler extended-resource budget + a runtime watchdog (HAMi/MPS rejected)
|
||||
|
||||
The single Tesla T4 (16 GB, ~15360 MiB usable) on `k8s-node1` is **time-sliced**
|
||||
(`nvidia.com/gpu` advertised ×100, `migStrategy: none`) and shared by ~9 tenants
|
||||
(immich-ml, immich-server, frigate, llama-swap, portal-stt, tts,
|
||||
ebook2audiobook, ytdlp, android-emulator). Time-slicing grants a *scheduling
|
||||
turn, not memory* — the scheduler is blind to VRAM, so the tenants can
|
||||
collectively overallocate the card. On 2026-06-02 immich-ml's unbounded
|
||||
onnxruntime OCR arena grew from ~2 GB to **10.7 GB**, starved llama-swap's
|
||||
qwen3-8b, and silently broke recruiter-responder triage for ~5 h
|
||||
(`docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). The
|
||||
post-mortem's #1 follow-up — alert/guard on GPU VRAM — was never built.
|
||||
|
||||
## Context
|
||||
|
||||
- **MIG is impossible.** The T4 is Turing; hardware memory partitioning (MIG)
|
||||
only exists on Ampere+. So per-tenant *hardware* isolation is off the table.
|
||||
- **The card is busy but not steadily oversubscribed.** Measured steady residents
|
||||
(2026-06-17, `gpu_pod_memory_used_bytes`): immich-ml ~2.1 GiB, frigate ~1.9 GiB,
|
||||
llama-swap ~4.35 GiB peak (one model at a time — it already swaps), immich-server
|
||||
~1.2 GiB, portal-stt ~1.5 GiB, android-emulator ~0.15 GiB → ~11 GiB used, ~4 GiB
|
||||
free. **The failure mode is a single tenant's runtime runaway, not a
|
||||
scheduling-time pile-on.**
|
||||
- **Prior art already exists (soft):** a `gpu-workload` PriorityClass (1,200,000)
|
||||
is auto-stamped on every GPU pod by the Kyverno `inject-gpu-workload-priority`
|
||||
policy (tts excluded → `tier-2-gpu`, evicted first); tts runs behind a
|
||||
free-VRAM demand-gate (`stacks/tts`, scales 0↔1 on `sum(gpu_pod_memory_used_bytes)`
|
||||
vs a floor); immich-ml is soft-bounded by `MACHINE_LEARNING_MODEL_TTL=600`. What
|
||||
was missing is anything that bounds a tenant's VRAM *during active use*.
|
||||
|
||||
### Alternatives considered and rejected
|
||||
|
||||
- **NVIDIA MPS** (device-plugin `sharing.mps`, hard `CUDA_MPS_PINNED_DEVICE_MEM_LIMIT`):
|
||||
caps are **uniform** — slice = `total ÷ replicas`, tenants get integer multiples.
|
||||
Nine heterogeneous tenants spanning 0.15→6 GB do not fit uniform slices without
|
||||
large rounding waste on a card that has none to spare. Rejected.
|
||||
- **HAMi vGPU** (per-container `nvidia.com/gpumem` MiB caps, libvgpu CUDA hook):
|
||||
the *correct* hard-cap primitive and T4-supported, but it **replaces the
|
||||
operator's device plugin** (the operator owns/reconciles it), enforces via an
|
||||
`LD_PRELOAD` CUDA hook that is **unproven for our NVENC transcode path**
|
||||
(open codec bug), **cannot cap the android-emulator** (QEMU bypasses the CUDA
|
||||
hook — KubeVirt/Kata explicitly unsupported), carries a **restart-triggered
|
||||
false-OOM bug** (#1181) directly in our blast radius (kured reboots node1
|
||||
regularly), and its reservation-based scheduling would **supersede the working
|
||||
demand-gate** and **strand the ~4 GB of steady headroom**. Too much risk and
|
||||
behavioral change for the single proven failure mode. Rejected for now; this
|
||||
ADR is the record of *why*, so a future "let's just use HAMi" re-opens with the
|
||||
trade-offs already on the table.
|
||||
|
||||
## Decision
|
||||
|
||||
Make the scheduler VRAM-aware and add runtime teeth — entirely with repo-native
|
||||
pieces, **no device-plugin/driver change, time-slicing untouched**:
|
||||
|
||||
1. **Budget (schedule-time).** Advertise a custom node-level **extended resource
|
||||
`viktorbarzin.me/gpumem`** on the GPU node (= ~14000 MiB; ~15.4 GB physical
|
||||
minus ~1.4 GB driver/CUDA-context/exporter slack), via a reconcile Job +
|
||||
CronJob that `kubectl patch node --subresource=status` (dynamic over
|
||||
`nvidia.com/gpu.present=true` nodes; re-asserts after node re-register).
|
||||
Every GPU tenant declares `resources.limits."viktorbarzin.me/gpumem"` (immich-ml
|
||||
3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500 — sum
|
||||
≤ advertised). Extended resources are **non-overcommittable** (request==limit,
|
||||
integer), so the scheduler refuses to co-schedule past the card → overflow
|
||||
`Pending`. On-demand batch tenants (tts/ebook2audiobook/ytdlp) keep the
|
||||
free-VRAM demand-gate and fill the real slack rather than holding a reserved seat.
|
||||
2. **Watchdog (runtime).** A `gpu-vram-watchdog` CronJob (every minute, nvidia ns)
|
||||
reads per-pod `gpu_pod_memory_used_bytes` (the host-PID exporter) and each GPU
|
||||
pod's *declared* `gpumem`, and **only when actual free VRAM < floor (~1536 MiB)**
|
||||
recycles the biggest **over-budget** offender (used > declared). Contract
|
||||
enforcement, not priority (immich-ml and llama-swap share `gpu-workload`, so
|
||||
priority can't distinguish them). Acting only under pressure lets a tenant burst
|
||||
into genuine slack; the recycle clears its arena (exactly what the TTL=600
|
||||
Recreate does for immich-ml when idle). This is what would have caught 2026-06-02.
|
||||
3. **Alerting** (the never-built follow-up): GPU free-VRAM below floor, GPU pod
|
||||
`Pending` on `gpumem`, and pod-over-budget → the `#alerts` digest.
|
||||
|
||||
This is **soft enforcement**: the scheduler reserves on paper and the watchdog
|
||||
corrects at runtime with a detection lag (seconds–minute), so a brief physical
|
||||
overshoot is possible before a recycle. Accepted, given the failure mode is a
|
||||
slow arena drift, not an instantaneous spike, and the alternative (HAMi) carries
|
||||
disproportionate risk for this hardware.
|
||||
|
||||
## Consequences
|
||||
|
||||
- **The 2026-06-02 class is bounded** without touching the pinned driver, the GPU
|
||||
operator, or time-slicing. immich-ml can no longer silently grow into
|
||||
llama-swap's VRAM: it either schedules within its budget or, on a true runaway
|
||||
under pressure, gets recycled (its heavy library job is the intended loser).
|
||||
- **The card has a seating chart now.** Sum of declared budgets ≤ ~14 GB, so a new
|
||||
always-on GPU tenant requires re-budgeting; an over-budget on-demand tenant sits
|
||||
`Pending`. This is the intended, legible back-pressure.
|
||||
- **Small/on-demand tenants (android-emulator, ytdlp, tts, ebook2audiobook) are
|
||||
NOT budgeted in v1** — they fill *actual* slack rather than holding a scheduler
|
||||
seat (tts via its existing free-VRAM demand-gate), and are covered by the
|
||||
~1.4 GiB physical reserve plus budget headroom (the five residents' budgets sum
|
||||
to 13300 ≤ 14000 advertised). Give them budgets later if they grow; until then
|
||||
the watchdog protects the budgeted five and counts everyone's usage toward free.
|
||||
- **New RBAC:** the reconcile SA patches `nodes/status`; the watchdog SA lists pods
|
||||
cluster-wide and deletes pods in GPU tenant namespaces. Far less privileged than
|
||||
existing cluster-admin tooling (woodpecker-agent).
|
||||
- **Apply order matters:** advertise `gpumem` (nvidia stack) **before** the
|
||||
consumer stacks declare it, or a pod requesting an unadvertised extended
|
||||
resource is unschedulable. The reconcile runs as a Job (immediate) for this.
|
||||
- **Fully reversible:** delete the CronJobs/Job + the `gpumem` stanzas, and
|
||||
`kubectl patch node --subresource=status` to remove the capacity key. Nothing
|
||||
structural; no driver/operator state to unwind.
|
||||
- The `gpumem` numbers are first estimates; tune from `gpu_pod_memory_used_bytes`.
|
||||
|
|
@ -9,7 +9,7 @@ resource "kubernetes_namespace" "frigate" {
|
|||
metadata {
|
||||
name = "frigate"
|
||||
labels = {
|
||||
tier = local.tiers.gpu
|
||||
tier = local.tiers.gpu
|
||||
"keel.sh/enrolled" = "true"
|
||||
}
|
||||
# labels = {
|
||||
|
|
@ -117,6 +117,8 @@ resource "kubernetes_deployment" "frigate" {
|
|||
limits = {
|
||||
memory = "10Gi"
|
||||
"nvidia.com/gpu" = "1"
|
||||
# GPU VRAM budget (ADR-0016): detector + ffmpeg decode (~1.9 GiB).
|
||||
"viktorbarzin.me/gpumem" = "2000"
|
||||
}
|
||||
}
|
||||
env {
|
||||
|
|
|
|||
|
|
@ -376,6 +376,10 @@ resource "kubernetes_deployment" "immich_server" {
|
|||
limits = {
|
||||
memory = "8Gi"
|
||||
"nvidia.com/gpu" = "1"
|
||||
# GPU VRAM budget (ADR-0016): schedule-time reservation + the
|
||||
# gpu-vram-watchdog recycle threshold. Bounds the onnxruntime
|
||||
# OCR-arena runaway that starved llama-swap on 2026-06-02.
|
||||
"viktorbarzin.me/gpumem" = "3000"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -732,6 +736,8 @@ resource "kubernetes_deployment" "immich-machine-learning" {
|
|||
limits = {
|
||||
memory = "3584Mi"
|
||||
"nvidia.com/gpu" = "1"
|
||||
# GPU VRAM budget (ADR-0016): NVENC transcode footprint (~1.2 GiB).
|
||||
"viktorbarzin.me/gpumem" = "1800"
|
||||
}
|
||||
}
|
||||
liveness_probe {
|
||||
|
|
|
|||
|
|
@ -266,8 +266,11 @@ resource "kubernetes_config_map" "llama_swap_config" {
|
|||
|
||||
# Single Deployment running llama-swap. Spawns per-model llama-server
|
||||
# subprocesses on demand and unloads them after `ttl` seconds idle.
|
||||
# The whole T4 is allocated to this pod via nvidia.com/gpu=1; immich-ml
|
||||
# must be scaled to 0 during benchmark runs.
|
||||
# nvidia.com/gpu=1 buys ONE time-slice (a scheduling turn, NOT the card's
|
||||
# memory) — the T4 is shared with immich-ml/frigate/immich-server/portal-stt.
|
||||
# VRAM is bounded per-tenant by the gpumem budget + watchdog (ADR-0016), not by
|
||||
# scaling co-tenants to 0. llama-swap loads ONE model at a time (no `groups` =
|
||||
# swap mode, ttl=600 unloads idle), so its footprint is the largest single model.
|
||||
resource "kubernetes_deployment" "llama_swap" {
|
||||
metadata {
|
||||
name = "llama-swap"
|
||||
|
|
@ -355,6 +358,9 @@ resource "kubernetes_deployment" "llama_swap" {
|
|||
limits = {
|
||||
memory = "12Gi"
|
||||
"nvidia.com/gpu" = "1"
|
||||
# GPU VRAM budget (ADR-0016): one model at a time; qwen3-8b @16k
|
||||
# peak ~4.35 GiB + headroom (the protected recruiter-responder tenant).
|
||||
"viktorbarzin.me/gpumem" = "5000"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -852,6 +852,32 @@ serverFiles:
|
|||
# targets: "alertmanager.viktorbarzin.lan"
|
||||
alerting_rules.yml:
|
||||
groups:
|
||||
# GPU VRAM budget (ADR-0016). The post-mortem 2026-06-02 follow-up — alert
|
||||
# on GPU free-VRAM — finally built. Physical T4 = 15360 MiB; free =
|
||||
# physical - sum(gpu_pod_memory_used_bytes) (the host-PID exporter gauge).
|
||||
- name: GPU VRAM
|
||||
rules:
|
||||
- alert: GPUVRAMLow
|
||||
expr: (15360 * 1024 * 1024 - sum(gpu_pod_memory_used_bytes)) / 1024 / 1024 < 1024 and on() (time() - process_start_time_seconds{job="prometheus"}) > 900
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "GPU free VRAM {{ $value | printf \"%.0f\" }} MiB (<1024) — T4 oversubscribed; gpu-vram-watchdog recycles the biggest over-budget tenant. See ADR-0016."
|
||||
- alert: GPUVRAMTelemetryDown
|
||||
expr: absent(gpu_pod_memory_used_bytes)
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "GPU VRAM telemetry (gpu-pod-exporter) absent 15m — the free-VRAM alert AND the watchdog are blind to per-pod usage."
|
||||
- alert: GPUVRAMWatchdogDown
|
||||
expr: kube_deployment_status_replicas_available{namespace="nvidia", deployment="gpu-vram-watchdog"} < 1
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "gpu-vram-watchdog has no available replica for 15m — runtime VRAM enforcement (over-budget recycle) is OFF. Budget still scheduler-enforced."
|
||||
- name: R730 Host
|
||||
rules:
|
||||
- alert: HighCPUTemperature
|
||||
|
|
|
|||
435
stacks/nvidia/modules/nvidia/gpu_memory_budget.tf
Normal file
435
stacks/nvidia/modules/nvidia/gpu_memory_budget.tf
Normal file
|
|
@ -0,0 +1,435 @@
|
|||
# =============================================================================
|
||||
# GPU VRAM protection — scheduler extended-resource budget + runtime watchdog
|
||||
# =============================================================================
|
||||
# See docs/adr/0016-gpu-vram-extended-resource-budget.md. The T4 is time-sliced
|
||||
# (nvidia.com/gpu = a scheduling turn, NOT memory), so the scheduler is blind to
|
||||
# VRAM and tenants can overallocate the card (post-mortem 2026-06-02: immich-ml's
|
||||
# onnxruntime arena 2->10.7 GiB starved llama-swap). MIG is impossible on Turing;
|
||||
# HAMi/MPS were rejected (ADR-0016). Instead, two repo-native layers, NO
|
||||
# device-plugin/driver change, time-slicing untouched:
|
||||
# 1. Budget — advertise a node extended resource `viktorbarzin.me/gpumem`;
|
||||
# each GPU tenant declares resources.limits gpumem; the scheduler
|
||||
# refuses to co-schedule past the card (overflow -> Pending).
|
||||
# 2. Watchdog — when ACTUAL free VRAM < floor, recycle the biggest tenant that
|
||||
# is OVER its declared budget (contract enforcement; the teeth the
|
||||
# schedule-time budget lacks).
|
||||
# =============================================================================
|
||||
|
||||
variable "gpumem_resource" {
|
||||
type = string
|
||||
default = "viktorbarzin.me/gpumem"
|
||||
description = "Custom node extended-resource name advertised for GPU memory budgeting (integer MiB)."
|
||||
}
|
||||
|
||||
variable "gpumem_total_mib" {
|
||||
type = number
|
||||
default = 14000
|
||||
description = "Schedulable GPU-memory budget advertised on the GPU node = ~15360 MiB physical minus ~1.4 GiB driver/CUDA-context/exporter slack. Sum of all tenants' declared gpumem must stay <= this."
|
||||
}
|
||||
|
||||
variable "watchdog_gpu_total_mib" {
|
||||
type = number
|
||||
default = 15360
|
||||
description = "PHYSICAL T4 framebuffer (MiB). The watchdog computes free = this - sum(gpu_pod_memory_used_bytes); distinct from gpumem_total_mib (the scheduler budget)."
|
||||
}
|
||||
|
||||
variable "watchdog_floor_mib" {
|
||||
type = number
|
||||
default = 1536
|
||||
description = "The watchdog acts only when actual free VRAM drops below this floor (genuine pressure), so a tenant may burst into real slack without being recycled."
|
||||
}
|
||||
|
||||
variable "watchdog_dry_run" {
|
||||
type = bool
|
||||
default = true
|
||||
description = "When true the watchdog logs the recycle it WOULD do but does not delete the pod. Ships true (observe-then-enforce); flip to false once a few cycles look right."
|
||||
}
|
||||
|
||||
locals {
|
||||
gpumem_json_pointer = "/status/capacity/${replace(var.gpumem_resource, "/", "~1")}"
|
||||
}
|
||||
|
||||
# --- 1a. Advertise the extended resource at apply time (immediate) ------------
|
||||
# Mirrors the gpu_node_config null_resource (local-exec kubectl). Runs from the
|
||||
# apply environment, dynamic over GPU-labelled nodes so it follows the card.
|
||||
# `op:add` on an existing key replaces -> idempotent. wait/ordering: this MUST
|
||||
# succeed before any consumer stack declares gpumem, or those pods are
|
||||
# unschedulable (extended resource not advertised by any node).
|
||||
resource "null_resource" "advertise_gpumem" {
|
||||
provisioner "local-exec" {
|
||||
command = <<-EOT
|
||||
set -euo pipefail
|
||||
for node in $(kubectl get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[*].metadata.name}'); do
|
||||
echo "advertising ${var.gpumem_resource}=${var.gpumem_total_mib} on $node"
|
||||
kubectl patch node "$node" --subresource=status --type=json \
|
||||
-p="[{\"op\":\"add\",\"path\":\"${local.gpumem_json_pointer}\",\"value\":\"${var.gpumem_total_mib}\"}]"
|
||||
done
|
||||
EOT
|
||||
}
|
||||
triggers = {
|
||||
gpumem_total = var.gpumem_total_mib
|
||||
resource = var.gpumem_resource
|
||||
command_hash = "advertise-gpumem-v1"
|
||||
}
|
||||
depends_on = [helm_release.nvidia-gpu-operator]
|
||||
}
|
||||
|
||||
# --- 1b. Re-assert the extended resource periodically (drift / node rejoin) ---
|
||||
# A node object that is deleted + re-registered loses manually-advertised
|
||||
# extended resources. Hourly re-assert (low churn; node rejoin is rare).
|
||||
resource "kubernetes_service_account" "gpumem_reconcile" {
|
||||
metadata {
|
||||
name = "gpumem-reconcile"
|
||||
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_cluster_role" "gpumem_reconcile" {
|
||||
metadata { name = "gpumem-reconcile" }
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["nodes"]
|
||||
verbs = ["get", "list", "patch"]
|
||||
}
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["nodes/status"]
|
||||
verbs = ["get", "patch", "update"]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_cluster_role_binding" "gpumem_reconcile" {
|
||||
metadata { name = "gpumem-reconcile" }
|
||||
role_ref {
|
||||
api_group = "rbac.authorization.k8s.io"
|
||||
kind = "ClusterRole"
|
||||
name = kubernetes_cluster_role.gpumem_reconcile.metadata[0].name
|
||||
}
|
||||
subject {
|
||||
kind = "ServiceAccount"
|
||||
name = kubernetes_service_account.gpumem_reconcile.metadata[0].name
|
||||
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_cron_job_v1" "gpumem_reconcile" {
|
||||
metadata {
|
||||
name = "gpumem-reconcile"
|
||||
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
||||
labels = { app = "gpumem-reconcile", tier = var.tier }
|
||||
}
|
||||
spec {
|
||||
schedule = "0 * * * *" # hourly re-assert
|
||||
concurrency_policy = "Forbid"
|
||||
successful_jobs_history_limit = 1
|
||||
failed_jobs_history_limit = 1
|
||||
job_template {
|
||||
metadata { labels = { app = "gpumem-reconcile" } }
|
||||
spec {
|
||||
backoff_limit = 1
|
||||
ttl_seconds_after_finished = 300
|
||||
template {
|
||||
metadata { labels = { app = "gpumem-reconcile" } }
|
||||
spec {
|
||||
service_account_name = kubernetes_service_account.gpumem_reconcile.metadata[0].name
|
||||
restart_policy = "Never"
|
||||
container {
|
||||
name = "reconcile"
|
||||
image = "bitnami/kubectl:latest"
|
||||
command = ["/bin/bash", "-c"]
|
||||
args = [<<-EOT
|
||||
set -euo pipefail
|
||||
for node in $(kubectl get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[*].metadata.name}'); do
|
||||
echo "re-asserting ${var.gpumem_resource}=${var.gpumem_total_mib} on $node"
|
||||
kubectl patch node "$node" --subresource=status --type=json \
|
||||
-p="[{\"op\":\"add\",\"path\":\"${local.gpumem_json_pointer}\",\"value\":\"${var.gpumem_total_mib}\"}]"
|
||||
done
|
||||
EOT
|
||||
]
|
||||
resources {
|
||||
requests = { cpu = "10m", memory = "64Mi" }
|
||||
limits = { memory = "64Mi" }
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
depends_on = [null_resource.advertise_gpumem]
|
||||
}
|
||||
|
||||
# --- 2. Watchdog: recycle the biggest over-budget tenant under VRAM pressure --
|
||||
resource "kubernetes_config_map" "gpu_vram_watchdog_script" {
|
||||
metadata {
|
||||
name = "gpu-vram-watchdog-script"
|
||||
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
||||
}
|
||||
data = {
|
||||
"watchdog.py" = <<-EOT
|
||||
#!/usr/bin/env python3
|
||||
"""GPU VRAM watchdog — recycle the biggest OVER-BUDGET tenant under pressure.
|
||||
|
||||
Soft runtime enforcement of the per-tenant gpumem budget (ADR-0016). Loops:
|
||||
free = PHYSICAL_TOTAL - sum(gpu_pod_memory_used_bytes)
|
||||
if free >= FLOOR: nothing (tenants may burst into genuine slack)
|
||||
else: among GPU pods that DECLARE viktorbarzin.me/gpumem, find those whose
|
||||
actual use exceeds their declared budget, and recycle the biggest
|
||||
offender (its arena clears on restart). Contract enforcement, not
|
||||
priority — co-tenants often share the gpu-workload PriorityClass.
|
||||
"""
|
||||
import json
|
||||
import os
|
||||
import ssl
|
||||
import time
|
||||
import urllib.parse
|
||||
import urllib.request
|
||||
|
||||
RESOURCE = os.environ["GPUMEM_RESOURCE"]
|
||||
PHYSICAL_TOTAL_MIB = int(os.environ["GPU_TOTAL_MIB"])
|
||||
FLOOR_MIB = int(os.environ["FLOOR_MIB"])
|
||||
INTERVAL = int(os.environ.get("CHECK_INTERVAL_SECONDS", "60"))
|
||||
DRY_RUN = os.environ.get("DRY_RUN", "true").lower() == "true"
|
||||
EXPORTER = os.environ.get(
|
||||
"EXPORTER_URL", "http://gpu-pod-exporter.nvidia.svc.cluster.local:80/metrics"
|
||||
)
|
||||
GPU_NODE_LABEL = "nvidia.com/gpu.present=true"
|
||||
|
||||
K8S = "https://kubernetes.default.svc"
|
||||
TOKEN = open("/var/run/secrets/kubernetes.io/serviceaccount/token").read().strip()
|
||||
CA = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
|
||||
_ctx = ssl.create_default_context(cafile=CA)
|
||||
MIB = 1024 * 1024
|
||||
|
||||
|
||||
def api(method, path):
|
||||
req = urllib.request.Request(
|
||||
K8S + path,
|
||||
method=method,
|
||||
headers={"Authorization": "Bearer " + TOKEN, "Accept": "application/json"},
|
||||
)
|
||||
with urllib.request.urlopen(req, context=_ctx, timeout=15) as r:
|
||||
return json.loads(r.read().decode()) if method == "GET" else None
|
||||
|
||||
|
||||
def scrape_used_mib():
|
||||
"""Return {(namespace, pod): used_mib} from the host-PID exporter."""
|
||||
try:
|
||||
with urllib.request.urlopen(EXPORTER, timeout=10) as r:
|
||||
text = r.read().decode()
|
||||
except Exception as e: # noqa: BLE001
|
||||
print("WARN: exporter scrape failed: %s" % e, flush=True)
|
||||
return None
|
||||
used = {}
|
||||
for line in text.splitlines():
|
||||
if not line.startswith("gpu_pod_memory_used_bytes{"):
|
||||
continue
|
||||
labels = line[line.index("{") + 1 : line.index("}")]
|
||||
try:
|
||||
val = float(line.rsplit(" ", 1)[1])
|
||||
except ValueError:
|
||||
continue
|
||||
d = {}
|
||||
for kv in labels.split(","):
|
||||
if "=" in kv:
|
||||
k, v = kv.split("=", 1)
|
||||
d[k] = v.strip('"')
|
||||
key = (d.get("namespace"), d.get("pod"))
|
||||
used[key] = used.get(key, 0.0) + val / MIB
|
||||
return used
|
||||
|
||||
|
||||
def gpu_node():
|
||||
items = api(
|
||||
"GET", "/api/v1/nodes?labelSelector=" + urllib.parse.quote(GPU_NODE_LABEL)
|
||||
).get("items", [])
|
||||
return items[0]["metadata"]["name"] if items else None
|
||||
|
||||
|
||||
def declared_budgets(node):
|
||||
"""Return {(namespace, pod): declared_gpumem_mib} for pods on the GPU node."""
|
||||
pods = api("GET", "/api/v1/pods?fieldSelector=spec.nodeName=" + node).get(
|
||||
"items", []
|
||||
)
|
||||
budgets = {}
|
||||
for p in pods:
|
||||
ns = p["metadata"]["namespace"]
|
||||
name = p["metadata"]["name"]
|
||||
total = 0
|
||||
for c in p["spec"].get("containers", []):
|
||||
v = c.get("resources", {}).get("limits", {}).get(RESOURCE)
|
||||
if v is not None:
|
||||
try:
|
||||
total += int(v)
|
||||
except ValueError:
|
||||
pass
|
||||
if total:
|
||||
budgets[(ns, name)] = total
|
||||
return budgets
|
||||
|
||||
|
||||
def tick():
|
||||
used = scrape_used_mib()
|
||||
if used is None:
|
||||
return # fail-safe: no metrics -> no action
|
||||
total_used = sum(used.values())
|
||||
free = PHYSICAL_TOTAL_MIB - total_used
|
||||
print(
|
||||
"VRAM used=%.0fMiB free=%.0fMiB floor=%dMiB total=%dMiB"
|
||||
% (total_used, free, FLOOR_MIB, PHYSICAL_TOTAL_MIB),
|
||||
flush=True,
|
||||
)
|
||||
if free >= FLOOR_MIB:
|
||||
return
|
||||
node = gpu_node()
|
||||
if not node:
|
||||
print("PRESSURE but no GPU node found -> no action", flush=True)
|
||||
return
|
||||
budgets = declared_budgets(node)
|
||||
offenders = []
|
||||
for key, budget in budgets.items():
|
||||
u = used.get(key, 0.0)
|
||||
if u > budget:
|
||||
offenders.append((u - budget, key, u, budget))
|
||||
if not offenders:
|
||||
print(
|
||||
"PRESSURE but no tenant over its declared budget -> alert-only, no recycle",
|
||||
flush=True,
|
||||
)
|
||||
return
|
||||
offenders.sort(reverse=True)
|
||||
overshoot, (ns, pod), u, budget = offenders[0]
|
||||
print(
|
||||
"PRESSURE: recycling %s/%s (used=%.0fMiB > budget=%dMiB, overshoot=%.0fMiB)%s"
|
||||
% (ns, pod, u, budget, overshoot, " [DRY_RUN]" if DRY_RUN else ""),
|
||||
flush=True,
|
||||
)
|
||||
if DRY_RUN:
|
||||
return
|
||||
try:
|
||||
api("DELETE", "/api/v1/namespaces/%s/pods/%s" % (ns, pod))
|
||||
print("recycled %s/%s" % (ns, pod), flush=True)
|
||||
except Exception as e: # noqa: BLE001
|
||||
print("ERROR deleting %s/%s: %s" % (ns, pod, e), flush=True)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
print(
|
||||
"gpu-vram-watchdog starting (interval=%ss dry_run=%s floor=%dMiB)"
|
||||
% (INTERVAL, DRY_RUN, FLOOR_MIB),
|
||||
flush=True,
|
||||
)
|
||||
while True:
|
||||
try:
|
||||
tick()
|
||||
except Exception as e: # noqa: BLE001
|
||||
print("ERROR in tick: %s" % e, flush=True)
|
||||
time.sleep(INTERVAL)
|
||||
EOT
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service_account" "gpu_vram_watchdog" {
|
||||
metadata {
|
||||
name = "gpu-vram-watchdog"
|
||||
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_cluster_role" "gpu_vram_watchdog" {
|
||||
metadata { name = "gpu-vram-watchdog" }
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["pods"]
|
||||
verbs = ["get", "list"]
|
||||
}
|
||||
# delete = the recycle. Broad (cluster-wide) but the script only ever targets
|
||||
# a GPU-node pod that is over its declared gpumem budget under VRAM pressure.
|
||||
# Far less privileged than existing cluster-admin tooling (woodpecker-agent).
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["pods"]
|
||||
verbs = ["delete"]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_cluster_role_binding" "gpu_vram_watchdog" {
|
||||
metadata { name = "gpu-vram-watchdog" }
|
||||
role_ref {
|
||||
api_group = "rbac.authorization.k8s.io"
|
||||
kind = "ClusterRole"
|
||||
name = kubernetes_cluster_role.gpu_vram_watchdog.metadata[0].name
|
||||
}
|
||||
subject {
|
||||
kind = "ServiceAccount"
|
||||
name = kubernetes_service_account.gpu_vram_watchdog.metadata[0].name
|
||||
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
# Long-running Deployment with an internal sleep loop (NOT an every-minute
|
||||
# CronJob) to avoid etcd pod-churn — one pod, the gpu-pod-exporter pattern.
|
||||
resource "kubernetes_deployment" "gpu_vram_watchdog" {
|
||||
metadata {
|
||||
name = "gpu-vram-watchdog"
|
||||
namespace = kubernetes_namespace.nvidia.metadata[0].name
|
||||
labels = { app = "gpu-vram-watchdog", tier = var.tier }
|
||||
}
|
||||
spec {
|
||||
replicas = 1
|
||||
selector { match_labels = { app = "gpu-vram-watchdog" } }
|
||||
strategy { type = "Recreate" }
|
||||
template {
|
||||
metadata { labels = { app = "gpu-vram-watchdog" } }
|
||||
spec {
|
||||
service_account_name = kubernetes_service_account.gpu_vram_watchdog.metadata[0].name
|
||||
container {
|
||||
name = "watchdog"
|
||||
image = "python:3.12-alpine"
|
||||
command = ["python3", "/scripts/watchdog.py"]
|
||||
env {
|
||||
name = "GPUMEM_RESOURCE"
|
||||
value = var.gpumem_resource
|
||||
}
|
||||
env {
|
||||
name = "GPU_TOTAL_MIB"
|
||||
value = tostring(var.watchdog_gpu_total_mib)
|
||||
}
|
||||
env {
|
||||
name = "FLOOR_MIB"
|
||||
value = tostring(var.watchdog_floor_mib)
|
||||
}
|
||||
env {
|
||||
name = "DRY_RUN"
|
||||
value = tostring(var.watchdog_dry_run)
|
||||
}
|
||||
volume_mount {
|
||||
name = "script"
|
||||
mount_path = "/scripts"
|
||||
read_only = true
|
||||
}
|
||||
resources {
|
||||
requests = { cpu = "10m", memory = "96Mi" }
|
||||
limits = { memory = "128Mi" }
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "script"
|
||||
config_map {
|
||||
name = kubernetes_config_map.gpu_vram_watchdog_script.metadata[0].name
|
||||
default_mode = "0755"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
depends_on = [kubernetes_cluster_role_binding.gpu_vram_watchdog]
|
||||
}
|
||||
|
|
@ -295,6 +295,8 @@ resource "kubernetes_deployment" "portal_stt" {
|
|||
limits = {
|
||||
memory = "4Gi"
|
||||
"nvidia.com/gpu" = "1" # ONE time-slice (operator advertises 100), NOT the whole card
|
||||
# GPU VRAM budget (ADR-0016): large-v3-turbo int8 warm-resident (~1.5 GiB).
|
||||
"viktorbarzin.me/gpumem" = "1500"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue