infra/docs/adr
Viktor Barzin 74819d4061 feat(nvidia): GPU VRAM budget + watchdog to stop T4 overallocation
The single time-sliced Tesla T4 has no per-tenant memory isolation, so its
~9 GPU workloads can collectively overallocate VRAM. On 2026-06-02 immich-ml's
onnxruntime arena grew to 10.7 GB and silently starved llama-swap, breaking
recruiter-responder for ~5h. Viktor asked for memory protection so we don't
overallocate GPU memory, and chose to do it at the scheduling level (no
device-plugin swap) after weighing HAMi and MPS.

Make the scheduler VRAM-aware and add runtime teeth, all repo-native,
time-slicing untouched:
- Advertise a node extended resource viktorbarzin.me/gpumem (~14000 MiB) via a
  reconcile null_resource (immediate, apply-time) + hourly re-assert CronJob.
- Each always-on GPU tenant declares a gpumem budget (immich-ml 3000,
  llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500; sum 13300
  <= advertised) so the scheduler refuses to co-schedule past the card
  (overflow -> Pending).
- gpu-vram-watchdog Deployment recycles the biggest over-budget tenant ONLY when
  actual free VRAM < floor. Ships DRY_RUN=true (observe-then-enforce); flip to
  false after a few cycles look right.
- Prometheus alerts GPUVRAMLow / GPUVRAMTelemetryDown / GPUVRAMWatchdogDown --
  the 2026-06-02 post-mortem's never-built free-VRAM follow-up.
- Docs: ADR-0016 (records why HAMi/MPS were rejected), CONTEXT.md GPU-sharing
  glossary; fix the stale "whole T4 / scale immich-ml to 0" llama-cpp comment.

HITL GPU-node change: apply nvidia FIRST (advertise gpumem), verify the node
shows the capacity, THEN the consumer stacks -- the cutover bounces GPU pods.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:57:40 +00:00
..
0001-android-emulator-in-cluster.md android-emulator: new stack — shared in-cluster Android 16 testing instance 2026-06-11 19:51:57 +00:00
0002-all-image-builds-off-infra-gha-ghcr.md docs: ADR-0002 — all owned image builds move off-infra to GHA + ghcr [ci skip] 2026-06-12 19:55:47 +00:00
0003-keep-forgejo-canonical-complete-mirror.md plotting-book: pull image from private ghcr instead of public DockerHub 2026-06-27 15:32:19 +00:00
0004-homelab-unified-cli.md homelab: v0.1 docs, distribution wiring, and version 2026-06-18 19:25:51 +00:00
0005-homelab-v01-scope.md homelab: v0.1 docs, distribution wiring, and version 2026-06-18 19:25:51 +00:00
0006-homelab-work-and-tf.md homelab: v0.1 docs, distribution wiring, and version 2026-06-18 19:25:51 +00:00
0007-homelab-k8s-verbs.md homelab: v0.2.0 — docs + version for the k8s verb-group 2026-06-18 22:30:41 +00:00
0008-homelab-memory-verbs.md homelab: add memory verb-group (v0.3.0) — direct claude-memory HTTP client 2026-06-19 05:56:25 +00:00
0009-homelab-ci-deploy-verbs.md homelab: v0.4.0 — ci/deploy verbs (watch what you trigger) 2026-06-19 10:59:14 +00:00
0010-homelab-net-obs-verbs.md homelab: v0.5.0 — net/dns/metrics/logs probes (endpoint resolution) 2026-06-19 11:27:31 +00:00
0011-homelab-usage-telemetry.md docs(adr): add ADR-0015 (OS/sudo is the authorization boundary), supersede ADR-0011 privacy norm 2026-06-26 08:22:29 +00:00
0012-homelab-ha-verbs.md homelab ha token: dedicated openclaw/ha-tokens secret + least-priv RBAC for emo 2026-06-21 10:45:32 +00:00
0013-homelab-browser-verbs.md homelab v0.8.0: browser verbs for headful anti-bot web automation 2026-06-22 12:22:22 +00:00
0014-service-identity-and-east-west-observability.md monitoring: consolidate all Slack alerting to #alerts, abandon #security 2026-06-26 13:29:44 +00:00
0015-os-is-the-authorization-boundary.md docs(adr): add ADR-0015 (OS/sudo is the authorization boundary), supersede ADR-0011 privacy norm 2026-06-26 08:22:29 +00:00
0016-gpu-vram-extended-resource-budget.md feat(nvidia): GPU VRAM budget + watchdog to stop T4 overallocation 2026-06-30 07:57:40 +00:00