infra

Viktor Barzin b9ac942647 nvidia: fix driver install deadlock + extend startup probe Two compounding issues prevented the GPU driver from installing after the k8s-node1 kernel rollback to 6.8.0-117-generic (Ubuntu 24.04): 1. Deadlock: The k8s-driver-manager init container was stuck waiting for nvidia-operator-validator to shut down. The validator's driver-validation init container was in an infinite poll loop checking for /run/nvidia/validations/.driver-ctr-ready (which only appears after a successful driver install). The validator pod had deletionTimestamp set but its container remained in Terminating state indefinitely. Fix: force-delete the stuck Terminating validator pod to break the deadlock (kubectl delete --force --grace-period=0). 2. Startup probe timeout: Full driver install on this hardware (apt headers ~2min + gcc make -j16 ~12min + file copy ~7min = ~21min) exactly exhausted the default 120×10s=20min startup probe window, causing SIGKILL (exit 137) at exactly 21 minutes even when the install was succeeding. Extended failureThreshold 120→300 (50min headroom). Documented both root causes + recovery steps in the post-mortem. values.yaml: add driver.startupProbe.failureThreshold: 300. Note: the kubectl patch applied during recovery is a temporary fix; this TF values.yaml change makes it durable via the next TF apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:53:44 +00:00
..
nvidia	nvidia: fix driver install deadlock + extend startup probe	2026-05-25 11:53:44 +00:00

Viktor Barzin b9ac942647 nvidia: fix driver install deadlock + extend startup probe

Two compounding issues prevented the GPU driver from installing after
the k8s-node1 kernel rollback to 6.8.0-117-generic (Ubuntu 24.04):

1. **Deadlock**: The k8s-driver-manager init container was stuck waiting
   for nvidia-operator-validator to shut down. The validator's
   driver-validation init container was in an infinite poll loop checking
   for /run/nvidia/validations/.driver-ctr-ready (which only appears after
   a successful driver install). The validator pod had deletionTimestamp
   set but its container remained in Terminating state indefinitely.
   Fix: force-delete the stuck Terminating validator pod to break the
   deadlock (kubectl delete --force --grace-period=0).

2. **Startup probe timeout**: Full driver install on this hardware
   (apt headers ~2min + gcc make -j16 ~12min + file copy ~7min = ~21min)
   exactly exhausted the default 120×10s=20min startup probe window,
   causing SIGKILL (exit 137) at exactly 21 minutes even when the install
   was succeeding. Extended failureThreshold 120→300 (50min headroom).

Documented both root causes + recovery steps in the post-mortem.
values.yaml: add driver.startupProbe.failureThreshold: 300.

Note: the kubectl patch applied during recovery is a temporary fix;
this TF values.yaml change makes it durable via the next TF apply.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 11:53:44 +00:00

nvidia

nvidia: fix driver install deadlock + extend startup probe

2026-05-25 11:53:44 +00:00