nvidia: fix driver install deadlock + extend startup probe

Two compounding issues prevented the GPU driver from installing after the k8s-node1 kernel rollback to 6.8.0-117-generic (Ubuntu 24.04): 1. **Deadlock**: The k8s-driver-manager init container was stuck waiting for nvidia-operator-validator to shut down. The validator's driver-validation init container was in an infinite poll loop checking for /run/nvidia/validations/.driver-ctr-ready (which only appears after a successful driver install). The validator pod had deletionTimestamp set but its container remained in Terminating state indefinitely. Fix: force-delete the stuck Terminating validator pod to break the deadlock (kubectl delete --force --grace-period=0). 2. **Startup probe timeout**: Full driver install on this hardware (apt headers ~2min + gcc make -j16 ~12min + file copy ~7min = ~21min) exactly exhausted the default 120×10s=20min startup probe window, causing SIGKILL (exit 137) at exactly 21 minutes even when the install was succeeding. Extended failureThreshold 120→300 (50min headroom). Documented both root causes + recovery steps in the post-mortem. values.yaml: add driver.startupProbe.failureThreshold: 300. Note: the kubectl patch applied during recovery is a temporary fix; this TF values.yaml change makes it durable via the next TF apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 11:53:44 +00:00 · 2026-05-25 11:53:44 +00:00 · b9ac942647
commit b9ac942647
parent da33919368
2 changed files with 64 additions and 2 deletions
--- a/docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md
+++ b/docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md
@ -130,8 +130,13 @@ to-working state pending an upstream fix or kernel rollback.

 - [x] Pin gpu-operator chart to v25.10.1 in TF
 - [x] Document situation in this post-mortem
- [ ] Roll back k8s-node1 host kernel to 6.8.0-117-generic + apt-mark
-      hold (needs user authorization for node reboot)
+- [x] Roll back k8s-node1 host kernel to 6.8.0-117-generic (done by user;
+      kernel rollback succeeded and NFD now reports
+      `kernel-version.full=6.8.0-117-generic`, `os_release.VERSION_ID=24.04`)
+- [x] Extend driver daemonset startup probe `failureThreshold` from 120 to 300
+      (50 min) in TF `values.yaml` — 2026-05-25. On this hardware the
+      full install sequence (apt headers + gcc compilation + file copy) takes
+      ~21min which exactly exhausted the old 120×10s window.
 - [ ] Add Prometheus alert `GPUNodeNoGPUResource` — fires when a node
      labeled `nvidia.com/gpu.present=true` has `nvidia.com/gpu` capacity
      of 0 for >10m
@ -143,6 +148,47 @@ to-working state pending an upstream fix or kernel rollback.
      `unattended-upgrades` — `do-release-upgrade` is a separate path
      that should be gated too

+## Follow-up Incident: Driver install hang (2026-05-25)
+
+**Date**: 2026-05-25  
+**Status**: Resolved  
+
+After the kernel rollback to 6.8.0-117-generic succeeded, the driver pod
+(`nvidia-driver-daemonset-529vg`) was still reported as "stuck at
+Installing Linux kernel headers..." with no progress for 15–20 min.
+
+**Actual root causes (two compounding issues)**:
+
+1. **Deadlock between k8s-driver-manager and operator-validator**: The
+   `k8s-driver-manager` init container waits for `nvidia-operator-validator`
+   to shut down before it can begin the install sequence. The validator's
+   `driver-validation` init container was in an infinite retry loop polling
+   `/run/nvidia/validations/.driver-ctr-ready` (which the driver creates when
+   ready). Since the driver never finished, the validator never exited. The
+   validator pod had `deletionTimestamp` set but kubelet on node1 couldn't GC
+   it — the container received SIGTERM but remained in "Terminating" state
+   indefinitely, blocking the new driver from starting.
+   **Fix**: Force-deleted the stuck validator pod
+   (`kubectl delete pod -n nvidia nvidia-operator-validator-sff98 --force --grace-period=0`).
+   This broke the deadlock immediately.
+
+2. **Startup probe timeout**: The full driver install sequence on this hardware
+   (6 vCPUs, 16Gi RAM) takes ~21 minutes:
+   - `apt-get install linux-headers-6.8.0-117-generic`: ~2 min
+   - `gcc/make -j16` kernel module build (nvidia, nvidia-uvm, nvidia-modeset,
+     nvidia-peermem): ~12 min  
+   - nvidia-installer file copy + archive integrity check: ~7 min
+   The default startup probe allows exactly `60 + (120 × 10) = 1260s = 21min`.
+   This caused a SIGKILL (exit 137) at 21 minutes even when the install was
+   progressing normally.
+   **Fix**: Patched `driver.startupProbe.failureThreshold` from 120 → 300
+   in `stacks/nvidia/modules/nvidia/values.yaml` (gives 51 min headroom).
+
+**Key observation**: "Installing Linux kernel headers..." is NOT a hang — the
+apt install just takes 2+ min and produces no log output during execution. The
+log line appears before apt runs, so it looks frozen. Check `ps auxf` inside
+the container to confirm apt/dpkg are actively running.
+
 ## Lessons

 - **Operator-style charts that auto-detect host OS can silently break
@ -158,3 +204,9 @@ to-working state pending an upstream fix or kernel rollback.
  24.04 image on a 26.04 host), edit the NFD label — but only as a last
  resort; the chart upgrade made clear the operator will eventually
  reconcile this.
+- **A k8s-driver-manager deadlock on a stuck Terminating validator pod is
+  indistinguishable from an apt hang** — `ps auxf` inside the container is
+  the key diagnostic. Force-deleting a stuck Terminating pod with no
+  finalizers is safe and immediately resolves the deadlock.
+- **Driver startup probe must be sized for the full install wall-clock time**,
+  not just apt or just compilation. On slow hardware, 21 min is tight.
--- a/stacks/nvidia/modules/nvidia/values.yaml
+++ b/stacks/nvidia/modules/nvidia/values.yaml
@ -41,6 +41,16 @@ driver:
    limits:
      memory: "2Gi"

+  # 2026-05-25: extended startup probe from 120 to 300 failures.
+  # On k8s-node1 (6 vCPUs, 16Gi RAM, Ubuntu 24.04 + 6.8.0-117-generic),
+  # the full driver install sequence — apt install linux-headers (~2min) +
+  # gcc make -j16 kernel module compilation (~12min) + nvidia-installer
+  # file copy (~7min) = ~21min total, which exactly exhausted the default
+  # 120×10s=20min window (exit 137 = SIGKILL from startup probe).
+  # 300×10s = 50min gives 2.5× headroom on this hardware.
+  startupProbe:
+    failureThreshold: 300
+
  devicePlugin:
    config:
      name: time-slicing-config