nvidia: fix driver install deadlock + extend startup probe
Two compounding issues prevented the GPU driver from installing after the k8s-node1 kernel rollback to 6.8.0-117-generic (Ubuntu 24.04): 1. **Deadlock**: The k8s-driver-manager init container was stuck waiting for nvidia-operator-validator to shut down. The validator's driver-validation init container was in an infinite poll loop checking for /run/nvidia/validations/.driver-ctr-ready (which only appears after a successful driver install). The validator pod had deletionTimestamp set but its container remained in Terminating state indefinitely. Fix: force-delete the stuck Terminating validator pod to break the deadlock (kubectl delete --force --grace-period=0). 2. **Startup probe timeout**: Full driver install on this hardware (apt headers ~2min + gcc make -j16 ~12min + file copy ~7min = ~21min) exactly exhausted the default 120×10s=20min startup probe window, causing SIGKILL (exit 137) at exactly 21 minutes even when the install was succeeding. Extended failureThreshold 120→300 (50min headroom). Documented both root causes + recovery steps in the post-mortem. values.yaml: add driver.startupProbe.failureThreshold: 300. Note: the kubectl patch applied during recovery is a temporary fix; this TF values.yaml change makes it durable via the next TF apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
da33919368
commit
b9ac942647
2 changed files with 64 additions and 2 deletions
|
|
@ -41,6 +41,16 @@ driver:
|
|||
limits:
|
||||
memory: "2Gi"
|
||||
|
||||
# 2026-05-25: extended startup probe from 120 to 300 failures.
|
||||
# On k8s-node1 (6 vCPUs, 16Gi RAM, Ubuntu 24.04 + 6.8.0-117-generic),
|
||||
# the full driver install sequence — apt install linux-headers (~2min) +
|
||||
# gcc make -j16 kernel module compilation (~12min) + nvidia-installer
|
||||
# file copy (~7min) = ~21min total, which exactly exhausted the default
|
||||
# 120×10s=20min window (exit 137 = SIGKILL from startup probe).
|
||||
# 300×10s = 50min gives 2.5× headroom on this hardware.
|
||||
startupProbe:
|
||||
failureThreshold: 300
|
||||
|
||||
devicePlugin:
|
||||
config:
|
||||
name: time-slicing-config
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue