Viktor Barzin b9ac942647 nvidia: fix driver install deadlock + extend startup probe

Two compounding issues prevented the GPU driver from installing after
the k8s-node1 kernel rollback to 6.8.0-117-generic (Ubuntu 24.04):

1. **Deadlock**: The k8s-driver-manager init container was stuck waiting
   for nvidia-operator-validator to shut down. The validator's
   driver-validation init container was in an infinite poll loop checking
   for /run/nvidia/validations/.driver-ctr-ready (which only appears after
   a successful driver install). The validator pod had deletionTimestamp
   set but its container remained in Terminating state indefinitely.
   Fix: force-delete the stuck Terminating validator pod to break the
   deadlock (kubectl delete --force --grace-period=0).

2. **Startup probe timeout**: Full driver install on this hardware
   (apt headers ~2min + gcc make -j16 ~12min + file copy ~7min = ~21min)
   exactly exhausted the default 120×10s=20min startup probe window,
   causing SIGKILL (exit 137) at exactly 21 minutes even when the install
   was succeeding. Extended failureThreshold 120→300 (50min headroom).

Documented both root causes + recovery steps in the post-mortem.
values.yaml: add driver.startupProbe.failureThreshold: 300.

Note: the kubectl patch applied during recovery is a temporary fix;
this TF values.yaml change makes it durable via the next TF apply.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 11:53:44 +00:00

11 KiB

Raw Blame History

Post-Mortem: GPU Driver Crashloop after Ubuntu 26.04 Upgrade on k8s-node1

Date: 2026-05-17 Author: Viktor Barzin / Claude (incident response) Severity: SEV-3 (GPU workloads unavailable: frigate, immich-ml, llama-swap, ytdlp/yt-highlights all Pending; no impact to non-GPU services) Beads: code-8vr0 (P1) Status: Blocked on upstream — NVIDIA has not published Ubuntu 26.04 driver images yet

Summary

nvidia-driver-daemonset-sg22g on k8s-node1 went into CrashLoopBackOff with 76+ restarts. Root cause: k8s-node1 was upgraded to Ubuntu 26.04 LTS (Resolute Raccoon) at some point, putting the running kernel at 7.0.0-15-generic. The NVIDIA driver daemonset's installer container runs apt-get install linux-headers-<kernel> against Ubuntu 24.04's noble repositories (the container's base OS), which don't carry linux-headers-7.0.0-15-generic, so the build aborts with:

Could not resolve Linux kernel version

Attempted fix (chart upgrade v25.10.1 → v26.3.1 with driver 580.105.08 and kernelModuleType: open) succeeded at the chart level but produced a worse outcome: the v26.3.1 operator auto-detects the host OS via NFD and constructs the image tag <version>-ubuntu26.04, which 404s on pull. skopeo list-tags docker://nvcr.io/nvidia/driver confirms zero ubuntu26.04 tags exist (vs 779 ubuntu22.04 and 206 ubuntu24.04 tags).

Rolled the chart back to v25.10.1 (pinned in TF) to restore the closest- to-working state pending an upstream fix or kernel rollback.

Impact

GPU resource nvidia.com/gpu = 0 on k8s-node1 (only GPU node)
All GPU-bound workloads Pending or 0/N Ready:
- frigate/frigate
- immich/immich-machine-learning
- llama-cpp/llama-swap
- nvidia/nvidia-exporter
- ytdlp/yt-highlights
Downstream alerts firing: NvidiaExporterDown, 5× Uptime Kuma monitors (Frigate, Immich ML, nvidia-exporter, …), GPUNodeUnschedulable not firing (node is schedulable, just no GPU advertised)
No data loss; no user-facing service degradation outside the GPU stack

Timeline (Europe/Sofia, UTC+3)

pre-incident — apt-get dist-upgrade (or do-release-upgrade) bumped k8s-node1 from Ubuntu 24.04 → 26.04. Apt history.log doesn't capture the upgrade (rotated by do-release-upgrade).
~2026-05-11 — node rebooted into kernel 7.0.0-15-generic. NFD reports system-os_release.VERSION_ID = 26.04, kernel-version.full = 7.0.0-15-generic.
2026-05-17 04:00 (approx) — driver daemonset enters CrashLoopBackOff on every kubelet restart cycle. Error: "Could not resolve Linux kernel version".
2026-05-17 13:35 — chart upgrade attempt v25.10.1 → v26.3.1, driver 570.195.03 → 580.105.08, kernelModuleType: open. Helm applies cleanly but driver pod ImagePullBackOff on driver:580.105.08-ubuntu26.04.
2026-05-17 ~13:45 — skopeo confirms zero ubuntu26.04 tags on nvcr.io/nvidia/driver. Decision: roll chart back, pin in TF, document the gotcha, file the kernel rollback as the next step.

Root Causes

Host OS upgraded to Ubuntu 26.04 ahead of NVIDIA's driver image support window. NVIDIA typically lags new Ubuntu LTS releases by weeks-to-months on the driver-container front.
gpu-operator chart was not pinned prior to today. The TF helm_release had version commented out, so any apply could re-resolve to the latest chart and follow its OS-auto-detection logic. With v25.10.1, the operator fell back to ubuntu24.04 image suffix (which pulls successfully but fails to compile against kernel 7.0). With v26.3.1, the operator picks the correct (per-NFD) ubuntu26.04 suffix — which doesn't exist.
No alert for "GPU device count = 0 on a GPU node" — the cluster had 14+ hours of silent GPU outage before noticing. NvidiaExporterDown fires only when the metrics exporter itself stops scraping, not when the operator's driver pod is unhealthy.

What We Changed in This Session

stacks/nvidia/modules/nvidia/main.tf — pinned helm_release.nvidia-gpu-operator.version = "v25.10.1" so future applies don't surprise us with v26.3.1's stricter OS detection.
stacks/nvidia/modules/nvidia/values.yaml — comment block explaining the situation; driver version stays at 570.195.03 as the last-known config that produced a pullable image.
docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md — this file.

What We Did NOT Do (Pending User Decision)

Roll back the host kernel on k8s-node1 from 7.0.0-15-generic to 6.8.0-117-generic. The 6.8 kernel is still installed at /lib/modules/6.8.0-117-generic and the matching headers at /usr/src/linux-headers-6.8.0-117-generic, so GRUB can boot it and the driver image's apt sources (Ubuntu 24.04 noble) carry linux-headers-6.8.0-117-generic. This would require draining the node, editing GRUB defaults, apt-mark hold to prevent future drift, and rebooting — needs explicit user OK.
Add a probe + alert for nvidia.com/gpu resource count on the GPU node. Should fire within 10 minutes of the operator failing to publish the resource, regardless of which sub-pod failed.

Recovery Procedure (next time)

If the driver-installer fails with "Could not resolve Linux kernel version"

Identify the running kernel: uname -r on the affected node.

Check whether NVIDIA ships an image for that kernel/distro combo:

docker run --rm quay.io/skopeo/stable list-tags \
    docker://nvcr.io/nvidia/driver \
  | python3 -c "import json,sys; d=json.load(sys.stdin); \
      print([t for t in d['Tags'] if '<distro>' in t][:5])"

If yes, point the chart at the right version + ensure NFD reports the matching OS.
If no (and a kernel rollback is acceptable):
- kubectl cordon <node> then kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
- nsenter -t 1 -m -p -u sed -i 's/^GRUB_DEFAULT=.*/GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 6.8.0-117-generic"/' /etc/default/grub
- nsenter -t 1 -m -p -u update-grub
- nsenter -t 1 -m -p -u apt-mark hold linux-image-6.8.0-117-generic linux-headers-6.8.0-117-generic linux-generic linux-image-generic linux-headers-generic
- Reboot: nsenter -t 1 -m -p -u systemctl reboot
- After boot: kubectl uncordon <node> and wait for the GPU daemonset to come Ready

Action Items

Pin gpu-operator chart to v25.10.1 in TF
Document situation in this post-mortem
Roll back k8s-node1 host kernel to 6.8.0-117-generic (done by user; kernel rollback succeeded and NFD now reports kernel-version.full=6.8.0-117-generic, os_release.VERSION_ID=24.04)
Extend driver daemonset startup probe failureThreshold from 120 to 300 (50 min) in TF values.yaml — 2026-05-25. On this hardware the full install sequence (apt headers + gcc compilation + file copy) takes ~21min which exactly exhausted the old 120×10s window.
Add Prometheus alert GPUNodeNoGPUResource — fires when a node labeled nvidia.com/gpu.present=true has nvidia.com/gpu capacity of 0 for >10m
Periodically re-check NVIDIA's NGC catalog for ubuntu26.04 driver tags — file as a quarterly checkup once we see the first 26.04 tag, unpin the chart and revert this post-mortem's mitigation
Audit ALL host packages with apt-mark hold semantics. The memory of the March 2026 outage says we disabled unattended-upgrades — do-release-upgrade is a separate path that should be gated too

Follow-up Incident: Driver install hang (2026-05-25)

Date: 2026-05-25
Status: Resolved

After the kernel rollback to 6.8.0-117-generic succeeded, the driver pod (nvidia-driver-daemonset-529vg) was still reported as "stuck at Installing Linux kernel headers..." with no progress for 15–20 min.

Actual root causes (two compounding issues):

Deadlock between k8s-driver-manager and operator-validator: The k8s-driver-manager init container waits for nvidia-operator-validator to shut down before it can begin the install sequence. The validator's driver-validation init container was in an infinite retry loop polling /run/nvidia/validations/.driver-ctr-ready (which the driver creates when ready). Since the driver never finished, the validator never exited. The validator pod had deletionTimestamp set but kubelet on node1 couldn't GC it — the container received SIGTERM but remained in "Terminating" state indefinitely, blocking the new driver from starting. Fix: Force-deleted the stuck validator pod (kubectl delete pod -n nvidia nvidia-operator-validator-sff98 --force --grace-period=0). This broke the deadlock immediately.
Startup probe timeout: The full driver install sequence on this hardware (6 vCPUs, 16Gi RAM) takes ~21 minutes:
- apt-get install linux-headers-6.8.0-117-generic: ~2 min
- gcc/make -j16 kernel module build (nvidia, nvidia-uvm, nvidia-modeset, nvidia-peermem): ~12 min
- nvidia-installer file copy + archive integrity check: ~7 min The default startup probe allows exactly 60 + (120 × 10) = 1260s = 21min. This caused a SIGKILL (exit 137) at 21 minutes even when the install was progressing normally. Fix: Patched driver.startupProbe.failureThreshold from 120 → 300 in stacks/nvidia/modules/nvidia/values.yaml (gives 51 min headroom).

Key observation: "Installing Linux kernel headers..." is NOT a hang — the apt install just takes 2+ min and produces no log output during execution. The log line appears before apt runs, so it looks frozen. Check ps auxf inside the container to confirm apt/dpkg are actively running.

Lessons

Operator-style charts that auto-detect host OS can silently break when the host fleet leapfrogs upstream image support. Pin the chart version + driver version, and treat upstream support gaps as a hard blocker rather than a guaranteed-to-resolve race condition.
Drain-and-revert host kernel is the right escape hatch when upstream image lags. Make sure the previous kernel and its headers stay installed (don't aggressively purge old kernels in apt autoremove).
NFD labels are authoritative for the operator's image-tag construction. If you need to lie about OS version (e.g., to force a 24.04 image on a 26.04 host), edit the NFD label — but only as a last resort; the chart upgrade made clear the operator will eventually reconcile this.
A k8s-driver-manager deadlock on a stuck Terminating validator pod is indistinguishable from an apt hang — ps auxf inside the container is the key diagnostic. Force-deleting a stuck Terminating pod with no finalizers is safe and immediately resolves the deadlock.
Driver startup probe must be sized for the full install wall-clock time, not just apt or just compilation. On slow hardware, 21 min is tight.

11 KiB Raw Blame History Unescape Escape