# Post-Mortem: GPU Driver Crashloop after Ubuntu 26.04 Upgrade on k8s-node1 **Date:** 2026-05-17 **Author:** Viktor Barzin / Claude (incident response) **Severity:** SEV-3 (GPU workloads unavailable: frigate, immich-ml, llama-swap, ytdlp/yt-highlights all Pending; no impact to non-GPU services) **Beads:** `code-8vr0` (P1) **Status:** Blocked on upstream — NVIDIA has not published Ubuntu 26.04 driver images yet ## Summary `nvidia-driver-daemonset-sg22g` on k8s-node1 went into CrashLoopBackOff with 76+ restarts. Root cause: k8s-node1 was upgraded to **Ubuntu 26.04 LTS (Resolute Raccoon)** at some point, putting the running kernel at `7.0.0-15-generic`. The NVIDIA driver daemonset's installer container runs `apt-get install linux-headers-` against Ubuntu 24.04's noble repositories (the container's base OS), which don't carry `linux-headers-7.0.0-15-generic`, so the build aborts with: Could not resolve Linux kernel version Attempted fix (chart upgrade v25.10.1 → v26.3.1 with driver 580.105.08 and `kernelModuleType: open`) succeeded at the chart level but produced a worse outcome: the v26.3.1 operator auto-detects the host OS via NFD and constructs the image tag `-ubuntu26.04`, which 404s on pull. `skopeo list-tags docker://nvcr.io/nvidia/driver` confirms zero ubuntu26.04 tags exist (vs 779 ubuntu22.04 and 206 ubuntu24.04 tags). Rolled the chart back to v25.10.1 (pinned in TF) to restore the closest- to-working state pending an upstream fix or kernel rollback. ## Impact - GPU resource `nvidia.com/gpu` = 0 on k8s-node1 (only GPU node) - All GPU-bound workloads Pending or 0/N Ready: - `frigate/frigate` - `immich/immich-machine-learning` - `llama-cpp/llama-swap` - `nvidia/nvidia-exporter` - `ytdlp/yt-highlights` - Downstream alerts firing: `NvidiaExporterDown`, 5× Uptime Kuma monitors (Frigate, Immich ML, nvidia-exporter, …), `GPUNodeUnschedulable` not firing (node is schedulable, just no GPU advertised) - No data loss; no user-facing service degradation outside the GPU stack ## Timeline (Europe/Sofia, UTC+3) - pre-incident — `apt-get dist-upgrade` (or `do-release-upgrade`) bumped k8s-node1 from Ubuntu 24.04 → 26.04. Apt history.log doesn't capture the upgrade (rotated by `do-release-upgrade`). - ~2026-05-11 — node rebooted into kernel `7.0.0-15-generic`. NFD reports `system-os_release.VERSION_ID = 26.04`, `kernel-version.full = 7.0.0-15-generic`. - 2026-05-17 04:00 (approx) — driver daemonset enters CrashLoopBackOff on every kubelet restart cycle. Error: "Could not resolve Linux kernel version". - 2026-05-17 13:35 — chart upgrade attempt v25.10.1 → v26.3.1, driver 570.195.03 → 580.105.08, `kernelModuleType: open`. Helm applies cleanly but driver pod ImagePullBackOff on `driver:580.105.08-ubuntu26.04`. - 2026-05-17 ~13:45 — skopeo confirms zero ubuntu26.04 tags on nvcr.io/nvidia/driver. Decision: roll chart back, pin in TF, document the gotcha, file the kernel rollback as the next step. ## Root Causes 1. **Host OS upgraded to Ubuntu 26.04** ahead of NVIDIA's driver image support window. NVIDIA typically lags new Ubuntu LTS releases by weeks-to-months on the driver-container front. 2. **gpu-operator chart was not pinned** prior to today. The TF `helm_release` had `version` commented out, so any apply could re-resolve to the latest chart and follow its OS-auto-detection logic. With v25.10.1, the operator fell back to ubuntu24.04 image suffix (which pulls successfully but fails to compile against kernel 7.0). With v26.3.1, the operator picks the correct (per-NFD) ubuntu26.04 suffix — which doesn't exist. 3. **No alert for "GPU device count = 0 on a GPU node"** — the cluster had 14+ hours of silent GPU outage before noticing. `NvidiaExporterDown` fires only when the metrics exporter itself stops scraping, not when the operator's driver pod is unhealthy. ## What We Changed in This Session - `stacks/nvidia/modules/nvidia/main.tf` — pinned `helm_release.nvidia-gpu-operator.version = "v25.10.1"` so future applies don't surprise us with v26.3.1's stricter OS detection. - `stacks/nvidia/modules/nvidia/values.yaml` — comment block explaining the situation; driver version stays at `570.195.03` as the last-known config that produced a pullable image. - `docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md` — this file. ## What We Did NOT Do (Pending User Decision) - **Roll back the host kernel** on k8s-node1 from `7.0.0-15-generic` to `6.8.0-117-generic`. The 6.8 kernel is still installed at `/lib/modules/6.8.0-117-generic` and the matching headers at `/usr/src/linux-headers-6.8.0-117-generic`, so GRUB can boot it and the driver image's apt sources (Ubuntu 24.04 noble) carry `linux-headers-6.8.0-117-generic`. This would require draining the node, editing GRUB defaults, `apt-mark hold` to prevent future drift, and rebooting — needs explicit user OK. - **Add a probe + alert** for `nvidia.com/gpu` resource count on the GPU node. Should fire within 10 minutes of the operator failing to publish the resource, regardless of which sub-pod failed. ## Recovery Procedure (next time) ### If the driver-installer fails with "Could not resolve Linux kernel version" 1. Identify the running kernel: `uname -r` on the affected node. 2. Check whether NVIDIA ships an image for that kernel/distro combo: docker run --rm quay.io/skopeo/stable list-tags \ docker://nvcr.io/nvidia/driver \ | python3 -c "import json,sys; d=json.load(sys.stdin); \ print([t for t in d['Tags'] if '' in t][:5])" 3. If yes, point the chart at the right version + ensure NFD reports the matching OS. 4. If no (and a kernel rollback is acceptable): - `kubectl cordon ` then `kubectl drain --ignore-daemonsets --delete-emptydir-data` - `nsenter -t 1 -m -p -u sed -i 's/^GRUB_DEFAULT=.*/GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 6.8.0-117-generic"/' /etc/default/grub` - `nsenter -t 1 -m -p -u update-grub` - `nsenter -t 1 -m -p -u apt-mark hold linux-image-6.8.0-117-generic linux-headers-6.8.0-117-generic linux-generic linux-image-generic linux-headers-generic` - Reboot: `nsenter -t 1 -m -p -u systemctl reboot` - After boot: `kubectl uncordon ` and wait for the GPU daemonset to come Ready ## Action Items - [x] Pin gpu-operator chart to v25.10.1 in TF - [x] Document situation in this post-mortem - [x] Roll back k8s-node1 host kernel to 6.8.0-117-generic (done by user; kernel rollback succeeded and NFD now reports `kernel-version.full=6.8.0-117-generic`, `os_release.VERSION_ID=24.04`) - [x] Extend driver daemonset startup probe `failureThreshold` from 120 to 300 (50 min) in TF `values.yaml` — 2026-05-25. On this hardware the full install sequence (apt headers + gcc compilation + file copy) takes ~21min which exactly exhausted the old 120×10s window. - [ ] Add Prometheus alert `GPUNodeNoGPUResource` — fires when a node labeled `nvidia.com/gpu.present=true` has `nvidia.com/gpu` capacity of 0 for >10m - [ ] Periodically re-check NVIDIA's NGC catalog for ubuntu26.04 driver tags — file as a quarterly checkup once we see the first 26.04 tag, unpin the chart and revert this post-mortem's mitigation - [ ] Audit ALL host packages with `apt-mark hold` semantics. The memory of the March 2026 outage says we disabled `unattended-upgrades` — `do-release-upgrade` is a separate path that should be gated too ## Follow-up Incident: Driver install hang (2026-05-25) **Date**: 2026-05-25 **Status**: Resolved After the kernel rollback to 6.8.0-117-generic succeeded, the driver pod (`nvidia-driver-daemonset-529vg`) was still reported as "stuck at Installing Linux kernel headers..." with no progress for 15–20 min. **Actual root causes (two compounding issues)**: 1. **Deadlock between k8s-driver-manager and operator-validator**: The `k8s-driver-manager` init container waits for `nvidia-operator-validator` to shut down before it can begin the install sequence. The validator's `driver-validation` init container was in an infinite retry loop polling `/run/nvidia/validations/.driver-ctr-ready` (which the driver creates when ready). Since the driver never finished, the validator never exited. The validator pod had `deletionTimestamp` set but kubelet on node1 couldn't GC it — the container received SIGTERM but remained in "Terminating" state indefinitely, blocking the new driver from starting. **Fix**: Force-deleted the stuck validator pod (`kubectl delete pod -n nvidia nvidia-operator-validator-sff98 --force --grace-period=0`). This broke the deadlock immediately. 2. **Startup probe timeout**: The full driver install sequence on this hardware (6 vCPUs, 16Gi RAM) takes ~21 minutes: - `apt-get install linux-headers-6.8.0-117-generic`: ~2 min - `gcc/make -j16` kernel module build (nvidia, nvidia-uvm, nvidia-modeset, nvidia-peermem): ~12 min - nvidia-installer file copy + archive integrity check: ~7 min The default startup probe allows exactly `60 + (120 × 10) = 1260s = 21min`. This caused a SIGKILL (exit 137) at 21 minutes even when the install was progressing normally. **Fix**: Patched `driver.startupProbe.failureThreshold` from 120 → 300 in `stacks/nvidia/modules/nvidia/values.yaml` (gives 51 min headroom). **Key observation**: "Installing Linux kernel headers..." is NOT a hang — the apt install just takes 2+ min and produces no log output during execution. The log line appears before apt runs, so it looks frozen. Check `ps auxf` inside the container to confirm apt/dpkg are actively running. ## Lessons - **Operator-style charts that auto-detect host OS can silently break when the host fleet leapfrogs upstream image support.** Pin the chart version + driver version, and treat upstream support gaps as a hard blocker rather than a guaranteed-to-resolve race condition. - **Drain-and-revert host kernel is the right escape hatch when upstream image lags.** Make sure the previous kernel and its headers stay installed (don't aggressively purge old kernels in apt autoremove). - **NFD labels are authoritative for the operator's image-tag construction.** If you need to lie about OS version (e.g., to force a 24.04 image on a 26.04 host), edit the NFD label — but only as a last resort; the chart upgrade made clear the operator will eventually reconcile this. - **A k8s-driver-manager deadlock on a stuck Terminating validator pod is indistinguishable from an apt hang** — `ps auxf` inside the container is the key diagnostic. Force-deleting a stuck Terminating pod with no finalizers is safe and immediately resolves the deadlock. - **Driver startup probe must be sized for the full install wall-clock time**, not just apt or just compilation. On slow hardware, 21 min is tight.