Two compounding issues prevented the GPU driver from installing after the k8s-node1 kernel rollback to 6.8.0-117-generic (Ubuntu 24.04): 1. **Deadlock**: The k8s-driver-manager init container was stuck waiting for nvidia-operator-validator to shut down. The validator's driver-validation init container was in an infinite poll loop checking for /run/nvidia/validations/.driver-ctr-ready (which only appears after a successful driver install). The validator pod had deletionTimestamp set but its container remained in Terminating state indefinitely. Fix: force-delete the stuck Terminating validator pod to break the deadlock (kubectl delete --force --grace-period=0). 2. **Startup probe timeout**: Full driver install on this hardware (apt headers ~2min + gcc make -j16 ~12min + file copy ~7min = ~21min) exactly exhausted the default 120×10s=20min startup probe window, causing SIGKILL (exit 137) at exactly 21 minutes even when the install was succeeding. Extended failureThreshold 120→300 (50min headroom). Documented both root causes + recovery steps in the post-mortem. values.yaml: add driver.startupProbe.failureThreshold: 300. Note: the kubectl patch applied during recovery is a temporary fix; this TF values.yaml change makes it durable via the next TF apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
212 lines
11 KiB
Markdown
212 lines
11 KiB
Markdown
# Post-Mortem: GPU Driver Crashloop after Ubuntu 26.04 Upgrade on k8s-node1
|
||
|
||
**Date:** 2026-05-17
|
||
**Author:** Viktor Barzin / Claude (incident response)
|
||
**Severity:** SEV-3 (GPU workloads unavailable: frigate, immich-ml, llama-swap, ytdlp/yt-highlights all Pending; no impact to non-GPU services)
|
||
**Beads:** `code-8vr0` (P1)
|
||
**Status:** Blocked on upstream — NVIDIA has not published Ubuntu 26.04 driver images yet
|
||
|
||
## Summary
|
||
|
||
`nvidia-driver-daemonset-sg22g` on k8s-node1 went into CrashLoopBackOff
|
||
with 76+ restarts. Root cause: k8s-node1 was upgraded to **Ubuntu 26.04
|
||
LTS (Resolute Raccoon)** at some point, putting the running kernel at
|
||
`7.0.0-15-generic`. The NVIDIA driver daemonset's installer container
|
||
runs `apt-get install linux-headers-<kernel>` against Ubuntu 24.04's
|
||
noble repositories (the container's base OS), which don't carry
|
||
`linux-headers-7.0.0-15-generic`, so the build aborts with:
|
||
|
||
Could not resolve Linux kernel version
|
||
|
||
Attempted fix (chart upgrade v25.10.1 → v26.3.1 with driver 580.105.08
|
||
and `kernelModuleType: open`) succeeded at the chart level but produced
|
||
a worse outcome: the v26.3.1 operator auto-detects the host OS via NFD
|
||
and constructs the image tag `<version>-ubuntu26.04`, which 404s on
|
||
pull. `skopeo list-tags docker://nvcr.io/nvidia/driver` confirms zero
|
||
ubuntu26.04 tags exist (vs 779 ubuntu22.04 and 206 ubuntu24.04 tags).
|
||
|
||
Rolled the chart back to v25.10.1 (pinned in TF) to restore the closest-
|
||
to-working state pending an upstream fix or kernel rollback.
|
||
|
||
## Impact
|
||
|
||
- GPU resource `nvidia.com/gpu` = 0 on k8s-node1 (only GPU node)
|
||
- All GPU-bound workloads Pending or 0/N Ready:
|
||
- `frigate/frigate`
|
||
- `immich/immich-machine-learning`
|
||
- `llama-cpp/llama-swap`
|
||
- `nvidia/nvidia-exporter`
|
||
- `ytdlp/yt-highlights`
|
||
- Downstream alerts firing: `NvidiaExporterDown`, 5× Uptime Kuma monitors
|
||
(Frigate, Immich ML, nvidia-exporter, …), `GPUNodeUnschedulable` not
|
||
firing (node is schedulable, just no GPU advertised)
|
||
- No data loss; no user-facing service degradation outside the GPU stack
|
||
|
||
## Timeline (Europe/Sofia, UTC+3)
|
||
|
||
- pre-incident — `apt-get dist-upgrade` (or `do-release-upgrade`) bumped
|
||
k8s-node1 from Ubuntu 24.04 → 26.04. Apt history.log doesn't capture
|
||
the upgrade (rotated by `do-release-upgrade`).
|
||
- ~2026-05-11 — node rebooted into kernel `7.0.0-15-generic`. NFD
|
||
reports `system-os_release.VERSION_ID = 26.04`,
|
||
`kernel-version.full = 7.0.0-15-generic`.
|
||
- 2026-05-17 04:00 (approx) — driver daemonset enters CrashLoopBackOff
|
||
on every kubelet restart cycle. Error: "Could not resolve Linux kernel
|
||
version".
|
||
- 2026-05-17 13:35 — chart upgrade attempt v25.10.1 → v26.3.1, driver
|
||
570.195.03 → 580.105.08, `kernelModuleType: open`. Helm applies
|
||
cleanly but driver pod ImagePullBackOff on
|
||
`driver:580.105.08-ubuntu26.04`.
|
||
- 2026-05-17 ~13:45 — skopeo confirms zero ubuntu26.04 tags on
|
||
nvcr.io/nvidia/driver. Decision: roll chart back, pin in TF, document
|
||
the gotcha, file the kernel rollback as the next step.
|
||
|
||
## Root Causes
|
||
|
||
1. **Host OS upgraded to Ubuntu 26.04** ahead of NVIDIA's driver image
|
||
support window. NVIDIA typically lags new Ubuntu LTS releases by
|
||
weeks-to-months on the driver-container front.
|
||
2. **gpu-operator chart was not pinned** prior to today. The TF
|
||
`helm_release` had `version` commented out, so any apply could
|
||
re-resolve to the latest chart and follow its OS-auto-detection
|
||
logic. With v25.10.1, the operator fell back to ubuntu24.04 image
|
||
suffix (which pulls successfully but fails to compile against kernel
|
||
7.0). With v26.3.1, the operator picks the correct (per-NFD)
|
||
ubuntu26.04 suffix — which doesn't exist.
|
||
3. **No alert for "GPU device count = 0 on a GPU node"** — the cluster
|
||
had 14+ hours of silent GPU outage before noticing. `NvidiaExporterDown`
|
||
fires only when the metrics exporter itself stops scraping, not when
|
||
the operator's driver pod is unhealthy.
|
||
|
||
## What We Changed in This Session
|
||
|
||
- `stacks/nvidia/modules/nvidia/main.tf` — pinned
|
||
`helm_release.nvidia-gpu-operator.version = "v25.10.1"` so future
|
||
applies don't surprise us with v26.3.1's stricter OS detection.
|
||
- `stacks/nvidia/modules/nvidia/values.yaml` — comment block explaining
|
||
the situation; driver version stays at `570.195.03` as the last-known
|
||
config that produced a pullable image.
|
||
- `docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md` —
|
||
this file.
|
||
|
||
## What We Did NOT Do (Pending User Decision)
|
||
|
||
- **Roll back the host kernel** on k8s-node1 from `7.0.0-15-generic`
|
||
to `6.8.0-117-generic`. The 6.8 kernel is still installed at
|
||
`/lib/modules/6.8.0-117-generic` and the matching headers at
|
||
`/usr/src/linux-headers-6.8.0-117-generic`, so GRUB can boot it and
|
||
the driver image's apt sources (Ubuntu 24.04 noble) carry
|
||
`linux-headers-6.8.0-117-generic`. This would require draining the
|
||
node, editing GRUB defaults, `apt-mark hold` to prevent future drift,
|
||
and rebooting — needs explicit user OK.
|
||
- **Add a probe + alert** for `nvidia.com/gpu` resource count on the
|
||
GPU node. Should fire within 10 minutes of the operator failing to
|
||
publish the resource, regardless of which sub-pod failed.
|
||
|
||
## Recovery Procedure (next time)
|
||
|
||
### If the driver-installer fails with "Could not resolve Linux kernel version"
|
||
|
||
1. Identify the running kernel: `uname -r` on the affected node.
|
||
2. Check whether NVIDIA ships an image for that kernel/distro combo:
|
||
|
||
docker run --rm quay.io/skopeo/stable list-tags \
|
||
docker://nvcr.io/nvidia/driver \
|
||
| python3 -c "import json,sys; d=json.load(sys.stdin); \
|
||
print([t for t in d['Tags'] if '<distro>' in t][:5])"
|
||
|
||
3. If yes, point the chart at the right version + ensure NFD reports
|
||
the matching OS.
|
||
4. If no (and a kernel rollback is acceptable):
|
||
- `kubectl cordon <node>` then `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data`
|
||
- `nsenter -t 1 -m -p -u sed -i 's/^GRUB_DEFAULT=.*/GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 6.8.0-117-generic"/' /etc/default/grub`
|
||
- `nsenter -t 1 -m -p -u update-grub`
|
||
- `nsenter -t 1 -m -p -u apt-mark hold linux-image-6.8.0-117-generic linux-headers-6.8.0-117-generic linux-generic linux-image-generic linux-headers-generic`
|
||
- Reboot: `nsenter -t 1 -m -p -u systemctl reboot`
|
||
- After boot: `kubectl uncordon <node>` and wait for the GPU
|
||
daemonset to come Ready
|
||
|
||
## Action Items
|
||
|
||
- [x] Pin gpu-operator chart to v25.10.1 in TF
|
||
- [x] Document situation in this post-mortem
|
||
- [x] Roll back k8s-node1 host kernel to 6.8.0-117-generic (done by user;
|
||
kernel rollback succeeded and NFD now reports
|
||
`kernel-version.full=6.8.0-117-generic`, `os_release.VERSION_ID=24.04`)
|
||
- [x] Extend driver daemonset startup probe `failureThreshold` from 120 to 300
|
||
(50 min) in TF `values.yaml` — 2026-05-25. On this hardware the
|
||
full install sequence (apt headers + gcc compilation + file copy) takes
|
||
~21min which exactly exhausted the old 120×10s window.
|
||
- [ ] Add Prometheus alert `GPUNodeNoGPUResource` — fires when a node
|
||
labeled `nvidia.com/gpu.present=true` has `nvidia.com/gpu` capacity
|
||
of 0 for >10m
|
||
- [ ] Periodically re-check NVIDIA's NGC catalog for ubuntu26.04 driver
|
||
tags — file as a quarterly checkup once we see the first 26.04
|
||
tag, unpin the chart and revert this post-mortem's mitigation
|
||
- [ ] Audit ALL host packages with `apt-mark hold` semantics. The
|
||
memory of the March 2026 outage says we disabled
|
||
`unattended-upgrades` — `do-release-upgrade` is a separate path
|
||
that should be gated too
|
||
|
||
## Follow-up Incident: Driver install hang (2026-05-25)
|
||
|
||
**Date**: 2026-05-25
|
||
**Status**: Resolved
|
||
|
||
After the kernel rollback to 6.8.0-117-generic succeeded, the driver pod
|
||
(`nvidia-driver-daemonset-529vg`) was still reported as "stuck at
|
||
Installing Linux kernel headers..." with no progress for 15–20 min.
|
||
|
||
**Actual root causes (two compounding issues)**:
|
||
|
||
1. **Deadlock between k8s-driver-manager and operator-validator**: The
|
||
`k8s-driver-manager` init container waits for `nvidia-operator-validator`
|
||
to shut down before it can begin the install sequence. The validator's
|
||
`driver-validation` init container was in an infinite retry loop polling
|
||
`/run/nvidia/validations/.driver-ctr-ready` (which the driver creates when
|
||
ready). Since the driver never finished, the validator never exited. The
|
||
validator pod had `deletionTimestamp` set but kubelet on node1 couldn't GC
|
||
it — the container received SIGTERM but remained in "Terminating" state
|
||
indefinitely, blocking the new driver from starting.
|
||
**Fix**: Force-deleted the stuck validator pod
|
||
(`kubectl delete pod -n nvidia nvidia-operator-validator-sff98 --force --grace-period=0`).
|
||
This broke the deadlock immediately.
|
||
|
||
2. **Startup probe timeout**: The full driver install sequence on this hardware
|
||
(6 vCPUs, 16Gi RAM) takes ~21 minutes:
|
||
- `apt-get install linux-headers-6.8.0-117-generic`: ~2 min
|
||
- `gcc/make -j16` kernel module build (nvidia, nvidia-uvm, nvidia-modeset,
|
||
nvidia-peermem): ~12 min
|
||
- nvidia-installer file copy + archive integrity check: ~7 min
|
||
The default startup probe allows exactly `60 + (120 × 10) = 1260s = 21min`.
|
||
This caused a SIGKILL (exit 137) at 21 minutes even when the install was
|
||
progressing normally.
|
||
**Fix**: Patched `driver.startupProbe.failureThreshold` from 120 → 300
|
||
in `stacks/nvidia/modules/nvidia/values.yaml` (gives 51 min headroom).
|
||
|
||
**Key observation**: "Installing Linux kernel headers..." is NOT a hang — the
|
||
apt install just takes 2+ min and produces no log output during execution. The
|
||
log line appears before apt runs, so it looks frozen. Check `ps auxf` inside
|
||
the container to confirm apt/dpkg are actively running.
|
||
|
||
## Lessons
|
||
|
||
- **Operator-style charts that auto-detect host OS can silently break
|
||
when the host fleet leapfrogs upstream image support.** Pin the chart
|
||
version + driver version, and treat upstream support gaps as a hard
|
||
blocker rather than a guaranteed-to-resolve race condition.
|
||
- **Drain-and-revert host kernel is the right escape hatch when
|
||
upstream image lags.** Make sure the previous kernel and its headers
|
||
stay installed (don't aggressively purge old kernels in apt
|
||
autoremove).
|
||
- **NFD labels are authoritative for the operator's image-tag
|
||
construction.** If you need to lie about OS version (e.g., to force a
|
||
24.04 image on a 26.04 host), edit the NFD label — but only as a last
|
||
resort; the chart upgrade made clear the operator will eventually
|
||
reconcile this.
|
||
- **A k8s-driver-manager deadlock on a stuck Terminating validator pod is
|
||
indistinguishable from an apt hang** — `ps auxf` inside the container is
|
||
the key diagnostic. Force-deleting a stuck Terminating pod with no
|
||
finalizers is safe and immediately resolves the deadlock.
|
||
- **Driver startup probe must be sized for the full install wall-clock time**,
|
||
not just apt or just compilation. On slow hardware, 21 min is tight.
|