infra/docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md

213 lines
11 KiB
Markdown
Raw Normal View History

# Post-Mortem: GPU Driver Crashloop after Ubuntu 26.04 Upgrade on k8s-node1
**Date:** 2026-05-17
**Author:** Viktor Barzin / Claude (incident response)
**Severity:** SEV-3 (GPU workloads unavailable: frigate, immich-ml, llama-swap, ytdlp/yt-highlights all Pending; no impact to non-GPU services)
**Beads:** `code-8vr0` (P1)
**Status:** Blocked on upstream — NVIDIA has not published Ubuntu 26.04 driver images yet
## Summary
`nvidia-driver-daemonset-sg22g` on k8s-node1 went into CrashLoopBackOff
with 76+ restarts. Root cause: k8s-node1 was upgraded to **Ubuntu 26.04
LTS (Resolute Raccoon)** at some point, putting the running kernel at
`7.0.0-15-generic`. The NVIDIA driver daemonset's installer container
runs `apt-get install linux-headers-<kernel>` against Ubuntu 24.04's
noble repositories (the container's base OS), which don't carry
`linux-headers-7.0.0-15-generic`, so the build aborts with:
Could not resolve Linux kernel version
Attempted fix (chart upgrade v25.10.1 → v26.3.1 with driver 580.105.08
and `kernelModuleType: open`) succeeded at the chart level but produced
a worse outcome: the v26.3.1 operator auto-detects the host OS via NFD
and constructs the image tag `<version>-ubuntu26.04`, which 404s on
pull. `skopeo list-tags docker://nvcr.io/nvidia/driver` confirms zero
ubuntu26.04 tags exist (vs 779 ubuntu22.04 and 206 ubuntu24.04 tags).
Rolled the chart back to v25.10.1 (pinned in TF) to restore the closest-
to-working state pending an upstream fix or kernel rollback.
## Impact
- GPU resource `nvidia.com/gpu` = 0 on k8s-node1 (only GPU node)
- All GPU-bound workloads Pending or 0/N Ready:
- `frigate/frigate`
- `immich/immich-machine-learning`
- `llama-cpp/llama-swap`
- `nvidia/nvidia-exporter`
- `ytdlp/yt-highlights`
- Downstream alerts firing: `NvidiaExporterDown`, 5× Uptime Kuma monitors
(Frigate, Immich ML, nvidia-exporter, …), `GPUNodeUnschedulable` not
firing (node is schedulable, just no GPU advertised)
- No data loss; no user-facing service degradation outside the GPU stack
## Timeline (Europe/Sofia, UTC+3)
- pre-incident — `apt-get dist-upgrade` (or `do-release-upgrade`) bumped
k8s-node1 from Ubuntu 24.04 → 26.04. Apt history.log doesn't capture
the upgrade (rotated by `do-release-upgrade`).
- ~2026-05-11 — node rebooted into kernel `7.0.0-15-generic`. NFD
reports `system-os_release.VERSION_ID = 26.04`,
`kernel-version.full = 7.0.0-15-generic`.
- 2026-05-17 04:00 (approx) — driver daemonset enters CrashLoopBackOff
on every kubelet restart cycle. Error: "Could not resolve Linux kernel
version".
- 2026-05-17 13:35 — chart upgrade attempt v25.10.1 → v26.3.1, driver
570.195.03 → 580.105.08, `kernelModuleType: open`. Helm applies
cleanly but driver pod ImagePullBackOff on
`driver:580.105.08-ubuntu26.04`.
- 2026-05-17 ~13:45 — skopeo confirms zero ubuntu26.04 tags on
nvcr.io/nvidia/driver. Decision: roll chart back, pin in TF, document
the gotcha, file the kernel rollback as the next step.
## Root Causes
1. **Host OS upgraded to Ubuntu 26.04** ahead of NVIDIA's driver image
support window. NVIDIA typically lags new Ubuntu LTS releases by
weeks-to-months on the driver-container front.
2. **gpu-operator chart was not pinned** prior to today. The TF
`helm_release` had `version` commented out, so any apply could
re-resolve to the latest chart and follow its OS-auto-detection
logic. With v25.10.1, the operator fell back to ubuntu24.04 image
suffix (which pulls successfully but fails to compile against kernel
7.0). With v26.3.1, the operator picks the correct (per-NFD)
ubuntu26.04 suffix — which doesn't exist.
3. **No alert for "GPU device count = 0 on a GPU node"** — the cluster
had 14+ hours of silent GPU outage before noticing. `NvidiaExporterDown`
fires only when the metrics exporter itself stops scraping, not when
the operator's driver pod is unhealthy.
## What We Changed in This Session
- `stacks/nvidia/modules/nvidia/main.tf` — pinned
`helm_release.nvidia-gpu-operator.version = "v25.10.1"` so future
applies don't surprise us with v26.3.1's stricter OS detection.
- `stacks/nvidia/modules/nvidia/values.yaml` — comment block explaining
the situation; driver version stays at `570.195.03` as the last-known
config that produced a pullable image.
- `docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md`
this file.
## What We Did NOT Do (Pending User Decision)
- **Roll back the host kernel** on k8s-node1 from `7.0.0-15-generic`
to `6.8.0-117-generic`. The 6.8 kernel is still installed at
`/lib/modules/6.8.0-117-generic` and the matching headers at
`/usr/src/linux-headers-6.8.0-117-generic`, so GRUB can boot it and
the driver image's apt sources (Ubuntu 24.04 noble) carry
`linux-headers-6.8.0-117-generic`. This would require draining the
node, editing GRUB defaults, `apt-mark hold` to prevent future drift,
and rebooting — needs explicit user OK.
- **Add a probe + alert** for `nvidia.com/gpu` resource count on the
GPU node. Should fire within 10 minutes of the operator failing to
publish the resource, regardless of which sub-pod failed.
## Recovery Procedure (next time)
### If the driver-installer fails with "Could not resolve Linux kernel version"
1. Identify the running kernel: `uname -r` on the affected node.
2. Check whether NVIDIA ships an image for that kernel/distro combo:
docker run --rm quay.io/skopeo/stable list-tags \
docker://nvcr.io/nvidia/driver \
| python3 -c "import json,sys; d=json.load(sys.stdin); \
print([t for t in d['Tags'] if '<distro>' in t][:5])"
3. If yes, point the chart at the right version + ensure NFD reports
the matching OS.
4. If no (and a kernel rollback is acceptable):
- `kubectl cordon <node>` then `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data`
- `nsenter -t 1 -m -p -u sed -i 's/^GRUB_DEFAULT=.*/GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 6.8.0-117-generic"/' /etc/default/grub`
- `nsenter -t 1 -m -p -u update-grub`
- `nsenter -t 1 -m -p -u apt-mark hold linux-image-6.8.0-117-generic linux-headers-6.8.0-117-generic linux-generic linux-image-generic linux-headers-generic`
- Reboot: `nsenter -t 1 -m -p -u systemctl reboot`
- After boot: `kubectl uncordon <node>` and wait for the GPU
daemonset to come Ready
## Action Items
- [x] Pin gpu-operator chart to v25.10.1 in TF
- [x] Document situation in this post-mortem
- [x] Roll back k8s-node1 host kernel to 6.8.0-117-generic (done by user;
kernel rollback succeeded and NFD now reports
`kernel-version.full=6.8.0-117-generic`, `os_release.VERSION_ID=24.04`)
- [x] Extend driver daemonset startup probe `failureThreshold` from 120 to 300
(50 min) in TF `values.yaml` — 2026-05-25. On this hardware the
full install sequence (apt headers + gcc compilation + file copy) takes
~21min which exactly exhausted the old 120×10s window.
- [ ] Add Prometheus alert `GPUNodeNoGPUResource` — fires when a node
labeled `nvidia.com/gpu.present=true` has `nvidia.com/gpu` capacity
of 0 for >10m
- [ ] Periodically re-check NVIDIA's NGC catalog for ubuntu26.04 driver
tags — file as a quarterly checkup once we see the first 26.04
tag, unpin the chart and revert this post-mortem's mitigation
- [ ] Audit ALL host packages with `apt-mark hold` semantics. The
memory of the March 2026 outage says we disabled
`unattended-upgrades``do-release-upgrade` is a separate path
that should be gated too
## Follow-up Incident: Driver install hang (2026-05-25)
**Date**: 2026-05-25
**Status**: Resolved
After the kernel rollback to 6.8.0-117-generic succeeded, the driver pod
(`nvidia-driver-daemonset-529vg`) was still reported as "stuck at
Installing Linux kernel headers..." with no progress for 1520 min.
**Actual root causes (two compounding issues)**:
1. **Deadlock between k8s-driver-manager and operator-validator**: The
`k8s-driver-manager` init container waits for `nvidia-operator-validator`
to shut down before it can begin the install sequence. The validator's
`driver-validation` init container was in an infinite retry loop polling
`/run/nvidia/validations/.driver-ctr-ready` (which the driver creates when
ready). Since the driver never finished, the validator never exited. The
validator pod had `deletionTimestamp` set but kubelet on node1 couldn't GC
it — the container received SIGTERM but remained in "Terminating" state
indefinitely, blocking the new driver from starting.
**Fix**: Force-deleted the stuck validator pod
(`kubectl delete pod -n nvidia nvidia-operator-validator-sff98 --force --grace-period=0`).
This broke the deadlock immediately.
2. **Startup probe timeout**: The full driver install sequence on this hardware
(6 vCPUs, 16Gi RAM) takes ~21 minutes:
- `apt-get install linux-headers-6.8.0-117-generic`: ~2 min
- `gcc/make -j16` kernel module build (nvidia, nvidia-uvm, nvidia-modeset,
nvidia-peermem): ~12 min
- nvidia-installer file copy + archive integrity check: ~7 min
The default startup probe allows exactly `60 + (120 × 10) = 1260s = 21min`.
This caused a SIGKILL (exit 137) at 21 minutes even when the install was
progressing normally.
**Fix**: Patched `driver.startupProbe.failureThreshold` from 120 → 300
in `stacks/nvidia/modules/nvidia/values.yaml` (gives 51 min headroom).
**Key observation**: "Installing Linux kernel headers..." is NOT a hang — the
apt install just takes 2+ min and produces no log output during execution. The
log line appears before apt runs, so it looks frozen. Check `ps auxf` inside
the container to confirm apt/dpkg are actively running.
## Lessons
- **Operator-style charts that auto-detect host OS can silently break
when the host fleet leapfrogs upstream image support.** Pin the chart
version + driver version, and treat upstream support gaps as a hard
blocker rather than a guaranteed-to-resolve race condition.
- **Drain-and-revert host kernel is the right escape hatch when
upstream image lags.** Make sure the previous kernel and its headers
stay installed (don't aggressively purge old kernels in apt
autoremove).
- **NFD labels are authoritative for the operator's image-tag
construction.** If you need to lie about OS version (e.g., to force a
24.04 image on a 26.04 host), edit the NFD label — but only as a last
resort; the chart upgrade made clear the operator will eventually
reconcile this.
- **A k8s-driver-manager deadlock on a stuck Terminating validator pod is
indistinguishable from an apt hang** — `ps auxf` inside the container is
the key diagnostic. Force-deleting a stuck Terminating pod with no
finalizers is safe and immediately resolves the deadlock.
- **Driver startup probe must be sized for the full install wall-clock time**,
not just apt or just compilation. On slow hardware, 21 min is tight.