nvidia: pin chart to v25.10.1 after v26.3.1 upgrade revealed missing ubuntu26.04 driver images
k8s-node1 was upgraded to Ubuntu 26.04 (kernel 7.0.0-15-generic) at some
point. NVIDIA has NOT published ubuntu26.04 driver images yet
(skopeo list-tags docker://nvcr.io/nvidia/driver returned 0 ubuntu26.04
tags vs 779 for ubuntu22.04 and 206 for ubuntu24.04).
Attempted fix today: bump gpu-operator chart v25.10.1 → v26.3.1 +
driver 570.195.03 → 580.105.08 + kernelModuleType=open. The chart
applied cleanly but the v26.3.1 operator auto-detects host OS via NFD
labels and constructs `<version>-ubuntu26.04` image tags, which 404 on
pull. Rolled back to chart v25.10.1 and pinned it explicitly here so
future `terraform apply` doesn't surface the same trap again.
Note: chart rollback alone does NOT restore GPU functionality on
k8s-node1. Both v25.10.1 and v26.3.1's operators now pick the
ubuntu26.04 suffix (the NFD label is sticky once detected). The actual
recovery path requires either (a) NVIDIA shipping ubuntu26.04 driver
images, or (b) rolling the host kernel back to 6.8.0-117-generic
(still installed in /boot, headers in /usr/src) + `apt-mark hold` to
prevent re-upgrade. That step needs explicit user authorization for a
node reboot — left as the next action item on code-8vr0.
Files:
- stacks/nvidia/modules/nvidia/main.tf — explicit version pin,
explanatory comment
- stacks/nvidia/modules/nvidia/values.yaml — comment block
documenting the situation; driver pinned at 570.195.03
- docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md —
full timeline, root causes, recovery procedure
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
38602f7974
commit
128cfbbc30
3 changed files with 182 additions and 2 deletions
|
|
@ -117,7 +117,13 @@ resource "helm_release" "nvidia-gpu-operator" {
|
|||
repository = "https://helm.ngc.nvidia.com/nvidia"
|
||||
chart = "gpu-operator"
|
||||
atomic = true
|
||||
# version = "0.9.3"
|
||||
# Pinned 2026-05-17. v26.3.1's operator auto-detects the host OS via NFD
|
||||
# and constructs `driver:<version>-ubuntu26.04` image tags, but NVIDIA
|
||||
# has not published any ubuntu26.04 driver images yet. v25.10.1 falls
|
||||
# back to ubuntu24.04 (which exists), so we stay here until NVIDIA ships
|
||||
# 26.04 builds (or until the host kernel is rolled back to a 24.04 line
|
||||
# one). See post-mortem 2026-05-17-gpu-driver-ubuntu2604-mismatch.md.
|
||||
version = "v25.10.1"
|
||||
timeout = 6000
|
||||
|
||||
values = [templatefile("${path.module}/values.yaml", {})]
|
||||
|
|
|
|||
|
|
@ -10,7 +10,21 @@ driver:
|
|||
#
|
||||
# Delete the cluster policy before each change
|
||||
# version: "575.57.08" # CUDA 12.9
|
||||
version: "570.195.03" # CUDA 12.8
|
||||
#
|
||||
# 2026-05-17: tried bumping to 580.x with kernelModuleType=open but
|
||||
# NVIDIA has NOT published any nvcr.io/nvidia/driver:*-ubuntu26.04
|
||||
# images yet (skopeo list-tags shows 0 ubuntu26.04 tags vs 779 for
|
||||
# ubuntu22.04 and 206 for ubuntu24.04). Chart v26.3.1's operator
|
||||
# auto-detects the host OS (k8s-node1 was upgraded to Ubuntu 26.04
|
||||
# with kernel 7.0.0-15-generic) and picks `<version>-ubuntu26.04` —
|
||||
# which then 404s on pull. Rolled back to chart v25.10.1 + this
|
||||
# 570.195.03 pin, which uses the ubuntu24.04 image suffix. That
|
||||
# image still can't compile against kernel 7.0.0 (apt sources are
|
||||
# 24.04 noble, which doesn't ship linux-headers-7.0.0-15-generic),
|
||||
# so the host kernel needs to be rolled back to 6.8.0-117-generic
|
||||
# (still installed in /boot) before the driver can come up.
|
||||
# See post-mortem 2026-05-17-gpu-driver-ubuntu2604-mismatch.md.
|
||||
version: "570.195.03" # CUDA 12.8 — pinned until NVIDIA ships ubuntu26.04 images
|
||||
upgradePolicy:
|
||||
autoUpgrade: false
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue