infra/stacks
Viktor Barzin 128cfbbc30 nvidia: pin chart to v25.10.1 after v26.3.1 upgrade revealed missing ubuntu26.04 driver images
k8s-node1 was upgraded to Ubuntu 26.04 (kernel 7.0.0-15-generic) at some
point. NVIDIA has NOT published ubuntu26.04 driver images yet
(skopeo list-tags docker://nvcr.io/nvidia/driver returned 0 ubuntu26.04
tags vs 779 for ubuntu22.04 and 206 for ubuntu24.04).

Attempted fix today: bump gpu-operator chart v25.10.1 → v26.3.1 +
driver 570.195.03 → 580.105.08 + kernelModuleType=open. The chart
applied cleanly but the v26.3.1 operator auto-detects host OS via NFD
labels and constructs `<version>-ubuntu26.04` image tags, which 404 on
pull. Rolled back to chart v25.10.1 and pinned it explicitly here so
future `terraform apply` doesn't surface the same trap again.

Note: chart rollback alone does NOT restore GPU functionality on
k8s-node1. Both v25.10.1 and v26.3.1's operators now pick the
ubuntu26.04 suffix (the NFD label is sticky once detected). The actual
recovery path requires either (a) NVIDIA shipping ubuntu26.04 driver
images, or (b) rolling the host kernel back to 6.8.0-117-generic
(still installed in /boot, headers in /usr/src) + `apt-mark hold` to
prevent re-upgrade. That step needs explicit user authorization for a
node reboot — left as the next action item on code-8vr0.

Files:
  - stacks/nvidia/modules/nvidia/main.tf — explicit version pin,
    explanatory comment
  - stacks/nvidia/modules/nvidia/values.yaml — comment block
    documenting the situation; driver pinned at 570.195.03
  - docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md —
    full timeline, root causes, recovery procedure

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 10:56:05 +00:00
..
_template ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-10 18:53:49 +00:00
actualbudget recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
affine recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
authentik infra: document auth = "app|none" tier on every legacy ingress 2026-05-11 19:25:48 +00:00
beads-server Bucket C: enroll 5 raw-deploy stacks in Keel auto-update 2026-05-16 23:14:43 +00:00
blog final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
broker-sync broker-sync(fidelity): un-suspend monthly CronJob 2026-05-17 00:36:23 +00:00
calico final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
changedetection enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
chrome-service recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
city-guesser enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
claude-agent-service recruiter-triage: AI culture & tooling section + warm-engage AI ask 2026-05-16 13:14:27 +00:00
claude-memory recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
cloudflared cloudflare: disable AI bot edge-block so x402 can issue payment offers 2026-05-10 18:37:29 +00:00
cnpg [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
coturn enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
crowdsec ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-10 18:53:49 +00:00
cyberchef final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
dashy enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
dawarich enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
dbaas kured + cnpg: drain-safe defaults ahead of Monday reboot wave 2026-05-16 12:06:30 +00:00
descheduler final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
diun enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
ebook2audiobook enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
ebooks enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
echo enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
excalidraw enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
external-secrets recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
f1-stream final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
fire-planner ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
foolery recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
forgejo enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
freedify recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
freshrss ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
frigate ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
grampsweb ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
hackmd ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
headscale infra: document auth = "app|none" tier on every legacy ingress 2026-05-11 19:25:48 +00:00
health ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
hermes-agent ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
homepage final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
immich final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
infra [forgejo] Phases 3+4+5: cutover, decommission, docs sweep 2026-05-07 18:30:02 +00:00
infra-maintenance [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] 2026-04-18 21:19:48 +00:00
insta2spotify ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
instagram-poster Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) 2026-05-16 23:10:38 +00:00
isponsorblocktv ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
job-hunter ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
jsoncrack final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
k8s-dashboard final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
k8s-portal Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) 2026-05-16 23:10:38 +00:00
k8s-version-upgrade final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
keel keel: enable Slack notifications on every upgrade 2026-05-16 13:01:35 +00:00
kms final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
kured ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
kyverno keel: use +() anchors on policy/match-tag so per-workload overrides stick 2026-05-17 00:32:19 +00:00
linkwarden ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
llama-cpp Bucket C: enroll 5 raw-deploy stacks in Keel auto-update 2026-05-16 23:14:43 +00:00
local-path final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
mailserver fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-10 21:57:01 +00:00
matrix ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
meshcentral ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
metallb [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
metrics-server [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
monitoring terminal: probe + alerts after Traefik replica routing-table skew 2026-05-17 10:04:26 +00:00
n8n ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
navidrome ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
netbox ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
networking-toolbox ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
nextcloud ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
nfs-csi nfs-csi: pin chart v4.13.1 + controller affinity (post-mortem) 2026-05-17 09:11:09 +00:00
nodelocal-dns [dns] NodeLocal DNSCache — deploy DaemonSet to all nodes (WS C) 2026-04-19 15:46:41 +00:00
novelapp Woodpecker CI deploy [CI SKIP] 2026-05-16 23:17:44 +00:00
ntfy ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
nvidia nvidia: pin chart to v25.10.1 after v26.3.1 upgrade revealed missing ubuntu26.04 driver images 2026-05-17 10:56:05 +00:00
onlyoffice ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
openclaw openclaw: native MCP servers + daily claude-memory sync 2026-05-16 14:01:46 +00:00
osm_routing final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
owntracks ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
paperless-ngx ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
payslip-ingest ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
phpipam ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
platform [infra] Add Cloudflare provider to all stack lock files and generated providers 2026-04-16 16:31:36 +00:00
plotting-book Woodpecker CI deploy [CI SKIP] 2026-05-16 23:17:44 +00:00
poison-fountain ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
postiz Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) 2026-05-16 23:10:38 +00:00
priority-pass ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
privatebin ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
proxmox-csi proxmox-csi: opt SCs into pvc-autoresizer (resize.topolvm.io/enabled=true) 2026-05-10 18:22:25 +00:00
pvc-autoresizer [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
rbac [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
real-estate-crawler final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
recruiter-responder recruiter-responder: bump image_tag to 50f43004 (backtest --persist) 2026-05-17 09:57:17 +00:00
redis fix: pvc-autoresizer threshold should be 10%, not 80% 2026-05-10 19:56:16 +00:00
reloader recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
resume ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
reverse-proxy chore: remove decommissioned registry.viktorbarzin.me ingress 2026-05-09 11:03:51 +00:00
rybbit ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
sealed-secrets [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
send ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
servarr ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
shadowsocks ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
speedtest ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
status-page [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
stirling-pdf ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
tandoor ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
technitium fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-10 21:57:01 +00:00
terminal terminal: probe + alerts after Traefik replica routing-table skew 2026-05-17 10:04:26 +00:00
tor-proxy ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-16 14:06:39 +00:00
trading-bot final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
traefik ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-10 18:53:49 +00:00
travel_blog final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
tuya-bridge ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-16 14:06:39 +00:00
uptime-kuma Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) 2026-05-16 23:10:38 +00:00
url ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-16 14:06:39 +00:00
vault final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
vaultwarden Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) 2026-05-16 23:10:38 +00:00
vpa ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-10 18:53:49 +00:00
wealthfolio Woodpecker CI deploy [CI SKIP] 2026-05-16 13:45:45 +00:00
webhook_handler final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
whisper ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-16 14:06:39 +00:00
wireguard wireguard: switch to iptables-nft so PostUp MASQUERADE works 2026-05-17 10:13:37 +00:00
woodpecker ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-16 14:06:39 +00:00
xray infra: document auth = "app|none" tier on every legacy ingress 2026-05-11 19:25:48 +00:00
ytdlp ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-16 14:06:39 +00:00