6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.
Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two compounding issues prevented the GPU driver from installing after
the k8s-node1 kernel rollback to 6.8.0-117-generic (Ubuntu 24.04):
1. **Deadlock**: The k8s-driver-manager init container was stuck waiting
for nvidia-operator-validator to shut down. The validator's
driver-validation init container was in an infinite poll loop checking
for /run/nvidia/validations/.driver-ctr-ready (which only appears after
a successful driver install). The validator pod had deletionTimestamp
set but its container remained in Terminating state indefinitely.
Fix: force-delete the stuck Terminating validator pod to break the
deadlock (kubectl delete --force --grace-period=0).
2. **Startup probe timeout**: Full driver install on this hardware
(apt headers ~2min + gcc make -j16 ~12min + file copy ~7min = ~21min)
exactly exhausted the default 120×10s=20min startup probe window,
causing SIGKILL (exit 137) at exactly 21 minutes even when the install
was succeeding. Extended failureThreshold 120→300 (50min headroom).
Documented both root causes + recovery steps in the post-mortem.
values.yaml: add driver.startupProbe.failureThreshold: 300.
Note: the kubectl patch applied during recovery is a temporary fix;
this TF values.yaml change makes it durable via the next TF apply.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
k8s-node1 was upgraded to Ubuntu 26.04 (kernel 7.0.0-15-generic) at some
point. NVIDIA has NOT published ubuntu26.04 driver images yet
(skopeo list-tags docker://nvcr.io/nvidia/driver returned 0 ubuntu26.04
tags vs 779 for ubuntu22.04 and 206 for ubuntu24.04).
Attempted fix today: bump gpu-operator chart v25.10.1 → v26.3.1 +
driver 570.195.03 → 580.105.08 + kernelModuleType=open. The chart
applied cleanly but the v26.3.1 operator auto-detects host OS via NFD
labels and constructs `<version>-ubuntu26.04` image tags, which 404 on
pull. Rolled back to chart v25.10.1 and pinned it explicitly here so
future `terraform apply` doesn't surface the same trap again.
Note: chart rollback alone does NOT restore GPU functionality on
k8s-node1. Both v25.10.1 and v26.3.1's operators now pick the
ubuntu26.04 suffix (the NFD label is sticky once detected). The actual
recovery path requires either (a) NVIDIA shipping ubuntu26.04 driver
images, or (b) rolling the host kernel back to 6.8.0-117-generic
(still installed in /boot, headers in /usr/src) + `apt-mark hold` to
prevent re-upgrade. That step needs explicit user authorization for a
node reboot — left as the next action item on code-8vr0.
Files:
- stacks/nvidia/modules/nvidia/main.tf — explicit version pin,
explanatory comment
- stacks/nvidia/modules/nvidia/values.yaml — comment block
documenting the situation; driver pinned at 570.195.03
- docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md —
full timeline, root causes, recovery procedure
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>