infra

Author	SHA1	Message	Date
Viktor Barzin	fd0f4a0365	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip] `6d224861` came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:45:33 +00:00
Viktor Barzin	6d224861c4	stem95su: scheduled Drive->site sync CronJob (every 10m) CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:42:26 +00:00
Viktor Barzin	b9ac942647	nvidia: fix driver install deadlock + extend startup probe Two compounding issues prevented the GPU driver from installing after the k8s-node1 kernel rollback to 6.8.0-117-generic (Ubuntu 24.04): 1. Deadlock: The k8s-driver-manager init container was stuck waiting for nvidia-operator-validator to shut down. The validator's driver-validation init container was in an infinite poll loop checking for /run/nvidia/validations/.driver-ctr-ready (which only appears after a successful driver install). The validator pod had deletionTimestamp set but its container remained in Terminating state indefinitely. Fix: force-delete the stuck Terminating validator pod to break the deadlock (kubectl delete --force --grace-period=0). 2. Startup probe timeout: Full driver install on this hardware (apt headers ~2min + gcc make -j16 ~12min + file copy ~7min = ~21min) exactly exhausted the default 120×10s=20min startup probe window, causing SIGKILL (exit 137) at exactly 21 minutes even when the install was succeeding. Extended failureThreshold 120→300 (50min headroom). Documented both root causes + recovery steps in the post-mortem. values.yaml: add driver.startupProbe.failureThreshold: 300. Note: the kubectl patch applied during recovery is a temporary fix; this TF values.yaml change makes it durable via the next TF apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:53:44 +00:00
Viktor Barzin	128cfbbc30	nvidia: pin chart to v25.10.1 after v26.3.1 upgrade revealed missing ubuntu26.04 driver images k8s-node1 was upgraded to Ubuntu 26.04 (kernel 7.0.0-15-generic) at some point. NVIDIA has NOT published ubuntu26.04 driver images yet (skopeo list-tags docker://nvcr.io/nvidia/driver returned 0 ubuntu26.04 tags vs 779 for ubuntu22.04 and 206 for ubuntu24.04). Attempted fix today: bump gpu-operator chart v25.10.1 → v26.3.1 + driver 570.195.03 → 580.105.08 + kernelModuleType=open. The chart applied cleanly but the v26.3.1 operator auto-detects host OS via NFD labels and constructs `<version>-ubuntu26.04` image tags, which 404 on pull. Rolled back to chart v25.10.1 and pinned it explicitly here so future `terraform apply` doesn't surface the same trap again. Note: chart rollback alone does NOT restore GPU functionality on k8s-node1. Both v25.10.1 and v26.3.1's operators now pick the ubuntu26.04 suffix (the NFD label is sticky once detected). The actual recovery path requires either (a) NVIDIA shipping ubuntu26.04 driver images, or (b) rolling the host kernel back to 6.8.0-117-generic (still installed in /boot, headers in /usr/src) + `apt-mark hold` to prevent re-upgrade. That step needs explicit user authorization for a node reboot — left as the next action item on code-8vr0. Files: - stacks/nvidia/modules/nvidia/main.tf — explicit version pin, explanatory comment - stacks/nvidia/modules/nvidia/values.yaml — comment block documenting the situation; driver pinned at 570.195.03 - docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md — full timeline, root causes, recovery procedure Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 10:56:05 +00:00

4 commits