infra

Viktor Barzin b9ac942647 nvidia: fix driver install deadlock + extend startup probe Two compounding issues prevented the GPU driver from installing after the k8s-node1 kernel rollback to 6.8.0-117-generic (Ubuntu 24.04): 1. Deadlock: The k8s-driver-manager init container was stuck waiting for nvidia-operator-validator to shut down. The validator's driver-validation init container was in an infinite poll loop checking for /run/nvidia/validations/.driver-ctr-ready (which only appears after a successful driver install). The validator pod had deletionTimestamp set but its container remained in Terminating state indefinitely. Fix: force-delete the stuck Terminating validator pod to break the deadlock (kubectl delete --force --grace-period=0). 2. Startup probe timeout: Full driver install on this hardware (apt headers ~2min + gcc make -j16 ~12min + file copy ~7min = ~21min) exactly exhausted the default 120×10s=20min startup probe window, causing SIGKILL (exit 137) at exactly 21 minutes even when the install was succeeding. Extended failureThreshold 120→300 (50min headroom). Documented both root causes + recovery steps in the post-mortem. values.yaml: add driver.startupProbe.failureThreshold: 300. Note: the kubectl patch applied during recovery is a temporary fix; this TF values.yaml change makes it durable via the next TF apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>		2026-05-25 11:53:44 +00:00
..
2026-03-16-kured-containerd-cascade-outage.html	docs: consolidate all post-mortems under docs/post-mortems/	2026-04-14 08:24:36 +00:00
2026-03-16-nfs-csi-cascade-failure.md	docs: move post-mortems to docs/post-mortems/	2026-04-14 08:20:09 +00:00
2026-04-14-nfs-fsid0-dns-vault-outage.md	docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip]	2026-04-14 18:09:11 +00:00
2026-04-14-postmortem-pipeline-test.md	fix: use full path to claude CLI for non-interactive SSH	2026-04-14 17:44:50 +00:00
2026-04-18-authentik-outpost-shm-full.md	docs/authentik: document postgres session backend + close out 2026-04-18 post-mortem items	2026-05-10 16:28:11 +00:00
2026-04-19-registry-orphan-index.md	[registry] bulk-clean 34 orphan manifests + beads-server image bump	2026-04-19 23:16:34 +00:00
2026-04-22-vault-raft-leader-deadlock.md	vault: complete Phase 2 NFS-hostile migration; remove nfs-proxmox SC	2026-04-25 17:10:00 +00:00
2026-05-09-io-pressure-stale-nfs.md	mysql: bump to 4Gi limit / 3Gi request; grow /srv/nfs LV to 3 TiB	2026-05-09 17:01:57 +00:00
2026-05-16-kured-stalled-and-anubis-ha.md	docs/pm: kured silently stalled 6 days + Anubis HA lift (2026-05-16)	2026-05-16 12:17:26 +00:00
2026-05-17-gpu-driver-ubuntu2604-mismatch.md	nvidia: fix driver install deadlock + extend startup probe	2026-05-25 11:53:44 +00:00
2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md	nfs-csi: pin chart v4.13.1 + controller affinity (post-mortem)	2026-05-17 09:11:09 +00:00
index.html	docs: consolidate all post-mortems under docs/post-mortems/	2026-04-14 08:24:36 +00:00