infra/docs/post-mortems
Viktor Barzin 2f1f9107f8 vault: add fsGroupChangePolicy=OnRootMismatch + 2026-04-22 post-mortem
The 2026-04-22 Vault outage caught kubelet in a 2-minute chown loop that
never exited because the default fsGroupChangePolicy (Always) walks every
file on the NFS-backed data PVC. With retrans=3,timeo=30 NFS options and
a 1GB audit log, the recursive chown outlasted the deadline and restarted
forever — blocking raft quorum recovery. OnRootMismatch makes chown a
no-op when the volume root is already correct, which it always is after
initial setup.

The breakglass fix was applied live via kubectl patch at 10:54 UTC; this
commit persists it in Terraform so the next apply doesn't revert.

The post-mortem also documents the upstream raft stuck-leader pattern,
NFS kernel client corruption after force-kill, and the path to migrate
Vault off NFS to proxmox-lvm-encrypted.
2026-04-22 11:12:19 +00:00
..
2026-03-16-kured-containerd-cascade-outage.html docs: consolidate all post-mortems under docs/post-mortems/ 2026-04-14 08:24:36 +00:00
2026-03-16-nfs-csi-cascade-failure.md docs: move post-mortems to docs/post-mortems/ 2026-04-14 08:20:09 +00:00
2026-04-14-nfs-fsid0-dns-vault-outage.md docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip] 2026-04-14 18:09:11 +00:00
2026-04-14-postmortem-pipeline-test.md fix: use full path to claude CLI for non-interactive SSH 2026-04-14 17:44:50 +00:00
2026-04-18-authentik-outpost-shm-full.md [docs] post-mortem: clarify the sizeLimit vs container memory limit gotcha 2026-04-18 13:23:14 +00:00
2026-04-19-registry-orphan-index.md [registry] bulk-clean 34 orphan manifests + beads-server image bump 2026-04-19 23:16:34 +00:00
2026-04-22-vault-raft-leader-deadlock.md vault: add fsGroupChangePolicy=OnRootMismatch + 2026-04-22 post-mortem 2026-04-22 11:12:19 +00:00
index.html docs: consolidate all post-mortems under docs/post-mortems/ 2026-04-14 08:24:36 +00:00