vault: add fsGroupChangePolicy=OnRootMismatch + 2026-04-22 post-mortem

The 2026-04-22 Vault outage caught kubelet in a 2-minute chown loop that
never exited because the default fsGroupChangePolicy (Always) walks every
file on the NFS-backed data PVC. With retrans=3,timeo=30 NFS options and
a 1GB audit log, the recursive chown outlasted the deadline and restarted
forever — blocking raft quorum recovery. OnRootMismatch makes chown a
no-op when the volume root is already correct, which it always is after
initial setup.

The breakglass fix was applied live via kubectl patch at 10:54 UTC; this
commit persists it in Terraform so the next apply doesn't revert.

The post-mortem also documents the upstream raft stuck-leader pattern,
NFS kernel client corruption after force-kill, and the path to migrate
Vault off NFS to proxmox-lvm-encrypted.
This commit is contained in:
Viktor Barzin 2026-04-22 11:12:19 +00:00
parent 6a4a477336
commit 2f1f9107f8
2 changed files with 153 additions and 0 deletions

View file

@ -117,6 +117,17 @@ resource "helm_release" "vault" {
}
}
# fsGroupChangePolicy=OnRootMismatch skips recursive chown on restart.
# Without this, kubelet walks every file over NFS each restart; during
# 2026-04-22 outage this looped for 10m+ and blocked quorum recovery.
statefulSet = {
securityContext = {
pod = {
fsGroupChangePolicy = "OnRootMismatch"
}
}
}
# Mount unseal key secret
extraVolumes = [{
type = "secret"