vault: add fsGroupChangePolicy=OnRootMismatch + 2026-04-22 post-mortem
The 2026-04-22 Vault outage caught kubelet in a 2-minute chown loop that never exited because the default fsGroupChangePolicy (Always) walks every file on the NFS-backed data PVC. With retrans=3,timeo=30 NFS options and a 1GB audit log, the recursive chown outlasted the deadline and restarted forever — blocking raft quorum recovery. OnRootMismatch makes chown a no-op when the volume root is already correct, which it always is after initial setup. The breakglass fix was applied live via kubectl patch at 10:54 UTC; this commit persists it in Terraform so the next apply doesn't revert. The post-mortem also documents the upstream raft stuck-leader pattern, NFS kernel client corruption after force-kill, and the path to migrate Vault off NFS to proxmox-lvm-encrypted.
This commit is contained in:
parent
6a4a477336
commit
2f1f9107f8
2 changed files with 153 additions and 0 deletions
|
|
@ -117,6 +117,17 @@ resource "helm_release" "vault" {
|
|||
}
|
||||
}
|
||||
|
||||
# fsGroupChangePolicy=OnRootMismatch skips recursive chown on restart.
|
||||
# Without this, kubelet walks every file over NFS each restart; during
|
||||
# 2026-04-22 outage this looped for 10m+ and blocked quorum recovery.
|
||||
statefulSet = {
|
||||
securityContext = {
|
||||
pod = {
|
||||
fsGroupChangePolicy = "OnRootMismatch"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Mount unseal key secret
|
||||
extraVolumes = [{
|
||||
type = "secret"
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue