From 4cb2c157da6e6967b6153e6b4630de176171c34f Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Wed, 22 Apr 2026 11:44:56 +0000 Subject: [PATCH] =?UTF-8?q?post-mortem=202026-04-22:=20full=20timeline=20?= =?UTF-8?q?=E2=80=94=20second=20regression=20+=20node4=20reboot?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The initial recovery at 11:03 was premature; vault-1's audit writes over NFS started hanging ~15 min later and the cluster regressed to 503. Full recovery required rebooting node4 (to free vault-0's stuck NFS mount and shed PVE NFS thread contention) and a second reboot of node3 (to clear another round of kernel NFS client degradation). Final recovery at 11:43:28 UTC with vault-2 as active leader on the quorum vault-0 + vault-2. vault-1 remains stuck in ContainerCreating on node2 — a third node2 reboot is required for full 3/3 quorum, but 2/3 is operationally sufficient, so that's deferred. --- .../2026-04-22-vault-raft-leader-deadlock.md | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/docs/post-mortems/2026-04-22-vault-raft-leader-deadlock.md b/docs/post-mortems/2026-04-22-vault-raft-leader-deadlock.md index 1be0ab20..ee66c054 100644 --- a/docs/post-mortems/2026-04-22-vault-raft-leader-deadlock.md +++ b/docs/post-mortems/2026-04-22-vault-raft-leader-deadlock.md @@ -3,10 +3,10 @@ | Field | Value | |-------|-------| | **Date** | 2026-04-22 | -| **Duration** | External endpoint 503 from ~09:00 UTC to 11:03 UTC (~2h). Full cluster recovery ~12:30 UTC after node4 reboot. | +| **Duration** | External endpoint 503 from ~09:00 UTC to ~11:43 UTC (~2h 43m). vault-2 became active leader 11:43:28 UTC. | | **Severity** | SEV1 (Vault — single source of secrets for 40+ services) | | **Affected Services** | All ESO-backed services (password rotation paused). CronJobs that read plan-time secrets (14 stacks). Woodpecker CI (blocked pipeline `d39770b3`). Everything with `ExternalSecret` refresh interval ≤ 2h. | -| **Status** | Resolved for user traffic; vault-0 pod replacement pending node4 reboot. Terraform not yet reapplied. | +| **Status** | Vault HA operational with vault-0 + vault-2 quorum. vault-1 still stuck ContainerCreating on node2 (third node2 reboot pending; workload can accept 2/3 quorum). Terraform fix committed as `2f1f9107`; apply pending. | ## Summary @@ -40,7 +40,14 @@ A Vault raft leader (`vault-2`) entered a stuck goroutine state where its cluste | **10:54** | Patched the Vault `StatefulSet` with `fsGroupChangePolicy: OnRootMismatch` so subsequent recreations skip the recursive chown. | | **10:57** | Force-deleted `vault-2` and `06fa940b` pod directory on node3. New pod spawned but kubelet again stuck on phantom state from the old pod. | | **11:01** | **Hard-reset node3 VM** (`qm reset 203`). | -| **11:03** | **External endpoint returns 200.** vault-1 elected leader, vault-2 standby, both 2/2 Running. | +| **11:03** | First 200 response: vault-1 elected leader, vault-2 standby. Premature celebration — vault-1's audit log on node2 NFS starts timing out; `/sys/ha-status` returns 500 even though raft thinks vault-1 is active. | +| **~11:18** | Service regresses. `vault-1` audit writes hanging (`event not processed by enough 'sink' nodes, context deadline exceeded`). Readiness probe fails; pod goes 1/2; `vault-active` endpoint stays pointed at vault-1's IP but backend unresponsive → 503. | +| **11:22** | Force-restart `vault-1` to trigger re-election with new pod. Delete + containerd-shim cleanup leaves yet another zombie on node2. Same pattern: force-delete → zombie. | +| **11:29** | **Hard-reset node4 VM** (`qm reset 204`). Rationale: vault-0 was still blocked there; 74 pods on node4 contribute to NFS server load (load avg 16 on PVE). After reboot, vault-0 mounts its PVCs on fresh kernel state and comes up 2/2 Running 11:31. | +| **11:31** | Increased PVE NFS threads from 16 to 64 (`echo 64 > /proc/fs/nfsd/threads`). Did not help immediate mount failures — the stuck state is per-client kernel, not server capacity. | +| **11:38** | Discover DNS resolution issue: vault-2's Go resolver returns NXDOMAIN for short names `vault-0.vault-internal` even though glibc resolver works. CoreDNS restart issued earlier didn't fix. Restart vault-2 pod to force fresh resolver state. | +| **11:42** | **Second hard-reset of node3 VM** (`qm reset 203`). Kubelet+CSI re-register; vault-2 scheduled, NFS mounts finally succeed on fresh kernel state. | +| **11:43:28** | **vault-2 becomes active leader.** External endpoint returns 200 and stays there. vault-0 follower, catches up to index 2477632+. vault-1 still stuck on node2; left for later recovery. | ## Root Cause Chain