From 288efa89b336886053625646e8a8a1c554484f59 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 25 Apr 2026 16:19:49 +0000 Subject: [PATCH] vault: migrate vault-0 storage to proxmox-lvm-encrypted MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 2 of the NFS-hostile migration: data + audit storageClass on the vault helm release switches from nfs-proxmox to proxmox-lvm-encrypted, then per-pod rolling swap (24h soak between). vault-0 swap done. vault-1 + vault-2 still on NFS — the rolling part is what makes this safe (raft quorum maintained by 2 healthy pods while one is replaced). Also restores chart-default pod securityContext fields. The previous `statefulSet.securityContext.pod = {fsGroupChangePolicy = "..."}` block REPLACED (not merged) the chart's defaults — fsGroup, runAsGroup, runAsUser, runAsNonRoot were all silently dropped. NFS exports were permissive enough to mask the missing fsGroup; ext4 LV volume root is root:root and the vault user (UID 100) couldn't open vault.db, CrashLoopBackOff. Fix: provide all five fields explicitly, survives future chart bumps. vault-1 and vault-2 retained their correct securityContext from when their pod specs were written to etcd, before the partial customization landed — the bug only surfaces when a pod is recreated. Pre-flight raft snapshot saved at /tmp/vault-pre-migration-*.snap (recovery anchor). Refs: code-gy7h Co-Authored-By: Claude Opus 4.7 --- ...2026-04-25-nfs-hostile-migration-design.md | 23 ++++++ .../2026-04-25-nfs-hostile-migration-plan.md | 73 ++++++++++++------- stacks/vault/main.tf | 11 ++- 3 files changed, 80 insertions(+), 27 deletions(-) diff --git a/docs/plans/2026-04-25-nfs-hostile-migration-design.md b/docs/plans/2026-04-25-nfs-hostile-migration-design.md index 0b74d050..832064ea 100644 --- a/docs/plans/2026-04-25-nfs-hostile-migration-design.md +++ b/docs/plans/2026-04-25-nfs-hostile-migration-design.md @@ -90,6 +90,29 @@ no RWX media migration, no backup-target migration. + nightly `vault-raft-backup` CronJob. RTO < 1h via snapshot restore. +## Helm `securityContext.pod` replace-not-merge (Vault, discovered during execution) + +The Vault helm chart sets pod-level securityContext defaults +(`fsGroup=1000, runAsGroup=1000, runAsUser=100, runAsNonRoot=true`) +from chart templates, not from values.yaml. When `main.tf` provided +its own `server.statefulSet.securityContext.pod = {fsGroupChangePolicy += "OnRootMismatch"}` the helm rendering REPLACED the chart defaults +rather than merging into them. On NFS this was harmless (`async, +insecure` exports made the volume world-writable enough for any UID), +but on a fresh ext4 LV via Proxmox CSI the volume root is `root:root` +and vault user (UID 100) cannot open `/vault/data/vault.db`. + +vault-1 and vault-2 happened to be Running with the correct +securityContext because their pod specs were written into etcd +**before** the customization landed; helm chart upgrades don't +restart pods, so the broken values lay dormant until vault-0 was +recreated by the orphan-deleted STS during this migration. + +Resolution: provide all five fields (`fsGroup`, `fsGroupChangePolicy`, +`runAsGroup`, `runAsUser`, `runAsNonRoot`) explicitly in main.tf so +`runAsGroup=1000` etc. survive future chart bumps. Idempotent on +both fresh PVCs and existing pods. + ## Init container chicken-and-egg (Immich PG, discovered during execution) The pre-existing `write-pg-override-conf` init container on the diff --git a/docs/plans/2026-04-25-nfs-hostile-migration-plan.md b/docs/plans/2026-04-25-nfs-hostile-migration-plan.md index e36041c3..73d863dd 100644 --- a/docs/plans/2026-04-25-nfs-hostile-migration-plan.md +++ b/docs/plans/2026-04-25-nfs-hostile-migration-plan.md @@ -32,51 +32,74 @@ ## Phase 2 — Vault Raft (IN PROGRESS) -### Pre-flight (T-0) +### Pre-flight (T-0) — DONE 2026-04-25 15:50 UTC -- [ ] Verify all 3 vault pods sealed=false, raft healthy. -- [ ] Take fresh `vault operator raft snapshot save` (anchor). -- [ ] Optional: scale ESO to 0 to reduce mid-migration churn. -- [ ] Step-down leader if it's not vault-0 (current leader: vault-2 — needs step-down). -- [ ] Verify thin pool headroom on PVE. +- [x] Verify all 3 vault pods sealed=false, raft healthy. +- [x] Take fresh `vault operator raft snapshot save` (anchor saved at + `/tmp/vault-pre-migration-20260425-155029.snap`, 1.5 MB). +- [ ] Optional: scale ESO to 0 — skipped (auto-unseal sidecar is + independent; ESO refresh churn is non-disruptive for one swap). +- [x] Confirmed leader is **vault-2** → migrate vault-0 first + (non-leader), vault-1 next, vault-2 last (with step-down). + Plan originally assumed vault-0 was leader; same intent + (non-leader first). +- [x] Thin pool headroom: 54.63% used, plenty for 6 × 2 GiB LVs. -### Step 0 — Helm values + StatefulSet swap +### Step 0 — Helm values + StatefulSet swap — DONE 2026-04-25 16:08 UTC -- [ ] Edit `infra/stacks/vault/main.tf`: change +- [x] Edit `infra/stacks/vault/main.tf`: change `dataStorage.storageClass` and `auditStorage.storageClass` from `nfs-proxmox` → `proxmox-lvm-encrypted`. -- [ ] `kubectl -n vault delete sts vault --cascade=orphan` (StatefulSet +- [x] `kubectl -n vault delete sts vault --cascade=orphan` (StatefulSet `volumeClaimTemplates` is immutable; orphan keeps pods+PVCs alive while we recreate the controller with the new template). -- [ ] `tg apply` → recreates StatefulSet with new VCT. Existing pods - still on old NFS PVCs. +- [x] `tg apply -target=helm_release.vault` → recreates STS with new + VCT (full-stack `tg plan` blocks on unrelated for_each-with- + apply-time-keys errors at lines 848/865/909/917; targeted + apply on the helm release alone is the right scope here). + Existing pods still on old NFS PVCs. -### Step 1 — Roll vault-2 (T+0) +### Step 1 — Roll vault-0 first (non-leader) — DONE 2026-04-25 16:18 UTC -- [ ] `kubectl -n vault delete pod vault-2 --grace-period=30` -- [ ] `kubectl -n vault delete pvc data-vault-2 audit-vault-2` -- [ ] STS controller recreates pod; new PVCs auto-provision on - `proxmox-lvm-encrypted`. -- [ ] Wait Ready; auto-unseal sidecar unseals; `retry_join` rejoins +- [x] `kubectl -n vault delete pod vault-0 --grace-period=30` +- [x] `kubectl -n vault delete pvc data-vault-0 audit-vault-0` +- [x] STS controller recreated pod; new PVCs auto-provisioned on + `proxmox-lvm-encrypted` (LVs `vm-9999-pvc-fb732fd7-...` data + 4.12%, `vm-9999-pvc-36451f42-...` audit 3.99%). +- [x] **Hit and fixed**: vault-0 CrashLoopBackOff'd with + `permission denied` on `/vault/data/vault.db`. The helm chart's + `statefulSet.securityContext.pod` block in main.tf only set + `fsGroupChangePolicy`, replacing (not merging) the chart's + defaults `fsGroup=1000, runAsGroup=1000, runAsUser=100, + runAsNonRoot=true`. NFS exports made the missing fsGroup a + no-op; ext4 LV needs it to chown the volume root for the + vault user. Old vault-1/vault-2 pods were created before that + block was added so they still had the chart-default + securityContext from their original spec. Fix: provide all + five fields explicitly in main.tf and re-apply. Same root + cause will affect vault-1 and vault-2 swaps unless this stays + in place. +- [x] Wait Ready; auto-unseal sidecar unsealed; `retry_join` rejoined raft cluster. -- [ ] Verify: `vault operator raft list-peers` shows 3 voters, - vault-2 reachable. +- [x] Verify: `vault operator raft list-peers` shows 3 voters, + vault-0 follower, leader=vault-2. External HTTPS 200. -### Step 2 — 24h soak +### Step 2 — 24h soak (IN PROGRESS, ends ~2026-04-26 16:18 UTC) Wait 24h. Confirm no Raft alarms, no Vault errors, downstream -healthy. Rollback window for vault-2 closes here. +healthy. Rollback window for vault-0 closes here. ### Step 3 — Roll vault-1 (T+24h) -Same shape as Step 1. +Same shape as Step 1. The securityContext fix is now in main.tf +so this should be straightforward. ### Step 4 — 24h soak -### Step 5 — Roll vault-0 (T+48h) +### Step 5 — Roll vault-2 (T+48h, leader) -- [ ] If vault-0 is leader at this point, step-down first: - `kubectl -n vault exec vault-0 -- vault operator step-down`. +- [ ] Step-down vault-2 first: + `kubectl -n vault exec vault-2 -- vault operator step-down`. - [ ] Then delete pod + PVCs as Step 1. ### Step 6 — Cleanup diff --git a/stacks/vault/main.tf b/stacks/vault/main.tf index 4fd47595..52ea3b6a 100644 --- a/stacks/vault/main.tf +++ b/stacks/vault/main.tf @@ -72,13 +72,13 @@ resource "helm_release" "vault" { dataStorage = { enabled = true size = "2Gi" - storageClass = "nfs-proxmox" # Proxmox host NFS (was nfs-truenas) + storageClass = "proxmox-lvm-encrypted" # Migrated 2026-04-25 from nfs-proxmox; raft fsync is NFS-hostile (post-mortems/2026-04-22-vault-raft-leader-deadlock.md) } auditStorage = { enabled = true size = "2Gi" - storageClass = "nfs-proxmox" # Proxmox host NFS (was nfs-truenas) + storageClass = "proxmox-lvm-encrypted" # Migrated 2026-04-25 from nfs-proxmox } standalone = { enabled = false } @@ -120,10 +120,17 @@ resource "helm_release" "vault" { # fsGroupChangePolicy=OnRootMismatch skips recursive chown on restart. # Without this, kubelet walks every file over NFS each restart; during # 2026-04-22 outage this looped for 10m+ and blocked quorum recovery. + # The other four fields restore the chart defaults — providing pod{} + # replaces them, and missing fsGroup left vault unable to write to + # the freshly-formatted ext4 PVC during the 2026-04-25 migration. statefulSet = { securityContext = { pod = { fsGroupChangePolicy = "OnRootMismatch" + fsGroup = 1000 + runAsGroup = 1000 + runAsUser = 100 + runAsNonRoot = true } } }