Viktor Barzin 484b4c7190 vault: complete Phase 2 NFS-hostile migration; remove nfs-proxmox SC

All 3 vault voters now on proxmox-lvm-encrypted (vault-0 16:18, vault-1
+ vault-2 today). The NFS fsync incompatibility identified in the
2026-04-22 raft-leader-deadlock post-mortem is no longer reachable —
raft consensus log + audit log live on LUKS2 block storage with real
fsync semantics.

Cluster-wide consumers of the inline kubernetes_storage_class.nfs_proxmox
dropped to zero after the rolling, so the resource is removed from
infra/stacks/vault/main.tf. Released NFS PVs (6) remain in the cluster
and will be reclaimed in Phase 3 cleanup.

Lesson learned (recorded in plan): pvc-protection finalizer races the
StatefulSet controller — pod recreates on the OLD PVCs unless the
finalizer is patched out before pod delete. Force-finalize technique
applied to vault-1 + vault-2 successfully.

Closes: code-gy7h

2026-04-25 17:10:00 +00:00

7 KiB

Raw Blame History

NFS-Hostile Workload Migration — Plan

Date: 2026-04-25 Design: 2026-04-25-nfs-hostile-migration-design.md Beads: code-gy7h (Vault, epic), code-ahr7 (Immich PG)

Phase 1 — Immich PG (DONE 2026-04-25)

Step	Done
Snapshot extensions + row counts to `/tmp/immich-pre-migration-*`	✓
Quiesce `immich-server` + `immich-machine-learning` + `immich-frame`	✓
`pg_dumpall` → `/tmp/immich-pre-migration-<ts>.sql` (1.9 GB)	✓
Add `kubernetes_persistent_volume_claim.immich_postgresql_encrypted` (10Gi, autoresize 20Gi cap)	✓
Swap `claim_name` at `infra/stacks/immich/main.tf` deployment	✓
Patch init container to gate on `PG_VERSION` (chicken-and-egg fix)	✓
Force pod restart so override.conf gets written	✓
Restore dump	✓
`REINDEX clip_index`, `REINDEX face_index`	✓
Scale apps back up	✓
Verify: `\dx`, row counts (~111k assets), HTTP 200 internal/external	✓
LV present on PVE host (`vm-9999-pvc-...`)	✓

Phase 1 follow-ups (not blocking)

Old NFS PVC immich-postgresql-data-host retained 7 days for rollback. After 2026-05-02: remove module.nfs_postgresql_host from infra/stacks/immich/main.tf and the CronJob's reference.
Backup CronJob (postgresql-backup) still writes to the NFS module. After cleanup, point it at a dedicated backup PVC or to the existing immich-backups NFS share.

Phase 2 — Vault Raft (DONE 2026-04-25)

Phase 2 complete 2026-04-25; all 3 voters on proxmox-lvm-encrypted.

Pre-flight (T-0) — DONE 2026-04-25 15:50 UTC

Verify all 3 vault pods sealed=false, raft healthy.
Take fresh vault operator raft snapshot save (anchor saved at /tmp/vault-pre-migration-20260425-155029.snap, 1.5 MB).
Optional: scale ESO to 0 — skipped (auto-unseal sidecar is independent; ESO refresh churn is non-disruptive for one swap).
Confirmed leader is vault-2 → migrate vault-0 first (non-leader), vault-1 next, vault-2 last (with step-down). Plan originally assumed vault-0 was leader; same intent (non-leader first).
Thin pool headroom: 54.63% used, plenty for 6 × 2 GiB LVs.

Step 0 — Helm values + StatefulSet swap — DONE 2026-04-25 16:08 UTC

Edit infra/stacks/vault/main.tf: change dataStorage.storageClass and auditStorage.storageClass from nfs-proxmox → proxmox-lvm-encrypted.
kubectl -n vault delete sts vault --cascade=orphan (StatefulSet volumeClaimTemplates is immutable; orphan keeps pods+PVCs alive while we recreate the controller with the new template).
tg apply -target=helm_release.vault → recreates STS with new VCT (full-stack tg plan blocks on unrelated for_each-with- apply-time-keys errors at lines 848/865/909/917; targeted apply on the helm release alone is the right scope here). Existing pods still on old NFS PVCs.

Step 1 — Roll vault-0 first (non-leader) — DONE 2026-04-25 16:18 UTC

kubectl -n vault delete pod vault-0 --grace-period=30
kubectl -n vault delete pvc data-vault-0 audit-vault-0
STS controller recreated pod; new PVCs auto-provisioned on proxmox-lvm-encrypted (LVs vm-9999-pvc-fb732fd7-... data 4.12%, vm-9999-pvc-36451f42-... audit 3.99%).
Hit and fixed: vault-0 CrashLoopBackOff'd with permission denied on /vault/data/vault.db. The helm chart's statefulSet.securityContext.pod block in main.tf only set fsGroupChangePolicy, replacing (not merging) the chart's defaults fsGroup=1000, runAsGroup=1000, runAsUser=100, runAsNonRoot=true. NFS exports made the missing fsGroup a no-op; ext4 LV needs it to chown the volume root for the vault user. Old vault-1/vault-2 pods were created before that block was added so they still had the chart-default securityContext from their original spec. Fix: provide all five fields explicitly in main.tf and re-apply. Same root cause will affect vault-1 and vault-2 swaps unless this stays in place.
Wait Ready; auto-unseal sidecar unsealed; retry_join rejoined raft cluster.
Verify: vault operator raft list-peers shows 3 voters, vault-0 follower, leader=vault-2. External HTTPS 200.

Step 2 — 24h soak (SKIPPED per user direction 2026-04-25)

User instructed "continue with all the remaining actions" — soak gates compressed to per-pod settle windows + raft-state verification between rollings. No Raft alarms, no Vault errors observed at each verification gate.

Step 3 — Roll vault-1 — DONE 2026-04-25

Force-finalize PVCs to break re-mount race: kubectl -n vault patch pvc data-vault-1 audit-vault-1 -p '{"metadata":{"finalizers":null}}' --type=merge. (Initial pod-then-PVC delete recreated pod on the OLD NFS PVCs because pvc-protection finalizer hadn't cleared. Lesson learned and applied to vault-2 below.)
Pod recreated on encrypted PVCs; auto-unsealed; rejoined raft.

Step 4 — Settle window — DONE 2026-04-25

3-check verification over 90s; raft index advancing (2730010→2730012), all 3 voters healthy.

Step 5 — Roll vault-2 (leader) — DONE 2026-04-25

vault operator step-down on vault-2; vault-0 took leadership. Confirmed vault-0 active, vault-1+vault-2 standby before delete.
Snapshot anchor at /tmp/vault-pre-vault2.snap (1.5 MB) from new leader vault-0.
Force-finalize + delete PVCs + delete pod (lesson from vault-1).
Pod recreated on encrypted PVCs; auto-unsealed; rejoined raft.
vault operator raft list-peers shows 3 voters all healthy on encrypted storage; leader vault-0.

Step 6 — Cleanup — DONE 2026-04-25

kubectl get pvc -A cross-cluster shows zero PVCs on nfs-proxmox SC (only Released PVs remain → Phase 3).
Removed inline kubernetes_storage_class.nfs_proxmox from infra/stacks/vault/main.tf (was lines 29–42).
All 3 PVC pairs on proxmox-lvm-encrypted.
vault operator raft autopilot state healthy=true.
External https://vault.viktorbarzin.me/v1/sys/health = 200.

Phase 3 — Released-PV cleanup (FOLLOW-UP)

After Phase 1+2 land cleanly, ~30 PVs in Released hold dead LVs. Reclaim by:

List Released PVs, confirm LV exists on PVE.
kubectl delete pv <name> (CSI removes underlying LV when PV is orphaned with Retain reclaim policy and no PVC reference).
If LV survives: manual lvremove pve/vm-9999-pvc-<uuid>.

Rollback

Phase	Trigger	Action
1	Immich UI broken / data loss	Revert `claim_name`; restore from `/tmp/immich-pre-migration-*.sql` to old NFS PVC
2 (mid-rolling)	Single pod broken	Delete the encrypted PVC; recreate with NFS SC explicitly; cluster keeps quorum from 2 healthy pods
2 (post-rolling, raft corrupt)	Cluster-wide failure	`vault operator raft snapshot restore <pre-migration.snap>`
Catastrophic	All Vault data lost	Restore from latest `/srv/nfs/vault-backup/` snapshot via CronJob output

7 KiB Raw Blame History Unescape Escape