infra/docs/plans/2026-04-25-nfs-hostile-migration-design.md
Viktor Barzin 288efa89b3 vault: migrate vault-0 storage to proxmox-lvm-encrypted
Phase 2 of the NFS-hostile migration: data + audit storageClass on
the vault helm release switches from nfs-proxmox to
proxmox-lvm-encrypted, then per-pod rolling swap (24h soak between).

vault-0 swap done. vault-1 + vault-2 still on NFS — the rolling part
is what makes this safe (raft quorum maintained by 2 healthy pods
while one is replaced).

Also restores chart-default pod securityContext fields. The previous
`statefulSet.securityContext.pod = {fsGroupChangePolicy = "..."}`
block REPLACED (not merged) the chart's defaults — fsGroup,
runAsGroup, runAsUser, runAsNonRoot were all silently dropped. NFS
exports were permissive enough to mask the missing fsGroup; ext4 LV
volume root is root:root and the vault user (UID 100) couldn't open
vault.db, CrashLoopBackOff. Fix: provide all five fields explicitly,
survives future chart bumps. vault-1 and vault-2 retained their
correct securityContext from when their pod specs were written to
etcd, before the partial customization landed — the bug only surfaces
when a pod is recreated.

Pre-flight raft snapshot saved at /tmp/vault-pre-migration-*.snap
(recovery anchor).

Refs: code-gy7h

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 16:19:49 +00:00

5.9 KiB

NFS-Hostile Workload Migration — Design

Date: 2026-04-25 Author: Viktor (with Claude) Status: Phase 1 done, Phase 2 in progress Beads: code-gy7h (Vault), code-ahr7 (Immich PG)

Problem

The 2026-04-22 Vault Raft leader deadlock (post-mortem 2026-04-22-vault-raft-leader-deadlock.md) traced to NFS client writeback stalls poisoning kernel state. Recovery took 2h43m and required hard-resetting 3 of 4 cluster VMs. Two workload classes on NFS are NFS-hostile per the criteria in infra/.claude/CLAUDE.md ("Critical services MUST NOT use NFS"):

  1. Postgres with WAL fsync per commit — Immich primary
  2. Vault Raft consensus log — fsync per append-entry, 3 replicas

Everything else on NFS (47 PVCs, ~455 GiB) is correctly placed: RWX media libraries, append-only backups, ML caches.

Decision

Migrate exactly those two workload classes to proxmox-lvm-encrypted (LUKS2 LVM-thin via Proxmox CSI). No iSCSI, no RWX media migration, no backup-target migration.

Rationale

  • Block storage decouples PG / Raft fsync from NFS client kernel state. Failure mode that triggered the post-mortem cannot recur for these workloads.
  • proxmox-lvm-encrypted is the documented default for sensitive data (infra/.claude/CLAUDE.md storage decision rule). It already backs ~28 PVCs across the cluster — pattern is proven.
  • Existing nightly lvm-pvc-snapshot PVE host script (03:00, 7-day retention) auto-picks-up new PVCs via thin snapshots — no extra backup wiring needed for the live data side.
  • LUKS2 satisfies "encrypted at rest for sensitive data" requirement.

Out of scope

  • iSCSI evaluation (already retired 2026-04-13).
  • RWX media (Immich library, music, ebooks) — correct placement.
  • Backup target PVCs (*-backup on NFS) — append-only, NFS-tolerant.
  • Prometheus 200 GiB — already on proxmox-lvm.

Pattern per workload

Immich PG (single replica, Deployment, Recreate strategy)

  • Add new RWO PVC on proxmox-lvm-encrypted.
  • Quiesce app pods (server + ML + frame).
  • pg_dumpall from running NFS pod → local file.
  • Swap deployment claim_name → encrypted PVC.
  • PG bootstraps fresh on empty PVC; restore dump.
  • REINDEX vector indexes (clip_index, face_index).
  • Backup CronJob keeps writing to NFS module (correct: append-only).

Vault Raft (3 replicas, StatefulSet, helm-managed)

  • Change dataStorage.storageClass and auditStorage.storageClass from nfs-proxmoxproxmox-lvm-encrypted.
  • StatefulSet volumeClaimTemplates is immutable → use kubectl delete sts vault --cascade=orphan then re-apply (memory pattern for VCT swaps).
  • Per-pod rolling: delete pod + PVCs, controller recreates with new template. Auto-unseal sidecar handles unseal; raft retry_join rejoins cluster.
  • 24h validation window between pods. Migrate non-leader pods first; step-down current leader before migrating it last.
  • Backup target (vault-backup-host on NFS) stays on NFS.

Risks and rollbacks

Immich PG

  • pg_dumpall captures schema + data, not file-level state. Vector index versions matter (vchord 0.3.0 unchanged; vector 0.8.0 → 0.8.1 is a minor automatic bump on CREATE EXTENSION — confirmed benign). Rollback: revert claim_name, scale apps; old NFS PVC retained for 7 days post-migration.

Vault Raft

  • Cluster keeps quorum from 2 standby replicas while one pod is swapped. Migrating the leader last avoids quorum churn.
  • Recovery anchor: pre-migration vault operator raft snapshot save
    • nightly vault-raft-backup CronJob. RTO < 1h via snapshot restore.

Helm securityContext.pod replace-not-merge (Vault, discovered during execution)

The Vault helm chart sets pod-level securityContext defaults (fsGroup=1000, runAsGroup=1000, runAsUser=100, runAsNonRoot=true) from chart templates, not from values.yaml. When main.tf provided its own server.statefulSet.securityContext.pod = {fsGroupChangePolicy = "OnRootMismatch"} the helm rendering REPLACED the chart defaults rather than merging into them. On NFS this was harmless (async, insecure exports made the volume world-writable enough for any UID), but on a fresh ext4 LV via Proxmox CSI the volume root is root:root and vault user (UID 100) cannot open /vault/data/vault.db.

vault-1 and vault-2 happened to be Running with the correct securityContext because their pod specs were written into etcd before the customization landed; helm chart upgrades don't restart pods, so the broken values lay dormant until vault-0 was recreated by the orphan-deleted STS during this migration.

Resolution: provide all five fields (fsGroup, fsGroupChangePolicy, runAsGroup, runAsUser, runAsNonRoot) explicitly in main.tf so runAsGroup=1000 etc. survive future chart bumps. Idempotent on both fresh PVCs and existing pods.

Init container chicken-and-egg (Immich PG, discovered during execution)

The pre-existing write-pg-override-conf init container on the Immich PG deployment writes postgresql.override.conf directly to PGDATA. On a populated NFS PVC this was a no-op (init was already run). On the fresh encrypted PVC, the file made initdb refuse the non-empty directory and the pod CrashLoopBackOff'd.

Resolution: gate the init container on PG_VERSION presence — first boot skips the override write, PG initdbs cleanly; force a pod restart and the second boot writes the override and PG loads vchord / vectors / pg_prewarm before the dump restore. Change is permanent and idempotent (correct on both fresh and initialised PVCs). One restart pre-migration only.

Verification

End-to-end DONE when:

  • kubectl get pvc -A | grep nfs-proxmox returns only the vault-backup-host PVC (or zero, if backup PVC moves elsewhere).
  • vault operator raft list-peers shows 3 voters on proxmox-lvm-encrypted, leader elected.
  • Immich PG \dx matches pre-migration extensions (vector minor drift OK).
  • lvm-pvc-snapshot captures new LVs in next 03:00 run.
  • 7 consecutive days of clean backup CronJob runs and no new alerts.