Live PG data moves to a 10Gi LUKS-encrypted RWO PVC. WAL fsync per
commit on NFS contributed to the 2026-04-22 NFS writeback storm
(2h43m recovery, 3 of 4 nodes hard-reset). Backups remain on NFS
(append-only, NFS-tolerant).
The init container that writes postgresql.override.conf is now gated
on PG_VERSION presence — on a fresh PVC the file would otherwise make
initdb refuse the non-empty PGDATA. First boot skips the override and
initdb's cleanly; second boot (after a forced restart) writes the
override so vchord/vectors/pg_prewarm load before the dump restore.
Idempotent on initialised PVCs.
Migration executed: pg_dumpall (1.9GB) → restore on encrypted PVC →
REINDEX clip_index/face_index → 111,843 assets verified, external
HTTP 200, all 10 extensions present (vector minor 0.8.0→0.8.1 only).
LV created on PVE host, picked up by lvm-pvc-snapshot.
See docs/plans/2026-04-25-nfs-hostile-migration-{design,plan}.md.
Phase 2 (Vault Raft) follows under code-gy7h.
Closes: code-ahr7
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4.7 KiB
NFS-Hostile Workload Migration — Design
Date: 2026-04-25 Author: Viktor (with Claude) Status: Phase 1 done, Phase 2 in progress Beads: code-gy7h (Vault), code-ahr7 (Immich PG)
Problem
The 2026-04-22 Vault Raft leader deadlock (post-mortem
2026-04-22-vault-raft-leader-deadlock.md) traced to NFS client
writeback stalls poisoning kernel state. Recovery took 2h43m and
required hard-resetting 3 of 4 cluster VMs. Two workload classes on
NFS are NFS-hostile per the criteria in
infra/.claude/CLAUDE.md ("Critical services MUST NOT use NFS"):
- Postgres with WAL fsync per commit — Immich primary
- Vault Raft consensus log — fsync per append-entry, 3 replicas
Everything else on NFS (47 PVCs, ~455 GiB) is correctly placed: RWX media libraries, append-only backups, ML caches.
Decision
Migrate exactly those two workload classes to
proxmox-lvm-encrypted (LUKS2 LVM-thin via Proxmox CSI). No iSCSI,
no RWX media migration, no backup-target migration.
Rationale
- Block storage decouples PG / Raft fsync from NFS client kernel state. Failure mode that triggered the post-mortem cannot recur for these workloads.
proxmox-lvm-encryptedis the documented default for sensitive data (infra/.claude/CLAUDE.mdstorage decision rule). It already backs ~28 PVCs across the cluster — pattern is proven.- Existing nightly
lvm-pvc-snapshotPVE host script (03:00, 7-day retention) auto-picks-up new PVCs via thin snapshots — no extra backup wiring needed for the live data side. - LUKS2 satisfies "encrypted at rest for sensitive data" requirement.
Out of scope
- iSCSI evaluation (already retired 2026-04-13).
- RWX media (Immich library, music, ebooks) — correct placement.
- Backup target PVCs (
*-backupon NFS) — append-only, NFS-tolerant. - Prometheus 200 GiB — already on
proxmox-lvm.
Pattern per workload
Immich PG (single replica, Deployment, Recreate strategy)
- Add new RWO PVC on
proxmox-lvm-encrypted. - Quiesce app pods (server + ML + frame).
pg_dumpallfrom running NFS pod → local file.- Swap deployment
claim_name→ encrypted PVC. - PG bootstraps fresh on empty PVC; restore dump.
- REINDEX vector indexes (
clip_index,face_index). - Backup CronJob keeps writing to NFS module (correct: append-only).
Vault Raft (3 replicas, StatefulSet, helm-managed)
- Change
dataStorage.storageClassandauditStorage.storageClassfromnfs-proxmox→proxmox-lvm-encrypted. - StatefulSet
volumeClaimTemplatesis immutable → usekubectl delete sts vault --cascade=orphanthen re-apply (memory pattern for VCT swaps). - Per-pod rolling: delete pod + PVCs, controller recreates with new
template. Auto-unseal sidecar handles unseal; raft
retry_joinrejoins cluster. - 24h validation window between pods. Migrate non-leader pods first; step-down current leader before migrating it last.
- Backup target (
vault-backup-hoston NFS) stays on NFS.
Risks and rollbacks
Immich PG
- pg_dumpall captures schema + data, not file-level state. Vector
index versions matter (vchord 0.3.0 unchanged; vector 0.8.0 →
0.8.1 is a minor automatic bump on
CREATE EXTENSION— confirmed benign). Rollback: revertclaim_name, scale apps; old NFS PVC retained for 7 days post-migration.
Vault Raft
- Cluster keeps quorum from 2 standby replicas while one pod is swapped. Migrating the leader last avoids quorum churn.
- Recovery anchor: pre-migration
vault operator raft snapshot save- nightly
vault-raft-backupCronJob. RTO < 1h via snapshot restore.
- nightly
Init container chicken-and-egg (Immich PG, discovered during execution)
The pre-existing write-pg-override-conf init container on the
Immich PG deployment writes postgresql.override.conf directly to
PGDATA. On a populated NFS PVC this was a no-op (init was already
run). On the fresh encrypted PVC, the file made initdb refuse the
non-empty directory and the pod CrashLoopBackOff'd.
Resolution: gate the init container on PG_VERSION presence — first
boot skips the override write, PG initdbs cleanly; force a pod
restart and the second boot writes the override and PG loads
vchord / vectors / pg_prewarm before the dump restore. Change
is permanent and idempotent (correct on both fresh and initialised
PVCs). One restart pre-migration only.
Verification
End-to-end DONE when:
kubectl get pvc -A | grep nfs-proxmoxreturns only thevault-backup-hostPVC (or zero, if backup PVC moves elsewhere).vault operator raft list-peersshows 3 voters onproxmox-lvm-encrypted, leader elected.- Immich PG
\dxmatches pre-migration extensions (vector minor drift OK). lvm-pvc-snapshotcaptures new LVs in next 03:00 run.- 7 consecutive days of clean backup CronJob runs and no new alerts.