infra/docs/plans/2026-04-25-nfs-hostile-migration-plan.md at 3f85cee1efbbdd4d87544f58d51e5407b306d82d

Viktor Barzin 43e4f3f68e immich: migrate PostgreSQL off NFS to proxmox-lvm-encrypted

Live PG data moves to a 10Gi LUKS-encrypted RWO PVC. WAL fsync per
commit on NFS contributed to the 2026-04-22 NFS writeback storm
(2h43m recovery, 3 of 4 nodes hard-reset). Backups remain on NFS
(append-only, NFS-tolerant).

The init container that writes postgresql.override.conf is now gated
on PG_VERSION presence — on a fresh PVC the file would otherwise make
initdb refuse the non-empty PGDATA. First boot skips the override and
initdb's cleanly; second boot (after a forced restart) writes the
override so vchord/vectors/pg_prewarm load before the dump restore.
Idempotent on initialised PVCs.

Migration executed: pg_dumpall (1.9GB) → restore on encrypted PVC →
REINDEX clip_index/face_index → 111,843 assets verified, external
HTTP 200, all 10 extensions present (vector minor 0.8.0→0.8.1 only).
LV created on PVE host, picked up by lvm-pvc-snapshot.

See docs/plans/2026-04-25-nfs-hostile-migration-{design,plan}.md.
Phase 2 (Vault Raft) follows under code-gy7h.

Closes: code-ahr7

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Step	Done
Snapshot extensions + row counts to `/tmp/immich-pre-migration-*`	✓
Quiesce `immich-server` + `immich-machine-learning` + `immich-frame`	✓
`pg_dumpall` → `/tmp/immich-pre-migration-<ts>.sql` (1.9 GB)	✓
Add `kubernetes_persistent_volume_claim.immich_postgresql_encrypted` (10Gi, autoresize 20Gi cap)	✓
Swap `claim_name` at `infra/stacks/immich/main.tf` deployment	✓
Patch init container to gate on `PG_VERSION` (chicken-and-egg fix)	✓
Force pod restart so override.conf gets written	✓
Restore dump	✓
`REINDEX clip_index`, `REINDEX face_index`	✓
Scale apps back up	✓
Verify: `\dx`, row counts (~111k assets), HTTP 200 internal/external	✓
LV present on PVE host (`vm-9999-pvc-...`)	✓

Phase	Trigger	Action
1	Immich UI broken / data loss	Revert `claim_name`; restore from `/tmp/immich-pre-migration-*.sql` to old NFS PVC
2 (mid-rolling)	Single pod broken	Delete the encrypted PVC; recreate with NFS SC explicitly; cluster keeps quorum from 2 healthy pods
2 (post-rolling, raft corrupt)	Cluster-wide failure	`vault operator raft snapshot restore <pre-migration.snap>`
Catastrophic	All Vault data lost	Restore from latest `/srv/nfs/vault-backup/` snapshot via CronJob output

4.8 KiB

Raw Blame History

NFS-Hostile Workload Migration — Plan

Phase 1 — Immich PG (DONE 2026-04-25)

Phase 1 follow-ups (not blocking)

Phase 2 — Vault Raft (IN PROGRESS)

Pre-flight (T-0)

Step 0 — Helm values + StatefulSet swap

Step 1 — Roll vault-2 (T+0)

Step 2 — 24h soak

Step 3 — Roll vault-1 (T+24h)

Step 4 — 24h soak

Step 5 — Roll vault-0 (T+48h)

Step 6 — Cleanup

Verify (after each pod, then again at the end)

Phase 3 — Released-PV cleanup (FOLLOW-UP)

Rollback

4.8 KiB Raw Blame History

NFS-Hostile Workload Migration — Plan

Phase 1 — Immich PG (DONE 2026-04-25)

Phase 1 follow-ups (not blocking)

Phase 2 — Vault Raft (IN PROGRESS)

Pre-flight (T-0)

Step 0 — Helm values + StatefulSet swap

Step 1 — Roll vault-2 (T+0)

Step 2 — 24h soak

Step 3 — Roll vault-1 (T+24h)

Step 4 — 24h soak

Step 5 — Roll vault-0 (T+48h)

Step 6 — Cleanup

Verify (after each pod, then again at the end)

Phase 3 — Released-PV cleanup (FOLLOW-UP)

Rollback

4.8 KiB

Raw Blame History