infra/docs/plans/2026-04-25-nfs-hostile-migration-plan.md at bf4c7618d8ca1646bca60a45d17afbdf92df763c

Viktor Barzin 288efa89b3 vault: migrate vault-0 storage to proxmox-lvm-encrypted

Phase 2 of the NFS-hostile migration: data + audit storageClass on
the vault helm release switches from nfs-proxmox to
proxmox-lvm-encrypted, then per-pod rolling swap (24h soak between).

vault-0 swap done. vault-1 + vault-2 still on NFS — the rolling part
is what makes this safe (raft quorum maintained by 2 healthy pods
while one is replaced).

Also restores chart-default pod securityContext fields. The previous
`statefulSet.securityContext.pod = {fsGroupChangePolicy = "..."}`
block REPLACED (not merged) the chart's defaults — fsGroup,
runAsGroup, runAsUser, runAsNonRoot were all silently dropped. NFS
exports were permissive enough to mask the missing fsGroup; ext4 LV
volume root is root:root and the vault user (UID 100) couldn't open
vault.db, CrashLoopBackOff. Fix: provide all five fields explicitly,
survives future chart bumps. vault-1 and vault-2 retained their
correct securityContext from when their pod specs were written to
etcd, before the partial customization landed — the bug only surfaces
when a pod is recreated.

Pre-flight raft snapshot saved at /tmp/vault-pre-migration-*.snap
(recovery anchor).

Refs: code-gy7h

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Step	Done
Snapshot extensions + row counts to `/tmp/immich-pre-migration-*`	✓
Quiesce `immich-server` + `immich-machine-learning` + `immich-frame`	✓
`pg_dumpall` → `/tmp/immich-pre-migration-<ts>.sql` (1.9 GB)	✓
Add `kubernetes_persistent_volume_claim.immich_postgresql_encrypted` (10Gi, autoresize 20Gi cap)	✓
Swap `claim_name` at `infra/stacks/immich/main.tf` deployment	✓
Patch init container to gate on `PG_VERSION` (chicken-and-egg fix)	✓
Force pod restart so override.conf gets written	✓
Restore dump	✓
`REINDEX clip_index`, `REINDEX face_index`	✓
Scale apps back up	✓
Verify: `\dx`, row counts (~111k assets), HTTP 200 internal/external	✓
LV present on PVE host (`vm-9999-pvc-...`)	✓

Phase	Trigger	Action
1	Immich UI broken / data loss	Revert `claim_name`; restore from `/tmp/immich-pre-migration-*.sql` to old NFS PVC
2 (mid-rolling)	Single pod broken	Delete the encrypted PVC; recreate with NFS SC explicitly; cluster keeps quorum from 2 healthy pods
2 (post-rolling, raft corrupt)	Cluster-wide failure	`vault operator raft snapshot restore <pre-migration.snap>`
Catastrophic	All Vault data lost	Restore from latest `/srv/nfs/vault-backup/` snapshot via CronJob output

6.4 KiB

Raw Blame History

NFS-Hostile Workload Migration — Plan

Phase 1 — Immich PG (DONE 2026-04-25)

Phase 1 follow-ups (not blocking)

Phase 2 — Vault Raft (IN PROGRESS)

Pre-flight (T-0) — DONE 2026-04-25 15:50 UTC

Step 0 — Helm values + StatefulSet swap — DONE 2026-04-25 16:08 UTC

Step 1 — Roll vault-0 first (non-leader) — DONE 2026-04-25 16:18 UTC

Step 2 — 24h soak (IN PROGRESS, ends ~2026-04-26 16:18 UTC)

Step 3 — Roll vault-1 (T+24h)

Step 4 — 24h soak

Step 5 — Roll vault-2 (T+48h, leader)

Step 6 — Cleanup

Verify (after each pod, then again at the end)

Phase 3 — Released-PV cleanup (FOLLOW-UP)

Rollback

6.4 KiB Raw Blame History Unescape Escape

NFS-Hostile Workload Migration — Plan

Phase 1 — Immich PG (DONE 2026-04-25)

Phase 1 follow-ups (not blocking)

Phase 2 — Vault Raft (IN PROGRESS)

Pre-flight (T-0) — DONE 2026-04-25 15:50 UTC

Step 0 — Helm values + StatefulSet swap — DONE 2026-04-25 16:08 UTC

Step 1 — Roll vault-0 first (non-leader) — DONE 2026-04-25 16:18 UTC

Step 2 — 24h soak (IN PROGRESS, ends ~2026-04-26 16:18 UTC)

Step 3 — Roll vault-1 (T+24h)

Step 4 — 24h soak

Step 5 — Roll vault-2 (T+48h, leader)

Step 6 — Cleanup

Verify (after each pod, then again at the end)

Phase 3 — Released-PV cleanup (FOLLOW-UP)

Rollback

6.4 KiB

Raw Blame History