infra/docs/plans/2026-06-05-block-storage-harden-nfs-design.md
Viktor Barzin fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00

144 lines
7.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Block-Storage Scaling — Harden proxmox-csi + NFS (Decision + Design)
**Date**: 2026-06-05
**Status**: Decided — supersedes the recommendation in `2026-06-01-topolvm-evaluation.md`
**Decision owner**: Viktor
## TL;DR
We keep the **proxmox-csi** block-storage model (which already gives cross-node
PVC mobility) and **harden** it, rather than re-architecting to TopoLVM or
Longhorn. The 29-PVC/node cap is made *unreachable* (not removed) by shrinking
the block footprint via NFS migration of non-DB workloads; the ghost-disk
doom-loop is *prevented* (not just detected); and node placement is rebalanced.
**£0, no new hardware, mobility preserved.**
## Why this, not TopoLVM / Longhorn
Hard constraints set by Viktor (2026-06-05): **(a)** must keep the ability to
move pods across VM nodes if one goes down (mobility), **(b)** no new hardware,
**(c)** sdc IO contention is acceptable / not worth spending on.
Key architectural insight that drove the decision:
- **Mobility and the LUN cap are two sides of the same mechanism.** proxmox-csi
gives mobility *because* it hot-plugs each PVC as a Proxmox virtio-scsi disk
that re-attaches wherever the pod lands — and that hot-plug is exactly what
imposes the `lun < 30` cap and spawns the query-pci ghost-disk loop.
- **TopoLVM** removes the cap by killing the hot-plug — which is *why* it pins a
PVC to one node. Rejected: violates constraint (a).
- **Longhorn** keeps mobility via replication, but mobility-via-replication
costs **≥2× writes** (1 replica = no failover). On a single PVE host both
replicas land on the same sdc HDD — you pay double the write IO for redundancy
that dies with the host anyway (host = SPOF). Longhorn's own docs say "use a
dedicated disk, not the root disk." Rejected: wasteful on a single host;
reconsider only if a 2nd physical host is added.
- proxmox-csi already provides mobility at **1× write** (centralized LV
re-attaches) — strictly more IO-efficient than replication on one host. The
cap and ghost-loop are *warts on a good model*, not reasons to replace it.
| Option | 29-cap | Ghost loop | Mobility | sdc IO | Hardware | Verdict |
|---|---|---|---|---|---|---|
| **① Harden proxmox-csi + NFS** | managed (far off) | prevented | ✅ kept (1×) | same/better | £0 | **CHOSEN** |
| TopoLVM (A/C) | removed | eliminated | ❌ pinned | A: same / C: better | £0 / £200 | rejected — loses mobility |
| Longhorn | removed | eliminated | ✅ (2×) | worse | £0 | rejected — replication wasted on 1 host |
## Live state at decision time (2026-06-05)
- 6 workers (VMID 201206), proxmox-csi `CSINode.allocatable.count = 28`/node →
**168 slots**; **69 used (41%)**; **0 PVCs Pending**.
- **Imbalance is the live risk, not aggregate capacity**: node6 **21/28** (hot),
node5 **3/28**. node1=9, node2/3/4=12.
- **Ghost-disk drift = 0** (the 2026-06-04 cleanup held; `qm config` scsi counts
match tracked VolumeAttachments). Prevention still open (beads `code-dfjn`).
Retained `unusedN` LVs: node1=6, node2=9, node3=6 (harmless to the cap).
- Block PVCs: **74** (44 encrypted + 30 plain). NFS: 64. local-path: 9.
- PVE host RAM **222/267 GiB used, swap in use** → adding more worker VMs is
memory-bound (the May 4→6 escape hatch is mostly spent).
- sdc thin pool `data`: 69.67% data / 15.89% meta. `nfs-data` LV 74% of 4 TiB.
VG `pve` raw free <16 GiB; VG `ssd` free 475 GiB.
## NFS-migration candidates (embedded-DB preflight is mandatory)
Rule: embedded transactional stores (SQLite/LevelDB/RocksDB/H2/LMDB/ClickHouse)
corrupt on NFS; sensitive `-encrypted` PVCs lose LUKS-at-rest on NFS. Only
non-DB, non-sensitive (or app-encrypted) workloads qualify.
**Verified NFS-safe (preflighted 2026-06-05, no embedded DB):**
| PVC | Node | SC | Evidence |
|---|---|---|---|
| `tandoor/tandoor-data-proxmox` | node6 | proxmox-lvm | `/opt/recipes/mediafiles` = media + bundled static; PG-backed |
| `speedtest/speedtest-config-proxmox` | node6 | proxmox-lvm | `/config` = logs (383 MB `laravel.log`) + config; MySQL-backed |
| `hackmd/hackmd-data-encrypted` | node6 | encrypted | `/…/public/uploads` = PNG uploads (4.5 MB); MySQL-backed |
| `changedetection/changedetection-data-proxmox` | node6 | proxmox-lvm | `/datastore` = JSON + brotli snapshots; no DB |
| `send/send-data-proxmox` | node2 | proxmox-lvm | `/uploads` = encrypted blobs; Redis metadata |
**Phase-1 candidates (preflight before migrating):** instagram-poster,
insta2spotify, novelapp, openclaw/openlobster, servarr/qbittorrent, postiz
(scaled-0), priority-pass-uploads*, tripit-personal-documents* (*app-encrypted /
sensitive keep app-layer crypto, confirm before moving).
**Must stay on block** (embedded DB or fsync-critical): vaultwarden, ntfy,
uptime-kuma, navidrome, actualbudget×3, openclaw×2, servarr arr-apps, freshrss
(SQLite); stirling-pdf (H2); rybbit (ClickHouse); beads/dolt; all CNPG
pg-cluster, mysql-standalone, immich-pg, redis; prometheus, alertmanager, loki,
vault×3, technitium×3, mailserver, paperless, forgejo, matrix, n8n.
## Plan
### Phase 0 — Tactical relief (now): migrate the 5 verified-safe PVCs
Per service, following the proven 2026-05-26 Wave-1 pattern (reversible source
block PVC kept until the NFS copy is verified):
1. `presence claim service:<svc>`.
2. Create NFS export dir on PVE host + add to git-managed
`infra/scripts/pve-nfs-exports`; `exportfs -ra`.
3. Add `module "nfs_<svc>"` (`modules/kubernetes/nfs_volume`) to the stack;
`scripts/tg apply` to create the static NFS PV/PVC.
4. Scale the workload to 0 (RWO must release the block PVC).
5. rsync blockNFS with `--checksum` (exclude cruft: changedetection
`test-direct`/`test-seq`/`lost+found`; speedtest can drop `log/`).
6. Swap the workload's `claim_name` to the NFS PVC; `tg apply`; scale up.
7. Verify app health + data intact.
8. Delete the old block PVC frees the LUN slot; confirm with check #47.
9. Commit + push per service; wait for CI/Woodpecker.
Result: node6 **21 → 17**, node2 **12 → 11**.
hackmd note: confirm the LUKSNFS downgrade is acceptable (low-sensitivity doc
images) or leave hackmd on encrypted block and accept 2118.
### Phase 1 — Broader NFS sweep (this session if smooth, else tracked)
Preflight + migrate the Phase-1 candidates above. Goal: leave **only true DBs +
fsync-critical** services on block, so per-node block counts stay well under the
cap with years of runway.
### Phase 2 — Ghost-loop prevention (beads `code-dfjn`; design separately)
The structural half of "harden". Substantial propose to design + plan on its
own rather than rush:
- Soft-cap block PVCs/node below the query-pci failure threshold (observed safe
24, fails 25) alert + scheduler hint.
- Raise the proxmox-csi controller's QMP/query-pci timeout (and/or QEMU side).
- Auto-reconcile CronJob: detect drift (check #47 logic) safe
`qm set <vmid> --delete scsiN` (detach-only, retains LV).
- Rebalance residual node6 block PVCs node5, one at a time, check #47 watching.
### Phase 3 — Docs
Update `storage.md` (Wave-2 NFS migration + the decision), `scale-k8s-cluster.md`,
`.claude/reference/service-catalog.md`; add a "Decided: ①" banner to
`2026-06-01-topolvm-evaluation.md` pointing here.
## Risks
- Data loss during migration mitigated by rsync `--checksum` + keep source
until verified + the workload is scaled-0 during copy.
- LUKS-at-rest dropped for `-encrypted` PVCs moved to NFS only migrate
low-sensitivity or app-encrypted ones; flag each.
- NFS soft-mount semantics only non-DB workloads (preflighted); `nfsvers=4`,
`soft,timeo=30,retrans=3` per the `nfs_volume` module defaults.
- Block rebalance (Phase 2) re-introduces detach/reattach ghost risk one at a
time with check #47.
## Related
- `2026-06-01-topolvm-evaluation.md` (superseded recommendation)
- `docs/architecture/storage.md` § "Per-VM SCSI-LUN cap"
- `docs/runbooks/scale-k8s-cluster.md`
- beads `code-dfjn` (ghost prevention), `code-oflt` (IO isolation not pursued here)