plans: TopoLVM migration evaluation (Path 3 for LUN-cap relief)

Decision-support doc, NOT a commitment. Evaluates whether replacing
proxmox-csi with TopoLVM would lift the per-VM 29-PVC ceiling
permanently and at what cost.

Key trade-off documented: TopoLVM PVCs are pinned to the node where
the LV lives (topology.topolvm.cybozu.com/node). proxmox-csi PVCs
migrate between VMs when pods reschedule. The data-locality penalty
matters most for single-replica stateful services (MySQL standalone,
Nextcloud, Vaultwarden, mailserver, claude-memory, ~30 SQLite-backed
apps); replicated services (CNPG PG cluster, Redis-v2, Vault Raft)
absorb it.

Three disk-layout options:
  A. Carve per-VM data disks from sdc — simple, no hardware,
     IO contention unchanged
  B. Hybrid SSD/HDD — SSD-constrained at 675 GiB free
  C. Add a dedicated NVMe — also closes beads code-oflt (IO
     contention), ~£200 hardware investment

Effort estimate: 2.5-3 weeks of focused work for the full migration;
covers TopoLVM install, lvmd config, per-VM disk provisioning,
LUKS plumbing, 5 migration waves (regenerable → huge PVCs),
backup-pipeline rewrite, deprecation.

Recommended next step before committing: small pilot on
k8s-node5/6 with one non-critical PVC to validate the operational
pattern end-to-end.

Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap,
docs/runbooks/scale-k8s-cluster.md (Path 1+2 alternative),
beads code-oflt (IO isolation).
This commit is contained in:
Viktor Barzin 2026-06-01 20:00:00 +00:00
parent 599d67db51
commit 82855848d1

View file

@ -0,0 +1,156 @@
# TopoLVM Migration Evaluation
**Date**: 2026-06-01
**Status**: Evaluation — not yet decided
**Decision**: Pending. Used to understand whether to commit to the migration.
## Problem statement
The cluster's block storage hits a **hardcoded 29-PVC-per-VM ceiling** in `sergelogvinov/proxmox-csi-plugin` (`pkg/csi/utils.go:394`, `for lun = 1; lun < 30; lun++`). The plugin scans Proxmox SCSI indices `scsi1..scsi29`; when all are taken, `ControllerPublishVolume` returns `Internal desc = no free lun found`. We hit this on 2026-05-26 with 4 stuck PVCs on k8s-node1 and responded by scaling from 4 → 6 worker VMs.
Path 1 (patch the plugin to `lun < 31`) buys +1 slot per VM. Path 2 (NFS-migrate non-DB workloads) buys 20-30 PVCs of headroom. Both are tactical. This doc evaluates **Path 3 — replace the CSI driver with TopoLVM**, which removes the cap permanently by changing the storage architecture from "PVE-host LVM-thin + SCSI hotplug" to "per-VM LVM-thin + local provisioning".
## What TopoLVM is
CSI driver from cybozu-go. Each K8s node runs an `lvmd` daemon managing one or more LVM volume groups. The CSI controller creates `LogicalVolume` CRDs; `topolvm-node` on the target node reconciles them by asking `lvmd` to `lvcreate` an LV in the chosen VG. The LV is mounted directly on the node (no virtio-scsi hotplug). PVCs are LV slices, not separate SCSI devices — there is no per-VM cap beyond kernel LV count limits (effectively thousands).
Mature project, used in production by Cybozu and others. Supports:
- Thin provisioning (`type: thin` device class with overprovision ratio)
- Multiple device classes per node (e.g., one for SSD, one for HDD)
- CSI VolumeSnapshot CRDs (thin-provisioned volumes only; restore pinned to source node)
- Online volume expansion (ext4, xfs, btrfs)
- Striping and RAID via `lvcreate-options`
## The big architectural trade-off — read this first
| Aspect | proxmox-csi (today) | TopoLVM |
|---|---|---|
| Storage location | PVE-host thin pool (sdc) | Per-VM thin pool on a dedicated disk |
| Per-VM PVC cap | **29** (plugin source) | None (kernel LV limits, thousands) |
| **PVC mobility** | **Migrates between VMs** — CSI re-attaches LV to wherever the pod schedules | **Pinned to one node** via `topology.topolvm.cybozu.com/node` label |
| Failure recovery | Pod reschedules to another VM, PVC follows | Pod can only restart on the same node; if the node dies, data is on the dead node |
| IO contention | All VMs share sdc thin pool | Each VM's pool is on its own disk (which may still share underlying physical media) |
| Snapshot mechanism | PVE-host `lvm-pvc-snapshot` script (custom) | CSI VolumeSnapshot CRDs (standard) |
| Encryption | LUKS via Proxmox CSI `extraParameters` + ESO-synced secret | LUKS via `csi.storage.k8s.io/{node-stage,node-expand}-secret` — same pattern, different secret target |
| Backup pipeline | sda → Synology via `daily-backup` script that mounts LVM snapshots on PVE | Same idea but snapshots live inside K8s VMs; backup script would need to run on each VM (or use CSI snapshot → object store) |
| Operational model | "Storage is a shared pool, VMs are cattle" | "Storage is per-node, like local-path with LVM features" |
**Data mobility is the most important difference.** Today, when k8s-node1 is drained for maintenance, all its PVC pods reschedule to other nodes and the proxmox-csi controller detaches/re-attaches the LVs accordingly. With TopoLVM, draining a node means **the PVC data is still on that node's local disk** — pods cannot start elsewhere until either (a) the data is migrated, or (b) the node returns.
For Viktor's setup specifically:
- **Pro**: the underlying PVE host is a single point of failure anyway (192.168.1.127). If the host dies, all VMs and all storage die together. The "mobility" of proxmox-csi is partially illusory at the homelab scale — the data isn't actually mobile across physical machines.
- **Con**: VM-level failures (kernel panic, OOM, manual qm shutdown for maintenance) DO happen routinely. Today, the pod just reschedules; with TopoLVM, you wait for the VM to recover or you accept downtime.
- **Mitigation**: For services that already have replication built in (CNPG Postgres cluster has 3 replicas, Redis-v2 has 3, Vault has 3-node Raft), the data-locality penalty is minimal — one replica's local LV being unavailable triggers a re-replication elsewhere. The PAIN is concentrated in single-replica stateful services: MySQL standalone, Nextcloud, Vaultwarden, mailserver, claude-memory, all the SQLite-backed services.
## Disk layout — three options
TopoLVM needs a dedicated LVM VG per node. Three ways to provision it:
### Option A — Carve from sdc (HDD), one VG per VM
Add a second virtual disk to each K8s VM, sized for its expected PVC load. The disk lives on the existing sdc thin pool. Format as LVM PV → its own VG → TopoLVM thin pool.
- **Sizing**: rough math from session-1 audit: 1.2 TB total LV allocation across 76 PVCs. Add 30% headroom = 1.6 TB. Distribute by current node placement:
- node1: Prometheus (433G) + others ≈ 600-700 GiB → **768 GiB disk**
- node2: Loki (50G) + smaller DBs ≈ 200 GiB → **256 GiB disk**
- node3: MySQL standalone + Immich PG + several DBs ≈ 200 GiB → **256 GiB disk**
- node4: smaller → **256 GiB disk**
- node5: smaller → **256 GiB disk**
- node6: Nextcloud + Vaultwarden + mailserver + small DBs ≈ 200 GiB → **256 GiB disk**
- **Total: ~2 TiB** carved from sdc thin pool (currently 66% used, 3.5 TiB free)
- **Pro**: simplest physical change, no hardware needed, just `qm set --scsiN local-lvm:NNN`
- **Con**: IO contention on sdc unchanged. The 6 thin pools all sit on the same HDD physical layer. Storms hit harder because there's no inter-pool isolation at the LVM level.
### Option B — Move hot workloads to sdb (SSD), keep cold on sdc
Use a hybrid layout:
- Per-VM SSD disk (sdb, 931 GB total, ~675 GB free) for hot DBs
- Per-VM HDD disk (sdc) for cold/bulk
TopoLVM supports multiple device classes per node — each VM would have an `ssd-thin` and `hdd-thin` class.
- **Pro**: separates hot/cold IO; SSD-backed DBs are dramatically faster; partial IO-contention relief on sdc
- **Con**: 675 GB SSD has to host DBs across 6 VMs (~112 GiB each, tight). Need to identify which PVCs are hot. The encrypted PVCs (45 currently) are mostly DBs and would be the SSD candidates.
### Option C — Add a second physical disk for storage
Add a real SSD (e.g., a 2 TB NVMe) to the PVE host. Carve per-VM disks from it for TopoLVM. Keep sdc for VM root + nfs-data only.
- **Pro**: cleanest physical isolation. Solves both LUN cap AND IO contention (the underlying beads `code-oflt` task).
- **Con**: hardware investment. ~£200 for a 2 TB NVMe. Requires PVE host downtime to install. Existing PVE has 2 SATA ports used (sda + sdb) + M.2 slot (might be in use, need to check). LVM/thin pool setup is straightforward.
## Migration approach
Same pattern as the 2026-05-26 Wave 1 NFS migration, multiplied across more PVCs:
1. **Install TopoLVM alongside proxmox-csi** — both run in parallel; new StorageClass `topolvm-provisioner` and `topolvm-provisioner-encrypted` created without touching existing PVCs
2. **Per-VM data disk provisioning**`qm set <vmid> --scsi8 local-lvm:NNN`, add `vgcreate` + `lvcreate` per VM (one-time)
3. **lvmd config per node** — Helm values point to the right VG per node
4. **Pilot migration** — pick a small, low-criticality PVC (e.g., a single-replica config-only service). Run the same scale-to-0 → rsync helper → swap claim_name → apply pattern from Wave 1. Validate.
5. **Phased rollout** — migrate PVCs in batches by criticality:
- Wave A: regenerable / cache (5-10 PVCs, low risk)
- Wave B: app config PVCs with SQLite (15-20 PVCs, blip per service)
- Wave C: medium DBs (Postgres, MySQL, Redis with replicas) (10-15 PVCs)
- Wave D: critical singletons (Vaultwarden, Nextcloud, mailserver, MySQL standalone) (5-10 PVCs)
- Wave E: huge ones (Prometheus, Loki, Forgejo) (3-5 PVCs)
6. **Rewrite backup pipeline** — current `daily-backup` mounts LVM snapshots on PVE host; new flow needs to either (a) run snapshot logic inside each K8s VM via DaemonSet, or (b) use CSI VolumeSnapshot CRDs + an external-snapshotter → restic/borg backend
7. **Deprecate proxmox-csi** — once all PVCs migrated, remove the Helm release and the `proxmox-lvm` / `proxmox-lvm-encrypted` StorageClasses
8. **Update docs**`docs/architecture/storage.md`, `CLAUDE.md`, ingress factory references, several runbooks
## Effort estimate
| Phase | Time | Notes |
|-------|------|-------|
| Decision + Option A/B/C pick | 1 day | Includes any hardware ordering for Option C |
| TopoLVM install + lvmd config | 1 day | Helm chart, secrets, RBAC, test on one node first |
| Per-VM data disk provisioning | 0.5 day | Six VMs; coordinate with kubelet restart |
| Encrypted PVC LUKS plumbing | 1 day | Verify the ExternalSecret pattern works with TopoLVM's secret refs |
| Pilot migration (1 PVC) | 0.5 day | Includes rollback rehearsal |
| Waves A-D migrations (~45 PVCs) | 5-7 days | ~20 min per PVC like Wave 1, plus verification |
| Wave E (huge PVCs) | 2-3 days | Prometheus 433 GiB will take hours to rsync; needs careful staging |
| Backup pipeline rewrite | 2-3 days | Snapshot-driven backup is a different model; testing |
| Deprecation + cleanup | 1 day | Remove proxmox-csi, update SCs, update docs |
| Docs + runbook updates | 1 day | storage.md, scale runbook, CLAUDE.md, post-mortems for incidents during migration |
**Total: ~2.5-3 weeks of focused infra time.** Could stretch over a quarter if done alongside other work.
## Risks
| Risk | Likelihood | Mitigation |
|------|------------|------------|
| Data loss during PVC migration | Low | Rsync with `--checksum`, verify before deleting source, keep proxmox-csi running until each migration validates |
| Data-locality penalty during VM reboot | High | Reboot one VM at a time; multi-replica services handle it; single-replica = brief downtime (same as today for kured-driven reboots, but more frequent in TopoLVM model) |
| LUKS encryption plumbing different from current | Medium | Pilot encrypted PVC migration before committing |
| Backup pipeline regression | High | Keep old `daily-backup` running until new pipeline proven for ≥2 weeks |
| Snapshot semantics change (restore pinned to source node) | Medium | Document; not a blocker for normal use but matters for cross-VM restore scenarios |
| TopoLVM does not solve IO contention | Certain (unless Option C) | Beads `code-oflt` remains open as a separate task |
| Migration window for huge PVCs (Prometheus 433G) | Medium | Stage during low-traffic period; use rsync with checkpoint resumption |
| Surprise incompatibility (Kyverno policy, Authentik, etc.) | Low | Pilot catches most |
| Reverse migration if we change our mind | Medium | Always possible via the same rsync pattern, but tedious |
## Decision criteria
Pick TopoLVM (any option) if:
- We hit the LUN cap repeatedly (≥2 incidents in 6 months)
- We want to fix IO contention at the same time (then Option C only)
- We're comfortable with single-node data locality
Stay on proxmox-csi if:
- The Path 1 + 2 combo gives us enough headroom for the foreseeable future
- We value data mobility (any-pod-can-run-anywhere) over architectural cleanliness
- The migration cost (3 weeks) outweighs the LUN-cap risk over the next year
## Recommended next steps if pursuing
1. **Run a small pilot first** — install TopoLVM on one node (k8s-node5 or node6 since they're newest and have less critical workloads), provision a 50 GB data disk, create a test PVC, migrate one tiny non-critical PVC, verify the operational pattern works end-to-end before committing to full migration
2. **Pick Option A or C** — Option B is too SSD-constrained for the encrypted PVC volume we have
3. **Order hardware if Option C** — NVMe + a hot-swap caddy or M.2 adapter; verify PVE host has the slot
4. **Schedule a 3-week window** — partition the migration waves around other infra commitments; flag in beads as a P1
## Related
- `docs/architecture/storage.md` — current storage architecture
- `docs/runbooks/scale-k8s-cluster.md` — current scaling playbook (Path 1+2 alternative)
- `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md` — IO contention is the related-but-separate concern
- Beads `code-oflt` — IO isolation long-term fix (Option C would close this)
- Remote memory id=2788 — proxmox-csi-plugin LUN cap explanation