Viktor Barzin 82855848d1 plans: TopoLVM migration evaluation (Path 3 for LUN-cap relief)

Decision-support doc, NOT a commitment. Evaluates whether replacing
proxmox-csi with TopoLVM would lift the per-VM 29-PVC ceiling
permanently and at what cost.

Key trade-off documented: TopoLVM PVCs are pinned to the node where
the LV lives (topology.topolvm.cybozu.com/node). proxmox-csi PVCs
migrate between VMs when pods reschedule. The data-locality penalty
matters most for single-replica stateful services (MySQL standalone,
Nextcloud, Vaultwarden, mailserver, claude-memory, ~30 SQLite-backed
apps); replicated services (CNPG PG cluster, Redis-v2, Vault Raft)
absorb it.

Three disk-layout options:
  A. Carve per-VM data disks from sdc — simple, no hardware,
     IO contention unchanged
  B. Hybrid SSD/HDD — SSD-constrained at 675 GiB free
  C. Add a dedicated NVMe — also closes beads code-oflt (IO
     contention), ~£200 hardware investment

Effort estimate: 2.5-3 weeks of focused work for the full migration;
covers TopoLVM install, lvmd config, per-VM disk provisioning,
LUKS plumbing, 5 migration waves (regenerable → huge PVCs),
backup-pipeline rewrite, deprecation.

Recommended next step before committing: small pilot on
k8s-node5/6 with one non-critical PVC to validate the operational
pattern end-to-end.

Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap,
docs/runbooks/scale-k8s-cluster.md (Path 1+2 alternative),
beads code-oflt (IO isolation).

2026-06-01 21:22:05 +00:00

12 KiB

Raw Blame History

TopoLVM Migration Evaluation

Date: 2026-06-01 Status: Evaluation — not yet decided Decision: Pending. Used to understand whether to commit to the migration.

Problem statement

The cluster's block storage hits a hardcoded 29-PVC-per-VM ceiling in sergelogvinov/proxmox-csi-plugin (pkg/csi/utils.go:394, for lun = 1; lun < 30; lun++). The plugin scans Proxmox SCSI indices scsi1..scsi29; when all are taken, ControllerPublishVolume returns Internal desc = no free lun found. We hit this on 2026-05-26 with 4 stuck PVCs on k8s-node1 and responded by scaling from 4 → 6 worker VMs.

Path 1 (patch the plugin to lun < 31) buys +1 slot per VM. Path 2 (NFS-migrate non-DB workloads) buys 20-30 PVCs of headroom. Both are tactical. This doc evaluates Path 3 — replace the CSI driver with TopoLVM, which removes the cap permanently by changing the storage architecture from "PVE-host LVM-thin + SCSI hotplug" to "per-VM LVM-thin + local provisioning".

What TopoLVM is

CSI driver from cybozu-go. Each K8s node runs an lvmd daemon managing one or more LVM volume groups. The CSI controller creates LogicalVolume CRDs; topolvm-node on the target node reconciles them by asking lvmd to lvcreate an LV in the chosen VG. The LV is mounted directly on the node (no virtio-scsi hotplug). PVCs are LV slices, not separate SCSI devices — there is no per-VM cap beyond kernel LV count limits (effectively thousands).

Mature project, used in production by Cybozu and others. Supports:

Thin provisioning (type: thin device class with overprovision ratio)
Multiple device classes per node (e.g., one for SSD, one for HDD)
CSI VolumeSnapshot CRDs (thin-provisioned volumes only; restore pinned to source node)
Online volume expansion (ext4, xfs, btrfs)
Striping and RAID via lvcreate-options

The big architectural trade-off — read this first

Aspect	proxmox-csi (today)	TopoLVM
Storage location	PVE-host thin pool (sdc)	Per-VM thin pool on a dedicated disk
Per-VM PVC cap	29 (plugin source)	None (kernel LV limits, thousands)
PVC mobility	Migrates between VMs — CSI re-attaches LV to wherever the pod schedules	Pinned to one node via `topology.topolvm.cybozu.com/node` label
Failure recovery	Pod reschedules to another VM, PVC follows	Pod can only restart on the same node; if the node dies, data is on the dead node
IO contention	All VMs share sdc thin pool	Each VM's pool is on its own disk (which may still share underlying physical media)
Snapshot mechanism	PVE-host `lvm-pvc-snapshot` script (custom)	CSI VolumeSnapshot CRDs (standard)
Encryption	LUKS via Proxmox CSI `extraParameters` + ESO-synced secret	LUKS via `csi.storage.k8s.io/{node-stage,node-expand}-secret` — same pattern, different secret target
Backup pipeline	sda → Synology via `daily-backup` script that mounts LVM snapshots on PVE	Same idea but snapshots live inside K8s VMs; backup script would need to run on each VM (or use CSI snapshot → object store)
Operational model	"Storage is a shared pool, VMs are cattle"	"Storage is per-node, like local-path with LVM features"

Data mobility is the most important difference. Today, when k8s-node1 is drained for maintenance, all its PVC pods reschedule to other nodes and the proxmox-csi controller detaches/re-attaches the LVs accordingly. With TopoLVM, draining a node means the PVC data is still on that node's local disk — pods cannot start elsewhere until either (a) the data is migrated, or (b) the node returns.

For Viktor's setup specifically:

Pro: the underlying PVE host is a single point of failure anyway (192.168.1.127). If the host dies, all VMs and all storage die together. The "mobility" of proxmox-csi is partially illusory at the homelab scale — the data isn't actually mobile across physical machines.
Con: VM-level failures (kernel panic, OOM, manual qm shutdown for maintenance) DO happen routinely. Today, the pod just reschedules; with TopoLVM, you wait for the VM to recover or you accept downtime.
Mitigation: For services that already have replication built in (CNPG Postgres cluster has 3 replicas, Redis-v2 has 3, Vault has 3-node Raft), the data-locality penalty is minimal — one replica's local LV being unavailable triggers a re-replication elsewhere. The PAIN is concentrated in single-replica stateful services: MySQL standalone, Nextcloud, Vaultwarden, mailserver, claude-memory, all the SQLite-backed services.

Disk layout — three options

TopoLVM needs a dedicated LVM VG per node. Three ways to provision it:

Option A — Carve from sdc (HDD), one VG per VM

Add a second virtual disk to each K8s VM, sized for its expected PVC load. The disk lives on the existing sdc thin pool. Format as LVM PV → its own VG → TopoLVM thin pool.

Sizing: rough math from session-1 audit: 1.2 TB total LV allocation across 76 PVCs. Add 30% headroom = 1.6 TB. Distribute by current node placement:
- node1: Prometheus (433G) + others ≈ 600-700 GiB → 768 GiB disk
- node2: Loki (50G) + smaller DBs ≈ 200 GiB → 256 GiB disk
- node3: MySQL standalone + Immich PG + several DBs ≈ 200 GiB → 256 GiB disk
- node4: smaller → 256 GiB disk
- node5: smaller → 256 GiB disk
- node6: Nextcloud + Vaultwarden + mailserver + small DBs ≈ 200 GiB → 256 GiB disk
- Total: ~2 TiB carved from sdc thin pool (currently 66% used, 3.5 TiB free)
Pro: simplest physical change, no hardware needed, just qm set --scsiN local-lvm:NNN
Con: IO contention on sdc unchanged. The 6 thin pools all sit on the same HDD physical layer. Storms hit harder because there's no inter-pool isolation at the LVM level.

Option B — Move hot workloads to sdb (SSD), keep cold on sdc

Use a hybrid layout:

Per-VM SSD disk (sdb, 931 GB total, ~675 GB free) for hot DBs
Per-VM HDD disk (sdc) for cold/bulk

TopoLVM supports multiple device classes per node — each VM would have an ssd-thin and hdd-thin class.

Pro: separates hot/cold IO; SSD-backed DBs are dramatically faster; partial IO-contention relief on sdc
Con: 675 GB SSD has to host DBs across 6 VMs (~112 GiB each, tight). Need to identify which PVCs are hot. The encrypted PVCs (45 currently) are mostly DBs and would be the SSD candidates.

Option C — Add a second physical disk for storage

Add a real SSD (e.g., a 2 TB NVMe) to the PVE host. Carve per-VM disks from it for TopoLVM. Keep sdc for VM root + nfs-data only.

Pro: cleanest physical isolation. Solves both LUN cap AND IO contention (the underlying beads code-oflt task).
Con: hardware investment. ~£200 for a 2 TB NVMe. Requires PVE host downtime to install. Existing PVE has 2 SATA ports used (sda + sdb) + M.2 slot (might be in use, need to check). LVM/thin pool setup is straightforward.

Migration approach

Same pattern as the 2026-05-26 Wave 1 NFS migration, multiplied across more PVCs:

Install TopoLVM alongside proxmox-csi — both run in parallel; new StorageClass topolvm-provisioner and topolvm-provisioner-encrypted created without touching existing PVCs
Per-VM data disk provisioning — qm set <vmid> --scsi8 local-lvm:NNN, add vgcreate + lvcreate per VM (one-time)
lvmd config per node — Helm values point to the right VG per node
Pilot migration — pick a small, low-criticality PVC (e.g., a single-replica config-only service). Run the same scale-to-0 → rsync helper → swap claim_name → apply pattern from Wave 1. Validate.
Phased rollout — migrate PVCs in batches by criticality:
- Wave A: regenerable / cache (5-10 PVCs, low risk)
- Wave B: app config PVCs with SQLite (15-20 PVCs, blip per service)
- Wave C: medium DBs (Postgres, MySQL, Redis with replicas) (10-15 PVCs)
- Wave D: critical singletons (Vaultwarden, Nextcloud, mailserver, MySQL standalone) (5-10 PVCs)
- Wave E: huge ones (Prometheus, Loki, Forgejo) (3-5 PVCs)
Rewrite backup pipeline — current daily-backup mounts LVM snapshots on PVE host; new flow needs to either (a) run snapshot logic inside each K8s VM via DaemonSet, or (b) use CSI VolumeSnapshot CRDs + an external-snapshotter → restic/borg backend
Deprecate proxmox-csi — once all PVCs migrated, remove the Helm release and the proxmox-lvm / proxmox-lvm-encrypted StorageClasses
Update docs — docs/architecture/storage.md, CLAUDE.md, ingress factory references, several runbooks

Effort estimate

Phase	Time	Notes
Decision + Option A/B/C pick	1 day	Includes any hardware ordering for Option C
TopoLVM install + lvmd config	1 day	Helm chart, secrets, RBAC, test on one node first
Per-VM data disk provisioning	0.5 day	Six VMs; coordinate with kubelet restart
Encrypted PVC LUKS plumbing	1 day	Verify the ExternalSecret pattern works with TopoLVM's secret refs
Pilot migration (1 PVC)	0.5 day	Includes rollback rehearsal
Waves A-D migrations (~45 PVCs)	5-7 days	~20 min per PVC like Wave 1, plus verification
Wave E (huge PVCs)	2-3 days	Prometheus 433 GiB will take hours to rsync; needs careful staging
Backup pipeline rewrite	2-3 days	Snapshot-driven backup is a different model; testing
Deprecation + cleanup	1 day	Remove proxmox-csi, update SCs, update docs
Docs + runbook updates	1 day	storage.md, scale runbook, CLAUDE.md, post-mortems for incidents during migration

Total: ~2.5-3 weeks of focused infra time. Could stretch over a quarter if done alongside other work.

Risks

Risk	Likelihood	Mitigation
Data loss during PVC migration	Low	Rsync with `--checksum`, verify before deleting source, keep proxmox-csi running until each migration validates
Data-locality penalty during VM reboot	High	Reboot one VM at a time; multi-replica services handle it; single-replica = brief downtime (same as today for kured-driven reboots, but more frequent in TopoLVM model)
LUKS encryption plumbing different from current	Medium	Pilot encrypted PVC migration before committing
Backup pipeline regression	High	Keep old `daily-backup` running until new pipeline proven for ≥2 weeks
Snapshot semantics change (restore pinned to source node)	Medium	Document; not a blocker for normal use but matters for cross-VM restore scenarios
TopoLVM does not solve IO contention	Certain (unless Option C)	Beads `code-oflt` remains open as a separate task
Migration window for huge PVCs (Prometheus 433G)	Medium	Stage during low-traffic period; use rsync with checkpoint resumption
Surprise incompatibility (Kyverno policy, Authentik, etc.)	Low	Pilot catches most
Reverse migration if we change our mind	Medium	Always possible via the same rsync pattern, but tedious

Decision criteria

Pick TopoLVM (any option) if:

We hit the LUN cap repeatedly (≥2 incidents in 6 months)
We want to fix IO contention at the same time (then Option C only)
We're comfortable with single-node data locality

Stay on proxmox-csi if:

The Path 1 + 2 combo gives us enough headroom for the foreseeable future
We value data mobility (any-pod-can-run-anywhere) over architectural cleanliness
The migration cost (3 weeks) outweighs the LUN-cap risk over the next year

Recommended next steps if pursuing

Run a small pilot first — install TopoLVM on one node (k8s-node5 or node6 since they're newest and have less critical workloads), provision a 50 GB data disk, create a test PVC, migrate one tiny non-critical PVC, verify the operational pattern works end-to-end before committing to full migration
Pick Option A or C — Option B is too SSD-constrained for the encrypted PVC volume we have
Order hardware if Option C — NVMe + a hot-swap caddy or M.2 adapter; verify PVE host has the slot
Schedule a 3-week window — partition the migration waves around other infra commitments; flag in beads as a P1

docs/architecture/storage.md — current storage architecture
docs/runbooks/scale-k8s-cluster.md — current scaling playbook (Path 1+2 alternative)
docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md — IO contention is the related-but-separate concern
Beads code-oflt — IO isolation long-term fix (Option C would close this)
Remote memory id=2788 — proxmox-csi-plugin LUN cap explanation

12 KiB Raw Blame History