Decision-support doc, NOT a commitment. Evaluates whether replacing
proxmox-csi with TopoLVM would lift the per-VM 29-PVC ceiling
permanently and at what cost.
Key trade-off documented: TopoLVM PVCs are pinned to the node where
the LV lives (topology.topolvm.cybozu.com/node). proxmox-csi PVCs
migrate between VMs when pods reschedule. The data-locality penalty
matters most for single-replica stateful services (MySQL standalone,
Nextcloud, Vaultwarden, mailserver, claude-memory, ~30 SQLite-backed
apps); replicated services (CNPG PG cluster, Redis-v2, Vault Raft)
absorb it.
Three disk-layout options:
A. Carve per-VM data disks from sdc — simple, no hardware,
IO contention unchanged
B. Hybrid SSD/HDD — SSD-constrained at 675 GiB free
C. Add a dedicated NVMe — also closes beads code-oflt (IO
contention), ~£200 hardware investment
Effort estimate: 2.5-3 weeks of focused work for the full migration;
covers TopoLVM install, lvmd config, per-VM disk provisioning,
LUKS plumbing, 5 migration waves (regenerable → huge PVCs),
backup-pipeline rewrite, deprecation.
Recommended next step before committing: small pilot on
k8s-node5/6 with one non-critical PVC to validate the operational
pattern end-to-end.
Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap,
docs/runbooks/scale-k8s-cluster.md (Path 1+2 alternative),
beads code-oflt (IO isolation).
12 KiB
TopoLVM Migration Evaluation
Date: 2026-06-01 Status: Evaluation — not yet decided Decision: Pending. Used to understand whether to commit to the migration.
Problem statement
The cluster's block storage hits a hardcoded 29-PVC-per-VM ceiling in sergelogvinov/proxmox-csi-plugin (pkg/csi/utils.go:394, for lun = 1; lun < 30; lun++). The plugin scans Proxmox SCSI indices scsi1..scsi29; when all are taken, ControllerPublishVolume returns Internal desc = no free lun found. We hit this on 2026-05-26 with 4 stuck PVCs on k8s-node1 and responded by scaling from 4 → 6 worker VMs.
Path 1 (patch the plugin to lun < 31) buys +1 slot per VM. Path 2 (NFS-migrate non-DB workloads) buys 20-30 PVCs of headroom. Both are tactical. This doc evaluates Path 3 — replace the CSI driver with TopoLVM, which removes the cap permanently by changing the storage architecture from "PVE-host LVM-thin + SCSI hotplug" to "per-VM LVM-thin + local provisioning".
What TopoLVM is
CSI driver from cybozu-go. Each K8s node runs an lvmd daemon managing one or more LVM volume groups. The CSI controller creates LogicalVolume CRDs; topolvm-node on the target node reconciles them by asking lvmd to lvcreate an LV in the chosen VG. The LV is mounted directly on the node (no virtio-scsi hotplug). PVCs are LV slices, not separate SCSI devices — there is no per-VM cap beyond kernel LV count limits (effectively thousands).
Mature project, used in production by Cybozu and others. Supports:
- Thin provisioning (
type: thindevice class with overprovision ratio) - Multiple device classes per node (e.g., one for SSD, one for HDD)
- CSI VolumeSnapshot CRDs (thin-provisioned volumes only; restore pinned to source node)
- Online volume expansion (ext4, xfs, btrfs)
- Striping and RAID via
lvcreate-options
The big architectural trade-off — read this first
| Aspect | proxmox-csi (today) | TopoLVM |
|---|---|---|
| Storage location | PVE-host thin pool (sdc) | Per-VM thin pool on a dedicated disk |
| Per-VM PVC cap | 29 (plugin source) | None (kernel LV limits, thousands) |
| PVC mobility | Migrates between VMs — CSI re-attaches LV to wherever the pod schedules | Pinned to one node via topology.topolvm.cybozu.com/node label |
| Failure recovery | Pod reschedules to another VM, PVC follows | Pod can only restart on the same node; if the node dies, data is on the dead node |
| IO contention | All VMs share sdc thin pool | Each VM's pool is on its own disk (which may still share underlying physical media) |
| Snapshot mechanism | PVE-host lvm-pvc-snapshot script (custom) |
CSI VolumeSnapshot CRDs (standard) |
| Encryption | LUKS via Proxmox CSI extraParameters + ESO-synced secret |
LUKS via csi.storage.k8s.io/{node-stage,node-expand}-secret — same pattern, different secret target |
| Backup pipeline | sda → Synology via daily-backup script that mounts LVM snapshots on PVE |
Same idea but snapshots live inside K8s VMs; backup script would need to run on each VM (or use CSI snapshot → object store) |
| Operational model | "Storage is a shared pool, VMs are cattle" | "Storage is per-node, like local-path with LVM features" |
Data mobility is the most important difference. Today, when k8s-node1 is drained for maintenance, all its PVC pods reschedule to other nodes and the proxmox-csi controller detaches/re-attaches the LVs accordingly. With TopoLVM, draining a node means the PVC data is still on that node's local disk — pods cannot start elsewhere until either (a) the data is migrated, or (b) the node returns.
For Viktor's setup specifically:
- Pro: the underlying PVE host is a single point of failure anyway (192.168.1.127). If the host dies, all VMs and all storage die together. The "mobility" of proxmox-csi is partially illusory at the homelab scale — the data isn't actually mobile across physical machines.
- Con: VM-level failures (kernel panic, OOM, manual qm shutdown for maintenance) DO happen routinely. Today, the pod just reschedules; with TopoLVM, you wait for the VM to recover or you accept downtime.
- Mitigation: For services that already have replication built in (CNPG Postgres cluster has 3 replicas, Redis-v2 has 3, Vault has 3-node Raft), the data-locality penalty is minimal — one replica's local LV being unavailable triggers a re-replication elsewhere. The PAIN is concentrated in single-replica stateful services: MySQL standalone, Nextcloud, Vaultwarden, mailserver, claude-memory, all the SQLite-backed services.
Disk layout — three options
TopoLVM needs a dedicated LVM VG per node. Three ways to provision it:
Option A — Carve from sdc (HDD), one VG per VM
Add a second virtual disk to each K8s VM, sized for its expected PVC load. The disk lives on the existing sdc thin pool. Format as LVM PV → its own VG → TopoLVM thin pool.
- Sizing: rough math from session-1 audit: 1.2 TB total LV allocation across 76 PVCs. Add 30% headroom = 1.6 TB. Distribute by current node placement:
- node1: Prometheus (433G) + others ≈ 600-700 GiB → 768 GiB disk
- node2: Loki (50G) + smaller DBs ≈ 200 GiB → 256 GiB disk
- node3: MySQL standalone + Immich PG + several DBs ≈ 200 GiB → 256 GiB disk
- node4: smaller → 256 GiB disk
- node5: smaller → 256 GiB disk
- node6: Nextcloud + Vaultwarden + mailserver + small DBs ≈ 200 GiB → 256 GiB disk
- Total: ~2 TiB carved from sdc thin pool (currently 66% used, 3.5 TiB free)
- Pro: simplest physical change, no hardware needed, just
qm set --scsiN local-lvm:NNN - Con: IO contention on sdc unchanged. The 6 thin pools all sit on the same HDD physical layer. Storms hit harder because there's no inter-pool isolation at the LVM level.
Option B — Move hot workloads to sdb (SSD), keep cold on sdc
Use a hybrid layout:
- Per-VM SSD disk (sdb, 931 GB total, ~675 GB free) for hot DBs
- Per-VM HDD disk (sdc) for cold/bulk
TopoLVM supports multiple device classes per node — each VM would have an ssd-thin and hdd-thin class.
- Pro: separates hot/cold IO; SSD-backed DBs are dramatically faster; partial IO-contention relief on sdc
- Con: 675 GB SSD has to host DBs across 6 VMs (~112 GiB each, tight). Need to identify which PVCs are hot. The encrypted PVCs (45 currently) are mostly DBs and would be the SSD candidates.
Option C — Add a second physical disk for storage
Add a real SSD (e.g., a 2 TB NVMe) to the PVE host. Carve per-VM disks from it for TopoLVM. Keep sdc for VM root + nfs-data only.
- Pro: cleanest physical isolation. Solves both LUN cap AND IO contention (the underlying beads
code-oflttask). - Con: hardware investment. ~£200 for a 2 TB NVMe. Requires PVE host downtime to install. Existing PVE has 2 SATA ports used (sda + sdb) + M.2 slot (might be in use, need to check). LVM/thin pool setup is straightforward.
Migration approach
Same pattern as the 2026-05-26 Wave 1 NFS migration, multiplied across more PVCs:
- Install TopoLVM alongside proxmox-csi — both run in parallel; new StorageClass
topolvm-provisionerandtopolvm-provisioner-encryptedcreated without touching existing PVCs - Per-VM data disk provisioning —
qm set <vmid> --scsi8 local-lvm:NNN, addvgcreate+lvcreateper VM (one-time) - lvmd config per node — Helm values point to the right VG per node
- Pilot migration — pick a small, low-criticality PVC (e.g., a single-replica config-only service). Run the same scale-to-0 → rsync helper → swap claim_name → apply pattern from Wave 1. Validate.
- Phased rollout — migrate PVCs in batches by criticality:
- Wave A: regenerable / cache (5-10 PVCs, low risk)
- Wave B: app config PVCs with SQLite (15-20 PVCs, blip per service)
- Wave C: medium DBs (Postgres, MySQL, Redis with replicas) (10-15 PVCs)
- Wave D: critical singletons (Vaultwarden, Nextcloud, mailserver, MySQL standalone) (5-10 PVCs)
- Wave E: huge ones (Prometheus, Loki, Forgejo) (3-5 PVCs)
- Rewrite backup pipeline — current
daily-backupmounts LVM snapshots on PVE host; new flow needs to either (a) run snapshot logic inside each K8s VM via DaemonSet, or (b) use CSI VolumeSnapshot CRDs + an external-snapshotter → restic/borg backend - Deprecate proxmox-csi — once all PVCs migrated, remove the Helm release and the
proxmox-lvm/proxmox-lvm-encryptedStorageClasses - Update docs —
docs/architecture/storage.md,CLAUDE.md, ingress factory references, several runbooks
Effort estimate
| Phase | Time | Notes |
|---|---|---|
| Decision + Option A/B/C pick | 1 day | Includes any hardware ordering for Option C |
| TopoLVM install + lvmd config | 1 day | Helm chart, secrets, RBAC, test on one node first |
| Per-VM data disk provisioning | 0.5 day | Six VMs; coordinate with kubelet restart |
| Encrypted PVC LUKS plumbing | 1 day | Verify the ExternalSecret pattern works with TopoLVM's secret refs |
| Pilot migration (1 PVC) | 0.5 day | Includes rollback rehearsal |
| Waves A-D migrations (~45 PVCs) | 5-7 days | ~20 min per PVC like Wave 1, plus verification |
| Wave E (huge PVCs) | 2-3 days | Prometheus 433 GiB will take hours to rsync; needs careful staging |
| Backup pipeline rewrite | 2-3 days | Snapshot-driven backup is a different model; testing |
| Deprecation + cleanup | 1 day | Remove proxmox-csi, update SCs, update docs |
| Docs + runbook updates | 1 day | storage.md, scale runbook, CLAUDE.md, post-mortems for incidents during migration |
Total: ~2.5-3 weeks of focused infra time. Could stretch over a quarter if done alongside other work.
Risks
| Risk | Likelihood | Mitigation |
|---|---|---|
| Data loss during PVC migration | Low | Rsync with --checksum, verify before deleting source, keep proxmox-csi running until each migration validates |
| Data-locality penalty during VM reboot | High | Reboot one VM at a time; multi-replica services handle it; single-replica = brief downtime (same as today for kured-driven reboots, but more frequent in TopoLVM model) |
| LUKS encryption plumbing different from current | Medium | Pilot encrypted PVC migration before committing |
| Backup pipeline regression | High | Keep old daily-backup running until new pipeline proven for ≥2 weeks |
| Snapshot semantics change (restore pinned to source node) | Medium | Document; not a blocker for normal use but matters for cross-VM restore scenarios |
| TopoLVM does not solve IO contention | Certain (unless Option C) | Beads code-oflt remains open as a separate task |
| Migration window for huge PVCs (Prometheus 433G) | Medium | Stage during low-traffic period; use rsync with checkpoint resumption |
| Surprise incompatibility (Kyverno policy, Authentik, etc.) | Low | Pilot catches most |
| Reverse migration if we change our mind | Medium | Always possible via the same rsync pattern, but tedious |
Decision criteria
Pick TopoLVM (any option) if:
- We hit the LUN cap repeatedly (≥2 incidents in 6 months)
- We want to fix IO contention at the same time (then Option C only)
- We're comfortable with single-node data locality
Stay on proxmox-csi if:
- The Path 1 + 2 combo gives us enough headroom for the foreseeable future
- We value data mobility (any-pod-can-run-anywhere) over architectural cleanliness
- The migration cost (3 weeks) outweighs the LUN-cap risk over the next year
Recommended next steps if pursuing
- Run a small pilot first — install TopoLVM on one node (k8s-node5 or node6 since they're newest and have less critical workloads), provision a 50 GB data disk, create a test PVC, migrate one tiny non-critical PVC, verify the operational pattern works end-to-end before committing to full migration
- Pick Option A or C — Option B is too SSD-constrained for the encrypted PVC volume we have
- Order hardware if Option C — NVMe + a hot-swap caddy or M.2 adapter; verify PVE host has the slot
- Schedule a 3-week window — partition the migration waves around other infra commitments; flag in beads as a P1
Related
docs/architecture/storage.md— current storage architecturedocs/runbooks/scale-k8s-cluster.md— current scaling playbook (Path 1+2 alternative)docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md— IO contention is the related-but-separate concern- Beads
code-oflt— IO isolation long-term fix (Option C would close this) - Remote memory id=2788 — proxmox-csi-plugin LUN cap explanation