From 82855848d1f573177ca5ce2d58f715888fe24215 Mon Sep 17 00:00:00 2001
From: Viktor Barzin <vbarzin@gmail.com>
Date: Mon, 1 Jun 2026 20:00:00 +0000
Subject: [PATCH] plans: TopoLVM migration evaluation (Path 3 for LUN-cap
 relief)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Decision-support doc, NOT a commitment. Evaluates whether replacing
proxmox-csi with TopoLVM would lift the per-VM 29-PVC ceiling
permanently and at what cost.

Key trade-off documented: TopoLVM PVCs are pinned to the node where
the LV lives (topology.topolvm.cybozu.com/node). proxmox-csi PVCs
migrate between VMs when pods reschedule. The data-locality penalty
matters most for single-replica stateful services (MySQL standalone,
Nextcloud, Vaultwarden, mailserver, claude-memory, ~30 SQLite-backed
apps); replicated services (CNPG PG cluster, Redis-v2, Vault Raft)
absorb it.

Three disk-layout options:
  A. Carve per-VM data disks from sdc — simple, no hardware,
     IO contention unchanged
  B. Hybrid SSD/HDD — SSD-constrained at 675 GiB free
  C. Add a dedicated NVMe — also closes beads code-oflt (IO
     contention), ~£200 hardware investment

Effort estimate: 2.5-3 weeks of focused work for the full migration;
covers TopoLVM install, lvmd config, per-VM disk provisioning,
LUKS plumbing, 5 migration waves (regenerable → huge PVCs),
backup-pipeline rewrite, deprecation.

Recommended next step before committing: small pilot on
k8s-node5/6 with one non-critical PVC to validate the operational
pattern end-to-end.

Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap,
docs/runbooks/scale-k8s-cluster.md (Path 1+2 alternative),
beads code-oflt (IO isolation).
---
 docs/plans/2026-06-01-topolvm-evaluation.md | 156 ++++++++++++++++++++
 1 file changed, 156 insertions(+)
 create mode 100644 docs/plans/2026-06-01-topolvm-evaluation.md

diff --git a/docs/plans/2026-06-01-topolvm-evaluation.md b/docs/plans/2026-06-01-topolvm-evaluation.md
new file mode 100644
index 00000000..64726f68
--- /dev/null
+++ b/docs/plans/2026-06-01-topolvm-evaluation.md
@@ -0,0 +1,156 @@
+# TopoLVM Migration Evaluation
+
+**Date**: 2026-06-01
+**Status**: Evaluation — not yet decided
+**Decision**: Pending. Used to understand whether to commit to the migration.
+
+## Problem statement
+
+The cluster's block storage hits a **hardcoded 29-PVC-per-VM ceiling** in `sergelogvinov/proxmox-csi-plugin` (`pkg/csi/utils.go:394`, `for lun = 1; lun < 30; lun++`). The plugin scans Proxmox SCSI indices `scsi1..scsi29`; when all are taken, `ControllerPublishVolume` returns `Internal desc = no free lun found`. We hit this on 2026-05-26 with 4 stuck PVCs on k8s-node1 and responded by scaling from 4 → 6 worker VMs.
+
+Path 1 (patch the plugin to `lun < 31`) buys +1 slot per VM. Path 2 (NFS-migrate non-DB workloads) buys 20-30 PVCs of headroom. Both are tactical. This doc evaluates **Path 3 — replace the CSI driver with TopoLVM**, which removes the cap permanently by changing the storage architecture from "PVE-host LVM-thin + SCSI hotplug" to "per-VM LVM-thin + local provisioning".
+
+## What TopoLVM is
+
+CSI driver from cybozu-go. Each K8s node runs an `lvmd` daemon managing one or more LVM volume groups. The CSI controller creates `LogicalVolume` CRDs; `topolvm-node` on the target node reconciles them by asking `lvmd` to `lvcreate` an LV in the chosen VG. The LV is mounted directly on the node (no virtio-scsi hotplug). PVCs are LV slices, not separate SCSI devices — there is no per-VM cap beyond kernel LV count limits (effectively thousands).
+
+Mature project, used in production by Cybozu and others. Supports:
+- Thin provisioning (`type: thin` device class with overprovision ratio)
+- Multiple device classes per node (e.g., one for SSD, one for HDD)
+- CSI VolumeSnapshot CRDs (thin-provisioned volumes only; restore pinned to source node)
+- Online volume expansion (ext4, xfs, btrfs)
+- Striping and RAID via `lvcreate-options`
+
+## The big architectural trade-off — read this first
+
+| Aspect | proxmox-csi (today) | TopoLVM |
+|---|---|---|
+| Storage location | PVE-host thin pool (sdc) | Per-VM thin pool on a dedicated disk |
+| Per-VM PVC cap | **29** (plugin source) | None (kernel LV limits, thousands) |
+| **PVC mobility** | **Migrates between VMs** — CSI re-attaches LV to wherever the pod schedules | **Pinned to one node** via `topology.topolvm.cybozu.com/node` label |
+| Failure recovery | Pod reschedules to another VM, PVC follows | Pod can only restart on the same node; if the node dies, data is on the dead node |
+| IO contention | All VMs share sdc thin pool | Each VM's pool is on its own disk (which may still share underlying physical media) |
+| Snapshot mechanism | PVE-host `lvm-pvc-snapshot` script (custom) | CSI VolumeSnapshot CRDs (standard) |
+| Encryption | LUKS via Proxmox CSI `extraParameters` + ESO-synced secret | LUKS via `csi.storage.k8s.io/{node-stage,node-expand}-secret` — same pattern, different secret target |
+| Backup pipeline | sda → Synology via `daily-backup` script that mounts LVM snapshots on PVE | Same idea but snapshots live inside K8s VMs; backup script would need to run on each VM (or use CSI snapshot → object store) |
+| Operational model | "Storage is a shared pool, VMs are cattle" | "Storage is per-node, like local-path with LVM features" |
+
+**Data mobility is the most important difference.** Today, when k8s-node1 is drained for maintenance, all its PVC pods reschedule to other nodes and the proxmox-csi controller detaches/re-attaches the LVs accordingly. With TopoLVM, draining a node means **the PVC data is still on that node's local disk** — pods cannot start elsewhere until either (a) the data is migrated, or (b) the node returns.
+
+For Viktor's setup specifically:
+- **Pro**: the underlying PVE host is a single point of failure anyway (192.168.1.127). If the host dies, all VMs and all storage die together. The "mobility" of proxmox-csi is partially illusory at the homelab scale — the data isn't actually mobile across physical machines.
+- **Con**: VM-level failures (kernel panic, OOM, manual qm shutdown for maintenance) DO happen routinely. Today, the pod just reschedules; with TopoLVM, you wait for the VM to recover or you accept downtime.
+- **Mitigation**: For services that already have replication built in (CNPG Postgres cluster has 3 replicas, Redis-v2 has 3, Vault has 3-node Raft), the data-locality penalty is minimal — one replica's local LV being unavailable triggers a re-replication elsewhere. The PAIN is concentrated in single-replica stateful services: MySQL standalone, Nextcloud, Vaultwarden, mailserver, claude-memory, all the SQLite-backed services.
+
+## Disk layout — three options
+
+TopoLVM needs a dedicated LVM VG per node. Three ways to provision it:
+
+### Option A — Carve from sdc (HDD), one VG per VM
+
+Add a second virtual disk to each K8s VM, sized for its expected PVC load. The disk lives on the existing sdc thin pool. Format as LVM PV → its own VG → TopoLVM thin pool.
+
+- **Sizing**: rough math from session-1 audit: 1.2 TB total LV allocation across 76 PVCs. Add 30% headroom = 1.6 TB. Distribute by current node placement:
+  - node1: Prometheus (433G) + others ≈ 600-700 GiB → **768 GiB disk**
+  - node2: Loki (50G) + smaller DBs ≈ 200 GiB → **256 GiB disk**
+  - node3: MySQL standalone + Immich PG + several DBs ≈ 200 GiB → **256 GiB disk**
+  - node4: smaller → **256 GiB disk**
+  - node5: smaller → **256 GiB disk**
+  - node6: Nextcloud + Vaultwarden + mailserver + small DBs ≈ 200 GiB → **256 GiB disk**
+  - **Total: ~2 TiB** carved from sdc thin pool (currently 66% used, 3.5 TiB free)
+- **Pro**: simplest physical change, no hardware needed, just `qm set --scsiN local-lvm:NNN`
+- **Con**: IO contention on sdc unchanged. The 6 thin pools all sit on the same HDD physical layer. Storms hit harder because there's no inter-pool isolation at the LVM level.
+
+### Option B — Move hot workloads to sdb (SSD), keep cold on sdc
+
+Use a hybrid layout:
+- Per-VM SSD disk (sdb, 931 GB total, ~675 GB free) for hot DBs
+- Per-VM HDD disk (sdc) for cold/bulk
+
+TopoLVM supports multiple device classes per node — each VM would have an `ssd-thin` and `hdd-thin` class.
+
+- **Pro**: separates hot/cold IO; SSD-backed DBs are dramatically faster; partial IO-contention relief on sdc
+- **Con**: 675 GB SSD has to host DBs across 6 VMs (~112 GiB each, tight). Need to identify which PVCs are hot. The encrypted PVCs (45 currently) are mostly DBs and would be the SSD candidates.
+
+### Option C — Add a second physical disk for storage
+
+Add a real SSD (e.g., a 2 TB NVMe) to the PVE host. Carve per-VM disks from it for TopoLVM. Keep sdc for VM root + nfs-data only.
+
+- **Pro**: cleanest physical isolation. Solves both LUN cap AND IO contention (the underlying beads `code-oflt` task).
+- **Con**: hardware investment. ~£200 for a 2 TB NVMe. Requires PVE host downtime to install. Existing PVE has 2 SATA ports used (sda + sdb) + M.2 slot (might be in use, need to check). LVM/thin pool setup is straightforward.
+
+## Migration approach
+
+Same pattern as the 2026-05-26 Wave 1 NFS migration, multiplied across more PVCs:
+
+1. **Install TopoLVM alongside proxmox-csi** — both run in parallel; new StorageClass `topolvm-provisioner` and `topolvm-provisioner-encrypted` created without touching existing PVCs
+2. **Per-VM data disk provisioning** — `qm set <vmid> --scsi8 local-lvm:NNN`, add `vgcreate` + `lvcreate` per VM (one-time)
+3. **lvmd config per node** — Helm values point to the right VG per node
+4. **Pilot migration** — pick a small, low-criticality PVC (e.g., a single-replica config-only service). Run the same scale-to-0 → rsync helper → swap claim_name → apply pattern from Wave 1. Validate.
+5. **Phased rollout** — migrate PVCs in batches by criticality:
+   - Wave A: regenerable / cache (5-10 PVCs, low risk)
+   - Wave B: app config PVCs with SQLite (15-20 PVCs, blip per service)
+   - Wave C: medium DBs (Postgres, MySQL, Redis with replicas) (10-15 PVCs)
+   - Wave D: critical singletons (Vaultwarden, Nextcloud, mailserver, MySQL standalone) (5-10 PVCs)
+   - Wave E: huge ones (Prometheus, Loki, Forgejo) (3-5 PVCs)
+6. **Rewrite backup pipeline** — current `daily-backup` mounts LVM snapshots on PVE host; new flow needs to either (a) run snapshot logic inside each K8s VM via DaemonSet, or (b) use CSI VolumeSnapshot CRDs + an external-snapshotter → restic/borg backend
+7. **Deprecate proxmox-csi** — once all PVCs migrated, remove the Helm release and the `proxmox-lvm` / `proxmox-lvm-encrypted` StorageClasses
+8. **Update docs** — `docs/architecture/storage.md`, `CLAUDE.md`, ingress factory references, several runbooks
+
+## Effort estimate
+
+| Phase | Time | Notes |
+|-------|------|-------|
+| Decision + Option A/B/C pick | 1 day | Includes any hardware ordering for Option C |
+| TopoLVM install + lvmd config | 1 day | Helm chart, secrets, RBAC, test on one node first |
+| Per-VM data disk provisioning | 0.5 day | Six VMs; coordinate with kubelet restart |
+| Encrypted PVC LUKS plumbing | 1 day | Verify the ExternalSecret pattern works with TopoLVM's secret refs |
+| Pilot migration (1 PVC) | 0.5 day | Includes rollback rehearsal |
+| Waves A-D migrations (~45 PVCs) | 5-7 days | ~20 min per PVC like Wave 1, plus verification |
+| Wave E (huge PVCs) | 2-3 days | Prometheus 433 GiB will take hours to rsync; needs careful staging |
+| Backup pipeline rewrite | 2-3 days | Snapshot-driven backup is a different model; testing |
+| Deprecation + cleanup | 1 day | Remove proxmox-csi, update SCs, update docs |
+| Docs + runbook updates | 1 day | storage.md, scale runbook, CLAUDE.md, post-mortems for incidents during migration |
+
+**Total: ~2.5-3 weeks of focused infra time.** Could stretch over a quarter if done alongside other work.
+
+## Risks
+
+| Risk | Likelihood | Mitigation |
+|------|------------|------------|
+| Data loss during PVC migration | Low | Rsync with `--checksum`, verify before deleting source, keep proxmox-csi running until each migration validates |
+| Data-locality penalty during VM reboot | High | Reboot one VM at a time; multi-replica services handle it; single-replica = brief downtime (same as today for kured-driven reboots, but more frequent in TopoLVM model) |
+| LUKS encryption plumbing different from current | Medium | Pilot encrypted PVC migration before committing |
+| Backup pipeline regression | High | Keep old `daily-backup` running until new pipeline proven for ≥2 weeks |
+| Snapshot semantics change (restore pinned to source node) | Medium | Document; not a blocker for normal use but matters for cross-VM restore scenarios |
+| TopoLVM does not solve IO contention | Certain (unless Option C) | Beads `code-oflt` remains open as a separate task |
+| Migration window for huge PVCs (Prometheus 433G) | Medium | Stage during low-traffic period; use rsync with checkpoint resumption |
+| Surprise incompatibility (Kyverno policy, Authentik, etc.) | Low | Pilot catches most |
+| Reverse migration if we change our mind | Medium | Always possible via the same rsync pattern, but tedious |
+
+## Decision criteria
+
+Pick TopoLVM (any option) if:
+- We hit the LUN cap repeatedly (≥2 incidents in 6 months)
+- We want to fix IO contention at the same time (then Option C only)
+- We're comfortable with single-node data locality
+
+Stay on proxmox-csi if:
+- The Path 1 + 2 combo gives us enough headroom for the foreseeable future
+- We value data mobility (any-pod-can-run-anywhere) over architectural cleanliness
+- The migration cost (3 weeks) outweighs the LUN-cap risk over the next year
+
+## Recommended next steps if pursuing
+
+1. **Run a small pilot first** — install TopoLVM on one node (k8s-node5 or node6 since they're newest and have less critical workloads), provision a 50 GB data disk, create a test PVC, migrate one tiny non-critical PVC, verify the operational pattern works end-to-end before committing to full migration
+2. **Pick Option A or C** — Option B is too SSD-constrained for the encrypted PVC volume we have
+3. **Order hardware if Option C** — NVMe + a hot-swap caddy or M.2 adapter; verify PVE host has the slot
+4. **Schedule a 3-week window** — partition the migration waves around other infra commitments; flag in beads as a P1
+
+## Related
+
+- `docs/architecture/storage.md` — current storage architecture
+- `docs/runbooks/scale-k8s-cluster.md` — current scaling playbook (Path 1+2 alternative)
+- `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md` — IO contention is the related-but-separate concern
+- Beads `code-oflt` — IO isolation long-term fix (Option C would close this)
+- Remote memory id=2788 — proxmox-csi-plugin LUN cap explanation