The cluster uses two storage backends: **Proxmox CSI** for database block storage and **Proxmox NFS** for application data.
**Block storage (Proxmox CSI)**: ~69 PVCs for databases and stateful apps use two StorageClasses provisioned from the same `local-lvm` thin pool (sdc, 10.7TB RAID1 HDD):
- **`proxmox-lvm`**: Unencrypted block storage for non-sensitive workloads (~26 PVCs)
- **`proxmox-lvm-encrypted`**: LUKS2-encrypted block storage for all sensitive data (~43 PVCs) — databases, auth, email, password managers, git repos, health data, etc. Uses Argon2id key derivation with passphrase from Vault KV.
- **Both StorageClasses use `reclaimPolicy: Retain`.** Deleting a PVC frees the SCSI-LUN slot (the volume is detached) but **retains the underlying LV** for data safety — the PV goes `Released` and the LV (plus its daily `lvm-pvc-snapshot` snapshots) lingers on the thin pool. ~63 such orphan Released PVs exist as of 2026-06-05; batch orphan-LV reclaim is tracked in beads `code-dfjn`. The slot is freed regardless — orphans consume thin-pool space, not LUN slots.
All services storing sensitive data were migrated to `proxmox-lvm-encrypted` on 2026-04-15. This eliminates the previous double-CoW (ZFS + LVM-thin) path and ensures data-at-rest encryption.
**NFS storage (Proxmox host)**: ~100 NFS shares for media libraries (Immich, audiobookshelf, servarr, navidrome), backup targets (`*-backup/` directories), and app data are served directly from the Proxmox host at `192.168.1.127`. Two NFS export roots exist:
- **HDD NFS**: `/srv/nfs` on ext4 LV `pve/nfs-data` (4TB) — bulk media and backup targets
- **SSD NFS**: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB) — high-performance data (Immich ML)
`StorageClass: nfs-truenas` is the **only** NFS StorageClass and points to the Proxmox host. The name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster. (A short-lived parallel `nfs-proxmox` StorageClass was removed on 2026-04-25, commit 484b4c71, during the vault NFS-hostile migration.)
**Backup storage (sda)**: 1.1TB RAID1 SAS disk, VG `backup`, LV `data` (ext4), mounted at `/mnt/backup` on PVE host. Dedicated backup disk for weekly PVC file backups, auto SQLite backups, pfSense backups, and PVE config. NFS data syncs directly to Synology via inotify change tracking (not stored on sda). Independent of live storage (sdc).
**History (2026-04-13)**: TrueNAS (VM 9000, 10.0.10.15) fully decommissioned. NFS storage migrated to the Proxmox host (192.168.1.127). ZFS datasets under `/mnt/main/` and `/mnt/ssd/` moved to ext4 LVs at `/srv/nfs/` and `/srv/nfs-ssd/`. Legacy PVs referencing `/mnt/main/` paths still work (bind-mounted or symlinked on the Proxmox host); new PVs use `/srv/nfs/` and `/srv/nfs-ssd/`. TrueNAS VM still exists in stopped state on PVE pending user decision on deletion.
**History (2026-06-05) — Wave 2 NFS migration + strategy decision**: Decided to **keep proxmox-csi and harden it** (option ① — keeps PVC mobility, £0, no new hardware) rather than re-architect to TopoLVM (pins PVCs to a node) or Longhorn (2× write-amplification on the single shared sdc HDD). See `docs/plans/2026-06-05-block-storage-harden-nfs-design.md`. Migrated 5 non-DB, embedded-DB-free workloads off block to NFS to relieve the per-VM LUN cap: **tandoor** (media, PG-backed), **speedtest** (config, MySQL), **hackmd** (image uploads, MySQL — dropped LUKS for low-sensitivity images), **changedetection** (JSON datastore), **send** (upload blobs, Redis). Freed 5 SCSI-LUN slots (4 on the then-hot node6, 21→16). Each followed the scale-0 → busybox mover (`cp -a`) → swap `claim_name` → delete block PVC pattern. (Phase-1 follow-on 2026-06-05: insta2spotify also migrated — note its reschedule re-pulled a 3.26 GB image, a ~6 min blip; large-image services incur a pull-delay when a migration moves the pod to a fresh node.)
**The "harden" half is now SHIPPED (2026-06-05):**
- **Orphan cleanup** — removed 67 `Released` proxmox PVs + 475 orphan LVs/snapshots (VG `pve` 997 → ~410 LVs; thin pool freed). 1 LV left (`f127a41c`, stuck-open stale qemu fd — harmless, clears on node reboot; do not force `dmsetup remove`).
- **Ghost-loop prevention** — `csi-ghost-reconcile` CronJob (`stacks/proxmox-csi/ghost-reconcile.tf`, every 15 min) compares each worker VM's real scsi disks (Proxmox API, scoped CSI token) against k8s VolumeAttachments and safely detaches ghosts (`PUT .../config delete=scsiN`); detection mirrors check #47, with a 60 s re-confirm + per-run cap-5. Verified live (66 VAs, 0 ghosts). This closes the doom loop by construction — **beads `code-dfjn` can be retired.**
- **Cap deliberately kept at 28** (NOT lowered to 24): the labeler value (`stacks/proxmox-csi/.../main.tf``node_labels`) was raised 24→28 per the 2026-05-25 eviction-cascade post-mortem; lowering it would reverse that fix. With auto-reconcile keeping drift at 0, the 28 cap is safe.
| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | The only NFS StorageClass — **historical name**, points to the Proxmox host. Kept because SC names are immutable on 48 bound PVs. (Sibling `nfs-proxmox` SC removed 2026-04-25, commit 484b4c71.) |
4.**Pod mount**: Applications reference PVCs in their deployment specs
5.**Mount options**: All NFS mounts use `soft,timeo=30,retrans=3` (set in StorageClass) to prevent indefinite hangs
**Note**: Some legacy PVs still reference `/mnt/main/<service>` paths. These work via compatibility symlinks/bind-mounts on the Proxmox host. New PVs should use `/srv/nfs/<service>` or `/srv/nfs-ssd/<service>`.
**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-truenas` StorageClass (historical name; it points at the Proxmox host) via PVCs.
1.**PVC creation**: Pod requests a PVC with `storageClass: proxmox-lvm`
2.**CSI provisioning**: Proxmox CSI plugin calls the Proxmox API to create a thin LV in the `local-lvm` storage
3.**SCSI hotplug**: The thin LV is hotplugged as a VirtIO-SCSI disk directly into the K8s node VM
4.**Filesystem**: CSI formats the disk as ext4 and mounts it into the pod
5.**Exclusive access**: RWO only — disk is attached to one VM at a time
6.**Topology**: Nodes are labeled with `topology.kubernetes.io/region=pve` and `zone=pve` for scheduling
**Key advantage**: Single CoW layer (LVM-thin only). No ZFS, no iSCSI network hop, no double-CoW corruption.
**Proxmox API token**: `csi@pve!csi-token` with CSI role (`VM.Audit VM.Config.Disk Datastore.Allocate Datastore.AllocateSpace Datastore.Audit`). Stored in Vault at `secret/viktor`.
1.**PVC creation**: Pod requests a PVC with `storageClass: proxmox-lvm-encrypted`
2.**CSI provisioning**: Same as `proxmox-lvm` — thin LV created in `local-lvm`
3.**LUKS encryption**: CSI node plugin reads the encryption passphrase from K8s Secret `proxmox-csi-encryption` (namespace `kube-system`), formats the disk with LUKS2 (Argon2id key derivation), then creates ext4 on top
4.**Transparent mounting**: Application sees a normal ext4 filesystem — encryption/decryption is handled by dm-crypt in the kernel
5.**Passphrase management**: ExternalSecret syncs passphrase from Vault KV (`secret/viktor/proxmox_csi_encryption_passphrase`) → K8s Secret. Backup key at `/root/.luks-backup-key` on PVE host.
**Services on encrypted storage (2026-04-15 migration):**
**Services migrated later** (post-audit catch-up): paperless-ngx (2026-04-25 — sensitive document scans had been left on plain `proxmox-lvm` by an abandoned attempt; rsync swap cleaned up the orphan and re-did via Terraform). Vault raft cluster (2026-04-25 — all 3 voters migrated from `nfs-proxmox` to `proxmox-lvm-encrypted` after the 2026-04-22 raft-leader-deadlock post-mortem found NFS fsync semantics incompatible with raft consensus log; rolled non-leader-first with force-finalize on the pvc-protection finalizer to avoid pod-recreating on the old PVCs).
**CSI node plugin memory**: Requires 1280Mi limit for LUKS2 Argon2id key derivation (~1GiB). Set via `node.plugin.resources` in Helm values (not `node.resources`).
**Terraform stack**: `stacks/proxmox-csi/` manages both StorageClasses, the ExternalSecret, and CSI plugin resources.
> **This section is historical.** All iSCSI PVCs have been migrated to Proxmox CSI (`proxmox-lvm`). The democratic-csi iSCSI driver is pending removal.
1.~~Zvol creation: democratic-csi creates ZFS zvols under `main/iscsi/<pvc-name>` via SSH commands~~
2.~~Target setup: TrueNAS iSCSI service exposes zvols as iSCSI LUNs~~
3.~~Initiator connection: K8s nodes connect via open-iscsi~~
### SQLite on NFS — Why It Fails
SQLite uses `fsync()` to guarantee durability. NFS's soft mount + async semantics break this:
- Soft mount returns success even if data is still in client cache
- Network blips during fsync → incomplete writes → corruption
- WAL mode helps but doesn't eliminate the race
**Solution**: Use Proxmox CSI (`proxmox-lvm`) for any SQLite database (Vaultwarden, plotting-book) or local disk (ephemeral).
1. Move hot data to SSD NFS: relocate from `/srv/nfs/<service>` to `/srv/nfs-ssd/<service>` and update PV path
2. Tune NFS mount: add `rsize=1048576,wsize=1048576` to StorageClass `mountOptions`
## Nextcloud as PVE-NFS browser
Both NFS export roots are mounted into the Nextcloud server pod — `/srv/nfs` at `/mnt/pve-nfs` and `/srv/nfs-ssd` at `/mnt/pve-nfs-ssd` — via standard NFS PVs (`nfs_volume` module). No host-level Unix user/group setup; Nextcloud is the sole household-facing surface.
**ACL model — two patterns:**
- **Root browser mounts** (`PVE NFS Pool`, `PVE NFS-SSD Pool`): scoped to NC group `admin`. Used by Viktor for ad-hoc browsing of any cluster NFS state. Other users never see these mounts.
- **Per-archive mounts** (e.g. `/anca-elements` → `/mnt/pve-nfs/anca-elements`): one NC External mount per archive, `applicable_users` set to the archive owners. Users see only the mounts assigned to them. Write/delete access is implicit at the OS level (NC pod writes via `no_root_squash`); deny semantics come from mount visibility — if the mount is not in your list, you cannot reach the path.
**Why mount-level ACL, not Files Access Control**: NC 30/31's workflow engine check classes are `FileName` (basename), `FileMimeType`, `FileSize`, `FileSystemTags`, and `UserGroupMembership`. There is no `FilePath` and no `UserId` check class. Per-(directory, user) rules are not expressible via FAC. Mount-level ACL via `occ files_external:applicable` is the supported primitive and maps cleanly onto the model.
**Manifest**: `kubernetes_config_map_v1.nextcloud_external_storage_manifest` in `stacks/nextcloud/external_storage.tf`. Mount entries reference NC usernames (`admin`, `anca`, `emo` — not display names; admin is Viktor). JSON shape:
A one-shot K8s bootstrap Job applies the manifest idempotently on every `tg apply` via `occ files_external:*`, `occ files_external:applicable`, and `occ files_external:option`. `enableSharing: true` lets admin re-share a subfolder of the mount with another NC user/group/public link; default is `false` (NC's local-backend default).
**Adding a new archive**: drop the directory under `/srv/nfs/<name>/` on PVE, append an `archiveMounts` entry to the manifest, then `scripts/tg apply` the nextcloud stack. See `docs/runbooks/nextcloud-add-archive.md` for the full step-by-step.
**Trade-off**: a compromised NC admin account has destructive reach over the cluster NFS roots (admin sees the root browser mounts). Accepted — Viktor's account is the single high-value target either way. No lateral movement to databases or block PVCs via this path (those are not NFS).
**Backup**: Synology retains a frozen copy of each archive (3-2-1 coverage); the existing `offsite-sync-backup` pipeline provides nightly delta sync from `/srv/nfs/<archive>` → Synology `nfs/`.
## Related
- **Runbooks**:
-`docs/runbooks/restore-postgresql.md`
-`docs/runbooks/restore-mysql.md`
-`docs/runbooks/recover-nfs-mount.md`
-`docs/runbooks/nextcloud-add-archive.md`
- **Architecture**: `docs/architecture/backup-dr.md` (backup strategy using LVM snapshots and Proxmox host scripts)
- **Reference**: `.claude/reference/service-catalog.md` (which services use NFS vs proxmox-lvm)