- Migrate ebooks-calibre-config-iscsi (2Gi, 2380 files) to proxmox-lvm - Update docs/architecture/storage.md: document Proxmox CSI as primary block storage, mark democratic-csi iSCSI as deprecated - Add full migration plan to docs/plans/
13 KiB
Storage Architecture
Last updated: 2026-04-03
Overview
The cluster uses two storage backends: Proxmox CSI for database block storage and TrueNAS NFS for application data.
Block storage (Proxmox CSI): 13 PVCs for databases (CNPG PostgreSQL, MySQL InnoDB, Redis, Vaultwarden, Prometheus, Nextcloud, Calibre-Web) use StorageClass: proxmox-lvm, which provisions thin LVs directly from the Proxmox host's local-lvm storage. This eliminates the previous double-CoW (ZFS + LVM-thin) path that caused 56 ZFS checksum errors.
NFS storage (TrueNAS): ~100 NFS shares for application data, media, configs, and backup targets continue to use TrueNAS ZFS at 10.0.10.15 via StorageClass: nfs-truenas.
Migration (2026-04-02): All iSCSI block volumes were migrated from democratic-csi (TrueNAS iSCSI → ZFS → LVM-thin) to Proxmox CSI (direct LVM-thin hotplug). democratic-csi iSCSI driver is deprecated and pending removal.
Architecture Diagram
graph TB
subgraph TrueNAS["TrueNAS (10.0.10.15)<br/>VMID 9000, 16c/16GB"]
ZFS_Main["ZFS Pool: main<br/>1.64 TiB<br/>32G + 7x256G + 1T disks"]
ZFS_SSD["ZFS Pool: ssd<br/>~256GB SSD<br/>Immich ML, PostgreSQL hot data"]
ZFS_Main --> NFS_Datasets["NFS Datasets<br/>~100 shares<br/>main/<service>"]
ZFS_Main --> iSCSI_Datasets["iSCSI Datasets<br/>main/iscsi (zvols)<br/>main/iscsi-snaps"]
NFS_Datasets --> NFS_Exports["NFS Exports<br/>managed by secrets/nfs_exports.sh"]
iSCSI_Datasets --> iSCSI_Targets["iSCSI Targets<br/>SSH-managed via democratic-csi"]
ZFS_SSD --> SSD_Data["Immich ML models<br/>PostgreSQL CNPG"]
end
subgraph K8s["Kubernetes Cluster"]
CSI_NFS["democratic-csi-nfs<br/>StorageClass: nfs-truenas<br/>soft,timeo=30,retrans=3"]
CSI_iSCSI["democratic-csi-iscsi<br/>StorageClass: iscsi-truenas<br/>SSH driver"]
NFS_PV["NFS PersistentVolumes<br/>RWX, ~100 volumes"]
iSCSI_PV["iSCSI PersistentVolumes<br/>RWO, ~19 volumes"]
Pods["Application Pods"]
DBPods["Database Pods<br/>PostgreSQL CNPG<br/>MySQL InnoDB"]
end
NFS_Exports -->|CSI driver| CSI_NFS
iSCSI_Targets -->|CSI driver| CSI_iSCSI
CSI_NFS --> NFS_PV
CSI_iSCSI --> iSCSI_PV
NFS_PV --> Pods
iSCSI_PV --> DBPods
style TrueNAS fill:#e1f5ff
style K8s fill:#fff4e1
style ZFS_Main fill:#c8e6c9
style ZFS_SSD fill:#ffe0b2
Components
| Component | Version/Config | Location | Purpose |
|---|---|---|---|
| Proxmox CSI plugin | Helm chart | Namespace: proxmox-csi | Block storage via LVM-thin hotplug |
StorageClass proxmox-lvm |
RWO, WaitForFirstConsumer | Cluster-wide | Databases and stateful apps |
| TrueNAS VM | VMID 9000, 16c/16GB | Proxmox host (10.0.10.15) | ZFS NFS storage server |
ZFS pool main |
1.64 TiB usable | 32G + 7x256G + 1T disks | NFS data for all services |
ZFS pool ssd |
~256GB SSD | Dedicated SSD | High-performance data (Immich ML) |
| nfs-csi | Helm chart | Namespace: nfs-csi | NFS CSI driver |
StorageClass nfs-truenas |
RWX, soft mount | Cluster-wide | Default storage for apps |
| DEPRECATED | Namespace: iscsi-csi | Replaced by Proxmox CSI (2026-04-02) | |
iscsi-truenas |
DEPRECATED | Cluster-wide | Replaced by proxmox-lvm |
TF module nfs_volume |
modules/kubernetes/nfs_volume/ |
Infra repo | NFS PV/PVC factory |
How It Works
NFS Storage Flow
- Dataset creation: NFS shares are created as ZFS datasets under
main/<service>(e.g.,main/immich,main/nextcloud) - Export configuration:
/root/secrets/nfs_exports.shon TrueNAS generates/etc/exportswith per-dataset exports (/mnt/main/<service>) - CSI provisioning: democratic-csi-nfs mounts NFS shares and creates K8s PersistentVolumes
- Terraform module: Stacks use
modules/kubernetes/nfs_volume/to declaratively create PV + PVC pairs:module "nfs_data" { source = "../../modules/kubernetes/nfs_volume" name = "immich-data" namespace = kubernetes_namespace.immich.metadata[0].name nfs_server = var.nfs_server # 10.0.10.15 nfs_path = "/mnt/main/immich" } - Pod mount: Applications reference PVCs in their deployment specs
- Mount options: All NFS mounts use
soft,timeo=30,retrans=3(set in StorageClass) to prevent indefinite hangs
CRITICAL: Never use inline nfs {} blocks in pod specs — they default to hard,timeo=600 which causes 10-minute hangs on network issues. Always use the nfs-truenas StorageClass via PVCs.
Block Storage Flow (Proxmox CSI) — NEW
- PVC creation: Pod requests a PVC with
storageClass: proxmox-lvm - CSI provisioning: Proxmox CSI plugin calls the Proxmox API to create a thin LV in the
local-lvmstorage - SCSI hotplug: The thin LV is hotplugged as a VirtIO-SCSI disk directly into the K8s node VM
- Filesystem: CSI formats the disk as ext4 and mounts it into the pod
- Exclusive access: RWO only — disk is attached to one VM at a time
- Topology: Nodes are labeled with
topology.kubernetes.io/region=pveandzone=pvefor scheduling
Key advantage: Single CoW layer (LVM-thin only). No ZFS, no iSCSI network hop, no double-CoW corruption.
Proxmox API token: csi@pve!csi-token with CSI role (VM.Audit VM.Config.Disk Datastore.Allocate Datastore.AllocateSpace Datastore.Audit). Stored in Vault at secret/viktor.
iSCSI Storage Flow (DEPRECATED — replaced 2026-04-02)
This section is historical. All iSCSI PVCs have been migrated to Proxmox CSI (
proxmox-lvm). The democratic-csi iSCSI driver is pending removal.
Zvol creation: democratic-csi creates ZFS zvols undermain/iscsi/<pvc-name>via SSH commandsTarget setup: TrueNAS iSCSI service exposes zvols as iSCSI LUNsInitiator connection: K8s nodes connect via open-iscsi
SQLite on NFS — Why It Fails
SQLite uses fsync() to guarantee durability. NFS's soft mount + async semantics break this:
- Soft mount returns success even if data is still in client cache
- Network blips during fsync → incomplete writes → corruption
- WAL mode helps but doesn't eliminate the race
Solution: Use Proxmox CSI (proxmox-lvm) for any SQLite database (Vaultwarden, plotting-book) or local disk (ephemeral).
Democratic-CSI Sidecar Resources
The Helm chart spawns 17 sidecar containers (driver-registrar, external-provisioner, etc.) across controller + node DaemonSet pods. Each sidecar defaults to resources: {}, which gets LimitRange defaults of 256Mi.
Fix: Set explicit resources in values.yaml:
csiProxy: # TOP-LEVEL key, not nested
resources:
requests:
memory: "32Mi"
limits:
memory: "32Mi"
controller:
externalProvisioner:
resources:
requests: {memory: "64Mi"}
limits: {memory: "64Mi"}
# ... repeat for all sidecars
Total footprint: ~1.5Gi → ~400Mi.
Configuration
Key Files
| Path | Purpose |
|---|---|
/root/secrets/nfs_exports.sh |
TrueNAS: generates /etc/exports with all service shares |
stacks/proxmox-csi/ |
Terraform stack for Proxmox CSI plugin + StorageClass |
stacks/iscsi-csi/ |
DEPRECATED — democratic-csi iSCSI driver (pending removal) |
stacks/nfs-csi/ |
NFS CSI driver |
modules/kubernetes/nfs_volume/ |
Reusable module for NFS PV/PVC creation |
config.tfvars |
Variable nfs_server = "10.0.10.15" shared by all stacks |
Vault Paths
| Path | Contents |
|---|---|
secret/viktor/truenas_ssh_key |
SSH private key for democratic-csi SSH driver |
secret/viktor/truenas_root_password |
TrueNAS root password (web UI access) |
Terraform Stacks
stacks/proxmox-csi/: Deploys Proxmox CSI plugin +proxmox-lvmStorageClass + node topology labelsstacks/nfs-csi/: Deploys NFS CSI driver for TrueNASstacks/iscsi-csi/:Deploys democratic-csi iSCSI driver— DEPRECATED, pending removal- All application stacks reference NFS volumes via
module "nfs_<name>"calls - Database PVCs use
storageClass: proxmox-lvm(CNPG, MySQL Helm VCT, Redis Helm, standalone PVCs)
NFS Export Management
NFS exports are NOT managed by Terraform. To add a new service:
- SSH to TrueNAS:
ssh root@10.0.10.15 - Edit
/root/secrets/nfs_exports.sh - Add dataset + export entry:
create_nfs_export "main/<service>" "/mnt/main/<service>" - Run the script:
/root/secrets/nfs_exports.sh - Verify:
showmount -e 10.0.10.15
Decisions & Rationale
Why NFS for Most Workloads?
- Simplicity: No volume provisioning delays, instant mounts
- RWX support: Multiple pods can share one volume (Nextcloud, Immich)
- ZFS benefits: Snapshots, compression, dedup all work at dataset level
- Good enough: For SQLite on NFS specifically, we accept the risk for low-value data (logs, caches) but mandate iSCSI for critical DBs
Why iSCSI for Databases?
- ACID guarantees: Block device + local filesystem = real fsync
- Performance: No NFS protocol overhead for random I/O
- Tested: PostgreSQL CNPG and MySQL InnoDB Cluster both run on iSCSI, zero corruption in 2+ years
Why SSH Driver Over API?
The democratic-csi API driver (driver: freenas-api-iscsi) has these issues:
- Requires TrueNAS API credentials in plaintext ConfigMap
- Fails silently when API schema changes between TrueNAS versions
- No retry logic on transient API errors
SSH driver (driver: freenas-ssh) is simpler:
- Direct
zfscommands, no API translation layer - SSH key auth (Vault-managed)
- Deterministic error messages
Why Soft Mount for NFS?
Hard mounts with default timeo=600 (10 minutes) cause:
- 10-minute pod startup delays if NFS server is unreachable
kubectl delete podhangs for 10 minutes- Kernel task hangs blocking node operations
Soft mount (soft,timeo=30,retrans=3) trades availability for responsiveness:
- Max 90s hang (30s × 3 retries)
- Operations return EIO after timeout → app can handle error
- Acceptable for non-critical data paths
Critical paths: Databases use iSCSI (not NFS), so soft mount never affects data integrity.
Troubleshooting
NFS Mount Hangs
Symptom: Pod stuck in ContainerCreating, df -h hangs on NFS mount
Diagnosis:
# On K8s node
mount | grep nfs
showmount -e 10.0.10.15
# Check NFS server
ssh root@10.0.10.15
zfs list | grep main/<service>
cat /etc/exports | grep <service>
Fix:
- Verify dataset exists:
zfs list main/<service> - Verify export:
grep <service> /etc/exports - If missing: re-run
/root/secrets/nfs_exports.sh - Restart NFS server:
service nfs-server restart
iSCSI Session Drops
Symptom: PostgreSQL/MySQL pod restarts, iSCSI reconnection loops
Diagnosis:
# On K8s node
iscsiadm -m session
dmesg | grep iscsi
journalctl -u iscsid -f
Fix:
- Check TrueNAS iSCSI service: WebUI → Sharing → iSCSI → Targets
- Verify hardened timeouts:
iscsiadm -m node -o show | grep timeout - If defaults: re-apply cloud-init or manually update
/etc/iscsi/iscsid.conf - Restart session:
iscsiadm -m node -u iscsiadm -m node -l
Democratic-CSI Sidecar OOMKill
Symptom: kubectl describe pod shows sidecar containers OOMKilled
Diagnosis:
kubectl get events -n democratic-csi | grep OOM
kubectl top pod -n democratic-csi
Fix:
- Set explicit resources in Helm values (see "Democratic-CSI Sidecar Resources" above)
- Apply:
terragrunt applyinstacks/democratic-csi/
SQLite Corruption on NFS
Symptom: database disk image is malformed, checksum errors
Diagnosis:
# In pod
sqlite3 /data/db.sqlite "PRAGMA integrity_check;"
Fix: Migrate to iSCSI
- Create iSCSI PVC in Terraform stack
- Restore from backup to new volume
- Update deployment to use new PVC
- Delete old NFS PVC
Slow NFS Performance
Symptom: High latency on file operations, iostat shows NFS wait times
Diagnosis:
# On TrueNAS
zpool iostat -v 5
arc_summary | grep "Hit Rate"
# On K8s node
nfsiostat 5
Optimization:
- Check ZFS ARC hit rate (should be >90%)
- Move hot datasets to SSD pool:
zfs send main/<dataset> | zfs recv ssd/<dataset> - Tune NFS mount: add
rsize=1048576,wsize=1048576to StorageClassmountOptions
Related
- Runbooks:
docs/runbooks/restore-postgresql.mddocs/runbooks/restore-mysql.mddocs/runbooks/recover-nfs-mount.md
- Architecture:
docs/architecture/backup-dr.md(backup strategy using ZFS snapshots) - Reference:
.claude/reference/service-catalog.md(which services use NFS vs iSCSI)