Viktor Barzin 38d51ab0af deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip]

- Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS
- Update config.tfvars nfs_server to 192.168.1.127 (Proxmox)
- Update nfs-csi StorageClass share to /srv/nfs
- Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP
- Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh)
- Rewrite nfs-health.sh for Proxmox NFS monitoring
- Update Freedify nfs_music_server default to Proxmox
- Mark CloudSync monitor CronJob as deprecated
- Update Prometheus alert summaries
- Update all architecture docs, AGENTS.md, and reference docs
- Zero PVs remain on TrueNAS — VM ready for decommission

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-13 14:42:07 +00:00

13 KiB

Raw Blame History

Storage Architecture

Last updated: 2026-04-13

Overview

The cluster uses two storage backends: Proxmox CSI for database block storage and Proxmox NFS for application data.

Block storage (Proxmox CSI): 65 PVCs for databases and stateful apps (CNPG PostgreSQL, MySQL InnoDB, Redis, Vaultwarden, Prometheus, Nextcloud, Calibre-Web, Forgejo, FreshRSS, ActualBudget, NovelApp, Headscale, Uptime Kuma, etc.) use StorageClass: proxmox-lvm, which provisions thin LVs directly from the Proxmox host's local-lvm storage (sdc, 10.7TB RAID1 HDD thin pool). This eliminates the previous double-CoW (ZFS + LVM-thin) path that caused 56 ZFS checksum errors.

NFS storage (Proxmox host): ~100 NFS shares for media libraries (Immich, audiobookshelf, servarr, navidrome), backup targets (*-backup/ directories), and app data are served directly from the Proxmox host at 192.168.1.127. Two NFS export roots exist:

HDD NFS: /srv/nfs on ext4 LV pve/nfs-data (2TB) — bulk media and backup targets
SSD NFS: /srv/nfs-ssd on ext4 LV ssd/nfs-ssd-data (100GB) — high-performance data (Immich ML)

Both StorageClass: nfs-truenas (name kept for compatibility) and StorageClass: nfs-proxmox (identical) point to the Proxmox host. Migrated from TrueNAS (10.0.10.15) which has been fully decommissioned.

Backup storage (sda): 1.1TB RAID1 SAS disk, VG backup, LV data (ext4), mounted at /mnt/backup on PVE host. Dedicated backup disk for weekly PVC file backups, NFS mirrors, pfSense backups, and PVE config. Independent of live storage (sdc).

Migration (2026-04-02): All iSCSI block volumes were migrated from democratic-csi (TrueNAS iSCSI → ZFS → LVM-thin) to Proxmox CSI (direct LVM-thin hotplug). democratic-csi iSCSI driver has been removed.

Migration (2026-04): TrueNAS (10.0.10.15) fully decommissioned. All NFS storage migrated to the Proxmox host (192.168.1.127). ZFS datasets under /mnt/main/ and /mnt/ssd/ moved to ext4 LVs at /srv/nfs/ and /srv/nfs-ssd/. Legacy PVs referencing /mnt/main/ paths still work (bind-mounted or symlinked on the Proxmox host); new PVs use /srv/nfs/ and /srv/nfs-ssd/.

Architecture Diagram

graph TB
    subgraph Proxmox["Proxmox Host (192.168.1.127)"]
        sdc["sdc: 10.7TB RAID1 HDD<br/>VG pve, LV data (thin pool)<br/>65 proxmox-lvm PVCs"]
        sda["sda: 1.1TB RAID1 SAS<br/>VG backup, LV data (ext4)<br/>/mnt/backup"]
        NFS_HDD["LV pve/nfs-data (2TB ext4)<br/>/srv/nfs<br/>~100 NFS shares<br/>Media + backup targets"]
        NFS_SSD["LV ssd/nfs-ssd-data (100GB ext4)<br/>/srv/nfs-ssd<br/>High-performance data<br/>(Immich ML)"]
        NFS_Exports["NFS Exports<br/>managed by /etc/exports"]
        NFS_HDD --> NFS_Exports
        NFS_SSD --> NFS_Exports
    end

    subgraph K8s["Kubernetes Cluster"]
        CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-truenas / nfs-proxmox<br/>soft,timeo=30,retrans=3"]
        CSI_PVE["Proxmox CSI plugin<br/>StorageClass: proxmox-lvm"]

        NFS_PV["NFS PersistentVolumes<br/>RWX, ~100 volumes"]
        Block_PV["Block PersistentVolumes<br/>RWO, 65 PVCs"]

        Pods["Application Pods"]
        DBPods["Database Pods<br/>PostgreSQL CNPG<br/>MySQL InnoDB"]
    end

    NFS_Exports -->|NFS mount| CSI_NFS
    sdc -->|LVM-thin hotplug| CSI_PVE

    CSI_NFS --> NFS_PV
    CSI_PVE --> Block_PV

    NFS_PV --> Pods
    Block_PV --> DBPods

    style Proxmox fill:#e1f5ff
    style K8s fill:#fff4e1
    style NFS_HDD fill:#c8e6c9
    style NFS_SSD fill:#ffe0b2

Components

Component	Version/Config	Location	Purpose
Proxmox CSI plugin	Helm chart	Namespace: proxmox-csi	Block storage via LVM-thin hotplug
StorageClass `proxmox-lvm`	RWO, WaitForFirstConsumer	Cluster-wide	Databases and stateful apps
Proxmox NFS (HDD)	LV `pve/nfs-data`, 2TB ext4	192.168.1.127:/srv/nfs	Bulk NFS data for all services
Proxmox NFS (SSD)	LV `ssd/nfs-ssd-data`, 100GB ext4	192.168.1.127:/srv/nfs-ssd	High-performance data (Immich ML)
nfs-csi	Helm chart	Namespace: nfs-csi	NFS CSI driver
StorageClass `nfs-truenas`	RWX, soft mount	Cluster-wide	NFS storage (name kept for compatibility, points to Proxmox)
StorageClass `nfs-proxmox`	RWX, soft mount	Cluster-wide	NFS storage (identical to nfs-truenas)
TF module `nfs_volume`	`modules/kubernetes/nfs_volume/`	Infra repo	Static NFS PV/PVC factory
~~TrueNAS VM~~	DECOMMISSIONED	Was VMID 9000 at 10.0.10.15	Replaced by Proxmox NFS (2026-04)
~~democratic-csi-iscsi~~	REMOVED	Was namespace: iscsi-csi	Replaced by Proxmox CSI (2026-04-02)
~~StorageClass `iscsi-truenas`~~	REMOVED	Was cluster-wide	Replaced by `proxmox-lvm`

How It Works

NFS Storage Flow

Directory creation: NFS share directories are created under /srv/nfs/<service> (HDD) or /srv/nfs-ssd/<service> (SSD) on the Proxmox host
Export configuration: /etc/exports on the Proxmox host lists per-directory NFS exports

Terraform module: Stacks use modules/kubernetes/nfs_volume/ to declaratively create static PV + PVC pairs:

module "nfs_data" {
  source     = "../../modules/kubernetes/nfs_volume"
  name       = "immich-data"
  namespace  = kubernetes_namespace.immich.metadata[0].name
  nfs_server = var.nfs_server  # 192.168.1.127
  nfs_path   = "/srv/nfs/immich"
}

Pod mount: Applications reference PVCs in their deployment specs
Mount options: All NFS mounts use soft,timeo=30,retrans=3 (set in StorageClass) to prevent indefinite hangs

Note: Some legacy PVs still reference /mnt/main/<service> paths. These work via compatibility symlinks/bind-mounts on the Proxmox host. New PVs should use /srv/nfs/<service> or /srv/nfs-ssd/<service>.

CRITICAL: Never use inline nfs {} blocks in pod specs — they default to hard,timeo=600 which causes 10-minute hangs on network issues. Always use the nfs-truenas or nfs-proxmox StorageClass via PVCs.

Block Storage Flow (Proxmox CSI) — NEW

PVC creation: Pod requests a PVC with storageClass: proxmox-lvm
CSI provisioning: Proxmox CSI plugin calls the Proxmox API to create a thin LV in the local-lvm storage
SCSI hotplug: The thin LV is hotplugged as a VirtIO-SCSI disk directly into the K8s node VM
Filesystem: CSI formats the disk as ext4 and mounts it into the pod
Exclusive access: RWO only — disk is attached to one VM at a time
Topology: Nodes are labeled with topology.kubernetes.io/region=pve and zone=pve for scheduling

Key advantage: Single CoW layer (LVM-thin only). No ZFS, no iSCSI network hop, no double-CoW corruption.

Proxmox API token: csi@pve!csi-token with CSI role (VM.Audit VM.Config.Disk Datastore.Allocate Datastore.AllocateSpace Datastore.Audit). Stored in Vault at secret/viktor.

iSCSI Storage Flow (DEPRECATED — replaced 2026-04-02)

This section is historical. All iSCSI PVCs have been migrated to Proxmox CSI (proxmox-lvm). The democratic-csi iSCSI driver is pending removal.

~~Zvol creation: democratic-csi creates ZFS zvols under main/iscsi/<pvc-name> via SSH commands~~
~~Target setup: TrueNAS iSCSI service exposes zvols as iSCSI LUNs~~
~~Initiator connection: K8s nodes connect via open-iscsi~~

SQLite on NFS — Why It Fails

SQLite uses fsync() to guarantee durability. NFS's soft mount + async semantics break this:

Soft mount returns success even if data is still in client cache
Network blips during fsync → incomplete writes → corruption
WAL mode helps but doesn't eliminate the race

Solution: Use Proxmox CSI (proxmox-lvm) for any SQLite database (Vaultwarden, plotting-book) or local disk (ephemeral).

Democratic-CSI Sidecar Resources (HISTORICAL — democratic-csi removed)

Democratic-csi has been removed along with TrueNAS decommissioning (2026-04). This section is kept for historical reference only.

Configuration

Key Files

Path	Purpose
`/etc/exports` (on Proxmox host)	NFS export configuration for all service shares
`stacks/proxmox-csi/`	Terraform stack for Proxmox CSI plugin + StorageClass
`stacks/nfs-csi/`	NFS CSI driver + StorageClasses (`nfs-truenas`, `nfs-proxmox`)
`modules/kubernetes/nfs_volume/`	Reusable module for static NFS PV/PVC creation
`config.tfvars`	Variable `nfs_server = "192.168.1.127"` shared by all stacks

Vault Paths

Path	Contents
~~`secret/viktor/truenas_ssh_key`~~	LEGACY — was SSH key for democratic-csi SSH driver (TrueNAS decommissioned)
~~`secret/viktor/truenas_root_password`~~	LEGACY — was TrueNAS root password (TrueNAS decommissioned)

Terraform Stacks

stacks/proxmox-csi/: Deploys Proxmox CSI plugin + proxmox-lvm StorageClass + node topology labels
stacks/nfs-csi/: Deploys NFS CSI driver + StorageClasses for Proxmox NFS
All application stacks reference NFS volumes via module "nfs_<name>" calls
Database PVCs use storageClass: proxmox-lvm (CNPG, MySQL Helm VCT, Redis Helm, standalone PVCs)

NFS Export Management

NFS exports are NOT managed by Terraform. To add a new service:

SSH to Proxmox host: ssh root@192.168.1.127
Create the directory: mkdir -p /srv/nfs/<service> && chmod 777 /srv/nfs/<service>
Edit /etc/exports — add the export entry
Reload exports: exportfs -ra
Verify: showmount -e 192.168.1.127

Decisions & Rationale

Why NFS for Most Workloads?

Simplicity: No volume provisioning delays, instant mounts
RWX support: Multiple pods can share one volume (Nextcloud, Immich)
Good enough: For SQLite on NFS specifically, we accept the risk for low-value data (logs, caches) but mandate proxmox-lvm for critical DBs

Why Proxmox CSI for Databases? (formerly iSCSI)

ACID guarantees: Block device + local filesystem = real fsync
Performance: No NFS protocol overhead for random I/O, no network hop (LVM-thin hotplug direct to VM)
Tested: PostgreSQL CNPG and MySQL InnoDB Cluster both run on proxmox-lvm, zero corruption
Single CoW layer: LVM-thin only, no ZFS double-CoW issues

Why Soft Mount for NFS?

Hard mounts with default timeo=600 (10 minutes) cause:

10-minute pod startup delays if NFS server is unreachable
kubectl delete pod hangs for 10 minutes
Kernel task hangs blocking node operations

Soft mount (soft,timeo=30,retrans=3) trades availability for responsiveness:

Max 90s hang (30s × 3 retries)
Operations return EIO after timeout → app can handle error
Acceptable for non-critical data paths

Critical paths: Databases use proxmox-lvm (not NFS), so soft mount never affects data integrity.

Troubleshooting

NFS Mount Hangs

Symptom: Pod stuck in ContainerCreating, df -h hangs on NFS mount

Diagnosis:

# On K8s node
mount | grep nfs
showmount -e 192.168.1.127

# Check NFS server (Proxmox host)
ssh root@192.168.1.127
ls -la /srv/nfs/<service>
cat /etc/exports | grep <service>

Fix:

Verify directory exists: ls /srv/nfs/<service> (or /srv/nfs-ssd/<service>)
Verify export: grep <service> /etc/exports
If missing: add to /etc/exports and run exportfs -ra
Restart NFS server: systemctl restart nfs-server

iSCSI Session Drops (HISTORICAL — iSCSI removed)

iSCSI was replaced by Proxmox CSI (2026-04-02) and TrueNAS has been decommissioned. This section is kept for historical reference only.

SQLite Corruption on NFS

Symptom: database disk image is malformed, checksum errors

Diagnosis:

# In pod
sqlite3 /data/db.sqlite "PRAGMA integrity_check;"

Fix: Migrate to proxmox-lvm

Create proxmox-lvm PVC in Terraform stack
Restore from backup to new volume
Update deployment to use new PVC
Delete old NFS PVC

Slow NFS Performance

Symptom: High latency on file operations, iostat shows NFS wait times

Diagnosis:

# On Proxmox host
ssh root@192.168.1.127
iostat -x 5
lvs --reportformat json pve/nfs-data ssd/nfs-ssd-data

# On K8s node
nfsiostat 5

Optimization:

Move hot data to SSD NFS: relocate from /srv/nfs/<service> to /srv/nfs-ssd/<service> and update PV path
Tune NFS mount: add rsize=1048576,wsize=1048576 to StorageClass mountOptions

Runbooks:
- docs/runbooks/restore-postgresql.md
- docs/runbooks/restore-mysql.md
- docs/runbooks/recover-nfs-mount.md
Architecture: docs/architecture/backup-dr.md (backup strategy using LVM snapshots and Proxmox host scripts)
Reference: .claude/reference/service-catalog.md (which services use NFS vs proxmox-lvm)

13 KiB Raw Blame History Unescape Escape

Storage Architecture

Overview

Architecture Diagram

Components

How It Works

NFS Storage Flow

Block Storage Flow (Proxmox CSI) — NEW

iSCSI Storage Flow (DEPRECATED — replaced 2026-04-02)

SQLite on NFS — Why It Fails

Democratic-CSI Sidecar Resources (HISTORICAL — democratic-csi removed)

Configuration

Key Files

Vault Paths

Terraform Stacks

NFS Export Management

Decisions & Rationale

Why NFS for Most Workloads?

Why Proxmox CSI for Databases? (formerly iSCSI)

Why Soft Mount for NFS?

Troubleshooting

NFS Mount Hangs

iSCSI Session Drops (HISTORICAL — iSCSI removed)

SQLite Corruption on NFS

Slow NFS Performance

Related

13 KiB

Raw Blame History