add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
# Storage Architecture
2026-04-13 14:41:15 +00:00
Last updated: 2026-04-13
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
## Overview
2026-04-13 14:41:15 +00:00
The cluster uses two storage backends: **Proxmox CSI** for database block storage and **Proxmox NFS** for application data.
2026-04-03 19:45:34 +03:00
update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
PVC file-level copy from LVM snapshots, pfsense backup, two offsite
paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00
**Block storage (Proxmox CSI)**: 65 PVCs for databases and stateful apps (CNPG PostgreSQL, MySQL InnoDB, Redis, Vaultwarden, Prometheus, Nextcloud, Calibre-Web, Forgejo, FreshRSS, ActualBudget, NovelApp, Headscale, Uptime Kuma, etc.) use `StorageClass: proxmox-lvm` , which provisions thin LVs directly from the Proxmox host's `local-lvm` storage (sdc, 10.7TB RAID1 HDD thin pool). This eliminates the previous double-CoW (ZFS + LVM-thin) path that caused 56 ZFS checksum errors.
2026-04-03 19:45:34 +03:00
2026-04-13 14:41:15 +00:00
**NFS storage (Proxmox host)**: ~100 NFS shares for media libraries (Immich, audiobookshelf, servarr, navidrome), backup targets (`*-backup/` directories), and app data are served directly from the Proxmox host at `192.168.1.127` . Two NFS export roots exist:
- **HDD NFS**: `/srv/nfs` on ext4 LV `pve/nfs-data` (2TB) — bulk media and backup targets
- **SSD NFS**: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB) — high-performance data (Immich ML)
Both `StorageClass: nfs-truenas` (name kept for compatibility) and `StorageClass: nfs-proxmox` (identical) point to the Proxmox host. Migrated from TrueNAS (10.0.10.15) which has been fully decommissioned.
update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
PVC file-level copy from LVM snapshots, pfsense backup, two offsite
paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00
**Backup storage (sda)**: 1.1TB RAID1 SAS disk, VG `backup` , LV `data` (ext4), mounted at `/mnt/backup` on PVE host. Dedicated backup disk for weekly PVC file backups, NFS mirrors, pfSense backups, and PVE config. Independent of live storage (sdc).
2026-04-03 19:45:34 +03:00
2026-04-13 14:41:15 +00:00
**Migration (2026-04-02)**: All iSCSI block volumes were migrated from democratic-csi (TrueNAS iSCSI → ZFS → LVM-thin) to Proxmox CSI (direct LVM-thin hotplug). democratic-csi iSCSI driver has been removed.
**Migration (2026-04)**: TrueNAS (10.0.10.15) fully decommissioned. All NFS storage migrated to the Proxmox host (192.168.1.127). ZFS datasets under `/mnt/main/` and `/mnt/ssd/` moved to ext4 LVs at `/srv/nfs/` and `/srv/nfs-ssd/` . Legacy PVs referencing `/mnt/main/` paths still work (bind-mounted or symlinked on the Proxmox host); new PVs use `/srv/nfs/` and `/srv/nfs-ssd/` .
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
## Architecture Diagram
```mermaid
graph TB
update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
PVC file-level copy from LVM snapshots, pfsense backup, two offsite
paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00
subgraph Proxmox["Proxmox Host (192.168.1.127)"]
sdc["sdc: 10.7TB RAID1 HDD< br / > VG pve, LV data (thin pool)< br / > 65 proxmox-lvm PVCs"]
sda["sda: 1.1TB RAID1 SAS< br / > VG backup, LV data (ext4)< br / > /mnt/backup"]
2026-04-13 14:41:15 +00:00
NFS_HDD["LV pve/nfs-data (2TB ext4)< br / > /srv/nfs< br / > ~100 NFS shares< br / > Media + backup targets"]
NFS_SSD["LV ssd/nfs-ssd-data (100GB ext4)< br / > /srv/nfs-ssd< br / > High-performance data< br / > (Immich ML)"]
NFS_Exports["NFS Exports< br / > managed by /etc/exports"]
NFS_HDD --> NFS_Exports
NFS_SSD --> NFS_Exports
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
end
subgraph K8s["Kubernetes Cluster"]
2026-04-13 14:41:15 +00:00
CSI_NFS["nfs-csi driver< br / > StorageClass: nfs-truenas / nfs-proxmox< br / > soft,timeo=30,retrans=3"]
CSI_PVE["Proxmox CSI plugin< br / > StorageClass: proxmox-lvm"]
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
NFS_PV["NFS PersistentVolumes< br / > RWX, ~100 volumes"]
2026-04-13 14:41:15 +00:00
Block_PV["Block PersistentVolumes< br / > RWO, 65 PVCs"]
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
Pods["Application Pods"]
DBPods["Database Pods< br / > PostgreSQL CNPG< br / > MySQL InnoDB"]
end
2026-04-13 14:41:15 +00:00
NFS_Exports -->|NFS mount| CSI_NFS
sdc -->|LVM-thin hotplug| CSI_PVE
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
CSI_NFS --> NFS_PV
2026-04-13 14:41:15 +00:00
CSI_PVE --> Block_PV
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
NFS_PV --> Pods
2026-04-13 14:41:15 +00:00
Block_PV --> DBPods
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
2026-04-13 14:41:15 +00:00
style Proxmox fill:#e1f5ff
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
style K8s fill:#fff4e1
2026-04-13 14:41:15 +00:00
style NFS_HDD fill:#c8e6c9
style NFS_SSD fill:#ffe0b2
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
```
## Components
| Component | Version/Config | Location | Purpose |
|-----------|---------------|----------|---------|
2026-04-03 19:45:34 +03:00
| **Proxmox CSI plugin** | Helm chart | Namespace: proxmox-csi | Block storage via LVM-thin hotplug |
| **StorageClass `proxmox-lvm`** | RWO, WaitForFirstConsumer | Cluster-wide | Databases and stateful apps |
2026-04-13 14:41:15 +00:00
| Proxmox NFS (HDD) | LV `pve/nfs-data` , 2TB ext4 | 192.168.1.127:/srv/nfs | Bulk NFS data for all services |
| Proxmox NFS (SSD) | LV `ssd/nfs-ssd-data` , 100GB ext4 | 192.168.1.127:/srv/nfs-ssd | High-performance data (Immich ML) |
2026-04-03 19:45:34 +03:00
| nfs-csi | Helm chart | Namespace: nfs-csi | NFS CSI driver |
2026-04-13 14:41:15 +00:00
| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | NFS storage (name kept for compatibility, points to Proxmox) |
| StorageClass `nfs-proxmox` | RWX, soft mount | Cluster-wide | NFS storage (identical to nfs-truenas) |
| TF module `nfs_volume` | `modules/kubernetes/nfs_volume/` | Infra repo | Static NFS PV/PVC factory |
| ~~TrueNAS VM~~ | **DECOMMISSIONED** | Was VMID 9000 at 10.0.10.15 | Replaced by Proxmox NFS (2026-04) |
| ~~democratic-csi-iscsi~~ | **REMOVED** | Was namespace: iscsi-csi | Replaced by Proxmox CSI (2026-04-02) |
| ~~StorageClass `iscsi-truenas`~~ | **REMOVED** | Was cluster-wide | Replaced by `proxmox-lvm` |
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
## How It Works
### NFS Storage Flow
2026-04-13 14:41:15 +00:00
1. **Directory creation** : NFS share directories are created under `/srv/nfs/<service>` (HDD) or `/srv/nfs-ssd/<service>` (SSD) on the Proxmox host
2. **Export configuration** : `/etc/exports` on the Proxmox host lists per-directory NFS exports
3. **Terraform module** : Stacks use `modules/kubernetes/nfs_volume/` to declaratively create static PV + PVC pairs:
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
```hcl
module "nfs_data" {
source = "../../modules/kubernetes/nfs_volume"
name = "immich-data"
namespace = kubernetes_namespace.immich.metadata[0].name
2026-04-13 14:41:15 +00:00
nfs_server = var.nfs_server # 192.168.1.127
nfs_path = "/srv/nfs/immich"
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
}
```
2026-04-13 14:41:15 +00:00
4. **Pod mount** : Applications reference PVCs in their deployment specs
5. **Mount options** : All NFS mounts use `soft,timeo=30,retrans=3` (set in StorageClass) to prevent indefinite hangs
**Note**: Some legacy PVs still reference `/mnt/main/<service>` paths. These work via compatibility symlinks/bind-mounts on the Proxmox host. New PVs should use `/srv/nfs/<service>` or `/srv/nfs-ssd/<service>` .
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
2026-04-13 14:41:15 +00:00
**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-truenas` or `nfs-proxmox` StorageClass via PVCs.
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
2026-04-03 19:45:34 +03:00
### Block Storage Flow (Proxmox CSI) — NEW
1. **PVC creation** : Pod requests a PVC with `storageClass: proxmox-lvm`
2. **CSI provisioning** : Proxmox CSI plugin calls the Proxmox API to create a thin LV in the `local-lvm` storage
3. **SCSI hotplug** : The thin LV is hotplugged as a VirtIO-SCSI disk directly into the K8s node VM
4. **Filesystem** : CSI formats the disk as ext4 and mounts it into the pod
5. **Exclusive access** : RWO only — disk is attached to one VM at a time
6. **Topology** : Nodes are labeled with `topology.kubernetes.io/region=pve` and `zone=pve` for scheduling
**Key advantage**: Single CoW layer (LVM-thin only). No ZFS, no iSCSI network hop, no double-CoW corruption.
**Proxmox API token**: `csi@pve!csi-token` with CSI role (`VM.Audit VM.Config.Disk Datastore.Allocate Datastore.AllocateSpace Datastore.Audit` ). Stored in Vault at `secret/viktor` .
### iSCSI Storage Flow (DEPRECATED — replaced 2026-04-02)
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
2026-04-03 19:45:34 +03:00
> **This section is historical.** All iSCSI PVCs have been migrated to Proxmox CSI (`proxmox-lvm`). The democratic-csi iSCSI driver is pending removal.
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
2026-04-03 19:45:34 +03:00
1. ~~Zvol creation: democratic-csi creates ZFS zvols under `main/iscsi/<pvc-name>` via SSH commands~~
2. ~~Target setup: TrueNAS iSCSI service exposes zvols as iSCSI LUNs~~
3. ~~Initiator connection: K8s nodes connect via open-iscsi~~
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
### SQLite on NFS — Why It Fails
SQLite uses `fsync()` to guarantee durability. NFS's soft mount + async semantics break this:
- Soft mount returns success even if data is still in client cache
- Network blips during fsync → incomplete writes → corruption
- WAL mode helps but doesn't eliminate the race
2026-04-03 19:45:34 +03:00
**Solution**: Use Proxmox CSI (`proxmox-lvm` ) for any SQLite database (Vaultwarden, plotting-book) or local disk (ephemeral).
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
2026-04-13 14:41:15 +00:00
### ~~Democratic-CSI Sidecar Resources~~ (HISTORICAL — democratic-csi removed)
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
2026-04-13 14:41:15 +00:00
> Democratic-csi has been removed along with TrueNAS decommissioning (2026-04). This section is kept for historical reference only.
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
## Configuration
### Key Files
| Path | Purpose |
|------|---------|
2026-04-13 14:41:15 +00:00
| `/etc/exports` (on Proxmox host) | NFS export configuration for all service shares |
2026-04-03 19:45:34 +03:00
| `stacks/proxmox-csi/` | Terraform stack for Proxmox CSI plugin + StorageClass |
2026-04-13 14:41:15 +00:00
| `stacks/nfs-csi/` | NFS CSI driver + StorageClasses (`nfs-truenas` , `nfs-proxmox` ) |
| `modules/kubernetes/nfs_volume/` | Reusable module for static NFS PV/PVC creation |
| `config.tfvars` | Variable `nfs_server = "192.168.1.127"` shared by all stacks |
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
### Vault Paths
| Path | Contents |
|------|----------|
2026-04-13 14:41:15 +00:00
| ~~`secret/viktor/truenas_ssh_key`~~ | **LEGACY** — was SSH key for democratic-csi SSH driver (TrueNAS decommissioned) |
| ~~`secret/viktor/truenas_root_password`~~ | **LEGACY** — was TrueNAS root password (TrueNAS decommissioned) |
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
### Terraform Stacks
2026-04-03 19:45:34 +03:00
- **`stacks/proxmox-csi/` **: Deploys Proxmox CSI plugin + `proxmox-lvm` StorageClass + node topology labels
2026-04-13 14:41:15 +00:00
- **`stacks/nfs-csi/` **: Deploys NFS CSI driver + StorageClasses for Proxmox NFS
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
- All application stacks reference NFS volumes via `module "nfs_<name>"` calls
2026-04-03 19:45:34 +03:00
- Database PVCs use `storageClass: proxmox-lvm` (CNPG, MySQL Helm VCT, Redis Helm, standalone PVCs)
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
### NFS Export Management
NFS exports are NOT managed by Terraform. To add a new service:
2026-04-13 14:41:15 +00:00
1. SSH to Proxmox host: `ssh root@192.168.1.127`
2. Create the directory: `mkdir -p /srv/nfs/<service> && chmod 777 /srv/nfs/<service>`
3. Edit `/etc/exports` — add the export entry
4. Reload exports: `exportfs -ra`
5. Verify: `showmount -e 192.168.1.127`
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
## Decisions & Rationale
### Why NFS for Most Workloads?
- **Simplicity**: No volume provisioning delays, instant mounts
- **RWX support**: Multiple pods can share one volume (Nextcloud, Immich)
2026-04-13 14:41:15 +00:00
- **Good enough**: For SQLite on NFS specifically, we accept the risk for low-value data (logs, caches) but mandate proxmox-lvm for critical DBs
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
2026-04-13 14:41:15 +00:00
### Why Proxmox CSI for Databases? (formerly iSCSI)
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
- **ACID guarantees**: Block device + local filesystem = real fsync
2026-04-13 14:41:15 +00:00
- **Performance**: No NFS protocol overhead for random I/O, no network hop (LVM-thin hotplug direct to VM)
- **Tested**: PostgreSQL CNPG and MySQL InnoDB Cluster both run on proxmox-lvm, zero corruption
- **Single CoW layer**: LVM-thin only, no ZFS double-CoW issues
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
### Why Soft Mount for NFS?
Hard mounts with default `timeo=600` (10 minutes) cause:
- 10-minute pod startup delays if NFS server is unreachable
- `kubectl delete pod` hangs for 10 minutes
- Kernel task hangs blocking node operations
Soft mount (`soft,timeo=30,retrans=3` ) trades availability for responsiveness:
- Max 90s hang (30s × 3 retries)
- Operations return EIO after timeout → app can handle error
- Acceptable for non-critical data paths
2026-04-13 14:41:15 +00:00
**Critical paths**: Databases use proxmox-lvm (not NFS), so soft mount never affects data integrity.
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
## Troubleshooting
### NFS Mount Hangs
**Symptom**: Pod stuck in `ContainerCreating` , `df -h` hangs on NFS mount
**Diagnosis**:
```bash
# On K8s node
mount | grep nfs
2026-04-13 14:41:15 +00:00
showmount -e 192.168.1.127
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
2026-04-13 14:41:15 +00:00
# Check NFS server (Proxmox host)
ssh root@192 .168.1.127
ls -la /srv/nfs/< service >
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
cat /etc/exports | grep < service >
```
**Fix**:
2026-04-13 14:41:15 +00:00
1. Verify directory exists: `ls /srv/nfs/<service>` (or `/srv/nfs-ssd/<service>` )
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
2. Verify export: `grep <service> /etc/exports`
2026-04-13 14:41:15 +00:00
3. If missing: add to `/etc/exports` and run `exportfs -ra`
4. Restart NFS server: `systemctl restart nfs-server`
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
2026-04-13 14:41:15 +00:00
### ~~iSCSI Session Drops~~ (HISTORICAL — iSCSI removed)
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
2026-04-13 14:41:15 +00:00
> iSCSI was replaced by Proxmox CSI (2026-04-02) and TrueNAS has been decommissioned. This section is kept for historical reference only.
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
### SQLite Corruption on NFS
**Symptom**: `database disk image is malformed` , checksum errors
**Diagnosis**:
```bash
# In pod
sqlite3 /data/db.sqlite "PRAGMA integrity_check;"
```
2026-04-13 14:41:15 +00:00
**Fix**: Migrate to proxmox-lvm
1. Create proxmox-lvm PVC in Terraform stack
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
2. Restore from backup to new volume
3. Update deployment to use new PVC
4. Delete old NFS PVC
### Slow NFS Performance
**Symptom**: High latency on file operations, `iostat` shows NFS wait times
**Diagnosis**:
```bash
2026-04-13 14:41:15 +00:00
# On Proxmox host
ssh root@192 .168.1.127
iostat -x 5
lvs --reportformat json pve/nfs-data ssd/nfs-ssd-data
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
# On K8s node
nfsiostat 5
```
**Optimization**:
2026-04-13 14:41:15 +00:00
1. Move hot data to SSD NFS: relocate from `/srv/nfs/<service>` to `/srv/nfs-ssd/<service>` and update PV path
2. Tune NFS mount: add `rsize=1048576,wsize=1048576` to StorageClass `mountOptions`
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
## Related
- **Runbooks**:
- `docs/runbooks/restore-postgresql.md`
- `docs/runbooks/restore-mysql.md`
- `docs/runbooks/recover-nfs-mount.md`
2026-04-13 14:41:15 +00:00
- **Architecture**: `docs/architecture/backup-dr.md` (backup strategy using LVM snapshots and Proxmox host scripts)
- **Reference**: `.claude/reference/service-catalog.md` (which services use NFS vs proxmox-lvm)