deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip]

- Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS
- Update config.tfvars nfs_server to 192.168.1.127 (Proxmox)
- Update nfs-csi StorageClass share to /srv/nfs
- Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP
- Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh)
- Rewrite nfs-health.sh for Proxmox NFS monitoring
- Update Freedify nfs_music_server default to Proxmox
- Mark CloudSync monitor CronJob as deprecated
- Update Prometheus alert summaries
- Update all architecture docs, AGENTS.md, and reference docs
- Zero PVs remain on TrueNAS — VM ready for decommission

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-04-13 14:41:15 +00:00
parent 69248eaa7b
commit 38d51ab0af
20 changed files with 245 additions and 524 deletions

View file

@ -1002,16 +1002,16 @@ check_nfs() {
# Try native tools first (available locally), fall back to kubectl-based check (pod environment) # Try native tools first (available locally), fall back to kubectl-based check (pod environment)
if command -v showmount &>/dev/null; then if command -v showmount &>/dev/null; then
if showmount -e 10.0.10.15 &>/dev/null; then if showmount -e 192.168.1.127 &>/dev/null; then
pass "NFS server 10.0.10.15 reachable (exports listed)" pass "NFS server 192.168.1.127 reachable (exports listed)"
json_add "nfs" "PASS" "NFS reachable" json_add "nfs" "PASS" "NFS reachable"
return 0 return 0
fi fi
fi fi
if command -v nc &>/dev/null; then if command -v nc &>/dev/null; then
if nc -z -G 3 10.0.10.15 2049 &>/dev/null; then if nc -z -G 3 192.168.1.127 2049 &>/dev/null; then
pass "NFS server 10.0.10.15 port 2049 open" pass "NFS server 192.168.1.127 port 2049 open"
json_add "nfs" "PASS" "NFS port open" json_add "nfs" "PASS" "NFS port open"
return 0 return 0
fi fi

View file

@ -3,29 +3,28 @@
Reference file for patterns, procedures, and tables. Read on demand when the specific topic comes up. Reference file for patterns, procedures, and tables. Read on demand when the specific topic comes up.
## NFS Volume Pattern ## NFS Volume Pattern
Use the `nfs_volume` shared module for all NFS volumes (CSI-backed, `soft,timeo=30,retrans=3`): Use the `nfs_volume` shared module for all NFS volumes (creates static PVs, CSI-backed, `soft,timeo=30,retrans=3`):
```hcl ```hcl
module "nfs_data" { module "nfs_data" {
source = "../../modules/kubernetes/nfs_volume" # ../../../../ for platform modules, ../../../ for sub-stacks source = "../../modules/kubernetes/nfs_volume" # ../../../../ for platform modules, ../../../ for sub-stacks
name = "<service>-data" # Must be globally unique (PV is cluster-scoped) name = "<service>-data" # Must be globally unique (PV is cluster-scoped)
namespace = kubernetes_namespace.<service>.metadata[0].name namespace = kubernetes_namespace.<service>.metadata[0].name
nfs_server = var.nfs_server nfs_server = var.nfs_server # 192.168.1.127 (Proxmox host)
nfs_path = "/mnt/main/<service>" nfs_path = "/srv/nfs/<service>" # HDD NFS, or "/srv/nfs-ssd/<service>" for SSD
} }
# In pod spec: persistent_volume_claim { claim_name = module.nfs_data.claim_name } # In pod spec: persistent_volume_claim { claim_name = module.nfs_data.claim_name }
``` ```
**Note**: Some legacy PVs still reference `/mnt/main/<service>` paths (from the TrueNAS era). These work via compatibility on the Proxmox host. New PVs should use `/srv/nfs/` or `/srv/nfs-ssd/`.
**DO NOT use inline `nfs {}` blocks** — they mount with `hard,timeo=600` defaults which hang forever. **DO NOT use inline `nfs {}` blocks** — they mount with `hard,timeo=600` defaults which hang forever.
## Adding NFS Exports ## Adding NFS Exports
1. Create dir on TrueNAS: `ssh root@10.0.10.15 "mkdir -p /mnt/main/<service> && chmod 777 /mnt/main/<service>"` 1. Create dir on Proxmox host: `ssh root@192.168.1.127 "mkdir -p /srv/nfs/<service> && chmod 777 /srv/nfs/<service>"`
2. Edit `secrets/nfs_directories.txt` — add path, keep sorted 2. Edit `/etc/exports` on the Proxmox host — add the export entry
3. Run `secrets/nfs_exports.sh` from `secrets/` 3. Reload exports: `ssh root@192.168.1.127 "exportfs -ra"`
4. If any path doesn't exist on TrueNAS, the API rejects the entire update. 4. Verify: `showmount -e 192.168.1.127`
## iSCSI Storage (Databases) ## ~~iSCSI Storage~~ (REMOVED — replaced by proxmox-lvm)
**StorageClass**: `iscsi-truenas` (democratic-csi, `freenas-iscsi` SSH driver — NOT `freenas-api-iscsi`). > iSCSI via democratic-csi and TrueNAS has been fully removed (2026-04). All database storage now uses `StorageClass: proxmox-lvm` (Proxmox CSI, LVM-thin hotplug). TrueNAS has been decommissioned.
Used by: PostgreSQL (CNPG), MySQL (InnoDB Cluster). ZFS: `main/iscsi` (zvols), `main/iscsi-snaps`.
All K8s nodes have `open-iscsi` + `iscsid` running.
## Anti-AI Scraping (5-Layer Defense) ## Anti-AI Scraping (5-Layer Defense)
Default `anti_ai_scraping = true` in ingress_factory. Disable per-service: `anti_ai_scraping = false`. Default `anti_ai_scraping = true` in ingress_factory. Disable per-service: `anti_ai_scraping = false`.

View file

@ -8,30 +8,10 @@
- **RAM**: 272 GB DDR4-2400 ECC RDIMM (10 DIMMs, see Memory Layout below) - **RAM**: 272 GB DDR4-2400 ECC RDIMM (10 DIMMs, see Memory Layout below)
- **GPU**: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1) - **GPU**: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1)
- **iDRAC**: 192.168.1.4 (root/calvin) - **iDRAC**: 192.168.1.4 (root/calvin)
- **Disks**: 1.1TB RAID1 SAS (unused) + 931GB Samsung SSD + 10.7TB RAID1 HDD - **Disks**: 1.1TB RAID1 SAS (backup) + 931GB Samsung SSD + 10.7TB RAID1 HDD
- **NFS server**: Proxmox host serves NFS directly. HDD NFS: `/srv/nfs` on ext4 LV `pve/nfs-data` (2TB). SSD NFS: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB). TrueNAS (10.0.10.15) decommissioned.
- **Proxmox access**: `ssh root@192.168.1.127` - **Proxmox access**: `ssh root@192.168.1.127`
## NFS Exports (Proxmox Host)
The Proxmox host serves NFS for all workloads except Immich (which remains on TrueNAS).
### HDD NFS
- **LV**: `pve/nfs-data` (thin LV, 1TB)
- **Filesystem**: ext4 (chosen over btrfs — btrfs CoW on LVM thin = double-CoW problem)
- **Mount**: `/srv/nfs` with `noatime,commit=30`
- **Export**: `/srv/nfs *(rw,no_subtree_check,no_root_squash,insecure,fsid=0)`
### SSD NFS
- **LV**: `ssd/nfs-ssd-data` (100GB)
- **Filesystem**: ext4
- **Mount**: `/srv/nfs-ssd` with `noatime,commit=30`
- **Export**: `/srv/nfs-ssd *(rw,no_subtree_check,no_root_squash,insecure,fsid=1)`
- **Current users**: Ollama (migrated from TrueNAS SSD `/mnt/ssd/ollama`)
### Notes
- `insecure` option required: pfSense NATs source ports >1024 when routing between VLANs
- 21 stacks migrated from TrueNAS, only Immich (8 PVCs) remains on TrueNAS
## Memory Layout (updated 2026-04-01) ## Memory Layout (updated 2026-04-01)
### Physical DIMM Slot Map ### Physical DIMM Slot Map
@ -97,10 +77,10 @@ Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB
## Network Topology ## Network Topology
``` ```
10.0.10.0/24 - Management: Wizard (10.0.10.10), TrueNAS NFS (10.0.10.15) 10.0.10.0/24 - Management: Wizard (10.0.10.10)
10.0.20.0/24 - Kubernetes: pfSense GW (10.0.20.1), Registry (10.0.20.10), 10.0.20.0/24 - Kubernetes: pfSense GW (10.0.20.1), Registry (10.0.20.10),
k8s-master (10.0.20.100), DNS (10.0.20.101), MetalLB (10.0.20.102-200) k8s-master (10.0.20.100), DNS (10.0.20.101), MetalLB (10.0.20.102-200)
192.168.1.0/24 - Physical: Proxmox (192.168.1.127, NFS server for k8s) 192.168.1.0/24 - Physical: Proxmox (192.168.1.127)
``` ```
## Network Bridges ## Network Bridges
@ -122,7 +102,7 @@ Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB
| 204 | k8s-node4 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker | | 204 | k8s-node4 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) | | 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM | | 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM |
| 9000 | truenas | running | 16 | 8GB | vmbr1:vlan10 | 32G+7x256G+1T | NFS (10.0.10.15) — Immich only | | ~~9000~~ | ~~truenas~~ | **stopped/decommissioned** | — | — | — | — | NFS migrated to Proxmox host (192.168.1.127) at `/srv/nfs` and `/srv/nfs-ssd` |
**Total VM RAM allocated**: 180 GB of 272 GB (66%) — 92 GB free for future VMs **Total VM RAM allocated**: 180 GB of 272 GB (66%) — 92 GB free for future VMs

View file

@ -3,7 +3,7 @@ set -euo pipefail
AGENT="nfs-health" AGENT="nfs-health"
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config" KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
TRUENAS_HOST="10.0.10.15" NFS_HOST="192.168.1.127"
NODES=("k8s-master:10.0.20.100" "k8s-node1:10.0.20.101" "k8s-node2:10.0.20.102" "k8s-node3:10.0.20.103" "k8s-node4:10.0.20.104") NODES=("k8s-master:10.0.20.100" "k8s-node1:10.0.20.101" "k8s-node2:10.0.20.102" "k8s-node3:10.0.20.103" "k8s-node4:10.0.20.104")
SSH_USER="wizard" SSH_USER="wizard"
DRY_RUN=false DRY_RUN=false
@ -21,33 +21,61 @@ add_check() {
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}") checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
} }
check_truenas_reachable() { check_nfs_reachable() {
if $DRY_RUN; then if $DRY_RUN; then
add_check "truenas-reachable" "ok" "dry-run: would ping $TRUENAS_HOST" add_check "nfs-reachable" "ok" "dry-run: would ping $NFS_HOST"
return return
fi fi
if timeout 5 ping -c 1 "$TRUENAS_HOST" &>/dev/null; then if timeout 5 ping -c 1 "$NFS_HOST" &>/dev/null; then
add_check "truenas-reachable" "ok" "TrueNAS at $TRUENAS_HOST is reachable" add_check "nfs-reachable" "ok" "Proxmox NFS at $NFS_HOST is reachable"
else else
add_check "truenas-reachable" "fail" "TrueNAS at $TRUENAS_HOST is unreachable" add_check "nfs-reachable" "fail" "Proxmox NFS at $NFS_HOST is unreachable"
fi fi
} }
check_truenas_nfs_service() { check_nfs_exports() {
if $DRY_RUN; then if $DRY_RUN; then
add_check "truenas-nfs-service" "ok" "dry-run: would check NFS service on TrueNAS" add_check "nfs-exports" "ok" "dry-run: would check NFS exports on Proxmox"
return return
fi fi
local result local result
if result=$(timeout 10 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "root@$TRUENAS_HOST" \ if result=$(timeout 10 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "root@$NFS_HOST" \
"service nfs-server status 2>/dev/null || systemctl is-active nfs-server 2>/dev/null || echo 'unknown'" 2>/dev/null); then "exportfs -v 2>/dev/null || cat /etc/exports 2>/dev/null" 2>/dev/null); then
if echo "$result" | grep -qiE "running|active|is running"; then local export_count
add_check "truenas-nfs-service" "ok" "NFS service is running on TrueNAS" export_count=$(echo "$result" | grep -c '/' || echo 0)
if [ "$export_count" -gt 0 ]; then
add_check "nfs-exports" "ok" "$export_count NFS exports active on Proxmox"
else else
add_check "truenas-nfs-service" "warn" "NFS service status unclear: $(echo "$result" | head -1 | tr '"' "'")" add_check "nfs-exports" "warn" "No NFS exports found on Proxmox"
fi fi
else else
add_check "truenas-nfs-service" "fail" "Could not check NFS service on TrueNAS via SSH" add_check "nfs-exports" "fail" "Could not check NFS exports on Proxmox via SSH"
fi
}
check_nfs_disk_usage() {
if $DRY_RUN; then
add_check "nfs-disk" "ok" "dry-run: would check NFS disk usage"
return
fi
local result
if result=$(timeout 10 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "root@$NFS_HOST" \
"df -h /srv/nfs /srv/nfs-ssd 2>/dev/null" 2>/dev/null); then
while IFS= read -r line; do
local mount pct
mount=$(echo "$line" | awk '{print $6}')
pct=$(echo "$line" | awk '{print $5}' | tr -d '%')
[ -z "$pct" ] || ! [[ "$pct" =~ ^[0-9]+$ ]] && continue
if [ "$pct" -ge 90 ]; then
add_check "nfs-disk-$mount" "fail" "$mount is ${pct}% full"
elif [ "$pct" -ge 80 ]; then
add_check "nfs-disk-$mount" "warn" "$mount is ${pct}% full"
else
add_check "nfs-disk-$mount" "ok" "$mount is ${pct}% full"
fi
done <<< "$result"
else
add_check "nfs-disk" "warn" "Could not check NFS disk usage"
fi fi
} }
@ -116,8 +144,9 @@ check_nfs_pvcs() {
} }
# Run checks # Run checks
check_truenas_reachable check_nfs_reachable
check_truenas_nfs_service check_nfs_exports
check_nfs_disk_usage
for node_entry in "${NODES[@]}"; do for node_entry in "${NODES[@]}"; do
node_name="${node_entry%%:*}" node_name="${node_entry%%:*}"

View file

@ -1,186 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
AGENT="truenas-status"
TRUENAS_HOST="root@10.0.10.15"
DRY_RUN=false
for arg in "$@"; do
case "$arg" in
--dry-run) DRY_RUN=true ;;
esac
done
checks=()
add_check() {
local name="$1" status="$2" message="$3"
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
}
ssh_cmd() {
timeout 15 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "$TRUENAS_HOST" "$@" 2>/dev/null
}
check_zfs_pools() {
if $DRY_RUN; then
add_check "zfs-pools" "ok" "dry-run: would check ZFS pool status"
return
fi
local pool_status
if ! pool_status=$(ssh_cmd "zpool status -x" 2>/dev/null); then
add_check "zfs-pools" "fail" "Could not retrieve ZFS pool status via SSH"
return
fi
if echo "$pool_status" | grep -q "all pools are healthy"; then
add_check "zfs-pools" "ok" "All ZFS pools are healthy"
else
local degraded_pools
degraded_pools=$(echo "$pool_status" | grep "pool:" | awk '{print $2}' | tr '\n' ', ' | sed 's/,$//')
if [ -n "$degraded_pools" ]; then
add_check "zfs-pools" "fail" "Degraded ZFS pools: $degraded_pools"
else
add_check "zfs-pools" "warn" "ZFS pool status unclear: $(echo "$pool_status" | head -1 | tr '"' "'")"
fi
fi
# Check pool capacity
local pool_list
if pool_list=$(ssh_cmd "zpool list -H -o name,cap" 2>/dev/null); then
while IFS=$'\t' read -r pool_name cap_pct; do
local cap_num
cap_num=$(echo "$cap_pct" | tr -d '%')
if [ -n "$cap_num" ] && [ "$cap_num" -ge 90 ]; then
add_check "zfs-capacity-$pool_name" "fail" "Pool $pool_name is ${cap_pct} full"
elif [ -n "$cap_num" ] && [ "$cap_num" -ge 80 ]; then
add_check "zfs-capacity-$pool_name" "warn" "Pool $pool_name is ${cap_pct} full"
else
add_check "zfs-capacity-$pool_name" "ok" "Pool $pool_name is ${cap_pct} full"
fi
done <<< "$pool_list"
fi
}
check_smart_health() {
if $DRY_RUN; then
add_check "smart-health" "ok" "dry-run: would check SMART disk health"
return
fi
local disk_list
if ! disk_list=$(ssh_cmd "smartctl --scan" 2>/dev/null); then
add_check "smart-health" "warn" "Could not scan disks for SMART status"
return
fi
local fail_count=0
local total_count=0
local failed_disks=""
while IFS= read -r line; do
local dev
dev=$(echo "$line" | awk '{print $1}')
[ -z "$dev" ] && continue
total_count=$((total_count + 1))
local health
if health=$(ssh_cmd "smartctl -H '$dev'" 2>/dev/null); then
if ! echo "$health" | grep -qiE "PASSED|OK"; then
fail_count=$((fail_count + 1))
failed_disks="$failed_disks $dev"
fi
fi
done <<< "$disk_list"
if [ "$fail_count" -gt 0 ]; then
add_check "smart-health" "fail" "$fail_count/$total_count disks failing SMART:$failed_disks"
elif [ "$total_count" -gt 0 ]; then
add_check "smart-health" "ok" "All $total_count disks pass SMART health checks"
else
add_check "smart-health" "warn" "No disks found for SMART check"
fi
}
check_replication() {
if $DRY_RUN; then
add_check "replication" "ok" "dry-run: would check replication task status"
return
fi
# Check for any running/failed replication tasks via midclt if available
local repl_status
if repl_status=$(ssh_cmd "midclt call replication.query 2>/dev/null" 2>/dev/null); then
local failed
failed=$(echo "$repl_status" | python3 -c "
import sys, json
try:
tasks = json.load(sys.stdin)
failed = [t.get('name','unknown') for t in tasks if t.get('state',{}).get('state','') == 'ERROR']
print(len(failed))
except: print('error')
" 2>/dev/null || echo "error")
if [ "$failed" = "error" ]; then
add_check "replication" "warn" "Could not parse replication task status"
elif [ "$failed" = "0" ]; then
add_check "replication" "ok" "All replication tasks healthy"
else
add_check "replication" "fail" "$failed replication tasks in ERROR state"
fi
else
# Fallback: check if zfs send/recv processes are stuck
local send_procs
send_procs=$(ssh_cmd "pgrep -c 'zfs send' 2>/dev/null || echo 0")
add_check "replication" "warn" "midclt unavailable; $send_procs active zfs send processes"
fi
}
check_iscsi() {
if $DRY_RUN; then
add_check "iscsi-targets" "ok" "dry-run: would check iSCSI target status"
return
fi
local target_status
if target_status=$(ssh_cmd "ctladm islist 2>/dev/null || targetcli ls 2>/dev/null" 2>/dev/null); then
local target_count
target_count=$(echo "$target_status" | wc -l | tr -d ' ')
if [ "$target_count" -gt 0 ]; then
add_check "iscsi-targets" "ok" "iSCSI service active with $target_count entries"
else
add_check "iscsi-targets" "warn" "iSCSI service active but no targets listed"
fi
else
# Try checking if the service is at least running
if ssh_cmd "midclt call iscsi.global.config" &>/dev/null; then
add_check "iscsi-targets" "ok" "iSCSI service is configured and running"
else
add_check "iscsi-targets" "warn" "Could not query iSCSI target status"
fi
fi
}
# Run checks
check_zfs_pools
check_smart_health
check_replication
check_iscsi
# Determine overall status
overall="ok"
for c in "${checks[@]}"; do
if echo "$c" | grep -q '"status": "fail"'; then
overall="fail"
break
elif echo "$c" | grep -q '"status": "warn"'; then
overall="warn"
fi
done
# Output JSON
checks_json=$(IFS=,; echo "${checks[*]}")
cat <<EOF
{"status": "$overall", "agent": "$AGENT", "checks": [$checks_json]}
EOF

View file

@ -3,7 +3,7 @@
## Critical Rules (MUST FOLLOW) ## Critical Rules (MUST FOLLOW)
- **ALL changes through Terraform/Terragrunt** — NEVER `kubectl apply/edit/patch/delete` for persistent changes. Read-only kubectl is fine. - **ALL changes through Terraform/Terragrunt** — NEVER `kubectl apply/edit/patch/delete` for persistent changes. Read-only kubectl is fine.
- **NEVER put secrets in plaintext** — use `secrets.sops.json` (SOPS-encrypted) or `terraform.tfvars` (git-crypt, legacy) - **NEVER put secrets in plaintext** — use `secrets.sops.json` (SOPS-encrypted) or `terraform.tfvars` (git-crypt, legacy)
- **NEVER restart NFS on Proxmox host or TrueNAS** — causes cluster-wide mount failures across all pods - **NEVER restart NFS on the Proxmox host** — causes cluster-wide mount failures across all pods
- **NEVER commit secrets** — triple-check before every commit - **NEVER commit secrets** — triple-check before every commit
- **`[ci skip]` in commit messages** when changes were already applied locally - **`[ci skip]` in commit messages** when changes were already applied locally
- **Ask before `git push`** — always confirm with the user first - **Ask before `git push`** — always confirm with the user first
@ -59,19 +59,15 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
- `scripts/cluster_healthcheck.sh` — 25-check cluster health script - `scripts/cluster_healthcheck.sh` — 25-check cluster health script
## Storage ## Storage
- **NFS — Proxmox host** (primary): HDD pool at `/srv/nfs` (ext4 thin LV `pve/nfs-data`, 1TB) and SSD pool at `/srv/nfs-ssd` (ext4 LV `ssd/nfs-ssd-data`, 100GB) on 192.168.1.127. Serves all workloads except Immich. - **NFS** (`nfs-truenas` / `nfs-proxmox` StorageClass): For app data. Use the `nfs_volume` module, never inline `nfs {}` blocks.
- `nfs-proxmox` StorageClass: Dynamic provisioning (Vault uses this). - **proxmox-lvm** (`proxmox-lvm` StorageClass): For databases (PostgreSQL, MySQL). Proxmox CSI driver.
- `nfs_volume` module: Static PVs for most services. Use this, never inline `nfs {}` blocks. - **NFS server**: Proxmox host at 192.168.1.127. HDD NFS at `/srv/nfs` (2TB ext4 LV `pve/nfs-data`), SSD NFS at `/srv/nfs-ssd` (100GB ext4 LV `ssd/nfs-ssd-data`). NFS exports managed via `/etc/exports` on the Proxmox host. TrueNAS (10.0.10.15) has been decommissioned.
- **NFS — TrueNAS** (10.0.10.15): Now **only serves Immich** (8 PVCs). `nfs-truenas` StorageClass retained for Immich only.
- **proxmox-lvm** (`proxmox-lvm` StorageClass): For databases (PostgreSQL, MySQL). TopoLVM driver.
- **SQLite on NFS is unreliable** (fsync issues) — always use proxmox-lvm or local disk for databases. - **SQLite on NFS is unreliable** (fsync issues) — always use proxmox-lvm or local disk for databases.
- **NFS mount options**: Always `soft,timeo=30,retrans=3` to prevent uninterruptible sleep (D state). - **NFS mount options**: Always `soft,timeo=30,retrans=3` to prevent uninterruptible sleep (D state).
- **NFS export directory must exist** on the NFS server before Terraform can create the PV. - **NFS export directory must exist** on the Proxmox host before Terraform can create the PV.
- **NFS `insecure` option**: Required on Proxmox host exports — pfSense NATs source ports >1024 when routing between VLANs, which NFS `secure` mode rejects.
- **CSI PV volumeAttributes are immutable**: Can't update NFS server in place. Must create new PV/PVC pairs (pattern: append `-host` to PV name).
## Shared Variables (never hardcode) ## Shared Variables (never hardcode)
`var.nfs_server` (192.168.1.127 — Proxmox host), `var.nfs_server_truenas` (10.0.10.15 — Immich only), `var.redis_host`, `var.postgresql_host`, `var.mysql_host`, `var.ollama_host`, `var.mail_host` `var.nfs_server` (192.168.1.127), `var.redis_host`, `var.postgresql_host`, `var.mysql_host`, `var.ollama_host`, `var.mail_host`
## Tier System ## Tier System
`0-core` | `1-cluster` | `2-gpu` | `3-edge` | `4-aux` — Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier label. `0-core` | `1-cluster` | `2-gpu` | `3-edge` | `4-aux` — Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier label.
@ -100,7 +96,7 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
- **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts. - **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
- **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf. - **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf.
- **Add a secret**: `sops set secrets.sops.json '["key"]' '"value"'` then commit. - **Add a secret**: `sops set secrets.sops.json '["key"]' '"value"'` then commit.
- **NFS exports**: Create dir on Proxmox host (`/srv/nfs/<dir>` or `/srv/nfs-ssd/<dir>`). For Immich (TrueNAS only): add to `secrets/nfs_directories.txt`, run `secrets/nfs_exports.sh`. - **NFS exports**: Create dir on Proxmox host (`ssh root@192.168.1.127 "mkdir -p /srv/nfs/<service>"`), add to `/etc/exports`, run `exportfs -ra`.
## Detailed Reference ## Detailed Reference
See `.claude/reference/patterns.md` for: NFS volume code examples, iSCSI details, Kyverno governance tables, anti-AI scraping layers, Terragrunt architecture, node rebuild procedure, archived troubleshooting runbooks index. See `.claude/reference/patterns.md` for: NFS volume code examples, iSCSI details, Kyverno governance tables, anti-AI scraping layers, Terragrunt architecture, node rebuild procedure, archived troubleshooting runbooks index.

Binary file not shown.

View file

@ -1,6 +1,6 @@
# Backup & Disaster Recovery Architecture # Backup & Disaster Recovery Architecture
Last updated: 2026-04-06 Last updated: 2026-04-13
## Overview ## Overview
@ -9,9 +9,8 @@ The homelab uses a defense-in-depth 3-2-1 backup strategy: **3 copies** (live PV
**3-2-1 Breakdown**: **3-2-1 Breakdown**:
- **Copy 1** (live): All PVC data + VM disks on Proxmox sdc thin pool (10.7TB RAID1 HDD) - **Copy 1** (live): All PVC data + VM disks on Proxmox sdc thin pool (10.7TB RAID1 HDD)
- **Copy 2** (local backup): Weekly file-level backup to sda `/mnt/backup` (1.1TB RAID1 SAS) - **Copy 2** (local backup): Weekly file-level backup to sda `/mnt/backup` (1.1TB RAID1 SAS)
- **Copy 3** (offsite): Synology NAS at 192.168.1.13 via two paths: - **Copy 3** (offsite): Synology NAS at 192.168.1.13:
- `Synology/Backup/Viki/pve-backup/` — structured PVE host backups (rsync --files-from weekly) - `Synology/Backup/Viki/pve-backup/` — structured PVE host backups (rsync --files-from weekly)
- `Synology/Backup/Viki/truenas/` — TrueNAS NFS media (Cloud Sync, narrowed to media only)
## Architecture Diagram ## Architecture Diagram
@ -42,9 +41,8 @@ graph TB
PVEConfig --> sda PVEConfig --> sda
end end
subgraph TrueNAS["TrueNAS (10.0.10.15)"] subgraph NFS_Storage["Proxmox NFS (/srv/nfs)"]
NFS_Backup["NFS-exported<br/>/mnt/main/*-backup/"] NFS_Backup["NFS dirs<br/>/srv/nfs/*-backup/"]
Media["Media (NFS)<br/>Immich ~800GB<br/>audiobookshelf, servarr, navidrome"]
subgraph AppBackups["App-Level Backup CronJobs"] subgraph AppBackups["App-Level Backup CronJobs"]
CronDaily["Daily 00:00-00:30<br/>PostgreSQL, MySQL<br/>14d retention"] CronDaily["Daily 00:00-00:30<br/>PostgreSQL, MySQL<br/>14d retention"]
@ -57,16 +55,13 @@ graph TB
subgraph Layer3["Layer 3: Offsite Sync"] subgraph Layer3["Layer 3: Offsite Sync"]
PVEOffsite["PVE → Synology<br/>Sunday 08:00<br/>rsync --files-from<br/>/Backup/Viki/pve-backup/"] PVEOffsite["PVE → Synology<br/>Sunday 08:00<br/>rsync --files-from<br/>/Backup/Viki/pve-backup/"]
CloudSync["TrueNAS → Synology<br/>Monday 09:00<br/>Cloud Sync (media only)<br/>/Backup/Viki/truenas/"]
end end
sda --> PVEOffsite sda --> PVEOffsite
Media --> CloudSync
Synology["Synology NAS<br/>192.168.1.13<br/>Offsite protection"] Synology["Synology NAS<br/>192.168.1.13<br/>Offsite protection"]
PVEOffsite --> Synology PVEOffsite --> Synology
CloudSync --> Synology
NFS_Backup -.->|mirrored to sda| NFSMirror NFS_Backup -.->|mirrored to sda| NFSMirror
@ -100,14 +95,7 @@ graph LR
S01 --> S02 --> S03a --> S03b --> S05 --> S08 S01 --> S02 --> S03a --> S03b --> S05 --> S08
subgraph Monday["Monday"]
M09["09:00 TrueNAS Cloud Sync<br/>Media → Synology"]
end
S08 -.->|next day| M09
style Sunday fill:#ffe0b2 style Sunday fill:#ffe0b2
style Monday fill:#e1f5ff
``` ```
### Physical Disk Layout ### Physical Disk Layout
@ -159,7 +147,7 @@ graph TB
Type -->|"Database"| AppBackup["Use app-level dump<br/>/mnt/backup/nfs-mirror/<service>-backup/<br/>OR Synology/pve-backup/nfs-mirror/<br/>RTO: <10 min"] Type -->|"Database"| AppBackup["Use app-level dump<br/>/mnt/backup/nfs-mirror/<service>-backup/<br/>OR Synology/pve-backup/nfs-mirror/<br/>RTO: <10 min"]
Type -->|"PVC files"| Proceed["Proceed with<br/>selected restore method"] Type -->|"PVC files"| Proceed["Proceed with<br/>selected restore method"]
Type -->|"Media (NFS)"| CloudSync["Use Synology backup<br/>Synology/truenas/<service>/<br/>RTO: varies by size"] Type -->|"Media (NFS)"| OffsiteMedia["Use Synology backup<br/>Synology/pve-backup/nfs-mirror/<br/>RTO: varies by size"]
style Start fill:#ffcdd2 style Start fill:#ffcdd2
style LVM fill:#c8e6c9 style LVM fill:#c8e6c9
@ -213,7 +201,7 @@ graph LR
| Vault Backup | Weekly Sunday 02:00, 30d | CronJob in `vault` | raft snapshot | | Vault Backup | Weekly Sunday 02:00, 30d | CronJob in `vault` | raft snapshot |
| Redis Backup | Weekly Sunday 03:00, 30d | CronJob in `redis` | BGSAVE + copy | | Redis Backup | Weekly Sunday 03:00, 30d | CronJob in `redis` | BGSAVE + copy |
| Vaultwarden Integrity Check | Hourly | CronJob in `vaultwarden` | PRAGMA integrity_check → metric | | Vaultwarden Integrity Check | Hourly | CronJob in `vaultwarden` | PRAGMA integrity_check → metric |
| TrueNAS Cloud Sync | Monday 09:00 (weekly) | TrueNAS Cloud Sync Task 1 | Media → Synology NAS | | ~~TrueNAS Cloud Sync~~ | **DECOMMISSIONED** | Was TrueNAS Cloud Sync Task 1 | Replaced by offsite-sync-backup (Path 1) |
## How It Works ## How It Works
@ -251,7 +239,7 @@ Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62
- 4 weekly versions with `--link-dest` hardlink dedup (unchanged files share inodes) - 4 weekly versions with `--link-dest` hardlink dedup (unchanged files share inodes)
**2. NFS Backup Mirror** (`/mnt/backup/nfs-mirror/`): **2. NFS Backup Mirror** (`/mnt/backup/nfs-mirror/`):
- Mount TrueNAS NFS ro → rsync DB dump dirs → unmount - Rsync DB dump dirs from Proxmox NFS (`/srv/nfs/*-backup/`)
- Covers: `mysql-backup/`, `postgresql-backup/`, `vault-backup/`, `vaultwarden-backup/`, `redis-backup/`, `etcd-backup/` - Covers: `mysql-backup/`, `postgresql-backup/`, `vault-backup/`, `vaultwarden-backup/`, `redis-backup/`, `etcd-backup/`
- Single copy (no rotation) — latest dump only - Single copy (no rotation) — latest dump only
@ -274,11 +262,11 @@ Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62
### Layer 2b: Application-Level Backups ### Layer 2b: Application-Level Backups
K8s CronJobs run inside the cluster, dumping database/state to NFS-exported backup directories. Each service writes to `/mnt/main/<service>-backup/`. K8s CronJobs run inside the cluster, dumping database/state to NFS-exported backup directories. Each service writes to `/srv/nfs/<service>-backup/` (some legacy paths still use `/mnt/main/<service>-backup/`).
**Why needed**: LVM snapshots capture block-level state, but: **Why needed**: LVM snapshots capture block-level state, but:
- Cannot restore individual databases from a PostgreSQL snapshot - Cannot restore individual databases from a PostgreSQL snapshot
- Proxmox CSI LVs are opaque to TrueNAS (raw block devices) - Proxmox CSI LVs are opaque raw block devices
- Need point-in-time recovery for specific apps without full LVM rollback - Need point-in-time recovery for specific apps without full LVM rollback
**Daily backups (00:00-00:30)**: **Daily backups (00:00-00:30)**:
@ -331,28 +319,9 @@ Two independent paths push backups offsite:
**Monitoring**: Pushes `offsite_backup_sync_last_success_timestamp` to Pushgateway. Alerts: `OffsiteBackupSyncStale` (>8d), `OffsiteBackupSyncFailing`. **Monitoring**: Pushes `offsite_backup_sync_last_success_timestamp` to Pushgateway. Alerts: `OffsiteBackupSyncStale` (>8d), `OffsiteBackupSyncFailing`.
#### Path 2: TrueNAS Media (Cloud Sync) #### ~~Path 2: TrueNAS Media (Cloud Sync)~~ — DECOMMISSIONED
**Task**: TrueNAS Cloud Sync Task 1 runs `rclone sync` Monday 09:00 > TrueNAS Cloud Sync was decommissioned along with TrueNAS (2026-04). Media offsite backup is now handled by the Proxmox host `offsite-sync-backup` script (Path 1) which includes NFS media directories in its manifest. The `Synology/Backup/Viki/truenas/` directory on the Synology NAS contains the last Cloud Sync snapshot and is no longer updated.
**Source**: `/mnt/main/` (NFS pool on TrueNAS)
**Destination**: `sftp://192.168.1.13/Backup/Viki/truenas`
**Scope**: Media libraries only (Immich ~800GB, audiobookshelf, servarr, navidrome music)
**Excludes** (Cloud Sync configured to skip):
- `clickhouse/**` (2.47M files, regenerable)
- `loki/**` (68K files, regenerable)
- `prometheus/**` (covered by monthly app backup)
- `frigate/**` (ephemeral recordings)
- `audiblez/**`, `ebook2audiobook/**` (regenerable)
- `ollama/**` (chat history, low value)
- `real-estate-crawler/**` (regenerable)
- `crowdsec/**` (regenerable)
- `servarr/downloads/**` (transient)
- `ytldp/**` (replaceable)
- `iscsi/**`, `iscsi-snaps/**` (raw zvols, backed at app level)
- `*-backup/**` (already mirrored via Path 1)
**Monitoring**: Existing `CloudSyncStale`, `CloudSyncNeverRun`, `CloudSyncFailing` alerts still apply.
## Configuration ## Configuration
@ -488,12 +457,12 @@ df -h /mnt/backup
**Common causes**: **Common causes**:
- Backup disk full (check `df -h /mnt/backup`, alert: `BackupDiskFull`) - Backup disk full (check `df -h /mnt/backup`, alert: `BackupDiskFull`)
- LV mount failed (check `lvs pve`, `dmesg | grep backup`) - LV mount failed (check `lvs pve`, `dmesg | grep backup`)
- NFS mount failed (check `showmount -e 10.0.10.15`) - NFS mount failed (check `showmount -e 192.168.1.127`)
**Fix**: **Fix**:
1. If disk full: Clean up old weekly versions manually, adjust retention 1. If disk full: Clean up old weekly versions manually, adjust retention
2. If LV mount failed: `lvchange -ay backup/data && mount /mnt/backup` 2. If LV mount failed: `lvchange -ay backup/data && mount /mnt/backup`
3. If NFS failed: Check TrueNAS availability, verify exports 3. If NFS failed: Check Proxmox NFS availability (`showmount -e 192.168.1.127`), verify exports
4. Manually trigger: `systemctl start weekly-backup.service` 4. Manually trigger: `systemctl start weekly-backup.service`
### Offsite Sync Failing ### Offsite Sync Failing
@ -531,12 +500,12 @@ kubectl logs -n dbaas job/postgresql-backup-<timestamp>
**Common causes**: **Common causes**:
- Pod OOMKilled (increase memory limit) - Pod OOMKilled (increase memory limit)
- NFS mount unavailable (check TrueNAS) - NFS mount unavailable (check Proxmox NFS)
- pg_dumpall command failed (check PostgreSQL connectivity) - pg_dumpall command failed (check PostgreSQL connectivity)
**Fix**: **Fix**:
1. If OOM: Increase `resources.limits.memory` in `stacks/dbaas/backup.tf` 1. If OOM: Increase `resources.limits.memory` in `stacks/dbaas/backup.tf`
2. If NFS: Verify mount on worker node, restart NFS server if needed 2. If NFS: Verify mount on worker node, restart NFS server on Proxmox host if needed (`systemctl restart nfs-server`)
3. Manually trigger: `kubectl create job --from=cronjob/postgresql-backup manual-backup -n dbaas` 3. Manually trigger: `kubectl create job --from=cronjob/postgresql-backup manual-backup -n dbaas`
### Vaultwarden Integrity Check Failing ### Vaultwarden Integrity Check Failing
@ -662,7 +631,7 @@ module "nfs_backup" {
name = "${var.service_name}-backup" name = "${var.service_name}-backup"
namespace = kubernetes_namespace.service.metadata[0].name namespace = kubernetes_namespace.service.metadata[0].name
nfs_server = var.nfs_server nfs_server = var.nfs_server
nfs_path = "/mnt/main/${var.service_name}-backup" nfs_path = "/srv/nfs/${var.service_name}-backup"
} }
``` ```
@ -678,9 +647,9 @@ module "nfs_backup" {
│ VaultBackupStale > 8d since last success │ │ VaultBackupStale > 8d since last success │
│ VaultwardenBackupStale > 8d since last success │ │ VaultwardenBackupStale > 8d since last success │
│ RedisBackupStale > 8d since last success │ │ RedisBackupStale > 8d since last success │
CloudSyncStale > 8d since last success ~~CloudSyncStale~~ REMOVED (TrueNAS decommissioned)
CloudSyncNeverRun task never completed ~~CloudSyncNeverRun~~ REMOVED (TrueNAS decommissioned)
CloudSyncFailing task in error state ~~CloudSyncFailing~~ REMOVED (TrueNAS decommissioned)
│ VaultwardenIntegrityFail integrity_ok == 0 │ │ VaultwardenIntegrityFail integrity_ok == 0 │
│ LVMSnapshotStale > 24h since last snapshot │ │ LVMSnapshotStale > 24h since last snapshot │
│ LVMSnapshotFailing snapshot creation failed │ │ LVMSnapshotFailing snapshot creation failed │
@ -698,7 +667,7 @@ module "nfs_backup" {
- LVM snapshot script: Pushes `lvm_snapshot_last_success_timestamp`, `lvm_snapshot_count`, `lvm_thin_pool_free_percent` - LVM snapshot script: Pushes `lvm_snapshot_last_success_timestamp`, `lvm_snapshot_count`, `lvm_thin_pool_free_percent`
- Weekly backup script: Pushes `backup_weekly_last_success_timestamp`, `backup_disk_usage_percent` - Weekly backup script: Pushes `backup_weekly_last_success_timestamp`, `backup_disk_usage_percent`
- Offsite sync script: Pushes `offsite_backup_sync_last_success_timestamp` - Offsite sync script: Pushes `offsite_backup_sync_last_success_timestamp`
- CloudSync monitor: Queries TrueNAS API every 6h, pushes `cloudsync_last_success_timestamp` - ~~CloudSync monitor~~: Removed (TrueNAS decommissioned)
- Vaultwarden integrity: Pushes `vaultwarden_sqlite_integrity_ok` hourly - Vaultwarden integrity: Pushes `vaultwarden_sqlite_integrity_ok` hourly
**Alert routing**: **Alert routing**:
@ -738,7 +707,7 @@ module "nfs_backup" {
- — = Not needed (other layers cover it, or data is regenerable/disposable) - — = Not needed (other layers cover it, or data is regenerable/disposable)
- excluded = Too large/regenerable, not worth offsite bandwidth - excluded = Too large/regenerable, not worth offsite bandwidth
**Note**: All 65 proxmox-lvm PVCs get LVM snapshots (except dbaas+monitoring = 3 PVCs) + file-level backup (except dbaas+monitoring). NFS-backed media relies on TrueNAS Cloud Sync for offsite. **Note**: All 65 proxmox-lvm PVCs get LVM snapshots (except dbaas+monitoring = 3 PVCs) + file-level backup (except dbaas+monitoring). NFS-backed media is included in the Proxmox host weekly-backup offsite sync.
## Recovery Procedures ## Recovery Procedures
@ -761,7 +730,7 @@ Detailed runbooks in `docs/runbooks/`:
- Vault: <10 min - Vault: <10 min
- Vaultwarden: <5 min - Vaultwarden: <5 min
- etcd: <20 min (requires cluster rebuild) - etcd: <20 min (requires cluster rebuild)
- Full cluster from offsite: <4 hours (TrueNAS restore + K8s bootstrap + app deploys) - Full cluster from offsite: <4 hours (NFS restore + K8s bootstrap + app deploys)
## Related ## Related

View file

@ -23,7 +23,6 @@ graph TB
NODE3["k8s-node3<br/>203"] NODE3["k8s-node3<br/>203"]
NODE4["k8s-node4<br/>204"] NODE4["k8s-node4<br/>204"]
REG["docker-registry<br/>220"] REG["docker-registry<br/>220"]
TN["TrueNAS<br/>9000"]
end end
subgraph Network["Network Bridges"] subgraph Network["Network Bridges"]
@ -48,7 +47,6 @@ graph TB
PF --> VMBR1_20 PF --> VMBR1_20
HA --> VMBR0 HA --> VMBR0
DEV --> VMBR1_10 DEV --> VMBR1_10
TN --> VMBR1_10
MASTER --> VMBR1_20 MASTER --> VMBR1_20
NODE1 --> VMBR1_20 NODE1 --> VMBR1_20
@ -78,7 +76,7 @@ graph TB
| Network | VLAN | CIDR | Purpose | | Network | VLAN | CIDR | Purpose |
|---------|------|------|---------| |---------|------|------|---------|
| Physical | - | 192.168.1.0/24 | Physical devices, Proxmox host (192.168.1.127) | | Physical | - | 192.168.1.0/24 | Physical devices, Proxmox host (192.168.1.127) |
| Management | 10 | 10.0.10.0/24 | Infrastructure VMs, TrueNAS, devvm | | Management | 10 | 10.0.10.0/24 | Infrastructure VMs, devvm |
| Kubernetes | 20 | 10.0.20.0/24 | K8s cluster nodes and services | | Kubernetes | 20 | 10.0.20.0/24 | K8s cluster nodes and services |
### Virtual Machine Inventory ### Virtual Machine Inventory
@ -94,7 +92,7 @@ graph TB
| 203 | k8s-node3 | 8 | 32GB | vmbr1:vlan20 | - | Worker node | | 203 | k8s-node3 | 8 | 32GB | vmbr1:vlan20 | - | Worker node |
| 204 | k8s-node4 | 8 | 32GB | vmbr1:vlan20 | - | Worker node | | 204 | k8s-node4 | 8 | 32GB | vmbr1:vlan20 | - | Worker node |
| 220 | docker-registry | 4 | 4GB | vmbr1:vlan20 | 10.0.20.10 | Private Docker registry | | 220 | docker-registry | 4 | 4GB | vmbr1:vlan20 | 10.0.20.10 | Private Docker registry |
| 9000 | truenas | 16 | 16GB | vmbr1:vlan10 | 10.0.10.15 | NFS storage server | | ~~9000~~ | ~~truenas~~ | — | — | — | ~~10.0.10.15~~ | **DECOMMISSIONED** — NFS now served by Proxmox host (192.168.1.127) |
### Kubernetes Cluster ### Kubernetes Cluster
@ -103,7 +101,7 @@ graph TB
| Version | v1.34.2 | | Version | v1.34.2 |
| Nodes | 5 (1 control plane, 4 workers) | | Nodes | 5 (1 control plane, 4 workers) |
| CNI | Calico | | CNI | Calico |
| Storage | NFS (democratic-csi) + Proxmox-LVM (Proxmox CSI) | | Storage | NFS (Proxmox host, nfs-csi) + Proxmox-LVM (Proxmox CSI) |
| Ingress | Traefik v3 | | Ingress | Traefik v3 |
| Total Services | 70+ services across 5 tiers | | Total Services | 70+ services across 5 tiers |
@ -164,8 +162,8 @@ Kyverno policies automatically inject namespace labels, LimitRange, ResourceQuot
- **Headscale**: Tailscale-compatible mesh VPN control plane - **Headscale**: Tailscale-compatible mesh VPN control plane
**Storage & Security**: **Storage & Security**:
- **TrueNAS**: NFS storage backend (10.0.10.15) - **Proxmox NFS**: NFS storage served directly from Proxmox host (192.168.1.127) at `/srv/nfs` (HDD) and `/srv/nfs-ssd` (SSD)
- **democratic-csi**: Dynamic PV provisioning from TrueNAS - **Proxmox CSI**: Block storage via LVM-thin hotplug for databases
- **Vaultwarden**: Password manager - **Vaultwarden**: Password manager
- **Immich**: Photo management - **Immich**: Photo management
- **CrowdSec**: IPS/IDS with community threat intelligence - **CrowdSec**: IPS/IDS with community threat intelligence

View file

@ -1,18 +1,24 @@
# Storage Architecture # Storage Architecture
Last updated: 2026-04-06 Last updated: 2026-04-13
## Overview ## Overview
The cluster uses two storage backends: **Proxmox CSI** for database block storage and **TrueNAS NFS** for application data. The cluster uses two storage backends: **Proxmox CSI** for database block storage and **Proxmox NFS** for application data.
**Block storage (Proxmox CSI)**: 65 PVCs for databases and stateful apps (CNPG PostgreSQL, MySQL InnoDB, Redis, Vaultwarden, Prometheus, Nextcloud, Calibre-Web, Forgejo, FreshRSS, ActualBudget, NovelApp, Headscale, Uptime Kuma, etc.) use `StorageClass: proxmox-lvm`, which provisions thin LVs directly from the Proxmox host's `local-lvm` storage (sdc, 10.7TB RAID1 HDD thin pool). This eliminates the previous double-CoW (ZFS + LVM-thin) path that caused 56 ZFS checksum errors. **Block storage (Proxmox CSI)**: 65 PVCs for databases and stateful apps (CNPG PostgreSQL, MySQL InnoDB, Redis, Vaultwarden, Prometheus, Nextcloud, Calibre-Web, Forgejo, FreshRSS, ActualBudget, NovelApp, Headscale, Uptime Kuma, etc.) use `StorageClass: proxmox-lvm`, which provisions thin LVs directly from the Proxmox host's `local-lvm` storage (sdc, 10.7TB RAID1 HDD thin pool). This eliminates the previous double-CoW (ZFS + LVM-thin) path that caused 56 ZFS checksum errors.
**NFS storage (TrueNAS)**: ~100 NFS shares for media libraries (Immich, audiobookshelf, servarr, navidrome), backup targets (`*-backup/` directories), and legacy app data continue to use TrueNAS ZFS at `10.0.10.15` via `StorageClass: nfs-truenas`. **NFS storage (Proxmox host)**: ~100 NFS shares for media libraries (Immich, audiobookshelf, servarr, navidrome), backup targets (`*-backup/` directories), and app data are served directly from the Proxmox host at `192.168.1.127`. Two NFS export roots exist:
- **HDD NFS**: `/srv/nfs` on ext4 LV `pve/nfs-data` (2TB) — bulk media and backup targets
- **SSD NFS**: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB) — high-performance data (Immich ML)
Both `StorageClass: nfs-truenas` (name kept for compatibility) and `StorageClass: nfs-proxmox` (identical) point to the Proxmox host. Migrated from TrueNAS (10.0.10.15) which has been fully decommissioned.
**Backup storage (sda)**: 1.1TB RAID1 SAS disk, VG `backup`, LV `data` (ext4), mounted at `/mnt/backup` on PVE host. Dedicated backup disk for weekly PVC file backups, NFS mirrors, pfSense backups, and PVE config. Independent of live storage (sdc). **Backup storage (sda)**: 1.1TB RAID1 SAS disk, VG `backup`, LV `data` (ext4), mounted at `/mnt/backup` on PVE host. Dedicated backup disk for weekly PVC file backups, NFS mirrors, pfSense backups, and PVE config. Independent of live storage (sdc).
**Migration (2026-04-02)**: All iSCSI block volumes were migrated from democratic-csi (TrueNAS iSCSI → ZFS → LVM-thin) to Proxmox CSI (direct LVM-thin hotplug). democratic-csi iSCSI driver is deprecated and pending removal. **Migration (2026-04-02)**: All iSCSI block volumes were migrated from democratic-csi (TrueNAS iSCSI → ZFS → LVM-thin) to Proxmox CSI (direct LVM-thin hotplug). democratic-csi iSCSI driver has been removed.
**Migration (2026-04)**: TrueNAS (10.0.10.15) fully decommissioned. All NFS storage migrated to the Proxmox host (192.168.1.127). ZFS datasets under `/mnt/main/` and `/mnt/ssd/` moved to ext4 LVs at `/srv/nfs/` and `/srv/nfs-ssd/`. Legacy PVs referencing `/mnt/main/` paths still work (bind-mounted or symlinked on the Proxmox host); new PVs use `/srv/nfs/` and `/srv/nfs-ssd/`.
## Architecture Diagram ## Architecture Diagram
@ -21,43 +27,37 @@ graph TB
subgraph Proxmox["Proxmox Host (192.168.1.127)"] subgraph Proxmox["Proxmox Host (192.168.1.127)"]
sdc["sdc: 10.7TB RAID1 HDD<br/>VG pve, LV data (thin pool)<br/>65 proxmox-lvm PVCs"] sdc["sdc: 10.7TB RAID1 HDD<br/>VG pve, LV data (thin pool)<br/>65 proxmox-lvm PVCs"]
sda["sda: 1.1TB RAID1 SAS<br/>VG backup, LV data (ext4)<br/>/mnt/backup"] sda["sda: 1.1TB RAID1 SAS<br/>VG backup, LV data (ext4)<br/>/mnt/backup"]
end NFS_HDD["LV pve/nfs-data (2TB ext4)<br/>/srv/nfs<br/>~100 NFS shares<br/>Media + backup targets"]
NFS_SSD["LV ssd/nfs-ssd-data (100GB ext4)<br/>/srv/nfs-ssd<br/>High-performance data<br/>(Immich ML)"]
subgraph TrueNAS["TrueNAS (10.0.10.15)<br/>VMID 9000, 16c/16GB"] NFS_Exports["NFS Exports<br/>managed by /etc/exports"]
ZFS_Main["ZFS Pool: main<br/>1.64 TiB<br/>32G + 7x256G + 1T disks"] NFS_HDD --> NFS_Exports
ZFS_SSD["ZFS Pool: ssd<br/>~256GB SSD<br/>Immich ML, PostgreSQL hot data"] NFS_SSD --> NFS_Exports
ZFS_Main --> NFS_Datasets["NFS Datasets<br/>~100 shares<br/>main/&lt;service&gt;<br/>Media + backup targets"]
NFS_Datasets --> NFS_Exports["NFS Exports<br/>managed by secrets/nfs_exports.sh"]
ZFS_SSD --> SSD_Data["Immich ML models"]
end end
subgraph K8s["Kubernetes Cluster"] subgraph K8s["Kubernetes Cluster"]
CSI_NFS["democratic-csi-nfs<br/>StorageClass: nfs-truenas<br/>soft,timeo=30,retrans=3"] CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-truenas / nfs-proxmox<br/>soft,timeo=30,retrans=3"]
CSI_iSCSI["democratic-csi-iscsi<br/>StorageClass: iscsi-truenas<br/>SSH driver"] CSI_PVE["Proxmox CSI plugin<br/>StorageClass: proxmox-lvm"]
NFS_PV["NFS PersistentVolumes<br/>RWX, ~100 volumes"] NFS_PV["NFS PersistentVolumes<br/>RWX, ~100 volumes"]
iSCSI_PV["iSCSI PersistentVolumes<br/>RWO, ~19 volumes"] Block_PV["Block PersistentVolumes<br/>RWO, 65 PVCs"]
Pods["Application Pods"] Pods["Application Pods"]
DBPods["Database Pods<br/>PostgreSQL CNPG<br/>MySQL InnoDB"] DBPods["Database Pods<br/>PostgreSQL CNPG<br/>MySQL InnoDB"]
end end
NFS_Exports -->|CSI driver| CSI_NFS NFS_Exports -->|NFS mount| CSI_NFS
iSCSI_Targets -->|CSI driver| CSI_iSCSI sdc -->|LVM-thin hotplug| CSI_PVE
CSI_NFS --> NFS_PV CSI_NFS --> NFS_PV
CSI_iSCSI --> iSCSI_PV CSI_PVE --> Block_PV
NFS_PV --> Pods NFS_PV --> Pods
iSCSI_PV --> DBPods Block_PV --> DBPods
style TrueNAS fill:#e1f5ff style Proxmox fill:#e1f5ff
style K8s fill:#fff4e1 style K8s fill:#fff4e1
style ZFS_Main fill:#c8e6c9 style NFS_HDD fill:#c8e6c9
style ZFS_SSD fill:#ffe0b2 style NFS_SSD fill:#ffe0b2
``` ```
## Components ## Components
@ -66,36 +66,38 @@ graph TB
|-----------|---------------|----------|---------| |-----------|---------------|----------|---------|
| **Proxmox CSI plugin** | Helm chart | Namespace: proxmox-csi | Block storage via LVM-thin hotplug | | **Proxmox CSI plugin** | Helm chart | Namespace: proxmox-csi | Block storage via LVM-thin hotplug |
| **StorageClass `proxmox-lvm`** | RWO, WaitForFirstConsumer | Cluster-wide | Databases and stateful apps | | **StorageClass `proxmox-lvm`** | RWO, WaitForFirstConsumer | Cluster-wide | Databases and stateful apps |
| TrueNAS VM | VMID 9000, 16c/16GB | Proxmox host (10.0.10.15) | ZFS NFS storage server | | Proxmox NFS (HDD) | LV `pve/nfs-data`, 2TB ext4 | 192.168.1.127:/srv/nfs | Bulk NFS data for all services |
| ZFS pool `main` | 1.64 TiB usable | 32G + 7x256G + 1T disks | NFS data for all services | | Proxmox NFS (SSD) | LV `ssd/nfs-ssd-data`, 100GB ext4 | 192.168.1.127:/srv/nfs-ssd | High-performance data (Immich ML) |
| ZFS pool `ssd` | ~256GB SSD | Dedicated SSD | High-performance data (Immich ML) |
| nfs-csi | Helm chart | Namespace: nfs-csi | NFS CSI driver | | nfs-csi | Helm chart | Namespace: nfs-csi | NFS CSI driver |
| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | Default storage for apps | | StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | NFS storage (name kept for compatibility, points to Proxmox) |
| ~~democratic-csi-iscsi~~ | **DEPRECATED** | Namespace: iscsi-csi | Replaced by Proxmox CSI (2026-04-02) | | StorageClass `nfs-proxmox` | RWX, soft mount | Cluster-wide | NFS storage (identical to nfs-truenas) |
| ~~StorageClass `iscsi-truenas`~~ | **DEPRECATED** | Cluster-wide | Replaced by `proxmox-lvm` | | TF module `nfs_volume` | `modules/kubernetes/nfs_volume/` | Infra repo | Static NFS PV/PVC factory |
| TF module `nfs_volume` | `modules/kubernetes/nfs_volume/` | Infra repo | NFS PV/PVC factory | | ~~TrueNAS VM~~ | **DECOMMISSIONED** | Was VMID 9000 at 10.0.10.15 | Replaced by Proxmox NFS (2026-04) |
| ~~democratic-csi-iscsi~~ | **REMOVED** | Was namespace: iscsi-csi | Replaced by Proxmox CSI (2026-04-02) |
| ~~StorageClass `iscsi-truenas`~~ | **REMOVED** | Was cluster-wide | Replaced by `proxmox-lvm` |
## How It Works ## How It Works
### NFS Storage Flow ### NFS Storage Flow
1. **Dataset creation**: NFS shares are created as ZFS datasets under `main/<service>` (e.g., `main/immich`, `main/nextcloud`) 1. **Directory creation**: NFS share directories are created under `/srv/nfs/<service>` (HDD) or `/srv/nfs-ssd/<service>` (SSD) on the Proxmox host
2. **Export configuration**: `/root/secrets/nfs_exports.sh` on TrueNAS generates `/etc/exports` with per-dataset exports (`/mnt/main/<service>`) 2. **Export configuration**: `/etc/exports` on the Proxmox host lists per-directory NFS exports
3. **CSI provisioning**: democratic-csi-nfs mounts NFS shares and creates K8s PersistentVolumes 3. **Terraform module**: Stacks use `modules/kubernetes/nfs_volume/` to declaratively create static PV + PVC pairs:
4. **Terraform module**: Stacks use `modules/kubernetes/nfs_volume/` to declaratively create PV + PVC pairs:
```hcl ```hcl
module "nfs_data" { module "nfs_data" {
source = "../../modules/kubernetes/nfs_volume" source = "../../modules/kubernetes/nfs_volume"
name = "immich-data" name = "immich-data"
namespace = kubernetes_namespace.immich.metadata[0].name namespace = kubernetes_namespace.immich.metadata[0].name
nfs_server = var.nfs_server # 10.0.10.15 nfs_server = var.nfs_server # 192.168.1.127
nfs_path = "/mnt/main/immich" nfs_path = "/srv/nfs/immich"
} }
``` ```
5. **Pod mount**: Applications reference PVCs in their deployment specs 4. **Pod mount**: Applications reference PVCs in their deployment specs
6. **Mount options**: All NFS mounts use `soft,timeo=30,retrans=3` (set in StorageClass) to prevent indefinite hangs 5. **Mount options**: All NFS mounts use `soft,timeo=30,retrans=3` (set in StorageClass) to prevent indefinite hangs
**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-truenas` StorageClass via PVCs. **Note**: Some legacy PVs still reference `/mnt/main/<service>` paths. These work via compatibility symlinks/bind-mounts on the Proxmox host. New PVs should use `/srv/nfs/<service>` or `/srv/nfs-ssd/<service>`.
**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-truenas` or `nfs-proxmox` StorageClass via PVCs.
### Block Storage Flow (Proxmox CSI) — NEW ### Block Storage Flow (Proxmox CSI) — NEW
@ -127,28 +129,9 @@ SQLite uses `fsync()` to guarantee durability. NFS's soft mount + async semantic
**Solution**: Use Proxmox CSI (`proxmox-lvm`) for any SQLite database (Vaultwarden, plotting-book) or local disk (ephemeral). **Solution**: Use Proxmox CSI (`proxmox-lvm`) for any SQLite database (Vaultwarden, plotting-book) or local disk (ephemeral).
### Democratic-CSI Sidecar Resources ### ~~Democratic-CSI Sidecar Resources~~ (HISTORICAL — democratic-csi removed)
The Helm chart spawns 17 sidecar containers (driver-registrar, external-provisioner, etc.) across controller + node DaemonSet pods. Each sidecar defaults to `resources: {}`, which gets LimitRange defaults of 256Mi. > Democratic-csi has been removed along with TrueNAS decommissioning (2026-04). This section is kept for historical reference only.
**Fix**: Set explicit resources in `values.yaml`:
```yaml
csiProxy: # TOP-LEVEL key, not nested
resources:
requests:
memory: "32Mi"
limits:
memory: "32Mi"
controller:
externalProvisioner:
resources:
requests: {memory: "64Mi"}
limits: {memory: "64Mi"}
# ... repeat for all sidecars
```
Total footprint: ~1.5Gi → ~400Mi.
## Configuration ## Configuration
@ -156,25 +139,23 @@ Total footprint: ~1.5Gi → ~400Mi.
| Path | Purpose | | Path | Purpose |
|------|---------| |------|---------|
| `/root/secrets/nfs_exports.sh` | TrueNAS: generates `/etc/exports` with all service shares | | `/etc/exports` (on Proxmox host) | NFS export configuration for all service shares |
| `stacks/proxmox-csi/` | Terraform stack for Proxmox CSI plugin + StorageClass | | `stacks/proxmox-csi/` | Terraform stack for Proxmox CSI plugin + StorageClass |
| `stacks/iscsi-csi/` | **DEPRECATED** — democratic-csi iSCSI driver (pending removal) | | `stacks/nfs-csi/` | NFS CSI driver + StorageClasses (`nfs-truenas`, `nfs-proxmox`) |
| `stacks/nfs-csi/` | NFS CSI driver | | `modules/kubernetes/nfs_volume/` | Reusable module for static NFS PV/PVC creation |
| `modules/kubernetes/nfs_volume/` | Reusable module for NFS PV/PVC creation | | `config.tfvars` | Variable `nfs_server = "192.168.1.127"` shared by all stacks |
| `config.tfvars` | Variable `nfs_server = "10.0.10.15"` shared by all stacks |
### Vault Paths ### Vault Paths
| Path | Contents | | Path | Contents |
|------|----------| |------|----------|
| `secret/viktor/truenas_ssh_key` | SSH private key for democratic-csi SSH driver | | ~~`secret/viktor/truenas_ssh_key`~~ | **LEGACY** — was SSH key for democratic-csi SSH driver (TrueNAS decommissioned) |
| `secret/viktor/truenas_root_password` | TrueNAS root password (web UI access) | | ~~`secret/viktor/truenas_root_password`~~ | **LEGACY** — was TrueNAS root password (TrueNAS decommissioned) |
### Terraform Stacks ### Terraform Stacks
- **`stacks/proxmox-csi/`**: Deploys Proxmox CSI plugin + `proxmox-lvm` StorageClass + node topology labels - **`stacks/proxmox-csi/`**: Deploys Proxmox CSI plugin + `proxmox-lvm` StorageClass + node topology labels
- **`stacks/nfs-csi/`**: Deploys NFS CSI driver for TrueNAS - **`stacks/nfs-csi/`**: Deploys NFS CSI driver + StorageClasses for Proxmox NFS
- **`stacks/iscsi-csi/`**: ~~Deploys democratic-csi iSCSI driver~~**DEPRECATED**, pending removal
- All application stacks reference NFS volumes via `module "nfs_<name>"` calls - All application stacks reference NFS volumes via `module "nfs_<name>"` calls
- Database PVCs use `storageClass: proxmox-lvm` (CNPG, MySQL Helm VCT, Redis Helm, standalone PVCs) - Database PVCs use `storageClass: proxmox-lvm` (CNPG, MySQL Helm VCT, Redis Helm, standalone PVCs)
@ -182,14 +163,11 @@ Total footprint: ~1.5Gi → ~400Mi.
NFS exports are NOT managed by Terraform. To add a new service: NFS exports are NOT managed by Terraform. To add a new service:
1. SSH to TrueNAS: `ssh root@10.0.10.15` 1. SSH to Proxmox host: `ssh root@192.168.1.127`
2. Edit `/root/secrets/nfs_exports.sh` 2. Create the directory: `mkdir -p /srv/nfs/<service> && chmod 777 /srv/nfs/<service>`
3. Add dataset + export entry: 3. Edit `/etc/exports` — add the export entry
```bash 4. Reload exports: `exportfs -ra`
create_nfs_export "main/<service>" "/mnt/main/<service>" 5. Verify: `showmount -e 192.168.1.127`
```
4. Run the script: `/root/secrets/nfs_exports.sh`
5. Verify: `showmount -e 10.0.10.15`
## Decisions & Rationale ## Decisions & Rationale
@ -197,26 +175,14 @@ NFS exports are NOT managed by Terraform. To add a new service:
- **Simplicity**: No volume provisioning delays, instant mounts - **Simplicity**: No volume provisioning delays, instant mounts
- **RWX support**: Multiple pods can share one volume (Nextcloud, Immich) - **RWX support**: Multiple pods can share one volume (Nextcloud, Immich)
- **ZFS benefits**: Snapshots, compression, dedup all work at dataset level - **Good enough**: For SQLite on NFS specifically, we accept the risk for low-value data (logs, caches) but mandate proxmox-lvm for critical DBs
- **Good enough**: For SQLite on NFS specifically, we accept the risk for low-value data (logs, caches) but mandate iSCSI for critical DBs
### Why iSCSI for Databases? ### Why Proxmox CSI for Databases? (formerly iSCSI)
- **ACID guarantees**: Block device + local filesystem = real fsync - **ACID guarantees**: Block device + local filesystem = real fsync
- **Performance**: No NFS protocol overhead for random I/O - **Performance**: No NFS protocol overhead for random I/O, no network hop (LVM-thin hotplug direct to VM)
- **Tested**: PostgreSQL CNPG and MySQL InnoDB Cluster both run on iSCSI, zero corruption in 2+ years - **Tested**: PostgreSQL CNPG and MySQL InnoDB Cluster both run on proxmox-lvm, zero corruption
- **Single CoW layer**: LVM-thin only, no ZFS double-CoW issues
### Why SSH Driver Over API?
The democratic-csi API driver (`driver: freenas-api-iscsi`) has these issues:
- Requires TrueNAS API credentials in plaintext ConfigMap
- Fails silently when API schema changes between TrueNAS versions
- No retry logic on transient API errors
SSH driver (`driver: freenas-ssh`) is simpler:
- Direct `zfs` commands, no API translation layer
- SSH key auth (Vault-managed)
- Deterministic error messages
### Why Soft Mount for NFS? ### Why Soft Mount for NFS?
@ -230,7 +196,7 @@ Soft mount (`soft,timeo=30,retrans=3`) trades availability for responsiveness:
- Operations return EIO after timeout → app can handle error - Operations return EIO after timeout → app can handle error
- Acceptable for non-critical data paths - Acceptable for non-critical data paths
**Critical paths**: Databases use iSCSI (not NFS), so soft mount never affects data integrity. **Critical paths**: Databases use proxmox-lvm (not NFS), so soft mount never affects data integrity.
## Troubleshooting ## Troubleshooting
@ -242,55 +208,23 @@ Soft mount (`soft,timeo=30,retrans=3`) trades availability for responsiveness:
```bash ```bash
# On K8s node # On K8s node
mount | grep nfs mount | grep nfs
showmount -e 10.0.10.15 showmount -e 192.168.1.127
# Check NFS server # Check NFS server (Proxmox host)
ssh root@10.0.10.15 ssh root@192.168.1.127
zfs list | grep main/<service> ls -la /srv/nfs/<service>
cat /etc/exports | grep <service> cat /etc/exports | grep <service>
``` ```
**Fix**: **Fix**:
1. Verify dataset exists: `zfs list main/<service>` 1. Verify directory exists: `ls /srv/nfs/<service>` (or `/srv/nfs-ssd/<service>`)
2. Verify export: `grep <service> /etc/exports` 2. Verify export: `grep <service> /etc/exports`
3. If missing: re-run `/root/secrets/nfs_exports.sh` 3. If missing: add to `/etc/exports` and run `exportfs -ra`
4. Restart NFS server: `service nfs-server restart` 4. Restart NFS server: `systemctl restart nfs-server`
### iSCSI Session Drops ### ~~iSCSI Session Drops~~ (HISTORICAL — iSCSI removed)
**Symptom**: PostgreSQL/MySQL pod restarts, iSCSI reconnection loops > iSCSI was replaced by Proxmox CSI (2026-04-02) and TrueNAS has been decommissioned. This section is kept for historical reference only.
**Diagnosis**:
```bash
# On K8s node
iscsiadm -m session
dmesg | grep iscsi
journalctl -u iscsid -f
```
**Fix**:
1. Check TrueNAS iSCSI service: WebUI → Sharing → iSCSI → Targets
2. Verify hardened timeouts: `iscsiadm -m node -o show | grep timeout`
3. If defaults: re-apply cloud-init or manually update `/etc/iscsi/iscsid.conf`
4. Restart session:
```bash
iscsiadm -m node -u
iscsiadm -m node -l
```
### Democratic-CSI Sidecar OOMKill
**Symptom**: `kubectl describe pod` shows sidecar containers OOMKilled
**Diagnosis**:
```bash
kubectl get events -n democratic-csi | grep OOM
kubectl top pod -n democratic-csi
```
**Fix**:
1. Set explicit resources in Helm values (see "Democratic-CSI Sidecar Resources" above)
2. Apply: `terragrunt apply` in `stacks/democratic-csi/`
### SQLite Corruption on NFS ### SQLite Corruption on NFS
@ -302,8 +236,8 @@ kubectl top pod -n democratic-csi
sqlite3 /data/db.sqlite "PRAGMA integrity_check;" sqlite3 /data/db.sqlite "PRAGMA integrity_check;"
``` ```
**Fix**: Migrate to iSCSI **Fix**: Migrate to proxmox-lvm
1. Create iSCSI PVC in Terraform stack 1. Create proxmox-lvm PVC in Terraform stack
2. Restore from backup to new volume 2. Restore from backup to new volume
3. Update deployment to use new PVC 3. Update deployment to use new PVC
4. Delete old NFS PVC 4. Delete old NFS PVC
@ -314,18 +248,18 @@ sqlite3 /data/db.sqlite "PRAGMA integrity_check;"
**Diagnosis**: **Diagnosis**:
```bash ```bash
# On TrueNAS # On Proxmox host
zpool iostat -v 5 ssh root@192.168.1.127
arc_summary | grep "Hit Rate" iostat -x 5
lvs --reportformat json pve/nfs-data ssd/nfs-ssd-data
# On K8s node # On K8s node
nfsiostat 5 nfsiostat 5
``` ```
**Optimization**: **Optimization**:
1. Check ZFS ARC hit rate (should be >90%) 1. Move hot data to SSD NFS: relocate from `/srv/nfs/<service>` to `/srv/nfs-ssd/<service>` and update PV path
2. Move hot datasets to SSD pool: `zfs send main/<dataset> | zfs recv ssd/<dataset>` 2. Tune NFS mount: add `rsize=1048576,wsize=1048576` to StorageClass `mountOptions`
3. Tune NFS mount: add `rsize=1048576,wsize=1048576` to StorageClass `mountOptions`
## Related ## Related
@ -333,5 +267,5 @@ nfsiostat 5
- `docs/runbooks/restore-postgresql.md` - `docs/runbooks/restore-postgresql.md`
- `docs/runbooks/restore-mysql.md` - `docs/runbooks/restore-mysql.md`
- `docs/runbooks/recover-nfs-mount.md` - `docs/runbooks/recover-nfs-mount.md`
- **Architecture**: `docs/architecture/backup-dr.md` (backup strategy using ZFS snapshots) - **Architecture**: `docs/architecture/backup-dr.md` (backup strategy using LVM snapshots and Proxmox host scripts)
- **Reference**: `.claude/reference/service-catalog.md` (which services use NFS vs iSCSI) - **Reference**: `.claude/reference/service-catalog.md` (which services use NFS vs proxmox-lvm)

View file

@ -938,15 +938,15 @@ check_kyverno() {
check_nfs() { check_nfs() {
section 20 "NFS Connectivity" section 20 "NFS Connectivity"
if showmount -e 10.0.10.15 &>/dev/null; then if showmount -e 192.168.1.127 &>/dev/null; then
pass "NFS server 10.0.10.15 reachable (exports listed)" pass "NFS server 192.168.1.127 (Proxmox) reachable (exports listed)"
json_add "nfs" "PASS" "NFS reachable" json_add "nfs" "PASS" "NFS reachable"
elif nc -z -G 3 10.0.10.15 2049 &>/dev/null; then elif nc -z -G 3 192.168.1.127 2049 &>/dev/null; then
pass "NFS server 10.0.10.15 port 2049 open" pass "NFS server 192.168.1.127 port 2049 open"
json_add "nfs" "PASS" "NFS port open" json_add "nfs" "PASS" "NFS port open"
else else
[[ "$QUIET" == true ]] && section_always 20 "NFS Connectivity" [[ "$QUIET" == true ]] && section_always 20 "NFS Connectivity"
fail "NFS server 10.0.10.15 unreachable — 30+ services depend on NFS" fail "NFS server 192.168.1.127 (Proxmox) unreachable — 30+ services depend on NFS"
json_add "nfs" "FAIL" "NFS unreachable" json_add "nfs" "FAIL" "NFS unreachable"
fi fi
} }

View file

@ -6,9 +6,9 @@ set -euo pipefail
# --- Configuration --- # --- Configuration ---
BACKUP_ROOT="/mnt/backup" BACKUP_ROOT="/mnt/backup"
NFS_SERVER="10.0.10.15" NFS_SERVER="192.168.1.127"
NFS_BASE="/mnt/main" NFS_BASE="/srv/nfs"
NFS_MOUNT="/mnt/nfs-truenas" NFS_MOUNT="/mnt/nfs-proxmox"
PVC_MOUNT="/tmp/pvc-mount" PVC_MOUNT="/tmp/pvc-mount"
PUSHGATEWAY="${WEEKLY_BACKUP_PUSHGATEWAY:-http://10.0.20.100:30091}" PUSHGATEWAY="${WEEKLY_BACKUP_PUSHGATEWAY:-http://10.0.20.100:30091}"
PUSHGATEWAY_JOB="weekly-backup" PUSHGATEWAY_JOB="weekly-backup"

Binary file not shown.

View file

@ -63,11 +63,11 @@ variable "ha_sofia_token" {
} }
variable "nfs_music_server" { variable "nfs_music_server" {
type = string type = string
default = "10.0.10.15" default = "192.168.1.127"
} }
variable "nfs_music_path" { variable "nfs_music_path" {
type = string type = string
default = "/mnt/main/freedify-music" default = "/srv/nfs/freedify-music"
} }

View file

@ -5,6 +5,7 @@ provider "registry.terraform.io/hashicorp/helm" {
version = "3.1.1" version = "3.1.1"
hashes = [ hashes = [
"h1:47CqNwkxctJtL/N/JuEj+8QMg8mRNI/NWeKO5/ydfZU=", "h1:47CqNwkxctJtL/N/JuEj+8QMg8mRNI/NWeKO5/ydfZU=",
"h1:5b2ojWKT0noujHiweCds37ZreRFRQLNaErdJLusJN88=",
"zh:1a6d5ce931708aec29d1f3d9e360c2a0c35ba5a54d03eeaff0ce3ca597cd0275", "zh:1a6d5ce931708aec29d1f3d9e360c2a0c35ba5a54d03eeaff0ce3ca597cd0275",
"zh:3411919ba2a5941801e677f0fea08bdd0ae22ba3c9ce3309f55554699e06524a", "zh:3411919ba2a5941801e677f0fea08bdd0ae22ba3c9ce3309f55554699e06524a",
"zh:81b36138b8f2320dc7f877b50f9e38f4bc614affe68de885d322629dd0d16a29", "zh:81b36138b8f2320dc7f877b50f9e38f4bc614affe68de885d322629dd0d16a29",
@ -24,6 +25,7 @@ provider "registry.terraform.io/hashicorp/kubernetes" {
version = "3.0.1" version = "3.0.1"
hashes = [ hashes = [
"h1:P0c8knzZnouTNFIRij8IS7+pqd0OKaFDYX0j4GRsiqo=", "h1:P0c8knzZnouTNFIRij8IS7+pqd0OKaFDYX0j4GRsiqo=",
"h1:vyHdH0p6bf9xp1NPePObAJkXTJb/I09FQQmmevTzZe0=",
"zh:02d55b0b2238fd17ffa12d5464593864e80f402b90b31f6e1bd02249b9727281", "zh:02d55b0b2238fd17ffa12d5464593864e80f402b90b31f6e1bd02249b9727281",
"zh:20b93a51bfeed82682b3c12f09bac3031f5bdb4977c47c97a042e4df4fb2f9ba", "zh:20b93a51bfeed82682b3c12f09bac3031f5bdb4977c47c97a042e4df4fb2f9ba",
"zh:6e14486ecfaee38c09ccf33d4fdaf791409f90795c1b66e026c226fad8bc03c7", "zh:6e14486ecfaee38c09ccf33d4fdaf791409f90795c1b66e026c226fad8bc03c7",
@ -44,6 +46,7 @@ provider "registry.terraform.io/hashicorp/vault" {
constraints = "~> 4.0" constraints = "~> 4.0"
hashes = [ hashes = [
"h1:GPfhH6dr1LY0foPBDYv9bEGifx7eSwYqFcEAOWOUxLk=", "h1:GPfhH6dr1LY0foPBDYv9bEGifx7eSwYqFcEAOWOUxLk=",
"h1:aHqgWQhDBMeZO9iUKwJYMlh4q+xNMUlMIcjRbF4d02Y=",
"zh:269ab13433f67684012ae7e15876532b0312f5d0d2002a9cf9febb1279ce5ea6", "zh:269ab13433f67684012ae7e15876532b0312f5d0d2002a9cf9febb1279ce5ea6",
"zh:4babc95bf0c40eb85005db1dc2ca403c46be4a71dd3e409db3711a56f7a5ca0e", "zh:4babc95bf0c40eb85005db1dc2ca403c46be4a71dd3e409db3711a56f7a5ca0e",
"zh:78d5eefdd9e494defcb3c68d282b8f96630502cac21d1ea161f53cfe9bb483b3", "zh:78d5eefdd9e494defcb3c68d282b8f96630502cac21d1ea161f53cfe9bb483b3",

View file

@ -1,6 +1,6 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa # Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform { terraform {
backend "local" { backend "local" {
path = "/Users/viktorbarzin/code/infra/state/stacks/immich/terraform.tfstate" path = "/home/wizard/code/infra/state/stacks/immich/terraform.tfstate"
} }
} }

View file

@ -17,7 +17,7 @@ variable "immich_version" {
# Change me to upgrade # Change me to upgrade
default = "v2.7.4" default = "v2.7.4"
} }
variable "nfs_server" { type = string } variable "proxmox_host" { type = string }
variable "redis_host" { type = string } variable "redis_host" { type = string }
@ -27,71 +27,70 @@ module "tls_secret" {
tls_secret_name = var.tls_secret_name tls_secret_name = var.tls_secret_name
} }
# NFS volumes for immich-server # NFS volumes on Proxmox host (migrated from TrueNAS 2026-04-13)
module "nfs_backups" {
module "nfs_backups_host" {
source = "../../modules/kubernetes/nfs_volume" source = "../../modules/kubernetes/nfs_volume"
name = "immich-backups" name = "immich-backups-host"
namespace = kubernetes_namespace.immich.metadata[0].name namespace = kubernetes_namespace.immich.metadata[0].name
nfs_server = var.nfs_server nfs_server = var.proxmox_host
nfs_path = "/mnt/main/immich/immich/backups" nfs_path = "/srv/nfs/immich/backups"
} }
module "nfs_encoded_video" { module "nfs_encoded_video_host" {
source = "../../modules/kubernetes/nfs_volume" source = "../../modules/kubernetes/nfs_volume"
name = "immich-encoded-video" name = "immich-encoded-video-host"
namespace = kubernetes_namespace.immich.metadata[0].name namespace = kubernetes_namespace.immich.metadata[0].name
nfs_server = var.nfs_server nfs_server = var.proxmox_host
nfs_path = "/mnt/main/immich/immich/encoded-video" nfs_path = "/srv/nfs/immich/encoded-video"
} }
module "nfs_library" { module "nfs_library_host" {
source = "../../modules/kubernetes/nfs_volume" source = "../../modules/kubernetes/nfs_volume"
name = "immich-library" name = "immich-library-host"
namespace = kubernetes_namespace.immich.metadata[0].name namespace = kubernetes_namespace.immich.metadata[0].name
nfs_server = var.nfs_server nfs_server = var.proxmox_host
nfs_path = "/mnt/main/immich/immich/library" nfs_path = "/srv/nfs/immich/library"
} }
module "nfs_profile" { module "nfs_profile_host" {
source = "../../modules/kubernetes/nfs_volume" source = "../../modules/kubernetes/nfs_volume"
name = "immich-profile" name = "immich-profile-host"
namespace = kubernetes_namespace.immich.metadata[0].name namespace = kubernetes_namespace.immich.metadata[0].name
nfs_server = var.nfs_server nfs_server = var.proxmox_host
nfs_path = "/mnt/main/immich/immich/profile" nfs_path = "/srv/nfs/immich/profile"
} }
module "nfs_thumbs" { module "nfs_thumbs_host" {
source = "../../modules/kubernetes/nfs_volume" source = "../../modules/kubernetes/nfs_volume"
name = "immich-thumbs" name = "immich-thumbs-host"
namespace = kubernetes_namespace.immich.metadata[0].name namespace = kubernetes_namespace.immich.metadata[0].name
nfs_server = var.nfs_server nfs_server = var.proxmox_host
nfs_path = "/mnt/ssd/immich/thumbs" nfs_path = "/srv/nfs-ssd/immich/thumbs"
} }
module "nfs_upload" { module "nfs_upload_host" {
source = "../../modules/kubernetes/nfs_volume" source = "../../modules/kubernetes/nfs_volume"
name = "immich-upload" name = "immich-upload-host"
namespace = kubernetes_namespace.immich.metadata[0].name namespace = kubernetes_namespace.immich.metadata[0].name
nfs_server = var.nfs_server nfs_server = var.proxmox_host
nfs_path = "/mnt/main/immich/immich/upload" nfs_path = "/srv/nfs/immich/upload"
} }
# NFS volume for immich-postgresql (shared with backup cronjob) module "nfs_postgresql_host" {
module "nfs_postgresql" {
source = "../../modules/kubernetes/nfs_volume" source = "../../modules/kubernetes/nfs_volume"
name = "immich-postgresql-data" name = "immich-postgresql-data-host"
namespace = kubernetes_namespace.immich.metadata[0].name namespace = kubernetes_namespace.immich.metadata[0].name
nfs_server = var.nfs_server nfs_server = var.proxmox_host
nfs_path = "/mnt/main/immich/data-immich-postgresql" nfs_path = "/srv/nfs/immich/postgresql"
} }
# NFS volume for immich-machine-learning cache module "nfs_ml_cache_host" {
module "nfs_ml_cache" {
source = "../../modules/kubernetes/nfs_volume" source = "../../modules/kubernetes/nfs_volume"
name = "immich-ml-cache" name = "immich-ml-cache-host"
namespace = kubernetes_namespace.immich.metadata[0].name namespace = kubernetes_namespace.immich.metadata[0].name
nfs_server = var.nfs_server nfs_server = var.proxmox_host
nfs_path = "/mnt/ssd/immich/machine-learning" nfs_path = "/srv/nfs-ssd/immich/machine-learning"
} }
resource "kubernetes_namespace" "immich" { resource "kubernetes_namespace" "immich" {
@ -303,37 +302,37 @@ resource "kubernetes_deployment" "immich_server" {
volume { volume {
name = "backups" name = "backups"
persistent_volume_claim { persistent_volume_claim {
claim_name = module.nfs_backups.claim_name claim_name = module.nfs_backups_host.claim_name
} }
} }
volume { volume {
name = "encoded-video" name = "encoded-video"
persistent_volume_claim { persistent_volume_claim {
claim_name = module.nfs_encoded_video.claim_name claim_name = module.nfs_encoded_video_host.claim_name
} }
} }
volume { volume {
name = "library" name = "library"
persistent_volume_claim { persistent_volume_claim {
claim_name = module.nfs_library.claim_name claim_name = module.nfs_library_host.claim_name
} }
} }
volume { volume {
name = "profile" name = "profile"
persistent_volume_claim { persistent_volume_claim {
claim_name = module.nfs_profile.claim_name claim_name = module.nfs_profile_host.claim_name
} }
} }
volume { volume {
name = "thumbs" name = "thumbs"
persistent_volume_claim { persistent_volume_claim {
claim_name = module.nfs_thumbs.claim_name claim_name = module.nfs_thumbs_host.claim_name
} }
} }
volume { volume {
name = "upload" name = "upload"
persistent_volume_claim { persistent_volume_claim {
claim_name = module.nfs_upload.claim_name claim_name = module.nfs_upload_host.claim_name
} }
} }
} }
@ -478,7 +477,7 @@ resource "kubernetes_deployment" "immich-postgres" {
volume { volume {
name = "postgresql-persistent-storage" name = "postgresql-persistent-storage"
persistent_volume_claim { persistent_volume_claim {
claim_name = module.nfs_postgresql.claim_name claim_name = module.nfs_postgresql_host.claim_name
} }
} }
} }
@ -646,7 +645,7 @@ resource "kubernetes_deployment" "immich-machine-learning" {
volume { volume {
name = "cache" name = "cache"
persistent_volume_claim { persistent_volume_claim {
claim_name = module.nfs_ml_cache.claim_name claim_name = module.nfs_ml_cache_host.claim_name
} }
} }
} }
@ -771,7 +770,7 @@ resource "kubernetes_cron_job_v1" "postgresql-backup" {
volume { volume {
name = "postgresql-backup" name = "postgresql-backup"
persistent_volume_claim { persistent_volume_claim {
claim_name = module.nfs_postgresql.claim_name claim_name = module.nfs_postgresql_host.claim_name
} }
} }
} }

View file

@ -95,8 +95,8 @@ resource "kubernetes_cron_job_v1" "monitor_prom" {
} }
# ----------------------------------------------------------------------------- # -----------------------------------------------------------------------------
# Cloud Sync Monitor check TrueNAS Cloud Sync job status, push to Pushgateway # Cloud Sync Monitor DEPRECATED: TrueNAS decommissioned 2026-04-13
# Runs every 6h. Alert fires if no successful sync in 8 days. # TODO: Remove this resource entirely once TrueNAS VM is shut down
# ----------------------------------------------------------------------------- # -----------------------------------------------------------------------------
resource "kubernetes_cron_job_v1" "cloudsync_monitor" { resource "kubernetes_cron_job_v1" "cloudsync_monitor" {
metadata { metadata {
@ -123,11 +123,11 @@ resource "kubernetes_cron_job_v1" "cloudsync_monitor" {
set -euo pipefail set -euo pipefail
apk add --no-cache curl jq apk add --no-cache curl jq
# Query TrueNAS Cloud Sync tasks # Query TrueNAS Cloud Sync tasks (TrueNAS deprecated this monitor should be removed)
RESPONSE=$(curl -sf -H "Authorization: Bearer $TRUENAS_API_KEY" \ RESPONSE=$(curl -sf -H "Authorization: Bearer $TRUENAS_API_KEY" \
"http://10.0.10.15/api/v2.0/cloudsync" 2>&1) || { "http://10.0.10.15/api/v2.0/cloudsync" 2>&1) || {
echo "ERROR: Failed to query TrueNAS API" echo "WARN: TrueNAS API unreachable (VM deprecated)"
exit 1 exit 0
} }
# Parse each task's last successful run # Parse each task's last successful run

View file

@ -1013,7 +1013,7 @@ serverFiles:
labels: labels:
severity: critical severity: critical
annotations: annotations:
summary: "Only {{ $value | printf \"%.0f\" }} node(s) have NFS activity — TrueNAS (10.0.10.15) may be down (need ≥2)" summary: "Only {{ $value | printf \"%.0f\" }} node(s) have NFS activity — Proxmox NFS (192.168.1.127) may be down (need ≥2)"
- name: K8s Health - name: K8s Health
rules: rules:
- alert: PodCrashLooping - alert: PodCrashLooping

View file

@ -87,7 +87,7 @@ resource "kubernetes_storage_class" "nfs_truenas" {
] ]
parameters = { parameters = {
server = "192.168.1.127" server = var.nfs_server
share = "/srv/nfs" share = "/srv/nfs"
} }
} }