deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip]

- Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS
- Update config.tfvars nfs_server to 192.168.1.127 (Proxmox)
- Update nfs-csi StorageClass share to /srv/nfs
- Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP
- Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh)
- Rewrite nfs-health.sh for Proxmox NFS monitoring
- Update Freedify nfs_music_server default to Proxmox
- Mark CloudSync monitor CronJob as deprecated
- Update Prometheus alert summaries
- Update all architecture docs, AGENTS.md, and reference docs
- Zero PVs remain on TrueNAS — VM ready for decommission

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-04-13 14:41:15 +00:00
parent 69248eaa7b
commit 38d51ab0af
20 changed files with 245 additions and 524 deletions

View file

@ -1002,16 +1002,16 @@ check_nfs() {
# Try native tools first (available locally), fall back to kubectl-based check (pod environment)
if command -v showmount &>/dev/null; then
if showmount -e 10.0.10.15 &>/dev/null; then
pass "NFS server 10.0.10.15 reachable (exports listed)"
if showmount -e 192.168.1.127 &>/dev/null; then
pass "NFS server 192.168.1.127 reachable (exports listed)"
json_add "nfs" "PASS" "NFS reachable"
return 0
fi
fi
if command -v nc &>/dev/null; then
if nc -z -G 3 10.0.10.15 2049 &>/dev/null; then
pass "NFS server 10.0.10.15 port 2049 open"
if nc -z -G 3 192.168.1.127 2049 &>/dev/null; then
pass "NFS server 192.168.1.127 port 2049 open"
json_add "nfs" "PASS" "NFS port open"
return 0
fi

View file

@ -3,29 +3,28 @@
Reference file for patterns, procedures, and tables. Read on demand when the specific topic comes up.
## NFS Volume Pattern
Use the `nfs_volume` shared module for all NFS volumes (CSI-backed, `soft,timeo=30,retrans=3`):
Use the `nfs_volume` shared module for all NFS volumes (creates static PVs, CSI-backed, `soft,timeo=30,retrans=3`):
```hcl
module "nfs_data" {
source = "../../modules/kubernetes/nfs_volume" # ../../../../ for platform modules, ../../../ for sub-stacks
name = "<service>-data" # Must be globally unique (PV is cluster-scoped)
namespace = kubernetes_namespace.<service>.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/mnt/main/<service>"
nfs_server = var.nfs_server # 192.168.1.127 (Proxmox host)
nfs_path = "/srv/nfs/<service>" # HDD NFS, or "/srv/nfs-ssd/<service>" for SSD
}
# In pod spec: persistent_volume_claim { claim_name = module.nfs_data.claim_name }
```
**Note**: Some legacy PVs still reference `/mnt/main/<service>` paths (from the TrueNAS era). These work via compatibility on the Proxmox host. New PVs should use `/srv/nfs/` or `/srv/nfs-ssd/`.
**DO NOT use inline `nfs {}` blocks** — they mount with `hard,timeo=600` defaults which hang forever.
## Adding NFS Exports
1. Create dir on TrueNAS: `ssh root@10.0.10.15 "mkdir -p /mnt/main/<service> && chmod 777 /mnt/main/<service>"`
2. Edit `secrets/nfs_directories.txt` — add path, keep sorted
3. Run `secrets/nfs_exports.sh` from `secrets/`
4. If any path doesn't exist on TrueNAS, the API rejects the entire update.
1. Create dir on Proxmox host: `ssh root@192.168.1.127 "mkdir -p /srv/nfs/<service> && chmod 777 /srv/nfs/<service>"`
2. Edit `/etc/exports` on the Proxmox host — add the export entry
3. Reload exports: `ssh root@192.168.1.127 "exportfs -ra"`
4. Verify: `showmount -e 192.168.1.127`
## iSCSI Storage (Databases)
**StorageClass**: `iscsi-truenas` (democratic-csi, `freenas-iscsi` SSH driver — NOT `freenas-api-iscsi`).
Used by: PostgreSQL (CNPG), MySQL (InnoDB Cluster). ZFS: `main/iscsi` (zvols), `main/iscsi-snaps`.
All K8s nodes have `open-iscsi` + `iscsid` running.
## ~~iSCSI Storage~~ (REMOVED — replaced by proxmox-lvm)
> iSCSI via democratic-csi and TrueNAS has been fully removed (2026-04). All database storage now uses `StorageClass: proxmox-lvm` (Proxmox CSI, LVM-thin hotplug). TrueNAS has been decommissioned.
## Anti-AI Scraping (5-Layer Defense)
Default `anti_ai_scraping = true` in ingress_factory. Disable per-service: `anti_ai_scraping = false`.

View file

@ -8,30 +8,10 @@
- **RAM**: 272 GB DDR4-2400 ECC RDIMM (10 DIMMs, see Memory Layout below)
- **GPU**: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1)
- **iDRAC**: 192.168.1.4 (root/calvin)
- **Disks**: 1.1TB RAID1 SAS (unused) + 931GB Samsung SSD + 10.7TB RAID1 HDD
- **Disks**: 1.1TB RAID1 SAS (backup) + 931GB Samsung SSD + 10.7TB RAID1 HDD
- **NFS server**: Proxmox host serves NFS directly. HDD NFS: `/srv/nfs` on ext4 LV `pve/nfs-data` (2TB). SSD NFS: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB). TrueNAS (10.0.10.15) decommissioned.
- **Proxmox access**: `ssh root@192.168.1.127`
## NFS Exports (Proxmox Host)
The Proxmox host serves NFS for all workloads except Immich (which remains on TrueNAS).
### HDD NFS
- **LV**: `pve/nfs-data` (thin LV, 1TB)
- **Filesystem**: ext4 (chosen over btrfs — btrfs CoW on LVM thin = double-CoW problem)
- **Mount**: `/srv/nfs` with `noatime,commit=30`
- **Export**: `/srv/nfs *(rw,no_subtree_check,no_root_squash,insecure,fsid=0)`
### SSD NFS
- **LV**: `ssd/nfs-ssd-data` (100GB)
- **Filesystem**: ext4
- **Mount**: `/srv/nfs-ssd` with `noatime,commit=30`
- **Export**: `/srv/nfs-ssd *(rw,no_subtree_check,no_root_squash,insecure,fsid=1)`
- **Current users**: Ollama (migrated from TrueNAS SSD `/mnt/ssd/ollama`)
### Notes
- `insecure` option required: pfSense NATs source ports >1024 when routing between VLANs
- 21 stacks migrated from TrueNAS, only Immich (8 PVCs) remains on TrueNAS
## Memory Layout (updated 2026-04-01)
### Physical DIMM Slot Map
@ -97,10 +77,10 @@ Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB
## Network Topology
```
10.0.10.0/24 - Management: Wizard (10.0.10.10), TrueNAS NFS (10.0.10.15)
10.0.10.0/24 - Management: Wizard (10.0.10.10)
10.0.20.0/24 - Kubernetes: pfSense GW (10.0.20.1), Registry (10.0.20.10),
k8s-master (10.0.20.100), DNS (10.0.20.101), MetalLB (10.0.20.102-200)
192.168.1.0/24 - Physical: Proxmox (192.168.1.127, NFS server for k8s)
192.168.1.0/24 - Physical: Proxmox (192.168.1.127)
```
## Network Bridges
@ -122,7 +102,7 @@ Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB
| 204 | k8s-node4 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM |
| 9000 | truenas | running | 16 | 8GB | vmbr1:vlan10 | 32G+7x256G+1T | NFS (10.0.10.15) — Immich only |
| ~~9000~~ | ~~truenas~~ | **stopped/decommissioned** | — | — | — | — | NFS migrated to Proxmox host (192.168.1.127) at `/srv/nfs` and `/srv/nfs-ssd` |
**Total VM RAM allocated**: 180 GB of 272 GB (66%) — 92 GB free for future VMs

View file

@ -3,7 +3,7 @@ set -euo pipefail
AGENT="nfs-health"
KUBECTL="kubectl --kubeconfig /Users/viktorbarzin/code/infra/config"
TRUENAS_HOST="10.0.10.15"
NFS_HOST="192.168.1.127"
NODES=("k8s-master:10.0.20.100" "k8s-node1:10.0.20.101" "k8s-node2:10.0.20.102" "k8s-node3:10.0.20.103" "k8s-node4:10.0.20.104")
SSH_USER="wizard"
DRY_RUN=false
@ -21,33 +21,61 @@ add_check() {
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
}
check_truenas_reachable() {
check_nfs_reachable() {
if $DRY_RUN; then
add_check "truenas-reachable" "ok" "dry-run: would ping $TRUENAS_HOST"
add_check "nfs-reachable" "ok" "dry-run: would ping $NFS_HOST"
return
fi
if timeout 5 ping -c 1 "$TRUENAS_HOST" &>/dev/null; then
add_check "truenas-reachable" "ok" "TrueNAS at $TRUENAS_HOST is reachable"
if timeout 5 ping -c 1 "$NFS_HOST" &>/dev/null; then
add_check "nfs-reachable" "ok" "Proxmox NFS at $NFS_HOST is reachable"
else
add_check "truenas-reachable" "fail" "TrueNAS at $TRUENAS_HOST is unreachable"
add_check "nfs-reachable" "fail" "Proxmox NFS at $NFS_HOST is unreachable"
fi
}
check_truenas_nfs_service() {
check_nfs_exports() {
if $DRY_RUN; then
add_check "truenas-nfs-service" "ok" "dry-run: would check NFS service on TrueNAS"
add_check "nfs-exports" "ok" "dry-run: would check NFS exports on Proxmox"
return
fi
local result
if result=$(timeout 10 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "root@$TRUENAS_HOST" \
"service nfs-server status 2>/dev/null || systemctl is-active nfs-server 2>/dev/null || echo 'unknown'" 2>/dev/null); then
if echo "$result" | grep -qiE "running|active|is running"; then
add_check "truenas-nfs-service" "ok" "NFS service is running on TrueNAS"
if result=$(timeout 10 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "root@$NFS_HOST" \
"exportfs -v 2>/dev/null || cat /etc/exports 2>/dev/null" 2>/dev/null); then
local export_count
export_count=$(echo "$result" | grep -c '/' || echo 0)
if [ "$export_count" -gt 0 ]; then
add_check "nfs-exports" "ok" "$export_count NFS exports active on Proxmox"
else
add_check "truenas-nfs-service" "warn" "NFS service status unclear: $(echo "$result" | head -1 | tr '"' "'")"
add_check "nfs-exports" "warn" "No NFS exports found on Proxmox"
fi
else
add_check "truenas-nfs-service" "fail" "Could not check NFS service on TrueNAS via SSH"
add_check "nfs-exports" "fail" "Could not check NFS exports on Proxmox via SSH"
fi
}
check_nfs_disk_usage() {
if $DRY_RUN; then
add_check "nfs-disk" "ok" "dry-run: would check NFS disk usage"
return
fi
local result
if result=$(timeout 10 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "root@$NFS_HOST" \
"df -h /srv/nfs /srv/nfs-ssd 2>/dev/null" 2>/dev/null); then
while IFS= read -r line; do
local mount pct
mount=$(echo "$line" | awk '{print $6}')
pct=$(echo "$line" | awk '{print $5}' | tr -d '%')
[ -z "$pct" ] || ! [[ "$pct" =~ ^[0-9]+$ ]] && continue
if [ "$pct" -ge 90 ]; then
add_check "nfs-disk-$mount" "fail" "$mount is ${pct}% full"
elif [ "$pct" -ge 80 ]; then
add_check "nfs-disk-$mount" "warn" "$mount is ${pct}% full"
else
add_check "nfs-disk-$mount" "ok" "$mount is ${pct}% full"
fi
done <<< "$result"
else
add_check "nfs-disk" "warn" "Could not check NFS disk usage"
fi
}
@ -116,8 +144,9 @@ check_nfs_pvcs() {
}
# Run checks
check_truenas_reachable
check_truenas_nfs_service
check_nfs_reachable
check_nfs_exports
check_nfs_disk_usage
for node_entry in "${NODES[@]}"; do
node_name="${node_entry%%:*}"

View file

@ -1,186 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
AGENT="truenas-status"
TRUENAS_HOST="root@10.0.10.15"
DRY_RUN=false
for arg in "$@"; do
case "$arg" in
--dry-run) DRY_RUN=true ;;
esac
done
checks=()
add_check() {
local name="$1" status="$2" message="$3"
checks+=("{\"name\": \"$name\", \"status\": \"$status\", \"message\": \"$message\"}")
}
ssh_cmd() {
timeout 15 ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "$TRUENAS_HOST" "$@" 2>/dev/null
}
check_zfs_pools() {
if $DRY_RUN; then
add_check "zfs-pools" "ok" "dry-run: would check ZFS pool status"
return
fi
local pool_status
if ! pool_status=$(ssh_cmd "zpool status -x" 2>/dev/null); then
add_check "zfs-pools" "fail" "Could not retrieve ZFS pool status via SSH"
return
fi
if echo "$pool_status" | grep -q "all pools are healthy"; then
add_check "zfs-pools" "ok" "All ZFS pools are healthy"
else
local degraded_pools
degraded_pools=$(echo "$pool_status" | grep "pool:" | awk '{print $2}' | tr '\n' ', ' | sed 's/,$//')
if [ -n "$degraded_pools" ]; then
add_check "zfs-pools" "fail" "Degraded ZFS pools: $degraded_pools"
else
add_check "zfs-pools" "warn" "ZFS pool status unclear: $(echo "$pool_status" | head -1 | tr '"' "'")"
fi
fi
# Check pool capacity
local pool_list
if pool_list=$(ssh_cmd "zpool list -H -o name,cap" 2>/dev/null); then
while IFS=$'\t' read -r pool_name cap_pct; do
local cap_num
cap_num=$(echo "$cap_pct" | tr -d '%')
if [ -n "$cap_num" ] && [ "$cap_num" -ge 90 ]; then
add_check "zfs-capacity-$pool_name" "fail" "Pool $pool_name is ${cap_pct} full"
elif [ -n "$cap_num" ] && [ "$cap_num" -ge 80 ]; then
add_check "zfs-capacity-$pool_name" "warn" "Pool $pool_name is ${cap_pct} full"
else
add_check "zfs-capacity-$pool_name" "ok" "Pool $pool_name is ${cap_pct} full"
fi
done <<< "$pool_list"
fi
}
check_smart_health() {
if $DRY_RUN; then
add_check "smart-health" "ok" "dry-run: would check SMART disk health"
return
fi
local disk_list
if ! disk_list=$(ssh_cmd "smartctl --scan" 2>/dev/null); then
add_check "smart-health" "warn" "Could not scan disks for SMART status"
return
fi
local fail_count=0
local total_count=0
local failed_disks=""
while IFS= read -r line; do
local dev
dev=$(echo "$line" | awk '{print $1}')
[ -z "$dev" ] && continue
total_count=$((total_count + 1))
local health
if health=$(ssh_cmd "smartctl -H '$dev'" 2>/dev/null); then
if ! echo "$health" | grep -qiE "PASSED|OK"; then
fail_count=$((fail_count + 1))
failed_disks="$failed_disks $dev"
fi
fi
done <<< "$disk_list"
if [ "$fail_count" -gt 0 ]; then
add_check "smart-health" "fail" "$fail_count/$total_count disks failing SMART:$failed_disks"
elif [ "$total_count" -gt 0 ]; then
add_check "smart-health" "ok" "All $total_count disks pass SMART health checks"
else
add_check "smart-health" "warn" "No disks found for SMART check"
fi
}
check_replication() {
if $DRY_RUN; then
add_check "replication" "ok" "dry-run: would check replication task status"
return
fi
# Check for any running/failed replication tasks via midclt if available
local repl_status
if repl_status=$(ssh_cmd "midclt call replication.query 2>/dev/null" 2>/dev/null); then
local failed
failed=$(echo "$repl_status" | python3 -c "
import sys, json
try:
tasks = json.load(sys.stdin)
failed = [t.get('name','unknown') for t in tasks if t.get('state',{}).get('state','') == 'ERROR']
print(len(failed))
except: print('error')
" 2>/dev/null || echo "error")
if [ "$failed" = "error" ]; then
add_check "replication" "warn" "Could not parse replication task status"
elif [ "$failed" = "0" ]; then
add_check "replication" "ok" "All replication tasks healthy"
else
add_check "replication" "fail" "$failed replication tasks in ERROR state"
fi
else
# Fallback: check if zfs send/recv processes are stuck
local send_procs
send_procs=$(ssh_cmd "pgrep -c 'zfs send' 2>/dev/null || echo 0")
add_check "replication" "warn" "midclt unavailable; $send_procs active zfs send processes"
fi
}
check_iscsi() {
if $DRY_RUN; then
add_check "iscsi-targets" "ok" "dry-run: would check iSCSI target status"
return
fi
local target_status
if target_status=$(ssh_cmd "ctladm islist 2>/dev/null || targetcli ls 2>/dev/null" 2>/dev/null); then
local target_count
target_count=$(echo "$target_status" | wc -l | tr -d ' ')
if [ "$target_count" -gt 0 ]; then
add_check "iscsi-targets" "ok" "iSCSI service active with $target_count entries"
else
add_check "iscsi-targets" "warn" "iSCSI service active but no targets listed"
fi
else
# Try checking if the service is at least running
if ssh_cmd "midclt call iscsi.global.config" &>/dev/null; then
add_check "iscsi-targets" "ok" "iSCSI service is configured and running"
else
add_check "iscsi-targets" "warn" "Could not query iSCSI target status"
fi
fi
}
# Run checks
check_zfs_pools
check_smart_health
check_replication
check_iscsi
# Determine overall status
overall="ok"
for c in "${checks[@]}"; do
if echo "$c" | grep -q '"status": "fail"'; then
overall="fail"
break
elif echo "$c" | grep -q '"status": "warn"'; then
overall="warn"
fi
done
# Output JSON
checks_json=$(IFS=,; echo "${checks[*]}")
cat <<EOF
{"status": "$overall", "agent": "$AGENT", "checks": [$checks_json]}
EOF