[ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup

- Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas - Add democratic-csi iSCSI driver module for TrueNAS - Add open-iscsi to cloud-init VM template - Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0) - Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh) - Fix cluster healthcheck CronJob: always exit 0 to prevent circular JobFailed alerts (reporting via Slack, not exit codes) - Fix Uptime Kuma nested list handling in cluster-health.sh - Add health probes to: audiobookshelf, immich ML, ntfy, headscale, uptime-kuma, vaultwarden, rybbit (clickhouse + server + client), shlink, shlink-web - Add iSCSI storage documentation to CLAUDE.md
2026-03-06 19:54:21 +00:00 · 2026-03-06 19:54:21 +00:00 · 1d80c49201
commit 1d80c49201
parent a8e07ad930
17 changed files with 378 additions and 13 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -61,6 +61,14 @@ For platform modules, use `source = "../../../../modules/kubernetes/nfs_volume"`
 **StorageClass**: `nfs-truenas` (deployed via `stacks/platform/modules/nfs-csi/`).
 **DO NOT use inline `nfs {}` blocks** — they mount with `hard,timeo=600` defaults which hang forever on stale mounts.

+### iSCSI Storage for Databases
+**StorageClass**: `iscsi-truenas` (deployed via `stacks/platform/modules/iscsi-csi/` using democratic-csi).
+- Used by: PostgreSQL (CNPG), MySQL (InnoDB Cluster) — any pod, any node, same data
+- Driver: `freenas-iscsi` (SSH-based, NOT `freenas-api-iscsi` which is TrueNAS SCALE only)
+- ZFS datasets: `main/iscsi` (zvols), `main/iscsi-snaps` (snapshots)
+- All K8s nodes have `open-iscsi` + `iscsid` running
+- Redis stays on `local-path` (StatefulSet `volumeClaimTemplates` are immutable)
+
 ### Adding NFS Exports
 1. **Create the directory on TrueNAS first**: `ssh root@10.0.10.15 "mkdir -p /mnt/main/<service> && chmod 777 /mnt/main/<service>"`
 2. Edit `secrets/nfs_directories.txt` — add path, keep sorted
--- a/.claude/cluster-health.sh
+++ b/.claude/cluster-health.sh
@ -1522,12 +1522,9 @@ main() {
    print_summary
    send_slack

-    # Exit code: 2 for failures, 1 for warnings, 0 for clean
-    if [[ "$FAIL_COUNT" -gt 0 ]]; then
-        exit 2
-    elif [[ "$WARN_COUNT" -gt 0 ]]; then
-        exit 1
-    fi
+    # Always exit 0 — reporting is done via Slack notification.
+    # Non-zero exits mark the CronJob as Failed, which triggers Prometheus
+    # JobFailed alerts, creating a circular alert loop.
    exit 0
 }