[ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup

- Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas
- Add democratic-csi iSCSI driver module for TrueNAS
- Add open-iscsi to cloud-init VM template
- Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0)
- Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh)
- Fix cluster healthcheck CronJob: always exit 0 to prevent circular
  JobFailed alerts (reporting via Slack, not exit codes)
- Fix Uptime Kuma nested list handling in cluster-health.sh
- Add health probes to: audiobookshelf, immich ML, ntfy, headscale,
  uptime-kuma, vaultwarden, rybbit (clickhouse + server + client),
  shlink, shlink-web
- Add iSCSI storage documentation to CLAUDE.md
This commit is contained in:
Viktor Barzin 2026-03-06 19:54:21 +00:00
parent a8e07ad930
commit 1d80c49201
No known key found for this signature in database
GPG key ID: 0EB088298288D958
17 changed files with 378 additions and 13 deletions

View file

@ -61,6 +61,14 @@ For platform modules, use `source = "../../../../modules/kubernetes/nfs_volume"`
**StorageClass**: `nfs-truenas` (deployed via `stacks/platform/modules/nfs-csi/`).
**DO NOT use inline `nfs {}` blocks** — they mount with `hard,timeo=600` defaults which hang forever on stale mounts.
### iSCSI Storage for Databases
**StorageClass**: `iscsi-truenas` (deployed via `stacks/platform/modules/iscsi-csi/` using democratic-csi).
- Used by: PostgreSQL (CNPG), MySQL (InnoDB Cluster) — any pod, any node, same data
- Driver: `freenas-iscsi` (SSH-based, NOT `freenas-api-iscsi` which is TrueNAS SCALE only)
- ZFS datasets: `main/iscsi` (zvols), `main/iscsi-snaps` (snapshots)
- All K8s nodes have `open-iscsi` + `iscsid` running
- Redis stays on `local-path` (StatefulSet `volumeClaimTemplates` are immutable)
### Adding NFS Exports
1. **Create the directory on TrueNAS first**: `ssh root@10.0.10.15 "mkdir -p /mnt/main/<service> && chmod 777 /mnt/main/<service>"`
2. Edit `secrets/nfs_directories.txt` — add path, keep sorted

View file

@ -1522,12 +1522,9 @@ main() {
print_summary
send_slack
# Exit code: 2 for failures, 1 for warnings, 0 for clean
if [[ "$FAIL_COUNT" -gt 0 ]]; then
exit 2
elif [[ "$WARN_COUNT" -gt 0 ]]; then
exit 1
fi
# Always exit 0 — reporting is done via Slack notification.
# Non-zero exits mark the CronJob as Failed, which triggers Prometheus
# JobFailed alerts, creating a circular alert loop.
exit 0
}