From 0d8e0ca6fcc99440fa968fac430f29b7316ce2b0 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 9 May 2026 17:41:04 +0000 Subject: [PATCH] backup: fix daily-backup silent failures, postiz pg_dump CronJob, doc reconcile MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit daily-backup ran out of its 1h budget and SIGTERMed for 10 days straight (Apr 30 → May 9). Each failed run left its snapshot mount stacked on /tmp/pvc-mount, which blocked the next run from completing — root cause of the WeeklyBackupStale alert going silent (the metric never reached its end-of-script push). Fixes: - TimeoutStartSec 1h → 4h (current workload of 118 PVCs needs ~1.5h, was hitting the wall during week 18 runs) - Recursive umount + LUKS cleanup on EXIT trap, plus the same at script start as belt-and-braces for any inherited stuck state from a prior crashed run - TERM/INT trap pushes status=2 metric so WeeklyBackupFailing fires instead of the alert going blind on systemd kills - pfsense metric pushed in BOTH success and failure paths (was only on success; any ssh-to-pfsense outage made PfsenseBackupStale silent until the alert threshold expired) Postiz backup CronJob: bundled bitnami PG/Redis live on local-path (K8s node OS disk) — outside Layer 1+2 of the 3-2-1 pipeline. Added postiz-postgres-backup that pg_dumps postiz + temporal + temporal_visibility daily 03:00 to /srv/nfs/postiz-backup, getting Layer 3 offsite coverage. Verified end-to-end: 3 dumps written, Pushgateway metric received. Note: bitnamilegacy/postgresql image is stripped (no curl/wget/python) — switched to docker.io/library/postgres matching the dbaas/postgresql-backup pattern with apt-installed curl. Doc reconcile (backup-dr.md): metric names had drifted (e.g. the docs claimed backup_weekly_last_success_timestamp but the script pushes daily_backup_last_run_timestamp). Updated to match what's actually emitted, and added a "default-covered" footnote to the Service Protection Matrix so the ~40 services with PVCs not enumerated in the table are no longer ambiguous. Manual PVE-host actions (out-of-band, not in TF): - unmounted 6 stacked snapshots from /tmp/pvc-mount - pruned 5 stale snapshots on vm-9999-pvc-67c90b6b... (origin LV that the loop got SIGTERMed against repeatedly, so prune kept failing) - created /srv/nfs/postiz-backup directory - triggered a one-shot daily-backup run with the new TimeoutStartSec to validate the fix end-to-end Co-Authored-By: Claude Opus 4.7 --- docs/architecture/backup-dr.md | 20 +++-- scripts/daily-backup.service | 5 +- scripts/daily-backup.sh | 54 ++++++++++++-- stacks/postiz/modules/postiz/main.tf | 107 +++++++++++++++++++++++++++ 4 files changed, 174 insertions(+), 12 deletions(-) diff --git a/docs/architecture/backup-dr.md b/docs/architecture/backup-dr.md index b307ec6c..55201417 100644 --- a/docs/architecture/backup-dr.md +++ b/docs/architecture/backup-dr.md @@ -267,7 +267,7 @@ Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62 **Snapshot Pruning**: Deletes LVM snapshots older than 7 days (safety net for snapshots that outlive `lvm-pvc-snapshot` timer). -**Monitoring**: Pushes `backup_weekly_last_success_timestamp` to Pushgateway. Alerts: `WeeklyBackupStale` (>8d), `WeeklyBackupFailing`. +**Monitoring**: Pushes `daily_backup_last_run_timestamp`, `daily_backup_last_status`, and `daily_backup_bytes_synced` to Pushgateway (job `daily-backup`). Alerts: `WeeklyBackupStale` (>9d on `daily_backup_last_run_timestamp`), `WeeklyBackupFailing` (`daily_backup_last_status != 0`). The metric is pushed both on clean exit AND from a `trap TERM INT` handler — a 2026-04-30 → 2026-05-09 silent-failure incident traced to systemd SIGTERMing the script before it reached its final push, leaving the alert blind. ### Layer 2b: Application-Level Backups @@ -686,9 +686,11 @@ module "nfs_backup" { **Metrics sources**: - Backup CronJobs: Push `backup_last_success_timestamp` to Pushgateway on completion -- LVM snapshot script: Pushes `lvm_snapshot_last_success_timestamp`, `lvm_snapshot_count`, `lvm_thin_pool_free_percent` -- Daily backup script: Pushes `backup_weekly_last_success_timestamp`, `backup_disk_usage_percent` -- Offsite sync script: Pushes `offsite_backup_sync_last_success_timestamp` +- LVM snapshot script: Pushes `lvm_snapshot_last_run_timestamp`, `lvm_snapshot_last_status`, `lvm_snapshot_created_total`, `lvm_snapshot_failed_total`, `lvm_snapshot_pruned_total`, `lvm_snapshot_thinpool_free_pct` (job `lvm-pvc-snapshot`) +- Daily backup script: Pushes `daily_backup_last_run_timestamp`, `daily_backup_last_status`, `daily_backup_bytes_synced` (job `daily-backup`). Disk-fullness alert (`BackupDiskFull`) does NOT use a script-pushed metric; it derives from node-exporter `node_filesystem_avail_bytes{job="proxmox-host", mountpoint="/mnt/backup"}`. +- pfSense backup (step 3 of `daily-backup`): Pushes `backup_last_run_timestamp`, `backup_last_status`, and `backup_last_success_timestamp` (only on success) under job `pfsense-backup`. Pushed in BOTH success and failure paths so `PfsenseBackupStale` doesn't go silent when SSH-to-pfsense breaks. +- Offsite sync script: Pushes `backup_last_success_timestamp`, `offsite_sync_last_status` (job `offsite-backup-sync`) +- Prometheus backup (sidecar in prometheus-server pod, monthly 1st-Sunday 04:00 UTC): Pushes `prometheus_backup_last_success_timestamp` (job `prometheus-backup`) - ~~CloudSync monitor~~: Removed (TrueNAS decommissioned) - Vaultwarden integrity: Pushes `vaultwarden_sqlite_integrity_ok` hourly @@ -728,6 +730,8 @@ the 2026-04-22 backup_offsite_sync FAIL (node3 kubelet hiccup at | NovelApp | ✓ | ✓ | — | ✓ | proxmox-lvm | | Headscale | ✓ | ✓ | — | ✓ | proxmox-lvm | | Uptime Kuma | ✓ | ✓ | — | ✓ | proxmox-lvm | +| **Other apps not enumerated above** | ✓¹ | ✓¹ | varies | ✓ | proxmox-lvm / proxmox-lvm-encrypted | +| **Postiz** (bundled bitnami PG on local-path) | — | — | ✓ daily pg_dump → NFS | ✓ | local-path + NFS | | **Media (NFS)** | | Immich (~800GB) | — | — | — | ✓ | NFS | | Audiobookshelf | — | — | — | ✓ | NFS | @@ -739,7 +743,13 @@ the 2026-04-22 backup_offsite_sync FAIL (node3 kubelet hiccup at - — = Not needed (other layers cover it, or data is regenerable/disposable) - excluded = Too large/regenerable, not worth offsite bandwidth -**Note**: All 65 proxmox-lvm PVCs get LVM snapshots (except dbaas+monitoring = 3 PVCs) + file-level backup (except dbaas+monitoring). NFS-backed media syncs directly to Synology `nfs/` and `nfs-ssd/` via inotify change tracking. +**Note**: All proxmox-lvm and proxmox-lvm-encrypted PVCs get LVM snapshots (except `dbaas` and `monitoring` namespaces, excluded for write-amplification reasons) + file-level backup. NFS-backed media syncs directly to Synology `nfs/` and `nfs-ssd/` via inotify change tracking. + +¹ **"Other apps not enumerated above"** — the table only enumerates services worth calling out. The default backup posture for any service using `proxmox-lvm` or `proxmox-lvm-encrypted` (outside `dbaas`/`monitoring`) is **automatic** Layer 1 (LVM thin snapshots, 7d retention) + Layer 2 (file backup, 4 weekly versions on sda) + Layer 3 (offsite to Synology). Auto-discovery is by LV name pattern (`vm-*-pvc-*`), so adding a new service to the cluster gets it covered without any explicit registration. Run `ssh root@192.168.1.127 lvs --noheadings -o lv_name pve | grep '^vm-.*-pvc-' | grep -v _snap_ | wc -l` to see the live count. + +**Known gaps** — services with PVCs not on the proxmox-lvm path lose Layer 1+2: +- **Postiz** PG and Redis (bundled bitnami chart) live on `local-path` (K8s node OS disk). PG covered by the postiz-postgres-backup CronJob (daily pg_dump → `/srv/nfs/postiz-backup/`, Layer 3 via offsite sync). Redis is regenerable cache — not backed up. +- **Prometheus, Alertmanager, Pushgateway** — `monitoring` namespace excluded by policy; loss is acceptable (metrics regenerable, silences ephemeral, Pushgateway has on-disk persistence for 24h gap tolerance). ## Recovery Procedures diff --git a/scripts/daily-backup.service b/scripts/daily-backup.service index a2bf2d85..752c79dd 100644 --- a/scripts/daily-backup.service +++ b/scripts/daily-backup.service @@ -8,4 +8,7 @@ ExecStart=/usr/local/bin/daily-backup StandardOutput=journal StandardError=journal SyslogIdentifier=daily-backup -TimeoutStartSec=3600 +# 4h budget — the snapshot mount + LUKS decrypt + rsync + sqlite scan loop +# scales with the number of PVCs (118 today). Hit the 1h ceiling around week +# 18 of 2026 and silently SIGTERM'd for 10 days. Bumped to 4h with margin. +TimeoutStartSec=14400 diff --git a/scripts/daily-backup.sh b/scripts/daily-backup.sh index a9d776a7..1d5b289a 100644 --- a/scripts/daily-backup.sh +++ b/scripts/daily-backup.sh @@ -21,15 +21,48 @@ warn() { log "WARN: $*" >&2; } die() { log "FATAL: $*" >&2; push_metrics 1 0; exit 1; } # --- Locking --- +# Track whether we got SIGTERM/SIGINT so cleanup can push a non-success metric. +# Without this, a systemd timeout-kill leaves WeeklyBackupFailing alerts blind: +# the script never reaches the success push at the end and the metric goes stale +# silently. (Root cause of 2026-04-30 → 2026-05-09 silent-failure run.) +KILLED="" + cleanup() { - umount "${PVC_MOUNT}" 2>/dev/null || true + # Recursively unmount /tmp/pvc-mount: previous SIGTERM'd runs left snapshot + # mounts stacked here, which made every subsequent run start with an + # already-occupied mountpoint and time out before reaching its own umount. + while mountpoint -q "${PVC_MOUNT}" 2>/dev/null; do + umount "${PVC_MOUNT}" 2>/dev/null || umount -l "${PVC_MOUNT}" 2>/dev/null || break + done + # Close any LUKS mappers we opened (or that were left over from a prior crash). + for m in /dev/mapper/pvc-snap-*; do + [ -e "$m" ] || continue + cryptsetup close "$(basename "$m")" 2>/dev/null || true + done rm -f "${LOCKFILE}" + if [ -n "${KILLED}" ]; then + # status=2 = aborted (matches lvm-pvc-snapshot's convention) + push_metrics 2 "${TOTAL_BYTES:-0}" + fi } trap cleanup EXIT +trap 'KILLED=1; exit 143' TERM INT + if ! ( set -o noclobber; echo $$ > "${LOCKFILE}" ) 2>/dev/null; then die "Another instance is running (PID $(cat "${LOCKFILE}" 2>/dev/null || echo unknown))" fi +# Belt-and-braces: if a previous run was SIGTERM'd before its trap completed, +# /tmp/pvc-mount may have stacked mounts and stale LUKS mappers. The lock above +# guarantees we're alone, so it's safe to clean these up now. +while mountpoint -q "${PVC_MOUNT}" 2>/dev/null; do + umount "${PVC_MOUNT}" 2>/dev/null || umount -l "${PVC_MOUNT}" 2>/dev/null || break +done +for m in /dev/mapper/pvc-snap-*; do + [ -e "$m" ] || continue + cryptsetup close "$(basename "$m")" 2>/dev/null || true +done + # --- Metrics --- push_metrics() { local status="${1:-0}" bytes="${2:-0}" @@ -243,6 +276,7 @@ fi log "--- Step 3: pfsense backup ---" PFSENSE_DEST="${BACKUP_ROOT}/pfsense" DATE=$(date +%Y%m%d) +PFSENSE_STATUS=0 mkdir -p "${PFSENSE_DEST}" if timeout 10 ssh -o BatchMode=yes -o ConnectTimeout=5 root@10.0.20.1 true 2>/dev/null; then @@ -253,6 +287,7 @@ if timeout 10 ssh -o BatchMode=yes -o ConnectTimeout=5 root@10.0.20.1 true 2>/de else warn "Failed to copy pfsense config.xml" STATUS=1 + PFSENSE_STATUS=1 fi # Full filesystem tar @@ -264,21 +299,28 @@ if timeout 10 ssh -o BatchMode=yes -o ConnectTimeout=5 root@10.0.20.1 true 2>/de else warn "Failed to tar pfsense filesystem" STATUS=1 + PFSENSE_STATUS=1 fi # Retention: keep 4 weekly copies ls -t "${PFSENSE_DEST}"/config-*.xml 2>/dev/null | tail -n +5 | xargs rm -f 2>/dev/null || true ls -t "${PFSENSE_DEST}"/pfsense-full-*.tar.gz 2>/dev/null | tail -n +5 | xargs rm -f 2>/dev/null || true - - # Push pfsense-specific metric - echo "backup_last_success_timestamp $(date +%s)" | \ - curl -s --connect-timeout 5 --max-time 10 --data-binary @- \ - "${PUSHGATEWAY}/metrics/job/pfsense-backup" 2>/dev/null || true else warn "Cannot SSH to pfsense (10.0.20.1) — skipping" STATUS=1 + PFSENSE_STATUS=1 fi +# Push pfsense-backup metrics in BOTH success and failure paths so +# PfsenseBackupStale + PfsenseBackupFailing alerts can fire instead of going +# silent when ssh-to-pfsense is broken. +{ + echo "backup_last_run_timestamp $(date +%s)" + echo "backup_last_status ${PFSENSE_STATUS}" + [ "${PFSENSE_STATUS}" -eq 0 ] && echo "backup_last_success_timestamp $(date +%s)" +} | curl -s --connect-timeout 5 --max-time 10 --data-binary @- \ + "${PUSHGATEWAY}/metrics/job/pfsense-backup" 2>/dev/null || true + # ============================================================ # STEP 4: PVE host config backup # ============================================================ diff --git a/stacks/postiz/modules/postiz/main.tf b/stacks/postiz/modules/postiz/main.tf index 351dfd66..a55c6711 100644 --- a/stacks/postiz/modules/postiz/main.tf +++ b/stacks/postiz/modules/postiz/main.tf @@ -428,6 +428,113 @@ resource "kubernetes_service" "temporal" { # NestJS bootstrap crashes with "cannot have more than 3 search attribute # of type Text" and the backend never starts. # Upstream issue: https://github.com/gitroomhq/postiz-app/issues/1504 +# ────────────────────────────────────────────────────────────────────────────── +# Backup CronJob — nightly pg_dump of the bundled postiz-postgresql to NFS. +# +# The bundled PostgreSQL StatefulSet uses local-path storage on the K8s node +# OS disk (chart default), which is NOT covered by Layer 1 (LVM thin +# snapshots) or Layer 2 (sda file backup) of the 3-2-1 pipeline. A pg_dump +# CronJob writing to /srv/nfs/postiz-backup/ closes the gap: dumps land on +# Proxmox host NFS → covered by inotify-driven offsite sync to Synology. +# Three databases are dumped: postiz (app data), temporal (workflow engine), +# temporal_visibility (workflow search). Bitnami chart-default credentials +# are used — same creds the Postiz pod itself uses, scoped to the postiz +# namespace via ClusterIP-only Services. +# ────────────────────────────────────────────────────────────────────────────── + +module "nfs_backup_host" { + source = "../../../../modules/kubernetes/nfs_volume" + name = "postiz-backup-host" + namespace = kubernetes_namespace.postiz.metadata[0].name + nfs_server = "192.168.1.127" + nfs_path = "/srv/nfs/postiz-backup" +} + +resource "kubernetes_cron_job_v1" "postgres_backup" { + metadata { + name = "postiz-postgres-backup" + namespace = kubernetes_namespace.postiz.metadata[0].name + labels = { app = "postiz", component = "backup" } + } + spec { + schedule = "0 3 * * *" + concurrency_policy = "Forbid" + successful_jobs_history_limit = 3 + failed_jobs_history_limit = 5 + job_template { + metadata {} + spec { + backoff_limit = 1 + ttl_seconds_after_finished = 86400 + template { + metadata { + labels = { app = "postiz", component = "backup" } + } + spec { + restart_policy = "OnFailure" + container { + name = "backup" + # Same image/pattern as dbaas/postgresql-backup: official postgres + # client tools + apt-installed curl for the Pushgateway push. The + # bitnamilegacy/postgresql variant is stripped (no curl/wget/python), + # so the metric push silently failed there. + image = "docker.io/library/postgres:16.4-bullseye" + command = ["/bin/bash", "-c"] + args = [ + <<-EOT + set -uo pipefail + apt-get update -qq && apt-get install -yqq curl >/dev/null 2>&1 || true + TIMESTAMP=$(date +%Y%m%d_%H%M) + BACKUP_DIR=/backup + STATUS=0 + for db in postiz temporal temporal_visibility; do + echo "Dumping $db..." + if PGPASSWORD=postiz-password pg_dump -h postiz-postgresql -U postiz \ + --format=custom --compress=6 \ + --file="$BACKUP_DIR/$db-$TIMESTAMP.dump" \ + "$db"; then + echo " OK: $db ($(du -h "$BACKUP_DIR/$db-$TIMESTAMP.dump" | cut -f1))" + else + echo " FAIL: $db" >&2 + STATUS=1 + fi + done + find "$BACKUP_DIR" -name '*.dump' -mtime +30 -delete 2>/dev/null || true + { + echo "backup_last_run_timestamp $(date +%s)" + echo "backup_last_status $STATUS" + [ "$STATUS" -eq 0 ] && echo "backup_last_success_timestamp $(date +%s)" + } | curl -sf --connect-timeout 5 --max-time 10 --data-binary @- \ + "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/postiz-postgres-backup" || true + exit $STATUS + EOT + ] + volume_mount { + name = "backup" + mount_path = "/backup" + } + resources { + requests = { cpu = "10m", memory = "64Mi" } + limits = { memory = "256Mi" } + } + } + volume { + name = "backup" + persistent_volume_claim { + claim_name = module.nfs_backup_host.claim_name + } + } + } + } + } + } + } + lifecycle { + ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 + } + depends_on = [helm_release.postiz] +} + resource "kubernetes_job" "temporal_search_attr_cleanup" { metadata { name = "temporal-search-attr-cleanup"