From 0d8e0ca6fcc99440fa968fac430f29b7316ce2b0 Mon Sep 17 00:00:00 2001
From: Viktor Barzin <vbarzin@gmail.com>
Date: Sat, 9 May 2026 17:41:04 +0000
Subject: [PATCH] backup: fix daily-backup silent failures, postiz pg_dump
 CronJob, doc reconcile
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

daily-backup ran out of its 1h budget and SIGTERMed for 10 days straight (Apr
30 → May 9). Each failed run left its snapshot mount stacked on /tmp/pvc-mount,
which blocked the next run from completing — root cause of the WeeklyBackupStale
alert going silent (the metric never reached its end-of-script push).

Fixes:
- TimeoutStartSec 1h → 4h (current workload of 118 PVCs needs ~1.5h, was hitting
  the wall during week 18 runs)
- Recursive umount + LUKS cleanup on EXIT trap, plus the same at script start as
  belt-and-braces for any inherited stuck state from a prior crashed run
- TERM/INT trap pushes status=2 metric so WeeklyBackupFailing fires instead of
  the alert going blind on systemd kills
- pfsense metric pushed in BOTH success and failure paths (was only on success;
  any ssh-to-pfsense outage made PfsenseBackupStale silent until the alert
  threshold expired)

Postiz backup CronJob: bundled bitnami PG/Redis live on local-path (K8s node
OS disk) — outside Layer 1+2 of the 3-2-1 pipeline. Added postiz-postgres-backup
that pg_dumps postiz + temporal + temporal_visibility daily 03:00 to
/srv/nfs/postiz-backup, getting Layer 3 offsite coverage. Verified end-to-end:
3 dumps written, Pushgateway metric received. Note: bitnamilegacy/postgresql
image is stripped (no curl/wget/python) — switched to docker.io/library/postgres
matching the dbaas/postgresql-backup pattern with apt-installed curl.

Doc reconcile (backup-dr.md): metric names had drifted (e.g. the docs claimed
backup_weekly_last_success_timestamp but the script pushes
daily_backup_last_run_timestamp). Updated to match what's actually emitted, and
added a "default-covered" footnote to the Service Protection Matrix so the
~40 services with PVCs not enumerated in the table are no longer ambiguous.

Manual PVE-host actions (out-of-band, not in TF):
- unmounted 6 stacked snapshots from /tmp/pvc-mount
- pruned 5 stale snapshots on vm-9999-pvc-67c90b6b... (origin LV that the
  loop got SIGTERMed against repeatedly, so prune kept failing)
- created /srv/nfs/postiz-backup directory
- triggered a one-shot daily-backup run with the new TimeoutStartSec to
  validate the fix end-to-end

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 docs/architecture/backup-dr.md       |  20 +++--
 scripts/daily-backup.service         |   5 +-
 scripts/daily-backup.sh              |  54 ++++++++++++--
 stacks/postiz/modules/postiz/main.tf | 107 +++++++++++++++++++++++++++
 4 files changed, 174 insertions(+), 12 deletions(-)

diff --git a/docs/architecture/backup-dr.md b/docs/architecture/backup-dr.md
index b307ec6c..55201417 100644
--- a/docs/architecture/backup-dr.md
+++ b/docs/architecture/backup-dr.md
@@ -267,7 +267,7 @@ Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62
 
 **Snapshot Pruning**: Deletes LVM snapshots older than 7 days (safety net for snapshots that outlive `lvm-pvc-snapshot` timer).
 
-**Monitoring**: Pushes `backup_weekly_last_success_timestamp` to Pushgateway. Alerts: `WeeklyBackupStale` (>8d), `WeeklyBackupFailing`.
+**Monitoring**: Pushes `daily_backup_last_run_timestamp`, `daily_backup_last_status`, and `daily_backup_bytes_synced` to Pushgateway (job `daily-backup`). Alerts: `WeeklyBackupStale` (>9d on `daily_backup_last_run_timestamp`), `WeeklyBackupFailing` (`daily_backup_last_status != 0`). The metric is pushed both on clean exit AND from a `trap TERM INT` handler — a 2026-04-30 → 2026-05-09 silent-failure incident traced to systemd SIGTERMing the script before it reached its final push, leaving the alert blind.
 
 ### Layer 2b: Application-Level Backups
 
@@ -686,9 +686,11 @@ module "nfs_backup" {
 
 **Metrics sources**:
 - Backup CronJobs: Push `backup_last_success_timestamp` to Pushgateway on completion
-- LVM snapshot script: Pushes `lvm_snapshot_last_success_timestamp`, `lvm_snapshot_count`, `lvm_thin_pool_free_percent`
-- Daily backup script: Pushes `backup_weekly_last_success_timestamp`, `backup_disk_usage_percent`
-- Offsite sync script: Pushes `offsite_backup_sync_last_success_timestamp`
+- LVM snapshot script: Pushes `lvm_snapshot_last_run_timestamp`, `lvm_snapshot_last_status`, `lvm_snapshot_created_total`, `lvm_snapshot_failed_total`, `lvm_snapshot_pruned_total`, `lvm_snapshot_thinpool_free_pct` (job `lvm-pvc-snapshot`)
+- Daily backup script: Pushes `daily_backup_last_run_timestamp`, `daily_backup_last_status`, `daily_backup_bytes_synced` (job `daily-backup`). Disk-fullness alert (`BackupDiskFull`) does NOT use a script-pushed metric; it derives from node-exporter `node_filesystem_avail_bytes{job="proxmox-host", mountpoint="/mnt/backup"}`.
+- pfSense backup (step 3 of `daily-backup`): Pushes `backup_last_run_timestamp`, `backup_last_status`, and `backup_last_success_timestamp` (only on success) under job `pfsense-backup`. Pushed in BOTH success and failure paths so `PfsenseBackupStale` doesn't go silent when SSH-to-pfsense breaks.
+- Offsite sync script: Pushes `backup_last_success_timestamp`, `offsite_sync_last_status` (job `offsite-backup-sync`)
+- Prometheus backup (sidecar in prometheus-server pod, monthly 1st-Sunday 04:00 UTC): Pushes `prometheus_backup_last_success_timestamp` (job `prometheus-backup`)
 - ~~CloudSync monitor~~: Removed (TrueNAS decommissioned)
 - Vaultwarden integrity: Pushes `vaultwarden_sqlite_integrity_ok` hourly
 
@@ -728,6 +730,8 @@ the 2026-04-22 backup_offsite_sync FAIL (node3 kubelet hiccup at
 | NovelApp | ✓ | ✓ | — | ✓ | proxmox-lvm |
 | Headscale | ✓ | ✓ | — | ✓ | proxmox-lvm |
 | Uptime Kuma | ✓ | ✓ | — | ✓ | proxmox-lvm |
+| **Other apps not enumerated above** | ✓¹ | ✓¹ | varies | ✓ | proxmox-lvm / proxmox-lvm-encrypted |
+| **Postiz** (bundled bitnami PG on local-path) | — | — | ✓ daily pg_dump → NFS | ✓ | local-path + NFS |
 | **Media (NFS)** |
 | Immich (~800GB) | — | — | — | ✓ | NFS |
 | Audiobookshelf | — | — | — | ✓ | NFS |
@@ -739,7 +743,13 @@ the 2026-04-22 backup_offsite_sync FAIL (node3 kubelet hiccup at
 - — = Not needed (other layers cover it, or data is regenerable/disposable)
 - excluded = Too large/regenerable, not worth offsite bandwidth
 
-**Note**: All 65 proxmox-lvm PVCs get LVM snapshots (except dbaas+monitoring = 3 PVCs) + file-level backup (except dbaas+monitoring). NFS-backed media syncs directly to Synology `nfs/` and `nfs-ssd/` via inotify change tracking.
+**Note**: All proxmox-lvm and proxmox-lvm-encrypted PVCs get LVM snapshots (except `dbaas` and `monitoring` namespaces, excluded for write-amplification reasons) + file-level backup. NFS-backed media syncs directly to Synology `nfs/` and `nfs-ssd/` via inotify change tracking.
+
+¹ **"Other apps not enumerated above"** — the table only enumerates services worth calling out. The default backup posture for any service using `proxmox-lvm` or `proxmox-lvm-encrypted` (outside `dbaas`/`monitoring`) is **automatic** Layer 1 (LVM thin snapshots, 7d retention) + Layer 2 (file backup, 4 weekly versions on sda) + Layer 3 (offsite to Synology). Auto-discovery is by LV name pattern (`vm-*-pvc-*`), so adding a new service to the cluster gets it covered without any explicit registration. Run `ssh root@192.168.1.127 lvs --noheadings -o lv_name pve | grep '^vm-.*-pvc-' | grep -v _snap_ | wc -l` to see the live count.
+
+**Known gaps** — services with PVCs not on the proxmox-lvm path lose Layer 1+2:
+- **Postiz** PG and Redis (bundled bitnami chart) live on `local-path` (K8s node OS disk). PG covered by the postiz-postgres-backup CronJob (daily pg_dump → `/srv/nfs/postiz-backup/`, Layer 3 via offsite sync). Redis is regenerable cache — not backed up.
+- **Prometheus, Alertmanager, Pushgateway** — `monitoring` namespace excluded by policy; loss is acceptable (metrics regenerable, silences ephemeral, Pushgateway has on-disk persistence for 24h gap tolerance).
 
 ## Recovery Procedures
 
diff --git a/scripts/daily-backup.service b/scripts/daily-backup.service
index a2bf2d85..752c79dd 100644
--- a/scripts/daily-backup.service
+++ b/scripts/daily-backup.service
@@ -8,4 +8,7 @@ ExecStart=/usr/local/bin/daily-backup
 StandardOutput=journal
 StandardError=journal
 SyslogIdentifier=daily-backup
-TimeoutStartSec=3600
+# 4h budget — the snapshot mount + LUKS decrypt + rsync + sqlite scan loop
+# scales with the number of PVCs (118 today). Hit the 1h ceiling around week
+# 18 of 2026 and silently SIGTERM'd for 10 days. Bumped to 4h with margin.
+TimeoutStartSec=14400
diff --git a/scripts/daily-backup.sh b/scripts/daily-backup.sh
index a9d776a7..1d5b289a 100644
--- a/scripts/daily-backup.sh
+++ b/scripts/daily-backup.sh
@@ -21,15 +21,48 @@ warn() { log "WARN: $*" >&2; }
 die()  { log "FATAL: $*" >&2; push_metrics 1 0; exit 1; }
 
 # --- Locking ---
+# Track whether we got SIGTERM/SIGINT so cleanup can push a non-success metric.
+# Without this, a systemd timeout-kill leaves WeeklyBackupFailing alerts blind:
+# the script never reaches the success push at the end and the metric goes stale
+# silently. (Root cause of 2026-04-30 → 2026-05-09 silent-failure run.)
+KILLED=""
+
 cleanup() {
-    umount "${PVC_MOUNT}" 2>/dev/null || true
+    # Recursively unmount /tmp/pvc-mount: previous SIGTERM'd runs left snapshot
+    # mounts stacked here, which made every subsequent run start with an
+    # already-occupied mountpoint and time out before reaching its own umount.
+    while mountpoint -q "${PVC_MOUNT}" 2>/dev/null; do
+        umount "${PVC_MOUNT}" 2>/dev/null || umount -l "${PVC_MOUNT}" 2>/dev/null || break
+    done
+    # Close any LUKS mappers we opened (or that were left over from a prior crash).
+    for m in /dev/mapper/pvc-snap-*; do
+        [ -e "$m" ] || continue
+        cryptsetup close "$(basename "$m")" 2>/dev/null || true
+    done
     rm -f "${LOCKFILE}"
+    if [ -n "${KILLED}" ]; then
+        # status=2 = aborted (matches lvm-pvc-snapshot's convention)
+        push_metrics 2 "${TOTAL_BYTES:-0}"
+    fi
 }
 trap cleanup EXIT
+trap 'KILLED=1; exit 143' TERM INT
+
 if ! ( set -o noclobber; echo $$ > "${LOCKFILE}" ) 2>/dev/null; then
     die "Another instance is running (PID $(cat "${LOCKFILE}" 2>/dev/null || echo unknown))"
 fi
 
+# Belt-and-braces: if a previous run was SIGTERM'd before its trap completed,
+# /tmp/pvc-mount may have stacked mounts and stale LUKS mappers. The lock above
+# guarantees we're alone, so it's safe to clean these up now.
+while mountpoint -q "${PVC_MOUNT}" 2>/dev/null; do
+    umount "${PVC_MOUNT}" 2>/dev/null || umount -l "${PVC_MOUNT}" 2>/dev/null || break
+done
+for m in /dev/mapper/pvc-snap-*; do
+    [ -e "$m" ] || continue
+    cryptsetup close "$(basename "$m")" 2>/dev/null || true
+done
+
 # --- Metrics ---
 push_metrics() {
     local status="${1:-0}" bytes="${2:-0}"
@@ -243,6 +276,7 @@ fi
 log "--- Step 3: pfsense backup ---"
 PFSENSE_DEST="${BACKUP_ROOT}/pfsense"
 DATE=$(date +%Y%m%d)
+PFSENSE_STATUS=0
 mkdir -p "${PFSENSE_DEST}"
 
 if timeout 10 ssh -o BatchMode=yes -o ConnectTimeout=5 root@10.0.20.1 true 2>/dev/null; then
@@ -253,6 +287,7 @@ if timeout 10 ssh -o BatchMode=yes -o ConnectTimeout=5 root@10.0.20.1 true 2>/de
     else
         warn "Failed to copy pfsense config.xml"
         STATUS=1
+        PFSENSE_STATUS=1
     fi
 
     # Full filesystem tar
@@ -264,21 +299,28 @@ if timeout 10 ssh -o BatchMode=yes -o ConnectTimeout=5 root@10.0.20.1 true 2>/de
     else
         warn "Failed to tar pfsense filesystem"
         STATUS=1
+        PFSENSE_STATUS=1
     fi
 
     # Retention: keep 4 weekly copies
     ls -t "${PFSENSE_DEST}"/config-*.xml 2>/dev/null | tail -n +5 | xargs rm -f 2>/dev/null || true
     ls -t "${PFSENSE_DEST}"/pfsense-full-*.tar.gz 2>/dev/null | tail -n +5 | xargs rm -f 2>/dev/null || true
-
-    # Push pfsense-specific metric
-    echo "backup_last_success_timestamp $(date +%s)" | \
-        curl -s --connect-timeout 5 --max-time 10 --data-binary @- \
-        "${PUSHGATEWAY}/metrics/job/pfsense-backup" 2>/dev/null || true
 else
     warn "Cannot SSH to pfsense (10.0.20.1) — skipping"
     STATUS=1
+    PFSENSE_STATUS=1
 fi
 
+# Push pfsense-backup metrics in BOTH success and failure paths so
+# PfsenseBackupStale + PfsenseBackupFailing alerts can fire instead of going
+# silent when ssh-to-pfsense is broken.
+{
+    echo "backup_last_run_timestamp $(date +%s)"
+    echo "backup_last_status ${PFSENSE_STATUS}"
+    [ "${PFSENSE_STATUS}" -eq 0 ] && echo "backup_last_success_timestamp $(date +%s)"
+} | curl -s --connect-timeout 5 --max-time 10 --data-binary @- \
+    "${PUSHGATEWAY}/metrics/job/pfsense-backup" 2>/dev/null || true
+
 # ============================================================
 # STEP 4: PVE host config backup
 # ============================================================
diff --git a/stacks/postiz/modules/postiz/main.tf b/stacks/postiz/modules/postiz/main.tf
index 351dfd66..a55c6711 100644
--- a/stacks/postiz/modules/postiz/main.tf
+++ b/stacks/postiz/modules/postiz/main.tf
@@ -428,6 +428,113 @@ resource "kubernetes_service" "temporal" {
 # NestJS bootstrap crashes with "cannot have more than 3 search attribute
 # of type Text" and the backend never starts.
 # Upstream issue: https://github.com/gitroomhq/postiz-app/issues/1504
+# ──────────────────────────────────────────────────────────────────────────────
+# Backup CronJob — nightly pg_dump of the bundled postiz-postgresql to NFS.
+#
+# The bundled PostgreSQL StatefulSet uses local-path storage on the K8s node
+# OS disk (chart default), which is NOT covered by Layer 1 (LVM thin
+# snapshots) or Layer 2 (sda file backup) of the 3-2-1 pipeline. A pg_dump
+# CronJob writing to /srv/nfs/postiz-backup/ closes the gap: dumps land on
+# Proxmox host NFS → covered by inotify-driven offsite sync to Synology.
+# Three databases are dumped: postiz (app data), temporal (workflow engine),
+# temporal_visibility (workflow search). Bitnami chart-default credentials
+# are used — same creds the Postiz pod itself uses, scoped to the postiz
+# namespace via ClusterIP-only Services.
+# ──────────────────────────────────────────────────────────────────────────────
+
+module "nfs_backup_host" {
+  source     = "../../../../modules/kubernetes/nfs_volume"
+  name       = "postiz-backup-host"
+  namespace  = kubernetes_namespace.postiz.metadata[0].name
+  nfs_server = "192.168.1.127"
+  nfs_path   = "/srv/nfs/postiz-backup"
+}
+
+resource "kubernetes_cron_job_v1" "postgres_backup" {
+  metadata {
+    name      = "postiz-postgres-backup"
+    namespace = kubernetes_namespace.postiz.metadata[0].name
+    labels    = { app = "postiz", component = "backup" }
+  }
+  spec {
+    schedule                      = "0 3 * * *"
+    concurrency_policy            = "Forbid"
+    successful_jobs_history_limit = 3
+    failed_jobs_history_limit     = 5
+    job_template {
+      metadata {}
+      spec {
+        backoff_limit              = 1
+        ttl_seconds_after_finished = 86400
+        template {
+          metadata {
+            labels = { app = "postiz", component = "backup" }
+          }
+          spec {
+            restart_policy = "OnFailure"
+            container {
+              name    = "backup"
+              # Same image/pattern as dbaas/postgresql-backup: official postgres
+              # client tools + apt-installed curl for the Pushgateway push. The
+              # bitnamilegacy/postgresql variant is stripped (no curl/wget/python),
+              # so the metric push silently failed there.
+              image   = "docker.io/library/postgres:16.4-bullseye"
+              command = ["/bin/bash", "-c"]
+              args = [
+                <<-EOT
+                set -uo pipefail
+                apt-get update -qq && apt-get install -yqq curl >/dev/null 2>&1 || true
+                TIMESTAMP=$(date +%Y%m%d_%H%M)
+                BACKUP_DIR=/backup
+                STATUS=0
+                for db in postiz temporal temporal_visibility; do
+                  echo "Dumping $db..."
+                  if PGPASSWORD=postiz-password pg_dump -h postiz-postgresql -U postiz \
+                       --format=custom --compress=6 \
+                       --file="$BACKUP_DIR/$db-$TIMESTAMP.dump" \
+                       "$db"; then
+                    echo "  OK: $db ($(du -h "$BACKUP_DIR/$db-$TIMESTAMP.dump" | cut -f1))"
+                  else
+                    echo "  FAIL: $db" >&2
+                    STATUS=1
+                  fi
+                done
+                find "$BACKUP_DIR" -name '*.dump' -mtime +30 -delete 2>/dev/null || true
+                {
+                  echo "backup_last_run_timestamp $(date +%s)"
+                  echo "backup_last_status $STATUS"
+                  [ "$STATUS" -eq 0 ] && echo "backup_last_success_timestamp $(date +%s)"
+                } | curl -sf --connect-timeout 5 --max-time 10 --data-binary @- \
+                  "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/postiz-postgres-backup" || true
+                exit $STATUS
+                EOT
+              ]
+              volume_mount {
+                name       = "backup"
+                mount_path = "/backup"
+              }
+              resources {
+                requests = { cpu = "10m", memory = "64Mi" }
+                limits   = { memory = "256Mi" }
+              }
+            }
+            volume {
+              name = "backup"
+              persistent_volume_claim {
+                claim_name = module.nfs_backup_host.claim_name
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+  lifecycle {
+    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
+  }
+  depends_on = [helm_release.postiz]
+}
+
 resource "kubernetes_job" "temporal_search_attr_cleanup" {
   metadata {
     name      = "temporal-search-attr-cleanup"