infra/scripts
Viktor Barzin 0d8e0ca6fc backup: fix daily-backup silent failures, postiz pg_dump CronJob, doc reconcile
daily-backup ran out of its 1h budget and SIGTERMed for 10 days straight (Apr
30 → May 9). Each failed run left its snapshot mount stacked on /tmp/pvc-mount,
which blocked the next run from completing — root cause of the WeeklyBackupStale
alert going silent (the metric never reached its end-of-script push).

Fixes:
- TimeoutStartSec 1h → 4h (current workload of 118 PVCs needs ~1.5h, was hitting
  the wall during week 18 runs)
- Recursive umount + LUKS cleanup on EXIT trap, plus the same at script start as
  belt-and-braces for any inherited stuck state from a prior crashed run
- TERM/INT trap pushes status=2 metric so WeeklyBackupFailing fires instead of
  the alert going blind on systemd kills
- pfsense metric pushed in BOTH success and failure paths (was only on success;
  any ssh-to-pfsense outage made PfsenseBackupStale silent until the alert
  threshold expired)

Postiz backup CronJob: bundled bitnami PG/Redis live on local-path (K8s node
OS disk) — outside Layer 1+2 of the 3-2-1 pipeline. Added postiz-postgres-backup
that pg_dumps postiz + temporal + temporal_visibility daily 03:00 to
/srv/nfs/postiz-backup, getting Layer 3 offsite coverage. Verified end-to-end:
3 dumps written, Pushgateway metric received. Note: bitnamilegacy/postgresql
image is stripped (no curl/wget/python) — switched to docker.io/library/postgres
matching the dbaas/postgresql-backup pattern with apt-installed curl.

Doc reconcile (backup-dr.md): metric names had drifted (e.g. the docs claimed
backup_weekly_last_success_timestamp but the script pushes
daily_backup_last_run_timestamp). Updated to match what's actually emitted, and
added a "default-covered" footnote to the Service Protection Matrix so the
~40 services with PVCs not enumerated in the table are no longer ambiguous.

Manual PVE-host actions (out-of-band, not in TF):
- unmounted 6 stacked snapshots from /tmp/pvc-mount
- pruned 5 stale snapshots on vm-9999-pvc-67c90b6b... (origin LV that the
  loop got SIGTERMed against repeatedly, so prune kept failing)
- created /srv/nfs/postiz-backup directory
- triggered a one-shot daily-backup run with the new TimeoutStartSec to
  validate the fix end-to-end

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 11:12:39 +00:00
..
server_safe_poweroff move helper scripts in scripts dir [ci skip] 2025-10-11 17:14:59 +00:00
cluster_healthcheck.sh healthcheck: fix three false-positive WARNs (HA token, cert-manager, LVM snap grep) 2026-04-19 22:13:32 +00:00
cluster_manager.py chore: add untracked stacks, scripts, and agent configs 2026-04-15 09:33:06 +00:00
daily-backup.service backup: fix daily-backup silent failures, postiz pg_dump CronJob, doc reconcile 2026-05-10 11:12:39 +00:00
daily-backup.sh backup: fix daily-backup silent failures, postiz pg_dump CronJob, doc reconcile 2026-05-10 11:12:39 +00:00
daily-backup.timer rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip] 2026-04-13 18:37:04 +00:00
extend_vm_storage.sh [ci skip] expand k8s worker nodes to 256G, update inventory and extend script 2026-02-28 16:00:16 +00:00
forgejo-migrate-orphan-images.sh [forgejo] Migration script: exclude empty repos, all-images full mode 2026-05-07 23:29:34 +00:00
frigate-bulk-classify.js [ci skip] sync tfstate and add frigate helper scripts 2026-02-12 23:11:23 +00:00
frigate-inspect.mjs [ci skip] sync tfstate and add frigate helper scripts 2026-02-12 23:11:23 +00:00
gen_service_stacks.py cleanup: remove calibre and audiobookshelf stacks after ebooks migration [ci skip] 2026-03-25 23:56:07 +02:00
graceful-db-maintenance.sh add pod dependency management via Kyverno init container injection 2026-03-15 19:17:57 +00:00
image_pull.sh chore: add untracked stacks, scripts, and agent configs 2026-04-15 09:33:06 +00:00
image_pull_remote.sh chore: add untracked stacks, scripts, and agent configs 2026-04-15 09:33:06 +00:00
kill_ns.sh move helper scripts in scripts dir [ci skip] 2025-10-11 17:14:59 +00:00
lvm-pvc-snapshot.sh [backup] Fix lvm-pvc-snapshot Pushgateway push (stdout pollution in cmd_prune_count) 2026-04-25 14:30:58 +00:00
lvm-pvc-snapshot.timer add 3-2-1 backup pipeline: weekly PVC file copy, NFS mirror, pfsense, offsite sync 2026-04-06 14:53:28 +03:00
migrate-state-to-pg [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
migrate_service_state.sh cleanup: remove calibre and audiobookshelf stacks after ebooks migration [ci skip] 2026-03-25 23:56:07 +02:00
nfs-change-tracker.service consolidate offsite backup: inotify change tracking, deduplicate Synology paths [ci skip] 2026-04-13 18:06:20 +00:00
node_registry_manager.sh some nits on the registry manager script - note it is still not working correctly [ci skip] 2025-10-17 19:23:43 +00:00
offsite-sync-backup.service rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip] 2026-04-13 18:37:04 +00:00
offsite-sync-backup.sh rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip] 2026-04-13 18:37:04 +00:00
offsite-sync-backup.timer switch backup + offsite sync from weekly to daily — RPO 7d → 1d [ci skip] 2026-04-13 18:24:38 +00:00
parse-postmortem-todos.sh fix: use sh instead of bash in pipeline (Alpine compat) 2026-04-14 17:29:14 +00:00
pfsense-haproxy-bootstrap.php mailserver: split healthcheck path off PROXY-aware listeners + book-search uses ClusterIP 2026-05-05 19:45:33 +00:00
pfsense-nat-mailserver-haproxy-flip.php [mailserver] Phase 4+5 — pfSense HAProxy cutover for all 4 mail ports [ci skip] 2026-04-19 12:24:50 +00:00
pfsense-nat-mailserver-haproxy-unflip.php [mailserver] Phase 4+5 — pfSense HAProxy cutover for all 4 mail ports [ci skip] 2026-04-19 12:24:50 +00:00
postmortem-pipeline.sh [claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP 2026-04-18 10:12:02 +00:00
pve-nfs-exports fix(post-mortem): add /etc/exports to git, NFS health check in daily-backup, document CSI requirements [PM-2026-04-14] 2026-04-14 18:08:24 +00:00
renew_worker_certs.sh move helper scripts in scripts dir [ci skip] 2025-10-11 17:14:59 +00:00
setup-containerd-pullthrough.sh chore: add untracked stacks, scripts, and agent configs 2026-04-15 09:33:06 +00:00
setup-forgejo-containerd-mirror.sh [forgejo] Phase 0 of registry consolidation: prepare Forgejo OCI registry 2026-05-07 23:29:33 +00:00
setup-task-pipeline.sh [ci skip] add Forgejo task pipeline for OpenClaw AI agent 2026-03-07 21:11:07 +00:00
setup_containerd_mirrors.sh add upstream fallback to containerd registry mirrors 2026-04-02 11:05:30 +03:00
state-sync [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
stop_storage_services.sh cleanup: remove calibre and audiobookshelf stacks after ebooks migration [ci skip] 2026-03-25 23:56:07 +02:00
task-processor.sh [ci skip] add Forgejo task pipeline for OpenClaw AI agent 2026-03-07 21:11:07 +00:00
tg ci: add vault CLI to infra-ci image + surface real errors in scripts/tg 2026-04-22 08:46:50 +00:00
update-istio-injection.sh move helper scripts in scripts dir [ci skip] 2025-10-11 17:14:59 +00:00
update_k8s.sh upgrade to k8s 1.34.2 [ci skip] 2025-12-18 12:37:14 +00:00
update_node.sh move helper scripts in scripts dir [ci skip] 2025-10-11 17:14:59 +00:00
vault-kubeconfig remove SOPS pipeline, deploy ESO + Vault DB/K8s engines 2026-03-15 16:37:38 +00:00
woodpecker-register-forgejo-repo.sh [woodpecker] Programmatic Forgejo repo registration 2026-05-10 11:12:36 +00:00