infra/docs/architecture/backup-dr.md
Viktor Barzin 4798583db7 backup pipeline: S1 fixes from 2026-05-24 audit
Three immediate fixes surfaced by the backup-pipeline audit:

1. **S1 silent-loss race fix** (daily-backup.sh:142): remove the
   `> "${MANIFEST}"` truncation at the start of daily-backup. Truncation
   already lives in offsite-sync-backup at line 159, gated on a successful
   sync. With both scripts truncating, an offsite-sync failure followed by
   the next morning's daily-backup would silently wipe yesterday's
   unconsumed manifest entries — those files would only reach Synology
   via the monthly full sync (1st-7th of month). Now only offsite-sync
   truncates, and only on success.

2. **Missing alert OffsiteBackupSyncFailing**: documented in backup-dr.md
   but was never added to prometheus_chart_values.tpl. Step 1 or Step 2
   failure pushes offsite_sync_last_status=1 but nothing read it. Added.

3. **wear: drop `-z` from local-only rsyncs** (daily-backup.sh:218 PVC
   snapshot rsync + line 347 /etc/pve sync). Both are local-to-sda
   transfers — compression wastes CPU and yields nothing (gigabit local
   path, intermediate disk doesn't benefit).

Bonus cleanups (zero functional impact):
- "Weekly backup starting/complete" → "daily-backup starting/complete"
  (the timer is daily, not weekly — legacy from earlier monthly-rotation
  schedule).
- "--- Step 2: PVC file copy ---" → "Step 1:" (was numbered from 2 with no
  Step 1 above).
- **wear: pfSense full filesystem tar now Sunday-only** instead of daily.
  config.xml stays daily (it's the primary restore artifact and tiny).
  Full tar is forensic recovery only — re-tarring ~100MB+ daily writes
  ~3G/month to sda + Synology for unchanged content. Weekly is plenty.

docs/architecture/backup-dr.md: rewritten Overview + 3-2-1 breakdown to
reflect today's two-leg architecture; added a "2026-05-24 session"
changelog summary at the top; added a "Synology snapshot management"
subsection with the sudo + `synosharesnapshot` recipe (DSM API is gated
by 2FA so this is the only programmatic path); updated Key Files table
with nfs-mirror + the Synology SSH access notes.

Open follow-ups from the audit (S2 — file as beads if pursued):
- Factor two-leg invariant into /etc/backup-skip-list.conf sourced by
  both nfs-mirror.sh and offsite-sync-backup.sh.
- Manifest write-collision flock between nfs-mirror Mon 04:11 and
  daily-backup Mon 05:00.
- Unbounded manifest cap (force full sync if > 500k lines).
- Synology free-space scraper + alert.
- LVM thin pool meta-pool fill alert.
- nfs-change-tracker.service heartbeat to Pushgateway.
- Synology config drift TF surface (snap retention, share defs).
2026-05-24 16:18:44 +00:00

43 KiB

Backup & Disaster Recovery Architecture

Last updated: 2026-05-24

2026-05-24 session — what changed today (deeper structural review pending — see the open backup-pipeline simplification audit):

  • anca-elements archive direction inverted — Synology /Backup/Anca/Elements (770G) deleted; PVE /srv/nfs/anca-elements is now source of truth. anca-elements-sync.sh retired.
  • anca-elements-mirror.{sh,service,timer} retired, subsumed into the new nfs-mirror weekly job covering all critical NFS subtrees (anca-elements + ~80 services) → sda.
  • offsite-sync-backup Step 2 filter inverted: NFS-direct-to-Synology now only carries the sda-bypass paths (immich + frigate + prometheus + *-backup + …). Two-leg invariant: nfs-mirror.sh EXCLUDESoffsite-sync-backup Step 2 INCLUDES. Cross-referenced in both scripts.
  • Synology /Backup/Viki/nfs/<svc>/ orphan cleanup — 84 dirs renamed in-place (btrfs metadata-only) to /Backup/Viki/pve-backup/<svc>/ so daily-incremental Step 1 sees them as pre-existing and only ships deltas. No re-transfer.
  • Synology snapshot retention 7d → 3d, all 8 backlog snapshots deleted via sudo synosharesnapshot delete Backup .... Reclaimed ~800G btrfs (98% → 83% used). DSM API was blocked by 2FA; sudo over the existing Administrator SSH key worked with the Vault-stored password.
  • Manifest mechanism extended: nfs-mirror now appends its transferred file list to /mnt/backup/.changed-files so daily Step 1 incremental picks it up (was previously only fed by daily-backup).

Overview

The homelab runs a 3-2-1 strategy with a two-leg path to Synology so every NFS byte takes exactly one route to offsite (no duplication, no gaps):

sdc /srv/nfs/<svc>/   ──nfs-mirror weekly──→  sda /mnt/backup/<svc>/   ──offsite-sync Step 1──→  Synology /Backup/Viki/pve-backup/<svc>/      [leg 1]
sdc /srv/nfs/<bypass>/  ──inotify (nfs-change-tracker)──→  offsite-sync Step 2  ──→  Synology /Backup/Viki/nfs/<bypass>/                       [leg 2]
sdc PVCs (LVM thin)   ──daily-backup~snapshot~rsync──→  sda /mnt/backup/{pvc-data,sqlite-backup,pfsense,pve-config}/  ──Step 1──→  Synology /Backup/Viki/pve-backup/

The bypass list (paths that take leg 2 — too big for sda, transient, or already-a-backup): immich, frigate, prometheus, loki, temp, alertmanager, ollama, audiblez, ebook2audiobook, *-backup. Anything NOT in this list rides leg 1 via nfs-mirror.

3-2-1 Breakdown:

  • Copy 1 (live): all PVC data + VM disks on Proxmox sdc thin pool (10.7TB RAID1 HDD); all NFS data at /srv/nfs[-ssd]/
  • Copy 2 (local backup): sda /mnt/backup (1.1TB RAID1 SAS) — at ~90% used post-2026-05-24 (was ~10% in April)
  • Copy 3 (offsite): Synology NAS at 192.168.1.13 — at ~83% used / 934G free post-2026-05-24 (was 98% / 121G before today's cleanup)
    • Synology/Backup/Viki/pve-backup/ — sda contents (PVC backups + nfs-mirror output: ~90 service dirs)
    • Synology/Backup/Viki/nfs/ — bypass-list NFS (immich, frigate, etc.)
    • Synology/Backup/Viki/nfs-ssd/ — bypass-list SSD NFS (immich-ML, ollama, llamacpp)

Architecture Diagram

Overall Backup Flow

graph TB
    subgraph Proxmox["Proxmox Host (192.168.1.127)"]
        sdc["sdc: 10.7TB RAID1 HDD<br/>VG pve, LV data (thin pool)<br/>65 proxmox-lvm PVCs"]
        sda["sda: 1.1TB RAID1 SAS<br/>VG backup, LV data (ext4)<br/>/mnt/backup"]

        subgraph Layer1["Layer 1: LVM Thin Snapshots"]
            Snap["Daily 03:00<br/>7-day retention<br/>62 PVCs (excludes dbaas+monitoring)"]
        end

        subgraph Layer2["Layer 2: Weekly File Backup"]
            PVCBackup["PVC File Copy<br/>Daily 05:00<br/>4 weekly versions<br/>/mnt/backup/pvc-data/<YYYY-WW>/"]
            SQLiteBackup["Auto SQLite Backup<br/>magic number check + ?mode=ro<br/>from PVC snapshots"]
            PfsenseBackup["pfSense Backup<br/>config.xml + full tar<br/>4 weekly versions"]
            PVEConfig["PVE Config<br/>/etc/pve + scripts"]
        end

        sdc --> Snap
        sdc --> PVCBackup
        PVCBackup --> sda
        SQLiteBackup --> sda
        PfsenseBackup --> sda
        PVEConfig --> sda
    end

    subgraph NFS_Storage["Proxmox NFS (/srv/nfs)"]
        NFS_Backup["NFS dirs<br/>/srv/nfs/*-backup/"]

        subgraph AppBackups["App-Level Backup CronJobs"]
            CronDaily["Daily 00:00-00:30<br/>PostgreSQL, MySQL<br/>14d retention"]
            CronWeekly["Weekly Sunday<br/>etcd, Vault, Redis<br/>Vaultwarden<br/>30d retention"]
        end

        CronDaily --> NFS_Backup
        CronWeekly --> NFS_Backup
    end

    subgraph Layer3["Layer 3: Offsite Sync"]
        PVEOffsite["Step 1: sda → Synology<br/>Daily 06:00<br/>pve-backup/ only"]
        NFSOffsite["Step 2: NFS → Synology<br/>inotify change-tracked<br/>rsync --files-from<br/>nfs/ + nfs-ssd/"]
    end

    sda --> PVEOffsite
    NFS_Storage --> NFSOffsite

    Synology["Synology NAS<br/>192.168.1.13<br/>Offsite protection"]

    PVEOffsite --> Synology
    NFSOffsite --> Synology

    NFS_Backup -.->|app-level dumps| NFS_Storage

    subgraph Monitoring["Monitoring & Alerting"]
        Prometheus["Prometheus Alerts<br/>PostgreSQLBackupStale, MySQLBackupStale<br/>WeeklyBackupStale, OffsiteBackupSyncStale<br/>LVMSnapshotStale, BackupDiskFull<br/>VaultwardenIntegrityFail"]
        Pushgateway["Pushgateway<br/>backup script metrics<br/>vaultwarden integrity"]
    end

    PVCBackup -.->|push metrics| Pushgateway
    Snap -.->|push metrics| Pushgateway
    Pushgateway --> Prometheus

    style Layer1 fill:#c8e6c9
    style Layer2 fill:#ffe0b2
    style Layer3 fill:#e1f5ff
    style Monitoring fill:#f3e5f5

Weekly Backup Timeline

graph LR
    subgraph Sunday["Sunday Timeline"]
        S01["01:00 etcd backup<br/>(CronJob)"]
        S02["02:00 Vault backup<br/>(CronJob)"]
        S03a["03:00 Redis backup<br/>(CronJob)"]
        S03b["03:00 LVM snapshots<br/>(lvm-pvc-snapshot timer)"]
        S05["05:00 Daily backup<br/>(daily-backup timer)<br/>1. PVC file copy (auto-discovered BACKUP_DIRS)<br/>2. Auto SQLite backup (magic number + ?mode=ro)<br/>3. pfSense backup<br/>4. PVE config<br/>5. Prune snapshots"]
        S08["08:00 Offsite sync<br/>(offsite-sync-backup timer)<br/>Step 1: sda → Synology pve-backup/<br/>Step 2: NFS → Synology nfs/ + nfs-ssd/<br/>(inotify change-tracked)"]
    end

    S01 --> S02 --> S03a --> S03b --> S05 --> S08

    style Sunday fill:#ffe0b2

Physical Disk Layout

graph TB
    subgraph PVE["Proxmox Host (192.168.1.127)"]
        subgraph sda["sda: 1.1TB RAID1 SAS"]
            sda_vg["VG: backup<br/>LV: data (ext4)<br/>/mnt/backup"]
            sda_content["pvc-data/<YYYY-WW>/<ns>/<pvc>/<br/>sqlite-backup/<br/>pfsense/<YYYY-WW>/<br/>pve-config/"]
        end

        subgraph sdb["sdb: 931GB SSD"]
            sdb_vg["VG: pve<br/>LV: root (ext4)<br/>PVE host OS"]
        end

        subgraph sdc["sdc: 10.7TB RAID1 HDD"]
            sdc_vg["VG: pve<br/>LV: data (thin pool)<br/>65 proxmox-lvm PVCs<br/>+ VM disks"]
        end

        sda_vg --> sda_content
    end

    sdc -.->|weekly backup<br/>mount snapshot ro| sda
    sda -.->|offsite sync<br/>rsync| Synology["Synology NAS<br/>192.168.1.13<br/>/Backup/Viki/{pve-backup,nfs,nfs-ssd}/"]

    style sda fill:#fff9c4
    style sdb fill:#c8e6c9
    style sdc fill:#e1f5ff

Restore Decision Tree

graph TB
    Start["Data loss detected"]
    Age{"How old is<br/>the lost data?"}
    Type{"What type<br/>of data?"}

    Start --> Age

    Age -->|"< 7 days"| LVM["Use LVM snapshot<br/>lvm-pvc-snapshot restore<br/>RTO: <5 min"]
    Age -->|"> 7 days,<br/>< 4 weeks"| FileBackup["Use sda file backup<br/>/mnt/backup/pvc-data/<week>/<br/>RTO: <15 min"]
    Age -->|"> 4 weeks or<br/>site disaster"| Offsite["Use Synology backup<br/>Synology/pve-backup/<br/>RTO: <4 hours"]

    LVM --> Type
    FileBackup --> Type
    Offsite --> Type

    Type -->|"Database"| AppBackup["Use app-level dump<br/>/srv/nfs/<service>-backup/<br/>OR Synology/nfs/<service>-backup/<br/>RTO: <10 min"]
    Type -->|"PVC files"| Proceed["Proceed with<br/>selected restore method"]
    Type -->|"Media (NFS)"| OffsiteMedia["Use Synology backup<br/>Synology/nfs/ or nfs-ssd/<br/>RTO: varies by size"]

    style Start fill:#ffcdd2
    style LVM fill:#c8e6c9
    style FileBackup fill:#fff9c4
    style Offsite fill:#e1f5ff
    style AppBackup fill:#e1bee7

Vaultwarden Enhanced Protection

graph LR
    subgraph Every6h["Every 6 hours"]
        VWBackup["vaultwarden-backup CronJob"]
        Step1["1. PRAGMA integrity_check<br/>(fail → abort)"]
        Step2["2. sqlite3 .backup<br/>/mnt/main/vaultwarden-backup/"]
        Step3["3. PRAGMA integrity_check<br/>on backup copy"]
        Step4["4. Copy RSA keys, attachments,<br/>sends, config.json"]
        Step5["5. Rotate backups (30d)"]

        VWBackup --> Step1 --> Step2 --> Step3 --> Step4 --> Step5
    end

    subgraph Hourly["Every hour"]
        VWCheck["vaultwarden-integrity-check"]
        Check1["PRAGMA integrity_check"]
        Metric["Push metric to Pushgateway:<br/>vaultwarden_sqlite_integrity_ok"]

        VWCheck --> Check1 --> Metric
    end

    Metric -.->|Prometheus scrape| Alert["Alert if integrity_ok == 0"]

    style Every6h fill:#fff9c4
    style Hourly fill:#e1bee7

Components

Component Version/Schedule Location Purpose
LVM Thin Snapshots Daily 03:00, 7d retention PVE host: lvm-pvc-snapshot CoW snapshots of 62 proxmox-lvm PVCs
Daily PVC Backup Daily 05:00, 4 weeks PVE host: daily-backup File-level PVC copy to sda
Auto SQLite Backup Daily 05:00 + daily-backup PVE host: magic number check + ?mode=ro Safe SQLite backup from PVC snapshots
NFS Change Tracker Continuous (inotifywait) PVE host: nfs-change-tracker.service Logs changed NFS file paths to /mnt/backup/.nfs-changes.log
pfSense Backup Daily 05:00 + daily-backup PVE host: SSH + API config.xml + full filesystem tar
Offsite Sync Daily 06:00 (after daily-backup) PVE host: offsite-sync-backup Two-step: sda→pve-backup + NFS→nfs/nfs-ssd via inotify
PostgreSQL Backup (full) Daily 00:00, 14d retention CronJob in dbaas namespace pg_dumpall for all databases
PostgreSQL Backup (per-db) Daily 00:15, 14d retention CronJob in dbaas namespace pg_dump -Fc per database → /backup/per-db/<db>/
MySQL Backup (full) Daily 00:30, 14d retention CronJob in dbaas namespace mysqldump --all-databases
MySQL Backup (per-db) Daily 00:45, 14d retention CronJob in dbaas namespace mysqldump per database → /backup/per-db/<db>/
etcd Backup Weekly Sunday 01:00, 30d CronJob in kube-system etcdctl snapshot
Vaultwarden Backup Every 6h, 30d retention CronJob in vaultwarden sqlite3 .backup + integrity
Vault Backup Weekly Sunday 02:00, 30d CronJob in vault raft snapshot
Redis Backup Weekly Sunday 03:00, 30d CronJob in redis BGSAVE + copy
Vaultwarden Integrity Check Hourly CronJob in vaultwarden PRAGMA integrity_check → metric
TrueNAS Cloud Sync DECOMMISSIONED 2026-04-13 Was TrueNAS Cloud Sync Task 1 Replaced by offsite-sync-backup + inotify change tracking on Proxmox host NFS

How It Works

Layer 1: LVM Thin Snapshots (Fast Local Recovery)

Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62 Proxmox CSI PVCs. These are CoW snapshots — instant creation, minimal overhead, sharing the thin pool's free space.

Script: /usr/local/bin/lvm-pvc-snapshot on PVE host (source: infra/scripts/lvm-pvc-snapshot.sh). Deploy: scp infra/scripts/lvm-pvc-snapshot.sh root@192.168.1.127:/usr/local/bin/lvm-pvc-snapshot Schedule: Daily 03:00 via systemd timer, 7-day retention Discovery: Auto-discovers PVC LVs matching vm-*-pvc-* pattern in VG pve thin pool data

Coverage: All 65 proxmox-lvm PVCs except dbaas and monitoring namespaces. These are excluded because:

  • MySQL InnoDB, PostgreSQL, and Prometheus are high-churn (50%+ CoW divergence/hour)
  • They already have app-level dumps (Layer 2)
  • Including them causes ~36% write amplification; excluding them reduces overhead to ~0%

Monitoring: Pushes metrics to Pushgateway via NodePort (30091). Alerts: LVMSnapshotStale (>30h since last run + 30m for:), LVMSnapshotFailing, LVMThinPoolLow (<15% free).

Restore: lvm-pvc-snapshot restore <pvc-lv> <snapshot-lv> — auto-discovers K8s workload, scales down, swaps LVs, scales back up. See docs/runbooks/restore-lvm-snapshot.md.

Layer 2: Weekly File-Level Backup (sda Backup Disk)

Backup disk: sda (1.1TB RAID1 SAS) → VG backup → LV data → ext4 → mounted at /mnt/backup on PVE host. Dedicated backup disk, independent of live storage.

Script: /usr/local/bin/daily-backup on PVE host (source: infra/scripts/daily-backup.sh) Schedule: Daily 05:00 via systemd timer Retention: 4 weekly versions (weeks 0-3 via --link-dest hardlink dedup)

What Gets Backed Up

1. PVC File Copies (/mnt/backup/pvc-data/<YYYY-WW>/):

  • Mount each LVM thin LV ro on PVE host → rsync files (not block) → unmount
  • 62 PVCs covered (all except dbaas + monitoring)
  • Organized as /mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/
  • 4 weekly versions with --link-dest hardlink dedup (unchanged files share inodes)

2. Auto SQLite Backup (/mnt/backup/sqlite-backup/):

  • Detects SQLite databases in PVC snapshots via magic number check (SQLite format 3)
  • Opens each database with ?mode=ro (read-only, safe — no WAL replay)
  • Runs .backup to create a consistent copy
  • Covers all SQLite files across all PVC snapshots automatically

3. pfSense Backup (/mnt/backup/pfsense/<YYYY-WW>/):

  • config.xml via API (base64 decode)
  • Full filesystem tar via SSH (tar czf /tmp/pfsense-full.tar.gz /cf /var/db /boot/loader.conf)
  • 4 weekly versions

4. PVE Config (/mnt/backup/pve-config/):

  • /etc/pve/ (cluster config, VM definitions)
  • /usr/local/bin/ (custom scripts)
  • /etc/systemd/system/ (timers)
  • Single copy (no rotation)

Auto-discovered BACKUP_DIRS: Uses glob-based discovery instead of a hardcoded list. Any new PVC LV matching vm-*-pvc-* is automatically included.

Snapshot Pruning: Deletes LVM snapshots older than 7 days (safety net for snapshots that outlive lvm-pvc-snapshot timer).

Monitoring: Pushes daily_backup_last_run_timestamp, daily_backup_last_status, and daily_backup_bytes_synced to Pushgateway (job daily-backup). Alerts: WeeklyBackupStale (>9d on daily_backup_last_run_timestamp), WeeklyBackupFailing (daily_backup_last_status != 0). The metric is pushed both on clean exit AND from a trap TERM INT handler — a 2026-04-30 → 2026-05-09 silent-failure incident traced to systemd SIGTERMing the script before it reached its final push, leaving the alert blind.

Layer 2b: Application-Level Backups

K8s CronJobs run inside the cluster, dumping database/state to NFS-exported backup directories. Each service writes to /srv/nfs/<service>-backup/ (some legacy paths still use /mnt/main/<service>-backup/).

Why needed: LVM snapshots capture block-level state, but:

  • Cannot restore individual databases from a PostgreSQL snapshot
  • Proxmox CSI LVs are opaque raw block devices
  • Need point-in-time recovery for specific apps without full LVM rollback

Daily backups (00:00-00:30):

  • PostgreSQL full (pg_dumpall, 00:00): Dumps all databases to /mnt/main/postgresql-backup/dump_*.sql.gz. 14-day rotation.
  • PostgreSQL per-db (pg_dump -Fc, 00:15): Dumps each database individually to /mnt/main/postgresql-backup/per-db/<dbname>/dump_*.dump. Enables single-database restore via pg_restore -d <db> --clean --if-exists. 14-day rotation.
  • MySQL full (mysqldump --all-databases, 00:30): Dumps all databases to /mnt/main/mysql-backup/dump_*.sql.gz. 14-day rotation.
  • MySQL per-db (mysqldump, 00:45): Dumps each database individually to /mnt/main/mysql-backup/per-db/<dbname>/dump_*.sql.gz. Enables single-database restore. 14-day rotation.

Daily backups (Sunday 01:00-04:00):

  • etcd: etcdctl snapshot save /mnt/main/etcd-backup/snapshot-$(date +%Y%m%d).db. 30-day retention. Critical for cluster recovery.
  • Vaultwarden: See "Vaultwarden Enhanced Protection" below. 30-day retention.
  • Vault: vault operator raft snapshot save /mnt/main/vault-backup/snapshot-$(date +%Y%m%d).snap. 30-day retention.
  • Redis: redis-cli BGSAVE then copy RDB file. 30-day retention.

Vaultwarden Enhanced Protection

Vaultwarden stores sensitive password vault data in SQLite on a proxmox-lvm volume. Extra safeguards prevent corruption:

Every 6 hours (vaultwarden-backup CronJob):

  1. Run PRAGMA integrity_check on live database
  2. If check fails → abort (alert fires)
  3. If check passes → sqlite3 .backup /mnt/main/vaultwarden-backup/db-$(date +%Y%m%d%H%M).sqlite
  4. Run PRAGMA integrity_check on backup copy
  5. Copy RSA keys, attachments, sends folder, config.json
  6. Rotate backups older than 30 days

Every hour (vaultwarden-integrity-check CronJob):

  1. Run PRAGMA integrity_check on live database
  2. Push metric to Pushgateway: vaultwarden_sqlite_integrity_ok{status="ok"}=1 or =0
  3. Prometheus scrapes Pushgateway and alerts on integrity_ok == 0

This provides both frequent backups (every 6h) AND continuous integrity monitoring (hourly).

Layer 3: Offsite Sync to Synology NAS

Script: /usr/local/bin/offsite-sync-backup on PVE host (source: infra/scripts/offsite-sync-backup) Schedule: Daily 06:00 via systemd timer (After=daily-backup.service)

Two-step offsite sync:

Step 1: sda to Synology pve-backup/

Method: rsync from /mnt/backup/ to synology.viktorbarzin.lan:/Backup/Viki/pve-backup/ Content: PVC snapshots (pvc-data/), pfSense backups, PVE config, SQLite backups, plus the nfs-mirror output (anca-elements + ~30 critical NFS subtrees) — see Layer 3a. After consolidation, sda is the single source for the bulk of Synology's payload.

Destination: Synology/Backup/Viki/pve-backup/:

  • pvc-data/<YYYY-WW>/ — 4 weekly PVC file backups
  • sqlite-backup/ — auto SQLite backups
  • pfsense/<YYYY-WW>/ — 4 weekly pfSense backups
  • pve-config/ — latest PVE config
  • anca-elements/, mysql/, postgresql/, nextcloud/, health/, <other critical NFS dirs>/ — from nfs-mirror (Layer 3a)

Step 2: sda-bypass NFS to Synology nfs/ + nfs-ssd/ (inotify change-tracked, FILTERED)

Role: Only carries paths that bypass sda — i.e., paths the nfs-mirror script explicitly skips (immich, frigate, prometheus, *-backup, …). Paths that ARE on sda reach Synology via Step 1 and are explicitly excluded from Step 2 to prevent double-syncing. The Step 2 INCLUDE list MUST stay in sync with nfs-mirror's EXCLUDES — they are complementary.

Method: rsync --files-from /mnt/backup/.nfs-changes.log with regex filter ^/srv/nfs/(immich|frigate|prometheus|loki|temp|alertmanager|ollama|audiblez|ebook2audiobook|[^/]+-backup)/. The monthly full sync uses --include='/<bypass-path>/***' … --exclude='*' to limit to the same set. nfs-ssd/ (all of immich-ML / ollama / llamacpp) is entirely bypass-list, so a plain --delete still applies.

Change tracking: nfs-change-tracker.service (systemd, inotifywait) on PVE host watches /srv/nfs and /srv/nfs-ssd continuously. Changed file paths are logged to /mnt/backup/.nfs-changes.log. Step 2 reads this log and transfers only changed files matching the bypass regex. Incremental syncs complete in seconds.

Monthly full sync: On 1st Sunday of month, runs rsync --delete with the bypass-only include list for cleanup.

/srv/nfs/anca-elements/ history: had its own dedicated Synology exclusion line earlier in 2026-05-24 because the original Synology source (/volume1/Backup/Anca/Elements) was being preserved while we moved canonical to PVE. After the original was deleted (same day), anca-elements joined the broader "NOT bypassing sda" category and is covered by Step 1 via nfs-mirror.

Layer 3a: NFS local mirror on sda (3-2-1 second copy): /usr/local/bin/nfs-mirror rsyncs the critical subset of /srv/nfs//mnt/backup/<service>/ weekly (Mon 04:00). Single rsync invocation, single destination. The skip-list (in nfs-mirror.sh EXCLUDES) drops paths that don't justify a second local copy:

  • immich (1.2T) — too big for sda; Synology offsite is the only 2nd copy by design
  • frigate (camera recordings, 14d auto-rotate)
  • prometheus, loki (TSDB + logs — rebuildable / policy-driven retention)
  • ollama, llamacpp, audiblez, ebook2audiobook (re-downloadable / regenerable)
  • temp, alertmanager (transient state)
  • *-backup (CronJob outputs — these ARE backups; backing up the backup is meta)
  • /srv/nfs-ssd entirely (after the SSD skips above, residual is ~0)

Everything else under /srv/nfs/ (anca-elements + ~30 critical service NFS subtrees: mysql, postgresql, nextcloud, health, real-estate-crawler, audiobookshelf, servarr, technitium, openclaw, ...) lands at /mnt/backup/<svc>/. Total mirror size ≈ 900 GB (mostly anca-elements at 770G).

Pushes nfs_mirror_last_run_timestamp + nfs_mirror_last_status + nfs_mirror_bytes to Pushgateway. Alerts: NfsMirrorStale (>16d), NfsMirrorFailing (status != 0). rsync -rlt --delete -H --no-perms --no-owner --no-group; idempotent. Nice=10, IOSchedulingClass=idle (won't compete with foreground IO).

History: anca-elements-mirror.{sh,service,timer} was a precursor (2026-05-24 morning) dedicated to /srv/nfs/anca-elements only. Subsumed by nfs-mirror later the same day to consolidate ad-hoc copy scripts into one.

Destination:

  • Synology/Backup/Viki/nfs/ — mirrors /srv/nfs
  • Synology/Backup/Viki/nfs-ssd/ — mirrors /srv/nfs-ssd

Monitoring: Pushes offsite_backup_sync_last_success_timestamp to Pushgateway. Alerts: OffsiteBackupSyncStale (>8d), OffsiteBackupSyncFailing.

TrueNAS Cloud Sync — DECOMMISSIONED 2026-04-13

TrueNAS Cloud Sync was decommissioned along with TrueNAS (2026-04-13). The current offsite path is inotify-change-tracked rsync from the Proxmox host NFS (/srv/nfs, /srv/nfs-ssd) to Synology.

Synology snapshot management

Synology DSM keeps daily btrfs snapshots of every shared folder (the Backup share most importantly). Retention is configured per-share in DSM's Snapshot Replication app, and persists in synosharesnapshot shareconf.

Current settings (Backup share, 2026-05-24): daily at 02:00, snap_auto_remove_keep_days=3 (tightened from 7 to reduce the window where deleted data continues to consume space).

Snapshots are CoW — deleting a file from the live filesystem does NOT free its blocks while any retained snapshot references them. Reclaim only happens after ALL referencing snapshots roll off.

DSM Web API is gated by 2FA (FIDO/OTP) — programmatic snapshot management has to go via SSH + sudo instead:

# Password is in Vault: secret/viktor → synology_admin_password
PASS=$(VAULT_ADDR=https://vault.viktorbarzin.me vault kv get -field=synology_admin_password secret/viktor)

# List snapshots on the Backup share
ssh Administrator@192.168.1.13 "echo '$PASS' | sudo -S /usr/syno/sbin/synosharesnapshot list Backup"

# Bulk delete ALL snapshots (reclaims everything once btrfs cleaner runs)
ssh Administrator@192.168.1.13 "
  SNAPS=\$(echo '$PASS' | sudo -S /usr/syno/sbin/synosharesnapshot list Backup 2>/dev/null \
    | grep -oE 'GMT-[0-9]+\.[0-9]+\.[0-9]+-[0-9]+\.[0-9]+\.[0-9]+' | sort -u)
  echo '$PASS' | sudo -S /usr/syno/sbin/synosharesnapshot delete Backup \$SNAPS
"

# Tighten retention
ssh Administrator@192.168.1.13 "echo '$PASS' | sudo -S /usr/syno/sbin/synosharesnapshot shareconf set Backup snap_auto_remove_keep_days=3"

The btrfs cleaner thread reclaims async — df may lag the snapshot-delete by minutes (typical reclaim rate observed 2026-05-24: ~300 MB/s sustained, with bursts of 800 GB in 2 minutes).

Memory: id=2673-2676 (Synology snapshot retention gotcha — deletion vs reclaim timing).

Configuration

Key Files

Path Purpose
/usr/local/bin/lvm-pvc-snapshot PVE host: LVM snapshot creation + restore
/usr/local/bin/daily-backup PVE host: PVC file copy + auto SQLite backup + pfSense
/usr/local/bin/offsite-sync-backup PVE host: two-step rsync to Synology (sda + NFS via inotify)
/mnt/backup/ PVE host: sda mount point (1.1TB backup disk)
/mnt/backup/.nfs-changes.log NFS change log from inotifywait, consumed by offsite-sync
/etc/systemd/system/nfs-change-tracker.service inotifywait watcher for /srv/nfs + /srv/nfs-ssd
/etc/systemd/system/lvm-pvc-snapshot.timer Daily 03:00 (LVM snapshots)
/etc/systemd/system/daily-backup.timer Daily 05:00 (file backup)
/etc/systemd/system/offsite-sync-backup.timer Daily 06:00 (offsite sync)
/usr/local/bin/nfs-mirror PVE host: weekly selective mirror of /srv/nfs/* → sda /mnt/backup// (Layer 3a)
/etc/systemd/system/nfs-mirror.timer Weekly Mon 04:00 (NFS local mirror to sda)
stacks/dbaas/ Terraform: PostgreSQL/MySQL backup CronJobs
stacks/vault/ Terraform: Vault backup CronJob
stacks/vaultwarden/ Terraform: Vaultwarden backup + integrity CronJobs
stacks/monitoring/ Terraform: Prometheus alerts
synology:Administrator@192.168.1.13 Synology SSH; sudo password = Vault secret/viktor synology_admin_password; DSM API itself gated by 2FA
/usr/syno/sbin/synosharesnapshot Synology: btrfs snapshot CLI — must run as root via sudo

Vault Paths

Path Contents
secret/viktor/synology_ssh_key SSH key for Synology NAS SFTP access
secret/viktor/pfsense_api_key pfSense API key + secret for config backup

Terraform Stacks

Each backup CronJob is defined in the application's stack:

  • PostgreSQL/MySQL: stacks/dbaas/backup.tf
  • Vault: stacks/vault/backup.tf
  • Vaultwarden: stacks/vaultwarden/backup.tf
  • etcd: stacks/platform/etcd-backup.tf

Decisions & Rationale

Why 3-2-1 Strategy?

3 copies:

  • Live PVCs (zero RTO for recent data)
  • sda local backup (fast recovery without network)
  • Synology offsite (site-level disaster protection)

2 media types:

  • sdc SSD (live, low latency)
  • sda HDD (backup, cost-effective bulk storage)

1 offsite:

  • Protection against fire, theft, catastrophic hardware failure
  • Weekly RPO acceptable for offsite (daily/weekly app backups reduce exposure)

Why File-Level + Block-Level Snapshots?

LVM snapshots (Layer 1):

  • Near-instant (<1s), zero overhead
  • Point-in-time recovery for entire PVCs
  • BUT: Cannot restore individual files, no offsite protection, 7-day retention

File-level backup (Layer 2):

  • Can restore single files or directories
  • Offsite-compatible (rsync)
  • Longer retention (4 weeks local, unlimited offsite)
  • BUT: Slower RTO (rsync), higher storage overhead

Both together provide flexibility: fast local rollback for recent changes, granular recovery for older data.

Why Dedicated Backup Disk (sda)?

Isolation: If sdc fails (thin pool corruption, controller failure), sda is independent (different disk, different VG).

Performance: Backup I/O doesn't compete with live PVC I/O.

Simplicity: Single mount point (/mnt/backup/) for all backup data, easy to monitor disk usage.

Why Not Velero/Longhorn Backup?

Evaluated K8s-native backup solutions (Velero, Longhorn):

  • Velero: Requires object storage backend, complex restore, doesn't handle databases well
  • Longhorn: High overhead (replicas, snapshots in-cluster), no offsite by default

Current approach wins because:

  • Leverages existing Proxmox LVM infrastructure (already running)
  • Database-native backups (pg_dump/mysqldump) are battle-tested
  • Simple restore procedures (documented runbooks)
  • Lower resource overhead (no in-cluster replicas)

Why Hybrid Incremental + Full Sync?

Incremental alone (rsync --files-from via inotify change log) is risky:

  • Deleted files on source never deleted on destination
  • Renamed paths create duplicates
  • No cleanup of orphaned files

Full sync alone (rsync --delete) is slow:

  • 30-60 min per run (all files scanned)
  • 7d RPO → 14d if a sync fails

Hybrid approach:

  • Fast incremental weekly via inotify change tracking (completes in seconds)
  • Monthly full rsync --delete for cleanup (tolerates longer runtime)

Why 6h Vaultwarden Backup vs Daily for Others?

Vaultwarden stores password vault data — highest-value target:

  • User creates 10 new passwords → disaster 5h later → daily backup loses all 10
  • 6h RPO acceptable for password vaults (industry standard is 1-24h)
  • Hourly integrity checks detect corruption before it spreads to backups

Other services (MySQL, PostgreSQL):

  • Mostly application data (not authentication secrets)
  • Daily RPO acceptable per user tolerance
  • Lower change velocity

Troubleshooting

LVM Snapshot Restore Issues

See docs/runbooks/restore-lvm-snapshot.md.

Weekly Backup Failing

Symptom: WeeklyBackupStale or WeeklyBackupFailing alert

Diagnosis:

ssh root@192.168.1.127
systemctl status daily-backup.service
journalctl -u daily-backup.service --since "7 days ago"
df -h /mnt/backup

Common causes:

  • Backup disk full (check df -h /mnt/backup, alert: BackupDiskFull)
  • LV mount failed (check lvs pve, dmesg | grep backup)
  • NFS mount failed (check showmount -e 192.168.1.127)

Fix:

  1. If disk full: Clean up old weekly versions manually, adjust retention
  2. If LV mount failed: lvchange -ay backup/data && mount /mnt/backup
  3. If NFS failed: Check Proxmox NFS availability (showmount -e 192.168.1.127), verify exports
  4. Manually trigger: systemctl start daily-backup.service

Offsite Sync Failing

Symptom: OffsiteBackupSyncStale or OffsiteBackupSyncFailing alert

Diagnosis:

ssh root@192.168.1.127
systemctl status offsite-sync-backup.service
journalctl -u offsite-sync-backup.service --since "7 days ago"
wc -l /mnt/backup/.nfs-changes.log  # verify change log exists
systemctl status nfs-change-tracker.service  # verify inotify watcher

Common causes:

  • Synology NAS unreachable (network, SFTP down)
  • SSH key auth failed (permissions, expired key)
  • nfs-change-tracker.service stopped (no change log)

Fix:

  1. Verify Synology: ping 192.168.1.13, ssh root@192.168.1.13
  2. Verify SSH key: ssh -i /root/.ssh/synology_backup root@192.168.1.13
  3. Verify change tracker running: systemctl status nfs-change-tracker.service
  4. Manually trigger: systemctl start offsite-sync-backup.service

PostgreSQL Backup Stale Alert

Symptom: PostgreSQLBackupStale firing in Prometheus

Diagnosis:

kubectl get cronjob -n dbaas
kubectl logs -n dbaas job/postgresql-backup-<timestamp>

Common causes:

  • Pod OOMKilled (increase memory limit)
  • NFS mount unavailable (check Proxmox NFS)
  • pg_dumpall command failed (check PostgreSQL connectivity)

Fix:

  1. If OOM: Increase resources.limits.memory in stacks/dbaas/backup.tf
  2. If NFS: Verify mount on worker node, restart NFS server on Proxmox host if needed (systemctl restart nfs-server)
  3. Manually trigger: kubectl create job --from=cronjob/postgresql-backup manual-backup -n dbaas

Vaultwarden Integrity Check Failing

Symptom: VaultwardenIntegrityFail alert, vaultwarden_sqlite_integrity_ok=0

Diagnosis:

kubectl exec -n vaultwarden deployment/vaultwarden -- sqlite3 /data/db.sqlite3 "PRAGMA integrity_check;"

Critical: If integrity check fails, database is corrupt.

Recovery:

  1. Stop writes: kubectl scale deployment/vaultwarden --replicas=0 -n vaultwarden
  2. Restore from latest backup (see restore-vaultwarden.md)
  3. Verify integrity on restored DB
  4. Scale back up: kubectl scale deployment/vaultwarden --replicas=1 -n vaultwarden

pfSense Backup Failing

Symptom: PfsenseBackupStale alert (if implemented)

Diagnosis:

ssh root@192.168.1.127
systemctl status daily-backup.service | grep -A5 pfsense

Common causes:

  • API key expired/invalid
  • SSH auth failed (password changed, key rejected)
  • pfSense unreachable

Fix:

  1. Verify API key: curl -k https://pfsense.viktorbarzin.me/api/v1/system/config -H "Authorization: <key>"
  2. Verify SSH: ssh root@pfsense.viktorbarzin.me
  3. Update credentials in Vault secret/viktor/pfsense_api_key

Backup Disk Full

Symptom: BackupDiskFull alert, df -h /mnt/backup >85%

Fix:

ssh root@192.168.1.127

# Check space usage by component
du -sh /mnt/backup/pvc-data/*
du -sh /mnt/backup/pfsense/*
du -sh /mnt/backup/sqlite-backup

# Clean up old weekly versions (keep latest 2)
find /mnt/backup/pvc-data -maxdepth 1 -type d -name "????-??" | sort | head -n -2 | xargs rm -rf
find /mnt/backup/pfsense -maxdepth 1 -type d -name "????-??" | sort | head -n -2 | xargs rm -rf

Missing Backup for New Service

Symptom: Added new service using proxmox-lvm storage, no backup exists

Fix: The service is automatically covered by:

  1. LVM snapshots (if not in dbaas/monitoring namespace) — automatic, no config needed
  2. Weekly file backup — automatic, no config needed

If the service has a database that needs app-level dumps: Add backup CronJob in service's Terraform stack (see template below).

Template:

resource "kubernetes_cron_job_v1" "backup" {
  metadata {
    name      = "${var.service_name}-backup"
    namespace = kubernetes_namespace.service.metadata[0].name
  }
  spec {
    schedule = "0 3 * * 0"  # Weekly Sunday 03:00
    job_template {
      spec {
        template {
          spec {
            container {
              name  = "backup"
              image = "appropriate/image:tag"
              command = ["/bin/sh", "-c"]
              args = [
                <<-EOT
                TIMESTAMP=$(date +%Y%m%d)
                # Dump command here (sqlite3 .backup, pg_dump, etc.)
                find /backup -mtime +30 -delete
                EOT
              ]
              volume_mount {
                name       = "data"
                mount_path = "/data"
              }
              volume_mount {
                name       = "backup"
                mount_path = "/backup"
              }
            }
            volume {
              name = "data"
              persistent_volume_claim {
                claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name
              }
            }
            volume {
              name = "backup"
              persistent_volume_claim {
                claim_name = module.nfs_backup.pvc_name
              }
            }
          }
        }
      }
    }
  }
}

module "nfs_backup" {
  source     = "../../modules/kubernetes/nfs_volume"
  name       = "${var.service_name}-backup"
  namespace  = kubernetes_namespace.service.metadata[0].name
  nfs_server = var.nfs_server
  nfs_path   = "/srv/nfs/${var.service_name}-backup"
}

Monitoring & Alerting

┌────────────────────────────────────────────────────────────────┐
│                     Prometheus Alerts                           │
│                                                                 │
│  PostgreSQLBackupStale      > 36h since last success            │
│  MySQLBackupStale           > 36h since last success            │
│  EtcdBackupStale            > 8d  since last success            │
│  VaultBackupStale           > 8d  since last success            │
│  VaultwardenBackupStale     > 8d  since last success            │
│  RedisBackupStale           > 8d  since last success            │
│  ~~CloudSyncStale~~         REMOVED (TrueNAS decommissioned)    │
│  ~~CloudSyncNeverRun~~      REMOVED (TrueNAS decommissioned)    │
│  ~~CloudSyncFailing~~       REMOVED (TrueNAS decommissioned)    │
│  VaultwardenIntegrityFail   integrity_ok == 0                   │
│  LVMSnapshotStale           > 30h since last snapshot           │
│  LVMSnapshotFailing         snapshot creation failed            │
│  LVMThinPoolLow             < 15% free space in thin pool       │
│  WeeklyBackupStale          > 8d  since last success            │
│  WeeklyBackupFailing        backup script exited non-zero       │
│  PfsenseBackupStale         > 8d  since last success            │
│  OffsiteBackupSyncStale     > 8d  since last success            │
│  BackupDiskFull             > 85% usage on /mnt/backup          │
└────────────────────────────────────────────────────────────────┘

Metrics sources:

  • Backup CronJobs: Push backup_last_success_timestamp to Pushgateway on completion
  • LVM snapshot script: Pushes lvm_snapshot_last_run_timestamp, lvm_snapshot_last_status, lvm_snapshot_created_total, lvm_snapshot_failed_total, lvm_snapshot_pruned_total, lvm_snapshot_thinpool_free_pct (job lvm-pvc-snapshot)
  • Daily backup script: Pushes daily_backup_last_run_timestamp, daily_backup_last_status, daily_backup_bytes_synced (job daily-backup). Disk-fullness alert (BackupDiskFull) does NOT use a script-pushed metric; it derives from node-exporter node_filesystem_avail_bytes{job="proxmox-host", mountpoint="/mnt/backup"}.
  • pfSense backup (step 3 of daily-backup): Pushes backup_last_run_timestamp, backup_last_status, and backup_last_success_timestamp (only on success) under job pfsense-backup. Pushed in BOTH success and failure paths so PfsenseBackupStale doesn't go silent when SSH-to-pfsense breaks.
  • Offsite sync script: Pushes backup_last_success_timestamp, offsite_sync_last_status (job offsite-backup-sync)
  • Prometheus backup (sidecar in prometheus-server pod, monthly 1st-Sunday 04:00 UTC): Pushes prometheus_backup_last_success_timestamp (job prometheus-backup)
  • CloudSync monitor: Removed (TrueNAS decommissioned)
  • Vaultwarden integrity: Pushes vaultwarden_sqlite_integrity_ok hourly

Pushgateway persistence: The Pushgateway is configured with --persistence.file=/data/pushgateway.bin --persistence.interval=1m on a 2Gi proxmox-lvm-encrypted PVC (helm values: prometheus-pushgateway.persistentVolume). Without this, every pod restart drops in-memory metrics. Once-per-day pushers (offsite-sync, weekly backup) are otherwise invisible for up to 24h if the Pushgateway restarts between pushes — which is exactly what triggered the 2026-04-22 backup_offsite_sync FAIL (node3 kubelet hiccup at 11:42 UTC terminated the Pushgateway 8h after the 03:12 UTC push).

Alert routing:

  • All backup alerts → Slack #infra-alerts
  • Vaultwarden integrity fail → Slack #infra-critical (immediate action required)

Service Protection Matrix

Service LVM Snapshots (7d) File Backup (4w) App Backup Offsite Storage
Databases
PostgreSQL (all DBs) ✓ daily proxmox-lvm
MySQL (all DBs) ✓ daily proxmox-lvm
Critical State
Vault ✓ weekly proxmox-lvm
etcd ✓ weekly proxmox-lvm
Vaultwarden ✓ 6h + integrity proxmox-lvm
Redis ✓ weekly proxmox-lvm
Applications (65 proxmox-lvm PVCs)
Prometheus excluded proxmox-lvm
Nextcloud proxmox-lvm
Calibre-Web proxmox-lvm
Forgejo proxmox-lvm
FreshRSS proxmox-lvm
ActualBudget proxmox-lvm
NovelApp proxmox-lvm
Headscale proxmox-lvm
Uptime Kuma proxmox-lvm
Other apps not enumerated above ✓¹ ✓¹ varies proxmox-lvm / proxmox-lvm-encrypted
Postiz (bundled bitnami PG on local-path) ✓ daily pg_dump → NFS local-path + NFS
Media (NFS)
Immich (~800GB) NFS
Audiobookshelf NFS
Servarr NFS
Navidrome NFS

Legend:

  • ✓ = Protected at this layer
  • — = Not needed (other layers cover it, or data is regenerable/disposable)
  • excluded = Too large/regenerable, not worth offsite bandwidth

Note: All proxmox-lvm and proxmox-lvm-encrypted PVCs get LVM snapshots (except dbaas and monitoring namespaces, excluded for write-amplification reasons) + file-level backup. NFS-backed media syncs directly to Synology nfs/ and nfs-ssd/ via inotify change tracking.

¹ "Other apps not enumerated above" — the table only enumerates services worth calling out. The default backup posture for any service using proxmox-lvm or proxmox-lvm-encrypted (outside dbaas/monitoring) is automatic Layer 1 (LVM thin snapshots, 7d retention) + Layer 2 (file backup, 4 weekly versions on sda) + Layer 3 (offsite to Synology). Auto-discovery is by LV name pattern (vm-*-pvc-*), so adding a new service to the cluster gets it covered without any explicit registration. Run ssh root@192.168.1.127 lvs --noheadings -o lv_name pve | grep '^vm-.*-pvc-' | grep -v _snap_ | wc -l to see the live count.

Known gaps — services with PVCs not on the proxmox-lvm path lose Layer 1+2:

  • Postiz PG and Redis (bundled bitnami chart) live on local-path (K8s node OS disk). PG covered by the postiz-postgres-backup CronJob (daily pg_dump → /srv/nfs/postiz-backup/, Layer 3 via offsite sync). Redis is regenerable cache — not backed up.
  • Prometheus, Alertmanager, Pushgatewaymonitoring namespace excluded by policy; loss is acceptable (metrics regenerable, silences ephemeral, Pushgateway has on-disk persistence for 24h gap tolerance).

Recovery Procedures

Detailed runbooks in docs/runbooks/:

  • restore-lvm-snapshot.md — Instant rollback of a PVC using LVM snapshot (RTO <5 min)
  • restore-pvc-from-backup.md — Restore a PVC from sda file backup (when snapshots expired)
  • restore-postgresql.md — Restore individual database (from per-db pg_dump -Fc) or full cluster (from pg_dumpall)
  • restore-mysql.md — Restore individual database (from per-db mysqldump) or full cluster (from mysqldump --all-databases)
  • restore-vault.md — Restore Vault from raft snapshot
  • restore-vaultwarden.md — Restore password vault from sqlite3 backup
  • restore-etcd.md — Restore etcd cluster from snapshot
  • restore-full-cluster.md — Disaster recovery: rebuild cluster from offsite backups

RTO estimates:

  • LVM snapshot rollback: <5 min (instant swap)
  • File-level restore from sda: <15 min (depends on PVC size)
  • Single PostgreSQL database: <5 min
  • Full MySQL cluster: <15 min
  • Vault: <10 min
  • Vaultwarden: <5 min
  • etcd: <20 min (requires cluster rebuild)
  • Full cluster from offsite: <4 hours (NFS restore + K8s bootstrap + app deploys)
  • Architecture: docs/architecture/storage.md (NFS/Proxmox storage layer)
  • Reference: .claude/reference/service-catalog.md (which services need backups)
  • Runbooks: docs/runbooks/restore-*.md (step-by-step recovery procedures)
  • Monitoring: stacks/monitoring/alerts/backup-alerts.yaml (Prometheus alert definitions)