- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
PVC file-level copy from LVM snapshots, pfsense backup, two offsite
paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
32 KiB
Backup & Disaster Recovery Architecture
Last updated: 2026-04-06
Overview
The homelab uses a defense-in-depth 3-2-1 backup strategy: 3 copies (live PVCs on sdc, weekly backups on sda, offsite on Synology), 2 media types (SSD thin LVM, HDD), 1 offsite copy (Synology NAS). This architecture provides <1s RPO for recent changes (via 7-day LVM snapshots), <7d RPO for file-level recovery, and <30min RTO for most services.
3-2-1 Breakdown:
- Copy 1 (live): All PVC data + VM disks on Proxmox sdc thin pool (10.7TB RAID1 HDD)
- Copy 2 (local backup): Weekly file-level backup to sda
/mnt/backup(1.1TB RAID1 SAS) - Copy 3 (offsite): Synology NAS at 192.168.1.13 via two paths:
Synology/Backup/Viki/pve-backup/— structured PVE host backups (rsync --files-from weekly)Synology/Backup/Viki/truenas/— TrueNAS NFS media (Cloud Sync, narrowed to media only)
Architecture Diagram
Overall Backup Flow
graph TB
subgraph Proxmox["Proxmox Host (192.168.1.127)"]
sdc["sdc: 10.7TB RAID1 HDD<br/>VG pve, LV data (thin pool)<br/>65 proxmox-lvm PVCs"]
sda["sda: 1.1TB RAID1 SAS<br/>VG backup, LV data (ext4)<br/>/mnt/backup"]
subgraph Layer1["Layer 1: LVM Thin Snapshots"]
Snap["Daily 03:00<br/>7-day retention<br/>62 PVCs (excludes dbaas+monitoring)"]
end
subgraph Layer2["Layer 2: Weekly File Backup"]
PVCBackup["PVC File Copy<br/>Sunday 05:00<br/>4 weekly versions<br/>/mnt/backup/pvc-data/<YYYY-WW>/"]
NFSMirror["NFS Mirror<br/>DB dumps + backup CronJob output<br/>/mnt/backup/nfs-mirror/"]
PfsenseBackup["pfSense Backup<br/>config.xml + full tar<br/>4 weekly versions"]
PVEConfig["PVE Config<br/>/etc/pve + scripts"]
end
sdc --> Snap
sdc --> PVCBackup
PVCBackup --> sda
NFSMirror --> sda
PfsenseBackup --> sda
PVEConfig --> sda
end
subgraph TrueNAS["TrueNAS (10.0.10.15)"]
NFS_Backup["NFS-exported<br/>/mnt/main/*-backup/"]
Media["Media (NFS)<br/>Immich ~800GB<br/>audiobookshelf, servarr, navidrome"]
subgraph AppBackups["App-Level Backup CronJobs"]
CronDaily["Daily 00:00-00:30<br/>PostgreSQL, MySQL<br/>14d retention"]
CronWeekly["Weekly Sunday<br/>etcd, Vault, Redis<br/>Vaultwarden<br/>30d retention"]
end
CronDaily --> NFS_Backup
CronWeekly --> NFS_Backup
end
subgraph Layer3["Layer 3: Offsite Sync"]
PVEOffsite["PVE → Synology<br/>Sunday 08:00<br/>rsync --files-from<br/>/Backup/Viki/pve-backup/"]
CloudSync["TrueNAS → Synology<br/>Monday 09:00<br/>Cloud Sync (media only)<br/>/Backup/Viki/truenas/"]
end
sda --> PVEOffsite
Media --> CloudSync
Synology["Synology NAS<br/>192.168.1.13<br/>Offsite protection"]
PVEOffsite --> Synology
CloudSync --> Synology
NFS_Backup -.->|mirrored to sda| NFSMirror
subgraph Monitoring["Monitoring & Alerting"]
Prometheus["Prometheus Alerts<br/>PostgreSQLBackupStale, MySQLBackupStale<br/>WeeklyBackupStale, OffsiteBackupSyncStale<br/>LVMSnapshotStale, BackupDiskFull<br/>VaultwardenIntegrityFail"]
Pushgateway["Pushgateway<br/>backup script metrics<br/>cloudsync metrics<br/>vaultwarden integrity"]
end
PVCBackup -.->|push metrics| Pushgateway
Snap -.->|push metrics| Pushgateway
Pushgateway --> Prometheus
style Layer1 fill:#c8e6c9
style Layer2 fill:#ffe0b2
style Layer3 fill:#e1f5ff
style Monitoring fill:#f3e5f5
Weekly Backup Timeline
graph LR
subgraph Sunday["Sunday Timeline"]
S01["01:00 etcd backup<br/>(CronJob)"]
S02["02:00 Vault backup<br/>(CronJob)"]
S03a["03:00 Redis backup<br/>(CronJob)"]
S03b["03:00 LVM snapshots<br/>(lvm-pvc-snapshot timer)"]
S05["05:00 Weekly backup<br/>(weekly-backup timer)<br/>1. NFS mirror<br/>2. PVC file copy<br/>3. pfSense backup<br/>4. PVE config<br/>5. Prune snapshots<br/>6. Generate manifest"]
S08["08:00 Offsite sync<br/>(offsite-sync-backup timer)<br/>rsync --files-from"]
end
S01 --> S02 --> S03a --> S03b --> S05 --> S08
subgraph Monday["Monday"]
M09["09:00 TrueNAS Cloud Sync<br/>Media → Synology"]
end
S08 -.->|next day| M09
style Sunday fill:#ffe0b2
style Monday fill:#e1f5ff
Physical Disk Layout
graph TB
subgraph PVE["Proxmox Host (192.168.1.127)"]
subgraph sda["sda: 1.1TB RAID1 SAS"]
sda_vg["VG: backup<br/>LV: data (ext4)<br/>/mnt/backup"]
sda_content["pvc-data/<YYYY-WW>/<ns>/<pvc>/<br/>nfs-mirror/<service>-backup/<br/>pfsense/<YYYY-WW>/<br/>pve-config/"]
end
subgraph sdb["sdb: 931GB SSD"]
sdb_vg["VG: pve<br/>LV: root (ext4)<br/>PVE host OS"]
end
subgraph sdc["sdc: 10.7TB RAID1 HDD"]
sdc_vg["VG: pve<br/>LV: data (thin pool)<br/>65 proxmox-lvm PVCs<br/>+ VM disks"]
end
sda_vg --> sda_content
end
sdc -.->|weekly backup<br/>mount snapshot ro| sda
sda -.->|offsite sync<br/>rsync| Synology["Synology NAS<br/>192.168.1.13<br/>/Backup/Viki/pve-backup/"]
style sda fill:#fff9c4
style sdb fill:#c8e6c9
style sdc fill:#e1f5ff
Restore Decision Tree
graph TB
Start["Data loss detected"]
Age{"How old is<br/>the lost data?"}
Type{"What type<br/>of data?"}
Start --> Age
Age -->|"< 7 days"| LVM["Use LVM snapshot<br/>lvm-pvc-snapshot restore<br/>RTO: <5 min"]
Age -->|"> 7 days,<br/>< 4 weeks"| FileBackup["Use sda file backup<br/>/mnt/backup/pvc-data/<week>/<br/>RTO: <15 min"]
Age -->|"> 4 weeks or<br/>site disaster"| Offsite["Use Synology backup<br/>Synology/pve-backup/<br/>RTO: <4 hours"]
LVM --> Type
FileBackup --> Type
Offsite --> Type
Type -->|"Database"| AppBackup["Use app-level dump<br/>/mnt/backup/nfs-mirror/<service>-backup/<br/>OR Synology/pve-backup/nfs-mirror/<br/>RTO: <10 min"]
Type -->|"PVC files"| Proceed["Proceed with<br/>selected restore method"]
Type -->|"Media (NFS)"| CloudSync["Use Synology backup<br/>Synology/truenas/<service>/<br/>RTO: varies by size"]
style Start fill:#ffcdd2
style LVM fill:#c8e6c9
style FileBackup fill:#fff9c4
style Offsite fill:#e1f5ff
style AppBackup fill:#e1bee7
Vaultwarden Enhanced Protection
graph LR
subgraph Every6h["Every 6 hours"]
VWBackup["vaultwarden-backup CronJob"]
Step1["1. PRAGMA integrity_check<br/>(fail → abort)"]
Step2["2. sqlite3 .backup<br/>/mnt/main/vaultwarden-backup/"]
Step3["3. PRAGMA integrity_check<br/>on backup copy"]
Step4["4. Copy RSA keys, attachments,<br/>sends, config.json"]
Step5["5. Rotate backups (30d)"]
VWBackup --> Step1 --> Step2 --> Step3 --> Step4 --> Step5
end
subgraph Hourly["Every hour"]
VWCheck["vaultwarden-integrity-check"]
Check1["PRAGMA integrity_check"]
Metric["Push metric to Pushgateway:<br/>vaultwarden_sqlite_integrity_ok"]
VWCheck --> Check1 --> Metric
end
Metric -.->|Prometheus scrape| Alert["Alert if integrity_ok == 0"]
style Every6h fill:#fff9c4
style Hourly fill:#e1bee7
Components
| Component | Version/Schedule | Location | Purpose |
|---|---|---|---|
| LVM Thin Snapshots | Daily 03:00, 7d retention | PVE host: lvm-pvc-snapshot |
CoW snapshots of 62 proxmox-lvm PVCs |
| Weekly PVC Backup | Sunday 05:00, 4 weeks | PVE host: weekly-backup |
File-level PVC copy to sda |
| NFS Mirror | Sunday 05:00 + weekly-backup | PVE host: mount NFS ro → rsync | Mirror DB dumps to sda |
| pfSense Backup | Sunday 05:00 + weekly-backup | PVE host: SSH + API | config.xml + full filesystem tar |
| Offsite Sync | Sunday 08:00 (after weekly-backup) | PVE host: offsite-sync-backup |
rsync sda → Synology |
| PostgreSQL Backup | Daily 00:00, 14d retention | CronJob in dbaas namespace |
pg_dumpall for all databases |
| MySQL Backup | Daily 00:30, 14d retention | CronJob in dbaas namespace |
mysqldump for all databases |
| etcd Backup | Weekly Sunday 01:00, 30d | CronJob in kube-system |
etcdctl snapshot |
| Vaultwarden Backup | Every 6h, 30d retention | CronJob in vaultwarden |
sqlite3 .backup + integrity |
| Vault Backup | Weekly Sunday 02:00, 30d | CronJob in vault |
raft snapshot |
| Redis Backup | Weekly Sunday 03:00, 30d | CronJob in redis |
BGSAVE + copy |
| Vaultwarden Integrity Check | Hourly | CronJob in vaultwarden |
PRAGMA integrity_check → metric |
| TrueNAS Cloud Sync | Monday 09:00 (weekly) | TrueNAS Cloud Sync Task 1 | Media → Synology NAS |
How It Works
Layer 1: LVM Thin Snapshots (Fast Local Recovery)
Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62 Proxmox CSI PVCs. These are CoW snapshots — instant creation, minimal overhead, sharing the thin pool's free space.
Script: /usr/local/bin/lvm-pvc-snapshot on PVE host (source: infra/scripts/lvm-pvc-snapshot)
Schedule: Daily 03:00 via systemd timer, 7-day retention
Discovery: Auto-discovers PVC LVs matching vm-*-pvc-* pattern in VG pve thin pool data
Coverage: All 65 proxmox-lvm PVCs except dbaas and monitoring namespaces. These are excluded because:
- MySQL InnoDB, PostgreSQL, and Prometheus are high-churn (50%+ CoW divergence/hour)
- They already have app-level dumps (Layer 2)
- Including them causes ~36% write amplification; excluding them reduces overhead to ~0%
Monitoring: Pushes metrics to Pushgateway via NodePort (30091). Alerts: LVMSnapshotStale (>24h), LVMSnapshotFailing, LVMThinPoolLow (<15% free).
Restore: lvm-pvc-snapshot restore <pvc-lv> <snapshot-lv> — auto-discovers K8s workload, scales down, swaps LVs, scales back up. See docs/runbooks/restore-lvm-snapshot.md.
Layer 2: Weekly File-Level Backup (sda Backup Disk)
Backup disk: sda (1.1TB RAID1 SAS) → VG backup → LV data → ext4 → mounted at /mnt/backup on PVE host. Dedicated backup disk, independent of live storage.
Script: /usr/local/bin/weekly-backup on PVE host (source: infra/scripts/weekly-backup)
Schedule: Sunday 05:00 via systemd timer
Retention: 4 weekly versions (weeks 0-3 via --link-dest hardlink dedup)
What Gets Backed Up
1. PVC File Copies (/mnt/backup/pvc-data/<YYYY-WW>/):
- Mount each LVM thin LV ro on PVE host → rsync files (not block) → unmount
- 62 PVCs covered (all except dbaas + monitoring)
- Organized as
/mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/ - 4 weekly versions with
--link-desthardlink dedup (unchanged files share inodes)
2. NFS Backup Mirror (/mnt/backup/nfs-mirror/):
- Mount TrueNAS NFS ro → rsync DB dump dirs → unmount
- Covers:
mysql-backup/,postgresql-backup/,vault-backup/,vaultwarden-backup/,redis-backup/,etcd-backup/ - Single copy (no rotation) — latest dump only
3. pfSense Backup (/mnt/backup/pfsense/<YYYY-WW>/):
config.xmlvia API (base64 decode)- Full filesystem tar via SSH (
tar czf /tmp/pfsense-full.tar.gz /cf /var/db /boot/loader.conf) - 4 weekly versions
4. PVE Config (/mnt/backup/pve-config/):
/etc/pve/(cluster config, VM definitions)/usr/local/bin/(custom scripts)/etc/systemd/system/(timers)- Single copy (no rotation)
Manifest Generation: After backup completes, generates /mnt/backup/manifest.txt with all file paths (relative to /mnt/backup/). Used by offsite sync --files-from.
Snapshot Pruning: Deletes LVM snapshots older than 7 days (safety net for snapshots that outlive lvm-pvc-snapshot timer).
Monitoring: Pushes backup_weekly_last_success_timestamp to Pushgateway. Alerts: WeeklyBackupStale (>8d), WeeklyBackupFailing.
Layer 2b: Application-Level Backups
K8s CronJobs run inside the cluster, dumping database/state to NFS-exported backup directories. Each service writes to /mnt/main/<service>-backup/.
Why needed: LVM snapshots capture block-level state, but:
- Cannot restore individual databases from a PostgreSQL snapshot
- Proxmox CSI LVs are opaque to TrueNAS (raw block devices)
- Need point-in-time recovery for specific apps without full LVM rollback
Daily backups (00:00-00:30):
- PostgreSQL (
pg_dumpall): Dumps all databases to/mnt/main/postgresql-backup/. Command:pg_dumpall -h pg-cluster-rw.dbaas -U postgres | gzip -9 > backup-$(date +%Y%m%d).sql.gz. 14-day rotation viafind -mtime +14 -delete. - MySQL (
mysqldump): Dumps all databases. Command:mysqldump -h mysql-primary.dbaas --all-databases --single-transaction | gzip -9 > backup-$(date +%Y%m%d).sql.gz. 14-day rotation.
Weekly backups (Sunday 01:00-04:00):
- etcd:
etcdctl snapshot save /mnt/main/etcd-backup/snapshot-$(date +%Y%m%d).db. 30-day retention. Critical for cluster recovery. - Vaultwarden: See "Vaultwarden Enhanced Protection" below. 30-day retention.
- Vault:
vault operator raft snapshot save /mnt/main/vault-backup/snapshot-$(date +%Y%m%d).snap. 30-day retention. - Redis:
redis-cli BGSAVEthen copy RDB file. 30-day retention.
Vaultwarden Enhanced Protection
Vaultwarden stores sensitive password vault data in SQLite on a proxmox-lvm volume. Extra safeguards prevent corruption:
Every 6 hours (vaultwarden-backup CronJob):
- Run
PRAGMA integrity_checkon live database - If check fails → abort (alert fires)
- If check passes →
sqlite3 .backup /mnt/main/vaultwarden-backup/db-$(date +%Y%m%d%H%M).sqlite - Run
PRAGMA integrity_checkon backup copy - Copy RSA keys, attachments, sends folder, config.json
- Rotate backups older than 30 days
Every hour (vaultwarden-integrity-check CronJob):
- Run
PRAGMA integrity_checkon live database - Push metric to Pushgateway:
vaultwarden_sqlite_integrity_ok{status="ok"}=1or=0 - Prometheus scrapes Pushgateway and alerts on
integrity_ok == 0
This provides both frequent backups (every 6h) AND continuous integrity monitoring (hourly).
Layer 3: Offsite Sync to Synology NAS
Two independent paths push backups offsite:
Path 1: PVE Host Backups (rsync)
Script: /usr/local/bin/offsite-sync-backup on PVE host (source: infra/scripts/offsite-sync-backup)
Schedule: Sunday 08:00 via systemd timer (After=weekly-backup.service)
Method: rsync --files-from /mnt/backup/manifest.txt to synology.viktorbarzin.lan:/Backup/Viki/pve-backup/
Monthly full sync: On 1st Sunday of month, runs rsync --delete (full sync, removes deleted files)
Why fast: Only changed files are transferred (manifest generated by weekly-backup). No directory traversal (--no-implied-dirs).
Destination: Synology/Backup/Viki/pve-backup/ mirrors sda /mnt/backup/ structure:
pvc-data/<YYYY-WW>/— 4 weekly PVC file backupsnfs-mirror/— latest DB dumpspfsense/<YYYY-WW>/— 4 weekly pfSense backupspve-config/— latest PVE config
Monitoring: Pushes offsite_backup_sync_last_success_timestamp to Pushgateway. Alerts: OffsiteBackupSyncStale (>8d), OffsiteBackupSyncFailing.
Path 2: TrueNAS Media (Cloud Sync)
Task: TrueNAS Cloud Sync Task 1 runs rclone sync Monday 09:00
Source: /mnt/main/ (NFS pool on TrueNAS)
Destination: sftp://192.168.1.13/Backup/Viki/truenas
Scope: Media libraries only (Immich ~800GB, audiobookshelf, servarr, navidrome music)
Excludes (Cloud Sync configured to skip):
clickhouse/**(2.47M files, regenerable)loki/**(68K files, regenerable)prometheus/**(covered by monthly app backup)frigate/**(ephemeral recordings)audiblez/**,ebook2audiobook/**(regenerable)ollama/**(chat history, low value)real-estate-crawler/**(regenerable)crowdsec/**(regenerable)servarr/downloads/**(transient)ytldp/**(replaceable)iscsi/**,iscsi-snaps/**(raw zvols, backed at app level)*-backup/**(already mirrored via Path 1)
Monitoring: Existing CloudSyncStale, CloudSyncNeverRun, CloudSyncFailing alerts still apply.
Configuration
Key Files
| Path | Purpose |
|---|---|
/usr/local/bin/lvm-pvc-snapshot |
PVE host: LVM snapshot creation + restore |
/usr/local/bin/weekly-backup |
PVE host: PVC file copy + NFS mirror + pfSense + manifest |
/usr/local/bin/offsite-sync-backup |
PVE host: rsync to Synology |
/mnt/backup/ |
PVE host: sda mount point (1.1TB backup disk) |
/mnt/backup/manifest.txt |
Generated by weekly-backup, consumed by offsite-sync |
/etc/systemd/system/lvm-pvc-snapshot.timer |
Daily 03:00 (LVM snapshots) |
/etc/systemd/system/weekly-backup.timer |
Sunday 05:00 (file backup) |
/etc/systemd/system/offsite-sync-backup.timer |
Sunday 08:00 (offsite sync) |
stacks/dbaas/ |
Terraform: PostgreSQL/MySQL backup CronJobs |
stacks/vault/ |
Terraform: Vault backup CronJob |
stacks/vaultwarden/ |
Terraform: Vaultwarden backup + integrity CronJobs |
stacks/monitoring/ |
Terraform: Prometheus alerts |
Vault Paths
| Path | Contents |
|---|---|
secret/viktor/synology_ssh_key |
SSH key for Synology NAS SFTP access |
secret/viktor/pfsense_api_key |
pfSense API key + secret for config backup |
Terraform Stacks
Each backup CronJob is defined in the application's stack:
- PostgreSQL/MySQL:
stacks/dbaas/backup.tf - Vault:
stacks/vault/backup.tf - Vaultwarden:
stacks/vaultwarden/backup.tf - etcd:
stacks/platform/etcd-backup.tf
Decisions & Rationale
Why 3-2-1 Strategy?
3 copies:
- Live PVCs (zero RTO for recent data)
- sda local backup (fast recovery without network)
- Synology offsite (site-level disaster protection)
2 media types:
- sdc SSD (live, low latency)
- sda HDD (backup, cost-effective bulk storage)
1 offsite:
- Protection against fire, theft, catastrophic hardware failure
- Weekly RPO acceptable for offsite (daily/weekly app backups reduce exposure)
Why File-Level + Block-Level Snapshots?
LVM snapshots (Layer 1):
- Near-instant (<1s), zero overhead
- Point-in-time recovery for entire PVCs
- BUT: Cannot restore individual files, no offsite protection, 7-day retention
File-level backup (Layer 2):
- Can restore single files or directories
- Offsite-compatible (rsync)
- Longer retention (4 weeks local, unlimited offsite)
- BUT: Slower RTO (rsync), higher storage overhead
Both together provide flexibility: fast local rollback for recent changes, granular recovery for older data.
Why Dedicated Backup Disk (sda)?
Isolation: If sdc fails (thin pool corruption, controller failure), sda is independent (different disk, different VG).
Performance: Backup I/O doesn't compete with live PVC I/O.
Simplicity: Single mount point (/mnt/backup/) for all backup data, easy to monitor disk usage.
Why Not Velero/Longhorn Backup?
Evaluated K8s-native backup solutions (Velero, Longhorn):
- Velero: Requires object storage backend, complex restore, doesn't handle databases well
- Longhorn: High overhead (replicas, snapshots in-cluster), no offsite by default
Current approach wins because:
- Leverages existing Proxmox LVM infrastructure (already running)
- Database-native backups (pg_dump/mysqldump) are battle-tested
- Simple restore procedures (documented runbooks)
- Lower resource overhead (no in-cluster replicas)
Why Hybrid Incremental + Full Sync?
Incremental alone (rsync --files-from) is risky:
- Deleted files on source never deleted on destination
- Renamed paths create duplicates
- No cleanup of orphaned files
Full sync alone (rsync --delete) is slow:
- 30-60 min per run (all files scanned)
- 7d RPO → 14d if a sync fails
Hybrid approach:
- Fast incremental weekly (sub-5min runtime via manifest)
- Monthly full sync for cleanup (tolerates longer runtime)
Why 6h Vaultwarden Backup vs Daily for Others?
Vaultwarden stores password vault data — highest-value target:
- User creates 10 new passwords → disaster 5h later → daily backup loses all 10
- 6h RPO acceptable for password vaults (industry standard is 1-24h)
- Hourly integrity checks detect corruption before it spreads to backups
Other services (MySQL, PostgreSQL):
- Mostly application data (not authentication secrets)
- Daily RPO acceptable per user tolerance
- Lower change velocity
Troubleshooting
LVM Snapshot Restore Issues
See docs/runbooks/restore-lvm-snapshot.md.
Weekly Backup Failing
Symptom: WeeklyBackupStale or WeeklyBackupFailing alert
Diagnosis:
ssh root@192.168.1.127
systemctl status weekly-backup.service
journalctl -u weekly-backup.service --since "7 days ago"
df -h /mnt/backup
Common causes:
- Backup disk full (check
df -h /mnt/backup, alert:BackupDiskFull) - LV mount failed (check
lvs pve,dmesg | grep backup) - NFS mount failed (check
showmount -e 10.0.10.15)
Fix:
- If disk full: Clean up old weekly versions manually, adjust retention
- If LV mount failed:
lvchange -ay backup/data && mount /mnt/backup - If NFS failed: Check TrueNAS availability, verify exports
- Manually trigger:
systemctl start weekly-backup.service
Offsite Sync Failing
Symptom: OffsiteBackupSyncStale or OffsiteBackupSyncFailing alert
Diagnosis:
ssh root@192.168.1.127
systemctl status offsite-sync-backup.service
journalctl -u offsite-sync-backup.service --since "7 days ago"
cat /mnt/backup/manifest.txt | wc -l # verify manifest exists
Common causes:
- Synology NAS unreachable (network, SFTP down)
- SSH key auth failed (permissions, expired key)
- Manifest missing (weekly-backup failed)
Fix:
- Verify Synology:
ping 192.168.1.13,ssh root@192.168.1.13 - Verify SSH key:
ssh -i /root/.ssh/synology_backup root@192.168.1.13 - Verify manifest exists:
ls -lh /mnt/backup/manifest.txt - Manually trigger:
systemctl start offsite-sync-backup.service
PostgreSQL Backup Stale Alert
Symptom: PostgreSQLBackupStale firing in Prometheus
Diagnosis:
kubectl get cronjob -n dbaas
kubectl logs -n dbaas job/postgresql-backup-<timestamp>
Common causes:
- Pod OOMKilled (increase memory limit)
- NFS mount unavailable (check TrueNAS)
- pg_dumpall command failed (check PostgreSQL connectivity)
Fix:
- If OOM: Increase
resources.limits.memoryinstacks/dbaas/backup.tf - If NFS: Verify mount on worker node, restart NFS server if needed
- Manually trigger:
kubectl create job --from=cronjob/postgresql-backup manual-backup -n dbaas
Vaultwarden Integrity Check Failing
Symptom: VaultwardenIntegrityFail alert, vaultwarden_sqlite_integrity_ok=0
Diagnosis:
kubectl exec -n vaultwarden deployment/vaultwarden -- sqlite3 /data/db.sqlite3 "PRAGMA integrity_check;"
Critical: If integrity check fails, database is corrupt.
Recovery:
- Stop writes:
kubectl scale deployment/vaultwarden --replicas=0 -n vaultwarden - Restore from latest backup (see
restore-vaultwarden.md) - Verify integrity on restored DB
- Scale back up:
kubectl scale deployment/vaultwarden --replicas=1 -n vaultwarden
pfSense Backup Failing
Symptom: PfsenseBackupStale alert (if implemented)
Diagnosis:
ssh root@192.168.1.127
systemctl status weekly-backup.service | grep -A5 pfsense
Common causes:
- API key expired/invalid
- SSH auth failed (password changed, key rejected)
- pfSense unreachable
Fix:
- Verify API key:
curl -k https://pfsense.viktorbarzin.me/api/v1/system/config -H "Authorization: <key>" - Verify SSH:
ssh root@pfsense.viktorbarzin.me - Update credentials in Vault
secret/viktor/pfsense_api_key
Backup Disk Full
Symptom: BackupDiskFull alert, df -h /mnt/backup >85%
Fix:
ssh root@192.168.1.127
# Check space usage by component
du -sh /mnt/backup/pvc-data/*
du -sh /mnt/backup/pfsense/*
du -sh /mnt/backup/nfs-mirror
# Clean up old weekly versions (keep latest 2)
find /mnt/backup/pvc-data -maxdepth 1 -type d -name "????-??" | sort | head -n -2 | xargs rm -rf
find /mnt/backup/pfsense -maxdepth 1 -type d -name "????-??" | sort | head -n -2 | xargs rm -rf
Missing Backup for New Service
Symptom: Added new service using proxmox-lvm storage, no backup exists
Fix: The service is automatically covered by:
- LVM snapshots (if not in dbaas/monitoring namespace) — automatic, no config needed
- Weekly file backup — automatic, no config needed
If the service has a database that needs app-level dumps: Add backup CronJob in service's Terraform stack (see template below).
Template:
resource "kubernetes_cron_job_v1" "backup" {
metadata {
name = "${var.service_name}-backup"
namespace = kubernetes_namespace.service.metadata[0].name
}
spec {
schedule = "0 3 * * 0" # Weekly Sunday 03:00
job_template {
spec {
template {
spec {
container {
name = "backup"
image = "appropriate/image:tag"
command = ["/bin/sh", "-c"]
args = [
<<-EOT
TIMESTAMP=$(date +%Y%m%d)
# Dump command here (sqlite3 .backup, pg_dump, etc.)
find /backup -mtime +30 -delete
EOT
]
volume_mount {
name = "data"
mount_path = "/data"
}
volume_mount {
name = "backup"
mount_path = "/backup"
}
}
volume {
name = "data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name
}
}
volume {
name = "backup"
persistent_volume_claim {
claim_name = module.nfs_backup.pvc_name
}
}
}
}
}
}
}
}
module "nfs_backup" {
source = "../../modules/kubernetes/nfs_volume"
name = "${var.service_name}-backup"
namespace = kubernetes_namespace.service.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/mnt/main/${var.service_name}-backup"
}
Monitoring & Alerting
┌────────────────────────────────────────────────────────────────┐
│ Prometheus Alerts │
│ │
│ PostgreSQLBackupStale > 36h since last success │
│ MySQLBackupStale > 36h since last success │
│ EtcdBackupStale > 8d since last success │
│ VaultBackupStale > 8d since last success │
│ VaultwardenBackupStale > 8d since last success │
│ RedisBackupStale > 8d since last success │
│ CloudSyncStale > 8d since last success │
│ CloudSyncNeverRun task never completed │
│ CloudSyncFailing task in error state │
│ VaultwardenIntegrityFail integrity_ok == 0 │
│ LVMSnapshotStale > 24h since last snapshot │
│ LVMSnapshotFailing snapshot creation failed │
│ LVMThinPoolLow < 15% free space in thin pool │
│ WeeklyBackupStale > 8d since last success │
│ WeeklyBackupFailing backup script exited non-zero │
│ PfsenseBackupStale > 8d since last success │
│ OffsiteBackupSyncStale > 8d since last success │
│ BackupDiskFull > 85% usage on /mnt/backup │
└────────────────────────────────────────────────────────────────┘
Metrics sources:
- Backup CronJobs: Push
backup_last_success_timestampto Pushgateway on completion - LVM snapshot script: Pushes
lvm_snapshot_last_success_timestamp,lvm_snapshot_count,lvm_thin_pool_free_percent - Weekly backup script: Pushes
backup_weekly_last_success_timestamp,backup_disk_usage_percent - Offsite sync script: Pushes
offsite_backup_sync_last_success_timestamp - CloudSync monitor: Queries TrueNAS API every 6h, pushes
cloudsync_last_success_timestamp - Vaultwarden integrity: Pushes
vaultwarden_sqlite_integrity_okhourly
Alert routing:
- All backup alerts → Slack
#infra-alerts - Vaultwarden integrity fail → Slack
#infra-critical(immediate action required)
Service Protection Matrix
| Service | LVM Snapshots (7d) | File Backup (4w) | App Backup | Offsite | Storage |
|---|---|---|---|---|---|
| Databases | |||||
| PostgreSQL (all DBs) | — | — | ✓ daily | ✓ | proxmox-lvm |
| MySQL (all DBs) | — | — | ✓ daily | ✓ | proxmox-lvm |
| Critical State | |||||
| Vault | ✓ | ✓ | ✓ weekly | ✓ | proxmox-lvm |
| etcd | ✓ | ✓ | ✓ weekly | ✓ | proxmox-lvm |
| Vaultwarden | ✓ | ✓ | ✓ 6h + integrity | ✓ | proxmox-lvm |
| Redis | ✓ | ✓ | ✓ weekly | ✓ | proxmox-lvm |
| Applications (65 proxmox-lvm PVCs) | |||||
| Prometheus | — | — | — | excluded | proxmox-lvm |
| Nextcloud | ✓ | ✓ | — | ✓ | proxmox-lvm |
| Calibre-Web | ✓ | ✓ | — | ✓ | proxmox-lvm |
| Forgejo | ✓ | ✓ | — | ✓ | proxmox-lvm |
| FreshRSS | ✓ | ✓ | — | ✓ | proxmox-lvm |
| ActualBudget | ✓ | ✓ | — | ✓ | proxmox-lvm |
| NovelApp | ✓ | ✓ | — | ✓ | proxmox-lvm |
| Headscale | ✓ | ✓ | — | ✓ | proxmox-lvm |
| Uptime Kuma | ✓ | ✓ | — | ✓ | proxmox-lvm |
| Media (NFS) | |||||
| Immich (~800GB) | — | — | — | ✓ | NFS |
| Audiobookshelf | — | — | — | ✓ | NFS |
| Servarr | — | — | — | ✓ | NFS |
| Navidrome | — | — | — | ✓ | NFS |
Legend:
- ✓ = Protected at this layer
- — = Not needed (other layers cover it, or data is regenerable/disposable)
- excluded = Too large/regenerable, not worth offsite bandwidth
Note: All 65 proxmox-lvm PVCs get LVM snapshots (except dbaas+monitoring = 3 PVCs) + file-level backup (except dbaas+monitoring). NFS-backed media relies on TrueNAS Cloud Sync for offsite.
Recovery Procedures
Detailed runbooks in docs/runbooks/:
restore-lvm-snapshot.md— Instant rollback of a PVC using LVM snapshot (RTO <5 min)restore-pvc-from-backup.md— Restore a PVC from sda file backup (when snapshots expired)restore-postgresql.md— Restore individual database or full cluster from pg_dumpall backuprestore-mysql.md— Restore MySQL databases from mysqldump backuprestore-vault.md— Restore Vault from raft snapshotrestore-vaultwarden.md— Restore password vault from sqlite3 backuprestore-etcd.md— Restore etcd cluster from snapshotrestore-full-cluster.md— Disaster recovery: rebuild cluster from offsite backups
RTO estimates:
- LVM snapshot rollback: <5 min (instant swap)
- File-level restore from sda: <15 min (depends on PVC size)
- Single PostgreSQL database: <5 min
- Full MySQL cluster: <15 min
- Vault: <10 min
- Vaultwarden: <5 min
- etcd: <20 min (requires cluster rebuild)
- Full cluster from offsite: <4 hours (TrueNAS restore + K8s bootstrap + app deploys)
Related
- Architecture:
docs/architecture/storage.md(NFS/Proxmox storage layer) - Reference:
.claude/reference/service-catalog.md(which services need backups) - Runbooks:
docs/runbooks/restore-*.md(step-by-step recovery procedures) - Monitoring:
stacks/monitoring/alerts/backup-alerts.yaml(Prometheus alert definitions)