Viktor Barzin 10f22350c5 exclude frigate, audiblez, ollama, real-estate-crawler from Synology backup [ci skip]

Expanded cloud sync excludes to reduce sync time and Synology disk usage.
All excluded data is either regenerable or low-value.
TrueNAS Task 1 and incremental script already updated live.

2026-03-29 13:44:32 +03:00

26 KiB

Raw Blame History

Backup & Disaster Recovery Architecture

Last updated: 2026-03-24

Overview

The homelab uses a defense-in-depth 3-layer backup strategy: Layer 1 provides near-instant local snapshots via ZFS auto-snapshots on TrueNAS (every 12h + daily, up to 3-week retention). Layer 2 adds application-level backups for complex stateful services (databases, Vault, etcd) via K8s CronJobs dumping to NFS-exported directories with 14-30 day retention. Layer 3 ensures offsite protection through hybrid incremental/full sync to a Synology NAS every 6 hours (incremental via ZFS diff) plus weekly full sync (Sunday 09:00) for cleanup. This architecture provides <1s RPO for file data, 6h RPO for offsite, and <30min RTO for most services.

Architecture Diagram

Overall Backup Flow

graph TB
    subgraph TrueNAS["TrueNAS (10.0.10.15)"]
        ZFS_Data["ZFS Pools<br/>main (1.64 TiB)<br/>ssd (~256GB)"]

        subgraph Layer1["Layer 1: ZFS Auto-Snapshots"]
            Snap12h["Every 12h<br/>auto-12h-*<br/>24h retention"]
            SnapDaily["Daily 00:00<br/>auto-*<br/>3-week retention"]
        end

        ZFS_Data --> Snap12h
        ZFS_Data --> SnapDaily

        NFS_Backup["NFS-exported<br/>/mnt/main/*-backup/"]
    end

    subgraph K8s["Kubernetes Cluster"]
        subgraph Layer2["Layer 2: App Backups"]
            CronDaily["Daily 00:00-00:30<br/>PostgreSQL, MySQL<br/>14d retention"]
            CronWeekly["Weekly Sunday<br/>etcd, Vault, Redis<br/>Vaultwarden, plotting-book<br/>30d retention"]
            CronMonthly["Monthly 1st Sunday<br/>Prometheus TSDB<br/>2 copies"]
            Cron6h["Every 6h<br/>Vaultwarden backup<br/>+ integrity check"]
        end

        CronDaily --> NFS_Backup
        CronWeekly --> NFS_Backup
        CronMonthly --> NFS_Backup
        Cron6h --> NFS_Backup
    end

    subgraph Layer3["Layer 3: Offsite Sync"]
        Incremental["Every 6h<br/>zfs diff → rclone copy<br/>--files-from --no-traverse"]
        FullSync["Weekly Sunday 09:00<br/>rclone sync<br/>handles deletions"]
    end

    ZFS_Data --> Incremental
    ZFS_Data --> FullSync

    Synology["Synology NAS<br/>192.168.1.13<br/>/Backup/Viki/truenas"]

    Incremental --> Synology
    FullSync --> Synology

    subgraph Monitoring["Monitoring & Alerting"]
        Prometheus["Prometheus Alerts<br/>PostgreSQLBackupStale<br/>MySQLBackupStale<br/>CloudSyncStale<br/>VaultwardenIntegrityFail"]
        Pushgateway["Pushgateway<br/>cloudsync metrics<br/>vaultwarden integrity"]
    end

    NFS_Backup -.->|scrape| Prometheus
    Synology -.->|API query| Pushgateway
    Pushgateway --> Prometheus

    style Layer1 fill:#c8e6c9
    style Layer2 fill:#ffe0b2
    style Layer3 fill:#e1f5ff
    style Monitoring fill:#f3e5f5

Vaultwarden Enhanced Protection

graph LR
    subgraph Every6h["Every 6 hours"]
        VWBackup["vaultwarden-backup CronJob"]
        Step1["1. PRAGMA integrity_check<br/>(fail → abort)"]
        Step2["2. sqlite3 .backup<br/>/mnt/main/vaultwarden-backup/"]
        Step3["3. PRAGMA integrity_check<br/>on backup copy"]
        Step4["4. Copy RSA keys, attachments,<br/>sends, config.json"]
        Step5["5. Rotate backups (30d)"]

        VWBackup --> Step1 --> Step2 --> Step3 --> Step4 --> Step5
    end

    subgraph Hourly["Every hour"]
        VWCheck["vaultwarden-integrity-check"]
        Check1["PRAGMA integrity_check"]
        Metric["Push metric to Pushgateway:<br/>vaultwarden_sqlite_integrity_ok"]

        VWCheck --> Check1 --> Metric
    end

    Metric -.->|Prometheus scrape| Alert["Alert if integrity_ok == 0"]

    style Every6h fill:#fff9c4
    style Hourly fill:#e1bee7

Incremental Offsite Sync

graph TB
    Prev["ZFS snapshot<br/>main@cloudsync-prev"]
    New["ZFS snapshot<br/>main@cloudsync-new"]

    Prev --> Diff["zfs diff -F -H<br/>prev vs new"]
    New --> Diff

    Diff --> Filter["Filter type=F<br/>Apply excludes"]
    Filter --> FileList["/tmp/cloudsync_copy_files.txt"]

    FileList --> Rclone["rclone copy<br/>--files-from-raw<br/>--no-traverse"]

    Rclone --> Synology["Synology NAS<br/>192.168.1.13"]

    Synology --> Rotate["Rotate snapshots:<br/>destroy prev<br/>rename new → prev"]

    Excludes["Excludes:<br/>clickhouse (2.47M files)<br/>loki (68K files)<br/>prometheus, iscsi<br/>frigate/recordings<br/>*.log"]

    Filter -.->|uses| Excludes

    style FileList fill:#fff9c4
    style Excludes fill:#ffcdd2

Components

Component	Version/Schedule	Location	Purpose
ZFS Auto-Snapshots	Every 12h + daily	TrueNAS pools (main, ssd)	Near-instant local protection
PostgreSQL Backup	Daily 00:00, 14d retention	CronJob in `dbaas` namespace	pg_dumpall for 12 databases
MySQL Backup	Daily 00:30, 14d retention	CronJob in `dbaas` namespace	mysqldump for 7 databases
etcd Backup	Weekly Sunday 01:00, 30d	CronJob in `kube-system`	etcdctl snapshot
Vaultwarden Backup	Every 6h, 30d retention	CronJob in `vaultwarden`	sqlite3 .backup + integrity
Vault Backup	Weekly Sunday 02:00, 30d	CronJob in `vault`	raft snapshot
Redis Backup	Weekly Sunday 03:00, 30d	CronJob in `redis`	BGSAVE + copy
Prometheus Backup	Monthly 1st Sunday, 2 copies	CronJob in `monitoring`	TSDB snapshot → tar.gz
plotting-book Backup	Weekly Sunday 03:00, 30d	CronJob in `plotting-book`	sqlite3 .backup
Incremental Sync	Every 6h (cron)	TrueNAS: `/root/cloudsync-copy.sh`	ZFS diff → rclone copy
Full Sync	Weekly Sunday 09:00	TrueNAS Cloud Sync Task 1	rclone sync with deletions
CloudSync Monitor	Every 6h (cron)	CronJob in `monitoring`	Query TrueNAS API → Pushgateway
Vaultwarden Integrity Check	Hourly	CronJob in `vaultwarden`	PRAGMA integrity_check → metric

How It Works

Layer 1: ZFS Auto-Snapshots

ZFS snapshots are copy-on-write markers that capture filesystem state in <1 second with zero I/O overhead (only metadata).

Schedule:

Pool	Frequency	Naming	Retention	Purpose
`main`	Every 12h	`auto-12h-YYYY-MM-DD_HH-MM`	24 hours	Recover from recent mistakes
`main`	Daily 00:00	`auto-YYYY-MM-DD_HH-MM`	3 weeks	Point-in-time recovery
`ssd`	Every 12h	`auto-12h-YYYY-MM-DD_HH-MM`	24 hours	Same as main
`ssd`	Daily 00:00	`auto-YYYY-MM-DD_HH-MM`	3 weeks	Same as main

Performance: Snapshot creation takes <1s for both pools (tested 2026-03-23).

Rollback:

# List snapshots
zfs list -t snapshot | grep main/<service>

# Rollback to snapshot
zfs rollback main/<service>@auto-2026-03-23_00-00

# Clone snapshot (non-destructive)
zfs clone main/<service>@auto-2026-03-23_00-00 main/<service>-recovered

Layer 2: Application-Level Backups

K8s CronJobs run inside the cluster, dumping database/state to NFS-exported backup directories. Each service writes to /mnt/main/<service>-backup/.

Why needed: ZFS snapshots capture block-level state, but:

Cannot restore individual databases from a PostgreSQL zvol snapshot
iSCSI zvols are opaque to TrueNAS (raw blocks)
Need point-in-time recovery for specific apps without full ZFS rollback

Daily backups (00:00-00:30):

PostgreSQL (pg_dumpall): Dumps all 12 databases to /mnt/main/dbaas-backups/postgresql/. Command: pg_dumpall -h pg-cluster-rw.dbaas -U postgres | gzip -9 > backup-$(date +%Y%m%d).sql.gz. 14-day rotation via find -mtime +14 -delete.
MySQL (mysqldump): Dumps all 7 databases individually. Command: mysqldump -h mysql-primary.dbaas --all-databases --single-transaction | gzip -9 > backup-$(date +%Y%m%d).sql.gz. 14-day rotation.

Weekly backups (Sunday 01:00-04:00):

etcd: etcdctl snapshot save /mnt/main/etcd-backup/snapshot-$(date +%Y%m%d).db. 30-day retention. Critical for cluster recovery.
Vaultwarden: See "Vaultwarden Enhanced Protection" below. 30-day retention.
Vault: vault operator raft snapshot save /mnt/main/vault-backup/snapshot-$(date +%Y%m%d).snap. 30-day retention.
Redis: redis-cli BGSAVE then copy RDB file. 30-day retention.
plotting-book: sqlite3 /data/db.sqlite ".backup '/mnt/main/plotting-book-backup/backup-$(date +%Y%m%d).sqlite'". 30-day retention.

Monthly backups (1st Sunday 04:00):

Prometheus: curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot → tar.gz snapshot. Keeps 2 most recent copies (older ones purged).

Vaultwarden Enhanced Protection

Vaultwarden stores sensitive password vault data in SQLite on an iSCSI volume. Extra safeguards prevent corruption:

Every 6 hours (vaultwarden-backup CronJob):

Run PRAGMA integrity_check on live database
If check fails → abort (alert fires)
If check passes → sqlite3 .backup /mnt/main/vaultwarden-backup/db-$(date +%Y%m%d%H%M).sqlite
Run PRAGMA integrity_check on backup copy
Copy RSA keys, attachments, sends folder, config.json
Rotate backups older than 30 days

Every hour (vaultwarden-integrity-check CronJob):

Run PRAGMA integrity_check on live database
Push metric to Pushgateway: vaultwarden_sqlite_integrity_ok{status="ok"}=1 or =0
Prometheus scrapes Pushgateway and alerts on integrity_ok == 0

This provides both frequent backups (every 6h) AND continuous integrity monitoring (hourly).

Layer 3: Offsite Sync to Synology NAS

Two complementary sync methods run on TrueNAS:

Incremental COPY (every 6 hours):

Runs /root/cloudsync-copy.sh via cron. Uses ZFS diff to identify changed files since last sync, then copies only those files.

Flow:

Take new snapshot: zfs snapshot main@cloudsync-new
If previous snapshot exists: zfs diff -F -H main@cloudsync-prev main@cloudsync-new
Filter output:
- Keep only type=F (files, not directories)
- Apply excludes (clickhouse, loki, prometheus, etc.)
- Write to /tmp/cloudsync_copy_files.txt
Run rclone copy --files-from-raw /tmp/cloudsync_copy_files.txt --no-traverse
Rotate snapshots: zfs destroy cloudsync-prev, zfs rename cloudsync-new cloudsync-prev

Why fast: Only changed files are transferred. ZFS diff is instant (metadata scan). --no-traverse skips SFTP directory scan.

Fallback: If no previous snapshot or >100k changed files → falls back to full find command.

Weekly SYNC (Sunday 09:00):

TrueNAS Cloud Sync Task 1 runs rclone sync which:

Mirrors source → destination (removes deleted files on destination)
Full directory traversal (~30-60 min)
Ensures offsite is clean (no orphaned files from renamed paths)

Why both methods:

Incremental: Fast recovery for recent changes (seconds to minutes)
Full sync: Cleanup pass to handle deletions, renames, edge cases

Destination: sftp://192.168.1.13/Backup/Viki/truenas

Excludes (both incremental and full sync)

Pattern	Reason	File count
`clickhouse/**`	Regenerable logs/metrics	2.47M files
`loki/**`	Regenerable logs	68K files
`iocage/**`	Legacy FreeBSD jails (unused)	96K files
`frigate/**`	Ephemeral recordings/clips, trivial config	57K+ files
`audiblez/**`	Generated audiobooks, regenerable from source ebooks	—
`ebook2audiobook/**`	Same service as audiblez, second volume	—
`ollama/**`	UI data (chat history/settings), low value	—
`real-estate-crawler/**`	Scraped property data, regenerable by re-crawling	—
`prometheus/**`	Covered by monthly app backup	Large TSDB
`crowdsec/**`	Regenerable threat intelligence	—
`servarr/downloads/**`	Transient download staging	—
`iscsi/`, `iscsi-snaps/`	Raw zvols, backed at app level	—
`ytldp/**`	YouTube downloads (replaceable)	—
`*.log`	Log files (regenerable)	—
`post`	Transient POST data	—

iSCSI Backup Architecture

iSCSI zvols are raw block devices exported to K8s nodes. TrueNAS cannot read the filesystem inside a zvol.

Protection strategy:

Layer 1: ZFS snapshots cover zvols automatically (block-level)
Layer 2: Application CronJobs inside pods dump data to NFS paths
Layer 3: NFS paths sync offsite

Current coverage:

Service	Storage	Layer 2 Backup	Offsite
PostgreSQL CNPG (12 DBs)	iSCSI	✓ daily	✓
MySQL InnoDB (7 DBs)	iSCSI	✓ daily	✓
Vault	iSCSI	✓ weekly	✓
Vaultwarden	iSCSI	✓ 6h + integrity	✓
Redis	iSCSI	✓ weekly	✓
plotting-book	iSCSI	✓ weekly	✓

Convention: Any new iSCSI-backed app MUST add a backup CronJob writing to /mnt/main/<app>-backup/ in its Terraform stack.

Uncovered (acceptable risk):

Prometheus (disposable metrics, monthly TSDB backup for long-term trends)
Loki (disposable logs)

iSCSI Hardening

To prevent SQLite corruption from transient network disruptions, iSCSI initiator timeouts are relaxed on all K8s nodes:

Setting	Default	Hardened	Impact
`node.session.timeo.replacement_timeout`	120s	300s	Time before declaring session dead
`node.conn[0].timeo.noop_out_interval`	5s	10s	Keepalive interval
`node.conn[0].timeo.noop_out_timeout`	5s	15s	Keepalive timeout
`node.conn[0].iscsi.HeaderDigest`	None	CRC32C,None	Error detection
`node.conn[0].iscsi.DataDigest`	None	CRC32C,None	Error detection

Applied to: All 5 K8s nodes (k8s-master, k8s-node1-4) on 2026-03-23.

Persistence: Baked into cloud-init template (modules/create-template-vm/cloud_init.yaml) so new nodes get these settings automatically.

Why needed: Default 120s timeout is too aggressive. Brief network hiccup (5-10s) can trigger failover, causing SQLite to see incomplete writes → corruption. 300s timeout tolerates longer blips.

Configuration

Key Files

Path	Purpose
`/root/cloudsync-copy.sh`	TrueNAS: incremental sync script
`/var/log/cloudsync-copy.log`	TrueNAS: sync script log output
`stacks/dbaas/`	Terraform: PostgreSQL/MySQL backup CronJobs
`stacks/vault/`	Terraform: Vault backup CronJob
`stacks/vaultwarden/`	Terraform: Vaultwarden backup + integrity CronJobs
`stacks/monitoring/`	Terraform: CloudSync monitor, Prometheus backup
`modules/create-template-vm/cloud_init.yaml`	iSCSI hardening params for new nodes
`/etc/iscsi/iscsid.conf`	K8s nodes: iSCSI initiator config

Vault Paths

Path	Contents
`secret/viktor/truenas_api_key`	TrueNAS API key for CloudSync monitor
`secret/viktor/synology_ssh_key`	SSH key for Synology NAS SFTP access

Terraform Stacks

Each backup CronJob is defined in the application's stack:

PostgreSQL/MySQL: stacks/dbaas/backup.tf
Vault: stacks/vault/backup.tf
Vaultwarden: stacks/vaultwarden/backup.tf
etcd: stacks/platform/etcd-backup.tf
Prometheus: stacks/monitoring/prometheus-backup.tf

Decisions & Rationale

Why 3 Layers?

Layer 1 (ZFS snapshots):

Near-instant (<1s), zero overhead
Point-in-time recovery for entire datasets
BUT: Cannot restore individual database records, no offsite protection

Layer 2 (App backups):

Granular restore (single DB, single table)
Database-native tools (pg_dump, mysqldump) produce portable backups
BUT: Higher overhead (CPU, I/O), longer RPO (daily/weekly)

Layer 3 (Offsite):

Protection against site-level disaster (fire, theft, catastrophic hardware failure)
BUT: 6h RPO (incremental), connectivity dependency

All three together provide defense-in-depth.

Why Not Velero/Longhorn Backup?

Evaluated K8s-native backup solutions (Velero, Longhorn):

Velero: Requires object storage backend, complex restore, doesn't handle databases well
Longhorn: High overhead (replicas, snapshots in-cluster), no offsite by default

Current approach wins because:

Leverages existing ZFS infrastructure (already running TrueNAS)
Database-native backups (pg_dump/mysqldump) are battle-tested
Simple restore procedures (documented runbooks)

Why Hybrid Incremental + Full Sync?

Incremental alone is risky:

Deleted files on source never deleted on destination
Renamed paths create duplicates
No cleanup of orphaned snapshots

Full sync alone is slow:

30-60 min per run
High network/CPU on both ends
6h RPO → 12h if a sync fails

Hybrid approach:

Fast incremental every 6h (sub-minute runtime)
Weekly full sync for cleanup (tolerates longer runtime)

Why 6h Vaultwarden Backup vs Daily for Others?

Vaultwarden stores password vault data — highest-value target:

User creates 10 new passwords → disaster 5h later → daily backup loses all 10
6h RPO acceptable for password vaults (industry standard is 1-24h)
Hourly integrity checks detect corruption before it spreads to backups

Other services (MySQL, PostgreSQL):

Mostly application data (not authentication secrets)
Daily RPO acceptable per user tolerance
Lower change velocity

Troubleshooting

PostgreSQL Backup Stale Alert

Symptom: PostgreSQLBackupStale firing in Prometheus

Diagnosis:

kubectl get cronjob -n dbaas
kubectl logs -n dbaas job/postgresql-backup-<timestamp>

Common causes:

Pod OOMKilled (increase memory limit)
NFS mount unavailable (check TrueNAS)
pg_dumpall command failed (check PostgreSQL connectivity)

Fix:

If OOM: Increase resources.limits.memory in stacks/dbaas/backup.tf
If NFS: Verify mount on worker node, restart NFS server if needed
Manually trigger: kubectl create job --from=cronjob/postgresql-backup manual-backup -n dbaas

CloudSync Stale/Failing

Symptom: CloudSyncStale or CloudSyncFailing alert

Diagnosis:

# SSH to TrueNAS
ssh root@10.0.10.15
cat /var/log/cloudsync-copy.log
zfs list -t snapshot | grep cloudsync

Common causes:

Synology NAS unreachable (network, SFTP down)
ZFS diff failed (snapshot deleted manually)
rclone error (quota, permission)

Fix:

Verify Synology: ping 192.168.1.13, ssh root@192.168.1.13
Verify snapshots exist: zfs list -t snapshot | grep cloudsync
Manually run: /root/cloudsync-copy.sh (check output)
Check rclone config: rclone ls synology:/Backup/Viki/truenas

Vaultwarden Integrity Check Failing

Symptom: VaultwardenIntegrityFail alert, vaultwarden_sqlite_integrity_ok=0

Diagnosis:

kubectl exec -n vaultwarden deployment/vaultwarden -- sqlite3 /data/db.sqlite3 "PRAGMA integrity_check;"

Critical: If integrity check fails, database is corrupt.

Recovery:

Stop writes: kubectl scale deployment/vaultwarden --replicas=0 -n vaultwarden

Restore from latest backup:

# Find latest backup
ls -lh /mnt/main/vaultwarden-backup/
# Copy to pod volume
kubectl cp /mnt/main/vaultwarden-backup/db-<latest>.sqlite \
  vaultwarden/vaultwarden-0:/data/db.sqlite3

Verify integrity on restored DB
Scale back up: kubectl scale deployment/vaultwarden --replicas=1 -n vaultwarden

iSCSI Session Drops Causing Backup Failures

Symptom: Backup CronJob fails with "I/O error" or "Transport endpoint not connected"

Diagnosis:

# On K8s node
iscsiadm -m session
dmesg | grep -i iscsi
journalctl -u iscsid | tail -50

Fix:

Verify hardened timeouts applied: iscsiadm -m node -o show | grep -E 'replacement_timeout|noop_out'

If defaults: Apply hardening:

iscsiadm -m node -o update -n node.session.timeo.replacement_timeout -v 300
iscsiadm -m node -o update -n node.conn[0].timeo.noop_out_interval -v 10
iscsiadm -m node -o update -n node.conn[0].timeo.noop_out_timeout -v 15
iscsiadm -m node -o update -n node.conn[0].iscsi.HeaderDigest -v CRC32C,None
iscsiadm -m node -o update -n node.conn[0].iscsi.DataDigest -v CRC32C,None

Restart session: iscsiadm -m node -u && iscsiadm -m node -l

Missing Backup for New Service

Symptom: Added new service using iSCSI storage, no backup exists

Fix: Add backup CronJob in service's Terraform stack

Template:

resource "kubernetes_cron_job_v1" "backup" {
  metadata {
    name      = "${var.service_name}-backup"
    namespace = kubernetes_namespace.service.metadata[0].name
  }
  spec {
    schedule = "0 3 * * 0"  # Weekly Sunday 03:00
    job_template {
      spec {
        template {
          spec {
            container {
              name  = "backup"
              image = "appropriate/image:tag"
              command = ["/bin/sh", "-c"]
              args = [
                <<-EOT
                TIMESTAMP=$(date +%Y%m%d)
                # Dump command here
                find /backup -mtime +30 -delete
                EOT
              ]
              volume_mount {
                name       = "data"
                mount_path = "/data"
              }
              volume_mount {
                name       = "backup"
                mount_path = "/backup"
              }
            }
            volume {
              name = "data"
              persistent_volume_claim {
                claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name
              }
            }
            volume {
              name = "backup"
              persistent_volume_claim {
                claim_name = module.nfs_backup.pvc_name
              }
            }
          }
        }
      }
    }
  }
}

module "nfs_backup" {
  source     = "../../modules/kubernetes/nfs_volume"
  name       = "${var.service_name}-backup"
  namespace  = kubernetes_namespace.service.metadata[0].name
  nfs_server = var.nfs_server
  nfs_path   = "/mnt/main/${var.service_name}-backup"
}

Monitoring & Alerting

┌────────────────────────────────────────────────────────────────┐
│                     Prometheus Alerts                           │
│                                                                 │
│  PostgreSQLBackupStale      > 36h since last success            │
│  MySQLBackupStale           > 36h since last success            │
│  EtcdBackupStale            > 8d  since last success            │
│  VaultBackupStale           > 8d  since last success            │
│  VaultwardenBackupStale     > 8d  since last success            │
│  RedisBackupStale           > 8d  since last success            │
│  PrometheusBackupStale      > 32d since last success            │
│  PlottingBookBackupStale    > 8d  since last success            │
│  CloudSyncStale             > 8d  since last success            │
│  CloudSyncNeverRun          task never completed                │
│  CloudSyncFailing           task in error state                 │
│  VaultwardenIntegrityFail   integrity_ok == 0                   │
└────────────────────────────────────────────────────────────────┘

Metrics sources:

Backup CronJobs: Push backup_last_success_timestamp to Pushgateway on completion
CloudSync monitor: Queries TrueNAS API every 6h, pushes cloudsync_last_success_timestamp
Vaultwarden integrity: Pushes vaultwarden_sqlite_integrity_ok hourly

Alert routing:

All backup alerts → Slack #infra-alerts
Vaultwarden integrity fail → Slack #infra-critical (immediate action required)

Service Protection Matrix

Service	Layer 1 (ZFS)	Layer 2 (App)	Layer 3 (Offsite)	Storage
Databases
PostgreSQL (12 DBs)	✓	✓ daily	✓	iSCSI
MySQL (7 DBs)	✓	✓ daily	✓	iSCSI
Critical State
Vault	✓	✓ weekly	✓	iSCSI
etcd	✓	✓ weekly	✓	local disk
Vaultwarden	✓	✓ 6h + integrity	✓	iSCSI
Redis	✓	✓ weekly	✓	iSCSI
Applications
Prometheus	✓	✓ monthly	excluded	NFS
plotting-book	✓	✓ weekly	✓	iSCSI
Immich	✓	—	✓	NFS
Forgejo	✓	—	✓	NFS
Paperless-ngx	✓	—	✓	NFS
Nextcloud	✓	—	✓	NFS
Other NFS services	✓	—	✓	NFS

Legend:

✓ = Protected at this layer
— = Not needed (simple file storage, ZFS snapshots sufficient)
excluded = Too large/regenerable, not worth offsite bandwidth

Note: NFS-backed services with simple data (files, SQLite) rely on ZFS snapshots + offsite sync. Application-level backups are only needed for services with complex state (databases, Raft consensus, multi-file consistency requirements).

Recovery Procedures

Detailed runbooks in docs/runbooks/:

restore-postgresql.md — Restore individual database or full cluster from pg_dumpall backup
restore-mysql.md — Restore MySQL databases from mysqldump backup
restore-vault.md — Restore Vault from raft snapshot
restore-vaultwarden.md — Restore password vault from sqlite3 backup
restore-etcd.md — Restore etcd cluster from snapshot
restore-full-cluster.md — Disaster recovery: rebuild cluster from offsite backups

RTO estimates (tested 2026-03-23):

Single PostgreSQL database: <5 min
Full MySQL cluster: <15 min
Vault: <10 min
Vaultwarden: <5 min
etcd: <20 min (requires cluster rebuild)
Full cluster from offsite: <4 hours (TrueNAS restore + K8s bootstrap + app deploys)

Architecture: docs/architecture/storage.md (NFS/iSCSI storage layer)
Reference: .claude/reference/service-catalog.md (which services need backups)
Runbooks: docs/runbooks/restore-*.md (step-by-step recovery procedures)
Monitoring: stacks/monitoring/alerts/backup-alerts.yaml (Prometheus alert definitions)

26 KiB Raw Blame History

Backup & Disaster Recovery Architecture

Overview

Architecture Diagram

Overall Backup Flow

Vaultwarden Enhanced Protection

Incremental Offsite Sync

Components

How It Works

Layer 1: ZFS Auto-Snapshots

Layer 2: Application-Level Backups

Vaultwarden Enhanced Protection

Layer 3: Offsite Sync to Synology NAS

Excludes (both incremental and full sync)

iSCSI Backup Architecture

iSCSI Hardening

Configuration

Key Files

Vault Paths

Terraform Stacks

Decisions & Rationale

Why 3 Layers?

Why Not Velero/Longhorn Backup?

Why Hybrid Incremental + Full Sync?

Why 6h Vaultwarden Backup vs Daily for Others?

Troubleshooting

PostgreSQL Backup Stale Alert

CloudSync Stale/Failing

Vaultwarden Integrity Check Failing

iSCSI Session Drops Causing Backup Failures

Missing Backup for New Service

Monitoring & Alerting

Service Protection Matrix

Recovery Procedures

Related

26 KiB

Raw Blame History