Viktor Barzin 6e661fdfc5 add backup & DR strategy documentation with ASCII diagrams

Covers all 3 protection layers (ZFS snapshots, app-level backups,
offsite sync), the hybrid cloud sync architecture, iSCSI hardening,
monitoring alerts, and service protection matrix.

2026-03-23 02:24:02 +02:00

14 KiB

Raw Blame History

Backup & Disaster Recovery Strategy

Last updated: 2026-03-23

Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                        TrueNAS (10.0.10.15)                        │
│                                                                     │
│  ZFS Pool "main" (1.64 TiB)     ZFS Pool "ssd"                    │
│  ├── NFS shares (~100)           ├── Immich ML data                │
│  └── iSCSI zvols (~19 PVCs)     └── PostgreSQL data               │
│                                                                     │
│  Layer 1: ZFS Auto-Snapshots                                       │
│  ┌──────────────────────────────────────────────┐                  │
│  │ Every 12h  → auto-12h-*  (24h retention)     │                  │
│  │ Daily      → auto-*      (3-week retention)  │                  │
│  │ Both pools, recursive, near-instant (<1s)     │                  │
│  └──────────────────────────────────────────────┘                  │
└────────────────┬────────────────────────────────┬───────────────────┘
                 │                                │
    ┌────────────▼────────────┐      ┌────────────▼────────────┐
    │  Layer 2: App Backups   │      │  Layer 3: Offsite Sync  │
    │  (K8s CronJobs → NFS)  │      │  (TrueNAS → Synology)   │
    └────────────┬────────────┘      └────────────┬────────────┘
                 │                                │
                 ▼                                ▼
    ┌─────────────────────┐      ┌──────────────────────────────┐
    │  /mnt/main/*-backup │      │  Synology NAS (192.168.1.13) │
    │  (NFS-exported dirs) │      │  /Backup/Viki/truenas        │
    └─────────────────────┘      └──────────────────────────────┘

Layer 1: ZFS Auto-Snapshots

Near-instant copy-on-write snapshots. No disk I/O beyond tiny metadata writes.

Pool	Schedule	Retention	Naming Schema
`main`	Every 12h	24 hours	`auto-12h-YYYY-MM-DD_HH-MM`
`main`	Daily 00:00	3 weeks	`auto-YYYY-MM-DD_HH-MM`
`ssd`	Every 12h	24 hours	`auto-12h-YYYY-MM-DD_HH-MM`
`ssd`	Daily 00:00	3 weeks	`auto-YYYY-MM-DD_HH-MM`

Performance: Both pools snapshot in <1 second (tested 2026-03-23).

Layer 2: Application-Level Backups

K8s CronJobs dump application data to NFS-exported backup directories.

┌──────────────────────────────────────────────────────────────────┐
│                    K8s CronJob Backup Schedule                   │
│                                                                  │
│  00:00 ─── PostgreSQL (pg_dumpall → gzip -9) ──→ 14d retention  │
│  00:30 ─── MySQL (mysqldump → gzip -9) ─────────→ 14d retention │
│                                                                  │
│  Sunday:                                                         │
│  01:00 ─── etcd (etcdctl snapshot) ──────────────→ 30d retention │
│  01:30 ─── Vaultwarden (sqlite3 .backup) ────────→ 30d retention │
│  02:00 ─── Vault (raft snapshot) ────────────────→ 30d retention │
│  03:00 ─── Redis (BGSAVE + copy) ────────────────→ 30d retention │
│  03:00 ─── plotting-book (sqlite3 .backup) ──────→ 30d retention │
│                                                                  │
│  Monthly (1st Sunday):                                           │
│  04:00 ─── Prometheus TSDB (snapshot → tar.gz) ──→ 2 copies     │
│                                                                  │
│  Every 6h:                                                       │
│  */6   ─── Vaultwarden backup ───────────────────→ 30d retention │
│  :30   ─── Vaultwarden integrity check ──────────→ metric push  │
└──────────────────────────────────────────────────────────────────┘

Vaultwarden Enhanced Protection

Vaultwarden uses iSCSI storage (SQLite on block device) and has extra safeguards:

Every 6 hours                          Every hour
┌─────────────────────────┐            ┌────────────────────────────┐
│ vaultwarden-backup      │            │ vaultwarden-integrity-check│
│                         │            │                            │
│ 1. PRAGMA integrity_check│           │ 1. PRAGMA integrity_check  │
│    (fail → abort)       │            │ 2. Push metric to          │
│ 2. sqlite3 .backup      │            │    Pushgateway:            │
│ 3. PRAGMA integrity_check│           │    vaultwarden_sqlite_     │
│    on backup copy       │            │    integrity_ok {0|1}      │
│ 4. Copy RSA keys,       │            └────────────────────────────┘
│    attachments, sends,  │
│    config.json          │
│ 5. Rotate (30d)         │
└─────────────────────────┘

Layer 3: Offsite Sync to Synology NAS

Hybrid approach: fast incremental copies + weekly full sync for cleanup.

                    TrueNAS                              Synology
                 (10.0.10.15)                        (192.168.1.13)
                      │                                    │
  Every 6h (cron)     │    zfs diff → changed files list   │
  ════════════════     │                                    │
  /root/cloudsync-     │  rclone copy --files-from          │
  copy.sh              │  --no-traverse                     │
                       │──────────────────────────────────→ │
                       │    Only changed files,             │
                       │    seconds to minutes              │
                       │                                    │
  Sunday 09:00         │    rclone sync                     │
  (Cloud Sync Task 1)  │    (full traversal)                │
  ════════════════     │──────────────────────────────────→ │
                       │    ~30-60 min,                     │
                       │    handles deletions               │
                       │                                    │

Incremental COPY — How It Works

  cloudsync-copy-prev          cloudsync-copy
  (previous snapshot)          (new snapshot)
         │                          │
         └────── zfs diff -F -H ────┘
                      │
                      ▼
              Changed files only
              (type=F, excludes applied)
                      │
                      ▼
         /tmp/cloudsync_copy_files.txt
                      │
                      ▼
         rclone copy --files-from-raw
         --no-traverse (skip SFTP scan)
                      │
                      ▼
              Synology updated
                      │
                      ▼
         Rotate: prev→destroy, new→prev

Key files:

Script: /root/cloudsync-copy.sh
Log: /var/log/cloudsync-copy.log
Cron job: TrueNAS cron id=1, 0 */6 * * *

Excludes (both incremental and weekly sync)

Pattern	Reason
`clickhouse/**`	2.47M files, regenerable
`loki/**`	68K files, regenerable logs
`iocage/**`	96K files, legacy FreeBSD jails
`frigate/recordings/**`	57K files, ephemeral video clips
`prometheus/**`	Large TSDB, separate monthly backup
`crowdsec/**`	Regenerable threat data
`servarr/downloads/**`	Transient download staging
`iscsi/**`	Raw zvols, backed up at app level
`iscsi-snaps/**`	Snapshot metadata
`ytldp/**`	YouTube downloads, replaceable
`*.log`	Log files
`post`	Transient POST data

Weekly SYNC (Cloud Sync Task 1)

Mode: SYNC (mirrors source → destination, removes deleted files)
Schedule: Sunday 09:00
Pre-script: Creates ZFS snapshot main@cloudsync-new
Post-script: Rotates snapshots (new → prev, creates placeholder)
Source path: /mnt/main/.zfs/snapshot/cloudsync-new
Destination: synology:/Backup/Viki/truenas (SFTP)

iSCSI Hardening

To prevent SQLite corruption from transient network disruptions, iSCSI initiator timeouts are relaxed on all K8s nodes:

Setting                              Default    Hardened
─────────────────────────────────────────────────────────
node.session.timeo.replacement_timeout  120s      300s
node.conn[0].timeo.noop_out_interval      5s       10s
node.conn[0].timeo.noop_out_timeout       5s       15s
node.conn[0].iscsi.HeaderDigest         None   CRC32C,None
node.conn[0].iscsi.DataDigest           None   CRC32C,None

Applied to all 5 nodes (k8s-master + k8s-node1-4) on 2026-03-23
Baked into cloud-init template (modules/create-template-vm/cloud_init.yaml) so new nodes get these settings automatically

Monitoring & Alerting

┌─────────────────────────────────────────────────────────┐
│                   Prometheus Alerts                      │
│                                                         │
│  PostgreSQLBackupStale      > 36h since last success    │
│  MySQLBackupStale           > 36h since last success    │
│  EtcdBackupStale            > 8d  since last success    │
│  VaultBackupStale           > 8d  since last success    │
│  VaultwardenBackupStale     > 8d  since last success    │
│  RedisBackupStale           > 8d  since last success    │
│  PrometheusBackupStale      > 32d since last success    │
│  CloudSyncStale             > 8d  since last success    │
│  CloudSyncNeverRun          task never completed        │
│  CloudSyncFailing           task in error state         │
│  VaultwardenIntegrityFail   integrity_ok == 0           │
└─────────────────────────────────────────────────────────┘

cloudsync-monitor CronJob queries TrueNAS API every 6h, pushes to Pushgateway
Vaultwarden integrity check pushes vaultwarden_sqlite_integrity_ok hourly

Service Protection Matrix

Service	Layer 1 (ZFS)	Layer 2 (App)	Layer 3 (Offsite)	Storage
PostgreSQL (12 DBs)	✓	✓ daily	✓	NFS
MySQL (7 DBs)	✓	✓ daily	✓	NFS
Vault	✓	✓ weekly	✓	iSCSI
etcd	✓	✓ weekly	✓	local
Vaultwarden	✓	✓ 6h + integrity	✓	iSCSI
Redis	✓	✓ weekly	✓	iSCSI
Prometheus	✓	✓ monthly	excluded	NFS
plotting-book	✓	✓ weekly	✓	iSCSI
Immich	✓	—	✓	NFS
Forgejo	✓	—	✓	NFS
Paperless-ngx	✓	—	✓	NFS
Other NFS services	✓	—	✓	NFS

NFS-backed services with simple data (files, SQLite) rely on ZFS snapshots + offsite sync. Application-level backups are only needed for services with complex state (databases, Raft consensus, multi-file consistency).

Recovery Procedures

See individual runbooks in docs/runbooks/:

restore-postgresql.md
restore-mysql.md
restore-vault.md
restore-vaultwarden.md
restore-etcd.md
restore-full-cluster.md

14 KiB Raw Blame History