From 6e661fdfc507c5a6f83a726c2f4b8f23a8eb3800 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Mon, 23 Mar 2026 02:24:02 +0200 Subject: [PATCH] add backup & DR strategy documentation with ASCII diagrams Covers all 3 protection layers (ZFS snapshots, app-level backups, offsite sync), the hybrid cloud sync architecture, iSCSI hardening, monitoring alerts, and service protection matrix. --- docs/backup-strategy.md | 248 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 248 insertions(+) create mode 100644 docs/backup-strategy.md diff --git a/docs/backup-strategy.md b/docs/backup-strategy.md new file mode 100644 index 00000000..0e18055c --- /dev/null +++ b/docs/backup-strategy.md @@ -0,0 +1,248 @@ +# Backup & Disaster Recovery Strategy + +Last updated: 2026-03-23 + +## Architecture Overview + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ TrueNAS (10.0.10.15) │ +│ │ +│ ZFS Pool "main" (1.64 TiB) ZFS Pool "ssd" │ +│ ├── NFS shares (~100) ├── Immich ML data │ +│ └── iSCSI zvols (~19 PVCs) └── PostgreSQL data │ +│ │ +│ Layer 1: ZFS Auto-Snapshots │ +│ ┌──────────────────────────────────────────────┐ │ +│ │ Every 12h → auto-12h-* (24h retention) │ │ +│ │ Daily → auto-* (3-week retention) │ │ +│ │ Both pools, recursive, near-instant (<1s) │ │ +│ └──────────────────────────────────────────────┘ │ +└────────────────┬────────────────────────────────┬───────────────────┘ + │ │ + ┌────────────▼────────────┐ ┌────────────▼────────────┐ + │ Layer 2: App Backups │ │ Layer 3: Offsite Sync │ + │ (K8s CronJobs → NFS) │ │ (TrueNAS → Synology) │ + └────────────┬────────────┘ └────────────┬────────────┘ + │ │ + ▼ ▼ + ┌─────────────────────┐ ┌──────────────────────────────┐ + │ /mnt/main/*-backup │ │ Synology NAS (192.168.1.13) │ + │ (NFS-exported dirs) │ │ /Backup/Viki/truenas │ + └─────────────────────┘ └──────────────────────────────┘ +``` + +## Layer 1: ZFS Auto-Snapshots + +Near-instant copy-on-write snapshots. No disk I/O beyond tiny metadata writes. + +| Pool | Schedule | Retention | Naming Schema | +|--------|----------------|-----------|---------------------------| +| `main` | Every 12h | 24 hours | `auto-12h-YYYY-MM-DD_HH-MM` | +| `main` | Daily 00:00 | 3 weeks | `auto-YYYY-MM-DD_HH-MM` | +| `ssd` | Every 12h | 24 hours | `auto-12h-YYYY-MM-DD_HH-MM` | +| `ssd` | Daily 00:00 | 3 weeks | `auto-YYYY-MM-DD_HH-MM` | + +**Performance**: Both pools snapshot in <1 second (tested 2026-03-23). + +## Layer 2: Application-Level Backups + +K8s CronJobs dump application data to NFS-exported backup directories. + +``` +┌──────────────────────────────────────────────────────────────────┐ +│ K8s CronJob Backup Schedule │ +│ │ +│ 00:00 ─── PostgreSQL (pg_dumpall → gzip -9) ──→ 14d retention │ +│ 00:30 ─── MySQL (mysqldump → gzip -9) ─────────→ 14d retention │ +│ │ +│ Sunday: │ +│ 01:00 ─── etcd (etcdctl snapshot) ──────────────→ 30d retention │ +│ 01:30 ─── Vaultwarden (sqlite3 .backup) ────────→ 30d retention │ +│ 02:00 ─── Vault (raft snapshot) ────────────────→ 30d retention │ +│ 03:00 ─── Redis (BGSAVE + copy) ────────────────→ 30d retention │ +│ 03:00 ─── plotting-book (sqlite3 .backup) ──────→ 30d retention │ +│ │ +│ Monthly (1st Sunday): │ +│ 04:00 ─── Prometheus TSDB (snapshot → tar.gz) ──→ 2 copies │ +│ │ +│ Every 6h: │ +│ */6 ─── Vaultwarden backup ───────────────────→ 30d retention │ +│ :30 ─── Vaultwarden integrity check ──────────→ metric push │ +└──────────────────────────────────────────────────────────────────┘ +``` + +### Vaultwarden Enhanced Protection + +Vaultwarden uses iSCSI storage (SQLite on block device) and has extra safeguards: + +``` +Every 6 hours Every hour +┌─────────────────────────┐ ┌────────────────────────────┐ +│ vaultwarden-backup │ │ vaultwarden-integrity-check│ +│ │ │ │ +│ 1. PRAGMA integrity_check│ │ 1. PRAGMA integrity_check │ +│ (fail → abort) │ │ 2. Push metric to │ +│ 2. sqlite3 .backup │ │ Pushgateway: │ +│ 3. PRAGMA integrity_check│ │ vaultwarden_sqlite_ │ +│ on backup copy │ │ integrity_ok {0|1} │ +│ 4. Copy RSA keys, │ └────────────────────────────┘ +│ attachments, sends, │ +│ config.json │ +│ 5. Rotate (30d) │ +└─────────────────────────┘ +``` + +## Layer 3: Offsite Sync to Synology NAS + +Hybrid approach: fast incremental copies + weekly full sync for cleanup. + +``` + TrueNAS Synology + (10.0.10.15) (192.168.1.13) + │ │ + Every 6h (cron) │ zfs diff → changed files list │ + ════════════════ │ │ + /root/cloudsync- │ rclone copy --files-from │ + copy.sh │ --no-traverse │ + │──────────────────────────────────→ │ + │ Only changed files, │ + │ seconds to minutes │ + │ │ + Sunday 09:00 │ rclone sync │ + (Cloud Sync Task 1) │ (full traversal) │ + ════════════════ │──────────────────────────────────→ │ + │ ~30-60 min, │ + │ handles deletions │ + │ │ +``` + +### Incremental COPY — How It Works + +``` + cloudsync-copy-prev cloudsync-copy + (previous snapshot) (new snapshot) + │ │ + └────── zfs diff -F -H ────┘ + │ + ▼ + Changed files only + (type=F, excludes applied) + │ + ▼ + /tmp/cloudsync_copy_files.txt + │ + ▼ + rclone copy --files-from-raw + --no-traverse (skip SFTP scan) + │ + ▼ + Synology updated + │ + ▼ + Rotate: prev→destroy, new→prev +``` + +**Key files**: +- Script: `/root/cloudsync-copy.sh` +- Log: `/var/log/cloudsync-copy.log` +- Cron job: TrueNAS cron id=1, `0 */6 * * *` + +### Excludes (both incremental and weekly sync) + +| Pattern | Reason | +|------------------------|-------------------------------------| +| `clickhouse/**` | 2.47M files, regenerable | +| `loki/**` | 68K files, regenerable logs | +| `iocage/**` | 96K files, legacy FreeBSD jails | +| `frigate/recordings/**`| 57K files, ephemeral video clips | +| `prometheus/**` | Large TSDB, separate monthly backup | +| `crowdsec/**` | Regenerable threat data | +| `servarr/downloads/**` | Transient download staging | +| `iscsi/**` | Raw zvols, backed up at app level | +| `iscsi-snaps/**` | Snapshot metadata | +| `ytldp/**` | YouTube downloads, replaceable | +| `*.log` | Log files | +| `post` | Transient POST data | + +### Weekly SYNC (Cloud Sync Task 1) + +- **Mode**: SYNC (mirrors source → destination, removes deleted files) +- **Schedule**: Sunday 09:00 +- **Pre-script**: Creates ZFS snapshot `main@cloudsync-new` +- **Post-script**: Rotates snapshots (`new` → `prev`, creates placeholder) +- **Source path**: `/mnt/main/.zfs/snapshot/cloudsync-new` +- **Destination**: `synology:/Backup/Viki/truenas` (SFTP) + +## iSCSI Hardening + +To prevent SQLite corruption from transient network disruptions, iSCSI +initiator timeouts are relaxed on all K8s nodes: + +``` +Setting Default Hardened +───────────────────────────────────────────────────────── +node.session.timeo.replacement_timeout 120s 300s +node.conn[0].timeo.noop_out_interval 5s 10s +node.conn[0].timeo.noop_out_timeout 5s 15s +node.conn[0].iscsi.HeaderDigest None CRC32C,None +node.conn[0].iscsi.DataDigest None CRC32C,None +``` + +- Applied to all 5 nodes (k8s-master + k8s-node1-4) on 2026-03-23 +- Baked into cloud-init template (`modules/create-template-vm/cloud_init.yaml`) + so new nodes get these settings automatically + +## Monitoring & Alerting + +``` +┌─────────────────────────────────────────────────────────┐ +│ Prometheus Alerts │ +│ │ +│ PostgreSQLBackupStale > 36h since last success │ +│ MySQLBackupStale > 36h since last success │ +│ EtcdBackupStale > 8d since last success │ +│ VaultBackupStale > 8d since last success │ +│ VaultwardenBackupStale > 8d since last success │ +│ RedisBackupStale > 8d since last success │ +│ PrometheusBackupStale > 32d since last success │ +│ CloudSyncStale > 8d since last success │ +│ CloudSyncNeverRun task never completed │ +│ CloudSyncFailing task in error state │ +│ VaultwardenIntegrityFail integrity_ok == 0 │ +└─────────────────────────────────────────────────────────┘ +``` + +- `cloudsync-monitor` CronJob queries TrueNAS API every 6h, pushes to Pushgateway +- Vaultwarden integrity check pushes `vaultwarden_sqlite_integrity_ok` hourly + +## Service Protection Matrix + +| Service | Layer 1 (ZFS) | Layer 2 (App) | Layer 3 (Offsite) | Storage | +|---------|:---:|:---:|:---:|---------| +| PostgreSQL (12 DBs) | ✓ | ✓ daily | ✓ | NFS | +| MySQL (7 DBs) | ✓ | ✓ daily | ✓ | NFS | +| Vault | ✓ | ✓ weekly | ✓ | iSCSI | +| etcd | ✓ | ✓ weekly | ✓ | local | +| Vaultwarden | ✓ | ✓ 6h + integrity | ✓ | iSCSI | +| Redis | ✓ | ✓ weekly | ✓ | iSCSI | +| Prometheus | ✓ | ✓ monthly | excluded | NFS | +| plotting-book | ✓ | ✓ weekly | ✓ | iSCSI | +| Immich | ✓ | — | ✓ | NFS | +| Forgejo | ✓ | — | ✓ | NFS | +| Paperless-ngx | ✓ | — | ✓ | NFS | +| Other NFS services | ✓ | — | ✓ | NFS | + +NFS-backed services with simple data (files, SQLite) rely on ZFS snapshots + +offsite sync. Application-level backups are only needed for services with +complex state (databases, Raft consensus, multi-file consistency). + +## Recovery Procedures + +See individual runbooks in `docs/runbooks/`: +- `restore-postgresql.md` +- `restore-mysql.md` +- `restore-vault.md` +- `restore-vaultwarden.md` +- `restore-etcd.md` +- `restore-full-cluster.md`