From 6e661fdfc507c5a6f83a726c2f4b8f23a8eb3800 Mon Sep 17 00:00:00 2001
From: Viktor Barzin <viktorbarzin@meta.com>
Date: Mon, 23 Mar 2026 02:24:02 +0200
Subject: [PATCH] add backup & DR strategy documentation with ASCII diagrams

Covers all 3 protection layers (ZFS snapshots, app-level backups,
offsite sync), the hybrid cloud sync architecture, iSCSI hardening,
monitoring alerts, and service protection matrix.
---
 docs/backup-strategy.md | 248 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 248 insertions(+)
 create mode 100644 docs/backup-strategy.md

diff --git a/docs/backup-strategy.md b/docs/backup-strategy.md
new file mode 100644
index 00000000..0e18055c
--- /dev/null
+++ b/docs/backup-strategy.md
@@ -0,0 +1,248 @@
+# Backup & Disaster Recovery Strategy
+
+Last updated: 2026-03-23
+
+## Architecture Overview
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                        TrueNAS (10.0.10.15)                        │
+│                                                                     │
+│  ZFS Pool "main" (1.64 TiB)     ZFS Pool "ssd"                    │
+│  ├── NFS shares (~100)           ├── Immich ML data                │
+│  └── iSCSI zvols (~19 PVCs)     └── PostgreSQL data               │
+│                                                                     │
+│  Layer 1: ZFS Auto-Snapshots                                       │
+│  ┌──────────────────────────────────────────────┐                  │
+│  │ Every 12h  → auto-12h-*  (24h retention)     │                  │
+│  │ Daily      → auto-*      (3-week retention)  │                  │
+│  │ Both pools, recursive, near-instant (<1s)     │                  │
+│  └──────────────────────────────────────────────┘                  │
+└────────────────┬────────────────────────────────┬───────────────────┘
+                 │                                │
+    ┌────────────▼────────────┐      ┌────────────▼────────────┐
+    │  Layer 2: App Backups   │      │  Layer 3: Offsite Sync  │
+    │  (K8s CronJobs → NFS)  │      │  (TrueNAS → Synology)   │
+    └────────────┬────────────┘      └────────────┬────────────┘
+                 │                                │
+                 ▼                                ▼
+    ┌─────────────────────┐      ┌──────────────────────────────┐
+    │  /mnt/main/*-backup │      │  Synology NAS (192.168.1.13) │
+    │  (NFS-exported dirs) │      │  /Backup/Viki/truenas        │
+    └─────────────────────┘      └──────────────────────────────┘
+```
+
+## Layer 1: ZFS Auto-Snapshots
+
+Near-instant copy-on-write snapshots. No disk I/O beyond tiny metadata writes.
+
+| Pool   | Schedule       | Retention | Naming Schema             |
+|--------|----------------|-----------|---------------------------|
+| `main` | Every 12h      | 24 hours  | `auto-12h-YYYY-MM-DD_HH-MM` |
+| `main` | Daily 00:00    | 3 weeks   | `auto-YYYY-MM-DD_HH-MM`     |
+| `ssd`  | Every 12h      | 24 hours  | `auto-12h-YYYY-MM-DD_HH-MM` |
+| `ssd`  | Daily 00:00    | 3 weeks   | `auto-YYYY-MM-DD_HH-MM`     |
+
+**Performance**: Both pools snapshot in <1 second (tested 2026-03-23).
+
+## Layer 2: Application-Level Backups
+
+K8s CronJobs dump application data to NFS-exported backup directories.
+
+```
+┌──────────────────────────────────────────────────────────────────┐
+│                    K8s CronJob Backup Schedule                   │
+│                                                                  │
+│  00:00 ─── PostgreSQL (pg_dumpall → gzip -9) ──→ 14d retention  │
+│  00:30 ─── MySQL (mysqldump → gzip -9) ─────────→ 14d retention │
+│                                                                  │
+│  Sunday:                                                         │
+│  01:00 ─── etcd (etcdctl snapshot) ──────────────→ 30d retention │
+│  01:30 ─── Vaultwarden (sqlite3 .backup) ────────→ 30d retention │
+│  02:00 ─── Vault (raft snapshot) ────────────────→ 30d retention │
+│  03:00 ─── Redis (BGSAVE + copy) ────────────────→ 30d retention │
+│  03:00 ─── plotting-book (sqlite3 .backup) ──────→ 30d retention │
+│                                                                  │
+│  Monthly (1st Sunday):                                           │
+│  04:00 ─── Prometheus TSDB (snapshot → tar.gz) ──→ 2 copies     │
+│                                                                  │
+│  Every 6h:                                                       │
+│  */6   ─── Vaultwarden backup ───────────────────→ 30d retention │
+│  :30   ─── Vaultwarden integrity check ──────────→ metric push  │
+└──────────────────────────────────────────────────────────────────┘
+```
+
+### Vaultwarden Enhanced Protection
+
+Vaultwarden uses iSCSI storage (SQLite on block device) and has extra safeguards:
+
+```
+Every 6 hours                          Every hour
+┌─────────────────────────┐            ┌────────────────────────────┐
+│ vaultwarden-backup      │            │ vaultwarden-integrity-check│
+│                         │            │                            │
+│ 1. PRAGMA integrity_check│           │ 1. PRAGMA integrity_check  │
+│    (fail → abort)       │            │ 2. Push metric to          │
+│ 2. sqlite3 .backup      │            │    Pushgateway:            │
+│ 3. PRAGMA integrity_check│           │    vaultwarden_sqlite_     │
+│    on backup copy       │            │    integrity_ok {0|1}      │
+│ 4. Copy RSA keys,       │            └────────────────────────────┘
+│    attachments, sends,  │
+│    config.json          │
+│ 5. Rotate (30d)         │
+└─────────────────────────┘
+```
+
+## Layer 3: Offsite Sync to Synology NAS
+
+Hybrid approach: fast incremental copies + weekly full sync for cleanup.
+
+```
+                    TrueNAS                              Synology
+                 (10.0.10.15)                        (192.168.1.13)
+                      │                                    │
+  Every 6h (cron)     │    zfs diff → changed files list   │
+  ════════════════     │                                    │
+  /root/cloudsync-     │  rclone copy --files-from          │
+  copy.sh              │  --no-traverse                     │
+                       │──────────────────────────────────→ │
+                       │    Only changed files,             │
+                       │    seconds to minutes              │
+                       │                                    │
+  Sunday 09:00         │    rclone sync                     │
+  (Cloud Sync Task 1)  │    (full traversal)                │
+  ════════════════     │──────────────────────────────────→ │
+                       │    ~30-60 min,                     │
+                       │    handles deletions               │
+                       │                                    │
+```
+
+### Incremental COPY — How It Works
+
+```
+  cloudsync-copy-prev          cloudsync-copy
+  (previous snapshot)          (new snapshot)
+         │                          │
+         └────── zfs diff -F -H ────┘
+                      │
+                      ▼
+              Changed files only
+              (type=F, excludes applied)
+                      │
+                      ▼
+         /tmp/cloudsync_copy_files.txt
+                      │
+                      ▼
+         rclone copy --files-from-raw
+         --no-traverse (skip SFTP scan)
+                      │
+                      ▼
+              Synology updated
+                      │
+                      ▼
+         Rotate: prev→destroy, new→prev
+```
+
+**Key files**:
+- Script: `/root/cloudsync-copy.sh`
+- Log: `/var/log/cloudsync-copy.log`
+- Cron job: TrueNAS cron id=1, `0 */6 * * *`
+
+### Excludes (both incremental and weekly sync)
+
+| Pattern                | Reason                              |
+|------------------------|-------------------------------------|
+| `clickhouse/**`        | 2.47M files, regenerable            |
+| `loki/**`              | 68K files, regenerable logs         |
+| `iocage/**`            | 96K files, legacy FreeBSD jails     |
+| `frigate/recordings/**`| 57K files, ephemeral video clips    |
+| `prometheus/**`        | Large TSDB, separate monthly backup |
+| `crowdsec/**`          | Regenerable threat data             |
+| `servarr/downloads/**` | Transient download staging          |
+| `iscsi/**`             | Raw zvols, backed up at app level   |
+| `iscsi-snaps/**`       | Snapshot metadata                   |
+| `ytldp/**`             | YouTube downloads, replaceable      |
+| `*.log`                | Log files                           |
+| `post`                 | Transient POST data                 |
+
+### Weekly SYNC (Cloud Sync Task 1)
+
+- **Mode**: SYNC (mirrors source → destination, removes deleted files)
+- **Schedule**: Sunday 09:00
+- **Pre-script**: Creates ZFS snapshot `main@cloudsync-new`
+- **Post-script**: Rotates snapshots (`new` → `prev`, creates placeholder)
+- **Source path**: `/mnt/main/.zfs/snapshot/cloudsync-new`
+- **Destination**: `synology:/Backup/Viki/truenas` (SFTP)
+
+## iSCSI Hardening
+
+To prevent SQLite corruption from transient network disruptions, iSCSI
+initiator timeouts are relaxed on all K8s nodes:
+
+```
+Setting                              Default    Hardened
+─────────────────────────────────────────────────────────
+node.session.timeo.replacement_timeout  120s      300s
+node.conn[0].timeo.noop_out_interval      5s       10s
+node.conn[0].timeo.noop_out_timeout       5s       15s
+node.conn[0].iscsi.HeaderDigest         None   CRC32C,None
+node.conn[0].iscsi.DataDigest           None   CRC32C,None
+```
+
+- Applied to all 5 nodes (k8s-master + k8s-node1-4) on 2026-03-23
+- Baked into cloud-init template (`modules/create-template-vm/cloud_init.yaml`)
+  so new nodes get these settings automatically
+
+## Monitoring & Alerting
+
+```
+┌─────────────────────────────────────────────────────────┐
+│                   Prometheus Alerts                      │
+│                                                         │
+│  PostgreSQLBackupStale      > 36h since last success    │
+│  MySQLBackupStale           > 36h since last success    │
+│  EtcdBackupStale            > 8d  since last success    │
+│  VaultBackupStale           > 8d  since last success    │
+│  VaultwardenBackupStale     > 8d  since last success    │
+│  RedisBackupStale           > 8d  since last success    │
+│  PrometheusBackupStale      > 32d since last success    │
+│  CloudSyncStale             > 8d  since last success    │
+│  CloudSyncNeverRun          task never completed        │
+│  CloudSyncFailing           task in error state         │
+│  VaultwardenIntegrityFail   integrity_ok == 0           │
+└─────────────────────────────────────────────────────────┘
+```
+
+- `cloudsync-monitor` CronJob queries TrueNAS API every 6h, pushes to Pushgateway
+- Vaultwarden integrity check pushes `vaultwarden_sqlite_integrity_ok` hourly
+
+## Service Protection Matrix
+
+| Service | Layer 1 (ZFS) | Layer 2 (App) | Layer 3 (Offsite) | Storage |
+|---------|:---:|:---:|:---:|---------|
+| PostgreSQL (12 DBs) | ✓ | ✓ daily | ✓ | NFS |
+| MySQL (7 DBs) | ✓ | ✓ daily | ✓ | NFS |
+| Vault | ✓ | ✓ weekly | ✓ | iSCSI |
+| etcd | ✓ | ✓ weekly | ✓ | local |
+| Vaultwarden | ✓ | ✓ 6h + integrity | ✓ | iSCSI |
+| Redis | ✓ | ✓ weekly | ✓ | iSCSI |
+| Prometheus | ✓ | ✓ monthly | excluded | NFS |
+| plotting-book | ✓ | ✓ weekly | ✓ | iSCSI |
+| Immich | ✓ | — | ✓ | NFS |
+| Forgejo | ✓ | — | ✓ | NFS |
+| Paperless-ngx | ✓ | — | ✓ | NFS |
+| Other NFS services | ✓ | — | ✓ | NFS |
+
+NFS-backed services with simple data (files, SQLite) rely on ZFS snapshots +
+offsite sync. Application-level backups are only needed for services with
+complex state (databases, Raft consensus, multi-file consistency).
+
+## Recovery Procedures
+
+See individual runbooks in `docs/runbooks/`:
+- `restore-postgresql.md`
+- `restore-mysql.md`
+- `restore-vault.md`
+- `restore-vaultwarden.md`
+- `restore-etcd.md`
+- `restore-full-cluster.md`