diff --git a/docs/backup-strategy.md b/docs/backup-strategy.md deleted file mode 100644 index 0e18055c..00000000 --- a/docs/backup-strategy.md +++ /dev/null @@ -1,248 +0,0 @@ -# Backup & Disaster Recovery Strategy - -Last updated: 2026-03-23 - -## Architecture Overview - -``` -┌─────────────────────────────────────────────────────────────────────┐ -│ TrueNAS (10.0.10.15) │ -│ │ -│ ZFS Pool "main" (1.64 TiB) ZFS Pool "ssd" │ -│ ├── NFS shares (~100) ├── Immich ML data │ -│ └── iSCSI zvols (~19 PVCs) └── PostgreSQL data │ -│ │ -│ Layer 1: ZFS Auto-Snapshots │ -│ ┌──────────────────────────────────────────────┐ │ -│ │ Every 12h → auto-12h-* (24h retention) │ │ -│ │ Daily → auto-* (3-week retention) │ │ -│ │ Both pools, recursive, near-instant (<1s) │ │ -│ └──────────────────────────────────────────────┘ │ -└────────────────┬────────────────────────────────┬───────────────────┘ - │ │ - ┌────────────▼────────────┐ ┌────────────▼────────────┐ - │ Layer 2: App Backups │ │ Layer 3: Offsite Sync │ - │ (K8s CronJobs → NFS) │ │ (TrueNAS → Synology) │ - └────────────┬────────────┘ └────────────┬────────────┘ - │ │ - ▼ ▼ - ┌─────────────────────┐ ┌──────────────────────────────┐ - │ /mnt/main/*-backup │ │ Synology NAS (192.168.1.13) │ - │ (NFS-exported dirs) │ │ /Backup/Viki/truenas │ - └─────────────────────┘ └──────────────────────────────┘ -``` - -## Layer 1: ZFS Auto-Snapshots - -Near-instant copy-on-write snapshots. No disk I/O beyond tiny metadata writes. - -| Pool | Schedule | Retention | Naming Schema | -|--------|----------------|-----------|---------------------------| -| `main` | Every 12h | 24 hours | `auto-12h-YYYY-MM-DD_HH-MM` | -| `main` | Daily 00:00 | 3 weeks | `auto-YYYY-MM-DD_HH-MM` | -| `ssd` | Every 12h | 24 hours | `auto-12h-YYYY-MM-DD_HH-MM` | -| `ssd` | Daily 00:00 | 3 weeks | `auto-YYYY-MM-DD_HH-MM` | - -**Performance**: Both pools snapshot in <1 second (tested 2026-03-23). - -## Layer 2: Application-Level Backups - -K8s CronJobs dump application data to NFS-exported backup directories. - -``` -┌──────────────────────────────────────────────────────────────────┐ -│ K8s CronJob Backup Schedule │ -│ │ -│ 00:00 ─── PostgreSQL (pg_dumpall → gzip -9) ──→ 14d retention │ -│ 00:30 ─── MySQL (mysqldump → gzip -9) ─────────→ 14d retention │ -│ │ -│ Sunday: │ -│ 01:00 ─── etcd (etcdctl snapshot) ──────────────→ 30d retention │ -│ 01:30 ─── Vaultwarden (sqlite3 .backup) ────────→ 30d retention │ -│ 02:00 ─── Vault (raft snapshot) ────────────────→ 30d retention │ -│ 03:00 ─── Redis (BGSAVE + copy) ────────────────→ 30d retention │ -│ 03:00 ─── plotting-book (sqlite3 .backup) ──────→ 30d retention │ -│ │ -│ Monthly (1st Sunday): │ -│ 04:00 ─── Prometheus TSDB (snapshot → tar.gz) ──→ 2 copies │ -│ │ -│ Every 6h: │ -│ */6 ─── Vaultwarden backup ───────────────────→ 30d retention │ -│ :30 ─── Vaultwarden integrity check ──────────→ metric push │ -└──────────────────────────────────────────────────────────────────┘ -``` - -### Vaultwarden Enhanced Protection - -Vaultwarden uses iSCSI storage (SQLite on block device) and has extra safeguards: - -``` -Every 6 hours Every hour -┌─────────────────────────┐ ┌────────────────────────────┐ -│ vaultwarden-backup │ │ vaultwarden-integrity-check│ -│ │ │ │ -│ 1. PRAGMA integrity_check│ │ 1. PRAGMA integrity_check │ -│ (fail → abort) │ │ 2. Push metric to │ -│ 2. sqlite3 .backup │ │ Pushgateway: │ -│ 3. PRAGMA integrity_check│ │ vaultwarden_sqlite_ │ -│ on backup copy │ │ integrity_ok {0|1} │ -│ 4. Copy RSA keys, │ └────────────────────────────┘ -│ attachments, sends, │ -│ config.json │ -│ 5. Rotate (30d) │ -└─────────────────────────┘ -``` - -## Layer 3: Offsite Sync to Synology NAS - -Hybrid approach: fast incremental copies + weekly full sync for cleanup. - -``` - TrueNAS Synology - (10.0.10.15) (192.168.1.13) - │ │ - Every 6h (cron) │ zfs diff → changed files list │ - ════════════════ │ │ - /root/cloudsync- │ rclone copy --files-from │ - copy.sh │ --no-traverse │ - │──────────────────────────────────→ │ - │ Only changed files, │ - │ seconds to minutes │ - │ │ - Sunday 09:00 │ rclone sync │ - (Cloud Sync Task 1) │ (full traversal) │ - ════════════════ │──────────────────────────────────→ │ - │ ~30-60 min, │ - │ handles deletions │ - │ │ -``` - -### Incremental COPY — How It Works - -``` - cloudsync-copy-prev cloudsync-copy - (previous snapshot) (new snapshot) - │ │ - └────── zfs diff -F -H ────┘ - │ - ▼ - Changed files only - (type=F, excludes applied) - │ - ▼ - /tmp/cloudsync_copy_files.txt - │ - ▼ - rclone copy --files-from-raw - --no-traverse (skip SFTP scan) - │ - ▼ - Synology updated - │ - ▼ - Rotate: prev→destroy, new→prev -``` - -**Key files**: -- Script: `/root/cloudsync-copy.sh` -- Log: `/var/log/cloudsync-copy.log` -- Cron job: TrueNAS cron id=1, `0 */6 * * *` - -### Excludes (both incremental and weekly sync) - -| Pattern | Reason | -|------------------------|-------------------------------------| -| `clickhouse/**` | 2.47M files, regenerable | -| `loki/**` | 68K files, regenerable logs | -| `iocage/**` | 96K files, legacy FreeBSD jails | -| `frigate/recordings/**`| 57K files, ephemeral video clips | -| `prometheus/**` | Large TSDB, separate monthly backup | -| `crowdsec/**` | Regenerable threat data | -| `servarr/downloads/**` | Transient download staging | -| `iscsi/**` | Raw zvols, backed up at app level | -| `iscsi-snaps/**` | Snapshot metadata | -| `ytldp/**` | YouTube downloads, replaceable | -| `*.log` | Log files | -| `post` | Transient POST data | - -### Weekly SYNC (Cloud Sync Task 1) - -- **Mode**: SYNC (mirrors source → destination, removes deleted files) -- **Schedule**: Sunday 09:00 -- **Pre-script**: Creates ZFS snapshot `main@cloudsync-new` -- **Post-script**: Rotates snapshots (`new` → `prev`, creates placeholder) -- **Source path**: `/mnt/main/.zfs/snapshot/cloudsync-new` -- **Destination**: `synology:/Backup/Viki/truenas` (SFTP) - -## iSCSI Hardening - -To prevent SQLite corruption from transient network disruptions, iSCSI -initiator timeouts are relaxed on all K8s nodes: - -``` -Setting Default Hardened -───────────────────────────────────────────────────────── -node.session.timeo.replacement_timeout 120s 300s -node.conn[0].timeo.noop_out_interval 5s 10s -node.conn[0].timeo.noop_out_timeout 5s 15s -node.conn[0].iscsi.HeaderDigest None CRC32C,None -node.conn[0].iscsi.DataDigest None CRC32C,None -``` - -- Applied to all 5 nodes (k8s-master + k8s-node1-4) on 2026-03-23 -- Baked into cloud-init template (`modules/create-template-vm/cloud_init.yaml`) - so new nodes get these settings automatically - -## Monitoring & Alerting - -``` -┌─────────────────────────────────────────────────────────┐ -│ Prometheus Alerts │ -│ │ -│ PostgreSQLBackupStale > 36h since last success │ -│ MySQLBackupStale > 36h since last success │ -│ EtcdBackupStale > 8d since last success │ -│ VaultBackupStale > 8d since last success │ -│ VaultwardenBackupStale > 8d since last success │ -│ RedisBackupStale > 8d since last success │ -│ PrometheusBackupStale > 32d since last success │ -│ CloudSyncStale > 8d since last success │ -│ CloudSyncNeverRun task never completed │ -│ CloudSyncFailing task in error state │ -│ VaultwardenIntegrityFail integrity_ok == 0 │ -└─────────────────────────────────────────────────────────┘ -``` - -- `cloudsync-monitor` CronJob queries TrueNAS API every 6h, pushes to Pushgateway -- Vaultwarden integrity check pushes `vaultwarden_sqlite_integrity_ok` hourly - -## Service Protection Matrix - -| Service | Layer 1 (ZFS) | Layer 2 (App) | Layer 3 (Offsite) | Storage | -|---------|:---:|:---:|:---:|---------| -| PostgreSQL (12 DBs) | ✓ | ✓ daily | ✓ | NFS | -| MySQL (7 DBs) | ✓ | ✓ daily | ✓ | NFS | -| Vault | ✓ | ✓ weekly | ✓ | iSCSI | -| etcd | ✓ | ✓ weekly | ✓ | local | -| Vaultwarden | ✓ | ✓ 6h + integrity | ✓ | iSCSI | -| Redis | ✓ | ✓ weekly | ✓ | iSCSI | -| Prometheus | ✓ | ✓ monthly | excluded | NFS | -| plotting-book | ✓ | ✓ weekly | ✓ | iSCSI | -| Immich | ✓ | — | ✓ | NFS | -| Forgejo | ✓ | — | ✓ | NFS | -| Paperless-ngx | ✓ | — | ✓ | NFS | -| Other NFS services | ✓ | — | ✓ | NFS | - -NFS-backed services with simple data (files, SQLite) rely on ZFS snapshots + -offsite sync. Application-level backups are only needed for services with -complex state (databases, Raft consensus, multi-file consistency). - -## Recovery Procedures - -See individual runbooks in `docs/runbooks/`: -- `restore-postgresql.md` -- `restore-mysql.md` -- `restore-vault.md` -- `restore-vaultwarden.md` -- `restore-etcd.md` -- `restore-full-cluster.md`