remove docs/backup-strategy.md, absorbed into architecture/backup-dr.md [ci skip]

2026-03-24 01:08:06 +02:00 · 2026-03-24 01:08:06 +02:00 · dbff547741
commit dbff547741
parent 5a42643176
1 changed files with 0 additions and 248 deletions
--- a/docs/backup-strategy.md
+++ b/docs/backup-strategy.md
@ -1,248 +0,0 @@
-# Backup & Disaster Recovery Strategy
-
-Last updated: 2026-03-23
-
-## Architecture Overview
-
-```
-┌─────────────────────────────────────────────────────────────────────┐
-│                        TrueNAS (10.0.10.15)                        │
-│                                                                     │
-│  ZFS Pool "main" (1.64 TiB)     ZFS Pool "ssd"                    │
-│  ├── NFS shares (~100)           ├── Immich ML data                │
-│  └── iSCSI zvols (~19 PVCs)     └── PostgreSQL data               │
-│                                                                     │
-│  Layer 1: ZFS Auto-Snapshots                                       │
-│  ┌──────────────────────────────────────────────┐                  │
-│  │ Every 12h  → auto-12h-*  (24h retention)     │                  │
-│  │ Daily      → auto-*      (3-week retention)  │                  │
-│  │ Both pools, recursive, near-instant (<1s)     │                  │
-│  └──────────────────────────────────────────────┘                  │
-└────────────────┬────────────────────────────────┬───────────────────┘
-                 │                                │
-    ┌────────────▼────────────┐      ┌────────────▼────────────┐
-    │  Layer 2: App Backups   │      │  Layer 3: Offsite Sync  │
-    │  (K8s CronJobs → NFS)  │      │  (TrueNAS → Synology)   │
-    └────────────┬────────────┘      └────────────┬────────────┘
-                 │                                │
-                 ▼                                ▼
-    ┌─────────────────────┐      ┌──────────────────────────────┐
-    │  /mnt/main/*-backup │      │  Synology NAS (192.168.1.13) │
-    │  (NFS-exported dirs) │      │  /Backup/Viki/truenas        │
-    └─────────────────────┘      └──────────────────────────────┘
-```
-
-## Layer 1: ZFS Auto-Snapshots
-
-Near-instant copy-on-write snapshots. No disk I/O beyond tiny metadata writes.
-
-| Pool   | Schedule       | Retention | Naming Schema             |
-|--------|----------------|-----------|---------------------------|
-| `main` | Every 12h      | 24 hours  | `auto-12h-YYYY-MM-DD_HH-MM` |
-| `main` | Daily 00:00    | 3 weeks   | `auto-YYYY-MM-DD_HH-MM`     |
-| `ssd`  | Every 12h      | 24 hours  | `auto-12h-YYYY-MM-DD_HH-MM` |
-| `ssd`  | Daily 00:00    | 3 weeks   | `auto-YYYY-MM-DD_HH-MM`     |
-
-**Performance**: Both pools snapshot in <1 second (tested 2026-03-23).
-
-## Layer 2: Application-Level Backups
-
-K8s CronJobs dump application data to NFS-exported backup directories.
-
-```
-┌──────────────────────────────────────────────────────────────────┐
-│                    K8s CronJob Backup Schedule                   │
-│                                                                  │
-│  00:00 ─── PostgreSQL (pg_dumpall → gzip -9) ──→ 14d retention  │
-│  00:30 ─── MySQL (mysqldump → gzip -9) ─────────→ 14d retention │
-│                                                                  │
-│  Sunday:                                                         │
-│  01:00 ─── etcd (etcdctl snapshot) ──────────────→ 30d retention │
-│  01:30 ─── Vaultwarden (sqlite3 .backup) ────────→ 30d retention │
-│  02:00 ─── Vault (raft snapshot) ────────────────→ 30d retention │
-│  03:00 ─── Redis (BGSAVE + copy) ────────────────→ 30d retention │
-│  03:00 ─── plotting-book (sqlite3 .backup) ──────→ 30d retention │
-│                                                                  │
-│  Monthly (1st Sunday):                                           │
-│  04:00 ─── Prometheus TSDB (snapshot → tar.gz) ──→ 2 copies     │
-│                                                                  │
-│  Every 6h:                                                       │
-│  */6   ─── Vaultwarden backup ───────────────────→ 30d retention │
-│  :30   ─── Vaultwarden integrity check ──────────→ metric push  │
-└──────────────────────────────────────────────────────────────────┘
-```
-
-### Vaultwarden Enhanced Protection
-
-Vaultwarden uses iSCSI storage (SQLite on block device) and has extra safeguards:
-
-```
-Every 6 hours                          Every hour
-┌─────────────────────────┐            ┌────────────────────────────┐
-│ vaultwarden-backup      │            │ vaultwarden-integrity-check│
-│                         │            │                            │
-│ 1. PRAGMA integrity_check│           │ 1. PRAGMA integrity_check  │
-│    (fail → abort)       │            │ 2. Push metric to          │
-│ 2. sqlite3 .backup      │            │    Pushgateway:            │
-│ 3. PRAGMA integrity_check│           │    vaultwarden_sqlite_     │
-│    on backup copy       │            │    integrity_ok {0|1}      │
-│ 4. Copy RSA keys,       │            └────────────────────────────┘
-│    attachments, sends,  │
-│    config.json          │
-│ 5. Rotate (30d)         │
-└─────────────────────────┘
-```
-
-## Layer 3: Offsite Sync to Synology NAS
-
-Hybrid approach: fast incremental copies + weekly full sync for cleanup.
-
-```
-                    TrueNAS                              Synology
-                 (10.0.10.15)                        (192.168.1.13)
-                      │                                    │
-  Every 6h (cron)     │    zfs diff → changed files list   │
-  ════════════════     │                                    │
-  /root/cloudsync-     │  rclone copy --files-from          │
-  copy.sh              │  --no-traverse                     │
-                       │──────────────────────────────────→ │
-                       │    Only changed files,             │
-                       │    seconds to minutes              │
-                       │                                    │
-  Sunday 09:00         │    rclone sync                     │
-  (Cloud Sync Task 1)  │    (full traversal)                │
-  ════════════════     │──────────────────────────────────→ │
-                       │    ~30-60 min,                     │
-                       │    handles deletions               │
-                       │                                    │
-```
-
-### Incremental COPY — How It Works
-
-```
-  cloudsync-copy-prev          cloudsync-copy
-  (previous snapshot)          (new snapshot)
-         │                          │
-         └────── zfs diff -F -H ────┘
-                      │
-                      ▼
-              Changed files only
-              (type=F, excludes applied)
-                      │
-                      ▼
-         /tmp/cloudsync_copy_files.txt
-                      │
-                      ▼
-         rclone copy --files-from-raw
-         --no-traverse (skip SFTP scan)
-                      │
-                      ▼
-              Synology updated
-                      │
-                      ▼
-         Rotate: prev→destroy, new→prev
-```
-
-**Key files**:
- Script: `/root/cloudsync-copy.sh`
- Log: `/var/log/cloudsync-copy.log`
- Cron job: TrueNAS cron id=1, `0 */6 * * *`
-
-### Excludes (both incremental and weekly sync)
-
-| Pattern                | Reason                              |
-|------------------------|-------------------------------------|
-| `clickhouse/**`        | 2.47M files, regenerable            |
-| `loki/**`              | 68K files, regenerable logs         |
-| `iocage/**`            | 96K files, legacy FreeBSD jails     |
-| `frigate/recordings/**`| 57K files, ephemeral video clips    |
-| `prometheus/**`        | Large TSDB, separate monthly backup |
-| `crowdsec/**`          | Regenerable threat data             |
-| `servarr/downloads/**` | Transient download staging          |
-| `iscsi/**`             | Raw zvols, backed up at app level   |
-| `iscsi-snaps/**`       | Snapshot metadata                   |
-| `ytldp/**`             | YouTube downloads, replaceable      |
-| `*.log`                | Log files                           |
-| `post`                 | Transient POST data                 |
-
-### Weekly SYNC (Cloud Sync Task 1)
-
- **Mode**: SYNC (mirrors source → destination, removes deleted files)
- **Schedule**: Sunday 09:00
- **Pre-script**: Creates ZFS snapshot `main@cloudsync-new`
- **Post-script**: Rotates snapshots (`new` → `prev`, creates placeholder)
- **Source path**: `/mnt/main/.zfs/snapshot/cloudsync-new`
- **Destination**: `synology:/Backup/Viki/truenas` (SFTP)
-
-## iSCSI Hardening
-
-To prevent SQLite corruption from transient network disruptions, iSCSI
-initiator timeouts are relaxed on all K8s nodes:
-
-```
-Setting                              Default    Hardened
-─────────────────────────────────────────────────────────
-node.session.timeo.replacement_timeout  120s      300s
-node.conn[0].timeo.noop_out_interval      5s       10s
-node.conn[0].timeo.noop_out_timeout       5s       15s
-node.conn[0].iscsi.HeaderDigest         None   CRC32C,None
-node.conn[0].iscsi.DataDigest           None   CRC32C,None
-```
-
- Applied to all 5 nodes (k8s-master + k8s-node1-4) on 2026-03-23
- Baked into cloud-init template (`modules/create-template-vm/cloud_init.yaml`)
-  so new nodes get these settings automatically
-
-## Monitoring & Alerting
-
-```
-┌─────────────────────────────────────────────────────────┐
-│                   Prometheus Alerts                      │
-│                                                         │
-│  PostgreSQLBackupStale      > 36h since last success    │
-│  MySQLBackupStale           > 36h since last success    │
-│  EtcdBackupStale            > 8d  since last success    │
-│  VaultBackupStale           > 8d  since last success    │
-│  VaultwardenBackupStale     > 8d  since last success    │
-│  RedisBackupStale           > 8d  since last success    │
-│  PrometheusBackupStale      > 32d since last success    │
-│  CloudSyncStale             > 8d  since last success    │
-│  CloudSyncNeverRun          task never completed        │
-│  CloudSyncFailing           task in error state         │
-│  VaultwardenIntegrityFail   integrity_ok == 0           │
-└─────────────────────────────────────────────────────────┘
-```
-
- `cloudsync-monitor` CronJob queries TrueNAS API every 6h, pushes to Pushgateway
- Vaultwarden integrity check pushes `vaultwarden_sqlite_integrity_ok` hourly
-
-## Service Protection Matrix
-
-| Service | Layer 1 (ZFS) | Layer 2 (App) | Layer 3 (Offsite) | Storage |
-|---------|:---:|:---:|:---:|---------|
-| PostgreSQL (12 DBs) | ✓ | ✓ daily | ✓ | NFS |
-| MySQL (7 DBs) | ✓ | ✓ daily | ✓ | NFS |
-| Vault | ✓ | ✓ weekly | ✓ | iSCSI |
-| etcd | ✓ | ✓ weekly | ✓ | local |
-| Vaultwarden | ✓ | ✓ 6h + integrity | ✓ | iSCSI |
-| Redis | ✓ | ✓ weekly | ✓ | iSCSI |
-| Prometheus | ✓ | ✓ monthly | excluded | NFS |
-| plotting-book | ✓ | ✓ weekly | ✓ | iSCSI |
-| Immich | ✓ | — | ✓ | NFS |
-| Forgejo | ✓ | — | ✓ | NFS |
-| Paperless-ngx | ✓ | — | ✓ | NFS |
-| Other NFS services | ✓ | — | ✓ | NFS |
-
-NFS-backed services with simple data (files, SQLite) rely on ZFS snapshots +
-offsite sync. Application-level backups are only needed for services with
-complex state (databases, Raft consensus, multi-file consistency).
-
-## Recovery Procedures
-
-See individual runbooks in `docs/runbooks/`:
- `restore-postgresql.md`
- `restore-mysql.md`
- `restore-vault.md`
- `restore-vaultwarden.md`
- `restore-etcd.md`
- `restore-full-cluster.md`