update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
PVC file-level copy from LVM snapshots, pfsense backup, two offsite
paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
This commit is contained in:
parent
d5b0990ed1
commit
b345b086ef
10 changed files with 1051 additions and 332 deletions
|
|
@ -1,10 +1,17 @@
|
|||
# Backup & Disaster Recovery Architecture
|
||||
|
||||
Last updated: 2026-03-24
|
||||
Last updated: 2026-04-06
|
||||
|
||||
## Overview
|
||||
|
||||
The homelab uses a defense-in-depth 3-layer backup strategy: Layer 1 provides near-instant local snapshots via ZFS auto-snapshots on TrueNAS (every 12h + daily, up to 3-week retention). Layer 2 adds application-level backups for complex stateful services (databases, Vault, etcd) via K8s CronJobs dumping to NFS-exported directories with 14-30 day retention. Layer 3 ensures offsite protection through hybrid incremental/full sync to a Synology NAS every 6 hours (incremental via ZFS diff) plus weekly full sync (Sunday 09:00) for cleanup. This architecture provides <1s RPO for file data, 6h RPO for offsite, and <30min RTO for most services.
|
||||
The homelab uses a defense-in-depth 3-2-1 backup strategy: **3 copies** (live PVCs on sdc, weekly backups on sda, offsite on Synology), **2 media types** (SSD thin LVM, HDD), **1 offsite copy** (Synology NAS). This architecture provides <1s RPO for recent changes (via 7-day LVM snapshots), <7d RPO for file-level recovery, and <30min RTO for most services.
|
||||
|
||||
**3-2-1 Breakdown**:
|
||||
- **Copy 1** (live): All PVC data + VM disks on Proxmox sdc thin pool (10.7TB RAID1 HDD)
|
||||
- **Copy 2** (local backup): Weekly file-level backup to sda `/mnt/backup` (1.1TB RAID1 SAS)
|
||||
- **Copy 3** (offsite): Synology NAS at 192.168.1.13 via two paths:
|
||||
- `Synology/Backup/Viki/pve-backup/` — structured PVE host backups (rsync --files-from weekly)
|
||||
- `Synology/Backup/Viki/truenas/` — TrueNAS NFS media (Cloud Sync, narrowed to media only)
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
|
|
@ -12,54 +19,64 @@ The homelab uses a defense-in-depth 3-layer backup strategy: Layer 1 provides ne
|
|||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph TrueNAS["TrueNAS (10.0.10.15)"]
|
||||
ZFS_Data["ZFS Pools<br/>main (1.64 TiB)<br/>ssd (~256GB)"]
|
||||
subgraph Proxmox["Proxmox Host (192.168.1.127)"]
|
||||
sdc["sdc: 10.7TB RAID1 HDD<br/>VG pve, LV data (thin pool)<br/>65 proxmox-lvm PVCs"]
|
||||
sda["sda: 1.1TB RAID1 SAS<br/>VG backup, LV data (ext4)<br/>/mnt/backup"]
|
||||
|
||||
subgraph Layer1["Layer 1: ZFS Auto-Snapshots"]
|
||||
Snap12h["Every 12h<br/>auto-12h-*<br/>24h retention"]
|
||||
SnapDaily["Daily 00:00<br/>auto-*<br/>3-week retention"]
|
||||
subgraph Layer1["Layer 1: LVM Thin Snapshots"]
|
||||
Snap["Daily 03:00<br/>7-day retention<br/>62 PVCs (excludes dbaas+monitoring)"]
|
||||
end
|
||||
|
||||
ZFS_Data --> Snap12h
|
||||
ZFS_Data --> SnapDaily
|
||||
subgraph Layer2["Layer 2: Weekly File Backup"]
|
||||
PVCBackup["PVC File Copy<br/>Sunday 05:00<br/>4 weekly versions<br/>/mnt/backup/pvc-data/<YYYY-WW>/"]
|
||||
NFSMirror["NFS Mirror<br/>DB dumps + backup CronJob output<br/>/mnt/backup/nfs-mirror/"]
|
||||
PfsenseBackup["pfSense Backup<br/>config.xml + full tar<br/>4 weekly versions"]
|
||||
PVEConfig["PVE Config<br/>/etc/pve + scripts"]
|
||||
end
|
||||
|
||||
NFS_Backup["NFS-exported<br/>/mnt/main/*-backup/"]
|
||||
sdc --> Snap
|
||||
sdc --> PVCBackup
|
||||
PVCBackup --> sda
|
||||
NFSMirror --> sda
|
||||
PfsenseBackup --> sda
|
||||
PVEConfig --> sda
|
||||
end
|
||||
|
||||
subgraph K8s["Kubernetes Cluster"]
|
||||
subgraph Layer2["Layer 2: App Backups"]
|
||||
subgraph TrueNAS["TrueNAS (10.0.10.15)"]
|
||||
NFS_Backup["NFS-exported<br/>/mnt/main/*-backup/"]
|
||||
Media["Media (NFS)<br/>Immich ~800GB<br/>audiobookshelf, servarr, navidrome"]
|
||||
|
||||
subgraph AppBackups["App-Level Backup CronJobs"]
|
||||
CronDaily["Daily 00:00-00:30<br/>PostgreSQL, MySQL<br/>14d retention"]
|
||||
CronWeekly["Weekly Sunday<br/>etcd, Vault, Redis<br/>Vaultwarden, plotting-book<br/>30d retention"]
|
||||
CronMonthly["Monthly 1st Sunday<br/>Prometheus TSDB<br/>2 copies"]
|
||||
Cron6h["Every 6h<br/>Vaultwarden backup<br/>+ integrity check"]
|
||||
CronWeekly["Weekly Sunday<br/>etcd, Vault, Redis<br/>Vaultwarden<br/>30d retention"]
|
||||
end
|
||||
|
||||
CronDaily --> NFS_Backup
|
||||
CronWeekly --> NFS_Backup
|
||||
CronMonthly --> NFS_Backup
|
||||
Cron6h --> NFS_Backup
|
||||
end
|
||||
|
||||
subgraph Layer3["Layer 3: Offsite Sync"]
|
||||
Incremental["Every 6h<br/>zfs diff → rclone copy<br/>--files-from --no-traverse"]
|
||||
FullSync["Weekly Sunday 09:00<br/>rclone sync<br/>handles deletions"]
|
||||
PVEOffsite["PVE → Synology<br/>Sunday 08:00<br/>rsync --files-from<br/>/Backup/Viki/pve-backup/"]
|
||||
CloudSync["TrueNAS → Synology<br/>Monday 09:00<br/>Cloud Sync (media only)<br/>/Backup/Viki/truenas/"]
|
||||
end
|
||||
|
||||
ZFS_Data --> Incremental
|
||||
ZFS_Data --> FullSync
|
||||
sda --> PVEOffsite
|
||||
Media --> CloudSync
|
||||
|
||||
Synology["Synology NAS<br/>192.168.1.13<br/>/Backup/Viki/truenas"]
|
||||
Synology["Synology NAS<br/>192.168.1.13<br/>Offsite protection"]
|
||||
|
||||
Incremental --> Synology
|
||||
FullSync --> Synology
|
||||
PVEOffsite --> Synology
|
||||
CloudSync --> Synology
|
||||
|
||||
NFS_Backup -.->|mirrored to sda| NFSMirror
|
||||
|
||||
subgraph Monitoring["Monitoring & Alerting"]
|
||||
Prometheus["Prometheus Alerts<br/>PostgreSQLBackupStale<br/>MySQLBackupStale<br/>CloudSyncStale<br/>VaultwardenIntegrityFail"]
|
||||
Pushgateway["Pushgateway<br/>cloudsync metrics<br/>vaultwarden integrity"]
|
||||
Prometheus["Prometheus Alerts<br/>PostgreSQLBackupStale, MySQLBackupStale<br/>WeeklyBackupStale, OffsiteBackupSyncStale<br/>LVMSnapshotStale, BackupDiskFull<br/>VaultwardenIntegrityFail"]
|
||||
Pushgateway["Pushgateway<br/>backup script metrics<br/>cloudsync metrics<br/>vaultwarden integrity"]
|
||||
end
|
||||
|
||||
NFS_Backup -.->|scrape| Prometheus
|
||||
Synology -.->|API query| Pushgateway
|
||||
PVCBackup -.->|push metrics| Pushgateway
|
||||
Snap -.->|push metrics| Pushgateway
|
||||
Pushgateway --> Prometheus
|
||||
|
||||
style Layer1 fill:#c8e6c9
|
||||
|
|
@ -68,6 +85,89 @@ graph TB
|
|||
style Monitoring fill:#f3e5f5
|
||||
```
|
||||
|
||||
### Weekly Backup Timeline
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph Sunday["Sunday Timeline"]
|
||||
S01["01:00 etcd backup<br/>(CronJob)"]
|
||||
S02["02:00 Vault backup<br/>(CronJob)"]
|
||||
S03a["03:00 Redis backup<br/>(CronJob)"]
|
||||
S03b["03:00 LVM snapshots<br/>(lvm-pvc-snapshot timer)"]
|
||||
S05["05:00 Weekly backup<br/>(weekly-backup timer)<br/>1. NFS mirror<br/>2. PVC file copy<br/>3. pfSense backup<br/>4. PVE config<br/>5. Prune snapshots<br/>6. Generate manifest"]
|
||||
S08["08:00 Offsite sync<br/>(offsite-sync-backup timer)<br/>rsync --files-from"]
|
||||
end
|
||||
|
||||
S01 --> S02 --> S03a --> S03b --> S05 --> S08
|
||||
|
||||
subgraph Monday["Monday"]
|
||||
M09["09:00 TrueNAS Cloud Sync<br/>Media → Synology"]
|
||||
end
|
||||
|
||||
S08 -.->|next day| M09
|
||||
|
||||
style Sunday fill:#ffe0b2
|
||||
style Monday fill:#e1f5ff
|
||||
```
|
||||
|
||||
### Physical Disk Layout
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph PVE["Proxmox Host (192.168.1.127)"]
|
||||
subgraph sda["sda: 1.1TB RAID1 SAS"]
|
||||
sda_vg["VG: backup<br/>LV: data (ext4)<br/>/mnt/backup"]
|
||||
sda_content["pvc-data/<YYYY-WW>/<ns>/<pvc>/<br/>nfs-mirror/<service>-backup/<br/>pfsense/<YYYY-WW>/<br/>pve-config/"]
|
||||
end
|
||||
|
||||
subgraph sdb["sdb: 931GB SSD"]
|
||||
sdb_vg["VG: pve<br/>LV: root (ext4)<br/>PVE host OS"]
|
||||
end
|
||||
|
||||
subgraph sdc["sdc: 10.7TB RAID1 HDD"]
|
||||
sdc_vg["VG: pve<br/>LV: data (thin pool)<br/>65 proxmox-lvm PVCs<br/>+ VM disks"]
|
||||
end
|
||||
|
||||
sda_vg --> sda_content
|
||||
end
|
||||
|
||||
sdc -.->|weekly backup<br/>mount snapshot ro| sda
|
||||
sda -.->|offsite sync<br/>rsync| Synology["Synology NAS<br/>192.168.1.13<br/>/Backup/Viki/pve-backup/"]
|
||||
|
||||
style sda fill:#fff9c4
|
||||
style sdb fill:#c8e6c9
|
||||
style sdc fill:#e1f5ff
|
||||
```
|
||||
|
||||
### Restore Decision Tree
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
Start["Data loss detected"]
|
||||
Age{"How old is<br/>the lost data?"}
|
||||
Type{"What type<br/>of data?"}
|
||||
|
||||
Start --> Age
|
||||
|
||||
Age -->|"< 7 days"| LVM["Use LVM snapshot<br/>lvm-pvc-snapshot restore<br/>RTO: <5 min"]
|
||||
Age -->|"> 7 days,<br/>< 4 weeks"| FileBackup["Use sda file backup<br/>/mnt/backup/pvc-data/<week>/<br/>RTO: <15 min"]
|
||||
Age -->|"> 4 weeks or<br/>site disaster"| Offsite["Use Synology backup<br/>Synology/pve-backup/<br/>RTO: <4 hours"]
|
||||
|
||||
LVM --> Type
|
||||
FileBackup --> Type
|
||||
Offsite --> Type
|
||||
|
||||
Type -->|"Database"| AppBackup["Use app-level dump<br/>/mnt/backup/nfs-mirror/<service>-backup/<br/>OR Synology/pve-backup/nfs-mirror/<br/>RTO: <10 min"]
|
||||
Type -->|"PVC files"| Proceed["Proceed with<br/>selected restore method"]
|
||||
Type -->|"Media (NFS)"| CloudSync["Use Synology backup<br/>Synology/truenas/<service>/<br/>RTO: varies by size"]
|
||||
|
||||
style Start fill:#ffcdd2
|
||||
style LVM fill:#c8e6c9
|
||||
style FileBackup fill:#fff9c4
|
||||
style Offsite fill:#e1f5ff
|
||||
style AppBackup fill:#e1bee7
|
||||
```
|
||||
|
||||
### Vaultwarden Enhanced Protection
|
||||
|
||||
```mermaid
|
||||
|
|
@ -97,127 +197,103 @@ graph LR
|
|||
style Hourly fill:#e1bee7
|
||||
```
|
||||
|
||||
### Incremental Offsite Sync
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
Prev["ZFS snapshot<br/>main@cloudsync-prev"]
|
||||
New["ZFS snapshot<br/>main@cloudsync-new"]
|
||||
|
||||
Prev --> Diff["zfs diff -F -H<br/>prev vs new"]
|
||||
New --> Diff
|
||||
|
||||
Diff --> Filter["Filter type=F<br/>Apply excludes"]
|
||||
Filter --> FileList["/tmp/cloudsync_copy_files.txt"]
|
||||
|
||||
FileList --> Rclone["rclone copy<br/>--files-from-raw<br/>--no-traverse"]
|
||||
|
||||
Rclone --> Synology["Synology NAS<br/>192.168.1.13"]
|
||||
|
||||
Synology --> Rotate["Rotate snapshots:<br/>destroy prev<br/>rename new → prev"]
|
||||
|
||||
Excludes["Excludes:<br/>clickhouse (2.47M files)<br/>loki (68K files)<br/>prometheus, iscsi<br/>frigate/recordings<br/>*.log"]
|
||||
|
||||
Filter -.->|uses| Excludes
|
||||
|
||||
style FileList fill:#fff9c4
|
||||
style Excludes fill:#ffcdd2
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Version/Schedule | Location | Purpose |
|
||||
|-----------|-----------------|----------|---------|
|
||||
| ZFS Auto-Snapshots | Every 12h + daily | TrueNAS pools (main, ssd) | Near-instant local protection |
|
||||
| PostgreSQL Backup | Daily 00:00, 14d retention | CronJob in `dbaas` namespace | pg_dumpall for 12 databases |
|
||||
| MySQL Backup | Daily 00:30, 14d retention | CronJob in `dbaas` namespace | mysqldump for 7 databases |
|
||||
| LVM Thin Snapshots | Daily 03:00, 7d retention | PVE host: `lvm-pvc-snapshot` | CoW snapshots of 62 proxmox-lvm PVCs |
|
||||
| Weekly PVC Backup | Sunday 05:00, 4 weeks | PVE host: `weekly-backup` | File-level PVC copy to sda |
|
||||
| NFS Mirror | Sunday 05:00 + weekly-backup | PVE host: mount NFS ro → rsync | Mirror DB dumps to sda |
|
||||
| pfSense Backup | Sunday 05:00 + weekly-backup | PVE host: SSH + API | config.xml + full filesystem tar |
|
||||
| Offsite Sync | Sunday 08:00 (after weekly-backup) | PVE host: `offsite-sync-backup` | rsync sda → Synology |
|
||||
| PostgreSQL Backup | Daily 00:00, 14d retention | CronJob in `dbaas` namespace | pg_dumpall for all databases |
|
||||
| MySQL Backup | Daily 00:30, 14d retention | CronJob in `dbaas` namespace | mysqldump for all databases |
|
||||
| etcd Backup | Weekly Sunday 01:00, 30d | CronJob in `kube-system` | etcdctl snapshot |
|
||||
| Vaultwarden Backup | Every 6h, 30d retention | CronJob in `vaultwarden` | sqlite3 .backup + integrity |
|
||||
| Vault Backup | Weekly Sunday 02:00, 30d | CronJob in `vault` | raft snapshot |
|
||||
| Redis Backup | Weekly Sunday 03:00, 30d | CronJob in `redis` | BGSAVE + copy |
|
||||
| Prometheus Backup | Monthly 1st Sunday, 2 copies | CronJob in `monitoring` | TSDB snapshot → tar.gz |
|
||||
| plotting-book Backup | Weekly Sunday 03:00, 30d | CronJob in `plotting-book` | sqlite3 .backup |
|
||||
| LVM Thin Snapshots | Twice daily (00:00, 12:00), 7d | PVE host: `lvm-pvc-snapshot` | CoW snapshots of 13 proxmox-lvm PVCs |
|
||||
| Incremental Sync | Every 6h (cron) | TrueNAS: `/root/cloudsync-copy.sh` | ZFS diff → rclone copy |
|
||||
| Full Sync | Weekly Sunday 09:00 | TrueNAS Cloud Sync Task 1 | rclone sync with deletions |
|
||||
| CloudSync Monitor | Every 6h (cron) | CronJob in `monitoring` | Query TrueNAS API → Pushgateway |
|
||||
| Vaultwarden Integrity Check | Hourly | CronJob in `vaultwarden` | PRAGMA integrity_check → metric |
|
||||
| TrueNAS Cloud Sync | Monday 09:00 (weekly) | TrueNAS Cloud Sync Task 1 | Media → Synology NAS |
|
||||
|
||||
## How It Works
|
||||
|
||||
### Layer 1: ZFS Auto-Snapshots
|
||||
### Layer 1: LVM Thin Snapshots (Fast Local Recovery)
|
||||
|
||||
ZFS snapshots are copy-on-write markers that capture filesystem state in <1 second with zero I/O overhead (only metadata).
|
||||
|
||||
**Schedule**:
|
||||
| Pool | Frequency | Naming | Retention | Purpose |
|
||||
|------|-----------|--------|-----------|---------|
|
||||
| `main` | Every 12h | `auto-12h-YYYY-MM-DD_HH-MM` | 24 hours | Recover from recent mistakes |
|
||||
| `main` | Daily 00:00 | `auto-YYYY-MM-DD_HH-MM` | 3 weeks | Point-in-time recovery |
|
||||
| `ssd` | Every 12h | `auto-12h-YYYY-MM-DD_HH-MM` | 24 hours | Same as main |
|
||||
| `ssd` | Daily 00:00 | `auto-YYYY-MM-DD_HH-MM` | 3 weeks | Same as main |
|
||||
|
||||
**Performance**: Snapshot creation takes <1s for both pools (tested 2026-03-23).
|
||||
|
||||
**Rollback**:
|
||||
```bash
|
||||
# List snapshots
|
||||
zfs list -t snapshot | grep main/<service>
|
||||
|
||||
# Rollback to snapshot
|
||||
zfs rollback main/<service>@auto-2026-03-23_00-00
|
||||
|
||||
# Clone snapshot (non-destructive)
|
||||
zfs clone main/<service>@auto-2026-03-23_00-00 main/<service>-recovered
|
||||
```
|
||||
|
||||
### Layer 1b: LVM Thin Snapshots (Proxmox CSI PVCs)
|
||||
|
||||
Native LVM thin snapshots provide crash-consistent point-in-time recovery for all 13 Proxmox CSI PVCs (~340Gi). These are CoW snapshots — instant creation, minimal overhead, sharing the thin pool's free space.
|
||||
Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62 Proxmox CSI PVCs. These are CoW snapshots — instant creation, minimal overhead, sharing the thin pool's free space.
|
||||
|
||||
**Script**: `/usr/local/bin/lvm-pvc-snapshot` on PVE host (source: `infra/scripts/lvm-pvc-snapshot`)
|
||||
**Schedule**: Twice daily (00:00, 12:00) via systemd timer, 7-day retention (max 14 snapshots per LV)
|
||||
**Schedule**: Daily 03:00 via systemd timer, 7-day retention
|
||||
**Discovery**: Auto-discovers PVC LVs matching `vm-*-pvc-*` pattern in VG `pve` thin pool `data`
|
||||
|
||||
**Coverage**: All proxmox-lvm PVCs **except** `dbaas` and `monitoring` namespaces. These are excluded because:
|
||||
**Coverage**: All 65 proxmox-lvm PVCs **except** `dbaas` and `monitoring` namespaces. These are excluded because:
|
||||
- MySQL InnoDB, PostgreSQL, and Prometheus are high-churn (50%+ CoW divergence/hour)
|
||||
- They already have app-level dumps (Layer 2)
|
||||
- Including them causes ~36% write amplification; excluding them reduces overhead to ~0%
|
||||
|
||||
Snapshotted PVCs include: Redis, Vaultwarden, Calibre, Nextcloud, Forgejo, FreshRSS, ActualBudget, NovelApp, Headscale, Uptime Kuma, etc. (~20 low-churn LVs)
|
||||
|
||||
**Exclusion config**: `EXCLUDE_NAMESPACES` variable in script (default: `dbaas,monitoring`). Uses kubectl to resolve LV names dynamically.
|
||||
|
||||
**Monitoring**: Pushes metrics to Pushgateway via NodePort (30091). Alerts: `LVMSnapshotStale` (>24h), `LVMSnapshotFailing`, `LVMThinPoolLow` (<15% free).
|
||||
|
||||
**Restore**: `lvm-pvc-snapshot restore <pvc-lv> <snapshot-lv>` — auto-discovers K8s workload, scales down, swaps LVs, scales back up. See `docs/runbooks/restore-lvm-snapshot.md`.
|
||||
|
||||
### Layer 2: Application-Level Backups
|
||||
### Layer 2: Weekly File-Level Backup (sda Backup Disk)
|
||||
|
||||
**Backup disk**: sda (1.1TB RAID1 SAS) → VG `backup` → LV `data` → ext4 → mounted at `/mnt/backup` on PVE host. Dedicated backup disk, independent of live storage.
|
||||
|
||||
**Script**: `/usr/local/bin/weekly-backup` on PVE host (source: `infra/scripts/weekly-backup`)
|
||||
**Schedule**: Sunday 05:00 via systemd timer
|
||||
**Retention**: 4 weekly versions (weeks 0-3 via `--link-dest` hardlink dedup)
|
||||
|
||||
#### What Gets Backed Up
|
||||
|
||||
**1. PVC File Copies** (`/mnt/backup/pvc-data/<YYYY-WW>/`):
|
||||
- Mount each LVM thin LV ro on PVE host → rsync files (not block) → unmount
|
||||
- 62 PVCs covered (all except dbaas + monitoring)
|
||||
- Organized as `/mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/`
|
||||
- 4 weekly versions with `--link-dest` hardlink dedup (unchanged files share inodes)
|
||||
|
||||
**2. NFS Backup Mirror** (`/mnt/backup/nfs-mirror/`):
|
||||
- Mount TrueNAS NFS ro → rsync DB dump dirs → unmount
|
||||
- Covers: `mysql-backup/`, `postgresql-backup/`, `vault-backup/`, `vaultwarden-backup/`, `redis-backup/`, `etcd-backup/`
|
||||
- Single copy (no rotation) — latest dump only
|
||||
|
||||
**3. pfSense Backup** (`/mnt/backup/pfsense/<YYYY-WW>/`):
|
||||
- `config.xml` via API (base64 decode)
|
||||
- Full filesystem tar via SSH (`tar czf /tmp/pfsense-full.tar.gz /cf /var/db /boot/loader.conf`)
|
||||
- 4 weekly versions
|
||||
|
||||
**4. PVE Config** (`/mnt/backup/pve-config/`):
|
||||
- `/etc/pve/` (cluster config, VM definitions)
|
||||
- `/usr/local/bin/` (custom scripts)
|
||||
- `/etc/systemd/system/` (timers)
|
||||
- Single copy (no rotation)
|
||||
|
||||
**Manifest Generation**: After backup completes, generates `/mnt/backup/manifest.txt` with all file paths (relative to `/mnt/backup/`). Used by offsite sync `--files-from`.
|
||||
|
||||
**Snapshot Pruning**: Deletes LVM snapshots older than 7 days (safety net for snapshots that outlive `lvm-pvc-snapshot` timer).
|
||||
|
||||
**Monitoring**: Pushes `backup_weekly_last_success_timestamp` to Pushgateway. Alerts: `WeeklyBackupStale` (>8d), `WeeklyBackupFailing`.
|
||||
|
||||
### Layer 2b: Application-Level Backups
|
||||
|
||||
K8s CronJobs run inside the cluster, dumping database/state to NFS-exported backup directories. Each service writes to `/mnt/main/<service>-backup/`.
|
||||
|
||||
**Why needed**: ZFS snapshots capture block-level state, but:
|
||||
- Cannot restore individual databases from a PostgreSQL zvol snapshot
|
||||
- iSCSI zvols are opaque to TrueNAS (raw blocks)
|
||||
- Need point-in-time recovery for specific apps without full ZFS rollback
|
||||
**Why needed**: LVM snapshots capture block-level state, but:
|
||||
- Cannot restore individual databases from a PostgreSQL snapshot
|
||||
- Proxmox CSI LVs are opaque to TrueNAS (raw block devices)
|
||||
- Need point-in-time recovery for specific apps without full LVM rollback
|
||||
|
||||
**Daily backups (00:00-00:30)**:
|
||||
- **PostgreSQL** (`pg_dumpall`): Dumps all 12 databases to `/mnt/main/dbaas-backups/postgresql/`. Command: `pg_dumpall -h pg-cluster-rw.dbaas -U postgres | gzip -9 > backup-$(date +%Y%m%d).sql.gz`. 14-day rotation via `find -mtime +14 -delete`.
|
||||
- **MySQL** (`mysqldump`): Dumps all 7 databases individually. Command: `mysqldump -h mysql-primary.dbaas --all-databases --single-transaction | gzip -9 > backup-$(date +%Y%m%d).sql.gz`. 14-day rotation.
|
||||
- **PostgreSQL** (`pg_dumpall`): Dumps all databases to `/mnt/main/postgresql-backup/`. Command: `pg_dumpall -h pg-cluster-rw.dbaas -U postgres | gzip -9 > backup-$(date +%Y%m%d).sql.gz`. 14-day rotation via `find -mtime +14 -delete`.
|
||||
- **MySQL** (`mysqldump`): Dumps all databases. Command: `mysqldump -h mysql-primary.dbaas --all-databases --single-transaction | gzip -9 > backup-$(date +%Y%m%d).sql.gz`. 14-day rotation.
|
||||
|
||||
**Weekly backups (Sunday 01:00-04:00)**:
|
||||
- **etcd**: `etcdctl snapshot save /mnt/main/etcd-backup/snapshot-$(date +%Y%m%d).db`. 30-day retention. Critical for cluster recovery.
|
||||
- **Vaultwarden**: See "Vaultwarden Enhanced Protection" below. 30-day retention.
|
||||
- **Vault**: `vault operator raft snapshot save /mnt/main/vault-backup/snapshot-$(date +%Y%m%d).snap`. 30-day retention.
|
||||
- **Redis**: `redis-cli BGSAVE` then copy RDB file. 30-day retention.
|
||||
- **plotting-book**: `sqlite3 /data/db.sqlite ".backup '/mnt/main/plotting-book-backup/backup-$(date +%Y%m%d).sqlite'"`. 30-day retention.
|
||||
|
||||
**Monthly backups (1st Sunday 04:00)**:
|
||||
- **Prometheus**: `curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot` → tar.gz snapshot. Keeps 2 most recent copies (older ones purged).
|
||||
|
||||
### Vaultwarden Enhanced Protection
|
||||
|
||||
Vaultwarden stores sensitive password vault data in SQLite on an iSCSI volume. Extra safeguards prevent corruption:
|
||||
Vaultwarden stores sensitive password vault data in SQLite on a proxmox-lvm volume. Extra safeguards prevent corruption:
|
||||
|
||||
**Every 6 hours** (vaultwarden-backup CronJob):
|
||||
1. Run `PRAGMA integrity_check` on live database
|
||||
|
|
@ -236,101 +312,47 @@ This provides both frequent backups (every 6h) AND continuous integrity monitori
|
|||
|
||||
### Layer 3: Offsite Sync to Synology NAS
|
||||
|
||||
Two complementary sync methods run on TrueNAS:
|
||||
Two independent paths push backups offsite:
|
||||
|
||||
**Incremental COPY (every 6 hours)**:
|
||||
#### Path 1: PVE Host Backups (rsync)
|
||||
|
||||
Runs `/root/cloudsync-copy.sh` via cron. Uses ZFS diff to identify changed files since last sync, then copies only those files.
|
||||
**Script**: `/usr/local/bin/offsite-sync-backup` on PVE host (source: `infra/scripts/offsite-sync-backup`)
|
||||
**Schedule**: Sunday 08:00 via systemd timer (After=weekly-backup.service)
|
||||
**Method**: `rsync --files-from /mnt/backup/manifest.txt` to `synology.viktorbarzin.lan:/Backup/Viki/pve-backup/`
|
||||
**Monthly full sync**: On 1st Sunday of month, runs `rsync --delete` (full sync, removes deleted files)
|
||||
|
||||
Flow:
|
||||
1. Take new snapshot: `zfs snapshot main@cloudsync-new`
|
||||
2. If previous snapshot exists: `zfs diff -F -H main@cloudsync-prev main@cloudsync-new`
|
||||
3. Filter output:
|
||||
- Keep only `type=F` (files, not directories)
|
||||
- Apply excludes (clickhouse, loki, prometheus, etc.)
|
||||
- Write to `/tmp/cloudsync_copy_files.txt`
|
||||
4. Run `rclone copy --files-from-raw /tmp/cloudsync_copy_files.txt --no-traverse`
|
||||
5. Rotate snapshots: `zfs destroy cloudsync-prev`, `zfs rename cloudsync-new cloudsync-prev`
|
||||
**Why fast**: Only changed files are transferred (manifest generated by weekly-backup). No directory traversal (`--no-implied-dirs`).
|
||||
|
||||
**Why fast**: Only changed files are transferred. ZFS diff is instant (metadata scan). `--no-traverse` skips SFTP directory scan.
|
||||
**Destination**: `Synology/Backup/Viki/pve-backup/` mirrors sda `/mnt/backup/` structure:
|
||||
- `pvc-data/<YYYY-WW>/` — 4 weekly PVC file backups
|
||||
- `nfs-mirror/` — latest DB dumps
|
||||
- `pfsense/<YYYY-WW>/` — 4 weekly pfSense backups
|
||||
- `pve-config/` — latest PVE config
|
||||
|
||||
**Fallback**: If no previous snapshot or >100k changed files → falls back to full `find` command.
|
||||
**Monitoring**: Pushes `offsite_backup_sync_last_success_timestamp` to Pushgateway. Alerts: `OffsiteBackupSyncStale` (>8d), `OffsiteBackupSyncFailing`.
|
||||
|
||||
**Weekly SYNC (Sunday 09:00)**:
|
||||
|
||||
TrueNAS Cloud Sync Task 1 runs `rclone sync` which:
|
||||
- Mirrors source → destination (removes deleted files on destination)
|
||||
- Full directory traversal (~30-60 min)
|
||||
- Ensures offsite is clean (no orphaned files from renamed paths)
|
||||
|
||||
**Why both methods**:
|
||||
- Incremental: Fast recovery for recent changes (seconds to minutes)
|
||||
- Full sync: Cleanup pass to handle deletions, renames, edge cases
|
||||
#### Path 2: TrueNAS Media (Cloud Sync)
|
||||
|
||||
**Task**: TrueNAS Cloud Sync Task 1 runs `rclone sync` Monday 09:00
|
||||
**Source**: `/mnt/main/` (NFS pool on TrueNAS)
|
||||
**Destination**: `sftp://192.168.1.13/Backup/Viki/truenas`
|
||||
**Scope**: Media libraries only (Immich ~800GB, audiobookshelf, servarr, navidrome music)
|
||||
|
||||
### Excludes (both incremental and full sync)
|
||||
**Excludes** (Cloud Sync configured to skip):
|
||||
- `clickhouse/**` (2.47M files, regenerable)
|
||||
- `loki/**` (68K files, regenerable)
|
||||
- `prometheus/**` (covered by monthly app backup)
|
||||
- `frigate/**` (ephemeral recordings)
|
||||
- `audiblez/**`, `ebook2audiobook/**` (regenerable)
|
||||
- `ollama/**` (chat history, low value)
|
||||
- `real-estate-crawler/**` (regenerable)
|
||||
- `crowdsec/**` (regenerable)
|
||||
- `servarr/downloads/**` (transient)
|
||||
- `ytldp/**` (replaceable)
|
||||
- `iscsi/**`, `iscsi-snaps/**` (raw zvols, backed at app level)
|
||||
- `*-backup/**` (already mirrored via Path 1)
|
||||
|
||||
| Pattern | Reason | File count |
|
||||
|---------|--------|-----------|
|
||||
| `clickhouse/**` | Regenerable logs/metrics | 2.47M files |
|
||||
| `loki/**` | Regenerable logs | 68K files |
|
||||
| `iocage/**` | Legacy FreeBSD jails (unused) | 96K files |
|
||||
| `frigate/**` | Ephemeral recordings/clips, trivial config | 57K+ files |
|
||||
| `audiblez/**` | Generated audiobooks, regenerable from source ebooks | — |
|
||||
| `ebook2audiobook/**` | Same service as audiblez, second volume | — |
|
||||
| `ollama/**` | UI data (chat history/settings), low value | — |
|
||||
| `real-estate-crawler/**` | Scraped property data, regenerable by re-crawling | — |
|
||||
| `prometheus/**` | Covered by monthly app backup | Large TSDB |
|
||||
| `crowdsec/**` | Regenerable threat intelligence | — |
|
||||
| `servarr/downloads/**` | Transient download staging | — |
|
||||
| `iscsi/**`, `iscsi-snaps/**` | Raw zvols, backed at app level | — |
|
||||
| `ytldp/**` | YouTube downloads (replaceable) | — |
|
||||
| `*.log` | Log files (regenerable) | — |
|
||||
| `post` | Transient POST data | — |
|
||||
|
||||
### iSCSI Backup Architecture
|
||||
|
||||
iSCSI zvols are raw block devices exported to K8s nodes. TrueNAS cannot read the filesystem inside a zvol.
|
||||
|
||||
**Protection strategy**:
|
||||
- **Layer 1**: ZFS snapshots cover zvols automatically (block-level)
|
||||
- **Layer 2**: Application CronJobs inside pods dump data to NFS paths
|
||||
- **Layer 3**: NFS paths sync offsite
|
||||
|
||||
**Current coverage**:
|
||||
| Service | Storage | Layer 2 Backup | Offsite |
|
||||
|---------|---------|----------------|---------|
|
||||
| PostgreSQL CNPG (12 DBs) | iSCSI | ✓ daily | ✓ |
|
||||
| MySQL InnoDB (7 DBs) | iSCSI | ✓ daily | ✓ |
|
||||
| Vault | iSCSI | ✓ weekly | ✓ |
|
||||
| Vaultwarden | iSCSI | ✓ 6h + integrity | ✓ |
|
||||
| Redis | iSCSI | ✓ weekly | ✓ |
|
||||
| plotting-book | iSCSI | ✓ weekly | ✓ |
|
||||
|
||||
**Convention**: Any new iSCSI-backed app MUST add a backup CronJob writing to `/mnt/main/<app>-backup/` in its Terraform stack.
|
||||
|
||||
**Uncovered (acceptable risk)**:
|
||||
- Prometheus (disposable metrics, monthly TSDB backup for long-term trends)
|
||||
- Loki (disposable logs)
|
||||
|
||||
### iSCSI Hardening
|
||||
|
||||
To prevent SQLite corruption from transient network disruptions, iSCSI initiator timeouts are relaxed on all K8s nodes:
|
||||
|
||||
| Setting | Default | Hardened | Impact |
|
||||
|---------|---------|----------|--------|
|
||||
| `node.session.timeo.replacement_timeout` | 120s | 300s | Time before declaring session dead |
|
||||
| `node.conn[0].timeo.noop_out_interval` | 5s | 10s | Keepalive interval |
|
||||
| `node.conn[0].timeo.noop_out_timeout` | 5s | 15s | Keepalive timeout |
|
||||
| `node.conn[0].iscsi.HeaderDigest` | None | CRC32C,None | Error detection |
|
||||
| `node.conn[0].iscsi.DataDigest` | None | CRC32C,None | Error detection |
|
||||
|
||||
**Applied to**: All 5 K8s nodes (k8s-master, k8s-node1-4) on 2026-03-23.
|
||||
|
||||
**Persistence**: Baked into cloud-init template (`modules/create-template-vm/cloud_init.yaml`) so new nodes get these settings automatically.
|
||||
|
||||
**Why needed**: Default 120s timeout is too aggressive. Brief network hiccup (5-10s) can trigger failover, causing SQLite to see incomplete writes → corruption. 300s timeout tolerates longer blips.
|
||||
**Monitoring**: Existing `CloudSyncStale`, `CloudSyncNeverRun`, `CloudSyncFailing` alerts still apply.
|
||||
|
||||
## Configuration
|
||||
|
||||
|
|
@ -338,21 +360,25 @@ To prevent SQLite corruption from transient network disruptions, iSCSI initiator
|
|||
|
||||
| Path | Purpose |
|
||||
|------|---------|
|
||||
| `/root/cloudsync-copy.sh` | TrueNAS: incremental sync script |
|
||||
| `/var/log/cloudsync-copy.log` | TrueNAS: sync script log output |
|
||||
| `/usr/local/bin/lvm-pvc-snapshot` | PVE host: LVM snapshot creation + restore |
|
||||
| `/usr/local/bin/weekly-backup` | PVE host: PVC file copy + NFS mirror + pfSense + manifest |
|
||||
| `/usr/local/bin/offsite-sync-backup` | PVE host: rsync to Synology |
|
||||
| `/mnt/backup/` | PVE host: sda mount point (1.1TB backup disk) |
|
||||
| `/mnt/backup/manifest.txt` | Generated by weekly-backup, consumed by offsite-sync |
|
||||
| `/etc/systemd/system/lvm-pvc-snapshot.timer` | Daily 03:00 (LVM snapshots) |
|
||||
| `/etc/systemd/system/weekly-backup.timer` | Sunday 05:00 (file backup) |
|
||||
| `/etc/systemd/system/offsite-sync-backup.timer` | Sunday 08:00 (offsite sync) |
|
||||
| `stacks/dbaas/` | Terraform: PostgreSQL/MySQL backup CronJobs |
|
||||
| `stacks/vault/` | Terraform: Vault backup CronJob |
|
||||
| `stacks/vaultwarden/` | Terraform: Vaultwarden backup + integrity CronJobs |
|
||||
| `stacks/monitoring/` | Terraform: CloudSync monitor, Prometheus backup |
|
||||
| `modules/create-template-vm/cloud_init.yaml` | iSCSI hardening params for new nodes |
|
||||
| `/etc/iscsi/iscsid.conf` | K8s nodes: iSCSI initiator config |
|
||||
| `stacks/monitoring/` | Terraform: Prometheus alerts |
|
||||
|
||||
### Vault Paths
|
||||
|
||||
| Path | Contents |
|
||||
|------|----------|
|
||||
| `secret/viktor/truenas_api_key` | TrueNAS API key for CloudSync monitor |
|
||||
| `secret/viktor/synology_ssh_key` | SSH key for Synology NAS SFTP access |
|
||||
| `secret/viktor/pfsense_api_key` | pfSense API key + secret for config backup |
|
||||
|
||||
### Terraform Stacks
|
||||
|
||||
|
|
@ -361,27 +387,46 @@ Each backup CronJob is defined in the application's stack:
|
|||
- Vault: `stacks/vault/backup.tf`
|
||||
- Vaultwarden: `stacks/vaultwarden/backup.tf`
|
||||
- etcd: `stacks/platform/etcd-backup.tf`
|
||||
- Prometheus: `stacks/monitoring/prometheus-backup.tf`
|
||||
|
||||
## Decisions & Rationale
|
||||
|
||||
### Why 3 Layers?
|
||||
### Why 3-2-1 Strategy?
|
||||
|
||||
**Layer 1 (ZFS snapshots)**:
|
||||
**3 copies**:
|
||||
- Live PVCs (zero RTO for recent data)
|
||||
- sda local backup (fast recovery without network)
|
||||
- Synology offsite (site-level disaster protection)
|
||||
|
||||
**2 media types**:
|
||||
- sdc SSD (live, low latency)
|
||||
- sda HDD (backup, cost-effective bulk storage)
|
||||
|
||||
**1 offsite**:
|
||||
- Protection against fire, theft, catastrophic hardware failure
|
||||
- Weekly RPO acceptable for offsite (daily/weekly app backups reduce exposure)
|
||||
|
||||
### Why File-Level + Block-Level Snapshots?
|
||||
|
||||
**LVM snapshots** (Layer 1):
|
||||
- Near-instant (<1s), zero overhead
|
||||
- Point-in-time recovery for entire datasets
|
||||
- BUT: Cannot restore individual database records, no offsite protection
|
||||
- Point-in-time recovery for entire PVCs
|
||||
- BUT: Cannot restore individual files, no offsite protection, 7-day retention
|
||||
|
||||
**Layer 2 (App backups)**:
|
||||
- Granular restore (single DB, single table)
|
||||
- Database-native tools (pg_dump, mysqldump) produce portable backups
|
||||
- BUT: Higher overhead (CPU, I/O), longer RPO (daily/weekly)
|
||||
**File-level backup** (Layer 2):
|
||||
- Can restore single files or directories
|
||||
- Offsite-compatible (rsync)
|
||||
- Longer retention (4 weeks local, unlimited offsite)
|
||||
- BUT: Slower RTO (rsync), higher storage overhead
|
||||
|
||||
**Layer 3 (Offsite)**:
|
||||
- Protection against site-level disaster (fire, theft, catastrophic hardware failure)
|
||||
- BUT: 6h RPO (incremental), connectivity dependency
|
||||
Both together provide flexibility: fast local rollback for recent changes, granular recovery for older data.
|
||||
|
||||
All three together provide defense-in-depth.
|
||||
### Why Dedicated Backup Disk (sda)?
|
||||
|
||||
**Isolation**: If sdc fails (thin pool corruption, controller failure), sda is independent (different disk, different VG).
|
||||
|
||||
**Performance**: Backup I/O doesn't compete with live PVC I/O.
|
||||
|
||||
**Simplicity**: Single mount point (`/mnt/backup/`) for all backup data, easy to monitor disk usage.
|
||||
|
||||
### Why Not Velero/Longhorn Backup?
|
||||
|
||||
|
|
@ -390,25 +435,25 @@ Evaluated K8s-native backup solutions (Velero, Longhorn):
|
|||
- **Longhorn**: High overhead (replicas, snapshots in-cluster), no offsite by default
|
||||
|
||||
**Current approach wins** because:
|
||||
- Leverages existing ZFS infrastructure (already running TrueNAS)
|
||||
- Leverages existing Proxmox LVM infrastructure (already running)
|
||||
- Database-native backups (pg_dump/mysqldump) are battle-tested
|
||||
- Simple restore procedures (documented runbooks)
|
||||
- Lower resource overhead (no in-cluster replicas)
|
||||
|
||||
### Why Hybrid Incremental + Full Sync?
|
||||
|
||||
**Incremental alone** is risky:
|
||||
**Incremental alone** (rsync --files-from) is risky:
|
||||
- Deleted files on source never deleted on destination
|
||||
- Renamed paths create duplicates
|
||||
- No cleanup of orphaned snapshots
|
||||
- No cleanup of orphaned files
|
||||
|
||||
**Full sync alone** is slow:
|
||||
- 30-60 min per run
|
||||
- High network/CPU on both ends
|
||||
- 6h RPO → 12h if a sync fails
|
||||
**Full sync alone** (rsync --delete) is slow:
|
||||
- 30-60 min per run (all files scanned)
|
||||
- 7d RPO → 14d if a sync fails
|
||||
|
||||
**Hybrid approach**:
|
||||
- Fast incremental every 6h (sub-minute runtime)
|
||||
- Weekly full sync for cleanup (tolerates longer runtime)
|
||||
- Fast incremental weekly (sub-5min runtime via manifest)
|
||||
- Monthly full sync for cleanup (tolerates longer runtime)
|
||||
|
||||
### Why 6h Vaultwarden Backup vs Daily for Others?
|
||||
|
||||
|
|
@ -424,6 +469,56 @@ Other services (MySQL, PostgreSQL):
|
|||
|
||||
## Troubleshooting
|
||||
|
||||
### LVM Snapshot Restore Issues
|
||||
|
||||
See `docs/runbooks/restore-lvm-snapshot.md`.
|
||||
|
||||
### Weekly Backup Failing
|
||||
|
||||
**Symptom**: `WeeklyBackupStale` or `WeeklyBackupFailing` alert
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
ssh root@192.168.1.127
|
||||
systemctl status weekly-backup.service
|
||||
journalctl -u weekly-backup.service --since "7 days ago"
|
||||
df -h /mnt/backup
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- Backup disk full (check `df -h /mnt/backup`, alert: `BackupDiskFull`)
|
||||
- LV mount failed (check `lvs pve`, `dmesg | grep backup`)
|
||||
- NFS mount failed (check `showmount -e 10.0.10.15`)
|
||||
|
||||
**Fix**:
|
||||
1. If disk full: Clean up old weekly versions manually, adjust retention
|
||||
2. If LV mount failed: `lvchange -ay backup/data && mount /mnt/backup`
|
||||
3. If NFS failed: Check TrueNAS availability, verify exports
|
||||
4. Manually trigger: `systemctl start weekly-backup.service`
|
||||
|
||||
### Offsite Sync Failing
|
||||
|
||||
**Symptom**: `OffsiteBackupSyncStale` or `OffsiteBackupSyncFailing` alert
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
ssh root@192.168.1.127
|
||||
systemctl status offsite-sync-backup.service
|
||||
journalctl -u offsite-sync-backup.service --since "7 days ago"
|
||||
cat /mnt/backup/manifest.txt | wc -l # verify manifest exists
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- Synology NAS unreachable (network, SFTP down)
|
||||
- SSH key auth failed (permissions, expired key)
|
||||
- Manifest missing (weekly-backup failed)
|
||||
|
||||
**Fix**:
|
||||
1. Verify Synology: `ping 192.168.1.13`, `ssh root@192.168.1.13`
|
||||
2. Verify SSH key: `ssh -i /root/.ssh/synology_backup root@192.168.1.13`
|
||||
3. Verify manifest exists: `ls -lh /mnt/backup/manifest.txt`
|
||||
4. Manually trigger: `systemctl start offsite-sync-backup.service`
|
||||
|
||||
### PostgreSQL Backup Stale Alert
|
||||
|
||||
**Symptom**: `PostgreSQLBackupStale` firing in Prometheus
|
||||
|
|
@ -444,29 +539,6 @@ kubectl logs -n dbaas job/postgresql-backup-<timestamp>
|
|||
2. If NFS: Verify mount on worker node, restart NFS server if needed
|
||||
3. Manually trigger: `kubectl create job --from=cronjob/postgresql-backup manual-backup -n dbaas`
|
||||
|
||||
### CloudSync Stale/Failing
|
||||
|
||||
**Symptom**: `CloudSyncStale` or `CloudSyncFailing` alert
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# SSH to TrueNAS
|
||||
ssh root@10.0.10.15
|
||||
cat /var/log/cloudsync-copy.log
|
||||
zfs list -t snapshot | grep cloudsync
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- Synology NAS unreachable (network, SFTP down)
|
||||
- ZFS diff failed (snapshot deleted manually)
|
||||
- rclone error (quota, permission)
|
||||
|
||||
**Fix**:
|
||||
1. Verify Synology: `ping 192.168.1.13`, `ssh root@192.168.1.13`
|
||||
2. Verify snapshots exist: `zfs list -t snapshot | grep cloudsync`
|
||||
3. Manually run: `/root/cloudsync-copy.sh` (check output)
|
||||
4. Check rclone config: `rclone ls synology:/Backup/Viki/truenas`
|
||||
|
||||
### Vaultwarden Integrity Check Failing
|
||||
|
||||
**Symptom**: `VaultwardenIntegrityFail` alert, `vaultwarden_sqlite_integrity_ok=0`
|
||||
|
|
@ -480,46 +552,58 @@ kubectl exec -n vaultwarden deployment/vaultwarden -- sqlite3 /data/db.sqlite3 "
|
|||
|
||||
**Recovery**:
|
||||
1. Stop writes: `kubectl scale deployment/vaultwarden --replicas=0 -n vaultwarden`
|
||||
2. Restore from latest backup:
|
||||
```bash
|
||||
# Find latest backup
|
||||
ls -lh /mnt/main/vaultwarden-backup/
|
||||
# Copy to pod volume
|
||||
kubectl cp /mnt/main/vaultwarden-backup/db-<latest>.sqlite \
|
||||
vaultwarden/vaultwarden-0:/data/db.sqlite3
|
||||
```
|
||||
2. Restore from latest backup (see `restore-vaultwarden.md`)
|
||||
3. Verify integrity on restored DB
|
||||
4. Scale back up: `kubectl scale deployment/vaultwarden --replicas=1 -n vaultwarden`
|
||||
|
||||
### iSCSI Session Drops Causing Backup Failures
|
||||
### pfSense Backup Failing
|
||||
|
||||
**Symptom**: Backup CronJob fails with "I/O error" or "Transport endpoint not connected"
|
||||
**Symptom**: `PfsenseBackupStale` alert (if implemented)
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# On K8s node
|
||||
iscsiadm -m session
|
||||
dmesg | grep -i iscsi
|
||||
journalctl -u iscsid | tail -50
|
||||
ssh root@192.168.1.127
|
||||
systemctl status weekly-backup.service | grep -A5 pfsense
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- API key expired/invalid
|
||||
- SSH auth failed (password changed, key rejected)
|
||||
- pfSense unreachable
|
||||
|
||||
**Fix**:
|
||||
1. Verify hardened timeouts applied: `iscsiadm -m node -o show | grep -E 'replacement_timeout|noop_out'`
|
||||
2. If defaults: Apply hardening:
|
||||
```bash
|
||||
iscsiadm -m node -o update -n node.session.timeo.replacement_timeout -v 300
|
||||
iscsiadm -m node -o update -n node.conn[0].timeo.noop_out_interval -v 10
|
||||
iscsiadm -m node -o update -n node.conn[0].timeo.noop_out_timeout -v 15
|
||||
iscsiadm -m node -o update -n node.conn[0].iscsi.HeaderDigest -v CRC32C,None
|
||||
iscsiadm -m node -o update -n node.conn[0].iscsi.DataDigest -v CRC32C,None
|
||||
```
|
||||
3. Restart session: `iscsiadm -m node -u && iscsiadm -m node -l`
|
||||
1. Verify API key: `curl -k https://pfsense.viktorbarzin.me/api/v1/system/config -H "Authorization: <key>"`
|
||||
2. Verify SSH: `ssh root@pfsense.viktorbarzin.me`
|
||||
3. Update credentials in Vault `secret/viktor/pfsense_api_key`
|
||||
|
||||
### Backup Disk Full
|
||||
|
||||
**Symptom**: `BackupDiskFull` alert, `df -h /mnt/backup` >85%
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
ssh root@192.168.1.127
|
||||
|
||||
# Check space usage by component
|
||||
du -sh /mnt/backup/pvc-data/*
|
||||
du -sh /mnt/backup/pfsense/*
|
||||
du -sh /mnt/backup/nfs-mirror
|
||||
|
||||
# Clean up old weekly versions (keep latest 2)
|
||||
find /mnt/backup/pvc-data -maxdepth 1 -type d -name "????-??" | sort | head -n -2 | xargs rm -rf
|
||||
find /mnt/backup/pfsense -maxdepth 1 -type d -name "????-??" | sort | head -n -2 | xargs rm -rf
|
||||
```
|
||||
|
||||
### Missing Backup for New Service
|
||||
|
||||
**Symptom**: Added new service using iSCSI storage, no backup exists
|
||||
**Symptom**: Added new service using proxmox-lvm storage, no backup exists
|
||||
|
||||
**Fix**: Add backup CronJob in service's Terraform stack
|
||||
**Fix**: The service is automatically covered by:
|
||||
1. **LVM snapshots** (if not in dbaas/monitoring namespace) — automatic, no config needed
|
||||
2. **Weekly file backup** — automatic, no config needed
|
||||
|
||||
**If the service has a database that needs app-level dumps**:
|
||||
Add backup CronJob in service's Terraform stack (see template below).
|
||||
|
||||
**Template**:
|
||||
```hcl
|
||||
|
|
@ -541,7 +625,7 @@ resource "kubernetes_cron_job_v1" "backup" {
|
|||
args = [
|
||||
<<-EOT
|
||||
TIMESTAMP=$(date +%Y%m%d)
|
||||
# Dump command here
|
||||
# Dump command here (sqlite3 .backup, pg_dump, etc.)
|
||||
find /backup -mtime +30 -delete
|
||||
EOT
|
||||
]
|
||||
|
|
@ -594,17 +678,26 @@ module "nfs_backup" {
|
|||
│ VaultBackupStale > 8d since last success │
|
||||
│ VaultwardenBackupStale > 8d since last success │
|
||||
│ RedisBackupStale > 8d since last success │
|
||||
│ PrometheusBackupStale > 32d since last success │
|
||||
│ PlottingBookBackupStale > 8d since last success │
|
||||
│ CloudSyncStale > 8d since last success │
|
||||
│ CloudSyncNeverRun task never completed │
|
||||
│ CloudSyncFailing task in error state │
|
||||
│ VaultwardenIntegrityFail integrity_ok == 0 │
|
||||
│ LVMSnapshotStale > 24h since last snapshot │
|
||||
│ LVMSnapshotFailing snapshot creation failed │
|
||||
│ LVMThinPoolLow < 15% free space in thin pool │
|
||||
│ WeeklyBackupStale > 8d since last success │
|
||||
│ WeeklyBackupFailing backup script exited non-zero │
|
||||
│ PfsenseBackupStale > 8d since last success │
|
||||
│ OffsiteBackupSyncStale > 8d since last success │
|
||||
│ BackupDiskFull > 85% usage on /mnt/backup │
|
||||
└────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Metrics sources**:
|
||||
- Backup CronJobs: Push `backup_last_success_timestamp` to Pushgateway on completion
|
||||
- LVM snapshot script: Pushes `lvm_snapshot_last_success_timestamp`, `lvm_snapshot_count`, `lvm_thin_pool_free_percent`
|
||||
- Weekly backup script: Pushes `backup_weekly_last_success_timestamp`, `backup_disk_usage_percent`
|
||||
- Offsite sync script: Pushes `offsite_backup_sync_last_success_timestamp`
|
||||
- CloudSync monitor: Queries TrueNAS API every 6h, pushes `cloudsync_last_success_timestamp`
|
||||
- Vaultwarden integrity: Pushes `vaultwarden_sqlite_integrity_ok` hourly
|
||||
|
||||
|
|
@ -614,36 +707,45 @@ module "nfs_backup" {
|
|||
|
||||
## Service Protection Matrix
|
||||
|
||||
| Service | Layer 1 (ZFS) | Layer 2 (App) | Layer 3 (Offsite) | Storage |
|
||||
|---------|:-------------:|:-------------:|:-----------------:|---------|
|
||||
| Service | LVM Snapshots (7d) | File Backup (4w) | App Backup | Offsite | Storage |
|
||||
|---------|:------------------:|:----------------:|:----------:|:-------:|---------|
|
||||
| **Databases** |
|
||||
| PostgreSQL (12 DBs) | ✓ | ✓ daily | ✓ | iSCSI |
|
||||
| MySQL (7 DBs) | ✓ | ✓ daily | ✓ | iSCSI |
|
||||
| PostgreSQL (all DBs) | — | — | ✓ daily | ✓ | proxmox-lvm |
|
||||
| MySQL (all DBs) | — | — | ✓ daily | ✓ | proxmox-lvm |
|
||||
| **Critical State** |
|
||||
| Vault | ✓ | ✓ weekly | ✓ | iSCSI |
|
||||
| etcd | ✓ | ✓ weekly | ✓ | local disk |
|
||||
| Vaultwarden | ✓ | ✓ 6h + integrity | ✓ | iSCSI |
|
||||
| Redis | ✓ | ✓ weekly | ✓ | iSCSI |
|
||||
| **Applications** |
|
||||
| Prometheus | ✓ | ✓ monthly | excluded | NFS |
|
||||
| plotting-book | ✓ | ✓ weekly | ✓ | iSCSI |
|
||||
| Immich | ✓ | — | ✓ | NFS |
|
||||
| Forgejo | ✓ | — | ✓ | NFS |
|
||||
| Paperless-ngx | ✓ | — | ✓ | NFS |
|
||||
| Nextcloud | ✓ | — | ✓ | NFS |
|
||||
| **Other NFS services** | ✓ | — | ✓ | NFS |
|
||||
| Vault | ✓ | ✓ | ✓ weekly | ✓ | proxmox-lvm |
|
||||
| etcd | ✓ | ✓ | ✓ weekly | ✓ | proxmox-lvm |
|
||||
| Vaultwarden | ✓ | ✓ | ✓ 6h + integrity | ✓ | proxmox-lvm |
|
||||
| Redis | ✓ | ✓ | ✓ weekly | ✓ | proxmox-lvm |
|
||||
| **Applications (65 proxmox-lvm PVCs)** |
|
||||
| Prometheus | — | — | — | excluded | proxmox-lvm |
|
||||
| Nextcloud | ✓ | ✓ | — | ✓ | proxmox-lvm |
|
||||
| Calibre-Web | ✓ | ✓ | — | ✓ | proxmox-lvm |
|
||||
| Forgejo | ✓ | ✓ | — | ✓ | proxmox-lvm |
|
||||
| FreshRSS | ✓ | ✓ | — | ✓ | proxmox-lvm |
|
||||
| ActualBudget | ✓ | ✓ | — | ✓ | proxmox-lvm |
|
||||
| NovelApp | ✓ | ✓ | — | ✓ | proxmox-lvm |
|
||||
| Headscale | ✓ | ✓ | — | ✓ | proxmox-lvm |
|
||||
| Uptime Kuma | ✓ | ✓ | — | ✓ | proxmox-lvm |
|
||||
| **Media (NFS)** |
|
||||
| Immich (~800GB) | — | — | — | ✓ | NFS |
|
||||
| Audiobookshelf | — | — | — | ✓ | NFS |
|
||||
| Servarr | — | — | — | ✓ | NFS |
|
||||
| Navidrome | — | — | — | ✓ | NFS |
|
||||
|
||||
**Legend**:
|
||||
- ✓ = Protected at this layer
|
||||
- — = Not needed (simple file storage, ZFS snapshots sufficient)
|
||||
- — = Not needed (other layers cover it, or data is regenerable/disposable)
|
||||
- excluded = Too large/regenerable, not worth offsite bandwidth
|
||||
|
||||
**Note**: NFS-backed services with simple data (files, SQLite) rely on ZFS snapshots + offsite sync. Application-level backups are only needed for services with complex state (databases, Raft consensus, multi-file consistency requirements).
|
||||
**Note**: All 65 proxmox-lvm PVCs get LVM snapshots (except dbaas+monitoring = 3 PVCs) + file-level backup (except dbaas+monitoring). NFS-backed media relies on TrueNAS Cloud Sync for offsite.
|
||||
|
||||
## Recovery Procedures
|
||||
|
||||
Detailed runbooks in `docs/runbooks/`:
|
||||
|
||||
- **`restore-lvm-snapshot.md`** — Instant rollback of a PVC using LVM snapshot (RTO <5 min)
|
||||
- **`restore-pvc-from-backup.md`** — Restore a PVC from sda file backup (when snapshots expired)
|
||||
- **`restore-postgresql.md`** — Restore individual database or full cluster from pg_dumpall backup
|
||||
- **`restore-mysql.md`** — Restore MySQL databases from mysqldump backup
|
||||
- **`restore-vault.md`** — Restore Vault from raft snapshot
|
||||
|
|
@ -651,7 +753,9 @@ Detailed runbooks in `docs/runbooks/`:
|
|||
- **`restore-etcd.md`** — Restore etcd cluster from snapshot
|
||||
- **`restore-full-cluster.md`** — Disaster recovery: rebuild cluster from offsite backups
|
||||
|
||||
**RTO estimates** (tested 2026-03-23):
|
||||
**RTO estimates**:
|
||||
- LVM snapshot rollback: <5 min (instant swap)
|
||||
- File-level restore from sda: <15 min (depends on PVC size)
|
||||
- Single PostgreSQL database: <5 min
|
||||
- Full MySQL cluster: <15 min
|
||||
- Vault: <10 min
|
||||
|
|
@ -661,7 +765,7 @@ Detailed runbooks in `docs/runbooks/`:
|
|||
|
||||
## Related
|
||||
|
||||
- **Architecture**: `docs/architecture/storage.md` (NFS/iSCSI storage layer)
|
||||
- **Architecture**: `docs/architecture/storage.md` (NFS/Proxmox storage layer)
|
||||
- **Reference**: `.claude/reference/service-catalog.md` (which services need backups)
|
||||
- **Runbooks**: `docs/runbooks/restore-*.md` (step-by-step recovery procedures)
|
||||
- **Monitoring**: `stacks/monitoring/alerts/backup-alerts.yaml` (Prometheus alert definitions)
|
||||
|
|
|
|||
|
|
@ -1,14 +1,16 @@
|
|||
# Storage Architecture
|
||||
|
||||
Last updated: 2026-04-03
|
||||
Last updated: 2026-04-06
|
||||
|
||||
## Overview
|
||||
|
||||
The cluster uses two storage backends: **Proxmox CSI** for database block storage and **TrueNAS NFS** for application data.
|
||||
|
||||
**Block storage (Proxmox CSI)**: 13 PVCs for databases (CNPG PostgreSQL, MySQL InnoDB, Redis, Vaultwarden, Prometheus, Nextcloud, Calibre-Web) use `StorageClass: proxmox-lvm`, which provisions thin LVs directly from the Proxmox host's `local-lvm` storage. This eliminates the previous double-CoW (ZFS + LVM-thin) path that caused 56 ZFS checksum errors.
|
||||
**Block storage (Proxmox CSI)**: 65 PVCs for databases and stateful apps (CNPG PostgreSQL, MySQL InnoDB, Redis, Vaultwarden, Prometheus, Nextcloud, Calibre-Web, Forgejo, FreshRSS, ActualBudget, NovelApp, Headscale, Uptime Kuma, etc.) use `StorageClass: proxmox-lvm`, which provisions thin LVs directly from the Proxmox host's `local-lvm` storage (sdc, 10.7TB RAID1 HDD thin pool). This eliminates the previous double-CoW (ZFS + LVM-thin) path that caused 56 ZFS checksum errors.
|
||||
|
||||
**NFS storage (TrueNAS)**: ~100 NFS shares for application data, media, configs, and backup targets continue to use TrueNAS ZFS at `10.0.10.15` via `StorageClass: nfs-truenas`.
|
||||
**NFS storage (TrueNAS)**: ~100 NFS shares for media libraries (Immich, audiobookshelf, servarr, navidrome), backup targets (`*-backup/` directories), and legacy app data continue to use TrueNAS ZFS at `10.0.10.15` via `StorageClass: nfs-truenas`.
|
||||
|
||||
**Backup storage (sda)**: 1.1TB RAID1 SAS disk, VG `backup`, LV `data` (ext4), mounted at `/mnt/backup` on PVE host. Dedicated backup disk for weekly PVC file backups, NFS mirrors, pfSense backups, and PVE config. Independent of live storage (sdc).
|
||||
|
||||
**Migration (2026-04-02)**: All iSCSI block volumes were migrated from democratic-csi (TrueNAS iSCSI → ZFS → LVM-thin) to Proxmox CSI (direct LVM-thin hotplug). democratic-csi iSCSI driver is deprecated and pending removal.
|
||||
|
||||
|
|
@ -16,17 +18,20 @@ The cluster uses two storage backends: **Proxmox CSI** for database block storag
|
|||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph Proxmox["Proxmox Host (192.168.1.127)"]
|
||||
sdc["sdc: 10.7TB RAID1 HDD<br/>VG pve, LV data (thin pool)<br/>65 proxmox-lvm PVCs"]
|
||||
sda["sda: 1.1TB RAID1 SAS<br/>VG backup, LV data (ext4)<br/>/mnt/backup"]
|
||||
end
|
||||
|
||||
subgraph TrueNAS["TrueNAS (10.0.10.15)<br/>VMID 9000, 16c/16GB"]
|
||||
ZFS_Main["ZFS Pool: main<br/>1.64 TiB<br/>32G + 7x256G + 1T disks"]
|
||||
ZFS_SSD["ZFS Pool: ssd<br/>~256GB SSD<br/>Immich ML, PostgreSQL hot data"]
|
||||
|
||||
ZFS_Main --> NFS_Datasets["NFS Datasets<br/>~100 shares<br/>main/<service>"]
|
||||
ZFS_Main --> iSCSI_Datasets["iSCSI Datasets<br/>main/iscsi (zvols)<br/>main/iscsi-snaps"]
|
||||
ZFS_Main --> NFS_Datasets["NFS Datasets<br/>~100 shares<br/>main/<service><br/>Media + backup targets"]
|
||||
|
||||
NFS_Datasets --> NFS_Exports["NFS Exports<br/>managed by secrets/nfs_exports.sh"]
|
||||
iSCSI_Datasets --> iSCSI_Targets["iSCSI Targets<br/>SSH-managed via democratic-csi"]
|
||||
|
||||
ZFS_SSD --> SSD_Data["Immich ML models<br/>PostgreSQL CNPG"]
|
||||
ZFS_SSD --> SSD_Data["Immich ML models"]
|
||||
end
|
||||
|
||||
subgraph K8s["Kubernetes Cluster"]
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue