diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 7a69ccd6..e51f539b 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -174,17 +174,30 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" { - Autoresizer annotations are **required** on all proxmox-lvm PVCs - Every proxmox-lvm app **MUST** add a backup CronJob writing to NFS `/mnt/main/-backup/` -### Cloud Sync (TrueNAS → Synology NAS) -- **Task 1**: Weekly push (Monday 09:00) of `/mnt/main` NFS data to `nas.viktorbarzin.lan:/Backup/Viki/truenas` -- **zfs diff optimization**: Pre-script diffs `main@cloudsync-prev` vs `main@cloudsync-new`, writes changed files to `/tmp/cloudsync_files.txt`. Args: `--files-from /tmp/cloudsync_files.txt --no-traverse`. Post-script rotates snapshots. Falls back to full `find` if no prev snapshot or >100k changes. -- **Excludes**: ytldp, prometheus, logs, post, crowdsec, servarr/downloads, iscsi, iscsi-snaps, frigate, audiblez, ebook2audiobook, ollama, real-estate-crawler +### 3-2-1 Backup Strategy +**Copy 1**: Live data on sdc thin pool (65 PVCs + VMs) +**Copy 2**: sda backup disk (`/mnt/backup`, 1.1TB ext4, VG `backup`) +**Copy 3**: Synology NAS offsite (two paths) -### Proxmox-LVM Backup Architecture -- proxmox-lvm volumes are thin LVs on the Proxmox host — opaque to TrueNAS -- **Offsite protection**: Application-level backup CronJobs dump data to NFS paths, which Cloud Sync Task 1 syncs to Synology -- **Current CronJob coverage**: MySQL (mysqldump), PostgreSQL (pg_dumpall), Vault (raft snapshot), Redis (BGSAVE), Vaultwarden (sqlite3 .backup), Headscale (sqlite3 .backup) -- **Convention**: Any new proxmox-lvm app MUST add a backup CronJob to its Terraform stack that writes to `/mnt/main/-backup/` -- **Uncovered (acceptable)**: Prometheus (disposable metrics), Loki (disposable logs), plotting-book and novelapp (small, low-priority) +**PVE host scripts** (source: `infra/scripts/`): +- `/usr/local/bin/weekly-backup` — Sunday 05:00. Mounts LVM thin snapshots ro → rsyncs FILES to `/mnt/backup/pvc-data////` with `--link-dest` versioning (4 weeks). Also mirrors NFS backup dirs, pfsense (config.xml + tar), PVE config. Prunes snapshots >7d. +- `/usr/local/bin/offsite-sync-backup` — Sunday 08:00 (After=weekly-backup). `rsync --files-from` manifest to `Administrator@192.168.1.13:/volume1/Backup/Viki/pve-backup/`. Monthly full `--delete` on 1st Sunday. +- `/usr/local/bin/lvm-pvc-snapshot` — Daily 03:00. Thin snapshots of all PVCs except dbaas+monitoring. 7-day retention. Instant restore: `lvm-pvc-snapshot restore `. + +**Offsite sync (two paths)**: +- `Synology/Backup/Viki/pve-backup/` — structured data from PVE host (PVC files, DB dumps, pfsense, PVE config) +- `Synology/Backup/Viki/truenas/` — NFS media from TrueNAS Cloud Sync (Immich, audiobookshelf, servarr — narrowed, excludes backup dirs) + +**App-level CronJobs** (write to TrueNAS NFS, mirrored to sda weekly): +- MySQL (daily), PostgreSQL (daily), Vault (weekly), Vaultwarden (6h + integrity), Redis (weekly), etcd (weekly) +- **Convention**: New proxmox-lvm apps MUST add a backup CronJob writing to `/mnt/main/-backup/` + +**Restore paths**: +- Accidental delete: `lvm-pvc-snapshot restore` (instant, 7 daily snapshots) +- Older data: Browse `/mnt/backup/pvc-data////`, rsync back +- Database: Restore from dump at `/mnt/backup/nfs-mirror/-backup/` +- pfsense: Upload config.xml via web UI, or extract tar for custom scripts +- Full disaster: Restore from Synology ## Known Issues - **CrowdSec Helm upgrade times out**: `terragrunt apply` on platform stack causes CrowdSec Helm release to get stuck in `pending-upgrade`. Workaround: `helm rollback crowdsec -n crowdsec`. Root cause: likely ResourceQuota CPU at 302% preventing pods from passing readiness probes. Needs investigation. diff --git a/docs/architecture/backup-dr.md b/docs/architecture/backup-dr.md index e2e171f3..5aab1189 100644 --- a/docs/architecture/backup-dr.md +++ b/docs/architecture/backup-dr.md @@ -1,10 +1,17 @@ # Backup & Disaster Recovery Architecture -Last updated: 2026-03-24 +Last updated: 2026-04-06 ## Overview -The homelab uses a defense-in-depth 3-layer backup strategy: Layer 1 provides near-instant local snapshots via ZFS auto-snapshots on TrueNAS (every 12h + daily, up to 3-week retention). Layer 2 adds application-level backups for complex stateful services (databases, Vault, etcd) via K8s CronJobs dumping to NFS-exported directories with 14-30 day retention. Layer 3 ensures offsite protection through hybrid incremental/full sync to a Synology NAS every 6 hours (incremental via ZFS diff) plus weekly full sync (Sunday 09:00) for cleanup. This architecture provides <1s RPO for file data, 6h RPO for offsite, and <30min RTO for most services. +The homelab uses a defense-in-depth 3-2-1 backup strategy: **3 copies** (live PVCs on sdc, weekly backups on sda, offsite on Synology), **2 media types** (SSD thin LVM, HDD), **1 offsite copy** (Synology NAS). This architecture provides <1s RPO for recent changes (via 7-day LVM snapshots), <7d RPO for file-level recovery, and <30min RTO for most services. + +**3-2-1 Breakdown**: +- **Copy 1** (live): All PVC data + VM disks on Proxmox sdc thin pool (10.7TB RAID1 HDD) +- **Copy 2** (local backup): Weekly file-level backup to sda `/mnt/backup` (1.1TB RAID1 SAS) +- **Copy 3** (offsite): Synology NAS at 192.168.1.13 via two paths: + - `Synology/Backup/Viki/pve-backup/` — structured PVE host backups (rsync --files-from weekly) + - `Synology/Backup/Viki/truenas/` — TrueNAS NFS media (Cloud Sync, narrowed to media only) ## Architecture Diagram @@ -12,54 +19,64 @@ The homelab uses a defense-in-depth 3-layer backup strategy: Layer 1 provides ne ```mermaid graph TB - subgraph TrueNAS["TrueNAS (10.0.10.15)"] - ZFS_Data["ZFS Pools
main (1.64 TiB)
ssd (~256GB)"] + subgraph Proxmox["Proxmox Host (192.168.1.127)"] + sdc["sdc: 10.7TB RAID1 HDD
VG pve, LV data (thin pool)
65 proxmox-lvm PVCs"] + sda["sda: 1.1TB RAID1 SAS
VG backup, LV data (ext4)
/mnt/backup"] - subgraph Layer1["Layer 1: ZFS Auto-Snapshots"] - Snap12h["Every 12h
auto-12h-*
24h retention"] - SnapDaily["Daily 00:00
auto-*
3-week retention"] + subgraph Layer1["Layer 1: LVM Thin Snapshots"] + Snap["Daily 03:00
7-day retention
62 PVCs (excludes dbaas+monitoring)"] end - ZFS_Data --> Snap12h - ZFS_Data --> SnapDaily + subgraph Layer2["Layer 2: Weekly File Backup"] + PVCBackup["PVC File Copy
Sunday 05:00
4 weekly versions
/mnt/backup/pvc-data//"] + NFSMirror["NFS Mirror
DB dumps + backup CronJob output
/mnt/backup/nfs-mirror/"] + PfsenseBackup["pfSense Backup
config.xml + full tar
4 weekly versions"] + PVEConfig["PVE Config
/etc/pve + scripts"] + end - NFS_Backup["NFS-exported
/mnt/main/*-backup/"] + sdc --> Snap + sdc --> PVCBackup + PVCBackup --> sda + NFSMirror --> sda + PfsenseBackup --> sda + PVEConfig --> sda end - subgraph K8s["Kubernetes Cluster"] - subgraph Layer2["Layer 2: App Backups"] + subgraph TrueNAS["TrueNAS (10.0.10.15)"] + NFS_Backup["NFS-exported
/mnt/main/*-backup/"] + Media["Media (NFS)
Immich ~800GB
audiobookshelf, servarr, navidrome"] + + subgraph AppBackups["App-Level Backup CronJobs"] CronDaily["Daily 00:00-00:30
PostgreSQL, MySQL
14d retention"] - CronWeekly["Weekly Sunday
etcd, Vault, Redis
Vaultwarden, plotting-book
30d retention"] - CronMonthly["Monthly 1st Sunday
Prometheus TSDB
2 copies"] - Cron6h["Every 6h
Vaultwarden backup
+ integrity check"] + CronWeekly["Weekly Sunday
etcd, Vault, Redis
Vaultwarden
30d retention"] end CronDaily --> NFS_Backup CronWeekly --> NFS_Backup - CronMonthly --> NFS_Backup - Cron6h --> NFS_Backup end subgraph Layer3["Layer 3: Offsite Sync"] - Incremental["Every 6h
zfs diff → rclone copy
--files-from --no-traverse"] - FullSync["Weekly Sunday 09:00
rclone sync
handles deletions"] + PVEOffsite["PVE → Synology
Sunday 08:00
rsync --files-from
/Backup/Viki/pve-backup/"] + CloudSync["TrueNAS → Synology
Monday 09:00
Cloud Sync (media only)
/Backup/Viki/truenas/"] end - ZFS_Data --> Incremental - ZFS_Data --> FullSync + sda --> PVEOffsite + Media --> CloudSync - Synology["Synology NAS
192.168.1.13
/Backup/Viki/truenas"] + Synology["Synology NAS
192.168.1.13
Offsite protection"] - Incremental --> Synology - FullSync --> Synology + PVEOffsite --> Synology + CloudSync --> Synology + + NFS_Backup -.->|mirrored to sda| NFSMirror subgraph Monitoring["Monitoring & Alerting"] - Prometheus["Prometheus Alerts
PostgreSQLBackupStale
MySQLBackupStale
CloudSyncStale
VaultwardenIntegrityFail"] - Pushgateway["Pushgateway
cloudsync metrics
vaultwarden integrity"] + Prometheus["Prometheus Alerts
PostgreSQLBackupStale, MySQLBackupStale
WeeklyBackupStale, OffsiteBackupSyncStale
LVMSnapshotStale, BackupDiskFull
VaultwardenIntegrityFail"] + Pushgateway["Pushgateway
backup script metrics
cloudsync metrics
vaultwarden integrity"] end - NFS_Backup -.->|scrape| Prometheus - Synology -.->|API query| Pushgateway + PVCBackup -.->|push metrics| Pushgateway + Snap -.->|push metrics| Pushgateway Pushgateway --> Prometheus style Layer1 fill:#c8e6c9 @@ -68,6 +85,89 @@ graph TB style Monitoring fill:#f3e5f5 ``` +### Weekly Backup Timeline + +```mermaid +graph LR + subgraph Sunday["Sunday Timeline"] + S01["01:00 etcd backup
(CronJob)"] + S02["02:00 Vault backup
(CronJob)"] + S03a["03:00 Redis backup
(CronJob)"] + S03b["03:00 LVM snapshots
(lvm-pvc-snapshot timer)"] + S05["05:00 Weekly backup
(weekly-backup timer)
1. NFS mirror
2. PVC file copy
3. pfSense backup
4. PVE config
5. Prune snapshots
6. Generate manifest"] + S08["08:00 Offsite sync
(offsite-sync-backup timer)
rsync --files-from"] + end + + S01 --> S02 --> S03a --> S03b --> S05 --> S08 + + subgraph Monday["Monday"] + M09["09:00 TrueNAS Cloud Sync
Media → Synology"] + end + + S08 -.->|next day| M09 + + style Sunday fill:#ffe0b2 + style Monday fill:#e1f5ff +``` + +### Physical Disk Layout + +```mermaid +graph TB + subgraph PVE["Proxmox Host (192.168.1.127)"] + subgraph sda["sda: 1.1TB RAID1 SAS"] + sda_vg["VG: backup
LV: data (ext4)
/mnt/backup"] + sda_content["pvc-data////
nfs-mirror/-backup/
pfsense//
pve-config/"] + end + + subgraph sdb["sdb: 931GB SSD"] + sdb_vg["VG: pve
LV: root (ext4)
PVE host OS"] + end + + subgraph sdc["sdc: 10.7TB RAID1 HDD"] + sdc_vg["VG: pve
LV: data (thin pool)
65 proxmox-lvm PVCs
+ VM disks"] + end + + sda_vg --> sda_content + end + + sdc -.->|weekly backup
mount snapshot ro| sda + sda -.->|offsite sync
rsync| Synology["Synology NAS
192.168.1.13
/Backup/Viki/pve-backup/"] + + style sda fill:#fff9c4 + style sdb fill:#c8e6c9 + style sdc fill:#e1f5ff +``` + +### Restore Decision Tree + +```mermaid +graph TB + Start["Data loss detected"] + Age{"How old is
the lost data?"} + Type{"What type
of data?"} + + Start --> Age + + Age -->|"< 7 days"| LVM["Use LVM snapshot
lvm-pvc-snapshot restore
RTO: <5 min"] + Age -->|"> 7 days,
< 4 weeks"| FileBackup["Use sda file backup
/mnt/backup/pvc-data//
RTO: <15 min"] + Age -->|"> 4 weeks or
site disaster"| Offsite["Use Synology backup
Synology/pve-backup/
RTO: <4 hours"] + + LVM --> Type + FileBackup --> Type + Offsite --> Type + + Type -->|"Database"| AppBackup["Use app-level dump
/mnt/backup/nfs-mirror/-backup/
OR Synology/pve-backup/nfs-mirror/
RTO: <10 min"] + Type -->|"PVC files"| Proceed["Proceed with
selected restore method"] + Type -->|"Media (NFS)"| CloudSync["Use Synology backup
Synology/truenas//
RTO: varies by size"] + + style Start fill:#ffcdd2 + style LVM fill:#c8e6c9 + style FileBackup fill:#fff9c4 + style Offsite fill:#e1f5ff + style AppBackup fill:#e1bee7 +``` + ### Vaultwarden Enhanced Protection ```mermaid @@ -97,127 +197,103 @@ graph LR style Hourly fill:#e1bee7 ``` -### Incremental Offsite Sync - -```mermaid -graph TB - Prev["ZFS snapshot
main@cloudsync-prev"] - New["ZFS snapshot
main@cloudsync-new"] - - Prev --> Diff["zfs diff -F -H
prev vs new"] - New --> Diff - - Diff --> Filter["Filter type=F
Apply excludes"] - Filter --> FileList["/tmp/cloudsync_copy_files.txt"] - - FileList --> Rclone["rclone copy
--files-from-raw
--no-traverse"] - - Rclone --> Synology["Synology NAS
192.168.1.13"] - - Synology --> Rotate["Rotate snapshots:
destroy prev
rename new → prev"] - - Excludes["Excludes:
clickhouse (2.47M files)
loki (68K files)
prometheus, iscsi
frigate/recordings
*.log"] - - Filter -.->|uses| Excludes - - style FileList fill:#fff9c4 - style Excludes fill:#ffcdd2 -``` - ## Components | Component | Version/Schedule | Location | Purpose | |-----------|-----------------|----------|---------| -| ZFS Auto-Snapshots | Every 12h + daily | TrueNAS pools (main, ssd) | Near-instant local protection | -| PostgreSQL Backup | Daily 00:00, 14d retention | CronJob in `dbaas` namespace | pg_dumpall for 12 databases | -| MySQL Backup | Daily 00:30, 14d retention | CronJob in `dbaas` namespace | mysqldump for 7 databases | +| LVM Thin Snapshots | Daily 03:00, 7d retention | PVE host: `lvm-pvc-snapshot` | CoW snapshots of 62 proxmox-lvm PVCs | +| Weekly PVC Backup | Sunday 05:00, 4 weeks | PVE host: `weekly-backup` | File-level PVC copy to sda | +| NFS Mirror | Sunday 05:00 + weekly-backup | PVE host: mount NFS ro → rsync | Mirror DB dumps to sda | +| pfSense Backup | Sunday 05:00 + weekly-backup | PVE host: SSH + API | config.xml + full filesystem tar | +| Offsite Sync | Sunday 08:00 (after weekly-backup) | PVE host: `offsite-sync-backup` | rsync sda → Synology | +| PostgreSQL Backup | Daily 00:00, 14d retention | CronJob in `dbaas` namespace | pg_dumpall for all databases | +| MySQL Backup | Daily 00:30, 14d retention | CronJob in `dbaas` namespace | mysqldump for all databases | | etcd Backup | Weekly Sunday 01:00, 30d | CronJob in `kube-system` | etcdctl snapshot | | Vaultwarden Backup | Every 6h, 30d retention | CronJob in `vaultwarden` | sqlite3 .backup + integrity | | Vault Backup | Weekly Sunday 02:00, 30d | CronJob in `vault` | raft snapshot | | Redis Backup | Weekly Sunday 03:00, 30d | CronJob in `redis` | BGSAVE + copy | -| Prometheus Backup | Monthly 1st Sunday, 2 copies | CronJob in `monitoring` | TSDB snapshot → tar.gz | -| plotting-book Backup | Weekly Sunday 03:00, 30d | CronJob in `plotting-book` | sqlite3 .backup | -| LVM Thin Snapshots | Twice daily (00:00, 12:00), 7d | PVE host: `lvm-pvc-snapshot` | CoW snapshots of 13 proxmox-lvm PVCs | -| Incremental Sync | Every 6h (cron) | TrueNAS: `/root/cloudsync-copy.sh` | ZFS diff → rclone copy | -| Full Sync | Weekly Sunday 09:00 | TrueNAS Cloud Sync Task 1 | rclone sync with deletions | -| CloudSync Monitor | Every 6h (cron) | CronJob in `monitoring` | Query TrueNAS API → Pushgateway | | Vaultwarden Integrity Check | Hourly | CronJob in `vaultwarden` | PRAGMA integrity_check → metric | +| TrueNAS Cloud Sync | Monday 09:00 (weekly) | TrueNAS Cloud Sync Task 1 | Media → Synology NAS | ## How It Works -### Layer 1: ZFS Auto-Snapshots +### Layer 1: LVM Thin Snapshots (Fast Local Recovery) -ZFS snapshots are copy-on-write markers that capture filesystem state in <1 second with zero I/O overhead (only metadata). - -**Schedule**: -| Pool | Frequency | Naming | Retention | Purpose | -|------|-----------|--------|-----------|---------| -| `main` | Every 12h | `auto-12h-YYYY-MM-DD_HH-MM` | 24 hours | Recover from recent mistakes | -| `main` | Daily 00:00 | `auto-YYYY-MM-DD_HH-MM` | 3 weeks | Point-in-time recovery | -| `ssd` | Every 12h | `auto-12h-YYYY-MM-DD_HH-MM` | 24 hours | Same as main | -| `ssd` | Daily 00:00 | `auto-YYYY-MM-DD_HH-MM` | 3 weeks | Same as main | - -**Performance**: Snapshot creation takes <1s for both pools (tested 2026-03-23). - -**Rollback**: -```bash -# List snapshots -zfs list -t snapshot | grep main/ - -# Rollback to snapshot -zfs rollback main/@auto-2026-03-23_00-00 - -# Clone snapshot (non-destructive) -zfs clone main/@auto-2026-03-23_00-00 main/-recovered -``` - -### Layer 1b: LVM Thin Snapshots (Proxmox CSI PVCs) - -Native LVM thin snapshots provide crash-consistent point-in-time recovery for all 13 Proxmox CSI PVCs (~340Gi). These are CoW snapshots — instant creation, minimal overhead, sharing the thin pool's free space. +Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62 Proxmox CSI PVCs. These are CoW snapshots — instant creation, minimal overhead, sharing the thin pool's free space. **Script**: `/usr/local/bin/lvm-pvc-snapshot` on PVE host (source: `infra/scripts/lvm-pvc-snapshot`) -**Schedule**: Twice daily (00:00, 12:00) via systemd timer, 7-day retention (max 14 snapshots per LV) +**Schedule**: Daily 03:00 via systemd timer, 7-day retention **Discovery**: Auto-discovers PVC LVs matching `vm-*-pvc-*` pattern in VG `pve` thin pool `data` -**Coverage**: All proxmox-lvm PVCs **except** `dbaas` and `monitoring` namespaces. These are excluded because: +**Coverage**: All 65 proxmox-lvm PVCs **except** `dbaas` and `monitoring` namespaces. These are excluded because: - MySQL InnoDB, PostgreSQL, and Prometheus are high-churn (50%+ CoW divergence/hour) - They already have app-level dumps (Layer 2) - Including them causes ~36% write amplification; excluding them reduces overhead to ~0% -Snapshotted PVCs include: Redis, Vaultwarden, Calibre, Nextcloud, Forgejo, FreshRSS, ActualBudget, NovelApp, Headscale, Uptime Kuma, etc. (~20 low-churn LVs) - -**Exclusion config**: `EXCLUDE_NAMESPACES` variable in script (default: `dbaas,monitoring`). Uses kubectl to resolve LV names dynamically. - **Monitoring**: Pushes metrics to Pushgateway via NodePort (30091). Alerts: `LVMSnapshotStale` (>24h), `LVMSnapshotFailing`, `LVMThinPoolLow` (<15% free). **Restore**: `lvm-pvc-snapshot restore ` — auto-discovers K8s workload, scales down, swaps LVs, scales back up. See `docs/runbooks/restore-lvm-snapshot.md`. -### Layer 2: Application-Level Backups +### Layer 2: Weekly File-Level Backup (sda Backup Disk) + +**Backup disk**: sda (1.1TB RAID1 SAS) → VG `backup` → LV `data` → ext4 → mounted at `/mnt/backup` on PVE host. Dedicated backup disk, independent of live storage. + +**Script**: `/usr/local/bin/weekly-backup` on PVE host (source: `infra/scripts/weekly-backup`) +**Schedule**: Sunday 05:00 via systemd timer +**Retention**: 4 weekly versions (weeks 0-3 via `--link-dest` hardlink dedup) + +#### What Gets Backed Up + +**1. PVC File Copies** (`/mnt/backup/pvc-data//`): +- Mount each LVM thin LV ro on PVE host → rsync files (not block) → unmount +- 62 PVCs covered (all except dbaas + monitoring) +- Organized as `/mnt/backup/pvc-data////` +- 4 weekly versions with `--link-dest` hardlink dedup (unchanged files share inodes) + +**2. NFS Backup Mirror** (`/mnt/backup/nfs-mirror/`): +- Mount TrueNAS NFS ro → rsync DB dump dirs → unmount +- Covers: `mysql-backup/`, `postgresql-backup/`, `vault-backup/`, `vaultwarden-backup/`, `redis-backup/`, `etcd-backup/` +- Single copy (no rotation) — latest dump only + +**3. pfSense Backup** (`/mnt/backup/pfsense//`): +- `config.xml` via API (base64 decode) +- Full filesystem tar via SSH (`tar czf /tmp/pfsense-full.tar.gz /cf /var/db /boot/loader.conf`) +- 4 weekly versions + +**4. PVE Config** (`/mnt/backup/pve-config/`): +- `/etc/pve/` (cluster config, VM definitions) +- `/usr/local/bin/` (custom scripts) +- `/etc/systemd/system/` (timers) +- Single copy (no rotation) + +**Manifest Generation**: After backup completes, generates `/mnt/backup/manifest.txt` with all file paths (relative to `/mnt/backup/`). Used by offsite sync `--files-from`. + +**Snapshot Pruning**: Deletes LVM snapshots older than 7 days (safety net for snapshots that outlive `lvm-pvc-snapshot` timer). + +**Monitoring**: Pushes `backup_weekly_last_success_timestamp` to Pushgateway. Alerts: `WeeklyBackupStale` (>8d), `WeeklyBackupFailing`. + +### Layer 2b: Application-Level Backups K8s CronJobs run inside the cluster, dumping database/state to NFS-exported backup directories. Each service writes to `/mnt/main/-backup/`. -**Why needed**: ZFS snapshots capture block-level state, but: -- Cannot restore individual databases from a PostgreSQL zvol snapshot -- iSCSI zvols are opaque to TrueNAS (raw blocks) -- Need point-in-time recovery for specific apps without full ZFS rollback +**Why needed**: LVM snapshots capture block-level state, but: +- Cannot restore individual databases from a PostgreSQL snapshot +- Proxmox CSI LVs are opaque to TrueNAS (raw block devices) +- Need point-in-time recovery for specific apps without full LVM rollback **Daily backups (00:00-00:30)**: -- **PostgreSQL** (`pg_dumpall`): Dumps all 12 databases to `/mnt/main/dbaas-backups/postgresql/`. Command: `pg_dumpall -h pg-cluster-rw.dbaas -U postgres | gzip -9 > backup-$(date +%Y%m%d).sql.gz`. 14-day rotation via `find -mtime +14 -delete`. -- **MySQL** (`mysqldump`): Dumps all 7 databases individually. Command: `mysqldump -h mysql-primary.dbaas --all-databases --single-transaction | gzip -9 > backup-$(date +%Y%m%d).sql.gz`. 14-day rotation. +- **PostgreSQL** (`pg_dumpall`): Dumps all databases to `/mnt/main/postgresql-backup/`. Command: `pg_dumpall -h pg-cluster-rw.dbaas -U postgres | gzip -9 > backup-$(date +%Y%m%d).sql.gz`. 14-day rotation via `find -mtime +14 -delete`. +- **MySQL** (`mysqldump`): Dumps all databases. Command: `mysqldump -h mysql-primary.dbaas --all-databases --single-transaction | gzip -9 > backup-$(date +%Y%m%d).sql.gz`. 14-day rotation. **Weekly backups (Sunday 01:00-04:00)**: - **etcd**: `etcdctl snapshot save /mnt/main/etcd-backup/snapshot-$(date +%Y%m%d).db`. 30-day retention. Critical for cluster recovery. - **Vaultwarden**: See "Vaultwarden Enhanced Protection" below. 30-day retention. - **Vault**: `vault operator raft snapshot save /mnt/main/vault-backup/snapshot-$(date +%Y%m%d).snap`. 30-day retention. - **Redis**: `redis-cli BGSAVE` then copy RDB file. 30-day retention. -- **plotting-book**: `sqlite3 /data/db.sqlite ".backup '/mnt/main/plotting-book-backup/backup-$(date +%Y%m%d).sqlite'"`. 30-day retention. - -**Monthly backups (1st Sunday 04:00)**: -- **Prometheus**: `curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot` → tar.gz snapshot. Keeps 2 most recent copies (older ones purged). ### Vaultwarden Enhanced Protection -Vaultwarden stores sensitive password vault data in SQLite on an iSCSI volume. Extra safeguards prevent corruption: +Vaultwarden stores sensitive password vault data in SQLite on a proxmox-lvm volume. Extra safeguards prevent corruption: **Every 6 hours** (vaultwarden-backup CronJob): 1. Run `PRAGMA integrity_check` on live database @@ -236,101 +312,47 @@ This provides both frequent backups (every 6h) AND continuous integrity monitori ### Layer 3: Offsite Sync to Synology NAS -Two complementary sync methods run on TrueNAS: +Two independent paths push backups offsite: -**Incremental COPY (every 6 hours)**: +#### Path 1: PVE Host Backups (rsync) -Runs `/root/cloudsync-copy.sh` via cron. Uses ZFS diff to identify changed files since last sync, then copies only those files. +**Script**: `/usr/local/bin/offsite-sync-backup` on PVE host (source: `infra/scripts/offsite-sync-backup`) +**Schedule**: Sunday 08:00 via systemd timer (After=weekly-backup.service) +**Method**: `rsync --files-from /mnt/backup/manifest.txt` to `synology.viktorbarzin.lan:/Backup/Viki/pve-backup/` +**Monthly full sync**: On 1st Sunday of month, runs `rsync --delete` (full sync, removes deleted files) -Flow: -1. Take new snapshot: `zfs snapshot main@cloudsync-new` -2. If previous snapshot exists: `zfs diff -F -H main@cloudsync-prev main@cloudsync-new` -3. Filter output: - - Keep only `type=F` (files, not directories) - - Apply excludes (clickhouse, loki, prometheus, etc.) - - Write to `/tmp/cloudsync_copy_files.txt` -4. Run `rclone copy --files-from-raw /tmp/cloudsync_copy_files.txt --no-traverse` -5. Rotate snapshots: `zfs destroy cloudsync-prev`, `zfs rename cloudsync-new cloudsync-prev` +**Why fast**: Only changed files are transferred (manifest generated by weekly-backup). No directory traversal (`--no-implied-dirs`). -**Why fast**: Only changed files are transferred. ZFS diff is instant (metadata scan). `--no-traverse` skips SFTP directory scan. +**Destination**: `Synology/Backup/Viki/pve-backup/` mirrors sda `/mnt/backup/` structure: +- `pvc-data//` — 4 weekly PVC file backups +- `nfs-mirror/` — latest DB dumps +- `pfsense//` — 4 weekly pfSense backups +- `pve-config/` — latest PVE config -**Fallback**: If no previous snapshot or >100k changed files → falls back to full `find` command. +**Monitoring**: Pushes `offsite_backup_sync_last_success_timestamp` to Pushgateway. Alerts: `OffsiteBackupSyncStale` (>8d), `OffsiteBackupSyncFailing`. -**Weekly SYNC (Sunday 09:00)**: - -TrueNAS Cloud Sync Task 1 runs `rclone sync` which: -- Mirrors source → destination (removes deleted files on destination) -- Full directory traversal (~30-60 min) -- Ensures offsite is clean (no orphaned files from renamed paths) - -**Why both methods**: -- Incremental: Fast recovery for recent changes (seconds to minutes) -- Full sync: Cleanup pass to handle deletions, renames, edge cases +#### Path 2: TrueNAS Media (Cloud Sync) +**Task**: TrueNAS Cloud Sync Task 1 runs `rclone sync` Monday 09:00 +**Source**: `/mnt/main/` (NFS pool on TrueNAS) **Destination**: `sftp://192.168.1.13/Backup/Viki/truenas` +**Scope**: Media libraries only (Immich ~800GB, audiobookshelf, servarr, navidrome music) -### Excludes (both incremental and full sync) +**Excludes** (Cloud Sync configured to skip): +- `clickhouse/**` (2.47M files, regenerable) +- `loki/**` (68K files, regenerable) +- `prometheus/**` (covered by monthly app backup) +- `frigate/**` (ephemeral recordings) +- `audiblez/**`, `ebook2audiobook/**` (regenerable) +- `ollama/**` (chat history, low value) +- `real-estate-crawler/**` (regenerable) +- `crowdsec/**` (regenerable) +- `servarr/downloads/**` (transient) +- `ytldp/**` (replaceable) +- `iscsi/**`, `iscsi-snaps/**` (raw zvols, backed at app level) +- `*-backup/**` (already mirrored via Path 1) -| Pattern | Reason | File count | -|---------|--------|-----------| -| `clickhouse/**` | Regenerable logs/metrics | 2.47M files | -| `loki/**` | Regenerable logs | 68K files | -| `iocage/**` | Legacy FreeBSD jails (unused) | 96K files | -| `frigate/**` | Ephemeral recordings/clips, trivial config | 57K+ files | -| `audiblez/**` | Generated audiobooks, regenerable from source ebooks | — | -| `ebook2audiobook/**` | Same service as audiblez, second volume | — | -| `ollama/**` | UI data (chat history/settings), low value | — | -| `real-estate-crawler/**` | Scraped property data, regenerable by re-crawling | — | -| `prometheus/**` | Covered by monthly app backup | Large TSDB | -| `crowdsec/**` | Regenerable threat intelligence | — | -| `servarr/downloads/**` | Transient download staging | — | -| `iscsi/**`, `iscsi-snaps/**` | Raw zvols, backed at app level | — | -| `ytldp/**` | YouTube downloads (replaceable) | — | -| `*.log` | Log files (regenerable) | — | -| `post` | Transient POST data | — | - -### iSCSI Backup Architecture - -iSCSI zvols are raw block devices exported to K8s nodes. TrueNAS cannot read the filesystem inside a zvol. - -**Protection strategy**: -- **Layer 1**: ZFS snapshots cover zvols automatically (block-level) -- **Layer 2**: Application CronJobs inside pods dump data to NFS paths -- **Layer 3**: NFS paths sync offsite - -**Current coverage**: -| Service | Storage | Layer 2 Backup | Offsite | -|---------|---------|----------------|---------| -| PostgreSQL CNPG (12 DBs) | iSCSI | ✓ daily | ✓ | -| MySQL InnoDB (7 DBs) | iSCSI | ✓ daily | ✓ | -| Vault | iSCSI | ✓ weekly | ✓ | -| Vaultwarden | iSCSI | ✓ 6h + integrity | ✓ | -| Redis | iSCSI | ✓ weekly | ✓ | -| plotting-book | iSCSI | ✓ weekly | ✓ | - -**Convention**: Any new iSCSI-backed app MUST add a backup CronJob writing to `/mnt/main/-backup/` in its Terraform stack. - -**Uncovered (acceptable risk)**: -- Prometheus (disposable metrics, monthly TSDB backup for long-term trends) -- Loki (disposable logs) - -### iSCSI Hardening - -To prevent SQLite corruption from transient network disruptions, iSCSI initiator timeouts are relaxed on all K8s nodes: - -| Setting | Default | Hardened | Impact | -|---------|---------|----------|--------| -| `node.session.timeo.replacement_timeout` | 120s | 300s | Time before declaring session dead | -| `node.conn[0].timeo.noop_out_interval` | 5s | 10s | Keepalive interval | -| `node.conn[0].timeo.noop_out_timeout` | 5s | 15s | Keepalive timeout | -| `node.conn[0].iscsi.HeaderDigest` | None | CRC32C,None | Error detection | -| `node.conn[0].iscsi.DataDigest` | None | CRC32C,None | Error detection | - -**Applied to**: All 5 K8s nodes (k8s-master, k8s-node1-4) on 2026-03-23. - -**Persistence**: Baked into cloud-init template (`modules/create-template-vm/cloud_init.yaml`) so new nodes get these settings automatically. - -**Why needed**: Default 120s timeout is too aggressive. Brief network hiccup (5-10s) can trigger failover, causing SQLite to see incomplete writes → corruption. 300s timeout tolerates longer blips. +**Monitoring**: Existing `CloudSyncStale`, `CloudSyncNeverRun`, `CloudSyncFailing` alerts still apply. ## Configuration @@ -338,21 +360,25 @@ To prevent SQLite corruption from transient network disruptions, iSCSI initiator | Path | Purpose | |------|---------| -| `/root/cloudsync-copy.sh` | TrueNAS: incremental sync script | -| `/var/log/cloudsync-copy.log` | TrueNAS: sync script log output | +| `/usr/local/bin/lvm-pvc-snapshot` | PVE host: LVM snapshot creation + restore | +| `/usr/local/bin/weekly-backup` | PVE host: PVC file copy + NFS mirror + pfSense + manifest | +| `/usr/local/bin/offsite-sync-backup` | PVE host: rsync to Synology | +| `/mnt/backup/` | PVE host: sda mount point (1.1TB backup disk) | +| `/mnt/backup/manifest.txt` | Generated by weekly-backup, consumed by offsite-sync | +| `/etc/systemd/system/lvm-pvc-snapshot.timer` | Daily 03:00 (LVM snapshots) | +| `/etc/systemd/system/weekly-backup.timer` | Sunday 05:00 (file backup) | +| `/etc/systemd/system/offsite-sync-backup.timer` | Sunday 08:00 (offsite sync) | | `stacks/dbaas/` | Terraform: PostgreSQL/MySQL backup CronJobs | | `stacks/vault/` | Terraform: Vault backup CronJob | | `stacks/vaultwarden/` | Terraform: Vaultwarden backup + integrity CronJobs | -| `stacks/monitoring/` | Terraform: CloudSync monitor, Prometheus backup | -| `modules/create-template-vm/cloud_init.yaml` | iSCSI hardening params for new nodes | -| `/etc/iscsi/iscsid.conf` | K8s nodes: iSCSI initiator config | +| `stacks/monitoring/` | Terraform: Prometheus alerts | ### Vault Paths | Path | Contents | |------|----------| -| `secret/viktor/truenas_api_key` | TrueNAS API key for CloudSync monitor | | `secret/viktor/synology_ssh_key` | SSH key for Synology NAS SFTP access | +| `secret/viktor/pfsense_api_key` | pfSense API key + secret for config backup | ### Terraform Stacks @@ -361,27 +387,46 @@ Each backup CronJob is defined in the application's stack: - Vault: `stacks/vault/backup.tf` - Vaultwarden: `stacks/vaultwarden/backup.tf` - etcd: `stacks/platform/etcd-backup.tf` -- Prometheus: `stacks/monitoring/prometheus-backup.tf` ## Decisions & Rationale -### Why 3 Layers? +### Why 3-2-1 Strategy? -**Layer 1 (ZFS snapshots)**: +**3 copies**: +- Live PVCs (zero RTO for recent data) +- sda local backup (fast recovery without network) +- Synology offsite (site-level disaster protection) + +**2 media types**: +- sdc SSD (live, low latency) +- sda HDD (backup, cost-effective bulk storage) + +**1 offsite**: +- Protection against fire, theft, catastrophic hardware failure +- Weekly RPO acceptable for offsite (daily/weekly app backups reduce exposure) + +### Why File-Level + Block-Level Snapshots? + +**LVM snapshots** (Layer 1): - Near-instant (<1s), zero overhead -- Point-in-time recovery for entire datasets -- BUT: Cannot restore individual database records, no offsite protection +- Point-in-time recovery for entire PVCs +- BUT: Cannot restore individual files, no offsite protection, 7-day retention -**Layer 2 (App backups)**: -- Granular restore (single DB, single table) -- Database-native tools (pg_dump, mysqldump) produce portable backups -- BUT: Higher overhead (CPU, I/O), longer RPO (daily/weekly) +**File-level backup** (Layer 2): +- Can restore single files or directories +- Offsite-compatible (rsync) +- Longer retention (4 weeks local, unlimited offsite) +- BUT: Slower RTO (rsync), higher storage overhead -**Layer 3 (Offsite)**: -- Protection against site-level disaster (fire, theft, catastrophic hardware failure) -- BUT: 6h RPO (incremental), connectivity dependency +Both together provide flexibility: fast local rollback for recent changes, granular recovery for older data. -All three together provide defense-in-depth. +### Why Dedicated Backup Disk (sda)? + +**Isolation**: If sdc fails (thin pool corruption, controller failure), sda is independent (different disk, different VG). + +**Performance**: Backup I/O doesn't compete with live PVC I/O. + +**Simplicity**: Single mount point (`/mnt/backup/`) for all backup data, easy to monitor disk usage. ### Why Not Velero/Longhorn Backup? @@ -390,25 +435,25 @@ Evaluated K8s-native backup solutions (Velero, Longhorn): - **Longhorn**: High overhead (replicas, snapshots in-cluster), no offsite by default **Current approach wins** because: -- Leverages existing ZFS infrastructure (already running TrueNAS) +- Leverages existing Proxmox LVM infrastructure (already running) - Database-native backups (pg_dump/mysqldump) are battle-tested - Simple restore procedures (documented runbooks) +- Lower resource overhead (no in-cluster replicas) ### Why Hybrid Incremental + Full Sync? -**Incremental alone** is risky: +**Incremental alone** (rsync --files-from) is risky: - Deleted files on source never deleted on destination - Renamed paths create duplicates -- No cleanup of orphaned snapshots +- No cleanup of orphaned files -**Full sync alone** is slow: -- 30-60 min per run -- High network/CPU on both ends -- 6h RPO → 12h if a sync fails +**Full sync alone** (rsync --delete) is slow: +- 30-60 min per run (all files scanned) +- 7d RPO → 14d if a sync fails **Hybrid approach**: -- Fast incremental every 6h (sub-minute runtime) -- Weekly full sync for cleanup (tolerates longer runtime) +- Fast incremental weekly (sub-5min runtime via manifest) +- Monthly full sync for cleanup (tolerates longer runtime) ### Why 6h Vaultwarden Backup vs Daily for Others? @@ -424,6 +469,56 @@ Other services (MySQL, PostgreSQL): ## Troubleshooting +### LVM Snapshot Restore Issues + +See `docs/runbooks/restore-lvm-snapshot.md`. + +### Weekly Backup Failing + +**Symptom**: `WeeklyBackupStale` or `WeeklyBackupFailing` alert + +**Diagnosis**: +```bash +ssh root@192.168.1.127 +systemctl status weekly-backup.service +journalctl -u weekly-backup.service --since "7 days ago" +df -h /mnt/backup +``` + +**Common causes**: +- Backup disk full (check `df -h /mnt/backup`, alert: `BackupDiskFull`) +- LV mount failed (check `lvs pve`, `dmesg | grep backup`) +- NFS mount failed (check `showmount -e 10.0.10.15`) + +**Fix**: +1. If disk full: Clean up old weekly versions manually, adjust retention +2. If LV mount failed: `lvchange -ay backup/data && mount /mnt/backup` +3. If NFS failed: Check TrueNAS availability, verify exports +4. Manually trigger: `systemctl start weekly-backup.service` + +### Offsite Sync Failing + +**Symptom**: `OffsiteBackupSyncStale` or `OffsiteBackupSyncFailing` alert + +**Diagnosis**: +```bash +ssh root@192.168.1.127 +systemctl status offsite-sync-backup.service +journalctl -u offsite-sync-backup.service --since "7 days ago" +cat /mnt/backup/manifest.txt | wc -l # verify manifest exists +``` + +**Common causes**: +- Synology NAS unreachable (network, SFTP down) +- SSH key auth failed (permissions, expired key) +- Manifest missing (weekly-backup failed) + +**Fix**: +1. Verify Synology: `ping 192.168.1.13`, `ssh root@192.168.1.13` +2. Verify SSH key: `ssh -i /root/.ssh/synology_backup root@192.168.1.13` +3. Verify manifest exists: `ls -lh /mnt/backup/manifest.txt` +4. Manually trigger: `systemctl start offsite-sync-backup.service` + ### PostgreSQL Backup Stale Alert **Symptom**: `PostgreSQLBackupStale` firing in Prometheus @@ -444,29 +539,6 @@ kubectl logs -n dbaas job/postgresql-backup- 2. If NFS: Verify mount on worker node, restart NFS server if needed 3. Manually trigger: `kubectl create job --from=cronjob/postgresql-backup manual-backup -n dbaas` -### CloudSync Stale/Failing - -**Symptom**: `CloudSyncStale` or `CloudSyncFailing` alert - -**Diagnosis**: -```bash -# SSH to TrueNAS -ssh root@10.0.10.15 -cat /var/log/cloudsync-copy.log -zfs list -t snapshot | grep cloudsync -``` - -**Common causes**: -- Synology NAS unreachable (network, SFTP down) -- ZFS diff failed (snapshot deleted manually) -- rclone error (quota, permission) - -**Fix**: -1. Verify Synology: `ping 192.168.1.13`, `ssh root@192.168.1.13` -2. Verify snapshots exist: `zfs list -t snapshot | grep cloudsync` -3. Manually run: `/root/cloudsync-copy.sh` (check output) -4. Check rclone config: `rclone ls synology:/Backup/Viki/truenas` - ### Vaultwarden Integrity Check Failing **Symptom**: `VaultwardenIntegrityFail` alert, `vaultwarden_sqlite_integrity_ok=0` @@ -480,46 +552,58 @@ kubectl exec -n vaultwarden deployment/vaultwarden -- sqlite3 /data/db.sqlite3 " **Recovery**: 1. Stop writes: `kubectl scale deployment/vaultwarden --replicas=0 -n vaultwarden` -2. Restore from latest backup: - ```bash - # Find latest backup - ls -lh /mnt/main/vaultwarden-backup/ - # Copy to pod volume - kubectl cp /mnt/main/vaultwarden-backup/db-.sqlite \ - vaultwarden/vaultwarden-0:/data/db.sqlite3 - ``` +2. Restore from latest backup (see `restore-vaultwarden.md`) 3. Verify integrity on restored DB 4. Scale back up: `kubectl scale deployment/vaultwarden --replicas=1 -n vaultwarden` -### iSCSI Session Drops Causing Backup Failures +### pfSense Backup Failing -**Symptom**: Backup CronJob fails with "I/O error" or "Transport endpoint not connected" +**Symptom**: `PfsenseBackupStale` alert (if implemented) **Diagnosis**: ```bash -# On K8s node -iscsiadm -m session -dmesg | grep -i iscsi -journalctl -u iscsid | tail -50 +ssh root@192.168.1.127 +systemctl status weekly-backup.service | grep -A5 pfsense ``` +**Common causes**: +- API key expired/invalid +- SSH auth failed (password changed, key rejected) +- pfSense unreachable + **Fix**: -1. Verify hardened timeouts applied: `iscsiadm -m node -o show | grep -E 'replacement_timeout|noop_out'` -2. If defaults: Apply hardening: - ```bash - iscsiadm -m node -o update -n node.session.timeo.replacement_timeout -v 300 - iscsiadm -m node -o update -n node.conn[0].timeo.noop_out_interval -v 10 - iscsiadm -m node -o update -n node.conn[0].timeo.noop_out_timeout -v 15 - iscsiadm -m node -o update -n node.conn[0].iscsi.HeaderDigest -v CRC32C,None - iscsiadm -m node -o update -n node.conn[0].iscsi.DataDigest -v CRC32C,None - ``` -3. Restart session: `iscsiadm -m node -u && iscsiadm -m node -l` +1. Verify API key: `curl -k https://pfsense.viktorbarzin.me/api/v1/system/config -H "Authorization: "` +2. Verify SSH: `ssh root@pfsense.viktorbarzin.me` +3. Update credentials in Vault `secret/viktor/pfsense_api_key` + +### Backup Disk Full + +**Symptom**: `BackupDiskFull` alert, `df -h /mnt/backup` >85% + +**Fix**: +```bash +ssh root@192.168.1.127 + +# Check space usage by component +du -sh /mnt/backup/pvc-data/* +du -sh /mnt/backup/pfsense/* +du -sh /mnt/backup/nfs-mirror + +# Clean up old weekly versions (keep latest 2) +find /mnt/backup/pvc-data -maxdepth 1 -type d -name "????-??" | sort | head -n -2 | xargs rm -rf +find /mnt/backup/pfsense -maxdepth 1 -type d -name "????-??" | sort | head -n -2 | xargs rm -rf +``` ### Missing Backup for New Service -**Symptom**: Added new service using iSCSI storage, no backup exists +**Symptom**: Added new service using proxmox-lvm storage, no backup exists -**Fix**: Add backup CronJob in service's Terraform stack +**Fix**: The service is automatically covered by: +1. **LVM snapshots** (if not in dbaas/monitoring namespace) — automatic, no config needed +2. **Weekly file backup** — automatic, no config needed + +**If the service has a database that needs app-level dumps**: +Add backup CronJob in service's Terraform stack (see template below). **Template**: ```hcl @@ -541,7 +625,7 @@ resource "kubernetes_cron_job_v1" "backup" { args = [ <<-EOT TIMESTAMP=$(date +%Y%m%d) - # Dump command here + # Dump command here (sqlite3 .backup, pg_dump, etc.) find /backup -mtime +30 -delete EOT ] @@ -594,17 +678,26 @@ module "nfs_backup" { │ VaultBackupStale > 8d since last success │ │ VaultwardenBackupStale > 8d since last success │ │ RedisBackupStale > 8d since last success │ -│ PrometheusBackupStale > 32d since last success │ -│ PlottingBookBackupStale > 8d since last success │ │ CloudSyncStale > 8d since last success │ │ CloudSyncNeverRun task never completed │ │ CloudSyncFailing task in error state │ │ VaultwardenIntegrityFail integrity_ok == 0 │ +│ LVMSnapshotStale > 24h since last snapshot │ +│ LVMSnapshotFailing snapshot creation failed │ +│ LVMThinPoolLow < 15% free space in thin pool │ +│ WeeklyBackupStale > 8d since last success │ +│ WeeklyBackupFailing backup script exited non-zero │ +│ PfsenseBackupStale > 8d since last success │ +│ OffsiteBackupSyncStale > 8d since last success │ +│ BackupDiskFull > 85% usage on /mnt/backup │ └────────────────────────────────────────────────────────────────┘ ``` **Metrics sources**: - Backup CronJobs: Push `backup_last_success_timestamp` to Pushgateway on completion +- LVM snapshot script: Pushes `lvm_snapshot_last_success_timestamp`, `lvm_snapshot_count`, `lvm_thin_pool_free_percent` +- Weekly backup script: Pushes `backup_weekly_last_success_timestamp`, `backup_disk_usage_percent` +- Offsite sync script: Pushes `offsite_backup_sync_last_success_timestamp` - CloudSync monitor: Queries TrueNAS API every 6h, pushes `cloudsync_last_success_timestamp` - Vaultwarden integrity: Pushes `vaultwarden_sqlite_integrity_ok` hourly @@ -614,36 +707,45 @@ module "nfs_backup" { ## Service Protection Matrix -| Service | Layer 1 (ZFS) | Layer 2 (App) | Layer 3 (Offsite) | Storage | -|---------|:-------------:|:-------------:|:-----------------:|---------| +| Service | LVM Snapshots (7d) | File Backup (4w) | App Backup | Offsite | Storage | +|---------|:------------------:|:----------------:|:----------:|:-------:|---------| | **Databases** | -| PostgreSQL (12 DBs) | ✓ | ✓ daily | ✓ | iSCSI | -| MySQL (7 DBs) | ✓ | ✓ daily | ✓ | iSCSI | +| PostgreSQL (all DBs) | — | — | ✓ daily | ✓ | proxmox-lvm | +| MySQL (all DBs) | — | — | ✓ daily | ✓ | proxmox-lvm | | **Critical State** | -| Vault | ✓ | ✓ weekly | ✓ | iSCSI | -| etcd | ✓ | ✓ weekly | ✓ | local disk | -| Vaultwarden | ✓ | ✓ 6h + integrity | ✓ | iSCSI | -| Redis | ✓ | ✓ weekly | ✓ | iSCSI | -| **Applications** | -| Prometheus | ✓ | ✓ monthly | excluded | NFS | -| plotting-book | ✓ | ✓ weekly | ✓ | iSCSI | -| Immich | ✓ | — | ✓ | NFS | -| Forgejo | ✓ | — | ✓ | NFS | -| Paperless-ngx | ✓ | — | ✓ | NFS | -| Nextcloud | ✓ | — | ✓ | NFS | -| **Other NFS services** | ✓ | — | ✓ | NFS | +| Vault | ✓ | ✓ | ✓ weekly | ✓ | proxmox-lvm | +| etcd | ✓ | ✓ | ✓ weekly | ✓ | proxmox-lvm | +| Vaultwarden | ✓ | ✓ | ✓ 6h + integrity | ✓ | proxmox-lvm | +| Redis | ✓ | ✓ | ✓ weekly | ✓ | proxmox-lvm | +| **Applications (65 proxmox-lvm PVCs)** | +| Prometheus | — | — | — | excluded | proxmox-lvm | +| Nextcloud | ✓ | ✓ | — | ✓ | proxmox-lvm | +| Calibre-Web | ✓ | ✓ | — | ✓ | proxmox-lvm | +| Forgejo | ✓ | ✓ | — | ✓ | proxmox-lvm | +| FreshRSS | ✓ | ✓ | — | ✓ | proxmox-lvm | +| ActualBudget | ✓ | ✓ | — | ✓ | proxmox-lvm | +| NovelApp | ✓ | ✓ | — | ✓ | proxmox-lvm | +| Headscale | ✓ | ✓ | — | ✓ | proxmox-lvm | +| Uptime Kuma | ✓ | ✓ | — | ✓ | proxmox-lvm | +| **Media (NFS)** | +| Immich (~800GB) | — | — | — | ✓ | NFS | +| Audiobookshelf | — | — | — | ✓ | NFS | +| Servarr | — | — | — | ✓ | NFS | +| Navidrome | — | — | — | ✓ | NFS | **Legend**: - ✓ = Protected at this layer -- — = Not needed (simple file storage, ZFS snapshots sufficient) +- — = Not needed (other layers cover it, or data is regenerable/disposable) - excluded = Too large/regenerable, not worth offsite bandwidth -**Note**: NFS-backed services with simple data (files, SQLite) rely on ZFS snapshots + offsite sync. Application-level backups are only needed for services with complex state (databases, Raft consensus, multi-file consistency requirements). +**Note**: All 65 proxmox-lvm PVCs get LVM snapshots (except dbaas+monitoring = 3 PVCs) + file-level backup (except dbaas+monitoring). NFS-backed media relies on TrueNAS Cloud Sync for offsite. ## Recovery Procedures Detailed runbooks in `docs/runbooks/`: +- **`restore-lvm-snapshot.md`** — Instant rollback of a PVC using LVM snapshot (RTO <5 min) +- **`restore-pvc-from-backup.md`** — Restore a PVC from sda file backup (when snapshots expired) - **`restore-postgresql.md`** — Restore individual database or full cluster from pg_dumpall backup - **`restore-mysql.md`** — Restore MySQL databases from mysqldump backup - **`restore-vault.md`** — Restore Vault from raft snapshot @@ -651,7 +753,9 @@ Detailed runbooks in `docs/runbooks/`: - **`restore-etcd.md`** — Restore etcd cluster from snapshot - **`restore-full-cluster.md`** — Disaster recovery: rebuild cluster from offsite backups -**RTO estimates** (tested 2026-03-23): +**RTO estimates**: +- LVM snapshot rollback: <5 min (instant swap) +- File-level restore from sda: <15 min (depends on PVC size) - Single PostgreSQL database: <5 min - Full MySQL cluster: <15 min - Vault: <10 min @@ -661,7 +765,7 @@ Detailed runbooks in `docs/runbooks/`: ## Related -- **Architecture**: `docs/architecture/storage.md` (NFS/iSCSI storage layer) +- **Architecture**: `docs/architecture/storage.md` (NFS/Proxmox storage layer) - **Reference**: `.claude/reference/service-catalog.md` (which services need backups) - **Runbooks**: `docs/runbooks/restore-*.md` (step-by-step recovery procedures) - **Monitoring**: `stacks/monitoring/alerts/backup-alerts.yaml` (Prometheus alert definitions) diff --git a/docs/architecture/storage.md b/docs/architecture/storage.md index c79bfe0f..33ea9253 100644 --- a/docs/architecture/storage.md +++ b/docs/architecture/storage.md @@ -1,14 +1,16 @@ # Storage Architecture -Last updated: 2026-04-03 +Last updated: 2026-04-06 ## Overview The cluster uses two storage backends: **Proxmox CSI** for database block storage and **TrueNAS NFS** for application data. -**Block storage (Proxmox CSI)**: 13 PVCs for databases (CNPG PostgreSQL, MySQL InnoDB, Redis, Vaultwarden, Prometheus, Nextcloud, Calibre-Web) use `StorageClass: proxmox-lvm`, which provisions thin LVs directly from the Proxmox host's `local-lvm` storage. This eliminates the previous double-CoW (ZFS + LVM-thin) path that caused 56 ZFS checksum errors. +**Block storage (Proxmox CSI)**: 65 PVCs for databases and stateful apps (CNPG PostgreSQL, MySQL InnoDB, Redis, Vaultwarden, Prometheus, Nextcloud, Calibre-Web, Forgejo, FreshRSS, ActualBudget, NovelApp, Headscale, Uptime Kuma, etc.) use `StorageClass: proxmox-lvm`, which provisions thin LVs directly from the Proxmox host's `local-lvm` storage (sdc, 10.7TB RAID1 HDD thin pool). This eliminates the previous double-CoW (ZFS + LVM-thin) path that caused 56 ZFS checksum errors. -**NFS storage (TrueNAS)**: ~100 NFS shares for application data, media, configs, and backup targets continue to use TrueNAS ZFS at `10.0.10.15` via `StorageClass: nfs-truenas`. +**NFS storage (TrueNAS)**: ~100 NFS shares for media libraries (Immich, audiobookshelf, servarr, navidrome), backup targets (`*-backup/` directories), and legacy app data continue to use TrueNAS ZFS at `10.0.10.15` via `StorageClass: nfs-truenas`. + +**Backup storage (sda)**: 1.1TB RAID1 SAS disk, VG `backup`, LV `data` (ext4), mounted at `/mnt/backup` on PVE host. Dedicated backup disk for weekly PVC file backups, NFS mirrors, pfSense backups, and PVE config. Independent of live storage (sdc). **Migration (2026-04-02)**: All iSCSI block volumes were migrated from democratic-csi (TrueNAS iSCSI → ZFS → LVM-thin) to Proxmox CSI (direct LVM-thin hotplug). democratic-csi iSCSI driver is deprecated and pending removal. @@ -16,17 +18,20 @@ The cluster uses two storage backends: **Proxmox CSI** for database block storag ```mermaid graph TB + subgraph Proxmox["Proxmox Host (192.168.1.127)"] + sdc["sdc: 10.7TB RAID1 HDD
VG pve, LV data (thin pool)
65 proxmox-lvm PVCs"] + sda["sda: 1.1TB RAID1 SAS
VG backup, LV data (ext4)
/mnt/backup"] + end + subgraph TrueNAS["TrueNAS (10.0.10.15)
VMID 9000, 16c/16GB"] ZFS_Main["ZFS Pool: main
1.64 TiB
32G + 7x256G + 1T disks"] ZFS_SSD["ZFS Pool: ssd
~256GB SSD
Immich ML, PostgreSQL hot data"] - ZFS_Main --> NFS_Datasets["NFS Datasets
~100 shares
main/<service>"] - ZFS_Main --> iSCSI_Datasets["iSCSI Datasets
main/iscsi (zvols)
main/iscsi-snaps"] + ZFS_Main --> NFS_Datasets["NFS Datasets
~100 shares
main/<service>
Media + backup targets"] NFS_Datasets --> NFS_Exports["NFS Exports
managed by secrets/nfs_exports.sh"] - iSCSI_Datasets --> iSCSI_Targets["iSCSI Targets
SSH-managed via democratic-csi"] - ZFS_SSD --> SSD_Data["Immich ML models
PostgreSQL CNPG"] + ZFS_SSD --> SSD_Data["Immich ML models"] end subgraph K8s["Kubernetes Cluster"] diff --git a/docs/runbooks/restore-full-cluster.md b/docs/runbooks/restore-full-cluster.md index 0e8fd6c1..0316514d 100644 --- a/docs/runbooks/restore-full-cluster.md +++ b/docs/runbooks/restore-full-cluster.md @@ -1,5 +1,7 @@ # Full Cluster Rebuild +Last updated: 2026-04-06 + ## When to Use - Complete cluster failure (all VMs lost) - etcd corruption requiring full rebuild @@ -7,7 +9,8 @@ ## Prerequisites - Proxmox host (192.168.1.127) accessible -- TrueNAS NFS server (192.168.1.2) accessible — or Synology NAS (192.168.1.13) for backups +- TrueNAS NFS server (10.0.10.15) accessible — or Synology NAS (192.168.1.13) for backups +- sda backup disk mounted at `/mnt/backup` on PVE host (or restore from Synology first) - Git repo with infra code - SOPS age keys for state decryption (`~/.config/sops/age/keys.txt`) - Vault unseal keys (emergency kit) @@ -41,15 +44,55 @@ sudo kubeadm init --config /etc/kubernetes/kubeadm-config.yaml ### Phase 3: Storage Layer ```bash -# 6. Deploy CSI drivers (NFS + iSCSI) +# 6. Deploy CSI drivers (NFS + Proxmox) scripts/tg apply stacks/nfs-csi -scripts/tg apply stacks/iscsi-csi +scripts/tg apply stacks/proxmox-csi # 7. Verify PVs are accessible kubectl get pv kubectl get pvc -A | grep -v Bound ``` +### Phase 3.5: Restore PVC Data from sda Backup + +After storage layer is deployed, restore PVC data from the sda backup disk: + +```bash +# 8a. List available backup weeks +ssh root@192.168.1.127 +ls -l /mnt/backup/pvc-data/ + +# 8b. For each critical PVC, restore files: +# Example: vaultwarden-data-proxmox +WEEK="2026-14" # Use most recent week +NAMESPACE="vaultwarden" +PVC_NAME="vaultwarden-data-proxmox" + +# Find the PV LV name +kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle' | grep $PVC_NAME + +# Assuming volumeHandle is "local-lvm:vm-999-pvc-abc123" +LV_NAME="vm-999-pvc-abc123" + +# Mount the LV +lvchange -ay pve/$LV_NAME +mkdir -p /mnt/restore-temp +mount /dev/pve/$LV_NAME /mnt/restore-temp + +# Restore from backup +rsync -avP --delete /mnt/backup/pvc-data/$WEEK/$NAMESPACE/$PVC_NAME/ /mnt/restore-temp/ + +# Unmount +umount /mnt/restore-temp +lvchange -an pve/$LV_NAME + +# 8c. Repeat for all critical PVCs (prioritize: vaultwarden, vault, redis, nextcloud) +``` + +**Note on pfSense restore**: If pfSense needs restoration, restore `config.xml` from `/mnt/backup/pfsense//config.xml` via web UI, or full filesystem tar for custom scripts. + +**Note on PVE config restore**: If custom scripts/timers are lost, restore from `/mnt/backup/pve-config/` (weekly-backup, offsite-sync-backup, lvm-pvc-snapshot scripts + timers). + ### Phase 4: Vault (secrets foundation) ```bash # 8. Deploy Vault (see restore-vault.md for full procedure) @@ -117,10 +160,11 @@ kubectl create job --from=cronjob/vaultwarden-backup manual-vw-backup -n vaultwa ## Dependency Graph ``` -etcd → K8s API → CSI Drivers → Vault → ESO → Platform → Databases → Apps - ↓ - Restore data from - NFS/Synology backups +etcd → K8s API → CSI Drivers → Restore PVC data from sda → Vault → ESO → Platform → Databases → Apps + ↓ + Restore DB dumps from + /mnt/backup/nfs-mirror + or Synology/pve-backup ``` ## Estimated Time diff --git a/docs/runbooks/restore-lvm-snapshot.md b/docs/runbooks/restore-lvm-snapshot.md new file mode 100644 index 00000000..7ffb6af7 --- /dev/null +++ b/docs/runbooks/restore-lvm-snapshot.md @@ -0,0 +1,159 @@ +# Runbook: Restore PVC from LVM Thin Snapshot + +Last updated: 2026-04-06 + +## When to Use + +- Rolling back a PVC to a previous state after a bad migration, data corruption, or accidental deletion +- Pre-upgrade safety: snapshot before upgrade, restore if upgrade fails +- Fast recovery for data changed within the last 7 days + +## Prerequisites + +- SSH access to PVE host (192.168.1.127) +- The `lvm-pvc-snapshot` script at `/usr/local/bin/lvm-pvc-snapshot` +- kubectl configured on PVE host (`/root/.kube/config`) + +## Snapshot Retention + +- **Daily snapshots**: Created at 03:00 via systemd timer +- **Retention**: 7 days (older snapshots automatically pruned) +- **Coverage**: All proxmox-lvm PVCs except `dbaas` and `monitoring` namespaces + +**If you need data older than 7 days**, see "Alternative: Restore from sda Backup" below. + +## Procedure + +### 1. List Available Snapshots + +```bash +ssh root@192.168.1.127 lvm-pvc-snapshot list +``` + +Output shows all snapshots with their original LV, age, and data divergence percentage. + +### 2. Identify the PVC LV Name + +Find the LV name for your PVC: + +```bash +# From your workstation (with kubectl): +kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle' + +# The HANDLE column shows "local-lvm:" +``` + +### 3. Run the Restore + +```bash +ssh root@192.168.1.127 +lvm-pvc-snapshot restore +``` + +The script will: +1. Look up the K8s PV/PVC/workload for the LV +2. Show a dry-run of all actions +3. Ask for confirmation (type `yes`) +4. Scale down the workload (Deployment or StatefulSet) +5. Rename the current LV to `_pre_restore_` +6. Rename the snapshot LV to the original name +7. Scale the workload back up +8. Wait for pod to become Ready + +### 4. Verify + +```bash +# Check pod is running +kubectl get pods -n -l app= + +# Check the application is working correctly +# (service-specific verification) +``` + +### 5. Clean Up + +Once you've verified the restore is correct, remove the pre-restore backup: + +```bash +ssh root@192.168.1.127 lvremove -f pve/_pre_restore_ +``` + +## Manual Restore (if script fails) + +If the automated restore fails, perform these steps manually: + +```bash +# 1. Scale down the workload +kubectl scale deployment/ -n --replicas=0 +# or for StatefulSets: +kubectl scale statefulset/ -n --replicas=0 + +# 2. Wait for pods to terminate +kubectl wait --for=delete pod -l app= -n --timeout=120s + +# 3. SSH to PVE host +ssh root@192.168.1.127 + +# 4. Verify LV is inactive +lvs -o lv_name,lv_active pve | grep + +# 5. Rename LVs +lvrename pve _pre_restore_$(date +%Y%m%d_%H%M) +lvrename pve + +# 6. Scale back up +kubectl scale deployment/ -n --replicas=1 +``` + +## Database-Specific Notes + +- **MySQL InnoDB**: After restore, InnoDB will replay redo logs automatically on startup. Check `SHOW ENGINE INNODB STATUS` for recovery progress. +- **PostgreSQL**: WAL replay happens automatically. Check `pg_is_in_recovery()` and PostgreSQL logs. +- **Redis**: Redis loads the RDB file on startup. Check `INFO persistence` for load status. + +For databases, prefer the app-level backup restore (see `restore-mysql.md`, `restore-postgresql.md`) unless you need a very recent point-in-time that predates the last dump. + +## Alternative: Restore from sda Backup + +If LVM snapshots are too old or missing (data lost >7 days ago), use the weekly file-level backup on sda: + +**Location**: `/mnt/backup/pvc-data////` on PVE host +**Retention**: 4 weekly versions (weeks 0-3) + +### Procedure + +```bash +# 1. List available backup weeks +ssh root@192.168.1.127 +ls -l /mnt/backup/pvc-data/ + +# 2. Identify the PVC backup directory +ls -l /mnt/backup/pvc-data/2026-14// + +# 3. Scale down the workload +kubectl scale deployment/ -n --replicas=0 + +# 4. Mount the live PVC LV on PVE host +lvchange -ay pve/ +mkdir -p /mnt/restore-temp +mount /dev/pve/ /mnt/restore-temp + +# 5. Restore from backup +rsync -avP --delete /mnt/backup/pvc-data/2026-14/// /mnt/restore-temp/ + +# 6. Unmount and scale up +umount /mnt/restore-temp +lvchange -an pve/ +kubectl scale deployment/ -n --replicas=1 +``` + +See `restore-pvc-from-backup.md` for detailed walkthrough. + +## Troubleshooting + +| Problem | Cause | Fix | +|---------|-------|-----| +| "Another instance is running" | Concurrent snapshot/restore | Wait for timer to finish: `systemctl status lvm-pvc-snapshot.service` | +| LV still active after scale-down | Proxmox CSI hasn't detached | Wait 30s, or `lvchange -an pve/` | +| Pod stuck in ContainerCreating | Volume not attached to node | `kubectl describe pod` — check events for attach errors | +| No PV found for volume handle | LV name doesn't match any PV | Check `kubectl get pv -o yaml` for the correct volumeHandle format | diff --git a/docs/runbooks/restore-mysql.md b/docs/runbooks/restore-mysql.md index 78ac83ef..145f7f5d 100644 --- a/docs/runbooks/restore-mysql.md +++ b/docs/runbooks/restore-mysql.md @@ -1,5 +1,7 @@ # Restore MySQL (InnoDB Cluster) +Last updated: 2026-04-06 + ## Prerequisites - `kubectl` access to the cluster - MySQL root password (from `cluster-secret` in `dbaas` namespace, key `ROOT_PASSWORD`) @@ -7,8 +9,9 @@ ## Backup Location - NFS: `/mnt/main/mysql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz` -- Replicated to Synology NAS (192.168.1.13) via TrueNAS ZFS replication -- Retention: 14 days +- Mirrored to sda: `/mnt/backup/nfs-mirror/mysql-backup/` (PVE host 192.168.1.127) +- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/mysql-backup/` +- Retention: 14 days (on NFS), latest only (on sda), unlimited (on Synology) - Size: ~11MB per dump ## Restore Procedure @@ -93,6 +96,39 @@ kubectl exec -it mysql-cluster-0 -n dbaas -c mysql -- mysqlsh root@localhost --p kubectl exec -it mysql-cluster-0 -n dbaas -c mysql -- mysqlsh root@localhost --password="$ROOT_PWD" -- cluster rejoinInstance root@mysql-cluster-1:3306 ``` +## Alternative: Restore from sda Backup + +If TrueNAS NFS is unavailable but the PVE host is accessible: + +```bash +# 1. SSH to PVE host +ssh root@192.168.1.127 + +# 2. Find the latest backup +ls -lt /mnt/backup/nfs-mirror/mysql-backup/ + +# 3. Copy backup to a location accessible from cluster (e.g., via kubectl cp) +# Or mount sda backup on a pod: +kubectl run mysql-restore --rm -it --image=mysql \ + --overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/mysql-backup"}}],"containers":[{"name":"mysql-restore","image":"mysql","env":[{"name":"MYSQL_PWD","value":"'$ROOT_PWD'"}],"volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["/bin/sh","-c","zcat /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -u root --host mysql.dbaas.svc.cluster.local"]}],"nodeName":"k8s-master"}}' \ + -n dbaas +``` + +## Alternative: Restore from Synology (if PVE host is down) + +If both TrueNAS and PVE host are unavailable: + +```bash +# 1. SSH to Synology NAS +ssh Administrator@192.168.1.13 + +# 2. Navigate to backup directory +cd /volume1/Backup/Viki/pve-backup/nfs-mirror/mysql-backup/ + +# 3. Copy dump to a temporary location accessible from cluster +# (e.g., via rsync to a surviving node, or restore TrueNAS first) +``` + ## Estimated Time - Data restore: ~5 minutes (11MB dump) - InnoDB Cluster recovery: ~15-20 minutes (init containers are slow) diff --git a/docs/runbooks/restore-postgresql.md b/docs/runbooks/restore-postgresql.md index 6ac73c7a..387f620c 100644 --- a/docs/runbooks/restore-postgresql.md +++ b/docs/runbooks/restore-postgresql.md @@ -1,5 +1,7 @@ # Restore PostgreSQL (CNPG) +Last updated: 2026-04-06 + ## Prerequisites - `kubectl` access to the cluster - CNPG operator running in the cluster @@ -8,8 +10,9 @@ ## Backup Location - NFS: `/mnt/main/postgresql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz` -- Replicated to Synology NAS (192.168.1.13) via TrueNAS ZFS replication -- Retention: 14 days +- Mirrored to sda: `/mnt/backup/nfs-mirror/postgresql-backup/` (PVE host 192.168.1.127) +- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/postgresql-backup/` +- Retention: 14 days (on NFS), latest only (on sda), unlimited (on Synology) ## Restore from pg_dumpall @@ -81,11 +84,39 @@ kubectl rollout restart deployment -n linkwarden # ... repeat for all PG-dependent services (excluding trading — disabled) ``` -## Restore from Synology (if TrueNAS is down) -1. SSH to Synology NAS (192.168.1.13) -2. Find the replicated dataset: `zfs list | grep postgresql-backup` -3. Mount or copy the backup file to a location accessible from the cluster -4. Follow the restore procedure above +## Alternative: Restore from sda Backup + +If TrueNAS NFS is unavailable but the PVE host is accessible: + +```bash +# 1. SSH to PVE host +ssh root@192.168.1.127 + +# 2. Find the latest backup +ls -lt /mnt/backup/nfs-mirror/postgresql-backup/ + +# 3. Mount sda backup on a pod +PGPASSWORD=$(kubectl get secret pg-cluster-superuser -n dbaas -o jsonpath='{.data.password}' | base64 -d) + +kubectl run pg-restore --rm -it --image=postgres:16.4-bullseye \ + --overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/postgresql-backup"}}],"containers":[{"name":"pg-restore","image":"postgres:16.4-bullseye","env":[{"name":"PGPASSWORD","value":"'$PGPASSWORD'"}],"volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["/bin/sh","-c","zcat /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | psql -h pg-cluster-rw.dbaas -U postgres"]}],"nodeName":"k8s-master"}}' \ + -n dbaas +``` + +## Alternative: Restore from Synology (if PVE host is down) + +If both TrueNAS and PVE host are unavailable: + +```bash +# 1. SSH to Synology NAS +ssh Administrator@192.168.1.13 + +# 2. Navigate to backup directory +cd /volume1/Backup/Viki/pve-backup/nfs-mirror/postgresql-backup/ + +# 3. Copy dump to a temporary location accessible from cluster +# (e.g., via rsync to a surviving node, or restore TrueNAS first) +``` ## Estimated Time - Restore into existing cluster: ~10 minutes (depends on dump size) diff --git a/docs/runbooks/restore-pvc-from-backup.md b/docs/runbooks/restore-pvc-from-backup.md new file mode 100644 index 00000000..62132091 --- /dev/null +++ b/docs/runbooks/restore-pvc-from-backup.md @@ -0,0 +1,231 @@ +# Runbook: Restore PVC from sda File Backup + +Last updated: 2026-04-06 + +## When to Use + +- LVM snapshots are too old (>7 days) or missing +- Need to restore data from a specific week (up to 4 weeks back) +- LVM snapshot restore failed or snapshot is corrupt +- Granular file-level restore (not full PVC) + +## Prerequisites + +- SSH access to PVE host (192.168.1.127) +- kubectl configured (either on PVE host or your workstation) +- sda backup disk mounted at `/mnt/backup` on PVE host + +## Backup Location + +**Path**: `/mnt/backup/pvc-data////` on PVE host +**Retention**: 4 weekly versions (weeks 0-3) +**Deduplication**: `--link-dest` hardlink dedup (unchanged files share inodes across weeks) + +## Procedure + +### 1. List Available Backup Weeks + +```bash +ssh root@192.168.1.127 +ls -l /mnt/backup/pvc-data/ + +# Output shows week directories like: +# 2026-13 +# 2026-14 +# 2026-15 +# 2026-16 +``` + +### 2. Identify the PVC Backup Directory + +```bash +# List namespaces in a specific week +ls -l /mnt/backup/pvc-data/2026-14/ + +# List PVCs in a namespace +ls -l /mnt/backup/pvc-data/2026-14/vaultwarden/ + +# Example: vaultwarden-data-proxmox/ +``` + +### 3. Find the Live PVC LV Name + +From your workstation (or PVE host with kubectl): + +```bash +# Get the PV volumeHandle (contains LV name) +kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle' | grep + +# Example output: +# pvc-abc123 vaultwarden-data-proxmox vaultwarden local-lvm:vm-999-pvc-abc123 +# ↑ this is the LV name +``` + +### 4. Scale Down the Workload + +```bash +# Find the workload using the PVC +kubectl get deployment,statefulset -n -o json | jq '.items[] | select(.spec.template.spec.volumes[]?.persistentVolumeClaim.claimName == "") | .metadata.name' + +# Scale down (Deployment example) +kubectl scale deployment/ -n --replicas=0 + +# Or StatefulSet: +kubectl scale statefulset/ -n --replicas=0 + +# Wait for pod to terminate +kubectl wait --for=delete pod -l app= -n --timeout=120s +``` + +### 5. Mount the Live PVC LV + +```bash +ssh root@192.168.1.127 + +# Activate the LV (should already be inactive after pod termination) +lvchange -ay pve/ + +# Create mount point +mkdir -p /mnt/restore-temp + +# Mount the LV +mount /dev/pve/ /mnt/restore-temp +``` + +### 6. Restore from Backup + +**Option A: Full PVC restore (replace all data)** + +```bash +# This will delete existing files in the PVC and replace with backup +rsync -avP --delete /mnt/backup/pvc-data//// /mnt/restore-temp/ + +# Example: +rsync -avP --delete /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/ /mnt/restore-temp/ +``` + +**Option B: Selective file restore (merge)** + +```bash +# Restore specific files or directories without deleting existing data +rsync -avP /mnt/backup/pvc-data////path/to/file /mnt/restore-temp/path/to/ + +# Example: Restore only db.sqlite3 +rsync -avP /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/db.sqlite3 /mnt/restore-temp/ +``` + +### 7. Unmount and Deactivate LV + +```bash +# Unmount +umount /mnt/restore-temp + +# Deactivate LV (optional, kubelet will activate it when pod starts) +lvchange -an pve/ +``` + +### 8. Scale Up the Workload + +```bash +# From your workstation: +kubectl scale deployment/ -n --replicas=1 + +# Or StatefulSet: +kubectl scale statefulset/ -n --replicas=1 + +# Wait for pod to be ready +kubectl wait --for=condition=Ready pod -l app= -n --timeout=120s +``` + +### 9. Verify + +```bash +# Check pod logs for startup errors +kubectl logs -n -l app= --tail=20 + +# Test application functionality (service-specific) +curl -s -o /dev/null -w "%{http_code}" https://.viktorbarzin.me/ +``` + +## Example: Full Vaultwarden Restore + +```bash +# 1. List backups +ssh root@192.168.1.127 +ls -l /mnt/backup/pvc-data/ + +# 2. Scale down +kubectl scale deployment vaultwarden -n vaultwarden --replicas=0 +kubectl wait --for=delete pod -l app=vaultwarden -n vaultwarden --timeout=120s + +# 3. Find LV name +kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,HANDLE:.spec.csi.volumeHandle' | grep vaultwarden-data-proxmox +# Output: pvc-xyz vaultwarden-data-proxmox local-lvm:vm-105-pvc-xyz456 + +# 4. Mount and restore +ssh root@192.168.1.127 +lvchange -ay pve/vm-105-pvc-xyz456 +mkdir -p /mnt/restore-temp +mount /dev/pve/vm-105-pvc-xyz456 /mnt/restore-temp + +rsync -avP --delete /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/ /mnt/restore-temp/ + +umount /mnt/restore-temp +lvchange -an pve/vm-105-pvc-xyz456 + +# 5. Scale up +kubectl scale deployment vaultwarden -n vaultwarden --replicas=1 +kubectl wait --for=condition=Ready pod -l app=vaultwarden -n vaultwarden --timeout=120s + +# 6. Test +curl -s -o /dev/null -w "%{http_code}" https://vaultwarden.viktorbarzin.me/ +``` + +## Database-Specific Notes + +For databases (MySQL, PostgreSQL), prefer the app-level backup restore (see `restore-mysql.md`, `restore-postgresql.md`) unless: +- You need a very recent point-in-time that predates the last dump +- The database dump is corrupt or missing +- You're restoring a non-SQL database (e.g., Redis RDB) + +## Troubleshooting + +| Problem | Cause | Fix | +|---------|-------|-----| +| "LV is active" during mount | Workload pod still running or stuck | `kubectl get pods -A | grep `, delete pod if stuck | +| "No such file or directory" in backup | PVC not backed up (in excluded namespace) | Check `weekly-backup` script EXCLUDE_NAMESPACES | +| rsync shows 0 files transferred | Wrong backup week or PVC name | Double-check paths: `ls /mnt/backup/pvc-data////` | +| Pod stuck in ContainerCreating after restore | LV still active on PVE host | `lvchange -an pve/`, wait 30s, check pod again | +| Backup week missing | Weekly backup hasn't run for that week | Check `systemctl status weekly-backup.service`, verify retention | + +## Restore from Synology (if PVE host sda is unavailable) + +If the PVE host sda backup disk is unavailable or corrupt: + +```bash +# 1. SSH to Synology NAS +ssh Administrator@192.168.1.13 + +# 2. Navigate to backup directory +cd /volume1/Backup/Viki/pve-backup/pvc-data/ + +# 3. Find the PVC backup +ls -l 2026-14/// + +# 4. Copy to a temporary location accessible from cluster +# Option A: Restore sda on PVE host first +# Option B: rsync to a surviving node's local disk +# Option C: Mount Synology NFS share on a pod (if network accessible) +``` + +## Estimated Time + +- Small PVC (<1GB): ~5 minutes +- Medium PVC (1-10GB): ~10-15 minutes +- Large PVC (>10GB): ~30+ minutes (depends on size and network) + +## Related + +- **`restore-lvm-snapshot.md`** — Fast restore for recent changes (<7 days) +- **`restore-full-cluster.md`** — Disaster recovery procedure (uses this runbook in Phase 3.5) +- **`docs/architecture/backup-dr.md`** — Backup architecture overview diff --git a/docs/runbooks/restore-vault.md b/docs/runbooks/restore-vault.md index adbfdcdb..55abd27c 100644 --- a/docs/runbooks/restore-vault.md +++ b/docs/runbooks/restore-vault.md @@ -1,5 +1,7 @@ # Restore Vault (Raft) +Last updated: 2026-04-06 + ## Prerequisites - `kubectl` access to the cluster - Vault root token (from `vault-root-token` secret in `vault` namespace — manually created, independent of automation) @@ -8,8 +10,9 @@ ## Backup Location - NFS: `/mnt/main/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db` -- Replicated to Synology NAS (192.168.1.13) via TrueNAS ZFS replication -- Retention: 30 days +- Mirrored to sda: `/mnt/backup/nfs-mirror/vault-backup/` (PVE host 192.168.1.127) +- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/vault-backup/` +- Retention: 30 days (on NFS), latest only (on sda), unlimited (on Synology) - Schedule: Weekly on Sundays at 02:00 (`0 2 * * 0`) ## CRITICAL: Vault is a dependency for many services @@ -88,6 +91,45 @@ kubectl rollout restart deployment -n external-secrets kubectl get externalsecrets -A | grep -v "SecretSynced" ``` +## Alternative: Restore from sda Backup + +If TrueNAS NFS is unavailable but the PVE host is accessible: + +```bash +# 1. SSH to PVE host +ssh root@192.168.1.127 + +# 2. Find the latest snapshot +ls -lt /mnt/backup/nfs-mirror/vault-backup/ + +# 3. Copy snapshot to a location accessible from cluster +# Port-forward to Vault and restore +kubectl port-forward svc/vault-active -n vault 8200:8200 & +export VAULT_ADDR=http://127.0.0.1:8200 +export VAULT_TOKEN=$(kubectl get secret vault-root-token -n vault -o jsonpath='{.data.vault-root-token}' | base64 -d) + +# Copy snapshot from PVE host to local workstation, then restore +scp root@192.168.1.127:/mnt/backup/nfs-mirror/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./ +vault operator raft snapshot restore -force ./vault-raft-YYYYMMDD-HHMMSS.db +``` + +## Alternative: Restore from Synology (if PVE host is down) + +If both TrueNAS and PVE host are unavailable: + +```bash +# 1. SSH to Synology NAS +ssh Administrator@192.168.1.13 + +# 2. Navigate to backup directory +cd /volume1/Backup/Viki/pve-backup/nfs-mirror/vault-backup/ + +# 3. Copy snapshot to local workstation +scp Administrator@192.168.1.13:/volume1/Backup/Viki/pve-backup/nfs-mirror/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./ + +# 4. Restore via port-forward (same as above) +``` + ## Full Vault Rebuild (from zero) If Vault needs to be rebuilt from scratch: 1. Comment out data sources + OIDC config in `stacks/vault/main.tf` diff --git a/docs/runbooks/restore-vaultwarden.md b/docs/runbooks/restore-vaultwarden.md index 0edeae60..d46504ae 100644 --- a/docs/runbooks/restore-vaultwarden.md +++ b/docs/runbooks/restore-vaultwarden.md @@ -1,5 +1,7 @@ # Restore Vaultwarden +Last updated: 2026-04-06 + ## Prerequisites - `kubectl` access to the cluster - Backup available on NFS at `/mnt/main/vaultwarden-backup/` @@ -7,8 +9,10 @@ ## Backup Location - NFS: `/mnt/main/vaultwarden-backup/YYYY_MM_DD_HH_MM/` (directory per backup) - Each backup contains: `db.sqlite3`, `rsa_key.pem`, `rsa_key.pub.pem`, `attachments/`, `sends/`, `config.json` -- Replicated to Synology NAS (192.168.1.13) via TrueNAS ZFS replication -- Retention: 30 days +- Mirrored to sda: `/mnt/backup/nfs-mirror/vaultwarden-backup/` (PVE host 192.168.1.127) +- PVC file backup (alternative): `/mnt/backup/pvc-data//vaultwarden/vaultwarden-data-proxmox/` +- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/vaultwarden-backup/` +- Retention: 30 days (on NFS), latest only (on sda nfs-mirror), 4 weeks (on sda pvc-data), unlimited (on Synology) - Schedule: Every 6 hours (00:00, 06:00, 12:00, 18:00) - Integrity check: Both source and backup are verified before/after each backup @@ -69,6 +73,56 @@ Log in to the Vaultwarden web UI and verify: - [ ] Attachments are accessible - [ ] TOTP codes are generating correctly +## Alternative: Restore from PVC File Backup + +If the NFS backup is unavailable or corrupt, restore from the weekly PVC file backup on sda: + +```bash +# 1. List available backup weeks +ssh root@192.168.1.127 +ls -l /mnt/backup/pvc-data/ + +# 2. Scale down Vaultwarden +kubectl scale deployment vaultwarden -n vaultwarden --replicas=0 + +# 3. Mount the live PVC LV on PVE host +# Find the LV name first: +kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,HANDLE:.spec.csi.volumeHandle' | grep vaultwarden-data-proxmox +# Assuming volumeHandle is "local-lvm:vm-999-pvc-abc123" +LV_NAME="vm-999-pvc-abc123" + +lvchange -ay pve/$LV_NAME +mkdir -p /mnt/restore-temp +mount /dev/pve/$LV_NAME /mnt/restore-temp + +# 4. Restore from backup (pick a week) +rsync -avP --delete /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/ /mnt/restore-temp/ + +# 5. Unmount and scale up +umount /mnt/restore-temp +lvchange -an pve/$LV_NAME +kubectl scale deployment vaultwarden -n vaultwarden --replicas=1 +``` + +## Alternative: Restore from sda NFS Mirror + +If TrueNAS NFS is unavailable but PVE host is accessible: + +```bash +# 1. SSH to PVE host +ssh root@192.168.1.127 + +# 2. Find the latest backup +ls -lt /mnt/backup/nfs-mirror/vaultwarden-backup/ + +# 3. Mount sda backup on a pod +BACKUP_DIR="YYYY_MM_DD_HH_MM" # Set to desired backup + +kubectl run vw-restore --rm -it --image=alpine \ + --overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/vaultwarden-backup"}},{"name":"data","persistentVolumeClaim":{"claimName":"vaultwarden-data-proxmox"}}],"containers":[{"name":"vw-restore","image":"alpine","volumeMounts":[{"name":"backup","mountPath":"/backup"},{"name":"data","mountPath":"/data"}],"command":["/bin/sh","-c","cp /backup/'$BACKUP_DIR'/db.sqlite3 /data/db.sqlite3 && cp /backup/'$BACKUP_DIR'/rsa_key.pem /data/ && cp /backup/'$BACKUP_DIR'/rsa_key.pub.pem /data/ && cp -a /backup/'$BACKUP_DIR'/attachments /data/ 2>/dev/null; echo Restore complete"]}],"nodeName":"k8s-master"}}' \ + -n vaultwarden +``` + ## Estimated Time - Restore: ~5 minutes - Verification: ~5 minutes