update backup/DR docs and runbooks for 3-2-1 architecture

- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk, PVC file-level copy from LVM snapshots, pfsense backup, two offsite paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree). - Update storage.md: 65 proxmox-lvm PVCs, sda backup tier - Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda - Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths - New runbook: restore-pvc-from-backup.md (file-level restore from sda) - Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00 · 2026-04-06 15:06:01 +03:00 · b345b086ef
commit b345b086ef
parent d5b0990ed1
10 changed files with 1051 additions and 332 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -174,17 +174,30 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
 - Autoresizer annotations are **required** on all proxmox-lvm PVCs
 - Every proxmox-lvm app **MUST** add a backup CronJob writing to NFS `/mnt/main/<app>-backup/`

-### Cloud Sync (TrueNAS → Synology NAS)
- **Task 1**: Weekly push (Monday 09:00) of `/mnt/main` NFS data to `nas.viktorbarzin.lan:/Backup/Viki/truenas`
- **zfs diff optimization**: Pre-script diffs `main@cloudsync-prev` vs `main@cloudsync-new`, writes changed files to `/tmp/cloudsync_files.txt`. Args: `--files-from /tmp/cloudsync_files.txt --no-traverse`. Post-script rotates snapshots. Falls back to full `find` if no prev snapshot or >100k changes.
- **Excludes**: ytldp, prometheus, logs, post, crowdsec, servarr/downloads, iscsi, iscsi-snaps, frigate, audiblez, ebook2audiobook, ollama, real-estate-crawler
+### 3-2-1 Backup Strategy
+**Copy 1**: Live data on sdc thin pool (65 PVCs + VMs)
+**Copy 2**: sda backup disk (`/mnt/backup`, 1.1TB ext4, VG `backup`)
+**Copy 3**: Synology NAS offsite (two paths)

-### Proxmox-LVM Backup Architecture
- proxmox-lvm volumes are thin LVs on the Proxmox host — opaque to TrueNAS
- **Offsite protection**: Application-level backup CronJobs dump data to NFS paths, which Cloud Sync Task 1 syncs to Synology
- **Current CronJob coverage**: MySQL (mysqldump), PostgreSQL (pg_dumpall), Vault (raft snapshot), Redis (BGSAVE), Vaultwarden (sqlite3 .backup), Headscale (sqlite3 .backup)
- **Convention**: Any new proxmox-lvm app MUST add a backup CronJob to its Terraform stack that writes to `/mnt/main/<app>-backup/`
- **Uncovered (acceptable)**: Prometheus (disposable metrics), Loki (disposable logs), plotting-book and novelapp (small, low-priority)
+**PVE host scripts** (source: `infra/scripts/`):
+- `/usr/local/bin/weekly-backup` — Sunday 05:00. Mounts LVM thin snapshots ro → rsyncs FILES to `/mnt/backup/pvc-data/<YYYY-WW>/<ns>/<pvc>/` with `--link-dest` versioning (4 weeks). Also mirrors NFS backup dirs, pfsense (config.xml + tar), PVE config. Prunes snapshots >7d.
+- `/usr/local/bin/offsite-sync-backup` — Sunday 08:00 (After=weekly-backup). `rsync --files-from` manifest to `Administrator@192.168.1.13:/volume1/Backup/Viki/pve-backup/`. Monthly full `--delete` on 1st Sunday.
+- `/usr/local/bin/lvm-pvc-snapshot` — Daily 03:00. Thin snapshots of all PVCs except dbaas+monitoring. 7-day retention. Instant restore: `lvm-pvc-snapshot restore <lv> <snap>`.
+
+**Offsite sync (two paths)**:
+- `Synology/Backup/Viki/pve-backup/` — structured data from PVE host (PVC files, DB dumps, pfsense, PVE config)
+- `Synology/Backup/Viki/truenas/` — NFS media from TrueNAS Cloud Sync (Immich, audiobookshelf, servarr — narrowed, excludes backup dirs)
+
+**App-level CronJobs** (write to TrueNAS NFS, mirrored to sda weekly):
+- MySQL (daily), PostgreSQL (daily), Vault (weekly), Vaultwarden (6h + integrity), Redis (weekly), etcd (weekly)
+- **Convention**: New proxmox-lvm apps MUST add a backup CronJob writing to `/mnt/main/<app>-backup/`
+
+**Restore paths**:
+- Accidental delete: `lvm-pvc-snapshot restore` (instant, 7 daily snapshots)
+- Older data: Browse `/mnt/backup/pvc-data/<week>/<ns>/<pvc>/`, rsync back
+- Database: Restore from dump at `/mnt/backup/nfs-mirror/<db>-backup/`
+- pfsense: Upload config.xml via web UI, or extract tar for custom scripts
+- Full disaster: Restore from Synology

 ## Known Issues
 - **CrowdSec Helm upgrade times out**: `terragrunt apply` on platform stack causes CrowdSec Helm release to get stuck in `pending-upgrade`. Workaround: `helm rollback crowdsec <rev> -n crowdsec`. Root cause: likely ResourceQuota CPU at 302% preventing pods from passing readiness probes. Needs investigation.
--- a/docs/architecture/backup-dr.md
+++ b/docs/architecture/backup-dr.md
@ -1,10 +1,17 @@
 # Backup & Disaster Recovery Architecture

-Last updated: 2026-03-24
+Last updated: 2026-04-06

 ## Overview

-The homelab uses a defense-in-depth 3-layer backup strategy: Layer 1 provides near-instant local snapshots via ZFS auto-snapshots on TrueNAS (every 12h + daily, up to 3-week retention). Layer 2 adds application-level backups for complex stateful services (databases, Vault, etcd) via K8s CronJobs dumping to NFS-exported directories with 14-30 day retention. Layer 3 ensures offsite protection through hybrid incremental/full sync to a Synology NAS every 6 hours (incremental via ZFS diff) plus weekly full sync (Sunday 09:00) for cleanup. This architecture provides <1s RPO for file data, 6h RPO for offsite, and <30min RTO for most services.
+The homelab uses a defense-in-depth 3-2-1 backup strategy: **3 copies** (live PVCs on sdc, weekly backups on sda, offsite on Synology), **2 media types** (SSD thin LVM, HDD), **1 offsite copy** (Synology NAS). This architecture provides <1s RPO for recent changes (via 7-day LVM snapshots), <7d RPO for file-level recovery, and <30min RTO for most services.
+
+**3-2-1 Breakdown**:
+- **Copy 1** (live): All PVC data + VM disks on Proxmox sdc thin pool (10.7TB RAID1 HDD)
+- **Copy 2** (local backup): Weekly file-level backup to sda `/mnt/backup` (1.1TB RAID1 SAS)
+- **Copy 3** (offsite): Synology NAS at 192.168.1.13 via two paths:
+  - `Synology/Backup/Viki/pve-backup/` — structured PVE host backups (rsync --files-from weekly)
+  - `Synology/Backup/Viki/truenas/` — TrueNAS NFS media (Cloud Sync, narrowed to media only)

 ## Architecture Diagram

@ -12,54 +19,64 @@ The homelab uses a defense-in-depth 3-layer backup strategy: Layer 1 provides ne

 ```mermaid
 graph TB
-    subgraph TrueNAS["TrueNAS (10.0.10.15)"]
-        ZFS_Data["ZFS Pools<br/>main (1.64 TiB)<br/>ssd (~256GB)"]
+    subgraph Proxmox["Proxmox Host (192.168.1.127)"]
+        sdc["sdc: 10.7TB RAID1 HDD<br/>VG pve, LV data (thin pool)<br/>65 proxmox-lvm PVCs"]
+        sda["sda: 1.1TB RAID1 SAS<br/>VG backup, LV data (ext4)<br/>/mnt/backup"]

-        subgraph Layer1["Layer 1: ZFS Auto-Snapshots"]
-            Snap12h["Every 12h<br/>auto-12h-*<br/>24h retention"]
-            SnapDaily["Daily 00:00<br/>auto-*<br/>3-week retention"]
+        subgraph Layer1["Layer 1: LVM Thin Snapshots"]
+            Snap["Daily 03:00<br/>7-day retention<br/>62 PVCs (excludes dbaas+monitoring)"]
        end

-        ZFS_Data --> Snap12h
-        ZFS_Data --> SnapDaily
+        subgraph Layer2["Layer 2: Weekly File Backup"]
+            PVCBackup["PVC File Copy<br/>Sunday 05:00<br/>4 weekly versions<br/>/mnt/backup/pvc-data/<YYYY-WW>/"]
+            NFSMirror["NFS Mirror<br/>DB dumps + backup CronJob output<br/>/mnt/backup/nfs-mirror/"]
+            PfsenseBackup["pfSense Backup<br/>config.xml + full tar<br/>4 weekly versions"]
+            PVEConfig["PVE Config<br/>/etc/pve + scripts"]
+        end

-        NFS_Backup["NFS-exported<br/>/mnt/main/*-backup/"]
+        sdc --> Snap
+        sdc --> PVCBackup
+        PVCBackup --> sda
+        NFSMirror --> sda
+        PfsenseBackup --> sda
+        PVEConfig --> sda
    end

-    subgraph K8s["Kubernetes Cluster"]
-        subgraph Layer2["Layer 2: App Backups"]
+    subgraph TrueNAS["TrueNAS (10.0.10.15)"]
+        NFS_Backup["NFS-exported<br/>/mnt/main/*-backup/"]
+        Media["Media (NFS)<br/>Immich ~800GB<br/>audiobookshelf, servarr, navidrome"]
+
+        subgraph AppBackups["App-Level Backup CronJobs"]
            CronDaily["Daily 00:00-00:30<br/>PostgreSQL, MySQL<br/>14d retention"]
-            CronWeekly["Weekly Sunday<br/>etcd, Vault, Redis<br/>Vaultwarden, plotting-book<br/>30d retention"]
-            CronMonthly["Monthly 1st Sunday<br/>Prometheus TSDB<br/>2 copies"]
-            Cron6h["Every 6h<br/>Vaultwarden backup<br/>+ integrity check"]
+            CronWeekly["Weekly Sunday<br/>etcd, Vault, Redis<br/>Vaultwarden<br/>30d retention"]
        end

        CronDaily --> NFS_Backup
        CronWeekly --> NFS_Backup
-        CronMonthly --> NFS_Backup
-        Cron6h --> NFS_Backup
    end

    subgraph Layer3["Layer 3: Offsite Sync"]
-        Incremental["Every 6h<br/>zfs diff → rclone copy<br/>--files-from --no-traverse"]
-        FullSync["Weekly Sunday 09:00<br/>rclone sync<br/>handles deletions"]
+        PVEOffsite["PVE → Synology<br/>Sunday 08:00<br/>rsync --files-from<br/>/Backup/Viki/pve-backup/"]
+        CloudSync["TrueNAS → Synology<br/>Monday 09:00<br/>Cloud Sync (media only)<br/>/Backup/Viki/truenas/"]
    end

-    ZFS_Data --> Incremental
-    ZFS_Data --> FullSync
+    sda --> PVEOffsite
+    Media --> CloudSync

-    Synology["Synology NAS<br/>192.168.1.13<br/>/Backup/Viki/truenas"]
+    Synology["Synology NAS<br/>192.168.1.13<br/>Offsite protection"]

-    Incremental --> Synology
-    FullSync --> Synology
+    PVEOffsite --> Synology
+    CloudSync --> Synology
+
+    NFS_Backup -.->|mirrored to sda| NFSMirror

    subgraph Monitoring["Monitoring & Alerting"]
-        Prometheus["Prometheus Alerts<br/>PostgreSQLBackupStale<br/>MySQLBackupStale<br/>CloudSyncStale<br/>VaultwardenIntegrityFail"]
-        Pushgateway["Pushgateway<br/>cloudsync metrics<br/>vaultwarden integrity"]
+        Prometheus["Prometheus Alerts<br/>PostgreSQLBackupStale, MySQLBackupStale<br/>WeeklyBackupStale, OffsiteBackupSyncStale<br/>LVMSnapshotStale, BackupDiskFull<br/>VaultwardenIntegrityFail"]
+        Pushgateway["Pushgateway<br/>backup script metrics<br/>cloudsync metrics<br/>vaultwarden integrity"]
    end

-    NFS_Backup -.->|scrape| Prometheus
-    Synology -.->|API query| Pushgateway
+    PVCBackup -.->|push metrics| Pushgateway
+    Snap -.->|push metrics| Pushgateway
    Pushgateway --> Prometheus

    style Layer1 fill:#c8e6c9
@ -68,6 +85,89 @@ graph TB
    style Monitoring fill:#f3e5f5
 ```

+### Weekly Backup Timeline
+
+```mermaid
+graph LR
+    subgraph Sunday["Sunday Timeline"]
+        S01["01:00 etcd backup<br/>(CronJob)"]
+        S02["02:00 Vault backup<br/>(CronJob)"]
+        S03a["03:00 Redis backup<br/>(CronJob)"]
+        S03b["03:00 LVM snapshots<br/>(lvm-pvc-snapshot timer)"]
+        S05["05:00 Weekly backup<br/>(weekly-backup timer)<br/>1. NFS mirror<br/>2. PVC file copy<br/>3. pfSense backup<br/>4. PVE config<br/>5. Prune snapshots<br/>6. Generate manifest"]
+        S08["08:00 Offsite sync<br/>(offsite-sync-backup timer)<br/>rsync --files-from"]
+    end
+
+    S01 --> S02 --> S03a --> S03b --> S05 --> S08
+
+    subgraph Monday["Monday"]
+        M09["09:00 TrueNAS Cloud Sync<br/>Media → Synology"]
+    end
+
+    S08 -.->|next day| M09
+
+    style Sunday fill:#ffe0b2
+    style Monday fill:#e1f5ff
+```
+
+### Physical Disk Layout
+
+```mermaid
+graph TB
+    subgraph PVE["Proxmox Host (192.168.1.127)"]
+        subgraph sda["sda: 1.1TB RAID1 SAS"]
+            sda_vg["VG: backup<br/>LV: data (ext4)<br/>/mnt/backup"]
+            sda_content["pvc-data/<YYYY-WW>/<ns>/<pvc>/<br/>nfs-mirror/<service>-backup/<br/>pfsense/<YYYY-WW>/<br/>pve-config/"]
+        end
+
+        subgraph sdb["sdb: 931GB SSD"]
+            sdb_vg["VG: pve<br/>LV: root (ext4)<br/>PVE host OS"]
+        end
+
+        subgraph sdc["sdc: 10.7TB RAID1 HDD"]
+            sdc_vg["VG: pve<br/>LV: data (thin pool)<br/>65 proxmox-lvm PVCs<br/>+ VM disks"]
+        end
+
+        sda_vg --> sda_content
+    end
+
+    sdc -.->|weekly backup<br/>mount snapshot ro| sda
+    sda -.->|offsite sync<br/>rsync| Synology["Synology NAS<br/>192.168.1.13<br/>/Backup/Viki/pve-backup/"]
+
+    style sda fill:#fff9c4
+    style sdb fill:#c8e6c9
+    style sdc fill:#e1f5ff
+```
+
+### Restore Decision Tree
+
+```mermaid
+graph TB
+    Start["Data loss detected"]
+    Age{"How old is<br/>the lost data?"}
+    Type{"What type<br/>of data?"}
+
+    Start --> Age
+
+    Age -->|"< 7 days"| LVM["Use LVM snapshot<br/>lvm-pvc-snapshot restore<br/>RTO: <5 min"]
+    Age -->|"> 7 days,<br/>< 4 weeks"| FileBackup["Use sda file backup<br/>/mnt/backup/pvc-data/<week>/<br/>RTO: <15 min"]
+    Age -->|"> 4 weeks or<br/>site disaster"| Offsite["Use Synology backup<br/>Synology/pve-backup/<br/>RTO: <4 hours"]
+
+    LVM --> Type
+    FileBackup --> Type
+    Offsite --> Type
+
+    Type -->|"Database"| AppBackup["Use app-level dump<br/>/mnt/backup/nfs-mirror/<service>-backup/<br/>OR Synology/pve-backup/nfs-mirror/<br/>RTO: <10 min"]
+    Type -->|"PVC files"| Proceed["Proceed with<br/>selected restore method"]
+    Type -->|"Media (NFS)"| CloudSync["Use Synology backup<br/>Synology/truenas/<service>/<br/>RTO: varies by size"]
+
+    style Start fill:#ffcdd2
+    style LVM fill:#c8e6c9
+    style FileBackup fill:#fff9c4
+    style Offsite fill:#e1f5ff
+    style AppBackup fill:#e1bee7
+```
+
 ### Vaultwarden Enhanced Protection

 ```mermaid
@ -97,127 +197,103 @@ graph LR
    style Hourly fill:#e1bee7
 ```

-### Incremental Offsite Sync
-
-```mermaid
-graph TB
-    Prev["ZFS snapshot<br/>main@cloudsync-prev"]
-    New["ZFS snapshot<br/>main@cloudsync-new"]
-
-    Prev --> Diff["zfs diff -F -H<br/>prev vs new"]
-    New --> Diff
-
-    Diff --> Filter["Filter type=F<br/>Apply excludes"]
-    Filter --> FileList["/tmp/cloudsync_copy_files.txt"]
-
-    FileList --> Rclone["rclone copy<br/>--files-from-raw<br/>--no-traverse"]
-
-    Rclone --> Synology["Synology NAS<br/>192.168.1.13"]
-
-    Synology --> Rotate["Rotate snapshots:<br/>destroy prev<br/>rename new → prev"]
-
-    Excludes["Excludes:<br/>clickhouse (2.47M files)<br/>loki (68K files)<br/>prometheus, iscsi<br/>frigate/recordings<br/>*.log"]
-
-    Filter -.->|uses| Excludes
-
-    style FileList fill:#fff9c4
-    style Excludes fill:#ffcdd2
-```
-
 ## Components

 | Component | Version/Schedule | Location | Purpose |
 |-----------|-----------------|----------|---------|
-| ZFS Auto-Snapshots | Every 12h + daily | TrueNAS pools (main, ssd) | Near-instant local protection |
-| PostgreSQL Backup | Daily 00:00, 14d retention | CronJob in `dbaas` namespace | pg_dumpall for 12 databases |
-| MySQL Backup | Daily 00:30, 14d retention | CronJob in `dbaas` namespace | mysqldump for 7 databases |
+| LVM Thin Snapshots | Daily 03:00, 7d retention | PVE host: `lvm-pvc-snapshot` | CoW snapshots of 62 proxmox-lvm PVCs |
+| Weekly PVC Backup | Sunday 05:00, 4 weeks | PVE host: `weekly-backup` | File-level PVC copy to sda |
+| NFS Mirror | Sunday 05:00 + weekly-backup | PVE host: mount NFS ro → rsync | Mirror DB dumps to sda |
+| pfSense Backup | Sunday 05:00 + weekly-backup | PVE host: SSH + API | config.xml + full filesystem tar |
+| Offsite Sync | Sunday 08:00 (after weekly-backup) | PVE host: `offsite-sync-backup` | rsync sda → Synology |
+| PostgreSQL Backup | Daily 00:00, 14d retention | CronJob in `dbaas` namespace | pg_dumpall for all databases |
+| MySQL Backup | Daily 00:30, 14d retention | CronJob in `dbaas` namespace | mysqldump for all databases |
 | etcd Backup | Weekly Sunday 01:00, 30d | CronJob in `kube-system` | etcdctl snapshot |
 | Vaultwarden Backup | Every 6h, 30d retention | CronJob in `vaultwarden` | sqlite3 .backup + integrity |
 | Vault Backup | Weekly Sunday 02:00, 30d | CronJob in `vault` | raft snapshot |
 | Redis Backup | Weekly Sunday 03:00, 30d | CronJob in `redis` | BGSAVE + copy |
-| Prometheus Backup | Monthly 1st Sunday, 2 copies | CronJob in `monitoring` | TSDB snapshot → tar.gz |
-| plotting-book Backup | Weekly Sunday 03:00, 30d | CronJob in `plotting-book` | sqlite3 .backup |
-| LVM Thin Snapshots | Twice daily (00:00, 12:00), 7d | PVE host: `lvm-pvc-snapshot` | CoW snapshots of 13 proxmox-lvm PVCs |
-| Incremental Sync | Every 6h (cron) | TrueNAS: `/root/cloudsync-copy.sh` | ZFS diff → rclone copy |
-| Full Sync | Weekly Sunday 09:00 | TrueNAS Cloud Sync Task 1 | rclone sync with deletions |
-| CloudSync Monitor | Every 6h (cron) | CronJob in `monitoring` | Query TrueNAS API → Pushgateway |
 | Vaultwarden Integrity Check | Hourly | CronJob in `vaultwarden` | PRAGMA integrity_check → metric |
+| TrueNAS Cloud Sync | Monday 09:00 (weekly) | TrueNAS Cloud Sync Task 1 | Media → Synology NAS |

 ## How It Works

-### Layer 1: ZFS Auto-Snapshots
+### Layer 1: LVM Thin Snapshots (Fast Local Recovery)

-ZFS snapshots are copy-on-write markers that capture filesystem state in <1 second with zero I/O overhead (only metadata).
-
-**Schedule**:
-| Pool | Frequency | Naming | Retention | Purpose |
-|------|-----------|--------|-----------|---------|
-| `main` | Every 12h | `auto-12h-YYYY-MM-DD_HH-MM` | 24 hours | Recover from recent mistakes |
-| `main` | Daily 00:00 | `auto-YYYY-MM-DD_HH-MM` | 3 weeks | Point-in-time recovery |
-| `ssd` | Every 12h | `auto-12h-YYYY-MM-DD_HH-MM` | 24 hours | Same as main |
-| `ssd` | Daily 00:00 | `auto-YYYY-MM-DD_HH-MM` | 3 weeks | Same as main |
-
-**Performance**: Snapshot creation takes <1s for both pools (tested 2026-03-23).
-
-**Rollback**:
-```bash
-# List snapshots
-zfs list -t snapshot | grep main/<service>
-
-# Rollback to snapshot
-zfs rollback main/<service>@auto-2026-03-23_00-00
-
-# Clone snapshot (non-destructive)
-zfs clone main/<service>@auto-2026-03-23_00-00 main/<service>-recovered
-```
-
-### Layer 1b: LVM Thin Snapshots (Proxmox CSI PVCs)
-
-Native LVM thin snapshots provide crash-consistent point-in-time recovery for all 13 Proxmox CSI PVCs (~340Gi). These are CoW snapshots — instant creation, minimal overhead, sharing the thin pool's free space.
+Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62 Proxmox CSI PVCs. These are CoW snapshots — instant creation, minimal overhead, sharing the thin pool's free space.

 **Script**: `/usr/local/bin/lvm-pvc-snapshot` on PVE host (source: `infra/scripts/lvm-pvc-snapshot`)
-**Schedule**: Twice daily (00:00, 12:00) via systemd timer, 7-day retention (max 14 snapshots per LV)
+**Schedule**: Daily 03:00 via systemd timer, 7-day retention
 **Discovery**: Auto-discovers PVC LVs matching `vm-*-pvc-*` pattern in VG `pve` thin pool `data`

-**Coverage**: All proxmox-lvm PVCs **except** `dbaas` and `monitoring` namespaces. These are excluded because:
+**Coverage**: All 65 proxmox-lvm PVCs **except** `dbaas` and `monitoring` namespaces. These are excluded because:
 - MySQL InnoDB, PostgreSQL, and Prometheus are high-churn (50%+ CoW divergence/hour)
 - They already have app-level dumps (Layer 2)
 - Including them causes ~36% write amplification; excluding them reduces overhead to ~0%

-Snapshotted PVCs include: Redis, Vaultwarden, Calibre, Nextcloud, Forgejo, FreshRSS, ActualBudget, NovelApp, Headscale, Uptime Kuma, etc. (~20 low-churn LVs)
-
-**Exclusion config**: `EXCLUDE_NAMESPACES` variable in script (default: `dbaas,monitoring`). Uses kubectl to resolve LV names dynamically.
-
 **Monitoring**: Pushes metrics to Pushgateway via NodePort (30091). Alerts: `LVMSnapshotStale` (>24h), `LVMSnapshotFailing`, `LVMThinPoolLow` (<15% free).

 **Restore**: `lvm-pvc-snapshot restore <pvc-lv> <snapshot-lv>` — auto-discovers K8s workload, scales down, swaps LVs, scales back up. See `docs/runbooks/restore-lvm-snapshot.md`.

-### Layer 2: Application-Level Backups
+### Layer 2: Weekly File-Level Backup (sda Backup Disk)
+
+**Backup disk**: sda (1.1TB RAID1 SAS) → VG `backup` → LV `data` → ext4 → mounted at `/mnt/backup` on PVE host. Dedicated backup disk, independent of live storage.
+
+**Script**: `/usr/local/bin/weekly-backup` on PVE host (source: `infra/scripts/weekly-backup`)
+**Schedule**: Sunday 05:00 via systemd timer
+**Retention**: 4 weekly versions (weeks 0-3 via `--link-dest` hardlink dedup)
+
+#### What Gets Backed Up
+
+**1. PVC File Copies** (`/mnt/backup/pvc-data/<YYYY-WW>/`):
+- Mount each LVM thin LV ro on PVE host → rsync files (not block) → unmount
+- 62 PVCs covered (all except dbaas + monitoring)
+- Organized as `/mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/`
+- 4 weekly versions with `--link-dest` hardlink dedup (unchanged files share inodes)
+
+**2. NFS Backup Mirror** (`/mnt/backup/nfs-mirror/`):
+- Mount TrueNAS NFS ro → rsync DB dump dirs → unmount
+- Covers: `mysql-backup/`, `postgresql-backup/`, `vault-backup/`, `vaultwarden-backup/`, `redis-backup/`, `etcd-backup/`
+- Single copy (no rotation) — latest dump only
+
+**3. pfSense Backup** (`/mnt/backup/pfsense/<YYYY-WW>/`):
+- `config.xml` via API (base64 decode)
+- Full filesystem tar via SSH (`tar czf /tmp/pfsense-full.tar.gz /cf /var/db /boot/loader.conf`)
+- 4 weekly versions
+
+**4. PVE Config** (`/mnt/backup/pve-config/`):
+- `/etc/pve/` (cluster config, VM definitions)
+- `/usr/local/bin/` (custom scripts)
+- `/etc/systemd/system/` (timers)
+- Single copy (no rotation)
+
+**Manifest Generation**: After backup completes, generates `/mnt/backup/manifest.txt` with all file paths (relative to `/mnt/backup/`). Used by offsite sync `--files-from`.
+
+**Snapshot Pruning**: Deletes LVM snapshots older than 7 days (safety net for snapshots that outlive `lvm-pvc-snapshot` timer).
+
+**Monitoring**: Pushes `backup_weekly_last_success_timestamp` to Pushgateway. Alerts: `WeeklyBackupStale` (>8d), `WeeklyBackupFailing`.
+
+### Layer 2b: Application-Level Backups

 K8s CronJobs run inside the cluster, dumping database/state to NFS-exported backup directories. Each service writes to `/mnt/main/<service>-backup/`.

-**Why needed**: ZFS snapshots capture block-level state, but:
- Cannot restore individual databases from a PostgreSQL zvol snapshot
- iSCSI zvols are opaque to TrueNAS (raw blocks)
- Need point-in-time recovery for specific apps without full ZFS rollback
+**Why needed**: LVM snapshots capture block-level state, but:
+- Cannot restore individual databases from a PostgreSQL snapshot
+- Proxmox CSI LVs are opaque to TrueNAS (raw block devices)
+- Need point-in-time recovery for specific apps without full LVM rollback

 **Daily backups (00:00-00:30)**:
- **PostgreSQL** (`pg_dumpall`): Dumps all 12 databases to `/mnt/main/dbaas-backups/postgresql/`. Command: `pg_dumpall -h pg-cluster-rw.dbaas -U postgres | gzip -9 > backup-$(date +%Y%m%d).sql.gz`. 14-day rotation via `find -mtime +14 -delete`.
- **MySQL** (`mysqldump`): Dumps all 7 databases individually. Command: `mysqldump -h mysql-primary.dbaas --all-databases --single-transaction | gzip -9 > backup-$(date +%Y%m%d).sql.gz`. 14-day rotation.
+- **PostgreSQL** (`pg_dumpall`): Dumps all databases to `/mnt/main/postgresql-backup/`. Command: `pg_dumpall -h pg-cluster-rw.dbaas -U postgres | gzip -9 > backup-$(date +%Y%m%d).sql.gz`. 14-day rotation via `find -mtime +14 -delete`.
+- **MySQL** (`mysqldump`): Dumps all databases. Command: `mysqldump -h mysql-primary.dbaas --all-databases --single-transaction | gzip -9 > backup-$(date +%Y%m%d).sql.gz`. 14-day rotation.

 **Weekly backups (Sunday 01:00-04:00)**:
 - **etcd**: `etcdctl snapshot save /mnt/main/etcd-backup/snapshot-$(date +%Y%m%d).db`. 30-day retention. Critical for cluster recovery.
 - **Vaultwarden**: See "Vaultwarden Enhanced Protection" below. 30-day retention.
 - **Vault**: `vault operator raft snapshot save /mnt/main/vault-backup/snapshot-$(date +%Y%m%d).snap`. 30-day retention.
 - **Redis**: `redis-cli BGSAVE` then copy RDB file. 30-day retention.
- **plotting-book**: `sqlite3 /data/db.sqlite ".backup '/mnt/main/plotting-book-backup/backup-$(date +%Y%m%d).sqlite'"`. 30-day retention.
-
-**Monthly backups (1st Sunday 04:00)**:
- **Prometheus**: `curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot` → tar.gz snapshot. Keeps 2 most recent copies (older ones purged).

 ### Vaultwarden Enhanced Protection

-Vaultwarden stores sensitive password vault data in SQLite on an iSCSI volume. Extra safeguards prevent corruption:
+Vaultwarden stores sensitive password vault data in SQLite on a proxmox-lvm volume. Extra safeguards prevent corruption:

 **Every 6 hours** (vaultwarden-backup CronJob):
 1. Run `PRAGMA integrity_check` on live database
@ -236,101 +312,47 @@ This provides both frequent backups (every 6h) AND continuous integrity monitori

 ### Layer 3: Offsite Sync to Synology NAS

-Two complementary sync methods run on TrueNAS:
+Two independent paths push backups offsite:

-**Incremental COPY (every 6 hours)**:
+#### Path 1: PVE Host Backups (rsync)

-Runs `/root/cloudsync-copy.sh` via cron. Uses ZFS diff to identify changed files since last sync, then copies only those files.
+**Script**: `/usr/local/bin/offsite-sync-backup` on PVE host (source: `infra/scripts/offsite-sync-backup`)
+**Schedule**: Sunday 08:00 via systemd timer (After=weekly-backup.service)
+**Method**: `rsync --files-from /mnt/backup/manifest.txt` to `synology.viktorbarzin.lan:/Backup/Viki/pve-backup/`
+**Monthly full sync**: On 1st Sunday of month, runs `rsync --delete` (full sync, removes deleted files)

-Flow:
-1. Take new snapshot: `zfs snapshot main@cloudsync-new`
-2. If previous snapshot exists: `zfs diff -F -H main@cloudsync-prev main@cloudsync-new`
-3. Filter output:
-   - Keep only `type=F` (files, not directories)
-   - Apply excludes (clickhouse, loki, prometheus, etc.)
-   - Write to `/tmp/cloudsync_copy_files.txt`
-4. Run `rclone copy --files-from-raw /tmp/cloudsync_copy_files.txt --no-traverse`
-5. Rotate snapshots: `zfs destroy cloudsync-prev`, `zfs rename cloudsync-new cloudsync-prev`
+**Why fast**: Only changed files are transferred (manifest generated by weekly-backup). No directory traversal (`--no-implied-dirs`).

-**Why fast**: Only changed files are transferred. ZFS diff is instant (metadata scan). `--no-traverse` skips SFTP directory scan.
+**Destination**: `Synology/Backup/Viki/pve-backup/` mirrors sda `/mnt/backup/` structure:
+- `pvc-data/<YYYY-WW>/` — 4 weekly PVC file backups
+- `nfs-mirror/` — latest DB dumps
+- `pfsense/<YYYY-WW>/` — 4 weekly pfSense backups
+- `pve-config/` — latest PVE config

-**Fallback**: If no previous snapshot or >100k changed files → falls back to full `find` command.
+**Monitoring**: Pushes `offsite_backup_sync_last_success_timestamp` to Pushgateway. Alerts: `OffsiteBackupSyncStale` (>8d), `OffsiteBackupSyncFailing`.

-**Weekly SYNC (Sunday 09:00)**:
-
-TrueNAS Cloud Sync Task 1 runs `rclone sync` which:
- Mirrors source → destination (removes deleted files on destination)
- Full directory traversal (~30-60 min)
- Ensures offsite is clean (no orphaned files from renamed paths)
-
-**Why both methods**:
- Incremental: Fast recovery for recent changes (seconds to minutes)
- Full sync: Cleanup pass to handle deletions, renames, edge cases
+#### Path 2: TrueNAS Media (Cloud Sync)

+**Task**: TrueNAS Cloud Sync Task 1 runs `rclone sync` Monday 09:00
+**Source**: `/mnt/main/` (NFS pool on TrueNAS)
 **Destination**: `sftp://192.168.1.13/Backup/Viki/truenas`
+**Scope**: Media libraries only (Immich ~800GB, audiobookshelf, servarr, navidrome music)

-### Excludes (both incremental and full sync)
+**Excludes** (Cloud Sync configured to skip):
+- `clickhouse/**` (2.47M files, regenerable)
+- `loki/**` (68K files, regenerable)
+- `prometheus/**` (covered by monthly app backup)
+- `frigate/**` (ephemeral recordings)
+- `audiblez/**`, `ebook2audiobook/**` (regenerable)
+- `ollama/**` (chat history, low value)
+- `real-estate-crawler/**` (regenerable)
+- `crowdsec/**` (regenerable)
+- `servarr/downloads/**` (transient)
+- `ytldp/**` (replaceable)
+- `iscsi/**`, `iscsi-snaps/**` (raw zvols, backed at app level)
+- `*-backup/**` (already mirrored via Path 1)

-| Pattern | Reason | File count |
-|---------|--------|-----------|
-| `clickhouse/**` | Regenerable logs/metrics | 2.47M files |
-| `loki/**` | Regenerable logs | 68K files |
-| `iocage/**` | Legacy FreeBSD jails (unused) | 96K files |
-| `frigate/**` | Ephemeral recordings/clips, trivial config | 57K+ files |
-| `audiblez/**` | Generated audiobooks, regenerable from source ebooks | — |
-| `ebook2audiobook/**` | Same service as audiblez, second volume | — |
-| `ollama/**` | UI data (chat history/settings), low value | — |
-| `real-estate-crawler/**` | Scraped property data, regenerable by re-crawling | — |
-| `prometheus/**` | Covered by monthly app backup | Large TSDB |
-| `crowdsec/**` | Regenerable threat intelligence | — |
-| `servarr/downloads/**` | Transient download staging | — |
-| `iscsi/**`, `iscsi-snaps/**` | Raw zvols, backed at app level | — |
-| `ytldp/**` | YouTube downloads (replaceable) | — |
-| `*.log` | Log files (regenerable) | — |
-| `post` | Transient POST data | — |
-
-### iSCSI Backup Architecture
-
-iSCSI zvols are raw block devices exported to K8s nodes. TrueNAS cannot read the filesystem inside a zvol.
-
-**Protection strategy**:
- **Layer 1**: ZFS snapshots cover zvols automatically (block-level)
- **Layer 2**: Application CronJobs inside pods dump data to NFS paths
- **Layer 3**: NFS paths sync offsite
-
-**Current coverage**:
-| Service | Storage | Layer 2 Backup | Offsite |
-|---------|---------|----------------|---------|
-| PostgreSQL CNPG (12 DBs) | iSCSI | ✓ daily | ✓ |
-| MySQL InnoDB (7 DBs) | iSCSI | ✓ daily | ✓ |
-| Vault | iSCSI | ✓ weekly | ✓ |
-| Vaultwarden | iSCSI | ✓ 6h + integrity | ✓ |
-| Redis | iSCSI | ✓ weekly | ✓ |
-| plotting-book | iSCSI | ✓ weekly | ✓ |
-
-**Convention**: Any new iSCSI-backed app MUST add a backup CronJob writing to `/mnt/main/<app>-backup/` in its Terraform stack.
-
-**Uncovered (acceptable risk)**:
- Prometheus (disposable metrics, monthly TSDB backup for long-term trends)
- Loki (disposable logs)
-
-### iSCSI Hardening
-
-To prevent SQLite corruption from transient network disruptions, iSCSI initiator timeouts are relaxed on all K8s nodes:
-
-| Setting | Default | Hardened | Impact |
-|---------|---------|----------|--------|
-| `node.session.timeo.replacement_timeout` | 120s | 300s | Time before declaring session dead |
-| `node.conn[0].timeo.noop_out_interval` | 5s | 10s | Keepalive interval |
-| `node.conn[0].timeo.noop_out_timeout` | 5s | 15s | Keepalive timeout |
-| `node.conn[0].iscsi.HeaderDigest` | None | CRC32C,None | Error detection |
-| `node.conn[0].iscsi.DataDigest` | None | CRC32C,None | Error detection |
-
-**Applied to**: All 5 K8s nodes (k8s-master, k8s-node1-4) on 2026-03-23.
-
-**Persistence**: Baked into cloud-init template (`modules/create-template-vm/cloud_init.yaml`) so new nodes get these settings automatically.
-
-**Why needed**: Default 120s timeout is too aggressive. Brief network hiccup (5-10s) can trigger failover, causing SQLite to see incomplete writes → corruption. 300s timeout tolerates longer blips.
+**Monitoring**: Existing `CloudSyncStale`, `CloudSyncNeverRun`, `CloudSyncFailing` alerts still apply.

 ## Configuration

@ -338,21 +360,25 @@ To prevent SQLite corruption from transient network disruptions, iSCSI initiator

 | Path | Purpose |
 |------|---------|
-| `/root/cloudsync-copy.sh` | TrueNAS: incremental sync script |
-| `/var/log/cloudsync-copy.log` | TrueNAS: sync script log output |
+| `/usr/local/bin/lvm-pvc-snapshot` | PVE host: LVM snapshot creation + restore |
+| `/usr/local/bin/weekly-backup` | PVE host: PVC file copy + NFS mirror + pfSense + manifest |
+| `/usr/local/bin/offsite-sync-backup` | PVE host: rsync to Synology |
+| `/mnt/backup/` | PVE host: sda mount point (1.1TB backup disk) |
+| `/mnt/backup/manifest.txt` | Generated by weekly-backup, consumed by offsite-sync |
+| `/etc/systemd/system/lvm-pvc-snapshot.timer` | Daily 03:00 (LVM snapshots) |
+| `/etc/systemd/system/weekly-backup.timer` | Sunday 05:00 (file backup) |
+| `/etc/systemd/system/offsite-sync-backup.timer` | Sunday 08:00 (offsite sync) |
 | `stacks/dbaas/` | Terraform: PostgreSQL/MySQL backup CronJobs |
 | `stacks/vault/` | Terraform: Vault backup CronJob |
 | `stacks/vaultwarden/` | Terraform: Vaultwarden backup + integrity CronJobs |
-| `stacks/monitoring/` | Terraform: CloudSync monitor, Prometheus backup |
-| `modules/create-template-vm/cloud_init.yaml` | iSCSI hardening params for new nodes |
-| `/etc/iscsi/iscsid.conf` | K8s nodes: iSCSI initiator config |
+| `stacks/monitoring/` | Terraform: Prometheus alerts |

 ### Vault Paths

 | Path | Contents |
 |------|----------|
-| `secret/viktor/truenas_api_key` | TrueNAS API key for CloudSync monitor |
 | `secret/viktor/synology_ssh_key` | SSH key for Synology NAS SFTP access |
+| `secret/viktor/pfsense_api_key` | pfSense API key + secret for config backup |

 ### Terraform Stacks

@ -361,27 +387,46 @@ Each backup CronJob is defined in the application's stack:
 - Vault: `stacks/vault/backup.tf`
 - Vaultwarden: `stacks/vaultwarden/backup.tf`
 - etcd: `stacks/platform/etcd-backup.tf`
- Prometheus: `stacks/monitoring/prometheus-backup.tf`

 ## Decisions & Rationale

-### Why 3 Layers?
+### Why 3-2-1 Strategy?

-**Layer 1 (ZFS snapshots)**:
+**3 copies**:
+- Live PVCs (zero RTO for recent data)
+- sda local backup (fast recovery without network)
+- Synology offsite (site-level disaster protection)
+
+**2 media types**:
+- sdc SSD (live, low latency)
+- sda HDD (backup, cost-effective bulk storage)
+
+**1 offsite**:
+- Protection against fire, theft, catastrophic hardware failure
+- Weekly RPO acceptable for offsite (daily/weekly app backups reduce exposure)
+
+### Why File-Level + Block-Level Snapshots?
+
+**LVM snapshots** (Layer 1):
 - Near-instant (<1s), zero overhead
- Point-in-time recovery for entire datasets
- BUT: Cannot restore individual database records, no offsite protection
+- Point-in-time recovery for entire PVCs
+- BUT: Cannot restore individual files, no offsite protection, 7-day retention

-**Layer 2 (App backups)**:
- Granular restore (single DB, single table)
- Database-native tools (pg_dump, mysqldump) produce portable backups
- BUT: Higher overhead (CPU, I/O), longer RPO (daily/weekly)
+**File-level backup** (Layer 2):
+- Can restore single files or directories
+- Offsite-compatible (rsync)
+- Longer retention (4 weeks local, unlimited offsite)
+- BUT: Slower RTO (rsync), higher storage overhead

-**Layer 3 (Offsite)**:
- Protection against site-level disaster (fire, theft, catastrophic hardware failure)
- BUT: 6h RPO (incremental), connectivity dependency
+Both together provide flexibility: fast local rollback for recent changes, granular recovery for older data.

-All three together provide defense-in-depth.
+### Why Dedicated Backup Disk (sda)?
+
+**Isolation**: If sdc fails (thin pool corruption, controller failure), sda is independent (different disk, different VG).
+
+**Performance**: Backup I/O doesn't compete with live PVC I/O.
+
+**Simplicity**: Single mount point (`/mnt/backup/`) for all backup data, easy to monitor disk usage.

 ### Why Not Velero/Longhorn Backup?

@ -390,25 +435,25 @@ Evaluated K8s-native backup solutions (Velero, Longhorn):
 - **Longhorn**: High overhead (replicas, snapshots in-cluster), no offsite by default

 **Current approach wins** because:
- Leverages existing ZFS infrastructure (already running TrueNAS)
+- Leverages existing Proxmox LVM infrastructure (already running)
 - Database-native backups (pg_dump/mysqldump) are battle-tested
 - Simple restore procedures (documented runbooks)
+- Lower resource overhead (no in-cluster replicas)

 ### Why Hybrid Incremental + Full Sync?

-**Incremental alone** is risky:
+**Incremental alone** (rsync --files-from) is risky:
 - Deleted files on source never deleted on destination
 - Renamed paths create duplicates
- No cleanup of orphaned snapshots
+- No cleanup of orphaned files

-**Full sync alone** is slow:
- 30-60 min per run
- High network/CPU on both ends
- 6h RPO → 12h if a sync fails
+**Full sync alone** (rsync --delete) is slow:
+- 30-60 min per run (all files scanned)
+- 7d RPO → 14d if a sync fails

 **Hybrid approach**:
- Fast incremental every 6h (sub-minute runtime)
- Weekly full sync for cleanup (tolerates longer runtime)
+- Fast incremental weekly (sub-5min runtime via manifest)
+- Monthly full sync for cleanup (tolerates longer runtime)

 ### Why 6h Vaultwarden Backup vs Daily for Others?

@ -424,6 +469,56 @@ Other services (MySQL, PostgreSQL):

 ## Troubleshooting

+### LVM Snapshot Restore Issues
+
+See `docs/runbooks/restore-lvm-snapshot.md`.
+
+### Weekly Backup Failing
+
+**Symptom**: `WeeklyBackupStale` or `WeeklyBackupFailing` alert
+
+**Diagnosis**:
+```bash
+ssh root@192.168.1.127
+systemctl status weekly-backup.service
+journalctl -u weekly-backup.service --since "7 days ago"
+df -h /mnt/backup
+```
+
+**Common causes**:
+- Backup disk full (check `df -h /mnt/backup`, alert: `BackupDiskFull`)
+- LV mount failed (check `lvs pve`, `dmesg | grep backup`)
+- NFS mount failed (check `showmount -e 10.0.10.15`)
+
+**Fix**:
+1. If disk full: Clean up old weekly versions manually, adjust retention
+2. If LV mount failed: `lvchange -ay backup/data && mount /mnt/backup`
+3. If NFS failed: Check TrueNAS availability, verify exports
+4. Manually trigger: `systemctl start weekly-backup.service`
+
+### Offsite Sync Failing
+
+**Symptom**: `OffsiteBackupSyncStale` or `OffsiteBackupSyncFailing` alert
+
+**Diagnosis**:
+```bash
+ssh root@192.168.1.127
+systemctl status offsite-sync-backup.service
+journalctl -u offsite-sync-backup.service --since "7 days ago"
+cat /mnt/backup/manifest.txt | wc -l  # verify manifest exists
+```
+
+**Common causes**:
+- Synology NAS unreachable (network, SFTP down)
+- SSH key auth failed (permissions, expired key)
+- Manifest missing (weekly-backup failed)
+
+**Fix**:
+1. Verify Synology: `ping 192.168.1.13`, `ssh root@192.168.1.13`
+2. Verify SSH key: `ssh -i /root/.ssh/synology_backup root@192.168.1.13`
+3. Verify manifest exists: `ls -lh /mnt/backup/manifest.txt`
+4. Manually trigger: `systemctl start offsite-sync-backup.service`
+
 ### PostgreSQL Backup Stale Alert

 **Symptom**: `PostgreSQLBackupStale` firing in Prometheus
@ -444,29 +539,6 @@ kubectl logs -n dbaas job/postgresql-backup-<timestamp>
 2. If NFS: Verify mount on worker node, restart NFS server if needed
 3. Manually trigger: `kubectl create job --from=cronjob/postgresql-backup manual-backup -n dbaas`

-### CloudSync Stale/Failing
-
-**Symptom**: `CloudSyncStale` or `CloudSyncFailing` alert
-
-**Diagnosis**:
-```bash
-# SSH to TrueNAS
-ssh root@10.0.10.15
-cat /var/log/cloudsync-copy.log
-zfs list -t snapshot | grep cloudsync
-```
-
-**Common causes**:
- Synology NAS unreachable (network, SFTP down)
- ZFS diff failed (snapshot deleted manually)
- rclone error (quota, permission)
-
-**Fix**:
-1. Verify Synology: `ping 192.168.1.13`, `ssh root@192.168.1.13`
-2. Verify snapshots exist: `zfs list -t snapshot | grep cloudsync`
-3. Manually run: `/root/cloudsync-copy.sh` (check output)
-4. Check rclone config: `rclone ls synology:/Backup/Viki/truenas`
-
 ### Vaultwarden Integrity Check Failing

 **Symptom**: `VaultwardenIntegrityFail` alert, `vaultwarden_sqlite_integrity_ok=0`
@ -480,46 +552,58 @@ kubectl exec -n vaultwarden deployment/vaultwarden -- sqlite3 /data/db.sqlite3 "

 **Recovery**:
 1. Stop writes: `kubectl scale deployment/vaultwarden --replicas=0 -n vaultwarden`
-2. Restore from latest backup:
-   ```bash
-   # Find latest backup
-   ls -lh /mnt/main/vaultwarden-backup/
-   # Copy to pod volume
-   kubectl cp /mnt/main/vaultwarden-backup/db-<latest>.sqlite \
-     vaultwarden/vaultwarden-0:/data/db.sqlite3
-   ```
+2. Restore from latest backup (see `restore-vaultwarden.md`)
 3. Verify integrity on restored DB
 4. Scale back up: `kubectl scale deployment/vaultwarden --replicas=1 -n vaultwarden`

-### iSCSI Session Drops Causing Backup Failures
+### pfSense Backup Failing

-**Symptom**: Backup CronJob fails with "I/O error" or "Transport endpoint not connected"
+**Symptom**: `PfsenseBackupStale` alert (if implemented)

 **Diagnosis**:
 ```bash
-# On K8s node
-iscsiadm -m session
-dmesg | grep -i iscsi
-journalctl -u iscsid | tail -50
+ssh root@192.168.1.127
+systemctl status weekly-backup.service | grep -A5 pfsense
 ```

+**Common causes**:
+- API key expired/invalid
+- SSH auth failed (password changed, key rejected)
+- pfSense unreachable
+
 **Fix**:
-1. Verify hardened timeouts applied: `iscsiadm -m node -o show | grep -E 'replacement_timeout|noop_out'`
-2. If defaults: Apply hardening:
-   ```bash
-   iscsiadm -m node -o update -n node.session.timeo.replacement_timeout -v 300
-   iscsiadm -m node -o update -n node.conn[0].timeo.noop_out_interval -v 10
-   iscsiadm -m node -o update -n node.conn[0].timeo.noop_out_timeout -v 15
-   iscsiadm -m node -o update -n node.conn[0].iscsi.HeaderDigest -v CRC32C,None
-   iscsiadm -m node -o update -n node.conn[0].iscsi.DataDigest -v CRC32C,None
-   ```
-3. Restart session: `iscsiadm -m node -u && iscsiadm -m node -l`
+1. Verify API key: `curl -k https://pfsense.viktorbarzin.me/api/v1/system/config -H "Authorization: <key>"`
+2. Verify SSH: `ssh root@pfsense.viktorbarzin.me`
+3. Update credentials in Vault `secret/viktor/pfsense_api_key`
+
+### Backup Disk Full
+
+**Symptom**: `BackupDiskFull` alert, `df -h /mnt/backup` >85%
+
+**Fix**:
+```bash
+ssh root@192.168.1.127
+
+# Check space usage by component
+du -sh /mnt/backup/pvc-data/*
+du -sh /mnt/backup/pfsense/*
+du -sh /mnt/backup/nfs-mirror
+
+# Clean up old weekly versions (keep latest 2)
+find /mnt/backup/pvc-data -maxdepth 1 -type d -name "????-??" | sort | head -n -2 | xargs rm -rf
+find /mnt/backup/pfsense -maxdepth 1 -type d -name "????-??" | sort | head -n -2 | xargs rm -rf
+```

 ### Missing Backup for New Service

-**Symptom**: Added new service using iSCSI storage, no backup exists
+**Symptom**: Added new service using proxmox-lvm storage, no backup exists

-**Fix**: Add backup CronJob in service's Terraform stack
+**Fix**: The service is automatically covered by:
+1. **LVM snapshots** (if not in dbaas/monitoring namespace) — automatic, no config needed
+2. **Weekly file backup** — automatic, no config needed
+
+**If the service has a database that needs app-level dumps**:
+Add backup CronJob in service's Terraform stack (see template below).

 **Template**:
 ```hcl
@ -541,7 +625,7 @@ resource "kubernetes_cron_job_v1" "backup" {
              args = [
                <<-EOT
                TIMESTAMP=$(date +%Y%m%d)
-                # Dump command here
+                # Dump command here (sqlite3 .backup, pg_dump, etc.)
                find /backup -mtime +30 -delete
                EOT
              ]
@ -594,17 +678,26 @@ module "nfs_backup" {
 │  VaultBackupStale           > 8d  since last success            │
 │  VaultwardenBackupStale     > 8d  since last success            │
 │  RedisBackupStale           > 8d  since last success            │
-│  PrometheusBackupStale      > 32d since last success            │
-│  PlottingBookBackupStale    > 8d  since last success            │
 │  CloudSyncStale             > 8d  since last success            │
 │  CloudSyncNeverRun          task never completed                │
 │  CloudSyncFailing           task in error state                 │
 │  VaultwardenIntegrityFail   integrity_ok == 0                   │
+│  LVMSnapshotStale           > 24h since last snapshot           │
+│  LVMSnapshotFailing         snapshot creation failed            │
+│  LVMThinPoolLow             < 15% free space in thin pool       │
+│  WeeklyBackupStale          > 8d  since last success            │
+│  WeeklyBackupFailing        backup script exited non-zero       │
+│  PfsenseBackupStale         > 8d  since last success            │
+│  OffsiteBackupSyncStale     > 8d  since last success            │
+│  BackupDiskFull             > 85% usage on /mnt/backup          │
 └────────────────────────────────────────────────────────────────┘
 ```

 **Metrics sources**:
 - Backup CronJobs: Push `backup_last_success_timestamp` to Pushgateway on completion
+- LVM snapshot script: Pushes `lvm_snapshot_last_success_timestamp`, `lvm_snapshot_count`, `lvm_thin_pool_free_percent`
+- Weekly backup script: Pushes `backup_weekly_last_success_timestamp`, `backup_disk_usage_percent`
+- Offsite sync script: Pushes `offsite_backup_sync_last_success_timestamp`
 - CloudSync monitor: Queries TrueNAS API every 6h, pushes `cloudsync_last_success_timestamp`
 - Vaultwarden integrity: Pushes `vaultwarden_sqlite_integrity_ok` hourly

@ -614,36 +707,45 @@ module "nfs_backup" {

 ## Service Protection Matrix

-| Service | Layer 1 (ZFS) | Layer 2 (App) | Layer 3 (Offsite) | Storage |
-|---------|:-------------:|:-------------:|:-----------------:|---------|
+| Service | LVM Snapshots (7d) | File Backup (4w) | App Backup | Offsite | Storage |
+|---------|:------------------:|:----------------:|:----------:|:-------:|---------|
 | **Databases** |
-| PostgreSQL (12 DBs) | ✓ | ✓ daily | ✓ | iSCSI |
-| MySQL (7 DBs) | ✓ | ✓ daily | ✓ | iSCSI |
+| PostgreSQL (all DBs) | — | — | ✓ daily | ✓ | proxmox-lvm |
+| MySQL (all DBs) | — | — | ✓ daily | ✓ | proxmox-lvm |
 | **Critical State** |
-| Vault | ✓ | ✓ weekly | ✓ | iSCSI |
-| etcd | ✓ | ✓ weekly | ✓ | local disk |
-| Vaultwarden | ✓ | ✓ 6h + integrity | ✓ | iSCSI |
-| Redis | ✓ | ✓ weekly | ✓ | iSCSI |
-| **Applications** |
-| Prometheus | ✓ | ✓ monthly | excluded | NFS |
-| plotting-book | ✓ | ✓ weekly | ✓ | iSCSI |
-| Immich | ✓ | — | ✓ | NFS |
-| Forgejo | ✓ | — | ✓ | NFS |
-| Paperless-ngx | ✓ | — | ✓ | NFS |
-| Nextcloud | ✓ | — | ✓ | NFS |
-| **Other NFS services** | ✓ | — | ✓ | NFS |
+| Vault | ✓ | ✓ | ✓ weekly | ✓ | proxmox-lvm |
+| etcd | ✓ | ✓ | ✓ weekly | ✓ | proxmox-lvm |
+| Vaultwarden | ✓ | ✓ | ✓ 6h + integrity | ✓ | proxmox-lvm |
+| Redis | ✓ | ✓ | ✓ weekly | ✓ | proxmox-lvm |
+| **Applications (65 proxmox-lvm PVCs)** |
+| Prometheus | — | — | — | excluded | proxmox-lvm |
+| Nextcloud | ✓ | ✓ | — | ✓ | proxmox-lvm |
+| Calibre-Web | ✓ | ✓ | — | ✓ | proxmox-lvm |
+| Forgejo | ✓ | ✓ | — | ✓ | proxmox-lvm |
+| FreshRSS | ✓ | ✓ | — | ✓ | proxmox-lvm |
+| ActualBudget | ✓ | ✓ | — | ✓ | proxmox-lvm |
+| NovelApp | ✓ | ✓ | — | ✓ | proxmox-lvm |
+| Headscale | ✓ | ✓ | — | ✓ | proxmox-lvm |
+| Uptime Kuma | ✓ | ✓ | — | ✓ | proxmox-lvm |
+| **Media (NFS)** |
+| Immich (~800GB) | — | — | — | ✓ | NFS |
+| Audiobookshelf | — | — | — | ✓ | NFS |
+| Servarr | — | — | — | ✓ | NFS |
+| Navidrome | — | — | — | ✓ | NFS |

 **Legend**:
 - ✓ = Protected at this layer
- — = Not needed (simple file storage, ZFS snapshots sufficient)
+- — = Not needed (other layers cover it, or data is regenerable/disposable)
 - excluded = Too large/regenerable, not worth offsite bandwidth

-**Note**: NFS-backed services with simple data (files, SQLite) rely on ZFS snapshots + offsite sync. Application-level backups are only needed for services with complex state (databases, Raft consensus, multi-file consistency requirements).
+**Note**: All 65 proxmox-lvm PVCs get LVM snapshots (except dbaas+monitoring = 3 PVCs) + file-level backup (except dbaas+monitoring). NFS-backed media relies on TrueNAS Cloud Sync for offsite.

 ## Recovery Procedures

 Detailed runbooks in `docs/runbooks/`:

+- **`restore-lvm-snapshot.md`** — Instant rollback of a PVC using LVM snapshot (RTO <5 min)
+- **`restore-pvc-from-backup.md`** — Restore a PVC from sda file backup (when snapshots expired)
 - **`restore-postgresql.md`** — Restore individual database or full cluster from pg_dumpall backup
 - **`restore-mysql.md`** — Restore MySQL databases from mysqldump backup
 - **`restore-vault.md`** — Restore Vault from raft snapshot
@ -651,7 +753,9 @@ Detailed runbooks in `docs/runbooks/`:
 - **`restore-etcd.md`** — Restore etcd cluster from snapshot
 - **`restore-full-cluster.md`** — Disaster recovery: rebuild cluster from offsite backups

-**RTO estimates** (tested 2026-03-23):
+**RTO estimates**:
+- LVM snapshot rollback: <5 min (instant swap)
+- File-level restore from sda: <15 min (depends on PVC size)
 - Single PostgreSQL database: <5 min
 - Full MySQL cluster: <15 min
 - Vault: <10 min
@ -661,7 +765,7 @@ Detailed runbooks in `docs/runbooks/`:

 ## Related

- **Architecture**: `docs/architecture/storage.md` (NFS/iSCSI storage layer)
+- **Architecture**: `docs/architecture/storage.md` (NFS/Proxmox storage layer)
 - **Reference**: `.claude/reference/service-catalog.md` (which services need backups)
 - **Runbooks**: `docs/runbooks/restore-*.md` (step-by-step recovery procedures)
 - **Monitoring**: `stacks/monitoring/alerts/backup-alerts.yaml` (Prometheus alert definitions)
--- a/docs/architecture/storage.md
+++ b/docs/architecture/storage.md
@ -1,14 +1,16 @@
 # Storage Architecture

-Last updated: 2026-04-03
+Last updated: 2026-04-06

 ## Overview

 The cluster uses two storage backends: **Proxmox CSI** for database block storage and **TrueNAS NFS** for application data.

-**Block storage (Proxmox CSI)**: 13 PVCs for databases (CNPG PostgreSQL, MySQL InnoDB, Redis, Vaultwarden, Prometheus, Nextcloud, Calibre-Web) use `StorageClass: proxmox-lvm`, which provisions thin LVs directly from the Proxmox host's `local-lvm` storage. This eliminates the previous double-CoW (ZFS + LVM-thin) path that caused 56 ZFS checksum errors.
+**Block storage (Proxmox CSI)**: 65 PVCs for databases and stateful apps (CNPG PostgreSQL, MySQL InnoDB, Redis, Vaultwarden, Prometheus, Nextcloud, Calibre-Web, Forgejo, FreshRSS, ActualBudget, NovelApp, Headscale, Uptime Kuma, etc.) use `StorageClass: proxmox-lvm`, which provisions thin LVs directly from the Proxmox host's `local-lvm` storage (sdc, 10.7TB RAID1 HDD thin pool). This eliminates the previous double-CoW (ZFS + LVM-thin) path that caused 56 ZFS checksum errors.

-**NFS storage (TrueNAS)**: ~100 NFS shares for application data, media, configs, and backup targets continue to use TrueNAS ZFS at `10.0.10.15` via `StorageClass: nfs-truenas`.
+**NFS storage (TrueNAS)**: ~100 NFS shares for media libraries (Immich, audiobookshelf, servarr, navidrome), backup targets (`*-backup/` directories), and legacy app data continue to use TrueNAS ZFS at `10.0.10.15` via `StorageClass: nfs-truenas`.
+
+**Backup storage (sda)**: 1.1TB RAID1 SAS disk, VG `backup`, LV `data` (ext4), mounted at `/mnt/backup` on PVE host. Dedicated backup disk for weekly PVC file backups, NFS mirrors, pfSense backups, and PVE config. Independent of live storage (sdc).

 **Migration (2026-04-02)**: All iSCSI block volumes were migrated from democratic-csi (TrueNAS iSCSI → ZFS → LVM-thin) to Proxmox CSI (direct LVM-thin hotplug). democratic-csi iSCSI driver is deprecated and pending removal.

@ -16,17 +18,20 @@ The cluster uses two storage backends: **Proxmox CSI** for database block storag

 ```mermaid
 graph TB
+    subgraph Proxmox["Proxmox Host (192.168.1.127)"]
+        sdc["sdc: 10.7TB RAID1 HDD<br/>VG pve, LV data (thin pool)<br/>65 proxmox-lvm PVCs"]
+        sda["sda: 1.1TB RAID1 SAS<br/>VG backup, LV data (ext4)<br/>/mnt/backup"]
+    end
+
    subgraph TrueNAS["TrueNAS (10.0.10.15)<br/>VMID 9000, 16c/16GB"]
        ZFS_Main["ZFS Pool: main<br/>1.64 TiB<br/>32G + 7x256G + 1T disks"]
        ZFS_SSD["ZFS Pool: ssd<br/>~256GB SSD<br/>Immich ML, PostgreSQL hot data"]

-        ZFS_Main --> NFS_Datasets["NFS Datasets<br/>~100 shares<br/>main/&lt;service&gt;"]
-        ZFS_Main --> iSCSI_Datasets["iSCSI Datasets<br/>main/iscsi (zvols)<br/>main/iscsi-snaps"]
+        ZFS_Main --> NFS_Datasets["NFS Datasets<br/>~100 shares<br/>main/&lt;service&gt;<br/>Media + backup targets"]

        NFS_Datasets --> NFS_Exports["NFS Exports<br/>managed by secrets/nfs_exports.sh"]
-        iSCSI_Datasets --> iSCSI_Targets["iSCSI Targets<br/>SSH-managed via democratic-csi"]

-        ZFS_SSD --> SSD_Data["Immich ML models<br/>PostgreSQL CNPG"]
+        ZFS_SSD --> SSD_Data["Immich ML models"]
    end

    subgraph K8s["Kubernetes Cluster"]
--- a/docs/runbooks/restore-full-cluster.md
+++ b/docs/runbooks/restore-full-cluster.md
@ -1,5 +1,7 @@
 # Full Cluster Rebuild

+Last updated: 2026-04-06
+
 ## When to Use
 - Complete cluster failure (all VMs lost)
 - etcd corruption requiring full rebuild
@ -7,7 +9,8 @@

 ## Prerequisites
 - Proxmox host (192.168.1.127) accessible
- TrueNAS NFS server (192.168.1.2) accessible — or Synology NAS (192.168.1.13) for backups
+- TrueNAS NFS server (10.0.10.15) accessible — or Synology NAS (192.168.1.13) for backups
+- sda backup disk mounted at `/mnt/backup` on PVE host (or restore from Synology first)
 - Git repo with infra code
 - SOPS age keys for state decryption (`~/.config/sops/age/keys.txt`)
 - Vault unseal keys (emergency kit)
@ -41,15 +44,55 @@ sudo kubeadm init --config /etc/kubernetes/kubeadm-config.yaml

 ### Phase 3: Storage Layer
 ```bash
-# 6. Deploy CSI drivers (NFS + iSCSI)
+# 6. Deploy CSI drivers (NFS + Proxmox)
 scripts/tg apply stacks/nfs-csi
-scripts/tg apply stacks/iscsi-csi
+scripts/tg apply stacks/proxmox-csi

 # 7. Verify PVs are accessible
 kubectl get pv
 kubectl get pvc -A | grep -v Bound
 ```

+### Phase 3.5: Restore PVC Data from sda Backup
+
+After storage layer is deployed, restore PVC data from the sda backup disk:
+
+```bash
+# 8a. List available backup weeks
+ssh root@192.168.1.127
+ls -l /mnt/backup/pvc-data/
+
+# 8b. For each critical PVC, restore files:
+# Example: vaultwarden-data-proxmox
+WEEK="2026-14"  # Use most recent week
+NAMESPACE="vaultwarden"
+PVC_NAME="vaultwarden-data-proxmox"
+
+# Find the PV LV name
+kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle' | grep $PVC_NAME
+
+# Assuming volumeHandle is "local-lvm:vm-999-pvc-abc123"
+LV_NAME="vm-999-pvc-abc123"
+
+# Mount the LV
+lvchange -ay pve/$LV_NAME
+mkdir -p /mnt/restore-temp
+mount /dev/pve/$LV_NAME /mnt/restore-temp
+
+# Restore from backup
+rsync -avP --delete /mnt/backup/pvc-data/$WEEK/$NAMESPACE/$PVC_NAME/ /mnt/restore-temp/
+
+# Unmount
+umount /mnt/restore-temp
+lvchange -an pve/$LV_NAME
+
+# 8c. Repeat for all critical PVCs (prioritize: vaultwarden, vault, redis, nextcloud)
+```
+
+**Note on pfSense restore**: If pfSense needs restoration, restore `config.xml` from `/mnt/backup/pfsense/<week>/config.xml` via web UI, or full filesystem tar for custom scripts.
+
+**Note on PVE config restore**: If custom scripts/timers are lost, restore from `/mnt/backup/pve-config/` (weekly-backup, offsite-sync-backup, lvm-pvc-snapshot scripts + timers).
+
 ### Phase 4: Vault (secrets foundation)
 ```bash
 # 8. Deploy Vault (see restore-vault.md for full procedure)
@ -117,10 +160,11 @@ kubectl create job --from=cronjob/vaultwarden-backup manual-vw-backup -n vaultwa

 ## Dependency Graph
 ```
-etcd → K8s API → CSI Drivers → Vault → ESO → Platform → Databases → Apps
-                                                              ↓
-                                                        Restore data from
-                                                        NFS/Synology backups
+etcd → K8s API → CSI Drivers → Restore PVC data from sda → Vault → ESO → Platform → Databases → Apps
+                                                                                          ↓
+                                                                                    Restore DB dumps from
+                                                                                    /mnt/backup/nfs-mirror
+                                                                                    or Synology/pve-backup
 ```

 ## Estimated Time
--- a/docs/runbooks/restore-lvm-snapshot.md
+++ b/docs/runbooks/restore-lvm-snapshot.md
@ -0,0 +1,159 @@
+# Runbook: Restore PVC from LVM Thin Snapshot
+
+Last updated: 2026-04-06
+
+## When to Use
+
+- Rolling back a PVC to a previous state after a bad migration, data corruption, or accidental deletion
+- Pre-upgrade safety: snapshot before upgrade, restore if upgrade fails
+- Fast recovery for data changed within the last 7 days
+
+## Prerequisites
+
+- SSH access to PVE host (192.168.1.127)
+- The `lvm-pvc-snapshot` script at `/usr/local/bin/lvm-pvc-snapshot`
+- kubectl configured on PVE host (`/root/.kube/config`)
+
+## Snapshot Retention
+
+- **Daily snapshots**: Created at 03:00 via systemd timer
+- **Retention**: 7 days (older snapshots automatically pruned)
+- **Coverage**: All proxmox-lvm PVCs except `dbaas` and `monitoring` namespaces
+
+**If you need data older than 7 days**, see "Alternative: Restore from sda Backup" below.
+
+## Procedure
+
+### 1. List Available Snapshots
+
+```bash
+ssh root@192.168.1.127 lvm-pvc-snapshot list
+```
+
+Output shows all snapshots with their original LV, age, and data divergence percentage.
+
+### 2. Identify the PVC LV Name
+
+Find the LV name for your PVC:
+
+```bash
+# From your workstation (with kubectl):
+kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle'
+
+# The HANDLE column shows "local-lvm:<lv-name>"
+```
+
+### 3. Run the Restore
+
+```bash
+ssh root@192.168.1.127
+lvm-pvc-snapshot restore <pvc-lv-name> <snapshot-lv-name>
+```
+
+The script will:
+1. Look up the K8s PV/PVC/workload for the LV
+2. Show a dry-run of all actions
+3. Ask for confirmation (type `yes`)
+4. Scale down the workload (Deployment or StatefulSet)
+5. Rename the current LV to `<name>_pre_restore_<timestamp>`
+6. Rename the snapshot LV to the original name
+7. Scale the workload back up
+8. Wait for pod to become Ready
+
+### 4. Verify
+
+```bash
+# Check pod is running
+kubectl get pods -n <namespace> -l app=<workload>
+
+# Check the application is working correctly
+# (service-specific verification)
+```
+
+### 5. Clean Up
+
+Once you've verified the restore is correct, remove the pre-restore backup:
+
+```bash
+ssh root@192.168.1.127 lvremove -f pve/<original-lv>_pre_restore_<timestamp>
+```
+
+## Manual Restore (if script fails)
+
+If the automated restore fails, perform these steps manually:
+
+```bash
+# 1. Scale down the workload
+kubectl scale deployment/<name> -n <ns> --replicas=0
+# or for StatefulSets:
+kubectl scale statefulset/<name> -n <ns> --replicas=0
+
+# 2. Wait for pods to terminate
+kubectl wait --for=delete pod -l app=<name> -n <ns> --timeout=120s
+
+# 3. SSH to PVE host
+ssh root@192.168.1.127
+
+# 4. Verify LV is inactive
+lvs -o lv_name,lv_active pve | grep <lv-name>
+
+# 5. Rename LVs
+lvrename pve <original-lv> <original-lv>_pre_restore_$(date +%Y%m%d_%H%M)
+lvrename pve <snapshot-lv> <original-lv>
+
+# 6. Scale back up
+kubectl scale deployment/<name> -n <ns> --replicas=1
+```
+
+## Database-Specific Notes
+
+- **MySQL InnoDB**: After restore, InnoDB will replay redo logs automatically on startup. Check `SHOW ENGINE INNODB STATUS` for recovery progress.
+- **PostgreSQL**: WAL replay happens automatically. Check `pg_is_in_recovery()` and PostgreSQL logs.
+- **Redis**: Redis loads the RDB file on startup. Check `INFO persistence` for load status.
+
+For databases, prefer the app-level backup restore (see `restore-mysql.md`, `restore-postgresql.md`) unless you need a very recent point-in-time that predates the last dump.
+
+## Alternative: Restore from sda Backup
+
+If LVM snapshots are too old or missing (data lost >7 days ago), use the weekly file-level backup on sda:
+
+**Location**: `/mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/` on PVE host
+**Retention**: 4 weekly versions (weeks 0-3)
+
+### Procedure
+
+```bash
+# 1. List available backup weeks
+ssh root@192.168.1.127
+ls -l /mnt/backup/pvc-data/
+
+# 2. Identify the PVC backup directory
+ls -l /mnt/backup/pvc-data/2026-14/<namespace>/
+
+# 3. Scale down the workload
+kubectl scale deployment/<name> -n <ns> --replicas=0
+
+# 4. Mount the live PVC LV on PVE host
+lvchange -ay pve/<pvc-lv-name>
+mkdir -p /mnt/restore-temp
+mount /dev/pve/<pvc-lv-name> /mnt/restore-temp
+
+# 5. Restore from backup
+rsync -avP --delete /mnt/backup/pvc-data/2026-14/<namespace>/<pvc-name>/ /mnt/restore-temp/
+
+# 6. Unmount and scale up
+umount /mnt/restore-temp
+lvchange -an pve/<pvc-lv-name>
+kubectl scale deployment/<name> -n <ns> --replicas=1
+```
+
+See `restore-pvc-from-backup.md` for detailed walkthrough.
+
+## Troubleshooting
+
+| Problem | Cause | Fix |
+|---------|-------|-----|
+| "Another instance is running" | Concurrent snapshot/restore | Wait for timer to finish: `systemctl status lvm-pvc-snapshot.service` |
+| LV still active after scale-down | Proxmox CSI hasn't detached | Wait 30s, or `lvchange -an pve/<lv>` |
+| Pod stuck in ContainerCreating | Volume not attached to node | `kubectl describe pod` — check events for attach errors |
+| No PV found for volume handle | LV name doesn't match any PV | Check `kubectl get pv -o yaml` for the correct volumeHandle format |
--- a/docs/runbooks/restore-mysql.md
+++ b/docs/runbooks/restore-mysql.md
@ -1,5 +1,7 @@
 # Restore MySQL (InnoDB Cluster)

+Last updated: 2026-04-06
+
 ## Prerequisites
 - `kubectl` access to the cluster
 - MySQL root password (from `cluster-secret` in `dbaas` namespace, key `ROOT_PASSWORD`)
@ -7,8 +9,9 @@

 ## Backup Location
 - NFS: `/mnt/main/mysql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz`
- Replicated to Synology NAS (192.168.1.13) via TrueNAS ZFS replication
- Retention: 14 days
+- Mirrored to sda: `/mnt/backup/nfs-mirror/mysql-backup/` (PVE host 192.168.1.127)
+- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/mysql-backup/`
+- Retention: 14 days (on NFS), latest only (on sda), unlimited (on Synology)
 - Size: ~11MB per dump

 ## Restore Procedure
@ -93,6 +96,39 @@ kubectl exec -it mysql-cluster-0 -n dbaas -c mysql -- mysqlsh root@localhost --p
 kubectl exec -it mysql-cluster-0 -n dbaas -c mysql -- mysqlsh root@localhost --password="$ROOT_PWD" -- cluster rejoinInstance root@mysql-cluster-1:3306
 ```

+## Alternative: Restore from sda Backup
+
+If TrueNAS NFS is unavailable but the PVE host is accessible:
+
+```bash
+# 1. SSH to PVE host
+ssh root@192.168.1.127
+
+# 2. Find the latest backup
+ls -lt /mnt/backup/nfs-mirror/mysql-backup/
+
+# 3. Copy backup to a location accessible from cluster (e.g., via kubectl cp)
+# Or mount sda backup on a pod:
+kubectl run mysql-restore --rm -it --image=mysql \
+  --overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/mysql-backup"}}],"containers":[{"name":"mysql-restore","image":"mysql","env":[{"name":"MYSQL_PWD","value":"'$ROOT_PWD'"}],"volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["/bin/sh","-c","zcat /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -u root --host mysql.dbaas.svc.cluster.local"]}],"nodeName":"k8s-master"}}' \
+  -n dbaas
+```
+
+## Alternative: Restore from Synology (if PVE host is down)
+
+If both TrueNAS and PVE host are unavailable:
+
+```bash
+# 1. SSH to Synology NAS
+ssh Administrator@192.168.1.13
+
+# 2. Navigate to backup directory
+cd /volume1/Backup/Viki/pve-backup/nfs-mirror/mysql-backup/
+
+# 3. Copy dump to a temporary location accessible from cluster
+# (e.g., via rsync to a surviving node, or restore TrueNAS first)
+```
+
 ## Estimated Time
 - Data restore: ~5 minutes (11MB dump)
 - InnoDB Cluster recovery: ~15-20 minutes (init containers are slow)
--- a/docs/runbooks/restore-postgresql.md
+++ b/docs/runbooks/restore-postgresql.md
@ -1,5 +1,7 @@
 # Restore PostgreSQL (CNPG)

+Last updated: 2026-04-06
+
 ## Prerequisites
 - `kubectl` access to the cluster
 - CNPG operator running in the cluster
@ -8,8 +10,9 @@

 ## Backup Location
 - NFS: `/mnt/main/postgresql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz`
- Replicated to Synology NAS (192.168.1.13) via TrueNAS ZFS replication
- Retention: 14 days
+- Mirrored to sda: `/mnt/backup/nfs-mirror/postgresql-backup/` (PVE host 192.168.1.127)
+- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/postgresql-backup/`
+- Retention: 14 days (on NFS), latest only (on sda), unlimited (on Synology)

 ## Restore from pg_dumpall

@ -81,11 +84,39 @@ kubectl rollout restart deployment -n linkwarden
 # ... repeat for all PG-dependent services (excluding trading — disabled)
 ```

-## Restore from Synology (if TrueNAS is down)
-1. SSH to Synology NAS (192.168.1.13)
-2. Find the replicated dataset: `zfs list | grep postgresql-backup`
-3. Mount or copy the backup file to a location accessible from the cluster
-4. Follow the restore procedure above
+## Alternative: Restore from sda Backup
+
+If TrueNAS NFS is unavailable but the PVE host is accessible:
+
+```bash
+# 1. SSH to PVE host
+ssh root@192.168.1.127
+
+# 2. Find the latest backup
+ls -lt /mnt/backup/nfs-mirror/postgresql-backup/
+
+# 3. Mount sda backup on a pod
+PGPASSWORD=$(kubectl get secret pg-cluster-superuser -n dbaas -o jsonpath='{.data.password}' | base64 -d)
+
+kubectl run pg-restore --rm -it --image=postgres:16.4-bullseye \
+  --overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/postgresql-backup"}}],"containers":[{"name":"pg-restore","image":"postgres:16.4-bullseye","env":[{"name":"PGPASSWORD","value":"'$PGPASSWORD'"}],"volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["/bin/sh","-c","zcat /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | psql -h pg-cluster-rw.dbaas -U postgres"]}],"nodeName":"k8s-master"}}' \
+  -n dbaas
+```
+
+## Alternative: Restore from Synology (if PVE host is down)
+
+If both TrueNAS and PVE host are unavailable:
+
+```bash
+# 1. SSH to Synology NAS
+ssh Administrator@192.168.1.13
+
+# 2. Navigate to backup directory
+cd /volume1/Backup/Viki/pve-backup/nfs-mirror/postgresql-backup/
+
+# 3. Copy dump to a temporary location accessible from cluster
+# (e.g., via rsync to a surviving node, or restore TrueNAS first)
+```

 ## Estimated Time
 - Restore into existing cluster: ~10 minutes (depends on dump size)
--- a/docs/runbooks/restore-pvc-from-backup.md
+++ b/docs/runbooks/restore-pvc-from-backup.md
@ -0,0 +1,231 @@
+# Runbook: Restore PVC from sda File Backup
+
+Last updated: 2026-04-06
+
+## When to Use
+
+- LVM snapshots are too old (>7 days) or missing
+- Need to restore data from a specific week (up to 4 weeks back)
+- LVM snapshot restore failed or snapshot is corrupt
+- Granular file-level restore (not full PVC)
+
+## Prerequisites
+
+- SSH access to PVE host (192.168.1.127)
+- kubectl configured (either on PVE host or your workstation)
+- sda backup disk mounted at `/mnt/backup` on PVE host
+
+## Backup Location
+
+**Path**: `/mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/` on PVE host
+**Retention**: 4 weekly versions (weeks 0-3)
+**Deduplication**: `--link-dest` hardlink dedup (unchanged files share inodes across weeks)
+
+## Procedure
+
+### 1. List Available Backup Weeks
+
+```bash
+ssh root@192.168.1.127
+ls -l /mnt/backup/pvc-data/
+
+# Output shows week directories like:
+# 2026-13
+# 2026-14
+# 2026-15
+# 2026-16
+```
+
+### 2. Identify the PVC Backup Directory
+
+```bash
+# List namespaces in a specific week
+ls -l /mnt/backup/pvc-data/2026-14/
+
+# List PVCs in a namespace
+ls -l /mnt/backup/pvc-data/2026-14/vaultwarden/
+
+# Example: vaultwarden-data-proxmox/
+```
+
+### 3. Find the Live PVC LV Name
+
+From your workstation (or PVE host with kubectl):
+
+```bash
+# Get the PV volumeHandle (contains LV name)
+kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle' | grep <pvc-name>
+
+# Example output:
+# pvc-abc123  vaultwarden-data-proxmox  vaultwarden  local-lvm:vm-999-pvc-abc123
+#                                                                   ↑ this is the LV name
+```
+
+### 4. Scale Down the Workload
+
+```bash
+# Find the workload using the PVC
+kubectl get deployment,statefulset -n <namespace> -o json | jq '.items[] | select(.spec.template.spec.volumes[]?.persistentVolumeClaim.claimName == "<pvc-name>") | .metadata.name'
+
+# Scale down (Deployment example)
+kubectl scale deployment/<workload-name> -n <namespace> --replicas=0
+
+# Or StatefulSet:
+kubectl scale statefulset/<workload-name> -n <namespace> --replicas=0
+
+# Wait for pod to terminate
+kubectl wait --for=delete pod -l app=<workload-name> -n <namespace> --timeout=120s
+```
+
+### 5. Mount the Live PVC LV
+
+```bash
+ssh root@192.168.1.127
+
+# Activate the LV (should already be inactive after pod termination)
+lvchange -ay pve/<lv-name>
+
+# Create mount point
+mkdir -p /mnt/restore-temp
+
+# Mount the LV
+mount /dev/pve/<lv-name> /mnt/restore-temp
+```
+
+### 6. Restore from Backup
+
+**Option A: Full PVC restore (replace all data)**
+
+```bash
+# This will delete existing files in the PVC and replace with backup
+rsync -avP --delete /mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/ /mnt/restore-temp/
+
+# Example:
+rsync -avP --delete /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/ /mnt/restore-temp/
+```
+
+**Option B: Selective file restore (merge)**
+
+```bash
+# Restore specific files or directories without deleting existing data
+rsync -avP /mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/path/to/file /mnt/restore-temp/path/to/
+
+# Example: Restore only db.sqlite3
+rsync -avP /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/db.sqlite3 /mnt/restore-temp/
+```
+
+### 7. Unmount and Deactivate LV
+
+```bash
+# Unmount
+umount /mnt/restore-temp
+
+# Deactivate LV (optional, kubelet will activate it when pod starts)
+lvchange -an pve/<lv-name>
+```
+
+### 8. Scale Up the Workload
+
+```bash
+# From your workstation:
+kubectl scale deployment/<workload-name> -n <namespace> --replicas=1
+
+# Or StatefulSet:
+kubectl scale statefulset/<workload-name> -n <namespace> --replicas=1
+
+# Wait for pod to be ready
+kubectl wait --for=condition=Ready pod -l app=<workload-name> -n <namespace> --timeout=120s
+```
+
+### 9. Verify
+
+```bash
+# Check pod logs for startup errors
+kubectl logs -n <namespace> -l app=<workload-name> --tail=20
+
+# Test application functionality (service-specific)
+curl -s -o /dev/null -w "%{http_code}" https://<service>.viktorbarzin.me/
+```
+
+## Example: Full Vaultwarden Restore
+
+```bash
+# 1. List backups
+ssh root@192.168.1.127
+ls -l /mnt/backup/pvc-data/
+
+# 2. Scale down
+kubectl scale deployment vaultwarden -n vaultwarden --replicas=0
+kubectl wait --for=delete pod -l app=vaultwarden -n vaultwarden --timeout=120s
+
+# 3. Find LV name
+kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,HANDLE:.spec.csi.volumeHandle' | grep vaultwarden-data-proxmox
+# Output: pvc-xyz  vaultwarden-data-proxmox  local-lvm:vm-105-pvc-xyz456
+
+# 4. Mount and restore
+ssh root@192.168.1.127
+lvchange -ay pve/vm-105-pvc-xyz456
+mkdir -p /mnt/restore-temp
+mount /dev/pve/vm-105-pvc-xyz456 /mnt/restore-temp
+
+rsync -avP --delete /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/ /mnt/restore-temp/
+
+umount /mnt/restore-temp
+lvchange -an pve/vm-105-pvc-xyz456
+
+# 5. Scale up
+kubectl scale deployment vaultwarden -n vaultwarden --replicas=1
+kubectl wait --for=condition=Ready pod -l app=vaultwarden -n vaultwarden --timeout=120s
+
+# 6. Test
+curl -s -o /dev/null -w "%{http_code}" https://vaultwarden.viktorbarzin.me/
+```
+
+## Database-Specific Notes
+
+For databases (MySQL, PostgreSQL), prefer the app-level backup restore (see `restore-mysql.md`, `restore-postgresql.md`) unless:
+- You need a very recent point-in-time that predates the last dump
+- The database dump is corrupt or missing
+- You're restoring a non-SQL database (e.g., Redis RDB)
+
+## Troubleshooting
+
+| Problem | Cause | Fix |
+|---------|-------|-----|
+| "LV is active" during mount | Workload pod still running or stuck | `kubectl get pods -A | grep <pvc-name>`, delete pod if stuck |
+| "No such file or directory" in backup | PVC not backed up (in excluded namespace) | Check `weekly-backup` script EXCLUDE_NAMESPACES |
+| rsync shows 0 files transferred | Wrong backup week or PVC name | Double-check paths: `ls /mnt/backup/pvc-data/<week>/<ns>/<pvc>/` |
+| Pod stuck in ContainerCreating after restore | LV still active on PVE host | `lvchange -an pve/<lv-name>`, wait 30s, check pod again |
+| Backup week missing | Weekly backup hasn't run for that week | Check `systemctl status weekly-backup.service`, verify retention |
+
+## Restore from Synology (if PVE host sda is unavailable)
+
+If the PVE host sda backup disk is unavailable or corrupt:
+
+```bash
+# 1. SSH to Synology NAS
+ssh Administrator@192.168.1.13
+
+# 2. Navigate to backup directory
+cd /volume1/Backup/Viki/pve-backup/pvc-data/
+
+# 3. Find the PVC backup
+ls -l 2026-14/<namespace>/<pvc-name>/
+
+# 4. Copy to a temporary location accessible from cluster
+# Option A: Restore sda on PVE host first
+# Option B: rsync to a surviving node's local disk
+# Option C: Mount Synology NFS share on a pod (if network accessible)
+```
+
+## Estimated Time
+
+- Small PVC (<1GB): ~5 minutes
+- Medium PVC (1-10GB): ~10-15 minutes
+- Large PVC (>10GB): ~30+ minutes (depends on size and network)
+
+## Related
+
+- **`restore-lvm-snapshot.md`** — Fast restore for recent changes (<7 days)
+- **`restore-full-cluster.md`** — Disaster recovery procedure (uses this runbook in Phase 3.5)
+- **`docs/architecture/backup-dr.md`** — Backup architecture overview
--- a/docs/runbooks/restore-vault.md
+++ b/docs/runbooks/restore-vault.md
@ -1,5 +1,7 @@
 # Restore Vault (Raft)

+Last updated: 2026-04-06
+
 ## Prerequisites
 - `kubectl` access to the cluster
 - Vault root token (from `vault-root-token` secret in `vault` namespace — manually created, independent of automation)
@ -8,8 +10,9 @@

 ## Backup Location
 - NFS: `/mnt/main/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db`
- Replicated to Synology NAS (192.168.1.13) via TrueNAS ZFS replication
- Retention: 30 days
+- Mirrored to sda: `/mnt/backup/nfs-mirror/vault-backup/` (PVE host 192.168.1.127)
+- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/vault-backup/`
+- Retention: 30 days (on NFS), latest only (on sda), unlimited (on Synology)
 - Schedule: Weekly on Sundays at 02:00 (`0 2 * * 0`)

 ## CRITICAL: Vault is a dependency for many services
@ -88,6 +91,45 @@ kubectl rollout restart deployment -n external-secrets
 kubectl get externalsecrets -A | grep -v "SecretSynced"
 ```

+## Alternative: Restore from sda Backup
+
+If TrueNAS NFS is unavailable but the PVE host is accessible:
+
+```bash
+# 1. SSH to PVE host
+ssh root@192.168.1.127
+
+# 2. Find the latest snapshot
+ls -lt /mnt/backup/nfs-mirror/vault-backup/
+
+# 3. Copy snapshot to a location accessible from cluster
+# Port-forward to Vault and restore
+kubectl port-forward svc/vault-active -n vault 8200:8200 &
+export VAULT_ADDR=http://127.0.0.1:8200
+export VAULT_TOKEN=$(kubectl get secret vault-root-token -n vault -o jsonpath='{.data.vault-root-token}' | base64 -d)
+
+# Copy snapshot from PVE host to local workstation, then restore
+scp root@192.168.1.127:/mnt/backup/nfs-mirror/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./
+vault operator raft snapshot restore -force ./vault-raft-YYYYMMDD-HHMMSS.db
+```
+
+## Alternative: Restore from Synology (if PVE host is down)
+
+If both TrueNAS and PVE host are unavailable:
+
+```bash
+# 1. SSH to Synology NAS
+ssh Administrator@192.168.1.13
+
+# 2. Navigate to backup directory
+cd /volume1/Backup/Viki/pve-backup/nfs-mirror/vault-backup/
+
+# 3. Copy snapshot to local workstation
+scp Administrator@192.168.1.13:/volume1/Backup/Viki/pve-backup/nfs-mirror/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./
+
+# 4. Restore via port-forward (same as above)
+```
+
 ## Full Vault Rebuild (from zero)
 If Vault needs to be rebuilt from scratch:
 1. Comment out data sources + OIDC config in `stacks/vault/main.tf`
--- a/docs/runbooks/restore-vaultwarden.md
+++ b/docs/runbooks/restore-vaultwarden.md
@ -1,5 +1,7 @@
 # Restore Vaultwarden

+Last updated: 2026-04-06
+
 ## Prerequisites
 - `kubectl` access to the cluster
 - Backup available on NFS at `/mnt/main/vaultwarden-backup/`
@ -7,8 +9,10 @@
 ## Backup Location
 - NFS: `/mnt/main/vaultwarden-backup/YYYY_MM_DD_HH_MM/` (directory per backup)
 - Each backup contains: `db.sqlite3`, `rsa_key.pem`, `rsa_key.pub.pem`, `attachments/`, `sends/`, `config.json`
- Replicated to Synology NAS (192.168.1.13) via TrueNAS ZFS replication
- Retention: 30 days
+- Mirrored to sda: `/mnt/backup/nfs-mirror/vaultwarden-backup/` (PVE host 192.168.1.127)
+- PVC file backup (alternative): `/mnt/backup/pvc-data/<YYYY-WW>/vaultwarden/vaultwarden-data-proxmox/`
+- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/vaultwarden-backup/`
+- Retention: 30 days (on NFS), latest only (on sda nfs-mirror), 4 weeks (on sda pvc-data), unlimited (on Synology)
 - Schedule: Every 6 hours (00:00, 06:00, 12:00, 18:00)
 - Integrity check: Both source and backup are verified before/after each backup

@ -69,6 +73,56 @@ Log in to the Vaultwarden web UI and verify:
 - [ ] Attachments are accessible
 - [ ] TOTP codes are generating correctly

+## Alternative: Restore from PVC File Backup
+
+If the NFS backup is unavailable or corrupt, restore from the weekly PVC file backup on sda:
+
+```bash
+# 1. List available backup weeks
+ssh root@192.168.1.127
+ls -l /mnt/backup/pvc-data/
+
+# 2. Scale down Vaultwarden
+kubectl scale deployment vaultwarden -n vaultwarden --replicas=0
+
+# 3. Mount the live PVC LV on PVE host
+# Find the LV name first:
+kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,HANDLE:.spec.csi.volumeHandle' | grep vaultwarden-data-proxmox
+# Assuming volumeHandle is "local-lvm:vm-999-pvc-abc123"
+LV_NAME="vm-999-pvc-abc123"
+
+lvchange -ay pve/$LV_NAME
+mkdir -p /mnt/restore-temp
+mount /dev/pve/$LV_NAME /mnt/restore-temp
+
+# 4. Restore from backup (pick a week)
+rsync -avP --delete /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/ /mnt/restore-temp/
+
+# 5. Unmount and scale up
+umount /mnt/restore-temp
+lvchange -an pve/$LV_NAME
+kubectl scale deployment vaultwarden -n vaultwarden --replicas=1
+```
+
+## Alternative: Restore from sda NFS Mirror
+
+If TrueNAS NFS is unavailable but PVE host is accessible:
+
+```bash
+# 1. SSH to PVE host
+ssh root@192.168.1.127
+
+# 2. Find the latest backup
+ls -lt /mnt/backup/nfs-mirror/vaultwarden-backup/
+
+# 3. Mount sda backup on a pod
+BACKUP_DIR="YYYY_MM_DD_HH_MM"  # Set to desired backup
+
+kubectl run vw-restore --rm -it --image=alpine \
+  --overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/vaultwarden-backup"}},{"name":"data","persistentVolumeClaim":{"claimName":"vaultwarden-data-proxmox"}}],"containers":[{"name":"vw-restore","image":"alpine","volumeMounts":[{"name":"backup","mountPath":"/backup"},{"name":"data","mountPath":"/data"}],"command":["/bin/sh","-c","cp /backup/'$BACKUP_DIR'/db.sqlite3 /data/db.sqlite3 && cp /backup/'$BACKUP_DIR'/rsa_key.pem /data/ && cp /backup/'$BACKUP_DIR'/rsa_key.pub.pem /data/ && cp -a /backup/'$BACKUP_DIR'/attachments /data/ 2>/dev/null; echo Restore complete"]}],"nodeName":"k8s-master"}}' \
+  -n vaultwarden
+```
+
 ## Estimated Time
 - Restore: ~5 minutes
 - Verification: ~5 minutes