update backup/DR docs and runbooks for 3-2-1 architecture

- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk, PVC file-level copy from LVM snapshots, pfsense backup, two offsite paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree). - Update storage.md: 65 proxmox-lvm PVCs, sda backup tier - Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda - Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths - New runbook: restore-pvc-from-backup.md (file-level restore from sda) - Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00 · 2026-04-06 15:06:01 +03:00 · b345b086ef
commit b345b086ef
parent d5b0990ed1
10 changed files with 1051 additions and 332 deletions
--- a/docs/runbooks/restore-full-cluster.md
+++ b/docs/runbooks/restore-full-cluster.md
@ -1,5 +1,7 @@
 # Full Cluster Rebuild

+Last updated: 2026-04-06
+
 ## When to Use
 - Complete cluster failure (all VMs lost)
 - etcd corruption requiring full rebuild
@ -7,7 +9,8 @@

 ## Prerequisites
 - Proxmox host (192.168.1.127) accessible
- TrueNAS NFS server (192.168.1.2) accessible — or Synology NAS (192.168.1.13) for backups
+- TrueNAS NFS server (10.0.10.15) accessible — or Synology NAS (192.168.1.13) for backups
+- sda backup disk mounted at `/mnt/backup` on PVE host (or restore from Synology first)
 - Git repo with infra code
 - SOPS age keys for state decryption (`~/.config/sops/age/keys.txt`)
 - Vault unseal keys (emergency kit)
@ -41,15 +44,55 @@ sudo kubeadm init --config /etc/kubernetes/kubeadm-config.yaml

 ### Phase 3: Storage Layer
 ```bash
-# 6. Deploy CSI drivers (NFS + iSCSI)
+# 6. Deploy CSI drivers (NFS + Proxmox)
 scripts/tg apply stacks/nfs-csi
-scripts/tg apply stacks/iscsi-csi
+scripts/tg apply stacks/proxmox-csi

 # 7. Verify PVs are accessible
 kubectl get pv
 kubectl get pvc -A | grep -v Bound
 ```

+### Phase 3.5: Restore PVC Data from sda Backup
+
+After storage layer is deployed, restore PVC data from the sda backup disk:
+
+```bash
+# 8a. List available backup weeks
+ssh root@192.168.1.127
+ls -l /mnt/backup/pvc-data/
+
+# 8b. For each critical PVC, restore files:
+# Example: vaultwarden-data-proxmox
+WEEK="2026-14"  # Use most recent week
+NAMESPACE="vaultwarden"
+PVC_NAME="vaultwarden-data-proxmox"
+
+# Find the PV LV name
+kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle' | grep $PVC_NAME
+
+# Assuming volumeHandle is "local-lvm:vm-999-pvc-abc123"
+LV_NAME="vm-999-pvc-abc123"
+
+# Mount the LV
+lvchange -ay pve/$LV_NAME
+mkdir -p /mnt/restore-temp
+mount /dev/pve/$LV_NAME /mnt/restore-temp
+
+# Restore from backup
+rsync -avP --delete /mnt/backup/pvc-data/$WEEK/$NAMESPACE/$PVC_NAME/ /mnt/restore-temp/
+
+# Unmount
+umount /mnt/restore-temp
+lvchange -an pve/$LV_NAME
+
+# 8c. Repeat for all critical PVCs (prioritize: vaultwarden, vault, redis, nextcloud)
+```
+
+**Note on pfSense restore**: If pfSense needs restoration, restore `config.xml` from `/mnt/backup/pfsense/<week>/config.xml` via web UI, or full filesystem tar for custom scripts.
+
+**Note on PVE config restore**: If custom scripts/timers are lost, restore from `/mnt/backup/pve-config/` (weekly-backup, offsite-sync-backup, lvm-pvc-snapshot scripts + timers).
+
 ### Phase 4: Vault (secrets foundation)
 ```bash
 # 8. Deploy Vault (see restore-vault.md for full procedure)
@ -117,10 +160,11 @@ kubectl create job --from=cronjob/vaultwarden-backup manual-vw-backup -n vaultwa

 ## Dependency Graph
 ```
-etcd → K8s API → CSI Drivers → Vault → ESO → Platform → Databases → Apps
-                                                              ↓
-                                                        Restore data from
-                                                        NFS/Synology backups
+etcd → K8s API → CSI Drivers → Restore PVC data from sda → Vault → ESO → Platform → Databases → Apps
+                                                                                          ↓
+                                                                                    Restore DB dumps from
+                                                                                    /mnt/backup/nfs-mirror
+                                                                                    or Synology/pve-backup
 ```

 ## Estimated Time
--- a/docs/runbooks/restore-lvm-snapshot.md
+++ b/docs/runbooks/restore-lvm-snapshot.md
@ -0,0 +1,159 @@
+# Runbook: Restore PVC from LVM Thin Snapshot
+
+Last updated: 2026-04-06
+
+## When to Use
+
+- Rolling back a PVC to a previous state after a bad migration, data corruption, or accidental deletion
+- Pre-upgrade safety: snapshot before upgrade, restore if upgrade fails
+- Fast recovery for data changed within the last 7 days
+
+## Prerequisites
+
+- SSH access to PVE host (192.168.1.127)
+- The `lvm-pvc-snapshot` script at `/usr/local/bin/lvm-pvc-snapshot`
+- kubectl configured on PVE host (`/root/.kube/config`)
+
+## Snapshot Retention
+
+- **Daily snapshots**: Created at 03:00 via systemd timer
+- **Retention**: 7 days (older snapshots automatically pruned)
+- **Coverage**: All proxmox-lvm PVCs except `dbaas` and `monitoring` namespaces
+
+**If you need data older than 7 days**, see "Alternative: Restore from sda Backup" below.
+
+## Procedure
+
+### 1. List Available Snapshots
+
+```bash
+ssh root@192.168.1.127 lvm-pvc-snapshot list
+```
+
+Output shows all snapshots with their original LV, age, and data divergence percentage.
+
+### 2. Identify the PVC LV Name
+
+Find the LV name for your PVC:
+
+```bash
+# From your workstation (with kubectl):
+kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle'
+
+# The HANDLE column shows "local-lvm:<lv-name>"
+```
+
+### 3. Run the Restore
+
+```bash
+ssh root@192.168.1.127
+lvm-pvc-snapshot restore <pvc-lv-name> <snapshot-lv-name>
+```
+
+The script will:
+1. Look up the K8s PV/PVC/workload for the LV
+2. Show a dry-run of all actions
+3. Ask for confirmation (type `yes`)
+4. Scale down the workload (Deployment or StatefulSet)
+5. Rename the current LV to `<name>_pre_restore_<timestamp>`
+6. Rename the snapshot LV to the original name
+7. Scale the workload back up
+8. Wait for pod to become Ready
+
+### 4. Verify
+
+```bash
+# Check pod is running
+kubectl get pods -n <namespace> -l app=<workload>
+
+# Check the application is working correctly
+# (service-specific verification)
+```
+
+### 5. Clean Up
+
+Once you've verified the restore is correct, remove the pre-restore backup:
+
+```bash
+ssh root@192.168.1.127 lvremove -f pve/<original-lv>_pre_restore_<timestamp>
+```
+
+## Manual Restore (if script fails)
+
+If the automated restore fails, perform these steps manually:
+
+```bash
+# 1. Scale down the workload
+kubectl scale deployment/<name> -n <ns> --replicas=0
+# or for StatefulSets:
+kubectl scale statefulset/<name> -n <ns> --replicas=0
+
+# 2. Wait for pods to terminate
+kubectl wait --for=delete pod -l app=<name> -n <ns> --timeout=120s
+
+# 3. SSH to PVE host
+ssh root@192.168.1.127
+
+# 4. Verify LV is inactive
+lvs -o lv_name,lv_active pve | grep <lv-name>
+
+# 5. Rename LVs
+lvrename pve <original-lv> <original-lv>_pre_restore_$(date +%Y%m%d_%H%M)
+lvrename pve <snapshot-lv> <original-lv>
+
+# 6. Scale back up
+kubectl scale deployment/<name> -n <ns> --replicas=1
+```
+
+## Database-Specific Notes
+
+- **MySQL InnoDB**: After restore, InnoDB will replay redo logs automatically on startup. Check `SHOW ENGINE INNODB STATUS` for recovery progress.
+- **PostgreSQL**: WAL replay happens automatically. Check `pg_is_in_recovery()` and PostgreSQL logs.
+- **Redis**: Redis loads the RDB file on startup. Check `INFO persistence` for load status.
+
+For databases, prefer the app-level backup restore (see `restore-mysql.md`, `restore-postgresql.md`) unless you need a very recent point-in-time that predates the last dump.
+
+## Alternative: Restore from sda Backup
+
+If LVM snapshots are too old or missing (data lost >7 days ago), use the weekly file-level backup on sda:
+
+**Location**: `/mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/` on PVE host
+**Retention**: 4 weekly versions (weeks 0-3)
+
+### Procedure
+
+```bash
+# 1. List available backup weeks
+ssh root@192.168.1.127
+ls -l /mnt/backup/pvc-data/
+
+# 2. Identify the PVC backup directory
+ls -l /mnt/backup/pvc-data/2026-14/<namespace>/
+
+# 3. Scale down the workload
+kubectl scale deployment/<name> -n <ns> --replicas=0
+
+# 4. Mount the live PVC LV on PVE host
+lvchange -ay pve/<pvc-lv-name>
+mkdir -p /mnt/restore-temp
+mount /dev/pve/<pvc-lv-name> /mnt/restore-temp
+
+# 5. Restore from backup
+rsync -avP --delete /mnt/backup/pvc-data/2026-14/<namespace>/<pvc-name>/ /mnt/restore-temp/
+
+# 6. Unmount and scale up
+umount /mnt/restore-temp
+lvchange -an pve/<pvc-lv-name>
+kubectl scale deployment/<name> -n <ns> --replicas=1
+```
+
+See `restore-pvc-from-backup.md` for detailed walkthrough.
+
+## Troubleshooting
+
+| Problem | Cause | Fix |
+|---------|-------|-----|
+| "Another instance is running" | Concurrent snapshot/restore | Wait for timer to finish: `systemctl status lvm-pvc-snapshot.service` |
+| LV still active after scale-down | Proxmox CSI hasn't detached | Wait 30s, or `lvchange -an pve/<lv>` |
+| Pod stuck in ContainerCreating | Volume not attached to node | `kubectl describe pod` — check events for attach errors |
+| No PV found for volume handle | LV name doesn't match any PV | Check `kubectl get pv -o yaml` for the correct volumeHandle format |
--- a/docs/runbooks/restore-mysql.md
+++ b/docs/runbooks/restore-mysql.md
@ -1,5 +1,7 @@
 # Restore MySQL (InnoDB Cluster)

+Last updated: 2026-04-06
+
 ## Prerequisites
 - `kubectl` access to the cluster
 - MySQL root password (from `cluster-secret` in `dbaas` namespace, key `ROOT_PASSWORD`)
@ -7,8 +9,9 @@

 ## Backup Location
 - NFS: `/mnt/main/mysql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz`
- Replicated to Synology NAS (192.168.1.13) via TrueNAS ZFS replication
- Retention: 14 days
+- Mirrored to sda: `/mnt/backup/nfs-mirror/mysql-backup/` (PVE host 192.168.1.127)
+- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/mysql-backup/`
+- Retention: 14 days (on NFS), latest only (on sda), unlimited (on Synology)
 - Size: ~11MB per dump

 ## Restore Procedure
@ -93,6 +96,39 @@ kubectl exec -it mysql-cluster-0 -n dbaas -c mysql -- mysqlsh root@localhost --p
 kubectl exec -it mysql-cluster-0 -n dbaas -c mysql -- mysqlsh root@localhost --password="$ROOT_PWD" -- cluster rejoinInstance root@mysql-cluster-1:3306
 ```

+## Alternative: Restore from sda Backup
+
+If TrueNAS NFS is unavailable but the PVE host is accessible:
+
+```bash
+# 1. SSH to PVE host
+ssh root@192.168.1.127
+
+# 2. Find the latest backup
+ls -lt /mnt/backup/nfs-mirror/mysql-backup/
+
+# 3. Copy backup to a location accessible from cluster (e.g., via kubectl cp)
+# Or mount sda backup on a pod:
+kubectl run mysql-restore --rm -it --image=mysql \
+  --overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/mysql-backup"}}],"containers":[{"name":"mysql-restore","image":"mysql","env":[{"name":"MYSQL_PWD","value":"'$ROOT_PWD'"}],"volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["/bin/sh","-c","zcat /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -u root --host mysql.dbaas.svc.cluster.local"]}],"nodeName":"k8s-master"}}' \
+  -n dbaas
+```
+
+## Alternative: Restore from Synology (if PVE host is down)
+
+If both TrueNAS and PVE host are unavailable:
+
+```bash
+# 1. SSH to Synology NAS
+ssh Administrator@192.168.1.13
+
+# 2. Navigate to backup directory
+cd /volume1/Backup/Viki/pve-backup/nfs-mirror/mysql-backup/
+
+# 3. Copy dump to a temporary location accessible from cluster
+# (e.g., via rsync to a surviving node, or restore TrueNAS first)
+```
+
 ## Estimated Time
 - Data restore: ~5 minutes (11MB dump)
 - InnoDB Cluster recovery: ~15-20 minutes (init containers are slow)
--- a/docs/runbooks/restore-postgresql.md
+++ b/docs/runbooks/restore-postgresql.md
@ -1,5 +1,7 @@
 # Restore PostgreSQL (CNPG)

+Last updated: 2026-04-06
+
 ## Prerequisites
 - `kubectl` access to the cluster
 - CNPG operator running in the cluster
@ -8,8 +10,9 @@

 ## Backup Location
 - NFS: `/mnt/main/postgresql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz`
- Replicated to Synology NAS (192.168.1.13) via TrueNAS ZFS replication
- Retention: 14 days
+- Mirrored to sda: `/mnt/backup/nfs-mirror/postgresql-backup/` (PVE host 192.168.1.127)
+- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/postgresql-backup/`
+- Retention: 14 days (on NFS), latest only (on sda), unlimited (on Synology)

 ## Restore from pg_dumpall

@ -81,11 +84,39 @@ kubectl rollout restart deployment -n linkwarden
 # ... repeat for all PG-dependent services (excluding trading — disabled)
 ```

-## Restore from Synology (if TrueNAS is down)
-1. SSH to Synology NAS (192.168.1.13)
-2. Find the replicated dataset: `zfs list | grep postgresql-backup`
-3. Mount or copy the backup file to a location accessible from the cluster
-4. Follow the restore procedure above
+## Alternative: Restore from sda Backup
+
+If TrueNAS NFS is unavailable but the PVE host is accessible:
+
+```bash
+# 1. SSH to PVE host
+ssh root@192.168.1.127
+
+# 2. Find the latest backup
+ls -lt /mnt/backup/nfs-mirror/postgresql-backup/
+
+# 3. Mount sda backup on a pod
+PGPASSWORD=$(kubectl get secret pg-cluster-superuser -n dbaas -o jsonpath='{.data.password}' | base64 -d)
+
+kubectl run pg-restore --rm -it --image=postgres:16.4-bullseye \
+  --overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/postgresql-backup"}}],"containers":[{"name":"pg-restore","image":"postgres:16.4-bullseye","env":[{"name":"PGPASSWORD","value":"'$PGPASSWORD'"}],"volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["/bin/sh","-c","zcat /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | psql -h pg-cluster-rw.dbaas -U postgres"]}],"nodeName":"k8s-master"}}' \
+  -n dbaas
+```
+
+## Alternative: Restore from Synology (if PVE host is down)
+
+If both TrueNAS and PVE host are unavailable:
+
+```bash
+# 1. SSH to Synology NAS
+ssh Administrator@192.168.1.13
+
+# 2. Navigate to backup directory
+cd /volume1/Backup/Viki/pve-backup/nfs-mirror/postgresql-backup/
+
+# 3. Copy dump to a temporary location accessible from cluster
+# (e.g., via rsync to a surviving node, or restore TrueNAS first)
+```

 ## Estimated Time
 - Restore into existing cluster: ~10 minutes (depends on dump size)
--- a/docs/runbooks/restore-pvc-from-backup.md
+++ b/docs/runbooks/restore-pvc-from-backup.md
@ -0,0 +1,231 @@
+# Runbook: Restore PVC from sda File Backup
+
+Last updated: 2026-04-06
+
+## When to Use
+
+- LVM snapshots are too old (>7 days) or missing
+- Need to restore data from a specific week (up to 4 weeks back)
+- LVM snapshot restore failed or snapshot is corrupt
+- Granular file-level restore (not full PVC)
+
+## Prerequisites
+
+- SSH access to PVE host (192.168.1.127)
+- kubectl configured (either on PVE host or your workstation)
+- sda backup disk mounted at `/mnt/backup` on PVE host
+
+## Backup Location
+
+**Path**: `/mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/` on PVE host
+**Retention**: 4 weekly versions (weeks 0-3)
+**Deduplication**: `--link-dest` hardlink dedup (unchanged files share inodes across weeks)
+
+## Procedure
+
+### 1. List Available Backup Weeks
+
+```bash
+ssh root@192.168.1.127
+ls -l /mnt/backup/pvc-data/
+
+# Output shows week directories like:
+# 2026-13
+# 2026-14
+# 2026-15
+# 2026-16
+```
+
+### 2. Identify the PVC Backup Directory
+
+```bash
+# List namespaces in a specific week
+ls -l /mnt/backup/pvc-data/2026-14/
+
+# List PVCs in a namespace
+ls -l /mnt/backup/pvc-data/2026-14/vaultwarden/
+
+# Example: vaultwarden-data-proxmox/
+```
+
+### 3. Find the Live PVC LV Name
+
+From your workstation (or PVE host with kubectl):
+
+```bash
+# Get the PV volumeHandle (contains LV name)
+kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle' | grep <pvc-name>
+
+# Example output:
+# pvc-abc123  vaultwarden-data-proxmox  vaultwarden  local-lvm:vm-999-pvc-abc123
+#                                                                   ↑ this is the LV name
+```
+
+### 4. Scale Down the Workload
+
+```bash
+# Find the workload using the PVC
+kubectl get deployment,statefulset -n <namespace> -o json | jq '.items[] | select(.spec.template.spec.volumes[]?.persistentVolumeClaim.claimName == "<pvc-name>") | .metadata.name'
+
+# Scale down (Deployment example)
+kubectl scale deployment/<workload-name> -n <namespace> --replicas=0
+
+# Or StatefulSet:
+kubectl scale statefulset/<workload-name> -n <namespace> --replicas=0
+
+# Wait for pod to terminate
+kubectl wait --for=delete pod -l app=<workload-name> -n <namespace> --timeout=120s
+```
+
+### 5. Mount the Live PVC LV
+
+```bash
+ssh root@192.168.1.127
+
+# Activate the LV (should already be inactive after pod termination)
+lvchange -ay pve/<lv-name>
+
+# Create mount point
+mkdir -p /mnt/restore-temp
+
+# Mount the LV
+mount /dev/pve/<lv-name> /mnt/restore-temp
+```
+
+### 6. Restore from Backup
+
+**Option A: Full PVC restore (replace all data)**
+
+```bash
+# This will delete existing files in the PVC and replace with backup
+rsync -avP --delete /mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/ /mnt/restore-temp/
+
+# Example:
+rsync -avP --delete /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/ /mnt/restore-temp/
+```
+
+**Option B: Selective file restore (merge)**
+
+```bash
+# Restore specific files or directories without deleting existing data
+rsync -avP /mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/path/to/file /mnt/restore-temp/path/to/
+
+# Example: Restore only db.sqlite3
+rsync -avP /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/db.sqlite3 /mnt/restore-temp/
+```
+
+### 7. Unmount and Deactivate LV
+
+```bash
+# Unmount
+umount /mnt/restore-temp
+
+# Deactivate LV (optional, kubelet will activate it when pod starts)
+lvchange -an pve/<lv-name>
+```
+
+### 8. Scale Up the Workload
+
+```bash
+# From your workstation:
+kubectl scale deployment/<workload-name> -n <namespace> --replicas=1
+
+# Or StatefulSet:
+kubectl scale statefulset/<workload-name> -n <namespace> --replicas=1
+
+# Wait for pod to be ready
+kubectl wait --for=condition=Ready pod -l app=<workload-name> -n <namespace> --timeout=120s
+```
+
+### 9. Verify
+
+```bash
+# Check pod logs for startup errors
+kubectl logs -n <namespace> -l app=<workload-name> --tail=20
+
+# Test application functionality (service-specific)
+curl -s -o /dev/null -w "%{http_code}" https://<service>.viktorbarzin.me/
+```
+
+## Example: Full Vaultwarden Restore
+
+```bash
+# 1. List backups
+ssh root@192.168.1.127
+ls -l /mnt/backup/pvc-data/
+
+# 2. Scale down
+kubectl scale deployment vaultwarden -n vaultwarden --replicas=0
+kubectl wait --for=delete pod -l app=vaultwarden -n vaultwarden --timeout=120s
+
+# 3. Find LV name
+kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,HANDLE:.spec.csi.volumeHandle' | grep vaultwarden-data-proxmox
+# Output: pvc-xyz  vaultwarden-data-proxmox  local-lvm:vm-105-pvc-xyz456
+
+# 4. Mount and restore
+ssh root@192.168.1.127
+lvchange -ay pve/vm-105-pvc-xyz456
+mkdir -p /mnt/restore-temp
+mount /dev/pve/vm-105-pvc-xyz456 /mnt/restore-temp
+
+rsync -avP --delete /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/ /mnt/restore-temp/
+
+umount /mnt/restore-temp
+lvchange -an pve/vm-105-pvc-xyz456
+
+# 5. Scale up
+kubectl scale deployment vaultwarden -n vaultwarden --replicas=1
+kubectl wait --for=condition=Ready pod -l app=vaultwarden -n vaultwarden --timeout=120s
+
+# 6. Test
+curl -s -o /dev/null -w "%{http_code}" https://vaultwarden.viktorbarzin.me/
+```
+
+## Database-Specific Notes
+
+For databases (MySQL, PostgreSQL), prefer the app-level backup restore (see `restore-mysql.md`, `restore-postgresql.md`) unless:
+- You need a very recent point-in-time that predates the last dump
+- The database dump is corrupt or missing
+- You're restoring a non-SQL database (e.g., Redis RDB)
+
+## Troubleshooting
+
+| Problem | Cause | Fix |
+|---------|-------|-----|
+| "LV is active" during mount | Workload pod still running or stuck | `kubectl get pods -A | grep <pvc-name>`, delete pod if stuck |
+| "No such file or directory" in backup | PVC not backed up (in excluded namespace) | Check `weekly-backup` script EXCLUDE_NAMESPACES |
+| rsync shows 0 files transferred | Wrong backup week or PVC name | Double-check paths: `ls /mnt/backup/pvc-data/<week>/<ns>/<pvc>/` |
+| Pod stuck in ContainerCreating after restore | LV still active on PVE host | `lvchange -an pve/<lv-name>`, wait 30s, check pod again |
+| Backup week missing | Weekly backup hasn't run for that week | Check `systemctl status weekly-backup.service`, verify retention |
+
+## Restore from Synology (if PVE host sda is unavailable)
+
+If the PVE host sda backup disk is unavailable or corrupt:
+
+```bash
+# 1. SSH to Synology NAS
+ssh Administrator@192.168.1.13
+
+# 2. Navigate to backup directory
+cd /volume1/Backup/Viki/pve-backup/pvc-data/
+
+# 3. Find the PVC backup
+ls -l 2026-14/<namespace>/<pvc-name>/
+
+# 4. Copy to a temporary location accessible from cluster
+# Option A: Restore sda on PVE host first
+# Option B: rsync to a surviving node's local disk
+# Option C: Mount Synology NFS share on a pod (if network accessible)
+```
+
+## Estimated Time
+
+- Small PVC (<1GB): ~5 minutes
+- Medium PVC (1-10GB): ~10-15 minutes
+- Large PVC (>10GB): ~30+ minutes (depends on size and network)
+
+## Related
+
+- **`restore-lvm-snapshot.md`** — Fast restore for recent changes (<7 days)
+- **`restore-full-cluster.md`** — Disaster recovery procedure (uses this runbook in Phase 3.5)
+- **`docs/architecture/backup-dr.md`** — Backup architecture overview
--- a/docs/runbooks/restore-vault.md
+++ b/docs/runbooks/restore-vault.md
@ -1,5 +1,7 @@
 # Restore Vault (Raft)

+Last updated: 2026-04-06
+
 ## Prerequisites
 - `kubectl` access to the cluster
 - Vault root token (from `vault-root-token` secret in `vault` namespace — manually created, independent of automation)
@ -8,8 +10,9 @@

 ## Backup Location
 - NFS: `/mnt/main/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db`
- Replicated to Synology NAS (192.168.1.13) via TrueNAS ZFS replication
- Retention: 30 days
+- Mirrored to sda: `/mnt/backup/nfs-mirror/vault-backup/` (PVE host 192.168.1.127)
+- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/vault-backup/`
+- Retention: 30 days (on NFS), latest only (on sda), unlimited (on Synology)
 - Schedule: Weekly on Sundays at 02:00 (`0 2 * * 0`)

 ## CRITICAL: Vault is a dependency for many services
@ -88,6 +91,45 @@ kubectl rollout restart deployment -n external-secrets
 kubectl get externalsecrets -A | grep -v "SecretSynced"
 ```

+## Alternative: Restore from sda Backup
+
+If TrueNAS NFS is unavailable but the PVE host is accessible:
+
+```bash
+# 1. SSH to PVE host
+ssh root@192.168.1.127
+
+# 2. Find the latest snapshot
+ls -lt /mnt/backup/nfs-mirror/vault-backup/
+
+# 3. Copy snapshot to a location accessible from cluster
+# Port-forward to Vault and restore
+kubectl port-forward svc/vault-active -n vault 8200:8200 &
+export VAULT_ADDR=http://127.0.0.1:8200
+export VAULT_TOKEN=$(kubectl get secret vault-root-token -n vault -o jsonpath='{.data.vault-root-token}' | base64 -d)
+
+# Copy snapshot from PVE host to local workstation, then restore
+scp root@192.168.1.127:/mnt/backup/nfs-mirror/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./
+vault operator raft snapshot restore -force ./vault-raft-YYYYMMDD-HHMMSS.db
+```
+
+## Alternative: Restore from Synology (if PVE host is down)
+
+If both TrueNAS and PVE host are unavailable:
+
+```bash
+# 1. SSH to Synology NAS
+ssh Administrator@192.168.1.13
+
+# 2. Navigate to backup directory
+cd /volume1/Backup/Viki/pve-backup/nfs-mirror/vault-backup/
+
+# 3. Copy snapshot to local workstation
+scp Administrator@192.168.1.13:/volume1/Backup/Viki/pve-backup/nfs-mirror/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./
+
+# 4. Restore via port-forward (same as above)
+```
+
 ## Full Vault Rebuild (from zero)
 If Vault needs to be rebuilt from scratch:
 1. Comment out data sources + OIDC config in `stacks/vault/main.tf`
--- a/docs/runbooks/restore-vaultwarden.md
+++ b/docs/runbooks/restore-vaultwarden.md
@ -1,5 +1,7 @@
 # Restore Vaultwarden

+Last updated: 2026-04-06
+
 ## Prerequisites
 - `kubectl` access to the cluster
 - Backup available on NFS at `/mnt/main/vaultwarden-backup/`
@ -7,8 +9,10 @@
 ## Backup Location
 - NFS: `/mnt/main/vaultwarden-backup/YYYY_MM_DD_HH_MM/` (directory per backup)
 - Each backup contains: `db.sqlite3`, `rsa_key.pem`, `rsa_key.pub.pem`, `attachments/`, `sends/`, `config.json`
- Replicated to Synology NAS (192.168.1.13) via TrueNAS ZFS replication
- Retention: 30 days
+- Mirrored to sda: `/mnt/backup/nfs-mirror/vaultwarden-backup/` (PVE host 192.168.1.127)
+- PVC file backup (alternative): `/mnt/backup/pvc-data/<YYYY-WW>/vaultwarden/vaultwarden-data-proxmox/`
+- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/vaultwarden-backup/`
+- Retention: 30 days (on NFS), latest only (on sda nfs-mirror), 4 weeks (on sda pvc-data), unlimited (on Synology)
 - Schedule: Every 6 hours (00:00, 06:00, 12:00, 18:00)
 - Integrity check: Both source and backup are verified before/after each backup

@ -69,6 +73,56 @@ Log in to the Vaultwarden web UI and verify:
 - [ ] Attachments are accessible
 - [ ] TOTP codes are generating correctly

+## Alternative: Restore from PVC File Backup
+
+If the NFS backup is unavailable or corrupt, restore from the weekly PVC file backup on sda:
+
+```bash
+# 1. List available backup weeks
+ssh root@192.168.1.127
+ls -l /mnt/backup/pvc-data/
+
+# 2. Scale down Vaultwarden
+kubectl scale deployment vaultwarden -n vaultwarden --replicas=0
+
+# 3. Mount the live PVC LV on PVE host
+# Find the LV name first:
+kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,HANDLE:.spec.csi.volumeHandle' | grep vaultwarden-data-proxmox
+# Assuming volumeHandle is "local-lvm:vm-999-pvc-abc123"
+LV_NAME="vm-999-pvc-abc123"
+
+lvchange -ay pve/$LV_NAME
+mkdir -p /mnt/restore-temp
+mount /dev/pve/$LV_NAME /mnt/restore-temp
+
+# 4. Restore from backup (pick a week)
+rsync -avP --delete /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/ /mnt/restore-temp/
+
+# 5. Unmount and scale up
+umount /mnt/restore-temp
+lvchange -an pve/$LV_NAME
+kubectl scale deployment vaultwarden -n vaultwarden --replicas=1
+```
+
+## Alternative: Restore from sda NFS Mirror
+
+If TrueNAS NFS is unavailable but PVE host is accessible:
+
+```bash
+# 1. SSH to PVE host
+ssh root@192.168.1.127
+
+# 2. Find the latest backup
+ls -lt /mnt/backup/nfs-mirror/vaultwarden-backup/
+
+# 3. Mount sda backup on a pod
+BACKUP_DIR="YYYY_MM_DD_HH_MM"  # Set to desired backup
+
+kubectl run vw-restore --rm -it --image=alpine \
+  --overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/vaultwarden-backup"}},{"name":"data","persistentVolumeClaim":{"claimName":"vaultwarden-data-proxmox"}}],"containers":[{"name":"vw-restore","image":"alpine","volumeMounts":[{"name":"backup","mountPath":"/backup"},{"name":"data","mountPath":"/data"}],"command":["/bin/sh","-c","cp /backup/'$BACKUP_DIR'/db.sqlite3 /data/db.sqlite3 && cp /backup/'$BACKUP_DIR'/rsa_key.pem /data/ && cp /backup/'$BACKUP_DIR'/rsa_key.pub.pem /data/ && cp -a /backup/'$BACKUP_DIR'/attachments /data/ 2>/dev/null; echo Restore complete"]}],"nodeName":"k8s-master"}}' \
+  -n vaultwarden
+```
+
 ## Estimated Time
 - Restore: ~5 minutes
 - Verification: ~5 minutes