update backup/DR docs and runbooks for 3-2-1 architecture

- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
  PVC file-level copy from LVM snapshots, pfsense backup, two offsite
  paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
This commit is contained in:
Viktor Barzin 2026-04-06 15:06:01 +03:00
parent d5b0990ed1
commit b345b086ef
10 changed files with 1051 additions and 332 deletions

View file

@ -1,5 +1,7 @@
# Full Cluster Rebuild
Last updated: 2026-04-06
## When to Use
- Complete cluster failure (all VMs lost)
- etcd corruption requiring full rebuild
@ -7,7 +9,8 @@
## Prerequisites
- Proxmox host (192.168.1.127) accessible
- TrueNAS NFS server (192.168.1.2) accessible — or Synology NAS (192.168.1.13) for backups
- TrueNAS NFS server (10.0.10.15) accessible — or Synology NAS (192.168.1.13) for backups
- sda backup disk mounted at `/mnt/backup` on PVE host (or restore from Synology first)
- Git repo with infra code
- SOPS age keys for state decryption (`~/.config/sops/age/keys.txt`)
- Vault unseal keys (emergency kit)
@ -41,15 +44,55 @@ sudo kubeadm init --config /etc/kubernetes/kubeadm-config.yaml
### Phase 3: Storage Layer
```bash
# 6. Deploy CSI drivers (NFS + iSCSI)
# 6. Deploy CSI drivers (NFS + Proxmox)
scripts/tg apply stacks/nfs-csi
scripts/tg apply stacks/iscsi-csi
scripts/tg apply stacks/proxmox-csi
# 7. Verify PVs are accessible
kubectl get pv
kubectl get pvc -A | grep -v Bound
```
### Phase 3.5: Restore PVC Data from sda Backup
After storage layer is deployed, restore PVC data from the sda backup disk:
```bash
# 8a. List available backup weeks
ssh root@192.168.1.127
ls -l /mnt/backup/pvc-data/
# 8b. For each critical PVC, restore files:
# Example: vaultwarden-data-proxmox
WEEK="2026-14" # Use most recent week
NAMESPACE="vaultwarden"
PVC_NAME="vaultwarden-data-proxmox"
# Find the PV LV name
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle' | grep $PVC_NAME
# Assuming volumeHandle is "local-lvm:vm-999-pvc-abc123"
LV_NAME="vm-999-pvc-abc123"
# Mount the LV
lvchange -ay pve/$LV_NAME
mkdir -p /mnt/restore-temp
mount /dev/pve/$LV_NAME /mnt/restore-temp
# Restore from backup
rsync -avP --delete /mnt/backup/pvc-data/$WEEK/$NAMESPACE/$PVC_NAME/ /mnt/restore-temp/
# Unmount
umount /mnt/restore-temp
lvchange -an pve/$LV_NAME
# 8c. Repeat for all critical PVCs (prioritize: vaultwarden, vault, redis, nextcloud)
```
**Note on pfSense restore**: If pfSense needs restoration, restore `config.xml` from `/mnt/backup/pfsense/<week>/config.xml` via web UI, or full filesystem tar for custom scripts.
**Note on PVE config restore**: If custom scripts/timers are lost, restore from `/mnt/backup/pve-config/` (weekly-backup, offsite-sync-backup, lvm-pvc-snapshot scripts + timers).
### Phase 4: Vault (secrets foundation)
```bash
# 8. Deploy Vault (see restore-vault.md for full procedure)
@ -117,10 +160,11 @@ kubectl create job --from=cronjob/vaultwarden-backup manual-vw-backup -n vaultwa
## Dependency Graph
```
etcd → K8s API → CSI Drivers → Vault → ESO → Platform → Databases → Apps
Restore data from
NFS/Synology backups
etcd → K8s API → CSI Drivers → Restore PVC data from sda → Vault → ESO → Platform → Databases → Apps
Restore DB dumps from
/mnt/backup/nfs-mirror
or Synology/pve-backup
```
## Estimated Time