update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
PVC file-level copy from LVM snapshots, pfsense backup, two offsite
paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
This commit is contained in:
parent
d5b0990ed1
commit
b345b086ef
10 changed files with 1051 additions and 332 deletions
|
|
@ -1,5 +1,7 @@
|
|||
# Full Cluster Rebuild
|
||||
|
||||
Last updated: 2026-04-06
|
||||
|
||||
## When to Use
|
||||
- Complete cluster failure (all VMs lost)
|
||||
- etcd corruption requiring full rebuild
|
||||
|
|
@ -7,7 +9,8 @@
|
|||
|
||||
## Prerequisites
|
||||
- Proxmox host (192.168.1.127) accessible
|
||||
- TrueNAS NFS server (192.168.1.2) accessible — or Synology NAS (192.168.1.13) for backups
|
||||
- TrueNAS NFS server (10.0.10.15) accessible — or Synology NAS (192.168.1.13) for backups
|
||||
- sda backup disk mounted at `/mnt/backup` on PVE host (or restore from Synology first)
|
||||
- Git repo with infra code
|
||||
- SOPS age keys for state decryption (`~/.config/sops/age/keys.txt`)
|
||||
- Vault unseal keys (emergency kit)
|
||||
|
|
@ -41,15 +44,55 @@ sudo kubeadm init --config /etc/kubernetes/kubeadm-config.yaml
|
|||
|
||||
### Phase 3: Storage Layer
|
||||
```bash
|
||||
# 6. Deploy CSI drivers (NFS + iSCSI)
|
||||
# 6. Deploy CSI drivers (NFS + Proxmox)
|
||||
scripts/tg apply stacks/nfs-csi
|
||||
scripts/tg apply stacks/iscsi-csi
|
||||
scripts/tg apply stacks/proxmox-csi
|
||||
|
||||
# 7. Verify PVs are accessible
|
||||
kubectl get pv
|
||||
kubectl get pvc -A | grep -v Bound
|
||||
```
|
||||
|
||||
### Phase 3.5: Restore PVC Data from sda Backup
|
||||
|
||||
After storage layer is deployed, restore PVC data from the sda backup disk:
|
||||
|
||||
```bash
|
||||
# 8a. List available backup weeks
|
||||
ssh root@192.168.1.127
|
||||
ls -l /mnt/backup/pvc-data/
|
||||
|
||||
# 8b. For each critical PVC, restore files:
|
||||
# Example: vaultwarden-data-proxmox
|
||||
WEEK="2026-14" # Use most recent week
|
||||
NAMESPACE="vaultwarden"
|
||||
PVC_NAME="vaultwarden-data-proxmox"
|
||||
|
||||
# Find the PV LV name
|
||||
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle' | grep $PVC_NAME
|
||||
|
||||
# Assuming volumeHandle is "local-lvm:vm-999-pvc-abc123"
|
||||
LV_NAME="vm-999-pvc-abc123"
|
||||
|
||||
# Mount the LV
|
||||
lvchange -ay pve/$LV_NAME
|
||||
mkdir -p /mnt/restore-temp
|
||||
mount /dev/pve/$LV_NAME /mnt/restore-temp
|
||||
|
||||
# Restore from backup
|
||||
rsync -avP --delete /mnt/backup/pvc-data/$WEEK/$NAMESPACE/$PVC_NAME/ /mnt/restore-temp/
|
||||
|
||||
# Unmount
|
||||
umount /mnt/restore-temp
|
||||
lvchange -an pve/$LV_NAME
|
||||
|
||||
# 8c. Repeat for all critical PVCs (prioritize: vaultwarden, vault, redis, nextcloud)
|
||||
```
|
||||
|
||||
**Note on pfSense restore**: If pfSense needs restoration, restore `config.xml` from `/mnt/backup/pfsense/<week>/config.xml` via web UI, or full filesystem tar for custom scripts.
|
||||
|
||||
**Note on PVE config restore**: If custom scripts/timers are lost, restore from `/mnt/backup/pve-config/` (weekly-backup, offsite-sync-backup, lvm-pvc-snapshot scripts + timers).
|
||||
|
||||
### Phase 4: Vault (secrets foundation)
|
||||
```bash
|
||||
# 8. Deploy Vault (see restore-vault.md for full procedure)
|
||||
|
|
@ -117,10 +160,11 @@ kubectl create job --from=cronjob/vaultwarden-backup manual-vw-backup -n vaultwa
|
|||
|
||||
## Dependency Graph
|
||||
```
|
||||
etcd → K8s API → CSI Drivers → Vault → ESO → Platform → Databases → Apps
|
||||
↓
|
||||
Restore data from
|
||||
NFS/Synology backups
|
||||
etcd → K8s API → CSI Drivers → Restore PVC data from sda → Vault → ESO → Platform → Databases → Apps
|
||||
↓
|
||||
Restore DB dumps from
|
||||
/mnt/backup/nfs-mirror
|
||||
or Synology/pve-backup
|
||||
```
|
||||
|
||||
## Estimated Time
|
||||
|
|
|
|||
159
docs/runbooks/restore-lvm-snapshot.md
Normal file
159
docs/runbooks/restore-lvm-snapshot.md
Normal file
|
|
@ -0,0 +1,159 @@
|
|||
# Runbook: Restore PVC from LVM Thin Snapshot
|
||||
|
||||
Last updated: 2026-04-06
|
||||
|
||||
## When to Use
|
||||
|
||||
- Rolling back a PVC to a previous state after a bad migration, data corruption, or accidental deletion
|
||||
- Pre-upgrade safety: snapshot before upgrade, restore if upgrade fails
|
||||
- Fast recovery for data changed within the last 7 days
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- SSH access to PVE host (192.168.1.127)
|
||||
- The `lvm-pvc-snapshot` script at `/usr/local/bin/lvm-pvc-snapshot`
|
||||
- kubectl configured on PVE host (`/root/.kube/config`)
|
||||
|
||||
## Snapshot Retention
|
||||
|
||||
- **Daily snapshots**: Created at 03:00 via systemd timer
|
||||
- **Retention**: 7 days (older snapshots automatically pruned)
|
||||
- **Coverage**: All proxmox-lvm PVCs except `dbaas` and `monitoring` namespaces
|
||||
|
||||
**If you need data older than 7 days**, see "Alternative: Restore from sda Backup" below.
|
||||
|
||||
## Procedure
|
||||
|
||||
### 1. List Available Snapshots
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 lvm-pvc-snapshot list
|
||||
```
|
||||
|
||||
Output shows all snapshots with their original LV, age, and data divergence percentage.
|
||||
|
||||
### 2. Identify the PVC LV Name
|
||||
|
||||
Find the LV name for your PVC:
|
||||
|
||||
```bash
|
||||
# From your workstation (with kubectl):
|
||||
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle'
|
||||
|
||||
# The HANDLE column shows "local-lvm:<lv-name>"
|
||||
```
|
||||
|
||||
### 3. Run the Restore
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127
|
||||
lvm-pvc-snapshot restore <pvc-lv-name> <snapshot-lv-name>
|
||||
```
|
||||
|
||||
The script will:
|
||||
1. Look up the K8s PV/PVC/workload for the LV
|
||||
2. Show a dry-run of all actions
|
||||
3. Ask for confirmation (type `yes`)
|
||||
4. Scale down the workload (Deployment or StatefulSet)
|
||||
5. Rename the current LV to `<name>_pre_restore_<timestamp>`
|
||||
6. Rename the snapshot LV to the original name
|
||||
7. Scale the workload back up
|
||||
8. Wait for pod to become Ready
|
||||
|
||||
### 4. Verify
|
||||
|
||||
```bash
|
||||
# Check pod is running
|
||||
kubectl get pods -n <namespace> -l app=<workload>
|
||||
|
||||
# Check the application is working correctly
|
||||
# (service-specific verification)
|
||||
```
|
||||
|
||||
### 5. Clean Up
|
||||
|
||||
Once you've verified the restore is correct, remove the pre-restore backup:
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 lvremove -f pve/<original-lv>_pre_restore_<timestamp>
|
||||
```
|
||||
|
||||
## Manual Restore (if script fails)
|
||||
|
||||
If the automated restore fails, perform these steps manually:
|
||||
|
||||
```bash
|
||||
# 1. Scale down the workload
|
||||
kubectl scale deployment/<name> -n <ns> --replicas=0
|
||||
# or for StatefulSets:
|
||||
kubectl scale statefulset/<name> -n <ns> --replicas=0
|
||||
|
||||
# 2. Wait for pods to terminate
|
||||
kubectl wait --for=delete pod -l app=<name> -n <ns> --timeout=120s
|
||||
|
||||
# 3. SSH to PVE host
|
||||
ssh root@192.168.1.127
|
||||
|
||||
# 4. Verify LV is inactive
|
||||
lvs -o lv_name,lv_active pve | grep <lv-name>
|
||||
|
||||
# 5. Rename LVs
|
||||
lvrename pve <original-lv> <original-lv>_pre_restore_$(date +%Y%m%d_%H%M)
|
||||
lvrename pve <snapshot-lv> <original-lv>
|
||||
|
||||
# 6. Scale back up
|
||||
kubectl scale deployment/<name> -n <ns> --replicas=1
|
||||
```
|
||||
|
||||
## Database-Specific Notes
|
||||
|
||||
- **MySQL InnoDB**: After restore, InnoDB will replay redo logs automatically on startup. Check `SHOW ENGINE INNODB STATUS` for recovery progress.
|
||||
- **PostgreSQL**: WAL replay happens automatically. Check `pg_is_in_recovery()` and PostgreSQL logs.
|
||||
- **Redis**: Redis loads the RDB file on startup. Check `INFO persistence` for load status.
|
||||
|
||||
For databases, prefer the app-level backup restore (see `restore-mysql.md`, `restore-postgresql.md`) unless you need a very recent point-in-time that predates the last dump.
|
||||
|
||||
## Alternative: Restore from sda Backup
|
||||
|
||||
If LVM snapshots are too old or missing (data lost >7 days ago), use the weekly file-level backup on sda:
|
||||
|
||||
**Location**: `/mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/` on PVE host
|
||||
**Retention**: 4 weekly versions (weeks 0-3)
|
||||
|
||||
### Procedure
|
||||
|
||||
```bash
|
||||
# 1. List available backup weeks
|
||||
ssh root@192.168.1.127
|
||||
ls -l /mnt/backup/pvc-data/
|
||||
|
||||
# 2. Identify the PVC backup directory
|
||||
ls -l /mnt/backup/pvc-data/2026-14/<namespace>/
|
||||
|
||||
# 3. Scale down the workload
|
||||
kubectl scale deployment/<name> -n <ns> --replicas=0
|
||||
|
||||
# 4. Mount the live PVC LV on PVE host
|
||||
lvchange -ay pve/<pvc-lv-name>
|
||||
mkdir -p /mnt/restore-temp
|
||||
mount /dev/pve/<pvc-lv-name> /mnt/restore-temp
|
||||
|
||||
# 5. Restore from backup
|
||||
rsync -avP --delete /mnt/backup/pvc-data/2026-14/<namespace>/<pvc-name>/ /mnt/restore-temp/
|
||||
|
||||
# 6. Unmount and scale up
|
||||
umount /mnt/restore-temp
|
||||
lvchange -an pve/<pvc-lv-name>
|
||||
kubectl scale deployment/<name> -n <ns> --replicas=1
|
||||
```
|
||||
|
||||
See `restore-pvc-from-backup.md` for detailed walkthrough.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Problem | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| "Another instance is running" | Concurrent snapshot/restore | Wait for timer to finish: `systemctl status lvm-pvc-snapshot.service` |
|
||||
| LV still active after scale-down | Proxmox CSI hasn't detached | Wait 30s, or `lvchange -an pve/<lv>` |
|
||||
| Pod stuck in ContainerCreating | Volume not attached to node | `kubectl describe pod` — check events for attach errors |
|
||||
| No PV found for volume handle | LV name doesn't match any PV | Check `kubectl get pv -o yaml` for the correct volumeHandle format |
|
||||
|
|
@ -1,5 +1,7 @@
|
|||
# Restore MySQL (InnoDB Cluster)
|
||||
|
||||
Last updated: 2026-04-06
|
||||
|
||||
## Prerequisites
|
||||
- `kubectl` access to the cluster
|
||||
- MySQL root password (from `cluster-secret` in `dbaas` namespace, key `ROOT_PASSWORD`)
|
||||
|
|
@ -7,8 +9,9 @@
|
|||
|
||||
## Backup Location
|
||||
- NFS: `/mnt/main/mysql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz`
|
||||
- Replicated to Synology NAS (192.168.1.13) via TrueNAS ZFS replication
|
||||
- Retention: 14 days
|
||||
- Mirrored to sda: `/mnt/backup/nfs-mirror/mysql-backup/` (PVE host 192.168.1.127)
|
||||
- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/mysql-backup/`
|
||||
- Retention: 14 days (on NFS), latest only (on sda), unlimited (on Synology)
|
||||
- Size: ~11MB per dump
|
||||
|
||||
## Restore Procedure
|
||||
|
|
@ -93,6 +96,39 @@ kubectl exec -it mysql-cluster-0 -n dbaas -c mysql -- mysqlsh root@localhost --p
|
|||
kubectl exec -it mysql-cluster-0 -n dbaas -c mysql -- mysqlsh root@localhost --password="$ROOT_PWD" -- cluster rejoinInstance root@mysql-cluster-1:3306
|
||||
```
|
||||
|
||||
## Alternative: Restore from sda Backup
|
||||
|
||||
If TrueNAS NFS is unavailable but the PVE host is accessible:
|
||||
|
||||
```bash
|
||||
# 1. SSH to PVE host
|
||||
ssh root@192.168.1.127
|
||||
|
||||
# 2. Find the latest backup
|
||||
ls -lt /mnt/backup/nfs-mirror/mysql-backup/
|
||||
|
||||
# 3. Copy backup to a location accessible from cluster (e.g., via kubectl cp)
|
||||
# Or mount sda backup on a pod:
|
||||
kubectl run mysql-restore --rm -it --image=mysql \
|
||||
--overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/mysql-backup"}}],"containers":[{"name":"mysql-restore","image":"mysql","env":[{"name":"MYSQL_PWD","value":"'$ROOT_PWD'"}],"volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["/bin/sh","-c","zcat /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -u root --host mysql.dbaas.svc.cluster.local"]}],"nodeName":"k8s-master"}}' \
|
||||
-n dbaas
|
||||
```
|
||||
|
||||
## Alternative: Restore from Synology (if PVE host is down)
|
||||
|
||||
If both TrueNAS and PVE host are unavailable:
|
||||
|
||||
```bash
|
||||
# 1. SSH to Synology NAS
|
||||
ssh Administrator@192.168.1.13
|
||||
|
||||
# 2. Navigate to backup directory
|
||||
cd /volume1/Backup/Viki/pve-backup/nfs-mirror/mysql-backup/
|
||||
|
||||
# 3. Copy dump to a temporary location accessible from cluster
|
||||
# (e.g., via rsync to a surviving node, or restore TrueNAS first)
|
||||
```
|
||||
|
||||
## Estimated Time
|
||||
- Data restore: ~5 minutes (11MB dump)
|
||||
- InnoDB Cluster recovery: ~15-20 minutes (init containers are slow)
|
||||
|
|
|
|||
|
|
@ -1,5 +1,7 @@
|
|||
# Restore PostgreSQL (CNPG)
|
||||
|
||||
Last updated: 2026-04-06
|
||||
|
||||
## Prerequisites
|
||||
- `kubectl` access to the cluster
|
||||
- CNPG operator running in the cluster
|
||||
|
|
@ -8,8 +10,9 @@
|
|||
|
||||
## Backup Location
|
||||
- NFS: `/mnt/main/postgresql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz`
|
||||
- Replicated to Synology NAS (192.168.1.13) via TrueNAS ZFS replication
|
||||
- Retention: 14 days
|
||||
- Mirrored to sda: `/mnt/backup/nfs-mirror/postgresql-backup/` (PVE host 192.168.1.127)
|
||||
- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/postgresql-backup/`
|
||||
- Retention: 14 days (on NFS), latest only (on sda), unlimited (on Synology)
|
||||
|
||||
## Restore from pg_dumpall
|
||||
|
||||
|
|
@ -81,11 +84,39 @@ kubectl rollout restart deployment -n linkwarden
|
|||
# ... repeat for all PG-dependent services (excluding trading — disabled)
|
||||
```
|
||||
|
||||
## Restore from Synology (if TrueNAS is down)
|
||||
1. SSH to Synology NAS (192.168.1.13)
|
||||
2. Find the replicated dataset: `zfs list | grep postgresql-backup`
|
||||
3. Mount or copy the backup file to a location accessible from the cluster
|
||||
4. Follow the restore procedure above
|
||||
## Alternative: Restore from sda Backup
|
||||
|
||||
If TrueNAS NFS is unavailable but the PVE host is accessible:
|
||||
|
||||
```bash
|
||||
# 1. SSH to PVE host
|
||||
ssh root@192.168.1.127
|
||||
|
||||
# 2. Find the latest backup
|
||||
ls -lt /mnt/backup/nfs-mirror/postgresql-backup/
|
||||
|
||||
# 3. Mount sda backup on a pod
|
||||
PGPASSWORD=$(kubectl get secret pg-cluster-superuser -n dbaas -o jsonpath='{.data.password}' | base64 -d)
|
||||
|
||||
kubectl run pg-restore --rm -it --image=postgres:16.4-bullseye \
|
||||
--overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/postgresql-backup"}}],"containers":[{"name":"pg-restore","image":"postgres:16.4-bullseye","env":[{"name":"PGPASSWORD","value":"'$PGPASSWORD'"}],"volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["/bin/sh","-c","zcat /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | psql -h pg-cluster-rw.dbaas -U postgres"]}],"nodeName":"k8s-master"}}' \
|
||||
-n dbaas
|
||||
```
|
||||
|
||||
## Alternative: Restore from Synology (if PVE host is down)
|
||||
|
||||
If both TrueNAS and PVE host are unavailable:
|
||||
|
||||
```bash
|
||||
# 1. SSH to Synology NAS
|
||||
ssh Administrator@192.168.1.13
|
||||
|
||||
# 2. Navigate to backup directory
|
||||
cd /volume1/Backup/Viki/pve-backup/nfs-mirror/postgresql-backup/
|
||||
|
||||
# 3. Copy dump to a temporary location accessible from cluster
|
||||
# (e.g., via rsync to a surviving node, or restore TrueNAS first)
|
||||
```
|
||||
|
||||
## Estimated Time
|
||||
- Restore into existing cluster: ~10 minutes (depends on dump size)
|
||||
|
|
|
|||
231
docs/runbooks/restore-pvc-from-backup.md
Normal file
231
docs/runbooks/restore-pvc-from-backup.md
Normal file
|
|
@ -0,0 +1,231 @@
|
|||
# Runbook: Restore PVC from sda File Backup
|
||||
|
||||
Last updated: 2026-04-06
|
||||
|
||||
## When to Use
|
||||
|
||||
- LVM snapshots are too old (>7 days) or missing
|
||||
- Need to restore data from a specific week (up to 4 weeks back)
|
||||
- LVM snapshot restore failed or snapshot is corrupt
|
||||
- Granular file-level restore (not full PVC)
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- SSH access to PVE host (192.168.1.127)
|
||||
- kubectl configured (either on PVE host or your workstation)
|
||||
- sda backup disk mounted at `/mnt/backup` on PVE host
|
||||
|
||||
## Backup Location
|
||||
|
||||
**Path**: `/mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/` on PVE host
|
||||
**Retention**: 4 weekly versions (weeks 0-3)
|
||||
**Deduplication**: `--link-dest` hardlink dedup (unchanged files share inodes across weeks)
|
||||
|
||||
## Procedure
|
||||
|
||||
### 1. List Available Backup Weeks
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127
|
||||
ls -l /mnt/backup/pvc-data/
|
||||
|
||||
# Output shows week directories like:
|
||||
# 2026-13
|
||||
# 2026-14
|
||||
# 2026-15
|
||||
# 2026-16
|
||||
```
|
||||
|
||||
### 2. Identify the PVC Backup Directory
|
||||
|
||||
```bash
|
||||
# List namespaces in a specific week
|
||||
ls -l /mnt/backup/pvc-data/2026-14/
|
||||
|
||||
# List PVCs in a namespace
|
||||
ls -l /mnt/backup/pvc-data/2026-14/vaultwarden/
|
||||
|
||||
# Example: vaultwarden-data-proxmox/
|
||||
```
|
||||
|
||||
### 3. Find the Live PVC LV Name
|
||||
|
||||
From your workstation (or PVE host with kubectl):
|
||||
|
||||
```bash
|
||||
# Get the PV volumeHandle (contains LV name)
|
||||
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle' | grep <pvc-name>
|
||||
|
||||
# Example output:
|
||||
# pvc-abc123 vaultwarden-data-proxmox vaultwarden local-lvm:vm-999-pvc-abc123
|
||||
# ↑ this is the LV name
|
||||
```
|
||||
|
||||
### 4. Scale Down the Workload
|
||||
|
||||
```bash
|
||||
# Find the workload using the PVC
|
||||
kubectl get deployment,statefulset -n <namespace> -o json | jq '.items[] | select(.spec.template.spec.volumes[]?.persistentVolumeClaim.claimName == "<pvc-name>") | .metadata.name'
|
||||
|
||||
# Scale down (Deployment example)
|
||||
kubectl scale deployment/<workload-name> -n <namespace> --replicas=0
|
||||
|
||||
# Or StatefulSet:
|
||||
kubectl scale statefulset/<workload-name> -n <namespace> --replicas=0
|
||||
|
||||
# Wait for pod to terminate
|
||||
kubectl wait --for=delete pod -l app=<workload-name> -n <namespace> --timeout=120s
|
||||
```
|
||||
|
||||
### 5. Mount the Live PVC LV
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127
|
||||
|
||||
# Activate the LV (should already be inactive after pod termination)
|
||||
lvchange -ay pve/<lv-name>
|
||||
|
||||
# Create mount point
|
||||
mkdir -p /mnt/restore-temp
|
||||
|
||||
# Mount the LV
|
||||
mount /dev/pve/<lv-name> /mnt/restore-temp
|
||||
```
|
||||
|
||||
### 6. Restore from Backup
|
||||
|
||||
**Option A: Full PVC restore (replace all data)**
|
||||
|
||||
```bash
|
||||
# This will delete existing files in the PVC and replace with backup
|
||||
rsync -avP --delete /mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/ /mnt/restore-temp/
|
||||
|
||||
# Example:
|
||||
rsync -avP --delete /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/ /mnt/restore-temp/
|
||||
```
|
||||
|
||||
**Option B: Selective file restore (merge)**
|
||||
|
||||
```bash
|
||||
# Restore specific files or directories without deleting existing data
|
||||
rsync -avP /mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/path/to/file /mnt/restore-temp/path/to/
|
||||
|
||||
# Example: Restore only db.sqlite3
|
||||
rsync -avP /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/db.sqlite3 /mnt/restore-temp/
|
||||
```
|
||||
|
||||
### 7. Unmount and Deactivate LV
|
||||
|
||||
```bash
|
||||
# Unmount
|
||||
umount /mnt/restore-temp
|
||||
|
||||
# Deactivate LV (optional, kubelet will activate it when pod starts)
|
||||
lvchange -an pve/<lv-name>
|
||||
```
|
||||
|
||||
### 8. Scale Up the Workload
|
||||
|
||||
```bash
|
||||
# From your workstation:
|
||||
kubectl scale deployment/<workload-name> -n <namespace> --replicas=1
|
||||
|
||||
# Or StatefulSet:
|
||||
kubectl scale statefulset/<workload-name> -n <namespace> --replicas=1
|
||||
|
||||
# Wait for pod to be ready
|
||||
kubectl wait --for=condition=Ready pod -l app=<workload-name> -n <namespace> --timeout=120s
|
||||
```
|
||||
|
||||
### 9. Verify
|
||||
|
||||
```bash
|
||||
# Check pod logs for startup errors
|
||||
kubectl logs -n <namespace> -l app=<workload-name> --tail=20
|
||||
|
||||
# Test application functionality (service-specific)
|
||||
curl -s -o /dev/null -w "%{http_code}" https://<service>.viktorbarzin.me/
|
||||
```
|
||||
|
||||
## Example: Full Vaultwarden Restore
|
||||
|
||||
```bash
|
||||
# 1. List backups
|
||||
ssh root@192.168.1.127
|
||||
ls -l /mnt/backup/pvc-data/
|
||||
|
||||
# 2. Scale down
|
||||
kubectl scale deployment vaultwarden -n vaultwarden --replicas=0
|
||||
kubectl wait --for=delete pod -l app=vaultwarden -n vaultwarden --timeout=120s
|
||||
|
||||
# 3. Find LV name
|
||||
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,HANDLE:.spec.csi.volumeHandle' | grep vaultwarden-data-proxmox
|
||||
# Output: pvc-xyz vaultwarden-data-proxmox local-lvm:vm-105-pvc-xyz456
|
||||
|
||||
# 4. Mount and restore
|
||||
ssh root@192.168.1.127
|
||||
lvchange -ay pve/vm-105-pvc-xyz456
|
||||
mkdir -p /mnt/restore-temp
|
||||
mount /dev/pve/vm-105-pvc-xyz456 /mnt/restore-temp
|
||||
|
||||
rsync -avP --delete /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/ /mnt/restore-temp/
|
||||
|
||||
umount /mnt/restore-temp
|
||||
lvchange -an pve/vm-105-pvc-xyz456
|
||||
|
||||
# 5. Scale up
|
||||
kubectl scale deployment vaultwarden -n vaultwarden --replicas=1
|
||||
kubectl wait --for=condition=Ready pod -l app=vaultwarden -n vaultwarden --timeout=120s
|
||||
|
||||
# 6. Test
|
||||
curl -s -o /dev/null -w "%{http_code}" https://vaultwarden.viktorbarzin.me/
|
||||
```
|
||||
|
||||
## Database-Specific Notes
|
||||
|
||||
For databases (MySQL, PostgreSQL), prefer the app-level backup restore (see `restore-mysql.md`, `restore-postgresql.md`) unless:
|
||||
- You need a very recent point-in-time that predates the last dump
|
||||
- The database dump is corrupt or missing
|
||||
- You're restoring a non-SQL database (e.g., Redis RDB)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Problem | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| "LV is active" during mount | Workload pod still running or stuck | `kubectl get pods -A | grep <pvc-name>`, delete pod if stuck |
|
||||
| "No such file or directory" in backup | PVC not backed up (in excluded namespace) | Check `weekly-backup` script EXCLUDE_NAMESPACES |
|
||||
| rsync shows 0 files transferred | Wrong backup week or PVC name | Double-check paths: `ls /mnt/backup/pvc-data/<week>/<ns>/<pvc>/` |
|
||||
| Pod stuck in ContainerCreating after restore | LV still active on PVE host | `lvchange -an pve/<lv-name>`, wait 30s, check pod again |
|
||||
| Backup week missing | Weekly backup hasn't run for that week | Check `systemctl status weekly-backup.service`, verify retention |
|
||||
|
||||
## Restore from Synology (if PVE host sda is unavailable)
|
||||
|
||||
If the PVE host sda backup disk is unavailable or corrupt:
|
||||
|
||||
```bash
|
||||
# 1. SSH to Synology NAS
|
||||
ssh Administrator@192.168.1.13
|
||||
|
||||
# 2. Navigate to backup directory
|
||||
cd /volume1/Backup/Viki/pve-backup/pvc-data/
|
||||
|
||||
# 3. Find the PVC backup
|
||||
ls -l 2026-14/<namespace>/<pvc-name>/
|
||||
|
||||
# 4. Copy to a temporary location accessible from cluster
|
||||
# Option A: Restore sda on PVE host first
|
||||
# Option B: rsync to a surviving node's local disk
|
||||
# Option C: Mount Synology NFS share on a pod (if network accessible)
|
||||
```
|
||||
|
||||
## Estimated Time
|
||||
|
||||
- Small PVC (<1GB): ~5 minutes
|
||||
- Medium PVC (1-10GB): ~10-15 minutes
|
||||
- Large PVC (>10GB): ~30+ minutes (depends on size and network)
|
||||
|
||||
## Related
|
||||
|
||||
- **`restore-lvm-snapshot.md`** — Fast restore for recent changes (<7 days)
|
||||
- **`restore-full-cluster.md`** — Disaster recovery procedure (uses this runbook in Phase 3.5)
|
||||
- **`docs/architecture/backup-dr.md`** — Backup architecture overview
|
||||
|
|
@ -1,5 +1,7 @@
|
|||
# Restore Vault (Raft)
|
||||
|
||||
Last updated: 2026-04-06
|
||||
|
||||
## Prerequisites
|
||||
- `kubectl` access to the cluster
|
||||
- Vault root token (from `vault-root-token` secret in `vault` namespace — manually created, independent of automation)
|
||||
|
|
@ -8,8 +10,9 @@
|
|||
|
||||
## Backup Location
|
||||
- NFS: `/mnt/main/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db`
|
||||
- Replicated to Synology NAS (192.168.1.13) via TrueNAS ZFS replication
|
||||
- Retention: 30 days
|
||||
- Mirrored to sda: `/mnt/backup/nfs-mirror/vault-backup/` (PVE host 192.168.1.127)
|
||||
- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/vault-backup/`
|
||||
- Retention: 30 days (on NFS), latest only (on sda), unlimited (on Synology)
|
||||
- Schedule: Weekly on Sundays at 02:00 (`0 2 * * 0`)
|
||||
|
||||
## CRITICAL: Vault is a dependency for many services
|
||||
|
|
@ -88,6 +91,45 @@ kubectl rollout restart deployment -n external-secrets
|
|||
kubectl get externalsecrets -A | grep -v "SecretSynced"
|
||||
```
|
||||
|
||||
## Alternative: Restore from sda Backup
|
||||
|
||||
If TrueNAS NFS is unavailable but the PVE host is accessible:
|
||||
|
||||
```bash
|
||||
# 1. SSH to PVE host
|
||||
ssh root@192.168.1.127
|
||||
|
||||
# 2. Find the latest snapshot
|
||||
ls -lt /mnt/backup/nfs-mirror/vault-backup/
|
||||
|
||||
# 3. Copy snapshot to a location accessible from cluster
|
||||
# Port-forward to Vault and restore
|
||||
kubectl port-forward svc/vault-active -n vault 8200:8200 &
|
||||
export VAULT_ADDR=http://127.0.0.1:8200
|
||||
export VAULT_TOKEN=$(kubectl get secret vault-root-token -n vault -o jsonpath='{.data.vault-root-token}' | base64 -d)
|
||||
|
||||
# Copy snapshot from PVE host to local workstation, then restore
|
||||
scp root@192.168.1.127:/mnt/backup/nfs-mirror/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./
|
||||
vault operator raft snapshot restore -force ./vault-raft-YYYYMMDD-HHMMSS.db
|
||||
```
|
||||
|
||||
## Alternative: Restore from Synology (if PVE host is down)
|
||||
|
||||
If both TrueNAS and PVE host are unavailable:
|
||||
|
||||
```bash
|
||||
# 1. SSH to Synology NAS
|
||||
ssh Administrator@192.168.1.13
|
||||
|
||||
# 2. Navigate to backup directory
|
||||
cd /volume1/Backup/Viki/pve-backup/nfs-mirror/vault-backup/
|
||||
|
||||
# 3. Copy snapshot to local workstation
|
||||
scp Administrator@192.168.1.13:/volume1/Backup/Viki/pve-backup/nfs-mirror/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./
|
||||
|
||||
# 4. Restore via port-forward (same as above)
|
||||
```
|
||||
|
||||
## Full Vault Rebuild (from zero)
|
||||
If Vault needs to be rebuilt from scratch:
|
||||
1. Comment out data sources + OIDC config in `stacks/vault/main.tf`
|
||||
|
|
|
|||
|
|
@ -1,5 +1,7 @@
|
|||
# Restore Vaultwarden
|
||||
|
||||
Last updated: 2026-04-06
|
||||
|
||||
## Prerequisites
|
||||
- `kubectl` access to the cluster
|
||||
- Backup available on NFS at `/mnt/main/vaultwarden-backup/`
|
||||
|
|
@ -7,8 +9,10 @@
|
|||
## Backup Location
|
||||
- NFS: `/mnt/main/vaultwarden-backup/YYYY_MM_DD_HH_MM/` (directory per backup)
|
||||
- Each backup contains: `db.sqlite3`, `rsa_key.pem`, `rsa_key.pub.pem`, `attachments/`, `sends/`, `config.json`
|
||||
- Replicated to Synology NAS (192.168.1.13) via TrueNAS ZFS replication
|
||||
- Retention: 30 days
|
||||
- Mirrored to sda: `/mnt/backup/nfs-mirror/vaultwarden-backup/` (PVE host 192.168.1.127)
|
||||
- PVC file backup (alternative): `/mnt/backup/pvc-data/<YYYY-WW>/vaultwarden/vaultwarden-data-proxmox/`
|
||||
- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/vaultwarden-backup/`
|
||||
- Retention: 30 days (on NFS), latest only (on sda nfs-mirror), 4 weeks (on sda pvc-data), unlimited (on Synology)
|
||||
- Schedule: Every 6 hours (00:00, 06:00, 12:00, 18:00)
|
||||
- Integrity check: Both source and backup are verified before/after each backup
|
||||
|
||||
|
|
@ -69,6 +73,56 @@ Log in to the Vaultwarden web UI and verify:
|
|||
- [ ] Attachments are accessible
|
||||
- [ ] TOTP codes are generating correctly
|
||||
|
||||
## Alternative: Restore from PVC File Backup
|
||||
|
||||
If the NFS backup is unavailable or corrupt, restore from the weekly PVC file backup on sda:
|
||||
|
||||
```bash
|
||||
# 1. List available backup weeks
|
||||
ssh root@192.168.1.127
|
||||
ls -l /mnt/backup/pvc-data/
|
||||
|
||||
# 2. Scale down Vaultwarden
|
||||
kubectl scale deployment vaultwarden -n vaultwarden --replicas=0
|
||||
|
||||
# 3. Mount the live PVC LV on PVE host
|
||||
# Find the LV name first:
|
||||
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,HANDLE:.spec.csi.volumeHandle' | grep vaultwarden-data-proxmox
|
||||
# Assuming volumeHandle is "local-lvm:vm-999-pvc-abc123"
|
||||
LV_NAME="vm-999-pvc-abc123"
|
||||
|
||||
lvchange -ay pve/$LV_NAME
|
||||
mkdir -p /mnt/restore-temp
|
||||
mount /dev/pve/$LV_NAME /mnt/restore-temp
|
||||
|
||||
# 4. Restore from backup (pick a week)
|
||||
rsync -avP --delete /mnt/backup/pvc-data/2026-14/vaultwarden/vaultwarden-data-proxmox/ /mnt/restore-temp/
|
||||
|
||||
# 5. Unmount and scale up
|
||||
umount /mnt/restore-temp
|
||||
lvchange -an pve/$LV_NAME
|
||||
kubectl scale deployment vaultwarden -n vaultwarden --replicas=1
|
||||
```
|
||||
|
||||
## Alternative: Restore from sda NFS Mirror
|
||||
|
||||
If TrueNAS NFS is unavailable but PVE host is accessible:
|
||||
|
||||
```bash
|
||||
# 1. SSH to PVE host
|
||||
ssh root@192.168.1.127
|
||||
|
||||
# 2. Find the latest backup
|
||||
ls -lt /mnt/backup/nfs-mirror/vaultwarden-backup/
|
||||
|
||||
# 3. Mount sda backup on a pod
|
||||
BACKUP_DIR="YYYY_MM_DD_HH_MM" # Set to desired backup
|
||||
|
||||
kubectl run vw-restore --rm -it --image=alpine \
|
||||
--overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/vaultwarden-backup"}},{"name":"data","persistentVolumeClaim":{"claimName":"vaultwarden-data-proxmox"}}],"containers":[{"name":"vw-restore","image":"alpine","volumeMounts":[{"name":"backup","mountPath":"/backup"},{"name":"data","mountPath":"/data"}],"command":["/bin/sh","-c","cp /backup/'$BACKUP_DIR'/db.sqlite3 /data/db.sqlite3 && cp /backup/'$BACKUP_DIR'/rsa_key.pem /data/ && cp /backup/'$BACKUP_DIR'/rsa_key.pub.pem /data/ && cp -a /backup/'$BACKUP_DIR'/attachments /data/ 2>/dev/null; echo Restore complete"]}],"nodeName":"k8s-master"}}' \
|
||||
-n vaultwarden
|
||||
```
|
||||
|
||||
## Estimated Time
|
||||
- Restore: ~5 minutes
|
||||
- Verification: ~5 minutes
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue