infra/docs/runbooks/restore-lvm-snapshot.md
Viktor Barzin b345b086ef update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
  PVC file-level copy from LVM snapshots, pfsense backup, two offsite
  paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00

159 lines
4.9 KiB
Markdown

# Runbook: Restore PVC from LVM Thin Snapshot
Last updated: 2026-04-06
## When to Use
- Rolling back a PVC to a previous state after a bad migration, data corruption, or accidental deletion
- Pre-upgrade safety: snapshot before upgrade, restore if upgrade fails
- Fast recovery for data changed within the last 7 days
## Prerequisites
- SSH access to PVE host (192.168.1.127)
- The `lvm-pvc-snapshot` script at `/usr/local/bin/lvm-pvc-snapshot`
- kubectl configured on PVE host (`/root/.kube/config`)
## Snapshot Retention
- **Daily snapshots**: Created at 03:00 via systemd timer
- **Retention**: 7 days (older snapshots automatically pruned)
- **Coverage**: All proxmox-lvm PVCs except `dbaas` and `monitoring` namespaces
**If you need data older than 7 days**, see "Alternative: Restore from sda Backup" below.
## Procedure
### 1. List Available Snapshots
```bash
ssh root@192.168.1.127 lvm-pvc-snapshot list
```
Output shows all snapshots with their original LV, age, and data divergence percentage.
### 2. Identify the PVC LV Name
Find the LV name for your PVC:
```bash
# From your workstation (with kubectl):
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle'
# The HANDLE column shows "local-lvm:<lv-name>"
```
### 3. Run the Restore
```bash
ssh root@192.168.1.127
lvm-pvc-snapshot restore <pvc-lv-name> <snapshot-lv-name>
```
The script will:
1. Look up the K8s PV/PVC/workload for the LV
2. Show a dry-run of all actions
3. Ask for confirmation (type `yes`)
4. Scale down the workload (Deployment or StatefulSet)
5. Rename the current LV to `<name>_pre_restore_<timestamp>`
6. Rename the snapshot LV to the original name
7. Scale the workload back up
8. Wait for pod to become Ready
### 4. Verify
```bash
# Check pod is running
kubectl get pods -n <namespace> -l app=<workload>
# Check the application is working correctly
# (service-specific verification)
```
### 5. Clean Up
Once you've verified the restore is correct, remove the pre-restore backup:
```bash
ssh root@192.168.1.127 lvremove -f pve/<original-lv>_pre_restore_<timestamp>
```
## Manual Restore (if script fails)
If the automated restore fails, perform these steps manually:
```bash
# 1. Scale down the workload
kubectl scale deployment/<name> -n <ns> --replicas=0
# or for StatefulSets:
kubectl scale statefulset/<name> -n <ns> --replicas=0
# 2. Wait for pods to terminate
kubectl wait --for=delete pod -l app=<name> -n <ns> --timeout=120s
# 3. SSH to PVE host
ssh root@192.168.1.127
# 4. Verify LV is inactive
lvs -o lv_name,lv_active pve | grep <lv-name>
# 5. Rename LVs
lvrename pve <original-lv> <original-lv>_pre_restore_$(date +%Y%m%d_%H%M)
lvrename pve <snapshot-lv> <original-lv>
# 6. Scale back up
kubectl scale deployment/<name> -n <ns> --replicas=1
```
## Database-Specific Notes
- **MySQL InnoDB**: After restore, InnoDB will replay redo logs automatically on startup. Check `SHOW ENGINE INNODB STATUS` for recovery progress.
- **PostgreSQL**: WAL replay happens automatically. Check `pg_is_in_recovery()` and PostgreSQL logs.
- **Redis**: Redis loads the RDB file on startup. Check `INFO persistence` for load status.
For databases, prefer the app-level backup restore (see `restore-mysql.md`, `restore-postgresql.md`) unless you need a very recent point-in-time that predates the last dump.
## Alternative: Restore from sda Backup
If LVM snapshots are too old or missing (data lost >7 days ago), use the weekly file-level backup on sda:
**Location**: `/mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/` on PVE host
**Retention**: 4 weekly versions (weeks 0-3)
### Procedure
```bash
# 1. List available backup weeks
ssh root@192.168.1.127
ls -l /mnt/backup/pvc-data/
# 2. Identify the PVC backup directory
ls -l /mnt/backup/pvc-data/2026-14/<namespace>/
# 3. Scale down the workload
kubectl scale deployment/<name> -n <ns> --replicas=0
# 4. Mount the live PVC LV on PVE host
lvchange -ay pve/<pvc-lv-name>
mkdir -p /mnt/restore-temp
mount /dev/pve/<pvc-lv-name> /mnt/restore-temp
# 5. Restore from backup
rsync -avP --delete /mnt/backup/pvc-data/2026-14/<namespace>/<pvc-name>/ /mnt/restore-temp/
# 6. Unmount and scale up
umount /mnt/restore-temp
lvchange -an pve/<pvc-lv-name>
kubectl scale deployment/<name> -n <ns> --replicas=1
```
See `restore-pvc-from-backup.md` for detailed walkthrough.
## Troubleshooting
| Problem | Cause | Fix |
|---------|-------|-----|
| "Another instance is running" | Concurrent snapshot/restore | Wait for timer to finish: `systemctl status lvm-pvc-snapshot.service` |
| LV still active after scale-down | Proxmox CSI hasn't detached | Wait 30s, or `lvchange -an pve/<lv>` |
| Pod stuck in ContainerCreating | Volume not attached to node | `kubectl describe pod` — check events for attach errors |
| No PV found for volume handle | LV name doesn't match any PV | Check `kubectl get pv -o yaml` for the correct volumeHandle format |