infra/docs/runbooks/restore-lvm-snapshot.md
Viktor Barzin b345b086ef update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
  PVC file-level copy from LVM snapshots, pfsense backup, two offsite
  paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00

4.9 KiB

Runbook: Restore PVC from LVM Thin Snapshot

Last updated: 2026-04-06

When to Use

  • Rolling back a PVC to a previous state after a bad migration, data corruption, or accidental deletion
  • Pre-upgrade safety: snapshot before upgrade, restore if upgrade fails
  • Fast recovery for data changed within the last 7 days

Prerequisites

  • SSH access to PVE host (192.168.1.127)
  • The lvm-pvc-snapshot script at /usr/local/bin/lvm-pvc-snapshot
  • kubectl configured on PVE host (/root/.kube/config)

Snapshot Retention

  • Daily snapshots: Created at 03:00 via systemd timer
  • Retention: 7 days (older snapshots automatically pruned)
  • Coverage: All proxmox-lvm PVCs except dbaas and monitoring namespaces

If you need data older than 7 days, see "Alternative: Restore from sda Backup" below.

Procedure

1. List Available Snapshots

ssh root@192.168.1.127 lvm-pvc-snapshot list

Output shows all snapshots with their original LV, age, and data divergence percentage.

2. Identify the PVC LV Name

Find the LV name for your PVC:

# From your workstation (with kubectl):
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle'

# The HANDLE column shows "local-lvm:<lv-name>"

3. Run the Restore

ssh root@192.168.1.127
lvm-pvc-snapshot restore <pvc-lv-name> <snapshot-lv-name>

The script will:

  1. Look up the K8s PV/PVC/workload for the LV
  2. Show a dry-run of all actions
  3. Ask for confirmation (type yes)
  4. Scale down the workload (Deployment or StatefulSet)
  5. Rename the current LV to <name>_pre_restore_<timestamp>
  6. Rename the snapshot LV to the original name
  7. Scale the workload back up
  8. Wait for pod to become Ready

4. Verify

# Check pod is running
kubectl get pods -n <namespace> -l app=<workload>

# Check the application is working correctly
# (service-specific verification)

5. Clean Up

Once you've verified the restore is correct, remove the pre-restore backup:

ssh root@192.168.1.127 lvremove -f pve/<original-lv>_pre_restore_<timestamp>

Manual Restore (if script fails)

If the automated restore fails, perform these steps manually:

# 1. Scale down the workload
kubectl scale deployment/<name> -n <ns> --replicas=0
# or for StatefulSets:
kubectl scale statefulset/<name> -n <ns> --replicas=0

# 2. Wait for pods to terminate
kubectl wait --for=delete pod -l app=<name> -n <ns> --timeout=120s

# 3. SSH to PVE host
ssh root@192.168.1.127

# 4. Verify LV is inactive
lvs -o lv_name,lv_active pve | grep <lv-name>

# 5. Rename LVs
lvrename pve <original-lv> <original-lv>_pre_restore_$(date +%Y%m%d_%H%M)
lvrename pve <snapshot-lv> <original-lv>

# 6. Scale back up
kubectl scale deployment/<name> -n <ns> --replicas=1

Database-Specific Notes

  • MySQL InnoDB: After restore, InnoDB will replay redo logs automatically on startup. Check SHOW ENGINE INNODB STATUS for recovery progress.
  • PostgreSQL: WAL replay happens automatically. Check pg_is_in_recovery() and PostgreSQL logs.
  • Redis: Redis loads the RDB file on startup. Check INFO persistence for load status.

For databases, prefer the app-level backup restore (see restore-mysql.md, restore-postgresql.md) unless you need a very recent point-in-time that predates the last dump.

Alternative: Restore from sda Backup

If LVM snapshots are too old or missing (data lost >7 days ago), use the weekly file-level backup on sda:

Location: /mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/ on PVE host Retention: 4 weekly versions (weeks 0-3)

Procedure

# 1. List available backup weeks
ssh root@192.168.1.127
ls -l /mnt/backup/pvc-data/

# 2. Identify the PVC backup directory
ls -l /mnt/backup/pvc-data/2026-14/<namespace>/

# 3. Scale down the workload
kubectl scale deployment/<name> -n <ns> --replicas=0

# 4. Mount the live PVC LV on PVE host
lvchange -ay pve/<pvc-lv-name>
mkdir -p /mnt/restore-temp
mount /dev/pve/<pvc-lv-name> /mnt/restore-temp

# 5. Restore from backup
rsync -avP --delete /mnt/backup/pvc-data/2026-14/<namespace>/<pvc-name>/ /mnt/restore-temp/

# 6. Unmount and scale up
umount /mnt/restore-temp
lvchange -an pve/<pvc-lv-name>
kubectl scale deployment/<name> -n <ns> --replicas=1

See restore-pvc-from-backup.md for detailed walkthrough.

Troubleshooting

Problem Cause Fix
"Another instance is running" Concurrent snapshot/restore Wait for timer to finish: systemctl status lvm-pvc-snapshot.service
LV still active after scale-down Proxmox CSI hasn't detached Wait 30s, or lvchange -an pve/<lv>
Pod stuck in ContainerCreating Volume not attached to node kubectl describe pod — check events for attach errors
No PV found for volume handle LV name doesn't match any PV Check kubectl get pv -o yaml for the correct volumeHandle format