backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks
Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert
Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days
Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef
Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00
# Full Cluster Rebuild
update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
PVC file-level copy from LVM snapshots, pfsense backup, two offsite
paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00
Last updated: 2026-04-06
backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks
Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert
Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days
Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef
Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00
## When to Use
- Complete cluster failure (all VMs lost)
- etcd corruption requiring full rebuild
- Proxmox host failure requiring fresh VM provisioning
## Prerequisites
2026-04-19 16:55:43 +00:00
- Proxmox host (192.168.1.127) accessible, with NFS exports on `/srv/nfs` and `/srv/nfs-ssd`
- Synology NAS (192.168.1.13) accessible for offsite backup restore if the PVE host backup disk is also lost
update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
PVC file-level copy from LVM snapshots, pfsense backup, two offsite
paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00
- sda backup disk mounted at `/mnt/backup` on PVE host (or restore from Synology first)
backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks
Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert
Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days
Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef
Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00
- Git repo with infra code
- SOPS age keys for state decryption (`~/.config/sops/age/keys.txt` )
- Vault unseal keys (emergency kit)
## Rebuild Order
The rebuild must follow dependency order. Each layer depends on the one before it.
### Phase 1: Infrastructure (Proxmox VMs)
```bash
# 1. Provision VMs via Terraform
cd infra
scripts/tg apply stacks/infra
# 2. Wait for VMs to boot and be reachable
docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]
Audited 14 documentation files against live cluster state and Terraform code.
Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB
Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading
Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates
2026-04-06 13:21:05 +03:00
# k8s-master, k8s-node3, k8s-node4, k8s-node5
# (node1 has GPU workloads, node2 excluded from MySQL anti-affinity only — both are active cluster members)
backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks
Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert
Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days
Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef
Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00
```
### Phase 2: Kubernetes Control Plane
```bash
# 3. Initialize kubeadm on master (if starting fresh)
sudo kubeadm init --config /etc/kubernetes/kubeadm-config.yaml
# 4. Join worker nodes
# Get join command from master, run on each node
# 5. OR restore etcd from snapshot (see restore-etcd.md)
# This restores all K8s objects from the snapshot time
```
### Phase 3: Storage Layer
```bash
update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
PVC file-level copy from LVM snapshots, pfsense backup, two offsite
paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00
# 6. Deploy CSI drivers (NFS + Proxmox)
backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks
Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert
Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days
Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef
Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00
scripts/tg apply stacks/nfs-csi
update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
PVC file-level copy from LVM snapshots, pfsense backup, two offsite
paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00
scripts/tg apply stacks/proxmox-csi
backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks
Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert
Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days
Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef
Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00
# 7. Verify PVs are accessible
kubectl get pv
kubectl get pvc -A | grep -v Bound
```
update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
PVC file-level copy from LVM snapshots, pfsense backup, two offsite
paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00
### Phase 3.5: Restore PVC Data from sda Backup
After storage layer is deployed, restore PVC data from the sda backup disk:
```bash
# 8a. List available backup weeks
ssh root@192 .168.1.127
ls -l /mnt/backup/pvc-data/
# 8b. For each critical PVC, restore files:
# Example: vaultwarden-data-proxmox
WEEK="2026-14" # Use most recent week
NAMESPACE="vaultwarden"
PVC_NAME="vaultwarden-data-proxmox"
# Find the PV LV name
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle' | grep $PVC_NAME
# Assuming volumeHandle is "local-lvm:vm-999-pvc-abc123"
LV_NAME="vm-999-pvc-abc123"
# Mount the LV
lvchange -ay pve/$LV_NAME
mkdir -p /mnt/restore-temp
mount /dev/pve/$LV_NAME /mnt/restore-temp
# Restore from backup
rsync -avP --delete /mnt/backup/pvc-data/$WEEK/$NAMESPACE/$PVC_NAME/ /mnt/restore-temp/
# Unmount
umount /mnt/restore-temp
lvchange -an pve/$LV_NAME
# 8c. Repeat for all critical PVCs (prioritize: vaultwarden, vault, redis, nextcloud)
```
**Note on pfSense restore**: If pfSense needs restoration, restore `config.xml` from `/mnt/backup/pfsense/<week>/config.xml` via web UI, or full filesystem tar for custom scripts.
2026-04-13 18:37:04 +00:00
**Note on PVE config restore**: If custom scripts/timers are lost, restore from `/mnt/backup/pve-config/` (daily-backup, offsite-sync-backup, lvm-pvc-snapshot scripts + timers).
update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
PVC file-level copy from LVM snapshots, pfsense backup, two offsite
paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00
backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks
Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert
Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days
Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef
Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00
### Phase 4: Vault (secrets foundation)
```bash
# 8. Deploy Vault (see restore-vault.md for full procedure)
scripts/tg apply stacks/vault
# 9. Initialize/unseal/restore raft snapshot
# 10. Verify ESO can connect
scripts/tg apply stacks/external-secrets
kubectl get externalsecrets -A
```
### Phase 5: Platform Services
```bash
# 11. Deploy platform stack (Traefik, monitoring, Kyverno, etc.)
scripts/tg apply stacks/platform
# 12. Verify ingress is working
curl -s -o /dev/null -w "%{http_code}" https://viktorbarzin.me/
```
### Phase 6: Databases
```bash
# 13. Deploy database stack
scripts/tg apply stacks/dbaas
# 14. Wait for CNPG and InnoDB clusters to initialize
kubectl wait --for=condition=Ready cluster/pg-cluster -n dbaas --timeout=600s
# 15. Restore PostgreSQL from dump (see restore-postgresql.md)
# 16. Restore MySQL from dump (see restore-mysql.md)
```
### Phase 7: Application Services
```bash
# 17. Deploy remaining stacks in any order
docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]
Audited 14 documentation files against live cluster state and Terraform code.
Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB
Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading
Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates
2026-04-06 13:21:05 +03:00
for stack in vaultwarden immich nextcloud linkwarden health; do
backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks
Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert
Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days
Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef
Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00
scripts/tg apply stacks/$stack
done
# 18. Restore Vaultwarden (see restore-vaultwarden.md)
```
### Phase 8: Verification
```bash
# 19. Check all pods are running
kubectl get pods -A | grep -v Running | grep -v Completed
# 20. Check all ingresses respond
kubectl get ingress -A -o jsonpath='{range .items[*]}{.spec.rules[0].host}{"\n"}{end}' | while read host; do
code=$(curl -s -o /dev/null -w "%{http_code}" "https://$host/" 2>/dev/null)
echo "$host: $code"
done
# 21. Check monitoring
# Verify Prometheus targets: https://prometheus.viktorbarzin.me/targets
# Verify Alertmanager: https://alertmanager.viktorbarzin.me/
# 22. Run backup CronJobs manually to establish baseline
kubectl create job --from=cronjob/backup-etcd manual-etcd-backup -n default
kubectl create job --from=cronjob/postgresql-backup manual-pg-backup -n dbaas
kubectl create job --from=cronjob/mysql-backup manual-mysql-backup -n dbaas
kubectl create job --from=cronjob/vault-raft-backup manual-vault-backup -n vault
kubectl create job --from=cronjob/vaultwarden-backup manual-vw-backup -n vaultwarden
```
## Dependency Graph
```
update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
PVC file-level copy from LVM snapshots, pfsense backup, two offsite
paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00
etcd → K8s API → CSI Drivers → Restore PVC data from sda → Vault → ESO → Platform → Databases → Apps
↓
Restore DB dumps from
/mnt/backup/nfs-mirror
or Synology/pve-backup
backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks
Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert
Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days
Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef
Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00
```
## Estimated Time
- Full cluster rebuild from scratch: ~2-4 hours
- With etcd restore (objects preserved): ~1-2 hours
- Individual service restore: ~10-30 minutes each