infra/docs/runbooks/restore-full-cluster.md
Viktor Barzin af2222fce8 backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks
Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert

Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days

Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef

Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00

3.9 KiB

Full Cluster Rebuild

When to Use

  • Complete cluster failure (all VMs lost)
  • etcd corruption requiring full rebuild
  • Proxmox host failure requiring fresh VM provisioning

Prerequisites

  • Proxmox host (192.168.1.127) accessible
  • TrueNAS NFS server (192.168.1.2) accessible — or Synology NAS (192.168.1.13) for backups
  • Git repo with infra code
  • SOPS age keys for state decryption (~/.config/sops/age/keys.txt)
  • Vault unseal keys (emergency kit)

Rebuild Order

The rebuild must follow dependency order. Each layer depends on the one before it.

Phase 1: Infrastructure (Proxmox VMs)

# 1. Provision VMs via Terraform
cd infra
scripts/tg apply stacks/infra

# 2. Wait for VMs to boot and be reachable
# k8s-master, k8s-node3, k8s-node4, k8s-node5 (node1/2 excluded)

Phase 2: Kubernetes Control Plane

# 3. Initialize kubeadm on master (if starting fresh)
sudo kubeadm init --config /etc/kubernetes/kubeadm-config.yaml

# 4. Join worker nodes
# Get join command from master, run on each node

# 5. OR restore etcd from snapshot (see restore-etcd.md)
# This restores all K8s objects from the snapshot time

Phase 3: Storage Layer

# 6. Deploy CSI drivers (NFS + iSCSI)
scripts/tg apply stacks/nfs-csi
scripts/tg apply stacks/iscsi-csi

# 7. Verify PVs are accessible
kubectl get pv
kubectl get pvc -A | grep -v Bound

Phase 4: Vault (secrets foundation)

# 8. Deploy Vault (see restore-vault.md for full procedure)
scripts/tg apply stacks/vault

# 9. Initialize/unseal/restore raft snapshot
# 10. Verify ESO can connect
scripts/tg apply stacks/external-secrets
kubectl get externalsecrets -A

Phase 5: Platform Services

# 11. Deploy platform stack (Traefik, monitoring, Kyverno, etc.)
scripts/tg apply stacks/platform

# 12. Verify ingress is working
curl -s -o /dev/null -w "%{http_code}" https://viktorbarzin.me/

Phase 6: Databases

# 13. Deploy database stack
scripts/tg apply stacks/dbaas

# 14. Wait for CNPG and InnoDB clusters to initialize
kubectl wait --for=condition=Ready cluster/pg-cluster -n dbaas --timeout=600s

# 15. Restore PostgreSQL from dump (see restore-postgresql.md)
# 16. Restore MySQL from dump (see restore-mysql.md)

Phase 7: Application Services

# 17. Deploy remaining stacks in any order
for stack in vaultwarden immich nextcloud linkwarden trading health; do
  scripts/tg apply stacks/$stack
done

# 18. Restore Vaultwarden (see restore-vaultwarden.md)

Phase 8: Verification

# 19. Check all pods are running
kubectl get pods -A | grep -v Running | grep -v Completed

# 20. Check all ingresses respond
kubectl get ingress -A -o jsonpath='{range .items[*]}{.spec.rules[0].host}{"\n"}{end}' | while read host; do
  code=$(curl -s -o /dev/null -w "%{http_code}" "https://$host/" 2>/dev/null)
  echo "$host: $code"
done

# 21. Check monitoring
# Verify Prometheus targets: https://prometheus.viktorbarzin.me/targets
# Verify Alertmanager: https://alertmanager.viktorbarzin.me/

# 22. Run backup CronJobs manually to establish baseline
kubectl create job --from=cronjob/backup-etcd manual-etcd-backup -n default
kubectl create job --from=cronjob/postgresql-backup manual-pg-backup -n dbaas
kubectl create job --from=cronjob/mysql-backup manual-mysql-backup -n dbaas
kubectl create job --from=cronjob/vault-raft-backup manual-vault-backup -n vault
kubectl create job --from=cronjob/vaultwarden-backup manual-vw-backup -n vaultwarden

Dependency Graph

etcd → K8s API → CSI Drivers → Vault → ESO → Platform → Databases → Apps
                                                              ↓
                                                        Restore data from
                                                        NFS/Synology backups

Estimated Time

  • Full cluster rebuild from scratch: ~2-4 hours
  • With etcd restore (objects preserved): ~1-2 hours
  • Individual service restore: ~10-30 minutes each