Viktor Barzin af2222fce8 backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks

Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert

Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days

Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef

Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild

2026-03-19 20:34:33 +00:00

3.9 KiB

Raw Blame History

Full Cluster Rebuild

When to Use

Complete cluster failure (all VMs lost)
etcd corruption requiring full rebuild
Proxmox host failure requiring fresh VM provisioning

Prerequisites

Proxmox host (192.168.1.127) accessible
TrueNAS NFS server (192.168.1.2) accessible — or Synology NAS (192.168.1.13) for backups
Git repo with infra code
SOPS age keys for state decryption (~/.config/sops/age/keys.txt)
Vault unseal keys (emergency kit)

Rebuild Order

The rebuild must follow dependency order. Each layer depends on the one before it.

Phase 1: Infrastructure (Proxmox VMs)

# 1. Provision VMs via Terraform
cd infra
scripts/tg apply stacks/infra

# 2. Wait for VMs to boot and be reachable
# k8s-master, k8s-node3, k8s-node4, k8s-node5 (node1/2 excluded)

Phase 2: Kubernetes Control Plane

# 3. Initialize kubeadm on master (if starting fresh)
sudo kubeadm init --config /etc/kubernetes/kubeadm-config.yaml

# 4. Join worker nodes
# Get join command from master, run on each node

# 5. OR restore etcd from snapshot (see restore-etcd.md)
# This restores all K8s objects from the snapshot time

Phase 3: Storage Layer

# 6. Deploy CSI drivers (NFS + iSCSI)
scripts/tg apply stacks/nfs-csi
scripts/tg apply stacks/iscsi-csi

# 7. Verify PVs are accessible
kubectl get pv
kubectl get pvc -A | grep -v Bound

Phase 4: Vault (secrets foundation)

# 8. Deploy Vault (see restore-vault.md for full procedure)
scripts/tg apply stacks/vault

# 9. Initialize/unseal/restore raft snapshot
# 10. Verify ESO can connect
scripts/tg apply stacks/external-secrets
kubectl get externalsecrets -A

Phase 5: Platform Services

# 11. Deploy platform stack (Traefik, monitoring, Kyverno, etc.)
scripts/tg apply stacks/platform

# 12. Verify ingress is working
curl -s -o /dev/null -w "%{http_code}" https://viktorbarzin.me/

Phase 6: Databases

# 13. Deploy database stack
scripts/tg apply stacks/dbaas

# 14. Wait for CNPG and InnoDB clusters to initialize
kubectl wait --for=condition=Ready cluster/pg-cluster -n dbaas --timeout=600s

# 15. Restore PostgreSQL from dump (see restore-postgresql.md)
# 16. Restore MySQL from dump (see restore-mysql.md)

Phase 7: Application Services

# 17. Deploy remaining stacks in any order
for stack in vaultwarden immich nextcloud linkwarden trading health; do
  scripts/tg apply stacks/$stack
done

# 18. Restore Vaultwarden (see restore-vaultwarden.md)

Phase 8: Verification

# 19. Check all pods are running
kubectl get pods -A | grep -v Running | grep -v Completed

# 20. Check all ingresses respond
kubectl get ingress -A -o jsonpath='{range .items[*]}{.spec.rules[0].host}{"\n"}{end}' | while read host; do
  code=$(curl -s -o /dev/null -w "%{http_code}" "https://$host/" 2>/dev/null)
  echo "$host: $code"
done

# 21. Check monitoring
# Verify Prometheus targets: https://prometheus.viktorbarzin.me/targets
# Verify Alertmanager: https://alertmanager.viktorbarzin.me/

# 22. Run backup CronJobs manually to establish baseline
kubectl create job --from=cronjob/backup-etcd manual-etcd-backup -n default
kubectl create job --from=cronjob/postgresql-backup manual-pg-backup -n dbaas
kubectl create job --from=cronjob/mysql-backup manual-mysql-backup -n dbaas
kubectl create job --from=cronjob/vault-raft-backup manual-vault-backup -n vault
kubectl create job --from=cronjob/vaultwarden-backup manual-vw-backup -n vaultwarden

Dependency Graph

etcd → K8s API → CSI Drivers → Vault → ESO → Platform → Databases → Apps
                                                              ↓
                                                        Restore data from
                                                        NFS/Synology backups

Estimated Time

Full cluster rebuild from scratch: ~2-4 hours
With etcd restore (objects preserved): ~1-2 hours
Individual service restore: ~10-30 minutes each

3.9 KiB Raw Blame History

Full Cluster Rebuild

When to Use

Prerequisites

Rebuild Order

Phase 1: Infrastructure (Proxmox VMs)

Phase 2: Kubernetes Control Plane

Phase 3: Storage Layer

Phase 4: Vault (secrets foundation)

Phase 5: Platform Services

Phase 6: Databases

Phase 7: Application Services

Phase 8: Verification

Dependency Graph

Estimated Time

3.9 KiB

Raw Blame History