backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks

Phase 1: Add 12 PrometheusRules for backup health alerting - PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts - CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces - Generic BackupCronJobFailed alert Phase 2: Fix backup rotation - etcd: timestamped snapshots instead of overwriting single file - Redis: timestamped RDB files with 7-day retention purge - PostgreSQL: retention increased from 7 to 14 days Phase 3: Fix MySQL password exposure - Move root password from command line arg to MYSQL_PWD env var via secretKeyRef Phase 5: Add restore runbooks - PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00 · 2026-03-19 20:34:33 +00:00 · af2222fce8
commit af2222fce8
parent 62d42657e6
9 changed files with 657 additions and 4 deletions
--- a/docs/runbooks/restore-full-cluster.md
+++ b/docs/runbooks/restore-full-cluster.md
@ -0,0 +1,128 @@
+# Full Cluster Rebuild
+
+## When to Use
+- Complete cluster failure (all VMs lost)
+- etcd corruption requiring full rebuild
+- Proxmox host failure requiring fresh VM provisioning
+
+## Prerequisites
+- Proxmox host (192.168.1.127) accessible
+- TrueNAS NFS server (192.168.1.2) accessible — or Synology NAS (192.168.1.13) for backups
+- Git repo with infra code
+- SOPS age keys for state decryption (`~/.config/sops/age/keys.txt`)
+- Vault unseal keys (emergency kit)
+
+## Rebuild Order
+
+The rebuild must follow dependency order. Each layer depends on the one before it.
+
+### Phase 1: Infrastructure (Proxmox VMs)
+```bash
+# 1. Provision VMs via Terraform
+cd infra
+scripts/tg apply stacks/infra
+
+# 2. Wait for VMs to boot and be reachable
+# k8s-master, k8s-node3, k8s-node4, k8s-node5 (node1/2 excluded)
+```
+
+### Phase 2: Kubernetes Control Plane
+```bash
+# 3. Initialize kubeadm on master (if starting fresh)
+sudo kubeadm init --config /etc/kubernetes/kubeadm-config.yaml
+
+# 4. Join worker nodes
+# Get join command from master, run on each node
+
+# 5. OR restore etcd from snapshot (see restore-etcd.md)
+# This restores all K8s objects from the snapshot time
+```
+
+### Phase 3: Storage Layer
+```bash
+# 6. Deploy CSI drivers (NFS + iSCSI)
+scripts/tg apply stacks/nfs-csi
+scripts/tg apply stacks/iscsi-csi
+
+# 7. Verify PVs are accessible
+kubectl get pv
+kubectl get pvc -A | grep -v Bound
+```
+
+### Phase 4: Vault (secrets foundation)
+```bash
+# 8. Deploy Vault (see restore-vault.md for full procedure)
+scripts/tg apply stacks/vault
+
+# 9. Initialize/unseal/restore raft snapshot
+# 10. Verify ESO can connect
+scripts/tg apply stacks/external-secrets
+kubectl get externalsecrets -A
+```
+
+### Phase 5: Platform Services
+```bash
+# 11. Deploy platform stack (Traefik, monitoring, Kyverno, etc.)
+scripts/tg apply stacks/platform
+
+# 12. Verify ingress is working
+curl -s -o /dev/null -w "%{http_code}" https://viktorbarzin.me/
+```
+
+### Phase 6: Databases
+```bash
+# 13. Deploy database stack
+scripts/tg apply stacks/dbaas
+
+# 14. Wait for CNPG and InnoDB clusters to initialize
+kubectl wait --for=condition=Ready cluster/pg-cluster -n dbaas --timeout=600s
+
+# 15. Restore PostgreSQL from dump (see restore-postgresql.md)
+# 16. Restore MySQL from dump (see restore-mysql.md)
+```
+
+### Phase 7: Application Services
+```bash
+# 17. Deploy remaining stacks in any order
+for stack in vaultwarden immich nextcloud linkwarden trading health; do
+  scripts/tg apply stacks/$stack
+done
+
+# 18. Restore Vaultwarden (see restore-vaultwarden.md)
+```
+
+### Phase 8: Verification
+```bash
+# 19. Check all pods are running
+kubectl get pods -A | grep -v Running | grep -v Completed
+
+# 20. Check all ingresses respond
+kubectl get ingress -A -o jsonpath='{range .items[*]}{.spec.rules[0].host}{"\n"}{end}' | while read host; do
+  code=$(curl -s -o /dev/null -w "%{http_code}" "https://$host/" 2>/dev/null)
+  echo "$host: $code"
+done
+
+# 21. Check monitoring
+# Verify Prometheus targets: https://prometheus.viktorbarzin.me/targets
+# Verify Alertmanager: https://alertmanager.viktorbarzin.me/
+
+# 22. Run backup CronJobs manually to establish baseline
+kubectl create job --from=cronjob/backup-etcd manual-etcd-backup -n default
+kubectl create job --from=cronjob/postgresql-backup manual-pg-backup -n dbaas
+kubectl create job --from=cronjob/mysql-backup manual-mysql-backup -n dbaas
+kubectl create job --from=cronjob/vault-raft-backup manual-vault-backup -n vault
+kubectl create job --from=cronjob/vaultwarden-backup manual-vw-backup -n vaultwarden
+```
+
+## Dependency Graph
+```
+etcd → K8s API → CSI Drivers → Vault → ESO → Platform → Databases → Apps
+                                                              ↓
+                                                        Restore data from
+                                                        NFS/Synology backups
+```
+
+## Estimated Time
+- Full cluster rebuild from scratch: ~2-4 hours
+- With etcd restore (objects preserved): ~1-2 hours
+- Individual service restore: ~10-30 minutes each