backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks

Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert

Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days

Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef

Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
This commit is contained in:
Viktor Barzin 2026-03-19 20:34:33 +00:00
parent 62d42657e6
commit af2222fce8
9 changed files with 657 additions and 4 deletions

View file

@ -283,12 +283,15 @@ resource "kubernetes_cron_job_v1" "redis-backup" {
image = "redis:7-alpine"
command = ["/bin/sh", "-c", <<-EOT
set -eux
TIMESTAMP=$(date +%Y%m%d-%H%M)
# Trigger a fresh RDB save on the master
redis-cli -h redis.redis BGSAVE
sleep 5
# Copy the RDB via redis-cli --rdb
redis-cli -h redis.redis --rdb /backup/dump.rdb
echo "Backup complete: $(ls -lh /backup/dump.rdb)"
redis-cli -h redis.redis --rdb /backup/redis-$TIMESTAMP.rdb
# Rotate 7-day retention
find /backup -name 'redis-*.rdb' -type f -mtime +7 -delete
echo "Backup complete: redis-$TIMESTAMP.rdb"
EOT
]
volume_mount {