Viktor Barzin 82f674a0b4 rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip]

Reflects the schedule change from weekly to daily. All references updated:
- scripts/weekly-backup.{sh,timer,service} → daily-backup.*
- Pushgateway job name: weekly-backup → daily-backup
- Prometheus metric names: weekly_backup_* → daily_backup_*
- All docs, runbooks, AGENTS.md, CLAUDE.md, proxmox-inventory
- offsite-sync dependency: After=daily-backup.service

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-13 18:37:04 +00:00

5.7 KiB

Raw Blame History

Full Cluster Rebuild

Last updated: 2026-04-06

When to Use

Complete cluster failure (all VMs lost)
etcd corruption requiring full rebuild
Proxmox host failure requiring fresh VM provisioning

Prerequisites

Proxmox host (192.168.1.127) accessible
TrueNAS NFS server (10.0.10.15) accessible — or Synology NAS (192.168.1.13) for backups
sda backup disk mounted at /mnt/backup on PVE host (or restore from Synology first)
Git repo with infra code
SOPS age keys for state decryption (~/.config/sops/age/keys.txt)
Vault unseal keys (emergency kit)

Rebuild Order

The rebuild must follow dependency order. Each layer depends on the one before it.

Phase 1: Infrastructure (Proxmox VMs)

# 1. Provision VMs via Terraform
cd infra
scripts/tg apply stacks/infra

# 2. Wait for VMs to boot and be reachable
# k8s-master, k8s-node3, k8s-node4, k8s-node5
# (node1 has GPU workloads, node2 excluded from MySQL anti-affinity only — both are active cluster members)

Phase 2: Kubernetes Control Plane

# 3. Initialize kubeadm on master (if starting fresh)
sudo kubeadm init --config /etc/kubernetes/kubeadm-config.yaml

# 4. Join worker nodes
# Get join command from master, run on each node

# 5. OR restore etcd from snapshot (see restore-etcd.md)
# This restores all K8s objects from the snapshot time

Phase 3: Storage Layer

# 6. Deploy CSI drivers (NFS + Proxmox)
scripts/tg apply stacks/nfs-csi
scripts/tg apply stacks/proxmox-csi

# 7. Verify PVs are accessible
kubectl get pv
kubectl get pvc -A | grep -v Bound

Phase 3.5: Restore PVC Data from sda Backup

After storage layer is deployed, restore PVC data from the sda backup disk:

# 8a. List available backup weeks
ssh root@192.168.1.127
ls -l /mnt/backup/pvc-data/

# 8b. For each critical PVC, restore files:
# Example: vaultwarden-data-proxmox
WEEK="2026-14"  # Use most recent week
NAMESPACE="vaultwarden"
PVC_NAME="vaultwarden-data-proxmox"

# Find the PV LV name
kubectl get pv -o custom-columns='PV:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace,HANDLE:.spec.csi.volumeHandle' | grep $PVC_NAME

# Assuming volumeHandle is "local-lvm:vm-999-pvc-abc123"
LV_NAME="vm-999-pvc-abc123"

# Mount the LV
lvchange -ay pve/$LV_NAME
mkdir -p /mnt/restore-temp
mount /dev/pve/$LV_NAME /mnt/restore-temp

# Restore from backup
rsync -avP --delete /mnt/backup/pvc-data/$WEEK/$NAMESPACE/$PVC_NAME/ /mnt/restore-temp/

# Unmount
umount /mnt/restore-temp
lvchange -an pve/$LV_NAME

# 8c. Repeat for all critical PVCs (prioritize: vaultwarden, vault, redis, nextcloud)

Note on pfSense restore: If pfSense needs restoration, restore config.xml from /mnt/backup/pfsense/<week>/config.xml via web UI, or full filesystem tar for custom scripts.

Note on PVE config restore: If custom scripts/timers are lost, restore from /mnt/backup/pve-config/ (daily-backup, offsite-sync-backup, lvm-pvc-snapshot scripts + timers).

Phase 4: Vault (secrets foundation)

# 8. Deploy Vault (see restore-vault.md for full procedure)
scripts/tg apply stacks/vault

# 9. Initialize/unseal/restore raft snapshot
# 10. Verify ESO can connect
scripts/tg apply stacks/external-secrets
kubectl get externalsecrets -A

Phase 5: Platform Services

# 11. Deploy platform stack (Traefik, monitoring, Kyverno, etc.)
scripts/tg apply stacks/platform

# 12. Verify ingress is working
curl -s -o /dev/null -w "%{http_code}" https://viktorbarzin.me/

Phase 6: Databases

# 13. Deploy database stack
scripts/tg apply stacks/dbaas

# 14. Wait for CNPG and InnoDB clusters to initialize
kubectl wait --for=condition=Ready cluster/pg-cluster -n dbaas --timeout=600s

# 15. Restore PostgreSQL from dump (see restore-postgresql.md)
# 16. Restore MySQL from dump (see restore-mysql.md)

Phase 7: Application Services

# 17. Deploy remaining stacks in any order
for stack in vaultwarden immich nextcloud linkwarden health; do
  scripts/tg apply stacks/$stack
done

# 18. Restore Vaultwarden (see restore-vaultwarden.md)

Phase 8: Verification

# 19. Check all pods are running
kubectl get pods -A | grep -v Running | grep -v Completed

# 20. Check all ingresses respond
kubectl get ingress -A -o jsonpath='{range .items[*]}{.spec.rules[0].host}{"\n"}{end}' | while read host; do
  code=$(curl -s -o /dev/null -w "%{http_code}" "https://$host/" 2>/dev/null)
  echo "$host: $code"
done

# 21. Check monitoring
# Verify Prometheus targets: https://prometheus.viktorbarzin.me/targets
# Verify Alertmanager: https://alertmanager.viktorbarzin.me/

# 22. Run backup CronJobs manually to establish baseline
kubectl create job --from=cronjob/backup-etcd manual-etcd-backup -n default
kubectl create job --from=cronjob/postgresql-backup manual-pg-backup -n dbaas
kubectl create job --from=cronjob/mysql-backup manual-mysql-backup -n dbaas
kubectl create job --from=cronjob/vault-raft-backup manual-vault-backup -n vault
kubectl create job --from=cronjob/vaultwarden-backup manual-vw-backup -n vaultwarden

Dependency Graph

etcd → K8s API → CSI Drivers → Restore PVC data from sda → Vault → ESO → Platform → Databases → Apps
                                                                                          ↓
                                                                                    Restore DB dumps from
                                                                                    /mnt/backup/nfs-mirror
                                                                                    or Synology/pve-backup

Estimated Time

Full cluster rebuild from scratch: ~2-4 hours
With etcd restore (objects preserved): ~1-2 hours
Individual service restore: ~10-30 minutes each

5.7 KiB Raw Blame History