Viktor Barzin fc233bd27f docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]

Audited 14 documentation files against live cluster state and Terraform code.

Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
  excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
  CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
  correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB

Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading

Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates

2026-04-06 13:21:05 +03:00

4 KiB

Raw Blame History

Full Cluster Rebuild

When to Use

Complete cluster failure (all VMs lost)
etcd corruption requiring full rebuild
Proxmox host failure requiring fresh VM provisioning

Prerequisites

Proxmox host (192.168.1.127) accessible
TrueNAS NFS server (192.168.1.2) accessible — or Synology NAS (192.168.1.13) for backups
Git repo with infra code
SOPS age keys for state decryption (~/.config/sops/age/keys.txt)
Vault unseal keys (emergency kit)

Rebuild Order

The rebuild must follow dependency order. Each layer depends on the one before it.

Phase 1: Infrastructure (Proxmox VMs)

# 1. Provision VMs via Terraform
cd infra
scripts/tg apply stacks/infra

# 2. Wait for VMs to boot and be reachable
# k8s-master, k8s-node3, k8s-node4, k8s-node5
# (node1 has GPU workloads, node2 excluded from MySQL anti-affinity only — both are active cluster members)

Phase 2: Kubernetes Control Plane

# 3. Initialize kubeadm on master (if starting fresh)
sudo kubeadm init --config /etc/kubernetes/kubeadm-config.yaml

# 4. Join worker nodes
# Get join command from master, run on each node

# 5. OR restore etcd from snapshot (see restore-etcd.md)
# This restores all K8s objects from the snapshot time

Phase 3: Storage Layer

# 6. Deploy CSI drivers (NFS + iSCSI)
scripts/tg apply stacks/nfs-csi
scripts/tg apply stacks/iscsi-csi

# 7. Verify PVs are accessible
kubectl get pv
kubectl get pvc -A | grep -v Bound

Phase 4: Vault (secrets foundation)

# 8. Deploy Vault (see restore-vault.md for full procedure)
scripts/tg apply stacks/vault

# 9. Initialize/unseal/restore raft snapshot
# 10. Verify ESO can connect
scripts/tg apply stacks/external-secrets
kubectl get externalsecrets -A

Phase 5: Platform Services

# 11. Deploy platform stack (Traefik, monitoring, Kyverno, etc.)
scripts/tg apply stacks/platform

# 12. Verify ingress is working
curl -s -o /dev/null -w "%{http_code}" https://viktorbarzin.me/

Phase 6: Databases

# 13. Deploy database stack
scripts/tg apply stacks/dbaas

# 14. Wait for CNPG and InnoDB clusters to initialize
kubectl wait --for=condition=Ready cluster/pg-cluster -n dbaas --timeout=600s

# 15. Restore PostgreSQL from dump (see restore-postgresql.md)
# 16. Restore MySQL from dump (see restore-mysql.md)

Phase 7: Application Services

# 17. Deploy remaining stacks in any order
for stack in vaultwarden immich nextcloud linkwarden health; do
  scripts/tg apply stacks/$stack
done

# 18. Restore Vaultwarden (see restore-vaultwarden.md)

Phase 8: Verification

# 19. Check all pods are running
kubectl get pods -A | grep -v Running | grep -v Completed

# 20. Check all ingresses respond
kubectl get ingress -A -o jsonpath='{range .items[*]}{.spec.rules[0].host}{"\n"}{end}' | while read host; do
  code=$(curl -s -o /dev/null -w "%{http_code}" "https://$host/" 2>/dev/null)
  echo "$host: $code"
done

# 21. Check monitoring
# Verify Prometheus targets: https://prometheus.viktorbarzin.me/targets
# Verify Alertmanager: https://alertmanager.viktorbarzin.me/

# 22. Run backup CronJobs manually to establish baseline
kubectl create job --from=cronjob/backup-etcd manual-etcd-backup -n default
kubectl create job --from=cronjob/postgresql-backup manual-pg-backup -n dbaas
kubectl create job --from=cronjob/mysql-backup manual-mysql-backup -n dbaas
kubectl create job --from=cronjob/vault-raft-backup manual-vault-backup -n vault
kubectl create job --from=cronjob/vaultwarden-backup manual-vw-backup -n vaultwarden

Dependency Graph

etcd → K8s API → CSI Drivers → Vault → ESO → Platform → Databases → Apps
                                                              ↓
                                                        Restore data from
                                                        NFS/Synology backups

Estimated Time

Full cluster rebuild from scratch: ~2-4 hours
With etcd restore (objects preserved): ~1-2 hours
Individual service restore: ~10-30 minutes each

4 KiB Raw Blame History

Full Cluster Rebuild

When to Use

Prerequisites

Rebuild Order

Phase 1: Infrastructure (Proxmox VMs)

Phase 2: Kubernetes Control Plane

Phase 3: Storage Layer

Phase 4: Vault (secrets foundation)

Phase 5: Platform Services

Phase 6: Databases

Phase 7: Application Services

Phase 8: Verification

Dependency Graph

Estimated Time

4 KiB

Raw Blame History