daily-backup ran out of its 1h budget and SIGTERMed for 10 days straight (Apr 30 → May 9). Each failed run left its snapshot mount stacked on /tmp/pvc-mount, which blocked the next run from completing — root cause of the WeeklyBackupStale alert going silent (the metric never reached its end-of-script push). Fixes: - TimeoutStartSec 1h → 4h (current workload of 118 PVCs needs ~1.5h, was hitting the wall during week 18 runs) - Recursive umount + LUKS cleanup on EXIT trap, plus the same at script start as belt-and-braces for any inherited stuck state from a prior crashed run - TERM/INT trap pushes status=2 metric so WeeklyBackupFailing fires instead of the alert going blind on systemd kills - pfsense metric pushed in BOTH success and failure paths (was only on success; any ssh-to-pfsense outage made PfsenseBackupStale silent until the alert threshold expired) Postiz backup CronJob: bundled bitnami PG/Redis live on local-path (K8s node OS disk) — outside Layer 1+2 of the 3-2-1 pipeline. Added postiz-postgres-backup that pg_dumps postiz + temporal + temporal_visibility daily 03:00 to /srv/nfs/postiz-backup, getting Layer 3 offsite coverage. Verified end-to-end: 3 dumps written, Pushgateway metric received. Note: bitnamilegacy/postgresql image is stripped (no curl/wget/python) — switched to docker.io/library/postgres matching the dbaas/postgresql-backup pattern with apt-installed curl. Doc reconcile (backup-dr.md): metric names had drifted (e.g. the docs claimed backup_weekly_last_success_timestamp but the script pushes daily_backup_last_run_timestamp). Updated to match what's actually emitted, and added a "default-covered" footnote to the Service Protection Matrix so the ~40 services with PVCs not enumerated in the table are no longer ambiguous. Manual PVE-host actions (out-of-band, not in TF): - unmounted 6 stacked snapshots from /tmp/pvc-mount - pruned 5 stale snapshots on vm-9999-pvc-67c90b6b... (origin LV that the loop got SIGTERMed against repeatedly, so prune kept failing) - created /srv/nfs/postiz-backup directory - triggered a one-shot daily-backup run with the new TimeoutStartSec to validate the fix end-to-end Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| architecture | ||
| plans | ||
| post-mortems | ||
| runbooks | ||
| README.md | ||
Infrastructure Documentation
This repository contains the configuration and documentation for a homelab Kubernetes cluster running on Proxmox. The infrastructure hosts 70+ services managed declaratively with Terraform and Terragrunt.
Quick Reference
Network Ranges
- Physical Network:
192.168.1.0/24- Physical devices and host network - Management VLAN 10:
10.0.10.0/24- Infrastructure VMs and management - Kubernetes VLAN 20:
10.0.20.0/24- Kubernetes cluster network
Key URLs
- Public:
viktorbarzin.me - Internal:
viktorbarzin.lan
Architecture Documentation
| Document | Description |
|---|---|
| Overview | Infrastructure overview, hardware specs, VM inventory, and service catalog |
| Networking | Network topology, VLANs, routing, and firewall rules |
| VPN | Headscale mesh VPN and Cloudflare Tunnel configuration |
| Storage | Proxmox host NFS, Proxmox CSI (LVM-thin + LUKS2), and persistent volume management |
| Authentication | Authentik SSO, OIDC flows, and service integration |
| Security | CrowdSec IPS, Kyverno policies, and security controls |
| Monitoring | Prometheus, Grafana, Loki, and observability stack |
| Secrets Management | HashiCorp Vault integration and secret rotation |
| CI/CD | Woodpecker CI pipeline and deployment automation |
| Backup & DR | Backup strategy, disaster recovery, and restore procedures |
| Compute | Proxmox VMs, GPU passthrough, K8s resource management, and VPA |
| Databases | PostgreSQL, MySQL, Redis, and database operators |
| Multi-tenancy | Namespace isolation, tier system, and resource quotas |
Operations
- Runbooks - Step-by-step operational procedures
- Plans - Infrastructure change plans and rollout strategies
Getting Started
- Review the Overview for a high-level understanding
- Read the Networking doc to understand connectivity
- Check Compute for resource management patterns
- Explore individual architecture docs based on your area of interest