viktor/infra

Viktor Barzin ead876ec65 All checks were successful ci/woodpecker/push/default Pipeline was successful Details k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases Adds a daily visibility layer so every night's autonomous-upgrade outcome is reviewable at a glance during the upgrade-cleanup window (Viktor: "track every night's upgrade for the next 7 days; clean up all bugs and blockers"). Last night (2026-06-20) confirmed BOTH prior fixes work in production: the detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35. What's here: - CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning: running version, detector freshness, detected target, outcome (no-op / blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs. Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap. Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack). - K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.` to `k8s-upgrade-(preflight\|master\|worker\|postflight)-.` so the new report job (or any future helper) can't false-trip the chain-wedged alarm. Manual state repair (no git artifact): imported the orphaned `alert-digest` CronJob into the monitoring stack state (`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`). Root cause: when alert_digest was added (2026-06-12) the apply recorded its ConfigMap + Secret but not the CronJob, so every full monitoring apply since has failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline 298 today) — surviving only via targeted prometheus applies. Now in state, so monitoring CI applies cleanly again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>		2026-06-21 16:57:44 +00:00
..
adr	homelab ha token: dedicated openclaw/ha-tokens secret + least-priv RBAC for emo	2026-06-21 10:45:32 +00:00
architecture	k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases	2026-06-21 16:57:44 +00:00
benchmarks	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
plans	docs: t3-migrate-idle runbook section + service-catalog + design status	2026-06-21 12:40:46 +00:00
post-mortems	Merge remote-tracking branch 'origin/master' into wizard/reconcile-mirror	2026-06-16 22:32:43 +00:00
runbooks	k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases	2026-06-21 16:57:44 +00:00
known-issues.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
README.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00

README.md

Infrastructure Documentation

This repository contains the configuration and documentation for a homelab Kubernetes cluster running on Proxmox. The infrastructure hosts 70+ services managed declaratively with Terraform and Terragrunt.

Quick Reference

Network Ranges

Physical Network: 192.168.1.0/24 - Physical devices and host network
Management VLAN 10: 10.0.10.0/24 - Infrastructure VMs and management
Kubernetes VLAN 20: 10.0.20.0/24 - Kubernetes cluster network

Key URLs

Public: viktorbarzin.me
Internal: viktorbarzin.lan

Architecture Documentation

Document	Description
Overview	Infrastructure overview, hardware specs, VM inventory, and service catalog
Networking	Network topology, VLANs, routing, and firewall rules
VPN	Headscale mesh VPN and Cloudflare Tunnel configuration
Storage	Proxmox host NFS, Proxmox CSI (LVM-thin + LUKS2), and persistent volume management
Authentication	Authentik SSO, OIDC flows, and service integration
Security	CrowdSec IPS, Kyverno policies, and security controls
Monitoring	Prometheus, Grafana, Loki, and observability stack
Secrets Management	HashiCorp Vault integration and secret rotation
CI/CD	Woodpecker CI pipeline and deployment automation
Backup & DR	Backup strategy, disaster recovery, and restore procedures
Compute	Proxmox VMs, GPU passthrough, K8s resource management, and VPA
Databases	PostgreSQL, MySQL, Redis, and database operators
Multi-tenancy	Namespace isolation, tier system, and resource quotas

Operations

Runbooks - Step-by-step operational procedures
Plans - Infrastructure change plans and rollout strategies

Getting Started

Review the Overview for a high-level understanding
Read the Networking doc to understand connectivity
Check Compute for resource management patterns
Explore individual architecture docs based on your area of interest