viktor/infra

Viktor Barzin 448bc0c0f6 k8s-version-upgrade: decompose into Job chain to fix self-preemption The agent-based v1 ran inside claude-agent-service (replicas=1, no nodeSelector) and self-evicted when it tried to drain its host (k8s-node4 on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers v1.34.2) until manual recovery. Rewrite the pipeline as a chain of nodeSelector-pinned Jobs: preflight (k8s-node1) → master (k8s-node1) drains k8s-master → worker × 4 (k8s-node1) drains k8s-node{4,3,2} → worker (k8s-master + control-plane toleration) drains k8s-node1 → postflight (no pinning) Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by envsubst-ing job-template.yaml into the next Job. Deterministic names (k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply` idempotent — a failed Job can be re-created without duplicating downstream. Also lands `predrain_unstick`: deletes pods on the target node whose PDB has 0 disruptionsAllowed. Without this, drain loops indefinitely on single-replica deployments (e.g. every Anubis instance — discovered the hard way during 2026-05-11 manual recovery of k8s-node3). Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min). Deprecates the agent prompt (renamed to *.deprecated.md with a header pointer to the new code). Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps), then monitoring (loads the new alert). Both applied 2026-05-11. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>		2026-05-22 14:16:45 +00:00
..
architecture	k8s-version-upgrade: decompose into Job chain to fix self-preemption	2026-05-22 14:16:45 +00:00
benchmarks	infra/llama-cpp: benchmark report + -fa flag fix	2026-05-22 14:16:41 +00:00
plans	docs/plans: 2026-04-20 infra audit design (post-research, post-challenge)	2026-05-22 14:16:41 +00:00
post-mortems	docs/authentik: document postgres session backend + close out 2026-04-18 post-mortem items	2026-05-22 14:16:41 +00:00
runbooks	k8s-version-upgrade: decompose into Job chain to fix self-preemption	2026-05-22 14:16:45 +00:00
README.md	[docs] TrueNAS decommission cleanup — remove references from active docs	2026-04-19 16:55:43 +00:00

README.md

Infrastructure Documentation

This repository contains the configuration and documentation for a homelab Kubernetes cluster running on Proxmox. The infrastructure hosts 70+ services managed declaratively with Terraform and Terragrunt.

Quick Reference

Network Ranges

Physical Network: 192.168.1.0/24 - Physical devices and host network
Management VLAN 10: 10.0.10.0/24 - Infrastructure VMs and management
Kubernetes VLAN 20: 10.0.20.0/24 - Kubernetes cluster network

Key URLs

Public: viktorbarzin.me
Internal: viktorbarzin.lan

Architecture Documentation

Document	Description
Overview	Infrastructure overview, hardware specs, VM inventory, and service catalog
Networking	Network topology, VLANs, routing, and firewall rules
VPN	Headscale mesh VPN and Cloudflare Tunnel configuration
Storage	Proxmox host NFS, Proxmox CSI (LVM-thin + LUKS2), and persistent volume management
Authentication	Authentik SSO, OIDC flows, and service integration
Security	CrowdSec IPS, Kyverno policies, and security controls
Monitoring	Prometheus, Grafana, Loki, and observability stack
Secrets Management	HashiCorp Vault integration and secret rotation
CI/CD	Woodpecker CI pipeline and deployment automation
Backup & DR	Backup strategy, disaster recovery, and restore procedures
Compute	Proxmox VMs, GPU passthrough, K8s resource management, and VPA
Databases	PostgreSQL, MySQL, Redis, and database operators
Multi-tenancy	Namespace isolation, tier system, and resource quotas

Operations

Runbooks - Step-by-step operational procedures
Plans - Infrastructure change plans and rollout strategies

Getting Started

Review the Overview for a high-level understanding
Read the Networking doc to understand connectivity
Check Compute for resource management patterns
Explore individual architecture docs based on your area of interest