infra/docs
Viktor Barzin 7dfe89a6e0 [redis] stabilise against node-crash flap cascade — RC1-RC5 fixes
Five compounding factors produced the 2026-04-22 flap cascade: soft
anti-affinity let 2/3 pods co-locate on k8s-node3 (which bounced
NotReady→Ready at 11:42Z and took quorum), aggressive sentinel/probe
timing amplified LUKS-encrypted LVM I/O stalls into spurious
+switch-master loops, HAProxy's 1s polling raced sentinel failovers
and routed writes to demoted masters, publish_not_ready_addresses=true
fed not-yet-ready pods into HAProxy DNS, and realestate-crawler-celery
CrashLoopBackOff closed the feedback loop.

Changes:
- Anti-affinity: preferred → required (one redis pod per node, hard)
- Sentinel down-after-ms 5000→15000, failover-timeout 30000→60000
- Redis + sentinel liveness: timeout 3→10, failure_threshold 3→5
- HAProxy: check inter 1s→2s / fall 2→3, timeout check 3s→5s
- Headless svc: publish_not_ready_addresses true→false

Post-rollout verification clean: 0 flaps, 0 +switch-master events,
0 celery ReadOnlyError in the 60s window after settle. Docs updated.
2026-04-22 15:59:00 +00:00
..
architecture [redis] stabilise against node-crash flap cascade — RC1-RC5 fixes 2026-04-22 15:59:00 +00:00
plans [docs] Update anti-AI and rybbit docs after rewrite-body removal 2026-04-17 21:43:13 +00:00
post-mortems post-mortem 2026-04-22: full timeline — second regression + node4 reboot 2026-04-22 11:44:56 +00:00
runbooks vault runbook + raft/HA stuck-leader alerts 2026-04-22 12:44:46 +00:00
README.md [docs] TrueNAS decommission cleanup — remove references from active docs 2026-04-19 16:55:43 +00:00

Infrastructure Documentation

This repository contains the configuration and documentation for a homelab Kubernetes cluster running on Proxmox. The infrastructure hosts 70+ services managed declaratively with Terraform and Terragrunt.

Quick Reference

Network Ranges

  • Physical Network: 192.168.1.0/24 - Physical devices and host network
  • Management VLAN 10: 10.0.10.0/24 - Infrastructure VMs and management
  • Kubernetes VLAN 20: 10.0.20.0/24 - Kubernetes cluster network

Key URLs

  • Public: viktorbarzin.me
  • Internal: viktorbarzin.lan

Architecture Documentation

Document Description
Overview Infrastructure overview, hardware specs, VM inventory, and service catalog
Networking Network topology, VLANs, routing, and firewall rules
VPN Headscale mesh VPN and Cloudflare Tunnel configuration
Storage Proxmox host NFS, Proxmox CSI (LVM-thin + LUKS2), and persistent volume management
Authentication Authentik SSO, OIDC flows, and service integration
Security CrowdSec IPS, Kyverno policies, and security controls
Monitoring Prometheus, Grafana, Loki, and observability stack
Secrets Management HashiCorp Vault integration and secret rotation
CI/CD Woodpecker CI pipeline and deployment automation
Backup & DR Backup strategy, disaster recovery, and restore procedures
Compute Proxmox VMs, GPU passthrough, K8s resource management, and VPA
Databases PostgreSQL, MySQL, Redis, and database operators
Multi-tenancy Namespace isolation, tier system, and resource quotas

Operations

  • Runbooks - Step-by-step operational procedures
  • Plans - Infrastructure change plans and rollout strategies

Getting Started

  1. Review the Overview for a high-level understanding
  2. Read the Networking doc to understand connectivity
  3. Check Compute for resource management patterns
  4. Explore individual architecture docs based on your area of interest