viktor/infra

Viktor Barzin 9a21c0f065 [dns] DNS reliability & hardening — Technitium + CoreDNS + alerts + readiness gate Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e). Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8. Technitium (WS A) - Primary deployment: add Kyverno lifecycle ignore_changes on dns_config (secondary/tertiary already had it) — eliminates per-apply ndots drift. - All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary was restarting near the ceiling; CPU limits stay off per cluster policy). - zone-sync CronJob: parse API responses, push status/failures/last-run and per-instance zone_count gauges to Pushgateway, fail the job on any create error (was silently passing). CoreDNS (WS B) - Corefile: add policy sequential + health_check 5s + max_fails 2 on root forward, health_check on viktorbarzin.lan forward, serve_stale 3600s/86400s on both cache blocks — pfSense flap no longer takes the cluster down; upstream outage keeps cached names resolving for 24h. - Scale deploy/coredns to 3 replicas with required pod anti-affinity on hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch resources); readiness gate asserts state post-apply. - PDB coredns with minAvailable=2. Observability (WS G) - Fix DNSQuerySpike — rewrite to compare against avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous dns_anomaly_avg_queries was computed from a per-pod /tmp file so always equalled the current value (alert could never fire). - New: DNSQueryRateDropped, TechnitiumZoneSyncFailed, TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch, CoreDNSForwardFailureRate. Post-apply readiness gate (WS H) - null_resource.technitium_readiness_gate runs at end of apply: kubectl rollout status on all 3 deployments (180s), per-pod /api/stats/get probe, zone-count parity across the 3 instances. Fails the apply on any check fail. Override: -var skip_readiness=true. Docs (WS I) - docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table, zone-sync metrics reference, why DNSQuerySpike was broken. - docs/runbooks/technitium-apply.md (new): what the gate checks, failure modes, emergency override. Out of scope for this commit (see beads follow-ups): - WS C: NodeLocal DNSCache (code-2k6) - WS D: pfSense Unbound replaces dnsmasq (code-k0d) - WS E: Kea multi-IP DHCP + TSIG (code-o6j) - WS F: static-client DNS fixes (code-dw8) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-19 14:53:41 +00:00
..
architecture	[dns] DNS reliability & hardening — Technitium + CoreDNS + alerts + readiness gate	2026-04-19 14:53:41 +00:00
plans	[docs] Update anti-AI and rybbit docs after rewrite-body removal	2026-04-17 21:43:13 +00:00
post-mortems	[docs] post-mortem: clarify the sizeLimit vs container memory limit gotcha	2026-04-18 13:23:14 +00:00
runbooks	[dns] DNS reliability & hardening — Technitium + CoreDNS + alerts + readiness gate	2026-04-19 14:53:41 +00:00
README.md	add architecture documentation for all infrastructure subsystems [ci skip]	2026-03-24 00:55:25 +02:00

README.md

Infrastructure Documentation

This repository contains the configuration and documentation for a homelab Kubernetes cluster running on Proxmox. The infrastructure hosts 70+ services managed declaratively with Terraform and Terragrunt.

Quick Reference

Network Ranges

Physical Network: 192.168.1.0/24 - Physical devices and host network
Management VLAN 10: 10.0.10.0/24 - Infrastructure VMs and management
Kubernetes VLAN 20: 10.0.20.0/24 - Kubernetes cluster network

Key URLs

Public: viktorbarzin.me
Internal: viktorbarzin.lan

Architecture Documentation

Document	Description
Overview	Infrastructure overview, hardware specs, VM inventory, and service catalog
Networking	Network topology, VLANs, routing, and firewall rules
VPN	Headscale mesh VPN and Cloudflare Tunnel configuration
Storage	TrueNAS NFS, democratic-csi, and persistent volume management
Authentication	Authentik SSO, OIDC flows, and service integration
Security	CrowdSec IPS, Kyverno policies, and security controls
Monitoring	Prometheus, Grafana, Loki, and observability stack
Secrets Management	HashiCorp Vault integration and secret rotation
CI/CD	Woodpecker CI pipeline and deployment automation
Backup & DR	Backup strategy, disaster recovery, and restore procedures
Compute	Proxmox VMs, GPU passthrough, K8s resource management, and VPA
Databases	PostgreSQL, MySQL, Redis, and database operators
Multi-tenancy	Namespace isolation, tier system, and resource quotas

Operations

Runbooks - Step-by-step operational procedures
Plans - Infrastructure change plans and rollout strategies

Getting Started

Review the Overview for a high-level understanding
Read the Networking doc to understand connectivity
Check Compute for resource management patterns
Explore individual architecture docs based on your area of interest