infra

Author	SHA1	Message	Date
Viktor Barzin	0cce9d350a	[ci skip] Redis: upgrade to Bitnami Helm chart with Sentinel HA - Replace manual redis:7-alpine deployment with Bitnami Redis Helm chart v25.3.2 - Architecture: replication with Sentinel (1 master + 1 replica + sentinels) - Automatic failover via Sentinel (quorum=2, masterSet=mymaster) - Service 'redis.redis' always points at current master (transparent to clients) - 120 clients connected immediately after deployment - Sentinel confirmed tracking redis-node-0 as master - Local-path PVCs for persistence (2Gi per node) - Auth disabled (matches previous setup) - Hourly RDB backup CronJob to NFS preserved - OCI chart pulled via pull-through cache (10.0.20.10:5000)	2026-02-28 19:59:58 +00:00
Viktor Barzin	00717d0c7e	[ci skip] color only public IPs red in service map, private IPs (10.x, 192.168.x) get light blue	2026-02-28 19:44:16 +00:00
Viktor Barzin	fdd4e3e467	[ci skip] Phase 2: migrate Redis from NFS to local disk - Switch from redis/redis-stack:latest to redis:7-alpine (modules were completely unused — zero module commands in stats) - Move data from NFS (/mnt/main/redis) to local-path PVC (RDB saves: 39s on NFS → <1s on local disk) - Start fresh (old RDB had redis-stack module data incompatible with plain redis; all Redis data is transient — queues and caches rebuild automatically) - Add hourly redis-backup CronJob: redis-cli --rdb to NFS for backup pipeline - Remove RedisInsight UI ingress (port 8001, only in redis-stack) - Add redis-backup to NFS exports - 110 clients reconnected immediately after switchover - Memory savings: ~100MB from dropping unused modules	2026-02-28 19:44:08 +00:00
Viktor Barzin	5d745376bf	[ci skip] set network observability dashboard auto-refresh to 1h	2026-02-28 19:32:49 +00:00
Viktor Barzin	849116d08a	[ci skip] fix service map coloring: remove arc system, use color field for namespace-based node colors	2026-02-28 19:25:52 +00:00
Viktor Barzin	5be70fb955	[ci skip] codify CNPG PostgreSQL in Terraform, decommission old NFS-backed PG Phase 1 complete — PostgreSQL fully migrated off NFS: dbaas module changes: - Replace old kubernetes_deployment.postgres with null_resource.pg_cluster (CNPG Cluster CR managed via kubectl apply due to webhook mutation issues) - Update postgresql Service selector: app=postgresql → cnpg primary - Update backup CronJob: use postgres user + read password from CNPG secret (pg-cluster-superuser) instead of hardcoded root password - Add kube_config_path variable for kubectl in null_resource - Old deployment deleted from cluster (was scaled to 0) CNPG cluster status: - 2 instances: primary (k8s-node4), replica (k8s-node2) - PostGIS image (ghcr.io/cloudnative-pg/postgis:16) - 20Gi local-path storage per instance - All 13 dependent services verified running - Backup CronJob verified working with new endpoint	2026-02-28 19:23:36 +00:00
Viktor Barzin	39e3fae488	[ci skip] improve network observability dashboard: namespace coloring, layered layout, full-width service map	2026-02-28 19:14:20 +00:00
Viktor Barzin	5e7dc5ba4a	[ci skip] combine caretta and goflow2 into unified network observability dashboard	2026-02-28 19:04:53 +00:00
Viktor Barzin	c24ae7e8df	[ci skip] fix caretta helm values and goflow2 transport args	2026-02-28 18:51:02 +00:00
Viktor Barzin	205d9f9fc4	fix: use plain string for cache_from/cache_to and fix caretta helm_release - cache_from/cache_to must be plain strings, not YAML lists — the plugin-docker-buildx treats them as single string values and the Woodpecker settings layer was splitting comma-separated list items into separate --cache-from flags (type=registry and ref=... separately) - caretta.tf: replace deprecated set{} blocks with values=[yamlencode()] to fix Terraform plan error with newer Helm provider	2026-02-28 18:47:20 +00:00
Viktor Barzin	22ffd5001d	[ci skip] add caretta, goflow2, and prometheus scrape targets to monitoring module	2026-02-28 18:30:20 +00:00
Viktor Barzin	ca6c7c865a	[ci skip] add goflow2 netflow collector to monitoring module	2026-02-28 18:29:07 +00:00
Viktor Barzin	be89d3c48f	[ci skip] add caretta eBPF pod topology to monitoring module	2026-02-28 18:28:09 +00:00
Viktor Barzin	3f558bd4da	[ci skip] install CloudNativePG operator as platform module - CNPG v0.27.1 operator in cnpg-system namespace - CRDs installed: clusters, backups, poolers, databases, etc. - local-path StorageClass already exists (from cloud-init template) - Prerequisite for PostgreSQL migration off NFS	2026-02-28 17:22:53 +00:00
Viktor Barzin	5318761336	[ci skip] fix OOM crashes: add resource limits for osrm-bicycle, aiostreams, listenarr, authentik - osrm-bicycle: 1Gi limit (loads 403MB routing graph) - aiostreams: 768Mi limit (loads 44K anime entries) - listenarr: 1Gi limit (.NET + Playwright/Chromium) - authentik server: 1Gi limit, worker: 1Gi limit (Django + gunicorn) - servarr: pass nfs_server variable to all submodules	2026-02-28 17:03:33 +00:00
Viktor Barzin	6aa29e9f77	[ci skip] technitium: add primary-secondary DNS HA with AXFR zone replication Secondary instance on a separate node replicates all zones from primary via zone transfer. LoadBalancer routes DNS queries to both pods. PDB ensures at least 1 DNS pod survives voluntary disruptions. Setup job automates zone transfer enablement and secondary zone creation via Technitium REST API.	2026-02-28 14:14:20 +00:00
Viktor Barzin	16343475b1	[ci skip] VPA: reduce LimitRange defaults, add overcommit check, protect tier-0 - Reduce Kyverno LimitRange default limits ~4x across all tiers to fix 800-900% memory overcommitment on worker nodes - Add cluster health check #25: per-node resource overcommitment showing requests and limits vs allocatable capacity - Add Kyverno policy for Goldilocks VPA mode by tier: tier-0 namespaces get VPA Off mode (recommend only, no evictions) to prevent downtime on critical infra (traefik, cloudflared, authentik, technitium, etc.) - Non-tier-0 namespaces get VPA Auto mode for active right-sizing	2026-02-26 23:15:43 +00:00
Viktor Barzin	a1b669ae97	[ci skip] Deploy VPA + Goldilocks for dynamic resource right-sizing Add Vertical Pod Autoscaler (recommender, updater, admission-controller) and Goldilocks dashboard to monitor resource recommendations across all namespaces. Dashboard at goldilocks.viktorbarzin.me behind Authentik.	2026-02-25 21:54:01 +00:00
Viktor Barzin	ef158ed431	[ci skip] kyverno: scale to 2 replicas, eliminate API calls from policies - Scale admission controller to 2 replicas with topology spread across nodes - Rewrite inject-priority-class-from-tier: use namespaceSelector instead of API call per pod admission (eliminates Kyverno→API server round-trip) - Rewrite sync-tier-label-from-namespace: same namespaceSelector approach - Extract governance_tiers local to DRY up tier definitions	2026-02-24 23:09:56 +00:00
Viktor Barzin	882df4cc5c	[ci skip] kyverno: fix crash loop — failurePolicy Ignore, increase memory, pin chart Admission controller was restarting every ~5min due to API server timeouts causing leader election loss. failurePolicy:Fail meant the webhook blocked all pod creation cluster-wide when Kyverno was unavailable.	2026-02-24 23:00:45 +00:00
Viktor Barzin	c06cca288a	[ci skip] fix cluster health: GPU tolerations, actualbudget nfs_server, AuthentikDown alert - Add missing nvidia.com/gpu toleration to ollama and yt-highlights deployments - Add node_selector gpu=true to ollama deployment - Pass nfs_server variable through to actualbudget factory modules - Fix AuthentikDown alert to match actual deployment name (goauthentik-server)	2026-02-24 22:55:58 +00:00
Viktor Barzin	18a873a630	[ci skip] wrongmove dashboard: add per-path latency table, fix layout, sort top offenders Add "Per-Path Latency Breakdown" table with p50/p95/p99 and request rate per endpoint. Fix bar gauge position to sit next to timeseries. Add sort transformation to "Top Offenders (Avg Duration)" panel.	2026-02-24 22:31:41 +00:00
Viktor Barzin	85f88bf167	[ci skip] platform: add ndots=2 dns_config to all deployment pod specs Prevents Terraform from reverting the Kyverno inject-ndots mutation on every apply. 23 pod specs across 19 platform module files.	2026-02-23 22:43:05 +00:00
Viktor Barzin	a2a83d30aa	[ci skip] monitoring: increase resource quota limits Bump limits.cpu 80→120 and limits.memory 160Gi→240Gi to provide headroom. Previous values were at 87% and 92% utilization.	2026-02-23 22:42:30 +00:00
Viktor Barzin	e982a8ad81	[ci skip] fix redis OOMKilled: increase memory limits to 2Gi Redis was CrashLoopBackOff due to OOMKilled - 512Mi limit was insufficient for 650MB RDB dataset plus redis-stack modules.	2026-02-23 22:37:56 +00:00
Viktor Barzin	2d919c4d34	[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability Phase 1 - Critical Security: - Netbox: move hardcoded DB/superuser passwords to variables - MeshCentral: disable public registration, add Authentik auth - Traefik: disable insecure API dashboard (api.insecure=false) - Traefik: configure forwarded headers with Cloudflare trusted IPs Phase 2 - Security Hardening: - Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.) - Add Kyverno pod security policies in audit mode (privileged, host namespaces, SYS_ADMIN, trusted registries) - Tighten rate limiting (avg=10, burst=50) - Add Authentik protection to grampsweb Phase 3 - Monitoring & Alerting: - Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale, Authentik, Loki) - Increase Loki retention from 7 to 30 days (720h) - Add predictive PV filling alert (predict_linear) - Re-enable Hackmd and Privatebin down alerts Phase 4 - Reliability: - Add resource requests/limits to Redis, DBaaS, Technitium, Headscale, Vaultwarden, Uptime Kuma - Increase Alloy DaemonSet memory to 512Mi/1Gi Phase 6 - Maintainability: - Extract duplicated tiers locals to terragrunt.hcl generate block (removed from 67 stacks) - Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114 instances across 63 files) - Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references with variables across ~35 stacks - Migrate xray raw ingress resources to ingress_factory modules	2026-02-23 22:05:28 +00:00
Viktor Barzin	48083bb1fd	Reorder realestate-crawler Grafana dashboard sections Move API Performance and Per-Endpoint Latency to the top. Move Scraping Overview, Scraping Activity, and Throttling & Errors to the bottom. Keeps the most operationally relevant panels visible first.	2026-02-23 22:03:27 +00:00
Viktor Barzin	449937e22e	Sync realestate-crawler Grafana dashboard with per-endpoint latency panels	2026-02-23 21:31:01 +00:00
Viktor Barzin	8985cd60cc	[ci skip] mailserver: fix Rspamd DKIM signing key path Mount DKIM private key at Rspamd-expected path (/tmp/docker-mailserver/rspamd/dkim/viktorbarzin.me/mail.private) and add dkim_signing.conf override for domain/selector config. Rspamd does not auto-detect keys from the OpenDKIM path.	2026-02-23 21:01:29 +00:00
Viktor Barzin	04db99fde2	docs: map existing codebase	2026-02-23 20:54:27 +00:00
Viktor Barzin	e95ef07b04	[ci skip] mailserver: tighten DMARC policy to quarantine Move DMARC enforcement from p=none (monitoring only) to p=quarantine so spoofed emails from viktorbarzin.me are quarantined by recipients.	2026-02-23 20:30:30 +00:00
Viktor Barzin	ce03bc25a9	[ci skip] mailserver: add Postfix rate limiting Add connection and message rate limits to protect against brute-force attacks on SMTP/IMAP ports. 10 connections and 30 messages per minute per client IP.	2026-02-23 20:29:45 +00:00
Viktor Barzin	74948a8af3	[ci skip] roundcubemail: pin to 1.6-apache, disable debug logging Pin Roundcubemail to stable 1.6-apache tag instead of :latest to prevent unexpected breakage. Disable SMTP debug and reduce debug level from 6 to 1 for production use.	2026-02-23 20:29:39 +00:00
Viktor Barzin	b7ccae69bc	[ci skip] monitoring: enable mailserver-down Prometheus alert Uncomment the mailserver availability alert so we get paged if the mail server pod has no available replicas for 5 minutes.	2026-02-23 20:29:33 +00:00
Viktor Barzin	75f5cb2001	[ci skip] mailserver: enable Rspamd, disable OpenDKIM Enable Rspamd for spam filtering and DKIM signing, replacing OpenDKIM. Rspamd reads existing DKIM keys from the same mount path.	2026-02-23 20:29:32 +00:00
Viktor Barzin	6ca4a1a081	Sync realestate-crawler dashboard with navigation & usage metrics panels	2026-02-23 20:28:55 +00:00
Viktor Barzin	b45688646d	Woodpecker CI: use built-in clone, fix CoreDNS DNS resolution [CI SKIP] - Switch from custom clone override to woodpeckerci/plugin-git built-in clone (handles auth automatically via netrc from GitHub OAuth token) - Add 8.8.8.8 and 1.1.1.1 as CoreDNS upstream resolvers alongside pfSense (fixes intermittent DNS timeouts causing clone failures) - Fix missing comma after heredoc in audit-policy.tf (syntax error)	2026-02-23 00:08:42 +00:00
Viktor Barzin	d870a63130	[ci skip] Reduce healthcheck frequency to 8h, fix apiserver audit duplication bug Change cluster-healthcheck CronJob from every 30min to every 8h. Replace fragile sed-based audit config in apiserver manifest with idempotent Python script that deduplicates by name/mountPath, preventing the duplicate volume entries that crashed the API server.	2026-02-22 23:18:30 +00:00
Viktor Barzin	860077a126	[ci skip] Remove ResourceQuota limits from nvidia and realestate-crawler namespaces Add resource-governance/custom-quota=true label to both namespaces so Kyverno skips auto-generating ResourceQuotas that were causing CPU pressure.	2026-02-22 23:14:53 +00:00
Viktor Barzin	cf67e02135	[ci skip] Reduce node config drift: GPU label, OIDC idempotency, node-exporter, rebuild docs - Add gpu=true label to Terraform (nvidia null_resource alongside taint) - Improve API server OIDC config to detect value changes, not just flag presence - Add policy_hash trigger to audit-policy so rule changes auto-reapply - Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook - Document full node rebuild procedure in CLAUDE.md - Save Talos Linux migration evaluation for future reference	2026-02-22 22:59:38 +00:00
Viktor Barzin	a92fbb8ca5	[ci skip] Add anti-AI scraping Traefik middlewares (ForwardAuth, headers, trap links)	2026-02-22 19:49:32 +00:00
Viktor Barzin	a9f96e2e53	[ci skip] Increase authentik ResourceQuota limits Authentik is a critical auth service that was at 83% CPU/memory quota utilization. Double all limits to prevent throttling.	2026-02-22 17:28:41 +00:00
Viktor Barzin	b692eb0c34	[ci skip] Flatten module wrappers into stack roots Remove the module "xxx" { source = "./module" } indirection layer from all 66 service stacks. Resources are now defined directly in each stack's main.tf instead of through a wrapper module. - Merge module/main.tf contents into stack main.tf - Apply variable replacements (var.tier -> local.tiers.X, renamed vars) - Fix shared module paths (one fewer ../ at each level) - Move extra files/dirs (factory/, chart_values, subdirs) to stack root - Update state files to strip module.<name>. prefix - Update CLAUDE.md to reflect flat structure Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.	2026-02-22 15:13:55 +00:00
Viktor Barzin	e225e81ebf	[ci skip] Move Terraform modules into stack directories Move all 88 service modules (66 individual + 22 platform) from modules/kubernetes/<service>/ into their corresponding stack directories: - Service stacks: stacks/<service>/module/ - Platform stack: stacks/platform/modules/<service>/ This collocates module source code with its Terragrunt definition. Only shared utility modules remain in modules/kubernetes/: ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy. All cross-references to shared modules updated to use correct relative paths. Verified with terragrunt run --all -- plan: 0 adds, 0 destroys across all 68 stacks.	2026-02-22 14:38:14 +00:00

44 commits