infra

Author	SHA1	Message	Date
Viktor Barzin	c07870d068	[ci skip] fix MySQL service: point at mysqld pods, pin to healthy primary The InnoDB Cluster Router (mysqlrouter) doesn't deploy when the cluster lacks quorum. Changed service selector from mysqlrouter to mysqld with publishNotReadyAddresses=true to bypass the operator's readiness gate. Pinned to mysql-cluster-1 (healthy primary) until full cluster recovers.	2026-03-01 12:16:28 +00:00
Viktor Barzin	1101242036	[ci skip] MySQL: deploy InnoDB Cluster via Oracle MySQL Operator - MySQL Operator v2.2.7 in mysql-operator namespace (on control-plane) - InnoDB Cluster: 3 MySQL 9.2.0 servers + 1 Router, local-path storage - Group Replication with automatic failover via MySQL Router - Compatibility service: mysql.dbaas:3306 → Router port 6446 - Images from container-registry.oracle.com (not Docker Hub) - Init containers are slow (~20 min) due to mysqlsh plugin loading - Data restore from mysqldump pending after cluster is ONLINE	2026-03-01 03:00:21 +00:00
Viktor Barzin	6139052104	[ci skip] add graceful degradation to CrowdSec bouncer middleware P0: Set updateMaxFailure=-1 (fail-open) Previously defaulted to 0 which blocked ALL traffic on first LAPI failure. Now serves from cached decisions when LAPI is unreachable. P1: Enable Redis cache for CrowdSec decisions Decisions are now shared across all 3 Traefik replicas and survive pod restarts. redisCacheUnreachableBlock=false prevents Redis from becoming another SPOF. P1: Add clientTrustedIPs for internal cluster traffic Node CIDR (10.0.20.0/24) and pod CIDR (10.10.0.0/16) bypass CrowdSec entirely, preventing internal cascade failures.	2026-03-01 02:36:53 +00:00
Viktor Barzin	c20312e41f	[ci skip] revert MySQL to NFS — Bitnami images unavailable on Docker Hub Bitnami MySQL images can't be pulled (not found on Docker Hub, likely moved to a different registry). Reverted MySQL to single instance on NFS as the known-working state. MySQL replication to be revisited once image availability is resolved. PostgreSQL and Redis remain on local disk with replication.	2026-02-28 22:53:33 +00:00
Viktor Barzin	f9a4823ccc	[ci skip] switch VPA from Auto to Initial mode for Terraform compatibility VPA Auto mode modifies Deployment specs at runtime, causing conflicts with Terraform on every apply (drift -> reset -> VPA evict loop). Initial mode only mutates Pod resource requests at creation time via the admission webhook, leaving the Deployment spec unchanged. This means terraform plan shows no drift while pods still get VPA-optimized resources on every restart. - 171 VPAs switched from Auto to Initial - 20 VPAs remain Off (tier-0 critical services) - Goldilocks dashboard continues to show recommendations	2026-02-28 22:43:29 +00:00
Viktor Barzin	f64c979ba5	[ci skip] tune resource limits and requests across 10 services Critical OOM fixes (add/increase limits): - netbox: add 512Mi limit (was at 98.8% of Kyverno default 256Mi) - speedtest: add 512Mi limit (was at 80.9%) - meshcentral: add 384Mi limit (was at 72.7%) - ytdlp: uncomment resources, set 512Mi limit (was at 74.6%) Over-provisioned (reduce limits): - dashy: 2Gi → 512Mi (was using 135Mi) - redis master: 2Gi → 256Mi (was using 14Mi) - redis replica: 1Gi → 256Mi (was using 12Mi) - resume printer: 2Gi → 512Mi (was using 108Mi) - resume app: 1Gi → 384Mi (was using 125Mi) - openclaw: 4Gi → 1Gi (was using 372Mi) Under-provisioned requests (increase): - authentik server: 256Mi → 512Mi request (actual ~560Mi) - authentik worker: 256Mi → 384Mi request (actual ~400Mi) New explicit resources (previously Kyverno defaults): - forgejo: add 512Mi limit, 64Mi request	2026-02-28 21:59:08 +00:00
Viktor Barzin	ac482b5324	[ci skip] Phase 3: migrate MySQL from NFS to local disk - Add local-path PVC for MySQL data (10Gi, actual usage ~27GB) - Init container seeds data from NFS on first run (cp -a) - NFS volume kept as read-only seed source in init container - MySQL 9.2.0 running on local disk with proper fsync - All dependent services verified running: hackmd, speedtest, onlyoffice, paperless-ngx - mysqldump backup taken before migration - Existing daily mysqldump CronJob unchanged (writes to NFS)	2026-02-28 20:41:07 +00:00
Viktor Barzin	58644e036f	[ci skip] Redis: upgrade to Bitnami Helm chart with Sentinel HA - Replace manual redis:7-alpine deployment with Bitnami Redis Helm chart v25.3.2 - Architecture: replication with Sentinel (1 master + 1 replica + sentinels) - Automatic failover via Sentinel (quorum=2, masterSet=mymaster) - Service 'redis.redis' always points at current master (transparent to clients) - 120 clients connected immediately after deployment - Sentinel confirmed tracking redis-node-0 as master - Local-path PVCs for persistence (2Gi per node) - Auth disabled (matches previous setup) - Hourly RDB backup CronJob to NFS preserved - OCI chart pulled via pull-through cache (10.0.20.10:5000)	2026-02-28 19:59:58 +00:00
Viktor Barzin	3974827ea9	[ci skip] color only public IPs red in service map, private IPs (10.x, 192.168.x) get light blue	2026-02-28 19:44:16 +00:00
Viktor Barzin	2b22c90a56	[ci skip] Phase 2: migrate Redis from NFS to local disk - Switch from redis/redis-stack:latest to redis:7-alpine (modules were completely unused — zero module commands in stats) - Move data from NFS (/mnt/main/redis) to local-path PVC (RDB saves: 39s on NFS → <1s on local disk) - Start fresh (old RDB had redis-stack module data incompatible with plain redis; all Redis data is transient — queues and caches rebuild automatically) - Add hourly redis-backup CronJob: redis-cli --rdb to NFS for backup pipeline - Remove RedisInsight UI ingress (port 8001, only in redis-stack) - Add redis-backup to NFS exports - 110 clients reconnected immediately after switchover - Memory savings: ~100MB from dropping unused modules	2026-02-28 19:44:08 +00:00
Viktor Barzin	c229f6ccab	[ci skip] set network observability dashboard auto-refresh to 1h	2026-02-28 19:32:49 +00:00
Viktor Barzin	11eb256511	[ci skip] fix service map coloring: remove arc system, use color field for namespace-based node colors	2026-02-28 19:25:52 +00:00
Viktor Barzin	882350250a	[ci skip] codify CNPG PostgreSQL in Terraform, decommission old NFS-backed PG Phase 1 complete — PostgreSQL fully migrated off NFS: dbaas module changes: - Replace old kubernetes_deployment.postgres with null_resource.pg_cluster (CNPG Cluster CR managed via kubectl apply due to webhook mutation issues) - Update postgresql Service selector: app=postgresql → cnpg primary - Update backup CronJob: use postgres user + read password from CNPG secret (pg-cluster-superuser) instead of hardcoded root password - Add kube_config_path variable for kubectl in null_resource - Old deployment deleted from cluster (was scaled to 0) CNPG cluster status: - 2 instances: primary (k8s-node4), replica (k8s-node2) - PostGIS image (ghcr.io/cloudnative-pg/postgis:16) - 20Gi local-path storage per instance - All 13 dependent services verified running - Backup CronJob verified working with new endpoint	2026-02-28 19:23:36 +00:00
Viktor Barzin	79775fa2cc	[ci skip] improve network observability dashboard: namespace coloring, layered layout, full-width service map	2026-02-28 19:14:20 +00:00
Viktor Barzin	5e177a8889	[ci skip] combine caretta and goflow2 into unified network observability dashboard	2026-02-28 19:04:53 +00:00
Viktor Barzin	c5c3b092e5	[ci skip] fix caretta helm values and goflow2 transport args	2026-02-28 18:51:02 +00:00
Viktor Barzin	87fc11121d	fix: use plain string for cache_from/cache_to and fix caretta helm_release - cache_from/cache_to must be plain strings, not YAML lists — the plugin-docker-buildx treats them as single string values and the Woodpecker settings layer was splitting comma-separated list items into separate --cache-from flags (type=registry and ref=... separately) - caretta.tf: replace deprecated set{} blocks with values=[yamlencode()] to fix Terraform plan error with newer Helm provider	2026-02-28 18:47:20 +00:00
Viktor Barzin	e8997ec430	[ci skip] add caretta, goflow2, and prometheus scrape targets to monitoring module	2026-02-28 18:30:20 +00:00
Viktor Barzin	bf3404bf6b	[ci skip] add goflow2 netflow collector to monitoring module	2026-02-28 18:29:07 +00:00
Viktor Barzin	9d52acd286	[ci skip] add caretta eBPF pod topology to monitoring module	2026-02-28 18:28:09 +00:00
Viktor Barzin	3633c195cf	[ci skip] install CloudNativePG operator as platform module - CNPG v0.27.1 operator in cnpg-system namespace - CRDs installed: clusters, backups, poolers, databases, etc. - local-path StorageClass already exists (from cloud-init template) - Prerequisite for PostgreSQL migration off NFS	2026-02-28 17:22:53 +00:00
Viktor Barzin	eb32190461	[ci skip] fix OOM crashes: add resource limits for osrm-bicycle, aiostreams, listenarr, authentik - osrm-bicycle: 1Gi limit (loads 403MB routing graph) - aiostreams: 768Mi limit (loads 44K anime entries) - listenarr: 1Gi limit (.NET + Playwright/Chromium) - authentik server: 1Gi limit, worker: 1Gi limit (Django + gunicorn) - servarr: pass nfs_server variable to all submodules	2026-02-28 17:03:33 +00:00
Viktor Barzin	0274cc0722	[ci skip] technitium: add primary-secondary DNS HA with AXFR zone replication Secondary instance on a separate node replicates all zones from primary via zone transfer. LoadBalancer routes DNS queries to both pods. PDB ensures at least 1 DNS pod survives voluntary disruptions. Setup job automates zone transfer enablement and secondary zone creation via Technitium REST API.	2026-02-28 14:14:20 +00:00
Viktor Barzin	69c4c0c76e	[ci skip] VPA: reduce LimitRange defaults, add overcommit check, protect tier-0 - Reduce Kyverno LimitRange default limits ~4x across all tiers to fix 800-900% memory overcommitment on worker nodes - Add cluster health check #25: per-node resource overcommitment showing requests and limits vs allocatable capacity - Add Kyverno policy for Goldilocks VPA mode by tier: tier-0 namespaces get VPA Off mode (recommend only, no evictions) to prevent downtime on critical infra (traefik, cloudflared, authentik, technitium, etc.) - Non-tier-0 namespaces get VPA Auto mode for active right-sizing	2026-02-26 23:15:43 +00:00
Viktor Barzin	250f805c32	[ci skip] Deploy VPA + Goldilocks for dynamic resource right-sizing Add Vertical Pod Autoscaler (recommender, updater, admission-controller) and Goldilocks dashboard to monitor resource recommendations across all namespaces. Dashboard at goldilocks.viktorbarzin.me behind Authentik.	2026-02-25 21:54:01 +00:00
Viktor Barzin	7bc975aa16	[ci skip] kyverno: scale to 2 replicas, eliminate API calls from policies - Scale admission controller to 2 replicas with topology spread across nodes - Rewrite inject-priority-class-from-tier: use namespaceSelector instead of API call per pod admission (eliminates Kyverno→API server round-trip) - Rewrite sync-tier-label-from-namespace: same namespaceSelector approach - Extract governance_tiers local to DRY up tier definitions	2026-02-24 23:09:56 +00:00
Viktor Barzin	e7e4faa57a	[ci skip] kyverno: fix crash loop — failurePolicy Ignore, increase memory, pin chart Admission controller was restarting every ~5min due to API server timeouts causing leader election loss. failurePolicy:Fail meant the webhook blocked all pod creation cluster-wide when Kyverno was unavailable.	2026-02-24 23:00:45 +00:00
Viktor Barzin	c35bef2fd8	[ci skip] fix cluster health: GPU tolerations, actualbudget nfs_server, AuthentikDown alert - Add missing nvidia.com/gpu toleration to ollama and yt-highlights deployments - Add node_selector gpu=true to ollama deployment - Pass nfs_server variable through to actualbudget factory modules - Fix AuthentikDown alert to match actual deployment name (goauthentik-server)	2026-02-24 22:55:58 +00:00
Viktor Barzin	4fab38da1f	[ci skip] wrongmove dashboard: add per-path latency table, fix layout, sort top offenders Add "Per-Path Latency Breakdown" table with p50/p95/p99 and request rate per endpoint. Fix bar gauge position to sit next to timeseries. Add sort transformation to "Top Offenders (Avg Duration)" panel.	2026-02-24 22:31:41 +00:00
Viktor Barzin	0a1d53b6dd	[ci skip] platform: add ndots=2 dns_config to all deployment pod specs Prevents Terraform from reverting the Kyverno inject-ndots mutation on every apply. 23 pod specs across 19 platform module files.	2026-02-23 22:43:05 +00:00
Viktor Barzin	a0df23f565	[ci skip] monitoring: increase resource quota limits Bump limits.cpu 80→120 and limits.memory 160Gi→240Gi to provide headroom. Previous values were at 87% and 92% utilization.	2026-02-23 22:42:30 +00:00
Viktor Barzin	83cc053742	[ci skip] fix redis OOMKilled: increase memory limits to 2Gi Redis was CrashLoopBackOff due to OOMKilled - 512Mi limit was insufficient for 650MB RDB dataset plus redis-stack modules.	2026-02-23 22:37:56 +00:00
Viktor Barzin	89a6e08245	[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability Phase 1 - Critical Security: - Netbox: move hardcoded DB/superuser passwords to variables - MeshCentral: disable public registration, add Authentik auth - Traefik: disable insecure API dashboard (api.insecure=false) - Traefik: configure forwarded headers with Cloudflare trusted IPs Phase 2 - Security Hardening: - Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.) - Add Kyverno pod security policies in audit mode (privileged, host namespaces, SYS_ADMIN, trusted registries) - Tighten rate limiting (avg=10, burst=50) - Add Authentik protection to grampsweb Phase 3 - Monitoring & Alerting: - Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale, Authentik, Loki) - Increase Loki retention from 7 to 30 days (720h) - Add predictive PV filling alert (predict_linear) - Re-enable Hackmd and Privatebin down alerts Phase 4 - Reliability: - Add resource requests/limits to Redis, DBaaS, Technitium, Headscale, Vaultwarden, Uptime Kuma - Increase Alloy DaemonSet memory to 512Mi/1Gi Phase 6 - Maintainability: - Extract duplicated tiers locals to terragrunt.hcl generate block (removed from 67 stacks) - Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114 instances across 63 files) - Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references with variables across ~35 stacks - Migrate xray raw ingress resources to ingress_factory modules	2026-02-23 22:05:28 +00:00
Viktor Barzin	1b4737c90c	Reorder realestate-crawler Grafana dashboard sections Move API Performance and Per-Endpoint Latency to the top. Move Scraping Overview, Scraping Activity, and Throttling & Errors to the bottom. Keeps the most operationally relevant panels visible first.	2026-02-23 22:03:27 +00:00
Viktor Barzin	5fdd9d7f04	Sync realestate-crawler Grafana dashboard with per-endpoint latency panels	2026-02-23 21:31:01 +00:00
Viktor Barzin	15157b50a2	[ci skip] mailserver: fix Rspamd DKIM signing key path Mount DKIM private key at Rspamd-expected path (/tmp/docker-mailserver/rspamd/dkim/viktorbarzin.me/mail.private) and add dkim_signing.conf override for domain/selector config. Rspamd does not auto-detect keys from the OpenDKIM path.	2026-02-23 21:01:29 +00:00
Viktor Barzin	c8e9c41afc	docs: map existing codebase	2026-02-23 20:54:27 +00:00
Viktor Barzin	275eb5aec8	[ci skip] mailserver: tighten DMARC policy to quarantine Move DMARC enforcement from p=none (monitoring only) to p=quarantine so spoofed emails from viktorbarzin.me are quarantined by recipients.	2026-02-23 20:30:30 +00:00
Viktor Barzin	00e1682ec8	[ci skip] mailserver: add Postfix rate limiting Add connection and message rate limits to protect against brute-force attacks on SMTP/IMAP ports. 10 connections and 30 messages per minute per client IP.	2026-02-23 20:29:45 +00:00
Viktor Barzin	ed6d505433	[ci skip] roundcubemail: pin to 1.6-apache, disable debug logging Pin Roundcubemail to stable 1.6-apache tag instead of :latest to prevent unexpected breakage. Disable SMTP debug and reduce debug level from 6 to 1 for production use.	2026-02-23 20:29:39 +00:00
Viktor Barzin	b0aaa7b813	[ci skip] monitoring: enable mailserver-down Prometheus alert Uncomment the mailserver availability alert so we get paged if the mail server pod has no available replicas for 5 minutes.	2026-02-23 20:29:33 +00:00
Viktor Barzin	491f9f4d49	[ci skip] mailserver: enable Rspamd, disable OpenDKIM Enable Rspamd for spam filtering and DKIM signing, replacing OpenDKIM. Rspamd reads existing DKIM keys from the same mount path.	2026-02-23 20:29:32 +00:00
Viktor Barzin	65ca327ed0	Sync realestate-crawler dashboard with navigation & usage metrics panels	2026-02-23 20:28:55 +00:00
Viktor Barzin	ebecaaee5c	Woodpecker CI: use built-in clone, fix CoreDNS DNS resolution [CI SKIP] - Switch from custom clone override to woodpeckerci/plugin-git built-in clone (handles auth automatically via netrc from GitHub OAuth token) - Add 8.8.8.8 and 1.1.1.1 as CoreDNS upstream resolvers alongside pfSense (fixes intermittent DNS timeouts causing clone failures) - Fix missing comma after heredoc in audit-policy.tf (syntax error)	2026-02-23 00:08:42 +00:00
Viktor Barzin	ddb293b2b7	[ci skip] Reduce healthcheck frequency to 8h, fix apiserver audit duplication bug Change cluster-healthcheck CronJob from every 30min to every 8h. Replace fragile sed-based audit config in apiserver manifest with idempotent Python script that deduplicates by name/mountPath, preventing the duplicate volume entries that crashed the API server.	2026-02-22 23:18:30 +00:00
Viktor Barzin	27dc486a4d	[ci skip] Remove ResourceQuota limits from nvidia and realestate-crawler namespaces Add resource-governance/custom-quota=true label to both namespaces so Kyverno skips auto-generating ResourceQuotas that were causing CPU pressure.	2026-02-22 23:14:53 +00:00
Viktor Barzin	cc7f119578	[ci skip] Reduce node config drift: GPU label, OIDC idempotency, node-exporter, rebuild docs - Add gpu=true label to Terraform (nvidia null_resource alongside taint) - Improve API server OIDC config to detect value changes, not just flag presence - Add policy_hash trigger to audit-policy so rule changes auto-reapply - Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook - Document full node rebuild procedure in CLAUDE.md - Save Talos Linux migration evaluation for future reference	2026-02-22 22:59:38 +00:00
Viktor Barzin	fd9b06266d	[ci skip] Add anti-AI scraping Traefik middlewares (ForwardAuth, headers, trap links)	2026-02-22 19:49:32 +00:00
Viktor Barzin	5501b5cfbf	[ci skip] Increase authentik ResourceQuota limits Authentik is a critical auth service that was at 83% CPU/memory quota utilization. Double all limits to prevent throttling.	2026-02-22 17:28:41 +00:00
Viktor Barzin	c7c7047f1c	[ci skip] Flatten module wrappers into stack roots Remove the module "xxx" { source = "./module" } indirection layer from all 66 service stacks. Resources are now defined directly in each stack's main.tf instead of through a wrapper module. - Merge module/main.tf contents into stack main.tf - Apply variable replacements (var.tier -> local.tiers.X, renamed vars) - Fix shared module paths (one fewer ../ at each level) - Move extra files/dirs (factory/, chart_values, subdirs) to stack root - Update state files to strip module.<name>. prefix - Update CLAUDE.md to reflect flat structure Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.	2026-02-22 15:13:55 +00:00

1 2

51 commits