infra

Author	SHA1	Message	Date
Viktor Barzin	e80b2f026f	[infra] Migrate Terraform state from local SOPS to PostgreSQL backend Two-tier state architecture: - Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local state with SOPS encryption in git — unchanged, required for bootstrap. - Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at 10.0.20.200:5432/terraform_state with native pg_advisory_lock. Motivation: multi-operator friction (every workstation needed SOPS + age + git-crypt), bootstrap complexity for new operators, and headless agents/CI needing the full encryption toolchain just to read state. Changes: - terragrunt.hcl: conditional backend (local vs pg) based on tier0 list - scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1, skip SOPS and Vault KV locking for Tier 1 stacks - scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1) - scripts/migrate-state-to-pg: one-shot migration script (idempotent) - stacks/vault/main.tf: pg-terraform-state static role + K8s auth role for claude-agent namespace - stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer service on shared IP 10.0.20.200 - Deleted 107 .tfstate.enc files for migrated Tier 1 stacks - Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:33:12 +00:00
Viktor Barzin	b1d152be1f	[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf \| module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list \| cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:45:04 +00:00
Viktor Barzin	bd41bb9230	fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2 - Authentik: upgrade 2025.10.3 → 2025.12.4 → 2026.2.2 with DB restore and stepped migration. Switch to existingSecret, PgBouncer session mode. - Mailserver: migrate email roundtrip probe from Mailgun to Brevo API - Redis: fix HAProxy tcp-check regex (rstring), faster health intervals - Nextcloud: fix Redis fallback to HAProxy service, update dependency - MeshCentral: fix TLSOffload + certUrl init container for first-run - Monitoring: remove authentik from latency alert exclusion - Diun: simplify to webhook notifier, remove git auto-update [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 06:41:56 +00:00
Viktor Barzin	0de2fef9c9	misc: actualbudget, authentik, headscale, rybbit, terminal, dbaas updates - actualbudget: adjust resource config - authentik: add configuration - headscale: minor fix - rybbit: add resources - terminal: add terminal stack config - platform/dbaas: add config - infra: update lock file	2026-04-06 11:58:00 +03:00
Viktor Barzin	8a5a53a832	fix alerts and reduce Prometheus disk write rate - linkwarden: add Reloader match annotation to DB secret so pods auto-restart on Vault credential rotation (was causing 100% 5xx) - authentik: increase memory limits (server 1Gi→1.5Gi, worker 896Mi→1Gi) to prevent OOM kills - prometheus: drop 113k high-cardinality series to reduce HDD write rate from ~8.8 to ~6.0 MB/s (31% reduction): - drop all traefik/apiserver/etcd histogram bucket metrics - drop goflow2_flow_process_nf_templates_total (9.3k series) - drop container_tasks_state and container_memory_failures_total - rewrite HighServiceLatency alert to use avg latency (_sum/_count) - update cluster_health dashboard to match - raise KubeletRuntimeOperationsLatency threshold from 30s to 60s	2026-03-28 15:42:14 +02:00
Viktor Barzin	ad689076d8	scale down non-critical services to free cluster memory - authentik server: 3→2, worker: 3→2, PDB minAvailable: 2→1 - tuya-bridge: 3→1 - realestate-crawler-api: 2→1 - claude-memory: 2→1 - grafana: 2→1 (config only, apply pending) - alertmanager: 2→1 (config only, apply pending) Estimated savings: ~1.2 Gi total	2026-03-22 03:10:12 +02:00
Viktor Barzin	3c804aedf8	extract dbaas, authentik, crowdsec from platform into independent stacks [ci skip] Phase 1 of platform stack split for parallel CI applies. All 3 modules were fully independent (no cross-module refs). State migrated via terraform state mv. All 3 stacks applied with zero changes (dbaas had pre-existing ResourceQuota drift). Woodpecker pipeline updated to run extracted stacks in parallel.	2026-03-17 18:11:53 +00:00

7 commits