infra

Viktor Barzin 20774f794d dbaas+monitoring: bump PG max_connections to 200, add scrape + alerts Cluster grew past the 100-conn default — steady-state idle was 90/100, leaving zero headroom for terragrunt applies or transient surges. The ceiling was being discovered by Terraform crashing (pq: "remaining connection slots are reserved for roles with the SUPERUSER attribute"), not by alerting, because we had no PG scrape config at all. dbaas (Tier 0): * max_connections: 100 → 200 * shared_buffers: 512MB → 1GB (Postgres recommends ~25% of pod memory) * effective_cache_size: 1536MB → 2560MB (scaled with pod memory) * pod memory: 2Gi → 3Gi (rough rule of thumb: enough for shared_buffers + ~16MB work_mem * concurrent sorts + OS cache + overhead) * Triggers bump on null_resource.pg_cluster forces CNPG to re-apply, which rolls the cluster (standby first, then primary failover). monitoring: * New scrape job 'cnpg' on dbaas namespace pods labeled cnpg.io/podRole=instance, port name=metrics (9187). Relabels add cnpg_cluster + cnpg_role labels for alert grouping. * PGConnectionsHigh (warning, >85% for 10m) — heads-up before exhaustion. * PGConnectionsCritical (critical, >95% for 3m) — last call before refusing connections. Verified: cnpg targets up, sum(cnpg_backends_total)=84, max_connections metric=200, alert ratio 0.42 → both alerts inactive.		2026-05-22 14:16:44 +00:00
..
modules/dbaas	dbaas+monitoring: bump PG max_connections to 200, add scrape + alerts	2026-05-22 14:16:44 +00:00
main.tf	[dbaas] Declare forgejo + roundcubemail MySQL users in Terraform	2026-04-17 22:06:23 +00:00
secrets	extract dbaas, authentik, crowdsec from platform into independent stacks [ci skip]	2026-03-17 18:11:53 +00:00
terragrunt.hcl	extract dbaas, authentik, crowdsec from platform into independent stacks [ci skip]	2026-03-17 18:11:53 +00:00