infra

Viktor Barzin 6377a8b85b Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards Noise reduction (8 alerts tuned): - PoisonFountainDown: 2m→5m, critical→warning (fail-open service) - NodeExporterDown: 2m→5m (flaps during node restarts) - PowerOutage: add for:1m (debounce transient voltage dips) - New Tailscale client: add for:5m (debounce headscale reauths) - NoNodeLoadData: use absent() instead of OR vector(0)==0 - NodeHighCPUUsage: 30%→60% (normal for 70+ services) - HighMemoryUsage GPU: 12GB/5m→14GB/15m (T4=16GB, model loading) - PrometheusStorageFull: 50GiB→150GiB (TSDB cap is 180GB) Alert regrouping: - Move MailServerDown, HackmdDown, PrivatebinDown → new "Application Health" - Move New Tailscale client → "Infrastructure Health" New alerts (14): - Networking: Cloudflared (2), MetalLB (2), Technitium DNS - Storage: NFS CSI, iSCSI CSI controllers - Critical Services: PgBouncer, CNPG operator, MySQL operator - Infra Health: CrowdSec, Kyverno, Sealed Secrets, Woodpecker Inhibit rules: - Consolidate 3 NodeDown rules into 1 comprehensive rule - Extend NFS rule to suppress NFS-dependent services - Add PowerOutage → downstream suppression Dashboard loading: - Add for_each ConfigMap in grafana.tf to auto-load all 18 dashboards - Remove duplicate caretta dashboard ConfigMap from caretta.tf		2026-03-14 10:25:31 +00:00
..
dashboards	Add node hang instrumentation and scale down chromium services	2026-03-13 22:20:28 +00:00
server-power-cycle	[ci skip] Move Terraform modules into stack directories	2026-02-22 14:38:14 +00:00
alloy.yaml	Remove all CPU limits cluster-wide to eliminate CFS throttling	2026-03-14 08:51:45 +00:00
caretta.tf	Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards	2026-03-14 10:25:31 +00:00
Dockerfile	[ci skip] Move Terraform modules into stack directories	2026-02-22 14:38:14 +00:00
goflow2.tf	Remove all CPU limits cluster-wide to eliminate CFS throttling	2026-03-14 08:51:45 +00:00
grafana.tf	Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards	2026-03-14 10:25:31 +00:00
grafana_chart_values.yaml	Remove all CPU limits cluster-wide to eliminate CFS throttling	2026-03-14 08:51:45 +00:00
idrac.tf	[ci skip] platform: add ndots=2 dns_config to all deployment pod specs	2026-02-23 22:43:05 +00:00
k8s-monitoring-values.yaml	[ci skip] Move Terraform modules into stack directories	2026-02-22 14:38:14 +00:00
loki.tf	feat(monitoring): Disable Loki centralized logging while preserving configuration	2026-03-13 08:41:23 +00:00
loki.yaml	Remove all CPU limits cluster-wide to eliminate CFS throttling	2026-03-14 08:51:45 +00:00
main.tf	Remove all CPU limits cluster-wide to eliminate CFS throttling	2026-03-14 08:51:45 +00:00
prometheus.tf	[ci skip] expand Prometheus PVC to 200Gi, increase retention to 180GB for 1-year history	2026-03-06 23:16:32 +00:00
prometheus_chart_values.tpl	Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards	2026-03-14 10:25:31 +00:00
prometheus_snmp_chart_values.yaml	[ci skip] Move Terraform modules into stack directories	2026-02-22 14:38:14 +00:00
pve_exporter.tf	[ci skip] platform: add ndots=2 dns_config to all deployment pod specs	2026-02-23 22:43:05 +00:00
snmp_exporter.tf	[ci skip] platform: add ndots=2 dns_config to all deployment pod specs	2026-02-23 22:43:05 +00:00
ups_snmp_values.yaml	[ci skip] Move Terraform modules into stack directories	2026-02-22 14:38:14 +00:00