infra/stacks/platform/modules
Viktor Barzin 6377a8b85b Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards
Noise reduction (8 alerts tuned):
- PoisonFountainDown: 2m→5m, critical→warning (fail-open service)
- NodeExporterDown: 2m→5m (flaps during node restarts)
- PowerOutage: add for:1m (debounce transient voltage dips)
- New Tailscale client: add for:5m (debounce headscale reauths)
- NoNodeLoadData: use absent() instead of OR vector(0)==0
- NodeHighCPUUsage: 30%→60% (normal for 70+ services)
- HighMemoryUsage GPU: 12GB/5m→14GB/15m (T4=16GB, model loading)
- PrometheusStorageFull: 50GiB→150GiB (TSDB cap is 180GB)

Alert regrouping:
- Move MailServerDown, HackmdDown, PrivatebinDown → new "Application Health"
- Move New Tailscale client → "Infrastructure Health"

New alerts (14):
- Networking: Cloudflared (2), MetalLB (2), Technitium DNS
- Storage: NFS CSI, iSCSI CSI controllers
- Critical Services: PgBouncer, CNPG operator, MySQL operator
- Infra Health: CrowdSec, Kyverno, Sealed Secrets, Woodpecker

Inhibit rules:
- Consolidate 3 NodeDown rules into 1 comprehensive rule
- Extend NFS rule to suppress NFS-dependent services
- Add PowerOutage → downstream suppression

Dashboard loading:
- Add for_each ConfigMap in grafana.tf to auto-load all 18 dashboards
- Remove duplicate caretta dashboard ConfigMap from caretta.tf
2026-03-14 10:25:31 +00:00
..
authentik Right-size CPU requests cluster-wide and remove missed CPU limits 2026-03-14 09:22:24 +00:00
cloudflared Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
cnpg Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
crowdsec Right-size CPU requests cluster-wide and remove missed CPU limits 2026-03-14 09:22:24 +00:00
dbaas Right-size CPU requests cluster-wide and remove missed CPU limits 2026-03-14 09:22:24 +00:00
headscale Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
infra-maintenance [ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup 2026-03-06 19:54:21 +00:00
iscsi-csi Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
k8s-portal Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
kyverno Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
mailserver Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
metallb [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
metrics-server [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
monitoring Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards 2026-03-14 10:25:31 +00:00
nfs-csi Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
nvidia Right-size CPU requests cluster-wide and remove missed CPU limits 2026-03-14 09:22:24 +00:00
rbac Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
redis Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
reverse_proxy Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
sealed-secrets Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
technitium Right-size CPU requests cluster-wide and remove missed CPU limits 2026-03-14 09:22:24 +00:00
traefik Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
uptime-kuma Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
vaultwarden Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
vpa Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
wireguard Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
xray Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00