infra/stacks/platform/modules/monitoring
Viktor Barzin e6c0c39ae7
reduce alert noise: add cascade inhibitions, increase for durations, drop Loki alerts
- NodeDown now suppresses workload and service alerts (PodCrashLooping,
  DeploymentReplicasMismatch, StatefulSetReplicasMismatch, etc.)
- NFSServerUnresponsive suppresses pod-level alerts
- Increased for durations on transient alerts (e.g. 15m→30m for replica mismatches)
- NodeDown for: 1m→3m to avoid flapping
- Removed all 3 Loki log-based alerts (duplicated Prometheus alerts)
- Downgraded HeadscaleDown critical→warning, mail server page→warning
2026-03-08 21:13:16 +00:00
..
dashboards [ci skip] color only public IPs red in service map, private IPs (10.x, 192.168.x) get light blue 2026-02-28 19:44:16 +00:00
server-power-cycle [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
alloy.yaml [ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability 2026-02-23 22:05:28 +00:00
caretta.tf resource quota review: fix OOM risks, close quota gaps, add HA protections 2026-03-08 18:17:46 +00:00
Dockerfile [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
goflow2.tf [ci skip] fix caretta helm values and goflow2 transport args 2026-02-28 18:51:02 +00:00
grafana.tf [ci skip] fix: add mount_options to all NFS PVs (soft,timeo=30,retrans=3) 2026-03-02 20:23:36 +00:00
grafana_chart_values.yaml resource quota review: fix OOM risks, close quota gaps, add HA protections 2026-03-08 18:17:46 +00:00
idrac.tf [ci skip] platform: add ndots=2 dns_config to all deployment pod specs 2026-02-23 22:43:05 +00:00
k8s-monitoring-values.yaml [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
loki.tf reduce alert noise: add cascade inhibitions, increase for durations, drop Loki alerts 2026-03-08 21:13:16 +00:00
loki.yaml [ci skip] migrate Redis, Prometheus, Loki storage to iSCSI 2026-03-06 20:50:55 +00:00
main.tf resource quota review: fix OOM risks, close quota gaps, add HA protections 2026-03-08 18:17:46 +00:00
prometheus.tf [ci skip] expand Prometheus PVC to 200Gi, increase retention to 180GB for 1-year history 2026-03-06 23:16:32 +00:00
prometheus_chart_values.tpl reduce alert noise: add cascade inhibitions, increase for durations, drop Loki alerts 2026-03-08 21:13:16 +00:00
prometheus_snmp_chart_values.yaml [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
pve_exporter.tf [ci skip] platform: add ndots=2 dns_config to all deployment pod specs 2026-02-23 22:43:05 +00:00
snmp_exporter.tf [ci skip] platform: add ndots=2 dns_config to all deployment pod specs 2026-02-23 22:43:05 +00:00
ups_snmp_values.yaml [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00