infra/stacks/platform/modules/monitoring
Viktor Barzin 17065304dc Fix NFSServerUnresponsive false positives
Root cause: sum(rate(node_nfs_requests_total[5m])) == 0 was too fragile:
- rate() returns nothing after Prometheus restarts (needs 2 scrapes)
- Individual nodes show zero NFS rate during scrape gaps or low activity
- The sum() could hit zero during quiet hours + scrape gaps

New expression uses:
- changes() instead of rate() — works with a single scrape
- Per-instance aggregation: count nodes with any NFS counter change
- Threshold < 2 nodes: single-node restarts won't trigger, real NFS
  outage (all nodes affected) will
- Prometheus startup guard: skip first 15m after restart to avoid
  false positives from empty TSDB
- Wider 15m changes() window to smooth out scrape gaps
2026-03-14 11:28:17 +00:00
..
dashboards Add node hang instrumentation and scale down chromium services 2026-03-13 22:20:28 +00:00
server-power-cycle [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
alloy.yaml Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
caretta.tf Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards 2026-03-14 10:25:31 +00:00
Dockerfile [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
goflow2.tf Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
grafana.tf Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards 2026-03-14 10:25:31 +00:00
grafana_chart_values.yaml Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
idrac.tf [ci skip] platform: add ndots=2 dns_config to all deployment pod specs 2026-02-23 22:43:05 +00:00
k8s-monitoring-values.yaml [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
loki.tf feat(monitoring): Disable Loki centralized logging while preserving configuration 2026-03-13 08:41:23 +00:00
loki.yaml Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
main.tf Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
prometheus.tf [ci skip] expand Prometheus PVC to 200Gi, increase retention to 180GB for 1-year history 2026-03-06 23:16:32 +00:00
prometheus_chart_values.tpl Fix NFSServerUnresponsive false positives 2026-03-14 11:28:17 +00:00
prometheus_snmp_chart_values.yaml [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
pve_exporter.tf [ci skip] platform: add ndots=2 dns_config to all deployment pod specs 2026-02-23 22:43:05 +00:00
snmp_exporter.tf [ci skip] platform: add ndots=2 dns_config to all deployment pod specs 2026-02-23 22:43:05 +00:00
ups_snmp_values.yaml [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00