Fix NFSServerUnresponsive false positives

Root cause: sum(rate(node_nfs_requests_total[5m])) == 0 was too fragile:
- rate() returns nothing after Prometheus restarts (needs 2 scrapes)
- Individual nodes show zero NFS rate during scrape gaps or low activity
- The sum() could hit zero during quiet hours + scrape gaps

New expression uses:
- changes() instead of rate() — works with a single scrape
- Per-instance aggregation: count nodes with any NFS counter change
- Threshold < 2 nodes: single-node restarts won't trigger, real NFS
  outage (all nodes affected) will
- Prometheus startup guard: skip first 15m after restart to avoid
  false positives from empty TSDB
- Wider 15m changes() window to smooth out scrape gaps
This commit is contained in:
Viktor Barzin 2026-03-14 11:28:17 +00:00 committed by Viktor Barzin
parent df44601a36
commit 8557d492db

View file

@ -354,12 +354,18 @@ serverFiles:
annotations:
summary: "PV {{ $labels.persistentvolumeclaim }} in {{ $labels.namespace }} predicted to fill within 24h"
- alert: NFSServerUnresponsive
expr: sum(rate(node_nfs_requests_total[5m])) == 0
expr: |
(
count by () (
sum by (instance) (changes(node_nfs_requests_total[15m])) > 0
) or on() vector(0)
) < 2
and on() (time() - process_start_time_seconds{job="prometheus"}) > 900
for: 10m
labels:
severity: critical
annotations:
summary: "All NFS operations across the cluster are zero for 10m — TrueNAS (10.0.10.15) may be down"
summary: "Fewer than 2 nodes have NFS activity for 10m — TrueNAS (10.0.10.15) may be down"
- name: K8s Health
rules:
- alert: PodCrashLooping