Fix NFSServerUnresponsive false positives
Root cause: sum(rate(node_nfs_requests_total[5m])) == 0 was too fragile: - rate() returns nothing after Prometheus restarts (needs 2 scrapes) - Individual nodes show zero NFS rate during scrape gaps or low activity - The sum() could hit zero during quiet hours + scrape gaps New expression uses: - changes() instead of rate() — works with a single scrape - Per-instance aggregation: count nodes with any NFS counter change - Threshold < 2 nodes: single-node restarts won't trigger, real NFS outage (all nodes affected) will - Prometheus startup guard: skip first 15m after restart to avoid false positives from empty TSDB - Wider 15m changes() window to smooth out scrape gaps
This commit is contained in:
parent
df44601a36
commit
8557d492db
1 changed files with 8 additions and 2 deletions
|
|
@ -354,12 +354,18 @@ serverFiles:
|
|||
annotations:
|
||||
summary: "PV {{ $labels.persistentvolumeclaim }} in {{ $labels.namespace }} predicted to fill within 24h"
|
||||
- alert: NFSServerUnresponsive
|
||||
expr: sum(rate(node_nfs_requests_total[5m])) == 0
|
||||
expr: |
|
||||
(
|
||||
count by () (
|
||||
sum by (instance) (changes(node_nfs_requests_total[15m])) > 0
|
||||
) or on() vector(0)
|
||||
) < 2
|
||||
and on() (time() - process_start_time_seconds{job="prometheus"}) > 900
|
||||
for: 10m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "All NFS operations across the cluster are zero for 10m — TrueNAS (10.0.10.15) may be down"
|
||||
summary: "Fewer than 2 nodes have NFS activity for 10m — TrueNAS (10.0.10.15) may be down"
|
||||
- name: K8s Health
|
||||
rules:
|
||||
- alert: PodCrashLooping
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue