Fix NFSServerUnresponsive false positives

Root cause: sum(rate(node_nfs_requests_total[5m])) == 0 was too fragile: - rate() returns nothing after Prometheus restarts (needs 2 scrapes) - Individual nodes show zero NFS rate during scrape gaps or low activity - The sum() could hit zero during quiet hours + scrape gaps New expression uses: - changes() instead of rate() — works with a single scrape - Per-instance aggregation: count nodes with any NFS counter change - Threshold < 2 nodes: single-node restarts won't trigger, real NFS outage (all nodes affected) will - Prometheus startup guard: skip first 15m after restart to avoid false positives from empty TSDB - Wider 15m changes() window to smooth out scrape gaps
2026-03-14 11:28:17 +00:00 · 2026-03-14 11:28:17 +00:00 · 8557d492db
commit 8557d492db
parent df44601a36
1 changed files with 8 additions and 2 deletions
--- a/stacks/platform/modules/monitoring/prometheus_chart_values.tpl
+++ b/stacks/platform/modules/monitoring/prometheus_chart_values.tpl
@ -354,12 +354,18 @@ serverFiles:
            annotations:
              summary: "PV {{ $labels.persistentvolumeclaim }} in {{ $labels.namespace }} predicted to fill within 24h"
          - alert: NFSServerUnresponsive
-            expr: sum(rate(node_nfs_requests_total[5m])) == 0
+            expr: |
+              (
+                count by () (
+                  sum by (instance) (changes(node_nfs_requests_total[15m])) > 0
+                ) or on() vector(0)
+              ) < 2
+              and on() (time() - process_start_time_seconds{job="prometheus"}) > 900
            for: 10m
            labels:
              severity: critical
            annotations:
-              summary: "All NFS operations across the cluster are zero for 10m — TrueNAS (10.0.10.15) may be down"
+              summary: "Fewer than 2 nodes have NFS activity for 10m — TrueNAS (10.0.10.15) may be down"
      - name: K8s Health
        rules:
          - alert: PodCrashLooping