Root cause: sum(rate(node_nfs_requests_total[5m])) == 0 was too fragile: - rate() returns nothing after Prometheus restarts (needs 2 scrapes) - Individual nodes show zero NFS rate during scrape gaps or low activity - The sum() could hit zero during quiet hours + scrape gaps New expression uses: - changes() instead of rate() — works with a single scrape - Per-instance aggregation: count nodes with any NFS counter change - Threshold < 2 nodes: single-node restarts won't trigger, real NFS outage (all nodes affected) will - Prometheus startup guard: skip first 15m after restart to avoid false positives from empty TSDB - Wider 15m changes() window to smooth out scrape gaps |
||
|---|---|---|
| .. | ||
| modules | ||
| .gitkeep | ||
| .terraform.lock.hcl | ||
| backend.tf | ||
| main.tf | ||
| providers.tf | ||
| redis-25.3.2.tgz | ||
| secrets | ||
| terragrunt.hcl | ||
| tiers.tf | ||