infra/stacks/platform/modules
Viktor Barzin 17065304dc Fix NFSServerUnresponsive false positives
Root cause: sum(rate(node_nfs_requests_total[5m])) == 0 was too fragile:
- rate() returns nothing after Prometheus restarts (needs 2 scrapes)
- Individual nodes show zero NFS rate during scrape gaps or low activity
- The sum() could hit zero during quiet hours + scrape gaps

New expression uses:
- changes() instead of rate() — works with a single scrape
- Per-instance aggregation: count nodes with any NFS counter change
- Threshold < 2 nodes: single-node restarts won't trigger, real NFS
  outage (all nodes affected) will
- Prometheus startup guard: skip first 15m after restart to avoid
  false positives from empty TSDB
- Wider 15m changes() window to smooth out scrape gaps
2026-03-14 11:28:17 +00:00
..
authentik Right-size CPU requests cluster-wide and remove missed CPU limits 2026-03-14 09:22:24 +00:00
cloudflared Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
cnpg Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
crowdsec Right-size CPU requests cluster-wide and remove missed CPU limits 2026-03-14 09:22:24 +00:00
dbaas Right-size CPU requests cluster-wide and remove missed CPU limits 2026-03-14 09:22:24 +00:00
headscale Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
infra-maintenance [ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup 2026-03-06 19:54:21 +00:00
iscsi-csi Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
k8s-portal Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
kyverno Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
mailserver Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
metallb [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
metrics-server [ci skip] Move Terraform modules into stack directories 2026-02-22 14:38:14 +00:00
monitoring Fix NFSServerUnresponsive false positives 2026-03-14 11:28:17 +00:00
nfs-csi Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
nvidia Right-size CPU requests cluster-wide and remove missed CPU limits 2026-03-14 09:22:24 +00:00
rbac Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
redis Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
reverse_proxy Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
sealed-secrets Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
technitium Right-size CPU requests cluster-wide and remove missed CPU limits 2026-03-14 09:22:24 +00:00
traefik Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
uptime-kuma Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
vaultwarden Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
vpa Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
wireguard Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
xray Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00