monitoring: TraefikReplicaConfigStale — drop false-positive on stale series

The initial formulation used clamp_min(min(rate[2h]), 0.0001), which
made a recently-deleted pod's lingering rate=0 drive the ratio toward
infinity for up to 2h until the stale series aged out of the rate
window. With for: 2h, this was a near-miss for spurious firing in the
immediate aftermath of restarting the bad replica (our remediation
path).

Tighter formulation:
* 30m rate window — stale series ages out within minutes, not hours
* `min(rate) > 0.0005` floor — filters both stale-zero and fresh-pod
  ramp-up series; the bug's actual rate (~0.00076 in the 2026-05-12
  incident) sits well above it, so true positives still trip
* for: 1h — fast enough to catch the next incident, long enough that
  short rate dips don't flap

Verified: post-deploy `(max/min) > 5 AND min > 0.0005` evaluates to 0
results with the live cluster's tight rate spread (~0.00065–0.0007/s
across all three Traefik replicas).
This commit is contained in:
Viktor Barzin 2026-05-12 08:37:17 +00:00
parent 700f7ae49c
commit cbb1184ee4

View file

@ -1964,17 +1964,21 @@ serverFiles:
# Remediation: `kubectl -n traefik delete pod <stale-pod>` — the
# Deployment recreates it with a fresh informer cache. PDB
# minAvailable=2 keeps the other two replicas serving.
# Window 2h + for 2h tolerates pod restarts (rate normalizes within
# an hour); the bug pattern persists indefinitely until restart.
# 30m rate window lets stale (deleted) pod series age out quickly;
# `for: 1h` tolerates startup ramp-up but catches sustained drift.
# The `min(rate) > 0.0005` guard filters out both stale-zero series
# (recently-deleted pods linger with rate=0) and fresh pods that
# haven't accumulated samples — bug pattern rate (~0.00076 ≈ 2.75/hr
# in the 2026-05-12 incident) sits comfortably above the floor.
- alert: TraefikReplicaConfigStale
expr: |
(
max(rate(traefik_config_reloads_total[2h]))
max(rate(traefik_config_reloads_total[30m]))
/
clamp_min(min(rate(traefik_config_reloads_total[2h])), 0.0001)
min(rate(traefik_config_reloads_total[30m]))
) > 5
and max(rate(traefik_config_reloads_total[2h])) > 0.001
for: 2h
and min(rate(traefik_config_reloads_total[30m])) > 0.0005
for: 1h
labels:
severity: warning
annotations: