monitoring: TraefikReplicaConfigStale — drop false-positive on stale series
The initial formulation used clamp_min(min(rate[2h]), 0.0001), which made a recently-deleted pod's lingering rate=0 drive the ratio toward infinity for up to 2h until the stale series aged out of the rate window. With for: 2h, this was a near-miss for spurious firing in the immediate aftermath of restarting the bad replica (our remediation path). Tighter formulation: * 30m rate window — stale series ages out within minutes, not hours * `min(rate) > 0.0005` floor — filters both stale-zero and fresh-pod ramp-up series; the bug's actual rate (~0.00076 in the 2026-05-12 incident) sits well above it, so true positives still trip * for: 1h — fast enough to catch the next incident, long enough that short rate dips don't flap Verified: post-deploy `(max/min) > 5 AND min > 0.0005` evaluates to 0 results with the live cluster's tight rate spread (~0.00065–0.0007/s across all three Traefik replicas).
This commit is contained in:
parent
700f7ae49c
commit
cbb1184ee4
1 changed files with 10 additions and 6 deletions
|
|
@ -1964,17 +1964,21 @@ serverFiles:
|
|||
# Remediation: `kubectl -n traefik delete pod <stale-pod>` — the
|
||||
# Deployment recreates it with a fresh informer cache. PDB
|
||||
# minAvailable=2 keeps the other two replicas serving.
|
||||
# Window 2h + for 2h tolerates pod restarts (rate normalizes within
|
||||
# an hour); the bug pattern persists indefinitely until restart.
|
||||
# 30m rate window lets stale (deleted) pod series age out quickly;
|
||||
# `for: 1h` tolerates startup ramp-up but catches sustained drift.
|
||||
# The `min(rate) > 0.0005` guard filters out both stale-zero series
|
||||
# (recently-deleted pods linger with rate=0) and fresh pods that
|
||||
# haven't accumulated samples — bug pattern rate (~0.00076 ≈ 2.75/hr
|
||||
# in the 2026-05-12 incident) sits comfortably above the floor.
|
||||
- alert: TraefikReplicaConfigStale
|
||||
expr: |
|
||||
(
|
||||
max(rate(traefik_config_reloads_total[2h]))
|
||||
max(rate(traefik_config_reloads_total[30m]))
|
||||
/
|
||||
clamp_min(min(rate(traefik_config_reloads_total[2h])), 0.0001)
|
||||
min(rate(traefik_config_reloads_total[30m]))
|
||||
) > 5
|
||||
and max(rate(traefik_config_reloads_total[2h])) > 0.001
|
||||
for: 2h
|
||||
and min(rate(traefik_config_reloads_total[30m])) > 0.0005
|
||||
for: 1h
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue