monitoring: TraefikReplicaConfigStale — drop false-positive on stale series
The initial formulation used clamp_min(min(rate[2h]), 0.0001), which made a recently-deleted pod's lingering rate=0 drive the ratio toward infinity for up to 2h until the stale series aged out of the rate window. With for: 2h, this was a near-miss for spurious firing in the immediate aftermath of restarting the bad replica (our remediation path). Tighter formulation: * 30m rate window — stale series ages out within minutes, not hours * `min(rate) > 0.0005` floor — filters both stale-zero and fresh-pod ramp-up series; the bug's actual rate (~0.00076 in the 2026-05-12 incident) sits well above it, so true positives still trip * for: 1h — fast enough to catch the next incident, long enough that short rate dips don't flap Verified: post-deploy `(max/min) > 5 AND min > 0.0005` evaluates to 0 results with the live cluster's tight rate spread (~0.00065–0.0007/s across all three Traefik replicas).
This commit is contained in:
parent
700f7ae49c
commit
cbb1184ee4
1 changed files with 10 additions and 6 deletions
|
|
@ -1964,17 +1964,21 @@ serverFiles:
|
||||||
# Remediation: `kubectl -n traefik delete pod <stale-pod>` — the
|
# Remediation: `kubectl -n traefik delete pod <stale-pod>` — the
|
||||||
# Deployment recreates it with a fresh informer cache. PDB
|
# Deployment recreates it with a fresh informer cache. PDB
|
||||||
# minAvailable=2 keeps the other two replicas serving.
|
# minAvailable=2 keeps the other two replicas serving.
|
||||||
# Window 2h + for 2h tolerates pod restarts (rate normalizes within
|
# 30m rate window lets stale (deleted) pod series age out quickly;
|
||||||
# an hour); the bug pattern persists indefinitely until restart.
|
# `for: 1h` tolerates startup ramp-up but catches sustained drift.
|
||||||
|
# The `min(rate) > 0.0005` guard filters out both stale-zero series
|
||||||
|
# (recently-deleted pods linger with rate=0) and fresh pods that
|
||||||
|
# haven't accumulated samples — bug pattern rate (~0.00076 ≈ 2.75/hr
|
||||||
|
# in the 2026-05-12 incident) sits comfortably above the floor.
|
||||||
- alert: TraefikReplicaConfigStale
|
- alert: TraefikReplicaConfigStale
|
||||||
expr: |
|
expr: |
|
||||||
(
|
(
|
||||||
max(rate(traefik_config_reloads_total[2h]))
|
max(rate(traefik_config_reloads_total[30m]))
|
||||||
/
|
/
|
||||||
clamp_min(min(rate(traefik_config_reloads_total[2h])), 0.0001)
|
min(rate(traefik_config_reloads_total[30m]))
|
||||||
) > 5
|
) > 5
|
||||||
and max(rate(traefik_config_reloads_total[2h])) > 0.001
|
and min(rate(traefik_config_reloads_total[30m])) > 0.0005
|
||||||
for: 2h
|
for: 1h
|
||||||
labels:
|
labels:
|
||||||
severity: warning
|
severity: warning
|
||||||
annotations:
|
annotations:
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue