From cbb1184ee450020164d38ec0eb76984d8b601d08 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Tue, 12 May 2026 08:37:17 +0000 Subject: [PATCH] =?UTF-8?q?monitoring:=20TraefikReplicaConfigStale=20?= =?UTF-8?q?=E2=80=94=20drop=20false-positive=20on=20stale=20series?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The initial formulation used clamp_min(min(rate[2h]), 0.0001), which made a recently-deleted pod's lingering rate=0 drive the ratio toward infinity for up to 2h until the stale series aged out of the rate window. With for: 2h, this was a near-miss for spurious firing in the immediate aftermath of restarting the bad replica (our remediation path). Tighter formulation: * 30m rate window — stale series ages out within minutes, not hours * `min(rate) > 0.0005` floor — filters both stale-zero and fresh-pod ramp-up series; the bug's actual rate (~0.00076 in the 2026-05-12 incident) sits well above it, so true positives still trip * for: 1h — fast enough to catch the next incident, long enough that short rate dips don't flap Verified: post-deploy `(max/min) > 5 AND min > 0.0005` evaluates to 0 results with the live cluster's tight rate spread (~0.00065–0.0007/s across all three Traefik replicas). --- .../monitoring/prometheus_chart_values.tpl | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl index ef4ac851..fb39a942 100755 --- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl +++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl @@ -1964,17 +1964,21 @@ serverFiles: # Remediation: `kubectl -n traefik delete pod ` — the # Deployment recreates it with a fresh informer cache. PDB # minAvailable=2 keeps the other two replicas serving. - # Window 2h + for 2h tolerates pod restarts (rate normalizes within - # an hour); the bug pattern persists indefinitely until restart. + # 30m rate window lets stale (deleted) pod series age out quickly; + # `for: 1h` tolerates startup ramp-up but catches sustained drift. + # The `min(rate) > 0.0005` guard filters out both stale-zero series + # (recently-deleted pods linger with rate=0) and fresh pods that + # haven't accumulated samples — bug pattern rate (~0.00076 ≈ 2.75/hr + # in the 2026-05-12 incident) sits comfortably above the floor. - alert: TraefikReplicaConfigStale expr: | ( - max(rate(traefik_config_reloads_total[2h])) + max(rate(traefik_config_reloads_total[30m])) / - clamp_min(min(rate(traefik_config_reloads_total[2h])), 0.0001) + min(rate(traefik_config_reloads_total[30m])) ) > 5 - and max(rate(traefik_config_reloads_total[2h])) > 0.001 - for: 2h + and min(rate(traefik_config_reloads_total[30m])) > 0.0005 + for: 1h labels: severity: warning annotations: