Three improvements identified in the 7d alert-noise review:
A. New PodImagePullBackOff alert. `KubeletImagePullErrors` measures
node-level pull error rate, which doesn't catch a single pod stuck
in ImagePullBackOff — council-complaints sat broken for ~10h on
2026-05-12 without paging. The new rule fires per-pod after 30m.
B. Two new inhibit_rules:
- PVFillingUp (95% used, critical) suppresses PVPredictedFull
(linear projection, warning) on the same PVC. Pair was producing
~24h of redundant firing per 7d.
- EmailRoundtripFailing (active probe failure) suppresses
EmailRoundtripStale (derivative >60min no-success). Same outage
windows, ~14.5h of duplicate firing per 7d.
C. JobFailed for: 30m → 2h. Most cronjobs run every 5–15min; the old
30-minute window paged on the first failed iteration before the
next run could recover. 2h means "still failing across at least
two cron iterations" — much more actionable.
Verified live: rules loaded, inhibitors in alertmanager config,
PodImagePullBackOff is currently inactive (council-complaints
ImagePullBackOff actively detected — see separate fix).