Reviewed the last 24h of Slack alerts after the midday node-pressure blip:
the volume came far less from the outage than from (a) alerts re-pinging
every few hours while nothing changed and (b) a pod cascade that fired
uninhibited. This hardens the alerting *system* so recurrences are quiet,
rather than just clearing today's broken services.
Changes (all in the monitoring module):
* Alert-on-change routing. warning/info repeat_interval -> 8760h (notify
once, then only on a membership change or resolve); critical 1h -> 6h
(a slow nag, not an hourly drip). send_resolved stays on. The bulk of
the 24h volume was these re-pings (RpiSofiaUndervoltage alone fired
continuously for ~24h, re-notifying every 4h).
* Daily digest CronJob (alert_digest.tf + alert_digest.py) -> #alerts at
08:00 Europe/London: the full current board grouped by severity + what
resolved in the last 24h. This is the standing-state safety net for the
alert-on-change model. Stock python:3.12-alpine, pure-stdlib script
(no pip/apk at runtime -> none of the per-run disk-write footprint that
disabled status-page-pusher). Reuses the existing Alertmanager Slack
webhook via a namespaced Secret; reads Alertmanager v2 + Prometheus.
* Cascade inhibition. NodeConditionBad/NodeDiskPressure now suppress the
downstream pod-churn alerts (PodCrashLooping, PodImagePullBackOff,
PodsStuckContainerCreating, ScrapeTargetDown, *ReplicasMismatch, ...).
The midday DiskPressure event on 4 nodes fired 25 PodCrashLooping + 14
PodImagePullBackOff uninhibited because only NodeDown was a source.
* T3 probe de-duplication. T3ProbeLegDown now inhibits T3ProbeDropBurst
for the same leg — two alerts described one condition and were the #1
noise source (~3,400 alert-minutes over 24h).
* ScrapeTargetDown false positives. Scrape only Ready endpoints, so
completed CronJob pods that linger in EndpointSlices as NotReady
addresses stop firing phantom "down" alerts (tts/tripit/beads). A Ready
pod with a genuinely broken metrics endpoint still fires.
* for: 0m -> 5m on the flappy backup-status flags (LVM/Weekly/Offsite/
NfsMirror/Vzdump *Failing) and DNS spike detectors, so a single
transient Pushgateway/scrape blip no longer fires-and-resolves.
* Added an Alertmanager scrape target: it carried no prometheus.io/scrape
annotation, so notification volume was unmeasurable — now we can verify
this change worked (alertmanager_notifications_total et al.).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>