monitoring: reduce Slack alert noise (alert-on-change + daily digest)

Reviewed the last 24h of Slack alerts after the midday node-pressure blip: the volume came far less from the outage than from (a) alerts re-pinging every few hours while nothing changed and (b) a pod cascade that fired uninhibited. This hardens the alerting *system* so recurrences are quiet, rather than just clearing today's broken services. Changes (all in the monitoring module): * Alert-on-change routing. warning/info repeat_interval -> 8760h (notify once, then only on a membership change or resolve); critical 1h -> 6h (a slow nag, not an hourly drip). send_resolved stays on. The bulk of the 24h volume was these re-pings (RpiSofiaUndervoltage alone fired continuously for ~24h, re-notifying every 4h). * Daily digest CronJob (alert_digest.tf + alert_digest.py) -> #alerts at 08:00 Europe/London: the full current board grouped by severity + what resolved in the last 24h. This is the standing-state safety net for the alert-on-change model. Stock python:3.12-alpine, pure-stdlib script (no pip/apk at runtime -> none of the per-run disk-write footprint that disabled status-page-pusher). Reuses the existing Alertmanager Slack webhook via a namespaced Secret; reads Alertmanager v2 + Prometheus. * Cascade inhibition. NodeConditionBad/NodeDiskPressure now suppress the downstream pod-churn alerts (PodCrashLooping, PodImagePullBackOff, PodsStuckContainerCreating, ScrapeTargetDown, *ReplicasMismatch, ...). The midday DiskPressure event on 4 nodes fired 25 PodCrashLooping + 14 PodImagePullBackOff uninhibited because only NodeDown was a source. * T3 probe de-duplication. T3ProbeLegDown now inhibits T3ProbeDropBurst for the same leg — two alerts described one condition and were the #1 noise source (~3,400 alert-minutes over 24h). * ScrapeTargetDown false positives. Scrape only Ready endpoints, so completed CronJob pods that linger in EndpointSlices as NotReady addresses stop firing phantom "down" alerts (tts/tripit/beads). A Ready pod with a genuinely broken metrics endpoint still fires. * for: 0m -> 5m on the flappy backup-status flags (LVM/Weekly/Offsite/ NfsMirror/Vzdump *Failing) and DNS spike detectors, so a single transient Pushgateway/scrape blip no longer fires-and-resolves. * Added an Alertmanager scrape target: it carried no prometheus.io/scrape annotation, so notification volume was unmeasurable — now we can verify this change worked (alertmanager_notifications_total et al.). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 20:33:49 +00:00 · 2026-06-12 20:33:49 +00:00 · 97dcf49b8e
commit 97dcf49b8e
parent 87a8a393fe
4 changed files with 434 additions and 12 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -165,8 +165,11 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
 | phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |

 ## Monitoring & Alerting
- Alert cascade inhibitions: if node is down, suppress pod alerts on that node.
- Exclude completed CronJob pods from "pod not ready" alerts.
+- **Alert-on-change routing** (alert-noise-reduction 2026-06-12, `route` block in `prometheus_chart_values.tpl`): warning/info notify ONCE then stay quiet while firing (`repeat_interval: 8760h` ≈ off); criticals re-ping every 6h (was 1h); `send_resolved` on. Standing state is reviewed via the daily digest, not re-pings.
+- **Daily alert digest**: CronJob `alert-digest` (monitoring ns, `alert_digest.tf` + `alert_digest.py`) posts the full current board grouped by severity + resolved-in-24h to `#alerts` at 08:00 Europe/London. Stock `python:3.12-alpine`, pure-stdlib (no pip/apk at runtime — avoids the status-page-pusher disk anti-pattern, id=559); reads Alertmanager v2 + Prometheus; reuses the Alertmanager Slack webhook via the `alert-digest` Secret. Safety net for alert-on-change.
+- **Cascade inhibitions** (`inhibit_rules`): `NodeDown` AND `NodeConditionBad`/`NodeDiskPressure` suppress downstream pod-churn alerts (PodCrashLooping/PodImagePullBackOff/PodsStuckContainerCreating/ScrapeTargetDown/*ReplicasMismatch); `T3ProbeLegDown` suppresses `T3ProbeDropBurst` for the same `leg`; plus existing NFS/Traefik/Authentik/Power/Tuya/iDRAC cascades. No `equal` on the node rules (pod alerts carry no `node` label → cluster-wide, like NodeDown).
+- **ScrapeTargetDown scrapes only Ready endpoints** (relabel `keep __meta_kubernetes_endpoint_ready=true` on both `kubernetes-service-endpoints` jobs) — completed CronJob pods lingering as NotReady EndpointSlice addresses no longer fire phantom "down" alerts (tts/tripit/beads, id=4895). Replaces the old "exclude completed CronJob pods" guidance; a Ready pod with a broken metrics endpoint still fires.
+- Alertmanager is now scraped (`extraScrapeConfigs` job `alertmanager`) → `alertmanager_notifications_total`/`_alerts`/`_notifications_failed_total` available; it had no `prometheus.io/scrape` annotation so notification volume was previously unmeasurable.
 - Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions.
 - **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
 - Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction).