infra/stacks/monitoring/modules/monitoring/alert_digest.tf
Viktor Barzin 97dcf49b8e
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
monitoring: reduce Slack alert noise (alert-on-change + daily digest)
Reviewed the last 24h of Slack alerts after the midday node-pressure blip:
the volume came far less from the outage than from (a) alerts re-pinging
every few hours while nothing changed and (b) a pod cascade that fired
uninhibited. This hardens the alerting *system* so recurrences are quiet,
rather than just clearing today's broken services.

Changes (all in the monitoring module):

* Alert-on-change routing. warning/info repeat_interval -> 8760h (notify
  once, then only on a membership change or resolve); critical 1h -> 6h
  (a slow nag, not an hourly drip). send_resolved stays on. The bulk of
  the 24h volume was these re-pings (RpiSofiaUndervoltage alone fired
  continuously for ~24h, re-notifying every 4h).

* Daily digest CronJob (alert_digest.tf + alert_digest.py) -> #alerts at
  08:00 Europe/London: the full current board grouped by severity + what
  resolved in the last 24h. This is the standing-state safety net for the
  alert-on-change model. Stock python:3.12-alpine, pure-stdlib script
  (no pip/apk at runtime -> none of the per-run disk-write footprint that
  disabled status-page-pusher). Reuses the existing Alertmanager Slack
  webhook via a namespaced Secret; reads Alertmanager v2 + Prometheus.

* Cascade inhibition. NodeConditionBad/NodeDiskPressure now suppress the
  downstream pod-churn alerts (PodCrashLooping, PodImagePullBackOff,
  PodsStuckContainerCreating, ScrapeTargetDown, *ReplicasMismatch, ...).
  The midday DiskPressure event on 4 nodes fired 25 PodCrashLooping + 14
  PodImagePullBackOff uninhibited because only NodeDown was a source.

* T3 probe de-duplication. T3ProbeLegDown now inhibits T3ProbeDropBurst
  for the same leg — two alerts described one condition and were the #1
  noise source (~3,400 alert-minutes over 24h).

* ScrapeTargetDown false positives. Scrape only Ready endpoints, so
  completed CronJob pods that linger in EndpointSlices as NotReady
  addresses stop firing phantom "down" alerts (tts/tripit/beads). A Ready
  pod with a genuinely broken metrics endpoint still fires.

* for: 0m -> 5m on the flappy backup-status flags (LVM/Weekly/Offsite/
  NfsMirror/Vzdump *Failing) and DNS spike detectors, so a single
  transient Pushgateway/scrape blip no longer fires-and-resolves.

* Added an Alertmanager scrape target: it carried no prometheus.io/scrape
  annotation, so notification volume was unmeasurable — now we can verify
  this change worked (alertmanager_notifications_total et al.).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 20:35:56 +00:00

127 lines
4.3 KiB
HCL

# =============================================================================
# Daily alert digest -> #alerts Slack
# =============================================================================
# Companion to the "alert on change" routing model (alert-noise-reduction
# 2026-06-12). Warning/info alerts no longer re-notify while they stay firing
# (repeat_interval is effectively off) and criticals only re-ping every 6h, so
# Slack reflects *changes*, not steady state. This CronJob is the safety net:
# once a day it posts the full current board of firing alerts grouped by
# severity (+ what cleared in the last 24h) so the standing state is reviewed
# on a schedule, the way the #security lane is skimmed daily.
#
# Implementation: stock python:3.12-alpine running a pure-stdlib script
# (alert_digest.py, mounted from a ConfigMap). NO pip/apk at runtime — once a
# day, zero per-run package-install disk writes (the footprint that got
# status-page-pusher disabled, memory id=559). Queries the in-cluster
# Alertmanager v2 API for the current board (respects silences + inhibitions)
# and Prometheus for the resolved-in-24h line.
# =============================================================================
resource "kubernetes_config_map" "alert_digest_script" {
metadata {
name = "alert-digest-script"
namespace = kubernetes_namespace.monitoring.metadata[0].name
}
data = {
"alert_digest.py" = file("${path.module}/alert_digest.py")
}
}
# Reuses the same Slack incoming-webhook the Alertmanager receivers post with
# (var.alertmanager_slack_api_url) — no new webhook, just a namespaced Secret so
# the URL isn't a literal in the pod spec.
resource "kubernetes_secret" "alert_digest" {
metadata {
name = "alert-digest"
namespace = kubernetes_namespace.monitoring.metadata[0].name
}
type = "Opaque"
data = {
SLACK_WEBHOOK_URL = var.alertmanager_slack_api_url
}
}
resource "kubernetes_cron_job_v1" "alert_digest" {
metadata {
name = "alert-digest"
namespace = kubernetes_namespace.monitoring.metadata[0].name
labels = {
app = "alert-digest"
tier = var.tier
}
}
spec {
concurrency_policy = "Forbid"
failed_jobs_history_limit = 3
successful_jobs_history_limit = 3
schedule = "0 8 * * *"
timezone = "Europe/London"
starting_deadline_seconds = 600
job_template {
metadata {}
spec {
backoff_limit = 2
ttl_seconds_after_finished = 86400
template {
metadata {
labels = {
app = "alert-digest"
}
}
spec {
restart_policy = "OnFailure"
container {
name = "alert-digest"
image = "docker.io/library/python:3.12-alpine"
image_pull_policy = "IfNotPresent"
command = ["python3", "/scripts/alert_digest.py"]
env {
name = "SLACK_WEBHOOK_URL"
value_from {
secret_key_ref {
name = kubernetes_secret.alert_digest.metadata[0].name
key = "SLACK_WEBHOOK_URL"
}
}
}
env {
name = "SLACK_CHANNEL"
value = "#alerts"
}
volume_mount {
name = "script"
mount_path = "/scripts"
read_only = true
}
resources {
requests = {
cpu = "10m"
memory = "48Mi"
}
limits = {
memory = "96Mi"
}
}
}
volume {
name = "script"
config_map {
name = kubernetes_config_map.alert_digest_script.metadata[0].name
}
}
dns_config {
option {
name = "ndots"
value = "2"
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}