infra/stacks/monitoring/modules/monitoring/alert_digest.tf

# =============================================================================
# Daily alert digest -> #alerts Slack
# =============================================================================
# Companion to the "alert on change" routing model (alert-noise-reduction
# 2026-06-12). Warning/info alerts no longer re-notify while they stay firing
# (repeat_interval is effectively off) and criticals only re-ping every 6h, so
# Slack reflects *changes*, not steady state. This CronJob is the safety net:
# once a day it posts the full current board of firing alerts grouped by
# severity (+ what cleared in the last 24h) so the standing state is reviewed
# on a schedule, the way the #security lane is skimmed daily.
#
# Implementation: stock python:3.12-alpine running a pure-stdlib script
# (alert_digest.py, mounted from a ConfigMap). NO pip/apk at runtime — once a
# day, zero per-run package-install disk writes (the footprint that got
# status-page-pusher disabled, memory id=559). Queries the in-cluster
# Alertmanager v2 API for the current board (respects silences + inhibitions)
# and Prometheus for the resolved-in-24h line.
# =============================================================================

resource "kubernetes_config_map" "alert_digest_script" {
  metadata {
    name      = "alert-digest-script"
    namespace = kubernetes_namespace.monitoring.metadata[0].name
  }
  data = {
    "alert_digest.py" = file("${path.module}/alert_digest.py")
  }
}

# Reuses the same Slack incoming-webhook the Alertmanager receivers post with
# (var.alertmanager_slack_api_url) — no new webhook, just a namespaced Secret so
# the URL isn't a literal in the pod spec.
resource "kubernetes_secret" "alert_digest" {
  metadata {
    name      = "alert-digest"
    namespace = kubernetes_namespace.monitoring.metadata[0].name
  }
  type = "Opaque"
  data = {
    SLACK_WEBHOOK_URL = var.alertmanager_slack_api_url
  }
}

resource "kubernetes_cron_job_v1" "alert_digest" {
  metadata {
    name      = "alert-digest"
    namespace = kubernetes_namespace.monitoring.metadata[0].name
    labels = {
      app  = "alert-digest"
      tier = var.tier
    }
  }
  spec {
    concurrency_policy            = "Forbid"
    failed_jobs_history_limit     = 3
    successful_jobs_history_limit = 3
    schedule                      = "0 8 * * *"
    timezone                      = "Europe/London"
    starting_deadline_seconds     = 600
    job_template {
      metadata {}
      spec {
        backoff_limit              = 2
        ttl_seconds_after_finished = 86400
        template {
          metadata {
            labels = {
              app = "alert-digest"
            }
          }
          spec {
            restart_policy = "OnFailure"
            container {
              name              = "alert-digest"
              image             = "docker.io/library/python:3.12-alpine"
              image_pull_policy = "IfNotPresent"
              command           = ["python3", "/scripts/alert_digest.py"]
              env {
                name = "SLACK_WEBHOOK_URL"
                value_from {
                  secret_key_ref {
                    name = kubernetes_secret.alert_digest.metadata[0].name
                    key  = "SLACK_WEBHOOK_URL"
                  }
                }
              }
              env {
                name  = "SLACK_CHANNEL"
                value = "#alerts"
              }
              volume_mount {
                name       = "script"
                mount_path = "/scripts"
                read_only  = true
              }
              resources {
                requests = {
                  cpu    = "10m"
                  memory = "48Mi"
                }
                limits = {
                  memory = "96Mi"
                }
              }
            }
            volume {
              name = "script"
              config_map {
                name = kubernetes_config_map.alert_digest_script.metadata[0].name
              }
            }
            dns_config {
              option {
                name  = "ndots"
                value = "2"
              }
            }
          }
        }
      }
    }
  }
  lifecycle {
    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
  }
}
monitoring: reduce Slack alert noise (alert-on-change + daily digest) Reviewed the last 24h of Slack alerts after the midday node-pressure blip: the volume came far less from the outage than from (a) alerts re-pinging every few hours while nothing changed and (b) a pod cascade that fired uninhibited. This hardens the alerting system so recurrences are quiet, rather than just clearing today's broken services. Changes (all in the monitoring module): * Alert-on-change routing. warning/info repeat_interval -> 8760h (notify once, then only on a membership change or resolve); critical 1h -> 6h (a slow nag, not an hourly drip). send_resolved stays on. The bulk of the 24h volume was these re-pings (RpiSofiaUndervoltage alone fired continuously for ~24h, re-notifying every 4h). * Daily digest CronJob (alert_digest.tf + alert_digest.py) -> #alerts at 08:00 Europe/London: the full current board grouped by severity + what resolved in the last 24h. This is the standing-state safety net for the alert-on-change model. Stock python:3.12-alpine, pure-stdlib script (no pip/apk at runtime -> none of the per-run disk-write footprint that disabled status-page-pusher). Reuses the existing Alertmanager Slack webhook via a namespaced Secret; reads Alertmanager v2 + Prometheus. * Cascade inhibition. NodeConditionBad/NodeDiskPressure now suppress the downstream pod-churn alerts (PodCrashLooping, PodImagePullBackOff, PodsStuckContainerCreating, ScrapeTargetDown, ReplicasMismatch, ...). The midday DiskPressure event on 4 nodes fired 25 PodCrashLooping + 14 PodImagePullBackOff uninhibited because only NodeDown was a source. T3 probe de-duplication. T3ProbeLegDown now inhibits T3ProbeDropBurst for the same leg — two alerts described one condition and were the #1 noise source (~3,400 alert-minutes over 24h). * ScrapeTargetDown false positives. Scrape only Ready endpoints, so completed CronJob pods that linger in EndpointSlices as NotReady addresses stop firing phantom "down" alerts (tts/tripit/beads). A Ready pod with a genuinely broken metrics endpoint still fires. * for: 0m -> 5m on the flappy backup-status flags (LVM/Weekly/Offsite/ NfsMirror/Vzdump Failing) and DNS spike detectors, so a single transient Pushgateway/scrape blip no longer fires-and-resolves. Added an Alertmanager scrape target: it carried no prometheus.io/scrape annotation, so notification volume was unmeasurable — now we can verify this change worked (alertmanager_notifications_total et al.). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> 2026-06-12 20:33:49 +00:00			`# =============================================================================`
			`# Daily alert digest -> #alerts Slack`
			`# =============================================================================`
			`# Companion to the "alert on change" routing model (alert-noise-reduction`
			`# 2026-06-12). Warning/info alerts no longer re-notify while they stay firing`
			`# (repeat_interval is effectively off) and criticals only re-ping every 6h, so`
			`# Slack reflects changes, not steady state. This CronJob is the safety net:`
			`# once a day it posts the full current board of firing alerts grouped by`
			`# severity (+ what cleared in the last 24h) so the standing state is reviewed`
			`# on a schedule, the way the #security lane is skimmed daily.`
			`#`
			`# Implementation: stock python:3.12-alpine running a pure-stdlib script`
			`# (alert_digest.py, mounted from a ConfigMap). NO pip/apk at runtime — once a`
			`# day, zero per-run package-install disk writes (the footprint that got`
			`# status-page-pusher disabled, memory id=559). Queries the in-cluster`
			`# Alertmanager v2 API for the current board (respects silences + inhibitions)`
			`# and Prometheus for the resolved-in-24h line.`
			`# =============================================================================`

			`resource "kubernetes_config_map" "alert_digest_script" {`
			`metadata {`
			`name = "alert-digest-script"`
			`namespace = kubernetes_namespace.monitoring.metadata[0].name`
			`}`
			`data = {`
			`"alert_digest.py" = file("${path.module}/alert_digest.py")`
			`}`
			`}`

			`# Reuses the same Slack incoming-webhook the Alertmanager receivers post with`
			`# (var.alertmanager_slack_api_url) — no new webhook, just a namespaced Secret so`
			`# the URL isn't a literal in the pod spec.`
			`resource "kubernetes_secret" "alert_digest" {`
			`metadata {`
			`name = "alert-digest"`
			`namespace = kubernetes_namespace.monitoring.metadata[0].name`
			`}`
			`type = "Opaque"`
			`data = {`
			`SLACK_WEBHOOK_URL = var.alertmanager_slack_api_url`
			`}`
			`}`

			`resource "kubernetes_cron_job_v1" "alert_digest" {`
			`metadata {`
			`name = "alert-digest"`
			`namespace = kubernetes_namespace.monitoring.metadata[0].name`
			`labels = {`
			`app = "alert-digest"`
			`tier = var.tier`
			`}`
			`}`
			`spec {`
			`concurrency_policy = "Forbid"`
			`failed_jobs_history_limit = 3`
			`successful_jobs_history_limit = 3`
			`schedule = "0 8 * * *"`
			`timezone = "Europe/London"`
			`starting_deadline_seconds = 600`
			`job_template {`
			`metadata {}`
			`spec {`
			`backoff_limit = 2`
			`ttl_seconds_after_finished = 86400`
			`template {`
			`metadata {`
			`labels = {`
			`app = "alert-digest"`
			`}`
			`}`
			`spec {`
			`restart_policy = "OnFailure"`
			`container {`
			`name = "alert-digest"`
			`image = "docker.io/library/python:3.12-alpine"`
			`image_pull_policy = "IfNotPresent"`
			`command = ["python3", "/scripts/alert_digest.py"]`
			`env {`
			`name = "SLACK_WEBHOOK_URL"`
			`value_from {`
			`secret_key_ref {`
			`name = kubernetes_secret.alert_digest.metadata[0].name`
			`key = "SLACK_WEBHOOK_URL"`
			`}`
			`}`
			`}`
			`env {`
			`name = "SLACK_CHANNEL"`
			`value = "#alerts"`
			`}`
			`volume_mount {`
			`name = "script"`
			`mount_path = "/scripts"`
			`read_only = true`
			`}`
			`resources {`
			`requests = {`
			`cpu = "10m"`
			`memory = "48Mi"`
			`}`
			`limits = {`
			`memory = "96Mi"`
			`}`
			`}`
			`}`
			`volume {`
			`name = "script"`
			`config_map {`
			`name = kubernetes_config_map.alert_digest_script.metadata[0].name`
			`}`
			`}`
			`dns_config {`
			`option {`
			`name = "ndots"`
			`value = "2"`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`
			`lifecycle {`
			`# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2`
			`ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]`
			`}`
			`}`