monitoring: reduce Slack alert noise (alert-on-change + daily digest)
Reviewed the last 24h of Slack alerts after the midday node-pressure blip: the volume came far less from the outage than from (a) alerts re-pinging every few hours while nothing changed and (b) a pod cascade that fired uninhibited. This hardens the alerting *system* so recurrences are quiet, rather than just clearing today's broken services. Changes (all in the monitoring module): * Alert-on-change routing. warning/info repeat_interval -> 8760h (notify once, then only on a membership change or resolve); critical 1h -> 6h (a slow nag, not an hourly drip). send_resolved stays on. The bulk of the 24h volume was these re-pings (RpiSofiaUndervoltage alone fired continuously for ~24h, re-notifying every 4h). * Daily digest CronJob (alert_digest.tf + alert_digest.py) -> #alerts at 08:00 Europe/London: the full current board grouped by severity + what resolved in the last 24h. This is the standing-state safety net for the alert-on-change model. Stock python:3.12-alpine, pure-stdlib script (no pip/apk at runtime -> none of the per-run disk-write footprint that disabled status-page-pusher). Reuses the existing Alertmanager Slack webhook via a namespaced Secret; reads Alertmanager v2 + Prometheus. * Cascade inhibition. NodeConditionBad/NodeDiskPressure now suppress the downstream pod-churn alerts (PodCrashLooping, PodImagePullBackOff, PodsStuckContainerCreating, ScrapeTargetDown, *ReplicasMismatch, ...). The midday DiskPressure event on 4 nodes fired 25 PodCrashLooping + 14 PodImagePullBackOff uninhibited because only NodeDown was a source. * T3 probe de-duplication. T3ProbeLegDown now inhibits T3ProbeDropBurst for the same leg — two alerts described one condition and were the #1 noise source (~3,400 alert-minutes over 24h). * ScrapeTargetDown false positives. Scrape only Ready endpoints, so completed CronJob pods that linger in EndpointSlices as NotReady addresses stop firing phantom "down" alerts (tts/tripit/beads). A Ready pod with a genuinely broken metrics endpoint still fires. * for: 0m -> 5m on the flappy backup-status flags (LVM/Weekly/Offsite/ NfsMirror/Vzdump *Failing) and DNS spike detectors, so a single transient Pushgateway/scrape blip no longer fires-and-resolves. * Added an Alertmanager scrape target: it carried no prometheus.io/scrape annotation, so notification volume was unmeasurable — now we can verify this change worked (alertmanager_notifications_total et al.). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
87a8a393fe
commit
97dcf49b8e
4 changed files with 434 additions and 12 deletions
|
|
@ -165,8 +165,11 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
|
||||||
| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
|
| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
|
||||||
|
|
||||||
## Monitoring & Alerting
|
## Monitoring & Alerting
|
||||||
- Alert cascade inhibitions: if node is down, suppress pod alerts on that node.
|
- **Alert-on-change routing** (alert-noise-reduction 2026-06-12, `route` block in `prometheus_chart_values.tpl`): warning/info notify ONCE then stay quiet while firing (`repeat_interval: 8760h` ≈ off); criticals re-ping every 6h (was 1h); `send_resolved` on. Standing state is reviewed via the daily digest, not re-pings.
|
||||||
- Exclude completed CronJob pods from "pod not ready" alerts.
|
- **Daily alert digest**: CronJob `alert-digest` (monitoring ns, `alert_digest.tf` + `alert_digest.py`) posts the full current board grouped by severity + resolved-in-24h to `#alerts` at 08:00 Europe/London. Stock `python:3.12-alpine`, pure-stdlib (no pip/apk at runtime — avoids the status-page-pusher disk anti-pattern, id=559); reads Alertmanager v2 + Prometheus; reuses the Alertmanager Slack webhook via the `alert-digest` Secret. Safety net for alert-on-change.
|
||||||
|
- **Cascade inhibitions** (`inhibit_rules`): `NodeDown` AND `NodeConditionBad`/`NodeDiskPressure` suppress downstream pod-churn alerts (PodCrashLooping/PodImagePullBackOff/PodsStuckContainerCreating/ScrapeTargetDown/*ReplicasMismatch); `T3ProbeLegDown` suppresses `T3ProbeDropBurst` for the same `leg`; plus existing NFS/Traefik/Authentik/Power/Tuya/iDRAC cascades. No `equal` on the node rules (pod alerts carry no `node` label → cluster-wide, like NodeDown).
|
||||||
|
- **ScrapeTargetDown scrapes only Ready endpoints** (relabel `keep __meta_kubernetes_endpoint_ready=true` on both `kubernetes-service-endpoints` jobs) — completed CronJob pods lingering as NotReady EndpointSlice addresses no longer fire phantom "down" alerts (tts/tripit/beads, id=4895). Replaces the old "exclude completed CronJob pods" guidance; a Ready pod with a broken metrics endpoint still fires.
|
||||||
|
- Alertmanager is now scraped (`extraScrapeConfigs` job `alertmanager`) → `alertmanager_notifications_total`/`_alerts`/`_notifications_failed_total` available; it had no `prometheus.io/scrape` annotation so notification volume was previously unmeasurable.
|
||||||
- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions.
|
- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions.
|
||||||
- **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
|
- **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
|
||||||
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction).
|
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction).
|
||||||
|
|
|
||||||
230
stacks/monitoring/modules/monitoring/alert_digest.py
Normal file
230
stacks/monitoring/modules/monitoring/alert_digest.py
Normal file
|
|
@ -0,0 +1,230 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Daily alert digest -> Slack.
|
||||||
|
|
||||||
|
Posts a once-a-day "state of the lab" summary to the #alerts Slack channel:
|
||||||
|
the full current board of firing alerts grouped by severity, plus a one-line
|
||||||
|
list of what fired-and-cleared in the last 24h.
|
||||||
|
|
||||||
|
This is the safety net for the "alert on change" routing model (warnings/info
|
||||||
|
no longer re-notify while firing; criticals re-ping slowly). The digest is the
|
||||||
|
recurring reminder of everything still firing, reviewed once each morning.
|
||||||
|
|
||||||
|
Pure stdlib (urllib + json) on purpose: the CronJob runs stock python:alpine
|
||||||
|
with NO pip/apk install at runtime, so it has none of the per-run disk-write
|
||||||
|
footprint that got status-page-pusher disabled (infra memory id=559).
|
||||||
|
|
||||||
|
Sources:
|
||||||
|
* Current board -> Alertmanager v2 (/api/v2/alerts): active, not silenced,
|
||||||
|
not inhibited == exactly what a human would otherwise be paged about, with
|
||||||
|
the human-readable `summary` annotation. Falls back to Prometheus ALERTS
|
||||||
|
if Alertmanager is unreachable.
|
||||||
|
* Resolved-in-24h -> Prometheus (alertnames seen firing in the last 24h that
|
||||||
|
are not firing now). Best-effort; skipped silently if Prometheus errors.
|
||||||
|
|
||||||
|
Env (all have in-cluster defaults):
|
||||||
|
ALERTMANAGER_URL default http://prometheus-alertmanager.monitoring.svc.cluster.local:9093
|
||||||
|
PROMETHEUS_URL default http://prometheus-server.monitoring.svc.cluster.local:80
|
||||||
|
SLACK_WEBHOOK_URL Slack incoming-webhook URL. If empty (or DRY_RUN set),
|
||||||
|
the payload is printed to stdout instead of posted.
|
||||||
|
SLACK_CHANNEL default "#alerts"
|
||||||
|
DRY_RUN if set (any value), print instead of posting.
|
||||||
|
"""
|
||||||
|
import datetime
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import urllib.parse
|
||||||
|
import urllib.request
|
||||||
|
|
||||||
|
ALERTMANAGER_URL = os.environ.get(
|
||||||
|
"ALERTMANAGER_URL",
|
||||||
|
"http://prometheus-alertmanager.monitoring.svc.cluster.local:9093",
|
||||||
|
).rstrip("/")
|
||||||
|
PROMETHEUS_URL = os.environ.get(
|
||||||
|
"PROMETHEUS_URL",
|
||||||
|
"http://prometheus-server.monitoring.svc.cluster.local:80",
|
||||||
|
).rstrip("/")
|
||||||
|
SLACK_WEBHOOK_URL = os.environ.get("SLACK_WEBHOOK_URL", "").strip()
|
||||||
|
SLACK_CHANNEL = os.environ.get("SLACK_CHANNEL", "#alerts")
|
||||||
|
DRY_RUN = bool(os.environ.get("DRY_RUN", "")) or not SLACK_WEBHOOK_URL
|
||||||
|
|
||||||
|
SEV_ORDER = ["critical", "warning", "info"]
|
||||||
|
SEV_EMOJI = {"critical": ":red_circle:", "warning": ":large_yellow_circle:", "info": ":large_blue_circle:"}
|
||||||
|
|
||||||
|
|
||||||
|
def _get_json(url, timeout=30):
|
||||||
|
req = urllib.request.Request(url, headers={"Accept": "application/json"})
|
||||||
|
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
||||||
|
return json.load(resp)
|
||||||
|
|
||||||
|
|
||||||
|
def _humanize(seconds):
|
||||||
|
seconds = int(max(seconds, 0))
|
||||||
|
d, rem = divmod(seconds, 86400)
|
||||||
|
h, rem = divmod(rem, 3600)
|
||||||
|
m, _ = divmod(rem, 60)
|
||||||
|
if d:
|
||||||
|
return "%dd%dh" % (d, h)
|
||||||
|
if h:
|
||||||
|
return "%dh%dm" % (h, m)
|
||||||
|
if m:
|
||||||
|
return "%dm" % m
|
||||||
|
return "<1m"
|
||||||
|
|
||||||
|
|
||||||
|
def _now_utc():
|
||||||
|
return datetime.datetime.now(datetime.timezone.utc)
|
||||||
|
|
||||||
|
|
||||||
|
def _age(starts_at):
|
||||||
|
if not starts_at:
|
||||||
|
return ""
|
||||||
|
try:
|
||||||
|
ts = starts_at.replace("Z", "+00:00")
|
||||||
|
# Trim sub-second precision beyond microseconds if present.
|
||||||
|
started = datetime.datetime.fromisoformat(ts)
|
||||||
|
return _humanize((_now_utc() - started).total_seconds())
|
||||||
|
except ValueError:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_current_from_alertmanager():
|
||||||
|
"""Active, non-silenced, non-inhibited alerts with their summaries."""
|
||||||
|
q = urllib.parse.urlencode(
|
||||||
|
{"active": "true", "silenced": "false", "inhibited": "false", "unprocessed": "false"}
|
||||||
|
)
|
||||||
|
data = _get_json("%s/api/v2/alerts?%s" % (ALERTMANAGER_URL, q))
|
||||||
|
alerts = []
|
||||||
|
for a in data:
|
||||||
|
if a.get("status", {}).get("state") != "active":
|
||||||
|
continue
|
||||||
|
labels = a.get("labels", {})
|
||||||
|
ann = a.get("annotations", {})
|
||||||
|
alerts.append(
|
||||||
|
{
|
||||||
|
"alertname": labels.get("alertname", "?"),
|
||||||
|
"severity": (labels.get("severity") or "info").lower(),
|
||||||
|
"lane": labels.get("lane", ""),
|
||||||
|
"summary": ann.get("summary", ""),
|
||||||
|
"age": _age(a.get("startsAt", "")),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return alerts
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_current_from_prometheus():
|
||||||
|
"""Fallback: firing alerts from Prometheus (no summaries, includes inhibited)."""
|
||||||
|
url = "%s/api/v1/query?%s" % (
|
||||||
|
PROMETHEUS_URL,
|
||||||
|
urllib.parse.urlencode({"query": 'ALERTS{alertstate="firing"}'}),
|
||||||
|
)
|
||||||
|
data = _get_json(url)
|
||||||
|
alerts = []
|
||||||
|
for s in data.get("data", {}).get("result", []):
|
||||||
|
m = s.get("metric", {})
|
||||||
|
alerts.append(
|
||||||
|
{
|
||||||
|
"alertname": m.get("alertname", "?"),
|
||||||
|
"severity": (m.get("severity") or "info").lower(),
|
||||||
|
"lane": m.get("lane", ""),
|
||||||
|
"summary": "",
|
||||||
|
"age": "",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return alerts
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_resolved_last_24h(active_names):
|
||||||
|
"""Alertnames that fired in the last 24h but are not firing now."""
|
||||||
|
try:
|
||||||
|
url = "%s/api/v1/query?%s" % (
|
||||||
|
PROMETHEUS_URL,
|
||||||
|
urllib.parse.urlencode(
|
||||||
|
{"query": 'count by (alertname) (max_over_time(ALERTS{alertstate="firing"}[24h]))'}
|
||||||
|
),
|
||||||
|
)
|
||||||
|
data = _get_json(url)
|
||||||
|
seen = {s["metric"].get("alertname", "?") for s in data.get("data", {}).get("result", [])}
|
||||||
|
return sorted(seen - active_names)
|
||||||
|
except Exception:
|
||||||
|
return []
|
||||||
|
|
||||||
|
|
||||||
|
def build_message(alerts, resolved):
|
||||||
|
today = _now_utc().strftime("%a %d %b %Y")
|
||||||
|
by_sev = {s: [] for s in SEV_ORDER}
|
||||||
|
for a in alerts:
|
||||||
|
by_sev.setdefault(a["severity"], []).append(a)
|
||||||
|
|
||||||
|
n = len(alerts)
|
||||||
|
counts = " ".join("%s %d" % (s, len(by_sev.get(s, []))) for s in SEV_ORDER if by_sev.get(s))
|
||||||
|
|
||||||
|
if n == 0:
|
||||||
|
header = ":white_check_mark: *Daily alert digest* — %s\nAll clear: nothing firing." % today
|
||||||
|
else:
|
||||||
|
header = ":bar_chart: *Daily alert digest* — %s\nFiring now: *%d*%s" % (
|
||||||
|
today,
|
||||||
|
n,
|
||||||
|
(" (" + counts + ")") if counts else "",
|
||||||
|
)
|
||||||
|
|
||||||
|
lines = [header]
|
||||||
|
for sev in SEV_ORDER:
|
||||||
|
items = sorted(by_sev.get(sev, []), key=lambda a: a["alertname"])
|
||||||
|
if not items:
|
||||||
|
continue
|
||||||
|
lines.append("")
|
||||||
|
lines.append("%s *%s (%d)*" % (SEV_EMOJI.get(sev, ""), sev.capitalize(), len(items)))
|
||||||
|
for a in items:
|
||||||
|
lock = ":lock: " if a["lane"] == "security" else ""
|
||||||
|
age = (" _(%s)_" % a["age"]) if a["age"] else ""
|
||||||
|
summary = (" — %s" % a["summary"]) if a["summary"] else ""
|
||||||
|
lines.append("• %s*%s*%s%s" % (lock, a["alertname"], summary, age))
|
||||||
|
|
||||||
|
# Any non-standard severities (defensive — shouldn't happen).
|
||||||
|
extra = [s for s in by_sev if s not in SEV_ORDER and by_sev[s]]
|
||||||
|
for sev in sorted(extra):
|
||||||
|
lines.append("")
|
||||||
|
lines.append("*%s (%d)*" % (sev, len(by_sev[sev])))
|
||||||
|
for a in sorted(by_sev[sev], key=lambda a: a["alertname"]):
|
||||||
|
lines.append("• *%s* — %s" % (a["alertname"], a["summary"]))
|
||||||
|
|
||||||
|
if resolved:
|
||||||
|
lines.append("")
|
||||||
|
lines.append(":white_check_mark: Resolved in last 24h (%d): %s" % (len(resolved), ", ".join(resolved)))
|
||||||
|
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
def post_to_slack(text):
|
||||||
|
payload = {"channel": SLACK_CHANNEL, "text": text}
|
||||||
|
if DRY_RUN:
|
||||||
|
print("[DRY_RUN] would POST to Slack channel %s:\n%s" % (SLACK_CHANNEL, text))
|
||||||
|
return
|
||||||
|
body = json.dumps(payload).encode("utf-8")
|
||||||
|
req = urllib.request.Request(
|
||||||
|
SLACK_WEBHOOK_URL, data=body, headers={"Content-Type": "application/json"}
|
||||||
|
)
|
||||||
|
with urllib.request.urlopen(req, timeout=30) as resp:
|
||||||
|
if resp.status >= 300:
|
||||||
|
raise RuntimeError("Slack POST failed: HTTP %d" % resp.status)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
try:
|
||||||
|
alerts = fetch_current_from_alertmanager()
|
||||||
|
source = "alertmanager"
|
||||||
|
except Exception as e:
|
||||||
|
sys.stderr.write("alertmanager fetch failed (%s); falling back to prometheus\n" % e)
|
||||||
|
alerts = fetch_current_from_prometheus()
|
||||||
|
source = "prometheus-fallback"
|
||||||
|
|
||||||
|
active_names = {a["alertname"] for a in alerts}
|
||||||
|
resolved = fetch_resolved_last_24h(active_names)
|
||||||
|
text = build_message(alerts, resolved)
|
||||||
|
sys.stderr.write("digest: %d firing (source=%s), %d resolved-24h\n" % (len(alerts), source, len(resolved)))
|
||||||
|
post_to_slack(text)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
127
stacks/monitoring/modules/monitoring/alert_digest.tf
Normal file
127
stacks/monitoring/modules/monitoring/alert_digest.tf
Normal file
|
|
@ -0,0 +1,127 @@
|
||||||
|
# =============================================================================
|
||||||
|
# Daily alert digest -> #alerts Slack
|
||||||
|
# =============================================================================
|
||||||
|
# Companion to the "alert on change" routing model (alert-noise-reduction
|
||||||
|
# 2026-06-12). Warning/info alerts no longer re-notify while they stay firing
|
||||||
|
# (repeat_interval is effectively off) and criticals only re-ping every 6h, so
|
||||||
|
# Slack reflects *changes*, not steady state. This CronJob is the safety net:
|
||||||
|
# once a day it posts the full current board of firing alerts grouped by
|
||||||
|
# severity (+ what cleared in the last 24h) so the standing state is reviewed
|
||||||
|
# on a schedule, the way the #security lane is skimmed daily.
|
||||||
|
#
|
||||||
|
# Implementation: stock python:3.12-alpine running a pure-stdlib script
|
||||||
|
# (alert_digest.py, mounted from a ConfigMap). NO pip/apk at runtime — once a
|
||||||
|
# day, zero per-run package-install disk writes (the footprint that got
|
||||||
|
# status-page-pusher disabled, memory id=559). Queries the in-cluster
|
||||||
|
# Alertmanager v2 API for the current board (respects silences + inhibitions)
|
||||||
|
# and Prometheus for the resolved-in-24h line.
|
||||||
|
# =============================================================================
|
||||||
|
|
||||||
|
resource "kubernetes_config_map" "alert_digest_script" {
|
||||||
|
metadata {
|
||||||
|
name = "alert-digest-script"
|
||||||
|
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||||
|
}
|
||||||
|
data = {
|
||||||
|
"alert_digest.py" = file("${path.module}/alert_digest.py")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# Reuses the same Slack incoming-webhook the Alertmanager receivers post with
|
||||||
|
# (var.alertmanager_slack_api_url) — no new webhook, just a namespaced Secret so
|
||||||
|
# the URL isn't a literal in the pod spec.
|
||||||
|
resource "kubernetes_secret" "alert_digest" {
|
||||||
|
metadata {
|
||||||
|
name = "alert-digest"
|
||||||
|
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||||
|
}
|
||||||
|
type = "Opaque"
|
||||||
|
data = {
|
||||||
|
SLACK_WEBHOOK_URL = var.alertmanager_slack_api_url
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
resource "kubernetes_cron_job_v1" "alert_digest" {
|
||||||
|
metadata {
|
||||||
|
name = "alert-digest"
|
||||||
|
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||||
|
labels = {
|
||||||
|
app = "alert-digest"
|
||||||
|
tier = var.tier
|
||||||
|
}
|
||||||
|
}
|
||||||
|
spec {
|
||||||
|
concurrency_policy = "Forbid"
|
||||||
|
failed_jobs_history_limit = 3
|
||||||
|
successful_jobs_history_limit = 3
|
||||||
|
schedule = "0 8 * * *"
|
||||||
|
timezone = "Europe/London"
|
||||||
|
starting_deadline_seconds = 600
|
||||||
|
job_template {
|
||||||
|
metadata {}
|
||||||
|
spec {
|
||||||
|
backoff_limit = 2
|
||||||
|
ttl_seconds_after_finished = 86400
|
||||||
|
template {
|
||||||
|
metadata {
|
||||||
|
labels = {
|
||||||
|
app = "alert-digest"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
spec {
|
||||||
|
restart_policy = "OnFailure"
|
||||||
|
container {
|
||||||
|
name = "alert-digest"
|
||||||
|
image = "docker.io/library/python:3.12-alpine"
|
||||||
|
image_pull_policy = "IfNotPresent"
|
||||||
|
command = ["python3", "/scripts/alert_digest.py"]
|
||||||
|
env {
|
||||||
|
name = "SLACK_WEBHOOK_URL"
|
||||||
|
value_from {
|
||||||
|
secret_key_ref {
|
||||||
|
name = kubernetes_secret.alert_digest.metadata[0].name
|
||||||
|
key = "SLACK_WEBHOOK_URL"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
env {
|
||||||
|
name = "SLACK_CHANNEL"
|
||||||
|
value = "#alerts"
|
||||||
|
}
|
||||||
|
volume_mount {
|
||||||
|
name = "script"
|
||||||
|
mount_path = "/scripts"
|
||||||
|
read_only = true
|
||||||
|
}
|
||||||
|
resources {
|
||||||
|
requests = {
|
||||||
|
cpu = "10m"
|
||||||
|
memory = "48Mi"
|
||||||
|
}
|
||||||
|
limits = {
|
||||||
|
memory = "96Mi"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
volume {
|
||||||
|
name = "script"
|
||||||
|
config_map {
|
||||||
|
name = kubernetes_config_map.alert_digest_script.metadata[0].name
|
||||||
|
}
|
||||||
|
}
|
||||||
|
dns_config {
|
||||||
|
option {
|
||||||
|
name = "ndots"
|
||||||
|
value = "2"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
lifecycle {
|
||||||
|
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||||
|
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
@ -51,7 +51,12 @@ alertmanager:
|
||||||
group_by: ["alertname"]
|
group_by: ["alertname"]
|
||||||
group_wait: 30s
|
group_wait: 30s
|
||||||
group_interval: 5m
|
group_interval: 5m
|
||||||
repeat_interval: 4h
|
# alert-on-change (2026-06-12): warning/info no longer re-notify while they
|
||||||
|
# stay firing — a 1y repeat is effectively "notify once, then only on a
|
||||||
|
# membership change or resolve". The daily alert-digest CronJob
|
||||||
|
# (alert_digest.tf) carries standing state. Criticals keep a slow re-ping
|
||||||
|
# (see the severity=critical child route).
|
||||||
|
repeat_interval: 8760h
|
||||||
receiver: slack-warning
|
receiver: slack-warning
|
||||||
routes:
|
routes:
|
||||||
# Wave 1 security lane — matches alerts that set `lane = "security"`
|
# Wave 1 security lane — matches alerts that set `lane = "security"`
|
||||||
|
|
@ -68,14 +73,19 @@ alertmanager:
|
||||||
- receiver: slack-critical
|
- receiver: slack-critical
|
||||||
group_wait: 10s
|
group_wait: 10s
|
||||||
group_interval: 1m
|
group_interval: 1m
|
||||||
repeat_interval: 1h
|
# alert-on-change tier (2026-06-12): criticals still re-ping, but every
|
||||||
|
# 6h not 1h — a slow nag so a real prod-down issue isn't forgotten,
|
||||||
|
# minus the hourly drip. Warnings/info don't re-ping at all.
|
||||||
|
repeat_interval: 6h
|
||||||
matchers:
|
matchers:
|
||||||
- severity = critical
|
- severity = critical
|
||||||
continue: false
|
continue: false
|
||||||
- receiver: slack-info
|
- receiver: slack-info
|
||||||
group_wait: 5m
|
group_wait: 5m
|
||||||
group_interval: 30m
|
group_interval: 30m
|
||||||
repeat_interval: 12h
|
# alert-on-change (2026-06-12): info never re-pings while firing; the
|
||||||
|
# daily digest carries standing state.
|
||||||
|
repeat_interval: 8760h
|
||||||
matchers:
|
matchers:
|
||||||
- severity = info
|
- severity = info
|
||||||
continue: false
|
continue: false
|
||||||
|
|
@ -175,6 +185,28 @@ alertmanager:
|
||||||
- alertname = KubeletImagePullErrors
|
- alertname = KubeletImagePullErrors
|
||||||
target_matchers:
|
target_matchers:
|
||||||
- alertname =~ "PodsStuckContainerCreating|DeploymentReplicasMismatch|StatefulSetReplicasMismatch|DaemonSetMissingPods"
|
- alertname =~ "PodsStuckContainerCreating|DeploymentReplicasMismatch|StatefulSetReplicasMismatch|DaemonSetMissingPods"
|
||||||
|
# A node under DiskPressure/MemoryPressure/PIDPressure evicts pods and
|
||||||
|
# stalls image GC + pulls — the resulting pod-churn alerts are downstream
|
||||||
|
# symptoms. The 2026-06-12 midday cascade fired 25 PodCrashLooping + 14
|
||||||
|
# PodImagePullBackOff uninhibited because only NodeDown (not node pressure)
|
||||||
|
# was a source. NodeConditionBad already matches all three pressures; keep
|
||||||
|
# NodeDiskPressure too for redundancy. No `equal` (matches the NodeDown
|
||||||
|
# rule above): the pod-churn alertnames carry no `node` label, and a
|
||||||
|
# multi-node pressure event is exactly when cluster-wide suppression helps.
|
||||||
|
- source_matchers:
|
||||||
|
- alertname =~ "NodeConditionBad|NodeDiskPressure"
|
||||||
|
target_matchers:
|
||||||
|
- alertname =~ "PodCrashLooping|ContainerOOMKilled|PodImagePullBackOff|PodsStuckContainerCreating|ScrapeTargetDown|DeploymentReplicasMismatch|StatefulSetReplicasMismatch|DaemonSetMissingPods"
|
||||||
|
# Two t3 path-probe alerts describe one leg-down condition: T3ProbeLegDown
|
||||||
|
# (connected==0, root) and T3ProbeDropBurst (>6 disconnects/15m, symptom).
|
||||||
|
# A down leg both reads disconnected AND disconnects repeatedly — suppress
|
||||||
|
# the burst for that same leg so one alert fires, not two. ~3,400 alert-min
|
||||||
|
# of overlap in the 24h before this (the #1 noise source).
|
||||||
|
- source_matchers:
|
||||||
|
- alertname = T3ProbeLegDown
|
||||||
|
target_matchers:
|
||||||
|
- alertname = T3ProbeDropBurst
|
||||||
|
equal: [leg]
|
||||||
receivers:
|
receivers:
|
||||||
- name: slack-critical
|
- name: slack-critical
|
||||||
slack_configs:
|
slack_configs:
|
||||||
|
|
@ -501,6 +533,16 @@ serverFiles:
|
||||||
regex: true
|
regex: true
|
||||||
source_labels:
|
source_labels:
|
||||||
- __meta_kubernetes_service_annotation_prometheus_io_scrape_slow
|
- __meta_kubernetes_service_annotation_prometheus_io_scrape_slow
|
||||||
|
# Only scrape Ready endpoints. Completed CronJob pods linger in the
|
||||||
|
# service's EndpointSlice as NotReady addresses (their labels still
|
||||||
|
# match the Service selector) and were scraped as down targets,
|
||||||
|
# firing ScrapeTargetDown false positives (tts/tripit/beads, memory
|
||||||
|
# id=4895). Dropping NotReady here kills that whole class; a Ready pod
|
||||||
|
# whose metrics endpoint is genuinely broken still fires.
|
||||||
|
- action: keep
|
||||||
|
regex: true
|
||||||
|
source_labels:
|
||||||
|
- __meta_kubernetes_endpoint_ready
|
||||||
- action: replace
|
- action: replace
|
||||||
regex: (https?)
|
regex: (https?)
|
||||||
source_labels:
|
source_labels:
|
||||||
|
|
@ -554,6 +596,12 @@ serverFiles:
|
||||||
regex: true
|
regex: true
|
||||||
source_labels:
|
source_labels:
|
||||||
- __meta_kubernetes_service_annotation_prometheus_io_scrape_slow
|
- __meta_kubernetes_service_annotation_prometheus_io_scrape_slow
|
||||||
|
# Only scrape Ready endpoints — see kubernetes-service-endpoints above
|
||||||
|
# (drops lingering completed-CronJob NotReady pods → no ScrapeTargetDown FP).
|
||||||
|
- action: keep
|
||||||
|
regex: true
|
||||||
|
source_labels:
|
||||||
|
- __meta_kubernetes_endpoint_ready
|
||||||
- action: replace
|
- action: replace
|
||||||
regex: (https?)
|
regex: (https?)
|
||||||
source_labels:
|
source_labels:
|
||||||
|
|
@ -1634,7 +1682,9 @@ serverFiles:
|
||||||
summary: "LVM PVC snapshot job has never reported metrics to Pushgateway"
|
summary: "LVM PVC snapshot job has never reported metrics to Pushgateway"
|
||||||
- alert: LVMSnapshotFailing
|
- alert: LVMSnapshotFailing
|
||||||
expr: lvm_snapshot_last_status{job="lvm-pvc-snapshot"} != 0
|
expr: lvm_snapshot_last_status{job="lvm-pvc-snapshot"} != 0
|
||||||
for: 0m
|
# for: 5m (was 0m) — ride out a single Pushgateway status flip; a real
|
||||||
|
# failure persists until the next run overwrites it (alert-noise 2026-06-12)
|
||||||
|
for: 5m
|
||||||
labels:
|
labels:
|
||||||
severity: critical
|
severity: critical
|
||||||
annotations:
|
annotations:
|
||||||
|
|
@ -1656,7 +1706,7 @@ serverFiles:
|
||||||
summary: "Daily backup is {{ $value | humanizeDuration }} old (threshold: 9d)"
|
summary: "Daily backup is {{ $value | humanizeDuration }} old (threshold: 9d)"
|
||||||
- alert: WeeklyBackupFailing
|
- alert: WeeklyBackupFailing
|
||||||
expr: daily_backup_last_status{job="daily-backup"} != 0
|
expr: daily_backup_last_status{job="daily-backup"} != 0
|
||||||
for: 0m
|
for: 5m
|
||||||
labels:
|
labels:
|
||||||
severity: warning
|
severity: warning
|
||||||
annotations:
|
annotations:
|
||||||
|
|
@ -1677,7 +1727,7 @@ serverFiles:
|
||||||
summary: "Offsite backup sync is {{ $value | humanizeDuration }} old (threshold: 9d)"
|
summary: "Offsite backup sync is {{ $value | humanizeDuration }} old (threshold: 9d)"
|
||||||
- alert: OffsiteBackupSyncFailing
|
- alert: OffsiteBackupSyncFailing
|
||||||
expr: offsite_sync_last_status{job="offsite-backup-sync"} != 0
|
expr: offsite_sync_last_status{job="offsite-backup-sync"} != 0
|
||||||
for: 0m
|
for: 5m
|
||||||
labels:
|
labels:
|
||||||
severity: warning
|
severity: warning
|
||||||
annotations:
|
annotations:
|
||||||
|
|
@ -1691,7 +1741,7 @@ serverFiles:
|
||||||
summary: "NFS local mirror to sda is {{ $value | humanizeDuration }} old (threshold: 16d / 2 weekly cycles)"
|
summary: "NFS local mirror to sda is {{ $value | humanizeDuration }} old (threshold: 16d / 2 weekly cycles)"
|
||||||
- alert: NfsMirrorFailing
|
- alert: NfsMirrorFailing
|
||||||
expr: nfs_mirror_last_status{job="nfs-mirror"} != 0
|
expr: nfs_mirror_last_status{job="nfs-mirror"} != 0
|
||||||
for: 0m
|
for: 5m
|
||||||
labels:
|
labels:
|
||||||
severity: warning
|
severity: warning
|
||||||
annotations:
|
annotations:
|
||||||
|
|
@ -1713,7 +1763,7 @@ serverFiles:
|
||||||
summary: "vzdump VM image backup job has never reported metrics to Pushgateway"
|
summary: "vzdump VM image backup job has never reported metrics to Pushgateway"
|
||||||
- alert: VzdumpBackupFailing
|
- alert: VzdumpBackupFailing
|
||||||
expr: vzdump_last_status{job="vzdump-backup"} != 0
|
expr: vzdump_last_status{job="vzdump-backup"} != 0
|
||||||
for: 0m
|
for: 5m
|
||||||
labels:
|
labels:
|
||||||
severity: warning
|
severity: warning
|
||||||
annotations:
|
annotations:
|
||||||
|
|
@ -2724,7 +2774,9 @@ serverFiles:
|
||||||
# current value (fresh /tmp each run), so the alert could never fire.
|
# current value (fresh /tmp each run), so the alert could never fire.
|
||||||
- alert: DNSQuerySpike
|
- alert: DNSQuerySpike
|
||||||
expr: dns_anomaly_total_queries > 2 * avg_over_time(dns_anomaly_total_queries[1h] offset 15m) and dns_anomaly_total_queries > 1000
|
expr: dns_anomaly_total_queries > 2 * avg_over_time(dns_anomaly_total_queries[1h] offset 15m) and dns_anomaly_total_queries > 1000
|
||||||
for: 0m
|
# for: 5m (was 0m) — a single 15-min scrape spike self-clears; a real
|
||||||
|
# spike persists across scrapes (alert-noise 2026-06-12)
|
||||||
|
for: 5m
|
||||||
labels:
|
labels:
|
||||||
severity: warning
|
severity: warning
|
||||||
annotations:
|
annotations:
|
||||||
|
|
@ -2738,7 +2790,7 @@ serverFiles:
|
||||||
summary: "DNS query volume dropped: {{ $value | printf \"%.0f\" }} queries (<50% of 1h avg) — upstream clients may be failing to reach Technitium"
|
summary: "DNS query volume dropped: {{ $value | printf \"%.0f\" }} queries (<50% of 1h avg) — upstream clients may be failing to reach Technitium"
|
||||||
- alert: DNSHighErrorRate
|
- alert: DNSHighErrorRate
|
||||||
expr: dns_anomaly_server_failure > 100
|
expr: dns_anomaly_server_failure > 100
|
||||||
for: 0m
|
for: 5m
|
||||||
labels:
|
labels:
|
||||||
severity: warning
|
severity: warning
|
||||||
annotations:
|
annotations:
|
||||||
|
|
@ -3085,6 +3137,16 @@ serverFiles:
|
||||||
description: "The must-stay-public URL {{ $labels.instance }} (carve-out `{{ $labels.service }}`) is failing its blackbox probe — most likely it now 302-redirects to Authentik SSO. A path-scoped `auth = \"none\"` carve-out probably regressed (TF revert / deploy / ingress_factory auth default flipping back to \"required\"). Native-client / public / webhook / WebSocket / SPA-XHR traffic to this endpoint is broken for strangers and machines. Check the owning stack's ingress_factory `auth` + `ingress_path`, and curl the URL: `curl -sI '{{ $labels.instance }}'` — a Location to authentik.viktorbarzin.me confirms the regression. Probe config + target list: stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf."
|
description: "The must-stay-public URL {{ $labels.instance }} (carve-out `{{ $labels.service }}`) is failing its blackbox probe — most likely it now 302-redirects to Authentik SSO. A path-scoped `auth = \"none\"` carve-out probably regressed (TF revert / deploy / ingress_factory auth default flipping back to \"required\"). Native-client / public / webhook / WebSocket / SPA-XHR traffic to this endpoint is broken for strangers and machines. Check the owning stack's ingress_factory `auth` + `ingress_path`, and curl the URL: `curl -sI '{{ $labels.instance }}'` — a Location to authentik.viktorbarzin.me confirms the regression. Probe config + target list: stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf."
|
||||||
|
|
||||||
extraScrapeConfigs: |
|
extraScrapeConfigs: |
|
||||||
|
# Alertmanager self-metrics. The bundled Alertmanager Service carries no
|
||||||
|
# prometheus.io/scrape annotation, so kubernetes-service-endpoints never
|
||||||
|
# discovered it — there was NO alertmanager_notifications_total / _alerts /
|
||||||
|
# _notifications_failed_total series until this job (alert-noise-reduction
|
||||||
|
# 2026-06-12), i.e. notification volume was unmeasurable. Static target = the
|
||||||
|
# in-cluster Alertmanager service.
|
||||||
|
- job_name: 'alertmanager'
|
||||||
|
static_configs:
|
||||||
|
- targets:
|
||||||
|
- 'prometheus-alertmanager.monitoring.svc.cluster.local:9093'
|
||||||
# Authentik walling-off guard. Probes each must-stay-public carve-out URL via
|
# Authentik walling-off guard. Probes each must-stay-public carve-out URL via
|
||||||
# blackbox-exporter's `http_no_authentik_redirect` module (no_follow_redirects +
|
# blackbox-exporter's `http_no_authentik_redirect` module (no_follow_redirects +
|
||||||
# fail_if_header_matches on a Location -> Authentik). probe_success == 0 for a
|
# fail_if_header_matches on a Location -> Authentik). probe_success == 0 for a
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue