monitoring: consolidate all Slack alerting to #alerts, abandon #security

The dedicated #security Slack channel was unreachable: the shared incoming webhook (Vault secret/viktor -> alertmanager_slack_api_url) belongs to a Slack app that isn't a member of #security, so any channel override on it returns HTTP 404 channel_not_found. The goldmane-edges-digest was silently failing for that reason. Per request ("dump the security channel, post in an existing one"), route everything to #alerts instead: - alertmanager slack-security receiver -> #alerts (keeps its [SECURITY/<sev>] title styling so security-lane alerts still stand out in the shared channel) - goldmane-edges-digest CronJob SLACK_CHANNEL -> #alerts (comment only; value was already switched and applied last change) - AggregatorDown / DigestFailing alert summaries reworded to say #alerts - docs swept (security.md, monitoring.md, ADR-0014, goldmane runbook, .claude/CLAUDE.md, service-catalog, CONTEXT.md) to drop the "invite the app / flip back to #security" caveats and state the #security abandonment + #alerts consolidation as the current routing. Monitoring stack applied (alertmanager rolled, live config verified: slack-security channel is now #alerts). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 13:29:44 +00:00 · 2026-06-26 13:29:44 +00:00 · fd33d1a447
commit fd33d1a447
parent 196d0db4bd
9 changed files with 32 additions and 28 deletions
--- a/docs/architecture/monitoring.md
+++ b/docs/architecture/monitoring.md
@ -286,7 +286,7 @@ Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (por

 #### Security Alerts (Wave 1 — planned, beads `code-8ywc`)

-Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
+Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there). Same handling path as infra alerts; severity labels carried in the alert (critical/warning/info). The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).

 | # | Source | Event | Severity |
 |---|---|---|---|
@ -318,7 +318,7 @@ IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-poli
 Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out.

 - **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m).
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → **`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
+- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → posts to **`#alerts`** via the `slack-security` receiver, which keeps its `[SECURITY]` styling (Slack-only, no paging; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
 - **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)

 #### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
--- a/docs/architecture/security.md
+++ b/docs/architecture/security.md
@ -272,7 +272,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**

 The block below documents the locked design.

-Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
+Response model: **(I) Slack-only, daily skim.** All security alerts post to **`#alerts`** via Alertmanager (the `slack-security` receiver keeps its distinct `[SECURITY/<sev>]` title styling so security-lane alerts still stand out). The dedicated `#security` channel was abandoned (2026-06-25) — the shared `alertmanager_slack_api_url` incoming webhook's Slack app isn't a member of it, so a channel override there returns HTTP `404 channel_not_found`; everything consolidated to `#alerts`. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.

 #### Detection sources

@ -285,7 +285,7 @@ Response model: **(I) Slack-only, daily skim.** All security alerts land in a ne

 #### Alert rules (16 total)

-Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel.
+Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it). Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) carried in the alert.

 **K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):**

@ -413,8 +413,10 @@ refined by a `service-identity` label in the few multi-Service namespaces
   private key into TF state — **re-apply the stack if the operator rotates that
   Secret**.
 3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to
-   **`#alerts`** (reuses the alert-digest webhook; a `#security` override 404s —
-   that webhook's Slack app isn't a member of `#security`; see runbook).
+   **`#alerts`** (reuses the alert-digest webhook). All Slack now consolidates to
+   `#alerts`; the `#security` channel was abandoned 2026-06-25 because that
+   webhook's Slack app isn't a member of it (a `#security` override 404s). See
+   runbook.

 The trail is **attribution-grade, not cryptographic** (reconstructs events in a
 trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model