monitoring: consolidate all Slack alerting to #alerts, abandon #security
Some checks are pending
ci/woodpecker/push/default Pipeline is running
Some checks are pending
ci/woodpecker/push/default Pipeline is running
The dedicated #security Slack channel was unreachable: the shared incoming
webhook (Vault secret/viktor -> alertmanager_slack_api_url) belongs to a
Slack app that isn't a member of #security, so any channel override on it
returns HTTP 404 channel_not_found. The goldmane-edges-digest was silently
failing for that reason.
Per request ("dump the security channel, post in an existing one"), route
everything to #alerts instead:
- alertmanager slack-security receiver -> #alerts (keeps its [SECURITY/<sev>]
title styling so security-lane alerts still stand out in the shared channel)
- goldmane-edges-digest CronJob SLACK_CHANNEL -> #alerts (comment only; value
was already switched and applied last change)
- AggregatorDown / DigestFailing alert summaries reworded to say #alerts
- docs swept (security.md, monitoring.md, ADR-0014, goldmane runbook,
.claude/CLAUDE.md, service-catalog, CONTEXT.md) to drop the
"invite the app / flip back to #security" caveats and state the
#security abandonment + #alerts consolidation as the current routing.
Monitoring stack applied (alertmanager rolled, live config verified:
slack-security channel is now #alerts).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
196d0db4bd
commit
fd33d1a447
9 changed files with 32 additions and 28 deletions
|
|
@ -233,7 +233,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
|
|||
- **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
|
||||
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction).
|
||||
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
|
||||
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → `#security` Slack). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
|
||||
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → posts to `#alerts` via the `slack-security` receiver, which keeps its `[SECURITY]` styling; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
|
||||
|
||||
## Security Posture (Wave 1 — locked 2026-05-18)
|
||||
|
||||
|
|
@ -241,10 +241,10 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se
|
|||
|
||||
- **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming.
|
||||
- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.)
|
||||
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
|
||||
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts` (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out). Severity labels carried in the alert (critical/warning/info). No paging. The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`.
|
||||
- **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
|
||||
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`).
|
||||
- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (the shared webhook's Slack app isn't in `#security` → 404 channel_not_found; flip `SLACK_CHANNEL` back once invited — see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
|
||||
- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
|
||||
- **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).
|
||||
|
||||
## Storage & Backup Architecture
|
||||
|
|
|
|||
|
|
@ -39,7 +39,7 @@
|
|||
## Active Use
|
||||
| Service | Description | Stack |
|
||||
|---------|-------------|-------|
|
||||
| goldmane-edge-aggregator | Durable who-talks-to-whom audit trail (ADR-0014 / #58). Go service: `aggregate` Deployment streams Goldmane's gRPC `Flows.Stream` (mTLS) and upserts the low-cardinality namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`) into CNPG DB `goldmane_edges`; `goldmane-edges-digest` CronJob posts first-seen edges daily to `#security`. mTLS client cert REUSES the operator's `whisker-backend-key-pair` (re-apply if rotated). Tier-4-aux. Image `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (private). Runbook: [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md). | goldmane-edge-aggregator |
|
||||
| goldmane-edge-aggregator | Durable who-talks-to-whom audit trail (ADR-0014 / #58). Go service: `aggregate` Deployment streams Goldmane's gRPC `Flows.Stream` (mTLS) and upserts the low-cardinality namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`) into CNPG DB `goldmane_edges`; `goldmane-edges-digest` CronJob posts first-seen edges daily to `#alerts` (the `#security` channel was abandoned 2026-06-25 — shared webhook's app isn't a member of it). mTLS client cert REUSES the operator's `whisker-backend-key-pair` (re-apply if rotated). Tier-4-aux. Image `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (private). Runbook: [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md). | goldmane-edge-aggregator |
|
||||
| mailserver | Email (docker-mailserver) | mailserver |
|
||||
| shadowsocks | Proxy | shadowsocks |
|
||||
| webhook_handler | Webhook processing | webhook_handler |
|
||||
|
|
|
|||
|
|
@ -125,7 +125,7 @@ How a **Service** is named in flow/audit data — its **namespace** is the prima
|
|||
_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
|
||||
|
||||
**Goldmane / Whisker**:
|
||||
Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#security` digest. As-built: `docs/runbooks/goldmane-flow-trail.md`.
|
||||
Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#alerts` digest (the `#security` channel was abandoned 2026-06-25). As-built: `docs/runbooks/goldmane-flow-trail.md`.
|
||||
_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
|
||||
|
||||
### Storage
|
||||
|
|
|
|||
|
|
@ -30,6 +30,6 @@ As the Service count grows we want an audit-grade record of which Service talks
|
|||
|
||||
## As-built (2026-06-25)
|
||||
|
||||
Implemented across infra issues #57–#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (the shared webhook can't reach `#security` — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
|
||||
Implemented across infra issues #57–#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (all Slack consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
|
||||
|
||||
Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.
|
||||
|
|
|
|||
|
|
@ -286,7 +286,7 @@ Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (por
|
|||
|
||||
#### Security Alerts (Wave 1 — planned, beads `code-8ywc`)
|
||||
|
||||
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
|
||||
Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there). Same handling path as infra alerts; severity labels carried in the alert (critical/warning/info). The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
|
||||
|
||||
| # | Source | Event | Severity |
|
||||
|---|---|---|---|
|
||||
|
|
@ -318,7 +318,7 @@ IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-poli
|
|||
Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out.
|
||||
|
||||
- **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m).
|
||||
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → **`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
|
||||
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → posts to **`#alerts`** via the `slack-security` receiver, which keeps its `[SECURITY]` styling (Slack-only, no paging; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
|
||||
- **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)
|
||||
|
||||
#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
|
||||
|
|
|
|||
|
|
@ -272,7 +272,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**
|
|||
|
||||
The block below documents the locked design.
|
||||
|
||||
Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
|
||||
Response model: **(I) Slack-only, daily skim.** All security alerts post to **`#alerts`** via Alertmanager (the `slack-security` receiver keeps its distinct `[SECURITY/<sev>]` title styling so security-lane alerts still stand out). The dedicated `#security` channel was abandoned (2026-06-25) — the shared `alertmanager_slack_api_url` incoming webhook's Slack app isn't a member of it, so a channel override there returns HTTP `404 channel_not_found`; everything consolidated to `#alerts`. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
|
||||
|
||||
#### Detection sources
|
||||
|
||||
|
|
@ -285,7 +285,7 @@ Response model: **(I) Slack-only, daily skim.** All security alerts land in a ne
|
|||
|
||||
#### Alert rules (16 total)
|
||||
|
||||
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel.
|
||||
Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it). Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) carried in the alert.
|
||||
|
||||
**K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):**
|
||||
|
||||
|
|
@ -413,8 +413,10 @@ refined by a `service-identity` label in the few multi-Service namespaces
|
|||
private key into TF state — **re-apply the stack if the operator rotates that
|
||||
Secret**.
|
||||
3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to
|
||||
**`#alerts`** (reuses the alert-digest webhook; a `#security` override 404s —
|
||||
that webhook's Slack app isn't a member of `#security`; see runbook).
|
||||
**`#alerts`** (reuses the alert-digest webhook). All Slack now consolidates to
|
||||
`#alerts`; the `#security` channel was abandoned 2026-06-25 because that
|
||||
webhook's Slack app isn't a member of it (a `#security` override 404s). See
|
||||
runbook.
|
||||
|
||||
The trail is **attribution-grade, not cryptographic** (reconstructs events in a
|
||||
trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model
|
||||
|
|
|
|||
|
|
@ -67,7 +67,7 @@ small no matter how much traffic flows.
|
|||
|
||||
### Slack `#alerts` — daily digest
|
||||
|
||||
> **Channel note (2026-06-25):** posts to **`#alerts`**, not `#security`. The shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of `#security`, so a channel override there returns HTTP `404 channel_not_found` (this almost certainly also breaks alertmanager's `slack-security` receiver — verify separately). To route the digest (and security alerts) to `#security`: invite that webhook's Slack app to `#security`, then set `SLACK_CHANNEL=#security` in `stacks/goldmane-edge-aggregator` and re-apply.
|
||||
> **Channel note (2026-06-25):** posts to **`#alerts`**. The dedicated `#security` channel was abandoned — the shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of it, so a channel override there returns HTTP `404 channel_not_found`. Everything now posts to `#alerts` (this digest plus alertmanager's `slack-security` receiver, which keeps its `[SECURITY]` styling so security-lane alerts still stand out there).
|
||||
|
||||
- CronJob `goldmane-edges-digest` (08:00 Europe/London) posts edges first seen
|
||||
in the last 24h. Quiet when there are none. Reuses the existing alert-digest
|
||||
|
|
@ -282,7 +282,7 @@ ExternalSecret resolved. A dry run / smoke test: run the image with `args:
|
|||
> Known state (2026-06-25): the digest CronJob's first Job **failed** and it has
|
||||
> never successfully posted (`lastSuccessfulTime` empty) — the digest leg is the
|
||||
> live gap; `DigestFailing` is catching it. Edges still land in the DB via the
|
||||
> `aggregate` Deployment; only the `#security` notification is affected.
|
||||
> `aggregate` Deployment; only the `#alerts` digest notification is affected.
|
||||
> Investigation/fix belongs to the aggregator slice (#58/#60), not monitoring.
|
||||
|
||||
**No edges at all in the table.** Confirm Goldmane is enabled
|
||||
|
|
|
|||
|
|
@ -456,14 +456,11 @@ resource "kubernetes_cron_job_v1" "digest" {
|
|||
}
|
||||
env {
|
||||
name = "SLACK_CHANNEL"
|
||||
# The shared alertmanager_slack_api_url incoming webhook's Slack
|
||||
# app is NOT a member of #security, so overriding the channel to
|
||||
# it returns HTTP 404 channel_not_found (verified 2026-06-25).
|
||||
# alertmanager's own slack-security receiver shares this webhook
|
||||
# and almost certainly hits the same wall. Post to #alerts (the
|
||||
# webhook's working channel, same as alert-digest) until the app
|
||||
# is invited to #security, then flip this back. See
|
||||
# docs/runbooks/goldmane-flow-trail.md.
|
||||
# Posts to #alerts. The dedicated #security channel was abandoned
|
||||
# 2026-06-25 — the shared alertmanager_slack_api_url webhook's
|
||||
# Slack app isn't a member of it (channel override 404s), so all
|
||||
# Slack (incl. alertmanager's security-lane alerts) consolidated
|
||||
# to #alerts. See docs/runbooks/goldmane-flow-trail.md.
|
||||
value = "#alerts"
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -60,9 +60,10 @@ alertmanager:
|
|||
receiver: slack-warning
|
||||
routes:
|
||||
# Wave 1 security lane — matches alerts that set `lane = "security"`
|
||||
# (K2-K9, V1-V7, S1 from Loki ruler). Routes to dedicated #security
|
||||
# channel regardless of severity. Defined first + continue: false so
|
||||
# security alerts never fall through to the generic #alerts channel.
|
||||
# (K2-K9, V1-V7, S1 from Loki ruler). Posts via the slack-security
|
||||
# receiver (distinct [SECURITY] styling) to #alerts; the dedicated
|
||||
# #security channel was abandoned 2026-06-25 (shared webhook can't reach
|
||||
# it). continue: false so they get the security-styled receiver.
|
||||
- receiver: slack-security
|
||||
group_wait: 10s
|
||||
group_interval: 1m
|
||||
|
|
@ -235,7 +236,10 @@ alertmanager:
|
|||
- name: slack-security
|
||||
slack_configs:
|
||||
- send_resolved: true
|
||||
channel: "#security"
|
||||
# #security was abandoned 2026-06-25 — the shared incoming webhook's
|
||||
# Slack app isn't a member of it (channel override 404s). Security-lane
|
||||
# alerts keep their distinct [SECURITY] styling but post to #alerts.
|
||||
channel: "#alerts"
|
||||
color: '{{ if eq .Status "firing" }}{{ if eq (index .Alerts 0).Labels.severity "critical" }}danger{{ else }}warning{{ end }}{{ else }}good{{ end }}'
|
||||
fallback: '{{ if eq .Status "firing" }}[SECURITY-{{ (index .Alerts 0).Labels.severity | toUpper }}]{{ else }}[RESOLVED]{{ end }}: {{ .GroupLabels.alertname }}'
|
||||
title: '{{ if eq .Status "firing" }}[SECURITY/{{ (index .Alerts 0).Labels.severity | toUpper }}]{{ else }}[RESOLVED]{{ end }} {{ .GroupLabels.alertname }} ({{ .Alerts | len }})'
|
||||
|
|
@ -1504,7 +1508,7 @@ serverFiles:
|
|||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "goldmane-edges-digest CronJob failing — new edges captured but not posted to #security"
|
||||
summary: "goldmane-edges-digest CronJob failing — new edges captured but not posted to #alerts"
|
||||
description: "The daily edge digest Job {{ $labels.job_name }} failed. Edges may still be landing in the goldmane_edges DB but no one is being notified of new namespace-pairs. `kubectl -n goldmane-edge-aggregator logs job/{{ $labels.job_name }}`."
|
||||
- name: Infrastructure Health
|
||||
rules:
|
||||
|
|
@ -3246,7 +3250,8 @@ serverFiles:
|
|||
# means blackbox's fail_if_header_matches caught a Location -> Authentik:
|
||||
# a path-scoped `auth = "none"` carve-out was clobbered (TF revert, deploy,
|
||||
# ingress_factory default flipping back to auth="required"). lane=security
|
||||
# routes it to the #security Slack receiver (Slack-only, no paging).
|
||||
# routes it to the slack-security receiver, which posts to #alerts
|
||||
# (#security abandoned 2026-06-25; Slack-only, no paging).
|
||||
- name: Authentik Walling Off
|
||||
rules:
|
||||
- alert: AuthentikWallingOffPublicPath
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue