goldmane-trail: polish follow-ups #57/#59/#61/#62/#63 + digest→#alerts

Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a subagent workflow (distinct stacks in parallel, docs last): - #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me, auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081 (the operator's whisker NP default-denies ingress). calico stack. - #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts (prometheus_chart_values.tpl) + cluster-health check #48. - #59 service-identity labels on the multi-Service namespaces (monitoring's 5 TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they update in-place. - #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog, security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md. #62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the edge table (feeds code-8ywc; enforce-flips out of scope). Also fixes the digest's Slack target: #security override 404s channel_not_found because the shared alertmanager_slack_api_url webhook's app isn't a member of #security (this likely also breaks alertmanager's slack-security receiver — flagged in the runbook). Routed to #alerts (the webhook's working channel) until the app is invited; verified a real digest run posts cleanly (360 edges). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 17:49:25 +00:00 · 2026-06-25 17:49:25 +00:00 · 6c5288998f
commit 6c5288998f
parent 306cdd4cb3
17 changed files with 626 additions and 11 deletions
--- a/docs/adr/0014-service-identity-and-east-west-observability.md
+++ b/docs/adr/0014-service-identity-and-east-west-observability.md
@ -27,3 +27,9 @@ As the Service count grows we want an audit-grade record of which Service talks
 - **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
 - **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
 - **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
+
+## As-built (2026-06-25)
+
+Implemented across infra issues #57–#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (the shared webhook can't reach `#security` — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
+
+Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.
--- a/docs/architecture/monitoring.md
+++ b/docs/architecture/monitoring.md
@ -321,6 +321,17 @@ Detects the inverse of the K-series alerts: a service that **must work WITHOUT A
 - **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → **`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
 - **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)

+#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
+
+Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**.
+
+| Alert | Expr (abridged) | For | Severity |
+|---|---|---|---|
+| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning |
+| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning |
+
+The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`).
+
 #### Backup Alerts
 - **PostgreSQLBackupStale**: >36h since last backup
 - **MySQLBackupStale**: >36h since last backup
--- a/docs/architecture/security.md
+++ b/docs/architecture/security.md
@ -364,6 +364,67 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
 - Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
 - Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).

+#### Deriving the per-namespace egress allowlist from the edge trail (Wave 1 W1.7)
+
+The durable **east-west flow trail** (below) is now the preferred data source for
+the *internal* (namespace-to-namespace) half of each Wave-1 egress allowlist —
+faster and identity-stamped vs the original iptables-`LOG`→journald→Loki path
+(ADR-0014: "Enforcement gains a better data source"). The unique observed
+namespace pairs live in CNPG DB `goldmane_edges`, table `edge`. To derive the
+namespaces a source is observed talking to (the `allow` set that seeds its
+NetworkPolicy):
+
+```sql
+SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow' ORDER BY dst_ns;
+```
+
+The full SQL recipe (whole-cluster matrix, deny sanity-checks, the ≥7-day
+observation caveat) is in
+[runbooks/goldmane-flow-trail.md → Deriving the Wave-1 egress allowlist](../runbooks/goldmane-flow-trail.md#deriving-the-wave-1-egress-allowlist-from-the-edge-table-infra-62).
+**External / public-internet egress is NOT in this table** (empty-namespace flows
+are dropped) — for those destinations keep using the Calico flow-log observation
+(the W1.6 snapshot, `wave1-egress-observation-2026-05-22.md`). This feeds the
+existing observe-then-enforce effort (beads `code-8ywc`); **enforce-flips remain
+out of scope** of the trail — it is observe-and-derive only.
+
+### East-west flow observability (Goldmane / Whisker + edge trail) (ADR-0014)
+
+The "who-talks-to-whom" data plane that succeeds raw iptables-`LOG` lines (which
+carried no identity). **Service identity = the workload's namespace** (primary),
+refined by a `service-identity` label in the few multi-Service namespaces
+(`monitoring`, `kube-system`, `dbaas`). End-to-end trail, three layers:
+
+1. **Calico Goldmane + Whisker** (`calico-system`) — Goldmane aggregates
+   identity-stamped flows (ns/pod/workload/labels + allow-deny + policy-trace)
+   streamed from Felix over gRPC into a **~60-min in-memory ring buffer** (no
+   etcd/API writes — the etcd-cost constraint that drove the design). **Whisker**
+   is its live web UI at `whisker.viktorbarzin.me` (Authentik-gated,
+   `auth = "required"` — Whisker has no own login; an additive NetworkPolicy ORs
+   Traefik past the operator's default-deny `whisker` NP). The ring buffer is
+   **not** a trail (lost on Goldmane restart). Enabled via operator CRs in
+   `stacks/calico/main.tf`; reversible toggle (Goldmane is OSS tech-preview).
+2. **`goldmane-edge-aggregator`** (`stacks/goldmane-edge-aggregator`) — streams
+   Goldmane's gRPC `Flows.Stream` over **mTLS** and upserts the low-cardinality
+   namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,
+   flow_count)`) into CNPG DB `goldmane_edges`. Self-edges and empty-namespace
+   (public-internet) flows are dropped — in-cluster relationships only. The mTLS
+   client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`**
+   (Goldmane verifies CA-chain only, not identity) rather than copying the CA
+   private key into TF state — **re-apply the stack if the operator rotates that
+   Secret**.
+3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to
+   **`#alerts`** (reuses the alert-digest webhook; a `#security` override 404s —
+   that webhook's Slack app isn't a member of `#security`; see runbook).
+
+The trail is **attribution-grade, not cryptographic** (reconstructs events in a
+trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model
+limit; east-west stays plaintext, no mTLS between app pods). Health is covered by
+the **`AggregatorDown`** + **`DigestFailing`** alerts and cluster-health check #48
+(see monitoring.md). Full as-built, query recipes, and troubleshooting:
+[runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md). Decision:
+[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md); glossary
+`CONTEXT.md` → **Service identity**, **Goldmane / Whisker**.
+
 ### TLS & HTTP/3

 **Traefik** handles TLS termination:
--- a/docs/runbooks/goldmane-flow-trail.md
+++ b/docs/runbooks/goldmane-flow-trail.md
@ -0,0 +1,301 @@
+# Goldmane Flow Trail — east-west "who-talks-to-whom" observability
+
+> As-built runbook for the Calico Goldmane + Whisker flow plane and the
+> `goldmane-edge-aggregator` durable audit trail. Design + rationale:
+> [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
+> Glossary: `CONTEXT.md` → **Service identity**, **Goldmane / Whisker**.
+> Implements infra issues #57 (Whisker ingress), #58 (aggregator), #61
+> (monitoring), #62 (egress allowlist queries), #63 (these docs).
+
+## What the trail is
+
+Three layers turn raw east-west traffic into a queryable, durable record of
+which Service talks to which. **Service identity = the workload's namespace**
+(primary), refined by a `service-identity` label in the few multi-Service
+namespaces (`monitoring`, `kube-system`, `dbaas`) — see ADR-0014.
+
+| Layer | Component | Lifetime | Where it lives |
+|---|---|---|---|
+| **Live map** | Calico **Goldmane** + **Whisker** | ~60-min in-memory ring buffer (lost on Goldmane restart) | `calico-system`; Whisker UI at `whisker.viktorbarzin.me` |
+| **Durable trail** | `goldmane-edge-aggregator` (`aggregate` mode) | persistent | CNPG Postgres DB `goldmane_edges`, table `edge` |
+| **Notification** | `goldmane-edges-digest` CronJob (`digest` mode) | daily | Slack `#alerts` |
+
+**Goldmane** aggregates identity-stamped flows (namespace / pod / workload /
+labels + allow-deny + policy-trace) streamed from Felix (the existing
+`calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer —
+**nothing is written to etcd or the K8s API** (the etcd-cost constraint that
+drove the whole design). **Whisker** is its live web UI. Because the ring
+buffer is *not* a trail (a Goldmane restart loses the window), the
+`goldmane-edge-aggregator` consumes Goldmane's gRPC `Flows.Stream` API over
+mTLS and upserts the unique **namespace-pair edge set** into Postgres; a daily
+CronJob posts first-seen edges to Slack.
+
+The edge set is deliberately **low-cardinality** — one row per
+`(src_ns, dst_ns, action)`, *not* per-pod or per-port — so the table stays
+small no matter how much traffic flows.
+
+## Where the data lives
+
+### Whisker UI — live, ~60 min
+- `https://whisker.viktorbarzin.me` (Authentik-gated — Whisker ships no own
+  login; `auth = "required"`). Shows the live flow stream + a service graph for
+  roughly the last hour. Use it for "what is talking right now"; it is **not**
+  history.
+- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
+  (HTTP), both in `calico-system`.
+
+### CNPG `goldmane_edges` — durable
+- Postgres DB `goldmane_edges` on the CNPG cluster
+  (`pg-cluster-rw.dbaas.svc.cluster.local:5432`). One table:
+
+  ```
+  edge(src_ns text, dst_ns text, action text,
+       first_seen timestamptz, last_seen timestamptz, flow_count bigint,
+       PRIMARY KEY (src_ns, dst_ns, action))
+  ```
+
+  - `action` ∈ `allow` / `deny` / `pass` / `unspecified` (normalised Goldmane
+    action).
+  - **Self-edges (`src_ns == dst_ns`) and empty-namespace flows** (host-endpoint
+    / public-internet) are **dropped** — the trail is about in-cluster service
+    relationships only. (Egress to the public internet is therefore NOT in this
+    table; it lives in the Wave-1 Calico flow-log path — see security.md.)
+  - A **"new edge"** = a row whose `first_seen` falls inside the digest window.
+  - Role `goldmane_edges` (Vault-rotated, 7-day) owns the DB. The `edge` table
+    is created idempotently by the aggregator at startup (canonical DDL also in
+    the repo at `migrations/0001_edge.sql`).
+
+### Slack `#alerts` — daily digest
+
+> **Channel note (2026-06-25):** posts to **`#alerts`**, not `#security`. The shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of `#security`, so a channel override there returns HTTP `404 channel_not_found` (this almost certainly also breaks alertmanager's `slack-security` receiver — verify separately). To route the digest (and security alerts) to `#security`: invite that webhook's Slack app to `#security`, then set `SLACK_CHANNEL=#security` in `stacks/goldmane-edge-aggregator` and re-apply.
+
+- CronJob `goldmane-edges-digest` (08:00 Europe/London) posts edges first seen
+  in the last 24h. Quiet when there are none. Reuses the existing alert-digest
+  Slack incoming webhook (Vault `secret/viktor` → `alertmanager_slack_api_url`)
+  — no new webhook was created.
+
+## How to enable / disable
+
+### Goldmane + Whisker (the flow plane)
+Operator CRs in **`stacks/calico/main.tf`** — NOT the Helm `goldmane`/`whisker`
+flags (those stay `false`; the operator's own `installation`/`apiServer` are
+operator-managed via the `goldmanes`/`whiskers.operator.tigera.io` CRDs):
+
+- `kubectl_manifest.goldmane` (kind `Goldmane`) — creating it makes the operator
+  re-render `calico-node` with the `FELIX_FLOWLOGSGOLDMANESERVER` env (the
+  operator auto-wires Felix — **do NOT patch FelixConfiguration**), triggering a
+  supervised `calico-node` DaemonSet roll. Yields `Deployment` + `Service
+  goldmane:7443`.
+- `kubectl_manifest.whisker` (kind `Whisker`, `depends_on` goldmane;
+  `notifications = Disabled`). Yields `Deployment` + `Service whisker:8081`.
+
+**To disable:** delete those two CRs and re-apply `stacks/calico`. Reversible
+toggle (Goldmane is tech-preview in OSS Calico 3.30 — the main standing risk per
+ADR-0014).
+
+### Whisker public ingress (infra #57)
+Also in `stacks/calico/main.tf`:
+- `module "ingress_whisker"` (`ingress_factory`, `auth = "required"`,
+  `dns_type = "proxied"`) → `whisker.viktorbarzin.me`.
+- `kubernetes_network_policy_v1.whisker_allow_traefik` — **required alongside the
+  ingress**: the operator's own `whisker` NetworkPolicy (owned by the Whisker CR)
+  is `policyTypes: [Ingress]` with no rules = default-deny ingress to the pod.
+  This additive NP ORs in an allow for `namespaceSelector
+  kubernetes.io/metadata.name=traefik` on TCP 8081. Without it Traefik 502s.
+
+### The aggregator + digest (the durable trail) — `stacks/goldmane-edge-aggregator`
+A Tier-1 stack (PG state) mirroring the claude-memory pattern. `scripts/tg
+apply` from `stacks/goldmane-edge-aggregator/`. It provisions: the namespace,
+the mTLS client material, the Postgres DB-init Job, the `DATABASE_URL`
+ExternalSecret (Vault static role `pg-goldmane-edges`), the Slack ExternalSecret,
+the `aggregate` Deployment, and the `digest` CronJob. **To disable the trail
+without touching the flow plane:** scale `deployment/goldmane-edge-aggregator` to
+0 (transient) or remove the stack (permanent) — Goldmane/Whisker keep running.
+
+Image: `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (PRIVATE) — the
+`goldmane-edge-aggregator` namespace must be in the `ghcr-credentials` Kyverno
+allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`,
+`local.ghcr_private_namespaces`) or pulls 401. Code repo:
+`~/code/goldmane-edge-aggregator` (see its `README.md` + `DEPLOY.md`).
+
+## mTLS cert — the REUSE decision (cert-reuse gotcha)
+
+The aggregator dials `goldmane:7443` over **mutual TLS**. Goldmane requires the
+client cert to chain to the **Tigera CA**, but it does **NOT authorize by client
+identity** — any Tigera-CA-signed cert is accepted.
+
+Rather than copy the Tigera CA **private key** into Terraform state to mint our
+own cert (a needless CA-key exposure; the `hashicorp/tls` provider also clashes
+with this repo's global generate-providers/lockfile pattern), the stack
+**REUSES the operator-minted, Tigera-CA-signed `whisker-backend-key-pair`
+Secret** (`calico-system`), copying its `tls.crt`/`tls.key` into the
+`goldmane-client-tls` Secret in the aggregator namespace. The CA *bundle* that
+verifies Goldmane's serving cert (`tigera-ca-bundle` ConfigMap, key
+`tigera-ca-bundle.crt`) is likewise copied verbatim (a ConfigMap can't be
+cross-namespace-mounted).
+
+> **GOTCHA — if the operator rotates `whisker-backend-key-pair`, re-apply
+> `stacks/goldmane-edge-aggregator`** to re-sync the copied cert. Symptom of a
+> stale copy: the `aggregate` pod logs TLS handshake / `Flows.Stream` failures
+> and no `last_seen` updates land in the `edge` table. Hardening follow-up
+> (noted in the stack): mint an own-identity cert in-namespace if Whisker is ever
+> removed (which would delete the reused source Secret).
+
+The Deployment leaves `GOLDMANE_HOST=goldmane.calico-system.svc.cluster.local:7443`
+and the default cert/CA paths; the default ServerName (host sans port) is a SAN
+on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` /
+`GOLDMANE_TLS_INSECURE` override is needed.
+
+## How to query who-talks-to-whom
+
+`psql` into the DB (creds: Vault static role `static-creds/pg-goldmane-edges`, or
+exec a CNPG pod). All queries are against the single `edge` table.
+
+```sql
+-- Everything talking to a namespace (inbound), most-active first
+SELECT src_ns, action, flow_count, first_seen, last_seen
+FROM edge WHERE dst_ns = '<ns>' ORDER BY flow_count DESC;
+
+-- Everything a namespace talks TO (outbound)
+SELECT dst_ns, action, flow_count, first_seen, last_seen
+FROM edge WHERE src_ns = '<ns>' ORDER BY last_seen DESC;
+
+-- New edges in the last 24h (what the digest reports)
+SELECT src_ns, dst_ns, action, flow_count, first_seen
+FROM edge WHERE first_seen > now() - interval '24 hours'
+ORDER BY first_seen DESC;
+
+-- Any DENIED edges (policy is dropping this pair)
+SELECT src_ns, dst_ns, flow_count, last_seen
+FROM edge WHERE action = 'deny' ORDER BY last_seen DESC;
+
+-- Full edge set as a graph adjacency list
+SELECT src_ns, dst_ns, action, flow_count FROM edge ORDER BY src_ns, dst_ns;
+```
+
+For the **live** (sub-hour) view including pod/port detail, use the Whisker UI —
+the `edge` table intentionally aggregates that away.
+
+## Deriving the Wave-1 egress allowlist from the edge table (infra #62)
+
+The durable edge set is a faster, identity-stamped data source for the existing
+**observe-then-enforce** egress effort (beads `code-8ywc`; snapshot
+`docs/architecture/wave1-egress-observation-2026-05-22.md`) than the original
+iptables-`LOG` → journald → Loki path (ADR-0014 consequence: "Enforcement gains
+a better data source"). It replaces the *internal* (namespace-to-namespace) leg
+of the allowlist; **external/public-internet egress is NOT in this table** (empty
+dst namespace, dropped) — for those destinations keep using the Calico flow-log
+path described in security.md.
+
+**Per-namespace internal egress allowlist** — the set of in-cluster namespaces a
+given source is *observed* talking to with `action='allow'`:
+
+```sql
+-- Internal egress allowlist for one namespace (feeds its NetworkPolicy)
+SELECT DISTINCT dst_ns
+FROM edge
+WHERE src_ns = '<ns>' AND action = 'allow'
+ORDER BY dst_ns;
+```
+
+```sql
+-- Full internal egress matrix for all namespaces at once
+SELECT src_ns, array_agg(DISTINCT dst_ns ORDER BY dst_ns) AS allowed_dst_ns
+FROM edge
+WHERE action = 'allow'
+GROUP BY src_ns
+ORDER BY src_ns;
+```
+
+```sql
+-- Sanity: namespaces with a DENY edge already (policy is biting; investigate
+-- before tightening further)
+SELECT DISTINCT src_ns, dst_ns FROM edge WHERE action = 'deny';
+```
+
+**How this feeds enforcement (scope):** the derived `dst_ns` set is the
+*internal* half of a namespace's egress allowlist — it tells you which
+in-cluster namespaces to permit before flipping that namespace to default-deny.
+The universal baseline (kube-dns :53, often dbaas :3306/:5432, redis :6379) and
+the external destinations still come from the Wave-1 observation snapshot.
+**Enforce-flips remain OUT OF SCOPE** here — this is observe-and-derive only;
+the phased per-namespace default-deny rollout (starting `recruiter-responder`)
+is tracked under `code-8ywc`. Cross-links:
+[security.md → NetworkPolicy Default-Deny Egress](../architecture/security.md#networkpolicy-default-deny-egress-wave-1--observe-then-enforce-tier-34),
+[wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md),
+[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
+
+> **Caveat (same as the Wave-1 snapshot):** an edge only exists if it was
+> *observed*. A weekly CronJob or a 7-day Vault rotation may not have fired yet —
+> collect ≥7 days of edges before treating a namespace's `allow` set as
+> complete. The `first_seen` column tells you how long an edge has been known;
+> the digest surfaces brand-new ones daily.
+
+## Monitoring & health (infra #61)
+
+The aggregator pod has **no `/metrics` endpoint** — health is inferred from
+kube-state-metrics. Three complementary signals (memory ids 6598, 6599;
+see also [monitoring.md → Security Alerts](../architecture/monitoring.md#security-alerts-wave-1--planned-beads-code-8ywc)):
+
+| Signal | What | Where |
+|---|---|---|
+| **`AggregatorDown`** | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` for 15m → warning | Prometheus alert group `Network Observability (Goldmane)` in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`; routes `slack-warning` → `#alerts` |
+| **`DigestFailing`** | `kube_job_status_failed{...job_name=~"goldmane-edges-digest.*"} > 0` within 24h, for 30m → warning | same alert group → `#alerts` |
+| **cluster-health #48** | `check_goldmane_aggregator` reads the Deployment's `Available` condition (missing or not-Available → FAIL) | `scripts/cluster_healthcheck.sh` (human / `--quiet` / `--json` modes; emits `goldmane_aggregator`) |
+
+The two alert layers are deliberately complementary: `AggregatorDown` →
+**no new edges land** in the DB; `DigestFailing` → **edges still land but nobody
+is told**. A freshness probe (#61b) was intentionally skipped — `AggregatorDown`
+is the agreed floor.
+
+## Troubleshooting
+
+**Whisker UI 502 / unreachable.** The additive
+`kubernetes_network_policy_v1.whisker_allow_traefik` is missing or the
+operator's default-deny `whisker` NP regenerated — re-apply `stacks/calico`. A
+brand-new ingress host is also invisible to LAN split-horizon until the hourly
+`technitium-ingress-dns-sync` runs (memory #5349); test meanwhile with
+`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
+(expect a 302 to Authentik — the gate working).
+
+**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
+pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).
+Common causes, in order:
+1. **Stale mTLS cert** — the operator rotated `whisker-backend-key-pair`; re-apply
+   `stacks/goldmane-edge-aggregator` (see cert-reuse gotcha above). Symptom: TLS
+   handshake / `Flows.Stream` errors.
+2. **Stale DB password** — the 7-day Vault rotation bounced the credential but
+   the pod kept the old one. The Deployment carries
+   `secret.reloader.stakater.com/reload: goldmane-edges-db-creds`; if it's not
+   restarting on rotation, verify the Reloader annotation and the ExternalSecret.
+3. **Goldmane restarted** — the in-memory window was lost (expected); the stream
+   reconnects automatically and resumes upserting. No data loss in the DB
+   (only the sub-hour live window in Whisker is gone).
+
+**Digest never posts / `DigestFailing` firing.** Inspect the most recent
+`goldmane-edges-digest-*` Job (`kubectl get jobs -n goldmane-edge-aggregator`;
+`kubectl logs job/<name>`). The CronJob's `ttl_seconds_after_finished=86400` GCs
+pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL`
+empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack`
+ExternalSecret resolved. A dry run / smoke test: run the image with `args:
+["digest"]` + `DRY_RUN=1` to print the message instead of POSTing.
+> Known state (2026-06-25): the digest CronJob's first Job **failed** and it has
+> never successfully posted (`lastSuccessfulTime` empty) — the digest leg is the
+> live gap; `DigestFailing` is catching it. Edges still land in the DB via the
+> `aggregate` Deployment; only the `#security` notification is affected.
+> Investigation/fix belongs to the aggregator slice (#58/#60), not monitoring.
+
+**No edges at all in the table.** Confirm Goldmane is enabled
+(`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the
+`FELIX_FLOWLOGSGOLDMANESERVER` env; confirm the `goldmane-edges-db-init` Job
+completed; confirm the aggregator pod is `Running` and not `ImagePullBackOff`
+(ghcr allowlist).
+
+## Related
+- [ADR-0014 — Service identity & east-west observability](../adr/0014-service-identity-and-east-west-observability.md)
+- [security.md — NetworkPolicy Default-Deny Egress + east-west flow observability](../architecture/security.md)
+- [monitoring.md — east-west flow observability + alerts](../architecture/monitoring.md)
+- [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md)
+- `CONTEXT.md` glossary — **Service identity**, **Goldmane / Whisker**
+- Code: `~/code/goldmane-edge-aggregator` (`README.md`, `DEPLOY.md`); stacks
+  `stacks/goldmane-edge-aggregator`, `stacks/calico`