diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 9c873a07..7dec9d96 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -243,7 +243,8 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se - **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.) - **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging. - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred. -- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. +- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`). +- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (the shared webhook's Slack app isn't in `#security` → 404 channel_not_found; flip `SLACK_CHANNEL` back once invited — see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.) - **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2). ## Storage & Backup Architecture diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md index cd7b5274..ce8d7abb 100644 --- a/.claude/reference/service-catalog.md +++ b/.claude/reference/service-catalog.md @@ -13,6 +13,8 @@ | authentik | Identity provider (SSO) | authentik | | cloudflared | Cloudflare tunnel | cloudflared | | authelia | Auth middleware (may be merged into ebooks or removed) | platform | +| goldmane | Calico 3.30 OSS flow aggregator (`goldmane.calico-system.svc:7443`, gRPC/mTLS). Stamps identity (ns/pod/workload/labels + allow-deny) on every flow from Felix into a ~60-min in-memory ring buffer — no etcd/API writes. East-west "who-talks-to-whom" source (ADR-0014). Enabled via operator CR (`kubectl_manifest.goldmane`). | calico | +| whisker | Calico 3.30 OSS live flow-observability UI (`whisker.calico-system.svc:8081`) at `whisker.viktorbarzin.me` (Authentik-gated, `auth=required` — no own login; additive NP ORs Traefik past the operator default-deny). ~60-min live view of Goldmane flows, NOT history. Enabled via operator CR (`kubectl_manifest.whisker`). | calico | | monitoring | Prometheus/Grafana/Loki stack | monitoring | ## Storage & Security (Tier: cluster) @@ -37,6 +39,7 @@ ## Active Use | Service | Description | Stack | |---------|-------------|-------| +| goldmane-edge-aggregator | Durable who-talks-to-whom audit trail (ADR-0014 / #58). Go service: `aggregate` Deployment streams Goldmane's gRPC `Flows.Stream` (mTLS) and upserts the low-cardinality namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`) into CNPG DB `goldmane_edges`; `goldmane-edges-digest` CronJob posts first-seen edges daily to `#security`. mTLS client cert REUSES the operator's `whisker-backend-key-pair` (re-apply if rotated). Tier-4-aux. Image `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (private). Runbook: [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md). | goldmane-edge-aggregator | | mailserver | Email (docker-mailserver) | mailserver | | shadowsocks | Proxy | shadowsocks | | webhook_handler | Webhook processing | webhook_handler | @@ -161,3 +164,4 @@ procedures) are documented in `infra/docs/runbooks/`: | pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) | | Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) | | Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) | +| Goldmane flow trail (east-west who-talks-to-whom) | [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md) | diff --git a/CONTEXT.md b/CONTEXT.md index 2b9bb8b3..368f8e59 100644 --- a/CONTEXT.md +++ b/CONTEXT.md @@ -125,7 +125,7 @@ How a **Service** is named in flow/audit data — its **namespace** is the prima _Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object. **Goldmane / Whisker**: -Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). Durable history requires emitting Goldmane flows to **Loki**; the in-memory buffer alone is not an audit trail. +Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#security` digest. As-built: `docs/runbooks/goldmane-flow-trail.md`. _Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh). ### Storage diff --git a/docs/adr/0014-service-identity-and-east-west-observability.md b/docs/adr/0014-service-identity-and-east-west-observability.md index 5eb1c83a..b782ee30 100644 --- a/docs/adr/0014-service-identity-and-east-west-observability.md +++ b/docs/adr/0014-service-identity-and-east-west-observability.md @@ -27,3 +27,9 @@ As the Service count grows we want an audit-grade record of which Service talks - **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency. - **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**. - **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary. + +## As-built (2026-06-25) + +Implemented across infra issues #57–#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (the shared webhook can't reach `#security` — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48. + +Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`. diff --git a/docs/architecture/monitoring.md b/docs/architecture/monitoring.md index 3c75a345..f4cc9012 100644 --- a/docs/architecture/monitoring.md +++ b/docs/architecture/monitoring.md @@ -321,6 +321,17 @@ Detects the inverse of the K-series alerts: a service that **must work WITHOUT A - **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → **`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages). - **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' ''`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.) +#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014) + +Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**. + +| Alert | Expr (abridged) | For | Severity | +|---|---|---|---| +| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning | +| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning | + +The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`). + #### Backup Alerts - **PostgreSQLBackupStale**: >36h since last backup - **MySQLBackupStale**: >36h since last backup diff --git a/docs/architecture/security.md b/docs/architecture/security.md index 7d3043ea..f1acf6bd 100644 --- a/docs/architecture/security.md +++ b/docs/architecture/security.md @@ -364,6 +364,67 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.** - Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs. - Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972). +#### Deriving the per-namespace egress allowlist from the edge trail (Wave 1 W1.7) + +The durable **east-west flow trail** (below) is now the preferred data source for +the *internal* (namespace-to-namespace) half of each Wave-1 egress allowlist — +faster and identity-stamped vs the original iptables-`LOG`→journald→Loki path +(ADR-0014: "Enforcement gains a better data source"). The unique observed +namespace pairs live in CNPG DB `goldmane_edges`, table `edge`. To derive the +namespaces a source is observed talking to (the `allow` set that seeds its +NetworkPolicy): + +```sql +SELECT DISTINCT dst_ns FROM edge WHERE src_ns='' AND action='allow' ORDER BY dst_ns; +``` + +The full SQL recipe (whole-cluster matrix, deny sanity-checks, the ≥7-day +observation caveat) is in +[runbooks/goldmane-flow-trail.md → Deriving the Wave-1 egress allowlist](../runbooks/goldmane-flow-trail.md#deriving-the-wave-1-egress-allowlist-from-the-edge-table-infra-62). +**External / public-internet egress is NOT in this table** (empty-namespace flows +are dropped) — for those destinations keep using the Calico flow-log observation +(the W1.6 snapshot, `wave1-egress-observation-2026-05-22.md`). This feeds the +existing observe-then-enforce effort (beads `code-8ywc`); **enforce-flips remain +out of scope** of the trail — it is observe-and-derive only. + +### East-west flow observability (Goldmane / Whisker + edge trail) (ADR-0014) + +The "who-talks-to-whom" data plane that succeeds raw iptables-`LOG` lines (which +carried no identity). **Service identity = the workload's namespace** (primary), +refined by a `service-identity` label in the few multi-Service namespaces +(`monitoring`, `kube-system`, `dbaas`). End-to-end trail, three layers: + +1. **Calico Goldmane + Whisker** (`calico-system`) — Goldmane aggregates + identity-stamped flows (ns/pod/workload/labels + allow-deny + policy-trace) + streamed from Felix over gRPC into a **~60-min in-memory ring buffer** (no + etcd/API writes — the etcd-cost constraint that drove the design). **Whisker** + is its live web UI at `whisker.viktorbarzin.me` (Authentik-gated, + `auth = "required"` — Whisker has no own login; an additive NetworkPolicy ORs + Traefik past the operator's default-deny `whisker` NP). The ring buffer is + **not** a trail (lost on Goldmane restart). Enabled via operator CRs in + `stacks/calico/main.tf`; reversible toggle (Goldmane is OSS tech-preview). +2. **`goldmane-edge-aggregator`** (`stacks/goldmane-edge-aggregator`) — streams + Goldmane's gRPC `Flows.Stream` over **mTLS** and upserts the low-cardinality + namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen, + flow_count)`) into CNPG DB `goldmane_edges`. Self-edges and empty-namespace + (public-internet) flows are dropped — in-cluster relationships only. The mTLS + client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** + (Goldmane verifies CA-chain only, not identity) rather than copying the CA + private key into TF state — **re-apply the stack if the operator rotates that + Secret**. +3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to + **`#alerts`** (reuses the alert-digest webhook; a `#security` override 404s — + that webhook's Slack app isn't a member of `#security`; see runbook). + +The trail is **attribution-grade, not cryptographic** (reconstructs events in a +trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model +limit; east-west stays plaintext, no mTLS between app pods). Health is covered by +the **`AggregatorDown`** + **`DigestFailing`** alerts and cluster-health check #48 +(see monitoring.md). Full as-built, query recipes, and troubleshooting: +[runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md). Decision: +[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md); glossary +`CONTEXT.md` → **Service identity**, **Goldmane / Whisker**. + ### TLS & HTTP/3 **Traefik** handles TLS termination: diff --git a/docs/runbooks/goldmane-flow-trail.md b/docs/runbooks/goldmane-flow-trail.md new file mode 100644 index 00000000..0ab27c43 --- /dev/null +++ b/docs/runbooks/goldmane-flow-trail.md @@ -0,0 +1,301 @@ +# Goldmane Flow Trail — east-west "who-talks-to-whom" observability + +> As-built runbook for the Calico Goldmane + Whisker flow plane and the +> `goldmane-edge-aggregator` durable audit trail. Design + rationale: +> [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md). +> Glossary: `CONTEXT.md` → **Service identity**, **Goldmane / Whisker**. +> Implements infra issues #57 (Whisker ingress), #58 (aggregator), #61 +> (monitoring), #62 (egress allowlist queries), #63 (these docs). + +## What the trail is + +Three layers turn raw east-west traffic into a queryable, durable record of +which Service talks to which. **Service identity = the workload's namespace** +(primary), refined by a `service-identity` label in the few multi-Service +namespaces (`monitoring`, `kube-system`, `dbaas`) — see ADR-0014. + +| Layer | Component | Lifetime | Where it lives | +|---|---|---|---| +| **Live map** | Calico **Goldmane** + **Whisker** | ~60-min in-memory ring buffer (lost on Goldmane restart) | `calico-system`; Whisker UI at `whisker.viktorbarzin.me` | +| **Durable trail** | `goldmane-edge-aggregator` (`aggregate` mode) | persistent | CNPG Postgres DB `goldmane_edges`, table `edge` | +| **Notification** | `goldmane-edges-digest` CronJob (`digest` mode) | daily | Slack `#alerts` | + +**Goldmane** aggregates identity-stamped flows (namespace / pod / workload / +labels + allow-deny + policy-trace) streamed from Felix (the existing +`calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer — +**nothing is written to etcd or the K8s API** (the etcd-cost constraint that +drove the whole design). **Whisker** is its live web UI. Because the ring +buffer is *not* a trail (a Goldmane restart loses the window), the +`goldmane-edge-aggregator` consumes Goldmane's gRPC `Flows.Stream` API over +mTLS and upserts the unique **namespace-pair edge set** into Postgres; a daily +CronJob posts first-seen edges to Slack. + +The edge set is deliberately **low-cardinality** — one row per +`(src_ns, dst_ns, action)`, *not* per-pod or per-port — so the table stays +small no matter how much traffic flows. + +## Where the data lives + +### Whisker UI — live, ~60 min +- `https://whisker.viktorbarzin.me` (Authentik-gated — Whisker ships no own + login; `auth = "required"`). Shows the live flow stream + a service graph for + roughly the last hour. Use it for "what is talking right now"; it is **not** + history. +- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081` + (HTTP), both in `calico-system`. + +### CNPG `goldmane_edges` — durable +- Postgres DB `goldmane_edges` on the CNPG cluster + (`pg-cluster-rw.dbaas.svc.cluster.local:5432`). One table: + + ``` + edge(src_ns text, dst_ns text, action text, + first_seen timestamptz, last_seen timestamptz, flow_count bigint, + PRIMARY KEY (src_ns, dst_ns, action)) + ``` + + - `action` ∈ `allow` / `deny` / `pass` / `unspecified` (normalised Goldmane + action). + - **Self-edges (`src_ns == dst_ns`) and empty-namespace flows** (host-endpoint + / public-internet) are **dropped** — the trail is about in-cluster service + relationships only. (Egress to the public internet is therefore NOT in this + table; it lives in the Wave-1 Calico flow-log path — see security.md.) + - A **"new edge"** = a row whose `first_seen` falls inside the digest window. + - Role `goldmane_edges` (Vault-rotated, 7-day) owns the DB. The `edge` table + is created idempotently by the aggregator at startup (canonical DDL also in + the repo at `migrations/0001_edge.sql`). + +### Slack `#alerts` — daily digest + +> **Channel note (2026-06-25):** posts to **`#alerts`**, not `#security`. The shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of `#security`, so a channel override there returns HTTP `404 channel_not_found` (this almost certainly also breaks alertmanager's `slack-security` receiver — verify separately). To route the digest (and security alerts) to `#security`: invite that webhook's Slack app to `#security`, then set `SLACK_CHANNEL=#security` in `stacks/goldmane-edge-aggregator` and re-apply. + +- CronJob `goldmane-edges-digest` (08:00 Europe/London) posts edges first seen + in the last 24h. Quiet when there are none. Reuses the existing alert-digest + Slack incoming webhook (Vault `secret/viktor` → `alertmanager_slack_api_url`) + — no new webhook was created. + +## How to enable / disable + +### Goldmane + Whisker (the flow plane) +Operator CRs in **`stacks/calico/main.tf`** — NOT the Helm `goldmane`/`whisker` +flags (those stay `false`; the operator's own `installation`/`apiServer` are +operator-managed via the `goldmanes`/`whiskers.operator.tigera.io` CRDs): + +- `kubectl_manifest.goldmane` (kind `Goldmane`) — creating it makes the operator + re-render `calico-node` with the `FELIX_FLOWLOGSGOLDMANESERVER` env (the + operator auto-wires Felix — **do NOT patch FelixConfiguration**), triggering a + supervised `calico-node` DaemonSet roll. Yields `Deployment` + `Service + goldmane:7443`. +- `kubectl_manifest.whisker` (kind `Whisker`, `depends_on` goldmane; + `notifications = Disabled`). Yields `Deployment` + `Service whisker:8081`. + +**To disable:** delete those two CRs and re-apply `stacks/calico`. Reversible +toggle (Goldmane is tech-preview in OSS Calico 3.30 — the main standing risk per +ADR-0014). + +### Whisker public ingress (infra #57) +Also in `stacks/calico/main.tf`: +- `module "ingress_whisker"` (`ingress_factory`, `auth = "required"`, + `dns_type = "proxied"`) → `whisker.viktorbarzin.me`. +- `kubernetes_network_policy_v1.whisker_allow_traefik` — **required alongside the + ingress**: the operator's own `whisker` NetworkPolicy (owned by the Whisker CR) + is `policyTypes: [Ingress]` with no rules = default-deny ingress to the pod. + This additive NP ORs in an allow for `namespaceSelector + kubernetes.io/metadata.name=traefik` on TCP 8081. Without it Traefik 502s. + +### The aggregator + digest (the durable trail) — `stacks/goldmane-edge-aggregator` +A Tier-1 stack (PG state) mirroring the claude-memory pattern. `scripts/tg +apply` from `stacks/goldmane-edge-aggregator/`. It provisions: the namespace, +the mTLS client material, the Postgres DB-init Job, the `DATABASE_URL` +ExternalSecret (Vault static role `pg-goldmane-edges`), the Slack ExternalSecret, +the `aggregate` Deployment, and the `digest` CronJob. **To disable the trail +without touching the flow plane:** scale `deployment/goldmane-edge-aggregator` to +0 (transient) or remove the stack (permanent) — Goldmane/Whisker keep running. + +Image: `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (PRIVATE) — the +`goldmane-edge-aggregator` namespace must be in the `ghcr-credentials` Kyverno +allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`, +`local.ghcr_private_namespaces`) or pulls 401. Code repo: +`~/code/goldmane-edge-aggregator` (see its `README.md` + `DEPLOY.md`). + +## mTLS cert — the REUSE decision (cert-reuse gotcha) + +The aggregator dials `goldmane:7443` over **mutual TLS**. Goldmane requires the +client cert to chain to the **Tigera CA**, but it does **NOT authorize by client +identity** — any Tigera-CA-signed cert is accepted. + +Rather than copy the Tigera CA **private key** into Terraform state to mint our +own cert (a needless CA-key exposure; the `hashicorp/tls` provider also clashes +with this repo's global generate-providers/lockfile pattern), the stack +**REUSES the operator-minted, Tigera-CA-signed `whisker-backend-key-pair` +Secret** (`calico-system`), copying its `tls.crt`/`tls.key` into the +`goldmane-client-tls` Secret in the aggregator namespace. The CA *bundle* that +verifies Goldmane's serving cert (`tigera-ca-bundle` ConfigMap, key +`tigera-ca-bundle.crt`) is likewise copied verbatim (a ConfigMap can't be +cross-namespace-mounted). + +> **GOTCHA — if the operator rotates `whisker-backend-key-pair`, re-apply +> `stacks/goldmane-edge-aggregator`** to re-sync the copied cert. Symptom of a +> stale copy: the `aggregate` pod logs TLS handshake / `Flows.Stream` failures +> and no `last_seen` updates land in the `edge` table. Hardening follow-up +> (noted in the stack): mint an own-identity cert in-namespace if Whisker is ever +> removed (which would delete the reused source Secret). + +The Deployment leaves `GOLDMANE_HOST=goldmane.calico-system.svc.cluster.local:7443` +and the default cert/CA paths; the default ServerName (host sans port) is a SAN +on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` / +`GOLDMANE_TLS_INSECURE` override is needed. + +## How to query who-talks-to-whom + +`psql` into the DB (creds: Vault static role `static-creds/pg-goldmane-edges`, or +exec a CNPG pod). All queries are against the single `edge` table. + +```sql +-- Everything talking to a namespace (inbound), most-active first +SELECT src_ns, action, flow_count, first_seen, last_seen +FROM edge WHERE dst_ns = '' ORDER BY flow_count DESC; + +-- Everything a namespace talks TO (outbound) +SELECT dst_ns, action, flow_count, first_seen, last_seen +FROM edge WHERE src_ns = '' ORDER BY last_seen DESC; + +-- New edges in the last 24h (what the digest reports) +SELECT src_ns, dst_ns, action, flow_count, first_seen +FROM edge WHERE first_seen > now() - interval '24 hours' +ORDER BY first_seen DESC; + +-- Any DENIED edges (policy is dropping this pair) +SELECT src_ns, dst_ns, flow_count, last_seen +FROM edge WHERE action = 'deny' ORDER BY last_seen DESC; + +-- Full edge set as a graph adjacency list +SELECT src_ns, dst_ns, action, flow_count FROM edge ORDER BY src_ns, dst_ns; +``` + +For the **live** (sub-hour) view including pod/port detail, use the Whisker UI — +the `edge` table intentionally aggregates that away. + +## Deriving the Wave-1 egress allowlist from the edge table (infra #62) + +The durable edge set is a faster, identity-stamped data source for the existing +**observe-then-enforce** egress effort (beads `code-8ywc`; snapshot +`docs/architecture/wave1-egress-observation-2026-05-22.md`) than the original +iptables-`LOG` → journald → Loki path (ADR-0014 consequence: "Enforcement gains +a better data source"). It replaces the *internal* (namespace-to-namespace) leg +of the allowlist; **external/public-internet egress is NOT in this table** (empty +dst namespace, dropped) — for those destinations keep using the Calico flow-log +path described in security.md. + +**Per-namespace internal egress allowlist** — the set of in-cluster namespaces a +given source is *observed* talking to with `action='allow'`: + +```sql +-- Internal egress allowlist for one namespace (feeds its NetworkPolicy) +SELECT DISTINCT dst_ns +FROM edge +WHERE src_ns = '' AND action = 'allow' +ORDER BY dst_ns; +``` + +```sql +-- Full internal egress matrix for all namespaces at once +SELECT src_ns, array_agg(DISTINCT dst_ns ORDER BY dst_ns) AS allowed_dst_ns +FROM edge +WHERE action = 'allow' +GROUP BY src_ns +ORDER BY src_ns; +``` + +```sql +-- Sanity: namespaces with a DENY edge already (policy is biting; investigate +-- before tightening further) +SELECT DISTINCT src_ns, dst_ns FROM edge WHERE action = 'deny'; +``` + +**How this feeds enforcement (scope):** the derived `dst_ns` set is the +*internal* half of a namespace's egress allowlist — it tells you which +in-cluster namespaces to permit before flipping that namespace to default-deny. +The universal baseline (kube-dns :53, often dbaas :3306/:5432, redis :6379) and +the external destinations still come from the Wave-1 observation snapshot. +**Enforce-flips remain OUT OF SCOPE** here — this is observe-and-derive only; +the phased per-namespace default-deny rollout (starting `recruiter-responder`) +is tracked under `code-8ywc`. Cross-links: +[security.md → NetworkPolicy Default-Deny Egress](../architecture/security.md#networkpolicy-default-deny-egress-wave-1--observe-then-enforce-tier-34), +[wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md), +[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md). + +> **Caveat (same as the Wave-1 snapshot):** an edge only exists if it was +> *observed*. A weekly CronJob or a 7-day Vault rotation may not have fired yet — +> collect ≥7 days of edges before treating a namespace's `allow` set as +> complete. The `first_seen` column tells you how long an edge has been known; +> the digest surfaces brand-new ones daily. + +## Monitoring & health (infra #61) + +The aggregator pod has **no `/metrics` endpoint** — health is inferred from +kube-state-metrics. Three complementary signals (memory ids 6598, 6599; +see also [monitoring.md → Security Alerts](../architecture/monitoring.md#security-alerts-wave-1--planned-beads-code-8ywc)): + +| Signal | What | Where | +|---|---|---| +| **`AggregatorDown`** | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` for 15m → warning | Prometheus alert group `Network Observability (Goldmane)` in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`; routes `slack-warning` → `#alerts` | +| **`DigestFailing`** | `kube_job_status_failed{...job_name=~"goldmane-edges-digest.*"} > 0` within 24h, for 30m → warning | same alert group → `#alerts` | +| **cluster-health #48** | `check_goldmane_aggregator` reads the Deployment's `Available` condition (missing or not-Available → FAIL) | `scripts/cluster_healthcheck.sh` (human / `--quiet` / `--json` modes; emits `goldmane_aggregator`) | + +The two alert layers are deliberately complementary: `AggregatorDown` → +**no new edges land** in the DB; `DigestFailing` → **edges still land but nobody +is told**. A freshness probe (#61b) was intentionally skipped — `AggregatorDown` +is the agreed floor. + +## Troubleshooting + +**Whisker UI 502 / unreachable.** The additive +`kubernetes_network_policy_v1.whisker_allow_traefik` is missing or the +operator's default-deny `whisker` NP regenerated — re-apply `stacks/calico`. A +brand-new ingress host is also invisible to LAN split-horizon until the hourly +`technitium-ingress-dns-sync` runs (memory #5349); test meanwhile with +`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me` +(expect a 302 to Authentik — the gate working). + +**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate` +pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`). +Common causes, in order: +1. **Stale mTLS cert** — the operator rotated `whisker-backend-key-pair`; re-apply + `stacks/goldmane-edge-aggregator` (see cert-reuse gotcha above). Symptom: TLS + handshake / `Flows.Stream` errors. +2. **Stale DB password** — the 7-day Vault rotation bounced the credential but + the pod kept the old one. The Deployment carries + `secret.reloader.stakater.com/reload: goldmane-edges-db-creds`; if it's not + restarting on rotation, verify the Reloader annotation and the ExternalSecret. +3. **Goldmane restarted** — the in-memory window was lost (expected); the stream + reconnects automatically and resumes upserting. No data loss in the DB + (only the sub-hour live window in Whisker is gone). + +**Digest never posts / `DigestFailing` firing.** Inspect the most recent +`goldmane-edges-digest-*` Job (`kubectl get jobs -n goldmane-edge-aggregator`; +`kubectl logs job/`). The CronJob's `ttl_seconds_after_finished=86400` GCs +pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL` +empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack` +ExternalSecret resolved. A dry run / smoke test: run the image with `args: +["digest"]` + `DRY_RUN=1` to print the message instead of POSTing. +> Known state (2026-06-25): the digest CronJob's first Job **failed** and it has +> never successfully posted (`lastSuccessfulTime` empty) — the digest leg is the +> live gap; `DigestFailing` is catching it. Edges still land in the DB via the +> `aggregate` Deployment; only the `#security` notification is affected. +> Investigation/fix belongs to the aggregator slice (#58/#60), not monitoring. + +**No edges at all in the table.** Confirm Goldmane is enabled +(`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the +`FELIX_FLOWLOGSGOLDMANESERVER` env; confirm the `goldmane-edges-db-init` Job +completed; confirm the aggregator pod is `Running` and not `ImagePullBackOff` +(ghcr allowlist). + +## Related +- [ADR-0014 — Service identity & east-west observability](../adr/0014-service-identity-and-east-west-observability.md) +- [security.md — NetworkPolicy Default-Deny Egress + east-west flow observability](../architecture/security.md) +- [monitoring.md — east-west flow observability + alerts](../architecture/monitoring.md) +- [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md) +- `CONTEXT.md` glossary — **Service identity**, **Goldmane / Whisker** +- Code: `~/code/goldmane-edge-aggregator` (`README.md`, `DEPLOY.md`); stacks + `stacks/goldmane-edge-aggregator`, `stacks/calico` diff --git a/scripts/cluster_healthcheck.sh b/scripts/cluster_healthcheck.sh index 51a13b5d..a5088137 100755 --- a/scripts/cluster_healthcheck.sh +++ b/scripts/cluster_healthcheck.sh @@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}" [[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config" KUBECTL="" JSON_RESULTS=() -TOTAL_CHECKS=47 +TOTAL_CHECKS=48 # Parallel execution settings. Each check function is self-contained — it # only reads cluster state and mutates the in-memory counters / JSON_RESULTS @@ -3156,6 +3156,44 @@ PYEOF esac } +# --- 48. Goldmane edge-aggregator availability --- +# +# The goldmane-edge-aggregator Deployment (ADR-0014 / infra #58) streams Calico +# Goldmane flows into the goldmane_edges CNPG DB — the durable who-talks-to-whom +# trail. The pod has NO /metrics endpoint, so its liveness can't be scraped; +# this check reads the Deployment's Available condition directly so the trail +# silently dying surfaces in the health board (mirrors the AggregatorDown +# Prometheus alert). Missing Deployment / not-Available -> FAIL. +check_goldmane_aggregator() { + section 48 "Goldmane Edge-Aggregator" + local ns="goldmane-edge-aggregator" dep="goldmane-edge-aggregator" + local avail desired ready + + # One get; absent Deployment is a hard fail (the trail isn't deployed). + if ! $KUBECTL get deploy "$dep" -n "$ns" >/dev/null 2>&1; then + [[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator" + fail "Deployment $ns/$dep not found — who-talks-to-whom edge trail is not running" + json_add "goldmane_aggregator" "FAIL" "deployment missing" + return 0 + fi + + avail=$($KUBECTL get deploy "$dep" -n "$ns" \ + -o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null) + ready=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.status.readyReplicas}' 2>/dev/null) + desired=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.spec.replicas}' 2>/dev/null) + ready=${ready:-0} + desired=${desired:-0} + + if [[ "$avail" == "True" ]]; then + pass "Edge-aggregator Available ($ready/$desired ready)" + json_add "goldmane_aggregator" "PASS" "${ready}/${desired} ready" + else + [[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator" + fail "Edge-aggregator NOT Available ($ready/$desired ready) — edge trail has stopped recording" + json_add "goldmane_aggregator" "FAIL" "${ready}/${desired} ready; Available=${avail:-unknown}" + fi +} + # --- Summary --- print_summary() { if [[ "$JSON" == true ]]; then @@ -3224,7 +3262,7 @@ main() { check_monitoring_prom_am check_monitoring_vault check_monitoring_css check_external_replicas check_external_divergence check_pve_thermals check_pve_load check_external_traefik_5xx check_ha_status_dashboard - check_immich_search check_csi_ghost_drift + check_immich_search check_csi_ghost_drift check_goldmane_aggregator ) # Auto-fix mutates cluster state inside individual checks — keep that diff --git a/stacks/calico/main.tf b/stacks/calico/main.tf index 39550024..1354190e 100644 --- a/stacks/calico/main.tf +++ b/stacks/calico/main.tf @@ -212,3 +212,65 @@ resource "kubectl_manifest" "whisker" { spec = { notifications = "Disabled" } }) } + +# --------------------------------------------------------------------------- +# Gated public ingress for the Whisker UI (infra #57 / ADR-0014). +# +# whisker.viktorbarzin.me -> whisker:8081, Authentik-gated (auth="required": +# Whisker ships NO own login — it's an admin observability UI, so Authentik +# forward-auth is the only gate between strangers and the flow view). The +# operator replicated `tls-secret` into calico-system already. +# +# TWO coupled pieces are required because the operator's own `whisker` +# NetworkPolicy (owned by the Whisker CR above) sets policyTypes:[Ingress] +# with NO ingress rules => default-deny on ingress to the whisker pod. The +# additive NP below ORs in a Traefik allow (k8s NetworkPolicies are additive +# across policies selecting the same pod), so we never edit the operator NP. +module "ingress_whisker" { + source = "../../modules/kubernetes/ingress_factory" + dns_type = "proxied" + namespace = "calico-system" + name = "whisker" + service_name = "whisker" + port = 8081 + auth = "required" + tls_secret_name = "tls-secret" + extra_annotations = { + "gethomepage.dev/enabled" = "true" + "gethomepage.dev/name" = "Whisker" + "gethomepage.dev/description" = "Calico flow observability (who-talks-to-whom)" + "gethomepage.dev/icon" = "calico.png" + "gethomepage.dev/group" = "Infrastructure" + } +} + +# Additive NetworkPolicy: permit Traefik -> whisker:8081. ORs with the +# operator's default-deny `whisker` NP (selecting the same pod) so Traefik +# can reach the UI without touching the operator-owned policy. +resource "kubernetes_network_policy_v1" "whisker_allow_traefik" { + metadata { + name = "whisker-allow-traefik" + namespace = "calico-system" + } + spec { + pod_selector { + match_labels = { + "app.kubernetes.io/name" = "whisker" + } + } + policy_types = ["Ingress"] + ingress { + from { + namespace_selector { + match_labels = { + "kubernetes.io/metadata.name" = "traefik" + } + } + } + ports { + port = "8081" + protocol = "TCP" + } + } + } +} diff --git a/stacks/dbaas/modules/dbaas/main.tf b/stacks/dbaas/modules/dbaas/main.tf index 479263ed..d940f642 100644 --- a/stacks/dbaas/modules/dbaas/main.tf +++ b/stacks/dbaas/modules/dbaas/main.tf @@ -745,7 +745,10 @@ resource "kubernetes_deployment" "phpmyadmin" { labels = { "app" = "phpmyadmin" tier = var.tier - + # ADR-0014 service identity: dbaas is a multi-Service namespace, so the + # namespace alone can't attribute Goldmane flows. Value = the fronting + # Service name (kubernetes_service.phpmyadmin is named "pma"). + "service-identity" = "pma" } annotations = { "reloader.stakater.com/search" = "true" @@ -762,6 +765,10 @@ resource "kubernetes_deployment" "phpmyadmin" { metadata { labels = { "app" = "phpmyadmin" + # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the + # disambiguating identity must live on the pod template (not just + # the Deployment metadata above). Not in selector → no replace. + "service-identity" = "pma" } } spec { @@ -812,8 +819,19 @@ resource "kubernetes_deployment" "phpmyadmin" { } } lifecycle { - # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 - ignore_changes = [spec[0].template[0].spec[0].dns_config] + ignore_changes = [ + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + # This Deployment is Keel-enrolled (keel.sh/policy=patch). Ignore the + # attributes Keel/Kyverno mutate at runtime so `terragrunt apply` (incl. + # the daily drift plan) doesn't fight them or revert the live image — + # canonical KEEL/KYVERNO lifecycle guard, matches linkwarden/chrome-service. + metadata[0].annotations["keel.sh/policy"], + metadata[0].annotations["keel.sh/trigger"], + metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2 + metadata[0].annotations["keel.sh/match-tag"], + spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates + spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 + ] } } @@ -1499,6 +1517,10 @@ resource "kubernetes_deployment" "pgadmin" { } labels = { tier = var.tier + # ADR-0014 service identity: dbaas is a multi-Service namespace, so the + # namespace alone can't attribute Goldmane flows. Value = the fronting + # Service name (kubernetes_service.pgadmin is named "pgadmin"). + "service-identity" = "pgadmin" } } spec { @@ -1514,6 +1536,10 @@ resource "kubernetes_deployment" "pgadmin" { metadata { labels = { app = "pgadmin" + # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the + # disambiguating identity must live on the pod template (not just + # the Deployment metadata above). Not in selector → no replace. + "service-identity" = "pgadmin" } } spec { @@ -1568,8 +1594,20 @@ resource "kubernetes_deployment" "pgadmin" { } } lifecycle { - # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 - ignore_changes = [spec[0].template[0].spec[0].dns_config] + ignore_changes = [ + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + # This Deployment is Keel-enrolled (keel.sh/policy=patch) and Keel has + # bumped the live image (dpage/pgadmin4:9.16). Ignore the Keel/Kyverno + # runtime-mutated attributes so `terragrunt apply` (incl. the daily drift + # plan) doesn't revert the image to bare `dpage/pgadmin4` or strip Keel's + # annotations — canonical guard, matches linkwarden/chrome-service. + metadata[0].annotations["keel.sh/policy"], + metadata[0].annotations["keel.sh/trigger"], + metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2 + metadata[0].annotations["keel.sh/match-tag"], + spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates + spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 + ] } } resource "kubernetes_service" "pgadmin" { diff --git a/stacks/goldmane-edge-aggregator/main.tf b/stacks/goldmane-edge-aggregator/main.tf index 2e71885e..9d1e8cdd 100644 --- a/stacks/goldmane-edge-aggregator/main.tf +++ b/stacks/goldmane-edge-aggregator/main.tf @@ -449,8 +449,16 @@ resource "kubernetes_cron_job_v1" "digest" { } } env { - name = "SLACK_CHANNEL" - value = "#security" + name = "SLACK_CHANNEL" + # The shared alertmanager_slack_api_url incoming webhook's Slack + # app is NOT a member of #security, so overriding the channel to + # it returns HTTP 404 channel_not_found (verified 2026-06-25). + # alertmanager's own slack-security receiver shares this webhook + # and almost certainly hits the same wall. Post to #alerts (the + # webhook's working channel, same as alert-digest) until the app + # is invited to #security, then flip this back. See + # docs/runbooks/goldmane-flow-trail.md. + value = "#alerts" } resources { diff --git a/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf b/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf index 5244db0e..429dadc0 100644 --- a/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf +++ b/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf @@ -130,6 +130,11 @@ resource "kubernetes_deployment" "blackbox_exporter" { labels = { app = "blackbox-exporter" tier = var.tier + # ADR-0014 service identity: monitoring is a multi-Service namespace, so + # the namespace alone can't attribute Goldmane flows. Value = the + # fronting Service name (kubernetes_service.blackbox_exporter is named + # "blackbox-exporter"). + "service-identity" = "blackbox-exporter" } annotations = { "reloader.stakater.com/search" = "true" @@ -146,6 +151,10 @@ resource "kubernetes_deployment" "blackbox_exporter" { metadata { labels = { app = "blackbox-exporter" + # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the + # disambiguating identity must live on the pod template (not just + # the Deployment metadata above). Not in selector → no replace. + "service-identity" = "blackbox-exporter" } } spec { diff --git a/stacks/monitoring/modules/monitoring/goflow2.tf b/stacks/monitoring/modules/monitoring/goflow2.tf index 6c9cb214..5d5829be 100644 --- a/stacks/monitoring/modules/monitoring/goflow2.tf +++ b/stacks/monitoring/modules/monitoring/goflow2.tf @@ -5,6 +5,11 @@ resource "kubernetes_deployment" "goflow2" { labels = { app = "goflow2" tier = var.tier + # ADR-0014 service identity: monitoring is a multi-Service namespace, so + # the namespace alone can't attribute Goldmane flows. Value = the + # fronting Service name (kubernetes_service.goflow2 — the metrics svc; the + # goflow2-netflow NodePort is the same pod by another name). + "service-identity" = "goflow2" } } spec { @@ -18,6 +23,10 @@ resource "kubernetes_deployment" "goflow2" { metadata { labels = { app = "goflow2" + # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the + # disambiguating identity must live on the pod template (not just + # the Deployment metadata above). Not in selector → no replace. + "service-identity" = "goflow2" } } spec { diff --git a/stacks/monitoring/modules/monitoring/idrac.tf b/stacks/monitoring/modules/monitoring/idrac.tf index d25e52aa..02f81598 100644 --- a/stacks/monitoring/modules/monitoring/idrac.tf +++ b/stacks/monitoring/modules/monitoring/idrac.tf @@ -47,6 +47,10 @@ resource "kubernetes_deployment" "idrac-redfish" { labels = { app = "idrac-redfish-exporter" tier = var.tier + # ADR-0014 service identity: monitoring is a multi-Service namespace, so + # the namespace alone can't attribute Goldmane flows. Value = the + # fronting Service name (kubernetes_service.idrac-redfish-exporter). + "service-identity" = "idrac-redfish-exporter" } annotations = { "reloader.stakater.com/search" = "true" @@ -63,6 +67,10 @@ resource "kubernetes_deployment" "idrac-redfish" { metadata { labels = { app = "idrac-redfish-exporter" + # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the + # disambiguating identity must live on the pod template (not just + # the Deployment metadata above). Not in selector → no replace. + "service-identity" = "idrac-redfish-exporter" } } spec { diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl index a504cedc..f526e7ac 100755 --- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl +++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl @@ -1450,6 +1450,49 @@ serverFiles: Remediation: right-size top reservers via Goldilocks (immich-server, frigate, prometheus, pg-cluster, paperless) or bump VM RAM on k8s-node2/k8s-node3 from 32GB → 48GB to match node1. + # Goldmane edge-aggregator (ADR-0014 / infra #58, #61): the durable + # who-talks-to-whom trail. The aggregator pod has NO /metrics endpoint, + # so its health is inferred from kube-state-metrics signals — the trail + # must not silently die. Two failure modes are covered: + # - the aggregate Deployment stops consuming Goldmane's flow stream + # (AggregatorDown) → no new edges ever land in the goldmane_edges DB + # - the daily digest CronJob can't post new edges to Slack + # (DigestFailing) → edges still land but nobody is told. + # A freshness probe (max(last_seen) staleness) is intentionally NOT here: + # AggregatorDown is the agreed floor and needs no extra moving parts. + - name: Network Observability (Goldmane) + rules: + # Deployment has <1 available replica for 15m. kube-state-metrics + # keeps `kube_deployment_status_replicas_available` (metric-keep list + # in serverFiles below). The 15m window rides out a normal rollout / + # node drain without paging; a genuinely-dead aggregator means the + # edge trail has stopped recording and stays down. + - alert: AggregatorDown + expr: | + kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1 + and on() (time() - process_start_time_seconds{job="prometheus"}) > 900 + for: 15m + labels: + severity: warning + annotations: + summary: "goldmane-edge-aggregator has no available replica — the who-talks-to-whom edge trail has stopped recording" + description: "The aggregate Deployment streams Calico Goldmane flows into the goldmane_edges CNPG DB. With 0 replicas, no new namespace-pair edges are captured. `kubectl -n goldmane-edge-aggregator describe deploy goldmane-edge-aggregator` + check the goldmane svc (calico-system) is reachable." + # The goldmane-edges-digest CronJob has a failed Job that started in + # the last 24h. Mirrors the generic JobFailed shape but scoped to the + # digest so it routes here. `for: 30m` rides out the apply/scrape + # transient; the digest runs daily so a real failure won't self-heal + # until the next run — surface it same-day rather than waiting 24h. + - alert: DigestFailing + expr: | + kube_job_status_failed{namespace="goldmane-edge-aggregator", job_name=~"goldmane-edges-digest.*"} > 0 + and on(namespace, job_name) + (time() - kube_job_status_start_time{namespace="goldmane-edge-aggregator", job_name=~"goldmane-edges-digest.*"}) < 86400 + for: 30m + labels: + severity: warning + annotations: + summary: "goldmane-edges-digest CronJob failing — new edges captured but not posted to #security" + description: "The daily edge digest Job {{ $labels.job_name }} failed. Edges may still be landing in the goldmane_edges DB but no one is being notified of new namespace-pairs. `kubectl -n goldmane-edge-aggregator logs job/{{ $labels.job_name }}`." - name: Infrastructure Health rules: - alert: HomeAssistantDown diff --git a/stacks/monitoring/modules/monitoring/pve_exporter.tf b/stacks/monitoring/modules/monitoring/pve_exporter.tf index 2d7a0b1a..eba9e13c 100644 --- a/stacks/monitoring/modules/monitoring/pve_exporter.tf +++ b/stacks/monitoring/modules/monitoring/pve_exporter.tf @@ -22,6 +22,10 @@ resource "kubernetes_deployment" "pve_exporter" { namespace = kubernetes_namespace.monitoring.metadata[0].name labels = { tier = var.tier + # ADR-0014 service identity: monitoring is a multi-Service namespace, so + # the namespace alone can't attribute Goldmane flows. Value = the + # fronting Service name (kubernetes_service.proxmox-exporter). + "service-identity" = "proxmox-exporter" } } @@ -37,6 +41,10 @@ resource "kubernetes_deployment" "pve_exporter" { metadata { labels = { app = "proxmox-exporter" + # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the + # disambiguating identity must live on the pod template (not just + # the Deployment metadata above). Not in selector → no replace. + "service-identity" = "proxmox-exporter" } } diff --git a/stacks/monitoring/modules/monitoring/snmp_exporter.tf b/stacks/monitoring/modules/monitoring/snmp_exporter.tf index 0fc60439..d8e5061c 100644 --- a/stacks/monitoring/modules/monitoring/snmp_exporter.tf +++ b/stacks/monitoring/modules/monitoring/snmp_exporter.tf @@ -31,6 +31,10 @@ resource "kubernetes_deployment" "snmp-exporter" { labels = { app = "snmp-exporter" tier = var.tier + # ADR-0014 service identity: monitoring is a multi-Service namespace, so + # the namespace alone can't attribute Goldmane flows. Value = the + # fronting Service name (kubernetes_service.snmp-exporter). + "service-identity" = "snmp-exporter" } annotations = { "reloader.stakater.com/search" = "true" @@ -47,6 +51,10 @@ resource "kubernetes_deployment" "snmp-exporter" { metadata { labels = { app = "snmp-exporter" + # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the + # disambiguating identity must live on the pod template (not just + # the Deployment metadata above). Not in selector → no replace. + "service-identity" = "snmp-exporter" } } spec {