calico: fix empty Whisker UI — allow whisker egress to the kube-dns ClusterIP

Real root cause of the 2026-06-28 "Whisker UI empty" incident (the watchdog added in 8d1d2fb9 was treating a symptom). The tigera operator's own `whisker` NetworkPolicy is policyTypes:[Ingress,Egress]; its egress allows DNS only to the kube-dns *pods* (podSelector k8s-app=kube-dns). But whisker-backend resolves goldmane.calico-system.svc via the kube-dns *ClusterIP* (10.96.0.10), and Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule. Verified in an isolated repro: from the whisker pod's netns, ClusterIP DNS = 100% timeout while direct kube-dns pod-IP DNS = OK; a pod with no egress policy resolves fine; a test pod with the operator's podSelector-only egress rule reproduces the failure, and adding an ipBlock(ClusterIP) egress rule flips it to 100% ok. whisker-backend resolves goldmane once in the brief startup window before the policy programs, holds its long-lived gRPC stream, and only re-resolves when that stream breaks (e.g. a node-reboot blip) — then the blocked ClusterIP DNS wedges its Go resolver and the UI goes empty. The durable aggregator (separate pod, unrestricted namespace) was never affected. Fix: additive egress NetworkPolicy whisker-allow-dns-clusterip (whisker -> 10.96.0.10/32 on 53 UDP+TCP); k8s egress policies are additive so the operator NP is untouched. The whisker-watchdog CronJob is kept as a backstop (repurposed comment). Applied + verified: ClusterIP DNS from the whisker netns now 8/8 ok, whisker-backend 0 errors, flow API returns 828 flows / the namespace list. Docs (runbook + CLAUDE.md) updated to the real root cause. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:32:21 +00:00 · 2026-06-28 09:32:21 +00:00 · a3eb309e26
commit a3eb309e26
parent b84b0021c2
3 changed files with 88 additions and 32 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -245,7 +245,7 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se
 - **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts` (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out). Severity labels carried in the alert (critical/warning/info). No paging. The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`.
 - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
 - **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`).
- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. **WHISKER-WEDGE GOTCHA**: whisker-backend has no operator liveness probe, so a transient CNI/DNS blip (e.g. a node reboot/upgrade) can wedge its Goldmane gRPC stream and leave the UI **empty** indefinitely (the aggregator, a separate pod, is unaffected) — the `whisker-watchdog` CronJob (`stacks/calico`, every 10 min) auto-restarts it; manual heal `kubectl -n calico-system delete pod -l k8s-app=whisker`. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
+- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. **WHISKER-WEDGE GOTCHA** (2026-06-28): the operator's `whisker` NetworkPolicy allows DNS egress only to kube-dns *pods*, but whisker-backend resolves goldmane via the kube-dns *ClusterIP* — Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule, so when whisker-backend's gRPC stream breaks and it re-resolves, it wedges and the UI goes **empty** (the aggregator, a separate pod, is unaffected). FIX = additive egress NP `whisker-allow-dns-clusterip` (`stacks/calico`, allows whisker→10.96.0.10/32:53); the `whisker-watchdog` CronJob is a backstop. Manual heal `kubectl -n calico-system delete pod -l k8s-app=whisker`. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
 - **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).

 ## Storage & Backup Architecture
--- a/docs/runbooks/goldmane-flow-trail.md
+++ b/docs/runbooks/goldmane-flow-trail.md
@ -43,9 +43,11 @@ small no matter how much traffic flows.
  history.
 - In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
  (HTTP), both in `calico-system`.
- **Self-heal:** the `whisker-watchdog` CronJob (`stacks/calico`, every 10 min)
-  restarts whisker if its backend's Goldmane stream wedges (the operator gives
-  whisker-backend no liveness probe) — see Troubleshooting → "Whisker UI empty".
+- **DNS fix + self-heal:** whisker's egress to the kube-dns ClusterIP is allowed
+  by `whisker-allow-dns-clusterip` (`stacks/calico`) — without it the UI goes
+  empty after any gRPC-stream break (see Troubleshooting → "Whisker UI empty").
+  The `whisker-watchdog` CronJob (every 10 min) is a backstop that restarts
+  whisker if its backend ever wedges for another reason.

 ### CNPG `goldmane_edges` — durable
 - Postgres DB `goldmane_edges` on the CNPG cluster
@ -261,23 +263,30 @@ brand-new ingress host is also invisible to LAN split-horizon until the hourly
 `curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
 (expect a 302 to Authentik — the gate working).

-**Whisker UI empty (but reachable — 302s to Authentik fine).** whisker-backend's
-gRPC stream to `goldmane:7443` wedged. A transient CNI/DNS blip (e.g. right after
-a node reboot/upgrade — observed 2026-06-28 as k8s-node5 settled post-1.35.6
-upgrade: the pod's resolver started timing out on the kube-dns ClusterIP) drops
-the stream, and the Go gRPC resolver gets STUCK — it spams `failed to stream
-flows` / `code = Unavailable: dns ... i/o timeout` forever and never reconnects.
-The operator ships whisker-backend with **no liveness probe**, so nothing
-restarts it. The **`whisker-watchdog` CronJob** (`stacks/calico`, every 10 min)
-auto-heals this — it deletes the whisker pod when it sees ≥10 such errors in 11m
-*and* Goldmane is Ready (the Goldmane-Ready guard avoids restart-thrash during a
-real Goldmane outage). To heal immediately:
-`kubectl -n calico-system delete pod -l k8s-app=whisker` (the Deployment recreates
-it; a fresh pod reconnects cleanly). The durable **aggregator is a SEPARATE pod**
-and is unaffected — only the live UI goes blank. Confirm the diagnosis with
-`kubectl -n calico-system logs -l k8s-app=whisker -c whisker-backend --tail=20`;
-the node's own DNS is usually fine (test with a throwaway pod pinned there:
-`kubectl run dns-test --image=busybox:1.36 --overrides='{"spec":{"nodeName":"<node>"}}' --rm -it -- nslookup goldmane.calico-system.svc.cluster.local`).
+**Whisker UI empty (but reachable — 302s to Authentik fine).** ROOT CAUSE (the
+2026-06-28 incident): the operator's own `whisker` NetworkPolicy is
+policyTypes:[Ingress,**Egress**], and its egress allows DNS only to the kube-dns
+*pods* (podSelector `k8s-app=kube-dns`). But whisker-backend resolves
+`goldmane.calico-system.svc` via the kube-dns **ClusterIP** (10.96.0.10), and
+**Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule**.
+Verified: from the whisker pod's netns, ClusterIP DNS = 100% timeout while direct
+kube-dns *pod-IP* DNS = OK, and a pod with no egress policy resolves fine.
+whisker-backend resolves goldmane ONCE in the brief startup window before the
+policy programs, holds its long-lived gRPC stream, and only re-resolves when that
+stream breaks (e.g. a node-reboot blip) — at which point the blocked ClusterIP
+DNS wedges its Go resolver (`failed to stream flows` / `code = Unavailable: dns
+... i/o timeout` forever) and the UI goes blank. The durable **aggregator is a
+SEPARATE pod in its own (unrestricted) namespace** and is unaffected.
+
+FIX (applied 2026-06-28): `kubernetes_network_policy_v1.whisker_allow_dns_clusterip`
+(`stacks/calico`) — an additive egress NP allowing whisker → the kube-dns
+ClusterIP (`10.96.0.10/32`) on 53/UDP+TCP; k8s egress policies are additive so
+the operator NP is untouched. Backstop: the `whisker-watchdog` CronJob restarts
+the pod if it ever wedges for another reason. Immediate manual heal:
+`kubectl -n calico-system delete pod -l k8s-app=whisker`. Diagnose by comparing,
+from the whisker pod's netns, `nslookup goldmane.calico-system.svc.cluster.local
+10.96.0.10` (the ClusterIP — times out if the NP fix is missing) against the same
+query aimed at a kube-dns *pod IP* (always works).

 **No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
 pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).
--- a/stacks/calico/main.tf
+++ b/stacks/calico/main.tf
@ -275,20 +275,67 @@ resource "kubernetes_network_policy_v1" "whisker_allow_traefik" {
  }
 }

+# Additive egress NetworkPolicy: permit whisker -> the kube-dns ClusterIP for DNS.
+#
+# ROOT CAUSE of the 2026-06-28 "Whisker UI empty" incident: the operator's own
+# `whisker` NetworkPolicy is policyTypes:[Ingress,Egress] and its egress allows
+# DNS only to the kube-dns *pods* (podSelector k8s-app=kube-dns). But
+# whisker-backend resolves `goldmane...svc` via the kube-dns *ClusterIP*
+# (10.96.0.10), and Calico drops UDP DNS to a ClusterIP under a podSelector-only
+# egress rule (verified: from whisker's netns, ClusterIP DNS = 100% timeout
+# while direct kube-dns pod-IP DNS = OK; a pod with no egress policy resolves
+# fine). whisker-backend resolves once in the brief startup window before the
+# policy programs, establishes its long-lived gRPC stream, and only re-resolves
+# when that stream breaks — at which point the blocked ClusterIP DNS wedges its
+# Go resolver and the UI goes empty (the durable aggregator, in its own
+# unrestricted namespace, is unaffected). k8s egress policies are additive, so
+# this ORs in an allow for the ClusterIP; the operator NP is left untouched.
+# (Empirically: adding this ipBlock rule flips ClusterIP DNS from 100% fail to
+# 100% ok.) See docs/runbooks/goldmane-flow-trail.md.
+resource "kubernetes_network_policy_v1" "whisker_allow_dns_clusterip" {
+  metadata {
+    name      = "whisker-allow-dns-clusterip"
+    namespace = "calico-system"
+  }
+  spec {
+    pod_selector {
+      match_labels = {
+        "app.kubernetes.io/name" = "whisker"
+      }
+    }
+    policy_types = ["Egress"]
+    egress {
+      # 10.96.0.10 is the kube-dns ClusterIP (cluster invariant — service CIDR
+      # 10.96.0.0/12, DNS always .10; the same IP CoreDNS/Technitium configs pin).
+      to {
+        ip_block {
+          cidr = "10.96.0.10/32"
+        }
+      }
+      ports {
+        port     = "53"
+        protocol = "UDP"
+      }
+      ports {
+        port     = "53"
+        protocol = "TCP"
+      }
+    }
+  }
+}
+
 # ---------------------------------------------------------------------------
 # Whisker self-heal watchdog (ADR-0014; added 2026-06-28 after a live incident).
 #
-# FAILURE MODE: whisker-backend dials goldmane:7443 over a long-lived gRPC
-# stream. When that stream drops during a transient CNI/DNS blip (observed
-# 2026-06-28 right after k8s-node5's v1.35.6 upgrade settled — the pod's
-# resolver started timing out on the kube-dns ClusterIP), the Go client's
-# resolver gets WEDGED: it spams `failed to stream flows` /
-# `code = Unavailable: dns ... i/o timeout` forever and never reconnects, so
-# the Whisker UI shows EMPTY while the durable aggregator (a separate pod, same
-# Goldmane source) is unaffected. The operator ships whisker-backend with NO
-# liveness/readiness probe, so nothing restarts it — it sat broken until a
-# manual `kubectl delete pod`. Whisker is operator-managed (Whisker CR), so we
-# can't inject a probe; this watchdog is the supported-pattern alternative.
+# BACKSTOP. The REAL fix is kubernetes_network_policy_v1.whisker_allow_dns_clusterip
+# above (it unblocks the root-cause ClusterIP DNS). This watchdog stays as
+# defense-in-depth: whisker-backend has NO operator liveness probe, so if its
+# long-lived goldmane gRPC stream ever wedges for any OTHER reason (the Go
+# resolver spams `failed to stream flows` / `code = Unavailable` and never
+# reconnects -> empty UI, while the durable aggregator in its own namespace is
+# unaffected), nothing else would restart it. Whisker is operator-managed
+# (Whisker CR) so we can't inject a probe; this is the supported-pattern
+# alternative. With the DNS fix in place it should rarely, if ever, fire.
 #
 # It restarts the pod ONLY when the wedged signature is present AND Goldmane is
 # Ready (so a real Goldmane outage doesn't cause restart-thrash). A fresh pod