calico: fix empty Whisker UI — allow whisker egress to the kube-dns ClusterIP

Real root cause of the 2026-06-28 "Whisker UI empty" incident (the watchdog added in 8d1d2fb9 was treating a symptom). The tigera operator's own `whisker` NetworkPolicy is policyTypes:[Ingress,Egress]; its egress allows DNS only to the kube-dns *pods* (podSelector k8s-app=kube-dns). But whisker-backend resolves goldmane.calico-system.svc via the kube-dns *ClusterIP* (10.96.0.10), and Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule. Verified in an isolated repro: from the whisker pod's netns, ClusterIP DNS = 100% timeout while direct kube-dns pod-IP DNS = OK; a pod with no egress policy resolves fine; a test pod with the operator's podSelector-only egress rule reproduces the failure, and adding an ipBlock(ClusterIP) egress rule flips it to 100% ok. whisker-backend resolves goldmane once in the brief startup window before the policy programs, holds its long-lived gRPC stream, and only re-resolves when that stream breaks (e.g. a node-reboot blip) — then the blocked ClusterIP DNS wedges its Go resolver and the UI goes empty. The durable aggregator (separate pod, unrestricted namespace) was never affected. Fix: additive egress NetworkPolicy whisker-allow-dns-clusterip (whisker -> 10.96.0.10/32 on 53 UDP+TCP); k8s egress policies are additive so the operator NP is untouched. The whisker-watchdog CronJob is kept as a backstop (repurposed comment). Applied + verified: ClusterIP DNS from the whisker netns now 8/8 ok, whisker-backend 0 errors, flow API returns 828 flows / the namespace list. Docs (runbook + CLAUDE.md) updated to the real root cause. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:32:21 +00:00 · 2026-06-28 09:32:21 +00:00 · a3eb309e26
commit a3eb309e26
parent b84b0021c2
3 changed files with 88 additions and 32 deletions
--- a/stacks/calico/main.tf
+++ b/stacks/calico/main.tf
@ -275,20 +275,67 @@ resource "kubernetes_network_policy_v1" "whisker_allow_traefik" {
  }
 }

+# Additive egress NetworkPolicy: permit whisker -> the kube-dns ClusterIP for DNS.
+#
+# ROOT CAUSE of the 2026-06-28 "Whisker UI empty" incident: the operator's own
+# `whisker` NetworkPolicy is policyTypes:[Ingress,Egress] and its egress allows
+# DNS only to the kube-dns *pods* (podSelector k8s-app=kube-dns). But
+# whisker-backend resolves `goldmane...svc` via the kube-dns *ClusterIP*
+# (10.96.0.10), and Calico drops UDP DNS to a ClusterIP under a podSelector-only
+# egress rule (verified: from whisker's netns, ClusterIP DNS = 100% timeout
+# while direct kube-dns pod-IP DNS = OK; a pod with no egress policy resolves
+# fine). whisker-backend resolves once in the brief startup window before the
+# policy programs, establishes its long-lived gRPC stream, and only re-resolves
+# when that stream breaks — at which point the blocked ClusterIP DNS wedges its
+# Go resolver and the UI goes empty (the durable aggregator, in its own
+# unrestricted namespace, is unaffected). k8s egress policies are additive, so
+# this ORs in an allow for the ClusterIP; the operator NP is left untouched.
+# (Empirically: adding this ipBlock rule flips ClusterIP DNS from 100% fail to
+# 100% ok.) See docs/runbooks/goldmane-flow-trail.md.
+resource "kubernetes_network_policy_v1" "whisker_allow_dns_clusterip" {
+  metadata {
+    name      = "whisker-allow-dns-clusterip"
+    namespace = "calico-system"
+  }
+  spec {
+    pod_selector {
+      match_labels = {
+        "app.kubernetes.io/name" = "whisker"
+      }
+    }
+    policy_types = ["Egress"]
+    egress {
+      # 10.96.0.10 is the kube-dns ClusterIP (cluster invariant — service CIDR
+      # 10.96.0.0/12, DNS always .10; the same IP CoreDNS/Technitium configs pin).
+      to {
+        ip_block {
+          cidr = "10.96.0.10/32"
+        }
+      }
+      ports {
+        port     = "53"
+        protocol = "UDP"
+      }
+      ports {
+        port     = "53"
+        protocol = "TCP"
+      }
+    }
+  }
+}
+
 # ---------------------------------------------------------------------------
 # Whisker self-heal watchdog (ADR-0014; added 2026-06-28 after a live incident).
 #
-# FAILURE MODE: whisker-backend dials goldmane:7443 over a long-lived gRPC
-# stream. When that stream drops during a transient CNI/DNS blip (observed
-# 2026-06-28 right after k8s-node5's v1.35.6 upgrade settled — the pod's
-# resolver started timing out on the kube-dns ClusterIP), the Go client's
-# resolver gets WEDGED: it spams `failed to stream flows` /
-# `code = Unavailable: dns ... i/o timeout` forever and never reconnects, so
-# the Whisker UI shows EMPTY while the durable aggregator (a separate pod, same
-# Goldmane source) is unaffected. The operator ships whisker-backend with NO
-# liveness/readiness probe, so nothing restarts it — it sat broken until a
-# manual `kubectl delete pod`. Whisker is operator-managed (Whisker CR), so we
-# can't inject a probe; this watchdog is the supported-pattern alternative.
+# BACKSTOP. The REAL fix is kubernetes_network_policy_v1.whisker_allow_dns_clusterip
+# above (it unblocks the root-cause ClusterIP DNS). This watchdog stays as
+# defense-in-depth: whisker-backend has NO operator liveness probe, so if its
+# long-lived goldmane gRPC stream ever wedges for any OTHER reason (the Go
+# resolver spams `failed to stream flows` / `code = Unavailable` and never
+# reconnects -> empty UI, while the durable aggregator in its own namespace is
+# unaffected), nothing else would restart it. Whisker is operator-managed
+# (Whisker CR) so we can't inject a probe; this is the supported-pattern
+# alternative. With the DNS fix in place it should rarely, if ever, fire.
 #
 # It restarts the pod ONLY when the wedged signature is present AND Goldmane is
 # Ready (so a real Goldmane outage doesn't cause restart-thrash). A fresh pod