Merge remote-tracking branch 'origin/master' into wizard/authentik-perf-fix

2026-06-28 09:55:25 +00:00 · 2026-06-28 09:55:25 +00:00 · b3c419e108
commit b3c419e108
parent 385dfff0e7 9a1ab6247b
12 changed files with 564 additions and 41 deletions
--- a/docs/runbooks/goldmane-flow-trail.md
+++ b/docs/runbooks/goldmane-flow-trail.md
@ -43,9 +43,11 @@ small no matter how much traffic flows.
  history.
 - In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
  (HTTP), both in `calico-system`.
- **Self-heal:** the `whisker-watchdog` CronJob (`stacks/calico`, every 10 min)
-  restarts whisker if its backend's Goldmane stream wedges (the operator gives
-  whisker-backend no liveness probe) — see Troubleshooting → "Whisker UI empty".
+- **DNS fix + self-heal:** whisker's egress to the kube-dns ClusterIP is allowed
+  by `whisker-allow-dns-clusterip` (`stacks/calico`) — without it the UI goes
+  empty after any gRPC-stream break (see Troubleshooting → "Whisker UI empty").
+  The `whisker-watchdog` CronJob (every 10 min) is a backstop that restarts
+  whisker if its backend ever wedges for another reason.

 ### CNPG `goldmane_edges` — durable
 - Postgres DB `goldmane_edges` on the CNPG cluster
@ -151,8 +153,22 @@ on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` /

 ## How to query who-talks-to-whom

-`psql` into the DB (creds: Vault static role `static-creds/pg-goldmane-edges`, or
-exec a CNPG pod). All queries are against the single `edge` table.
+**Quickest — the `homelab edges` CLI** (the investigation helper; read-only
+SELECT against the DB via the dbaas primary pod, no creds/SQL to remember):
+
+```
+homelab edges --ns <ns>         # edges touching <ns> (either direction)
+homelab edges --peers-of <ns>   # <ns>'s distinct peer namespaces
+homelab edges --src <ns>        # <ns>'s egress peers   (--dst <ns> for ingress)
+homelab edges --new-since 24h   # edges first seen in the last day (or a date)
+homelab edges --denied          # blocked / lateral-movement attempts
+homelab edges --json [...]      # machine-readable, for agents/pipelines
+homelab edges --help            # full flag list
+```
+
+For ad-hoc SQL, `psql` into the DB (creds: Vault static role
+`static-creds/pg-goldmane-edges`, or exec a CNPG pod). All queries are against
+the single `edge` table.

 ```sql
 -- Everything talking to a namespace (inbound), most-active first
@ -261,23 +277,30 @@ brand-new ingress host is also invisible to LAN split-horizon until the hourly
 `curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
 (expect a 302 to Authentik — the gate working).

-**Whisker UI empty (but reachable — 302s to Authentik fine).** whisker-backend's
-gRPC stream to `goldmane:7443` wedged. A transient CNI/DNS blip (e.g. right after
-a node reboot/upgrade — observed 2026-06-28 as k8s-node5 settled post-1.35.6
-upgrade: the pod's resolver started timing out on the kube-dns ClusterIP) drops
-the stream, and the Go gRPC resolver gets STUCK — it spams `failed to stream
-flows` / `code = Unavailable: dns ... i/o timeout` forever and never reconnects.
-The operator ships whisker-backend with **no liveness probe**, so nothing
-restarts it. The **`whisker-watchdog` CronJob** (`stacks/calico`, every 10 min)
-auto-heals this — it deletes the whisker pod when it sees ≥10 such errors in 11m
-*and* Goldmane is Ready (the Goldmane-Ready guard avoids restart-thrash during a
-real Goldmane outage). To heal immediately:
-`kubectl -n calico-system delete pod -l k8s-app=whisker` (the Deployment recreates
-it; a fresh pod reconnects cleanly). The durable **aggregator is a SEPARATE pod**
-and is unaffected — only the live UI goes blank. Confirm the diagnosis with
-`kubectl -n calico-system logs -l k8s-app=whisker -c whisker-backend --tail=20`;
-the node's own DNS is usually fine (test with a throwaway pod pinned there:
-`kubectl run dns-test --image=busybox:1.36 --overrides='{"spec":{"nodeName":"<node>"}}' --rm -it -- nslookup goldmane.calico-system.svc.cluster.local`).
+**Whisker UI empty (but reachable — 302s to Authentik fine).** ROOT CAUSE (the
+2026-06-28 incident): the operator's own `whisker` NetworkPolicy is
+policyTypes:[Ingress,**Egress**], and its egress allows DNS only to the kube-dns
+*pods* (podSelector `k8s-app=kube-dns`). But whisker-backend resolves
+`goldmane.calico-system.svc` via the kube-dns **ClusterIP** (10.96.0.10), and
+**Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule**.
+Verified: from the whisker pod's netns, ClusterIP DNS = 100% timeout while direct
+kube-dns *pod-IP* DNS = OK, and a pod with no egress policy resolves fine.
+whisker-backend resolves goldmane ONCE in the brief startup window before the
+policy programs, holds its long-lived gRPC stream, and only re-resolves when that
+stream breaks (e.g. a node-reboot blip) — at which point the blocked ClusterIP
+DNS wedges its Go resolver (`failed to stream flows` / `code = Unavailable: dns
+... i/o timeout` forever) and the UI goes blank. The durable **aggregator is a
+SEPARATE pod in its own (unrestricted) namespace** and is unaffected.
+
+FIX (applied 2026-06-28): `kubernetes_network_policy_v1.whisker_allow_dns_clusterip`
+(`stacks/calico`) — an additive egress NP allowing whisker → the kube-dns
+ClusterIP (`10.96.0.10/32`) on 53/UDP+TCP; k8s egress policies are additive so
+the operator NP is untouched. Backstop: the `whisker-watchdog` CronJob restarts
+the pod if it ever wedges for another reason. Immediate manual heal:
+`kubectl -n calico-system delete pod -l k8s-app=whisker`. Diagnose by comparing,
+from the whisker pod's netns, `nslookup goldmane.calico-system.svc.cluster.local
+10.96.0.10` (the ClusterIP — times out if the NP fix is missing) against the same
+query aimed at a kube-dns *pod IP* (always works).

 **No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
 pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).