calico: fix empty Whisker UI — allow whisker egress to the kube-dns ClusterIP
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Real root cause of the 2026-06-28 "Whisker UI empty" incident (the watchdog
added in 8d1d2fb9 was treating a symptom). The tigera operator's own `whisker`
NetworkPolicy is policyTypes:[Ingress,Egress]; its egress allows DNS only to the
kube-dns *pods* (podSelector k8s-app=kube-dns). But whisker-backend resolves
goldmane.calico-system.svc via the kube-dns *ClusterIP* (10.96.0.10), and Calico
drops UDP DNS to a ClusterIP under a podSelector-only egress rule.
Verified in an isolated repro: from the whisker pod's netns, ClusterIP DNS = 100%
timeout while direct kube-dns pod-IP DNS = OK; a pod with no egress policy
resolves fine; a test pod with the operator's podSelector-only egress rule
reproduces the failure, and adding an ipBlock(ClusterIP) egress rule flips it to
100% ok. whisker-backend resolves goldmane once in the brief startup window
before the policy programs, holds its long-lived gRPC stream, and only
re-resolves when that stream breaks (e.g. a node-reboot blip) — then the blocked
ClusterIP DNS wedges its Go resolver and the UI goes empty. The durable
aggregator (separate pod, unrestricted namespace) was never affected.
Fix: additive egress NetworkPolicy whisker-allow-dns-clusterip
(whisker -> 10.96.0.10/32 on 53 UDP+TCP); k8s egress policies are additive so
the operator NP is untouched. The whisker-watchdog CronJob is kept as a backstop
(repurposed comment). Applied + verified: ClusterIP DNS from the whisker netns
now 8/8 ok, whisker-backend 0 errors, flow API returns 828 flows / the namespace
list. Docs (runbook + CLAUDE.md) updated to the real root cause.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
b84b0021c2
commit
a3eb309e26
3 changed files with 88 additions and 32 deletions
|
|
@ -43,9 +43,11 @@ small no matter how much traffic flows.
|
|||
history.
|
||||
- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
|
||||
(HTTP), both in `calico-system`.
|
||||
- **Self-heal:** the `whisker-watchdog` CronJob (`stacks/calico`, every 10 min)
|
||||
restarts whisker if its backend's Goldmane stream wedges (the operator gives
|
||||
whisker-backend no liveness probe) — see Troubleshooting → "Whisker UI empty".
|
||||
- **DNS fix + self-heal:** whisker's egress to the kube-dns ClusterIP is allowed
|
||||
by `whisker-allow-dns-clusterip` (`stacks/calico`) — without it the UI goes
|
||||
empty after any gRPC-stream break (see Troubleshooting → "Whisker UI empty").
|
||||
The `whisker-watchdog` CronJob (every 10 min) is a backstop that restarts
|
||||
whisker if its backend ever wedges for another reason.
|
||||
|
||||
### CNPG `goldmane_edges` — durable
|
||||
- Postgres DB `goldmane_edges` on the CNPG cluster
|
||||
|
|
@ -261,23 +263,30 @@ brand-new ingress host is also invisible to LAN split-horizon until the hourly
|
|||
`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
|
||||
(expect a 302 to Authentik — the gate working).
|
||||
|
||||
**Whisker UI empty (but reachable — 302s to Authentik fine).** whisker-backend's
|
||||
gRPC stream to `goldmane:7443` wedged. A transient CNI/DNS blip (e.g. right after
|
||||
a node reboot/upgrade — observed 2026-06-28 as k8s-node5 settled post-1.35.6
|
||||
upgrade: the pod's resolver started timing out on the kube-dns ClusterIP) drops
|
||||
the stream, and the Go gRPC resolver gets STUCK — it spams `failed to stream
|
||||
flows` / `code = Unavailable: dns ... i/o timeout` forever and never reconnects.
|
||||
The operator ships whisker-backend with **no liveness probe**, so nothing
|
||||
restarts it. The **`whisker-watchdog` CronJob** (`stacks/calico`, every 10 min)
|
||||
auto-heals this — it deletes the whisker pod when it sees ≥10 such errors in 11m
|
||||
*and* Goldmane is Ready (the Goldmane-Ready guard avoids restart-thrash during a
|
||||
real Goldmane outage). To heal immediately:
|
||||
`kubectl -n calico-system delete pod -l k8s-app=whisker` (the Deployment recreates
|
||||
it; a fresh pod reconnects cleanly). The durable **aggregator is a SEPARATE pod**
|
||||
and is unaffected — only the live UI goes blank. Confirm the diagnosis with
|
||||
`kubectl -n calico-system logs -l k8s-app=whisker -c whisker-backend --tail=20`;
|
||||
the node's own DNS is usually fine (test with a throwaway pod pinned there:
|
||||
`kubectl run dns-test --image=busybox:1.36 --overrides='{"spec":{"nodeName":"<node>"}}' --rm -it -- nslookup goldmane.calico-system.svc.cluster.local`).
|
||||
**Whisker UI empty (but reachable — 302s to Authentik fine).** ROOT CAUSE (the
|
||||
2026-06-28 incident): the operator's own `whisker` NetworkPolicy is
|
||||
policyTypes:[Ingress,**Egress**], and its egress allows DNS only to the kube-dns
|
||||
*pods* (podSelector `k8s-app=kube-dns`). But whisker-backend resolves
|
||||
`goldmane.calico-system.svc` via the kube-dns **ClusterIP** (10.96.0.10), and
|
||||
**Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule**.
|
||||
Verified: from the whisker pod's netns, ClusterIP DNS = 100% timeout while direct
|
||||
kube-dns *pod-IP* DNS = OK, and a pod with no egress policy resolves fine.
|
||||
whisker-backend resolves goldmane ONCE in the brief startup window before the
|
||||
policy programs, holds its long-lived gRPC stream, and only re-resolves when that
|
||||
stream breaks (e.g. a node-reboot blip) — at which point the blocked ClusterIP
|
||||
DNS wedges its Go resolver (`failed to stream flows` / `code = Unavailable: dns
|
||||
... i/o timeout` forever) and the UI goes blank. The durable **aggregator is a
|
||||
SEPARATE pod in its own (unrestricted) namespace** and is unaffected.
|
||||
|
||||
FIX (applied 2026-06-28): `kubernetes_network_policy_v1.whisker_allow_dns_clusterip`
|
||||
(`stacks/calico`) — an additive egress NP allowing whisker → the kube-dns
|
||||
ClusterIP (`10.96.0.10/32`) on 53/UDP+TCP; k8s egress policies are additive so
|
||||
the operator NP is untouched. Backstop: the `whisker-watchdog` CronJob restarts
|
||||
the pod if it ever wedges for another reason. Immediate manual heal:
|
||||
`kubectl -n calico-system delete pod -l k8s-app=whisker`. Diagnose by comparing,
|
||||
from the whisker pod's netns, `nslookup goldmane.calico-system.svc.cluster.local
|
||||
10.96.0.10` (the ClusterIP — times out if the NP fix is missing) against the same
|
||||
query aimed at a kube-dns *pod IP* (always works).
|
||||
|
||||
**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
|
||||
pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue