calico: add whisker-watchdog CronJob to self-heal a wedged whisker-backend
All checks were successful
ci/woodpecker/push/default Pipeline was successful

Whisker showed an empty UI on 2026-06-28. Root cause: whisker-backend dials
goldmane:7443 over a long-lived gRPC stream; when that stream dropped during a
transient CNI/DNS blip (right after k8s-node5 finished its v1.35.6 upgrade, its
pod resolver briefly timed out on the kube-dns ClusterIP) the Go gRPC resolver
got WEDGED — spamming "failed to stream flows" / "code = Unavailable: dns ...
i/o timeout" forever, never reconnecting. The operator ships whisker-backend
with NO liveness probe, so nothing restarted it; the live UI stayed blank until
a manual `kubectl delete pod`. (The durable aggregator is a separate pod and
was unaffected — only Whisker's ~60-min live view went dark.)

Whisker is operator-managed (Whisker CR), so we can't inject a liveness probe.
Instead add a watchdog so this never needs a manual restart again:
- whisker-watchdog CronJob (every 10 min) + least-privilege SA/Role/RoleBinding
  (calico-system only: pods get/list/delete, pods/log get).
- It restarts the whisker pod only when whisker-backend logs >=10 goldmane-
  connection errors in 11m AND Goldmane is Ready (the Goldmane-Ready guard
  avoids restart-thrash during a real Goldmane outage).
- Self-tested: a manual run reports "whisker-backend healthy: 0 ... errors"
  and does not restart.

Docs: runbook gains a "Whisker UI empty" troubleshooting entry + a self-heal
note; the stale 2026-06-25 "digest never posted" known-state block is updated
to Resolved (digest posts to #alerts, lastSuccessfulTime current); CLAUDE.md
flow-trail bullet gains the whisker-wedge gotcha.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-28 08:59:07 +00:00
parent c70810a51b
commit 8d1d2fb999
3 changed files with 148 additions and 9 deletions

View file

@ -43,6 +43,9 @@ small no matter how much traffic flows.
history.
- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
(HTTP), both in `calico-system`.
- **Self-heal:** the `whisker-watchdog` CronJob (`stacks/calico`, every 10 min)
restarts whisker if its backend's Goldmane stream wedges (the operator gives
whisker-backend no liveness probe) — see Troubleshooting → "Whisker UI empty".
### CNPG `goldmane_edges` — durable
- Postgres DB `goldmane_edges` on the CNPG cluster
@ -258,6 +261,24 @@ brand-new ingress host is also invisible to LAN split-horizon until the hourly
`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
(expect a 302 to Authentik — the gate working).
**Whisker UI empty (but reachable — 302s to Authentik fine).** whisker-backend's
gRPC stream to `goldmane:7443` wedged. A transient CNI/DNS blip (e.g. right after
a node reboot/upgrade — observed 2026-06-28 as k8s-node5 settled post-1.35.6
upgrade: the pod's resolver started timing out on the kube-dns ClusterIP) drops
the stream, and the Go gRPC resolver gets STUCK — it spams `failed to stream
flows` / `code = Unavailable: dns ... i/o timeout` forever and never reconnects.
The operator ships whisker-backend with **no liveness probe**, so nothing
restarts it. The **`whisker-watchdog` CronJob** (`stacks/calico`, every 10 min)
auto-heals this — it deletes the whisker pod when it sees ≥10 such errors in 11m
*and* Goldmane is Ready (the Goldmane-Ready guard avoids restart-thrash during a
real Goldmane outage). To heal immediately:
`kubectl -n calico-system delete pod -l k8s-app=whisker` (the Deployment recreates
it; a fresh pod reconnects cleanly). The durable **aggregator is a SEPARATE pod**
and is unaffected — only the live UI goes blank. Confirm the diagnosis with
`kubectl -n calico-system logs -l k8s-app=whisker -c whisker-backend --tail=20`;
the node's own DNS is usually fine (test with a throwaway pod pinned there:
`kubectl run dns-test --image=busybox:1.36 --overrides='{"spec":{"nodeName":"<node>"}}' --rm -it -- nslookup goldmane.calico-system.svc.cluster.local`).
**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).
Common causes, in order:
@ -279,11 +300,12 @@ pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL`
empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack`
ExternalSecret resolved. A dry run / smoke test: run the image with `args:
["digest"]` + `DRY_RUN=1` to print the message instead of POSTing.
> Known state (2026-06-25): the digest CronJob's first Job **failed** and it has
> never successfully posted (`lastSuccessfulTime` empty) — the digest leg is the
> live gap; `DigestFailing` is catching it. Edges still land in the DB via the
> `aggregate` Deployment; only the `#alerts` digest notification is affected.
> Investigation/fix belongs to the aggregator slice (#58/#60), not monitoring.
> Resolved (2026-06-28): the digest posts cleanly to `#alerts`
> (`lastSuccessfulTime` current, `DigestFailing` clear; e.g. the 2026-06-28 08:00
> London run reported "8 new edges in last 24h"). The 2026-06-25 failures were
> the `#security` channel override returning HTTP 404 — the shared
> `alertmanager_slack_api_url` webhook's Slack app isn't a member of `#security`;
> consolidating all Slack output to `#alerts` fixed it.
**No edges at all in the table.** Confirm Goldmane is enabled
(`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the