calico: add whisker-watchdog CronJob to self-heal a wedged whisker-backend
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Whisker showed an empty UI on 2026-06-28. Root cause: whisker-backend dials goldmane:7443 over a long-lived gRPC stream; when that stream dropped during a transient CNI/DNS blip (right after k8s-node5 finished its v1.35.6 upgrade, its pod resolver briefly timed out on the kube-dns ClusterIP) the Go gRPC resolver got WEDGED — spamming "failed to stream flows" / "code = Unavailable: dns ... i/o timeout" forever, never reconnecting. The operator ships whisker-backend with NO liveness probe, so nothing restarted it; the live UI stayed blank until a manual `kubectl delete pod`. (The durable aggregator is a separate pod and was unaffected — only Whisker's ~60-min live view went dark.) Whisker is operator-managed (Whisker CR), so we can't inject a liveness probe. Instead add a watchdog so this never needs a manual restart again: - whisker-watchdog CronJob (every 10 min) + least-privilege SA/Role/RoleBinding (calico-system only: pods get/list/delete, pods/log get). - It restarts the whisker pod only when whisker-backend logs >=10 goldmane- connection errors in 11m AND Goldmane is Ready (the Goldmane-Ready guard avoids restart-thrash during a real Goldmane outage). - Self-tested: a manual run reports "whisker-backend healthy: 0 ... errors" and does not restart. Docs: runbook gains a "Whisker UI empty" troubleshooting entry + a self-heal note; the stale 2026-06-25 "digest never posted" known-state block is updated to Resolved (digest posts to #alerts, lastSuccessfulTime current); CLAUDE.md flow-trail bullet gains the whisker-wedge gotcha. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
c70810a51b
commit
8d1d2fb999
3 changed files with 148 additions and 9 deletions
|
|
@ -43,6 +43,9 @@ small no matter how much traffic flows.
|
|||
history.
|
||||
- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
|
||||
(HTTP), both in `calico-system`.
|
||||
- **Self-heal:** the `whisker-watchdog` CronJob (`stacks/calico`, every 10 min)
|
||||
restarts whisker if its backend's Goldmane stream wedges (the operator gives
|
||||
whisker-backend no liveness probe) — see Troubleshooting → "Whisker UI empty".
|
||||
|
||||
### CNPG `goldmane_edges` — durable
|
||||
- Postgres DB `goldmane_edges` on the CNPG cluster
|
||||
|
|
@ -258,6 +261,24 @@ brand-new ingress host is also invisible to LAN split-horizon until the hourly
|
|||
`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
|
||||
(expect a 302 to Authentik — the gate working).
|
||||
|
||||
**Whisker UI empty (but reachable — 302s to Authentik fine).** whisker-backend's
|
||||
gRPC stream to `goldmane:7443` wedged. A transient CNI/DNS blip (e.g. right after
|
||||
a node reboot/upgrade — observed 2026-06-28 as k8s-node5 settled post-1.35.6
|
||||
upgrade: the pod's resolver started timing out on the kube-dns ClusterIP) drops
|
||||
the stream, and the Go gRPC resolver gets STUCK — it spams `failed to stream
|
||||
flows` / `code = Unavailable: dns ... i/o timeout` forever and never reconnects.
|
||||
The operator ships whisker-backend with **no liveness probe**, so nothing
|
||||
restarts it. The **`whisker-watchdog` CronJob** (`stacks/calico`, every 10 min)
|
||||
auto-heals this — it deletes the whisker pod when it sees ≥10 such errors in 11m
|
||||
*and* Goldmane is Ready (the Goldmane-Ready guard avoids restart-thrash during a
|
||||
real Goldmane outage). To heal immediately:
|
||||
`kubectl -n calico-system delete pod -l k8s-app=whisker` (the Deployment recreates
|
||||
it; a fresh pod reconnects cleanly). The durable **aggregator is a SEPARATE pod**
|
||||
and is unaffected — only the live UI goes blank. Confirm the diagnosis with
|
||||
`kubectl -n calico-system logs -l k8s-app=whisker -c whisker-backend --tail=20`;
|
||||
the node's own DNS is usually fine (test with a throwaway pod pinned there:
|
||||
`kubectl run dns-test --image=busybox:1.36 --overrides='{"spec":{"nodeName":"<node>"}}' --rm -it -- nslookup goldmane.calico-system.svc.cluster.local`).
|
||||
|
||||
**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
|
||||
pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).
|
||||
Common causes, in order:
|
||||
|
|
@ -279,11 +300,12 @@ pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL`
|
|||
empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack`
|
||||
ExternalSecret resolved. A dry run / smoke test: run the image with `args:
|
||||
["digest"]` + `DRY_RUN=1` to print the message instead of POSTing.
|
||||
> Known state (2026-06-25): the digest CronJob's first Job **failed** and it has
|
||||
> never successfully posted (`lastSuccessfulTime` empty) — the digest leg is the
|
||||
> live gap; `DigestFailing` is catching it. Edges still land in the DB via the
|
||||
> `aggregate` Deployment; only the `#alerts` digest notification is affected.
|
||||
> Investigation/fix belongs to the aggregator slice (#58/#60), not monitoring.
|
||||
> Resolved (2026-06-28): the digest posts cleanly to `#alerts`
|
||||
> (`lastSuccessfulTime` current, `DigestFailing` clear; e.g. the 2026-06-28 08:00
|
||||
> London run reported "8 new edges in last 24h"). The 2026-06-25 failures were
|
||||
> the `#security` channel override returning HTTP 404 — the shared
|
||||
> `alertmanager_slack_api_url` webhook's Slack app isn't a member of `#security`;
|
||||
> consolidating all Slack output to `#alerts` fixed it.
|
||||
|
||||
**No edges at all in the table.** Confirm Goldmane is enabled
|
||||
(`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue