calico: add whisker-watchdog CronJob to self-heal a wedged whisker-backend

Whisker showed an empty UI on 2026-06-28. Root cause: whisker-backend dials goldmane:7443 over a long-lived gRPC stream; when that stream dropped during a transient CNI/DNS blip (right after k8s-node5 finished its v1.35.6 upgrade, its pod resolver briefly timed out on the kube-dns ClusterIP) the Go gRPC resolver got WEDGED — spamming "failed to stream flows" / "code = Unavailable: dns ... i/o timeout" forever, never reconnecting. The operator ships whisker-backend with NO liveness probe, so nothing restarted it; the live UI stayed blank until a manual `kubectl delete pod`. (The durable aggregator is a separate pod and was unaffected — only Whisker's ~60-min live view went dark.) Whisker is operator-managed (Whisker CR), so we can't inject a liveness probe. Instead add a watchdog so this never needs a manual restart again: - whisker-watchdog CronJob (every 10 min) + least-privilege SA/Role/RoleBinding (calico-system only: pods get/list/delete, pods/log get). - It restarts the whisker pod only when whisker-backend logs >=10 goldmane- connection errors in 11m AND Goldmane is Ready (the Goldmane-Ready guard avoids restart-thrash during a real Goldmane outage). - Self-tested: a manual run reports "whisker-backend healthy: 0 ... errors" and does not restart. Docs: runbook gains a "Whisker UI empty" troubleshooting entry + a self-heal note; the stale 2026-06-25 "digest never posted" known-state block is updated to Resolved (digest posts to #alerts, lastSuccessfulTime current); CLAUDE.md flow-trail bullet gains the whisker-wedge gotcha. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 08:59:07 +00:00 · 2026-06-28 08:59:07 +00:00 · 8d1d2fb999
commit 8d1d2fb999
parent c70810a51b
3 changed files with 148 additions and 9 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -245,7 +245,7 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se
 - **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts` (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out). Severity labels carried in the alert (critical/warning/info). No paging. The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`.
 - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
 - **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`).
- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
+- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. **WHISKER-WEDGE GOTCHA**: whisker-backend has no operator liveness probe, so a transient CNI/DNS blip (e.g. a node reboot/upgrade) can wedge its Goldmane gRPC stream and leave the UI **empty** indefinitely (the aggregator, a separate pod, is unaffected) — the `whisker-watchdog` CronJob (`stacks/calico`, every 10 min) auto-restarts it; manual heal `kubectl -n calico-system delete pod -l k8s-app=whisker`. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
 - **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).

 ## Storage & Backup Architecture