From 8d1d2fb9999aee2aefaf7929a581fc22264e8ed5 Mon Sep 17 00:00:00 2001
From: Viktor Barzin <vbarzin@gmail.com>
Date: Sun, 28 Jun 2026 08:59:07 +0000
Subject: [PATCH] calico: add whisker-watchdog CronJob to self-heal a wedged
 whisker-backend
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Whisker showed an empty UI on 2026-06-28. Root cause: whisker-backend dials
goldmane:7443 over a long-lived gRPC stream; when that stream dropped during a
transient CNI/DNS blip (right after k8s-node5 finished its v1.35.6 upgrade, its
pod resolver briefly timed out on the kube-dns ClusterIP) the Go gRPC resolver
got WEDGED — spamming "failed to stream flows" / "code = Unavailable: dns ...
i/o timeout" forever, never reconnecting. The operator ships whisker-backend
with NO liveness probe, so nothing restarted it; the live UI stayed blank until
a manual `kubectl delete pod`. (The durable aggregator is a separate pod and
was unaffected — only Whisker's ~60-min live view went dark.)

Whisker is operator-managed (Whisker CR), so we can't inject a liveness probe.
Instead add a watchdog so this never needs a manual restart again:
- whisker-watchdog CronJob (every 10 min) + least-privilege SA/Role/RoleBinding
  (calico-system only: pods get/list/delete, pods/log get).
- It restarts the whisker pod only when whisker-backend logs >=10 goldmane-
  connection errors in 11m AND Goldmane is Ready (the Goldmane-Ready guard
  avoids restart-thrash during a real Goldmane outage).
- Self-tested: a manual run reports "whisker-backend healthy: 0 ... errors"
  and does not restart.

Docs: runbook gains a "Whisker UI empty" troubleshooting entry + a self-heal
note; the stale 2026-06-25 "digest never posted" known-state block is updated
to Resolved (digest posts to #alerts, lastSuccessfulTime current); CLAUDE.md
flow-trail bullet gains the whisker-wedge gotcha.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .claude/CLAUDE.md                    |   2 +-
 docs/runbooks/goldmane-flow-trail.md |  32 +++++--
 stacks/calico/main.tf                | 123 ++++++++++++++++++++++++++-
 3 files changed, 148 insertions(+), 9 deletions(-)
diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md
index 4cd12d6c..117f4fe1 100755
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@@ -245,7 +245,7 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se
 - **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts` (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out). Severity labels carried in the alert (critical/warning/info). No paging. The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`.
 - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
 - **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`).
-- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
+- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. **WHISKER-WEDGE GOTCHA**: whisker-backend has no operator liveness probe, so a transient CNI/DNS blip (e.g. a node reboot/upgrade) can wedge its Goldmane gRPC stream and leave the UI **empty** indefinitely (the aggregator, a separate pod, is unaffected) — the `whisker-watchdog` CronJob (`stacks/calico`, every 10 min) auto-restarts it; manual heal `kubectl -n calico-system delete pod -l k8s-app=whisker`. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
 - **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).
 
 ## Storage & Backup Architecture
diff --git a/docs/runbooks/goldmane-flow-trail.md b/docs/runbooks/goldmane-flow-trail.md
index 51adaa8f..f6a93bc3 100644
--- a/docs/runbooks/goldmane-flow-trail.md
+++ b/docs/runbooks/goldmane-flow-trail.md
@@ -43,6 +43,9 @@ small no matter how much traffic flows.
   history.
 - In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
   (HTTP), both in `calico-system`.
+- **Self-heal:** the `whisker-watchdog` CronJob (`stacks/calico`, every 10 min)
+  restarts whisker if its backend's Goldmane stream wedges (the operator gives
+  whisker-backend no liveness probe) — see Troubleshooting → "Whisker UI empty".
 
 ### CNPG `goldmane_edges` — durable
 - Postgres DB `goldmane_edges` on the CNPG cluster
@@ -258,6 +261,24 @@ brand-new ingress host is also invisible to LAN split-horizon until the hourly
 `curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
 (expect a 302 to Authentik — the gate working).
 
+**Whisker UI empty (but reachable — 302s to Authentik fine).** whisker-backend's
+gRPC stream to `goldmane:7443` wedged. A transient CNI/DNS blip (e.g. right after
+a node reboot/upgrade — observed 2026-06-28 as k8s-node5 settled post-1.35.6
+upgrade: the pod's resolver started timing out on the kube-dns ClusterIP) drops
+the stream, and the Go gRPC resolver gets STUCK — it spams `failed to stream
+flows` / `code = Unavailable: dns ... i/o timeout` forever and never reconnects.
+The operator ships whisker-backend with **no liveness probe**, so nothing
+restarts it. The **`whisker-watchdog` CronJob** (`stacks/calico`, every 10 min)
+auto-heals this — it deletes the whisker pod when it sees ≥10 such errors in 11m
+*and* Goldmane is Ready (the Goldmane-Ready guard avoids restart-thrash during a
+real Goldmane outage). To heal immediately:
+`kubectl -n calico-system delete pod -l k8s-app=whisker` (the Deployment recreates
+it; a fresh pod reconnects cleanly). The durable **aggregator is a SEPARATE pod**
+and is unaffected — only the live UI goes blank. Confirm the diagnosis with
+`kubectl -n calico-system logs -l k8s-app=whisker -c whisker-backend --tail=20`;
+the node's own DNS is usually fine (test with a throwaway pod pinned there:
+`kubectl run dns-test --image=busybox:1.36 --overrides='{"spec":{"nodeName":"<node>"}}' --rm -it -- nslookup goldmane.calico-system.svc.cluster.local`).
+
 **No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
 pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).
 Common causes, in order:
@@ -279,11 +300,12 @@ pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL`
 empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack`
 ExternalSecret resolved. A dry run / smoke test: run the image with `args:
 ["digest"]` + `DRY_RUN=1` to print the message instead of POSTing.
-> Known state (2026-06-25): the digest CronJob's first Job **failed** and it has
-> never successfully posted (`lastSuccessfulTime` empty) — the digest leg is the
-> live gap; `DigestFailing` is catching it. Edges still land in the DB via the
-> `aggregate` Deployment; only the `#alerts` digest notification is affected.
-> Investigation/fix belongs to the aggregator slice (#58/#60), not monitoring.
+> Resolved (2026-06-28): the digest posts cleanly to `#alerts`
+> (`lastSuccessfulTime` current, `DigestFailing` clear; e.g. the 2026-06-28 08:00
+> London run reported "8 new edges in last 24h"). The 2026-06-25 failures were
+> the `#security` channel override returning HTTP 404 — the shared
+> `alertmanager_slack_api_url` webhook's Slack app isn't a member of `#security`;
+> consolidating all Slack output to `#alerts` fixed it.
 
 **No edges at all in the table.** Confirm Goldmane is enabled
 (`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the
diff --git a/stacks/calico/main.tf b/stacks/calico/main.tf
index 1354190e..3c411ecb 100644
--- a/stacks/calico/main.tf
+++ b/stacks/calico/main.tf
@@ -22,7 +22,7 @@ resource "kubernetes_namespace" "calico_system" {
     name = "calico-system"
     labels = {
       name = "calico-system"
-# calico-system namespace is managed by tigera-operator — auto-update is
+      # calico-system namespace is managed by tigera-operator — auto-update is
       # incompatible (operator reverts DaemonSet image from its Installation CR).
       # "keel.sh/enrolled" = "true"
     }
@@ -161,8 +161,8 @@ resource "helm_release" "tigera_operator" {
     # render before their crds/ (which helm skips on upgrade) -> "ensure CRDs
     # are installed first". We instead enable them via the operator CRs applied
     # directly below (kubectl_manifest) now that the CRDs exist — see ADR-0014.
-    goldmane  = { enabled = false }
-    whisker   = { enabled = false }
+    goldmane = { enabled = false }
+    whisker  = { enabled = false }
     # 512Mi (was 256Mi): the operator idles at ~38Mi but its STARTUP spike
     # (re-listing resources to build informer caches) exceeded 256Mi and
     # OOM-crashlooped on 2026-06-23 the first time the pod restarted (a latent
@@ -274,3 +274,120 @@ resource "kubernetes_network_policy_v1" "whisker_allow_traefik" {
     }
   }
 }
+
+# ---------------------------------------------------------------------------
+# Whisker self-heal watchdog (ADR-0014; added 2026-06-28 after a live incident).
+#
+# FAILURE MODE: whisker-backend dials goldmane:7443 over a long-lived gRPC
+# stream. When that stream drops during a transient CNI/DNS blip (observed
+# 2026-06-28 right after k8s-node5's v1.35.6 upgrade settled — the pod's
+# resolver started timing out on the kube-dns ClusterIP), the Go client's
+# resolver gets WEDGED: it spams `failed to stream flows` /
+# `code = Unavailable: dns ... i/o timeout` forever and never reconnects, so
+# the Whisker UI shows EMPTY while the durable aggregator (a separate pod, same
+# Goldmane source) is unaffected. The operator ships whisker-backend with NO
+# liveness/readiness probe, so nothing restarts it — it sat broken until a
+# manual `kubectl delete pod`. Whisker is operator-managed (Whisker CR), so we
+# can't inject a probe; this watchdog is the supported-pattern alternative.
+#
+# It restarts the pod ONLY when the wedged signature is present AND Goldmane is
+# Ready (so a real Goldmane outage doesn't cause restart-thrash). A fresh pod
+# reconnects cleanly. See docs/runbooks/goldmane-flow-trail.md.
+resource "kubernetes_service_account" "whisker_watchdog" {
+  metadata {
+    name      = "whisker-watchdog"
+    namespace = kubernetes_namespace.calico_system.metadata[0].name
+  }
+}
+
+# Namespaced Role (least privilege — only calico-system): read pod logs to
+# detect the wedge, delete the whisker pod to heal it.
+resource "kubernetes_role" "whisker_watchdog" {
+  metadata {
+    name      = "whisker-watchdog"
+    namespace = kubernetes_namespace.calico_system.metadata[0].name
+  }
+  rule {
+    api_groups = [""]
+    resources  = ["pods"]
+    verbs      = ["get", "list", "delete"]
+  }
+  rule {
+    api_groups = [""]
+    resources  = ["pods/log"]
+    verbs      = ["get"]
+  }
+}
+
+resource "kubernetes_role_binding" "whisker_watchdog" {
+  metadata {
+    name      = "whisker-watchdog"
+    namespace = kubernetes_namespace.calico_system.metadata[0].name
+  }
+  role_ref {
+    api_group = "rbac.authorization.k8s.io"
+    kind      = "Role"
+    name      = kubernetes_role.whisker_watchdog.metadata[0].name
+  }
+  subject {
+    kind      = "ServiceAccount"
+    name      = kubernetes_service_account.whisker_watchdog.metadata[0].name
+    namespace = kubernetes_namespace.calico_system.metadata[0].name
+  }
+}
+
+resource "kubernetes_cron_job_v1" "whisker_watchdog" {
+  metadata {
+    name      = "whisker-watchdog"
+    namespace = kubernetes_namespace.calico_system.metadata[0].name
+  }
+  spec {
+    schedule                      = "*/10 * * * *"
+    successful_jobs_history_limit = 1
+    failed_jobs_history_limit     = 1
+    concurrency_policy            = "Forbid"
+    job_template {
+      metadata {
+        name = "whisker-watchdog"
+      }
+      spec {
+        template {
+          metadata {
+            name = "whisker-watchdog"
+          }
+          spec {
+            service_account_name = kubernetes_service_account.whisker_watchdog.metadata[0].name
+            container {
+              name  = "watchdog"
+              image = "bitnami/kubectl:latest"
+              command = ["/bin/sh", "-c", <<-EOT
+                set -eu
+                NS=calico-system
+                # Don't thrash if Goldmane itself is down — that's not a whisker bug.
+                if ! kubectl -n "$NS" get pod -l k8s-app=goldmane \
+                     -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}' 2>/dev/null | grep -q True; then
+                  echo "goldmane not Ready — skipping (not a whisker problem)"; exit 0
+                fi
+                ERRS=$(kubectl -n "$NS" logs -l k8s-app=whisker -c whisker-backend --since=11m --tail=500 2>/dev/null \
+                  | grep -cE 'failed to stream flows|failed to list filter hints|code = Unavailable|i/o timeout' || true)
+                ERRS=$${ERRS:-0}
+                if [ "$ERRS" -ge 10 ]; then
+                  echo "whisker-backend WEDGED: $ERRS goldmane-connection errors in 11m — restarting whisker pod"
+                  kubectl -n "$NS" delete pod -l k8s-app=whisker --ignore-not-found
+                else
+                  echo "whisker-backend healthy: $ERRS goldmane-connection errors in 11m"
+                fi
+              EOT
+              ]
+            }
+            restart_policy = "Never"
+          }
+        }
+      }
+    }
+  }
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
+    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
+  }
+}