infra

Viktor Barzin 8d1d2fb999 All checks were successful ci/woodpecker/push/default Pipeline was successful Details calico: add whisker-watchdog CronJob to self-heal a wedged whisker-backend Whisker showed an empty UI on 2026-06-28. Root cause: whisker-backend dials goldmane:7443 over a long-lived gRPC stream; when that stream dropped during a transient CNI/DNS blip (right after k8s-node5 finished its v1.35.6 upgrade, its pod resolver briefly timed out on the kube-dns ClusterIP) the Go gRPC resolver got WEDGED — spamming "failed to stream flows" / "code = Unavailable: dns ... i/o timeout" forever, never reconnecting. The operator ships whisker-backend with NO liveness probe, so nothing restarted it; the live UI stayed blank until a manual `kubectl delete pod`. (The durable aggregator is a separate pod and was unaffected — only Whisker's ~60-min live view went dark.) Whisker is operator-managed (Whisker CR), so we can't inject a liveness probe. Instead add a watchdog so this never needs a manual restart again: - whisker-watchdog CronJob (every 10 min) + least-privilege SA/Role/RoleBinding (calico-system only: pods get/list/delete, pods/log get). - It restarts the whisker pod only when whisker-backend logs >=10 goldmane- connection errors in 11m AND Goldmane is Ready (the Goldmane-Ready guard avoids restart-thrash during a real Goldmane outage). - Self-tested: a manual run reports "whisker-backend healthy: 0 ... errors" and does not restart. Docs: runbook gains a "Whisker UI empty" troubleshooting entry + a self-heal note; the stale 2026-06-25 "digest never posted" known-state block is updated to Resolved (digest posts to #alerts, lastSuccessfulTime current); CLAUDE.md flow-trail bullet gains the whisker-wedge gotcha. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>		2026-06-28 08:59:31 +00:00
..
agents	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
commands	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
reference	monitoring: consolidate all Slack alerting to #alerts, abandon #security	2026-06-26 13:29:44 +00:00
scripts	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
skills	home-assistant skill: refresh ha-london map (HAOS 2026.5.2, Cowboy revived, Overview redesign)	2026-06-24 22:03:15 +00:00
calendar-query.py	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
CLAUDE.md	calico: add whisker-watchdog CronJob to self-heal a wedged whisker-backend	2026-06-28 08:59:31 +00:00
home-assistant-sofia.py	homelab CLI v0.7: add `ha token` + `ha ssh` for Home Assistant	2026-06-20 23:46:09 +00:00
home-assistant.py	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
internet-mode-used_DO_NOT_REMOVE_MANUALLY_SECURITY_RISK	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
pfsense.py	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
settings.json	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00