infra

Author	SHA1	Message	Date
Viktor Barzin	9dcd3b0d5d	Merge remote-tracking branch 'forgejo/master' into wizard/stem95su-cutover All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-07-03 15:27:04 +00:00
Viktor Barzin	5367d4a055	paperless-mail-ingest: rules process inline attachments (Apple Mail lesson) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor's first real forward carried the invoice PDF with Content-Disposition: inline (Apple Mail does this for real documents), and the attachments-only rules consumed nothing — recorded PROCESSED_WO_CONSUMPTION, which also blocks reprocessing. Flipped all 5 rules to attachment_type=2 (process inline) via the API and documented the trade-off + the ProcessedMail unblock step. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 15:25:44 +00:00
Viktor Barzin	21c6e7112e	stem95su: retire the in-cluster serving stack — now a Valia site on Pages Completes the ADR-0018 cutover. The stack is emptied to a tombstone so CI destroys nginx, the NFS content volume, the ingress, the per-site gdrive-sync CronJob and the namespace; serving + sync are owned by stacks/valia-sites since the cutover commits. Catalog + runbook updated to the migrated state (incl. the one-time 42.9→21.4MB video compression Viktor approved). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 15:22:32 +00:00
Viktor Barzin	c1ee6863b3	mailserver docs: troubleshooting entry for the postsrsd 100%-CPU spin All checks were successful ci/woodpecker/push/default Pipeline was successful Details Hit during the docs@ rollout: after a pod restart postsrsd came up spinning without binding its TCP ports, so postfix cleanup tempfailed every message with 451 queue file write error. Document the signature and the supervisorctl-restart / pod-recreate fix. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 14:39:13 +00:00
Viktor Barzin	68b9858eff	paperless-mail-ingest runbook: manual mail_fetcher must drop to the paperless user All checks were successful ci/woodpecker/push/default Pipeline was successful Details A root-run kubectl exec mail_fetcher downloads attachments root-owned into the scratch dir and the celery consumer (uid 1000) fails with PermissionError — found during the build E2E. Document s6-setuidgid usage and the recovery step. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 14:26:12 +00:00
Viktor Barzin	77fcb08e8e	mailserver: add docs@ paperless ingest mailbox (sieve sender allowlist) Some checks failed ci/woodpecker/push/default Pipeline failed Details Viktor asked to forward arbitrary emails with PDF attachments into paperless-ngx, with the forwarding sender mapping 1:1 to the paperless account that owns the document. paperless-ngx's built-in IMAP consumer already does the sender->owner mapping, so the infra half is a dedicated real mailbox docs@viktorbarzin.me: an explicit self-alias (the @domain catch-all would otherwise divert it into the TripIt-swept spam@ mailbox, whose sweeper LLM-parses and auto-replies to mail from linked senders) plus a per-user Dovecot sieve that discards non-family senders at delivery (chosen behaviour for unmatched senders: ignore and delete; also keeps spam out of the guessable address). The mailbox credential was added to Vault secret/platform.mailserver_accounts. Paperless-side mail account + 5 per-sender rules are DB state, configured via the API per the new runbook docs/runbooks/paperless-mail-ingest.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 14:06:19 +00:00
Viktor Barzin	f5187806f9	ADR-0017: replace ASCII trunk diagram with excalidraw VLAN-tagging diagram All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wants the traffic-flow view as a colored excalidraw instead of the ASCII block (which was the only thing rendering after the earlier VLAN-tagging SVG commit failed to push — a locally-masked non-fast- forward this session, not a merge clobber). Ships both the editable .excalidraw scene and a hand-drawn-style SVG export embedded in the Traffic-on-the-trunk section: two lanes showing where the 802.1Q tag is added, carried (only P5<->vmbr0) and stripped, L2 membership drops vs L3 firewall verdicts. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 13:21:59 +00:00
Viktor Barzin	316cdb7441	docs: valia-sites runbook + dns.md CM mechanism + service-catalog entries All checks were successful ci/woodpecker/push/default Pipeline was successful Details Runbook covers add/update/retire (one map entry; internal DNS now cleans up after itself), content rules for Valia's folders, and the failure modes incl. both token re-mint paths. dns.md superset-rule paragraph now describes the declarative ConfigMap reconcile instead of hand-added static CNAMEs. Catalog: new valia-sites row; stem95su row notes its Pages cutover is parked on the 42.9MB stem_video.mp4 exceeding the 25MB Pages per-file cap. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 12:46:24 +00:00
Viktor Barzin	348f64d34d	ADR-0017: add physical-cabling diagram (wires only) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked for one diagram showing just the physical connections between nodes, separate from the logical/VLAN topology: ISP->AX6000, the in-wall apartment->garage run into P1, 4G router (cellular OOB), UPS mgmt, the PoE cat6 to the camera, the LAN1 cable to eno1, dark eno2 fallback + free eno3/4, iDRAC on shared-LOM, and the note that everything else on the R730 is virtual. Referenced from the ADR next to the logical SVG. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 12:40:29 +00:00
Viktor Barzin	126cf4c88e	Merge origin/master into wizard/cctv-adr-trunk All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-07-03 12:32:00 +00:00
Viktor Barzin	5d16a18cf4	ADR-0017: document trunk traffic semantics + ASCII topology While reviewing the single-switch design Viktor asked whether both the home LAN and the camera VLAN 'go via pfSense which forwards upstream' - a natural misreading a future reader would repeat. Added a section spelling out the vmbr0 fork: untagged home LAN is L2-bridged past pfSense (gateway stays the AX6000, rack outage does not affect it, OOB via 4G survives), while tagged-30 can only land on the dCCTV interface, making a pfSense bypass impossible by construction. Includes a compact ASCII topology for terminal readers alongside the SVG. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 12:31:48 +00:00
Viktor Barzin	5c42155b81	docs: Valia-sites domain language + ADR-0018 (off-infra Pages, in-cluster sync) Grill session with Viktor: his mother Valia will keep asking for 1-page site hosting, so the pattern is being made repeatable. Decisions: all Valia sites serve off-infra on Cloudflare Pages (survive homelab outages); one shared in-cluster CronJob mirrors her Drive folders every 10 min and redeploys on change; English subdomain names picked by Viktor; failed-Job-only visibility; stem95su migrates onto the pattern. CONTEXT.md gains Valia site / Content folder / Entry file; full rationale and rejected options in ADR-0018. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 12:17:45 +00:00
Viktor Barzin	e1bd111562	rename CF Pages site most.viktorbarzin.me -> bridge.viktorbarzin.me All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to rename the 'мост' school static site to 'bridge'. New Cloudflare Pages project 'bridge' (bridge-cv2.pages.dev) already deployed and the custom domain attached; this renames the public CNAME (TF resource most_pages -> bridge_pages, destroy+create swaps the record) and the internal split-horizon static CNAME in the ingress-dns-sync CronJob. The old 'most' Pages project and the stale internal 'most' record are removed out-of-band after this applies. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 10:52:30 +00:00
Viktor Barzin	7dd80b6c7c	technitium: mirror most.viktorbarzin.me into the internal zone (CF Pages site) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The internal split-horizon zone is authoritative for viktorbarzin.me, so the new Cloudflare Pages site (most.viktorbarzin.me, added for Viktor's 'мост' school static site) NXDOMAINed for every internal client — LAN, VLANs and pods — while resolving fine externally. Per the superset rule, add it as a static CNAME (-> most-6if.pages.dev) in the ingress-dns-sync CronJob next to the mail-auth records, and document the off-infra-site case in dns.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 10:10:46 +00:00
Viktor Barzin	217a54be9d	cloudflared: add most.viktorbarzin.me CNAME for Cloudflare Pages site All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to host a static HTML site (the 'мост' school project, ОбУ „Отец Паисий", pulled from his Google Drive) on Cloudflare Pages with a custom domain, as a try-out of Pages hosting. The site content is deployed off-infra via wrangler to the Pages project 'most' (most-6if.pages.dev); this CNAME points most.viktorbarzin.me at it. The custom domain is already attached to the Pages project and is waiting on this DNS record to validate. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 10:06:33 +00:00
Viktor Barzin	be80ef23bb	ADR-0017 rev 3: single switch — PE replaces the SG105E, CCTV rides a VLAN-30 trunk on the LAN1 cable Viktor prefers not running two switches, so the TL-SG105PE takes over all rack duties (apartment uplink, 4G, UPS, camera PoE) and the CCTV segment moves onto a managed tagged trunk over the existing LAN1 cable: pfSense net3 re-pointed from vmbr2 to vmbr0 tag=30 (applied live; same MAC so vtnet3/dCCTV survived untouched). This is safe where the original 802.1Q rejection was not, because the managed switch is the only device on eno1 and polices VLAN-30 membership. eno2/vmbr2 kept dormant as the documented fallback. Old SG105E retires to cold spare; PE inherits 192.168.1.6. Glossary Segment term updated (all three segments are now bridge-tags feeding untagged pfSense vNICs). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 09:15:52 +00:00
Viktor Barzin	e11bd6e893	ADR-0017 rev 2: two switches — the PE is a dedicated CCTV island, no VLAN table anywhere Viktor asked to verify free ports on the garage switch (192.168.1.6) before finalizing. Logging into it showed it is NOT the TL-SG105PE from the plan but a pre-existing non-PoE TL-SG105E with 4 of 5 ports in use (apartment uplink, R730 LAN1, 4G router, UPS) - the single-shared-switch port-VLAN design written earlier today was based on conflating the two devices. Corrected: the new TL-SG105PE carries ONLY camera + eno2 uplink (mgmt 10.0.30.6 inside the segment), the old switch is untouched, and no VLAN config exists anywhere. ADR, topology SVG and networking.md updated to match. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 08:37:15 +00:00
Viktor Barzin	b761701994	ADR-0017: add network topology diagram (SVG) next to the decision All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked for a reviewable network visualization committed alongside the CCTV-segment ADR. Hand-drawn SVG (renders on Forgejo, validated palette): physical path camera -> TL-SG105PE port-VLANs -> eno2/vmbr2 -> pfSense dCCTV, the firewall flows (Frigate RTSP, ha-sofia ISAPI/RTSP, NTP-only egress, default deny), and the dashed camera-day steps (patch cable, cat6 run, AX6000 static route). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 20:25:28 +00:00
Viktor Barzin	248e186dce	CCTV segment (dCCTV 10.0.30.0/24) on a dedicated pfSense leg for the garage camera All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor and emo are adding the first owned camera at the Sofia site (HiLook IPC-T241H-C watching the garage / server rack). Viktor asked to finalize emo's plan; the grilling session resolved emo's five open decisions and replaced the doc's 802.1Q-trunk idea with the site idiom: a dedicated physical leg (R730 eno2 -> vmbr2 -> pfSense net3 = dCCTV 10.0.30.1/24), port-based VLAN split on the shared TL-SG105PE, camera default-deny with NTP-only egress, Frigate + ha-sofia as the only consumers. The PVE bridge, pfSense interface, Kea subnet and firewall rules were applied live this session (hand-managed hosts, backed up). This commit records the decision (ADR-0017), the glossary terms (Segment / CCTV segment), the as-built architecture doc, and bumps Frigate's ADR-0016 VRAM budget 2000 -> 2300 MiB for the upcoming NVDEC stream. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 20:01:45 +00:00
Viktor Barzin	4c532dbf97	devvm containment: drop the MemoryHigh throttle band, straight to MemoryMax OOM All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details t3.viktorbarzin.me went down 2026-07-02 15:42-16:35 UTC: an agent-spawned 12.3G ugrep plateaued inside t3-serve@wizard's MemoryHigh(12G)..MemoryMax(16G) band. With MemorySwapMax=0 its anon pages were unreclaimable, so the kernel throttled every task in the cgroup indefinitely (memory.pressure full ~80%, oom_kill never fired) - the t3 event loop starved, the accept queue rotted, and the terminal was dead until the hog was SIGKILLed by hand. The 2026-06-22 design assumed 'throttle to a crawl, then OOM locally'; a hog that stabilises between high and max never OOMs, so the throttle band is a livelock zone, not a safety layer. Viktor asked to close that gap: MemoryHigh is now explicitly infinity on all three work cgroup definitions (t3-serve@ unit, user-<uid>.slice drop-in, docker.slice) so a runaway is cgroup-OOM- killed at MemoryMax immediately - OOMPolicy=continue already keeps the t3 server alive when a child dies. MemoryMax/MemorySwapMax=0/earlyoom unchanged. Applied live to the devvm the same day (daemon-reload + runtime set-property on running cgroups, no session restarts). Post-mortem addendum + runbook updated in the same commit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 16:59:38 +00:00
Viktor Barzin	21afae85c9	dawarich: dedicated 100/1000 Traefik rate limit (default 10/50 429'd page loads) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor saw dawarich throwing 429s through Traefik and asked to loosen the burst for it. The access log confirms the burst pattern: one page load fires the whole fingerprinted-asset tail (SVG store badges, favicons, webmanifest) from a single client IP and trips the default 10 req/s / burst 50 limiter (repro: 80 parallel GETs -> 28x 429). Same remedy as ha-sofia, ActualBudget, noVNC, tripit, health and authentik: dedicated dawarich-rate-limit middleware (average 100 / burst 1000) + skip_default_rate_limit on the dawarich ingress. Also updates the networking.md middleware enumerations (adding the previously undocumented tripit/health limiters alongside dawarich). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 15:03:08 +00:00
Viktor Barzin	8fc657f431	excalidraw: migrate image build to GHA -> private ghcr (ADR-0002) The image was still built by hand and pushed to DockerHub (v1..v4), predating the all-builds-off-infra doctrine; Viktor chose to move it onto the standard pipeline while shipping the export/rename feature rather than keep the manual flow. Mirrors the k8s-portal pattern: .github/workflows/build-excalidraw.yml (go test + buildx linux/amd64, pushes ghcr latest+sha), excalidraw ns added to the Kyverno ghcr-credentials allowlist (package is PRIVATE), deployment now pins ghcr :latest with pullPolicy Always + pull secret, Keel force/match-tag/5m annotations seed the metadata (live values win via ignore_changes). DockerHub viktorbarzin/excalidraw-library:v4 stays frozen as the rollback image. Docs: ci-cd.md + .claude/CLAUDE.md image lists updated (also backfilled the missing k8s-portal rows in ci-cd.md). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 14:29:23 +00:00
Viktor Barzin	6f03ccd1aa	excalidraw: grant emo-browser SA port-forward for drawing uploads All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to fix emo's permission so his Claude can upload to the Excalidraw service. emo's recent sessions show the documented upload recipe (kubectl port-forward svc/draw + X-Authentik-Username header, from his ~/.claude/CLAUDE.md) failing with: pods/portforward forbidden for system:serviceaccount:chrome-service:emo-browser in namespace excalidraw because his default kubeconfig is the read-only emo-browser SA (its port-forward grant covers only chrome-service) and his old admin kubeconfig at /home/emo/code/config expired and was removed. Add a namespace-scoped Role (pods/portforward create) + RoleBinding for that SA in the excalidraw namespace, mirroring the 2026-06-28 chrome-service grant. Trade-off (any-user drawings via the trusted username header) documented in the file and accepted. Also record the grant in docs/architecture/chrome-service.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 11:08:28 +00:00
Viktor Barzin	88c86e2109	ci: Slack-notify failed pipeline runs only All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor doesn't want a Slack message for every CI run — only failures. The infra apply pipeline posted a status line to #general on every push, and the renew-tls / postmortem-todos / registry-config-sync / pve-nfs-exports-sync crons posted on every scheduled run (~30+ routine messages a week). Now: the apply pipeline's success post is gone (notify-failure already covers failures), all cron notifies are status:[failure] with explicit FAILED texts, and drift-detection is silent when all stacks are clean (still posts drift findings and errors, and gains a hard-failure catch step it previously lacked). Kept: notify-nonadmin-push (org audit feed) and the actionable provision-user post. Per-app deploy template in ci-cd.md updated to match. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 07:27:43 +00:00
Viktor Barzin	a64d2ba2b9	upgrades: fix hourly gotenberg error + cap update notifications at weekly All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor was getting upgrade-error Slack messages every hour and wants update notifications at most weekly. Root cause of the errors: Keel kept trying to roll gotenberg 8.25->8.25.1 in paperless-ngx but kyverno's require-trusted-registries denied it — gotenberg/* (and apache/, which tika will hit next) were never allowlisted, and Keel's Slack notifier at info level re-posted the identical failure to #general on every hourly poll since Jun 28. Changes: allowlist gotenberg/ + apache/* so the patch applies cleanly; disable Keel's direct Slack notifier and replace failure visibility with a KeelUpdateFailing Loki-ruler alert (alert-on-change: one notification plus the daily digest, never an hourly drip); remove diun's Slack notifier whose default message @channel-pinged #image-updates for every new upstream tag every 6h (the n8n upgrade-agent webhook feed is untouched). The k8s upgrade report is already weekly (Mon 06:07 UTC). Paperless-ngx itself stays paused (keel policy=never, user-managed) while the ingest runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 07:16:50 +00:00
Viktor Barzin	74819d4061	feat(nvidia): GPU VRAM budget + watchdog to stop T4 overallocation The single time-sliced Tesla T4 has no per-tenant memory isolation, so its ~9 GPU workloads can collectively overallocate VRAM. On 2026-06-02 immich-ml's onnxruntime arena grew to 10.7 GB and silently starved llama-swap, breaking recruiter-responder for ~5h. Viktor asked for memory protection so we don't overallocate GPU memory, and chose to do it at the scheduling level (no device-plugin swap) after weighing HAMi and MPS. Make the scheduler VRAM-aware and add runtime teeth, all repo-native, time-slicing untouched: - Advertise a node extended resource viktorbarzin.me/gpumem (~14000 MiB) via a reconcile null_resource (immediate, apply-time) + hourly re-assert CronJob. - Each always-on GPU tenant declares a gpumem budget (immich-ml 3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500; sum 13300 <= advertised) so the scheduler refuses to co-schedule past the card (overflow -> Pending). - gpu-vram-watchdog Deployment recycles the biggest over-budget tenant ONLY when actual free VRAM < floor. Ships DRY_RUN=true (observe-then-enforce); flip to false after a few cycles look right. - Prometheus alerts GPUVRAMLow / GPUVRAMTelemetryDown / GPUVRAMWatchdogDown -- the 2026-06-02 post-mortem's never-built free-VRAM follow-up. - Docs: ADR-0016 (records why HAMi/MPS were rejected), CONTEXT.md GPU-sharing glossary; fix the stale "whole T4 / scale immich-ml to 0" llama-cpp comment. HITL GPU-node change: apply nvidia FIRST (advertise gpumem), verify the node shows the capacity, THEN the consumer stacks -- the cutover bounces GPU pods. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:57:40 +00:00
Viktor Barzin	308a174ad6	docs(networking): record MetalLB .204 (frigate-rtsp go2rtc) allocation All checks were successful ci/woodpecker/push/default Pipeline was successful Details PR #17 moved frigate-rtsp to a dedicated MetalLB LoadBalancer IP (10.0.20.204) exposing RTSP 8554 + WebRTC 8555, but the networking doc still listed only four IPs in use / three dedicated. Add the .204 row to the allocation table, bump the counts (five in use, four dedicated, 5-IP layout), and add a LB-IP renumber-checklist entry for the out-of-band consumers (the go2rtc WebRTC candidate on the frigate config PVC and the HA-sofia rtsp_url_template). Note go2rtc cannot use a DNS name in ICE candidates, so the Service annotation is the single source of truth. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:42:27 +00:00
Viktor Barzin	3398873a16	k8s-upgrade: move version-check cadence from daily to weekly (Sun check, Mon report) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to move the upgrade checks to weekly. With the actionable-vs-held gate now quieting the routine 'held' churn (e.g. 1.36), a daily check + attempt buys little; weekly is enough. Accepted trade-off: k8s patch (incl. security) uptake now lags up to 7 days instead of <=1. - var.schedule: 0 23 * * * -> 0 23 * * 0 (detector: weekly Sunday 23:00 UTC) - var.report_schedule: 7 6 * * * -> 7 6 * * 1 (report: Monday 06:07 UTC, ~7h after the Sunday check, so nightly-report.py's ~25h staleness threshold stays valid AND still flags a missed weekly run; no STALE_SECONDS change needed) The report CronJob keeps its historical name k8s-upgrade-nightly-report (rename = churn). Cadence wording updated across main.tf comments, nightly-report.py docstring, and the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 06:22:20 +00:00
Viktor Barzin	7fe2d9780e	monitoring: add pfSense WAN/egress alerting + probes Some checks failed ci/woodpecker/push/default Pipeline was canceled Details On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for ~20 min while internal routing + Unbound stayed up; recovery needed a manual reboot and NOTHING alerted — there was no egress probe and the cloudflared replica metric stayed green. Add first-class egress monitoring so the next occurrence pages in ~2 min instead of being noticed by a human. - blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW so ICMP can use raw sockets). - Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 + 1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers). - Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable, InternetEgressDown (both providers dead), ExternalDNSResolutionDown, EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's exact "external down while internal up" signature), PfSenseVMDown. - Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the cloudflared replica metric is blind to tunnel-connection loss. Threshold calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident). - Alertmanager inhibit: WAN/egress-down suppresses the downstream egress symptom alerts so one root alert pages, not a storm. - Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md. All metric names + the cloudflared threshold verified against live Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening (dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred and documented in the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 16:46:30 +00:00
Viktor Barzin	279b88d2bc	docs: add MetalLB L2Status-immutable PG-VIP-flap post-mortem (code-aoxk) All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Post-mortem for the 2026-04/05 SEV3 where a stuck MetalLB ServiceL2Status CR (immutable status.node) flapped the PG load-balancer VIP and silently broke Tier-1 Woodpecker terragrunt applies for ~5 days (the wrapper error "Cannot read PG creds" masked the real cause for ~25 days). Written when the incident closed (beads code-aoxk, 2026-05-26) but never committed; landing it so the RCA + stuck-CR cleanup procedure live in the repo. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 16:25:10 +00:00
Viktor Barzin	2e50c1235c	chrome-service: grant emo shared browser access (noVNC + homelab browser CLI) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to give emo access to the cluster's headed Chrome so he can fill in forms and get past anti-bot / captcha pages. emo was deliberately locked out of chrome-service (noVNC Authentik allowlist was Viktor-only + his power-user RBAC has no pods/portforward). Viktor's explicit decision: SHARE his existing browser rather than stand up an isolated per-user instance, accepting that emo can therefore reach Viktor's warmed logged-in sessions (CDP has no per-context auth, so the single shared persistent profile is reachable by anyone who can drive the browser). emo's CLI use is hands-off (his agent can run it unattended). - authentik: add emo (emil.barzin / emil.barzin@gmail.com) to CHROME_ALLOWED so the admin-services-restriction policy admits him to chrome.viktorbarzin.me (noVNC). Reverses the prior Viktor-only lock; comment updated to record why. - chrome-service/rbac.tf (new): emo-browser ServiceAccount + long-lived token (dashboard-sa.tf pattern), a chrome-service-portforward Role granting pods/portforward, and a cluster read-only binding (oidc-power-user-readonly) so the SA can resolve the Service and emo's normal read access doesn't regress. - t3-provision-users.sh: install_browser_kubeconfig installs a dual-context kubeconfig for any user with a <user>-browser SA — SA token as the default context (non-interactive, works headless), personal OIDC retained as the oidc@homelab named context. emo's OIDC-only kubeconfig can't authenticate the headless agent session that homelab browser needs. - docs/architecture/chrome-service.md: document the shared-browser multi-user access model, the session-exposure trade-off, and how to grant/revoke a user. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 15:20:07 +00:00
Viktor Barzin	250d0fc334	docs(authentik): document SFE forced-WebAuthn escape hatches (TOTP + social) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Old-browser users on the SFE who have a password but no MFA device hit the default-authentication-flow's forced WebAuthn passkey enrolment, which the SFE cannot render (the 'unsupported state: ak-stage-authenticator-webauthn' error). emo (Google-only, iPadOS 15) hit this on the password path. Document the two no-MFA-downgrade fixes: (1) social login, whose source flow (default-source-authentication) has no MFA stage, so the SFE's social button always completes; (2) enrolling TOTP, which the SFE can validate (unlike WebAuthn) and which flips the MFA stage from force-enrol to validate. TOTP was enrolled for emo and stored in his Vaultwarden authentik item; verified end-to-end (a Bitwarden-generated code is accepted by authentik). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 12:24:40 +00:00
Viktor Barzin	e518ada3d4	authentik: repoint to overlay patch3 (all-iOS SFE + SFE social links) + docs All checks were successful ci/woodpecker/push/default Pipeline was successful Details global.image -> 2026.2.4-patch3. Old iPad Chrome (and any iOS browser) now gets the SFE too, and the SFE login shows social-login buttons (emo is Google-only with no password, so the password form alone was a dead end). Docs: .claude/CLAUDE.md + authentication.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:53:26 +00:00
Viktor Barzin	6ba60cbb2d	authentik: repoint to overlay patch2 (SFE for old Safari) + docs All checks were successful ci/woodpecker/push/default Pipeline was successful Details global.image -> 2026.2.4-patch2 (adds the compat_needs_sfe SFE patch on top of the SLOW-1a query patch). Old Safari/WebKit (<=16.3) now gets authentik's no-JS SFE login instead of a blank page — fixes emo's iPadOS-15.8 iPad with no auth downgrade. Docs: .claude/CLAUDE.md Authentik row + docs/architecture/authentication.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:39:29 +00:00
Viktor Barzin	ec681ba6e1	ci(infra): stop double-apply + stop counting PG lock-waits as failures All checks were successful ci/woodpecker/push/default Pipeline was successful Details The infra terragrunt-apply pipeline (.woodpecker/default.yml) was going red ~20% of the time. Root causes (verified from the failure logs, not guessed): 1. infra is registered in Woodpecker TWICE — canonical Forgejo (repo 82) AND legacy GitHub mirror (repo 1) — and BOTH run `default.yml` on every push. The two applies race each other for the per-stack PG state lock → "Error acquiring the state lock" failures + push-supersede "killed" runs. 2. The skip-not-fail lock guard only matched the Tier-0 Vault lock string ("is locked by"); the Tier-1 PG-backend lock ("Error acquiring the state lock") fell through and was counted as a hard FAILURE. 3. Transient provider-registry download timeouts (and Vault 5xx) failed the whole pipeline with no retry. Fixes (all in default.yml): - Forge guard: the push-apply runs ONLY on the canonical Forgejo forge; on the GitHub mirror it no-ops (exit 0). The mirror keeps running the crons (they live on repo 1), so we de-dup the apply without deactivating the registration. Fail-open on unknown forge. - Lock-skip now matches BOTH tiers (Vault + PG) → lock-waits are SKIPPED. - Bounded retry (3x) ONLY on transient signatures (provider download timeout, Vault 5xx). Config errors + helm atomic-timeouts fail fast. Rejected (documented in docs/architecture/ci-cd.md): an off-infra GHA validate gate (catches ~0 of the real, runtime/Vault-data/SSA/lock failures; reproduced `terraform validate` passing the exact stacks that fail at apply) and lock-reaping/force-unlock (PG advisory locks are session-scoped + auto-release; force-unlock can't free them and would corrupt a live concurrent apply). Shell logic + the classification regexes were unit-tested locally against the real decoded error strings (#359 PG lock, #353 provider timeout, #360 missing-arg, helm atomic timeout); `bash -n` clean; YAML parses. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:37:18 +00:00
Viktor Barzin	e03e4719ad	vault: distinguish Vaultwarden vs HashiCorp Vault, add `vault kv` `homelab vault` only spoke to Vaultwarden (the password manager), but the name reads as HashiCorp Vault (the infra secrets store — actually OpenBao here). Make the two unmistakable and support both. Distinction (no breakage — the existing Vaultwarden verbs are unchanged): - bare `homelab vault` help now LEADS with the two-stores split; - every verb summary is tagged `[vaultwarden]` or `[hashicorp-vault]`; - HashiCorp Vault/OpenBao lives under a clearly-named `vault kv` group. New `vault kv` (HashiCorp Vault / OpenBao, the secret/… KV store): - `kv get <path> [--field K]` — read; --field → one value (TTY-aware clipboard/stdout), no field → full secret JSON (refuses a bare TTY). - `kv list <path>` — list sub-paths (no values). - `kv put <path> <key>` — write one key; value via stdin (piped or no-echo prompt, never argv); creates the path or merges (never clobbers siblings; uses kv patch -method=rw so no `patch` cap needed). Critical: `kv` uses the caller's OWN Vault token (OIDC ~/.vault-token / $VAULT_TOKEN), NOT the per-user scoped Vaultwarden token (bound only to claude-users/<user>, which would 403 elsewhere) — handlers set VAULT_ADDR but never inject the scoped token. Access is whatever the policy grants. Logic in cmd_vault_kv.go (pure cores extractKVData/parseKVList/arg builders/kvGet/List/Put; file header documents the credential split). CLI v0.11.0. Tests: no value in put argv, create-then-merge, KV-v2 envelope strip, help names both systems. Verified e2e against live Vault (read key-names-only + a scratch put/merge/cleanup). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:09:33 +00:00
Viktor Barzin	3d948c7033	Merge remote-tracking branch 'origin/master' into wizard/upgrade-gate-held All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-28 10:09:42 +00:00
Viktor Barzin	2880fe1c29	docs: update k8s-version-upgrade runbook for actionable-vs-held gate Reflect the classification change in the operational runbook: the gate's three refusal classes (actionable/waiting/pinned), held wins on a mix, refusals now Complete cleanly (no Failed Job), k8s_upgrade_held gauge + the deliberate no-alert-for-held, the dropped K8sUpgradeChainJobFailed suppression clause, the nightly report ⏸️ HELD outcome, and the detector's silent nightly re-evaluation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 10:09:34 +00:00
Viktor Barzin	eebb6c8594	k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case The nightly upgrade chain detected 1.36, the preflight compat-gate refused it, and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY night — even though the block is unactionable (no kyverno/ESO release supports 1.36 yet, and gpu-operator is pinned to its current version because bumping it needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor asked to teach the checker to tell 'we can fix this' apart from 'nothing to do but wait', and stop the nightly Failed-Job + alert noise for the latter. compat-gate.py now classifies each blocker: - ACTIONABLE: a newer addon version in addon-compat.json supports the target -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the nightly report). - WAITING-on-upstream: no released version supports the target yet -> held. - PINNED: addon marked pinned in the matrix (gpu-operator) -> held. Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert. Tidy the block path (Viktor's scope choice): deliberate gate decisions now make the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete 'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge is pushed definitively once per run (no 1->0->1 flap that re-notifies). The detector re-spawns a refused-but-Complete preflight nightly (silently) so a standing hold still re-evaluates, and only announces genuine new/Failed spawns. nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class. gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason). Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned; Calico the lone actionable piece) — no nightly Failed Job, no alert, just the morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 10:08:20 +00:00
Viktor Barzin	ccee443790	vault: add `get --all` to browse every field of an item Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details `homelab vault get` could only fetch one of five allow-listed fields and had no way to see what fields an item even has — in particular it could not reach arbitrary user-defined custom fields. Add a `--all` flag that dumps the whole item as a normalized JSON object (`{name, username?, password?, uris?, totp?, notes?, fields?}`), so a Claude session can discover and read every field, custom ones included, in a single call. Security model preserved: - Like `get --json`, the dump is all secret values, so it refuses a bare TTY (pipe it, e.g. `\| jq`); the machine/agent path is stdout. - The TOTP seed is reduced to a presence flag (`"totp": true`) and never emitted — the seed is more powerful than a one-time code, so the only seed-derived path stays the specially-audited `vault code`. Tests assert the seed and password-history never appear in the dump. - Op-log uses a distinct `get-all` verb (item name still never logged) so a bulk dump is distinguishable from a single-field read. `normalizeItem` is a pure, unit-tested core; `getItem` is the session+fetch seam. CLI bumped to v0.10.0. Docs: README changelog, onboarding runbook, design spec §16. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 10:01:49 +00:00
Viktor Barzin	afcd463f39	k8s-upgrade: design doc for actionable-vs-held compat-gate classification The nightly upgrade chain fails a preflight Job and raises K8sUpgradeBlocked every night for the 1.36 target, even though the block is unactionable: no kyverno/ESO release supports 1.36 yet and gpu-operator is deliberately pinned (NVIDIA driver/Ubuntu coupling). Viktor asked to teach the checker to tell 'we can fix this' apart from 'nothing to do but wait', and stop the nightly Failed-Job + alert noise for the latter. This documents the design: classify each blocker as actionable / waiting- upstream / pinned, keep the alert only for actionable, quiet the held case to the nightly report, and make deliberate gate decisions Complete cleanly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 10:01:36 +00:00
Viktor Barzin	b3c419e108	Merge remote-tracking branch 'origin/master' into wizard/authentik-perf-fix All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-28 09:55:25 +00:00
Viktor Barzin	9a1ab6247b	cli: add `homelab edges` — who-talks-to-whom investigation helper (v0.9.0) Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Makes the goldmane_edges east-west trail (ADR-0014) reachable during incident investigations without remembering the DB/creds/SQL. New top-level verb: homelab edges --ns <ns> edges touching <ns> (either direction) homelab edges --src/--dst <ns> directional egress / ingress peers homelab edges --peers-of <ns> distinct peer namespaces of <ns> homelab edges --new-since 24h first seen since a duration or date (YYYY-MM-DD) homelab edges --denied only action='deny' (blocked / lateral movement) homelab edges --json --limit N machine-readable / row cap (default 200) Filters render to a single read-only SELECT against the `edge` table, run via the dbaas CNPG primary pod (same exec path as `k8s db`). Namespace values are validated to the k8s name charset (injection guard) before they reach SQL. TDD: edges_test.go covers flag parsing, query building (each filter, AND combination, peers-of shape, JSON wrapper), the new-since duration/date parser, and namespace-validation / injection rejection. Smoke-tested live: --peers-of, --new-since 24h, --denied, and --json all return correct rows. Docs: runbook query section now leads with the CLI; cli/README gains a v0.9 section. VERSION v0.8.2 -> v0.9.0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 09:51:41 +00:00
Viktor Barzin	a3eb309e26	calico: fix empty Whisker UI — allow whisker egress to the kube-dns ClusterIP All checks were successful ci/woodpecker/push/default Pipeline was successful Details Real root cause of the 2026-06-28 "Whisker UI empty" incident (the watchdog added in `8d1d2fb9` was treating a symptom). The tigera operator's own `whisker` NetworkPolicy is policyTypes:[Ingress,Egress]; its egress allows DNS only to the kube-dns pods (podSelector k8s-app=kube-dns). But whisker-backend resolves goldmane.calico-system.svc via the kube-dns ClusterIP (10.96.0.10), and Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule. Verified in an isolated repro: from the whisker pod's netns, ClusterIP DNS = 100% timeout while direct kube-dns pod-IP DNS = OK; a pod with no egress policy resolves fine; a test pod with the operator's podSelector-only egress rule reproduces the failure, and adding an ipBlock(ClusterIP) egress rule flips it to 100% ok. whisker-backend resolves goldmane once in the brief startup window before the policy programs, holds its long-lived gRPC stream, and only re-resolves when that stream breaks (e.g. a node-reboot blip) — then the blocked ClusterIP DNS wedges its Go resolver and the UI goes empty. The durable aggregator (separate pod, unrestricted namespace) was never affected. Fix: additive egress NetworkPolicy whisker-allow-dns-clusterip (whisker -> 10.96.0.10/32 on 53 UDP+TCP); k8s egress policies are additive so the operator NP is untouched. The whisker-watchdog CronJob is kept as a backstop (repurposed comment). Applied + verified: ClusterIP DNS from the whisker netns now 8/8 ok, whisker-backend 0 errors, flow API returns 828 flows / the namespace list. Docs (runbook + CLAUDE.md) updated to the real root cause. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 09:32:28 +00:00
Viktor Barzin	385dfff0e7	authentik: fix episodic blank-screen + 30s-hang login (reliability R2) The login screen would sometimes hang/blank for everyone for ~30s at a time. Root-caused: the readiness probe (/-/health/ready/) queries the DB, and on a transient PG/pgbouncer blip it 503s; with the chart-default ~30s tolerance all 3 goauthentik-server pods dropped out of the Service at once, so Traefik had no healthy backend -> 502/503/504. Compounded by a silent drift: the repo set the rollout strategy under `strategy:`, but the chart reads `deploymentStrategy:` — so live ran the chart-default 25%/25% and dropped a pod out of rotation on every roll. (Redis was removed upstream in authentik 2026.2, so sessions+cache are on PostgreSQL and request-serving is coupled to PG — verified there is no external-cache option to put back, so a SHORT transient is now survived but a total CNPG outage still takes authentik down.) Reliability package (R2, approved): - readinessProbe.failureThreshold 3->8 (~80s) — absorbs a full CNPG failover reconnect without dropping the whole fleet from the Service. - rename server+worker `strategy:` -> `deploymentStrategy:` (the real chart key) and set maxSurge:1/maxUnavailable:0 so a roll never dips below 3 ready. - gunicorn AUTHENTIK_WEB__MAX_REQUESTS 1000->10000 / JITTER 50->1000 so the 9 workers' recycles don't cluster on a DB blip. - / and /static ingresses switch to the dedicated authentik-rate-limit (100/1000) from the previous commit (skip_default_rate_limit) — fixes the cold-load 429 blank screen. Liveness intentionally left DB-coupled-but-shallow (LiveView always returns 200, so it can't kill a DB-blocked pod). CONN_MAX_AGE intentionally NOT set (pins the pgbouncer pool, reverted 2026-06-10). Docs: .claude/CLAUDE.md + authentication.md (also corrected a stale "60s persistent DB connections" note). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 09:17:05 +00:00
Viktor Barzin	b84b0021c2	authentik: dedicated rate-limit carve-out + per-router 5xx observability All checks were successful ci/woodpecker/push/default Pipeline was successful Details Unauthenticated users were getting a blank login screen (and the screen would sometimes just hang). Root-caused via a read-only fan-out + adversarial verify: the login SPA cold-loads ~70 flow-executor JS/CSS chunks from /static through the SHARED 10/50 Traefik limiter, so a fresh/empty-cache load 429s the tail and a failed ES-module import aborts SPA bootstrap -> permanent blank. authentik was the only first-party SPA still on the default limiter (8 siblings already have a carve-out). NAT-shared clients trip it especially easily (shared per-IP bucket). - traefik: new `authentik-rate-limit` Middleware (average 100 / burst 1000, mirroring the existing health/tripit carve-outs). The authentik / and /static ingresses switch to it in the authentik-stack commit. - monitoring: the `traefik` scrape job's drop-regex was a blanket `traefik_router_.`, which also dropped `traefik_router_requests_total` — so per-router 4xx/5xx (incl. 429/503) was neither queryable nor alertable. Narrowed it to keep the counter while still dropping the high-cardinality `_duration_seconds_bucket` histogram, and added `AuthentikRootRouter5xxHigh` for the episodic all-3-server-pods-NotReady 502/503/504 cascade. Docs updated (networking.md rate-limit list, .claude/CLAUDE.md). GitOps CI applies. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 09:10:34 +00:00
Viktor Barzin	65a09dcbc4	docs(homelab-vault): rebuild snippet uses cli/VERSION, not git describe All checks were successful ci/woodpecker/push/default Pipeline was successful Details The onboarding runbook's "rebuild the binary" command stamped the version from `git describe --tags --always`, but setup-devvm.sh stamps it from `cli/VERSION`. The v0.8.1 tag is no longer reachable from master, so the describe form silently produced a bare commit sha — diverging from what a provisioner reconcile stamps. Match the canonical source. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 09:05:49 +00:00
Viktor Barzin	c53e7839e1	Merge remote-tracking branch 'origin/master' into wizard/vault-addr-default Some checks failed Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was canceled Details	2026-06-28 09:04:43 +00:00
Viktor Barzin	0525f0b12d	homelab vault: self-default VAULT_ADDR + prefer scoped token over ~/.vault-token Setting up emo's Bitwarden access via `homelab vault`, his one-time `homelab vault setup` failed with an opaque "exit status 2". Two latent CLI bugs, both of which any non-admin AFK invocation can hit: 1. The CLI set VAULT_TOKEN but never VAULT_ADDR, relying on the ambient value. It IS in /etc/environment (login shells), but emo runs his agents from long-lived tmux / non-login shells that never sourced it, so every `vault` child hit the 127.0.0.1:8200 default -> connection refused. claude-auth-sync already self-defaults VAULT_ADDR; the CLI now does the same. 2. Token precedence was env > ~/.vault-token > scoped. A power-user who ran `vault login -method=oidc` carries a read-only ~/.vault-token (policy `default`, capability `deny` on their workstation path), which shadowed the purpose-built scoped token -> 403 permission denied on the user's OWN path. This tool only ever touches secret/workstation/claude-users/<user>, which the scoped token covers exactly, so precedence is now env > scoped > ~/.vault-token. Verified the scoped tokens for both emo and wizard hold create/read/update on their own paths, so admins are unaffected. Also stop swallowing the shelled `vault`/`bw` stderr: errors now carry the real message (connection refused / permission denied) instead of a bare "exit status N" — without that, (1) and (2) were indistinguishable. Verified end-to-end as emo (VAULT_ADDR unset + his read-only ~/.vault-token present): writeCreds now succeeds. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 09:04:28 +00:00
Viktor Barzin	8d1d2fb999	calico: add whisker-watchdog CronJob to self-heal a wedged whisker-backend All checks were successful ci/woodpecker/push/default Pipeline was successful Details Whisker showed an empty UI on 2026-06-28. Root cause: whisker-backend dials goldmane:7443 over a long-lived gRPC stream; when that stream dropped during a transient CNI/DNS blip (right after k8s-node5 finished its v1.35.6 upgrade, its pod resolver briefly timed out on the kube-dns ClusterIP) the Go gRPC resolver got WEDGED — spamming "failed to stream flows" / "code = Unavailable: dns ... i/o timeout" forever, never reconnecting. The operator ships whisker-backend with NO liveness probe, so nothing restarted it; the live UI stayed blank until a manual `kubectl delete pod`. (The durable aggregator is a separate pod and was unaffected — only Whisker's ~60-min live view went dark.) Whisker is operator-managed (Whisker CR), so we can't inject a liveness probe. Instead add a watchdog so this never needs a manual restart again: - whisker-watchdog CronJob (every 10 min) + least-privilege SA/Role/RoleBinding (calico-system only: pods get/list/delete, pods/log get). - It restarts the whisker pod only when whisker-backend logs >=10 goldmane- connection errors in 11m AND Goldmane is Ready (the Goldmane-Ready guard avoids restart-thrash during a real Goldmane outage). - Self-tested: a manual run reports "whisker-backend healthy: 0 ... errors" and does not restart. Docs: runbook gains a "Whisker UI empty" troubleshooting entry + a self-heal note; the stale 2026-06-25 "digest never posted" known-state block is updated to Resolved (digest posts to #alerts, lastSuccessfulTime current); CLAUDE.md flow-trail bullet gains the whisker-wedge gotcha. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 08:59:31 +00:00

1 2 3 4 5 ...

456 commits