monitoring: re-trigger apply to persist state after CI cancel-race

No-op comment touch in loki.tf to force a clean `terragrunt apply monitoring`. The pfSense egress-monitoring apply (commit 7fe2d978, CI pipeline #414) was cancelled by a newer push and SIGKILLed mid-helm-upgrade: the live resources applied (probes green, rules loaded) but the Terraform state write and the helm release finalize were lost, leaving the prometheus release stuck in pending-upgrade (manually unstuck). This commit re-applies the unchanged monitoring stack so state matches live, with zero resource changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fire-planner: solve FIRE targets to age 100 (horizon 60→72)
2026-06-28 16:58:49 +00:00 · 2026-06-28 16:49:20 +00:00 · 2026-06-28 16:46:30 +00:00 · 2026-06-28 16:25:10 +00:00 · 2026-06-28 16:14:42 +00:00 · 2026-06-28 15:39:17 +00:00
163 changed files with 12834 additions and 4484 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -16,6 +16,7 @@
 **ALL infrastructure changes MUST go through Terraform/Terragrunt.** Never use `kubectl apply/edit/patch/set`, `helm install/upgrade`, or any manual cluster mutation as the final state.

 - **No exceptions for "quick fixes"** — even one-line changes must be in `.tf` files and applied via `scripts/tg apply`
+- **Apply locally OR let CI do it — but ALWAYS commit.** You don't have to wait for CI: with apply access you MAY run the apply yourself (`scripts/tg apply <stack>` / `homelab tf apply <stack>`), but **from the main checkout, never a worktree** (git-crypt'd `*.tfvars` come through as ciphertext under the worktree filter-bypass, so a worktree apply reads garbage). **Every applied change MUST be committed and pushed to `master` the same session** — the repo is the source of truth, so applied-but-uncommitted HCL is drift that the next CI apply / daily drift-detection will try to revert. Order either way: apply locally then commit + push (CI's changed-stack apply then no-ops), or commit + push and let CI apply. Never apply an uncommitted edit; never leave a committed change unapplied.
 - **kubectl is for read-only operations and temporary debugging only** (get, describe, logs, exec, port-forward)
 - **If a resource isn't in Terraform yet**, evaluate whether it can be added before making manual changes. If manual change is unavoidable (e.g., emergency), document it immediately and create the Terraform resource in the same session
 - **kubectl scale/patch during migrations is acceptable** as a transient step, but the final state must be in Terraform and applied via `scripts/tg apply`
@ -203,7 +204,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
 - **PDBs**: minAvailable=2 on Traefik and Authentik.
 - **Fallback proxies**: basicAuth when Authentik is down, fail-open when poison-fountain is down.
 - **CrowdSec enforcement is out-of-band** (no Traefik plugin/middleware — the dead Yaegi `crowdsec-bouncer-traefik-plugin` was removed on Traefik 3.7.5): banned IPs are dropped **in-kernel via nftables** by the `cs-firewall-bouncer` DaemonSet on **direct** hosts (drops in BOTH the `input` and `forward` hooks — Traefik is ETP=Local so client traffic is DNAT'd to the pod via `forward`; pulls ALL decisions incl. the ~31k CAPI blocklist), and **blocked at the Cloudflare edge** for **proxied** hosts (one `crowdsec_ban` Rules List + a zone WAF block rule, fed by the `crowdsec-cf-sync` CronJob in `rybbit` ns every 2 min — excludes CAPI). Zero per-request latency; **fails open** (LAPI down → no new bans, existing drops persist, legit traffic never blocked). Whitelist covers RFC1918 + tailnet + internal CIDRs. Full as-built: `docs/architecture/security.md`.
- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations).
+- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations), authentik 100/1000 on `/`+`/static` (login SPA cold-loads ~70 flow chunks from `/static`; default burst 429'd them → blank login screen).
 - **Retry middleware**: 2 attempts, 100ms — in default ingress chain.
 - **Entrypoint transport timeouts** (`websecure` `respondingTimeouts`): `writeTimeout=0` (unlimited download duration), `readTimeout=3600s` (uploads ≤1h), `idleTimeout=600s`. These are **HARD total-duration caps**, not nginx-style per-read idle timeouts — a finite `writeTimeout` truncates *any* large download at that wall-clock mark (a prior `writeTimeout=60s` silently cut Immich videos at 60s). **Do NOT re-tighten `writeTimeout`**; keep `readTimeout` finite (slow-loris backstop) but ≥ longest expected upload. Full rationale: `docs/architecture/networking.md` → "Entrypoint Transport Timeouts".
 - **HTTP/3 (QUIC)**: Enabled on Traefik. Works for **direct (non-proxied) apps** via the dedicated LB IP below (ETP=Local). Proxied apps get QUIC at the Cloudflare edge.
@ -218,7 +219,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
 | Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
 | CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob. **Enforcement is out-of-band, NOT a Traefik plugin** (the Yaegi `crowdsec-bouncer-traefik-plugin` was dead on Traefik 3.7.5 and removed): `cs-firewall-bouncer` DaemonSet drops in-kernel via nftables on direct hosts (bouncer key `firewall`, v0.0.34 binary fetched at runtime, hostNetwork+NET_ADMIN, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`); `crowdsec-cf-sync` CronJob blocks at the CF edge for proxied hosts (bouncer key `kvsync`, `stacks/rybbit/crowdsec_edge.tf`). Both fail open. See `docs/architecture/security.md` |
 | Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
-| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. |
+| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control; `/`+`/static` use a dedicated `authentik-rate-limit` (100/1000) so the cold-load chunk burst isn't 429'd into a blank screen. **Reliability (2026-06-28): the chart key is `deploymentStrategy`, NOT `strategy`** — the old `strategy:` key was inert, so live ran the chart default 25%/25% and dropped a server pod out of rotation on every roll; now `maxSurge:1/maxUnavailable:0`. Readiness `failureThreshold:8` (~80s, was 30s): the DB-coupled `/-/health/ready/` returns 503 on a PG/pgbouncer blip, and with too-tight tolerance all 3 server pods left the Service at once → Traefik 502/504 (the episodic blank-screen + 30s-hang). gunicorn `max_requests=10000`/jitter=1000 decorrelates worker recycles from DB blips. Redis is GONE since 2026.2 (sessions+cache+channels on PostgreSQL, no external-cache option) — a short PG transient is now survived, but a TOTAL CNPG outage still takes authentik down. **Custom overlay image (2026-06-28):** server+worker run `ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3` (built by `.github/workflows/build-authentik.yml` from `stacks/authentik/Dockerfile` + `patch-compat-sfe.py`) with TWO guarded patches: **#1 SLOW-1a** — narrows the identification-stage `select_subclasses()` query (~1.4s→~14ms; bare upstream call LEFT-JOINs every source subtype); **#2 old-browser blank login** — `patch-compat-sfe.py` (a) extends `compat_needs_sfe()` to serve authentik's built-in no-JS **SFE** login to old Safari/WebKit AND **any iOS browser** (Chrome/CriOS, Firefox/FxiOS — all share the system WebKit) on iOS≤16.3, and (b) **injects static social-login `<a>` links into the SFE shell** (`flow-sfe.html`) since the SFE can't render Identification-stage sources — required for password-less accounts (e.g. emo = Google-only). The modern flow SPA is ES2022 (needs Safari 16.4+) and renders BLANK on older WebKit; every iOS browser shares that WebKit, so it's not browser-choice (emo's iPadOS-15.8 iPad hit this). SFE = the *real* authentik login (password + MFA + reputation, no auth downgrade) — chosen over a Traefik basic-auth fallback which would have put a spoofable-UA single password in front of `vbarzin→wizard` passwordless-root. Social link = plain redirect to `/source/oauth/login/<slug>/` (works on any browser); slugs (google/github/facebook) are static — re-verify on source changes. **Keel un-enrolled** for the ns → image pinned in `global.image` (repo+tag), **upgraded manually**: bump the Dockerfile `FROM` + the values tag (+ re-verify both patches) together, GHA rebuilds, then apply. |
 | Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
 | MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
 | phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
@ -231,9 +232,10 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
 - Alertmanager is now scraped (`extraScrapeConfigs` job `alertmanager`) → `alertmanager_notifications_total`/`_alerts`/`_notifications_failed_total` available; it had no `prometheus.io/scrape` annotation so notification volume was previously unmeasurable.
 - Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions.
 - **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction).
+- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction), AuthentikRootRouter5xxHigh (all-3-server-pods-NotReady cascade → 502/503/504 on the authentik `/` router). **The Traefik scrape keeps `traefik_router_requests_total`** (per-router `code` label) — the drop-regex in the `traefik` scrape job drops only the high-cardinality `*_duration_seconds_bucket` histogram, NOT the request counter, so per-router 429/5xx is queryable + alertable.
 - **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → `#security` Slack). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
+- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → posts to `#alerts` via the `slack-security` receiver, which keeps its `[SECURITY]` styling; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
+- **pfSense egress / WAN monitoring** (added 2026-06-28 after the 2026-06-27 egress-only incident — pfSense VMID 101 stopped passing internet egress for ~20 min while internal routing + Unbound stayed up, and NOTHING alerted: no egress probe existed and the cloudflared replica metric stayed green): `blackbox-exporter` gained `icmp_egress` + `dns_external` modules (+ `NET_RAW` on the pod) in `authentik_walloff_probe.tf`. Three in-cluster probe jobs (`wan-gateway-icmp` → 192.168.1.1, `internet-egress-icmp` → 9.9.9.9/1.1.1.1, `internet-egress-dns` → cloudflare.com via both) traverse the pod→node→pfSense-NAT path that fails. Alerts (group `Egress / pfSense` in `alerting_rules.yml`): `WANGatewayUnreachable`, `InternetEgressDown` (`max()==0` = both providers dead, not a single-provider blip), `ExternalDNSResolutionDown`, `EgressOnlyDivergence` (t3-probe `cloudflare` leg down WHILE `internal` leg up — the incident signature, reuses the existing t3-probe), `PfSenseVMDown` (`pve_up{id="qemu/101"}==0` while host up — does NOT catch a guest-internal reboot, `pve_up` tracks the qemu process). Plus Loki ruler `CloudflaredTunnelConnLoss` (>20 edge-conn failures/5m; calibrated live: steady-state ~2/6h vs 37-85/5m in-incident; the cloudflared replica metric is blind to tunnel-connection loss). `WANGatewayUnreachable`/`InternetEgressDown` **inhibit** the downstream egress symptoms (ExternalDNSResolutionDown/EgressOnlyDivergence/CloudflaredTunnelConnLoss/Email*/ExternalAccessDivergence). Runbook: `docs/runbooks/pfsense-egress.md`. **Deferred (needs a live-pfSense change, not in this monitoring-only change):** point dpinger's monitor at the local gateway + widen thresholds, disable `gw_down_kill_states`, add a failover gateway group + auto-recovery watchdog, and ship pfSense system/gateway/routing syslog to Loki (today only filterlog → CrowdSec; those logs are NOT centrally queryable — id #6717). No Uptime-Kuma egress monitor was added (the `external-monitor-sync` is purpose-built for `*.viktorbarzin.me` Cloudflare-path discovery; the blackbox probes cover egress directly).

 ## Security Posture (Wave 1 — locked 2026-05-18)

@ -241,9 +243,10 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se

 - **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming.
 - **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.)
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
+- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts` (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out). Severity labels carried in the alert (critical/warning/info). No paging. The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`.
 - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred.
+- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`).
+- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. **WHISKER-WEDGE GOTCHA** (2026-06-28): the operator's `whisker` NetworkPolicy allows DNS egress only to kube-dns *pods*, but whisker-backend resolves goldmane via the kube-dns *ClusterIP* — Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule, so when whisker-backend's gRPC stream breaks and it re-resolves, it wedges and the UI goes **empty** (the aggregator, a separate pod, is unaffected). FIX = additive egress NP `whisker-allow-dns-clusterip` (`stacks/calico`, allows whisker→10.96.0.10/32:53); the `whisker-watchdog` CronJob is a backstop. Manual heal `kubectl -n calico-system delete pod -l k8s-app=whisker`. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
 - **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).

 ## Storage & Backup Architecture
--- a/.claude/reference/service-catalog.md
+++ b/.claude/reference/service-catalog.md
@ -13,6 +13,8 @@
 | authentik | Identity provider (SSO) | authentik |
 | cloudflared | Cloudflare tunnel | cloudflared |
 | authelia | Auth middleware (may be merged into ebooks or removed) | platform |
+| goldmane | Calico 3.30 OSS flow aggregator (`goldmane.calico-system.svc:7443`, gRPC/mTLS). Stamps identity (ns/pod/workload/labels + allow-deny) on every flow from Felix into a ~60-min in-memory ring buffer — no etcd/API writes. East-west "who-talks-to-whom" source (ADR-0014). Enabled via operator CR (`kubectl_manifest.goldmane`). | calico |
+| whisker | Calico 3.30 OSS live flow-observability UI (`whisker.calico-system.svc:8081`) at `whisker.viktorbarzin.me` (Authentik-gated, `auth=required` — no own login; additive NP ORs Traefik past the operator default-deny). ~60-min live view of Goldmane flows, NOT history. Enabled via operator CR (`kubectl_manifest.whisker`). | calico |
 | monitoring | Prometheus/Grafana/Loki stack | monitoring |

 ## Storage & Security (Tier: cluster)
@ -37,6 +39,7 @@
 ## Active Use
 | Service | Description | Stack |
 |---------|-------------|-------|
+| goldmane-edge-aggregator | Durable who-talks-to-whom audit trail (ADR-0014 / #58). Go service: `aggregate` Deployment streams Goldmane's gRPC `Flows.Stream` (mTLS) and upserts the low-cardinality namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`) into CNPG DB `goldmane_edges`; `goldmane-edges-digest` CronJob posts first-seen edges daily to `#alerts` (the `#security` channel was abandoned 2026-06-25 — shared webhook's app isn't a member of it). mTLS client cert REUSES the operator's `whisker-backend-key-pair` (re-apply if rotated). Tier-4-aux. Image `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (private). Runbook: [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md). | goldmane-edge-aggregator |
 | mailserver | Email (docker-mailserver) | mailserver |
 | shadowsocks | Proxy | shadowsocks |
 | webhook_handler | Webhook processing | webhook_handler |
@ -161,3 +164,4 @@ procedures) are documented in `infra/docs/runbooks/`:
 | pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) |
 | Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) |
 | Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) |
+| Goldmane flow trail (east-west who-talks-to-whom) | [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md) |
--- a/.claude/skills/home-assistant/SKILL.md
+++ b/.claude/skills/home-assistant/SKILL.md
@ -11,8 +11,8 @@ description: |
  There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
  Always use Home Assistant for smart home control.
 author: Claude Code
-version: 2.0.0
-date: 2026-02-07
+version: 2.1.0
+date: 2026-06-24
 ---

 # Home Assistant Control
@ -395,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
 ## ha-london Knowledge Map

 ### Overview
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
+- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
 - **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/` (requires `sudo` for file access)
+- **Platform**: Raspberry Pi 4, HA OS
+- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
+- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
+- **Config path**: `/config/`
 - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
 - **Zone**: London (home)

+### Dashboards (redesigned 2026-06-24)
+**Glossary** (HA terms — keep distinct):
+- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
+- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
+- **Card** = a widget inside a view.
+
+- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
+  - **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
+  - **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
+- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.)
+- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
+
 ### Key Systems

 #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -424,10 +437,15 @@ Named plugs with power/energy tracking:
 - PM1.0/2.5/4.0/10 particulate sensors
 - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors

-#### 3. Cowboy E-Bike
- `sensor.bike_state_of_charge`: Battery %
- `sensor.bike_total_distance`: Total km
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
+#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
+Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
+- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
+- `sensor.classic_performance_remaining_range`: Range km
+- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
+- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
+- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
+- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
+- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).

 #### 4. Uptime Monitoring (UptimeRobot)
 - `sensor.blog`: blog uptime
@ -446,12 +464,17 @@ Named plugs with power/energy tracking:
 - Scripts: `script.start_netflix`, `script.start_stremio`
 - Scene: `scene.night` (turns off Livia + Michelle plugs)

-### Custom Components
- **cowboy**: Cowboy e-bike integration (HACS)
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
+### Custom Components (HACS integrations)
+- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
+- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
+
+### HACS frontend cards (plugins)
+- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.

 ### Integrations
-ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
+ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
+- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
+- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.

 ### AI / Voice Assistants
 - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -466,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL
 - Anca arrival/departure notifications
 - Night scene: turns off Livia + Michelle

-### Docker Setup
-```bash
-docker run -d --name homeassistant --privileged \
-  -e TZ=Europe/London \
-  -v /home/pi/docker/homeAssistant:/config \
-  -v /run/dbus:/run/dbus:ro \
-  --network=host --restart=unless-stopped \
-  homeassistant/home-assistant:2025.9
-```
+### Platform (HAOS — ignore any legacy `docker run` snippet)
+ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker).

 ### SSH Access
 ```bash
--- a/.github/workflows/build-authentik.yml
+++ b/.github/workflows/build-authentik.yml
@ -0,0 +1,39 @@
+name: Build Custom Authentik Image
+
+# ADR-0002: infra-owned image built off-infra on GHA → ghcr.
+# Thin SLOW-1a overlay over the official authentik server (narrows the login
+# identification stage's select_subclasses() to the login-capable source subtypes;
+# see stacks/authentik/Dockerfile). Rebuild only when the Dockerfile changes — on
+# every authentik bump, edit the FROM tag + the patchN suffix here + the image tag
+# in modules/authentik/values.yaml together.
+on:
+  push:
+    branches: [master]
+    paths:
+      - 'stacks/authentik/Dockerfile'
+  workflow_dispatch: {}
+
+permissions:
+  contents: read
+  packages: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: docker/setup-buildx-action@v3
+      - uses: docker/login-action@v3
+        with:
+          registry: ghcr.io
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+      - uses: docker/build-push-action@v6
+        with:
+          context: stacks/authentik
+          platforms: linux/amd64
+          provenance: false
+          push: true
+          tags: |
+            ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3
+            ghcr.io/viktorbarzin/authentik-server:latest
--- a/.woodpecker/default.yml
+++ b/.woodpecker/default.yml
@ -65,6 +65,21 @@ steps:
      # don't need explicit token propagation.
      VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200
    commands:
+      # ── Forge guard: apply ONLY on the canonical Forgejo forge ──
+      # infra is registered in Woodpecker on BOTH the Forgejo canonical repo and
+      # the legacy GitHub mirror, and BOTH fire this push pipeline. Without this
+      # guard both run `terragrunt apply` on every push and race each other for
+      # the per-stack PG state lock — the dominant cause of the "Error acquiring
+      # the state lock" failures + push-supersede "killed" runs. The GitHub-mirror
+      # registration keeps running the CRONS (drift-detection, renew-tls, …) — only
+      # its duplicate push-apply no-ops here. Fail-open: an unknown forge (neither
+      # env var set) still applies, preserving prior behaviour.
+      - |
+        if echo "${CI_REPO_URL:-}${CI_FORGE_URL:-}" | grep -qi 'github\.com'; then
+          echo "[forge-guard] GitHub-mirror push — apply runs only on the Forgejo canonical repo (avoids double-apply + state-lock races). Skipping."
+          exit 0
+        fi
+
      # ── Skip CI commits ──
      - |
        if echo "$CI_COMMIT_MESSAGE" | grep -q '\[CI SKIP\]\|\[ci skip\]'; then
@ -213,23 +228,40 @@ steps:
        if [ -s .platform_apply ]; then
          echo "=== Applying platform stacks (serial, locked) ==="
          while read -r stack; do
+            # Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role
+            # lacks Vault-admin perms (sys/mounts + sys/policies/acl), so a CI
+            # apply always 403s and fails the pipeline. Kept in PLATFORM_STACKS
+            # (so the app-stack detector still excludes it) but skipped here.
+            # (2026-06-27 — see docs/architecture/ci-cd.md)
+            if [ "$stack" = "vault" ]; then echo "[vault] SKIPPED (Tier-0, human-applied via OIDC)"; continue; fi
            echo "[$stack] Starting apply..."
+            ATTEMPT=0
+            while :; do
+              ATTEMPT=$((ATTEMPT + 1))
              set +e
              OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
              EXIT=$?
              set -e
-            if [ $EXIT -ne 0 ]; then
-              if echo "$OUTPUT" | grep -q "is locked by"; then
-                echo "[$stack] SKIPPED (locked by another session)"
-              else
-                echo "$OUTPUT" | tail -50
-                echo "[$stack] FAILED (exit $EXIT)"
-                FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"
+              if [ $EXIT -eq 0 ]; then
+                echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
              fi
-            else
-              echo "$OUTPUT" | tail -3
-              echo "[$stack] OK"
+              # Lock contention → SKIP, not fail. Match BOTH the Tier-0 Vault lock
+              # ("is locked by", from scripts/tg) AND the Tier-1 PG-backend lock
+              # ("Error acquiring the state lock" / "already locked"). The PG case
+              # was previously counted as a failure — the #1 source of false reds.
+              if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
+                echo "[$stack] SKIPPED (locked by another session/run)"; break
              fi
+              # Transient: provider-registry download timeout / Vault 5xx → bounded
+              # retry. Deliberately NOT helm atomic-timeouts or config errors
+              # (missing arg, invalid index) — those must fail fast, retry can't fix
+              # them and can worsen a stuck helm release.
+              if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
+                echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
+              fi
+              echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
+              FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"; break
+            done
          done < .platform_apply
        fi
        # Deferred until after app stacks so both lists get a chance to run.
@ -242,22 +274,27 @@ steps:
          echo "=== Applying app stacks (serial, locked) ==="
          while read -r stack; do
            echo "[$stack] Starting apply..."
+            ATTEMPT=0
+            while :; do
+              ATTEMPT=$((ATTEMPT + 1))
              set +e
              OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
              EXIT=$?
              set -e
-            if [ $EXIT -ne 0 ]; then
-              if echo "$OUTPUT" | grep -q "is locked by"; then
-                echo "[$stack] SKIPPED (locked by another session)"
-              else
-                echo "$OUTPUT" | tail -50
-                echo "[$stack] FAILED (exit $EXIT)"
-                FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"
+              if [ $EXIT -eq 0 ]; then
+                echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
              fi
-            else
-              echo "$OUTPUT" | tail -3
-              echo "[$stack] OK"
+              # Lock contention → SKIP, not fail (Tier-0 Vault + Tier-1 PG; see platform loop).
+              if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
+                echo "[$stack] SKIPPED (locked by another session/run)"; break
              fi
+              # Transient provider-download / Vault 5xx → bounded retry (see platform loop).
+              if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
+                echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
+              fi
+              echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
+              FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"; break
+            done
          done < .app_apply
        fi
        # Fail the step loudly so the pipeline `default` workflow state
--- a/.woodpecker/drift-detection.yml
+++ b/.woodpecker/drift-detection.yml
@ -85,6 +85,13 @@ steps:
          stack=$(basename "$stack_dir")
          [ -f "$stack_dir/terragrunt.hcl" ] || continue

+          # Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role lacks
+          # Vault-admin perms (sys/mounts + sys/policies/acl), so `terragrunt plan`
+          # on it ERRORs (detailed-exitcode 1) and fails the whole nightly drift
+          # run. Skip it — drift on Tier-0 vault is caught at human apply time.
+          # (2026-06-27)
+          [ "$stack" = "vault" ] && continue
+
          echo -n "[$stack] planning... "
          OUTPUT=$(cd "$stack_dir" && terragrunt plan -detailed-exitcode -input=false 2>&1)
          EXIT=$?
--- a/AGENTS.md
+++ b/AGENTS.md
@ -273,8 +273,11 @@ To land a finished change from such a clone:
   Slack audit feed; a no-op CI apply on a docs-only commit is harmless.
 4. Leave the clone on clean `master` so auto-refresh keeps working.
 5. Tell the user in plain language what happened. Stack changes are
-   auto-applied by CI — verify the live result with the user's read-only
-   kubectl before saying "it's live".
+   auto-applied by CI on push — or, with apply access, applied locally yourself
+   (`scripts/tg apply`, from the main checkout, not a worktree); either path is
+   fine, but the change must always be committed here, never applied
+   uncommitted. Verify the live result with the user's read-only kubectl before
+   saying "it's live".

 If a push to `master` is rejected by branch protection (user not on the
 whitelist — e.g. new users before Viktor grants it), fall back to a
--- a/CONTEXT.md
+++ b/CONTEXT.md
@ -125,7 +125,7 @@ How a **Service** is named in flow/audit data — its **namespace** is the prima
 _Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.

 **Goldmane / Whisker**:
-Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). Durable history requires emitting Goldmane flows to **Loki**; the in-memory buffer alone is not an audit trail.
+Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#alerts` digest (the `#security` channel was abandoned 2026-06-25). As-built: `docs/runbooks/goldmane-flow-trail.md`.
 _Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).

 ### Storage
--- a/cli/README.md
+++ b/cli/README.md
@ -202,6 +202,69 @@ runs on the devvm, `setInputFiles` streams local files to the remote browser ove
 CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md`
 and `docs/adr/0013`.

+### v0.9 verbs — edges (east-west "who-talks-to-whom" trail)
+
+Read-only investigation helper over the `goldmane_edges` CNPG trail (ADR-0014):
+filters render to a single safe `SELECT` (namespace values validated to the k8s
+name charset) run via the dbaas primary pod — the same exec path as `k8s db`.
+
+| Command | Tier | What it does |
+| --- | --- | --- |
+| `edges --ns <ns>` | read | edges touching `<ns>` (either direction) |
+| `edges --src <ns>` / `--dst <ns>` | read | directional: `<ns>`'s egress / ingress peers |
+| `edges --peers-of <ns>` | read | distinct peer namespaces of `<ns>` (both directions) |
+| `edges --new-since <24h\|7d\|YYYY-MM-DD>` | read | edges first seen since a duration or date |
+| `edges --denied` | read | only `action='deny'` edges (blocked / lateral-movement) |
+| `edges --json` / `--limit N` | read | JSON array output / row cap (default 200) |
+
+### v0.10 — `vault get --all` (browse every field)
+
+`vault get <name> --all` returns the **whole item** as a normalized JSON object,
+so an agent can discover and read fields the single-field `--field` allowlist
+can't reach — notably arbitrary **custom fields**.
+
+| Command | Tier | What it does |
+| --- | --- | --- |
+| `vault get <name> --all` | read | all fields as JSON: `{name, username?, password?, uris?, totp?, notes?, fields?}` |
+
+Shape notes: present standard fields only (empty ones omitted); `fields` is a
+custom `name→value` map (duplicate names → last-wins; `linked` fields skipped).
+The TOTP **seed is never emitted** — `totp` is a presence flag (`true`), so the
+only seed-derived path stays the specially-audited `vault code`. Like
+`get --json`, the dump is all secret values, so it **refuses a terminal** — pipe
+it (`homelab vault get <name> --all | jq`).
+
+### v0.10.1 — reads `bw sync` first (always fresh)
+
+Every vault read (`get`, `get --all`, `list`, `code`, `status`) now runs `bw
+sync` when opening its session, so it reflects the latest server-side values.
+`bw unlock` only decrypts the *local* cache, so without this a persisted
+(already-logged-in) session served stale data — a password changed in the web
+vault wouldn't show up until the next login. The sync is **best-effort**: a
+transient failure warns on stderr and falls back to the cached vault rather than
+failing the read.
+
+### v0.11 — `vault kv` (HashiCorp Vault / OpenBao infra secrets)
+
+`homelab vault` now fronts **two unrelated stores**, made explicit in the bare
+`homelab vault` help and via `[vaultwarden]` / `[hashicorp-vault]` summary tags:
+
+- **Vaultwarden** — your personal password manager (`vault get/list/code/…`, unchanged).
+- **HashiCorp Vault / OpenBao** — homelab infra secrets, the `secret/…` KV store, under `vault kv`.
+
+| Command | Tier | What it does |
+| --- | --- | --- |
+| `vault kv get <path> [--field K]` | read | read a secret: `--field K` → one value (TTY-aware clipboard/stdout); no field → all fields as JSON (refuses a bare TTY) |
+| `vault kv list <path>` | read | list sub-paths under `<path>` (no values) |
+| `vault kv put <path> <key>` | write | write one key; **value via stdin** (piped or no-echo prompt, never argv); creates the path or **merges** (never clobbers siblings) |
+
+**Different credentials:** the Vaultwarden verbs use the per-user *scoped* token
+(bound to `claude-users/<user>`); `vault kv` uses your **own** Vault token
+(`vault login -method=oidc` → `~/.vault-token`, or `$VAULT_TOKEN`) — the kv
+handlers set `VAULT_ADDR` but never inject the scoped token (which would 403 off
+its own path). Access is whatever your policy grants. Writes are merge-only;
+`put` (replace) / `delete` are out of scope — use the raw `vault` CLI.
+
 ## Build / install

 Built from source to `/usr/local/bin/homelab` during devvm provisioning
--- a/cli/VERSION
+++ b/cli/VERSION
@ -1 +1 @@
-v0.8.1
+v0.11.0
--- a/cli/cmd_edges.go
+++ b/cli/cmd_edges.go
@ -0,0 +1,69 @@
+package main
+
+import "fmt"
+
+func edgesCommands() []Command {
+	return []Command{
+		{Path: []string{"edges"}, Tier: TierRead,
+			Summary: "who-talks-to-whom trail: edges [--ns|--src|--dst|--peers-of N] [--new-since 24h] [--denied] [--json] [--limit N]",
+			Run:     edgesRun},
+	}
+}
+
+// edgesRun renders the filter flags to SQL and runs it read-only against the
+// goldmane_edges CNPG DB via the dbaas primary pod (same exec path as `k8s db`).
+func edgesRun(args []string) error {
+	for _, a := range args {
+		if a == "-h" || a == "--help" {
+			fmt.Print(edgesUsage())
+			return nil
+		}
+	}
+	o, err := parseEdgesArgs(args)
+	if err != nil {
+		return fmt.Errorf("%w\n\n%s", err, edgesUsage())
+	}
+	sql, err := buildEdgesQuery(o)
+	if err != nil {
+		return err
+	}
+	// pg-cluster-rw is a Service (not exec-able); resolve the primary POD.
+	pod, err := kubectlCapture("dbaas", "get", "pod", "-l", "cnpg.io/instanceRole=primary",
+		"-o", "jsonpath={.items[0].metadata.name}")
+	if err != nil || pod == "" {
+		return fmt.Errorf("could not resolve CNPG primary pod in dbaas: %v", err)
+	}
+	exec := []string{"exec", pod, "-c", "postgres", "--", "psql", "-U", "postgres", "-d", "goldmane_edges"}
+	if o.asJSON {
+		exec = append(exec, "-tAc", sql) // raw tuple → the JSON array
+	} else {
+		exec = append(exec, "-P", "pager=off", "-c", sql) // aligned table for humans
+	}
+	return kubectlStream("dbaas", exec...)
+}
+
+func edgesUsage() string {
+	return `homelab edges — query the who-talks-to-whom trail (goldmane_edges, ADR-0014)
+
+Usage: homelab edges [filters]
+
+Filters (AND-combined; namespace values are validated to the k8s name charset):
+  --ns NAME         edges touching NAME (either direction)
+  --src NAME        edges where source namespace = NAME
+  --dst NAME        edges where destination namespace = NAME
+  --peers-of NAME   distinct peer namespaces of NAME (both directions)
+  --new-since SPEC  first seen since SPEC: a duration (24h, 7d, 30m, 90s) or a date (YYYY-MM-DD)
+  --denied          only denied (action='deny') edges — blocked / lateral-movement attempts
+  --json            output a JSON array (for agents/pipelines)
+  --limit N         cap rows (default 200)
+
+Examples:
+  homelab edges --ns immich                # everything immich talks to / is talked to by
+  homelab edges --peers-of authentik       # authentik's peer namespaces
+  homelab edges --src recruiter-responder  # that namespace's egress peers
+  homelab edges --new-since 24h            # edges first seen in the last day
+  homelab edges --denied --json            # blocked flows, machine-readable
+
+Read-only SELECT against CNPG DB goldmane_edges via the dbaas primary pod.
+`
+}
--- a/cli/cmd_memory.go
+++ b/cli/cmd_memory.go
@ -54,10 +54,7 @@ func printMemories(raw []byte, jsonOut bool) error {
 		return nil
 	}
 	for _, m := range r.Memories {
-		c := strings.ReplaceAll(m.Content, "\n", " ")
-		if len(c) > 240 {
-			c = c[:240] + "…"
-		}
+		c := truncatePreview(strings.ReplaceAll(m.Content, "\n", " "), 240)
 		fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
 		if m.Tags != "" {
 			fmt.Printf("       tags: %s\n", m.Tags)
@ -66,6 +63,21 @@ func printMemories(raw []byte, jsonOut bool) error {
 	return nil
 }

+// truncatePreview shortens s to at most maxRunes RUNES, appending "…" when it
+// trims. Counting runes (not bytes) is load-bearing: a byte slice like s[:240]
+// can cut through the middle of a multibyte UTF-8 character (e.g. 2-byte
+// Cyrillic), leaving a dangling lead byte = invalid UTF-8. That crashed strict
+// decoders downstream — notably the homelab-memory-recall.py UserPromptSubmit
+// hook (subprocess text=True), which surfaced as a recurring "UserPromptSubmit
+// hook error" for Cyrillic-language users.
+func truncatePreview(s string, maxRunes int) string {
+	r := []rune(s)
+	if len(r) <= maxRunes {
+		return s
+	}
+	return string(r[:maxRunes]) + "…"
+}
+
 func memoryRecall(args []string) error {
 	req := memRecallReq{}
 	jsonOut := false
--- a/cli/cmd_vault.go
+++ b/cli/cmd_vault.go
@ -4,6 +4,7 @@ import (
 	"bufio"
 	"encoding/base64"
 	"encoding/json"
+	"errors"
 	"fmt"
 	"os"
 	"os/exec"
@ -15,43 +16,60 @@ import (
 // Identity is the kernel UID; per-user creds live in that user's isolated Vault
 // path (secret/workstation/claude-users/<user>) read via their scoped token, and
 // decryption is done by the official `bw` CLI. See
-// docs/superpowers/specs/2026-06-24-homelab-vault-design.md.
+// docs/runbooks/homelab-vault-onboarding.md.
 func vaultCommands() []Command {
-	return []Command{
+	cmds := []Command{
+		// Vaultwarden — your personal password manager (logins/passwords/TOTP).
 		{Path: []string{"vault", "setup"}, Tier: TierWrite,
-			Summary: "one-time: store your Vaultwarden master password + API key in your Vault path", Run: vaultSetup},
+			Summary: "[vaultwarden] one-time: store your master password + API key in your Vault path", Run: vaultSetup},
 		{Path: []string{"vault", "status"}, Tier: TierRead,
-			Summary: "show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
+			Summary: "[vaultwarden] show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
 		{Path: []string{"vault", "list"}, Tier: TierRead,
-			Summary: "list your item names: vault list [--search Q]", Run: vaultList},
+			Summary: "[vaultwarden] list your item names: vault list [--search Q]", Run: vaultList},
 		{Path: []string{"vault", "get"}, Tier: TierRead,
-			Summary: "fetch one item: vault get <name> [--field password|username|uri|notes|totp] [--json]", Run: vaultGet},
+			Summary: "[vaultwarden] fetch one login: vault get <name> [--field password|username|uri|notes|totp] [--json] [--all]", Run: vaultGet},
 		{Path: []string{"vault", "search"}, Tier: TierRead,
-			Summary: "search your item names: vault search <query>", Run: vaultSearch},
+			Summary: "[vaultwarden] search your item names: vault search <query>", Run: vaultSearch},
 		{Path: []string{"vault", "code"}, Tier: TierRead,
-			Summary: "current TOTP code for an item: vault code <name>", Run: vaultCode},
+			Summary: "[vaultwarden] current TOTP code for an item: vault code <name>", Run: vaultCode},
 		{Path: []string{"vault", "lock"}, Tier: TierWrite,
-			Summary: "lock/log out the local bw session", Run: vaultLock},
+			Summary: "[vaultwarden] lock/log out the local bw session", Run: vaultLock},
 		{Path: []string{"vault"}, Tier: TierRead,
-			Summary: "Vaultwarden access for your own vault (run `homelab vault` for help)",
+			Summary: "two stores: Vaultwarden (logins) + HashiCorp Vault/OpenBao kv (infra secrets) — run `homelab vault` for help",
 			Run:     func([]string) error { fmt.Print(vaultHelp()); return nil }},
 	}
+	// HashiCorp Vault / OpenBao — homelab INFRA secrets (the secret/… KV store).
+	return append(cmds, vaultKVCommands()...)
 }

-// vaultHelp is shown for bare `homelab vault`.
+// vaultHelp is shown for bare `homelab vault`. It LEADS with the distinction
+// between the two unrelated "vaults" this command fronts, because the name
+// collides: Vaultwarden (a password manager) vs HashiCorp Vault / OpenBao (the
+// infra secrets store).
 func vaultHelp() string {
-	return `homelab vault — read YOUR OWN Vaultwarden logins (no-HITL after one-time setup)
+	return `homelab vault — two different secret stores under one command:

+  • Vaultwarden               your personal PASSWORD MANAGER (logins / passwords / TOTP)
+  • HashiCorp Vault / OpenBao  homelab INFRA secrets (the secret/… KV store)  → 'vault kv …'
+
+── Vaultwarden  (reads YOUR OWN vault; no-HITL after one-time setup) ──
  homelab vault setup             one-time: store your master password + API key in your Vault path
  homelab vault status            configured / unlocked / reachable (no secrets)
  homelab vault list [--search Q] list your item names (no secrets)
  homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
                                  TTY → clipboard (auto-clears); piped → stdout
+  homelab vault get <name> --all  all fields (incl. custom) as JSON; piped only.
+                                  TOTP shown as presence flag — use 'vault code' for a code.
  homelab vault code <name>       current TOTP code
  homelab vault lock              lock / log out the local bw session

-Creds live only in your own Vault path; the admin never sees them. Identity is
-your unix UID. Security model: docs/superpowers/specs/2026-06-24-homelab-vault-design.md
+── HashiCorp Vault / OpenBao  (infra secrets; uses your own OIDC vault token) ──
+  homelab vault kv get <path> [--field K]   read an infra KV secret
+  homelab vault kv list <path>              list sub-paths
+  homelab vault kv put <path> <key>         write one key (value via stdin)
+
+Vaultwarden creds live only in your own Vault path; the admin never sees them.
+Security model: docs/runbooks/homelab-vault-onboarding.md
 (note: anything running as your user can decrypt your vault — the accepted no-HITL trade).
 `
 }
@ -79,7 +97,33 @@ func realRunner(name string, argv, envv []string) (string, error) {
 	out, err := cmd.Output()
 	// Trim only the trailing newline the tool appends — NOT all whitespace, so a
 	// fetched secret with significant leading/trailing spaces is preserved.
-	return strings.TrimRight(string(out), "\r\n"), err
+	return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err))
+}
+
+// exitStderr returns the stderr captured by cmd.Output() on a failed exec (it
+// stows it on *exec.ExitError), or nil. The tools we shell out to (vault, bw)
+// write the actionable message there — "connection refused", "permission
+// denied" — which the caller would otherwise never see behind a bare
+// "exit status N".
+func exitStderr(err error) []byte {
+	var ee *exec.ExitError
+	if errors.As(err, &ee) {
+		return ee.Stderr
+	}
+	return nil
+}
+
+// augmentErr appends captured stderr to an error so failures are diagnosable
+// (not just "exit status 2"). Returns nil when err is nil, and err unchanged
+// when there's no stderr; preserves the wrapped error for errors.Is/As.
+func augmentErr(err error, stderr []byte) error {
+	if err == nil {
+		return nil
+	}
+	if s := strings.TrimSpace(string(stderr)); s != "" {
+		return fmt.Errorf("%w: %s", err, s)
+	}
+	return err
 }

 // realRunnerStdin runs a command feeding `stdin` to it, for secret values that
@ -92,7 +136,7 @@ func realRunnerStdin(name string, argv, envv []string, stdin string) (string, er
 	}
 	cmd.Stdin = strings.NewReader(stdin)
 	out, err := cmd.Output()
-	return strings.TrimRight(string(out), "\r\n"), err
+	return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err))
 }

 func vwCredsPath(user string) string { return vwUserPathPrefix + user }
@ -128,6 +172,89 @@ func loadCreds(run cmdRunner, user string) (vwCreds, error) {
 var vaultCurrentUser = func() string { return os.Getenv("USER") }
 var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) }

+// scopedTokenPath is where claude-auth-sync keeps the user's scoped Vault token.
+// MUST match CAS_VAULT_TOKEN_FILE in scripts/workstation/claude-auth-sync.sh.
+func scopedTokenPath(home string) string {
+	return home + "/.config/claude-auth-sync/vault-token"
+}
+
+// vaultTokenSource decides which Vault token the `vault` child processes should
+// use. Precedence: an explicit $VAULT_TOKEN (deliberate override), then the
+// per-user scoped token claude-auth-sync maintains at scopedTokenPath(HOME)
+// (policy workstation-claude-<user>, which grants exactly the create/read/update
+// this tool needs on the user's own path), then a native ~/.vault-token.
+//
+// The scoped token MUST beat ~/.vault-token: this tool only ever touches the
+// caller's own secret/workstation/claude-users/<user> path, and a power-user who
+// ran `vault login -method=oidc` carries a read-only ~/.vault-token whose
+// capability on that path is `deny` — letting it win shadows the scoped token
+// and every op fails 403/deny (emo, 2026-06-28). ~/.vault-token is only the
+// right credential when there is no scoped token (admins). Returns the token to
+// export — "" when the vault CLI should read the ambient/native credential —
+// plus a source tag for tests/logging.
+func vaultTokenSource(envToken string, haveVaultTokenFile bool, scopedToken string) (token, source string) {
+	switch {
+	case envToken != "":
+		return "", "env"
+	case strings.TrimSpace(scopedToken) != "":
+		return strings.TrimSpace(scopedToken), "scoped"
+	case haveVaultTokenFile:
+		return "", "file"
+	default:
+		return "", "none"
+	}
+}
+
+// vaultAddrDefault is the cluster Vault the workstation talks to. The bw server
+// is likewise hardcoded (openSession), so a sane default here is consistent.
+const vaultAddrDefault = "https://vault.viktorbarzin.me"
+
+// vaultAddrToSet returns the VAULT_ADDR to export when the caller's environment
+// doesn't already set one, else "". homelab vault is invoked by AFK agent
+// sessions — frequently non-login shells (tmux panes, agent subprocesses) that
+// never sourced /etc/environment — so, like claude-auth-sync, the CLI must NOT
+// depend on an ambient VAULT_ADDR; otherwise every `vault` child falls back to
+// the 127.0.0.1:8200 default and fails "connection refused" (exit 2).
+func vaultAddrToSet(envAddr string) string {
+	if strings.TrimSpace(envAddr) == "" {
+		return vaultAddrDefault
+	}
+	return ""
+}
+
+// ensureVaultAddr exports the default VAULT_ADDR when none is set, so the vault
+// child processes reach the cluster Vault regardless of the caller's shell. An
+// explicit VAULT_ADDR (admins, CI) is left untouched.
+func ensureVaultAddr() {
+	if a := vaultAddrToSet(os.Getenv("VAULT_ADDR")); a != "" {
+		os.Setenv("VAULT_ADDR", a)
+	}
+}
+
+// fileNonEmpty reports whether path exists and has content.
+func fileNonEmpty(path string) bool {
+	fi, err := os.Stat(path)
+	return err == nil && fi.Size() > 0
+}
+
+// ensureVaultToken wires vaultTokenSource to the real environment: when the user
+// has no ambient Vault credential, it exports the claude-auth-sync scoped token
+// so the `vault` child processes authenticate as workstation-claude-<user>. It
+// is idempotent and safe for admins, whose explicit $VAULT_TOKEN / ~/.vault-token
+// take precedence and are left untouched.
+func ensureVaultToken() {
+	// Every vault verb funnels through here, so this is the one place that also
+	// guarantees VAULT_ADDR is set (see vaultAddrToSet for why it can't be
+	// assumed from the caller's shell).
+	ensureVaultAddr()
+	home := os.Getenv("HOME")
+	scoped, _ := os.ReadFile(scopedTokenPath(home))
+	tok, src := vaultTokenSource(os.Getenv("VAULT_TOKEN"), home != "" && fileNonEmpty(home+"/.vault-token"), string(scoped))
+	if src == "scoped" {
+		os.Setenv("VAULT_TOKEN", tok)
+	}
+}
+
 // bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately
 // do NOT inherit the full parent env (keeps stray secrets out of the child).
 func bwBaseEnv(appdata string) []string {
@ -160,7 +287,9 @@ func bwSecretEnv(appdata string, c vwCreds, session string) []string {
 func bwLoginArgs() []string                 { return []string{"login", "--apikey"} }
 func bwUnlockArgs() []string                { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} }
 func bwGetArgs(field, name string) []string { return []string{"get", field, name} }
+func bwItemArgs(name string) []string       { return []string{"get", "item", name} }
 func bwStatusArgs() []string                { return []string{"status"} }
+func bwSyncArgs() []string                  { return []string{"sync"} }

 // bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is
 // required. Unparseable/empty output → true (safer to attempt login).
@ -327,13 +456,23 @@ func openSession(run cmdRunner, user, uid string) (session, error) {
 	if err != nil {
 		return session{}, err
 	}
-	return session{env: bwSecretEnv(appdata, creds, sess)}, nil
+	sessEnv := bwSecretEnv(appdata, creds, sess)
+	// Pull the latest server-side state so reads reflect current values. `bw
+	// unlock` only decrypts the LOCAL cache, so a persisted (already-logged-in)
+	// session would otherwise serve stale data until the next login. Best-effort:
+	// a transient sync failure must not break a read — fall back to the cached
+	// vault and warn (status reports reachability separately).
+	if _, err := run("bw", bwSyncArgs(), sessEnv); err != nil {
+		fmt.Fprintln(os.Stderr, "homelab vault: warning: bw sync failed; using cached vault (values may be stale): "+err.Error())
+	}
+	return session{env: sessEnv}, nil
 }

 type getOpts struct {
 	name  string
 	field string
 	json  bool
+	all   bool // dump every field (incl. custom) as normalized JSON
 }

 var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true}
@ -345,6 +484,8 @@ func parseGetArgs(args []string) (getOpts, error) {
 		switch {
 		case a == "--json":
 			o.json = true
+		case a == "--all":
+			o.all = true
 		case a == "--field" && i+1 < len(args):
 			o.field = args[i+1]
 			i++
@ -355,9 +496,10 @@ func parseGetArgs(args []string) (getOpts, error) {
 		}
 	}
 	if o.name == "" {
-		return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json]")
+		return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json] [--all]")
 	}
-	if !validGetFields[o.field] {
+	// --all dumps the whole item, so --field is irrelevant — skip its allowlist.
+	if !o.all && !validGetFields[o.field] {
 		return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field)
 	}
 	return o, nil
@ -373,6 +515,81 @@ func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) {
 	return bwGet(run, s.env, o.field, o.name)
 }

+// getItem opens a session and returns the whole item as raw `bw get item` JSON.
+// Used by `get --all`; normalization is a separate, pure step (normalizeItem).
+func getItem(run cmdRunner, user, uid, name string) (string, error) {
+	s, err := openSession(run, user, uid)
+	if err != nil {
+		return "", err
+	}
+	return run("bw", bwItemArgs(name), s.env)
+}
+
+// normalizedItem is the browse-all-fields projection of a Vaultwarden item: the
+// standard login fields that are present, notes, and a flat map of custom field
+// name→value. bw internals (id, object, reprompt, passwordHistory) are dropped,
+// and the TOTP *seed* is reduced to a presence flag — the only seed-derived path
+// stays the specially-audited `vault code` (see the design §10/§16).
+type normalizedItem struct {
+	Name     string            `json:"name"`
+	Username string            `json:"username,omitempty"`
+	Password string            `json:"password,omitempty"`
+	URIs     []string          `json:"uris,omitempty"`
+	TOTP     bool              `json:"totp,omitempty"` // presence only, never the seed
+	Notes    string            `json:"notes,omitempty"`
+	Fields   map[string]string `json:"fields,omitempty"` // custom field name→value
+}
+
+// bwFieldLinked is the Bitwarden custom-field type for a "linked" field: it
+// references another field and carries a null value, so it is not real data.
+const bwFieldLinked = 3
+
+// normalizeItem parses a `bw get item` payload into the browse projection. It is
+// pure (no I/O), so it is the unit-tested heart of `get --all`.
+func normalizeItem(raw string) (normalizedItem, error) {
+	var it struct {
+		Name  string `json:"name"`
+		Notes string `json:"notes"`
+		Login *struct {
+			Username string `json:"username"`
+			Password string `json:"password"`
+			Totp     string `json:"totp"`
+			URIs     []struct {
+				URI string `json:"uri"`
+			} `json:"uris"`
+		} `json:"login"`
+		Fields []struct {
+			Name  string `json:"name"`
+			Value string `json:"value"`
+			Type  int    `json:"type"`
+		} `json:"fields"`
+	}
+	if err := json.Unmarshal([]byte(raw), &it); err != nil {
+		return normalizedItem{}, fmt.Errorf("parse bw item: %w", err)
+	}
+	n := normalizedItem{Name: it.Name, Notes: it.Notes}
+	if it.Login != nil {
+		n.Username = it.Login.Username
+		n.Password = it.Login.Password
+		n.TOTP = it.Login.Totp != ""
+		for _, u := range it.Login.URIs {
+			if u.URI != "" {
+				n.URIs = append(n.URIs, u.URI)
+			}
+		}
+	}
+	for _, f := range it.Fields {
+		if f.Type == bwFieldLinked {
+			continue // references another field, no value of its own
+		}
+		if n.Fields == nil {
+			n.Fields = map[string]string{}
+		}
+		n.Fields[f.Name] = f.Value // duplicate names: last-wins (rare; documented)
+	}
+	return n, nil
+}
+
 // clipboardDecision picks how to return a secret value. "stdout" prints it (a
 // pipe/agent — the intended machine path); "clipboard" copies via OSC52;
 // "refuse" emits nothing sensitive (would otherwise risk dumping the secret's
@ -443,6 +660,7 @@ func runList(run cmdRunner, user, uid, search string) ([]string, error) {

 func vaultList(args []string) error {
 	hardenProcess()
+	ensureVaultToken()
 	search := ""
 	for i := 0; i < len(args); i++ {
 		if args[i] == "--search" && i+1 < len(args) {
@ -477,6 +695,7 @@ func vaultSearch(args []string) error {

 func vaultCode(args []string) error {
 	hardenProcess()
+	ensureVaultToken()
 	if len(args) == 0 {
 		return fmt.Errorf("usage: homelab vault code <name>")
 	}
@ -508,7 +727,9 @@ func statusSummary(run cmdRunner, user, uid string) string {
 	if err != nil {
 		return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error()
 	}
-	if _, err := run("bw", []string{"sync"}, s.env); err != nil {
+	// openSession already did a best-effort sync; status re-runs it explicitly so
+	// a reachability failure surfaces in this report rather than only on stderr.
+	if _, err := run("bw", bwSyncArgs(), s.env); err != nil {
 		return "vault: configured + unlocked, but sync/reachability failed: " + err.Error()
 	}
 	return "vault: configured, unlocked, reachable ✓"
@ -516,6 +737,7 @@ func statusSummary(run cmdRunner, user, uid string) string {

 func vaultStatus(args []string) error {
 	hardenProcess()
+	ensureVaultToken()
 	uid := vaultCurrentUID()
 	unlock, err := withUserLock(uid)
 	if err != nil {
@ -542,32 +764,61 @@ func vaultLock(args []string) error {
 	return nil // lock/logout best-effort; never error the caller
 }

-// vaultPatchPublicArgs writes the non-secret identifiers via argv. Neither the
+// kvWriteVerb selects the KV write semantics. merge=true → `kv patch -method=rw`
+// (read-modify-write: needs only read+update, NOT the `patch` capability the
+// scoped workstation-claude-<user> policy lacks, and preserves co-located keys
+// such as claude-auth-sync's claude_ai_oauth_json). merge=false → `kv put`
+// (creates the path on first use, before any sibling keys exist).
+func kvWriteVerb(merge bool) []string {
+	if merge {
+		return []string{"kv", "patch", "-method=rw"}
+	}
+	return []string{"kv", "put"}
+}
+
+// vaultWritePublicArgs writes the non-secret identifiers via argv. Neither the
 // email nor the API client_id is a usable credential on its own.
-func vaultPatchPublicArgs(user, email, clientID string) []string {
-	return []string{"kv", "patch", vwCredsPath(user),
+func vaultWritePublicArgs(merge bool, user, email, clientID string) []string {
+	return append(kvWriteVerb(merge), vwCredsPath(user),
 		"vaultwarden_email="+email,
 		"vaultwarden_client_id="+clientID,
-	}
+	)
 }

-// vaultPatchSecretArgs writes ONE secret value via the `key=-` stdin form, so
-// the value never appears in argv (ps / /proc/<pid>/cmdline). The value is fed
-// on stdin by realRunnerStdin.
-func vaultPatchSecretArgs(user, key string) []string {
-	return []string{"kv", "patch", vwCredsPath(user), key + "=-"}
+// vaultWriteSecretArgs writes ONE secret value via the `key=-` stdin form, so the
+// value never appears in argv (ps / /proc/<pid>/cmdline). Fed on stdin by
+// realRunnerStdin.
+func vaultWriteSecretArgs(merge bool, user, key string) []string {
+	return append(kvWriteVerb(merge), vwCredsPath(user), key+"=-")
 }

-// writeCreds stores all four fields in the user's Vault path. The two real
-// secrets (master password, API client_secret) go via stdin — never argv.
-func writeCreds(user string, c vwCreds) error {
-	if _, err := realRunner("vault", vaultPatchPublicArgs(user, c.Email, c.ClientID), nil); err != nil {
+// credsPathExists reports whether the user's KV path already holds data. Used to
+// pick create (`kv put`) vs merge (`kv patch -method=rw`) for the first write:
+// claude-auth-sync usually creates the path first (Claude OAuth backup), but a
+// user could run `homelab vault setup` before that ever happens.
+func credsPathExists(run cmdRunner, user string) bool {
+	_, err := run("vault", []string{"kv", "get", "-format=json", vwCredsPath(user)}, nil)
+	return err == nil
+}
+
+// cmdRunnerStdin is realRunnerStdin's shape, injected so writeCreds is testable.
+type cmdRunnerStdin func(name string, argv, envv []string, stdin string) (string, error)
+
+// writeCreds stores all four fields in the user's Vault path using only the
+// capabilities the scoped policy grants (create/read/update — NOT `patch`). The
+// first (public) write creates the path when absent; the two real secrets then
+// merge in via read-modify-write so the public keys — and any claude-auth-sync
+// keys already present — survive. Secret values travel on stdin, never argv.
+func writeCreds(run cmdRunner, runStdin cmdRunnerStdin, user string, c vwCreds) error {
+	merge := credsPathExists(run, user)
+	if _, err := run("vault", vaultWritePublicArgs(merge, user, c.Email, c.ClientID), nil); err != nil {
 		return err
 	}
-	if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
+	// The path now exists regardless of the branch above → merge the secrets in.
+	if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
 		return err
 	}
-	if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
+	if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
 		return err
 	}
 	return nil
@ -593,6 +844,7 @@ func promptLine(prompt string) (string, error) {

 func vaultSetup(args []string) error {
 	hardenProcess()
+	ensureVaultToken()
 	fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.")
 	fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.")
 	email, err := promptLine("Vaultwarden email: ")
@ -615,7 +867,7 @@ func vaultSetup(args []string) error {
 		return fmt.Errorf("all fields are required")
 	}
 	c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret}
-	if err := writeCreds(vaultCurrentUser(), c); err != nil {
+	if err := writeCreds(realRunner, realRunnerStdin, vaultCurrentUser(), c); err != nil {
 		return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err)
 	}
 	fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…")
@ -634,6 +886,7 @@ func vaultSetup(args []string) error {

 func vaultGet(args []string) error {
 	hardenProcess()
+	ensureVaultToken()
 	o, err := parseGetArgs(args)
 	if err != nil {
 		return err
@ -645,6 +898,9 @@ func vaultGet(args []string) error {
 	}
 	defer unlock()
 	user := vaultCurrentUser()
+	if o.all {
+		return getAllFields(user, uid, o.name)
+	}
 	val, err := getValue(realRunner, user, uid, o)
 	if err != nil {
 		return err
@ -661,3 +917,28 @@ func vaultGet(args []string) error {
 	return nil
 }

+// getAllFields prints every field of one item as normalized JSON. Like
+// `get --json`, the payload is all secret values, so it refuses a terminal
+// (pipe it). The TOTP seed is never emitted — only a presence flag — so no extra
+// TOTP audit is needed; the op-log uses a distinct verb so a bulk dump is
+// distinguishable from a single-field get (the item name is still never logged).
+func getAllFields(user, uid, name string) error {
+	if !jsonToStdoutOK(stdoutIsTTY()) {
+		return fmt.Errorf("refusing to print all fields as JSON to a terminal; pipe it (e.g. | jq)")
+	}
+	raw, err := getItem(realRunner, user, uid, name)
+	if err != nil {
+		return err
+	}
+	item, err := normalizeItem(raw)
+	if err != nil {
+		return err
+	}
+	out, err := json.Marshal(item)
+	if err != nil {
+		return err
+	}
+	writeOpLog(opRecord{User: user, Verb: "get-all", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
+	fmt.Println(string(out))
+	return nil
+}
--- a/cli/cmd_vault_kv.go
+++ b/cli/cmd_vault_kv.go
@ -0,0 +1,248 @@
+package main
+
+import (
+	"encoding/json"
+	"fmt"
+	"io"
+	"os"
+	"strings"
+)
+
+// The `vault kv` verbs talk to HashiCorp Vault / OpenBao — the homelab INFRA
+// secrets store (the `secret/…` KV-v2 mount at vault.viktorbarzin.me) — NOT
+// Vaultwarden. They are a thin, TTY-aware wrapper over the `vault` CLI that adds
+// the same conveniences as the Vaultwarden verbs: a self-defaulted VAULT_ADDR
+// (so non-login agent shells work) and clipboard/refuse-on-TTY secret handling.
+//
+// CREDENTIALS DIFFER FROM THE VAULTWARDEN VERBS. Those use the per-user *scoped*
+// token (bound only to secret/workstation/claude-users/<user>). A general kv read
+// of e.g. secret/viktor must use the caller's OWN Vault token (the OIDC
+// ~/.vault-token or an explicit $VAULT_TOKEN) — the scoped token has `deny`
+// everywhere else and would 403. So the kv handlers call ensureVaultAddr() to
+// guarantee VAULT_ADDR but deliberately do NOT call ensureVaultToken() (which
+// injects the scoped token). Access is then whatever the caller's policy grants.
+func vaultKVCommands() []Command {
+	return []Command{
+		{Path: []string{"vault", "kv", "get"}, Tier: TierRead,
+			Summary: "[hashicorp-vault] read an infra KV secret: vault kv get <path> [--field K]", Run: vaultKVGet},
+		{Path: []string{"vault", "kv", "list"}, Tier: TierRead,
+			Summary: "[hashicorp-vault] list infra KV sub-paths: vault kv list <path>", Run: vaultKVList},
+		{Path: []string{"vault", "kv", "put"}, Tier: TierWrite,
+			Summary: "[hashicorp-vault] write one KV key (value via stdin): vault kv put <path> <key>", Run: vaultKVPut},
+		{Path: []string{"vault", "kv"}, Tier: TierRead,
+			Summary: "[hashicorp-vault] infra secrets (run `homelab vault kv` for help)",
+			Run:     func([]string) error { fmt.Print(vaultKVHelp()); return nil }},
+	}
+}
+
+func vaultKVHelp() string {
+	return `homelab vault kv — HashiCorp Vault / OpenBao (homelab INFRA secrets, the secret/… KV store)
+
+  homelab vault kv get <path> [--field K]   read a secret
+                                  --field K  → one value (TTY → clipboard; piped → stdout)
+                                  no --field → all fields as JSON (piped only)
+  homelab vault kv list <path>    list sub-paths under <path> (no values)
+  homelab vault kv put <path> <key>   write one key; value read from stdin
+                                  (piped, or no-echo prompt); merges — never clobbers siblings
+
+Uses YOUR Vault token (vault login -method=oidc → ~/.vault-token); access is
+whatever your policy grants. This is NOT Vaultwarden — for your personal logins
+use 'homelab vault get' (see 'homelab vault').
+`
+}
+
+// --- arg builders (pure; values never travel via argv) --------------------
+
+func vaultKVGetFieldArgs(path, field string) []string {
+	return []string{"kv", "get", "-field=" + field, path}
+}
+func vaultKVGetJSONArgs(path string) []string { return []string{"kv", "get", "-format=json", path} }
+func vaultKVListArgs(path string) []string    { return []string{"kv", "list", "-format=json", path} }
+
+// vaultKVPutArgs builds the write argv. merge=true → `kv patch -method=rw`
+// (read-modify-write: merges, needs only read+update — not the `patch` capability
+// — and preserves sibling keys); merge=false → `kv put` (creates the path on
+// first write). The value is ALWAYS read from stdin via the `<key>=-` form, so it
+// never appears in argv (visible via ps / /proc/<pid>/cmdline to same-UID procs).
+func vaultKVPutArgs(merge bool, path, key string) []string {
+	return append(kvWriteVerb(merge), path, key+"=-")
+}
+
+// --- pure parsers ----------------------------------------------------------
+
+// extractKVData returns the inner secret object from a `vault kv get -format=json`
+// envelope (`{"data":{"data":{…},"metadata":{…}}}`), dropping the metadata/request
+// wrapper so only the secret's own key→value data is emitted.
+func extractKVData(jsonOut string) (string, error) {
+	var env struct {
+		Data struct {
+			Data json.RawMessage `json:"data"`
+		} `json:"data"`
+	}
+	if err := json.Unmarshal([]byte(jsonOut), &env); err != nil {
+		return "", fmt.Errorf("parse vault kv json: %w", err)
+	}
+	if len(env.Data.Data) == 0 {
+		return "", fmt.Errorf("no secret data at that path")
+	}
+	return string(env.Data.Data), nil
+}
+
+// parseKVList parses the JSON array `vault kv list -format=json` prints.
+func parseKVList(jsonOut string) ([]string, error) {
+	var keys []string
+	if err := json.Unmarshal([]byte(jsonOut), &keys); err != nil {
+		return nil, fmt.Errorf("parse vault kv list json: %w", err)
+	}
+	return keys, nil
+}
+
+// --- testable cores (injected cmdRunner) -----------------------------------
+
+func kvGetField(run cmdRunner, path, field string) (string, error) {
+	return run("vault", vaultKVGetFieldArgs(path, field), nil)
+}
+
+func kvGetJSON(run cmdRunner, path string) (string, error) {
+	out, err := run("vault", vaultKVGetJSONArgs(path), nil)
+	if err != nil {
+		return "", err
+	}
+	return extractKVData(out)
+}
+
+func kvList(run cmdRunner, path string) ([]string, error) {
+	out, err := run("vault", vaultKVListArgs(path), nil)
+	if err != nil {
+		return nil, err
+	}
+	return parseKVList(out)
+}
+
+// kvPathExists reports whether the KV path already holds data, to pick create
+// (`kv put`) vs merge (`kv patch -method=rw`) — so a write never clobbers
+// sibling keys on an existing path.
+func kvPathExists(run cmdRunner, path string) bool {
+	_, err := run("vault", vaultKVGetJSONArgs(path), nil)
+	return err == nil
+}
+
+// kvPut writes one key, creating the path when absent and merging when present.
+// The value travels on stdin only (never argv).
+func kvPut(run cmdRunner, runStdin cmdRunnerStdin, path, key, value string) error {
+	merge := kvPathExists(run, path)
+	_, err := runStdin("vault", vaultKVPutArgs(merge, path, key), nil, value)
+	return err
+}
+
+// --- handlers --------------------------------------------------------------
+
+func vaultKVGet(args []string) error {
+	hardenProcess()
+	ensureVaultAddr() // own token, NOT the scoped one (see file header)
+	var path, field string
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--field" && i+1 < len(args):
+			field = args[i+1]
+			i++
+		case strings.HasPrefix(a, "--field="):
+			field = strings.TrimPrefix(a, "--field=")
+		case !strings.HasPrefix(a, "-") && path == "":
+			path = a
+		}
+	}
+	if path == "" {
+		return fmt.Errorf("usage: homelab vault kv get <path> [--field <key>]")
+	}
+	if field != "" {
+		val, err := kvGetField(realRunner, path, field)
+		if err != nil {
+			return err
+		}
+		emitSecret(val) // TTY-aware: clipboard on a terminal, stdout when piped
+		return nil
+	}
+	// No --field → the whole secret. All values, so refuse a bare TTY (like
+	// `vault get --json`): pick a --field for the clipboard path, or pipe it.
+	if !jsonToStdoutOK(stdoutIsTTY()) {
+		return fmt.Errorf("refusing to print all KV fields as JSON to a terminal; use --field <key>, or pipe it (e.g. | jq)")
+	}
+	out, err := kvGetJSON(realRunner, path)
+	if err != nil {
+		return err
+	}
+	fmt.Println(out)
+	return nil
+}
+
+func vaultKVList(args []string) error {
+	ensureVaultAddr()
+	var path string
+	for _, a := range args {
+		if !strings.HasPrefix(a, "-") {
+			path = a
+			break
+		}
+	}
+	if path == "" {
+		return fmt.Errorf("usage: homelab vault kv list <path>")
+	}
+	keys, err := kvList(realRunner, path)
+	if err != nil {
+		return err
+	}
+	for _, k := range keys {
+		fmt.Println(k)
+	}
+	return nil
+}
+
+func vaultKVPut(args []string) error {
+	hardenProcess()
+	ensureVaultAddr()
+	var path, key string
+	for _, a := range args {
+		if strings.HasPrefix(a, "-") {
+			continue
+		}
+		switch {
+		case path == "":
+			path = a
+		case key == "":
+			key = a
+		}
+	}
+	if path == "" || key == "" {
+		return fmt.Errorf("usage: homelab vault kv put <path> <key>   (value read from stdin)")
+	}
+	value, err := readSecretValue("Value for " + key + ": ")
+	if err != nil {
+		return err
+	}
+	if value == "" {
+		return fmt.Errorf("empty value; aborting (nothing written)")
+	}
+	if err := kvPut(realRunner, realRunnerStdin, path, key, value); err != nil {
+		return fmt.Errorf("writing %q to %s failed (does your token have write access? path correct?): %w", key, path, err)
+	}
+	fmt.Fprintln(os.Stderr, "wrote "+key+" to "+path)
+	return nil
+}
+
+// readSecretValue obtains a secret value WITHOUT putting it in argv: piped stdin
+// is read verbatim (trailing newline trimmed, internal newlines preserved so
+// multi-line values like PEM keys survive); an interactive TTY is prompted
+// without echo.
+func readSecretValue(prompt string) (string, error) {
+	fi, err := os.Stdin.Stat()
+	if err == nil && fi.Mode()&os.ModeCharDevice == 0 {
+		b, rerr := io.ReadAll(os.Stdin)
+		if rerr != nil {
+			return "", rerr
+		}
+		return strings.TrimRight(string(b), "\r\n"), nil
+	}
+	return promptNoEcho(prompt)
+}
--- a/cli/cmd_vault_test.go
+++ b/cli/cmd_vault_test.go
@ -2,6 +2,8 @@ package main

 import (
 	"encoding/base64"
+	"encoding/json"
+	"errors"
 	"fmt"
 	"os"
 	"reflect"
@ -233,12 +235,181 @@ func TestStatusSummaryUnconfigured(t *testing.T) {
 	}
 }

-func TestVaultPatchPublicArgs(t *testing.T) {
-	got := vaultPatchPublicArgs("emo", "e@x.me", "user.ci")
-	want := []string{"kv", "patch", "secret/workstation/claude-users/emo",
+func TestEnsureVaultTokenSetsScopedFallback(t *testing.T) {
+	dir := t.TempDir()
+	cfg := dir + "/.config/claude-auth-sync"
+	if err := os.MkdirAll(cfg, 0o700); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK\n"), 0o600); err != nil {
+		t.Fatal(err)
+	}
+	t.Setenv("HOME", dir)
+	t.Setenv("VAULT_TOKEN", "") // no ambient token
+
+	ensureVaultToken()
+	if got := os.Getenv("VAULT_TOKEN"); got != "SCOPED-TOK" {
+		t.Fatalf("VAULT_TOKEN = %q, want scoped fallback to be exported", got)
+	}
+}
+
+func TestEnsureVaultTokenKeepsExplicitEnv(t *testing.T) {
+	dir := t.TempDir()
+	cfg := dir + "/.config/claude-auth-sync"
+	if err := os.MkdirAll(cfg, 0o700); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK"), 0o600); err != nil {
+		t.Fatal(err)
+	}
+	t.Setenv("HOME", dir)
+	t.Setenv("VAULT_TOKEN", "ADMIN-TOK")
+
+	ensureVaultToken()
+	if got := os.Getenv("VAULT_TOKEN"); got != "ADMIN-TOK" {
+		t.Fatalf("VAULT_TOKEN = %q, must not override an explicit token", got)
+	}
+}
+
+func TestEnsureVaultTokenPrefersScopedOverFile(t *testing.T) {
+	// Regression: a power-user's read-only OIDC ~/.vault-token must NOT shadow the
+	// purpose-built scoped token (emo's setup hit 403 because it did, 2026-06-28).
+	dir := t.TempDir()
+	cfg := dir + "/.config/claude-auth-sync"
+	if err := os.MkdirAll(cfg, 0o700); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK"), 0o600); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(dir+"/.vault-token", []byte("STALE-OIDC-TOK"), 0o600); err != nil {
+		t.Fatal(err)
+	}
+	t.Setenv("HOME", dir)
+	t.Setenv("VAULT_TOKEN", "")
+
+	ensureVaultToken()
+	if got := os.Getenv("VAULT_TOKEN"); got != "SCOPED-TOK" {
+		t.Fatalf("VAULT_TOKEN = %q, want the scoped token to win over a stale ~/.vault-token", got)
+	}
+}
+
+func TestScopedTokenPath(t *testing.T) {
+	if got := scopedTokenPath("/home/emo"); got != "/home/emo/.config/claude-auth-sync/vault-token" {
+		t.Fatalf("scopedTokenPath = %q", got)
+	}
+}
+
+func TestVaultTokenSource(t *testing.T) {
+	// Precedence: explicit $VAULT_TOKEN > the claude-auth-sync per-user scoped
+	// token > a native ~/.vault-token. Scoped beats the file so a power-user's
+	// read-only OIDC ~/.vault-token can't shadow the scoped token on the user's
+	// own path (emo, 2026-06-28).
+	cases := []struct {
+		name             string
+		env              string
+		haveVaultToken   bool
+		scoped           string
+		wantTok, wantSrc string
+	}{
+		{"explicit env wins", "abc", true, "S", "", "env"},
+		{"scoped beats a stale ~/.vault-token", "", true, "S-TOK", "S-TOK", "scoped"},
+		{"scoped used when no file", "", false, "S-TOK", "S-TOK", "scoped"},
+		{"native ~/.vault-token only when no scoped", "", true, "", "", "file"},
+		{"scoped value is trimmed", "", false, "  S-TOK\n", "S-TOK", "scoped"},
+		{"whitespace-only scoped falls back to file", "", true, "  \n", "", "file"},
+		{"nothing configured", "", false, "", "", "none"},
+	}
+	for _, c := range cases {
+		tok, src := vaultTokenSource(c.env, c.haveVaultToken, c.scoped)
+		if tok != c.wantTok || src != c.wantSrc {
+			t.Errorf("%s: vaultTokenSource(%q,%v,%q) = (%q,%q), want (%q,%q)",
+				c.name, c.env, c.haveVaultToken, c.scoped, tok, src, c.wantTok, c.wantSrc)
+		}
+	}
+}
+
+func TestVaultAddrToSet(t *testing.T) {
+	// homelab vault is invoked by AFK agent sessions (non-login shells that
+	// never sourced /etc/environment), so the CLI must self-default VAULT_ADDR
+	// rather than rely on the ambient env — else every `vault` child hits the
+	// 127.0.0.1:8200 default and fails "connection refused" (exit 2).
+	cases := []struct {
+		name, env, want string
+	}{
+		{"unset -> default", "", vaultAddrDefault},
+		{"whitespace-only -> default", "  \n", vaultAddrDefault},
+		{"explicit kept (empty = leave alone)", "https://vault.example.com", ""},
+	}
+	for _, c := range cases {
+		if got := vaultAddrToSet(c.env); got != c.want {
+			t.Errorf("%s: vaultAddrToSet(%q) = %q, want %q", c.name, c.env, got, c.want)
+		}
+	}
+}
+
+func TestEnsureVaultTokenSetsDefaultAddr(t *testing.T) {
+	dir := t.TempDir() // no scoped token, no ~/.vault-token
+	t.Setenv("HOME", dir)
+	t.Setenv("VAULT_TOKEN", "")
+	t.Setenv("VAULT_ADDR", "") // emo's non-login-shell situation
+
+	ensureVaultToken()
+	if got := os.Getenv("VAULT_ADDR"); got != vaultAddrDefault {
+		t.Fatalf("VAULT_ADDR = %q, want default %q to be exported", got, vaultAddrDefault)
+	}
+}
+
+func TestEnsureVaultTokenKeepsExplicitAddr(t *testing.T) {
+	dir := t.TempDir()
+	t.Setenv("HOME", dir)
+	t.Setenv("VAULT_TOKEN", "")
+	t.Setenv("VAULT_ADDR", "https://vault.example.com")
+
+	ensureVaultToken()
+	if got := os.Getenv("VAULT_ADDR"); got != "https://vault.example.com" {
+		t.Fatalf("VAULT_ADDR = %q, must not override an explicit addr", got)
+	}
+}
+
+func TestAugmentErrSurfacesStderr(t *testing.T) {
+	if got := augmentErr(nil, []byte("ignored")); got != nil {
+		t.Fatalf("augmentErr(nil, …) = %v, want nil", got)
+	}
+	base := errors.New("exit status 2")
+	got := augmentErr(base, []byte("  dial tcp 127.0.0.1:8200: connect: connection refused\n"))
+	if got == nil || !strings.Contains(got.Error(), "connection refused") || !strings.Contains(got.Error(), "exit status 2") {
+		t.Fatalf("augmentErr did not surface stderr: %v", got)
+	}
+	if !errors.Is(got, base) {
+		t.Fatal("augmentErr lost the wrapped error (errors.Is failed)")
+	}
+	if got := augmentErr(base, []byte("   ")); got != base {
+		t.Fatalf("augmentErr with blank stderr = %v, want the original error unchanged", got)
+	}
+}
+
+func TestKvWriteVerb(t *testing.T) {
+	// merge=true → read-modify-write patch (needs only read+update, NOT the
+	// `patch` capability the scoped workstation policy lacks).
+	if got := kvWriteVerb(true); !reflect.DeepEqual(got, []string{"kv", "patch", "-method=rw"}) {
+		t.Fatalf("kvWriteVerb(true) = %v", got)
+	}
+	// merge=false → put (creates the path on first use)
+	if got := kvWriteVerb(false); !reflect.DeepEqual(got, []string{"kv", "put"}) {
+		t.Fatalf("kvWriteVerb(false) = %v", got)
+	}
+}
+
+func TestVaultWritePublicArgs(t *testing.T) {
+	got := vaultWritePublicArgs(true, "emo", "e@x.me", "user.ci")
+	want := []string{"kv", "patch", "-method=rw", "secret/workstation/claude-users/emo",
 		"vaultwarden_email=e@x.me", "vaultwarden_client_id=user.ci"}
 	if !reflect.DeepEqual(got, want) {
-		t.Fatalf("vaultPatchPublicArgs = %v", got)
+		t.Fatalf("vaultWritePublicArgs(merge) = %v", got)
+	}
+	if got := vaultWritePublicArgs(false, "emo", "e@x.me", "user.ci"); got[0] != "kv" || got[1] != "put" {
+		t.Fatalf("vaultWritePublicArgs(create) must use `kv put`, got %v", got)
 	}
 	for _, a := range got {
 		if strings.Contains(a, "master_password") || strings.Contains(a, "client_secret") {
@ -247,12 +418,12 @@ func TestVaultPatchPublicArgs(t *testing.T) {
 	}
 }

-func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) {
+func TestVaultWriteSecretArgsNoValueInArgv(t *testing.T) {
 	for _, key := range []string{"vaultwarden_master_password", "vaultwarden_client_secret"} {
-		got := vaultPatchSecretArgs("emo", key)
-		want := []string{"kv", "patch", "secret/workstation/claude-users/emo", key + "=-"}
+		got := vaultWriteSecretArgs(true, "emo", key)
+		want := []string{"kv", "patch", "-method=rw", "secret/workstation/claude-users/emo", key + "=-"}
 		if !reflect.DeepEqual(got, want) {
-			t.Fatalf("vaultPatchSecretArgs(%q) = %v", key, got)
+			t.Fatalf("vaultWriteSecretArgs(%q) = %v", key, got)
 		}
 		if got[len(got)-1] != key+"=-" {
 			t.Fatalf("secret value must be read from stdin (`%s=-`), got %v", key, got)
@ -260,6 +431,90 @@ func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) {
 	}
 }

+// recStdin records a stdin-bearing call for assertions.
+type recStdin struct {
+	argv  []string
+	stdin string
+}
+
+// TestWriteCredsCreatesThenMerges: when the path is ABSENT the first (public)
+// write must `kv put` (create), and the two secrets must merge via patch -rw
+// with values on stdin only — never the buggy plain `kv patch` (needs `patch`).
+func TestWriteCredsCreatesThenMerges(t *testing.T) {
+	var calls [][]string
+	var stdinCalls []recStdin
+	run := func(name string, argv, envv []string) (string, error) {
+		calls = append(calls, append([]string{name}, argv...))
+		if len(argv) >= 2 && argv[0] == "kv" && argv[1] == "get" {
+			return "", fmt.Errorf("no value found") // path absent
+		}
+		return "", nil
+	}
+	runStdin := func(name string, argv, envv []string, stdin string) (string, error) {
+		stdinCalls = append(stdinCalls, recStdin{append([]string{name}, argv...), stdin})
+		return "", nil
+	}
+	c := vwCreds{Email: "e@x.me", MasterPassword: "PW", ClientID: "user.ci", ClientSecret: "CS"}
+	if err := writeCreds(run, runStdin, "emo", c); err != nil {
+		t.Fatalf("writeCreds: %v", err)
+	}
+	var sawPut, sawPlainPatch bool
+	for _, cl := range calls {
+		j := strings.Join(cl, " ")
+		if strings.Contains(j, "kv put") {
+			sawPut = true
+		}
+		if strings.Contains(j, "kv patch") && !strings.Contains(j, "-method=rw") {
+			sawPlainPatch = true
+		}
+	}
+	if !sawPut {
+		t.Fatalf("path absent → public write must be `kv put`; calls=%v", calls)
+	}
+	if sawPlainPatch {
+		t.Fatalf("must never use plain `kv patch` (needs `patch` capability); calls=%v", calls)
+	}
+	if len(stdinCalls) != 2 {
+		t.Fatalf("want 2 stdin secret writes, got %d", len(stdinCalls))
+	}
+	for _, sc := range stdinCalls {
+		if !strings.Contains(strings.Join(sc.argv, " "), "kv patch -method=rw") {
+			t.Errorf("secret write must use patch -method=rw: %v", sc.argv)
+		}
+		for _, a := range sc.argv {
+			if strings.Contains(a, "PW") || strings.Contains(a, "CS") {
+				t.Errorf("secret leaked into argv: %v", sc.argv)
+			}
+		}
+	}
+	if stdinCalls[0].stdin != "PW" || stdinCalls[1].stdin != "CS" {
+		t.Errorf("stdin values wrong: %q,%q", stdinCalls[0].stdin, stdinCalls[1].stdin)
+	}
+}
+
+// TestWriteCredsMergesWhenPresent: when the path EXISTS, every write must merge
+// (patch -rw) — a `kv put` would wipe sibling keys (e.g. claude_ai_oauth_json).
+func TestWriteCredsMergesWhenPresent(t *testing.T) {
+	var calls [][]string
+	run := func(name string, argv, envv []string) (string, error) {
+		calls = append(calls, append([]string{name}, argv...))
+		return "{}", nil // get succeeds → path exists
+	}
+	runStdin := func(name string, argv, envv []string, stdin string) (string, error) {
+		calls = append(calls, append([]string{name}, argv...))
+		return "", nil
+	}
+	c := vwCreds{Email: "e@x.me", MasterPassword: "PW", ClientID: "user.ci", ClientSecret: "CS"}
+	if err := writeCreds(run, runStdin, "emo", c); err != nil {
+		t.Fatalf("writeCreds: %v", err)
+	}
+	for _, cl := range calls {
+		if strings.Contains(strings.Join(cl, " "), "kv put") {
+			t.Fatalf("path exists → must NOT `kv put` (wipes siblings): %v", cl)
+		}
+	}
+}
+
 // TestNoSecretInArgvAcrossFlow is the load-bearing security test: across the
 // whole get flow (vault reads, bw config/status/login/unlock/get) NO secret
 // value may appear in any command's argv — secrets travel via env/stdin only.
@ -366,3 +621,437 @@ func TestGetValueFlow(t *testing.T) {
 		t.Fatalf("getValue = %q, %v", val, err)
 	}
 }
+
+// --- vault get --all (browse all fields) ----------------------------------
+
+func TestParseGetArgsAll(t *testing.T) {
+	o, err := parseGetArgs([]string{"github", "--all"})
+	if err != nil || o.name != "github" || !o.all {
+		t.Fatalf("parseGetArgs(--all) = %+v err=%v", o, err)
+	}
+	// --all must skip --field validation (field is irrelevant for a full dump).
+	if _, err := parseGetArgs([]string{"github", "--all", "--field", "evil"}); err != nil {
+		t.Fatalf("--all must ignore an otherwise-invalid --field, got err=%v", err)
+	}
+	// A name is still required.
+	if _, err := parseGetArgs([]string{"--all"}); err == nil {
+		t.Fatal("get --all with no name must error")
+	}
+	// Without --all, the field allowlist still applies.
+	if _, err := parseGetArgs([]string{"github", "--field", "evil"}); err == nil {
+		t.Fatal("invalid --field without --all must still error")
+	}
+}
+
+func TestBwItemArgs(t *testing.T) {
+	argv := bwItemArgs("github")
+	if !reflect.DeepEqual(argv, []string{"get", "item", "github"}) {
+		t.Fatalf("bwItemArgs = %v", argv)
+	}
+	for _, a := range argv {
+		if strings.Contains(a, "SESSION") || a == "--session" {
+			t.Fatalf("session must travel via env, not argv: %v", argv)
+		}
+	}
+}
+
+// a representative `bw get item` payload: login fields, multiple URIs, a TOTP
+// seed, notes, custom fields (text/hidden/boolean), plus bw internals that MUST
+// be dropped (id/object/reprompt/passwordHistory).
+const sampleLoginItemJSON = `{
+  "object":"item","id":"abc-123","folderId":null,"type":1,"reprompt":0,
+  "name":"GitHub","notes":"my notes","favorite":false,
+  "fields":[
+    {"name":"PIN","value":"1234","type":1},
+    {"name":"endpoint","value":"https://api.gh","type":0},
+    {"name":"enabled","value":"true","type":2}
+  ],
+  "login":{
+    "username":"octocat","password":"hunter2",
+    "totp":"otpauth://totp/GitHub:octocat?secret=SEEDSEEDSEED",
+    "uris":[{"match":null,"uri":"https://github.com"},{"match":null,"uri":"https://gist.github.com"}]
+  },
+  "passwordHistory":[{"password":"OLD-PASSWORD-XYZ"}]
+}`
+
+func TestNormalizeItemLogin(t *testing.T) {
+	n, err := normalizeItem(sampleLoginItemJSON)
+	if err != nil {
+		t.Fatalf("normalizeItem: %v", err)
+	}
+	if n.Name != "GitHub" || n.Username != "octocat" || n.Password != "hunter2" || n.Notes != "my notes" {
+		t.Fatalf("standard fields wrong: %+v", n)
+	}
+	if !n.TOTP {
+		t.Fatal("TOTP presence flag must be true when a seed exists")
+	}
+	if !reflect.DeepEqual(n.URIs, []string{"https://github.com", "https://gist.github.com"}) {
+		t.Fatalf("URIs = %v", n.URIs)
+	}
+	want := map[string]string{"PIN": "1234", "endpoint": "https://api.gh", "enabled": "true"}
+	if !reflect.DeepEqual(n.Fields, want) {
+		t.Fatalf("custom fields = %v want %v", n.Fields, want)
+	}
+}
+
+// The load-bearing security test: the raw TOTP seed (more powerful than a
+// one-time code) and the password history must NEVER appear in the dump.
+func TestNormalizeItemNeverLeaksSeedOrHistory(t *testing.T) {
+	n, err := normalizeItem(sampleLoginItemJSON)
+	if err != nil {
+		t.Fatalf("normalizeItem: %v", err)
+	}
+	out, err := json.Marshal(n)
+	if err != nil {
+		t.Fatalf("marshal: %v", err)
+	}
+	for _, leak := range []string{"SEEDSEEDSEED", "otpauth", "OLD-PASSWORD-XYZ", "passwordHistory", "abc-123"} {
+		if strings.Contains(string(out), leak) {
+			t.Fatalf("dump leaked %q: %s", leak, out)
+		}
+	}
+}
+
+func TestNormalizeItemNoTOTP(t *testing.T) {
+	n, err := normalizeItem(`{"name":"X","type":1,"login":{"username":"u","password":"p"}}`)
+	if err != nil {
+		t.Fatalf("normalizeItem: %v", err)
+	}
+	if n.TOTP {
+		t.Fatal("TOTP must be false when no seed present")
+	}
+	out, _ := json.Marshal(n)
+	if strings.Contains(string(out), "totp") {
+		t.Fatalf("no-totp item must omit the totp key entirely: %s", out)
+	}
+}
+
+func TestNormalizeItemEmptyStandardFieldsOmitted(t *testing.T) {
+	n, err := normalizeItem(`{"name":"Bare","type":1,"login":{"username":"","password":"","totp":"","uris":[]},"fields":[{"name":"only","value":"x","type":0}]}`)
+	if err != nil {
+		t.Fatalf("normalizeItem: %v", err)
+	}
+	out, _ := json.Marshal(n)
+	for _, k := range []string{"username", "password", "uris", "notes", "totp"} {
+		if strings.Contains(string(out), `"`+k+`"`) {
+			t.Fatalf("empty standard field %q must be omitted: %s", k, out)
+		}
+	}
+	if !strings.Contains(string(out), `"name":"Bare"`) || !strings.Contains(string(out), `"only":"x"`) {
+		t.Fatalf("name + custom field must survive: %s", out)
+	}
+}
+
+func TestNormalizeItemSecureNoteNullLogin(t *testing.T) {
+	// type 2 (secure note): login is null — must not panic; notes + custom fields survive.
+	n, err := normalizeItem(`{"name":"SN","type":2,"notes":"secret note","login":null,"fields":[{"name":"k","value":"v","type":1}]}`)
+	if err != nil {
+		t.Fatalf("normalizeItem(null login): %v", err)
+	}
+	if n.Name != "SN" || n.Notes != "secret note" || n.Fields["k"] != "v" {
+		t.Fatalf("secure-note normalize wrong: %+v", n)
+	}
+	if n.Username != "" || n.Password != "" || n.TOTP {
+		t.Fatalf("login fields must be empty for a login-less item: %+v", n)
+	}
+}
+
+func TestNormalizeItemDuplicateCustomNames(t *testing.T) {
+	// Bitwarden permits duplicate custom-field names; a JSON object can't hold
+	// dups, so last-wins (documented).
+	n, err := normalizeItem(`{"name":"D","fields":[{"name":"k","value":"first","type":0},{"name":"k","value":"second","type":0}]}`)
+	if err != nil {
+		t.Fatalf("normalizeItem: %v", err)
+	}
+	if n.Fields["k"] != "second" {
+		t.Fatalf("duplicate custom names must be last-wins, got %q", n.Fields["k"])
+	}
+}
+
+func TestNormalizeItemLinkedFieldSkipped(t *testing.T) {
+	// type 3 (linked) fields reference another field and carry a null value —
+	// they are not real data and must be skipped.
+	n, err := normalizeItem(`{"name":"L","login":{"username":"u"},"fields":[{"name":"linked","value":null,"type":3},{"name":"real","value":"r","type":0}]}`)
+	if err != nil {
+		t.Fatalf("normalizeItem: %v", err)
+	}
+	if _, ok := n.Fields["linked"]; ok {
+		t.Fatalf("linked field must be skipped: %v", n.Fields)
+	}
+	if n.Fields["real"] != "r" {
+		t.Fatalf("real custom field dropped: %v", n.Fields)
+	}
+}
+
+func TestNormalizeItemMalformed(t *testing.T) {
+	if _, err := normalizeItem("not json"); err == nil {
+		t.Fatal("malformed item JSON must error")
+	}
+}
+
+// getItem opens a session and runs `bw get item <name>`, returning raw JSON.
+func TestGetItemFlow(t *testing.T) {
+	f := &fakeRunner{out: map[string]string{
+		"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw",
+		"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo":       "user.x",
+		"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo":   "cs",
+		"bw status":          `{"status":"locked"}`,
+		"bw unlock":          "SESS",
+		"bw get item github": sampleLoginItemJSON,
+	}}
+	uid := fmt.Sprintf("%d", os.Getuid())
+	raw, err := getItem(f.run, "emo", uid, "github")
+	if err != nil || !strings.Contains(raw, `"name":"GitHub"`) {
+		t.Fatalf("getItem = %q, %v", raw, err)
+	}
+	// The session key must reach bw via env, never argv.
+	for _, call := range f.calls {
+		for _, arg := range call {
+			if strings.Contains(arg, "SESS") {
+				t.Errorf("session leaked into argv: %v", call)
+			}
+		}
+	}
+}
+
+func TestVaultHelpMentionsAll(t *testing.T) {
+	if !strings.Contains(vaultHelp(), "--all") {
+		t.Error("vault help must document --all")
+	}
+}
+
+// --- bw sync on read (freshness) ------------------------------------------
+
+func TestBwSyncArgs(t *testing.T) {
+	if got := bwSyncArgs(); !reflect.DeepEqual(got, []string{"sync"}) {
+		t.Fatalf("bwSyncArgs = %v", got)
+	}
+}
+
+// Every read opens a session that first `bw sync`s, so reads reflect the latest
+// server-side values: `bw unlock` is local-only, so without a sync a persisted
+// (already-logged-in) session serves a stale local cache.
+func TestOpenSessionSyncsBeforeRead(t *testing.T) {
+	f := &fakeRunner{out: map[string]string{
+		"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw",
+		"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo":       "user.x",
+		"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo":   "cs",
+		"bw status":              `{"status":"locked"}`,
+		"bw unlock":              "SESS",
+		"bw sync":                "Syncing complete.",
+		"bw get password github": "p@ss",
+	}}
+	uid := fmt.Sprintf("%d", os.Getuid())
+	if _, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"}); err != nil {
+		t.Fatalf("getValue: %v", err)
+	}
+	idx := func(prefix string) int {
+		for i, c := range f.calls {
+			if strings.HasPrefix(strings.Join(c, " "), prefix) {
+				return i
+			}
+		}
+		return -1
+	}
+	syncAt, unlockAt, getAt := idx("bw sync"), idx("bw unlock"), idx("bw get password github")
+	if syncAt < 0 {
+		t.Fatal("expected a `bw sync` before the read")
+	}
+	if !(unlockAt < syncAt && syncAt < getAt) {
+		t.Fatalf("order wrong: unlock=%d sync=%d get=%d (want unlock<sync<get)", unlockAt, syncAt, getAt)
+	}
+}
+
+// Sync is best-effort: a transient sync failure must NOT fail the read — the
+// cached value is still returned (a stderr warning is emitted, not asserted here).
+func TestReadSucceedsWhenSyncFails(t *testing.T) {
+	f := &fakeRunner{
+		out: map[string]string{
+			"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw",
+			"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo":       "user.x",
+			"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo":   "cs",
+			"bw status":              `{"status":"locked"}`,
+			"bw unlock":              "SESS",
+			"bw get password github": "p@ss",
+		},
+		err: map[string]error{"bw sync": errors.New("Failed to sync: network error")},
+	}
+	uid := fmt.Sprintf("%d", os.Getuid())
+	val, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"})
+	if err != nil || val != "p@ss" {
+		t.Fatalf("read must succeed despite a sync failure: val=%q err=%v", val, err)
+	}
+}
+
+// --- vault kv (HashiCorp Vault / OpenBao infra secrets) --------------------
+
+func TestVaultKVCommandsRegistered(t *testing.T) {
+	want := map[string]Tier{
+		"vault kv get":  TierRead,
+		"vault kv list": TierRead,
+		"vault kv put":  TierWrite,
+	}
+	got := map[string]Tier{}
+	for _, c := range vaultCommands() {
+		got[c.name()] = c.Tier
+	}
+	for name, tier := range want {
+		if got[name] != tier {
+			t.Errorf("command %q: tier=%q, want %q", name, got[name], tier)
+		}
+	}
+}
+
+func TestVaultKVArgs(t *testing.T) {
+	if got := vaultKVGetFieldArgs("secret/viktor", "github_pat"); !reflect.DeepEqual(got, []string{"kv", "get", "-field=github_pat", "secret/viktor"}) {
+		t.Fatalf("vaultKVGetFieldArgs = %v", got)
+	}
+	if got := vaultKVGetJSONArgs("secret/viktor"); !reflect.DeepEqual(got, []string{"kv", "get", "-format=json", "secret/viktor"}) {
+		t.Fatalf("vaultKVGetJSONArgs = %v", got)
+	}
+	if got := vaultKVListArgs("secret/"); !reflect.DeepEqual(got, []string{"kv", "list", "-format=json", "secret/"}) {
+		t.Fatalf("vaultKVListArgs = %v", got)
+	}
+	// create (path absent) → put; merge (path present) → patch -method=rw. Either
+	// way the VALUE travels via the `key=-` stdin form, never argv.
+	create := vaultKVPutArgs(false, "secret/x", "api_key")
+	if !reflect.DeepEqual(create, []string{"kv", "put", "secret/x", "api_key=-"}) {
+		t.Fatalf("vaultKVPutArgs(create) = %v", create)
+	}
+	merge := vaultKVPutArgs(true, "secret/x", "api_key")
+	if !reflect.DeepEqual(merge, []string{"kv", "patch", "-method=rw", "secret/x", "api_key=-"}) {
+		t.Fatalf("vaultKVPutArgs(merge) = %v", merge)
+	}
+	for _, args := range [][]string{create, merge} {
+		for _, a := range args {
+			if strings.Contains(a, "SECRETVALUE") || strings.HasSuffix(a, "=SECRETVALUE") {
+				t.Fatalf("value must not appear in argv: %v", args)
+			}
+		}
+	}
+}
+
+func TestExtractKVData(t *testing.T) {
+	// `vault kv get -format=json` wraps the secret in {"data":{"data":{...},"metadata":{...}}}.
+	env := `{"request_id":"x","data":{"data":{"github_pat":"ghp_abc","email":"e@x.me"},"metadata":{"version":3}}}`
+	out, err := extractKVData(env)
+	if err != nil {
+		t.Fatalf("extractKVData: %v", err)
+	}
+	// Round-trip to a map so key order doesn't matter.
+	var m map[string]string
+	if err := json.Unmarshal([]byte(out), &m); err != nil {
+		t.Fatalf("result not a JSON object: %q (%v)", out, err)
+	}
+	if m["github_pat"] != "ghp_abc" || m["email"] != "e@x.me" {
+		t.Fatalf("extractKVData inner data wrong: %v", m)
+	}
+	// metadata must NOT leak into the output.
+	if strings.Contains(out, "metadata") || strings.Contains(out, "request_id") {
+		t.Fatalf("envelope internals leaked: %s", out)
+	}
+	if _, err := extractKVData("not json"); err == nil {
+		t.Fatal("malformed envelope must error")
+	}
+}
+
+func TestParseKVList(t *testing.T) {
+	keys, err := parseKVList(`["app1","app2/","viktor"]`)
+	if err != nil {
+		t.Fatalf("parseKVList: %v", err)
+	}
+	if !reflect.DeepEqual(keys, []string{"app1", "app2/", "viktor"}) {
+		t.Fatalf("parseKVList = %v", keys)
+	}
+	if _, err := parseKVList("not json"); err == nil {
+		t.Fatal("malformed list must error")
+	}
+}
+
+func TestKVGetFieldFlow(t *testing.T) {
+	f := &fakeRunner{out: map[string]string{
+		"vault kv get -field=github_pat secret/viktor": "ghp_secret",
+	}}
+	val, err := kvGetField(f.run, "secret/viktor", "github_pat")
+	if err != nil || val != "ghp_secret" {
+		t.Fatalf("kvGetField = %q, %v", val, err)
+	}
+}
+
+func TestKVListFlow(t *testing.T) {
+	f := &fakeRunner{out: map[string]string{
+		"vault kv list -format=json secret/": `["app1","app2/"]`,
+	}}
+	keys, err := kvList(f.run, "secret/")
+	if err != nil || !reflect.DeepEqual(keys, []string{"app1", "app2/"}) {
+		t.Fatalf("kvList = %v, %v", keys, err)
+	}
+}
+
+// kvPut creates the path on first write and merges thereafter, with the value on
+// stdin only (mirrors writeCreds). Never plain `kv patch` (needs the patch cap).
+func TestKVPutCreatesThenMerges(t *testing.T) {
+	for _, tc := range []struct {
+		name       string
+		exists     bool
+		wantCreate bool
+	}{
+		{"absent path → create (put)", false, true},
+		{"present path → merge (patch -rw)", true, false},
+	} {
+		t.Run(tc.name, func(t *testing.T) {
+			var stdinCalls []recStdin
+			run := func(name string, argv, envv []string) (string, error) {
+				if len(argv) >= 2 && argv[0] == "kv" && argv[1] == "get" {
+					if tc.exists {
+						return `{"data":{"data":{}}}`, nil
+					}
+					return "", fmt.Errorf("No value found at secret/x")
+				}
+				return "", nil
+			}
+			runStdin := func(name string, argv, envv []string, stdin string) (string, error) {
+				stdinCalls = append(stdinCalls, recStdin{append([]string{name}, argv...), stdin})
+				return "", nil
+			}
+			if err := kvPut(run, runStdin, "secret/x", "api_key", "SECRETVALUE"); err != nil {
+				t.Fatalf("kvPut: %v", err)
+			}
+			if len(stdinCalls) != 1 {
+				t.Fatalf("want exactly 1 stdin write, got %d", len(stdinCalls))
+			}
+			sc := stdinCalls[0]
+			joined := strings.Join(sc.argv, " ")
+			if tc.wantCreate && !strings.Contains(joined, "kv put") {
+				t.Fatalf("absent path must use `kv put`: %v", sc.argv)
+			}
+			if !tc.wantCreate && !strings.Contains(joined, "kv patch -method=rw") {
+				t.Fatalf("present path must merge via `kv patch -method=rw`: %v", sc.argv)
+			}
+			if strings.Contains(joined, "kv patch") && !strings.Contains(joined, "-method=rw") {
+				t.Fatalf("must never use plain `kv patch`: %v", sc.argv)
+			}
+			if sc.stdin != "SECRETVALUE" {
+				t.Fatalf("value must travel via stdin, got %q", sc.stdin)
+			}
+			for _, a := range sc.argv {
+				if strings.Contains(a, "SECRETVALUE") {
+					t.Fatalf("value leaked into argv: %v", sc.argv)
+				}
+			}
+		})
+	}
+}
+
+func TestVaultHelpMentionsBothSystems(t *testing.T) {
+	h := vaultHelp()
+	for _, want := range []string{"Vaultwarden", "vault kv"} {
+		if !strings.Contains(h, want) {
+			t.Errorf("vault help must mention %q (distinguish the two systems)", want)
+		}
+	}
+	// Must name the infra-secrets system so the distinction is unambiguous.
+	if !strings.Contains(h, "HashiCorp") && !strings.Contains(h, "OpenBao") {
+		t.Error("vault help must name HashiCorp Vault / OpenBao (the infra secrets store)")
+	}
+}
--- a/cli/edges.go
+++ b/cli/edges.go
@ -0,0 +1,164 @@
+package main
+
+import (
+	"fmt"
+	"regexp"
+	"strconv"
+	"strings"
+)
+
+// edgesOpts is the parsed filter set for `homelab edges` (the who-talks-to-whom
+// investigation helper over the goldmane_edges trail; see ADR-0014).
+type edgesOpts struct {
+	ns       string // edges touching this namespace (either direction)
+	src      string // edges where src_ns = this
+	dst      string // edges where dst_ns = this
+	peersOf  string // distinct peers of this namespace (both directions)
+	newSince string // first_seen >= duration (24h/7d/30m) or date (YYYY-MM-DD)
+	denied   bool   // action = 'deny' only
+	asJSON   bool   // wrap result as a JSON array
+	limit    int    // row cap (default 200)
+}
+
+// parseEdgesArgs parses the edges flag surface. Unknown flags error out so a
+// typo surfaces instead of silently dumping the whole table.
+func parseEdgesArgs(args []string) (edgesOpts, error) {
+	o := edgesOpts{limit: 200}
+	i := 0
+	for i < len(args) {
+		a := args[i]
+		key, inline, hasInline := a, "", false
+		if eq := strings.IndexByte(a, '='); eq >= 0 {
+			key, inline, hasInline = a[:eq], a[eq+1:], true
+		}
+		needVal := func() (string, error) {
+			if hasInline {
+				return inline, nil
+			}
+			if i+1 < len(args) {
+				i++
+				return args[i], nil
+			}
+			return "", fmt.Errorf("flag %s needs a value", key)
+		}
+		var err error
+		switch key {
+		case "--ns":
+			o.ns, err = needVal()
+		case "--src":
+			o.src, err = needVal()
+		case "--dst":
+			o.dst, err = needVal()
+		case "--peers-of":
+			o.peersOf, err = needVal()
+		case "--new-since":
+			o.newSince, err = needVal()
+		case "--denied":
+			o.denied = true
+		case "--json":
+			o.asJSON = true
+		case "--limit":
+			var v string
+			if v, err = needVal(); err == nil {
+				if o.limit, err = strconv.Atoi(v); err != nil {
+					err = fmt.Errorf("--limit must be an integer: %q", v)
+				}
+			}
+		default:
+			return o, fmt.Errorf("unknown flag: %s", a)
+		}
+		if err != nil {
+			return o, err
+		}
+		i++
+	}
+	return o, nil
+}
+
+// nsRE is the safe namespace-token charset (k8s names + "Global"). Used as the
+// injection guard — anything else is rejected rather than quoted-and-hoped.
+var nsRE = regexp.MustCompile(`^[A-Za-z0-9][A-Za-z0-9_.-]*$`)
+
+func validateNS(s string) error {
+	if s == "" || len(s) > 63 || !nsRE.MatchString(s) {
+		return fmt.Errorf("invalid namespace name: %q", s)
+	}
+	return nil
+}
+
+// sqlStr renders a SQL string literal (belt-and-suspenders on top of validateNS).
+func sqlStr(s string) string { return "'" + strings.ReplaceAll(s, "'", "''") + "'" }
+
+var (
+	durRE  = regexp.MustCompile(`^(\d+)([smhd])$`)
+	dateRE = regexp.MustCompile(`^\d{4}-\d{2}-\d{2}([ T]\d{2}:\d{2}(:\d{2})?)?$`)
+)
+
+// newSinceCond turns a duration (24h/7d/30m/90s) or a date (YYYY-MM-DD[ HH:MM])
+// into a first_seen predicate.
+func newSinceCond(v string) (string, error) {
+	if m := durRE.FindStringSubmatch(v); m != nil {
+		unit := map[string]string{"s": "seconds", "m": "minutes", "h": "hours", "d": "days"}[m[2]]
+		return fmt.Sprintf("first_seen >= now() - interval '%s %s'", m[1], unit), nil
+	}
+	if dateRE.MatchString(v) {
+		return "first_seen >= " + sqlStr(v), nil
+	}
+	return "", fmt.Errorf("--new-since must be a duration (e.g. 24h, 7d, 30m) or a date (YYYY-MM-DD): %q", v)
+}
+
+// buildEdgesQuery renders the SQL for the given filters against the `edge` table.
+func buildEdgesQuery(o edgesOpts) (string, error) {
+	limit := o.limit
+	if limit <= 0 {
+		limit = 200
+	}
+
+	// peers-of is a distinct-peer summary, a different shape from the row list.
+	if o.peersOf != "" {
+		if err := validateNS(o.peersOf); err != nil {
+			return "", err
+		}
+		p := sqlStr(o.peersOf)
+		return fmt.Sprintf("SELECT DISTINCT peer, action FROM ("+
+			"SELECT dst_ns AS peer, action FROM edge WHERE src_ns = %s "+
+			"UNION SELECT src_ns AS peer, action FROM edge WHERE dst_ns = %s"+
+			") t ORDER BY peer LIMIT %d", p, p, limit), nil
+	}
+
+	var conds []string
+	for _, f := range []struct{ val, tmpl string }{
+		{o.ns, "(src_ns = %[1]s OR dst_ns = %[1]s)"},
+		{o.src, "src_ns = %s"},
+		{o.dst, "dst_ns = %s"},
+	} {
+		if f.val == "" {
+			continue
+		}
+		if err := validateNS(f.val); err != nil {
+			return "", err
+		}
+		conds = append(conds, fmt.Sprintf(f.tmpl, sqlStr(f.val)))
+	}
+	if o.denied {
+		conds = append(conds, "action = 'deny'")
+	}
+	if o.newSince != "" {
+		c, err := newSinceCond(o.newSince)
+		if err != nil {
+			return "", err
+		}
+		conds = append(conds, c)
+	}
+
+	q := "SELECT src_ns, dst_ns, action, flow_count, first_seen, last_seen FROM edge"
+	if len(conds) > 0 {
+		q += " WHERE " + strings.Join(conds, " AND ")
+	}
+	q += fmt.Sprintf(" ORDER BY first_seen DESC LIMIT %d", limit)
+
+	if o.asJSON {
+		q = "SELECT coalesce(json_agg(row_to_json(t)), '[]') FROM (" + q + ") t"
+	}
+	return q, nil
+}
--- a/cli/edges_test.go
+++ b/cli/edges_test.go
@ -0,0 +1,163 @@
+package main
+
+import (
+	"strings"
+	"testing"
+)
+
+func TestParseEdgesArgs(t *testing.T) {
+	cases := []struct {
+		name string
+		args []string
+		want edgesOpts
+	}{
+		{"defaults", nil, edgesOpts{limit: 200}},
+		{"ns", []string{"--ns", "immich"}, edgesOpts{ns: "immich", limit: 200}},
+		{"ns equals", []string{"--ns=immich"}, edgesOpts{ns: "immich", limit: 200}},
+		{"src dst", []string{"--src", "a", "--dst", "b"}, edgesOpts{src: "a", dst: "b", limit: 200}},
+		{"peers-of", []string{"--peers-of", "authentik"}, edgesOpts{peersOf: "authentik", limit: 200}},
+		{"denied json", []string{"--denied", "--json"}, edgesOpts{denied: true, asJSON: true, limit: 200}},
+		{"new-since", []string{"--new-since", "24h"}, edgesOpts{newSince: "24h", limit: 200}},
+		{"limit", []string{"--limit", "50"}, edgesOpts{limit: 50}},
+	}
+	for _, c := range cases {
+		t.Run(c.name, func(t *testing.T) {
+			got, err := parseEdgesArgs(c.args)
+			if err != nil {
+				t.Fatalf("parseEdgesArgs(%v) error: %v", c.args, err)
+			}
+			if got != c.want {
+				t.Fatalf("parseEdgesArgs(%v) = %+v, want %+v", c.args, got, c.want)
+			}
+		})
+	}
+}
+
+func TestParseEdgesArgsErrors(t *testing.T) {
+	for _, args := range [][]string{
+		{"--limit", "abc"},
+		{"--bogus"},
+	} {
+		if _, err := parseEdgesArgs(args); err == nil {
+			t.Errorf("parseEdgesArgs(%v) expected error, got nil", args)
+		}
+	}
+}
+
+func TestBuildEdgesQueryDefaults(t *testing.T) {
+	q, err := buildEdgesQuery(edgesOpts{limit: 200})
+	if err != nil {
+		t.Fatal(err)
+	}
+	for _, want := range []string{"FROM edge", "ORDER BY first_seen DESC", "LIMIT 200"} {
+		if !strings.Contains(q, want) {
+			t.Errorf("query %q missing %q", q, want)
+		}
+	}
+	if strings.Contains(q, "WHERE") {
+		t.Errorf("no-filter query should have no WHERE: %q", q)
+	}
+}
+
+func TestBuildEdgesQueryFilters(t *testing.T) {
+	cases := []struct {
+		name string
+		o    edgesOpts
+		want string
+	}{
+		{"ns both directions", edgesOpts{ns: "immich", limit: 10}, "(src_ns = 'immich' OR dst_ns = 'immich')"},
+		{"src only", edgesOpts{src: "authentik", limit: 10}, "src_ns = 'authentik'"},
+		{"dst only", edgesOpts{dst: "dbaas", limit: 10}, "dst_ns = 'dbaas'"},
+		{"denied", edgesOpts{denied: true, limit: 10}, "action = 'deny'"},
+	}
+	for _, c := range cases {
+		t.Run(c.name, func(t *testing.T) {
+			q, err := buildEdgesQuery(c.o)
+			if err != nil {
+				t.Fatal(err)
+			}
+			if !strings.Contains(q, "WHERE") || !strings.Contains(q, c.want) {
+				t.Errorf("query %q missing WHERE/%q", q, c.want)
+			}
+		})
+	}
+}
+
+func TestBuildEdgesQueryCombinedFiltersAnded(t *testing.T) {
+	q, err := buildEdgesQuery(edgesOpts{src: "a", denied: true, limit: 5})
+	if err != nil {
+		t.Fatal(err)
+	}
+	if !strings.Contains(q, "src_ns = 'a' AND action = 'deny'") {
+		t.Errorf("combined filters not AND'd: %q", q)
+	}
+}
+
+func TestBuildEdgesQueryPeersOf(t *testing.T) {
+	q, err := buildEdgesQuery(edgesOpts{peersOf: "authentik", limit: 100})
+	if err != nil {
+		t.Fatal(err)
+	}
+	for _, want := range []string{"DISTINCT", "src_ns = 'authentik'", "dst_ns = 'authentik'", "UNION"} {
+		if !strings.Contains(q, want) {
+			t.Errorf("peers-of query %q missing %q", q, want)
+		}
+	}
+}
+
+func TestBuildEdgesQueryJSON(t *testing.T) {
+	q, err := buildEdgesQuery(edgesOpts{asJSON: true, limit: 200})
+	if err != nil {
+		t.Fatal(err)
+	}
+	if !strings.Contains(q, "json_agg") || !strings.Contains(q, "row_to_json") {
+		t.Errorf("json query missing json_agg wrapper: %q", q)
+	}
+}
+
+func TestBuildEdgesQueryRejectsInjection(t *testing.T) {
+	for _, bad := range []string{"a'; DROP TABLE edge;--", "a b", "a;b", "a\"b"} {
+		if _, err := buildEdgesQuery(edgesOpts{ns: bad, limit: 10}); err == nil {
+			t.Errorf("buildEdgesQuery(ns=%q) expected validation error, got nil", bad)
+		}
+	}
+}
+
+func TestNewSinceCond(t *testing.T) {
+	cases := []struct {
+		in   string
+		want string
+	}{
+		{"24h", "first_seen >= now() - interval '24 hours'"},
+		{"7d", "first_seen >= now() - interval '7 days'"},
+		{"30m", "first_seen >= now() - interval '30 minutes'"},
+		{"2026-06-28", "first_seen >= '2026-06-28'"},
+	}
+	for _, c := range cases {
+		got, err := newSinceCond(c.in)
+		if err != nil {
+			t.Fatalf("newSinceCond(%q) error: %v", c.in, err)
+		}
+		if got != c.want {
+			t.Errorf("newSinceCond(%q) = %q, want %q", c.in, got, c.want)
+		}
+	}
+	for _, bad := range []string{"yesterday", "1y", "'; DROP", ""} {
+		if _, err := newSinceCond(bad); err == nil {
+			t.Errorf("newSinceCond(%q) expected error, got nil", bad)
+		}
+	}
+}
+
+func TestValidateNS(t *testing.T) {
+	for _, ok := range []string{"immich", "calico-system", "kube-system", "Global", "pg-cluster-rw"} {
+		if err := validateNS(ok); err != nil {
+			t.Errorf("validateNS(%q) unexpected error: %v", ok, err)
+		}
+	}
+	for _, bad := range []string{"", "a b", "a'b", "a;b", "../x", "a$b"} {
+		if err := validateNS(bad); err == nil {
+			t.Errorf("validateNS(%q) expected error, got nil", bad)
+		}
+	}
+}
--- a/cli/homelab.go
+++ b/cli/homelab.go
@ -20,6 +20,7 @@ func buildRegistry() []Command {
 	reg = append(reg, deployCommands()...)
 	reg = append(reg, netCommands()...)
 	reg = append(reg, obsCommands()...)
+	reg = append(reg, edgesCommands()...)
 	reg = append(reg, usageCommands()...)
 	reg = append(reg, haCommands()...)
 	reg = append(reg, browserCommands()...)
--- a/cli/memory_test.go
+++ b/cli/memory_test.go
@ -5,8 +5,31 @@ import (
 	"os"
 	"strings"
 	"testing"
+	"unicode/utf8"
 )

+func TestTruncatePreviewKeepsValidUTF8(t *testing.T) {
+	// Byte-slicing a long Cyrillic string at 240 splits a 2-byte rune and emits
+	// invalid UTF-8 — the bug that crashed the recall hook. truncatePreview must
+	// cut on a rune boundary and always stay valid UTF-8.
+	long := strings.Repeat("я", 300) // 300 runes / 600 bytes
+	got := truncatePreview(long, 240)
+	if !utf8.ValidString(got) {
+		t.Fatalf("truncatePreview produced invalid UTF-8: %q", got)
+	}
+	if r := []rune(got); len(r) != 241 || string(r[:240]) != strings.Repeat("я", 240) || r[240] != '…' {
+		t.Fatalf("truncatePreview = %d runes, want 240 Cyrillic + ellipsis", len(r))
+	}
+	// Short multibyte strings pass through untouched (no ellipsis).
+	if got := truncatePreview("кратко", 240); got != "кратко" {
+		t.Fatalf("short string altered: %q", got)
+	}
+	// ASCII boundary still works.
+	if got := truncatePreview(strings.Repeat("a", 500), 240); got != strings.Repeat("a", 240)+"…" {
+		t.Fatalf("ascii truncation wrong: %q", got)
+	}
+}
+
 func TestResolveMemoryBase(t *testing.T) {
 	old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL")
 	defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }()
--- a/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md
+++ b/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md
@ -13,7 +13,7 @@ The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling
 Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`:

 - Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.**
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`.
+- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`. `Plotting-Your-Dream-Book` (owned by Anca, dev in her org) keeps its GHA build in-place and pushes the image to **its own org's ghcr** (`ghcr.io/passionprojectsanca/book-plotter`, private) via the workflow's built-in `GITHUB_TOKEN` — no Forgejo mirror, no `viktorbarzin`-namespace push, no shared PAT in her repo (2026-06-27, migrated off DockerHub).
 - `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge.
 - Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror."

--- a/docs/adr/0011-homelab-usage-telemetry.md
+++ b/docs/adr/0011-homelab-usage-telemetry.md
@ -5,6 +5,14 @@ exists to answer the question that drove the whole CLI — *which verbs are wort
 adding next* — with data instead of one maintainer's habits (the earlier mining
 covered a single user's ~51k commands, so the surface is shaped to that user).

+> **Update (2026-06-26) — the cross-user privacy *norm* below is superseded by
+> [ADR-0015](0015-os-is-the-authorization-boundary.md).** The prohibition this
+> ADR leaned on ("reading another user's `~/.claude` is off-limits even for an
+> owner in-session") no longer holds: the managed-settings policy now **defers
+> to OS/sudo authorization**. The `usage top` telemetry design itself is
+> unchanged and still current — only the "never read homes" framing in the
+> third decision below is overtaken.
+
 ## Decisions

 - **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows
--- a/docs/adr/0014-service-identity-and-east-west-observability.md
+++ b/docs/adr/0014-service-identity-and-east-west-observability.md
@ -27,3 +27,9 @@ As the Service count grows we want an audit-grade record of which Service talks
 - **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
 - **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
 - **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
+
+## As-built (2026-06-25)
+
+Implemented across infra issues #57–#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (all Slack consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
+
+Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.
--- a/docs/adr/0015-os-is-the-authorization-boundary.md
+++ b/docs/adr/0015-os-is-the-authorization-boundary.md
@ -0,0 +1,57 @@
+# OS is the authorization boundary: agents defer to Unix/sudo, not a stricter in-policy rule
+
+Supersedes the cross-user privacy *norm* that the devvm managed-settings policy
+carried and that ADR-0011 leaned on ("never read another user's home /
+`~/.claude`, off-limits even for an owner in-session"). ADR-0011's actual
+subject — `usage top` telemetry and its emit design — is unchanged and still
+current; only the privacy prohibition it referenced is superseded here.
+
+## Context
+
+The devvm managed-settings policy (`/etc/claude-code/managed-settings.json`,
+`claudeMd`) carried two rules that were, in practice, *stricter than the OS*:
+"you are not the admin, do not escalate privileges" and "never read another
+user's home directory, credentials, tokens, or `~/.claude`." The OS told a
+different story: `wizard` holds `(ALL) NOPASSWD: ALL` — full passwordless root.
+The kernel had already granted total read access; the policy was layering an
+artificial refusal on top of an authorization the OS already permits, and the
+"not the admin" framing was factually wrong for a NOPASSWD-root user.
+
+Two honest ways to resolve the inconsistency: tighten sudo to match the policy,
+or loosen the policy to match the OS. The owner chose the latter on 2026-06-26,
+for analytics/debugging across the shared box.
+
+## Decision
+
+- **Authorization follows the OS, not this policy.** Agents may access whatever
+  their OS user can access — directly or via `sudo` where they hold sudo rights
+  — and must not impose restrictions stricter than the OS. On this box that
+  includes other users' home directories and `~/.claude` for users who hold
+  broad sudo.
+- **No separate prompt or carve-out** for OS-authorized access. The Unix
+  permission model + sudoers is the single source of truth for who may read
+  what. Other homes are `0750`-owned, so a cross-home read necessarily transits
+  `sudo` and is therefore captured in the sudo/auth audit log.
+- **Cluster/infra RBAC tiering is unchanged.** kubectl / Vault / infra access
+  stays scoped to each user's RBAC tier; "defer to the OS" is about OS-level
+  file access, not a licence to exceed cluster RBAC.
+- **Scope is symmetric and multi-user.** The rule lives in the *shared*
+  managed-settings, so every user's agents defer to that user's own sudo grant.
+  Any user with broad sudo gets the same cross-home read capability over other
+  users' files. Accepted by the owner with that understanding; emo's and
+  ancamilea's `~/.claude` is now agent-readable by sudo-holders.
+- **Takes effect in a fresh session.** managed-settings loads at session start;
+  the session that made the change keeps running under the old policy.
+
+## Consequences
+
+- The privacy-preserving telemetry rationale in ADR-0011 (`usage top` as the
+  "cross-user analytics without reading homes" answer) remains useful but is no
+  longer the *only* sanctioned path; direct reads via `sudo` are now permitted.
+- Larger blast radius: if an agent session running as a sudo-holder is
+  prompt-injected or otherwise compromised, it can now read every user's secrets
+  with no in-agent friction (sudo here is passwordless). The sudo/auth audit log
+  is the remaining accountability control.
+- Reversible: restore the prior `claudeMd` bullets (backup kept at
+  `/etc/claude-code/managed-settings.json.bak-2026-06-26`) and start a fresh
+  session.
--- a/docs/architecture/authentication.md
+++ b/docs/architecture/authentication.md
@ -86,10 +86,56 @@ Signin latency is dominated by screen count and round trips, not server time
  use the explicit-consent flow (it re-prompted every 4 weeks per app).
 - **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
  are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
-  15m policy cache, 60s persistent DB connections.
+  15m policy cache, gunicorn `max_requests=10000`/jitter=1000 (recycle
+  hardening — decorrelates the 9 workers' recycles from PG blips). **No
+  `CONN_MAX_AGE`** — persistent Django connections pin a PgBouncer server conn
+  1:1 and saturate the session-mode pool (reverted 2026-06-10).
 - **Static assets cached immutable**: `/static` ingress carve-out adds
  `Cache-Control: public, max-age=31536000, immutable` (assets are
  version-fingerprinted; authentik itself sends no max-age).
+- **Rate-limit carve-out** (2026-06-28): `/` and `/static` use a dedicated
+  `authentik-rate-limit` (100/1000) instead of the shared 10/50 default — the
+  login SPA cold-loads ~70 flow-executor chunks from `/static`; the default
+  burst 429'd the tail and a failed ES-module import left a blank login screen.
+- **Readiness tolerance** (2026-06-28): server `readinessProbe.failureThreshold:8`
+  (~80s, was the chart-default ~30s). The probe (`/-/health/ready/`) queries the
+  DB; too-tight tolerance let a sub-60s PG/pgbouncer transient return 503 on all
+  3 server pods at once → Traefik had no healthy backend → 502/503/504 (episodic
+  blank login + 30s hangs). 80s absorbs a full CNPG failover reconnect. Sessions
+  + cache are PostgreSQL-only since Redis was removed in 2026.2 (no external-cache
+  option), so request-serving is coupled to PG — this survives a short transient,
+  not a total CNPG outage.
+- **Rolling-update strategy** (2026-06-28): the chart key is `deploymentStrategy`
+  (the repo's old `strategy:` key was silently inert → live ran the chart-default
+  25%/25% and dropped a server pod out of rotation on every roll). Now
+  `maxSurge:1/maxUnavailable:0` keeps all 3 ready throughout a roll.
+- **Old-browser login (SFE)** (2026-06-28): authentik's modern flow SPA is ES2022
+  and renders a **blank login** on Safari/WebKit ≤16.3 (every iOS browser shares
+  the system WebKit, so it's not browser-choice — e.g. iPadOS ≤15). The overlay
+  image patches `flows/views/interface.py::compat_needs_sfe()` to also serve
+  authentik's built-in no-JS **Simplified Flow Executor** (SFE, ES5) to old Safari
+  **and any iOS browser** (Chrome/Firefox on iOS are WebKit skins) on iOS ≤16.3,
+  so those clients get the *real* authentik login (password + MFA + reputation —
+  no auth downgrade). The SFE can't render Identification-stage **sources**
+  (authentik limitation), so the patch also injects static social-login `<a>`
+  links into `flow-sfe.html` (→ `/source/oauth/login/<slug>/`, plain redirects) —
+  required for password-less accounts (e.g. Google-only users). A Traefik
+  basic-auth fallback was rejected: it would have put a single spoofable-UA
+  password in front of `vbarzin→wizard` (passwordless root on the devvm). See
+  `stacks/authentik/patch-compat-sfe.py`.
+- **SFE + forced-WebAuthn MFA gotcha** (2026-06-28): the `default-authentication-flow`
+  MFA stage (`not_configured_action=configure`, `conf_stages=[webauthn]`) force-enrols
+  a WebAuthn passkey for any **password**-path user with no MFA device — but the SFE
+  **cannot render WebAuthn** (enrol *or* validate), so that user gets
+  `unsupported state: ak-stage-authenticator-webauthn`. Two escape hatches, **no MFA
+  downgrade**: (1) **social login** — sources run `default-source-authentication`
+  (UserLoginStage only, **no MFA stage**), so the SFE's "Continue with <provider>"
+  button always completes; (2) **enrol TOTP** — the SFE *can* validate TOTP codes, and
+  ≥1 confirmed device flips the stage from force-enrol to validate. User MFA devices are
+  runtime data (not Terraform): enrol via `ak shell`
+  (`TOTPDevice.objects.create(user=…, confirmed=True)`) and store the secret in the
+  user's own Vaultwarden item. (Done for emo — the Google-only iPadOS-15 case: TOTP in
+  his `authentik.viktorbarzin.me` Bitwarden item; e2e-verified the BW code is accepted.)
 - **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
 - **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
  TCP setup on the forward-auth subrequest path.
--- a/docs/architecture/chrome-service.md
+++ b/docs/architecture/chrome-service.md
@ -205,6 +205,43 @@ healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts*
 wrapper in `main.tf` (so it applies deterministically even though the image is
 `:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix
 as the android-emulator stack.
+
+### noVNC black after a browser-container restart (x11vnc supervision)
+
+A **distinct** failure from the fd-sweep gotcha above: the noVNC client *connects*
+but the view is **black**, and the novnc container logs spew
+`connecting to: localhost:5900` → `Failed to connect ... [Errno 111] Connection
+refused` (x11vnc is **down**, not slow). Cause: `x11vnc` and `websockify` both run
+in the **novnc** container, but x11vnc attaches to the **chrome-service** (browser)
+container's Xvfb over `localhost:6099` (shared pod network). When the browser
+container restarts — Chrome exits cleanly (exit 0, "Completed") or crashes — its
+Xvfb vanishes and x11vnc loses its X connection and exits.
+
+`entrypoint.sh` **supervises** x11vnc: it launches x11vnc and websockify as
+background children and `wait -n`s on them, exiting non-zero if **either** dies, so
+the kubelet restarts the novnc container, which re-waits for Xvfb on `:6099` and
+relaunches x11vnc — the bridge **self-heals** across browser-container restarts.
+(Before 2026-06-27, x11vnc was an unsupervised background child of an `exec`ed
+websockify; a dead x11vnc was never relaunched, leaving `:5900` dead — a
+`<defunct>` zombie — and the view black until a manual pod restart. Same
+supervision pattern as the android-emulator stack's entrypoint.)
+
+**Diagnose:** `kubectl exec -c novnc -- ps aux | grep x11vnc` (a `<defunct>`/Z
+entry = the bug); or the RFB-banner probe from a sibling container (`python3 -c
+"import socket;s=socket.socket();s.settimeout(2);s.connect(('127.0.0.1',5900));print(s.recv(12))"`
+— healthy returns `b'RFB 003.008\n'`, broken = `ConnectionRefused`). **Immediate
+recovery** (no image change): restart just the novnc container with `kubectl exec
+-n chrome-service deploy/chrome-service -c novnc -- kill 1` — re-runs its entrypoint
+and relaunches x11vnc **without** touching the browser session/in-flight CDP jobs.
+
+> **Deploying a rebuilt novnc entrypoint:** Keel is **off** for this deployment
+> (`keel.sh/policy=never`, because the browser container's playwright image is
+> version-pinned to f1-stream) and the image is `:latest`/`IfNotPresent`, so a
+> rebuilt `:latest` will **not** redeploy on its own. After the
+> `build-chrome-service-novnc.yml` GHA build pushes `:latest` + `:<sha>`,
+> **SHA-pin** the novnc `image` in `main.tf` to the new `:<sha>` to force the pull
+> and rollout (the novnc image is TF-managed — not in the deployment's
+> `lifecycle.ignore_changes`).
 - **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`)
  serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`,
  bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088
@ -256,6 +293,42 @@ Key facts:
  byte-identical copy of `files/stealth.js`, guarded by a drift test — so the
  CLI's stealth never diverges from the in-cluster callers'.

+## Multi-user access (sharing the browser)
+
+There is ONE chrome-service browser with ONE persistent profile, warmed with
+**Viktor's** logged-in sessions. CDP has no per-context auth, so anyone who can
+drive the browser — over the noVNC view OR the CDP/`homelab browser` path — can
+reach the persistent profile (`browser.contexts[0]`) and therefore Viktor's
+sessions. Access is gated accordingly, per user.
+
+**Decision (2026-06-28):** emo (`emil.barzin` / `emil.barzin@gmail.com`) SHARES
+Viktor's browser for form-filling + captcha solving, rather than getting an
+isolated instance. The session-exposure trade-off above was explicitly accepted.
+
+Two independent grants make up "browser access" for a user:
+
+1. **noVNC (interactive view, `chrome.viktorbarzin.me`)** — gated by the Authentik
+   `admin-services-restriction` policy: the `CHROME_ALLOWED` set
+   (`stacks/authentik/admin-services-restriction.tf`) matches the user's Authentik
+   username OR email. Add the user there. No kubeconfig/RBAC needed.
+2. **CLI (`homelab browser`, CDP over port-forward)** — needs `pods/portforward`
+   in `chrome-service` PLUS a non-interactive credential (a normal devvm user's
+   kubeconfig is interactive-OIDC-only and can't authenticate a headless agent
+   session). Provided by a per-user **ServiceAccount** with a long-lived token
+   (`stacks/chrome-service/rbac.tf`, e.g. `emo-browser`): `pods/portforward` in
+   this namespace + cluster read-only (`oidc-power-user-readonly`, so it can also
+   resolve the Service and doesn't regress the user's normal read). The devvm
+   provisioner (`scripts/t3-provision-users.sh` → `install_browser_kubeconfig`)
+   reads that token and installs it as the user's DEFAULT kubeconfig context
+   (`<user>-browser@homelab`), keeping their personal OIDC login as the
+   `oidc@homelab` named context. The SA's existence is the source of truth for who
+   gets the CLI — the provisioner no-ops for users without a `<user>-browser` SA.
+
+**To grant another user:** add them to `CHROME_ALLOWED` (noVNC) and/or add a
+`<user>-browser` SA + bindings mirroring `emo-browser` in `rbac.tf` (CLI), then run
+the provisioner. To revoke: remove from `CHROME_ALLOWED` and delete the SA (rotate
+a token by deleting its `<user>-browser-token` Secret).
+
 ## Limits + risks

 - **Anti-bot vs stealth arms race** — when an upstream beats us (DRM
--- a/docs/architecture/ci-cd.md
+++ b/docs/architecture/ci-cd.md
@ -115,9 +115,67 @@ claude-agent-service, claude-memory-mcp, kms-website, Freedify,
 instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`),
 fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original
 pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website,
-k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify,
+k8s-portal, apple-health-data, audiblez-web, insta2spotify,
 audiobook-search) now also land on ghcr.

+**plotting-book** is a special case (a GitHub-first repo owned by Anca,
+ADR-0003): the build runs in *her* GitHub repo
+(`PassionProjectsAnca/Plotting-Your-Dream-Book`) and pushes to **private
+`ghcr.io/passionprojectsanca/book-plotter`** — under her org's ghcr namespace,
+not `viktorbarzin`, using the workflow's built-in `GITHUB_TOKEN` (no shared
+PAT). The cluster pulls it via the Kyverno-synced `ghcr-credentials` secret (the
+`plotting-book` namespace is on the allowlist; the shared `ghcr_pull_token` has
+read access). Migrated off public DockerHub (`viktorbarzin/book-plotter`) on
+2026-06-27. The Woodpecker deploy hook (repo 43, registered to Anca's repo) is
+unchanged. Flow:
+
+```text
+ DEVELOP ───────────────────────────────────────────────────────────────────────
+   Anca (Codex / t3 web agent)
+        │  git push → main
+        ▼
+ ┌──────────────────────────────────────────────────────────────┐
+ │ GitHub: PassionProjectsAnca/Plotting-Your-Dream-Book (private)│  ← canonical
+ │   .github/workflows/build-and-deploy.yml     on: push → main  │
+ └───────────────────────────┬──────────────────────────────────┘
+                             │  GitHub Actions runner (off-infra build · ADR-0002)
+        ┌────────────────────┴─────────────────────────────────┐
+        ▼                                                        ▼
+ ┌─────────────────────────────────────────────┐      ╔═══════════════════════════════════════╗
+ │ build job                                   │ push ║  GHCR · PRIVATE package                ║
+ │  • svu next --always → tag vX.Y.Z (→ repo)  │═════▶║  ghcr.io/passionprojectsanca/         ║
+ │  • buildx linux/amd64, provenance:false     │ tags ║       book-plotter  :vX.Y.Z  :latest  ║
+ │  • login ghcr (GITHUB_TOKEN, packages:write)│      ╚═══════════════════╤═══════════════════╝
+ │  • delete-package-versions (keep newest 10) │                          │
+ └───────────────────────┬─────────────────────┘                          │ pull (private,
+                         ▼  deploy job  [gate: repo var DEPLOY_ENABLED ≠ "false"]  via secret)
+   POST ci.viktorbarzin.me/api/repos/43/pipelines {IMAGE_TAG, IMAGE_NAME}         │
+                         ▼                                                         │
+ ┌─────────────────────────────────────────────────────────────┐                 │
+ │ Woodpecker repo 43 · .woodpecker/deploy.yml (event: manual)  │                 │
+ │   kubectl set image deployment/plotting-book = <ghcr>:vX.Y.Z │                 │
+ │   kubectl rollout status                                     │                 │
+ └───────────────────────────┬─────────────────────────────────┘                 │
+                             ▼                                                     │
+ ═══════════════ Kubernetes · ns: plotting-book ════════════════════════════      │
+ ┌─────────────────────────────────────────────────────────────┐                 │
+ │ Deployment plotting-book  (Recreate · image = ignore_changes)│                 │
+ │   imagePullSecrets: ghcr-credentials ────────pull───────────┼─────────────────┘
+ │   Pod → Express :3001  +  SQLite on PVC (proxmox-lvm)        │
+ └─────────────────────────────────────────────────────────────┘
+   guards / supporting:
+     • Kyverno require-trusted-registries [Enforce] → ghcr.io/* ALLOWED   (admission)
+     • Keel policy=patch @1h → watches GHCR via ghcr-credentials          (backstop)
+     • ghcr-credentials ⇐ Kyverno generate-clone ⇐ Vault secret/viktor/ghcr_pull_token
+
+ ═══════════════ Serving path (unchanged) ══════════════════════════════════
+   Browser ─▶ plotting-book.viktorbarzin.me  (non-proxied DNS → Traefik .203)
+           ─▶ Authentik forward-auth (gate) ─▶ Service :80 ─▶ Pod :3001
+```
+
+Governance: the Deployment + Kyverno allowlist are Terraform (`stacks/plotting-book`,
+`stacks/kyverno`); the live image *tag* is CI-owned (`ignore_changes`).
+
 ### Infra-owned images (issues #29 / #30)

 Images owned by the infra repo build on GHA workflows **in the infra repo's own
@ -163,9 +221,9 @@ Woodpecker is **deploy + cluster-touching steps only**:
 | Pipeline | File | Purpose |
 |----------|------|---------|
 | per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) |
-| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) |
+| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`). **Skips Tier-0 `vault`** — it's human-applied via OIDC; the CI `ci` role lacks Vault-admin perms (`sys/mounts`, `sys/policies/acl`) so a CI apply 403s |
 | certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron |
-| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) |
+| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`). **Skips Tier-0 `vault`** (its `plan` 403s under the `ci` role and would fail the whole run) |
 | provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
 | registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` → `10.0.20.10` on change |
 | pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE |
@ -176,6 +234,38 @@ Woodpecker is **deploy + cluster-touching steps only**:

 **No build/test pipeline exists on any repo.** Do not (re)introduce one.

+### `default.yml` apply: dual-registration de-dup + reliability (2026-06-28)
+
+infra is registered in Woodpecker on **both** the canonical Forgejo repo (id 82)
+and the legacy GitHub mirror (id 1), and **both fire `default.yml` on every
+push**. Left unguarded, two `terragrunt apply` runs race each other for the
+per-stack PG state lock — historically the #1 source of `Error acquiring the
+state lock` failures and push-supersede "killed" runs.
+
+- **Forge guard** (first command in the `apply` step): the push-apply runs **only
+  on the canonical Forgejo forge**; on the GitHub mirror it logs `[forge-guard]`
+  and `exit 0`s. Detection: `CI_REPO_URL`/`CI_FORGE_URL` contains `github.com` →
+  skip. Fail-open (unknown forge still applies). The mirror keeps running the
+  **crons** (drift-detection, renew-tls, …), which live on repo 1 — only its
+  duplicate push-apply no-ops. (Crons were NOT moved; deactivating repo 1 would
+  have killed them.)
+- **Lock-skip matches both tiers**: a stack whose apply hits a lock is SKIPPED,
+  not failed. The grep now matches the Tier-0 Vault message (`is locked by`) **and**
+  the Tier-1 PG-backend message (`Error acquiring the state lock` / `already
+  locked`) — the PG case was previously miscounted as a hard failure.
+- **Transient retry** (bounded, 3 attempts): only provider-registry download
+  timeouts (`Failed to install provider` / `Client.Timeout`) and Vault 5xx are
+  retried. Config errors (missing arg, invalid index) and helm `atomic` timeouts
+  are NOT retried — they fail fast.
+
+A pre-apply off-infra validate gate was evaluated and rejected: `terraform
+validate` runs without state but catches ~0 of the observed failures (they are
+provider-config-from-Vault-data, server-side-apply conflicts, helm installs, and
+lock contention — all invisible to static validate), and `plan` cannot run
+off-infra (no Vault/PG access). `terragrunt apply` already fails at its plan
+phase without mutating on config errors, so a separate in-pipeline plan-gate was
+also dropped as redundant.
+
 ### Woodpecker API

 Uses **numeric repo IDs** (`/api/repos/<id>/pipelines`), NOT owner/name paths
--- a/docs/architecture/monitoring.md
+++ b/docs/architecture/monitoring.md
@ -286,7 +286,7 @@ Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (por

 #### Security Alerts (Wave 1 — planned, beads `code-8ywc`)

-Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
+Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there). Same handling path as infra alerts; severity labels carried in the alert (critical/warning/info). The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).

 | # | Source | Event | Severity |
 |---|---|---|---|
@ -318,9 +318,20 @@ IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-poli
 Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out.

 - **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m).
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → **`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
+- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → posts to **`#alerts`** via the `slack-security` receiver, which keeps its `[SECURITY]` styling (Slack-only, no paging; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
 - **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)

+#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
+
+Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**.
+
+| Alert | Expr (abridged) | For | Severity |
+|---|---|---|---|
+| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning |
+| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning |
+
+The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`).
+
 #### Backup Alerts
 - **PostgreSQLBackupStale**: >36h since last backup
 - **MySQLBackupStale**: >36h since last backup
--- a/docs/architecture/multi-tenancy.md
+++ b/docs/architecture/multi-tenancy.md
@ -541,7 +541,7 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1

 **RBAC tiers:** `admin` (Viktor — cluster-admin, unlocked tree, secrets) · `power-user` (cluster-wide read-only, NO Secrets, via a dedicated `oidc-power-user-readonly` ClusterRole) · `namespace-owner` (admin in own namespace only). Each session acts as the user's **own** OIDC identity (kubelogin), never the admin's.

-**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.
+**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **(2026-06-26: the managed `claudeMd` now defers OS-level file access to the OS/sudo — a user holding broad `sudo` may read other users' files incl. `~/.claude`; the mode-600 / no-symlink posture is unchanged but is no longer reinforced by an agent "never read other homes" rule. See [ADR-0015](../adr/0015-os-is-the-authorization-boundary.md).)** **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.

 **Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.)

--- a/docs/architecture/networking.md
+++ b/docs/architecture/networking.md
@ -261,7 +261,7 @@ Traefik chain:

 1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`).
 2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend.
-3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads) and ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load).
+3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), and authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients).
 4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors).

 Additional middleware:
@ -550,7 +550,7 @@ chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Che

 **Diagnosis**: Check Traefik middleware config for the affected IngressRoute.

-**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300.
+**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, authentik 100/1000 (login SPA `/static` chunk burst → blank screen).

 ### Large Downloads or Uploads Truncate / Fail Partway

--- a/docs/architecture/security.md
+++ b/docs/architecture/security.md
@ -132,6 +132,13 @@ for the supersession history — there is no longer an inline Traefik bouncer.)
  account hard-limits to **one** list), and CAPI is already covered in-kernel on
  direct hosts and by Cloudflare's own managed protections on proxied hosts.
  Registered bouncer key: **`kvsync`**.
+- **Rate-limit resilient (2026-06-27):** Cloudflare's Lists-API *write* endpoint
+  is throttled (~per-60s; `429 retry-after`). The CronJob runs `backoff_limit=0`
+  (one POST per cycle — the `*/2` schedule IS the retry cadence) and treats a CF
+  `429` as a soft-skip (exit 0, retry next cycle), the same fail-safe pattern it
+  uses for LAPI. An earlier `backoff_limit=2` fired 3 rapid POSTs/cycle and
+  escalated the throttle into a stuck state that left the list empty — a
+  self-inflicted DoS that this change prevents.
 - **Block-only**: the single-list limit precludes a separate
  captcha/managed-challenge list, so both ban and captcha decisions are enforced
  as a plain block at the edge.
@ -272,7 +279,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**

 The block below documents the locked design.

-Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
+Response model: **(I) Slack-only, daily skim.** All security alerts post to **`#alerts`** via Alertmanager (the `slack-security` receiver keeps its distinct `[SECURITY/<sev>]` title styling so security-lane alerts still stand out). The dedicated `#security` channel was abandoned (2026-06-25) — the shared `alertmanager_slack_api_url` incoming webhook's Slack app isn't a member of it, so a channel override there returns HTTP `404 channel_not_found`; everything consolidated to `#alerts`. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.

 #### Detection sources

@ -285,7 +292,7 @@ Response model: **(I) Slack-only, daily skim.** All security alerts land in a ne

 #### Alert rules (16 total)

-Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel.
+Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it). Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) carried in the alert.

 **K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):**

@ -364,6 +371,69 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
 - Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
 - Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).

+#### Deriving the per-namespace egress allowlist from the edge trail (Wave 1 W1.7)
+
+The durable **east-west flow trail** (below) is now the preferred data source for
+the *internal* (namespace-to-namespace) half of each Wave-1 egress allowlist —
+faster and identity-stamped vs the original iptables-`LOG`→journald→Loki path
+(ADR-0014: "Enforcement gains a better data source"). The unique observed
+namespace pairs live in CNPG DB `goldmane_edges`, table `edge`. To derive the
+namespaces a source is observed talking to (the `allow` set that seeds its
+NetworkPolicy):
+
+```sql
+SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow' ORDER BY dst_ns;
+```
+
+The full SQL recipe (whole-cluster matrix, deny sanity-checks, the ≥7-day
+observation caveat) is in
+[runbooks/goldmane-flow-trail.md → Deriving the Wave-1 egress allowlist](../runbooks/goldmane-flow-trail.md#deriving-the-wave-1-egress-allowlist-from-the-edge-table-infra-62).
+**External / public-internet egress is NOT in this table** (empty-namespace flows
+are dropped) — for those destinations keep using the Calico flow-log observation
+(the W1.6 snapshot, `wave1-egress-observation-2026-05-22.md`). This feeds the
+existing observe-then-enforce effort (beads `code-8ywc`); **enforce-flips remain
+out of scope** of the trail — it is observe-and-derive only.
+
+### East-west flow observability (Goldmane / Whisker + edge trail) (ADR-0014)
+
+The "who-talks-to-whom" data plane that succeeds raw iptables-`LOG` lines (which
+carried no identity). **Service identity = the workload's namespace** (primary),
+refined by a `service-identity` label in the few multi-Service namespaces
+(`monitoring`, `kube-system`, `dbaas`). End-to-end trail, three layers:
+
+1. **Calico Goldmane + Whisker** (`calico-system`) — Goldmane aggregates
+   identity-stamped flows (ns/pod/workload/labels + allow-deny + policy-trace)
+   streamed from Felix over gRPC into a **~60-min in-memory ring buffer** (no
+   etcd/API writes — the etcd-cost constraint that drove the design). **Whisker**
+   is its live web UI at `whisker.viktorbarzin.me` (Authentik-gated,
+   `auth = "required"` — Whisker has no own login; an additive NetworkPolicy ORs
+   Traefik past the operator's default-deny `whisker` NP). The ring buffer is
+   **not** a trail (lost on Goldmane restart). Enabled via operator CRs in
+   `stacks/calico/main.tf`; reversible toggle (Goldmane is OSS tech-preview).
+2. **`goldmane-edge-aggregator`** (`stacks/goldmane-edge-aggregator`) — streams
+   Goldmane's gRPC `Flows.Stream` over **mTLS** and upserts the low-cardinality
+   namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,
+   flow_count)`) into CNPG DB `goldmane_edges`. Self-edges and empty-namespace
+   (public-internet) flows are dropped — in-cluster relationships only. The mTLS
+   client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`**
+   (Goldmane verifies CA-chain only, not identity) rather than copying the CA
+   private key into TF state — **re-apply the stack if the operator rotates that
+   Secret**.
+3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to
+   **`#alerts`** (reuses the alert-digest webhook). All Slack now consolidates to
+   `#alerts`; the `#security` channel was abandoned 2026-06-25 because that
+   webhook's Slack app isn't a member of it (a `#security` override 404s). See
+   runbook.
+
+The trail is **attribution-grade, not cryptographic** (reconstructs events in a
+trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model
+limit; east-west stays plaintext, no mTLS between app pods). Health is covered by
+the **`AggregatorDown`** + **`DigestFailing`** alerts and cluster-health check #48
+(see monitoring.md). Full as-built, query recipes, and troubleshooting:
+[runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md). Decision:
+[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md); glossary
+`CONTEXT.md` → **Service identity**, **Goldmane / Whisker**.
+
 ### TLS & HTTP/3

 **Traefik** handles TLS termination:
--- a/docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md
+++ b/docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md
@ -0,0 +1,117 @@
+# k8s-upgrade compat-gate: classify "actionable" vs "held" blocks
+
+**Date:** 2026-06-28
+**Status:** design → implementation
+**Stack:** `stacks/k8s-version-upgrade` (+ `stacks/monitoring` alert rules)
+
+## Problem
+
+The cluster is on k8s 1.35.6. The nightly `k8s-version-check` chain detects the
+next minor (1.36.2), runs the preflight compat-gate, and the gate **refuses**
+it — because no released kyverno/ESO supports k8s 1.36 yet, and gpu-operator is
+deliberately pinned (its 26.3 bump needs a newer NVIDIA driver image + Ubuntu
+release we're not ready for). The result, **every single night**:
+
+- a **Failed** preflight Job (`block()` exits 1), and
+- `k8s_upgrade_blocked=1` → the **K8sUpgradeBlocked** alert.
+
+But this block is **not actionable** — there's nothing we can upgrade to clear
+it; we can only wait for upstream (kyverno/ESO) and, separately, do the
+gpu-operator/Ubuntu work. The gate is crying wolf: a "blocked, needs attention"
+signal that's indistinguishable from a block we could actually fix.
+
+## Goal
+
+Make the gate **classify** each blocker and behave accordingly:
+
+| Class | Definition | Behaviour |
+|-------|-----------|-----------|
+| **actionable** | the compat matrix has a newer version of the addon whose `max_k8s >= target`, and the running version is older — upgrading it would clear the block | **alert** (`k8s_upgrade_blocked=1` → K8sUpgradeBlocked), with the specific "upgrade X → Y" remediation in the nightly report |
+| **waiting-upstream** | **no** matrix version of the addon supports the target yet (kyverno/ESO for 1.36) | **quiet** (`k8s_upgrade_held=1`, no alert) — nightly report only |
+| **pinned** | a supporting version exists but the addon carries `"pinned": true` in the matrix (gpu-operator) | **quiet** (held) |
+
+Removed-API and containerd blocks are always **actionable**. **Held wins:** if
+*any* blocker is waiting-or-pinned, the whole target is **HELD** (quiet) —
+acting on the actionable blockers wouldn't unblock it yet. The nightly report
+still lists everything so the full eventual scope is visible.
+
+Also (scope decision: "tidy the block path"): deliberate gate decisions
+(actionable-block **and** held) now make the preflight Job **Complete cleanly**
+(exit 0) instead of Failing. Chain progression is gated on the verdict, not the
+exit code. Real failures (unhealthy nodes, kubeadm errors, crashes) still exit
+1 → `K8sUpgradeChainJobFailed`.
+
+## Design
+
+### `compat-gate.py`
+- New exit codes: `0` safe · `2` actionable-block · `3` gate-error (fail-safe) · **`4` held**.
+- Each stdout reason line is tagged `[ACTIONABLE]` / `[WAITING]` / `[PINNED]`.
+- `check_addons`: when an addon blocks, decide its class:
+  - `pinned: true` in its matrix entry → `[PINNED]`.
+  - else a higher matrix version with `max_k8s >= target` exists → `[ACTIONABLE]` (`upgrade X to >= V`).
+  - else → `[WAITING]` (`no released X version supports k8s T yet`).
+  - unreadable image / below-matrix → `[ACTIONABLE]` (fail-safe — a human must look).
+- `check_removed_apis`, `check_containerd`: tag `[ACTIONABLE]`.
+- `exit_code(reasons)`: `0` if none; `4` if any `held_reason` (WAITING/PINNED); else `2`.
+
+### `upgrade-step.sh`
+- New global `HALT_CHAIN=0`; `spawn_next()` returns early (no next Job) when set.
+- Replace `block()` with `record_blocked()` / `record_held()` — push the gauge,
+  set `HALT_CHAIN=1`, **do not exit**.
+- `phase_preflight` gate handling routes on the gate's exit code:
+  - `0` → push `blocked=0`+`held=0`, proceed.
+  - `2`/`3` → `record_blocked`, `return 0` (Job Completes, K8sUpgradeBlocked fires).
+  - `4` → `record_held`, `return 0` (Job Completes, **no alert**).
+- Push the gauge **definitively once** per run (remove the pre-reset `blocked=0`
+  at gate start) so a standing block doesn't flap 1→0→1 and re-notify.
+- postflight also clears `held=0` alongside the existing gauge resets.
+
+### detector (`main.tf`, the `k8s-version-check` CronJob)
+- Consequence of the tidy change: refusals now **Complete** instead of Failing,
+  so the old "re-spawn only a *Failed* preflight" idempotency would skip a
+  refused-but-Complete preflight until its 7d TTL. Fix: re-spawn nightly when the
+  preflight is **Complete but no `k8s-upgrade-master-<target>` Job exists** (the
+  gate refused — chain never advanced) — **silently** (no Slack), so a standing
+  hold re-evaluates each night without noise.
+- The per-night `slack "K8s upgrade available…"` becomes an `echo`; the spawn
+  Slack fires only for a genuinely new spawn or a Failed-respawn (`ANNOUNCE`
+  flag), not for silent re-evaluations — killing the last nightly-noise source.
+
+### `addon-compat.json`
+- Add `"pinned": true` + `"pin_reason"` to the gpu-operator entry (its
+  `26.3 → 1.36` row stays; `pinned` overrides classification to held). Document
+  the `pinned` flag in `_comment`. Unpinning later = delete two keys.
+
+### `stacks/monitoring` alert rules (`prometheus_chart_values.tpl`)
+- `K8sUpgradeBlocked` (`k8s_upgrade_blocked == 1`): unchanged trigger, now
+  actionable-only; reword annotation (reasons are in the nightly report, not a
+  per-run chain Slack).
+- `K8sUpgradeChainJobFailed`: **drop** the `unless on() (k8s_upgrade_blocked == 1)`
+  clause — deliberate blocks no longer create Failed Jobs, so the alert again
+  means a genuine wedge.
+- **No alert** for `k8s_upgrade_held` (intentional — nothing to action; the
+  nightly report surfaces it). Add a comment recording this.
+
+### `nightly-report.py`
+- Read `k8s_upgrade_held`. New `⏸️ HELD — <target> not yet upgradable` headline.
+- Group reasons by tag: *Action needed* / *Waiting on upstream* / *Pinned (held by us)*
+  (fallback bullets for untagged lines, so older reason strings still render).
+- Fetch reasons when avail AND (blocked OR held).
+
+## Net effect on 1.36 today
+**HELD, quiet** — waiting on kyverno + ESO (upstream) + gpu-operator (pinned);
+Calico listed as the lone actionable piece. No nightly Failed Job, no alert —
+just the nightly report's ⏸️ line. Flips to actionable (→ alert) only once
+kyverno/ESO ship support **and** gpu-operator is unpinned.
+
+## Tests (TDD)
+- `compat-gate`: waiting / actionable / pinned-is-held / mixed-held-wins,
+  removed-API & containerd are actionable, exit_code mapping, + existing
+  patch/safe cases stay green.
+- `nightly-report`: held headline + grouped reasons; existing tests stay green.
+- `upgrade-step.sh`: shellcheck; manual review of the HALT_CHAIN + gauge flow
+  (bash, not unit-tested).
+
+## Out of scope (separate follow-up)
+Auto-refreshing the matrix when upstream ships 1.36 support (a periodic
+addon-readiness probe). This change only *consumes* the matrix.
--- a/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md
+++ b/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md
@ -0,0 +1,128 @@
+# Post-Mortem: MetalLB ServiceL2Status Stuck Immutable → PG LB VIP Flap → Woodpecker CI Tier 1 Applies Broken
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-05-16 (mitigated) / 2026-05-26 (closed) |
+| **Duration** | ~5 days of degraded CI (2026-04-21 first observed → 2026-05-16 mitigated). Symptom-only; no human-visible service downtime. |
+| **Severity** | SEV3 — Woodpecker CI default.yml apply step failed on Tier 1 (PG-backend) stacks. Drift-detection ran silently broken. Manual `scripts/tg apply` continued to work. No data loss, no app downtime. |
+| **Affected Services** | Woodpecker CI pipelines applying any of the 28+ Tier 1 stacks (monitoring, crowdsec, authentik, headscale, etc.). PostgreSQL backend itself was healthy. |
+| **Issue** | Beads `code-aoxk` (closed 2026-05-26). |
+| **Status** | Closed |
+
+## Summary
+
+Woodpecker CI surfaced as `ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc` from `scripts/tg` whenever a pipeline tried to apply a Tier 1 stack. The error was misleading on two counts:
+
+1. **Vault was healthy.** A direct `vault read database/static-creds/pg-terraform-state` from inside a Woodpecker pipeline pod (using K8s SA JWT → `auth/kubernetes/login role=ci`) succeeded every time when run in isolation.
+2. **The "Cannot read PG credentials" message in `scripts/tg` was a catch-all** that fired for *any* Terraform/Terragrunt failure during PG state-lock acquire-release, including TCP RSTs against the PG LoadBalancer VIP.
+
+Actual root cause: the MetalLB `ServiceL2Status` CR for the `postgresql-lb` service (`dbaas` namespace, VIP `10.0.20.200`) had a stuck `status.node` field that the controller treated as immutable. The L2 speaker kept failing to update it with `Invalid value: "k8s-nodeX": Value is immutable`, so the leader-elected announcer flapped between k8s-node3 and k8s-node4 every few seconds. Each flap dropped open TCP connections (RST). Terraform's state-lock acquire → operation → release sequence straddled flaps and failed mid-operation. `scripts/tg` surfaced this as the misleading "Cannot read PG credentials" message.
+
+Manual `scripts/tg apply` from the DevVM kept working because the developer's session happened to land on whichever node currently held the VIP and complete fast enough to not straddle a flap. CI pipelines, being slower (full stack walk), reliably straddled at least one flap.
+
+## Impact
+
+- **CI degradation**: Tier 1 stack changes pushed to master were NOT auto-applied. Required manual `scripts/tg apply` from DevVM after every push touching one of 28+ stacks.
+- **Drift-detection broken**: The daily `drift-detection.yml` Woodpecker pipeline silently failed on every Tier 1 stack — meaning unannounced manual changes to those stacks could have persisted undetected for the duration.
+- **No user-facing outage**: PG cluster itself, all apps that use PG, and all in-cluster traffic to `10.0.20.200` worked normally. Only the very specific `acquire-state-lock → run operation → release-state-lock` round-trip pattern from CI was unreliable.
+
+## Timeline (UTC)
+
+| Time | Event |
+|------|-------|
+| 2026-04-21 | First broken CI pipelines (#411, #412, #413). Drift-detection failures noticed. `code-aoxk` filed. Initial hypothesis: Vault auth/role mismatch. |
+| 2026-04-22 — 2026-05-15 | Multiple investigation attempts. Verified Vault K8s `auth/kubernetes/role/ci` has correct policies (`terraform-state`, `ci`). Verified `database/static-creds/pg-terraform-state` exists, rotates on schedule, credentials valid. Could not reproduce the failure in isolated `vault read` from Woodpecker pods. |
+| 2026-05-16 (~12:14 UTC) | `pg-cluster-3` came up (third CNPG replica); endpoint set churn likely triggered MetalLB L2 announcer to attempt to update the existing `ServiceL2Status` CR (was `l2-rgt9d`). Update was rejected as immutable. Speaker kept retrying. VIP flapped. |
+| 2026-05-16 | RCA breakthrough: noticed `kubectl logs -n metallb-system -l component=speaker` was full of `Invalid value: "k8s-node…": Value is immutable` on the postgresql-lb ServiceL2Status. Correlated with `kubectl get servicel2status` returning multiple stale entries for the same service. |
+| 2026-05-16 | **Mitigation**: `kubectl delete servicel2status.metallb.io l2-rgt9d -n metallb-system`. Speaker recreated the CR cleanly (became `l2-zj9ss`). Flap stopped. PG connections stable. Manual CI re-runs of `monitoring` stack apply succeeded immediately. |
+| 2026-05-17 | Audit: acceptance criteria 1 + 2 met implicitly. #3 (post-mortem) remained pending. Beads task reverted from `in_progress` → `open`. |
+| 2026-05-25 | Node2 SCSI LUN remap → encrypted PVC emergency_ro → containerd boltdb corruption outage. Unrelated, but pulled Woodpecker server off node2. Subsequent server pod restart on k8s-node4. |
+| 2026-05-26 | Verification: from a live Woodpecker pipeline pod (`wp-01kshph6pa0w6ch0zf5x9bfqgr`), `vault write auth/kubernetes/login role=ci jwt=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)` succeeded. `vault read database/static-creds/pg-terraform-state` returned valid creds (`username=terraform_state`, last_vault_rotation 2026-05-21, TTL 58h). Live `default.yml` pipeline confirmed applying Tier 1 stacks: dbaas, authentik, crowdsec, monitoring, nvidia, cloudflared, kyverno, metallb — all `OK`. `postgresql-lb` ServiceL2Status currently single allocation (`l2-sv9vv` on k8s-node3, no flap). Beads task closed. |
+
+## Root Cause
+
+`metallb-speaker` reconciler in the deployed MetalLB version treats `ServiceL2Status.status.node` as immutable after first set. When the L2 announcer's leader-election picks a different node to announce a given VIP (which happens on speaker pod restart, node loss, endpoint set churn, or pod-anti-affinity reshuffles), the reconciler fails to patch the existing CR and gets stuck in a retry loop. Without manual deletion, the reconciler will not progress.
+
+Why it manifested as Vault credential errors:
+
+1. CI's `scripts/tg` pre-flight runs `vault read database/static-creds/pg-terraform-state` (line 83 in current code) to get PG credentials. That call succeeds.
+2. CI then runs `terragrunt apply` against the Tier 1 stack. Terragrunt connects to `10.0.20.200:5432` for state-lock acquire (via `pg_advisory_lock`). The TCP connection lands on whichever node MetalLB last announced the VIP from.
+3. Mid-operation, MetalLB tries to re-announce from a different node, sends gratuitous ARPs, and the upstream switch updates its MAC table. Open TCP sessions on the previous announcer's node are immediately RST.
+4. Terragrunt's state-lock release (or any subsequent PG operation) fails with broken pipe / connection refused.
+5. `scripts/tg` interpreted the wrapper-level failure as "PG creds bad" because that's the most common failure mode it handles. The actual error from terragrunt was buried in `2>/dev/null` suppression (since fixed — see Fix #1 below).
+
+## Detection
+
+We did not have any of:
+- A direct alert for "MetalLB ServiceL2Status reconciler errors".
+- An alert for "PG LB VIP node changed N times in M minutes".
+- An end-to-end probe for the CI state-lock pattern (terragrunt against `10.0.20.200`).
+
+Detection mechanism was a human reading `kubectl logs -n metallb-system` for unrelated reasons. Took 25 days from first observed symptom to RCA.
+
+## Fixes & Mitigations
+
+### 1. Surface real error from `scripts/tg` (DONE)
+
+The original `scripts/tg` swallowed the real `vault read` / terragrunt error behind `2>/dev/null` and printed a static "Cannot read PG credentials from Vault" message. Fixed in the script:
+
+```sh
+# scripts/tg lines 79-89 (current)
+if ! command -v vault >/dev/null 2>&1; then
+  echo "ERROR: vault CLI not found on PATH. Install it or use an image that includes it (ci/Dockerfile)." >&2
+  exit 1
+fi
+VAULT_OUT=$(vault read -format=json database/static-creds/pg-terraform-state 2>&1) || {
+  echo "ERROR: Cannot read PG credentials from Vault. Vault output follows:" >&2
+  echo "$VAULT_OUT" >&2
+  echo "" >&2
+  echo "Hint: humans run 'vault login -method=oidc'; CI auths via K8s SA (role=ci)." >&2
+  exit 1
+}
+```
+
+Comment in the code explicitly references this incident.
+
+### 2. Stuck-CR cleanup procedure (DOCUMENTED)
+
+Reproduction check for future sessions (also in `code-aoxk` beads notes):
+
+```sh
+kubectl logs -n metallb-system -l component=speaker --tail=200 | grep -iE 'Invalid value.*immutable'
+# If matches found → same root cause. Delete the stuck CR:
+kubectl get servicel2status -n metallb-system
+kubectl delete servicel2status.metallb.io <name> -n metallb-system
+```
+
+Speaker recreates the CR cleanly within seconds.
+
+### 3. Long-term MetalLB controller fix (DEFERRED)
+
+The underlying bug — speaker not recreating the CR when the immutable field needs to change — is upstream MetalLB behaviour. Two paths possible:
+
+- **Upgrade MetalLB** to a version where this is fixed (needs research — check changelogs).
+- **File upstream issue / patch** with reproducer.
+
+Not done as part of this post-mortem; tracked separately. Risk acceptance: until then, the manual `delete servicel2status` workaround is the playbook, and is fast (<10s).
+
+### 4. Alerting (DEFERRED)
+
+Suggested but not implemented:
+- Prometheus alert on `metallb_speaker_reconcile_errors_total{kind="ServiceL2Status"}` rate.
+- Synthetic probe: a CronJob that does `pg_advisory_lock` + release against the PG VIP every 5min from CI namespace, alert if it ever fails.
+
+Tracked as future hardening (no beads task yet — only worth filing if recurrence happens).
+
+## Lessons
+
+1. **`2>/dev/null` is a time-bomb.** It hid the real error for weeks. Fix #1 already lands the principle; audit other places in `scripts/` for the same anti-pattern next time we touch them.
+2. **CRD `status.*` immutability is non-obvious failure mode.** When debugging weird LB / VIP / endpoint behaviour, always grep speaker logs for `immutable`, `cannot update`, and reconciler errors. Add to cluster-health checks.
+3. **Misleading wrapper errors cost weeks.** `scripts/tg` claimed "Cannot read PG credentials" — that's what the operator believed. The actual `vault read` step worked. The real failure was three steps later in a completely different subsystem. When a wrapper script makes a definitive claim about which subsystem failed, distrust it; reproduce the subsystem in isolation before chasing the claim.
+4. **CNPG primary changes / endpoint churn can trigger L2 announcer flap.** The trigger (within the timeline) was likely the `pg-cluster-3` pod coming up. Worth flagging for any future CNPG topology changes.
+
+## References
+
+- Beads: `code-aoxk` — closed 2026-05-26.
+- `scripts/tg` lines 65-95 — current pre-flight with explicit error surfacing.
+- `kubectl get servicel2status -A` — current state, single allocation per service.
+- This file: `infra/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md`.
--- a/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md
+++ b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md
@ -0,0 +1,97 @@
+# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
+
+> Filename kept for inbound links. The originally-suspected cause (kubeadm-config
+> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
+> drift was a real *separate* latent bug fixed in the same change.
+
+**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
+the master control-plane phase for the first time — preflight passed, etcd
+snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the
+kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute
+static-pod-hash window across all internal retries, then auto-rolled-back to
+v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
+the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**.
+No data loss; no user-facing outage (the master carries control-plane taints, so
+no workloads were displaced).
+
+**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
+first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
+static pods, i.e. the first time the upgrade pushes real write-IO at etcd.
+
+## Root cause — etcd IO starvation on the shared HDD
+
+The new kube-apiserver could not establish/keep a working connection to etcd
+during the upgrade because **etcd was IO-starved**. etcd's surviving container log
+from the crash window (`/var/log/pods/.../etcd/0.log`, 23:04–23:20 UTC) shows:
+
+- **1,180** `apply request took too long` warnings in 16 minutes;
+- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms),
+  clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
+  to bring the new apiserver up.
+
+A reproduced 1.35.6 apiserver with no etcd dies with
+`F instance.go:233 Error creating leases: error creating storage factory: context
+deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
+lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
+shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
+that spindle:
+
+1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
+2. kubeadm dumping a full **~400MB etcd DB backup** to
+   `/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
+   etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
+   cleans them up), pushing master root fs to **73%**, above the 70% kubelet
+   image-GC threshold, so image GC churned during the drain too;
+3. master-drain pod evictions.
+
+### Correction — it was NOT the OIDC flag swap
+
+`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
+`--authentication-config` (structured multi-issuer OIDC) back to legacy
+single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
+was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
+those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
+(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test
+etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
+the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
+were also ruled out.
+
+## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
+
+apiserver auth is configured in three places that must agree:
+(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
+ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest
+(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM —
+which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
+the manifest from (3), so it would have reverted structured auth → **dashboard +
+kubectl SSO break after a successful upgrade** (recoverable: the chain's
+post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.
+
+## Resolution
+
+1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%.
+2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps.
+3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).
+
+## Prevention (landed in this change)
+
+| Gap | Fix |
+|-----|-----|
+| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
+| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. |
+| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |
+
+## Lessons
+
+- **Capture the failing component's own logs before concluding.** The `kubeadm
+  upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
+  applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
+  "what config changes," not "why it crashed."
+- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
+  2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
+  backup copy + drain) onto that spindle. code-oflt is the real fix.
+- **Tools that leave per-operation scratch must be reaped.** kubeadm's
+  `/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
+  GC'd; 28GB had silently accumulated.
+- **Out-of-band control-plane edits must be written back to kubeadm-config** — else
+  `kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).
--- a/docs/runbooks/claude-auth-renew-workstation.md
+++ b/docs/runbooks/claude-auth-renew-workstation.md
@ -11,6 +11,11 @@ inference every six hours and backs up only the `claudeAiOauth` object to:
 secret/workstation/claude-users/<os-user>
 ```

+The backup **merges** into that path (`vault kv patch -method=rw`, falling back to
+`kv put` only when the path does not exist yet), so keys that other tools
+co-locate there — notably `homelab vault`'s `vaultwarden_*` credentials — survive.
+A blind `kv put` here silently wiped them on every six-hourly run (fixed 2026-06-26).
+
 The user's unrelated `mcpOAuth` credentials never leave their home directory.
 Each renewal service has a distinct 32-day periodic Vault token, mode `0600`, at
 `~/.config/claude-auth-sync/vault-token`. Its policy can access only that user's
@ -75,8 +80,64 @@ sudo --preserve-env=VAULT_ADDR,VAULT_TOKEN /usr/local/bin/t3-provision-users
 ```

 Never copy another user's `.credentials.json` or scoped Vault token. Never restore
-the old shared `CLAUDE_CODE_OAUTH_TOKEN`; environment credentials outrank per-user
-login and would silently collapse all users onto one identity.
+a **shared** `CLAUDE_CODE_OAUTH_TOKEN` across users; environment credentials
+outrank per-user login and would silently collapse all users onto one identity.
+(A **per-user**, non-rotating setup-token tied to the user's OWN Enterprise
+identity is a different, sanctioned thing — see "Long-lived per-user token" below.)
+
+## Long-lived per-user token (heavy concurrent-agent users)
+
+The six-hourly renewal above assumes Claude owns refresh-token rotation in a
+single `~/.claude/.credentials.json`. A user who runs **many concurrent Claude
+sessions** (interactive tmux panes + their `t3-serve` instance + always-on
+`start-claude.sh` agents) breaks that assumption: when the shared access token
+expires, the processes refresh **simultaneously**, the OAuth server rotates the
+refresh token, and the losing writer persists an **empty** refresh token —
+logging the user out roughly every access-token lifetime (~8h). Re-issuing the
+credential does not help; the race recurs.
+
+The fix is a **per-user, long-lived setup-token** (`sk-ant-oat01-…`, ~1y,
+**non-rotating**). With `CLAUDE_CODE_OAUTH_TOKEN` set, Claude uses it directly and
+never touches `.credentials.json` — so there is nothing to race on. This is the
+user's OWN Enterprise identity (scope `user:inference`; local MCP servers are
+client-side and unaffected), stored only in their OWN Vault path — **NOT** the
+forbidden shared token, and it never crosses OS users.
+
+**Enable it (one-time, per user):**
+
+1. The user mints their own token (interactive Enterprise SSO):
+
+   ```bash
+   claude setup-token        # opens an SSO URL; paste the code back -> prints sk-ant-oat01-…
+   ```
+
+2. An admin stores it in that user's Vault path (MERGE, never `kv put` — siblings
+   like `claude_ai_oauth_json` / `vaultwarden_*` must survive):
+
+   ```bash
+   vault kv patch -method=rw secret/workstation/claude-users/<os-user> \
+     setup_token=sk-ant-oat01-…
+   ```
+
+3. Materialize + activate (or just wait ≤6h for the timer):
+
+   ```bash
+   systemctl start claude-auth-sync@<os-user>.service
+   ```
+
+   `claude-auth-sync` writes `~/.config/claude-auth-sync/claude-oauth.env`
+   (`CLAUDE_CODE_OAUTH_TOKEN=…`, mode 0600) and, while a token is present, **skips**
+   the rotating-credential validate/backup/restore (so no false
+   `WorkstationClaudeAuthInvalid`). `start-claude.sh` and `t3-serve@.service` load
+   that env file. **Sessions started before activation keep the old credential
+   until relaunched** — the user must restart their agents / `t3-serve` to cut over.
+
+**Disable it:** clear the field (`vault kv patch -method=rw
+secret/workstation/claude-users/<os-user> setup_token=""`) — the next sync removes
+the env file and the user reverts to the per-user SSO credential flow.
+
+**Rotate before expiry:** setup-tokens expire 1y after mint. Re-mint (step 1) and
+re-store (step 2); the env file refreshes on the next sync.

 ## Verification

--- a/docs/runbooks/goldmane-flow-trail.md
+++ b/docs/runbooks/goldmane-flow-trail.md
@ -0,0 +1,346 @@
+# Goldmane Flow Trail — east-west "who-talks-to-whom" observability
+
+> As-built runbook for the Calico Goldmane + Whisker flow plane and the
+> `goldmane-edge-aggregator` durable audit trail. Design + rationale:
+> [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
+> Glossary: `CONTEXT.md` → **Service identity**, **Goldmane / Whisker**.
+> Implements infra issues #57 (Whisker ingress), #58 (aggregator), #61
+> (monitoring), #62 (egress allowlist queries), #63 (these docs).
+
+## What the trail is
+
+Three layers turn raw east-west traffic into a queryable, durable record of
+which Service talks to which. **Service identity = the workload's namespace**
+(primary), refined by a `service-identity` label in the few multi-Service
+namespaces (`monitoring`, `kube-system`, `dbaas`) — see ADR-0014.
+
+| Layer | Component | Lifetime | Where it lives |
+|---|---|---|---|
+| **Live map** | Calico **Goldmane** + **Whisker** | ~60-min in-memory ring buffer (lost on Goldmane restart) | `calico-system`; Whisker UI at `whisker.viktorbarzin.me` |
+| **Durable trail** | `goldmane-edge-aggregator` (`aggregate` mode) | persistent | CNPG Postgres DB `goldmane_edges`, table `edge` |
+| **Notification** | `goldmane-edges-digest` CronJob (`digest` mode) | daily | Slack `#alerts` |
+
+**Goldmane** aggregates identity-stamped flows (namespace / pod / workload /
+labels + allow-deny + policy-trace) streamed from Felix (the existing
+`calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer —
+**nothing is written to etcd or the K8s API** (the etcd-cost constraint that
+drove the whole design). **Whisker** is its live web UI. Because the ring
+buffer is *not* a trail (a Goldmane restart loses the window), the
+`goldmane-edge-aggregator` consumes Goldmane's gRPC `Flows.Stream` API over
+mTLS and upserts the unique **namespace-pair edge set** into Postgres; a daily
+CronJob posts first-seen edges to Slack.
+
+The edge set is deliberately **low-cardinality** — one row per
+`(src_ns, dst_ns, action)`, *not* per-pod or per-port — so the table stays
+small no matter how much traffic flows.
+
+## Where the data lives
+
+### Whisker UI — live, ~60 min
+- `https://whisker.viktorbarzin.me` (Authentik-gated — Whisker ships no own
+  login; `auth = "required"`). Shows the live flow stream + a service graph for
+  roughly the last hour. Use it for "what is talking right now"; it is **not**
+  history.
+- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
+  (HTTP), both in `calico-system`.
+- **DNS fix + self-heal:** whisker's egress to the kube-dns ClusterIP is allowed
+  by `whisker-allow-dns-clusterip` (`stacks/calico`) — without it the UI goes
+  empty after any gRPC-stream break (see Troubleshooting → "Whisker UI empty").
+  The `whisker-watchdog` CronJob (every 10 min) is a backstop that restarts
+  whisker if its backend ever wedges for another reason.
+
+### CNPG `goldmane_edges` — durable
+- Postgres DB `goldmane_edges` on the CNPG cluster
+  (`pg-cluster-rw.dbaas.svc.cluster.local:5432`). One table:
+
+  ```
+  edge(src_ns text, dst_ns text, action text,
+       first_seen timestamptz, last_seen timestamptz, flow_count bigint,
+       PRIMARY KEY (src_ns, dst_ns, action))
+  ```
+
+  - `action` ∈ `allow` / `deny` / `pass` / `unspecified` (normalised Goldmane
+    action).
+  - **Self-edges (`src_ns == dst_ns`) and empty-namespace flows** (host-endpoint
+    / public-internet) are **dropped** — the trail is about in-cluster service
+    relationships only. (Egress to the public internet is therefore NOT in this
+    table; it lives in the Wave-1 Calico flow-log path — see security.md.)
+  - A **"new edge"** = a row whose `first_seen` falls inside the digest window.
+  - Role `goldmane_edges` (Vault-rotated, 7-day) owns the DB. The `edge` table
+    is created idempotently by the aggregator at startup (canonical DDL also in
+    the repo at `migrations/0001_edge.sql`).
+
+### Slack `#alerts` — daily digest
+
+> **Channel note (2026-06-25):** posts to **`#alerts`**. The dedicated `#security` channel was abandoned — the shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of it, so a channel override there returns HTTP `404 channel_not_found`. Everything now posts to `#alerts` (this digest plus alertmanager's `slack-security` receiver, which keeps its `[SECURITY]` styling so security-lane alerts still stand out there).
+
+- CronJob `goldmane-edges-digest` (08:00 Europe/London) posts edges first seen
+  in the last 24h. Quiet when there are none. Reuses the existing alert-digest
+  Slack incoming webhook (Vault `secret/viktor` → `alertmanager_slack_api_url`)
+  — no new webhook was created.
+
+## How to enable / disable
+
+### Goldmane + Whisker (the flow plane)
+Operator CRs in **`stacks/calico/main.tf`** — NOT the Helm `goldmane`/`whisker`
+flags (those stay `false`; the operator's own `installation`/`apiServer` are
+operator-managed via the `goldmanes`/`whiskers.operator.tigera.io` CRDs):
+
+- `kubectl_manifest.goldmane` (kind `Goldmane`) — creating it makes the operator
+  re-render `calico-node` with the `FELIX_FLOWLOGSGOLDMANESERVER` env (the
+  operator auto-wires Felix — **do NOT patch FelixConfiguration**), triggering a
+  supervised `calico-node` DaemonSet roll. Yields `Deployment` + `Service
+  goldmane:7443`.
+- `kubectl_manifest.whisker` (kind `Whisker`, `depends_on` goldmane;
+  `notifications = Disabled`). Yields `Deployment` + `Service whisker:8081`.
+
+**To disable:** delete those two CRs and re-apply `stacks/calico`. Reversible
+toggle (Goldmane is tech-preview in OSS Calico 3.30 — the main standing risk per
+ADR-0014).
+
+### Whisker public ingress (infra #57)
+Also in `stacks/calico/main.tf`:
+- `module "ingress_whisker"` (`ingress_factory`, `auth = "required"`,
+  `dns_type = "proxied"`) → `whisker.viktorbarzin.me`.
+- `kubernetes_network_policy_v1.whisker_allow_traefik` — **required alongside the
+  ingress**: the operator's own `whisker` NetworkPolicy (owned by the Whisker CR)
+  is `policyTypes: [Ingress]` with no rules = default-deny ingress to the pod.
+  This additive NP ORs in an allow for `namespaceSelector
+  kubernetes.io/metadata.name=traefik` on TCP 8081. Without it Traefik 502s.
+
+### The aggregator + digest (the durable trail) — `stacks/goldmane-edge-aggregator`
+A Tier-1 stack (PG state) mirroring the claude-memory pattern. `scripts/tg
+apply` from `stacks/goldmane-edge-aggregator/`. It provisions: the namespace,
+the mTLS client material, the Postgres DB-init Job, the `DATABASE_URL`
+ExternalSecret (Vault static role `pg-goldmane-edges`), the Slack ExternalSecret,
+the `aggregate` Deployment, and the `digest` CronJob. **To disable the trail
+without touching the flow plane:** scale `deployment/goldmane-edge-aggregator` to
+0 (transient) or remove the stack (permanent) — Goldmane/Whisker keep running.
+
+Image: `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (PRIVATE) — the
+`goldmane-edge-aggregator` namespace must be in the `ghcr-credentials` Kyverno
+allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`,
+`local.ghcr_private_namespaces`) or pulls 401. Code repo:
+`~/code/goldmane-edge-aggregator` (see its `README.md` + `DEPLOY.md`).
+
+## mTLS cert — the REUSE decision (cert-reuse gotcha)
+
+The aggregator dials `goldmane:7443` over **mutual TLS**. Goldmane requires the
+client cert to chain to the **Tigera CA**, but it does **NOT authorize by client
+identity** — any Tigera-CA-signed cert is accepted.
+
+Rather than copy the Tigera CA **private key** into Terraform state to mint our
+own cert (a needless CA-key exposure; the `hashicorp/tls` provider also clashes
+with this repo's global generate-providers/lockfile pattern), the stack
+**REUSES the operator-minted, Tigera-CA-signed `whisker-backend-key-pair`
+Secret** (`calico-system`), copying its `tls.crt`/`tls.key` into the
+`goldmane-client-tls` Secret in the aggregator namespace. The CA *bundle* that
+verifies Goldmane's serving cert (`tigera-ca-bundle` ConfigMap, key
+`tigera-ca-bundle.crt`) is likewise copied verbatim (a ConfigMap can't be
+cross-namespace-mounted).
+
+> **GOTCHA — if the operator rotates `whisker-backend-key-pair`, re-apply
+> `stacks/goldmane-edge-aggregator`** to re-sync the copied cert. Symptom of a
+> stale copy: the `aggregate` pod logs TLS handshake / `Flows.Stream` failures
+> and no `last_seen` updates land in the `edge` table. Hardening follow-up
+> (noted in the stack): mint an own-identity cert in-namespace if Whisker is ever
+> removed (which would delete the reused source Secret).
+
+The Deployment leaves `GOLDMANE_HOST=goldmane.calico-system.svc.cluster.local:7443`
+and the default cert/CA paths; the default ServerName (host sans port) is a SAN
+on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` /
+`GOLDMANE_TLS_INSECURE` override is needed.
+
+## How to query who-talks-to-whom
+
+**Quickest — the `homelab edges` CLI** (the investigation helper; read-only
+SELECT against the DB via the dbaas primary pod, no creds/SQL to remember):
+
+```
+homelab edges --ns <ns>         # edges touching <ns> (either direction)
+homelab edges --peers-of <ns>   # <ns>'s distinct peer namespaces
+homelab edges --src <ns>        # <ns>'s egress peers   (--dst <ns> for ingress)
+homelab edges --new-since 24h   # edges first seen in the last day (or a date)
+homelab edges --denied          # blocked / lateral-movement attempts
+homelab edges --json [...]      # machine-readable, for agents/pipelines
+homelab edges --help            # full flag list
+```
+
+For ad-hoc SQL, `psql` into the DB (creds: Vault static role
+`static-creds/pg-goldmane-edges`, or exec a CNPG pod). All queries are against
+the single `edge` table.
+
+```sql
+-- Everything talking to a namespace (inbound), most-active first
+SELECT src_ns, action, flow_count, first_seen, last_seen
+FROM edge WHERE dst_ns = '<ns>' ORDER BY flow_count DESC;
+
+-- Everything a namespace talks TO (outbound)
+SELECT dst_ns, action, flow_count, first_seen, last_seen
+FROM edge WHERE src_ns = '<ns>' ORDER BY last_seen DESC;
+
+-- New edges in the last 24h (what the digest reports)
+SELECT src_ns, dst_ns, action, flow_count, first_seen
+FROM edge WHERE first_seen > now() - interval '24 hours'
+ORDER BY first_seen DESC;
+
+-- Any DENIED edges (policy is dropping this pair)
+SELECT src_ns, dst_ns, flow_count, last_seen
+FROM edge WHERE action = 'deny' ORDER BY last_seen DESC;
+
+-- Full edge set as a graph adjacency list
+SELECT src_ns, dst_ns, action, flow_count FROM edge ORDER BY src_ns, dst_ns;
+```
+
+For the **live** (sub-hour) view including pod/port detail, use the Whisker UI —
+the `edge` table intentionally aggregates that away.
+
+## Deriving the Wave-1 egress allowlist from the edge table (infra #62)
+
+The durable edge set is a faster, identity-stamped data source for the existing
+**observe-then-enforce** egress effort (beads `code-8ywc`; snapshot
+`docs/architecture/wave1-egress-observation-2026-05-22.md`) than the original
+iptables-`LOG` → journald → Loki path (ADR-0014 consequence: "Enforcement gains
+a better data source"). It replaces the *internal* (namespace-to-namespace) leg
+of the allowlist; **external/public-internet egress is NOT in this table** (empty
+dst namespace, dropped) — for those destinations keep using the Calico flow-log
+path described in security.md.
+
+**Per-namespace internal egress allowlist** — the set of in-cluster namespaces a
+given source is *observed* talking to with `action='allow'`:
+
+```sql
+-- Internal egress allowlist for one namespace (feeds its NetworkPolicy)
+SELECT DISTINCT dst_ns
+FROM edge
+WHERE src_ns = '<ns>' AND action = 'allow'
+ORDER BY dst_ns;
+```
+
+```sql
+-- Full internal egress matrix for all namespaces at once
+SELECT src_ns, array_agg(DISTINCT dst_ns ORDER BY dst_ns) AS allowed_dst_ns
+FROM edge
+WHERE action = 'allow'
+GROUP BY src_ns
+ORDER BY src_ns;
+```
+
+```sql
+-- Sanity: namespaces with a DENY edge already (policy is biting; investigate
+-- before tightening further)
+SELECT DISTINCT src_ns, dst_ns FROM edge WHERE action = 'deny';
+```
+
+**How this feeds enforcement (scope):** the derived `dst_ns` set is the
+*internal* half of a namespace's egress allowlist — it tells you which
+in-cluster namespaces to permit before flipping that namespace to default-deny.
+The universal baseline (kube-dns :53, often dbaas :3306/:5432, redis :6379) and
+the external destinations still come from the Wave-1 observation snapshot.
+**Enforce-flips remain OUT OF SCOPE** here — this is observe-and-derive only;
+the phased per-namespace default-deny rollout (starting `recruiter-responder`)
+is tracked under `code-8ywc`. Cross-links:
+[security.md → NetworkPolicy Default-Deny Egress](../architecture/security.md#networkpolicy-default-deny-egress-wave-1--observe-then-enforce-tier-34),
+[wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md),
+[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
+
+> **Caveat (same as the Wave-1 snapshot):** an edge only exists if it was
+> *observed*. A weekly CronJob or a 7-day Vault rotation may not have fired yet —
+> collect ≥7 days of edges before treating a namespace's `allow` set as
+> complete. The `first_seen` column tells you how long an edge has been known;
+> the digest surfaces brand-new ones daily.
+
+## Monitoring & health (infra #61)
+
+The aggregator pod has **no `/metrics` endpoint** — health is inferred from
+kube-state-metrics. Three complementary signals (memory ids 6598, 6599;
+see also [monitoring.md → Security Alerts](../architecture/monitoring.md#security-alerts-wave-1--planned-beads-code-8ywc)):
+
+| Signal | What | Where |
+|---|---|---|
+| **`AggregatorDown`** | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` for 15m → warning | Prometheus alert group `Network Observability (Goldmane)` in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`; routes `slack-warning` → `#alerts` |
+| **`DigestFailing`** | `kube_job_status_failed{...job_name=~"goldmane-edges-digest.*"} > 0` within 24h, for 30m → warning | same alert group → `#alerts` |
+| **cluster-health #48** | `check_goldmane_aggregator` reads the Deployment's `Available` condition (missing or not-Available → FAIL) | `scripts/cluster_healthcheck.sh` (human / `--quiet` / `--json` modes; emits `goldmane_aggregator`) |
+
+The two alert layers are deliberately complementary: `AggregatorDown` →
+**no new edges land** in the DB; `DigestFailing` → **edges still land but nobody
+is told**. A freshness probe (#61b) was intentionally skipped — `AggregatorDown`
+is the agreed floor.
+
+## Troubleshooting
+
+**Whisker UI 502 / unreachable.** The additive
+`kubernetes_network_policy_v1.whisker_allow_traefik` is missing or the
+operator's default-deny `whisker` NP regenerated — re-apply `stacks/calico`. A
+brand-new ingress host is also invisible to LAN split-horizon until the hourly
+`technitium-ingress-dns-sync` runs (memory #5349); test meanwhile with
+`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
+(expect a 302 to Authentik — the gate working).
+
+**Whisker UI empty (but reachable — 302s to Authentik fine).** ROOT CAUSE (the
+2026-06-28 incident): the operator's own `whisker` NetworkPolicy is
+policyTypes:[Ingress,**Egress**], and its egress allows DNS only to the kube-dns
+*pods* (podSelector `k8s-app=kube-dns`). But whisker-backend resolves
+`goldmane.calico-system.svc` via the kube-dns **ClusterIP** (10.96.0.10), and
+**Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule**.
+Verified: from the whisker pod's netns, ClusterIP DNS = 100% timeout while direct
+kube-dns *pod-IP* DNS = OK, and a pod with no egress policy resolves fine.
+whisker-backend resolves goldmane ONCE in the brief startup window before the
+policy programs, holds its long-lived gRPC stream, and only re-resolves when that
+stream breaks (e.g. a node-reboot blip) — at which point the blocked ClusterIP
+DNS wedges its Go resolver (`failed to stream flows` / `code = Unavailable: dns
+... i/o timeout` forever) and the UI goes blank. The durable **aggregator is a
+SEPARATE pod in its own (unrestricted) namespace** and is unaffected.
+
+FIX (applied 2026-06-28): `kubernetes_network_policy_v1.whisker_allow_dns_clusterip`
+(`stacks/calico`) — an additive egress NP allowing whisker → the kube-dns
+ClusterIP (`10.96.0.10/32`) on 53/UDP+TCP; k8s egress policies are additive so
+the operator NP is untouched. Backstop: the `whisker-watchdog` CronJob restarts
+the pod if it ever wedges for another reason. Immediate manual heal:
+`kubectl -n calico-system delete pod -l k8s-app=whisker`. Diagnose by comparing,
+from the whisker pod's netns, `nslookup goldmane.calico-system.svc.cluster.local
+10.96.0.10` (the ClusterIP — times out if the NP fix is missing) against the same
+query aimed at a kube-dns *pod IP* (always works).
+
+**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
+pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).
+Common causes, in order:
+1. **Stale mTLS cert** — the operator rotated `whisker-backend-key-pair`; re-apply
+   `stacks/goldmane-edge-aggregator` (see cert-reuse gotcha above). Symptom: TLS
+   handshake / `Flows.Stream` errors.
+2. **Stale DB password** — the 7-day Vault rotation bounced the credential but
+   the pod kept the old one. The Deployment carries
+   `secret.reloader.stakater.com/reload: goldmane-edges-db-creds`; if it's not
+   restarting on rotation, verify the Reloader annotation and the ExternalSecret.
+3. **Goldmane restarted** — the in-memory window was lost (expected); the stream
+   reconnects automatically and resumes upserting. No data loss in the DB
+   (only the sub-hour live window in Whisker is gone).
+
+**Digest never posts / `DigestFailing` firing.** Inspect the most recent
+`goldmane-edges-digest-*` Job (`kubectl get jobs -n goldmane-edge-aggregator`;
+`kubectl logs job/<name>`). The CronJob's `ttl_seconds_after_finished=86400` GCs
+pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL`
+empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack`
+ExternalSecret resolved. A dry run / smoke test: run the image with `args:
+["digest"]` + `DRY_RUN=1` to print the message instead of POSTing.
+> Resolved (2026-06-28): the digest posts cleanly to `#alerts`
+> (`lastSuccessfulTime` current, `DigestFailing` clear; e.g. the 2026-06-28 08:00
+> London run reported "8 new edges in last 24h"). The 2026-06-25 failures were
+> the `#security` channel override returning HTTP 404 — the shared
+> `alertmanager_slack_api_url` webhook's Slack app isn't a member of `#security`;
+> consolidating all Slack output to `#alerts` fixed it.
+
+**No edges at all in the table.** Confirm Goldmane is enabled
+(`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the
+`FELIX_FLOWLOGSGOLDMANESERVER` env; confirm the `goldmane-edges-db-init` Job
+completed; confirm the aggregator pod is `Running` and not `ImagePullBackOff`
+(ghcr allowlist).
+
+## Related
+- [ADR-0014 — Service identity & east-west observability](../adr/0014-service-identity-and-east-west-observability.md)
+- [security.md — NetworkPolicy Default-Deny Egress + east-west flow observability](../architecture/security.md)
+- [monitoring.md — east-west flow observability + alerts](../architecture/monitoring.md)
+- [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md)
+- `CONTEXT.md` glossary — **Service identity**, **Goldmane / Whisker**
+- Code: `~/code/goldmane-edge-aggregator` (`README.md`, `DEPLOY.md`); stacks
+  `stacks/goldmane-edge-aggregator`, `stacks/calico`
--- a/docs/runbooks/homelab-vault-onboarding.md
+++ b/docs/runbooks/homelab-vault-onboarding.md
@ -0,0 +1,164 @@
+# `homelab vault` onboarding (Vaultwarden access + `vault kv` infra secrets)
+
+## Scope
+
+`homelab vault` fronts **two unrelated secret stores** — the name collides, so
+the command keeps them clearly separated:
+
+- **Vaultwarden** — your personal *password manager* (logins/passwords/TOTP).
+  The verbs below give each devvm roster user no-HITL access to **their own**
+  Vaultwarden vault (and any Organization Collection shared with their account).
+  It shells out to the official `bw` CLI; the user's Vaultwarden credentials live
+  only in their isolated Vault path `secret/workstation/claude-users/<os-user>`
+  and are decrypted as that OS user — the admin never sees them.
+- **HashiCorp Vault / OpenBao** — the homelab *infra* secrets store (the
+  `secret/…` KV mount at `vault.viktorbarzin.me`), under `homelab vault kv`.
+  These use the caller's **own** Vault token (`vault login -method=oidc` →
+  `~/.vault-token`), **not** the scoped Vaultwarden token (which only reads the
+  `claude-users/<user>` path); access is whatever your Vault policy grants.
+
+```text
+# Vaultwarden (password manager)
+homelab vault setup             one-time: store VW email + master password + API key
+homelab vault status            configured / unlocked / reachable (no secrets)
+homelab vault list [--search Q]  item names (no secrets)
+homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
+homelab vault get <name> --all  all fields (incl. custom) as JSON; pipe it (| jq)
+homelab vault code <name>       current TOTP code
+homelab vault lock              lock / log out the local bw session
+
+# HashiCorp Vault / OpenBao (infra secrets; uses your own OIDC token)
+homelab vault kv get <path> [--field K]   read an infra KV secret
+homelab vault kv list <path>              list sub-paths
+homelab vault kv put <path> <key>         write one key (value via stdin; merges)
+```
+
+## How auth works (why a non-admin can use it)
+
+`homelab vault` runs `vault` as the calling user. It resolves a Vault token in
+this order (`ensureVaultToken`, `cli/cmd_vault.go`):
+
+1. an explicit `$VAULT_TOKEN` (a deliberate override), then
+2. the per-user **scoped token** that `claude-auth-sync` maintains at
+   `~/.config/claude-auth-sync/vault-token` (policy `workstation-claude-<user>`), then
+3. a native `~/.vault-token` (admins who carry one; non-admins usually don't).
+
+**The scoped token deliberately beats `~/.vault-token`.** This tool only touches
+your own `secret/workstation/claude-users/<user>` path, and a power-user who ran
+`vault login -method=oidc` carries a read-only `~/.vault-token` (capability
+`deny` on that path); letting it win would shadow the scoped token and fail every
+op with `403 permission denied` (this is exactly what bit emo, 2026-06-28). The
+CLI also **self-defaults `VAULT_ADDR`** to `https://vault.viktorbarzin.me` when
+unset, so it works from non-login shells (tmux panes, AFK agent subprocesses)
+that never sourced `/etc/environment` — otherwise every `vault` child hits the
+`127.0.0.1:8200` default and fails `connection refused` (exit 2).
+
+That scoped policy grants exactly `create`/`read`/`update` on the user's own
+`secret/workstation/claude-users/<user>` path — no `patch` capability — so the
+tool writes with `vault kv patch -method=rw` (read-modify-write), falling back to
+`kv put` only when the path does not exist yet. This preserves the
+`claude_ai_oauth_json` key that [claude-auth-sync](claude-auth-renew-workstation.md)
+co-locates there. (The admin-only bugs were fixed 2026-06-27; the
+`VAULT_ADDR`/token-precedence bugs above were fixed 2026-06-28.)
+
+## Prerequisites (per user)
+
+- The user is in `scripts/workstation/roster.yaml` and the **vault** stack has
+  been applied → their `workstation-claude-<user>` policy exists.
+- The user's workstation was provisioned (`setup-devvm.sh`) → their scoped Vault
+  token exists at `~/.config/claude-auth-sync/vault-token`.
+- `bw` is installed **system-wide** at `/usr/bin/bw` (see below).
+- The user has a Vaultwarden account at `https://vaultwarden.viktorbarzin.me`
+  (self-service signup is open; admin panel is disabled).
+
+## One-time admin steps (devvm)
+
+`bw` must be system-wide so every user resolves it (it is a Node script, and
+`node` is already system-wide at `/usr/bin/node`). `setup-devvm.sh` installs it
+to the npm `/usr` prefix; the guard checks the **system** path, not
+`command -v bw` (an admin's own `~/.local/bin/bw` used to mask the system
+install, leaving non-admins with no backend). To install on a running box:
+
+```bash
+sudo npm install -g --prefix /usr "@bitwarden/cli@^2024"
+bw --version            # confirm /usr/bin/bw resolves
+```
+
+After landing a `cli/` change, rebuild the binary so users pick it up:
+
+```bash
+# version is stamped from cli/VERSION, exactly as setup-devvm.sh does it
+sudo bash -c 'cd /home/wizard/code/infra/cli && \
+  go build -ldflags "-X main.version=$(cat VERSION 2>/dev/null || echo dev)" \
+  -o /usr/local/bin/homelab .'
+```
+
+(or just re-run `scripts/workstation/setup-devvm.sh` as root, which rebuilds it.)
+
+## User onboarding
+
+The user runs these as themselves. The master password / API key are entered
+interactively (never on the command line) and stored only in the user's Vault
+path.
+
+1. In the Vaultwarden web vault → **Settings → Security → Keys → View API key**,
+   copy the `client_id` (`user.xxxx`) and `client_secret`.
+2. Configure:
+
+   ```bash
+   homelab vault setup        # prompts: VW email, API client_id/secret, master password
+   homelab vault status       # → "vault: configured, unlocked, reachable ✓"
+   homelab vault list         # item names (own vault + any shared Collections)
+   ```
+
+## Shared-Collection access (sharing passwords with a user)
+
+`homelab vault` surfaces Organization Collection items automatically once the
+user's Vaultwarden account is a confirmed member. These steps are done by the
+vault owner in the **Vaultwarden web UI** (they need the owner's master
+password — not an infra/Terraform operation):
+
+1. Create or reuse an **Organization** and a **Collection** of shared logins.
+2. **Invite** the user's Vaultwarden account to the Organization, granting
+   **"Can view"** on that Collection (least privilege).
+3. The user accepts the email invite and confirms membership.
+4. The user runs `homelab vault list` — the shared items now appear alongside
+   their own (a `homelab vault status` sync picks them up).
+
+## Security model (the no-HITL trade)
+
+Identity is the kernel UID. Anything running as the user can decrypt the user's
+vault — this is the accepted trade for no-human-in-the-loop fetches. Secrets
+never appear in `argv` (passed via env or stdin), core dumps are disabled, TOTP
+fetches are logged to syslog/Loki, and on a TTY values go to the clipboard
+(auto-clearing) rather than scrollback. The admin's Vault token is never used by
+a non-admin: each user authenticates with their own scoped token.
+
+## Verification
+
+```bash
+# the scoped token carries the right policy
+VAULT_TOKEN="$(sudo cat /home/<user>/.config/claude-auth-sync/vault-token)" \
+  vault token lookup -format=json | jq '.data.display_name, .data.policies'
+#   → "token-devvm-claude-auth-<user>", [..., "workstation-claude-<user>"]
+
+sudo -u <user> -i bw --version        # /usr/bin/bw resolves for the user
+sudo -u <user> -i homelab vault status
+```
+
+## Troubleshooting
+
+**`homelab vault setup` (or any verb) fails with `exit status 2`** — older
+binaries swallowed the underlying `vault` error; the message now includes it.
+Two historical causes (both fixed in-CLI 2026-06-28, kept here for diagnosis):
+
+- `... connection refused` to `127.0.0.1:8200` → `VAULT_ADDR` wasn't set in the
+  caller's shell. The CLI now self-defaults it, but if you see this on an old
+  binary: `export VAULT_ADDR=https://vault.viktorbarzin.me`.
+- `403 permission denied` on `PUT .../secret/data/workstation/claude-users/<user>`
+  → a stale read-only `~/.vault-token` (e.g. from `vault login -method=oidc`,
+  policy `default`, capability `deny` on that path) was shadowing the scoped
+  token. The CLI now prefers the scoped token; on an old binary, `rm
+  ~/.vault-token` (or `unset VAULT_TOKEN`) and retry. Confirm with
+  `VAULT_TOKEN="$(sudo cat /home/<user>/.config/claude-auth-sync/vault-token)" vault token capabilities secret/data/workstation/claude-users/<user>`
+  → must be `create, read, update`.
--- a/docs/runbooks/k8s-version-upgrade.md
+++ b/docs/runbooks/k8s-version-upgrade.md
@ -36,11 +36,13 @@ envsubst on /template/job-template.yaml  | kubectl apply -f -
  ▼

 Job 0 — preflight       (pinned: k8s-node1)
-  ├── compat-gate: addon/API/containerd support for target (else BLOCK+alert)
+  ├── compat-gate: addon/API/containerd support for target (else BLOCK-actionable+alert / HOLD-quiet)
  ├── All nodes Ready + no Mem/Disk pressure
  ├── halt-on-alert (kured-style ignore-list)
  ├── 24h-quiet baseline (no Ready transitions <24h ago)
  ├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
+  ├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block)
+  ├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups)
  ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
  ├── Trigger backup-etcd Job, wait, verify snapshot byte count
  ├── SSH master: containerd skew fix (if master < workers)
@ -112,18 +114,36 @@ inert for a patch (no API removal or containerd floor occurs inside a minor).

 This is the **"auto-upgrade when we can, halt + alert when we can't"** contract.

-**On a block**, the gate:
- pushes `k8s_upgrade_blocked=1` to Pushgateway (→ the `K8sUpgradeBlocked`
-  Prometheus alert),
- Slacks the **specific reasons** (which addon/API/node, current vs required), and
- **halts the chain** — it exits **non-fatal** (the upgrade simply isn't safe yet,
-  this is not a failure). Because the block happens **before any mutation, no
-  rollback is involved**; nothing was changed.
+**The gate classifies each refusal** (2026-06-28) so it only cries wolf when
+there's something to do — `compat-gate.py` exit code + a `[TAG]` on every reason:

-**To clear a block**: upgrade the named addon (or migrate the API caller off the
-deprecated group/version, or bump containerd on the named node) so the offending
-condition no longer holds. The **next nightly run then proceeds automatically** —
-no manual chain restart needed.
+- **`[ACTIONABLE]`** (exit 2) — a newer version of the lagging addon **exists in
+  the compat matrix** and upgrading it would clear the block (or an in-use
+  deprecated API must be migrated / a node's containerd bumped).
+- **`[WAITING]`** (exit 4 = held) — **no released addon version supports the
+  target yet** (e.g. kyverno/ESO behind a brand-new k8s minor). Only an upstream
+  release can clear it.
+- **`[PINNED]`** (exit 4 = held) — a supporting version exists but the addon is
+  **deliberately pinned** in the matrix (`"pinned": true`, e.g. gpu-operator,
+  whose bump is coupled to a newer NVIDIA driver image + Ubuntu/kernel).
+- **Held wins on a mix**: if any blocker is waiting/pinned the whole target is
+  held — acting on the actionable ones wouldn't unblock it yet.
+
+**On any refusal** the preflight pushes the verdict gauge (`k8s_upgrade_blocked=1`
+for actionable, `k8s_upgrade_held=1` for held), sets `HALT_CHAIN` so the chain
+doesn't advance, and **exits 0 — the Job Completes cleanly** (a refusal is a
+decision, not a failure: no Failed Job, no `K8sUpgradeChainJobFailed`). It's
+before any mutation, so no rollback. Reasons (grouped by class) appear in the
+**morning nightly report**, not a per-run Slack.
+
+- **Actionable** → `K8sUpgradeBlocked` fires (once, via alert-on-change). Clear
+  it by doing the named upgrade/migration; the next nightly run proceeds.
+- **Held** → **deliberately NO alert** — only the nightly report's `⏸️ HELD`
+  line, because it can't be actioned now (a nightly alert would cry wolf). It
+  clears itself once upstream ships support (refresh `addon-compat.json`) or the
+  pin is lifted (delete `pinned`+`pin_reason`). The detector re-evaluates every
+  night, silently re-spawning the refused-but-Complete preflight (so a cleared
+  block is picked up next run, not after the 7d Job TTL).

 The **compat matrix** lives in
 `stacks/k8s-version-upgrade/scripts/addon-compat.json` — a map of `addon → highest
@ -163,6 +183,8 @@ Pushed by upgrade-step.sh during phase execution; observed by the
 | `k8s_upgrade_in_flight` (1/0) | preflight Job (set to 1) | postflight Job (set to 0) |
 | `k8s_upgrade_started_timestamp` (epoch s) | preflight Job | postflight Job (set to 0) |
 | `k8s_upgrade_snapshot_taken` (1/0) | preflight Job (set to 1 after Job=`pre-upgrade-etcd-*` completes with `Backup done:` log of ≥1 KiB) | postflight Job (0) |
+| `k8s_upgrade_blocked` (1/0) | preflight Job — set 1 on an **actionable** compat refusal (→ `K8sUpgradeBlocked`) | preflight (definitive each run; 0 when safe) / postflight (0) |
+| `k8s_upgrade_held` (1/0) | preflight Job — set 1 on a **held** (waiting-upstream/pinned) refusal; **no alert** | preflight (definitive each run; 0 when safe) / postflight (0) |
 | `k8s_upgrade_available{kind,running,target}` | detection CronJob | next detection run (overwrite) |
 | `k8s_version_check_last_run_timestamp` | detection CronJob | (cumulative) |

@ -171,8 +193,8 @@ Pushed by upgrade-step.sh during phase execution; observed by the
 - **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout.
 - **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
 - **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
- **`K8sUpgradeChainJobFailed`** — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The `unless k8s_upgrade_blocked == 1` clause (added 2026-06-21) excludes a preflight that failed because the **compat gate deliberately refused** the target — that's owned by `K8sUpgradeBlocked` and was double-firing here; a genuine wedge exits without setting the blocked gauge, so it still fires.
- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). A k8s **auto-upgrade was refused** by the compat gate because a critical addon, an in-use deprecated API, or a node's containerd is too old for the detected target. The **specific reasons are in Slack**; clear it by upgrading the named addon / migrating the API caller / bumping containerd, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert.
+- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The old `unless on() (k8s_upgrade_blocked == 1)` clause was **dropped 2026-06-28**: compat-gate refusals now Complete cleanly (exit 0) instead of Failing, so a terminally-Failed chain Job again means a genuine wedge with nothing to exclude.
+- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). An **ACTIONABLE** compat-gate refusal — a newer version of the lagging addon exists and upgrading it would clear the block (or an in-use deprecated API must be migrated / a node's containerd bumped). Reasons (grouped by class) are in the **morning nightly report**; clear it by doing the named upgrade/migration, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert. **There is deliberately NO companion alert for the held verdict** (`k8s_upgrade_held=1` — waiting-on-upstream / pinned): nothing can be actioned now, so it is surfaced only by the nightly report's `⏸️ HELD` line.
 - The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.

 ### Nightly upgrade report (Slack)
@ -181,8 +203,8 @@ CronJob `k8s-upgrade-nightly-report` (k8s-upgrade ns, `var.report_schedule`,
 default `7 6 * * *` = 06:07 UTC — after the 23:00 chain, before the 08:00 London
 alert-digest) posts ONE Slack summary each morning of the previous night's run:
 running version, detector freshness, detected target + kind, the outcome
-(⚪ no upgrade needed / 🔴 blocked + live blocker reasons / 🟢 upgraded /
-🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads
+(⚪ no upgrade needed / 🔴 blocked-actionable + reasons / ⏸️ held = waiting-upstream/pinned /
+🟢 upgraded / 🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads
 the Pushgateway gauges + live nodes/jobs and re-runs `compat-gate.py` for fresh
 blocker reasons; reuses the chain's SA + `slack_webhook` + scripts ConfigMap.
 Logic + unit tests: `scripts/nightly-report.py`, `scripts/test_nightly_report.py`.
@ -222,22 +244,34 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names

 ## Common Operations

-### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19)
+### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24)

 `kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
-and drops the `--authentication-config` flag**, silently disabling apiserver
-OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get
-401). This used to require a manual re-apply after **every** control-plane bump.
+from kubeadm-config**. apiserver auth uses a structured multi-issuer
+`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to
+still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade
+reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does
+NOT crash on this — verified by isolated repro; it's recoverable via the restore
+script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue —
+etcd IO starvation**, not this drift; post-mortem:
+`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`.

-**Now automated:** the `rbac` stack publishes its OIDC restore script to the
-`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's
-`phase_master` re-runs it on master immediately after `kubeadm upgrade apply`
-(while tigera-operator is still quiesced, so the flag-add apiserver restart can't
-crashloop the operator). It's idempotent, health-gates `/livez` with
-auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac
-apply (the version upgrade itself already succeeded). So a chain-driven
-control-plane bump no longer breaks SSO. The master phase self-skips when master
-is already at target, so this only runs when master was actually upgraded.
+**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now
+**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting
+`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of
+its remote script. So kubeadm regenerates a **correct** manifest and the apiserver
+upgrades with a pure image bump — `kubeadm upgrade diff <target>` shows only the
+image change. Zero live impact (the CM is read only during an upgrade).
+
+**Backstops:**
+- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does
+  NOT block — the drift only breaks SSO, which is recoverable) if
+  `--authentication-config` would still be dropped.
+- The `rbac` stack still publishes its restore script to the
+  `kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on
+  master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with
+  auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also*
+  re-reconciles kubeadm-config. Self-skips when master is already at target.

 **Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the
 chain logged `WARN: --authentication-config absent after re-apply`:
--- a/docs/runbooks/pfsense-egress.md
+++ b/docs/runbooks/pfsense-egress.md
@ -0,0 +1,72 @@
+# Runbook: pfSense WAN / egress outage
+
+**Scope:** the cluster (and home) loses **internet egress** while pfSense is
+otherwise alive — internal VLAN routing and DNS keep working. This is the
+**2026-06-27 incident class**: pfSense (Proxmox **VMID 101**) stopped passing
+IPv4 egress for ~20 min (00:02→00:23 UTC) while LAN/OPT1 routing + Unbound
+stayed up; recovery required a manual reboot, and **nothing alerted** (no egress
+probe existed; the cloudflared replica metric stayed green). The alerts +
+probes below close that gap. Incident detail: memory ids #6715–#6723.
+
+pfSense is a **single point of failure** (no HA): it is the k8s default gateway
+(`10.0.20.1`), Kea DHCP, Unbound DNS, NAT, and the WireGuard hub. WAN is
+**static** `192.168.1.2/24`, upstream gateway `WANGW = 192.168.1.1` (the TP-Link
+Archer AX6000). The sole IPv4 default gateway, no gateway-group/failover.
+
+## Alerts (all in `stacks/monitoring/modules/monitoring/`)
+
+| Alert | Signal | Means |
+|-------|--------|-------|
+| `WANGatewayUnreachable` (critical) | in-cluster ICMP to `192.168.1.1` fails >3m | pfSense's upstream gateway is unreachable from the cluster |
+| `InternetEgressDown` (critical) | in-cluster ICMP to **both** `9.9.9.9` and `1.1.1.1` fails >2m | internet egress through pfSense NAT is black-holed |
+| `ExternalDNSResolutionDown` (warning) | UDP/53 to both public resolvers fails >3m | egress or external-DNS path broken |
+| `EgressOnlyDivergence` (critical) | t3-probe `cloudflare` leg down **while** `internal` leg up >3m | egress-specific failure, internal healthy (the exact 2026-06-27 signature) |
+| `PfSenseVMDown` (critical) | `pve_up{id="qemu/101"}==0` while host up >2m | the pfSense VM stopped/crashed (host fine) |
+| `CloudflaredTunnelConnLoss` (warning, Loki) | >20 cloudflared edge-conn failures/5m | tunnel/egress trouble (canary that fires first; replica metric is blind) |
+
+Probes run **from inside the cluster** (blackbox-exporter, pod → node → pfSense
+NAT), so they exercise the exact egress path that fails. `WANGatewayUnreachable`
+/ `InternetEgressDown` **inhibit** the downstream egress symptoms so one root
+alert pages, not a storm.
+
+`PfSenseVMDown` **does not** catch a *guest-internal* reboot — `pve_up` tracks
+the qemu process, which survives an in-guest reboot (this is why 2026-06-27 was
+metric-invisible). `CloudflaredTunnelConnLoss` + the probe alerts cover that case.
+
+## Diagnose (read-only first)
+
+1. **Confirm scope** — is it egress-only or total?
+   - `kubectl -n monitoring` Grafana → `probe_success{job=~"wan-gateway-icmp|internet-egress-icmp"}` and `t3probe_connected` by `leg`.
+   - Internal still up? `pve_up{id="qemu/101"}` should be `1`; internal k8s DNS (`10.0.20.1`) still resolving = pfSense alive, egress-only.
+2. **Capture pfSense on-box logs BEFORE rebooting** (they persist on disk — no RAM-disk — and are the only source that proves the mechanism; they are NOT shipped to Loki):
+   ```
+   ssh -i ~/.ssh/id_ed25519 admin@10.0.20.1      # devvm wizard key (id #6784)
+   clog /var/log/gateways.log | grep -iE 'WANGW|down|up|delay|loss'   # dpinger gateway alarms
+   clog /var/log/routing.log  | grep -iE 'default|route'              # default-route add/delete
+   clog /var/log/system.log   | tail -200
+   netstat -rn | head                                                 # is the default route present?
+   ls -la /var/crash/                                                 # panic/textdump?
+   ```
+   (If SSH is rejected post-reboot, the reboot regenerated `authorized_keys` from
+   config.xml — re-add the key via console or WebGUI; see id #6718.)
+3. **Upstream check** — is the TP-Link / ISP up? It held the same public IP with
+   clean DHCP renewals through the 2026-06-27 event, so a *sustained* upstream
+   fault is unlikely; a reboot fixing it points at **pfSense-side state**.
+
+## Recover
+
+- **Fast path (known fix):** reboot pfSense — re-adds the default route, re-arms
+  dpinger, flushes pf state. **Capture the logs above FIRST** (a reboot wipes
+  the volatile evidence needed to find the real mechanism).
+- Targeted (if logs show a dpinger gateway-down): System → Routing → Gateways →
+  WANGW; check the monitor IP + dpinger state; re-enable the gateway / let it
+  re-eval. Confirm `netstat -rn` shows the default route restored.
+
+## Prevent / harden (deferred, needs a live-pfSense change)
+
+Not done in this monitoring change — tracked for a follow-up with hands-on
+pfSense access: point dpinger's monitor at the local gateway (`192.168.1.1`)
+instead of an external IP + widen thresholds; disable `gw_down_kill_states` for
+the single WAN; add a failover gateway group; a 60s auto-recovery watchdog;
+ship pfSense system/gateway/routing syslog to the cluster so these logs become
+centrally queryable.
--- a/scripts/cluster_healthcheck.sh
+++ b/scripts/cluster_healthcheck.sh
@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
 [[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
 KUBECTL=""
 JSON_RESULTS=()
-TOTAL_CHECKS=47
+TOTAL_CHECKS=48

 # Parallel execution settings. Each check function is self-contained — it
 # only reads cluster state and mutates the in-memory counters / JSON_RESULTS
@ -3156,6 +3156,44 @@ PYEOF
    esac
 }

+# --- 48. Goldmane edge-aggregator availability ---
+#
+# The goldmane-edge-aggregator Deployment (ADR-0014 / infra #58) streams Calico
+# Goldmane flows into the goldmane_edges CNPG DB — the durable who-talks-to-whom
+# trail. The pod has NO /metrics endpoint, so its liveness can't be scraped;
+# this check reads the Deployment's Available condition directly so the trail
+# silently dying surfaces in the health board (mirrors the AggregatorDown
+# Prometheus alert). Missing Deployment / not-Available -> FAIL.
+check_goldmane_aggregator() {
+    section 48 "Goldmane Edge-Aggregator"
+    local ns="goldmane-edge-aggregator" dep="goldmane-edge-aggregator"
+    local avail desired ready
+
+    # One get; absent Deployment is a hard fail (the trail isn't deployed).
+    if ! $KUBECTL get deploy "$dep" -n "$ns" >/dev/null 2>&1; then
+        [[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
+        fail "Deployment $ns/$dep not found — who-talks-to-whom edge trail is not running"
+        json_add "goldmane_aggregator" "FAIL" "deployment missing"
+        return 0
+    fi
+
+    avail=$($KUBECTL get deploy "$dep" -n "$ns" \
+        -o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null)
+    ready=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.status.readyReplicas}' 2>/dev/null)
+    desired=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.spec.replicas}' 2>/dev/null)
+    ready=${ready:-0}
+    desired=${desired:-0}
+
+    if [[ "$avail" == "True" ]]; then
+        pass "Edge-aggregator Available ($ready/$desired ready)"
+        json_add "goldmane_aggregator" "PASS" "${ready}/${desired} ready"
+    else
+        [[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
+        fail "Edge-aggregator NOT Available ($ready/$desired ready) — edge trail has stopped recording"
+        json_add "goldmane_aggregator" "FAIL" "${ready}/${desired} ready; Available=${avail:-unknown}"
+    fi
+}
+
 # --- Summary ---
 print_summary() {
    if [[ "$JSON" == true ]]; then
@ -3224,7 +3262,7 @@ main() {
        check_monitoring_prom_am check_monitoring_vault check_monitoring_css
        check_external_replicas check_external_divergence check_pve_thermals
        check_pve_load check_external_traefik_5xx check_ha_status_dashboard
-        check_immich_search check_csi_ghost_drift
+        check_immich_search check_csi_ghost_drift check_goldmane_aggregator
    )

    # Auto-fix mutates cluster state inside individual checks — keep that
--- a/scripts/t3-provision-users.sh
+++ b/scripts/t3-provision-users.sh
@ -240,6 +240,79 @@ EOF
  log "wrote OIDC kubeconfig -> $user:~/.kube/config"
 }

+# Hands-off chrome-service browser credential. For a user who has a
+# `<os_user>-browser` ServiceAccount in the chrome-service namespace (created in
+# stacks/chrome-service/rbac.tf), install a DUAL-CONTEXT kubeconfig whose DEFAULT
+# context authenticates with that SA's long-lived token — so `homelab browser`
+# (which shells out to `kubectl port-forward -n chrome-service`) works
+# non-interactively, even from a headless agent session (the user's interactive
+# OIDC login can't authenticate a headless kubectl). The user's personal OIDC
+# identity is retained as the `oidc@homelab` named context
+# (`kubectl --context oidc@homelab`). TF (the SA's existence) is the source of
+# truth for WHO gets this — there is no roster flag. Idempotent (cmp-guarded; SA
+# tokens are stable) + best-effort (cluster/secret unreachable -> WARN, never aborts).
+install_browser_kubeconfig() {
+  local user="$1" home kc sa secret token server ca tmp
+  home="$(getent passwd "$user" | cut -d: -f6)"
+  [[ -z "$home" ]] && return 0
+  sa="${user}-browser"
+  secret="${sa}-token"
+  [[ -r "$ADMIN_KUBECONFIG" ]] || return 0
+  # Gate: only users with a chrome-service browser SA (TF-driven). Best-effort read.
+  KUBECONFIG="$ADMIN_KUBECONFIG" kubectl --request-timeout=10s -n chrome-service get serviceaccount "$sa" >/dev/null 2>&1 || return 0
+  token="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl --request-timeout=10s -n chrome-service get secret "$secret" -o jsonpath='{.data.token}' 2>/dev/null | base64 -d 2>/dev/null || true)"
+  [[ -n "$token" ]] || { log "WARN: browser SA token not ready for $user (secret chrome-service/$secret) — skipped"; return 0; }
+  server="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl config view --raw --minify -o jsonpath='{.clusters[0].cluster.server}')"
+  ca="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl config view --raw --minify -o jsonpath='{.clusters[0].cluster.certificate-authority-data}')"
+  [[ -n "$server" && -n "$ca" ]] || { log "WARN: could not read cluster server/CA -> skip browser kubeconfig for $user"; return 0; }
+  kc="$home/.kube/config"
+  tmp="$(mktemp)"
+  cat > "$tmp" <<EOF
+apiVersion: v1
+kind: Config
+clusters:
+- name: homelab
+  cluster:
+    server: $server
+    certificate-authority-data: $ca
+contexts:
+- name: ${sa}@homelab
+  context:
+    cluster: homelab
+    user: $sa
+- name: oidc@homelab
+  context:
+    cluster: homelab
+    user: oidc
+current-context: ${sa}@homelab
+users:
+- name: $sa
+  user:
+    token: $token
+- name: oidc
+  user:
+    exec:
+      apiVersion: client.authentication.k8s.io/v1beta1
+      command: kubectl
+      args:
+      - oidc-login
+      - get-token
+      - --oidc-issuer-url=$OIDC_ISSUER
+      - --oidc-client-id=kubernetes
+      - --oidc-extra-scope=email
+      - --oidc-extra-scope=profile
+      - --oidc-extra-scope=groups
+      interactiveMode: IfAvailable
+EOF
+  if cmp -s "$tmp" "$kc" 2>/dev/null; then rm -f "$tmp"; return 0; fi   # already current -> no churn
+  if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] dual-context (SA default + OIDC) browser kubeconfig -> $user:$kc"; rm -f "$tmp"; return 0; fi
+  install -d -o "$user" -g "$user" -m 0700 "$home/.kube"
+  install -o "$user" -g "$user" -m 0600 "$tmp" "$kc" || { log "WARN: failed to write browser kubeconfig for $user"; rm -f "$tmp"; return 0; }
+  rm -f "$tmp"
+  log "wrote dual-context browser kubeconfig (SA default + OIDC) -> $user:~/.kube/config"
+  return 0
+}
+
 # Idempotently set KEY=VALUE in a t3-serve env file, PRESERVING other lines — so writing
 # T3_PORT never clobbers an injected CLAUDE_CODE_OAUTH_TOKEN, and vice-versa. Mode 0600.
 env_set() {
@ -594,6 +667,7 @@ while IFS=$'\t' read -r os_user tier shell groups_csv code_layout repos_csv; do
      refresh_user_clone   "$os_user" code
    fi
    install_user_kubeconfig "$os_user"
+    install_browser_kubeconfig "$os_user"    # hands-off chrome-service CLI cred (no-op unless the user has a browser SA)
    deploy_user_launcher "$os_user"          # keep ~/start-claude.sh current (skel only seeds new accounts)
  fi
  refresh_codex_mirror "$os_user"            # all tiers — mirror of the managed claudeMd
--- a/scripts/t3-serve@.service
+++ b/scripts/t3-serve@.service
@ -11,6 +11,12 @@ Environment=HOME=/home/%i
 Environment=PATH=/usr/local/bin:/usr/bin:/bin:/home/%i/.local/bin
 Environment=NODE_ENV=production
 EnvironmentFile=/etc/t3-serve/%i.env
+# Optional per-user long-lived CLAUDE_CODE_OAUTH_TOKEN, materialized by
+# claude-auth-sync from the user's own Vault path. Non-rotating, so t3's
+# concurrent agent sessions can't race on OAuth refresh-token rotation and wipe
+# the shared ~/.claude/.credentials.json. Leading '-' = optional (absent for
+# users on the normal per-user Enterprise-SSO credential flow).
+EnvironmentFile=-/home/%i/.config/claude-auth-sync/claude-oauth.env
 WorkingDirectory=/home/%i
 ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3
 Restart=on-failure
--- a/scripts/test-claude-auth-sync.sh
+++ b/scripts/test-claude-auth-sync.sh
@ -28,5 +28,61 @@ ok "accept own scoped Vault token" cas_vault_identity_ok token-devvm-claude-auth
 no "reject another user's token" cas_vault_identity_ok token-devvm-claude-auth-anca default,workstation-claude-anca
 no "reject wrong policy" cas_vault_identity_ok token-devvm-claude-auth-emo default,workstation-claude-anca

+# --- Regression: cas_backup must MERGE into the shared Vault path, preserving
+# sibling keys that other tools co-locate there (e.g. `homelab vault`'s
+# vaultwarden_* creds) — NOT overwrite the whole KV document. A blind `kv put`
+# wiped them every 6h (claude-auth-sync clobber, 2026-06-26).
+fakebin="$tmp/bin"; mkdir -p "$fakebin"
+store="$tmp/vault-store.json"
+cat > "$fakebin/vault" <<'FAKE'
+#!/usr/bin/env bash
+# Minimal KV-v2 fake backed by $VAULT_FAKE_STORE (a flat JSON object).
+[[ "$1" == kv ]] || { echo '{}'; exit 0; }   # token lookup etc. -> ignore
+op="$2"; shift 2
+store="$VAULT_FAKE_STORE"
+case "$op" in
+  get)
+    for a in "$@"; do [[ "$a" == -field=* ]] && field="${a#-field=}"; done
+    if [[ "$*" == *-format=json* ]]; then
+      [[ -f "$store" ]] || { echo "No value found"; exit 2; }
+      jq -n --argjson d "$(cat "$store")" '{data:{data:$d}}'; exit 0
+    fi
+    [[ -f "$store" ]] || exit 2                # bare get == existence check
+    if [[ -n "${field:-}" ]]; then
+      v="$(jq -r --arg k "$field" '.[$k] // empty' "$store")"; [[ -n "$v" ]] || exit 1
+      printf '%s' "$v"; exit 0
+    fi
+    exit 0 ;;
+  put)   echo '{}' > "$store" ;;                          # full replace
+  patch) [[ -f "$store" ]] || { echo "No value found"; exit 2; } ;;  # merge (rw)
+  *) exit 1 ;;
+esac
+for a in "$@"; do
+  case "$a" in
+    -*|secret/*) continue ;;                  # flags + the path arg
+    *=*) k="${a%%=*}"; v="${a#*=}"
+         t="$(mktemp)"; jq --arg k "$k" --arg v "$v" '.[$k]=$v' "$store" > "$t" && mv "$t" "$store" ;;
+  esac
+done
+exit 0
+FAKE
+chmod +x "$fakebin/vault"
+
+CAS_VAULT_PATH="secret/workstation/claude-users/test"
+CAS_CREDENTIALS="$tmp/credentials.json"
+CAS_STATE_DIR="$tmp/state"
+_oldpath="$PATH"; PATH="$fakebin:$PATH"; export VAULT_FAKE_STORE="$store"
+
+printf '{"vaultwarden_master_password":"keep-me"}\n' > "$store"   # pretend `homelab vault setup` ran
+ok "backup succeeds (existing doc)"   cas_backup
+eq "merge preserves sibling key"      keep-me "$(jq -r '.vaultwarden_master_password' "$store")"
+eq "merge writes claude oauth"        access  "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")"
+
+rm -f "$store"                                                    # fresh user: no doc yet
+ok "backup succeeds (creates doc)"    cas_backup
+eq "create writes claude oauth"       access  "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")"
+
+PATH="$_oldpath"; unset VAULT_FAKE_STORE
+
 printf '\n%d passed, %d failed\n' "$pass" "$fail"
 (( fail == 0 ))
--- a/scripts/workstation/claude-auth-sync.sh
+++ b/scripts/workstation/claude-auth-sync.sh
@ -13,6 +13,10 @@ CAS_VAULT_TOKEN_FILE="${CLAUDE_AUTH_VAULT_TOKEN_FILE:-$CAS_CONFIG_DIR/vault-toke
 CAS_VAULT_PATH="${CLAUDE_AUTH_VAULT_PATH:-secret/workstation/claude-users/$CAS_USER}"
 CAS_STATE_DIR="${CLAUDE_AUTH_STATE_DIR:-$CAS_HOME/.local/state/claude-auth-sync}"
 CAS_LOG="$CAS_STATE_DIR/sync.log"
+# Where a long-lived per-user setup-token is materialized as an env file
+# (KEY=VALUE) for start-claude.sh + t3-serve@.service to load. Lives under the
+# already-ReadWritePaths config dir so the sandboxed service may write it.
+CAS_TOKEN_ENV_FILE="${CLAUDE_AUTH_TOKEN_ENV_FILE:-$CAS_CONFIG_DIR/claude-oauth.env}"

 cas_log() {
  mkdir -p "$CAS_STATE_DIR"
@ -82,7 +86,17 @@ cas_backup() {
    return 1
  }
  expires="$(jq -r '.expiresAt' <<<"$oauth")"
-  vault kv put "$CAS_VAULT_PATH" \
+  # MERGE into the shared path so sibling keys other tools co-locate there
+  # (e.g. `homelab vault`'s vaultwarden_* creds) survive. `kv patch -method=rw`
+  # is read+update (needs no `patch` capability) but requires the secret to
+  # already exist, so create it with `kv put` on the very first backup only.
+  local -a write_cmd
+  if vault kv get "$CAS_VAULT_PATH" >/dev/null 2>&1; then
+    write_cmd=(vault kv patch -method=rw "$CAS_VAULT_PATH")
+  else
+    write_cmd=(vault kv put "$CAS_VAULT_PATH")
+  fi
+  "${write_cmd[@]}" \
    claude_ai_oauth_json="$oauth" \
    credential_expires_at_ms="$expires" \
    backed_up_at="$(date -Is)" >/dev/null || {
@ -123,6 +137,41 @@ cas_restore() {
  cas_log "RECOVERED restored Claude OAuth state from Vault"
 }

+# A user-scoped, long-lived setup-token (`sk-ant-oat01-…`, ~1y, NON-rotating) may
+# be stored in this user's OWN Vault path (field `setup_token`). When present it
+# is the authoritative credential: it bypasses the shared
+# ~/.claude/.credentials.json OAuth refresh-token rotation entirely — the fix for
+# users running many concurrent Claude sessions (interactive + t3-serve + always-on
+# agents) that otherwise race on refresh and wipe each other's refresh token.
+# We materialize it to a user-owned env file that start-claude.sh and
+# t3-serve@.service load as CLAUDE_CODE_OAUTH_TOKEN. This is the user's OWN
+# Enterprise identity, NOT the forbidden legacy SHARED token — it never crosses
+# OS users. Returns 0 when a token is active, so the caller skips the
+# rotating-credential validate/backup/restore (probing the now-vestigial
+# credential would otherwise emit false WorkstationClaudeAuthInvalid alerts).
+cas_sync_setup_token() {
+  local token desired tmp
+  token="$(vault kv get -field=setup_token "$CAS_VAULT_PATH" 2>/dev/null)" || token=""
+  if [[ "$token" != sk-ant-oat01-* ]]; then
+    if [[ -e "$CAS_TOKEN_ENV_FILE" ]]; then
+      rm -f "$CAS_TOKEN_ENV_FILE"
+      cas_log "removed stale CLAUDE_CODE_OAUTH_TOKEN env (no setup-token in Vault)"
+    fi
+    return 1
+  fi
+  desired="CLAUDE_CODE_OAUTH_TOKEN=$token"
+  if [[ -r "$CAS_TOKEN_ENV_FILE" && "$(<"$CAS_TOKEN_ENV_FILE")" == "$desired" ]]; then
+    cas_log "OK long-lived setup-token active (CLAUDE_CODE_OAUTH_TOKEN current); credential checks skipped"
+    return 0
+  fi
+  tmp="$(mktemp "${CAS_TOKEN_ENV_FILE}.XXXXXX")" || { cas_log "FAIL could not stage token env file"; return 1; }
+  printf '%s\n' "$desired" > "$tmp"
+  chmod 0600 "$tmp"
+  mv "$tmp" "$CAS_TOKEN_ENV_FILE"
+  cas_log "OK long-lived setup-token active; CLAUDE_CODE_OAUTH_TOKEN materialized; credential checks skipped"
+  return 0
+}
+
 cas_main() {
  umask 077
  for bin in jq vault claude timeout flock; do
@ -133,6 +182,11 @@ cas_main() {
  flock -n 9 || { cas_log "SKIP another sync is already running"; return 0; }

  cas_prepare_vault || return 1
+  # A long-lived per-user setup-token, if provisioned, is authoritative and
+  # non-rotating — materialize it and skip the rotating-credential dance.
+  if cas_sync_setup_token; then
+    return 0
+  fi
  if cas_live_auth_ok; then
    cas_backup
    return
--- a/scripts/workstation/claude-hooks/homelab-memory-recall.py
+++ b/scripts/workstation/claude-hooks/homelab-memory-recall.py
@ -45,9 +45,15 @@ def main() -> None:
    try:
        res = subprocess.run(
            [homelab, "memory", "recall", prompt, "--limit", "5"],
-            capture_output=True, text=True, timeout=4, env=os.environ,
+            capture_output=True, text=True, errors="replace", timeout=4,
+            env=os.environ,
        )
-    except (subprocess.TimeoutExpired, OSError):
+    except Exception:
+        # Best-effort: ANY failure — timeout, OSError, or a UnicodeDecodeError on
+        # truncated multibyte (Cyrillic) output — must silently skip recall this
+        # turn, exactly like the MCP being unavailable. errors="replace" above
+        # also keeps a mid-rune-truncated payload from raising here at all. Never
+        # let this hook surface a "UserPromptSubmit hook error".
        return

    out = (res.stdout or "").strip()
--- a/scripts/workstation/claude-skills/README.md
+++ b/scripts/workstation/claude-skills/README.md
@ -19,13 +19,29 @@ unpinned-CLI dependencies out of the hourly **root** reconcile.

 - `mattpocock/skills` (https://github.com/mattpocock/skills) — all except `find-skills`
 - `vercel-labs/skills` (https://github.com/vercel-labs/skills) — `find-skills`
+- **homelab-local, emo-PERSONALIZED** — `cluster-health` here is an
+  **emo-specific variant**, not a copy of the canonical skill. It started as a
+  copy of this repo's `.claude/skills/cluster-health/` but was rewritten on
+  2026-06-26 to focus on ha-sofia + emo's Sofia devices (emo is the only entry
+  in `SKILL_USERS`, a read-only power-user). The canonical admin skill
+  (`.claude/skills/cluster-health/`) is the full 47-check version and is left
+  untouched. **Do NOT `cp -a` the canonical copy over this one** — that would
+  clobber the personalization. Maintain the two independently.

 ## Refreshing

-Re-snapshot from a current install and commit the diff:
+Re-snapshot the upstream skills from a current install and commit the diff:

 ```sh
 cp -a ~/.agents/skills/. scripts/workstation/claude-skills/
 ```

-Snapshot taken 2026-06-23.
+`cluster-health` is hand-maintained (emo variant) — it is **not** covered by the
+`cp -a` above and must **not** be overwritten from `.claude/skills/`. Edit it in
+place here when emo's needs change, then refresh his live copy (the provisioner's
+`install_skills()` is if-absent, so it won't update an existing `~/.agents/skills`
+copy — `cp` the new `SKILL.md` to `/home/emo/.agents/skills/cluster-health/` and
+`chown emo:emo`, or remove emo's copy and re-run the reconcile).
+
+Snapshot taken 2026-06-23 (upstream); `cluster-health` vendored 2026-06-26,
+personalized for emo 2026-06-26.
--- a/scripts/workstation/claude-skills/cluster-health/SKILL.md
+++ b/scripts/workstation/claude-skills/cluster-health/SKILL.md
@ -0,0 +1,146 @@
+---
+name: cluster-health
+description: |
+  Personalized for emo. Check whether the homelab Kubernetes cluster is
+  affecting ha-sofia or the Sofia smart-home devices it runs (Tuya devices,
+  the MPPT ATS, lights, climate, security, irrigation). Use when:
+  (1) "is ha-sofia ok", "are my devices / the ATS / the lights down",
+  (2) "is the cluster affecting Sofia / my devices",
+  (3) "check the cluster", "cluster health", "is everything running",
+  (4) a device on the Барзини → Статус dashboard looks offline.
+  Runs the cluster-wide healthcheck read-only and triages it by what
+  ha-sofia actually depends on; the rest of the cluster is the admin's area.
+author: Claude Code
+version: 3.0.0-emo
+date: 2026-06-26
+---
+
+# Cluster Health — personalized for emo (ha-sofia focus)
+
+## What you actually care about
+
+You care about **ha-sofia** and the **Sofia smart-home devices** it runs —
+the Tuya devices, the **MPPT ATS**, and the lights / climate / security /
+irrigation on your **Барзини → Статус** dashboard. The wider Kubernetes
+cluster matters to you **only when it's breaking something ha-sofia or your
+devices depend on.** Anything else is the admin's (wizard's) area — note it in
+one line and move on; don't chase it.
+
+You have **read-only** cluster access. You can SEE everything but change
+nothing — so when something on your chain is broken, the job is to confirm it
+and hand it off, not to repair it.
+
+## How ha-sofia depends on the cluster
+
+ha-sofia itself runs at the house (HAOS at https://ha-sofia.viktorbarzin.me) —
+**not** in the cluster. The cluster reaches it through exactly two things:
+
+1. **tuya-bridge** (namespace `tuya-bridge`) — the REST API ha-sofia calls for
+   every Tuya device **and the MPPT ATS**. If it's unhealthy, your Tuya devices
+   + ATS stop responding. **This is the #1 thing to check.**
+2. **The path that carries ha-sofia ⇄ tuya-bridge and keeps ha-sofia
+   reachable**: cloudflared (tunnel) → Traefik (LB) → the ingress + TLS cert
+   for `tuya-bridge.viktorbarzin.me` and `ha-sofia.viktorbarzin.me`, plus
+   Technitium DNS. If any of these break, ha-sofia can't reach tuya-bridge and
+   you can't reach ha-sofia remotely.
+
+Everything else in the cluster is unrelated to you unless it's hosting one of
+those pods.
+
+## Step 1 — run the healthcheck (read-only, with your HA token)
+
+Your account can't read Vault, so load your own ha-sofia token first (it was
+minted for you and lives at `~/.config/cluster-health/haos_token`). Then run
+the script from YOUR clone, read-only:
+
+```bash
+cd /home/emo/code
+export HOME_ASSISTANT_SOFIA_TOKEN="$(cat ~/.config/cluster-health/haos_token)"
+bash scripts/cluster_healthcheck.sh --no-fix --quiet
+# machine-readable instead:
+# bash scripts/cluster_healthcheck.sh --no-fix --quiet --json | tee /tmp/cluster-health.json
+```
+
+- **Never pass `--fix`** — it deletes pods (a write); you're read-only and it
+  will fail.
+- Exit codes: `0` healthy, `1` warnings, `2` failures.
+
+With the token exported, the **ha-sofia checks run for you**:
+26 Entity Availability · 27 Integration Health · 28 Automation Status ·
+29 System Resources · **45 Status Dashboard** — your Барзини → Статус view,
+classifying every device tile as OK / ⚠️ / Offline across Сигурност, Мрежа &
+IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна. Check 30 also
+covers the **tuya** exporter.
+
+## Step 2 — triage the output by relevance to YOU
+
+Read the PASS/WARN/FAIL summary, then split the WARN/FAIL items in two:
+
+- **On your chain → this is what matters.** Anything touching: `tuya-bridge`,
+  `cloudflared`, `traefik`, DNS (check 21), the TLS cert / ingress for your two
+  hosts (checks 12, 22, 31, 32), or a **node** hosting those pods — plus all the
+  **ha-sofia** checks (26–29, 45) and the **tuya** exporter (30).
+- **Not on your chain → one line, then drop it.** Summarise as "N unrelated
+  cluster issues (admin's area)" and don't investigate.
+
+## Step 3 — read-only checks for your chain
+
+All of these work with your read-only access:
+
+```bash
+# tuya-bridge — your devices + the ATS
+kubectl get pods -n tuya-bridge
+kubectl rollout status deploy/tuya-bridge -n tuya-bridge
+kubectl logs -n tuya-bridge deploy/tuya-bridge --tail=50
+
+# the reachability path ha-sofia uses
+kubectl get pods -n cloudflared
+kubectl get pods -n traefik
+kubectl get ingress -A | grep -Ei 'tuya-bridge|ha-sofia'
+
+# whole external path in one shot (DNS + tunnel + Traefik + cert):
+curl -sI --max-time 10 https://tuya-bridge.viktorbarzin.me | head -1
+#   reachable  -> HTTP/2 200 / 401 / 403  (any HTTP response = path is up)
+#   broken     -> curl: timeout / could not resolve host
+```
+
+The fastest **device-level** signal is your own dashboard: open
+**https://ha-sofia.viktorbarzin.me → Барзини → Статус**. If devices show
+Offline / Разкачен / ⚠️ **but tuya-bridge is healthy**, the problem is at the
+house (device power / Wi-Fi / the Sofia TP-Link network) — **not** the cluster.
+
+## Step 4 — if something on your chain is broken
+
+You can't fix the cluster (read-only), so **capture + hand off**:
+
+```bash
+kubectl describe pod -n tuya-bridge <pod>
+kubectl logs -n tuya-bridge <pod> --previous --tail=200
+```
+
+Then file it for the admin with the **`/file-issue`** skill — e.g. *"ha-sofia
+Tuya devices + ATS unresponsive; tuya-bridge pod CrashLooping"* with the output
+above. cloudflared / Traefik / DNS outages are cluster-wide — the admin's
+alerting is already firing, but file it so it's tracked from your side too.
+
+## What will skip for you (expected — not failures)
+
+A few checks need access your account doesn't have. They warn/skip — that's
+normal, and **none of them are on your ha-sofia chain**:
+
+- **Uptime Kuma (14)** — needs an admin password from Vault.
+- **PVE host checks** — 36 (LVM snapshots), 43 (host thermals), 44 (host load),
+  and the Proxmox CSI ghost-disk check — all need root SSH to the Proxmox host.
+- **`--fix`** — pod deletion (a write); not available to you.
+
+(The ha-sofia checks are **not** in this list — your token makes them work.)
+
+## Your ha-sofia token
+
+- Stored at `~/.config/cluster-health/haos_token` (yours, mode 600).
+- It's a **dedicated** long-lived token, named `emo-cluster-health` under
+  ha-sofia → your profile → **Long-Lived Access Tokens**. Revoking it there
+  affects only you.
+- It currently carries admin-level HA scope (Home Assistant only lets a token
+  be minted for the account that created it, and it was minted via the admin
+  account). If it ever stops working, tell wizard and a fresh one can be minted.
--- a/scripts/workstation/managed-settings.json
+++ b/scripts/workstation/managed-settings.json
@ -1,4 +1,4 @@
 {
-  "claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner). Do not attempt to escalate privileges or reach another user's resources.\n- Secrets are per-user. Never read another user's home directory, credentials, tokens, or ~/.claude secrets. Your own secrets live in your home at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n  - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n  - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n  - Push rejected by branch protection (user not whitelisted) → fall back to a <os-user>/<topic> branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n  - Keep every clone on a clean master when done; tell the user in plain words what happened.\n  - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n",
+  "claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier for cluster/infra. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner) — don't try to exceed it. At the OS level, authorization follows the OS: where your user holds `sudo`, using it is permitted and audited.\n- OS access follows OS permissions, not a separate rule here. You may read whatever your OS user can read — directly or via `sudo` where you hold it — including other users' home directories, credentials, tokens, and `~/.claude`. Do not impose restrictions stricter than the OS; no extra prompt is required for OS-authorized access. Protect your own secrets at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n  - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n  - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n  - Push rejected by branch protection (user not whitelisted) → fall back to a <os-user>/<topic> branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n  - Keep every clone on a clean master when done; tell the user in plain words what happened.\n  - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n",
  "model": "claude-opus-4-8"
 }
--- a/scripts/workstation/setup-devvm.sh
+++ b/scripts/workstation/setup-devvm.sh
@ -72,11 +72,14 @@ if [[ -n "$want_t3" && "$(t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/
 fi

 # 2c) Bitwarden CLI — backs `homelab vault` (per-user no-HITL Vaultwarden access).
-#     npm-global so every user's PATH resolves it. Pinned major; best-effort (a
-#     failure only disables `homelab vault`, nothing else on the box).
-if ! command -v bw >/dev/null; then
-  log "npm: installing @bitwarden/cli (homelab vault backend)"
-  npm install -g "@bitwarden/cli@^2024" >/dev/null 2>&1 || log "WARN: @bitwarden/cli install failed; homelab vault unavailable"
+#     Install SYSTEM-WIDE (npm prefix /usr → /usr/bin/bw) so EVERY user's PATH
+#     resolves it. The guard tests the SYSTEM path, NOT `command -v bw`: the
+#     latter is satisfied by an admin's own ~/.local/bin/bw and would skip the
+#     system install, leaving non-admins (emo, anca, …) with no backend. Pinned
+#     major; best-effort (a failure only disables `homelab vault`).
+if [ ! -x /usr/bin/bw ] && [ ! -x /usr/local/bin/bw ]; then
+  log "npm: installing @bitwarden/cli system-wide (homelab vault backend)"
+  npm install -g --prefix /usr "@bitwarden/cli@^2024" >/dev/null 2>&1 || log "WARN: @bitwarden/cli install failed; homelab vault unavailable"
 fi

 # 3) kubelogin (kubectl oidc-login) system-wide — NOT the apt 'kubelogin' (= Azure tool).
--- a/scripts/workstation/skel/start-claude.sh
+++ b/scripts/workstation/skel/start-claude.sh
@ -93,6 +93,15 @@ ensure_onboarding() {
 }
 ensure_onboarding

+# Load a per-user long-lived CLAUDE_CODE_OAUTH_TOKEN if claude-auth-sync has
+# materialized one from this user's own Vault path. A non-rotating setup-token
+# sidesteps the shared ~/.claude/.credentials.json OAuth refresh-token race that
+# logs out users running many concurrent agents (interactive + t3 + always-on).
+# Absent file -> no-op (normal per-user Enterprise-SSO flow). The user's OWN
+# token; never shared between OS users.
+_oauth_env="$HOME/.config/claude-auth-sync/claude-oauth.env"
+if [ -r "$_oauth_env" ]; then set -a; . "$_oauth_env"; set +a; fi
+
 # Deliberately not `exec` so we can branch on the exit code: clean quit ends the
 # pane (ttyd closes the terminal); a crash drops to a shell so the tmux session
 # isn't destroyed-and-recreated in a ttyd auto-reconnect loop.
--- a/stacks/actualbudget/main.tf
+++ b/stacks/actualbudget/main.tf
@ -5,6 +5,9 @@ variable "tls_secret_name" {
 variable "nfs_server" { type = string }

 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/affine/main.tf
+++ b/stacks/affine/main.tf
@ -5,6 +5,9 @@ variable "tls_secret_name" {
 variable "nfs_server" { type = string }

 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -42,6 +45,9 @@ data "kubernetes_secret" "eso_secrets" {
 # DB credentials from Vault database engine (rotated automatically)
 # Provides DATABASE_URL that auto-updates when password rotates
 resource "kubernetes_manifest" "db_external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/authentik/Dockerfile
+++ b/stacks/authentik/Dockerfile
@ -0,0 +1,46 @@
+# SLOW-1a overlay over the official authentik server image.
+#
+# The login flow's identification stage renders each enabled source's UI login
+# button. Upstream authentik/stages/identification/stage.py does:
+#     current_stage.sources.filter(enabled=True).order_by("name").select_subclasses()
+# The bare no-arg select_subclasses() (django-model-utils InheritanceManager)
+# LEFT-JOINs EVERY Source subtype table; on the cold-login hot path that is ~1.5s
+# (verified live on 2026.2.4: 1527ms vs 14ms). Passing only the subtypes that
+# actually render a UI login button — every concrete Source type that overrides
+# ui_login_button: oauth/saml/plex/telegram/kerberos, NOT the sync-only ldap/scim —
+# is ~100x faster and BYTE-IDENTICAL output (verified: concrete types + rendered
+# buttons match). django-model-utils accepts the lowercase subclass *accessor
+# names* as strings, so no new import is needed (no circular-import risk) — the
+# patch is a single, reviewable line edit.
+#
+# RE-VERIFY ON EVERY AUTHENTIK BUMP: bump the FROM tag below AND the image tag in
+# modules/authentik/values.yaml together. The grep guards fail the build LOUDLY if
+# the upstream target line moved. If a future authentik version adds a NEW
+# login-capable source type, add its lowercase accessor to the list below.
+# Upstream: the bare select_subclasses() is still present in main (no fix/PR as of
+# 2026-06-28) — drop this overlay once upstream narrows the query.
+FROM ghcr.io/goauthentik/server:2026.2.4
+
+USER root
+RUN set -eux; \
+    F=/authentik/stages/identification/stage.py; \
+    grep -q 'order_by("name").select_subclasses()' "$F"; \
+    sed -i 's/order_by("name")\.select_subclasses()/order_by("name").select_subclasses("oauthsource", "samlsource", "plexsource", "telegramsource", "kerberossource")/' "$F"; \
+    grep -q 'select_subclasses("oauthsource", "samlsource", "plexsource", "telegramsource", "kerberossource")' "$F"; \
+    PY="$(command -v python || command -v python3)"; "$PY" -c "import ast,sys; ast.parse(open('$F').read())"; \
+    rm -f /authentik/stages/identification/__pycache__/stage.*.pyc
+
+# PATCH #2 — old-browser BLANK LOGIN. authentik's modern flow SPA is ES2022 and
+# hard-fails (blank login) on Safari<=16.3 (e.g. iPadOS<=16.3). authentik already
+# ships a no-JS Simplified Flow Executor (SFE, ES5) but only serves it to
+# IE/old-Edge/PKeyAuth. patch-compat-sfe.py (a) extends compat_needs_sfe() to
+# serve the SFE to old Safari AND any iOS browser (Chrome/CriOS, Firefox/FxiOS —
+# all share the system WebKit) on iOS<=16.3, and (b) injects static social-login
+# <a> links into the SFE shell (the SFE can't render Identification-stage sources;
+# needed for password-less Google-only accounts). Clients get the REAL authentik
+# login (password + MFA + reputation, NO auth downgrade) instead of a blank page.
+# The script is guarded (asserts both upstream anchors + ast-parses) so the build
+# fails loudly if upstream moves — re-verify on every authentik bump.
+COPY patch-compat-sfe.py /tmp/patch-compat-sfe.py
+RUN python3 /tmp/patch-compat-sfe.py && rm -f /tmp/patch-compat-sfe.py
+USER authentik
--- a/stacks/authentik/admin-services-restriction.tf
+++ b/stacks/authentik/admin-services-restriction.tf
@ -49,14 +49,15 @@ resource "authentik_policy_expression" "admin_services_restriction" {

    host = request.context.get("host", "")

-    # chrome-service noVNC (chrome.viktorbarzin.me) exposes Viktor's LIVE
-    # logged-in browser sessions, so lock it to Viktor's own accounts ONLY.
-    # "Home Server Admins" is NOT sufficient — emo (emil.barzin@gmail.com) is a
-    # member. akadmin kept as break-glass. The homelab-browser CDP path is
-    # already RBAC-gated (emo = oidc-power-user-readonly, no pods/portforward),
-    # so this closes the only remaining, human, noVNC path. Match username OR
-    # email so neither attribute alone can lock Viktor out.
-    CHROME_ALLOWED = {"akadmin", "akadmin@viktorbarzin.me", "vbarzin@gmail.com"}
+    # chrome-service noVNC (chrome.viktorbarzin.me) exposes LIVE logged-in browser
+    # sessions from the SHARED persistent profile. Originally Viktor-only.
+    # 2026-06-28 (Viktor's explicit decision): emo SHARES Viktor's browser, so emo
+    # (emil.barzin / emil.barzin@gmail.com) is allowed in for noVNC form-filling +
+    # captcha solving. Trade-off accepted: emo can therefore reach Viktor's warmed
+    # sessions (the CLI half is the emo-browser ServiceAccount in
+    # stacks/chrome-service/rbac.tf). akadmin kept as break-glass. Match username OR
+    # email so neither attribute alone can lock anyone out.
+    CHROME_ALLOWED = {"akadmin", "akadmin@viktorbarzin.me", "vbarzin@gmail.com", "emil.barzin", "emil.barzin@gmail.com"}
    if host == "chrome.viktorbarzin.me":
        return request.user.username in CHROME_ALLOWED or request.user.email in CHROME_ALLOWED

--- a/stacks/authentik/email-secret.tf
+++ b/stacks/authentik/email-secret.tf
@ -6,6 +6,9 @@
 # are non-secret and live in values.yaml. The reloader annotation rolls the
 # authentik pods if the password ever changes.
 resource "kubernetes_manifest" "authentik_email_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/authentik/modules/authentik/main.tf
+++ b/stacks/authentik/modules/authentik/main.tf
@ -29,7 +29,12 @@ resource "kubernetes_namespace" "authentik" {
    labels = {
      tier                               = var.tier
      "resource-governance/custom-quota" = "true"
-      "keel.sh/enrolled"                 = "true"
+      # Keel intentionally NOT enrolled: server+worker run our custom overlay image
+      # (ghcr.io/viktorbarzin/authentik-server — see values.yaml global.image +
+      # stacks/authentik/Dockerfile). The tag is pinned explicitly and bumped
+      # manually (rebuild the overlay FROM the new authentik version + repoint), so
+      # a Keel auto-bump would only risk re-introducing the upstream tag / the
+      # 2026-06-10 downgrade-boot-storm class. Re-enroll only if the overlay is dropped.
    }
  }
  lifecycle {
@ -82,6 +87,11 @@ module "ingress" {
  service_name     = "goauthentik-server"
  tls_secret_name  = var.tls_secret_name
  anti_ai_scraping = false
+  # Swap the shared 10/50 default limiter for a dedicated 100/1000 carve-out:
+  # the login SPA + flow-executor API burst on a cold load otherwise 429s into
+  # a blank screen (see traefik middleware "authentik-rate-limit").
+  skip_default_rate_limit = true
+  extra_middlewares       = ["traefik-authentik-rate-limit@kubernetescrd"]
  extra_annotations = {
    "gethomepage.dev/enabled"      = "true"
    "gethomepage.dev/name"         = "Authentik"
@ -149,5 +159,12 @@ module "ingress-static" {
  tls_secret_name  = var.tls_secret_name
  anti_ai_scraping = false
  homepage_enabled = false
-  extra_middlewares = ["authentik-static-cache-headers@kubernetescrd"]
+  # /static serves ALL the SPA JS/CSS chunks; the default 10/50 limiter 429s the
+  # cold-load fan-out → blank screen. Dedicated 100/1000 carve-out (note the two
+  # namespaces: cache-headers is in ns authentik, rate-limit is in ns traefik).
+  skip_default_rate_limit = true
+  extra_middlewares = [
+    "authentik-static-cache-headers@kubernetescrd",
+    "traefik-authentik-rate-limit@kubernetescrd",
+  ]
 }
--- a/stacks/authentik/modules/authentik/values.yaml
+++ b/stacks/authentik/modules/authentik/values.yaml
@ -39,6 +39,16 @@ server:
      value: "3"
    - name: AUTHENTIK_WEB__THREADS
      value: "4"
+    # Gunicorn worker recycle hardening (defaults max_requests=1000/jitter=50).
+    # A worker recycle that coincides with a transient PG/pgbouncer blip stalls
+    # in-flight requests (sessions+cache are on PostgreSQL since Redis was removed
+    # in 2026.2), and with 9 workers recycling on a tight 50-jitter window the
+    # recycles cluster — feeding the episodic all-pods-NotReady 502/504 cascade.
+    # 10x rarer recycles + 20x wider jitter (1000) decorrelate them from DB blips.
+    - name: AUTHENTIK_WEB__MAX_REQUESTS
+      value: "10000"
+    - name: AUTHENTIK_WEB__MAX_REQUESTS_JITTER
+      value: "1000"
    # Cache flow plans for 30m and policy evaluations for 15m (defaults 300s).
    # Authentik 2026.2 stores cache in Postgres, so a TTL hit is still a
    # SELECT — but a single indexed lookup beats re-planning the flow
@ -87,11 +97,28 @@ server:
  livenessProbe:
    failureThreshold: 6
    timeoutSeconds: 5
-  strategy:
+  # Readiness widened from the chart default (3x10s/3s ~= 30s) to ~80s. The
+  # readiness probe (/-/health/ready/) queries the DB, so a sub-~60s PG/pgbouncer
+  # transient otherwise returns 503 and drops ALL 3 server pods from the Service
+  # at once -> Traefik has no healthy backend -> 502/504 (the episodic blank
+  # screen + 30s hang). 80s absorbs a full CNPG failover reconnect; liveness
+  # still reaps a truly hung pod. Partial override — the chart deep-merges the
+  # httpGet path /-/health/ready/ (same as the livenessProbe override above).
+  readinessProbe:
+    failureThreshold: 8
+    periodSeconds: 10
+    timeoutSeconds: 5
+  # RollingUpdate strategy. The chart key is `deploymentStrategy`, NOT `strategy`
+  # (authentik.server reads .Values.server.deploymentStrategy) — the old
+  # `strategy:` key was silently ignored, so live ran the chart default 25%/25%
+  # and every rolling event dropped a server pod out of rotation, amplifying the
+  # NotReady cascade. maxSurge:1 + maxUnavailable:0 keeps all 3 ready throughout
+  # a roll (PDB minAvailable:2 + ResourceQuota headroom allow the transient pod).
+  deploymentStrategy:
    type: RollingUpdate
    rollingUpdate:
-      maxSurge: 0
-      maxUnavailable: 1
+      maxSurge: 1
+      maxUnavailable: 0
  resources:
    requests:
      cpu: 100m
@ -118,15 +145,23 @@ server:
 global:
  addPrometheusAnnotations: true
  image:
-    # Pin to the Keel-managed live tag. Keel (diun-annotated, keel.sh/enrolled
-    # namespace) bumps the IMAGE between chart releases, while helm defaults
-    # the tag to the chart appVersion — so any helm upgrade silently
-    # DOWNGRADES the running pods to the chart pin (2026-06-10: a values-only
-    # apply rolled live 2026.2.4 back to 2026.2.2 against a 2026.2.4-migrated
-    # DB → boot storm, see docs/post-mortems/2026-06-10-authentik-downgrade-
-    # boot-storm.md). Keep this tag in sync with what Keel has deployed when
-    # touching this chart; clear it only when bumping the chart version itself.
-    tag: "2026.2.4"
+    # CUSTOM OVERLAY: two thin patches over the official authentik server image
+    # (see stacks/authentik/Dockerfile): (1) SLOW-1a — narrows the login-flow
+    # select_subclasses() query, ~1.4s -> ~14ms; (2) serve authentik's no-JS SFE
+    # login to old Safari/WebKit AND any iOS browser (Chrome/Firefox = WebKit) on
+    # iOS<=16.3 so old devices (e.g. iPadOS<=15) get a working login instead of a
+    # blank page, and injects social-login links into the SFE (it can't render
+    # sources; needed for password-less Google-only accounts). Built by
+    # .github/workflows/build-authentik.yml to ghcr.io/viktorbarzin/authentik-server
+    # (public package, anonymous pull — no imagePullSecret needed, like the
+    # upstream goauthentik image). Keel is NO LONGER enrolled for this namespace
+    # (see main.tf) so it can't bump/downgrade the tag; helm also defaults the tag
+    # to the chart appVersion (2026.2.2) — so BOTH repository AND tag are pinned
+    # explicitly here to prevent the 2026-06-10 downgrade-boot-storm class.
+    # UPGRADE = bump the Dockerfile FROM tag + this tag together (e.g. ->
+    # 2026.3.0-patch1), let GHA rebuild, then apply.
+    repository: ghcr.io/viktorbarzin/authentik-server
+    tag: "2026.2.4-patch3"

 worker:
  # 2 replicas: workers handle background tasks (LDAP sync, email,
@ -166,7 +201,10 @@ worker:
        secretKeyRef:
          name: authentik-email
          key: AUTHENTIK_EMAIL__PASSWORD
-  strategy:
+  # Chart key is `deploymentStrategy`, not `strategy` (see server above). Workers
+  # serve no user traffic, so maxSurge:0/maxUnavailable:1 is fine — this is just
+  # the dead-key cleanup so the declared intent actually takes effect.
+  deploymentStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 0
--- a/stacks/authentik/patch-compat-sfe.py
+++ b/stacks/authentik/patch-compat-sfe.py
@ -0,0 +1,96 @@
+#!/usr/bin/env python3
+"""Overlay patch — make authentik usable on OLD browsers (no modern-JS SPA).
+
+authentik's modern flow SPA is ES2022 (static{} init blocks) that hard-fail on
+Safari/WebKit <= 16.3 (e.g. iPadOS <= 16.3) and render a COMPLETELY BLANK login.
+authentik ships a no-JS Simplified Flow Executor (SFE, ES5) but only serves it to
+IE / old-Edge / PKeyAuth, and the SFE itself canNOT render Identification-stage
+sources (social-login buttons) — authentik docs list "Sources" as unsupported.
+
+This patch does TWO things, both guarded (assert the upstream anchor + verify the
+result) so the image build fails LOUDLY if upstream moves. RE-VERIFY on every
+authentik upgrade.
+
+  1. flows/views/interface.py::compat_needs_sfe() -> also return True for old
+     Safari/WebKit: (a) Safari/Mobile Safari Version <= 16.3 (covers desktop-mode
+     iPadOS which reports as Mac Safari), and (b) ANY iOS browser (Chrome/CriOS,
+     Firefox/FxiOS, Edge — all share the system WebKit) on iOS <= 16.3. So old
+     iPads get the SFE on EVERY browser, not just Safari.
+
+  2. flows/templates/if/flow-sfe.html -> inject static social-login <a> links
+     (plain redirects to /source/oauth/login/<slug>/, work on ANY browser) so SFE
+     users (who otherwise see only username/password) can use social login —
+     required for accounts with no password (e.g. Google-only users like emo).
+"""
+import ast
+import glob
+import os
+
+# --- Patch 1: compat_needs_sfe() UA gate -------------------------------------
+INTERFACE = "/authentik/flows/views/interface.py"
+ANCHOR = (
+    '        if "PKeyAuth" in ua["string"]:\n'
+    "            return True\n"
+    "        return False"
+)
+REPLACEMENT = (
+    '        if "PKeyAuth" in ua["string"]:\n'
+    "            return True\n"
+    "        # OVERLAY: old WebKit can't parse the modern ES2022 flow SPA (blank\n"
+    "        # login) -> serve the SFE (real authentik login). (a) desktop-mode\n"
+    "        # Safari/iPadOS reports as Mac Safari with Version<=16.3:\n"
+    '        if ua["user_agent"]["family"] in ("Safari", "Mobile Safari"):\n'
+    "            try:\n"
+    '                _maj = int(ua["user_agent"]["major"] or 0)\n'
+    '                _min = int(ua["user_agent"]["minor"] or 0)\n'
+    "            except (TypeError, ValueError):\n"
+    "                _maj = _min = 0\n"
+    "            if _maj and (_maj < 16 or (_maj == 16 and _min <= 3)):\n"
+    "                return True\n"
+    "        # (b) ANY iOS browser (Chrome/CriOS, Firefox/FxiOS, Edge) shares the\n"
+    "        # system WebKit, so iOS<=16.3 fails regardless of the browser family:\n"
+    '        if ua["os"]["family"] == "iOS":\n'
+    "            try:\n"
+    '                _omaj = int(ua["os"]["major"] or 0)\n'
+    '                _omin = int(ua["os"]["minor"] or 0)\n'
+    "            except (TypeError, ValueError):\n"
+    "                _omaj = _omin = 0\n"
+    "            if _omaj and (_omaj < 16 or (_omaj == 16 and _omin <= 3)):\n"
+    "                return True\n"
+    "        return False"
+)
+src = open(INTERFACE).read()
+assert "def compat_needs_sfe" in src, "compat_needs_sfe() not found — upstream changed"
+assert src.count(ANCHOR) == 1, f"anchor not found exactly once in {INTERFACE}"
+src = src.replace(ANCHOR, REPLACEMENT)
+open(INTERFACE, "w").write(src)
+ast.parse(src)
+assert 'ua["os"]["family"] == "iOS"' in open(INTERFACE).read()
+for pyc in glob.glob("/authentik/flows/views/__pycache__/interface.*.pyc"):
+    os.remove(pyc)
+
+# --- Patch 2: social-login links on the SFE shell ----------------------------
+SFE_HTML = "/authentik/flows/templates/if/flow-sfe.html"
+HTML_ANCHOR = (
+    "        </main>\n"
+    "        <span class=\"mt-3 mb-0 text-muted text-center\">{% trans 'Powered by authentik' %}</span>"
+)
+HTML_REPLACEMENT = (
+    "        </main>\n"
+    "        <!-- OVERLAY: the SFE can't render Identification-stage sources, so add\n"
+    "             static social-login links (plain redirects, work on any browser).\n"
+    "             Re-verify slugs on source changes; shown on all SFE flows. -->\n"
+    '        <div class="form-signin w-100 m-auto pt-2 mt-2 border-top">\n'
+    '          <a class="btn btn-outline-secondary w-100 mb-2" href="/source/oauth/login/google/">Continue with Google</a>\n'
+    '          <a class="btn btn-outline-secondary w-100 mb-2" href="/source/oauth/login/github/">Continue with GitHub</a>\n'
+    '          <a class="btn btn-outline-secondary w-100 mb-2" href="/source/oauth/login/facebook/">Continue with Facebook</a>\n'
+    "        </div>\n"
+    "        <span class=\"mt-3 mb-0 text-muted text-center\">{% trans 'Powered by authentik' %}</span>"
+)
+html = open(SFE_HTML).read()
+assert html.count(HTML_ANCHOR) == 1, f"SFE html anchor not found exactly once in {SFE_HTML}"
+html = html.replace(HTML_ANCHOR, HTML_REPLACEMENT)
+open(SFE_HTML, "w").write(html)
+assert "Continue with Google" in open(SFE_HTML).read()
+
+print("patch-compat-sfe: SFE for old Safari + all iOS<=16.3; social-login links added to SFE")
--- a/stacks/beads-server/main.tf
+++ b/stacks/beads-server/main.tf
@ -601,6 +601,9 @@ resource "kubernetes_config_map" "beadboard_config" {
 # Pulls the claude-agent-service bearer token from Vault so BeadBoard can
 # dispatch agent jobs via the in-cluster HTTP API.
 resource "kubernetes_manifest" "beadboard_agent_service_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/broker-sync/main.tf
+++ b/stacks/broker-sync/main.tf
@ -28,6 +28,9 @@ resource "kubernetes_namespace" "broker_sync" {
 #   trading212_api_keys — JSON array of {account_id, account_type, api_key, name, currency}
 #   imap_host, imap_user, imap_password, imap_directory — for InvestEngine + Schwab email ingest
 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/calico/main.tf
+++ b/stacks/calico/main.tf
@ -212,3 +212,229 @@ resource "kubectl_manifest" "whisker" {
    spec       = { notifications = "Disabled" }
  })
 }
+
+# ---------------------------------------------------------------------------
+# Gated public ingress for the Whisker UI (infra #57 / ADR-0014).
+#
+# whisker.viktorbarzin.me -> whisker:8081, Authentik-gated (auth="required":
+# Whisker ships NO own login — it's an admin observability UI, so Authentik
+# forward-auth is the only gate between strangers and the flow view). The
+# operator replicated `tls-secret` into calico-system already.
+#
+# TWO coupled pieces are required because the operator's own `whisker`
+# NetworkPolicy (owned by the Whisker CR above) sets policyTypes:[Ingress]
+# with NO ingress rules => default-deny on ingress to the whisker pod. The
+# additive NP below ORs in a Traefik allow (k8s NetworkPolicies are additive
+# across policies selecting the same pod), so we never edit the operator NP.
+module "ingress_whisker" {
+  source          = "../../modules/kubernetes/ingress_factory"
+  dns_type        = "proxied"
+  namespace       = "calico-system"
+  name            = "whisker"
+  service_name    = "whisker"
+  port            = 8081
+  auth            = "required"
+  tls_secret_name = "tls-secret"
+  extra_annotations = {
+    "gethomepage.dev/enabled"     = "true"
+    "gethomepage.dev/name"        = "Whisker"
+    "gethomepage.dev/description" = "Calico flow observability (who-talks-to-whom)"
+    "gethomepage.dev/icon"        = "calico.png"
+    "gethomepage.dev/group"       = "Infrastructure"
+  }
+}
+
+# Additive NetworkPolicy: permit Traefik -> whisker:8081. ORs with the
+# operator's default-deny `whisker` NP (selecting the same pod) so Traefik
+# can reach the UI without touching the operator-owned policy.
+resource "kubernetes_network_policy_v1" "whisker_allow_traefik" {
+  metadata {
+    name      = "whisker-allow-traefik"
+    namespace = "calico-system"
+  }
+  spec {
+    pod_selector {
+      match_labels = {
+        "app.kubernetes.io/name" = "whisker"
+      }
+    }
+    policy_types = ["Ingress"]
+    ingress {
+      from {
+        namespace_selector {
+          match_labels = {
+            "kubernetes.io/metadata.name" = "traefik"
+          }
+        }
+      }
+      ports {
+        port     = "8081"
+        protocol = "TCP"
+      }
+    }
+  }
+}
+
+# Additive egress NetworkPolicy: permit whisker -> the kube-dns ClusterIP for DNS.
+#
+# ROOT CAUSE of the 2026-06-28 "Whisker UI empty" incident: the operator's own
+# `whisker` NetworkPolicy is policyTypes:[Ingress,Egress] and its egress allows
+# DNS only to the kube-dns *pods* (podSelector k8s-app=kube-dns). But
+# whisker-backend resolves `goldmane...svc` via the kube-dns *ClusterIP*
+# (10.96.0.10), and Calico drops UDP DNS to a ClusterIP under a podSelector-only
+# egress rule (verified: from whisker's netns, ClusterIP DNS = 100% timeout
+# while direct kube-dns pod-IP DNS = OK; a pod with no egress policy resolves
+# fine). whisker-backend resolves once in the brief startup window before the
+# policy programs, establishes its long-lived gRPC stream, and only re-resolves
+# when that stream breaks — at which point the blocked ClusterIP DNS wedges its
+# Go resolver and the UI goes empty (the durable aggregator, in its own
+# unrestricted namespace, is unaffected). k8s egress policies are additive, so
+# this ORs in an allow for the ClusterIP; the operator NP is left untouched.
+# (Empirically: adding this ipBlock rule flips ClusterIP DNS from 100% fail to
+# 100% ok.) See docs/runbooks/goldmane-flow-trail.md.
+resource "kubernetes_network_policy_v1" "whisker_allow_dns_clusterip" {
+  metadata {
+    name      = "whisker-allow-dns-clusterip"
+    namespace = "calico-system"
+  }
+  spec {
+    pod_selector {
+      match_labels = {
+        "app.kubernetes.io/name" = "whisker"
+      }
+    }
+    policy_types = ["Egress"]
+    egress {
+      # 10.96.0.10 is the kube-dns ClusterIP (cluster invariant — service CIDR
+      # 10.96.0.0/12, DNS always .10; the same IP CoreDNS/Technitium configs pin).
+      to {
+        ip_block {
+          cidr = "10.96.0.10/32"
+        }
+      }
+      ports {
+        port     = "53"
+        protocol = "UDP"
+      }
+      ports {
+        port     = "53"
+        protocol = "TCP"
+      }
+    }
+  }
+}
+
+# ---------------------------------------------------------------------------
+# Whisker self-heal watchdog (ADR-0014; added 2026-06-28 after a live incident).
+#
+# BACKSTOP. The REAL fix is kubernetes_network_policy_v1.whisker_allow_dns_clusterip
+# above (it unblocks the root-cause ClusterIP DNS). This watchdog stays as
+# defense-in-depth: whisker-backend has NO operator liveness probe, so if its
+# long-lived goldmane gRPC stream ever wedges for any OTHER reason (the Go
+# resolver spams `failed to stream flows` / `code = Unavailable` and never
+# reconnects -> empty UI, while the durable aggregator in its own namespace is
+# unaffected), nothing else would restart it. Whisker is operator-managed
+# (Whisker CR) so we can't inject a probe; this is the supported-pattern
+# alternative. With the DNS fix in place it should rarely, if ever, fire.
+#
+# It restarts the pod ONLY when the wedged signature is present AND Goldmane is
+# Ready (so a real Goldmane outage doesn't cause restart-thrash). A fresh pod
+# reconnects cleanly. See docs/runbooks/goldmane-flow-trail.md.
+resource "kubernetes_service_account" "whisker_watchdog" {
+  metadata {
+    name      = "whisker-watchdog"
+    namespace = kubernetes_namespace.calico_system.metadata[0].name
+  }
+}
+
+# Namespaced Role (least privilege — only calico-system): read pod logs to
+# detect the wedge, delete the whisker pod to heal it.
+resource "kubernetes_role" "whisker_watchdog" {
+  metadata {
+    name      = "whisker-watchdog"
+    namespace = kubernetes_namespace.calico_system.metadata[0].name
+  }
+  rule {
+    api_groups = [""]
+    resources  = ["pods"]
+    verbs      = ["get", "list", "delete"]
+  }
+  rule {
+    api_groups = [""]
+    resources  = ["pods/log"]
+    verbs      = ["get"]
+  }
+}
+
+resource "kubernetes_role_binding" "whisker_watchdog" {
+  metadata {
+    name      = "whisker-watchdog"
+    namespace = kubernetes_namespace.calico_system.metadata[0].name
+  }
+  role_ref {
+    api_group = "rbac.authorization.k8s.io"
+    kind      = "Role"
+    name      = kubernetes_role.whisker_watchdog.metadata[0].name
+  }
+  subject {
+    kind      = "ServiceAccount"
+    name      = kubernetes_service_account.whisker_watchdog.metadata[0].name
+    namespace = kubernetes_namespace.calico_system.metadata[0].name
+  }
+}
+
+resource "kubernetes_cron_job_v1" "whisker_watchdog" {
+  metadata {
+    name      = "whisker-watchdog"
+    namespace = kubernetes_namespace.calico_system.metadata[0].name
+  }
+  spec {
+    schedule                      = "*/10 * * * *"
+    successful_jobs_history_limit = 1
+    failed_jobs_history_limit     = 1
+    concurrency_policy            = "Forbid"
+    job_template {
+      metadata {
+        name = "whisker-watchdog"
+      }
+      spec {
+        template {
+          metadata {
+            name = "whisker-watchdog"
+          }
+          spec {
+            service_account_name = kubernetes_service_account.whisker_watchdog.metadata[0].name
+            container {
+              name  = "watchdog"
+              image = "bitnami/kubectl:latest"
+              command = ["/bin/sh", "-c", <<-EOT
+                set -eu
+                NS=calico-system
+                # Don't thrash if Goldmane itself is down — that's not a whisker bug.
+                if ! kubectl -n "$NS" get pod -l k8s-app=goldmane \
+                     -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}' 2>/dev/null | grep -q True; then
+                  echo "goldmane not Ready — skipping (not a whisker problem)"; exit 0
+                fi
+                ERRS=$(kubectl -n "$NS" logs -l k8s-app=whisker -c whisker-backend --since=11m --tail=500 2>/dev/null \
+                  | grep -cE 'failed to stream flows|failed to list filter hints|code = Unavailable|i/o timeout' || true)
+                ERRS=$${ERRS:-0}
+                if [ "$ERRS" -ge 10 ]; then
+                  echo "whisker-backend WEDGED: $ERRS goldmane-connection errors in 11m — restarting whisker pod"
+                  kubectl -n "$NS" delete pod -l k8s-app=whisker --ignore-not-found
+                else
+                  echo "whisker-backend healthy: $ERRS goldmane-connection errors in 11m"
+                fi
+              EOT
+              ]
+            }
+            restart_policy = "Never"
+          }
+        }
+      }
+    }
+  }
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
+    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
+  }
+}
--- a/stacks/changedetection/main.tf
+++ b/stacks/changedetection/main.tf
@ -19,6 +19,9 @@ resource "kubernetes_namespace" "changedetection" {
 }

 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/chrome-service/files/novnc/entrypoint.sh
+++ b/stacks/chrome-service/files/novnc/entrypoint.sh
@ -19,14 +19,14 @@ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do
  sleep 2
 done

-# websockify runs as PID 1; x11vnc is a child so its logs land on container stdout
-# `-noshm` skips MIT-SHM probes that fail across container boundaries (each
-# container has its own /dev/shm); `-noxdamage` skips XDAMAGE which Xvfb
-# doesn't expose; `-quiet` keeps the polling chatter out of pod logs.
+# Both x11vnc and websockify run as supervised children of this entrypoint (PID
+# 1) so their logs land on container stdout and the `wait -n` at the end can catch
+# either one dying. `-noshm` skips MIT-SHM probes that fail across container
+# boundaries (each container has its own /dev/shm); `-noxdamage` skips XDAMAGE
+# which Xvfb doesn't expose; `-quiet` keeps the polling chatter out of pod logs.
 echo "starting x11vnc -> :5900"
 x11vnc -display localhost:99 -nopw -listen 0.0.0.0 -rfbport 5900 \
       -forever -shared -noshm -noxdamage -quiet 2>&1 &
-X11VNC_PID=$!

 for i in 1 2 3 4 5 6 7 8 9 10; do
  if echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then
@ -43,4 +43,18 @@ if ! echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then
 fi

 echo "starting websockify -> :6080"
-exec websockify --web=/usr/share/novnc 6080 localhost:5900
+# Run websockify in the background (it was `exec`ed before) so BOTH it and x11vnc
+# are supervised. x11vnc attaches to the chrome-service container's Xvfb over
+# localhost:6099 (shared pod network); when that container restarts, x11vnc loses
+# its X connection and exits. Previously websockify was PID 1 and x11vnc was an
+# unsupervised child, so a dead x11vnc was never relaunched: :5900 stayed dead and
+# the noVNC view went black until a manual pod restart. Now if EITHER process
+# exits, `wait -n` returns and we exit non-zero so the kubelet restarts this
+# container, which re-waits for Xvfb and relaunches x11vnc — the bridge self-heals
+# across browser-container restarts. (Same supervision pattern as the
+# android-emulator stack's entrypoint.)
+websockify --web=/usr/share/novnc 6080 localhost:5900 &
+
+wait -n || true
+echo "novnc: a supervised process (x11vnc or websockify) exited; exiting so the kubelet restarts this container." >&2
+exit 1
--- a/stacks/chrome-service/main.tf
+++ b/stacks/chrome-service/main.tf
@ -41,6 +41,9 @@ resource "kubernetes_namespace" "chrome_service" {
 # --- Secrets (single-key extract: api_bearer_token) ---

 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -330,15 +333,23 @@ resource "kubernetes_deployment" "chrome_service" {
        container {
          name = "novnc"
          # Phase 3 cutover 2026-05-07 — Forgejo registry consolidation.
-          image             = "ghcr.io/viktorbarzin/chrome-service-novnc:latest"
+          # SHA-pinned (not :latest): Keel is OFF for this deployment
+          # (keel.sh/policy=never, below) and :latest/IfNotPresent won't re-pull a
+          # rebuilt image, so a new noVNC entrypoint only deploys when this digest
+          # is bumped here. Bump after build-chrome-service-novnc.yml pushes a new
+          # SHA tag — then WAIT for that apply pipeline to finish before pushing
+          # anything else: Woodpecker cancel-previous SIGKILLs an in-flight apply
+          # mid-run (memory id=1957), which is exactly how the 2026-06-27 apply got
+          # killed. 2026-06-27: bumped to land the x11vnc-supervision self-heal fix
+          # (noVNC went black after a browser-container restart; see
+          # docs/architecture/chrome-service.md "x11vnc supervision").
+          image             = "ghcr.io/viktorbarzin/chrome-service-novnc:19d0f0933a8ec75be6cfa077db88e0f8c3760f40"
          image_pull_policy = "IfNotPresent"
          # Cap RLIMIT_NOFILE before the entrypoint runs. Containerd grants pods
          # nofile=2^31; x11vnc sweeps the whole fd table on each client connect,
          # so every VNC connection hangs on "Connecting" until it times out
-          # (fd-sweep bug, same as android-emulator). entrypoint.sh now also sets
-          # this, but the image is :latest/IfNotPresent so a rebuilt entrypoint
-          # isn't guaranteed to be pulled — this wrapper applies the cap
-          # deterministically on every rollout off the cached image.
+          # (fd-sweep bug, same as android-emulator). entrypoint.sh also sets this;
+          # the wrapper keeps the cap deterministic even off a cached image.
          command = ["bash", "-c", "ulimit -n 65536; exec /entrypoint.sh"]
          port {
            name           = "http"
@ -348,9 +359,13 @@ resource "kubernetes_deployment" "chrome_service" {
          # x11vnc connects to the chrome-service container's Xvfb over
          # localhost TCP (shared pod network). Same uid 1000 as chrome
          # container so we can read MIT-MAGIC-COOKIE if Xvfb adds one.
+          # 256Mi (was 96Mi): the 96Mi cap OOMKilled (exit 137) the sidecar under
+          # ACTIVE VNC use — x11vnc + websockify framebuffer/encode buffers spike
+          # well past idle (~37Mi) when a client streams the 1280x720 screen, so the
+          # noVNC view froze/hung on connect. Bumped 2026-06-28.
          resources {
-            requests = { cpu = "10m", memory = "32Mi" }
-            limits   = { memory = "96Mi" }
+            requests = { cpu = "10m", memory = "64Mi" }
+            limits   = { memory = "256Mi" }
          }
        }

--- a/stacks/chrome-service/rbac.tf
+++ b/stacks/chrome-service/rbac.tf
@ -0,0 +1,95 @@
+# emo's hands-off "homelab browser" credential + chrome-service port-forward RBAC.
+#
+# Access decision (2026-06-28, Viktor's explicit call): emo SHARES Viktor's single
+# chrome-service browser rather than getting an isolated instance. The noVNC half of
+# that grant is the Authentik allowlist in
+# stacks/authentik/admin-services-restriction.tf (CHROME_ALLOWED); THIS file is the
+# CLI half — it lets emo's `homelab browser` reach the headed Chrome over CDP.
+#
+# `homelab browser` shells out to `kubectl port-forward -n chrome-service svc/chrome-service`
+# (cli/browser.go). emo's normal kubeconfig is interactive-OIDC-only (kubelogin) and
+# can't authenticate a headless agent session, and his power-user tier has no
+# pods/portforward. So we mint a dedicated ServiceAccount with a long-lived token
+# (the dashboard-sa.tf pattern) that the devvm provisioner installs as emo's DEFAULT
+# kubeconfig context (scripts/t3-provision-users.sh install_browser_kubeconfig); his
+# personal OIDC login stays available as the `oidc@homelab` named context.
+#
+# TRADE-OFF (accepted): CDP access == full control of the shared browser, including
+# the persistent profile (browser.contexts[0]) where Viktor's warmed logins live.
+# CDP has no per-context auth, so this SA can reach Viktor's sessions. That is inherent
+# to sharing one browser (the isolated per-user instance was declined).
+# See docs/architecture/chrome-service.md "Multi-user access".
+
+resource "kubernetes_service_account" "emo_browser" {
+  metadata {
+    name      = "emo-browser"
+    namespace = kubernetes_namespace.chrome_service.metadata[0].name
+  }
+}
+
+# Long-lived (non-expiring) token for the SA — the devvm provisioner reads this and
+# writes it into emo's kubeconfig. Same pattern as stacks/rbac/.../dashboard-sa.tf.
+resource "kubernetes_secret" "emo_browser_token" {
+  metadata {
+    name      = "emo-browser-token"
+    namespace = kubernetes_namespace.chrome_service.metadata[0].name
+    annotations = {
+      "kubernetes.io/service-account.name" = kubernetes_service_account.emo_browser.metadata[0].name
+    }
+  }
+  type                           = "kubernetes.io/service-account-token"
+  wait_for_service_account_token = true
+}
+
+# The ONLY verb emo's SA lacks for `kubectl port-forward svc/chrome-service`: the
+# port-forward subresource. (get/list of pods + services + endpoints comes from the
+# cluster-read binding below.) Namespace-scoped to chrome-service.
+resource "kubernetes_role" "browser_portforward" {
+  metadata {
+    name      = "chrome-service-portforward"
+    namespace = kubernetes_namespace.chrome_service.metadata[0].name
+  }
+  rule {
+    api_groups = [""]
+    resources  = ["pods/portforward"]
+    verbs      = ["create"]
+  }
+}
+
+resource "kubernetes_role_binding" "emo_browser_portforward" {
+  metadata {
+    name      = "emo-browser-portforward"
+    namespace = kubernetes_namespace.chrome_service.metadata[0].name
+  }
+  role_ref {
+    api_group = "rbac.authorization.k8s.io"
+    kind      = "Role"
+    name      = kubernetes_role.browser_portforward.metadata[0].name
+  }
+  subject {
+    kind      = "ServiceAccount"
+    name      = kubernetes_service_account.emo_browser.metadata[0].name
+    namespace = kubernetes_namespace.chrome_service.metadata[0].name
+  }
+}
+
+# Cluster-wide read-only (NO secrets), mirroring emo's power-user OIDC access, bound
+# to the SA. Needed because the SA becomes emo's DEFAULT kubectl context, so without
+# this his everyday `kubectl get ...` would regress — AND port-forward itself needs
+# get/list on services + pods + endpoints (all covered by oidc-power-user-readonly).
+# That ClusterRole is defined in stacks/rbac (modules/rbac/main.tf); referenced by name.
+resource "kubernetes_cluster_role_binding" "emo_browser_readonly" {
+  metadata {
+    name = "emo-browser-readonly"
+  }
+  role_ref {
+    api_group = "rbac.authorization.k8s.io"
+    kind      = "ClusterRole"
+    name      = "oidc-power-user-readonly"
+  }
+  subject {
+    kind      = "ServiceAccount"
+    name      = kubernetes_service_account.emo_browser.metadata[0].name
+    namespace = kubernetes_namespace.chrome_service.metadata[0].name
+  }
+}
--- a/stacks/ci-pipeline-health/main.tf
+++ b/stacks/ci-pipeline-health/main.tf
@ -49,6 +49,9 @@ resource "kubernetes_namespace" "ci_pipeline_health" {
 # billing on PRIVATE mirrors, which a future scoped read:packages rotation of
 # the alias could not do. Blast radius = this single-CronJob namespace.
 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/claude-agent-service/main.tf
+++ b/stacks/claude-agent-service/main.tf
@ -38,6 +38,9 @@ resource "kubernetes_namespace" "claude_agent" {
 # --- Secrets ---

 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/claude-breakglass/main.tf
+++ b/stacks/claude-breakglass/main.tf
@ -57,6 +57,9 @@ resource "kubernetes_service_account" "breakglass" {
 # DENIED this path (see stacks/vault/main.tf) so the shared, prompt-injectable
 # pod can never read it.
 resource "kubernetes_manifest" "external_secret_ssh" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -82,6 +85,9 @@ resource "kubernetes_manifest" "external_secret_ssh" {
 # Env secrets: the Anthropic OAuth token (shared with claude-agent-service —
 # same account) and the app bearer token (in-cluster/CLI fallback caller auth).
 resource "kubernetes_manifest" "external_secret_env" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/claude-memory/main.tf
+++ b/stacks/claude-memory/main.tf
@ -29,6 +29,9 @@ resource "kubernetes_namespace" "claude-memory" {
 }

 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -57,6 +60,9 @@ resource "kubernetes_manifest" "external_secret" {

 # DB credentials from Vault database engine (rotated every 24h)
 resource "kubernetes_manifest" "db_external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/coturn/main.tf
+++ b/stacks/coturn/main.tf
@ -5,6 +5,9 @@ variable "tls_secret_name" {
 variable "public_ip" { type = string }

 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/dawarich/main.tf
+++ b/stacks/dawarich/main.tf
@ -23,6 +23,9 @@ resource "kubernetes_namespace" "dawarich" {
 }

 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/dbaas/modules/dbaas/main.tf
+++ b/stacks/dbaas/modules/dbaas/main.tf
@ -745,7 +745,10 @@ resource "kubernetes_deployment" "phpmyadmin" {
    labels = {
      "app" = "phpmyadmin"
      tier  = var.tier
-
+      # ADR-0014 service identity: dbaas is a multi-Service namespace, so the
+      # namespace alone can't attribute Goldmane flows. Value = the fronting
+      # Service name (kubernetes_service.phpmyadmin is named "pma").
+      "service-identity" = "pma"
    }
    annotations = {
      "reloader.stakater.com/search" = "true"
@ -762,6 +765,10 @@ resource "kubernetes_deployment" "phpmyadmin" {
      metadata {
        labels = {
          "app" = "phpmyadmin"
+          # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
+          # disambiguating identity must live on the pod template (not just
+          # the Deployment metadata above). Not in selector → no replace.
+          "service-identity" = "pma"
        }
      }
      spec {
@ -812,8 +819,19 @@ resource "kubernetes_deployment" "phpmyadmin" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
+      # This Deployment is Keel-enrolled (keel.sh/policy=patch). Ignore the
+      # attributes Keel/Kyverno mutate at runtime so `terragrunt apply` (incl.
+      # the daily drift plan) doesn't fight them or revert the live image —
+      # canonical KEEL/KYVERNO lifecycle guard, matches linkwarden/chrome-service.
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+      metadata[0].annotations["keel.sh/match-tag"],
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
+    ]
  }
 }

@ -1499,6 +1517,10 @@ resource "kubernetes_deployment" "pgadmin" {
    }
    labels = {
      tier = var.tier
+      # ADR-0014 service identity: dbaas is a multi-Service namespace, so the
+      # namespace alone can't attribute Goldmane flows. Value = the fronting
+      # Service name (kubernetes_service.pgadmin is named "pgadmin").
+      "service-identity" = "pgadmin"
    }
  }
  spec {
@ -1514,6 +1536,10 @@ resource "kubernetes_deployment" "pgadmin" {
      metadata {
        labels = {
          app = "pgadmin"
+          # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
+          # disambiguating identity must live on the pod template (not just
+          # the Deployment metadata above). Not in selector → no replace.
+          "service-identity" = "pgadmin"
        }
      }
      spec {
@ -1568,8 +1594,20 @@ resource "kubernetes_deployment" "pgadmin" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
+      # This Deployment is Keel-enrolled (keel.sh/policy=patch) and Keel has
+      # bumped the live image (dpage/pgadmin4:9.16). Ignore the Keel/Kyverno
+      # runtime-mutated attributes so `terragrunt apply` (incl. the daily drift
+      # plan) doesn't revert the image to bare `dpage/pgadmin4` or strip Keel's
+      # annotations — canonical guard, matches linkwarden/chrome-service.
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+      metadata[0].annotations["keel.sh/match-tag"],
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
+    ]
  }
 }
 resource "kubernetes_service" "pgadmin" {
--- a/stacks/diun/main.tf
+++ b/stacks/diun/main.tf
@ -20,6 +20,9 @@ resource "kubernetes_namespace" "diun" {
 }

 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/ebooks/main.tf
+++ b/stacks/ebooks/main.tf
@ -20,6 +20,9 @@ resource "kubernetes_namespace" "ebooks" {

 # ExternalSecrets for all three sources
 resource "kubernetes_manifest" "calibre_external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -47,6 +50,9 @@ resource "kubernetes_manifest" "calibre_external_secret" {
 }

 resource "kubernetes_manifest" "audiobookshelf_external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -74,6 +80,9 @@ resource "kubernetes_manifest" "audiobookshelf_external_secret" {
 }

 resource "kubernetes_manifest" "servarr_external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/f1-stream/main.tf
+++ b/stacks/f1-stream/main.tf
@ -33,6 +33,9 @@ resource "kubernetes_namespace" "f1-stream" {
 }

 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -62,6 +65,9 @@ resource "kubernetes_manifest" "external_secret" {
 # Pull the chrome-service bearer token into this namespace as a separate
 # Secret so the verifier can reach the in-cluster Playwright pool.
 resource "kubernetes_manifest" "chrome_service_client_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/fire-planner/main.tf
+++ b/stacks/fire-planner/main.tf
@ -53,6 +53,9 @@ resource "kubernetes_namespace" "fire_planner" {
 # Seed before applying:
 #   secret/fire-planner -> property `recompute_bearer_token`
 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -115,6 +118,9 @@ resource "kubernetes_manifest" "external_secret" {
 # Template builds the asyncpg DSN consumed by the FastAPI app + CronJob
 # as DB_CONNECTION_STRING.
 resource "kubernetes_manifest" "db_external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -159,6 +165,9 @@ resource "kubernetes_manifest" "db_external_secret" {
 # pg-sync sidecar populates `daily_account_valuation` etc. hourly; the
 # fire-planner ingest reads those tables via this role.
 resource "kubernetes_manifest" "wealthfolio_sync_db_external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -450,6 +459,90 @@ resource "kubernetes_cron_job_v1" "fire_planner_recompute" {
  ]
 }

+# Monthly FIRE-countdown target solve on the 2nd at 10:00 UTC (an hour after
+# recompute-all, so account_snapshot is fresh). Binary-searches each Case's FIRE
+# number per country at the 99% Guyton-Klinger bar and upserts fire_target, which
+# the wealth Grafana dashboard's "FIRE Countdown" section reads.
+resource "kubernetes_cron_job_v1" "fire_planner_fire_targets" {
+  metadata {
+    name      = "fire-planner-fire-targets"
+    namespace = kubernetes_namespace.fire_planner.metadata[0].name
+  }
+  spec {
+    schedule                      = "0 10 2 * *"
+    concurrency_policy            = "Forbid"
+    successful_jobs_history_limit = 3
+    failed_jobs_history_limit     = 5
+    starting_deadline_seconds     = 600
+
+    job_template {
+      metadata {
+        labels = local.labels
+      }
+      spec {
+        backoff_limit              = 1
+        ttl_seconds_after_finished = 86400
+        # The full country sweep is CPU-bound (binary search × ~22 cities ×
+        # 3 cases). Give it room rather than letting it run forever.
+        active_deadline_seconds = 3600
+        template {
+          metadata {
+            labels = local.labels
+          }
+          spec {
+            restart_policy = "OnFailure"
+            image_pull_secrets {
+              name = "registry-credentials"
+            }
+            image_pull_secrets {
+              name = "ghcr-credentials"
+            }
+            container {
+              name  = "fire-targets"
+              image = local.image
+              # --horizon 72: Viktor retires ~age 28 and plans to live to 100, so
+              # the portfolio must last 72 years (was the 60y default ≈ to age 88).
+              command = ["python", "-m", "fire_planner", "recompute-fire-targets",
+              "--countries", "all", "--horizon", "72"]
+
+              env_from {
+                secret_ref {
+                  name = "fire-planner-secrets"
+                }
+              }
+              env_from {
+                secret_ref {
+                  name = "fire-planner-db-creds"
+                }
+              }
+
+              resources {
+                requests = {
+                  cpu    = "500m"
+                  memory = "1Gi"
+                }
+                limits = {
+                  memory = "2Gi"
+                }
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1
+    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
+  }
+
+  depends_on = [
+    kubernetes_manifest.external_secret,
+    kubernetes_manifest.db_external_secret,
+  ]
+}
+
 # Weekly refresh of the COL cache: walks col_snapshot for rows
 # expiring within 7 days, re-scrapes Numbeo + Expatistan, upserts. With
 # the user-chosen 1-year TTL, a healthy cache has 0 stale rows on most
@ -569,16 +662,53 @@ module "ingress_api" {
  auth = "none"
 }

-# Plan-time read of the ESO-created K8s Secret for Grafana datasource
-# password. First-apply gotcha: must
-# `terragrunt apply -target=kubernetes_manifest.db_external_secret` so
-# the Secret exists before this data source plans.
-data "kubernetes_secret" "fire_planner_db_creds" {
-  metadata {
-    name      = "fire-planner-db-creds"
-    namespace = kubernetes_namespace.fire_planner.metadata[0].name
+# ExternalSecret in the monitoring namespace mirroring the rotating
+# fire_planner DB password. Grafana mounts this via envFromSecrets in
+# monitoring/grafana_chart_values.yaml; the datasource ConfigMap below
+# references it as $__env{FIRE_PLANNER_PG_PASSWORD}. Reloader restarts
+# Grafana whenever ESO updates this secret (on the 7d static-role
+# rotation), so the provisioned datasource never goes stale — replaces
+# the old plan-time `data.kubernetes_secret` bake that broke weekly.
+# Mirrors the wealth-pg / payslips-pg pattern.
+resource "kubernetes_manifest" "grafana_fire_planner_pg_creds" {
+  field_manager {
+    force_conflicts = true
+  }
+  manifest = {
+    apiVersion = "external-secrets.io/v1"
+    kind       = "ExternalSecret"
+    metadata = {
+      name      = "grafana-fire-planner-pg-creds"
+      namespace = "monitoring"
+    }
+    spec = {
+      refreshInterval = "15m"
+      secretStoreRef = {
+        name = "vault-database"
+        kind = "ClusterSecretStore"
+      }
+      target = {
+        name = "grafana-fire-planner-pg-creds"
+        template = {
+          metadata = {
+            annotations = {
+              "reloader.stakater.com/match" = "true"
+            }
+          }
+          data = {
+            FIRE_PLANNER_PG_PASSWORD = "{{ .password }}"
+          }
+        }
+      }
+      data = [{
+        secretKey = "password"
+        remoteRef = {
+          key      = "static-creds/pg-fire-planner"
+          property = "password"
+        }
+      }]
+    }
  }
-  depends_on = [kubernetes_manifest.db_external_secret]
 }

 # Grafana datasource for fire_planner PostgreSQL DB.
@ -615,12 +745,15 @@ resource "kubernetes_config_map" "grafana_fire_planner_datasource" {
          timescaledb     = false
        }
        secureJsonData = {
-          password = data.kubernetes_secret.fire_planner_db_creds.data["DB_PASSWORD"]
+          # Live env from grafana-fire-planner-pg-creds (above), injected into
+          # Grafana via envFromSecrets; reloader refreshes it on rotation.
+          password = "$__env{FIRE_PLANNER_PG_PASSWORD}"
        }
        editable = true
      }]
    })
  }
+  depends_on = [kubernetes_manifest.grafana_fire_planner_pg_creds]
 }

 # CI retrigger 2026-05-16T13:42:57+00:00 — bulk enrollment apply (pipeline #689 killed)
@ -661,6 +794,9 @@ variable "run_examples_bulk_ingest" {

 # Reddit OAuth creds pulled from Vault secret/viktor.
 resource "kubernetes_manifest" "external_secret_examples_reddit" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -701,6 +837,9 @@ resource "kubernetes_manifest" "external_secret_examples_reddit" {
 # claude-agent-service bearer pulled separately so its rotation cadence
 # is decoupled from the Reddit creds.
 resource "kubernetes_manifest" "external_secret_examples_claude" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/forgejo/email-secret.tf
+++ b/stacks/forgejo/email-secret.tf
@ -6,6 +6,9 @@
 # (stacks/authentik/email-secret.tf) — one credential, one rotation point. The
 # reloader annotation rolls the Forgejo pod if the password is ever rotated.
 resource "kubernetes_manifest" "forgejo_email_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/freedify/main.tf
+++ b/stacks/freedify/main.tf
@ -3,6 +3,9 @@ variable "tls_secret_name" {
  sensitive = true
 }
 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/freshrss/main.tf
+++ b/stacks/freshrss/main.tf
@ -18,6 +18,9 @@ resource "kubernetes_namespace" "immich" {
 }

 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/goldmane-edge-aggregator/main.tf
+++ b/stacks/goldmane-edge-aggregator/main.tf
@ -0,0 +1,499 @@
+# =============================================================================
+# goldmane-edge-aggregator — durable who-talks-to-whom audit trail (ADR-0014 / #58)
+# =============================================================================
+# A small Go service that streams Calico Goldmane's gRPC Flows API (mTLS) and
+# upserts the unique service-to-service edge set into Postgres, plus a daily
+# Slack digest CronJob of first-seen edges. Code lives in the standalone
+# `goldmane-edge-aggregator` repo; the authoritative deploy spec is its
+# DEPLOY.md. This stack is the infra side of that spec.
+#
+# Goldmane runs as `Service goldmane:7443` (gRPC/mTLS) in calico-system, enabled
+# via the operator CR in stacks/calico/main.tf. The durable Loki path is NOT
+# the operator CRs — this service IS the durable trail.
+#
+# Structure mirrors stacks/claude-memory (the canonical Tier-1 pattern): a
+# per-service namespace, a CNPG Postgres DB + role + Vault 7-day rotation +
+# ExternalSecret -> DATABASE_URL, the Reloader annotation, and the
+# Terragrunt-generated backend.tf/providers.tf/tiers.tf layout. The novel bit is
+# minting an mTLS client cert from the Tigera CA (hashicorp/tls; see versions.tf).
+#
+# IMAGE: ghcr.io/viktorbarzin/goldmane-edge-aggregator is PRIVATE. Onboarding
+# MUST add the "goldmane-edge-aggregator" namespace to the ghcr-credentials
+# Kyverno allowlist (stacks/kyverno/modules/kyverno/ghcr-credentials.tf,
+# local.ghcr_private_namespaces) so the Kyverno-synced `ghcr-credentials` secret
+# is cloned into this namespace — otherwise the pulls 401. The imagePullSecrets
+# reference below assumes that entry exists.
+# =============================================================================
+
+variable "postgresql_host" { type = string }
+
+# Plan-time root creds for the idempotent DB-init Job (mirrors claude-memory).
+data "vault_kv_secret_v2" "secrets" {
+  mount = "secret"
+  name  = "goldmane-edge-aggregator"
+}
+
+# -----------------------------------------------------------------------------
+# 1. Namespace
+# -----------------------------------------------------------------------------
+resource "kubernetes_namespace" "goldmane_edge_aggregator" {
+  metadata {
+    name = "goldmane-edge-aggregator"
+    labels = {
+      name = "goldmane-edge-aggregator"
+      # Tier 4-aux: a small off-path consumer service, like claude-memory.
+      tier               = local.tiers.aux
+      "keel.sh/enrolled" = "true"
+    }
+  }
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
+    ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
+  }
+}
+
+# -----------------------------------------------------------------------------
+# 2. Goldmane mTLS client certificate (minted from the Tigera CA)
+# -----------------------------------------------------------------------------
+# The aggregator dials goldmane:7443 over mutual TLS. We mint a client cert
+# signed by the Tigera CA (the same CA that issues Goldmane's serving cert), so
+# Goldmane requires mutual TLS on :7443 and verifies the client cert chains to
+# the Tigera CA — it does NOT authorize by client identity, so ANY Tigera-CA-
+# signed cert is accepted. Rather than copy the Tigera CA PRIVATE KEY into TF
+# state to mint our own (a needless CA-key exposure; the hashicorp/tls provider
+# is also incompatible with this repo's global generate-providers/lockfile
+# pattern), we REUSE the operator-minted, Tigera-CA-signed client cert
+# `whisker-backend-key-pair` (calico-system). We never touch the CA key.
+# Trade-off: if the operator rotates that cert, re-apply to re-sync (hardening
+# follow-up: mint an own-identity cert in-namespace if Whisker is ever removed).
+data "kubernetes_secret" "whisker_backend" {
+  metadata {
+    name      = "whisker-backend-key-pair"
+    namespace = "calico-system"
+  }
+}
+
+# The CA bundle that verifies Goldmane's serving cert. It lives ONLY in
+# calico-system (verified: ConfigMap `tigera-ca-bundle`, 2 keys present —
+# `ca-bundle.crt` AND `tigera-ca-bundle.crt`, both the trusted bundle). We read
+# it and recreate it as a ConfigMap in this namespace so the pod can mount it
+# (a ConfigMap cannot be cross-namespace-mounted).
+data "kubernetes_config_map" "tigera_ca_bundle" {
+  metadata {
+    name      = "tigera-ca-bundle"
+    namespace = "calico-system"
+  }
+}
+
+resource "kubernetes_config_map" "tigera_ca_bundle" {
+  metadata {
+    name      = "tigera-ca-bundle"
+    namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
+  }
+  # Copy the upstream bundle verbatim. We mount the `tigera-ca-bundle.crt` key
+  # at /etc/tigera-ca/tigera-ca-bundle.crt so the service's default
+  # CA_CERT_PATH (/etc/tigera-ca/tigera-ca-bundle.crt) resolves with no override.
+  data = data.kubernetes_config_map.tigera_ca_bundle.data
+}
+
+# Client cert + key for mTLS to goldmane:7443, mounted at TLS_CERT_PATH /
+# TLS_KEY_PATH defaults (/etc/goldmane-client-tls/tls.crt and .../tls.key).
+# Sourced verbatim from the operator's whisker-backend client key-pair (read
+# above) — already Tigera-CA-signed, which is all Goldmane verifies. No CA key
+# is touched and no cross-namespace CA RBAC is needed.
+resource "kubernetes_secret" "goldmane_client_tls" {
+  metadata {
+    name      = "goldmane-client-tls"
+    namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
+  }
+  type = "Opaque"
+  data = {
+    "tls.crt" = data.kubernetes_secret.whisker_backend.data["tls.crt"]
+    "tls.key" = data.kubernetes_secret.whisker_backend.data["tls.key"]
+  }
+}
+
+# -----------------------------------------------------------------------------
+# 3. Postgres: DB + role `goldmane_edges`, Vault 7-day rotation, DATABASE_URL
+# -----------------------------------------------------------------------------
+# Idempotent create of the role + DB using the CNPG root creds from Vault
+# (dbaas_root_password), exactly mirroring claude-memory's db_init Job. The
+# service creates the `edge` table itself at startup (migrations/0001_edge.sql),
+# so no migration Job is needed.
+resource "kubernetes_job" "db_init" {
+  metadata {
+    name      = "goldmane-edges-db-init"
+    namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
+  }
+  spec {
+    template {
+      metadata {}
+      spec {
+        container {
+          name  = "db-init"
+          image = "postgres:16-alpine"
+          command = [
+            "sh", "-c",
+            <<-EOT
+              set -e
+              # -d postgres: psql defaults the database name to the username;
+              # the root user has no root-named database, so be explicit.
+              PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_roles WHERE rolname='goldmane_edges'" | grep -q 1 || \
+                PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE ROLE goldmane_edges WITH LOGIN PASSWORD '${data.vault_kv_secret_v2.secrets.data["db_password"]}'"
+              PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_database WHERE datname='goldmane_edges'" | grep -q 1 || \
+                PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE DATABASE goldmane_edges OWNER goldmane_edges"
+              PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "GRANT ALL PRIVILEGES ON DATABASE goldmane_edges TO goldmane_edges"
+              echo "Database init complete"
+            EOT
+          ]
+        }
+        restart_policy = "Never"
+      }
+    }
+    backoff_limit = 3
+  }
+  wait_for_completion = true
+  timeouts {
+    create = "2m"
+  }
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1: Kyverno injects dns_config (ndots=2); ignore it so
+    # this idempotent Job isn't replaced (Jobs are immutable) on every apply.
+    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+  }
+}
+
+# ExternalSecret projecting the Vault-rotated (7-day) credential into a K8s
+# Secret as DATABASE_URL. The Vault DB static role `pg-goldmane-edges` and its
+# place in the CNPG connection allowlist are added in stacks/vault/main.tf
+# (see this stack's terragrunt.hcl note). remoteRef key: static-creds/pg-goldmane-edges.
+resource "kubernetes_manifest" "db_external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
+  manifest = {
+    apiVersion = "external-secrets.io/v1"
+    kind       = "ExternalSecret"
+    metadata = {
+      name      = "goldmane-edges-db-creds"
+      namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
+    }
+    spec = {
+      refreshInterval = "15m"
+      secretStoreRef = {
+        name = "vault-database"
+        kind = "ClusterSecretStore"
+      }
+      target = {
+        name = "goldmane-edges-db-creds"
+        template = {
+          data = {
+            DATABASE_URL = "postgresql://goldmane_edges:{{ .password }}@${var.postgresql_host}:5432/goldmane_edges"
+          }
+        }
+      }
+      data = [{
+        secretKey = "password"
+        remoteRef = {
+          key      = "static-creds/pg-goldmane-edges"
+          property = "password"
+        }
+      }]
+    }
+  }
+  depends_on = [kubernetes_namespace.goldmane_edge_aggregator]
+}
+
+# -----------------------------------------------------------------------------
+# 4. Slack webhook (reuse the alert-digest incoming webhook)
+# -----------------------------------------------------------------------------
+# The monitoring alert-digest CronJob posts with the Slack incoming webhook at
+# Vault secret/monitoring -> key `alertmanager_slack_api_url`
+# (stacks/monitoring/modules/monitoring/alert_digest.tf). Project that same URL
+# into this namespace as SLACK_WEBHOOK_URL via an ExternalSecret (no new
+# webhook). The digest CronJob defaults to #security.
+resource "kubernetes_manifest" "slack_external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
+  manifest = {
+    apiVersion = "external-secrets.io/v1"
+    kind       = "ExternalSecret"
+    metadata = {
+      name      = "goldmane-edges-slack"
+      namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
+    }
+    spec = {
+      refreshInterval = "1h"
+      secretStoreRef = {
+        name = "vault-kv"
+        kind = "ClusterSecretStore"
+      }
+      target = {
+        name = "goldmane-edges-slack"
+      }
+      data = [{
+        secretKey = "SLACK_WEBHOOK_URL"
+        remoteRef = {
+          key      = "viktor"
+          property = "alertmanager_slack_api_url"
+        }
+      }]
+    }
+  }
+  depends_on = [kubernetes_namespace.goldmane_edge_aggregator]
+}
+
+# -----------------------------------------------------------------------------
+# 5. aggregate — Deployment (long-running gRPC stream -> Postgres upserts)
+# -----------------------------------------------------------------------------
+resource "kubernetes_deployment" "aggregate" {
+  depends_on = [
+    kubernetes_job.db_init,
+    kubernetes_manifest.db_external_secret,
+  ]
+  metadata {
+    name      = "goldmane-edge-aggregator"
+    namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
+    labels = {
+      app  = "goldmane-edge-aggregator"
+      tier = local.tiers.aux
+    }
+    annotations = {
+      # Credential is env-injected and read only at startup; the 7-day rotation
+      # must bounce the pod or it keeps the stale password and silently fails
+      # DB auth (infra CLAUDE.md Reloader rule).
+      "secret.reloader.stakater.com/reload" = "goldmane-edges-db-creds"
+    }
+  }
+  spec {
+    # 1 replica: the edge set is a global upsert keyed on (src_ns, dst_ns,
+    # action); a second replica only doubles writes for no benefit (Goldmane
+    # streams per-flow). Stateless (no PVC) so RollingUpdate is fine.
+    replicas = 1
+    selector {
+      match_labels = {
+        app = "goldmane-edge-aggregator"
+      }
+    }
+    template {
+      metadata {
+        labels = {
+          app = "goldmane-edge-aggregator"
+        }
+      }
+      spec {
+        # PRIVATE ghcr image — cloned into this namespace by the Kyverno
+        # sync-ghcr-credentials allowlist policy (add this ns to that list).
+        image_pull_secrets {
+          name = "ghcr-credentials"
+        }
+        container {
+          name = "aggregate"
+          # CI (GHA -> ghcr) overwrites this to :<sha8> via `kubectl set image`;
+          # the image tag is in ignore_changes below so the SHA sticks across
+          # `terragrunt apply` (fleet image-pin convention). Placeholder :latest
+          # until the deploy pipeline runs.
+          image = "ghcr.io/viktorbarzin/goldmane-edge-aggregator:latest"
+          args  = ["aggregate"]
+
+          # Goldmane mTLS. GOLDMANE_HOST default host sans port =>
+          # ServerName "goldmane.calico-system.svc.cluster.local", which is a SAN
+          # on the live Goldmane serving cert (verified 2026-06-24:
+          # DNS:goldmane{,.calico-system{,.svc{,.cluster.local}}}). So no
+          # GOLDMANE_SERVER_NAME override and no GOLDMANE_TLS_INSECURE needed.
+          env {
+            name  = "GOLDMANE_HOST"
+            value = "goldmane.calico-system.svc.cluster.local:7443"
+          }
+          # TLS_CERT_PATH / TLS_KEY_PATH / CA_CERT_PATH are left at their image
+          # defaults (/etc/goldmane-client-tls/tls.{crt,key} and
+          # /etc/tigera-ca/tigera-ca-bundle.crt) — the mounts below match them.
+
+          env {
+            name = "DATABASE_URL"
+            value_from {
+              secret_key_ref {
+                name = "goldmane-edges-db-creds"
+                key  = "DATABASE_URL"
+              }
+            }
+          }
+
+          volume_mount {
+            name       = "goldmane-client-tls"
+            mount_path = "/etc/goldmane-client-tls"
+            read_only  = true
+          }
+          volume_mount {
+            name       = "tigera-ca"
+            mount_path = "/etc/tigera-ca"
+            read_only  = true
+          }
+
+          resources {
+            # Idles low: a single gRPC stream + periodic upserts. requests=limits
+            # per the repo memory rule; no CPU limit (CFS throttling). Right-size
+            # later with krr.
+            requests = {
+              cpu    = "10m"
+              memory = "64Mi"
+            }
+            limits = {
+              memory = "64Mi"
+            }
+          }
+        }
+
+        volume {
+          name = "goldmane-client-tls"
+          secret {
+            secret_name = kubernetes_secret.goldmane_client_tls.metadata[0].name
+          }
+        }
+        volume {
+          name = "tigera-ca"
+          config_map {
+            name = kubernetes_config_map.tigera_ca_bundle.metadata[0].name
+          }
+        }
+      }
+    }
+  }
+  lifecycle {
+    ignore_changes = [
+      # CI pipeline owns the image tag (kubectl set image from GHA/Woodpecker).
+      spec[0].template[0].spec[0].container[0].image,
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+      metadata[0].annotations["keel.sh/match-tag"],
+      metadata[0].annotations["kubernetes.io/change-cause"],
+      metadata[0].annotations["deployment.kubernetes.io/revision"],
+      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
+    ]
+  }
+}
+
+# -----------------------------------------------------------------------------
+# 6. digest — daily CronJob (first-seen edges -> Slack)
+# -----------------------------------------------------------------------------
+resource "kubernetes_cron_job_v1" "digest" {
+  depends_on = [
+    kubernetes_job.db_init,
+    kubernetes_manifest.db_external_secret,
+    kubernetes_manifest.slack_external_secret,
+  ]
+  metadata {
+    name      = "goldmane-edges-digest"
+    namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
+    labels = {
+      app  = "goldmane-edge-aggregator"
+      tier = local.tiers.aux
+    }
+  }
+  spec {
+    # Daily 08:00 Europe/London — aligns with the alert-digest cadence.
+    schedule                      = "0 8 * * *"
+    timezone                      = "Europe/London"
+    concurrency_policy            = "Forbid"
+    successful_jobs_history_limit = 3
+    failed_jobs_history_limit     = 3
+    starting_deadline_seconds     = 600
+
+    job_template {
+      metadata {
+        labels = {
+          app = "goldmane-edge-aggregator"
+        }
+        annotations = {
+          # 7-day DB rotation: bounce the Job pod's stale env (Reloader rule).
+          "secret.reloader.stakater.com/reload" = "goldmane-edges-db-creds"
+        }
+      }
+      spec {
+        backoff_limit              = 2
+        active_deadline_seconds    = 300
+        ttl_seconds_after_finished = 86400
+
+        template {
+          metadata {
+            labels = {
+              app = "goldmane-edge-aggregator"
+            }
+          }
+          spec {
+            restart_policy = "OnFailure"
+            image_pull_secrets {
+              name = "ghcr-credentials"
+            }
+            container {
+              name = "digest"
+              # CronJobs track :latest + imagePullPolicy: Always (fleet
+              # convention) so the daily run picks up the current image.
+              image             = "ghcr.io/viktorbarzin/goldmane-edge-aggregator:latest"
+              image_pull_policy = "Always"
+              args              = ["digest"]
+
+              env {
+                name = "DATABASE_URL"
+                value_from {
+                  secret_key_ref {
+                    name = "goldmane-edges-db-creds"
+                    key  = "DATABASE_URL"
+                  }
+                }
+              }
+              env {
+                name = "SLACK_WEBHOOK_URL"
+                value_from {
+                  secret_key_ref {
+                    name = "goldmane-edges-slack"
+                    key  = "SLACK_WEBHOOK_URL"
+                  }
+                }
+              }
+              env {
+                name = "SLACK_CHANNEL"
+                # Posts to #alerts. The dedicated #security channel was abandoned
+                # 2026-06-25 — the shared alertmanager_slack_api_url webhook's
+                # Slack app isn't a member of it (channel override 404s), so all
+                # Slack (incl. alertmanager's security-lane alerts) consolidated
+                # to #alerts. See docs/runbooks/goldmane-flow-trail.md.
+                value = "#alerts"
+              }
+
+              resources {
+                requests = {
+                  cpu    = "10m"
+                  memory = "64Mi"
+                }
+                limits = {
+                  memory = "64Mi"
+                }
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1 (CronJob path): Kyverno mutates dns_config with ndots=2.
+    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
+  }
+}
+
+# -----------------------------------------------------------------------------
+# 7. Egress (default-deny consideration)
+# -----------------------------------------------------------------------------
+# Goldmane's own NetworkPolicy already allows INGRESS on 7443 from anywhere, so
+# nothing is needed on the Goldmane side. No egress policy is declared here:
+# this namespace is default-allow egress today. IF/WHEN it is brought under the
+# wave-1 default-deny egress enforcement (per-namespace allowlists), add
+# (Global)NetworkPolicy egress rules permitting:
+#   - goldmane.calico-system.svc.cluster.local:7443 (the flow stream)
+#   - pg-cluster-rw.dbaas.svc.cluster.local:5432    (Postgres)
+#   - hooks.slack.com:443                            (digest -> Slack, internet)
+#   - kube-dns / CoreDNS :53                         (DNS, every namespace)
--- a/stacks/goldmane-edge-aggregator/terragrunt.hcl
+++ b/stacks/goldmane-edge-aggregator/terragrunt.hcl
@ -0,0 +1,24 @@
+include "root" {
+  path = find_in_parent_folders()
+}
+
+# Tier-1 stack (PG state backend). The root terragrunt.hcl generates backend.tf
+# (pg backend, schema_name = "goldmane-edge-aggregator"), providers.tf,
+# cloudflare_provider.tf and tiers.tf automatically — do NOT hand-write those.
+# This stack adds the hashicorp/tls provider via a local versions.tf (merged
+# into the generated required_providers).
+
+dependency "platform" {
+  config_path  = "../platform"
+  skip_outputs = true
+}
+
+dependency "vault" {
+  config_path  = "../vault"
+  skip_outputs = true
+}
+
+# The Vault DB static role pg-goldmane-edges (7-day rotation) and the CNPG
+# connection allowlist entry live in the vault stack (stacks/vault/main.tf).
+# The vault dependency above orders this stack after it so the ExternalSecret
+# can materialize the rotated credential on first apply.
--- a/stacks/grampsweb/main.tf
+++ b/stacks/grampsweb/main.tf
@ -5,6 +5,9 @@ variable "tls_secret_name" {
 variable "nfs_server" { type = string }

 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/hackmd/main.tf
+++ b/stacks/hackmd/main.tf
@ -208,6 +208,9 @@ module "ingress" {
 }

 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/health/main.tf
+++ b/stacks/health/main.tf
@ -250,6 +250,9 @@ module "ingress_test" {
 }

 resource "kubernetes_manifest" "external_secret_db" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -284,6 +287,9 @@ resource "kubernetes_manifest" "external_secret_db" {
 }

 resource "kubernetes_manifest" "external_secret_kv" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/hermes-agent/main.tf
+++ b/stacks/hermes-agent/main.tf
@ -37,6 +37,9 @@ module "tls_secret" {
 # --- Secrets (ESO from Vault) ---

 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/immich/frame-emo.tf
+++ b/stacks/immich/frame-emo.tf
@ -0,0 +1,155 @@
+# Immich photo-frame for Emo (emil.barzin@gmail.com) — a second instance cloned
+# from the London frame in frame.tf, scoped to Emo's Immich account + Sofia
+# weather. Served at highlights-immich-emo.viktorbarzin.me and shown on Emo's
+# Portal Mini (Sofia) via the portal-immich-frame app.
+# API key: Vault secret/immich -> frame_api_key_emo (minted on Emo's account).
+
+resource "kubernetes_config_map" "frame_config_emo" {
+  metadata {
+    name      = "config-emo"
+    namespace = "immich"
+
+    labels = {
+      app = "frame-config-emo"
+    }
+    annotations = {
+      "reloader.stakater.com/match" = "true"
+    }
+  }
+
+  data = {
+    "Settings.yml" = <<-EOF
+    General:
+        Layout: single
+        Interval: 45
+        ImageZoom: true
+        ShowAlbumName: false
+        ShowProgressBar: false
+        ClockFormat: "HH:mm"
+        PhotoDateFormat: "dd/MM/yyyy"
+        WeatherApiKey: ${data.vault_kv_secret_v2.secrets.data["frame_weather_api_key"]}
+        UnitSystem: metric
+        WeatherLatLong: "42.6977,23.3219"
+        Language: en
+    Accounts:
+        - ImmichServerUrl: http://immich.viktorbarzin.me
+          ApiKey: ${data.vault_kv_secret_v2.secrets.data["frame_api_key_emo"]}
+          ImagesFromDays: 730
+    EOF
+  }
+}
+
+
+resource "kubernetes_deployment" "immich-frame-emo" {
+  metadata {
+    name      = "immich-frame-emo"
+    namespace = "immich"
+    annotations = {
+      "reloader.stakater.com/search" = "true"
+    }
+    labels = {
+      tier = local.tiers.gpu
+    }
+  }
+
+  spec {
+    replicas = 1
+    selector {
+      match_labels = {
+        app = "immich-frame-emo"
+      }
+    }
+    strategy {
+      type = "RollingUpdate"
+    }
+    template {
+      metadata {
+        labels = {
+          app = "immich-frame-emo"
+        }
+        annotations = {
+          "dependency.kyverno.io/wait-for" = "immich-server.immich:2283"
+        }
+      }
+      spec {
+        container {
+          image = "ghcr.io/immichframe/immichframe:v1.0.32.0"
+          name  = "immich-frame-emo"
+          resources {
+            requests = {
+              cpu    = "10m"
+              memory = "64Mi"
+            }
+            limits = {
+              memory = "128Mi"
+            }
+          }
+          port {
+            container_port = 8080
+            protocol       = "TCP"
+            name           = "http"
+          }
+          volume_mount {
+            name       = "config"
+            mount_path = "/app/Config"
+            read_only  = true
+          }
+        }
+        volume {
+          name = "config"
+          config_map {
+            name = "config-emo"
+          }
+        }
+      }
+    }
+  }
+  lifecycle {
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+      metadata[0].annotations["keel.sh/match-tag"],
+      metadata[0].annotations["kubernetes.io/change-cause"],
+      metadata[0].annotations["deployment.kubernetes.io/revision"],
+      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
+      spec[0].template[0].spec[0].container[0].image,                     # KEEL_IGNORE_IMAGE
+    ]
+  }
+}
+
+
+resource "kubernetes_service" "immich-frame-emo" {
+  metadata {
+    name      = "immich-frame-emo"
+    namespace = "immich"
+    labels = {
+      "app" = "immich-frame-emo"
+    }
+  }
+
+  spec {
+    selector = {
+      app = "immich-frame-emo"
+    }
+    port {
+      port        = 80
+      target_port = 8080
+    }
+  }
+}
+
+module "ingress_emo" {
+  source = "../../modules/kubernetes/ingress_factory"
+  # Photo-frame kiosk display on Emo's Portal — headless browser pulling images
+  # via an Immich API key (no user login). Forward-auth would 302 the device to
+  # Authentik with no way to complete login.
+  # auth = "none": photo-frame kiosk; headless browser with API key; no user login.
+  auth            = "none"
+  dns_type        = "proxied"
+  namespace       = "immich"
+  name            = "highlights-immich-emo"
+  tls_secret_name = var.tls_secret_name
+  service_name    = "immich-frame-emo"
+}
--- a/stacks/immich/main.tf
+++ b/stacks/immich/main.tf
@ -162,6 +162,9 @@ resource "kubernetes_resource_quota" "immich" {
 }

 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/insta2spotify/main.tf
+++ b/stacks/insta2spotify/main.tf
@ -20,6 +20,9 @@ resource "kubernetes_namespace" "insta2spotify" {
 }

 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/instagram-poster/modules/instagram-poster/main.tf
+++ b/stacks/instagram-poster/modules/instagram-poster/main.tf
@ -35,6 +35,14 @@ resource "kubernetes_namespace" "instagram_poster" {
 #     - immich_tag_instagram      (optional — auto-resolved if missing)
 #     - immich_tag_posted         (optional — auto-resolved if missing)
 resource "kubernetes_manifest" "external_secret" {
+  # The external-secrets controller takes server-side-apply ownership of
+  # .spec.refreshInterval, so a plain TF apply conflicts. force_conflicts lets
+  # TF win (values match, so it's stable) — same pattern as grafana/woodpecker/
+  # traefik/k8s-version-upgrade. Surfaced 2026-06-24 by the first IG apply since
+  # the ESO v1 migration (the scale-to-0 push).
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -139,6 +147,11 @@ resource "kubernetes_manifest" "external_secret" {
 # ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match`
 # bounces the pod when the password changes.
 resource "kubernetes_manifest" "benchmark_db_external_secret" {
+  # See external_secret above — ESO owns .spec.refreshInterval; force_conflicts
+  # lets the TF apply win instead of erroring on the field-manager conflict.
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -227,7 +240,11 @@ resource "kubernetes_deployment" "instagram_poster" {
  }

  spec {
-    replicas = 1
+    # Scaled to 0 (2026-06-24): Instagram Graph integration is unused and its
+    # ExternalSecret is dead (missing ig_graph_long_lived_token /
+    # ig_business_account_id in Vault secret/instagram-poster). Set back to 1
+    # after minting a Meta long-lived token and populating those keys.
+    replicas = 0
    # RWO PVC — cannot rolling-update.
    strategy {
      type = "Recreate"
--- a/stacks/job-hunter/main.tf
+++ b/stacks/job-hunter/main.tf
@ -41,6 +41,9 @@ resource "kubernetes_namespace" "job_hunter" {
 #     digest_to_address     — where the weekly digest goes
 #     digest_from_address   — From: header for the digest
 resource "kubernetes_manifest" "external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -105,6 +108,9 @@ resource "kubernetes_manifest" "external_secret" {
 # DB credentials from Vault database engine (7-day rotation).
 # Template builds the asyncpg DSN consumed by the FastAPI app as DB_CONNECTION_STRING.
 resource "kubernetes_manifest" "db_external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -325,6 +331,9 @@ resource "kubernetes_service" "job_hunter" {
 # references it as $__env{JOB_HUNTER_PG_PASSWORD}. Reloader restarts
 # Grafana whenever ESO updates this secret (every 7d on rotation).
 resource "kubernetes_manifest" "grafana_job_hunter_db_external_secret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/k8s-dashboard/oauth2_proxy.tf
+++ b/stacks/k8s-dashboard/oauth2_proxy.tf
@ -5,6 +5,9 @@
 # -----------------------------------------------------------------------------

 resource "kubernetes_manifest" "oauth2_proxy_externalsecret" {
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/+page.svelte
+++ b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/+page.svelte
@ -5,9 +5,11 @@
 <main>
 	<h1>Kubernetes Access Portal</h1>

-	<div class="callout warning">
-		<strong>VPN Required</strong> — The cluster is on a private network. You need Headscale VPN access before kubectl will work.
-		<a href="/onboarding">See the Getting Started guide</a> for VPN setup instructions.
+	<div class="callout info">
+		<strong>Fastest way in:</strong> open the <a href="https://t3.viktorbarzin.me">web terminal</a> or the
+		<a href="https://k8s.viktorbarzin.me">dashboard</a> and sign in — no install, no VPN needed. Prefer your
+		own machine? The <a href="/onboarding#path-laptop">local-setup guide</a> covers VPN + kubectl, and the
+		<a href="/onboarding">Getting Started page</a> compares all three access paths.
 	</div>

 	<section>
@ -26,6 +28,7 @@
 			<p><strong>Assigned namespaces:</strong> {data.namespaces.join(', ')}</p>

 			<h3>Quick Commands</h3>
+			<p>Run these as-is in the <a href="https://t3.viktorbarzin.me">web terminal</a> — it's already signed in as you.</p>
 			<pre>
 # Check your pods
 kubectl get pods -n {data.namespaces[0]}
@ -47,16 +50,23 @@ vault write kubernetes/creds/{data.namespaces[0]}-deployer \

 	<section>
 		<h2>Get Started</h2>
+		<h3>No setup — start now</h3>
+		<ol>
+			<li><a href="https://t3.viktorbarzin.me">Open the web terminal</a> — a ready shell with kubectl, Vault and your repos already set up</li>
+			<li><a href="https://k8s.viktorbarzin.me">Open the dashboard</a> — point-and-click view of your workloads</li>
+		</ol>
+		<h3>On your own machine</h3>
 		<ol>
 			{#if data.role === 'namespace-owner'}
-				<li><a href="/onboarding?role=namespace-owner">Complete the namespace-owner onboarding guide</a></li>
+				<li><a href="/onboarding?role=namespace-owner#path-laptop">Follow the namespace-owner setup</a> (VPN, kubectl, Vault, encrypted state)</li>
 			{:else}
-				<li><a href="/onboarding">Complete the onboarding guide</a> (VPN, kubectl, git)</li>
+				<li><a href="/onboarding#path-laptop">Follow the local setup</a> (VPN, kubectl, git)</li>
 			{/if}
 			<li><a href="/setup">Install kubectl and kubelogin</a></li>
 			<li><a href="/download">Download your kubeconfig</a></li>
 			<li>Run <code>kubectl get namespaces</code> to verify access</li>
 		</ol>
+		<p><a href="/onboarding">Compare all three access paths →</a></p>
 	</section>

 	<section>
@ -91,12 +101,12 @@ vault write kubernetes/creds/{data.namespaces[0]}-deployer \
 		border-radius: 6px;
 		margin: 1rem 0;
 	}
-	.callout.warning {
-		background: #fff3cd;
-		border-left: 4px solid #ffc107;
+	.callout.info {
+		background: #e8f4fd;
+		border-left: 4px solid #2196f3;
 	}
 	.callout a {
-		color: #856404;
+		color: #0d47a1;
 		font-weight: 600;
 	}
 </style>
--- a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/onboarding/+page.svelte
+++ b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/onboarding/+page.svelte
@ -5,22 +5,123 @@

 <main class="content">
 	<h1>Getting Started</h1>
-	<p>Welcome! Follow these steps to get access to the home Kubernetes cluster.</p>
-
-	<div class="role-tabs">
-		<a href="/onboarding" class:active={!showNamespaceOwner}>General User</a>
-		<a href="/onboarding?role=namespace-owner" class:active={showNamespaceOwner}>Namespace Owner</a>
-	</div>
+	<p>
+		Welcome! There are three ways to reach the home Kubernetes cluster. Pick the one that fits —
+		the first two need <strong>zero setup</strong> and open right in your browser.
+	</p>

 	<section>
-		<h2>Step 0 — Join the VPN</h2>
-		<p>The cluster is on a private network (<code>10.0.20.0/24</code>). You need VPN access first.</p>
+		<h2>Three ways in</h2>
+		<table>
+			<thead><tr><th>Path</th><th>Best for</th><th>Setup</th></tr></thead>
+			<tbody>
+				<tr>
+					<td><a href="#path-terminal"><strong>A — Web terminal</strong></a></td>
+					<td>Just want to start working now</td>
+					<td>None — opens in your browser</td>
+				</tr>
+				<tr>
+					<td><a href="#path-dashboard"><strong>B — Web dashboard</strong></a></td>
+					<td>Click around, watch your app, read logs</td>
+					<td>None — opens in your browser</td>
+				</tr>
+				<tr>
+					<td><a href="#path-laptop"><strong>C — Your own machine</strong></a></td>
+					<td>kubectl / Terraform locally, full control</td>
+					<td>VPN + one-line installer</td>
+				</tr>
+			</tbody>
+		</table>
+		<div class="callout info">
+			<strong>Not sure?</strong> Start with the <a href="#path-terminal">web terminal (Path A)</a>.
+			Everything is already installed and your repos are already cloned — you can run your first
+			<code>kubectl</code> command within a minute, from any device.
+		</div>
+	</section>
+
+	<section id="path-terminal" class="path">
+		<h2>Path A — Web terminal <span class="badge rec">Recommended</span> <span class="badge none">No setup</span></h2>
+		<p>
+			A full terminal that runs in your browser — nothing to install, works from any device
+			(even a tablet). It drops you into your own account on the shared workstation, with every
+			tool already set up.
+		</p>
+		<ol>
+			<li>Open <a href="https://t3.viktorbarzin.me" target="_blank">t3.viktorbarzin.me</a></li>
+			<li>Sign in with your Authentik account (the same SSO login as this portal)</li>
+			<li>You land in a ready-to-use shell. Try it:
+				<pre>kubectl get pods -n YOUR_NAMESPACE</pre>
+			</li>
+		</ol>
+		<div class="callout info">
+			<strong>Already done for you</strong> on the workstation:
+			<ul>
+				<li><code>kubectl</code> + your kubeconfig, scoped to your namespaces (no login dance)</li>
+				<li><code>vault</code>, <code>terragrunt</code>, <code>terraform</code>, <code>sops</code>, <code>kubeseal</code></li>
+				<li>Your repos cloned under <code>~/code</code> — the <code>infra</code> repo plus your own project repos</li>
+				<li>Claude Code, ready to pair with you on changes</li>
+			</ul>
+		</div>
+		<div class="callout warning">
+			<strong>No access yet?</strong> The workstation is provisioned per person. If
+			<code>t3.viktorbarzin.me</code> says you're not authorized, ask Viktor to add you
+			(<a href="mailto:vbarzin@gmail.com">vbarzin@gmail.com</a> or Slack).
+		</div>
+	</section>
+
+	<section id="path-dashboard" class="path">
+		<h2>Path B — Web dashboard <span class="badge none">No setup</span></h2>
+		<p>
+			A point-and-click view of the cluster — browse your pods, read logs, restart a deployment,
+			check events. Nothing to install.
+		</p>
+		<ol>
+			<li>Open <a href="https://k8s.viktorbarzin.me" target="_blank">k8s.viktorbarzin.me</a></li>
+			<li>Sign in with your Authentik account</li>
+			<li>
+				You're dropped straight into the Kubernetes Dashboard, already authenticated as you —
+				<strong>no token to paste</strong>. The portal injects your personal access token for you.
+			</li>
+		</ol>
+		<div class="callout info">
+			Scoped to your namespace(s): you can see and manage your own workloads, but not other
+			tenants'. This path uses a per-user token that does <em>not</em> depend on CLI login, so it
+			keeps working even if <code>kubectl</code> OIDC login is having a bad day — making it the
+			reliable fallback for Path C.
+		</div>
+	</section>
+
+	<section id="path-laptop" class="path c">
+		<h2>Path C — From your own machine</h2>
+		<p>
+			For running <code>kubectl</code>, <code>vault</code> and Terraform locally. This is the most
+			powerful path and the one to use for infrastructure changes — it just needs a bit more setup
+			because the cluster API lives on a private network.
+		</p>
+
+		<div class="role-tabs">
+			<a href="/onboarding?role=general#path-laptop" class:active={!showNamespaceOwner}>General User</a>
+			<a href="/onboarding?role=namespace-owner#path-laptop" class:active={showNamespaceOwner}>Namespace Owner</a>
+		</div>
+		<p class="prereq">
+			{#if showNamespaceOwner}
+				Namespace owner — you'll also set up Vault and encrypted Terraform state so you can deploy
+				your own app stacks.
+			{:else}
+				General user — VPN, kubectl and git access. (Managing your own app stack? Switch to the
+				<strong>Namespace Owner</strong> tab above.)
+			{/if}
+		</p>
+
+		<section>
+			<h3>Step 1 — Join the VPN</h3>
+			<p>The cluster API is on a private network (<code>10.0.20.0/24</code>), so you need VPN access first.</p>
 			<ol>
 				<li>Install <a href="https://tailscale.com/download" target="_blank">Tailscale</a> for your OS</li>
 				<li>Run this in your terminal:
 					<pre>tailscale login --login-server https://headscale.viktorbarzin.me</pre>
 				</li>
-			<li>A browser window will open with a registration URL</li>
+				<li>A browser window opens with a registration URL</li>
 				<li>Send that URL to Viktor via email (<a href="mailto:vbarzin@gmail.com">vbarzin@gmail.com</a>) or Slack</li>
 				<li>Wait for approval (usually within a few hours)</li>
 				<li>Once approved, test: <pre>ping 10.0.20.100</pre></li>
@ -28,62 +129,49 @@
 		</section>

 		<section>
-		<h2>Step 1 — Log in to the portal</h2>
-		<p>Visit <a href="https://k8s-portal.viktorbarzin.me">k8s-portal.viktorbarzin.me</a> and sign in with your Authentik account.</p>
-		<p>If you don't have an account yet, ask Viktor to create one.</p>
+			<h3>Step 2 — Install the tools</h3>
+			<p>Run one of these to install everything automatically (kubectl, kubelogin, vault, terragrunt, terraform, kubeseal) and write your kubeconfig to <code>~/.kube/config-home</code>:</p>
+			<h4>macOS</h4>
+			<p class="prereq">Requires <a href="https://brew.sh" target="_blank">Homebrew</a>. Install it first if you don't have it.</p>
+			<pre>bash &lt;(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)</pre>
+			<h4>Linux</h4>
+			<pre>bash &lt;(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)</pre>
+			<h4>Windows</h4>
+			<p>Use <a href="https://learn.microsoft.com/en-us/windows/wsl/install" target="_blank">WSL2</a> and follow the Linux instructions.</p>
 		</section>

 		<section>
-		<h2>Step 2 — Set up kubectl</h2>
-		<p>Run one of these commands in your terminal to install everything automatically:</p>
-		<h3>macOS</h3>
-		<p class="prereq">Requires <a href="https://brew.sh" target="_blank">Homebrew</a>. Install it first if you don't have it.</p>
-		<pre>bash &lt;(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)</pre>
-		<h3>Linux</h3>
-		<pre>bash &lt;(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)</pre>
-		<h3>Windows</h3>
-		<p>Use <a href="https://learn.microsoft.com/en-us/windows/wsl/install" target="_blank">WSL2</a> and follow the Linux instructions.</p>
+			<h3>Step 3 — Verify access</h3>
+			<p>Run this. The first time, it opens your browser for SSO login:</p>
+			<pre>kubectl get {showNamespaceOwner ? 'pods -n YOUR_NAMESPACE' : 'namespaces'}</pre>
+			<p>You should see your resources (or an empty list if you haven't deployed anything yet).</p>
+			<div class="callout warning">
+				<strong>Browser login loops, or kubectl says "Unauthorized"?</strong> Command-line SSO
+				(OIDC) can occasionally be unavailable. When that happens, use the
+				<a href="#path-dashboard">web dashboard (Path B)</a> or the
+				<a href="#path-terminal">web terminal (Path A)</a> — both authenticate a different way and
+				keep working — and let Viktor know.
+			</div>
+			<p class="prereq">Connection error instead? Make sure the VPN is up: <code>tailscale status</code>.</p>
 		</section>

 		{#if showNamespaceOwner}
 			<section>
-			<h2>Step 3 — Log into Vault</h2>
+				<h3>Step 4 — Log into Vault</h3>
 				<p>Vault manages your secrets and issues dynamic Kubernetes credentials.</p>
 				<pre>vault login -method=oidc</pre>
 				<p>This opens your browser for Authentik SSO. After login, your token is saved to <code>~/.vault-token</code>.</p>
 			</section>

 			<section>
-			<h2>Step 4 — Verify kubectl access</h2>
-			<p>Run this command. It will open your browser for OIDC login the first time:</p>
-			<pre>kubectl get pods -n YOUR_NAMESPACE</pre>
-			<p>You should see an empty list (no resources) or your running pods.</p>
-		</section>
-
-		<section>
-			<h2>Step 5 — Clone the infra repo</h2>
+				<h3>Step 5 — Clone the infra repo</h3>
 				<pre>git clone https://github.com/ViktorBarzin/infra.git
 cd infra</pre>
 				<p>This is where all the infrastructure configuration lives. Terraform state is committed as encrypted files.</p>
 			</section>

 			<section>
-			<h2>Step 6 — Install tools</h2>
-			<p>You need <code>sops</code> and <code>terragrunt</code> to work with infrastructure state:</p>
-			<h3>macOS</h3>
-			<pre>brew install sops terragrunt</pre>
-			<h3>Linux</h3>
-			<pre># sops
-curl -LO https://github.com/getsops/sops/releases/latest/download/sops-v3.9.4.linux.amd64
-sudo mv sops-*.linux.amd64 /usr/local/bin/sops && sudo chmod +x /usr/local/bin/sops
-
-# terragrunt
-curl -LO https://github.com/gruntwork-io/terragrunt/releases/latest/download/terragrunt_linux_amd64
-sudo mv terragrunt_linux_amd64 /usr/local/bin/terragrunt && sudo chmod +x /usr/local/bin/terragrunt</pre>
-		</section>
-
-		<section>
-			<h2>Step 7 — Decrypt your state</h2>
+				<h3>Step 6 — Decrypt your state</h3>
 				<p>Terraform state is encrypted with SOPS. Your Vault login gives you access to <strong>only your stacks</strong>.</p>
 				<pre># Make sure you're logged into Vault
 vault login -method=oidc
@ -132,7 +220,7 @@ cd stacks/YOUR_NAMESPACE
 			</section>

 			<section>
-			<h2>Step 8 — Create your first app stack</h2>
+				<h3>Step 7 — Create your first app stack</h3>
 				<ol>
 					<li>Copy the template: <pre>cp -r stacks/_template stacks/myapp
 mv stacks/myapp/main.tf.example stacks/myapp/main.tf</pre></li>
@ -153,7 +241,7 @@ git push</pre>
 			</section>

 			<section>
-			<h2>Architecture Overview</h2>
+				<h3>Architecture Overview</h3>
 				<p>Here's how your changes flow through the system:</p>

 				<div class="diagram">
@ -204,31 +292,18 @@ git push</pre>
 			</section>
 		{:else}
 			<section>
-			<h2>Step 3 — Verify access</h2>
-			<p>Run this command. It will open your browser for login the first time:</p>
-			<pre>kubectl get namespaces</pre>
-			<p>You should see output like:</p>
-			<pre class="output">NAME              STATUS   AGE
-default           Active   200d
-kube-system       Active   200d
-monitoring        Active   200d
-...</pre>
-			<p>If you get a connection error, make sure your VPN is connected (<code>tailscale status</code>).</p>
-		</section>
-
-		<section>
-			<h2>Step 4 — Clone the repo</h2>
+				<h3>Step 4 — Clone the repo</h3>
 				<pre>git clone https://github.com/ViktorBarzin/infra.git
 cd infra</pre>
 				<p>This is where all the infrastructure configuration lives.</p>
 			</section>

 			<section>
-			<h2>Step 5 — Your first change</h2>
+				<h3>Step 5 — Your first change</h3>
 				<ol>
 					<li>Create a branch: <pre>git checkout -b my-first-change</pre></li>
 					<li>Edit a service file (e.g., change an image tag in <code>stacks/echo/main.tf</code>)</li>
-				<li>Commit and push: <pre>git add . && git commit -m "my first change" && git push -u origin my-first-change</pre></li>
+					<li>Commit and push: <pre>git add . &amp;&amp; git commit -m "my first change" &amp;&amp; git push -u origin my-first-change</pre></li>
 					<li>Open a Pull Request on GitHub</li>
 					<li>Viktor reviews and merges</li>
 					<li>Woodpecker CI automatically applies the change to the cluster</li>
@ -236,19 +311,29 @@ cd infra</pre>
 				</ol>
 			</section>
 		{/if}
+	</section>
 </main>

 <style>
 	.content { max-width: 768px; margin: 2rem auto; padding: 0 1rem; font-family: system-ui, -apple-system, sans-serif; line-height: 1.6; }
 	.content h1 { border-bottom: 1px solid #e0e0e0; padding-bottom: 0.5rem; }
 	.content h2 { margin-top: 2rem; color: #333; }
-	.content h3 { color: #666; margin: 1rem 0 0.25rem; }
+	.content h3 { color: #444; margin: 1.25rem 0 0.25rem; }
+	.content h4 { color: #666; margin: 0.75rem 0 0.25rem; }
 	.content pre { background: #1e1e1e; color: #d4d4d4; padding: 1rem; border-radius: 6px; overflow-x: auto; }
-	.content pre.output { background: #f5f5f5; color: #333; }
 	.content code { background: #f0f0f0; padding: 2px 6px; border-radius: 3px; }
 	.content .prereq { font-size: 0.9rem; color: #666; font-style: italic; }
 	section { margin: 2rem 0; }
-	.role-tabs { display: flex; gap: 0; margin: 1.5rem 0; border-bottom: 2px solid #e0e0e0; }
+	section section { margin: 1.25rem 0; }
+
+	.path { border-left: 4px solid #4fc3f7; padding-left: 1.25rem; scroll-margin-top: 4rem; }
+	.path.c { border-left-color: #bbb; }
+
+	.badge { display: inline-block; font-size: 0.65rem; font-weight: 700; text-transform: uppercase; letter-spacing: 0.5px; padding: 0.15rem 0.5rem; border-radius: 4px; vertical-align: middle; margin-left: 0.4rem; }
+	.badge.rec { background: #d4f8d4; color: #1b5e20; }
+	.badge.none { background: #e3f2fd; color: #0d47a1; }
+
+	.role-tabs { display: flex; gap: 0; margin: 1.5rem 0 0.5rem; border-bottom: 2px solid #e0e0e0; }
 	.role-tabs a { padding: 0.5rem 1.5rem; text-decoration: none; color: #666; border-bottom: 2px solid transparent; margin-bottom: -2px; }
 	.role-tabs a.active { color: #333; border-bottom-color: #333; font-weight: 600; }
 	table { border-collapse: collapse; width: 100%; margin: 0.5rem 0; }
@ -258,6 +343,7 @@ cd infra</pre>
 	.callout { padding: 1rem; border-radius: 6px; margin: 1rem 0; }
 	.callout.info { background: #e8f4fd; border-left: 4px solid #2196f3; }
 	.callout.warning { background: #fff3cd; border-left: 4px solid #ffc107; }
+	.callout ul { margin: 0.5rem 0 0; padding-left: 1.25rem; }

 	.diagram { background: #fafafa; border: 1px solid #e0e0e0; border-radius: 8px; padding: 1.5rem; margin: 1.5rem 0; }
 	.diagram h3 { margin: 0 0 1rem 0; color: #333; font-size: 0.95rem; text-transform: uppercase; letter-spacing: 0.5px; }
--- a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/services/+page.svelte
+++ b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/services/+page.svelte
@ -2,6 +2,19 @@
 	<h1>Service Catalog</h1>
 	<p>70+ services running on the cluster. Here are the most commonly used:</p>

+	<section>
+		<h2>Cluster Access</h2>
+		<table>
+			<thead><tr><th>Service</th><th>URL</th><th>Description</th></tr></thead>
+			<tbody>
+			<tr><td>Web Terminal</td><td><a href="https://t3.viktorbarzin.me">t3.viktorbarzin.me</a></td><td>Browser shell on the shared workstation — kubectl, Vault &amp; your repos preinstalled (zero setup)</td></tr>
+			<tr><td>Kubernetes Dashboard</td><td><a href="https://k8s.viktorbarzin.me">k8s.viktorbarzin.me</a></td><td>Point-and-click view of your workloads, auto-authenticated (zero setup)</td></tr>
+			<tr><td>Access Portal</td><td><a href="https://k8s-portal.viktorbarzin.me">k8s-portal.viktorbarzin.me</a></td><td>This portal — onboarding, kubeconfig download, setup script</td></tr>
+			<tr><td>Vault</td><td><a href="https://vault.viktorbarzin.me">vault.viktorbarzin.me</a></td><td>Secrets &amp; dynamic credentials — <code>vault login -method=oidc</code></td></tr>
+			</tbody>
+		</table>
+	</section>
+
 	<section>
 		<h2>Core Services</h2>
 		<table>
@ -22,7 +35,7 @@
 			<tbody>
 			<tr><td>Nextcloud</td><td><a href="https://nextcloud.viktorbarzin.me">nextcloud.viktorbarzin.me</a></td><td>File storage, calendar, contacts</td></tr>
 			<tr><td>Immich</td><td><a href="https://immich.viktorbarzin.me">immich.viktorbarzin.me</a></td><td>Photo library (Google Photos alternative)</td></tr>
-			<tr><td>Vaultwarden</td><td><a href="https://vault.viktorbarzin.me">vault.viktorbarzin.me</a></td><td>Password manager</td></tr>
+			<tr><td>Vaultwarden</td><td><a href="https://vaultwarden.viktorbarzin.me">vaultwarden.viktorbarzin.me</a></td><td>Password manager</td></tr>
 			<tr><td>Paperless-ngx</td><td><a href="https://pdf.viktorbarzin.me">pdf.viktorbarzin.me</a></td><td>Document management</td></tr>
 			<tr><td>Navidrome</td><td><a href="https://music.viktorbarzin.me">music.viktorbarzin.me</a></td><td>Music streaming</td></tr>
 			<tr><td>Tandoor</td><td><a href="https://recipes.viktorbarzin.me">recipes.viktorbarzin.me</a></td><td>Recipe manager</td></tr>
--- a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/troubleshooting/+page.svelte
+++ b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/troubleshooting/+page.svelte
@ -11,6 +11,26 @@
 		</ol>
 	</section>

+	<section>
+		<h2>Browser login loops, or kubectl says "Unauthorized"</h2>
+		<p>Command-line SSO (OIDC) login can occasionally be unavailable. You don't have to wait for it — these authenticate a different way and keep working:</p>
+		<ul>
+			<li><a href="https://k8s.viktorbarzin.me">Web dashboard</a> — auto-authenticated, no token to paste</li>
+			<li><a href="https://t3.viktorbarzin.me">Web terminal</a> — its kubectl is already wired up</li>
+		</ul>
+		<p>Let Viktor know so the CLI login path gets fixed.</p>
+	</section>
+
+	<section>
+		<h2>Don't want to set up a local machine at all?</h2>
+		<p>Skip the VPN and CLI install entirely:</p>
+		<ul>
+			<li><a href="https://t3.viktorbarzin.me">t3.viktorbarzin.me</a> — a browser shell with everything preinstalled</li>
+			<li><a href="https://k8s.viktorbarzin.me">k8s.viktorbarzin.me</a> — a point-and-click dashboard</li>
+		</ul>
+		<p>Both just need your Authentik login. See the <a href="/onboarding">Getting Started</a> guide.</p>
+	</section>
+
 	<section>
 		<h2>"Forbidden" or "Permission denied"</h2>
 		<p>You may not have access to that namespace. Your access is scoped to specific namespaces.</p>
--- a/stacks/k8s-version-upgrade/main.tf
+++ b/stacks/k8s-version-upgrade/main.tf
@ -483,31 +483,49 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
                  exit 0
                fi

-                slack "K8s upgrade available: v$RUNNING → v$TARGET ($KIND)"
+                echo "K8s upgrade available: v$RUNNING -> v$TARGET ($KIND)"

                if [ "$DRY_RUN" = "true" ]; then
-                  slack "DRY_RUN — not spawning preflight Job"
+                  slack "DRY_RUN — target v$TARGET detected, not spawning preflight Job"
                  exit 0
                fi

                # 7. Spawn Job 0 (preflight) via envsubst on the job-template
                #    Idempotency: deterministic name reconciles via `apply`.
                JOB_NAME="k8s-upgrade-preflight-$${TARGET//./-}"
+                MASTER_JOB="k8s-upgrade-master-$${TARGET//./-}"
+                ANNOUNCE=yes   # Slack the spawn? Suppressed for silent nightly re-evaluations of a standing gate refusal.

-                # Retry-on-failure idempotency: skip only if an existing preflight
-                # Job is Active/Complete. A *Failed* preflight (aborted on a
-                # transient gate, e.g. a spurious critical alert) is deleted and
-                # re-spawned — otherwise its deterministic name + 7d TTL wedges
-                # the entire pipeline until it ages out. (Stuck-pipeline fix
-                # 2026-06-17: a transient critical alert wedged 1.34.9 for 5 days.)
+                # Idempotency + nightly re-evaluation:
+                #   - FAILED preflight (transient gate abort, e.g. a spurious
+                #     critical alert / unhealthy node) -> delete + re-spawn, announced.
+                #   - COMPLETE preflight but NO master Job spawned -> the compat
+                #     gate REFUSED the target (blocked/held now Complete cleanly
+                #     rather than Failing). Re-spawn SILENTLY so the gate re-checks
+                #     nightly (the refusal may have cleared: addon upgraded / matrix
+                #     updated / upstream shipped) WITHOUT nightly Slack noise for a
+                #     standing refusal — the morning report (+ K8sUpgradeBlocked for
+                #     actionable) is the signal.
+                #   - Otherwise (Active, or Complete with the chain advanced) -> skip.
+                # The old "Failed-only re-spawn" left a refused-but-Complete preflight
+                # skipped until its 7d TTL — too slow now that refusals Complete
+                # instead of Failing (2026-06-28). Deterministic names; `apply`
+                # reconciles. (Stuck-pipeline history: a transient critical alert
+                # wedged 1.34.9 for 5 days, 2026-06-17 — hence Failed always re-spawns.)
                if /usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" >/dev/null 2>&1; then
                  JOB_FAILED=$(/usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" \
                    -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' 2>/dev/null || true)
+                  JOB_COMPLETE=$(/usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" \
+                    -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' 2>/dev/null || true)
                  if [ "$JOB_FAILED" = "True" ]; then
                    slack "Preflight Job $JOB_NAME exists but FAILED — deleting and re-spawning"
                    /usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
+                  elif [ "$JOB_COMPLETE" = "True" ] && ! /usr/local/bin/kubectl -n k8s-upgrade get job "$MASTER_JOB" >/dev/null 2>&1; then
+                    echo "Preflight $JOB_NAME Complete + no master Job (gate refused) — silent nightly re-evaluate"
+                    /usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
+                    ANNOUNCE=no
                  else
-                    slack "Preflight Job $JOB_NAME already exists (active/complete) — skipping"
+                    echo "Preflight Job $JOB_NAME already exists (active / chain advanced) — skipping"
                    exit 0
                  fi
                fi
@ -521,7 +539,9 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
                  < /template/job-template.yaml \
                  | /usr/local/bin/kubectl apply -f -

+                if [ "$ANNOUNCE" = "yes" ]; then
                  slack "Spawned $JOB_NAME (target=v$TARGET kind=$KIND)"
+                fi
              EOT
              ]
              env {
--- a/stacks/k8s-version-upgrade/scripts/addon-compat.json
+++ b/stacks/k8s-version-upgrade/scripts/addon-compat.json
@ -1,5 +1,5 @@
 {
-  "_comment": "Addon -> highest k8s minor each addon version supports. The preflight compat-gate (compat-gate.py) reads the RUNNING version of each addon and blocks a k8s upgrade whose target minor exceeds what that running version supports — so the chain auto-halts + alerts instead of breaking on an unsupported addon. Keep current; sources are the addons' own k8s compat matrices (last refreshed 2026-06-19 for the 1.34->1.36 catch-up). max_k8s keys are addon-version floors (major.minor); value is the highest k8s minor that floor supports.",
+  "_comment": "Addon -> highest k8s minor each addon version supports. The preflight compat-gate (compat-gate.py) reads the RUNNING version of each addon and blocks a k8s upgrade whose target minor exceeds what that running version supports — so the chain auto-halts + alerts instead of breaking on an unsupported addon. Keep current; sources are the addons' own k8s compat matrices (last refreshed 2026-06-19 for the 1.34->1.36 catch-up). max_k8s keys are addon-version floors (major.minor); value is the highest k8s minor that floor supports. An addon entry may also set \"pinned\": true (+ \"pin_reason\") to mark it deliberately held: the gate classifies its block as PINNED/held (quiet — no alert, nightly report only) even if a supporting version exists, for upgrades coupled to other work we're not ready for (e.g. gpu-operator's NVIDIA-driver/Ubuntu coupling). A block with NO supporting version in the matrix is WAITING (also quiet); a block a newer matrix version would clear is ACTIONABLE (alerts).",
  "addons": [
    {
      "name": "calico",
@ -48,7 +48,9 @@
      "max_k8s": {
        "25.10": "1.35",
        "26.3": "1.36"
-      }
+      },
+      "pinned": true,
+      "pin_reason": "26.3 needs a newer NVIDIA driver image + Ubuntu/kernel; held until the driver/OS path is ready. Unpin = delete pinned + pin_reason."
    }
  ],
  "containerd_min": {
--- a/stacks/k8s-version-upgrade/scripts/compat-gate.py
+++ b/stacks/k8s-version-upgrade/scripts/compat-gate.py
@ -14,9 +14,20 @@ classes of blocker:
  3. containerd    — every node's containerd >= the target's floor, if the matrix
                     declares one (e.g. the 1.7.x -> k8s 1.37 cliff)

+Each reason line is tagged with its class so the caller can act differently:
+  [ACTIONABLE]  a newer addon version (present in the matrix) supports the
+                target — upgrading it clears the block. Also covers removed-API
+                / containerd blocks and the unreadable-version fail-safe.
+  [WAITING]     no released addon version supports the target yet — only an
+                upstream release can clear it (e.g. kyverno/ESO behind a new k8s).
+  [PINNED]      a supporting version exists but the addon is deliberately held
+                (matrix `pinned: true`, e.g. gpu-operator's driver/OS coupling).
+
 Exit 0  = safe, proceed.
-Exit 2  = BLOCKED — prints one human reason per line (caller pushes
-          k8s_upgrade_blocked=1, Slacks the reasons, and halts the chain).
+Exit 2  = BLOCKED, actionable — >=1 blocker, none held. Caller pushes
+          k8s_upgrade_blocked=1 (-> K8sUpgradeBlocked alert) and halts.
+Exit 4  = HELD — >=1 waiting-upstream/pinned blocker (held wins over actionable).
+          Caller pushes k8s_upgrade_held=1 (no alert; nightly report only) and halts.
 Exit 3  = the gate itself errored — caller treats as a block (fail safe).

 Read-only: kubectl get + one Prometheus query. No mutations. PROM is overridable
@ -62,6 +73,20 @@ def running_minor():
    return min(minors) if minors else None


+def _addon_resolution(a, tgt, running_ver):
+    """For a BLOCKING addon, decide whether a newer matrix version would clear
+    the block. Returns ("actionable", hint) when some version key has
+    max_k8s >= target AND is newer than the running version (upgrading it clears
+    the block); otherwise ("waiting", hint) — nothing released supports the
+    target yet, so only an upstream release can clear it."""
+    sufficient = [floor for floor, mk in a["max_k8s"].items()
+                  if minor(mk) and minor(mk) >= tgt and minor(floor) > minor(running_ver)]
+    if sufficient:
+        best = min(sufficient, key=minor)  # smallest sufficient upgrade
+        return "actionable", f"upgrade {a['name']} to >= {best}"
+    return "waiting", f"no released {a['name']} version supports k8s {tgt[0]}.{tgt[1]} yet"
+
+
 def check_addons(matrix, tgt, running):
    # A target at or below the RUNNING minor (a patch, or a same/lower minor)
    # crosses into no new k8s minor, so every installed addon is already
@ -77,25 +102,36 @@ def check_addons(matrix, tgt, running):
                    "-o", "jsonpath={.spec.template.spec.containers[*].image}"])
        m = re.search(a["image_re"], img or "")
        if not m:
-            # Fail safe: if we can't read the running version, don't upgrade blind.
-            reasons.append(f"addon {a['name']}: could not read running version "
-                           f"(img='{img or 'not found'}') — refusing to upgrade blind")
+            # Fail safe: can't read the running version → block; a human must
+            # look (ACTIONABLE), never upgrade blind.
+            reasons.append(f"[ACTIONABLE] addon {a['name']}: could not read running "
+                           f"version (img='{img or 'not found'}') — refusing to upgrade blind")
            continue
-        running = m.group(1)  # e.g. "3.26"
+        running_ver = m.group(1)  # e.g. "3.26"
        # max_k8s maps an addon-version floor -> highest supported k8s minor.
        # Pick the highest floor that is <= the running version.
        max_k8s = None
        for floor, mk in sorted(a["max_k8s"].items(), key=lambda kv: minor(kv[0]), reverse=True):
-            if minor(running) >= minor(floor):
+            if minor(running_ver) >= minor(floor):
                max_k8s = mk
                break
        if max_k8s is None:
-            reasons.append(f"addon {a['name']} v{running}: below the lowest version "
-                           f"in the compat matrix — unknown k8s support")
+            reasons.append(f"[ACTIONABLE] addon {a['name']} v{running_ver}: below the lowest "
+                           f"version in the compat matrix — unknown k8s support")
            continue
        if tgt > minor(max_k8s):
-            reasons.append(f"addon {a['name']} v{running} supports k8s <= {max_k8s}; "
-                           f"target {tgt[0]}.{tgt[1]} exceeds it — upgrade {a['name']} first")
+            base = (f"addon {a['name']} v{running_ver} supports k8s <= {max_k8s}; "
+                    f"target {tgt[0]}.{tgt[1]} exceeds it")
+            # A deliberately-pinned addon is HELD even if a newer version exists
+            # (e.g. gpu-operator 26.3 supports 1.36 but its driver/OS coupling
+            # means we don't take it yet) — the pin overrides actionable.
+            if a.get("pinned"):
+                why = a.get("pin_reason", "deliberately pinned")
+                reasons.append(f"[PINNED] {base} — pinned ({why}); holding")
+            else:
+                kind, hint = _addon_resolution(a, tgt, running_ver)
+                tag = "ACTIONABLE" if kind == "actionable" else "WAITING"
+                reasons.append(f"[{tag}] {base} — {hint}")
    return reasons


@ -109,11 +145,11 @@ def check_removed_apis(tgt):
            rr = lbl.get("removed_release", "")
            if rr and minor(rr) and tgt >= minor(rr):
                g = lbl.get("group") or "core"
-                reasons.append(f"deprecated API {g}/{lbl.get('version')} "
+                reasons.append(f"[ACTIONABLE] deprecated API {g}/{lbl.get('version')} "
                               f"{lbl.get('resource')} is in use and is removed in "
                               f"k8s {rr} (target {tgt[0]}.{tgt[1]}) — migrate callers first")
    except Exception as e:
-        reasons.append(f"removed-API check could not query Prometheus ({e}) — "
+        reasons.append(f"[ACTIONABLE] removed-API check could not query Prometheus ({e}) — "
                       f"refusing to upgrade blind")
    return reasons

@ -132,11 +168,28 @@ def check_containerd(matrix, tgt):
        name, _, ver = line.partition(" ")
        cv = ver.replace("containerd://", "")
        if minor(cv) and minor(cv) < minor(floor):
-            reasons.append(f"node {name} containerd {cv} < required {floor} "
+            reasons.append(f"[ACTIONABLE] node {name} containerd {cv} < required {floor} "
                           f"for k8s {tgt[0]}.{tgt[1]} — bump containerd first")
    return reasons


+def held_reason(r):
+    """True for a blocker the cluster cannot act on now: no released version
+    supports the target (WAITING) or the addon is deliberately pinned (PINNED).
+    These are quiet (no alert) — only an upstream release / a manual unpin clears
+    them, so a nightly 'needs attention' alert would be crying wolf."""
+    return r.startswith("[WAITING]") or r.startswith("[PINNED]")
+
+
+def exit_code(reasons):
+    """Map reasons to the gate verdict: 0 safe · 2 actionable block · 4 held.
+    Held WINS over actionable on a mix — if anything is waiting/pinned the target
+    can't proceed yet, so acting on the actionable blockers would be premature."""
+    if not reasons:
+        return 0
+    return 4 if any(held_reason(r) for r in reasons) else 2
+
+
 def main():
    if len(sys.argv) < 2:
        print("usage: compat-gate.py <target-k8s-version>  (matrix JSON on stdin)")
@ -158,9 +211,9 @@ def main():
    if reasons:
        for r in reasons:
            print(r)
-        sys.exit(2)
+    else:
        print(f"compat-gate OK: cluster is safe to upgrade to {sys.argv[1]}")
-    sys.exit(0)
+    sys.exit(exit_code(reasons))


 if __name__ == "__main__":
--- a/stacks/k8s-version-upgrade/scripts/nightly-report.py
+++ b/stacks/k8s-version-upgrade/scripts/nightly-report.py
@ -69,6 +69,29 @@ def fmt_age(seconds):
    return f"{seconds / 86400:.1f}d ago"


+def _render_reasons(blocker_reasons):
+    """Group compat-gate reason lines by their [ACTIONABLE]/[WAITING]/[PINNED]
+    tag into labelled sections, stripping the tag from each bullet. Untagged
+    lines (older reason format) fall back to a generic 'Blockers' list. PURE.
+    Returns a list of message lines."""
+    lines = [r.strip() for r in (blocker_reasons or "").splitlines() if r.strip()]
+    out, shown = [], set()
+    for title, tag in (("Action needed", "[ACTIONABLE]"),
+                       ("Waiting on upstream", "[WAITING]"),
+                       ("Pinned (held by us)", "[PINNED]")):
+        sub = [l for l in lines if l.startswith(tag)]
+        if sub:
+            out.append(f"{title}:")
+            for l in sub:
+                shown.add(l)
+                out.append(f"  • {l[len(tag):].strip()}")
+    rest = [l for l in lines if l not in shown]
+    if rest:
+        out.append("Blockers:")
+        out.extend(f"  • {l}" for l in rest)
+    return out
+
+
 def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
    """Build the Slack message text from gathered facts. PURE.

@ -98,6 +121,7 @@ def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):

    avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
    blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
+    held = any(val == 1 for _, val in select(metrics, "k8s_upgrade_held"))

    if avail:
        lbl = avail[0][0]
@ -105,7 +129,12 @@ def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
        kind = lbl.get("kind", "?")
        tgt_line = f"Detected target: *{target}* ({kind})"
        if blocked:
-            headline = f"🔴 BLOCKED — compat gate refused {target}"
+            # actionable block — an addon upgrade would clear it (K8sUpgradeBlocked fired)
+            headline = f"🔴 BLOCKED (action needed) — {target}"
+        elif held:
+            # waiting on upstream and/or a pinned addon — nothing to do but wait;
+            # intentionally NO alert, this nightly line is the only signal
+            headline = f"⏸️ HELD — {target} not yet upgradable"
        elif len(versions) == 1 and target == versions[0]:
            headline = f"🟢 UPGRADED — all nodes now on {target}"
        else:
@ -120,12 +149,8 @@ def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):

    msg = [f"*[k8s-upgrade nightly]* {headline}", node_line, run_line, tgt_line]

-    if blocked and blocker_reasons:
-        msg.append("Blockers (live):")
-        for r in blocker_reasons.splitlines():
-            r = r.strip()
-            if r:
-                msg.append(f"  • {r}")
+    if (blocked or held) and blocker_reasons:
+        msg.extend(_render_reasons(blocker_reasons))

    if jobs:
        msg.append("Chain jobs (recent):")
@ -213,7 +238,8 @@ def main():

    avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
    blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
-    reasons = get_blocker_reasons(avail[0][0].get("target", "")) if (avail and blocked) else None
+    held = any(val == 1 for _, val in select(metrics, "k8s_upgrade_held"))
+    reasons = get_blocker_reasons(avail[0][0].get("target", "")) if (avail and (blocked or held)) else None

    msg = compose_report(now_ts, nodes, metrics, reasons, jobs)
    post_slack(msg)
--- a/stacks/k8s-version-upgrade/scripts/test_compat_gate.py
+++ b/stacks/k8s-version-upgrade/scripts/test_compat_gate.py
@ -95,3 +95,121 @@ def test_running_minor_from_kubectl(monkeypatch):
    # oldest kubelet wins (mirrors the detector): node2 on 1.33 is the floor.
    monkeypatch.setattr(cg, "kget", lambda args: "v1.34.9\nv1.33.5\nv1.34.9")
    assert cg.running_minor() == (1, 33)
+
+
+# --- block classification: actionable / waiting-upstream / pinned ----------
+# A block is ACTIONABLE if a newer addon version in the matrix supports the
+# target (we can upgrade to clear it), WAITING if no released version supports
+# the target yet (only upstream can clear it), or PINNED if a version exists but
+# we deliberately hold the addon. Held (waiting|pinned) is quiet; actionable
+# alerts.
+KYVERNO_MATRIX = {
+    "addons": [{
+        "name": "kyverno",
+        "namespace": "kyverno",
+        "kind": "deployment",
+        "resource": "kyverno-admission-controller",
+        "image_re": r"kyverno:v(\d+\.\d+)",
+        "max_k8s": {"1.16": "1.34", "1.18": "1.35"},
+    }]
+}
+GPU_MATRIX = {
+    "addons": [{
+        "name": "gpu-operator",
+        "namespace": "nvidia",
+        "kind": "deployment",
+        "resource": "gpu-operator",
+        "image_re": r"gpu-operator:v(\d+\.\d+)",
+        "max_k8s": {"25.10": "1.35", "26.3": "1.36"},
+        "pinned": True,
+        "pin_reason": "needs newer NVIDIA driver + Ubuntu release",
+    }]
+}
+
+
+def test_actionable_when_higher_version_supports_target(monkeypatch):
+    # calico 3.30 (ceiling 1.35), target 1.36, matrix has 3.32 -> 1.36:
+    # upgrading calico WOULD clear it -> ACTIONABLE, with a remediation hint.
+    _img(monkeypatch, "quay.io/calico/node:v3.30.7")
+    reasons = cg.check_addons(CALICO_MATRIX, (1, 36), (1, 35))
+    assert len(reasons) == 1, reasons
+    assert reasons[0].startswith("[ACTIONABLE]"), reasons
+    assert "3.32" in reasons[0] and "calico" in reasons[0]
+
+
+def test_waiting_when_no_version_supports_target(monkeypatch):
+    # kyverno 1.18 is the matrix ceiling (k8s 1.35); target 1.36 has NO
+    # supporting version -> WAITING on upstream (nothing to upgrade to).
+    _img(monkeypatch, "kyverno/kyverno:v1.18.1")
+    reasons = cg.check_addons(KYVERNO_MATRIX, (1, 36), (1, 35))
+    assert len(reasons) == 1, reasons
+    assert reasons[0].startswith("[WAITING]"), reasons
+    assert "kyverno" in reasons[0]
+
+
+def test_pinned_addon_is_held_not_actionable(monkeypatch):
+    # gpu-operator 25.10, target 1.36; 26.3 supports 1.36 BUT the entry is
+    # pinned -> classified PINNED (held), never ACTIONABLE.
+    _img(monkeypatch, "nvcr.io/nvidia/gpu-operator:v25.10.0")
+    reasons = cg.check_addons(GPU_MATRIX, (1, 36), (1, 35))
+    assert len(reasons) == 1, reasons
+    assert reasons[0].startswith("[PINNED]"), reasons
+    assert "gpu-operator" in reasons[0]
+
+
+def test_unreadable_addon_tagged_actionable(monkeypatch):
+    # fail-safe block on an unreadable image is ACTIONABLE (a human must look).
+    _img(monkeypatch, "")
+    reasons = cg.check_addons(ESO_MATRIX, (1, 35), (1, 34))
+    assert reasons and reasons[0].startswith("[ACTIONABLE]"), reasons
+
+
+def test_existing_reasons_are_tagged(monkeypatch):
+    # the legacy "ceiling below target, newer version exists" case is ACTIONABLE.
+    _img(monkeypatch, "external-secrets/external-secrets:v0.12.1")
+    reasons = cg.check_addons(ESO_MATRIX, (1, 35), (1, 34))
+    assert reasons[0].startswith("[ACTIONABLE]"), reasons
+
+
+def test_held_reason_classifier():
+    assert cg.held_reason("[WAITING] x")
+    assert cg.held_reason("[PINNED] x")
+    assert not cg.held_reason("[ACTIONABLE] x")
+    assert not cg.held_reason("untagged")
+
+
+def test_exit_code_mapping():
+    assert cg.exit_code([]) == 0
+    assert cg.exit_code(["[ACTIONABLE] x"]) == 2
+    assert cg.exit_code(["[WAITING] x"]) == 4
+    assert cg.exit_code(["[PINNED] x"]) == 4
+    # held wins on a mix: an upstream/pinned wait can't be cleared by acting now
+    assert cg.exit_code(["[ACTIONABLE] x", "[WAITING] y"]) == 4
+
+
+def test_real_matrix_136_is_held(monkeypatch):
+    """Regression guard on the SHIPPED addon-compat.json: at today's running
+    versions a 1.36 jump must be HELD (exit 4) — calico ACTIONABLE (3.32 in the
+    matrix), ESO+kyverno WAITING (no 1.36 release), gpu-operator PINNED. Catches
+    a matrix edit that silently turns the quiet held state into a nightly alert."""
+    import json as _json
+    matrix = _json.loads((HERE / "addon-compat.json").read_text())
+    running_imgs = {
+        "calico-system": "quay.io/calico/node:v3.30.7",
+        "external-secrets": "ghcr.io/external-secrets/external-secrets:v2.6.0",
+        "kyverno": "ghcr.io/kyverno/kyverno:v1.18.1",
+        "nvidia": "nvcr.io/nvidia/gpu-operator:v25.10.0",
+    }
+
+    def fake_kget(args):
+        ns = args[args.index("-n") + 1] if "-n" in args else ""
+        return running_imgs.get(ns, "")
+
+    monkeypatch.setattr(cg, "kget", fake_kget)
+    reasons = cg.check_addons(matrix, (1, 36), (1, 35))
+    pick = lambda name: next(r for r in reasons if name in r)
+    assert pick("calico").startswith("[ACTIONABLE]"), reasons
+    assert pick("external-secrets").startswith("[WAITING]"), reasons
+    assert pick("kyverno").startswith("[WAITING]"), reasons
+    assert pick("gpu-operator").startswith("[PINNED]"), reasons
+    assert cg.exit_code(reasons) == 4  # held wins
--- a/stacks/k8s-version-upgrade/scripts/test_nightly_report.py
+++ b/stacks/k8s-version-upgrade/scripts/test_nightly_report.py
@ -79,3 +79,41 @@ def test_compose_includes_recent_jobs():
    jobs = [{"name": "k8s-upgrade-preflight-1-35-6", "status": "Failed", "age_s": 3600}]
    out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, "x", jobs)
    assert "k8s-upgrade-preflight-1-35-6: Failed" in out
+
+
+# --- held (waiting-upstream / pinned) vs actionable-blocked rendering -------
+METRICS_HELD = f"""# TYPE k8s_upgrade_available gauge
+k8s_upgrade_available{{instance="",job="k8s-version-check",kind="minor",running="1.35.6",target="1.36.2"}} 1
+k8s_upgrade_held{{instance="",job="k8s-version-upgrade"}} 1
+k8s_upgrade_blocked{{instance="",job="k8s-version-upgrade"}} 0
+k8s_version_check_last_run_timestamp{{instance="",job="k8s-version-check"}} {LAST_RUN}
+"""
+NODES_135 = [(f"k8s-node{i}", "v1.35.6") for i in range(7)]
+
+
+def test_compose_held_headline_and_grouped_reasons():
+    m = nr.parse_metrics(METRICS_HELD)
+    reasons = (
+        "[WAITING] addon kyverno v1.18 supports k8s <= 1.35; target 1.36 exceeds it — no released kyverno version supports k8s 1.36 yet\n"
+        "[PINNED] addon gpu-operator v25.10 supports k8s <= 1.35; target 1.36 exceeds it — pinned (driver/OS); holding\n"
+        "[ACTIONABLE] addon calico v3.30 supports k8s <= 1.35; target 1.36 exceeds it — upgrade calico to >= 3.32"
+    )
+    out = nr.compose_report(LAST_RUN + 30000, NODES_135, m, reasons, [])
+    # held headline, NOT a red actionable block
+    assert "⏸️ HELD" in out and "1.36.2" in out
+    assert "🔴 BLOCKED" not in out
+    # grouped by class
+    assert "Waiting on upstream" in out and "kyverno" in out
+    assert "Pinned" in out and "gpu-operator" in out
+    # the lone actionable piece is still listed so eventual scope is visible
+    assert "calico" in out
+    # tags are stripped from the rendered bullets (no raw "[WAITING]")
+    assert "[WAITING]" not in out
+
+
+def test_compose_blocked_groups_actionable():
+    m = nr.parse_metrics(METRICS_BLOCKED)  # blocked=1
+    reasons = "[ACTIONABLE] addon calico v3.30 supports k8s <= 1.35; target 1.36 exceeds it — upgrade calico to >= 3.32"
+    out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, reasons, [])
+    assert "🔴 BLOCKED" in out
+    assert "Action needed" in out and "calico" in out
--- a/Show more
+++ b/Show more