WIP: goldmane-edge-aggregator deploy stack + vault role + ghcr allowlist (infra #58 )

NOT APPLIED. Staged for a fresh-session finish (see memory runbook). Contains: - stacks/goldmane-edge-aggregator/{main.tf,terragrunt.hcl}: namespace, TF-minted mTLS client cert from tigera-ca-private, goldmane_edges PG DB-init Job, db + slack ExternalSecrets, aggregate Deployment + digest CronJob. - stacks/vault/main.tf: pg-goldmane-edges static rotation role (Tier-0). - stacks/kyverno/.../ghcr-credentials.tf: ns added to the private-image allowlist. KNOWN BLOCKER: the stack uses the hashicorp/tls provider (cert minting) but the root terragrunt.hcl generate "k8s_providers" block doesn't declare it, and a second required_providers (the removed versions.tf) is illegal. FIX = add tls to that global block (mirrors proxmox/kubectl). Then apply order: db_init (creates goldmane_edges role) -> kyverno -> vault (Tier-0, plan-review) -> stack ExternalSecrets (targeted, first-apply) -> stack full -> verify mTLS to goldmane:7443. Vault KV secret/goldmane-edge-aggregator already created. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 13:01:37 +00:00
162 changed files with 4565 additions and 12322 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -16,7 +16,6 @@
 **ALL infrastructure changes MUST go through Terraform/Terragrunt.** Never use `kubectl apply/edit/patch/set`, `helm install/upgrade`, or any manual cluster mutation as the final state.

 - **No exceptions for "quick fixes"** — even one-line changes must be in `.tf` files and applied via `scripts/tg apply`
- **Apply locally OR let CI do it — but ALWAYS commit.** You don't have to wait for CI: with apply access you MAY run the apply yourself (`scripts/tg apply <stack>` / `homelab tf apply <stack>`), but **from the main checkout, never a worktree** (git-crypt'd `*.tfvars` come through as ciphertext under the worktree filter-bypass, so a worktree apply reads garbage). **Every applied change MUST be committed and pushed to `master` the same session** — the repo is the source of truth, so applied-but-uncommitted HCL is drift that the next CI apply / daily drift-detection will try to revert. Order either way: apply locally then commit + push (CI's changed-stack apply then no-ops), or commit + push and let CI apply. Never apply an uncommitted edit; never leave a committed change unapplied.
 - **kubectl is for read-only operations and temporary debugging only** (get, describe, logs, exec, port-forward)
 - **If a resource isn't in Terraform yet**, evaluate whether it can be added before making manual changes. If manual change is unavoidable (e.g., emergency), document it immediately and create the Terraform resource in the same session
 - **kubectl scale/patch during migrations is acceptable** as a transient step, but the final state must be in Terraform and applied via `scripts/tg apply`
@ -204,7 +203,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
 - **PDBs**: minAvailable=2 on Traefik and Authentik.
 - **Fallback proxies**: basicAuth when Authentik is down, fail-open when poison-fountain is down.
 - **CrowdSec enforcement is out-of-band** (no Traefik plugin/middleware — the dead Yaegi `crowdsec-bouncer-traefik-plugin` was removed on Traefik 3.7.5): banned IPs are dropped **in-kernel via nftables** by the `cs-firewall-bouncer` DaemonSet on **direct** hosts (drops in BOTH the `input` and `forward` hooks — Traefik is ETP=Local so client traffic is DNAT'd to the pod via `forward`; pulls ALL decisions incl. the ~31k CAPI blocklist), and **blocked at the Cloudflare edge** for **proxied** hosts (one `crowdsec_ban` Rules List + a zone WAF block rule, fed by the `crowdsec-cf-sync` CronJob in `rybbit` ns every 2 min — excludes CAPI). Zero per-request latency; **fails open** (LAPI down → no new bans, existing drops persist, legit traffic never blocked). Whitelist covers RFC1918 + tailnet + internal CIDRs. Full as-built: `docs/architecture/security.md`.
- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations), authentik 100/1000 on `/`+`/static` (login SPA cold-loads ~70 flow chunks from `/static`; default burst 429'd them → blank login screen).
+- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations).
 - **Retry middleware**: 2 attempts, 100ms — in default ingress chain.
 - **Entrypoint transport timeouts** (`websecure` `respondingTimeouts`): `writeTimeout=0` (unlimited download duration), `readTimeout=3600s` (uploads ≤1h), `idleTimeout=600s`. These are **HARD total-duration caps**, not nginx-style per-read idle timeouts — a finite `writeTimeout` truncates *any* large download at that wall-clock mark (a prior `writeTimeout=60s` silently cut Immich videos at 60s). **Do NOT re-tighten `writeTimeout`**; keep `readTimeout` finite (slow-loris backstop) but ≥ longest expected upload. Full rationale: `docs/architecture/networking.md` → "Entrypoint Transport Timeouts".
 - **HTTP/3 (QUIC)**: Enabled on Traefik. Works for **direct (non-proxied) apps** via the dedicated LB IP below (ETP=Local). Proxied apps get QUIC at the Cloudflare edge.
@ -219,7 +218,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
 | Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
 | CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob. **Enforcement is out-of-band, NOT a Traefik plugin** (the Yaegi `crowdsec-bouncer-traefik-plugin` was dead on Traefik 3.7.5 and removed): `cs-firewall-bouncer` DaemonSet drops in-kernel via nftables on direct hosts (bouncer key `firewall`, v0.0.34 binary fetched at runtime, hostNetwork+NET_ADMIN, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`); `crowdsec-cf-sync` CronJob blocks at the CF edge for proxied hosts (bouncer key `kvsync`, `stacks/rybbit/crowdsec_edge.tf`). Both fail open. See `docs/architecture/security.md` |
 | Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
-| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control; `/`+`/static` use a dedicated `authentik-rate-limit` (100/1000) so the cold-load chunk burst isn't 429'd into a blank screen. **Reliability (2026-06-28): the chart key is `deploymentStrategy`, NOT `strategy`** — the old `strategy:` key was inert, so live ran the chart default 25%/25% and dropped a server pod out of rotation on every roll; now `maxSurge:1/maxUnavailable:0`. Readiness `failureThreshold:8` (~80s, was 30s): the DB-coupled `/-/health/ready/` returns 503 on a PG/pgbouncer blip, and with too-tight tolerance all 3 server pods left the Service at once → Traefik 502/504 (the episodic blank-screen + 30s-hang). gunicorn `max_requests=10000`/jitter=1000 decorrelates worker recycles from DB blips. Redis is GONE since 2026.2 (sessions+cache+channels on PostgreSQL, no external-cache option) — a short PG transient is now survived, but a TOTAL CNPG outage still takes authentik down. **Custom overlay image (2026-06-28):** server+worker run `ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3` (built by `.github/workflows/build-authentik.yml` from `stacks/authentik/Dockerfile` + `patch-compat-sfe.py`) with TWO guarded patches: **#1 SLOW-1a** — narrows the identification-stage `select_subclasses()` query (~1.4s→~14ms; bare upstream call LEFT-JOINs every source subtype); **#2 old-browser blank login** — `patch-compat-sfe.py` (a) extends `compat_needs_sfe()` to serve authentik's built-in no-JS **SFE** login to old Safari/WebKit AND **any iOS browser** (Chrome/CriOS, Firefox/FxiOS — all share the system WebKit) on iOS≤16.3, and (b) **injects static social-login `<a>` links into the SFE shell** (`flow-sfe.html`) since the SFE can't render Identification-stage sources — required for password-less accounts (e.g. emo = Google-only). The modern flow SPA is ES2022 (needs Safari 16.4+) and renders BLANK on older WebKit; every iOS browser shares that WebKit, so it's not browser-choice (emo's iPadOS-15.8 iPad hit this). SFE = the *real* authentik login (password + MFA + reputation, no auth downgrade) — chosen over a Traefik basic-auth fallback which would have put a spoofable-UA single password in front of `vbarzin→wizard` passwordless-root. Social link = plain redirect to `/source/oauth/login/<slug>/` (works on any browser); slugs (google/github/facebook) are static — re-verify on source changes. **Keel un-enrolled** for the ns → image pinned in `global.image` (repo+tag), **upgraded manually**: bump the Dockerfile `FROM` + the values tag (+ re-verify both patches) together, GHA rebuilds, then apply. |
+| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. |
 | Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
 | MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
 | phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
@ -232,10 +231,9 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
 - Alertmanager is now scraped (`extraScrapeConfigs` job `alertmanager`) → `alertmanager_notifications_total`/`_alerts`/`_notifications_failed_total` available; it had no `prometheus.io/scrape` annotation so notification volume was previously unmeasurable.
 - Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions.
 - **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction), AuthentikRootRouter5xxHigh (all-3-server-pods-NotReady cascade → 502/503/504 on the authentik `/` router). **The Traefik scrape keeps `traefik_router_requests_total`** (per-router `code` label) — the drop-regex in the `traefik` scrape job drops only the high-cardinality `*_duration_seconds_bucket` histogram, NOT the request counter, so per-router 429/5xx is queryable + alertable.
+- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction).
 - **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → posts to `#alerts` via the `slack-security` receiver, which keeps its `[SECURITY]` styling; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
- **pfSense egress / WAN monitoring** (added 2026-06-28 after the 2026-06-27 egress-only incident — pfSense VMID 101 stopped passing internet egress for ~20 min while internal routing + Unbound stayed up, and NOTHING alerted: no egress probe existed and the cloudflared replica metric stayed green): `blackbox-exporter` gained `icmp_egress` + `dns_external` modules (+ `NET_RAW` on the pod) in `authentik_walloff_probe.tf`. Three in-cluster probe jobs (`wan-gateway-icmp` → 192.168.1.1, `internet-egress-icmp` → 9.9.9.9/1.1.1.1, `internet-egress-dns` → cloudflare.com via both) traverse the pod→node→pfSense-NAT path that fails. Alerts (group `Egress / pfSense` in `alerting_rules.yml`): `WANGatewayUnreachable`, `InternetEgressDown` (`max()==0` = both providers dead, not a single-provider blip), `ExternalDNSResolutionDown`, `EgressOnlyDivergence` (t3-probe `cloudflare` leg down WHILE `internal` leg up — the incident signature, reuses the existing t3-probe), `PfSenseVMDown` (`pve_up{id="qemu/101"}==0` while host up — does NOT catch a guest-internal reboot, `pve_up` tracks the qemu process). Plus Loki ruler `CloudflaredTunnelConnLoss` (>20 edge-conn failures/5m; calibrated live: steady-state ~2/6h vs 37-85/5m in-incident; the cloudflared replica metric is blind to tunnel-connection loss). `WANGatewayUnreachable`/`InternetEgressDown` **inhibit** the downstream egress symptoms (ExternalDNSResolutionDown/EgressOnlyDivergence/CloudflaredTunnelConnLoss/Email*/ExternalAccessDivergence). Runbook: `docs/runbooks/pfsense-egress.md`. **Deferred (needs a live-pfSense change, not in this monitoring-only change):** point dpinger's monitor at the local gateway + widen thresholds, disable `gw_down_kill_states`, add a failover gateway group + auto-recovery watchdog, and ship pfSense system/gateway/routing syslog to Loki (today only filterlog → CrowdSec; those logs are NOT centrally queryable — id #6717). No Uptime-Kuma egress monitor was added (the `external-monitor-sync` is purpose-built for `*.viktorbarzin.me` Cloudflare-path discovery; the blackbox probes cover egress directly).
+- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → `#security` Slack). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.

 ## Security Posture (Wave 1 — locked 2026-05-18)

@ -243,10 +241,9 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se

 - **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming.
 - **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.)
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts` (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out). Severity labels carried in the alert (critical/warning/info). No paging. The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`.
+- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
 - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`).
- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. **WHISKER-WEDGE GOTCHA** (2026-06-28): the operator's `whisker` NetworkPolicy allows DNS egress only to kube-dns *pods*, but whisker-backend resolves goldmane via the kube-dns *ClusterIP* — Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule, so when whisker-backend's gRPC stream breaks and it re-resolves, it wedges and the UI goes **empty** (the aggregator, a separate pod, is unaffected). FIX = additive egress NP `whisker-allow-dns-clusterip` (`stacks/calico`, allows whisker→10.96.0.10/32:53); the `whisker-watchdog` CronJob is a backstop. Manual heal `kubectl -n calico-system delete pod -l k8s-app=whisker`. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
+- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred.
 - **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).

 ## Storage & Backup Architecture
--- a/.claude/reference/service-catalog.md
+++ b/.claude/reference/service-catalog.md
@ -13,8 +13,6 @@
 | authentik | Identity provider (SSO) | authentik |
 | cloudflared | Cloudflare tunnel | cloudflared |
 | authelia | Auth middleware (may be merged into ebooks or removed) | platform |
-| goldmane | Calico 3.30 OSS flow aggregator (`goldmane.calico-system.svc:7443`, gRPC/mTLS). Stamps identity (ns/pod/workload/labels + allow-deny) on every flow from Felix into a ~60-min in-memory ring buffer — no etcd/API writes. East-west "who-talks-to-whom" source (ADR-0014). Enabled via operator CR (`kubectl_manifest.goldmane`). | calico |
-| whisker | Calico 3.30 OSS live flow-observability UI (`whisker.calico-system.svc:8081`) at `whisker.viktorbarzin.me` (Authentik-gated, `auth=required` — no own login; additive NP ORs Traefik past the operator default-deny). ~60-min live view of Goldmane flows, NOT history. Enabled via operator CR (`kubectl_manifest.whisker`). | calico |
 | monitoring | Prometheus/Grafana/Loki stack | monitoring |

 ## Storage & Security (Tier: cluster)
@ -39,7 +37,6 @@
 ## Active Use
 | Service | Description | Stack |
 |---------|-------------|-------|
-| goldmane-edge-aggregator | Durable who-talks-to-whom audit trail (ADR-0014 / #58). Go service: `aggregate` Deployment streams Goldmane's gRPC `Flows.Stream` (mTLS) and upserts the low-cardinality namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`) into CNPG DB `goldmane_edges`; `goldmane-edges-digest` CronJob posts first-seen edges daily to `#alerts` (the `#security` channel was abandoned 2026-06-25 — shared webhook's app isn't a member of it). mTLS client cert REUSES the operator's `whisker-backend-key-pair` (re-apply if rotated). Tier-4-aux. Image `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (private). Runbook: [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md). | goldmane-edge-aggregator |
 | mailserver | Email (docker-mailserver) | mailserver |
 | shadowsocks | Proxy | shadowsocks |
 | webhook_handler | Webhook processing | webhook_handler |
@ -164,4 +161,3 @@ procedures) are documented in `infra/docs/runbooks/`:
 | pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) |
 | Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) |
 | Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) |
-| Goldmane flow trail (east-west who-talks-to-whom) | [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md) |
--- a/.claude/skills/home-assistant/SKILL.md
+++ b/.claude/skills/home-assistant/SKILL.md
@ -11,8 +11,8 @@ description: |
  There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
  Always use Home Assistant for smart home control.
 author: Claude Code
-version: 2.1.0
-date: 2026-06-24
+version: 2.0.0
+date: 2026-02-07
 ---

 # Home Assistant Control
@ -395,27 +395,14 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
 ## ha-london Knowledge Map

 ### Overview
- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
+- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
 - **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS
- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/`
+- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
+- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
+- **Config path**: `/config/` (requires `sudo` for file access)
 - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
 - **Zone**: London (home)

-### Dashboards (redesigned 2026-06-24)
-**Glossary** (HA terms — keep distinct):
- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
- **Card** = a widget inside a view.
-
- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
-  - **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
-  - **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.)
- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
-
 ### Key Systems

 #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -437,15 +424,10 @@ Named plugs with power/energy tracking:
 - PM1.0/2.5/4.0/10 particulate sensors
 - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors

-#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
-Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
- `sensor.classic_performance_remaining_range`: Range km
- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
+#### 3. Cowboy E-Bike
+- `sensor.bike_state_of_charge`: Battery %
+- `sensor.bike_total_distance`: Total km
+- `sensor.bike_total_co2_saved`: CO2 saved (grams)

 #### 4. Uptime Monitoring (UptimeRobot)
 - `sensor.blog`: blog uptime
@ -464,17 +446,12 @@ Bike named **"Classic Performance"** → entities are `sensor.classic_performanc
 - Scripts: `script.start_netflix`, `script.start_stremio`
 - Scene: `scene.night` (turns off Livia + Michelle plugs)

-### Custom Components (HACS integrations)
- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
-
-### HACS frontend cards (plugins)
- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
+### Custom Components
+- **cowboy**: Cowboy e-bike integration (HACS)
+- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)

 ### Integrations
-ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
+ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB

 ### AI / Voice Assistants
 - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -489,8 +466,15 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ook
 - Anca arrival/departure notifications
 - Night scene: turns off Livia + Michelle

-### Platform (HAOS — ignore any legacy `docker run` snippet)
-ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker).
+### Docker Setup
+```bash
+docker run -d --name homeassistant --privileged \
+  -e TZ=Europe/London \
+  -v /home/pi/docker/homeAssistant:/config \
+  -v /run/dbus:/run/dbus:ro \
+  --network=host --restart=unless-stopped \
+  homeassistant/home-assistant:2025.9
+```

 ### SSH Access
 ```bash
--- a/.github/workflows/build-authentik.yml
+++ b/.github/workflows/build-authentik.yml
@ -1,39 +0,0 @@
-name: Build Custom Authentik Image
-
-# ADR-0002: infra-owned image built off-infra on GHA → ghcr.
-# Thin SLOW-1a overlay over the official authentik server (narrows the login
-# identification stage's select_subclasses() to the login-capable source subtypes;
-# see stacks/authentik/Dockerfile). Rebuild only when the Dockerfile changes — on
-# every authentik bump, edit the FROM tag + the patchN suffix here + the image tag
-# in modules/authentik/values.yaml together.
-on:
-  push:
-    branches: [master]
-    paths:
-      - 'stacks/authentik/Dockerfile'
-  workflow_dispatch: {}
-
-permissions:
-  contents: read
-  packages: write
-
-jobs:
-  build:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v4
-      - uses: docker/setup-buildx-action@v3
-      - uses: docker/login-action@v3
-        with:
-          registry: ghcr.io
-          username: ${{ github.actor }}
-          password: ${{ secrets.GITHUB_TOKEN }}
-      - uses: docker/build-push-action@v6
-        with:
-          context: stacks/authentik
-          platforms: linux/amd64
-          provenance: false
-          push: true
-          tags: |
-            ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3
-            ghcr.io/viktorbarzin/authentik-server:latest
--- a/.woodpecker/default.yml
+++ b/.woodpecker/default.yml
@ -65,21 +65,6 @@ steps:
      # don't need explicit token propagation.
      VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200
    commands:
-      # ── Forge guard: apply ONLY on the canonical Forgejo forge ──
-      # infra is registered in Woodpecker on BOTH the Forgejo canonical repo and
-      # the legacy GitHub mirror, and BOTH fire this push pipeline. Without this
-      # guard both run `terragrunt apply` on every push and race each other for
-      # the per-stack PG state lock — the dominant cause of the "Error acquiring
-      # the state lock" failures + push-supersede "killed" runs. The GitHub-mirror
-      # registration keeps running the CRONS (drift-detection, renew-tls, …) — only
-      # its duplicate push-apply no-ops here. Fail-open: an unknown forge (neither
-      # env var set) still applies, preserving prior behaviour.
-      - |
-        if echo "${CI_REPO_URL:-}${CI_FORGE_URL:-}" | grep -qi 'github\.com'; then
-          echo "[forge-guard] GitHub-mirror push — apply runs only on the Forgejo canonical repo (avoids double-apply + state-lock races). Skipping."
-          exit 0
-        fi
-
      # ── Skip CI commits ──
      - |
        if echo "$CI_COMMIT_MESSAGE" | grep -q '\[CI SKIP\]\|\[ci skip\]'; then
@ -228,40 +213,23 @@ steps:
        if [ -s .platform_apply ]; then
          echo "=== Applying platform stacks (serial, locked) ==="
          while read -r stack; do
-            # Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role
-            # lacks Vault-admin perms (sys/mounts + sys/policies/acl), so a CI
-            # apply always 403s and fails the pipeline. Kept in PLATFORM_STACKS
-            # (so the app-stack detector still excludes it) but skipped here.
-            # (2026-06-27 — see docs/architecture/ci-cd.md)
-            if [ "$stack" = "vault" ]; then echo "[vault] SKIPPED (Tier-0, human-applied via OIDC)"; continue; fi
            echo "[$stack] Starting apply..."
-            ATTEMPT=0
-            while :; do
-              ATTEMPT=$((ATTEMPT + 1))
            set +e
            OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
            EXIT=$?
            set -e
-              if [ $EXIT -eq 0 ]; then
-                echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
+            if [ $EXIT -ne 0 ]; then
+              if echo "$OUTPUT" | grep -q "is locked by"; then
+                echo "[$stack] SKIPPED (locked by another session)"
+              else
+                echo "$OUTPUT" | tail -50
+                echo "[$stack] FAILED (exit $EXIT)"
+                FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"
              fi
-              # Lock contention → SKIP, not fail. Match BOTH the Tier-0 Vault lock
-              # ("is locked by", from scripts/tg) AND the Tier-1 PG-backend lock
-              # ("Error acquiring the state lock" / "already locked"). The PG case
-              # was previously counted as a failure — the #1 source of false reds.
-              if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
-                echo "[$stack] SKIPPED (locked by another session/run)"; break
+            else
+              echo "$OUTPUT" | tail -3
+              echo "[$stack] OK"
            fi
-              # Transient: provider-registry download timeout / Vault 5xx → bounded
-              # retry. Deliberately NOT helm atomic-timeouts or config errors
-              # (missing arg, invalid index) — those must fail fast, retry can't fix
-              # them and can worsen a stuck helm release.
-              if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
-                echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
-              fi
-              echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
-              FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"; break
-            done
          done < .platform_apply
        fi
        # Deferred until after app stacks so both lists get a chance to run.
@ -274,27 +242,22 @@ steps:
          echo "=== Applying app stacks (serial, locked) ==="
          while read -r stack; do
            echo "[$stack] Starting apply..."
-            ATTEMPT=0
-            while :; do
-              ATTEMPT=$((ATTEMPT + 1))
            set +e
            OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
            EXIT=$?
            set -e
-              if [ $EXIT -eq 0 ]; then
-                echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
+            if [ $EXIT -ne 0 ]; then
+              if echo "$OUTPUT" | grep -q "is locked by"; then
+                echo "[$stack] SKIPPED (locked by another session)"
+              else
+                echo "$OUTPUT" | tail -50
+                echo "[$stack] FAILED (exit $EXIT)"
+                FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"
              fi
-              # Lock contention → SKIP, not fail (Tier-0 Vault + Tier-1 PG; see platform loop).
-              if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
-                echo "[$stack] SKIPPED (locked by another session/run)"; break
+            else
+              echo "$OUTPUT" | tail -3
+              echo "[$stack] OK"
            fi
-              # Transient provider-download / Vault 5xx → bounded retry (see platform loop).
-              if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
-                echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
-              fi
-              echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
-              FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"; break
-            done
          done < .app_apply
        fi
        # Fail the step loudly so the pipeline `default` workflow state
--- a/.woodpecker/drift-detection.yml
+++ b/.woodpecker/drift-detection.yml
@ -85,13 +85,6 @@ steps:
          stack=$(basename "$stack_dir")
          [ -f "$stack_dir/terragrunt.hcl" ] || continue

-          # Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role lacks
-          # Vault-admin perms (sys/mounts + sys/policies/acl), so `terragrunt plan`
-          # on it ERRORs (detailed-exitcode 1) and fails the whole nightly drift
-          # run. Skip it — drift on Tier-0 vault is caught at human apply time.
-          # (2026-06-27)
-          [ "$stack" = "vault" ] && continue
-
          echo -n "[$stack] planning... "
          OUTPUT=$(cd "$stack_dir" && terragrunt plan -detailed-exitcode -input=false 2>&1)
          EXIT=$?
--- a/AGENTS.md
+++ b/AGENTS.md
@ -273,11 +273,8 @@ To land a finished change from such a clone:
   Slack audit feed; a no-op CI apply on a docs-only commit is harmless.
 4. Leave the clone on clean `master` so auto-refresh keeps working.
 5. Tell the user in plain language what happened. Stack changes are
-   auto-applied by CI on push — or, with apply access, applied locally yourself
-   (`scripts/tg apply`, from the main checkout, not a worktree); either path is
-   fine, but the change must always be committed here, never applied
-   uncommitted. Verify the live result with the user's read-only kubectl before
-   saying "it's live".
+   auto-applied by CI — verify the live result with the user's read-only
+   kubectl before saying "it's live".

 If a push to `master` is rejected by branch protection (user not on the
 whitelist — e.g. new users before Viktor grants it), fall back to a
--- a/CONTEXT.md
+++ b/CONTEXT.md
@ -125,7 +125,7 @@ How a **Service** is named in flow/audit data — its **namespace** is the prima
 _Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.

 **Goldmane / Whisker**:
-Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#alerts` digest (the `#security` channel was abandoned 2026-06-25). As-built: `docs/runbooks/goldmane-flow-trail.md`.
+Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). Durable history requires emitting Goldmane flows to **Loki**; the in-memory buffer alone is not an audit trail.
 _Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).

 ### Storage
--- a/cli/README.md
+++ b/cli/README.md
@ -202,69 +202,6 @@ runs on the devvm, `setInputFiles` streams local files to the remote browser ove
 CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md`
 and `docs/adr/0013`.

-### v0.9 verbs — edges (east-west "who-talks-to-whom" trail)
-
-Read-only investigation helper over the `goldmane_edges` CNPG trail (ADR-0014):
-filters render to a single safe `SELECT` (namespace values validated to the k8s
-name charset) run via the dbaas primary pod — the same exec path as `k8s db`.
-
-| Command | Tier | What it does |
-| --- | --- | --- |
-| `edges --ns <ns>` | read | edges touching `<ns>` (either direction) |
-| `edges --src <ns>` / `--dst <ns>` | read | directional: `<ns>`'s egress / ingress peers |
-| `edges --peers-of <ns>` | read | distinct peer namespaces of `<ns>` (both directions) |
-| `edges --new-since <24h\|7d\|YYYY-MM-DD>` | read | edges first seen since a duration or date |
-| `edges --denied` | read | only `action='deny'` edges (blocked / lateral-movement) |
-| `edges --json` / `--limit N` | read | JSON array output / row cap (default 200) |
-
-### v0.10 — `vault get --all` (browse every field)
-
-`vault get <name> --all` returns the **whole item** as a normalized JSON object,
-so an agent can discover and read fields the single-field `--field` allowlist
-can't reach — notably arbitrary **custom fields**.
-
-| Command | Tier | What it does |
-| --- | --- | --- |
-| `vault get <name> --all` | read | all fields as JSON: `{name, username?, password?, uris?, totp?, notes?, fields?}` |
-
-Shape notes: present standard fields only (empty ones omitted); `fields` is a
-custom `name→value` map (duplicate names → last-wins; `linked` fields skipped).
-The TOTP **seed is never emitted** — `totp` is a presence flag (`true`), so the
-only seed-derived path stays the specially-audited `vault code`. Like
-`get --json`, the dump is all secret values, so it **refuses a terminal** — pipe
-it (`homelab vault get <name> --all | jq`).
-
-### v0.10.1 — reads `bw sync` first (always fresh)
-
-Every vault read (`get`, `get --all`, `list`, `code`, `status`) now runs `bw
-sync` when opening its session, so it reflects the latest server-side values.
-`bw unlock` only decrypts the *local* cache, so without this a persisted
-(already-logged-in) session served stale data — a password changed in the web
-vault wouldn't show up until the next login. The sync is **best-effort**: a
-transient failure warns on stderr and falls back to the cached vault rather than
-failing the read.
-
-### v0.11 — `vault kv` (HashiCorp Vault / OpenBao infra secrets)
-
-`homelab vault` now fronts **two unrelated stores**, made explicit in the bare
-`homelab vault` help and via `[vaultwarden]` / `[hashicorp-vault]` summary tags:
-
- **Vaultwarden** — your personal password manager (`vault get/list/code/…`, unchanged).
- **HashiCorp Vault / OpenBao** — homelab infra secrets, the `secret/…` KV store, under `vault kv`.
-
-| Command | Tier | What it does |
-| --- | --- | --- |
-| `vault kv get <path> [--field K]` | read | read a secret: `--field K` → one value (TTY-aware clipboard/stdout); no field → all fields as JSON (refuses a bare TTY) |
-| `vault kv list <path>` | read | list sub-paths under `<path>` (no values) |
-| `vault kv put <path> <key>` | write | write one key; **value via stdin** (piped or no-echo prompt, never argv); creates the path or **merges** (never clobbers siblings) |
-
-**Different credentials:** the Vaultwarden verbs use the per-user *scoped* token
-(bound to `claude-users/<user>`); `vault kv` uses your **own** Vault token
-(`vault login -method=oidc` → `~/.vault-token`, or `$VAULT_TOKEN`) — the kv
-handlers set `VAULT_ADDR` but never inject the scoped token (which would 403 off
-its own path). Access is whatever your policy grants. Writes are merge-only;
-`put` (replace) / `delete` are out of scope — use the raw `vault` CLI.
-
 ## Build / install

 Built from source to `/usr/local/bin/homelab` during devvm provisioning
--- a/cli/VERSION
+++ b/cli/VERSION
@ -1 +1 @@
-v0.11.0
+v0.8.1
--- a/cli/cmd_edges.go
+++ b/cli/cmd_edges.go
@ -1,69 +0,0 @@
-package main
-
-import "fmt"
-
-func edgesCommands() []Command {
-	return []Command{
-		{Path: []string{"edges"}, Tier: TierRead,
-			Summary: "who-talks-to-whom trail: edges [--ns|--src|--dst|--peers-of N] [--new-since 24h] [--denied] [--json] [--limit N]",
-			Run:     edgesRun},
-	}
-}
-
-// edgesRun renders the filter flags to SQL and runs it read-only against the
-// goldmane_edges CNPG DB via the dbaas primary pod (same exec path as `k8s db`).
-func edgesRun(args []string) error {
-	for _, a := range args {
-		if a == "-h" || a == "--help" {
-			fmt.Print(edgesUsage())
-			return nil
-		}
-	}
-	o, err := parseEdgesArgs(args)
-	if err != nil {
-		return fmt.Errorf("%w\n\n%s", err, edgesUsage())
-	}
-	sql, err := buildEdgesQuery(o)
-	if err != nil {
-		return err
-	}
-	// pg-cluster-rw is a Service (not exec-able); resolve the primary POD.
-	pod, err := kubectlCapture("dbaas", "get", "pod", "-l", "cnpg.io/instanceRole=primary",
-		"-o", "jsonpath={.items[0].metadata.name}")
-	if err != nil || pod == "" {
-		return fmt.Errorf("could not resolve CNPG primary pod in dbaas: %v", err)
-	}
-	exec := []string{"exec", pod, "-c", "postgres", "--", "psql", "-U", "postgres", "-d", "goldmane_edges"}
-	if o.asJSON {
-		exec = append(exec, "-tAc", sql) // raw tuple → the JSON array
-	} else {
-		exec = append(exec, "-P", "pager=off", "-c", sql) // aligned table for humans
-	}
-	return kubectlStream("dbaas", exec...)
-}
-
-func edgesUsage() string {
-	return `homelab edges — query the who-talks-to-whom trail (goldmane_edges, ADR-0014)
-
-Usage: homelab edges [filters]
-
-Filters (AND-combined; namespace values are validated to the k8s name charset):
-  --ns NAME         edges touching NAME (either direction)
-  --src NAME        edges where source namespace = NAME
-  --dst NAME        edges where destination namespace = NAME
-  --peers-of NAME   distinct peer namespaces of NAME (both directions)
-  --new-since SPEC  first seen since SPEC: a duration (24h, 7d, 30m, 90s) or a date (YYYY-MM-DD)
-  --denied          only denied (action='deny') edges — blocked / lateral-movement attempts
-  --json            output a JSON array (for agents/pipelines)
-  --limit N         cap rows (default 200)
-
-Examples:
-  homelab edges --ns immich                # everything immich talks to / is talked to by
-  homelab edges --peers-of authentik       # authentik's peer namespaces
-  homelab edges --src recruiter-responder  # that namespace's egress peers
-  homelab edges --new-since 24h            # edges first seen in the last day
-  homelab edges --denied --json            # blocked flows, machine-readable
-
-Read-only SELECT against CNPG DB goldmane_edges via the dbaas primary pod.
-`
-}
--- a/cli/cmd_memory.go
+++ b/cli/cmd_memory.go
@ -54,7 +54,10 @@ func printMemories(raw []byte, jsonOut bool) error {
 		return nil
 	}
 	for _, m := range r.Memories {
-		c := truncatePreview(strings.ReplaceAll(m.Content, "\n", " "), 240)
+		c := strings.ReplaceAll(m.Content, "\n", " ")
+		if len(c) > 240 {
+			c = c[:240] + "…"
+		}
 		fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
 		if m.Tags != "" {
 			fmt.Printf("       tags: %s\n", m.Tags)
@ -63,21 +66,6 @@ func printMemories(raw []byte, jsonOut bool) error {
 	return nil
 }

-// truncatePreview shortens s to at most maxRunes RUNES, appending "…" when it
-// trims. Counting runes (not bytes) is load-bearing: a byte slice like s[:240]
-// can cut through the middle of a multibyte UTF-8 character (e.g. 2-byte
-// Cyrillic), leaving a dangling lead byte = invalid UTF-8. That crashed strict
-// decoders downstream — notably the homelab-memory-recall.py UserPromptSubmit
-// hook (subprocess text=True), which surfaced as a recurring "UserPromptSubmit
-// hook error" for Cyrillic-language users.
-func truncatePreview(s string, maxRunes int) string {
-	r := []rune(s)
-	if len(r) <= maxRunes {
-		return s
-	}
-	return string(r[:maxRunes]) + "…"
-}
-
 func memoryRecall(args []string) error {
 	req := memRecallReq{}
 	jsonOut := false
--- a/cli/cmd_vault.go
+++ b/cli/cmd_vault.go
@ -4,7 +4,6 @@ import (
 	"bufio"
 	"encoding/base64"
 	"encoding/json"
-	"errors"
 	"fmt"
 	"os"
 	"os/exec"
@ -16,60 +15,43 @@ import (
 // Identity is the kernel UID; per-user creds live in that user's isolated Vault
 // path (secret/workstation/claude-users/<user>) read via their scoped token, and
 // decryption is done by the official `bw` CLI. See
-// docs/runbooks/homelab-vault-onboarding.md.
+// docs/superpowers/specs/2026-06-24-homelab-vault-design.md.
 func vaultCommands() []Command {
-	cmds := []Command{
-		// Vaultwarden — your personal password manager (logins/passwords/TOTP).
+	return []Command{
 		{Path: []string{"vault", "setup"}, Tier: TierWrite,
-			Summary: "[vaultwarden] one-time: store your master password + API key in your Vault path", Run: vaultSetup},
+			Summary: "one-time: store your Vaultwarden master password + API key in your Vault path", Run: vaultSetup},
 		{Path: []string{"vault", "status"}, Tier: TierRead,
-			Summary: "[vaultwarden] show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
+			Summary: "show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
 		{Path: []string{"vault", "list"}, Tier: TierRead,
-			Summary: "[vaultwarden] list your item names: vault list [--search Q]", Run: vaultList},
+			Summary: "list your item names: vault list [--search Q]", Run: vaultList},
 		{Path: []string{"vault", "get"}, Tier: TierRead,
-			Summary: "[vaultwarden] fetch one login: vault get <name> [--field password|username|uri|notes|totp] [--json] [--all]", Run: vaultGet},
+			Summary: "fetch one item: vault get <name> [--field password|username|uri|notes|totp] [--json]", Run: vaultGet},
 		{Path: []string{"vault", "search"}, Tier: TierRead,
-			Summary: "[vaultwarden] search your item names: vault search <query>", Run: vaultSearch},
+			Summary: "search your item names: vault search <query>", Run: vaultSearch},
 		{Path: []string{"vault", "code"}, Tier: TierRead,
-			Summary: "[vaultwarden] current TOTP code for an item: vault code <name>", Run: vaultCode},
+			Summary: "current TOTP code for an item: vault code <name>", Run: vaultCode},
 		{Path: []string{"vault", "lock"}, Tier: TierWrite,
-			Summary: "[vaultwarden] lock/log out the local bw session", Run: vaultLock},
+			Summary: "lock/log out the local bw session", Run: vaultLock},
 		{Path: []string{"vault"}, Tier: TierRead,
-			Summary: "two stores: Vaultwarden (logins) + HashiCorp Vault/OpenBao kv (infra secrets) — run `homelab vault` for help",
+			Summary: "Vaultwarden access for your own vault (run `homelab vault` for help)",
 			Run:     func([]string) error { fmt.Print(vaultHelp()); return nil }},
 	}
-	// HashiCorp Vault / OpenBao — homelab INFRA secrets (the secret/… KV store).
-	return append(cmds, vaultKVCommands()...)
 }

-// vaultHelp is shown for bare `homelab vault`. It LEADS with the distinction
-// between the two unrelated "vaults" this command fronts, because the name
-// collides: Vaultwarden (a password manager) vs HashiCorp Vault / OpenBao (the
-// infra secrets store).
+// vaultHelp is shown for bare `homelab vault`.
 func vaultHelp() string {
-	return `homelab vault — two different secret stores under one command:
+	return `homelab vault — read YOUR OWN Vaultwarden logins (no-HITL after one-time setup)

-  • Vaultwarden               your personal PASSWORD MANAGER (logins / passwords / TOTP)
-  • HashiCorp Vault / OpenBao  homelab INFRA secrets (the secret/… KV store)  → 'vault kv …'
-
-── Vaultwarden  (reads YOUR OWN vault; no-HITL after one-time setup) ──
  homelab vault setup             one-time: store your master password + API key in your Vault path
  homelab vault status            configured / unlocked / reachable (no secrets)
  homelab vault list [--search Q] list your item names (no secrets)
  homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
                                  TTY → clipboard (auto-clears); piped → stdout
-  homelab vault get <name> --all  all fields (incl. custom) as JSON; piped only.
-                                  TOTP shown as presence flag — use 'vault code' for a code.
  homelab vault code <name>       current TOTP code
  homelab vault lock              lock / log out the local bw session

-── HashiCorp Vault / OpenBao  (infra secrets; uses your own OIDC vault token) ──
-  homelab vault kv get <path> [--field K]   read an infra KV secret
-  homelab vault kv list <path>              list sub-paths
-  homelab vault kv put <path> <key>         write one key (value via stdin)
-
-Vaultwarden creds live only in your own Vault path; the admin never sees them.
-Security model: docs/runbooks/homelab-vault-onboarding.md
+Creds live only in your own Vault path; the admin never sees them. Identity is
+your unix UID. Security model: docs/superpowers/specs/2026-06-24-homelab-vault-design.md
 (note: anything running as your user can decrypt your vault — the accepted no-HITL trade).
 `
 }
@ -97,33 +79,7 @@ func realRunner(name string, argv, envv []string) (string, error) {
 	out, err := cmd.Output()
 	// Trim only the trailing newline the tool appends — NOT all whitespace, so a
 	// fetched secret with significant leading/trailing spaces is preserved.
-	return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err))
-}
-
-// exitStderr returns the stderr captured by cmd.Output() on a failed exec (it
-// stows it on *exec.ExitError), or nil. The tools we shell out to (vault, bw)
-// write the actionable message there — "connection refused", "permission
-// denied" — which the caller would otherwise never see behind a bare
-// "exit status N".
-func exitStderr(err error) []byte {
-	var ee *exec.ExitError
-	if errors.As(err, &ee) {
-		return ee.Stderr
-	}
-	return nil
-}
-
-// augmentErr appends captured stderr to an error so failures are diagnosable
-// (not just "exit status 2"). Returns nil when err is nil, and err unchanged
-// when there's no stderr; preserves the wrapped error for errors.Is/As.
-func augmentErr(err error, stderr []byte) error {
-	if err == nil {
-		return nil
-	}
-	if s := strings.TrimSpace(string(stderr)); s != "" {
-		return fmt.Errorf("%w: %s", err, s)
-	}
-	return err
+	return strings.TrimRight(string(out), "\r\n"), err
 }

 // realRunnerStdin runs a command feeding `stdin` to it, for secret values that
@ -136,7 +92,7 @@ func realRunnerStdin(name string, argv, envv []string, stdin string) (string, er
 	}
 	cmd.Stdin = strings.NewReader(stdin)
 	out, err := cmd.Output()
-	return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err))
+	return strings.TrimRight(string(out), "\r\n"), err
 }

 func vwCredsPath(user string) string { return vwUserPathPrefix + user }
@ -172,89 +128,6 @@ func loadCreds(run cmdRunner, user string) (vwCreds, error) {
 var vaultCurrentUser = func() string { return os.Getenv("USER") }
 var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) }

-// scopedTokenPath is where claude-auth-sync keeps the user's scoped Vault token.
-// MUST match CAS_VAULT_TOKEN_FILE in scripts/workstation/claude-auth-sync.sh.
-func scopedTokenPath(home string) string {
-	return home + "/.config/claude-auth-sync/vault-token"
-}
-
-// vaultTokenSource decides which Vault token the `vault` child processes should
-// use. Precedence: an explicit $VAULT_TOKEN (deliberate override), then the
-// per-user scoped token claude-auth-sync maintains at scopedTokenPath(HOME)
-// (policy workstation-claude-<user>, which grants exactly the create/read/update
-// this tool needs on the user's own path), then a native ~/.vault-token.
-//
-// The scoped token MUST beat ~/.vault-token: this tool only ever touches the
-// caller's own secret/workstation/claude-users/<user> path, and a power-user who
-// ran `vault login -method=oidc` carries a read-only ~/.vault-token whose
-// capability on that path is `deny` — letting it win shadows the scoped token
-// and every op fails 403/deny (emo, 2026-06-28). ~/.vault-token is only the
-// right credential when there is no scoped token (admins). Returns the token to
-// export — "" when the vault CLI should read the ambient/native credential —
-// plus a source tag for tests/logging.
-func vaultTokenSource(envToken string, haveVaultTokenFile bool, scopedToken string) (token, source string) {
-	switch {
-	case envToken != "":
-		return "", "env"
-	case strings.TrimSpace(scopedToken) != "":
-		return strings.TrimSpace(scopedToken), "scoped"
-	case haveVaultTokenFile:
-		return "", "file"
-	default:
-		return "", "none"
-	}
-}
-
-// vaultAddrDefault is the cluster Vault the workstation talks to. The bw server
-// is likewise hardcoded (openSession), so a sane default here is consistent.
-const vaultAddrDefault = "https://vault.viktorbarzin.me"
-
-// vaultAddrToSet returns the VAULT_ADDR to export when the caller's environment
-// doesn't already set one, else "". homelab vault is invoked by AFK agent
-// sessions — frequently non-login shells (tmux panes, agent subprocesses) that
-// never sourced /etc/environment — so, like claude-auth-sync, the CLI must NOT
-// depend on an ambient VAULT_ADDR; otherwise every `vault` child falls back to
-// the 127.0.0.1:8200 default and fails "connection refused" (exit 2).
-func vaultAddrToSet(envAddr string) string {
-	if strings.TrimSpace(envAddr) == "" {
-		return vaultAddrDefault
-	}
-	return ""
-}
-
-// ensureVaultAddr exports the default VAULT_ADDR when none is set, so the vault
-// child processes reach the cluster Vault regardless of the caller's shell. An
-// explicit VAULT_ADDR (admins, CI) is left untouched.
-func ensureVaultAddr() {
-	if a := vaultAddrToSet(os.Getenv("VAULT_ADDR")); a != "" {
-		os.Setenv("VAULT_ADDR", a)
-	}
-}
-
-// fileNonEmpty reports whether path exists and has content.
-func fileNonEmpty(path string) bool {
-	fi, err := os.Stat(path)
-	return err == nil && fi.Size() > 0
-}
-
-// ensureVaultToken wires vaultTokenSource to the real environment: when the user
-// has no ambient Vault credential, it exports the claude-auth-sync scoped token
-// so the `vault` child processes authenticate as workstation-claude-<user>. It
-// is idempotent and safe for admins, whose explicit $VAULT_TOKEN / ~/.vault-token
-// take precedence and are left untouched.
-func ensureVaultToken() {
-	// Every vault verb funnels through here, so this is the one place that also
-	// guarantees VAULT_ADDR is set (see vaultAddrToSet for why it can't be
-	// assumed from the caller's shell).
-	ensureVaultAddr()
-	home := os.Getenv("HOME")
-	scoped, _ := os.ReadFile(scopedTokenPath(home))
-	tok, src := vaultTokenSource(os.Getenv("VAULT_TOKEN"), home != "" && fileNonEmpty(home+"/.vault-token"), string(scoped))
-	if src == "scoped" {
-		os.Setenv("VAULT_TOKEN", tok)
-	}
-}
-
 // bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately
 // do NOT inherit the full parent env (keeps stray secrets out of the child).
 func bwBaseEnv(appdata string) []string {
@ -287,9 +160,7 @@ func bwSecretEnv(appdata string, c vwCreds, session string) []string {
 func bwLoginArgs() []string  { return []string{"login", "--apikey"} }
 func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} }
 func bwGetArgs(field, name string) []string { return []string{"get", field, name} }
-func bwItemArgs(name string) []string       { return []string{"get", "item", name} }
 func bwStatusArgs() []string { return []string{"status"} }
-func bwSyncArgs() []string                  { return []string{"sync"} }

 // bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is
 // required. Unparseable/empty output → true (safer to attempt login).
@ -456,23 +327,13 @@ func openSession(run cmdRunner, user, uid string) (session, error) {
 	if err != nil {
 		return session{}, err
 	}
-	sessEnv := bwSecretEnv(appdata, creds, sess)
-	// Pull the latest server-side state so reads reflect current values. `bw
-	// unlock` only decrypts the LOCAL cache, so a persisted (already-logged-in)
-	// session would otherwise serve stale data until the next login. Best-effort:
-	// a transient sync failure must not break a read — fall back to the cached
-	// vault and warn (status reports reachability separately).
-	if _, err := run("bw", bwSyncArgs(), sessEnv); err != nil {
-		fmt.Fprintln(os.Stderr, "homelab vault: warning: bw sync failed; using cached vault (values may be stale): "+err.Error())
-	}
-	return session{env: sessEnv}, nil
+	return session{env: bwSecretEnv(appdata, creds, sess)}, nil
 }

 type getOpts struct {
 	name  string
 	field string
 	json  bool
-	all   bool // dump every field (incl. custom) as normalized JSON
 }

 var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true}
@ -484,8 +345,6 @@ func parseGetArgs(args []string) (getOpts, error) {
 		switch {
 		case a == "--json":
 			o.json = true
-		case a == "--all":
-			o.all = true
 		case a == "--field" && i+1 < len(args):
 			o.field = args[i+1]
 			i++
@ -496,10 +355,9 @@ func parseGetArgs(args []string) (getOpts, error) {
 		}
 	}
 	if o.name == "" {
-		return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json] [--all]")
+		return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json]")
 	}
-	// --all dumps the whole item, so --field is irrelevant — skip its allowlist.
-	if !o.all && !validGetFields[o.field] {
+	if !validGetFields[o.field] {
 		return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field)
 	}
 	return o, nil
@ -515,81 +373,6 @@ func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) {
 	return bwGet(run, s.env, o.field, o.name)
 }

-// getItem opens a session and returns the whole item as raw `bw get item` JSON.
-// Used by `get --all`; normalization is a separate, pure step (normalizeItem).
-func getItem(run cmdRunner, user, uid, name string) (string, error) {
-	s, err := openSession(run, user, uid)
-	if err != nil {
-		return "", err
-	}
-	return run("bw", bwItemArgs(name), s.env)
-}
-
-// normalizedItem is the browse-all-fields projection of a Vaultwarden item: the
-// standard login fields that are present, notes, and a flat map of custom field
-// name→value. bw internals (id, object, reprompt, passwordHistory) are dropped,
-// and the TOTP *seed* is reduced to a presence flag — the only seed-derived path
-// stays the specially-audited `vault code` (see the design §10/§16).
-type normalizedItem struct {
-	Name     string            `json:"name"`
-	Username string            `json:"username,omitempty"`
-	Password string            `json:"password,omitempty"`
-	URIs     []string          `json:"uris,omitempty"`
-	TOTP     bool              `json:"totp,omitempty"` // presence only, never the seed
-	Notes    string            `json:"notes,omitempty"`
-	Fields   map[string]string `json:"fields,omitempty"` // custom field name→value
-}
-
-// bwFieldLinked is the Bitwarden custom-field type for a "linked" field: it
-// references another field and carries a null value, so it is not real data.
-const bwFieldLinked = 3
-
-// normalizeItem parses a `bw get item` payload into the browse projection. It is
-// pure (no I/O), so it is the unit-tested heart of `get --all`.
-func normalizeItem(raw string) (normalizedItem, error) {
-	var it struct {
-		Name  string `json:"name"`
-		Notes string `json:"notes"`
-		Login *struct {
-			Username string `json:"username"`
-			Password string `json:"password"`
-			Totp     string `json:"totp"`
-			URIs     []struct {
-				URI string `json:"uri"`
-			} `json:"uris"`
-		} `json:"login"`
-		Fields []struct {
-			Name  string `json:"name"`
-			Value string `json:"value"`
-			Type  int    `json:"type"`
-		} `json:"fields"`
-	}
-	if err := json.Unmarshal([]byte(raw), &it); err != nil {
-		return normalizedItem{}, fmt.Errorf("parse bw item: %w", err)
-	}
-	n := normalizedItem{Name: it.Name, Notes: it.Notes}
-	if it.Login != nil {
-		n.Username = it.Login.Username
-		n.Password = it.Login.Password
-		n.TOTP = it.Login.Totp != ""
-		for _, u := range it.Login.URIs {
-			if u.URI != "" {
-				n.URIs = append(n.URIs, u.URI)
-			}
-		}
-	}
-	for _, f := range it.Fields {
-		if f.Type == bwFieldLinked {
-			continue // references another field, no value of its own
-		}
-		if n.Fields == nil {
-			n.Fields = map[string]string{}
-		}
-		n.Fields[f.Name] = f.Value // duplicate names: last-wins (rare; documented)
-	}
-	return n, nil
-}
-
 // clipboardDecision picks how to return a secret value. "stdout" prints it (a
 // pipe/agent — the intended machine path); "clipboard" copies via OSC52;
 // "refuse" emits nothing sensitive (would otherwise risk dumping the secret's
@ -660,7 +443,6 @@ func runList(run cmdRunner, user, uid, search string) ([]string, error) {

 func vaultList(args []string) error {
 	hardenProcess()
-	ensureVaultToken()
 	search := ""
 	for i := 0; i < len(args); i++ {
 		if args[i] == "--search" && i+1 < len(args) {
@ -695,7 +477,6 @@ func vaultSearch(args []string) error {

 func vaultCode(args []string) error {
 	hardenProcess()
-	ensureVaultToken()
 	if len(args) == 0 {
 		return fmt.Errorf("usage: homelab vault code <name>")
 	}
@ -727,9 +508,7 @@ func statusSummary(run cmdRunner, user, uid string) string {
 	if err != nil {
 		return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error()
 	}
-	// openSession already did a best-effort sync; status re-runs it explicitly so
-	// a reachability failure surfaces in this report rather than only on stderr.
-	if _, err := run("bw", bwSyncArgs(), s.env); err != nil {
+	if _, err := run("bw", []string{"sync"}, s.env); err != nil {
 		return "vault: configured + unlocked, but sync/reachability failed: " + err.Error()
 	}
 	return "vault: configured, unlocked, reachable ✓"
@ -737,7 +516,6 @@ func statusSummary(run cmdRunner, user, uid string) string {

 func vaultStatus(args []string) error {
 	hardenProcess()
-	ensureVaultToken()
 	uid := vaultCurrentUID()
 	unlock, err := withUserLock(uid)
 	if err != nil {
@ -764,61 +542,32 @@ func vaultLock(args []string) error {
 	return nil // lock/logout best-effort; never error the caller
 }

-// kvWriteVerb selects the KV write semantics. merge=true → `kv patch -method=rw`
-// (read-modify-write: needs only read+update, NOT the `patch` capability the
-// scoped workstation-claude-<user> policy lacks, and preserves co-located keys
-// such as claude-auth-sync's claude_ai_oauth_json). merge=false → `kv put`
-// (creates the path on first use, before any sibling keys exist).
-func kvWriteVerb(merge bool) []string {
-	if merge {
-		return []string{"kv", "patch", "-method=rw"}
-	}
-	return []string{"kv", "put"}
-}
-
-// vaultWritePublicArgs writes the non-secret identifiers via argv. Neither the
+// vaultPatchPublicArgs writes the non-secret identifiers via argv. Neither the
 // email nor the API client_id is a usable credential on its own.
-func vaultWritePublicArgs(merge bool, user, email, clientID string) []string {
-	return append(kvWriteVerb(merge), vwCredsPath(user),
-		"vaultwarden_email="+email,
-		"vaultwarden_client_id="+clientID,
-	)
+func vaultPatchPublicArgs(user, email, clientID string) []string {
+	return []string{"kv", "patch", vwCredsPath(user),
+		"vaultwarden_email=" + email,
+		"vaultwarden_client_id=" + clientID,
+	}
 }

-// vaultWriteSecretArgs writes ONE secret value via the `key=-` stdin form, so the
-// value never appears in argv (ps / /proc/<pid>/cmdline). Fed on stdin by
-// realRunnerStdin.
-func vaultWriteSecretArgs(merge bool, user, key string) []string {
-	return append(kvWriteVerb(merge), vwCredsPath(user), key+"=-")
+// vaultPatchSecretArgs writes ONE secret value via the `key=-` stdin form, so
+// the value never appears in argv (ps / /proc/<pid>/cmdline). The value is fed
+// on stdin by realRunnerStdin.
+func vaultPatchSecretArgs(user, key string) []string {
+	return []string{"kv", "patch", vwCredsPath(user), key + "=-"}
 }

-// credsPathExists reports whether the user's KV path already holds data. Used to
-// pick create (`kv put`) vs merge (`kv patch -method=rw`) for the first write:
-// claude-auth-sync usually creates the path first (Claude OAuth backup), but a
-// user could run `homelab vault setup` before that ever happens.
-func credsPathExists(run cmdRunner, user string) bool {
-	_, err := run("vault", []string{"kv", "get", "-format=json", vwCredsPath(user)}, nil)
-	return err == nil
-}
-
-// cmdRunnerStdin is realRunnerStdin's shape, injected so writeCreds is testable.
-type cmdRunnerStdin func(name string, argv, envv []string, stdin string) (string, error)
-
-// writeCreds stores all four fields in the user's Vault path using only the
-// capabilities the scoped policy grants (create/read/update — NOT `patch`). The
-// first (public) write creates the path when absent; the two real secrets then
-// merge in via read-modify-write so the public keys — and any claude-auth-sync
-// keys already present — survive. Secret values travel on stdin, never argv.
-func writeCreds(run cmdRunner, runStdin cmdRunnerStdin, user string, c vwCreds) error {
-	merge := credsPathExists(run, user)
-	if _, err := run("vault", vaultWritePublicArgs(merge, user, c.Email, c.ClientID), nil); err != nil {
+// writeCreds stores all four fields in the user's Vault path. The two real
+// secrets (master password, API client_secret) go via stdin — never argv.
+func writeCreds(user string, c vwCreds) error {
+	if _, err := realRunner("vault", vaultPatchPublicArgs(user, c.Email, c.ClientID), nil); err != nil {
 		return err
 	}
-	// The path now exists regardless of the branch above → merge the secrets in.
-	if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
+	if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
 		return err
 	}
-	if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
+	if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
 		return err
 	}
 	return nil
@ -844,7 +593,6 @@ func promptLine(prompt string) (string, error) {

 func vaultSetup(args []string) error {
 	hardenProcess()
-	ensureVaultToken()
 	fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.")
 	fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.")
 	email, err := promptLine("Vaultwarden email: ")
@ -867,7 +615,7 @@ func vaultSetup(args []string) error {
 		return fmt.Errorf("all fields are required")
 	}
 	c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret}
-	if err := writeCreds(realRunner, realRunnerStdin, vaultCurrentUser(), c); err != nil {
+	if err := writeCreds(vaultCurrentUser(), c); err != nil {
 		return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err)
 	}
 	fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…")
@ -886,7 +634,6 @@ func vaultSetup(args []string) error {

 func vaultGet(args []string) error {
 	hardenProcess()
-	ensureVaultToken()
 	o, err := parseGetArgs(args)
 	if err != nil {
 		return err
@ -898,9 +645,6 @@ func vaultGet(args []string) error {
 	}
 	defer unlock()
 	user := vaultCurrentUser()
-	if o.all {
-		return getAllFields(user, uid, o.name)
-	}
 	val, err := getValue(realRunner, user, uid, o)
 	if err != nil {
 		return err
@ -917,28 +661,3 @@ func vaultGet(args []string) error {
 	return nil
 }

-// getAllFields prints every field of one item as normalized JSON. Like
-// `get --json`, the payload is all secret values, so it refuses a terminal
-// (pipe it). The TOTP seed is never emitted — only a presence flag — so no extra
-// TOTP audit is needed; the op-log uses a distinct verb so a bulk dump is
-// distinguishable from a single-field get (the item name is still never logged).
-func getAllFields(user, uid, name string) error {
-	if !jsonToStdoutOK(stdoutIsTTY()) {
-		return fmt.Errorf("refusing to print all fields as JSON to a terminal; pipe it (e.g. | jq)")
-	}
-	raw, err := getItem(realRunner, user, uid, name)
-	if err != nil {
-		return err
-	}
-	item, err := normalizeItem(raw)
-	if err != nil {
-		return err
-	}
-	out, err := json.Marshal(item)
-	if err != nil {
-		return err
-	}
-	writeOpLog(opRecord{User: user, Verb: "get-all", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
-	fmt.Println(string(out))
-	return nil
-}
--- a/cli/cmd_vault_kv.go
+++ b/cli/cmd_vault_kv.go
@ -1,248 +0,0 @@
-package main
-
-import (
-	"encoding/json"
-	"fmt"
-	"io"
-	"os"
-	"strings"
-)
-
-// The `vault kv` verbs talk to HashiCorp Vault / OpenBao — the homelab INFRA
-// secrets store (the `secret/…` KV-v2 mount at vault.viktorbarzin.me) — NOT
-// Vaultwarden. They are a thin, TTY-aware wrapper over the `vault` CLI that adds
-// the same conveniences as the Vaultwarden verbs: a self-defaulted VAULT_ADDR
-// (so non-login agent shells work) and clipboard/refuse-on-TTY secret handling.
-//
-// CREDENTIALS DIFFER FROM THE VAULTWARDEN VERBS. Those use the per-user *scoped*
-// token (bound only to secret/workstation/claude-users/<user>). A general kv read
-// of e.g. secret/viktor must use the caller's OWN Vault token (the OIDC
-// ~/.vault-token or an explicit $VAULT_TOKEN) — the scoped token has `deny`
-// everywhere else and would 403. So the kv handlers call ensureVaultAddr() to
-// guarantee VAULT_ADDR but deliberately do NOT call ensureVaultToken() (which
-// injects the scoped token). Access is then whatever the caller's policy grants.
-func vaultKVCommands() []Command {
-	return []Command{
-		{Path: []string{"vault", "kv", "get"}, Tier: TierRead,
-			Summary: "[hashicorp-vault] read an infra KV secret: vault kv get <path> [--field K]", Run: vaultKVGet},
-		{Path: []string{"vault", "kv", "list"}, Tier: TierRead,
-			Summary: "[hashicorp-vault] list infra KV sub-paths: vault kv list <path>", Run: vaultKVList},
-		{Path: []string{"vault", "kv", "put"}, Tier: TierWrite,
-			Summary: "[hashicorp-vault] write one KV key (value via stdin): vault kv put <path> <key>", Run: vaultKVPut},
-		{Path: []string{"vault", "kv"}, Tier: TierRead,
-			Summary: "[hashicorp-vault] infra secrets (run `homelab vault kv` for help)",
-			Run:     func([]string) error { fmt.Print(vaultKVHelp()); return nil }},
-	}
-}
-
-func vaultKVHelp() string {
-	return `homelab vault kv — HashiCorp Vault / OpenBao (homelab INFRA secrets, the secret/… KV store)
-
-  homelab vault kv get <path> [--field K]   read a secret
-                                  --field K  → one value (TTY → clipboard; piped → stdout)
-                                  no --field → all fields as JSON (piped only)
-  homelab vault kv list <path>    list sub-paths under <path> (no values)
-  homelab vault kv put <path> <key>   write one key; value read from stdin
-                                  (piped, or no-echo prompt); merges — never clobbers siblings
-
-Uses YOUR Vault token (vault login -method=oidc → ~/.vault-token); access is
-whatever your policy grants. This is NOT Vaultwarden — for your personal logins
-use 'homelab vault get' (see 'homelab vault').
-`
-}
-
-// --- arg builders (pure; values never travel via argv) --------------------
-
-func vaultKVGetFieldArgs(path, field string) []string {
-	return []string{"kv", "get", "-field=" + field, path}
-}
-func vaultKVGetJSONArgs(path string) []string { return []string{"kv", "get", "-format=json", path} }
-func vaultKVListArgs(path string) []string    { return []string{"kv", "list", "-format=json", path} }
-
-// vaultKVPutArgs builds the write argv. merge=true → `kv patch -method=rw`
-// (read-modify-write: merges, needs only read+update — not the `patch` capability
-// — and preserves sibling keys); merge=false → `kv put` (creates the path on
-// first write). The value is ALWAYS read from stdin via the `<key>=-` form, so it
-// never appears in argv (visible via ps / /proc/<pid>/cmdline to same-UID procs).
-func vaultKVPutArgs(merge bool, path, key string) []string {
-	return append(kvWriteVerb(merge), path, key+"=-")
-}
-
-// --- pure parsers ----------------------------------------------------------
-
-// extractKVData returns the inner secret object from a `vault kv get -format=json`
-// envelope (`{"data":{"data":{…},"metadata":{…}}}`), dropping the metadata/request
-// wrapper so only the secret's own key→value data is emitted.
-func extractKVData(jsonOut string) (string, error) {
-	var env struct {
-		Data struct {
-			Data json.RawMessage `json:"data"`
-		} `json:"data"`
-	}
-	if err := json.Unmarshal([]byte(jsonOut), &env); err != nil {
-		return "", fmt.Errorf("parse vault kv json: %w", err)
-	}
-	if len(env.Data.Data) == 0 {
-		return "", fmt.Errorf("no secret data at that path")
-	}
-	return string(env.Data.Data), nil
-}
-
-// parseKVList parses the JSON array `vault kv list -format=json` prints.
-func parseKVList(jsonOut string) ([]string, error) {
-	var keys []string
-	if err := json.Unmarshal([]byte(jsonOut), &keys); err != nil {
-		return nil, fmt.Errorf("parse vault kv list json: %w", err)
-	}
-	return keys, nil
-}
-
-// --- testable cores (injected cmdRunner) -----------------------------------
-
-func kvGetField(run cmdRunner, path, field string) (string, error) {
-	return run("vault", vaultKVGetFieldArgs(path, field), nil)
-}
-
-func kvGetJSON(run cmdRunner, path string) (string, error) {
-	out, err := run("vault", vaultKVGetJSONArgs(path), nil)
-	if err != nil {
-		return "", err
-	}
-	return extractKVData(out)
-}
-
-func kvList(run cmdRunner, path string) ([]string, error) {
-	out, err := run("vault", vaultKVListArgs(path), nil)
-	if err != nil {
-		return nil, err
-	}
-	return parseKVList(out)
-}
-
-// kvPathExists reports whether the KV path already holds data, to pick create
-// (`kv put`) vs merge (`kv patch -method=rw`) — so a write never clobbers
-// sibling keys on an existing path.
-func kvPathExists(run cmdRunner, path string) bool {
-	_, err := run("vault", vaultKVGetJSONArgs(path), nil)
-	return err == nil
-}
-
-// kvPut writes one key, creating the path when absent and merging when present.
-// The value travels on stdin only (never argv).
-func kvPut(run cmdRunner, runStdin cmdRunnerStdin, path, key, value string) error {
-	merge := kvPathExists(run, path)
-	_, err := runStdin("vault", vaultKVPutArgs(merge, path, key), nil, value)
-	return err
-}
-
-// --- handlers --------------------------------------------------------------
-
-func vaultKVGet(args []string) error {
-	hardenProcess()
-	ensureVaultAddr() // own token, NOT the scoped one (see file header)
-	var path, field string
-	for i := 0; i < len(args); i++ {
-		a := args[i]
-		switch {
-		case a == "--field" && i+1 < len(args):
-			field = args[i+1]
-			i++
-		case strings.HasPrefix(a, "--field="):
-			field = strings.TrimPrefix(a, "--field=")
-		case !strings.HasPrefix(a, "-") && path == "":
-			path = a
-		}
-	}
-	if path == "" {
-		return fmt.Errorf("usage: homelab vault kv get <path> [--field <key>]")
-	}
-	if field != "" {
-		val, err := kvGetField(realRunner, path, field)
-		if err != nil {
-			return err
-		}
-		emitSecret(val) // TTY-aware: clipboard on a terminal, stdout when piped
-		return nil
-	}
-	// No --field → the whole secret. All values, so refuse a bare TTY (like
-	// `vault get --json`): pick a --field for the clipboard path, or pipe it.
-	if !jsonToStdoutOK(stdoutIsTTY()) {
-		return fmt.Errorf("refusing to print all KV fields as JSON to a terminal; use --field <key>, or pipe it (e.g. | jq)")
-	}
-	out, err := kvGetJSON(realRunner, path)
-	if err != nil {
-		return err
-	}
-	fmt.Println(out)
-	return nil
-}
-
-func vaultKVList(args []string) error {
-	ensureVaultAddr()
-	var path string
-	for _, a := range args {
-		if !strings.HasPrefix(a, "-") {
-			path = a
-			break
-		}
-	}
-	if path == "" {
-		return fmt.Errorf("usage: homelab vault kv list <path>")
-	}
-	keys, err := kvList(realRunner, path)
-	if err != nil {
-		return err
-	}
-	for _, k := range keys {
-		fmt.Println(k)
-	}
-	return nil
-}
-
-func vaultKVPut(args []string) error {
-	hardenProcess()
-	ensureVaultAddr()
-	var path, key string
-	for _, a := range args {
-		if strings.HasPrefix(a, "-") {
-			continue
-		}
-		switch {
-		case path == "":
-			path = a
-		case key == "":
-			key = a
-		}
-	}
-	if path == "" || key == "" {
-		return fmt.Errorf("usage: homelab vault kv put <path> <key>   (value read from stdin)")
-	}
-	value, err := readSecretValue("Value for " + key + ": ")
-	if err != nil {
-		return err
-	}
-	if value == "" {
-		return fmt.Errorf("empty value; aborting (nothing written)")
-	}
-	if err := kvPut(realRunner, realRunnerStdin, path, key, value); err != nil {
-		return fmt.Errorf("writing %q to %s failed (does your token have write access? path correct?): %w", key, path, err)
-	}
-	fmt.Fprintln(os.Stderr, "wrote "+key+" to "+path)
-	return nil
-}
-
-// readSecretValue obtains a secret value WITHOUT putting it in argv: piped stdin
-// is read verbatim (trailing newline trimmed, internal newlines preserved so
-// multi-line values like PEM keys survive); an interactive TTY is prompted
-// without echo.
-func readSecretValue(prompt string) (string, error) {
-	fi, err := os.Stdin.Stat()
-	if err == nil && fi.Mode()&os.ModeCharDevice == 0 {
-		b, rerr := io.ReadAll(os.Stdin)
-		if rerr != nil {
-			return "", rerr
-		}
-		return strings.TrimRight(string(b), "\r\n"), nil
-	}
-	return promptNoEcho(prompt)
-}
--- a/cli/cmd_vault_test.go
+++ b/cli/cmd_vault_test.go
@ -2,8 +2,6 @@ package main

 import (
 	"encoding/base64"
-	"encoding/json"
-	"errors"
 	"fmt"
 	"os"
 	"reflect"
@ -235,181 +233,12 @@ func TestStatusSummaryUnconfigured(t *testing.T) {
 	}
 }

-func TestEnsureVaultTokenSetsScopedFallback(t *testing.T) {
-	dir := t.TempDir()
-	cfg := dir + "/.config/claude-auth-sync"
-	if err := os.MkdirAll(cfg, 0o700); err != nil {
-		t.Fatal(err)
-	}
-	if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK\n"), 0o600); err != nil {
-		t.Fatal(err)
-	}
-	t.Setenv("HOME", dir)
-	t.Setenv("VAULT_TOKEN", "") // no ambient token
-
-	ensureVaultToken()
-	if got := os.Getenv("VAULT_TOKEN"); got != "SCOPED-TOK" {
-		t.Fatalf("VAULT_TOKEN = %q, want scoped fallback to be exported", got)
-	}
-}
-
-func TestEnsureVaultTokenKeepsExplicitEnv(t *testing.T) {
-	dir := t.TempDir()
-	cfg := dir + "/.config/claude-auth-sync"
-	if err := os.MkdirAll(cfg, 0o700); err != nil {
-		t.Fatal(err)
-	}
-	if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK"), 0o600); err != nil {
-		t.Fatal(err)
-	}
-	t.Setenv("HOME", dir)
-	t.Setenv("VAULT_TOKEN", "ADMIN-TOK")
-
-	ensureVaultToken()
-	if got := os.Getenv("VAULT_TOKEN"); got != "ADMIN-TOK" {
-		t.Fatalf("VAULT_TOKEN = %q, must not override an explicit token", got)
-	}
-}
-
-func TestEnsureVaultTokenPrefersScopedOverFile(t *testing.T) {
-	// Regression: a power-user's read-only OIDC ~/.vault-token must NOT shadow the
-	// purpose-built scoped token (emo's setup hit 403 because it did, 2026-06-28).
-	dir := t.TempDir()
-	cfg := dir + "/.config/claude-auth-sync"
-	if err := os.MkdirAll(cfg, 0o700); err != nil {
-		t.Fatal(err)
-	}
-	if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK"), 0o600); err != nil {
-		t.Fatal(err)
-	}
-	if err := os.WriteFile(dir+"/.vault-token", []byte("STALE-OIDC-TOK"), 0o600); err != nil {
-		t.Fatal(err)
-	}
-	t.Setenv("HOME", dir)
-	t.Setenv("VAULT_TOKEN", "")
-
-	ensureVaultToken()
-	if got := os.Getenv("VAULT_TOKEN"); got != "SCOPED-TOK" {
-		t.Fatalf("VAULT_TOKEN = %q, want the scoped token to win over a stale ~/.vault-token", got)
-	}
-}
-
-func TestScopedTokenPath(t *testing.T) {
-	if got := scopedTokenPath("/home/emo"); got != "/home/emo/.config/claude-auth-sync/vault-token" {
-		t.Fatalf("scopedTokenPath = %q", got)
-	}
-}
-
-func TestVaultTokenSource(t *testing.T) {
-	// Precedence: explicit $VAULT_TOKEN > the claude-auth-sync per-user scoped
-	// token > a native ~/.vault-token. Scoped beats the file so a power-user's
-	// read-only OIDC ~/.vault-token can't shadow the scoped token on the user's
-	// own path (emo, 2026-06-28).
-	cases := []struct {
-		name             string
-		env              string
-		haveVaultToken   bool
-		scoped           string
-		wantTok, wantSrc string
-	}{
-		{"explicit env wins", "abc", true, "S", "", "env"},
-		{"scoped beats a stale ~/.vault-token", "", true, "S-TOK", "S-TOK", "scoped"},
-		{"scoped used when no file", "", false, "S-TOK", "S-TOK", "scoped"},
-		{"native ~/.vault-token only when no scoped", "", true, "", "", "file"},
-		{"scoped value is trimmed", "", false, "  S-TOK\n", "S-TOK", "scoped"},
-		{"whitespace-only scoped falls back to file", "", true, "  \n", "", "file"},
-		{"nothing configured", "", false, "", "", "none"},
-	}
-	for _, c := range cases {
-		tok, src := vaultTokenSource(c.env, c.haveVaultToken, c.scoped)
-		if tok != c.wantTok || src != c.wantSrc {
-			t.Errorf("%s: vaultTokenSource(%q,%v,%q) = (%q,%q), want (%q,%q)",
-				c.name, c.env, c.haveVaultToken, c.scoped, tok, src, c.wantTok, c.wantSrc)
-		}
-	}
-}
-
-func TestVaultAddrToSet(t *testing.T) {
-	// homelab vault is invoked by AFK agent sessions (non-login shells that
-	// never sourced /etc/environment), so the CLI must self-default VAULT_ADDR
-	// rather than rely on the ambient env — else every `vault` child hits the
-	// 127.0.0.1:8200 default and fails "connection refused" (exit 2).
-	cases := []struct {
-		name, env, want string
-	}{
-		{"unset -> default", "", vaultAddrDefault},
-		{"whitespace-only -> default", "  \n", vaultAddrDefault},
-		{"explicit kept (empty = leave alone)", "https://vault.example.com", ""},
-	}
-	for _, c := range cases {
-		if got := vaultAddrToSet(c.env); got != c.want {
-			t.Errorf("%s: vaultAddrToSet(%q) = %q, want %q", c.name, c.env, got, c.want)
-		}
-	}
-}
-
-func TestEnsureVaultTokenSetsDefaultAddr(t *testing.T) {
-	dir := t.TempDir() // no scoped token, no ~/.vault-token
-	t.Setenv("HOME", dir)
-	t.Setenv("VAULT_TOKEN", "")
-	t.Setenv("VAULT_ADDR", "") // emo's non-login-shell situation
-
-	ensureVaultToken()
-	if got := os.Getenv("VAULT_ADDR"); got != vaultAddrDefault {
-		t.Fatalf("VAULT_ADDR = %q, want default %q to be exported", got, vaultAddrDefault)
-	}
-}
-
-func TestEnsureVaultTokenKeepsExplicitAddr(t *testing.T) {
-	dir := t.TempDir()
-	t.Setenv("HOME", dir)
-	t.Setenv("VAULT_TOKEN", "")
-	t.Setenv("VAULT_ADDR", "https://vault.example.com")
-
-	ensureVaultToken()
-	if got := os.Getenv("VAULT_ADDR"); got != "https://vault.example.com" {
-		t.Fatalf("VAULT_ADDR = %q, must not override an explicit addr", got)
-	}
-}
-
-func TestAugmentErrSurfacesStderr(t *testing.T) {
-	if got := augmentErr(nil, []byte("ignored")); got != nil {
-		t.Fatalf("augmentErr(nil, …) = %v, want nil", got)
-	}
-	base := errors.New("exit status 2")
-	got := augmentErr(base, []byte("  dial tcp 127.0.0.1:8200: connect: connection refused\n"))
-	if got == nil || !strings.Contains(got.Error(), "connection refused") || !strings.Contains(got.Error(), "exit status 2") {
-		t.Fatalf("augmentErr did not surface stderr: %v", got)
-	}
-	if !errors.Is(got, base) {
-		t.Fatal("augmentErr lost the wrapped error (errors.Is failed)")
-	}
-	if got := augmentErr(base, []byte("   ")); got != base {
-		t.Fatalf("augmentErr with blank stderr = %v, want the original error unchanged", got)
-	}
-}
-
-func TestKvWriteVerb(t *testing.T) {
-	// merge=true → read-modify-write patch (needs only read+update, NOT the
-	// `patch` capability the scoped workstation policy lacks).
-	if got := kvWriteVerb(true); !reflect.DeepEqual(got, []string{"kv", "patch", "-method=rw"}) {
-		t.Fatalf("kvWriteVerb(true) = %v", got)
-	}
-	// merge=false → put (creates the path on first use)
-	if got := kvWriteVerb(false); !reflect.DeepEqual(got, []string{"kv", "put"}) {
-		t.Fatalf("kvWriteVerb(false) = %v", got)
-	}
-}
-
-func TestVaultWritePublicArgs(t *testing.T) {
-	got := vaultWritePublicArgs(true, "emo", "e@x.me", "user.ci")
-	want := []string{"kv", "patch", "-method=rw", "secret/workstation/claude-users/emo",
+func TestVaultPatchPublicArgs(t *testing.T) {
+	got := vaultPatchPublicArgs("emo", "e@x.me", "user.ci")
+	want := []string{"kv", "patch", "secret/workstation/claude-users/emo",
 		"vaultwarden_email=e@x.me", "vaultwarden_client_id=user.ci"}
 	if !reflect.DeepEqual(got, want) {
-		t.Fatalf("vaultWritePublicArgs(merge) = %v", got)
-	}
-	if got := vaultWritePublicArgs(false, "emo", "e@x.me", "user.ci"); got[0] != "kv" || got[1] != "put" {
-		t.Fatalf("vaultWritePublicArgs(create) must use `kv put`, got %v", got)
+		t.Fatalf("vaultPatchPublicArgs = %v", got)
 	}
 	for _, a := range got {
 		if strings.Contains(a, "master_password") || strings.Contains(a, "client_secret") {
@ -418,12 +247,12 @@ func TestVaultWritePublicArgs(t *testing.T) {
 	}
 }

-func TestVaultWriteSecretArgsNoValueInArgv(t *testing.T) {
+func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) {
 	for _, key := range []string{"vaultwarden_master_password", "vaultwarden_client_secret"} {
-		got := vaultWriteSecretArgs(true, "emo", key)
-		want := []string{"kv", "patch", "-method=rw", "secret/workstation/claude-users/emo", key + "=-"}
+		got := vaultPatchSecretArgs("emo", key)
+		want := []string{"kv", "patch", "secret/workstation/claude-users/emo", key + "=-"}
 		if !reflect.DeepEqual(got, want) {
-			t.Fatalf("vaultWriteSecretArgs(%q) = %v", key, got)
+			t.Fatalf("vaultPatchSecretArgs(%q) = %v", key, got)
 		}
 		if got[len(got)-1] != key+"=-" {
 			t.Fatalf("secret value must be read from stdin (`%s=-`), got %v", key, got)
@ -431,90 +260,6 @@ func TestVaultWriteSecretArgsNoValueInArgv(t *testing.T) {
 	}
 }

-// recStdin records a stdin-bearing call for assertions.
-type recStdin struct {
-	argv  []string
-	stdin string
-}
-
-// TestWriteCredsCreatesThenMerges: when the path is ABSENT the first (public)
-// write must `kv put` (create), and the two secrets must merge via patch -rw
-// with values on stdin only — never the buggy plain `kv patch` (needs `patch`).
-func TestWriteCredsCreatesThenMerges(t *testing.T) {
-	var calls [][]string
-	var stdinCalls []recStdin
-	run := func(name string, argv, envv []string) (string, error) {
-		calls = append(calls, append([]string{name}, argv...))
-		if len(argv) >= 2 && argv[0] == "kv" && argv[1] == "get" {
-			return "", fmt.Errorf("no value found") // path absent
-		}
-		return "", nil
-	}
-	runStdin := func(name string, argv, envv []string, stdin string) (string, error) {
-		stdinCalls = append(stdinCalls, recStdin{append([]string{name}, argv...), stdin})
-		return "", nil
-	}
-	c := vwCreds{Email: "e@x.me", MasterPassword: "PW", ClientID: "user.ci", ClientSecret: "CS"}
-	if err := writeCreds(run, runStdin, "emo", c); err != nil {
-		t.Fatalf("writeCreds: %v", err)
-	}
-	var sawPut, sawPlainPatch bool
-	for _, cl := range calls {
-		j := strings.Join(cl, " ")
-		if strings.Contains(j, "kv put") {
-			sawPut = true
-		}
-		if strings.Contains(j, "kv patch") && !strings.Contains(j, "-method=rw") {
-			sawPlainPatch = true
-		}
-	}
-	if !sawPut {
-		t.Fatalf("path absent → public write must be `kv put`; calls=%v", calls)
-	}
-	if sawPlainPatch {
-		t.Fatalf("must never use plain `kv patch` (needs `patch` capability); calls=%v", calls)
-	}
-	if len(stdinCalls) != 2 {
-		t.Fatalf("want 2 stdin secret writes, got %d", len(stdinCalls))
-	}
-	for _, sc := range stdinCalls {
-		if !strings.Contains(strings.Join(sc.argv, " "), "kv patch -method=rw") {
-			t.Errorf("secret write must use patch -method=rw: %v", sc.argv)
-		}
-		for _, a := range sc.argv {
-			if strings.Contains(a, "PW") || strings.Contains(a, "CS") {
-				t.Errorf("secret leaked into argv: %v", sc.argv)
-			}
-		}
-	}
-	if stdinCalls[0].stdin != "PW" || stdinCalls[1].stdin != "CS" {
-		t.Errorf("stdin values wrong: %q,%q", stdinCalls[0].stdin, stdinCalls[1].stdin)
-	}
-}
-
-// TestWriteCredsMergesWhenPresent: when the path EXISTS, every write must merge
-// (patch -rw) — a `kv put` would wipe sibling keys (e.g. claude_ai_oauth_json).
-func TestWriteCredsMergesWhenPresent(t *testing.T) {
-	var calls [][]string
-	run := func(name string, argv, envv []string) (string, error) {
-		calls = append(calls, append([]string{name}, argv...))
-		return "{}", nil // get succeeds → path exists
-	}
-	runStdin := func(name string, argv, envv []string, stdin string) (string, error) {
-		calls = append(calls, append([]string{name}, argv...))
-		return "", nil
-	}
-	c := vwCreds{Email: "e@x.me", MasterPassword: "PW", ClientID: "user.ci", ClientSecret: "CS"}
-	if err := writeCreds(run, runStdin, "emo", c); err != nil {
-		t.Fatalf("writeCreds: %v", err)
-	}
-	for _, cl := range calls {
-		if strings.Contains(strings.Join(cl, " "), "kv put") {
-			t.Fatalf("path exists → must NOT `kv put` (wipes siblings): %v", cl)
-		}
-	}
-}
-
 // TestNoSecretInArgvAcrossFlow is the load-bearing security test: across the
 // whole get flow (vault reads, bw config/status/login/unlock/get) NO secret
 // value may appear in any command's argv — secrets travel via env/stdin only.
@ -621,437 +366,3 @@ func TestGetValueFlow(t *testing.T) {
 		t.Fatalf("getValue = %q, %v", val, err)
 	}
 }
-
-// --- vault get --all (browse all fields) ----------------------------------
-
-func TestParseGetArgsAll(t *testing.T) {
-	o, err := parseGetArgs([]string{"github", "--all"})
-	if err != nil || o.name != "github" || !o.all {
-		t.Fatalf("parseGetArgs(--all) = %+v err=%v", o, err)
-	}
-	// --all must skip --field validation (field is irrelevant for a full dump).
-	if _, err := parseGetArgs([]string{"github", "--all", "--field", "evil"}); err != nil {
-		t.Fatalf("--all must ignore an otherwise-invalid --field, got err=%v", err)
-	}
-	// A name is still required.
-	if _, err := parseGetArgs([]string{"--all"}); err == nil {
-		t.Fatal("get --all with no name must error")
-	}
-	// Without --all, the field allowlist still applies.
-	if _, err := parseGetArgs([]string{"github", "--field", "evil"}); err == nil {
-		t.Fatal("invalid --field without --all must still error")
-	}
-}
-
-func TestBwItemArgs(t *testing.T) {
-	argv := bwItemArgs("github")
-	if !reflect.DeepEqual(argv, []string{"get", "item", "github"}) {
-		t.Fatalf("bwItemArgs = %v", argv)
-	}
-	for _, a := range argv {
-		if strings.Contains(a, "SESSION") || a == "--session" {
-			t.Fatalf("session must travel via env, not argv: %v", argv)
-		}
-	}
-}
-
-// a representative `bw get item` payload: login fields, multiple URIs, a TOTP
-// seed, notes, custom fields (text/hidden/boolean), plus bw internals that MUST
-// be dropped (id/object/reprompt/passwordHistory).
-const sampleLoginItemJSON = `{
-  "object":"item","id":"abc-123","folderId":null,"type":1,"reprompt":0,
-  "name":"GitHub","notes":"my notes","favorite":false,
-  "fields":[
-    {"name":"PIN","value":"1234","type":1},
-    {"name":"endpoint","value":"https://api.gh","type":0},
-    {"name":"enabled","value":"true","type":2}
-  ],
-  "login":{
-    "username":"octocat","password":"hunter2",
-    "totp":"otpauth://totp/GitHub:octocat?secret=SEEDSEEDSEED",
-    "uris":[{"match":null,"uri":"https://github.com"},{"match":null,"uri":"https://gist.github.com"}]
-  },
-  "passwordHistory":[{"password":"OLD-PASSWORD-XYZ"}]
-}`
-
-func TestNormalizeItemLogin(t *testing.T) {
-	n, err := normalizeItem(sampleLoginItemJSON)
-	if err != nil {
-		t.Fatalf("normalizeItem: %v", err)
-	}
-	if n.Name != "GitHub" || n.Username != "octocat" || n.Password != "hunter2" || n.Notes != "my notes" {
-		t.Fatalf("standard fields wrong: %+v", n)
-	}
-	if !n.TOTP {
-		t.Fatal("TOTP presence flag must be true when a seed exists")
-	}
-	if !reflect.DeepEqual(n.URIs, []string{"https://github.com", "https://gist.github.com"}) {
-		t.Fatalf("URIs = %v", n.URIs)
-	}
-	want := map[string]string{"PIN": "1234", "endpoint": "https://api.gh", "enabled": "true"}
-	if !reflect.DeepEqual(n.Fields, want) {
-		t.Fatalf("custom fields = %v want %v", n.Fields, want)
-	}
-}
-
-// The load-bearing security test: the raw TOTP seed (more powerful than a
-// one-time code) and the password history must NEVER appear in the dump.
-func TestNormalizeItemNeverLeaksSeedOrHistory(t *testing.T) {
-	n, err := normalizeItem(sampleLoginItemJSON)
-	if err != nil {
-		t.Fatalf("normalizeItem: %v", err)
-	}
-	out, err := json.Marshal(n)
-	if err != nil {
-		t.Fatalf("marshal: %v", err)
-	}
-	for _, leak := range []string{"SEEDSEEDSEED", "otpauth", "OLD-PASSWORD-XYZ", "passwordHistory", "abc-123"} {
-		if strings.Contains(string(out), leak) {
-			t.Fatalf("dump leaked %q: %s", leak, out)
-		}
-	}
-}
-
-func TestNormalizeItemNoTOTP(t *testing.T) {
-	n, err := normalizeItem(`{"name":"X","type":1,"login":{"username":"u","password":"p"}}`)
-	if err != nil {
-		t.Fatalf("normalizeItem: %v", err)
-	}
-	if n.TOTP {
-		t.Fatal("TOTP must be false when no seed present")
-	}
-	out, _ := json.Marshal(n)
-	if strings.Contains(string(out), "totp") {
-		t.Fatalf("no-totp item must omit the totp key entirely: %s", out)
-	}
-}
-
-func TestNormalizeItemEmptyStandardFieldsOmitted(t *testing.T) {
-	n, err := normalizeItem(`{"name":"Bare","type":1,"login":{"username":"","password":"","totp":"","uris":[]},"fields":[{"name":"only","value":"x","type":0}]}`)
-	if err != nil {
-		t.Fatalf("normalizeItem: %v", err)
-	}
-	out, _ := json.Marshal(n)
-	for _, k := range []string{"username", "password", "uris", "notes", "totp"} {
-		if strings.Contains(string(out), `"`+k+`"`) {
-			t.Fatalf("empty standard field %q must be omitted: %s", k, out)
-		}
-	}
-	if !strings.Contains(string(out), `"name":"Bare"`) || !strings.Contains(string(out), `"only":"x"`) {
-		t.Fatalf("name + custom field must survive: %s", out)
-	}
-}
-
-func TestNormalizeItemSecureNoteNullLogin(t *testing.T) {
-	// type 2 (secure note): login is null — must not panic; notes + custom fields survive.
-	n, err := normalizeItem(`{"name":"SN","type":2,"notes":"secret note","login":null,"fields":[{"name":"k","value":"v","type":1}]}`)
-	if err != nil {
-		t.Fatalf("normalizeItem(null login): %v", err)
-	}
-	if n.Name != "SN" || n.Notes != "secret note" || n.Fields["k"] != "v" {
-		t.Fatalf("secure-note normalize wrong: %+v", n)
-	}
-	if n.Username != "" || n.Password != "" || n.TOTP {
-		t.Fatalf("login fields must be empty for a login-less item: %+v", n)
-	}
-}
-
-func TestNormalizeItemDuplicateCustomNames(t *testing.T) {
-	// Bitwarden permits duplicate custom-field names; a JSON object can't hold
-	// dups, so last-wins (documented).
-	n, err := normalizeItem(`{"name":"D","fields":[{"name":"k","value":"first","type":0},{"name":"k","value":"second","type":0}]}`)
-	if err != nil {
-		t.Fatalf("normalizeItem: %v", err)
-	}
-	if n.Fields["k"] != "second" {
-		t.Fatalf("duplicate custom names must be last-wins, got %q", n.Fields["k"])
-	}
-}
-
-func TestNormalizeItemLinkedFieldSkipped(t *testing.T) {
-	// type 3 (linked) fields reference another field and carry a null value —
-	// they are not real data and must be skipped.
-	n, err := normalizeItem(`{"name":"L","login":{"username":"u"},"fields":[{"name":"linked","value":null,"type":3},{"name":"real","value":"r","type":0}]}`)
-	if err != nil {
-		t.Fatalf("normalizeItem: %v", err)
-	}
-	if _, ok := n.Fields["linked"]; ok {
-		t.Fatalf("linked field must be skipped: %v", n.Fields)
-	}
-	if n.Fields["real"] != "r" {
-		t.Fatalf("real custom field dropped: %v", n.Fields)
-	}
-}
-
-func TestNormalizeItemMalformed(t *testing.T) {
-	if _, err := normalizeItem("not json"); err == nil {
-		t.Fatal("malformed item JSON must error")
-	}
-}
-
-// getItem opens a session and runs `bw get item <name>`, returning raw JSON.
-func TestGetItemFlow(t *testing.T) {
-	f := &fakeRunner{out: map[string]string{
-		"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw",
-		"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo":       "user.x",
-		"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo":   "cs",
-		"bw status":          `{"status":"locked"}`,
-		"bw unlock":          "SESS",
-		"bw get item github": sampleLoginItemJSON,
-	}}
-	uid := fmt.Sprintf("%d", os.Getuid())
-	raw, err := getItem(f.run, "emo", uid, "github")
-	if err != nil || !strings.Contains(raw, `"name":"GitHub"`) {
-		t.Fatalf("getItem = %q, %v", raw, err)
-	}
-	// The session key must reach bw via env, never argv.
-	for _, call := range f.calls {
-		for _, arg := range call {
-			if strings.Contains(arg, "SESS") {
-				t.Errorf("session leaked into argv: %v", call)
-			}
-		}
-	}
-}
-
-func TestVaultHelpMentionsAll(t *testing.T) {
-	if !strings.Contains(vaultHelp(), "--all") {
-		t.Error("vault help must document --all")
-	}
-}
-
-// --- bw sync on read (freshness) ------------------------------------------
-
-func TestBwSyncArgs(t *testing.T) {
-	if got := bwSyncArgs(); !reflect.DeepEqual(got, []string{"sync"}) {
-		t.Fatalf("bwSyncArgs = %v", got)
-	}
-}
-
-// Every read opens a session that first `bw sync`s, so reads reflect the latest
-// server-side values: `bw unlock` is local-only, so without a sync a persisted
-// (already-logged-in) session serves a stale local cache.
-func TestOpenSessionSyncsBeforeRead(t *testing.T) {
-	f := &fakeRunner{out: map[string]string{
-		"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw",
-		"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo":       "user.x",
-		"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo":   "cs",
-		"bw status":              `{"status":"locked"}`,
-		"bw unlock":              "SESS",
-		"bw sync":                "Syncing complete.",
-		"bw get password github": "p@ss",
-	}}
-	uid := fmt.Sprintf("%d", os.Getuid())
-	if _, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"}); err != nil {
-		t.Fatalf("getValue: %v", err)
-	}
-	idx := func(prefix string) int {
-		for i, c := range f.calls {
-			if strings.HasPrefix(strings.Join(c, " "), prefix) {
-				return i
-			}
-		}
-		return -1
-	}
-	syncAt, unlockAt, getAt := idx("bw sync"), idx("bw unlock"), idx("bw get password github")
-	if syncAt < 0 {
-		t.Fatal("expected a `bw sync` before the read")
-	}
-	if !(unlockAt < syncAt && syncAt < getAt) {
-		t.Fatalf("order wrong: unlock=%d sync=%d get=%d (want unlock<sync<get)", unlockAt, syncAt, getAt)
-	}
-}
-
-// Sync is best-effort: a transient sync failure must NOT fail the read — the
-// cached value is still returned (a stderr warning is emitted, not asserted here).
-func TestReadSucceedsWhenSyncFails(t *testing.T) {
-	f := &fakeRunner{
-		out: map[string]string{
-			"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw",
-			"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo":       "user.x",
-			"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo":   "cs",
-			"bw status":              `{"status":"locked"}`,
-			"bw unlock":              "SESS",
-			"bw get password github": "p@ss",
-		},
-		err: map[string]error{"bw sync": errors.New("Failed to sync: network error")},
-	}
-	uid := fmt.Sprintf("%d", os.Getuid())
-	val, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"})
-	if err != nil || val != "p@ss" {
-		t.Fatalf("read must succeed despite a sync failure: val=%q err=%v", val, err)
-	}
-}
-
-// --- vault kv (HashiCorp Vault / OpenBao infra secrets) --------------------
-
-func TestVaultKVCommandsRegistered(t *testing.T) {
-	want := map[string]Tier{
-		"vault kv get":  TierRead,
-		"vault kv list": TierRead,
-		"vault kv put":  TierWrite,
-	}
-	got := map[string]Tier{}
-	for _, c := range vaultCommands() {
-		got[c.name()] = c.Tier
-	}
-	for name, tier := range want {
-		if got[name] != tier {
-			t.Errorf("command %q: tier=%q, want %q", name, got[name], tier)
-		}
-	}
-}
-
-func TestVaultKVArgs(t *testing.T) {
-	if got := vaultKVGetFieldArgs("secret/viktor", "github_pat"); !reflect.DeepEqual(got, []string{"kv", "get", "-field=github_pat", "secret/viktor"}) {
-		t.Fatalf("vaultKVGetFieldArgs = %v", got)
-	}
-	if got := vaultKVGetJSONArgs("secret/viktor"); !reflect.DeepEqual(got, []string{"kv", "get", "-format=json", "secret/viktor"}) {
-		t.Fatalf("vaultKVGetJSONArgs = %v", got)
-	}
-	if got := vaultKVListArgs("secret/"); !reflect.DeepEqual(got, []string{"kv", "list", "-format=json", "secret/"}) {
-		t.Fatalf("vaultKVListArgs = %v", got)
-	}
-	// create (path absent) → put; merge (path present) → patch -method=rw. Either
-	// way the VALUE travels via the `key=-` stdin form, never argv.
-	create := vaultKVPutArgs(false, "secret/x", "api_key")
-	if !reflect.DeepEqual(create, []string{"kv", "put", "secret/x", "api_key=-"}) {
-		t.Fatalf("vaultKVPutArgs(create) = %v", create)
-	}
-	merge := vaultKVPutArgs(true, "secret/x", "api_key")
-	if !reflect.DeepEqual(merge, []string{"kv", "patch", "-method=rw", "secret/x", "api_key=-"}) {
-		t.Fatalf("vaultKVPutArgs(merge) = %v", merge)
-	}
-	for _, args := range [][]string{create, merge} {
-		for _, a := range args {
-			if strings.Contains(a, "SECRETVALUE") || strings.HasSuffix(a, "=SECRETVALUE") {
-				t.Fatalf("value must not appear in argv: %v", args)
-			}
-		}
-	}
-}
-
-func TestExtractKVData(t *testing.T) {
-	// `vault kv get -format=json` wraps the secret in {"data":{"data":{...},"metadata":{...}}}.
-	env := `{"request_id":"x","data":{"data":{"github_pat":"ghp_abc","email":"e@x.me"},"metadata":{"version":3}}}`
-	out, err := extractKVData(env)
-	if err != nil {
-		t.Fatalf("extractKVData: %v", err)
-	}
-	// Round-trip to a map so key order doesn't matter.
-	var m map[string]string
-	if err := json.Unmarshal([]byte(out), &m); err != nil {
-		t.Fatalf("result not a JSON object: %q (%v)", out, err)
-	}
-	if m["github_pat"] != "ghp_abc" || m["email"] != "e@x.me" {
-		t.Fatalf("extractKVData inner data wrong: %v", m)
-	}
-	// metadata must NOT leak into the output.
-	if strings.Contains(out, "metadata") || strings.Contains(out, "request_id") {
-		t.Fatalf("envelope internals leaked: %s", out)
-	}
-	if _, err := extractKVData("not json"); err == nil {
-		t.Fatal("malformed envelope must error")
-	}
-}
-
-func TestParseKVList(t *testing.T) {
-	keys, err := parseKVList(`["app1","app2/","viktor"]`)
-	if err != nil {
-		t.Fatalf("parseKVList: %v", err)
-	}
-	if !reflect.DeepEqual(keys, []string{"app1", "app2/", "viktor"}) {
-		t.Fatalf("parseKVList = %v", keys)
-	}
-	if _, err := parseKVList("not json"); err == nil {
-		t.Fatal("malformed list must error")
-	}
-}
-
-func TestKVGetFieldFlow(t *testing.T) {
-	f := &fakeRunner{out: map[string]string{
-		"vault kv get -field=github_pat secret/viktor": "ghp_secret",
-	}}
-	val, err := kvGetField(f.run, "secret/viktor", "github_pat")
-	if err != nil || val != "ghp_secret" {
-		t.Fatalf("kvGetField = %q, %v", val, err)
-	}
-}
-
-func TestKVListFlow(t *testing.T) {
-	f := &fakeRunner{out: map[string]string{
-		"vault kv list -format=json secret/": `["app1","app2/"]`,
-	}}
-	keys, err := kvList(f.run, "secret/")
-	if err != nil || !reflect.DeepEqual(keys, []string{"app1", "app2/"}) {
-		t.Fatalf("kvList = %v, %v", keys, err)
-	}
-}
-
-// kvPut creates the path on first write and merges thereafter, with the value on
-// stdin only (mirrors writeCreds). Never plain `kv patch` (needs the patch cap).
-func TestKVPutCreatesThenMerges(t *testing.T) {
-	for _, tc := range []struct {
-		name       string
-		exists     bool
-		wantCreate bool
-	}{
-		{"absent path → create (put)", false, true},
-		{"present path → merge (patch -rw)", true, false},
-	} {
-		t.Run(tc.name, func(t *testing.T) {
-			var stdinCalls []recStdin
-			run := func(name string, argv, envv []string) (string, error) {
-				if len(argv) >= 2 && argv[0] == "kv" && argv[1] == "get" {
-					if tc.exists {
-						return `{"data":{"data":{}}}`, nil
-					}
-					return "", fmt.Errorf("No value found at secret/x")
-				}
-				return "", nil
-			}
-			runStdin := func(name string, argv, envv []string, stdin string) (string, error) {
-				stdinCalls = append(stdinCalls, recStdin{append([]string{name}, argv...), stdin})
-				return "", nil
-			}
-			if err := kvPut(run, runStdin, "secret/x", "api_key", "SECRETVALUE"); err != nil {
-				t.Fatalf("kvPut: %v", err)
-			}
-			if len(stdinCalls) != 1 {
-				t.Fatalf("want exactly 1 stdin write, got %d", len(stdinCalls))
-			}
-			sc := stdinCalls[0]
-			joined := strings.Join(sc.argv, " ")
-			if tc.wantCreate && !strings.Contains(joined, "kv put") {
-				t.Fatalf("absent path must use `kv put`: %v", sc.argv)
-			}
-			if !tc.wantCreate && !strings.Contains(joined, "kv patch -method=rw") {
-				t.Fatalf("present path must merge via `kv patch -method=rw`: %v", sc.argv)
-			}
-			if strings.Contains(joined, "kv patch") && !strings.Contains(joined, "-method=rw") {
-				t.Fatalf("must never use plain `kv patch`: %v", sc.argv)
-			}
-			if sc.stdin != "SECRETVALUE" {
-				t.Fatalf("value must travel via stdin, got %q", sc.stdin)
-			}
-			for _, a := range sc.argv {
-				if strings.Contains(a, "SECRETVALUE") {
-					t.Fatalf("value leaked into argv: %v", sc.argv)
-				}
-			}
-		})
-	}
-}
-
-func TestVaultHelpMentionsBothSystems(t *testing.T) {
-	h := vaultHelp()
-	for _, want := range []string{"Vaultwarden", "vault kv"} {
-		if !strings.Contains(h, want) {
-			t.Errorf("vault help must mention %q (distinguish the two systems)", want)
-		}
-	}
-	// Must name the infra-secrets system so the distinction is unambiguous.
-	if !strings.Contains(h, "HashiCorp") && !strings.Contains(h, "OpenBao") {
-		t.Error("vault help must name HashiCorp Vault / OpenBao (the infra secrets store)")
-	}
-}
--- a/cli/edges.go
+++ b/cli/edges.go
@ -1,164 +0,0 @@
-package main
-
-import (
-	"fmt"
-	"regexp"
-	"strconv"
-	"strings"
-)
-
-// edgesOpts is the parsed filter set for `homelab edges` (the who-talks-to-whom
-// investigation helper over the goldmane_edges trail; see ADR-0014).
-type edgesOpts struct {
-	ns       string // edges touching this namespace (either direction)
-	src      string // edges where src_ns = this
-	dst      string // edges where dst_ns = this
-	peersOf  string // distinct peers of this namespace (both directions)
-	newSince string // first_seen >= duration (24h/7d/30m) or date (YYYY-MM-DD)
-	denied   bool   // action = 'deny' only
-	asJSON   bool   // wrap result as a JSON array
-	limit    int    // row cap (default 200)
-}
-
-// parseEdgesArgs parses the edges flag surface. Unknown flags error out so a
-// typo surfaces instead of silently dumping the whole table.
-func parseEdgesArgs(args []string) (edgesOpts, error) {
-	o := edgesOpts{limit: 200}
-	i := 0
-	for i < len(args) {
-		a := args[i]
-		key, inline, hasInline := a, "", false
-		if eq := strings.IndexByte(a, '='); eq >= 0 {
-			key, inline, hasInline = a[:eq], a[eq+1:], true
-		}
-		needVal := func() (string, error) {
-			if hasInline {
-				return inline, nil
-			}
-			if i+1 < len(args) {
-				i++
-				return args[i], nil
-			}
-			return "", fmt.Errorf("flag %s needs a value", key)
-		}
-		var err error
-		switch key {
-		case "--ns":
-			o.ns, err = needVal()
-		case "--src":
-			o.src, err = needVal()
-		case "--dst":
-			o.dst, err = needVal()
-		case "--peers-of":
-			o.peersOf, err = needVal()
-		case "--new-since":
-			o.newSince, err = needVal()
-		case "--denied":
-			o.denied = true
-		case "--json":
-			o.asJSON = true
-		case "--limit":
-			var v string
-			if v, err = needVal(); err == nil {
-				if o.limit, err = strconv.Atoi(v); err != nil {
-					err = fmt.Errorf("--limit must be an integer: %q", v)
-				}
-			}
-		default:
-			return o, fmt.Errorf("unknown flag: %s", a)
-		}
-		if err != nil {
-			return o, err
-		}
-		i++
-	}
-	return o, nil
-}
-
-// nsRE is the safe namespace-token charset (k8s names + "Global"). Used as the
-// injection guard — anything else is rejected rather than quoted-and-hoped.
-var nsRE = regexp.MustCompile(`^[A-Za-z0-9][A-Za-z0-9_.-]*$`)
-
-func validateNS(s string) error {
-	if s == "" || len(s) > 63 || !nsRE.MatchString(s) {
-		return fmt.Errorf("invalid namespace name: %q", s)
-	}
-	return nil
-}
-
-// sqlStr renders a SQL string literal (belt-and-suspenders on top of validateNS).
-func sqlStr(s string) string { return "'" + strings.ReplaceAll(s, "'", "''") + "'" }
-
-var (
-	durRE  = regexp.MustCompile(`^(\d+)([smhd])$`)
-	dateRE = regexp.MustCompile(`^\d{4}-\d{2}-\d{2}([ T]\d{2}:\d{2}(:\d{2})?)?$`)
-)
-
-// newSinceCond turns a duration (24h/7d/30m/90s) or a date (YYYY-MM-DD[ HH:MM])
-// into a first_seen predicate.
-func newSinceCond(v string) (string, error) {
-	if m := durRE.FindStringSubmatch(v); m != nil {
-		unit := map[string]string{"s": "seconds", "m": "minutes", "h": "hours", "d": "days"}[m[2]]
-		return fmt.Sprintf("first_seen >= now() - interval '%s %s'", m[1], unit), nil
-	}
-	if dateRE.MatchString(v) {
-		return "first_seen >= " + sqlStr(v), nil
-	}
-	return "", fmt.Errorf("--new-since must be a duration (e.g. 24h, 7d, 30m) or a date (YYYY-MM-DD): %q", v)
-}
-
-// buildEdgesQuery renders the SQL for the given filters against the `edge` table.
-func buildEdgesQuery(o edgesOpts) (string, error) {
-	limit := o.limit
-	if limit <= 0 {
-		limit = 200
-	}
-
-	// peers-of is a distinct-peer summary, a different shape from the row list.
-	if o.peersOf != "" {
-		if err := validateNS(o.peersOf); err != nil {
-			return "", err
-		}
-		p := sqlStr(o.peersOf)
-		return fmt.Sprintf("SELECT DISTINCT peer, action FROM ("+
-			"SELECT dst_ns AS peer, action FROM edge WHERE src_ns = %s "+
-			"UNION SELECT src_ns AS peer, action FROM edge WHERE dst_ns = %s"+
-			") t ORDER BY peer LIMIT %d", p, p, limit), nil
-	}
-
-	var conds []string
-	for _, f := range []struct{ val, tmpl string }{
-		{o.ns, "(src_ns = %[1]s OR dst_ns = %[1]s)"},
-		{o.src, "src_ns = %s"},
-		{o.dst, "dst_ns = %s"},
-	} {
-		if f.val == "" {
-			continue
-		}
-		if err := validateNS(f.val); err != nil {
-			return "", err
-		}
-		conds = append(conds, fmt.Sprintf(f.tmpl, sqlStr(f.val)))
-	}
-	if o.denied {
-		conds = append(conds, "action = 'deny'")
-	}
-	if o.newSince != "" {
-		c, err := newSinceCond(o.newSince)
-		if err != nil {
-			return "", err
-		}
-		conds = append(conds, c)
-	}
-
-	q := "SELECT src_ns, dst_ns, action, flow_count, first_seen, last_seen FROM edge"
-	if len(conds) > 0 {
-		q += " WHERE " + strings.Join(conds, " AND ")
-	}
-	q += fmt.Sprintf(" ORDER BY first_seen DESC LIMIT %d", limit)
-
-	if o.asJSON {
-		q = "SELECT coalesce(json_agg(row_to_json(t)), '[]') FROM (" + q + ") t"
-	}
-	return q, nil
-}
--- a/cli/edges_test.go
+++ b/cli/edges_test.go
@ -1,163 +0,0 @@
-package main
-
-import (
-	"strings"
-	"testing"
-)
-
-func TestParseEdgesArgs(t *testing.T) {
-	cases := []struct {
-		name string
-		args []string
-		want edgesOpts
-	}{
-		{"defaults", nil, edgesOpts{limit: 200}},
-		{"ns", []string{"--ns", "immich"}, edgesOpts{ns: "immich", limit: 200}},
-		{"ns equals", []string{"--ns=immich"}, edgesOpts{ns: "immich", limit: 200}},
-		{"src dst", []string{"--src", "a", "--dst", "b"}, edgesOpts{src: "a", dst: "b", limit: 200}},
-		{"peers-of", []string{"--peers-of", "authentik"}, edgesOpts{peersOf: "authentik", limit: 200}},
-		{"denied json", []string{"--denied", "--json"}, edgesOpts{denied: true, asJSON: true, limit: 200}},
-		{"new-since", []string{"--new-since", "24h"}, edgesOpts{newSince: "24h", limit: 200}},
-		{"limit", []string{"--limit", "50"}, edgesOpts{limit: 50}},
-	}
-	for _, c := range cases {
-		t.Run(c.name, func(t *testing.T) {
-			got, err := parseEdgesArgs(c.args)
-			if err != nil {
-				t.Fatalf("parseEdgesArgs(%v) error: %v", c.args, err)
-			}
-			if got != c.want {
-				t.Fatalf("parseEdgesArgs(%v) = %+v, want %+v", c.args, got, c.want)
-			}
-		})
-	}
-}
-
-func TestParseEdgesArgsErrors(t *testing.T) {
-	for _, args := range [][]string{
-		{"--limit", "abc"},
-		{"--bogus"},
-	} {
-		if _, err := parseEdgesArgs(args); err == nil {
-			t.Errorf("parseEdgesArgs(%v) expected error, got nil", args)
-		}
-	}
-}
-
-func TestBuildEdgesQueryDefaults(t *testing.T) {
-	q, err := buildEdgesQuery(edgesOpts{limit: 200})
-	if err != nil {
-		t.Fatal(err)
-	}
-	for _, want := range []string{"FROM edge", "ORDER BY first_seen DESC", "LIMIT 200"} {
-		if !strings.Contains(q, want) {
-			t.Errorf("query %q missing %q", q, want)
-		}
-	}
-	if strings.Contains(q, "WHERE") {
-		t.Errorf("no-filter query should have no WHERE: %q", q)
-	}
-}
-
-func TestBuildEdgesQueryFilters(t *testing.T) {
-	cases := []struct {
-		name string
-		o    edgesOpts
-		want string
-	}{
-		{"ns both directions", edgesOpts{ns: "immich", limit: 10}, "(src_ns = 'immich' OR dst_ns = 'immich')"},
-		{"src only", edgesOpts{src: "authentik", limit: 10}, "src_ns = 'authentik'"},
-		{"dst only", edgesOpts{dst: "dbaas", limit: 10}, "dst_ns = 'dbaas'"},
-		{"denied", edgesOpts{denied: true, limit: 10}, "action = 'deny'"},
-	}
-	for _, c := range cases {
-		t.Run(c.name, func(t *testing.T) {
-			q, err := buildEdgesQuery(c.o)
-			if err != nil {
-				t.Fatal(err)
-			}
-			if !strings.Contains(q, "WHERE") || !strings.Contains(q, c.want) {
-				t.Errorf("query %q missing WHERE/%q", q, c.want)
-			}
-		})
-	}
-}
-
-func TestBuildEdgesQueryCombinedFiltersAnded(t *testing.T) {
-	q, err := buildEdgesQuery(edgesOpts{src: "a", denied: true, limit: 5})
-	if err != nil {
-		t.Fatal(err)
-	}
-	if !strings.Contains(q, "src_ns = 'a' AND action = 'deny'") {
-		t.Errorf("combined filters not AND'd: %q", q)
-	}
-}
-
-func TestBuildEdgesQueryPeersOf(t *testing.T) {
-	q, err := buildEdgesQuery(edgesOpts{peersOf: "authentik", limit: 100})
-	if err != nil {
-		t.Fatal(err)
-	}
-	for _, want := range []string{"DISTINCT", "src_ns = 'authentik'", "dst_ns = 'authentik'", "UNION"} {
-		if !strings.Contains(q, want) {
-			t.Errorf("peers-of query %q missing %q", q, want)
-		}
-	}
-}
-
-func TestBuildEdgesQueryJSON(t *testing.T) {
-	q, err := buildEdgesQuery(edgesOpts{asJSON: true, limit: 200})
-	if err != nil {
-		t.Fatal(err)
-	}
-	if !strings.Contains(q, "json_agg") || !strings.Contains(q, "row_to_json") {
-		t.Errorf("json query missing json_agg wrapper: %q", q)
-	}
-}
-
-func TestBuildEdgesQueryRejectsInjection(t *testing.T) {
-	for _, bad := range []string{"a'; DROP TABLE edge;--", "a b", "a;b", "a\"b"} {
-		if _, err := buildEdgesQuery(edgesOpts{ns: bad, limit: 10}); err == nil {
-			t.Errorf("buildEdgesQuery(ns=%q) expected validation error, got nil", bad)
-		}
-	}
-}
-
-func TestNewSinceCond(t *testing.T) {
-	cases := []struct {
-		in   string
-		want string
-	}{
-		{"24h", "first_seen >= now() - interval '24 hours'"},
-		{"7d", "first_seen >= now() - interval '7 days'"},
-		{"30m", "first_seen >= now() - interval '30 minutes'"},
-		{"2026-06-28", "first_seen >= '2026-06-28'"},
-	}
-	for _, c := range cases {
-		got, err := newSinceCond(c.in)
-		if err != nil {
-			t.Fatalf("newSinceCond(%q) error: %v", c.in, err)
-		}
-		if got != c.want {
-			t.Errorf("newSinceCond(%q) = %q, want %q", c.in, got, c.want)
-		}
-	}
-	for _, bad := range []string{"yesterday", "1y", "'; DROP", ""} {
-		if _, err := newSinceCond(bad); err == nil {
-			t.Errorf("newSinceCond(%q) expected error, got nil", bad)
-		}
-	}
-}
-
-func TestValidateNS(t *testing.T) {
-	for _, ok := range []string{"immich", "calico-system", "kube-system", "Global", "pg-cluster-rw"} {
-		if err := validateNS(ok); err != nil {
-			t.Errorf("validateNS(%q) unexpected error: %v", ok, err)
-		}
-	}
-	for _, bad := range []string{"", "a b", "a'b", "a;b", "../x", "a$b"} {
-		if err := validateNS(bad); err == nil {
-			t.Errorf("validateNS(%q) expected error, got nil", bad)
-		}
-	}
-}
--- a/cli/homelab.go
+++ b/cli/homelab.go
@ -20,7 +20,6 @@ func buildRegistry() []Command {
 	reg = append(reg, deployCommands()...)
 	reg = append(reg, netCommands()...)
 	reg = append(reg, obsCommands()...)
-	reg = append(reg, edgesCommands()...)
 	reg = append(reg, usageCommands()...)
 	reg = append(reg, haCommands()...)
 	reg = append(reg, browserCommands()...)
--- a/cli/memory_test.go
+++ b/cli/memory_test.go
@ -5,31 +5,8 @@ import (
 	"os"
 	"strings"
 	"testing"
-	"unicode/utf8"
 )

-func TestTruncatePreviewKeepsValidUTF8(t *testing.T) {
-	// Byte-slicing a long Cyrillic string at 240 splits a 2-byte rune and emits
-	// invalid UTF-8 — the bug that crashed the recall hook. truncatePreview must
-	// cut on a rune boundary and always stay valid UTF-8.
-	long := strings.Repeat("я", 300) // 300 runes / 600 bytes
-	got := truncatePreview(long, 240)
-	if !utf8.ValidString(got) {
-		t.Fatalf("truncatePreview produced invalid UTF-8: %q", got)
-	}
-	if r := []rune(got); len(r) != 241 || string(r[:240]) != strings.Repeat("я", 240) || r[240] != '…' {
-		t.Fatalf("truncatePreview = %d runes, want 240 Cyrillic + ellipsis", len(r))
-	}
-	// Short multibyte strings pass through untouched (no ellipsis).
-	if got := truncatePreview("кратко", 240); got != "кратко" {
-		t.Fatalf("short string altered: %q", got)
-	}
-	// ASCII boundary still works.
-	if got := truncatePreview(strings.Repeat("a", 500), 240); got != strings.Repeat("a", 240)+"…" {
-		t.Fatalf("ascii truncation wrong: %q", got)
-	}
-}
-
 func TestResolveMemoryBase(t *testing.T) {
 	old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL")
 	defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }()
--- a/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md
+++ b/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md
@ -13,7 +13,7 @@ The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling
 Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`:

 - Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.**
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`. `Plotting-Your-Dream-Book` (owned by Anca, dev in her org) keeps its GHA build in-place and pushes the image to **its own org's ghcr** (`ghcr.io/passionprojectsanca/book-plotter`, private) via the workflow's built-in `GITHUB_TOKEN` — no Forgejo mirror, no `viktorbarzin`-namespace push, no shared PAT in her repo (2026-06-27, migrated off DockerHub).
+- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`.
 - `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge.
 - Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror."

--- a/docs/adr/0011-homelab-usage-telemetry.md
+++ b/docs/adr/0011-homelab-usage-telemetry.md
@ -5,14 +5,6 @@ exists to answer the question that drove the whole CLI — *which verbs are wort
 adding next* — with data instead of one maintainer's habits (the earlier mining
 covered a single user's ~51k commands, so the surface is shaped to that user).

-> **Update (2026-06-26) — the cross-user privacy *norm* below is superseded by
-> [ADR-0015](0015-os-is-the-authorization-boundary.md).** The prohibition this
-> ADR leaned on ("reading another user's `~/.claude` is off-limits even for an
-> owner in-session") no longer holds: the managed-settings policy now **defers
-> to OS/sudo authorization**. The `usage top` telemetry design itself is
-> unchanged and still current — only the "never read homes" framing in the
-> third decision below is overtaken.
-
 ## Decisions

 - **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows
--- a/docs/adr/0014-service-identity-and-east-west-observability.md
+++ b/docs/adr/0014-service-identity-and-east-west-observability.md
@ -27,9 +27,3 @@ As the Service count grows we want an audit-grade record of which Service talks
 - **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
 - **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
 - **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
-
-## As-built (2026-06-25)
-
-Implemented across infra issues #57–#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (all Slack consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
-
-Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.
--- a/docs/adr/0015-os-is-the-authorization-boundary.md
+++ b/docs/adr/0015-os-is-the-authorization-boundary.md
@ -1,57 +0,0 @@
-# OS is the authorization boundary: agents defer to Unix/sudo, not a stricter in-policy rule
-
-Supersedes the cross-user privacy *norm* that the devvm managed-settings policy
-carried and that ADR-0011 leaned on ("never read another user's home /
-`~/.claude`, off-limits even for an owner in-session"). ADR-0011's actual
-subject — `usage top` telemetry and its emit design — is unchanged and still
-current; only the privacy prohibition it referenced is superseded here.
-
-## Context
-
-The devvm managed-settings policy (`/etc/claude-code/managed-settings.json`,
-`claudeMd`) carried two rules that were, in practice, *stricter than the OS*:
-"you are not the admin, do not escalate privileges" and "never read another
-user's home directory, credentials, tokens, or `~/.claude`." The OS told a
-different story: `wizard` holds `(ALL) NOPASSWD: ALL` — full passwordless root.
-The kernel had already granted total read access; the policy was layering an
-artificial refusal on top of an authorization the OS already permits, and the
-"not the admin" framing was factually wrong for a NOPASSWD-root user.
-
-Two honest ways to resolve the inconsistency: tighten sudo to match the policy,
-or loosen the policy to match the OS. The owner chose the latter on 2026-06-26,
-for analytics/debugging across the shared box.
-
-## Decision
-
- **Authorization follows the OS, not this policy.** Agents may access whatever
-  their OS user can access — directly or via `sudo` where they hold sudo rights
-  — and must not impose restrictions stricter than the OS. On this box that
-  includes other users' home directories and `~/.claude` for users who hold
-  broad sudo.
- **No separate prompt or carve-out** for OS-authorized access. The Unix
-  permission model + sudoers is the single source of truth for who may read
-  what. Other homes are `0750`-owned, so a cross-home read necessarily transits
-  `sudo` and is therefore captured in the sudo/auth audit log.
- **Cluster/infra RBAC tiering is unchanged.** kubectl / Vault / infra access
-  stays scoped to each user's RBAC tier; "defer to the OS" is about OS-level
-  file access, not a licence to exceed cluster RBAC.
- **Scope is symmetric and multi-user.** The rule lives in the *shared*
-  managed-settings, so every user's agents defer to that user's own sudo grant.
-  Any user with broad sudo gets the same cross-home read capability over other
-  users' files. Accepted by the owner with that understanding; emo's and
-  ancamilea's `~/.claude` is now agent-readable by sudo-holders.
- **Takes effect in a fresh session.** managed-settings loads at session start;
-  the session that made the change keeps running under the old policy.
-
-## Consequences
-
- The privacy-preserving telemetry rationale in ADR-0011 (`usage top` as the
-  "cross-user analytics without reading homes" answer) remains useful but is no
-  longer the *only* sanctioned path; direct reads via `sudo` are now permitted.
- Larger blast radius: if an agent session running as a sudo-holder is
-  prompt-injected or otherwise compromised, it can now read every user's secrets
-  with no in-agent friction (sudo here is passwordless). The sudo/auth audit log
-  is the remaining accountability control.
- Reversible: restore the prior `claudeMd` bullets (backup kept at
-  `/etc/claude-code/managed-settings.json.bak-2026-06-26`) and start a fresh
-  session.
--- a/docs/architecture/authentication.md
+++ b/docs/architecture/authentication.md
@ -86,56 +86,10 @@ Signin latency is dominated by screen count and round trips, not server time
  use the explicit-consent flow (it re-prompted every 4 weeks per app).
 - **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
  are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
-  15m policy cache, gunicorn `max_requests=10000`/jitter=1000 (recycle
-  hardening — decorrelates the 9 workers' recycles from PG blips). **No
-  `CONN_MAX_AGE`** — persistent Django connections pin a PgBouncer server conn
-  1:1 and saturate the session-mode pool (reverted 2026-06-10).
+  15m policy cache, 60s persistent DB connections.
 - **Static assets cached immutable**: `/static` ingress carve-out adds
  `Cache-Control: public, max-age=31536000, immutable` (assets are
  version-fingerprinted; authentik itself sends no max-age).
- **Rate-limit carve-out** (2026-06-28): `/` and `/static` use a dedicated
-  `authentik-rate-limit` (100/1000) instead of the shared 10/50 default — the
-  login SPA cold-loads ~70 flow-executor chunks from `/static`; the default
-  burst 429'd the tail and a failed ES-module import left a blank login screen.
- **Readiness tolerance** (2026-06-28): server `readinessProbe.failureThreshold:8`
-  (~80s, was the chart-default ~30s). The probe (`/-/health/ready/`) queries the
-  DB; too-tight tolerance let a sub-60s PG/pgbouncer transient return 503 on all
-  3 server pods at once → Traefik had no healthy backend → 502/503/504 (episodic
-  blank login + 30s hangs). 80s absorbs a full CNPG failover reconnect. Sessions
-  + cache are PostgreSQL-only since Redis was removed in 2026.2 (no external-cache
-  option), so request-serving is coupled to PG — this survives a short transient,
-  not a total CNPG outage.
- **Rolling-update strategy** (2026-06-28): the chart key is `deploymentStrategy`
-  (the repo's old `strategy:` key was silently inert → live ran the chart-default
-  25%/25% and dropped a server pod out of rotation on every roll). Now
-  `maxSurge:1/maxUnavailable:0` keeps all 3 ready throughout a roll.
- **Old-browser login (SFE)** (2026-06-28): authentik's modern flow SPA is ES2022
-  and renders a **blank login** on Safari/WebKit ≤16.3 (every iOS browser shares
-  the system WebKit, so it's not browser-choice — e.g. iPadOS ≤15). The overlay
-  image patches `flows/views/interface.py::compat_needs_sfe()` to also serve
-  authentik's built-in no-JS **Simplified Flow Executor** (SFE, ES5) to old Safari
-  **and any iOS browser** (Chrome/Firefox on iOS are WebKit skins) on iOS ≤16.3,
-  so those clients get the *real* authentik login (password + MFA + reputation —
-  no auth downgrade). The SFE can't render Identification-stage **sources**
-  (authentik limitation), so the patch also injects static social-login `<a>`
-  links into `flow-sfe.html` (→ `/source/oauth/login/<slug>/`, plain redirects) —
-  required for password-less accounts (e.g. Google-only users). A Traefik
-  basic-auth fallback was rejected: it would have put a single spoofable-UA
-  password in front of `vbarzin→wizard` (passwordless root on the devvm). See
-  `stacks/authentik/patch-compat-sfe.py`.
- **SFE + forced-WebAuthn MFA gotcha** (2026-06-28): the `default-authentication-flow`
-  MFA stage (`not_configured_action=configure`, `conf_stages=[webauthn]`) force-enrols
-  a WebAuthn passkey for any **password**-path user with no MFA device — but the SFE
-  **cannot render WebAuthn** (enrol *or* validate), so that user gets
-  `unsupported state: ak-stage-authenticator-webauthn`. Two escape hatches, **no MFA
-  downgrade**: (1) **social login** — sources run `default-source-authentication`
-  (UserLoginStage only, **no MFA stage**), so the SFE's "Continue with <provider>"
-  button always completes; (2) **enrol TOTP** — the SFE *can* validate TOTP codes, and
-  ≥1 confirmed device flips the stage from force-enrol to validate. User MFA devices are
-  runtime data (not Terraform): enrol via `ak shell`
-  (`TOTPDevice.objects.create(user=…, confirmed=True)`) and store the secret in the
-  user's own Vaultwarden item. (Done for emo — the Google-only iPadOS-15 case: TOTP in
-  his `authentik.viktorbarzin.me` Bitwarden item; e2e-verified the BW code is accepted.)
 - **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
 - **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
  TCP setup on the forward-auth subrequest path.
--- a/docs/architecture/chrome-service.md
+++ b/docs/architecture/chrome-service.md
@ -205,43 +205,6 @@ healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts*
 wrapper in `main.tf` (so it applies deterministically even though the image is
 `:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix
 as the android-emulator stack.
-
-### noVNC black after a browser-container restart (x11vnc supervision)
-
-A **distinct** failure from the fd-sweep gotcha above: the noVNC client *connects*
-but the view is **black**, and the novnc container logs spew
-`connecting to: localhost:5900` → `Failed to connect ... [Errno 111] Connection
-refused` (x11vnc is **down**, not slow). Cause: `x11vnc` and `websockify` both run
-in the **novnc** container, but x11vnc attaches to the **chrome-service** (browser)
-container's Xvfb over `localhost:6099` (shared pod network). When the browser
-container restarts — Chrome exits cleanly (exit 0, "Completed") or crashes — its
-Xvfb vanishes and x11vnc loses its X connection and exits.
-
-`entrypoint.sh` **supervises** x11vnc: it launches x11vnc and websockify as
-background children and `wait -n`s on them, exiting non-zero if **either** dies, so
-the kubelet restarts the novnc container, which re-waits for Xvfb on `:6099` and
-relaunches x11vnc — the bridge **self-heals** across browser-container restarts.
-(Before 2026-06-27, x11vnc was an unsupervised background child of an `exec`ed
-websockify; a dead x11vnc was never relaunched, leaving `:5900` dead — a
-`<defunct>` zombie — and the view black until a manual pod restart. Same
-supervision pattern as the android-emulator stack's entrypoint.)
-
-**Diagnose:** `kubectl exec -c novnc -- ps aux | grep x11vnc` (a `<defunct>`/Z
-entry = the bug); or the RFB-banner probe from a sibling container (`python3 -c
-"import socket;s=socket.socket();s.settimeout(2);s.connect(('127.0.0.1',5900));print(s.recv(12))"`
-— healthy returns `b'RFB 003.008\n'`, broken = `ConnectionRefused`). **Immediate
-recovery** (no image change): restart just the novnc container with `kubectl exec
-n chrome-service deploy/chrome-service -c novnc -- kill 1` — re-runs its entrypoint
-and relaunches x11vnc **without** touching the browser session/in-flight CDP jobs.
-
-> **Deploying a rebuilt novnc entrypoint:** Keel is **off** for this deployment
-> (`keel.sh/policy=never`, because the browser container's playwright image is
-> version-pinned to f1-stream) and the image is `:latest`/`IfNotPresent`, so a
-> rebuilt `:latest` will **not** redeploy on its own. After the
-> `build-chrome-service-novnc.yml` GHA build pushes `:latest` + `:<sha>`,
-> **SHA-pin** the novnc `image` in `main.tf` to the new `:<sha>` to force the pull
-> and rollout (the novnc image is TF-managed — not in the deployment's
-> `lifecycle.ignore_changes`).
 - **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`)
  serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`,
  bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088
@ -293,42 +256,6 @@ Key facts:
  byte-identical copy of `files/stealth.js`, guarded by a drift test — so the
  CLI's stealth never diverges from the in-cluster callers'.

-## Multi-user access (sharing the browser)
-
-There is ONE chrome-service browser with ONE persistent profile, warmed with
-**Viktor's** logged-in sessions. CDP has no per-context auth, so anyone who can
-drive the browser — over the noVNC view OR the CDP/`homelab browser` path — can
-reach the persistent profile (`browser.contexts[0]`) and therefore Viktor's
-sessions. Access is gated accordingly, per user.
-
-**Decision (2026-06-28):** emo (`emil.barzin` / `emil.barzin@gmail.com`) SHARES
-Viktor's browser for form-filling + captcha solving, rather than getting an
-isolated instance. The session-exposure trade-off above was explicitly accepted.
-
-Two independent grants make up "browser access" for a user:
-
-1. **noVNC (interactive view, `chrome.viktorbarzin.me`)** — gated by the Authentik
-   `admin-services-restriction` policy: the `CHROME_ALLOWED` set
-   (`stacks/authentik/admin-services-restriction.tf`) matches the user's Authentik
-   username OR email. Add the user there. No kubeconfig/RBAC needed.
-2. **CLI (`homelab browser`, CDP over port-forward)** — needs `pods/portforward`
-   in `chrome-service` PLUS a non-interactive credential (a normal devvm user's
-   kubeconfig is interactive-OIDC-only and can't authenticate a headless agent
-   session). Provided by a per-user **ServiceAccount** with a long-lived token
-   (`stacks/chrome-service/rbac.tf`, e.g. `emo-browser`): `pods/portforward` in
-   this namespace + cluster read-only (`oidc-power-user-readonly`, so it can also
-   resolve the Service and doesn't regress the user's normal read). The devvm
-   provisioner (`scripts/t3-provision-users.sh` → `install_browser_kubeconfig`)
-   reads that token and installs it as the user's DEFAULT kubeconfig context
-   (`<user>-browser@homelab`), keeping their personal OIDC login as the
-   `oidc@homelab` named context. The SA's existence is the source of truth for who
-   gets the CLI — the provisioner no-ops for users without a `<user>-browser` SA.
-
-**To grant another user:** add them to `CHROME_ALLOWED` (noVNC) and/or add a
-`<user>-browser` SA + bindings mirroring `emo-browser` in `rbac.tf` (CLI), then run
-the provisioner. To revoke: remove from `CHROME_ALLOWED` and delete the SA (rotate
-a token by deleting its `<user>-browser-token` Secret).
-
 ## Limits + risks

 - **Anti-bot vs stealth arms race** — when an upstream beats us (DRM
--- a/docs/architecture/ci-cd.md
+++ b/docs/architecture/ci-cd.md
@ -115,67 +115,9 @@ claude-agent-service, claude-memory-mcp, kms-website, Freedify,
 instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`),
 fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original
 pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website,
-k8s-portal, apple-health-data, audiblez-web, insta2spotify,
+k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify,
 audiobook-search) now also land on ghcr.

-**plotting-book** is a special case (a GitHub-first repo owned by Anca,
-ADR-0003): the build runs in *her* GitHub repo
-(`PassionProjectsAnca/Plotting-Your-Dream-Book`) and pushes to **private
-`ghcr.io/passionprojectsanca/book-plotter`** — under her org's ghcr namespace,
-not `viktorbarzin`, using the workflow's built-in `GITHUB_TOKEN` (no shared
-PAT). The cluster pulls it via the Kyverno-synced `ghcr-credentials` secret (the
-`plotting-book` namespace is on the allowlist; the shared `ghcr_pull_token` has
-read access). Migrated off public DockerHub (`viktorbarzin/book-plotter`) on
-2026-06-27. The Woodpecker deploy hook (repo 43, registered to Anca's repo) is
-unchanged. Flow:
-
-```text
- DEVELOP ───────────────────────────────────────────────────────────────────────
-   Anca (Codex / t3 web agent)
-        │  git push → main
-        ▼
- ┌──────────────────────────────────────────────────────────────┐
- │ GitHub: PassionProjectsAnca/Plotting-Your-Dream-Book (private)│  ← canonical
- │   .github/workflows/build-and-deploy.yml     on: push → main  │
- └───────────────────────────┬──────────────────────────────────┘
-                             │  GitHub Actions runner (off-infra build · ADR-0002)
-        ┌────────────────────┴─────────────────────────────────┐
-        ▼                                                        ▼
- ┌─────────────────────────────────────────────┐      ╔═══════════════════════════════════════╗
- │ build job                                   │ push ║  GHCR · PRIVATE package                ║
- │  • svu next --always → tag vX.Y.Z (→ repo)  │═════▶║  ghcr.io/passionprojectsanca/         ║
- │  • buildx linux/amd64, provenance:false     │ tags ║       book-plotter  :vX.Y.Z  :latest  ║
- │  • login ghcr (GITHUB_TOKEN, packages:write)│      ╚═══════════════════╤═══════════════════╝
- │  • delete-package-versions (keep newest 10) │                          │
- └───────────────────────┬─────────────────────┘                          │ pull (private,
-                         ▼  deploy job  [gate: repo var DEPLOY_ENABLED ≠ "false"]  via secret)
-   POST ci.viktorbarzin.me/api/repos/43/pipelines {IMAGE_TAG, IMAGE_NAME}         │
-                         ▼                                                         │
- ┌─────────────────────────────────────────────────────────────┐                 │
- │ Woodpecker repo 43 · .woodpecker/deploy.yml (event: manual)  │                 │
- │   kubectl set image deployment/plotting-book = <ghcr>:vX.Y.Z │                 │
- │   kubectl rollout status                                     │                 │
- └───────────────────────────┬─────────────────────────────────┘                 │
-                             ▼                                                     │
- ═══════════════ Kubernetes · ns: plotting-book ════════════════════════════      │
- ┌─────────────────────────────────────────────────────────────┐                 │
- │ Deployment plotting-book  (Recreate · image = ignore_changes)│                 │
- │   imagePullSecrets: ghcr-credentials ────────pull───────────┼─────────────────┘
- │   Pod → Express :3001  +  SQLite on PVC (proxmox-lvm)        │
- └─────────────────────────────────────────────────────────────┘
-   guards / supporting:
-     • Kyverno require-trusted-registries [Enforce] → ghcr.io/* ALLOWED   (admission)
-     • Keel policy=patch @1h → watches GHCR via ghcr-credentials          (backstop)
-     • ghcr-credentials ⇐ Kyverno generate-clone ⇐ Vault secret/viktor/ghcr_pull_token
-
- ═══════════════ Serving path (unchanged) ══════════════════════════════════
-   Browser ─▶ plotting-book.viktorbarzin.me  (non-proxied DNS → Traefik .203)
-           ─▶ Authentik forward-auth (gate) ─▶ Service :80 ─▶ Pod :3001
-```
-
-Governance: the Deployment + Kyverno allowlist are Terraform (`stacks/plotting-book`,
-`stacks/kyverno`); the live image *tag* is CI-owned (`ignore_changes`).
-
 ### Infra-owned images (issues #29 / #30)

 Images owned by the infra repo build on GHA workflows **in the infra repo's own
@ -221,9 +163,9 @@ Woodpecker is **deploy + cluster-touching steps only**:
 | Pipeline | File | Purpose |
 |----------|------|---------|
 | per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) |
-| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`). **Skips Tier-0 `vault`** — it's human-applied via OIDC; the CI `ci` role lacks Vault-admin perms (`sys/mounts`, `sys/policies/acl`) so a CI apply 403s |
+| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) |
 | certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron |
-| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`). **Skips Tier-0 `vault`** (its `plan` 403s under the `ci` role and would fail the whole run) |
+| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) |
 | provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
 | registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` → `10.0.20.10` on change |
 | pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE |
@ -234,38 +176,6 @@ Woodpecker is **deploy + cluster-touching steps only**:

 **No build/test pipeline exists on any repo.** Do not (re)introduce one.

-### `default.yml` apply: dual-registration de-dup + reliability (2026-06-28)
-
-infra is registered in Woodpecker on **both** the canonical Forgejo repo (id 82)
-and the legacy GitHub mirror (id 1), and **both fire `default.yml` on every
-push**. Left unguarded, two `terragrunt apply` runs race each other for the
-per-stack PG state lock — historically the #1 source of `Error acquiring the
-state lock` failures and push-supersede "killed" runs.
-
- **Forge guard** (first command in the `apply` step): the push-apply runs **only
-  on the canonical Forgejo forge**; on the GitHub mirror it logs `[forge-guard]`
-  and `exit 0`s. Detection: `CI_REPO_URL`/`CI_FORGE_URL` contains `github.com` →
-  skip. Fail-open (unknown forge still applies). The mirror keeps running the
-  **crons** (drift-detection, renew-tls, …), which live on repo 1 — only its
-  duplicate push-apply no-ops. (Crons were NOT moved; deactivating repo 1 would
-  have killed them.)
- **Lock-skip matches both tiers**: a stack whose apply hits a lock is SKIPPED,
-  not failed. The grep now matches the Tier-0 Vault message (`is locked by`) **and**
-  the Tier-1 PG-backend message (`Error acquiring the state lock` / `already
-  locked`) — the PG case was previously miscounted as a hard failure.
- **Transient retry** (bounded, 3 attempts): only provider-registry download
-  timeouts (`Failed to install provider` / `Client.Timeout`) and Vault 5xx are
-  retried. Config errors (missing arg, invalid index) and helm `atomic` timeouts
-  are NOT retried — they fail fast.
-
-A pre-apply off-infra validate gate was evaluated and rejected: `terraform
-validate` runs without state but catches ~0 of the observed failures (they are
-provider-config-from-Vault-data, server-side-apply conflicts, helm installs, and
-lock contention — all invisible to static validate), and `plan` cannot run
-off-infra (no Vault/PG access). `terragrunt apply` already fails at its plan
-phase without mutating on config errors, so a separate in-pipeline plan-gate was
-also dropped as redundant.
-
 ### Woodpecker API

 Uses **numeric repo IDs** (`/api/repos/<id>/pipelines`), NOT owner/name paths
--- a/docs/architecture/monitoring.md
+++ b/docs/architecture/monitoring.md
@ -286,7 +286,7 @@ Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (por

 #### Security Alerts (Wave 1 — planned, beads `code-8ywc`)

-Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there). Same handling path as infra alerts; severity labels carried in the alert (critical/warning/info). The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
+Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).

 | # | Source | Event | Severity |
 |---|---|---|---|
@ -318,20 +318,9 @@ IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-poli
 Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out.

 - **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m).
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → posts to **`#alerts`** via the `slack-security` receiver, which keeps its `[SECURITY]` styling (Slack-only, no paging; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
+- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → **`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
 - **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)

-#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
-
-Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**.
-
-| Alert | Expr (abridged) | For | Severity |
-|---|---|---|---|
-| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning |
-| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning |
-
-The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`).
-
 #### Backup Alerts
 - **PostgreSQLBackupStale**: >36h since last backup
 - **MySQLBackupStale**: >36h since last backup
--- a/docs/architecture/multi-tenancy.md
+++ b/docs/architecture/multi-tenancy.md
@ -541,7 +541,7 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1

 **RBAC tiers:** `admin` (Viktor — cluster-admin, unlocked tree, secrets) · `power-user` (cluster-wide read-only, NO Secrets, via a dedicated `oidc-power-user-readonly` ClusterRole) · `namespace-owner` (admin in own namespace only). Each session acts as the user's **own** OIDC identity (kubelogin), never the admin's.

-**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **(2026-06-26: the managed `claudeMd` now defers OS-level file access to the OS/sudo — a user holding broad `sudo` may read other users' files incl. `~/.claude`; the mode-600 / no-symlink posture is unchanged but is no longer reinforced by an agent "never read other homes" rule. See [ADR-0015](../adr/0015-os-is-the-authorization-boundary.md).)** **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.
+**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.

 **Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.)

--- a/docs/architecture/networking.md
+++ b/docs/architecture/networking.md
@ -261,7 +261,7 @@ Traefik chain:

 1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`).
 2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend.
-3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), and authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients).
+3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads) and ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load).
 4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors).

 Additional middleware:
@ -550,7 +550,7 @@ chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Che

 **Diagnosis**: Check Traefik middleware config for the affected IngressRoute.

-**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, authentik 100/1000 (login SPA `/static` chunk burst → blank screen).
+**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300.

 ### Large Downloads or Uploads Truncate / Fail Partway

--- a/docs/architecture/security.md
+++ b/docs/architecture/security.md
@ -132,13 +132,6 @@ for the supersession history — there is no longer an inline Traefik bouncer.)
  account hard-limits to **one** list), and CAPI is already covered in-kernel on
  direct hosts and by Cloudflare's own managed protections on proxied hosts.
  Registered bouncer key: **`kvsync`**.
- **Rate-limit resilient (2026-06-27):** Cloudflare's Lists-API *write* endpoint
-  is throttled (~per-60s; `429 retry-after`). The CronJob runs `backoff_limit=0`
-  (one POST per cycle — the `*/2` schedule IS the retry cadence) and treats a CF
-  `429` as a soft-skip (exit 0, retry next cycle), the same fail-safe pattern it
-  uses for LAPI. An earlier `backoff_limit=2` fired 3 rapid POSTs/cycle and
-  escalated the throttle into a stuck state that left the list empty — a
-  self-inflicted DoS that this change prevents.
 - **Block-only**: the single-list limit precludes a separate
  captcha/managed-challenge list, so both ban and captcha decisions are enforced
  as a plain block at the edge.
@ -279,7 +272,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**

 The block below documents the locked design.

-Response model: **(I) Slack-only, daily skim.** All security alerts post to **`#alerts`** via Alertmanager (the `slack-security` receiver keeps its distinct `[SECURITY/<sev>]` title styling so security-lane alerts still stand out). The dedicated `#security` channel was abandoned (2026-06-25) — the shared `alertmanager_slack_api_url` incoming webhook's Slack app isn't a member of it, so a channel override there returns HTTP `404 channel_not_found`; everything consolidated to `#alerts`. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
+Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.

 #### Detection sources

@ -292,7 +285,7 @@ Response model: **(I) Slack-only, daily skim.** All security alerts post to **`#

 #### Alert rules (16 total)

-Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it). Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) carried in the alert.
+Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel.

 **K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):**

@ -371,69 +364,6 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
 - Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
 - Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).

-#### Deriving the per-namespace egress allowlist from the edge trail (Wave 1 W1.7)
-
-The durable **east-west flow trail** (below) is now the preferred data source for
-the *internal* (namespace-to-namespace) half of each Wave-1 egress allowlist —
-faster and identity-stamped vs the original iptables-`LOG`→journald→Loki path
-(ADR-0014: "Enforcement gains a better data source"). The unique observed
-namespace pairs live in CNPG DB `goldmane_edges`, table `edge`. To derive the
-namespaces a source is observed talking to (the `allow` set that seeds its
-NetworkPolicy):
-
-```sql
-SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow' ORDER BY dst_ns;
-```
-
-The full SQL recipe (whole-cluster matrix, deny sanity-checks, the ≥7-day
-observation caveat) is in
-[runbooks/goldmane-flow-trail.md → Deriving the Wave-1 egress allowlist](../runbooks/goldmane-flow-trail.md#deriving-the-wave-1-egress-allowlist-from-the-edge-table-infra-62).
-**External / public-internet egress is NOT in this table** (empty-namespace flows
-are dropped) — for those destinations keep using the Calico flow-log observation
-(the W1.6 snapshot, `wave1-egress-observation-2026-05-22.md`). This feeds the
-existing observe-then-enforce effort (beads `code-8ywc`); **enforce-flips remain
-out of scope** of the trail — it is observe-and-derive only.
-
-### East-west flow observability (Goldmane / Whisker + edge trail) (ADR-0014)
-
-The "who-talks-to-whom" data plane that succeeds raw iptables-`LOG` lines (which
-carried no identity). **Service identity = the workload's namespace** (primary),
-refined by a `service-identity` label in the few multi-Service namespaces
-(`monitoring`, `kube-system`, `dbaas`). End-to-end trail, three layers:
-
-1. **Calico Goldmane + Whisker** (`calico-system`) — Goldmane aggregates
-   identity-stamped flows (ns/pod/workload/labels + allow-deny + policy-trace)
-   streamed from Felix over gRPC into a **~60-min in-memory ring buffer** (no
-   etcd/API writes — the etcd-cost constraint that drove the design). **Whisker**
-   is its live web UI at `whisker.viktorbarzin.me` (Authentik-gated,
-   `auth = "required"` — Whisker has no own login; an additive NetworkPolicy ORs
-   Traefik past the operator's default-deny `whisker` NP). The ring buffer is
-   **not** a trail (lost on Goldmane restart). Enabled via operator CRs in
-   `stacks/calico/main.tf`; reversible toggle (Goldmane is OSS tech-preview).
-2. **`goldmane-edge-aggregator`** (`stacks/goldmane-edge-aggregator`) — streams
-   Goldmane's gRPC `Flows.Stream` over **mTLS** and upserts the low-cardinality
-   namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,
-   flow_count)`) into CNPG DB `goldmane_edges`. Self-edges and empty-namespace
-   (public-internet) flows are dropped — in-cluster relationships only. The mTLS
-   client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`**
-   (Goldmane verifies CA-chain only, not identity) rather than copying the CA
-   private key into TF state — **re-apply the stack if the operator rotates that
-   Secret**.
-3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to
-   **`#alerts`** (reuses the alert-digest webhook). All Slack now consolidates to
-   `#alerts`; the `#security` channel was abandoned 2026-06-25 because that
-   webhook's Slack app isn't a member of it (a `#security` override 404s). See
-   runbook.
-
-The trail is **attribution-grade, not cryptographic** (reconstructs events in a
-trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model
-limit; east-west stays plaintext, no mTLS between app pods). Health is covered by
-the **`AggregatorDown`** + **`DigestFailing`** alerts and cluster-health check #48
-(see monitoring.md). Full as-built, query recipes, and troubleshooting:
-[runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md). Decision:
-[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md); glossary
-`CONTEXT.md` → **Service identity**, **Goldmane / Whisker**.
-
 ### TLS & HTTP/3

 **Traefik** handles TLS termination:
--- a/docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md
+++ b/docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md
@ -1,117 +0,0 @@
-# k8s-upgrade compat-gate: classify "actionable" vs "held" blocks
-
-**Date:** 2026-06-28
-**Status:** design → implementation
-**Stack:** `stacks/k8s-version-upgrade` (+ `stacks/monitoring` alert rules)
-
-## Problem
-
-The cluster is on k8s 1.35.6. The nightly `k8s-version-check` chain detects the
-next minor (1.36.2), runs the preflight compat-gate, and the gate **refuses**
-it — because no released kyverno/ESO supports k8s 1.36 yet, and gpu-operator is
-deliberately pinned (its 26.3 bump needs a newer NVIDIA driver image + Ubuntu
-release we're not ready for). The result, **every single night**:
-
- a **Failed** preflight Job (`block()` exits 1), and
- `k8s_upgrade_blocked=1` → the **K8sUpgradeBlocked** alert.
-
-But this block is **not actionable** — there's nothing we can upgrade to clear
-it; we can only wait for upstream (kyverno/ESO) and, separately, do the
-gpu-operator/Ubuntu work. The gate is crying wolf: a "blocked, needs attention"
-signal that's indistinguishable from a block we could actually fix.
-
-## Goal
-
-Make the gate **classify** each blocker and behave accordingly:
-
-| Class | Definition | Behaviour |
-|-------|-----------|-----------|
-| **actionable** | the compat matrix has a newer version of the addon whose `max_k8s >= target`, and the running version is older — upgrading it would clear the block | **alert** (`k8s_upgrade_blocked=1` → K8sUpgradeBlocked), with the specific "upgrade X → Y" remediation in the nightly report |
-| **waiting-upstream** | **no** matrix version of the addon supports the target yet (kyverno/ESO for 1.36) | **quiet** (`k8s_upgrade_held=1`, no alert) — nightly report only |
-| **pinned** | a supporting version exists but the addon carries `"pinned": true` in the matrix (gpu-operator) | **quiet** (held) |
-
-Removed-API and containerd blocks are always **actionable**. **Held wins:** if
-*any* blocker is waiting-or-pinned, the whole target is **HELD** (quiet) —
-acting on the actionable blockers wouldn't unblock it yet. The nightly report
-still lists everything so the full eventual scope is visible.
-
-Also (scope decision: "tidy the block path"): deliberate gate decisions
-(actionable-block **and** held) now make the preflight Job **Complete cleanly**
-(exit 0) instead of Failing. Chain progression is gated on the verdict, not the
-exit code. Real failures (unhealthy nodes, kubeadm errors, crashes) still exit
-1 → `K8sUpgradeChainJobFailed`.
-
-## Design
-
-### `compat-gate.py`
- New exit codes: `0` safe · `2` actionable-block · `3` gate-error (fail-safe) · **`4` held**.
- Each stdout reason line is tagged `[ACTIONABLE]` / `[WAITING]` / `[PINNED]`.
- `check_addons`: when an addon blocks, decide its class:
-  - `pinned: true` in its matrix entry → `[PINNED]`.
-  - else a higher matrix version with `max_k8s >= target` exists → `[ACTIONABLE]` (`upgrade X to >= V`).
-  - else → `[WAITING]` (`no released X version supports k8s T yet`).
-  - unreadable image / below-matrix → `[ACTIONABLE]` (fail-safe — a human must look).
- `check_removed_apis`, `check_containerd`: tag `[ACTIONABLE]`.
- `exit_code(reasons)`: `0` if none; `4` if any `held_reason` (WAITING/PINNED); else `2`.
-
-### `upgrade-step.sh`
- New global `HALT_CHAIN=0`; `spawn_next()` returns early (no next Job) when set.
- Replace `block()` with `record_blocked()` / `record_held()` — push the gauge,
-  set `HALT_CHAIN=1`, **do not exit**.
- `phase_preflight` gate handling routes on the gate's exit code:
-  - `0` → push `blocked=0`+`held=0`, proceed.
-  - `2`/`3` → `record_blocked`, `return 0` (Job Completes, K8sUpgradeBlocked fires).
-  - `4` → `record_held`, `return 0` (Job Completes, **no alert**).
- Push the gauge **definitively once** per run (remove the pre-reset `blocked=0`
-  at gate start) so a standing block doesn't flap 1→0→1 and re-notify.
- postflight also clears `held=0` alongside the existing gauge resets.
-
-### detector (`main.tf`, the `k8s-version-check` CronJob)
- Consequence of the tidy change: refusals now **Complete** instead of Failing,
-  so the old "re-spawn only a *Failed* preflight" idempotency would skip a
-  refused-but-Complete preflight until its 7d TTL. Fix: re-spawn nightly when the
-  preflight is **Complete but no `k8s-upgrade-master-<target>` Job exists** (the
-  gate refused — chain never advanced) — **silently** (no Slack), so a standing
-  hold re-evaluates each night without noise.
- The per-night `slack "K8s upgrade available…"` becomes an `echo`; the spawn
-  Slack fires only for a genuinely new spawn or a Failed-respawn (`ANNOUNCE`
-  flag), not for silent re-evaluations — killing the last nightly-noise source.
-
-### `addon-compat.json`
- Add `"pinned": true` + `"pin_reason"` to the gpu-operator entry (its
-  `26.3 → 1.36` row stays; `pinned` overrides classification to held). Document
-  the `pinned` flag in `_comment`. Unpinning later = delete two keys.
-
-### `stacks/monitoring` alert rules (`prometheus_chart_values.tpl`)
- `K8sUpgradeBlocked` (`k8s_upgrade_blocked == 1`): unchanged trigger, now
-  actionable-only; reword annotation (reasons are in the nightly report, not a
-  per-run chain Slack).
- `K8sUpgradeChainJobFailed`: **drop** the `unless on() (k8s_upgrade_blocked == 1)`
-  clause — deliberate blocks no longer create Failed Jobs, so the alert again
-  means a genuine wedge.
- **No alert** for `k8s_upgrade_held` (intentional — nothing to action; the
-  nightly report surfaces it). Add a comment recording this.
-
-### `nightly-report.py`
- Read `k8s_upgrade_held`. New `⏸️ HELD — <target> not yet upgradable` headline.
- Group reasons by tag: *Action needed* / *Waiting on upstream* / *Pinned (held by us)*
-  (fallback bullets for untagged lines, so older reason strings still render).
- Fetch reasons when avail AND (blocked OR held).
-
-## Net effect on 1.36 today
-**HELD, quiet** — waiting on kyverno + ESO (upstream) + gpu-operator (pinned);
-Calico listed as the lone actionable piece. No nightly Failed Job, no alert —
-just the nightly report's ⏸️ line. Flips to actionable (→ alert) only once
-kyverno/ESO ship support **and** gpu-operator is unpinned.
-
-## Tests (TDD)
- `compat-gate`: waiting / actionable / pinned-is-held / mixed-held-wins,
-  removed-API & containerd are actionable, exit_code mapping, + existing
-  patch/safe cases stay green.
- `nightly-report`: held headline + grouped reasons; existing tests stay green.
- `upgrade-step.sh`: shellcheck; manual review of the HALT_CHAIN + gauge flow
-  (bash, not unit-tested).
-
-## Out of scope (separate follow-up)
-Auto-refreshing the matrix when upstream ships 1.36 support (a periodic
-addon-readiness probe). This change only *consumes* the matrix.
--- a/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md
+++ b/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md
@ -1,128 +0,0 @@
-# Post-Mortem: MetalLB ServiceL2Status Stuck Immutable → PG LB VIP Flap → Woodpecker CI Tier 1 Applies Broken
-
-| Field | Value |
-|-------|-------|
-| **Date** | 2026-05-16 (mitigated) / 2026-05-26 (closed) |
-| **Duration** | ~5 days of degraded CI (2026-04-21 first observed → 2026-05-16 mitigated). Symptom-only; no human-visible service downtime. |
-| **Severity** | SEV3 — Woodpecker CI default.yml apply step failed on Tier 1 (PG-backend) stacks. Drift-detection ran silently broken. Manual `scripts/tg apply` continued to work. No data loss, no app downtime. |
-| **Affected Services** | Woodpecker CI pipelines applying any of the 28+ Tier 1 stacks (monitoring, crowdsec, authentik, headscale, etc.). PostgreSQL backend itself was healthy. |
-| **Issue** | Beads `code-aoxk` (closed 2026-05-26). |
-| **Status** | Closed |
-
-## Summary
-
-Woodpecker CI surfaced as `ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc` from `scripts/tg` whenever a pipeline tried to apply a Tier 1 stack. The error was misleading on two counts:
-
-1. **Vault was healthy.** A direct `vault read database/static-creds/pg-terraform-state` from inside a Woodpecker pipeline pod (using K8s SA JWT → `auth/kubernetes/login role=ci`) succeeded every time when run in isolation.
-2. **The "Cannot read PG credentials" message in `scripts/tg` was a catch-all** that fired for *any* Terraform/Terragrunt failure during PG state-lock acquire-release, including TCP RSTs against the PG LoadBalancer VIP.
-
-Actual root cause: the MetalLB `ServiceL2Status` CR for the `postgresql-lb` service (`dbaas` namespace, VIP `10.0.20.200`) had a stuck `status.node` field that the controller treated as immutable. The L2 speaker kept failing to update it with `Invalid value: "k8s-nodeX": Value is immutable`, so the leader-elected announcer flapped between k8s-node3 and k8s-node4 every few seconds. Each flap dropped open TCP connections (RST). Terraform's state-lock acquire → operation → release sequence straddled flaps and failed mid-operation. `scripts/tg` surfaced this as the misleading "Cannot read PG credentials" message.
-
-Manual `scripts/tg apply` from the DevVM kept working because the developer's session happened to land on whichever node currently held the VIP and complete fast enough to not straddle a flap. CI pipelines, being slower (full stack walk), reliably straddled at least one flap.
-
-## Impact
-
- **CI degradation**: Tier 1 stack changes pushed to master were NOT auto-applied. Required manual `scripts/tg apply` from DevVM after every push touching one of 28+ stacks.
- **Drift-detection broken**: The daily `drift-detection.yml` Woodpecker pipeline silently failed on every Tier 1 stack — meaning unannounced manual changes to those stacks could have persisted undetected for the duration.
- **No user-facing outage**: PG cluster itself, all apps that use PG, and all in-cluster traffic to `10.0.20.200` worked normally. Only the very specific `acquire-state-lock → run operation → release-state-lock` round-trip pattern from CI was unreliable.
-
-## Timeline (UTC)
-
-| Time | Event |
-|------|-------|
-| 2026-04-21 | First broken CI pipelines (#411, #412, #413). Drift-detection failures noticed. `code-aoxk` filed. Initial hypothesis: Vault auth/role mismatch. |
-| 2026-04-22 — 2026-05-15 | Multiple investigation attempts. Verified Vault K8s `auth/kubernetes/role/ci` has correct policies (`terraform-state`, `ci`). Verified `database/static-creds/pg-terraform-state` exists, rotates on schedule, credentials valid. Could not reproduce the failure in isolated `vault read` from Woodpecker pods. |
-| 2026-05-16 (~12:14 UTC) | `pg-cluster-3` came up (third CNPG replica); endpoint set churn likely triggered MetalLB L2 announcer to attempt to update the existing `ServiceL2Status` CR (was `l2-rgt9d`). Update was rejected as immutable. Speaker kept retrying. VIP flapped. |
-| 2026-05-16 | RCA breakthrough: noticed `kubectl logs -n metallb-system -l component=speaker` was full of `Invalid value: "k8s-node…": Value is immutable` on the postgresql-lb ServiceL2Status. Correlated with `kubectl get servicel2status` returning multiple stale entries for the same service. |
-| 2026-05-16 | **Mitigation**: `kubectl delete servicel2status.metallb.io l2-rgt9d -n metallb-system`. Speaker recreated the CR cleanly (became `l2-zj9ss`). Flap stopped. PG connections stable. Manual CI re-runs of `monitoring` stack apply succeeded immediately. |
-| 2026-05-17 | Audit: acceptance criteria 1 + 2 met implicitly. #3 (post-mortem) remained pending. Beads task reverted from `in_progress` → `open`. |
-| 2026-05-25 | Node2 SCSI LUN remap → encrypted PVC emergency_ro → containerd boltdb corruption outage. Unrelated, but pulled Woodpecker server off node2. Subsequent server pod restart on k8s-node4. |
-| 2026-05-26 | Verification: from a live Woodpecker pipeline pod (`wp-01kshph6pa0w6ch0zf5x9bfqgr`), `vault write auth/kubernetes/login role=ci jwt=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)` succeeded. `vault read database/static-creds/pg-terraform-state` returned valid creds (`username=terraform_state`, last_vault_rotation 2026-05-21, TTL 58h). Live `default.yml` pipeline confirmed applying Tier 1 stacks: dbaas, authentik, crowdsec, monitoring, nvidia, cloudflared, kyverno, metallb — all `OK`. `postgresql-lb` ServiceL2Status currently single allocation (`l2-sv9vv` on k8s-node3, no flap). Beads task closed. |
-
-## Root Cause
-
-`metallb-speaker` reconciler in the deployed MetalLB version treats `ServiceL2Status.status.node` as immutable after first set. When the L2 announcer's leader-election picks a different node to announce a given VIP (which happens on speaker pod restart, node loss, endpoint set churn, or pod-anti-affinity reshuffles), the reconciler fails to patch the existing CR and gets stuck in a retry loop. Without manual deletion, the reconciler will not progress.
-
-Why it manifested as Vault credential errors:
-
-1. CI's `scripts/tg` pre-flight runs `vault read database/static-creds/pg-terraform-state` (line 83 in current code) to get PG credentials. That call succeeds.
-2. CI then runs `terragrunt apply` against the Tier 1 stack. Terragrunt connects to `10.0.20.200:5432` for state-lock acquire (via `pg_advisory_lock`). The TCP connection lands on whichever node MetalLB last announced the VIP from.
-3. Mid-operation, MetalLB tries to re-announce from a different node, sends gratuitous ARPs, and the upstream switch updates its MAC table. Open TCP sessions on the previous announcer's node are immediately RST.
-4. Terragrunt's state-lock release (or any subsequent PG operation) fails with broken pipe / connection refused.
-5. `scripts/tg` interpreted the wrapper-level failure as "PG creds bad" because that's the most common failure mode it handles. The actual error from terragrunt was buried in `2>/dev/null` suppression (since fixed — see Fix #1 below).
-
-## Detection
-
-We did not have any of:
- A direct alert for "MetalLB ServiceL2Status reconciler errors".
- An alert for "PG LB VIP node changed N times in M minutes".
- An end-to-end probe for the CI state-lock pattern (terragrunt against `10.0.20.200`).
-
-Detection mechanism was a human reading `kubectl logs -n metallb-system` for unrelated reasons. Took 25 days from first observed symptom to RCA.
-
-## Fixes & Mitigations
-
-### 1. Surface real error from `scripts/tg` (DONE)
-
-The original `scripts/tg` swallowed the real `vault read` / terragrunt error behind `2>/dev/null` and printed a static "Cannot read PG credentials from Vault" message. Fixed in the script:
-
-```sh
-# scripts/tg lines 79-89 (current)
-if ! command -v vault >/dev/null 2>&1; then
-  echo "ERROR: vault CLI not found on PATH. Install it or use an image that includes it (ci/Dockerfile)." >&2
-  exit 1
-fi
-VAULT_OUT=$(vault read -format=json database/static-creds/pg-terraform-state 2>&1) || {
-  echo "ERROR: Cannot read PG credentials from Vault. Vault output follows:" >&2
-  echo "$VAULT_OUT" >&2
-  echo "" >&2
-  echo "Hint: humans run 'vault login -method=oidc'; CI auths via K8s SA (role=ci)." >&2
-  exit 1
-}
-```
-
-Comment in the code explicitly references this incident.
-
-### 2. Stuck-CR cleanup procedure (DOCUMENTED)
-
-Reproduction check for future sessions (also in `code-aoxk` beads notes):
-
-```sh
-kubectl logs -n metallb-system -l component=speaker --tail=200 | grep -iE 'Invalid value.*immutable'
-# If matches found → same root cause. Delete the stuck CR:
-kubectl get servicel2status -n metallb-system
-kubectl delete servicel2status.metallb.io <name> -n metallb-system
-```
-
-Speaker recreates the CR cleanly within seconds.
-
-### 3. Long-term MetalLB controller fix (DEFERRED)
-
-The underlying bug — speaker not recreating the CR when the immutable field needs to change — is upstream MetalLB behaviour. Two paths possible:
-
- **Upgrade MetalLB** to a version where this is fixed (needs research — check changelogs).
- **File upstream issue / patch** with reproducer.
-
-Not done as part of this post-mortem; tracked separately. Risk acceptance: until then, the manual `delete servicel2status` workaround is the playbook, and is fast (<10s).
-
-### 4. Alerting (DEFERRED)
-
-Suggested but not implemented:
- Prometheus alert on `metallb_speaker_reconcile_errors_total{kind="ServiceL2Status"}` rate.
- Synthetic probe: a CronJob that does `pg_advisory_lock` + release against the PG VIP every 5min from CI namespace, alert if it ever fails.
-
-Tracked as future hardening (no beads task yet — only worth filing if recurrence happens).
-
-## Lessons
-
-1. **`2>/dev/null` is a time-bomb.** It hid the real error for weeks. Fix #1 already lands the principle; audit other places in `scripts/` for the same anti-pattern next time we touch them.
-2. **CRD `status.*` immutability is non-obvious failure mode.** When debugging weird LB / VIP / endpoint behaviour, always grep speaker logs for `immutable`, `cannot update`, and reconciler errors. Add to cluster-health checks.
-3. **Misleading wrapper errors cost weeks.** `scripts/tg` claimed "Cannot read PG credentials" — that's what the operator believed. The actual `vault read` step worked. The real failure was three steps later in a completely different subsystem. When a wrapper script makes a definitive claim about which subsystem failed, distrust it; reproduce the subsystem in isolation before chasing the claim.
-4. **CNPG primary changes / endpoint churn can trigger L2 announcer flap.** The trigger (within the timeline) was likely the `pg-cluster-3` pod coming up. Worth flagging for any future CNPG topology changes.
-
-## References
-
- Beads: `code-aoxk` — closed 2026-05-26.
- `scripts/tg` lines 65-95 — current pre-flight with explicit error surfacing.
- `kubectl get servicel2status -A` — current state, single allocation per service.
- This file: `infra/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md`.
--- a/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md
+++ b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md
@ -1,97 +0,0 @@
-# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
-
-> Filename kept for inbound links. The originally-suspected cause (kubeadm-config
-> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
-> drift was a real *separate* latent bug fixed in the same change.
-
-**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
-the master control-plane phase for the first time — preflight passed, etcd
-snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the
-kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute
-static-pod-hash window across all internal retries, then auto-rolled-back to
-v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
-the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**.
-No data loss; no user-facing outage (the master carries control-plane taints, so
-no workloads were displaced).
-
-**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
-first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
-static pods, i.e. the first time the upgrade pushes real write-IO at etcd.
-
-## Root cause — etcd IO starvation on the shared HDD
-
-The new kube-apiserver could not establish/keep a working connection to etcd
-during the upgrade because **etcd was IO-starved**. etcd's surviving container log
-from the crash window (`/var/log/pods/.../etcd/0.log`, 23:04–23:20 UTC) shows:
-
- **1,180** `apply request took too long` warnings in 16 minutes;
- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms),
-  clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
-  to bring the new apiserver up.
-
-A reproduced 1.35.6 apiserver with no etcd dies with
-`F instance.go:233 Error creating leases: error creating storage factory: context
-deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
-lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
-shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
-that spindle:
-
-1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
-2. kubeadm dumping a full **~400MB etcd DB backup** to
-   `/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
-   etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
-   cleans them up), pushing master root fs to **73%**, above the 70% kubelet
-   image-GC threshold, so image GC churned during the drain too;
-3. master-drain pod evictions.
-
-### Correction — it was NOT the OIDC flag swap
-
-`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
-`--authentication-config` (structured multi-issuer OIDC) back to legacy
-single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
-was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
-those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
-(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test
-etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
-the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
-were also ruled out.
-
-## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
-
-apiserver auth is configured in three places that must agree:
-(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
-+ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest
-(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM —
-which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
-the manifest from (3), so it would have reverted structured auth → **dashboard +
-kubectl SSO break after a successful upgrade** (recoverable: the chain's
-post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.
-
-## Resolution
-
-1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%.
-2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps.
-3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).
-
-## Prevention (landed in this change)
-
-| Gap | Fix |
-|-----|-----|
-| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
-| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. |
-| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |
-
-## Lessons
-
- **Capture the failing component's own logs before concluding.** The `kubeadm
-  upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
-  applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
-  "what config changes," not "why it crashed."
- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
-  2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
-  backup copy + drain) onto that spindle. code-oflt is the real fix.
- **Tools that leave per-operation scratch must be reaped.** kubeadm's
-  `/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
-  GC'd; 28GB had silently accumulated.
- **Out-of-band control-plane edits must be written back to kubeadm-config** — else
-  `kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).
--- a/docs/runbooks/claude-auth-renew-workstation.md
+++ b/docs/runbooks/claude-auth-renew-workstation.md
@ -11,11 +11,6 @@ inference every six hours and backs up only the `claudeAiOauth` object to:
 secret/workstation/claude-users/<os-user>
 ```

-The backup **merges** into that path (`vault kv patch -method=rw`, falling back to
-`kv put` only when the path does not exist yet), so keys that other tools
-co-locate there — notably `homelab vault`'s `vaultwarden_*` credentials — survive.
-A blind `kv put` here silently wiped them on every six-hourly run (fixed 2026-06-26).
-
 The user's unrelated `mcpOAuth` credentials never leave their home directory.
 Each renewal service has a distinct 32-day periodic Vault token, mode `0600`, at
 `~/.config/claude-auth-sync/vault-token`. Its policy can access only that user's
@ -80,64 +75,8 @@ sudo --preserve-env=VAULT_ADDR,VAULT_TOKEN /usr/local/bin/t3-provision-users
 ```

 Never copy another user's `.credentials.json` or scoped Vault token. Never restore
-a **shared** `CLAUDE_CODE_OAUTH_TOKEN` across users; environment credentials
-outrank per-user login and would silently collapse all users onto one identity.
-(A **per-user**, non-rotating setup-token tied to the user's OWN Enterprise
-identity is a different, sanctioned thing — see "Long-lived per-user token" below.)
-
-## Long-lived per-user token (heavy concurrent-agent users)
-
-The six-hourly renewal above assumes Claude owns refresh-token rotation in a
-single `~/.claude/.credentials.json`. A user who runs **many concurrent Claude
-sessions** (interactive tmux panes + their `t3-serve` instance + always-on
-`start-claude.sh` agents) breaks that assumption: when the shared access token
-expires, the processes refresh **simultaneously**, the OAuth server rotates the
-refresh token, and the losing writer persists an **empty** refresh token —
-logging the user out roughly every access-token lifetime (~8h). Re-issuing the
-credential does not help; the race recurs.
-
-The fix is a **per-user, long-lived setup-token** (`sk-ant-oat01-…`, ~1y,
-**non-rotating**). With `CLAUDE_CODE_OAUTH_TOKEN` set, Claude uses it directly and
-never touches `.credentials.json` — so there is nothing to race on. This is the
-user's OWN Enterprise identity (scope `user:inference`; local MCP servers are
-client-side and unaffected), stored only in their OWN Vault path — **NOT** the
-forbidden shared token, and it never crosses OS users.
-
-**Enable it (one-time, per user):**
-
-1. The user mints their own token (interactive Enterprise SSO):
-
-   ```bash
-   claude setup-token        # opens an SSO URL; paste the code back -> prints sk-ant-oat01-…
-   ```
-
-2. An admin stores it in that user's Vault path (MERGE, never `kv put` — siblings
-   like `claude_ai_oauth_json` / `vaultwarden_*` must survive):
-
-   ```bash
-   vault kv patch -method=rw secret/workstation/claude-users/<os-user> \
-     setup_token=sk-ant-oat01-…
-   ```
-
-3. Materialize + activate (or just wait ≤6h for the timer):
-
-   ```bash
-   systemctl start claude-auth-sync@<os-user>.service
-   ```
-
-   `claude-auth-sync` writes `~/.config/claude-auth-sync/claude-oauth.env`
-   (`CLAUDE_CODE_OAUTH_TOKEN=…`, mode 0600) and, while a token is present, **skips**
-   the rotating-credential validate/backup/restore (so no false
-   `WorkstationClaudeAuthInvalid`). `start-claude.sh` and `t3-serve@.service` load
-   that env file. **Sessions started before activation keep the old credential
-   until relaunched** — the user must restart their agents / `t3-serve` to cut over.
-
-**Disable it:** clear the field (`vault kv patch -method=rw
-secret/workstation/claude-users/<os-user> setup_token=""`) — the next sync removes
-the env file and the user reverts to the per-user SSO credential flow.
-
-**Rotate before expiry:** setup-tokens expire 1y after mint. Re-mint (step 1) and
-re-store (step 2); the env file refreshes on the next sync.
+the old shared `CLAUDE_CODE_OAUTH_TOKEN`; environment credentials outrank per-user
+login and would silently collapse all users onto one identity.

 ## Verification

--- a/docs/runbooks/goldmane-flow-trail.md
+++ b/docs/runbooks/goldmane-flow-trail.md
@ -1,346 +0,0 @@
-# Goldmane Flow Trail — east-west "who-talks-to-whom" observability
-
-> As-built runbook for the Calico Goldmane + Whisker flow plane and the
-> `goldmane-edge-aggregator` durable audit trail. Design + rationale:
-> [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
-> Glossary: `CONTEXT.md` → **Service identity**, **Goldmane / Whisker**.
-> Implements infra issues #57 (Whisker ingress), #58 (aggregator), #61
-> (monitoring), #62 (egress allowlist queries), #63 (these docs).
-
-## What the trail is
-
-Three layers turn raw east-west traffic into a queryable, durable record of
-which Service talks to which. **Service identity = the workload's namespace**
-(primary), refined by a `service-identity` label in the few multi-Service
-namespaces (`monitoring`, `kube-system`, `dbaas`) — see ADR-0014.
-
-| Layer | Component | Lifetime | Where it lives |
-|---|---|---|---|
-| **Live map** | Calico **Goldmane** + **Whisker** | ~60-min in-memory ring buffer (lost on Goldmane restart) | `calico-system`; Whisker UI at `whisker.viktorbarzin.me` |
-| **Durable trail** | `goldmane-edge-aggregator` (`aggregate` mode) | persistent | CNPG Postgres DB `goldmane_edges`, table `edge` |
-| **Notification** | `goldmane-edges-digest` CronJob (`digest` mode) | daily | Slack `#alerts` |
-
-**Goldmane** aggregates identity-stamped flows (namespace / pod / workload /
-labels + allow-deny + policy-trace) streamed from Felix (the existing
-`calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer —
-**nothing is written to etcd or the K8s API** (the etcd-cost constraint that
-drove the whole design). **Whisker** is its live web UI. Because the ring
-buffer is *not* a trail (a Goldmane restart loses the window), the
-`goldmane-edge-aggregator` consumes Goldmane's gRPC `Flows.Stream` API over
-mTLS and upserts the unique **namespace-pair edge set** into Postgres; a daily
-CronJob posts first-seen edges to Slack.
-
-The edge set is deliberately **low-cardinality** — one row per
-`(src_ns, dst_ns, action)`, *not* per-pod or per-port — so the table stays
-small no matter how much traffic flows.
-
-## Where the data lives
-
-### Whisker UI — live, ~60 min
- `https://whisker.viktorbarzin.me` (Authentik-gated — Whisker ships no own
-  login; `auth = "required"`). Shows the live flow stream + a service graph for
-  roughly the last hour. Use it for "what is talking right now"; it is **not**
-  history.
- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
-  (HTTP), both in `calico-system`.
- **DNS fix + self-heal:** whisker's egress to the kube-dns ClusterIP is allowed
-  by `whisker-allow-dns-clusterip` (`stacks/calico`) — without it the UI goes
-  empty after any gRPC-stream break (see Troubleshooting → "Whisker UI empty").
-  The `whisker-watchdog` CronJob (every 10 min) is a backstop that restarts
-  whisker if its backend ever wedges for another reason.
-
-### CNPG `goldmane_edges` — durable
- Postgres DB `goldmane_edges` on the CNPG cluster
-  (`pg-cluster-rw.dbaas.svc.cluster.local:5432`). One table:
-
-  ```
-  edge(src_ns text, dst_ns text, action text,
-       first_seen timestamptz, last_seen timestamptz, flow_count bigint,
-       PRIMARY KEY (src_ns, dst_ns, action))
-  ```
-
-  - `action` ∈ `allow` / `deny` / `pass` / `unspecified` (normalised Goldmane
-    action).
-  - **Self-edges (`src_ns == dst_ns`) and empty-namespace flows** (host-endpoint
-    / public-internet) are **dropped** — the trail is about in-cluster service
-    relationships only. (Egress to the public internet is therefore NOT in this
-    table; it lives in the Wave-1 Calico flow-log path — see security.md.)
-  - A **"new edge"** = a row whose `first_seen` falls inside the digest window.
-  - Role `goldmane_edges` (Vault-rotated, 7-day) owns the DB. The `edge` table
-    is created idempotently by the aggregator at startup (canonical DDL also in
-    the repo at `migrations/0001_edge.sql`).
-
-### Slack `#alerts` — daily digest
-
-> **Channel note (2026-06-25):** posts to **`#alerts`**. The dedicated `#security` channel was abandoned — the shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of it, so a channel override there returns HTTP `404 channel_not_found`. Everything now posts to `#alerts` (this digest plus alertmanager's `slack-security` receiver, which keeps its `[SECURITY]` styling so security-lane alerts still stand out there).
-
- CronJob `goldmane-edges-digest` (08:00 Europe/London) posts edges first seen
-  in the last 24h. Quiet when there are none. Reuses the existing alert-digest
-  Slack incoming webhook (Vault `secret/viktor` → `alertmanager_slack_api_url`)
-  — no new webhook was created.
-
-## How to enable / disable
-
-### Goldmane + Whisker (the flow plane)
-Operator CRs in **`stacks/calico/main.tf`** — NOT the Helm `goldmane`/`whisker`
-flags (those stay `false`; the operator's own `installation`/`apiServer` are
-operator-managed via the `goldmanes`/`whiskers.operator.tigera.io` CRDs):
-
- `kubectl_manifest.goldmane` (kind `Goldmane`) — creating it makes the operator
-  re-render `calico-node` with the `FELIX_FLOWLOGSGOLDMANESERVER` env (the
-  operator auto-wires Felix — **do NOT patch FelixConfiguration**), triggering a
-  supervised `calico-node` DaemonSet roll. Yields `Deployment` + `Service
-  goldmane:7443`.
- `kubectl_manifest.whisker` (kind `Whisker`, `depends_on` goldmane;
-  `notifications = Disabled`). Yields `Deployment` + `Service whisker:8081`.
-
-**To disable:** delete those two CRs and re-apply `stacks/calico`. Reversible
-toggle (Goldmane is tech-preview in OSS Calico 3.30 — the main standing risk per
-ADR-0014).
-
-### Whisker public ingress (infra #57)
-Also in `stacks/calico/main.tf`:
- `module "ingress_whisker"` (`ingress_factory`, `auth = "required"`,
-  `dns_type = "proxied"`) → `whisker.viktorbarzin.me`.
- `kubernetes_network_policy_v1.whisker_allow_traefik` — **required alongside the
-  ingress**: the operator's own `whisker` NetworkPolicy (owned by the Whisker CR)
-  is `policyTypes: [Ingress]` with no rules = default-deny ingress to the pod.
-  This additive NP ORs in an allow for `namespaceSelector
-  kubernetes.io/metadata.name=traefik` on TCP 8081. Without it Traefik 502s.
-
-### The aggregator + digest (the durable trail) — `stacks/goldmane-edge-aggregator`
-A Tier-1 stack (PG state) mirroring the claude-memory pattern. `scripts/tg
-apply` from `stacks/goldmane-edge-aggregator/`. It provisions: the namespace,
-the mTLS client material, the Postgres DB-init Job, the `DATABASE_URL`
-ExternalSecret (Vault static role `pg-goldmane-edges`), the Slack ExternalSecret,
-the `aggregate` Deployment, and the `digest` CronJob. **To disable the trail
-without touching the flow plane:** scale `deployment/goldmane-edge-aggregator` to
-0 (transient) or remove the stack (permanent) — Goldmane/Whisker keep running.
-
-Image: `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (PRIVATE) — the
-`goldmane-edge-aggregator` namespace must be in the `ghcr-credentials` Kyverno
-allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`,
-`local.ghcr_private_namespaces`) or pulls 401. Code repo:
-`~/code/goldmane-edge-aggregator` (see its `README.md` + `DEPLOY.md`).
-
-## mTLS cert — the REUSE decision (cert-reuse gotcha)
-
-The aggregator dials `goldmane:7443` over **mutual TLS**. Goldmane requires the
-client cert to chain to the **Tigera CA**, but it does **NOT authorize by client
-identity** — any Tigera-CA-signed cert is accepted.
-
-Rather than copy the Tigera CA **private key** into Terraform state to mint our
-own cert (a needless CA-key exposure; the `hashicorp/tls` provider also clashes
-with this repo's global generate-providers/lockfile pattern), the stack
-**REUSES the operator-minted, Tigera-CA-signed `whisker-backend-key-pair`
-Secret** (`calico-system`), copying its `tls.crt`/`tls.key` into the
-`goldmane-client-tls` Secret in the aggregator namespace. The CA *bundle* that
-verifies Goldmane's serving cert (`tigera-ca-bundle` ConfigMap, key
-`tigera-ca-bundle.crt`) is likewise copied verbatim (a ConfigMap can't be
-cross-namespace-mounted).
-
-> **GOTCHA — if the operator rotates `whisker-backend-key-pair`, re-apply
-> `stacks/goldmane-edge-aggregator`** to re-sync the copied cert. Symptom of a
-> stale copy: the `aggregate` pod logs TLS handshake / `Flows.Stream` failures
-> and no `last_seen` updates land in the `edge` table. Hardening follow-up
-> (noted in the stack): mint an own-identity cert in-namespace if Whisker is ever
-> removed (which would delete the reused source Secret).
-
-The Deployment leaves `GOLDMANE_HOST=goldmane.calico-system.svc.cluster.local:7443`
-and the default cert/CA paths; the default ServerName (host sans port) is a SAN
-on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` /
-`GOLDMANE_TLS_INSECURE` override is needed.
-
-## How to query who-talks-to-whom
-
-**Quickest — the `homelab edges` CLI** (the investigation helper; read-only
-SELECT against the DB via the dbaas primary pod, no creds/SQL to remember):
-
-```
-homelab edges --ns <ns>         # edges touching <ns> (either direction)
-homelab edges --peers-of <ns>   # <ns>'s distinct peer namespaces
-homelab edges --src <ns>        # <ns>'s egress peers   (--dst <ns> for ingress)
-homelab edges --new-since 24h   # edges first seen in the last day (or a date)
-homelab edges --denied          # blocked / lateral-movement attempts
-homelab edges --json [...]      # machine-readable, for agents/pipelines
-homelab edges --help            # full flag list
-```
-
-For ad-hoc SQL, `psql` into the DB (creds: Vault static role
-`static-creds/pg-goldmane-edges`, or exec a CNPG pod). All queries are against
-the single `edge` table.
-
-```sql
-- Everything talking to a namespace (inbound), most-active first
-SELECT src_ns, action, flow_count, first_seen, last_seen
-FROM edge WHERE dst_ns = '<ns>' ORDER BY flow_count DESC;
-
-- Everything a namespace talks TO (outbound)
-SELECT dst_ns, action, flow_count, first_seen, last_seen
-FROM edge WHERE src_ns = '<ns>' ORDER BY last_seen DESC;
-
-- New edges in the last 24h (what the digest reports)
-SELECT src_ns, dst_ns, action, flow_count, first_seen
-FROM edge WHERE first_seen > now() - interval '24 hours'
-ORDER BY first_seen DESC;
-
-- Any DENIED edges (policy is dropping this pair)
-SELECT src_ns, dst_ns, flow_count, last_seen
-FROM edge WHERE action = 'deny' ORDER BY last_seen DESC;
-
-- Full edge set as a graph adjacency list
-SELECT src_ns, dst_ns, action, flow_count FROM edge ORDER BY src_ns, dst_ns;
-```
-
-For the **live** (sub-hour) view including pod/port detail, use the Whisker UI —
-the `edge` table intentionally aggregates that away.
-
-## Deriving the Wave-1 egress allowlist from the edge table (infra #62)
-
-The durable edge set is a faster, identity-stamped data source for the existing
-**observe-then-enforce** egress effort (beads `code-8ywc`; snapshot
-`docs/architecture/wave1-egress-observation-2026-05-22.md`) than the original
-iptables-`LOG` → journald → Loki path (ADR-0014 consequence: "Enforcement gains
-a better data source"). It replaces the *internal* (namespace-to-namespace) leg
-of the allowlist; **external/public-internet egress is NOT in this table** (empty
-dst namespace, dropped) — for those destinations keep using the Calico flow-log
-path described in security.md.
-
-**Per-namespace internal egress allowlist** — the set of in-cluster namespaces a
-given source is *observed* talking to with `action='allow'`:
-
-```sql
-- Internal egress allowlist for one namespace (feeds its NetworkPolicy)
-SELECT DISTINCT dst_ns
-FROM edge
-WHERE src_ns = '<ns>' AND action = 'allow'
-ORDER BY dst_ns;
-```
-
-```sql
-- Full internal egress matrix for all namespaces at once
-SELECT src_ns, array_agg(DISTINCT dst_ns ORDER BY dst_ns) AS allowed_dst_ns
-FROM edge
-WHERE action = 'allow'
-GROUP BY src_ns
-ORDER BY src_ns;
-```
-
-```sql
-- Sanity: namespaces with a DENY edge already (policy is biting; investigate
-- before tightening further)
-SELECT DISTINCT src_ns, dst_ns FROM edge WHERE action = 'deny';
-```
-
-**How this feeds enforcement (scope):** the derived `dst_ns` set is the
-*internal* half of a namespace's egress allowlist — it tells you which
-in-cluster namespaces to permit before flipping that namespace to default-deny.
-The universal baseline (kube-dns :53, often dbaas :3306/:5432, redis :6379) and
-the external destinations still come from the Wave-1 observation snapshot.
-**Enforce-flips remain OUT OF SCOPE** here — this is observe-and-derive only;
-the phased per-namespace default-deny rollout (starting `recruiter-responder`)
-is tracked under `code-8ywc`. Cross-links:
-[security.md → NetworkPolicy Default-Deny Egress](../architecture/security.md#networkpolicy-default-deny-egress-wave-1--observe-then-enforce-tier-34),
-[wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md),
-[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
-
-> **Caveat (same as the Wave-1 snapshot):** an edge only exists if it was
-> *observed*. A weekly CronJob or a 7-day Vault rotation may not have fired yet —
-> collect ≥7 days of edges before treating a namespace's `allow` set as
-> complete. The `first_seen` column tells you how long an edge has been known;
-> the digest surfaces brand-new ones daily.
-
-## Monitoring & health (infra #61)
-
-The aggregator pod has **no `/metrics` endpoint** — health is inferred from
-kube-state-metrics. Three complementary signals (memory ids 6598, 6599;
-see also [monitoring.md → Security Alerts](../architecture/monitoring.md#security-alerts-wave-1--planned-beads-code-8ywc)):
-
-| Signal | What | Where |
-|---|---|---|
-| **`AggregatorDown`** | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` for 15m → warning | Prometheus alert group `Network Observability (Goldmane)` in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`; routes `slack-warning` → `#alerts` |
-| **`DigestFailing`** | `kube_job_status_failed{...job_name=~"goldmane-edges-digest.*"} > 0` within 24h, for 30m → warning | same alert group → `#alerts` |
-| **cluster-health #48** | `check_goldmane_aggregator` reads the Deployment's `Available` condition (missing or not-Available → FAIL) | `scripts/cluster_healthcheck.sh` (human / `--quiet` / `--json` modes; emits `goldmane_aggregator`) |
-
-The two alert layers are deliberately complementary: `AggregatorDown` →
-**no new edges land** in the DB; `DigestFailing` → **edges still land but nobody
-is told**. A freshness probe (#61b) was intentionally skipped — `AggregatorDown`
-is the agreed floor.
-
-## Troubleshooting
-
-**Whisker UI 502 / unreachable.** The additive
-`kubernetes_network_policy_v1.whisker_allow_traefik` is missing or the
-operator's default-deny `whisker` NP regenerated — re-apply `stacks/calico`. A
-brand-new ingress host is also invisible to LAN split-horizon until the hourly
-`technitium-ingress-dns-sync` runs (memory #5349); test meanwhile with
-`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
-(expect a 302 to Authentik — the gate working).
-
-**Whisker UI empty (but reachable — 302s to Authentik fine).** ROOT CAUSE (the
-2026-06-28 incident): the operator's own `whisker` NetworkPolicy is
-policyTypes:[Ingress,**Egress**], and its egress allows DNS only to the kube-dns
-*pods* (podSelector `k8s-app=kube-dns`). But whisker-backend resolves
-`goldmane.calico-system.svc` via the kube-dns **ClusterIP** (10.96.0.10), and
-**Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule**.
-Verified: from the whisker pod's netns, ClusterIP DNS = 100% timeout while direct
-kube-dns *pod-IP* DNS = OK, and a pod with no egress policy resolves fine.
-whisker-backend resolves goldmane ONCE in the brief startup window before the
-policy programs, holds its long-lived gRPC stream, and only re-resolves when that
-stream breaks (e.g. a node-reboot blip) — at which point the blocked ClusterIP
-DNS wedges its Go resolver (`failed to stream flows` / `code = Unavailable: dns
-... i/o timeout` forever) and the UI goes blank. The durable **aggregator is a
-SEPARATE pod in its own (unrestricted) namespace** and is unaffected.
-
-FIX (applied 2026-06-28): `kubernetes_network_policy_v1.whisker_allow_dns_clusterip`
-(`stacks/calico`) — an additive egress NP allowing whisker → the kube-dns
-ClusterIP (`10.96.0.10/32`) on 53/UDP+TCP; k8s egress policies are additive so
-the operator NP is untouched. Backstop: the `whisker-watchdog` CronJob restarts
-the pod if it ever wedges for another reason. Immediate manual heal:
-`kubectl -n calico-system delete pod -l k8s-app=whisker`. Diagnose by comparing,
-from the whisker pod's netns, `nslookup goldmane.calico-system.svc.cluster.local
-10.96.0.10` (the ClusterIP — times out if the NP fix is missing) against the same
-query aimed at a kube-dns *pod IP* (always works).
-
-**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
-pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).
-Common causes, in order:
-1. **Stale mTLS cert** — the operator rotated `whisker-backend-key-pair`; re-apply
-   `stacks/goldmane-edge-aggregator` (see cert-reuse gotcha above). Symptom: TLS
-   handshake / `Flows.Stream` errors.
-2. **Stale DB password** — the 7-day Vault rotation bounced the credential but
-   the pod kept the old one. The Deployment carries
-   `secret.reloader.stakater.com/reload: goldmane-edges-db-creds`; if it's not
-   restarting on rotation, verify the Reloader annotation and the ExternalSecret.
-3. **Goldmane restarted** — the in-memory window was lost (expected); the stream
-   reconnects automatically and resumes upserting. No data loss in the DB
-   (only the sub-hour live window in Whisker is gone).
-
-**Digest never posts / `DigestFailing` firing.** Inspect the most recent
-`goldmane-edges-digest-*` Job (`kubectl get jobs -n goldmane-edge-aggregator`;
-`kubectl logs job/<name>`). The CronJob's `ttl_seconds_after_finished=86400` GCs
-pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL`
-empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack`
-ExternalSecret resolved. A dry run / smoke test: run the image with `args:
-["digest"]` + `DRY_RUN=1` to print the message instead of POSTing.
-> Resolved (2026-06-28): the digest posts cleanly to `#alerts`
-> (`lastSuccessfulTime` current, `DigestFailing` clear; e.g. the 2026-06-28 08:00
-> London run reported "8 new edges in last 24h"). The 2026-06-25 failures were
-> the `#security` channel override returning HTTP 404 — the shared
-> `alertmanager_slack_api_url` webhook's Slack app isn't a member of `#security`;
-> consolidating all Slack output to `#alerts` fixed it.
-
-**No edges at all in the table.** Confirm Goldmane is enabled
-(`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the
-`FELIX_FLOWLOGSGOLDMANESERVER` env; confirm the `goldmane-edges-db-init` Job
-completed; confirm the aggregator pod is `Running` and not `ImagePullBackOff`
-(ghcr allowlist).
-
-## Related
- [ADR-0014 — Service identity & east-west observability](../adr/0014-service-identity-and-east-west-observability.md)
- [security.md — NetworkPolicy Default-Deny Egress + east-west flow observability](../architecture/security.md)
- [monitoring.md — east-west flow observability + alerts](../architecture/monitoring.md)
- [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md)
- `CONTEXT.md` glossary — **Service identity**, **Goldmane / Whisker**
- Code: `~/code/goldmane-edge-aggregator` (`README.md`, `DEPLOY.md`); stacks
-  `stacks/goldmane-edge-aggregator`, `stacks/calico`
--- a/docs/runbooks/homelab-vault-onboarding.md
+++ b/docs/runbooks/homelab-vault-onboarding.md
@ -1,164 +0,0 @@
-# `homelab vault` onboarding (Vaultwarden access + `vault kv` infra secrets)
-
-## Scope
-
-`homelab vault` fronts **two unrelated secret stores** — the name collides, so
-the command keeps them clearly separated:
-
- **Vaultwarden** — your personal *password manager* (logins/passwords/TOTP).
-  The verbs below give each devvm roster user no-HITL access to **their own**
-  Vaultwarden vault (and any Organization Collection shared with their account).
-  It shells out to the official `bw` CLI; the user's Vaultwarden credentials live
-  only in their isolated Vault path `secret/workstation/claude-users/<os-user>`
-  and are decrypted as that OS user — the admin never sees them.
- **HashiCorp Vault / OpenBao** — the homelab *infra* secrets store (the
-  `secret/…` KV mount at `vault.viktorbarzin.me`), under `homelab vault kv`.
-  These use the caller's **own** Vault token (`vault login -method=oidc` →
-  `~/.vault-token`), **not** the scoped Vaultwarden token (which only reads the
-  `claude-users/<user>` path); access is whatever your Vault policy grants.
-
-```text
-# Vaultwarden (password manager)
-homelab vault setup             one-time: store VW email + master password + API key
-homelab vault status            configured / unlocked / reachable (no secrets)
-homelab vault list [--search Q]  item names (no secrets)
-homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
-homelab vault get <name> --all  all fields (incl. custom) as JSON; pipe it (| jq)
-homelab vault code <name>       current TOTP code
-homelab vault lock              lock / log out the local bw session
-
-# HashiCorp Vault / OpenBao (infra secrets; uses your own OIDC token)
-homelab vault kv get <path> [--field K]   read an infra KV secret
-homelab vault kv list <path>              list sub-paths
-homelab vault kv put <path> <key>         write one key (value via stdin; merges)
-```
-
-## How auth works (why a non-admin can use it)
-
-`homelab vault` runs `vault` as the calling user. It resolves a Vault token in
-this order (`ensureVaultToken`, `cli/cmd_vault.go`):
-
-1. an explicit `$VAULT_TOKEN` (a deliberate override), then
-2. the per-user **scoped token** that `claude-auth-sync` maintains at
-   `~/.config/claude-auth-sync/vault-token` (policy `workstation-claude-<user>`), then
-3. a native `~/.vault-token` (admins who carry one; non-admins usually don't).
-
-**The scoped token deliberately beats `~/.vault-token`.** This tool only touches
-your own `secret/workstation/claude-users/<user>` path, and a power-user who ran
-`vault login -method=oidc` carries a read-only `~/.vault-token` (capability
-`deny` on that path); letting it win would shadow the scoped token and fail every
-op with `403 permission denied` (this is exactly what bit emo, 2026-06-28). The
-CLI also **self-defaults `VAULT_ADDR`** to `https://vault.viktorbarzin.me` when
-unset, so it works from non-login shells (tmux panes, AFK agent subprocesses)
-that never sourced `/etc/environment` — otherwise every `vault` child hits the
-`127.0.0.1:8200` default and fails `connection refused` (exit 2).
-
-That scoped policy grants exactly `create`/`read`/`update` on the user's own
-`secret/workstation/claude-users/<user>` path — no `patch` capability — so the
-tool writes with `vault kv patch -method=rw` (read-modify-write), falling back to
-`kv put` only when the path does not exist yet. This preserves the
-`claude_ai_oauth_json` key that [claude-auth-sync](claude-auth-renew-workstation.md)
-co-locates there. (The admin-only bugs were fixed 2026-06-27; the
-`VAULT_ADDR`/token-precedence bugs above were fixed 2026-06-28.)
-
-## Prerequisites (per user)
-
- The user is in `scripts/workstation/roster.yaml` and the **vault** stack has
-  been applied → their `workstation-claude-<user>` policy exists.
- The user's workstation was provisioned (`setup-devvm.sh`) → their scoped Vault
-  token exists at `~/.config/claude-auth-sync/vault-token`.
- `bw` is installed **system-wide** at `/usr/bin/bw` (see below).
- The user has a Vaultwarden account at `https://vaultwarden.viktorbarzin.me`
-  (self-service signup is open; admin panel is disabled).
-
-## One-time admin steps (devvm)
-
-`bw` must be system-wide so every user resolves it (it is a Node script, and
-`node` is already system-wide at `/usr/bin/node`). `setup-devvm.sh` installs it
-to the npm `/usr` prefix; the guard checks the **system** path, not
-`command -v bw` (an admin's own `~/.local/bin/bw` used to mask the system
-install, leaving non-admins with no backend). To install on a running box:
-
-```bash
-sudo npm install -g --prefix /usr "@bitwarden/cli@^2024"
-bw --version            # confirm /usr/bin/bw resolves
-```
-
-After landing a `cli/` change, rebuild the binary so users pick it up:
-
-```bash
-# version is stamped from cli/VERSION, exactly as setup-devvm.sh does it
-sudo bash -c 'cd /home/wizard/code/infra/cli && \
-  go build -ldflags "-X main.version=$(cat VERSION 2>/dev/null || echo dev)" \
-  -o /usr/local/bin/homelab .'
-```
-
-(or just re-run `scripts/workstation/setup-devvm.sh` as root, which rebuilds it.)
-
-## User onboarding
-
-The user runs these as themselves. The master password / API key are entered
-interactively (never on the command line) and stored only in the user's Vault
-path.
-
-1. In the Vaultwarden web vault → **Settings → Security → Keys → View API key**,
-   copy the `client_id` (`user.xxxx`) and `client_secret`.
-2. Configure:
-
-   ```bash
-   homelab vault setup        # prompts: VW email, API client_id/secret, master password
-   homelab vault status       # → "vault: configured, unlocked, reachable ✓"
-   homelab vault list         # item names (own vault + any shared Collections)
-   ```
-
-## Shared-Collection access (sharing passwords with a user)
-
-`homelab vault` surfaces Organization Collection items automatically once the
-user's Vaultwarden account is a confirmed member. These steps are done by the
-vault owner in the **Vaultwarden web UI** (they need the owner's master
-password — not an infra/Terraform operation):
-
-1. Create or reuse an **Organization** and a **Collection** of shared logins.
-2. **Invite** the user's Vaultwarden account to the Organization, granting
-   **"Can view"** on that Collection (least privilege).
-3. The user accepts the email invite and confirms membership.
-4. The user runs `homelab vault list` — the shared items now appear alongside
-   their own (a `homelab vault status` sync picks them up).
-
-## Security model (the no-HITL trade)
-
-Identity is the kernel UID. Anything running as the user can decrypt the user's
-vault — this is the accepted trade for no-human-in-the-loop fetches. Secrets
-never appear in `argv` (passed via env or stdin), core dumps are disabled, TOTP
-fetches are logged to syslog/Loki, and on a TTY values go to the clipboard
-(auto-clearing) rather than scrollback. The admin's Vault token is never used by
-a non-admin: each user authenticates with their own scoped token.
-
-## Verification
-
-```bash
-# the scoped token carries the right policy
-VAULT_TOKEN="$(sudo cat /home/<user>/.config/claude-auth-sync/vault-token)" \
-  vault token lookup -format=json | jq '.data.display_name, .data.policies'
-#   → "token-devvm-claude-auth-<user>", [..., "workstation-claude-<user>"]
-
-sudo -u <user> -i bw --version        # /usr/bin/bw resolves for the user
-sudo -u <user> -i homelab vault status
-```
-
-## Troubleshooting
-
-**`homelab vault setup` (or any verb) fails with `exit status 2`** — older
-binaries swallowed the underlying `vault` error; the message now includes it.
-Two historical causes (both fixed in-CLI 2026-06-28, kept here for diagnosis):
-
- `... connection refused` to `127.0.0.1:8200` → `VAULT_ADDR` wasn't set in the
-  caller's shell. The CLI now self-defaults it, but if you see this on an old
-  binary: `export VAULT_ADDR=https://vault.viktorbarzin.me`.
- `403 permission denied` on `PUT .../secret/data/workstation/claude-users/<user>`
-  → a stale read-only `~/.vault-token` (e.g. from `vault login -method=oidc`,
-  policy `default`, capability `deny` on that path) was shadowing the scoped
-  token. The CLI now prefers the scoped token; on an old binary, `rm
-  ~/.vault-token` (or `unset VAULT_TOKEN`) and retry. Confirm with
-  `VAULT_TOKEN="$(sudo cat /home/<user>/.config/claude-auth-sync/vault-token)" vault token capabilities secret/data/workstation/claude-users/<user>`
-  → must be `create, read, update`.
--- a/docs/runbooks/k8s-version-upgrade.md
+++ b/docs/runbooks/k8s-version-upgrade.md
@ -36,13 +36,11 @@ envsubst on /template/job-template.yaml  | kubectl apply -f -
  ▼

 Job 0 — preflight       (pinned: k8s-node1)
-  ├── compat-gate: addon/API/containerd support for target (else BLOCK-actionable+alert / HOLD-quiet)
+  ├── compat-gate: addon/API/containerd support for target (else BLOCK+alert)
  ├── All nodes Ready + no Mem/Disk pressure
  ├── halt-on-alert (kured-style ignore-list)
  ├── 24h-quiet baseline (no Ready transitions <24h ago)
  ├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
-  ├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block)
-  ├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups)
  ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
  ├── Trigger backup-etcd Job, wait, verify snapshot byte count
  ├── SSH master: containerd skew fix (if master < workers)
@ -114,36 +112,18 @@ inert for a patch (no API removal or containerd floor occurs inside a minor).

 This is the **"auto-upgrade when we can, halt + alert when we can't"** contract.

-**The gate classifies each refusal** (2026-06-28) so it only cries wolf when
-there's something to do — `compat-gate.py` exit code + a `[TAG]` on every reason:
+**On a block**, the gate:
+- pushes `k8s_upgrade_blocked=1` to Pushgateway (→ the `K8sUpgradeBlocked`
+  Prometheus alert),
+- Slacks the **specific reasons** (which addon/API/node, current vs required), and
+- **halts the chain** — it exits **non-fatal** (the upgrade simply isn't safe yet,
+  this is not a failure). Because the block happens **before any mutation, no
+  rollback is involved**; nothing was changed.

- **`[ACTIONABLE]`** (exit 2) — a newer version of the lagging addon **exists in
-  the compat matrix** and upgrading it would clear the block (or an in-use
-  deprecated API must be migrated / a node's containerd bumped).
- **`[WAITING]`** (exit 4 = held) — **no released addon version supports the
-  target yet** (e.g. kyverno/ESO behind a brand-new k8s minor). Only an upstream
-  release can clear it.
- **`[PINNED]`** (exit 4 = held) — a supporting version exists but the addon is
-  **deliberately pinned** in the matrix (`"pinned": true`, e.g. gpu-operator,
-  whose bump is coupled to a newer NVIDIA driver image + Ubuntu/kernel).
- **Held wins on a mix**: if any blocker is waiting/pinned the whole target is
-  held — acting on the actionable ones wouldn't unblock it yet.
-
-**On any refusal** the preflight pushes the verdict gauge (`k8s_upgrade_blocked=1`
-for actionable, `k8s_upgrade_held=1` for held), sets `HALT_CHAIN` so the chain
-doesn't advance, and **exits 0 — the Job Completes cleanly** (a refusal is a
-decision, not a failure: no Failed Job, no `K8sUpgradeChainJobFailed`). It's
-before any mutation, so no rollback. Reasons (grouped by class) appear in the
-**morning nightly report**, not a per-run Slack.
-
- **Actionable** → `K8sUpgradeBlocked` fires (once, via alert-on-change). Clear
-  it by doing the named upgrade/migration; the next nightly run proceeds.
- **Held** → **deliberately NO alert** — only the nightly report's `⏸️ HELD`
-  line, because it can't be actioned now (a nightly alert would cry wolf). It
-  clears itself once upstream ships support (refresh `addon-compat.json`) or the
-  pin is lifted (delete `pinned`+`pin_reason`). The detector re-evaluates every
-  night, silently re-spawning the refused-but-Complete preflight (so a cleared
-  block is picked up next run, not after the 7d Job TTL).
+**To clear a block**: upgrade the named addon (or migrate the API caller off the
+deprecated group/version, or bump containerd on the named node) so the offending
+condition no longer holds. The **next nightly run then proceeds automatically** —
+no manual chain restart needed.

 The **compat matrix** lives in
 `stacks/k8s-version-upgrade/scripts/addon-compat.json` — a map of `addon → highest
@ -183,8 +163,6 @@ Pushed by upgrade-step.sh during phase execution; observed by the
 | `k8s_upgrade_in_flight` (1/0) | preflight Job (set to 1) | postflight Job (set to 0) |
 | `k8s_upgrade_started_timestamp` (epoch s) | preflight Job | postflight Job (set to 0) |
 | `k8s_upgrade_snapshot_taken` (1/0) | preflight Job (set to 1 after Job=`pre-upgrade-etcd-*` completes with `Backup done:` log of ≥1 KiB) | postflight Job (0) |
-| `k8s_upgrade_blocked` (1/0) | preflight Job — set 1 on an **actionable** compat refusal (→ `K8sUpgradeBlocked`) | preflight (definitive each run; 0 when safe) / postflight (0) |
-| `k8s_upgrade_held` (1/0) | preflight Job — set 1 on a **held** (waiting-upstream/pinned) refusal; **no alert** | preflight (definitive each run; 0 when safe) / postflight (0) |
 | `k8s_upgrade_available{kind,running,target}` | detection CronJob | next detection run (overwrite) |
 | `k8s_version_check_last_run_timestamp` | detection CronJob | (cumulative) |

@ -193,8 +171,8 @@ Pushed by upgrade-step.sh during phase execution; observed by the
 - **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout.
 - **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
 - **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The old `unless on() (k8s_upgrade_blocked == 1)` clause was **dropped 2026-06-28**: compat-gate refusals now Complete cleanly (exit 0) instead of Failing, so a terminally-Failed chain Job again means a genuine wedge with nothing to exclude.
- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). An **ACTIONABLE** compat-gate refusal — a newer version of the lagging addon exists and upgrading it would clear the block (or an in-use deprecated API must be migrated / a node's containerd bumped). Reasons (grouped by class) are in the **morning nightly report**; clear it by doing the named upgrade/migration, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert. **There is deliberately NO companion alert for the held verdict** (`k8s_upgrade_held=1` — waiting-on-upstream / pinned): nothing can be actioned now, so it is surfaced only by the nightly report's `⏸️ HELD` line.
+- **`K8sUpgradeChainJobFailed`** — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The `unless k8s_upgrade_blocked == 1` clause (added 2026-06-21) excludes a preflight that failed because the **compat gate deliberately refused** the target — that's owned by `K8sUpgradeBlocked` and was double-firing here; a genuine wedge exits without setting the blocked gauge, so it still fires.
+- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). A k8s **auto-upgrade was refused** by the compat gate because a critical addon, an in-use deprecated API, or a node's containerd is too old for the detected target. The **specific reasons are in Slack**; clear it by upgrading the named addon / migrating the API caller / bumping containerd, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert.
 - The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.

 ### Nightly upgrade report (Slack)
@ -203,8 +181,8 @@ CronJob `k8s-upgrade-nightly-report` (k8s-upgrade ns, `var.report_schedule`,
 default `7 6 * * *` = 06:07 UTC — after the 23:00 chain, before the 08:00 London
 alert-digest) posts ONE Slack summary each morning of the previous night's run:
 running version, detector freshness, detected target + kind, the outcome
-(⚪ no upgrade needed / 🔴 blocked-actionable + reasons / ⏸️ held = waiting-upstream/pinned /
-🟢 upgraded / 🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads
+(⚪ no upgrade needed / 🔴 blocked + live blocker reasons / 🟢 upgraded /
+🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads
 the Pushgateway gauges + live nodes/jobs and re-runs `compat-gate.py` for fresh
 blocker reasons; reuses the chain's SA + `slack_webhook` + scripts ConfigMap.
 Logic + unit tests: `scripts/nightly-report.py`, `scripts/test_nightly_report.py`.
@ -244,34 +222,22 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names

 ## Common Operations

-### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24)
+### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19)

 `kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
-from kubeadm-config**. apiserver auth uses a structured multi-issuer
-`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to
-still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade
-reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does
-NOT crash on this — verified by isolated repro; it's recoverable via the restore
-script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue —
-etcd IO starvation**, not this drift; post-mortem:
-`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`.
+and drops the `--authentication-config` flag**, silently disabling apiserver
+OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get
+401). This used to require a manual re-apply after **every** control-plane bump.

-**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now
-**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting
-`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of
-its remote script. So kubeadm regenerates a **correct** manifest and the apiserver
-upgrades with a pure image bump — `kubeadm upgrade diff <target>` shows only the
-image change. Zero live impact (the CM is read only during an upgrade).
-
-**Backstops:**
- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does
-  NOT block — the drift only breaks SSO, which is recoverable) if
-  `--authentication-config` would still be dropped.
- The `rbac` stack still publishes its restore script to the
-  `kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on
-  master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with
-  auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also*
-  re-reconciles kubeadm-config. Self-skips when master is already at target.
+**Now automated:** the `rbac` stack publishes its OIDC restore script to the
+`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's
+`phase_master` re-runs it on master immediately after `kubeadm upgrade apply`
+(while tigera-operator is still quiesced, so the flag-add apiserver restart can't
+crashloop the operator). It's idempotent, health-gates `/livez` with
+auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac
+apply (the version upgrade itself already succeeded). So a chain-driven
+control-plane bump no longer breaks SSO. The master phase self-skips when master
+is already at target, so this only runs when master was actually upgraded.

 **Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the
 chain logged `WARN: --authentication-config absent after re-apply`:
--- a/docs/runbooks/pfsense-egress.md
+++ b/docs/runbooks/pfsense-egress.md
@ -1,72 +0,0 @@
-# Runbook: pfSense WAN / egress outage
-
-**Scope:** the cluster (and home) loses **internet egress** while pfSense is
-otherwise alive — internal VLAN routing and DNS keep working. This is the
-**2026-06-27 incident class**: pfSense (Proxmox **VMID 101**) stopped passing
-IPv4 egress for ~20 min (00:02→00:23 UTC) while LAN/OPT1 routing + Unbound
-stayed up; recovery required a manual reboot, and **nothing alerted** (no egress
-probe existed; the cloudflared replica metric stayed green). The alerts +
-probes below close that gap. Incident detail: memory ids #6715–#6723.
-
-pfSense is a **single point of failure** (no HA): it is the k8s default gateway
-(`10.0.20.1`), Kea DHCP, Unbound DNS, NAT, and the WireGuard hub. WAN is
-**static** `192.168.1.2/24`, upstream gateway `WANGW = 192.168.1.1` (the TP-Link
-Archer AX6000). The sole IPv4 default gateway, no gateway-group/failover.
-
-## Alerts (all in `stacks/monitoring/modules/monitoring/`)
-
-| Alert | Signal | Means |
-|-------|--------|-------|
-| `WANGatewayUnreachable` (critical) | in-cluster ICMP to `192.168.1.1` fails >3m | pfSense's upstream gateway is unreachable from the cluster |
-| `InternetEgressDown` (critical) | in-cluster ICMP to **both** `9.9.9.9` and `1.1.1.1` fails >2m | internet egress through pfSense NAT is black-holed |
-| `ExternalDNSResolutionDown` (warning) | UDP/53 to both public resolvers fails >3m | egress or external-DNS path broken |
-| `EgressOnlyDivergence` (critical) | t3-probe `cloudflare` leg down **while** `internal` leg up >3m | egress-specific failure, internal healthy (the exact 2026-06-27 signature) |
-| `PfSenseVMDown` (critical) | `pve_up{id="qemu/101"}==0` while host up >2m | the pfSense VM stopped/crashed (host fine) |
-| `CloudflaredTunnelConnLoss` (warning, Loki) | >20 cloudflared edge-conn failures/5m | tunnel/egress trouble (canary that fires first; replica metric is blind) |
-
-Probes run **from inside the cluster** (blackbox-exporter, pod → node → pfSense
-NAT), so they exercise the exact egress path that fails. `WANGatewayUnreachable`
-/ `InternetEgressDown` **inhibit** the downstream egress symptoms so one root
-alert pages, not a storm.
-
-`PfSenseVMDown` **does not** catch a *guest-internal* reboot — `pve_up` tracks
-the qemu process, which survives an in-guest reboot (this is why 2026-06-27 was
-metric-invisible). `CloudflaredTunnelConnLoss` + the probe alerts cover that case.
-
-## Diagnose (read-only first)
-
-1. **Confirm scope** — is it egress-only or total?
-   - `kubectl -n monitoring` Grafana → `probe_success{job=~"wan-gateway-icmp|internet-egress-icmp"}` and `t3probe_connected` by `leg`.
-   - Internal still up? `pve_up{id="qemu/101"}` should be `1`; internal k8s DNS (`10.0.20.1`) still resolving = pfSense alive, egress-only.
-2. **Capture pfSense on-box logs BEFORE rebooting** (they persist on disk — no RAM-disk — and are the only source that proves the mechanism; they are NOT shipped to Loki):
-   ```
-   ssh -i ~/.ssh/id_ed25519 admin@10.0.20.1      # devvm wizard key (id #6784)
-   clog /var/log/gateways.log | grep -iE 'WANGW|down|up|delay|loss'   # dpinger gateway alarms
-   clog /var/log/routing.log  | grep -iE 'default|route'              # default-route add/delete
-   clog /var/log/system.log   | tail -200
-   netstat -rn | head                                                 # is the default route present?
-   ls -la /var/crash/                                                 # panic/textdump?
-   ```
-   (If SSH is rejected post-reboot, the reboot regenerated `authorized_keys` from
-   config.xml — re-add the key via console or WebGUI; see id #6718.)
-3. **Upstream check** — is the TP-Link / ISP up? It held the same public IP with
-   clean DHCP renewals through the 2026-06-27 event, so a *sustained* upstream
-   fault is unlikely; a reboot fixing it points at **pfSense-side state**.
-
-## Recover
-
- **Fast path (known fix):** reboot pfSense — re-adds the default route, re-arms
-  dpinger, flushes pf state. **Capture the logs above FIRST** (a reboot wipes
-  the volatile evidence needed to find the real mechanism).
- Targeted (if logs show a dpinger gateway-down): System → Routing → Gateways →
-  WANGW; check the monitor IP + dpinger state; re-enable the gateway / let it
-  re-eval. Confirm `netstat -rn` shows the default route restored.
-
-## Prevent / harden (deferred, needs a live-pfSense change)
-
-Not done in this monitoring change — tracked for a follow-up with hands-on
-pfSense access: point dpinger's monitor at the local gateway (`192.168.1.1`)
-instead of an external IP + widen thresholds; disable `gw_down_kill_states` for
-the single WAN; add a failover gateway group; a 60s auto-recovery watchdog;
-ship pfSense system/gateway/routing syslog to the cluster so these logs become
-centrally queryable.
--- a/scripts/cluster_healthcheck.sh
+++ b/scripts/cluster_healthcheck.sh
@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
 [[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
 KUBECTL=""
 JSON_RESULTS=()
-TOTAL_CHECKS=48
+TOTAL_CHECKS=47

 # Parallel execution settings. Each check function is self-contained — it
 # only reads cluster state and mutates the in-memory counters / JSON_RESULTS
@ -3156,44 +3156,6 @@ PYEOF
    esac
 }

-# --- 48. Goldmane edge-aggregator availability ---
-#
-# The goldmane-edge-aggregator Deployment (ADR-0014 / infra #58) streams Calico
-# Goldmane flows into the goldmane_edges CNPG DB — the durable who-talks-to-whom
-# trail. The pod has NO /metrics endpoint, so its liveness can't be scraped;
-# this check reads the Deployment's Available condition directly so the trail
-# silently dying surfaces in the health board (mirrors the AggregatorDown
-# Prometheus alert). Missing Deployment / not-Available -> FAIL.
-check_goldmane_aggregator() {
-    section 48 "Goldmane Edge-Aggregator"
-    local ns="goldmane-edge-aggregator" dep="goldmane-edge-aggregator"
-    local avail desired ready
-
-    # One get; absent Deployment is a hard fail (the trail isn't deployed).
-    if ! $KUBECTL get deploy "$dep" -n "$ns" >/dev/null 2>&1; then
-        [[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
-        fail "Deployment $ns/$dep not found — who-talks-to-whom edge trail is not running"
-        json_add "goldmane_aggregator" "FAIL" "deployment missing"
-        return 0
-    fi
-
-    avail=$($KUBECTL get deploy "$dep" -n "$ns" \
-        -o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null)
-    ready=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.status.readyReplicas}' 2>/dev/null)
-    desired=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.spec.replicas}' 2>/dev/null)
-    ready=${ready:-0}
-    desired=${desired:-0}
-
-    if [[ "$avail" == "True" ]]; then
-        pass "Edge-aggregator Available ($ready/$desired ready)"
-        json_add "goldmane_aggregator" "PASS" "${ready}/${desired} ready"
-    else
-        [[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
-        fail "Edge-aggregator NOT Available ($ready/$desired ready) — edge trail has stopped recording"
-        json_add "goldmane_aggregator" "FAIL" "${ready}/${desired} ready; Available=${avail:-unknown}"
-    fi
-}
-
 # --- Summary ---
 print_summary() {
    if [[ "$JSON" == true ]]; then
@ -3262,7 +3224,7 @@ main() {
        check_monitoring_prom_am check_monitoring_vault check_monitoring_css
        check_external_replicas check_external_divergence check_pve_thermals
        check_pve_load check_external_traefik_5xx check_ha_status_dashboard
-        check_immich_search check_csi_ghost_drift check_goldmane_aggregator
+        check_immich_search check_csi_ghost_drift
    )

    # Auto-fix mutates cluster state inside individual checks — keep that
--- a/scripts/t3-provision-users.sh
+++ b/scripts/t3-provision-users.sh
@ -240,79 +240,6 @@ EOF
  log "wrote OIDC kubeconfig -> $user:~/.kube/config"
 }

-# Hands-off chrome-service browser credential. For a user who has a
-# `<os_user>-browser` ServiceAccount in the chrome-service namespace (created in
-# stacks/chrome-service/rbac.tf), install a DUAL-CONTEXT kubeconfig whose DEFAULT
-# context authenticates with that SA's long-lived token — so `homelab browser`
-# (which shells out to `kubectl port-forward -n chrome-service`) works
-# non-interactively, even from a headless agent session (the user's interactive
-# OIDC login can't authenticate a headless kubectl). The user's personal OIDC
-# identity is retained as the `oidc@homelab` named context
-# (`kubectl --context oidc@homelab`). TF (the SA's existence) is the source of
-# truth for WHO gets this — there is no roster flag. Idempotent (cmp-guarded; SA
-# tokens are stable) + best-effort (cluster/secret unreachable -> WARN, never aborts).
-install_browser_kubeconfig() {
-  local user="$1" home kc sa secret token server ca tmp
-  home="$(getent passwd "$user" | cut -d: -f6)"
-  [[ -z "$home" ]] && return 0
-  sa="${user}-browser"
-  secret="${sa}-token"
-  [[ -r "$ADMIN_KUBECONFIG" ]] || return 0
-  # Gate: only users with a chrome-service browser SA (TF-driven). Best-effort read.
-  KUBECONFIG="$ADMIN_KUBECONFIG" kubectl --request-timeout=10s -n chrome-service get serviceaccount "$sa" >/dev/null 2>&1 || return 0
-  token="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl --request-timeout=10s -n chrome-service get secret "$secret" -o jsonpath='{.data.token}' 2>/dev/null | base64 -d 2>/dev/null || true)"
-  [[ -n "$token" ]] || { log "WARN: browser SA token not ready for $user (secret chrome-service/$secret) — skipped"; return 0; }
-  server="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl config view --raw --minify -o jsonpath='{.clusters[0].cluster.server}')"
-  ca="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl config view --raw --minify -o jsonpath='{.clusters[0].cluster.certificate-authority-data}')"
-  [[ -n "$server" && -n "$ca" ]] || { log "WARN: could not read cluster server/CA -> skip browser kubeconfig for $user"; return 0; }
-  kc="$home/.kube/config"
-  tmp="$(mktemp)"
-  cat > "$tmp" <<EOF
-apiVersion: v1
-kind: Config
-clusters:
- name: homelab
-  cluster:
-    server: $server
-    certificate-authority-data: $ca
-contexts:
- name: ${sa}@homelab
-  context:
-    cluster: homelab
-    user: $sa
- name: oidc@homelab
-  context:
-    cluster: homelab
-    user: oidc
-current-context: ${sa}@homelab
-users:
- name: $sa
-  user:
-    token: $token
- name: oidc
-  user:
-    exec:
-      apiVersion: client.authentication.k8s.io/v1beta1
-      command: kubectl
-      args:
-      - oidc-login
-      - get-token
-      - --oidc-issuer-url=$OIDC_ISSUER
-      - --oidc-client-id=kubernetes
-      - --oidc-extra-scope=email
-      - --oidc-extra-scope=profile
-      - --oidc-extra-scope=groups
-      interactiveMode: IfAvailable
-EOF
-  if cmp -s "$tmp" "$kc" 2>/dev/null; then rm -f "$tmp"; return 0; fi   # already current -> no churn
-  if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] dual-context (SA default + OIDC) browser kubeconfig -> $user:$kc"; rm -f "$tmp"; return 0; fi
-  install -d -o "$user" -g "$user" -m 0700 "$home/.kube"
-  install -o "$user" -g "$user" -m 0600 "$tmp" "$kc" || { log "WARN: failed to write browser kubeconfig for $user"; rm -f "$tmp"; return 0; }
-  rm -f "$tmp"
-  log "wrote dual-context browser kubeconfig (SA default + OIDC) -> $user:~/.kube/config"
-  return 0
-}
-
 # Idempotently set KEY=VALUE in a t3-serve env file, PRESERVING other lines — so writing
 # T3_PORT never clobbers an injected CLAUDE_CODE_OAUTH_TOKEN, and vice-versa. Mode 0600.
 env_set() {
@ -667,7 +594,6 @@ while IFS=$'\t' read -r os_user tier shell groups_csv code_layout repos_csv; do
      refresh_user_clone   "$os_user" code
    fi
    install_user_kubeconfig "$os_user"
-    install_browser_kubeconfig "$os_user"    # hands-off chrome-service CLI cred (no-op unless the user has a browser SA)
    deploy_user_launcher "$os_user"          # keep ~/start-claude.sh current (skel only seeds new accounts)
  fi
  refresh_codex_mirror "$os_user"            # all tiers — mirror of the managed claudeMd
--- a/scripts/t3-serve@.service
+++ b/scripts/t3-serve@.service
@ -11,12 +11,6 @@ Environment=HOME=/home/%i
 Environment=PATH=/usr/local/bin:/usr/bin:/bin:/home/%i/.local/bin
 Environment=NODE_ENV=production
 EnvironmentFile=/etc/t3-serve/%i.env
-# Optional per-user long-lived CLAUDE_CODE_OAUTH_TOKEN, materialized by
-# claude-auth-sync from the user's own Vault path. Non-rotating, so t3's
-# concurrent agent sessions can't race on OAuth refresh-token rotation and wipe
-# the shared ~/.claude/.credentials.json. Leading '-' = optional (absent for
-# users on the normal per-user Enterprise-SSO credential flow).
-EnvironmentFile=-/home/%i/.config/claude-auth-sync/claude-oauth.env
 WorkingDirectory=/home/%i
 ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3
 Restart=on-failure
--- a/scripts/test-claude-auth-sync.sh
+++ b/scripts/test-claude-auth-sync.sh
@ -28,61 +28,5 @@ ok "accept own scoped Vault token" cas_vault_identity_ok token-devvm-claude-auth
 no "reject another user's token" cas_vault_identity_ok token-devvm-claude-auth-anca default,workstation-claude-anca
 no "reject wrong policy" cas_vault_identity_ok token-devvm-claude-auth-emo default,workstation-claude-anca

-# --- Regression: cas_backup must MERGE into the shared Vault path, preserving
-# sibling keys that other tools co-locate there (e.g. `homelab vault`'s
-# vaultwarden_* creds) — NOT overwrite the whole KV document. A blind `kv put`
-# wiped them every 6h (claude-auth-sync clobber, 2026-06-26).
-fakebin="$tmp/bin"; mkdir -p "$fakebin"
-store="$tmp/vault-store.json"
-cat > "$fakebin/vault" <<'FAKE'
-#!/usr/bin/env bash
-# Minimal KV-v2 fake backed by $VAULT_FAKE_STORE (a flat JSON object).
-[[ "$1" == kv ]] || { echo '{}'; exit 0; }   # token lookup etc. -> ignore
-op="$2"; shift 2
-store="$VAULT_FAKE_STORE"
-case "$op" in
-  get)
-    for a in "$@"; do [[ "$a" == -field=* ]] && field="${a#-field=}"; done
-    if [[ "$*" == *-format=json* ]]; then
-      [[ -f "$store" ]] || { echo "No value found"; exit 2; }
-      jq -n --argjson d "$(cat "$store")" '{data:{data:$d}}'; exit 0
-    fi
-    [[ -f "$store" ]] || exit 2                # bare get == existence check
-    if [[ -n "${field:-}" ]]; then
-      v="$(jq -r --arg k "$field" '.[$k] // empty' "$store")"; [[ -n "$v" ]] || exit 1
-      printf '%s' "$v"; exit 0
-    fi
-    exit 0 ;;
-  put)   echo '{}' > "$store" ;;                          # full replace
-  patch) [[ -f "$store" ]] || { echo "No value found"; exit 2; } ;;  # merge (rw)
-  *) exit 1 ;;
-esac
-for a in "$@"; do
-  case "$a" in
-    -*|secret/*) continue ;;                  # flags + the path arg
-    *=*) k="${a%%=*}"; v="${a#*=}"
-         t="$(mktemp)"; jq --arg k "$k" --arg v "$v" '.[$k]=$v' "$store" > "$t" && mv "$t" "$store" ;;
-  esac
-done
-exit 0
-FAKE
-chmod +x "$fakebin/vault"
-
-CAS_VAULT_PATH="secret/workstation/claude-users/test"
-CAS_CREDENTIALS="$tmp/credentials.json"
-CAS_STATE_DIR="$tmp/state"
-_oldpath="$PATH"; PATH="$fakebin:$PATH"; export VAULT_FAKE_STORE="$store"
-
-printf '{"vaultwarden_master_password":"keep-me"}\n' > "$store"   # pretend `homelab vault setup` ran
-ok "backup succeeds (existing doc)"   cas_backup
-eq "merge preserves sibling key"      keep-me "$(jq -r '.vaultwarden_master_password' "$store")"
-eq "merge writes claude oauth"        access  "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")"
-
-rm -f "$store"                                                    # fresh user: no doc yet
-ok "backup succeeds (creates doc)"    cas_backup
-eq "create writes claude oauth"       access  "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")"
-
-PATH="$_oldpath"; unset VAULT_FAKE_STORE
-
 printf '\n%d passed, %d failed\n' "$pass" "$fail"
 (( fail == 0 ))
--- a/scripts/workstation/claude-auth-sync.sh
+++ b/scripts/workstation/claude-auth-sync.sh
@ -13,10 +13,6 @@ CAS_VAULT_TOKEN_FILE="${CLAUDE_AUTH_VAULT_TOKEN_FILE:-$CAS_CONFIG_DIR/vault-toke
 CAS_VAULT_PATH="${CLAUDE_AUTH_VAULT_PATH:-secret/workstation/claude-users/$CAS_USER}"
 CAS_STATE_DIR="${CLAUDE_AUTH_STATE_DIR:-$CAS_HOME/.local/state/claude-auth-sync}"
 CAS_LOG="$CAS_STATE_DIR/sync.log"
-# Where a long-lived per-user setup-token is materialized as an env file
-# (KEY=VALUE) for start-claude.sh + t3-serve@.service to load. Lives under the
-# already-ReadWritePaths config dir so the sandboxed service may write it.
-CAS_TOKEN_ENV_FILE="${CLAUDE_AUTH_TOKEN_ENV_FILE:-$CAS_CONFIG_DIR/claude-oauth.env}"

 cas_log() {
  mkdir -p "$CAS_STATE_DIR"
@ -86,17 +82,7 @@ cas_backup() {
    return 1
  }
  expires="$(jq -r '.expiresAt' <<<"$oauth")"
-  # MERGE into the shared path so sibling keys other tools co-locate there
-  # (e.g. `homelab vault`'s vaultwarden_* creds) survive. `kv patch -method=rw`
-  # is read+update (needs no `patch` capability) but requires the secret to
-  # already exist, so create it with `kv put` on the very first backup only.
-  local -a write_cmd
-  if vault kv get "$CAS_VAULT_PATH" >/dev/null 2>&1; then
-    write_cmd=(vault kv patch -method=rw "$CAS_VAULT_PATH")
-  else
-    write_cmd=(vault kv put "$CAS_VAULT_PATH")
-  fi
-  "${write_cmd[@]}" \
+  vault kv put "$CAS_VAULT_PATH" \
    claude_ai_oauth_json="$oauth" \
    credential_expires_at_ms="$expires" \
    backed_up_at="$(date -Is)" >/dev/null || {
@ -137,41 +123,6 @@ cas_restore() {
  cas_log "RECOVERED restored Claude OAuth state from Vault"
 }

-# A user-scoped, long-lived setup-token (`sk-ant-oat01-…`, ~1y, NON-rotating) may
-# be stored in this user's OWN Vault path (field `setup_token`). When present it
-# is the authoritative credential: it bypasses the shared
-# ~/.claude/.credentials.json OAuth refresh-token rotation entirely — the fix for
-# users running many concurrent Claude sessions (interactive + t3-serve + always-on
-# agents) that otherwise race on refresh and wipe each other's refresh token.
-# We materialize it to a user-owned env file that start-claude.sh and
-# t3-serve@.service load as CLAUDE_CODE_OAUTH_TOKEN. This is the user's OWN
-# Enterprise identity, NOT the forbidden legacy SHARED token — it never crosses
-# OS users. Returns 0 when a token is active, so the caller skips the
-# rotating-credential validate/backup/restore (probing the now-vestigial
-# credential would otherwise emit false WorkstationClaudeAuthInvalid alerts).
-cas_sync_setup_token() {
-  local token desired tmp
-  token="$(vault kv get -field=setup_token "$CAS_VAULT_PATH" 2>/dev/null)" || token=""
-  if [[ "$token" != sk-ant-oat01-* ]]; then
-    if [[ -e "$CAS_TOKEN_ENV_FILE" ]]; then
-      rm -f "$CAS_TOKEN_ENV_FILE"
-      cas_log "removed stale CLAUDE_CODE_OAUTH_TOKEN env (no setup-token in Vault)"
-    fi
-    return 1
-  fi
-  desired="CLAUDE_CODE_OAUTH_TOKEN=$token"
-  if [[ -r "$CAS_TOKEN_ENV_FILE" && "$(<"$CAS_TOKEN_ENV_FILE")" == "$desired" ]]; then
-    cas_log "OK long-lived setup-token active (CLAUDE_CODE_OAUTH_TOKEN current); credential checks skipped"
-    return 0
-  fi
-  tmp="$(mktemp "${CAS_TOKEN_ENV_FILE}.XXXXXX")" || { cas_log "FAIL could not stage token env file"; return 1; }
-  printf '%s\n' "$desired" > "$tmp"
-  chmod 0600 "$tmp"
-  mv "$tmp" "$CAS_TOKEN_ENV_FILE"
-  cas_log "OK long-lived setup-token active; CLAUDE_CODE_OAUTH_TOKEN materialized; credential checks skipped"
-  return 0
-}
-
 cas_main() {
  umask 077
  for bin in jq vault claude timeout flock; do
@ -182,11 +133,6 @@ cas_main() {
  flock -n 9 || { cas_log "SKIP another sync is already running"; return 0; }

  cas_prepare_vault || return 1
-  # A long-lived per-user setup-token, if provisioned, is authoritative and
-  # non-rotating — materialize it and skip the rotating-credential dance.
-  if cas_sync_setup_token; then
-    return 0
-  fi
  if cas_live_auth_ok; then
    cas_backup
    return
--- a/scripts/workstation/claude-hooks/homelab-memory-recall.py
+++ b/scripts/workstation/claude-hooks/homelab-memory-recall.py
@ -45,15 +45,9 @@ def main() -> None:
    try:
        res = subprocess.run(
            [homelab, "memory", "recall", prompt, "--limit", "5"],
-            capture_output=True, text=True, errors="replace", timeout=4,
-            env=os.environ,
+            capture_output=True, text=True, timeout=4, env=os.environ,
        )
-    except Exception:
-        # Best-effort: ANY failure — timeout, OSError, or a UnicodeDecodeError on
-        # truncated multibyte (Cyrillic) output — must silently skip recall this
-        # turn, exactly like the MCP being unavailable. errors="replace" above
-        # also keeps a mid-rune-truncated payload from raising here at all. Never
-        # let this hook surface a "UserPromptSubmit hook error".
+    except (subprocess.TimeoutExpired, OSError):
        return

    out = (res.stdout or "").strip()
--- a/scripts/workstation/claude-skills/README.md
+++ b/scripts/workstation/claude-skills/README.md
@ -19,29 +19,13 @@ unpinned-CLI dependencies out of the hourly **root** reconcile.

 - `mattpocock/skills` (https://github.com/mattpocock/skills) — all except `find-skills`
 - `vercel-labs/skills` (https://github.com/vercel-labs/skills) — `find-skills`
- **homelab-local, emo-PERSONALIZED** — `cluster-health` here is an
-  **emo-specific variant**, not a copy of the canonical skill. It started as a
-  copy of this repo's `.claude/skills/cluster-health/` but was rewritten on
-  2026-06-26 to focus on ha-sofia + emo's Sofia devices (emo is the only entry
-  in `SKILL_USERS`, a read-only power-user). The canonical admin skill
-  (`.claude/skills/cluster-health/`) is the full 47-check version and is left
-  untouched. **Do NOT `cp -a` the canonical copy over this one** — that would
-  clobber the personalization. Maintain the two independently.

 ## Refreshing

-Re-snapshot the upstream skills from a current install and commit the diff:
+Re-snapshot from a current install and commit the diff:

 ```sh
 cp -a ~/.agents/skills/. scripts/workstation/claude-skills/
 ```

-`cluster-health` is hand-maintained (emo variant) — it is **not** covered by the
-`cp -a` above and must **not** be overwritten from `.claude/skills/`. Edit it in
-place here when emo's needs change, then refresh his live copy (the provisioner's
-`install_skills()` is if-absent, so it won't update an existing `~/.agents/skills`
-copy — `cp` the new `SKILL.md` to `/home/emo/.agents/skills/cluster-health/` and
-`chown emo:emo`, or remove emo's copy and re-run the reconcile).
-
-Snapshot taken 2026-06-23 (upstream); `cluster-health` vendored 2026-06-26,
-personalized for emo 2026-06-26.
+Snapshot taken 2026-06-23.
--- a/scripts/workstation/claude-skills/cluster-health/SKILL.md
+++ b/scripts/workstation/claude-skills/cluster-health/SKILL.md
@ -1,146 +0,0 @@
---
-name: cluster-health
-description: |
-  Personalized for emo. Check whether the homelab Kubernetes cluster is
-  affecting ha-sofia or the Sofia smart-home devices it runs (Tuya devices,
-  the MPPT ATS, lights, climate, security, irrigation). Use when:
-  (1) "is ha-sofia ok", "are my devices / the ATS / the lights down",
-  (2) "is the cluster affecting Sofia / my devices",
-  (3) "check the cluster", "cluster health", "is everything running",
-  (4) a device on the Барзини → Статус dashboard looks offline.
-  Runs the cluster-wide healthcheck read-only and triages it by what
-  ha-sofia actually depends on; the rest of the cluster is the admin's area.
-author: Claude Code
-version: 3.0.0-emo
-date: 2026-06-26
---
-
-# Cluster Health — personalized for emo (ha-sofia focus)
-
-## What you actually care about
-
-You care about **ha-sofia** and the **Sofia smart-home devices** it runs —
-the Tuya devices, the **MPPT ATS**, and the lights / climate / security /
-irrigation on your **Барзини → Статус** dashboard. The wider Kubernetes
-cluster matters to you **only when it's breaking something ha-sofia or your
-devices depend on.** Anything else is the admin's (wizard's) area — note it in
-one line and move on; don't chase it.
-
-You have **read-only** cluster access. You can SEE everything but change
-nothing — so when something on your chain is broken, the job is to confirm it
-and hand it off, not to repair it.
-
-## How ha-sofia depends on the cluster
-
-ha-sofia itself runs at the house (HAOS at https://ha-sofia.viktorbarzin.me) —
-**not** in the cluster. The cluster reaches it through exactly two things:
-
-1. **tuya-bridge** (namespace `tuya-bridge`) — the REST API ha-sofia calls for
-   every Tuya device **and the MPPT ATS**. If it's unhealthy, your Tuya devices
-   + ATS stop responding. **This is the #1 thing to check.**
-2. **The path that carries ha-sofia ⇄ tuya-bridge and keeps ha-sofia
-   reachable**: cloudflared (tunnel) → Traefik (LB) → the ingress + TLS cert
-   for `tuya-bridge.viktorbarzin.me` and `ha-sofia.viktorbarzin.me`, plus
-   Technitium DNS. If any of these break, ha-sofia can't reach tuya-bridge and
-   you can't reach ha-sofia remotely.
-
-Everything else in the cluster is unrelated to you unless it's hosting one of
-those pods.
-
-## Step 1 — run the healthcheck (read-only, with your HA token)
-
-Your account can't read Vault, so load your own ha-sofia token first (it was
-minted for you and lives at `~/.config/cluster-health/haos_token`). Then run
-the script from YOUR clone, read-only:
-
-```bash
-cd /home/emo/code
-export HOME_ASSISTANT_SOFIA_TOKEN="$(cat ~/.config/cluster-health/haos_token)"
-bash scripts/cluster_healthcheck.sh --no-fix --quiet
-# machine-readable instead:
-# bash scripts/cluster_healthcheck.sh --no-fix --quiet --json | tee /tmp/cluster-health.json
-```
-
- **Never pass `--fix`** — it deletes pods (a write); you're read-only and it
-  will fail.
- Exit codes: `0` healthy, `1` warnings, `2` failures.
-
-With the token exported, the **ha-sofia checks run for you**:
-26 Entity Availability · 27 Integration Health · 28 Automation Status ·
-29 System Resources · **45 Status Dashboard** — your Барзини → Статус view,
-classifying every device tile as OK / ⚠️ / Offline across Сигурност, Мрежа &
-IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна. Check 30 also
-covers the **tuya** exporter.
-
-## Step 2 — triage the output by relevance to YOU
-
-Read the PASS/WARN/FAIL summary, then split the WARN/FAIL items in two:
-
- **On your chain → this is what matters.** Anything touching: `tuya-bridge`,
-  `cloudflared`, `traefik`, DNS (check 21), the TLS cert / ingress for your two
-  hosts (checks 12, 22, 31, 32), or a **node** hosting those pods — plus all the
-  **ha-sofia** checks (26–29, 45) and the **tuya** exporter (30).
- **Not on your chain → one line, then drop it.** Summarise as "N unrelated
-  cluster issues (admin's area)" and don't investigate.
-
-## Step 3 — read-only checks for your chain
-
-All of these work with your read-only access:
-
-```bash
-# tuya-bridge — your devices + the ATS
-kubectl get pods -n tuya-bridge
-kubectl rollout status deploy/tuya-bridge -n tuya-bridge
-kubectl logs -n tuya-bridge deploy/tuya-bridge --tail=50
-
-# the reachability path ha-sofia uses
-kubectl get pods -n cloudflared
-kubectl get pods -n traefik
-kubectl get ingress -A | grep -Ei 'tuya-bridge|ha-sofia'
-
-# whole external path in one shot (DNS + tunnel + Traefik + cert):
-curl -sI --max-time 10 https://tuya-bridge.viktorbarzin.me | head -1
-#   reachable  -> HTTP/2 200 / 401 / 403  (any HTTP response = path is up)
-#   broken     -> curl: timeout / could not resolve host
-```
-
-The fastest **device-level** signal is your own dashboard: open
-**https://ha-sofia.viktorbarzin.me → Барзини → Статус**. If devices show
-Offline / Разкачен / ⚠️ **but tuya-bridge is healthy**, the problem is at the
-house (device power / Wi-Fi / the Sofia TP-Link network) — **not** the cluster.
-
-## Step 4 — if something on your chain is broken
-
-You can't fix the cluster (read-only), so **capture + hand off**:
-
-```bash
-kubectl describe pod -n tuya-bridge <pod>
-kubectl logs -n tuya-bridge <pod> --previous --tail=200
-```
-
-Then file it for the admin with the **`/file-issue`** skill — e.g. *"ha-sofia
-Tuya devices + ATS unresponsive; tuya-bridge pod CrashLooping"* with the output
-above. cloudflared / Traefik / DNS outages are cluster-wide — the admin's
-alerting is already firing, but file it so it's tracked from your side too.
-
-## What will skip for you (expected — not failures)
-
-A few checks need access your account doesn't have. They warn/skip — that's
-normal, and **none of them are on your ha-sofia chain**:
-
- **Uptime Kuma (14)** — needs an admin password from Vault.
- **PVE host checks** — 36 (LVM snapshots), 43 (host thermals), 44 (host load),
-  and the Proxmox CSI ghost-disk check — all need root SSH to the Proxmox host.
- **`--fix`** — pod deletion (a write); not available to you.
-
-(The ha-sofia checks are **not** in this list — your token makes them work.)
-
-## Your ha-sofia token
-
- Stored at `~/.config/cluster-health/haos_token` (yours, mode 600).
- It's a **dedicated** long-lived token, named `emo-cluster-health` under
-  ha-sofia → your profile → **Long-Lived Access Tokens**. Revoking it there
-  affects only you.
- It currently carries admin-level HA scope (Home Assistant only lets a token
-  be minted for the account that created it, and it was minted via the admin
-  account). If it ever stops working, tell wizard and a fresh one can be minted.
--- a/scripts/workstation/managed-settings.json
+++ b/scripts/workstation/managed-settings.json
@ -1,4 +1,4 @@
 {
-  "claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier for cluster/infra. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner) — don't try to exceed it. At the OS level, authorization follows the OS: where your user holds `sudo`, using it is permitted and audited.\n- OS access follows OS permissions, not a separate rule here. You may read whatever your OS user can read — directly or via `sudo` where you hold it — including other users' home directories, credentials, tokens, and `~/.claude`. Do not impose restrictions stricter than the OS; no extra prompt is required for OS-authorized access. Protect your own secrets at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n  - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n  - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n  - Push rejected by branch protection (user not whitelisted) → fall back to a <os-user>/<topic> branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n  - Keep every clone on a clean master when done; tell the user in plain words what happened.\n  - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n",
+  "claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner). Do not attempt to escalate privileges or reach another user's resources.\n- Secrets are per-user. Never read another user's home directory, credentials, tokens, or ~/.claude secrets. Your own secrets live in your home at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n  - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n  - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n  - Push rejected by branch protection (user not whitelisted) → fall back to a <os-user>/<topic> branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n  - Keep every clone on a clean master when done; tell the user in plain words what happened.\n  - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n",
  "model": "claude-opus-4-8"
 }
--- a/scripts/workstation/setup-devvm.sh
+++ b/scripts/workstation/setup-devvm.sh
@ -72,14 +72,11 @@ if [[ -n "$want_t3" && "$(t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/
 fi

 # 2c) Bitwarden CLI — backs `homelab vault` (per-user no-HITL Vaultwarden access).
-#     Install SYSTEM-WIDE (npm prefix /usr → /usr/bin/bw) so EVERY user's PATH
-#     resolves it. The guard tests the SYSTEM path, NOT `command -v bw`: the
-#     latter is satisfied by an admin's own ~/.local/bin/bw and would skip the
-#     system install, leaving non-admins (emo, anca, …) with no backend. Pinned
-#     major; best-effort (a failure only disables `homelab vault`).
-if [ ! -x /usr/bin/bw ] && [ ! -x /usr/local/bin/bw ]; then
-  log "npm: installing @bitwarden/cli system-wide (homelab vault backend)"
-  npm install -g --prefix /usr "@bitwarden/cli@^2024" >/dev/null 2>&1 || log "WARN: @bitwarden/cli install failed; homelab vault unavailable"
+#     npm-global so every user's PATH resolves it. Pinned major; best-effort (a
+#     failure only disables `homelab vault`, nothing else on the box).
+if ! command -v bw >/dev/null; then
+  log "npm: installing @bitwarden/cli (homelab vault backend)"
+  npm install -g "@bitwarden/cli@^2024" >/dev/null 2>&1 || log "WARN: @bitwarden/cli install failed; homelab vault unavailable"
 fi

 # 3) kubelogin (kubectl oidc-login) system-wide — NOT the apt 'kubelogin' (= Azure tool).
--- a/scripts/workstation/skel/start-claude.sh
+++ b/scripts/workstation/skel/start-claude.sh
@ -93,15 +93,6 @@ ensure_onboarding() {
 }
 ensure_onboarding

-# Load a per-user long-lived CLAUDE_CODE_OAUTH_TOKEN if claude-auth-sync has
-# materialized one from this user's own Vault path. A non-rotating setup-token
-# sidesteps the shared ~/.claude/.credentials.json OAuth refresh-token race that
-# logs out users running many concurrent agents (interactive + t3 + always-on).
-# Absent file -> no-op (normal per-user Enterprise-SSO flow). The user's OWN
-# token; never shared between OS users.
-_oauth_env="$HOME/.config/claude-auth-sync/claude-oauth.env"
-if [ -r "$_oauth_env" ]; then set -a; . "$_oauth_env"; set +a; fi
-
 # Deliberately not `exec` so we can branch on the exit code: clean quit ends the
 # pane (ttyd closes the terminal); a crash drops to a shell so the tmux session
 # isn't destroyed-and-recreated in a ttyd auto-reconnect loop.
--- a/stacks/actualbudget/main.tf
+++ b/stacks/actualbudget/main.tf
@ -5,9 +5,6 @@ variable "tls_secret_name" {
 variable "nfs_server" { type = string }

 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/affine/main.tf
+++ b/stacks/affine/main.tf
@ -5,9 +5,6 @@ variable "tls_secret_name" {
 variable "nfs_server" { type = string }

 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -45,9 +42,6 @@ data "kubernetes_secret" "eso_secrets" {
 # DB credentials from Vault database engine (rotated automatically)
 # Provides DATABASE_URL that auto-updates when password rotates
 resource "kubernetes_manifest" "db_external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/authentik/Dockerfile
+++ b/stacks/authentik/Dockerfile
@ -1,46 +0,0 @@
-# SLOW-1a overlay over the official authentik server image.
-#
-# The login flow's identification stage renders each enabled source's UI login
-# button. Upstream authentik/stages/identification/stage.py does:
-#     current_stage.sources.filter(enabled=True).order_by("name").select_subclasses()
-# The bare no-arg select_subclasses() (django-model-utils InheritanceManager)
-# LEFT-JOINs EVERY Source subtype table; on the cold-login hot path that is ~1.5s
-# (verified live on 2026.2.4: 1527ms vs 14ms). Passing only the subtypes that
-# actually render a UI login button — every concrete Source type that overrides
-# ui_login_button: oauth/saml/plex/telegram/kerberos, NOT the sync-only ldap/scim —
-# is ~100x faster and BYTE-IDENTICAL output (verified: concrete types + rendered
-# buttons match). django-model-utils accepts the lowercase subclass *accessor
-# names* as strings, so no new import is needed (no circular-import risk) — the
-# patch is a single, reviewable line edit.
-#
-# RE-VERIFY ON EVERY AUTHENTIK BUMP: bump the FROM tag below AND the image tag in
-# modules/authentik/values.yaml together. The grep guards fail the build LOUDLY if
-# the upstream target line moved. If a future authentik version adds a NEW
-# login-capable source type, add its lowercase accessor to the list below.
-# Upstream: the bare select_subclasses() is still present in main (no fix/PR as of
-# 2026-06-28) — drop this overlay once upstream narrows the query.
-FROM ghcr.io/goauthentik/server:2026.2.4
-
-USER root
-RUN set -eux; \
-    F=/authentik/stages/identification/stage.py; \
-    grep -q 'order_by("name").select_subclasses()' "$F"; \
-    sed -i 's/order_by("name")\.select_subclasses()/order_by("name").select_subclasses("oauthsource", "samlsource", "plexsource", "telegramsource", "kerberossource")/' "$F"; \
-    grep -q 'select_subclasses("oauthsource", "samlsource", "plexsource", "telegramsource", "kerberossource")' "$F"; \
-    PY="$(command -v python || command -v python3)"; "$PY" -c "import ast,sys; ast.parse(open('$F').read())"; \
-    rm -f /authentik/stages/identification/__pycache__/stage.*.pyc
-
-# PATCH #2 — old-browser BLANK LOGIN. authentik's modern flow SPA is ES2022 and
-# hard-fails (blank login) on Safari<=16.3 (e.g. iPadOS<=16.3). authentik already
-# ships a no-JS Simplified Flow Executor (SFE, ES5) but only serves it to
-# IE/old-Edge/PKeyAuth. patch-compat-sfe.py (a) extends compat_needs_sfe() to
-# serve the SFE to old Safari AND any iOS browser (Chrome/CriOS, Firefox/FxiOS —
-# all share the system WebKit) on iOS<=16.3, and (b) injects static social-login
-# <a> links into the SFE shell (the SFE can't render Identification-stage sources;
-# needed for password-less Google-only accounts). Clients get the REAL authentik
-# login (password + MFA + reputation, NO auth downgrade) instead of a blank page.
-# The script is guarded (asserts both upstream anchors + ast-parses) so the build
-# fails loudly if upstream moves — re-verify on every authentik bump.
-COPY patch-compat-sfe.py /tmp/patch-compat-sfe.py
-RUN python3 /tmp/patch-compat-sfe.py && rm -f /tmp/patch-compat-sfe.py
-USER authentik
--- a/stacks/authentik/admin-services-restriction.tf
+++ b/stacks/authentik/admin-services-restriction.tf
@ -49,15 +49,14 @@ resource "authentik_policy_expression" "admin_services_restriction" {

    host = request.context.get("host", "")

-    # chrome-service noVNC (chrome.viktorbarzin.me) exposes LIVE logged-in browser
-    # sessions from the SHARED persistent profile. Originally Viktor-only.
-    # 2026-06-28 (Viktor's explicit decision): emo SHARES Viktor's browser, so emo
-    # (emil.barzin / emil.barzin@gmail.com) is allowed in for noVNC form-filling +
-    # captcha solving. Trade-off accepted: emo can therefore reach Viktor's warmed
-    # sessions (the CLI half is the emo-browser ServiceAccount in
-    # stacks/chrome-service/rbac.tf). akadmin kept as break-glass. Match username OR
-    # email so neither attribute alone can lock anyone out.
-    CHROME_ALLOWED = {"akadmin", "akadmin@viktorbarzin.me", "vbarzin@gmail.com", "emil.barzin", "emil.barzin@gmail.com"}
+    # chrome-service noVNC (chrome.viktorbarzin.me) exposes Viktor's LIVE
+    # logged-in browser sessions, so lock it to Viktor's own accounts ONLY.
+    # "Home Server Admins" is NOT sufficient — emo (emil.barzin@gmail.com) is a
+    # member. akadmin kept as break-glass. The homelab-browser CDP path is
+    # already RBAC-gated (emo = oidc-power-user-readonly, no pods/portforward),
+    # so this closes the only remaining, human, noVNC path. Match username OR
+    # email so neither attribute alone can lock Viktor out.
+    CHROME_ALLOWED = {"akadmin", "akadmin@viktorbarzin.me", "vbarzin@gmail.com"}
    if host == "chrome.viktorbarzin.me":
        return request.user.username in CHROME_ALLOWED or request.user.email in CHROME_ALLOWED

--- a/stacks/authentik/email-secret.tf
+++ b/stacks/authentik/email-secret.tf
@ -6,9 +6,6 @@
 # are non-secret and live in values.yaml. The reloader annotation rolls the
 # authentik pods if the password ever changes.
 resource "kubernetes_manifest" "authentik_email_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/authentik/modules/authentik/main.tf
+++ b/stacks/authentik/modules/authentik/main.tf
@ -29,12 +29,7 @@ resource "kubernetes_namespace" "authentik" {
    labels = {
      tier                               = var.tier
      "resource-governance/custom-quota" = "true"
-      # Keel intentionally NOT enrolled: server+worker run our custom overlay image
-      # (ghcr.io/viktorbarzin/authentik-server — see values.yaml global.image +
-      # stacks/authentik/Dockerfile). The tag is pinned explicitly and bumped
-      # manually (rebuild the overlay FROM the new authentik version + repoint), so
-      # a Keel auto-bump would only risk re-introducing the upstream tag / the
-      # 2026-06-10 downgrade-boot-storm class. Re-enroll only if the overlay is dropped.
+      "keel.sh/enrolled"                 = "true"
    }
  }
  lifecycle {
@ -87,11 +82,6 @@ module "ingress" {
  service_name     = "goauthentik-server"
  tls_secret_name  = var.tls_secret_name
  anti_ai_scraping = false
-  # Swap the shared 10/50 default limiter for a dedicated 100/1000 carve-out:
-  # the login SPA + flow-executor API burst on a cold load otherwise 429s into
-  # a blank screen (see traefik middleware "authentik-rate-limit").
-  skip_default_rate_limit = true
-  extra_middlewares       = ["traefik-authentik-rate-limit@kubernetescrd"]
  extra_annotations = {
    "gethomepage.dev/enabled"      = "true"
    "gethomepage.dev/name"         = "Authentik"
@ -159,12 +149,5 @@ module "ingress-static" {
  tls_secret_name   = var.tls_secret_name
  anti_ai_scraping  = false
  homepage_enabled  = false
-  # /static serves ALL the SPA JS/CSS chunks; the default 10/50 limiter 429s the
-  # cold-load fan-out → blank screen. Dedicated 100/1000 carve-out (note the two
-  # namespaces: cache-headers is in ns authentik, rate-limit is in ns traefik).
-  skip_default_rate_limit = true
-  extra_middlewares = [
-    "authentik-static-cache-headers@kubernetescrd",
-    "traefik-authentik-rate-limit@kubernetescrd",
-  ]
+  extra_middlewares = ["authentik-static-cache-headers@kubernetescrd"]
 }
--- a/stacks/authentik/modules/authentik/values.yaml
+++ b/stacks/authentik/modules/authentik/values.yaml
@ -39,16 +39,6 @@ server:
      value: "3"
    - name: AUTHENTIK_WEB__THREADS
      value: "4"
-    # Gunicorn worker recycle hardening (defaults max_requests=1000/jitter=50).
-    # A worker recycle that coincides with a transient PG/pgbouncer blip stalls
-    # in-flight requests (sessions+cache are on PostgreSQL since Redis was removed
-    # in 2026.2), and with 9 workers recycling on a tight 50-jitter window the
-    # recycles cluster — feeding the episodic all-pods-NotReady 502/504 cascade.
-    # 10x rarer recycles + 20x wider jitter (1000) decorrelate them from DB blips.
-    - name: AUTHENTIK_WEB__MAX_REQUESTS
-      value: "10000"
-    - name: AUTHENTIK_WEB__MAX_REQUESTS_JITTER
-      value: "1000"
    # Cache flow plans for 30m and policy evaluations for 15m (defaults 300s).
    # Authentik 2026.2 stores cache in Postgres, so a TTL hit is still a
    # SELECT — but a single indexed lookup beats re-planning the flow
@ -97,28 +87,11 @@ server:
  livenessProbe:
    failureThreshold: 6
    timeoutSeconds: 5
-  # Readiness widened from the chart default (3x10s/3s ~= 30s) to ~80s. The
-  # readiness probe (/-/health/ready/) queries the DB, so a sub-~60s PG/pgbouncer
-  # transient otherwise returns 503 and drops ALL 3 server pods from the Service
-  # at once -> Traefik has no healthy backend -> 502/504 (the episodic blank
-  # screen + 30s hang). 80s absorbs a full CNPG failover reconnect; liveness
-  # still reaps a truly hung pod. Partial override — the chart deep-merges the
-  # httpGet path /-/health/ready/ (same as the livenessProbe override above).
-  readinessProbe:
-    failureThreshold: 8
-    periodSeconds: 10
-    timeoutSeconds: 5
-  # RollingUpdate strategy. The chart key is `deploymentStrategy`, NOT `strategy`
-  # (authentik.server reads .Values.server.deploymentStrategy) — the old
-  # `strategy:` key was silently ignored, so live ran the chart default 25%/25%
-  # and every rolling event dropped a server pod out of rotation, amplifying the
-  # NotReady cascade. maxSurge:1 + maxUnavailable:0 keeps all 3 ready throughout
-  # a roll (PDB minAvailable:2 + ResourceQuota headroom allow the transient pod).
-  deploymentStrategy:
+  strategy:
    type: RollingUpdate
    rollingUpdate:
-      maxSurge: 1
-      maxUnavailable: 0
+      maxSurge: 0
+      maxUnavailable: 1
  resources:
    requests:
      cpu: 100m
@ -145,23 +118,15 @@ server:
 global:
  addPrometheusAnnotations: true
  image:
-    # CUSTOM OVERLAY: two thin patches over the official authentik server image
-    # (see stacks/authentik/Dockerfile): (1) SLOW-1a — narrows the login-flow
-    # select_subclasses() query, ~1.4s -> ~14ms; (2) serve authentik's no-JS SFE
-    # login to old Safari/WebKit AND any iOS browser (Chrome/Firefox = WebKit) on
-    # iOS<=16.3 so old devices (e.g. iPadOS<=15) get a working login instead of a
-    # blank page, and injects social-login links into the SFE (it can't render
-    # sources; needed for password-less Google-only accounts). Built by
-    # .github/workflows/build-authentik.yml to ghcr.io/viktorbarzin/authentik-server
-    # (public package, anonymous pull — no imagePullSecret needed, like the
-    # upstream goauthentik image). Keel is NO LONGER enrolled for this namespace
-    # (see main.tf) so it can't bump/downgrade the tag; helm also defaults the tag
-    # to the chart appVersion (2026.2.2) — so BOTH repository AND tag are pinned
-    # explicitly here to prevent the 2026-06-10 downgrade-boot-storm class.
-    # UPGRADE = bump the Dockerfile FROM tag + this tag together (e.g. ->
-    # 2026.3.0-patch1), let GHA rebuild, then apply.
-    repository: ghcr.io/viktorbarzin/authentik-server
-    tag: "2026.2.4-patch3"
+    # Pin to the Keel-managed live tag. Keel (diun-annotated, keel.sh/enrolled
+    # namespace) bumps the IMAGE between chart releases, while helm defaults
+    # the tag to the chart appVersion — so any helm upgrade silently
+    # DOWNGRADES the running pods to the chart pin (2026-06-10: a values-only
+    # apply rolled live 2026.2.4 back to 2026.2.2 against a 2026.2.4-migrated
+    # DB → boot storm, see docs/post-mortems/2026-06-10-authentik-downgrade-
+    # boot-storm.md). Keep this tag in sync with what Keel has deployed when
+    # touching this chart; clear it only when bumping the chart version itself.
+    tag: "2026.2.4"

 worker:
  # 2 replicas: workers handle background tasks (LDAP sync, email,
@ -201,10 +166,7 @@ worker:
        secretKeyRef:
          name: authentik-email
          key: AUTHENTIK_EMAIL__PASSWORD
-  # Chart key is `deploymentStrategy`, not `strategy` (see server above). Workers
-  # serve no user traffic, so maxSurge:0/maxUnavailable:1 is fine — this is just
-  # the dead-key cleanup so the declared intent actually takes effect.
-  deploymentStrategy:
+  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 0
--- a/stacks/authentik/patch-compat-sfe.py
+++ b/stacks/authentik/patch-compat-sfe.py
@ -1,96 +0,0 @@
-#!/usr/bin/env python3
-"""Overlay patch — make authentik usable on OLD browsers (no modern-JS SPA).
-
-authentik's modern flow SPA is ES2022 (static{} init blocks) that hard-fail on
-Safari/WebKit <= 16.3 (e.g. iPadOS <= 16.3) and render a COMPLETELY BLANK login.
-authentik ships a no-JS Simplified Flow Executor (SFE, ES5) but only serves it to
-IE / old-Edge / PKeyAuth, and the SFE itself canNOT render Identification-stage
-sources (social-login buttons) — authentik docs list "Sources" as unsupported.
-
-This patch does TWO things, both guarded (assert the upstream anchor + verify the
-result) so the image build fails LOUDLY if upstream moves. RE-VERIFY on every
-authentik upgrade.
-
-  1. flows/views/interface.py::compat_needs_sfe() -> also return True for old
-     Safari/WebKit: (a) Safari/Mobile Safari Version <= 16.3 (covers desktop-mode
-     iPadOS which reports as Mac Safari), and (b) ANY iOS browser (Chrome/CriOS,
-     Firefox/FxiOS, Edge — all share the system WebKit) on iOS <= 16.3. So old
-     iPads get the SFE on EVERY browser, not just Safari.
-
-  2. flows/templates/if/flow-sfe.html -> inject static social-login <a> links
-     (plain redirects to /source/oauth/login/<slug>/, work on ANY browser) so SFE
-     users (who otherwise see only username/password) can use social login —
-     required for accounts with no password (e.g. Google-only users like emo).
-"""
-import ast
-import glob
-import os
-
-# --- Patch 1: compat_needs_sfe() UA gate -------------------------------------
-INTERFACE = "/authentik/flows/views/interface.py"
-ANCHOR = (
-    '        if "PKeyAuth" in ua["string"]:\n'
-    "            return True\n"
-    "        return False"
-)
-REPLACEMENT = (
-    '        if "PKeyAuth" in ua["string"]:\n'
-    "            return True\n"
-    "        # OVERLAY: old WebKit can't parse the modern ES2022 flow SPA (blank\n"
-    "        # login) -> serve the SFE (real authentik login). (a) desktop-mode\n"
-    "        # Safari/iPadOS reports as Mac Safari with Version<=16.3:\n"
-    '        if ua["user_agent"]["family"] in ("Safari", "Mobile Safari"):\n'
-    "            try:\n"
-    '                _maj = int(ua["user_agent"]["major"] or 0)\n'
-    '                _min = int(ua["user_agent"]["minor"] or 0)\n'
-    "            except (TypeError, ValueError):\n"
-    "                _maj = _min = 0\n"
-    "            if _maj and (_maj < 16 or (_maj == 16 and _min <= 3)):\n"
-    "                return True\n"
-    "        # (b) ANY iOS browser (Chrome/CriOS, Firefox/FxiOS, Edge) shares the\n"
-    "        # system WebKit, so iOS<=16.3 fails regardless of the browser family:\n"
-    '        if ua["os"]["family"] == "iOS":\n'
-    "            try:\n"
-    '                _omaj = int(ua["os"]["major"] or 0)\n'
-    '                _omin = int(ua["os"]["minor"] or 0)\n'
-    "            except (TypeError, ValueError):\n"
-    "                _omaj = _omin = 0\n"
-    "            if _omaj and (_omaj < 16 or (_omaj == 16 and _omin <= 3)):\n"
-    "                return True\n"
-    "        return False"
-)
-src = open(INTERFACE).read()
-assert "def compat_needs_sfe" in src, "compat_needs_sfe() not found — upstream changed"
-assert src.count(ANCHOR) == 1, f"anchor not found exactly once in {INTERFACE}"
-src = src.replace(ANCHOR, REPLACEMENT)
-open(INTERFACE, "w").write(src)
-ast.parse(src)
-assert 'ua["os"]["family"] == "iOS"' in open(INTERFACE).read()
-for pyc in glob.glob("/authentik/flows/views/__pycache__/interface.*.pyc"):
-    os.remove(pyc)
-
-# --- Patch 2: social-login links on the SFE shell ----------------------------
-SFE_HTML = "/authentik/flows/templates/if/flow-sfe.html"
-HTML_ANCHOR = (
-    "        </main>\n"
-    "        <span class=\"mt-3 mb-0 text-muted text-center\">{% trans 'Powered by authentik' %}</span>"
-)
-HTML_REPLACEMENT = (
-    "        </main>\n"
-    "        <!-- OVERLAY: the SFE can't render Identification-stage sources, so add\n"
-    "             static social-login links (plain redirects, work on any browser).\n"
-    "             Re-verify slugs on source changes; shown on all SFE flows. -->\n"
-    '        <div class="form-signin w-100 m-auto pt-2 mt-2 border-top">\n'
-    '          <a class="btn btn-outline-secondary w-100 mb-2" href="/source/oauth/login/google/">Continue with Google</a>\n'
-    '          <a class="btn btn-outline-secondary w-100 mb-2" href="/source/oauth/login/github/">Continue with GitHub</a>\n'
-    '          <a class="btn btn-outline-secondary w-100 mb-2" href="/source/oauth/login/facebook/">Continue with Facebook</a>\n'
-    "        </div>\n"
-    "        <span class=\"mt-3 mb-0 text-muted text-center\">{% trans 'Powered by authentik' %}</span>"
-)
-html = open(SFE_HTML).read()
-assert html.count(HTML_ANCHOR) == 1, f"SFE html anchor not found exactly once in {SFE_HTML}"
-html = html.replace(HTML_ANCHOR, HTML_REPLACEMENT)
-open(SFE_HTML, "w").write(html)
-assert "Continue with Google" in open(SFE_HTML).read()
-
-print("patch-compat-sfe: SFE for old Safari + all iOS<=16.3; social-login links added to SFE")
--- a/stacks/beads-server/main.tf
+++ b/stacks/beads-server/main.tf
@ -601,9 +601,6 @@ resource "kubernetes_config_map" "beadboard_config" {
 # Pulls the claude-agent-service bearer token from Vault so BeadBoard can
 # dispatch agent jobs via the in-cluster HTTP API.
 resource "kubernetes_manifest" "beadboard_agent_service_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/broker-sync/main.tf
+++ b/stacks/broker-sync/main.tf
@ -28,9 +28,6 @@ resource "kubernetes_namespace" "broker_sync" {
 #   trading212_api_keys — JSON array of {account_id, account_type, api_key, name, currency}
 #   imap_host, imap_user, imap_password, imap_directory — for InvestEngine + Schwab email ingest
 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/calico/main.tf
+++ b/stacks/calico/main.tf
@ -22,7 +22,7 @@ resource "kubernetes_namespace" "calico_system" {
    name = "calico-system"
    labels = {
      name = "calico-system"
-      # calico-system namespace is managed by tigera-operator — auto-update is
+# calico-system namespace is managed by tigera-operator — auto-update is
      # incompatible (operator reverts DaemonSet image from its Installation CR).
      # "keel.sh/enrolled" = "true"
    }
@ -212,229 +212,3 @@ resource "kubectl_manifest" "whisker" {
    spec       = { notifications = "Disabled" }
  })
 }
-
-# ---------------------------------------------------------------------------
-# Gated public ingress for the Whisker UI (infra #57 / ADR-0014).
-#
-# whisker.viktorbarzin.me -> whisker:8081, Authentik-gated (auth="required":
-# Whisker ships NO own login — it's an admin observability UI, so Authentik
-# forward-auth is the only gate between strangers and the flow view). The
-# operator replicated `tls-secret` into calico-system already.
-#
-# TWO coupled pieces are required because the operator's own `whisker`
-# NetworkPolicy (owned by the Whisker CR above) sets policyTypes:[Ingress]
-# with NO ingress rules => default-deny on ingress to the whisker pod. The
-# additive NP below ORs in a Traefik allow (k8s NetworkPolicies are additive
-# across policies selecting the same pod), so we never edit the operator NP.
-module "ingress_whisker" {
-  source          = "../../modules/kubernetes/ingress_factory"
-  dns_type        = "proxied"
-  namespace       = "calico-system"
-  name            = "whisker"
-  service_name    = "whisker"
-  port            = 8081
-  auth            = "required"
-  tls_secret_name = "tls-secret"
-  extra_annotations = {
-    "gethomepage.dev/enabled"     = "true"
-    "gethomepage.dev/name"        = "Whisker"
-    "gethomepage.dev/description" = "Calico flow observability (who-talks-to-whom)"
-    "gethomepage.dev/icon"        = "calico.png"
-    "gethomepage.dev/group"       = "Infrastructure"
-  }
-}
-
-# Additive NetworkPolicy: permit Traefik -> whisker:8081. ORs with the
-# operator's default-deny `whisker` NP (selecting the same pod) so Traefik
-# can reach the UI without touching the operator-owned policy.
-resource "kubernetes_network_policy_v1" "whisker_allow_traefik" {
-  metadata {
-    name      = "whisker-allow-traefik"
-    namespace = "calico-system"
-  }
-  spec {
-    pod_selector {
-      match_labels = {
-        "app.kubernetes.io/name" = "whisker"
-      }
-    }
-    policy_types = ["Ingress"]
-    ingress {
-      from {
-        namespace_selector {
-          match_labels = {
-            "kubernetes.io/metadata.name" = "traefik"
-          }
-        }
-      }
-      ports {
-        port     = "8081"
-        protocol = "TCP"
-      }
-    }
-  }
-}
-
-# Additive egress NetworkPolicy: permit whisker -> the kube-dns ClusterIP for DNS.
-#
-# ROOT CAUSE of the 2026-06-28 "Whisker UI empty" incident: the operator's own
-# `whisker` NetworkPolicy is policyTypes:[Ingress,Egress] and its egress allows
-# DNS only to the kube-dns *pods* (podSelector k8s-app=kube-dns). But
-# whisker-backend resolves `goldmane...svc` via the kube-dns *ClusterIP*
-# (10.96.0.10), and Calico drops UDP DNS to a ClusterIP under a podSelector-only
-# egress rule (verified: from whisker's netns, ClusterIP DNS = 100% timeout
-# while direct kube-dns pod-IP DNS = OK; a pod with no egress policy resolves
-# fine). whisker-backend resolves once in the brief startup window before the
-# policy programs, establishes its long-lived gRPC stream, and only re-resolves
-# when that stream breaks — at which point the blocked ClusterIP DNS wedges its
-# Go resolver and the UI goes empty (the durable aggregator, in its own
-# unrestricted namespace, is unaffected). k8s egress policies are additive, so
-# this ORs in an allow for the ClusterIP; the operator NP is left untouched.
-# (Empirically: adding this ipBlock rule flips ClusterIP DNS from 100% fail to
-# 100% ok.) See docs/runbooks/goldmane-flow-trail.md.
-resource "kubernetes_network_policy_v1" "whisker_allow_dns_clusterip" {
-  metadata {
-    name      = "whisker-allow-dns-clusterip"
-    namespace = "calico-system"
-  }
-  spec {
-    pod_selector {
-      match_labels = {
-        "app.kubernetes.io/name" = "whisker"
-      }
-    }
-    policy_types = ["Egress"]
-    egress {
-      # 10.96.0.10 is the kube-dns ClusterIP (cluster invariant — service CIDR
-      # 10.96.0.0/12, DNS always .10; the same IP CoreDNS/Technitium configs pin).
-      to {
-        ip_block {
-          cidr = "10.96.0.10/32"
-        }
-      }
-      ports {
-        port     = "53"
-        protocol = "UDP"
-      }
-      ports {
-        port     = "53"
-        protocol = "TCP"
-      }
-    }
-  }
-}
-
-# ---------------------------------------------------------------------------
-# Whisker self-heal watchdog (ADR-0014; added 2026-06-28 after a live incident).
-#
-# BACKSTOP. The REAL fix is kubernetes_network_policy_v1.whisker_allow_dns_clusterip
-# above (it unblocks the root-cause ClusterIP DNS). This watchdog stays as
-# defense-in-depth: whisker-backend has NO operator liveness probe, so if its
-# long-lived goldmane gRPC stream ever wedges for any OTHER reason (the Go
-# resolver spams `failed to stream flows` / `code = Unavailable` and never
-# reconnects -> empty UI, while the durable aggregator in its own namespace is
-# unaffected), nothing else would restart it. Whisker is operator-managed
-# (Whisker CR) so we can't inject a probe; this is the supported-pattern
-# alternative. With the DNS fix in place it should rarely, if ever, fire.
-#
-# It restarts the pod ONLY when the wedged signature is present AND Goldmane is
-# Ready (so a real Goldmane outage doesn't cause restart-thrash). A fresh pod
-# reconnects cleanly. See docs/runbooks/goldmane-flow-trail.md.
-resource "kubernetes_service_account" "whisker_watchdog" {
-  metadata {
-    name      = "whisker-watchdog"
-    namespace = kubernetes_namespace.calico_system.metadata[0].name
-  }
-}
-
-# Namespaced Role (least privilege — only calico-system): read pod logs to
-# detect the wedge, delete the whisker pod to heal it.
-resource "kubernetes_role" "whisker_watchdog" {
-  metadata {
-    name      = "whisker-watchdog"
-    namespace = kubernetes_namespace.calico_system.metadata[0].name
-  }
-  rule {
-    api_groups = [""]
-    resources  = ["pods"]
-    verbs      = ["get", "list", "delete"]
-  }
-  rule {
-    api_groups = [""]
-    resources  = ["pods/log"]
-    verbs      = ["get"]
-  }
-}
-
-resource "kubernetes_role_binding" "whisker_watchdog" {
-  metadata {
-    name      = "whisker-watchdog"
-    namespace = kubernetes_namespace.calico_system.metadata[0].name
-  }
-  role_ref {
-    api_group = "rbac.authorization.k8s.io"
-    kind      = "Role"
-    name      = kubernetes_role.whisker_watchdog.metadata[0].name
-  }
-  subject {
-    kind      = "ServiceAccount"
-    name      = kubernetes_service_account.whisker_watchdog.metadata[0].name
-    namespace = kubernetes_namespace.calico_system.metadata[0].name
-  }
-}
-
-resource "kubernetes_cron_job_v1" "whisker_watchdog" {
-  metadata {
-    name      = "whisker-watchdog"
-    namespace = kubernetes_namespace.calico_system.metadata[0].name
-  }
-  spec {
-    schedule                      = "*/10 * * * *"
-    successful_jobs_history_limit = 1
-    failed_jobs_history_limit     = 1
-    concurrency_policy            = "Forbid"
-    job_template {
-      metadata {
-        name = "whisker-watchdog"
-      }
-      spec {
-        template {
-          metadata {
-            name = "whisker-watchdog"
-          }
-          spec {
-            service_account_name = kubernetes_service_account.whisker_watchdog.metadata[0].name
-            container {
-              name  = "watchdog"
-              image = "bitnami/kubectl:latest"
-              command = ["/bin/sh", "-c", <<-EOT
-                set -eu
-                NS=calico-system
-                # Don't thrash if Goldmane itself is down — that's not a whisker bug.
-                if ! kubectl -n "$NS" get pod -l k8s-app=goldmane \
-                     -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}' 2>/dev/null | grep -q True; then
-                  echo "goldmane not Ready — skipping (not a whisker problem)"; exit 0
-                fi
-                ERRS=$(kubectl -n "$NS" logs -l k8s-app=whisker -c whisker-backend --since=11m --tail=500 2>/dev/null \
-                  | grep -cE 'failed to stream flows|failed to list filter hints|code = Unavailable|i/o timeout' || true)
-                ERRS=$${ERRS:-0}
-                if [ "$ERRS" -ge 10 ]; then
-                  echo "whisker-backend WEDGED: $ERRS goldmane-connection errors in 11m — restarting whisker pod"
-                  kubectl -n "$NS" delete pod -l k8s-app=whisker --ignore-not-found
-                else
-                  echo "whisker-backend healthy: $ERRS goldmane-connection errors in 11m"
-                fi
-              EOT
-              ]
-            }
-            restart_policy = "Never"
-          }
-        }
-      }
-    }
-  }
-  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
-  }
-}
--- a/stacks/changedetection/main.tf
+++ b/stacks/changedetection/main.tf
@ -19,9 +19,6 @@ resource "kubernetes_namespace" "changedetection" {
 }

 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/chrome-service/files/novnc/entrypoint.sh
+++ b/stacks/chrome-service/files/novnc/entrypoint.sh
@ -19,14 +19,14 @@ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do
  sleep 2
 done

-# Both x11vnc and websockify run as supervised children of this entrypoint (PID
-# 1) so their logs land on container stdout and the `wait -n` at the end can catch
-# either one dying. `-noshm` skips MIT-SHM probes that fail across container
-# boundaries (each container has its own /dev/shm); `-noxdamage` skips XDAMAGE
-# which Xvfb doesn't expose; `-quiet` keeps the polling chatter out of pod logs.
+# websockify runs as PID 1; x11vnc is a child so its logs land on container stdout
+# `-noshm` skips MIT-SHM probes that fail across container boundaries (each
+# container has its own /dev/shm); `-noxdamage` skips XDAMAGE which Xvfb
+# doesn't expose; `-quiet` keeps the polling chatter out of pod logs.
 echo "starting x11vnc -> :5900"
 x11vnc -display localhost:99 -nopw -listen 0.0.0.0 -rfbport 5900 \
       -forever -shared -noshm -noxdamage -quiet 2>&1 &
+X11VNC_PID=$!

 for i in 1 2 3 4 5 6 7 8 9 10; do
  if echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then
@ -43,18 +43,4 @@ if ! echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then
 fi

 echo "starting websockify -> :6080"
-# Run websockify in the background (it was `exec`ed before) so BOTH it and x11vnc
-# are supervised. x11vnc attaches to the chrome-service container's Xvfb over
-# localhost:6099 (shared pod network); when that container restarts, x11vnc loses
-# its X connection and exits. Previously websockify was PID 1 and x11vnc was an
-# unsupervised child, so a dead x11vnc was never relaunched: :5900 stayed dead and
-# the noVNC view went black until a manual pod restart. Now if EITHER process
-# exits, `wait -n` returns and we exit non-zero so the kubelet restarts this
-# container, which re-waits for Xvfb and relaunches x11vnc — the bridge self-heals
-# across browser-container restarts. (Same supervision pattern as the
-# android-emulator stack's entrypoint.)
-websockify --web=/usr/share/novnc 6080 localhost:5900 &
-
-wait -n || true
-echo "novnc: a supervised process (x11vnc or websockify) exited; exiting so the kubelet restarts this container." >&2
-exit 1
+exec websockify --web=/usr/share/novnc 6080 localhost:5900
--- a/stacks/chrome-service/main.tf
+++ b/stacks/chrome-service/main.tf
@ -41,9 +41,6 @@ resource "kubernetes_namespace" "chrome_service" {
 # --- Secrets (single-key extract: api_bearer_token) ---

 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -333,23 +330,15 @@ resource "kubernetes_deployment" "chrome_service" {
        container {
          name = "novnc"
          # Phase 3 cutover 2026-05-07 — Forgejo registry consolidation.
-          # SHA-pinned (not :latest): Keel is OFF for this deployment
-          # (keel.sh/policy=never, below) and :latest/IfNotPresent won't re-pull a
-          # rebuilt image, so a new noVNC entrypoint only deploys when this digest
-          # is bumped here. Bump after build-chrome-service-novnc.yml pushes a new
-          # SHA tag — then WAIT for that apply pipeline to finish before pushing
-          # anything else: Woodpecker cancel-previous SIGKILLs an in-flight apply
-          # mid-run (memory id=1957), which is exactly how the 2026-06-27 apply got
-          # killed. 2026-06-27: bumped to land the x11vnc-supervision self-heal fix
-          # (noVNC went black after a browser-container restart; see
-          # docs/architecture/chrome-service.md "x11vnc supervision").
-          image             = "ghcr.io/viktorbarzin/chrome-service-novnc:19d0f0933a8ec75be6cfa077db88e0f8c3760f40"
+          image             = "ghcr.io/viktorbarzin/chrome-service-novnc:latest"
          image_pull_policy = "IfNotPresent"
          # Cap RLIMIT_NOFILE before the entrypoint runs. Containerd grants pods
          # nofile=2^31; x11vnc sweeps the whole fd table on each client connect,
          # so every VNC connection hangs on "Connecting" until it times out
-          # (fd-sweep bug, same as android-emulator). entrypoint.sh also sets this;
-          # the wrapper keeps the cap deterministic even off a cached image.
+          # (fd-sweep bug, same as android-emulator). entrypoint.sh now also sets
+          # this, but the image is :latest/IfNotPresent so a rebuilt entrypoint
+          # isn't guaranteed to be pulled — this wrapper applies the cap
+          # deterministically on every rollout off the cached image.
          command = ["bash", "-c", "ulimit -n 65536; exec /entrypoint.sh"]
          port {
            name           = "http"
@ -359,13 +348,9 @@ resource "kubernetes_deployment" "chrome_service" {
          # x11vnc connects to the chrome-service container's Xvfb over
          # localhost TCP (shared pod network). Same uid 1000 as chrome
          # container so we can read MIT-MAGIC-COOKIE if Xvfb adds one.
-          # 256Mi (was 96Mi): the 96Mi cap OOMKilled (exit 137) the sidecar under
-          # ACTIVE VNC use — x11vnc + websockify framebuffer/encode buffers spike
-          # well past idle (~37Mi) when a client streams the 1280x720 screen, so the
-          # noVNC view froze/hung on connect. Bumped 2026-06-28.
          resources {
-            requests = { cpu = "10m", memory = "64Mi" }
-            limits   = { memory = "256Mi" }
+            requests = { cpu = "10m", memory = "32Mi" }
+            limits   = { memory = "96Mi" }
          }
        }

--- a/stacks/chrome-service/rbac.tf
+++ b/stacks/chrome-service/rbac.tf
@ -1,95 +0,0 @@
-# emo's hands-off "homelab browser" credential + chrome-service port-forward RBAC.
-#
-# Access decision (2026-06-28, Viktor's explicit call): emo SHARES Viktor's single
-# chrome-service browser rather than getting an isolated instance. The noVNC half of
-# that grant is the Authentik allowlist in
-# stacks/authentik/admin-services-restriction.tf (CHROME_ALLOWED); THIS file is the
-# CLI half — it lets emo's `homelab browser` reach the headed Chrome over CDP.
-#
-# `homelab browser` shells out to `kubectl port-forward -n chrome-service svc/chrome-service`
-# (cli/browser.go). emo's normal kubeconfig is interactive-OIDC-only (kubelogin) and
-# can't authenticate a headless agent session, and his power-user tier has no
-# pods/portforward. So we mint a dedicated ServiceAccount with a long-lived token
-# (the dashboard-sa.tf pattern) that the devvm provisioner installs as emo's DEFAULT
-# kubeconfig context (scripts/t3-provision-users.sh install_browser_kubeconfig); his
-# personal OIDC login stays available as the `oidc@homelab` named context.
-#
-# TRADE-OFF (accepted): CDP access == full control of the shared browser, including
-# the persistent profile (browser.contexts[0]) where Viktor's warmed logins live.
-# CDP has no per-context auth, so this SA can reach Viktor's sessions. That is inherent
-# to sharing one browser (the isolated per-user instance was declined).
-# See docs/architecture/chrome-service.md "Multi-user access".
-
-resource "kubernetes_service_account" "emo_browser" {
-  metadata {
-    name      = "emo-browser"
-    namespace = kubernetes_namespace.chrome_service.metadata[0].name
-  }
-}
-
-# Long-lived (non-expiring) token for the SA — the devvm provisioner reads this and
-# writes it into emo's kubeconfig. Same pattern as stacks/rbac/.../dashboard-sa.tf.
-resource "kubernetes_secret" "emo_browser_token" {
-  metadata {
-    name      = "emo-browser-token"
-    namespace = kubernetes_namespace.chrome_service.metadata[0].name
-    annotations = {
-      "kubernetes.io/service-account.name" = kubernetes_service_account.emo_browser.metadata[0].name
-    }
-  }
-  type                           = "kubernetes.io/service-account-token"
-  wait_for_service_account_token = true
-}
-
-# The ONLY verb emo's SA lacks for `kubectl port-forward svc/chrome-service`: the
-# port-forward subresource. (get/list of pods + services + endpoints comes from the
-# cluster-read binding below.) Namespace-scoped to chrome-service.
-resource "kubernetes_role" "browser_portforward" {
-  metadata {
-    name      = "chrome-service-portforward"
-    namespace = kubernetes_namespace.chrome_service.metadata[0].name
-  }
-  rule {
-    api_groups = [""]
-    resources  = ["pods/portforward"]
-    verbs      = ["create"]
-  }
-}
-
-resource "kubernetes_role_binding" "emo_browser_portforward" {
-  metadata {
-    name      = "emo-browser-portforward"
-    namespace = kubernetes_namespace.chrome_service.metadata[0].name
-  }
-  role_ref {
-    api_group = "rbac.authorization.k8s.io"
-    kind      = "Role"
-    name      = kubernetes_role.browser_portforward.metadata[0].name
-  }
-  subject {
-    kind      = "ServiceAccount"
-    name      = kubernetes_service_account.emo_browser.metadata[0].name
-    namespace = kubernetes_namespace.chrome_service.metadata[0].name
-  }
-}
-
-# Cluster-wide read-only (NO secrets), mirroring emo's power-user OIDC access, bound
-# to the SA. Needed because the SA becomes emo's DEFAULT kubectl context, so without
-# this his everyday `kubectl get ...` would regress — AND port-forward itself needs
-# get/list on services + pods + endpoints (all covered by oidc-power-user-readonly).
-# That ClusterRole is defined in stacks/rbac (modules/rbac/main.tf); referenced by name.
-resource "kubernetes_cluster_role_binding" "emo_browser_readonly" {
-  metadata {
-    name = "emo-browser-readonly"
-  }
-  role_ref {
-    api_group = "rbac.authorization.k8s.io"
-    kind      = "ClusterRole"
-    name      = "oidc-power-user-readonly"
-  }
-  subject {
-    kind      = "ServiceAccount"
-    name      = kubernetes_service_account.emo_browser.metadata[0].name
-    namespace = kubernetes_namespace.chrome_service.metadata[0].name
-  }
-}
--- a/stacks/ci-pipeline-health/main.tf
+++ b/stacks/ci-pipeline-health/main.tf
@ -49,9 +49,6 @@ resource "kubernetes_namespace" "ci_pipeline_health" {
 # billing on PRIVATE mirrors, which a future scoped read:packages rotation of
 # the alias could not do. Blast radius = this single-CronJob namespace.
 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/claude-agent-service/main.tf
+++ b/stacks/claude-agent-service/main.tf
@ -38,9 +38,6 @@ resource "kubernetes_namespace" "claude_agent" {
 # --- Secrets ---

 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/claude-breakglass/main.tf
+++ b/stacks/claude-breakglass/main.tf
@ -57,9 +57,6 @@ resource "kubernetes_service_account" "breakglass" {
 # DENIED this path (see stacks/vault/main.tf) so the shared, prompt-injectable
 # pod can never read it.
 resource "kubernetes_manifest" "external_secret_ssh" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -85,9 +82,6 @@ resource "kubernetes_manifest" "external_secret_ssh" {
 # Env secrets: the Anthropic OAuth token (shared with claude-agent-service —
 # same account) and the app bearer token (in-cluster/CLI fallback caller auth).
 resource "kubernetes_manifest" "external_secret_env" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/claude-memory/main.tf
+++ b/stacks/claude-memory/main.tf
@ -29,9 +29,6 @@ resource "kubernetes_namespace" "claude-memory" {
 }

 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -60,9 +57,6 @@ resource "kubernetes_manifest" "external_secret" {

 # DB credentials from Vault database engine (rotated every 24h)
 resource "kubernetes_manifest" "db_external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/coturn/main.tf
+++ b/stacks/coturn/main.tf
@ -5,9 +5,6 @@ variable "tls_secret_name" {
 variable "public_ip" { type = string }

 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/dawarich/main.tf
+++ b/stacks/dawarich/main.tf
@ -23,9 +23,6 @@ resource "kubernetes_namespace" "dawarich" {
 }

 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/dbaas/modules/dbaas/main.tf
+++ b/stacks/dbaas/modules/dbaas/main.tf
@ -745,10 +745,7 @@ resource "kubernetes_deployment" "phpmyadmin" {
    labels = {
      "app" = "phpmyadmin"
      tier  = var.tier
-      # ADR-0014 service identity: dbaas is a multi-Service namespace, so the
-      # namespace alone can't attribute Goldmane flows. Value = the fronting
-      # Service name (kubernetes_service.phpmyadmin is named "pma").
-      "service-identity" = "pma"
+
    }
    annotations = {
      "reloader.stakater.com/search" = "true"
@ -765,10 +762,6 @@ resource "kubernetes_deployment" "phpmyadmin" {
      metadata {
        labels = {
          "app" = "phpmyadmin"
-          # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
-          # disambiguating identity must live on the pod template (not just
-          # the Deployment metadata above). Not in selector → no replace.
-          "service-identity" = "pma"
        }
      }
      spec {
@ -819,19 +812,8 @@ resource "kubernetes_deployment" "phpmyadmin" {
    }
  }
  lifecycle {
-    ignore_changes = [
-      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-      # This Deployment is Keel-enrolled (keel.sh/policy=patch). Ignore the
-      # attributes Keel/Kyverno mutate at runtime so `terragrunt apply` (incl.
-      # the daily drift plan) doesn't fight them or revert the live image —
-      # canonical KEEL/KYVERNO lifecycle guard, matches linkwarden/chrome-service.
-      metadata[0].annotations["keel.sh/policy"],
-      metadata[0].annotations["keel.sh/trigger"],
-      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
-      metadata[0].annotations["keel.sh/match-tag"],
-      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
-      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
-    ]
+    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
+    ignore_changes = [spec[0].template[0].spec[0].dns_config]
  }
 }

@ -1517,10 +1499,6 @@ resource "kubernetes_deployment" "pgadmin" {
    }
    labels = {
      tier = var.tier
-      # ADR-0014 service identity: dbaas is a multi-Service namespace, so the
-      # namespace alone can't attribute Goldmane flows. Value = the fronting
-      # Service name (kubernetes_service.pgadmin is named "pgadmin").
-      "service-identity" = "pgadmin"
    }
  }
  spec {
@ -1536,10 +1514,6 @@ resource "kubernetes_deployment" "pgadmin" {
      metadata {
        labels = {
          app = "pgadmin"
-          # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
-          # disambiguating identity must live on the pod template (not just
-          # the Deployment metadata above). Not in selector → no replace.
-          "service-identity" = "pgadmin"
        }
      }
      spec {
@ -1594,20 +1568,8 @@ resource "kubernetes_deployment" "pgadmin" {
    }
  }
  lifecycle {
-    ignore_changes = [
-      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-      # This Deployment is Keel-enrolled (keel.sh/policy=patch) and Keel has
-      # bumped the live image (dpage/pgadmin4:9.16). Ignore the Keel/Kyverno
-      # runtime-mutated attributes so `terragrunt apply` (incl. the daily drift
-      # plan) doesn't revert the image to bare `dpage/pgadmin4` or strip Keel's
-      # annotations — canonical guard, matches linkwarden/chrome-service.
-      metadata[0].annotations["keel.sh/policy"],
-      metadata[0].annotations["keel.sh/trigger"],
-      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
-      metadata[0].annotations["keel.sh/match-tag"],
-      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
-      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
-    ]
+    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
+    ignore_changes = [spec[0].template[0].spec[0].dns_config]
  }
 }
 resource "kubernetes_service" "pgadmin" {
--- a/stacks/diun/main.tf
+++ b/stacks/diun/main.tf
@ -20,9 +20,6 @@ resource "kubernetes_namespace" "diun" {
 }

 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/ebooks/main.tf
+++ b/stacks/ebooks/main.tf
@ -20,9 +20,6 @@ resource "kubernetes_namespace" "ebooks" {

 # ExternalSecrets for all three sources
 resource "kubernetes_manifest" "calibre_external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -50,9 +47,6 @@ resource "kubernetes_manifest" "calibre_external_secret" {
 }

 resource "kubernetes_manifest" "audiobookshelf_external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -80,9 +74,6 @@ resource "kubernetes_manifest" "audiobookshelf_external_secret" {
 }

 resource "kubernetes_manifest" "servarr_external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/f1-stream/main.tf
+++ b/stacks/f1-stream/main.tf
@ -33,9 +33,6 @@ resource "kubernetes_namespace" "f1-stream" {
 }

 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -65,9 +62,6 @@ resource "kubernetes_manifest" "external_secret" {
 # Pull the chrome-service bearer token into this namespace as a separate
 # Secret so the verifier can reach the in-cluster Playwright pool.
 resource "kubernetes_manifest" "chrome_service_client_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/fire-planner/main.tf
+++ b/stacks/fire-planner/main.tf
@ -53,9 +53,6 @@ resource "kubernetes_namespace" "fire_planner" {
 # Seed before applying:
 #   secret/fire-planner -> property `recompute_bearer_token`
 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -118,9 +115,6 @@ resource "kubernetes_manifest" "external_secret" {
 # Template builds the asyncpg DSN consumed by the FastAPI app + CronJob
 # as DB_CONNECTION_STRING.
 resource "kubernetes_manifest" "db_external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -165,9 +159,6 @@ resource "kubernetes_manifest" "db_external_secret" {
 # pg-sync sidecar populates `daily_account_valuation` etc. hourly; the
 # fire-planner ingest reads those tables via this role.
 resource "kubernetes_manifest" "wealthfolio_sync_db_external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -459,90 +450,6 @@ resource "kubernetes_cron_job_v1" "fire_planner_recompute" {
  ]
 }

-# Monthly FIRE-countdown target solve on the 2nd at 10:00 UTC (an hour after
-# recompute-all, so account_snapshot is fresh). Binary-searches each Case's FIRE
-# number per country at the 99% Guyton-Klinger bar and upserts fire_target, which
-# the wealth Grafana dashboard's "FIRE Countdown" section reads.
-resource "kubernetes_cron_job_v1" "fire_planner_fire_targets" {
-  metadata {
-    name      = "fire-planner-fire-targets"
-    namespace = kubernetes_namespace.fire_planner.metadata[0].name
-  }
-  spec {
-    schedule                      = "0 10 2 * *"
-    concurrency_policy            = "Forbid"
-    successful_jobs_history_limit = 3
-    failed_jobs_history_limit     = 5
-    starting_deadline_seconds     = 600
-
-    job_template {
-      metadata {
-        labels = local.labels
-      }
-      spec {
-        backoff_limit              = 1
-        ttl_seconds_after_finished = 86400
-        # The full country sweep is CPU-bound (binary search × ~22 cities ×
-        # 3 cases). Give it room rather than letting it run forever.
-        active_deadline_seconds = 3600
-        template {
-          metadata {
-            labels = local.labels
-          }
-          spec {
-            restart_policy = "OnFailure"
-            image_pull_secrets {
-              name = "registry-credentials"
-            }
-            image_pull_secrets {
-              name = "ghcr-credentials"
-            }
-            container {
-              name  = "fire-targets"
-              image = local.image
-              # --horizon 72: Viktor retires ~age 28 and plans to live to 100, so
-              # the portfolio must last 72 years (was the 60y default ≈ to age 88).
-              command = ["python", "-m", "fire_planner", "recompute-fire-targets",
-              "--countries", "all", "--horizon", "72"]
-
-              env_from {
-                secret_ref {
-                  name = "fire-planner-secrets"
-                }
-              }
-              env_from {
-                secret_ref {
-                  name = "fire-planner-db-creds"
-                }
-              }
-
-              resources {
-                requests = {
-                  cpu    = "500m"
-                  memory = "1Gi"
-                }
-                limits = {
-                  memory = "2Gi"
-                }
-              }
-            }
-          }
-        }
-      }
-    }
-  }
-
-  lifecycle {
-    # KYVERNO_LIFECYCLE_V1
-    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
-  }
-
-  depends_on = [
-    kubernetes_manifest.external_secret,
-    kubernetes_manifest.db_external_secret,
-  ]
-}
-
 # Weekly refresh of the COL cache: walks col_snapshot for rows
 # expiring within 7 days, re-scrapes Numbeo + Expatistan, upserts. With
 # the user-chosen 1-year TTL, a healthy cache has 0 stale rows on most
@ -662,53 +569,16 @@ module "ingress_api" {
  auth = "none"
 }

-# ExternalSecret in the monitoring namespace mirroring the rotating
-# fire_planner DB password. Grafana mounts this via envFromSecrets in
-# monitoring/grafana_chart_values.yaml; the datasource ConfigMap below
-# references it as $__env{FIRE_PLANNER_PG_PASSWORD}. Reloader restarts
-# Grafana whenever ESO updates this secret (on the 7d static-role
-# rotation), so the provisioned datasource never goes stale — replaces
-# the old plan-time `data.kubernetes_secret` bake that broke weekly.
-# Mirrors the wealth-pg / payslips-pg pattern.
-resource "kubernetes_manifest" "grafana_fire_planner_pg_creds" {
-  field_manager {
-    force_conflicts = true
-  }
-  manifest = {
-    apiVersion = "external-secrets.io/v1"
-    kind       = "ExternalSecret"
-    metadata = {
-      name      = "grafana-fire-planner-pg-creds"
-      namespace = "monitoring"
-    }
-    spec = {
-      refreshInterval = "15m"
-      secretStoreRef = {
-        name = "vault-database"
-        kind = "ClusterSecretStore"
-      }
-      target = {
-        name = "grafana-fire-planner-pg-creds"
-        template = {
-          metadata = {
-            annotations = {
-              "reloader.stakater.com/match" = "true"
-            }
-          }
-          data = {
-            FIRE_PLANNER_PG_PASSWORD = "{{ .password }}"
-          }
-        }
-      }
-      data = [{
-        secretKey = "password"
-        remoteRef = {
-          key      = "static-creds/pg-fire-planner"
-          property = "password"
-        }
-      }]
-    }
+# Plan-time read of the ESO-created K8s Secret for Grafana datasource
+# password. First-apply gotcha: must
+# `terragrunt apply -target=kubernetes_manifest.db_external_secret` so
+# the Secret exists before this data source plans.
+data "kubernetes_secret" "fire_planner_db_creds" {
+  metadata {
+    name      = "fire-planner-db-creds"
+    namespace = kubernetes_namespace.fire_planner.metadata[0].name
  }
+  depends_on = [kubernetes_manifest.db_external_secret]
 }

 # Grafana datasource for fire_planner PostgreSQL DB.
@ -745,15 +615,12 @@ resource "kubernetes_config_map" "grafana_fire_planner_datasource" {
          timescaledb     = false
        }
        secureJsonData = {
-          # Live env from grafana-fire-planner-pg-creds (above), injected into
-          # Grafana via envFromSecrets; reloader refreshes it on rotation.
-          password = "$__env{FIRE_PLANNER_PG_PASSWORD}"
+          password = data.kubernetes_secret.fire_planner_db_creds.data["DB_PASSWORD"]
        }
        editable = true
      }]
    })
  }
-  depends_on = [kubernetes_manifest.grafana_fire_planner_pg_creds]
 }

 # CI retrigger 2026-05-16T13:42:57+00:00 — bulk enrollment apply (pipeline #689 killed)
@ -794,9 +661,6 @@ variable "run_examples_bulk_ingest" {

 # Reddit OAuth creds pulled from Vault secret/viktor.
 resource "kubernetes_manifest" "external_secret_examples_reddit" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -837,9 +701,6 @@ resource "kubernetes_manifest" "external_secret_examples_reddit" {
 # claude-agent-service bearer pulled separately so its rotation cadence
 # is decoupled from the Reddit creds.
 resource "kubernetes_manifest" "external_secret_examples_claude" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/forgejo/email-secret.tf
+++ b/stacks/forgejo/email-secret.tf
@ -6,9 +6,6 @@
 # (stacks/authentik/email-secret.tf) — one credential, one rotation point. The
 # reloader annotation rolls the Forgejo pod if the password is ever rotated.
 resource "kubernetes_manifest" "forgejo_email_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/freedify/main.tf
+++ b/stacks/freedify/main.tf
@ -3,9 +3,6 @@ variable "tls_secret_name" {
  sensitive = true
 }
 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/freshrss/main.tf
+++ b/stacks/freshrss/main.tf
@ -18,9 +18,6 @@ resource "kubernetes_namespace" "immich" {
 }

 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/goldmane-edge-aggregator/main.tf
+++ b/stacks/goldmane-edge-aggregator/main.tf
@ -57,19 +57,16 @@ resource "kubernetes_namespace" "goldmane_edge_aggregator" {
 # -----------------------------------------------------------------------------
 # The aggregator dials goldmane:7443 over mutual TLS. We mint a client cert
 # signed by the Tigera CA (the same CA that issues Goldmane's serving cert), so
-# Goldmane requires mutual TLS on :7443 and verifies the client cert chains to
-# the Tigera CA — it does NOT authorize by client identity, so ANY Tigera-CA-
-# signed cert is accepted. Rather than copy the Tigera CA PRIVATE KEY into TF
-# state to mint our own (a needless CA-key exposure; the hashicorp/tls provider
-# is also incompatible with this repo's global generate-providers/lockfile
-# pattern), we REUSE the operator-minted, Tigera-CA-signed client cert
-# `whisker-backend-key-pair` (calico-system). We never touch the CA key.
-# Trade-off: if the operator rotates that cert, re-apply to re-sync (hardening
-# follow-up: mint an own-identity cert in-namespace if Whisker is ever removed).
-data "kubernetes_secret" "whisker_backend" {
+# Goldmane trusts the client and the client trusts Goldmane's server cert via
+# the published CA bundle.
+#
+# The Tigera CA private key lives in the `tigera-ca-private` Secret in
+# tigera-operator (Opaque; verified keys: tls.crt + tls.key). The stack's apply
+# identity needs RBAC get on that secret — see the Role/RoleBinding below.
+data "kubernetes_secret" "tigera_ca" {
  metadata {
-    name      = "whisker-backend-key-pair"
-    namespace = "calico-system"
+    name      = "tigera-ca-private"
+    namespace = "tigera-operator"
  }
 }

@ -96,11 +93,46 @@ resource "kubernetes_config_map" "tigera_ca_bundle" {
  data = data.kubernetes_config_map.tigera_ca_bundle.data
 }

-# Client cert + key for mTLS to goldmane:7443, mounted at TLS_CERT_PATH /
-# TLS_KEY_PATH defaults (/etc/goldmane-client-tls/tls.crt and .../tls.key).
-# Sourced verbatim from the operator's whisker-backend client key-pair (read
-# above) — already Tigera-CA-signed, which is all Goldmane verifies. No CA key
-# is touched and no cross-namespace CA RBAC is needed.
+# Client private key.
+resource "tls_private_key" "goldmane_client" {
+  algorithm = "RSA"
+  rsa_bits  = 2048
+}
+
+# CSR for the client cert. CN identifies the client; the service-DNS SAN mirrors
+# how Felix/whisker-backend present a client identity to Goldmane.
+resource "tls_cert_request" "goldmane_client" {
+  private_key_pem = tls_private_key.goldmane_client.private_key_pem
+  subject {
+    common_name  = "goldmane-edge-aggregator"
+    organization = "goldmane-edge-aggregator"
+  }
+  dns_names = [
+    "goldmane-edge-aggregator",
+    "goldmane-edge-aggregator.goldmane-edge-aggregator.svc.cluster.local",
+  ]
+}
+
+# Sign the CSR with the Tigera CA. 10-year validity (87600h): re-apply rotates
+# it well before expiry; a long horizon avoids surprise mTLS outages from an
+# unattended stack. The Tigera CA itself outlives this (operator-managed).
+resource "tls_locally_signed_cert" "goldmane_client" {
+  cert_request_pem   = tls_cert_request.goldmane_client.cert_request_pem
+  ca_private_key_pem = data.kubernetes_secret.tigera_ca.data["tls.key"]
+  ca_cert_pem        = data.kubernetes_secret.tigera_ca.data["tls.crt"]
+
+  validity_period_hours = 87600 # 10y
+  early_renewal_hours   = 720   # re-sign on apply when <30d remain
+
+  allowed_uses = [
+    "client_auth",
+    "digital_signature",
+    "key_encipherment",
+  ]
+}
+
+# The minted client cert + key, mounted at TLS_CERT_PATH / TLS_KEY_PATH defaults
+# (/etc/goldmane-client-tls/tls.crt and .../tls.key).
 resource "kubernetes_secret" "goldmane_client_tls" {
  metadata {
    name      = "goldmane-client-tls"
@ -108,8 +140,47 @@ resource "kubernetes_secret" "goldmane_client_tls" {
  }
  type = "Opaque"
  data = {
-    "tls.crt" = data.kubernetes_secret.whisker_backend.data["tls.crt"]
-    "tls.key" = data.kubernetes_secret.whisker_backend.data["tls.key"]
+    "tls.crt" = tls_locally_signed_cert.goldmane_client.cert_pem
+    "tls.key" = tls_private_key.goldmane_client.private_key_pem
+  }
+}
+
+# Narrow RBAC so this stack's apply identity (and ESO/Reloader are unaffected)
+# can `get` the Tigera CA private key in tigera-operator. The data source above
+# reads it at apply time; this Role/RoleBinding documents + grants that access
+# rather than relying on cluster-admin. The subject is the same SA the other
+# Tier-1 stacks apply as (claude-agent/terraform-state for headless, the human
+# OIDC identity interactively) — both are cluster-admin today, so this is
+# belt-and-braces / least-privilege intent for when apply identities tighten.
+resource "kubernetes_role" "read_tigera_ca" {
+  metadata {
+    name      = "goldmane-edge-aggregator-read-tigera-ca"
+    namespace = "tigera-operator"
+  }
+  rule {
+    api_groups     = [""]
+    resources      = ["secrets"]
+    resource_names = ["tigera-ca-private"]
+    verbs          = ["get"]
+  }
+}
+
+resource "kubernetes_role_binding" "read_tigera_ca" {
+  metadata {
+    name      = "goldmane-edge-aggregator-read-tigera-ca"
+    namespace = "tigera-operator"
+  }
+  role_ref {
+    api_group = "rbac.authorization.k8s.io"
+    kind      = "Role"
+    name      = kubernetes_role.read_tigera_ca.metadata[0].name
+  }
+  # The headless apply identity (claude-agent-service runs Tier-1 applies as the
+  # `terraform-state` Vault K8s role in the claude-agent namespace).
+  subject {
+    kind      = "ServiceAccount"
+    name      = "default"
+    namespace = "claude-agent"
  }
 }

@ -156,11 +227,6 @@ resource "kubernetes_job" "db_init" {
  timeouts {
    create = "2m"
  }
-  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno injects dns_config (ndots=2); ignore it so
-    # this idempotent Job isn't replaced (Jobs are immutable) on every apply.
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
-  }
 }

 # ExternalSecret projecting the Vault-rotated (7-day) credential into a K8s
@ -168,9 +234,6 @@ resource "kubernetes_job" "db_init" {
 # place in the CNPG connection allowlist are added in stacks/vault/main.tf
 # (see this stack's terragrunt.hcl note). remoteRef key: static-creds/pg-goldmane-edges.
 resource "kubernetes_manifest" "db_external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -213,9 +276,6 @@ resource "kubernetes_manifest" "db_external_secret" {
 # into this namespace as SLACK_WEBHOOK_URL via an ExternalSecret (no new
 # webhook). The digest CronJob defaults to #security.
 resource "kubernetes_manifest" "slack_external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -235,7 +295,7 @@ resource "kubernetes_manifest" "slack_external_secret" {
      data = [{
        secretKey = "SLACK_WEBHOOK_URL"
        remoteRef = {
-          key      = "viktor"
+          key      = "monitoring"
          property = "alertmanager_slack_api_url"
        }
      }]
@ -456,12 +516,7 @@ resource "kubernetes_cron_job_v1" "digest" {
              }
              env {
                name  = "SLACK_CHANNEL"
-                # Posts to #alerts. The dedicated #security channel was abandoned
-                # 2026-06-25 — the shared alertmanager_slack_api_url webhook's
-                # Slack app isn't a member of it (channel override 404s), so all
-                # Slack (incl. alertmanager's security-lane alerts) consolidated
-                # to #alerts. See docs/runbooks/goldmane-flow-trail.md.
-                value = "#alerts"
+                value = "#security"
              }

              resources {
--- a/stacks/grampsweb/main.tf
+++ b/stacks/grampsweb/main.tf
@ -5,9 +5,6 @@ variable "tls_secret_name" {
 variable "nfs_server" { type = string }

 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/hackmd/main.tf
+++ b/stacks/hackmd/main.tf
@ -208,9 +208,6 @@ module "ingress" {
 }

 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/health/main.tf
+++ b/stacks/health/main.tf
@ -250,9 +250,6 @@ module "ingress_test" {
 }

 resource "kubernetes_manifest" "external_secret_db" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -287,9 +284,6 @@ resource "kubernetes_manifest" "external_secret_db" {
 }

 resource "kubernetes_manifest" "external_secret_kv" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/hermes-agent/main.tf
+++ b/stacks/hermes-agent/main.tf
@ -37,9 +37,6 @@ module "tls_secret" {
 # --- Secrets (ESO from Vault) ---

 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/immich/frame-emo.tf
+++ b/stacks/immich/frame-emo.tf
@ -1,155 +0,0 @@
-# Immich photo-frame for Emo (emil.barzin@gmail.com) — a second instance cloned
-# from the London frame in frame.tf, scoped to Emo's Immich account + Sofia
-# weather. Served at highlights-immich-emo.viktorbarzin.me and shown on Emo's
-# Portal Mini (Sofia) via the portal-immich-frame app.
-# API key: Vault secret/immich -> frame_api_key_emo (minted on Emo's account).
-
-resource "kubernetes_config_map" "frame_config_emo" {
-  metadata {
-    name      = "config-emo"
-    namespace = "immich"
-
-    labels = {
-      app = "frame-config-emo"
-    }
-    annotations = {
-      "reloader.stakater.com/match" = "true"
-    }
-  }
-
-  data = {
-    "Settings.yml" = <<-EOF
-    General:
-        Layout: single
-        Interval: 45
-        ImageZoom: true
-        ShowAlbumName: false
-        ShowProgressBar: false
-        ClockFormat: "HH:mm"
-        PhotoDateFormat: "dd/MM/yyyy"
-        WeatherApiKey: ${data.vault_kv_secret_v2.secrets.data["frame_weather_api_key"]}
-        UnitSystem: metric
-        WeatherLatLong: "42.6977,23.3219"
-        Language: en
-    Accounts:
-        - ImmichServerUrl: http://immich.viktorbarzin.me
-          ApiKey: ${data.vault_kv_secret_v2.secrets.data["frame_api_key_emo"]}
-          ImagesFromDays: 730
-    EOF
-  }
-}
-
-
-resource "kubernetes_deployment" "immich-frame-emo" {
-  metadata {
-    name      = "immich-frame-emo"
-    namespace = "immich"
-    annotations = {
-      "reloader.stakater.com/search" = "true"
-    }
-    labels = {
-      tier = local.tiers.gpu
-    }
-  }
-
-  spec {
-    replicas = 1
-    selector {
-      match_labels = {
-        app = "immich-frame-emo"
-      }
-    }
-    strategy {
-      type = "RollingUpdate"
-    }
-    template {
-      metadata {
-        labels = {
-          app = "immich-frame-emo"
-        }
-        annotations = {
-          "dependency.kyverno.io/wait-for" = "immich-server.immich:2283"
-        }
-      }
-      spec {
-        container {
-          image = "ghcr.io/immichframe/immichframe:v1.0.32.0"
-          name  = "immich-frame-emo"
-          resources {
-            requests = {
-              cpu    = "10m"
-              memory = "64Mi"
-            }
-            limits = {
-              memory = "128Mi"
-            }
-          }
-          port {
-            container_port = 8080
-            protocol       = "TCP"
-            name           = "http"
-          }
-          volume_mount {
-            name       = "config"
-            mount_path = "/app/Config"
-            read_only  = true
-          }
-        }
-        volume {
-          name = "config"
-          config_map {
-            name = "config-emo"
-          }
-        }
-      }
-    }
-  }
-  lifecycle {
-    ignore_changes = [
-      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
-      metadata[0].annotations["keel.sh/policy"],
-      metadata[0].annotations["keel.sh/trigger"],
-      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
-      metadata[0].annotations["keel.sh/match-tag"],
-      metadata[0].annotations["kubernetes.io/change-cause"],
-      metadata[0].annotations["deployment.kubernetes.io/revision"],
-      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
-      spec[0].template[0].spec[0].container[0].image,                     # KEEL_IGNORE_IMAGE
-    ]
-  }
-}
-
-
-resource "kubernetes_service" "immich-frame-emo" {
-  metadata {
-    name      = "immich-frame-emo"
-    namespace = "immich"
-    labels = {
-      "app" = "immich-frame-emo"
-    }
-  }
-
-  spec {
-    selector = {
-      app = "immich-frame-emo"
-    }
-    port {
-      port        = 80
-      target_port = 8080
-    }
-  }
-}
-
-module "ingress_emo" {
-  source = "../../modules/kubernetes/ingress_factory"
-  # Photo-frame kiosk display on Emo's Portal — headless browser pulling images
-  # via an Immich API key (no user login). Forward-auth would 302 the device to
-  # Authentik with no way to complete login.
-  # auth = "none": photo-frame kiosk; headless browser with API key; no user login.
-  auth            = "none"
-  dns_type        = "proxied"
-  namespace       = "immich"
-  name            = "highlights-immich-emo"
-  tls_secret_name = var.tls_secret_name
-  service_name    = "immich-frame-emo"
-}
--- a/stacks/immich/main.tf
+++ b/stacks/immich/main.tf
@ -162,9 +162,6 @@ resource "kubernetes_resource_quota" "immich" {
 }

 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/insta2spotify/main.tf
+++ b/stacks/insta2spotify/main.tf
@ -20,9 +20,6 @@ resource "kubernetes_namespace" "insta2spotify" {
 }

 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/instagram-poster/modules/instagram-poster/main.tf
+++ b/stacks/instagram-poster/modules/instagram-poster/main.tf
@ -35,14 +35,6 @@ resource "kubernetes_namespace" "instagram_poster" {
 #     - immich_tag_instagram      (optional — auto-resolved if missing)
 #     - immich_tag_posted         (optional — auto-resolved if missing)
 resource "kubernetes_manifest" "external_secret" {
-  # The external-secrets controller takes server-side-apply ownership of
-  # .spec.refreshInterval, so a plain TF apply conflicts. force_conflicts lets
-  # TF win (values match, so it's stable) — same pattern as grafana/woodpecker/
-  # traefik/k8s-version-upgrade. Surfaced 2026-06-24 by the first IG apply since
-  # the ESO v1 migration (the scale-to-0 push).
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -147,11 +139,6 @@ resource "kubernetes_manifest" "external_secret" {
 # ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match`
 # bounces the pod when the password changes.
 resource "kubernetes_manifest" "benchmark_db_external_secret" {
-  # See external_secret above — ESO owns .spec.refreshInterval; force_conflicts
-  # lets the TF apply win instead of erroring on the field-manager conflict.
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -240,11 +227,7 @@ resource "kubernetes_deployment" "instagram_poster" {
  }

  spec {
-    # Scaled to 0 (2026-06-24): Instagram Graph integration is unused and its
-    # ExternalSecret is dead (missing ig_graph_long_lived_token /
-    # ig_business_account_id in Vault secret/instagram-poster). Set back to 1
-    # after minting a Meta long-lived token and populating those keys.
-    replicas = 0
+    replicas = 1
    # RWO PVC — cannot rolling-update.
    strategy {
      type = "Recreate"
--- a/stacks/job-hunter/main.tf
+++ b/stacks/job-hunter/main.tf
@ -41,9 +41,6 @@ resource "kubernetes_namespace" "job_hunter" {
 #     digest_to_address     — where the weekly digest goes
 #     digest_from_address   — From: header for the digest
 resource "kubernetes_manifest" "external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -108,9 +105,6 @@ resource "kubernetes_manifest" "external_secret" {
 # DB credentials from Vault database engine (7-day rotation).
 # Template builds the asyncpg DSN consumed by the FastAPI app as DB_CONNECTION_STRING.
 resource "kubernetes_manifest" "db_external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -331,9 +325,6 @@ resource "kubernetes_service" "job_hunter" {
 # references it as $__env{JOB_HUNTER_PG_PASSWORD}. Reloader restarts
 # Grafana whenever ESO updates this secret (every 7d on rotation).
 resource "kubernetes_manifest" "grafana_job_hunter_db_external_secret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/k8s-dashboard/oauth2_proxy.tf
+++ b/stacks/k8s-dashboard/oauth2_proxy.tf
@ -5,9 +5,6 @@
 # -----------------------------------------------------------------------------

 resource "kubernetes_manifest" "oauth2_proxy_externalsecret" {
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/+page.svelte
+++ b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/+page.svelte
@ -5,11 +5,9 @@
 <main>
 	<h1>Kubernetes Access Portal</h1>

-	<div class="callout info">
-		<strong>Fastest way in:</strong> open the <a href="https://t3.viktorbarzin.me">web terminal</a> or the
-		<a href="https://k8s.viktorbarzin.me">dashboard</a> and sign in — no install, no VPN needed. Prefer your
-		own machine? The <a href="/onboarding#path-laptop">local-setup guide</a> covers VPN + kubectl, and the
-		<a href="/onboarding">Getting Started page</a> compares all three access paths.
+	<div class="callout warning">
+		<strong>VPN Required</strong> — The cluster is on a private network. You need Headscale VPN access before kubectl will work.
+		<a href="/onboarding">See the Getting Started guide</a> for VPN setup instructions.
 	</div>

 	<section>
@ -28,7 +26,6 @@
 			<p><strong>Assigned namespaces:</strong> {data.namespaces.join(', ')}</p>

 			<h3>Quick Commands</h3>
-			<p>Run these as-is in the <a href="https://t3.viktorbarzin.me">web terminal</a> — it's already signed in as you.</p>
 			<pre>
 # Check your pods
 kubectl get pods -n {data.namespaces[0]}
@ -50,23 +47,16 @@ vault write kubernetes/creds/{data.namespaces[0]}-deployer \

 	<section>
 		<h2>Get Started</h2>
-		<h3>No setup — start now</h3>
-		<ol>
-			<li><a href="https://t3.viktorbarzin.me">Open the web terminal</a> — a ready shell with kubectl, Vault and your repos already set up</li>
-			<li><a href="https://k8s.viktorbarzin.me">Open the dashboard</a> — point-and-click view of your workloads</li>
-		</ol>
-		<h3>On your own machine</h3>
 		<ol>
 			{#if data.role === 'namespace-owner'}
-				<li><a href="/onboarding?role=namespace-owner#path-laptop">Follow the namespace-owner setup</a> (VPN, kubectl, Vault, encrypted state)</li>
+				<li><a href="/onboarding?role=namespace-owner">Complete the namespace-owner onboarding guide</a></li>
 			{:else}
-				<li><a href="/onboarding#path-laptop">Follow the local setup</a> (VPN, kubectl, git)</li>
+				<li><a href="/onboarding">Complete the onboarding guide</a> (VPN, kubectl, git)</li>
 			{/if}
 			<li><a href="/setup">Install kubectl and kubelogin</a></li>
 			<li><a href="/download">Download your kubeconfig</a></li>
 			<li>Run <code>kubectl get namespaces</code> to verify access</li>
 		</ol>
-		<p><a href="/onboarding">Compare all three access paths →</a></p>
 	</section>

 	<section>
@ -101,12 +91,12 @@ vault write kubernetes/creds/{data.namespaces[0]}-deployer \
 		border-radius: 6px;
 		margin: 1rem 0;
 	}
-	.callout.info {
-		background: #e8f4fd;
-		border-left: 4px solid #2196f3;
+	.callout.warning {
+		background: #fff3cd;
+		border-left: 4px solid #ffc107;
 	}
 	.callout a {
-		color: #0d47a1;
+		color: #856404;
 		font-weight: 600;
 	}
 </style>
--- a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/onboarding/+page.svelte
+++ b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/onboarding/+page.svelte
@ -5,123 +5,22 @@

 <main class="content">
 	<h1>Getting Started</h1>
-	<p>
-		Welcome! There are three ways to reach the home Kubernetes cluster. Pick the one that fits —
-		the first two need <strong>zero setup</strong> and open right in your browser.
-	</p>
-
-	<section>
-		<h2>Three ways in</h2>
-		<table>
-			<thead><tr><th>Path</th><th>Best for</th><th>Setup</th></tr></thead>
-			<tbody>
-				<tr>
-					<td><a href="#path-terminal"><strong>A — Web terminal</strong></a></td>
-					<td>Just want to start working now</td>
-					<td>None — opens in your browser</td>
-				</tr>
-				<tr>
-					<td><a href="#path-dashboard"><strong>B — Web dashboard</strong></a></td>
-					<td>Click around, watch your app, read logs</td>
-					<td>None — opens in your browser</td>
-				</tr>
-				<tr>
-					<td><a href="#path-laptop"><strong>C — Your own machine</strong></a></td>
-					<td>kubectl / Terraform locally, full control</td>
-					<td>VPN + one-line installer</td>
-				</tr>
-			</tbody>
-		</table>
-		<div class="callout info">
-			<strong>Not sure?</strong> Start with the <a href="#path-terminal">web terminal (Path A)</a>.
-			Everything is already installed and your repos are already cloned — you can run your first
-			<code>kubectl</code> command within a minute, from any device.
-		</div>
-	</section>
-
-	<section id="path-terminal" class="path">
-		<h2>Path A — Web terminal <span class="badge rec">Recommended</span> <span class="badge none">No setup</span></h2>
-		<p>
-			A full terminal that runs in your browser — nothing to install, works from any device
-			(even a tablet). It drops you into your own account on the shared workstation, with every
-			tool already set up.
-		</p>
-		<ol>
-			<li>Open <a href="https://t3.viktorbarzin.me" target="_blank">t3.viktorbarzin.me</a></li>
-			<li>Sign in with your Authentik account (the same SSO login as this portal)</li>
-			<li>You land in a ready-to-use shell. Try it:
-				<pre>kubectl get pods -n YOUR_NAMESPACE</pre>
-			</li>
-		</ol>
-		<div class="callout info">
-			<strong>Already done for you</strong> on the workstation:
-			<ul>
-				<li><code>kubectl</code> + your kubeconfig, scoped to your namespaces (no login dance)</li>
-				<li><code>vault</code>, <code>terragrunt</code>, <code>terraform</code>, <code>sops</code>, <code>kubeseal</code></li>
-				<li>Your repos cloned under <code>~/code</code> — the <code>infra</code> repo plus your own project repos</li>
-				<li>Claude Code, ready to pair with you on changes</li>
-			</ul>
-		</div>
-		<div class="callout warning">
-			<strong>No access yet?</strong> The workstation is provisioned per person. If
-			<code>t3.viktorbarzin.me</code> says you're not authorized, ask Viktor to add you
-			(<a href="mailto:vbarzin@gmail.com">vbarzin@gmail.com</a> or Slack).
-		</div>
-	</section>
-
-	<section id="path-dashboard" class="path">
-		<h2>Path B — Web dashboard <span class="badge none">No setup</span></h2>
-		<p>
-			A point-and-click view of the cluster — browse your pods, read logs, restart a deployment,
-			check events. Nothing to install.
-		</p>
-		<ol>
-			<li>Open <a href="https://k8s.viktorbarzin.me" target="_blank">k8s.viktorbarzin.me</a></li>
-			<li>Sign in with your Authentik account</li>
-			<li>
-				You're dropped straight into the Kubernetes Dashboard, already authenticated as you —
-				<strong>no token to paste</strong>. The portal injects your personal access token for you.
-			</li>
-		</ol>
-		<div class="callout info">
-			Scoped to your namespace(s): you can see and manage your own workloads, but not other
-			tenants'. This path uses a per-user token that does <em>not</em> depend on CLI login, so it
-			keeps working even if <code>kubectl</code> OIDC login is having a bad day — making it the
-			reliable fallback for Path C.
-		</div>
-	</section>
-
-	<section id="path-laptop" class="path c">
-		<h2>Path C — From your own machine</h2>
-		<p>
-			For running <code>kubectl</code>, <code>vault</code> and Terraform locally. This is the most
-			powerful path and the one to use for infrastructure changes — it just needs a bit more setup
-			because the cluster API lives on a private network.
-		</p>
+	<p>Welcome! Follow these steps to get access to the home Kubernetes cluster.</p>

 	<div class="role-tabs">
-			<a href="/onboarding?role=general#path-laptop" class:active={!showNamespaceOwner}>General User</a>
-			<a href="/onboarding?role=namespace-owner#path-laptop" class:active={showNamespaceOwner}>Namespace Owner</a>
+		<a href="/onboarding" class:active={!showNamespaceOwner}>General User</a>
+		<a href="/onboarding?role=namespace-owner" class:active={showNamespaceOwner}>Namespace Owner</a>
 	</div>
-		<p class="prereq">
-			{#if showNamespaceOwner}
-				Namespace owner — you'll also set up Vault and encrypted Terraform state so you can deploy
-				your own app stacks.
-			{:else}
-				General user — VPN, kubectl and git access. (Managing your own app stack? Switch to the
-				<strong>Namespace Owner</strong> tab above.)
-			{/if}
-		</p>

 	<section>
-			<h3>Step 1 — Join the VPN</h3>
-			<p>The cluster API is on a private network (<code>10.0.20.0/24</code>), so you need VPN access first.</p>
+		<h2>Step 0 — Join the VPN</h2>
+		<p>The cluster is on a private network (<code>10.0.20.0/24</code>). You need VPN access first.</p>
 		<ol>
 			<li>Install <a href="https://tailscale.com/download" target="_blank">Tailscale</a> for your OS</li>
 			<li>Run this in your terminal:
 				<pre>tailscale login --login-server https://headscale.viktorbarzin.me</pre>
 			</li>
-				<li>A browser window opens with a registration URL</li>
+			<li>A browser window will open with a registration URL</li>
 			<li>Send that URL to Viktor via email (<a href="mailto:vbarzin@gmail.com">vbarzin@gmail.com</a>) or Slack</li>
 			<li>Wait for approval (usually within a few hours)</li>
 			<li>Once approved, test: <pre>ping 10.0.20.100</pre></li>
@ -129,49 +28,62 @@
 	</section>

 	<section>
-			<h3>Step 2 — Install the tools</h3>
-			<p>Run one of these to install everything automatically (kubectl, kubelogin, vault, terragrunt, terraform, kubeseal) and write your kubeconfig to <code>~/.kube/config-home</code>:</p>
-			<h4>macOS</h4>
-			<p class="prereq">Requires <a href="https://brew.sh" target="_blank">Homebrew</a>. Install it first if you don't have it.</p>
-			<pre>bash &lt;(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)</pre>
-			<h4>Linux</h4>
-			<pre>bash &lt;(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)</pre>
-			<h4>Windows</h4>
-			<p>Use <a href="https://learn.microsoft.com/en-us/windows/wsl/install" target="_blank">WSL2</a> and follow the Linux instructions.</p>
+		<h2>Step 1 — Log in to the portal</h2>
+		<p>Visit <a href="https://k8s-portal.viktorbarzin.me">k8s-portal.viktorbarzin.me</a> and sign in with your Authentik account.</p>
+		<p>If you don't have an account yet, ask Viktor to create one.</p>
 	</section>

 	<section>
-			<h3>Step 3 — Verify access</h3>
-			<p>Run this. The first time, it opens your browser for SSO login:</p>
-			<pre>kubectl get {showNamespaceOwner ? 'pods -n YOUR_NAMESPACE' : 'namespaces'}</pre>
-			<p>You should see your resources (or an empty list if you haven't deployed anything yet).</p>
-			<div class="callout warning">
-				<strong>Browser login loops, or kubectl says "Unauthorized"?</strong> Command-line SSO
-				(OIDC) can occasionally be unavailable. When that happens, use the
-				<a href="#path-dashboard">web dashboard (Path B)</a> or the
-				<a href="#path-terminal">web terminal (Path A)</a> — both authenticate a different way and
-				keep working — and let Viktor know.
-			</div>
-			<p class="prereq">Connection error instead? Make sure the VPN is up: <code>tailscale status</code>.</p>
+		<h2>Step 2 — Set up kubectl</h2>
+		<p>Run one of these commands in your terminal to install everything automatically:</p>
+		<h3>macOS</h3>
+		<p class="prereq">Requires <a href="https://brew.sh" target="_blank">Homebrew</a>. Install it first if you don't have it.</p>
+		<pre>bash &lt;(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)</pre>
+		<h3>Linux</h3>
+		<pre>bash &lt;(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)</pre>
+		<h3>Windows</h3>
+		<p>Use <a href="https://learn.microsoft.com/en-us/windows/wsl/install" target="_blank">WSL2</a> and follow the Linux instructions.</p>
 	</section>

 	{#if showNamespaceOwner}
 		<section>
-				<h3>Step 4 — Log into Vault</h3>
+			<h2>Step 3 — Log into Vault</h2>
 			<p>Vault manages your secrets and issues dynamic Kubernetes credentials.</p>
 			<pre>vault login -method=oidc</pre>
 			<p>This opens your browser for Authentik SSO. After login, your token is saved to <code>~/.vault-token</code>.</p>
 		</section>

 		<section>
-				<h3>Step 5 — Clone the infra repo</h3>
+			<h2>Step 4 — Verify kubectl access</h2>
+			<p>Run this command. It will open your browser for OIDC login the first time:</p>
+			<pre>kubectl get pods -n YOUR_NAMESPACE</pre>
+			<p>You should see an empty list (no resources) or your running pods.</p>
+		</section>
+
+		<section>
+			<h2>Step 5 — Clone the infra repo</h2>
 			<pre>git clone https://github.com/ViktorBarzin/infra.git
 cd infra</pre>
 			<p>This is where all the infrastructure configuration lives. Terraform state is committed as encrypted files.</p>
 		</section>

 		<section>
-				<h3>Step 6 — Decrypt your state</h3>
+			<h2>Step 6 — Install tools</h2>
+			<p>You need <code>sops</code> and <code>terragrunt</code> to work with infrastructure state:</p>
+			<h3>macOS</h3>
+			<pre>brew install sops terragrunt</pre>
+			<h3>Linux</h3>
+			<pre># sops
+curl -LO https://github.com/getsops/sops/releases/latest/download/sops-v3.9.4.linux.amd64
+sudo mv sops-*.linux.amd64 /usr/local/bin/sops && sudo chmod +x /usr/local/bin/sops
+
+# terragrunt
+curl -LO https://github.com/gruntwork-io/terragrunt/releases/latest/download/terragrunt_linux_amd64
+sudo mv terragrunt_linux_amd64 /usr/local/bin/terragrunt && sudo chmod +x /usr/local/bin/terragrunt</pre>
+		</section>
+
+		<section>
+			<h2>Step 7 — Decrypt your state</h2>
 			<p>Terraform state is encrypted with SOPS. Your Vault login gives you access to <strong>only your stacks</strong>.</p>
 			<pre># Make sure you're logged into Vault
 vault login -method=oidc
@ -220,7 +132,7 @@ cd stacks/YOUR_NAMESPACE
 		</section>

 		<section>
-				<h3>Step 7 — Create your first app stack</h3>
+			<h2>Step 8 — Create your first app stack</h2>
 			<ol>
 				<li>Copy the template: <pre>cp -r stacks/_template stacks/myapp
 mv stacks/myapp/main.tf.example stacks/myapp/main.tf</pre></li>
@ -241,7 +153,7 @@ git push</pre>
 		</section>

 		<section>
-				<h3>Architecture Overview</h3>
+			<h2>Architecture Overview</h2>
 			<p>Here's how your changes flow through the system:</p>

 			<div class="diagram">
@ -292,18 +204,31 @@ git push</pre>
 		</section>
 	{:else}
 		<section>
-				<h3>Step 4 — Clone the repo</h3>
+			<h2>Step 3 — Verify access</h2>
+			<p>Run this command. It will open your browser for login the first time:</p>
+			<pre>kubectl get namespaces</pre>
+			<p>You should see output like:</p>
+			<pre class="output">NAME              STATUS   AGE
+default           Active   200d
+kube-system       Active   200d
+monitoring        Active   200d
+...</pre>
+			<p>If you get a connection error, make sure your VPN is connected (<code>tailscale status</code>).</p>
+		</section>
+
+		<section>
+			<h2>Step 4 — Clone the repo</h2>
 			<pre>git clone https://github.com/ViktorBarzin/infra.git
 cd infra</pre>
 			<p>This is where all the infrastructure configuration lives.</p>
 		</section>

 		<section>
-				<h3>Step 5 — Your first change</h3>
+			<h2>Step 5 — Your first change</h2>
 			<ol>
 				<li>Create a branch: <pre>git checkout -b my-first-change</pre></li>
 				<li>Edit a service file (e.g., change an image tag in <code>stacks/echo/main.tf</code>)</li>
-					<li>Commit and push: <pre>git add . &amp;&amp; git commit -m "my first change" &amp;&amp; git push -u origin my-first-change</pre></li>
+				<li>Commit and push: <pre>git add . && git commit -m "my first change" && git push -u origin my-first-change</pre></li>
 				<li>Open a Pull Request on GitHub</li>
 				<li>Viktor reviews and merges</li>
 				<li>Woodpecker CI automatically applies the change to the cluster</li>
@ -311,29 +236,19 @@ cd infra</pre>
 			</ol>
 		</section>
 	{/if}
-	</section>
 </main>

 <style>
 	.content { max-width: 768px; margin: 2rem auto; padding: 0 1rem; font-family: system-ui, -apple-system, sans-serif; line-height: 1.6; }
 	.content h1 { border-bottom: 1px solid #e0e0e0; padding-bottom: 0.5rem; }
 	.content h2 { margin-top: 2rem; color: #333; }
-	.content h3 { color: #444; margin: 1.25rem 0 0.25rem; }
-	.content h4 { color: #666; margin: 0.75rem 0 0.25rem; }
+	.content h3 { color: #666; margin: 1rem 0 0.25rem; }
 	.content pre { background: #1e1e1e; color: #d4d4d4; padding: 1rem; border-radius: 6px; overflow-x: auto; }
+	.content pre.output { background: #f5f5f5; color: #333; }
 	.content code { background: #f0f0f0; padding: 2px 6px; border-radius: 3px; }
 	.content .prereq { font-size: 0.9rem; color: #666; font-style: italic; }
 	section { margin: 2rem 0; }
-	section section { margin: 1.25rem 0; }
-
-	.path { border-left: 4px solid #4fc3f7; padding-left: 1.25rem; scroll-margin-top: 4rem; }
-	.path.c { border-left-color: #bbb; }
-
-	.badge { display: inline-block; font-size: 0.65rem; font-weight: 700; text-transform: uppercase; letter-spacing: 0.5px; padding: 0.15rem 0.5rem; border-radius: 4px; vertical-align: middle; margin-left: 0.4rem; }
-	.badge.rec { background: #d4f8d4; color: #1b5e20; }
-	.badge.none { background: #e3f2fd; color: #0d47a1; }
-
-	.role-tabs { display: flex; gap: 0; margin: 1.5rem 0 0.5rem; border-bottom: 2px solid #e0e0e0; }
+	.role-tabs { display: flex; gap: 0; margin: 1.5rem 0; border-bottom: 2px solid #e0e0e0; }
 	.role-tabs a { padding: 0.5rem 1.5rem; text-decoration: none; color: #666; border-bottom: 2px solid transparent; margin-bottom: -2px; }
 	.role-tabs a.active { color: #333; border-bottom-color: #333; font-weight: 600; }
 	table { border-collapse: collapse; width: 100%; margin: 0.5rem 0; }
@ -343,7 +258,6 @@ cd infra</pre>
 	.callout { padding: 1rem; border-radius: 6px; margin: 1rem 0; }
 	.callout.info { background: #e8f4fd; border-left: 4px solid #2196f3; }
 	.callout.warning { background: #fff3cd; border-left: 4px solid #ffc107; }
-	.callout ul { margin: 0.5rem 0 0; padding-left: 1.25rem; }

 	.diagram { background: #fafafa; border: 1px solid #e0e0e0; border-radius: 8px; padding: 1.5rem; margin: 1.5rem 0; }
 	.diagram h3 { margin: 0 0 1rem 0; color: #333; font-size: 0.95rem; text-transform: uppercase; letter-spacing: 0.5px; }
--- a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/services/+page.svelte
+++ b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/services/+page.svelte
@ -2,19 +2,6 @@
 	<h1>Service Catalog</h1>
 	<p>70+ services running on the cluster. Here are the most commonly used:</p>

-	<section>
-		<h2>Cluster Access</h2>
-		<table>
-			<thead><tr><th>Service</th><th>URL</th><th>Description</th></tr></thead>
-			<tbody>
-			<tr><td>Web Terminal</td><td><a href="https://t3.viktorbarzin.me">t3.viktorbarzin.me</a></td><td>Browser shell on the shared workstation — kubectl, Vault &amp; your repos preinstalled (zero setup)</td></tr>
-			<tr><td>Kubernetes Dashboard</td><td><a href="https://k8s.viktorbarzin.me">k8s.viktorbarzin.me</a></td><td>Point-and-click view of your workloads, auto-authenticated (zero setup)</td></tr>
-			<tr><td>Access Portal</td><td><a href="https://k8s-portal.viktorbarzin.me">k8s-portal.viktorbarzin.me</a></td><td>This portal — onboarding, kubeconfig download, setup script</td></tr>
-			<tr><td>Vault</td><td><a href="https://vault.viktorbarzin.me">vault.viktorbarzin.me</a></td><td>Secrets &amp; dynamic credentials — <code>vault login -method=oidc</code></td></tr>
-			</tbody>
-		</table>
-	</section>
-
 	<section>
 		<h2>Core Services</h2>
 		<table>
@ -35,7 +22,7 @@
 			<tbody>
 			<tr><td>Nextcloud</td><td><a href="https://nextcloud.viktorbarzin.me">nextcloud.viktorbarzin.me</a></td><td>File storage, calendar, contacts</td></tr>
 			<tr><td>Immich</td><td><a href="https://immich.viktorbarzin.me">immich.viktorbarzin.me</a></td><td>Photo library (Google Photos alternative)</td></tr>
-			<tr><td>Vaultwarden</td><td><a href="https://vaultwarden.viktorbarzin.me">vaultwarden.viktorbarzin.me</a></td><td>Password manager</td></tr>
+			<tr><td>Vaultwarden</td><td><a href="https://vault.viktorbarzin.me">vault.viktorbarzin.me</a></td><td>Password manager</td></tr>
 			<tr><td>Paperless-ngx</td><td><a href="https://pdf.viktorbarzin.me">pdf.viktorbarzin.me</a></td><td>Document management</td></tr>
 			<tr><td>Navidrome</td><td><a href="https://music.viktorbarzin.me">music.viktorbarzin.me</a></td><td>Music streaming</td></tr>
 			<tr><td>Tandoor</td><td><a href="https://recipes.viktorbarzin.me">recipes.viktorbarzin.me</a></td><td>Recipe manager</td></tr>
--- a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/troubleshooting/+page.svelte
+++ b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/troubleshooting/+page.svelte
@ -11,26 +11,6 @@
 		</ol>
 	</section>

-	<section>
-		<h2>Browser login loops, or kubectl says "Unauthorized"</h2>
-		<p>Command-line SSO (OIDC) login can occasionally be unavailable. You don't have to wait for it — these authenticate a different way and keep working:</p>
-		<ul>
-			<li><a href="https://k8s.viktorbarzin.me">Web dashboard</a> — auto-authenticated, no token to paste</li>
-			<li><a href="https://t3.viktorbarzin.me">Web terminal</a> — its kubectl is already wired up</li>
-		</ul>
-		<p>Let Viktor know so the CLI login path gets fixed.</p>
-	</section>
-
-	<section>
-		<h2>Don't want to set up a local machine at all?</h2>
-		<p>Skip the VPN and CLI install entirely:</p>
-		<ul>
-			<li><a href="https://t3.viktorbarzin.me">t3.viktorbarzin.me</a> — a browser shell with everything preinstalled</li>
-			<li><a href="https://k8s.viktorbarzin.me">k8s.viktorbarzin.me</a> — a point-and-click dashboard</li>
-		</ul>
-		<p>Both just need your Authentik login. See the <a href="/onboarding">Getting Started</a> guide.</p>
-	</section>
-
 	<section>
 		<h2>"Forbidden" or "Permission denied"</h2>
 		<p>You may not have access to that namespace. Your access is scoped to specific namespaces.</p>
--- a/stacks/k8s-version-upgrade/main.tf
+++ b/stacks/k8s-version-upgrade/main.tf
@ -483,49 +483,31 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
                  exit 0
                fi

-                echo "K8s upgrade available: v$RUNNING -> v$TARGET ($KIND)"
+                slack "K8s upgrade available: v$RUNNING → v$TARGET ($KIND)"

                if [ "$DRY_RUN" = "true" ]; then
-                  slack "DRY_RUN — target v$TARGET detected, not spawning preflight Job"
+                  slack "DRY_RUN — not spawning preflight Job"
                  exit 0
                fi

                # 7. Spawn Job 0 (preflight) via envsubst on the job-template
                #    Idempotency: deterministic name reconciles via `apply`.
                JOB_NAME="k8s-upgrade-preflight-$${TARGET//./-}"
-                MASTER_JOB="k8s-upgrade-master-$${TARGET//./-}"
-                ANNOUNCE=yes   # Slack the spawn? Suppressed for silent nightly re-evaluations of a standing gate refusal.

-                # Idempotency + nightly re-evaluation:
-                #   - FAILED preflight (transient gate abort, e.g. a spurious
-                #     critical alert / unhealthy node) -> delete + re-spawn, announced.
-                #   - COMPLETE preflight but NO master Job spawned -> the compat
-                #     gate REFUSED the target (blocked/held now Complete cleanly
-                #     rather than Failing). Re-spawn SILENTLY so the gate re-checks
-                #     nightly (the refusal may have cleared: addon upgraded / matrix
-                #     updated / upstream shipped) WITHOUT nightly Slack noise for a
-                #     standing refusal — the morning report (+ K8sUpgradeBlocked for
-                #     actionable) is the signal.
-                #   - Otherwise (Active, or Complete with the chain advanced) -> skip.
-                # The old "Failed-only re-spawn" left a refused-but-Complete preflight
-                # skipped until its 7d TTL — too slow now that refusals Complete
-                # instead of Failing (2026-06-28). Deterministic names; `apply`
-                # reconciles. (Stuck-pipeline history: a transient critical alert
-                # wedged 1.34.9 for 5 days, 2026-06-17 — hence Failed always re-spawns.)
+                # Retry-on-failure idempotency: skip only if an existing preflight
+                # Job is Active/Complete. A *Failed* preflight (aborted on a
+                # transient gate, e.g. a spurious critical alert) is deleted and
+                # re-spawned — otherwise its deterministic name + 7d TTL wedges
+                # the entire pipeline until it ages out. (Stuck-pipeline fix
+                # 2026-06-17: a transient critical alert wedged 1.34.9 for 5 days.)
                if /usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" >/dev/null 2>&1; then
                  JOB_FAILED=$(/usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" \
                    -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' 2>/dev/null || true)
-                  JOB_COMPLETE=$(/usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" \
-                    -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' 2>/dev/null || true)
                  if [ "$JOB_FAILED" = "True" ]; then
                    slack "Preflight Job $JOB_NAME exists but FAILED — deleting and re-spawning"
                    /usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
-                  elif [ "$JOB_COMPLETE" = "True" ] && ! /usr/local/bin/kubectl -n k8s-upgrade get job "$MASTER_JOB" >/dev/null 2>&1; then
-                    echo "Preflight $JOB_NAME Complete + no master Job (gate refused) — silent nightly re-evaluate"
-                    /usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
-                    ANNOUNCE=no
                  else
-                    echo "Preflight Job $JOB_NAME already exists (active / chain advanced) — skipping"
+                    slack "Preflight Job $JOB_NAME already exists (active/complete) — skipping"
                    exit 0
                  fi
                fi
@ -539,9 +521,7 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
                  < /template/job-template.yaml \
                  | /usr/local/bin/kubectl apply -f -

-                if [ "$ANNOUNCE" = "yes" ]; then
                slack "Spawned $JOB_NAME (target=v$TARGET kind=$KIND)"
-                fi
              EOT
              ]
              env {
--- a/stacks/k8s-version-upgrade/scripts/addon-compat.json
+++ b/stacks/k8s-version-upgrade/scripts/addon-compat.json
@ -1,5 +1,5 @@
 {
-  "_comment": "Addon -> highest k8s minor each addon version supports. The preflight compat-gate (compat-gate.py) reads the RUNNING version of each addon and blocks a k8s upgrade whose target minor exceeds what that running version supports — so the chain auto-halts + alerts instead of breaking on an unsupported addon. Keep current; sources are the addons' own k8s compat matrices (last refreshed 2026-06-19 for the 1.34->1.36 catch-up). max_k8s keys are addon-version floors (major.minor); value is the highest k8s minor that floor supports. An addon entry may also set \"pinned\": true (+ \"pin_reason\") to mark it deliberately held: the gate classifies its block as PINNED/held (quiet — no alert, nightly report only) even if a supporting version exists, for upgrades coupled to other work we're not ready for (e.g. gpu-operator's NVIDIA-driver/Ubuntu coupling). A block with NO supporting version in the matrix is WAITING (also quiet); a block a newer matrix version would clear is ACTIONABLE (alerts).",
+  "_comment": "Addon -> highest k8s minor each addon version supports. The preflight compat-gate (compat-gate.py) reads the RUNNING version of each addon and blocks a k8s upgrade whose target minor exceeds what that running version supports — so the chain auto-halts + alerts instead of breaking on an unsupported addon. Keep current; sources are the addons' own k8s compat matrices (last refreshed 2026-06-19 for the 1.34->1.36 catch-up). max_k8s keys are addon-version floors (major.minor); value is the highest k8s minor that floor supports.",
  "addons": [
    {
      "name": "calico",
@ -48,9 +48,7 @@
      "max_k8s": {
        "25.10": "1.35",
        "26.3": "1.36"
-      },
-      "pinned": true,
-      "pin_reason": "26.3 needs a newer NVIDIA driver image + Ubuntu/kernel; held until the driver/OS path is ready. Unpin = delete pinned + pin_reason."
+      }
    }
  ],
  "containerd_min": {
--- a/stacks/k8s-version-upgrade/scripts/compat-gate.py
+++ b/stacks/k8s-version-upgrade/scripts/compat-gate.py
@ -14,20 +14,9 @@ classes of blocker:
  3. containerd    — every node's containerd >= the target's floor, if the matrix
                     declares one (e.g. the 1.7.x -> k8s 1.37 cliff)

-Each reason line is tagged with its class so the caller can act differently:
-  [ACTIONABLE]  a newer addon version (present in the matrix) supports the
-                target — upgrading it clears the block. Also covers removed-API
-                / containerd blocks and the unreadable-version fail-safe.
-  [WAITING]     no released addon version supports the target yet — only an
-                upstream release can clear it (e.g. kyverno/ESO behind a new k8s).
-  [PINNED]      a supporting version exists but the addon is deliberately held
-                (matrix `pinned: true`, e.g. gpu-operator's driver/OS coupling).
-
 Exit 0  = safe, proceed.
-Exit 2  = BLOCKED, actionable — >=1 blocker, none held. Caller pushes
-          k8s_upgrade_blocked=1 (-> K8sUpgradeBlocked alert) and halts.
-Exit 4  = HELD — >=1 waiting-upstream/pinned blocker (held wins over actionable).
-          Caller pushes k8s_upgrade_held=1 (no alert; nightly report only) and halts.
+Exit 2  = BLOCKED — prints one human reason per line (caller pushes
+          k8s_upgrade_blocked=1, Slacks the reasons, and halts the chain).
 Exit 3  = the gate itself errored — caller treats as a block (fail safe).

 Read-only: kubectl get + one Prometheus query. No mutations. PROM is overridable
@ -73,20 +62,6 @@ def running_minor():
    return min(minors) if minors else None


-def _addon_resolution(a, tgt, running_ver):
-    """For a BLOCKING addon, decide whether a newer matrix version would clear
-    the block. Returns ("actionable", hint) when some version key has
-    max_k8s >= target AND is newer than the running version (upgrading it clears
-    the block); otherwise ("waiting", hint) — nothing released supports the
-    target yet, so only an upstream release can clear it."""
-    sufficient = [floor for floor, mk in a["max_k8s"].items()
-                  if minor(mk) and minor(mk) >= tgt and minor(floor) > minor(running_ver)]
-    if sufficient:
-        best = min(sufficient, key=minor)  # smallest sufficient upgrade
-        return "actionable", f"upgrade {a['name']} to >= {best}"
-    return "waiting", f"no released {a['name']} version supports k8s {tgt[0]}.{tgt[1]} yet"
-
-
 def check_addons(matrix, tgt, running):
    # A target at or below the RUNNING minor (a patch, or a same/lower minor)
    # crosses into no new k8s minor, so every installed addon is already
@ -102,36 +77,25 @@ def check_addons(matrix, tgt, running):
                    "-o", "jsonpath={.spec.template.spec.containers[*].image}"])
        m = re.search(a["image_re"], img or "")
        if not m:
-            # Fail safe: can't read the running version → block; a human must
-            # look (ACTIONABLE), never upgrade blind.
-            reasons.append(f"[ACTIONABLE] addon {a['name']}: could not read running "
-                           f"version (img='{img or 'not found'}') — refusing to upgrade blind")
+            # Fail safe: if we can't read the running version, don't upgrade blind.
+            reasons.append(f"addon {a['name']}: could not read running version "
+                           f"(img='{img or 'not found'}') — refusing to upgrade blind")
            continue
-        running_ver = m.group(1)  # e.g. "3.26"
+        running = m.group(1)  # e.g. "3.26"
        # max_k8s maps an addon-version floor -> highest supported k8s minor.
        # Pick the highest floor that is <= the running version.
        max_k8s = None
        for floor, mk in sorted(a["max_k8s"].items(), key=lambda kv: minor(kv[0]), reverse=True):
-            if minor(running_ver) >= minor(floor):
+            if minor(running) >= minor(floor):
                max_k8s = mk
                break
        if max_k8s is None:
-            reasons.append(f"[ACTIONABLE] addon {a['name']} v{running_ver}: below the lowest "
-                           f"version in the compat matrix — unknown k8s support")
+            reasons.append(f"addon {a['name']} v{running}: below the lowest version "
+                           f"in the compat matrix — unknown k8s support")
            continue
        if tgt > minor(max_k8s):
-            base = (f"addon {a['name']} v{running_ver} supports k8s <= {max_k8s}; "
-                    f"target {tgt[0]}.{tgt[1]} exceeds it")
-            # A deliberately-pinned addon is HELD even if a newer version exists
-            # (e.g. gpu-operator 26.3 supports 1.36 but its driver/OS coupling
-            # means we don't take it yet) — the pin overrides actionable.
-            if a.get("pinned"):
-                why = a.get("pin_reason", "deliberately pinned")
-                reasons.append(f"[PINNED] {base} — pinned ({why}); holding")
-            else:
-                kind, hint = _addon_resolution(a, tgt, running_ver)
-                tag = "ACTIONABLE" if kind == "actionable" else "WAITING"
-                reasons.append(f"[{tag}] {base} — {hint}")
+            reasons.append(f"addon {a['name']} v{running} supports k8s <= {max_k8s}; "
+                           f"target {tgt[0]}.{tgt[1]} exceeds it — upgrade {a['name']} first")
    return reasons


@ -145,11 +109,11 @@ def check_removed_apis(tgt):
            rr = lbl.get("removed_release", "")
            if rr and minor(rr) and tgt >= minor(rr):
                g = lbl.get("group") or "core"
-                reasons.append(f"[ACTIONABLE] deprecated API {g}/{lbl.get('version')} "
+                reasons.append(f"deprecated API {g}/{lbl.get('version')} "
                               f"{lbl.get('resource')} is in use and is removed in "
                               f"k8s {rr} (target {tgt[0]}.{tgt[1]}) — migrate callers first")
    except Exception as e:
-        reasons.append(f"[ACTIONABLE] removed-API check could not query Prometheus ({e}) — "
+        reasons.append(f"removed-API check could not query Prometheus ({e}) — "
                       f"refusing to upgrade blind")
    return reasons

@ -168,28 +132,11 @@ def check_containerd(matrix, tgt):
        name, _, ver = line.partition(" ")
        cv = ver.replace("containerd://", "")
        if minor(cv) and minor(cv) < minor(floor):
-            reasons.append(f"[ACTIONABLE] node {name} containerd {cv} < required {floor} "
+            reasons.append(f"node {name} containerd {cv} < required {floor} "
                           f"for k8s {tgt[0]}.{tgt[1]} — bump containerd first")
    return reasons


-def held_reason(r):
-    """True for a blocker the cluster cannot act on now: no released version
-    supports the target (WAITING) or the addon is deliberately pinned (PINNED).
-    These are quiet (no alert) — only an upstream release / a manual unpin clears
-    them, so a nightly 'needs attention' alert would be crying wolf."""
-    return r.startswith("[WAITING]") or r.startswith("[PINNED]")
-
-
-def exit_code(reasons):
-    """Map reasons to the gate verdict: 0 safe · 2 actionable block · 4 held.
-    Held WINS over actionable on a mix — if anything is waiting/pinned the target
-    can't proceed yet, so acting on the actionable blockers would be premature."""
-    if not reasons:
-        return 0
-    return 4 if any(held_reason(r) for r in reasons) else 2
-
-
 def main():
    if len(sys.argv) < 2:
        print("usage: compat-gate.py <target-k8s-version>  (matrix JSON on stdin)")
@ -211,9 +158,9 @@ def main():
    if reasons:
        for r in reasons:
            print(r)
-    else:
+        sys.exit(2)
    print(f"compat-gate OK: cluster is safe to upgrade to {sys.argv[1]}")
-    sys.exit(exit_code(reasons))
+    sys.exit(0)


 if __name__ == "__main__":
--- a/stacks/k8s-version-upgrade/scripts/nightly-report.py
+++ b/stacks/k8s-version-upgrade/scripts/nightly-report.py
@ -69,29 +69,6 @@ def fmt_age(seconds):
    return f"{seconds / 86400:.1f}d ago"


-def _render_reasons(blocker_reasons):
-    """Group compat-gate reason lines by their [ACTIONABLE]/[WAITING]/[PINNED]
-    tag into labelled sections, stripping the tag from each bullet. Untagged
-    lines (older reason format) fall back to a generic 'Blockers' list. PURE.
-    Returns a list of message lines."""
-    lines = [r.strip() for r in (blocker_reasons or "").splitlines() if r.strip()]
-    out, shown = [], set()
-    for title, tag in (("Action needed", "[ACTIONABLE]"),
-                       ("Waiting on upstream", "[WAITING]"),
-                       ("Pinned (held by us)", "[PINNED]")):
-        sub = [l for l in lines if l.startswith(tag)]
-        if sub:
-            out.append(f"{title}:")
-            for l in sub:
-                shown.add(l)
-                out.append(f"  • {l[len(tag):].strip()}")
-    rest = [l for l in lines if l not in shown]
-    if rest:
-        out.append("Blockers:")
-        out.extend(f"  • {l}" for l in rest)
-    return out
-
-
 def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
    """Build the Slack message text from gathered facts. PURE.

@ -121,7 +98,6 @@ def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):

    avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
    blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
-    held = any(val == 1 for _, val in select(metrics, "k8s_upgrade_held"))

    if avail:
        lbl = avail[0][0]
@ -129,12 +105,7 @@ def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
        kind = lbl.get("kind", "?")
        tgt_line = f"Detected target: *{target}* ({kind})"
        if blocked:
-            # actionable block — an addon upgrade would clear it (K8sUpgradeBlocked fired)
-            headline = f"🔴 BLOCKED (action needed) — {target}"
-        elif held:
-            # waiting on upstream and/or a pinned addon — nothing to do but wait;
-            # intentionally NO alert, this nightly line is the only signal
-            headline = f"⏸️ HELD — {target} not yet upgradable"
+            headline = f"🔴 BLOCKED — compat gate refused {target}"
        elif len(versions) == 1 and target == versions[0]:
            headline = f"🟢 UPGRADED — all nodes now on {target}"
        else:
@ -149,8 +120,12 @@ def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):

    msg = [f"*[k8s-upgrade nightly]* {headline}", node_line, run_line, tgt_line]

-    if (blocked or held) and blocker_reasons:
-        msg.extend(_render_reasons(blocker_reasons))
+    if blocked and blocker_reasons:
+        msg.append("Blockers (live):")
+        for r in blocker_reasons.splitlines():
+            r = r.strip()
+            if r:
+                msg.append(f"  • {r}")

    if jobs:
        msg.append("Chain jobs (recent):")
@ -238,8 +213,7 @@ def main():

    avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
    blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
-    held = any(val == 1 for _, val in select(metrics, "k8s_upgrade_held"))
-    reasons = get_blocker_reasons(avail[0][0].get("target", "")) if (avail and (blocked or held)) else None
+    reasons = get_blocker_reasons(avail[0][0].get("target", "")) if (avail and blocked) else None

    msg = compose_report(now_ts, nodes, metrics, reasons, jobs)
    post_slack(msg)
--- a/stacks/k8s-version-upgrade/scripts/test_compat_gate.py
+++ b/stacks/k8s-version-upgrade/scripts/test_compat_gate.py
@ -95,121 +95,3 @@ def test_running_minor_from_kubectl(monkeypatch):
    # oldest kubelet wins (mirrors the detector): node2 on 1.33 is the floor.
    monkeypatch.setattr(cg, "kget", lambda args: "v1.34.9\nv1.33.5\nv1.34.9")
    assert cg.running_minor() == (1, 33)
-
-
-# --- block classification: actionable / waiting-upstream / pinned ----------
-# A block is ACTIONABLE if a newer addon version in the matrix supports the
-# target (we can upgrade to clear it), WAITING if no released version supports
-# the target yet (only upstream can clear it), or PINNED if a version exists but
-# we deliberately hold the addon. Held (waiting|pinned) is quiet; actionable
-# alerts.
-KYVERNO_MATRIX = {
-    "addons": [{
-        "name": "kyverno",
-        "namespace": "kyverno",
-        "kind": "deployment",
-        "resource": "kyverno-admission-controller",
-        "image_re": r"kyverno:v(\d+\.\d+)",
-        "max_k8s": {"1.16": "1.34", "1.18": "1.35"},
-    }]
-}
-GPU_MATRIX = {
-    "addons": [{
-        "name": "gpu-operator",
-        "namespace": "nvidia",
-        "kind": "deployment",
-        "resource": "gpu-operator",
-        "image_re": r"gpu-operator:v(\d+\.\d+)",
-        "max_k8s": {"25.10": "1.35", "26.3": "1.36"},
-        "pinned": True,
-        "pin_reason": "needs newer NVIDIA driver + Ubuntu release",
-    }]
-}
-
-
-def test_actionable_when_higher_version_supports_target(monkeypatch):
-    # calico 3.30 (ceiling 1.35), target 1.36, matrix has 3.32 -> 1.36:
-    # upgrading calico WOULD clear it -> ACTIONABLE, with a remediation hint.
-    _img(monkeypatch, "quay.io/calico/node:v3.30.7")
-    reasons = cg.check_addons(CALICO_MATRIX, (1, 36), (1, 35))
-    assert len(reasons) == 1, reasons
-    assert reasons[0].startswith("[ACTIONABLE]"), reasons
-    assert "3.32" in reasons[0] and "calico" in reasons[0]
-
-
-def test_waiting_when_no_version_supports_target(monkeypatch):
-    # kyverno 1.18 is the matrix ceiling (k8s 1.35); target 1.36 has NO
-    # supporting version -> WAITING on upstream (nothing to upgrade to).
-    _img(monkeypatch, "kyverno/kyverno:v1.18.1")
-    reasons = cg.check_addons(KYVERNO_MATRIX, (1, 36), (1, 35))
-    assert len(reasons) == 1, reasons
-    assert reasons[0].startswith("[WAITING]"), reasons
-    assert "kyverno" in reasons[0]
-
-
-def test_pinned_addon_is_held_not_actionable(monkeypatch):
-    # gpu-operator 25.10, target 1.36; 26.3 supports 1.36 BUT the entry is
-    # pinned -> classified PINNED (held), never ACTIONABLE.
-    _img(monkeypatch, "nvcr.io/nvidia/gpu-operator:v25.10.0")
-    reasons = cg.check_addons(GPU_MATRIX, (1, 36), (1, 35))
-    assert len(reasons) == 1, reasons
-    assert reasons[0].startswith("[PINNED]"), reasons
-    assert "gpu-operator" in reasons[0]
-
-
-def test_unreadable_addon_tagged_actionable(monkeypatch):
-    # fail-safe block on an unreadable image is ACTIONABLE (a human must look).
-    _img(monkeypatch, "")
-    reasons = cg.check_addons(ESO_MATRIX, (1, 35), (1, 34))
-    assert reasons and reasons[0].startswith("[ACTIONABLE]"), reasons
-
-
-def test_existing_reasons_are_tagged(monkeypatch):
-    # the legacy "ceiling below target, newer version exists" case is ACTIONABLE.
-    _img(monkeypatch, "external-secrets/external-secrets:v0.12.1")
-    reasons = cg.check_addons(ESO_MATRIX, (1, 35), (1, 34))
-    assert reasons[0].startswith("[ACTIONABLE]"), reasons
-
-
-def test_held_reason_classifier():
-    assert cg.held_reason("[WAITING] x")
-    assert cg.held_reason("[PINNED] x")
-    assert not cg.held_reason("[ACTIONABLE] x")
-    assert not cg.held_reason("untagged")
-
-
-def test_exit_code_mapping():
-    assert cg.exit_code([]) == 0
-    assert cg.exit_code(["[ACTIONABLE] x"]) == 2
-    assert cg.exit_code(["[WAITING] x"]) == 4
-    assert cg.exit_code(["[PINNED] x"]) == 4
-    # held wins on a mix: an upstream/pinned wait can't be cleared by acting now
-    assert cg.exit_code(["[ACTIONABLE] x", "[WAITING] y"]) == 4
-
-
-def test_real_matrix_136_is_held(monkeypatch):
-    """Regression guard on the SHIPPED addon-compat.json: at today's running
-    versions a 1.36 jump must be HELD (exit 4) — calico ACTIONABLE (3.32 in the
-    matrix), ESO+kyverno WAITING (no 1.36 release), gpu-operator PINNED. Catches
-    a matrix edit that silently turns the quiet held state into a nightly alert."""
-    import json as _json
-    matrix = _json.loads((HERE / "addon-compat.json").read_text())
-    running_imgs = {
-        "calico-system": "quay.io/calico/node:v3.30.7",
-        "external-secrets": "ghcr.io/external-secrets/external-secrets:v2.6.0",
-        "kyverno": "ghcr.io/kyverno/kyverno:v1.18.1",
-        "nvidia": "nvcr.io/nvidia/gpu-operator:v25.10.0",
-    }
-
-    def fake_kget(args):
-        ns = args[args.index("-n") + 1] if "-n" in args else ""
-        return running_imgs.get(ns, "")
-
-    monkeypatch.setattr(cg, "kget", fake_kget)
-    reasons = cg.check_addons(matrix, (1, 36), (1, 35))
-    pick = lambda name: next(r for r in reasons if name in r)
-    assert pick("calico").startswith("[ACTIONABLE]"), reasons
-    assert pick("external-secrets").startswith("[WAITING]"), reasons
-    assert pick("kyverno").startswith("[WAITING]"), reasons
-    assert pick("gpu-operator").startswith("[PINNED]"), reasons
-    assert cg.exit_code(reasons) == 4  # held wins
--- a/stacks/k8s-version-upgrade/scripts/test_nightly_report.py
+++ b/stacks/k8s-version-upgrade/scripts/test_nightly_report.py
@ -79,41 +79,3 @@ def test_compose_includes_recent_jobs():
    jobs = [{"name": "k8s-upgrade-preflight-1-35-6", "status": "Failed", "age_s": 3600}]
    out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, "x", jobs)
    assert "k8s-upgrade-preflight-1-35-6: Failed" in out
-
-
-# --- held (waiting-upstream / pinned) vs actionable-blocked rendering -------
-METRICS_HELD = f"""# TYPE k8s_upgrade_available gauge
-k8s_upgrade_available{{instance="",job="k8s-version-check",kind="minor",running="1.35.6",target="1.36.2"}} 1
-k8s_upgrade_held{{instance="",job="k8s-version-upgrade"}} 1
-k8s_upgrade_blocked{{instance="",job="k8s-version-upgrade"}} 0
-k8s_version_check_last_run_timestamp{{instance="",job="k8s-version-check"}} {LAST_RUN}
-"""
-NODES_135 = [(f"k8s-node{i}", "v1.35.6") for i in range(7)]
-
-
-def test_compose_held_headline_and_grouped_reasons():
-    m = nr.parse_metrics(METRICS_HELD)
-    reasons = (
-        "[WAITING] addon kyverno v1.18 supports k8s <= 1.35; target 1.36 exceeds it — no released kyverno version supports k8s 1.36 yet\n"
-        "[PINNED] addon gpu-operator v25.10 supports k8s <= 1.35; target 1.36 exceeds it — pinned (driver/OS); holding\n"
-        "[ACTIONABLE] addon calico v3.30 supports k8s <= 1.35; target 1.36 exceeds it — upgrade calico to >= 3.32"
-    )
-    out = nr.compose_report(LAST_RUN + 30000, NODES_135, m, reasons, [])
-    # held headline, NOT a red actionable block
-    assert "⏸️ HELD" in out and "1.36.2" in out
-    assert "🔴 BLOCKED" not in out
-    # grouped by class
-    assert "Waiting on upstream" in out and "kyverno" in out
-    assert "Pinned" in out and "gpu-operator" in out
-    # the lone actionable piece is still listed so eventual scope is visible
-    assert "calico" in out
-    # tags are stripped from the rendered bullets (no raw "[WAITING]")
-    assert "[WAITING]" not in out
-
-
-def test_compose_blocked_groups_actionable():
-    m = nr.parse_metrics(METRICS_BLOCKED)  # blocked=1
-    reasons = "[ACTIONABLE] addon calico v3.30 supports k8s <= 1.35; target 1.36 exceeds it — upgrade calico to >= 3.32"
-    out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, reasons, [])
-    assert "🔴 BLOCKED" in out
-    assert "Action needed" in out and "calico" in out
--- a/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
+++ b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
@ -37,12 +37,6 @@ KUBECTL=kubectl
 JOB_TEMPLATE=/template/job-template.yaml
 UPDATE_K8S_SH=/scripts/update_k8s.sh

-# Set to 1 by record_blocked/record_held when the compat-gate refuses the
-# target. spawn_next() then declines to advance the chain — but the Job still
-# exits 0, because a gate refusal is a DECISION, not a failure (no Failed Job,
-# no K8sUpgradeChainJobFailed). Signalling is via the gauges those recorders push.
-HALT_CHAIN=0
-
 # SSH targets are node InternalIPs, resolved live from `kubectl get nodes` (see
 # ssh_target() below) — the pipeline has NO dependency on node DNS records
 # (`k8s-node<N>.viktorbarzin.lan`). This is what lets a freshly-joined node be
@ -94,31 +88,17 @@ push() {
    | curl -sS --data-binary @- "$PG" || echo "warn: pushgateway push failed"
 }

-# Compat-gate verdict recorders. A gate refusal is a DECISION, not a crash: the
-# Job Completes cleanly and the chain simply doesn't advance (spawn_next checks
-# HALT_CHAIN). The two outcomes differ only in how they're signalled:
-#   - record_blocked: ACTIONABLE — a newer addon version would clear it.
-#       k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (fires once via
-#       alert-on-change). "upgrade when we can, alert when we can't."
-#   - record_held:    WAITING-ON-UPSTREAM or PINNED — nothing to do but wait.
-#       k8s_upgrade_held=1 -> NO alert; the nightly report's ⏸️ line is the
-#       only signal. This is what stops the nightly cry-wolf for unactionable
-#       blocks (kyverno/ESO behind upstream, gpu-operator pinned).
-# Neither Slacks per-run: the reasons are in the nightly report (it re-runs
-# compat-gate), and per-run Slack was itself a nightly-noise source.
-record_blocked() {
+# Auto-upgrade safety: a preflight compat-gate refusal is a BLOCK, not a crash —
+# the cluster simply isn't ready for this target yet (an addon / in-use API /
+# containerd is too old). Record it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
+# alert), Slack the reasons, and halt so a human clears the blocker (or a later
+# run proceeds once it's cleared). This is the "upgrade when we can, alert when
+# we can't" contract.
+block() {
  push k8s_upgrade_blocked 1
-  push k8s_upgrade_held 0
-  HALT_CHAIN=1
-  echo "BLOCKED (action needed) preflight v$TARGET_VERSION:" >&2
-  printf '%s\n' "$1" >&2
-}
-record_held() {
-  push k8s_upgrade_held 1
-  push k8s_upgrade_blocked 0
-  HALT_CHAIN=1
-  echo "HELD (not yet upgradable — waiting upstream / pinned) preflight v$TARGET_VERSION:" >&2
-  printf '%s\n' "$1" >&2
+  slack "BLOCKED preflight (target v$TARGET_VERSION) — auto-upgrade halted, needs attention:\n$1"
+  echo "BLOCKED: $1" >&2
+  exit 1
 }

 halt_on_alert_query() {
@ -276,10 +256,6 @@ case "$PHASE" in
 esac

 spawn_next() {
-  if [ "${HALT_CHAIN:-0}" = "1" ]; then
-    echo "Chain halted by compat-gate (blocked/held) — not spawning next phase."
-    return 0
-  fi
  [ -z "$NEXT_PHASE" ] && { echo "End of chain."; return 0; }

  local job_name="k8s-upgrade-${NEXT_PHASE}-${TARGET_VERSION//./-}"
@ -339,37 +315,15 @@ phase_preflight() {
  # 0. Auto-upgrade compat gate (compat-gate.py): refuse the upgrade if a critical
  #    addon, an in-use deprecated API, or a node's containerd is too old for the
  #    target. Runs FIRST — before any mutation (etcd snapshot, drains) — so a
-  #    refusal is cheap. The gate CLASSIFIES the refusal (exit code):
-  #      0 safe      -> proceed
-  #      2 actionable -> record_blocked (a newer addon version would clear it)
-  #      4 held       -> record_held (waiting on upstream / a pinned addon)
-  #      3/other err  -> fail-safe: treat as actionable block
-  #    blocked/held push the gauge DEFINITIVELY (one value per run — no pre-reset
-  #    flap that would re-notify the alert nightly) and set HALT_CHAIN so the Job
-  #    Completes cleanly without advancing the chain. This is what makes
-  #    unattended minor upgrades safe AND quiet: proceed when supported, alert
-  #    only when there's something to do, hold silently when there isn't.
+  #    block is cheap. Reset the blocked gauge for this run; block() sets it to 1
+  #    only on a refusal. This is what makes unattended minor upgrades safe: the
+  #    chain proceeds when the cluster supports the target and halts+alerts when
+  #    it doesn't (e.g. Calico/ESO/kyverno behind, or a removed API still in use).
+  push k8s_upgrade_blocked 0
  local gate_out gate_rc=0
  gate_out=$(python3 /scripts/compat-gate.py "$TARGET_VERSION" < /scripts/addon-compat.json 2>&1) || gate_rc=$?
-  case "$gate_rc" in
-    0)
-      push k8s_upgrade_blocked 0
-      push k8s_upgrade_held 0
+  if [ "$gate_rc" -ne 0 ]; then block "$gate_out"; fi
  echo "compat-gate passed for v$TARGET_VERSION"
-      ;;
-    4)
-      record_held "$gate_out"
-      return 0
-      ;;
-    2)
-      record_blocked "$gate_out"
-      return 0
-      ;;
-    *)
-      record_blocked "gate ERROR (rc=$gate_rc) — failing safe as an actionable block:"$'\n'"$gate_out"
-      return 0
-      ;;
-  esac

  # 1. All nodes Ready + no pressure
  local bad_nodes
@ -462,39 +416,6 @@ phase_preflight() {
    fi
  fi

-  # 4b. apiserver-OIDC drift check (backstop for the rbac stack's kubeadm-config
-  # reconciliation). A `kubeadm upgrade` REGENERATES the apiserver manifest from
-  # kubeadm-config; if kubeadm-config still carries the legacy single-issuer
-  # --oidc-* args instead of --authentication-config, the regenerated apiserver
-  # loses structured multi-issuer auth → kubectl + dashboard SSO break AFTER the
-  # upgrade. This is RECOVERABLE (the apiserver does NOT crash — verified by an
-  # isolated repro 2026-06-24; the chain's post-master restore.sh re-adds the flag,
-  # and the rbac stack reconciles kubeadm-config so it won't recur) — so this is an
-  # ALERT, not a block. (NB the 2026-06-24 stall was NOT this — it was etcd IO
-  # starvation; see docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md.)
-  # Skip on an at-target master (resume — no apiserver regen).
-  if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
-    local apiserver_diff
-    apiserver_diff=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" "sudo kubeadm upgrade diff v$TARGET_VERSION 2>/dev/null" || true)
-    if echo "$apiserver_diff" | grep -qE '^-[[:space:]].*--authentication-config'; then
-      slack "WARN preflight — kubeadm upgrade will DROP --authentication-config (kubeadm-config OIDC drift). SSO breaks post-upgrade until restore.sh re-adds it; re-apply the rbac stack to reconcile kubeadm-config. Proceeding (recoverable, not a crash)."
-    fi
-  fi
-
-  # 4c. Reclaim kubeadm scratch on master. `kubeadm upgrade apply` dumps a full
-  # ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
-  # every etcd upgrade and NEVER cleans it up — 145 dirs / 28GB had accumulated by
-  # 2026-06-24, pushing master root fs to 73% (image-GC churn + extra write IO on
-  # the shared HDD where etcd lives — a contributor to the etcd IO starvation that
-  # stalled that run, see post-mortem). Real etcd backups go to NFS, so these are
-  # throwaway. Prune ones >3 days old (keeps a short rollback window). Best-effort;
-  # never aborts the chain.
-  if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
-    ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" \
-      "sudo find /etc/kubernetes/tmp -maxdepth 1 -type d \( -name 'kubeadm-backup-*' -o -name 'kubeadm-upgraded-manifests*' \) -mtime +3 -exec rm -rf {} + 2>/dev/null; echo -n 'master root after prune: '; df -h / | awk 'NR==2{print \$5\" used, \"\$4\" free\"}'" \
-      || echo "kubeadm-scratch prune skipped (ssh/df failed) — non-fatal"
-  fi
-
  # 5. Push in-flight + started_timestamp metrics + ns annotations
  $KUBECTL annotate ns "$NS" \
    "viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \
@ -823,8 +744,6 @@ phase_postflight() {
  push k8s_upgrade_in_flight 0
  push k8s_upgrade_snapshot_taken 0
  push k8s_upgrade_started_timestamp 0
-  push k8s_upgrade_blocked 0
-  push k8s_upgrade_held 0

  slack ":white_check_mark: K8s upgrade complete: cluster on v$TARGET_VERSION (pod-ready ratio $ratio)"
 }
--- a/Show more
+++ b/Show more