goldmane-trail: polish follow-ups #57/#59/#61/#62/#63 + digest→#alerts

Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a subagent workflow (distinct stacks in parallel, docs last): - #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me, auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081 (the operator's whisker NP default-denies ingress). calico stack. - #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts (prometheus_chart_values.tpl) + cluster-health check #48. - #59 service-identity labels on the multi-Service namespaces (monitoring's 5 TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they update in-place. - #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog, security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md. #62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the edge table (feeds code-8ywc; enforce-flips out of scope). Also fixes the digest's Slack target: #security override 404s channel_not_found because the shared alertmanager_slack_api_url webhook's app isn't a member of #security (this likely also breaks alertmanager's slack-security receiver — flagged in the runbook). Routed to #alerts (the webhook's working channel) until the app is invited; verified a real digest run posts cleanly (360 edges). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
state(dbaas): update encrypted state
2026-06-25 17:49:25 +00:00 · 2026-06-25 17:31:03 +00:00 · 2026-06-25 15:23:15 +00:00 · 2026-06-25 14:16:04 +00:00 · 2026-06-24 22:03:15 +00:00 · 2026-06-24 20:59:39 +00:00
215 changed files with 16634 additions and 7049 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -63,7 +63,7 @@ Violations cause state drift, which causes future applies to break or silently r
 - **`secret/viktor`** — go-to path for ALL personal secrets (135 keys). Contains every API key, token, password, SSH key, and config from the old terraform.tfvars. Check here first: `vault kv get -field=KEY secret/viktor`.
 - **Auth**: `vault login -method=oidc` (Authentik SSO) → `~/.vault-token` → read by Vault TF provider.
 - **Vault stack self-reads**: `data "vault_kv_secret_v2" "vault"` reads its own OIDC creds from `secret/vault`.
- **ESO (External Secrets Operator)**: `stacks/external-secrets/` — 43 ExternalSecrets + 9 DB-creds ExternalSecrets. API version `v1beta1`. Two ClusterSecretStores: `vault-kv` and `vault-database`.
+- **ESO (External Secrets Operator)**: `stacks/external-secrets/` — chart **2.6.0 / app v2.6.0** (migrated 0.12.1→2.6.0 on 2026-06-22, one minor at a time; helm_release has `atomic=true`). **~104 ExternalSecrets across 73 files**, all on **API version `v1`** (migrated v1beta1→v1 on 2026-06-22 — there is NO v1beta1→v1 conversion webhook, so all CRs were rewritten to v1 on chart 0.16.2 before 0.17 removed v1beta1; see `docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md`). Two ClusterSecretStores: `vault-kv` and `vault-database`. (2 pre-existing dead ESs — instagram-poster, payslip-ingest — fail "cannot find secret data" on missing Vault keys, unrelated.)
 - **Plan-time pattern**: Former plan-time stacks use `data "kubernetes_secret"` to read ESO-created K8s Secrets at plan time (no Vault dependency). First-apply gotcha: must `terragrunt apply -target=kubernetes_manifest.external_secret` first, then full apply. `count` on resources using secret values fails — remove conditional counts.
 - **14 hybrid stacks** still keep `data "vault_kv_secret_v2"` for plan-time needs (job commands, Helm templatefile, module inputs). Platform has 48 plan-time refs — no migration possible without restructuring modules.
 - **Database rotation**: Vault DB engine rotates passwords every 7 days (604800s). MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana, phpipam. PostgreSQL: health, linkwarden, affine, woodpecker, claude_memory, crowdsec, technitium. Excluded: authentik (PgBouncer), root users. **Apps that read a rotated secret only at startup** (env var / initContainer, not a hot-reloaded mount) MUST carry a Reloader annotation (`secret.reloader.stakater.com/reload: <secret>`) or they keep the stale password and silently fail DB auth on each rotation until manually restarted — matrix's Synapse `inject-db-password` initContainer hit exactly this (found via Loki 2026-06-05, ~12.9k auth-fail lines/hr); matrix has since migrated to tuwunel (RocksDB, no Postgres) on 2026-06-08 and is no longer in the rotation list above. Technitium uses a password-sync CronJob (every 6h) to push rotated password to the Technitium app config via API, disable SQLite + MySQL logging, check PG plugin is loaded, configure PG query logging (90-day retention), and disable SQLite on secondary/tertiary instances.
@ -243,7 +243,8 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se
 - **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.)
 - **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
 - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred.
+- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`).
+- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (the shared webhook's Slack app isn't in `#security` → 404 channel_not_found; flip `SLACK_CHANNEL` back once invited — see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
 - **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).

 ## Storage & Backup Architecture
--- a/.claude/reference/service-catalog.md
+++ b/.claude/reference/service-catalog.md
@ -13,6 +13,8 @@
 | authentik | Identity provider (SSO) | authentik |
 | cloudflared | Cloudflare tunnel | cloudflared |
 | authelia | Auth middleware (may be merged into ebooks or removed) | platform |
+| goldmane | Calico 3.30 OSS flow aggregator (`goldmane.calico-system.svc:7443`, gRPC/mTLS). Stamps identity (ns/pod/workload/labels + allow-deny) on every flow from Felix into a ~60-min in-memory ring buffer — no etcd/API writes. East-west "who-talks-to-whom" source (ADR-0014). Enabled via operator CR (`kubectl_manifest.goldmane`). | calico |
+| whisker | Calico 3.30 OSS live flow-observability UI (`whisker.calico-system.svc:8081`) at `whisker.viktorbarzin.me` (Authentik-gated, `auth=required` — no own login; additive NP ORs Traefik past the operator default-deny). ~60-min live view of Goldmane flows, NOT history. Enabled via operator CR (`kubectl_manifest.whisker`). | calico |
 | monitoring | Prometheus/Grafana/Loki stack | monitoring |

 ## Storage & Security (Tier: cluster)
@ -37,6 +39,7 @@
 ## Active Use
 | Service | Description | Stack |
 |---------|-------------|-------|
+| goldmane-edge-aggregator | Durable who-talks-to-whom audit trail (ADR-0014 / #58). Go service: `aggregate` Deployment streams Goldmane's gRPC `Flows.Stream` (mTLS) and upserts the low-cardinality namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`) into CNPG DB `goldmane_edges`; `goldmane-edges-digest` CronJob posts first-seen edges daily to `#security`. mTLS client cert REUSES the operator's `whisker-backend-key-pair` (re-apply if rotated). Tier-4-aux. Image `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (private). Runbook: [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md). | goldmane-edge-aggregator |
 | mailserver | Email (docker-mailserver) | mailserver |
 | shadowsocks | Proxy | shadowsocks |
 | webhook_handler | Webhook processing | webhook_handler |
@ -161,3 +164,4 @@ procedures) are documented in `infra/docs/runbooks/`:
 | pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) |
 | Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) |
 | Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) |
+| Goldmane flow trail (east-west who-talks-to-whom) | [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md) |
--- a/.claude/skills/home-assistant/SKILL.md
+++ b/.claude/skills/home-assistant/SKILL.md
@ -11,8 +11,8 @@ description: |
  There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
  Always use Home Assistant for smart home control.
 author: Claude Code
-version: 2.0.0
-date: 2026-02-07
+version: 2.1.0
+date: 2026-06-24
 ---

 # Home Assistant Control
@ -395,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
 ## ha-london Knowledge Map

 ### Overview
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
+- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
 - **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/` (requires `sudo` for file access)
+- **Platform**: Raspberry Pi 4, HA OS
+- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
+- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
+- **Config path**: `/config/`
 - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
 - **Zone**: London (home)

+### Dashboards (redesigned 2026-06-24)
+**Glossary** (HA terms — keep distinct):
+- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
+- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
+- **Card** = a widget inside a view.
+
+- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
+  - **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
+  - **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
+- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.)
+- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
+
 ### Key Systems

 #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -424,10 +437,15 @@ Named plugs with power/energy tracking:
 - PM1.0/2.5/4.0/10 particulate sensors
 - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors

-#### 3. Cowboy E-Bike
- `sensor.bike_state_of_charge`: Battery %
- `sensor.bike_total_distance`: Total km
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
+#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
+Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
+- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
+- `sensor.classic_performance_remaining_range`: Range km
+- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
+- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
+- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
+- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
+- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).

 #### 4. Uptime Monitoring (UptimeRobot)
 - `sensor.blog`: blog uptime
@ -446,12 +464,17 @@ Named plugs with power/energy tracking:
 - Scripts: `script.start_netflix`, `script.start_stremio`
 - Scene: `scene.night` (turns off Livia + Michelle plugs)

-### Custom Components
- **cowboy**: Cowboy e-bike integration (HACS)
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
+### Custom Components (HACS integrations)
+- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
+- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
+
+### HACS frontend cards (plugins)
+- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.

 ### Integrations
-ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
+ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
+- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
+- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.

 ### AI / Voice Assistants
 - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -466,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL
 - Anca arrival/departure notifications
 - Night scene: turns off Livia + Michelle

-### Docker Setup
-```bash
-docker run -d --name homeassistant --privileged \
-  -e TZ=Europe/London \
-  -v /home/pi/docker/homeAssistant:/config \
-  -v /run/dbus:/run/dbus:ro \
-  --network=host --restart=unless-stopped \
-  homeassistant/home-assistant:2025.9
-```
+### Platform (HAOS — ignore any legacy `docker run` snippet)
+ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker).

 ### SSH Access
 ```bash
--- a/.github/workflows/build-chrome-service-browser.yml
+++ b/.github/workflows/build-chrome-service-browser.yml
@ -0,0 +1,39 @@
+name: Build chrome-service-browser
+
+# ADR-0002: infra-owned image built off-infra on GHA → ghcr. Playwright base +
+# real Google Chrome (proprietary H.264/AAC codecs) for the chrome-service
+# browser container, so the noVNC view can play H.264 video (Reels). Rebuilds
+# are rare → dispatch + path trigger. NOTE: after the first push, set the ghcr
+# package `chrome-service-browser` to PUBLIC (same as chrome-service-novnc) so
+# the pod pulls it without credentials.
+on:
+  push:
+    branches: [master]
+    paths:
+      - 'stacks/chrome-service/files/chrome/**'
+  workflow_dispatch: {}
+
+permissions:
+  contents: read
+  packages: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: docker/setup-buildx-action@v3
+      - uses: docker/login-action@v3
+        with:
+          registry: ghcr.io
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+      - uses: docker/build-push-action@v6
+        with:
+          context: stacks/chrome-service/files/chrome
+          platforms: linux/amd64
+          provenance: false
+          push: true
+          tags: |
+            ghcr.io/viktorbarzin/chrome-service-browser:latest
+            ghcr.io/viktorbarzin/chrome-service-browser:${{ github.sha }}
--- a/CONTEXT.md
+++ b/CONTEXT.md
@ -117,9 +117,17 @@ The bare-metal load-balancer that assigns external IPs to `type=LoadBalancer` Se
 _Avoid_: calling `.200` "the cluster IP" or assuming all ingress shares one LB IP.

 **Calico**:
-The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
+The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs; live flow observability via **Goldmane / Whisker**). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
 _Avoid_: "firewall" (it's pod-level policy, not a perimeter); conflating a Calico **NetworkPolicy** (enforced in the data path) with a **Kyverno policy** (enforced at admission) — different layers.

+**Service identity**:
+How a **Service** is named in flow/audit data — its **namespace** is the primary identity (Goldmane stamps it natively, and "one Service ≈ one namespace" holds for ~87 namespaces), refined by an explicit identity label (e.g. `service-identity`) only in the handful of genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). Deliberately NOT a per-Service **ServiceAccount** (deferred — 56% of pods share `default`; revisit only if principal-based enforcement or mTLS is adopted) and NOT a SPIFFE/mesh identity (rejected — attribution-grade audit on a trusted single-tenant cluster doesn't justify a mesh).
+_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
+
+**Goldmane / Whisker**:
+Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#security` digest. As-built: `docs/runbooks/goldmane-flow-trail.md`.
+_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
+
 ### Storage

 **proxmox-lvm-encrypted**:
--- a/cli/README.md
+++ b/cli/README.md
@ -171,6 +171,37 @@ prints the bare token to stdout so it composes in `$(…)`; it's read-tier like
 not tied to whoever first wrote the workflow (the user's key must be enrolled on
 the HA host).

+### v0.8 verbs — browser (headful anti-bot automation)
+
+Drive the cluster's **headful** Chrome (`chrome-service`, real Chrome under Xvfb)
+from the devvm over CDP, for sites that detect and block headless automation. The
+headless `@playwright/mcp` browser can *load* such a site and fill its forms, but
+the gated action (submit/login) silently fails — the motivating case was the
+Stirling Ackroyd Fixflo tenant portal, whose pre-submit check returned
+`net::ERR_FILE_NOT_FOUND` and hung. This path connects via `connect_over_cdp`,
+injects the same `stealth.js` the in-cluster callers use, and submits first try.
+
+The command owns only the *mechanics* (port-forward, stealth, lifecycle); the
+agent supplies the Playwright script — judgment stays out of the CLI.
+
+| Command | Tier | What it does |
+|---|---|---|
+| `browser run <script.js> [--url U] [--shared-context] [--keep-open] [--port N] [--timeout S]` | write | port-forward `svc/chrome-service:9222`, assert it's a real (non-headless) Chrome via `/json/version`, `connect_over_cdp`, `addInitScript(stealth.js)`, then run the script with `page`/`context`/`browser`/`log` in scope (top-level await ok; return a value to print it). Always tears the forward down. |
+| `browser open <url> [--shared-context] [--timeout S]` | write | open `<url>` headful and print title + visible text + a screenshot path — a quick check. |
+| `browser --help` | read | when-to-use signature + the error-code cheat-sheet (`ERR_FILE_NOT_FOUND` = automation-layer intercept, not egress; `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED` = real egress; one endpoint 500 while siblings 200 = bot rejection). |
+
+Default context is a **fresh incognito** one (closed on exit) — safe for the
+shared browser and concurrent callers (e.g. tripit's fare scrape); `--shared-context`
+reuses the warmed persistent profile when a pre-logged-in session is needed.
+`port-forward` tunnels API-server→pod, so it bypasses the `:9222` NetworkPolicy
+that gates in-cluster callers — no namespace label needed. The node CDP client is
+pinned to **`playwright-core@1.48.2`** to match the chrome-service image minor
+(Chromium 130; protocol changes between minors) and is installed once, lazily,
+into `~/.cache/homelab/browser-client/` (no per-user setup). Because the client
+runs on the devvm, `setInputFiles` streams local files to the remote browser over
+CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md`
+and `docs/adr/0013`.
+
 ## Build / install

 Built from source to `/usr/local/bin/homelab` during devvm provisioning
@ -190,4 +221,4 @@ original flag-based path unchanged, so the webhook handler is unaffected.

 ## Design

-See `infra/docs/adr/0004`–`0012` for the architecture decisions.
+See `infra/docs/adr/0004`–`0013` for the architecture decisions.
--- a/cli/VERSION
+++ b/cli/VERSION
@ -1 +1 @@
-v0.7.1
+v0.8.1
--- a/cli/browser.go
+++ b/cli/browser.go
@ -0,0 +1,388 @@
+package main
+
+import (
+	_ "embed"
+	"encoding/json"
+	"fmt"
+	"io"
+	"net"
+	"net/http"
+	"os"
+	"os/exec"
+	"os/signal"
+	"path/filepath"
+	"strconv"
+	"strings"
+	"sync"
+	"syscall"
+	"time"
+)
+
+// playwrightVersion pins the node CDP client to the chrome-service image minor
+// (mcr.microsoft.com/playwright:v1.48.0-noble → Chromium 130). connect_over_cdp
+// speaks the browser's CDP, so the client minor must track the server minor;
+// see docs/architecture/chrome-service.md "Image pin".
+const playwrightVersion = "1.48.2"
+
+// defaultBrowserTimeout is how long (seconds) to wait for the port-forwarded CDP
+// endpoint to become ready before giving up.
+const defaultBrowserTimeout = 60
+
+const (
+	chromeServiceNamespace = "chrome-service"
+	chromeServiceName      = "chrome-service"
+	chromeServiceCDPPort   = 9222
+)
+
+// stealthJS is vendored verbatim from stacks/chrome-service/files/stealth.js (the
+// source of truth the in-cluster callers use). TestStealthJSEmbeddedMatchesCanonical
+// guards against drift.
+//
+//go:embed browser_stealth.js
+var stealthJS string
+
+// runnerJS is the node wrapper that connects to the port-forwarded CDP endpoint,
+// installs the stealth init script, and runs the user's Playwright script.
+//
+//go:embed browser_runner.js
+var runnerJS string
+
+// browserOpts is the parsed form of `homelab browser run|open` arguments.
+type browserOpts struct {
+	mode      string // "run" | "open"
+	script    string // path to the user Playwright script (run mode)
+	url       string // initial URL (run: optional; open: required positional)
+	sharedCtx bool   // use the warmed persistent profile instead of a fresh context
+	keepOpen  bool   // leave the created context/pages open on exit
+	port      int    // explicit local port for the forward (0 = auto)
+	timeout   int    // CDP readiness timeout, seconds
+	help      bool
+}
+
+// parseBrowserArgs parses the args after `browser run` / `browser open`.
+func parseBrowserArgs(mode string, args []string) (browserOpts, error) {
+	o := browserOpts{mode: mode, timeout: defaultBrowserTimeout}
+	var positionals []string
+	atoi := func(s, flag string) (int, error) {
+		n, err := strconv.Atoi(s)
+		if err != nil {
+			return 0, fmt.Errorf("%s expects an integer, got %q", flag, s)
+		}
+		return n, nil
+	}
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "-h" || a == "--help":
+			o.help = true
+		case a == "--shared-context":
+			o.sharedCtx = true
+		case a == "--keep-open":
+			o.keepOpen = true
+		case a == "--url":
+			if i+1 < len(args) {
+				o.url = args[i+1]
+				i++
+			}
+		case strings.HasPrefix(a, "--url="):
+			o.url = strings.TrimPrefix(a, "--url=")
+		case a == "--port":
+			if i+1 < len(args) {
+				n, err := atoi(args[i+1], "--port")
+				if err != nil {
+					return o, err
+				}
+				o.port = n
+				i++
+			}
+		case strings.HasPrefix(a, "--port="):
+			n, err := atoi(strings.TrimPrefix(a, "--port="), "--port")
+			if err != nil {
+				return o, err
+			}
+			o.port = n
+		case a == "--timeout":
+			if i+1 < len(args) {
+				n, err := atoi(args[i+1], "--timeout")
+				if err != nil {
+					return o, err
+				}
+				o.timeout = n
+				i++
+			}
+		case strings.HasPrefix(a, "--timeout="):
+			n, err := atoi(strings.TrimPrefix(a, "--timeout="), "--timeout")
+			if err != nil {
+				return o, err
+			}
+			o.timeout = n
+		case strings.HasPrefix(a, "-"):
+			return o, fmt.Errorf("unknown flag %q (try: homelab browser --help)", a)
+		default:
+			positionals = append(positionals, a)
+		}
+	}
+	if o.help {
+		return o, nil
+	}
+	switch mode {
+	case "run":
+		if len(positionals) == 0 {
+			return o, fmt.Errorf("usage: homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]")
+		}
+		o.script = positionals[0]
+	case "open":
+		if len(positionals) == 0 {
+			return o, fmt.Errorf("usage: homelab browser open <url> [--shared-context] [--timeout S]")
+		}
+		o.url = positionals[0]
+	}
+	return o, nil
+}
+
+// cdpHealthy parses a CDP /json/version body and reports whether the endpoint is
+// a real (non-headless) Chrome — the entire reason chrome-service exists.
+func cdpHealthy(jsonBody []byte) (browser string, healthy bool, err error) {
+	var v struct {
+		Browser   string `json:"Browser"`
+		UserAgent string `json:"User-Agent"`
+	}
+	if e := json.Unmarshal(jsonBody, &v); e != nil {
+		return "", false, fmt.Errorf("parse /json/version: %w", e)
+	}
+	if v.Browser == "" {
+		return "", false, fmt.Errorf("/json/version had no Browser field")
+	}
+	healthy = strings.HasPrefix(v.Browser, "Chrome/") &&
+		!strings.Contains(v.Browser, "Headless") &&
+		!strings.Contains(v.UserAgent, "Headless")
+	return v.Browser, healthy, nil
+}
+
+// buildPortForwardArgs is the kubectl invocation that exposes chrome-service's
+// CDP locally. port-forward tunnels API-server→pod, so it bypasses the :9222
+// NetworkPolicy that gates in-cluster callers.
+func buildPortForwardArgs(localPort int) []string {
+	return []string{"-n", chromeServiceNamespace, "port-forward",
+		"svc/" + chromeServiceName, fmt.Sprintf("%d:%d", localPort, chromeServiceCDPPort)}
+}
+
+// browserClientPackageJSON is the auto-managed manifest for the pinned node CDP
+// client kept under the user cache dir.
+func browserClientPackageJSON() string {
+	return fmt.Sprintf(`{
+  "name": "homelab-browser-client",
+  "private": true,
+  "description": "Pinned CDP client for 'homelab browser' — auto-managed, do not edit.",
+  "dependencies": {
+    "playwright-core": "%s"
+  }
+}
+`, playwrightVersion)
+}
+
+// freePort asks the kernel for an unused ephemeral TCP port.
+func freePort() (int, error) {
+	l, err := net.Listen("tcp", "127.0.0.1:0")
+	if err != nil {
+		return 0, err
+	}
+	defer l.Close()
+	return l.Addr().(*net.TCPAddr).Port, nil
+}
+
+// browserClientDir is where the pinned node client + managed runner files live.
+func browserClientDir() (string, error) {
+	cache, err := os.UserCacheDir()
+	if err != nil || cache == "" {
+		home, herr := os.UserHomeDir()
+		if herr != nil {
+			return "", fmt.Errorf("locate cache dir: %v / %v", err, herr)
+		}
+		cache = filepath.Join(home, ".cache")
+	}
+	return filepath.Join(cache, "homelab", "browser-client"), nil
+}
+
+// installedPlaywrightVersion reads the version of the playwright-core already
+// installed in dir, or "" if absent/unreadable.
+func installedPlaywrightVersion(dir string) string {
+	b, err := os.ReadFile(filepath.Join(dir, "node_modules", "playwright-core", "package.json"))
+	if err != nil {
+		return ""
+	}
+	var v struct {
+		Version string `json:"version"`
+	}
+	if json.Unmarshal(b, &v) != nil {
+		return ""
+	}
+	return v.Version
+}
+
+// ensureBrowserClient writes the managed runner/stealth/package files into dir
+// and lazily installs the pinned playwright-core (only when missing/mismatched),
+// so no per-user setup is needed and the client tracks the binary version.
+func ensureBrowserClient(dir string) error {
+	if err := os.MkdirAll(dir, 0o755); err != nil {
+		return err
+	}
+	files := map[string]string{
+		"package.json":      browserClientPackageJSON(),
+		"browser_runner.js": runnerJS,
+		"stealth.js":        stealthJS,
+	}
+	for name, content := range files {
+		if err := os.WriteFile(filepath.Join(dir, name), []byte(content), 0o644); err != nil {
+			return err
+		}
+	}
+	if installedPlaywrightVersion(dir) == playwrightVersion {
+		return nil
+	}
+	fmt.Fprintf(os.Stderr, "homelab browser: installing pinned playwright-core@%s (one-time, ~a few seconds)…\n", playwrightVersion)
+	cmd := exec.Command("npm", "install", "--no-audit", "--no-fund", "--silent")
+	cmd.Dir = dir
+	cmd.Stdout = os.Stderr
+	cmd.Stderr = os.Stderr
+	if err := cmd.Run(); err != nil {
+		return fmt.Errorf("npm install playwright-core@%s in %s: %w (is node/npm installed?)", playwrightVersion, dir, err)
+	}
+	if got := installedPlaywrightVersion(dir); got != playwrightVersion {
+		return fmt.Errorf("playwright-core install mismatch in %s: want %s, got %q", dir, playwrightVersion, got)
+	}
+	return nil
+}
+
+// waitForCDP polls the local CDP endpoint until it answers as a healthy
+// (non-headless) Chrome, or the timeout elapses.
+func waitForCDP(cdpURL string, timeout time.Duration) (string, error) {
+	deadline := time.Now().Add(timeout)
+	client := &http.Client{Timeout: 3 * time.Second}
+	var lastErr error
+	for time.Now().Before(deadline) {
+		resp, err := client.Get(cdpURL + "/json/version")
+		if err != nil {
+			lastErr = err
+			time.Sleep(300 * time.Millisecond)
+			continue
+		}
+		body, _ := io.ReadAll(resp.Body)
+		resp.Body.Close()
+		browser, healthy, herr := cdpHealthy(body)
+		if herr != nil {
+			lastErr = herr
+			time.Sleep(300 * time.Millisecond)
+			continue
+		}
+		if !healthy {
+			return browser, fmt.Errorf("CDP reports %q — expected a non-headless Chrome (wrong target?)", browser)
+		}
+		return browser, nil
+	}
+	if lastErr == nil {
+		lastErr = fmt.Errorf("timed out after %s", timeout)
+	}
+	return "", lastErr
+}
+
+// runBrowser is the orchestration: pick a port, ensure the pinned client, start
+// (and ALWAYS tear down) a CDP port-forward, wait for readiness, then run node.
+func runBrowser(o browserOpts) error {
+	port := o.port
+	if port == 0 {
+		p, err := freePort()
+		if err != nil {
+			return fmt.Errorf("pick local port: %w", err)
+		}
+		port = p
+	}
+
+	dir, err := browserClientDir()
+	if err != nil {
+		return err
+	}
+	if err := ensureBrowserClient(dir); err != nil {
+		return err
+	}
+
+	// Start the forward in its own process group so the whole tree dies on cleanup.
+	pf := exec.Command("kubectl", buildPortForwardArgs(port)...)
+	pf.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
+	var pfLog strings.Builder
+	pf.Stdout = &pfLog
+	pf.Stderr = &pfLog
+	if err := pf.Start(); err != nil {
+		return fmt.Errorf("start kubectl port-forward (kubeconfig set?): %w", err)
+	}
+
+	var once sync.Once
+	teardown := func() {
+		once.Do(func() {
+			if pf.Process != nil {
+				_ = syscall.Kill(-pf.Process.Pid, syscall.SIGKILL)
+			}
+			_ = pf.Wait()
+		})
+	}
+	defer teardown()
+
+	// Tear down on Ctrl-C / SIGTERM too, then exit non-zero.
+	sigCh := make(chan os.Signal, 1)
+	signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM)
+	defer signal.Stop(sigCh)
+	go func() {
+		if _, ok := <-sigCh; ok {
+			teardown()
+			os.Exit(130)
+		}
+	}()
+
+	cdpURL := fmt.Sprintf("http://127.0.0.1:%d", port)
+	browser, err := waitForCDP(cdpURL, time.Duration(o.timeout)*time.Second)
+	if err != nil {
+		return fmt.Errorf("chrome-service CDP not ready on %s: %w\n--- port-forward log ---\n%s", cdpURL, err, pfLog.String())
+	}
+	fmt.Fprintf(os.Stderr, "homelab browser: connected to %s via %s\n", browser, cdpURL)
+
+	return runBrowserNode(dir, cdpURL, o)
+}
+
+// runBrowserNode invokes the managed node runner with inputs passed via env.
+func runBrowserNode(dir, cdpURL string, o browserOpts) error {
+	env := append(os.Environ(),
+		"HOMELAB_CDP_URL="+cdpURL,
+		"HOMELAB_BROWSER_MODE="+o.mode,
+		"HOMELAB_STEALTH_PATH="+filepath.Join(dir, "stealth.js"),
+		"NODE_PATH="+filepath.Join(dir, "node_modules"),
+	)
+	if o.url != "" {
+		env = append(env, "HOMELAB_BROWSER_URL="+o.url)
+	}
+	if o.script != "" {
+		abs, err := filepath.Abs(o.script)
+		if err != nil {
+			return err
+		}
+		if _, err := os.Stat(abs); err != nil {
+			return fmt.Errorf("script %s: %w", o.script, err)
+		}
+		env = append(env, "HOMELAB_BROWSER_SCRIPT="+abs)
+	}
+	if o.sharedCtx {
+		env = append(env, "HOMELAB_BROWSER_SHARED=1")
+	}
+	if o.keepOpen {
+		env = append(env, "HOMELAB_BROWSER_KEEP_OPEN=1")
+	}
+	if o.mode == "open" {
+		shot := filepath.Join(os.TempDir(), fmt.Sprintf("homelab-browser-%d.png", os.Getpid()))
+		env = append(env, "HOMELAB_BROWSER_SCREENSHOT="+shot)
+	}
+	cmd := exec.Command("node", filepath.Join(dir, "browser_runner.js"))
+	cmd.Env = env
+	cmd.Stdout = os.Stdout
+	cmd.Stderr = os.Stderr
+	cmd.Stdin = os.Stdin
+	return cmd.Run()
+}
--- a/cli/browser_runner.js
+++ b/cli/browser_runner.js
@ -0,0 +1,106 @@
+// homelab browser — node CDP runner (auto-managed; regenerated each run from the
+// homelab binary — DO NOT EDIT here). Connects to the port-forwarded
+// chrome-service CDP endpoint, installs the stealth init script, then runs the
+// user's Playwright script (run mode) or opens a URL (open mode). All inputs
+// arrive via HOMELAB_* env vars set by the Go CLI.
+'use strict';
+const fs = require('fs');
+const { chromium } = require('playwright-core');
+
+async function main() {
+  const cdpURL = process.env.HOMELAB_CDP_URL;
+  if (!cdpURL) throw new Error('HOMELAB_CDP_URL not set');
+  const mode = process.env.HOMELAB_BROWSER_MODE || 'run';
+  const stealthPath = process.env.HOMELAB_STEALTH_PATH || '';
+  const initURL = process.env.HOMELAB_BROWSER_URL || '';
+  const scriptPath = process.env.HOMELAB_BROWSER_SCRIPT || '';
+  const shared = process.env.HOMELAB_BROWSER_SHARED === '1';
+  const keepOpen = process.env.HOMELAB_BROWSER_KEEP_OPEN === '1';
+  const screenshotPath = process.env.HOMELAB_BROWSER_SCREENSHOT || '';
+
+  const browser = await chromium.connectOverCDP(cdpURL);
+
+  // Fresh isolated context by default (safe for the shared browser + concurrent
+  // callers); --shared-context reuses the warmed persistent profile.
+  let context;
+  let createdContext = false;
+  if (shared) {
+    const existing = browser.contexts();
+    if (existing.length) {
+      context = existing[0];
+    } else {
+      context = await browser.newContext();
+      createdContext = true;
+    }
+  } else {
+    context = await browser.newContext();
+    createdContext = true;
+  }
+
+  if (stealthPath) {
+    const stealth = fs.readFileSync(stealthPath, 'utf8');
+    if (stealth.trim()) await context.addInitScript(stealth);
+  }
+
+  const page = await context.newPage();
+  const log = (...a) => console.error('[browser]', ...a);
+
+  let exitCode = 0;
+  try {
+    if (initURL) {
+      await page.goto(initURL, { waitUntil: 'domcontentloaded' });
+    }
+    if (mode === 'open') {
+      console.log('url:    ' + page.url());
+      console.log('title:  ' + (await page.title()));
+      const text = (await page.evaluate(() => (document.body ? document.body.innerText : ''))).trim();
+      console.log('--- visible text (truncated to 4000 chars) ---');
+      console.log(text.slice(0, 4000));
+      if (screenshotPath) {
+        await page.screenshot({ path: screenshotPath, fullPage: true });
+        console.log('screenshot: ' + screenshotPath);
+      }
+    } else {
+      if (!scriptPath) throw new Error('run mode requires HOMELAB_BROWSER_SCRIPT');
+      const src = fs.readFileSync(scriptPath, 'utf8');
+      // Run the user's source with page/context/browser/log in lexical scope.
+      // AsyncFunction body permits top-level await.
+      const AsyncFunction = Object.getPrototypeOf(async () => {}).constructor;
+      const fn = new AsyncFunction('page', 'context', 'browser', 'log', src);
+      const result = await fn(page, context, browser, log);
+      if (result !== undefined) {
+        let out;
+        try {
+          out = typeof result === 'string' ? result : JSON.stringify(result, null, 2);
+        } catch (_) {
+          out = String(result);
+        }
+        console.log(out);
+      }
+    }
+  } catch (e) {
+    console.error('homelab browser: script error:', e && e.stack ? e.stack : e);
+    exitCode = 1;
+  } finally {
+    if (!keepOpen) {
+      try {
+        // Close only what we created; never tear down the shared persistent context.
+        if (createdContext) {
+          await context.close();
+        } else {
+          await page.close();
+        }
+      } catch (_) { /* ignore */ }
+    }
+    // Disconnect from the CDP endpoint; this does NOT kill the remote browser.
+    try {
+      await browser.close();
+    } catch (_) { /* ignore */ }
+  }
+  process.exit(exitCode);
+}
+
+main().catch((e) => {
+  console.error('homelab browser: fatal:', e && e.stack ? e.stack : e);
+  process.exit(1);
+});
--- a/cli/browser_stealth.js
+++ b/cli/browser_stealth.js
@ -0,0 +1,54 @@
+// Minimal stealth init script for Playwright-driven Chromium.
+// Vendored from puppeteer-extra-plugin-stealth/evasions/* (MIT) — covers:
+//   webdriver, chrome.runtime, navigator.plugins, navigator.languages,
+//   Permissions.query, WebGL getParameter (vendor + renderer spoof).
+// Run via context.add_init_script() so it executes before any page script.
+(() => {
+  // navigator.webdriver — most common detection, removed entirely.
+  Object.defineProperty(Navigator.prototype, 'webdriver', { get: () => undefined });
+
+  // window.chrome.runtime — many sites check that real Chrome exposes this.
+  if (!window.chrome) window.chrome = {};
+  window.chrome.runtime = window.chrome.runtime || {};
+
+  // navigator.plugins — headless reports zero; spoof a plausible PDF viewer.
+  Object.defineProperty(navigator, 'plugins', {
+    get: () => [{ name: 'Chrome PDF Plugin' }, { name: 'Chrome PDF Viewer' }, { name: 'Native Client' }],
+  });
+
+  // navigator.languages — headless returns empty array.
+  Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
+
+  // Permissions.query — headless returns 'denied' for notifications instead of 'default'.
+  const origQuery = window.navigator.permissions && window.navigator.permissions.query;
+  if (origQuery) {
+    window.navigator.permissions.query = (parameters) =>
+      parameters && parameters.name === 'notifications'
+        ? Promise.resolve({ state: Notification.permission })
+        : origQuery(parameters);
+  }
+
+  // WebGL getParameter — spoof vendor + renderer strings to a real GPU.
+  const spoofGl = (proto) => {
+    if (!proto) return;
+    const orig = proto.getParameter;
+    proto.getParameter = function (parameter) {
+      if (parameter === 37445) return 'Intel Inc.';                   // UNMASKED_VENDOR_WEBGL
+      if (parameter === 37446) return 'Intel Iris OpenGL Engine';     // UNMASKED_RENDERER_WEBGL
+      return orig.apply(this, arguments);
+    };
+  };
+  spoofGl(window.WebGLRenderingContext && window.WebGLRenderingContext.prototype);
+  spoofGl(window.WebGL2RenderingContext && window.WebGL2RenderingContext.prototype);
+
+  // disable-devtool.js (theajack/disable-devtool) auto-inits via a script
+  // tag with `disable-devtool-auto`. Its Performance detector trips under
+  // Playwright (CDP adds console.log latency vs console.table) and the
+  // redirect URL is hard-coded — for hmembeds that's google.com.
+  // Hide the auto-init marker so the library's IIFE exits early.
+  const origQS = Document.prototype.querySelector;
+  Document.prototype.querySelector = function (sel) {
+    if (typeof sel === 'string' && sel.indexOf('disable-devtool-auto') !== -1) return null;
+    return origQS.apply(this, arguments);
+  };
+})();
--- a/cli/cmd_browser.go
+++ b/cli/cmd_browser.go
@ -0,0 +1,117 @@
+package main
+
+import "fmt"
+
+// browser verbs drive the cluster's HEADFUL Chrome (ns chrome-service) over CDP
+// from outside the cluster, for sites that detect/block headless automation.
+// The headless @playwright/mcp browser can load such sites but their gated
+// actions (submit/login) silently fail; this path submits first try. Mechanics
+// only — the agent supplies the Playwright script. See docs/adr/0013.
+
+func browserCommands() []Command {
+	return []Command{
+		{Path: []string{"browser"}, Tier: TierRead,
+			Summary: "headful cluster-Chrome automation for anti-bot sites (run `browser --help`)", Run: browserTopHelp},
+		{Path: []string{"browser", "run"}, Tier: TierWrite,
+			Summary: "run a Playwright script against headful cluster Chrome: browser run <script.js> [--url U] [--shared-context]", Run: browserRun},
+		{Path: []string{"browser", "open"}, Tier: TierWrite,
+			Summary: "open a URL in headful cluster Chrome; print title + text + screenshot: browser open <url>", Run: browserOpen},
+	}
+}
+
+func browserTopHelp([]string) error {
+	fmt.Print(browserHelp())
+	return nil
+}
+
+func browserRun(args []string) error {
+	o, err := parseBrowserArgs("run", args)
+	if err != nil {
+		return err
+	}
+	if o.help {
+		fmt.Print(browserHelp())
+		return nil
+	}
+	return runBrowser(o)
+}
+
+func browserOpen(args []string) error {
+	o, err := parseBrowserArgs("open", args)
+	if err != nil {
+		return err
+	}
+	if o.help {
+		fmt.Print(browserHelp())
+		return nil
+	}
+	return runBrowser(o)
+}
+
+// browserHelp carries the discoverability payload: WHEN to reach for this, and
+// the diagnostic cheat-sheet that lets the agent self-correct instead of
+// retrying a deterministic form blind (the failure mode that motivated this).
+func browserHelp() string {
+	return `homelab browser — drive the cluster's HEADFUL Chrome (anti-bot) over CDP
+
+The shared chrome-service (ns chrome-service) runs a REAL, headed Chrome under
+Xvfb. This connects to it via a port-forward + Playwright connect_over_cdp,
+injects the same stealth.js the in-cluster callers use, and runs your script.
+
+USAGE
+  homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]
+  homelab browser open <url> [--shared-context] [--timeout S]
+
+WHEN TO USE THIS — escalation only; DEFAULT to the headless/MCP browser
+  Default to the Playwright MCP / headless browser for ALL routine browsing and
+  automation — it's interactive (snapshot per step), fast to start, isolated.
+  Reach for THIS command ONLY when headless is demonstrably blocked: a site
+  LOADS fine but a gated action FAILS or HANGS — a submit/login/checkout spins
+  forever, or ONE request errors while its siblings 200. That is the signature
+  of headless / anti-bot detection (navigator.webdriver, UA "HeadlessChrome",
+  disable-devtool traps). It presents as a real Chrome and usually succeeds
+  first try — but it's the shared cluster browser (slower startup, one batch
+  run, no per-step feedback), so it's the escalation path, never the default.
+
+ERROR-CODE CHEAT-SHEET (diagnose BEFORE retrying)
+  ERR_FILE_NOT_FOUND (-6)   request intercepted/resolved locally by the
+                            automation layer — NOT a network/egress problem.
+                            (This is what silently broke the headless submit.)
+  ERR_CONNECTION_REFUSED /  real egress failure (DNS/route/firewall). These also
+  ERR_TIMED_OUT /           break the initial page load — if the page loaded,
+  ERR_NAME_NOT_RESOLVED     egress is fine and the cause is elsewhere.
+  one endpoint 500s while   server-side bot rejection of the automation, not
+  its siblings 200          your payload.
+
+HABITS
+  - Inspect the network panel BEFORE retrying a deterministic form; a blind
+    retry just repeats the same silent failure.
+  - Don't park a half-filled multi-step form across a user pause — the session
+    can expire; re-run the whole flow from this command in one shot.
+  - Uploads stream over CDP via setInputFiles from THIS host — no chmod/staging
+    of $HOME needed; just point setInputFiles at a local path.
+
+CONTEXT
+  Default: a FRESH incognito context, closed on exit — safe for the shared
+  browser and concurrent callers (e.g. tripit). Your script does its own login.
+  --shared-context: reuse the warmed PERSISTENT profile (cookies from a manual
+  noVNC login at chrome.viktorbarzin.me) when you need a pre-logged-in session.
+
+SCRIPT CONTRACT (run mode)
+  Your file's body runs with page, context, browser and log() already in scope
+  (top-level await allowed). Return a value to print it. Example flow.js:
+
+    await page.goto('https://portal.example.com/login');
+    await page.fill('#user', 'me'); await page.fill('#pass', process.env.PW);
+    await page.click('button[type=submit]');
+    await page.waitForURL('**/dashboard');
+    return 'logged in: ' + page.url();
+
+  Run it:  homelab browser run flow.js
+
+NOTES
+  - The Playwright client is pinned to playwright-core@` + playwrightVersion + ` to match the
+    chrome-service image (Chrome 130); installed once into ~/.cache/homelab/.
+  - The port-forward is always torn down, on success and on error.
+`
+}
--- a/cli/cmd_browser_test.go
+++ b/cli/cmd_browser_test.go
@ -0,0 +1,172 @@
+package main
+
+import (
+	"os"
+	"reflect"
+	"strings"
+	"testing"
+)
+
+func TestParseBrowserArgsRun(t *testing.T) {
+	got, err := parseBrowserArgs("run", []string{
+		"flow.js", "--url", "https://example.com", "--shared-context",
+		"--port", "19999", "--timeout", "45", "--keep-open",
+	})
+	if err != nil {
+		t.Fatalf("parseBrowserArgs run: unexpected err: %v", err)
+	}
+	want := browserOpts{
+		mode: "run", script: "flow.js", url: "https://example.com",
+		sharedCtx: true, keepOpen: true, port: 19999, timeout: 45,
+	}
+	if !reflect.DeepEqual(got, want) {
+		t.Fatalf("parseBrowserArgs run =\n %+v\nwant\n %+v", got, want)
+	}
+}
+
+func TestParseBrowserArgsRunDefaults(t *testing.T) {
+	got, err := parseBrowserArgs("run", []string{"flow.js"})
+	if err != nil {
+		t.Fatalf("unexpected err: %v", err)
+	}
+	if got.script != "flow.js" || got.sharedCtx || got.keepOpen || got.port != 0 {
+		t.Fatalf("defaults wrong: %+v", got)
+	}
+	if got.timeout != defaultBrowserTimeout {
+		t.Fatalf("timeout default = %d, want %d", got.timeout, defaultBrowserTimeout)
+	}
+}
+
+func TestParseBrowserArgsRunRequiresScript(t *testing.T) {
+	if _, err := parseBrowserArgs("run", []string{"--url", "https://x"}); err == nil {
+		t.Fatalf("run without a script path should error")
+	}
+}
+
+func TestParseBrowserArgsOpenRequiresURL(t *testing.T) {
+	got, err := parseBrowserArgs("open", []string{"https://example.com"})
+	if err != nil {
+		t.Fatalf("unexpected err: %v", err)
+	}
+	if got.url != "https://example.com" || got.mode != "open" {
+		t.Fatalf("open parse wrong: %+v", got)
+	}
+	if _, err := parseBrowserArgs("open", []string{}); err == nil {
+		t.Fatalf("open without a URL should error")
+	}
+}
+
+func TestParseBrowserArgsHelp(t *testing.T) {
+	for _, a := range [][]string{{"--help"}, {"-h"}, {"flow.js", "--help"}} {
+		got, err := parseBrowserArgs("run", a)
+		if err != nil {
+			t.Fatalf("help parse %v: %v", a, err)
+		}
+		if !got.help {
+			t.Fatalf("args %v should set help", a)
+		}
+	}
+}
+
+func TestParseBrowserArgsEqualsForm(t *testing.T) {
+	got, err := parseBrowserArgs("run", []string{"flow.js", "--url=https://x", "--port=8123", "--timeout=10"})
+	if err != nil {
+		t.Fatalf("unexpected err: %v", err)
+	}
+	if got.url != "https://x" || got.port != 8123 || got.timeout != 10 {
+		t.Fatalf("--flag=value form not parsed: %+v", got)
+	}
+}
+
+func TestCDPHealthy(t *testing.T) {
+	real := []byte(`{"Browser":"Chrome/130.0.6723.31","User-Agent":"Mozilla/5.0 (X11; Linux x86_64) Chrome/130.0.0.0 Safari/537.36","webSocketDebuggerUrl":"ws://127.0.0.1/devtools/browser/x"}`)
+	browser, ok, err := cdpHealthy(real)
+	if err != nil || !ok {
+		t.Fatalf("real Chrome should be healthy: ok=%v err=%v", ok, err)
+	}
+	if !strings.HasPrefix(browser, "Chrome/") {
+		t.Fatalf("browser = %q, want Chrome/ prefix", browser)
+	}
+
+	headless := []byte(`{"Browser":"HeadlessChrome/130.0.6723.31","User-Agent":"Mozilla/5.0 HeadlessChrome/130.0.0.0"}`)
+	if _, ok, _ := cdpHealthy(headless); ok {
+		t.Fatalf("HeadlessChrome must be reported unhealthy (the whole point of chrome-service)")
+	}
+
+	if _, _, err := cdpHealthy([]byte("not json")); err == nil {
+		t.Fatalf("malformed /json/version body should error")
+	}
+}
+
+func TestBuildPortForwardArgs(t *testing.T) {
+	got := buildPortForwardArgs(18080)
+	want := []string{"-n", "chrome-service", "port-forward", "svc/chrome-service", "18080:9222"}
+	if !reflect.DeepEqual(got, want) {
+		t.Fatalf("buildPortForwardArgs =\n %v\nwant\n %v", got, want)
+	}
+}
+
+func TestBrowserClientPackageJSONPinsVersion(t *testing.T) {
+	pj := browserClientPackageJSON()
+	if !strings.Contains(pj, `"playwright-core": "`+playwrightVersion+`"`) {
+		t.Fatalf("package.json must pin playwright-core to %s; got:\n%s", playwrightVersion, pj)
+	}
+}
+
+func TestPlaywrightVersionPinnedToServerMinor(t *testing.T) {
+	// chrome-service runs mcr.microsoft.com/playwright:v1.48.0-noble; the CDP
+	// client minor MUST match (protocol changes between minors).
+	if !strings.HasPrefix(playwrightVersion, "1.48.") {
+		t.Fatalf("playwrightVersion = %q, must be 1.48.x to match the chrome-service image", playwrightVersion)
+	}
+}
+
+func TestBrowserHelpHasDiagnosticCheatSheet(t *testing.T) {
+	h := browserHelp()
+	for _, want := range []string{
+		"homelab browser run",
+		"ERR_FILE_NOT_FOUND",
+		"ERR_CONNECTION_REFUSED",
+		"network panel",
+		"headless",
+		"--shared-context",
+	} {
+		if !strings.Contains(h, want) {
+			t.Errorf("browser --help is missing %q (the discoverability/self-correction payload)", want)
+		}
+	}
+}
+
+func TestBrowserHelpIsTiered(t *testing.T) {
+	// --help must frame this as the ESCALATION path (default to headless first),
+	// matching ~/code/CLAUDE.md and chrome-service.md — non-conflicting agent
+	// instructions. Guard against a regression to "co-equal choice" wording.
+	h := browserHelp()
+	for _, want := range []string{"Default to the", "escalation"} {
+		if !strings.Contains(h, want) {
+			t.Errorf("browser --help must carry the tiered/default-headless framing; missing %q", want)
+		}
+	}
+}
+
+func TestStealthJSEmbeddedMatchesCanonical(t *testing.T) {
+	// The embedded copy must never drift from the source of truth that the
+	// in-cluster callers use, else the CLI's stealth and the cluster's diverge.
+	canonical, err := os.ReadFile("../stacks/chrome-service/files/stealth.js")
+	if err != nil {
+		t.Fatalf("read canonical stealth.js: %v", err)
+	}
+	if stealthJS != string(canonical) {
+		t.Fatalf("cli/browser_stealth.js has drifted from stacks/chrome-service/files/stealth.js — re-copy it")
+	}
+}
+
+func TestFreePortReturnsUsablePort(t *testing.T) {
+	p, err := freePort()
+	if err != nil {
+		t.Fatalf("freePort: %v", err)
+	}
+	if p <= 1024 || p > 65535 {
+		t.Fatalf("freePort returned %d, want an ephemeral port", p)
+	}
+}
--- a/cli/cmd_vault.go
+++ b/cli/cmd_vault.go
@ -0,0 +1,663 @@
+package main
+
+import (
+	"bufio"
+	"encoding/base64"
+	"encoding/json"
+	"fmt"
+	"os"
+	"os/exec"
+	"strings"
+	"syscall"
+)
+
+// vault verbs give each unix user no-HITL access to THEIR OWN Vaultwarden vault.
+// Identity is the kernel UID; per-user creds live in that user's isolated Vault
+// path (secret/workstation/claude-users/<user>) read via their scoped token, and
+// decryption is done by the official `bw` CLI. See
+// docs/superpowers/specs/2026-06-24-homelab-vault-design.md.
+func vaultCommands() []Command {
+	return []Command{
+		{Path: []string{"vault", "setup"}, Tier: TierWrite,
+			Summary: "one-time: store your Vaultwarden master password + API key in your Vault path", Run: vaultSetup},
+		{Path: []string{"vault", "status"}, Tier: TierRead,
+			Summary: "show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
+		{Path: []string{"vault", "list"}, Tier: TierRead,
+			Summary: "list your item names: vault list [--search Q]", Run: vaultList},
+		{Path: []string{"vault", "get"}, Tier: TierRead,
+			Summary: "fetch one item: vault get <name> [--field password|username|uri|notes|totp] [--json]", Run: vaultGet},
+		{Path: []string{"vault", "search"}, Tier: TierRead,
+			Summary: "search your item names: vault search <query>", Run: vaultSearch},
+		{Path: []string{"vault", "code"}, Tier: TierRead,
+			Summary: "current TOTP code for an item: vault code <name>", Run: vaultCode},
+		{Path: []string{"vault", "lock"}, Tier: TierWrite,
+			Summary: "lock/log out the local bw session", Run: vaultLock},
+		{Path: []string{"vault"}, Tier: TierRead,
+			Summary: "Vaultwarden access for your own vault (run `homelab vault` for help)",
+			Run:     func([]string) error { fmt.Print(vaultHelp()); return nil }},
+	}
+}
+
+// vaultHelp is shown for bare `homelab vault`.
+func vaultHelp() string {
+	return `homelab vault — read YOUR OWN Vaultwarden logins (no-HITL after one-time setup)
+
+  homelab vault setup             one-time: store your master password + API key in your Vault path
+  homelab vault status            configured / unlocked / reachable (no secrets)
+  homelab vault list [--search Q] list your item names (no secrets)
+  homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
+                                  TTY → clipboard (auto-clears); piped → stdout
+  homelab vault code <name>       current TOTP code
+  homelab vault lock              lock / log out the local bw session
+
+Creds live only in your own Vault path; the admin never sees them. Identity is
+your unix UID. Security model: docs/superpowers/specs/2026-06-24-homelab-vault-design.md
+(note: anything running as your user can decrypt your vault — the accepted no-HITL trade).
+`
+}
+
+const vwUserPathPrefix = "secret/workstation/claude-users/"
+
+// vwCreds is one user's Vaultwarden auth material, read from their Vault path.
+type vwCreds struct {
+	Email          string
+	MasterPassword string
+	ClientID       string
+	ClientSecret   string
+}
+
+// cmdRunner shells out to an external command with an explicit environment and
+// returns trimmed stdout. Secrets are passed via envv, NEVER argv. Tests inject
+// a fake; realRunner is the production implementation.
+type cmdRunner func(name string, argv, envv []string) (string, error)
+
+func realRunner(name string, argv, envv []string) (string, error) {
+	cmd := exec.Command(name, argv...)
+	if envv != nil {
+		cmd.Env = envv
+	}
+	out, err := cmd.Output()
+	// Trim only the trailing newline the tool appends — NOT all whitespace, so a
+	// fetched secret with significant leading/trailing spaces is preserved.
+	return strings.TrimRight(string(out), "\r\n"), err
+}
+
+// realRunnerStdin runs a command feeding `stdin` to it, for secret values that
+// must NOT appear in argv (visible via ps / /proc/<pid>/cmdline to same-UID
+// processes). Used by setup to write the master password / client_secret.
+func realRunnerStdin(name string, argv, envv []string, stdin string) (string, error) {
+	cmd := exec.Command(name, argv...)
+	if envv != nil {
+		cmd.Env = envv
+	}
+	cmd.Stdin = strings.NewReader(stdin)
+	out, err := cmd.Output()
+	return strings.TrimRight(string(out), "\r\n"), err
+}
+
+func vwCredsPath(user string) string { return vwUserPathPrefix + user }
+
+func bwAppDataDir(uid string) string { return "/run/user/" + uid + "/homelab-bw" }
+
+// readVaultField returns one field from a KV-v2 path, "" if absent/error.
+func readVaultField(run cmdRunner, field, path string) string {
+	out, err := run("vault", []string{"kv", "get", "-field=" + field, path}, nil)
+	if err != nil {
+		return ""
+	}
+	return out
+}
+
+// loadCreds reads the four vaultwarden_* keys from the user's isolated path.
+// A missing master password means the user hasn't onboarded.
+func loadCreds(run cmdRunner, user string) (vwCreds, error) {
+	p := vwCredsPath(user)
+	c := vwCreds{
+		Email:          readVaultField(run, "vaultwarden_email", p),
+		MasterPassword: readVaultField(run, "vaultwarden_master_password", p),
+		ClientID:       readVaultField(run, "vaultwarden_client_id", p),
+		ClientSecret:   readVaultField(run, "vaultwarden_client_secret", p),
+	}
+	if c.MasterPassword == "" {
+		return vwCreds{}, fmt.Errorf("vault not configured for this user — run `homelab vault setup`")
+	}
+	return c, nil
+}
+
+// vaultCurrentUser/vaultCurrentUID are seams for tests (avoid conflict with repo.go's currentUser func).
+var vaultCurrentUser = func() string { return os.Getenv("USER") }
+var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) }
+
+// bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately
+// do NOT inherit the full parent env (keeps stray secrets out of the child).
+func bwBaseEnv(appdata string) []string {
+	path := os.Getenv("PATH")
+	if path == "" {
+		path = "/usr/local/bin:/usr/bin:/bin"
+	}
+	return []string{
+		"PATH=" + path,
+		"HOME=" + os.Getenv("HOME"),
+		"BITWARDENCLI_APPDATA_DIR=" + appdata,
+		"BW_NOINTERACTION=true",
+	}
+}
+
+// bwSecretEnv adds the secret-bearing vars. session may be "" (pre-unlock).
+func bwSecretEnv(appdata string, c vwCreds, session string) []string {
+	env := bwBaseEnv(appdata)
+	env = append(env,
+		"BW_CLIENTID="+c.ClientID,
+		"BW_CLIENTSECRET="+c.ClientSecret,
+		"BW_PASSWORD="+c.MasterPassword,
+	)
+	if session != "" {
+		env = append(env, "BW_SESSION="+session)
+	}
+	return env
+}
+
+func bwLoginArgs() []string  { return []string{"login", "--apikey"} }
+func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} }
+func bwGetArgs(field, name string) []string { return []string{"get", field, name} }
+func bwStatusArgs() []string { return []string{"status"} }
+
+// bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is
+// required. Unparseable/empty output → true (safer to attempt login).
+func bwNeedsLogin(statusJSON string) bool {
+	var s struct {
+		Status string `json:"status"`
+	}
+	if err := json.Unmarshal([]byte(statusJSON), &s); err != nil {
+		return true
+	}
+	return s.Status == "unauthenticated" || s.Status == ""
+}
+
+func bwListArgs(search string) []string {
+	a := []string{"list", "items"}
+	if search != "" {
+		a = append(a, "--search", search)
+	}
+	return a
+}
+
+// bwUnlock runs `bw unlock` and returns the raw session key.
+func bwUnlock(run cmdRunner, env []string) (string, error) {
+	out, err := run("bw", bwUnlockArgs(), env)
+	if err != nil {
+		return "", fmt.Errorf("bw unlock failed (wrong master password? run `homelab vault setup`): %w", err)
+	}
+	return out, nil
+}
+
+// bwGet fetches one field of one item; session must be present in env.
+func bwGet(run cmdRunner, env []string, field, name string) (string, error) {
+	return run("bw", bwGetArgs(field, name), env)
+}
+
+func returnMode(isTTY bool) string {
+	if isTTY {
+		return "clipboard"
+	}
+	return "stdout"
+}
+
+// stdoutIsTTY reports whether stdout is a character device (a terminal).
+func stdoutIsTTY() bool {
+	fi, err := os.Stdout.Stat()
+	if err != nil {
+		return false
+	}
+	return fi.Mode()&os.ModeCharDevice != 0
+}
+
+// stderrIsTTY reports whether stderr is a terminal (the OSC52 escape is written
+// to stderr, so the clipboard path is only viable when stderr is a terminal).
+func stderrIsTTY() bool {
+	fi, err := os.Stderr.Stat()
+	if err != nil {
+		return false
+	}
+	return fi.Mode()&os.ModeCharDevice != 0
+}
+
+// osc52 returns the OSC 52 escape that makes the local terminal copy payload to
+// the system clipboard (works over SSH; no X11). osc52clear copies empty.
+func osc52(payload string) string {
+	return "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte(payload)) + "\a"
+}
+func osc52clear() string { return "\x1b]52;c;\a" }
+
+// terminalAllowed gates OSC 52: only terminals known to honor clipboard writes,
+// else we'd dump the secret's base64 into scrollback on unsupported terminals.
+func terminalAllowed(term, termProgram string) bool {
+	t := strings.ToLower(term)
+	p := strings.ToLower(termProgram)
+	for _, ok := range []string{"kitty", "alacritty", "foot", "wezterm", "ghostty", "tmux", "screen"} {
+		if strings.Contains(t, ok) || strings.Contains(p, ok) {
+			return true
+		}
+	}
+	// xterm proper supports it only when the program is a known-good emulator.
+	return false
+}
+
+// opRecord is one CLI operation. ItemName is accepted for the caller's
+// convenience but is INTENTIONALLY never rendered into the log line — auditing
+// which of your own logins you opened is itself sensitive, and per-item reads
+// are invisible server-side anyway (spec §9a).
+type opRecord struct {
+	User       string
+	Verb       string
+	PID        int
+	PPID       int
+	ParentComm string
+	ItemName   string // never logged
+}
+
+func opLogLine(r opRecord) string {
+	return fmt.Sprintf("user=%s verb=%s pid=%d ppid=%d parent=%s",
+		r.User, r.Verb, r.PID, r.PPID, r.ParentComm)
+}
+
+// parentComm reads /proc/<ppid>/comm (best-effort; "" on failure).
+func parentComm(ppid int) string {
+	b, err := os.ReadFile(fmt.Sprintf("/proc/%d/comm", ppid))
+	if err != nil {
+		return ""
+	}
+	return strings.TrimSpace(string(b))
+}
+
+// writeOpLog appends one privacy-aware line to the user's op-log (best-effort;
+// never blocks or fails the command). Goes to syslog so it ships to Loki.
+func writeOpLog(r opRecord) {
+	exec.Command("logger", "-t", "homelab-vault", opLogLine(r)).Run() // best-effort
+}
+
+func vaultLockPath(uid string) string { return "/run/user/" + uid + "/homelab-vault.lock" }
+
+// hardenProcess disables core dumps so a bw/homelab crash can't spill the master
+// password to a core file. Best-effort.
+func hardenProcess() {
+	_ = syscall.Setrlimit(syscall.RLIMIT_CORE, &syscall.Rlimit{Cur: 0, Max: 0})
+}
+
+// withUserLock serializes bw mutations for this user (concurrent Claude sessions
+// as the same user otherwise race bw's appdata). Returns an unlock func.
+func withUserLock(uid string) (func(), error) {
+	f, err := os.OpenFile(vaultLockPath(uid), os.O_CREATE|os.O_RDWR, 0600)
+	if err != nil {
+		return nil, err
+	}
+	if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX); err != nil {
+		f.Close()
+		return nil, err
+	}
+	return func() { syscall.Flock(int(f.Fd()), syscall.LOCK_UN); f.Close() }, nil
+}
+
+// session is one usable bw context: the env (with BW_SESSION) ready for `bw get`.
+type session struct {
+	env []string
+}
+
+// openSession resolves creds, ensures login, unlocks, and returns a ready env.
+// Caller must hold the user lock. appdata is created on tmpfs (0700).
+func openSession(run cmdRunner, user, uid string) (session, error) {
+	creds, err := loadCreds(run, user)
+	if err != nil {
+		return session{}, err
+	}
+	appdata := bwAppDataDir(uid)
+	if err := os.MkdirAll(appdata, 0700); err != nil {
+		return session{}, fmt.Errorf("create bw appdata %s: %w", appdata, err)
+	}
+	loginEnv := bwSecretEnv(appdata, creds, "")
+	// Ensure server is set and we're logged in (idempotent; ignore "already").
+	_, _ = run("bw", []string{"config", "server", "https://vaultwarden.viktorbarzin.me"}, loginEnv)
+	st, _ := run("bw", bwStatusArgs(), loginEnv)
+	if bwNeedsLogin(st) {
+		if _, err := run("bw", bwLoginArgs(), loginEnv); err != nil {
+			return session{}, fmt.Errorf("bw login --apikey failed (API key valid? run `homelab vault setup`): %w", err)
+		}
+	}
+	sess, err := bwUnlock(run, loginEnv)
+	if err != nil {
+		return session{}, err
+	}
+	return session{env: bwSecretEnv(appdata, creds, sess)}, nil
+}
+
+type getOpts struct {
+	name  string
+	field string
+	json  bool
+}
+
+var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true}
+
+func parseGetArgs(args []string) (getOpts, error) {
+	o := getOpts{field: "password"}
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--json":
+			o.json = true
+		case a == "--field" && i+1 < len(args):
+			o.field = args[i+1]
+			i++
+		case strings.HasPrefix(a, "--field="):
+			o.field = strings.TrimPrefix(a, "--field=")
+		case !strings.HasPrefix(a, "-") && o.name == "":
+			o.name = a
+		}
+	}
+	if o.name == "" {
+		return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json]")
+	}
+	if !validGetFields[o.field] {
+		return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field)
+	}
+	return o, nil
+}
+
+// getValue opens a session and fetches one field. Pure of I/O side effects
+// besides the runner, so it is unit-tested with a fake runner.
+func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) {
+	s, err := openSession(run, user, uid)
+	if err != nil {
+		return "", err
+	}
+	return bwGet(run, s.env, o.field, o.name)
+}
+
+// clipboardDecision picks how to return a secret value. "stdout" prints it (a
+// pipe/agent — the intended machine path); "clipboard" copies via OSC52;
+// "refuse" emits nothing sensitive (would otherwise risk dumping the secret's
+// base64 into scrollback, or silently fail because the OSC52 escape goes to a
+// non-terminal stderr).
+func clipboardDecision(stdoutTTY, stderrTTY bool, term, termProgram string) string {
+	if !stdoutTTY {
+		return "stdout"
+	}
+	if terminalAllowed(term, termProgram) && stderrTTY {
+		return "clipboard"
+	}
+	return "refuse"
+}
+
+// jsonToStdoutOK reports whether `--json` may print the secret to stdout — only
+// when stdout is NOT a terminal (i.e. piped to a machine consumer).
+func jsonToStdoutOK(stdoutTTY bool) bool { return !stdoutTTY }
+
+// emitSecret returns a value TTY-aware (see clipboardDecision). Never prints the
+// secret to a terminal's stdout/scrollback.
+func emitSecret(value string) {
+	switch clipboardDecision(stdoutIsTTY(), stderrIsTTY(), os.Getenv("TERM"), os.Getenv("TERM_PROGRAM")) {
+	case "stdout":
+		fmt.Println(value)
+	case "clipboard":
+		fmt.Fprint(os.Stderr, osc52(value))
+		fmt.Fprintln(os.Stderr, "copied to clipboard; clearing in 30s")
+		clearClipboardAfter(30)
+	default: // refuse
+		fmt.Fprintln(os.Stderr, "refusing to print secret: this terminal can't do OSC52 clipboard safely; pipe the command (e.g. | cat) or use a supported terminal")
+	}
+}
+
+// clearClipboardAfter spawns a detached background clear so the secret doesn't
+// linger in the clipboard. Best-effort.
+func clearClipboardAfter(seconds int) {
+	exec.Command("sh", "-c", fmt.Sprintf("sleep %d; printf '%s'", seconds, osc52clear())).Start()
+}
+
+// listNames extracts "name (id)" from `bw list items` JSON; never values.
+func listNames(jsonOut string) []string {
+	var items []struct {
+		ID   string `json:"id"`
+		Name string `json:"name"`
+	}
+	if err := json.Unmarshal([]byte(jsonOut), &items); err != nil {
+		return nil
+	}
+	out := make([]string, 0, len(items))
+	for _, it := range items {
+		out = append(out, fmt.Sprintf("%s (%s)", it.Name, it.ID))
+	}
+	return out
+}
+
+func runList(run cmdRunner, user, uid, search string) ([]string, error) {
+	s, err := openSession(run, user, uid)
+	if err != nil {
+		return nil, err
+	}
+	out, err := run("bw", bwListArgs(search), s.env)
+	if err != nil {
+		return nil, err
+	}
+	return listNames(out), nil
+}
+
+func vaultList(args []string) error {
+	hardenProcess()
+	search := ""
+	for i := 0; i < len(args); i++ {
+		if args[i] == "--search" && i+1 < len(args) {
+			search = args[i+1]
+			i++
+		} else if strings.HasPrefix(args[i], "--search=") {
+			search = strings.TrimPrefix(args[i], "--search=")
+		}
+	}
+	uid := vaultCurrentUID()
+	unlock, err := withUserLock(uid)
+	if err != nil {
+		return err
+	}
+	defer unlock()
+	names, err := runList(realRunner, vaultCurrentUser(), uid, search)
+	if err != nil {
+		return err
+	}
+	for _, n := range names {
+		fmt.Println(n)
+	}
+	return nil
+}
+
+func vaultSearch(args []string) error {
+	if len(args) == 0 {
+		return fmt.Errorf("usage: homelab vault search <query>")
+	}
+	return vaultList([]string{"--search", strings.Join(args, " ")})
+}
+
+func vaultCode(args []string) error {
+	hardenProcess()
+	if len(args) == 0 {
+		return fmt.Errorf("usage: homelab vault code <name>")
+	}
+	name := args[0]
+	uid := vaultCurrentUID()
+	unlock, err := withUserLock(uid)
+	if err != nil {
+		return err
+	}
+	defer unlock()
+	user := vaultCurrentUser()
+	val, err := getValue(realRunner, user, uid, getOpts{name: name, field: "totp"})
+	if err != nil {
+		return err
+	}
+	// TOTP is the most sensitive op: log AND emit an ntfy-bound marker (spec §9a-d).
+	writeOpLog(opRecord{User: user, Verb: "code", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
+	exec.Command("logger", "-t", "homelab-vault-totp", "user="+user+" totp-fetch parent="+parentComm(os.Getppid())).Run()
+	emitSecret(val)
+	return nil
+}
+
+// statusSummary reports config/reachability without revealing secrets.
+func statusSummary(run cmdRunner, user, uid string) string {
+	if _, err := loadCreds(run, user); err != nil {
+		return "vault: not configured — run `homelab vault setup`"
+	}
+	s, err := openSession(run, user, uid)
+	if err != nil {
+		return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error()
+	}
+	if _, err := run("bw", []string{"sync"}, s.env); err != nil {
+		return "vault: configured + unlocked, but sync/reachability failed: " + err.Error()
+	}
+	return "vault: configured, unlocked, reachable ✓"
+}
+
+func vaultStatus(args []string) error {
+	hardenProcess()
+	uid := vaultCurrentUID()
+	unlock, err := withUserLock(uid)
+	if err != nil {
+		return err
+	}
+	defer unlock()
+	fmt.Println(statusSummary(realRunner, vaultCurrentUser(), uid))
+	return nil
+}
+
+func vaultLock(args []string) error {
+	uid := vaultCurrentUID()
+	unlock, err := withUserLock(uid) // logout mutates bw state — serialize with get/list
+	if err != nil {
+		return err
+	}
+	defer unlock()
+	appdata := bwAppDataDir(uid)
+	_, _ = realRunner("bw", []string{"lock"}, bwBaseEnv(appdata))
+	_, logoutErr := realRunner("bw", []string{"logout"}, bwBaseEnv(appdata))
+	if logoutErr == nil {
+		fmt.Println("locked")
+	}
+	return nil // lock/logout best-effort; never error the caller
+}
+
+// vaultPatchPublicArgs writes the non-secret identifiers via argv. Neither the
+// email nor the API client_id is a usable credential on its own.
+func vaultPatchPublicArgs(user, email, clientID string) []string {
+	return []string{"kv", "patch", vwCredsPath(user),
+		"vaultwarden_email=" + email,
+		"vaultwarden_client_id=" + clientID,
+	}
+}
+
+// vaultPatchSecretArgs writes ONE secret value via the `key=-` stdin form, so
+// the value never appears in argv (ps / /proc/<pid>/cmdline). The value is fed
+// on stdin by realRunnerStdin.
+func vaultPatchSecretArgs(user, key string) []string {
+	return []string{"kv", "patch", vwCredsPath(user), key + "=-"}
+}
+
+// writeCreds stores all four fields in the user's Vault path. The two real
+// secrets (master password, API client_secret) go via stdin — never argv.
+func writeCreds(user string, c vwCreds) error {
+	if _, err := realRunner("vault", vaultPatchPublicArgs(user, c.Email, c.ClientID), nil); err != nil {
+		return err
+	}
+	if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
+		return err
+	}
+	if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
+		return err
+	}
+	return nil
+}
+
+// promptNoEcho reads one line without terminal echo (for the master password).
+func promptNoEcho(prompt string) (string, error) {
+	fmt.Fprint(os.Stderr, prompt)
+	exec.Command("stty", "-echo").Run()
+	defer func() { exec.Command("stty", "echo").Run(); fmt.Fprintln(os.Stderr) }()
+	r := bufio.NewReader(os.Stdin)
+	line, err := r.ReadString('\n')
+	// Trim only the line terminator — a master password / API secret may
+	// legitimately contain leading/trailing spaces.
+	return strings.TrimRight(line, "\r\n"), err
+}
+
+func promptLine(prompt string) (string, error) {
+	fmt.Fprint(os.Stderr, prompt)
+	line, err := bufio.NewReader(os.Stdin).ReadString('\n')
+	return strings.TrimSpace(line), err
+}
+
+func vaultSetup(args []string) error {
+	hardenProcess()
+	fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.")
+	fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.")
+	email, err := promptLine("Vaultwarden email: ")
+	if err != nil {
+		return err
+	}
+	clientID, err := promptLine("API key client_id (user.xxxx): ")
+	if err != nil {
+		return err
+	}
+	clientSecret, err := promptNoEcho("API key client_secret: ")
+	if err != nil {
+		return err
+	}
+	master, err := promptNoEcho("Master password: ")
+	if err != nil {
+		return err
+	}
+	if master == "" || clientID == "" || clientSecret == "" {
+		return fmt.Errorf("all fields are required")
+	}
+	c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret}
+	if err := writeCreds(vaultCurrentUser(), c); err != nil {
+		return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err)
+	}
+	fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…")
+	uid := vaultCurrentUID()
+	unlock, err := withUserLock(uid)
+	if err != nil {
+		return err
+	}
+	defer unlock()
+	if _, err := openSession(realRunner, vaultCurrentUser(), uid); err != nil {
+		return fmt.Errorf("stored, but verification failed — double-check master password / API key: %w", err)
+	}
+	fmt.Fprintln(os.Stderr, "✓ Verified. Fetches are now AFK.")
+	return nil
+}
+
+func vaultGet(args []string) error {
+	hardenProcess()
+	o, err := parseGetArgs(args)
+	if err != nil {
+		return err
+	}
+	uid := vaultCurrentUID()
+	unlock, err := withUserLock(uid)
+	if err != nil {
+		return err
+	}
+	defer unlock()
+	user := vaultCurrentUser()
+	val, err := getValue(realRunner, user, uid, o)
+	if err != nil {
+		return err
+	}
+	writeOpLog(opRecord{User: user, Verb: "get", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: o.name})
+	if o.json {
+		if !jsonToStdoutOK(stdoutIsTTY()) {
+			return fmt.Errorf("refusing to print a secret as JSON to a terminal; pipe it (e.g. | cat) or drop --json")
+		}
+		fmt.Printf("{%q:%q}\n", o.field, val)
+		return nil
+	}
+	emitSecret(val)
+	return nil
+}
+
--- a/cli/cmd_vault_test.go
+++ b/cli/cmd_vault_test.go
@ -0,0 +1,368 @@
+package main
+
+import (
+	"encoding/base64"
+	"fmt"
+	"os"
+	"reflect"
+	"strings"
+	"testing"
+)
+
+func TestVaultCommandsRegistered(t *testing.T) {
+	want := map[string]Tier{
+		"vault setup":  TierWrite,
+		"vault status": TierRead,
+		"vault list":   TierRead,
+		"vault get":    TierRead,
+		"vault search": TierRead,
+		"vault code":   TierRead,
+		"vault lock":   TierWrite,
+	}
+	got := map[string]Tier{}
+	for _, c := range vaultCommands() {
+		got[c.name()] = c.Tier
+	}
+	for name, tier := range want {
+		if got[name] != tier {
+			t.Errorf("command %q: tier=%q, want %q (registered=%v)", name, got[name], tier, got[name] != "")
+		}
+	}
+}
+
+func TestVaultGroupInRegistry(t *testing.T) {
+	if !isCommandGroup(buildRegistry(), "vault") {
+		t.Fatal("`vault` group not wired into buildRegistry()")
+	}
+}
+
+func TestVaultCredsPath(t *testing.T) {
+	if got := vwCredsPath("emo"); got != "secret/workstation/claude-users/emo" {
+		t.Fatalf("vwCredsPath = %q", got)
+	}
+}
+
+func TestBwAppDataDir(t *testing.T) {
+	if got := bwAppDataDir("1001"); got != "/run/user/1001/homelab-bw" {
+		t.Fatalf("bwAppDataDir = %q", got)
+	}
+}
+
+// fakeRunner records calls and returns canned stdout/err keyed by argv[0]+first arg.
+type fakeRunner struct {
+	calls   [][]string
+	out     map[string]string // key: name+" "+strings.Join(argv," ") prefix-matched
+	err     map[string]error
+	lastEnv []string
+}
+
+func (f *fakeRunner) run(name string, argv, envv []string) (string, error) {
+	f.calls = append(f.calls, append([]string{name}, argv...))
+	f.lastEnv = envv
+	key := name + " " + strings.Join(argv, " ")
+	for k, v := range f.out {
+		if strings.HasPrefix(key, k) {
+			return v, f.err[k]
+		}
+	}
+	return "", f.err[key]
+}
+
+func TestLoadCredsReadsFourFields(t *testing.T) {
+	f := &fakeRunner{out: map[string]string{
+		"vault kv get -field=vaultwarden_email secret/workstation/claude-users/emo":          "emo@x.me",
+		"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "hunter2",
+		"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo":       "user.abc",
+		"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo":   "sek",
+	}}
+	c, err := loadCreds(f.run, "emo")
+	if err != nil {
+		t.Fatalf("loadCreds: %v", err)
+	}
+	want := vwCreds{Email: "emo@x.me", MasterPassword: "hunter2", ClientID: "user.abc", ClientSecret: "sek"}
+	if !reflect.DeepEqual(c, want) {
+		t.Fatalf("loadCreds = %+v want %+v", c, want)
+	}
+}
+
+func TestLoadCredsUnconfigured(t *testing.T) {
+	f := &fakeRunner{out: map[string]string{}} // every field empty
+	if _, err := loadCreds(f.run, "emo"); err == nil || !strings.Contains(err.Error(), "not configured") {
+		t.Fatalf("want 'not configured' error, got %v", err)
+	}
+}
+
+func TestBwEnvCarriesSecretsNotArgv(t *testing.T) {
+	c := vwCreds{ClientID: "user.abc", ClientSecret: "sek", MasterPassword: "hunter2"}
+	env := bwSecretEnv("/run/user/1001/homelab-bw", c, "SESSIONKEY")
+	joined := strings.Join(env, "\n")
+	for _, want := range []string{
+		"BW_CLIENTID=user.abc", "BW_CLIENTSECRET=sek", "BW_PASSWORD=hunter2",
+		"BW_SESSION=SESSIONKEY", "BITWARDENCLI_APPDATA_DIR=/run/user/1001/homelab-bw",
+	} {
+		if !strings.Contains(joined, want) {
+			t.Errorf("bwSecretEnv missing %q", want)
+		}
+	}
+	if strings.Contains(joined, "PATH=") == false {
+		t.Error("bwSecretEnv must keep a PATH so node/bw resolve")
+	}
+}
+
+func TestBwGetArgsHasNoSessionInArgv(t *testing.T) {
+	argv := bwGetArgs("password", "github")
+	for _, a := range argv {
+		if strings.Contains(a, "SESSION") || a == "--session" {
+			t.Fatalf("session must travel via env, not argv: %v", argv)
+		}
+	}
+	if !reflect.DeepEqual(argv, []string{"get", "password", "github"}) {
+		t.Fatalf("bwGetArgs = %v", argv)
+	}
+}
+
+func TestBwListArgs(t *testing.T) {
+	if got := bwListArgs(""); !reflect.DeepEqual(got, []string{"list", "items"}) {
+		t.Fatalf("bwListArgs('') = %v", got)
+	}
+	if got := bwListArgs("git"); !reflect.DeepEqual(got, []string{"list", "items", "--search", "git"}) {
+		t.Fatalf("bwListArgs('git') = %v", got)
+	}
+}
+
+func TestBwUnlockReturnsSession(t *testing.T) {
+	f := &fakeRunner{out: map[string]string{"bw unlock": "THE-SESSION-KEY"}}
+	env := bwSecretEnv("/run/user/1001/homelab-bw", vwCreds{MasterPassword: "pw"}, "")
+	sess, err := bwUnlock(f.run, env)
+	if err != nil || sess != "THE-SESSION-KEY" {
+		t.Fatalf("bwUnlock = %q, %v", sess, err)
+	}
+	// argv must use --passwordenv + --raw, never the password literal
+	last := f.calls[len(f.calls)-1]
+	if strings.Join(last, " ") != "bw unlock --passwordenv BW_PASSWORD --raw" {
+		t.Fatalf("unlock argv = %v", last)
+	}
+}
+
+func TestReturnMode(t *testing.T) {
+	if returnMode(true) != "clipboard" || returnMode(false) != "stdout" {
+		t.Fatal("returnMode wrong")
+	}
+}
+
+func TestOSC52Encode(t *testing.T) {
+	got := osc52("secret")
+	want := "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte("secret")) + "\a"
+	if got != want {
+		t.Fatalf("osc52 = %q want %q", got, want)
+	}
+	if osc52clear() != "\x1b]52;c;\a" {
+		t.Fatalf("osc52clear wrong: %q", osc52clear())
+	}
+}
+
+func TestTerminalAllowed(t *testing.T) {
+	allow := []struct{ term, prog string }{
+		{"xterm-kitty", ""}, {"alacritty", ""}, {"foot", ""}, {"tmux-256color", ""},
+		{"screen-256color", ""}, {"xterm-256color", "WezTerm"}, {"xterm-256color", "ghostty"},
+	}
+	for _, c := range allow {
+		if !terminalAllowed(c.term, c.prog) {
+			t.Errorf("terminalAllowed(%q,%q) = false, want true", c.term, c.prog)
+		}
+	}
+	deny := []struct{ term, prog string }{{"dumb", ""}, {"", ""}, {"vt100", ""}}
+	for _, c := range deny {
+		if terminalAllowed(c.term, c.prog) {
+			t.Errorf("terminalAllowed(%q,%q) = true, want false", c.term, c.prog)
+		}
+	}
+}
+
+func TestOpLogLineHasNoSecretOrItem(t *testing.T) {
+	line := opLogLine(opRecord{User: "emo", Verb: "get", PID: 10, PPID: 9, ParentComm: "claude", ItemName: "Chase Bank"})
+	for _, must := range []string{"user=emo", "verb=get", "ppid=9", "parent=claude"} {
+		if !strings.Contains(line, must) {
+			t.Errorf("op-log missing %q: %s", must, line)
+		}
+	}
+	for _, mustNot := range []string{"Chase", "password", "secret"} {
+		if strings.Contains(line, mustNot) {
+			t.Fatalf("op-log LEAKS %q (privacy violation): %s", mustNot, line)
+		}
+	}
+}
+
+func TestLockPath(t *testing.T) {
+	if got := vaultLockPath("1001"); got != "/run/user/1001/homelab-vault.lock" {
+		t.Fatalf("vaultLockPath = %q", got)
+	}
+}
+
+func TestParseGetArgs(t *testing.T) {
+	o, err := parseGetArgs([]string{"github", "--field", "username", "--json"})
+	if err != nil || o.name != "github" || o.field != "username" || !o.json {
+		t.Fatalf("parseGetArgs = %+v err=%v", o, err)
+	}
+	d, _ := parseGetArgs([]string{"github"})
+	if d.field != "password" || d.json {
+		t.Fatalf("defaults wrong: %+v", d)
+	}
+	if _, err := parseGetArgs([]string{}); err == nil {
+		t.Fatal("get with no name must error")
+	}
+	if _, err := parseGetArgs([]string{"x", "--field", "evil"}); err == nil {
+		t.Fatal("invalid --field must error")
+	}
+}
+
+func TestListNamesParsing(t *testing.T) {
+	// bw list items returns JSON; listNames extracts name + id only.
+	js := `[{"id":"1","name":"GitHub","login":{"username":"u"}},{"id":"2","name":"AWS"}]`
+	names := listNames(js)
+	if len(names) != 2 || names[0] != "GitHub (1)" || names[1] != "AWS (2)" {
+		t.Fatalf("listNames = %v", names)
+	}
+}
+
+func TestStatusSummaryUnconfigured(t *testing.T) {
+	f := &fakeRunner{out: map[string]string{}} // no creds
+	s := statusSummary(f.run, "emo", "1001")
+	if !strings.Contains(s, "not configured") {
+		t.Fatalf("status = %q", s)
+	}
+}
+
+func TestVaultPatchPublicArgs(t *testing.T) {
+	got := vaultPatchPublicArgs("emo", "e@x.me", "user.ci")
+	want := []string{"kv", "patch", "secret/workstation/claude-users/emo",
+		"vaultwarden_email=e@x.me", "vaultwarden_client_id=user.ci"}
+	if !reflect.DeepEqual(got, want) {
+		t.Fatalf("vaultPatchPublicArgs = %v", got)
+	}
+	for _, a := range got {
+		if strings.Contains(a, "master_password") || strings.Contains(a, "client_secret") {
+			t.Fatalf("secret key leaked into public argv: %v", got)
+		}
+	}
+}
+
+func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) {
+	for _, key := range []string{"vaultwarden_master_password", "vaultwarden_client_secret"} {
+		got := vaultPatchSecretArgs("emo", key)
+		want := []string{"kv", "patch", "secret/workstation/claude-users/emo", key + "=-"}
+		if !reflect.DeepEqual(got, want) {
+			t.Fatalf("vaultPatchSecretArgs(%q) = %v", key, got)
+		}
+		if got[len(got)-1] != key+"=-" {
+			t.Fatalf("secret value must be read from stdin (`%s=-`), got %v", key, got)
+		}
+	}
+}
+
+// TestNoSecretInArgvAcrossFlow is the load-bearing security test: across the
+// whole get flow (vault reads, bw config/status/login/unlock/get) NO secret
+// value may appear in any command's argv — secrets travel via env/stdin only.
+func TestNoSecretInArgvAcrossFlow(t *testing.T) {
+	uid := fmt.Sprintf("%d", os.Getuid())
+	f := &fakeRunner{out: map[string]string{
+		"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "SUPERSECRETPW",
+		"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo":        "user.x",
+		"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo":    "CLIENTSEKRET",
+		"bw status":              `{"status":"locked"}`,
+		"bw unlock":              "SESSIONXYZ",
+		"bw get password github": "p@ss",
+	}}
+	if _, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"}); err != nil {
+		t.Fatalf("getValue: %v", err)
+	}
+	for _, call := range f.calls {
+		for _, arg := range call {
+			for _, s := range []string{"SUPERSECRETPW", "CLIENTSEKRET", "SESSIONXYZ"} {
+				if strings.Contains(arg, s) {
+					t.Errorf("secret %q leaked into argv: %v", s, call)
+				}
+			}
+		}
+	}
+	if !strings.Contains(strings.Join(f.lastEnv, "\n"), "BW_SESSION=SESSIONXYZ") {
+		t.Error("expected BW_SESSION in the bw get env (test would be vacuous otherwise)")
+	}
+}
+
+func TestClipboardDecision(t *testing.T) {
+	cases := []struct {
+		stdoutTTY, stderrTTY bool
+		term, prog, want     string
+	}{
+		{false, true, "xterm-kitty", "", "stdout"},
+		{true, true, "xterm-kitty", "", "clipboard"},
+		{true, true, "dumb", "", "refuse"},
+		{true, false, "xterm-kitty", "", "refuse"},
+	}
+	for _, c := range cases {
+		if got := clipboardDecision(c.stdoutTTY, c.stderrTTY, c.term, c.prog); got != c.want {
+			t.Errorf("clipboardDecision(%v,%v,%q) = %q, want %q", c.stdoutTTY, c.stderrTTY, c.term, got, c.want)
+		}
+	}
+}
+
+func TestJSONToStdoutOK(t *testing.T) {
+	if jsonToStdoutOK(true) {
+		t.Error("must refuse JSON secret on a terminal")
+	}
+	if !jsonToStdoutOK(false) {
+		t.Error("must allow JSON when piped")
+	}
+}
+
+func TestBwNeedsLogin(t *testing.T) {
+	if !bwNeedsLogin(`{"status":"unauthenticated"}`) {
+		t.Error("unauthenticated → needs login")
+	}
+	if bwNeedsLogin(`{"status":"locked"}`) {
+		t.Error("locked → no login (just unlock)")
+	}
+	if bwNeedsLogin(`{"status":"unlocked"}`) {
+		t.Error("unlocked → no login")
+	}
+	if !bwNeedsLogin(`not json`) {
+		t.Error("unparseable → attempt login")
+	}
+}
+
+func TestVaultHelpMentionsSecurity(t *testing.T) {
+	h := vaultHelp()
+	for _, want := range []string{"homelab vault get", "no-HITL", "your own", "setup"} {
+		if !strings.Contains(h, want) {
+			t.Errorf("vault help missing %q", want)
+		}
+	}
+}
+
+func TestVaultBareGroupRegistered(t *testing.T) {
+	for _, c := range vaultCommands() {
+		if len(c.Path) == 1 && c.Path[0] == "vault" {
+			return
+		}
+	}
+	t.Fatal("bare `vault` help command not registered")
+}
+
+// getValue is the testable core: given a runner + opts, returns the secret value.
+func TestGetValueFlow(t *testing.T) {
+	f := &fakeRunner{out: map[string]string{
+		"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw",
+		"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo":        "user.x",
+		"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo":    "cs",
+		"bw status":              `{"status":"locked"}`,
+		"bw unlock":              "SESS",
+		"bw get password github": "p@ss",
+	}}
+	// Use real UID so os.MkdirAll(/run/user/<uid>/homelab-bw) succeeds.
+	uid := fmt.Sprintf("%d", os.Getuid())
+	val, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"})
+	if err != nil || val != "p@ss" {
+		t.Fatalf("getValue = %q, %v", val, err)
+	}
+}
--- a/cli/homelab.go
+++ b/cli/homelab.go
@ -22,6 +22,8 @@ func buildRegistry() []Command {
 	reg = append(reg, obsCommands()...)
 	reg = append(reg, usageCommands()...)
 	reg = append(reg, haCommands()...)
+	reg = append(reg, browserCommands()...)
+	reg = append(reg, vaultCommands()...)
 	return reg
 }

--- a/docs/adr/0013-homelab-browser-verbs.md
+++ b/docs/adr/0013-homelab-browser-verbs.md
@ -0,0 +1,75 @@
+# homelab browser verbs: headful (anti-bot) web automation via cluster Chrome
+
+v0.8 adds `browser run`, `browser open`, and `browser --help`. They package a
+capability that already existed but was undiscoverable: driving the cluster's
+**headful** Chrome (`chrome-service` — real Chrome under Xvfb, CDP on
+`svc/chrome-service:9222`) from the devvm, for sites that detect and block
+headless automation.
+
+## Motivating incident (2026-06-22)
+
+Logging a washing-machine repair on the Stirling Ackroyd **Fixflo** tenant
+portal: the headless `@playwright/mcp` browser loaded the site and filled the
+entire multi-step form, but the **final submit silently failed** — Fixflo's
+pre-submit `POST /IssuePreCreationCheck` returned `net::ERR_FILE_NOT_FOUND`, the
+spinner hung, no issue was created. Root cause = headless-Chrome detection. The
+fix was to drive the headful `chrome-service` over `connect_over_cdp` — it
+submitted first try (Fixflo ref IS22657587). That capability was documented
+(`docs/architecture/chrome-service.md`) but **not packaged or discoverable**, so
+it took ~40 min, three redundant full form re-runs, and a user hint. The agent
+also misread `ERR_FILE_NOT_FOUND` as "network egress" and retried blind instead
+of inspecting the network panel.
+
+## Decisions
+
+- **Mechanics in `homelab`, not a `~/.claude` skill.** A standalone skill was
+  rejected: the CLI is run every session (so the verb is *discoverable*), is
+  versioned, multi-user, and test-covered. A private, untested skill is none of
+  those. The command owns only the deterministic *mechanics* (port-forward,
+  stealth injection, lifecycle) — the agent supplies the Playwright script, so
+  *judgment* stays out of the CLI (the founding rule, ADR-0004/0005).
+- **The failure was judgment, not setup friction**, so the CLI is paired with a
+  one-line pointer in always-in-context `~/code/CLAUDE.md` and a diagnostic
+  payload in `browser --help`: the *when-to-use* signature (a site loads but a
+  gated action fails/hangs, or one request 500s/aborts while siblings 200 →
+  suspect headless detection) and an error-code cheat-sheet (`ERR_FILE_NOT_FOUND`
+  = request resolved/intercepted by the automation layer, **not** egress;
+  egress failures are `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED`
+  and would break the page load too). A command the agent doesn't think to run is
+  useless; the cheat-sheet is the actual fix for the misdiagnosis.
+- **Reach the pod via `kubectl port-forward`, then `connect_over_cdp` to
+  localhost.** port-forward tunnels API-server→pod, so it **bypasses the `:9222`
+  NetworkPolicy** that gates in-cluster callers — the devvm needs no namespace
+  label. Readiness is asserted against `/json/version`: the endpoint must report
+  a real `Chrome/…`, never `HeadlessChrome` (the whole point). The forward is
+  **always** torn down (process-group kill + signal handler), on success and on
+  error — an acceptance requirement.
+- **Default to a fresh incognito context; `--shared-context` opts into the warmed
+  profile.** chrome-service is a single shared browser with a persistent profile.
+  A fresh, always-closed context is safe for concurrent callers (tripit's fare
+  scrape connects per-quote) and is what production already does. The warmed
+  persistent profile (cookies from a manual noVNC login) is opt-in for flows that
+  need a pre-logged-in session.
+- **Pin the node CDP client to `playwright-core@1.48.2`** to match the
+  chrome-service image minor (`mcr.microsoft.com/playwright:v1.48.0-noble`,
+  Chromium 130). `connect_over_cdp` speaks the browser's CDP, and protocol
+  changes between Playwright minors — the devvm's ambient Python Playwright was
+  1.58, a 10-minor skew. The pin makes behaviour deterministic across the fleet
+  regardless of local drift. `playwright-core` (not `playwright`) because no
+  browser binary is needed — we connect to the remote one.
+- **Self-provision the client lazily, no per-user setup.** The pinned client is
+  installed once into `~/.cache/homelab/browser-client/` (idempotent, version-
+  guarded) on first use, alongside the embedded runner + stealth files. node is
+  already fleet-wide; this avoids coupling the feature to a provisioner change
+  and keeps it self-contained and self-healing. The client runs on the devvm, so
+  `setInputFiles` streams local files to the remote browser over CDP — no
+  `chmod`/staging-dir workaround on the CDP path.
+- **Vendor `stealth.js`, guard against drift.** The CLI embeds a byte-for-byte
+  copy of `stacks/chrome-service/files/stealth.js` (the source of truth the
+  in-cluster callers use) via `go:embed`; a unit test fails if the copy drifts.
+  `go:embed` can't reach outside the package dir, hence the vendored copy rather
+  than a path reference.
+- **Scope held at two action verbs + help.** `run` (arbitrary script — the
+  workhorse) and `open` (navigate + title/text/screenshot — a quick check) cover
+  the surface. Both are write-tier; the bare `browser`/`--help` is read. Re-measure
+  via `usage top` (ADR-0011) before adding more.
--- a/docs/adr/0014-service-identity-and-east-west-observability.md
+++ b/docs/adr/0014-service-identity-and-east-west-observability.md
@ -0,0 +1,35 @@
+---
+status: accepted
+date: 2026-06-24
+---
+
+# Service identity is namespace + label; east-west observability via Calico Goldmane; no service mesh
+
+As the Service count grows we want an audit-grade record of which Service talks to which — the "service mesh evaluation" `docs/plans/2026-04-20-infra-audit-design.md` flagged as never done ("worth a design doc even if the answer is no, too much complexity for the gain"). We evaluated the full design space against two constraints: the trust model is a single-tenant cluster needing **attribution-grade** forensics (reconstruct events in a cluster we trust), not cryptographic non-repudiation against a hostile pod; and we are acutely **etcd-constrained** (we removed VPA/Goldilocks for exactly this, and carry open beads `code-oflt`/`code-at4f` on etcd starvation). Decision: **service identity = the workload's namespace** (primary; Goldmane stamps it natively and "one Service ≈ one namespace" holds for ~87 of our namespaces), refined by an explicit `service-identity` label only in the few genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). **East-west observability = Calico 3.30 Goldmane + Whisker** (already in our Calico v3.30.7, currently `enabled = false` in `stacks/calico/main.tf`), with Goldmane's emitter shipping flows to **Loki** for a durable trail. **Enforcement reuses the existing Wave 1 observe-then-enforce egress track**, now selecting on namespace/label and fed by Goldmane's allow/deny + policy-trace flows. We explicitly **reject** a service mesh, mTLS, SPIFFE/SPIRE, and dedicated per-Service ServiceAccounts for now.
+
+## Considered options
+
+- **Dedicated per-Service ServiceAccount as the identity primitive** — initially chosen, then reversed. 56% of pods (257/458) run as `default`, so it is a ~116-stack rollout; and Goldmane (the chosen flow source) carries pod/namespace/workload **labels but no ServiceAccount field**, so SAs would not even reach the audit trail without a custom pod→SA mapping. The cheaper, etcd-inert path (namespace is free; a handful of static labels) delivers the same attribution. Deferred until identity-aware NetworkPolicy needs a principal finer than namespace/label, or mTLS is adopted.
+- **Service mesh (Istio / Linkerd / Cilium-mesh) + mTLS + SPIFFE/SPIRE** — the only thing that makes the trail cryptographically non-repudiable against a hostile pod. The trust model does not justify it, east-west stays single-tenant plaintext, and it is precisely the "too much complexity for the gain" the audit doc predicted. Rejected.
+- **Microsoft Retina (CNI-agnostic eBPF)** — more capable (DNS, drops, Hubble UI) and GA, runs on Calico without a CNI change. But identity-rich mode writes **one `RetinaEndpoint` CRD per pod to etcd** (continuous, pod-proportional churn — the exact axis we guard), and it is metrics-first, not log-first (no per-flow Loki records without custom glue). Rejected for this use case; noted as the fallback if DNS/drop-level detail is ever needed.
+- **Cilium Hubble** — reads Cilium's eBPF datapath maps; unusable on Calico without migrating the CNI. A CNI migration is not justified. Rejected.
+- **Kiali** — builds its graph entirely from an Istio mesh's Prometheus telemetry; no mesh, no graph. Rejected.
+- **Custom Grafana Alloy enrichment exporter over raw iptables-`LOG` flow lines** — Alloy has no IP→identity dictionary-lookup primitive (`loki.process` lacks a lookup stage; `k8sattributes` can't do per-line/dual-IP association), so this is a multi-day custom build that also has to beat pod-IP churn. Goldmane delivers identity-stamped flows natively and obviates it. Rejected.
+- **Kyverno generate+mutate to provision/assign identity** — rejected on etcd grounds: background scans + PolicyReports + UpdateRequests are continuous writes, the VPA-class cost we shed. Identity stays static.
+
+## Consequences
+
+- **No etcd cost from the flow plane.** Goldmane streams flows from Felix (the existing `calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer — nothing written to etcd or the K8s API. Steady-state cost is two Deployments (`goldmane`, `whisker`) + RAM/CPU on the goldmane pod.
+- **The ring buffer is not a trail.** Durable, queryable history depends on the emitter→Loki path (reuse the 90-day security-stream retention); on a Goldmane restart the in-memory window is lost.
+- **Goldmane is tech-preview** in OSS Calico 3.30 — the main risk. Enabling it is a reversible toggle in `stacks/calico/main.tf`, but the toggle interacts with the operator-managed Installation CR (only namespaces are TF-adopted today; verify how Goldmane/Whisker are enabled before applying).
+- **Attribution is namespace-grained for free** across ~87 single-Service namespaces. Multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`) need a `service-identity` label to disambiguate; most are platform/infra and already on the Wave 1 enforcement exclude list.
+- **The trail is attribution-grade, not cryptographic.** It reliably reconstructs events in a trusted cluster but cannot prove identity against a pod that spoofs its source — an accepted limit of the trust model. This ADR does not change the east-west encryption posture (still plaintext, no mTLS).
+- **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
+- **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
+- **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
+
+## As-built (2026-06-25)
+
+Implemented across infra issues #57–#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (the shared webhook can't reach `#security` — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
+
+Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.
--- a/docs/architecture/chrome-service.md
+++ b/docs/architecture/chrome-service.md
@ -112,17 +112,32 @@ External caller (dev box):
  @playwright/mcp --isolated --storage-state ~/.cache/...storage-state.json
 ```

+## Browser binary — real Google Chrome (for proprietary codecs)
+
+The chrome-service container runs **real Google Chrome**, not the bundled
+Chromium, via the infra-owned image `ghcr.io/viktorbarzin/chrome-service-browser`
+(`files/chrome/Dockerfile` = `mcr.microsoft.com/playwright:v1.48.0-noble` +
+`google-chrome-stable`, built by `.github/workflows/build-chrome-service-browser.yml`).
+The launch resolves `CHROMIUM=/opt/google/chrome/chrome`.
+
+**Why:** the Playwright-bundled Chromium has proprietary codecs **compiled out**,
+so H.264/AAC video (Instagram Reels, X, most `.mp4`) fails in the noVNC view with
+`MEDIA_ERR_SRC_NOT_SUPPORTED` (the bytes download `200 video/mp4` but there's no
+decoder — NOT a GPU issue). Royalty-free codecs (VP9/VP8/AV1 → YouTube) always
+worked. Swapping `libffmpeg.so` does NOT help (codecs are compiled out, not just
+the lib stripped) and Chrome-for-Testing is also codec-less — only
+`google-chrome-stable` carries them.
+
 ## Image pin

-Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in
-`stacks/chrome-service/main.tf`) and the Python client
-(`playwright==1.48.0` in callers' `requirements.txt`) **must match
-minor-versions**. Bump in lockstep — Playwright protocol changes between
-minors and the client cannot connect to a mismatched server.
-
-The harvester + snapshot-server sidecar use
-`mcr.microsoft.com/playwright/python:v1.48.0-noble` — same playwright
-minor, with Python-side bindings pre-installed.
+The Playwright base + the Python client (`playwright==1.48.0` in callers'
+`requirements.txt`) and the snapshot sidecars
+(`mcr.microsoft.com/playwright/python:v1.48.0-noble`) historically had to match
+minor-versions. The chrome-service browser is now real Google Chrome (a newer
+milestone than the 1.48 Chromium), but the `connect_over_cdp` callers (tripit
+fare scrape, `homelab browser`, snapshot-harvester) attach over raw CDP, which is
+version-tolerant — verified working against this Chrome. If a future Chrome
+milestone breaks a caller, pin Chrome in the Dockerfile or bump the clients.

 ## Storage

@ -167,7 +182,29 @@ minor, with Python-side bindings pre-installed.
  `x11vnc` (connected to Xvfb on `localhost:6099`) bridged to
  `websockify` on port 6080. Service `chrome` maps :80 → :6080 and is
  exposed via `ingress_factory` at `chrome.viktorbarzin.me`,
-  Authentik-gated.
+  Authentik-gated. The bare host serves `vnc.html` (image symlinks
+  `index.html → vnc.html`); add `?autoconnect=true&resize=scale&path=websockify`
+  to skip the Connect button. The view is **black when no browser window is
+  open** (idle) — that is normal, not a failed connection. Chrome is launched
+  with `--window-size=1280,720 --window-position=0,0` to fill the Xvfb screen
+  (no window manager runs, so without it Chrome opens at its profile-persisted
+  size and the rest of the framebuffer shows as a black cut-off).
+
+### noVNC fd-sweep gotcha (stuck "Connecting")
+
+If the noVNC client hangs on **"Connecting" forever then times out**, the cause
+is almost always x11vnc's fd-table sweep: containerd grants pods
+`RLIMIT_NOFILE = 2^31`, and x11vnc `fcntl`-sweeps the **entire** fd table on
+every client connection, so the RFB handshake never completes (websockify
+accepts the WS and logs `connecting to: localhost:5900`, but x11vnc never sends
+the `RFB 003.008` banner). Diagnose: `grep "open files" /proc/$(pgrep -n
+x11vnc)/limits` (huge = bad) and time the handshake from a sibling container
+(`python3 -c "import socket;s=socket.socket();s.connect(('127.0.0.1',5900));print(s.recv(12))"` —
+healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts**
+— done both in `files/novnc/entrypoint.sh` (root) and via the container `command`
+wrapper in `main.tf` (so it applies deterministically even though the image is
+`:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix
+as the android-emulator stack.
 - **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`)
  serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`,
  bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088
@ -180,6 +217,45 @@ minor, with Python-side bindings pre-installed.
 See `stacks/chrome-service/README.md` for the recipe (label namespace,
 inject `CHROME_CDP_URL`, vendor `stealth.js`).

+## Driving from OUTSIDE the cluster (`homelab browser`)
+
+Agents on the devvm reach this browser through the **`homelab browser`** CLI
+(`cli/`, ADR-0013) — the packaged, discoverable form of the ad-hoc
+`connect_over_cdp` recipe. It is the **escalation path, not the default**:
+agents default to the Playwright MCP / headless browser for all routine
+automation, and reach for `homelab browser` ONLY when headless is blocked — a
+site loads but a gated action (submit/login) silently fails or hangs, the
+signature of headless / anti-bot detection. (Same tiered rule lives in
+`~/code/CLAUDE.md` and `homelab browser --help`.)
+
+```text
+devvm:  homelab browser run flow.js
+          │  kubectl port-forward svc/chrome-service :9222  (random local port)
+          ▼
+   http://127.0.0.1:<port>  ──►  chrome-service pod :9222 (CDP)
+          │  assert /json/version Browser is "Chrome/…", not "HeadlessChrome"
+          │  node + playwright-core@1.48.2 → connectOverCDP
+          │  context.addInitScript(stealth.js)   ← same vendored file as in-cluster
+          │  run the user's Playwright script with page/context/browser in scope
+          └─ port-forward always torn down (success or error)
+```
+
+Key facts:
+
+- **port-forward bypasses the `:9222` NetworkPolicy.** It tunnels
+  API-server→pod, so the devvm needs no `chrome-service.viktorbarzin.me/client`
+  label — unlike in-cluster callers.
+- **Client pinned to the image minor.** The node client is
+  `playwright-core@1.48.2` (matches `v1.48.0-noble` / Chromium 130), installed
+  lazily into `~/.cache/homelab/browser-client/`. Bump it in lockstep when the
+  server image bumps (same rule as the in-cluster Python clients — see "Image
+  pin" above).
+- **Default context is a fresh incognito one** (closed on exit), safe for the
+  shared browser; `--shared-context` reuses the warmed persistent profile.
+- **`stealth.js` is vendored** into the CLI (`cli/browser_stealth.js`) as a
+  byte-identical copy of `files/stealth.js`, guarded by a drift test — so the
+  CLI's stealth never diverges from the in-cluster callers'.
+
 ## Limits + risks

 - **Anti-bot vs stealth arms race** — when an upstream beats us (DRM
--- a/docs/architecture/monitoring.md
+++ b/docs/architecture/monitoring.md
@ -321,6 +321,17 @@ Detects the inverse of the K-series alerts: a service that **must work WITHOUT A
 - **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → **`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
 - **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)

+#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
+
+Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**.
+
+| Alert | Expr (abridged) | For | Severity |
+|---|---|---|---|
+| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning |
+| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning |
+
+The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`).
+
 #### Backup Alerts
 - **PostgreSQLBackupStale**: >36h since last backup
 - **MySQLBackupStale**: >36h since last backup
--- a/docs/architecture/multi-tenancy.md
+++ b/docs/architecture/multi-tenancy.md
@ -543,6 +543,10 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1

 **Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.

+**Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.)
+
+**Agent skills — vendored own-copies for an allowlist (2026-06-23):** beyond the config-inheritance base (above, which symlinks the admin's `~/.claude/skills` into every user), the reconcile's `install_skills` gives users on the `SKILL_USERS` allowlist (currently `emo`) their OWN copies of a curated skill set vendored in-repo at `scripts/workstation/claude-skills/` (16: the admin's 15 `mattpocock/skills` + `find-skills` from `vercel-labs/skills`). It copies each into `~/.agents/skills/<name>` (owned by the user, parent `~/.agents` chowned too — `install -d` leaves intermediates root-owned) and points `~/.claude/skills/<name>` at it with a **relative** symlink (`../../.agents/skills/<name>` — the layout `skills add -g` produces; Claude Code reads `~/.claude/skills/`). **Vendored, NOT `npx skills add`:** upstream drifted off this exact set (`diagnose`→`diagnosing-bugs`, `write-a-skill`→`writing-great-skills` renamed; `caveman` + `zoom-out` unpublished), so npx can't reproduce it — and a per-reconcile GitHub clone + unpinned-CLI dependency has no place in the hourly root job; refresh by re-snapshotting (`claude-skills/README.md`). **if-absent keys on the user's OWN copy** (a real dir under `~/.agents/skills`), so a steady-state reconcile is a no-op AND a stale or cross-user `~/.claude/skills` symlink is healed to the own copy — emo had `grill-me`/`file-issue` symlinked into the admin's home; `grill-me` is now emo's own (`file-issue` is outside the set, left as-is). A real dir squatting a name is never clobbered. Best-effort tail (`return 0`, like `install_memory`). Extend coverage = edit `SKILL_USERS`.
+
 **Onboarding state self-heals (2026-06-15):** `~/.claude.json` is a single file that ALL of a user's concurrent `claude` processes (the ttyd terminal + their `t3-serve` instance + agent/SDK sessions) read-modify-write, so a stale writer periodically drops top-level keys — including `hasCompletedOnboarding` — which bounces the next *interactive* session back to the first-run "Choose the text style" wizard even though the user is fully logged in (credentials live in the SEPARATE `~/.claude/.credentials.json`, untouched by the race; first observed for emo 2026-06-15). The launcher (`skel/start-claude.sh`) now idempotently re-asserts `hasCompletedOnboarding` (+ `lastOnboardingVersion`) in `~/.claude.json` right before it runs `claude` — merge-only, never clobbers other keys, no-op if jq is missing or the file is empty/corrupt. And since the launcher is a per-user copy that `/etc/skel` only seeds at account creation, the reconcile's new `deploy_user_launcher` step re-copies `skel/start-claude.sh` into every non-admin home (copy-if-changed) so launcher edits now reach EXISTING users within the hour — `.tmux.conf` is deliberately NOT re-copied (terminal-lobby appends its own managed section to it).

 **Claude Code runtime — native, per-user (2026-06-15):** `claude` is the **native** install (`~/.local/bin/claude` → `~/.local/share/claude/versions/<v>`, self-updating; `installMethod: native`) — NOT npm-global or npx. It is the runtime for both the ttyd launcher and each `t3-serve` instance. `setup-devvm.sh` installs node ONLY for the `t3` CLI (not claude); per-user native claude is provisioned by the reconcile's `install_user_claude_native` (covers terminal + t3, idempotent, skip-if-present) and self-bootstrapped by `start-claude.sh` on first launch — both via the official `https://claude.ai/install.sh`. The legacy machine-wide `npm install -g @anthropic-ai/claude-code` bootstrap and the launcher's `npx` fallback were removed; existing users had already auto-migrated to native, and the npm-global dir was empty. **PATH (`~/.local/bin`, where the native binary lives):** ensured three ways — `/etc/profile.d/10-local-bin.sh` for login shells (machine-wide, fresh-user-safe), `start-claude.sh` itself (the launcher runs in tmux's non-login env that skips the user's shell rc), and `t3-serve@.service` (`Environment=PATH=…:/home/%i/.local/bin`).
--- a/docs/architecture/security.md
+++ b/docs/architecture/security.md
@ -364,6 +364,67 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
 - Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
 - Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).

+#### Deriving the per-namespace egress allowlist from the edge trail (Wave 1 W1.7)
+
+The durable **east-west flow trail** (below) is now the preferred data source for
+the *internal* (namespace-to-namespace) half of each Wave-1 egress allowlist —
+faster and identity-stamped vs the original iptables-`LOG`→journald→Loki path
+(ADR-0014: "Enforcement gains a better data source"). The unique observed
+namespace pairs live in CNPG DB `goldmane_edges`, table `edge`. To derive the
+namespaces a source is observed talking to (the `allow` set that seeds its
+NetworkPolicy):
+
+```sql
+SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow' ORDER BY dst_ns;
+```
+
+The full SQL recipe (whole-cluster matrix, deny sanity-checks, the ≥7-day
+observation caveat) is in
+[runbooks/goldmane-flow-trail.md → Deriving the Wave-1 egress allowlist](../runbooks/goldmane-flow-trail.md#deriving-the-wave-1-egress-allowlist-from-the-edge-table-infra-62).
+**External / public-internet egress is NOT in this table** (empty-namespace flows
+are dropped) — for those destinations keep using the Calico flow-log observation
+(the W1.6 snapshot, `wave1-egress-observation-2026-05-22.md`). This feeds the
+existing observe-then-enforce effort (beads `code-8ywc`); **enforce-flips remain
+out of scope** of the trail — it is observe-and-derive only.
+
+### East-west flow observability (Goldmane / Whisker + edge trail) (ADR-0014)
+
+The "who-talks-to-whom" data plane that succeeds raw iptables-`LOG` lines (which
+carried no identity). **Service identity = the workload's namespace** (primary),
+refined by a `service-identity` label in the few multi-Service namespaces
+(`monitoring`, `kube-system`, `dbaas`). End-to-end trail, three layers:
+
+1. **Calico Goldmane + Whisker** (`calico-system`) — Goldmane aggregates
+   identity-stamped flows (ns/pod/workload/labels + allow-deny + policy-trace)
+   streamed from Felix over gRPC into a **~60-min in-memory ring buffer** (no
+   etcd/API writes — the etcd-cost constraint that drove the design). **Whisker**
+   is its live web UI at `whisker.viktorbarzin.me` (Authentik-gated,
+   `auth = "required"` — Whisker has no own login; an additive NetworkPolicy ORs
+   Traefik past the operator's default-deny `whisker` NP). The ring buffer is
+   **not** a trail (lost on Goldmane restart). Enabled via operator CRs in
+   `stacks/calico/main.tf`; reversible toggle (Goldmane is OSS tech-preview).
+2. **`goldmane-edge-aggregator`** (`stacks/goldmane-edge-aggregator`) — streams
+   Goldmane's gRPC `Flows.Stream` over **mTLS** and upserts the low-cardinality
+   namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,
+   flow_count)`) into CNPG DB `goldmane_edges`. Self-edges and empty-namespace
+   (public-internet) flows are dropped — in-cluster relationships only. The mTLS
+   client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`**
+   (Goldmane verifies CA-chain only, not identity) rather than copying the CA
+   private key into TF state — **re-apply the stack if the operator rotates that
+   Secret**.
+3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to
+   **`#alerts`** (reuses the alert-digest webhook; a `#security` override 404s —
+   that webhook's Slack app isn't a member of `#security`; see runbook).
+
+The trail is **attribution-grade, not cryptographic** (reconstructs events in a
+trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model
+limit; east-west stays plaintext, no mTLS between app pods). Health is covered by
+the **`AggregatorDown`** + **`DigestFailing`** alerts and cluster-health check #48
+(see monitoring.md). Full as-built, query recipes, and troubleshooting:
+[runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md). Decision:
+[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md); glossary
+`CONTEXT.md` → **Service identity**, **Goldmane / Whisker**.
+
 ### TLS & HTTP/3

 **Traefik** handles TLS termination:
--- a/docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md
+++ b/docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md
@ -0,0 +1,243 @@
+# External Secrets Operator: 0.12.1 → 2.6.0 Migration (v1beta1 → v1) — Design Doc
+
+> **Status:** ✅ **COMPLETE (2026-06-22).** ESO at chart/app **2.6.0**; all 104 ExternalSecrets + 2 ClusterSecretStores on `external-secrets.io/v1`; 109 ESs SecretSynced (2 pre-existing dead); compat-gate now returns `OK: cluster is safe to upgrade to 1.35.6` (EXIT 0) — the last k8s-1.35 blocker is cleared. Executed Phase 1 (climb to 0.16.2) → Phase 2 (v1 rewrite, validated GC-survival on tandoor) → Phase 3 (climb 0.16.2→2.6.0 across the 0.17 cutoff, ES sync held at 109 every hop). Side-finding fixed: repo-wide stale `.terraform.lock.hcl` files (missing gavinbunney/kubectl + telmate/proxmox from the generated providers.tf) had broken `terragrunt apply` for ~28 stacks (this is what failed CI pipeline 332) — reconciled via `init -upgrade` + committed.
+> **Scope:** Upgrade the ESO Helm chart `0.12.1` (app `v0.12.1`) to `2.6.0` (app `v2.6.0`) and migrate every `external-secrets.io/v1beta1` custom resource to `external-secrets.io/v1`.
+> **Owner:** Viktor Barzin. **Author:** Claude (research + design only — no changes applied).
+>
+> **EXECUTION CORRECTION + STATUS (2026-06-21 — "let's do the ESO migration"):** The cluster is already on **k8s 1.34.9** (all 7 nodes), NOT ≤1.31 as §4.3 assumed. ESO 0.12 runs fine on 1.34 (the support-matrix bands are conservative *tested* ranges, not hard limits). **The entire ESO climb 0.12→2.6 therefore happens on k8s 1.34 — there is NO k8s interleave; IGNORE the "advance k8s to 1.32/1.33" steps in §4.3 / Phase 1 / Phase 3.** Only AFTER ESO reaches 2.x does the nightly version-check chain take k8s 1.34→1.35 (gate clears). Exact hop sequence (latest patch per minor): **0.13.0 → 0.14.4 → 0.15.1 → 0.16.2** [rewrite all 104 CRs to `v1` here] → **0.17.0 → 0.18.2 → 0.19.2 → 0.20.4 → 1.0.0 → 1.1.1 → 1.2.1 → 1.3.2 → 2.0.1 → 2.1.0 → 2.2.0 → 2.3.0 → 2.4.1 → 2.5.0 → 2.6.0**. Pre-flight done: CRD `storedVersions` are `["v1beta1"]` only (no v1alpha1 patch needed).
+>
+> **EXECUTION LOG:**
+> - **✅ Phase 1 DONE (2026-06-21):** ESO climbed 0.12.1 → 0.13.0 → 0.14.4 → 0.15.1 → **0.16.2**, one hop at a time, each applied + verified (controller healthy; 108 live ExternalSecrets stayed SecretSynced; 2 pre-existing dead — `instagram-poster/instagram-poster-secrets` False since 2026-05-10, `payslip-ingest/payslip-ingest-secrets` False since 2026-04-25, both missing Vault data, untouched). Added `atomic=true` + `timeout=600` to the helm_release. At 0.16.2 **both `v1beta1` and `v1` are served** (110 each) and `storedVersions = ["v1beta1","v1"]`. Committed (`eso: Phase 1 …`); state auto-committed per hop by `scripts/tg`.
+> - **⏳ Phase 2 PENDING — findings confirmed (decisive for execution):** (a) bumping a `kubernetes_manifest` ExternalSecret's apiVersion v1beta1→v1 **forces a REPLACE** (verified live on instagram-poster: `-/+ must be replaced`), NOT in-place. (b) Our ExternalSecrets use **`creationPolicy=Owner`** (default; confirmed on nextcloud) → target Secrets carry an ownerReference, so the replace's delete step can **cascade-GC the Secret** before ESO recreates it. → **Phase 2 must be done carefully, NOT a blind bulk apply:** (1) snapshot ALL target Secrets first (backstop); (2) **empirically validate on the FIRST live stack** — migrate one ES while watching its target Secret; ESO re-syncs the identical spec fast and should re-adopt before GC, but confirm before proceeding; (3) then the per-stack two-phase `-target`-then-full apply (the 15 plan-time-coupled stacks need `-target` first). If validation shows GC wins, pivot to `state rm` + `import {}` (adopts the already-v1-served object with zero delete → zero GC). Repo is clean at v1beta1 (the lone test edit was reverted, never applied).
+> - **Phase 3 PENDING:** hops 0.17.0 → 0.18.2 → 0.19.2 → 0.20.4 → 1.0.0 → 1.1.1 → 1.2.1 → 1.3.2 → 2.0.1 → 2.1.0 → 2.2.0 → 2.3.0 → 2.4.1 → 2.5.0 → 2.6.0 (all on k8s 1.34, CRs already v1). Crossing **0.17 is the point of no return**.
+
+---
+
+## 1. Goal & why
+
+ESO is the **last remaining compatibility gate blocking the autonomous k8s 1.35 upgrade** (Kyverno was cleared to 1.18.1 earlier today). The installed ESO `0.12.x` supports only Kubernetes **1.19 → 1.31** ([support matrix](https://external-secrets.io/latest/introduction/stability-support/)); the k8s-version-check chain will refuse to advance the cluster past 1.31 while ESO sits at 0.12. The `2.x` series supports **k8s 1.34–1.35**, which clears the gate.
+
+The hard part is not the chart bump itself — it is that **ESO removed the `external-secrets.io/v1beta1` API**, and every one of our ExternalSecret / ClusterSecretStore resources is currently declared `v1beta1`. If we upgrade past the removal version without first rewriting the manifests to `v1`, ESO stops reconciling and synced Secrets go stale (apps keep their last-good Secret, but rotations and new secrets break).
+
+**Downtime tolerance:** brief, recoverable downtime of the ESO *controller* is acceptable. What must NOT happen is loss/corruption of the downstream Kubernetes `Secret` objects that apps mount (DB creds, API keys). Those must survive continuously.
+
+---
+
+## 2. Current state
+
+### 2.1 Versions
+| Component | Current | Target |
+|---|---|---|
+| Helm chart `external-secrets` | **0.12.1** | **2.6.0** |
+| App / controller image | **v0.12.1** | **v2.6.0** |
+| API version of all CRs | **`external-secrets.io/v1beta1`** | **`external-secrets.io/v1`** |
+| Repo: `https://charts.external-secrets.io` | (unchanged) | (unchanged) |
+
+ESO stack: `stacks/external-secrets/main.tf`. `helm_release.external_secrets` pins `version = "0.12.1"`, namespace `external-secrets` (separate `kubernetes_namespace` resource, not `create_namespace`), and the **only** chart value set is `installCRDs = true` (via `yamlencode({ installCRDs = true })`). No webhook/replica/resource overrides.
+
+### 2.2 Inventory (live, from `stacks/`)
+| Kind | Count | apiVersion | Where |
+|---|---|---|---|
+| **ExternalSecret** (`kubernetes_manifest`) | **104** | all `v1beta1` (0 mismatches) | 73 `.tf` files |
+| **ClusterSecretStore** (definitions) | **2** | both `v1beta1` | `stacks/external-secrets/main.tf` |
+| SecretStore | 0 | — | — |
+| PushSecret | 0 | — | — |
+| ClusterExternalSecret | 0 | — | — |
+
+- **Only ONE apiVersion string exists in the whole tree:** `external-secrets.io/v1beta1` (106 occurrences = 104 ExternalSecret + 2 ClusterSecretStore). Zero `v1`, zero `v1alpha1`. → a clean single-target rewrite.
+- **`secretStoreRef` split:** 78 ExternalSecrets → `vault-kv`, 26 → `vault-database` (78 + 26 = 104). The `kind = "ClusterSecretStore"` string also appears inside every `secretStoreRef`, so a naive `grep 'kind = "ClusterSecretStore"'` returns 106 — only **2** are real store definitions.
+- **22 files carry >1 ExternalSecret** (max: `stacks/fire-planner/main.tf` = 5; then wealthfolio / real-estate-crawler / phpipam / payslip-ingest / n8n / job-hunter / ebooks = 3 each; 13 files = 2). The 104-vs-73 gap is these multi-secret files.
+- **Nested-module ExternalSecrets** (easy to miss when scripting the bump): `stacks/instagram-poster/modules/instagram-poster/main.tf`, `stacks/postiz/modules/postiz/main.tf`, `stacks/technitium/modules/technitium/main.tf`, `stacks/mailserver/modules/mailserver/main.tf`, `stacks/monitoring/modules/monitoring/grafana.tf`, `stacks/proxmox-csi/modules/proxmox-csi/main.tf`.
+- **Docs are STALE:** `.claude/CLAUDE.md` says "43 ExternalSecrets + 9 DB-creds". Live count is **104 ExternalSecrets / 73 files / 26 db-refs**. Fix in the migration PR.
+
+### 2.3 The two ClusterSecretStores (`stacks/external-secrets/main.tf`)
+Both `kubernetes_manifest`, both `external-secrets.io/v1beta1`, both `depends_on = [helm_release.external_secrets]`:
+- **`vault-kv`** → Vault KV **v2** at `path = "secret"`, server `http://vault-active.vault.svc.cluster.local:8200`, auth `kubernetes` mount `kubernetes`, role `eso`, SA `external-secrets/external-secrets`.
+- **`vault-database`** → identical except `path = "database"`, **`version = "v1"`** (Vault DB engine, KV-v1-style).
+
+ESO's Vault auth role `eso` (`stacks/vault/main.tf:486-511`): policy `eso-reader` (`secret/data/*` read+list, deny `secret/data/vault`, `database/static-creds/*` read), `token_ttl = token_period = 864000` (10d, periodic/auto-renew).
+
+### 2.4 Tier-0 / state
+ESO is **Tier-0 (bootstrap)** (`.claude/CLAUDE.md` "Terraform State — Two-Tier Backend"; root `terragrunt.hcl` `tier0_stacks = ["infra","platform","cnpg","vault","dbaas","external-secrets"]`). Tier-0 ⇒ **local SOPS-encrypted state in git** (`state/stacks/external-secrets/terraform.tfstate`), NOT the PG backend. Workflow: `git pull` → `scripts/tg plan` → `scripts/tg apply` → `git push`; SOPS decrypt via Vault Transit (primary) → age fallback. **Tier-0 must apply before PG is reachable**, so the ESO upgrade cannot depend on PG.
+
+### 2.5 Provider versions (`stacks/external-secrets/providers.tf`)
+- `required_providers` declares **only** `vault = hashicorp/vault, ~> 4.0`.
+- `provider "kubernetes"` and `provider "helm"` are declared **without version constraints** (resolve from root / `.terraform.lock.hcl`). The `helm` block already uses the **v3-style nested `kubernetes = {…}` argument** (not the legacy `kubernetes {}` block) ⇒ helm provider is **v3.x or v4.x** in the lockfile. **No `kubectl` provider** in this stack. No `required_version` pinned here.
+- ⚠️ **Verify the resolved helm provider version** in `.terraform.lock.hcl` before starting — the prompt referenced `~> 4.0` for helm; the *stack* only pins that for `vault`. Either way the v3-syntax helm block + an SDK-v3 provider is compatible with the chart (see §4.5).
+
+### 2.6 Plan-time coupling (the cross-cutting risk)
+**15 stacks read ESO-created Secrets at plan time** via `data "kubernetes_secret"` (avoids a Vault dependency at plan): `actualbudget, affine, changedetection, coturn, ebooks, fire-planner, freedify, freshrss, grampsweb, k8s-dashboard (dashboard_injector.tf), navidrome, owntracks, real-estate-crawler, servarr, technitium (modules/technitium)`.
+
+The documented **first-apply gotcha** (`.claude/CLAUDE.md`, `docs/architecture/secrets.md:360`, `stacks/fire-planner/main.tf:574`): the Secret must exist before the `data "kubernetes_secret"` plans, so on first creation you must `terragrunt apply -target=kubernetes_manifest.<external_secret>` first, then full apply. **Why this matters for the migration:** the `kubernetes_manifest` provider treats `apiVersion` as part of resource identity, so bumping `v1beta1`→`v1` **forces a replace** of all 104 ExternalSecrets. During replace there is a window where the new CR (and thus the synced Secret) may not yet be materialized when the same stack's `data "kubernetes_secret"` plans → the two-phase `-target` apply is needed **fleet-wide for the v1 rewrite step, not just fire-planner.**
+
+### 2.7 Vault DB rotation (rotation interplay)
+`stacks/vault/main.tf`: **25 `vault_database_secret_backend_static_role`, every one `rotation_period = 604800` (7 days)** (8 MySQL + 17 PostgreSQL static roles). ESO syncs these via `vault-database` → `remoteRef.key = "static-creds/<role>"`. Apps reading a rotated secret only at startup carry a Stakater Reloader annotation. **Implication:** any ESO controller downtime longer than the gap to the next rotation could leave a Secret stale across a rotation; keep controller downtime short and re-sync promptly.
+
+### 2.8 git-crypt landmine (adjacent, not in ESO stack)
+`.claude/CLAUDE.md:146` + `docs/architecture/ci-cd.md:108` + `stacks/kyverno/modules/kyverno/tls-secret-sync.tf`: on a **git-crypt-locked clone**, `kubernetes_secret.tls_secret` reads `secrets/fullchain.pem`/`privkey.pem` via `file()` which returns **ciphertext**, corrupting the wildcard TLS secret Kyverno clones cluster-wide. **The ESO stack itself has NO `file()` reads of git-crypt secrets** — so this landmine does not bite the ESO upgrade directly. It is listed here only as a guardrail: do not piggyback unrelated kyverno applies during this work, and run all applies from an **unlocked** checkout.
+
+---
+
+## 3. Target
+
+- Helm chart **`external-secrets` 2.6.0** (app **v2.6.0**), repo `https://charts.external-secrets.io`.
+- All ExternalSecret + ClusterSecretStore CRs on **`external-secrets.io/v1`**.
+- Cluster ESO compatible with **k8s 1.34–1.35** ⇒ unblocks the autonomous 1.35 upgrade.
+
+---
+
+## 4. Key findings (the decisive facts)
+
+> Sourced from ESO official docs + GitHub release notes; verbatim quotes below.
+
+### 4.1 Chart version == app version (premise check)
+The chart version and app version are released **in lockstep and are the same number**. `Chart.yaml`: `version: 0.12.1 / appVersion: v0.12.1`; `version: 2.6.0 / appVersion: v2.6.0`. The app series ran `…0.20.4 → 1.0.0 → … → 2.0.0 → … → 2.6.0`. **Crucially, the `v1.0.0` and `v2.0.0` APP releases are NOT the `external-secrets.io/v1` API** — `v1.0.0` is just "continuation after 0.20.4" (release diff `v0.20.4...v1.0.0`, no API change), and `v2.0.0`'s only breaking change is removing the unmaintained **Alibaba + Device42** providers (we use neither — only Vault). The API migration happened back at **0.16/0.17**. Source: [v1.0.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v1.0.0) · [v2.0.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v2.0.0).
+
+### 4.2 Version path: **NO skipping minors — step one minor at a time**
+Official policy, verbatim ([stability-support](https://external-secrets.io/latest/introduction/stability-support/)):
+> "**Upgrade version by version** — We strongly recommend upgrading one minor version at a time (e.g., 0.18.x → 0.19.x → 0.20.x) rather than skipping versions."
+
+Maintainer (issue [#4785](https://github.com/external-secrets/external-secrets/issues/4785), @gusfcarvalho): *"We are pre release… Every minor bump should be treated as a major bump until we go 1.0."* ⇒ **You CANNOT helm-upgrade 0.12.1 → 2.6.0 directly.** You must step each minor: `0.12 → 0.13 → 0.14 → 0.15 → 0.16 → 0.17 → 0.18 → 0.19 → 0.20 → 1.x → 2.x`.
+
+### 4.3 k8s ↔ ESO must advance roughly in lockstep
+Each ESO release targets a **narrow** k8s band ([support matrix](https://external-secrets.io/latest/introduction/stability-support/)):
+
+| ESO | k8s band |
+|---|---|
+| 0.12.x | 1.19 → 1.31 |
+| 0.16.x | 1.32 |
+| 0.17.x | 1.33 |
+| 2.0 – 2.5 | 1.34 – 1.35 |
+| 2.6 (latest) | (matrix row not yet appended; 2.x band is consistently 1.34–1.35 — see Open Questions) |
+
+**This is the single most important sequencing constraint.** ESO doesn't "support only ≤ its max k8s" in a wide range — older ESO may not run cleanly on a *much newer* k8s either. The bands imply the ESO upgrade and the k8s upgrade need to be **interleaved**, not "finish ESO, then bump k8s in one jump." Practical reading: the cluster is currently on k8s ≤1.31 (ESO 0.12 blocks past it). The 0.16/0.17 steps want k8s 1.32/1.33; the 2.x steps want 1.34/1.35. So this is a **coordinated ESO+k8s climb**, e.g. ESO→0.16 alongside k8s→1.32, ESO→0.17 alongside k8s→1.33, then ESO→2.x alongside k8s→1.34→1.35. (The k8s climb is itself sequential via the version-check chain; this doc focuses on the ESO half but flags the coupling — see Open Questions for who drives the interleave.)
+
+### 4.4 API migration: **must rewrite manifests to `v1` FIRST — there is NO v1beta1→v1 conversion webhook**
+- **`external-secrets.io/v1` promoted to STORAGE version: v0.16.0.** v0.16.0 release notes "BREAKING CHANGES": *"Promotion of ExternalSecret/v1 and SecretStore/v1 and their cluster counterparts"* and *"Removal of Conversion Webhooks and …/v1alpha1…"*. From 0.16, **etcd stores `v1`**. Source: [v0.16.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v0.16.0).
+- **`external-secrets.io/v1beta1` STOPS BEING SERVED (hard cutoff): v0.17.0.** Verbatim ([v0.17.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v0.17.0)):
+  > "v0.17.0 Stops serving `v1beta1` apis. You need to update your manifests from `v1beta1` to `v1` prior to updating from `v0.16` to `v0.17`. The only change needed is upgrading your manifests to `v1` (i.e. removing the `beta1` from `v1beta1`). … Be sure to do that to all your manifests prior to bumping to `v0.17.0`! `v0.16.2` already supports `v1` so this process should be smooth."
+- **No v1beta1→v1 conversion webhook.** The only conversion webhook that ever existed was v1alpha1→v1beta1, **removed in 0.16**. Maintainer (issue [#5478](https://github.com/external-secrets/external-secrets/issues/5478), @gusfcarvalho): the post-0.16 "drift" is simply that etcd now stores v1 — *"This isn't really a conversion issue."* ⇒ **old v1beta1 manifests do NOT keep working past 0.17 via any auto-conversion.**
+  - **Verdict: MUST-REWRITE-FIRST.** Rewrite all CRs to `v1` while on **0.16.x** (which serves *both* v1beta1 and v1), then upgrade to 0.17. Real-world confirmation (issue [#4785](https://github.com/external-secrets/external-secrets/issues/4785), @Dutchy-): *"I was able to change v1beta1 to v1 on 0.16 without issues. After that I was able to upgrade to 0.17."*
+  - There is a deprecated escape hatch in chart 2.6.0 — `unsafeServeV1Beta1: true` re-enables v1beta1 serving for stragglers — but its own values comment says *"This flag will be removed on 2026.05.01"* (i.e. **already past**, do not rely on it).
+- **Schema change is a PURE apiVersion string bump — ZERO field changes.** CRD `openAPIV3Schema` diff (v0.16.2 bundle, which serves both): ExternalSecret / SecretStore / ClusterSecretStore / ClusterExternalSecret have **byte-identical** spec field sets between v1beta1 and v1 (`{data, dataFrom, refreshInterval, refreshPolicy, secretStoreRef, target}` for ExternalSecret). Maintainer (issue #4785, @Skarlso): *"Just change your manifests to be v1 and upgrade… We don't have anything fancy that you need to do."* PushSecret only ever had `v1alpha1` (no v1beta1) — **unaffected** (we have 0 anyway).
+
+### 4.5 Helm chart values + CRD handling (0.12 → 2.6)
+- **No top-level values removed or renamed.** `values.yaml` diff 0.12.1↔2.6.0 is **additive only** (new keys: `enableHTTP2, extraInitContainers, genericTargets, grafanaDashboard, hostAliases, hostUsers, leaderElectionID, livenessProbe, openshiftFinalizers, processClusterGenerator, processClusterPushSecret, processSecretStore, readinessProbe, strategy, systemAuthDelegator, vault`). Our single value `installCRDs = true` survives.
+- **`installCRDs` still works** in 2.6.0 (defaults `true`, "install and upgrade CRDs through helm chart"). CRDs are **templated into the single `external-secrets` chart** and **upgraded by `helm upgrade`** automatically — there is **no separate CRDs subchart**, and no manual `kubectl apply` of CRDs is required by default. (Out-of-band bundle, if ever needed, lives at `deploy/crds/bundle.yaml` per release tag.) The only CRD-value change: `crds.conversion.enabled` defaults `true` in 0.12.1 (for the old v1alpha1 webhook) → `false` in 2.6.0 ("we stopped supporting v1alpha1"). We don't set it, so the new default is fine.
+- **CRD storedVersions bookkeeping (the one real pre-flight check):** v0.16.0 notes warn to ensure no CRD still lists `v1alpha1` in `.status.storedVersions` before/at 0.16, with a `kubectl patch` to set it to `["v1","v1beta1"]` if needed. This is CRD metadata hygiene, NOT secret deletion.
+- **Helm provider:** `Chart.yaml apiVersion: v2` (Helm 3 chart) in both 0.12.1 and 2.6.0; **no minimum Helm version declared** (only `kubeVersion: ">= 1.19.0-0"`). The Terraform helm provider on Helm SDK v3 (v3.x/v4.x) is compatible. **The 2.x chart does NOT require a newer helm provider than 0.12 did** — the v3-style helm block in `providers.tf` already satisfies it. (Still: pin/verify the resolved version in the lockfile; see Open Questions.)
+
+### 4.6 Data migration: **downstream Secrets survive**
+The synced Kubernetes `Secret` objects are **not deleted or force-resynced** by these upgrades. The change is an apiVersion bump on the *custom resources*, whose `spec` is schema-identical, so the controller keeps reconciling the same target Secrets. A controller restart triggers a normal **reconcile (re-assert, not delete)**. Caveat: no release note says verbatim "synced Secrets are preserved"; the conclusion is from (a) schema identity, (b) maintainers calling it "100% compatible" (issue #5478), (c) absence of any "secrets recreated/deleted" note. **Standard caution: snapshot/back up all ESO-created Secrets before the 0.16→0.17 step** (see §8 verification). Unrelated watch-item: v0.14.0 flagged a stateful-**generators** change — we use no generators, so N/A.
+
+---
+
+## 5. Migration strategy (ordered, do-this-then-that)
+
+> **Pre-reqs every step:** run from an **unlocked** infra checkout (git-crypt unlocked); `vault login -method=oidc`; ESO is **Tier-0** so use `scripts/tg plan` / `scripts/tg apply` against `stacks/external-secrets` and **`git push`** after each apply (SOPS state). Claim presence before each apply: `~/code/scripts/presence claim stack:external-secrets --purpose "ESO 0.12→2.x migration step N"`. Wait for the controller `Deployment` to roll out healthy before the next hop.
+
+### Phase 0 — Pre-flight (no changes)
+1. Confirm cluster k8s version and the version-check chain's current target; **coordinate with the k8s climb** (see §4.3 / Open Questions). Decide who drives the interleave.
+2. `kubectl get crd | grep external-secrets.io` and for each: `kubectl get crd <name> -o jsonpath='{.status.storedVersions}'` — confirm none still list `v1alpha1`. If any do, plan the `kubectl patch …/status storedVersions=["v1beta1"]` per the v0.16.0 note (do this *before* reaching 0.16).
+3. **Snapshot all ESO-managed Secrets** (rollback safety net):
+   `kubectl get externalsecrets -A` (record the 104) and `for ns/secret in <targets>: kubectl get secret -n <ns> <name> -o yaml > backup/<ns>-<name>.yaml`. Keep outside git-crypt or encrypt.
+4. Inspect `.terraform.lock.hcl` in `stacks/external-secrets` — record resolved `helm` + `kubernetes` provider versions. If helm provider < what 2.6.0 needs (it doesn't appear to need anything beyond SDK v3), bump the constraint as its own committed change first.
+5. Read `docs/architecture/secrets.md` + the fire-planner first-apply comment to re-confirm the `-target` pattern for the v1 rewrite step.
+
+### Phase 1 — Climb to 0.16.x (chart bump only, NO manifest change yet)
+ESO `0.16.x` is the **transition version** that serves *both* v1beta1 and v1. Climb to it one minor at a time, leaving all CRs as `v1beta1`:
+6. For `v` in `0.13.0, 0.14.0, 0.15.x, 0.16.2` (use latest patch of each minor): set `helm_release.external_secrets.version = "<v>"`, `scripts/tg plan` (expect: chart upgrade + CRD upgrade in place; **no `kubernetes_manifest` replacements** — apiVersion unchanged), `scripts/tg apply`, `git push`, wait for rollout, verify `kubectl get externalsecrets -A` all `SecretSynced=True`.
+   - **Interleave k8s as required:** before/at 0.16 the cluster should be on **k8s 1.32** (0.16 band). Advance k8s via the normal version-check chain to 1.32 around this point.
+   - Watch the **0.14.0** notes (generators) — N/A for us, but eyeball the plan diff anyway.
+7. **Land on 0.16.2 and STOP.** Verify both APIs are served: `kubectl get externalsecrets.v1.external-secrets.io -A` and `kubectl get externalsecrets.v1beta1.external-secrets.io -A` both work.
+
+### Phase 2 — Rewrite all 104 CRs + 2 stores to `v1` (while on 0.16.2)
+This is the MUST-DO-FIRST API migration, done in the safe window where both versions are served.
+8. **Mechanical rewrite** across `stacks/`: replace the apiVersion string `external-secrets.io/v1beta1` → `external-secrets.io/v1` in every ExternalSecret and ClusterSecretStore `kubernetes_manifest` (104 + 2 = 106 occurrences across 73 files, **including the 6 nested-module files** in §2.2). **No other field changes** (schema identical). Do this in a worktree, committed file-by-file.
+   - Leave `secretStoreRef.kind = "ClusterSecretStore"` (that's a kind reference, not an apiVersion — unaffected).
+9. **Two-phase apply because `kubernetes_manifest` replace + plan-time `data "kubernetes_secret"`:**
+   a. **Stores first:** `scripts/tg apply -target='kubernetes_manifest.css_vault_kv' -target='kubernetes_manifest.css_vault_db'` in `stacks/external-secrets` (they get replaced to v1; ESO still serves v1beta1 too, so in-flight ExternalSecrets keep syncing). `git push`.
+   b. **ExternalSecrets, per stack:** for each of the 73 stacks, `scripts/tg apply -target=kubernetes_manifest.<external_secret_name>` FIRST (materializes the replaced v1 CR + its Secret), THEN a full `scripts/tg apply` for that stack (lets the 15 plan-time `data "kubernetes_secret"` reads resolve against the now-existing Secret). The **15 plan-time-coupled stacks** (§2.6) absolutely need the `-target` first; the rest are lower-risk but follow the same pattern for safety. `git push` per stack (Tier-1 stacks use PG state; ESO stack is Tier-0).
+   - Because the spec is identical, the *replace* re-creates an identical CR; ESO reconciles and re-asserts the same target Secret (no value change) → apps keep their Secret throughout.
+10. **Verify the rewrite fully landed:** `grep -rc 'external-secrets.io/v1beta1' stacks/` returns **0**; `kubectl get externalsecrets -A -o jsonpath used to confirm all served as v1`; all `SecretSynced=True`; spot-check a rotated DB cred (e.g. `nextcloud-db-creds`) still valid.
+
+### Phase 3 — Cross the 0.17 cutoff, then climb to 2.6.0
+Only after Phase 2 is 100% applied (zero v1beta1 in repo AND in etcd):
+11. Bump chart `0.16.2 → 0.17.x`. `scripts/tg plan` (expect chart/CRD upgrade; **no manifest replacements** — already v1), apply, push, rollout, verify all synced. **k8s should be 1.33** (0.17 band) around here.
+12. Continue one minor at a time: `0.18.x → 0.19.x → 0.20.x → 1.0.0 → 1.x (latest) → 2.0.0 → … → 2.6.0`. At each: bump `version`, plan, apply, push, rollout, verify synced. **k8s reaches 1.34 then 1.35** across the 2.x steps.
+    - **At 2.0.0:** confirm the plan shows nothing odd from the Alibaba/Device42 provider removal (we use only Vault — should be a no-op).
+13. **Land on 2.6.0.** Verify: controller image `v2.6.0`, all 104 ExternalSecrets `SecretSynced=True`, both ClusterSecretStores `Valid=True`.
+
+### Phase 4 — Close the gate + docs
+14. Advance k8s to **1.35** via the version-check chain if not already; confirm the **compat-gate now lists ESO as compatible** and 1.35 is unblocked.
+15. Update `.claude/CLAUDE.md` Secrets Management section: correct counts (**104 ExternalSecrets / 73 files / 26 db-refs**), apiVersion now `v1`. Update `docs/architecture/secrets.md`. Commit as part of the work (audit trail).
+
+---
+
+## 6. Risks & mitigations
+
+| Risk | Likelihood | Mitigation |
+|---|---|---|
+| **Secret-sync outage → app DB/API auth failures** during controller restarts or the replace window | Med | Spec is identical so re-sync re-asserts the same value; keep each controller restart short; do Phase-2 replaces **per stack** (small blast radius); the 15 plan-time stacks use `-target` first so the Secret exists before dependents plan. Pre-step Secret snapshot (Phase 0.3) for instant restore. |
+| **Crossing 0.17 with any CR still v1beta1** → ESO stops reconciling those, secrets go stale | High if rushed | Phase 2 gate: `grep -rc v1beta1 stacks/` **must be 0** AND `kubectl get …v1beta1…` returns nothing live before Phase 3. Do not skip 0.16. |
+| **CRD removal/replace by helm dropping data** | Low | Chart manages CRDs in-place via `installCRDs=true` (upgrade, not delete-recreate); CRs are the data and they're untouched by a CRD *upgrade*. Snapshot anyway. Never `helm uninstall` (that can GC CRDs). |
+| **No conversion webhook safety net** (must-rewrite-first) | Certain (by design) | Whole strategy is built on rewriting at 0.16. The deprecated `unsafeServeV1Beta1` is already past its 2026-05-01 removal — do NOT rely on it. |
+| **`kubernetes_manifest` forces replace on apiVersion bump** → transient gap + plan-time read failures | High | Two-phase `-target` apply fleet-wide (Phase 2.9); identical spec ⇒ replacement CR is equivalent. |
+| **Vault 7-day DB rotation lands mid-migration** → a Secret stale across rotation if controller down | Med | Keep controller downtime < rotation gap; re-sync immediately after each hop; Reloader annotations already re-roll pods on Secret change; if a rotation is imminent, sequence the affected db stacks last and verify those creds explicitly. |
+| **git-crypt tls-secret-sync landmine** | Low (not in ESO stack) | ESO stack has no `file()` git-crypt reads; run from an **unlocked** checkout; do **not** piggyback kyverno applies during this work. |
+| **helm/k8s provider in lockfile too old for 2.x chart** | Low | Phase 0.4 verify; bump constraint as a separate committed change if needed (chart needs only Helm SDK v3, already satisfied). |
+| **k8s/ESO band mismatch** (e.g. ESO 0.12 on k8s 1.33) | Med | Interleave the climbs per §4.3; don't jump k8s far ahead of ESO or vice-versa. |
+| **Many small applies = long, error-prone session** | Med | Script the per-stack `-target`-then-full loop; checkpoint with `kubectl get externalsecrets -A` after each; the rewrite itself is a single `sed`-class change so low semantic risk. |
+
+---
+
+## 7. Rollback plan (per hop)
+
+- **During Phase 1 (chart climb, still v1beta1):** revert `version` to the previous minor in `stacks/external-secrets/main.tf`, `scripts/tg apply`, `git push`. Helm rolls the controller back; CRs unchanged. Clean.
+- **During Phase 2 (v1 rewrite, on 0.16.2):** 0.16.2 serves both APIs, so you can `git revert` the apiVersion-bump commits and re-apply — the CRs flip back to v1beta1 cleanly (both served). Secrets unaffected (identical spec). This is the **last point of easy rollback**.
+- **After Phase 3 (≥0.17, v1beta1 no longer served):** **rollback is HARD** — once etcd stores v1-only and the controller is ≥0.17, downgrading cannot re-serve v1beta1 and v1 objects can't be auto-converted back ([general guidance + maintainer position](https://github.com/external-secrets/external-secrets/issues/5478)). Treat **crossing 0.17 as the point of no return.** If you must recover: re-install 0.16.2 (serves both), restore CRs from the Phase-0 manifest snapshot, and restore Secrets from the Secret snapshot. This is a disaster-recovery path, not a routine rollback — hence the Phase-2 gate must be airtight.
+- **Always available:** the Phase-0.3 Secret backups let you `kubectl apply` the last-good Secret to keep an app authenticating while you fix ESO.
+
+---
+
+## 8. Verification
+
+**Per hop:**
+- `kubectl -n external-secrets get deploy,po` healthy; controller image tag == target.
+- `kubectl get externalsecrets -A` → all 104 `STATUS=SecretSynced` / `READY=True`.
+- `kubectl get clustersecretstores` → `vault-kv` + `vault-database` `Valid=True`.
+
+**After Phase 2 (v1 rewrite):**
+- `grep -rc 'external-secrets.io/v1beta1' stacks/` → **0**.
+- `kubectl get externalsecrets.v1beta1.external-secrets.io -A` → still served on 0.16 (sanity), but `kubectl get externalsecrets.v1.external-secrets.io -A` is the real check.
+- Spot-check a rotated DB cred end-to-end: e.g. `nextcloud-db-creds` value matches `vault read database/static-creds/mysql-nextcloud` and the app authenticates.
+
+**Final (2.6.0):**
+- Controller image `v2.6.0`; all ExternalSecrets synced; both stores valid.
+- Diff a sample of the 104 target Secrets against the Phase-0 backups → values unchanged (continuity proof).
+- App health: spot-check 3–4 high-value consumers (nextcloud, immich, grafana, a `vault-database` consumer) — pods running, no auth errors in logs.
+- **Compat-gate:** run the upgrade-state / k8s-version-check audit — ESO no longer flagged as a 1.35 blocker; k8s 1.35 upgrade proceeds.
+
+---
+
+## 9. Open questions
+
+1. **k8s/ESO interleave ownership.** §4.3 shows narrow per-version k8s bands (0.16→1.32, 0.17→1.33, 2.x→1.34-1.35). The cluster is currently ≤1.31. **Who drives the interleave** — does this migration also advance k8s step-by-step, or does the autonomous version-check chain advance k8s and we time ESO hops to it? Need the exact current k8s version and the chain's behavior when ESO is the only gate. (Decisive for sequencing Phases 1/3.)
+2. **2.6.0 ↔ k8s 1.35 explicit support.** The support matrix table currently ends at **2.5** (k8s 1.34-1.35). 2.6.0 exists on GitHub but the matrix row isn't appended yet; the whole 2.x band is consistently 1.34-1.35, so 2.6 on 1.35 is a *strong inference* not a quoted row. Confirm via `Chart.yaml` `kubeVersion` of 2.6.0 or a 2.6 release note before relying on it. ([matrix](https://external-secrets.io/latest/introduction/stability-support/))
+3. **Resolved helm provider version.** The stack only pins `vault ~> 4.0`; helm/k8s are unpinned (lockfile-resolved). Confirm the lockfile version and whether to pin it explicitly as part of this work. (Chart needs only Helm SDK v3 — likely a no-op, but verify.)
+4. **Intermediate-minor patch selection.** Use latest patch of each minor (0.13.x, 0.14.x, 0.15.x). Confirm 0.16.**2** specifically (the note says 0.16.2 already supports v1) vs a later 0.16 patch.
+5. **Per-stack apply automation.** 73 stacks × (target + full) apply is large. Acceptable to script a loop, or prefer manual per-stack with checkpoints? Some stacks have other in-flight drift that a full apply would also push — needs a clean-plan check per stack first.
+6. **Stateful generators / advanced features.** Confirmed we use none (0 SecretStore/PushSecret/ClusterExternalSecret/generators), so the v0.14 generator and v2.0 provider-removal breaking changes are N/A — but re-confirm no generator usage crept in before Phase 3.
+
+---
+
+## 10. Sources (decisive facts)
+
+- Skip-version policy + k8s support matrix: <https://external-secrets.io/latest/introduction/stability-support/>
+- `v1` promoted to storage version (0.16.0): <https://github.com/external-secrets/external-secrets/releases/tag/v0.16.0>
+- `v1beta1` removed / "rewrite manifests to v1 first" (0.17.0): <https://github.com/external-secrets/external-secrets/releases/tag/v0.17.0>
+- No conversion webhook / "not a conversion issue" (#5478): <https://github.com/external-secrets/external-secrets/issues/5478>
+- v1beta1↔v1 schema identical / "nothing fancy" (#4785): <https://github.com/external-secrets/external-secrets/issues/4785>
+- App v1.0.0 ≠ API v1: <https://github.com/external-secrets/external-secrets/releases/tag/v1.0.0>
+- v2.0.0 only removes Alibaba/Device42: <https://github.com/external-secrets/external-secrets/releases/tag/v2.0.0>
+- Chart 2.6.0 on ArtifactHub: <https://artifacthub.io/packages/helm/external-secrets-operator/external-secrets>
--- a/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md
+++ b/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md
@ -0,0 +1,131 @@
+# 2026-06-22 — devvm memory/IO overload: per-user containment + OOM backstop
+
+## Impact
+
+- devvm (VM 102, the shared multi-user Claude Code workstation) became
+  unresponsive under combined memory + IO pressure and had to be **hard-killed +
+  rebooted** by the admin on 2026-06-22 (morning). All ssh/tmux + t3 sessions for
+  wizard/emo/anca lost, in-flight agents killed.
+- Signature on the last htop before the kill: **load avg ~60** on 32 vCPU, **RAM
+  22.5/23.5G**, **swap 13.9/14.0G (full)**, a wall of **D-state** (uninterruptible
+  IO-wait) processes, and a single `ugrep` in emo's tmux holding **~10G RES /
+  64% CPU**. Many `claude --effort max/xhigh` sessions + playwright-chrome MCP
+  instances across three users on top.
+
+## This is the "crawl" class, not the QEMU-stall class
+
+The 2026-06-11 post-mortem (`2026-06-11-devvm-qemu-io-stall.md`) fixed a
+*different* failure mode — a QEMU-userspace block-path wedge on the legacy LSI
+controller. That fix shipped (verified 2026-06-22: the guest now boots on
+`virtio_scsi`, `scsihw: virtio-scsi-single + iothread`). Its post-mortem
+explicitly deferred **this** class:
+
+> The recurring *crawl* class (agent storms → swap-thrash; journald
+> watchdog-killed 3× on 2026-06-10) is a separate failure mode — ssh/tmux
+> sessions remain memory-uncontained by **explicit decision (swap-only,
+> 2026-06-10)**.
+
+That explicit decision is the root cause closed here.
+
+## Root cause
+
+Work on the devvm lives in **two independent cgroup-v2 trees per user**, and only
+one was capped:
+
+| Tree | cgroup | Cap before today |
+|---|---|---|
+| t3 web sessions | `system.slice/system-t3\x2dserve.slice/t3-serve@<user>` | `MemoryHigh=12G MemoryMax=16G MemorySwapMax=0 OOMPolicy=continue` ✓ |
+| **ssh/tmux sessions** | `user.slice/user-<uid>.slice` | **`MemoryMax=infinity`, swap unlimited** ✗ |
+
+The uncapped `user-<uid>.slice` was the hole. A runaway there (the 10G `ugrep`;
+stacked max-effort agents) grew unbounded, spilled into the **14G disk swap**, and
+swap-thrashed the **host-mbps-throttled (60/60 MB/s) virtual disk**. That is the
+overload chain:
+
+```
+uncapped tmux growth → disk-swap thrash on a throttled spindle
+   → IO storm (D-state pileup) → load ~60 → box unresponsive → hard kill
+```
+
+i.e. **memory pressure becomes the IO storm**. There was also **no global OOM
+backstop** (no systemd-oomd / earlyoom) to shed the worst offender before the
+kernel OOM or the thrash-wedge. And even the existing t3 caps don't sum safely
+(3 users × 16G = 48G > 32G RAM) — nothing reasoned about the *whole box*.
+
+## Fix (`setup-devvm.sh` §10, applied live 2026-06-22)
+
+Design decisions (interviewed with the admin via `/grill-me`): **soft-generous
+per-user caps + a hard ceiling + a kill-the-worst backstop**, maximising
+single-user utilisation while making a box-wide wedge impossible. (The backstop
+was first built on systemd-oomd, then switched to earlyoom mid-rollout when oomd
+proved inert with `swap=0` — see Verification + Lessons.)
+
+| Layer | What |
+|---|---|
+| **Per-user caps, BOTH trees** | `user-.slice.d` drop-in gives every `user-<uid>.slice` the same `MemoryHigh=12G / MemoryMax=16G / MemorySwapMax=0` the t3 tree already had. A user is now bounded in whichever surface they work in. |
+| **No disk swap for work** | `MemorySwapMax=0` on every work cgroup → a spike OOMs **locally** at the ceiling instead of thrashing the throttled disk. Kills the IO-storm-via-swap mechanism at the source. The 14G swapfile stays for system cold pages only. |
+| **earlyoom backstop (free-RAM threshold)** | New package — used **instead of systemd-oomd** (which is inert with `swap=0`; see Lessons). Watches `MemAvailable%` and SIGTERMs the biggest task at **5%**, SIGKILL at **3%**, swap ignored (`-s 100`). `--avoid` keeps sshd/systemd/dockerd/containerd/t3-dispatch/tmux off the victim list (**the admin's way in always survives**); `--prefer` targets the agent/browser hogs (python3/node/chrome/…). Swap-independent and reliable, where oomd's pressure-kill was not. |
+| **Fair-share CPU/IO** | `CPUWeight`/`IOWeight` per slice (system.slice 200, users + docker 100 each). Work-conserving — a lone user still gets all 32 cores + the full IO budget when others idle; weights only bite under contention. No hard CPU/IO caps. |
+| **Docker containment** | Containers previously landed in `system.slice` — uncapped. Now `cgroup-parent: docker.slice` in `daemon.json` routes every container into a capped (`MemoryMax=8G`, swap 0) slice, so a runaway container is cgroup-OOM'd locally instead of escaping into the uncapped `system.slice`. |
+
+Durable in `setup-devvm.sh` (survives a VM rebuild); `earlyoom` added to
+`packages.txt`. The numbers are tunable — `MemoryHigh=12G` will throttle a *lone*
+heavy user between 12–16G even with RAM free; bump to 16/20 if that bites.
+
+## Verification (live, 2026-06-22)
+
+- **Caps live on running cgroups**: all three `user-<uid>.slice` report
+  `memory.high=12G memory.max=16G memory.swap.max=0`; `docker.slice` `memory.max=8G`;
+  daemon.json kept buildkit/nvidia/insecure-registries; paperless-mcp recovered
+  under `docker.slice`.
+- **Stress test A (hard cap)** — the PRIMARY guard: a 2G-capped, swap=0 balloon was
+  killed at exactly 2G by the cgroup-local OOM (`constraint=CONSTRAINT_MEMCG`) with
+  **swap flat at 0MB throughout** — no thrash. Same mechanism protects every user
+  slice (16G) and `docker.slice` (8G).
+- **Soft cap observed**: a balloon pushed past `MemoryHigh` sat at ~220M / 99%
+  memory.pressure, throttled to a crawl, making no progress and harming nothing —
+  a runaway is throttled, not just killed.
+- **systemd-oomd disproven, then dropped**: a self-policed balloon held
+  `memory.pressure full avg10 = 96–99%` (≫ its 20% limit) for >70s but oomd never
+  killed it — `Pgscan: 0`. oomd's pressure-kill only acts on cgroups doing active
+  reclaim, which a `swap=0` anon workload never does. oomd was purged.
+- **earlyoom backstop** — verified via `--dryrun`: at the threshold it logs
+  `low memory! … mem 90% swap 100%` (fires on RAM alone, swap ignored) and selects
+  `SIGTERM … "chrome"` (a `--prefer` hog), never an `--avoid`'d daemon. Live
+  earlyoom v1.7 confirms `SIGTERM mem<=5% / SIGKILL mem<=3%, swap<=100%`.
+
+## Out of scope / follow-ups
+
+- **Alerting** (tracked, fast-follow bead): `DevvmDown` (closes the 90-min
+  detection gap the 2026-06-11 PM flagged), sustained-memory-PSI/swap pressure
+  early-warning, and an "earlyoom-killed-something" alert (earlyoom logs each kill;
+  `-N /script` can push a metric). devvm node-exporter is already scraped
+  (`job=devvm`, `10.0.10.10:9100`), so only alert *rules* are new (a
+  monitoring-stack Terraform change).
+- **zram cushion**: considered, deferred. Could let work cgroups absorb spikes in
+  compressed RAM instead of OOMing at the ceiling; not needed for the wedge fix.
+- **Per-user docker isolation**: containers share one `docker.slice` budget, not
+  per-user. Fine for current usage (krr + short-lived tools).
+- **Host-side IO**: the 60/60 mbps cap + the shared `sdc` HDD IO domain are
+  host-level (bead `code-oflt`); unchanged here.
+
+## Lessons
+
+- **"Swap as the safety valve" is an IO-storm amplifier on a throttled disk.**
+  Leaving ssh/tmux memory-uncontained (the 2026-06-10 decision) traded a clean
+  local OOM for a box-wide swap-thrash wedge. `MemorySwapMax=0` + a hard cap turns
+  the failure back into a contained, local kill.
+- **Cap the box, not one surface.** t3 sessions were capped for months while the
+  same user's tmux was unbounded — and the caps that existed didn't sum to < RAM.
+  Containment has to reason about every tree and the aggregate.
+- **A backstop must protect the operator's way in.** earlyoom `--avoid`s
+  sshd/systemd/dockerd/containerd/t3-dispatch/tmux, so the box always stays
+  reachable to recover; only the agent/browser hogs are eligible victims.
+- **systemd-oomd is the wrong backstop for a no-swap box — verify, don't assume.**
+  oomd's memory-pressure killer only fires on cgroups doing active reclaim
+  (`pgscan` rising). With `MemorySwapMax=0` + anonymous memory there is nothing to
+  reclaim, so a cgroup sat at 99% `memory.pressure` indefinitely and oomd never
+  acted (proven with `oomctl` + a balloon). The very `swap=0` that kills the IO
+  storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the
+  correct pairing. A famous tool that "does OOM" still has to be proven to fire
+  under *your* configuration.
--- a/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md
+++ b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md
@ -0,0 +1,97 @@
+# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
+
+> Filename kept for inbound links. The originally-suspected cause (kubeadm-config
+> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
+> drift was a real *separate* latent bug fixed in the same change.
+
+**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
+the master control-plane phase for the first time — preflight passed, etcd
+snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the
+kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute
+static-pod-hash window across all internal retries, then auto-rolled-back to
+v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
+the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**.
+No data loss; no user-facing outage (the master carries control-plane taints, so
+no workloads were displaced).
+
+**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
+first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
+static pods, i.e. the first time the upgrade pushes real write-IO at etcd.
+
+## Root cause — etcd IO starvation on the shared HDD
+
+The new kube-apiserver could not establish/keep a working connection to etcd
+during the upgrade because **etcd was IO-starved**. etcd's surviving container log
+from the crash window (`/var/log/pods/.../etcd/0.log`, 23:04–23:20 UTC) shows:
+
+- **1,180** `apply request took too long` warnings in 16 minutes;
+- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms),
+  clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
+  to bring the new apiserver up.
+
+A reproduced 1.35.6 apiserver with no etcd dies with
+`F instance.go:233 Error creating leases: error creating storage factory: context
+deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
+lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
+shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
+that spindle:
+
+1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
+2. kubeadm dumping a full **~400MB etcd DB backup** to
+   `/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
+   etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
+   cleans them up), pushing master root fs to **73%**, above the 70% kubelet
+   image-GC threshold, so image GC churned during the drain too;
+3. master-drain pod evictions.
+
+### Correction — it was NOT the OIDC flag swap
+
+`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
+`--authentication-config` (structured multi-issuer OIDC) back to legacy
+single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
+was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
+those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
+(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test
+etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
+the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
+were also ruled out.
+
+## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
+
+apiserver auth is configured in three places that must agree:
+(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
+ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest
+(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM —
+which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
+the manifest from (3), so it would have reverted structured auth → **dashboard +
+kubectl SSO break after a successful upgrade** (recoverable: the chain's
+post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.
+
+## Resolution
+
+1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%.
+2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps.
+3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).
+
+## Prevention (landed in this change)
+
+| Gap | Fix |
+|-----|-----|
+| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
+| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. |
+| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |
+
+## Lessons
+
+- **Capture the failing component's own logs before concluding.** The `kubeadm
+  upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
+  applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
+  "what config changes," not "why it crashed."
+- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
+  2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
+  backup copy + drain) onto that spindle. code-oflt is the real fix.
+- **Tools that leave per-operation scratch must be reaped.** kubeadm's
+  `/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
+  GC'd; 28GB had silently accumulated.
+- **Out-of-band control-plane edits must be written back to kubeadm-config** — else
+  `kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).
--- a/docs/runbooks/goldmane-flow-trail.md
+++ b/docs/runbooks/goldmane-flow-trail.md
@ -0,0 +1,301 @@
+# Goldmane Flow Trail — east-west "who-talks-to-whom" observability
+
+> As-built runbook for the Calico Goldmane + Whisker flow plane and the
+> `goldmane-edge-aggregator` durable audit trail. Design + rationale:
+> [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
+> Glossary: `CONTEXT.md` → **Service identity**, **Goldmane / Whisker**.
+> Implements infra issues #57 (Whisker ingress), #58 (aggregator), #61
+> (monitoring), #62 (egress allowlist queries), #63 (these docs).
+
+## What the trail is
+
+Three layers turn raw east-west traffic into a queryable, durable record of
+which Service talks to which. **Service identity = the workload's namespace**
+(primary), refined by a `service-identity` label in the few multi-Service
+namespaces (`monitoring`, `kube-system`, `dbaas`) — see ADR-0014.
+
+| Layer | Component | Lifetime | Where it lives |
+|---|---|---|---|
+| **Live map** | Calico **Goldmane** + **Whisker** | ~60-min in-memory ring buffer (lost on Goldmane restart) | `calico-system`; Whisker UI at `whisker.viktorbarzin.me` |
+| **Durable trail** | `goldmane-edge-aggregator` (`aggregate` mode) | persistent | CNPG Postgres DB `goldmane_edges`, table `edge` |
+| **Notification** | `goldmane-edges-digest` CronJob (`digest` mode) | daily | Slack `#alerts` |
+
+**Goldmane** aggregates identity-stamped flows (namespace / pod / workload /
+labels + allow-deny + policy-trace) streamed from Felix (the existing
+`calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer —
+**nothing is written to etcd or the K8s API** (the etcd-cost constraint that
+drove the whole design). **Whisker** is its live web UI. Because the ring
+buffer is *not* a trail (a Goldmane restart loses the window), the
+`goldmane-edge-aggregator` consumes Goldmane's gRPC `Flows.Stream` API over
+mTLS and upserts the unique **namespace-pair edge set** into Postgres; a daily
+CronJob posts first-seen edges to Slack.
+
+The edge set is deliberately **low-cardinality** — one row per
+`(src_ns, dst_ns, action)`, *not* per-pod or per-port — so the table stays
+small no matter how much traffic flows.
+
+## Where the data lives
+
+### Whisker UI — live, ~60 min
+- `https://whisker.viktorbarzin.me` (Authentik-gated — Whisker ships no own
+  login; `auth = "required"`). Shows the live flow stream + a service graph for
+  roughly the last hour. Use it for "what is talking right now"; it is **not**
+  history.
+- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
+  (HTTP), both in `calico-system`.
+
+### CNPG `goldmane_edges` — durable
+- Postgres DB `goldmane_edges` on the CNPG cluster
+  (`pg-cluster-rw.dbaas.svc.cluster.local:5432`). One table:
+
+  ```
+  edge(src_ns text, dst_ns text, action text,
+       first_seen timestamptz, last_seen timestamptz, flow_count bigint,
+       PRIMARY KEY (src_ns, dst_ns, action))
+  ```
+
+  - `action` ∈ `allow` / `deny` / `pass` / `unspecified` (normalised Goldmane
+    action).
+  - **Self-edges (`src_ns == dst_ns`) and empty-namespace flows** (host-endpoint
+    / public-internet) are **dropped** — the trail is about in-cluster service
+    relationships only. (Egress to the public internet is therefore NOT in this
+    table; it lives in the Wave-1 Calico flow-log path — see security.md.)
+  - A **"new edge"** = a row whose `first_seen` falls inside the digest window.
+  - Role `goldmane_edges` (Vault-rotated, 7-day) owns the DB. The `edge` table
+    is created idempotently by the aggregator at startup (canonical DDL also in
+    the repo at `migrations/0001_edge.sql`).
+
+### Slack `#alerts` — daily digest
+
+> **Channel note (2026-06-25):** posts to **`#alerts`**, not `#security`. The shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of `#security`, so a channel override there returns HTTP `404 channel_not_found` (this almost certainly also breaks alertmanager's `slack-security` receiver — verify separately). To route the digest (and security alerts) to `#security`: invite that webhook's Slack app to `#security`, then set `SLACK_CHANNEL=#security` in `stacks/goldmane-edge-aggregator` and re-apply.
+
+- CronJob `goldmane-edges-digest` (08:00 Europe/London) posts edges first seen
+  in the last 24h. Quiet when there are none. Reuses the existing alert-digest
+  Slack incoming webhook (Vault `secret/viktor` → `alertmanager_slack_api_url`)
+  — no new webhook was created.
+
+## How to enable / disable
+
+### Goldmane + Whisker (the flow plane)
+Operator CRs in **`stacks/calico/main.tf`** — NOT the Helm `goldmane`/`whisker`
+flags (those stay `false`; the operator's own `installation`/`apiServer` are
+operator-managed via the `goldmanes`/`whiskers.operator.tigera.io` CRDs):
+
+- `kubectl_manifest.goldmane` (kind `Goldmane`) — creating it makes the operator
+  re-render `calico-node` with the `FELIX_FLOWLOGSGOLDMANESERVER` env (the
+  operator auto-wires Felix — **do NOT patch FelixConfiguration**), triggering a
+  supervised `calico-node` DaemonSet roll. Yields `Deployment` + `Service
+  goldmane:7443`.
+- `kubectl_manifest.whisker` (kind `Whisker`, `depends_on` goldmane;
+  `notifications = Disabled`). Yields `Deployment` + `Service whisker:8081`.
+
+**To disable:** delete those two CRs and re-apply `stacks/calico`. Reversible
+toggle (Goldmane is tech-preview in OSS Calico 3.30 — the main standing risk per
+ADR-0014).
+
+### Whisker public ingress (infra #57)
+Also in `stacks/calico/main.tf`:
+- `module "ingress_whisker"` (`ingress_factory`, `auth = "required"`,
+  `dns_type = "proxied"`) → `whisker.viktorbarzin.me`.
+- `kubernetes_network_policy_v1.whisker_allow_traefik` — **required alongside the
+  ingress**: the operator's own `whisker` NetworkPolicy (owned by the Whisker CR)
+  is `policyTypes: [Ingress]` with no rules = default-deny ingress to the pod.
+  This additive NP ORs in an allow for `namespaceSelector
+  kubernetes.io/metadata.name=traefik` on TCP 8081. Without it Traefik 502s.
+
+### The aggregator + digest (the durable trail) — `stacks/goldmane-edge-aggregator`
+A Tier-1 stack (PG state) mirroring the claude-memory pattern. `scripts/tg
+apply` from `stacks/goldmane-edge-aggregator/`. It provisions: the namespace,
+the mTLS client material, the Postgres DB-init Job, the `DATABASE_URL`
+ExternalSecret (Vault static role `pg-goldmane-edges`), the Slack ExternalSecret,
+the `aggregate` Deployment, and the `digest` CronJob. **To disable the trail
+without touching the flow plane:** scale `deployment/goldmane-edge-aggregator` to
+0 (transient) or remove the stack (permanent) — Goldmane/Whisker keep running.
+
+Image: `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (PRIVATE) — the
+`goldmane-edge-aggregator` namespace must be in the `ghcr-credentials` Kyverno
+allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`,
+`local.ghcr_private_namespaces`) or pulls 401. Code repo:
+`~/code/goldmane-edge-aggregator` (see its `README.md` + `DEPLOY.md`).
+
+## mTLS cert — the REUSE decision (cert-reuse gotcha)
+
+The aggregator dials `goldmane:7443` over **mutual TLS**. Goldmane requires the
+client cert to chain to the **Tigera CA**, but it does **NOT authorize by client
+identity** — any Tigera-CA-signed cert is accepted.
+
+Rather than copy the Tigera CA **private key** into Terraform state to mint our
+own cert (a needless CA-key exposure; the `hashicorp/tls` provider also clashes
+with this repo's global generate-providers/lockfile pattern), the stack
+**REUSES the operator-minted, Tigera-CA-signed `whisker-backend-key-pair`
+Secret** (`calico-system`), copying its `tls.crt`/`tls.key` into the
+`goldmane-client-tls` Secret in the aggregator namespace. The CA *bundle* that
+verifies Goldmane's serving cert (`tigera-ca-bundle` ConfigMap, key
+`tigera-ca-bundle.crt`) is likewise copied verbatim (a ConfigMap can't be
+cross-namespace-mounted).
+
+> **GOTCHA — if the operator rotates `whisker-backend-key-pair`, re-apply
+> `stacks/goldmane-edge-aggregator`** to re-sync the copied cert. Symptom of a
+> stale copy: the `aggregate` pod logs TLS handshake / `Flows.Stream` failures
+> and no `last_seen` updates land in the `edge` table. Hardening follow-up
+> (noted in the stack): mint an own-identity cert in-namespace if Whisker is ever
+> removed (which would delete the reused source Secret).
+
+The Deployment leaves `GOLDMANE_HOST=goldmane.calico-system.svc.cluster.local:7443`
+and the default cert/CA paths; the default ServerName (host sans port) is a SAN
+on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` /
+`GOLDMANE_TLS_INSECURE` override is needed.
+
+## How to query who-talks-to-whom
+
+`psql` into the DB (creds: Vault static role `static-creds/pg-goldmane-edges`, or
+exec a CNPG pod). All queries are against the single `edge` table.
+
+```sql
+-- Everything talking to a namespace (inbound), most-active first
+SELECT src_ns, action, flow_count, first_seen, last_seen
+FROM edge WHERE dst_ns = '<ns>' ORDER BY flow_count DESC;
+
+-- Everything a namespace talks TO (outbound)
+SELECT dst_ns, action, flow_count, first_seen, last_seen
+FROM edge WHERE src_ns = '<ns>' ORDER BY last_seen DESC;
+
+-- New edges in the last 24h (what the digest reports)
+SELECT src_ns, dst_ns, action, flow_count, first_seen
+FROM edge WHERE first_seen > now() - interval '24 hours'
+ORDER BY first_seen DESC;
+
+-- Any DENIED edges (policy is dropping this pair)
+SELECT src_ns, dst_ns, flow_count, last_seen
+FROM edge WHERE action = 'deny' ORDER BY last_seen DESC;
+
+-- Full edge set as a graph adjacency list
+SELECT src_ns, dst_ns, action, flow_count FROM edge ORDER BY src_ns, dst_ns;
+```
+
+For the **live** (sub-hour) view including pod/port detail, use the Whisker UI —
+the `edge` table intentionally aggregates that away.
+
+## Deriving the Wave-1 egress allowlist from the edge table (infra #62)
+
+The durable edge set is a faster, identity-stamped data source for the existing
+**observe-then-enforce** egress effort (beads `code-8ywc`; snapshot
+`docs/architecture/wave1-egress-observation-2026-05-22.md`) than the original
+iptables-`LOG` → journald → Loki path (ADR-0014 consequence: "Enforcement gains
+a better data source"). It replaces the *internal* (namespace-to-namespace) leg
+of the allowlist; **external/public-internet egress is NOT in this table** (empty
+dst namespace, dropped) — for those destinations keep using the Calico flow-log
+path described in security.md.
+
+**Per-namespace internal egress allowlist** — the set of in-cluster namespaces a
+given source is *observed* talking to with `action='allow'`:
+
+```sql
+-- Internal egress allowlist for one namespace (feeds its NetworkPolicy)
+SELECT DISTINCT dst_ns
+FROM edge
+WHERE src_ns = '<ns>' AND action = 'allow'
+ORDER BY dst_ns;
+```
+
+```sql
+-- Full internal egress matrix for all namespaces at once
+SELECT src_ns, array_agg(DISTINCT dst_ns ORDER BY dst_ns) AS allowed_dst_ns
+FROM edge
+WHERE action = 'allow'
+GROUP BY src_ns
+ORDER BY src_ns;
+```
+
+```sql
+-- Sanity: namespaces with a DENY edge already (policy is biting; investigate
+-- before tightening further)
+SELECT DISTINCT src_ns, dst_ns FROM edge WHERE action = 'deny';
+```
+
+**How this feeds enforcement (scope):** the derived `dst_ns` set is the
+*internal* half of a namespace's egress allowlist — it tells you which
+in-cluster namespaces to permit before flipping that namespace to default-deny.
+The universal baseline (kube-dns :53, often dbaas :3306/:5432, redis :6379) and
+the external destinations still come from the Wave-1 observation snapshot.
+**Enforce-flips remain OUT OF SCOPE** here — this is observe-and-derive only;
+the phased per-namespace default-deny rollout (starting `recruiter-responder`)
+is tracked under `code-8ywc`. Cross-links:
+[security.md → NetworkPolicy Default-Deny Egress](../architecture/security.md#networkpolicy-default-deny-egress-wave-1--observe-then-enforce-tier-34),
+[wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md),
+[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
+
+> **Caveat (same as the Wave-1 snapshot):** an edge only exists if it was
+> *observed*. A weekly CronJob or a 7-day Vault rotation may not have fired yet —
+> collect ≥7 days of edges before treating a namespace's `allow` set as
+> complete. The `first_seen` column tells you how long an edge has been known;
+> the digest surfaces brand-new ones daily.
+
+## Monitoring & health (infra #61)
+
+The aggregator pod has **no `/metrics` endpoint** — health is inferred from
+kube-state-metrics. Three complementary signals (memory ids 6598, 6599;
+see also [monitoring.md → Security Alerts](../architecture/monitoring.md#security-alerts-wave-1--planned-beads-code-8ywc)):
+
+| Signal | What | Where |
+|---|---|---|
+| **`AggregatorDown`** | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` for 15m → warning | Prometheus alert group `Network Observability (Goldmane)` in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`; routes `slack-warning` → `#alerts` |
+| **`DigestFailing`** | `kube_job_status_failed{...job_name=~"goldmane-edges-digest.*"} > 0` within 24h, for 30m → warning | same alert group → `#alerts` |
+| **cluster-health #48** | `check_goldmane_aggregator` reads the Deployment's `Available` condition (missing or not-Available → FAIL) | `scripts/cluster_healthcheck.sh` (human / `--quiet` / `--json` modes; emits `goldmane_aggregator`) |
+
+The two alert layers are deliberately complementary: `AggregatorDown` →
+**no new edges land** in the DB; `DigestFailing` → **edges still land but nobody
+is told**. A freshness probe (#61b) was intentionally skipped — `AggregatorDown`
+is the agreed floor.
+
+## Troubleshooting
+
+**Whisker UI 502 / unreachable.** The additive
+`kubernetes_network_policy_v1.whisker_allow_traefik` is missing or the
+operator's default-deny `whisker` NP regenerated — re-apply `stacks/calico`. A
+brand-new ingress host is also invisible to LAN split-horizon until the hourly
+`technitium-ingress-dns-sync` runs (memory #5349); test meanwhile with
+`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
+(expect a 302 to Authentik — the gate working).
+
+**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
+pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).
+Common causes, in order:
+1. **Stale mTLS cert** — the operator rotated `whisker-backend-key-pair`; re-apply
+   `stacks/goldmane-edge-aggregator` (see cert-reuse gotcha above). Symptom: TLS
+   handshake / `Flows.Stream` errors.
+2. **Stale DB password** — the 7-day Vault rotation bounced the credential but
+   the pod kept the old one. The Deployment carries
+   `secret.reloader.stakater.com/reload: goldmane-edges-db-creds`; if it's not
+   restarting on rotation, verify the Reloader annotation and the ExternalSecret.
+3. **Goldmane restarted** — the in-memory window was lost (expected); the stream
+   reconnects automatically and resumes upserting. No data loss in the DB
+   (only the sub-hour live window in Whisker is gone).
+
+**Digest never posts / `DigestFailing` firing.** Inspect the most recent
+`goldmane-edges-digest-*` Job (`kubectl get jobs -n goldmane-edge-aggregator`;
+`kubectl logs job/<name>`). The CronJob's `ttl_seconds_after_finished=86400` GCs
+pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL`
+empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack`
+ExternalSecret resolved. A dry run / smoke test: run the image with `args:
+["digest"]` + `DRY_RUN=1` to print the message instead of POSTing.
+> Known state (2026-06-25): the digest CronJob's first Job **failed** and it has
+> never successfully posted (`lastSuccessfulTime` empty) — the digest leg is the
+> live gap; `DigestFailing` is catching it. Edges still land in the DB via the
+> `aggregate` Deployment; only the `#security` notification is affected.
+> Investigation/fix belongs to the aggregator slice (#58/#60), not monitoring.
+
+**No edges at all in the table.** Confirm Goldmane is enabled
+(`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the
+`FELIX_FLOWLOGSGOLDMANESERVER` env; confirm the `goldmane-edges-db-init` Job
+completed; confirm the aggregator pod is `Running` and not `ImagePullBackOff`
+(ghcr allowlist).
+
+## Related
+- [ADR-0014 — Service identity & east-west observability](../adr/0014-service-identity-and-east-west-observability.md)
+- [security.md — NetworkPolicy Default-Deny Egress + east-west flow observability](../architecture/security.md)
+- [monitoring.md — east-west flow observability + alerts](../architecture/monitoring.md)
+- [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md)
+- `CONTEXT.md` glossary — **Service identity**, **Goldmane / Whisker**
+- Code: `~/code/goldmane-edge-aggregator` (`README.md`, `DEPLOY.md`); stacks
+  `stacks/goldmane-edge-aggregator`, `stacks/calico`
--- a/docs/runbooks/k8s-version-upgrade.md
+++ b/docs/runbooks/k8s-version-upgrade.md
@ -41,6 +41,8 @@ Job 0 — preflight       (pinned: k8s-node1)
  ├── halt-on-alert (kured-style ignore-list)
  ├── 24h-quiet baseline (no Ready transitions <24h ago)
  ├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
+  ├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block)
+  ├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups)
  ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
  ├── Trigger backup-etcd Job, wait, verify snapshot byte count
  ├── SSH master: containerd skew fix (if master < workers)
@ -222,22 +224,34 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names

 ## Common Operations

-### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19)
+### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24)

 `kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
-and drops the `--authentication-config` flag**, silently disabling apiserver
-OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get
-401). This used to require a manual re-apply after **every** control-plane bump.
+from kubeadm-config**. apiserver auth uses a structured multi-issuer
+`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to
+still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade
+reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does
+NOT crash on this — verified by isolated repro; it's recoverable via the restore
+script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue —
+etcd IO starvation**, not this drift; post-mortem:
+`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`.

-**Now automated:** the `rbac` stack publishes its OIDC restore script to the
-`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's
-`phase_master` re-runs it on master immediately after `kubeadm upgrade apply`
-(while tigera-operator is still quiesced, so the flag-add apiserver restart can't
-crashloop the operator). It's idempotent, health-gates `/livez` with
-auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac
-apply (the version upgrade itself already succeeded). So a chain-driven
-control-plane bump no longer breaks SSO. The master phase self-skips when master
-is already at target, so this only runs when master was actually upgraded.
+**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now
+**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting
+`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of
+its remote script. So kubeadm regenerates a **correct** manifest and the apiserver
+upgrades with a pure image bump — `kubeadm upgrade diff <target>` shows only the
+image change. Zero live impact (the CM is read only during an upgrade).
+
+**Backstops:**
+- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does
+  NOT block — the drift only breaks SSO, which is recoverable) if
+  `--authentication-config` would still be dropped.
+- The `rbac` stack still publishes its restore script to the
+  `kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on
+  master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with
+  auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also*
+  re-reconciles kubeadm-config. Self-skips when master is already at target.

 **Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the
 chain logged `WARN: --authentication-config absent after re-apply`:
--- a/scripts/cluster_healthcheck.sh
+++ b/scripts/cluster_healthcheck.sh
@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
 [[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
 KUBECTL=""
 JSON_RESULTS=()
-TOTAL_CHECKS=47
+TOTAL_CHECKS=48

 # Parallel execution settings. Each check function is self-contained — it
 # only reads cluster state and mutates the in-memory counters / JSON_RESULTS
@ -3156,6 +3156,44 @@ PYEOF
    esac
 }

+# --- 48. Goldmane edge-aggregator availability ---
+#
+# The goldmane-edge-aggregator Deployment (ADR-0014 / infra #58) streams Calico
+# Goldmane flows into the goldmane_edges CNPG DB — the durable who-talks-to-whom
+# trail. The pod has NO /metrics endpoint, so its liveness can't be scraped;
+# this check reads the Deployment's Available condition directly so the trail
+# silently dying surfaces in the health board (mirrors the AggregatorDown
+# Prometheus alert). Missing Deployment / not-Available -> FAIL.
+check_goldmane_aggregator() {
+    section 48 "Goldmane Edge-Aggregator"
+    local ns="goldmane-edge-aggregator" dep="goldmane-edge-aggregator"
+    local avail desired ready
+
+    # One get; absent Deployment is a hard fail (the trail isn't deployed).
+    if ! $KUBECTL get deploy "$dep" -n "$ns" >/dev/null 2>&1; then
+        [[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
+        fail "Deployment $ns/$dep not found — who-talks-to-whom edge trail is not running"
+        json_add "goldmane_aggregator" "FAIL" "deployment missing"
+        return 0
+    fi
+
+    avail=$($KUBECTL get deploy "$dep" -n "$ns" \
+        -o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null)
+    ready=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.status.readyReplicas}' 2>/dev/null)
+    desired=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.spec.replicas}' 2>/dev/null)
+    ready=${ready:-0}
+    desired=${desired:-0}
+
+    if [[ "$avail" == "True" ]]; then
+        pass "Edge-aggregator Available ($ready/$desired ready)"
+        json_add "goldmane_aggregator" "PASS" "${ready}/${desired} ready"
+    else
+        [[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
+        fail "Edge-aggregator NOT Available ($ready/$desired ready) — edge trail has stopped recording"
+        json_add "goldmane_aggregator" "FAIL" "${ready}/${desired} ready; Available=${avail:-unknown}"
+    fi
+}
+
 # --- Summary ---
 print_summary() {
    if [[ "$JSON" == true ]]; then
@ -3224,7 +3262,7 @@ main() {
        check_monitoring_prom_am check_monitoring_vault check_monitoring_css
        check_external_replicas check_external_divergence check_pve_thermals
        check_pve_load check_external_traefik_5xx check_ha_status_dashboard
-        check_immich_search check_csi_ghost_drift
+        check_immich_search check_csi_ghost_drift check_goldmane_aggregator
    )

    # Auto-fix mutates cluster state inside individual checks — keep that
--- a/scripts/t3-provision-users.sh
+++ b/scripts/t3-provision-users.sh
@ -29,6 +29,9 @@ REPO_REMOTE_BASE="${REPO_REMOTE_BASE:-https://forgejo.viktorbarzin.me/viktor}"
 # Per-user OIDC kubeconfig (kubelogin/PKCE; cluster server+CA copied from the admin kubeconfig).
 OIDC_ISSUER="${OIDC_ISSUER:-https://authentik.viktorbarzin.me/application/o/kubernetes/}"
 ADMIN_KUBECONFIG="${ADMIN_KUBECONFIG:-/home/wizard/.kube/config}"
+# OS users (space-separated) that receive the vendored agent skills (scripts/workstation/claude-skills).
+# Allowlist: install_skills no-ops for anyone not listed. Extend here to roll out to more users.
+SKILL_USERS="${SKILL_USERS:-emo}"

 log() { echo "[t3-provision] $*"; }
 run() { if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] $*"; else "$@"; fi; }
@ -381,9 +384,133 @@ install_playwright() {
  run systemctl enable --now "playwright-snapshot-refresh@$user.timer" >/dev/null 2>&1 || true
 }

+# Per-user homelab-memory setup — migrate off the claude-memory MCP/plugin to the
+# homelab CLI hooks (auto-recall + auto-learn + compaction backup/recovery).
+# Idempotent, if-absent, ADDITIVE: never clobbers `env` (the per-user
+# MEMORY_API_KEY) or other MCP servers; removes ONLY the `claude_memory` MCP.
+# Reuses the user's existing key — does NOT mint one (per-user isolation stays
+# deferred, design 2026-06-08). The homelab CLI (/usr/local/bin/homelab) hits the
+# same remote HTTP API the MCP used. Hook scripts: $WORKSTATION_DIR/claude-hooks.
+install_memory() {
+  local user="$1" home
+  home="$(getent passwd "$user" | cut -d: -f6)"
+  [[ -n "$home" && -d "$home" ]] || return 0
+  local src="$WORKSTATION_DIR/claude-hooks" hooks_dst="$home/.claude/hooks" settings="$home/.claude/settings.json"
+  [[ -d "$src" ]] || { log "WARN: $src missing -> skip memory setup for $user"; return 0; }
+
+  if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] memory: hooks + settings wire + claude_memory MCP removal -> $user"; return 0; fi
+
+  # (1) (re)install the 4 hook scripts, owned by the user (refreshed each reconcile so fixes land)
+  install -d -o "$user" -g "$user" -m 0755 "$hooks_dst"
+  local h
+  for h in homelab-memory-recall.py auto-learn.py pre-compact-backup.sh post-compact-recovery.sh; do
+    install -o "$user" -g "$user" -m 0755 "$src/$h" "$hooks_dst/$h"
+  done
+
+  # (2) wire the hooks in settings.json, if-absent + additive. Run the helper as ROOT:
+  #     it must read $src under the admin's hardened home (mode 700), which a
+  #     runuser-as-$user CANNOT traverse — so chown the result back to the user and
+  #     enforce 0600 (it holds the per-user MEMORY_API_KEY).
+  if python3 "$src/wire-memory-hooks.py" "$home" >/dev/null 2>&1; then
+    [[ -f "$settings" ]] && chown "$user:$user" "$settings" 2>/dev/null || true
+    log "memory hooks wired -> $user"
+  else
+    log "WARN: memory hook wiring failed for $user (retries next reconcile)"
+  fi
+  [[ -f "$settings" ]] && chmod 600 "$settings" || true
+
+  # (2b) reuse the user's existing key; warn (do NOT mint — needs an admin vault write) if absent.
+  if [[ -f "$settings" ]] && ! grep -q 'MEMORY_API_KEY' "$settings"; then
+    log "WARN: $user has no MEMORY_API_KEY in settings.json — homelab memory no-ops until an admin mints one"
+  fi
+
+  # (3) remove the now-superseded claude_memory MCP (AS the user, if-present) + the plugin dir.
+  if runuser -u "$user" -- bash -lc 'command -v claude >/dev/null 2>&1 && claude mcp get claude_memory >/dev/null 2>&1'; then
+    runuser -u "$user" -- bash -lc 'claude mcp remove claude_memory >/dev/null 2>&1' && log "removed claude_memory MCP -> $user" || true
+  fi
+  if [[ -d "$home/.claude/plugins/claude-memory" ]]; then
+    rm -rf "$home/.claude/plugins/claude-memory" && log "removed claude-memory plugin dir -> $user"
+  fi
+  return 0  # best-effort tail must never return non-zero, else set -euo pipefail aborts the whole reconcile
+}
+
+# Per-user agent skills, vendored from the in-repo snapshot ($WORKSTATION_DIR/claude-skills) — the
+# `npx skills` upstream drifted off this exact set, so we reproduce it offline + deterministically.
+# if-absent + ADDITIVE: copies a skill dir into ~/.agents/skills/<name> (owned by the user) and
+# symlinks ~/.claude/skills/<name> -> ../../.agents/skills/<name> (the layout `skills add -g`
+# produces; Claude Code reads ~/.claude/skills/). Scoped to SKILL_USERS. if-absent keys on the
+# user's OWN copy, so it heals a stale/cross-user ~/.claude/skills symlink but never clobbers a real
+# skill dir. Best-effort tail: must return 0 or set -euo pipefail aborts the whole reconcile.
+install_skills() {
+  local user="$1" home
+  home="$(getent passwd "$user" | cut -d: -f6)"
+  [[ -n "$home" && -d "$home" ]] || return 0
+  case " $SKILL_USERS " in *" $user "*) ;; *) return 0 ;; esac
+  local src_root="$WORKSTATION_DIR/claude-skills"
+  [[ -d "$src_root" ]] || { log "WARN: $src_root missing -> skip skills for $user"; return 0; }
+
+  if [[ "$DRY_RUN" == 1 ]]; then
+    local d names=""
+    for d in "$src_root"/*/; do [[ -d "$d" ]] && names+="$(basename "$d") "; done
+    echo "[dry-run] vendor skills if-absent -> $user: ${names}"
+    return 0
+  fi
+
+  local agents_dir="$home/.agents/skills" claude_dir="$home/.claude/skills"
+  # own the parent ~/.agents too (install -d leaves created intermediates root-owned)
+  install -d -o "$user" -g "$user" -m 0755 "$home/.agents" "$agents_dir" "$claude_dir"
+  chown "$user:$user" "$home/.agents" || true
+
+  local skill name dst link n=0
+  for skill in "$src_root"/*/; do
+    [[ -d "$skill" ]] || continue
+    name="$(basename "$skill")"
+    dst="$agents_dir/$name"
+    link="$claude_dir/$name"
+    # if-absent keys on the user's OWN copy (a real dir under ~/.agents/skills), NOT on any
+    # pre-existing ~/.claude/skills entry — so a stale or cross-user symlink gets healed.
+    if [[ ! -d "$dst" ]]; then
+      cp -a "$src_root/$name" "$dst" || { log "WARN: copy skill $name -> $user failed"; continue; }
+      chown -R "$user:$user" "$dst" || true
+      n=$((n+1))
+    fi
+    # point ~/.claude/skills/<name> at the user's own copy (replacing a stale/cross-user symlink);
+    # never clobber a real dir/file squatting that name.
+    if [[ -d "$link" && ! -L "$link" ]]; then
+      log "WARN: $claude_dir/$name is a real dir (left as-is) for $user"
+    elif [[ "$(readlink "$link" 2>/dev/null)" != "../../.agents/skills/$name" ]]; then
+      ln -sfn "../../.agents/skills/$name" "$link" && chown -h "$user:$user" "$link" || log "WARN: link skill $name -> $user failed"
+    fi
+  done
+  if [[ "$n" -gt 0 ]]; then log "vendored/healed $n skill(s) -> $user"; fi
+  return 0  # best-effort tail must never return non-zero, else set -euo pipefail aborts the reconcile
+}
+
 [[ $EUID -eq 0 ]] || { echo "t3-provision-users: must run as root" >&2; exit 1; }
 for bin in python3 jq; do command -v "$bin" >/dev/null || { echo "missing $bin" >&2; exit 1; }; done
 [[ -f "$ROSTER" && -f "$ENGINE" ]] || { echo "roster/engine not under $WORKSTATION_DIR" >&2; exit 1; }
+
+# 0) self-deploy: the repo is the authoring surface (like sync_managed_config /
+#    deploy_user_launcher below). Nothing else redeploys /usr/local/bin (only the
+#    manual setup-devvm.sh did) — so a committed edit silently never reached the
+#    hourly run until now (the homelab-memory rollout sat undeployed for a day).
+#    If the repo copy differs, install it and re-exec the fresh binary. Guarded:
+#    re-exec flag (no loop), bash -n (never deploy a broken script), DRY_RUN (no
+#    mutation), cmp (no churn when unchanged).
+SELF_SRC="$WORKSTATION_DIR/../t3-provision-users.sh"
+SELF_DST=/usr/local/bin/t3-provision-users
+if [[ -z "${T3_PROVISION_SELF_DEPLOYED:-}" && -r "$SELF_SRC" ]] && ! cmp -s "$SELF_SRC" "$SELF_DST"; then
+  if [[ "$DRY_RUN" == 1 ]]; then
+    echo "[dry-run] self-deploy $SELF_DST from repo (changed)"
+  elif bash -n "$SELF_SRC" 2>/dev/null; then
+    install -m 0755 "$SELF_SRC" "$SELF_DST"
+    log "self-deployed $SELF_DST from repo (changed) — re-exec"
+    exec env T3_PROVISION_SELF_DEPLOYED=1 "$SELF_DST" "$@"
+  else
+    log "WARN: repo t3-provision-users.sh fails 'bash -n' — keeping deployed copy"
+  fi
+fi
+
 install -d -m 0755 "$ENVDIR"

 # 1) current sticky ports from existing .env files -> {os_user: port}
@ -494,6 +621,21 @@ while IFS=$'\t' read -r os_user pw_port; do
  install_playwright "$os_user"
 done < <(jq -r '.playwright_ports | to_entries[] | [.key, .value] | @tsv' "$desired_file")

+# 5d) per-user homelab-memory (ALL users): replace the claude-memory MCP/plugin with the
+#     homelab CLI memory hooks. Idempotent + additive + if-absent; never touches the
+#     per-user MEMORY_API_KEY or other MCP servers (removes ONLY claude_memory).
+while IFS=$'\t' read -r os_user; do
+  id "$os_user" >/dev/null 2>&1 || continue
+  install_memory "$os_user"
+done < <(jq -r '.accounts[].os_user' "$desired_file")
+
+# 5e) per-user agent skills (SKILL_USERS allowlist only): vendored snapshot -> ~/.agents/skills
+#     + ~/.claude/skills symlinks. if-absent + additive; best-effort (never aborts the reconcile).
+while IFS=$'\t' read -r os_user; do
+  id "$os_user" >/dev/null 2>&1 || continue
+  install_skills "$os_user"
+done < <(jq -r '.accounts[].os_user' "$desired_file")
+
 # 5b) machine-wide (once, not per-user): keep the t3 gated nightly TRACKER timer enabled (it
 #     follows t3@nightly daily, gated; see t3-autoupdate.sh / docs/runbooks/t3-version-bump.md).
 #     NEVER --now: the tracker installs a NEW build + migrates DBs + restarts serves, so firing
--- a/scripts/workstation/claude-hooks/auto-learn.py
+++ b/scripts/workstation/claude-hooks/auto-learn.py
@ -0,0 +1,184 @@
+#!/usr/bin/env python3
+"""
+Stop hook (async): automatic learning extraction via haiku-as-judge.
+
+After each Claude response, sends the user message + assistant response to
+haiku to detect corrections, preferences, decisions, or facts worth storing.
+If learning events are detected, stores them via the `homelab memory` CLI — the
+only sanctioned memory path on the devvm (no direct HTTP, no local SQLite).
+
+Runs with async: true — does NOT block the user.
+"""
+
+import io
+import json
+import logging
+import os
+import shutil
+import subprocess
+import sys
+
+logger = logging.getLogger(__name__)
+
+JUDGE_PROMPT = """You are a memory extraction judge. Analyze this exchange between a user and an AI assistant.
+
+USER MESSAGE:
+{user_message}
+
+ASSISTANT RESPONSE:
+{assistant_response}
+
+Your job: determine if any of these learning events occurred:
+1. USER CORRECTION — user corrected the assistant's mistake or misunderstanding
+2. PREFERENCE — user stated a preference, habit, or "I like/prefer/want" statement
+3. DECISION — a decision was reached about how to do something
+4. FACT — user shared a durable fact about themselves, their team, tools, or environment
+
+If ANY learning event occurred, return JSON:
+{{"events": [{{"type": "correction|preference|decision|fact", "content": "concise fact to remember (one sentence)", "importance": 0.7, "expanded_keywords": "space-separated semantically related search terms for recall (minimum 5 words)", "supersedes": null}}]}}
+
+If NO learning event occurred, return:
+{{"events": []}}
+
+Rules:
+- Only extract DURABLE facts, not transient task details
+- Corrections are highest value (0.8-0.9)
+- Be conservative — false negatives are better than false positives
+- "expanded_keywords" should include synonyms, related concepts, and adjacent topics that would help find this memory later
+- "supersedes" should be a search query to find the old outdated memory, or null
+- Return ONLY valid JSON, no other text"""
+
+
+def _store_via_homelab_cli(content, category, tags, importance, expanded_keywords):
+    """Store one memory via the homelab CLI — the only sanctioned memory path on
+    the devvm (no direct HTTP, no local SQLite). The CLI defaults the API URL and
+    reads CLAUDE_MEMORY_API_KEY / MEMORY_API_KEY from the environment; if neither
+    is set (e.g. a user without a minted key) it no-ops silently."""
+    homelab = shutil.which("homelab") or "/usr/local/bin/homelab"
+    if not os.path.exists(homelab):
+        return
+    if not (os.environ.get("CLAUDE_MEMORY_API_KEY") or os.environ.get("MEMORY_API_KEY")):
+        return
+    cmd = [
+        homelab, "memory", "store", content,
+        "--category", category,
+        "--tags", tags,
+        "--importance", str(importance),
+    ]
+    if expanded_keywords:
+        # CLI wants comma-separated keywords; the judge emits space-separated terms.
+        keywords = ",".join(expanded_keywords.replace(",", " ").split())
+        if keywords:
+            cmd += ["--keywords", keywords]
+    subprocess.run(cmd, capture_output=True, text=True, timeout=15, env=os.environ)
+
+
+def main() -> None:
+    # Graceful exit if claude CLI is not available
+    if not shutil.which("claude"):
+        return
+
+    try:
+        hook_input = json.load(sys.stdin)
+    except (json.JSONDecodeError, EOFError):
+        return
+
+    if isinstance(hook_input, dict) and hook_input.get("stop_hook_active", False):
+        return
+
+    transcript_path = ""
+    if isinstance(hook_input, dict):
+        transcript_path = hook_input.get("transcript_path", "")
+
+    if not transcript_path or not os.path.exists(transcript_path):
+        return
+
+    user_message = ""
+    assistant_response = ""
+    try:
+        MAX_TAIL_BYTES = 50_000
+        with open(transcript_path, "rb") as f:
+            f.seek(0, io.SEEK_END)
+            size = f.tell()
+            f.seek(max(0, size - MAX_TAIL_BYTES))
+            tail = f.read().decode("utf-8", errors="replace")
+        lines = tail.split("\n")
+
+        for line in reversed(lines):
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                entry = json.loads(line)
+            except json.JSONDecodeError:
+                continue
+            role = entry.get("role", "")
+            content = entry.get("content", "")
+            if isinstance(content, list):
+                content = " ".join(
+                    b.get("text", "") for b in content
+                    if isinstance(b, dict) and b.get("type") == "text"
+                )
+            content = str(content)[:2000]
+            if role == "assistant" and not assistant_response:
+                assistant_response = content
+            elif role == "user" and not user_message:
+                user_message = content
+            if user_message and assistant_response:
+                break
+    except Exception:
+        return
+
+    if not user_message or len(user_message.strip()) < 10:
+        return
+
+    prompt = JUDGE_PROMPT.format(
+        user_message=user_message,
+        assistant_response=assistant_response[:1000],
+    )
+
+    try:
+        result = subprocess.run(
+            ["claude", "-p", prompt, "--model", "haiku"],
+            capture_output=True, text=True, timeout=30,
+            env={**os.environ, "CLAUDECODE": ""},
+        )
+        if result.returncode != 0:
+            return
+        response_text = result.stdout.strip()
+        if response_text.startswith("```"):
+            lines = response_text.split("\n")
+            lines = [l for l in lines if not l.strip().startswith("```")]
+            response_text = "\n".join(lines).strip()
+        judge_result = json.loads(response_text)
+        events = judge_result.get("events", [])
+        if not events:
+            return
+    except (subprocess.TimeoutExpired, json.JSONDecodeError, OSError):
+        return
+
+    category_map = {
+        "correction": "preferences",
+        "preference": "preferences",
+        "decision": "decisions",
+        "fact": "facts",
+    }
+
+    for event in events:
+        content = event.get("content", "")
+        if not content:
+            continue
+        event_type = event.get("type", "fact")
+        importance = max(0.0, min(1.0, float(event.get("importance", 0.7))))
+        category = category_map.get(event_type, "facts")
+        tags = f"auto-learned,{event_type}"
+        expanded_keywords = event.get("expanded_keywords", "")
+
+        try:
+            _store_via_homelab_cli(content, category, tags, importance, expanded_keywords)
+        except Exception:
+            pass  # Never crash the async hook
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/workstation/claude-hooks/homelab-memory-recall.py
+++ b/scripts/workstation/claude-hooks/homelab-memory-recall.py
@ -0,0 +1,70 @@
+#!/usr/bin/env python3
+"""UserPromptSubmit hook: inject relevant memories via `homelab memory recall`.
+
+Replaces the claude-memory MCP recall path. Instead of instructing the model to
+call the memory_recall MCP tool, this hook runs the homelab CLI (a direct client
+to the same claude-memory HTTP API) and injects the ACTUAL results as context —
+so recall is automatic, needs no model tool-call, and works with the MCP
+uninstalled. Best-effort: any failure exits 0 silently (recall just doesn't
+happen that turn, exactly like the MCP being unavailable).
+
+Wizard-only trial of the MCP deprecation (2026-06-20). Reversible: restore the
+plugin command in ~/.claude/settings.json (backup: settings.json.bak-pre-homelab-memory).
+"""
+
+import json
+import os
+import shutil
+import subprocess
+import sys
+
+
+def main() -> None:
+    try:
+        hook_input = json.load(sys.stdin)
+    except (json.JSONDecodeError, EOFError):
+        return
+
+    prompt = ""
+    if isinstance(hook_input, dict):
+        prompt = hook_input.get("prompt") or hook_input.get("user_prompt") or ""
+        if not prompt and isinstance(hook_input.get("content"), str):
+            prompt = hook_input["content"]
+    prompt = (prompt or "").strip()
+
+    # Same gates as the original recall hook: skip short prompts, code/JSON/XML blobs.
+    if len(prompt) < 10 or prompt[0] in "`{<":
+        return
+
+    homelab = shutil.which("homelab") or "/usr/local/bin/homelab"
+    if not os.path.exists(homelab):
+        return
+    if not (os.environ.get("CLAUDE_MEMORY_API_KEY") or os.environ.get("MEMORY_API_KEY")):
+        return
+
+    try:
+        res = subprocess.run(
+            [homelab, "memory", "recall", prompt, "--limit", "5"],
+            capture_output=True, text=True, timeout=4, env=os.environ,
+        )
+    except (subprocess.TimeoutExpired, OSError):
+        return
+
+    out = (res.stdout or "").strip()
+    if res.returncode != 0 or not out:
+        return
+
+    context = (
+        "Relevant stored memories (via `homelab memory recall`) — incorporate "
+        "naturally if useful; do NOT mention this lookup to the user:\n\n" + out
+    )
+    print(json.dumps({
+        "hookSpecificOutput": {
+            "hookEventName": "UserPromptSubmit",
+            "additionalContext": context,
+        }
+    }))
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/workstation/claude-hooks/post-compact-recovery.sh
+++ b/scripts/workstation/claude-hooks/post-compact-recovery.sh
@ -0,0 +1,64 @@
+#!/bin/bash
+# UserPromptSubmit hook: Inject recovery context after compaction
+# This hook runs on each user prompt, but only injects context once after compaction.
+
+# Read hook input from stdin
+INPUT=$(cat)
+
+# Extract session ID
+SESSION_ID=$(echo "$INPUT" | jq -r '.session_id // .sessionId // "unknown"')
+
+# Define marker path
+MEMORY_HOME="${MEMORY_HOME:-$HOME/.claude/claude-memory}"
+MARKER_DIR="${MEMORY_HOME}/state/compaction-markers"
+MARKER_FILE="${MARKER_DIR}/${SESSION_ID}.json"
+
+# Fast path: no marker means no recent compaction, exit immediately
+if [ ! -f "$MARKER_FILE" ]; then
+    exit 0
+fi
+
+# Read marker contents
+MARKER=$(cat "$MARKER_FILE")
+
+# Validate JSON before processing
+if ! echo "$MARKER" | jq -e . >/dev/null 2>&1; then
+    rm -f "$MARKER_FILE"
+    exit 0
+fi
+
+# Extract data from marker
+COMPACTED_AT=$(echo "$MARKER" | jq -r '.compactedAt // "unknown"')
+PERSONALITY=$(echo "$MARKER" | jq -r '.personalityReminder // ""')
+
+# Build remembered facts summary (limit to ~500 chars)
+FACTS_SUMMARY=$(echo "$MARKER" | jq -r '
+    .rememberedFacts[:10] |
+    map("- [\(.category // "fact")] \(.content)") |
+    join("\n")
+' 2>/dev/null || echo "")
+
+# Build recovery context (kept under 1000 tokens)
+RECOVERY_CONTEXT="[Claude Memory Recovery - Context compacted at ${COMPACTED_AT}]
+
+${PERSONALITY}
+
+Key memories from before compaction:
+${FACTS_SUMMARY}
+
+Use the memory_recall MCP tool if you need more context about past conversations."
+
+# Output JSON with additional context for injection
+cat << EOF
+{
+  "hookSpecificOutput": {
+    "hookEventName": "UserPromptSubmit",
+    "additionalContext": $(echo "$RECOVERY_CONTEXT" | jq -Rs .)
+  }
+}
+EOF
+
+# Delete marker file (one-time injection)
+rm -f "$MARKER_FILE"
+
+exit 0
--- a/scripts/workstation/claude-hooks/pre-compact-backup.sh
+++ b/scripts/workstation/claude-hooks/pre-compact-backup.sh
@ -0,0 +1,43 @@
+#!/bin/bash
+# PreCompact hook: Save key memories before compaction
+set -e
+
+INPUT=$(cat)
+SESSION_ID=$(echo "$INPUT" | jq -r '.session_id // .sessionId // "unknown"')
+
+MEMORY_HOME="${MEMORY_HOME:-$HOME/.claude/claude-memory}"
+MARKER_DIR="${MEMORY_HOME}/state/compaction-markers"
+MEMORY_DB="${MEMORY_HOME}/memory/memory.db"
+MARKER_FILE="${MARKER_DIR}/${SESSION_ID}.json"
+
+mkdir -p "$MARKER_DIR"
+
+TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
+
+# Try API first, fall back to SQLite
+REMEMBERED_FACTS="[]"
+if [ -n "${MEMORY_API_KEY:-${CLAUDE_MEMORY_API_KEY:-}}" ]; then
+    API_KEY="${MEMORY_API_KEY:-${CLAUDE_MEMORY_API_KEY:-}}"
+    API_URL="${MEMORY_API_URL:-${CLAUDE_MEMORY_API_URL:-}}"
+    if [ -n "$API_URL" ]; then
+        REMEMBERED_FACTS=$(curl -sf -H "Authorization: Bearer ${API_KEY}" \
+            "${API_URL}/api/memories?limit=20" 2>/dev/null | \
+            jq '[.memories[] | {content, category, importance}]' 2>/dev/null || echo "[]")
+    fi
+elif [ -f "$MEMORY_DB" ]; then
+    REMEMBERED_FACTS=$(sqlite3 -json "$MEMORY_DB" \
+        "SELECT content, category, importance FROM memories ORDER BY importance DESC, created_at DESC LIMIT 20" 2>/dev/null || echo "[]")
+fi
+
+if ! echo "$REMEMBERED_FACTS" | jq empty 2>/dev/null; then
+    REMEMBERED_FACTS="[]"
+fi
+
+jq -n \
+  --arg sid "$SESSION_ID" \
+  --arg ts "$TIMESTAMP" \
+  --argjson facts "$REMEMBERED_FACTS" \
+  '{sessionId: $sid, compactedAt: $ts, rememberedFacts: $facts}' \
+  > "$MARKER_FILE"
+
+exit 0
--- a/scripts/workstation/claude-hooks/wire-memory-hooks.py
+++ b/scripts/workstation/claude-hooks/wire-memory-hooks.py
@ -0,0 +1,90 @@
+#!/usr/bin/env python3
+"""Wire the homelab-memory hooks into a user's ~/.claude/settings.json.
+
+Part of the claude-memory MCP -> homelab CLI migration (all-users rollout).
+Two passes, idempotent, never touching `env` (the per-user MEMORY_API_KEY) or any
+other setting:
+  (0) PRUNE any hook command still pointing at the retired claude-memory plugin
+      (`plugins/claude-memory/hooks/`). install_memory() rm -rf's that dir, so
+      those entries are dangling — and a missing UserPromptSubmit hook exits 2,
+      a BLOCKING error that erases the prompt and freezes the session (devvm emo
+      incident 2026-06-22). Must run BEFORE the additive pass: the plugin shares
+      basenames with the homelab hooks, so without pruning, the "already present"
+      check below matches the dead plugin path and skips the real install.
+  (1) ADD each homelab hook group when no existing command references its script.
+
+Usage: wire-memory-hooks.py <home_dir>
+Exit 0 on success (changed or already-present); 1 only on an unreadable settings file.
+"""
+import json
+import os
+import sys
+
+home = sys.argv[1]
+settings = os.path.join(home, ".claude", "settings.json")
+hooks_dir = os.path.join(home, ".claude", "hooks")
+
+# (event, script-basename used for the if-absent check, full command, extra fields)
+WANT = [
+    ("PreCompact", "pre-compact-backup.sh", f"{hooks_dir}/pre-compact-backup.sh", {"timeout": 30}),
+    ("UserPromptSubmit", "post-compact-recovery.sh", f"{hooks_dir}/post-compact-recovery.sh", {"timeout": 10}),
+    ("UserPromptSubmit", "homelab-memory-recall.py", f"python3 {hooks_dir}/homelab-memory-recall.py", {"timeout": 8}),
+    ("Stop", "auto-learn.py", f"python3 {hooks_dir}/auto-learn.py", {"async": True}),
+]
+
+try:
+    if os.path.exists(settings) and os.path.getsize(settings) > 0:
+        with open(settings) as fh:
+            data = json.load(fh)
+    else:
+        data = {}
+except (json.JSONDecodeError, OSError) as e:
+    print(f"ERROR: cannot read {settings}: {e}", file=sys.stderr)
+    sys.exit(1)
+
+hooks = data.setdefault("hooks", {})
+changed = False
+
+# (0) Prune dead claude-memory plugin hooks (see module docstring). Must precede
+# the additive pass so shared basenames don't mask a needed install.
+DEAD_REF = "plugins/claude-memory/hooks/"
+for event in list(hooks.keys()):
+    new_groups = []
+    removed_any = False
+    for g in (hooks.get(event) or []):
+        original = g.get("hooks") or []
+        kept = [h for h in original if DEAD_REF not in (h.get("command", "") or "")]
+        if len(kept) != len(original):
+            removed_any = True
+        if kept:
+            new_groups.append({**g, "hooks": kept})
+    if removed_any:
+        changed = True
+        if new_groups:
+            hooks[event] = new_groups
+        else:
+            del hooks[event]
+
+# (1) Additively wire each homelab hook, if no command already references it.
+for event, basename, command, extra in WANT:
+    groups = hooks.setdefault(event, [])
+    already = any(
+        basename in (h.get("command", "") or "")
+        for g in groups
+        for h in (g.get("hooks", []) or [])
+    )
+    if already:
+        continue
+    entry = {"type": "command", "command": command}
+    entry.update(extra)
+    groups.append({"hooks": [entry]})
+    changed = True
+
+if changed:
+    tmp = settings + ".tmp"
+    with open(tmp, "w") as fh:
+        json.dump(data, fh, indent=2)
+    os.replace(tmp, settings)
+    print(f"wired memory hooks -> {settings}")
+else:
+    print(f"memory hooks already present -> {settings} (no change)")
--- a/scripts/workstation/claude-skills/README.md
+++ b/scripts/workstation/claude-skills/README.md
@ -0,0 +1,31 @@
+# claude-skills — vendored agent-skill snapshot
+
+Point-in-time snapshot of the admin's (`wizard`) Claude Code agent skills, deployed
+per-user by `install_skills()` in `../../t3-provision-users.sh` (scoped to the
+`SKILL_USERS` allowlist). Each subdirectory is one skill (`SKILL.md` + any bundled
+references). The provisioner copies a skill into `~/.agents/skills/<name>/` (owned by
+the user) and symlinks `~/.claude/skills/<name> -> ../../.agents/skills/<name>` — the
+layout the `skills` CLI's `-g` install produces; Claude Code reads `~/.claude/skills/`.
+
+## Why vendored (not `npx skills add` at provision time)
+
+Upstream drifted from this set: on `mattpocock/skills` master, `diagnose` →
+`diagnosing-bugs` and `write-a-skill` → `writing-great-skills` were renamed, and
+`caveman` + `zoom-out` are no longer published — so `npx skills` cannot reproduce this
+exact set. Vendoring is also offline/deterministic and keeps GitHub-clone +
+unpinned-CLI dependencies out of the hourly **root** reconcile.
+
+## Sources
+
+- `mattpocock/skills` (https://github.com/mattpocock/skills) — all except `find-skills`
+- `vercel-labs/skills` (https://github.com/vercel-labs/skills) — `find-skills`
+
+## Refreshing
+
+Re-snapshot from a current install and commit the diff:
+
+```sh
+cp -a ~/.agents/skills/. scripts/workstation/claude-skills/
+```
+
+Snapshot taken 2026-06-23.
--- a/scripts/workstation/claude-skills/caveman/SKILL.md
+++ b/scripts/workstation/claude-skills/caveman/SKILL.md
@ -0,0 +1,49 @@
+---
+name: caveman
+description: >
+  Ultra-compressed communication mode. Cuts token usage ~75% by dropping
+  filler, articles, and pleasantries while keeping full technical accuracy.
+  Use when user says "caveman mode", "talk like caveman", "use caveman",
+  "less tokens", "be brief", or invokes /caveman.
+---
+
+Respond terse like smart caveman. All technical substance stay. Only fluff die.
+
+## Persistence
+
+ACTIVE EVERY RESPONSE once triggered. No revert after many turns. No filler drift. Still active if unsure. Off only when user says "stop caveman" or "normal mode".
+
+## Rules
+
+Drop: articles (a/an/the), filler (just/really/basically/actually/simply), pleasantries (sure/certainly/of course/happy to), hedging. Fragments OK. Short synonyms (big not extensive, fix not "implement a solution for"). Abbreviate common terms (DB/auth/config/req/res/fn/impl). Strip conjunctions. Use arrows for causality (X -> Y). One word when one word enough.
+
+Technical terms stay exact. Code blocks unchanged. Errors quoted exact.
+
+Pattern: `[thing] [action] [reason]. [next step].`
+
+Not: "Sure! I'd be happy to help you with that. The issue you're experiencing is likely caused by..."
+Yes: "Bug in auth middleware. Token expiry check use `<` not `<=`. Fix:"
+
+### Examples
+
+**"Why React component re-render?"**
+
+> Inline obj prop -> new ref -> re-render. `useMemo`.
+
+**"Explain database connection pooling."**
+
+> Pool = reuse DB conn. Skip handshake -> fast under load.
+
+## Auto-Clarity Exception
+
+Drop caveman temporarily for: security warnings, irreversible action confirmations, multi-step sequences where fragment order risks misread, user asks to clarify or repeats question. Resume caveman after clear part done.
+
+Example -- destructive op:
+
+> **Warning:** This will permanently delete all rows in the `users` table and cannot be undone.
+>
+> ```sql
+> DROP TABLE users;
+> ```
+>
+> Caveman resume. Verify backup exist first.
--- a/scripts/workstation/claude-skills/diagnose/SKILL.md
+++ b/scripts/workstation/claude-skills/diagnose/SKILL.md
@ -0,0 +1,117 @@
+---
+name: diagnose
+description: Disciplined diagnosis loop for hard bugs and performance regressions. Reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when user says "diagnose this" / "debug this", reports a bug, says something is broken/throwing/failing, or describes a performance regression.
+---
+
+# Diagnose
+
+A discipline for hard bugs. Skip phases only when explicitly justified.
+
+When exploring the codebase, use the project's domain glossary to get a clear mental model of the relevant modules, and check ADRs in the area you're touching.
+
+## Phase 1 — Build a feedback loop
+
+**This is the skill.** Everything else is mechanical. If you have a fast, deterministic, agent-runnable pass/fail signal for the bug, you will find the cause — bisection, hypothesis-testing, and instrumentation all just consume that signal. If you don't have one, no amount of staring at code will save you.
+
+Spend disproportionate effort here. **Be aggressive. Be creative. Refuse to give up.**
+
+### Ways to construct one — try them in roughly this order
+
+1. **Failing test** at whatever seam reaches the bug — unit, integration, e2e.
+2. **Curl / HTTP script** against a running dev server.
+3. **CLI invocation** with a fixture input, diffing stdout against a known-good snapshot.
+4. **Headless browser script** (Playwright / Puppeteer) — drives the UI, asserts on DOM/console/network.
+5. **Replay a captured trace.** Save a real network request / payload / event log to disk; replay it through the code path in isolation.
+6. **Throwaway harness.** Spin up a minimal subset of the system (one service, mocked deps) that exercises the bug code path with a single function call.
+7. **Property / fuzz loop.** If the bug is "sometimes wrong output", run 1000 random inputs and look for the failure mode.
+8. **Bisection harness.** If the bug appeared between two known states (commit, dataset, version), automate "boot at state X, check, repeat" so you can `git bisect run` it.
+9. **Differential loop.** Run the same input through old-version vs new-version (or two configs) and diff outputs.
+10. **HITL bash script.** Last resort. If a human must click, drive _them_ with `scripts/hitl-loop.template.sh` so the loop is still structured. Captured output feeds back to you.
+
+Build the right feedback loop, and the bug is 90% fixed.
+
+### Iterate on the loop itself
+
+Treat the loop as a product. Once you have _a_ loop, ask:
+
+- Can I make it faster? (Cache setup, skip unrelated init, narrow the test scope.)
+- Can I make the signal sharper? (Assert on the specific symptom, not "didn't crash".)
+- Can I make it more deterministic? (Pin time, seed RNG, isolate filesystem, freeze network.)
+
+A 30-second flaky loop is barely better than no loop. A 2-second deterministic loop is a debugging superpower.
+
+### Non-deterministic bugs
+
+The goal is not a clean repro but a **higher reproduction rate**. Loop the trigger 100×, parallelise, add stress, narrow timing windows, inject sleeps. A 50%-flake bug is debuggable; 1% is not — keep raising the rate until it's debuggable.
+
+### When you genuinely cannot build a loop
+
+Stop and say so explicitly. List what you tried. Ask the user for: (a) access to whatever environment reproduces it, (b) a captured artifact (HAR file, log dump, core dump, screen recording with timestamps), or (c) permission to add temporary production instrumentation. Do **not** proceed to hypothesise without a loop.
+
+Do not proceed to Phase 2 until you have a loop you believe in.
+
+## Phase 2 — Reproduce
+
+Run the loop. Watch the bug appear.
+
+Confirm:
+
+- [ ] The loop produces the failure mode the **user** described — not a different failure that happens to be nearby. Wrong bug = wrong fix.
+- [ ] The failure is reproducible across multiple runs (or, for non-deterministic bugs, reproducible at a high enough rate to debug against).
+- [ ] You have captured the exact symptom (error message, wrong output, slow timing) so later phases can verify the fix actually addresses it.
+
+Do not proceed until you reproduce the bug.
+
+## Phase 3 — Hypothesise
+
+Generate **3–5 ranked hypotheses** before testing any of them. Single-hypothesis generation anchors on the first plausible idea.
+
+Each hypothesis must be **falsifiable**: state the prediction it makes.
+
+> Format: "If <X> is the cause, then <changing Y> will make the bug disappear / <changing Z> will make it worse."
+
+If you cannot state the prediction, the hypothesis is a vibe — discard or sharpen it.
+
+**Show the ranked list to the user before testing.** They often have domain knowledge that re-ranks instantly ("we just deployed a change to #3"), or know hypotheses they've already ruled out. Cheap checkpoint, big time saver. Don't block on it — proceed with your ranking if the user is AFK.
+
+## Phase 4 — Instrument
+
+Each probe must map to a specific prediction from Phase 3. **Change one variable at a time.**
+
+Tool preference:
+
+1. **Debugger / REPL inspection** if the env supports it. One breakpoint beats ten logs.
+2. **Targeted logs** at the boundaries that distinguish hypotheses.
+3. Never "log everything and grep".
+
+**Tag every debug log** with a unique prefix, e.g. `[DEBUG-a4f2]`. Cleanup at the end becomes a single grep. Untagged logs survive; tagged logs die.
+
+**Perf branch.** For performance regressions, logs are usually wrong. Instead: establish a baseline measurement (timing harness, `performance.now()`, profiler, query plan), then bisect. Measure first, fix second.
+
+## Phase 5 — Fix + regression test
+
+Write the regression test **before the fix** — but only if there is a **correct seam** for it.
+
+A correct seam is one where the test exercises the **real bug pattern** as it occurs at the call site. If the only available seam is too shallow (single-caller test when the bug needs multiple callers, unit test that can't replicate the chain that triggered the bug), a regression test there gives false confidence.
+
+**If no correct seam exists, that itself is the finding.** Note it. The codebase architecture is preventing the bug from being locked down. Flag this for the next phase.
+
+If a correct seam exists:
+
+1. Turn the minimised repro into a failing test at that seam.
+2. Watch it fail.
+3. Apply the fix.
+4. Watch it pass.
+5. Re-run the Phase 1 feedback loop against the original (un-minimised) scenario.
+
+## Phase 6 — Cleanup + post-mortem
+
+Required before declaring done:
+
+- [ ] Original repro no longer reproduces (re-run the Phase 1 loop)
+- [ ] Regression test passes (or absence of seam is documented)
+- [ ] All `[DEBUG-...]` instrumentation removed (`grep` the prefix)
+- [ ] Throwaway prototypes deleted (or moved to a clearly-marked debug location)
+- [ ] The hypothesis that turned out correct is stated in the commit / PR message — so the next debugger learns
+
+**Then ask: what would have prevented this bug?** If the answer involves architectural change (no good test seam, tangled callers, hidden coupling) hand off to the `/improve-codebase-architecture` skill with the specifics. Make the recommendation **after** the fix is in, not before — you have more information now than when you started.
--- a/scripts/workstation/claude-skills/diagnose/scripts/hitl-loop.template.sh
+++ b/scripts/workstation/claude-skills/diagnose/scripts/hitl-loop.template.sh
@ -0,0 +1,41 @@
+#!/usr/bin/env bash
+# Human-in-the-loop reproduction loop.
+# Copy this file, edit the steps below, and run it.
+# The agent runs the script; the user follows prompts in their terminal.
+#
+# Usage:
+#   bash hitl-loop.template.sh
+#
+# Two helpers:
+#   step "<instruction>"          → show instruction, wait for Enter
+#   capture VAR "<question>"      → show question, read response into VAR
+#
+# At the end, captured values are printed as KEY=VALUE for the agent to parse.
+
+set -euo pipefail
+
+step() {
+  printf '\n>>> %s\n' "$1"
+  read -r -p "    [Enter when done] " _
+}
+
+capture() {
+  local var="$1" question="$2" answer
+  printf '\n>>> %s\n' "$question"
+  read -r -p "    > " answer
+  printf -v "$var" '%s' "$answer"
+}
+
+# --- edit below ---------------------------------------------------------
+
+step "Open the app at http://localhost:3000 and sign in."
+
+capture ERRORED "Click the 'Export' button. Did it throw an error? (y/n)"
+
+capture ERROR_MSG "Paste the error message (or 'none'):"
+
+# --- edit above ---------------------------------------------------------
+
+printf '\n--- Captured ---\n'
+printf 'ERRORED=%s\n' "$ERRORED"
+printf 'ERROR_MSG=%s\n' "$ERROR_MSG"
--- a/scripts/workstation/claude-skills/find-skills/SKILL.md
+++ b/scripts/workstation/claude-skills/find-skills/SKILL.md
@ -0,0 +1,142 @@
+---
+name: find-skills
+description: Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.
+---
+
+# Find Skills
+
+This skill helps you discover and install skills from the open agent skills ecosystem.
+
+## When to Use This Skill
+
+Use this skill when the user:
+
+- Asks "how do I do X" where X might be a common task with an existing skill
+- Says "find a skill for X" or "is there a skill for X"
+- Asks "can you do X" where X is a specialized capability
+- Expresses interest in extending agent capabilities
+- Wants to search for tools, templates, or workflows
+- Mentions they wish they had help with a specific domain (design, testing, deployment, etc.)
+
+## What is the Skills CLI?
+
+The Skills CLI (`npx skills`) is the package manager for the open agent skills ecosystem. Skills are modular packages that extend agent capabilities with specialized knowledge, workflows, and tools.
+
+**Key commands:**
+
+- `npx skills find [query]` - Search for skills interactively or by keyword
+- `npx skills add <package>` - Install a skill from GitHub or other sources
+- `npx skills check` - Check for skill updates
+- `npx skills update` - Update all installed skills
+
+**Browse skills at:** https://skills.sh/
+
+## How to Help Users Find Skills
+
+### Step 1: Understand What They Need
+
+When a user asks for help with something, identify:
+
+1. The domain (e.g., React, testing, design, deployment)
+2. The specific task (e.g., writing tests, creating animations, reviewing PRs)
+3. Whether this is a common enough task that a skill likely exists
+
+### Step 2: Check the Leaderboard First
+
+Before running a CLI search, check the [skills.sh leaderboard](https://skills.sh/) to see if a well-known skill already exists for the domain. The leaderboard ranks skills by total installs, surfacing the most popular and battle-tested options.
+
+For example, top skills for web development include:
+- `vercel-labs/agent-skills` — React, Next.js, web design (100K+ installs each)
+- `anthropics/skills` — Frontend design, document processing (100K+ installs)
+
+### Step 3: Search for Skills
+
+If the leaderboard doesn't cover the user's need, run the find command:
+
+```bash
+npx skills find [query]
+```
+
+For example:
+
+- User asks "how do I make my React app faster?" → `npx skills find react performance`
+- User asks "can you help me with PR reviews?" → `npx skills find pr review`
+- User asks "I need to create a changelog" → `npx skills find changelog`
+
+### Step 4: Verify Quality Before Recommending
+
+**Do not recommend a skill based solely on search results.** Always verify:
+
+1. **Install count** — Prefer skills with 1K+ installs. Be cautious with anything under 100.
+2. **Source reputation** — Official sources (`vercel-labs`, `anthropics`, `microsoft`) are more trustworthy than unknown authors.
+3. **GitHub stars** — Check the source repository. A skill from a repo with <100 stars should be treated with skepticism.
+
+### Step 5: Present Options to the User
+
+When you find relevant skills, present them to the user with:
+
+1. The skill name and what it does
+2. The install count and source
+3. The install command they can run
+4. A link to learn more at skills.sh
+
+Example response:
+
+```
+I found a skill that might help! The "react-best-practices" skill provides
+React and Next.js performance optimization guidelines from Vercel Engineering.
+(185K installs)
+
+To install it:
+npx skills add vercel-labs/agent-skills@react-best-practices
+
+Learn more: https://skills.sh/vercel-labs/agent-skills/react-best-practices
+```
+
+### Step 6: Offer to Install
+
+If the user wants to proceed, you can install the skill for them:
+
+```bash
+npx skills add <owner/repo@skill> -g -y
+```
+
+The `-g` flag installs globally (user-level) and `-y` skips confirmation prompts.
+
+## Common Skill Categories
+
+When searching, consider these common categories:
+
+| Category        | Example Queries                          |
+| --------------- | ---------------------------------------- |
+| Web Development | react, nextjs, typescript, css, tailwind |
+| Testing         | testing, jest, playwright, e2e           |
+| DevOps          | deploy, docker, kubernetes, ci-cd        |
+| Documentation   | docs, readme, changelog, api-docs        |
+| Code Quality    | review, lint, refactor, best-practices   |
+| Design          | ui, ux, design-system, accessibility     |
+| Productivity    | workflow, automation, git                |
+
+## Tips for Effective Searches
+
+1. **Use specific keywords**: "react testing" is better than just "testing"
+2. **Try alternative terms**: If "deploy" doesn't work, try "deployment" or "ci-cd"
+3. **Check popular sources**: Many skills come from `vercel-labs/agent-skills` or `ComposioHQ/awesome-claude-skills`
+
+## When No Skills Are Found
+
+If no relevant skills exist:
+
+1. Acknowledge that no existing skill was found
+2. Offer to help with the task directly using your general capabilities
+3. Suggest the user could create their own skill with `npx skills init`
+
+Example:
+
+```
+I searched for skills related to "xyz" but didn't find any matches.
+I can still help you with this task directly! Would you like me to proceed?
+
+If this is something you do often, you could create your own skill:
+npx skills init my-xyz-skill
+```
--- a/scripts/workstation/claude-skills/grill-me/SKILL.md
+++ b/scripts/workstation/claude-skills/grill-me/SKILL.md
@ -0,0 +1,10 @@
+---
+name: grill-me
+description: Interview the user relentlessly about a plan or design until reaching shared understanding, resolving each branch of the decision tree. Use when user wants to stress-test a plan, get grilled on their design, or mentions "grill me".
+---
+
+Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.
+
+Ask the questions one at a time.
+
+If a question can be answered by exploring the codebase, explore the codebase instead.
--- a/scripts/workstation/claude-skills/grill-with-docs/ADR-FORMAT.md
+++ b/scripts/workstation/claude-skills/grill-with-docs/ADR-FORMAT.md
@ -0,0 +1,47 @@
+# ADR Format
+
+ADRs live in `docs/adr/` and use sequential numbering: `0001-slug.md`, `0002-slug.md`, etc.
+
+Create the `docs/adr/` directory lazily — only when the first ADR is needed.
+
+## Template
+
+```md
+# {Short title of the decision}
+
+{1-3 sentences: what's the context, what did we decide, and why.}
+```
+
+That's it. An ADR can be a single paragraph. The value is in recording *that* a decision was made and *why* — not in filling out sections.
+
+## Optional sections
+
+Only include these when they add genuine value. Most ADRs won't need them.
+
+- **Status** frontmatter (`proposed | accepted | deprecated | superseded by ADR-NNNN`) — useful when decisions are revisited
+- **Considered Options** — only when the rejected alternatives are worth remembering
+- **Consequences** — only when non-obvious downstream effects need to be called out
+
+## Numbering
+
+Scan `docs/adr/` for the highest existing number and increment by one.
+
+## When to offer an ADR
+
+All three of these must be true:
+
+1. **Hard to reverse** — the cost of changing your mind later is meaningful
+2. **Surprising without context** — a future reader will look at the code and wonder "why on earth did they do it this way?"
+3. **The result of a real trade-off** — there were genuine alternatives and you picked one for specific reasons
+
+If a decision is easy to reverse, skip it — you'll just reverse it. If it's not surprising, nobody will wonder why. If there was no real alternative, there's nothing to record beyond "we did the obvious thing."
+
+### What qualifies
+
+- **Architectural shape.** "We're using a monorepo." "The write model is event-sourced, the read model is projected into Postgres."
+- **Integration patterns between contexts.** "Ordering and Billing communicate via domain events, not synchronous HTTP."
+- **Technology choices that carry lock-in.** Database, message bus, auth provider, deployment target. Not every library — just the ones that would take a quarter to swap out.
+- **Boundary and scope decisions.** "Customer data is owned by the Customer context; other contexts reference it by ID only." The explicit no-s are as valuable as the yes-s.
+- **Deliberate deviations from the obvious path.** "We're using manual SQL instead of an ORM because X." Anything where a reasonable reader would assume the opposite. These stop the next engineer from "fixing" something that was deliberate.
+- **Constraints not visible in the code.** "We can't use AWS because of compliance requirements." "Response times must be under 200ms because of the partner API contract."
+- **Rejected alternatives when the rejection is non-obvious.** If you considered GraphQL and picked REST for subtle reasons, record it — otherwise someone will suggest GraphQL again in six months.
--- a/scripts/workstation/claude-skills/grill-with-docs/CONTEXT-FORMAT.md
+++ b/scripts/workstation/claude-skills/grill-with-docs/CONTEXT-FORMAT.md
@ -0,0 +1,60 @@
+# CONTEXT.md Format
+
+## Structure
+
+```md
+# {Context Name}
+
+{One or two sentence description of what this context is and why it exists.}
+
+## Language
+
+**Order**:
+{A one or two sentence description of the term}
+_Avoid_: Purchase, transaction
+
+**Invoice**:
+A request for payment sent to a customer after delivery.
+_Avoid_: Bill, payment request
+
+**Customer**:
+A person or organization that places orders.
+_Avoid_: Client, buyer, account
+```
+
+## Rules
+
+- **Be opinionated.** When multiple words exist for the same concept, pick the best one and list the others under `_Avoid_`.
+- **Keep definitions tight.** One or two sentences max. Define what it IS, not what it does.
+- **Only include terms specific to this project's context.** General programming concepts (timeouts, error types, utility patterns) don't belong even if the project uses them extensively. Before adding a term, ask: is this a concept unique to this context, or a general programming concept? Only the former belongs.
+- **Group terms under subheadings** when natural clusters emerge. If all terms belong to a single cohesive area, a flat list is fine.
+
+## Single vs multi-context repos
+
+**Single context (most repos):** One `CONTEXT.md` at the repo root.
+
+**Multiple contexts:** A `CONTEXT-MAP.md` at the repo root lists the contexts, where they live, and how they relate to each other:
+
+```md
+# Context Map
+
+## Contexts
+
+- [Ordering](./src/ordering/CONTEXT.md) — receives and tracks customer orders
+- [Billing](./src/billing/CONTEXT.md) — generates invoices and processes payments
+- [Fulfillment](./src/fulfillment/CONTEXT.md) — manages warehouse picking and shipping
+
+## Relationships
+
+- **Ordering → Fulfillment**: Ordering emits `OrderPlaced` events; Fulfillment consumes them to start picking
+- **Fulfillment → Billing**: Fulfillment emits `ShipmentDispatched` events; Billing consumes them to generate invoices
+- **Ordering ↔ Billing**: Shared types for `CustomerId` and `Money`
+```
+
+The skill infers which structure applies:
+
+- If `CONTEXT-MAP.md` exists, read it to find contexts
+- If only a root `CONTEXT.md` exists, single context
+- If neither exists, create a root `CONTEXT.md` lazily when the first term is resolved
+
+When multiple contexts exist, infer which one the current topic relates to. If unclear, ask.
--- a/scripts/workstation/claude-skills/grill-with-docs/SKILL.md
+++ b/scripts/workstation/claude-skills/grill-with-docs/SKILL.md
@ -0,0 +1,88 @@
+---
+name: grill-with-docs
+description: Grilling session that challenges your plan against the existing domain model, sharpens terminology, and updates documentation (CONTEXT.md, ADRs) inline as decisions crystallise. Use when user wants to stress-test a plan against their project's language and documented decisions.
+---
+
+<what-to-do>
+
+Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.
+
+Ask the questions one at a time, waiting for feedback on each question before continuing.
+
+If a question can be answered by exploring the codebase, explore the codebase instead.
+
+</what-to-do>
+
+<supporting-info>
+
+## Domain awareness
+
+During codebase exploration, also look for existing documentation:
+
+### File structure
+
+Most repos have a single context:
+
+```
+/
+├── CONTEXT.md
+├── docs/
+│   └── adr/
+│       ├── 0001-event-sourced-orders.md
+│       └── 0002-postgres-for-write-model.md
+└── src/
+```
+
+If a `CONTEXT-MAP.md` exists at the root, the repo has multiple contexts. The map points to where each one lives:
+
+```
+/
+├── CONTEXT-MAP.md
+├── docs/
+│   └── adr/                          ← system-wide decisions
+├── src/
+│   ├── ordering/
+│   │   ├── CONTEXT.md
+│   │   └── docs/adr/                 ← context-specific decisions
+│   └── billing/
+│       ├── CONTEXT.md
+│       └── docs/adr/
+```
+
+Create files lazily — only when you have something to write. If no `CONTEXT.md` exists, create one when the first term is resolved. If no `docs/adr/` exists, create it when the first ADR is needed.
+
+## During the session
+
+### Challenge against the glossary
+
+When the user uses a term that conflicts with the existing language in `CONTEXT.md`, call it out immediately. "Your glossary defines 'cancellation' as X, but you seem to mean Y — which is it?"
+
+### Sharpen fuzzy language
+
+When the user uses vague or overloaded terms, propose a precise canonical term. "You're saying 'account' — do you mean the Customer or the User? Those are different things."
+
+### Discuss concrete scenarios
+
+When domain relationships are being discussed, stress-test them with specific scenarios. Invent scenarios that probe edge cases and force the user to be precise about the boundaries between concepts.
+
+### Cross-reference with code
+
+When the user states how something works, check whether the code agrees. If you find a contradiction, surface it: "Your code cancels entire Orders, but you just said partial cancellation is possible — which is right?"
+
+### Update CONTEXT.md inline
+
+When a term is resolved, update `CONTEXT.md` right there. Don't batch these up — capture them as they happen. Use the format in [CONTEXT-FORMAT.md](./CONTEXT-FORMAT.md).
+
+`CONTEXT.md` should be totally devoid of implementation details. Do not treat `CONTEXT.md` as a spec, a scratch pad, or a repository for implementation decisions. It is a glossary and nothing else.
+
+### Offer ADRs sparingly
+
+Only offer to create an ADR when all three are true:
+
+1. **Hard to reverse** — the cost of changing your mind later is meaningful
+2. **Surprising without context** — a future reader will wonder "why did they do it this way?"
+3. **The result of a real trade-off** — there were genuine alternatives and you picked one for specific reasons
+
+If any of the three is missing, skip the ADR. Use the format in [ADR-FORMAT.md](./ADR-FORMAT.md).
+
+</supporting-info>
--- a/scripts/workstation/claude-skills/handoff/SKILL.md
+++ b/scripts/workstation/claude-skills/handoff/SKILL.md
@ -0,0 +1,13 @@
+---
+name: handoff
+description: Compact the current conversation into a handoff document for another agent to pick up.
+argument-hint: "What will the next session be used for?"
+---
+
+Write a handoff document summarising the current conversation so a fresh agent can continue the work. Save it to a path produced by `mktemp -t handoff-XXXXXX.md` (read the file before you write to it).
+
+Suggest the skills to be used, if any, by the next session.
+
+Do not duplicate content already captured in other artifacts (PRDs, plans, ADRs, issues, commits, diffs). Reference them by path or URL instead.
+
+If the user passed arguments, treat them as a description of what the next session will focus on and tailor the doc accordingly.
--- a/scripts/workstation/claude-skills/improve-codebase-architecture/DEEPENING.md
+++ b/scripts/workstation/claude-skills/improve-codebase-architecture/DEEPENING.md
@ -0,0 +1,37 @@
+# Deepening
+
+How to deepen a cluster of shallow modules safely, given its dependencies. Assumes the vocabulary in [LANGUAGE.md](LANGUAGE.md) — **module**, **interface**, **seam**, **adapter**.
+
+## Dependency categories
+
+When assessing a candidate for deepening, classify its dependencies. The category determines how the deepened module is tested across its seam.
+
+### 1. In-process
+
+Pure computation, in-memory state, no I/O. Always deepenable — merge the modules and test through the new interface directly. No adapter needed.
+
+### 2. Local-substitutable
+
+Dependencies that have local test stand-ins (PGLite for Postgres, in-memory filesystem). Deepenable if the stand-in exists. The deepened module is tested with the stand-in running in the test suite. The seam is internal; no port at the module's external interface.
+
+### 3. Remote but owned (Ports & Adapters)
+
+Your own services across a network boundary (microservices, internal APIs). Define a **port** (interface) at the seam. The deep module owns the logic; the transport is injected as an **adapter**. Tests use an in-memory adapter. Production uses an HTTP/gRPC/queue adapter.
+
+Recommendation shape: *"Define a port at the seam, implement an HTTP adapter for production and an in-memory adapter for testing, so the logic sits in one deep module even though it's deployed across a network."*
+
+### 4. True external (Mock)
+
+Third-party services (Stripe, Twilio, etc.) you don't control. The deepened module takes the external dependency as an injected port; tests provide a mock adapter.
+
+## Seam discipline
+
+- **One adapter means a hypothetical seam. Two adapters means a real one.** Don't introduce a port unless at least two adapters are justified (typically production + test). A single-adapter seam is just indirection.
+- **Internal seams vs external seams.** A deep module can have internal seams (private to its implementation, used by its own tests) as well as the external seam at its interface. Don't expose internal seams through the interface just because tests use them.
+
+## Testing strategy: replace, don't layer
+
+- Old unit tests on shallow modules become waste once tests at the deepened module's interface exist — delete them.
+- Write new tests at the deepened module's interface. The **interface is the test surface**.
+- Tests assert on observable outcomes through the interface, not internal state.
+- Tests should survive internal refactors — they describe behaviour, not implementation. If a test has to change when the implementation changes, it's testing past the interface.
--- a/scripts/workstation/claude-skills/improve-codebase-architecture/HTML-REPORT.md
+++ b/scripts/workstation/claude-skills/improve-codebase-architecture/HTML-REPORT.md
@ -0,0 +1,123 @@
+# HTML Report Format
+
+The architectural review is rendered as a single self-contained HTML file in the OS temp directory. Tailwind and Mermaid both come from CDNs. Mermaid handles graph-shaped diagrams reliably; hand-built divs and inline SVG handle the more editorial visuals (mass diagrams, cross-sections). Mix the two — don't lean on Mermaid for everything, it'll start to look generic.
+
+## Scaffold
+
+```html
+<!doctype html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8" />
+    <title>Architecture review — {{repo name}}</title>
+    <script src="https://cdn.tailwindcss.com"></script>
+    <script type="module">
+      import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@11/dist/mermaid.esm.min.mjs";
+      mermaid.initialize({ startOnLoad: true, theme: "neutral", securityLevel: "loose" });
+    </script>
+    <style>
+      /* small custom layer for things Tailwind doesn't cover cleanly:
+         dashed seam lines, hand-drawn-feeling arrow heads, etc. */
+      .seam { stroke-dasharray: 4 4; }
+      .leak { stroke: #dc2626; }
+      .deep { background: linear-gradient(135deg, #0f172a, #1e293b); }
+    </style>
+  </head>
+  <body class="bg-stone-50 text-slate-900 font-sans">
+    <main class="max-w-5xl mx-auto px-6 py-12 space-y-12">
+      <header>...</header>
+      <section id="candidates" class="space-y-10">...</section>
+      <section id="top-recommendation">...</section>
+    </main>
+  </body>
+</html>
+```
+
+## Header
+
+Repo name, date, and a compact legend: solid box = module, dashed line = seam, red arrow = leakage, thick dark box = deep module. No introduction paragraph — straight into the candidates.
+
+## Candidate card
+
+The diagrams carry the weight. Prose is sparse, plain, and uses the glossary terms ([LANGUAGE.md](LANGUAGE.md)) without ceremony.
+
+Each candidate is one `<article>`:
+
+- **Title** — short, names the deepening (e.g. "Collapse the Order intake pipeline").
+- **Badge row** — recommendation strength (`Strong` = emerald, `Worth exploring` = amber, `Speculative` = slate), plus a tag for the dependency category (`in-process`, `local-substitutable`, `ports & adapters`, `mock`).
+- **Files** — monospaced list, `font-mono text-sm`.
+- **Before / After diagram** — the centrepiece. Two columns, side by side. See patterns below.
+- **Problem** — one sentence. What hurts.
+- **Solution** — one sentence. What changes.
+- **Wins** — bullets, ≤6 words each. e.g. "Tests hit one interface", "Pricing logic stops leaking", "Delete 4 shallow wrappers".
+- **ADR callout** (if applicable) — one line in an amber-tinted box.
+
+No paragraphs of explanation. If the diagram needs a paragraph to be understood, redraw the diagram.
+
+## Diagram patterns
+
+Pick the pattern that fits the candidate. Mix them. Don't make every diagram look the same — variety is part of the point.
+
+### Mermaid graph (the workhorse for dependencies / call flow)
+
+Use a Mermaid `flowchart` or `graph` when the point is "X calls Y calls Z, and look at the mess." Wrap it in a Tailwind-styled card so it doesn't feel parachuted in. Style with classDef to colour leakage edges red and the deep module dark. Sequence diagrams work well for "before: 6 round-trips; after: 1."
+
+```html
+<div class="rounded-lg border border-slate-200 bg-white p-4">
+  <pre class="mermaid">
+    flowchart LR
+      A[OrderHandler] --> B[OrderValidator]
+      B --> C[OrderRepo]
+      C -.leak.-> D[PricingClient]
+      classDef leak stroke:#dc2626,stroke-width:2px;
+      class C,D leak
+  </pre>
+</div>
+```
+
+### Hand-built boxes-and-arrows (when Mermaid's layout fights you)
+
+Modules as `<div>`s with borders and labels. Arrows as inline SVG `<line>` or `<path>` elements positioned absolutely over a relative container. Reach for this when you want the "after" diagram to feel like one thick-bordered deep module with greyed-out internals — Mermaid won't render that with the right weight.
+
+### Cross-section (good for layered shallowness)
+
+Stack horizontal bands (`h-12 border-l-4`) to show layers a call passes through. Before: 6 thin layers each doing nothing. After: 1 thick band labelled with the consolidated responsibility.
+
+### Mass diagram (good for "interface as wide as implementation")
+
+Two rectangles per module — one for interface surface area, one for implementation. Before: interface rectangle is nearly as tall as the implementation rectangle (shallow). After: interface rectangle is short, implementation rectangle is tall (deep).
+
+### Call-graph collapse
+
+Before: a tree of function calls rendered as nested boxes. After: the same tree collapsed into one box, with the now-internal calls shown faded inside it.
+
+## Style guidance
+
+- Lean editorial, not corporate-dashboard. Generous whitespace. Serif optional for headings (`font-serif` works well with stone/slate).
+- Colour sparingly: one accent (emerald or indigo) plus red for leakage and amber for warnings.
+- Keep diagrams ~320px tall so before/after sits comfortably side by side without scrolling.
+- Use `text-xs uppercase tracking-wider` for module labels inside diagrams — they should read as schematic, not as UI.
+- The only scripts are the Tailwind CDN and the Mermaid ESM import. The report is otherwise static — no app code, no interactivity beyond Mermaid's own rendering.
+
+## Top recommendation section
+
+One larger card. Candidate name, one sentence on why, anchor link to its card. That's it.
+
+## Tone
+
+Plain English, concise — but the architectural nouns and verbs come straight from [LANGUAGE.md](LANGUAGE.md). Concision is not an excuse to drift.
+
+**Use exactly:** module, interface, implementation, depth, deep, shallow, seam, adapter, leverage, locality.
+
+**Never substitute:** component, service, unit (for module) · API, signature (for interface) · boundary (for seam) · layer, wrapper (for module, when you mean module).
+
+**Phrasings that fit the style:**
+
+- "Order intake module is shallow — interface nearly matches the implementation."
+- "Pricing leaks across the seam."
+- "Deepen: one interface, one place to test."
+- "Two adapters justify the seam: HTTP in prod, in-memory in tests."
+
+**Wins bullets** name the gain in glossary terms: *"locality: bugs concentrate in one module"*, *"leverage: one interface, N call sites"*, *"interface shrinks; implementation absorbs the wrappers"*. Don't write *"easier to maintain"* or *"cleaner code"* — those terms aren't in the glossary and don't earn their place.
+
+No hedging, no throat-clearing, no "it's worth noting that…". If a sentence could be a bullet, make it a bullet. If a bullet could be cut, cut it. If a term isn't in [LANGUAGE.md](LANGUAGE.md), reach for one that is before inventing a new one.
--- a/scripts/workstation/claude-skills/improve-codebase-architecture/INTERFACE-DESIGN.md
+++ b/scripts/workstation/claude-skills/improve-codebase-architecture/INTERFACE-DESIGN.md
@ -0,0 +1,44 @@
+# Interface Design
+
+When the user wants to explore alternative interfaces for a chosen deepening candidate, use this parallel sub-agent pattern. Based on "Design It Twice" (Ousterhout) — your first idea is unlikely to be the best.
+
+Uses the vocabulary in [LANGUAGE.md](LANGUAGE.md) — **module**, **interface**, **seam**, **adapter**, **leverage**.
+
+## Process
+
+### 1. Frame the problem space
+
+Before spawning sub-agents, write a user-facing explanation of the problem space for the chosen candidate:
+
+- The constraints any new interface would need to satisfy
+- The dependencies it would rely on, and which category they fall into (see [DEEPENING.md](DEEPENING.md))
+- A rough illustrative code sketch to ground the constraints — not a proposal, just a way to make the constraints concrete
+
+Show this to the user, then immediately proceed to Step 2. The user reads and thinks while the sub-agents work in parallel.
+
+### 2. Spawn sub-agents
+
+Spawn 3+ sub-agents in parallel using the Agent tool. Each must produce a **radically different** interface for the deepened module.
+
+Prompt each sub-agent with a separate technical brief (file paths, coupling details, dependency category from [DEEPENING.md](DEEPENING.md), what sits behind the seam). The brief is independent of the user-facing problem-space explanation in Step 1. Give each agent a different design constraint:
+
+- Agent 1: "Minimize the interface — aim for 1–3 entry points max. Maximise leverage per entry point."
+- Agent 2: "Maximise flexibility — support many use cases and extension."
+- Agent 3: "Optimise for the most common caller — make the default case trivial."
+- Agent 4 (if applicable): "Design around ports & adapters for cross-seam dependencies."
+
+Include both [LANGUAGE.md](LANGUAGE.md) vocabulary and CONTEXT.md vocabulary in the brief so each sub-agent names things consistently with the architecture language and the project's domain language.
+
+Each sub-agent outputs:
+
+1. Interface (types, methods, params — plus invariants, ordering, error modes)
+2. Usage example showing how callers use it
+3. What the implementation hides behind the seam
+4. Dependency strategy and adapters (see [DEEPENING.md](DEEPENING.md))
+5. Trade-offs — where leverage is high, where it's thin
+
+### 3. Present and compare
+
+Present designs sequentially so the user can absorb each one, then compare them in prose. Contrast by **depth** (leverage at the interface), **locality** (where change concentrates), and **seam placement**.
+
+After comparing, give your own recommendation: which design you think is strongest and why. If elements from different designs would combine well, propose a hybrid. Be opinionated — the user wants a strong read, not a menu.
--- a/scripts/workstation/claude-skills/improve-codebase-architecture/LANGUAGE.md
+++ b/scripts/workstation/claude-skills/improve-codebase-architecture/LANGUAGE.md
@ -0,0 +1,53 @@
+# Language
+
+Shared vocabulary for every suggestion this skill makes. Use these terms exactly — don't substitute "component," "service," "API," or "boundary." Consistent language is the whole point.
+
+## Terms
+
+**Module**
+Anything with an interface and an implementation. Deliberately scale-agnostic — applies equally to a function, class, package, or tier-spanning slice.
+_Avoid_: unit, component, service.
+
+**Interface**
+Everything a caller must know to use the module correctly. Includes the type signature, but also invariants, ordering constraints, error modes, required configuration, and performance characteristics.
+_Avoid_: API, signature (too narrow — those refer only to the type-level surface).
+
+**Implementation**
+What's inside a module — its body of code. Distinct from **Adapter**: a thing can be a small adapter with a large implementation (a Postgres repo) or a large adapter with a small implementation (an in-memory fake). Reach for "adapter" when the seam is the topic; "implementation" otherwise.
+
+**Depth**
+Leverage at the interface — the amount of behaviour a caller (or test) can exercise per unit of interface they have to learn. A module is **deep** when a large amount of behaviour sits behind a small interface. A module is **shallow** when the interface is nearly as complex as the implementation.
+
+**Seam** _(from Michael Feathers)_
+A place where you can alter behaviour without editing in that place. The *location* at which a module's interface lives. Choosing where to put the seam is its own design decision, distinct from what goes behind it.
+_Avoid_: boundary (overloaded with DDD's bounded context).
+
+**Adapter**
+A concrete thing that satisfies an interface at a seam. Describes *role* (what slot it fills), not substance (what's inside).
+
+**Leverage**
+What callers get from depth. More capability per unit of interface they have to learn. One implementation pays back across N call sites and M tests.
+
+**Locality**
+What maintainers get from depth. Change, bugs, knowledge, and verification concentrate at one place rather than spreading across callers. Fix once, fixed everywhere.
+
+## Principles
+
+- **Depth is a property of the interface, not the implementation.** A deep module can be internally composed of small, mockable, swappable parts — they just aren't part of the interface. A module can have **internal seams** (private to its implementation, used by its own tests) as well as the **external seam** at its interface.
+- **The deletion test.** Imagine deleting the module. If complexity vanishes, the module wasn't hiding anything (it was a pass-through). If complexity reappears across N callers, the module was earning its keep.
+- **The interface is the test surface.** Callers and tests cross the same seam. If you want to test *past* the interface, the module is probably the wrong shape.
+- **One adapter means a hypothetical seam. Two adapters means a real one.** Don't introduce a seam unless something actually varies across it.
+
+## Relationships
+
+- A **Module** has exactly one **Interface** (the surface it presents to callers and tests).
+- **Depth** is a property of a **Module**, measured against its **Interface**.
+- A **Seam** is where a **Module**'s **Interface** lives.
+- An **Adapter** sits at a **Seam** and satisfies the **Interface**.
+- **Depth** produces **Leverage** for callers and **Locality** for maintainers.
+
+## Rejected framings
+
+- **Depth as ratio of implementation-lines to interface-lines** (Ousterhout): rewards padding the implementation. We use depth-as-leverage instead.
+- **"Interface" as the TypeScript `interface` keyword or a class's public methods**: too narrow — interface here includes every fact a caller must know.
+- **"Boundary"**: overloaded with DDD's bounded context. Say **seam** or **interface**.
--- a/scripts/workstation/claude-skills/improve-codebase-architecture/SKILL.md
+++ b/scripts/workstation/claude-skills/improve-codebase-architecture/SKILL.md
@ -0,0 +1,81 @@
+---
+name: improve-codebase-architecture
+description: Find deepening opportunities in a codebase, informed by the domain language in CONTEXT.md and the decisions in docs/adr/. Use when the user wants to improve architecture, find refactoring opportunities, consolidate tightly-coupled modules, or make a codebase more testable and AI-navigable.
+---
+
+# Improve Codebase Architecture
+
+Surface architectural friction and propose **deepening opportunities** — refactors that turn shallow modules into deep ones. The aim is testability and AI-navigability.
+
+## Glossary
+
+Use these terms exactly in every suggestion. Consistent language is the point — don't drift into "component," "service," "API," or "boundary." Full definitions in [LANGUAGE.md](LANGUAGE.md).
+
+- **Module** — anything with an interface and an implementation (function, class, package, slice).
+- **Interface** — everything a caller must know to use the module: types, invariants, error modes, ordering, config. Not just the type signature.
+- **Implementation** — the code inside.
+- **Depth** — leverage at the interface: a lot of behaviour behind a small interface. **Deep** = high leverage. **Shallow** = interface nearly as complex as the implementation.
+- **Seam** — where an interface lives; a place behaviour can be altered without editing in place. (Use this, not "boundary.")
+- **Adapter** — a concrete thing satisfying an interface at a seam.
+- **Leverage** — what callers get from depth.
+- **Locality** — what maintainers get from depth: change, bugs, knowledge concentrated in one place.
+
+Key principles (see [LANGUAGE.md](LANGUAGE.md) for the full list):
+
+- **Deletion test**: imagine deleting the module. If complexity vanishes, it was a pass-through. If complexity reappears across N callers, it was earning its keep.
+- **The interface is the test surface.**
+- **One adapter = hypothetical seam. Two adapters = real seam.**
+
+This skill is _informed_ by the project's domain model. The domain language gives names to good seams; ADRs record decisions the skill should not re-litigate.
+
+## Process
+
+### 1. Explore
+
+Read the project's domain glossary and any ADRs in the area you're touching first.
+
+Then use the Agent tool with `subagent_type=Explore` to walk the codebase. Don't follow rigid heuristics — explore organically and note where you experience friction:
+
+- Where does understanding one concept require bouncing between many small modules?
+- Where are modules **shallow** — interface nearly as complex as the implementation?
+- Where have pure functions been extracted just for testability, but the real bugs hide in how they're called (no **locality**)?
+- Where do tightly-coupled modules leak across their seams?
+- Which parts of the codebase are untested, or hard to test through their current interface?
+
+Apply the **deletion test** to anything you suspect is shallow: would deleting it concentrate complexity, or just move it? A "yes, concentrates" is the signal you want.
+
+### 2. Present candidates as an HTML report
+
+Write a self-contained HTML file to the OS temp directory so nothing lands in the repo. Resolve the temp dir from `$TMPDIR`, falling back to `/tmp` (or `%TEMP%` on Windows), and write to `<tmpdir>/architecture-review-<timestamp>.html` so each run gets a fresh file. Open it for the user — `xdg-open <path>` on Linux, `open <path>` on macOS, `start <path>` on Windows — and tell them the absolute path.
+
+The report uses **Tailwind via CDN** for layout and styling, and **Mermaid via CDN** for diagrams where a graph/flow/sequence reliably communicates the structure. Mix Mermaid with hand-crafted CSS/SVG visuals — use Mermaid when relationships are graph-shaped (call graphs, dependencies, sequences), and hand-built divs/SVG when you want something more editorial (mass diagrams, cross-sections, collapse animations). Each candidate gets a **before/after visualisation**. Be visual.
+
+For each candidate, the same template as before, but rendered as a card:
+
+- **Files** — which files/modules are involved
+- **Problem** — why the current architecture is causing friction
+- **Solution** — plain English description of what would change
+- **Benefits** — explained in terms of locality and leverage, and how tests would improve
+- **Before / After diagram** — side-by-side, custom-drawn, illustrating the shallowness and the deepening
+- **Recommendation strength** — one of `Strong`, `Worth exploring`, `Speculative`, rendered as a badge
+
+End the report with a **Top recommendation** section: which candidate you'd tackle first and why.
+
+**Use CONTEXT.md vocabulary for the domain, and [LANGUAGE.md](LANGUAGE.md) vocabulary for the architecture.** If `CONTEXT.md` defines "Order," talk about "the Order intake module" — not "the FooBarHandler," and not "the Order service."
+
+**ADR conflicts**: if a candidate contradicts an existing ADR, only surface it when the friction is real enough to warrant revisiting the ADR. Mark it clearly in the card (e.g. a warning callout: _"contradicts ADR-0007 — but worth reopening because…"_). Don't list every theoretical refactor an ADR forbids.
+
+See [HTML-REPORT.md](HTML-REPORT.md) for the full HTML scaffold, diagram patterns, and styling guidance.
+
+Do NOT propose interfaces yet. After the file is written, ask the user: "Which of these would you like to explore?"
+
+### 3. Grilling loop
+
+Once the user picks a candidate, drop into a grilling conversation. Walk the design tree with them — constraints, dependencies, the shape of the deepened module, what sits behind the seam, what tests survive.
+
+Side effects happen inline as decisions crystallize:
+
+- **Naming a deepened module after a concept not in `CONTEXT.md`?** Add the term to `CONTEXT.md` — same discipline as `/grill-with-docs` (see [CONTEXT-FORMAT.md](../grill-with-docs/CONTEXT-FORMAT.md)). Create the file lazily if it doesn't exist.
+- **Sharpening a fuzzy term during the conversation?** Update `CONTEXT.md` right there.
+- **User rejects the candidate with a load-bearing reason?** Offer an ADR, framed as: _"Want me to record this as an ADR so future architecture reviews don't re-suggest it?"_ Only offer when the reason would actually be needed by a future explorer to avoid re-suggesting the same thing — skip ephemeral reasons ("not worth it right now") and self-evident ones. See [ADR-FORMAT.md](../grill-with-docs/ADR-FORMAT.md).
+- **Want to explore alternative interfaces for the deepened module?** See [INTERFACE-DESIGN.md](INTERFACE-DESIGN.md).
--- a/scripts/workstation/claude-skills/prototype/LOGIC.md
+++ b/scripts/workstation/claude-skills/prototype/LOGIC.md
@ -0,0 +1,79 @@
+# Logic Prototype
+
+A tiny interactive terminal app that lets the user drive a state model by hand. Use this when the question is about **business logic, state transitions, or data shape** — the kind of thing that looks reasonable on paper but only feels wrong once you push it through real cases.
+
+## When this is the right shape
+
+- "I'm not sure if this state machine handles the edge case where X then Y."
+- "Does this data model actually let me represent the case where..."
+- "I want to feel out what the API should look like before writing it."
+- Anything where the user wants to **press buttons and watch state change**.
+
+If the question is "what should this look like" — wrong branch. Use [UI.md](UI.md).
+
+## Process
+
+### 1. State the question
+
+Before writing code, write down what state model and what question you're prototyping. One paragraph, in the prototype's README or a comment at the top of the file. A logic prototype that answers the wrong question is pure waste — make the question explicit so it can be checked later, whether the user is watching now or returning to it AFK.
+
+### 2. Pick the language
+
+Use whatever the host project uses. If the project has no obvious runtime (e.g. a docs repo), ask.
+
+Match the project's existing conventions for tooling — don't add a new package manager or runtime just for the prototype.
+
+### 3. Isolate the logic in a portable module
+
+Put the actual logic — the bit that's answering the question — behind a small, pure interface that could be lifted out and dropped into the real codebase later. The TUI around it is throwaway; the logic module shouldn't be.
+
+The right shape depends on the question:
+
+- **A pure reducer** — `(state, action) => state`. Good when actions are discrete events and state is a single value.
+- **A state machine** — explicit states and transitions. Good when "which actions are even legal right now" is part of the question.
+- **A small set of pure functions** over a plain data type. Good when there's no implicit current state — just transformations.
+- **A class or module with a clear method surface** when the logic genuinely owns ongoing internal state.
+
+Pick whichever shape best fits the question being asked, *not* whichever is easiest to wire to a TUI. Keep it pure: no I/O, no terminal code, no `console.log` for control flow. The TUI imports it and calls into it; nothing flows the other direction.
+
+This is what makes the prototype useful past its own lifetime. When the question's been answered, the validated reducer / machine / function set can be lifted into the real module — the TUI shell gets deleted.
+
+### 4. Build the smallest TUI that exposes the state
+
+Build it as a **lightweight TUI** — on every tick, clear the screen (`console.clear()` / `print("\033[2J\033[H")` / equivalent) and re-render the whole frame. The user should always see one stable view, not an ever-growing scrollback.
+
+Each frame has two parts, in this order:
+
+1. **Current state**, pretty-printed and diff-friendly (one field per line, or formatted JSON). Use **bold** for field names or section headers and **dim** for less important context (timestamps, IDs, derived values). Native ANSI escape codes are fine — `\x1b[1m` bold, `\x1b[2m` dim, `\x1b[0m` reset. No need to pull in a styling library unless one is already in the project.
+2. **Keyboard shortcuts**, listed at the bottom: `[a] add user  [d] delete user  [t] tick clock  [q] quit`. Bold the key, dim the description, or vice-versa — whatever reads cleanly.
+
+Behaviour:
+
+1. **Initialise state** — a single in-memory object/struct. Render the first frame on start.
+2. **Read one keystroke (or one line)** at a time, dispatch to a handler that mutates state.
+3. **Re-render** the full frame after every action — don't append, replace.
+4. **Loop until quit.**
+
+The whole frame should fit on one screen.
+
+### 5. Make it runnable in one command
+
+Add a script to the project's existing task runner (`package.json` scripts, `Makefile`, `justfile`, `pyproject.toml`). The user should run `pnpm run <prototype-name>` or equivalent — never need to remember a path.
+
+If the host project has no task runner, just put the command at the top of the prototype's README.
+
+### 6. Hand it over
+
+Give the user the run command. They'll drive it themselves; the interesting moments are when they say "wait, that shouldn't be possible" or "huh, I assumed X would be different" — those are the bugs in the _idea_, which is the whole point. If they want new actions added, add them. Prototypes evolve.
+
+### 7. Capture the answer
+
+When the prototype has done its job, the answer to the question is the only thing worth keeping. If the user is around, ask what it taught them. If not, leave a `NOTES.md` next to the prototype so the answer can be filled in (or filled in by you, if you've watched the session) before the prototype gets deleted.
+
+## Anti-patterns
+
+- **Don't add tests.** A prototype that needs tests is no longer a prototype.
+- **Don't wire it to the real database.** Use an in-memory store unless the question is specifically about persistence.
+- **Don't generalise.** No "what if we wanted to support X later." The prototype answers one question.
+- **Don't blur the logic and the TUI together.** If the reducer / state machine references `console.log`, prompts, or terminal escape codes, it's no longer portable. Keep the TUI as a thin shell over a pure module.
+- **Don't ship the TUI shell into production.** The shell is optimised for being driven by hand from a terminal. The logic module behind it is the bit worth keeping.
--- a/scripts/workstation/claude-skills/prototype/SKILL.md
+++ b/scripts/workstation/claude-skills/prototype/SKILL.md
@ -0,0 +1,30 @@
+---
+name: prototype
+description: Build a throwaway prototype to flesh out a design before committing to it. Routes between two branches — a runnable terminal app for state/business-logic questions, or several radically different UI variations toggleable from one route. Use when the user wants to prototype, sanity-check a data model or state machine, mock up a UI, explore design options, or says "prototype this", "let me play with it", "try a few designs".
+---
+
+# Prototype
+
+A prototype is **throwaway code that answers a question**. The question decides the shape.
+
+## Pick a branch
+
+Identify which question is being answered — from the user's prompt, the surrounding code, or by asking if the user is around:
+
+- **"Does this logic / state model feel right?"** → [LOGIC.md](LOGIC.md). Build a tiny interactive terminal app that pushes the state machine through cases that are hard to reason about on paper.
+- **"What should this look like?"** → [UI.md](UI.md). Generate several radically different UI variations on a single route, switchable via a URL search param and a floating bottom bar.
+
+The two branches produce very different artifacts — getting this wrong wastes the whole prototype. If the question is genuinely ambiguous and the user isn't reachable, default to whichever branch better matches the surrounding code (a backend module → logic; a page or component → UI) and state the assumption at the top of the prototype.
+
+## Rules that apply to both
+
+1. **Throwaway from day one, and clearly marked as such.** Locate the prototype code close to where it will actually be used (next to the module or page it's prototyping for) so context is obvious — but name it so a casual reader can see it's a prototype, not production. For throwaway UI routes, obey whatever routing convention the project already uses; don't invent a new top-level structure.
+2. **One command to run.** Whatever the project's existing task runner supports — `pnpm <name>`, `python <path>`, `bun <path>`, etc. The user must be able to start it without thinking.
+3. **No persistence by default.** State lives in memory. Persistence is the thing the prototype is _checking_, not something it should depend on. If the question explicitly involves a database, hit a scratch DB or a local file with a clear "PROTOTYPE — wipe me" name.
+4. **Skip the polish.** No tests, no error handling beyond what makes the prototype _runnable_, no abstractions. The point is to learn something fast and then delete it.
+5. **Surface the state.** After every action (logic) or on every variant switch (UI), print or render the full relevant state so the user can see what changed.
+6. **Delete or absorb when done.** When the prototype has answered its question, either delete it or fold the validated decision into the real code — don't leave it rotting in the repo.
+
+## When done
+
+The _answer_ is the only thing worth keeping from a prototype. Capture it somewhere durable (commit message, ADR, issue, or a `NOTES.md` next to the prototype) along with the question it was answering. If the user is around, that capture is a quick conversation; if not, leave the placeholder so they (or you, on the next pass) can fill in the verdict before deleting the prototype.
--- a/scripts/workstation/claude-skills/prototype/UI.md
+++ b/scripts/workstation/claude-skills/prototype/UI.md
@ -0,0 +1,112 @@
+# UI Prototype
+
+Generate **several radically different UI variations** on a single route, switchable from a floating bottom bar. The user flips between variants in the browser, picks one (or steals bits from each), then throws the rest away.
+
+If the question is about logic/state rather than what something looks like — wrong branch. Use [LOGIC.md](LOGIC.md).
+
+## When this is the right shape
+
+- "What should this page look like?"
+- "I want to see a few options for this dashboard before committing."
+- "Try a different layout for the settings screen."
+- Any time the user would otherwise spend a day picking between three vague mockups in their head.
+
+## Two sub-shapes — strongly prefer sub-shape A
+
+A UI prototype is much easier to judge when it's **butting up against the rest of the app** — real header, real sidebar, real data, real density. A throwaway route on its own is a vacuum: every variant looks fine in isolation. Default to sub-shape A whenever there's a plausible existing page to host the variants. Only reach for sub-shape B if the prototype genuinely has no nearby home.
+
+### Sub-shape A — adjustment to an existing page (preferred)
+
+The route already exists. Variants are rendered **on the same route**, gated by a `?variant=` URL search param. The existing data fetching, params, and auth all stay — only the rendering swaps. This is the default; pick it unless there's a specific reason not to.
+
+If the prototype is for something that doesn't yet have a page but *would naturally live inside one* (a new section of the dashboard, a new card on the settings screen, a new step in an existing flow) — that's still sub-shape A. Mount the variants inside the host page.
+
+### Sub-shape B — a new page (last resort)
+
+Only use this when the thing being prototyped genuinely has no existing page to live inside — e.g. an entirely new top-level surface, or a flow that can't be embedded anywhere sensible.
+
+Create a **throwaway route** following whatever routing convention the project already uses — don't invent a new top-level structure. Name it so it's obviously a prototype (e.g. include the word `prototype` in the path or filename). Same `?variant=` pattern.
+
+Before committing to sub-shape B, sanity-check: is there really no existing page this could be embedded in? An empty route hides design problems that a populated one would expose.
+
+In both sub-shapes the floating bottom bar is identical.
+
+## Process
+
+### 1. State the question and pick N
+
+Default to **3 variants**. More than 5 stops being radically different and starts being noise — cap there.
+
+Write down the plan in one line, in the prototype's location or a top-of-file comment:
+
+> "Three variants of the settings page, switchable via `?variant=`, on the existing `/settings` route."
+
+This works whether the user is here to push back or not.
+
+### 2. Generate radically different variants
+
+Draft each variant. Hold each one to:
+
+- The page's purpose and the data it has access to.
+- The project's component library / styling system (TailwindCSS, shadcn, MUI, plain CSS, whatever).
+- A clear exported component name, e.g. `VariantA`, `VariantB`, `VariantC`.
+
+Variants must be **structurally different** — different layout, different information hierarchy, different primary affordance, not just different colours. Three slightly-tweaked card grids isn't a UI prototype, it's wallpaper. If two drafts come out too similar, redo one with explicit "do not use a card grid" guidance.
+
+### 3. Wire them together
+
+Create a single switcher component on the route:
+
+```tsx
+// pseudo-code — adapt to the project's framework
+const variant = searchParams.get('variant') ?? 'A';
+return (
+  <>
+    {variant === 'A' && <VariantA {...data} />}
+    {variant === 'B' && <VariantB {...data} />}
+    {variant === 'C' && <VariantC {...data} />}
+    <PrototypeSwitcher variants={['A','B','C']} current={variant} />
+  </>
+);
+```
+
+For sub-shape A (existing page): keep all the existing data fetching above the switcher; only the rendered subtree changes per variant.
+
+For sub-shape B (new page): the throwaway route under `/prototype/<name>` mounts the same switcher.
+
+### 4. Build the floating switcher
+
+A small fixed-position bar at the bottom-centre of the screen with three pieces:
+
+- **Left arrow** — cycles to the previous variant (wraps around).
+- **Variant label** — shows the current variant key and, if the variant exports a name, that name too. e.g. `B — Sidebar layout`.
+- **Right arrow** — cycles forward (wraps around).
+
+Behaviour:
+
+- Clicking an arrow updates the URL search param (use the framework's router — `router.replace` on Next, `navigate` on React Router, etc) so the variant is shareable and reload-stable.
+- Keyboard: `←` and `→` arrow keys also cycle. Don't intercept arrow keys when an `<input>`, `<textarea>`, or `[contenteditable]` is focused.
+- Visually distinct from the page (e.g. high-contrast pill, subtle shadow) so it's obviously not part of the design being evaluated.
+- Hidden in production builds — gate on `process.env.NODE_ENV !== 'production'` or an equivalent check, so a stray prototype merge can't ship the bar to users.
+
+Put the switcher in a single shared component so both sub-shapes can reuse it. Locate it wherever shared UI lives in the project.
+
+### 5. Hand it over
+
+Surface the URL (and the `?variant=` keys). The user will flip through whenever they get to it. The interesting feedback is usually **"I want the header from B with the sidebar from C"** — that's the actual design they want.
+
+### 6. Capture the answer and clean up
+
+Once a variant has won, write down which one and why (commit message, ADR, issue, or a `NOTES.md` next to the prototype if running AFK and the user hasn't responded yet). Then:
+
+- **Sub-shape A** — delete the losing variants and the switcher; fold the winner into the existing page.
+- **Sub-shape B** — promote the winning variant to a real route, delete the throwaway route and the switcher.
+
+Don't leave variant components or the switcher lying around. They rot fast and confuse the next reader.
+
+## Anti-patterns
+
+- **Variants that differ only in colour or copy.** That's a tweak, not a prototype. Real variants disagree about structure.
+- **Sharing too much code between variants.** A shared `<Header>` is fine; a shared `<Layout>` defeats the point. Each variant should be free to throw out the layout.
+- **Wiring variants to real mutations.** Read-only prototypes are fine. If a variant needs to mutate, point it at a stub — the question is "what should this look like", not "does the backend work".
+- **Promoting the prototype directly to production.** The variant code was written under prototype constraints (no tests, minimal error handling). Rewrite it properly when you fold it in.
--- a/scripts/workstation/claude-skills/setup-matt-pocock-skills/SKILL.md
+++ b/scripts/workstation/claude-skills/setup-matt-pocock-skills/SKILL.md
@ -0,0 +1,121 @@
+---
+name: setup-matt-pocock-skills
+description: Sets up an `## Agent skills` block in AGENTS.md/CLAUDE.md and `docs/agents/` so the engineering skills know this repo's issue tracker (GitHub or local markdown), triage label vocabulary, and domain doc layout. Run before first use of `to-issues`, `to-prd`, `triage`, `diagnose`, `tdd`, `improve-codebase-architecture`, or `zoom-out` — or if those skills appear to be missing context about the issue tracker, triage labels, or domain docs.
+disable-model-invocation: true
+---
+
+# Setup Matt Pocock's Skills
+
+Scaffold the per-repo configuration that the engineering skills assume:
+
+- **Issue tracker** — where issues live (GitHub by default; local markdown is also supported out of the box)
+- **Triage labels** — the strings used for the five canonical triage roles
+- **Domain docs** — where `CONTEXT.md` and ADRs live, and the consumer rules for reading them
+
+This is a prompt-driven skill, not a deterministic script. Explore, present what you found, confirm with the user, then write.
+
+## Process
+
+### 1. Explore
+
+Look at the current repo to understand its starting state. Read whatever exists; don't assume:
+
+- `git remote -v` and `.git/config` — is this a GitHub repo? Which one?
+- `AGENTS.md` and `CLAUDE.md` at the repo root — does either exist? Is there already an `## Agent skills` section in either?
+- `CONTEXT.md` and `CONTEXT-MAP.md` at the repo root
+- `docs/adr/` and any `src/*/docs/adr/` directories
+- `docs/agents/` — does this skill's prior output already exist?
+- `.scratch/` — sign that a local-markdown issue tracker convention is already in use
+
+### 2. Present findings and ask
+
+Summarise what's present and what's missing. Then walk the user through the three decisions **one at a time** — present a section, get the user's answer, then move to the next. Don't dump all three at once.
+
+Assume the user does not know what these terms mean. Each section starts with a short explainer (what it is, why these skills need it, what changes if they pick differently). Then show the choices and the default.
+
+**Section A — Issue tracker.**
+
+> Explainer: The "issue tracker" is where issues live for this repo. Skills like `to-issues`, `triage`, `to-prd`, and `qa` read from and write to it — they need to know whether to call `gh issue create`, write a markdown file under `.scratch/`, or follow some other workflow you describe. Pick the place you actually track work for this repo.
+
+Default posture: these skills were designed for GitHub. If a `git remote` points at GitHub, propose that. If a `git remote` points at GitLab (`gitlab.com` or a self-hosted host), propose GitLab. Otherwise (or if the user prefers), offer:
+
+- **GitHub** — issues live in the repo's GitHub Issues (uses the `gh` CLI)
+- **GitLab** — issues live in the repo's GitLab Issues (uses the [`glab`](https://gitlab.com/gitlab-org/cli) CLI)
+- **Local markdown** — issues live as files under `.scratch/<feature>/` in this repo (good for solo projects or repos without a remote)
+- **Other** (Jira, Linear, etc.) — ask the user to describe the workflow in one paragraph; the skill will record it as freeform prose
+
+**Section B — Triage label vocabulary.**
+
+> Explainer: When the `triage` skill processes an incoming issue, it moves it through a state machine — needs evaluation, waiting on reporter, ready for an AFK agent to pick up, ready for a human, or won't fix. To do that, it needs to apply labels (or the equivalent in your issue tracker) that match strings *you've actually configured*. If your repo already uses different label names (e.g. `bug:triage` instead of `needs-triage`), map them here so the skill applies the right ones instead of creating duplicates.
+
+The five canonical roles:
+
+- `needs-triage` — maintainer needs to evaluate
+- `needs-info` — waiting on reporter
+- `ready-for-agent` — fully specified, AFK-ready (an agent can pick it up with no human context)
+- `ready-for-human` — needs human implementation
+- `wontfix` — will not be actioned
+
+Default: each role's string equals its name. Ask the user if they want to override any. If their issue tracker has no existing labels, the defaults are fine.
+
+**Section C — Domain docs.**
+
+> Explainer: Some skills (`improve-codebase-architecture`, `diagnose`, `tdd`) read a `CONTEXT.md` file to learn the project's domain language, and `docs/adr/` for past architectural decisions. They need to know whether the repo has one global context or multiple (e.g. a monorepo with separate frontend/backend contexts) so they look in the right place.
+
+Confirm the layout:
+
+- **Single-context** — one `CONTEXT.md` + `docs/adr/` at the repo root. Most repos are this.
+- **Multi-context** — `CONTEXT-MAP.md` at the root pointing to per-context `CONTEXT.md` files (typically a monorepo).
+
+### 3. Confirm and edit
+
+Show the user a draft of:
+
+- The `## Agent skills` block to add to whichever of `CLAUDE.md` / `AGENTS.md` is being edited (see step 4 for selection rules)
+- The contents of `docs/agents/issue-tracker.md`, `docs/agents/triage-labels.md`, `docs/agents/domain.md`
+
+Let them edit before writing.
+
+### 4. Write
+
+**Pick the file to edit:**
+
+- If `CLAUDE.md` exists, edit it.
+- Else if `AGENTS.md` exists, edit it.
+- If neither exists, ask the user which one to create — don't pick for them.
+
+Never create `AGENTS.md` when `CLAUDE.md` already exists (or vice versa) — always edit the one that's already there.
+
+If an `## Agent skills` block already exists in the chosen file, update its contents in-place rather than appending a duplicate. Don't overwrite user edits to the surrounding sections.
+
+The block:
+
+```markdown
+## Agent skills
+
+### Issue tracker
+
+[one-line summary of where issues are tracked]. See `docs/agents/issue-tracker.md`.
+
+### Triage labels
+
+[one-line summary of the label vocabulary]. See `docs/agents/triage-labels.md`.
+
+### Domain docs
+
+[one-line summary of layout — "single-context" or "multi-context"]. See `docs/agents/domain.md`.
+```
+
+Then write the three docs files using the seed templates in this skill folder as a starting point:
+
+- [issue-tracker-github.md](./issue-tracker-github.md) — GitHub issue tracker
+- [issue-tracker-gitlab.md](./issue-tracker-gitlab.md) — GitLab issue tracker
+- [issue-tracker-local.md](./issue-tracker-local.md) — local-markdown issue tracker
+- [triage-labels.md](./triage-labels.md) — label mapping
+- [domain.md](./domain.md) — domain doc consumer rules + layout
+
+For "other" issue trackers, write `docs/agents/issue-tracker.md` from scratch using the user's description.
+
+### 5. Done
+
+Tell the user the setup is complete and which engineering skills will now read from these files. Mention they can edit `docs/agents/*.md` directly later — re-running this skill is only necessary if they want to switch issue trackers or restart from scratch.
--- a/scripts/workstation/claude-skills/setup-matt-pocock-skills/domain.md
+++ b/scripts/workstation/claude-skills/setup-matt-pocock-skills/domain.md
@ -0,0 +1,51 @@
+# Domain Docs
+
+How the engineering skills should consume this repo's domain documentation when exploring the codebase.
+
+## Before exploring, read these
+
+- **`CONTEXT.md`** at the repo root, or
+- **`CONTEXT-MAP.md`** at the repo root if it exists — it points at one `CONTEXT.md` per context. Read each one relevant to the topic.
+- **`docs/adr/`** — read ADRs that touch the area you're about to work in. In multi-context repos, also check `src/<context>/docs/adr/` for context-scoped decisions.
+
+If any of these files don't exist, **proceed silently**. Don't flag their absence; don't suggest creating them upfront. The producer skill (`/grill-with-docs`) creates them lazily when terms or decisions actually get resolved.
+
+## File structure
+
+Single-context repo (most repos):
+
+```
+/
+├── CONTEXT.md
+├── docs/adr/
+│   ├── 0001-event-sourced-orders.md
+│   └── 0002-postgres-for-write-model.md
+└── src/
+```
+
+Multi-context repo (presence of `CONTEXT-MAP.md` at the root):
+
+```
+/
+├── CONTEXT-MAP.md
+├── docs/adr/                          ← system-wide decisions
+└── src/
+    ├── ordering/
+    │   ├── CONTEXT.md
+    │   └── docs/adr/                  ← context-specific decisions
+    └── billing/
+        ├── CONTEXT.md
+        └── docs/adr/
+```
+
+## Use the glossary's vocabulary
+
+When your output names a domain concept (in an issue title, a refactor proposal, a hypothesis, a test name), use the term as defined in `CONTEXT.md`. Don't drift to synonyms the glossary explicitly avoids.
+
+If the concept you need isn't in the glossary yet, that's a signal — either you're inventing language the project doesn't use (reconsider) or there's a real gap (note it for `/grill-with-docs`).
+
+## Flag ADR conflicts
+
+If your output contradicts an existing ADR, surface it explicitly rather than silently overriding:
+
+> _Contradicts ADR-0007 (event-sourced orders) — but worth reopening because…_
--- a/scripts/workstation/claude-skills/setup-matt-pocock-skills/issue-tracker-github.md
+++ b/scripts/workstation/claude-skills/setup-matt-pocock-skills/issue-tracker-github.md
@ -0,0 +1,22 @@
+# Issue tracker: GitHub
+
+Issues and PRDs for this repo live as GitHub issues. Use the `gh` CLI for all operations.
+
+## Conventions
+
+- **Create an issue**: `gh issue create --title "..." --body "..."`. Use a heredoc for multi-line bodies.
+- **Read an issue**: `gh issue view <number> --comments`, filtering comments by `jq` and also fetching labels.
+- **List issues**: `gh issue list --state open --json number,title,body,labels,comments --jq '[.[] | {number, title, body, labels: [.labels[].name], comments: [.comments[].body]}]'` with appropriate `--label` and `--state` filters.
+- **Comment on an issue**: `gh issue comment <number> --body "..."`
+- **Apply / remove labels**: `gh issue edit <number> --add-label "..."` / `--remove-label "..."`
+- **Close**: `gh issue close <number> --comment "..."`
+
+Infer the repo from `git remote -v` — `gh` does this automatically when run inside a clone.
+
+## When a skill says "publish to the issue tracker"
+
+Create a GitHub issue.
+
+## When a skill says "fetch the relevant ticket"
+
+Run `gh issue view <number> --comments`.
--- a/scripts/workstation/claude-skills/setup-matt-pocock-skills/issue-tracker-gitlab.md
+++ b/scripts/workstation/claude-skills/setup-matt-pocock-skills/issue-tracker-gitlab.md
@ -0,0 +1,23 @@
+# Issue tracker: GitLab
+
+Issues and PRDs for this repo live as GitLab issues. Use the [`glab`](https://gitlab.com/gitlab-org/cli) CLI for all operations.
+
+## Conventions
+
+- **Create an issue**: `glab issue create --title "..." --description "..."`. Use a heredoc for multi-line descriptions. Pass `--description -` to open an editor.
+- **Read an issue**: `glab issue view <number> --comments`. Use `-F json` for machine-readable output.
+- **List issues**: `glab issue list -F json` with appropriate `--label` filters.
+- **Comment on an issue**: `glab issue note <number> --message "..."`. GitLab calls comments "notes".
+- **Apply / remove labels**: `glab issue update <number> --label "..."` / `--unlabel "..."`. Multiple labels can be comma-separated or by repeating the flag.
+- **Close**: `glab issue close <number>`. `glab issue close` does not accept a closing comment, so post the explanation first with `glab issue note <number> --message "..."`, then close.
+- **Merge requests**: GitLab calls PRs "merge requests". Use `glab mr create`, `glab mr view`, `glab mr note`, etc. — the same shape as `gh pr ...` with `mr` in place of `pr` and `note`/`--message` in place of `comment`/`--body`.
+
+Infer the repo from `git remote -v` — `glab` does this automatically when run inside a clone.
+
+## When a skill says "publish to the issue tracker"
+
+Create a GitLab issue.
+
+## When a skill says "fetch the relevant ticket"
+
+Run `glab issue view <number> --comments`.
--- a/scripts/workstation/claude-skills/setup-matt-pocock-skills/issue-tracker-local.md
+++ b/scripts/workstation/claude-skills/setup-matt-pocock-skills/issue-tracker-local.md
@ -0,0 +1,19 @@
+# Issue tracker: Local Markdown
+
+Issues and PRDs for this repo live as markdown files in `.scratch/`.
+
+## Conventions
+
+- One feature per directory: `.scratch/<feature-slug>/`
+- The PRD is `.scratch/<feature-slug>/PRD.md`
+- Implementation issues are `.scratch/<feature-slug>/issues/<NN>-<slug>.md`, numbered from `01`
+- Triage state is recorded as a `Status:` line near the top of each issue file (see `triage-labels.md` for the role strings)
+- Comments and conversation history append to the bottom of the file under a `## Comments` heading
+
+## When a skill says "publish to the issue tracker"
+
+Create a new file under `.scratch/<feature-slug>/` (creating the directory if needed).
+
+## When a skill says "fetch the relevant ticket"
+
+Read the file at the referenced path. The user will normally pass the path or the issue number directly.
--- a/scripts/workstation/claude-skills/setup-matt-pocock-skills/triage-labels.md
+++ b/scripts/workstation/claude-skills/setup-matt-pocock-skills/triage-labels.md
@ -0,0 +1,15 @@
+# Triage Labels
+
+The skills speak in terms of five canonical triage roles. This file maps those roles to the actual label strings used in this repo's issue tracker.
+
+| Label in mattpocock/skills | Label in our tracker | Meaning                                  |
+| -------------------------- | -------------------- | ---------------------------------------- |
+| `needs-triage`             | `needs-triage`       | Maintainer needs to evaluate this issue  |
+| `needs-info`               | `needs-info`         | Waiting on reporter for more information |
+| `ready-for-agent`          | `ready-for-agent`    | Fully specified, ready for an AFK agent  |
+| `ready-for-human`          | `ready-for-human`    | Requires human implementation            |
+| `wontfix`                  | `wontfix`            | Will not be actioned                     |
+
+When a skill mentions a role (e.g. "apply the AFK-ready triage label"), use the corresponding label string from this table.
+
+Edit the right-hand column to match whatever vocabulary you actually use.
--- a/scripts/workstation/claude-skills/tdd/SKILL.md
+++ b/scripts/workstation/claude-skills/tdd/SKILL.md
@ -0,0 +1,109 @@
+---
+name: tdd
+description: Test-driven development with red-green-refactor loop. Use when user wants to build features or fix bugs using TDD, mentions "red-green-refactor", wants integration tests, or asks for test-first development.
+---
+
+# Test-Driven Development
+
+## Philosophy
+
+**Core principle**: Tests should verify behavior through public interfaces, not implementation details. Code can change entirely; tests shouldn't.
+
+**Good tests** are integration-style: they exercise real code paths through public APIs. They describe _what_ the system does, not _how_ it does it. A good test reads like a specification - "user can checkout with valid cart" tells you exactly what capability exists. These tests survive refactors because they don't care about internal structure.
+
+**Bad tests** are coupled to implementation. They mock internal collaborators, test private methods, or verify through external means (like querying a database directly instead of using the interface). The warning sign: your test breaks when you refactor, but behavior hasn't changed. If you rename an internal function and tests fail, those tests were testing implementation, not behavior.
+
+See [tests.md](tests.md) for examples and [mocking.md](mocking.md) for mocking guidelines.
+
+## Anti-Pattern: Horizontal Slices
+
+**DO NOT write all tests first, then all implementation.** This is "horizontal slicing" - treating RED as "write all tests" and GREEN as "write all code."
+
+This produces **crap tests**:
+
+- Tests written in bulk test _imagined_ behavior, not _actual_ behavior
+- You end up testing the _shape_ of things (data structures, function signatures) rather than user-facing behavior
+- Tests become insensitive to real changes - they pass when behavior breaks, fail when behavior is fine
+- You outrun your headlights, committing to test structure before understanding the implementation
+
+**Correct approach**: Vertical slices via tracer bullets. One test → one implementation → repeat. Each test responds to what you learned from the previous cycle. Because you just wrote the code, you know exactly what behavior matters and how to verify it.
+
+```
+WRONG (horizontal):
+  RED:   test1, test2, test3, test4, test5
+  GREEN: impl1, impl2, impl3, impl4, impl5
+
+RIGHT (vertical):
+  RED→GREEN: test1→impl1
+  RED→GREEN: test2→impl2
+  RED→GREEN: test3→impl3
+  ...
+```
+
+## Workflow
+
+### 1. Planning
+
+When exploring the codebase, use the project's domain glossary so that test names and interface vocabulary match the project's language, and respect ADRs in the area you're touching.
+
+Before writing any code:
+
+- [ ] Confirm with user what interface changes are needed
+- [ ] Confirm with user which behaviors to test (prioritize)
+- [ ] Identify opportunities for [deep modules](deep-modules.md) (small interface, deep implementation)
+- [ ] Design interfaces for [testability](interface-design.md)
+- [ ] List the behaviors to test (not implementation steps)
+- [ ] Get user approval on the plan
+
+Ask: "What should the public interface look like? Which behaviors are most important to test?"
+
+**You can't test everything.** Confirm with the user exactly which behaviors matter most. Focus testing effort on critical paths and complex logic, not every possible edge case.
+
+### 2. Tracer Bullet
+
+Write ONE test that confirms ONE thing about the system:
+
+```
+RED:   Write test for first behavior → test fails
+GREEN: Write minimal code to pass → test passes
+```
+
+This is your tracer bullet - proves the path works end-to-end.
+
+### 3. Incremental Loop
+
+For each remaining behavior:
+
+```
+RED:   Write next test → fails
+GREEN: Minimal code to pass → passes
+```
+
+Rules:
+
+- One test at a time
+- Only enough code to pass current test
+- Don't anticipate future tests
+- Keep tests focused on observable behavior
+
+### 4. Refactor
+
+After all tests pass, look for [refactor candidates](refactoring.md):
+
+- [ ] Extract duplication
+- [ ] Deepen modules (move complexity behind simple interfaces)
+- [ ] Apply SOLID principles where natural
+- [ ] Consider what new code reveals about existing code
+- [ ] Run tests after each refactor step
+
+**Never refactor while RED.** Get to GREEN first.
+
+## Checklist Per Cycle
+
+```
+[ ] Test describes behavior, not implementation
+[ ] Test uses public interface only
+[ ] Test would survive internal refactor
+[ ] Code is minimal for this test
+[ ] No speculative features added
+```
--- a/scripts/workstation/claude-skills/tdd/deep-modules.md
+++ b/scripts/workstation/claude-skills/tdd/deep-modules.md
@ -0,0 +1,33 @@
+# Deep Modules
+
+From "A Philosophy of Software Design":
+
+**Deep module** = small interface + lots of implementation
+
+```
+┌─────────────────────┐
+│   Small Interface   │  ← Few methods, simple params
+├─────────────────────┤
+│                     │
+│                     │
+│  Deep Implementation│  ← Complex logic hidden
+│                     │
+│                     │
+└─────────────────────┘
+```
+
+**Shallow module** = large interface + little implementation (avoid)
+
+```
+┌─────────────────────────────────┐
+│       Large Interface           │  ← Many methods, complex params
+├─────────────────────────────────┤
+│  Thin Implementation            │  ← Just passes through
+└─────────────────────────────────┘
+```
+
+When designing interfaces, ask:
+
+- Can I reduce the number of methods?
+- Can I simplify the parameters?
+- Can I hide more complexity inside?
--- a/scripts/workstation/claude-skills/tdd/interface-design.md
+++ b/scripts/workstation/claude-skills/tdd/interface-design.md
@ -0,0 +1,31 @@
+# Interface Design for Testability
+
+Good interfaces make testing natural:
+
+1. **Accept dependencies, don't create them**
+
+   ```typescript
+   // Testable
+   function processOrder(order, paymentGateway) {}
+
+   // Hard to test
+   function processOrder(order) {
+     const gateway = new StripeGateway();
+   }
+   ```
+
+2. **Return results, don't produce side effects**
+
+   ```typescript
+   // Testable
+   function calculateDiscount(cart): Discount {}
+
+   // Hard to test
+   function applyDiscount(cart): void {
+     cart.total -= discount;
+   }
+   ```
+
+3. **Small surface area**
+   - Fewer methods = fewer tests needed
+   - Fewer params = simpler test setup
--- a/scripts/workstation/claude-skills/tdd/mocking.md
+++ b/scripts/workstation/claude-skills/tdd/mocking.md
@ -0,0 +1,59 @@
+# When to Mock
+
+Mock at **system boundaries** only:
+
+- External APIs (payment, email, etc.)
+- Databases (sometimes - prefer test DB)
+- Time/randomness
+- File system (sometimes)
+
+Don't mock:
+
+- Your own classes/modules
+- Internal collaborators
+- Anything you control
+
+## Designing for Mockability
+
+At system boundaries, design interfaces that are easy to mock:
+
+**1. Use dependency injection**
+
+Pass external dependencies in rather than creating them internally:
+
+```typescript
+// Easy to mock
+function processPayment(order, paymentClient) {
+  return paymentClient.charge(order.total);
+}
+
+// Hard to mock
+function processPayment(order) {
+  const client = new StripeClient(process.env.STRIPE_KEY);
+  return client.charge(order.total);
+}
+```
+
+**2. Prefer SDK-style interfaces over generic fetchers**
+
+Create specific functions for each external operation instead of one generic function with conditional logic:
+
+```typescript
+// GOOD: Each function is independently mockable
+const api = {
+  getUser: (id) => fetch(`/users/${id}`),
+  getOrders: (userId) => fetch(`/users/${userId}/orders`),
+  createOrder: (data) => fetch('/orders', { method: 'POST', body: data }),
+};
+
+// BAD: Mocking requires conditional logic inside the mock
+const api = {
+  fetch: (endpoint, options) => fetch(endpoint, options),
+};
+```
+
+The SDK approach means:
+- Each mock returns one specific shape
+- No conditional logic in test setup
+- Easier to see which endpoints a test exercises
+- Type safety per endpoint
--- a/scripts/workstation/claude-skills/tdd/refactoring.md
+++ b/scripts/workstation/claude-skills/tdd/refactoring.md
@ -0,0 +1,10 @@
+# Refactor Candidates
+
+After TDD cycle, look for:
+
+- **Duplication** → Extract function/class
+- **Long methods** → Break into private helpers (keep tests on public interface)
+- **Shallow modules** → Combine or deepen
+- **Feature envy** → Move logic to where data lives
+- **Primitive obsession** → Introduce value objects
+- **Existing code** the new code reveals as problematic
--- a/scripts/workstation/claude-skills/tdd/tests.md
+++ b/scripts/workstation/claude-skills/tdd/tests.md
@ -0,0 +1,61 @@
+# Good and Bad Tests
+
+## Good Tests
+
+**Integration-style**: Test through real interfaces, not mocks of internal parts.
+
+```typescript
+// GOOD: Tests observable behavior
+test("user can checkout with valid cart", async () => {
+  const cart = createCart();
+  cart.add(product);
+  const result = await checkout(cart, paymentMethod);
+  expect(result.status).toBe("confirmed");
+});
+```
+
+Characteristics:
+
+- Tests behavior users/callers care about
+- Uses public API only
+- Survives internal refactors
+- Describes WHAT, not HOW
+- One logical assertion per test
+
+## Bad Tests
+
+**Implementation-detail tests**: Coupled to internal structure.
+
+```typescript
+// BAD: Tests implementation details
+test("checkout calls paymentService.process", async () => {
+  const mockPayment = jest.mock(paymentService);
+  await checkout(cart, payment);
+  expect(mockPayment.process).toHaveBeenCalledWith(cart.total);
+});
+```
+
+Red flags:
+
+- Mocking internal collaborators
+- Testing private methods
+- Asserting on call counts/order
+- Test breaks when refactoring without behavior change
+- Test name describes HOW not WHAT
+- Verifying through external means instead of interface
+
+```typescript
+// BAD: Bypasses interface to verify
+test("createUser saves to database", async () => {
+  await createUser({ name: "Alice" });
+  const row = await db.query("SELECT * FROM users WHERE name = ?", ["Alice"]);
+  expect(row).toBeDefined();
+});
+
+// GOOD: Verifies through interface
+test("createUser makes user retrievable", async () => {
+  const user = await createUser({ name: "Alice" });
+  const retrieved = await getUser(user.id);
+  expect(retrieved.name).toBe("Alice");
+});
+```
--- a/scripts/workstation/claude-skills/teach/GLOSSARY-FORMAT.md
+++ b/scripts/workstation/claude-skills/teach/GLOSSARY-FORMAT.md
@ -0,0 +1,35 @@
+# GLOSSARY.md Format
+
+`GLOSSARY.md` is the canonical language for this teaching workspace. All explainers, exercises, and learning records should adhere to its terminology. Building it is itself part of learning: compressing a concept into a tight definition is evidence the user understands it.
+
+## Structure
+
+```md
+# {Topic} Glossary
+
+{One or two sentence description of the topic this glossary covers.}
+
+## Terms
+
+**Hypertrophy**:
+Muscle growth driven by mechanical tension and metabolic stress over repeated training sessions.
+_Avoid_: Bulking, getting big
+
+**Progressive overload**:
+Systematically increasing the demand on a muscle over time — via load, volume, or intensity.
+_Avoid_: Pushing harder, levelling up
+
+**RPE (Rate of Perceived Exertion)**:
+A 1–10 self-rating of how hard a set felt, where 10 is failure and 8 means two reps left in the tank.
+_Avoid_: Effort score, intensity rating
+```
+
+## Rules
+
+- **Add a term only when the user understands it.** The glossary is a record of compressed knowledge, not a dictionary the user reads to learn. If the user has just been introduced to a concept, wait until they can use it correctly before promoting it here.
+- **Be opinionated.** When several words exist for the same concept, pick the best one and list the rest as aliases to avoid. This is how language compresses.
+- **Keep definitions tight.** One or two sentences. Define what the term IS, not what it does or how to do it.
+- **Use the glossary's own terms inside definitions.** Once a term is in the glossary, prefer it everywhere — including inside other definitions. This is what makes complex terms easier to grasp later.
+- **Group under subheadings** when natural clusters emerge (e.g. `## Anatomy`, `## Programming`). A flat list is fine when terms cohere.
+- **Flag ambiguities explicitly.** If a term is used loosely in the wider field, note the resolution: "In this workspace, 'set' always means a working set — warm-ups are tracked separately."
+- **Revise as understanding deepens.** A definition the user wrote in week one may be wrong by week six. Update in place; do not leave stale entries.
--- a/scripts/workstation/claude-skills/teach/LEARNING-RECORD-FORMAT.md
+++ b/scripts/workstation/claude-skills/teach/LEARNING-RECORD-FORMAT.md
@ -0,0 +1,46 @@
+# Learning Record Format
+
+Learning records live in `./learning-records/` and use sequential numbering: `0001-slug.md`, `0002-slug.md`, etc. Create the directory lazily — only when the first record is written.
+
+They are the teaching equivalent of ADRs: they capture non-obvious lessons, key insights, and stated prior knowledge that will steer future sessions. They are used to calculate the zone of proximal development.
+
+## Template
+
+```md
+# {Short title of what was learned or established}
+
+{1-3 sentences: what was learned (or what prior knowledge was established), and why it matters for future sessions.}
+```
+
+That is the whole format. A learning record can be a single paragraph. The value is recording _that_ this is now known and _why_ it changes what to teach next — not in filling out sections.
+
+## Optional sections
+
+Only include these when they add genuine value. Most records won't need them.
+
+- **Status** frontmatter (`active | superseded by LR-NNNN`) — useful when an earlier understanding turns out to be wrong and is replaced.
+- **Evidence** — how the user demonstrated the understanding (a question answered, an exercise completed, prior experience cited). Useful when the claim might be revisited.
+- **Implications** — what this unlocks or rules out for future sessions. Worth recording when non-obvious.
+
+## Numbering
+
+Scan `./learning-records/` for the highest existing number and increment by one.
+
+## When to write a learning record
+
+Write one when any of these is true:
+
+1. **The user demonstrated genuine understanding of something non-trivial** — not just exposure, but evidence they can use the concept correctly. This sets a new floor for what to teach next.
+2. **The user disclosed prior knowledge** — "I already know X." Record it so future sessions don't re-teach it. Also record the _depth_ claimed.
+3. **A misconception was corrected** — the user previously believed something wrong and now sees why. These are high-value: they predict future stumbling blocks for related topics.
+4. **The mission shifted in response to learning** — the user discovered they cared about something different than they thought. Cross-link to [[MISSION.md]] and update it.
+
+### What does _not_ qualify
+
+- Material that was merely covered. Coverage is not learning. Wait for evidence.
+- Anything already captured tersely in [[GLOSSARY.md]] as a term definition. Don't duplicate.
+- Session-by-session activity logs. Learning records are not a journal — they are decision-grade insights.
+
+## Supersession
+
+When a later record contradicts an earlier one (the user's understanding deepened or corrected), mark the old record `Status: superseded by LR-NNNN` rather than deleting it. The history of how understanding evolved is itself useful signal.
--- a/scripts/workstation/claude-skills/teach/MISSION-FORMAT.md
+++ b/scripts/workstation/claude-skills/teach/MISSION-FORMAT.md
@ -0,0 +1,31 @@
+# MISSION.md Format
+
+`MISSION.md` lives at the workspace root. It captures the _reason_ the user is learning this topic. Every teaching decision — what to teach next, which resources to surface, which exercises to design — should trace back to this document.
+
+## Template
+
+```md
+# Mission: {Topic}
+
+## Why
+{1-3 sentences. The concrete real-world goal the user is chasing. What changes in their life or work when they have this skill? Avoid abstract framings like "to understand X" — push for the underlying outcome.}
+
+## Success looks like
+- {A specific, observable thing the user will be able to do}
+- {Another specific thing}
+- {…}
+
+## Constraints
+- {Time, budget, prior commitments, learning preferences, anything that bounds the approach}
+
+## Out of scope
+- {Adjacent topics the user explicitly does not want to chase right now — protects the zone of proximal development}
+```
+
+## Rules
+
+- **One mission per workspace.** If the user wants to learn two unrelated things, that is two workspaces.
+- **Concrete over abstract.** "Run a half marathon by October" beats "get fitter." "Ship a Rust CLI to my team" beats "learn Rust."
+- **Push back on vagueness.** If the user cannot articulate why, interview them before writing anything. A bad mission is worse than no mission.
+- **Revise when reality shifts.** Missions change. When the user's goal moves, update this file — don't leave a stale mission steering future sessions.
+- **Keep it short.** If `MISSION.md` runs past a screen, it has stopped being a compass and started being a plan.
--- a/scripts/workstation/claude-skills/teach/RESOURCES-FORMAT.md
+++ b/scripts/workstation/claude-skills/teach/RESOURCES-FORMAT.md
@ -0,0 +1,32 @@
+# RESOURCES.md Format
+
+`RESOURCES.md` is the curated set of trusted sources for this topic. Knowledge for explainers should be drawn from here, not from parametric guesses. Wisdom comes from the communities listed here.
+
+## Structure
+
+```md
+# {Topic} Resources
+
+## Knowledge
+
+- [Book: _The Science and Practice of Strength Training_ — Zatsiorsky & Kraemer](https://example.com)
+  Foundational text on programming and adaptation. Use for: anything to do with periodisation, recovery, intensity zones.
+- [Article: "How Much Should I Train?" — Greg Nuckols (Stronger By Science)](https://example.com)
+  Evidence-based review of volume landmarks. Use for: weekly set targets per muscle group.
+
+## Wisdom (Communities)
+
+- [r/weightroom](https://reddit.com/r/weightroom)
+  High-signal subreddit, moderated against bro-science. Use for: programme critique, plateau troubleshooting.
+- Local: Tuesday strength class at {gym name}
+  Use for: real-time coaching feedback on lifts.
+```
+
+## Rules
+
+- **High-trust only.** Prefer primary sources, recognised experts, peer-reviewed work, and communities with strong moderation. If a resource is marketing dressed as education, leave it out.
+- **Annotate every entry.** A bare link is useless in three months. Add one line: what it covers and when to reach for it.
+- **Group by Knowledge / Wisdom.** Mirrors the philosophy in [SKILL.md](./SKILL.md). It is fine for a resource to appear in only one group.
+- **Surface gaps explicitly.** If no good resource exists for an area the mission needs, write a `## Gaps` section listing what is missing. This drives future search.
+- **Prune ruthlessly.** A resource that turned out to be wrong, shallow, or off-mission should be removed, not buried. Better five sharp sources than thirty mediocre ones.
+- **Record community preferences.** If the user has opted out of joining communities, note it here so future sessions don't keep proposing them.
--- a/scripts/workstation/claude-skills/teach/SKILL.md
+++ b/scripts/workstation/claude-skills/teach/SKILL.md
@ -0,0 +1,131 @@
+---
+name: teach
+description: Teach the user a new skill or concept, within this workspace.
+disable-model-invocation: true
+argument-hint: "What would you like to learn about?"
+---
+
+The user has asked you to teach them something. This is a stateful request - they intend to learn the topic over multiple sessions.
+
+## Teaching Workspace
+
+Treat the current directory as a teaching workspace. The state of their learning is captured in this directory in several files:
+
+- `MISSION.md`: A document capturing the _reason_ the user is interested in the topic. This should be used to ground all teaching. Use the format in [MISSION-FORMAT.md](./MISSION-FORMAT.md).
+- `./reference/*.html`: A directory of reference materials. These are the compressed learnings from the lessons - cheat sheets, reference algorithms, syntax, yoga poses, glossaries. They are the raw units of learning. They should be beautiful documents which print out well, and are designed for quick reference.
+- `RESOURCES.md`: A list of resources which can be explored to ground your teaching in contextual knowledge, or to acquire knowledge and wisdom. Use the format in [RESOURCES-FORMAT.md](./RESOURCES-FORMAT.md).
+- `./learning-records/*.md`: A directory of learning records, which capture what the user has learned. These are loosely equivalent to architectural decision records in software development - they capture non-obvious lessons and key insights that may need to be revised later, or drive future sessions. These should be used to calculate the zone of proximal development. They are titled `0001-<dash-case-name>.md`, where the number increments each time. Use the format in [LEARNING-RECORD-FORMAT.md](./LEARNING-RECORD-FORMAT.md).
+- `./lessons/*.html`: A directory of lessons. A **lesson** is a single, self-contained HTML output that teaches one tightly-scoped thing tied to the mission. This is the primary unit of teaching in this workspace.
+- `NOTES.md`: A scratchpad for you to jot down user preferences, or working notes.
+
+## Philosophy
+
+To learn at a deep level, the user needs three things:
+
+- **Knowledge**, captured from high-quality, high-trust resources
+- **Skills**, acquired through highly-relevant interactive lessons devised by you, based on the knowledge
+- **Wisdom**, which comes from interacting with other learners and practitioners
+
+Before the `RESOURCES.md` is well-populated, your focus should be to find high-quality resources which will help the user acquire knowledge. Never trust your parametric knowledge.
+
+Some topics may require more skills than knowledge. Learning more about theoretical physics might be more knowledge-based. For yoga, more skills-based.
+
+### Fluency vs Storage Strength
+
+You should be careful to split between two types of learning:
+
+- **Fluency strength**: in-the-moment retrieval of knowledge
+- **Storage strength**: long-term retention of knowledge
+
+Fluency can give the user an illusory sense of mastery, but storage strength is the real goal. Try to design lessons which build long-term retention by desirable difficulty:
+
+- Using retrieval practice (recall from memory)
+- Spacing (distributing practice over time)
+- Interleaving (mixing up different but related topics in practice - for skills practice only)
+
+## Lessons
+
+A lesson is the main thing you produce — the unit in which knowledge and skills reach the user. Each lesson is one self-contained HTML file, saved to `./lessons/` and titled `0001-<dash-case-name>.html` where the number increments each time.
+
+A lesson should be **beautiful** — clean, readable typography and layout — since the user will return to these later to review. Think Tufte.
+
+The lesson should be short, and completable very quickly. Learners' working memory is very small, and we need to stay within it. But each lesson should give the user a single tangible win that they can build on. It should be directly tied to the mission, and should be in the user's zone of proximal development.
+
+If possible, open the lesson file for the user by running a CLI command.
+
+Each lesson should link via HTML anchors to other lessons and reference documents.
+
+Each lesson should recommend a primary source for the user to read or watch. This should be the most high-quality, high-trust resource you found on the topic.
+
+Each lesson should contain a reminder to ask followup questions to the agent. The agent is their teacher, and can assist with anything that's unclear.
+
+## The Mission
+
+Every lesson should be tied into the mission - the reason that the user is interested in learning about the topic.
+
+If the user is unclear about the mission, or the `MISSION.md` is not populated, your first job should be to question the user on why they want to learn this.
+
+Failing to understand the mission will mean knowledge acquisition is not grounded in real-world goals. Lessons will feel too abstract. You will have no way of judging what the user should do next.
+
+Missions may change as the user develops more skills and knowledge. This is normal - make sure to update the `MISSION.md` and add a learning record to capture the change. Confirm with the user before changing the mission.
+
+## Zone Of Proximal Development
+
+Each lesson, the user should always feel as if they are being challenged 'just enough'.
+
+The user may specify an exact thing they want to learn. If they don't, figure out their zone of proximal development by:
+
+- Reading their `learning-records`
+- Figuring out the right thing to teach them based on their mission
+- Teach the most relevant thing that fits in their zone of proximal development
+
+## Knowledge
+
+Lessons should be designed around a skill the user is going to learn. The knowledge in the lesson should be only what's required to acquire that skill. You teach the knowledge first, then get the user to practice the skills via an interactive feedback loop.
+
+Knowledge should first be gathered from trusted resources. Use `RESOURCES.md` to keep track of them. Lessons should be littered with citations - links to external resources to back up any claim made. This increases the trustworthiness of the lesson.
+
+For acquiring knowledge, difficulty is the enemy. It eats working memory you need for understanding.
+
+## Skills
+
+If knowledge is all about acquisition, skills are about durability and flexibility. Make the knowledge stick.
+
+For skill acquisition, difficulty is the tool. Effortful retrieval is what builds storage strength. Skills should be taught through interactive lessons. There are several tools at your disposal:
+
+- Interactive lessons, using quizzes and light in-browser tasks
+- Lessons which guide the user through a list of real-world steps to take (for instance, yoga poses)
+
+Each of these should be based on a **feedback loop**, where the user receives feedback on their performance. This feedback loop should be as tight as possible, giving feedback immediately - and ideally automatically.
+
+For quizzes, each answer should be exactly the same number of words (and characters, if possible). Don't give the user any clues about the answer through formatting.
+
+## Acquiring Wisdom
+
+Wisdom comes from true real-world interaction - testing your skills outside the learning environment.
+
+When the user asks a question that appears to require wisdom, your default posture should be to attempt to answer - but to ultimately delegate to a **community**.
+
+A community is a place (online or offline) where the user can test their skills in the real world. This might be a forum, a subreddit, a real-world class (budget permitting) or a local interest group.
+
+You should attempt to find high-reputation communities the user can join. If the user expresses a preference that they don't want to join a community, respect it.
+
+## Reference Documents
+
+While creating lessons, you should also create reference documents. Lessons can reference these documents - they are useful for tracking raw units of knowledge useful across lessons.
+
+Lessons will rarely be revisited later - reference documents will be. They should be the compressed essence of the lesson, in a format designed for quick reference.
+
+Some learning topics lend themselves to reference:
+
+- Syntax and code snippets for programming
+- Algorithms and flowcharts for processes
+- Yoga poses and sequences for yoga
+- Exercises and routines for fitness
+- Glossaries for any topic with its own nomenclature
+
+Glossaries, in particular, are an essential reference. Once one is created, it should be adhered to in every lesson.
+
+## `NOTES.md`
+
+The user will sometimes express preferences of how they want to be taught, or things you should keep in mind. This is the place to record those preferences, so you can refer back to them when designing lessons or working with the user.
--- a/scripts/workstation/claude-skills/to-issues/SKILL.md
+++ b/scripts/workstation/claude-skills/to-issues/SKILL.md
@ -0,0 +1,83 @@
+---
+name: to-issues
+description: Break a plan, spec, or PRD into independently-grabbable issues on the project issue tracker using tracer-bullet vertical slices. Use when user wants to convert a plan into issues, create implementation tickets, or break down work into issues.
+---
+
+# To Issues
+
+Break a plan into independently-grabbable issues using vertical slices (tracer bullets).
+
+The issue tracker and triage label vocabulary should have been provided to you — run `/setup-matt-pocock-skills` if not.
+
+## Process
+
+### 1. Gather context
+
+Work from whatever is already in the conversation context. If the user passes an issue reference (issue number, URL, or path) as an argument, fetch it from the issue tracker and read its full body and comments.
+
+### 2. Explore the codebase (optional)
+
+If you have not already explored the codebase, do so to understand the current state of the code. Issue titles and descriptions should use the project's domain glossary vocabulary, and respect ADRs in the area you're touching.
+
+### 3. Draft vertical slices
+
+Break the plan into **tracer bullet** issues. Each issue is a thin vertical slice that cuts through ALL integration layers end-to-end, NOT a horizontal slice of one layer.
+
+Slices may be 'HITL' or 'AFK'. HITL slices require human interaction, such as an architectural decision or a design review. AFK slices can be implemented and merged without human interaction. Prefer AFK over HITL where possible.
+
+<vertical-slice-rules>
+- Each slice delivers a narrow but COMPLETE path through every layer (schema, API, UI, tests)
+- A completed slice is demoable or verifiable on its own
+- Prefer many thin slices over few thick ones
+</vertical-slice-rules>
+
+### 4. Quiz the user
+
+Present the proposed breakdown as a numbered list. For each slice, show:
+
+- **Title**: short descriptive name
+- **Type**: HITL / AFK
+- **Blocked by**: which other slices (if any) must complete first
+- **User stories covered**: which user stories this addresses (if the source material has them)
+
+Ask the user:
+
+- Does the granularity feel right? (too coarse / too fine)
+- Are the dependency relationships correct?
+- Should any slices be merged or split further?
+- Are the correct slices marked as HITL and AFK?
+
+Iterate until the user approves the breakdown.
+
+### 5. Publish the issues to the issue tracker
+
+For each approved slice, publish a new issue to the issue tracker. Use the issue body template below. These issues are considered ready for AFK agents, so publish them with the correct triage label unless instructed otherwise.
+
+Publish issues in dependency order (blockers first) so you can reference real issue identifiers in the "Blocked by" field.
+
+<issue-template>
+## Parent
+
+A reference to the parent issue on the issue tracker (if the source was an existing issue, otherwise omit this section).
+
+## What to build
+
+A concise description of this vertical slice. Describe the end-to-end behavior, not layer-by-layer implementation.
+
+Avoid specific file paths or code snippets — they go stale fast. Exception: if a prototype produced a snippet that encodes a decision more precisely than prose can (state machine, reducer, schema, type shape), inline it here and note briefly that it came from a prototype. Trim to the decision-rich parts — not a working demo, just the important bits.
+
+## Acceptance criteria
+
+- [ ] Criterion 1
+- [ ] Criterion 2
+- [ ] Criterion 3
+
+## Blocked by
+
+- A reference to the blocking ticket (if any)
+
+Or "None - can start immediately" if no blockers.
+
+</issue-template>
+
+Do NOT close or modify any parent issue.
--- a/scripts/workstation/claude-skills/to-prd/SKILL.md
+++ b/scripts/workstation/claude-skills/to-prd/SKILL.md
@ -0,0 +1,74 @@
+---
+name: to-prd
+description: Turn the current conversation context into a PRD and publish it to the project issue tracker. Use when user wants to create a PRD from the current context.
+---
+
+This skill takes the current conversation context and codebase understanding and produces a PRD. Do NOT interview the user — just synthesize what you already know.
+
+The issue tracker and triage label vocabulary should have been provided to you — run `/setup-matt-pocock-skills` if not.
+
+## Process
+
+1. Explore the repo to understand the current state of the codebase, if you haven't already. Use the project's domain glossary vocabulary throughout the PRD, and respect any ADRs in the area you're touching.
+
+2. Sketch out the seams at which you're going to test the feature. Existing seams should be preferred to new ones. Use the highest seam possible. If new seams are needed, propose them at the highest point you can.
+
+Check with the user that these seams match their expectations.
+
+3. Write the PRD using the template below, then publish it to the project issue tracker. Apply the `ready-for-agent` triage label - no need for additional triage.
+
+<prd-template>
+
+## Problem Statement
+
+The problem that the user is facing, from the user's perspective.
+
+## Solution
+
+The solution to the problem, from the user's perspective.
+
+## User Stories
+
+A LONG, numbered list of user stories. Each user story should be in the format of:
+
+1. As an <actor>, I want a <feature>, so that <benefit>
+
+<user-story-example>
+1. As a mobile bank customer, I want to see balance on my accounts, so that I can make better informed decisions about my spending
+</user-story-example>
+
+This list of user stories should be extremely extensive and cover all aspects of the feature.
+
+## Implementation Decisions
+
+A list of implementation decisions that were made. This can include:
+
+- The modules that will be built/modified
+- The interfaces of those modules that will be modified
+- Technical clarifications from the developer
+- Architectural decisions
+- Schema changes
+- API contracts
+- Specific interactions
+
+Do NOT include specific file paths or code snippets. They may end up being outdated very quickly.
+
+Exception: if a prototype produced a snippet that encodes a decision more precisely than prose can (state machine, reducer, schema, type shape), inline it within the relevant decision and note briefly that it came from a prototype. Trim to the decision-rich parts — not a working demo, just the important bits.
+
+## Testing Decisions
+
+A list of testing decisions that were made. Include:
+
+- A description of what makes a good test (only test external behavior, not implementation details)
+- Which modules will be tested
+- Prior art for the tests (i.e. similar types of tests in the codebase)
+
+## Out of Scope
+
+A description of the things that are out of scope for this PRD.
+
+## Further Notes
+
+Any further notes about the feature.
+
+</prd-template>
--- a/scripts/workstation/claude-skills/triage/AGENT-BRIEF.md
+++ b/scripts/workstation/claude-skills/triage/AGENT-BRIEF.md
@ -0,0 +1,168 @@
+# Writing Agent Briefs
+
+An agent brief is a structured comment posted on a GitHub issue when it moves to `ready-for-agent`. It is the authoritative specification that an AFK agent will work from. The original issue body and discussion are context — the agent brief is the contract.
+
+## Principles
+
+### Durability over precision
+
+The issue may sit in `ready-for-agent` for days or weeks. The codebase will change in the meantime. Write the brief so it stays useful even as files are renamed, moved, or refactored.
+
+- **Do** describe interfaces, types, and behavioral contracts
+- **Do** name specific types, function signatures, or config shapes that the agent should look for or modify
+- **Don't** reference file paths — they go stale
+- **Don't** reference line numbers
+- **Don't** assume the current implementation structure will remain the same
+
+### Behavioral, not procedural
+
+Describe **what** the system should do, not **how** to implement it. The agent will explore the codebase fresh and make its own implementation decisions.
+
+- **Good:** "The `SkillConfig` type should accept an optional `schedule` field of type `CronExpression`"
+- **Bad:** "Open src/types/skill.ts and add a schedule field on line 42"
+- **Good:** "When a user runs `/triage` with no arguments, they should see a summary of issues needing attention"
+- **Bad:** "Add a switch statement in the main handler function"
+
+### Complete acceptance criteria
+
+The agent needs to know when it's done. Every agent brief must have concrete, testable acceptance criteria. Each criterion should be independently verifiable.
+
+- **Good:** "Running `gh issue list --label needs-triage` returns issues that have been through initial classification"
+- **Bad:** "Triage should work correctly"
+
+### Explicit scope boundaries
+
+State what is out of scope. This prevents the agent from gold-plating or making assumptions about adjacent features.
+
+## Template
+
+```markdown
+## Agent Brief
+
+**Category:** bug / enhancement
+**Summary:** one-line description of what needs to happen
+
+**Current behavior:**
+Describe what happens now. For bugs, this is the broken behavior.
+For enhancements, this is the status quo the feature builds on.
+
+**Desired behavior:**
+Describe what should happen after the agent's work is complete.
+Be specific about edge cases and error conditions.
+
+**Key interfaces:**
+- `TypeName` — what needs to change and why
+- `functionName()` return type — what it currently returns vs what it should return
+- Config shape — any new configuration options needed
+
+**Acceptance criteria:**
+- [ ] Specific, testable criterion 1
+- [ ] Specific, testable criterion 2
+- [ ] Specific, testable criterion 3
+
+**Out of scope:**
+- Thing that should NOT be changed or addressed in this issue
+- Adjacent feature that might seem related but is separate
+```
+
+## Examples
+
+### Good agent brief (bug)
+
+```markdown
+## Agent Brief
+
+**Category:** bug
+**Summary:** Skill description truncation drops mid-word, producing broken output
+
+**Current behavior:**
+When a skill description exceeds 1024 characters, it is truncated at exactly
+1024 characters regardless of word boundaries. This produces descriptions
+that end mid-word (e.g. "Use when the user wants to confi").
+
+**Desired behavior:**
+Truncation should break at the last word boundary before 1024 characters
+and append "..." to indicate truncation.
+
+**Key interfaces:**
+- The `SkillMetadata` type's `description` field — no type change needed,
+  but the validation/processing logic that populates it needs to respect
+  word boundaries
+- Any function that reads SKILL.md frontmatter and extracts the description
+
+**Acceptance criteria:**
+- [ ] Descriptions under 1024 chars are unchanged
+- [ ] Descriptions over 1024 chars are truncated at the last word boundary
+      before 1024 chars
+- [ ] Truncated descriptions end with "..."
+- [ ] The total length including "..." does not exceed 1024 chars
+
+**Out of scope:**
+- Changing the 1024 char limit itself
+- Multi-line description support
+```
+
+### Good agent brief (enhancement)
+
+```markdown
+## Agent Brief
+
+**Category:** enhancement
+**Summary:** Add `.out-of-scope/` directory support for tracking rejected feature requests
+
+**Current behavior:**
+When a feature request is rejected, the issue is closed with a `wontfix` label
+and a comment. There is no persistent record of the decision or reasoning.
+Future similar requests require the maintainer to recall or search for the
+prior discussion.
+
+**Desired behavior:**
+Rejected feature requests should be documented in `.out-of-scope/<concept>.md`
+files that capture the decision, reasoning, and links to all issues that
+requested the feature. When triaging new issues, these files should be
+checked for matches.
+
+**Key interfaces:**
+- Markdown file format in `.out-of-scope/` — each file should have a
+  `# Concept Name` heading, a `**Decision:**` line, a `**Reason:**` line,
+  and a `**Prior requests:**` list with issue links
+- The triage workflow should read all `.out-of-scope/*.md` files early
+  and match incoming issues against them by concept similarity
+
+**Acceptance criteria:**
+- [ ] Closing a feature as wontfix creates/updates a file in `.out-of-scope/`
+- [ ] The file includes the decision, reasoning, and link to the closed issue
+- [ ] If a matching `.out-of-scope/` file already exists, the new issue is
+      appended to its "Prior requests" list rather than creating a duplicate
+- [ ] During triage, existing `.out-of-scope/` files are checked and surfaced
+      when a new issue matches a prior rejection
+
+**Out of scope:**
+- Automated matching (human confirms the match)
+- Reopening previously rejected features
+- Bug reports (only enhancement rejections go to `.out-of-scope/`)
+```
+
+### Bad agent brief
+
+```markdown
+## Agent Brief
+
+**Summary:** Fix the triage bug
+
+**What to do:**
+The triage thing is broken. Look at the main file and fix it.
+The function around line 150 has the issue.
+
+**Files to change:**
+- src/triage/handler.ts (line 150)
+- src/types.ts (line 42)
+```
+
+This is bad because:
+- No category
+- Vague description ("the triage thing is broken")
+- References file paths and line numbers that will go stale
+- No acceptance criteria
+- No scope boundaries
+- No description of current vs desired behavior
--- a/scripts/workstation/claude-skills/triage/OUT-OF-SCOPE.md
+++ b/scripts/workstation/claude-skills/triage/OUT-OF-SCOPE.md
@ -0,0 +1,101 @@
+# Out-of-Scope Knowledge Base
+
+The `.out-of-scope/` directory in a repo stores persistent records of rejected feature requests. It serves two purposes:
+
+1. **Institutional memory** — why a feature was rejected, so the reasoning isn't lost when the issue is closed
+2. **Deduplication** — when a new issue comes in that matches a prior rejection, the skill can surface the previous decision instead of re-litigating it
+
+## Directory structure
+
+```
+.out-of-scope/
+├── dark-mode.md
+├── plugin-system.md
+└── graphql-api.md
+```
+
+One file per **concept**, not per issue. Multiple issues requesting the same thing are grouped under one file.
+
+## File format
+
+The file should be written in a relaxed, readable style — more like a short design document than a database entry. Use paragraphs, code samples, and examples to make the reasoning clear and useful to someone encountering it for the first time.
+
+```markdown
+# Dark Mode
+
+This project does not support dark mode or user-facing theming.
+
+## Why this is out of scope
+
+The rendering pipeline assumes a single color palette defined in
+`ThemeConfig`. Supporting multiple themes would require:
+
+- A theme context provider wrapping the entire component tree
+- Per-component theme-aware style resolution
+- A persistence layer for user theme preferences
+
+This is a significant architectural change that doesn't align with the
+project's focus on content authoring. Theming is a concern for downstream
+consumers who embed or redistribute the output.
+
+```ts
+// The current ThemeConfig interface is not designed for runtime switching:
+interface ThemeConfig {
+  colors: ColorPalette; // single palette, resolved at build time
+  fonts: FontStack;
+}
+```
+
+## Prior requests
+
+- #42 — "Add dark mode support"
+- #87 — "Night theme for accessibility"
+- #134 — "Dark theme option"
+```
+
+### Naming the file
+
+Use a short, descriptive kebab-case name for the concept: `dark-mode.md`, `plugin-system.md`, `graphql-api.md`. The name should be recognizable enough that someone browsing the directory understands what was rejected without opening the file.
+
+### Writing the reason
+
+The reason should be substantive — not "we don't want this" but why. Good reasons reference:
+
+- Project scope or philosophy ("This project focuses on X; theming is a downstream concern")
+- Technical constraints ("Supporting this would require Y, which conflicts with our Z architecture")
+- Strategic decisions ("We chose to use A instead of B because...")
+
+The reason should be durable. Avoid referencing temporary circumstances ("we're too busy right now") — those aren't real rejections, they're deferrals.
+
+## When to check `.out-of-scope/`
+
+During triage (Step 1: Gather context), read all files in `.out-of-scope/`. When evaluating a new issue:
+
+- Check if the request matches an existing out-of-scope concept
+- Matching is by concept similarity, not keyword — "night theme" matches `dark-mode.md`
+- If there's a match, surface it to the maintainer: "This is similar to `.out-of-scope/dark-mode.md` — we rejected this before because [reason]. Do you still feel the same way?"
+
+The maintainer may:
+
+- **Confirm** — the new issue gets added to the existing file's "Prior requests" list, then closed
+- **Reconsider** — the out-of-scope file gets deleted or updated, and the issue proceeds through normal triage
+- **Disagree** — the issues are related but distinct, proceed with normal triage
+
+## When to write to `.out-of-scope/`
+
+Only when an **enhancement** (not a bug) is rejected as `wontfix`. The flow:
+
+1. Maintainer decides a feature request is out of scope
+2. Check if a matching `.out-of-scope/` file already exists
+3. If yes: append the new issue to the "Prior requests" list
+4. If no: create a new file with the concept name, decision, reason, and first prior request
+5. Post a comment on the issue explaining the decision and mentioning the `.out-of-scope/` file
+6. Close the issue with the `wontfix` label
+
+## Updating or removing out-of-scope files
+
+If the maintainer changes their mind about a previously rejected concept:
+
+- Delete the `.out-of-scope/` file
+- The skill does not need to reopen old issues — they're historical records
+- The new issue that triggered the reconsideration proceeds through normal triage
--- a/scripts/workstation/claude-skills/triage/SKILL.md
+++ b/scripts/workstation/claude-skills/triage/SKILL.md
@ -0,0 +1,103 @@
+---
+name: triage
+description: Triage issues through a state machine driven by triage roles. Use when user wants to create an issue, triage issues, review incoming bugs or feature requests, prepare issues for an AFK agent, or manage issue workflow.
+---
+
+# Triage
+
+Move issues on the project issue tracker through a small state machine of triage roles.
+
+Every comment or issue posted to the issue tracker during triage **must** start with this disclaimer:
+
+```
+> *This was generated by AI during triage.*
+```
+
+## Reference docs
+
+- [AGENT-BRIEF.md](AGENT-BRIEF.md) — how to write durable agent briefs
+- [OUT-OF-SCOPE.md](OUT-OF-SCOPE.md) — how the `.out-of-scope/` knowledge base works
+
+## Roles
+
+Two **category** roles:
+
+- `bug` — something is broken
+- `enhancement` — new feature or improvement
+
+Five **state** roles:
+
+- `needs-triage` — maintainer needs to evaluate
+- `needs-info` — waiting on reporter for more information
+- `ready-for-agent` — fully specified, ready for an AFK agent
+- `ready-for-human` — needs human implementation
+- `wontfix` — will not be actioned
+
+Every triaged issue should carry exactly one category role and one state role. If state roles conflict, flag it and ask the maintainer before doing anything else.
+
+These are canonical role names — the actual label strings used in the issue tracker may differ. The mapping should have been provided to you - run `/setup-matt-pocock-skills` if not.
+
+State transitions: an unlabeled issue normally goes to `needs-triage` first; from there it moves to `needs-info`, `ready-for-agent`, `ready-for-human`, or `wontfix`. `needs-info` returns to `needs-triage` once the reporter replies. The maintainer can override at any time — flag transitions that look unusual and ask before proceeding.
+
+## Invocation
+
+The maintainer invokes `/triage` and describes what they want in natural language. Interpret the request and act. Examples:
+
+- "Show me anything that needs my attention"
+- "Let's look at #42"
+- "Move #42 to ready-for-agent"
+- "What's ready for agents to pick up?"
+
+## Show what needs attention
+
+Query the issue tracker and present three buckets, oldest first:
+
+1. **Unlabeled** — never triaged.
+2. **`needs-triage`** — evaluation in progress.
+3. **`needs-info` with reporter activity since the last triage notes** — needs re-evaluation.
+
+Show counts and a one-line summary per issue. Let the maintainer pick.
+
+## Triage a specific issue
+
+1. **Gather context.** Read the full issue (body, comments, labels, reporter, dates). Parse any prior triage notes so you don't re-ask resolved questions. Explore the codebase using the project's domain glossary, respecting ADRs in the area. Read `.out-of-scope/*.md` and surface any prior rejection that resembles this issue.
+
+2. **Recommend.** Tell the maintainer your category and state recommendation with reasoning, plus a brief codebase summary relevant to the issue. Wait for direction.
+
+3. **Reproduce (bugs only).** Before any grilling, attempt reproduction: read the reporter's steps, trace the relevant code, run tests or commands. Report what happened — successful repro with code path, failed repro, or insufficient detail (a strong `needs-info` signal). A confirmed repro makes a much stronger agent brief.
+
+4. **Grill (if needed).** If the issue needs fleshing out, run a `/grill-with-docs` session.
+
+5. **Apply the outcome:**
+   - `ready-for-agent` — post an agent brief comment ([AGENT-BRIEF.md](AGENT-BRIEF.md)).
+   - `ready-for-human` — same structure as an agent brief, but note why it can't be delegated (judgment calls, external access, design decisions, manual testing).
+   - `needs-info` — post triage notes (template below).
+   - `wontfix` (bug) — polite explanation, then close.
+   - `wontfix` (enhancement) — write to `.out-of-scope/`, link to it from a comment, then close ([OUT-OF-SCOPE.md](OUT-OF-SCOPE.md)).
+   - `needs-triage` — apply the role. Optional comment if there's partial progress.
+
+## Quick state override
+
+If the maintainer says "move #42 to ready-for-agent", trust them and apply the role directly. Confirm what you're about to do (role changes, comment, close), then act. Skip grilling. If moving to `ready-for-agent` without a grilling session, ask whether they want to write an agent brief.
+
+## Needs-info template
+
+```markdown
+## Triage Notes
+
+**What we've established so far:**
+
+- point 1
+- point 2
+
+**What we still need from you (@reporter):**
+
+- question 1
+- question 2
+```
+
+Capture everything resolved during grilling under "established so far" so the work isn't lost. Questions must be specific and actionable, not "please provide more info".
+
+## Resuming a previous session
+
+If prior triage notes exist on the issue, read them, check whether the reporter has answered any outstanding questions, and present an updated picture before continuing. Don't re-ask resolved questions.
--- a/scripts/workstation/claude-skills/write-a-skill/SKILL.md
+++ b/scripts/workstation/claude-skills/write-a-skill/SKILL.md
@ -0,0 +1,117 @@
+---
+name: write-a-skill
+description: Create new agent skills with proper structure, progressive disclosure, and bundled resources. Use when user wants to create, write, or build a new skill.
+---
+
+# Writing Skills
+
+## Process
+
+1. **Gather requirements** - ask user about:
+   - What task/domain does the skill cover?
+   - What specific use cases should it handle?
+   - Does it need executable scripts or just instructions?
+   - Any reference materials to include?
+
+2. **Draft the skill** - create:
+   - SKILL.md with concise instructions
+   - Additional reference files if content exceeds 500 lines
+   - Utility scripts if deterministic operations needed
+
+3. **Review with user** - present draft and ask:
+   - Does this cover your use cases?
+   - Anything missing or unclear?
+   - Should any section be more/less detailed?
+
+## Skill Structure
+
+```
+skill-name/
+├── SKILL.md           # Main instructions (required)
+├── REFERENCE.md       # Detailed docs (if needed)
+├── EXAMPLES.md        # Usage examples (if needed)
+└── scripts/           # Utility scripts (if needed)
+    └── helper.js
+```
+
+## SKILL.md Template
+
+```md
+---
+name: skill-name
+description: Brief description of capability. Use when [specific triggers].
+---
+
+# Skill Name
+
+## Quick start
+
+[Minimal working example]
+
+## Workflows
+
+[Step-by-step processes with checklists for complex tasks]
+
+## Advanced features
+
+[Link to separate files: See [REFERENCE.md](REFERENCE.md)]
+```
+
+## Description Requirements
+
+The description is **the only thing your agent sees** when deciding which skill to load. It's surfaced in the system prompt alongside all other installed skills. Your agent reads these descriptions and picks the relevant skill based on the user's request.
+
+**Goal**: Give your agent just enough info to know:
+
+1. What capability this skill provides
+2. When/why to trigger it (specific keywords, contexts, file types)
+
+**Format**:
+
+- Max 1024 chars
+- Write in third person
+- First sentence: what it does
+- Second sentence: "Use when [specific triggers]"
+
+**Good example**:
+
+```
+Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when user mentions PDFs, forms, or document extraction.
+```
+
+**Bad example**:
+
+```
+Helps with documents.
+```
+
+The bad example gives your agent no way to distinguish this from other document skills.
+
+## When to Add Scripts
+
+Add utility scripts when:
+
+- Operation is deterministic (validation, formatting)
+- Same code would be generated repeatedly
+- Errors need explicit handling
+
+Scripts save tokens and improve reliability vs generated code.
+
+## When to Split Files
+
+Split into separate files when:
+
+- SKILL.md exceeds 100 lines
+- Content has distinct domains (finance vs sales schemas)
+- Advanced features are rarely needed
+
+## Review Checklist
+
+After drafting, verify:
+
+- [ ] Description includes triggers ("Use when...")
+- [ ] SKILL.md under 100 lines
+- [ ] No time-sensitive info
+- [ ] Consistent terminology
+- [ ] Concrete examples included
+- [ ] References one level deep
--- a/scripts/workstation/claude-skills/zoom-out/SKILL.md
+++ b/scripts/workstation/claude-skills/zoom-out/SKILL.md
@ -0,0 +1,7 @@
+---
+name: zoom-out
+description: Tell the agent to zoom out and give broader context or a higher-level perspective. Use when you're unfamiliar with a section of code or need to understand how it fits into the bigger picture.
+disable-model-invocation: true
+---
+
+I don't know this area of code well. Go up a layer of abstraction. Give me a map of all the relevant modules and callers, using the project's domain glossary vocabulary.
--- a/scripts/workstation/packages.txt
+++ b/scripts/workstation/packages.txt
@ -24,6 +24,12 @@ rsync
 wget
 tree
 shellcheck
+# resource containment — earlyoom backstop (setup-devvm.sh §10, 2026-06-22): a
+# free-RAM-threshold OOM killer used INSTEAD of systemd-oomd, which is inert with
+# swap=0 (its pressure-kill needs reclaim/pgscan that no-swap anon workloads never
+# produce; verified live — 99% mem.pressure, pgscan=0, no kill). earlyoom watches
+# MemAvailable% and is swap-independent.
+earlyoom

 # --- installed by setup-devvm.sh via NON-apt paths (not apt-installable) ---
 # nodejs + npm                -> NodeSource repo (claude-code needs node >= 18; distro nodejs is too old)
--- a/scripts/workstation/setup-devvm.sh
+++ b/scripts/workstation/setup-devvm.sh
@ -71,6 +71,14 @@ if [[ -n "$want_t3" && "$(t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/
  log "npm: installing t3@$T3_TRACK ($want_t3)"; npm install -g "t3@$want_t3" >/dev/null
 fi

+# 2c) Bitwarden CLI — backs `homelab vault` (per-user no-HITL Vaultwarden access).
+#     npm-global so every user's PATH resolves it. Pinned major; best-effort (a
+#     failure only disables `homelab vault`, nothing else on the box).
+if ! command -v bw >/dev/null; then
+  log "npm: installing @bitwarden/cli (homelab vault backend)"
+  npm install -g "@bitwarden/cli@^2024" >/dev/null 2>&1 || log "WARN: @bitwarden/cli install failed; homelab vault unavailable"
+fi
+
 # 3) kubelogin (kubectl oidc-login) system-wide — NOT the apt 'kubelogin' (= Azure tool).
 #    PINNED (not 'latest/download') so two fresh boxes built weeks apart are byte-identical.
 KUBELOGIN_VER="${KUBELOGIN_VER:-v1.36.2}"
@ -103,8 +111,11 @@ for d in skills rules agents commands; do
 done
 log "skel: launcher + tmux + inheritance symlinks (base=$CONFIG_BASE)"

-# 6) deploy the roster-driven provisioner to /usr/local/bin (run hourly by
-#    t3-provision-users.timer). Re-deployed here so its logic is reproducible.
+# 6) BOOTSTRAP-deploy the roster-driven provisioner to /usr/local/bin (run hourly
+#    by t3-provision-users.timer). This seeds the binary on a fresh box; ongoing
+#    edits self-deploy from the repo on the next reconcile (the script's step 0),
+#    so a committed change no longer needs a manual setup-devvm.sh re-run to land
+#    (the gap that left the homelab-memory rollout undeployed for a day).
 install -m 0755 "$HERE/../t3-provision-users.sh" /usr/local/bin/t3-provision-users
 log "t3-provision-users -> /usr/local/bin/ (roster-driven)"

@ -223,6 +234,134 @@ systemctl enable --now t3-dispatch.service \
  log "WARN: some units failed to enable (check: systemctl status t3-dispatch t3-*.timer)"
 log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-user)"

+# 10) RESOURCE CONTAINMENT (2026-06-22): bound per-user memory + an OOM backstop so
+#     ONE user's runaway can never IO/memory-overload the shared box. History: the
+#     2026-06-10 "swap-only, ssh/tmux memory-uncontained" decision let a single
+#     user's runaway (a 10G `ugrep`; agent storms) swap-thrash the 60/60-throttled
+#     virtual disk into an IO storm + multi-minute freeze (hard-killed 2026-06-22).
+#     t3-serve@ was already capped (its [Service] block); the HOLE was the uncapped
+#     user-<uid>.slice (all ssh/tmux work). Design — per user, on BOTH trees:
+#     MemoryHigh=12G soft (throttles a runaway to a crawl), MemoryMax=16G hard,
+#     MemorySwapMax=0 (work never touches disk swap → no thrash; it OOMs locally at
+#     the ceiling instead), plus fair-share CPU/IO weights.
+#     BACKSTOP = earlyoom, NOT systemd-oomd. We first shipped systemd-oomd but it is
+#     INERT with swap=0: its pressure-kill only acts on cgroups doing active reclaim
+#     (pgscan rising), and a no-swap anon workload never reclaims — verified live, a
+#     cgroup at 99% memory.pressure / pgscan=0 was never killed. earlyoom instead
+#     watches FREE RAM (MemAvailable%) and SIGTERMs the biggest process at 5% / -k 3%,
+#     swap-independent and reliable. It --avoids sshd/systemd/dockerd (your way in
+#     stays alive) and --prefers the agent/browser hogs. earlyoom pkg = packages.txt
+#     (§1). Per-cgroup MemoryMax is the PRIMARY guard; earlyoom is the aggregate net.
+#     Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md
+
+# 10a) per-user caps + fair-share weights on EVERY user-<uid>.slice (ssh/tmux)
+install -d -m 0755 /etc/systemd/system/user-.slice.d
+cat > /etc/systemd/system/user-.slice.d/50-devvm-resource.conf <<'SLICE_EOF'
+# Per-user containment for the shared devvm (setup-devvm.sh §10, 2026-06-22).
+# Applies to EACH user-<uid>.slice = all of one user's ssh/tmux work. Mirrors the
+# t3-serve@.service caps so a user is bounded in whichever surface they work in.
+[Slice]
+MemoryAccounting=yes
+MemoryHigh=12G
+MemoryMax=16G
+MemorySwapMax=0
+CPUAccounting=yes
+CPUWeight=100
+IOAccounting=yes
+IOWeight=100
+SLICE_EOF
+
+# 10b) earlyoom backstop config — RAM-threshold, swap-INDEPENDENT (see header note
+#      on why systemd-oomd is inert with swap=0). The Debian unit reads /etc/default.
+cat > /etc/default/earlyoom <<'EARLYOOM_EOF'
+# devvm aggregate OOM backstop (setup-devvm.sh §10, 2026-06-22). Watches FREE RAM
+# (MemAvailable%) and kills the biggest task before the box exhausts. Unlike
+# systemd-oomd it needs NO swap/reclaim, so it works with our swap=0 work cgroups.
+#   -m 5,3     SIGTERM the victim at MemAvailable<5%, SIGKILL at <3%
+#   -s 100,100 ignore swap in the decision (RAM-only; work cgroups are swap=0)
+#   --avoid    never the box's nervous system / your way back in
+#   --prefer   target the agent/browser/build hogs that actually exhaust RAM
+#   -r 3600    hourly memory report (the 60s default is log spam)
+EARLYOOM_ARGS="-m 5,3 -s 100,100 -r 3600 --avoid ^(systemd|systemd-.*|sshd|dockerd|containerd|init|t3-dispatch|tmux.*)$ --prefer ^(python3|node|chrome|chromium|ugrep|rg|go|claude)$"
+EARLYOOM_EOF
+
+# 10c) capped docker.slice (top-level sibling of system/user slices); daemon.json
+#      cgroup-parent (10d) makes EVERY container land here under one bounded budget.
+cat > /etc/systemd/system/docker.slice <<'DOCKER_SLICE_EOF'
+# All docker containers live here (cgroup-parent in /etc/docker/daemon.json) so
+# they share one bounded budget and a runaway container is capped at MemoryMax
+# (cgroup-OOM'd locally) instead of escaping into the uncapped system.slice.
+# setup-devvm.sh §10, 2026-06-22.
+[Unit]
+Description=Docker containers slice (capped)
+[Slice]
+MemoryAccounting=yes
+MemoryHigh=6G
+MemoryMax=8G
+MemorySwapMax=0
+CPUAccounting=yes
+CPUWeight=100
+IOAccounting=yes
+IOWeight=100
+DOCKER_SLICE_EOF
+
+# 10d) point dockerd at docker.slice (idempotent JSON merge; flag a needed restart).
+#      python preserves the rest of daemon.json (buildkit, nvidia runtime, etc.).
+docker_restart=0
+# if-condition form so the deliberate non-zero exit (10=changed) does NOT trip the
+# script's `set -e`; $? in the else branch is the python exit code.
+if python3 - <<'PY'
+import json, os, sys
+p = "/etc/docker/daemon.json"
+try:
+    d = json.load(open(p)) if os.path.exists(p) else {}
+except Exception:
+    sys.exit(2)                       # malformed -> don't touch
+if d.get("cgroup-parent") == "docker.slice":
+    sys.exit(0)                       # already correct -> no restart
+d["cgroup-parent"] = "docker.slice"
+json.dump(d, open(p, "w"), indent=4)
+sys.exit(10)                          # changed -> restart needed
+PY
+then rc=0; else rc=$?; fi
+case $rc in
+  0) : ;;
+  10) docker_restart=1 ;;
+  *) log "WARN: could not patch /etc/docker/daemon.json — docker.slice NOT wired" ;;
+esac
+
+# 10e) t3-serve@ instances need no extra drop-in: their per-instance MemoryMax /
+#      MemorySwapMax caps live in t3-serve@.service [Service]; earlyoom (10b) is the
+#      box-wide net. (The earlier oomd slice-policing drop-in was removed — inert.)
+
+# 10f) give system.slice a priority edge so sshd/services stay snappy under
+#      contention (weights are work-conserving — users still get idle CPU/IO).
+install -d -m 0755 /etc/systemd/system/system.slice.d
+cat > /etc/systemd/system/system.slice.d/50-devvm-priority.conf <<'SYS_EOF'
+# Keep the box's nervous system responsive under contention (setup-devvm.sh §10).
+[Slice]
+CPUAccounting=yes
+CPUWeight=200
+IOAccounting=yes
+IOWeight=200
+SYS_EOF
+
+# 10g) activate: reload, arm earlyoom, restart dockerd ONLY if daemon.json changed.
+systemctl daemon-reload
+# earlyoom reads /etc/default/earlyoom (10b); enable + restart so new args take effect
+# even on a re-run where it was already running.
+systemctl enable --now earlyoom.service >/dev/null 2>&1 \
+  || log "WARN: earlyoom failed to enable — is the package installed? (packages.txt §1)"
+systemctl restart earlyoom.service 2>/dev/null || true
+# systemd-oomd is inert with swap=0 (see header) — ensure it isn't also running from
+# an earlier iteration of this section. No-op if the package was never installed.
+systemctl disable --now systemd-oomd.service >/dev/null 2>&1 || true
+if [[ $docker_restart -eq 1 ]] && systemctl is-active --quiet docker; then
+  log "restarting dockerd to apply cgroup-parent=docker.slice (running containers bounce briefly)"
+  systemctl restart docker || log "WARN: docker restart failed"
+fi
+log "§10 resource containment: per-user 12G/16G swap=0, earlyoom RAM backstop, docker.slice"
+
 # Run one foreground reconcile while the admin Vault token borrowed in section 8
 # is still available. This is what mints new roster users' isolated periodic
 # Vault tokens; the hourly no-admin-token reconcile only maintains existing ones.
--- a/stacks/actualbudget/.terraform.lock.hcl
+++ b/stacks/actualbudget/.terraform.lock.hcl
@ -24,6 +24,14 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
  ]
 }

+provider "registry.terraform.io/gavinbunney/kubectl" {
+  version     = "1.19.0"
+  constraints = "~> 1.14"
+  hashes = [
+    "h1:9QkxPjp0x5FZFfJbE+B7hBOoads9gmdfj9aYu5N4Sfc=",
+  ]
+}
+
 provider "registry.terraform.io/goauthentik/authentik" {
  version     = "2024.12.1"
  constraints = "~> 2024.10"
@ -47,61 +55,36 @@ provider "registry.terraform.io/goauthentik/authentik" {
 }

 provider "registry.terraform.io/hashicorp/helm" {
-  version = "3.1.1"
+  version = "3.2.0"
  hashes = [
-    "h1:47CqNwkxctJtL/N/JuEj+8QMg8mRNI/NWeKO5/ydfZU=",
-    "h1:5b2ojWKT0noujHiweCds37ZreRFRQLNaErdJLusJN88=",
-    "zh:1a6d5ce931708aec29d1f3d9e360c2a0c35ba5a54d03eeaff0ce3ca597cd0275",
-    "zh:3411919ba2a5941801e677f0fea08bdd0ae22ba3c9ce3309f55554699e06524a",
-    "zh:81b36138b8f2320dc7f877b50f9e38f4bc614affe68de885d322629dd0d16a29",
-    "zh:95a2a0a497a6082ee06f95b38bd0f0d6924a65722892a856cfd914c0d117f104",
-    "zh:9d3e78c2d1bb46508b972210ad706dd8c8b106f8b206ecf096cd211c54f46990",
-    "zh:a79139abf687387a6efdbbb04289a0a8e7eaca2bd91cdc0ce68ea4f3286c2c34",
-    "zh:aaa8784be125fbd50c48d84d6e171d3fb6ef84a221dbc5165c067ce05faab4c8",
-    "zh:afecd301f469975c9d8f350cc482fe656e082b6ab0f677d1a816c3c615837cc1",
-    "zh:c54c22b18d48ff9053d899d178d9ffef7d9d19785d9bf310a07d648b7aac075b",
-    "zh:db2eefd55aea48e73384a555c72bac3f7d428e24147bedb64e1a039398e5b903",
-    "zh:ee61666a233533fd2be971091cecc01650561f1585783c381b6f6e8a390198a4",
-    "zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
+    "h1:CTWVDQxyq1cQJR/9RJSkXii/gz5zMjxszSw9LmptDh4=",
  ]
 }

 provider "registry.terraform.io/hashicorp/kubernetes" {
-  version = "3.1.0"
+  version = "3.2.0"
  hashes = [
-    "h1:oodIAuFMikXNmEtil5MQgP4dfSctUBYQiGJfjbsF3NY=",
-    "zh:0215c5c60be62028c09a2f22458e89cda3ef5830a632299f1d401eb3538874b0",
-    "zh:09ebb9f442431e278a310a9423f32caf467cb4b3cad3fe59573ca71fa7b14e20",
-    "zh:0c4e5912f83bb35846ae0a9ae54fc320706ee61894cd21cc6b4181b1c5a2fa5c",
-    "zh:1678c982853ad461e65ccb5e79d585e13ed109dd47dab2a66d3a7a304faeef65",
-    "zh:1c050a5c15e330457a9c18caacf61a923c59d663e13f2962e4b32f04fef523a0",
-    "zh:2c55bcec83be58ec132c7cb0a1ac644758b800d794fdc636d53a0eada0358a3a",
-    "zh:a062bb0aa316c08d8460c66a5d68da71da40de5d3bc3b31abcf3a1a9a19650f1",
-    "zh:a26fdea0afaa9b247c73c0b42843ca51ba7db0ac2571f9d3d50dcabd20ca1b98",
-    "zh:c872c9385a78d502bf5823d61cd3bb0f9a0585030e025eb12585c83451beeaa1",
-    "zh:f180879af931182beee4c8c0d9dab62b81d86f17ddcbe3786ef4c7cec9163a4e",
-    "zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
-    "zh:f70f5789264069e0eef06f9b5d5fde955ef7206f7d446d1ce51a4c37a3f3e02f",
+    "h1:OjMar8kVp0LcDtwgRs877g/K/KPAEDVhFewpE3Tp7l8=",
  ]
 }

 provider "registry.terraform.io/hashicorp/random" {
-  version = "3.8.1"
+  version = "3.9.0"
  hashes = [
-    "h1:Eexl06+6J+s75uD46+WnZtpJZYRVUMB0AiuPBifK6Jc=",
-    "h1:u8AKlWVDTH5r9YLSeswoVEjiY72Rt4/ch7U+61ZDkiQ=",
-    "zh:08dd03b918c7b55713026037c5400c48af5b9f468f483463321bd18e17b907b4",
-    "zh:0eee654a5542dc1d41920bbf2419032d6f0d5625b03bd81339e5b33394a3e0ae",
-    "zh:229665ddf060aa0ed315597908483eee5b818a17d09b6417a0f52fd9405c4f57",
-    "zh:2469d2e48f28076254a2a3fc327f184914566d9e40c5780b8d96ebf7205f8bc0",
-    "zh:37d7eb334d9561f335e748280f5535a384a88675af9a9eac439d4cfd663bcb66",
-    "zh:741101426a2f2c52dee37122f0f4a2f2d6af6d852cb1db634480a86398fa3511",
+    "h1:UlBuNVuCGJ39tTv2c5gz2NRZnQbXfbIWbTzWcth5o74=",
+    "zh:161ad0bd9a75768c82f53fb6e7172a9d8be2d4889b012645a34795031aaf1bf1",
+    "zh:19dc9a5b17729725ccfc4f45b0500af0ee5bc6b6b160c7adb8f2bf617d2c80ea",
+    "zh:269eda8fe42daa7974d5a34d166c3ba9defe80cde86c01e4dadcfdf2e1f05e5f",
+    "zh:373f7c65566f8f2cc7f45d698654feb9d988996957e1266a69ca00c52d6d16d0",
+    "zh:5599d16804c41c83009ec621b6d6b6f74e102f5827678a4750f8809055546b61",
+    "zh:583be0440469a22bff70dcfa56593b01566860b29607437264adb51060cf46fc",
+    "zh:5f211d8ec3f2e1f414870d9584bfe26e6995560ef81c748f8447a48164767398",
    "zh:78d5eefdd9e494defcb3c68d282b8f96630502cac21d1ea161f53cfe9bb483b3",
-    "zh:a902473f08ef8df62cfe6116bd6c157070a93f66622384300de235a533e9d4a9",
-    "zh:b85c511a23e57a2147355932b3b6dce2a11e856b941165793a0c3d7578d94d05",
-    "zh:c5172226d18eaac95b1daac80172287b69d4ce32750c82ad77fa0768be4ea4b8",
-    "zh:dab4434dba34aad569b0bc243c2d3f3ff86dd7740def373f2a49816bd2ff819b",
-    "zh:f49fd62aa8c5525a5c17abd51e27ca5e213881d58882fd42fec4a545b53c9699",
+    "zh:7b547fd16216761ef86efc3ed516ac5ac0c5c42b7c7eb24a08cef2d93f69ed5e",
+    "zh:7e7c0679daf2a382151d05068c8c3f0dae6b7b7dccf818827b73dd08638df2ef",
+    "zh:8089dec888a8038b9b4fb23b3df7e1057293dbc5b60b42cc47ff690d69d4b61b",
+    "zh:c51f15a031edfd6f23ce8ced3446ca7f8d8d647e2499890d7d5d10d5016d7257",
+    "zh:c94784f005708890dc6895afd53636ec00ec1e430b15d41e5aebfb1d4b39bd04",
  ]
 }

@ -125,3 +108,11 @@ provider "registry.terraform.io/hashicorp/vault" {
    "zh:ff35fb1ab6add288f0f368981e56f780b50405accd1937131cba1137999c8d83",
  ]
 }
+
+provider "registry.terraform.io/telmate/proxmox" {
+  version     = "3.0.2-rc07"
+  constraints = "3.0.2-rc07"
+  hashes = [
+    "h1:zp5hpQJQ4t4zROSLqdltVpBO+Riy9VugtfFbpyTw1aM=",
+  ]
+}
--- a/stacks/actualbudget/main.tf
+++ b/stacks/actualbudget/main.tf
@ -6,7 +6,7 @@ variable "nfs_server" { type = string }

 resource "kubernetes_manifest" "external_secret" {
  manifest = {
-    apiVersion = "external-secrets.io/v1beta1"
+    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "actualbudget-secrets"
--- a/stacks/affine/.terraform.lock.hcl
+++ b/stacks/affine/.terraform.lock.hcl
@ -24,6 +24,14 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
  ]
 }

+provider "registry.terraform.io/gavinbunney/kubectl" {
+  version     = "1.19.0"
+  constraints = "~> 1.14"
+  hashes = [
+    "h1:9QkxPjp0x5FZFfJbE+B7hBOoads9gmdfj9aYu5N4Sfc=",
+  ]
+}
+
 provider "registry.terraform.io/goauthentik/authentik" {
  version     = "2024.12.1"
  constraints = "~> 2024.10"
@ -33,29 +41,16 @@ provider "registry.terraform.io/goauthentik/authentik" {
 }

 provider "registry.terraform.io/hashicorp/helm" {
-  version = "3.1.1"
+  version = "3.2.0"
  hashes = [
-    "h1:47CqNwkxctJtL/N/JuEj+8QMg8mRNI/NWeKO5/ydfZU=",
-    "h1:5b2ojWKT0noujHiweCds37ZreRFRQLNaErdJLusJN88=",
-    "zh:1a6d5ce931708aec29d1f3d9e360c2a0c35ba5a54d03eeaff0ce3ca597cd0275",
-    "zh:3411919ba2a5941801e677f0fea08bdd0ae22ba3c9ce3309f55554699e06524a",
-    "zh:81b36138b8f2320dc7f877b50f9e38f4bc614affe68de885d322629dd0d16a29",
-    "zh:95a2a0a497a6082ee06f95b38bd0f0d6924a65722892a856cfd914c0d117f104",
-    "zh:9d3e78c2d1bb46508b972210ad706dd8c8b106f8b206ecf096cd211c54f46990",
-    "zh:a79139abf687387a6efdbbb04289a0a8e7eaca2bd91cdc0ce68ea4f3286c2c34",
-    "zh:aaa8784be125fbd50c48d84d6e171d3fb6ef84a221dbc5165c067ce05faab4c8",
-    "zh:afecd301f469975c9d8f350cc482fe656e082b6ab0f677d1a816c3c615837cc1",
-    "zh:c54c22b18d48ff9053d899d178d9ffef7d9d19785d9bf310a07d648b7aac075b",
-    "zh:db2eefd55aea48e73384a555c72bac3f7d428e24147bedb64e1a039398e5b903",
-    "zh:ee61666a233533fd2be971091cecc01650561f1585783c381b6f6e8a390198a4",
-    "zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
+    "h1:CTWVDQxyq1cQJR/9RJSkXii/gz5zMjxszSw9LmptDh4=",
  ]
 }

 provider "registry.terraform.io/hashicorp/kubernetes" {
-  version = "3.1.0"
+  version = "3.2.0"
  hashes = [
-    "h1:oodIAuFMikXNmEtil5MQgP4dfSctUBYQiGJfjbsF3NY=",
+    "h1:OjMar8kVp0LcDtwgRs877g/K/KPAEDVhFewpE3Tp7l8=",
  ]
 }

@ -79,3 +74,11 @@ provider "registry.terraform.io/hashicorp/vault" {
    "zh:ff35fb1ab6add288f0f368981e56f780b50405accd1937131cba1137999c8d83",
  ]
 }
+
+provider "registry.terraform.io/telmate/proxmox" {
+  version     = "3.0.2-rc07"
+  constraints = "3.0.2-rc07"
+  hashes = [
+    "h1:zp5hpQJQ4t4zROSLqdltVpBO+Riy9VugtfFbpyTw1aM=",
+  ]
+}
--- a/stacks/affine/main.tf
+++ b/stacks/affine/main.tf
@ -6,7 +6,7 @@ variable "nfs_server" { type = string }

 resource "kubernetes_manifest" "external_secret" {
  manifest = {
-    apiVersion = "external-secrets.io/v1beta1"
+    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "affine-secrets"
@ -43,7 +43,7 @@ data "kubernetes_secret" "eso_secrets" {
 # Provides DATABASE_URL that auto-updates when password rotates
 resource "kubernetes_manifest" "db_external_secret" {
  manifest = {
-    apiVersion = "external-secrets.io/v1beta1"
+    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "affine-db-creds"
--- a/stacks/android-emulator/README.md
+++ b/stacks/android-emulator/README.md
@ -6,9 +6,12 @@ tenant: tripit). Decision record: `docs/adr/0001-android-emulator-in-cluster.md`

 ## On-demand lifecycle (since 2026-06-12)

-The emulator **scales to zero when idle** (no adb/VNC connections for ~1h,
-checked by the `android-emulator-idle-sleeper` CronJob) and **wakes on
-visit**: the wake gate owns `/` on both hostnames. Warm boot is ~90s.
+The emulator **scales to zero when idle** (no user interaction for 6h —
+taps/keys/app-launches/noVNC clicks, read from `dumpsys power` by the
+`android-emulator-idle-sleeper` CronJob) and **wakes on visit**: the wake
+gate owns `/` on both hostnames. Warm boot is ~90s. Idle is measured from
+real interaction, not connection count, so a forgotten `adb connect` (left
+ESTABLISHED) no longer keeps it awake — but `adb disconnect` anyway.

 - Humans: open https://android-emulator.viktorbarzin.me — it wakes the
  emulator if needed, shows a self-refreshing boot page, then hands over to
--- a/stacks/android-emulator/gate.py
+++ b/stacks/android-emulator/gate.py
@ -22,7 +22,6 @@ API = "https://%s:%s" % (
 )
 TOKEN_PATH = "/var/run/secrets/kubernetes.io/serviceaccount/token"
 CA_PATH = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
-IDLE_ANNOTATION = "emulator.viktorbarzin.me/idle-checks"
 VNC_PATH = "/vnc.html?autoconnect=1&resize=scale"

 WAKING_PAGE = """<!doctype html><html><head><title>Android emulator</title>
@ -57,13 +56,14 @@ def deployment_state():


 def wake():
+    # Direct replicas patch on the named deployment — same path the idle
+    # sleeper uses to scale DOWN; needs only `deployments` patch, not
+    # `deployments/scale`. Idle is now measured from dumpsys power, so there
+    # is no idle-counter annotation to reset here.
    kube(
        "PATCH",
        f"/apis/apps/v1/namespaces/{NS}/deployments/{DEPLOY}",
-        {
-            "spec": {"replicas": 1},
-            "metadata": {"annotations": {IDLE_ANNOTATION: "0"}},
-        },
+        {"spec": {"replicas": 1}},
    )


--- a/stacks/android-emulator/gate.tf
+++ b/stacks/android-emulator/gate.tf
@ -152,11 +152,18 @@ resource "kubernetes_service" "gate" {
  }
 }

-# Sleep side: every 15 min, look at established TCP connections to the
-# emulator's adb (5555) and noVNC (6080) ports from OUTSIDE the pod
-# (remote != 127.0.0.1 — the in-container adb server holds a permanent
-# loopback connection to adbd that must not count as activity). Four
-# consecutive idle checks (~1h) scale the deployment to zero.
+# Sleep side: every 15 min, ask the emulator how long since it was actually
+# USED — dumpsys power's last user-activity time (taps/keys/app-launches,
+# including noVNC clicks) vs guest uptime. No activity for 6h → scale the
+# deployment to zero. This deliberately IGNORES open adb/noVNC connections:
+# a forgotten adb transport (connect with no disconnect) stays ESTABLISHED
+# forever, so the old connection-count check kept resetting and the emulator
+# never slept (up 6+ days while idle ~5). Reads activity via `kubectl exec`
+# (the SA has pods/exec) and scales down with a direct replicas patch on the
+# named deployment — the SAME path the wake gate scales UP — so it needs only
+# the existing `deployments` patch grant, NOT `deployments/scale` (which the
+# SA lacks; the old `kubectl scale` here failed Forbidden). Stateless: no
+# idle-counter annotation. Fail-safe: any read error → do NOT sleep.
 resource "kubernetes_cron_job_v1" "idle_sleeper" {
  metadata {
    name      = "android-emulator-idle-sleeper"
@ -182,33 +189,37 @@ resource "kubernetes_cron_job_v1" "idle_sleeper" {
              image   = "bitnami/kubectl:latest"
              command = ["/bin/bash", "-c"]
              args = [<<-EOT
-                set -euo pipefail
-                NS=android-emulator DEPLOY=android-emulator ANN=emulator.viktorbarzin.me/idle-checks
+                set -eu
+                NS=android-emulator
+                DEPLOY=android-emulator
+                IDLE_LIMIT_SECONDS=21600   # 6h with no user activity -> sleep
                spec=$(kubectl -n $NS get deploy $DEPLOY -o jsonpath='{.spec.replicas}')
                [ "$spec" = "0" ] && { echo "already asleep"; exit 0; }
-                pod=$(kubectl -n $NS get pods -l app=$DEPLOY --field-selector=status.phase=Running -o name | head -1)
-                [ -z "$pod" ] && { echo "no running pod (booting?) — not counting"; exit 0; }
-                # /proc/net/tcp: count ESTABLISHED (st=01) conns with local port
-                # 5555 (0x15B3) or 6080 (0x17C0) whose remote is not loopback.
-                est=$(kubectl -n $NS exec $${pod#pod/} -- cat /proc/net/tcp | awk '
-                  $4 == "01" {
-                    split($2, l, ":"); split($3, r, ":")
-                    if ((l[2] == "15B3" || l[2] == "17C0") && r[1] != "0100007F") n++
-                  } END { print n+0 }')
-                if [ "$est" -gt 0 ]; then
-                  echo "$est active connection(s) — resetting idle counter"
-                  kubectl -n $NS annotate deploy $DEPLOY $ANN=0 --overwrite
+                pod=$(kubectl -n $NS get pods -l app=$DEPLOY --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}')
+                [ -z "$pod" ] && { echo "no running pod (booting?) — not sleeping"; exit 0; }
+                # How long since the emulator was actually used? Compare the last
+                # user-activity time from dumpsys power (taps/keys/app-launches,
+                # incl. noVNC clicks) with current guest uptime, both in ms on
+                # the guest uptime clock. Capture first, then parse: NO pipefail
+                # and no early `exit` in awk, so a streaming `dumpsys` can't
+                # SIGPIPE the exec and trip set -e (that bug made every run die
+                # 141 with no output). Fail-safe: a still-booting emulator (adb
+                # not ready) yields empty values -> do NOT sleep.
+                uptime_raw=$(kubectl -n $NS exec $pod -- adb shell cat /proc/uptime 2>/dev/null || true)
+                dump=$(kubectl -n $NS exec $pod -- adb shell dumpsys power 2>/dev/null || true)
+                uptime_ms=$(printf '%s' "$uptime_raw" | awk '{printf "%d", $1*1000}')
+                last_ms=$(printf '%s' "$dump" | awk -F= '/mLastUserActivityTime\(excludingAttention\)/{v=$2} END{gsub(/[^0-9]/,"",v); print v}')
+                if [ -z "$uptime_ms" ] || [ -z "$last_ms" ]; then
+                  echo "could not read activity (emulator booting / adb not ready) — not sleeping"
                  exit 0
                fi
-                n=$(kubectl -n $NS get deploy $DEPLOY -o jsonpath="{.metadata.annotations['emulator\.viktorbarzin\.me/idle-checks']}")
-                n=$(( $${n:-0} + 1 ))
-                if [ "$n" -ge 4 ]; then
-                  echo "idle for $n checks (~1h) — scaling to zero"
-                  kubectl -n $NS scale deploy $DEPLOY --replicas=0
-                  kubectl -n $NS annotate deploy $DEPLOY $ANN=0 --overwrite
+                idle_s=$(( (uptime_ms - last_ms) / 1000 ))
+                echo "idle for $idle_s s (limit $IDLE_LIMIT_SECONDS s / 6h)"
+                if [ "$idle_s" -ge "$IDLE_LIMIT_SECONDS" ]; then
+                  echo "idle >= 6h with no user activity — scaling to zero"
+                  kubectl -n $NS patch deploy $DEPLOY --type=merge -p '{"spec":{"replicas":0}}'
                else
-                  echo "idle check $n/4"
-                  kubectl -n $NS annotate deploy $DEPLOY $ANN=$n --overwrite
+                  echo "used within 6h — staying up"
                fi
              EOT
              ]
--- a/stacks/authentik/admin-services-restriction.tf
+++ b/stacks/authentik/admin-services-restriction.tf
@ -49,6 +49,17 @@ resource "authentik_policy_expression" "admin_services_restriction" {

    host = request.context.get("host", "")

+    # chrome-service noVNC (chrome.viktorbarzin.me) exposes Viktor's LIVE
+    # logged-in browser sessions, so lock it to Viktor's own accounts ONLY.
+    # "Home Server Admins" is NOT sufficient — emo (emil.barzin@gmail.com) is a
+    # member. akadmin kept as break-glass. The homelab-browser CDP path is
+    # already RBAC-gated (emo = oidc-power-user-readonly, no pods/portforward),
+    # so this closes the only remaining, human, noVNC path. Match username OR
+    # email so neither attribute alone can lock Viktor out.
+    CHROME_ALLOWED = {"akadmin", "akadmin@viktorbarzin.me", "vbarzin@gmail.com"}
+    if host == "chrome.viktorbarzin.me":
+        return request.user.username in CHROME_ALLOWED or request.user.email in CHROME_ALLOWED
+
    # t3 Workstation edge gate: only members of "T3 Users" may reach t3.
    # Placed BEFORE the ADMIN_ONLY_HOSTS early-return (t3 is intentionally not in
    # that set — it must not require Home-Server-Admins, just T3 Users membership).
--- a/stacks/authentik/email-secret.tf
+++ b/stacks/authentik/email-secret.tf
@ -7,7 +7,7 @@
 # authentik pods if the password ever changes.
 resource "kubernetes_manifest" "authentik_email_secret" {
  manifest = {
-    apiVersion = "external-secrets.io/v1beta1"
+    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "authentik-email"
--- a/stacks/beads-server/main.tf
+++ b/stacks/beads-server/main.tf
@ -602,7 +602,7 @@ resource "kubernetes_config_map" "beadboard_config" {
 # dispatch agent jobs via the in-cluster HTTP API.
 resource "kubernetes_manifest" "beadboard_agent_service_secret" {
  manifest = {
-    apiVersion = "external-secrets.io/v1beta1"
+    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "beadboard-agent-service"
--- a/stacks/broker-sync/main.tf
+++ b/stacks/broker-sync/main.tf
@ -29,7 +29,7 @@ resource "kubernetes_namespace" "broker_sync" {
 #   imap_host, imap_user, imap_password, imap_directory — for InvestEngine + Schwab email ingest
 resource "kubernetes_manifest" "external_secret" {
  manifest = {
-    apiVersion = "external-secrets.io/v1beta1"
+    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "broker-sync-secrets"
--- a/stacks/calico/main.tf
+++ b/stacks/calico/main.tf
@ -156,12 +156,121 @@ resource "helm_release" "tigera_operator" {
  values = [yamlencode({
    installation = { enabled = false }
    apiServer    = { enabled = false }
-    # Goldmane (flow aggregator) + Whisker (observability UI) are new in Calico
-    # 3.30 and default-on; disabled — we use Prometheus/Loki, and on a helm
-    # UPGRADE their CRs render before their crds/ (which helm skips on upgrade)
-    # are installed -> "ensure CRDs are installed first". Not needed here.
+    # Goldmane (flow aggregator) + Whisker (observability UI), new in Calico
+    # 3.30, are kept disabled IN HELM on purpose: on a helm UPGRADE their CRs
+    # render before their crds/ (which helm skips on upgrade) -> "ensure CRDs
+    # are installed first". We instead enable them via the operator CRs applied
+    # directly below (kubectl_manifest) now that the CRDs exist — see ADR-0014.
    goldmane  = { enabled = false }
    whisker   = { enabled = false }
-    resources = { limits = { memory = "256Mi" } }
+    # 512Mi (was 256Mi): the operator idles at ~38Mi but its STARTUP spike
+    # (re-listing resources to build informer caches) exceeded 256Mi and
+    # OOM-crashlooped on 2026-06-23 the first time the pod restarted (a latent
+    # landmine — any restart would have triggered it). 512Mi covers the spike;
+    # data plane (calico-node) is unaffected by an operator restart.
+    resources = { limits = { memory = "512Mi" } }
  })]
 }
+
+# ---------------------------------------------------------------------------
+# Goldmane + Whisker (Calico 3.30 OSS flow observability) — ADR-0014.
+#
+# Enabled via the operator CRs directly (NOT the Helm goldmane/whisker flags,
+# which stay false above): the goldmanes/whiskers.operator.tigera.io CRDs are
+# already installed (operator adopted them 2026-06-19), so we sidestep the
+# helm-upgrade "CRs render before crds/" ordering issue by applying the CRs
+# ourselves — the running operator reconciles them. Same kubectl_manifest
+# pattern as the wave1 GNP above (no plan-time CRD requirement).
+#
+# Creating the Goldmane CR makes the operator re-render calico-node with the
+# FELIX_FLOWLOGSGOLDMANESERVER env (operator auto-wires Felix — do NOT patch
+# FelixConfiguration) => a supervised calico-node DaemonSet roll. Goldmane:
+# Deployment + Service goldmane:7443 (gRPC/mTLS) in calico-system. Whisker:
+# Deployment + Service whisker:8081 in calico-system; its backend dials
+# goldmane, so Goldmane must exist first (depends_on). notifications=Disabled
+# so the UI does not call the external Tigera notifications endpoint.
+#
+# NOTE: durable Loki persistence is NOT these CRs. The Goldmane emitter is
+# Calico Cloud/Enterprise-gated (no OSS knob to aim it at Loki), so the trail
+# is a separate consumer of goldmane's gRPC Flows API (ADR-0014 / issue #58).
+# Whisker alone is a ~60-min in-memory live view. Reversible: delete to disable.
+resource "kubectl_manifest" "goldmane" {
+  depends_on = [helm_release.tigera_operator]
+  yaml_body = yamlencode({
+    apiVersion = "operator.tigera.io/v1"
+    kind       = "Goldmane"
+    metadata   = { name = "default" }
+  })
+}
+
+resource "kubectl_manifest" "whisker" {
+  depends_on = [kubectl_manifest.goldmane]
+  yaml_body = yamlencode({
+    apiVersion = "operator.tigera.io/v1"
+    kind       = "Whisker"
+    metadata   = { name = "default" }
+    spec       = { notifications = "Disabled" }
+  })
+}
+
+# ---------------------------------------------------------------------------
+# Gated public ingress for the Whisker UI (infra #57 / ADR-0014).
+#
+# whisker.viktorbarzin.me -> whisker:8081, Authentik-gated (auth="required":
+# Whisker ships NO own login — it's an admin observability UI, so Authentik
+# forward-auth is the only gate between strangers and the flow view). The
+# operator replicated `tls-secret` into calico-system already.
+#
+# TWO coupled pieces are required because the operator's own `whisker`
+# NetworkPolicy (owned by the Whisker CR above) sets policyTypes:[Ingress]
+# with NO ingress rules => default-deny on ingress to the whisker pod. The
+# additive NP below ORs in a Traefik allow (k8s NetworkPolicies are additive
+# across policies selecting the same pod), so we never edit the operator NP.
+module "ingress_whisker" {
+  source          = "../../modules/kubernetes/ingress_factory"
+  dns_type        = "proxied"
+  namespace       = "calico-system"
+  name            = "whisker"
+  service_name    = "whisker"
+  port            = 8081
+  auth            = "required"
+  tls_secret_name = "tls-secret"
+  extra_annotations = {
+    "gethomepage.dev/enabled"     = "true"
+    "gethomepage.dev/name"        = "Whisker"
+    "gethomepage.dev/description" = "Calico flow observability (who-talks-to-whom)"
+    "gethomepage.dev/icon"        = "calico.png"
+    "gethomepage.dev/group"       = "Infrastructure"
+  }
+}
+
+# Additive NetworkPolicy: permit Traefik -> whisker:8081. ORs with the
+# operator's default-deny `whisker` NP (selecting the same pod) so Traefik
+# can reach the UI without touching the operator-owned policy.
+resource "kubernetes_network_policy_v1" "whisker_allow_traefik" {
+  metadata {
+    name      = "whisker-allow-traefik"
+    namespace = "calico-system"
+  }
+  spec {
+    pod_selector {
+      match_labels = {
+        "app.kubernetes.io/name" = "whisker"
+      }
+    }
+    policy_types = ["Ingress"]
+    ingress {
+      from {
+        namespace_selector {
+          match_labels = {
+            "kubernetes.io/metadata.name" = "traefik"
+          }
+        }
+      }
+      ports {
+        port     = "8081"
+        protocol = "TCP"
+      }
+    }
+  }
+}
--- a/stacks/changedetection/.terraform.lock.hcl
+++ b/stacks/changedetection/.terraform.lock.hcl
@ -70,41 +70,16 @@ provider "registry.terraform.io/goauthentik/authentik" {
 }

 provider "registry.terraform.io/hashicorp/helm" {
-  version = "3.1.1"
+  version = "3.2.0"
  hashes = [
-    "h1:47CqNwkxctJtL/N/JuEj+8QMg8mRNI/NWeKO5/ydfZU=",
-    "h1:5b2ojWKT0noujHiweCds37ZreRFRQLNaErdJLusJN88=",
-    "zh:1a6d5ce931708aec29d1f3d9e360c2a0c35ba5a54d03eeaff0ce3ca597cd0275",
-    "zh:3411919ba2a5941801e677f0fea08bdd0ae22ba3c9ce3309f55554699e06524a",
-    "zh:81b36138b8f2320dc7f877b50f9e38f4bc614affe68de885d322629dd0d16a29",
-    "zh:95a2a0a497a6082ee06f95b38bd0f0d6924a65722892a856cfd914c0d117f104",
-    "zh:9d3e78c2d1bb46508b972210ad706dd8c8b106f8b206ecf096cd211c54f46990",
-    "zh:a79139abf687387a6efdbbb04289a0a8e7eaca2bd91cdc0ce68ea4f3286c2c34",
-    "zh:aaa8784be125fbd50c48d84d6e171d3fb6ef84a221dbc5165c067ce05faab4c8",
-    "zh:afecd301f469975c9d8f350cc482fe656e082b6ab0f677d1a816c3c615837cc1",
-    "zh:c54c22b18d48ff9053d899d178d9ffef7d9d19785d9bf310a07d648b7aac075b",
-    "zh:db2eefd55aea48e73384a555c72bac3f7d428e24147bedb64e1a039398e5b903",
-    "zh:ee61666a233533fd2be971091cecc01650561f1585783c381b6f6e8a390198a4",
-    "zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
+    "h1:CTWVDQxyq1cQJR/9RJSkXii/gz5zMjxszSw9LmptDh4=",
  ]
 }

 provider "registry.terraform.io/hashicorp/kubernetes" {
-  version = "3.1.0"
+  version = "3.2.0"
  hashes = [
-    "h1:oodIAuFMikXNmEtil5MQgP4dfSctUBYQiGJfjbsF3NY=",
-    "zh:0215c5c60be62028c09a2f22458e89cda3ef5830a632299f1d401eb3538874b0",
-    "zh:09ebb9f442431e278a310a9423f32caf467cb4b3cad3fe59573ca71fa7b14e20",
-    "zh:0c4e5912f83bb35846ae0a9ae54fc320706ee61894cd21cc6b4181b1c5a2fa5c",
-    "zh:1678c982853ad461e65ccb5e79d585e13ed109dd47dab2a66d3a7a304faeef65",
-    "zh:1c050a5c15e330457a9c18caacf61a923c59d663e13f2962e4b32f04fef523a0",
-    "zh:2c55bcec83be58ec132c7cb0a1ac644758b800d794fdc636d53a0eada0358a3a",
-    "zh:a062bb0aa316c08d8460c66a5d68da71da40de5d3bc3b31abcf3a1a9a19650f1",
-    "zh:a26fdea0afaa9b247c73c0b42843ca51ba7db0ac2571f9d3d50dcabd20ca1b98",
-    "zh:c872c9385a78d502bf5823d61cd3bb0f9a0585030e025eb12585c83451beeaa1",
-    "zh:f180879af931182beee4c8c0d9dab62b81d86f17ddcbe3786ef4c7cec9163a4e",
-    "zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
-    "zh:f70f5789264069e0eef06f9b5d5fde955ef7206f7d446d1ce51a4c37a3f3e02f",
+    "h1:OjMar8kVp0LcDtwgRs877g/K/KPAEDVhFewpE3Tp7l8=",
  ]
 }

--- a/stacks/changedetection/main.tf
+++ b/stacks/changedetection/main.tf
@ -20,7 +20,7 @@ resource "kubernetes_namespace" "changedetection" {

 resource "kubernetes_manifest" "external_secret" {
  manifest = {
-    apiVersion = "external-secrets.io/v1beta1"
+    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "changedetection-secrets"
--- a/stacks/chrome-service/files/chrome/Dockerfile
+++ b/stacks/chrome-service/files/chrome/Dockerfile
@ -0,0 +1,27 @@
+# chrome-service browser image (ADR-0002, infra-owned, built off-infra on GHA).
+#
+# The Playwright base provides Xvfb + every browser runtime dep + fonts. On top
+# we install REAL Google Chrome for its licensed proprietary codecs (H.264/AAC):
+# the bundled open-source Chromium ships with those codecs COMPILED OUT, so
+# H.264/AAC video (Instagram Reels, X, most .mp4) fails in the noVNC view with
+# MEDIA_ERR_SRC_NOT_SUPPORTED. Swapping libffmpeg.so does NOT help (Playwright's
+# Chromium has the codecs compiled out, not just the lib stripped), and Chrome
+# for Testing is also codec-less — only google-chrome-stable carries them.
+#
+# main.tf launches /opt/google/chrome/chrome instead of the bundled
+# /ms-playwright/chromium-*/chrome. connect_over_cdp callers (tripit fare scrape,
+# homelab browser, snapshot-harvester) attach to whatever Chrome runs here.
+FROM mcr.microsoft.com/playwright:v1.48.0-noble
+
+RUN apt-get update \
+ && apt-get install -y --no-install-recommends wget gnupg ca-certificates \
+ && wget -qO- https://dl.google.com/linux/linux_signing_key.pub \
+      | gpg --dearmor -o /usr/share/keyrings/google-chrome.gpg \
+ && echo "deb [arch=amd64 signed-by=/usr/share/keyrings/google-chrome.gpg] https://dl.google.com/linux/chrome/deb/ stable main" \
+      > /etc/apt/sources.list.d/google-chrome.list \
+ && apt-get update \
+ && apt-get install -y --no-install-recommends google-chrome-stable \
+ && rm -rf /var/lib/apt/lists/*
+
+# Fail the build if Chrome isn't runnable / the path moved.
+RUN /opt/google/chrome/chrome --version
--- a/stacks/chrome-service/files/novnc/entrypoint.sh
+++ b/stacks/chrome-service/files/novnc/entrypoint.sh
@ -3,6 +3,13 @@
 # and serve the noVNC HTML5 client + websockify bridge on :6080.
 set -e

+# Containerd grants pods an effectively unbounded RLIMIT_NOFILE (2^31). x11vnc
+# sweeps the WHOLE fd table with fcntl on every client connection, so each VNC
+# connect hangs for ~forever and the noVNC client sits on "Connecting" until it
+# times out. Cap it before launching x11vnc. (Same fix as the android-emulator
+# stack; see docs/architecture/chrome-service.md "noVNC fd-sweep".)
+ulimit -n 65536
+
 for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do
  if echo > /dev/tcp/127.0.0.1/6099 2>/dev/null; then
    echo "Xvfb TCP up after attempt $i"
--- a/stacks/chrome-service/main.tf
+++ b/stacks/chrome-service/main.tf
@ -42,7 +42,7 @@ resource "kubernetes_namespace" "chrome_service" {

 resource "kubernetes_manifest" "external_secret" {
  manifest = {
-    apiVersion = "external-secrets.io/v1beta1"
+    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "chrome-service-secrets"
@ -178,8 +178,12 @@ resource "kubernetes_deployment" "chrome_service" {
        }

        container {
-          name              = "chrome-service"
-          image             = local.image
+          name = "chrome-service"
+          # Real Google Chrome (Playwright base + google-chrome-stable) for
+          # proprietary H.264/AAC codecs — see files/chrome/Dockerfile. The
+          # snapshot sidecars still use local.python_image (playwright minor
+          # pin) and connect_over_cdp; verified compatible with this Chrome.
+          image             = "ghcr.io/viktorbarzin/chrome-service-browser:latest"
          image_pull_policy = "IfNotPresent"

          # Direct chromium launch (NOT `playwright launch-server`). Reason:
@ -203,16 +207,16 @@ resource "kubernetes_deployment" "chrome_service" {
          args = [
            <<-EOT
            set -e
-            # Locate chromium in the Microsoft image. The path is
-            # /ms-playwright/chromium-XXXX/chrome-linux/chrome where XXXX
-            # is the playwright-pinned build; resolve at runtime so a minor
-            # bump of the image doesn't break the launch line.
-            CHROMIUM=$(find /ms-playwright -maxdepth 4 -name 'chrome' -type f -executable -path '*/chrome-linux/*' 2>/dev/null | head -1)
-            if [ -z "$CHROMIUM" ]; then
-              echo "ERROR: chromium binary not found under /ms-playwright" >&2
+            # Real Google Chrome (proprietary H.264/AAC codecs) baked into the
+            # chrome-service-browser image at a fixed path — so H.264 video
+            # (Reels) plays in the noVNC view. The bundled Chromium under
+            # /ms-playwright lacks those codecs (MEDIA_ERR_SRC_NOT_SUPPORTED).
+            CHROMIUM=/opt/google/chrome/chrome
+            if [ ! -x "$CHROMIUM" ]; then
+              echo "ERROR: google-chrome not found at $CHROMIUM (wrong image?)" >&2
              exit 1
            fi
-            echo "[chrome-service] using chromium: $CHROMIUM"
+            echo "[chrome-service] using browser: $($CHROMIUM --version 2>/dev/null || echo "$CHROMIUM")"

            # -listen tcp enables localhost:6099 so the noVNC sidecar can
            # attach over the pod's shared network ns (Ubuntu 24.04
@ -252,6 +256,8 @@ resource "kubernetes_deployment" "chrome_service" {
              --disable-dev-shm-usage \
              --password-store=basic \
              --use-mock-keychain \
+              --window-position=0,0 \
+              --window-size=1280,720 \
              about:blank
            EOT
          ]
@ -326,6 +332,14 @@ resource "kubernetes_deployment" "chrome_service" {
          # Phase 3 cutover 2026-05-07 — Forgejo registry consolidation.
          image             = "ghcr.io/viktorbarzin/chrome-service-novnc:latest"
          image_pull_policy = "IfNotPresent"
+          # Cap RLIMIT_NOFILE before the entrypoint runs. Containerd grants pods
+          # nofile=2^31; x11vnc sweeps the whole fd table on each client connect,
+          # so every VNC connection hangs on "Connecting" until it times out
+          # (fd-sweep bug, same as android-emulator). entrypoint.sh now also sets
+          # this, but the image is :latest/IfNotPresent so a rebuilt entrypoint
+          # isn't guaranteed to be pulled — this wrapper applies the cap
+          # deterministically on every rollout off the cached image.
+          command = ["bash", "-c", "ulimit -n 65536; exec /entrypoint.sh"]
          port {
            name           = "http"
            container_port = 6080
--- a/stacks/ci-pipeline-health/main.tf
+++ b/stacks/ci-pipeline-health/main.tf
@ -50,7 +50,7 @@ resource "kubernetes_namespace" "ci_pipeline_health" {
 # the alias could not do. Blast radius = this single-CronJob namespace.
 resource "kubernetes_manifest" "external_secret" {
  manifest = {
-    apiVersion = "external-secrets.io/v1beta1"
+    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "ci-pipeline-health-creds"
--- a/stacks/claude-agent-service/main.tf
+++ b/stacks/claude-agent-service/main.tf
@ -39,7 +39,7 @@ resource "kubernetes_namespace" "claude_agent" {

 resource "kubernetes_manifest" "external_secret" {
  manifest = {
-    apiVersion = "external-secrets.io/v1beta1"
+    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "claude-agent-secrets"
--- a/stacks/claude-breakglass/main.tf
+++ b/stacks/claude-breakglass/main.tf
@ -58,7 +58,7 @@ resource "kubernetes_service_account" "breakglass" {
 # pod can never read it.
 resource "kubernetes_manifest" "external_secret_ssh" {
  manifest = {
-    apiVersion = "external-secrets.io/v1beta1"
+    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "breakglass-ssh"
@ -83,7 +83,7 @@ resource "kubernetes_manifest" "external_secret_ssh" {
 # same account) and the app bearer token (in-cluster/CLI fallback caller auth).
 resource "kubernetes_manifest" "external_secret_env" {
  manifest = {
-    apiVersion = "external-secrets.io/v1beta1"
+    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "breakglass-env"
--- a/stacks/claude-memory/main.tf
+++ b/stacks/claude-memory/main.tf
@ -30,7 +30,7 @@ resource "kubernetes_namespace" "claude-memory" {

 resource "kubernetes_manifest" "external_secret" {
  manifest = {
-    apiVersion = "external-secrets.io/v1beta1"
+    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "claude-memory-secrets"
@ -58,7 +58,7 @@ resource "kubernetes_manifest" "external_secret" {
 # DB credentials from Vault database engine (rotated every 24h)
 resource "kubernetes_manifest" "db_external_secret" {
  manifest = {
-    apiVersion = "external-secrets.io/v1beta1"
+    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "claude-memory-db-creds"
--- a/stacks/coturn/.terraform.lock.hcl
+++ b/stacks/coturn/.terraform.lock.hcl
@ -24,43 +24,33 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
  ]
 }

-provider "registry.terraform.io/hashicorp/helm" {
-  version = "3.1.1"
+provider "registry.terraform.io/gavinbunney/kubectl" {
+  version     = "1.19.0"
+  constraints = "~> 1.14"
  hashes = [
-    "h1:47CqNwkxctJtL/N/JuEj+8QMg8mRNI/NWeKO5/ydfZU=",
-    "h1:5b2ojWKT0noujHiweCds37ZreRFRQLNaErdJLusJN88=",
-    "zh:1a6d5ce931708aec29d1f3d9e360c2a0c35ba5a54d03eeaff0ce3ca597cd0275",
-    "zh:3411919ba2a5941801e677f0fea08bdd0ae22ba3c9ce3309f55554699e06524a",
-    "zh:81b36138b8f2320dc7f877b50f9e38f4bc614affe68de885d322629dd0d16a29",
-    "zh:95a2a0a497a6082ee06f95b38bd0f0d6924a65722892a856cfd914c0d117f104",
-    "zh:9d3e78c2d1bb46508b972210ad706dd8c8b106f8b206ecf096cd211c54f46990",
-    "zh:a79139abf687387a6efdbbb04289a0a8e7eaca2bd91cdc0ce68ea4f3286c2c34",
-    "zh:aaa8784be125fbd50c48d84d6e171d3fb6ef84a221dbc5165c067ce05faab4c8",
-    "zh:afecd301f469975c9d8f350cc482fe656e082b6ab0f677d1a816c3c615837cc1",
-    "zh:c54c22b18d48ff9053d899d178d9ffef7d9d19785d9bf310a07d648b7aac075b",
-    "zh:db2eefd55aea48e73384a555c72bac3f7d428e24147bedb64e1a039398e5b903",
-    "zh:ee61666a233533fd2be971091cecc01650561f1585783c381b6f6e8a390198a4",
-    "zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
+    "h1:9QkxPjp0x5FZFfJbE+B7hBOoads9gmdfj9aYu5N4Sfc=",
+  ]
+}
+
+provider "registry.terraform.io/goauthentik/authentik" {
+  version     = "2024.12.1"
+  constraints = "~> 2024.10"
+  hashes = [
+    "h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
+  ]
+}
+
+provider "registry.terraform.io/hashicorp/helm" {
+  version = "3.2.0"
+  hashes = [
+    "h1:CTWVDQxyq1cQJR/9RJSkXii/gz5zMjxszSw9LmptDh4=",
  ]
 }

 provider "registry.terraform.io/hashicorp/kubernetes" {
-  version = "3.0.1"
+  version = "3.2.0"
  hashes = [
-    "h1:P0c8knzZnouTNFIRij8IS7+pqd0OKaFDYX0j4GRsiqo=",
-    "h1:vyHdH0p6bf9xp1NPePObAJkXTJb/I09FQQmmevTzZe0=",
-    "zh:02d55b0b2238fd17ffa12d5464593864e80f402b90b31f6e1bd02249b9727281",
-    "zh:20b93a51bfeed82682b3c12f09bac3031f5bdb4977c47c97a042e4df4fb2f9ba",
-    "zh:6e14486ecfaee38c09ccf33d4fdaf791409f90795c1b66e026c226fad8bc03c7",
-    "zh:8d0656ff422df94575668e32c310980193fccb1c28117e5c78dd2d4050a760a6",
-    "zh:9795119b30ec0c1baa99a79abace56ac850b6e6fbce60e7f6067792f6eb4b5f4",
-    "zh:b388c87acc40f6bd9620f4e23f01f3c7b41d9b88a68d5255dec0a72f0bdec249",
-    "zh:b59abd0a980649c2f97f172392f080eaeb18e486b603f83bf95f5d93aeccc090",
-    "zh:ba6e3060fddf4a022087d8f09e38aa0001c705f21170c2ded3d1c26c12f70d97",
-    "zh:c12626d044b1d5501cf95ca78cbe507c13ad1dd9f12d4736df66eb8e5f336eb8",
-    "zh:c55203240d50f4cdeb3df1e1760630d677679f5b1a6ffd9eba23662a4ad05119",
-    "zh:ea206a5a32d6e0d6e32f1849ad703da9a28355d9c516282a8458b5cf1502b2a1",
-    "zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
+    "h1:OjMar8kVp0LcDtwgRs877g/K/KPAEDVhFewpE3Tp7l8=",
  ]
 }

@ -84,3 +74,11 @@ provider "registry.terraform.io/hashicorp/vault" {
    "zh:ff35fb1ab6add288f0f368981e56f780b50405accd1937131cba1137999c8d83",
  ]
 }
+
+provider "registry.terraform.io/telmate/proxmox" {
+  version     = "3.0.2-rc07"
+  constraints = "3.0.2-rc07"
+  hashes = [
+    "h1:zp5hpQJQ4t4zROSLqdltVpBO+Riy9VugtfFbpyTw1aM=",
+  ]
+}
--- a/stacks/coturn/main.tf
+++ b/stacks/coturn/main.tf
@ -6,7 +6,7 @@ variable "public_ip" { type = string }

 resource "kubernetes_manifest" "external_secret" {
  manifest = {
-    apiVersion = "external-secrets.io/v1beta1"
+    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "coturn-secrets"
--- a/Show more
+++ b/Show more