backup-mx: pivot to self-hosted Oracle relay; challenge-hardened design v3

Rollernet's free tier failed the validation gates before any DNS change (200 msgs / 10 MB per rolling week, then 48h of SMTP 5xx bounces — worse than no backup MX; free accounts being discontinued). Viktor chose to stay free, so the backup MX becomes a Postfix store-and-forward relay on an Oracle Always-Free VM (mx2.viktorbarzin.me, MX pref 20), draining via port 2526 through the existing pfSense HAProxy frontend since Oracle blocks egress 25. Two independent adversarial reviews then fixed the design: primary-side drain enablement moved to the layers that actually reject (unknown- client-hostname, spoof protection, anvil limits, rspamd reject tier -> external_relay + action cap, never backscatter), monitoring moved off the nonexistent cluster->tailnet path to allowlisted public-IP scrapes, bounce lifetime cut to 1d (the VM can never deliver DSNs), OCI OS-level iptables + reserved-IP + mandatory PAYG requirements added, and 4xx-only postscreen hygiene replaces the blanket no-filtering stance. ADR-0019 and the design doc renamed accordingly (rollernet -> oracle). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
backup-mx design: credentials to Vaultwarden, not Vault KV
2026-07-04 13:38:39 +00:00 · 2026-07-04 12:55:43 +00:00 · 2026-07-04 10:14:44 +00:00 · 2026-07-04 09:59:16 +00:00 · 2026-07-04 09:31:32 +00:00 · 2026-07-04 08:44:04 +00:00
420 changed files with 42390 additions and 10698 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
--- a/.claude/home-assistant-sofia.py
+++ b/.claude/home-assistant-sofia.py
@ -7,6 +7,7 @@ Control and query Home Assistant entities on ha-sofia.viktorbarzin.me.
 import argparse
 import json
 import os
+import subprocess
 import sys
 from urllib.parse import urljoin

@ -17,13 +18,29 @@ except ImportError:
    print("  pip install requests")
    sys.exit(1)

-# Configuration from environment variables (ha-sofia specific)
-HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/")
-HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN")

-if not HA_URL or not HA_TOKEN:
-    print("ERROR: HOME_ASSISTANT_SOFIA_URL and HOME_ASSISTANT_SOFIA_TOKEN environment variables must be set.")
-    print("These should be set when activating the Claude venv (~/.venvs/claude)")
+def _token_from_homelab():
+    """Resolve the token via the homelab CLI when the env var isn't set, so the
+    script works from any directory / unprovisioned session (see ADR-0012)."""
+    try:
+        out = subprocess.run(
+            ["homelab", "ha", "token", "--instance", "sofia"],
+            capture_output=True, text=True, timeout=30)
+        if out.returncode == 0 and out.stdout.strip():
+            return out.stdout.strip()
+    except Exception:
+        pass
+    return None
+
+
+# Configuration: prefer env vars (set by the Claude venv); otherwise fall back to
+# defaults + the homelab CLI so the script is not cwd/env dependent (ADR-0012).
+HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/") or "https://ha-sofia.viktorbarzin.me"
+HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN") or _token_from_homelab()
+
+if not HA_TOKEN:
+    print("ERROR: no ha-sofia API token available.")
+    print("Set HOME_ASSISTANT_SOFIA_TOKEN, or ensure `homelab ha token` works (kubeconfig reachable).")
    sys.exit(1)

 HEADERS = {
--- a/.claude/reference/authentik-state.md
+++ b/.claude/reference/authentik-state.md
@ -166,7 +166,8 @@ Pinned via Terraform in `stacks/authentik/`:

 | Knob | Value | Surface | Effect |
 |------|-------|---------|--------|
-| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. |
+| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. Used by password login (`default-authentication-flow`) AND passkey login (`webauthn` flow — both terminate on this stage). |
+| `UserLoginStage.session_duration` on `default-source-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_source_login` in `authentik_provider.tf` (imported 2026-06-20, id `4c6977d2-…`) | **Social logins** (Google/GitHub/Facebook, via `default-source-authentication-flow`). Was the provider default `seconds=0`, which fell back to `UNAUTHENTICATED_AGE=hours=2` — so social logins expired every **2h** while password/passkey lasted 4 weeks. Pinned `weeks=4` on 2026-06-20 to make all login paths consistent. (Surfaced when the 2026-06-18 passkey wipe forced fallback to Google login → "re-login multiple times daily".) |
 | `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. |
 | `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |

@ -177,6 +178,13 @@ Notes:
 - The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
 - `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
 - The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
+
+## WebAuthn / Passkeys (2026-06-20)
+
+- **Passkey devices live in the DB, NOT Terraform** (`WebAuthnDevice` model). They are user-owned; no TF resource or blueprint manages them. Re-enroll via the user settings UI (Authentik → Settings → MFA Devices → register a security key / passkey).
+- **2026-06-18 wipe (root cause of the "WebAuthn broke" incident):** all 6 of Viktor's passkeys were deleted (`WebAuthnDevice.objects.count()` → 0) at 19:27 by an **ad-hoc tripit passkey E2E test** run from the devvm (`python-httpx/0.28.1`, as `akadmin`). The test cleanup did `GET /core/users/?search={demo}` (a **fuzzy** search) then `DELETE /api/v3/authenticators/admin/webauthn/{pk}/` for each device of `users[0]` — but `users[0]` resolved to the **real** account, not the intended demo user. **Lesson:** any future passkey-test cleanup MUST exact-match the demo user (`username == demo`), never `users[0]` of a fuzzy `?search=`. It was a one-off ad-hoc script (no committed/scheduled copy), so nothing auto-re-deletes — re-enrollment is safe.
+- **Passkey login path itself is intact:** the identification stage's `passwordless_flow` → `webauthn` flow (UI-managed, in `ignore_changes`); the break was purely the missing device records.
+- **Provider-schema gotcha:** the pinned authentik TF provider's `authentik_stage_identification` resource exposes **no** `webauthn_stage` or `enable_remember_me` attribute (they exist on the app *model*, not in the provider schema). Do NOT add them to `ignore_changes` — `tg plan` errors `Unsupported attribute`. They are purely UI/app-managed. (Commit `4e882989` removed them for exactly this reason; re-adding breaks every apply.)
 - ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns).
 - **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config.
 - **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin.
--- a/.claude/reference/service-catalog.md
+++ b/.claude/reference/service-catalog.md
--- a/.claude/skills/home-assistant/SKILL.md
+++ b/.claude/skills/home-assistant/SKILL.md
@ -11,8 +11,8 @@ description: |
  There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
  Always use Home Assistant for smart home control.
 author: Claude Code
-version: 2.0.0
-date: 2026-02-07
+version: 2.1.0
+date: 2026-06-24
 ---

 # Home Assistant Control
@ -44,6 +44,12 @@ There are **two** Home Assistant instances:
 - Environment variables for each instance:
  - **ha-london**: `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN`
  - **ha-sofia**: `HOME_ASSISTANT_SOFIA_URL` and `HOME_ASSISTANT_SOFIA_TOKEN`
+  - If those env vars aren't set (e.g. you're not in the infra repo / Claude venv), don't hand-roll a `kubectl | base64 | jq` token pipeline — use the global **`homelab` CLI** instead (on `$PATH` in any directory):
+
+## homelab CLI (preferred — works from any directory)
+- **Token**: `homelab ha token [--instance sofia|london]` resolves the long-lived API token live from the cluster. Use it directly in curl: `curl -H "Authorization: Bearer $(homelab ha token)" https://ha-sofia.viktorbarzin.me/api/states`. (The `home-assistant-sofia.py` script also auto-falls-back to this when its env var is unset.)
+- **Host shell** (ha-sofia): `homelab ha ssh -- <cmd>` runs a command on the HA host with deterministic non-interactive ssh (no host-key prompt) — e.g. `homelab ha ssh -- "sudo docker ps"`, `homelab ha ssh -- "cat /config/configuration.yaml"`. Replaces bespoke `ssh -o StrictHostKeyChecking=no …` invocations.
+- **Cluster metrics/logs** (not HA-specific): prefer `homelab metrics query "<promql>"` / `homelab logs query "<logql>"` over hand-rolled `curl …/api/v1/query`, and `homelab claim`/`release` over calling `scripts/presence` directly.

 ## API Control

@ -389,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
 ## ha-london Knowledge Map

 ### Overview
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
+- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
 - **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/` (requires `sudo` for file access)
+- **Platform**: Raspberry Pi 4, HA OS
+- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
+- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
+- **Config path**: `/config/`
 - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
 - **Zone**: London (home)

+### Dashboards (redesigned 2026-06-24)
+**Glossary** (HA terms — keep distinct):
+- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
+- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
+- **Card** = a widget inside a view.
+
+- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
+  - **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
+  - **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
+- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.)
+- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
+
 ### Key Systems

 #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -418,10 +437,15 @@ Named plugs with power/energy tracking:
 - PM1.0/2.5/4.0/10 particulate sensors
 - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors

-#### 3. Cowboy E-Bike
- `sensor.bike_state_of_charge`: Battery %
- `sensor.bike_total_distance`: Total km
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
+#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
+Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
+- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
+- `sensor.classic_performance_remaining_range`: Range km
+- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
+- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
+- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
+- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
+- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).

 #### 4. Uptime Monitoring (UptimeRobot)
 - `sensor.blog`: blog uptime
@ -440,12 +464,17 @@ Named plugs with power/energy tracking:
 - Scripts: `script.start_netflix`, `script.start_stremio`
 - Scene: `scene.night` (turns off Livia + Michelle plugs)

-### Custom Components
- **cowboy**: Cowboy e-bike integration (HACS)
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
+### Custom Components (HACS integrations)
+- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
+- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
+
+### HACS frontend cards (plugins)
+- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.

 ### Integrations
-ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
+ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
+- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
+- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.

 ### AI / Voice Assistants
 - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -460,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL
 - Anca arrival/departure notifications
 - Night scene: turns off Livia + Michelle

-### Docker Setup
-```bash
-docker run -d --name homeassistant --privileged \
-  -e TZ=Europe/London \
-  -v /home/pi/docker/homeAssistant:/config \
-  -v /run/dbus:/run/dbus:ro \
-  --network=host --restart=unless-stopped \
-  homeassistant/home-assistant:2025.9
-```
+### Platform (HAOS — ignore any legacy `docker run` snippet)
+ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker).

 ### SSH Access
 ```bash
--- a/.claude/workflows/memory-overcommit-node-removal.workflow.js
+++ b/.claude/workflows/memory-overcommit-node-removal.workflow.js
@ -0,0 +1,203 @@
+export const meta = {
+  name: 'memory-overcommit-node-removal',
+  description: 'Read-only: assess PVE host + k8s memory overcommit, right-size deployment REQUESTS (scheduling) and LIMITS (OOM) separately from 30d usage, then test whether one worker node can be removed while preserving N-1 by BOTH a physical-usage and a scheduling-request model. Emits a gated plan.',
+  phases: [
+    { title: 'Gather' },
+    { title: 'Model' },
+    { title: 'Verify' },
+  ],
+}
+
+// ---------- confirmed read-only access paths ----------
+const SSH = "ssh -o BatchMode=yes -o ConnectTimeout=8 root@192.168.1.127";
+const PROM = "https://prometheus-query.viktorbarzin.lan/api/v1/query";
+const G = (mib) => (mib == null ? "?" : (mib / 1024).toFixed(1) + "Gi");
+
+// ---------- schema helpers ----------
+const num = { type: "number" }, str = { type: "string" }, bool = { type: "boolean" };
+const arr = (items) => ({ type: "array", items });
+const obj = (props) => ({ type: "object", additionalProperties: false, required: Object.keys(props), properties: props });
+
+const HOST = obj({
+  host_total_mib: num, host_used_mib: num, host_free_mib: num, host_available_mib: num,
+  swap_total_mib: num, swap_used_mib: num, ksm_saved_mib: num,
+  vms: arr(obj({ vmid: num, name: str, configured_mib: num, balloon_mib: num, rss_mib: num, is_k8s_node: bool })),
+  sum_vm_configured_mib: num, sum_vm_rss_mib: num, notes: str,
+});
+
+const K8S = obj({
+  nodes: arr(obj({
+    name: str, role: str, is_gpu: bool, is_control_plane: bool, gpu_tainted: bool, schedulable: bool,
+    capacity_mib: num, allocatable_mib: num, requests_mib: num, ds_requests_mib: num, limits_mib: num, usage_now_mib: num, peak_30d_mib: num, pod_count: num,
+  })),
+  cluster_allocatable_mib: num, cluster_requests_mib: num, cluster_usage_now_mib: num, cluster_peak_30d_mib: num, notes: str,
+});
+
+// NOTE the v2 split: requests are sized for SCHEDULING (cover normal load, can shrink below current),
+// limits are sized for OOM SAFETY (cover peak). They are DIFFERENT knobs and must not be conflated.
+const USAGE = obj({
+  totals: obj({
+    sum_current_requests_mib: num, sum_recommended_requests_mib: num, net_request_reclaim_mib: num,
+    reschedulable_request_recommended_mib: num, ds_request_recommended_per_node_mib: num, gpu_request_recommended_mib: num,
+    largest_single_request_mib: num, count_request_shrink: num, count_limit_raise_oom: num,
+  }),
+  request_shrinks: arr(obj({ namespace: str, name: str, kind: str, replicas: num, current_request_mib: num, p95_30d_mib: num, recommended_request_mib: num, delta_mib: num, rationale: str })),
+  limit_raises_oom: arr(obj({ namespace: str, name: str, container: str, current_limit_mib: num, peak_max_30d_mib: num, recommended_limit_mib: num, risk: str })),
+  spiky_periodic: arr(obj({ namespace: str, name: str, note: str })),
+  method_notes: str,
+});
+
+const TOPO = obj({
+  nodes: arr(obj({ name: str, sticky_pods: arr(str), local_pv_count: num, volumeattachments: num, cnpg_primary: bool, gpu_workloads: bool, evac_difficulty: str, evac_notes: str })),
+  spofs: arr(obj({ namespace: str, name: str, replicas: num, has_pdb: bool, issue: str })),
+  antiaffinity_risks: arr(str),
+  csi_pinning_note: str,
+  priority_classes_note: str,
+  notes: str,
+});
+
+const VERDICT = obj({ refuted: bool, confidence: str, reasoning: str, corrections: arr(str) });
+
+// ---------- prompts ----------
+const HOST_PROMPT = `Read-only PVE host memory audit. SSH (key-based): ${SSH} '<cmd>'  (host 'pve', the Proxmox r730 at 192.168.1.127). Read-only ONLY; NEVER a state-changing qm/pvesh/ha-manager command.
+- 'free -m' -> host_total/used/free/available_mib + swap_total/swap_used_mib.
+- KSM: cat /sys/kernel/mm/ksm/pages_sharing ; ksm_saved_mib = pages_sharing*4096/1048576.
+- 'qm list'; for each running VM 'qm config <vmid>' -> memory (configured_mib), balloon (balloon_mib; if balloon==memory or balloon==0 ballooning is effectively OFF -> host RSS pins near configured = the headroom RATCHET).
+- Per-VM host RSS: read /var/run/qemu-server/<vmid>.pid then 'ps -o rss= -p <pid>' (KiB->MiB).
+- is_k8s_node = VMs named k8s-*.
+Return per-VM rows + sum_vm_configured_mib + sum_vm_rss_mib over ALL RUNNING VMs. notes: overcommit ratio, swap pressure, ballooning state.`;
+
+const K8S_PROMPT = `Read-only Kubernetes node-capacity audit. kubectl read access confirmed. For every node (k8s-master + k8s-node1..6):
+- capacity_mib & allocatable_mib from 'kubectl get node <n> -o json' (Ki->MiB).
+- is_control_plane (node-role.kubernetes.io/control-plane), is_gpu (k8s-node1; nvidia.com/gpu in capacity), gpu_tainted (a NoSchedule taint general pods would NOT tolerate), schedulable.
+- requests_mib, limits_mib, ds_requests_mib (DaemonSet-owned pods only), usage_now_mib, pod_count.
+  Prefer Prometheus (curl -sk -G '${PROM}' --data-urlencode 'query=<q>'):
+    sum by (node)(kube_pod_container_resource_requests{resource="memory"})    [these metrics HAVE a node label]
+    usage_now: cAdvisor container_memory_working_set_bytes has NO node label - join: sum by (node)(container_memory_working_set_bytes{container!="",container!="POD"} * on(namespace,pod) group_left(node) kube_pod_info)
+- peak_30d_mib per node: max_over_time of that joined per-node sum over [30d:5m] (best effort; if the join is flaky leave 0 and rely on cluster figure).
+ALSO return cluster-wide:
+- cluster_allocatable_mib, cluster_requests_mib, cluster_usage_now_mib.
+- cluster_peak_30d_mib = max_over_time(sum(container_memory_working_set_bytes{container!="",container!="POD"})[30d:5m]) /1024/1024  (this is the PHYSICAL reliability bedrock - the highest the whole cluster ever simultaneously used in 30d).
+notes: host-vs-k8s overcommit contrast (requests vs allocatable vs actual usage).`;
+
+const USAGE_PROMPT = `Read-only memory RIGHT-SIZING from 30-day usage. CRITICAL: requests and limits are DIFFERENT knobs - size them separately. Do NOT set requests to peak (that is what a flawed earlier run did; it manufactured a false capacity shortfall).
+- REQUEST (scheduling reservation, drives bin-packing & node-removal feasibility): size to cover NORMAL operation = recommended_request_mib = ceil(max(p95_30d * 1.15, 64)). This SHRINKS the many over-provisioned requests toward real usage. requests should sit BELOW limits (Burstable). Be moderately conservative for stateful/db/critical infra (mysql, postgres/CNPG, redis, vault, prometheus, mailserver): use p99 instead of p95.
+- LIMIT (OOM ceiling): recommended_limit_mib = ceil(peak_max_30d * 1.25). FLAG any container whose peak_max_30d >= 95% of current limit as an OOM risk (limit_raises_oom) - these are real reliability bugs to fix REGARDLESS of node removal.
+
+Sources: kubectl (current requests/limits/replicas for Deployments/StatefulSets/DaemonSets, all namespaces); Prometheus (curl -sk -G '${PROM}' --data-urlencode 'query=<q>'):
+  p95: quantile_over_time(0.95, container_memory_working_set_bytes{container!="",container!="POD"}[30d])
+  p99: quantile_over_time(0.99, ...[30d])
+  peak: max_over_time(...[30d])
+  Aggregate by (namespace,pod,container), map pod->workload (strip hash suffixes), take MAX across a workload's pods as per-replica value.
+
+Splits for the N-1 model (use the REQUEST recommendation; multiply per-replica by replicas):
+- reschedulable_request_recommended_mib = SUM recommended_request of Deployment+StatefulSet pods that are NON-GPU and schedulable on general workers (everything that must reschedule if a worker is removed).
+- ds_request_recommended_per_node_mib = SUM recommended_request of DaemonSet containers (one set per node).
+- gpu_request_recommended_mib = SUM recommended_request of workloads pinned to GPU node k8s-node1 (REAL value; do not inflate).
+- largest_single_request_mib = largest single recommended per-replica request among reschedulable.
+Return totals (sum_current_requests_mib, sum_recommended_requests_mib, net_request_reclaim_mib = sum of POSITIVE request deltas i.e. shrinks, the splits, count_request_shrink, count_limit_raise_oom), request_shrinks (top ~30 by delta), limit_raises_oom (every OOM-tight container), spiky_periodic (mailserver/immich-ml/backups/dumps/postiz). NEVER mutate.`;
+
+const TOPO_PROMPT = `Read-only reliability-topology audit: which worker is safest to remove? Candidates: k8s-node2..node6 (NOT master, NOT GPU node1). For each worker (k8s-node1..6): sticky_pods (StatefulSet members; pods with local/hostPath PVCs; single-replica critical), local_pv_count, volumeattachments, cnpg_primary (CNPG 'pg-cluster' PRIMARY here? check pod role labels), gpu_workloads, evac_difficulty (easy|medium|hard)+evac_notes.
+Cluster-wide: spofs (1 replica AND no PDB); antiaffinity_risks (hard podAntiAffinity / topologySpread DoNotSchedule that becomes UNSATISFIABLE at one fewer worker - check replica counts vs surviving distinct hosts); csi_pinning_note (do Proxmox-CSI PVs pin to a node, or share one host-level topology so they reattach anywhere? check volumeHandle / topology zone/region on the PVs - this decides whether removal STRANDS data); priority_classes_note. NEVER mutate.`;
+
+// ============================================================
+phase('Gather');
+log('Gather (read-only): PVE host memory, k8s capacity + cluster 30d peak, request/limit right-sizing, reliability topology');
+const [host, k8s, usage, topo] = await parallel([
+  () => agent(HOST_PROMPT, { label: 'gather:pve-host', phase: 'Gather', schema: HOST }),
+  () => agent(K8S_PROMPT, { label: 'gather:k8s-capacity', phase: 'Gather', schema: K8S }),
+  () => agent(USAGE_PROMPT, { label: 'gather:rightsize', phase: 'Gather', schema: USAGE }),
+  () => agent(TOPO_PROMPT, { label: 'gather:reliability', phase: 'Gather', schema: TOPO }),
+]);
+if (!k8s || !usage) return { error: 'Critical gather agent failed (k8s/usage).', host, k8s, usage, topo };
+
+// ============================================================
+phase('Model');
+const T = usage.totals;
+const workers = k8s.nodes.filter((n) => !n.is_control_plane);
+const generalPool = workers.filter((n) => !n.gpu_tainted);            // general pods can land here (incl. GPU node if not tainted)
+const candidates = workers.filter((n) => !n.is_gpu && !n.is_control_plane); // node2..node6
+const clusterPeak = k8s.cluster_peak_30d_mib || 0;
+
+const freeGeneral = (n) => n.allocatable_mib - (T.ds_request_recommended_per_node_mib || 0) - (n.is_gpu ? (T.gpu_request_recommended_mib || 0) : 0);
+
+function evalRemove(removeName) {
+  const pool = generalPool.filter((n) => n.name !== removeName);
+  // --- scheduling N-1 (realistic requests): fit reschedulable load even if the largest survivor then fails ---
+  const frees = pool.map(freeGeneral);
+  const schedCap = frees.reduce((a, b) => a + b, 0) - (frees.length ? Math.max(...frees) : 0);
+  const schedNeed = T.reschedulable_request_recommended_mib;
+  const schedMargin = schedCap - schedNeed;
+  // --- physical N-1 (actual peak usage): cluster 30d peak must fit on survivors after losing the largest too ---
+  const survAlloc = pool.map((n) => n.allocatable_mib);
+  const physCap = survAlloc.reduce((a, b) => a + b, 0) - (survAlloc.length ? Math.max(...survAlloc) : 0);
+  const physMargin = physCap - clusterPeak;
+  const t = topo && topo.nodes ? topo.nodes.find((n) => n.name === removeName) : null;
+  return {
+    removeName, pool: pool.map((n) => n.name),
+    sched_capacityN1_mib: Math.round(schedCap), sched_need_mib: Math.round(schedNeed), sched_margin_mib: Math.round(schedMargin), sched_pass: schedMargin >= 0,
+    phys_capacityN1_mib: Math.round(physCap), cluster_peak_mib: Math.round(clusterPeak), phys_margin_mib: Math.round(physMargin), phys_pass: physMargin >= 0,
+    pass: schedMargin >= 0 && physMargin >= 0,
+    host_freed_mib: hostFreedFor(removeName),
+    evac_difficulty: t ? t.evac_difficulty : 'unknown', cnpg_primary: t ? t.cnpg_primary : false, sticky_pods: t ? t.sticky_pods : [],
+  };
+}
+function hostFreedFor(nodeName) {
+  if (host && host.vms) {
+    const s = nodeName.replace('k8s-', '');
+    const vm = host.vms.find((v) => v.name === nodeName || (v.name && v.name.includes(s)));
+    if (vm) return vm.configured_mib;
+  }
+  const n = k8s.nodes.find((x) => x.name === nodeName);
+  return n ? n.capacity_mib : 0;
+}
+
+const evalCandidates = candidates.map((c) => evalRemove(c.name));
+const diffRank = { easy: 0, medium: 1, hard: 2, unknown: 3 };
+const passing = evalCandidates.filter((c) => c.pass && !c.cnpg_primary)
+  .sort((a, b) => (diffRank[a.evac_difficulty] - diffRank[b.evac_difficulty]) || (b.phys_margin_mib - a.phys_margin_mib));
+const best = passing[0] || null;
+
+const hostOvercommit = host ? { sum_vm_configured_mib: host.sum_vm_configured_mib, host_total_mib: host.host_total_mib, ratio: +(host.sum_vm_configured_mib / host.host_total_mib).toFixed(3), free_mib: host.host_free_mib, available_mib: host.host_available_mib, swap_used_mib: host.swap_used_mib, swap_total_mib: host.swap_total_mib, ksm_saved_mib: host.ksm_saved_mib } : null;
+const k8sOvercommit = { cluster_requests_mib: k8s.cluster_requests_mib, cluster_allocatable_mib: k8s.cluster_allocatable_mib, cluster_usage_now_mib: k8s.cluster_usage_now_mib, cluster_peak_30d_mib: clusterPeak, request_ratio: +(k8s.cluster_requests_mib / k8s.cluster_allocatable_mib).toFixed(3), usage_ratio: +(clusterPeak / k8s.cluster_allocatable_mib).toFixed(3) };
+
+log(`Host overcommit ${hostOvercommit ? hostOvercommit.ratio : '?'}x (${G(hostOvercommit && hostOvercommit.free_mib)} free, swap ${G(hostOvercommit && hostOvercommit.swap_used_mib)}/${G(hostOvercommit && hostOvercommit.swap_total_mib)})`);
+log(`K8s: requests ${G(k8s.cluster_requests_mib)} / 30d-peak-usage ${G(clusterPeak)} / allocatable ${G(k8s.cluster_allocatable_mib)} -> requests are ${(k8s.cluster_requests_mib / clusterPeak).toFixed(2)}x real peak`);
+log(`Request right-sizing: ${G(T.net_request_reclaim_mib)} of over-provisioned requests can be trimmed (${T.count_request_shrink} workloads); ${T.count_limit_raise_oom} workloads are OOM-tight on LIMITS (raise regardless).`);
+for (const c of evalCandidates) log(`  remove ${c.removeName}: phys-N1 ${c.phys_pass ? 'PASS' : 'FAIL'} (${G(c.phys_margin_mib)}) | sched-N1 ${c.sched_pass ? 'PASS' : 'FAIL'} (${G(c.sched_margin_mib)}) | frees ~${G(c.host_freed_mib)} host | evac ${c.evac_difficulty}${c.cnpg_primary ? ' CNPG-PRIMARY' : ''}`);
+log(best ? `Best candidate: ${best.removeName} (phys margin ${G(best.phys_margin_mib)}, frees ~${G(best.host_freed_mib)})` : 'No candidate passes both N-1 tests.');
+
+// ============================================================
+phase('Verify');
+const headline = best
+  ? `${best.removeName} can be removed while preserving N-1: cluster 30d peak usage ${G(clusterPeak)} fits on survivors-minus-one (${G(best.phys_capacityN1_mib)}); after trimming over-provisioned requests, scheduling also fits (${G(best.sched_margin_mib)} margin). Frees ~${G(best.host_freed_mib)} to the PVE host.`
+  : `No worker can be removed while preserving N-1 by BOTH physical-usage and scheduling-request models.`;
+const verifyData = JSON.stringify({ hostOvercommit, k8sOvercommit, k8s_nodes: k8s.nodes, usage_totals: T, evalCandidates, best, csi_pinning_note: topo ? topo.csi_pinning_note : null, generalPool: generalPool.map((n) => n.name) }, null, 2);
+const lenses = [
+  { key: 'math', ask: 'Recompute BOTH N-1 models independently. Physical: cluster 30d peak vs (sum survivor allocatable - largest survivor). Scheduling: reschedulable recommended REQUESTS (not limits, not peak) vs (sum survivor freeGeneral - largest). Verify GPU node reserve uses REAL gpu requests, allocatable not capacity, DaemonSets are per-node fixed load. Are pool selection and numbers right?' },
+  { key: 'temporal', ask: 'Challenge the 30-DAY peak window and the request shrinks. Could a monthly/quarterly peak exceed cluster_peak_30d (compare a 90d peak)? Are the shrunk REQUESTS safe given each workload keeps a limit above its peak (Burstable)? Name any shrink or any still-tight limit that is reckless.' },
+  { key: 'stateful', ask: 'Check the chosen candidate for STRANDED state and drain blockers: CSI PV pinning (do volumes reattach anywhere?), CNPG primary, VolumeAttachment caps, anti-affinity/topologySpread unsatisfiable at one fewer worker, PDBs that block drain (disruptionsAllowed=0). Is removal actually safe, and what drain ORDERING is required?' },
+];
+const verdicts = (await parallel(lenses.map((l) => () =>
+  agent(`Adversarial reviewer. Try to REFUTE:\n"${headline}"\n\nLens: ${l.ask}\n\nData (read-only). Verify LIVE: kubectl, Prometheus (curl -sk -G '${PROM}' --data-urlencode 'query=...'), ${SSH} '<cmd>'.\n\n${verifyData}\n\nDefault refuted=true if evidence does not clearly hold. Give concrete corrections.`,
+    { label: `verify:${l.key}`, phase: 'Verify', schema: VERDICT }))
+)).filter(Boolean);
+
+return {
+  headline,
+  hostOvercommit, k8sOvercommit,
+  rightsizing: T,
+  request_shrinks: usage.request_shrinks,
+  limit_raises_oom: usage.limit_raises_oom,
+  spiky_periodic: usage.spiky_periodic,
+  candidates: evalCandidates,
+  recommendation: best,
+  k8s_nodes: k8s.nodes,
+  host_vms: host ? host.vms : null,
+  topo_spofs: topo ? topo.spofs : [],
+  topo_nodes: topo ? topo.nodes : [],
+  csi_pinning_note: topo ? topo.csi_pinning_note : null,
+  antiaffinity_risks: topo ? topo.antiaffinity_risks : [],
+  verdicts,
+  verdict_summary: `${verdicts.filter((v) => v.refuted).length}/${verdicts.length} reviewers refuted the headline`,
+};
--- a/.gitattributes
+++ b/.gitattributes
@ -4,3 +4,12 @@
 *.tfvars filter=git-crypt diff=git-crypt
 secrets/** filter=git-crypt diff=git-crypt
 stacks/**/secrets/** filter=git-crypt diff=git-crypt
+
+# Kubeconfigs / cluster credentials — encrypt at rest so a force-added or renamed
+# commit can't push plaintext to the public GitHub mirror. Belt-and-suspenders to
+# the .gitignore rules above; `.config` is explicit because that is exactly the
+# name an admin kubeconfig once leaked under (GitGuardian, 2026-07-02).
+.config filter=git-crypt diff=git-crypt
+kubeconfig filter=git-crypt diff=git-crypt
+*.kubeconfig filter=git-crypt diff=git-crypt
+admin.conf filter=git-crypt diff=git-crypt
--- a/.github/workflows/build-authentik.yml
+++ b/.github/workflows/build-authentik.yml
@ -0,0 +1,39 @@
+name: Build Custom Authentik Image
+
+# ADR-0002: infra-owned image built off-infra on GHA → ghcr.
+# Thin SLOW-1a overlay over the official authentik server (narrows the login
+# identification stage's select_subclasses() to the login-capable source subtypes;
+# see stacks/authentik/Dockerfile). Rebuild only when the Dockerfile changes — on
+# every authentik bump, edit the FROM tag + the patchN suffix here + the image tag
+# in modules/authentik/values.yaml together.
+on:
+  push:
+    branches: [master]
+    paths:
+      - 'stacks/authentik/Dockerfile'
+  workflow_dispatch: {}
+
+permissions:
+  contents: read
+  packages: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: docker/setup-buildx-action@v3
+      - uses: docker/login-action@v3
+        with:
+          registry: ghcr.io
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+      - uses: docker/build-push-action@v6
+        with:
+          context: stacks/authentik
+          platforms: linux/amd64
+          provenance: false
+          push: true
+          tags: |
+            ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3
+            ghcr.io/viktorbarzin/authentik-server:latest
--- a/.github/workflows/build-chrome-service-browser.yml
+++ b/.github/workflows/build-chrome-service-browser.yml
@ -0,0 +1,39 @@
+name: Build chrome-service-browser
+
+# ADR-0002: infra-owned image built off-infra on GHA → ghcr. Playwright base +
+# real Google Chrome (proprietary H.264/AAC codecs) for the chrome-service
+# browser container, so the noVNC view can play H.264 video (Reels). Rebuilds
+# are rare → dispatch + path trigger. NOTE: after the first push, set the ghcr
+# package `chrome-service-browser` to PUBLIC (same as chrome-service-novnc) so
+# the pod pulls it without credentials.
+on:
+  push:
+    branches: [master]
+    paths:
+      - 'stacks/chrome-service/files/chrome/**'
+  workflow_dispatch: {}
+
+permissions:
+  contents: read
+  packages: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: docker/setup-buildx-action@v3
+      - uses: docker/login-action@v3
+        with:
+          registry: ghcr.io
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+      - uses: docker/build-push-action@v6
+        with:
+          context: stacks/chrome-service/files/chrome
+          platforms: linux/amd64
+          provenance: false
+          push: true
+          tags: |
+            ghcr.io/viktorbarzin/chrome-service-browser:latest
+            ghcr.io/viktorbarzin/chrome-service-browser:${{ github.sha }}
--- a/.github/workflows/build-excalidraw.yml
+++ b/.github/workflows/build-excalidraw.yml
@ -0,0 +1,42 @@
+name: Build excalidraw-library
+
+# ADR-0002 / no-local-builds: excalidraw-library (infra-owned Go app behind
+# draw.viktorbarzin.me) builds off-infra on GHA → private ghcr; Keel polls
+# ghcr:latest and rolls the deployment. Replaces the manual DockerHub pushes
+# (viktorbarzin/excalidraw-library:v4 stays frozen as the rollback image).
+on:
+  push:
+    branches: [master]
+    paths:
+      - 'stacks/excalidraw/project/**'
+  workflow_dispatch: {}
+
+permissions:
+  contents: read
+  packages: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-go@v5
+        with:
+          go-version: '1.21'
+      - run: go test ./...
+        working-directory: stacks/excalidraw/project
+      - uses: docker/setup-buildx-action@v3
+      - uses: docker/login-action@v3
+        with:
+          registry: ghcr.io
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+      - uses: docker/build-push-action@v6
+        with:
+          context: stacks/excalidraw/project
+          platforms: linux/amd64
+          provenance: false
+          push: true
+          tags: |
+            ghcr.io/viktorbarzin/excalidraw-library:latest
+            ghcr.io/viktorbarzin/excalidraw-library:${{ github.sha }}
--- a/.github/workflows/build-valia-sites-sync.yml
+++ b/.github/workflows/build-valia-sites-sync.yml
@ -0,0 +1,39 @@
+name: Build valia-sites-sync
+
+# ADR-0002 + ADR-0018: infra-owned image built off-infra on GHA → ghcr (public).
+# Rclone + wrangler runner for the Valia-sites Content-folder mirror CronJob.
+# Rebuilds are rare (tool pins only change deliberately) → dispatch + path.
+# Security note: no untrusted event inputs are interpolated anywhere (only
+# github.actor / github.sha / GITHUB_TOKEN — same shape as the other
+# build-*.yml workflows in this repo).
+on:
+  push:
+    branches: [master]
+    paths:
+      - 'stacks/valia-sites/sync-image/**'
+  workflow_dispatch: {}
+
+permissions:
+  contents: read
+  packages: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: docker/setup-buildx-action@v3
+      - uses: docker/login-action@v3
+        with:
+          registry: ghcr.io
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+      - uses: docker/build-push-action@v6
+        with:
+          context: stacks/valia-sites/sync-image
+          platforms: linux/amd64
+          provenance: false
+          push: true
+          tags: |
+            ghcr.io/viktorbarzin/valia-sites-sync:latest
+            ghcr.io/viktorbarzin/valia-sites-sync:${{ github.sha }}
--- a/.gitignore
+++ b/.gitignore
@ -71,8 +71,15 @@ stacks/*/cloudflare_provider.tf
 stacks/*/tiers.tf
 stacks/*/terragrunt_rendered.json

-# Kubernetes config (sensitive)
+# Kubernetes config / cluster credentials (sensitive) — never commit in plaintext.
+# `config` alone missed the dotfile form: an admin kubeconfig once leaked to the
+# public mirror as `.config` (GitGuardian, 2026-07-02). Cover the common names.
 config
+.config
+kubeconfig
+*.kubeconfig
+admin.conf
+.kube/

 # Node.js (not part of infra)
 node_modules/
@ -110,3 +117,9 @@ terraform.tfstate.backup
 # Timestamped terraform state backups (terraform.tfstate.<ts>.backup) — plaintext Tier-0
 # secrets; created by terraform state ops. The patterns above miss the timestamped form.
 terraform.tfstate.*.backup
+
+# Python test artifacts (pytest bytecode cache) — e.g. from
+# stacks/k8s-version-upgrade/scripts/test_compat_gate.py
+__pycache__/
+*.pyc
+.pytest_cache/
--- a/.woodpecker/default.yml
+++ b/.woodpecker/default.yml
@ -19,6 +19,7 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
+      partial: false
      depth: 2
      attempts: 5
      backoff: 10s
@ -64,6 +65,21 @@ steps:
      # don't need explicit token propagation.
      VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200
    commands:
+      # ── Forge guard: apply ONLY on the canonical Forgejo forge ──
+      # infra is registered in Woodpecker on BOTH the Forgejo canonical repo and
+      # the legacy GitHub mirror, and BOTH fire this push pipeline. Without this
+      # guard both run `terragrunt apply` on every push and race each other for
+      # the per-stack PG state lock — the dominant cause of the "Error acquiring
+      # the state lock" failures + push-supersede "killed" runs. The GitHub-mirror
+      # registration keeps running the CRONS (drift-detection, renew-tls, …) — only
+      # its duplicate push-apply no-ops here. Fail-open: an unknown forge (neither
+      # env var set) still applies, preserving prior behaviour.
+      - |
+        if echo "${CI_REPO_URL:-}${CI_FORGE_URL:-}" | grep -qi 'github\.com'; then
+          echo "[forge-guard] GitHub-mirror push — apply runs only on the Forgejo canonical repo (avoids double-apply + state-lock races). Skipping."
+          exit 0
+        fi
+
      # ── Skip CI commits ──
      - |
        if echo "$CI_COMMIT_MESSAGE" | grep -q '\[CI SKIP\]\|\[ci skip\]'; then
@ -212,23 +228,40 @@ steps:
        if [ -s .platform_apply ]; then
          echo "=== Applying platform stacks (serial, locked) ==="
          while read -r stack; do
+            # Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role
+            # lacks Vault-admin perms (sys/mounts + sys/policies/acl), so a CI
+            # apply always 403s and fails the pipeline. Kept in PLATFORM_STACKS
+            # (so the app-stack detector still excludes it) but skipped here.
+            # (2026-06-27 — see docs/architecture/ci-cd.md)
+            if [ "$stack" = "vault" ]; then echo "[vault] SKIPPED (Tier-0, human-applied via OIDC)"; continue; fi
            echo "[$stack] Starting apply..."
-            set +e
-            OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
-            EXIT=$?
-            set -e
-            if [ $EXIT -ne 0 ]; then
-              if echo "$OUTPUT" | grep -q "is locked by"; then
-                echo "[$stack] SKIPPED (locked by another session)"
-              else
-                echo "$OUTPUT" | tail -50
-                echo "[$stack] FAILED (exit $EXIT)"
-                FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"
+            ATTEMPT=0
+            while :; do
+              ATTEMPT=$((ATTEMPT + 1))
+              set +e
+              OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
+              EXIT=$?
+              set -e
+              if [ $EXIT -eq 0 ]; then
+                echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
              fi
-            else
-              echo "$OUTPUT" | tail -3
-              echo "[$stack] OK"
-            fi
+              # Lock contention → SKIP, not fail. Match BOTH the Tier-0 Vault lock
+              # ("is locked by", from scripts/tg) AND the Tier-1 PG-backend lock
+              # ("Error acquiring the state lock" / "already locked"). The PG case
+              # was previously counted as a failure — the #1 source of false reds.
+              if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
+                echo "[$stack] SKIPPED (locked by another session/run)"; break
+              fi
+              # Transient: provider-registry download timeout / Vault 5xx → bounded
+              # retry. Deliberately NOT helm atomic-timeouts or config errors
+              # (missing arg, invalid index) — those must fail fast, retry can't fix
+              # them and can worsen a stuck helm release.
+              if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
+                echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
+              fi
+              echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
+              FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"; break
+            done
          done < .platform_apply
        fi
        # Deferred until after app stacks so both lists get a chance to run.
@ -241,22 +274,27 @@ steps:
          echo "=== Applying app stacks (serial, locked) ==="
          while read -r stack; do
            echo "[$stack] Starting apply..."
-            set +e
-            OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
-            EXIT=$?
-            set -e
-            if [ $EXIT -ne 0 ]; then
-              if echo "$OUTPUT" | grep -q "is locked by"; then
-                echo "[$stack] SKIPPED (locked by another session)"
-              else
-                echo "$OUTPUT" | tail -50
-                echo "[$stack] FAILED (exit $EXIT)"
-                FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"
+            ATTEMPT=0
+            while :; do
+              ATTEMPT=$((ATTEMPT + 1))
+              set +e
+              OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
+              EXIT=$?
+              set -e
+              if [ $EXIT -eq 0 ]; then
+                echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
              fi
-            else
-              echo "$OUTPUT" | tail -3
-              echo "[$stack] OK"
-            fi
+              # Lock contention → SKIP, not fail (Tier-0 Vault + Tier-1 PG; see platform loop).
+              if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
+                echo "[$stack] SKIPPED (locked by another session/run)"; break
+              fi
+              # Transient provider-download / Vault 5xx → bounded retry (see platform loop).
+              if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
+                echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
+              fi
+              echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
+              FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"; break
+            done
          done < .app_apply
        fi
        # Fail the step loudly so the pipeline `default` workflow state
@ -286,13 +324,8 @@ steps:
        fi
        GIT_SSH_COMMAND='ssh -i ./secrets/deploy_key -o IdentitiesOnly=yes' git push origin master

-      # ── Slack notification ──
-      - |
-        PLATFORM_COUNT=$(wc -l < .platform_apply 2>/dev/null | tr -d ' ')
-        APP_COUNT=$(wc -l < .app_apply 2>/dev/null | tr -d ' ')
-        curl -s -X POST -H 'Content-type: application/json' \
-          --data "{\"channel\":\"general\",\"text\":\"Woodpecker CI: infra pipeline ${CI_PIPELINE_STATUS} (platform:${PLATFORM_COUNT}, apps:${APP_COUNT})\"}" \
-          "$SLACK_WEBHOOK" || true
+      # (No Slack post on success — Viktor 2026-07-02: CI notifies on FAILED
+      # runs only; the notify-failure step below covers those.)

  # Slack on failure (runs even if apply step fails)
  - name: notify-failure
--- a/.woodpecker/drift-detection.yml
+++ b/.woodpecker/drift-detection.yml
@ -9,6 +9,7 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
+      partial: false
      depth: 1
      attempts: 3

@ -84,6 +85,13 @@ steps:
          stack=$(basename "$stack_dir")
          [ -f "$stack_dir/terragrunt.hcl" ] || continue

+          # Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role lacks
+          # Vault-admin perms (sys/mounts + sys/policies/acl), so `terragrunt plan`
+          # on it ERRORs (detailed-exitcode 1) and fails the whole nightly drift
+          # run. Skip it — drift on Tier-0 vault is caught at human apply time.
+          # (2026-06-27)
+          [ "$stack" = "vault" ] && continue
+
          echo -n "[$stack] planning... "
          OUTPUT=$(cd "$stack_dir" && terragrunt plan -detailed-exitcode -input=false 2>&1)
          EXIT=$?
@ -139,13 +147,30 @@ steps:
        echo "Drift: ${DRIFTED:-none}"
        echo "Errors: ${ERRORS:-none}"

-        # ── Slack alert if drift found ──
+        # ── Slack only when something is WRONG (drift or errors) ──
+        # All-clean runs are silent (Viktor 2026-07-02: CI notifies on
+        # failed/actionable runs only; clean is the daily normal).
        if [ -n "$DRIFTED" ]; then
          curl -s -X POST -H 'Content-type: application/json' \
            --data "{\"channel\":\"general\",\"text\":\":warning: Drift detected in:${DRIFTED}\nClean: ${CLEAN} stacks. Errors:${ERRORS:-none}\"}" \
            "$SLACK_WEBHOOK" || true
-        else
+        elif [ -n "$ERRORS" ]; then
          curl -s -X POST -H 'Content-type: application/json' \
-            --data "{\"channel\":\"general\",\"text\":\":white_check_mark: Drift detection: all ${CLEAN} stacks clean${ERRORS:+. Errors: $ERRORS}\"}" \
+            --data "{\"channel\":\"general\",\"text\":\":red_circle: Drift detection had errors: ${ERRORS} (clean: ${CLEAN})\"}" \
            "$SLACK_WEBHOOK" || true
        fi
+
+  # Hard-failure catch: the in-script posts above never run if the step
+  # itself crashes early — this step is the only signal for that case.
+  - name: notify-failure
+    image: curlimages/curl
+    commands:
+      - |
+        curl -s -X POST -H 'Content-type: application/json' \
+          --data "{\"channel\":\"general\",\"text\":\":red_circle: Drift-detection pipeline FAILED (crashed before reporting)\"}" \
+          "$SLACK_WEBHOOK" || true
+    environment:
+      SLACK_WEBHOOK:
+        from_secret: slack_webhook
+    when:
+      status: [failure]
--- a/.woodpecker/issue-automation.yml
+++ b/.woodpecker/issue-automation.yml
@ -5,6 +5,7 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
+      partial: false
      depth: 2

 steps:
--- a/.woodpecker/postmortem-todos.yml
+++ b/.woodpecker/postmortem-todos.yml
@ -11,6 +11,7 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
+      partial: false
      depth: 5

 steps:
@ -27,6 +28,7 @@ steps:
        from_secret: slack_webhook
    commands:
      - apk add --no-cache curl
-      - "curl -sf -X POST https://hooks.slack.com/services/$SLACK_WEBHOOK -H 'Content-Type: application/json' -d '{\"text\": \"Post-mortem TODO pipeline completed\"}' || true"
+      - "curl -sf -X POST https://hooks.slack.com/services/$SLACK_WEBHOOK -H 'Content-Type: application/json' -d '{\"text\": \":red_circle: Post-mortem TODO pipeline FAILED\"}' || true"
    when:
-      - status: [success, failure]
+      # Failure-only (Viktor 2026-07-02): CI notifies on failed runs only.
+      - status: [failure]
--- a/.woodpecker/provision-user.yml
+++ b/.woodpecker/provision-user.yml
@ -5,6 +5,7 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
+      partial: false
      attempts: 5
      backoff: 10s

--- a/.woodpecker/pve-nfs-exports-sync.yml
+++ b/.woodpecker/pve-nfs-exports-sync.yml
@ -23,6 +23,7 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
+      partial: false
      depth: 1
      attempts: 3

@ -57,7 +58,8 @@ steps:
    commands:
      - |
        curl -s -X POST -H 'Content-type: application/json' \
-          --data "{\"channel\":\"general\",\"text\":\"PVE /etc/exports sync: ${CI_PIPELINE_STATUS}\"}" \
+          --data "{\"channel\":\"general\",\"text\":\":red_circle: PVE /etc/exports sync FAILED\"}" \
          "$SLACK_WEBHOOK" || true
    when:
-      status: [success, failure]
+      # Failure-only (Viktor 2026-07-02): CI notifies on failed runs only.
+      status: [failure]
--- a/.woodpecker/registry-config-sync.yml
+++ b/.woodpecker/registry-config-sync.yml
@ -38,6 +38,7 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
+      partial: false
      depth: 1
      attempts: 3

@ -150,7 +151,8 @@ steps:
    commands:
      - |
        curl -s -X POST -H 'Content-type: application/json' \
-          --data "{\"channel\":\"general\",\"text\":\"Registry config sync on 10.0.20.10: ${CI_PIPELINE_STATUS}\"}" \
+          --data "{\"channel\":\"general\",\"text\":\":red_circle: Registry config sync on 10.0.20.10 FAILED\"}" \
          "$SLACK_WEBHOOK" || true
    when:
-      status: [success, failure]
+      # Failure-only (Viktor 2026-07-02): CI notifies on failed runs only.
+      status: [failure]
--- a/.woodpecker/renew-tls.yml
+++ b/.woodpecker/renew-tls.yml
@ -6,6 +6,7 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
+      partial: false
      attempts: 5
      backoff: 10s

@ -70,10 +71,11 @@ steps:
    commands:
      - |
        curl -s -X POST -H 'Content-type: application/json' \
-          --data "{\"channel\":\"general\",\"text\":\"Woodpecker CI: TLS certificate renewal ${CI_PIPELINE_STATUS}\"}" \
+          --data "{\"channel\":\"general\",\"text\":\":red_circle: Woodpecker CI: TLS certificate renewal FAILED\"}" \
          "$SLACK_WEBHOOK" || true
    environment:
      SLACK_WEBHOOK:
        from_secret: slack_webhook
    when:
-      status: [success, failure]
+      # Failure-only (Viktor 2026-07-02): successful renewals are routine.
+      status: [failure]
--- a/AGENTS.md
+++ b/AGENTS.md
@ -9,7 +9,7 @@
 - **Ask before `git push`** — always confirm with the user first

 ## Execution
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets)
+- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets; passes `-lock-timeout`, default `5m` / `TG_LOCK_TIMEOUT`, so a contended state lock waits instead of failing with `Error acquiring the state lock`)
 - **Legacy apply**: `cd stacks/<service> && terragrunt apply --non-interactive` (uses terraform.tfvars)
 - **kubectl**: `kubectl --kubeconfig $(pwd)/config`
 - **Health check**: `bash scripts/cluster_healthcheck.sh --quiet`
@ -273,8 +273,11 @@ To land a finished change from such a clone:
   Slack audit feed; a no-op CI apply on a docs-only commit is harmless.
 4. Leave the clone on clean `master` so auto-refresh keeps working.
 5. Tell the user in plain language what happened. Stack changes are
-   auto-applied by CI — verify the live result with the user's read-only
-   kubectl before saying "it's live".
+   auto-applied by CI on push — or, with apply access, applied locally yourself
+   (`scripts/tg apply`, from the main checkout, not a worktree); either path is
+   fine, but the change must always be committed here, never applied
+   uncommitted. Verify the live result with the user's read-only kubectl before
+   saying "it's live".

 If a push to `master` is rejected by branch protection (user not on the
 whitelist — e.g. new users before Viktor grants it), fall back to a
@ -289,6 +292,7 @@ curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json'
 ```

 ## Common Operations
+- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply <stack>` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release <kind>:<name>`, `homelab work start|land|clean <topic>` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status <app>` (read; `<app>` defaults to the namespace, target to `deploy/<app>`), `homelab k8s db <app> [--mysql] -- "<SQL>"`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall "<context>"` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait <ns>/<deploy> [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check <host> [path]` (external-CF vs internal-LB reachability), `dns lookup <name>` (Technitium vs public diff), `metrics query "<promql>"` / `metrics alerts` (Prometheus via LB), `logs query "<logql>" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Usage telemetry (v0.6): every dispatched verb fire-and-forgets a Loki line (`{user,verb}` + exit only, NO args/secrets; opt-out `HOMELAB_TELEMETRY=0`); `homelab usage top [--since][--user]` ranks verb usage across all users — evidence for what to build next, queryable without reading anyone's home. Home Assistant (v0.7): `homelab ha token [--instance sofia|london]` (prints the long-lived API token, resolved live from k8s Secret `openclaw/openclaw-secrets` — use as `curl -H "Authorization: Bearer $(homelab ha token)"`), `homelab ha ssh [--instance sofia|london] -- <cmd>` (run a command on the HA host; deterministic non-interactive ssh, the invoking user's `~/.ssh/id_ed25519`, sofia=`vbarzin@192.168.1.8` default) — entity state/control stays with the `ha` MCP, these cover only what an API-only MCP can't (token + host shell). Full docs: `cli/README.md`.
 - **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service.
 - **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
 - **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf.
--- a/CONTEXT.md
+++ b/CONTEXT.md
@ -56,6 +56,28 @@ _Avoid_: "core service" (collides with the `0-core-*` Namespace tier name).
 A non-admin identity declared in `secret/platform → k8s_users` (JSON map). Owns one or more namespaces and one or more public subdomains. Also drives a **Workstation profile** (an identity has both a cluster facet and a workstation facet).
 _Avoid_: bare "user", "tenant".

+### GPU sharing
+
+**GPU slice**:
+One unit of `nvidia.com/gpu` on the time-sliced Tesla T4 — a **scheduling turn, NOT a memory allocation**. The device plugin advertises the card ×100; a pod requesting `nvidia.com/gpu: 1` gets GPU *access*, with zero guarantee about how much of the 16 GB VRAM it may use. "Overallocate GPU memory" is a real failure precisely because a slice carries no memory accounting.
+_Avoid_: reading a GPU slice as a memory reservation or a fraction of the card; "vGPU" (we run no vGPU/MIG/MPS — see ADR-0016).
+
+**GPU memory budget**:
+The custom node-level extended resource **`viktorbarzin.me/gpumem`** (integer MiB) that makes the scheduler VRAM-aware (ADR-0016). The GPU node advertises a total (~14000 MiB = physical minus driver/context slack); each GPU tenant declares `resources.limits."viktorbarzin.me/gpumem"`; being non-overcommittable, the scheduler refuses to co-schedule past the card (overflow → `Pending`). A *schedule-time* reservation, **not** a runtime cap — it stops pile-on, not a single tenant's runaway.
+_Avoid_: treating it as a hard CUDA cap (it isn't — that's what the **GPU watchdog** is for); confusing it with the `nvidia.com/gpu` slice (orthogonal axes: access vs memory accounting).
+
+**GPU watchdog**:
+The `gpu-vram-watchdog` CronJob (nvidia ns) that supplies the runtime teeth the **GPU memory budget** lacks: when *actual* free VRAM (`gpu_pod_memory_used_bytes`) drops below a floor, it recycles the biggest tenant that is **over its declared budget**. Enforces the budget as a contract, acts only under pressure (so bursting into genuine slack is fine), and is what bounds the 2026-06-02 immich-ml runaway class.
+_Avoid_: expecting it to act on priority (it enforces the *budget*, since co-tenants often share one PriorityClass); expecting instant prevention (it corrects with a detection lag — soft, by design).
+
+**GPU demand-gate**:
+The scale-0↔1 admission CronJobs (`stacks/tts`) that bring a best-effort *batch* GPU tenant (chatterbox-tts) up only when free VRAM ≥ a floor and idle it back down — letting on-demand tenants fill real slack without holding a reserved **GPU memory budget** seat.
+_Avoid_: using it for interactive tenants (cold-load lag — portal-stt is warm-resident instead); conflating it with the **GPU watchdog** (gate = admit on free VRAM; watchdog = recycle on over-budget pressure).
+
+**gpu-workload priority**:
+The `gpu-workload` PriorityClass (1,200,000) auto-stamped on every GPU pod by the Kyverno `inject-gpu-workload-priority` policy — the exclude list (`tts`) drops to `tier-2-gpu` (600,000) so it loses node-pressure eviction first. Governs *Kubernetes node* eviction order, **not** VRAM (VRAM is the budget + watchdog's job).
+_Avoid_: assuming it protects VRAM; it is a scheduling/eviction priority on node memory/CPU pressure.
+
 ### Workstation (multi-user devvm)

 **devvm**:
@ -96,6 +118,14 @@ _Avoid_: "external", "outside".
 `viktorbarzin.lan`, served by Technitium DNS. Resolves only inside the homelab network.
 _Avoid_: bare "lan", "private", "intranet".

+**Segment**:
+One isolated L2/L3 network with pfSense as its gateway — realised as a Proxmox-bridge-level tag feeding one dedicated untagged pfSense interface (dManagementsVms 10.0.10.0/24 = vmbr1 tag 10, dKubernetes 10.0.20.0/24 = vmbr1 tag 20, dCCTV 10.0.30.0/24 = vmbr0 tag 30). pfSense itself never terminates 802.1Q.
+_Avoid_: "VLAN" as the primary name (the tags 10/20/30 are transport detail; the Segment is the concept).
+
+**CCTV segment**:
+The untrusted camera **Segment** (`dCCTV`) — devices in it may be pulled from (RTSP/ISAPI) but may initiate nothing except NTP to their gateway. Deliberately outside every trusted source-IP allowlist (ADR-0017).
+_Avoid_: "camera VLAN", "CCTV LAN".
+
 **Ingress auth**:
 The `auth = "..."` parameter on `ingress_factory` — a discrete *mode*, not a ranked tier — one of `required` (Authentik forward-auth gates every request), `app` (the backend owns its login), `public` (anonymous Authentik binding for audit only), or `none` (Anubis-fronted content, or native-client API). Default `required` (fail-closed).
 _Avoid_: "auth tier" / "auth mode" — refer to it by the canonical key, `auth` (e.g. `auth = "required"`). "tier" is reserved for State tier and Namespace tier.
@ -117,9 +147,17 @@ The bare-metal load-balancer that assigns external IPs to `type=LoadBalancer` Se
 _Avoid_: calling `.200` "the cluster IP" or assuming all ingress shares one LB IP.

 **Calico**:
-The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
+The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs; live flow observability via **Goldmane / Whisker**). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
 _Avoid_: "firewall" (it's pod-level policy, not a perimeter); conflating a Calico **NetworkPolicy** (enforced in the data path) with a **Kyverno policy** (enforced at admission) — different layers.

+**Service identity**:
+How a **Service** is named in flow/audit data — its **namespace** is the primary identity (Goldmane stamps it natively, and "one Service ≈ one namespace" holds for ~87 namespaces), refined by an explicit identity label (e.g. `service-identity`) only in the handful of genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). Deliberately NOT a per-Service **ServiceAccount** (deferred — 56% of pods share `default`; revisit only if principal-based enforcement or mTLS is adopted) and NOT a SPIFFE/mesh identity (rejected — attribution-grade audit on a trusted single-tenant cluster doesn't justify a mesh).
+_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
+
+**Goldmane / Whisker**:
+Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#alerts` digest (the `#security` channel was abandoned 2026-06-25). As-built: `docs/runbooks/goldmane-flow-trail.md`.
+_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
+
 ### Storage

 **proxmox-lvm-encrypted**:
@ -199,6 +237,20 @@ _Avoid_: expecting Diun to deploy; conflating with **Keel**.
 **Anubis**:
 A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content-bearing sites without app-level auth (blog, wiki, landing pages). Never in front of Git, WebDAV, CalDAV, or API endpoints (clients can't solve PoW).

+### Externally-authored sites
+
+**Valia site**:
+A small public static site authored by Valia (Viktor's mother, external to the infra) and hosted for her under `<name>.viktorbarzin.me`. Its source of truth is a **Content folder** she owns; the live site is a mirror of that folder, fresh within ~10 minutes. Hosted **off-infra** (Cloudflare Pages) by decision: a homelab outage freezes content but never takes her sites down. Viktor picks the English subdomain name per site at registration (her folder names stay Bulgarian). Current instances: `stem95su`, `bridge`.
+_Avoid_: "school site" (the family may grow beyond school projects); treating the deployed copy as editable — edits land only in the **Content folder**.
+
+**Content folder**:
+The Google Drive folder (or subfolder) Valia shares with `vbarzin@gmail.com` holding one **Valia site**'s files. Strictly read-only from the infra side — nothing ever writes back to her Drive. Empty or half-uploaded folder states must never wipe a live site.
+_Avoid_: syncing a folder root when the servable content lives in a subfolder (stem95su serves `stem claude/files/`, not the folder root).
+
+**Entry file**:
+The HTML file a **Valia site** serves at `/`. Defaults to `index.html`; per-site override when she names it differently (stem95su: `stem_board.html`). The override is a registration-time setting, not a constraint on her authoring.
+_Avoid_: asking Valia to rename her files to fit hosting conventions.
+
 ## Relationships

 - A **Service** is defined by exactly one **Stack** — **flat** or wrapping a **Stack-local module** — which sources zero or more shared **Factory modules** and resolves to one or more K8s workloads.
@ -210,6 +262,7 @@ A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content
 - A **Service**'s image reaches the cluster via **Woodpecker deploy** (push-driven, on commit) or **Keel** (poll-driven, on a new registry tag); **Diun** only notifies. Operator-managed StatefulSets are rolled by neither.
 - An owned **Service**'s image is built by GitHub Actions from the **Canonical repo**'s **GitHub mirror** and hosted on ghcr.io (ADR-0002); the **Forgejo registry** keeps only a frozen last-known-good tag per **Service**.
 - Tier-1 **State tier** state and ~12 app databases share one **CNPG** `pg-cluster`, reached through **PgBouncer**; their credentials rotate via the `vault-database` store.
+- A **Valia site** mirrors exactly one **Content folder** and serves exactly one **Entry file** at `/`; the folder is hers, the subdomain name is Viktor's, the hosting is off-infra.

 ## Example dialogue

--- a/cli/README.md
+++ b/cli/README.md
@ -1,2 +1,287 @@
-# What is this?
-This is a CLI to manipulate files in the terraform repo and commit and push them
+# homelab
+
+`homelab` is the unified, agent-facing CLI for operating this homelab — one
+composable, JSON-capable surface for the operations agents run over and over,
+discovered progressively at runtime. It is grown **in place** from this
+directory (the former `infra-cli`), and the legacy webhook use-cases still work
+(see below).
+
+It encodes *actions*, never *judgment*: methodology (debugging, TDD, review) and
+third-party/owned MCP servers (e.g. phpIPAM) are deliberately out of scope.
+
+## Usage
+
+```
+homelab <command> [args]
+homelab manifest [--json]    # list every verb + its read/write tier (discovery entrypoint)
+homelab version
+```
+
+### v0.1 verbs — the infra inner-loop
+
+| Command | Tier | What it does |
+|---|---|---|
+| `claim <kind>:<name> --purpose "…"` | write | claim a shared resource on the presence board (wraps `scripts/presence`) |
+| `release <kind>:<name>` | write | release a presence claim |
+| `tf plan <stack>` | read | `scripts/tg plan` for a stack (resolved from cwd) |
+| `tf validate <stack>` | read | `scripts/tg validate` |
+| `tf fmt <stack>` | read | `terraform fmt -recursive` on the stack |
+| `tf force-unlock <stack> <lock-id>` | write | release a stuck state lock |
+| `tf apply <stack>` | write | `scripts/tg apply` — auto-claims `stack:<name>`, always releases, warns it's out-of-band |
+| `work start <topic>` | write | create `.worktrees/<topic>` on `<user>/<topic>` off `<remote>/master`; enter with native `EnterWorktree` |
+| `work land [--verify-cmd "…"] [--no-verify]` | write | merge master in → verify → push `HEAD:master` (non-ff retry; PR fallback) |
+| `work clean <topic>` | write | remove a task's worktree + branch (run from the main checkout) |
+
+### v0.2 verbs — Kubernetes
+
+Built on an **app→namespace→pod resolver**: `<app>` defaults to the namespace
+(most namespaces hold one app); the target defaults to `deploy/<app>` and lets
+kubectl resolve the pod. Override with `-n`/`--pod`/`-c`/`-l`/`--tty`. Uses the
+ambient kubeconfig.
+
+| Command | Tier | What it does |
+|---|---|---|
+| `k8s status [ns]` | read | pods (wide) + recent non-Normal events (`-A` if no ns) |
+| `k8s get <ns> <resource> […]` | read | `kubectl -n <ns> get …` passthrough |
+| `k8s logs <app>` | read | logs for `deploy/<app>` (`--tail` default 200; `-c`/`--previous`/`--since`/`-l`) |
+| `k8s describe <app> [resource]` | read | describe the deployment (or an explicit resource) |
+| `k8s debug <app>` | read | one-shot triage: pods + workloads + describe + recent logs + events |
+| `k8s pf <app> <local:remote> [target]` | read | port-forward to `svc/<app>` (or an explicit target) |
+| `k8s rollout-status <app>` | read | `rollout status deploy/<app>` |
+| `k8s db <app> [--mysql] [--db N] -- "<SQL>"` | write | exec into the dbaas DB (PG `pg-cluster-rw`, or MySQL with env-password wrapper) |
+| `k8s exec <app> [--tty] -- <cmd>` | write | exec in the app's pod |
+| `k8s restart <app>` | write | `rollout restart deploy/<app>` then wait for status |
+| `k8s rm-pod <name> -n <ns> [--job] [--force]` | write | delete a stuck **pod/job only** |
+
+Config-mutation verbs (`apply`/`edit`/`patch`/`scale`/`create`) are intentionally
+**not** exposed — they stay raw `kubectl`, per the Terraform-only policy.
+
+`tf` resolves the stack dir by walking up from cwd to the infra root and
+delegates to `scripts/tg` (which owns state decrypt/encrypt, the Vault lock, and
+the ingress auth-comment check). git-crypt filter flags are auto-injected on git
+operations in the encrypted infra repo.
+
+**`work land` refuses to push when it cannot verify** (no `--verify-cmd` and no
+auto-detected suite) unless you pass `--no-verify` — landing to master unverified
+must be deliberate. After pushing it **watches CI to green** (`ci watch` on the
+landed commit) and fails if the pipeline does; pass `--no-ci-watch` to skip.
+
+Tiers are recorded per verb so a future PreToolUse classifier can auto-allow
+reads / prompt writes; v0.1 allows everything and relies on existing gates
+(permission mode, presence claims, plan approval).
+
+### v0.3 verbs — memory
+
+A thin HTTP client over the **claude-memory** service (the same backend the
+memory MCP wraps), authed with `CLAUDE_MEMORY_API_KEY` against
+`CLAUDE_MEMORY_API_URL` (the env the hooks already set; defaults to the
+ingress). Because it hits the HTTP API directly, it **works even when the MCP
+frontend is down**.
+
+| Command | Tier | What it does |
+|---|---|---|
+| `memory recall "<context>" [--query --category --sort --limit]` | read | semantic search (server-side ranking) — the navigate workhorse |
+| `memory list [--category --tag --limit]` | read | recent memories |
+| `memory categories` / `memory tags` / `memory stats` | read | enumerate the store |
+| `memory secret <id>` | read | reveal a sensitive memory's content |
+| `memory store "<content>" [--category --tags --keywords --importance --sensitive]` | write | store a memory |
+| `memory update <id> [--content --tags --importance]` | write | edit a memory |
+| `memory delete <id>` | write | delete a memory |
+
+All read/write paths are validated against the live API (incl. a
+store→recall→delete round-trip). This gives full data-plane parity with the MCP;
+the eventual deprecation (rewiring the per-prompt auto-recall + auto-learn hooks
+to the CLI, then uninstalling the MCP) is a **separate, deliberate follow-up** —
+see `docs/adr/0008`.
+
+### v0.4 verbs — ci / deploy
+
+Watch what you trigger, without hand-rolling Woodpecker/kubectl polling. `ci`
+talks to the Woodpecker API (token from `WOODPECKER_TOKEN` or Vault
+`secret/ci/global`) via the internal Traefik LB, resolving the repo from the cwd
+remote, with retries that ride Woodpecker's intermittent empty responses.
+
+| Command | Tier | What it does |
+|---|---|---|
+| `ci status [commit]` | read | pipeline status for HEAD (or a commit) |
+| `ci watch [commit]` | read | poll the pipeline to terminal; exit non-zero on failure |
+| `deploy wait <ns>/<deploy> [--sha SHA]` | read | wait for the deployment image to match the sha, *then* rollout status (rollout status alone lies on the old ReplicaSet) |
+
+`work land` now calls `ci watch` on the landed commit automatically (skip with
+`--no-ci-watch`), closing the v0.1 "doesn't wait for CI" gap. `ci logs` (failing
+step) is deferred to v0.4.1 — Woodpecker's per-pipeline detail/log endpoints were
+the least reliable; `status`/`watch` use the list endpoint that works.
+
+### v0.5 verbs — net / dns / metrics / logs
+
+Reachability + observability probes. Their value is *endpoint resolution* — the
+non-obvious "which host, public or LB, what auth, what URL shape" reasoning you'd
+otherwise re-derive every time — not the HTTP call itself. All reach internal
+ingresses through the Traefik LB (the Go form of `curl --resolve host:443:10.0.20.203`).
+
+| Command | Tier | What it does |
+|---|---|---|
+| `net check <host> [path]` | read | probes the host two ways — external (public DNS → Cloudflare) vs internal (Traefik LB) — with status + latency, so you can tell *where* a break is (CF? app? the LB path?) |
+| `dns lookup <name> [type]` | read | resolves via Technitium (`10.0.20.201`) and public (`1.1.1.1`), diffed — surfaces split-horizon vs propagation gaps |
+| `metrics query "<promql>"` | read | Prometheus instant query (`prometheus-query.viktorbarzin.lan`); prints `value {labels}` or `--json` |
+| `metrics alerts` | read | currently-firing alerts (via the synthetic `ALERTS` series — the query frontend has no `/api/v1/alerts`) |
+| `logs query "<logql>" [--since 1h] [--limit N]` | read | Loki range query (`loki.viktorbarzin.lan`); prints log lines or `--json` |
+
+Quote the PromQL/LogQL. These hit auth-free internal ingresses — no port-forward,
+no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the
+firing set is reachable via `ALERTS` instead.)
+
+### v0.6 — usage telemetry (`usage top`)
+
+Makes "which verbs are actually used, by everyone" a query instead of a guess —
+so adding the *next* verb is evidence-driven, not shaped by one person's habits.
+
+Every dispatched verb emits one fire-and-forget Loki line: `{job, user, verb}`
+labels + `exit=N ver=X` — **only the verb path and exit code, never args, paths,
+flags, or secrets.** It's best-effort (tight timeout, errors swallowed, never
+affects the command) and opt-out via `HOMELAB_TELEMETRY=0`. Because the sink is
+the shared Loki, aggregate usage is queryable **without reading anyone's home** —
+the privacy-preserving answer to "what does the team use."
+
+| Command | Tier | What it does |
+|---|---|---|
+| `usage top [--since 30d] [--user U] [--json]` | read | rank verbs by invocation count across all users (or one), via `sum by (verb) (count_over_time({job="homelab-usage"}[…]))` |
+
+### v0.7 verbs — Home Assistant
+
+Cover exactly the two things the `ha` **MCP server can't**: resolving the
+long-lived API token out of the cluster, and SSH to the HA host for host-level
+work (config files, docker, add-ons). Entity state and control (`turn_on`,
+`get_state`, services) stay with the MCP — *actions an MCP already encodes are
+out of scope* (see top of this doc). The value here is the same as `net`/`dns`:
+the non-obvious *which secret, which host, which key, which flags* you'd
+otherwise re-derive every session — agents were hand-rolling a
+`kubectl | base64 | jq` token pipeline and a bespoke `ssh -o …` invocation on
+every run because the existing `home-assistant-sofia.py` needs an env var set
+and a cwd-relative path, neither of which holds in an arbitrary session.
+
+| Command | Tier | What it does |
+|---|---|---|
+| `ha token [--instance sofia\|london]` | read | print the long-lived HA API token, resolved live from the dedicated k8s Secret `openclaw/ha-tokens` (key per instance) via the ambient kubeconfig — no pre-set env var. Use as `curl -H "Authorization: Bearer $(homelab ha token)" …`. The secret is a least-privilege carve-out (`stacks/openclaw/ha_tokens.tf`): the `Home Server Admins` group can read *just* it, so non-admin operators get the HA token without the rest of `skill_secrets` (slack webhook, uptime-kuma password) |
+| `ha ssh [--instance sofia\|london] [-i KEY] -- <cmd>` | write | run `<cmd>` on the HA host over ssh with deterministic non-interactive flags (explicit key = the invoking user's `~/.ssh/id_ed25519`, no user ssh-config, no known_hosts prompt). sofia (`vbarzin@192.168.1.8`) is reachable from the devvm LAN; london is documented but generally remote |
+
+`--instance` defaults to **sofia** (the devvm shares the Sofia LAN). `ha token`
+prints the bare token to stdout so it composes in `$(…)`; it's read-tier like
+`memory secret`. `ha ssh` resolves the *invoking user's* key, so it's per-user,
+not tied to whoever first wrote the workflow (the user's key must be enrolled on
+the HA host).
+
+### v0.8 verbs — browser (headful anti-bot automation)
+
+Drive the cluster's **headful** Chrome (`chrome-service`, real Chrome under Xvfb)
+from the devvm over CDP, for sites that detect and block headless automation. The
+headless `@playwright/mcp` browser can *load* such a site and fill its forms, but
+the gated action (submit/login) silently fails — the motivating case was the
+Stirling Ackroyd Fixflo tenant portal, whose pre-submit check returned
+`net::ERR_FILE_NOT_FOUND` and hung. This path connects via `connect_over_cdp`,
+injects the same `stealth.js` the in-cluster callers use, and submits first try.
+
+The command owns only the *mechanics* (port-forward, stealth, lifecycle); the
+agent supplies the Playwright script — judgment stays out of the CLI.
+
+| Command | Tier | What it does |
+|---|---|---|
+| `browser run <script.js> [--url U] [--shared-context] [--keep-open] [--port N] [--timeout S]` | write | port-forward `svc/chrome-service:9222`, assert it's a real (non-headless) Chrome via `/json/version`, `connect_over_cdp`, `addInitScript(stealth.js)`, then run the script with `page`/`context`/`browser`/`log` in scope (top-level await ok; return a value to print it). Always tears the forward down. |
+| `browser open <url> [--shared-context] [--timeout S]` | write | open `<url>` headful and print title + visible text + a screenshot path — a quick check. |
+| `browser --help` | read | when-to-use signature + the error-code cheat-sheet (`ERR_FILE_NOT_FOUND` = automation-layer intercept, not egress; `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED` = real egress; one endpoint 500 while siblings 200 = bot rejection). |
+
+Default context is a **fresh incognito** one (closed on exit) — safe for the
+shared browser and concurrent callers (e.g. tripit's fare scrape); `--shared-context`
+reuses the warmed persistent profile when a pre-logged-in session is needed.
+`port-forward` tunnels API-server→pod, so it bypasses the `:9222` NetworkPolicy
+that gates in-cluster callers — no namespace label needed. The node CDP client is
+pinned to **`playwright-core@1.48.2`** to match the chrome-service image minor
+(Chromium 130; protocol changes between minors) and is installed once, lazily,
+into `~/.cache/homelab/browser-client/` (no per-user setup). Because the client
+runs on the devvm, `setInputFiles` streams local files to the remote browser over
+CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md`
+and `docs/adr/0013`.
+
+### v0.9 verbs — edges (east-west "who-talks-to-whom" trail)
+
+Read-only investigation helper over the `goldmane_edges` CNPG trail (ADR-0014):
+filters render to a single safe `SELECT` (namespace values validated to the k8s
+name charset) run via the dbaas primary pod — the same exec path as `k8s db`.
+
+| Command | Tier | What it does |
+| --- | --- | --- |
+| `edges --ns <ns>` | read | edges touching `<ns>` (either direction) |
+| `edges --src <ns>` / `--dst <ns>` | read | directional: `<ns>`'s egress / ingress peers |
+| `edges --peers-of <ns>` | read | distinct peer namespaces of `<ns>` (both directions) |
+| `edges --new-since <24h\|7d\|YYYY-MM-DD>` | read | edges first seen since a duration or date |
+| `edges --denied` | read | only `action='deny'` edges (blocked / lateral-movement) |
+| `edges --json` / `--limit N` | read | JSON array output / row cap (default 200) |
+
+### v0.10 — `vault get --all` (browse every field)
+
+`vault get <name> --all` returns the **whole item** as a normalized JSON object,
+so an agent can discover and read fields the single-field `--field` allowlist
+can't reach — notably arbitrary **custom fields**.
+
+| Command | Tier | What it does |
+| --- | --- | --- |
+| `vault get <name> --all` | read | all fields as JSON: `{name, username?, password?, uris?, totp?, notes?, fields?}` |
+
+Shape notes: present standard fields only (empty ones omitted); `fields` is a
+custom `name→value` map (duplicate names → last-wins; `linked` fields skipped).
+The TOTP **seed is never emitted** — `totp` is a presence flag (`true`), so the
+only seed-derived path stays the specially-audited `vault code`. Like
+`get --json`, the dump is all secret values, so it **refuses a terminal** — pipe
+it (`homelab vault get <name> --all | jq`).
+
+### v0.10.1 — reads `bw sync` first (always fresh)
+
+Every vault read (`get`, `get --all`, `list`, `code`, `status`) now runs `bw
+sync` when opening its session, so it reflects the latest server-side values.
+`bw unlock` only decrypts the *local* cache, so without this a persisted
+(already-logged-in) session served stale data — a password changed in the web
+vault wouldn't show up until the next login. The sync is **best-effort**: a
+transient failure warns on stderr and falls back to the cached vault rather than
+failing the read.
+
+### v0.11 — `vault kv` (HashiCorp Vault / OpenBao infra secrets)
+
+`homelab vault` now fronts **two unrelated stores**, made explicit in the bare
+`homelab vault` help and via `[vaultwarden]` / `[hashicorp-vault]` summary tags:
+
+- **Vaultwarden** — your personal password manager (`vault get/list/code/…`, unchanged).
+- **HashiCorp Vault / OpenBao** — homelab infra secrets, the `secret/…` KV store, under `vault kv`.
+
+| Command | Tier | What it does |
+| --- | --- | --- |
+| `vault kv get <path> [--field K]` | read | read a secret: `--field K` → one value (TTY-aware clipboard/stdout); no field → all fields as JSON (refuses a bare TTY) |
+| `vault kv list <path>` | read | list sub-paths under `<path>` (no values) |
+| `vault kv put <path> <key>` | write | write one key; **value via stdin** (piped or no-echo prompt, never argv); creates the path or **merges** (never clobbers siblings) |
+
+**Different credentials:** the Vaultwarden verbs use the per-user *scoped* token
+(bound to `claude-users/<user>`); `vault kv` uses your **own** Vault token
+(`vault login -method=oidc` → `~/.vault-token`, or `$VAULT_TOKEN`) — the kv
+handlers set `VAULT_ADDR` but never inject the scoped token (which would 403 off
+its own path). Access is whatever your policy grants. Writes are merge-only;
+`put` (replace) / `delete` are out of scope — use the raw `vault` CLI.
+
+## Build / install
+
+Built from source to `/usr/local/bin/homelab` during devvm provisioning
+(`scripts/workstation/setup-devvm.sh`, the `t3-dispatch` pattern); version is
+stamped from `cli/VERSION` via ldflags. Manual build:
+
+```
+cd cli && go build -ldflags "-X main.version=$(cat VERSION)" -o /usr/local/bin/homelab .
+go test ./...
+```
+
+## Legacy webhook use-cases (preserved)
+
+This binary is also the in-cluster `infra-cli` image. Invocations starting with
+`-use-case=<vpn|setup-openwrt-dns|add-email-alias|...>` fall through to the
+original flag-based path unchanged, so the webhook handler is unaffected.
+
+## Design
+
+See `infra/docs/adr/0004`–`0013` for the architecture decisions.
--- a/cli/VERSION
+++ b/cli/VERSION
@ -0,0 +1 @@
+v0.11.0
--- a/cli/browser.go
+++ b/cli/browser.go
@ -0,0 +1,388 @@
+package main
+
+import (
+	_ "embed"
+	"encoding/json"
+	"fmt"
+	"io"
+	"net"
+	"net/http"
+	"os"
+	"os/exec"
+	"os/signal"
+	"path/filepath"
+	"strconv"
+	"strings"
+	"sync"
+	"syscall"
+	"time"
+)
+
+// playwrightVersion pins the node CDP client to the chrome-service image minor
+// (mcr.microsoft.com/playwright:v1.48.0-noble → Chromium 130). connect_over_cdp
+// speaks the browser's CDP, so the client minor must track the server minor;
+// see docs/architecture/chrome-service.md "Image pin".
+const playwrightVersion = "1.48.2"
+
+// defaultBrowserTimeout is how long (seconds) to wait for the port-forwarded CDP
+// endpoint to become ready before giving up.
+const defaultBrowserTimeout = 60
+
+const (
+	chromeServiceNamespace = "chrome-service"
+	chromeServiceName      = "chrome-service"
+	chromeServiceCDPPort   = 9222
+)
+
+// stealthJS is vendored verbatim from stacks/chrome-service/files/stealth.js (the
+// source of truth the in-cluster callers use). TestStealthJSEmbeddedMatchesCanonical
+// guards against drift.
+//
+//go:embed browser_stealth.js
+var stealthJS string
+
+// runnerJS is the node wrapper that connects to the port-forwarded CDP endpoint,
+// installs the stealth init script, and runs the user's Playwright script.
+//
+//go:embed browser_runner.js
+var runnerJS string
+
+// browserOpts is the parsed form of `homelab browser run|open` arguments.
+type browserOpts struct {
+	mode      string // "run" | "open"
+	script    string // path to the user Playwright script (run mode)
+	url       string // initial URL (run: optional; open: required positional)
+	sharedCtx bool   // use the warmed persistent profile instead of a fresh context
+	keepOpen  bool   // leave the created context/pages open on exit
+	port      int    // explicit local port for the forward (0 = auto)
+	timeout   int    // CDP readiness timeout, seconds
+	help      bool
+}
+
+// parseBrowserArgs parses the args after `browser run` / `browser open`.
+func parseBrowserArgs(mode string, args []string) (browserOpts, error) {
+	o := browserOpts{mode: mode, timeout: defaultBrowserTimeout}
+	var positionals []string
+	atoi := func(s, flag string) (int, error) {
+		n, err := strconv.Atoi(s)
+		if err != nil {
+			return 0, fmt.Errorf("%s expects an integer, got %q", flag, s)
+		}
+		return n, nil
+	}
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "-h" || a == "--help":
+			o.help = true
+		case a == "--shared-context":
+			o.sharedCtx = true
+		case a == "--keep-open":
+			o.keepOpen = true
+		case a == "--url":
+			if i+1 < len(args) {
+				o.url = args[i+1]
+				i++
+			}
+		case strings.HasPrefix(a, "--url="):
+			o.url = strings.TrimPrefix(a, "--url=")
+		case a == "--port":
+			if i+1 < len(args) {
+				n, err := atoi(args[i+1], "--port")
+				if err != nil {
+					return o, err
+				}
+				o.port = n
+				i++
+			}
+		case strings.HasPrefix(a, "--port="):
+			n, err := atoi(strings.TrimPrefix(a, "--port="), "--port")
+			if err != nil {
+				return o, err
+			}
+			o.port = n
+		case a == "--timeout":
+			if i+1 < len(args) {
+				n, err := atoi(args[i+1], "--timeout")
+				if err != nil {
+					return o, err
+				}
+				o.timeout = n
+				i++
+			}
+		case strings.HasPrefix(a, "--timeout="):
+			n, err := atoi(strings.TrimPrefix(a, "--timeout="), "--timeout")
+			if err != nil {
+				return o, err
+			}
+			o.timeout = n
+		case strings.HasPrefix(a, "-"):
+			return o, fmt.Errorf("unknown flag %q (try: homelab browser --help)", a)
+		default:
+			positionals = append(positionals, a)
+		}
+	}
+	if o.help {
+		return o, nil
+	}
+	switch mode {
+	case "run":
+		if len(positionals) == 0 {
+			return o, fmt.Errorf("usage: homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]")
+		}
+		o.script = positionals[0]
+	case "open":
+		if len(positionals) == 0 {
+			return o, fmt.Errorf("usage: homelab browser open <url> [--shared-context] [--timeout S]")
+		}
+		o.url = positionals[0]
+	}
+	return o, nil
+}
+
+// cdpHealthy parses a CDP /json/version body and reports whether the endpoint is
+// a real (non-headless) Chrome — the entire reason chrome-service exists.
+func cdpHealthy(jsonBody []byte) (browser string, healthy bool, err error) {
+	var v struct {
+		Browser   string `json:"Browser"`
+		UserAgent string `json:"User-Agent"`
+	}
+	if e := json.Unmarshal(jsonBody, &v); e != nil {
+		return "", false, fmt.Errorf("parse /json/version: %w", e)
+	}
+	if v.Browser == "" {
+		return "", false, fmt.Errorf("/json/version had no Browser field")
+	}
+	healthy = strings.HasPrefix(v.Browser, "Chrome/") &&
+		!strings.Contains(v.Browser, "Headless") &&
+		!strings.Contains(v.UserAgent, "Headless")
+	return v.Browser, healthy, nil
+}
+
+// buildPortForwardArgs is the kubectl invocation that exposes chrome-service's
+// CDP locally. port-forward tunnels API-server→pod, so it bypasses the :9222
+// NetworkPolicy that gates in-cluster callers.
+func buildPortForwardArgs(localPort int) []string {
+	return []string{"-n", chromeServiceNamespace, "port-forward",
+		"svc/" + chromeServiceName, fmt.Sprintf("%d:%d", localPort, chromeServiceCDPPort)}
+}
+
+// browserClientPackageJSON is the auto-managed manifest for the pinned node CDP
+// client kept under the user cache dir.
+func browserClientPackageJSON() string {
+	return fmt.Sprintf(`{
+  "name": "homelab-browser-client",
+  "private": true,
+  "description": "Pinned CDP client for 'homelab browser' — auto-managed, do not edit.",
+  "dependencies": {
+    "playwright-core": "%s"
+  }
+}
+`, playwrightVersion)
+}
+
+// freePort asks the kernel for an unused ephemeral TCP port.
+func freePort() (int, error) {
+	l, err := net.Listen("tcp", "127.0.0.1:0")
+	if err != nil {
+		return 0, err
+	}
+	defer l.Close()
+	return l.Addr().(*net.TCPAddr).Port, nil
+}
+
+// browserClientDir is where the pinned node client + managed runner files live.
+func browserClientDir() (string, error) {
+	cache, err := os.UserCacheDir()
+	if err != nil || cache == "" {
+		home, herr := os.UserHomeDir()
+		if herr != nil {
+			return "", fmt.Errorf("locate cache dir: %v / %v", err, herr)
+		}
+		cache = filepath.Join(home, ".cache")
+	}
+	return filepath.Join(cache, "homelab", "browser-client"), nil
+}
+
+// installedPlaywrightVersion reads the version of the playwright-core already
+// installed in dir, or "" if absent/unreadable.
+func installedPlaywrightVersion(dir string) string {
+	b, err := os.ReadFile(filepath.Join(dir, "node_modules", "playwright-core", "package.json"))
+	if err != nil {
+		return ""
+	}
+	var v struct {
+		Version string `json:"version"`
+	}
+	if json.Unmarshal(b, &v) != nil {
+		return ""
+	}
+	return v.Version
+}
+
+// ensureBrowserClient writes the managed runner/stealth/package files into dir
+// and lazily installs the pinned playwright-core (only when missing/mismatched),
+// so no per-user setup is needed and the client tracks the binary version.
+func ensureBrowserClient(dir string) error {
+	if err := os.MkdirAll(dir, 0o755); err != nil {
+		return err
+	}
+	files := map[string]string{
+		"package.json":      browserClientPackageJSON(),
+		"browser_runner.js": runnerJS,
+		"stealth.js":        stealthJS,
+	}
+	for name, content := range files {
+		if err := os.WriteFile(filepath.Join(dir, name), []byte(content), 0o644); err != nil {
+			return err
+		}
+	}
+	if installedPlaywrightVersion(dir) == playwrightVersion {
+		return nil
+	}
+	fmt.Fprintf(os.Stderr, "homelab browser: installing pinned playwright-core@%s (one-time, ~a few seconds)…\n", playwrightVersion)
+	cmd := exec.Command("npm", "install", "--no-audit", "--no-fund", "--silent")
+	cmd.Dir = dir
+	cmd.Stdout = os.Stderr
+	cmd.Stderr = os.Stderr
+	if err := cmd.Run(); err != nil {
+		return fmt.Errorf("npm install playwright-core@%s in %s: %w (is node/npm installed?)", playwrightVersion, dir, err)
+	}
+	if got := installedPlaywrightVersion(dir); got != playwrightVersion {
+		return fmt.Errorf("playwright-core install mismatch in %s: want %s, got %q", dir, playwrightVersion, got)
+	}
+	return nil
+}
+
+// waitForCDP polls the local CDP endpoint until it answers as a healthy
+// (non-headless) Chrome, or the timeout elapses.
+func waitForCDP(cdpURL string, timeout time.Duration) (string, error) {
+	deadline := time.Now().Add(timeout)
+	client := &http.Client{Timeout: 3 * time.Second}
+	var lastErr error
+	for time.Now().Before(deadline) {
+		resp, err := client.Get(cdpURL + "/json/version")
+		if err != nil {
+			lastErr = err
+			time.Sleep(300 * time.Millisecond)
+			continue
+		}
+		body, _ := io.ReadAll(resp.Body)
+		resp.Body.Close()
+		browser, healthy, herr := cdpHealthy(body)
+		if herr != nil {
+			lastErr = herr
+			time.Sleep(300 * time.Millisecond)
+			continue
+		}
+		if !healthy {
+			return browser, fmt.Errorf("CDP reports %q — expected a non-headless Chrome (wrong target?)", browser)
+		}
+		return browser, nil
+	}
+	if lastErr == nil {
+		lastErr = fmt.Errorf("timed out after %s", timeout)
+	}
+	return "", lastErr
+}
+
+// runBrowser is the orchestration: pick a port, ensure the pinned client, start
+// (and ALWAYS tear down) a CDP port-forward, wait for readiness, then run node.
+func runBrowser(o browserOpts) error {
+	port := o.port
+	if port == 0 {
+		p, err := freePort()
+		if err != nil {
+			return fmt.Errorf("pick local port: %w", err)
+		}
+		port = p
+	}
+
+	dir, err := browserClientDir()
+	if err != nil {
+		return err
+	}
+	if err := ensureBrowserClient(dir); err != nil {
+		return err
+	}
+
+	// Start the forward in its own process group so the whole tree dies on cleanup.
+	pf := exec.Command("kubectl", buildPortForwardArgs(port)...)
+	pf.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
+	var pfLog strings.Builder
+	pf.Stdout = &pfLog
+	pf.Stderr = &pfLog
+	if err := pf.Start(); err != nil {
+		return fmt.Errorf("start kubectl port-forward (kubeconfig set?): %w", err)
+	}
+
+	var once sync.Once
+	teardown := func() {
+		once.Do(func() {
+			if pf.Process != nil {
+				_ = syscall.Kill(-pf.Process.Pid, syscall.SIGKILL)
+			}
+			_ = pf.Wait()
+		})
+	}
+	defer teardown()
+
+	// Tear down on Ctrl-C / SIGTERM too, then exit non-zero.
+	sigCh := make(chan os.Signal, 1)
+	signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM)
+	defer signal.Stop(sigCh)
+	go func() {
+		if _, ok := <-sigCh; ok {
+			teardown()
+			os.Exit(130)
+		}
+	}()
+
+	cdpURL := fmt.Sprintf("http://127.0.0.1:%d", port)
+	browser, err := waitForCDP(cdpURL, time.Duration(o.timeout)*time.Second)
+	if err != nil {
+		return fmt.Errorf("chrome-service CDP not ready on %s: %w\n--- port-forward log ---\n%s", cdpURL, err, pfLog.String())
+	}
+	fmt.Fprintf(os.Stderr, "homelab browser: connected to %s via %s\n", browser, cdpURL)
+
+	return runBrowserNode(dir, cdpURL, o)
+}
+
+// runBrowserNode invokes the managed node runner with inputs passed via env.
+func runBrowserNode(dir, cdpURL string, o browserOpts) error {
+	env := append(os.Environ(),
+		"HOMELAB_CDP_URL="+cdpURL,
+		"HOMELAB_BROWSER_MODE="+o.mode,
+		"HOMELAB_STEALTH_PATH="+filepath.Join(dir, "stealth.js"),
+		"NODE_PATH="+filepath.Join(dir, "node_modules"),
+	)
+	if o.url != "" {
+		env = append(env, "HOMELAB_BROWSER_URL="+o.url)
+	}
+	if o.script != "" {
+		abs, err := filepath.Abs(o.script)
+		if err != nil {
+			return err
+		}
+		if _, err := os.Stat(abs); err != nil {
+			return fmt.Errorf("script %s: %w", o.script, err)
+		}
+		env = append(env, "HOMELAB_BROWSER_SCRIPT="+abs)
+	}
+	if o.sharedCtx {
+		env = append(env, "HOMELAB_BROWSER_SHARED=1")
+	}
+	if o.keepOpen {
+		env = append(env, "HOMELAB_BROWSER_KEEP_OPEN=1")
+	}
+	if o.mode == "open" {
+		shot := filepath.Join(os.TempDir(), fmt.Sprintf("homelab-browser-%d.png", os.Getpid()))
+		env = append(env, "HOMELAB_BROWSER_SCREENSHOT="+shot)
+	}
+	cmd := exec.Command("node", filepath.Join(dir, "browser_runner.js"))
+	cmd.Env = env
+	cmd.Stdout = os.Stdout
+	cmd.Stderr = os.Stderr
+	cmd.Stdin = os.Stdin
+	return cmd.Run()
+}
--- a/cli/browser_runner.js
+++ b/cli/browser_runner.js
@ -0,0 +1,106 @@
+// homelab browser — node CDP runner (auto-managed; regenerated each run from the
+// homelab binary — DO NOT EDIT here). Connects to the port-forwarded
+// chrome-service CDP endpoint, installs the stealth init script, then runs the
+// user's Playwright script (run mode) or opens a URL (open mode). All inputs
+// arrive via HOMELAB_* env vars set by the Go CLI.
+'use strict';
+const fs = require('fs');
+const { chromium } = require('playwright-core');
+
+async function main() {
+  const cdpURL = process.env.HOMELAB_CDP_URL;
+  if (!cdpURL) throw new Error('HOMELAB_CDP_URL not set');
+  const mode = process.env.HOMELAB_BROWSER_MODE || 'run';
+  const stealthPath = process.env.HOMELAB_STEALTH_PATH || '';
+  const initURL = process.env.HOMELAB_BROWSER_URL || '';
+  const scriptPath = process.env.HOMELAB_BROWSER_SCRIPT || '';
+  const shared = process.env.HOMELAB_BROWSER_SHARED === '1';
+  const keepOpen = process.env.HOMELAB_BROWSER_KEEP_OPEN === '1';
+  const screenshotPath = process.env.HOMELAB_BROWSER_SCREENSHOT || '';
+
+  const browser = await chromium.connectOverCDP(cdpURL);
+
+  // Fresh isolated context by default (safe for the shared browser + concurrent
+  // callers); --shared-context reuses the warmed persistent profile.
+  let context;
+  let createdContext = false;
+  if (shared) {
+    const existing = browser.contexts();
+    if (existing.length) {
+      context = existing[0];
+    } else {
+      context = await browser.newContext();
+      createdContext = true;
+    }
+  } else {
+    context = await browser.newContext();
+    createdContext = true;
+  }
+
+  if (stealthPath) {
+    const stealth = fs.readFileSync(stealthPath, 'utf8');
+    if (stealth.trim()) await context.addInitScript(stealth);
+  }
+
+  const page = await context.newPage();
+  const log = (...a) => console.error('[browser]', ...a);
+
+  let exitCode = 0;
+  try {
+    if (initURL) {
+      await page.goto(initURL, { waitUntil: 'domcontentloaded' });
+    }
+    if (mode === 'open') {
+      console.log('url:    ' + page.url());
+      console.log('title:  ' + (await page.title()));
+      const text = (await page.evaluate(() => (document.body ? document.body.innerText : ''))).trim();
+      console.log('--- visible text (truncated to 4000 chars) ---');
+      console.log(text.slice(0, 4000));
+      if (screenshotPath) {
+        await page.screenshot({ path: screenshotPath, fullPage: true });
+        console.log('screenshot: ' + screenshotPath);
+      }
+    } else {
+      if (!scriptPath) throw new Error('run mode requires HOMELAB_BROWSER_SCRIPT');
+      const src = fs.readFileSync(scriptPath, 'utf8');
+      // Run the user's source with page/context/browser/log in lexical scope.
+      // AsyncFunction body permits top-level await.
+      const AsyncFunction = Object.getPrototypeOf(async () => {}).constructor;
+      const fn = new AsyncFunction('page', 'context', 'browser', 'log', src);
+      const result = await fn(page, context, browser, log);
+      if (result !== undefined) {
+        let out;
+        try {
+          out = typeof result === 'string' ? result : JSON.stringify(result, null, 2);
+        } catch (_) {
+          out = String(result);
+        }
+        console.log(out);
+      }
+    }
+  } catch (e) {
+    console.error('homelab browser: script error:', e && e.stack ? e.stack : e);
+    exitCode = 1;
+  } finally {
+    if (!keepOpen) {
+      try {
+        // Close only what we created; never tear down the shared persistent context.
+        if (createdContext) {
+          await context.close();
+        } else {
+          await page.close();
+        }
+      } catch (_) { /* ignore */ }
+    }
+    // Disconnect from the CDP endpoint; this does NOT kill the remote browser.
+    try {
+      await browser.close();
+    } catch (_) { /* ignore */ }
+  }
+  process.exit(exitCode);
+}
+
+main().catch((e) => {
+  console.error('homelab browser: fatal:', e && e.stack ? e.stack : e);
+  process.exit(1);
+});
--- a/cli/browser_stealth.js
+++ b/cli/browser_stealth.js
@ -0,0 +1,54 @@
+// Minimal stealth init script for Playwright-driven Chromium.
+// Vendored from puppeteer-extra-plugin-stealth/evasions/* (MIT) — covers:
+//   webdriver, chrome.runtime, navigator.plugins, navigator.languages,
+//   Permissions.query, WebGL getParameter (vendor + renderer spoof).
+// Run via context.add_init_script() so it executes before any page script.
+(() => {
+  // navigator.webdriver — most common detection, removed entirely.
+  Object.defineProperty(Navigator.prototype, 'webdriver', { get: () => undefined });
+
+  // window.chrome.runtime — many sites check that real Chrome exposes this.
+  if (!window.chrome) window.chrome = {};
+  window.chrome.runtime = window.chrome.runtime || {};
+
+  // navigator.plugins — headless reports zero; spoof a plausible PDF viewer.
+  Object.defineProperty(navigator, 'plugins', {
+    get: () => [{ name: 'Chrome PDF Plugin' }, { name: 'Chrome PDF Viewer' }, { name: 'Native Client' }],
+  });
+
+  // navigator.languages — headless returns empty array.
+  Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
+
+  // Permissions.query — headless returns 'denied' for notifications instead of 'default'.
+  const origQuery = window.navigator.permissions && window.navigator.permissions.query;
+  if (origQuery) {
+    window.navigator.permissions.query = (parameters) =>
+      parameters && parameters.name === 'notifications'
+        ? Promise.resolve({ state: Notification.permission })
+        : origQuery(parameters);
+  }
+
+  // WebGL getParameter — spoof vendor + renderer strings to a real GPU.
+  const spoofGl = (proto) => {
+    if (!proto) return;
+    const orig = proto.getParameter;
+    proto.getParameter = function (parameter) {
+      if (parameter === 37445) return 'Intel Inc.';                   // UNMASKED_VENDOR_WEBGL
+      if (parameter === 37446) return 'Intel Iris OpenGL Engine';     // UNMASKED_RENDERER_WEBGL
+      return orig.apply(this, arguments);
+    };
+  };
+  spoofGl(window.WebGLRenderingContext && window.WebGLRenderingContext.prototype);
+  spoofGl(window.WebGL2RenderingContext && window.WebGL2RenderingContext.prototype);
+
+  // disable-devtool.js (theajack/disable-devtool) auto-inits via a script
+  // tag with `disable-devtool-auto`. Its Performance detector trips under
+  // Playwright (CDP adds console.log latency vs console.table) and the
+  // redirect URL is hard-coded — for hmembeds that's google.com.
+  // Hide the auto-init marker so the library's IIFE exits early.
+  const origQS = Document.prototype.querySelector;
+  Document.prototype.querySelector = function (sel) {
+    if (typeof sel === 'string' && sel.indexOf('disable-devtool-auto') !== -1) return null;
+    return origQS.apply(this, arguments);
+  };
+})();
--- a/cli/cmd_browser.go
+++ b/cli/cmd_browser.go
@ -0,0 +1,117 @@
+package main
+
+import "fmt"
+
+// browser verbs drive the cluster's HEADFUL Chrome (ns chrome-service) over CDP
+// from outside the cluster, for sites that detect/block headless automation.
+// The headless @playwright/mcp browser can load such sites but their gated
+// actions (submit/login) silently fail; this path submits first try. Mechanics
+// only — the agent supplies the Playwright script. See docs/adr/0013.
+
+func browserCommands() []Command {
+	return []Command{
+		{Path: []string{"browser"}, Tier: TierRead,
+			Summary: "headful cluster-Chrome automation for anti-bot sites (run `browser --help`)", Run: browserTopHelp},
+		{Path: []string{"browser", "run"}, Tier: TierWrite,
+			Summary: "run a Playwright script against headful cluster Chrome: browser run <script.js> [--url U] [--shared-context]", Run: browserRun},
+		{Path: []string{"browser", "open"}, Tier: TierWrite,
+			Summary: "open a URL in headful cluster Chrome; print title + text + screenshot: browser open <url>", Run: browserOpen},
+	}
+}
+
+func browserTopHelp([]string) error {
+	fmt.Print(browserHelp())
+	return nil
+}
+
+func browserRun(args []string) error {
+	o, err := parseBrowserArgs("run", args)
+	if err != nil {
+		return err
+	}
+	if o.help {
+		fmt.Print(browserHelp())
+		return nil
+	}
+	return runBrowser(o)
+}
+
+func browserOpen(args []string) error {
+	o, err := parseBrowserArgs("open", args)
+	if err != nil {
+		return err
+	}
+	if o.help {
+		fmt.Print(browserHelp())
+		return nil
+	}
+	return runBrowser(o)
+}
+
+// browserHelp carries the discoverability payload: WHEN to reach for this, and
+// the diagnostic cheat-sheet that lets the agent self-correct instead of
+// retrying a deterministic form blind (the failure mode that motivated this).
+func browserHelp() string {
+	return `homelab browser — drive the cluster's HEADFUL Chrome (anti-bot) over CDP
+
+The shared chrome-service (ns chrome-service) runs a REAL, headed Chrome under
+Xvfb. This connects to it via a port-forward + Playwright connect_over_cdp,
+injects the same stealth.js the in-cluster callers use, and runs your script.
+
+USAGE
+  homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]
+  homelab browser open <url> [--shared-context] [--timeout S]
+
+WHEN TO USE THIS — escalation only; DEFAULT to the headless/MCP browser
+  Default to the Playwright MCP / headless browser for ALL routine browsing and
+  automation — it's interactive (snapshot per step), fast to start, isolated.
+  Reach for THIS command ONLY when headless is demonstrably blocked: a site
+  LOADS fine but a gated action FAILS or HANGS — a submit/login/checkout spins
+  forever, or ONE request errors while its siblings 200. That is the signature
+  of headless / anti-bot detection (navigator.webdriver, UA "HeadlessChrome",
+  disable-devtool traps). It presents as a real Chrome and usually succeeds
+  first try — but it's the shared cluster browser (slower startup, one batch
+  run, no per-step feedback), so it's the escalation path, never the default.
+
+ERROR-CODE CHEAT-SHEET (diagnose BEFORE retrying)
+  ERR_FILE_NOT_FOUND (-6)   request intercepted/resolved locally by the
+                            automation layer — NOT a network/egress problem.
+                            (This is what silently broke the headless submit.)
+  ERR_CONNECTION_REFUSED /  real egress failure (DNS/route/firewall). These also
+  ERR_TIMED_OUT /           break the initial page load — if the page loaded,
+  ERR_NAME_NOT_RESOLVED     egress is fine and the cause is elsewhere.
+  one endpoint 500s while   server-side bot rejection of the automation, not
+  its siblings 200          your payload.
+
+HABITS
+  - Inspect the network panel BEFORE retrying a deterministic form; a blind
+    retry just repeats the same silent failure.
+  - Don't park a half-filled multi-step form across a user pause — the session
+    can expire; re-run the whole flow from this command in one shot.
+  - Uploads stream over CDP via setInputFiles from THIS host — no chmod/staging
+    of $HOME needed; just point setInputFiles at a local path.
+
+CONTEXT
+  Default: a FRESH incognito context, closed on exit — safe for the shared
+  browser and concurrent callers (e.g. tripit). Your script does its own login.
+  --shared-context: reuse the warmed PERSISTENT profile (cookies from a manual
+  noVNC login at chrome.viktorbarzin.me) when you need a pre-logged-in session.
+
+SCRIPT CONTRACT (run mode)
+  Your file's body runs with page, context, browser and log() already in scope
+  (top-level await allowed). Return a value to print it. Example flow.js:
+
+    await page.goto('https://portal.example.com/login');
+    await page.fill('#user', 'me'); await page.fill('#pass', process.env.PW);
+    await page.click('button[type=submit]');
+    await page.waitForURL('**/dashboard');
+    return 'logged in: ' + page.url();
+
+  Run it:  homelab browser run flow.js
+
+NOTES
+  - The Playwright client is pinned to playwright-core@` + playwrightVersion + ` to match the
+    chrome-service image (Chrome 130); installed once into ~/.cache/homelab/.
+  - The port-forward is always torn down, on success and on error.
+`
+}
--- a/cli/cmd_browser_test.go
+++ b/cli/cmd_browser_test.go
@ -0,0 +1,172 @@
+package main
+
+import (
+	"os"
+	"reflect"
+	"strings"
+	"testing"
+)
+
+func TestParseBrowserArgsRun(t *testing.T) {
+	got, err := parseBrowserArgs("run", []string{
+		"flow.js", "--url", "https://example.com", "--shared-context",
+		"--port", "19999", "--timeout", "45", "--keep-open",
+	})
+	if err != nil {
+		t.Fatalf("parseBrowserArgs run: unexpected err: %v", err)
+	}
+	want := browserOpts{
+		mode: "run", script: "flow.js", url: "https://example.com",
+		sharedCtx: true, keepOpen: true, port: 19999, timeout: 45,
+	}
+	if !reflect.DeepEqual(got, want) {
+		t.Fatalf("parseBrowserArgs run =\n %+v\nwant\n %+v", got, want)
+	}
+}
+
+func TestParseBrowserArgsRunDefaults(t *testing.T) {
+	got, err := parseBrowserArgs("run", []string{"flow.js"})
+	if err != nil {
+		t.Fatalf("unexpected err: %v", err)
+	}
+	if got.script != "flow.js" || got.sharedCtx || got.keepOpen || got.port != 0 {
+		t.Fatalf("defaults wrong: %+v", got)
+	}
+	if got.timeout != defaultBrowserTimeout {
+		t.Fatalf("timeout default = %d, want %d", got.timeout, defaultBrowserTimeout)
+	}
+}
+
+func TestParseBrowserArgsRunRequiresScript(t *testing.T) {
+	if _, err := parseBrowserArgs("run", []string{"--url", "https://x"}); err == nil {
+		t.Fatalf("run without a script path should error")
+	}
+}
+
+func TestParseBrowserArgsOpenRequiresURL(t *testing.T) {
+	got, err := parseBrowserArgs("open", []string{"https://example.com"})
+	if err != nil {
+		t.Fatalf("unexpected err: %v", err)
+	}
+	if got.url != "https://example.com" || got.mode != "open" {
+		t.Fatalf("open parse wrong: %+v", got)
+	}
+	if _, err := parseBrowserArgs("open", []string{}); err == nil {
+		t.Fatalf("open without a URL should error")
+	}
+}
+
+func TestParseBrowserArgsHelp(t *testing.T) {
+	for _, a := range [][]string{{"--help"}, {"-h"}, {"flow.js", "--help"}} {
+		got, err := parseBrowserArgs("run", a)
+		if err != nil {
+			t.Fatalf("help parse %v: %v", a, err)
+		}
+		if !got.help {
+			t.Fatalf("args %v should set help", a)
+		}
+	}
+}
+
+func TestParseBrowserArgsEqualsForm(t *testing.T) {
+	got, err := parseBrowserArgs("run", []string{"flow.js", "--url=https://x", "--port=8123", "--timeout=10"})
+	if err != nil {
+		t.Fatalf("unexpected err: %v", err)
+	}
+	if got.url != "https://x" || got.port != 8123 || got.timeout != 10 {
+		t.Fatalf("--flag=value form not parsed: %+v", got)
+	}
+}
+
+func TestCDPHealthy(t *testing.T) {
+	real := []byte(`{"Browser":"Chrome/130.0.6723.31","User-Agent":"Mozilla/5.0 (X11; Linux x86_64) Chrome/130.0.0.0 Safari/537.36","webSocketDebuggerUrl":"ws://127.0.0.1/devtools/browser/x"}`)
+	browser, ok, err := cdpHealthy(real)
+	if err != nil || !ok {
+		t.Fatalf("real Chrome should be healthy: ok=%v err=%v", ok, err)
+	}
+	if !strings.HasPrefix(browser, "Chrome/") {
+		t.Fatalf("browser = %q, want Chrome/ prefix", browser)
+	}
+
+	headless := []byte(`{"Browser":"HeadlessChrome/130.0.6723.31","User-Agent":"Mozilla/5.0 HeadlessChrome/130.0.0.0"}`)
+	if _, ok, _ := cdpHealthy(headless); ok {
+		t.Fatalf("HeadlessChrome must be reported unhealthy (the whole point of chrome-service)")
+	}
+
+	if _, _, err := cdpHealthy([]byte("not json")); err == nil {
+		t.Fatalf("malformed /json/version body should error")
+	}
+}
+
+func TestBuildPortForwardArgs(t *testing.T) {
+	got := buildPortForwardArgs(18080)
+	want := []string{"-n", "chrome-service", "port-forward", "svc/chrome-service", "18080:9222"}
+	if !reflect.DeepEqual(got, want) {
+		t.Fatalf("buildPortForwardArgs =\n %v\nwant\n %v", got, want)
+	}
+}
+
+func TestBrowserClientPackageJSONPinsVersion(t *testing.T) {
+	pj := browserClientPackageJSON()
+	if !strings.Contains(pj, `"playwright-core": "`+playwrightVersion+`"`) {
+		t.Fatalf("package.json must pin playwright-core to %s; got:\n%s", playwrightVersion, pj)
+	}
+}
+
+func TestPlaywrightVersionPinnedToServerMinor(t *testing.T) {
+	// chrome-service runs mcr.microsoft.com/playwright:v1.48.0-noble; the CDP
+	// client minor MUST match (protocol changes between minors).
+	if !strings.HasPrefix(playwrightVersion, "1.48.") {
+		t.Fatalf("playwrightVersion = %q, must be 1.48.x to match the chrome-service image", playwrightVersion)
+	}
+}
+
+func TestBrowserHelpHasDiagnosticCheatSheet(t *testing.T) {
+	h := browserHelp()
+	for _, want := range []string{
+		"homelab browser run",
+		"ERR_FILE_NOT_FOUND",
+		"ERR_CONNECTION_REFUSED",
+		"network panel",
+		"headless",
+		"--shared-context",
+	} {
+		if !strings.Contains(h, want) {
+			t.Errorf("browser --help is missing %q (the discoverability/self-correction payload)", want)
+		}
+	}
+}
+
+func TestBrowserHelpIsTiered(t *testing.T) {
+	// --help must frame this as the ESCALATION path (default to headless first),
+	// matching ~/code/CLAUDE.md and chrome-service.md — non-conflicting agent
+	// instructions. Guard against a regression to "co-equal choice" wording.
+	h := browserHelp()
+	for _, want := range []string{"Default to the", "escalation"} {
+		if !strings.Contains(h, want) {
+			t.Errorf("browser --help must carry the tiered/default-headless framing; missing %q", want)
+		}
+	}
+}
+
+func TestStealthJSEmbeddedMatchesCanonical(t *testing.T) {
+	// The embedded copy must never drift from the source of truth that the
+	// in-cluster callers use, else the CLI's stealth and the cluster's diverge.
+	canonical, err := os.ReadFile("../stacks/chrome-service/files/stealth.js")
+	if err != nil {
+		t.Fatalf("read canonical stealth.js: %v", err)
+	}
+	if stealthJS != string(canonical) {
+		t.Fatalf("cli/browser_stealth.js has drifted from stacks/chrome-service/files/stealth.js — re-copy it")
+	}
+}
+
+func TestFreePortReturnsUsablePort(t *testing.T) {
+	p, err := freePort()
+	if err != nil {
+		t.Fatalf("freePort: %v", err)
+	}
+	if p <= 1024 || p > 65535 {
+		t.Fatalf("freePort returned %d, want an ephemeral port", p)
+	}
+}
--- a/cli/cmd_ci.go
+++ b/cli/cmd_ci.go
@ -0,0 +1,99 @@
+package main
+
+import (
+	"fmt"
+	"os"
+	"strings"
+	"time"
+)
+
+func ciCommands() []Command {
+	return []Command{
+		{Path: []string{"ci", "status"}, Tier: TierRead,
+			Summary: "pipeline status for HEAD/a commit: ci status [commit]", Run: ciStatus},
+		{Path: []string{"ci", "watch"}, Tier: TierRead,
+			Summary: "poll the pipeline for HEAD (or a commit) to terminal; non-zero on failure", Run: ciWatch},
+	}
+}
+
+func short(s string) string {
+	if len(s) > 8 {
+		return s[:8]
+	}
+	return s
+}
+
+func firstLine(s string) string { return strings.SplitN(s, "\n", 2)[0] }
+
+// currentHEAD returns the full HEAD sha of the cwd repo (empty if not a repo).
+func currentHEAD() string {
+	cwd, _ := os.Getwd()
+	root, err := gitRepoRoot(cwd)
+	if err != nil {
+		return ""
+	}
+	sha, _ := gitOutput(root, "rev-parse", "HEAD")
+	return sha
+}
+
+func ciStatus(args []string) error {
+	commit, _ := firstPositional(args)
+	c, err := newWPClient()
+	if err != nil {
+		return err
+	}
+	id, err := c.repoID()
+	if err != nil {
+		return err
+	}
+	p, err := c.findPipeline(id, commit)
+	if err != nil {
+		return err
+	}
+	fmt.Printf("#%d %s event=%s %s %s\n", p.Number, p.Status, p.Event, short(p.Commit), firstLine(p.Message))
+	return nil
+}
+
+func ciWatch(args []string) error {
+	commit, _ := firstPositional(args)
+	if commit == "" {
+		commit = currentHEAD()
+	}
+	if commit == "" {
+		return fmt.Errorf("no commit given and not in a git repo")
+	}
+	c, err := newWPClient()
+	if err != nil {
+		return err
+	}
+	id, err := c.repoID()
+	if err != nil {
+		return err
+	}
+	timeout := 20 * time.Minute
+	deadline := time.Now().Add(timeout)
+	last := ""
+	for time.Now().Before(deadline) {
+		p, err := c.findPipeline(id, commit)
+		if err != nil {
+			if last != "waiting" {
+				fmt.Fprintf(os.Stderr, "homelab: waiting for pipeline (%s)...\n", short(commit))
+				last = "waiting"
+			}
+		} else {
+			if p.Status != last {
+				fmt.Fprintf(os.Stderr, "homelab: #%d %s\n", p.Number, p.Status)
+				last = p.Status
+			}
+			if isTerminalStatus(p.Status) {
+				fmt.Printf("#%d %s %s\n", p.Number, p.Status, short(commit))
+				if isFailureStatus(p.Status) {
+					return fmt.Errorf("pipeline #%d %s (woodpecker repo, see UI/DB for the failing step)", p.Number, p.Status)
+				}
+				return nil
+			}
+		}
+		time.Sleep(15 * time.Second)
+	}
+	return fmt.Errorf("timed out after %s waiting for CI on %s", timeout, short(commit))
+}
--- a/cli/cmd_claim.go
+++ b/cli/cmd_claim.go
@ -0,0 +1,56 @@
+package main
+
+import (
+	"fmt"
+	"strings"
+)
+
+func claimCommands() []Command {
+	return []Command{
+		{Path: []string{"claim"}, Tier: TierWrite,
+			Summary: "claim a shared infra resource on the presence board",
+			Run:     runClaim},
+		{Path: []string{"release"}, Tier: TierWrite,
+			Summary: "release a presence claim",
+			Run:     runRelease},
+	}
+}
+
+// runClaim parses `<kind>:<name> --purpose "..."` in either order (the presence
+// script takes the label first, so we can't rely on Go's flag package which
+// stops at the first positional).
+func runClaim(args []string) error {
+	var label, purpose string
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--purpose" || a == "-purpose":
+			if i+1 < len(args) {
+				purpose = args[i+1]
+				i++
+			}
+		case strings.HasPrefix(a, "--purpose="):
+			purpose = strings.TrimPrefix(a, "--purpose=")
+		case !strings.HasPrefix(a, "-") && label == "":
+			label = a
+		}
+	}
+	if label == "" {
+		return fmt.Errorf(`usage: homelab claim <kind>:<name> --purpose "what + why"`)
+	}
+	return presenceClaim(label, purpose)
+}
+
+func runRelease(args []string) error {
+	var label string
+	for _, a := range args {
+		if !strings.HasPrefix(a, "-") {
+			label = a
+			break
+		}
+	}
+	if label == "" {
+		return fmt.Errorf("usage: homelab release <kind>:<name>")
+	}
+	return presenceRelease(label)
+}
--- a/cli/cmd_deploy.go
+++ b/cli/cmd_deploy.go
@ -0,0 +1,51 @@
+package main
+
+import (
+	"fmt"
+	"os"
+	"strings"
+	"time"
+)
+
+func deployCommands() []Command {
+	return []Command{
+		{Path: []string{"deploy", "wait"}, Tier: TierRead,
+			Summary: "wait for <ns>/<deploy> to roll out the current (or --sha) image: deploy wait <ns>/<deploy> [--sha SHA]", Run: deployWait},
+	}
+}
+
+// deployWait closes the "did the NEW code land" gap: rollout status alone returns
+// success on the OLD ReplicaSet, so we first wait for the deployment image to
+// reference the expected sha, THEN block on rollout status.
+func deployWait(args []string) error {
+	target, _ := firstPositional(args)
+	if target == "" || !strings.Contains(target, "/") {
+		return fmt.Errorf("usage: homelab deploy wait <ns>/<deploy> [--sha SHA] [--timeout 10m]")
+	}
+	parts := strings.SplitN(target, "/", 2)
+	ns, deploy := parts[0], parts[1]
+
+	sha := flagValue(args, "--sha")
+	if sha == "" {
+		sha = short(currentHEAD())
+	}
+	deadline := time.Now().Add(10 * time.Minute)
+
+	if sha != "" {
+		fmt.Fprintf(os.Stderr, "homelab: waiting for %s/%s image to match %s...\n", ns, deploy, sha)
+		matched := false
+		for time.Now().Before(deadline) {
+			img, _ := kubectlCapture(ns, "get", "deploy", deploy, "-o", "jsonpath={.spec.template.spec.containers[*].image}")
+			if strings.Contains(img, sha) {
+				matched = true
+				break
+			}
+			time.Sleep(10 * time.Second)
+		}
+		if !matched {
+			return fmt.Errorf("timed out: %s/%s image never matched %q", ns, deploy, sha)
+		}
+	}
+	fmt.Fprintf(os.Stderr, "homelab: rollout status %s/%s...\n", ns, deploy)
+	return kubectlStream(ns, "rollout", "status", "deploy/"+deploy, "--timeout=180s")
+}
--- a/cli/cmd_edges.go
+++ b/cli/cmd_edges.go
@ -0,0 +1,69 @@
+package main
+
+import "fmt"
+
+func edgesCommands() []Command {
+	return []Command{
+		{Path: []string{"edges"}, Tier: TierRead,
+			Summary: "who-talks-to-whom trail: edges [--ns|--src|--dst|--peers-of N] [--new-since 24h] [--denied] [--json] [--limit N]",
+			Run:     edgesRun},
+	}
+}
+
+// edgesRun renders the filter flags to SQL and runs it read-only against the
+// goldmane_edges CNPG DB via the dbaas primary pod (same exec path as `k8s db`).
+func edgesRun(args []string) error {
+	for _, a := range args {
+		if a == "-h" || a == "--help" {
+			fmt.Print(edgesUsage())
+			return nil
+		}
+	}
+	o, err := parseEdgesArgs(args)
+	if err != nil {
+		return fmt.Errorf("%w\n\n%s", err, edgesUsage())
+	}
+	sql, err := buildEdgesQuery(o)
+	if err != nil {
+		return err
+	}
+	// pg-cluster-rw is a Service (not exec-able); resolve the primary POD.
+	pod, err := kubectlCapture("dbaas", "get", "pod", "-l", "cnpg.io/instanceRole=primary",
+		"-o", "jsonpath={.items[0].metadata.name}")
+	if err != nil || pod == "" {
+		return fmt.Errorf("could not resolve CNPG primary pod in dbaas: %v", err)
+	}
+	exec := []string{"exec", pod, "-c", "postgres", "--", "psql", "-U", "postgres", "-d", "goldmane_edges"}
+	if o.asJSON {
+		exec = append(exec, "-tAc", sql) // raw tuple → the JSON array
+	} else {
+		exec = append(exec, "-P", "pager=off", "-c", sql) // aligned table for humans
+	}
+	return kubectlStream("dbaas", exec...)
+}
+
+func edgesUsage() string {
+	return `homelab edges — query the who-talks-to-whom trail (goldmane_edges, ADR-0014)
+
+Usage: homelab edges [filters]
+
+Filters (AND-combined; namespace values are validated to the k8s name charset):
+  --ns NAME         edges touching NAME (either direction)
+  --src NAME        edges where source namespace = NAME
+  --dst NAME        edges where destination namespace = NAME
+  --peers-of NAME   distinct peer namespaces of NAME (both directions)
+  --new-since SPEC  first seen since SPEC: a duration (24h, 7d, 30m, 90s) or a date (YYYY-MM-DD)
+  --denied          only denied (action='deny') edges — blocked / lateral-movement attempts
+  --json            output a JSON array (for agents/pipelines)
+  --limit N         cap rows (default 200)
+
+Examples:
+  homelab edges --ns immich                # everything immich talks to / is talked to by
+  homelab edges --peers-of authentik       # authentik's peer namespaces
+  homelab edges --src recruiter-responder  # that namespace's egress peers
+  homelab edges --new-since 24h            # edges first seen in the last day
+  homelab edges --denied --json            # blocked flows, machine-readable
+
+Read-only SELECT against CNPG DB goldmane_edges via the dbaas primary pod.
+`
+}
--- a/cli/cmd_ha.go
+++ b/cli/cmd_ha.go
@ -0,0 +1,172 @@
+package main
+
+import (
+	"encoding/base64"
+	"fmt"
+	"os"
+	"path/filepath"
+	"strings"
+)
+
+// Home Assistant verbs cover the two things the `ha` MCP server can't: resolving
+// the long-lived API token out of the cluster, and SSH to the HA host for
+// host-level work (config files, docker, add-ons). Entity state/control stays
+// with the MCP — see docs/adr/0012.
+//
+// The token lives in the dedicated k8s Secret openclaw/ha-tokens (one key per
+// instance), split out of openclaw-secrets so non-admin operators (emo / "Home
+// Server Admins") can read JUST the HA token, not the full skill_secrets blob.
+// `ha token` resolves it on demand via the ambient kubeconfig, so it never
+// depends on a pre-set env var (the gap that made agents re-derive the
+// kubectl|base64|jq pipeline every session).
+
+type haInstance struct {
+	name      string // sofia | london
+	sshUser   string // SSH login on the HA host
+	sshHost   string // host reachable from the devvm (Sofia LAN)
+	secretKey string // key inside the openclaw/ha-tokens Secret holding this token
+}
+
+const (
+	haDefaultInstance = "sofia"
+	haSecretNamespace = "openclaw"
+	haSecretName      = "ha-tokens" // dedicated, least-privilege; see stacks/openclaw/ha_tokens.tf
+)
+
+// haInstances maps instance name → connection/secret facts. sofia is the default
+// because the devvm is on the Sofia LAN; london is documented but its host
+// (192.168.8.x) is only reachable remotely, so `ha ssh --instance london`
+// generally won't connect from here (token resolution still works).
+var haInstances = map[string]haInstance{
+	"sofia":  {name: "sofia", sshUser: "vbarzin", sshHost: "192.168.1.8", secretKey: "sofia"},
+	"london": {name: "london", sshUser: "hassio", sshHost: "192.168.8.103", secretKey: "london"},
+}
+
+func haCommands() []Command {
+	return []Command{
+		{Path: []string{"ha", "token"}, Tier: TierRead,
+			Summary: "reveal the HA long-lived API token from the cluster: ha token [--instance sofia|london]", Run: haToken},
+		{Path: []string{"ha", "ssh"}, Tier: TierWrite,
+			Summary: "run a command on the HA host over ssh: ha ssh [--instance sofia|london] [-i KEY] -- <cmd>", Run: haSSH},
+	}
+}
+
+// resolveHAInstance looks up an instance by name; "" yields the default (sofia).
+func resolveHAInstance(name string) (haInstance, error) {
+	if name == "" {
+		name = haDefaultInstance
+	}
+	inst, ok := haInstances[name]
+	if !ok {
+		return haInstance{}, fmt.Errorf("unknown HA instance %q (want sofia or london)", name)
+	}
+	return inst, nil
+}
+
+// decodeSecretValue base64-decodes a k8s Secret `.data.<key>` value as returned
+// by kubectl jsonpath (trailing whitespace tolerated).
+func decodeSecretValue(b64 string) (string, error) {
+	raw, err := base64.StdEncoding.DecodeString(strings.TrimSpace(b64))
+	if err != nil {
+		return "", fmt.Errorf("base64-decode secret value: %w", err)
+	}
+	return string(raw), nil
+}
+
+func haToken(args []string) error {
+	name, _ := firstPositional(args) // accept `ha token sofia` as well as `--instance sofia`
+	for i := 0; i < len(args); i++ {
+		if args[i] == "--instance" && i+1 < len(args) {
+			name = args[i+1]
+		} else if strings.HasPrefix(args[i], "--instance=") {
+			name = strings.TrimPrefix(args[i], "--instance=")
+		}
+	}
+	inst, err := resolveHAInstance(name)
+	if err != nil {
+		return err
+	}
+	b64, err := kubectlCapture(haSecretNamespace, "get", "secret", haSecretName,
+		"-o", "jsonpath={.data."+inst.secretKey+"}")
+	if err != nil {
+		return fmt.Errorf("read secret %s/%s (kubeconfig set? RBAC?): %w", haSecretNamespace, haSecretName, err)
+	}
+	if b64 == "" {
+		return fmt.Errorf("secret %s/%s has no %q key", haSecretNamespace, haSecretName, inst.secretKey)
+	}
+	tok, err := decodeSecretValue(b64)
+	if err != nil {
+		return err
+	}
+	fmt.Println(tok)
+	return nil
+}
+
+// defaultHAKeyPath is the invoking user's ed25519 key, so the verb is per-user
+// rather than tied to whoever first wrote the workflow.
+func defaultHAKeyPath() string {
+	if home, err := os.UserHomeDir(); err == nil && home != "" {
+		return filepath.Join(home, ".ssh", "id_ed25519")
+	}
+	return filepath.Join("~", ".ssh", "id_ed25519")
+}
+
+// parseHASSH reads `[--instance X] [-i|--key PATH] [-- ] <cmd...>`. Tokens after
+// `--` are taken verbatim; bare tokens before it are also the remote command.
+func parseHASSH(args []string) (inst haInstance, keyPath string, remote []string, err error) {
+	name := haDefaultInstance
+	keyPath = defaultHAKeyPath()
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--":
+			remote = append(remote, args[i+1:]...)
+			i = len(args)
+		case a == "--instance":
+			if i+1 < len(args) {
+				name = args[i+1]
+				i++
+			}
+		case strings.HasPrefix(a, "--instance="):
+			name = strings.TrimPrefix(a, "--instance=")
+		case a == "--key" || a == "-i":
+			if i+1 < len(args) {
+				keyPath = args[i+1]
+				i++
+			}
+		case strings.HasPrefix(a, "--key="):
+			keyPath = strings.TrimPrefix(a, "--key=")
+		default:
+			remote = append(remote, a)
+		}
+	}
+	inst, err = resolveHAInstance(name)
+	return inst, keyPath, remote, err
+}
+
+// buildHASSHArgs assembles deterministic, non-interactive ssh args: an explicit
+// key, no user ssh config, and no known_hosts prompt/record — so it runs
+// unattended in an agent session without hanging on a host-key prompt.
+func buildHASSHArgs(inst haInstance, keyPath string, remote []string) []string {
+	args := []string{
+		"-F", "/dev/null",
+		"-o", "IdentityFile=" + keyPath,
+		"-o", "StrictHostKeyChecking=no",
+		"-o", "UserKnownHostsFile=/dev/null",
+		"-o", "ConnectTimeout=10",
+		"-o", "BatchMode=yes",
+		inst.sshUser + "@" + inst.sshHost,
+	}
+	return append(args, remote...)
+}
+
+func haSSH(args []string) error {
+	inst, keyPath, remote, err := parseHASSH(args)
+	if err != nil {
+		return err
+	}
+	if len(remote) == 0 {
+		return fmt.Errorf(`usage: homelab ha ssh [--instance sofia|london] [-i KEY] -- <command>`)
+	}
+	return runStreaming("ssh", buildHASSHArgs(inst, keyPath, remote)...)
+}
--- a/cli/cmd_ha_test.go
+++ b/cli/cmd_ha_test.go
@ -0,0 +1,92 @@
+package main
+
+import (
+	"encoding/base64"
+	"reflect"
+	"strings"
+	"testing"
+)
+
+func TestResolveHAInstance(t *testing.T) {
+	// empty defaults to sofia (the devvm sits on the Sofia LAN)
+	if got, err := resolveHAInstance(""); err != nil || got.name != "sofia" {
+		t.Fatalf(`resolveHAInstance("") = %+v, %v; want sofia`, got, err)
+	}
+	if got, err := resolveHAInstance("sofia"); err != nil || got.secretKey != "sofia" {
+		t.Fatalf("sofia secretKey = %q, %v", got.secretKey, err)
+	}
+	if got, err := resolveHAInstance("london"); err != nil || got.secretKey != "london" || got.sshUser != "hassio" {
+		t.Fatalf("london = %+v, %v", got, err)
+	}
+	if _, err := resolveHAInstance("paris"); err == nil {
+		t.Fatalf("resolveHAInstance(paris) should error on unknown instance")
+	}
+}
+
+func TestDecodeSecretValue(t *testing.T) {
+	// k8s stores Secret values base64-encoded; `kubectl -o jsonpath={.data.<k>}`
+	// returns that base64, which decodeSecretValue turns back into the raw token.
+	enc := base64.StdEncoding.EncodeToString([]byte("tok-sofia"))
+	if got, err := decodeSecretValue(enc); err != nil || got != "tok-sofia" {
+		t.Fatalf("decodeSecretValue = %q, %v; want tok-sofia", got, err)
+	}
+	// trailing whitespace/newline from jsonpath output must be tolerated
+	if got, err := decodeSecretValue(enc + "\n"); err != nil || got != "tok-sofia" {
+		t.Fatalf("decodeSecretValue (trailing ws) = %q, %v; want tok-sofia", got, err)
+	}
+	if _, err := decodeSecretValue("not-base64!!"); err == nil {
+		t.Fatalf("decodeSecretValue should error on undecodable base64")
+	}
+}
+
+func TestBuildHASSHArgs(t *testing.T) {
+	inst, _ := resolveHAInstance("sofia")
+	got := buildHASSHArgs(inst, "/home/u/.ssh/id_ed25519", []string{"cat", "/config/configuration.yaml"})
+	want := []string{
+		"-F", "/dev/null",
+		"-o", "IdentityFile=/home/u/.ssh/id_ed25519",
+		"-o", "StrictHostKeyChecking=no",
+		"-o", "UserKnownHostsFile=/dev/null",
+		"-o", "ConnectTimeout=10",
+		"-o", "BatchMode=yes",
+		"vbarzin@192.168.1.8",
+		"cat", "/config/configuration.yaml",
+	}
+	if !reflect.DeepEqual(got, want) {
+		t.Fatalf("buildHASSHArgs =\n %v\nwant\n %v", got, want)
+	}
+}
+
+func TestParseHASSH(t *testing.T) {
+	// instance flag + everything after `--` is the verbatim remote command
+	inst, key, remote, err := parseHASSH([]string{"--instance", "sofia", "--", "docker", "ps", "-a"})
+	if err != nil {
+		t.Fatalf("parseHASSH err: %v", err)
+	}
+	if inst.name != "sofia" {
+		t.Errorf("instance = %q, want sofia", inst.name)
+	}
+	if !strings.HasSuffix(key, "/.ssh/id_ed25519") {
+		t.Errorf("default key = %q, want it to end in /.ssh/id_ed25519", key)
+	}
+	if !reflect.DeepEqual(remote, []string{"docker", "ps", "-a"}) {
+		t.Errorf("remote = %v, want [docker ps -a]", remote)
+	}
+
+	// bare args (no `--`) are also taken as the remote command; -i overrides the key
+	_, key2, remote2, err := parseHASSH([]string{"-i", "/tmp/k", "uptime"})
+	if err != nil {
+		t.Fatalf("parseHASSH err: %v", err)
+	}
+	if key2 != "/tmp/k" {
+		t.Errorf("key = %q, want /tmp/k", key2)
+	}
+	if !reflect.DeepEqual(remote2, []string{"uptime"}) {
+		t.Errorf("remote = %v, want [uptime]", remote2)
+	}
+
+	// unknown instance surfaces as an error
+	if _, _, _, err := parseHASSH([]string{"--instance", "paris", "--", "ls"}); err == nil {
+		t.Errorf("parseHASSH should error on unknown instance")
+	}
+}
--- a/cli/cmd_k8s.go
+++ b/cli/cmd_k8s.go
@ -0,0 +1,288 @@
+package main
+
+import (
+	"fmt"
+	"os"
+	"strings"
+)
+
+func k8sCommands() []Command {
+	return []Command{
+		{Path: []string{"k8s", "status"}, Tier: TierRead,
+			Summary: "pods (wide) + recent non-Normal events for a namespace (or -A)", Run: k8sStatus},
+		{Path: []string{"k8s", "get"}, Tier: TierRead,
+			Summary: "kubectl get in a namespace: k8s get <ns> <resource> [args]", Run: k8sGet},
+		{Path: []string{"k8s", "logs"}, Tier: TierRead,
+			Summary: "logs for <app> (deploy/<app>; --tail/-c/--previous/--since/-l)", Run: k8sLogs},
+		{Path: []string{"k8s", "describe"}, Tier: TierRead,
+			Summary: "describe <app>'s deployment (or an explicit resource)", Run: k8sDescribe},
+		{Path: []string{"k8s", "debug"}, Tier: TierRead,
+			Summary: "one-shot triage for <app>: pods+deploy+describe+logs+events", Run: k8sDebug},
+		{Path: []string{"k8s", "pf"}, Tier: TierRead,
+			Summary: "port-forward: k8s pf <app> <local:remote> [svc/pod target]", Run: k8sPortForward},
+		{Path: []string{"k8s", "db"}, Tier: TierWrite,
+			Summary: `query a dbaas DB: k8s db <app> [--mysql] [--db N] -- "<SQL>"`, Run: k8sDB},
+		{Path: []string{"k8s", "exec"}, Tier: TierWrite,
+			Summary: "exec in <app>'s pod: k8s exec <app> [--tty] -- <cmd>", Run: k8sExec},
+		{Path: []string{"k8s", "rm-pod"}, Tier: TierWrite,
+			Summary: "delete a stuck pod/job ONLY: k8s rm-pod <name> -n <ns> [--job] [--force]", Run: k8sRmPod},
+		{Path: []string{"k8s", "rollout-status"}, Tier: TierRead,
+			Summary: "rollout status of deploy/<app>", Run: k8sRolloutStatus},
+		{Path: []string{"k8s", "restart"}, Tier: TierWrite,
+			Summary: "rollout restart deploy/<app> then wait for status", Run: k8sRestart},
+		{Path: []string{"k8s", "probe"}, Tier: TierRead,
+			Summary: "in-cluster reachability: ephemeral curl pod to <app>.<ns>.svc", Run: k8sProbe},
+	}
+}
+
+func k8sStatus(args []string) error {
+	t := parseK8sTarget(args)
+	ns := t.namespace() // "" when no app/ns given → cluster-wide
+	get := []string{"get", "pods", "-o", "wide"}
+	ev := []string{"get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp"}
+	if ns == "" {
+		get = append(get, "-A")
+		ev = append(ev, "-A")
+	}
+	if err := kubectlStream(ns, get...); err != nil {
+		return err
+	}
+	fmt.Fprintln(os.Stderr, "\n--- recent events (type!=Normal) ---")
+	_ = kubectlStream(ns, ev...) // best-effort
+	return nil
+}
+
+func k8sGet(args []string) error {
+	t := parseK8sTarget(args)
+	if t.app == "" || len(t.rest) == 0 {
+		return fmt.Errorf("usage: homelab k8s get <ns> <resource> [args]")
+	}
+	return kubectlStream(t.app, append([]string{"get"}, t.rest...)...)
+}
+
+func k8sLogs(args []string) error {
+	t := parseK8sTarget(args)
+	if t.app == "" {
+		return fmt.Errorf("usage: homelab k8s logs <app> [--tail N] [-c ctr] [--previous] [--since 1h] [-l sel]")
+	}
+	a := []string{"logs"}
+	if t.selector != "" {
+		a = append(a, "-l", t.selector)
+	} else {
+		a = append(a, t.objectRef())
+	}
+	if t.container != "" {
+		a = append(a, "-c", t.container)
+	}
+	if !containsPrefix(t.rest, "--tail") {
+		a = append(a, "--tail=200")
+	}
+	a = append(a, t.rest...)
+	return kubectlStream(t.namespace(), a...)
+}
+
+func k8sDescribe(args []string) error {
+	t := parseK8sTarget(args)
+	if t.app == "" {
+		return fmt.Errorf("usage: homelab k8s describe <app> [resource]")
+	}
+	if len(t.rest) > 0 {
+		return kubectlStream(t.namespace(), append([]string{"describe"}, t.rest...)...)
+	}
+	return kubectlStream(t.namespace(), "describe", t.objectRef())
+}
+
+func k8sDebug(args []string) error {
+	t := parseK8sTarget(args)
+	if t.app == "" {
+		return fmt.Errorf("usage: homelab k8s debug <app>")
+	}
+	ns := t.namespace()
+	sec := func(title string) { fmt.Fprintf(os.Stderr, "\n=== %s ===\n", title) }
+	sec("pods")
+	_ = kubectlStream(ns, "get", "pods", "-o", "wide")
+	sec("workloads")
+	_ = kubectlStream(ns, "get", "deploy,sts,ds", "-o", "wide")
+	sec("describe "+t.objectRef())
+	_ = kubectlStream(ns, "describe", t.objectRef())
+	sec("recent logs (--tail=50)")
+	_ = kubectlStream(ns, "logs", t.objectRef(), "--tail=50")
+	sec("events (type!=Normal)")
+	_ = kubectlStream(ns, "get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp")
+	return nil
+}
+
+func k8sPortForward(args []string) error {
+	t := parseK8sTarget(args)
+	if t.app == "" || len(t.rest) == 0 {
+		return fmt.Errorf("usage: homelab k8s pf <app> <local:remote> [svc/pod target]")
+	}
+	ports := t.rest[0]
+	target := "svc/" + t.app
+	if len(t.rest) > 1 {
+		target = t.rest[1]
+	}
+	return kubectlStream(t.namespace(), "port-forward", target, ports)
+}
+
+func k8sDB(args []string) error {
+	var app, dbName, sql string
+	mysql := false
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		if a == "--" {
+			sql = strings.Join(args[i+1:], " ")
+			break
+		}
+		switch {
+		case a == "--mysql":
+			mysql = true
+		case a == "--db":
+			if i+1 < len(args) {
+				dbName = args[i+1]
+				i++
+			}
+		case strings.HasPrefix(a, "--db="):
+			dbName = strings.TrimPrefix(a, "--db=")
+		case !strings.HasPrefix(a, "-") && app == "":
+			app = a
+		}
+	}
+	if app == "" {
+		return fmt.Errorf(`usage: homelab k8s db <app> [--mysql] [--db NAME] -- "<SQL>"`)
+	}
+	p := planDBExec(app, dbName, sql, mysql)
+	pod := p.pod
+	if pod == "" && p.selector != "" {
+		resolved, err := kubectlCapture(p.ns, "get", "pod", "-l", p.selector, "-o", "jsonpath={.items[0].metadata.name}")
+		if err != nil || resolved == "" {
+			return fmt.Errorf("could not resolve db pod in %s (selector %q): %v", p.ns, p.selector, err)
+		}
+		pod = resolved
+	}
+	exec := []string{"exec"}
+	if sql == "" {
+		exec = append(exec, "-it") // interactive client when no SQL given
+	}
+	exec = append(exec, pod)
+	if p.container != "" {
+		exec = append(exec, "-c", p.container)
+	}
+	exec = append(exec, "--")
+	exec = append(exec, p.argv...)
+	return kubectlStream(p.ns, exec...)
+}
+
+func k8sExec(args []string) error {
+	t := parseK8sTarget(args)
+	if t.app == "" {
+		return fmt.Errorf("usage: homelab k8s exec <app> [--pod p] [-c ctr] [--tty] -- <cmd>")
+	}
+	if len(t.rest) == 0 {
+		return fmt.Errorf("provide a command after --, e.g. homelab k8s exec %s -- env", t.app)
+	}
+	a := []string{"exec"}
+	if t.tty {
+		a = append(a, "-it")
+	}
+	a = append(a, t.objectRef())
+	if t.container != "" {
+		a = append(a, "-c", t.container)
+	}
+	a = append(a, "--")
+	a = append(a, t.rest...)
+	return kubectlStream(t.namespace(), a...)
+}
+
+func k8sRmPod(args []string) error {
+	var pod, ns, grace string
+	force, job := false, false
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "-n" || a == "--namespace":
+			if i+1 < len(args) {
+				ns = args[i+1]
+				i++
+			}
+		case a == "--force":
+			force = true
+		case a == "--job":
+			job = true
+		case a == "--grace":
+			if i+1 < len(args) {
+				grace = args[i+1]
+				i++
+			}
+		case !strings.HasPrefix(a, "-") && pod == "":
+			pod = a
+		}
+	}
+	if pod == "" || ns == "" {
+		return fmt.Errorf("usage: homelab k8s rm-pod <name> -n <ns> [--job] [--force] [--grace N] (pods/jobs only)")
+	}
+	kind := "pod"
+	if job {
+		kind = "job"
+	}
+	a := []string{"delete", kind, pod}
+	if grace != "" {
+		a = append(a, "--grace-period="+grace)
+	}
+	if force {
+		a = append(a, "--force")
+	}
+	return kubectlStream(ns, a...)
+}
+
+func k8sRolloutStatus(args []string) error {
+	t := parseK8sTarget(args)
+	if t.app == "" {
+		return fmt.Errorf("usage: homelab k8s rollout-status <app>")
+	}
+	return kubectlStream(t.namespace(), "rollout", "status", "deploy/"+t.app)
+}
+
+func k8sRestart(args []string) error {
+	t := parseK8sTarget(args)
+	if t.app == "" {
+		return fmt.Errorf("usage: homelab k8s restart <app>")
+	}
+	ns := t.namespace()
+	if err := kubectlStream(ns, "rollout", "restart", "deploy/"+t.app); err != nil {
+		return err
+	}
+	return kubectlStream(ns, "rollout", "status", "deploy/"+t.app)
+}
+
+func k8sProbe(args []string) error {
+	t := parseK8sTarget(args)
+	if t.app == "" {
+		return fmt.Errorf("usage: homelab k8s probe <app> [path] [--port N]")
+	}
+	ns := t.namespace()
+	url := "http://" + t.app + "." + ns + ".svc.cluster.local"
+	if port := flagValue(args, "--port"); port != "" {
+		url += ":" + port
+	}
+	if len(t.rest) > 0 {
+		p := t.rest[0]
+		if !strings.HasPrefix(p, "/") {
+			p = "/" + p
+		}
+		url += p
+	}
+	return kubectlStream(ns, "run", "homelab-probe", "--rm", "-i", "--restart=Never",
+		"--image=curlimages/curl:latest", "--",
+		"curl", "-sS", "--max-time", "10", "-w", "\n[%{http_code}] %{time_total}s\n", url)
+}
+
+// containsPrefix reports whether any arg starts with prefix.
+func containsPrefix(args []string, prefix string) bool {
+	for _, a := range args {
+		if strings.HasPrefix(a, prefix) {
+			return true
+		}
+	}
+	return false
+}
--- a/cli/cmd_memory.go
+++ b/cli/cmd_memory.go
@ -0,0 +1,314 @@
+package main
+
+import (
+	"encoding/json"
+	"fmt"
+	"net/url"
+	"strings"
+)
+
+func memoryCommands() []Command {
+	return []Command{
+		{Path: []string{"memory", "recall"}, Tier: TierRead,
+			Summary: `semantic search of memory: memory recall "<context>" [--query …] [--category] [--sort] [--limit]`, Run: memoryRecall},
+		{Path: []string{"memory", "list"}, Tier: TierRead,
+			Summary: "list recent memories [--category C] [--tag T] [--limit N]", Run: memoryList},
+		{Path: []string{"memory", "categories"}, Tier: TierRead,
+			Summary: "list memory categories", Run: memorySimpleGet("/api/categories")},
+		{Path: []string{"memory", "tags"}, Tier: TierRead,
+			Summary: "list memory tags", Run: memorySimpleGet("/api/tags")},
+		{Path: []string{"memory", "stats"}, Tier: TierRead,
+			Summary: "memory store stats", Run: memorySimpleGet("/api/stats")},
+		{Path: []string{"memory", "secret"}, Tier: TierRead,
+			Summary: "reveal a sensitive memory's content: memory secret <id>", Run: memorySecret},
+		{Path: []string{"memory", "store"}, Tier: TierWrite,
+			Summary: `store a memory: memory store "<content>" [--category --tags --keywords --importance --sensitive]`, Run: memoryStore},
+		{Path: []string{"memory", "update"}, Tier: TierWrite,
+			Summary: "update a memory: memory update <id> [--content --tags --importance --keywords]", Run: memoryUpdate},
+		{Path: []string{"memory", "delete"}, Tier: TierWrite,
+			Summary: "delete a memory: memory delete <id>", Run: memoryDelete},
+	}
+}
+
+// printMemories renders a {memories:[…]} response as compact lines, or raw JSON.
+func printMemories(raw []byte, jsonOut bool) error {
+	if jsonOut {
+		fmt.Println(string(raw))
+		return nil
+	}
+	var r struct {
+		Memories []struct {
+			ID         int     `json:"id"`
+			Content    string  `json:"content"`
+			Category   string  `json:"category"`
+			Tags       string  `json:"tags"`
+			Importance float64 `json:"importance"`
+		} `json:"memories"`
+	}
+	if err := json.Unmarshal(raw, &r); err != nil {
+		fmt.Println(string(raw))
+		return nil
+	}
+	if len(r.Memories) == 0 {
+		fmt.Println("(no memories)")
+		return nil
+	}
+	for _, m := range r.Memories {
+		c := truncatePreview(strings.ReplaceAll(m.Content, "\n", " "), 240)
+		fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
+		if m.Tags != "" {
+			fmt.Printf("       tags: %s\n", m.Tags)
+		}
+	}
+	return nil
+}
+
+// truncatePreview shortens s to at most maxRunes RUNES, appending "…" when it
+// trims. Counting runes (not bytes) is load-bearing: a byte slice like s[:240]
+// can cut through the middle of a multibyte UTF-8 character (e.g. 2-byte
+// Cyrillic), leaving a dangling lead byte = invalid UTF-8. That crashed strict
+// decoders downstream — notably the homelab-memory-recall.py UserPromptSubmit
+// hook (subprocess text=True), which surfaced as a recurring "UserPromptSubmit
+// hook error" for Cyrillic-language users.
+func truncatePreview(s string, maxRunes int) string {
+	r := []rune(s)
+	if len(r) <= maxRunes {
+		return s
+	}
+	return string(r[:maxRunes]) + "…"
+}
+
+func memoryRecall(args []string) error {
+	req := memRecallReq{}
+	jsonOut := false
+	var pos []string
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--query":
+			if i+1 < len(args) {
+				req.ExpandedQuery = args[i+1]
+				i++
+			}
+		case a == "--category":
+			if i+1 < len(args) {
+				req.Category = args[i+1]
+				i++
+			}
+		case a == "--sort":
+			if i+1 < len(args) {
+				req.SortBy = args[i+1]
+				i++
+			}
+		case a == "--limit":
+			if i+1 < len(args) {
+				fmt.Sscanf(args[i+1], "%d", &req.Limit)
+				i++
+			}
+		case a == "--json":
+			jsonOut = true
+		case !strings.HasPrefix(a, "-"):
+			pos = append(pos, a)
+		}
+	}
+	req.Context = strings.Join(pos, " ")
+	if req.Context == "" {
+		return fmt.Errorf(`usage: homelab memory recall "<context>" [--query …] [--category C] [--sort importance|relevance|recency] [--limit N]`)
+	}
+	c, err := newMemoryClient()
+	if err != nil {
+		return err
+	}
+	raw, err := c.do("POST", "/api/memories/recall", req)
+	if err != nil {
+		return err
+	}
+	return printMemories(raw, jsonOut)
+}
+
+func memoryList(args []string) error {
+	q := url.Values{}
+	jsonOut := false
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--category":
+			if i+1 < len(args) {
+				q.Set("category", args[i+1])
+				i++
+			}
+		case a == "--tag":
+			if i+1 < len(args) {
+				q.Set("tag", args[i+1])
+				i++
+			}
+		case a == "--limit":
+			if i+1 < len(args) {
+				q.Set("limit", args[i+1])
+				i++
+			}
+		case a == "--json":
+			jsonOut = true
+		}
+	}
+	c, err := newMemoryClient()
+	if err != nil {
+		return err
+	}
+	path := "/api/memories"
+	if len(q) > 0 {
+		path += "?" + q.Encode()
+	}
+	raw, err := c.do("GET", path, nil)
+	if err != nil {
+		return err
+	}
+	return printMemories(raw, jsonOut)
+}
+
+func memorySimpleGet(path string) func([]string) error {
+	return func(args []string) error {
+		c, err := newMemoryClient()
+		if err != nil {
+			return err
+		}
+		raw, err := c.do("GET", path, nil)
+		if err != nil {
+			return err
+		}
+		fmt.Println(string(raw))
+		return nil
+	}
+}
+
+func memorySecret(args []string) error {
+	id, _ := firstPositional(args)
+	if id == "" {
+		return fmt.Errorf("usage: homelab memory secret <id>")
+	}
+	c, err := newMemoryClient()
+	if err != nil {
+		return err
+	}
+	raw, err := c.do("POST", "/api/memories/"+id+"/secret", nil)
+	if err != nil {
+		return err
+	}
+	fmt.Println(string(raw))
+	return nil
+}
+
+func memoryStore(args []string) error {
+	req := memStoreReq{Category: "facts", Importance: 0.5}
+	var pos []string
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--category":
+			if i+1 < len(args) {
+				req.Category = args[i+1]
+				i++
+			}
+		case a == "--tags":
+			if i+1 < len(args) {
+				req.Tags = args[i+1]
+				i++
+			}
+		case a == "--keywords":
+			if i+1 < len(args) {
+				req.ExpandedKeywords = args[i+1]
+				i++
+			}
+		case a == "--importance":
+			if i+1 < len(args) {
+				fmt.Sscanf(args[i+1], "%f", &req.Importance)
+				i++
+			}
+		case a == "--sensitive":
+			req.ForceSensitive = true
+		case !strings.HasPrefix(a, "-"):
+			pos = append(pos, a)
+		}
+	}
+	req.Content = strings.Join(pos, " ")
+	if req.Content == "" {
+		return fmt.Errorf(`usage: homelab memory store "<content>" [--category C] [--tags ...] [--keywords ...] [--importance 0.5] [--sensitive]`)
+	}
+	c, err := newMemoryClient()
+	if err != nil {
+		return err
+	}
+	raw, err := c.do("POST", "/api/memories", req)
+	if err != nil {
+		return err
+	}
+	fmt.Println(string(raw))
+	return nil
+}
+
+func memoryUpdate(args []string) error {
+	var id string
+	req := memUpdateReq{}
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--content":
+			if i+1 < len(args) {
+				v := args[i+1]
+				req.Content = &v
+				i++
+			}
+		case a == "--tags":
+			if i+1 < len(args) {
+				v := args[i+1]
+				req.Tags = &v
+				i++
+			}
+		case a == "--keywords":
+			if i+1 < len(args) {
+				v := args[i+1]
+				req.ExpandedKeywords = &v
+				i++
+			}
+		case a == "--importance":
+			if i+1 < len(args) {
+				var f float64
+				fmt.Sscanf(args[i+1], "%f", &f)
+				req.Importance = &f
+				i++
+			}
+		case !strings.HasPrefix(a, "-") && id == "":
+			id = a
+		}
+	}
+	if id == "" {
+		return fmt.Errorf("usage: homelab memory update <id> [--content ...] [--tags ...] [--importance N] [--keywords ...]")
+	}
+	c, err := newMemoryClient()
+	if err != nil {
+		return err
+	}
+	raw, err := c.do("PUT", "/api/memories/"+id, req)
+	if err != nil {
+		return err
+	}
+	fmt.Println(string(raw))
+	return nil
+}
+
+func memoryDelete(args []string) error {
+	id, _ := firstPositional(args)
+	if id == "" {
+		return fmt.Errorf("usage: homelab memory delete <id>")
+	}
+	c, err := newMemoryClient()
+	if err != nil {
+		return err
+	}
+	raw, err := c.do("DELETE", "/api/memories/"+id, nil)
+	if err != nil {
+		return err
+	}
+	fmt.Println(string(raw))
+	return nil
+}
--- a/cli/cmd_net.go
+++ b/cli/cmd_net.go
@ -0,0 +1,83 @@
+package main
+
+import (
+	"fmt"
+	"strings"
+	"time"
+)
+
+func netCommands() []Command {
+	return []Command{
+		{Path: []string{"net", "check"}, Tier: TierRead,
+			Summary: "reachability of <host>[/path]: external (public DNS→CF) vs internal (Traefik LB)", Run: netCheck},
+		{Path: []string{"dns", "lookup"}, Tier: TierRead,
+			Summary: "resolve <name> via Technitium (10.0.20.201) and public (1.1.1.1), diffed", Run: dnsLookup},
+	}
+}
+
+func fmtProbe(code int, d time.Duration, err error) string {
+	if err != nil {
+		return "ERR " + err.Error()
+	}
+	return fmt.Sprintf("HTTP %d  %dms", code, d.Milliseconds())
+}
+
+func netCheck(args []string) error {
+	host, rest := firstPositional(args)
+	if host == "" {
+		return fmt.Errorf("usage: homelab net check <host> [path]")
+	}
+	path := "/"
+	if len(rest) > 0 && !strings.HasPrefix(rest[0], "-") {
+		path = rest[0]
+		if !strings.HasPrefix(path, "/") {
+			path = "/" + path
+		}
+	}
+	u := "https://" + host + path
+	fmt.Printf("%s\n", u)
+
+	// external leg: resolve via public DNS, dial the public IP (tests the real CF path)
+	pubOut, _ := dig(hostOnly(host), "1.1.1.1", "")
+	if pubIP := firstLine(pubOut); pubIP != "" {
+		c, d, e := probeURL(clientDialingIP(pubIP, 10*time.Second), u)
+		fmt.Printf("  external (public %-15s) %s\n", pubIP, fmtProbe(c, d, e))
+	} else {
+		fmt.Println("  external (public)            no public A record")
+	}
+	// internal leg: dial the Traefik LB directly
+	c, d, e := probeURL(clientDialingIP(internalLBIP, 10*time.Second), u)
+	fmt.Printf("  internal (LB %-15s)     %s\n", internalLBIP, fmtProbe(c, d, e))
+	return nil
+}
+
+func dnsLookup(args []string) error {
+	name, rest := firstPositional(args)
+	if name == "" {
+		return fmt.Errorf("usage: homelab dns lookup <name> [A|AAAA|TXT|MX|PTR]")
+	}
+	rr := ""
+	if len(rest) > 0 {
+		rr = rest[0]
+	}
+	tech, _ := dig(name, "10.0.20.201", rr)
+	pub, _ := dig(name, "1.1.1.1", rr)
+	fmt.Printf("technitium (10.0.20.201): %s\n", oneLineList(tech))
+	fmt.Printf("public     (1.1.1.1)    : %s\n", oneLineList(pub))
+	if strings.TrimSpace(tech) != strings.TrimSpace(pub) {
+		fmt.Println("⚠ mismatch — split-horizon (expected for internal-only apps) or a propagation gap")
+	}
+	return nil
+}
+
+func hostOnly(h string) string { // strip any path accidentally included
+	return strings.SplitN(h, "/", 2)[0]
+}
+
+func oneLineList(s string) string {
+	s = strings.TrimSpace(s)
+	if s == "" {
+		return "(none)"
+	}
+	return strings.ReplaceAll(s, "\n", ", ")
+}
--- a/cli/cmd_obs.go
+++ b/cli/cmd_obs.go
@ -0,0 +1,197 @@
+package main
+
+import (
+	"encoding/json"
+	"fmt"
+	"net/url"
+	"sort"
+	"strconv"
+	"strings"
+	"time"
+)
+
+const (
+	promHost = "prometheus-query.viktorbarzin.lan"
+	lokiHost = "loki.viktorbarzin.lan"
+)
+
+func obsCommands() []Command {
+	return []Command{
+		{Path: []string{"metrics", "query"}, Tier: TierRead,
+			Summary: `Prometheus instant query: metrics query "<promql>" [--json]`, Run: metricsQuery},
+		{Path: []string{"metrics", "alerts"}, Tier: TierRead,
+			Summary: "list currently firing Prometheus alerts", Run: metricsAlerts},
+		{Path: []string{"logs", "query"}, Tier: TierRead,
+			Summary: `Loki query (last --since, default 1h): logs query "<logql>" [--since 1h] [--limit N] [--json]`, Run: logsQuery},
+	}
+}
+
+// queryArg joins non-flag args into the query (PromQL/LogQL should normally be
+// passed as a single quoted argument; this also tolerates unquoted multi-token).
+func queryArg(args []string, valueFlags map[string]bool) string {
+	var parts []string
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		if valueFlags[a] {
+			i++
+			continue
+		}
+		if strings.HasPrefix(a, "-") {
+			continue
+		}
+		parts = append(parts, a)
+	}
+	return strings.Join(parts, " ")
+}
+
+func labelStr(m map[string]string) string {
+	name := m["__name__"]
+	var kv []string
+	for k, v := range m {
+		if k != "__name__" {
+			kv = append(kv, k+"="+v)
+		}
+	}
+	sort.Strings(kv)
+	return name + "{" + strings.Join(kv, ",") + "}"
+}
+
+func metricsQuery(args []string) error {
+	q := queryArg(args, nil)
+	if q == "" {
+		return fmt.Errorf(`usage: homelab metrics query "<promql>" [--json]`)
+	}
+	v := url.Values{}
+	v.Set("query", q)
+	body, err := lbGetBody(promHost, "/api/v1/query", v)
+	if err != nil {
+		return err
+	}
+	if containsArg(args, "--json") {
+		fmt.Println(string(body))
+		return nil
+	}
+	var r struct {
+		Data struct {
+			Result []struct {
+				Metric map[string]string `json:"metric"`
+				Value  []interface{}     `json:"value"`
+			} `json:"result"`
+		} `json:"data"`
+	}
+	if err := json.Unmarshal(body, &r); err != nil {
+		fmt.Println(string(body))
+		return nil
+	}
+	if len(r.Data.Result) == 0 {
+		fmt.Println("(no series)")
+		return nil
+	}
+	for _, s := range r.Data.Result {
+		val := ""
+		if len(s.Value) == 2 {
+			val = fmt.Sprint(s.Value[1])
+		}
+		fmt.Printf("%-14s %s\n", val, labelStr(s.Metric))
+	}
+	return nil
+}
+
+func metricsAlerts(args []string) error {
+	// prometheus-query is a query-only frontend (no /api/v1/alerts); the firing
+	// set is exposed as the synthetic ALERTS series, queryable the normal way.
+	v := url.Values{}
+	v.Set("query", `ALERTS{alertstate="firing"}`)
+	body, err := lbGetBody(promHost, "/api/v1/query", v)
+	if err != nil {
+		return err
+	}
+	if containsArg(args, "--json") {
+		fmt.Println(string(body))
+		return nil
+	}
+	var r struct {
+		Data struct {
+			Result []struct {
+				Metric map[string]string `json:"metric"`
+			} `json:"result"`
+		} `json:"data"`
+	}
+	if err := json.Unmarshal(body, &r); err != nil {
+		fmt.Println(string(body))
+		return nil
+	}
+	if len(r.Data.Result) == 0 {
+		fmt.Println("(no firing alerts)")
+		return nil
+	}
+	for _, a := range r.Data.Result {
+		m := a.Metric
+		scope := ""
+		for _, k := range []string{"namespace", "deployment", "instance", "job", "node"} {
+			if v := m[k]; v != "" {
+				scope = k + "=" + v
+				break
+			}
+		}
+		fmt.Printf("%-9s %-34s %s\n", m["severity"], m["alertname"], scope)
+	}
+	return nil
+}
+
+func logsQuery(args []string) error {
+	q := queryArg(args, map[string]bool{"--since": true, "--limit": true})
+	if q == "" {
+		return fmt.Errorf(`usage: homelab logs query "<logql>" [--since 1h] [--limit N] [--json]`)
+	}
+	since := flagValue(args, "--since")
+	if since == "" {
+		since = "1h"
+	}
+	dur, err := time.ParseDuration(since)
+	if err != nil {
+		return fmt.Errorf("bad --since %q: %w", since, err)
+	}
+	limit := flagValue(args, "--limit")
+	if limit == "" {
+		limit = "100"
+	}
+	end := time.Now()
+	v := url.Values{}
+	v.Set("query", q)
+	v.Set("limit", limit)
+	v.Set("start", strconv.FormatInt(end.Add(-dur).UnixNano(), 10))
+	v.Set("end", strconv.FormatInt(end.UnixNano(), 10))
+	body, err := lbGetBody(lokiHost, "/loki/api/v1/query_range", v)
+	if err != nil {
+		return err
+	}
+	if containsArg(args, "--json") {
+		fmt.Println(string(body))
+		return nil
+	}
+	var r struct {
+		Data struct {
+			Result []struct {
+				Values [][]string `json:"values"`
+			} `json:"result"`
+		} `json:"data"`
+	}
+	if err := json.Unmarshal(body, &r); err != nil {
+		fmt.Println(string(body))
+		return nil
+	}
+	n := 0
+	for _, s := range r.Data.Result {
+		for _, val := range s.Values {
+			if len(val) == 2 {
+				fmt.Println(val[1])
+				n++
+			}
+		}
+	}
+	if n == 0 {
+		fmt.Println("(no log lines)")
+	}
+	return nil
+}
--- a/cli/cmd_tf.go
+++ b/cli/cmd_tf.go
@ -0,0 +1,122 @@
+package main
+
+import (
+	"fmt"
+	"os"
+	"os/signal"
+	"path/filepath"
+	"strings"
+	"sync"
+	"syscall"
+)
+
+func tfCommands() []Command {
+	return []Command{
+		{Path: []string{"tf", "plan"}, Tier: TierRead,
+			Summary: "terragrunt plan a stack (via scripts/tg)", Run: tfPassthrough("plan")},
+		{Path: []string{"tf", "validate"}, Tier: TierRead,
+			Summary: "terragrunt validate a stack", Run: tfPassthrough("validate")},
+		{Path: []string{"tf", "fmt"}, Tier: TierRead,
+			Summary: "terraform fmt a stack's files", Run: tfFmt},
+		{Path: []string{"tf", "force-unlock"}, Tier: TierWrite,
+			Summary: "release a stuck terraform state lock (needs <stack> <lock-id>)", Run: tfForceUnlock},
+		{Path: []string{"tf", "apply"}, Tier: TierWrite,
+			Summary: "terragrunt apply a stack — presence-coupled, out-of-band", Run: tfApply},
+	}
+}
+
+// firstPositional returns the first non-flag arg and the remaining args with it removed.
+func firstPositional(args []string) (string, []string) {
+	for i, a := range args {
+		if !strings.HasPrefix(a, "-") {
+			rest := append(append([]string{}, args[:i]...), args[i+1:]...)
+			return a, rest
+		}
+	}
+	return "", args
+}
+
+// resolveTfStack finds the infra root (from cwd) and the stack directory named
+// by the first positional arg, returning the remaining args.
+func resolveTfStack(args []string) (infraRoot, stackName, stackDir string, rest []string, err error) {
+	stackName, rest = firstPositional(args)
+	if stackName == "" {
+		err = fmt.Errorf("missing <stack> argument")
+		return
+	}
+	cwd, e := os.Getwd()
+	if e != nil {
+		err = e
+		return
+	}
+	infraRoot, err = findInfraRoot(cwd)
+	if err != nil {
+		return
+	}
+	stackDir, err = resolveStack(infraRoot, stackName)
+	return
+}
+
+func tgPath(infraRoot string) string { return filepath.Join(infraRoot, "scripts", "tg") }
+
+// tfPassthrough runs `scripts/tg <verb> [extra]` in the stack directory.
+func tfPassthrough(verb string) func([]string) error {
+	return func(args []string) error {
+		infraRoot, _, stackDir, rest, err := resolveTfStack(args)
+		if err != nil {
+			return err
+		}
+		return runStreamingIn(stackDir, tgPath(infraRoot), append([]string{verb}, rest...)...)
+	}
+}
+
+func tfFmt(args []string) error {
+	_, _, stackDir, _, err := resolveTfStack(args)
+	if err != nil {
+		return err
+	}
+	return runStreamingIn(stackDir, "terraform", "fmt", "-recursive", ".")
+}
+
+func tfForceUnlock(args []string) error {
+	infraRoot, _, stackDir, rest, err := resolveTfStack(args)
+	if err != nil {
+		return err
+	}
+	if len(rest) < 1 {
+		return fmt.Errorf("usage: homelab tf force-unlock <stack> <lock-id>")
+	}
+	return runStreamingIn(stackDir, tgPath(infraRoot), "force-unlock", "-force", rest[0])
+}
+
+// tfApply applies a stack out-of-band: claim the stack on the presence board,
+// ALWAYS release on exit (normal, error, or signal — fixing the claim leak),
+// and warn that CI applies canonically on push.
+func tfApply(args []string) error {
+	infraRoot, stackName, stackDir, _, err := resolveTfStack(args)
+	if err != nil {
+		return err
+	}
+	label := "stack:" + stackName
+	fmt.Fprintf(os.Stderr,
+		"homelab: out-of-band apply of %q — CI applies canonically on push to master.\n", stackName)
+
+	if err := presenceClaim(label, "homelab tf apply "+stackName); err != nil {
+		return fmt.Errorf("presence claim failed (run `vault login -method=oidc`?): %w", err)
+	}
+	// Release exactly once, whether we exit normally, on error, or on signal —
+	// sync.Once makes the defer and the signal goroutine safe to both call it.
+	var once sync.Once
+	release := func() { once.Do(func() { _ = presenceRelease(label) }) }
+	defer release()
+
+	sig := make(chan os.Signal, 1)
+	signal.Notify(sig, os.Interrupt, syscall.SIGTERM)
+	go func() {
+		<-sig
+		release()
+		os.Exit(130)
+	}()
+
+	return runStreamingIn(stackDir, tgPath(infraRoot), "apply", "--non-interactive")
+}
--- a/cli/cmd_tf_test.go
+++ b/cli/cmd_tf_test.go
@ -0,0 +1,27 @@
+package main
+
+import (
+	"reflect"
+	"testing"
+)
+
+func TestFirstPositional(t *testing.T) {
+	cases := []struct {
+		args     []string
+		wantName string
+		wantRest []string
+	}{
+		{[]string{"vault"}, "vault", []string{}},
+		{[]string{"--json", "vault"}, "vault", []string{"--json"}},
+		{[]string{"vault", "abc-123"}, "vault", []string{"abc-123"}},
+		{[]string{"--foo", "monitoring", "extra"}, "monitoring", []string{"--foo", "extra"}},
+		{[]string{"--only-flags"}, "", []string{"--only-flags"}},
+	}
+	for _, c := range cases {
+		gotName, gotRest := firstPositional(c.args)
+		if gotName != c.wantName || !reflect.DeepEqual(gotRest, c.wantRest) {
+			t.Errorf("firstPositional(%v) = (%q, %v), want (%q, %v)",
+				c.args, gotName, gotRest, c.wantName, c.wantRest)
+		}
+	}
+}
--- a/cli/cmd_usage.go
+++ b/cli/cmd_usage.go
@ -0,0 +1,77 @@
+package main
+
+import (
+	"encoding/json"
+	"fmt"
+	"net/url"
+	"sort"
+	"strconv"
+)
+
+func usageCommands() []Command {
+	return []Command{
+		{Path: []string{"usage", "top"}, Tier: TierRead,
+			Summary: "rank homelab verb usage across users (from Loki): usage top [--since 30d] [--user U] [--json]", Run: usageTop},
+	}
+}
+
+// usageQuery builds the LogQL metric query that counts invocations per verb.
+func usageQuery(since, user string) string {
+	sel := `job="` + usageJob + `"`
+	if user != "" {
+		sel += `, user="` + user + `"`
+	}
+	return fmt.Sprintf(`sum by (verb) (count_over_time({%s}[%s]))`, sel, since)
+}
+
+func usageTop(args []string) error {
+	since := flagValue(args, "--since")
+	if since == "" {
+		since = "30d"
+	}
+	v := url.Values{}
+	v.Set("query", usageQuery(since, flagValue(args, "--user")))
+	body, err := lbGetBody(lokiHost, "/loki/api/v1/query", v)
+	if err != nil {
+		return err
+	}
+	if containsArg(args, "--json") {
+		fmt.Println(string(body))
+		return nil
+	}
+	var r struct {
+		Data struct {
+			Result []struct {
+				Metric map[string]string `json:"metric"`
+				Value  []interface{}     `json:"value"`
+			} `json:"result"`
+		} `json:"data"`
+	}
+	if err := json.Unmarshal(body, &r); err != nil {
+		fmt.Println(string(body))
+		return nil
+	}
+	type row struct {
+		verb string
+		n    int
+	}
+	var rows []row
+	for _, s := range r.Data.Result {
+		n := 0
+		if len(s.Value) == 2 {
+			if f, e := strconv.ParseFloat(fmt.Sprint(s.Value[1]), 64); e == nil {
+				n = int(f)
+			}
+		}
+		rows = append(rows, row{s.Metric["verb"], n})
+	}
+	if len(rows) == 0 {
+		fmt.Println("(no usage recorded yet)")
+		return nil
+	}
+	sort.Slice(rows, func(i, j int) bool { return rows[i].n > rows[j].n })
+	for _, r := range rows {
+		fmt.Printf("%6d  %s\n", r.n, r.verb)
+	}
+	return nil
+}
--- a/cli/cmd_vault.go
+++ b/cli/cmd_vault.go
@ -0,0 +1,944 @@
+package main
+
+import (
+	"bufio"
+	"encoding/base64"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"os"
+	"os/exec"
+	"strings"
+	"syscall"
+)
+
+// vault verbs give each unix user no-HITL access to THEIR OWN Vaultwarden vault.
+// Identity is the kernel UID; per-user creds live in that user's isolated Vault
+// path (secret/workstation/claude-users/<user>) read via their scoped token, and
+// decryption is done by the official `bw` CLI. See
+// docs/runbooks/homelab-vault-onboarding.md.
+func vaultCommands() []Command {
+	cmds := []Command{
+		// Vaultwarden — your personal password manager (logins/passwords/TOTP).
+		{Path: []string{"vault", "setup"}, Tier: TierWrite,
+			Summary: "[vaultwarden] one-time: store your master password + API key in your Vault path", Run: vaultSetup},
+		{Path: []string{"vault", "status"}, Tier: TierRead,
+			Summary: "[vaultwarden] show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
+		{Path: []string{"vault", "list"}, Tier: TierRead,
+			Summary: "[vaultwarden] list your item names: vault list [--search Q]", Run: vaultList},
+		{Path: []string{"vault", "get"}, Tier: TierRead,
+			Summary: "[vaultwarden] fetch one login: vault get <name> [--field password|username|uri|notes|totp] [--json] [--all]", Run: vaultGet},
+		{Path: []string{"vault", "search"}, Tier: TierRead,
+			Summary: "[vaultwarden] search your item names: vault search <query>", Run: vaultSearch},
+		{Path: []string{"vault", "code"}, Tier: TierRead,
+			Summary: "[vaultwarden] current TOTP code for an item: vault code <name>", Run: vaultCode},
+		{Path: []string{"vault", "lock"}, Tier: TierWrite,
+			Summary: "[vaultwarden] lock/log out the local bw session", Run: vaultLock},
+		{Path: []string{"vault"}, Tier: TierRead,
+			Summary: "two stores: Vaultwarden (logins) + HashiCorp Vault/OpenBao kv (infra secrets) — run `homelab vault` for help",
+			Run:     func([]string) error { fmt.Print(vaultHelp()); return nil }},
+	}
+	// HashiCorp Vault / OpenBao — homelab INFRA secrets (the secret/… KV store).
+	return append(cmds, vaultKVCommands()...)
+}
+
+// vaultHelp is shown for bare `homelab vault`. It LEADS with the distinction
+// between the two unrelated "vaults" this command fronts, because the name
+// collides: Vaultwarden (a password manager) vs HashiCorp Vault / OpenBao (the
+// infra secrets store).
+func vaultHelp() string {
+	return `homelab vault — two different secret stores under one command:
+
+  • Vaultwarden               your personal PASSWORD MANAGER (logins / passwords / TOTP)
+  • HashiCorp Vault / OpenBao  homelab INFRA secrets (the secret/… KV store)  → 'vault kv …'
+
+── Vaultwarden  (reads YOUR OWN vault; no-HITL after one-time setup) ──
+  homelab vault setup             one-time: store your master password + API key in your Vault path
+  homelab vault status            configured / unlocked / reachable (no secrets)
+  homelab vault list [--search Q] list your item names (no secrets)
+  homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
+                                  TTY → clipboard (auto-clears); piped → stdout
+  homelab vault get <name> --all  all fields (incl. custom) as JSON; piped only.
+                                  TOTP shown as presence flag — use 'vault code' for a code.
+  homelab vault code <name>       current TOTP code
+  homelab vault lock              lock / log out the local bw session
+
+── HashiCorp Vault / OpenBao  (infra secrets; uses your own OIDC vault token) ──
+  homelab vault kv get <path> [--field K]   read an infra KV secret
+  homelab vault kv list <path>              list sub-paths
+  homelab vault kv put <path> <key>         write one key (value via stdin)
+
+Vaultwarden creds live only in your own Vault path; the admin never sees them.
+Security model: docs/runbooks/homelab-vault-onboarding.md
+(note: anything running as your user can decrypt your vault — the accepted no-HITL trade).
+`
+}
+
+const vwUserPathPrefix = "secret/workstation/claude-users/"
+
+// vwCreds is one user's Vaultwarden auth material, read from their Vault path.
+type vwCreds struct {
+	Email          string
+	MasterPassword string
+	ClientID       string
+	ClientSecret   string
+}
+
+// cmdRunner shells out to an external command with an explicit environment and
+// returns trimmed stdout. Secrets are passed via envv, NEVER argv. Tests inject
+// a fake; realRunner is the production implementation.
+type cmdRunner func(name string, argv, envv []string) (string, error)
+
+func realRunner(name string, argv, envv []string) (string, error) {
+	cmd := exec.Command(name, argv...)
+	if envv != nil {
+		cmd.Env = envv
+	}
+	out, err := cmd.Output()
+	// Trim only the trailing newline the tool appends — NOT all whitespace, so a
+	// fetched secret with significant leading/trailing spaces is preserved.
+	return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err))
+}
+
+// exitStderr returns the stderr captured by cmd.Output() on a failed exec (it
+// stows it on *exec.ExitError), or nil. The tools we shell out to (vault, bw)
+// write the actionable message there — "connection refused", "permission
+// denied" — which the caller would otherwise never see behind a bare
+// "exit status N".
+func exitStderr(err error) []byte {
+	var ee *exec.ExitError
+	if errors.As(err, &ee) {
+		return ee.Stderr
+	}
+	return nil
+}
+
+// augmentErr appends captured stderr to an error so failures are diagnosable
+// (not just "exit status 2"). Returns nil when err is nil, and err unchanged
+// when there's no stderr; preserves the wrapped error for errors.Is/As.
+func augmentErr(err error, stderr []byte) error {
+	if err == nil {
+		return nil
+	}
+	if s := strings.TrimSpace(string(stderr)); s != "" {
+		return fmt.Errorf("%w: %s", err, s)
+	}
+	return err
+}
+
+// realRunnerStdin runs a command feeding `stdin` to it, for secret values that
+// must NOT appear in argv (visible via ps / /proc/<pid>/cmdline to same-UID
+// processes). Used by setup to write the master password / client_secret.
+func realRunnerStdin(name string, argv, envv []string, stdin string) (string, error) {
+	cmd := exec.Command(name, argv...)
+	if envv != nil {
+		cmd.Env = envv
+	}
+	cmd.Stdin = strings.NewReader(stdin)
+	out, err := cmd.Output()
+	return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err))
+}
+
+func vwCredsPath(user string) string { return vwUserPathPrefix + user }
+
+func bwAppDataDir(uid string) string { return "/run/user/" + uid + "/homelab-bw" }
+
+// readVaultField returns one field from a KV-v2 path, "" if absent/error.
+func readVaultField(run cmdRunner, field, path string) string {
+	out, err := run("vault", []string{"kv", "get", "-field=" + field, path}, nil)
+	if err != nil {
+		return ""
+	}
+	return out
+}
+
+// loadCreds reads the four vaultwarden_* keys from the user's isolated path.
+// A missing master password means the user hasn't onboarded.
+func loadCreds(run cmdRunner, user string) (vwCreds, error) {
+	p := vwCredsPath(user)
+	c := vwCreds{
+		Email:          readVaultField(run, "vaultwarden_email", p),
+		MasterPassword: readVaultField(run, "vaultwarden_master_password", p),
+		ClientID:       readVaultField(run, "vaultwarden_client_id", p),
+		ClientSecret:   readVaultField(run, "vaultwarden_client_secret", p),
+	}
+	if c.MasterPassword == "" {
+		return vwCreds{}, fmt.Errorf("vault not configured for this user — run `homelab vault setup`")
+	}
+	return c, nil
+}
+
+// vaultCurrentUser/vaultCurrentUID are seams for tests (avoid conflict with repo.go's currentUser func).
+var vaultCurrentUser = func() string { return os.Getenv("USER") }
+var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) }
+
+// scopedTokenPath is where claude-auth-sync keeps the user's scoped Vault token.
+// MUST match CAS_VAULT_TOKEN_FILE in scripts/workstation/claude-auth-sync.sh.
+func scopedTokenPath(home string) string {
+	return home + "/.config/claude-auth-sync/vault-token"
+}
+
+// vaultTokenSource decides which Vault token the `vault` child processes should
+// use. Precedence: an explicit $VAULT_TOKEN (deliberate override), then the
+// per-user scoped token claude-auth-sync maintains at scopedTokenPath(HOME)
+// (policy workstation-claude-<user>, which grants exactly the create/read/update
+// this tool needs on the user's own path), then a native ~/.vault-token.
+//
+// The scoped token MUST beat ~/.vault-token: this tool only ever touches the
+// caller's own secret/workstation/claude-users/<user> path, and a power-user who
+// ran `vault login -method=oidc` carries a read-only ~/.vault-token whose
+// capability on that path is `deny` — letting it win shadows the scoped token
+// and every op fails 403/deny (emo, 2026-06-28). ~/.vault-token is only the
+// right credential when there is no scoped token (admins). Returns the token to
+// export — "" when the vault CLI should read the ambient/native credential —
+// plus a source tag for tests/logging.
+func vaultTokenSource(envToken string, haveVaultTokenFile bool, scopedToken string) (token, source string) {
+	switch {
+	case envToken != "":
+		return "", "env"
+	case strings.TrimSpace(scopedToken) != "":
+		return strings.TrimSpace(scopedToken), "scoped"
+	case haveVaultTokenFile:
+		return "", "file"
+	default:
+		return "", "none"
+	}
+}
+
+// vaultAddrDefault is the cluster Vault the workstation talks to. The bw server
+// is likewise hardcoded (openSession), so a sane default here is consistent.
+const vaultAddrDefault = "https://vault.viktorbarzin.me"
+
+// vaultAddrToSet returns the VAULT_ADDR to export when the caller's environment
+// doesn't already set one, else "". homelab vault is invoked by AFK agent
+// sessions — frequently non-login shells (tmux panes, agent subprocesses) that
+// never sourced /etc/environment — so, like claude-auth-sync, the CLI must NOT
+// depend on an ambient VAULT_ADDR; otherwise every `vault` child falls back to
+// the 127.0.0.1:8200 default and fails "connection refused" (exit 2).
+func vaultAddrToSet(envAddr string) string {
+	if strings.TrimSpace(envAddr) == "" {
+		return vaultAddrDefault
+	}
+	return ""
+}
+
+// ensureVaultAddr exports the default VAULT_ADDR when none is set, so the vault
+// child processes reach the cluster Vault regardless of the caller's shell. An
+// explicit VAULT_ADDR (admins, CI) is left untouched.
+func ensureVaultAddr() {
+	if a := vaultAddrToSet(os.Getenv("VAULT_ADDR")); a != "" {
+		os.Setenv("VAULT_ADDR", a)
+	}
+}
+
+// fileNonEmpty reports whether path exists and has content.
+func fileNonEmpty(path string) bool {
+	fi, err := os.Stat(path)
+	return err == nil && fi.Size() > 0
+}
+
+// ensureVaultToken wires vaultTokenSource to the real environment: when the user
+// has no ambient Vault credential, it exports the claude-auth-sync scoped token
+// so the `vault` child processes authenticate as workstation-claude-<user>. It
+// is idempotent and safe for admins, whose explicit $VAULT_TOKEN / ~/.vault-token
+// take precedence and are left untouched.
+func ensureVaultToken() {
+	// Every vault verb funnels through here, so this is the one place that also
+	// guarantees VAULT_ADDR is set (see vaultAddrToSet for why it can't be
+	// assumed from the caller's shell).
+	ensureVaultAddr()
+	home := os.Getenv("HOME")
+	scoped, _ := os.ReadFile(scopedTokenPath(home))
+	tok, src := vaultTokenSource(os.Getenv("VAULT_TOKEN"), home != "" && fileNonEmpty(home+"/.vault-token"), string(scoped))
+	if src == "scoped" {
+		os.Setenv("VAULT_TOKEN", tok)
+	}
+}
+
+// bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately
+// do NOT inherit the full parent env (keeps stray secrets out of the child).
+func bwBaseEnv(appdata string) []string {
+	path := os.Getenv("PATH")
+	if path == "" {
+		path = "/usr/local/bin:/usr/bin:/bin"
+	}
+	return []string{
+		"PATH=" + path,
+		"HOME=" + os.Getenv("HOME"),
+		"BITWARDENCLI_APPDATA_DIR=" + appdata,
+		"BW_NOINTERACTION=true",
+	}
+}
+
+// bwSecretEnv adds the secret-bearing vars. session may be "" (pre-unlock).
+func bwSecretEnv(appdata string, c vwCreds, session string) []string {
+	env := bwBaseEnv(appdata)
+	env = append(env,
+		"BW_CLIENTID="+c.ClientID,
+		"BW_CLIENTSECRET="+c.ClientSecret,
+		"BW_PASSWORD="+c.MasterPassword,
+	)
+	if session != "" {
+		env = append(env, "BW_SESSION="+session)
+	}
+	return env
+}
+
+func bwLoginArgs() []string                 { return []string{"login", "--apikey"} }
+func bwUnlockArgs() []string                { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} }
+func bwGetArgs(field, name string) []string { return []string{"get", field, name} }
+func bwItemArgs(name string) []string       { return []string{"get", "item", name} }
+func bwStatusArgs() []string                { return []string{"status"} }
+func bwSyncArgs() []string                  { return []string{"sync"} }
+
+// bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is
+// required. Unparseable/empty output → true (safer to attempt login).
+func bwNeedsLogin(statusJSON string) bool {
+	var s struct {
+		Status string `json:"status"`
+	}
+	if err := json.Unmarshal([]byte(statusJSON), &s); err != nil {
+		return true
+	}
+	return s.Status == "unauthenticated" || s.Status == ""
+}
+
+func bwListArgs(search string) []string {
+	a := []string{"list", "items"}
+	if search != "" {
+		a = append(a, "--search", search)
+	}
+	return a
+}
+
+// bwUnlock runs `bw unlock` and returns the raw session key.
+func bwUnlock(run cmdRunner, env []string) (string, error) {
+	out, err := run("bw", bwUnlockArgs(), env)
+	if err != nil {
+		return "", fmt.Errorf("bw unlock failed (wrong master password? run `homelab vault setup`): %w", err)
+	}
+	return out, nil
+}
+
+// bwGet fetches one field of one item; session must be present in env.
+func bwGet(run cmdRunner, env []string, field, name string) (string, error) {
+	return run("bw", bwGetArgs(field, name), env)
+}
+
+func returnMode(isTTY bool) string {
+	if isTTY {
+		return "clipboard"
+	}
+	return "stdout"
+}
+
+// stdoutIsTTY reports whether stdout is a character device (a terminal).
+func stdoutIsTTY() bool {
+	fi, err := os.Stdout.Stat()
+	if err != nil {
+		return false
+	}
+	return fi.Mode()&os.ModeCharDevice != 0
+}
+
+// stderrIsTTY reports whether stderr is a terminal (the OSC52 escape is written
+// to stderr, so the clipboard path is only viable when stderr is a terminal).
+func stderrIsTTY() bool {
+	fi, err := os.Stderr.Stat()
+	if err != nil {
+		return false
+	}
+	return fi.Mode()&os.ModeCharDevice != 0
+}
+
+// osc52 returns the OSC 52 escape that makes the local terminal copy payload to
+// the system clipboard (works over SSH; no X11). osc52clear copies empty.
+func osc52(payload string) string {
+	return "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte(payload)) + "\a"
+}
+func osc52clear() string { return "\x1b]52;c;\a" }
+
+// terminalAllowed gates OSC 52: only terminals known to honor clipboard writes,
+// else we'd dump the secret's base64 into scrollback on unsupported terminals.
+func terminalAllowed(term, termProgram string) bool {
+	t := strings.ToLower(term)
+	p := strings.ToLower(termProgram)
+	for _, ok := range []string{"kitty", "alacritty", "foot", "wezterm", "ghostty", "tmux", "screen"} {
+		if strings.Contains(t, ok) || strings.Contains(p, ok) {
+			return true
+		}
+	}
+	// xterm proper supports it only when the program is a known-good emulator.
+	return false
+}
+
+// opRecord is one CLI operation. ItemName is accepted for the caller's
+// convenience but is INTENTIONALLY never rendered into the log line — auditing
+// which of your own logins you opened is itself sensitive, and per-item reads
+// are invisible server-side anyway (spec §9a).
+type opRecord struct {
+	User       string
+	Verb       string
+	PID        int
+	PPID       int
+	ParentComm string
+	ItemName   string // never logged
+}
+
+func opLogLine(r opRecord) string {
+	return fmt.Sprintf("user=%s verb=%s pid=%d ppid=%d parent=%s",
+		r.User, r.Verb, r.PID, r.PPID, r.ParentComm)
+}
+
+// parentComm reads /proc/<ppid>/comm (best-effort; "" on failure).
+func parentComm(ppid int) string {
+	b, err := os.ReadFile(fmt.Sprintf("/proc/%d/comm", ppid))
+	if err != nil {
+		return ""
+	}
+	return strings.TrimSpace(string(b))
+}
+
+// writeOpLog appends one privacy-aware line to the user's op-log (best-effort;
+// never blocks or fails the command). Goes to syslog so it ships to Loki.
+func writeOpLog(r opRecord) {
+	exec.Command("logger", "-t", "homelab-vault", opLogLine(r)).Run() // best-effort
+}
+
+func vaultLockPath(uid string) string { return "/run/user/" + uid + "/homelab-vault.lock" }
+
+// hardenProcess disables core dumps so a bw/homelab crash can't spill the master
+// password to a core file. Best-effort.
+func hardenProcess() {
+	_ = syscall.Setrlimit(syscall.RLIMIT_CORE, &syscall.Rlimit{Cur: 0, Max: 0})
+}
+
+// withUserLock serializes bw mutations for this user (concurrent Claude sessions
+// as the same user otherwise race bw's appdata). Returns an unlock func.
+func withUserLock(uid string) (func(), error) {
+	f, err := os.OpenFile(vaultLockPath(uid), os.O_CREATE|os.O_RDWR, 0600)
+	if err != nil {
+		return nil, err
+	}
+	if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX); err != nil {
+		f.Close()
+		return nil, err
+	}
+	return func() { syscall.Flock(int(f.Fd()), syscall.LOCK_UN); f.Close() }, nil
+}
+
+// session is one usable bw context: the env (with BW_SESSION) ready for `bw get`.
+type session struct {
+	env []string
+}
+
+// openSession resolves creds, ensures login, unlocks, and returns a ready env.
+// Caller must hold the user lock. appdata is created on tmpfs (0700).
+func openSession(run cmdRunner, user, uid string) (session, error) {
+	creds, err := loadCreds(run, user)
+	if err != nil {
+		return session{}, err
+	}
+	appdata := bwAppDataDir(uid)
+	if err := os.MkdirAll(appdata, 0700); err != nil {
+		return session{}, fmt.Errorf("create bw appdata %s: %w", appdata, err)
+	}
+	loginEnv := bwSecretEnv(appdata, creds, "")
+	// Ensure server is set and we're logged in (idempotent; ignore "already").
+	_, _ = run("bw", []string{"config", "server", "https://vaultwarden.viktorbarzin.me"}, loginEnv)
+	st, _ := run("bw", bwStatusArgs(), loginEnv)
+	if bwNeedsLogin(st) {
+		if _, err := run("bw", bwLoginArgs(), loginEnv); err != nil {
+			return session{}, fmt.Errorf("bw login --apikey failed (API key valid? run `homelab vault setup`): %w", err)
+		}
+	}
+	sess, err := bwUnlock(run, loginEnv)
+	if err != nil {
+		return session{}, err
+	}
+	sessEnv := bwSecretEnv(appdata, creds, sess)
+	// Pull the latest server-side state so reads reflect current values. `bw
+	// unlock` only decrypts the LOCAL cache, so a persisted (already-logged-in)
+	// session would otherwise serve stale data until the next login. Best-effort:
+	// a transient sync failure must not break a read — fall back to the cached
+	// vault and warn (status reports reachability separately).
+	if _, err := run("bw", bwSyncArgs(), sessEnv); err != nil {
+		fmt.Fprintln(os.Stderr, "homelab vault: warning: bw sync failed; using cached vault (values may be stale): "+err.Error())
+	}
+	return session{env: sessEnv}, nil
+}
+
+type getOpts struct {
+	name  string
+	field string
+	json  bool
+	all   bool // dump every field (incl. custom) as normalized JSON
+}
+
+var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true}
+
+func parseGetArgs(args []string) (getOpts, error) {
+	o := getOpts{field: "password"}
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--json":
+			o.json = true
+		case a == "--all":
+			o.all = true
+		case a == "--field" && i+1 < len(args):
+			o.field = args[i+1]
+			i++
+		case strings.HasPrefix(a, "--field="):
+			o.field = strings.TrimPrefix(a, "--field=")
+		case !strings.HasPrefix(a, "-") && o.name == "":
+			o.name = a
+		}
+	}
+	if o.name == "" {
+		return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json] [--all]")
+	}
+	// --all dumps the whole item, so --field is irrelevant — skip its allowlist.
+	if !o.all && !validGetFields[o.field] {
+		return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field)
+	}
+	return o, nil
+}
+
+// getValue opens a session and fetches one field. Pure of I/O side effects
+// besides the runner, so it is unit-tested with a fake runner.
+func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) {
+	s, err := openSession(run, user, uid)
+	if err != nil {
+		return "", err
+	}
+	return bwGet(run, s.env, o.field, o.name)
+}
+
+// getItem opens a session and returns the whole item as raw `bw get item` JSON.
+// Used by `get --all`; normalization is a separate, pure step (normalizeItem).
+func getItem(run cmdRunner, user, uid, name string) (string, error) {
+	s, err := openSession(run, user, uid)
+	if err != nil {
+		return "", err
+	}
+	return run("bw", bwItemArgs(name), s.env)
+}
+
+// normalizedItem is the browse-all-fields projection of a Vaultwarden item: the
+// standard login fields that are present, notes, and a flat map of custom field
+// name→value. bw internals (id, object, reprompt, passwordHistory) are dropped,
+// and the TOTP *seed* is reduced to a presence flag — the only seed-derived path
+// stays the specially-audited `vault code` (see the design §10/§16).
+type normalizedItem struct {
+	Name     string            `json:"name"`
+	Username string            `json:"username,omitempty"`
+	Password string            `json:"password,omitempty"`
+	URIs     []string          `json:"uris,omitempty"`
+	TOTP     bool              `json:"totp,omitempty"` // presence only, never the seed
+	Notes    string            `json:"notes,omitempty"`
+	Fields   map[string]string `json:"fields,omitempty"` // custom field name→value
+}
+
+// bwFieldLinked is the Bitwarden custom-field type for a "linked" field: it
+// references another field and carries a null value, so it is not real data.
+const bwFieldLinked = 3
+
+// normalizeItem parses a `bw get item` payload into the browse projection. It is
+// pure (no I/O), so it is the unit-tested heart of `get --all`.
+func normalizeItem(raw string) (normalizedItem, error) {
+	var it struct {
+		Name  string `json:"name"`
+		Notes string `json:"notes"`
+		Login *struct {
+			Username string `json:"username"`
+			Password string `json:"password"`
+			Totp     string `json:"totp"`
+			URIs     []struct {
+				URI string `json:"uri"`
+			} `json:"uris"`
+		} `json:"login"`
+		Fields []struct {
+			Name  string `json:"name"`
+			Value string `json:"value"`
+			Type  int    `json:"type"`
+		} `json:"fields"`
+	}
+	if err := json.Unmarshal([]byte(raw), &it); err != nil {
+		return normalizedItem{}, fmt.Errorf("parse bw item: %w", err)
+	}
+	n := normalizedItem{Name: it.Name, Notes: it.Notes}
+	if it.Login != nil {
+		n.Username = it.Login.Username
+		n.Password = it.Login.Password
+		n.TOTP = it.Login.Totp != ""
+		for _, u := range it.Login.URIs {
+			if u.URI != "" {
+				n.URIs = append(n.URIs, u.URI)
+			}
+		}
+	}
+	for _, f := range it.Fields {
+		if f.Type == bwFieldLinked {
+			continue // references another field, no value of its own
+		}
+		if n.Fields == nil {
+			n.Fields = map[string]string{}
+		}
+		n.Fields[f.Name] = f.Value // duplicate names: last-wins (rare; documented)
+	}
+	return n, nil
+}
+
+// clipboardDecision picks how to return a secret value. "stdout" prints it (a
+// pipe/agent — the intended machine path); "clipboard" copies via OSC52;
+// "refuse" emits nothing sensitive (would otherwise risk dumping the secret's
+// base64 into scrollback, or silently fail because the OSC52 escape goes to a
+// non-terminal stderr).
+func clipboardDecision(stdoutTTY, stderrTTY bool, term, termProgram string) string {
+	if !stdoutTTY {
+		return "stdout"
+	}
+	if terminalAllowed(term, termProgram) && stderrTTY {
+		return "clipboard"
+	}
+	return "refuse"
+}
+
+// jsonToStdoutOK reports whether `--json` may print the secret to stdout — only
+// when stdout is NOT a terminal (i.e. piped to a machine consumer).
+func jsonToStdoutOK(stdoutTTY bool) bool { return !stdoutTTY }
+
+// emitSecret returns a value TTY-aware (see clipboardDecision). Never prints the
+// secret to a terminal's stdout/scrollback.
+func emitSecret(value string) {
+	switch clipboardDecision(stdoutIsTTY(), stderrIsTTY(), os.Getenv("TERM"), os.Getenv("TERM_PROGRAM")) {
+	case "stdout":
+		fmt.Println(value)
+	case "clipboard":
+		fmt.Fprint(os.Stderr, osc52(value))
+		fmt.Fprintln(os.Stderr, "copied to clipboard; clearing in 30s")
+		clearClipboardAfter(30)
+	default: // refuse
+		fmt.Fprintln(os.Stderr, "refusing to print secret: this terminal can't do OSC52 clipboard safely; pipe the command (e.g. | cat) or use a supported terminal")
+	}
+}
+
+// clearClipboardAfter spawns a detached background clear so the secret doesn't
+// linger in the clipboard. Best-effort.
+func clearClipboardAfter(seconds int) {
+	exec.Command("sh", "-c", fmt.Sprintf("sleep %d; printf '%s'", seconds, osc52clear())).Start()
+}
+
+// listNames extracts "name (id)" from `bw list items` JSON; never values.
+func listNames(jsonOut string) []string {
+	var items []struct {
+		ID   string `json:"id"`
+		Name string `json:"name"`
+	}
+	if err := json.Unmarshal([]byte(jsonOut), &items); err != nil {
+		return nil
+	}
+	out := make([]string, 0, len(items))
+	for _, it := range items {
+		out = append(out, fmt.Sprintf("%s (%s)", it.Name, it.ID))
+	}
+	return out
+}
+
+func runList(run cmdRunner, user, uid, search string) ([]string, error) {
+	s, err := openSession(run, user, uid)
+	if err != nil {
+		return nil, err
+	}
+	out, err := run("bw", bwListArgs(search), s.env)
+	if err != nil {
+		return nil, err
+	}
+	return listNames(out), nil
+}
+
+func vaultList(args []string) error {
+	hardenProcess()
+	ensureVaultToken()
+	search := ""
+	for i := 0; i < len(args); i++ {
+		if args[i] == "--search" && i+1 < len(args) {
+			search = args[i+1]
+			i++
+		} else if strings.HasPrefix(args[i], "--search=") {
+			search = strings.TrimPrefix(args[i], "--search=")
+		}
+	}
+	uid := vaultCurrentUID()
+	unlock, err := withUserLock(uid)
+	if err != nil {
+		return err
+	}
+	defer unlock()
+	names, err := runList(realRunner, vaultCurrentUser(), uid, search)
+	if err != nil {
+		return err
+	}
+	for _, n := range names {
+		fmt.Println(n)
+	}
+	return nil
+}
+
+func vaultSearch(args []string) error {
+	if len(args) == 0 {
+		return fmt.Errorf("usage: homelab vault search <query>")
+	}
+	return vaultList([]string{"--search", strings.Join(args, " ")})
+}
+
+func vaultCode(args []string) error {
+	hardenProcess()
+	ensureVaultToken()
+	if len(args) == 0 {
+		return fmt.Errorf("usage: homelab vault code <name>")
+	}
+	name := args[0]
+	uid := vaultCurrentUID()
+	unlock, err := withUserLock(uid)
+	if err != nil {
+		return err
+	}
+	defer unlock()
+	user := vaultCurrentUser()
+	val, err := getValue(realRunner, user, uid, getOpts{name: name, field: "totp"})
+	if err != nil {
+		return err
+	}
+	// TOTP is the most sensitive op: log AND emit an ntfy-bound marker (spec §9a-d).
+	writeOpLog(opRecord{User: user, Verb: "code", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
+	exec.Command("logger", "-t", "homelab-vault-totp", "user="+user+" totp-fetch parent="+parentComm(os.Getppid())).Run()
+	emitSecret(val)
+	return nil
+}
+
+// statusSummary reports config/reachability without revealing secrets.
+func statusSummary(run cmdRunner, user, uid string) string {
+	if _, err := loadCreds(run, user); err != nil {
+		return "vault: not configured — run `homelab vault setup`"
+	}
+	s, err := openSession(run, user, uid)
+	if err != nil {
+		return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error()
+	}
+	// openSession already did a best-effort sync; status re-runs it explicitly so
+	// a reachability failure surfaces in this report rather than only on stderr.
+	if _, err := run("bw", bwSyncArgs(), s.env); err != nil {
+		return "vault: configured + unlocked, but sync/reachability failed: " + err.Error()
+	}
+	return "vault: configured, unlocked, reachable ✓"
+}
+
+func vaultStatus(args []string) error {
+	hardenProcess()
+	ensureVaultToken()
+	uid := vaultCurrentUID()
+	unlock, err := withUserLock(uid)
+	if err != nil {
+		return err
+	}
+	defer unlock()
+	fmt.Println(statusSummary(realRunner, vaultCurrentUser(), uid))
+	return nil
+}
+
+func vaultLock(args []string) error {
+	uid := vaultCurrentUID()
+	unlock, err := withUserLock(uid) // logout mutates bw state — serialize with get/list
+	if err != nil {
+		return err
+	}
+	defer unlock()
+	appdata := bwAppDataDir(uid)
+	_, _ = realRunner("bw", []string{"lock"}, bwBaseEnv(appdata))
+	_, logoutErr := realRunner("bw", []string{"logout"}, bwBaseEnv(appdata))
+	if logoutErr == nil {
+		fmt.Println("locked")
+	}
+	return nil // lock/logout best-effort; never error the caller
+}
+
+// kvWriteVerb selects the KV write semantics. merge=true → `kv patch -method=rw`
+// (read-modify-write: needs only read+update, NOT the `patch` capability the
+// scoped workstation-claude-<user> policy lacks, and preserves co-located keys
+// such as claude-auth-sync's claude_ai_oauth_json). merge=false → `kv put`
+// (creates the path on first use, before any sibling keys exist).
+func kvWriteVerb(merge bool) []string {
+	if merge {
+		return []string{"kv", "patch", "-method=rw"}
+	}
+	return []string{"kv", "put"}
+}
+
+// vaultWritePublicArgs writes the non-secret identifiers via argv. Neither the
+// email nor the API client_id is a usable credential on its own.
+func vaultWritePublicArgs(merge bool, user, email, clientID string) []string {
+	return append(kvWriteVerb(merge), vwCredsPath(user),
+		"vaultwarden_email="+email,
+		"vaultwarden_client_id="+clientID,
+	)
+}
+
+// vaultWriteSecretArgs writes ONE secret value via the `key=-` stdin form, so the
+// value never appears in argv (ps / /proc/<pid>/cmdline). Fed on stdin by
+// realRunnerStdin.
+func vaultWriteSecretArgs(merge bool, user, key string) []string {
+	return append(kvWriteVerb(merge), vwCredsPath(user), key+"=-")
+}
+
+// credsPathExists reports whether the user's KV path already holds data. Used to
+// pick create (`kv put`) vs merge (`kv patch -method=rw`) for the first write:
+// claude-auth-sync usually creates the path first (Claude OAuth backup), but a
+// user could run `homelab vault setup` before that ever happens.
+func credsPathExists(run cmdRunner, user string) bool {
+	_, err := run("vault", []string{"kv", "get", "-format=json", vwCredsPath(user)}, nil)
+	return err == nil
+}
+
+// cmdRunnerStdin is realRunnerStdin's shape, injected so writeCreds is testable.
+type cmdRunnerStdin func(name string, argv, envv []string, stdin string) (string, error)
+
+// writeCreds stores all four fields in the user's Vault path using only the
+// capabilities the scoped policy grants (create/read/update — NOT `patch`). The
+// first (public) write creates the path when absent; the two real secrets then
+// merge in via read-modify-write so the public keys — and any claude-auth-sync
+// keys already present — survive. Secret values travel on stdin, never argv.
+func writeCreds(run cmdRunner, runStdin cmdRunnerStdin, user string, c vwCreds) error {
+	merge := credsPathExists(run, user)
+	if _, err := run("vault", vaultWritePublicArgs(merge, user, c.Email, c.ClientID), nil); err != nil {
+		return err
+	}
+	// The path now exists regardless of the branch above → merge the secrets in.
+	if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
+		return err
+	}
+	if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
+		return err
+	}
+	return nil
+}
+
+// promptNoEcho reads one line without terminal echo (for the master password).
+func promptNoEcho(prompt string) (string, error) {
+	fmt.Fprint(os.Stderr, prompt)
+	exec.Command("stty", "-echo").Run()
+	defer func() { exec.Command("stty", "echo").Run(); fmt.Fprintln(os.Stderr) }()
+	r := bufio.NewReader(os.Stdin)
+	line, err := r.ReadString('\n')
+	// Trim only the line terminator — a master password / API secret may
+	// legitimately contain leading/trailing spaces.
+	return strings.TrimRight(line, "\r\n"), err
+}
+
+func promptLine(prompt string) (string, error) {
+	fmt.Fprint(os.Stderr, prompt)
+	line, err := bufio.NewReader(os.Stdin).ReadString('\n')
+	return strings.TrimSpace(line), err
+}
+
+func vaultSetup(args []string) error {
+	hardenProcess()
+	ensureVaultToken()
+	fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.")
+	fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.")
+	email, err := promptLine("Vaultwarden email: ")
+	if err != nil {
+		return err
+	}
+	clientID, err := promptLine("API key client_id (user.xxxx): ")
+	if err != nil {
+		return err
+	}
+	clientSecret, err := promptNoEcho("API key client_secret: ")
+	if err != nil {
+		return err
+	}
+	master, err := promptNoEcho("Master password: ")
+	if err != nil {
+		return err
+	}
+	if master == "" || clientID == "" || clientSecret == "" {
+		return fmt.Errorf("all fields are required")
+	}
+	c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret}
+	if err := writeCreds(realRunner, realRunnerStdin, vaultCurrentUser(), c); err != nil {
+		return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err)
+	}
+	fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…")
+	uid := vaultCurrentUID()
+	unlock, err := withUserLock(uid)
+	if err != nil {
+		return err
+	}
+	defer unlock()
+	if _, err := openSession(realRunner, vaultCurrentUser(), uid); err != nil {
+		return fmt.Errorf("stored, but verification failed — double-check master password / API key: %w", err)
+	}
+	fmt.Fprintln(os.Stderr, "✓ Verified. Fetches are now AFK.")
+	return nil
+}
+
+func vaultGet(args []string) error {
+	hardenProcess()
+	ensureVaultToken()
+	o, err := parseGetArgs(args)
+	if err != nil {
+		return err
+	}
+	uid := vaultCurrentUID()
+	unlock, err := withUserLock(uid)
+	if err != nil {
+		return err
+	}
+	defer unlock()
+	user := vaultCurrentUser()
+	if o.all {
+		return getAllFields(user, uid, o.name)
+	}
+	val, err := getValue(realRunner, user, uid, o)
+	if err != nil {
+		return err
+	}
+	writeOpLog(opRecord{User: user, Verb: "get", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: o.name})
+	if o.json {
+		if !jsonToStdoutOK(stdoutIsTTY()) {
+			return fmt.Errorf("refusing to print a secret as JSON to a terminal; pipe it (e.g. | cat) or drop --json")
+		}
+		fmt.Printf("{%q:%q}\n", o.field, val)
+		return nil
+	}
+	emitSecret(val)
+	return nil
+}
+
+// getAllFields prints every field of one item as normalized JSON. Like
+// `get --json`, the payload is all secret values, so it refuses a terminal
+// (pipe it). The TOTP seed is never emitted — only a presence flag — so no extra
+// TOTP audit is needed; the op-log uses a distinct verb so a bulk dump is
+// distinguishable from a single-field get (the item name is still never logged).
+func getAllFields(user, uid, name string) error {
+	if !jsonToStdoutOK(stdoutIsTTY()) {
+		return fmt.Errorf("refusing to print all fields as JSON to a terminal; pipe it (e.g. | jq)")
+	}
+	raw, err := getItem(realRunner, user, uid, name)
+	if err != nil {
+		return err
+	}
+	item, err := normalizeItem(raw)
+	if err != nil {
+		return err
+	}
+	out, err := json.Marshal(item)
+	if err != nil {
+		return err
+	}
+	writeOpLog(opRecord{User: user, Verb: "get-all", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
+	fmt.Println(string(out))
+	return nil
+}
--- a/cli/cmd_vault_kv.go
+++ b/cli/cmd_vault_kv.go
@ -0,0 +1,248 @@
+package main
+
+import (
+	"encoding/json"
+	"fmt"
+	"io"
+	"os"
+	"strings"
+)
+
+// The `vault kv` verbs talk to HashiCorp Vault / OpenBao — the homelab INFRA
+// secrets store (the `secret/…` KV-v2 mount at vault.viktorbarzin.me) — NOT
+// Vaultwarden. They are a thin, TTY-aware wrapper over the `vault` CLI that adds
+// the same conveniences as the Vaultwarden verbs: a self-defaulted VAULT_ADDR
+// (so non-login agent shells work) and clipboard/refuse-on-TTY secret handling.
+//
+// CREDENTIALS DIFFER FROM THE VAULTWARDEN VERBS. Those use the per-user *scoped*
+// token (bound only to secret/workstation/claude-users/<user>). A general kv read
+// of e.g. secret/viktor must use the caller's OWN Vault token (the OIDC
+// ~/.vault-token or an explicit $VAULT_TOKEN) — the scoped token has `deny`
+// everywhere else and would 403. So the kv handlers call ensureVaultAddr() to
+// guarantee VAULT_ADDR but deliberately do NOT call ensureVaultToken() (which
+// injects the scoped token). Access is then whatever the caller's policy grants.
+func vaultKVCommands() []Command {
+	return []Command{
+		{Path: []string{"vault", "kv", "get"}, Tier: TierRead,
+			Summary: "[hashicorp-vault] read an infra KV secret: vault kv get <path> [--field K]", Run: vaultKVGet},
+		{Path: []string{"vault", "kv", "list"}, Tier: TierRead,
+			Summary: "[hashicorp-vault] list infra KV sub-paths: vault kv list <path>", Run: vaultKVList},
+		{Path: []string{"vault", "kv", "put"}, Tier: TierWrite,
+			Summary: "[hashicorp-vault] write one KV key (value via stdin): vault kv put <path> <key>", Run: vaultKVPut},
+		{Path: []string{"vault", "kv"}, Tier: TierRead,
+			Summary: "[hashicorp-vault] infra secrets (run `homelab vault kv` for help)",
+			Run:     func([]string) error { fmt.Print(vaultKVHelp()); return nil }},
+	}
+}
+
+func vaultKVHelp() string {
+	return `homelab vault kv — HashiCorp Vault / OpenBao (homelab INFRA secrets, the secret/… KV store)
+
+  homelab vault kv get <path> [--field K]   read a secret
+                                  --field K  → one value (TTY → clipboard; piped → stdout)
+                                  no --field → all fields as JSON (piped only)
+  homelab vault kv list <path>    list sub-paths under <path> (no values)
+  homelab vault kv put <path> <key>   write one key; value read from stdin
+                                  (piped, or no-echo prompt); merges — never clobbers siblings
+
+Uses YOUR Vault token (vault login -method=oidc → ~/.vault-token); access is
+whatever your policy grants. This is NOT Vaultwarden — for your personal logins
+use 'homelab vault get' (see 'homelab vault').
+`
+}
+
+// --- arg builders (pure; values never travel via argv) --------------------
+
+func vaultKVGetFieldArgs(path, field string) []string {
+	return []string{"kv", "get", "-field=" + field, path}
+}
+func vaultKVGetJSONArgs(path string) []string { return []string{"kv", "get", "-format=json", path} }
+func vaultKVListArgs(path string) []string    { return []string{"kv", "list", "-format=json", path} }
+
+// vaultKVPutArgs builds the write argv. merge=true → `kv patch -method=rw`
+// (read-modify-write: merges, needs only read+update — not the `patch` capability
+// — and preserves sibling keys); merge=false → `kv put` (creates the path on
+// first write). The value is ALWAYS read from stdin via the `<key>=-` form, so it
+// never appears in argv (visible via ps / /proc/<pid>/cmdline to same-UID procs).
+func vaultKVPutArgs(merge bool, path, key string) []string {
+	return append(kvWriteVerb(merge), path, key+"=-")
+}
+
+// --- pure parsers ----------------------------------------------------------
+
+// extractKVData returns the inner secret object from a `vault kv get -format=json`
+// envelope (`{"data":{"data":{…},"metadata":{…}}}`), dropping the metadata/request
+// wrapper so only the secret's own key→value data is emitted.
+func extractKVData(jsonOut string) (string, error) {
+	var env struct {
+		Data struct {
+			Data json.RawMessage `json:"data"`
+		} `json:"data"`
+	}
+	if err := json.Unmarshal([]byte(jsonOut), &env); err != nil {
+		return "", fmt.Errorf("parse vault kv json: %w", err)
+	}
+	if len(env.Data.Data) == 0 {
+		return "", fmt.Errorf("no secret data at that path")
+	}
+	return string(env.Data.Data), nil
+}
+
+// parseKVList parses the JSON array `vault kv list -format=json` prints.
+func parseKVList(jsonOut string) ([]string, error) {
+	var keys []string
+	if err := json.Unmarshal([]byte(jsonOut), &keys); err != nil {
+		return nil, fmt.Errorf("parse vault kv list json: %w", err)
+	}
+	return keys, nil
+}
+
+// --- testable cores (injected cmdRunner) -----------------------------------
+
+func kvGetField(run cmdRunner, path, field string) (string, error) {
+	return run("vault", vaultKVGetFieldArgs(path, field), nil)
+}
+
+func kvGetJSON(run cmdRunner, path string) (string, error) {
+	out, err := run("vault", vaultKVGetJSONArgs(path), nil)
+	if err != nil {
+		return "", err
+	}
+	return extractKVData(out)
+}
+
+func kvList(run cmdRunner, path string) ([]string, error) {
+	out, err := run("vault", vaultKVListArgs(path), nil)
+	if err != nil {
+		return nil, err
+	}
+	return parseKVList(out)
+}
+
+// kvPathExists reports whether the KV path already holds data, to pick create
+// (`kv put`) vs merge (`kv patch -method=rw`) — so a write never clobbers
+// sibling keys on an existing path.
+func kvPathExists(run cmdRunner, path string) bool {
+	_, err := run("vault", vaultKVGetJSONArgs(path), nil)
+	return err == nil
+}
+
+// kvPut writes one key, creating the path when absent and merging when present.
+// The value travels on stdin only (never argv).
+func kvPut(run cmdRunner, runStdin cmdRunnerStdin, path, key, value string) error {
+	merge := kvPathExists(run, path)
+	_, err := runStdin("vault", vaultKVPutArgs(merge, path, key), nil, value)
+	return err
+}
+
+// --- handlers --------------------------------------------------------------
+
+func vaultKVGet(args []string) error {
+	hardenProcess()
+	ensureVaultAddr() // own token, NOT the scoped one (see file header)
+	var path, field string
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--field" && i+1 < len(args):
+			field = args[i+1]
+			i++
+		case strings.HasPrefix(a, "--field="):
+			field = strings.TrimPrefix(a, "--field=")
+		case !strings.HasPrefix(a, "-") && path == "":
+			path = a
+		}
+	}
+	if path == "" {
+		return fmt.Errorf("usage: homelab vault kv get <path> [--field <key>]")
+	}
+	if field != "" {
+		val, err := kvGetField(realRunner, path, field)
+		if err != nil {
+			return err
+		}
+		emitSecret(val) // TTY-aware: clipboard on a terminal, stdout when piped
+		return nil
+	}
+	// No --field → the whole secret. All values, so refuse a bare TTY (like
+	// `vault get --json`): pick a --field for the clipboard path, or pipe it.
+	if !jsonToStdoutOK(stdoutIsTTY()) {
+		return fmt.Errorf("refusing to print all KV fields as JSON to a terminal; use --field <key>, or pipe it (e.g. | jq)")
+	}
+	out, err := kvGetJSON(realRunner, path)
+	if err != nil {
+		return err
+	}
+	fmt.Println(out)
+	return nil
+}
+
+func vaultKVList(args []string) error {
+	ensureVaultAddr()
+	var path string
+	for _, a := range args {
+		if !strings.HasPrefix(a, "-") {
+			path = a
+			break
+		}
+	}
+	if path == "" {
+		return fmt.Errorf("usage: homelab vault kv list <path>")
+	}
+	keys, err := kvList(realRunner, path)
+	if err != nil {
+		return err
+	}
+	for _, k := range keys {
+		fmt.Println(k)
+	}
+	return nil
+}
+
+func vaultKVPut(args []string) error {
+	hardenProcess()
+	ensureVaultAddr()
+	var path, key string
+	for _, a := range args {
+		if strings.HasPrefix(a, "-") {
+			continue
+		}
+		switch {
+		case path == "":
+			path = a
+		case key == "":
+			key = a
+		}
+	}
+	if path == "" || key == "" {
+		return fmt.Errorf("usage: homelab vault kv put <path> <key>   (value read from stdin)")
+	}
+	value, err := readSecretValue("Value for " + key + ": ")
+	if err != nil {
+		return err
+	}
+	if value == "" {
+		return fmt.Errorf("empty value; aborting (nothing written)")
+	}
+	if err := kvPut(realRunner, realRunnerStdin, path, key, value); err != nil {
+		return fmt.Errorf("writing %q to %s failed (does your token have write access? path correct?): %w", key, path, err)
+	}
+	fmt.Fprintln(os.Stderr, "wrote "+key+" to "+path)
+	return nil
+}
+
+// readSecretValue obtains a secret value WITHOUT putting it in argv: piped stdin
+// is read verbatim (trailing newline trimmed, internal newlines preserved so
+// multi-line values like PEM keys survive); an interactive TTY is prompted
+// without echo.
+func readSecretValue(prompt string) (string, error) {
+	fi, err := os.Stdin.Stat()
+	if err == nil && fi.Mode()&os.ModeCharDevice == 0 {
+		b, rerr := io.ReadAll(os.Stdin)
+		if rerr != nil {
+			return "", rerr
+		}
+		return strings.TrimRight(string(b), "\r\n"), nil
+	}
+	return promptNoEcho(prompt)
+}
--- a/cli/cmd_vault_test.go
+++ b/cli/cmd_vault_test.go
--- a/cli/cmd_work.go
+++ b/cli/cmd_work.go
@ -0,0 +1,212 @@
+package main
+
+import (
+	"fmt"
+	"os"
+	"path/filepath"
+	"strings"
+)
+
+func workCommands() []Command {
+	return []Command{
+		{Path: []string{"work", "start"}, Tier: TierWrite,
+			Summary: "create a worktree + branch for a task (enter it with EnterWorktree)", Run: workStart},
+		{Path: []string{"work", "land"}, Tier: TierWrite,
+			Summary: "merge master in, verify, push HEAD:master (run from the worktree)", Run: workLand},
+		{Path: []string{"work", "clean"}, Tier: TierWrite,
+			Summary: "remove a task's worktree + branch (run from the main checkout)", Run: workClean},
+	}
+}
+
+// flagValue extracts `--name value` or `--name=value` from args.
+func flagValue(args []string, name string) string {
+	for i, a := range args {
+		if a == name && i+1 < len(args) {
+			return args[i+1]
+		}
+		if strings.HasPrefix(a, name+"=") {
+			return strings.TrimPrefix(a, name+"=")
+		}
+	}
+	return ""
+}
+
+func remotesOrEmpty(repoRoot string) []string {
+	r, _ := gitRemotes(repoRoot)
+	return r
+}
+
+// workStart creates .worktrees/<topic> on branch <user>/<topic> off <remote>/master.
+func workStart(args []string) error {
+	topic, _ := firstPositional(args)
+	if topic == "" {
+		return fmt.Errorf("usage: homelab work start <topic>")
+	}
+	cwd, _ := os.Getwd()
+	repoRoot, err := gitRepoRoot(cwd)
+	if err != nil {
+		return fmt.Errorf("not in a git repository: %w", err)
+	}
+	remote := preferRemote(remotesOrEmpty(repoRoot))
+	if remote == "" {
+		return fmt.Errorf("no git remote configured in %s", repoRoot)
+	}
+	flags := cryptFlagsFor(repoRoot)
+	branch := currentUser() + "/" + topic
+	wtRel := filepath.Join(".worktrees", topic)
+
+	ensureWorktreesIgnored(repoRoot)
+	if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
+		return fmt.Errorf("fetch %s failed: %w", remote, err)
+	}
+	if err := gitStream(repoRoot, flags, "worktree", "add", wtRel, "-b", branch, remote+"/master"); err != nil {
+		return fmt.Errorf("worktree add failed: %w", err)
+	}
+	wtPath := filepath.Join(repoRoot, wtRel)
+	fmt.Printf("homelab: created worktree %s (branch %s off %s/master)\n", wtPath, branch, remote)
+	fmt.Printf("homelab: enter it with the native tool: EnterWorktree(path=%q)\n", wtPath)
+	return nil
+}
+
+// workLand integrates the current branch into master: fetch, merge master in,
+// verify, push HEAD:master (retrying on non-fast-forward), with a feature-branch
+// fallback when the direct push is rejected (e.g. branch protection).
+func workLand(args []string) error {
+	verifyCmd := flagValue(args, "--verify-cmd")
+	cwd, _ := os.Getwd()
+	repoRoot, err := gitRepoRoot(cwd)
+	if err != nil {
+		return fmt.Errorf("not in a git repository: %w", err)
+	}
+	branch, err := gitOutput(repoRoot, "rev-parse", "--abbrev-ref", "HEAD")
+	if err != nil {
+		return err
+	}
+	if branch == "master" || branch == "main" {
+		return fmt.Errorf("refusing to land: already on %s", branch)
+	}
+	remote := preferRemote(remotesOrEmpty(repoRoot))
+	if remote == "" {
+		return fmt.Errorf("no git remote configured in %s", repoRoot)
+	}
+	flags := cryptFlagsFor(repoRoot)
+
+	if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
+		return fmt.Errorf("fetch failed: %w", err)
+	}
+	if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil {
+		return fmt.Errorf("merging %s/master failed — resolve conflicts then re-run `homelab work land`: %w", remote, err)
+	}
+	if err := runVerify(repoRoot, verifyCmd, containsArg(args, "--no-verify")); err != nil {
+		return fmt.Errorf("not landing: %w", err)
+	}
+	if err := pushWithRetry(repoRoot, flags, remote, 3); err != nil {
+		return landFallback(repoRoot, flags, remote, branch, err)
+	}
+	fmt.Printf("homelab: landed %s -> %s/master.\n", branch, remote)
+	if containsArg(args, "--no-ci-watch") {
+		fmt.Println("homelab: --no-ci-watch set; not waiting for CI.")
+		return nil
+	}
+	landed, _ := gitOutput(repoRoot, "rev-parse", "HEAD")
+	fmt.Fprintln(os.Stderr, "homelab: watching CI for the landed commit...")
+	if err := ciWatch([]string{landed}); err != nil {
+		return fmt.Errorf("landed, but CI did not go green: %w", err)
+	}
+	return nil
+}
+
+// runVerify runs the explicit --verify-cmd, else auto-detects (go test). If
+// neither is available it REFUSES (returns an error) unless allowSkip is set —
+// landing to master unverified must be a deliberate choice (--no-verify).
+func runVerify(repoRoot, verifyCmd string, allowSkip bool) error {
+	if verifyCmd != "" {
+		fmt.Fprintf(os.Stderr, "homelab: verify: %s\n", verifyCmd)
+		return runStreamingIn(repoRoot, "sh", "-c", verifyCmd)
+	}
+	if isFile(filepath.Join(repoRoot, "go.mod")) {
+		fmt.Fprintln(os.Stderr, "homelab: verify: go test ./...")
+		return runStreamingIn(repoRoot, "go", "test", "./...")
+	}
+	if allowSkip {
+		fmt.Fprintln(os.Stderr, "homelab: WARNING: --no-verify set — landing without verification")
+		return nil
+	}
+	return fmt.Errorf("no verification configured for this repo — pass --verify-cmd \"...\" or --no-verify to land without verifying")
+}
+
+// pushWithRetry pushes HEAD:master, recovering from non-fast-forward rejections
+// by fetching + merging master and retrying.
+func pushWithRetry(repoRoot string, flags []string, remote string, attempts int) error {
+	var lastErr error
+	for i := 0; i < attempts; i++ {
+		if err := gitStream(repoRoot, flags, "push", remote, "HEAD:master"); err == nil {
+			return nil
+		} else {
+			lastErr = err
+		}
+		if i < attempts-1 {
+			fmt.Fprintln(os.Stderr, "homelab: push rejected — fetching + merging master, then retrying")
+			if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
+				return err
+			}
+			if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil {
+				return err
+			}
+		}
+	}
+	return fmt.Errorf("push to %s/master failed after %d attempts: %w", remote, attempts, lastErr)
+}
+
+// landFallback pushes the feature branch when the direct master push is rejected
+// (e.g. branch protection), so the work isn't lost and a PR can be opened.
+func landFallback(repoRoot string, flags []string, remote, branch string, pushErr error) error {
+	fmt.Fprintf(os.Stderr, "homelab: direct push to master failed (%v)\n", pushErr)
+	fmt.Fprintf(os.Stderr, "homelab: falling back to pushing the feature branch %q for a PR\n", branch)
+	if err := gitStream(repoRoot, flags, "push", "-u", remote, branch); err != nil {
+		return fmt.Errorf("fallback branch push also failed: %w", err)
+	}
+	fmt.Printf("homelab: pushed %s to %s. Open a PR to land it (branch protection blocked the direct push).\n", branch, remote)
+	return nil
+}
+
+// workClean removes a task's worktree and branch. Run from the main checkout.
+func workClean(args []string) error {
+	topic, _ := firstPositional(args)
+	if topic == "" {
+		return fmt.Errorf("usage: homelab work clean <topic>  (run from the main checkout)")
+	}
+	cwd, _ := os.Getwd()
+	repoRoot, err := gitRepoRoot(cwd)
+	if err != nil {
+		return fmt.Errorf("not in a git repository: %w", err)
+	}
+	flags := cryptFlagsFor(repoRoot)
+	wtRel := filepath.Join(".worktrees", topic)
+	branch := currentUser() + "/" + topic
+
+	if err := gitStream(repoRoot, flags, "worktree", "remove", wtRel); err != nil {
+		return fmt.Errorf("worktree remove failed (uncommitted changes? run from the main checkout, not the worktree): %w", err)
+	}
+	if err := gitStream(repoRoot, flags, "branch", "-d", branch); err != nil {
+		fmt.Fprintf(os.Stderr, "homelab: note: could not delete branch %s (unmerged — use `git branch -D` if intended): %v\n", branch, err)
+	}
+	fmt.Printf("homelab: removed worktree %s and branch %s\n", wtRel, branch)
+	return nil
+}
+
+// ensureWorktreesIgnored appends .worktrees/ to .gitignore if not already ignored.
+func ensureWorktreesIgnored(repoRoot string) {
+	if _, err := gitOutput(repoRoot, "check-ignore", ".worktrees"); err == nil {
+		return
+	}
+	gi := filepath.Join(repoRoot, ".gitignore")
+	f, err := os.OpenFile(gi, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0o644)
+	if err != nil {
+		return
+	}
+	defer f.Close()
+	if _, err := f.WriteString("\n.worktrees/\n"); err == nil {
+		fmt.Fprintln(os.Stderr, "homelab: added .worktrees/ to .gitignore")
+	}
+}
--- a/cli/cmd_work_test.go
+++ b/cli/cmd_work_test.go
@ -0,0 +1,32 @@
+package main
+
+import "testing"
+
+func TestRunVerifyRefusesWhenNothingToVerify(t *testing.T) {
+	dir := t.TempDir() // no go.mod, no verify cmd
+	if err := runVerify(dir, "", false); err == nil {
+		t.Fatal("runVerify must refuse (error) when nothing to verify and --no-verify absent")
+	}
+	if err := runVerify(dir, "", true); err != nil {
+		t.Fatalf("runVerify must skip when --no-verify set, got: %v", err)
+	}
+}
+
+func TestFlagValue(t *testing.T) {
+	cases := []struct {
+		args []string
+		name string
+		want string
+	}{
+		{[]string{"--verify-cmd", "go test ./..."}, "--verify-cmd", "go test ./..."},
+		{[]string{"--verify-cmd=make test"}, "--verify-cmd", "make test"},
+		{[]string{"topic", "--verify-cmd", "x"}, "--verify-cmd", "x"},
+		{[]string{"topic"}, "--verify-cmd", ""},
+		{[]string{"--verify-cmd"}, "--verify-cmd", ""}, // no value
+	}
+	for _, c := range cases {
+		if got := flagValue(c.args, c.name); got != c.want {
+			t.Errorf("flagValue(%v, %q) = %q, want %q", c.args, c.name, got, c.want)
+		}
+	}
+}
--- a/cli/command.go
+++ b/cli/command.go
@ -0,0 +1,104 @@
+package main
+
+import (
+	"encoding/json"
+	"fmt"
+	"sort"
+	"strings"
+)
+
+// Tier classifies whether a command observes (read) or mutates (write) state.
+// v0.1 allows everything; the tier is recorded so a classifier hook can gate
+// writes later without restructuring (see docs/adr/0005).
+type Tier string
+
+const (
+	TierRead  Tier = "read"
+	TierWrite Tier = "write"
+)
+
+// Command is one homelab verb. Path is the token sequence that selects it,
+// e.g. ["claim"] or ["tf", "plan"]. Run receives the args after the path.
+type Command struct {
+	Path    []string
+	Tier    Tier
+	Summary string
+	Run     func(args []string) error
+}
+
+// dispatch routes args to the command whose Path is the longest matching prefix
+// of args, passing the remaining args to its Run.
+func dispatch(reg []Command, args []string) error {
+	best := -1
+	bestLen := 0
+	for i, c := range reg {
+		if len(c.Path) > len(args) {
+			continue
+		}
+		match := true
+		for j, p := range c.Path {
+			if args[j] != p {
+				match = false
+				break
+			}
+		}
+		if match && len(c.Path) >= bestLen {
+			best = i
+			bestLen = len(c.Path)
+		}
+	}
+	if best < 0 {
+		return fmt.Errorf("unknown command: %q", strings.Join(args, " "))
+	}
+	matched := reg[best]
+	runErr := matched.Run(args[bestLen:])
+	emitUsage(matched.name(), runErr) // best-effort usage telemetry; never affects the command
+	return runErr
+}
+
+// name is the space-joined verb path, e.g. "tf plan".
+func (c Command) name() string { return strings.Join(c.Path, " ") }
+
+// sortedByName returns a copy of reg ordered by verb path for stable output.
+func sortedByName(reg []Command) []Command {
+	out := make([]Command, len(reg))
+	copy(out, reg)
+	sort.Slice(out, func(i, j int) bool { return out[i].name() < out[j].name() })
+	return out
+}
+
+// manifestText renders one aligned line per command: "<path>  <tier>  <summary>".
+// This is the cheap progressive-discovery entrypoint (see docs/adr/0004).
+func manifestText(reg []Command) string {
+	cmds := sortedByName(reg)
+	width := 0
+	for _, c := range cmds {
+		if n := len(c.name()); n > width {
+			width = n
+		}
+	}
+	var b strings.Builder
+	for _, c := range cmds {
+		fmt.Fprintf(&b, "%-*s  %-5s  %s\n", width, c.name(), c.Tier, c.Summary)
+	}
+	return b.String()
+}
+
+// manifestJSON renders the registry as a JSON array of {command, tier, summary}
+// so agents can parse the full surface in one call.
+func manifestJSON(reg []Command) (string, error) {
+	type entry struct {
+		Command string `json:"command"`
+		Tier    string `json:"tier"`
+		Summary string `json:"summary"`
+	}
+	entries := make([]entry, 0, len(reg))
+	for _, c := range sortedByName(reg) {
+		entries = append(entries, entry{Command: c.name(), Tier: string(c.Tier), Summary: c.Summary})
+	}
+	b, err := json.MarshalIndent(entries, "", "  ")
+	if err != nil {
+		return "", err
+	}
+	return string(b), nil
+}
--- a/cli/command_test.go
+++ b/cli/command_test.go
@ -0,0 +1,73 @@
+package main
+
+import (
+	"encoding/json"
+	"reflect"
+	"strings"
+	"testing"
+)
+
+// Tracer bullet: the dispatcher must route `homelab <path...> <args...>` to the
+// command whose Path is the longest matching prefix of the input tokens, and
+// hand the command the remaining args.
+func TestDispatchRoutesToLongestPrefixMatch(t *testing.T) {
+	var gotArgs []string
+	ran := ""
+	reg := []Command{
+		{Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource",
+			Run: func(a []string) error { ran = "claim"; gotArgs = a; return nil }},
+		{Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack",
+			Run: func(a []string) error { ran = "tf plan"; gotArgs = a; return nil }},
+	}
+
+	if err := dispatch(reg, []string{"tf", "plan", "vault", "--json"}); err != nil {
+		t.Fatalf("dispatch returned error: %v", err)
+	}
+	if ran != "tf plan" {
+		t.Fatalf("routed to %q, want %q", ran, "tf plan")
+	}
+	if want := []string{"vault", "--json"}; !reflect.DeepEqual(gotArgs, want) {
+		t.Fatalf("command got args %v, want %v", gotArgs, want)
+	}
+}
+
+func TestDispatchUnknownCommandErrors(t *testing.T) {
+	reg := []Command{{Path: []string{"claim"}, Run: func(a []string) error { return nil }}}
+	if err := dispatch(reg, []string{"bogus"}); err == nil {
+		t.Fatal("expected error for unknown command, got nil")
+	}
+}
+
+// The manifest is the progressive-discovery entrypoint: one line per command
+// showing the full verb path, its tier, and summary, sorted for stable output.
+func TestManifestTextListsEveryCommandWithTier(t *testing.T) {
+	reg := []Command{
+		{Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack"},
+		{Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource"},
+	}
+	out := manifestText(reg)
+	for _, want := range []string{"claim", "tf plan", "read", "write", "plan a stack", "claim a resource"} {
+		if !strings.Contains(out, want) {
+			t.Errorf("manifest text missing %q\n---\n%s", want, out)
+		}
+	}
+	// sorted: claim (c) must appear before tf plan (t)
+	if strings.Index(out, "claim") > strings.Index(out, "tf plan") {
+		t.Errorf("manifest not sorted by path:\n%s", out)
+	}
+}
+
+func TestManifestJSONIsParsableAndTagged(t *testing.T) {
+	reg := []Command{{Path: []string{"tf", "apply"}, Tier: TierWrite, Summary: "apply a stack"}}
+	out, err := manifestJSON(reg)
+	if err != nil {
+		t.Fatalf("manifestJSON error: %v", err)
+	}
+	var got []map[string]string
+	if err := json.Unmarshal([]byte(out), &got); err != nil {
+		t.Fatalf("manifest JSON not parsable: %v\n%s", err, out)
+	}
+	if len(got) != 1 || got[0]["command"] != "tf apply" || got[0]["tier"] != "write" {
+		t.Fatalf("unexpected manifest JSON: %v", got)
+	}
+}
--- a/cli/edges.go
+++ b/cli/edges.go
@ -0,0 +1,164 @@
+package main
+
+import (
+	"fmt"
+	"regexp"
+	"strconv"
+	"strings"
+)
+
+// edgesOpts is the parsed filter set for `homelab edges` (the who-talks-to-whom
+// investigation helper over the goldmane_edges trail; see ADR-0014).
+type edgesOpts struct {
+	ns       string // edges touching this namespace (either direction)
+	src      string // edges where src_ns = this
+	dst      string // edges where dst_ns = this
+	peersOf  string // distinct peers of this namespace (both directions)
+	newSince string // first_seen >= duration (24h/7d/30m) or date (YYYY-MM-DD)
+	denied   bool   // action = 'deny' only
+	asJSON   bool   // wrap result as a JSON array
+	limit    int    // row cap (default 200)
+}
+
+// parseEdgesArgs parses the edges flag surface. Unknown flags error out so a
+// typo surfaces instead of silently dumping the whole table.
+func parseEdgesArgs(args []string) (edgesOpts, error) {
+	o := edgesOpts{limit: 200}
+	i := 0
+	for i < len(args) {
+		a := args[i]
+		key, inline, hasInline := a, "", false
+		if eq := strings.IndexByte(a, '='); eq >= 0 {
+			key, inline, hasInline = a[:eq], a[eq+1:], true
+		}
+		needVal := func() (string, error) {
+			if hasInline {
+				return inline, nil
+			}
+			if i+1 < len(args) {
+				i++
+				return args[i], nil
+			}
+			return "", fmt.Errorf("flag %s needs a value", key)
+		}
+		var err error
+		switch key {
+		case "--ns":
+			o.ns, err = needVal()
+		case "--src":
+			o.src, err = needVal()
+		case "--dst":
+			o.dst, err = needVal()
+		case "--peers-of":
+			o.peersOf, err = needVal()
+		case "--new-since":
+			o.newSince, err = needVal()
+		case "--denied":
+			o.denied = true
+		case "--json":
+			o.asJSON = true
+		case "--limit":
+			var v string
+			if v, err = needVal(); err == nil {
+				if o.limit, err = strconv.Atoi(v); err != nil {
+					err = fmt.Errorf("--limit must be an integer: %q", v)
+				}
+			}
+		default:
+			return o, fmt.Errorf("unknown flag: %s", a)
+		}
+		if err != nil {
+			return o, err
+		}
+		i++
+	}
+	return o, nil
+}
+
+// nsRE is the safe namespace-token charset (k8s names + "Global"). Used as the
+// injection guard — anything else is rejected rather than quoted-and-hoped.
+var nsRE = regexp.MustCompile(`^[A-Za-z0-9][A-Za-z0-9_.-]*$`)
+
+func validateNS(s string) error {
+	if s == "" || len(s) > 63 || !nsRE.MatchString(s) {
+		return fmt.Errorf("invalid namespace name: %q", s)
+	}
+	return nil
+}
+
+// sqlStr renders a SQL string literal (belt-and-suspenders on top of validateNS).
+func sqlStr(s string) string { return "'" + strings.ReplaceAll(s, "'", "''") + "'" }
+
+var (
+	durRE  = regexp.MustCompile(`^(\d+)([smhd])$`)
+	dateRE = regexp.MustCompile(`^\d{4}-\d{2}-\d{2}([ T]\d{2}:\d{2}(:\d{2})?)?$`)
+)
+
+// newSinceCond turns a duration (24h/7d/30m/90s) or a date (YYYY-MM-DD[ HH:MM])
+// into a first_seen predicate.
+func newSinceCond(v string) (string, error) {
+	if m := durRE.FindStringSubmatch(v); m != nil {
+		unit := map[string]string{"s": "seconds", "m": "minutes", "h": "hours", "d": "days"}[m[2]]
+		return fmt.Sprintf("first_seen >= now() - interval '%s %s'", m[1], unit), nil
+	}
+	if dateRE.MatchString(v) {
+		return "first_seen >= " + sqlStr(v), nil
+	}
+	return "", fmt.Errorf("--new-since must be a duration (e.g. 24h, 7d, 30m) or a date (YYYY-MM-DD): %q", v)
+}
+
+// buildEdgesQuery renders the SQL for the given filters against the `edge` table.
+func buildEdgesQuery(o edgesOpts) (string, error) {
+	limit := o.limit
+	if limit <= 0 {
+		limit = 200
+	}
+
+	// peers-of is a distinct-peer summary, a different shape from the row list.
+	if o.peersOf != "" {
+		if err := validateNS(o.peersOf); err != nil {
+			return "", err
+		}
+		p := sqlStr(o.peersOf)
+		return fmt.Sprintf("SELECT DISTINCT peer, action FROM ("+
+			"SELECT dst_ns AS peer, action FROM edge WHERE src_ns = %s "+
+			"UNION SELECT src_ns AS peer, action FROM edge WHERE dst_ns = %s"+
+			") t ORDER BY peer LIMIT %d", p, p, limit), nil
+	}
+
+	var conds []string
+	for _, f := range []struct{ val, tmpl string }{
+		{o.ns, "(src_ns = %[1]s OR dst_ns = %[1]s)"},
+		{o.src, "src_ns = %s"},
+		{o.dst, "dst_ns = %s"},
+	} {
+		if f.val == "" {
+			continue
+		}
+		if err := validateNS(f.val); err != nil {
+			return "", err
+		}
+		conds = append(conds, fmt.Sprintf(f.tmpl, sqlStr(f.val)))
+	}
+	if o.denied {
+		conds = append(conds, "action = 'deny'")
+	}
+	if o.newSince != "" {
+		c, err := newSinceCond(o.newSince)
+		if err != nil {
+			return "", err
+		}
+		conds = append(conds, c)
+	}
+
+	q := "SELECT src_ns, dst_ns, action, flow_count, first_seen, last_seen FROM edge"
+	if len(conds) > 0 {
+		q += " WHERE " + strings.Join(conds, " AND ")
+	}
+	q += fmt.Sprintf(" ORDER BY first_seen DESC LIMIT %d", limit)
+
+	if o.asJSON {
+		q = "SELECT coalesce(json_agg(row_to_json(t)), '[]') FROM (" + q + ") t"
+	}
+	return q, nil
+}
--- a/cli/edges_test.go
+++ b/cli/edges_test.go
@ -0,0 +1,163 @@
+package main
+
+import (
+	"strings"
+	"testing"
+)
+
+func TestParseEdgesArgs(t *testing.T) {
+	cases := []struct {
+		name string
+		args []string
+		want edgesOpts
+	}{
+		{"defaults", nil, edgesOpts{limit: 200}},
+		{"ns", []string{"--ns", "immich"}, edgesOpts{ns: "immich", limit: 200}},
+		{"ns equals", []string{"--ns=immich"}, edgesOpts{ns: "immich", limit: 200}},
+		{"src dst", []string{"--src", "a", "--dst", "b"}, edgesOpts{src: "a", dst: "b", limit: 200}},
+		{"peers-of", []string{"--peers-of", "authentik"}, edgesOpts{peersOf: "authentik", limit: 200}},
+		{"denied json", []string{"--denied", "--json"}, edgesOpts{denied: true, asJSON: true, limit: 200}},
+		{"new-since", []string{"--new-since", "24h"}, edgesOpts{newSince: "24h", limit: 200}},
+		{"limit", []string{"--limit", "50"}, edgesOpts{limit: 50}},
+	}
+	for _, c := range cases {
+		t.Run(c.name, func(t *testing.T) {
+			got, err := parseEdgesArgs(c.args)
+			if err != nil {
+				t.Fatalf("parseEdgesArgs(%v) error: %v", c.args, err)
+			}
+			if got != c.want {
+				t.Fatalf("parseEdgesArgs(%v) = %+v, want %+v", c.args, got, c.want)
+			}
+		})
+	}
+}
+
+func TestParseEdgesArgsErrors(t *testing.T) {
+	for _, args := range [][]string{
+		{"--limit", "abc"},
+		{"--bogus"},
+	} {
+		if _, err := parseEdgesArgs(args); err == nil {
+			t.Errorf("parseEdgesArgs(%v) expected error, got nil", args)
+		}
+	}
+}
+
+func TestBuildEdgesQueryDefaults(t *testing.T) {
+	q, err := buildEdgesQuery(edgesOpts{limit: 200})
+	if err != nil {
+		t.Fatal(err)
+	}
+	for _, want := range []string{"FROM edge", "ORDER BY first_seen DESC", "LIMIT 200"} {
+		if !strings.Contains(q, want) {
+			t.Errorf("query %q missing %q", q, want)
+		}
+	}
+	if strings.Contains(q, "WHERE") {
+		t.Errorf("no-filter query should have no WHERE: %q", q)
+	}
+}
+
+func TestBuildEdgesQueryFilters(t *testing.T) {
+	cases := []struct {
+		name string
+		o    edgesOpts
+		want string
+	}{
+		{"ns both directions", edgesOpts{ns: "immich", limit: 10}, "(src_ns = 'immich' OR dst_ns = 'immich')"},
+		{"src only", edgesOpts{src: "authentik", limit: 10}, "src_ns = 'authentik'"},
+		{"dst only", edgesOpts{dst: "dbaas", limit: 10}, "dst_ns = 'dbaas'"},
+		{"denied", edgesOpts{denied: true, limit: 10}, "action = 'deny'"},
+	}
+	for _, c := range cases {
+		t.Run(c.name, func(t *testing.T) {
+			q, err := buildEdgesQuery(c.o)
+			if err != nil {
+				t.Fatal(err)
+			}
+			if !strings.Contains(q, "WHERE") || !strings.Contains(q, c.want) {
+				t.Errorf("query %q missing WHERE/%q", q, c.want)
+			}
+		})
+	}
+}
+
+func TestBuildEdgesQueryCombinedFiltersAnded(t *testing.T) {
+	q, err := buildEdgesQuery(edgesOpts{src: "a", denied: true, limit: 5})
+	if err != nil {
+		t.Fatal(err)
+	}
+	if !strings.Contains(q, "src_ns = 'a' AND action = 'deny'") {
+		t.Errorf("combined filters not AND'd: %q", q)
+	}
+}
+
+func TestBuildEdgesQueryPeersOf(t *testing.T) {
+	q, err := buildEdgesQuery(edgesOpts{peersOf: "authentik", limit: 100})
+	if err != nil {
+		t.Fatal(err)
+	}
+	for _, want := range []string{"DISTINCT", "src_ns = 'authentik'", "dst_ns = 'authentik'", "UNION"} {
+		if !strings.Contains(q, want) {
+			t.Errorf("peers-of query %q missing %q", q, want)
+		}
+	}
+}
+
+func TestBuildEdgesQueryJSON(t *testing.T) {
+	q, err := buildEdgesQuery(edgesOpts{asJSON: true, limit: 200})
+	if err != nil {
+		t.Fatal(err)
+	}
+	if !strings.Contains(q, "json_agg") || !strings.Contains(q, "row_to_json") {
+		t.Errorf("json query missing json_agg wrapper: %q", q)
+	}
+}
+
+func TestBuildEdgesQueryRejectsInjection(t *testing.T) {
+	for _, bad := range []string{"a'; DROP TABLE edge;--", "a b", "a;b", "a\"b"} {
+		if _, err := buildEdgesQuery(edgesOpts{ns: bad, limit: 10}); err == nil {
+			t.Errorf("buildEdgesQuery(ns=%q) expected validation error, got nil", bad)
+		}
+	}
+}
+
+func TestNewSinceCond(t *testing.T) {
+	cases := []struct {
+		in   string
+		want string
+	}{
+		{"24h", "first_seen >= now() - interval '24 hours'"},
+		{"7d", "first_seen >= now() - interval '7 days'"},
+		{"30m", "first_seen >= now() - interval '30 minutes'"},
+		{"2026-06-28", "first_seen >= '2026-06-28'"},
+	}
+	for _, c := range cases {
+		got, err := newSinceCond(c.in)
+		if err != nil {
+			t.Fatalf("newSinceCond(%q) error: %v", c.in, err)
+		}
+		if got != c.want {
+			t.Errorf("newSinceCond(%q) = %q, want %q", c.in, got, c.want)
+		}
+	}
+	for _, bad := range []string{"yesterday", "1y", "'; DROP", ""} {
+		if _, err := newSinceCond(bad); err == nil {
+			t.Errorf("newSinceCond(%q) expected error, got nil", bad)
+		}
+	}
+}
+
+func TestValidateNS(t *testing.T) {
+	for _, ok := range []string{"immich", "calico-system", "kube-system", "Global", "pg-cluster-rw"} {
+		if err := validateNS(ok); err != nil {
+			t.Errorf("validateNS(%q) unexpected error: %v", ok, err)
+		}
+	}
+	for _, bad := range []string{"", "a b", "a'b", "a;b", "../x", "a$b"} {
+		if err := validateNS(bad); err == nil {
+			t.Errorf("validateNS(%q) expected error, got nil", bad)
+		}
+	}
+}
--- a/cli/homelab.go
+++ b/cli/homelab.go
@ -0,0 +1,99 @@
+package main
+
+import (
+	"fmt"
+	"strings"
+)
+
+// version is stamped at build time via -ldflags "-X main.version=vX.Y.Z".
+var version = "dev"
+
+// buildRegistry returns every homelab verb. New verb-groups append here.
+func buildRegistry() []Command {
+	var reg []Command
+	reg = append(reg, claimCommands()...)
+	reg = append(reg, tfCommands()...)
+	reg = append(reg, workCommands()...)
+	reg = append(reg, k8sCommands()...)
+	reg = append(reg, memoryCommands()...)
+	reg = append(reg, ciCommands()...)
+	reg = append(reg, deployCommands()...)
+	reg = append(reg, netCommands()...)
+	reg = append(reg, obsCommands()...)
+	reg = append(reg, edgesCommands()...)
+	reg = append(reg, usageCommands()...)
+	reg = append(reg, haCommands()...)
+	reg = append(reg, browserCommands()...)
+	reg = append(reg, vaultCommands()...)
+	return reg
+}
+
+// dispatchTop handles the homelab verb surface. handled=false means the args are
+// not a homelab verb, so main() falls back to the legacy -use-case path.
+func dispatchTop(args []string) (handled bool, err error) {
+	if len(args) == 0 {
+		fmt.Print(usage())
+		return true, nil
+	}
+	switch args[0] {
+	case "help", "-h", "--help":
+		fmt.Print(usage())
+		return true, nil
+	case "version", "--version":
+		fmt.Println("homelab " + version)
+		return true, nil
+	case "manifest":
+		reg := buildRegistry()
+		if containsArg(args[1:], "--json") {
+			out, err := manifestJSON(reg)
+			if err != nil {
+				return true, err
+			}
+			fmt.Println(out)
+			return true, nil
+		}
+		fmt.Print(manifestText(reg))
+		return true, nil
+	}
+	if strings.HasPrefix(args[0], "-") {
+		return false, nil
+	}
+	reg := buildRegistry()
+	if !isCommandGroup(reg, args[0]) {
+		return false, nil
+	}
+	return true, dispatch(reg, args)
+}
+
+func isCommandGroup(reg []Command, group string) bool {
+	for _, c := range reg {
+		if len(c.Path) > 0 && c.Path[0] == group {
+			return true
+		}
+	}
+	return false
+}
+
+func containsArg(args []string, want string) bool {
+	for _, a := range args {
+		if a == want {
+			return true
+		}
+	}
+	return false
+}
+
+func usage() string {
+	var b strings.Builder
+	fmt.Fprintf(&b, "homelab %s — unified homelab operations CLI\n\n", version)
+	b.WriteString("Usage:\n  homelab <command> [args]\n\nCommands:\n")
+	for _, line := range strings.Split(strings.TrimRight(manifestText(buildRegistry()), "\n"), "\n") {
+		if line != "" {
+			b.WriteString("  " + line + "\n")
+		}
+	}
+	b.WriteString("\n  manifest [--json]   list all commands (machine-readable with --json)\n")
+	b.WriteString("  version             print version\n")
+	b.WriteString("\nLegacy webhook use-cases remain available via -use-case=<name>.\n")
+	return b.String()
+}
--- a/cli/k8s.go
+++ b/cli/k8s.go
@ -0,0 +1,138 @@
+package main
+
+import (
+	"fmt"
+	"os/exec"
+	"strings"
+)
+
+// kubectl helpers use the ambient kubeconfig (no per-call auth flags).
+
+func kubectlBase(ns string, args ...string) []string {
+	var full []string
+	if ns != "" {
+		full = append(full, "-n", ns)
+	}
+	return append(full, args...)
+}
+
+func kubectlStream(ns string, args ...string) error {
+	return runStreamingIn("", "kubectl", kubectlBase(ns, args...)...)
+}
+
+// kubectlCapture runs kubectl and returns trimmed stdout (for resolving pods).
+func kubectlCapture(ns string, args ...string) (string, error) {
+	out, err := exec.Command("kubectl", kubectlBase(ns, args...)...).Output()
+	return strings.TrimSpace(string(out)), err
+}
+
+// k8sTarget is the parsed `<app>` + selectors shared by the k8s verbs.
+type k8sTarget struct {
+	app       string
+	ns        string
+	pod       string
+	container string
+	selector  string
+	tty       bool
+	rest      []string // passthrough flags and, after `--`, the exec command
+}
+
+// parseK8sTarget reads `<app> [-n ns] [--pod p] [-c ctr] [-l sel] [flags] [-- cmd]`.
+// The first bare token is the app; unknown flags pass through in rest.
+func parseK8sTarget(args []string) k8sTarget {
+	t := k8sTarget{}
+	i := 0
+	take := func() string {
+		if i+1 < len(args) {
+			i++
+			return args[i]
+		}
+		return ""
+	}
+	for i = 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--":
+			t.rest = append(t.rest, args[i+1:]...)
+			return t
+		case a == "-n" || a == "--namespace":
+			t.ns = take()
+		case strings.HasPrefix(a, "--namespace="):
+			t.ns = strings.TrimPrefix(a, "--namespace=")
+		case a == "--pod":
+			t.pod = take()
+		case strings.HasPrefix(a, "--pod="):
+			t.pod = strings.TrimPrefix(a, "--pod=")
+		case a == "-c" || a == "--container":
+			t.container = take()
+		case strings.HasPrefix(a, "--container="):
+			t.container = strings.TrimPrefix(a, "--container=")
+		case a == "-l" || a == "--selector":
+			t.selector = take()
+		case strings.HasPrefix(a, "--selector="):
+			t.selector = strings.TrimPrefix(a, "--selector=")
+		case a == "--tty" || a == "-it" || a == "-ti":
+			t.tty = true
+		case !strings.HasPrefix(a, "-") && t.app == "":
+			t.app = a
+		default:
+			t.rest = append(t.rest, a)
+		}
+	}
+	return t
+}
+
+// namespace defaults to the app name (most namespaces hold exactly one app).
+func (t k8sTarget) namespace() string {
+	if t.ns != "" {
+		return t.ns
+	}
+	return t.app
+}
+
+// objectRef is the kubectl object for logs/exec: an explicit pod, else
+// deploy/<app> (kubectl resolves a pod from the Deployment).
+func (t k8sTarget) objectRef() string {
+	if t.pod != "" {
+		return "pod/" + t.pod
+	}
+	return "deploy/" + t.app
+}
+
+// --- database access (the dbaas exec pattern) ---
+
+type dbPlan struct {
+	ns        string
+	pod       string   // explicit pod (e.g. mysql-standalone-0)
+	selector  string   // resolve the pod by this label when pod == "" (CNPG primary)
+	container string   // "" = default container
+	argv      []string // command + args to run inside the pod
+}
+
+// planDBExec builds the in-pod command to run sql against app's database.
+// PG (default): CNPG primary POD (resolved by label — pg-cluster-rw is a
+// Service, not an exec target), psql -U postgres -d <db>.
+// MySQL: mysql-standalone-0, password from env (never on the command line).
+// dbName defaults to app. sql empty => interactive client.
+func planDBExec(app, dbName, sql string, mysql bool) dbPlan {
+	if dbName == "" {
+		dbName = app
+	}
+	if mysql {
+		inner := fmt.Sprintf(`mysql -u root -p"$MYSQL_ROOT_PASSWORD" %s`, shellQuote(dbName))
+		if sql != "" {
+			inner += " -e " + shellQuote(sql)
+		}
+		return dbPlan{ns: "dbaas", pod: "mysql-standalone-0", argv: []string{"bash", "-c", inner}}
+	}
+	argv := []string{"psql", "-U", "postgres", "-d", dbName}
+	if sql != "" {
+		argv = append(argv, "-tAc", sql)
+	}
+	return dbPlan{ns: "dbaas", selector: "cnpg.io/instanceRole=primary", container: "postgres", argv: argv}
+}
+
+// shellQuote single-quotes s for safe embedding in a bash -c string.
+func shellQuote(s string) string {
+	return "'" + strings.ReplaceAll(s, "'", `'\''`) + "'"
+}
--- a/cli/k8s_test.go
+++ b/cli/k8s_test.go
@ -0,0 +1,65 @@
+package main
+
+import (
+	"reflect"
+	"strings"
+	"testing"
+)
+
+func TestParseK8sTarget(t *testing.T) {
+	got := parseK8sTarget([]string{"tripit", "-n", "prod", "--pod", "x-123", "-c", "app", "-l", "k=v", "--tail=50", "--", "ls", "-la"})
+	want := k8sTarget{app: "tripit", ns: "prod", pod: "x-123", container: "app", selector: "k=v", rest: []string{"--tail=50", "ls", "-la"}}
+	if !reflect.DeepEqual(got, want) {
+		t.Fatalf("parseK8sTarget =\n %+v\nwant\n %+v", got, want)
+	}
+}
+
+func TestK8sTargetNamespaceDefaultsToApp(t *testing.T) {
+	if ns := parseK8sTarget([]string{"immich"}).namespace(); ns != "immich" {
+		t.Errorf("namespace() = %q, want immich", ns)
+	}
+	if ns := parseK8sTarget([]string{"immich", "-n", "dbaas"}).namespace(); ns != "dbaas" {
+		t.Errorf("namespace() = %q, want dbaas", ns)
+	}
+}
+
+func TestK8sTargetObjectRef(t *testing.T) {
+	if r := parseK8sTarget([]string{"tripit"}).objectRef(); r != "deploy/tripit" {
+		t.Errorf("objectRef() = %q, want deploy/tripit", r)
+	}
+	if r := parseK8sTarget([]string{"tripit", "--pod", "tripit-abc"}).objectRef(); r != "pod/tripit-abc" {
+		t.Errorf("objectRef() = %q, want pod/tripit-abc", r)
+	}
+}
+
+func TestPlanDBExecPostgresDefault(t *testing.T) {
+	p := planDBExec("fire-planner", "", "SELECT 1", false)
+	// pg-cluster-rw is a Service, so the PG plan resolves the primary POD by
+	// label rather than naming an (un-exec-able) Service.
+	if p.ns != "dbaas" || p.pod != "" || p.selector != "cnpg.io/instanceRole=primary" || p.container != "postgres" {
+		t.Fatalf("unexpected pg target: %+v", p)
+	}
+	// db name defaults to the app; SQL passed via -tAc
+	joined := strings.Join(p.argv, " ")
+	if !strings.Contains(joined, "-d fire-planner") || !strings.Contains(joined, "-tAc") {
+		t.Fatalf("pg argv missing db/sql: %v", p.argv)
+	}
+}
+
+func TestPlanDBExecMysqlEnvPassword(t *testing.T) {
+	p := planDBExec("wrongmove", "wrongmove", "SHOW TABLES", true)
+	if p.pod != "mysql-standalone-0" {
+		t.Fatalf("unexpected mysql pod: %+v", p)
+	}
+	inner := strings.Join(p.argv, " ")
+	// password must come from the env var, never inline
+	if !strings.Contains(inner, `-p"$MYSQL_ROOT_PASSWORD"`) {
+		t.Fatalf("mysql must use env password wrapper: %v", p.argv)
+	}
+}
+
+func TestShellQuoteEscapes(t *testing.T) {
+	if got := shellQuote("a'b"); got != `'a'\''b'` {
+		t.Fatalf("shellQuote = %q", got)
+	}
+}
--- a/cli/main.go
+++ b/cli/main.go
@ -26,8 +26,16 @@ var (
 )

 func main() {
-	err := run()
-	if err != nil {
+	// homelab verb surface (work/tf/claim/...) is tried first; if the args are
+	// not a homelab verb, fall through to the legacy webhook -use-case path.
+	if handled, err := dispatchTop(os.Args[1:]); handled {
+		if err != nil {
+			fmt.Fprintln(os.Stderr, "homelab: "+err.Error())
+			os.Exit(1)
+		}
+		return
+	}
+	if err := run(); err != nil {
 		glog.Errorf("run failed: %s", err.Error())
 		os.Exit(255)
 	}
--- a/cli/memory.go
+++ b/cli/memory.go
@ -0,0 +1,103 @@
+package main
+
+import (
+	"bytes"
+	"encoding/json"
+	"fmt"
+	"io"
+	"net/http"
+	"os"
+	"strings"
+	"time"
+)
+
+// defaultMemoryURL is used when no env override is present (agents normally have
+// CLAUDE_MEMORY_API_URL set by the memory hooks).
+const defaultMemoryURL = "https://claude-memory.viktorbarzin.me"
+
+type memoryClient struct {
+	base string
+	key  string
+	http *http.Client
+}
+
+func firstEnv(keys ...string) string {
+	for _, k := range keys {
+		if v := os.Getenv(k); v != "" {
+			return v
+		}
+	}
+	return ""
+}
+
+func resolveMemoryBase() string {
+	if b := firstEnv("CLAUDE_MEMORY_API_URL", "MEMORY_API_URL"); b != "" {
+		return strings.TrimRight(b, "/")
+	}
+	return defaultMemoryURL
+}
+
+// newMemoryClient talks straight to the claude-memory HTTP API (the same backend
+// the MCP wraps), so it works even when the MCP frontend is down.
+func newMemoryClient() (*memoryClient, error) {
+	key := firstEnv("CLAUDE_MEMORY_API_KEY", "MEMORY_API_KEY")
+	if key == "" {
+		return nil, fmt.Errorf("no memory API key — set CLAUDE_MEMORY_API_KEY (or MEMORY_API_KEY)")
+	}
+	return &memoryClient{base: resolveMemoryBase(), key: key, http: &http.Client{Timeout: 30 * time.Second}}, nil
+}
+
+func (c *memoryClient) do(method, path string, body interface{}) ([]byte, error) {
+	var r io.Reader
+	if body != nil {
+		b, err := json.Marshal(body)
+		if err != nil {
+			return nil, err
+		}
+		r = bytes.NewReader(b)
+	}
+	req, err := http.NewRequest(method, c.base+path, r)
+	if err != nil {
+		return nil, err
+	}
+	req.Header.Set("Authorization", "Bearer "+c.key)
+	if body != nil {
+		req.Header.Set("Content-Type", "application/json")
+	}
+	resp, err := c.http.Do(req)
+	if err != nil {
+		return nil, err
+	}
+	defer resp.Body.Close()
+	out, _ := io.ReadAll(resp.Body)
+	if resp.StatusCode >= 300 {
+		return nil, fmt.Errorf("memory API %s %s -> %d: %s", method, path, resp.StatusCode, strings.TrimSpace(string(out)))
+	}
+	return out, nil
+}
+
+// Request bodies mirror src/claude_memory/api/models.py.
+
+type memRecallReq struct {
+	Context       string `json:"context"`
+	ExpandedQuery string `json:"expanded_query,omitempty"`
+	Category      string `json:"category,omitempty"`
+	SortBy        string `json:"sort_by,omitempty"`
+	Limit         int    `json:"limit,omitempty"`
+}
+
+type memStoreReq struct {
+	Content          string  `json:"content"`
+	Category         string  `json:"category,omitempty"`
+	Tags             string  `json:"tags,omitempty"`
+	ExpandedKeywords string  `json:"expanded_keywords,omitempty"`
+	Importance       float64 `json:"importance"`
+	ForceSensitive   bool    `json:"force_sensitive,omitempty"`
+}
+
+type memUpdateReq struct {
+	Content          *string  `json:"content,omitempty"`
+	Tags             *string  `json:"tags,omitempty"`
+	Importance       *float64 `json:"importance,omitempty"`
+	ExpandedKeywords *string  `json:"expanded_keywords,omitempty"`
+}
--- a/cli/memory_test.go
+++ b/cli/memory_test.go
@ -0,0 +1,74 @@
+package main
+
+import (
+	"encoding/json"
+	"os"
+	"strings"
+	"testing"
+	"unicode/utf8"
+)
+
+func TestTruncatePreviewKeepsValidUTF8(t *testing.T) {
+	// Byte-slicing a long Cyrillic string at 240 splits a 2-byte rune and emits
+	// invalid UTF-8 — the bug that crashed the recall hook. truncatePreview must
+	// cut on a rune boundary and always stay valid UTF-8.
+	long := strings.Repeat("я", 300) // 300 runes / 600 bytes
+	got := truncatePreview(long, 240)
+	if !utf8.ValidString(got) {
+		t.Fatalf("truncatePreview produced invalid UTF-8: %q", got)
+	}
+	if r := []rune(got); len(r) != 241 || string(r[:240]) != strings.Repeat("я", 240) || r[240] != '…' {
+		t.Fatalf("truncatePreview = %d runes, want 240 Cyrillic + ellipsis", len(r))
+	}
+	// Short multibyte strings pass through untouched (no ellipsis).
+	if got := truncatePreview("кратко", 240); got != "кратко" {
+		t.Fatalf("short string altered: %q", got)
+	}
+	// ASCII boundary still works.
+	if got := truncatePreview(strings.Repeat("a", 500), 240); got != strings.Repeat("a", 240)+"…" {
+		t.Fatalf("ascii truncation wrong: %q", got)
+	}
+}
+
+func TestResolveMemoryBase(t *testing.T) {
+	old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL")
+	defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }()
+
+	os.Unsetenv("CLAUDE_MEMORY_API_URL")
+	os.Unsetenv("MEMORY_API_URL")
+	if got := resolveMemoryBase(); got != defaultMemoryURL {
+		t.Errorf("resolveMemoryBase() = %q, want default %q", got, defaultMemoryURL)
+	}
+	os.Setenv("CLAUDE_MEMORY_API_URL", "https://m.example/") // trailing slash trimmed
+	if got := resolveMemoryBase(); got != "https://m.example" {
+		t.Errorf("resolveMemoryBase() = %q, want https://m.example", got)
+	}
+}
+
+func TestMemStoreReqAlwaysSendsImportance(t *testing.T) {
+	b, _ := json.Marshal(memStoreReq{Content: "x", Category: "facts", Importance: 0.5})
+	s := string(b)
+	if !strings.Contains(s, `"content":"x"`) || !strings.Contains(s, `"importance":0.5`) {
+		t.Fatalf("memStoreReq JSON missing fields: %s", s)
+	}
+}
+
+func TestMemUpdateReqOmitsUnsetFields(t *testing.T) {
+	tags := "a,b"
+	b, _ := json.Marshal(memUpdateReq{Tags: &tags})
+	s := string(b)
+	if strings.Contains(s, "content") || strings.Contains(s, "importance") {
+		t.Fatalf("unset update fields must be omitted: %s", s)
+	}
+	if !strings.Contains(s, `"tags":"a,b"`) {
+		t.Fatalf("set field missing: %s", s)
+	}
+}
+
+func TestMemRecallReqOmitsEmptyOptionals(t *testing.T) {
+	b, _ := json.Marshal(memRecallReq{Context: "hi"})
+	s := string(b)
+	if strings.Contains(s, "expanded_query") || strings.Contains(s, "category") || strings.Contains(s, "limit") {
+		t.Fatalf("empty optionals must be omitted: %s", s)
+	}
+}
--- a/cli/presence.go
+++ b/cli/presence.go
@ -0,0 +1,58 @@
+package main
+
+import (
+	"fmt"
+	"os"
+	"path/filepath"
+	"strings"
+)
+
+// validPresenceKinds is the fixed label taxonomy accepted by the presence board.
+var validPresenceKinds = []string{"node", "host", "stack", "service", "db", "pvc", "infra"}
+
+// presenceScript locates the presence CLI — homelab WRAPS it, it does not
+// reimplement it. Override with HOMELAB_PRESENCE; defaults to ~/code/scripts/presence.
+func presenceScript() string {
+	if p := os.Getenv("HOMELAB_PRESENCE"); p != "" {
+		return p
+	}
+	home, err := os.UserHomeDir()
+	if err != nil {
+		return "presence"
+	}
+	return filepath.Join(home, "code", "scripts", "presence")
+}
+
+// validateLabel checks a presence label is <kind>:<name> with a known kind.
+func validateLabel(label string) error {
+	parts := strings.SplitN(label, ":", 2)
+	if len(parts) != 2 || parts[0] == "" || parts[1] == "" {
+		return fmt.Errorf("label must be <kind>:<name> (e.g. stack:vault), got %q", label)
+	}
+	for _, k := range validPresenceKinds {
+		if parts[0] == k {
+			return nil
+		}
+	}
+	return fmt.Errorf("invalid label kind %q; valid kinds: %s", parts[0], strings.Join(validPresenceKinds, ", "))
+}
+
+// presenceClaim claims label on the board with a purpose note.
+func presenceClaim(label, purpose string) error {
+	if err := validateLabel(label); err != nil {
+		return err
+	}
+	args := []string{"claim", label}
+	if purpose != "" {
+		args = append(args, "--purpose", purpose)
+	}
+	return runStreaming(presenceScript(), args...)
+}
+
+// presenceRelease releases a prior claim on label.
+func presenceRelease(label string) error {
+	if err := validateLabel(label); err != nil {
+		return err
+	}
+	return runStreaming(presenceScript(), "release", label)
+}
--- a/cli/presence_test.go
+++ b/cli/presence_test.go
@ -0,0 +1,24 @@
+package main
+
+import "testing"
+
+func TestValidateLabelAcceptsTaxonomy(t *testing.T) {
+	good := []string{
+		"stack:vault", "service:health", "node:k8s-node1", "db:pg-cluster",
+		"infra:gpu-operator", "host:proxmox-1", "pvc:dbaas/data",
+	}
+	for _, l := range good {
+		if err := validateLabel(l); err != nil {
+			t.Errorf("validateLabel(%q) = %v, want nil", l, err)
+		}
+	}
+}
+
+func TestValidateLabelRejectsBadLabels(t *testing.T) {
+	bad := []string{"vault", "stack:", "bogus:x", ":x", "stack", ""}
+	for _, l := range bad {
+		if err := validateLabel(l); err == nil {
+			t.Errorf("validateLabel(%q) = nil, want error", l)
+		}
+	}
+}
--- a/cli/probe.go
+++ b/cli/probe.go
@ -0,0 +1,76 @@
+package main
+
+import (
+	"context"
+	"crypto/tls"
+	"fmt"
+	"io"
+	"net"
+	"net/http"
+	"net/url"
+	"os/exec"
+	"strings"
+	"time"
+)
+
+// internalLBIP is the dedicated Traefik LB; every internal ingress routes through it.
+const internalLBIP = "10.0.20.203"
+
+// clientDialingIP returns an http.Client that dials ip for ANY host while keeping
+// the URL host as SNI (so the cert matches) — the Go form of `curl --resolve
+// host:443:ip`. TLS verification is skipped (these are reachability/observability
+// probes, not security checks; internal .lan vhosts may serve a non-matching cert).
+func clientDialingIP(ip string, timeout time.Duration) *http.Client {
+	d := &net.Dialer{Timeout: 8 * time.Second}
+	tr := &http.Transport{
+		DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
+			if i := strings.LastIndex(addr, ":"); i >= 0 {
+				addr = ip + addr[i:]
+			}
+			return d.DialContext(ctx, network, addr)
+		},
+		TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
+	}
+	return &http.Client{Timeout: timeout, Transport: tr}
+}
+
+// probeURL issues a GET and returns status code + elapsed time.
+func probeURL(c *http.Client, rawurl string) (int, time.Duration, error) {
+	start := time.Now()
+	resp, err := c.Get(rawurl)
+	dur := time.Since(start)
+	if err != nil {
+		return 0, dur, err
+	}
+	resp.Body.Close()
+	return resp.StatusCode, dur, nil
+}
+
+// lbGetBody GETs https://<host><path>?<q> through the internal LB and returns the body.
+func lbGetBody(host, path string, q url.Values) ([]byte, error) {
+	u := "https://" + host + path
+	if len(q) > 0 {
+		u += "?" + q.Encode()
+	}
+	resp, err := clientDialingIP(internalLBIP, 20*time.Second).Get(u)
+	if err != nil {
+		return nil, err
+	}
+	defer resp.Body.Close()
+	body, _ := io.ReadAll(resp.Body)
+	if resp.StatusCode >= 300 {
+		return nil, fmt.Errorf("%s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body)))
+	}
+	return body, nil
+}
+
+// dig runs `dig +short` against a resolver, optionally for a record type.
+func dig(name, server, rrtype string) (string, error) {
+	args := []string{"+short", "+time=3", "+tries=1"}
+	if rrtype != "" {
+		args = append(args, rrtype)
+	}
+	args = append(args, name, "@"+server)
+	out, err := exec.Command("dig", args...).Output()
+	return strings.TrimSpace(string(out)), err
+}
--- a/cli/probe_test.go
+++ b/cli/probe_test.go
@ -0,0 +1,49 @@
+package main
+
+import "testing"
+
+func TestQueryArg(t *testing.T) {
+	if got := queryArg([]string{"up"}, nil); got != "up" {
+		t.Errorf(`queryArg(["up"]) = %q, want "up"`, got)
+	}
+	if got := queryArg([]string{"up", "--json"}, nil); got != "up" {
+		t.Errorf(`--json should be dropped, got %q`, got)
+	}
+	// single quoted PromQL arrives as one token
+	if got := queryArg([]string{"count by (node) (up)", "--json"}, nil); got != "count by (node) (up)" {
+		t.Errorf(`quoted query mangled: %q`, got)
+	}
+	// value-flags and their values are skipped, query survives
+	vf := map[string]bool{"--since": true, "--limit": true}
+	if got := queryArg([]string{`{app="x"}`, "--since", "1h", "--limit", "50"}, vf); got != `{app="x"}` {
+		t.Errorf(`value-flag skipping failed: %q`, got)
+	}
+}
+
+func TestLabelStr(t *testing.T) {
+	got := labelStr(map[string]string{"__name__": "up", "job": "x", "instance": "y"})
+	if got != "up{instance=y,job=x}" { // __name__ extracted, rest sorted
+		t.Errorf("labelStr = %q", got)
+	}
+	if got := labelStr(map[string]string{"alertname": "Foo"}); got != "{alertname=Foo}" {
+		t.Errorf("labelStr (no __name__) = %q", got)
+	}
+}
+
+func TestOneLineList(t *testing.T) {
+	if got := oneLineList("  "); got != "(none)" {
+		t.Errorf("empty = %q, want (none)", got)
+	}
+	if got := oneLineList("a\nb"); got != "a, b" {
+		t.Errorf("multi = %q, want 'a, b'", got)
+	}
+}
+
+func TestHostOnly(t *testing.T) {
+	if got := hostOnly("foo.me/path"); got != "foo.me" {
+		t.Errorf("hostOnly = %q", got)
+	}
+	if got := hostOnly("foo.me"); got != "foo.me" {
+		t.Errorf("hostOnly = %q", got)
+	}
+}
--- a/cli/repo.go
+++ b/cli/repo.go
@ -0,0 +1,101 @@
+package main
+
+import (
+	"os"
+	"os/exec"
+	"os/user"
+	"path/filepath"
+	"strings"
+)
+
+// preferRemote picks the canonical remote: forgejo if present, else origin,
+// else the first listed. (For infra, origin and forgejo both point at Forgejo.)
+func preferRemote(remotes []string) string {
+	has := map[string]bool{}
+	for _, r := range remotes {
+		has[r] = true
+	}
+	switch {
+	case has["forgejo"]:
+		return "forgejo"
+	case has["origin"]:
+		return "origin"
+	case len(remotes) > 0:
+		return remotes[0]
+	default:
+		return ""
+	}
+}
+
+// hasGitCryptAttr reports whether .gitattributes content enables git-crypt.
+func hasGitCryptAttr(gitattributes string) bool {
+	return strings.Contains(gitattributes, "filter=git-crypt")
+}
+
+// gitCryptFlags are the per-command flags that disable smudge/clean so git
+// operations in a git-crypt repo don't try to decrypt (NEVER persisted to config).
+func gitCryptFlags() []string {
+	return []string{
+		"-c", "filter.git-crypt.smudge=cat",
+		"-c", "filter.git-crypt.clean=cat",
+		"-c", "filter.git-crypt.required=false",
+	}
+}
+
+// gitOutput runs `git -C dir <args>` and returns trimmed stdout.
+func gitOutput(dir string, args ...string) (string, error) {
+	cmd := exec.Command("git", append([]string{"-C", dir}, args...)...)
+	out, err := cmd.Output()
+	return strings.TrimSpace(string(out)), err
+}
+
+func gitRepoRoot(dir string) (string, error) {
+	return gitOutput(dir, "rev-parse", "--show-toplevel")
+}
+
+// gitRemotes lists configured remote names for the repo at dir.
+func gitRemotes(dir string) ([]string, error) {
+	out, err := gitOutput(dir, "remote")
+	if err != nil {
+		return nil, err
+	}
+	if out == "" {
+		return nil, nil
+	}
+	return strings.Split(out, "\n"), nil
+}
+
+// isGitCryptRepo reports whether the repo at repoRoot uses git-crypt.
+func isGitCryptRepo(repoRoot string) bool {
+	b, err := os.ReadFile(filepath.Join(repoRoot, ".gitattributes"))
+	if err != nil {
+		return false
+	}
+	return hasGitCryptAttr(string(b))
+}
+
+// cryptFlagsFor returns the git-crypt filter flags when repoRoot is encrypted,
+// else nil. These are injected per-command and never persisted.
+func cryptFlagsFor(repoRoot string) []string {
+	if isGitCryptRepo(repoRoot) {
+		return gitCryptFlags()
+	}
+	return nil
+}
+
+// gitStream runs `git [cryptFlags] -C repoRoot <args>` with live output.
+func gitStream(repoRoot string, cryptFlags []string, args ...string) error {
+	full := append(append([]string{}, cryptFlags...), append([]string{"-C", repoRoot}, args...)...)
+	return runStreamingIn("", "git", full...)
+}
+
+// currentUser returns the OS username for branch naming (<user>/<topic>).
+func currentUser() string {
+	if u := os.Getenv("USER"); u != "" {
+		return u
+	}
+	if u, err := user.Current(); err == nil && u.Username != "" {
+		return u.Username
+	}
+	return "user"
+}
--- a/cli/repo_test.go
+++ b/cli/repo_test.go
@ -0,0 +1,37 @@
+package main
+
+import "testing"
+
+func TestPreferRemote(t *testing.T) {
+	cases := []struct {
+		in   []string
+		want string
+	}{
+		{[]string{"origin", "forgejo"}, "forgejo"},
+		{[]string{"forgejo"}, "forgejo"},
+		{[]string{"origin"}, "origin"},
+		{[]string{"upstream"}, "upstream"},
+		{nil, ""},
+	}
+	for _, c := range cases {
+		if got := preferRemote(c.in); got != c.want {
+			t.Errorf("preferRemote(%v) = %q, want %q", c.in, got, c.want)
+		}
+	}
+}
+
+func TestHasGitCryptAttr(t *testing.T) {
+	if !hasGitCryptAttr("*.tfvars filter=git-crypt diff=git-crypt") {
+		t.Error("expected git-crypt detected")
+	}
+	if hasGitCryptAttr("*.md text\n*.png binary") {
+		t.Error("expected no git-crypt")
+	}
+}
+
+func TestGitCryptFlagsShape(t *testing.T) {
+	f := gitCryptFlags()
+	if len(f) != 6 || f[0] != "-c" || f[1] != "filter.git-crypt.smudge=cat" {
+		t.Fatalf("unexpected git-crypt flags: %v", f)
+	}
+}
--- a/cli/run.go
+++ b/cli/run.go
@ -0,0 +1,23 @@
+package main
+
+import (
+	"os"
+	"os/exec"
+)
+
+// runStreaming executes name with args, wiring std streams to this process so
+// the caller sees live output, and returns the command's error (non-nil on
+// non-zero exit — preserved so homelab's own exit code reflects the child's).
+func runStreaming(name string, args ...string) error {
+	return runStreamingIn("", name, args...)
+}
+
+// runStreamingIn is runStreaming with a working directory (empty = inherit).
+func runStreamingIn(dir, name string, args ...string) error {
+	cmd := exec.Command(name, args...)
+	cmd.Dir = dir
+	cmd.Stdout = os.Stdout
+	cmd.Stderr = os.Stderr
+	cmd.Stdin = os.Stdin
+	return cmd.Run()
+}
--- a/cli/stack.go
+++ b/cli/stack.go
@ -0,0 +1,54 @@
+package main
+
+import (
+	"fmt"
+	"os"
+	"path/filepath"
+	"sort"
+	"strings"
+)
+
+// findInfraRoot walks up from start to the infra repo root — the directory
+// holding both terragrunt.hcl and a stacks/ directory.
+func findInfraRoot(start string) (string, error) {
+	dir := start
+	for {
+		if isFile(filepath.Join(dir, "terragrunt.hcl")) && isDir(filepath.Join(dir, "stacks")) {
+			return dir, nil
+		}
+		parent := filepath.Dir(dir)
+		if parent == dir {
+			return "", fmt.Errorf("not inside an infra checkout (no terragrunt.hcl + stacks/ found above %s)", start)
+		}
+		dir = parent
+	}
+}
+
+// resolveStack maps a bare stack name to its directory under <infraRoot>/stacks.
+func resolveStack(infraRoot, name string) (string, error) {
+	dir := filepath.Join(infraRoot, "stacks", name)
+	if isDir(dir) {
+		return dir, nil
+	}
+	avail := listStacks(infraRoot)
+	return "", fmt.Errorf("stack %q not found under stacks/; available: %s", name, strings.Join(avail, ", "))
+}
+
+// listStacks returns the sorted names of every directory under <infraRoot>/stacks.
+func listStacks(infraRoot string) []string {
+	entries, err := os.ReadDir(filepath.Join(infraRoot, "stacks"))
+	if err != nil {
+		return nil
+	}
+	var out []string
+	for _, e := range entries {
+		if e.IsDir() {
+			out = append(out, e.Name())
+		}
+	}
+	sort.Strings(out)
+	return out
+}
+
+func isFile(p string) bool { fi, err := os.Stat(p); return err == nil && !fi.IsDir() }
+func isDir(p string) bool  { fi, err := os.Stat(p); return err == nil && fi.IsDir() }
--- a/cli/stack_test.go
+++ b/cli/stack_test.go
@ -0,0 +1,52 @@
+package main
+
+import (
+	"os"
+	"path/filepath"
+	"testing"
+)
+
+func newInfraTree(t *testing.T, stacks ...string) string {
+	t.Helper()
+	root := t.TempDir()
+	if err := os.WriteFile(filepath.Join(root, "terragrunt.hcl"), []byte("# root"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	for _, s := range stacks {
+		if err := os.MkdirAll(filepath.Join(root, "stacks", s), 0o755); err != nil {
+			t.Fatal(err)
+		}
+	}
+	return root
+}
+
+func TestFindInfraRootWalksUp(t *testing.T) {
+	root := newInfraTree(t, "vault")
+	got, err := findInfraRoot(filepath.Join(root, "stacks", "vault"))
+	if err != nil {
+		t.Fatalf("findInfraRoot error: %v", err)
+	}
+	if got != root {
+		t.Fatalf("findInfraRoot = %q, want %q", got, root)
+	}
+}
+
+func TestFindInfraRootErrorsOutsideInfra(t *testing.T) {
+	if _, err := findInfraRoot(t.TempDir()); err == nil {
+		t.Fatal("expected error outside an infra checkout")
+	}
+}
+
+func TestResolveStack(t *testing.T) {
+	root := newInfraTree(t, "vault", "monitoring")
+	dir, err := resolveStack(root, "vault")
+	if err != nil {
+		t.Fatalf("resolveStack error: %v", err)
+	}
+	if want := filepath.Join(root, "stacks", "vault"); dir != want {
+		t.Fatalf("resolveStack = %q, want %q", dir, want)
+	}
+	if _, err := resolveStack(root, "nonesuch"); err == nil {
+		t.Fatal("expected error for unknown stack")
+	}
+}
--- a/cli/telemetry.go
+++ b/cli/telemetry.go
@ -0,0 +1,62 @@
+package main
+
+import (
+	"bytes"
+	"encoding/json"
+	"net/http"
+	"os"
+	"strconv"
+	"strings"
+	"time"
+)
+
+// usageJob is the Loki stream job label for homelab usage telemetry.
+const usageJob = "homelab-usage"
+
+// emitUsage best-effort records one verb invocation to Loki for cross-user
+// usage analytics. Labels are low-cardinality (job/user/verb); the line carries
+// only exit code + CLI version. NEVER args, paths, flags, or secrets. It must
+// never affect the command: all errors are swallowed and a tight timeout bounds
+// the cost. Opt out with HOMELAB_TELEMETRY=0.
+func emitUsage(verb string, runErr error) {
+	switch os.Getenv("HOMELAB_TELEMETRY") {
+	case "0", "off", "false", "no":
+		return
+	}
+	if verb == "" || strings.HasPrefix(verb, "usage") {
+		return // don't self-record the analytics reader
+	}
+	exit := 0
+	if runErr != nil {
+		exit = 1
+	}
+	body, err := json.Marshal(lokiPush{Streams: []lokiStream{{
+		Stream: map[string]string{"job": usageJob, "user": currentUser(), "verb": verb},
+		Values: [][2]string{{
+			strconv.FormatInt(time.Now().UnixNano(), 10),
+			"exit=" + strconv.Itoa(exit) + " ver=" + version,
+		}},
+	}}})
+	if err != nil {
+		return
+	}
+	req, err := http.NewRequest("POST", "https://"+lokiHost+"/loki/api/v1/push", bytes.NewReader(body))
+	if err != nil {
+		return
+	}
+	req.Header.Set("Content-Type", "application/json")
+	resp, err := clientDialingIP(internalLBIP, 800*time.Millisecond).Do(req)
+	if err != nil {
+		return
+	}
+	resp.Body.Close()
+}
+
+type lokiPush struct {
+	Streams []lokiStream `json:"streams"`
+}
+
+type lokiStream struct {
+	Stream map[string]string `json:"stream"`
+	Values [][2]string       `json:"values"`
+}
--- a/cli/update_viktorbarzin_me.go
+++ b/cli/update_viktorbarzin_me.go
@ -103,6 +103,6 @@ func notifyForIPChange(oldIP, newIP net.IP) error {
 	if err != nil {
 		return errors.Wrapf(err, "Error reading response")
 	}
-	glog.Infof("Response:", string(responseBody))
+	glog.Infof("Response: %s", string(responseBody))
 	return nil
 }
--- a/cli/usage_test.go
+++ b/cli/usage_test.go
@ -0,0 +1,18 @@
+package main
+
+import (
+	"strings"
+	"testing"
+)
+
+func TestUsageQuery(t *testing.T) {
+	got := usageQuery("30d", "")
+	want := `sum by (verb) (count_over_time({job="homelab-usage"}[30d]))`
+	if got != want {
+		t.Errorf("usageQuery(30d,\"\") = %q, want %q", got, want)
+	}
+	withUser := usageQuery("7d", "emo")
+	if !strings.Contains(withUser, `user="emo"`) || !strings.Contains(withUser, "[7d]") {
+		t.Errorf("usageQuery with user missing filter/range: %q", withUser)
+	}
+}
--- a/cli/woodpecker.go
+++ b/cli/woodpecker.go
@ -0,0 +1,191 @@
+package main
+
+import (
+	"context"
+	"encoding/json"
+	"fmt"
+	"io"
+	"net"
+	"net/http"
+	"os"
+	"os/exec"
+	"strings"
+	"time"
+)
+
+// Woodpecker is reached at ci.viktorbarzin.me but routed via the internal Traefik
+// LB (mirrors the proven `curl --resolve ci.viktorbarzin.me:443:10.0.20.203`):
+// we dial the LB IP while keeping SNI/Host = the hostname so the cert verifies.
+const (
+	wpHost = "ci.viktorbarzin.me"
+	wpLBIP = "10.0.20.203"
+)
+
+type wpClient struct {
+	base  string
+	token string
+	http  *http.Client
+}
+
+// wpToken reads WOODPECKER_TOKEN, else the canonical Vault path.
+func wpToken() string {
+	if t := firstEnv("WOODPECKER_TOKEN", "WP_TOKEN"); t != "" {
+		return t
+	}
+	out, err := exec.Command("vault", "kv", "get", "-field=woodpecker_api_token", "secret/ci/global").Output()
+	if err != nil {
+		return ""
+	}
+	return strings.TrimSpace(string(out))
+}
+
+func newWPClient() (*wpClient, error) {
+	tok := wpToken()
+	if tok == "" {
+		return nil, fmt.Errorf("no woodpecker token — set WOODPECKER_TOKEN or `vault login` (reads secret/ci/global)")
+	}
+	ip := firstEnv("HOMELAB_WP_IP")
+	if ip == "" {
+		ip = wpLBIP
+	}
+	dialer := &net.Dialer{Timeout: 8 * time.Second}
+	tr := &http.Transport{
+		DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
+			if strings.HasPrefix(addr, wpHost+":") {
+				addr = ip + addr[strings.LastIndex(addr, ":"):]
+			}
+			return dialer.DialContext(ctx, network, addr)
+		},
+	}
+	return &wpClient{base: "https://" + wpHost, token: tok, http: &http.Client{Timeout: 20 * time.Second, Transport: tr}}, nil
+}
+
+// getJSON GETs path into v, retrying the transient empty/5xx responses the
+// Woodpecker API intermittently returns under load.
+func (c *wpClient) getJSON(path string, v interface{}) error {
+	var lastErr error
+	for attempt := 0; attempt < 5; attempt++ {
+		if attempt > 0 {
+			time.Sleep(2 * time.Second)
+		}
+		req, _ := http.NewRequest("GET", c.base+path, nil)
+		req.Header.Set("Authorization", "Bearer "+c.token)
+		resp, err := c.http.Do(req)
+		if err != nil {
+			lastErr = err
+			continue
+		}
+		body, _ := io.ReadAll(resp.Body)
+		resp.Body.Close()
+		if resp.StatusCode >= 500 || len(strings.TrimSpace(string(body))) == 0 {
+			lastErr = fmt.Errorf("woodpecker GET %s -> %d (empty/5xx, retrying)", path, resp.StatusCode)
+			continue
+		}
+		if resp.StatusCode >= 300 {
+			return fmt.Errorf("woodpecker GET %s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body)))
+		}
+		return json.Unmarshal(body, v)
+	}
+	return lastErr
+}
+
+type wpPipeline struct {
+	Number  int    `json:"number"`
+	Status  string `json:"status"`
+	Event   string `json:"event"`
+	Commit  string `json:"commit"`
+	Message string `json:"message"`
+}
+
+func (c *wpClient) recentPipelines(repoID, n int) ([]wpPipeline, error) {
+	var ps []wpPipeline
+	err := c.getJSON(fmt.Sprintf("/api/repos/%d/pipelines?per_page=%d", repoID, n), &ps)
+	return ps, err
+}
+
+// findPipeline returns the pipeline for commit (prefix match), or the latest when
+// commit is empty.
+func (c *wpClient) findPipeline(repoID int, commit string) (wpPipeline, error) {
+	ps, err := c.recentPipelines(repoID, 25)
+	if err != nil {
+		return wpPipeline{}, err
+	}
+	if len(ps) == 0 {
+		return wpPipeline{}, fmt.Errorf("no pipelines for repo %d", repoID)
+	}
+	if commit == "" {
+		return ps[0], nil
+	}
+	for _, p := range ps {
+		if strings.HasPrefix(p.Commit, commit) {
+			return p, nil
+		}
+	}
+	return wpPipeline{}, fmt.Errorf("no pipeline for commit %s in the last %d", commit[:min(8, len(commit))], len(ps))
+}
+
+func (c *wpClient) repoID() (int, error) {
+	owner, repo, err := repoOwnerName()
+	if err != nil {
+		return 0, err
+	}
+	var r struct {
+		ID int `json:"id"`
+	}
+	if err := c.getJSON("/api/repos/lookup/"+owner+"/"+repo, &r); err != nil {
+		return 0, err
+	}
+	if r.ID == 0 {
+		return 0, fmt.Errorf("repo %s/%s not registered in woodpecker", owner, repo)
+	}
+	return r.ID, nil
+}
+
+// repoOwnerName derives <owner>/<repo> from the cwd git remote.
+func repoOwnerName() (string, string, error) {
+	cwd, _ := os.Getwd()
+	root, err := gitRepoRoot(cwd)
+	if err != nil {
+		return "", "", fmt.Errorf("not in a git repository: %w", err)
+	}
+	remote := preferRemote(remotesOrEmpty(root))
+	url, err := gitOutput(root, "remote", "get-url", remote)
+	if err != nil {
+		return "", "", err
+	}
+	return parseOwnerRepo(url)
+}
+
+// parseOwnerRepo extracts owner/repo from an https or ssh git remote URL.
+func parseOwnerRepo(url string) (string, string, error) {
+	u := strings.TrimSuffix(strings.TrimSpace(url), ".git")
+	u = strings.TrimSuffix(u, "/")
+	if i := strings.Index(u, "://"); i >= 0 {
+		u = u[i+3:]
+	}
+	u = strings.ReplaceAll(u, ":", "/") // git@host:owner/repo -> git@host/owner/repo
+	parts := strings.Split(u, "/")
+	if len(parts) < 2 || parts[len(parts)-1] == "" || parts[len(parts)-2] == "" {
+		return "", "", fmt.Errorf("cannot parse owner/repo from remote %q", url)
+	}
+	return parts[len(parts)-2], parts[len(parts)-1], nil
+}
+
+func isTerminalStatus(s string) bool {
+	switch s {
+	case "success", "failure", "error", "killed", "declined", "blocked":
+		return true
+	}
+	return false
+}
+
+func isFailureStatus(s string) bool {
+	return s == "failure" || s == "error" || s == "killed" || s == "declined"
+}
+
+func min(a, b int) int {
+	if a < b {
+		return a
+	}
+	return b
+}
--- a/cli/woodpecker_test.go
+++ b/cli/woodpecker_test.go
@ -0,0 +1,40 @@
+package main
+
+import "testing"
+
+func TestParseOwnerRepo(t *testing.T) {
+	cases := []struct{ in, owner, repo string }{
+		{"https://forgejo.viktorbarzin.me/viktor/infra.git", "viktor", "infra"},
+		{"https://forgejo.viktorbarzin.me/viktor/infra", "viktor", "infra"},
+		{"git@github.com:ViktorBarzin/infra.git", "ViktorBarzin", "infra"},
+		{"https://github.com/ViktorBarzin/tripit/", "ViktorBarzin", "tripit"},
+	}
+	for _, c := range cases {
+		o, r, err := parseOwnerRepo(c.in)
+		if err != nil || o != c.owner || r != c.repo {
+			t.Errorf("parseOwnerRepo(%q) = (%q, %q, %v), want (%q, %q)", c.in, o, r, err, c.owner, c.repo)
+		}
+	}
+	if _, _, err := parseOwnerRepo("nonsense"); err == nil {
+		t.Error("expected error for unparseable remote")
+	}
+}
+
+func TestStatusClassification(t *testing.T) {
+	for _, s := range []string{"success", "failure", "error", "killed"} {
+		if !isTerminalStatus(s) {
+			t.Errorf("%q should be terminal", s)
+		}
+	}
+	for _, s := range []string{"running", "pending"} {
+		if isTerminalStatus(s) {
+			t.Errorf("%q should not be terminal", s)
+		}
+	}
+	if !isFailureStatus("failure") || !isFailureStatus("error") {
+		t.Error("failure/error should classify as failure")
+	}
+	if isFailureStatus("success") {
+		t.Error("success must not classify as failure")
+	}
+}
--- a/config.tfvars
+++ b/config.tfvars
--- a/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md
+++ b/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md
@ -13,7 +13,7 @@ The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling
 Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`:

 - Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.**
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`.
+- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`. `Plotting-Your-Dream-Book` (owned by Anca, dev in her org) keeps its GHA build in-place and pushes the image to **its own org's ghcr** (`ghcr.io/passionprojectsanca/book-plotter`, private) via the workflow's built-in `GITHUB_TOKEN` — no Forgejo mirror, no `viktorbarzin`-namespace push, no shared PAT in her repo (2026-06-27, migrated off DockerHub).
 - `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge.
 - Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror."

--- a/docs/adr/0004-homelab-unified-cli.md
+++ b/docs/adr/0004-homelab-unified-cli.md
@ -0,0 +1,30 @@
+# homelab: a unified infra-ops CLI grown in place from infra/cli
+
+Agents re-derive the same operational command boilerplate every session — mining
+51,116 bash commands across 2,225 past sessions showed dense, repeated patterns
+(the infra inner-loop alone is ~29%). We are building `homelab`, one CLI encoding
+the deterministic, repeated **actions** (not judgment) agents run — composable in
+bash, JSON-capable, and discovered progressively via `homelab manifest`. It is
+grown **in place** in `cli/` (the existing `infra-cli`), absorbing new verb-groups
+alongside the preserved legacy webhook use-cases. Versioned with a `cli/VERSION`
+file (the infra repo deploys continuously and does not cut semver tags).
+
+## Considered options
+
+- **Its own top-level repo** (the original plan) — rejected in favour of keeping
+  it where the Terraform/Terragrunt and `scripts/tg` it drives already live; the
+  Go source isn't git-crypt-encrypted and a provision-time build is unaffected by
+  GitOps continuous-deploy.
+- **A fresh CLI ignoring infra-cli** — rejected: strands the VPN/DNS/email
+  webhook use-cases.
+- **Raw kubectl/tg/ssh + skills + MCP only** — kept for everything outside the
+  recurring action surface (methodology skills; third-party/owned MCP such as
+  phpIPAM, which homelab does NOT duplicate).
+
+## Consequences
+
+- The binary is dual-purpose: the agent-facing `homelab` verb surface AND the
+  in-cluster `infra-cli` webhook image. `main()` front-dispatches homelab verbs
+  and falls through to the legacy `-use-case` path verbatim.
+- Distribution: built from source to `/usr/local/bin/homelab` during devvm
+  provisioning (`t3-dispatch` precedent), refreshed by `t3-autoupdate`.
--- a/docs/adr/0005-homelab-v01-scope.md
+++ b/docs/adr/0005-homelab-v01-scope.md
@ -0,0 +1,23 @@
+# homelab v0.1 scope: the infra inner-loop; everything allowed, tiers recorded
+
+v0.1 ships only the highest-volume surface — the infra inner-loop: `work`
+(worktree lifecycle), `tf` (terragrunt via `scripts/tg` + fmt/validate/
+force-unlock), and `claim`/`release` (presence) — because it is ~29% of all mined
+commands and where agents lose the most time and leak the most presence claims.
+
+v0.1 enforces **no** homelab-level permission gating: everything is allowed,
+relying on existing gates (harness permission mode, presence claims, plan
+approval). But every verb records a `read|write` tier (visible in `manifest`), so
+a PreToolUse classifier hook (auto-allow reads / prompt writes) can be added
+later with zero restructuring.
+
+## Considered options
+
+- **Reads-first vertical slice** (top read verb per domain) — lower risk, broad
+  value, but defers the toil that motivated the project.
+- **One domain deep (k8s)** — cleanest template, narrow day-one value.
+
+We chose the highest-volume-but-write-heavy infra loop deliberately, accepting
+the extra complexity (worktree lifecycle, git-crypt flag injection, presence
+coupling, branch-protection PR fallback) for the biggest immediate toil
+reduction. k8s/node/secret/net/ci verb-groups are deferred to later versions.
--- a/docs/adr/0006-homelab-work-and-tf.md
+++ b/docs/adr/0006-homelab-work-and-tf.md
@ -0,0 +1,29 @@
+# homelab work/tf behaviour: native worktree entry, gated auto-land, presence-coupled apply
+
+Four behaviours of the infra-loop verbs are surprising enough to record:
+
+1. **`work` owns worktree create/land/clean, but session *entry* delegates to the
+   native harness worktree tool.** A CLI is a child process and cannot change the
+   agent's working directory; `EnterWorktree` can. So `homelab work start <topic>`
+   creates the worktree + branch off `<remote>/master` (git-crypt-aware) and
+   prints the path — the agent enters it with native `EnterWorktree({path})`.
+
+2. **`work land` is auto-land, but gated on verification.** It merges master in →
+   runs verification → pushes `HEAD:master` (fetch+merge+retry on
+   non-fast-forward) → falls back to pushing the feature branch for a PR when the
+   direct push is rejected (branch protection). It **refuses to push when it
+   cannot verify** (no `--verify-cmd` and no auto-detected suite) unless
+   `--no-verify` is passed — added after an accidental smoke-test land pushed
+   unverified WIP to master (benign: the infra CI applied 0 stacks because the
+   diff was `cli/`-only, but an unverified land must be deliberate, not default).
+
+3. **`tf apply` is first-class despite GitOps, and mandatorily presence-coupled.**
+   Local applies are out-of-band (CI applies canonically on push) but happen
+   constantly (~763× in the corpus). `tf apply <stack>` auto-claims `stack:<name>`,
+   delegates to `scripts/tg apply --non-interactive`, and **always releases on
+   exit** (normal, error, or signal via `sync.Once` + handler) — fixing the
+   documented ~200-claim leak — and prints an out-of-band reminder.
+
+4. **Known v0.1 limitation:** `work land` does not yet block on CI to green; that
+   arrives with the ci/deploy watch verb-group. It prints a reminder to follow
+   the pipeline manually.
--- a/docs/adr/0007-homelab-k8s-verbs.md
+++ b/docs/adr/0007-homelab-k8s-verbs.md
@ -0,0 +1,30 @@
+# homelab k8s verb-group: app→pod resolver, read/write split, config-mutation stays raw
+
+v0.2 adds the Kubernetes verb-group — the biggest remaining surface by far
+(mining the post-v0.1 corpus: 11,291 `kubectl` commands across 243 sessions, more
+than every other domain combined).
+
+It is built on an **app→namespace→pod resolver**: most namespaces hold exactly
+one app, so `<app>` defaults to the namespace, and the target defaults to
+`deploy/<app>` (kubectl resolves a pod from the Deployment). `-n`/`--pod`/`-c`/
+`-l`/`--tty` override; multi-pod namespaces (`dbaas`, `monitoring`) need
+specificity. The CLI uses the ambient kubeconfig — no per-call auth flags.
+
+Verbs: read — `status`, `get`, `logs`, `describe`, `debug` (one-shot triage),
+`pf`, `rollout-status`; write/operational — `db`, `exec`, `restart`, `rm-pod`.
+
+## Decisions worth recording
+
+- **Config-mutation verbs are deliberately NOT exposed** (`apply`/`edit`/`patch`/
+  `scale`/`create`). They stay raw `kubectl`, by design, per the repo's
+  Terraform-only policy — the corpus confirms they're low-frequency, and a
+  friendly verb would normalise a policy violation.
+- **`rm-pod` is restricted to pods/jobs only** — deleting Deployments/STS/PVCs is
+  config mutation and forbidden; the verb cannot target them.
+- **`db` encodes the dbaas exec pattern** (the single highest-value k8s
+  sub-pattern, ~886 dbaas ops): PG via `pg-cluster-rw -c postgres`,
+  `psql -U postgres -d <app>`; MySQL via `mysql-standalone-0` with a
+  `bash -c 'mysql -p"$MYSQL_ROOT_PASSWORD" …'` wrapper so the password comes from
+  the pod env and never appears on the command line.
+- Read verbs were smoke-tested against the live cluster; write verbs are
+  unit-tested (resolver, db-plan, shell-quoting) but not fired at live state.
--- a/docs/adr/0008-homelab-memory-verbs.md
+++ b/docs/adr/0008-homelab-memory-verbs.md
@ -0,0 +1,30 @@
+# homelab memory verb-group: direct HTTP client to claude-memory; MCP deprecation path
+
+v0.3 adds the memory verb-group so agents can search and navigate memory from the
+CLI. `claude-memory` is a FastAPI service (Postgres-backed, `Bearer`-auth,
+ingress `auth = "none"` so programmatic clients work) — the **MCP is just one
+frontend over it**. `homelab memory` is a thin HTTP client over the same API,
+using the env the hooks already set (`CLAUDE_MEMORY_API_URL` +
+`CLAUDE_MEMORY_API_KEY`; defaults to the ingress). Because it talks to the HTTP
+API directly, it **works even when the MCP frontend is down** — the recurring
+MCP-disconnect problem that motivated claude-memory HA (and that took the MCP
+offline for the entire session this was built in).
+
+Verbs: `recall` (server-side semantic ranking), `list`, `categories`, `tags`,
+`stats`, `secret` (read); `store`, `update`, `delete` (write). Validated against
+the live API including a store→recall→delete round-trip — full data-plane parity
+with the MCP.
+
+## Deprecation path (deliberate follow-up — NOT done in v0.3)
+
+The MCP is more than tools: the **per-prompt auto-recall hook** and the
+**auto-learn hook** run on every prompt for every agent. Deprecating it safely is
+a separate, sequenced change:
+
+1. Rewire the auto-recall hook to `homelab memory recall` and the auto-learn hook
+   to `homelab memory store`.
+2. Update the CLAUDE.md memory policy to point at the CLI.
+3. Uninstall the MCP.
+
+Done CLI-first (verbs proven before touching the every-prompt path) so a
+regression can't silently break auto-recall/auto-learn fleet-wide.
--- a/docs/adr/0009-homelab-ci-deploy-verbs.md
+++ b/docs/adr/0009-homelab-ci-deploy-verbs.md
@ -0,0 +1,29 @@
+# homelab ci/deploy verbs: API-based watch, internal-LB dialer, work-land integration
+
+v0.4 adds `ci`/`deploy` — the biggest *reasoning* sink in agent sessions (watching
+a build/deploy to completion), proven during the session that built it (hours
+spent hand-rolling Woodpecker API polling, DB-schema reverse-engineering, and
+retrigger logic for a single CI incident).
+
+## Decisions
+
+- **API, not DB.** The verbs query the Woodpecker REST API (version-stable),
+  not its Postgres schema (which drifts across upgrades — column renames bit us
+  mid-incident). Reached via the internal Traefik LB by dialing `10.0.20.203`
+  while keeping SNI/Host = `ci.viktorbarzin.me` so the cert verifies (the Go
+  equivalent of the house `curl --resolve` pattern). Token from
+  `WOODPECKER_TOKEN` or Vault `secret/ci/global`; repo id resolved from the cwd
+  git remote via `/api/repos/lookup/<owner>/<repo>`.
+- **Retries are mandatory.** The Woodpecker API intermittently returns empty/5xx
+  under load (it flapped through the whole build session); `getJSON` retries
+  empties with backoff so `ci watch` is reliable exactly when it's needed.
+- **`work land` now waits for CI.** After pushing, `work land` calls `ci watch`
+  on the landed commit and fails if the pipeline does — closing the gap ADR-0005
+  deferred. `--no-ci-watch` opts out.
+- **`deploy wait` encodes the "rollout status lies" rule:** it first waits for
+  the deployment image to reference the expected sha, *then* blocks on rollout
+  status (kubectl-based; reuses the k8s helpers).
+- **`ci logs` deferred to v0.4.1.** Woodpecker's per-pipeline detail/log
+  endpoints were the least reliable this session (often empty); `status`/`watch`
+  rely on the list endpoint that works. A DB-backed `ci logs` is a possible
+  follow-up if the API path stays flaky.
--- a/docs/adr/0010-homelab-net-obs-verbs.md
+++ b/docs/adr/0010-homelab-net-obs-verbs.md
@ -0,0 +1,37 @@
+# homelab net/dns/metrics/logs verbs: endpoint resolution as the unit of value
+
+v0.5 adds `net`/`dns`/`metrics`/`logs`. These were chosen against an explicit
+test the user posed mid-build: *does the verb save reasoning, or only typing?* A
+wrapper over a command already known fluently (plain `ssh`, `vault kv get`) saves
+keystrokes but not thought. These four save thought — the reasoning they encode
+is **which endpoint, reached how, with what auth/URL shape** — re-derived every
+time otherwise. (That same test deprioritized `node ssh` aliasing and `secret
+get`, which are thin wrappers; see the session discussion.)
+
+## Decisions
+
+- **Internal ingresses, reached via the LB.** Everything routes through the
+  Traefik LB by dialing `10.0.20.203` with the URL host preserved as SNI — the
+  Go form of the house `curl --resolve host:443:10.0.20.203` pattern
+  (`probe.go: clientDialingIP`). Verified live before building: Prometheus
+  (`prometheus-query.viktorbarzin.lan`) and Loki (`loki.viktorbarzin.lan`) both
+  answer JSON over the LB with **no auth gate and no port-forward** — so these
+  stay clean HTTP clients, not kubectl wrappers.
+- **`net check` is two-legged on purpose.** It resolves the host via public DNS
+  (→ Cloudflare) AND dials the internal LB, reporting both — because the useful
+  question is *where* a break is (CF edge vs the app vs the LB path), which a
+  single curl can't answer. The external leg forces public resolution (the devvm
+  resolver is split-horizon and would otherwise hit the LB for both).
+- **`metrics alerts` uses the `ALERTS` series, not `/api/v1/alerts`.**
+  `prometheus-query.*` is a query-only frontend (404 on `/api/v1/alerts`), and
+  Alertmanager has no LB ingress (the alert-digest reads it in-cluster). Firing
+  alerts are exposed as the synthetic `ALERTS{alertstate="firing"}` time series,
+  queryable through the working endpoint — so no new dependency.
+- **Deliberately NOT built:** in-cluster-only endpoints (Alertmanager v2,
+  raw `*.svc` services) that would force port-forward/`kubectl run`. The
+  reasoning-savings there don't beat the added moving parts; kept out of scope.
+- **No `node`/`secret` group.** Same test: their high-volume parts are
+  command-wrappers (low savings); only compound node ops (serial console, VM
+  wait, fan-out) would qualify, and those are lower-frequency. Left unbuilt
+  unless a concrete pain surfaces — the high-value deterministic surface
+  (tf/work/ci/k8s/memory + these probes) is now covered.
--- a/docs/adr/0011-homelab-usage-telemetry.md
+++ b/docs/adr/0011-homelab-usage-telemetry.md
@ -0,0 +1,42 @@
+# homelab usage telemetry: evidence-driven verb prioritization, privacy by construction
+
+v0.6 adds `usage top` plus a fire-and-forget emit on every dispatched verb. It
+exists to answer the question that drove the whole CLI — *which verbs are worth
+adding next* — with data instead of one maintainer's habits (the earlier mining
+covered a single user's ~51k commands, so the surface is shaped to that user).
+
+> **Update (2026-06-26) — the cross-user privacy *norm* below is superseded by
+> [ADR-0015](0015-os-is-the-authorization-boundary.md).** The prohibition this
+> ADR leaned on ("reading another user's `~/.claude` is off-limits even for an
+> owner in-session") no longer holds: the managed-settings policy now **defers
+> to OS/sudo authorization**. The `usage top` telemetry design itself is
+> unchanged and still current — only the "never read homes" framing in the
+> third decision below is overtaken.
+
+## Decisions
+
+- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows
+  the verb path; after `Run` returns we emit `{verb, exit}`. Discovery verbs
+  don't go through `dispatch()` (`manifest`/`version`/`help` are handled in
+  `dispatchTop`), so they don't self-record; `usage *` is skipped explicitly so
+  the analytics reader doesn't pollute its own data.
+- **Payload is deliberately minimal: verb path + exit code only.** Labels
+  `{job=homelab-usage, user, verb}` (all low-cardinality) + line `exit=N ver=X`.
+  **No args, paths, flags, hostnames, or secrets** ever leave the process — the
+  emit sees only the matched verb name, not the arguments. This is what makes
+  cross-user aggregation safe.
+- **Shared Loki sink → cross-user analytics WITHOUT reading homes.** Each user's
+  CLI writes its own invocations (attributed to its OS user) to the shared Loki
+  push API via the Traefik LB (verified: HTTP 204, no auth). `usage top` reads
+  back with a LogQL metric query. This is the privacy-preserving resolution to
+  "what does everyone (e.g. another user) use" — it never touches anyone's
+  `~/.claude`, which the org per-user policy bars (see the per-user red-line in
+  managed-settings; reading another user's home is off-limits even for an owner
+  in-session — a fresh session under changed MDM policy is the only legitimate
+  path, and even then this telemetry is the better answer).
+- **Best-effort, never affects the command.** All errors swallowed; an 800ms
+  client timeout bounds the cost; opt-out via `HOMELAB_TELEMETRY=0`. Telemetry
+  must never slow or break the tool it measures.
+- **Loki, not a new datastore.** Zero new infra, and it dogfoods the v0.5 `logs`
+  path (same host, same LB dial). Presence MySQL was the alternative (queryable
+  SQL) but would add a write dependency and creds; Loki needs neither.
--- a/docs/adr/0012-homelab-ha-verbs.md
+++ b/docs/adr/0012-homelab-ha-verbs.md
@ -0,0 +1,54 @@
+# homelab Home Assistant verbs: token resolution + host SSH, not entity control
+
+v0.7 adds `ha token` and `ha ssh`. They were chosen by mining a heavy HA
+operator's sessions: across ~1,900 shell commands the single most-repeated line
+(420×) was a hand-rolled `kubectl … | base64 -d | python -c '…token'` pipeline,
+and a bespoke `ssh -o StrictHostKeyChecking=no -o …` invocation was redefined as
+a shell function ~30× — both re-derived from scratch every session. The existing
+`home-assistant-sofia.py` already covers the *API*, but it goes unused from an
+arbitrary cwd (it needs `HOME_ASSISTANT_SOFIA_TOKEN` set and is referenced by a
+cwd-relative path), so agents bypassed it. A global verb on `$PATH` closes that
+gap for every user in every directory.
+
+## Decisions
+
+- **Only the two gaps the `ha` MCP can't fill.** The `ha` MCP server already
+  does entity state and control (`get_state`, `call_service`, history, logs).
+  Per the CLI's founding rule — *MCP-encoded actions are out of scope* (ADR-0004)
+  — we do **not** reimplement `on`/`off`/`list`/`state`. We add only token
+  *resolution* and host *SSH*, neither of which an API-only MCP can provide. The
+  value is endpoint/secret/host resolution, exactly like `net`/`dns` (ADR-0010).
+- **`ha token` resolves live from the cluster, not from an env var.** It reads
+  the dedicated k8s Secret `openclaw/ha-tokens` (one key per instance: `sofia` /
+  `london`) via the ambient kubeconfig. This is robust to env drift — the precise
+  failure that made agents re-derive the pipeline. Read-tier, prints the bare
+  token to stdout so it composes in `$(…)`, mirroring `memory secret`.
+- **The token is split into its own least-privilege secret** (`stacks/openclaw/ha_tokens.tf`).
+  It was originally read from `openclaw-secrets` → `skill_secrets` (a JSON blob
+  also holding `slack_webhook` + `uptime_kuma_password`), which only cluster
+  admins can read — so the verb hung/failed for the non-admin operator it was
+  built for (emo = `emil.barzin@gmail.com`, group `Home Server Admins`, whose
+  OIDC identity is barred from secrets in `openclaw`). `ha-tokens` carries only
+  the HA tokens, with a Role+RoleBinding granting `get` on *just that secret* to
+  the `Home Server Admins` group (k8s RBAC can't scope to a JSON sub-key, hence
+  the separate object). openclaw's own deployment keeps reading `openclaw-secrets`
+  — this is purely additive.
+- **`ha ssh` is deterministic and per-user.** Flags are fixed for unattended
+  use: `-F /dev/null` (ignore user ssh-config), `StrictHostKeyChecking=no` +
+  `UserKnownHostsFile=/dev/null` (no host-key prompt/record — agents have no
+  TTY), `BatchMode=yes` + `ConnectTimeout=10` (fail fast, never hang). The key
+  is the **invoking user's** `~/.ssh/id_ed25519`, so the verb isn't tied to
+  whoever first wrote the workflow; that user's key must be enrolled on the HA
+  host. Write-tier (runs an arbitrary remote command).
+- **sofia is the default; london is structural.** The devvm sits on the Sofia
+  LAN, so `vbarzin@192.168.1.8` is reachable and is the default instance. london
+  (`hassio@192.168.8.103`) is in the instance map so `ha token --instance london`
+  works (a pure secret read), but `ha ssh --instance london` generally won't
+  connect from here — london is remote. We model it correctly rather than
+  pretend it's reachable.
+- **Scope held at two verbs.** `ha api` (an authenticated curl passthrough for
+  the endpoints the MCP/script don't cover — `/api/template`, `/reload`,
+  `check_config`, `/error_log`) was deferred: once `ha token` exists, raw curl is
+  already unblocked, and a generic passthrough overlaps the MCP. Re-measure via
+  `usage top` (ADR-0011); add targeted sugar verbs only if those endpoints are
+  still hand-rolled often.
--- a/docs/adr/0013-homelab-browser-verbs.md
+++ b/docs/adr/0013-homelab-browser-verbs.md
@ -0,0 +1,75 @@
+# homelab browser verbs: headful (anti-bot) web automation via cluster Chrome
+
+v0.8 adds `browser run`, `browser open`, and `browser --help`. They package a
+capability that already existed but was undiscoverable: driving the cluster's
+**headful** Chrome (`chrome-service` — real Chrome under Xvfb, CDP on
+`svc/chrome-service:9222`) from the devvm, for sites that detect and block
+headless automation.
+
+## Motivating incident (2026-06-22)
+
+Logging a washing-machine repair on the Stirling Ackroyd **Fixflo** tenant
+portal: the headless `@playwright/mcp` browser loaded the site and filled the
+entire multi-step form, but the **final submit silently failed** — Fixflo's
+pre-submit `POST /IssuePreCreationCheck` returned `net::ERR_FILE_NOT_FOUND`, the
+spinner hung, no issue was created. Root cause = headless-Chrome detection. The
+fix was to drive the headful `chrome-service` over `connect_over_cdp` — it
+submitted first try (Fixflo ref IS22657587). That capability was documented
+(`docs/architecture/chrome-service.md`) but **not packaged or discoverable**, so
+it took ~40 min, three redundant full form re-runs, and a user hint. The agent
+also misread `ERR_FILE_NOT_FOUND` as "network egress" and retried blind instead
+of inspecting the network panel.
+
+## Decisions
+
+- **Mechanics in `homelab`, not a `~/.claude` skill.** A standalone skill was
+  rejected: the CLI is run every session (so the verb is *discoverable*), is
+  versioned, multi-user, and test-covered. A private, untested skill is none of
+  those. The command owns only the deterministic *mechanics* (port-forward,
+  stealth injection, lifecycle) — the agent supplies the Playwright script, so
+  *judgment* stays out of the CLI (the founding rule, ADR-0004/0005).
+- **The failure was judgment, not setup friction**, so the CLI is paired with a
+  one-line pointer in always-in-context `~/code/CLAUDE.md` and a diagnostic
+  payload in `browser --help`: the *when-to-use* signature (a site loads but a
+  gated action fails/hangs, or one request 500s/aborts while siblings 200 →
+  suspect headless detection) and an error-code cheat-sheet (`ERR_FILE_NOT_FOUND`
+  = request resolved/intercepted by the automation layer, **not** egress;
+  egress failures are `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED`
+  and would break the page load too). A command the agent doesn't think to run is
+  useless; the cheat-sheet is the actual fix for the misdiagnosis.
+- **Reach the pod via `kubectl port-forward`, then `connect_over_cdp` to
+  localhost.** port-forward tunnels API-server→pod, so it **bypasses the `:9222`
+  NetworkPolicy** that gates in-cluster callers — the devvm needs no namespace
+  label. Readiness is asserted against `/json/version`: the endpoint must report
+  a real `Chrome/…`, never `HeadlessChrome` (the whole point). The forward is
+  **always** torn down (process-group kill + signal handler), on success and on
+  error — an acceptance requirement.
+- **Default to a fresh incognito context; `--shared-context` opts into the warmed
+  profile.** chrome-service is a single shared browser with a persistent profile.
+  A fresh, always-closed context is safe for concurrent callers (tripit's fare
+  scrape connects per-quote) and is what production already does. The warmed
+  persistent profile (cookies from a manual noVNC login) is opt-in for flows that
+  need a pre-logged-in session.
+- **Pin the node CDP client to `playwright-core@1.48.2`** to match the
+  chrome-service image minor (`mcr.microsoft.com/playwright:v1.48.0-noble`,
+  Chromium 130). `connect_over_cdp` speaks the browser's CDP, and protocol
+  changes between Playwright minors — the devvm's ambient Python Playwright was
+  1.58, a 10-minor skew. The pin makes behaviour deterministic across the fleet
+  regardless of local drift. `playwright-core` (not `playwright`) because no
+  browser binary is needed — we connect to the remote one.
+- **Self-provision the client lazily, no per-user setup.** The pinned client is
+  installed once into `~/.cache/homelab/browser-client/` (idempotent, version-
+  guarded) on first use, alongside the embedded runner + stealth files. node is
+  already fleet-wide; this avoids coupling the feature to a provisioner change
+  and keeps it self-contained and self-healing. The client runs on the devvm, so
+  `setInputFiles` streams local files to the remote browser over CDP — no
+  `chmod`/staging-dir workaround on the CDP path.
+- **Vendor `stealth.js`, guard against drift.** The CLI embeds a byte-for-byte
+  copy of `stacks/chrome-service/files/stealth.js` (the source of truth the
+  in-cluster callers use) via `go:embed`; a unit test fails if the copy drifts.
+  `go:embed` can't reach outside the package dir, hence the vendored copy rather
+  than a path reference.
+- **Scope held at two action verbs + help.** `run` (arbitrary script — the
+  workhorse) and `open` (navigate + title/text/screenshot — a quick check) cover
+  the surface. Both are write-tier; the bare `browser`/`--help` is read. Re-measure
+  via `usage top` (ADR-0011) before adding more.
--- a/docs/adr/0014-service-identity-and-east-west-observability.md
+++ b/docs/adr/0014-service-identity-and-east-west-observability.md
@ -0,0 +1,35 @@
+---
+status: accepted
+date: 2026-06-24
+---
+
+# Service identity is namespace + label; east-west observability via Calico Goldmane; no service mesh
+
+As the Service count grows we want an audit-grade record of which Service talks to which — the "service mesh evaluation" `docs/plans/2026-04-20-infra-audit-design.md` flagged as never done ("worth a design doc even if the answer is no, too much complexity for the gain"). We evaluated the full design space against two constraints: the trust model is a single-tenant cluster needing **attribution-grade** forensics (reconstruct events in a cluster we trust), not cryptographic non-repudiation against a hostile pod; and we are acutely **etcd-constrained** (we removed VPA/Goldilocks for exactly this, and carry open beads `code-oflt`/`code-at4f` on etcd starvation). Decision: **service identity = the workload's namespace** (primary; Goldmane stamps it natively and "one Service ≈ one namespace" holds for ~87 of our namespaces), refined by an explicit `service-identity` label only in the few genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). **East-west observability = Calico 3.30 Goldmane + Whisker** (already in our Calico v3.30.7, currently `enabled = false` in `stacks/calico/main.tf`), with Goldmane's emitter shipping flows to **Loki** for a durable trail. **Enforcement reuses the existing Wave 1 observe-then-enforce egress track**, now selecting on namespace/label and fed by Goldmane's allow/deny + policy-trace flows. We explicitly **reject** a service mesh, mTLS, SPIFFE/SPIRE, and dedicated per-Service ServiceAccounts for now.
+
+## Considered options
+
+- **Dedicated per-Service ServiceAccount as the identity primitive** — initially chosen, then reversed. 56% of pods (257/458) run as `default`, so it is a ~116-stack rollout; and Goldmane (the chosen flow source) carries pod/namespace/workload **labels but no ServiceAccount field**, so SAs would not even reach the audit trail without a custom pod→SA mapping. The cheaper, etcd-inert path (namespace is free; a handful of static labels) delivers the same attribution. Deferred until identity-aware NetworkPolicy needs a principal finer than namespace/label, or mTLS is adopted.
+- **Service mesh (Istio / Linkerd / Cilium-mesh) + mTLS + SPIFFE/SPIRE** — the only thing that makes the trail cryptographically non-repudiable against a hostile pod. The trust model does not justify it, east-west stays single-tenant plaintext, and it is precisely the "too much complexity for the gain" the audit doc predicted. Rejected.
+- **Microsoft Retina (CNI-agnostic eBPF)** — more capable (DNS, drops, Hubble UI) and GA, runs on Calico without a CNI change. But identity-rich mode writes **one `RetinaEndpoint` CRD per pod to etcd** (continuous, pod-proportional churn — the exact axis we guard), and it is metrics-first, not log-first (no per-flow Loki records without custom glue). Rejected for this use case; noted as the fallback if DNS/drop-level detail is ever needed.
+- **Cilium Hubble** — reads Cilium's eBPF datapath maps; unusable on Calico without migrating the CNI. A CNI migration is not justified. Rejected.
+- **Kiali** — builds its graph entirely from an Istio mesh's Prometheus telemetry; no mesh, no graph. Rejected.
+- **Custom Grafana Alloy enrichment exporter over raw iptables-`LOG` flow lines** — Alloy has no IP→identity dictionary-lookup primitive (`loki.process` lacks a lookup stage; `k8sattributes` can't do per-line/dual-IP association), so this is a multi-day custom build that also has to beat pod-IP churn. Goldmane delivers identity-stamped flows natively and obviates it. Rejected.
+- **Kyverno generate+mutate to provision/assign identity** — rejected on etcd grounds: background scans + PolicyReports + UpdateRequests are continuous writes, the VPA-class cost we shed. Identity stays static.
+
+## Consequences
+
+- **No etcd cost from the flow plane.** Goldmane streams flows from Felix (the existing `calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer — nothing written to etcd or the K8s API. Steady-state cost is two Deployments (`goldmane`, `whisker`) + RAM/CPU on the goldmane pod.
+- **The ring buffer is not a trail.** Durable, queryable history depends on the emitter→Loki path (reuse the 90-day security-stream retention); on a Goldmane restart the in-memory window is lost.
+- **Goldmane is tech-preview** in OSS Calico 3.30 — the main risk. Enabling it is a reversible toggle in `stacks/calico/main.tf`, but the toggle interacts with the operator-managed Installation CR (only namespaces are TF-adopted today; verify how Goldmane/Whisker are enabled before applying).
+- **Attribution is namespace-grained for free** across ~87 single-Service namespaces. Multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`) need a `service-identity` label to disambiguate; most are platform/infra and already on the Wave 1 enforcement exclude list.
+- **The trail is attribution-grade, not cryptographic.** It reliably reconstructs events in a trusted cluster but cannot prove identity against a pod that spoofs its source — an accepted limit of the trust model. This ADR does not change the east-west encryption posture (still plaintext, no mTLS).
+- **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
+- **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
+- **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
+
+## As-built (2026-06-25)
+
+Implemented across infra issues #57–#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (all Slack consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
+
+Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.
--- a/docs/adr/0015-os-is-the-authorization-boundary.md
+++ b/docs/adr/0015-os-is-the-authorization-boundary.md
@ -0,0 +1,57 @@
+# OS is the authorization boundary: agents defer to Unix/sudo, not a stricter in-policy rule
+
+Supersedes the cross-user privacy *norm* that the devvm managed-settings policy
+carried and that ADR-0011 leaned on ("never read another user's home /
+`~/.claude`, off-limits even for an owner in-session"). ADR-0011's actual
+subject — `usage top` telemetry and its emit design — is unchanged and still
+current; only the privacy prohibition it referenced is superseded here.
+
+## Context
+
+The devvm managed-settings policy (`/etc/claude-code/managed-settings.json`,
+`claudeMd`) carried two rules that were, in practice, *stricter than the OS*:
+"you are not the admin, do not escalate privileges" and "never read another
+user's home directory, credentials, tokens, or `~/.claude`." The OS told a
+different story: `wizard` holds `(ALL) NOPASSWD: ALL` — full passwordless root.
+The kernel had already granted total read access; the policy was layering an
+artificial refusal on top of an authorization the OS already permits, and the
+"not the admin" framing was factually wrong for a NOPASSWD-root user.
+
+Two honest ways to resolve the inconsistency: tighten sudo to match the policy,
+or loosen the policy to match the OS. The owner chose the latter on 2026-06-26,
+for analytics/debugging across the shared box.
+
+## Decision
+
+- **Authorization follows the OS, not this policy.** Agents may access whatever
+  their OS user can access — directly or via `sudo` where they hold sudo rights
+  — and must not impose restrictions stricter than the OS. On this box that
+  includes other users' home directories and `~/.claude` for users who hold
+  broad sudo.
+- **No separate prompt or carve-out** for OS-authorized access. The Unix
+  permission model + sudoers is the single source of truth for who may read
+  what. Other homes are `0750`-owned, so a cross-home read necessarily transits
+  `sudo` and is therefore captured in the sudo/auth audit log.
+- **Cluster/infra RBAC tiering is unchanged.** kubectl / Vault / infra access
+  stays scoped to each user's RBAC tier; "defer to the OS" is about OS-level
+  file access, not a licence to exceed cluster RBAC.
+- **Scope is symmetric and multi-user.** The rule lives in the *shared*
+  managed-settings, so every user's agents defer to that user's own sudo grant.
+  Any user with broad sudo gets the same cross-home read capability over other
+  users' files. Accepted by the owner with that understanding; emo's and
+  ancamilea's `~/.claude` is now agent-readable by sudo-holders.
+- **Takes effect in a fresh session.** managed-settings loads at session start;
+  the session that made the change keeps running under the old policy.
+
+## Consequences
+
+- The privacy-preserving telemetry rationale in ADR-0011 (`usage top` as the
+  "cross-user analytics without reading homes" answer) remains useful but is no
+  longer the *only* sanctioned path; direct reads via `sudo` are now permitted.
+- Larger blast radius: if an agent session running as a sudo-holder is
+  prompt-injected or otherwise compromised, it can now read every user's secrets
+  with no in-agent friction (sudo here is passwordless). The sudo/auth audit log
+  is the remaining accountability control.
+- Reversible: restore the prior `claudeMd` bullets (backup kept at
+  `/etc/claude-code/managed-settings.json.bak-2026-06-26`) and start a fresh
+  session.
--- a/docs/adr/0016-gpu-vram-extended-resource-budget.md
+++ b/docs/adr/0016-gpu-vram-extended-resource-budget.md
@ -0,0 +1,107 @@
+# GPU VRAM protection via a scheduler extended-resource budget + a runtime watchdog (HAMi/MPS rejected)
+
+The single Tesla T4 (16 GB, ~15360 MiB usable) on `k8s-node1` is **time-sliced**
+(`nvidia.com/gpu` advertised ×100, `migStrategy: none`) and shared by ~9 tenants
+(immich-ml, immich-server, frigate, llama-swap, portal-stt, tts,
+ebook2audiobook, ytdlp, android-emulator). Time-slicing grants a *scheduling
+turn, not memory* — the scheduler is blind to VRAM, so the tenants can
+collectively overallocate the card. On 2026-06-02 immich-ml's unbounded
+onnxruntime OCR arena grew from ~2 GB to **10.7 GB**, starved llama-swap's
+qwen3-8b, and silently broke recruiter-responder triage for ~5 h
+(`docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). The
+post-mortem's #1 follow-up — alert/guard on GPU VRAM — was never built.
+
+## Context
+
+- **MIG is impossible.** The T4 is Turing; hardware memory partitioning (MIG)
+  only exists on Ampere+. So per-tenant *hardware* isolation is off the table.
+- **The card is busy but not steadily oversubscribed.** Measured steady residents
+  (2026-06-17, `gpu_pod_memory_used_bytes`): immich-ml ~2.1 GiB, frigate ~1.9 GiB,
+  llama-swap ~4.35 GiB peak (one model at a time — it already swaps), immich-server
+  ~1.2 GiB, portal-stt ~1.5 GiB, android-emulator ~0.15 GiB → ~11 GiB used, ~4 GiB
+  free. **The failure mode is a single tenant's runtime runaway, not a
+  scheduling-time pile-on.**
+- **Prior art already exists (soft):** a `gpu-workload` PriorityClass (1,200,000)
+  is auto-stamped on every GPU pod by the Kyverno `inject-gpu-workload-priority`
+  policy (tts excluded → `tier-2-gpu`, evicted first); tts runs behind a
+  free-VRAM demand-gate (`stacks/tts`, scales 0↔1 on `sum(gpu_pod_memory_used_bytes)`
+  vs a floor); immich-ml is soft-bounded by `MACHINE_LEARNING_MODEL_TTL=600`. What
+  was missing is anything that bounds a tenant's VRAM *during active use*.
+
+### Alternatives considered and rejected
+
+- **NVIDIA MPS** (device-plugin `sharing.mps`, hard `CUDA_MPS_PINNED_DEVICE_MEM_LIMIT`):
+  caps are **uniform** — slice = `total ÷ replicas`, tenants get integer multiples.
+  Nine heterogeneous tenants spanning 0.15→6 GB do not fit uniform slices without
+  large rounding waste on a card that has none to spare. Rejected.
+- **HAMi vGPU** (per-container `nvidia.com/gpumem` MiB caps, libvgpu CUDA hook):
+  the *correct* hard-cap primitive and T4-supported, but it **replaces the
+  operator's device plugin** (the operator owns/reconciles it), enforces via an
+  `LD_PRELOAD` CUDA hook that is **unproven for our NVENC transcode path**
+  (open codec bug), **cannot cap the android-emulator** (QEMU bypasses the CUDA
+  hook — KubeVirt/Kata explicitly unsupported), carries a **restart-triggered
+  false-OOM bug** (#1181) directly in our blast radius (kured reboots node1
+  regularly), and its reservation-based scheduling would **supersede the working
+  demand-gate** and **strand the ~4 GB of steady headroom**. Too much risk and
+  behavioral change for the single proven failure mode. Rejected for now; this
+  ADR is the record of *why*, so a future "let's just use HAMi" re-opens with the
+  trade-offs already on the table.
+
+## Decision
+
+Make the scheduler VRAM-aware and add runtime teeth — entirely with repo-native
+pieces, **no device-plugin/driver change, time-slicing untouched**:
+
+1. **Budget (schedule-time).** Advertise a custom node-level **extended resource
+   `viktorbarzin.me/gpumem`** on the GPU node (= ~14000 MiB; ~15.4 GB physical
+   minus ~1.4 GB driver/CUDA-context/exporter slack), via a reconcile Job +
+   CronJob that `kubectl patch node --subresource=status` (dynamic over
+   `nvidia.com/gpu.present=true` nodes; re-asserts after node re-register).
+   Every GPU tenant declares `resources.limits."viktorbarzin.me/gpumem"` (immich-ml
+   3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500 — sum
+   ≤ advertised). Extended resources are **non-overcommittable** (request==limit,
+   integer), so the scheduler refuses to co-schedule past the card → overflow
+   `Pending`. On-demand batch tenants (tts/ebook2audiobook/ytdlp) keep the
+   free-VRAM demand-gate and fill the real slack rather than holding a reserved seat.
+2. **Watchdog (runtime).** A `gpu-vram-watchdog` CronJob (every minute, nvidia ns)
+   reads per-pod `gpu_pod_memory_used_bytes` (the host-PID exporter) and each GPU
+   pod's *declared* `gpumem`, and **only when actual free VRAM < floor (~1536 MiB)**
+   recycles the biggest **over-budget** offender (used > declared). Contract
+   enforcement, not priority (immich-ml and llama-swap share `gpu-workload`, so
+   priority can't distinguish them). Acting only under pressure lets a tenant burst
+   into genuine slack; the recycle clears its arena (exactly what the TTL=600
+   Recreate does for immich-ml when idle). This is what would have caught 2026-06-02.
+3. **Alerting** (the never-built follow-up): GPU free-VRAM below floor, GPU pod
+   `Pending` on `gpumem`, and pod-over-budget → the `#alerts` digest.
+
+This is **soft enforcement**: the scheduler reserves on paper and the watchdog
+corrects at runtime with a detection lag (seconds–minute), so a brief physical
+overshoot is possible before a recycle. Accepted, given the failure mode is a
+slow arena drift, not an instantaneous spike, and the alternative (HAMi) carries
+disproportionate risk for this hardware.
+
+## Consequences
+
+- **The 2026-06-02 class is bounded** without touching the pinned driver, the GPU
+  operator, or time-slicing. immich-ml can no longer silently grow into
+  llama-swap's VRAM: it either schedules within its budget or, on a true runaway
+  under pressure, gets recycled (its heavy library job is the intended loser).
+- **The card has a seating chart now.** Sum of declared budgets ≤ ~14 GB, so a new
+  always-on GPU tenant requires re-budgeting; an over-budget on-demand tenant sits
+  `Pending`. This is the intended, legible back-pressure.
+- **Small/on-demand tenants (android-emulator, ytdlp, tts, ebook2audiobook) are
+  NOT budgeted in v1** — they fill *actual* slack rather than holding a scheduler
+  seat (tts via its existing free-VRAM demand-gate), and are covered by the
+  ~1.4 GiB physical reserve plus budget headroom (the five residents' budgets sum
+  to 13300 ≤ 14000 advertised). Give them budgets later if they grow; until then
+  the watchdog protects the budgeted five and counts everyone's usage toward free.
+- **New RBAC:** the reconcile SA patches `nodes/status`; the watchdog SA lists pods
+  cluster-wide and deletes pods in GPU tenant namespaces. Far less privileged than
+  existing cluster-admin tooling (woodpecker-agent).
+- **Apply order matters:** advertise `gpumem` (nvidia stack) **before** the
+  consumer stacks declare it, or a pod requesting an unadvertised extended
+  resource is unschedulable. The reconcile runs as a Job (immediate) for this.
+- **Fully reversible:** delete the CronJobs/Job + the `gpumem` stanzas, and
+  `kubectl patch node --subresource=status` to remove the capacity key. Nothing
+  structural; no driver/operator state to unwind.
+- The `gpumem` numbers are first estimates; tune from `gpu_pod_memory_used_bytes`.
--- a/docs/adr/0017-cctv-physical-cabling.svg
+++ b/docs/adr/0017-cctv-physical-cabling.svg
@ -0,0 +1,126 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="1600" height="820" viewBox="0 0 1600 820" font-family="system-ui, -apple-system, 'Segoe UI', Roboto, sans-serif">
+  <!-- ADR-0017: PHYSICAL cabling only — no VLANs, no flows. Solid = cable in
+       place today · dashed = camera-day work · ~~~ = radio. Palette: neutral
+       grays + blue for copper runs (reference dataviz palette text tokens). -->
+  <defs>
+    <marker id="dot" viewBox="0 0 8 8" refX="4" refY="4" markerWidth="5" markerHeight="5">
+      <circle cx="4" cy="4" r="3" fill="#52514e"/>
+    </marker>
+  </defs>
+
+  <rect width="1600" height="820" fill="#fcfcfb"/>
+
+  <text x="40" y="42" font-size="26" font-weight="700" fill="#0b0b0b">ADR-0017 — physical cabling (single-switch, rev 3)</text>
+  <text x="40" y="66" font-size="15" fill="#52514e">wires only — no VLANs, no traffic · solid = in place · dashed = camera-day · ~ = radio</text>
+
+  <!-- ═════════ APARTMENT ═════════ -->
+  <rect x="40" y="100" width="330" height="330" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
+  <text x="56" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">APARTMENT</text>
+
+  <text x="70" y="158" font-size="13" fill="#52514e">☁ ISP (internet)</text>
+  <path d="M120,166 L120,196" fill="none" stroke="#52514e" stroke-width="2"/>
+
+  <rect x="64" y="198" width="220" height="64" rx="8" fill="#ffffff" stroke="#8a8984"/>
+  <text x="80" y="222" font-size="14.5" font-weight="700" fill="#0b0b0b">AX6000 router</text>
+  <text x="80" y="242" font-size="12" fill="#52514e">192.168.1.1 · WAN←ISP · 8×LAN</text>
+
+  <rect x="64" y="290" width="220" height="52" rx="8" fill="#ffffff" stroke="#8a8984"/>
+  <text x="80" y="312" font-size="14" font-weight="700" fill="#0b0b0b">Synology NAS · .13</text>
+  <text x="80" y="330" font-size="12" fill="#52514e">on an AX6000 LAN port</text>
+  <path d="M174,262 L174,290" fill="none" stroke="#2a78d6" stroke-width="2"/>
+
+  <text x="70" y="376" font-size="12.5" fill="#52514e">📶 wifi clients (phones, laptops)</text>
+  <path d="M110,262 C104,272 106,278 100,286 C106,294 104,300 100,308 C106,316 104,322 100,330 C106,338 104,344 100,352 C104,358 102,362 98,366" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
+
+  <!-- in-wall run apartment -> garage -->
+  <path d="M284,230 C450,230 540,228 616,228" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
+  <text x="330" y="218" font-size="12.5" font-weight="700" fill="#2a78d6">in-wall run → garage</text>
+
+  <!-- ═════════ GARAGE — RACK ═════════ -->
+  <rect x="560" y="100" width="640" height="680" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
+  <text x="576" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE — RACK</text>
+
+  <!-- switch -->
+  <rect x="600" y="150" width="560" height="150" rx="8" fill="#ffffff" stroke="#0b0b0b" stroke-opacity="0.5" stroke-width="1.6"/>
+  <text x="616" y="176" font-size="14.5" font-weight="700" fill="#0b0b0b">TL-SG105PE · 5-port gigabit PoE switch</text>
+  <text x="616" y="194" font-size="12" fill="#52514e">mgmt 192.168.1.6 · replaces the old TL-SG105E (→ shelf, cold spare)</text>
+  <g font-size="11.5" text-anchor="middle">
+    <rect x="616" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
+    <text x="664" y="227" font-weight="700" fill="#0b0b0b">P1</text>
+    <text x="664" y="242" fill="#52514e">← apartment</text>
+    <rect x="722" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
+    <text x="770" y="227" font-weight="700" fill="#0b0b0b">P2</text>
+    <text x="770" y="242" fill="#52514e">← 4G router</text>
+    <rect x="828" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
+    <text x="876" y="227" font-weight="700" fill="#0b0b0b">P3</text>
+    <text x="876" y="242" fill="#52514e">← UPS mgmt</text>
+    <rect x="934" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984" stroke-dasharray="4,3"/>
+    <text x="982" y="227" font-weight="700" fill="#0b0b0b">P4 ⚡PoE</text>
+    <text x="982" y="242" fill="#52514e">← camera</text>
+    <rect x="1040" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
+    <text x="1088" y="227" font-weight="700" fill="#0b0b0b">P5</text>
+    <text x="1088" y="242" fill="#52514e">← R730 eno1</text>
+  </g>
+  <text x="616" y="284" font-size="12" fill="#52514e">every cable below re-plugs old-switch → PE on camera day (≈3 min)</text>
+
+  <!-- 4G router -->
+  <rect x="600" y="360" width="250" height="64" rx="8" fill="#ffffff" stroke="#8a8984"/>
+  <text x="616" y="384" font-size="14" font-weight="700" fill="#0b0b0b">4G router · 192.168.1.7</text>
+  <text x="616" y="403" font-size="12" fill="#52514e">~cellular uplink (out-of-band)</text>
+  <path d="M770,300 L770,360" fill="none" stroke="#2a78d6" stroke-width="2"/>
+  <path d="M856,392 C866,386 864,380 874,376 C866,370 868,364 876,360" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
+  <text x="884" y="380" font-size="12" fill="#52514e">📡 cellular</text>
+
+  <!-- UPS -->
+  <rect x="600" y="452" width="250" height="56" rx="8" fill="#ffffff" stroke="#8a8984"/>
+  <text x="616" y="476" font-size="14" font-weight="700" fill="#0b0b0b">UPS (Huawei)</text>
+  <text x="616" y="494" font-size="12" fill="#52514e">network mgmt card</text>
+  <path d="M876,300 C876,340 800,410 720,452" fill="none" stroke="#2a78d6" stroke-width="2"/>
+
+  <!-- R730 -->
+  <rect x="600" y="540" width="560" height="220" rx="8" fill="#ffffff" stroke="#0b0b0b" stroke-opacity="0.5" stroke-width="1.6"/>
+  <text x="616" y="566" font-size="14.5" font-weight="700" fill="#0b0b0b">Dell R730 · PVE host · 192.168.1.127</text>
+  <g font-size="11.5">
+    <rect x="616" y="582" width="128" height="38" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
+    <text x="628" y="598" font-weight="700" fill="#0b0b0b">eno1 · LAN1</text>
+    <text x="628" y="613" fill="#52514e">← switch P5 · 1GbE</text>
+    <rect x="756" y="582" width="128" height="38" rx="5" fill="#ffffff" stroke="#8a8984" stroke-dasharray="4,3"/>
+    <text x="768" y="598" font-weight="700" fill="#52514e">eno2 · LAN2</text>
+    <text x="768" y="613" fill="#8a8984">dark · fallback leg</text>
+    <rect x="896" y="582" width="128" height="38" rx="5" fill="#ffffff" stroke="#d8d7d2"/>
+    <text x="908" y="598" fill="#8a8984">eno3 / eno4</text>
+    <text x="908" y="613" fill="#8a8984">free, uncabled</text>
+    <rect x="1036" y="582" width="108" height="38" rx="5" fill="#ffffff" stroke="#d8d7d2"/>
+    <text x="1048" y="598" fill="#8a8984">iDRAC · .4</text>
+    <text x="1048" y="613" fill="#8a8984">shared-LOM/eno1</text>
+  </g>
+  <text x="616" y="648" font-size="12" fill="#52514e">no other network cables — everything else on this host is VIRTUAL:</text>
+  <text x="616" y="668" font-size="12" fill="#52514e">pfSense · ha-sofia (HA) · devvm · k8s-master + node1-6 · registry VM …</text>
+  <text x="616" y="696" font-size="12" fill="#8a8984">(power: host + switch fed from the UPS — power wiring not drawn)</text>
+
+  <path d="M1088,300 C1088,420 720,500 680,582" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
+  <text x="1100" y="330" font-size="12.5" font-weight="700" fill="#2a78d6">LAN1 cable</text>
+
+  <!-- ═════════ GARAGE ENTRANCE ═════════ -->
+  <rect x="1280" y="100" width="280" height="200" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
+  <text x="1296" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE ENTRANCE</text>
+  <rect x="1304" y="150" width="232" height="110" rx="8" fill="#ffffff" stroke="#8a8984"/>
+  <text x="1320" y="176" font-size="14" font-weight="700" fill="#0b0b0b">vermont-garage camera</text>
+  <text x="1320" y="196" font-size="12" fill="#52514e">HiLook IPC-T241H-C · 10.0.30.70</text>
+  <text x="1320" y="214" font-size="12" fill="#52514e">powered over the data cable (PoE)</text>
+  <text x="1320" y="232" font-size="12" fill="#52514e">outdoor · armored conduit</text>
+
+  <path d="M982,210 C982,150 1140,140 1304,180" fill="none" stroke="#52514e" stroke-width="2.5" stroke-dasharray="7,5"/>
+  <text x="1080" y="136" font-size="12.5" font-weight="700" fill="#52514e">single cat6 in conduit · data + PoE power (camera day)</text>
+
+  <!-- legend -->
+  <g transform="translate(40,780)" font-size="12.5">
+    <line x1="0" y1="-4" x2="44" y2="-4" stroke="#2a78d6" stroke-width="2.5"/>
+    <text x="52" y="0" fill="#0b0b0b">copper, in place</text>
+    <line x1="190" y1="-4" x2="234" y2="-4" stroke="#52514e" stroke-width="2.5" stroke-dasharray="7,5"/>
+    <text x="242" y="0" fill="#0b0b0b">camera-day cable / dark port</text>
+    <path d="M450,-4 C456,-10 454,-14 460,-18" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
+    <text x="470" y="0" fill="#0b0b0b">radio (wifi / cellular)</text>
+    <text x="650" y="0" fill="#52514e">total wired links at the rack: 5 (all on the one switch) · ADR-0017 rev 3</text>
+  </g>
+</svg>
--- a/docs/adr/0017-cctv-segment-dedicated-pfsense-leg.md
+++ b/docs/adr/0017-cctv-segment-dedicated-pfsense-leg.md
@ -0,0 +1,99 @@
+# CCTV segment: dedicated pfSense interface, VLAN-30 trunk on the LAN1 cable
+
+Status: accepted (2026-07-02, rev 3 — single-switch)
+
+![Network topology — dCCTV segment, flows, and camera-day steps](./0017-cctv-segment-topology.svg)
+
+![Physical cabling — wires only, no VLANs](./0017-cctv-physical-cabling.svg)
+
+The first owned camera at the Sofia/Vermont site (`vermont-garage`, HiLook
+IPC-T241H-C at the garage entrance) needs to be network-isolated: its cable is
+physically exposed outside the apartment, so anything plugged into that cable
+must land in a segment that can reach nothing. The original design doc
+(NAS: `Emo shared/Claude shared/garage-camera/`) called for an "802.1Q trunk
+to pfSense" — but nothing in this network terminates dot1q on pfSense; the
+site idiom is one vlan-aware Proxmox bridge → one tagged VM NIC → one clean
+untagged pfSense interface per segment.
+
+**Decision (rev 3):** ONE switch — the new TL-SG105PE **replaces** the old
+garage TL-SG105E (Viktor prefers not running two switches; retired unit
+becomes a cold spare, its 192.168.1.6 mgmt IP passes to the PE). Five ports,
+all used: apartment uplink, 4G router 192.168.1.7, UPS mgmt (all untagged
+VLAN 1), the camera (untagged VLAN 30, PoE), and the **trunk to R730 `eno1`
+carrying home LAN untagged + CCTV tagged 30** over the existing LAN1 cable.
+pfSense `net3` (vtnet3) sits on `vmbr0` with `tag=30` — exactly the site
+idiom used for dManagementsVms/dKubernetes (bridge-level tag → clean untagged
+vNIC; pfSense still terminates no dot1q itself). The earlier dedicated
+`eno2`/`vmbr2` leg is kept **dormant as a fallback** (rev 2 wired it; moving
+net3 back to vmbr2 restores pure physical isolation in one `qm set`).
+This narrows the earlier 802.1Q objection rather than contradicting it: the
+rejection assumed *unmanaged* switches, where any LAN device could inject
+tagged frames; with the managed PE as the only device on eno1, VLAN-30
+membership is {camera port, trunk port} only, so tag-30 ingress from every
+other port — and from the exposed camera cable — is dropped or contained.
+Cameras are untrusted: default-deny on dCCTV with a single
+NTP-to-gateway exception; Frigate (k8s) pulls RTSP in; ha-sofia (192.168.1.8)
+may reach ISAPI/RTSP directly; home-LAN clients route in via an AX6000 static
+route (10.0.30.0/24 via 192.168.1.2). 10.0.30.0/24 is deliberately NOT in the
+10.0.20.0/22 trusted source-IP allowlist.
+
+## Traffic on the trunk — how one cable carries two networks
+
+The LAN1 cable is shared, but the two networks on it diverge at `vmbr0`
+(the vlan-aware bridge on the PVE host), and only ONE of them ever touches
+pfSense:
+
+- **Untagged (VLAN 1, home LAN)** is plain L2 bridging: vmbr0 switches it
+  between the trunk, the host's own IP (192.168.1.127) and pfSense `net0` —
+  where pfSense sits as an ordinary LAN *client* (WAN 192.168.1.2). The home
+  LAN's gateway is and remains the AX6000; home-LAN traffic never transits
+  pfSense. Consequently a pfSense (or R730 VM-level) outage does not affect
+  the home LAN, and the apartment ↔ 4G-router ↔ UPS paths don't even leave
+  the switch (P1/P2/P3 bridge internally), so out-of-band recovery via the
+  4G router survives the whole rack being down.
+- **Tagged 30 (CCTV)** has exactly one possible landing: vmbr0 delivers
+  VID 30 only to pfSense `net3` (dCCTV, 10.0.30.1), which is the camera
+  segment's gateway, firewall and sole exit. "Camera → AX6000 → internet"
+  is impossible by construction, not merely by firewall rule.
+- pfSense forwards *upstream* only its own segments (10.0.10/20/30), NATed
+  out of its WAN toward the AX6000. Load-wise the trunk gained only the
+  camera's ~8 Mbps — it already carried all rack-bound home-LAN traffic.
+
+![VLAN tagging — where traffic can flow](./0017-cctv-vlan-tagging.svg)
+
+*(editable source: [`0017-cctv-vlan-tagging.excalidraw`](./0017-cctv-vlan-tagging.excalidraw) — open it in excalidraw to tweak)*
+
+## Considered options
+
+- **802.1Q over the LAN path behind an UNMANAGED switch** (the original plan
+  read this way) — rejected: any LAN device could inject tagged frames into
+  vmbr0 (`bridge-vids 2-4094`) and tag-passing through a dumb switch is
+  undefined. Rev 3 adopts the tagged path ONLY because the managed PE now
+  polices VLAN-30 membership at the single entry point to eno1; no bridge
+  reconfiguration was needed (vmbr0 was already vlan-aware).
+- **Dedicated physical leg (eno2 → vmbr2 → net3), one switch per role**
+  (rev 1/2 as-built) — superseded by rev 3: it forced either a second switch
+  (6 connections vs 5 ports once the PE also replaced the old switch) or new
+  hardware. Strongest isolation of all options; kept dormant as the fallback.
+- **AX6000 as the camera gateway** — rejected earlier in the design (consumer
+  router, no inter-VLAN firewall).
+
+## Consequences
+
+- The switch is now single-point and load-bearing for everything in the rack
+  (apartment uplink, pfSense backup-WAN via 4G, UPS mgmt, CCTV) AND its VLAN
+  table + mgmt password are part of the isolation boundary — the Easy Smart
+  mgmt UI answers on every port, so the password is the gate between a
+  compromised camera and the switch config. All 5 ports are consumed: the
+  next camera forces an 8-port PoE upgrade (the wiring plan already fits it).
+- `eno2`/`vmbr2` stay cabled-ready but dormant (fallback to rev 2's physical
+  leg); eno3/eno4 remain free.
+- The old TL-SG105E is retired to cold spare; the PE inherits 192.168.1.6
+  (Kea reservation by MAC).
+- Revision history (all 2026-07-02): rev 1 assumed one shared PE with a
+  port-VLAN split (conflated the two devices); rev 2 split into two switches
+  after inspecting 192.168.1.6 (old non-PoE SG105E, 4/5 ports used); rev 3
+  consolidated back to one switch — the PE replacing the SG105E — per
+  Viktor's preference, moving CCTV onto a managed tagged trunk.
+- Frigate's ADR-0016 VRAM budget was bumped 2000 → 2300 MiB for the extra
+  NVDEC stream.
--- a/docs/adr/0017-cctv-segment-topology.svg
+++ b/docs/adr/0017-cctv-segment-topology.svg
@ -0,0 +1,178 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="1600" height="880" viewBox="0 0 1600 880" font-family="system-ui, -apple-system, 'Segoe UI', Roboto, sans-serif">
+  <!-- ADR-0017 rev 3 dCCTV topology (single switch, VLAN-30 trunk on LAN1).
+       Colors: reference dataviz palette (light mode). blue #2a78d6 = home LAN ·
+       violet #4a3aa7 = dCCTV · aqua #1baf7a = dKubernetes ·
+       yellow #eda100 = dManagementsVms · green #008300 allow · red #e34948 deny -->
+  <defs>
+    <marker id="arrGreen" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
+      <path d="M0,0 L10,5 L0,10 z" fill="#008300"/>
+    </marker>
+    <marker id="arrRed" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
+      <path d="M0,0 L10,5 L0,10 z" fill="#e34948"/>
+    </marker>
+    <marker id="arrGray" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto-start-reverse">
+      <path d="M0,0 L10,5 L0,10 z" fill="#52514e"/>
+    </marker>
+  </defs>
+
+  <rect width="1600" height="880" fill="#fcfcfb"/>
+
+  <text x="40" y="42" font-size="26" font-weight="700" fill="#0b0b0b">ADR-0017 — CCTV segment behind pfSense, VLAN-30 trunk on the LAN1 cable</text>
+  <text x="40" y="66" font-size="15" fill="#52514e">Sofia/Vermont · rev 3 (single switch) 2026-07-02 · dashed = camera-day · the ONLY 802.1Q is the trunk between the switch and eno1</text>
+
+  <!-- camera -> everything else (denied) -->
+  <path d="M240,168 C520,104 900,104 1148,140" fill="none" stroke="#e34948" stroke-width="3" marker-end="url(#arrRed)"/>
+  <g transform="translate(560,111)">
+    <circle r="11" fill="#fcfcfb" stroke="#e34948" stroke-width="2.5"/>
+    <path d="M-5,-5 L5,5 M5,-5 L-5,5" stroke="#e34948" stroke-width="2.5"/>
+  </g>
+  <text x="588" y="100" font-size="13.5" font-weight="700" fill="#e34948">DENY · camera → LAN / other segments / internet (default deny on dCCTV)</text>
+
+  <!-- GARAGE ENTRANCE -->
+  <rect x="40" y="128" width="240" height="180" rx="10" fill="#4a3aa7" fill-opacity="0.06" stroke="#4a3aa7" stroke-opacity="0.35"/>
+  <text x="56" y="154" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE ENTRANCE</text>
+  <rect x="64" y="170" width="192" height="112" rx="8" fill="#ffffff" stroke="#4a3aa7" stroke-width="2"/>
+  <text x="80" y="196" font-size="15" font-weight="700" fill="#0b0b0b">vermont-garage</text>
+  <text x="80" y="216" font-size="12.5" fill="#52514e">HiLook IPC-T241H-C · pure IR</text>
+  <text x="80" y="234" font-size="12.5" fill="#52514e">10.0.30.70 (Kea reservation)</text>
+  <text x="80" y="252" font-size="12.5" fill="#52514e">DNS: garage-cam.viktorbarzin.lan</text>
+  <text x="80" y="270" font-size="12.5" fill="#52514e">PoE from switch · cloud/P2P off</text>
+
+  <path d="M256,284 C330,330 412,368 417,430" fill="none" stroke="#52514e" stroke-width="2" stroke-dasharray="6,5" marker-end="url(#arrGray)"/>
+  <text x="330" y="322" font-size="12" fill="#52514e">cat6 in conduit · PoE → P4</text>
+
+  <!-- RACK zone: single switch -->
+  <rect x="40" y="360" width="560" height="265" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
+  <text x="56" y="384" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">RACK — GARAGE · ONE SWITCH</text>
+
+  <rect x="64" y="396" width="512" height="176" rx="8" fill="#4a3aa7" fill-opacity="0.04" stroke="#4a3aa7" stroke-width="2"/>
+  <text x="80" y="420" font-size="15" font-weight="700" fill="#0b0b0b">TL-SG105PE <tspan font-size="12.5" font-weight="400" fill="#52514e">replaces the SG105E · mgmt 192.168.1.6 (Kea) · all 5 ports used</tspan></text>
+  <g font-size="11.5" text-anchor="middle">
+    <rect x="80"  y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
+    <text x="124" y="454" font-weight="700" fill="#0b0b0b">P1 · V1</text>
+    <text x="124" y="470" fill="#52514e">apartment</text>
+    <text x="124" y="484" fill="#52514e">uplink</text>
+    <rect x="178" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
+    <text x="222" y="454" font-weight="700" fill="#0b0b0b">P2 · V1</text>
+    <text x="222" y="470" fill="#52514e">4G router</text>
+    <text x="222" y="484" fill="#52514e">192.168.1.7</text>
+    <rect x="276" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
+    <text x="320" y="454" font-weight="700" fill="#0b0b0b">P3 · V1</text>
+    <text x="320" y="470" fill="#52514e">UPS mgmt</text>
+    <rect x="374" y="436" width="88" height="56" rx="6" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
+    <text x="418" y="454" font-weight="700" fill="#0b0b0b">P4 · V30</text>
+    <text x="418" y="470" fill="#52514e">camera</text>
+    <text x="418" y="484" fill="#52514e">PoE ON</text>
+    <rect x="472" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.10" stroke="#4a3aa7" stroke-width="2" stroke-dasharray="0"/>
+    <text x="516" y="454" font-weight="700" fill="#0b0b0b">P5 · trunk</text>
+    <text x="516" y="470" fill="#52514e">V1 untagged</text>
+    <text x="516" y="484" fill="#4a3aa7">+ V30 tagged</text>
+  </g>
+  <text x="80" y="516" font-size="12" fill="#52514e">802.1Q: VLAN 1 untagged {P1,P2,P3,P5} · VLAN 30 {P4 untagged/PVID 30, P5 tagged}</text>
+  <text x="80" y="534" font-size="12" fill="#52514e">tag-30 ingress on P1/P2/P3 is dropped (not members) — the trunk is the only tagged path</text>
+  <text x="80" y="558" font-size="12" fill="#8a8984">old TL-SG105E → retired, cold spare · backup-WAN (4G) + UPS keep their ports</text>
+
+  <!-- trunk: two parallel lines to eno1 -->
+  <path d="M560,458 C630,458 640,428 692,420" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
+  <path d="M560,466 C632,466 644,436 692,428" fill="none" stroke="#4a3aa7" stroke-width="2.5"/>
+  <text x="588" y="404" font-size="12" font-weight="700" fill="#0b0b0b">LAN1 cable</text>
+
+  <!-- R730 / PVE zone -->
+  <rect x="680" y="330" width="880" height="440" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
+  <text x="696" y="356" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">DELL R730 — PVE HOST 192.168.1.127 (IN THE RACK)</text>
+
+  <g font-size="12">
+    <rect x="700" y="400" width="150" height="46" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
+    <text x="712" y="419" font-weight="700" fill="#0b0b0b">eno1 → vmbr0</text>
+    <text x="712" y="436" fill="#52514e">untag V1 + tag 30</text>
+
+    <rect x="700" y="471" width="150" height="46" rx="6" fill="#ffffff" stroke="#8a8984" stroke-dasharray="4,3"/>
+    <text x="712" y="490" font-weight="700" fill="#52514e">eno2 → vmbr2</text>
+    <text x="712" y="507" fill="#8a8984">dormant fallback leg</text>
+
+    <rect x="700" y="542" width="150" height="46" rx="6" fill="#0b0b0b" fill-opacity="0.04" stroke="#8a8984"/>
+    <text x="712" y="561" font-weight="700" fill="#0b0b0b">vmbr1</text>
+    <text x="712" y="578" fill="#52514e">internal · tags 10/20</text>
+  </g>
+
+  <!-- pfSense VM -->
+  <rect x="890" y="388" width="300" height="230" rx="8" fill="#ffffff" stroke="#8a8984"/>
+  <text x="906" y="414" font-size="15" font-weight="700" fill="#0b0b0b">pfSense (VM 101)</text>
+  <text x="906" y="432" font-size="12" fill="#52514e">gateway + firewall for every segment</text>
+  <g font-size="12">
+    <rect x="906" y="444" width="268" height="34" rx="5" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
+    <text x="916" y="465" fill="#0b0b0b">net0 · WAN <tspan fill="#52514e">192.168.1.2 · vmbr0 untagged</tspan></text>
+    <rect x="906" y="484" width="268" height="34" rx="5" fill="#eda100" fill-opacity="0.14" stroke="#eda100"/>
+    <text x="916" y="505" fill="#0b0b0b">net1 · dManagementsVms <tspan fill="#52514e">10.0.10.1</tspan></text>
+    <rect x="906" y="524" width="268" height="34" rx="5" fill="#1baf7a" fill-opacity="0.12" stroke="#1baf7a"/>
+    <text x="916" y="545" fill="#0b0b0b">net2 · dKubernetes <tspan fill="#52514e">10.0.20.1</tspan></text>
+    <rect x="906" y="564" width="268" height="34" rx="5" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
+    <text x="916" y="585" fill="#0b0b0b">net3 · dCCTV <tspan fill="#52514e">10.0.30.1/24 · vmbr0 tag 30</tspan></text>
+  </g>
+  <path d="M850,415 L890,458" fill="none" stroke="#2a78d6" stroke-width="1.6" opacity="0.6"/>
+  <path d="M850,430 L890,581" fill="none" stroke="#4a3aa7" stroke-width="2"/>
+  <path d="M850,565 L890,501" fill="none" stroke="#8a8984" stroke-width="1.6" opacity="0.6"/>
+  <path d="M850,565 L890,541" fill="none" stroke="#8a8984" stroke-width="1.6" opacity="0.6"/>
+
+  <!-- k8s VMs -->
+  <rect x="1240" y="388" width="290" height="230" rx="8" fill="#1baf7a" fill-opacity="0.07" stroke="#1baf7a"/>
+  <text x="1256" y="414" font-size="15" font-weight="700" fill="#0b0b0b">k8s VMs · 10.0.20.0/24</text>
+  <text x="1256" y="434" font-size="12.5" fill="#52514e">vmbr1 tag 20 · pod egress SNATs</text>
+  <text x="1256" y="450" font-size="12.5" fill="#52514e">to node IPs</text>
+  <rect x="1256" y="464" width="258" height="66" rx="6" fill="#ffffff" stroke="#1baf7a"/>
+  <text x="1268" y="486" font-size="13.5" font-weight="700" fill="#0b0b0b">Frigate · k8s-node1 (T4)</text>
+  <text x="1268" y="504" font-size="12" fill="#52514e">detect sub / record main</text>
+  <text x="1268" y="520" font-size="12" fill="#52514e">gpumem budget 2300 MiB</text>
+  <rect x="1256" y="540" width="258" height="52" rx="6" fill="#ffffff" stroke="#1baf7a"/>
+  <text x="1268" y="562" font-size="13.5" font-weight="700" fill="#0b0b0b">go2rtc LB 10.0.20.204</text>
+  <text x="1268" y="580" font-size="12" fill="#52514e">restream → HA live view (MSE/HLS)</text>
+
+  <!-- HOME LAN zone -->
+  <rect x="1148" y="128" width="412" height="180" rx="10" fill="#2a78d6" fill-opacity="0.06" stroke="#2a78d6" stroke-opacity="0.4"/>
+  <text x="1164" y="154" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">HOME LAN 192.168.1.0/24</text>
+  <rect x="1164" y="168" width="180" height="56" rx="6" fill="#ffffff" stroke="#2a78d6"/>
+  <text x="1176" y="190" font-size="13.5" font-weight="700" fill="#0b0b0b">AX6000 · .1</text>
+  <text x="1176" y="208" font-size="11.5" fill="#52514e">+ route 10.0.30.0/24 → .2</text>
+  <rect x="1164" y="236" width="180" height="52" rx="6" fill="#ffffff" stroke="#2a78d6"/>
+  <text x="1176" y="258" font-size="13.5" font-weight="700" fill="#0b0b0b">ha-sofia · .8</text>
+  <text x="1176" y="275" font-size="11.5" fill="#52514e">Frigate card + hikvision_next</text>
+  <rect x="1360" y="168" width="184" height="56" rx="6" fill="#ffffff" stroke="#2a78d6"/>
+  <text x="1372" y="190" font-size="13.5" font-weight="700" fill="#0b0b0b">apartment clients</text>
+  <text x="1372" y="208" font-size="11.5" fill="#52514e">laptops, phones</text>
+  <rect x="1360" y="236" width="184" height="52" rx="6" fill="#ffffff" stroke="#52514e" stroke-dasharray="5,4"/>
+  <text x="1372" y="256" font-size="11.5" font-weight="700" fill="#52514e">CAMERA DAY: static route</text>
+  <text x="1372" y="272" font-size="11.5" fill="#52514e">10.0.30.0/24 via 192.168.1.2</text>
+
+  <path d="M1254,308 C1150,352 950,372 790,400" fill="none" stroke="#2a78d6" stroke-width="2" opacity="0.6"/>
+  <text x="1010" y="374" font-size="12" fill="#2a78d6">apartment uplink · switch P1 · trunk · eno1</text>
+
+  <!-- FLOWS -->
+  <path d="M1256,497 C1010,690 330,730 120,650 C40,618 40,380 96,286" fill="none" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
+  <text x="620" y="700" font-size="13.5" font-weight="700" fill="#008300">ALLOW · Frigate → camera RTSP :554 (routed k8s → dCCTV; opt1 allow-all)</text>
+
+  <path d="M1164,262 C820,282 470,268 302,176 C286,167 278,166 270,172" fill="none" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
+  <text x="484" y="216" font-size="13.5" font-weight="700" fill="#008300">ALLOW · ha-sofia → camera :80 ISAPI + :554</text>
+  <text x="484" y="234" font-size="12" fill="#52514e">enters pfSense WAN · reply-to off · needs the AX6000 route</text>
+
+  <path d="M280,232 C660,200 860,320 936,386" fill="none" stroke="#008300" stroke-width="2" opacity="0.85" marker-end="url(#arrGreen)"/>
+  <text x="740" y="322" font-size="12.5" font-weight="700" fill="#008300">ALLOW · camera → 10.0.30.1:123 (NTP)</text>
+
+  <!-- LEGEND -->
+  <g transform="translate(40,800)" font-size="12.5">
+    <rect x="0" y="0" width="18" height="18" rx="4" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
+    <text x="26" y="14" fill="#0b0b0b">home LAN / VLAN 1</text>
+    <rect x="200" y="0" width="18" height="18" rx="4" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
+    <text x="226" y="14" fill="#0b0b0b">CCTV / VLAN 30 / dCCTV 10.0.30.0/24</text>
+    <rect x="500" y="0" width="18" height="18" rx="4" fill="#1baf7a" fill-opacity="0.12" stroke="#1baf7a"/>
+    <text x="526" y="14" fill="#0b0b0b">dKubernetes</text>
+    <rect x="640" y="0" width="18" height="18" rx="4" fill="#eda100" fill-opacity="0.14" stroke="#eda100"/>
+    <text x="666" y="14" fill="#0b0b0b">dManagementsVms</text>
+    <line x1="820" y1="9" x2="860" y2="9" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
+    <text x="870" y="14" fill="#0b0b0b">allowed flow</text>
+    <line x1="980" y1="9" x2="1020" y2="9" stroke="#e34948" stroke-width="3" marker-end="url(#arrRed)"/>
+    <text x="1030" y="14" fill="#0b0b0b">denied</text>
+    <line x1="1100" y1="9" x2="1140" y2="9" stroke="#52514e" stroke-width="2" stroke-dasharray="6,5"/>
+    <text x="1150" y="14" fill="#0b0b0b">camera-day step</text>
+    <text x="1320" y="14" fill="#52514e">ADR-0017 · rev 3</text>
+  </g>
+</svg>
--- a/docs/adr/0017-cctv-vlan-tagging.excalidraw
+++ b/docs/adr/0017-cctv-vlan-tagging.excalidraw
--- a/docs/adr/0017-cctv-vlan-tagging.svg
+++ b/docs/adr/0017-cctv-vlan-tagging.svg
--- a/docs/adr/0018-valia-sites-off-infra-pages-in-cluster-sync.md
+++ b/docs/adr/0018-valia-sites-off-infra-pages-in-cluster-sync.md
@ -0,0 +1,47 @@
+# Valia sites are served off-infra (Cloudflare Pages), synced in-cluster
+
+Valia (Viktor's mother) authors small one-page static sites in Google Drive folders she
+shares, and keeps asking for them to be hosted — two exist already (`stem95su`, `bridge`)
+and more are expected. We decided all **Valia sites** are served **off-infra on Cloudflare
+Pages** under `<english-name>.viktorbarzin.me`, kept fresh by **one shared in-cluster
+CronJob** (`stacks/valia-sites/`) that mirrors each **Content folder** every 10 minutes
+(rclone, drive.readonly) and re-deploys only on change (wrangler direct upload). The
+existing in-cluster `stem95su` serving stack (nginx + NFS + ingress + per-site sync)
+migrates onto this and is retired.
+
+Why off-infra serving: these are her sites, shown to teachers/parents — they must survive
+homelab outages (cf. the 2026-06-27 egress incident that took every proxied in-cluster
+site down). With Pages, a homelab outage degrades to "content frozen until we're back",
+never "site down". Serving costs no cluster resources and no per-site nginx/PVC/ingress/
+Anubis. Why the syncer stays in-cluster anyway: secrets stay in Vault (no per-site GHA
+secret sprawl), and the stem95su guard patterns (hard-fail on Drive auth errors, never
+wipe a live site on an empty/partial folder, capped deletes) carry over wholesale. The
+deliberate asymmetry — off-infra serving, on-infra syncing — is the point, not an
+accident.
+
+## Considered options
+
+- **In-cluster everywhere** (generalise stem95su into a factory module): one roof, no
+  Cloudflare Pages dependency — but her sites share the homelab's fate and each site
+  spends cluster resources to serve static files a free CDN serves better.
+- **Pages for new sites only**: less work now, two patterns and two runbooks forever.
+- **GHA-scheduled sync** (fully off-infra pipeline): no cluster dependency at all, but
+  Drive + Cloudflare credentials would live as GitHub secrets per repo, outside Vault.
+
+## Consequences
+
+- Registration is one entry in the `sites` map (name, Content folder, optional Entry
+  file); CI applies Pages project, custom domain, public CNAME, and internal-DNS config
+  together. Names are English, picked by Viktor (most → bridge set the precedent).
+- The internal split-horizon zone learns Valia sites from a ConfigMap the
+  `technitium-ingress-dns-sync` script consumes — declaratively, including **removal**
+  (the previous static-CNAME approach was add-only; a retired site left a stale record).
+- Deploy-on-change is mandatory, not an optimisation: Pages caps monthly deployments on
+  the free tier, and a 10-minute cadence would burn ~4,300/month if unchanged runs
+  deployed.
+- Failure visibility is **failed-Job-only** by explicit choice (no stale-sync alert, no
+  per-site uptime monitors, no notifications to Valia) — Viktor fields "it didn't
+  update" reports, consistent with the alert-noise-reduction posture. Revisit if a
+  silent stall actually bites.
+- If the homelab is down, content updates pause; the sites keep serving last-deployed
+  content. Accepted degradation.
--- a/docs/adr/0019-backup-mx-self-hosted-oracle-relay.md
+++ b/docs/adr/0019-backup-mx-self-hosted-oracle-relay.md
@ -0,0 +1,97 @@
+# Inbound mail gets a self-hosted store-and-forward backup MX on Oracle Always-Free
+
+`viktorbarzin.me` has run a single direct MX to the home IP since the 2026-04-12
+inbound overhaul, with sender-MTA retry (1–5 days, sender-dependent) as the only
+outage protection — a documented "No Backup MX" decision made after ForwardEmail's
+forced anti-spoofing rejected legitimate forwarded mail and Cloudflare Email
+Routing proved pass-through-only. Viktor now wants inbound mail to survive
+homelab outages **without loss** (2026-07-04): delayed delivery is fine,
+mid-outage reading is not required, and the budget is **$0** — a hard
+constraint that eliminated every managed option (see below).
+
+We run a minimal **Postfix store-and-forward relay on an Oracle Cloud
+Always-Free `VM.Standard.E2.1.Micro`** (`mx2.viktorbarzin.me`, **reserved**
+public IP, MX preference 20; primary untouched at 1). It accepts everything
+for the domain (catch-all — every RCPT is valid; reputation may only ever
+4xx-defer, via postscreen pregreet + conservative DNSBL-defer on the VM —
+never 5xx: a backup MX that hard-rejects manufactures the loss it exists to
+prevent), queues up to **30 days** (bounce lifetime 1 day — the VM can never
+deliver a DSN, its only egress is the drain), and drains to the primary over
+**port 2526** — one scripted pfSense WAN NAT rule onto the existing HAProxy
+frontend — because Oracle blocks egress TCP 25 tenancy-wide. Management is
+tailnet-only (headscale preauth key, `tag:backup-mx`; OCI console as
+mid-outage break-glass since headscale itself lives in the cluster); TLS via
+certbot HTTP-01 (port 80 permanently open — LE validation is
+multi-perspective and unscopeable); the VM is a cattle-rebuild from a new
+`stacks/backup-mx/` Terraform stack (OCI provider + cloud-init, which must
+also punch 25/80 through the OCI Ubuntu image's OS-level iptables REJECT).
+On the primary, the drain stream (one /32) is enabled at the layers that
+actually bite — `check_client_access` permits past
+`reject_unknown_client_hostname` and spoof-protection, an anvil rate-limit
+exception, and rspamd `external_relay` (score against the *original* sender
+IP) with the reject action capped to tag/fold so drained spam can never force
+the VM to emit backscatter. Go-live is gated on empirical checks: inbound-25
+reachability (recurring probe — Oracle publishes no commitment), drain
+end-to-end, and a live failover test that includes a high-spam-score and a
+>10 MB message. Two independent adversarial reviews (2026-07-04) shaped this
+final form. Design:
+[`plans/2026-07-04-backup-mx-design.md`](../plans/2026-07-04-backup-mx-design.md).
+
+## Considered options
+
+- **Roller Network free Secondary MX** — v1 of this decision, killed at the
+  validation gates the same day: free tier caps at 200 relayed messages or
+  10 MB per rolling 7 days, and overage suspends the domain for 48 h
+  answering **SMTP 5xx** (permanent bounces) — since spammers target backup
+  MXes even while the primary is up, background spam alone can hold it
+  suspended, making it *worse than no backup MX*. Free accounts are also
+  being discontinued. (Their TLS checked out; their paid Basic at $30/yr is
+  the documented fallback if the OCI route sours.)
+- **Dynu Email Backup ($9.99/yr)** — queue lifetime undocumented (FAQ hints
+  12–24 h, barely beating sender retry); filtering black-box; not free.
+- **Cloudflare Email Routing / mailflare** — no store-and-forward / terminal
+  inbox on Cloudflare; rejected earlier (2026-04-12; 2026-07-04 memory #7148).
+- **Other free tiers** (challenged and re-verified 2026-07-04): GCP e2-micro
+  blocks egress 25 too and its free regions are US-only; AWS's 2025+ "free"
+  plan is a 6-month credit; Azure has no always-free VM and blocks 25;
+  Hetzner has no free tier; Fly.io ended free allowances; Vultr/Linode are
+  trial credits; DNSExit/KisoLabs/DuoCircle backup-MX are paid or dead. OCI
+  is the only standing free option.
+- **Harden-only** (5xx-misconfig guards + paging) — does not address
+  multi-day outages or short-retry senders; deferred as a complementary
+  track.
+
+## Consequences
+
+- **A pet outside the cluster** — deliberately cattle: rebuilt entirely from
+  Terraform + cloud-init, patched by unattended-upgrades, scraped by the
+  cluster's Prometheus (exporters on the reserved public IP, allowlisted to
+  the homelab WAN /32 — there is **no cluster→tailnet route**, so tailnet
+  scraping was rejected as fictional; blackbox TCP:25 + MX-set drift alerts
+  besides). Never a backup target itself.
+- **Oracle free-tier caprice is the top risk**: Oracle silently halved the A1
+  free allowance in June 2026 and terminated over-limit instances, and
+  publishes no commitment that inbound 25 stays open. Mitigations:
+  **Pay-As-You-Go conversion is a required prerequisite** (exempts idle
+  reclamation, stays $0), a recurring inbound-25 probe, `BackupMxDown`, and
+  the queue being empty outside outages (a surprise reclamation loses
+  coverage, never mail). Home region is fixed at signup — Frankfurt, chosen
+  once.
+- The drain stream bypasses `reject_unknown_client_hostname`, anvil limits,
+  and rspamd's reject tier for one /32; DKIM verification, SPF/DMARC (against
+  the original IP via `external_relay`), and content scoring stay on — spam
+  arriving via the backup is tagged and folded to Junk, never bounced. The VM
+  is deliberately NOT in the primary's `mynetworks` (a compromised VM must
+  not relay through us).
+- **Outages > 30 days lose queued mail silently** — no DSN can ever leave the
+  VM. Stated and accepted (6× better than the status quo).
+- Outage mail sits in plaintext on Oracle disk ≤ 30 days — single-tenant but
+  off-premises; accepted (same class as Brevo holding outbound today).
+- Cloudflare zone lands at 197/200 records; the MTA-STS follow-up (policy
+  host found dangling during design — inert today; must list `mx2` when
+  fixed) needs 1–2 more → schedule the next record purge proactively.
+- `architecture/mailserver.md` §"No Backup MX" superseded at implementation;
+  new runbook `docs/runbooks/backup-mx.md` (incl. OCI console break-glass);
+  `vpn.md`'s stale headscale claims fixed in passing; the roundtrip probe's
+  failure semantics change (a "failing" probe may now mean "delayed via mx2,
+  drains shortly" — noted in alert description).
--- a/docs/architecture/authentication.md
+++ b/docs/architecture/authentication.md
@ -86,10 +86,56 @@ Signin latency is dominated by screen count and round trips, not server time
  use the explicit-consent flow (it re-prompted every 4 weeks per app).
 - **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
  are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
-  15m policy cache, 60s persistent DB connections.
+  15m policy cache, gunicorn `max_requests=10000`/jitter=1000 (recycle
+  hardening — decorrelates the 9 workers' recycles from PG blips). **No
+  `CONN_MAX_AGE`** — persistent Django connections pin a PgBouncer server conn
+  1:1 and saturate the session-mode pool (reverted 2026-06-10).
 - **Static assets cached immutable**: `/static` ingress carve-out adds
  `Cache-Control: public, max-age=31536000, immutable` (assets are
  version-fingerprinted; authentik itself sends no max-age).
+- **Rate-limit carve-out** (2026-06-28): `/` and `/static` use a dedicated
+  `authentik-rate-limit` (100/1000) instead of the shared 10/50 default — the
+  login SPA cold-loads ~70 flow-executor chunks from `/static`; the default
+  burst 429'd the tail and a failed ES-module import left a blank login screen.
+- **Readiness tolerance** (2026-06-28): server `readinessProbe.failureThreshold:8`
+  (~80s, was the chart-default ~30s). The probe (`/-/health/ready/`) queries the
+  DB; too-tight tolerance let a sub-60s PG/pgbouncer transient return 503 on all
+  3 server pods at once → Traefik had no healthy backend → 502/503/504 (episodic
+  blank login + 30s hangs). 80s absorbs a full CNPG failover reconnect. Sessions
+  + cache are PostgreSQL-only since Redis was removed in 2026.2 (no external-cache
+  option), so request-serving is coupled to PG — this survives a short transient,
+  not a total CNPG outage.
+- **Rolling-update strategy** (2026-06-28): the chart key is `deploymentStrategy`
+  (the repo's old `strategy:` key was silently inert → live ran the chart-default
+  25%/25% and dropped a server pod out of rotation on every roll). Now
+  `maxSurge:1/maxUnavailable:0` keeps all 3 ready throughout a roll.
+- **Old-browser login (SFE)** (2026-06-28): authentik's modern flow SPA is ES2022
+  and renders a **blank login** on Safari/WebKit ≤16.3 (every iOS browser shares
+  the system WebKit, so it's not browser-choice — e.g. iPadOS ≤15). The overlay
+  image patches `flows/views/interface.py::compat_needs_sfe()` to also serve
+  authentik's built-in no-JS **Simplified Flow Executor** (SFE, ES5) to old Safari
+  **and any iOS browser** (Chrome/Firefox on iOS are WebKit skins) on iOS ≤16.3,
+  so those clients get the *real* authentik login (password + MFA + reputation —
+  no auth downgrade). The SFE can't render Identification-stage **sources**
+  (authentik limitation), so the patch also injects static social-login `<a>`
+  links into `flow-sfe.html` (→ `/source/oauth/login/<slug>/`, plain redirects) —
+  required for password-less accounts (e.g. Google-only users). A Traefik
+  basic-auth fallback was rejected: it would have put a single spoofable-UA
+  password in front of `vbarzin→wizard` (passwordless root on the devvm). See
+  `stacks/authentik/patch-compat-sfe.py`.
+- **SFE + forced-WebAuthn MFA gotcha** (2026-06-28): the `default-authentication-flow`
+  MFA stage (`not_configured_action=configure`, `conf_stages=[webauthn]`) force-enrols
+  a WebAuthn passkey for any **password**-path user with no MFA device — but the SFE
+  **cannot render WebAuthn** (enrol *or* validate), so that user gets
+  `unsupported state: ak-stage-authenticator-webauthn`. Two escape hatches, **no MFA
+  downgrade**: (1) **social login** — sources run `default-source-authentication`
+  (UserLoginStage only, **no MFA stage**), so the SFE's "Continue with <provider>"
+  button always completes; (2) **enrol TOTP** — the SFE *can* validate TOTP codes, and
+  ≥1 confirmed device flips the stage from force-enrol to validate. User MFA devices are
+  runtime data (not Terraform): enrol via `ak shell`
+  (`TOTPDevice.objects.create(user=…, confirmed=True)`) and store the secret in the
+  user's own Vaultwarden item. (Done for emo — the Google-only iPadOS-15 case: TOTP in
+  his `authentik.viktorbarzin.me` Bitwarden item; e2e-verified the BW code is accepted.)
 - **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
 - **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
  TCP setup on the forward-auth subrequest path.
@ -108,31 +154,6 @@ All new users must use an invitation link to register. The invitation-enrollment

 Group membership is auto-assigned from the invitation's `fixed_data` field. This prevents open registration while maintaining SSO convenience.

-### TripIt External self-signup (open enrollment, fenced)
-
-Unlike every other app, **TripIt allows open public self-signup** for people
-outside the homelab (ADR-0020 in the tripit repo; runbook
-`docs/runbooks/tripit-external-signup.md`). A dedicated public `tripit-enrollment`
-flow (email + passkey, no password) creates the account and stamps it into the
-parentless **`TripIt External`** group. Containment is two-layered:
-
- **Forward-auth apps**: a branch prepended to the `admin-services-restriction`
-  catch-all policy admits `TripIt External` to `tripit.viktorbarzin.me` only and
-  denies every other `auth="required"` host.
- **OIDC apps**: that branch does NOT cover OIDC (OIDC bypasses forward-auth).
-  External users are contained because every sensitive OIDC app already requires a
-  trusted group they do not hold — audited 2026-06-15:
-  Immich/Grafana/Linkwarden/Cloudflare Access → `Home Server Admins`, Forgejo →
-  `Task Submitters`/`Forgejo Users`, Headscale → `Headscale Users`, wrongmove →
-  `Wrongmove Users`. **Vault** was OPEN (any OIDC identity got a powerless
-  `default`-policy token) and is bound to **`Allow Login Users`** as part of this
-  change. The Kubernetes OIDC clients are OPEN but idle (apiserver rejects OIDC).
-
-**Invariants**: keep `TripIt External` parentless (never under `Allow Login
-Users`); keep the catch-all branch first; never co-assign `TripIt External` to a
-trusted/internal user; the `tripit-enrollment` user_write "Create users group"
-setting is the keystone that tags every signup.
-
 ### OIDC Applications

 Authentik provides OIDC for 10 applications:
--- a/docs/architecture/automated-upgrades.md
+++ b/docs/architecture/automated-upgrades.md
@ -128,7 +128,7 @@ The agent handles all three version patterns in Terraform:

 - **Slack**: All upgrade events reported (start, success, failure, rollback)
 - **Git**: Detailed commit messages with changelog summaries, risk level, backup status
- **DIUN Slack**: Independent Slack channel for raw version detection (separate from upgrade agent)
+- **DIUN Slack**: REMOVED 2026-07-02 (per-tag @channel pings in #image-updates; human cadence is the weekly upgrade report). The n8n webhook feed to the upgrade agent is unchanged.

 ## Bulk Upgrades

@ -319,7 +319,7 @@ each Job's pod and its drain target are always different nodes.
  - `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout.
  - `EtcdPreUpgradeSnapshotMissing` — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently.
  - `K8sUpgradeStalled` — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor.
-  - `K8sUpgradeChainJobFailed` — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured).
+  - `K8sUpgradeChainJobFailed` — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured). The `unless k8s_upgrade_blocked == 1` clause (2026-06-21) excludes a deliberate compat-gate refusal (owned by `K8sUpgradeBlocked`) so a block doesn't double-fire as a wedge.
 - **Pushgateway metrics**:
  - `k8s_upgrade_in_flight` (set in preflight, cleared in postflight)
  - `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB)
--- a/docs/architecture/chrome-service.md
+++ b/docs/architecture/chrome-service.md
@ -112,17 +112,32 @@ External caller (dev box):
  @playwright/mcp --isolated --storage-state ~/.cache/...storage-state.json
 ```

+## Browser binary — real Google Chrome (for proprietary codecs)
+
+The chrome-service container runs **real Google Chrome**, not the bundled
+Chromium, via the infra-owned image `ghcr.io/viktorbarzin/chrome-service-browser`
+(`files/chrome/Dockerfile` = `mcr.microsoft.com/playwright:v1.48.0-noble` +
+`google-chrome-stable`, built by `.github/workflows/build-chrome-service-browser.yml`).
+The launch resolves `CHROMIUM=/opt/google/chrome/chrome`.
+
+**Why:** the Playwright-bundled Chromium has proprietary codecs **compiled out**,
+so H.264/AAC video (Instagram Reels, X, most `.mp4`) fails in the noVNC view with
+`MEDIA_ERR_SRC_NOT_SUPPORTED` (the bytes download `200 video/mp4` but there's no
+decoder — NOT a GPU issue). Royalty-free codecs (VP9/VP8/AV1 → YouTube) always
+worked. Swapping `libffmpeg.so` does NOT help (codecs are compiled out, not just
+the lib stripped) and Chrome-for-Testing is also codec-less — only
+`google-chrome-stable` carries them.
+
 ## Image pin

-Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in
-`stacks/chrome-service/main.tf`) and the Python client
-(`playwright==1.48.0` in callers' `requirements.txt`) **must match
-minor-versions**. Bump in lockstep — Playwright protocol changes between
-minors and the client cannot connect to a mismatched server.
-
-The harvester + snapshot-server sidecar use
-`mcr.microsoft.com/playwright/python:v1.48.0-noble` — same playwright
-minor, with Python-side bindings pre-installed.
+The Playwright base + the Python client (`playwright==1.48.0` in callers'
+`requirements.txt`) and the snapshot sidecars
+(`mcr.microsoft.com/playwright/python:v1.48.0-noble`) historically had to match
+minor-versions. The chrome-service browser is now real Google Chrome (a newer
+milestone than the 1.48 Chromium), but the `connect_over_cdp` callers (tripit
+fare scrape, `homelab browser`, snapshot-harvester) attach over raw CDP, which is
+version-tolerant — verified working against this Chrome. If a future Chrome
+milestone breaks a caller, pin Chrome in the Dockerfile or bump the clients.

 ## Storage

@ -167,7 +182,66 @@ minor, with Python-side bindings pre-installed.
  `x11vnc` (connected to Xvfb on `localhost:6099`) bridged to
  `websockify` on port 6080. Service `chrome` maps :80 → :6080 and is
  exposed via `ingress_factory` at `chrome.viktorbarzin.me`,
-  Authentik-gated.
+  Authentik-gated. The bare host serves `vnc.html` (image symlinks
+  `index.html → vnc.html`); add `?autoconnect=true&resize=scale&path=websockify`
+  to skip the Connect button. The view is **black when no browser window is
+  open** (idle) — that is normal, not a failed connection. Chrome is launched
+  with `--window-size=1280,720 --window-position=0,0` to fill the Xvfb screen
+  (no window manager runs, so without it Chrome opens at its profile-persisted
+  size and the rest of the framebuffer shows as a black cut-off).
+
+### noVNC fd-sweep gotcha (stuck "Connecting")
+
+If the noVNC client hangs on **"Connecting" forever then times out**, the cause
+is almost always x11vnc's fd-table sweep: containerd grants pods
+`RLIMIT_NOFILE = 2^31`, and x11vnc `fcntl`-sweeps the **entire** fd table on
+every client connection, so the RFB handshake never completes (websockify
+accepts the WS and logs `connecting to: localhost:5900`, but x11vnc never sends
+the `RFB 003.008` banner). Diagnose: `grep "open files" /proc/$(pgrep -n
+x11vnc)/limits` (huge = bad) and time the handshake from a sibling container
+(`python3 -c "import socket;s=socket.socket();s.connect(('127.0.0.1',5900));print(s.recv(12))"` —
+healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts**
+— done both in `files/novnc/entrypoint.sh` (root) and via the container `command`
+wrapper in `main.tf` (so it applies deterministically even though the image is
+`:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix
+as the android-emulator stack.
+
+### noVNC black after a browser-container restart (x11vnc supervision)
+
+A **distinct** failure from the fd-sweep gotcha above: the noVNC client *connects*
+but the view is **black**, and the novnc container logs spew
+`connecting to: localhost:5900` → `Failed to connect ... [Errno 111] Connection
+refused` (x11vnc is **down**, not slow). Cause: `x11vnc` and `websockify` both run
+in the **novnc** container, but x11vnc attaches to the **chrome-service** (browser)
+container's Xvfb over `localhost:6099` (shared pod network). When the browser
+container restarts — Chrome exits cleanly (exit 0, "Completed") or crashes — its
+Xvfb vanishes and x11vnc loses its X connection and exits.
+
+`entrypoint.sh` **supervises** x11vnc: it launches x11vnc and websockify as
+background children and `wait -n`s on them, exiting non-zero if **either** dies, so
+the kubelet restarts the novnc container, which re-waits for Xvfb on `:6099` and
+relaunches x11vnc — the bridge **self-heals** across browser-container restarts.
+(Before 2026-06-27, x11vnc was an unsupervised background child of an `exec`ed
+websockify; a dead x11vnc was never relaunched, leaving `:5900` dead — a
+`<defunct>` zombie — and the view black until a manual pod restart. Same
+supervision pattern as the android-emulator stack's entrypoint.)
+
+**Diagnose:** `kubectl exec -c novnc -- ps aux | grep x11vnc` (a `<defunct>`/Z
+entry = the bug); or the RFB-banner probe from a sibling container (`python3 -c
+"import socket;s=socket.socket();s.settimeout(2);s.connect(('127.0.0.1',5900));print(s.recv(12))"`
+— healthy returns `b'RFB 003.008\n'`, broken = `ConnectionRefused`). **Immediate
+recovery** (no image change): restart just the novnc container with `kubectl exec
+-n chrome-service deploy/chrome-service -c novnc -- kill 1` — re-runs its entrypoint
+and relaunches x11vnc **without** touching the browser session/in-flight CDP jobs.
+
+> **Deploying a rebuilt novnc entrypoint:** Keel is **off** for this deployment
+> (`keel.sh/policy=never`, because the browser container's playwright image is
+> version-pinned to f1-stream) and the image is `:latest`/`IfNotPresent`, so a
+> rebuilt `:latest` will **not** redeploy on its own. After the
+> `build-chrome-service-novnc.yml` GHA build pushes `:latest` + `:<sha>`,
+> **SHA-pin** the novnc `image` in `main.tf` to the new `:<sha>` to force the pull
+> and rollout (the novnc image is TF-managed — not in the deployment's
+> `lifecycle.ignore_changes`).
 - **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`)
  serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`,
  bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088
@ -180,6 +254,87 @@ minor, with Python-side bindings pre-installed.
 See `stacks/chrome-service/README.md` for the recipe (label namespace,
 inject `CHROME_CDP_URL`, vendor `stealth.js`).

+## Driving from OUTSIDE the cluster (`homelab browser`)
+
+Agents on the devvm reach this browser through the **`homelab browser`** CLI
+(`cli/`, ADR-0013) — the packaged, discoverable form of the ad-hoc
+`connect_over_cdp` recipe. It is the **escalation path, not the default**:
+agents default to the Playwright MCP / headless browser for all routine
+automation, and reach for `homelab browser` ONLY when headless is blocked — a
+site loads but a gated action (submit/login) silently fails or hangs, the
+signature of headless / anti-bot detection. (Same tiered rule lives in
+`~/code/CLAUDE.md` and `homelab browser --help`.)
+
+```text
+devvm:  homelab browser run flow.js
+          │  kubectl port-forward svc/chrome-service :9222  (random local port)
+          ▼
+   http://127.0.0.1:<port>  ──►  chrome-service pod :9222 (CDP)
+          │  assert /json/version Browser is "Chrome/…", not "HeadlessChrome"
+          │  node + playwright-core@1.48.2 → connectOverCDP
+          │  context.addInitScript(stealth.js)   ← same vendored file as in-cluster
+          │  run the user's Playwright script with page/context/browser in scope
+          └─ port-forward always torn down (success or error)
+```
+
+Key facts:
+
+- **port-forward bypasses the `:9222` NetworkPolicy.** It tunnels
+  API-server→pod, so the devvm needs no `chrome-service.viktorbarzin.me/client`
+  label — unlike in-cluster callers.
+- **Client pinned to the image minor.** The node client is
+  `playwright-core@1.48.2` (matches `v1.48.0-noble` / Chromium 130), installed
+  lazily into `~/.cache/homelab/browser-client/`. Bump it in lockstep when the
+  server image bumps (same rule as the in-cluster Python clients — see "Image
+  pin" above).
+- **Default context is a fresh incognito one** (closed on exit), safe for the
+  shared browser; `--shared-context` reuses the warmed persistent profile.
+- **`stealth.js` is vendored** into the CLI (`cli/browser_stealth.js`) as a
+  byte-identical copy of `files/stealth.js`, guarded by a drift test — so the
+  CLI's stealth never diverges from the in-cluster callers'.
+
+## Multi-user access (sharing the browser)
+
+There is ONE chrome-service browser with ONE persistent profile, warmed with
+**Viktor's** logged-in sessions. CDP has no per-context auth, so anyone who can
+drive the browser — over the noVNC view OR the CDP/`homelab browser` path — can
+reach the persistent profile (`browser.contexts[0]`) and therefore Viktor's
+sessions. Access is gated accordingly, per user.
+
+**Decision (2026-06-28):** emo (`emil.barzin` / `emil.barzin@gmail.com`) SHARES
+Viktor's browser for form-filling + captcha solving, rather than getting an
+isolated instance. The session-exposure trade-off above was explicitly accepted.
+
+Two independent grants make up "browser access" for a user:
+
+1. **noVNC (interactive view, `chrome.viktorbarzin.me`)** — gated by the Authentik
+   `admin-services-restriction` policy: the `CHROME_ALLOWED` set
+   (`stacks/authentik/admin-services-restriction.tf`) matches the user's Authentik
+   username OR email. Add the user there. No kubeconfig/RBAC needed.
+2. **CLI (`homelab browser`, CDP over port-forward)** — needs `pods/portforward`
+   in `chrome-service` PLUS a non-interactive credential (a normal devvm user's
+   kubeconfig is interactive-OIDC-only and can't authenticate a headless agent
+   session). Provided by a per-user **ServiceAccount** with a long-lived token
+   (`stacks/chrome-service/rbac.tf`, e.g. `emo-browser`): `pods/portforward` in
+   this namespace + cluster read-only (`oidc-power-user-readonly`, so it can also
+   resolve the Service and doesn't regress the user's normal read). The devvm
+   provisioner (`scripts/t3-provision-users.sh` → `install_browser_kubeconfig`)
+   reads that token and installs it as the user's DEFAULT kubeconfig context
+   (`<user>-browser@homelab`), keeping their personal OIDC login as the
+   `oidc@homelab` named context. The SA's existence is the source of truth for who
+   gets the CLI — the provisioner no-ops for users without a `<user>-browser` SA.
+
+**To grant another user:** add them to `CHROME_ALLOWED` (noVNC) and/or add a
+`<user>-browser` SA + bindings mirroring `emo-browser` in `rbac.tf` (CLI), then run
+the provisioner. To revoke: remove from `CHROME_ALLOWED` and delete the SA (rotate
+a token by deleting its `<user>-browser-token` Secret).
+
+Because the SA is the user's DEFAULT kubectl credential, other per-namespace
+port-forward grants hang off the same identity: `stacks/excalidraw/rbac.tf`
+grants `emo-browser` `pods/portforward` in `excalidraw` (2026-07-02) so emo's
+agent can upload drawings via the port-forward + `X-Authentik-Username` recipe
+in his `~/.claude/CLAUDE.md`. Revoking the SA revokes those too.
+
 ## Limits + risks

 - **Anti-bot vs stealth arms race** — when an upstream beats us (DRM
--- a/docs/architecture/ci-cd.md
+++ b/docs/architecture/ci-cd.md
@ -94,7 +94,7 @@ can't reach Forgejo's public hairpin.
 | Visibility | Packages | Pull mechanism |
 |------------|----------|----------------|
 | **Public** | beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator | Anonymous |
-| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci | `ghcr-credentials` dockerconfigjson |
+| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci, k8s-portal, excalidraw-library | `ghcr-credentials` dockerconfigjson |

 Private-image pulls use the `ghcr-credentials` dockerconfigjson, cloned by the
 kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit
@ -115,8 +115,66 @@ claude-agent-service, claude-memory-mcp, kms-website, Freedify,
 instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`),
 fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original
 pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website,
-k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify,
-audiobook-search, council-complaints) now also land on ghcr.
+k8s-portal, apple-health-data, audiblez-web, insta2spotify,
+audiobook-search) now also land on ghcr.
+
+**plotting-book** is a special case (a GitHub-first repo owned by Anca,
+ADR-0003): the build runs in *her* GitHub repo
+(`PassionProjectsAnca/Plotting-Your-Dream-Book`) and pushes to **private
+`ghcr.io/passionprojectsanca/book-plotter`** — under her org's ghcr namespace,
+not `viktorbarzin`, using the workflow's built-in `GITHUB_TOKEN` (no shared
+PAT). The cluster pulls it via the Kyverno-synced `ghcr-credentials` secret (the
+`plotting-book` namespace is on the allowlist; the shared `ghcr_pull_token` has
+read access). Migrated off public DockerHub (`viktorbarzin/book-plotter`) on
+2026-06-27. The Woodpecker deploy hook (repo 43, registered to Anca's repo) is
+unchanged. Flow:
+
+```text
+ DEVELOP ───────────────────────────────────────────────────────────────────────
+   Anca (Codex / t3 web agent)
+        │  git push → main
+        ▼
+ ┌──────────────────────────────────────────────────────────────┐
+ │ GitHub: PassionProjectsAnca/Plotting-Your-Dream-Book (private)│  ← canonical
+ │   .github/workflows/build-and-deploy.yml     on: push → main  │
+ └───────────────────────────┬──────────────────────────────────┘
+                             │  GitHub Actions runner (off-infra build · ADR-0002)
+        ┌────────────────────┴─────────────────────────────────┐
+        ▼                                                        ▼
+ ┌─────────────────────────────────────────────┐      ╔═══════════════════════════════════════╗
+ │ build job                                   │ push ║  GHCR · PRIVATE package                ║
+ │  • svu next --always → tag vX.Y.Z (→ repo)  │═════▶║  ghcr.io/passionprojectsanca/         ║
+ │  • buildx linux/amd64, provenance:false     │ tags ║       book-plotter  :vX.Y.Z  :latest  ║
+ │  • login ghcr (GITHUB_TOKEN, packages:write)│      ╚═══════════════════╤═══════════════════╝
+ │  • delete-package-versions (keep newest 10) │                          │
+ └───────────────────────┬─────────────────────┘                          │ pull (private,
+                         ▼  deploy job  [gate: repo var DEPLOY_ENABLED ≠ "false"]  via secret)
+   POST ci.viktorbarzin.me/api/repos/43/pipelines {IMAGE_TAG, IMAGE_NAME}         │
+                         ▼                                                         │
+ ┌─────────────────────────────────────────────────────────────┐                 │
+ │ Woodpecker repo 43 · .woodpecker/deploy.yml (event: manual)  │                 │
+ │   kubectl set image deployment/plotting-book = <ghcr>:vX.Y.Z │                 │
+ │   kubectl rollout status                                     │                 │
+ └───────────────────────────┬─────────────────────────────────┘                 │
+                             ▼                                                     │
+ ═══════════════ Kubernetes · ns: plotting-book ════════════════════════════      │
+ ┌─────────────────────────────────────────────────────────────┐                 │
+ │ Deployment plotting-book  (Recreate · image = ignore_changes)│                 │
+ │   imagePullSecrets: ghcr-credentials ────────pull───────────┼─────────────────┘
+ │   Pod → Express :3001  +  SQLite on PVC (proxmox-lvm)        │
+ └─────────────────────────────────────────────────────────────┘
+   guards / supporting:
+     • Kyverno require-trusted-registries [Enforce] → ghcr.io/* ALLOWED   (admission)
+     • Keel policy=patch @1h → watches GHCR via ghcr-credentials          (backstop)
+     • ghcr-credentials ⇐ Kyverno generate-clone ⇐ Vault secret/viktor/ghcr_pull_token
+
+ ═══════════════ Serving path (unchanged) ══════════════════════════════════
+   Browser ─▶ plotting-book.viktorbarzin.me  (non-proxied DNS → Traefik .203)
+           ─▶ Authentik forward-auth (gate) ─▶ Service :80 ─▶ Pod :3001
+```
+
+Governance: the Deployment + Kyverno allowlist are Terraform (`stacks/plotting-book`,
+`stacks/kyverno`); the live image *tag* is CI-owned (`ignore_changes`).

 ### Infra-owned images (issues #29 / #30)

@ -130,6 +188,8 @@ reconciled — the workflows were added to the GitHub lineage via PR):
 | android-emulator | `build-android-emulator.yml` | public `ghcr.io/viktorbarzin/android-emulator` |
 | infra CLI | `build-cli.yml` | DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli` |
 | infra-ci | `build-infra-ci.yml` | private `ghcr.io/viktorbarzin/infra-ci` |
+| k8s-portal | `build-k8s-portal.yml` | private `ghcr.io/viktorbarzin/k8s-portal` (Keel rolls `:latest` digests) |
+| excalidraw-library | `build-excalidraw.yml` | private `ghcr.io/viktorbarzin/excalidraw-library` (Keel rolls `:latest` digests; DockerHub `:v4` frozen as rollback) |

 **`infra-ci`** is the image the `.woodpecker/default.yml` apply step and
 `drift-detection.yml` run in (proven by pipelines 165/166). `chatterbox-tts` is
@ -163,9 +223,9 @@ Woodpecker is **deploy + cluster-touching steps only**:
 | Pipeline | File | Purpose |
 |----------|------|---------|
 | per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) |
-| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) |
+| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`). **Skips Tier-0 `vault`** — it's human-applied via OIDC; the CI `ci` role lacks Vault-admin perms (`sys/mounts`, `sys/policies/acl`) so a CI apply 403s |
 | certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron |
-| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) |
+| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`). **Skips Tier-0 `vault`** (its `plan` 403s under the `ci` role and would fail the whole run) |
 | provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
 | registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` → `10.0.20.10` on change |
 | pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE |
@ -176,6 +236,38 @@ Woodpecker is **deploy + cluster-touching steps only**:

 **No build/test pipeline exists on any repo.** Do not (re)introduce one.

+### `default.yml` apply: dual-registration de-dup + reliability (2026-06-28)
+
+infra is registered in Woodpecker on **both** the canonical Forgejo repo (id 82)
+and the legacy GitHub mirror (id 1), and **both fire `default.yml` on every
+push**. Left unguarded, two `terragrunt apply` runs race each other for the
+per-stack PG state lock — historically the #1 source of `Error acquiring the
+state lock` failures and push-supersede "killed" runs.
+
+- **Forge guard** (first command in the `apply` step): the push-apply runs **only
+  on the canonical Forgejo forge**; on the GitHub mirror it logs `[forge-guard]`
+  and `exit 0`s. Detection: `CI_REPO_URL`/`CI_FORGE_URL` contains `github.com` →
+  skip. Fail-open (unknown forge still applies). The mirror keeps running the
+  **crons** (drift-detection, renew-tls, …), which live on repo 1 — only its
+  duplicate push-apply no-ops. (Crons were NOT moved; deactivating repo 1 would
+  have killed them.)
+- **Lock-skip matches both tiers**: a stack whose apply hits a lock is SKIPPED,
+  not failed. The grep now matches the Tier-0 Vault message (`is locked by`) **and**
+  the Tier-1 PG-backend message (`Error acquiring the state lock` / `already
+  locked`) — the PG case was previously miscounted as a hard failure.
+- **Transient retry** (bounded, 3 attempts): only provider-registry download
+  timeouts (`Failed to install provider` / `Client.Timeout`) and Vault 5xx are
+  retried. Config errors (missing arg, invalid index) and helm `atomic` timeouts
+  are NOT retried — they fail fast.
+
+A pre-apply off-infra validate gate was evaluated and rejected: `terraform
+validate` runs without state but catches ~0 of the observed failures (they are
+provider-config-from-Vault-data, server-side-apply conflicts, helm installs, and
+lock contention — all invisible to static validate), and `plan` cannot run
+off-infra (no Vault/PG access). `terragrunt apply` already fails at its plan
+phase without mutating on config errors, so a separate in-pipeline plan-gate was
+also dropped as redundant.
+
 ### Woodpecker API

 Uses **numeric repo IDs** (`/api/repos/<id>/pipelines`), NOT owner/name paths
@ -203,7 +295,9 @@ The infra repo runs on Woodpecker via **two** forge registrations: the Forgejo
 forge (repo id 82, registered 2026-06-08) and the legacy GitHub forge (repo id
 1). Pushes to **Forgejo** `master` fire `.woodpecker/default.yml`
 (changed-stacks terragrunt apply, in `infra-ci`) plus the `notify-nonadmin-push`
-Slack audit step. Operational facts (2026-06-10):
+Slack audit step. **Slack policy (2026-07-02): every infra pipeline posts only
+on FAILURE** (plus the non-admin audit post and drift/error findings) — routine
+successful runs are silent. Operational facts (2026-06-10):

 - **Webhook URL is the IN-CLUSTER service**:
  `http://woodpecker-server.woodpecker.svc.cluster.local/api/hook?...` (PATCHed
@ -285,7 +379,8 @@ steps:
  notify:
    image: plugins/slack
    when:
-      status: [success, failure]
+      # Failure-only (2026-07-02 policy): CI notifies about failed runs only.
+      status: [failure]
 ```

 ### CI/CD secrets sync
--- a/docs/architecture/dns.md
+++ b/docs/architecture/dns.md
@ -277,7 +277,7 @@ Technitium's **Split Horizon AddressTranslation** app post-processes DNS respons

 Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h).

-**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too.
+**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too. **Off-infra Valia sites** (Cloudflare Pages, ADR-0018) are the other class of public-only names with no Traefik ingress — without internal records they NXDOMAIN for every internal client while working fine externally. Since 2026-07-03 they are reconciled **declaratively**: `stacks/valia-sites` writes the ConfigMap `valia-sites-dns` (technitium ns, `<name> → <project>.pages.dev`), and the sync script ensures/updates a CNAME per entry and **deletes** stale internal CNAMEs targeting `*.pages.dev` that left the map (retire/rename cleans itself up; deletion is suffix-scoped so nothing else can be touched).

 ## NodeLocal DNSCache

@ -368,6 +368,7 @@ The Cloudflare tunnel uses a **wildcard rule** (`*.viktorbarzin.me → Traefik`)
 | TXT (MTA-STS) | 1 | `v=STSv1; id=20260412` | TLS enforcement |
 | TXT (TLSRPT) | 1 | `v=TLSRPTv1; rua=mailto:postmaster@...` | TLS reporting |
 | A (keyserver) | 1 | `130.162.165.220` (Oracle VPS) | PGP keyserver |
+| CNAME (CF Pages) | 2 | `<project>.pages.dev` (Cloudflare Pages) | bridge, stem95su — Valia sites (ADR-0018), managed by `stacks/valia-sites` |

 ### Proxied vs Non-Proxied

@ -513,6 +514,7 @@ For external `.viktorbarzin.me` records:
 1. Add `dns_type = "proxied"` (or `"non-proxied"`) to the `ingress_factory` module call in the service stack
 2. Run `scripts/tg apply` on the service stack — DNS record is auto-created
 3. For non-standard records (MX, TXT), add a `cloudflare_record` resource in `stacks/cloudflared/modules/cloudflared/cloudflare.tf`
+4. For a Valia site (off-infra Cloudflare Pages), add the entry to `local.sites` in `stacks/valia-sites/main.tf` — public CNAME + internal record both follow (`docs/runbooks/valia-sites.md`)

 ## Incident History

--- a/docs/architecture/mailserver.md
+++ b/docs/architecture/mailserver.md
@ -161,6 +161,17 @@ https://mail.viktorbarzin.me → Traefik → Roundcubemail
  DB: MySQL (mysql.dbaas.svc.cluster.local)
 ```

+### Paperless ingest mailbox (docs@)
+
+`docs@viktorbarzin.me` is a dedicated real mailbox (explicit self-alias in
+`extra/aliases.txt` so the `@domain → spam@` catch-all doesn't shadow it) that
+paperless-ngx polls over IMAP; family members forward document emails to it
+and the sender maps 1:1 to a paperless account. A per-user Dovecot sieve
+(`docs-at-viktorbarzin.me.dovecot.sieve` in the `mailserver.config` ConfigMap,
+mounted as `/tmp/docker-mailserver/docs@viktorbarzin.me.dovecot.sieve`)
+discards mail from non-allowlisted senders at delivery. Full flow, sender map,
+and add-a-sender procedure: [`runbooks/paperless-mail-ingest.md`](../runbooks/paperless-mail-ingest.md).
+
 ## DNS Records

 All managed in Terraform at `stacks/cloudflared/modules/cloudflared/cloudflare.tf`.
@ -300,6 +311,21 @@ Push secrets (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) come from External

 ## Troubleshooting

+### All mail tempfailing with `451 4.3.0 queue file write error` (postsrsd spin)
+
+Seen 2026-07-03 right after a pod restart. Signature in `/var/log/mail/mail.log`:
+`postfix/cleanup: warning: tcp:localhost:10001 lookup error` +
+`sender_canonical_maps map lookup problem ... message not accepted, try again later`.
+Cause: **postsrsd** (SRS daemon, `sender_canonical_maps = tcp:localhost:10001`)
+came up spinning at 100% CPU without binding 10001/10002 — supervisor shows it
+`RUNNING` but `ss -ltn | grep 1000` is empty and its log is empty. Postfix then
+tempfails every message (inbound AND submission); senders retry so nothing is
+lost, and the roundtrip probe alerts within the hour.
+Fix: `supervisorctl restart postsrsd` inside the container; if the fresh
+process spins again (it did once), `kubectl -n mailserver delete pod` for a
+full re-init — that healed it. Root cause not pinned down (one-off bad init;
+postsrsd 1.10).
+
 ### Inbound mail not arriving
 1. **DNS/MX**: `dig MX viktorbarzin.me +short` → should show `mail.viktorbarzin.me`
 2. **WAN reachability**: `nc -zw5 mail.viktorbarzin.me 25` from outside
--- a/docs/architecture/monitoring.md
+++ b/docs/architecture/monitoring.md
@ -146,7 +146,7 @@ Query examples (Grafana → Loki): `{job="rpi-sofia-journal"}`, `{job="rpi-sofia

 **Dashboard** — `dashboards/rpi-sofia.json` ("RPi Sofia", Hardware folder): status, undervoltage/throttle, SoC temp, load, memory, root-fs free + read-only, network.

-**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`rpi_under_voltage_occurred==1`), `RpiSofiaHighTemp`.
+**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`increase(rpi_under_voltage_occurred[1h])>0` — edge-triggered on the sticky bit; the live `rpi_under_voltage_now` bit is too transient to catch at 1-min sampling, so it fires on a *new* brown-out and auto-resolves ~1h later instead of latching until reboot), `RpiSofiaHighTemp`.

 **Recovery** — a systemd hardware watchdog (`RuntimeWatchdogSec=14s`, bcm2835 max ~15s) auto-reboots the Pi on a hard hang instead of leaving it dead for hours.

@ -286,7 +286,7 @@ Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (por

 #### Security Alerts (Wave 1 — planned, beads `code-8ywc`)

-Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
+Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there). Same handling path as infra alerts; severity labels carried in the alert (critical/warning/info). The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).

 | # | Source | Event | Severity |
 |---|---|---|---|
@ -318,9 +318,20 @@ IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-poli
 Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out.

 - **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m).
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → **`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
+- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → posts to **`#alerts`** via the `slack-security` receiver, which keeps its `[SECURITY]` styling (Slack-only, no paging; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
 - **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)

+#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
+
+Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**.
+
+| Alert | Expr (abridged) | For | Severity |
+|---|---|---|---|
+| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning |
+| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning |
+
+The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`).
+
 #### Backup Alerts
 - **PostgreSQLBackupStale**: >36h since last backup
 - **MySQLBackupStale**: >36h since last backup
--- a/Show more
+++ b/Show more