portal-tts: DRAFT stack — Piper TTS (CPU, always-on) for portal-assistant

Draft (NOT applied) of a new infra stack deploying Piper as an in-cluster text-to-speech service for the portal-assistant Gateway (portal-assistant issue #3, ADR-0003). Bulgarian (bg_BG-dimitar-medium) + English (en_US-lessac-medium), voice chosen per request. Why this shape: - CPU-only, always-on (replicas=1, no GPU): Piper runs in real time on CPU, so this keeps TTS off the OOM-prone shared T4 that the two GPU siblings (tts/chatterbox, portal-stt) already contend for. Bulgarian isn't on chatterbox anyway (its langs exclude bg). - OpenAI-compatible image (openedai-speech-min, /v1/audio/speech) so the Gateway gets raw audio bytes per its tts.synthesize(text, lang) -> bytes contract and treats Piper + the future edge-tts fallback identically — same shape chatterbox already uses. - Voices on an NFS-SSD PVC, downloaded from rhasspy/piper-voices by an init container on first boot; a ConfigMap maps request voice bg/en -> .onnx model. - ClusterIP only (audio stays on the LAN; the Gateway is the only externally exposed component, ADR-0001). Mirrors the just-written portal-stt sibling stack's conventions. terraform fmt clean; terraform validate passes (only the codebase-wide kubernetes_namespace deprecation warnings). HITL: operator reviews + applies via GitOps; do not apply from a worktree. Open items flagged in main.tf (image choice on a frozen upstream; resource sizing to confirm with krr). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 18:59:42 +00:00
425 changed files with 11535 additions and 43696 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
--- a/.claude/home-assistant-sofia.py
+++ b/.claude/home-assistant-sofia.py
@ -7,7 +7,6 @@ Control and query Home Assistant entities on ha-sofia.viktorbarzin.me.
 import argparse
 import json
 import os
-import subprocess
 import sys
 from urllib.parse import urljoin

@ -18,29 +17,13 @@ except ImportError:
    print("  pip install requests")
    sys.exit(1)

+# Configuration from environment variables (ha-sofia specific)
+HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/")
+HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN")

-def _token_from_homelab():
-    """Resolve the token via the homelab CLI when the env var isn't set, so the
-    script works from any directory / unprovisioned session (see ADR-0012)."""
-    try:
-        out = subprocess.run(
-            ["homelab", "ha", "token", "--instance", "sofia"],
-            capture_output=True, text=True, timeout=30)
-        if out.returncode == 0 and out.stdout.strip():
-            return out.stdout.strip()
-    except Exception:
-        pass
-    return None
-
-
-# Configuration: prefer env vars (set by the Claude venv); otherwise fall back to
-# defaults + the homelab CLI so the script is not cwd/env dependent (ADR-0012).
-HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/") or "https://ha-sofia.viktorbarzin.me"
-HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN") or _token_from_homelab()
-
-if not HA_TOKEN:
-    print("ERROR: no ha-sofia API token available.")
-    print("Set HOME_ASSISTANT_SOFIA_TOKEN, or ensure `homelab ha token` works (kubeconfig reachable).")
+if not HA_URL or not HA_TOKEN:
+    print("ERROR: HOME_ASSISTANT_SOFIA_URL and HOME_ASSISTANT_SOFIA_TOKEN environment variables must be set.")
+    print("These should be set when activating the Claude venv (~/.venvs/claude)")
    sys.exit(1)

 HEADERS = {
--- a/.claude/reference/authentik-state.md
+++ b/.claude/reference/authentik-state.md
@ -166,8 +166,7 @@ Pinned via Terraform in `stacks/authentik/`:

 | Knob | Value | Surface | Effect |
 |------|-------|---------|--------|
-| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. Used by password login (`default-authentication-flow`) AND passkey login (`webauthn` flow — both terminate on this stage). |
-| `UserLoginStage.session_duration` on `default-source-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_source_login` in `authentik_provider.tf` (imported 2026-06-20, id `4c6977d2-…`) | **Social logins** (Google/GitHub/Facebook, via `default-source-authentication-flow`). Was the provider default `seconds=0`, which fell back to `UNAUTHENTICATED_AGE=hours=2` — so social logins expired every **2h** while password/passkey lasted 4 weeks. Pinned `weeks=4` on 2026-06-20 to make all login paths consistent. (Surfaced when the 2026-06-18 passkey wipe forced fallback to Google login → "re-login multiple times daily".) |
+| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. |
 | `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. |
 | `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |

@ -178,13 +177,6 @@ Notes:
 - The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
 - `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
 - The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
-
-## WebAuthn / Passkeys (2026-06-20)
-
- **Passkey devices live in the DB, NOT Terraform** (`WebAuthnDevice` model). They are user-owned; no TF resource or blueprint manages them. Re-enroll via the user settings UI (Authentik → Settings → MFA Devices → register a security key / passkey).
- **2026-06-18 wipe (root cause of the "WebAuthn broke" incident):** all 6 of Viktor's passkeys were deleted (`WebAuthnDevice.objects.count()` → 0) at 19:27 by an **ad-hoc tripit passkey E2E test** run from the devvm (`python-httpx/0.28.1`, as `akadmin`). The test cleanup did `GET /core/users/?search={demo}` (a **fuzzy** search) then `DELETE /api/v3/authenticators/admin/webauthn/{pk}/` for each device of `users[0]` — but `users[0]` resolved to the **real** account, not the intended demo user. **Lesson:** any future passkey-test cleanup MUST exact-match the demo user (`username == demo`), never `users[0]` of a fuzzy `?search=`. It was a one-off ad-hoc script (no committed/scheduled copy), so nothing auto-re-deletes — re-enrollment is safe.
- **Passkey login path itself is intact:** the identification stage's `passwordless_flow` → `webauthn` flow (UI-managed, in `ignore_changes`); the break was purely the missing device records.
- **Provider-schema gotcha:** the pinned authentik TF provider's `authentik_stage_identification` resource exposes **no** `webauthn_stage` or `enable_remember_me` attribute (they exist on the app *model*, not in the provider schema). Do NOT add them to `ignore_changes` — `tg plan` errors `Unsupported attribute`. They are purely UI/app-managed. (Commit `4e882989` removed them for exactly this reason; re-adding breaks every apply.)
 - ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns).
 - **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config.
 - **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin.
--- a/.claude/reference/service-catalog.md
+++ b/.claude/reference/service-catalog.md
--- a/.claude/skills/home-assistant/SKILL.md
+++ b/.claude/skills/home-assistant/SKILL.md
@ -11,8 +11,8 @@ description: |
  There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
  Always use Home Assistant for smart home control.
 author: Claude Code
-version: 2.1.0
-date: 2026-06-24
+version: 2.0.0
+date: 2026-02-07
 ---

 # Home Assistant Control
@ -44,12 +44,6 @@ There are **two** Home Assistant instances:
 - Environment variables for each instance:
  - **ha-london**: `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN`
  - **ha-sofia**: `HOME_ASSISTANT_SOFIA_URL` and `HOME_ASSISTANT_SOFIA_TOKEN`
-  - If those env vars aren't set (e.g. you're not in the infra repo / Claude venv), don't hand-roll a `kubectl | base64 | jq` token pipeline — use the global **`homelab` CLI** instead (on `$PATH` in any directory):
-
-## homelab CLI (preferred — works from any directory)
- **Token**: `homelab ha token [--instance sofia|london]` resolves the long-lived API token live from the cluster. Use it directly in curl: `curl -H "Authorization: Bearer $(homelab ha token)" https://ha-sofia.viktorbarzin.me/api/states`. (The `home-assistant-sofia.py` script also auto-falls-back to this when its env var is unset.)
- **Host shell** (ha-sofia): `homelab ha ssh -- <cmd>` runs a command on the HA host with deterministic non-interactive ssh (no host-key prompt) — e.g. `homelab ha ssh -- "sudo docker ps"`, `homelab ha ssh -- "cat /config/configuration.yaml"`. Replaces bespoke `ssh -o StrictHostKeyChecking=no …` invocations.
- **Cluster metrics/logs** (not HA-specific): prefer `homelab metrics query "<promql>"` / `homelab logs query "<logql>"` over hand-rolled `curl …/api/v1/query`, and `homelab claim`/`release` over calling `scripts/presence` directly.

 ## API Control

@ -395,27 +389,14 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
 ## ha-london Knowledge Map

 ### Overview
- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
+- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
 - **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS
- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/`
+- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
+- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
+- **Config path**: `/config/` (requires `sudo` for file access)
 - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
 - **Zone**: London (home)

-### Dashboards (redesigned 2026-06-24)
-**Glossary** (HA terms — keep distinct):
- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
- **Card** = a widget inside a view.
-
- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
-  - **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
-  - **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.)
- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
-
 ### Key Systems

 #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -437,15 +418,10 @@ Named plugs with power/energy tracking:
 - PM1.0/2.5/4.0/10 particulate sensors
 - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors

-#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
-Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
- `sensor.classic_performance_remaining_range`: Range km
- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
+#### 3. Cowboy E-Bike
+- `sensor.bike_state_of_charge`: Battery %
+- `sensor.bike_total_distance`: Total km
+- `sensor.bike_total_co2_saved`: CO2 saved (grams)

 #### 4. Uptime Monitoring (UptimeRobot)
 - `sensor.blog`: blog uptime
@ -464,17 +440,12 @@ Bike named **"Classic Performance"** → entities are `sensor.classic_performanc
 - Scripts: `script.start_netflix`, `script.start_stremio`
 - Scene: `scene.night` (turns off Livia + Michelle plugs)

-### Custom Components (HACS integrations)
- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
-
-### HACS frontend cards (plugins)
- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
+### Custom Components
+- **cowboy**: Cowboy e-bike integration (HACS)
+- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)

 ### Integrations
-ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
+ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB

 ### AI / Voice Assistants
 - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -489,8 +460,15 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ook
 - Anca arrival/departure notifications
 - Night scene: turns off Livia + Michelle

-### Platform (HAOS — ignore any legacy `docker run` snippet)
-ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker).
+### Docker Setup
+```bash
+docker run -d --name homeassistant --privileged \
+  -e TZ=Europe/London \
+  -v /home/pi/docker/homeAssistant:/config \
+  -v /run/dbus:/run/dbus:ro \
+  --network=host --restart=unless-stopped \
+  homeassistant/home-assistant:2025.9
+```

 ### SSH Access
 ```bash
--- a/.claude/workflows/memory-overcommit-node-removal.workflow.js
+++ b/.claude/workflows/memory-overcommit-node-removal.workflow.js
@ -1,203 +0,0 @@
-export const meta = {
-  name: 'memory-overcommit-node-removal',
-  description: 'Read-only: assess PVE host + k8s memory overcommit, right-size deployment REQUESTS (scheduling) and LIMITS (OOM) separately from 30d usage, then test whether one worker node can be removed while preserving N-1 by BOTH a physical-usage and a scheduling-request model. Emits a gated plan.',
-  phases: [
-    { title: 'Gather' },
-    { title: 'Model' },
-    { title: 'Verify' },
-  ],
-}
-
-// ---------- confirmed read-only access paths ----------
-const SSH = "ssh -o BatchMode=yes -o ConnectTimeout=8 root@192.168.1.127";
-const PROM = "https://prometheus-query.viktorbarzin.lan/api/v1/query";
-const G = (mib) => (mib == null ? "?" : (mib / 1024).toFixed(1) + "Gi");
-
-// ---------- schema helpers ----------
-const num = { type: "number" }, str = { type: "string" }, bool = { type: "boolean" };
-const arr = (items) => ({ type: "array", items });
-const obj = (props) => ({ type: "object", additionalProperties: false, required: Object.keys(props), properties: props });
-
-const HOST = obj({
-  host_total_mib: num, host_used_mib: num, host_free_mib: num, host_available_mib: num,
-  swap_total_mib: num, swap_used_mib: num, ksm_saved_mib: num,
-  vms: arr(obj({ vmid: num, name: str, configured_mib: num, balloon_mib: num, rss_mib: num, is_k8s_node: bool })),
-  sum_vm_configured_mib: num, sum_vm_rss_mib: num, notes: str,
-});
-
-const K8S = obj({
-  nodes: arr(obj({
-    name: str, role: str, is_gpu: bool, is_control_plane: bool, gpu_tainted: bool, schedulable: bool,
-    capacity_mib: num, allocatable_mib: num, requests_mib: num, ds_requests_mib: num, limits_mib: num, usage_now_mib: num, peak_30d_mib: num, pod_count: num,
-  })),
-  cluster_allocatable_mib: num, cluster_requests_mib: num, cluster_usage_now_mib: num, cluster_peak_30d_mib: num, notes: str,
-});
-
-// NOTE the v2 split: requests are sized for SCHEDULING (cover normal load, can shrink below current),
-// limits are sized for OOM SAFETY (cover peak). They are DIFFERENT knobs and must not be conflated.
-const USAGE = obj({
-  totals: obj({
-    sum_current_requests_mib: num, sum_recommended_requests_mib: num, net_request_reclaim_mib: num,
-    reschedulable_request_recommended_mib: num, ds_request_recommended_per_node_mib: num, gpu_request_recommended_mib: num,
-    largest_single_request_mib: num, count_request_shrink: num, count_limit_raise_oom: num,
-  }),
-  request_shrinks: arr(obj({ namespace: str, name: str, kind: str, replicas: num, current_request_mib: num, p95_30d_mib: num, recommended_request_mib: num, delta_mib: num, rationale: str })),
-  limit_raises_oom: arr(obj({ namespace: str, name: str, container: str, current_limit_mib: num, peak_max_30d_mib: num, recommended_limit_mib: num, risk: str })),
-  spiky_periodic: arr(obj({ namespace: str, name: str, note: str })),
-  method_notes: str,
-});
-
-const TOPO = obj({
-  nodes: arr(obj({ name: str, sticky_pods: arr(str), local_pv_count: num, volumeattachments: num, cnpg_primary: bool, gpu_workloads: bool, evac_difficulty: str, evac_notes: str })),
-  spofs: arr(obj({ namespace: str, name: str, replicas: num, has_pdb: bool, issue: str })),
-  antiaffinity_risks: arr(str),
-  csi_pinning_note: str,
-  priority_classes_note: str,
-  notes: str,
-});
-
-const VERDICT = obj({ refuted: bool, confidence: str, reasoning: str, corrections: arr(str) });
-
-// ---------- prompts ----------
-const HOST_PROMPT = `Read-only PVE host memory audit. SSH (key-based): ${SSH} '<cmd>'  (host 'pve', the Proxmox r730 at 192.168.1.127). Read-only ONLY; NEVER a state-changing qm/pvesh/ha-manager command.
- 'free -m' -> host_total/used/free/available_mib + swap_total/swap_used_mib.
- KSM: cat /sys/kernel/mm/ksm/pages_sharing ; ksm_saved_mib = pages_sharing*4096/1048576.
- 'qm list'; for each running VM 'qm config <vmid>' -> memory (configured_mib), balloon (balloon_mib; if balloon==memory or balloon==0 ballooning is effectively OFF -> host RSS pins near configured = the headroom RATCHET).
- Per-VM host RSS: read /var/run/qemu-server/<vmid>.pid then 'ps -o rss= -p <pid>' (KiB->MiB).
- is_k8s_node = VMs named k8s-*.
-Return per-VM rows + sum_vm_configured_mib + sum_vm_rss_mib over ALL RUNNING VMs. notes: overcommit ratio, swap pressure, ballooning state.`;
-
-const K8S_PROMPT = `Read-only Kubernetes node-capacity audit. kubectl read access confirmed. For every node (k8s-master + k8s-node1..6):
- capacity_mib & allocatable_mib from 'kubectl get node <n> -o json' (Ki->MiB).
- is_control_plane (node-role.kubernetes.io/control-plane), is_gpu (k8s-node1; nvidia.com/gpu in capacity), gpu_tainted (a NoSchedule taint general pods would NOT tolerate), schedulable.
- requests_mib, limits_mib, ds_requests_mib (DaemonSet-owned pods only), usage_now_mib, pod_count.
-  Prefer Prometheus (curl -sk -G '${PROM}' --data-urlencode 'query=<q>'):
-    sum by (node)(kube_pod_container_resource_requests{resource="memory"})    [these metrics HAVE a node label]
-    usage_now: cAdvisor container_memory_working_set_bytes has NO node label - join: sum by (node)(container_memory_working_set_bytes{container!="",container!="POD"} * on(namespace,pod) group_left(node) kube_pod_info)
- peak_30d_mib per node: max_over_time of that joined per-node sum over [30d:5m] (best effort; if the join is flaky leave 0 and rely on cluster figure).
-ALSO return cluster-wide:
- cluster_allocatable_mib, cluster_requests_mib, cluster_usage_now_mib.
- cluster_peak_30d_mib = max_over_time(sum(container_memory_working_set_bytes{container!="",container!="POD"})[30d:5m]) /1024/1024  (this is the PHYSICAL reliability bedrock - the highest the whole cluster ever simultaneously used in 30d).
-notes: host-vs-k8s overcommit contrast (requests vs allocatable vs actual usage).`;
-
-const USAGE_PROMPT = `Read-only memory RIGHT-SIZING from 30-day usage. CRITICAL: requests and limits are DIFFERENT knobs - size them separately. Do NOT set requests to peak (that is what a flawed earlier run did; it manufactured a false capacity shortfall).
- REQUEST (scheduling reservation, drives bin-packing & node-removal feasibility): size to cover NORMAL operation = recommended_request_mib = ceil(max(p95_30d * 1.15, 64)). This SHRINKS the many over-provisioned requests toward real usage. requests should sit BELOW limits (Burstable). Be moderately conservative for stateful/db/critical infra (mysql, postgres/CNPG, redis, vault, prometheus, mailserver): use p99 instead of p95.
- LIMIT (OOM ceiling): recommended_limit_mib = ceil(peak_max_30d * 1.25). FLAG any container whose peak_max_30d >= 95% of current limit as an OOM risk (limit_raises_oom) - these are real reliability bugs to fix REGARDLESS of node removal.
-
-Sources: kubectl (current requests/limits/replicas for Deployments/StatefulSets/DaemonSets, all namespaces); Prometheus (curl -sk -G '${PROM}' --data-urlencode 'query=<q>'):
-  p95: quantile_over_time(0.95, container_memory_working_set_bytes{container!="",container!="POD"}[30d])
-  p99: quantile_over_time(0.99, ...[30d])
-  peak: max_over_time(...[30d])
-  Aggregate by (namespace,pod,container), map pod->workload (strip hash suffixes), take MAX across a workload's pods as per-replica value.
-
-Splits for the N-1 model (use the REQUEST recommendation; multiply per-replica by replicas):
- reschedulable_request_recommended_mib = SUM recommended_request of Deployment+StatefulSet pods that are NON-GPU and schedulable on general workers (everything that must reschedule if a worker is removed).
- ds_request_recommended_per_node_mib = SUM recommended_request of DaemonSet containers (one set per node).
- gpu_request_recommended_mib = SUM recommended_request of workloads pinned to GPU node k8s-node1 (REAL value; do not inflate).
- largest_single_request_mib = largest single recommended per-replica request among reschedulable.
-Return totals (sum_current_requests_mib, sum_recommended_requests_mib, net_request_reclaim_mib = sum of POSITIVE request deltas i.e. shrinks, the splits, count_request_shrink, count_limit_raise_oom), request_shrinks (top ~30 by delta), limit_raises_oom (every OOM-tight container), spiky_periodic (mailserver/immich-ml/backups/dumps/postiz). NEVER mutate.`;
-
-const TOPO_PROMPT = `Read-only reliability-topology audit: which worker is safest to remove? Candidates: k8s-node2..node6 (NOT master, NOT GPU node1). For each worker (k8s-node1..6): sticky_pods (StatefulSet members; pods with local/hostPath PVCs; single-replica critical), local_pv_count, volumeattachments, cnpg_primary (CNPG 'pg-cluster' PRIMARY here? check pod role labels), gpu_workloads, evac_difficulty (easy|medium|hard)+evac_notes.
-Cluster-wide: spofs (1 replica AND no PDB); antiaffinity_risks (hard podAntiAffinity / topologySpread DoNotSchedule that becomes UNSATISFIABLE at one fewer worker - check replica counts vs surviving distinct hosts); csi_pinning_note (do Proxmox-CSI PVs pin to a node, or share one host-level topology so they reattach anywhere? check volumeHandle / topology zone/region on the PVs - this decides whether removal STRANDS data); priority_classes_note. NEVER mutate.`;
-
-// ============================================================
-phase('Gather');
-log('Gather (read-only): PVE host memory, k8s capacity + cluster 30d peak, request/limit right-sizing, reliability topology');
-const [host, k8s, usage, topo] = await parallel([
-  () => agent(HOST_PROMPT, { label: 'gather:pve-host', phase: 'Gather', schema: HOST }),
-  () => agent(K8S_PROMPT, { label: 'gather:k8s-capacity', phase: 'Gather', schema: K8S }),
-  () => agent(USAGE_PROMPT, { label: 'gather:rightsize', phase: 'Gather', schema: USAGE }),
-  () => agent(TOPO_PROMPT, { label: 'gather:reliability', phase: 'Gather', schema: TOPO }),
-]);
-if (!k8s || !usage) return { error: 'Critical gather agent failed (k8s/usage).', host, k8s, usage, topo };
-
-// ============================================================
-phase('Model');
-const T = usage.totals;
-const workers = k8s.nodes.filter((n) => !n.is_control_plane);
-const generalPool = workers.filter((n) => !n.gpu_tainted);            // general pods can land here (incl. GPU node if not tainted)
-const candidates = workers.filter((n) => !n.is_gpu && !n.is_control_plane); // node2..node6
-const clusterPeak = k8s.cluster_peak_30d_mib || 0;
-
-const freeGeneral = (n) => n.allocatable_mib - (T.ds_request_recommended_per_node_mib || 0) - (n.is_gpu ? (T.gpu_request_recommended_mib || 0) : 0);
-
-function evalRemove(removeName) {
-  const pool = generalPool.filter((n) => n.name !== removeName);
-  // --- scheduling N-1 (realistic requests): fit reschedulable load even if the largest survivor then fails ---
-  const frees = pool.map(freeGeneral);
-  const schedCap = frees.reduce((a, b) => a + b, 0) - (frees.length ? Math.max(...frees) : 0);
-  const schedNeed = T.reschedulable_request_recommended_mib;
-  const schedMargin = schedCap - schedNeed;
-  // --- physical N-1 (actual peak usage): cluster 30d peak must fit on survivors after losing the largest too ---
-  const survAlloc = pool.map((n) => n.allocatable_mib);
-  const physCap = survAlloc.reduce((a, b) => a + b, 0) - (survAlloc.length ? Math.max(...survAlloc) : 0);
-  const physMargin = physCap - clusterPeak;
-  const t = topo && topo.nodes ? topo.nodes.find((n) => n.name === removeName) : null;
-  return {
-    removeName, pool: pool.map((n) => n.name),
-    sched_capacityN1_mib: Math.round(schedCap), sched_need_mib: Math.round(schedNeed), sched_margin_mib: Math.round(schedMargin), sched_pass: schedMargin >= 0,
-    phys_capacityN1_mib: Math.round(physCap), cluster_peak_mib: Math.round(clusterPeak), phys_margin_mib: Math.round(physMargin), phys_pass: physMargin >= 0,
-    pass: schedMargin >= 0 && physMargin >= 0,
-    host_freed_mib: hostFreedFor(removeName),
-    evac_difficulty: t ? t.evac_difficulty : 'unknown', cnpg_primary: t ? t.cnpg_primary : false, sticky_pods: t ? t.sticky_pods : [],
-  };
-}
-function hostFreedFor(nodeName) {
-  if (host && host.vms) {
-    const s = nodeName.replace('k8s-', '');
-    const vm = host.vms.find((v) => v.name === nodeName || (v.name && v.name.includes(s)));
-    if (vm) return vm.configured_mib;
-  }
-  const n = k8s.nodes.find((x) => x.name === nodeName);
-  return n ? n.capacity_mib : 0;
-}
-
-const evalCandidates = candidates.map((c) => evalRemove(c.name));
-const diffRank = { easy: 0, medium: 1, hard: 2, unknown: 3 };
-const passing = evalCandidates.filter((c) => c.pass && !c.cnpg_primary)
-  .sort((a, b) => (diffRank[a.evac_difficulty] - diffRank[b.evac_difficulty]) || (b.phys_margin_mib - a.phys_margin_mib));
-const best = passing[0] || null;
-
-const hostOvercommit = host ? { sum_vm_configured_mib: host.sum_vm_configured_mib, host_total_mib: host.host_total_mib, ratio: +(host.sum_vm_configured_mib / host.host_total_mib).toFixed(3), free_mib: host.host_free_mib, available_mib: host.host_available_mib, swap_used_mib: host.swap_used_mib, swap_total_mib: host.swap_total_mib, ksm_saved_mib: host.ksm_saved_mib } : null;
-const k8sOvercommit = { cluster_requests_mib: k8s.cluster_requests_mib, cluster_allocatable_mib: k8s.cluster_allocatable_mib, cluster_usage_now_mib: k8s.cluster_usage_now_mib, cluster_peak_30d_mib: clusterPeak, request_ratio: +(k8s.cluster_requests_mib / k8s.cluster_allocatable_mib).toFixed(3), usage_ratio: +(clusterPeak / k8s.cluster_allocatable_mib).toFixed(3) };
-
-log(`Host overcommit ${hostOvercommit ? hostOvercommit.ratio : '?'}x (${G(hostOvercommit && hostOvercommit.free_mib)} free, swap ${G(hostOvercommit && hostOvercommit.swap_used_mib)}/${G(hostOvercommit && hostOvercommit.swap_total_mib)})`);
-log(`K8s: requests ${G(k8s.cluster_requests_mib)} / 30d-peak-usage ${G(clusterPeak)} / allocatable ${G(k8s.cluster_allocatable_mib)} -> requests are ${(k8s.cluster_requests_mib / clusterPeak).toFixed(2)}x real peak`);
-log(`Request right-sizing: ${G(T.net_request_reclaim_mib)} of over-provisioned requests can be trimmed (${T.count_request_shrink} workloads); ${T.count_limit_raise_oom} workloads are OOM-tight on LIMITS (raise regardless).`);
-for (const c of evalCandidates) log(`  remove ${c.removeName}: phys-N1 ${c.phys_pass ? 'PASS' : 'FAIL'} (${G(c.phys_margin_mib)}) | sched-N1 ${c.sched_pass ? 'PASS' : 'FAIL'} (${G(c.sched_margin_mib)}) | frees ~${G(c.host_freed_mib)} host | evac ${c.evac_difficulty}${c.cnpg_primary ? ' CNPG-PRIMARY' : ''}`);
-log(best ? `Best candidate: ${best.removeName} (phys margin ${G(best.phys_margin_mib)}, frees ~${G(best.host_freed_mib)})` : 'No candidate passes both N-1 tests.');
-
-// ============================================================
-phase('Verify');
-const headline = best
-  ? `${best.removeName} can be removed while preserving N-1: cluster 30d peak usage ${G(clusterPeak)} fits on survivors-minus-one (${G(best.phys_capacityN1_mib)}); after trimming over-provisioned requests, scheduling also fits (${G(best.sched_margin_mib)} margin). Frees ~${G(best.host_freed_mib)} to the PVE host.`
-  : `No worker can be removed while preserving N-1 by BOTH physical-usage and scheduling-request models.`;
-const verifyData = JSON.stringify({ hostOvercommit, k8sOvercommit, k8s_nodes: k8s.nodes, usage_totals: T, evalCandidates, best, csi_pinning_note: topo ? topo.csi_pinning_note : null, generalPool: generalPool.map((n) => n.name) }, null, 2);
-const lenses = [
-  { key: 'math', ask: 'Recompute BOTH N-1 models independently. Physical: cluster 30d peak vs (sum survivor allocatable - largest survivor). Scheduling: reschedulable recommended REQUESTS (not limits, not peak) vs (sum survivor freeGeneral - largest). Verify GPU node reserve uses REAL gpu requests, allocatable not capacity, DaemonSets are per-node fixed load. Are pool selection and numbers right?' },
-  { key: 'temporal', ask: 'Challenge the 30-DAY peak window and the request shrinks. Could a monthly/quarterly peak exceed cluster_peak_30d (compare a 90d peak)? Are the shrunk REQUESTS safe given each workload keeps a limit above its peak (Burstable)? Name any shrink or any still-tight limit that is reckless.' },
-  { key: 'stateful', ask: 'Check the chosen candidate for STRANDED state and drain blockers: CSI PV pinning (do volumes reattach anywhere?), CNPG primary, VolumeAttachment caps, anti-affinity/topologySpread unsatisfiable at one fewer worker, PDBs that block drain (disruptionsAllowed=0). Is removal actually safe, and what drain ORDERING is required?' },
-];
-const verdicts = (await parallel(lenses.map((l) => () =>
-  agent(`Adversarial reviewer. Try to REFUTE:\n"${headline}"\n\nLens: ${l.ask}\n\nData (read-only). Verify LIVE: kubectl, Prometheus (curl -sk -G '${PROM}' --data-urlencode 'query=...'), ${SSH} '<cmd>'.\n\n${verifyData}\n\nDefault refuted=true if evidence does not clearly hold. Give concrete corrections.`,
-    { label: `verify:${l.key}`, phase: 'Verify', schema: VERDICT }))
-)).filter(Boolean);
-
-return {
-  headline,
-  hostOvercommit, k8sOvercommit,
-  rightsizing: T,
-  request_shrinks: usage.request_shrinks,
-  limit_raises_oom: usage.limit_raises_oom,
-  spiky_periodic: usage.spiky_periodic,
-  candidates: evalCandidates,
-  recommendation: best,
-  k8s_nodes: k8s.nodes,
-  host_vms: host ? host.vms : null,
-  topo_spofs: topo ? topo.spofs : [],
-  topo_nodes: topo ? topo.nodes : [],
-  csi_pinning_note: topo ? topo.csi_pinning_note : null,
-  antiaffinity_risks: topo ? topo.antiaffinity_risks : [],
-  verdicts,
-  verdict_summary: `${verdicts.filter((v) => v.refuted).length}/${verdicts.length} reviewers refuted the headline`,
-};
--- a/.gitattributes
+++ b/.gitattributes
@ -4,12 +4,3 @@
 *.tfvars filter=git-crypt diff=git-crypt
 secrets/** filter=git-crypt diff=git-crypt
 stacks/**/secrets/** filter=git-crypt diff=git-crypt
-
-# Kubeconfigs / cluster credentials — encrypt at rest so a force-added or renamed
-# commit can't push plaintext to the public GitHub mirror. Belt-and-suspenders to
-# the .gitignore rules above; `.config` is explicit because that is exactly the
-# name an admin kubeconfig once leaked under (GitGuardian, 2026-07-02).
-.config filter=git-crypt diff=git-crypt
-kubeconfig filter=git-crypt diff=git-crypt
-*.kubeconfig filter=git-crypt diff=git-crypt
-admin.conf filter=git-crypt diff=git-crypt
--- a/.github/workflows/build-authentik.yml
+++ b/.github/workflows/build-authentik.yml
@ -1,39 +0,0 @@
-name: Build Custom Authentik Image
-
-# ADR-0002: infra-owned image built off-infra on GHA → ghcr.
-# Thin SLOW-1a overlay over the official authentik server (narrows the login
-# identification stage's select_subclasses() to the login-capable source subtypes;
-# see stacks/authentik/Dockerfile). Rebuild only when the Dockerfile changes — on
-# every authentik bump, edit the FROM tag + the patchN suffix here + the image tag
-# in modules/authentik/values.yaml together.
-on:
-  push:
-    branches: [master]
-    paths:
-      - 'stacks/authentik/Dockerfile'
-  workflow_dispatch: {}
-
-permissions:
-  contents: read
-  packages: write
-
-jobs:
-  build:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v4
-      - uses: docker/setup-buildx-action@v3
-      - uses: docker/login-action@v3
-        with:
-          registry: ghcr.io
-          username: ${{ github.actor }}
-          password: ${{ secrets.GITHUB_TOKEN }}
-      - uses: docker/build-push-action@v6
-        with:
-          context: stacks/authentik
-          platforms: linux/amd64
-          provenance: false
-          push: true
-          tags: |
-            ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3
-            ghcr.io/viktorbarzin/authentik-server:latest
--- a/.github/workflows/build-chrome-service-browser.yml
+++ b/.github/workflows/build-chrome-service-browser.yml
@ -1,39 +0,0 @@
-name: Build chrome-service-browser
-
-# ADR-0002: infra-owned image built off-infra on GHA → ghcr. Playwright base +
-# real Google Chrome (proprietary H.264/AAC codecs) for the chrome-service
-# browser container, so the noVNC view can play H.264 video (Reels). Rebuilds
-# are rare → dispatch + path trigger. NOTE: after the first push, set the ghcr
-# package `chrome-service-browser` to PUBLIC (same as chrome-service-novnc) so
-# the pod pulls it without credentials.
-on:
-  push:
-    branches: [master]
-    paths:
-      - 'stacks/chrome-service/files/chrome/**'
-  workflow_dispatch: {}
-
-permissions:
-  contents: read
-  packages: write
-
-jobs:
-  build:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v4
-      - uses: docker/setup-buildx-action@v3
-      - uses: docker/login-action@v3
-        with:
-          registry: ghcr.io
-          username: ${{ github.actor }}
-          password: ${{ secrets.GITHUB_TOKEN }}
-      - uses: docker/build-push-action@v6
-        with:
-          context: stacks/chrome-service/files/chrome
-          platforms: linux/amd64
-          provenance: false
-          push: true
-          tags: |
-            ghcr.io/viktorbarzin/chrome-service-browser:latest
-            ghcr.io/viktorbarzin/chrome-service-browser:${{ github.sha }}
--- a/.github/workflows/build-excalidraw.yml
+++ b/.github/workflows/build-excalidraw.yml
@ -1,42 +0,0 @@
-name: Build excalidraw-library
-
-# ADR-0002 / no-local-builds: excalidraw-library (infra-owned Go app behind
-# draw.viktorbarzin.me) builds off-infra on GHA → private ghcr; Keel polls
-# ghcr:latest and rolls the deployment. Replaces the manual DockerHub pushes
-# (viktorbarzin/excalidraw-library:v4 stays frozen as the rollback image).
-on:
-  push:
-    branches: [master]
-    paths:
-      - 'stacks/excalidraw/project/**'
-  workflow_dispatch: {}
-
-permissions:
-  contents: read
-  packages: write
-
-jobs:
-  build:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v4
-      - uses: actions/setup-go@v5
-        with:
-          go-version: '1.21'
-      - run: go test ./...
-        working-directory: stacks/excalidraw/project
-      - uses: docker/setup-buildx-action@v3
-      - uses: docker/login-action@v3
-        with:
-          registry: ghcr.io
-          username: ${{ github.actor }}
-          password: ${{ secrets.GITHUB_TOKEN }}
-      - uses: docker/build-push-action@v6
-        with:
-          context: stacks/excalidraw/project
-          platforms: linux/amd64
-          provenance: false
-          push: true
-          tags: |
-            ghcr.io/viktorbarzin/excalidraw-library:latest
-            ghcr.io/viktorbarzin/excalidraw-library:${{ github.sha }}
--- a/.github/workflows/build-valia-sites-sync.yml
+++ b/.github/workflows/build-valia-sites-sync.yml
@ -1,39 +0,0 @@
-name: Build valia-sites-sync
-
-# ADR-0002 + ADR-0018: infra-owned image built off-infra on GHA → ghcr (public).
-# Rclone + wrangler runner for the Valia-sites Content-folder mirror CronJob.
-# Rebuilds are rare (tool pins only change deliberately) → dispatch + path.
-# Security note: no untrusted event inputs are interpolated anywhere (only
-# github.actor / github.sha / GITHUB_TOKEN — same shape as the other
-# build-*.yml workflows in this repo).
-on:
-  push:
-    branches: [master]
-    paths:
-      - 'stacks/valia-sites/sync-image/**'
-  workflow_dispatch: {}
-
-permissions:
-  contents: read
-  packages: write
-
-jobs:
-  build:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v4
-      - uses: docker/setup-buildx-action@v3
-      - uses: docker/login-action@v3
-        with:
-          registry: ghcr.io
-          username: ${{ github.actor }}
-          password: ${{ secrets.GITHUB_TOKEN }}
-      - uses: docker/build-push-action@v6
-        with:
-          context: stacks/valia-sites/sync-image
-          platforms: linux/amd64
-          provenance: false
-          push: true
-          tags: |
-            ghcr.io/viktorbarzin/valia-sites-sync:latest
-            ghcr.io/viktorbarzin/valia-sites-sync:${{ github.sha }}
--- a/.gitignore
+++ b/.gitignore
@ -71,15 +71,8 @@ stacks/*/cloudflare_provider.tf
 stacks/*/tiers.tf
 stacks/*/terragrunt_rendered.json

-# Kubernetes config / cluster credentials (sensitive) — never commit in plaintext.
-# `config` alone missed the dotfile form: an admin kubeconfig once leaked to the
-# public mirror as `.config` (GitGuardian, 2026-07-02). Cover the common names.
+# Kubernetes config (sensitive)
 config
-.config
-kubeconfig
-*.kubeconfig
-admin.conf
-.kube/

 # Node.js (not part of infra)
 node_modules/
@ -117,9 +110,3 @@ terraform.tfstate.backup
 # Timestamped terraform state backups (terraform.tfstate.<ts>.backup) — plaintext Tier-0
 # secrets; created by terraform state ops. The patterns above miss the timestamped form.
 terraform.tfstate.*.backup
-
-# Python test artifacts (pytest bytecode cache) — e.g. from
-# stacks/k8s-version-upgrade/scripts/test_compat_gate.py
-__pycache__/
-*.pyc
-.pytest_cache/
--- a/.woodpecker/default.yml
+++ b/.woodpecker/default.yml
@ -19,7 +19,6 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
-      partial: false
      depth: 2
      attempts: 5
      backoff: 10s
@ -65,21 +64,6 @@ steps:
      # don't need explicit token propagation.
      VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200
    commands:
-      # ── Forge guard: apply ONLY on the canonical Forgejo forge ──
-      # infra is registered in Woodpecker on BOTH the Forgejo canonical repo and
-      # the legacy GitHub mirror, and BOTH fire this push pipeline. Without this
-      # guard both run `terragrunt apply` on every push and race each other for
-      # the per-stack PG state lock — the dominant cause of the "Error acquiring
-      # the state lock" failures + push-supersede "killed" runs. The GitHub-mirror
-      # registration keeps running the CRONS (drift-detection, renew-tls, …) — only
-      # its duplicate push-apply no-ops here. Fail-open: an unknown forge (neither
-      # env var set) still applies, preserving prior behaviour.
-      - |
-        if echo "${CI_REPO_URL:-}${CI_FORGE_URL:-}" | grep -qi 'github\.com'; then
-          echo "[forge-guard] GitHub-mirror push — apply runs only on the Forgejo canonical repo (avoids double-apply + state-lock races). Skipping."
-          exit 0
-        fi
-
      # ── Skip CI commits ──
      - |
        if echo "$CI_COMMIT_MESSAGE" | grep -q '\[CI SKIP\]\|\[ci skip\]'; then
@ -228,40 +212,23 @@ steps:
        if [ -s .platform_apply ]; then
          echo "=== Applying platform stacks (serial, locked) ==="
          while read -r stack; do
-            # Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role
-            # lacks Vault-admin perms (sys/mounts + sys/policies/acl), so a CI
-            # apply always 403s and fails the pipeline. Kept in PLATFORM_STACKS
-            # (so the app-stack detector still excludes it) but skipped here.
-            # (2026-06-27 — see docs/architecture/ci-cd.md)
-            if [ "$stack" = "vault" ]; then echo "[vault] SKIPPED (Tier-0, human-applied via OIDC)"; continue; fi
            echo "[$stack] Starting apply..."
-            ATTEMPT=0
-            while :; do
-              ATTEMPT=$((ATTEMPT + 1))
            set +e
            OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
            EXIT=$?
            set -e
-              if [ $EXIT -eq 0 ]; then
-                echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
+            if [ $EXIT -ne 0 ]; then
+              if echo "$OUTPUT" | grep -q "is locked by"; then
+                echo "[$stack] SKIPPED (locked by another session)"
+              else
+                echo "$OUTPUT" | tail -50
+                echo "[$stack] FAILED (exit $EXIT)"
+                FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"
              fi
-              # Lock contention → SKIP, not fail. Match BOTH the Tier-0 Vault lock
-              # ("is locked by", from scripts/tg) AND the Tier-1 PG-backend lock
-              # ("Error acquiring the state lock" / "already locked"). The PG case
-              # was previously counted as a failure — the #1 source of false reds.
-              if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
-                echo "[$stack] SKIPPED (locked by another session/run)"; break
+            else
+              echo "$OUTPUT" | tail -3
+              echo "[$stack] OK"
            fi
-              # Transient: provider-registry download timeout / Vault 5xx → bounded
-              # retry. Deliberately NOT helm atomic-timeouts or config errors
-              # (missing arg, invalid index) — those must fail fast, retry can't fix
-              # them and can worsen a stuck helm release.
-              if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
-                echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
-              fi
-              echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
-              FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"; break
-            done
          done < .platform_apply
        fi
        # Deferred until after app stacks so both lists get a chance to run.
@ -274,27 +241,22 @@ steps:
          echo "=== Applying app stacks (serial, locked) ==="
          while read -r stack; do
            echo "[$stack] Starting apply..."
-            ATTEMPT=0
-            while :; do
-              ATTEMPT=$((ATTEMPT + 1))
            set +e
            OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
            EXIT=$?
            set -e
-              if [ $EXIT -eq 0 ]; then
-                echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
+            if [ $EXIT -ne 0 ]; then
+              if echo "$OUTPUT" | grep -q "is locked by"; then
+                echo "[$stack] SKIPPED (locked by another session)"
+              else
+                echo "$OUTPUT" | tail -50
+                echo "[$stack] FAILED (exit $EXIT)"
+                FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"
              fi
-              # Lock contention → SKIP, not fail (Tier-0 Vault + Tier-1 PG; see platform loop).
-              if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
-                echo "[$stack] SKIPPED (locked by another session/run)"; break
+            else
+              echo "$OUTPUT" | tail -3
+              echo "[$stack] OK"
            fi
-              # Transient provider-download / Vault 5xx → bounded retry (see platform loop).
-              if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
-                echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
-              fi
-              echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
-              FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"; break
-            done
          done < .app_apply
        fi
        # Fail the step loudly so the pipeline `default` workflow state
@ -324,8 +286,13 @@ steps:
        fi
        GIT_SSH_COMMAND='ssh -i ./secrets/deploy_key -o IdentitiesOnly=yes' git push origin master

-      # (No Slack post on success — Viktor 2026-07-02: CI notifies on FAILED
-      # runs only; the notify-failure step below covers those.)
+      # ── Slack notification ──
+      - |
+        PLATFORM_COUNT=$(wc -l < .platform_apply 2>/dev/null | tr -d ' ')
+        APP_COUNT=$(wc -l < .app_apply 2>/dev/null | tr -d ' ')
+        curl -s -X POST -H 'Content-type: application/json' \
+          --data "{\"channel\":\"general\",\"text\":\"Woodpecker CI: infra pipeline ${CI_PIPELINE_STATUS} (platform:${PLATFORM_COUNT}, apps:${APP_COUNT})\"}" \
+          "$SLACK_WEBHOOK" || true

  # Slack on failure (runs even if apply step fails)
  - name: notify-failure
--- a/.woodpecker/drift-detection.yml
+++ b/.woodpecker/drift-detection.yml
@ -9,7 +9,6 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
-      partial: false
      depth: 1
      attempts: 3

@ -85,13 +84,6 @@ steps:
          stack=$(basename "$stack_dir")
          [ -f "$stack_dir/terragrunt.hcl" ] || continue

-          # Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role lacks
-          # Vault-admin perms (sys/mounts + sys/policies/acl), so `terragrunt plan`
-          # on it ERRORs (detailed-exitcode 1) and fails the whole nightly drift
-          # run. Skip it — drift on Tier-0 vault is caught at human apply time.
-          # (2026-06-27)
-          [ "$stack" = "vault" ] && continue
-
          echo -n "[$stack] planning... "
          OUTPUT=$(cd "$stack_dir" && terragrunt plan -detailed-exitcode -input=false 2>&1)
          EXIT=$?
@ -147,30 +139,13 @@ steps:
        echo "Drift: ${DRIFTED:-none}"
        echo "Errors: ${ERRORS:-none}"

-        # ── Slack only when something is WRONG (drift or errors) ──
-        # All-clean runs are silent (Viktor 2026-07-02: CI notifies on
-        # failed/actionable runs only; clean is the daily normal).
+        # ── Slack alert if drift found ──
        if [ -n "$DRIFTED" ]; then
          curl -s -X POST -H 'Content-type: application/json' \
            --data "{\"channel\":\"general\",\"text\":\":warning: Drift detected in:${DRIFTED}\nClean: ${CLEAN} stacks. Errors:${ERRORS:-none}\"}" \
            "$SLACK_WEBHOOK" || true
-        elif [ -n "$ERRORS" ]; then
+        else
          curl -s -X POST -H 'Content-type: application/json' \
-            --data "{\"channel\":\"general\",\"text\":\":red_circle: Drift detection had errors: ${ERRORS} (clean: ${CLEAN})\"}" \
+            --data "{\"channel\":\"general\",\"text\":\":white_check_mark: Drift detection: all ${CLEAN} stacks clean${ERRORS:+. Errors: $ERRORS}\"}" \
            "$SLACK_WEBHOOK" || true
        fi
-
-  # Hard-failure catch: the in-script posts above never run if the step
-  # itself crashes early — this step is the only signal for that case.
-  - name: notify-failure
-    image: curlimages/curl
-    commands:
-      - |
-        curl -s -X POST -H 'Content-type: application/json' \
-          --data "{\"channel\":\"general\",\"text\":\":red_circle: Drift-detection pipeline FAILED (crashed before reporting)\"}" \
-          "$SLACK_WEBHOOK" || true
-    environment:
-      SLACK_WEBHOOK:
-        from_secret: slack_webhook
-    when:
-      status: [failure]
--- a/.woodpecker/issue-automation.yml
+++ b/.woodpecker/issue-automation.yml
@ -5,7 +5,6 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
-      partial: false
      depth: 2

 steps:
--- a/.woodpecker/postmortem-todos.yml
+++ b/.woodpecker/postmortem-todos.yml
@ -11,7 +11,6 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
-      partial: false
      depth: 5

 steps:
@ -28,7 +27,6 @@ steps:
        from_secret: slack_webhook
    commands:
      - apk add --no-cache curl
-      - "curl -sf -X POST https://hooks.slack.com/services/$SLACK_WEBHOOK -H 'Content-Type: application/json' -d '{\"text\": \":red_circle: Post-mortem TODO pipeline FAILED\"}' || true"
+      - "curl -sf -X POST https://hooks.slack.com/services/$SLACK_WEBHOOK -H 'Content-Type: application/json' -d '{\"text\": \"Post-mortem TODO pipeline completed\"}' || true"
    when:
-      # Failure-only (Viktor 2026-07-02): CI notifies on failed runs only.
-      - status: [failure]
+      - status: [success, failure]
--- a/.woodpecker/provision-user.yml
+++ b/.woodpecker/provision-user.yml
@ -5,7 +5,6 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
-      partial: false
      attempts: 5
      backoff: 10s

--- a/.woodpecker/pve-nfs-exports-sync.yml
+++ b/.woodpecker/pve-nfs-exports-sync.yml
@ -23,7 +23,6 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
-      partial: false
      depth: 1
      attempts: 3

@ -58,8 +57,7 @@ steps:
    commands:
      - |
        curl -s -X POST -H 'Content-type: application/json' \
-          --data "{\"channel\":\"general\",\"text\":\":red_circle: PVE /etc/exports sync FAILED\"}" \
+          --data "{\"channel\":\"general\",\"text\":\"PVE /etc/exports sync: ${CI_PIPELINE_STATUS}\"}" \
          "$SLACK_WEBHOOK" || true
    when:
-      # Failure-only (Viktor 2026-07-02): CI notifies on failed runs only.
-      status: [failure]
+      status: [success, failure]
--- a/.woodpecker/registry-config-sync.yml
+++ b/.woodpecker/registry-config-sync.yml
@ -38,7 +38,6 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
-      partial: false
      depth: 1
      attempts: 3

@ -151,8 +150,7 @@ steps:
    commands:
      - |
        curl -s -X POST -H 'Content-type: application/json' \
-          --data "{\"channel\":\"general\",\"text\":\":red_circle: Registry config sync on 10.0.20.10 FAILED\"}" \
+          --data "{\"channel\":\"general\",\"text\":\"Registry config sync on 10.0.20.10: ${CI_PIPELINE_STATUS}\"}" \
          "$SLACK_WEBHOOK" || true
    when:
-      # Failure-only (Viktor 2026-07-02): CI notifies on failed runs only.
-      status: [failure]
+      status: [success, failure]
--- a/.woodpecker/renew-tls.yml
+++ b/.woodpecker/renew-tls.yml
@ -6,7 +6,6 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
-      partial: false
      attempts: 5
      backoff: 10s

@ -71,11 +70,10 @@ steps:
    commands:
      - |
        curl -s -X POST -H 'Content-type: application/json' \
-          --data "{\"channel\":\"general\",\"text\":\":red_circle: Woodpecker CI: TLS certificate renewal FAILED\"}" \
+          --data "{\"channel\":\"general\",\"text\":\"Woodpecker CI: TLS certificate renewal ${CI_PIPELINE_STATUS}\"}" \
          "$SLACK_WEBHOOK" || true
    environment:
      SLACK_WEBHOOK:
        from_secret: slack_webhook
    when:
-      # Failure-only (Viktor 2026-07-02): successful renewals are routine.
-      status: [failure]
+      status: [success, failure]
--- a/AGENTS.md
+++ b/AGENTS.md
@ -9,7 +9,7 @@
 - **Ask before `git push`** — always confirm with the user first

 ## Execution
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets; passes `-lock-timeout`, default `5m` / `TG_LOCK_TIMEOUT`, so a contended state lock waits instead of failing with `Error acquiring the state lock`)
+- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets)
 - **Legacy apply**: `cd stacks/<service> && terragrunt apply --non-interactive` (uses terraform.tfvars)
 - **kubectl**: `kubectl --kubeconfig $(pwd)/config`
 - **Health check**: `bash scripts/cluster_healthcheck.sh --quiet`
@ -95,7 +95,7 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
 ## Key Paths
 - `stacks/<service>/main.tf` — service definition
 - `stacks/platform/modules/<service>/` — core infra modules
- `modules/kubernetes/ingress_factory/` — standardized ingress with auth, rate limiting, anti-AI, and auto Cloudflare DNS (`dns_type = "proxied"`, `"non-proxied"`, or `"internal"` — a public A record carrying the internal Traefik LB IP for household-only services; pair with the `home-lans-only` ipAllowList middleware, never with `"proxied"`)
+- `modules/kubernetes/ingress_factory/` — standardized ingress with auth, rate limiting, anti-AI, and auto Cloudflare DNS (`dns_type = "proxied"` or `"non-proxied"`)
 - `modules/kubernetes/nfs_volume/` — NFS volume module (CSI-backed, soft mount)
 - `config.tfvars` — non-secret configuration (plaintext)
 - `secrets.sops.json` — all secrets (SOPS-encrypted JSON)
@ -273,11 +273,8 @@ To land a finished change from such a clone:
   Slack audit feed; a no-op CI apply on a docs-only commit is harmless.
 4. Leave the clone on clean `master` so auto-refresh keeps working.
 5. Tell the user in plain language what happened. Stack changes are
-   auto-applied by CI on push — or, with apply access, applied locally yourself
-   (`scripts/tg apply`, from the main checkout, not a worktree); either path is
-   fine, but the change must always be committed here, never applied
-   uncommitted. Verify the live result with the user's read-only kubectl before
-   saying "it's live".
+   auto-applied by CI — verify the live result with the user's read-only
+   kubectl before saying "it's live".

 If a push to `master` is rejected by branch protection (user not on the
 whitelist — e.g. new users before Viktor grants it), fall back to a
@ -292,7 +289,6 @@ curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json'
 ```

 ## Common Operations
- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply <stack>` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release <kind>:<name>`, `homelab work start|land|clean <topic>` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status <app>` (read; `<app>` defaults to the namespace, target to `deploy/<app>`), `homelab k8s db <app> [--mysql] -- "<SQL>"`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall "<context>"` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait <ns>/<deploy> [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check <host> [path]` (external-CF vs internal-LB reachability), `dns lookup <name>` (Technitium vs public diff), `metrics query "<promql>"` / `metrics alerts` (Prometheus via LB), `logs query "<logql>" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Usage telemetry (v0.6): every dispatched verb fire-and-forgets a Loki line (`{user,verb}` + exit only, NO args/secrets; opt-out `HOMELAB_TELEMETRY=0`); `homelab usage top [--since][--user]` ranks verb usage across all users — evidence for what to build next, queryable without reading anyone's home. Home Assistant (v0.7): `homelab ha token [--instance sofia|london]` (prints the long-lived API token, resolved live from k8s Secret `openclaw/openclaw-secrets` — use as `curl -H "Authorization: Bearer $(homelab ha token)"`), `homelab ha ssh [--instance sofia|london] -- <cmd>` (run a command on the HA host; deterministic non-interactive ssh, the invoking user's `~/.ssh/id_ed25519`, sofia=`vbarzin@192.168.1.8` default) — entity state/control stays with the `ha` MCP, these cover only what an API-only MCP can't (token + host shell). Full docs: `cli/README.md`.
 - **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service.
 - **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
 - **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf.
--- a/CONTEXT.md
+++ b/CONTEXT.md
@ -56,28 +56,6 @@ _Avoid_: "core service" (collides with the `0-core-*` Namespace tier name).
 A non-admin identity declared in `secret/platform → k8s_users` (JSON map). Owns one or more namespaces and one or more public subdomains. Also drives a **Workstation profile** (an identity has both a cluster facet and a workstation facet).
 _Avoid_: bare "user", "tenant".

-### GPU sharing
-
-**GPU slice**:
-One unit of `nvidia.com/gpu` on the time-sliced Tesla T4 — a **scheduling turn, NOT a memory allocation**. The device plugin advertises the card ×100; a pod requesting `nvidia.com/gpu: 1` gets GPU *access*, with zero guarantee about how much of the 16 GB VRAM it may use. "Overallocate GPU memory" is a real failure precisely because a slice carries no memory accounting.
-_Avoid_: reading a GPU slice as a memory reservation or a fraction of the card; "vGPU" (we run no vGPU/MIG/MPS — see ADR-0016).
-
-**GPU memory budget**:
-The custom node-level extended resource **`viktorbarzin.me/gpumem`** (integer MiB) that makes the scheduler VRAM-aware (ADR-0016). The GPU node advertises a total (~14000 MiB = physical minus driver/context slack); each GPU tenant declares `resources.limits."viktorbarzin.me/gpumem"`; being non-overcommittable, the scheduler refuses to co-schedule past the card (overflow → `Pending`). A *schedule-time* reservation, **not** a runtime cap — it stops pile-on, not a single tenant's runaway.
-_Avoid_: treating it as a hard CUDA cap (it isn't — that's what the **GPU watchdog** is for); confusing it with the `nvidia.com/gpu` slice (orthogonal axes: access vs memory accounting).
-
-**GPU watchdog**:
-The `gpu-vram-watchdog` CronJob (nvidia ns) that supplies the runtime teeth the **GPU memory budget** lacks: when *actual* free VRAM (`gpu_pod_memory_used_bytes`) drops below a floor, it recycles the biggest tenant that is **over its declared budget**. Enforces the budget as a contract, acts only under pressure (so bursting into genuine slack is fine), and is what bounds the 2026-06-02 immich-ml runaway class.
-_Avoid_: expecting it to act on priority (it enforces the *budget*, since co-tenants often share one PriorityClass); expecting instant prevention (it corrects with a detection lag — soft, by design).
-
-**GPU demand-gate**:
-The scale-0↔1 admission CronJobs (`stacks/tts`) that bring a best-effort *batch* GPU tenant (chatterbox-tts) up only when free VRAM ≥ a floor and idle it back down — letting on-demand tenants fill real slack without holding a reserved **GPU memory budget** seat.
-_Avoid_: using it for interactive tenants (cold-load lag — portal-stt is warm-resident instead); conflating it with the **GPU watchdog** (gate = admit on free VRAM; watchdog = recycle on over-budget pressure).
-
-**gpu-workload priority**:
-The `gpu-workload` PriorityClass (1,200,000) auto-stamped on every GPU pod by the Kyverno `inject-gpu-workload-priority` policy — the exclude list (`tts`) drops to `tier-2-gpu` (600,000) so it loses node-pressure eviction first. Governs *Kubernetes node* eviction order, **not** VRAM (VRAM is the budget + watchdog's job).
-_Avoid_: assuming it protects VRAM; it is a scheduling/eviction priority on node memory/CPU pressure.
-
 ### Workstation (multi-user devvm)

 **devvm**:
@ -118,14 +96,6 @@ _Avoid_: "external", "outside".
 `viktorbarzin.lan`, served by Technitium DNS. Resolves only inside the homelab network.
 _Avoid_: bare "lan", "private", "intranet".

-**Segment**:
-One isolated L2/L3 network with pfSense as its gateway — realised as a Proxmox-bridge-level tag feeding one dedicated untagged pfSense interface (dManagementsVms 10.0.10.0/24 = vmbr1 tag 10, dKubernetes 10.0.20.0/24 = vmbr1 tag 20, dCCTV 10.0.30.0/24 = vmbr0 tag 30). pfSense itself never terminates 802.1Q.
-_Avoid_: "VLAN" as the primary name (the tags 10/20/30 are transport detail; the Segment is the concept).
-
-**CCTV segment**:
-The untrusted camera **Segment** (`dCCTV`) — devices in it may be pulled from (RTSP/ISAPI) but may initiate nothing except NTP to their gateway. Deliberately outside every trusted source-IP allowlist (ADR-0017).
-_Avoid_: "camera VLAN", "CCTV LAN".
-
 **Ingress auth**:
 The `auth = "..."` parameter on `ingress_factory` — a discrete *mode*, not a ranked tier — one of `required` (Authentik forward-auth gates every request), `app` (the backend owns its login), `public` (anonymous Authentik binding for audit only), or `none` (Anubis-fronted content, or native-client API). Default `required` (fail-closed).
 _Avoid_: "auth tier" / "auth mode" — refer to it by the canonical key, `auth` (e.g. `auth = "required"`). "tier" is reserved for State tier and Namespace tier.
@ -147,17 +117,9 @@ The bare-metal load-balancer that assigns external IPs to `type=LoadBalancer` Se
 _Avoid_: calling `.200` "the cluster IP" or assuming all ingress shares one LB IP.

 **Calico**:
-The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs; live flow observability via **Goldmane / Whisker**). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
+The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
 _Avoid_: "firewall" (it's pod-level policy, not a perimeter); conflating a Calico **NetworkPolicy** (enforced in the data path) with a **Kyverno policy** (enforced at admission) — different layers.

-**Service identity**:
-How a **Service** is named in flow/audit data — its **namespace** is the primary identity (Goldmane stamps it natively, and "one Service ≈ one namespace" holds for ~87 namespaces), refined by an explicit identity label (e.g. `service-identity`) only in the handful of genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). Deliberately NOT a per-Service **ServiceAccount** (deferred — 56% of pods share `default`; revisit only if principal-based enforcement or mTLS is adopted) and NOT a SPIFFE/mesh identity (rejected — attribution-grade audit on a trusted single-tenant cluster doesn't justify a mesh).
-_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
-
-**Goldmane / Whisker**:
-Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#alerts` digest (the `#security` channel was abandoned 2026-06-25). As-built: `docs/runbooks/goldmane-flow-trail.md`.
-_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
-
 ### Storage

 **proxmox-lvm-encrypted**:
@ -237,20 +199,6 @@ _Avoid_: expecting Diun to deploy; conflating with **Keel**.
 **Anubis**:
 A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content-bearing sites without app-level auth (blog, wiki, landing pages). Never in front of Git, WebDAV, CalDAV, or API endpoints (clients can't solve PoW).

-### Externally-authored sites
-
-**Valia site**:
-A small public static site authored by Valia (Viktor's mother, external to the infra) and hosted for her under `<name>.viktorbarzin.me`. Its source of truth is a **Content folder** she owns; the live site is a mirror of that folder, fresh within ~10 minutes. Hosted **off-infra** (Cloudflare Pages) by decision: a homelab outage freezes content but never takes her sites down. Viktor picks the English subdomain name per site at registration (her folder names stay Bulgarian). Current instances: `stem95su`, `bridge`.
-_Avoid_: "school site" (the family may grow beyond school projects); treating the deployed copy as editable — edits land only in the **Content folder**.
-
-**Content folder**:
-The Google Drive folder (or subfolder) Valia shares with `vbarzin@gmail.com` holding one **Valia site**'s files. Strictly read-only from the infra side — nothing ever writes back to her Drive. Empty or half-uploaded folder states must never wipe a live site.
-_Avoid_: syncing a folder root when the servable content lives in a subfolder (stem95su serves `stem claude/files/`, not the folder root).
-
-**Entry file**:
-The HTML file a **Valia site** serves at `/`. Defaults to `index.html`; per-site override when she names it differently (stem95su: `stem_board.html`). The override is a registration-time setting, not a constraint on her authoring.
-_Avoid_: asking Valia to rename her files to fit hosting conventions.
-
 ## Relationships

 - A **Service** is defined by exactly one **Stack** — **flat** or wrapping a **Stack-local module** — which sources zero or more shared **Factory modules** and resolves to one or more K8s workloads.
@ -262,7 +210,6 @@ _Avoid_: asking Valia to rename her files to fit hosting conventions.
 - A **Service**'s image reaches the cluster via **Woodpecker deploy** (push-driven, on commit) or **Keel** (poll-driven, on a new registry tag); **Diun** only notifies. Operator-managed StatefulSets are rolled by neither.
 - An owned **Service**'s image is built by GitHub Actions from the **Canonical repo**'s **GitHub mirror** and hosted on ghcr.io (ADR-0002); the **Forgejo registry** keeps only a frozen last-known-good tag per **Service**.
 - Tier-1 **State tier** state and ~12 app databases share one **CNPG** `pg-cluster`, reached through **PgBouncer**; their credentials rotate via the `vault-database` store.
- A **Valia site** mirrors exactly one **Content folder** and serves exactly one **Entry file** at `/`; the folder is hers, the subdomain name is Viktor's, the hosting is off-infra.

 ## Example dialogue

--- a/cli/README.md
+++ b/cli/README.md
@ -1,287 +1,2 @@
-# homelab
-
-`homelab` is the unified, agent-facing CLI for operating this homelab — one
-composable, JSON-capable surface for the operations agents run over and over,
-discovered progressively at runtime. It is grown **in place** from this
-directory (the former `infra-cli`), and the legacy webhook use-cases still work
-(see below).
-
-It encodes *actions*, never *judgment*: methodology (debugging, TDD, review) and
-third-party/owned MCP servers (e.g. phpIPAM) are deliberately out of scope.
-
-## Usage
-
-```
-homelab <command> [args]
-homelab manifest [--json]    # list every verb + its read/write tier (discovery entrypoint)
-homelab version
-```
-
-### v0.1 verbs — the infra inner-loop
-
-| Command | Tier | What it does |
-|---|---|---|
-| `claim <kind>:<name> --purpose "…"` | write | claim a shared resource on the presence board (wraps `scripts/presence`) |
-| `release <kind>:<name>` | write | release a presence claim |
-| `tf plan <stack>` | read | `scripts/tg plan` for a stack (resolved from cwd) |
-| `tf validate <stack>` | read | `scripts/tg validate` |
-| `tf fmt <stack>` | read | `terraform fmt -recursive` on the stack |
-| `tf force-unlock <stack> <lock-id>` | write | release a stuck state lock |
-| `tf apply <stack>` | write | `scripts/tg apply` — auto-claims `stack:<name>`, always releases, warns it's out-of-band |
-| `work start <topic>` | write | create `.worktrees/<topic>` on `<user>/<topic>` off `<remote>/master`; enter with native `EnterWorktree` |
-| `work land [--verify-cmd "…"] [--no-verify]` | write | merge master in → verify → push `HEAD:master` (non-ff retry; PR fallback) |
-| `work clean <topic>` | write | remove a task's worktree + branch (run from the main checkout) |
-
-### v0.2 verbs — Kubernetes
-
-Built on an **app→namespace→pod resolver**: `<app>` defaults to the namespace
-(most namespaces hold one app); the target defaults to `deploy/<app>` and lets
-kubectl resolve the pod. Override with `-n`/`--pod`/`-c`/`-l`/`--tty`. Uses the
-ambient kubeconfig.
-
-| Command | Tier | What it does |
-|---|---|---|
-| `k8s status [ns]` | read | pods (wide) + recent non-Normal events (`-A` if no ns) |
-| `k8s get <ns> <resource> […]` | read | `kubectl -n <ns> get …` passthrough |
-| `k8s logs <app>` | read | logs for `deploy/<app>` (`--tail` default 200; `-c`/`--previous`/`--since`/`-l`) |
-| `k8s describe <app> [resource]` | read | describe the deployment (or an explicit resource) |
-| `k8s debug <app>` | read | one-shot triage: pods + workloads + describe + recent logs + events |
-| `k8s pf <app> <local:remote> [target]` | read | port-forward to `svc/<app>` (or an explicit target) |
-| `k8s rollout-status <app>` | read | `rollout status deploy/<app>` |
-| `k8s db <app> [--mysql] [--db N] -- "<SQL>"` | write | exec into the dbaas DB (PG `pg-cluster-rw`, or MySQL with env-password wrapper) |
-| `k8s exec <app> [--tty] -- <cmd>` | write | exec in the app's pod |
-| `k8s restart <app>` | write | `rollout restart deploy/<app>` then wait for status |
-| `k8s rm-pod <name> -n <ns> [--job] [--force]` | write | delete a stuck **pod/job only** |
-
-Config-mutation verbs (`apply`/`edit`/`patch`/`scale`/`create`) are intentionally
-**not** exposed — they stay raw `kubectl`, per the Terraform-only policy.
-
-`tf` resolves the stack dir by walking up from cwd to the infra root and
-delegates to `scripts/tg` (which owns state decrypt/encrypt, the Vault lock, and
-the ingress auth-comment check). git-crypt filter flags are auto-injected on git
-operations in the encrypted infra repo.
-
-**`work land` refuses to push when it cannot verify** (no `--verify-cmd` and no
-auto-detected suite) unless you pass `--no-verify` — landing to master unverified
-must be deliberate. After pushing it **watches CI to green** (`ci watch` on the
-landed commit) and fails if the pipeline does; pass `--no-ci-watch` to skip.
-
-Tiers are recorded per verb so a future PreToolUse classifier can auto-allow
-reads / prompt writes; v0.1 allows everything and relies on existing gates
-(permission mode, presence claims, plan approval).
-
-### v0.3 verbs — memory
-
-A thin HTTP client over the **claude-memory** service (the same backend the
-memory MCP wraps), authed with `CLAUDE_MEMORY_API_KEY` against
-`CLAUDE_MEMORY_API_URL` (the env the hooks already set; defaults to the
-ingress). Because it hits the HTTP API directly, it **works even when the MCP
-frontend is down**.
-
-| Command | Tier | What it does |
-|---|---|---|
-| `memory recall "<context>" [--query --category --sort --limit]` | read | semantic search (server-side ranking) — the navigate workhorse |
-| `memory list [--category --tag --limit]` | read | recent memories |
-| `memory categories` / `memory tags` / `memory stats` | read | enumerate the store |
-| `memory secret <id>` | read | reveal a sensitive memory's content |
-| `memory store "<content>" [--category --tags --keywords --importance --sensitive]` | write | store a memory |
-| `memory update <id> [--content --tags --importance]` | write | edit a memory |
-| `memory delete <id>` | write | delete a memory |
-
-All read/write paths are validated against the live API (incl. a
-store→recall→delete round-trip). This gives full data-plane parity with the MCP;
-the eventual deprecation (rewiring the per-prompt auto-recall + auto-learn hooks
-to the CLI, then uninstalling the MCP) is a **separate, deliberate follow-up** —
-see `docs/adr/0008`.
-
-### v0.4 verbs — ci / deploy
-
-Watch what you trigger, without hand-rolling Woodpecker/kubectl polling. `ci`
-talks to the Woodpecker API (token from `WOODPECKER_TOKEN` or Vault
-`secret/ci/global`) via the internal Traefik LB, resolving the repo from the cwd
-remote, with retries that ride Woodpecker's intermittent empty responses.
-
-| Command | Tier | What it does |
-|---|---|---|
-| `ci status [commit]` | read | pipeline status for HEAD (or a commit) |
-| `ci watch [commit]` | read | poll the pipeline to terminal; exit non-zero on failure |
-| `deploy wait <ns>/<deploy> [--sha SHA]` | read | wait for the deployment image to match the sha, *then* rollout status (rollout status alone lies on the old ReplicaSet) |
-
-`work land` now calls `ci watch` on the landed commit automatically (skip with
-`--no-ci-watch`), closing the v0.1 "doesn't wait for CI" gap. `ci logs` (failing
-step) is deferred to v0.4.1 — Woodpecker's per-pipeline detail/log endpoints were
-the least reliable; `status`/`watch` use the list endpoint that works.
-
-### v0.5 verbs — net / dns / metrics / logs
-
-Reachability + observability probes. Their value is *endpoint resolution* — the
-non-obvious "which host, public or LB, what auth, what URL shape" reasoning you'd
-otherwise re-derive every time — not the HTTP call itself. All reach internal
-ingresses through the Traefik LB (the Go form of `curl --resolve host:443:10.0.20.203`).
-
-| Command | Tier | What it does |
-|---|---|---|
-| `net check <host> [path]` | read | probes the host two ways — external (public DNS → Cloudflare) vs internal (Traefik LB) — with status + latency, so you can tell *where* a break is (CF? app? the LB path?) |
-| `dns lookup <name> [type]` | read | resolves via Technitium (`10.0.20.201`) and public (`1.1.1.1`), diffed — surfaces split-horizon vs propagation gaps |
-| `metrics query "<promql>"` | read | Prometheus instant query (`prometheus-query.viktorbarzin.lan`); prints `value {labels}` or `--json` |
-| `metrics alerts` | read | currently-firing alerts (via the synthetic `ALERTS` series — the query frontend has no `/api/v1/alerts`) |
-| `logs query "<logql>" [--since 1h] [--limit N]` | read | Loki range query (`loki.viktorbarzin.lan`); prints log lines or `--json` |
-
-Quote the PromQL/LogQL. These hit auth-free internal ingresses — no port-forward,
-no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the
-firing set is reachable via `ALERTS` instead.)
-
-### v0.6 — usage telemetry (`usage top`)
-
-Makes "which verbs are actually used, by everyone" a query instead of a guess —
-so adding the *next* verb is evidence-driven, not shaped by one person's habits.
-
-Every dispatched verb emits one fire-and-forget Loki line: `{job, user, verb}`
-labels + `exit=N ver=X` — **only the verb path and exit code, never args, paths,
-flags, or secrets.** It's best-effort (tight timeout, errors swallowed, never
-affects the command) and opt-out via `HOMELAB_TELEMETRY=0`. Because the sink is
-the shared Loki, aggregate usage is queryable **without reading anyone's home** —
-the privacy-preserving answer to "what does the team use."
-
-| Command | Tier | What it does |
-|---|---|---|
-| `usage top [--since 30d] [--user U] [--json]` | read | rank verbs by invocation count across all users (or one), via `sum by (verb) (count_over_time({job="homelab-usage"}[…]))` |
-
-### v0.7 verbs — Home Assistant
-
-Cover exactly the two things the `ha` **MCP server can't**: resolving the
-long-lived API token out of the cluster, and SSH to the HA host for host-level
-work (config files, docker, add-ons). Entity state and control (`turn_on`,
-`get_state`, services) stay with the MCP — *actions an MCP already encodes are
-out of scope* (see top of this doc). The value here is the same as `net`/`dns`:
-the non-obvious *which secret, which host, which key, which flags* you'd
-otherwise re-derive every session — agents were hand-rolling a
-`kubectl | base64 | jq` token pipeline and a bespoke `ssh -o …` invocation on
-every run because the existing `home-assistant-sofia.py` needs an env var set
-and a cwd-relative path, neither of which holds in an arbitrary session.
-
-| Command | Tier | What it does |
-|---|---|---|
-| `ha token [--instance sofia\|london]` | read | print the long-lived HA API token, resolved live from the dedicated k8s Secret `openclaw/ha-tokens` (key per instance) via the ambient kubeconfig — no pre-set env var. Use as `curl -H "Authorization: Bearer $(homelab ha token)" …`. The secret is a least-privilege carve-out (`stacks/openclaw/ha_tokens.tf`): the `Home Server Admins` group can read *just* it, so non-admin operators get the HA token without the rest of `skill_secrets` (slack webhook, uptime-kuma password) |
-| `ha ssh [--instance sofia\|london] [-i KEY] -- <cmd>` | write | run `<cmd>` on the HA host over ssh with deterministic non-interactive flags (explicit key = the invoking user's `~/.ssh/id_ed25519`, no user ssh-config, no known_hosts prompt). sofia (`vbarzin@192.168.1.8`) is reachable from the devvm LAN; london is documented but generally remote |
-
-`--instance` defaults to **sofia** (the devvm shares the Sofia LAN). `ha token`
-prints the bare token to stdout so it composes in `$(…)`; it's read-tier like
-`memory secret`. `ha ssh` resolves the *invoking user's* key, so it's per-user,
-not tied to whoever first wrote the workflow (the user's key must be enrolled on
-the HA host).
-
-### v0.8 verbs — browser (headful anti-bot automation)
-
-Drive the cluster's **headful** Chrome (`chrome-service`, real Chrome under Xvfb)
-from the devvm over CDP, for sites that detect and block headless automation. The
-headless `@playwright/mcp` browser can *load* such a site and fill its forms, but
-the gated action (submit/login) silently fails — the motivating case was the
-Stirling Ackroyd Fixflo tenant portal, whose pre-submit check returned
-`net::ERR_FILE_NOT_FOUND` and hung. This path connects via `connect_over_cdp`,
-injects the same `stealth.js` the in-cluster callers use, and submits first try.
-
-The command owns only the *mechanics* (port-forward, stealth, lifecycle); the
-agent supplies the Playwright script — judgment stays out of the CLI.
-
-| Command | Tier | What it does |
-|---|---|---|
-| `browser run <script.js> [--url U] [--shared-context] [--keep-open] [--port N] [--timeout S]` | write | port-forward `svc/chrome-service:9222`, assert it's a real (non-headless) Chrome via `/json/version`, `connect_over_cdp`, `addInitScript(stealth.js)`, then run the script with `page`/`context`/`browser`/`log` in scope (top-level await ok; return a value to print it). Always tears the forward down. |
-| `browser open <url> [--shared-context] [--timeout S]` | write | open `<url>` headful and print title + visible text + a screenshot path — a quick check. |
-| `browser --help` | read | when-to-use signature + the error-code cheat-sheet (`ERR_FILE_NOT_FOUND` = automation-layer intercept, not egress; `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED` = real egress; one endpoint 500 while siblings 200 = bot rejection). |
-
-Default context is a **fresh incognito** one (closed on exit) — safe for the
-shared browser and concurrent callers (e.g. tripit's fare scrape); `--shared-context`
-reuses the warmed persistent profile when a pre-logged-in session is needed.
-`port-forward` tunnels API-server→pod, so it bypasses the `:9222` NetworkPolicy
-that gates in-cluster callers — no namespace label needed. The node CDP client is
-pinned to **`playwright-core@1.48.2`** to match the chrome-service image minor
-(Chromium 130; protocol changes between minors) and is installed once, lazily,
-into `~/.cache/homelab/browser-client/` (no per-user setup). Because the client
-runs on the devvm, `setInputFiles` streams local files to the remote browser over
-CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md`
-and `docs/adr/0013`.
-
-### v0.9 verbs — edges (east-west "who-talks-to-whom" trail)
-
-Read-only investigation helper over the `goldmane_edges` CNPG trail (ADR-0014):
-filters render to a single safe `SELECT` (namespace values validated to the k8s
-name charset) run via the dbaas primary pod — the same exec path as `k8s db`.
-
-| Command | Tier | What it does |
-| --- | --- | --- |
-| `edges --ns <ns>` | read | edges touching `<ns>` (either direction) |
-| `edges --src <ns>` / `--dst <ns>` | read | directional: `<ns>`'s egress / ingress peers |
-| `edges --peers-of <ns>` | read | distinct peer namespaces of `<ns>` (both directions) |
-| `edges --new-since <24h\|7d\|YYYY-MM-DD>` | read | edges first seen since a duration or date |
-| `edges --denied` | read | only `action='deny'` edges (blocked / lateral-movement) |
-| `edges --json` / `--limit N` | read | JSON array output / row cap (default 200) |
-
-### v0.10 — `vault get --all` (browse every field)
-
-`vault get <name> --all` returns the **whole item** as a normalized JSON object,
-so an agent can discover and read fields the single-field `--field` allowlist
-can't reach — notably arbitrary **custom fields**.
-
-| Command | Tier | What it does |
-| --- | --- | --- |
-| `vault get <name> --all` | read | all fields as JSON: `{name, username?, password?, uris?, totp?, notes?, fields?}` |
-
-Shape notes: present standard fields only (empty ones omitted); `fields` is a
-custom `name→value` map (duplicate names → last-wins; `linked` fields skipped).
-The TOTP **seed is never emitted** — `totp` is a presence flag (`true`), so the
-only seed-derived path stays the specially-audited `vault code`. Like
-`get --json`, the dump is all secret values, so it **refuses a terminal** — pipe
-it (`homelab vault get <name> --all | jq`).
-
-### v0.10.1 — reads `bw sync` first (always fresh)
-
-Every vault read (`get`, `get --all`, `list`, `code`, `status`) now runs `bw
-sync` when opening its session, so it reflects the latest server-side values.
-`bw unlock` only decrypts the *local* cache, so without this a persisted
-(already-logged-in) session served stale data — a password changed in the web
-vault wouldn't show up until the next login. The sync is **best-effort**: a
-transient failure warns on stderr and falls back to the cached vault rather than
-failing the read.
-
-### v0.11 — `vault kv` (HashiCorp Vault / OpenBao infra secrets)
-
-`homelab vault` now fronts **two unrelated stores**, made explicit in the bare
-`homelab vault` help and via `[vaultwarden]` / `[hashicorp-vault]` summary tags:
-
- **Vaultwarden** — your personal password manager (`vault get/list/code/…`, unchanged).
- **HashiCorp Vault / OpenBao** — homelab infra secrets, the `secret/…` KV store, under `vault kv`.
-
-| Command | Tier | What it does |
-| --- | --- | --- |
-| `vault kv get <path> [--field K]` | read | read a secret: `--field K` → one value (TTY-aware clipboard/stdout); no field → all fields as JSON (refuses a bare TTY) |
-| `vault kv list <path>` | read | list sub-paths under `<path>` (no values) |
-| `vault kv put <path> <key>` | write | write one key; **value via stdin** (piped or no-echo prompt, never argv); creates the path or **merges** (never clobbers siblings) |
-
-**Different credentials:** the Vaultwarden verbs use the per-user *scoped* token
-(bound to `claude-users/<user>`); `vault kv` uses your **own** Vault token
-(`vault login -method=oidc` → `~/.vault-token`, or `$VAULT_TOKEN`) — the kv
-handlers set `VAULT_ADDR` but never inject the scoped token (which would 403 off
-its own path). Access is whatever your policy grants. Writes are merge-only;
-`put` (replace) / `delete` are out of scope — use the raw `vault` CLI.
-
-## Build / install
-
-Built from source to `/usr/local/bin/homelab` during devvm provisioning
-(`scripts/workstation/setup-devvm.sh`, the `t3-dispatch` pattern); version is
-stamped from `cli/VERSION` via ldflags. Manual build:
-
-```
-cd cli && go build -ldflags "-X main.version=$(cat VERSION)" -o /usr/local/bin/homelab .
-go test ./...
-```
-
-## Legacy webhook use-cases (preserved)
-
-This binary is also the in-cluster `infra-cli` image. Invocations starting with
-`-use-case=<vpn|setup-openwrt-dns|add-email-alias|...>` fall through to the
-original flag-based path unchanged, so the webhook handler is unaffected.
-
-## Design
-
-See `infra/docs/adr/0004`–`0013` for the architecture decisions.
+# What is this?
+This is a CLI to manipulate files in the terraform repo and commit and push them
--- a/cli/VERSION
+++ b/cli/VERSION
@ -1 +0,0 @@
-v0.12.0
--- a/cli/browser.go
+++ b/cli/browser.go
@ -1,388 +0,0 @@
-package main
-
-import (
-	_ "embed"
-	"encoding/json"
-	"fmt"
-	"io"
-	"net"
-	"net/http"
-	"os"
-	"os/exec"
-	"os/signal"
-	"path/filepath"
-	"strconv"
-	"strings"
-	"sync"
-	"syscall"
-	"time"
-)
-
-// playwrightVersion pins the node CDP client to the chrome-service image minor
-// (mcr.microsoft.com/playwright:v1.48.0-noble → Chromium 130). connect_over_cdp
-// speaks the browser's CDP, so the client minor must track the server minor;
-// see docs/architecture/chrome-service.md "Image pin".
-const playwrightVersion = "1.48.2"
-
-// defaultBrowserTimeout is how long (seconds) to wait for the port-forwarded CDP
-// endpoint to become ready before giving up.
-const defaultBrowserTimeout = 60
-
-const (
-	chromeServiceNamespace = "chrome-service"
-	chromeServiceName      = "chrome-service"
-	chromeServiceCDPPort   = 9222
-)
-
-// stealthJS is vendored verbatim from stacks/chrome-service/files/stealth.js (the
-// source of truth the in-cluster callers use). TestStealthJSEmbeddedMatchesCanonical
-// guards against drift.
-//
-//go:embed browser_stealth.js
-var stealthJS string
-
-// runnerJS is the node wrapper that connects to the port-forwarded CDP endpoint,
-// installs the stealth init script, and runs the user's Playwright script.
-//
-//go:embed browser_runner.js
-var runnerJS string
-
-// browserOpts is the parsed form of `homelab browser run|open` arguments.
-type browserOpts struct {
-	mode      string // "run" | "open"
-	script    string // path to the user Playwright script (run mode)
-	url       string // initial URL (run: optional; open: required positional)
-	sharedCtx bool   // use the warmed persistent profile instead of a fresh context
-	keepOpen  bool   // leave the created context/pages open on exit
-	port      int    // explicit local port for the forward (0 = auto)
-	timeout   int    // CDP readiness timeout, seconds
-	help      bool
-}
-
-// parseBrowserArgs parses the args after `browser run` / `browser open`.
-func parseBrowserArgs(mode string, args []string) (browserOpts, error) {
-	o := browserOpts{mode: mode, timeout: defaultBrowserTimeout}
-	var positionals []string
-	atoi := func(s, flag string) (int, error) {
-		n, err := strconv.Atoi(s)
-		if err != nil {
-			return 0, fmt.Errorf("%s expects an integer, got %q", flag, s)
-		}
-		return n, nil
-	}
-	for i := 0; i < len(args); i++ {
-		a := args[i]
-		switch {
-		case a == "-h" || a == "--help":
-			o.help = true
-		case a == "--shared-context":
-			o.sharedCtx = true
-		case a == "--keep-open":
-			o.keepOpen = true
-		case a == "--url":
-			if i+1 < len(args) {
-				o.url = args[i+1]
-				i++
-			}
-		case strings.HasPrefix(a, "--url="):
-			o.url = strings.TrimPrefix(a, "--url=")
-		case a == "--port":
-			if i+1 < len(args) {
-				n, err := atoi(args[i+1], "--port")
-				if err != nil {
-					return o, err
-				}
-				o.port = n
-				i++
-			}
-		case strings.HasPrefix(a, "--port="):
-			n, err := atoi(strings.TrimPrefix(a, "--port="), "--port")
-			if err != nil {
-				return o, err
-			}
-			o.port = n
-		case a == "--timeout":
-			if i+1 < len(args) {
-				n, err := atoi(args[i+1], "--timeout")
-				if err != nil {
-					return o, err
-				}
-				o.timeout = n
-				i++
-			}
-		case strings.HasPrefix(a, "--timeout="):
-			n, err := atoi(strings.TrimPrefix(a, "--timeout="), "--timeout")
-			if err != nil {
-				return o, err
-			}
-			o.timeout = n
-		case strings.HasPrefix(a, "-"):
-			return o, fmt.Errorf("unknown flag %q (try: homelab browser --help)", a)
-		default:
-			positionals = append(positionals, a)
-		}
-	}
-	if o.help {
-		return o, nil
-	}
-	switch mode {
-	case "run":
-		if len(positionals) == 0 {
-			return o, fmt.Errorf("usage: homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]")
-		}
-		o.script = positionals[0]
-	case "open":
-		if len(positionals) == 0 {
-			return o, fmt.Errorf("usage: homelab browser open <url> [--shared-context] [--timeout S]")
-		}
-		o.url = positionals[0]
-	}
-	return o, nil
-}
-
-// cdpHealthy parses a CDP /json/version body and reports whether the endpoint is
-// a real (non-headless) Chrome — the entire reason chrome-service exists.
-func cdpHealthy(jsonBody []byte) (browser string, healthy bool, err error) {
-	var v struct {
-		Browser   string `json:"Browser"`
-		UserAgent string `json:"User-Agent"`
-	}
-	if e := json.Unmarshal(jsonBody, &v); e != nil {
-		return "", false, fmt.Errorf("parse /json/version: %w", e)
-	}
-	if v.Browser == "" {
-		return "", false, fmt.Errorf("/json/version had no Browser field")
-	}
-	healthy = strings.HasPrefix(v.Browser, "Chrome/") &&
-		!strings.Contains(v.Browser, "Headless") &&
-		!strings.Contains(v.UserAgent, "Headless")
-	return v.Browser, healthy, nil
-}
-
-// buildPortForwardArgs is the kubectl invocation that exposes chrome-service's
-// CDP locally. port-forward tunnels API-server→pod, so it bypasses the :9222
-// NetworkPolicy that gates in-cluster callers.
-func buildPortForwardArgs(localPort int) []string {
-	return []string{"-n", chromeServiceNamespace, "port-forward",
-		"svc/" + chromeServiceName, fmt.Sprintf("%d:%d", localPort, chromeServiceCDPPort)}
-}
-
-// browserClientPackageJSON is the auto-managed manifest for the pinned node CDP
-// client kept under the user cache dir.
-func browserClientPackageJSON() string {
-	return fmt.Sprintf(`{
-  "name": "homelab-browser-client",
-  "private": true,
-  "description": "Pinned CDP client for 'homelab browser' — auto-managed, do not edit.",
-  "dependencies": {
-    "playwright-core": "%s"
-  }
-}
-`, playwrightVersion)
-}
-
-// freePort asks the kernel for an unused ephemeral TCP port.
-func freePort() (int, error) {
-	l, err := net.Listen("tcp", "127.0.0.1:0")
-	if err != nil {
-		return 0, err
-	}
-	defer l.Close()
-	return l.Addr().(*net.TCPAddr).Port, nil
-}
-
-// browserClientDir is where the pinned node client + managed runner files live.
-func browserClientDir() (string, error) {
-	cache, err := os.UserCacheDir()
-	if err != nil || cache == "" {
-		home, herr := os.UserHomeDir()
-		if herr != nil {
-			return "", fmt.Errorf("locate cache dir: %v / %v", err, herr)
-		}
-		cache = filepath.Join(home, ".cache")
-	}
-	return filepath.Join(cache, "homelab", "browser-client"), nil
-}
-
-// installedPlaywrightVersion reads the version of the playwright-core already
-// installed in dir, or "" if absent/unreadable.
-func installedPlaywrightVersion(dir string) string {
-	b, err := os.ReadFile(filepath.Join(dir, "node_modules", "playwright-core", "package.json"))
-	if err != nil {
-		return ""
-	}
-	var v struct {
-		Version string `json:"version"`
-	}
-	if json.Unmarshal(b, &v) != nil {
-		return ""
-	}
-	return v.Version
-}
-
-// ensureBrowserClient writes the managed runner/stealth/package files into dir
-// and lazily installs the pinned playwright-core (only when missing/mismatched),
-// so no per-user setup is needed and the client tracks the binary version.
-func ensureBrowserClient(dir string) error {
-	if err := os.MkdirAll(dir, 0o755); err != nil {
-		return err
-	}
-	files := map[string]string{
-		"package.json":      browserClientPackageJSON(),
-		"browser_runner.js": runnerJS,
-		"stealth.js":        stealthJS,
-	}
-	for name, content := range files {
-		if err := os.WriteFile(filepath.Join(dir, name), []byte(content), 0o644); err != nil {
-			return err
-		}
-	}
-	if installedPlaywrightVersion(dir) == playwrightVersion {
-		return nil
-	}
-	fmt.Fprintf(os.Stderr, "homelab browser: installing pinned playwright-core@%s (one-time, ~a few seconds)…\n", playwrightVersion)
-	cmd := exec.Command("npm", "install", "--no-audit", "--no-fund", "--silent")
-	cmd.Dir = dir
-	cmd.Stdout = os.Stderr
-	cmd.Stderr = os.Stderr
-	if err := cmd.Run(); err != nil {
-		return fmt.Errorf("npm install playwright-core@%s in %s: %w (is node/npm installed?)", playwrightVersion, dir, err)
-	}
-	if got := installedPlaywrightVersion(dir); got != playwrightVersion {
-		return fmt.Errorf("playwright-core install mismatch in %s: want %s, got %q", dir, playwrightVersion, got)
-	}
-	return nil
-}
-
-// waitForCDP polls the local CDP endpoint until it answers as a healthy
-// (non-headless) Chrome, or the timeout elapses.
-func waitForCDP(cdpURL string, timeout time.Duration) (string, error) {
-	deadline := time.Now().Add(timeout)
-	client := &http.Client{Timeout: 3 * time.Second}
-	var lastErr error
-	for time.Now().Before(deadline) {
-		resp, err := client.Get(cdpURL + "/json/version")
-		if err != nil {
-			lastErr = err
-			time.Sleep(300 * time.Millisecond)
-			continue
-		}
-		body, _ := io.ReadAll(resp.Body)
-		resp.Body.Close()
-		browser, healthy, herr := cdpHealthy(body)
-		if herr != nil {
-			lastErr = herr
-			time.Sleep(300 * time.Millisecond)
-			continue
-		}
-		if !healthy {
-			return browser, fmt.Errorf("CDP reports %q — expected a non-headless Chrome (wrong target?)", browser)
-		}
-		return browser, nil
-	}
-	if lastErr == nil {
-		lastErr = fmt.Errorf("timed out after %s", timeout)
-	}
-	return "", lastErr
-}
-
-// runBrowser is the orchestration: pick a port, ensure the pinned client, start
-// (and ALWAYS tear down) a CDP port-forward, wait for readiness, then run node.
-func runBrowser(o browserOpts) error {
-	port := o.port
-	if port == 0 {
-		p, err := freePort()
-		if err != nil {
-			return fmt.Errorf("pick local port: %w", err)
-		}
-		port = p
-	}
-
-	dir, err := browserClientDir()
-	if err != nil {
-		return err
-	}
-	if err := ensureBrowserClient(dir); err != nil {
-		return err
-	}
-
-	// Start the forward in its own process group so the whole tree dies on cleanup.
-	pf := exec.Command("kubectl", buildPortForwardArgs(port)...)
-	pf.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
-	var pfLog strings.Builder
-	pf.Stdout = &pfLog
-	pf.Stderr = &pfLog
-	if err := pf.Start(); err != nil {
-		return fmt.Errorf("start kubectl port-forward (kubeconfig set?): %w", err)
-	}
-
-	var once sync.Once
-	teardown := func() {
-		once.Do(func() {
-			if pf.Process != nil {
-				_ = syscall.Kill(-pf.Process.Pid, syscall.SIGKILL)
-			}
-			_ = pf.Wait()
-		})
-	}
-	defer teardown()
-
-	// Tear down on Ctrl-C / SIGTERM too, then exit non-zero.
-	sigCh := make(chan os.Signal, 1)
-	signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM)
-	defer signal.Stop(sigCh)
-	go func() {
-		if _, ok := <-sigCh; ok {
-			teardown()
-			os.Exit(130)
-		}
-	}()
-
-	cdpURL := fmt.Sprintf("http://127.0.0.1:%d", port)
-	browser, err := waitForCDP(cdpURL, time.Duration(o.timeout)*time.Second)
-	if err != nil {
-		return fmt.Errorf("chrome-service CDP not ready on %s: %w\n--- port-forward log ---\n%s", cdpURL, err, pfLog.String())
-	}
-	fmt.Fprintf(os.Stderr, "homelab browser: connected to %s via %s\n", browser, cdpURL)
-
-	return runBrowserNode(dir, cdpURL, o)
-}
-
-// runBrowserNode invokes the managed node runner with inputs passed via env.
-func runBrowserNode(dir, cdpURL string, o browserOpts) error {
-	env := append(os.Environ(),
-		"HOMELAB_CDP_URL="+cdpURL,
-		"HOMELAB_BROWSER_MODE="+o.mode,
-		"HOMELAB_STEALTH_PATH="+filepath.Join(dir, "stealth.js"),
-		"NODE_PATH="+filepath.Join(dir, "node_modules"),
-	)
-	if o.url != "" {
-		env = append(env, "HOMELAB_BROWSER_URL="+o.url)
-	}
-	if o.script != "" {
-		abs, err := filepath.Abs(o.script)
-		if err != nil {
-			return err
-		}
-		if _, err := os.Stat(abs); err != nil {
-			return fmt.Errorf("script %s: %w", o.script, err)
-		}
-		env = append(env, "HOMELAB_BROWSER_SCRIPT="+abs)
-	}
-	if o.sharedCtx {
-		env = append(env, "HOMELAB_BROWSER_SHARED=1")
-	}
-	if o.keepOpen {
-		env = append(env, "HOMELAB_BROWSER_KEEP_OPEN=1")
-	}
-	if o.mode == "open" {
-		shot := filepath.Join(os.TempDir(), fmt.Sprintf("homelab-browser-%d.png", os.Getpid()))
-		env = append(env, "HOMELAB_BROWSER_SCREENSHOT="+shot)
-	}
-	cmd := exec.Command("node", filepath.Join(dir, "browser_runner.js"))
-	cmd.Env = env
-	cmd.Stdout = os.Stdout
-	cmd.Stderr = os.Stderr
-	cmd.Stdin = os.Stdin
-	return cmd.Run()
-}
--- a/cli/browser_runner.js
+++ b/cli/browser_runner.js
@ -1,106 +0,0 @@
-// homelab browser — node CDP runner (auto-managed; regenerated each run from the
-// homelab binary — DO NOT EDIT here). Connects to the port-forwarded
-// chrome-service CDP endpoint, installs the stealth init script, then runs the
-// user's Playwright script (run mode) or opens a URL (open mode). All inputs
-// arrive via HOMELAB_* env vars set by the Go CLI.
-'use strict';
-const fs = require('fs');
-const { chromium } = require('playwright-core');
-
-async function main() {
-  const cdpURL = process.env.HOMELAB_CDP_URL;
-  if (!cdpURL) throw new Error('HOMELAB_CDP_URL not set');
-  const mode = process.env.HOMELAB_BROWSER_MODE || 'run';
-  const stealthPath = process.env.HOMELAB_STEALTH_PATH || '';
-  const initURL = process.env.HOMELAB_BROWSER_URL || '';
-  const scriptPath = process.env.HOMELAB_BROWSER_SCRIPT || '';
-  const shared = process.env.HOMELAB_BROWSER_SHARED === '1';
-  const keepOpen = process.env.HOMELAB_BROWSER_KEEP_OPEN === '1';
-  const screenshotPath = process.env.HOMELAB_BROWSER_SCREENSHOT || '';
-
-  const browser = await chromium.connectOverCDP(cdpURL);
-
-  // Fresh isolated context by default (safe for the shared browser + concurrent
-  // callers); --shared-context reuses the warmed persistent profile.
-  let context;
-  let createdContext = false;
-  if (shared) {
-    const existing = browser.contexts();
-    if (existing.length) {
-      context = existing[0];
-    } else {
-      context = await browser.newContext();
-      createdContext = true;
-    }
-  } else {
-    context = await browser.newContext();
-    createdContext = true;
-  }
-
-  if (stealthPath) {
-    const stealth = fs.readFileSync(stealthPath, 'utf8');
-    if (stealth.trim()) await context.addInitScript(stealth);
-  }
-
-  const page = await context.newPage();
-  const log = (...a) => console.error('[browser]', ...a);
-
-  let exitCode = 0;
-  try {
-    if (initURL) {
-      await page.goto(initURL, { waitUntil: 'domcontentloaded' });
-    }
-    if (mode === 'open') {
-      console.log('url:    ' + page.url());
-      console.log('title:  ' + (await page.title()));
-      const text = (await page.evaluate(() => (document.body ? document.body.innerText : ''))).trim();
-      console.log('--- visible text (truncated to 4000 chars) ---');
-      console.log(text.slice(0, 4000));
-      if (screenshotPath) {
-        await page.screenshot({ path: screenshotPath, fullPage: true });
-        console.log('screenshot: ' + screenshotPath);
-      }
-    } else {
-      if (!scriptPath) throw new Error('run mode requires HOMELAB_BROWSER_SCRIPT');
-      const src = fs.readFileSync(scriptPath, 'utf8');
-      // Run the user's source with page/context/browser/log in lexical scope.
-      // AsyncFunction body permits top-level await.
-      const AsyncFunction = Object.getPrototypeOf(async () => {}).constructor;
-      const fn = new AsyncFunction('page', 'context', 'browser', 'log', src);
-      const result = await fn(page, context, browser, log);
-      if (result !== undefined) {
-        let out;
-        try {
-          out = typeof result === 'string' ? result : JSON.stringify(result, null, 2);
-        } catch (_) {
-          out = String(result);
-        }
-        console.log(out);
-      }
-    }
-  } catch (e) {
-    console.error('homelab browser: script error:', e && e.stack ? e.stack : e);
-    exitCode = 1;
-  } finally {
-    if (!keepOpen) {
-      try {
-        // Close only what we created; never tear down the shared persistent context.
-        if (createdContext) {
-          await context.close();
-        } else {
-          await page.close();
-        }
-      } catch (_) { /* ignore */ }
-    }
-    // Disconnect from the CDP endpoint; this does NOT kill the remote browser.
-    try {
-      await browser.close();
-    } catch (_) { /* ignore */ }
-  }
-  process.exit(exitCode);
-}
-
-main().catch((e) => {
-  console.error('homelab browser: fatal:', e && e.stack ? e.stack : e);
-  process.exit(1);
-});
--- a/cli/browser_stealth.js
+++ b/cli/browser_stealth.js
@ -1,54 +0,0 @@
-// Minimal stealth init script for Playwright-driven Chromium.
-// Vendored from puppeteer-extra-plugin-stealth/evasions/* (MIT) — covers:
-//   webdriver, chrome.runtime, navigator.plugins, navigator.languages,
-//   Permissions.query, WebGL getParameter (vendor + renderer spoof).
-// Run via context.add_init_script() so it executes before any page script.
-(() => {
-  // navigator.webdriver — most common detection, removed entirely.
-  Object.defineProperty(Navigator.prototype, 'webdriver', { get: () => undefined });
-
-  // window.chrome.runtime — many sites check that real Chrome exposes this.
-  if (!window.chrome) window.chrome = {};
-  window.chrome.runtime = window.chrome.runtime || {};
-
-  // navigator.plugins — headless reports zero; spoof a plausible PDF viewer.
-  Object.defineProperty(navigator, 'plugins', {
-    get: () => [{ name: 'Chrome PDF Plugin' }, { name: 'Chrome PDF Viewer' }, { name: 'Native Client' }],
-  });
-
-  // navigator.languages — headless returns empty array.
-  Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
-
-  // Permissions.query — headless returns 'denied' for notifications instead of 'default'.
-  const origQuery = window.navigator.permissions && window.navigator.permissions.query;
-  if (origQuery) {
-    window.navigator.permissions.query = (parameters) =>
-      parameters && parameters.name === 'notifications'
-        ? Promise.resolve({ state: Notification.permission })
-        : origQuery(parameters);
-  }
-
-  // WebGL getParameter — spoof vendor + renderer strings to a real GPU.
-  const spoofGl = (proto) => {
-    if (!proto) return;
-    const orig = proto.getParameter;
-    proto.getParameter = function (parameter) {
-      if (parameter === 37445) return 'Intel Inc.';                   // UNMASKED_VENDOR_WEBGL
-      if (parameter === 37446) return 'Intel Iris OpenGL Engine';     // UNMASKED_RENDERER_WEBGL
-      return orig.apply(this, arguments);
-    };
-  };
-  spoofGl(window.WebGLRenderingContext && window.WebGLRenderingContext.prototype);
-  spoofGl(window.WebGL2RenderingContext && window.WebGL2RenderingContext.prototype);
-
-  // disable-devtool.js (theajack/disable-devtool) auto-inits via a script
-  // tag with `disable-devtool-auto`. Its Performance detector trips under
-  // Playwright (CDP adds console.log latency vs console.table) and the
-  // redirect URL is hard-coded — for hmembeds that's google.com.
-  // Hide the auto-init marker so the library's IIFE exits early.
-  const origQS = Document.prototype.querySelector;
-  Document.prototype.querySelector = function (sel) {
-    if (typeof sel === 'string' && sel.indexOf('disable-devtool-auto') !== -1) return null;
-    return origQS.apply(this, arguments);
-  };
-})();
--- a/cli/cmd_browser.go
+++ b/cli/cmd_browser.go
@ -1,117 +0,0 @@
-package main
-
-import "fmt"
-
-// browser verbs drive the cluster's HEADFUL Chrome (ns chrome-service) over CDP
-// from outside the cluster, for sites that detect/block headless automation.
-// The headless @playwright/mcp browser can load such sites but their gated
-// actions (submit/login) silently fail; this path submits first try. Mechanics
-// only — the agent supplies the Playwright script. See docs/adr/0013.
-
-func browserCommands() []Command {
-	return []Command{
-		{Path: []string{"browser"}, Tier: TierRead,
-			Summary: "headful cluster-Chrome automation for anti-bot sites (run `browser --help`)", Run: browserTopHelp},
-		{Path: []string{"browser", "run"}, Tier: TierWrite,
-			Summary: "run a Playwright script against headful cluster Chrome: browser run <script.js> [--url U] [--shared-context]", Run: browserRun},
-		{Path: []string{"browser", "open"}, Tier: TierWrite,
-			Summary: "open a URL in headful cluster Chrome; print title + text + screenshot: browser open <url>", Run: browserOpen},
-	}
-}
-
-func browserTopHelp([]string) error {
-	fmt.Print(browserHelp())
-	return nil
-}
-
-func browserRun(args []string) error {
-	o, err := parseBrowserArgs("run", args)
-	if err != nil {
-		return err
-	}
-	if o.help {
-		fmt.Print(browserHelp())
-		return nil
-	}
-	return runBrowser(o)
-}
-
-func browserOpen(args []string) error {
-	o, err := parseBrowserArgs("open", args)
-	if err != nil {
-		return err
-	}
-	if o.help {
-		fmt.Print(browserHelp())
-		return nil
-	}
-	return runBrowser(o)
-}
-
-// browserHelp carries the discoverability payload: WHEN to reach for this, and
-// the diagnostic cheat-sheet that lets the agent self-correct instead of
-// retrying a deterministic form blind (the failure mode that motivated this).
-func browserHelp() string {
-	return `homelab browser — drive the cluster's HEADFUL Chrome (anti-bot) over CDP
-
-The shared chrome-service (ns chrome-service) runs a REAL, headed Chrome under
-Xvfb. This connects to it via a port-forward + Playwright connect_over_cdp,
-injects the same stealth.js the in-cluster callers use, and runs your script.
-
-USAGE
-  homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]
-  homelab browser open <url> [--shared-context] [--timeout S]
-
-WHEN TO USE THIS — escalation only; DEFAULT to the headless/MCP browser
-  Default to the Playwright MCP / headless browser for ALL routine browsing and
-  automation — it's interactive (snapshot per step), fast to start, isolated.
-  Reach for THIS command ONLY when headless is demonstrably blocked: a site
-  LOADS fine but a gated action FAILS or HANGS — a submit/login/checkout spins
-  forever, or ONE request errors while its siblings 200. That is the signature
-  of headless / anti-bot detection (navigator.webdriver, UA "HeadlessChrome",
-  disable-devtool traps). It presents as a real Chrome and usually succeeds
-  first try — but it's the shared cluster browser (slower startup, one batch
-  run, no per-step feedback), so it's the escalation path, never the default.
-
-ERROR-CODE CHEAT-SHEET (diagnose BEFORE retrying)
-  ERR_FILE_NOT_FOUND (-6)   request intercepted/resolved locally by the
-                            automation layer — NOT a network/egress problem.
-                            (This is what silently broke the headless submit.)
-  ERR_CONNECTION_REFUSED /  real egress failure (DNS/route/firewall). These also
-  ERR_TIMED_OUT /           break the initial page load — if the page loaded,
-  ERR_NAME_NOT_RESOLVED     egress is fine and the cause is elsewhere.
-  one endpoint 500s while   server-side bot rejection of the automation, not
-  its siblings 200          your payload.
-
-HABITS
-  - Inspect the network panel BEFORE retrying a deterministic form; a blind
-    retry just repeats the same silent failure.
-  - Don't park a half-filled multi-step form across a user pause — the session
-    can expire; re-run the whole flow from this command in one shot.
-  - Uploads stream over CDP via setInputFiles from THIS host — no chmod/staging
-    of $HOME needed; just point setInputFiles at a local path.
-
-CONTEXT
-  Default: a FRESH incognito context, closed on exit — safe for the shared
-  browser and concurrent callers (e.g. tripit). Your script does its own login.
-  --shared-context: reuse the warmed PERSISTENT profile (cookies from a manual
-  noVNC login at chrome.viktorbarzin.me) when you need a pre-logged-in session.
-
-SCRIPT CONTRACT (run mode)
-  Your file's body runs with page, context, browser and log() already in scope
-  (top-level await allowed). Return a value to print it. Example flow.js:
-
-    await page.goto('https://portal.example.com/login');
-    await page.fill('#user', 'me'); await page.fill('#pass', process.env.PW);
-    await page.click('button[type=submit]');
-    await page.waitForURL('**/dashboard');
-    return 'logged in: ' + page.url();
-
-  Run it:  homelab browser run flow.js
-
-NOTES
-  - The Playwright client is pinned to playwright-core@` + playwrightVersion + ` to match the
-    chrome-service image (Chrome 130); installed once into ~/.cache/homelab/.
-  - The port-forward is always torn down, on success and on error.
-`
-}
--- a/cli/cmd_browser_test.go
+++ b/cli/cmd_browser_test.go
@ -1,172 +0,0 @@
-package main
-
-import (
-	"os"
-	"reflect"
-	"strings"
-	"testing"
-)
-
-func TestParseBrowserArgsRun(t *testing.T) {
-	got, err := parseBrowserArgs("run", []string{
-		"flow.js", "--url", "https://example.com", "--shared-context",
-		"--port", "19999", "--timeout", "45", "--keep-open",
-	})
-	if err != nil {
-		t.Fatalf("parseBrowserArgs run: unexpected err: %v", err)
-	}
-	want := browserOpts{
-		mode: "run", script: "flow.js", url: "https://example.com",
-		sharedCtx: true, keepOpen: true, port: 19999, timeout: 45,
-	}
-	if !reflect.DeepEqual(got, want) {
-		t.Fatalf("parseBrowserArgs run =\n %+v\nwant\n %+v", got, want)
-	}
-}
-
-func TestParseBrowserArgsRunDefaults(t *testing.T) {
-	got, err := parseBrowserArgs("run", []string{"flow.js"})
-	if err != nil {
-		t.Fatalf("unexpected err: %v", err)
-	}
-	if got.script != "flow.js" || got.sharedCtx || got.keepOpen || got.port != 0 {
-		t.Fatalf("defaults wrong: %+v", got)
-	}
-	if got.timeout != defaultBrowserTimeout {
-		t.Fatalf("timeout default = %d, want %d", got.timeout, defaultBrowserTimeout)
-	}
-}
-
-func TestParseBrowserArgsRunRequiresScript(t *testing.T) {
-	if _, err := parseBrowserArgs("run", []string{"--url", "https://x"}); err == nil {
-		t.Fatalf("run without a script path should error")
-	}
-}
-
-func TestParseBrowserArgsOpenRequiresURL(t *testing.T) {
-	got, err := parseBrowserArgs("open", []string{"https://example.com"})
-	if err != nil {
-		t.Fatalf("unexpected err: %v", err)
-	}
-	if got.url != "https://example.com" || got.mode != "open" {
-		t.Fatalf("open parse wrong: %+v", got)
-	}
-	if _, err := parseBrowserArgs("open", []string{}); err == nil {
-		t.Fatalf("open without a URL should error")
-	}
-}
-
-func TestParseBrowserArgsHelp(t *testing.T) {
-	for _, a := range [][]string{{"--help"}, {"-h"}, {"flow.js", "--help"}} {
-		got, err := parseBrowserArgs("run", a)
-		if err != nil {
-			t.Fatalf("help parse %v: %v", a, err)
-		}
-		if !got.help {
-			t.Fatalf("args %v should set help", a)
-		}
-	}
-}
-
-func TestParseBrowserArgsEqualsForm(t *testing.T) {
-	got, err := parseBrowserArgs("run", []string{"flow.js", "--url=https://x", "--port=8123", "--timeout=10"})
-	if err != nil {
-		t.Fatalf("unexpected err: %v", err)
-	}
-	if got.url != "https://x" || got.port != 8123 || got.timeout != 10 {
-		t.Fatalf("--flag=value form not parsed: %+v", got)
-	}
-}
-
-func TestCDPHealthy(t *testing.T) {
-	real := []byte(`{"Browser":"Chrome/130.0.6723.31","User-Agent":"Mozilla/5.0 (X11; Linux x86_64) Chrome/130.0.0.0 Safari/537.36","webSocketDebuggerUrl":"ws://127.0.0.1/devtools/browser/x"}`)
-	browser, ok, err := cdpHealthy(real)
-	if err != nil || !ok {
-		t.Fatalf("real Chrome should be healthy: ok=%v err=%v", ok, err)
-	}
-	if !strings.HasPrefix(browser, "Chrome/") {
-		t.Fatalf("browser = %q, want Chrome/ prefix", browser)
-	}
-
-	headless := []byte(`{"Browser":"HeadlessChrome/130.0.6723.31","User-Agent":"Mozilla/5.0 HeadlessChrome/130.0.0.0"}`)
-	if _, ok, _ := cdpHealthy(headless); ok {
-		t.Fatalf("HeadlessChrome must be reported unhealthy (the whole point of chrome-service)")
-	}
-
-	if _, _, err := cdpHealthy([]byte("not json")); err == nil {
-		t.Fatalf("malformed /json/version body should error")
-	}
-}
-
-func TestBuildPortForwardArgs(t *testing.T) {
-	got := buildPortForwardArgs(18080)
-	want := []string{"-n", "chrome-service", "port-forward", "svc/chrome-service", "18080:9222"}
-	if !reflect.DeepEqual(got, want) {
-		t.Fatalf("buildPortForwardArgs =\n %v\nwant\n %v", got, want)
-	}
-}
-
-func TestBrowserClientPackageJSONPinsVersion(t *testing.T) {
-	pj := browserClientPackageJSON()
-	if !strings.Contains(pj, `"playwright-core": "`+playwrightVersion+`"`) {
-		t.Fatalf("package.json must pin playwright-core to %s; got:\n%s", playwrightVersion, pj)
-	}
-}
-
-func TestPlaywrightVersionPinnedToServerMinor(t *testing.T) {
-	// chrome-service runs mcr.microsoft.com/playwright:v1.48.0-noble; the CDP
-	// client minor MUST match (protocol changes between minors).
-	if !strings.HasPrefix(playwrightVersion, "1.48.") {
-		t.Fatalf("playwrightVersion = %q, must be 1.48.x to match the chrome-service image", playwrightVersion)
-	}
-}
-
-func TestBrowserHelpHasDiagnosticCheatSheet(t *testing.T) {
-	h := browserHelp()
-	for _, want := range []string{
-		"homelab browser run",
-		"ERR_FILE_NOT_FOUND",
-		"ERR_CONNECTION_REFUSED",
-		"network panel",
-		"headless",
-		"--shared-context",
-	} {
-		if !strings.Contains(h, want) {
-			t.Errorf("browser --help is missing %q (the discoverability/self-correction payload)", want)
-		}
-	}
-}
-
-func TestBrowserHelpIsTiered(t *testing.T) {
-	// --help must frame this as the ESCALATION path (default to headless first),
-	// matching ~/code/CLAUDE.md and chrome-service.md — non-conflicting agent
-	// instructions. Guard against a regression to "co-equal choice" wording.
-	h := browserHelp()
-	for _, want := range []string{"Default to the", "escalation"} {
-		if !strings.Contains(h, want) {
-			t.Errorf("browser --help must carry the tiered/default-headless framing; missing %q", want)
-		}
-	}
-}
-
-func TestStealthJSEmbeddedMatchesCanonical(t *testing.T) {
-	// The embedded copy must never drift from the source of truth that the
-	// in-cluster callers use, else the CLI's stealth and the cluster's diverge.
-	canonical, err := os.ReadFile("../stacks/chrome-service/files/stealth.js")
-	if err != nil {
-		t.Fatalf("read canonical stealth.js: %v", err)
-	}
-	if stealthJS != string(canonical) {
-		t.Fatalf("cli/browser_stealth.js has drifted from stacks/chrome-service/files/stealth.js — re-copy it")
-	}
-}
-
-func TestFreePortReturnsUsablePort(t *testing.T) {
-	p, err := freePort()
-	if err != nil {
-		t.Fatalf("freePort: %v", err)
-	}
-	if p <= 1024 || p > 65535 {
-		t.Fatalf("freePort returned %d, want an ephemeral port", p)
-	}
-}
--- a/cli/cmd_ci.go
+++ b/cli/cmd_ci.go
@ -1,99 +0,0 @@
-package main
-
-import (
-	"fmt"
-	"os"
-	"strings"
-	"time"
-)
-
-func ciCommands() []Command {
-	return []Command{
-		{Path: []string{"ci", "status"}, Tier: TierRead,
-			Summary: "pipeline status for HEAD/a commit: ci status [commit]", Run: ciStatus},
-		{Path: []string{"ci", "watch"}, Tier: TierRead,
-			Summary: "poll the pipeline for HEAD (or a commit) to terminal; non-zero on failure", Run: ciWatch},
-	}
-}
-
-func short(s string) string {
-	if len(s) > 8 {
-		return s[:8]
-	}
-	return s
-}
-
-func firstLine(s string) string { return strings.SplitN(s, "\n", 2)[0] }
-
-// currentHEAD returns the full HEAD sha of the cwd repo (empty if not a repo).
-func currentHEAD() string {
-	cwd, _ := os.Getwd()
-	root, err := gitRepoRoot(cwd)
-	if err != nil {
-		return ""
-	}
-	sha, _ := gitOutput(root, "rev-parse", "HEAD")
-	return sha
-}
-
-func ciStatus(args []string) error {
-	commit, _ := firstPositional(args)
-	c, err := newWPClient()
-	if err != nil {
-		return err
-	}
-	id, err := c.repoID()
-	if err != nil {
-		return err
-	}
-	p, err := c.findPipeline(id, commit)
-	if err != nil {
-		return err
-	}
-	fmt.Printf("#%d %s event=%s %s %s\n", p.Number, p.Status, p.Event, short(p.Commit), firstLine(p.Message))
-	return nil
-}
-
-func ciWatch(args []string) error {
-	commit, _ := firstPositional(args)
-	if commit == "" {
-		commit = currentHEAD()
-	}
-	if commit == "" {
-		return fmt.Errorf("no commit given and not in a git repo")
-	}
-	c, err := newWPClient()
-	if err != nil {
-		return err
-	}
-	id, err := c.repoID()
-	if err != nil {
-		return err
-	}
-	timeout := 20 * time.Minute
-	deadline := time.Now().Add(timeout)
-	last := ""
-	for time.Now().Before(deadline) {
-		p, err := c.findPipeline(id, commit)
-		if err != nil {
-			if last != "waiting" {
-				fmt.Fprintf(os.Stderr, "homelab: waiting for pipeline (%s)...\n", short(commit))
-				last = "waiting"
-			}
-		} else {
-			if p.Status != last {
-				fmt.Fprintf(os.Stderr, "homelab: #%d %s\n", p.Number, p.Status)
-				last = p.Status
-			}
-			if isTerminalStatus(p.Status) {
-				fmt.Printf("#%d %s %s\n", p.Number, p.Status, short(commit))
-				if isFailureStatus(p.Status) {
-					return fmt.Errorf("pipeline #%d %s (woodpecker repo, see UI/DB for the failing step)", p.Number, p.Status)
-				}
-				return nil
-			}
-		}
-		time.Sleep(15 * time.Second)
-	}
-	return fmt.Errorf("timed out after %s waiting for CI on %s", timeout, short(commit))
-}
--- a/cli/cmd_claim.go
+++ b/cli/cmd_claim.go
@ -1,56 +0,0 @@
-package main
-
-import (
-	"fmt"
-	"strings"
-)
-
-func claimCommands() []Command {
-	return []Command{
-		{Path: []string{"claim"}, Tier: TierWrite,
-			Summary: "claim a shared infra resource on the presence board",
-			Run:     runClaim},
-		{Path: []string{"release"}, Tier: TierWrite,
-			Summary: "release a presence claim",
-			Run:     runRelease},
-	}
-}
-
-// runClaim parses `<kind>:<name> --purpose "..."` in either order (the presence
-// script takes the label first, so we can't rely on Go's flag package which
-// stops at the first positional).
-func runClaim(args []string) error {
-	var label, purpose string
-	for i := 0; i < len(args); i++ {
-		a := args[i]
-		switch {
-		case a == "--purpose" || a == "-purpose":
-			if i+1 < len(args) {
-				purpose = args[i+1]
-				i++
-			}
-		case strings.HasPrefix(a, "--purpose="):
-			purpose = strings.TrimPrefix(a, "--purpose=")
-		case !strings.HasPrefix(a, "-") && label == "":
-			label = a
-		}
-	}
-	if label == "" {
-		return fmt.Errorf(`usage: homelab claim <kind>:<name> --purpose "what + why"`)
-	}
-	return presenceClaim(label, purpose)
-}
-
-func runRelease(args []string) error {
-	var label string
-	for _, a := range args {
-		if !strings.HasPrefix(a, "-") {
-			label = a
-			break
-		}
-	}
-	if label == "" {
-		return fmt.Errorf("usage: homelab release <kind>:<name>")
-	}
-	return presenceRelease(label)
-}
--- a/cli/cmd_deploy.go
+++ b/cli/cmd_deploy.go
@ -1,51 +0,0 @@
-package main
-
-import (
-	"fmt"
-	"os"
-	"strings"
-	"time"
-)
-
-func deployCommands() []Command {
-	return []Command{
-		{Path: []string{"deploy", "wait"}, Tier: TierRead,
-			Summary: "wait for <ns>/<deploy> to roll out the current (or --sha) image: deploy wait <ns>/<deploy> [--sha SHA]", Run: deployWait},
-	}
-}
-
-// deployWait closes the "did the NEW code land" gap: rollout status alone returns
-// success on the OLD ReplicaSet, so we first wait for the deployment image to
-// reference the expected sha, THEN block on rollout status.
-func deployWait(args []string) error {
-	target, _ := firstPositional(args)
-	if target == "" || !strings.Contains(target, "/") {
-		return fmt.Errorf("usage: homelab deploy wait <ns>/<deploy> [--sha SHA] [--timeout 10m]")
-	}
-	parts := strings.SplitN(target, "/", 2)
-	ns, deploy := parts[0], parts[1]
-
-	sha := flagValue(args, "--sha")
-	if sha == "" {
-		sha = short(currentHEAD())
-	}
-	deadline := time.Now().Add(10 * time.Minute)
-
-	if sha != "" {
-		fmt.Fprintf(os.Stderr, "homelab: waiting for %s/%s image to match %s...\n", ns, deploy, sha)
-		matched := false
-		for time.Now().Before(deadline) {
-			img, _ := kubectlCapture(ns, "get", "deploy", deploy, "-o", "jsonpath={.spec.template.spec.containers[*].image}")
-			if strings.Contains(img, sha) {
-				matched = true
-				break
-			}
-			time.Sleep(10 * time.Second)
-		}
-		if !matched {
-			return fmt.Errorf("timed out: %s/%s image never matched %q", ns, deploy, sha)
-		}
-	}
-	fmt.Fprintf(os.Stderr, "homelab: rollout status %s/%s...\n", ns, deploy)
-	return kubectlStream(ns, "rollout", "status", "deploy/"+deploy, "--timeout=180s")
-}
--- a/cli/cmd_edges.go
+++ b/cli/cmd_edges.go
@ -1,69 +0,0 @@
-package main
-
-import "fmt"
-
-func edgesCommands() []Command {
-	return []Command{
-		{Path: []string{"edges"}, Tier: TierRead,
-			Summary: "who-talks-to-whom trail: edges [--ns|--src|--dst|--peers-of N] [--new-since 24h] [--denied] [--json] [--limit N]",
-			Run:     edgesRun},
-	}
-}
-
-// edgesRun renders the filter flags to SQL and runs it read-only against the
-// goldmane_edges CNPG DB via the dbaas primary pod (same exec path as `k8s db`).
-func edgesRun(args []string) error {
-	for _, a := range args {
-		if a == "-h" || a == "--help" {
-			fmt.Print(edgesUsage())
-			return nil
-		}
-	}
-	o, err := parseEdgesArgs(args)
-	if err != nil {
-		return fmt.Errorf("%w\n\n%s", err, edgesUsage())
-	}
-	sql, err := buildEdgesQuery(o)
-	if err != nil {
-		return err
-	}
-	// pg-cluster-rw is a Service (not exec-able); resolve the primary POD.
-	pod, err := kubectlCapture("dbaas", "get", "pod", "-l", "cnpg.io/instanceRole=primary",
-		"-o", "jsonpath={.items[0].metadata.name}")
-	if err != nil || pod == "" {
-		return fmt.Errorf("could not resolve CNPG primary pod in dbaas: %v", err)
-	}
-	exec := []string{"exec", pod, "-c", "postgres", "--", "psql", "-U", "postgres", "-d", "goldmane_edges"}
-	if o.asJSON {
-		exec = append(exec, "-tAc", sql) // raw tuple → the JSON array
-	} else {
-		exec = append(exec, "-P", "pager=off", "-c", sql) // aligned table for humans
-	}
-	return kubectlStream("dbaas", exec...)
-}
-
-func edgesUsage() string {
-	return `homelab edges — query the who-talks-to-whom trail (goldmane_edges, ADR-0014)
-
-Usage: homelab edges [filters]
-
-Filters (AND-combined; namespace values are validated to the k8s name charset):
-  --ns NAME         edges touching NAME (either direction)
-  --src NAME        edges where source namespace = NAME
-  --dst NAME        edges where destination namespace = NAME
-  --peers-of NAME   distinct peer namespaces of NAME (both directions)
-  --new-since SPEC  first seen since SPEC: a duration (24h, 7d, 30m, 90s) or a date (YYYY-MM-DD)
-  --denied          only denied (action='deny') edges — blocked / lateral-movement attempts
-  --json            output a JSON array (for agents/pipelines)
-  --limit N         cap rows (default 200)
-
-Examples:
-  homelab edges --ns immich                # everything immich talks to / is talked to by
-  homelab edges --peers-of authentik       # authentik's peer namespaces
-  homelab edges --src recruiter-responder  # that namespace's egress peers
-  homelab edges --new-since 24h            # edges first seen in the last day
-  homelab edges --denied --json            # blocked flows, machine-readable
-
-Read-only SELECT against CNPG DB goldmane_edges via the dbaas primary pod.
-`
-}
--- a/cli/cmd_ha.go
+++ b/cli/cmd_ha.go
@ -1,172 +0,0 @@
-package main
-
-import (
-	"encoding/base64"
-	"fmt"
-	"os"
-	"path/filepath"
-	"strings"
-)
-
-// Home Assistant verbs cover the two things the `ha` MCP server can't: resolving
-// the long-lived API token out of the cluster, and SSH to the HA host for
-// host-level work (config files, docker, add-ons). Entity state/control stays
-// with the MCP — see docs/adr/0012.
-//
-// The token lives in the dedicated k8s Secret openclaw/ha-tokens (one key per
-// instance), split out of openclaw-secrets so non-admin operators (emo / "Home
-// Server Admins") can read JUST the HA token, not the full skill_secrets blob.
-// `ha token` resolves it on demand via the ambient kubeconfig, so it never
-// depends on a pre-set env var (the gap that made agents re-derive the
-// kubectl|base64|jq pipeline every session).
-
-type haInstance struct {
-	name      string // sofia | london
-	sshUser   string // SSH login on the HA host
-	sshHost   string // host reachable from the devvm (Sofia LAN)
-	secretKey string // key inside the openclaw/ha-tokens Secret holding this token
-}
-
-const (
-	haDefaultInstance = "sofia"
-	haSecretNamespace = "openclaw"
-	haSecretName      = "ha-tokens" // dedicated, least-privilege; see stacks/openclaw/ha_tokens.tf
-)
-
-// haInstances maps instance name → connection/secret facts. sofia is the default
-// because the devvm is on the Sofia LAN; london is documented but its host
-// (192.168.8.x) is only reachable remotely, so `ha ssh --instance london`
-// generally won't connect from here (token resolution still works).
-var haInstances = map[string]haInstance{
-	"sofia":  {name: "sofia", sshUser: "vbarzin", sshHost: "192.168.1.8", secretKey: "sofia"},
-	"london": {name: "london", sshUser: "hassio", sshHost: "192.168.8.103", secretKey: "london"},
-}
-
-func haCommands() []Command {
-	return []Command{
-		{Path: []string{"ha", "token"}, Tier: TierRead,
-			Summary: "reveal the HA long-lived API token from the cluster: ha token [--instance sofia|london]", Run: haToken},
-		{Path: []string{"ha", "ssh"}, Tier: TierWrite,
-			Summary: "run a command on the HA host over ssh: ha ssh [--instance sofia|london] [-i KEY] -- <cmd>", Run: haSSH},
-	}
-}
-
-// resolveHAInstance looks up an instance by name; "" yields the default (sofia).
-func resolveHAInstance(name string) (haInstance, error) {
-	if name == "" {
-		name = haDefaultInstance
-	}
-	inst, ok := haInstances[name]
-	if !ok {
-		return haInstance{}, fmt.Errorf("unknown HA instance %q (want sofia or london)", name)
-	}
-	return inst, nil
-}
-
-// decodeSecretValue base64-decodes a k8s Secret `.data.<key>` value as returned
-// by kubectl jsonpath (trailing whitespace tolerated).
-func decodeSecretValue(b64 string) (string, error) {
-	raw, err := base64.StdEncoding.DecodeString(strings.TrimSpace(b64))
-	if err != nil {
-		return "", fmt.Errorf("base64-decode secret value: %w", err)
-	}
-	return string(raw), nil
-}
-
-func haToken(args []string) error {
-	name, _ := firstPositional(args) // accept `ha token sofia` as well as `--instance sofia`
-	for i := 0; i < len(args); i++ {
-		if args[i] == "--instance" && i+1 < len(args) {
-			name = args[i+1]
-		} else if strings.HasPrefix(args[i], "--instance=") {
-			name = strings.TrimPrefix(args[i], "--instance=")
-		}
-	}
-	inst, err := resolveHAInstance(name)
-	if err != nil {
-		return err
-	}
-	b64, err := kubectlCapture(haSecretNamespace, "get", "secret", haSecretName,
-		"-o", "jsonpath={.data."+inst.secretKey+"}")
-	if err != nil {
-		return fmt.Errorf("read secret %s/%s (kubeconfig set? RBAC?): %w", haSecretNamespace, haSecretName, err)
-	}
-	if b64 == "" {
-		return fmt.Errorf("secret %s/%s has no %q key", haSecretNamespace, haSecretName, inst.secretKey)
-	}
-	tok, err := decodeSecretValue(b64)
-	if err != nil {
-		return err
-	}
-	fmt.Println(tok)
-	return nil
-}
-
-// defaultHAKeyPath is the invoking user's ed25519 key, so the verb is per-user
-// rather than tied to whoever first wrote the workflow.
-func defaultHAKeyPath() string {
-	if home, err := os.UserHomeDir(); err == nil && home != "" {
-		return filepath.Join(home, ".ssh", "id_ed25519")
-	}
-	return filepath.Join("~", ".ssh", "id_ed25519")
-}
-
-// parseHASSH reads `[--instance X] [-i|--key PATH] [-- ] <cmd...>`. Tokens after
-// `--` are taken verbatim; bare tokens before it are also the remote command.
-func parseHASSH(args []string) (inst haInstance, keyPath string, remote []string, err error) {
-	name := haDefaultInstance
-	keyPath = defaultHAKeyPath()
-	for i := 0; i < len(args); i++ {
-		a := args[i]
-		switch {
-		case a == "--":
-			remote = append(remote, args[i+1:]...)
-			i = len(args)
-		case a == "--instance":
-			if i+1 < len(args) {
-				name = args[i+1]
-				i++
-			}
-		case strings.HasPrefix(a, "--instance="):
-			name = strings.TrimPrefix(a, "--instance=")
-		case a == "--key" || a == "-i":
-			if i+1 < len(args) {
-				keyPath = args[i+1]
-				i++
-			}
-		case strings.HasPrefix(a, "--key="):
-			keyPath = strings.TrimPrefix(a, "--key=")
-		default:
-			remote = append(remote, a)
-		}
-	}
-	inst, err = resolveHAInstance(name)
-	return inst, keyPath, remote, err
-}
-
-// buildHASSHArgs assembles deterministic, non-interactive ssh args: an explicit
-// key, no user ssh config, and no known_hosts prompt/record — so it runs
-// unattended in an agent session without hanging on a host-key prompt.
-func buildHASSHArgs(inst haInstance, keyPath string, remote []string) []string {
-	args := []string{
-		"-F", "/dev/null",
-		"-o", "IdentityFile=" + keyPath,
-		"-o", "StrictHostKeyChecking=no",
-		"-o", "UserKnownHostsFile=/dev/null",
-		"-o", "ConnectTimeout=10",
-		"-o", "BatchMode=yes",
-		inst.sshUser + "@" + inst.sshHost,
-	}
-	return append(args, remote...)
-}
-
-func haSSH(args []string) error {
-	inst, keyPath, remote, err := parseHASSH(args)
-	if err != nil {
-		return err
-	}
-	if len(remote) == 0 {
-		return fmt.Errorf(`usage: homelab ha ssh [--instance sofia|london] [-i KEY] -- <command>`)
-	}
-	return runStreaming("ssh", buildHASSHArgs(inst, keyPath, remote)...)
-}
--- a/cli/cmd_ha_test.go
+++ b/cli/cmd_ha_test.go
@ -1,92 +0,0 @@
-package main
-
-import (
-	"encoding/base64"
-	"reflect"
-	"strings"
-	"testing"
-)
-
-func TestResolveHAInstance(t *testing.T) {
-	// empty defaults to sofia (the devvm sits on the Sofia LAN)
-	if got, err := resolveHAInstance(""); err != nil || got.name != "sofia" {
-		t.Fatalf(`resolveHAInstance("") = %+v, %v; want sofia`, got, err)
-	}
-	if got, err := resolveHAInstance("sofia"); err != nil || got.secretKey != "sofia" {
-		t.Fatalf("sofia secretKey = %q, %v", got.secretKey, err)
-	}
-	if got, err := resolveHAInstance("london"); err != nil || got.secretKey != "london" || got.sshUser != "hassio" {
-		t.Fatalf("london = %+v, %v", got, err)
-	}
-	if _, err := resolveHAInstance("paris"); err == nil {
-		t.Fatalf("resolveHAInstance(paris) should error on unknown instance")
-	}
-}
-
-func TestDecodeSecretValue(t *testing.T) {
-	// k8s stores Secret values base64-encoded; `kubectl -o jsonpath={.data.<k>}`
-	// returns that base64, which decodeSecretValue turns back into the raw token.
-	enc := base64.StdEncoding.EncodeToString([]byte("tok-sofia"))
-	if got, err := decodeSecretValue(enc); err != nil || got != "tok-sofia" {
-		t.Fatalf("decodeSecretValue = %q, %v; want tok-sofia", got, err)
-	}
-	// trailing whitespace/newline from jsonpath output must be tolerated
-	if got, err := decodeSecretValue(enc + "\n"); err != nil || got != "tok-sofia" {
-		t.Fatalf("decodeSecretValue (trailing ws) = %q, %v; want tok-sofia", got, err)
-	}
-	if _, err := decodeSecretValue("not-base64!!"); err == nil {
-		t.Fatalf("decodeSecretValue should error on undecodable base64")
-	}
-}
-
-func TestBuildHASSHArgs(t *testing.T) {
-	inst, _ := resolveHAInstance("sofia")
-	got := buildHASSHArgs(inst, "/home/u/.ssh/id_ed25519", []string{"cat", "/config/configuration.yaml"})
-	want := []string{
-		"-F", "/dev/null",
-		"-o", "IdentityFile=/home/u/.ssh/id_ed25519",
-		"-o", "StrictHostKeyChecking=no",
-		"-o", "UserKnownHostsFile=/dev/null",
-		"-o", "ConnectTimeout=10",
-		"-o", "BatchMode=yes",
-		"vbarzin@192.168.1.8",
-		"cat", "/config/configuration.yaml",
-	}
-	if !reflect.DeepEqual(got, want) {
-		t.Fatalf("buildHASSHArgs =\n %v\nwant\n %v", got, want)
-	}
-}
-
-func TestParseHASSH(t *testing.T) {
-	// instance flag + everything after `--` is the verbatim remote command
-	inst, key, remote, err := parseHASSH([]string{"--instance", "sofia", "--", "docker", "ps", "-a"})
-	if err != nil {
-		t.Fatalf("parseHASSH err: %v", err)
-	}
-	if inst.name != "sofia" {
-		t.Errorf("instance = %q, want sofia", inst.name)
-	}
-	if !strings.HasSuffix(key, "/.ssh/id_ed25519") {
-		t.Errorf("default key = %q, want it to end in /.ssh/id_ed25519", key)
-	}
-	if !reflect.DeepEqual(remote, []string{"docker", "ps", "-a"}) {
-		t.Errorf("remote = %v, want [docker ps -a]", remote)
-	}
-
-	// bare args (no `--`) are also taken as the remote command; -i overrides the key
-	_, key2, remote2, err := parseHASSH([]string{"-i", "/tmp/k", "uptime"})
-	if err != nil {
-		t.Fatalf("parseHASSH err: %v", err)
-	}
-	if key2 != "/tmp/k" {
-		t.Errorf("key = %q, want /tmp/k", key2)
-	}
-	if !reflect.DeepEqual(remote2, []string{"uptime"}) {
-		t.Errorf("remote = %v, want [uptime]", remote2)
-	}
-
-	// unknown instance surfaces as an error
-	if _, _, _, err := parseHASSH([]string{"--instance", "paris", "--", "ls"}); err == nil {
-		t.Errorf("parseHASSH should error on unknown instance")
-	}
-}
--- a/cli/cmd_k8s.go
+++ b/cli/cmd_k8s.go
@ -1,288 +0,0 @@
-package main
-
-import (
-	"fmt"
-	"os"
-	"strings"
-)
-
-func k8sCommands() []Command {
-	return []Command{
-		{Path: []string{"k8s", "status"}, Tier: TierRead,
-			Summary: "pods (wide) + recent non-Normal events for a namespace (or -A)", Run: k8sStatus},
-		{Path: []string{"k8s", "get"}, Tier: TierRead,
-			Summary: "kubectl get in a namespace: k8s get <ns> <resource> [args]", Run: k8sGet},
-		{Path: []string{"k8s", "logs"}, Tier: TierRead,
-			Summary: "logs for <app> (deploy/<app>; --tail/-c/--previous/--since/-l)", Run: k8sLogs},
-		{Path: []string{"k8s", "describe"}, Tier: TierRead,
-			Summary: "describe <app>'s deployment (or an explicit resource)", Run: k8sDescribe},
-		{Path: []string{"k8s", "debug"}, Tier: TierRead,
-			Summary: "one-shot triage for <app>: pods+deploy+describe+logs+events", Run: k8sDebug},
-		{Path: []string{"k8s", "pf"}, Tier: TierRead,
-			Summary: "port-forward: k8s pf <app> <local:remote> [svc/pod target]", Run: k8sPortForward},
-		{Path: []string{"k8s", "db"}, Tier: TierWrite,
-			Summary: `query a dbaas DB: k8s db <app> [--mysql] [--db N] -- "<SQL>"`, Run: k8sDB},
-		{Path: []string{"k8s", "exec"}, Tier: TierWrite,
-			Summary: "exec in <app>'s pod: k8s exec <app> [--tty] -- <cmd>", Run: k8sExec},
-		{Path: []string{"k8s", "rm-pod"}, Tier: TierWrite,
-			Summary: "delete a stuck pod/job ONLY: k8s rm-pod <name> -n <ns> [--job] [--force]", Run: k8sRmPod},
-		{Path: []string{"k8s", "rollout-status"}, Tier: TierRead,
-			Summary: "rollout status of deploy/<app>", Run: k8sRolloutStatus},
-		{Path: []string{"k8s", "restart"}, Tier: TierWrite,
-			Summary: "rollout restart deploy/<app> then wait for status", Run: k8sRestart},
-		{Path: []string{"k8s", "probe"}, Tier: TierRead,
-			Summary: "in-cluster reachability: ephemeral curl pod to <app>.<ns>.svc", Run: k8sProbe},
-	}
-}
-
-func k8sStatus(args []string) error {
-	t := parseK8sTarget(args)
-	ns := t.namespace() // "" when no app/ns given → cluster-wide
-	get := []string{"get", "pods", "-o", "wide"}
-	ev := []string{"get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp"}
-	if ns == "" {
-		get = append(get, "-A")
-		ev = append(ev, "-A")
-	}
-	if err := kubectlStream(ns, get...); err != nil {
-		return err
-	}
-	fmt.Fprintln(os.Stderr, "\n--- recent events (type!=Normal) ---")
-	_ = kubectlStream(ns, ev...) // best-effort
-	return nil
-}
-
-func k8sGet(args []string) error {
-	t := parseK8sTarget(args)
-	if t.app == "" || len(t.rest) == 0 {
-		return fmt.Errorf("usage: homelab k8s get <ns> <resource> [args]")
-	}
-	return kubectlStream(t.app, append([]string{"get"}, t.rest...)...)
-}
-
-func k8sLogs(args []string) error {
-	t := parseK8sTarget(args)
-	if t.app == "" {
-		return fmt.Errorf("usage: homelab k8s logs <app> [--tail N] [-c ctr] [--previous] [--since 1h] [-l sel]")
-	}
-	a := []string{"logs"}
-	if t.selector != "" {
-		a = append(a, "-l", t.selector)
-	} else {
-		a = append(a, t.objectRef())
-	}
-	if t.container != "" {
-		a = append(a, "-c", t.container)
-	}
-	if !containsPrefix(t.rest, "--tail") {
-		a = append(a, "--tail=200")
-	}
-	a = append(a, t.rest...)
-	return kubectlStream(t.namespace(), a...)
-}
-
-func k8sDescribe(args []string) error {
-	t := parseK8sTarget(args)
-	if t.app == "" {
-		return fmt.Errorf("usage: homelab k8s describe <app> [resource]")
-	}
-	if len(t.rest) > 0 {
-		return kubectlStream(t.namespace(), append([]string{"describe"}, t.rest...)...)
-	}
-	return kubectlStream(t.namespace(), "describe", t.objectRef())
-}
-
-func k8sDebug(args []string) error {
-	t := parseK8sTarget(args)
-	if t.app == "" {
-		return fmt.Errorf("usage: homelab k8s debug <app>")
-	}
-	ns := t.namespace()
-	sec := func(title string) { fmt.Fprintf(os.Stderr, "\n=== %s ===\n", title) }
-	sec("pods")
-	_ = kubectlStream(ns, "get", "pods", "-o", "wide")
-	sec("workloads")
-	_ = kubectlStream(ns, "get", "deploy,sts,ds", "-o", "wide")
-	sec("describe "+t.objectRef())
-	_ = kubectlStream(ns, "describe", t.objectRef())
-	sec("recent logs (--tail=50)")
-	_ = kubectlStream(ns, "logs", t.objectRef(), "--tail=50")
-	sec("events (type!=Normal)")
-	_ = kubectlStream(ns, "get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp")
-	return nil
-}
-
-func k8sPortForward(args []string) error {
-	t := parseK8sTarget(args)
-	if t.app == "" || len(t.rest) == 0 {
-		return fmt.Errorf("usage: homelab k8s pf <app> <local:remote> [svc/pod target]")
-	}
-	ports := t.rest[0]
-	target := "svc/" + t.app
-	if len(t.rest) > 1 {
-		target = t.rest[1]
-	}
-	return kubectlStream(t.namespace(), "port-forward", target, ports)
-}
-
-func k8sDB(args []string) error {
-	var app, dbName, sql string
-	mysql := false
-	for i := 0; i < len(args); i++ {
-		a := args[i]
-		if a == "--" {
-			sql = strings.Join(args[i+1:], " ")
-			break
-		}
-		switch {
-		case a == "--mysql":
-			mysql = true
-		case a == "--db":
-			if i+1 < len(args) {
-				dbName = args[i+1]
-				i++
-			}
-		case strings.HasPrefix(a, "--db="):
-			dbName = strings.TrimPrefix(a, "--db=")
-		case !strings.HasPrefix(a, "-") && app == "":
-			app = a
-		}
-	}
-	if app == "" {
-		return fmt.Errorf(`usage: homelab k8s db <app> [--mysql] [--db NAME] -- "<SQL>"`)
-	}
-	p := planDBExec(app, dbName, sql, mysql)
-	pod := p.pod
-	if pod == "" && p.selector != "" {
-		resolved, err := kubectlCapture(p.ns, "get", "pod", "-l", p.selector, "-o", "jsonpath={.items[0].metadata.name}")
-		if err != nil || resolved == "" {
-			return fmt.Errorf("could not resolve db pod in %s (selector %q): %v", p.ns, p.selector, err)
-		}
-		pod = resolved
-	}
-	exec := []string{"exec"}
-	if sql == "" {
-		exec = append(exec, "-it") // interactive client when no SQL given
-	}
-	exec = append(exec, pod)
-	if p.container != "" {
-		exec = append(exec, "-c", p.container)
-	}
-	exec = append(exec, "--")
-	exec = append(exec, p.argv...)
-	return kubectlStream(p.ns, exec...)
-}
-
-func k8sExec(args []string) error {
-	t := parseK8sTarget(args)
-	if t.app == "" {
-		return fmt.Errorf("usage: homelab k8s exec <app> [--pod p] [-c ctr] [--tty] -- <cmd>")
-	}
-	if len(t.rest) == 0 {
-		return fmt.Errorf("provide a command after --, e.g. homelab k8s exec %s -- env", t.app)
-	}
-	a := []string{"exec"}
-	if t.tty {
-		a = append(a, "-it")
-	}
-	a = append(a, t.objectRef())
-	if t.container != "" {
-		a = append(a, "-c", t.container)
-	}
-	a = append(a, "--")
-	a = append(a, t.rest...)
-	return kubectlStream(t.namespace(), a...)
-}
-
-func k8sRmPod(args []string) error {
-	var pod, ns, grace string
-	force, job := false, false
-	for i := 0; i < len(args); i++ {
-		a := args[i]
-		switch {
-		case a == "-n" || a == "--namespace":
-			if i+1 < len(args) {
-				ns = args[i+1]
-				i++
-			}
-		case a == "--force":
-			force = true
-		case a == "--job":
-			job = true
-		case a == "--grace":
-			if i+1 < len(args) {
-				grace = args[i+1]
-				i++
-			}
-		case !strings.HasPrefix(a, "-") && pod == "":
-			pod = a
-		}
-	}
-	if pod == "" || ns == "" {
-		return fmt.Errorf("usage: homelab k8s rm-pod <name> -n <ns> [--job] [--force] [--grace N] (pods/jobs only)")
-	}
-	kind := "pod"
-	if job {
-		kind = "job"
-	}
-	a := []string{"delete", kind, pod}
-	if grace != "" {
-		a = append(a, "--grace-period="+grace)
-	}
-	if force {
-		a = append(a, "--force")
-	}
-	return kubectlStream(ns, a...)
-}
-
-func k8sRolloutStatus(args []string) error {
-	t := parseK8sTarget(args)
-	if t.app == "" {
-		return fmt.Errorf("usage: homelab k8s rollout-status <app>")
-	}
-	return kubectlStream(t.namespace(), "rollout", "status", "deploy/"+t.app)
-}
-
-func k8sRestart(args []string) error {
-	t := parseK8sTarget(args)
-	if t.app == "" {
-		return fmt.Errorf("usage: homelab k8s restart <app>")
-	}
-	ns := t.namespace()
-	if err := kubectlStream(ns, "rollout", "restart", "deploy/"+t.app); err != nil {
-		return err
-	}
-	return kubectlStream(ns, "rollout", "status", "deploy/"+t.app)
-}
-
-func k8sProbe(args []string) error {
-	t := parseK8sTarget(args)
-	if t.app == "" {
-		return fmt.Errorf("usage: homelab k8s probe <app> [path] [--port N]")
-	}
-	ns := t.namespace()
-	url := "http://" + t.app + "." + ns + ".svc.cluster.local"
-	if port := flagValue(args, "--port"); port != "" {
-		url += ":" + port
-	}
-	if len(t.rest) > 0 {
-		p := t.rest[0]
-		if !strings.HasPrefix(p, "/") {
-			p = "/" + p
-		}
-		url += p
-	}
-	return kubectlStream(ns, "run", "homelab-probe", "--rm", "-i", "--restart=Never",
-		"--image=curlimages/curl:latest", "--",
-		"curl", "-sS", "--max-time", "10", "-w", "\n[%{http_code}] %{time_total}s\n", url)
-}
-
-// containsPrefix reports whether any arg starts with prefix.
-func containsPrefix(args []string, prefix string) bool {
-	for _, a := range args {
-		if strings.HasPrefix(a, prefix) {
-			return true
-		}
-	}
-	return false
-}
--- a/cli/cmd_memory.go
+++ b/cli/cmd_memory.go
@ -1,308 +0,0 @@
-package main
-
-import (
-	"encoding/json"
-	"fmt"
-	"net/url"
-	"strings"
-)
-
-func memoryCommands() []Command {
-	return []Command{
-		{Path: []string{"memory", "recall"}, Tier: TierRead,
-			Summary: `semantic search of memory: memory recall "<context>" [--query …] [--category] [--sort] [--limit]`, Run: memoryRecall},
-		{Path: []string{"memory", "list"}, Tier: TierRead,
-			Summary: "list recent memories [--category C] [--tag T] [--limit N]", Run: memoryList},
-		{Path: []string{"memory", "categories"}, Tier: TierRead,
-			Summary: "list memory categories", Run: memorySimpleGet("/api/categories")},
-		{Path: []string{"memory", "tags"}, Tier: TierRead,
-			Summary: "list memory tags", Run: memorySimpleGet("/api/tags")},
-		{Path: []string{"memory", "stats"}, Tier: TierRead,
-			Summary: "memory store stats", Run: memorySimpleGet("/api/stats")},
-		{Path: []string{"memory", "secret"}, Tier: TierRead,
-			Summary: "reveal a sensitive memory's content: memory secret <id>", Run: memorySecret},
-		{Path: []string{"memory", "store"}, Tier: TierWrite,
-			Summary: `store a memory: memory store "<content>" [--category --tags --keywords --importance --sensitive]`, Run: memoryStore},
-		{Path: []string{"memory", "update"}, Tier: TierWrite,
-			Summary: "update a memory: memory update <id> [--content --tags --importance --keywords]", Run: memoryUpdate},
-		{Path: []string{"memory", "delete"}, Tier: TierWrite,
-			Summary: "delete a memory: memory delete <id>", Run: memoryDelete},
-	}
-}
-
-// printMemories renders a {memories:[…]} response as one line per memory, or raw JSON.
-func printMemories(raw []byte, jsonOut bool) error {
-	fmt.Print(renderMemories(raw, jsonOut))
-	return nil
-}
-
-// renderMemories formats each memory as a single line with its FULL content
-// (newlines flattened to spaces). Content is deliberately never truncated: the
-// old 240-rune preview cut memories mid-sentence, misled agents into believing
-// no full-content read-back existed, and made blind `update --content` from
-// the preview silently destroy the stored tail. Full passthrough also can't
-// produce invalid UTF-8 (the old mid-rune cut crashed the recall hook).
-func renderMemories(raw []byte, jsonOut bool) string {
-	if jsonOut {
-		return string(raw) + "\n"
-	}
-	var r struct {
-		Memories []struct {
-			ID         int     `json:"id"`
-			Content    string  `json:"content"`
-			Category   string  `json:"category"`
-			Tags       string  `json:"tags"`
-			Importance float64 `json:"importance"`
-		} `json:"memories"`
-	}
-	if err := json.Unmarshal(raw, &r); err != nil {
-		return string(raw) + "\n"
-	}
-	if len(r.Memories) == 0 {
-		return "(no memories)\n"
-	}
-	var b strings.Builder
-	for _, m := range r.Memories {
-		c := strings.ReplaceAll(m.Content, "\n", " ")
-		fmt.Fprintf(&b, "#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
-		if m.Tags != "" {
-			fmt.Fprintf(&b, "       tags: %s\n", m.Tags)
-		}
-	}
-	return b.String()
-}
-
-func memoryRecall(args []string) error {
-	req := memRecallReq{}
-	jsonOut := false
-	var pos []string
-	for i := 0; i < len(args); i++ {
-		a := args[i]
-		switch {
-		case a == "--query":
-			if i+1 < len(args) {
-				req.ExpandedQuery = args[i+1]
-				i++
-			}
-		case a == "--category":
-			if i+1 < len(args) {
-				req.Category = args[i+1]
-				i++
-			}
-		case a == "--sort":
-			if i+1 < len(args) {
-				req.SortBy = args[i+1]
-				i++
-			}
-		case a == "--limit":
-			if i+1 < len(args) {
-				fmt.Sscanf(args[i+1], "%d", &req.Limit)
-				i++
-			}
-		case a == "--json":
-			jsonOut = true
-		case !strings.HasPrefix(a, "-"):
-			pos = append(pos, a)
-		}
-	}
-	req.Context = strings.Join(pos, " ")
-	if req.Context == "" {
-		return fmt.Errorf(`usage: homelab memory recall "<context>" [--query …] [--category C] [--sort importance|relevance|recency] [--limit N]`)
-	}
-	c, err := newMemoryClient()
-	if err != nil {
-		return err
-	}
-	raw, err := c.do("POST", "/api/memories/recall", req)
-	if err != nil {
-		return err
-	}
-	return printMemories(raw, jsonOut)
-}
-
-func memoryList(args []string) error {
-	q := url.Values{}
-	jsonOut := false
-	for i := 0; i < len(args); i++ {
-		a := args[i]
-		switch {
-		case a == "--category":
-			if i+1 < len(args) {
-				q.Set("category", args[i+1])
-				i++
-			}
-		case a == "--tag":
-			if i+1 < len(args) {
-				q.Set("tag", args[i+1])
-				i++
-			}
-		case a == "--limit":
-			if i+1 < len(args) {
-				q.Set("limit", args[i+1])
-				i++
-			}
-		case a == "--json":
-			jsonOut = true
-		}
-	}
-	c, err := newMemoryClient()
-	if err != nil {
-		return err
-	}
-	path := "/api/memories"
-	if len(q) > 0 {
-		path += "?" + q.Encode()
-	}
-	raw, err := c.do("GET", path, nil)
-	if err != nil {
-		return err
-	}
-	return printMemories(raw, jsonOut)
-}
-
-func memorySimpleGet(path string) func([]string) error {
-	return func(args []string) error {
-		c, err := newMemoryClient()
-		if err != nil {
-			return err
-		}
-		raw, err := c.do("GET", path, nil)
-		if err != nil {
-			return err
-		}
-		fmt.Println(string(raw))
-		return nil
-	}
-}
-
-func memorySecret(args []string) error {
-	id, _ := firstPositional(args)
-	if id == "" {
-		return fmt.Errorf("usage: homelab memory secret <id>")
-	}
-	c, err := newMemoryClient()
-	if err != nil {
-		return err
-	}
-	raw, err := c.do("POST", "/api/memories/"+id+"/secret", nil)
-	if err != nil {
-		return err
-	}
-	fmt.Println(string(raw))
-	return nil
-}
-
-func memoryStore(args []string) error {
-	req := memStoreReq{Category: "facts", Importance: 0.5}
-	var pos []string
-	for i := 0; i < len(args); i++ {
-		a := args[i]
-		switch {
-		case a == "--category":
-			if i+1 < len(args) {
-				req.Category = args[i+1]
-				i++
-			}
-		case a == "--tags":
-			if i+1 < len(args) {
-				req.Tags = args[i+1]
-				i++
-			}
-		case a == "--keywords":
-			if i+1 < len(args) {
-				req.ExpandedKeywords = args[i+1]
-				i++
-			}
-		case a == "--importance":
-			if i+1 < len(args) {
-				fmt.Sscanf(args[i+1], "%f", &req.Importance)
-				i++
-			}
-		case a == "--sensitive":
-			req.ForceSensitive = true
-		case !strings.HasPrefix(a, "-"):
-			pos = append(pos, a)
-		}
-	}
-	req.Content = strings.Join(pos, " ")
-	if req.Content == "" {
-		return fmt.Errorf(`usage: homelab memory store "<content>" [--category C] [--tags ...] [--keywords ...] [--importance 0.5] [--sensitive]`)
-	}
-	c, err := newMemoryClient()
-	if err != nil {
-		return err
-	}
-	raw, err := c.do("POST", "/api/memories", req)
-	if err != nil {
-		return err
-	}
-	fmt.Println(string(raw))
-	return nil
-}
-
-func memoryUpdate(args []string) error {
-	var id string
-	req := memUpdateReq{}
-	for i := 0; i < len(args); i++ {
-		a := args[i]
-		switch {
-		case a == "--content":
-			if i+1 < len(args) {
-				v := args[i+1]
-				req.Content = &v
-				i++
-			}
-		case a == "--tags":
-			if i+1 < len(args) {
-				v := args[i+1]
-				req.Tags = &v
-				i++
-			}
-		case a == "--keywords":
-			if i+1 < len(args) {
-				v := args[i+1]
-				req.ExpandedKeywords = &v
-				i++
-			}
-		case a == "--importance":
-			if i+1 < len(args) {
-				var f float64
-				fmt.Sscanf(args[i+1], "%f", &f)
-				req.Importance = &f
-				i++
-			}
-		case !strings.HasPrefix(a, "-") && id == "":
-			id = a
-		}
-	}
-	if id == "" {
-		return fmt.Errorf("usage: homelab memory update <id> [--content ...] [--tags ...] [--importance N] [--keywords ...]")
-	}
-	c, err := newMemoryClient()
-	if err != nil {
-		return err
-	}
-	raw, err := c.do("PUT", "/api/memories/"+id, req)
-	if err != nil {
-		return err
-	}
-	fmt.Println(string(raw))
-	return nil
-}
-
-func memoryDelete(args []string) error {
-	id, _ := firstPositional(args)
-	if id == "" {
-		return fmt.Errorf("usage: homelab memory delete <id>")
-	}
-	c, err := newMemoryClient()
-	if err != nil {
-		return err
-	}
-	raw, err := c.do("DELETE", "/api/memories/"+id, nil)
-	if err != nil {
-		return err
-	}
-	fmt.Println(string(raw))
-	return nil
-}
--- a/cli/cmd_net.go
+++ b/cli/cmd_net.go
@ -1,83 +0,0 @@
-package main
-
-import (
-	"fmt"
-	"strings"
-	"time"
-)
-
-func netCommands() []Command {
-	return []Command{
-		{Path: []string{"net", "check"}, Tier: TierRead,
-			Summary: "reachability of <host>[/path]: external (public DNS→CF) vs internal (Traefik LB)", Run: netCheck},
-		{Path: []string{"dns", "lookup"}, Tier: TierRead,
-			Summary: "resolve <name> via Technitium (10.0.20.201) and public (1.1.1.1), diffed", Run: dnsLookup},
-	}
-}
-
-func fmtProbe(code int, d time.Duration, err error) string {
-	if err != nil {
-		return "ERR " + err.Error()
-	}
-	return fmt.Sprintf("HTTP %d  %dms", code, d.Milliseconds())
-}
-
-func netCheck(args []string) error {
-	host, rest := firstPositional(args)
-	if host == "" {
-		return fmt.Errorf("usage: homelab net check <host> [path]")
-	}
-	path := "/"
-	if len(rest) > 0 && !strings.HasPrefix(rest[0], "-") {
-		path = rest[0]
-		if !strings.HasPrefix(path, "/") {
-			path = "/" + path
-		}
-	}
-	u := "https://" + host + path
-	fmt.Printf("%s\n", u)
-
-	// external leg: resolve via public DNS, dial the public IP (tests the real CF path)
-	pubOut, _ := dig(hostOnly(host), "1.1.1.1", "")
-	if pubIP := firstLine(pubOut); pubIP != "" {
-		c, d, e := probeURL(clientDialingIP(pubIP, 10*time.Second), u)
-		fmt.Printf("  external (public %-15s) %s\n", pubIP, fmtProbe(c, d, e))
-	} else {
-		fmt.Println("  external (public)            no public A record")
-	}
-	// internal leg: dial the Traefik LB directly
-	c, d, e := probeURL(clientDialingIP(internalLBIP, 10*time.Second), u)
-	fmt.Printf("  internal (LB %-15s)     %s\n", internalLBIP, fmtProbe(c, d, e))
-	return nil
-}
-
-func dnsLookup(args []string) error {
-	name, rest := firstPositional(args)
-	if name == "" {
-		return fmt.Errorf("usage: homelab dns lookup <name> [A|AAAA|TXT|MX|PTR]")
-	}
-	rr := ""
-	if len(rest) > 0 {
-		rr = rest[0]
-	}
-	tech, _ := dig(name, "10.0.20.201", rr)
-	pub, _ := dig(name, "1.1.1.1", rr)
-	fmt.Printf("technitium (10.0.20.201): %s\n", oneLineList(tech))
-	fmt.Printf("public     (1.1.1.1)    : %s\n", oneLineList(pub))
-	if strings.TrimSpace(tech) != strings.TrimSpace(pub) {
-		fmt.Println("⚠ mismatch — split-horizon (expected for internal-only apps) or a propagation gap")
-	}
-	return nil
-}
-
-func hostOnly(h string) string { // strip any path accidentally included
-	return strings.SplitN(h, "/", 2)[0]
-}
-
-func oneLineList(s string) string {
-	s = strings.TrimSpace(s)
-	if s == "" {
-		return "(none)"
-	}
-	return strings.ReplaceAll(s, "\n", ", ")
-}
--- a/cli/cmd_obs.go
+++ b/cli/cmd_obs.go
@ -1,197 +0,0 @@
-package main
-
-import (
-	"encoding/json"
-	"fmt"
-	"net/url"
-	"sort"
-	"strconv"
-	"strings"
-	"time"
-)
-
-const (
-	promHost = "prometheus-query.viktorbarzin.lan"
-	lokiHost = "loki.viktorbarzin.lan"
-)
-
-func obsCommands() []Command {
-	return []Command{
-		{Path: []string{"metrics", "query"}, Tier: TierRead,
-			Summary: `Prometheus instant query: metrics query "<promql>" [--json]`, Run: metricsQuery},
-		{Path: []string{"metrics", "alerts"}, Tier: TierRead,
-			Summary: "list currently firing Prometheus alerts", Run: metricsAlerts},
-		{Path: []string{"logs", "query"}, Tier: TierRead,
-			Summary: `Loki query (last --since, default 1h): logs query "<logql>" [--since 1h] [--limit N] [--json]`, Run: logsQuery},
-	}
-}
-
-// queryArg joins non-flag args into the query (PromQL/LogQL should normally be
-// passed as a single quoted argument; this also tolerates unquoted multi-token).
-func queryArg(args []string, valueFlags map[string]bool) string {
-	var parts []string
-	for i := 0; i < len(args); i++ {
-		a := args[i]
-		if valueFlags[a] {
-			i++
-			continue
-		}
-		if strings.HasPrefix(a, "-") {
-			continue
-		}
-		parts = append(parts, a)
-	}
-	return strings.Join(parts, " ")
-}
-
-func labelStr(m map[string]string) string {
-	name := m["__name__"]
-	var kv []string
-	for k, v := range m {
-		if k != "__name__" {
-			kv = append(kv, k+"="+v)
-		}
-	}
-	sort.Strings(kv)
-	return name + "{" + strings.Join(kv, ",") + "}"
-}
-
-func metricsQuery(args []string) error {
-	q := queryArg(args, nil)
-	if q == "" {
-		return fmt.Errorf(`usage: homelab metrics query "<promql>" [--json]`)
-	}
-	v := url.Values{}
-	v.Set("query", q)
-	body, err := lbGetBody(promHost, "/api/v1/query", v)
-	if err != nil {
-		return err
-	}
-	if containsArg(args, "--json") {
-		fmt.Println(string(body))
-		return nil
-	}
-	var r struct {
-		Data struct {
-			Result []struct {
-				Metric map[string]string `json:"metric"`
-				Value  []interface{}     `json:"value"`
-			} `json:"result"`
-		} `json:"data"`
-	}
-	if err := json.Unmarshal(body, &r); err != nil {
-		fmt.Println(string(body))
-		return nil
-	}
-	if len(r.Data.Result) == 0 {
-		fmt.Println("(no series)")
-		return nil
-	}
-	for _, s := range r.Data.Result {
-		val := ""
-		if len(s.Value) == 2 {
-			val = fmt.Sprint(s.Value[1])
-		}
-		fmt.Printf("%-14s %s\n", val, labelStr(s.Metric))
-	}
-	return nil
-}
-
-func metricsAlerts(args []string) error {
-	// prometheus-query is a query-only frontend (no /api/v1/alerts); the firing
-	// set is exposed as the synthetic ALERTS series, queryable the normal way.
-	v := url.Values{}
-	v.Set("query", `ALERTS{alertstate="firing"}`)
-	body, err := lbGetBody(promHost, "/api/v1/query", v)
-	if err != nil {
-		return err
-	}
-	if containsArg(args, "--json") {
-		fmt.Println(string(body))
-		return nil
-	}
-	var r struct {
-		Data struct {
-			Result []struct {
-				Metric map[string]string `json:"metric"`
-			} `json:"result"`
-		} `json:"data"`
-	}
-	if err := json.Unmarshal(body, &r); err != nil {
-		fmt.Println(string(body))
-		return nil
-	}
-	if len(r.Data.Result) == 0 {
-		fmt.Println("(no firing alerts)")
-		return nil
-	}
-	for _, a := range r.Data.Result {
-		m := a.Metric
-		scope := ""
-		for _, k := range []string{"namespace", "deployment", "instance", "job", "node"} {
-			if v := m[k]; v != "" {
-				scope = k + "=" + v
-				break
-			}
-		}
-		fmt.Printf("%-9s %-34s %s\n", m["severity"], m["alertname"], scope)
-	}
-	return nil
-}
-
-func logsQuery(args []string) error {
-	q := queryArg(args, map[string]bool{"--since": true, "--limit": true})
-	if q == "" {
-		return fmt.Errorf(`usage: homelab logs query "<logql>" [--since 1h] [--limit N] [--json]`)
-	}
-	since := flagValue(args, "--since")
-	if since == "" {
-		since = "1h"
-	}
-	dur, err := time.ParseDuration(since)
-	if err != nil {
-		return fmt.Errorf("bad --since %q: %w", since, err)
-	}
-	limit := flagValue(args, "--limit")
-	if limit == "" {
-		limit = "100"
-	}
-	end := time.Now()
-	v := url.Values{}
-	v.Set("query", q)
-	v.Set("limit", limit)
-	v.Set("start", strconv.FormatInt(end.Add(-dur).UnixNano(), 10))
-	v.Set("end", strconv.FormatInt(end.UnixNano(), 10))
-	body, err := lbGetBody(lokiHost, "/loki/api/v1/query_range", v)
-	if err != nil {
-		return err
-	}
-	if containsArg(args, "--json") {
-		fmt.Println(string(body))
-		return nil
-	}
-	var r struct {
-		Data struct {
-			Result []struct {
-				Values [][]string `json:"values"`
-			} `json:"result"`
-		} `json:"data"`
-	}
-	if err := json.Unmarshal(body, &r); err != nil {
-		fmt.Println(string(body))
-		return nil
-	}
-	n := 0
-	for _, s := range r.Data.Result {
-		for _, val := range s.Values {
-			if len(val) == 2 {
-				fmt.Println(val[1])
-				n++
-			}
-		}
-	}
-	if n == 0 {
-		fmt.Println("(no log lines)")
-	}
-	return nil
-}
--- a/cli/cmd_tf.go
+++ b/cli/cmd_tf.go
@ -1,122 +0,0 @@
-package main
-
-import (
-	"fmt"
-	"os"
-	"os/signal"
-	"path/filepath"
-	"strings"
-	"sync"
-	"syscall"
-)
-
-func tfCommands() []Command {
-	return []Command{
-		{Path: []string{"tf", "plan"}, Tier: TierRead,
-			Summary: "terragrunt plan a stack (via scripts/tg)", Run: tfPassthrough("plan")},
-		{Path: []string{"tf", "validate"}, Tier: TierRead,
-			Summary: "terragrunt validate a stack", Run: tfPassthrough("validate")},
-		{Path: []string{"tf", "fmt"}, Tier: TierRead,
-			Summary: "terraform fmt a stack's files", Run: tfFmt},
-		{Path: []string{"tf", "force-unlock"}, Tier: TierWrite,
-			Summary: "release a stuck terraform state lock (needs <stack> <lock-id>)", Run: tfForceUnlock},
-		{Path: []string{"tf", "apply"}, Tier: TierWrite,
-			Summary: "terragrunt apply a stack — presence-coupled, out-of-band", Run: tfApply},
-	}
-}
-
-// firstPositional returns the first non-flag arg and the remaining args with it removed.
-func firstPositional(args []string) (string, []string) {
-	for i, a := range args {
-		if !strings.HasPrefix(a, "-") {
-			rest := append(append([]string{}, args[:i]...), args[i+1:]...)
-			return a, rest
-		}
-	}
-	return "", args
-}
-
-// resolveTfStack finds the infra root (from cwd) and the stack directory named
-// by the first positional arg, returning the remaining args.
-func resolveTfStack(args []string) (infraRoot, stackName, stackDir string, rest []string, err error) {
-	stackName, rest = firstPositional(args)
-	if stackName == "" {
-		err = fmt.Errorf("missing <stack> argument")
-		return
-	}
-	cwd, e := os.Getwd()
-	if e != nil {
-		err = e
-		return
-	}
-	infraRoot, err = findInfraRoot(cwd)
-	if err != nil {
-		return
-	}
-	stackDir, err = resolveStack(infraRoot, stackName)
-	return
-}
-
-func tgPath(infraRoot string) string { return filepath.Join(infraRoot, "scripts", "tg") }
-
-// tfPassthrough runs `scripts/tg <verb> [extra]` in the stack directory.
-func tfPassthrough(verb string) func([]string) error {
-	return func(args []string) error {
-		infraRoot, _, stackDir, rest, err := resolveTfStack(args)
-		if err != nil {
-			return err
-		}
-		return runStreamingIn(stackDir, tgPath(infraRoot), append([]string{verb}, rest...)...)
-	}
-}
-
-func tfFmt(args []string) error {
-	_, _, stackDir, _, err := resolveTfStack(args)
-	if err != nil {
-		return err
-	}
-	return runStreamingIn(stackDir, "terraform", "fmt", "-recursive", ".")
-}
-
-func tfForceUnlock(args []string) error {
-	infraRoot, _, stackDir, rest, err := resolveTfStack(args)
-	if err != nil {
-		return err
-	}
-	if len(rest) < 1 {
-		return fmt.Errorf("usage: homelab tf force-unlock <stack> <lock-id>")
-	}
-	return runStreamingIn(stackDir, tgPath(infraRoot), "force-unlock", "-force", rest[0])
-}
-
-// tfApply applies a stack out-of-band: claim the stack on the presence board,
-// ALWAYS release on exit (normal, error, or signal — fixing the claim leak),
-// and warn that CI applies canonically on push.
-func tfApply(args []string) error {
-	infraRoot, stackName, stackDir, _, err := resolveTfStack(args)
-	if err != nil {
-		return err
-	}
-	label := "stack:" + stackName
-	fmt.Fprintf(os.Stderr,
-		"homelab: out-of-band apply of %q — CI applies canonically on push to master.\n", stackName)
-
-	if err := presenceClaim(label, "homelab tf apply "+stackName); err != nil {
-		return fmt.Errorf("presence claim failed (run `vault login -method=oidc`?): %w", err)
-	}
-	// Release exactly once, whether we exit normally, on error, or on signal —
-	// sync.Once makes the defer and the signal goroutine safe to both call it.
-	var once sync.Once
-	release := func() { once.Do(func() { _ = presenceRelease(label) }) }
-	defer release()
-
-	sig := make(chan os.Signal, 1)
-	signal.Notify(sig, os.Interrupt, syscall.SIGTERM)
-	go func() {
-		<-sig
-		release()
-		os.Exit(130)
-	}()
-
-	return runStreamingIn(stackDir, tgPath(infraRoot), "apply", "--non-interactive")
-}
--- a/cli/cmd_tf_test.go
+++ b/cli/cmd_tf_test.go
@ -1,27 +0,0 @@
-package main
-
-import (
-	"reflect"
-	"testing"
-)
-
-func TestFirstPositional(t *testing.T) {
-	cases := []struct {
-		args     []string
-		wantName string
-		wantRest []string
-	}{
-		{[]string{"vault"}, "vault", []string{}},
-		{[]string{"--json", "vault"}, "vault", []string{"--json"}},
-		{[]string{"vault", "abc-123"}, "vault", []string{"abc-123"}},
-		{[]string{"--foo", "monitoring", "extra"}, "monitoring", []string{"--foo", "extra"}},
-		{[]string{"--only-flags"}, "", []string{"--only-flags"}},
-	}
-	for _, c := range cases {
-		gotName, gotRest := firstPositional(c.args)
-		if gotName != c.wantName || !reflect.DeepEqual(gotRest, c.wantRest) {
-			t.Errorf("firstPositional(%v) = (%q, %v), want (%q, %v)",
-				c.args, gotName, gotRest, c.wantName, c.wantRest)
-		}
-	}
-}
--- a/cli/cmd_usage.go
+++ b/cli/cmd_usage.go
@ -1,77 +0,0 @@
-package main
-
-import (
-	"encoding/json"
-	"fmt"
-	"net/url"
-	"sort"
-	"strconv"
-)
-
-func usageCommands() []Command {
-	return []Command{
-		{Path: []string{"usage", "top"}, Tier: TierRead,
-			Summary: "rank homelab verb usage across users (from Loki): usage top [--since 30d] [--user U] [--json]", Run: usageTop},
-	}
-}
-
-// usageQuery builds the LogQL metric query that counts invocations per verb.
-func usageQuery(since, user string) string {
-	sel := `job="` + usageJob + `"`
-	if user != "" {
-		sel += `, user="` + user + `"`
-	}
-	return fmt.Sprintf(`sum by (verb) (count_over_time({%s}[%s]))`, sel, since)
-}
-
-func usageTop(args []string) error {
-	since := flagValue(args, "--since")
-	if since == "" {
-		since = "30d"
-	}
-	v := url.Values{}
-	v.Set("query", usageQuery(since, flagValue(args, "--user")))
-	body, err := lbGetBody(lokiHost, "/loki/api/v1/query", v)
-	if err != nil {
-		return err
-	}
-	if containsArg(args, "--json") {
-		fmt.Println(string(body))
-		return nil
-	}
-	var r struct {
-		Data struct {
-			Result []struct {
-				Metric map[string]string `json:"metric"`
-				Value  []interface{}     `json:"value"`
-			} `json:"result"`
-		} `json:"data"`
-	}
-	if err := json.Unmarshal(body, &r); err != nil {
-		fmt.Println(string(body))
-		return nil
-	}
-	type row struct {
-		verb string
-		n    int
-	}
-	var rows []row
-	for _, s := range r.Data.Result {
-		n := 0
-		if len(s.Value) == 2 {
-			if f, e := strconv.ParseFloat(fmt.Sprint(s.Value[1]), 64); e == nil {
-				n = int(f)
-			}
-		}
-		rows = append(rows, row{s.Metric["verb"], n})
-	}
-	if len(rows) == 0 {
-		fmt.Println("(no usage recorded yet)")
-		return nil
-	}
-	sort.Slice(rows, func(i, j int) bool { return rows[i].n > rows[j].n })
-	for _, r := range rows {
-		fmt.Printf("%6d  %s\n", r.n, r.verb)
-	}
-	return nil
-}
--- a/cli/cmd_vault.go
+++ b/cli/cmd_vault.go
@ -1,944 +0,0 @@
-package main
-
-import (
-	"bufio"
-	"encoding/base64"
-	"encoding/json"
-	"errors"
-	"fmt"
-	"os"
-	"os/exec"
-	"strings"
-	"syscall"
-)
-
-// vault verbs give each unix user no-HITL access to THEIR OWN Vaultwarden vault.
-// Identity is the kernel UID; per-user creds live in that user's isolated Vault
-// path (secret/workstation/claude-users/<user>) read via their scoped token, and
-// decryption is done by the official `bw` CLI. See
-// docs/runbooks/homelab-vault-onboarding.md.
-func vaultCommands() []Command {
-	cmds := []Command{
-		// Vaultwarden — your personal password manager (logins/passwords/TOTP).
-		{Path: []string{"vault", "setup"}, Tier: TierWrite,
-			Summary: "[vaultwarden] one-time: store your master password + API key in your Vault path", Run: vaultSetup},
-		{Path: []string{"vault", "status"}, Tier: TierRead,
-			Summary: "[vaultwarden] show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
-		{Path: []string{"vault", "list"}, Tier: TierRead,
-			Summary: "[vaultwarden] list your item names: vault list [--search Q]", Run: vaultList},
-		{Path: []string{"vault", "get"}, Tier: TierRead,
-			Summary: "[vaultwarden] fetch one login: vault get <name> [--field password|username|uri|notes|totp] [--json] [--all]", Run: vaultGet},
-		{Path: []string{"vault", "search"}, Tier: TierRead,
-			Summary: "[vaultwarden] search your item names: vault search <query>", Run: vaultSearch},
-		{Path: []string{"vault", "code"}, Tier: TierRead,
-			Summary: "[vaultwarden] current TOTP code for an item: vault code <name>", Run: vaultCode},
-		{Path: []string{"vault", "lock"}, Tier: TierWrite,
-			Summary: "[vaultwarden] lock/log out the local bw session", Run: vaultLock},
-		{Path: []string{"vault"}, Tier: TierRead,
-			Summary: "two stores: Vaultwarden (logins) + HashiCorp Vault/OpenBao kv (infra secrets) — run `homelab vault` for help",
-			Run:     func([]string) error { fmt.Print(vaultHelp()); return nil }},
-	}
-	// HashiCorp Vault / OpenBao — homelab INFRA secrets (the secret/… KV store).
-	return append(cmds, vaultKVCommands()...)
-}
-
-// vaultHelp is shown for bare `homelab vault`. It LEADS with the distinction
-// between the two unrelated "vaults" this command fronts, because the name
-// collides: Vaultwarden (a password manager) vs HashiCorp Vault / OpenBao (the
-// infra secrets store).
-func vaultHelp() string {
-	return `homelab vault — two different secret stores under one command:
-
-  • Vaultwarden               your personal PASSWORD MANAGER (logins / passwords / TOTP)
-  • HashiCorp Vault / OpenBao  homelab INFRA secrets (the secret/… KV store)  → 'vault kv …'
-
-── Vaultwarden  (reads YOUR OWN vault; no-HITL after one-time setup) ──
-  homelab vault setup             one-time: store your master password + API key in your Vault path
-  homelab vault status            configured / unlocked / reachable (no secrets)
-  homelab vault list [--search Q] list your item names (no secrets)
-  homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
-                                  TTY → clipboard (auto-clears); piped → stdout
-  homelab vault get <name> --all  all fields (incl. custom) as JSON; piped only.
-                                  TOTP shown as presence flag — use 'vault code' for a code.
-  homelab vault code <name>       current TOTP code
-  homelab vault lock              lock / log out the local bw session
-
-── HashiCorp Vault / OpenBao  (infra secrets; uses your own OIDC vault token) ──
-  homelab vault kv get <path> [--field K]   read an infra KV secret
-  homelab vault kv list <path>              list sub-paths
-  homelab vault kv put <path> <key>         write one key (value via stdin)
-
-Vaultwarden creds live only in your own Vault path; the admin never sees them.
-Security model: docs/runbooks/homelab-vault-onboarding.md
-(note: anything running as your user can decrypt your vault — the accepted no-HITL trade).
-`
-}
-
-const vwUserPathPrefix = "secret/workstation/claude-users/"
-
-// vwCreds is one user's Vaultwarden auth material, read from their Vault path.
-type vwCreds struct {
-	Email          string
-	MasterPassword string
-	ClientID       string
-	ClientSecret   string
-}
-
-// cmdRunner shells out to an external command with an explicit environment and
-// returns trimmed stdout. Secrets are passed via envv, NEVER argv. Tests inject
-// a fake; realRunner is the production implementation.
-type cmdRunner func(name string, argv, envv []string) (string, error)
-
-func realRunner(name string, argv, envv []string) (string, error) {
-	cmd := exec.Command(name, argv...)
-	if envv != nil {
-		cmd.Env = envv
-	}
-	out, err := cmd.Output()
-	// Trim only the trailing newline the tool appends — NOT all whitespace, so a
-	// fetched secret with significant leading/trailing spaces is preserved.
-	return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err))
-}
-
-// exitStderr returns the stderr captured by cmd.Output() on a failed exec (it
-// stows it on *exec.ExitError), or nil. The tools we shell out to (vault, bw)
-// write the actionable message there — "connection refused", "permission
-// denied" — which the caller would otherwise never see behind a bare
-// "exit status N".
-func exitStderr(err error) []byte {
-	var ee *exec.ExitError
-	if errors.As(err, &ee) {
-		return ee.Stderr
-	}
-	return nil
-}
-
-// augmentErr appends captured stderr to an error so failures are diagnosable
-// (not just "exit status 2"). Returns nil when err is nil, and err unchanged
-// when there's no stderr; preserves the wrapped error for errors.Is/As.
-func augmentErr(err error, stderr []byte) error {
-	if err == nil {
-		return nil
-	}
-	if s := strings.TrimSpace(string(stderr)); s != "" {
-		return fmt.Errorf("%w: %s", err, s)
-	}
-	return err
-}
-
-// realRunnerStdin runs a command feeding `stdin` to it, for secret values that
-// must NOT appear in argv (visible via ps / /proc/<pid>/cmdline to same-UID
-// processes). Used by setup to write the master password / client_secret.
-func realRunnerStdin(name string, argv, envv []string, stdin string) (string, error) {
-	cmd := exec.Command(name, argv...)
-	if envv != nil {
-		cmd.Env = envv
-	}
-	cmd.Stdin = strings.NewReader(stdin)
-	out, err := cmd.Output()
-	return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err))
-}
-
-func vwCredsPath(user string) string { return vwUserPathPrefix + user }
-
-func bwAppDataDir(uid string) string { return "/run/user/" + uid + "/homelab-bw" }
-
-// readVaultField returns one field from a KV-v2 path, "" if absent/error.
-func readVaultField(run cmdRunner, field, path string) string {
-	out, err := run("vault", []string{"kv", "get", "-field=" + field, path}, nil)
-	if err != nil {
-		return ""
-	}
-	return out
-}
-
-// loadCreds reads the four vaultwarden_* keys from the user's isolated path.
-// A missing master password means the user hasn't onboarded.
-func loadCreds(run cmdRunner, user string) (vwCreds, error) {
-	p := vwCredsPath(user)
-	c := vwCreds{
-		Email:          readVaultField(run, "vaultwarden_email", p),
-		MasterPassword: readVaultField(run, "vaultwarden_master_password", p),
-		ClientID:       readVaultField(run, "vaultwarden_client_id", p),
-		ClientSecret:   readVaultField(run, "vaultwarden_client_secret", p),
-	}
-	if c.MasterPassword == "" {
-		return vwCreds{}, fmt.Errorf("vault not configured for this user — run `homelab vault setup`")
-	}
-	return c, nil
-}
-
-// vaultCurrentUser/vaultCurrentUID are seams for tests (avoid conflict with repo.go's currentUser func).
-var vaultCurrentUser = func() string { return os.Getenv("USER") }
-var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) }
-
-// scopedTokenPath is where claude-auth-sync keeps the user's scoped Vault token.
-// MUST match CAS_VAULT_TOKEN_FILE in scripts/workstation/claude-auth-sync.sh.
-func scopedTokenPath(home string) string {
-	return home + "/.config/claude-auth-sync/vault-token"
-}
-
-// vaultTokenSource decides which Vault token the `vault` child processes should
-// use. Precedence: an explicit $VAULT_TOKEN (deliberate override), then the
-// per-user scoped token claude-auth-sync maintains at scopedTokenPath(HOME)
-// (policy workstation-claude-<user>, which grants exactly the create/read/update
-// this tool needs on the user's own path), then a native ~/.vault-token.
-//
-// The scoped token MUST beat ~/.vault-token: this tool only ever touches the
-// caller's own secret/workstation/claude-users/<user> path, and a power-user who
-// ran `vault login -method=oidc` carries a read-only ~/.vault-token whose
-// capability on that path is `deny` — letting it win shadows the scoped token
-// and every op fails 403/deny (emo, 2026-06-28). ~/.vault-token is only the
-// right credential when there is no scoped token (admins). Returns the token to
-// export — "" when the vault CLI should read the ambient/native credential —
-// plus a source tag for tests/logging.
-func vaultTokenSource(envToken string, haveVaultTokenFile bool, scopedToken string) (token, source string) {
-	switch {
-	case envToken != "":
-		return "", "env"
-	case strings.TrimSpace(scopedToken) != "":
-		return strings.TrimSpace(scopedToken), "scoped"
-	case haveVaultTokenFile:
-		return "", "file"
-	default:
-		return "", "none"
-	}
-}
-
-// vaultAddrDefault is the cluster Vault the workstation talks to. The bw server
-// is likewise hardcoded (openSession), so a sane default here is consistent.
-const vaultAddrDefault = "https://vault.viktorbarzin.me"
-
-// vaultAddrToSet returns the VAULT_ADDR to export when the caller's environment
-// doesn't already set one, else "". homelab vault is invoked by AFK agent
-// sessions — frequently non-login shells (tmux panes, agent subprocesses) that
-// never sourced /etc/environment — so, like claude-auth-sync, the CLI must NOT
-// depend on an ambient VAULT_ADDR; otherwise every `vault` child falls back to
-// the 127.0.0.1:8200 default and fails "connection refused" (exit 2).
-func vaultAddrToSet(envAddr string) string {
-	if strings.TrimSpace(envAddr) == "" {
-		return vaultAddrDefault
-	}
-	return ""
-}
-
-// ensureVaultAddr exports the default VAULT_ADDR when none is set, so the vault
-// child processes reach the cluster Vault regardless of the caller's shell. An
-// explicit VAULT_ADDR (admins, CI) is left untouched.
-func ensureVaultAddr() {
-	if a := vaultAddrToSet(os.Getenv("VAULT_ADDR")); a != "" {
-		os.Setenv("VAULT_ADDR", a)
-	}
-}
-
-// fileNonEmpty reports whether path exists and has content.
-func fileNonEmpty(path string) bool {
-	fi, err := os.Stat(path)
-	return err == nil && fi.Size() > 0
-}
-
-// ensureVaultToken wires vaultTokenSource to the real environment: when the user
-// has no ambient Vault credential, it exports the claude-auth-sync scoped token
-// so the `vault` child processes authenticate as workstation-claude-<user>. It
-// is idempotent and safe for admins, whose explicit $VAULT_TOKEN / ~/.vault-token
-// take precedence and are left untouched.
-func ensureVaultToken() {
-	// Every vault verb funnels through here, so this is the one place that also
-	// guarantees VAULT_ADDR is set (see vaultAddrToSet for why it can't be
-	// assumed from the caller's shell).
-	ensureVaultAddr()
-	home := os.Getenv("HOME")
-	scoped, _ := os.ReadFile(scopedTokenPath(home))
-	tok, src := vaultTokenSource(os.Getenv("VAULT_TOKEN"), home != "" && fileNonEmpty(home+"/.vault-token"), string(scoped))
-	if src == "scoped" {
-		os.Setenv("VAULT_TOKEN", tok)
-	}
-}
-
-// bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately
-// do NOT inherit the full parent env (keeps stray secrets out of the child).
-func bwBaseEnv(appdata string) []string {
-	path := os.Getenv("PATH")
-	if path == "" {
-		path = "/usr/local/bin:/usr/bin:/bin"
-	}
-	return []string{
-		"PATH=" + path,
-		"HOME=" + os.Getenv("HOME"),
-		"BITWARDENCLI_APPDATA_DIR=" + appdata,
-		"BW_NOINTERACTION=true",
-	}
-}
-
-// bwSecretEnv adds the secret-bearing vars. session may be "" (pre-unlock).
-func bwSecretEnv(appdata string, c vwCreds, session string) []string {
-	env := bwBaseEnv(appdata)
-	env = append(env,
-		"BW_CLIENTID="+c.ClientID,
-		"BW_CLIENTSECRET="+c.ClientSecret,
-		"BW_PASSWORD="+c.MasterPassword,
-	)
-	if session != "" {
-		env = append(env, "BW_SESSION="+session)
-	}
-	return env
-}
-
-func bwLoginArgs() []string                 { return []string{"login", "--apikey"} }
-func bwUnlockArgs() []string                { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} }
-func bwGetArgs(field, name string) []string { return []string{"get", field, name} }
-func bwItemArgs(name string) []string       { return []string{"get", "item", name} }
-func bwStatusArgs() []string                { return []string{"status"} }
-func bwSyncArgs() []string                  { return []string{"sync"} }
-
-// bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is
-// required. Unparseable/empty output → true (safer to attempt login).
-func bwNeedsLogin(statusJSON string) bool {
-	var s struct {
-		Status string `json:"status"`
-	}
-	if err := json.Unmarshal([]byte(statusJSON), &s); err != nil {
-		return true
-	}
-	return s.Status == "unauthenticated" || s.Status == ""
-}
-
-func bwListArgs(search string) []string {
-	a := []string{"list", "items"}
-	if search != "" {
-		a = append(a, "--search", search)
-	}
-	return a
-}
-
-// bwUnlock runs `bw unlock` and returns the raw session key.
-func bwUnlock(run cmdRunner, env []string) (string, error) {
-	out, err := run("bw", bwUnlockArgs(), env)
-	if err != nil {
-		return "", fmt.Errorf("bw unlock failed (wrong master password? run `homelab vault setup`): %w", err)
-	}
-	return out, nil
-}
-
-// bwGet fetches one field of one item; session must be present in env.
-func bwGet(run cmdRunner, env []string, field, name string) (string, error) {
-	return run("bw", bwGetArgs(field, name), env)
-}
-
-func returnMode(isTTY bool) string {
-	if isTTY {
-		return "clipboard"
-	}
-	return "stdout"
-}
-
-// stdoutIsTTY reports whether stdout is a character device (a terminal).
-func stdoutIsTTY() bool {
-	fi, err := os.Stdout.Stat()
-	if err != nil {
-		return false
-	}
-	return fi.Mode()&os.ModeCharDevice != 0
-}
-
-// stderrIsTTY reports whether stderr is a terminal (the OSC52 escape is written
-// to stderr, so the clipboard path is only viable when stderr is a terminal).
-func stderrIsTTY() bool {
-	fi, err := os.Stderr.Stat()
-	if err != nil {
-		return false
-	}
-	return fi.Mode()&os.ModeCharDevice != 0
-}
-
-// osc52 returns the OSC 52 escape that makes the local terminal copy payload to
-// the system clipboard (works over SSH; no X11). osc52clear copies empty.
-func osc52(payload string) string {
-	return "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte(payload)) + "\a"
-}
-func osc52clear() string { return "\x1b]52;c;\a" }
-
-// terminalAllowed gates OSC 52: only terminals known to honor clipboard writes,
-// else we'd dump the secret's base64 into scrollback on unsupported terminals.
-func terminalAllowed(term, termProgram string) bool {
-	t := strings.ToLower(term)
-	p := strings.ToLower(termProgram)
-	for _, ok := range []string{"kitty", "alacritty", "foot", "wezterm", "ghostty", "tmux", "screen"} {
-		if strings.Contains(t, ok) || strings.Contains(p, ok) {
-			return true
-		}
-	}
-	// xterm proper supports it only when the program is a known-good emulator.
-	return false
-}
-
-// opRecord is one CLI operation. ItemName is accepted for the caller's
-// convenience but is INTENTIONALLY never rendered into the log line — auditing
-// which of your own logins you opened is itself sensitive, and per-item reads
-// are invisible server-side anyway (spec §9a).
-type opRecord struct {
-	User       string
-	Verb       string
-	PID        int
-	PPID       int
-	ParentComm string
-	ItemName   string // never logged
-}
-
-func opLogLine(r opRecord) string {
-	return fmt.Sprintf("user=%s verb=%s pid=%d ppid=%d parent=%s",
-		r.User, r.Verb, r.PID, r.PPID, r.ParentComm)
-}
-
-// parentComm reads /proc/<ppid>/comm (best-effort; "" on failure).
-func parentComm(ppid int) string {
-	b, err := os.ReadFile(fmt.Sprintf("/proc/%d/comm", ppid))
-	if err != nil {
-		return ""
-	}
-	return strings.TrimSpace(string(b))
-}
-
-// writeOpLog appends one privacy-aware line to the user's op-log (best-effort;
-// never blocks or fails the command). Goes to syslog so it ships to Loki.
-func writeOpLog(r opRecord) {
-	exec.Command("logger", "-t", "homelab-vault", opLogLine(r)).Run() // best-effort
-}
-
-func vaultLockPath(uid string) string { return "/run/user/" + uid + "/homelab-vault.lock" }
-
-// hardenProcess disables core dumps so a bw/homelab crash can't spill the master
-// password to a core file. Best-effort.
-func hardenProcess() {
-	_ = syscall.Setrlimit(syscall.RLIMIT_CORE, &syscall.Rlimit{Cur: 0, Max: 0})
-}
-
-// withUserLock serializes bw mutations for this user (concurrent Claude sessions
-// as the same user otherwise race bw's appdata). Returns an unlock func.
-func withUserLock(uid string) (func(), error) {
-	f, err := os.OpenFile(vaultLockPath(uid), os.O_CREATE|os.O_RDWR, 0600)
-	if err != nil {
-		return nil, err
-	}
-	if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX); err != nil {
-		f.Close()
-		return nil, err
-	}
-	return func() { syscall.Flock(int(f.Fd()), syscall.LOCK_UN); f.Close() }, nil
-}
-
-// session is one usable bw context: the env (with BW_SESSION) ready for `bw get`.
-type session struct {
-	env []string
-}
-
-// openSession resolves creds, ensures login, unlocks, and returns a ready env.
-// Caller must hold the user lock. appdata is created on tmpfs (0700).
-func openSession(run cmdRunner, user, uid string) (session, error) {
-	creds, err := loadCreds(run, user)
-	if err != nil {
-		return session{}, err
-	}
-	appdata := bwAppDataDir(uid)
-	if err := os.MkdirAll(appdata, 0700); err != nil {
-		return session{}, fmt.Errorf("create bw appdata %s: %w", appdata, err)
-	}
-	loginEnv := bwSecretEnv(appdata, creds, "")
-	// Ensure server is set and we're logged in (idempotent; ignore "already").
-	_, _ = run("bw", []string{"config", "server", "https://vaultwarden.viktorbarzin.me"}, loginEnv)
-	st, _ := run("bw", bwStatusArgs(), loginEnv)
-	if bwNeedsLogin(st) {
-		if _, err := run("bw", bwLoginArgs(), loginEnv); err != nil {
-			return session{}, fmt.Errorf("bw login --apikey failed (API key valid? run `homelab vault setup`): %w", err)
-		}
-	}
-	sess, err := bwUnlock(run, loginEnv)
-	if err != nil {
-		return session{}, err
-	}
-	sessEnv := bwSecretEnv(appdata, creds, sess)
-	// Pull the latest server-side state so reads reflect current values. `bw
-	// unlock` only decrypts the LOCAL cache, so a persisted (already-logged-in)
-	// session would otherwise serve stale data until the next login. Best-effort:
-	// a transient sync failure must not break a read — fall back to the cached
-	// vault and warn (status reports reachability separately).
-	if _, err := run("bw", bwSyncArgs(), sessEnv); err != nil {
-		fmt.Fprintln(os.Stderr, "homelab vault: warning: bw sync failed; using cached vault (values may be stale): "+err.Error())
-	}
-	return session{env: sessEnv}, nil
-}
-
-type getOpts struct {
-	name  string
-	field string
-	json  bool
-	all   bool // dump every field (incl. custom) as normalized JSON
-}
-
-var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true}
-
-func parseGetArgs(args []string) (getOpts, error) {
-	o := getOpts{field: "password"}
-	for i := 0; i < len(args); i++ {
-		a := args[i]
-		switch {
-		case a == "--json":
-			o.json = true
-		case a == "--all":
-			o.all = true
-		case a == "--field" && i+1 < len(args):
-			o.field = args[i+1]
-			i++
-		case strings.HasPrefix(a, "--field="):
-			o.field = strings.TrimPrefix(a, "--field=")
-		case !strings.HasPrefix(a, "-") && o.name == "":
-			o.name = a
-		}
-	}
-	if o.name == "" {
-		return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json] [--all]")
-	}
-	// --all dumps the whole item, so --field is irrelevant — skip its allowlist.
-	if !o.all && !validGetFields[o.field] {
-		return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field)
-	}
-	return o, nil
-}
-
-// getValue opens a session and fetches one field. Pure of I/O side effects
-// besides the runner, so it is unit-tested with a fake runner.
-func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) {
-	s, err := openSession(run, user, uid)
-	if err != nil {
-		return "", err
-	}
-	return bwGet(run, s.env, o.field, o.name)
-}
-
-// getItem opens a session and returns the whole item as raw `bw get item` JSON.
-// Used by `get --all`; normalization is a separate, pure step (normalizeItem).
-func getItem(run cmdRunner, user, uid, name string) (string, error) {
-	s, err := openSession(run, user, uid)
-	if err != nil {
-		return "", err
-	}
-	return run("bw", bwItemArgs(name), s.env)
-}
-
-// normalizedItem is the browse-all-fields projection of a Vaultwarden item: the
-// standard login fields that are present, notes, and a flat map of custom field
-// name→value. bw internals (id, object, reprompt, passwordHistory) are dropped,
-// and the TOTP *seed* is reduced to a presence flag — the only seed-derived path
-// stays the specially-audited `vault code` (see the design §10/§16).
-type normalizedItem struct {
-	Name     string            `json:"name"`
-	Username string            `json:"username,omitempty"`
-	Password string            `json:"password,omitempty"`
-	URIs     []string          `json:"uris,omitempty"`
-	TOTP     bool              `json:"totp,omitempty"` // presence only, never the seed
-	Notes    string            `json:"notes,omitempty"`
-	Fields   map[string]string `json:"fields,omitempty"` // custom field name→value
-}
-
-// bwFieldLinked is the Bitwarden custom-field type for a "linked" field: it
-// references another field and carries a null value, so it is not real data.
-const bwFieldLinked = 3
-
-// normalizeItem parses a `bw get item` payload into the browse projection. It is
-// pure (no I/O), so it is the unit-tested heart of `get --all`.
-func normalizeItem(raw string) (normalizedItem, error) {
-	var it struct {
-		Name  string `json:"name"`
-		Notes string `json:"notes"`
-		Login *struct {
-			Username string `json:"username"`
-			Password string `json:"password"`
-			Totp     string `json:"totp"`
-			URIs     []struct {
-				URI string `json:"uri"`
-			} `json:"uris"`
-		} `json:"login"`
-		Fields []struct {
-			Name  string `json:"name"`
-			Value string `json:"value"`
-			Type  int    `json:"type"`
-		} `json:"fields"`
-	}
-	if err := json.Unmarshal([]byte(raw), &it); err != nil {
-		return normalizedItem{}, fmt.Errorf("parse bw item: %w", err)
-	}
-	n := normalizedItem{Name: it.Name, Notes: it.Notes}
-	if it.Login != nil {
-		n.Username = it.Login.Username
-		n.Password = it.Login.Password
-		n.TOTP = it.Login.Totp != ""
-		for _, u := range it.Login.URIs {
-			if u.URI != "" {
-				n.URIs = append(n.URIs, u.URI)
-			}
-		}
-	}
-	for _, f := range it.Fields {
-		if f.Type == bwFieldLinked {
-			continue // references another field, no value of its own
-		}
-		if n.Fields == nil {
-			n.Fields = map[string]string{}
-		}
-		n.Fields[f.Name] = f.Value // duplicate names: last-wins (rare; documented)
-	}
-	return n, nil
-}
-
-// clipboardDecision picks how to return a secret value. "stdout" prints it (a
-// pipe/agent — the intended machine path); "clipboard" copies via OSC52;
-// "refuse" emits nothing sensitive (would otherwise risk dumping the secret's
-// base64 into scrollback, or silently fail because the OSC52 escape goes to a
-// non-terminal stderr).
-func clipboardDecision(stdoutTTY, stderrTTY bool, term, termProgram string) string {
-	if !stdoutTTY {
-		return "stdout"
-	}
-	if terminalAllowed(term, termProgram) && stderrTTY {
-		return "clipboard"
-	}
-	return "refuse"
-}
-
-// jsonToStdoutOK reports whether `--json` may print the secret to stdout — only
-// when stdout is NOT a terminal (i.e. piped to a machine consumer).
-func jsonToStdoutOK(stdoutTTY bool) bool { return !stdoutTTY }
-
-// emitSecret returns a value TTY-aware (see clipboardDecision). Never prints the
-// secret to a terminal's stdout/scrollback.
-func emitSecret(value string) {
-	switch clipboardDecision(stdoutIsTTY(), stderrIsTTY(), os.Getenv("TERM"), os.Getenv("TERM_PROGRAM")) {
-	case "stdout":
-		fmt.Println(value)
-	case "clipboard":
-		fmt.Fprint(os.Stderr, osc52(value))
-		fmt.Fprintln(os.Stderr, "copied to clipboard; clearing in 30s")
-		clearClipboardAfter(30)
-	default: // refuse
-		fmt.Fprintln(os.Stderr, "refusing to print secret: this terminal can't do OSC52 clipboard safely; pipe the command (e.g. | cat) or use a supported terminal")
-	}
-}
-
-// clearClipboardAfter spawns a detached background clear so the secret doesn't
-// linger in the clipboard. Best-effort.
-func clearClipboardAfter(seconds int) {
-	exec.Command("sh", "-c", fmt.Sprintf("sleep %d; printf '%s'", seconds, osc52clear())).Start()
-}
-
-// listNames extracts "name (id)" from `bw list items` JSON; never values.
-func listNames(jsonOut string) []string {
-	var items []struct {
-		ID   string `json:"id"`
-		Name string `json:"name"`
-	}
-	if err := json.Unmarshal([]byte(jsonOut), &items); err != nil {
-		return nil
-	}
-	out := make([]string, 0, len(items))
-	for _, it := range items {
-		out = append(out, fmt.Sprintf("%s (%s)", it.Name, it.ID))
-	}
-	return out
-}
-
-func runList(run cmdRunner, user, uid, search string) ([]string, error) {
-	s, err := openSession(run, user, uid)
-	if err != nil {
-		return nil, err
-	}
-	out, err := run("bw", bwListArgs(search), s.env)
-	if err != nil {
-		return nil, err
-	}
-	return listNames(out), nil
-}
-
-func vaultList(args []string) error {
-	hardenProcess()
-	ensureVaultToken()
-	search := ""
-	for i := 0; i < len(args); i++ {
-		if args[i] == "--search" && i+1 < len(args) {
-			search = args[i+1]
-			i++
-		} else if strings.HasPrefix(args[i], "--search=") {
-			search = strings.TrimPrefix(args[i], "--search=")
-		}
-	}
-	uid := vaultCurrentUID()
-	unlock, err := withUserLock(uid)
-	if err != nil {
-		return err
-	}
-	defer unlock()
-	names, err := runList(realRunner, vaultCurrentUser(), uid, search)
-	if err != nil {
-		return err
-	}
-	for _, n := range names {
-		fmt.Println(n)
-	}
-	return nil
-}
-
-func vaultSearch(args []string) error {
-	if len(args) == 0 {
-		return fmt.Errorf("usage: homelab vault search <query>")
-	}
-	return vaultList([]string{"--search", strings.Join(args, " ")})
-}
-
-func vaultCode(args []string) error {
-	hardenProcess()
-	ensureVaultToken()
-	if len(args) == 0 {
-		return fmt.Errorf("usage: homelab vault code <name>")
-	}
-	name := args[0]
-	uid := vaultCurrentUID()
-	unlock, err := withUserLock(uid)
-	if err != nil {
-		return err
-	}
-	defer unlock()
-	user := vaultCurrentUser()
-	val, err := getValue(realRunner, user, uid, getOpts{name: name, field: "totp"})
-	if err != nil {
-		return err
-	}
-	// TOTP is the most sensitive op: log AND emit an ntfy-bound marker (spec §9a-d).
-	writeOpLog(opRecord{User: user, Verb: "code", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
-	exec.Command("logger", "-t", "homelab-vault-totp", "user="+user+" totp-fetch parent="+parentComm(os.Getppid())).Run()
-	emitSecret(val)
-	return nil
-}
-
-// statusSummary reports config/reachability without revealing secrets.
-func statusSummary(run cmdRunner, user, uid string) string {
-	if _, err := loadCreds(run, user); err != nil {
-		return "vault: not configured — run `homelab vault setup`"
-	}
-	s, err := openSession(run, user, uid)
-	if err != nil {
-		return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error()
-	}
-	// openSession already did a best-effort sync; status re-runs it explicitly so
-	// a reachability failure surfaces in this report rather than only on stderr.
-	if _, err := run("bw", bwSyncArgs(), s.env); err != nil {
-		return "vault: configured + unlocked, but sync/reachability failed: " + err.Error()
-	}
-	return "vault: configured, unlocked, reachable ✓"
-}
-
-func vaultStatus(args []string) error {
-	hardenProcess()
-	ensureVaultToken()
-	uid := vaultCurrentUID()
-	unlock, err := withUserLock(uid)
-	if err != nil {
-		return err
-	}
-	defer unlock()
-	fmt.Println(statusSummary(realRunner, vaultCurrentUser(), uid))
-	return nil
-}
-
-func vaultLock(args []string) error {
-	uid := vaultCurrentUID()
-	unlock, err := withUserLock(uid) // logout mutates bw state — serialize with get/list
-	if err != nil {
-		return err
-	}
-	defer unlock()
-	appdata := bwAppDataDir(uid)
-	_, _ = realRunner("bw", []string{"lock"}, bwBaseEnv(appdata))
-	_, logoutErr := realRunner("bw", []string{"logout"}, bwBaseEnv(appdata))
-	if logoutErr == nil {
-		fmt.Println("locked")
-	}
-	return nil // lock/logout best-effort; never error the caller
-}
-
-// kvWriteVerb selects the KV write semantics. merge=true → `kv patch -method=rw`
-// (read-modify-write: needs only read+update, NOT the `patch` capability the
-// scoped workstation-claude-<user> policy lacks, and preserves co-located keys
-// such as claude-auth-sync's claude_ai_oauth_json). merge=false → `kv put`
-// (creates the path on first use, before any sibling keys exist).
-func kvWriteVerb(merge bool) []string {
-	if merge {
-		return []string{"kv", "patch", "-method=rw"}
-	}
-	return []string{"kv", "put"}
-}
-
-// vaultWritePublicArgs writes the non-secret identifiers via argv. Neither the
-// email nor the API client_id is a usable credential on its own.
-func vaultWritePublicArgs(merge bool, user, email, clientID string) []string {
-	return append(kvWriteVerb(merge), vwCredsPath(user),
-		"vaultwarden_email="+email,
-		"vaultwarden_client_id="+clientID,
-	)
-}
-
-// vaultWriteSecretArgs writes ONE secret value via the `key=-` stdin form, so the
-// value never appears in argv (ps / /proc/<pid>/cmdline). Fed on stdin by
-// realRunnerStdin.
-func vaultWriteSecretArgs(merge bool, user, key string) []string {
-	return append(kvWriteVerb(merge), vwCredsPath(user), key+"=-")
-}
-
-// credsPathExists reports whether the user's KV path already holds data. Used to
-// pick create (`kv put`) vs merge (`kv patch -method=rw`) for the first write:
-// claude-auth-sync usually creates the path first (Claude OAuth backup), but a
-// user could run `homelab vault setup` before that ever happens.
-func credsPathExists(run cmdRunner, user string) bool {
-	_, err := run("vault", []string{"kv", "get", "-format=json", vwCredsPath(user)}, nil)
-	return err == nil
-}
-
-// cmdRunnerStdin is realRunnerStdin's shape, injected so writeCreds is testable.
-type cmdRunnerStdin func(name string, argv, envv []string, stdin string) (string, error)
-
-// writeCreds stores all four fields in the user's Vault path using only the
-// capabilities the scoped policy grants (create/read/update — NOT `patch`). The
-// first (public) write creates the path when absent; the two real secrets then
-// merge in via read-modify-write so the public keys — and any claude-auth-sync
-// keys already present — survive. Secret values travel on stdin, never argv.
-func writeCreds(run cmdRunner, runStdin cmdRunnerStdin, user string, c vwCreds) error {
-	merge := credsPathExists(run, user)
-	if _, err := run("vault", vaultWritePublicArgs(merge, user, c.Email, c.ClientID), nil); err != nil {
-		return err
-	}
-	// The path now exists regardless of the branch above → merge the secrets in.
-	if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
-		return err
-	}
-	if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
-		return err
-	}
-	return nil
-}
-
-// promptNoEcho reads one line without terminal echo (for the master password).
-func promptNoEcho(prompt string) (string, error) {
-	fmt.Fprint(os.Stderr, prompt)
-	exec.Command("stty", "-echo").Run()
-	defer func() { exec.Command("stty", "echo").Run(); fmt.Fprintln(os.Stderr) }()
-	r := bufio.NewReader(os.Stdin)
-	line, err := r.ReadString('\n')
-	// Trim only the line terminator — a master password / API secret may
-	// legitimately contain leading/trailing spaces.
-	return strings.TrimRight(line, "\r\n"), err
-}
-
-func promptLine(prompt string) (string, error) {
-	fmt.Fprint(os.Stderr, prompt)
-	line, err := bufio.NewReader(os.Stdin).ReadString('\n')
-	return strings.TrimSpace(line), err
-}
-
-func vaultSetup(args []string) error {
-	hardenProcess()
-	ensureVaultToken()
-	fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.")
-	fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.")
-	email, err := promptLine("Vaultwarden email: ")
-	if err != nil {
-		return err
-	}
-	clientID, err := promptLine("API key client_id (user.xxxx): ")
-	if err != nil {
-		return err
-	}
-	clientSecret, err := promptNoEcho("API key client_secret: ")
-	if err != nil {
-		return err
-	}
-	master, err := promptNoEcho("Master password: ")
-	if err != nil {
-		return err
-	}
-	if master == "" || clientID == "" || clientSecret == "" {
-		return fmt.Errorf("all fields are required")
-	}
-	c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret}
-	if err := writeCreds(realRunner, realRunnerStdin, vaultCurrentUser(), c); err != nil {
-		return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err)
-	}
-	fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…")
-	uid := vaultCurrentUID()
-	unlock, err := withUserLock(uid)
-	if err != nil {
-		return err
-	}
-	defer unlock()
-	if _, err := openSession(realRunner, vaultCurrentUser(), uid); err != nil {
-		return fmt.Errorf("stored, but verification failed — double-check master password / API key: %w", err)
-	}
-	fmt.Fprintln(os.Stderr, "✓ Verified. Fetches are now AFK.")
-	return nil
-}
-
-func vaultGet(args []string) error {
-	hardenProcess()
-	ensureVaultToken()
-	o, err := parseGetArgs(args)
-	if err != nil {
-		return err
-	}
-	uid := vaultCurrentUID()
-	unlock, err := withUserLock(uid)
-	if err != nil {
-		return err
-	}
-	defer unlock()
-	user := vaultCurrentUser()
-	if o.all {
-		return getAllFields(user, uid, o.name)
-	}
-	val, err := getValue(realRunner, user, uid, o)
-	if err != nil {
-		return err
-	}
-	writeOpLog(opRecord{User: user, Verb: "get", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: o.name})
-	if o.json {
-		if !jsonToStdoutOK(stdoutIsTTY()) {
-			return fmt.Errorf("refusing to print a secret as JSON to a terminal; pipe it (e.g. | cat) or drop --json")
-		}
-		fmt.Printf("{%q:%q}\n", o.field, val)
-		return nil
-	}
-	emitSecret(val)
-	return nil
-}
-
-// getAllFields prints every field of one item as normalized JSON. Like
-// `get --json`, the payload is all secret values, so it refuses a terminal
-// (pipe it). The TOTP seed is never emitted — only a presence flag — so no extra
-// TOTP audit is needed; the op-log uses a distinct verb so a bulk dump is
-// distinguishable from a single-field get (the item name is still never logged).
-func getAllFields(user, uid, name string) error {
-	if !jsonToStdoutOK(stdoutIsTTY()) {
-		return fmt.Errorf("refusing to print all fields as JSON to a terminal; pipe it (e.g. | jq)")
-	}
-	raw, err := getItem(realRunner, user, uid, name)
-	if err != nil {
-		return err
-	}
-	item, err := normalizeItem(raw)
-	if err != nil {
-		return err
-	}
-	out, err := json.Marshal(item)
-	if err != nil {
-		return err
-	}
-	writeOpLog(opRecord{User: user, Verb: "get-all", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
-	fmt.Println(string(out))
-	return nil
-}
--- a/cli/cmd_vault_kv.go
+++ b/cli/cmd_vault_kv.go
@ -1,248 +0,0 @@
-package main
-
-import (
-	"encoding/json"
-	"fmt"
-	"io"
-	"os"
-	"strings"
-)
-
-// The `vault kv` verbs talk to HashiCorp Vault / OpenBao — the homelab INFRA
-// secrets store (the `secret/…` KV-v2 mount at vault.viktorbarzin.me) — NOT
-// Vaultwarden. They are a thin, TTY-aware wrapper over the `vault` CLI that adds
-// the same conveniences as the Vaultwarden verbs: a self-defaulted VAULT_ADDR
-// (so non-login agent shells work) and clipboard/refuse-on-TTY secret handling.
-//
-// CREDENTIALS DIFFER FROM THE VAULTWARDEN VERBS. Those use the per-user *scoped*
-// token (bound only to secret/workstation/claude-users/<user>). A general kv read
-// of e.g. secret/viktor must use the caller's OWN Vault token (the OIDC
-// ~/.vault-token or an explicit $VAULT_TOKEN) — the scoped token has `deny`
-// everywhere else and would 403. So the kv handlers call ensureVaultAddr() to
-// guarantee VAULT_ADDR but deliberately do NOT call ensureVaultToken() (which
-// injects the scoped token). Access is then whatever the caller's policy grants.
-func vaultKVCommands() []Command {
-	return []Command{
-		{Path: []string{"vault", "kv", "get"}, Tier: TierRead,
-			Summary: "[hashicorp-vault] read an infra KV secret: vault kv get <path> [--field K]", Run: vaultKVGet},
-		{Path: []string{"vault", "kv", "list"}, Tier: TierRead,
-			Summary: "[hashicorp-vault] list infra KV sub-paths: vault kv list <path>", Run: vaultKVList},
-		{Path: []string{"vault", "kv", "put"}, Tier: TierWrite,
-			Summary: "[hashicorp-vault] write one KV key (value via stdin): vault kv put <path> <key>", Run: vaultKVPut},
-		{Path: []string{"vault", "kv"}, Tier: TierRead,
-			Summary: "[hashicorp-vault] infra secrets (run `homelab vault kv` for help)",
-			Run:     func([]string) error { fmt.Print(vaultKVHelp()); return nil }},
-	}
-}
-
-func vaultKVHelp() string {
-	return `homelab vault kv — HashiCorp Vault / OpenBao (homelab INFRA secrets, the secret/… KV store)
-
-  homelab vault kv get <path> [--field K]   read a secret
-                                  --field K  → one value (TTY → clipboard; piped → stdout)
-                                  no --field → all fields as JSON (piped only)
-  homelab vault kv list <path>    list sub-paths under <path> (no values)
-  homelab vault kv put <path> <key>   write one key; value read from stdin
-                                  (piped, or no-echo prompt); merges — never clobbers siblings
-
-Uses YOUR Vault token (vault login -method=oidc → ~/.vault-token); access is
-whatever your policy grants. This is NOT Vaultwarden — for your personal logins
-use 'homelab vault get' (see 'homelab vault').
-`
-}
-
-// --- arg builders (pure; values never travel via argv) --------------------
-
-func vaultKVGetFieldArgs(path, field string) []string {
-	return []string{"kv", "get", "-field=" + field, path}
-}
-func vaultKVGetJSONArgs(path string) []string { return []string{"kv", "get", "-format=json", path} }
-func vaultKVListArgs(path string) []string    { return []string{"kv", "list", "-format=json", path} }
-
-// vaultKVPutArgs builds the write argv. merge=true → `kv patch -method=rw`
-// (read-modify-write: merges, needs only read+update — not the `patch` capability
-// — and preserves sibling keys); merge=false → `kv put` (creates the path on
-// first write). The value is ALWAYS read from stdin via the `<key>=-` form, so it
-// never appears in argv (visible via ps / /proc/<pid>/cmdline to same-UID procs).
-func vaultKVPutArgs(merge bool, path, key string) []string {
-	return append(kvWriteVerb(merge), path, key+"=-")
-}
-
-// --- pure parsers ----------------------------------------------------------
-
-// extractKVData returns the inner secret object from a `vault kv get -format=json`
-// envelope (`{"data":{"data":{…},"metadata":{…}}}`), dropping the metadata/request
-// wrapper so only the secret's own key→value data is emitted.
-func extractKVData(jsonOut string) (string, error) {
-	var env struct {
-		Data struct {
-			Data json.RawMessage `json:"data"`
-		} `json:"data"`
-	}
-	if err := json.Unmarshal([]byte(jsonOut), &env); err != nil {
-		return "", fmt.Errorf("parse vault kv json: %w", err)
-	}
-	if len(env.Data.Data) == 0 {
-		return "", fmt.Errorf("no secret data at that path")
-	}
-	return string(env.Data.Data), nil
-}
-
-// parseKVList parses the JSON array `vault kv list -format=json` prints.
-func parseKVList(jsonOut string) ([]string, error) {
-	var keys []string
-	if err := json.Unmarshal([]byte(jsonOut), &keys); err != nil {
-		return nil, fmt.Errorf("parse vault kv list json: %w", err)
-	}
-	return keys, nil
-}
-
-// --- testable cores (injected cmdRunner) -----------------------------------
-
-func kvGetField(run cmdRunner, path, field string) (string, error) {
-	return run("vault", vaultKVGetFieldArgs(path, field), nil)
-}
-
-func kvGetJSON(run cmdRunner, path string) (string, error) {
-	out, err := run("vault", vaultKVGetJSONArgs(path), nil)
-	if err != nil {
-		return "", err
-	}
-	return extractKVData(out)
-}
-
-func kvList(run cmdRunner, path string) ([]string, error) {
-	out, err := run("vault", vaultKVListArgs(path), nil)
-	if err != nil {
-		return nil, err
-	}
-	return parseKVList(out)
-}
-
-// kvPathExists reports whether the KV path already holds data, to pick create
-// (`kv put`) vs merge (`kv patch -method=rw`) — so a write never clobbers
-// sibling keys on an existing path.
-func kvPathExists(run cmdRunner, path string) bool {
-	_, err := run("vault", vaultKVGetJSONArgs(path), nil)
-	return err == nil
-}
-
-// kvPut writes one key, creating the path when absent and merging when present.
-// The value travels on stdin only (never argv).
-func kvPut(run cmdRunner, runStdin cmdRunnerStdin, path, key, value string) error {
-	merge := kvPathExists(run, path)
-	_, err := runStdin("vault", vaultKVPutArgs(merge, path, key), nil, value)
-	return err
-}
-
-// --- handlers --------------------------------------------------------------
-
-func vaultKVGet(args []string) error {
-	hardenProcess()
-	ensureVaultAddr() // own token, NOT the scoped one (see file header)
-	var path, field string
-	for i := 0; i < len(args); i++ {
-		a := args[i]
-		switch {
-		case a == "--field" && i+1 < len(args):
-			field = args[i+1]
-			i++
-		case strings.HasPrefix(a, "--field="):
-			field = strings.TrimPrefix(a, "--field=")
-		case !strings.HasPrefix(a, "-") && path == "":
-			path = a
-		}
-	}
-	if path == "" {
-		return fmt.Errorf("usage: homelab vault kv get <path> [--field <key>]")
-	}
-	if field != "" {
-		val, err := kvGetField(realRunner, path, field)
-		if err != nil {
-			return err
-		}
-		emitSecret(val) // TTY-aware: clipboard on a terminal, stdout when piped
-		return nil
-	}
-	// No --field → the whole secret. All values, so refuse a bare TTY (like
-	// `vault get --json`): pick a --field for the clipboard path, or pipe it.
-	if !jsonToStdoutOK(stdoutIsTTY()) {
-		return fmt.Errorf("refusing to print all KV fields as JSON to a terminal; use --field <key>, or pipe it (e.g. | jq)")
-	}
-	out, err := kvGetJSON(realRunner, path)
-	if err != nil {
-		return err
-	}
-	fmt.Println(out)
-	return nil
-}
-
-func vaultKVList(args []string) error {
-	ensureVaultAddr()
-	var path string
-	for _, a := range args {
-		if !strings.HasPrefix(a, "-") {
-			path = a
-			break
-		}
-	}
-	if path == "" {
-		return fmt.Errorf("usage: homelab vault kv list <path>")
-	}
-	keys, err := kvList(realRunner, path)
-	if err != nil {
-		return err
-	}
-	for _, k := range keys {
-		fmt.Println(k)
-	}
-	return nil
-}
-
-func vaultKVPut(args []string) error {
-	hardenProcess()
-	ensureVaultAddr()
-	var path, key string
-	for _, a := range args {
-		if strings.HasPrefix(a, "-") {
-			continue
-		}
-		switch {
-		case path == "":
-			path = a
-		case key == "":
-			key = a
-		}
-	}
-	if path == "" || key == "" {
-		return fmt.Errorf("usage: homelab vault kv put <path> <key>   (value read from stdin)")
-	}
-	value, err := readSecretValue("Value for " + key + ": ")
-	if err != nil {
-		return err
-	}
-	if value == "" {
-		return fmt.Errorf("empty value; aborting (nothing written)")
-	}
-	if err := kvPut(realRunner, realRunnerStdin, path, key, value); err != nil {
-		return fmt.Errorf("writing %q to %s failed (does your token have write access? path correct?): %w", key, path, err)
-	}
-	fmt.Fprintln(os.Stderr, "wrote "+key+" to "+path)
-	return nil
-}
-
-// readSecretValue obtains a secret value WITHOUT putting it in argv: piped stdin
-// is read verbatim (trailing newline trimmed, internal newlines preserved so
-// multi-line values like PEM keys survive); an interactive TTY is prompted
-// without echo.
-func readSecretValue(prompt string) (string, error) {
-	fi, err := os.Stdin.Stat()
-	if err == nil && fi.Mode()&os.ModeCharDevice == 0 {
-		b, rerr := io.ReadAll(os.Stdin)
-		if rerr != nil {
-			return "", rerr
-		}
-		return strings.TrimRight(string(b), "\r\n"), nil
-	}
-	return promptNoEcho(prompt)
-}
--- a/cli/cmd_vault_test.go
+++ b/cli/cmd_vault_test.go
--- a/cli/cmd_work.go
+++ b/cli/cmd_work.go
@ -1,212 +0,0 @@
-package main
-
-import (
-	"fmt"
-	"os"
-	"path/filepath"
-	"strings"
-)
-
-func workCommands() []Command {
-	return []Command{
-		{Path: []string{"work", "start"}, Tier: TierWrite,
-			Summary: "create a worktree + branch for a task (enter it with EnterWorktree)", Run: workStart},
-		{Path: []string{"work", "land"}, Tier: TierWrite,
-			Summary: "merge master in, verify, push HEAD:master (run from the worktree)", Run: workLand},
-		{Path: []string{"work", "clean"}, Tier: TierWrite,
-			Summary: "remove a task's worktree + branch (run from the main checkout)", Run: workClean},
-	}
-}
-
-// flagValue extracts `--name value` or `--name=value` from args.
-func flagValue(args []string, name string) string {
-	for i, a := range args {
-		if a == name && i+1 < len(args) {
-			return args[i+1]
-		}
-		if strings.HasPrefix(a, name+"=") {
-			return strings.TrimPrefix(a, name+"=")
-		}
-	}
-	return ""
-}
-
-func remotesOrEmpty(repoRoot string) []string {
-	r, _ := gitRemotes(repoRoot)
-	return r
-}
-
-// workStart creates .worktrees/<topic> on branch <user>/<topic> off <remote>/master.
-func workStart(args []string) error {
-	topic, _ := firstPositional(args)
-	if topic == "" {
-		return fmt.Errorf("usage: homelab work start <topic>")
-	}
-	cwd, _ := os.Getwd()
-	repoRoot, err := gitRepoRoot(cwd)
-	if err != nil {
-		return fmt.Errorf("not in a git repository: %w", err)
-	}
-	remote := preferRemote(remotesOrEmpty(repoRoot))
-	if remote == "" {
-		return fmt.Errorf("no git remote configured in %s", repoRoot)
-	}
-	flags := cryptFlagsFor(repoRoot)
-	branch := currentUser() + "/" + topic
-	wtRel := filepath.Join(".worktrees", topic)
-
-	ensureWorktreesIgnored(repoRoot)
-	if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
-		return fmt.Errorf("fetch %s failed: %w", remote, err)
-	}
-	if err := gitStream(repoRoot, flags, "worktree", "add", wtRel, "-b", branch, remote+"/master"); err != nil {
-		return fmt.Errorf("worktree add failed: %w", err)
-	}
-	wtPath := filepath.Join(repoRoot, wtRel)
-	fmt.Printf("homelab: created worktree %s (branch %s off %s/master)\n", wtPath, branch, remote)
-	fmt.Printf("homelab: enter it with the native tool: EnterWorktree(path=%q)\n", wtPath)
-	return nil
-}
-
-// workLand integrates the current branch into master: fetch, merge master in,
-// verify, push HEAD:master (retrying on non-fast-forward), with a feature-branch
-// fallback when the direct push is rejected (e.g. branch protection).
-func workLand(args []string) error {
-	verifyCmd := flagValue(args, "--verify-cmd")
-	cwd, _ := os.Getwd()
-	repoRoot, err := gitRepoRoot(cwd)
-	if err != nil {
-		return fmt.Errorf("not in a git repository: %w", err)
-	}
-	branch, err := gitOutput(repoRoot, "rev-parse", "--abbrev-ref", "HEAD")
-	if err != nil {
-		return err
-	}
-	if branch == "master" || branch == "main" {
-		return fmt.Errorf("refusing to land: already on %s", branch)
-	}
-	remote := preferRemote(remotesOrEmpty(repoRoot))
-	if remote == "" {
-		return fmt.Errorf("no git remote configured in %s", repoRoot)
-	}
-	flags := cryptFlagsFor(repoRoot)
-
-	if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
-		return fmt.Errorf("fetch failed: %w", err)
-	}
-	if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil {
-		return fmt.Errorf("merging %s/master failed — resolve conflicts then re-run `homelab work land`: %w", remote, err)
-	}
-	if err := runVerify(repoRoot, verifyCmd, containsArg(args, "--no-verify")); err != nil {
-		return fmt.Errorf("not landing: %w", err)
-	}
-	if err := pushWithRetry(repoRoot, flags, remote, 3); err != nil {
-		return landFallback(repoRoot, flags, remote, branch, err)
-	}
-	fmt.Printf("homelab: landed %s -> %s/master.\n", branch, remote)
-	if containsArg(args, "--no-ci-watch") {
-		fmt.Println("homelab: --no-ci-watch set; not waiting for CI.")
-		return nil
-	}
-	landed, _ := gitOutput(repoRoot, "rev-parse", "HEAD")
-	fmt.Fprintln(os.Stderr, "homelab: watching CI for the landed commit...")
-	if err := ciWatch([]string{landed}); err != nil {
-		return fmt.Errorf("landed, but CI did not go green: %w", err)
-	}
-	return nil
-}
-
-// runVerify runs the explicit --verify-cmd, else auto-detects (go test). If
-// neither is available it REFUSES (returns an error) unless allowSkip is set —
-// landing to master unverified must be a deliberate choice (--no-verify).
-func runVerify(repoRoot, verifyCmd string, allowSkip bool) error {
-	if verifyCmd != "" {
-		fmt.Fprintf(os.Stderr, "homelab: verify: %s\n", verifyCmd)
-		return runStreamingIn(repoRoot, "sh", "-c", verifyCmd)
-	}
-	if isFile(filepath.Join(repoRoot, "go.mod")) {
-		fmt.Fprintln(os.Stderr, "homelab: verify: go test ./...")
-		return runStreamingIn(repoRoot, "go", "test", "./...")
-	}
-	if allowSkip {
-		fmt.Fprintln(os.Stderr, "homelab: WARNING: --no-verify set — landing without verification")
-		return nil
-	}
-	return fmt.Errorf("no verification configured for this repo — pass --verify-cmd \"...\" or --no-verify to land without verifying")
-}
-
-// pushWithRetry pushes HEAD:master, recovering from non-fast-forward rejections
-// by fetching + merging master and retrying.
-func pushWithRetry(repoRoot string, flags []string, remote string, attempts int) error {
-	var lastErr error
-	for i := 0; i < attempts; i++ {
-		if err := gitStream(repoRoot, flags, "push", remote, "HEAD:master"); err == nil {
-			return nil
-		} else {
-			lastErr = err
-		}
-		if i < attempts-1 {
-			fmt.Fprintln(os.Stderr, "homelab: push rejected — fetching + merging master, then retrying")
-			if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
-				return err
-			}
-			if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil {
-				return err
-			}
-		}
-	}
-	return fmt.Errorf("push to %s/master failed after %d attempts: %w", remote, attempts, lastErr)
-}
-
-// landFallback pushes the feature branch when the direct master push is rejected
-// (e.g. branch protection), so the work isn't lost and a PR can be opened.
-func landFallback(repoRoot string, flags []string, remote, branch string, pushErr error) error {
-	fmt.Fprintf(os.Stderr, "homelab: direct push to master failed (%v)\n", pushErr)
-	fmt.Fprintf(os.Stderr, "homelab: falling back to pushing the feature branch %q for a PR\n", branch)
-	if err := gitStream(repoRoot, flags, "push", "-u", remote, branch); err != nil {
-		return fmt.Errorf("fallback branch push also failed: %w", err)
-	}
-	fmt.Printf("homelab: pushed %s to %s. Open a PR to land it (branch protection blocked the direct push).\n", branch, remote)
-	return nil
-}
-
-// workClean removes a task's worktree and branch. Run from the main checkout.
-func workClean(args []string) error {
-	topic, _ := firstPositional(args)
-	if topic == "" {
-		return fmt.Errorf("usage: homelab work clean <topic>  (run from the main checkout)")
-	}
-	cwd, _ := os.Getwd()
-	repoRoot, err := gitRepoRoot(cwd)
-	if err != nil {
-		return fmt.Errorf("not in a git repository: %w", err)
-	}
-	flags := cryptFlagsFor(repoRoot)
-	wtRel := filepath.Join(".worktrees", topic)
-	branch := currentUser() + "/" + topic
-
-	if err := gitStream(repoRoot, flags, "worktree", "remove", wtRel); err != nil {
-		return fmt.Errorf("worktree remove failed (uncommitted changes? run from the main checkout, not the worktree): %w", err)
-	}
-	if err := gitStream(repoRoot, flags, "branch", "-d", branch); err != nil {
-		fmt.Fprintf(os.Stderr, "homelab: note: could not delete branch %s (unmerged — use `git branch -D` if intended): %v\n", branch, err)
-	}
-	fmt.Printf("homelab: removed worktree %s and branch %s\n", wtRel, branch)
-	return nil
-}
-
-// ensureWorktreesIgnored appends .worktrees/ to .gitignore if not already ignored.
-func ensureWorktreesIgnored(repoRoot string) {
-	if _, err := gitOutput(repoRoot, "check-ignore", ".worktrees"); err == nil {
-		return
-	}
-	gi := filepath.Join(repoRoot, ".gitignore")
-	f, err := os.OpenFile(gi, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0o644)
-	if err != nil {
-		return
-	}
-	defer f.Close()
-	if _, err := f.WriteString("\n.worktrees/\n"); err == nil {
-		fmt.Fprintln(os.Stderr, "homelab: added .worktrees/ to .gitignore")
-	}
-}
--- a/cli/cmd_work_test.go
+++ b/cli/cmd_work_test.go
@ -1,32 +0,0 @@
-package main
-
-import "testing"
-
-func TestRunVerifyRefusesWhenNothingToVerify(t *testing.T) {
-	dir := t.TempDir() // no go.mod, no verify cmd
-	if err := runVerify(dir, "", false); err == nil {
-		t.Fatal("runVerify must refuse (error) when nothing to verify and --no-verify absent")
-	}
-	if err := runVerify(dir, "", true); err != nil {
-		t.Fatalf("runVerify must skip when --no-verify set, got: %v", err)
-	}
-}
-
-func TestFlagValue(t *testing.T) {
-	cases := []struct {
-		args []string
-		name string
-		want string
-	}{
-		{[]string{"--verify-cmd", "go test ./..."}, "--verify-cmd", "go test ./..."},
-		{[]string{"--verify-cmd=make test"}, "--verify-cmd", "make test"},
-		{[]string{"topic", "--verify-cmd", "x"}, "--verify-cmd", "x"},
-		{[]string{"topic"}, "--verify-cmd", ""},
-		{[]string{"--verify-cmd"}, "--verify-cmd", ""}, // no value
-	}
-	for _, c := range cases {
-		if got := flagValue(c.args, c.name); got != c.want {
-			t.Errorf("flagValue(%v, %q) = %q, want %q", c.args, c.name, got, c.want)
-		}
-	}
-}
--- a/cli/command.go
+++ b/cli/command.go
@ -1,104 +0,0 @@
-package main
-
-import (
-	"encoding/json"
-	"fmt"
-	"sort"
-	"strings"
-)
-
-// Tier classifies whether a command observes (read) or mutates (write) state.
-// v0.1 allows everything; the tier is recorded so a classifier hook can gate
-// writes later without restructuring (see docs/adr/0005).
-type Tier string
-
-const (
-	TierRead  Tier = "read"
-	TierWrite Tier = "write"
-)
-
-// Command is one homelab verb. Path is the token sequence that selects it,
-// e.g. ["claim"] or ["tf", "plan"]. Run receives the args after the path.
-type Command struct {
-	Path    []string
-	Tier    Tier
-	Summary string
-	Run     func(args []string) error
-}
-
-// dispatch routes args to the command whose Path is the longest matching prefix
-// of args, passing the remaining args to its Run.
-func dispatch(reg []Command, args []string) error {
-	best := -1
-	bestLen := 0
-	for i, c := range reg {
-		if len(c.Path) > len(args) {
-			continue
-		}
-		match := true
-		for j, p := range c.Path {
-			if args[j] != p {
-				match = false
-				break
-			}
-		}
-		if match && len(c.Path) >= bestLen {
-			best = i
-			bestLen = len(c.Path)
-		}
-	}
-	if best < 0 {
-		return fmt.Errorf("unknown command: %q", strings.Join(args, " "))
-	}
-	matched := reg[best]
-	runErr := matched.Run(args[bestLen:])
-	emitUsage(matched.name(), runErr) // best-effort usage telemetry; never affects the command
-	return runErr
-}
-
-// name is the space-joined verb path, e.g. "tf plan".
-func (c Command) name() string { return strings.Join(c.Path, " ") }
-
-// sortedByName returns a copy of reg ordered by verb path for stable output.
-func sortedByName(reg []Command) []Command {
-	out := make([]Command, len(reg))
-	copy(out, reg)
-	sort.Slice(out, func(i, j int) bool { return out[i].name() < out[j].name() })
-	return out
-}
-
-// manifestText renders one aligned line per command: "<path>  <tier>  <summary>".
-// This is the cheap progressive-discovery entrypoint (see docs/adr/0004).
-func manifestText(reg []Command) string {
-	cmds := sortedByName(reg)
-	width := 0
-	for _, c := range cmds {
-		if n := len(c.name()); n > width {
-			width = n
-		}
-	}
-	var b strings.Builder
-	for _, c := range cmds {
-		fmt.Fprintf(&b, "%-*s  %-5s  %s\n", width, c.name(), c.Tier, c.Summary)
-	}
-	return b.String()
-}
-
-// manifestJSON renders the registry as a JSON array of {command, tier, summary}
-// so agents can parse the full surface in one call.
-func manifestJSON(reg []Command) (string, error) {
-	type entry struct {
-		Command string `json:"command"`
-		Tier    string `json:"tier"`
-		Summary string `json:"summary"`
-	}
-	entries := make([]entry, 0, len(reg))
-	for _, c := range sortedByName(reg) {
-		entries = append(entries, entry{Command: c.name(), Tier: string(c.Tier), Summary: c.Summary})
-	}
-	b, err := json.MarshalIndent(entries, "", "  ")
-	if err != nil {
-		return "", err
-	}
-	return string(b), nil
-}
--- a/cli/command_test.go
+++ b/cli/command_test.go
@ -1,73 +0,0 @@
-package main
-
-import (
-	"encoding/json"
-	"reflect"
-	"strings"
-	"testing"
-)
-
-// Tracer bullet: the dispatcher must route `homelab <path...> <args...>` to the
-// command whose Path is the longest matching prefix of the input tokens, and
-// hand the command the remaining args.
-func TestDispatchRoutesToLongestPrefixMatch(t *testing.T) {
-	var gotArgs []string
-	ran := ""
-	reg := []Command{
-		{Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource",
-			Run: func(a []string) error { ran = "claim"; gotArgs = a; return nil }},
-		{Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack",
-			Run: func(a []string) error { ran = "tf plan"; gotArgs = a; return nil }},
-	}
-
-	if err := dispatch(reg, []string{"tf", "plan", "vault", "--json"}); err != nil {
-		t.Fatalf("dispatch returned error: %v", err)
-	}
-	if ran != "tf plan" {
-		t.Fatalf("routed to %q, want %q", ran, "tf plan")
-	}
-	if want := []string{"vault", "--json"}; !reflect.DeepEqual(gotArgs, want) {
-		t.Fatalf("command got args %v, want %v", gotArgs, want)
-	}
-}
-
-func TestDispatchUnknownCommandErrors(t *testing.T) {
-	reg := []Command{{Path: []string{"claim"}, Run: func(a []string) error { return nil }}}
-	if err := dispatch(reg, []string{"bogus"}); err == nil {
-		t.Fatal("expected error for unknown command, got nil")
-	}
-}
-
-// The manifest is the progressive-discovery entrypoint: one line per command
-// showing the full verb path, its tier, and summary, sorted for stable output.
-func TestManifestTextListsEveryCommandWithTier(t *testing.T) {
-	reg := []Command{
-		{Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack"},
-		{Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource"},
-	}
-	out := manifestText(reg)
-	for _, want := range []string{"claim", "tf plan", "read", "write", "plan a stack", "claim a resource"} {
-		if !strings.Contains(out, want) {
-			t.Errorf("manifest text missing %q\n---\n%s", want, out)
-		}
-	}
-	// sorted: claim (c) must appear before tf plan (t)
-	if strings.Index(out, "claim") > strings.Index(out, "tf plan") {
-		t.Errorf("manifest not sorted by path:\n%s", out)
-	}
-}
-
-func TestManifestJSONIsParsableAndTagged(t *testing.T) {
-	reg := []Command{{Path: []string{"tf", "apply"}, Tier: TierWrite, Summary: "apply a stack"}}
-	out, err := manifestJSON(reg)
-	if err != nil {
-		t.Fatalf("manifestJSON error: %v", err)
-	}
-	var got []map[string]string
-	if err := json.Unmarshal([]byte(out), &got); err != nil {
-		t.Fatalf("manifest JSON not parsable: %v\n%s", err, out)
-	}
-	if len(got) != 1 || got[0]["command"] != "tf apply" || got[0]["tier"] != "write" {
-		t.Fatalf("unexpected manifest JSON: %v", got)
-	}
-}
--- a/cli/edges.go
+++ b/cli/edges.go
@ -1,164 +0,0 @@
-package main
-
-import (
-	"fmt"
-	"regexp"
-	"strconv"
-	"strings"
-)
-
-// edgesOpts is the parsed filter set for `homelab edges` (the who-talks-to-whom
-// investigation helper over the goldmane_edges trail; see ADR-0014).
-type edgesOpts struct {
-	ns       string // edges touching this namespace (either direction)
-	src      string // edges where src_ns = this
-	dst      string // edges where dst_ns = this
-	peersOf  string // distinct peers of this namespace (both directions)
-	newSince string // first_seen >= duration (24h/7d/30m) or date (YYYY-MM-DD)
-	denied   bool   // action = 'deny' only
-	asJSON   bool   // wrap result as a JSON array
-	limit    int    // row cap (default 200)
-}
-
-// parseEdgesArgs parses the edges flag surface. Unknown flags error out so a
-// typo surfaces instead of silently dumping the whole table.
-func parseEdgesArgs(args []string) (edgesOpts, error) {
-	o := edgesOpts{limit: 200}
-	i := 0
-	for i < len(args) {
-		a := args[i]
-		key, inline, hasInline := a, "", false
-		if eq := strings.IndexByte(a, '='); eq >= 0 {
-			key, inline, hasInline = a[:eq], a[eq+1:], true
-		}
-		needVal := func() (string, error) {
-			if hasInline {
-				return inline, nil
-			}
-			if i+1 < len(args) {
-				i++
-				return args[i], nil
-			}
-			return "", fmt.Errorf("flag %s needs a value", key)
-		}
-		var err error
-		switch key {
-		case "--ns":
-			o.ns, err = needVal()
-		case "--src":
-			o.src, err = needVal()
-		case "--dst":
-			o.dst, err = needVal()
-		case "--peers-of":
-			o.peersOf, err = needVal()
-		case "--new-since":
-			o.newSince, err = needVal()
-		case "--denied":
-			o.denied = true
-		case "--json":
-			o.asJSON = true
-		case "--limit":
-			var v string
-			if v, err = needVal(); err == nil {
-				if o.limit, err = strconv.Atoi(v); err != nil {
-					err = fmt.Errorf("--limit must be an integer: %q", v)
-				}
-			}
-		default:
-			return o, fmt.Errorf("unknown flag: %s", a)
-		}
-		if err != nil {
-			return o, err
-		}
-		i++
-	}
-	return o, nil
-}
-
-// nsRE is the safe namespace-token charset (k8s names + "Global"). Used as the
-// injection guard — anything else is rejected rather than quoted-and-hoped.
-var nsRE = regexp.MustCompile(`^[A-Za-z0-9][A-Za-z0-9_.-]*$`)
-
-func validateNS(s string) error {
-	if s == "" || len(s) > 63 || !nsRE.MatchString(s) {
-		return fmt.Errorf("invalid namespace name: %q", s)
-	}
-	return nil
-}
-
-// sqlStr renders a SQL string literal (belt-and-suspenders on top of validateNS).
-func sqlStr(s string) string { return "'" + strings.ReplaceAll(s, "'", "''") + "'" }
-
-var (
-	durRE  = regexp.MustCompile(`^(\d+)([smhd])$`)
-	dateRE = regexp.MustCompile(`^\d{4}-\d{2}-\d{2}([ T]\d{2}:\d{2}(:\d{2})?)?$`)
-)
-
-// newSinceCond turns a duration (24h/7d/30m/90s) or a date (YYYY-MM-DD[ HH:MM])
-// into a first_seen predicate.
-func newSinceCond(v string) (string, error) {
-	if m := durRE.FindStringSubmatch(v); m != nil {
-		unit := map[string]string{"s": "seconds", "m": "minutes", "h": "hours", "d": "days"}[m[2]]
-		return fmt.Sprintf("first_seen >= now() - interval '%s %s'", m[1], unit), nil
-	}
-	if dateRE.MatchString(v) {
-		return "first_seen >= " + sqlStr(v), nil
-	}
-	return "", fmt.Errorf("--new-since must be a duration (e.g. 24h, 7d, 30m) or a date (YYYY-MM-DD): %q", v)
-}
-
-// buildEdgesQuery renders the SQL for the given filters against the `edge` table.
-func buildEdgesQuery(o edgesOpts) (string, error) {
-	limit := o.limit
-	if limit <= 0 {
-		limit = 200
-	}
-
-	// peers-of is a distinct-peer summary, a different shape from the row list.
-	if o.peersOf != "" {
-		if err := validateNS(o.peersOf); err != nil {
-			return "", err
-		}
-		p := sqlStr(o.peersOf)
-		return fmt.Sprintf("SELECT DISTINCT peer, action FROM ("+
-			"SELECT dst_ns AS peer, action FROM edge WHERE src_ns = %s "+
-			"UNION SELECT src_ns AS peer, action FROM edge WHERE dst_ns = %s"+
-			") t ORDER BY peer LIMIT %d", p, p, limit), nil
-	}
-
-	var conds []string
-	for _, f := range []struct{ val, tmpl string }{
-		{o.ns, "(src_ns = %[1]s OR dst_ns = %[1]s)"},
-		{o.src, "src_ns = %s"},
-		{o.dst, "dst_ns = %s"},
-	} {
-		if f.val == "" {
-			continue
-		}
-		if err := validateNS(f.val); err != nil {
-			return "", err
-		}
-		conds = append(conds, fmt.Sprintf(f.tmpl, sqlStr(f.val)))
-	}
-	if o.denied {
-		conds = append(conds, "action = 'deny'")
-	}
-	if o.newSince != "" {
-		c, err := newSinceCond(o.newSince)
-		if err != nil {
-			return "", err
-		}
-		conds = append(conds, c)
-	}
-
-	q := "SELECT src_ns, dst_ns, action, flow_count, first_seen, last_seen FROM edge"
-	if len(conds) > 0 {
-		q += " WHERE " + strings.Join(conds, " AND ")
-	}
-	q += fmt.Sprintf(" ORDER BY first_seen DESC LIMIT %d", limit)
-
-	if o.asJSON {
-		q = "SELECT coalesce(json_agg(row_to_json(t)), '[]') FROM (" + q + ") t"
-	}
-	return q, nil
-}
--- a/cli/edges_test.go
+++ b/cli/edges_test.go
@ -1,163 +0,0 @@
-package main
-
-import (
-	"strings"
-	"testing"
-)
-
-func TestParseEdgesArgs(t *testing.T) {
-	cases := []struct {
-		name string
-		args []string
-		want edgesOpts
-	}{
-		{"defaults", nil, edgesOpts{limit: 200}},
-		{"ns", []string{"--ns", "immich"}, edgesOpts{ns: "immich", limit: 200}},
-		{"ns equals", []string{"--ns=immich"}, edgesOpts{ns: "immich", limit: 200}},
-		{"src dst", []string{"--src", "a", "--dst", "b"}, edgesOpts{src: "a", dst: "b", limit: 200}},
-		{"peers-of", []string{"--peers-of", "authentik"}, edgesOpts{peersOf: "authentik", limit: 200}},
-		{"denied json", []string{"--denied", "--json"}, edgesOpts{denied: true, asJSON: true, limit: 200}},
-		{"new-since", []string{"--new-since", "24h"}, edgesOpts{newSince: "24h", limit: 200}},
-		{"limit", []string{"--limit", "50"}, edgesOpts{limit: 50}},
-	}
-	for _, c := range cases {
-		t.Run(c.name, func(t *testing.T) {
-			got, err := parseEdgesArgs(c.args)
-			if err != nil {
-				t.Fatalf("parseEdgesArgs(%v) error: %v", c.args, err)
-			}
-			if got != c.want {
-				t.Fatalf("parseEdgesArgs(%v) = %+v, want %+v", c.args, got, c.want)
-			}
-		})
-	}
-}
-
-func TestParseEdgesArgsErrors(t *testing.T) {
-	for _, args := range [][]string{
-		{"--limit", "abc"},
-		{"--bogus"},
-	} {
-		if _, err := parseEdgesArgs(args); err == nil {
-			t.Errorf("parseEdgesArgs(%v) expected error, got nil", args)
-		}
-	}
-}
-
-func TestBuildEdgesQueryDefaults(t *testing.T) {
-	q, err := buildEdgesQuery(edgesOpts{limit: 200})
-	if err != nil {
-		t.Fatal(err)
-	}
-	for _, want := range []string{"FROM edge", "ORDER BY first_seen DESC", "LIMIT 200"} {
-		if !strings.Contains(q, want) {
-			t.Errorf("query %q missing %q", q, want)
-		}
-	}
-	if strings.Contains(q, "WHERE") {
-		t.Errorf("no-filter query should have no WHERE: %q", q)
-	}
-}
-
-func TestBuildEdgesQueryFilters(t *testing.T) {
-	cases := []struct {
-		name string
-		o    edgesOpts
-		want string
-	}{
-		{"ns both directions", edgesOpts{ns: "immich", limit: 10}, "(src_ns = 'immich' OR dst_ns = 'immich')"},
-		{"src only", edgesOpts{src: "authentik", limit: 10}, "src_ns = 'authentik'"},
-		{"dst only", edgesOpts{dst: "dbaas", limit: 10}, "dst_ns = 'dbaas'"},
-		{"denied", edgesOpts{denied: true, limit: 10}, "action = 'deny'"},
-	}
-	for _, c := range cases {
-		t.Run(c.name, func(t *testing.T) {
-			q, err := buildEdgesQuery(c.o)
-			if err != nil {
-				t.Fatal(err)
-			}
-			if !strings.Contains(q, "WHERE") || !strings.Contains(q, c.want) {
-				t.Errorf("query %q missing WHERE/%q", q, c.want)
-			}
-		})
-	}
-}
-
-func TestBuildEdgesQueryCombinedFiltersAnded(t *testing.T) {
-	q, err := buildEdgesQuery(edgesOpts{src: "a", denied: true, limit: 5})
-	if err != nil {
-		t.Fatal(err)
-	}
-	if !strings.Contains(q, "src_ns = 'a' AND action = 'deny'") {
-		t.Errorf("combined filters not AND'd: %q", q)
-	}
-}
-
-func TestBuildEdgesQueryPeersOf(t *testing.T) {
-	q, err := buildEdgesQuery(edgesOpts{peersOf: "authentik", limit: 100})
-	if err != nil {
-		t.Fatal(err)
-	}
-	for _, want := range []string{"DISTINCT", "src_ns = 'authentik'", "dst_ns = 'authentik'", "UNION"} {
-		if !strings.Contains(q, want) {
-			t.Errorf("peers-of query %q missing %q", q, want)
-		}
-	}
-}
-
-func TestBuildEdgesQueryJSON(t *testing.T) {
-	q, err := buildEdgesQuery(edgesOpts{asJSON: true, limit: 200})
-	if err != nil {
-		t.Fatal(err)
-	}
-	if !strings.Contains(q, "json_agg") || !strings.Contains(q, "row_to_json") {
-		t.Errorf("json query missing json_agg wrapper: %q", q)
-	}
-}
-
-func TestBuildEdgesQueryRejectsInjection(t *testing.T) {
-	for _, bad := range []string{"a'; DROP TABLE edge;--", "a b", "a;b", "a\"b"} {
-		if _, err := buildEdgesQuery(edgesOpts{ns: bad, limit: 10}); err == nil {
-			t.Errorf("buildEdgesQuery(ns=%q) expected validation error, got nil", bad)
-		}
-	}
-}
-
-func TestNewSinceCond(t *testing.T) {
-	cases := []struct {
-		in   string
-		want string
-	}{
-		{"24h", "first_seen >= now() - interval '24 hours'"},
-		{"7d", "first_seen >= now() - interval '7 days'"},
-		{"30m", "first_seen >= now() - interval '30 minutes'"},
-		{"2026-06-28", "first_seen >= '2026-06-28'"},
-	}
-	for _, c := range cases {
-		got, err := newSinceCond(c.in)
-		if err != nil {
-			t.Fatalf("newSinceCond(%q) error: %v", c.in, err)
-		}
-		if got != c.want {
-			t.Errorf("newSinceCond(%q) = %q, want %q", c.in, got, c.want)
-		}
-	}
-	for _, bad := range []string{"yesterday", "1y", "'; DROP", ""} {
-		if _, err := newSinceCond(bad); err == nil {
-			t.Errorf("newSinceCond(%q) expected error, got nil", bad)
-		}
-	}
-}
-
-func TestValidateNS(t *testing.T) {
-	for _, ok := range []string{"immich", "calico-system", "kube-system", "Global", "pg-cluster-rw"} {
-		if err := validateNS(ok); err != nil {
-			t.Errorf("validateNS(%q) unexpected error: %v", ok, err)
-		}
-	}
-	for _, bad := range []string{"", "a b", "a'b", "a;b", "../x", "a$b"} {
-		if err := validateNS(bad); err == nil {
-			t.Errorf("validateNS(%q) expected error, got nil", bad)
-		}
-	}
-}
--- a/cli/homelab.go
+++ b/cli/homelab.go
@ -1,99 +0,0 @@
-package main
-
-import (
-	"fmt"
-	"strings"
-)
-
-// version is stamped at build time via -ldflags "-X main.version=vX.Y.Z".
-var version = "dev"
-
-// buildRegistry returns every homelab verb. New verb-groups append here.
-func buildRegistry() []Command {
-	var reg []Command
-	reg = append(reg, claimCommands()...)
-	reg = append(reg, tfCommands()...)
-	reg = append(reg, workCommands()...)
-	reg = append(reg, k8sCommands()...)
-	reg = append(reg, memoryCommands()...)
-	reg = append(reg, ciCommands()...)
-	reg = append(reg, deployCommands()...)
-	reg = append(reg, netCommands()...)
-	reg = append(reg, obsCommands()...)
-	reg = append(reg, edgesCommands()...)
-	reg = append(reg, usageCommands()...)
-	reg = append(reg, haCommands()...)
-	reg = append(reg, browserCommands()...)
-	reg = append(reg, vaultCommands()...)
-	return reg
-}
-
-// dispatchTop handles the homelab verb surface. handled=false means the args are
-// not a homelab verb, so main() falls back to the legacy -use-case path.
-func dispatchTop(args []string) (handled bool, err error) {
-	if len(args) == 0 {
-		fmt.Print(usage())
-		return true, nil
-	}
-	switch args[0] {
-	case "help", "-h", "--help":
-		fmt.Print(usage())
-		return true, nil
-	case "version", "--version":
-		fmt.Println("homelab " + version)
-		return true, nil
-	case "manifest":
-		reg := buildRegistry()
-		if containsArg(args[1:], "--json") {
-			out, err := manifestJSON(reg)
-			if err != nil {
-				return true, err
-			}
-			fmt.Println(out)
-			return true, nil
-		}
-		fmt.Print(manifestText(reg))
-		return true, nil
-	}
-	if strings.HasPrefix(args[0], "-") {
-		return false, nil
-	}
-	reg := buildRegistry()
-	if !isCommandGroup(reg, args[0]) {
-		return false, nil
-	}
-	return true, dispatch(reg, args)
-}
-
-func isCommandGroup(reg []Command, group string) bool {
-	for _, c := range reg {
-		if len(c.Path) > 0 && c.Path[0] == group {
-			return true
-		}
-	}
-	return false
-}
-
-func containsArg(args []string, want string) bool {
-	for _, a := range args {
-		if a == want {
-			return true
-		}
-	}
-	return false
-}
-
-func usage() string {
-	var b strings.Builder
-	fmt.Fprintf(&b, "homelab %s — unified homelab operations CLI\n\n", version)
-	b.WriteString("Usage:\n  homelab <command> [args]\n\nCommands:\n")
-	for _, line := range strings.Split(strings.TrimRight(manifestText(buildRegistry()), "\n"), "\n") {
-		if line != "" {
-			b.WriteString("  " + line + "\n")
-		}
-	}
-	b.WriteString("\n  manifest [--json]   list all commands (machine-readable with --json)\n")
-	b.WriteString("  version             print version\n")
-	b.WriteString("\nLegacy webhook use-cases remain available via -use-case=<name>.\n")
-	return b.String()
-}
--- a/cli/k8s.go
+++ b/cli/k8s.go
@ -1,138 +0,0 @@
-package main
-
-import (
-	"fmt"
-	"os/exec"
-	"strings"
-)
-
-// kubectl helpers use the ambient kubeconfig (no per-call auth flags).
-
-func kubectlBase(ns string, args ...string) []string {
-	var full []string
-	if ns != "" {
-		full = append(full, "-n", ns)
-	}
-	return append(full, args...)
-}
-
-func kubectlStream(ns string, args ...string) error {
-	return runStreamingIn("", "kubectl", kubectlBase(ns, args...)...)
-}
-
-// kubectlCapture runs kubectl and returns trimmed stdout (for resolving pods).
-func kubectlCapture(ns string, args ...string) (string, error) {
-	out, err := exec.Command("kubectl", kubectlBase(ns, args...)...).Output()
-	return strings.TrimSpace(string(out)), err
-}
-
-// k8sTarget is the parsed `<app>` + selectors shared by the k8s verbs.
-type k8sTarget struct {
-	app       string
-	ns        string
-	pod       string
-	container string
-	selector  string
-	tty       bool
-	rest      []string // passthrough flags and, after `--`, the exec command
-}
-
-// parseK8sTarget reads `<app> [-n ns] [--pod p] [-c ctr] [-l sel] [flags] [-- cmd]`.
-// The first bare token is the app; unknown flags pass through in rest.
-func parseK8sTarget(args []string) k8sTarget {
-	t := k8sTarget{}
-	i := 0
-	take := func() string {
-		if i+1 < len(args) {
-			i++
-			return args[i]
-		}
-		return ""
-	}
-	for i = 0; i < len(args); i++ {
-		a := args[i]
-		switch {
-		case a == "--":
-			t.rest = append(t.rest, args[i+1:]...)
-			return t
-		case a == "-n" || a == "--namespace":
-			t.ns = take()
-		case strings.HasPrefix(a, "--namespace="):
-			t.ns = strings.TrimPrefix(a, "--namespace=")
-		case a == "--pod":
-			t.pod = take()
-		case strings.HasPrefix(a, "--pod="):
-			t.pod = strings.TrimPrefix(a, "--pod=")
-		case a == "-c" || a == "--container":
-			t.container = take()
-		case strings.HasPrefix(a, "--container="):
-			t.container = strings.TrimPrefix(a, "--container=")
-		case a == "-l" || a == "--selector":
-			t.selector = take()
-		case strings.HasPrefix(a, "--selector="):
-			t.selector = strings.TrimPrefix(a, "--selector=")
-		case a == "--tty" || a == "-it" || a == "-ti":
-			t.tty = true
-		case !strings.HasPrefix(a, "-") && t.app == "":
-			t.app = a
-		default:
-			t.rest = append(t.rest, a)
-		}
-	}
-	return t
-}
-
-// namespace defaults to the app name (most namespaces hold exactly one app).
-func (t k8sTarget) namespace() string {
-	if t.ns != "" {
-		return t.ns
-	}
-	return t.app
-}
-
-// objectRef is the kubectl object for logs/exec: an explicit pod, else
-// deploy/<app> (kubectl resolves a pod from the Deployment).
-func (t k8sTarget) objectRef() string {
-	if t.pod != "" {
-		return "pod/" + t.pod
-	}
-	return "deploy/" + t.app
-}
-
-// --- database access (the dbaas exec pattern) ---
-
-type dbPlan struct {
-	ns        string
-	pod       string   // explicit pod (e.g. mysql-standalone-0)
-	selector  string   // resolve the pod by this label when pod == "" (CNPG primary)
-	container string   // "" = default container
-	argv      []string // command + args to run inside the pod
-}
-
-// planDBExec builds the in-pod command to run sql against app's database.
-// PG (default): CNPG primary POD (resolved by label — pg-cluster-rw is a
-// Service, not an exec target), psql -U postgres -d <db>.
-// MySQL: mysql-standalone-0, password from env (never on the command line).
-// dbName defaults to app. sql empty => interactive client.
-func planDBExec(app, dbName, sql string, mysql bool) dbPlan {
-	if dbName == "" {
-		dbName = app
-	}
-	if mysql {
-		inner := fmt.Sprintf(`mysql -u root -p"$MYSQL_ROOT_PASSWORD" %s`, shellQuote(dbName))
-		if sql != "" {
-			inner += " -e " + shellQuote(sql)
-		}
-		return dbPlan{ns: "dbaas", pod: "mysql-standalone-0", argv: []string{"bash", "-c", inner}}
-	}
-	argv := []string{"psql", "-U", "postgres", "-d", dbName}
-	if sql != "" {
-		argv = append(argv, "-tAc", sql)
-	}
-	return dbPlan{ns: "dbaas", selector: "cnpg.io/instanceRole=primary", container: "postgres", argv: argv}
-}
-
-// shellQuote single-quotes s for safe embedding in a bash -c string.
-func shellQuote(s string) string {
-	return "'" + strings.ReplaceAll(s, "'", `'\''`) + "'"
-}
--- a/cli/k8s_test.go
+++ b/cli/k8s_test.go
@ -1,65 +0,0 @@
-package main
-
-import (
-	"reflect"
-	"strings"
-	"testing"
-)
-
-func TestParseK8sTarget(t *testing.T) {
-	got := parseK8sTarget([]string{"tripit", "-n", "prod", "--pod", "x-123", "-c", "app", "-l", "k=v", "--tail=50", "--", "ls", "-la"})
-	want := k8sTarget{app: "tripit", ns: "prod", pod: "x-123", container: "app", selector: "k=v", rest: []string{"--tail=50", "ls", "-la"}}
-	if !reflect.DeepEqual(got, want) {
-		t.Fatalf("parseK8sTarget =\n %+v\nwant\n %+v", got, want)
-	}
-}
-
-func TestK8sTargetNamespaceDefaultsToApp(t *testing.T) {
-	if ns := parseK8sTarget([]string{"immich"}).namespace(); ns != "immich" {
-		t.Errorf("namespace() = %q, want immich", ns)
-	}
-	if ns := parseK8sTarget([]string{"immich", "-n", "dbaas"}).namespace(); ns != "dbaas" {
-		t.Errorf("namespace() = %q, want dbaas", ns)
-	}
-}
-
-func TestK8sTargetObjectRef(t *testing.T) {
-	if r := parseK8sTarget([]string{"tripit"}).objectRef(); r != "deploy/tripit" {
-		t.Errorf("objectRef() = %q, want deploy/tripit", r)
-	}
-	if r := parseK8sTarget([]string{"tripit", "--pod", "tripit-abc"}).objectRef(); r != "pod/tripit-abc" {
-		t.Errorf("objectRef() = %q, want pod/tripit-abc", r)
-	}
-}
-
-func TestPlanDBExecPostgresDefault(t *testing.T) {
-	p := planDBExec("fire-planner", "", "SELECT 1", false)
-	// pg-cluster-rw is a Service, so the PG plan resolves the primary POD by
-	// label rather than naming an (un-exec-able) Service.
-	if p.ns != "dbaas" || p.pod != "" || p.selector != "cnpg.io/instanceRole=primary" || p.container != "postgres" {
-		t.Fatalf("unexpected pg target: %+v", p)
-	}
-	// db name defaults to the app; SQL passed via -tAc
-	joined := strings.Join(p.argv, " ")
-	if !strings.Contains(joined, "-d fire-planner") || !strings.Contains(joined, "-tAc") {
-		t.Fatalf("pg argv missing db/sql: %v", p.argv)
-	}
-}
-
-func TestPlanDBExecMysqlEnvPassword(t *testing.T) {
-	p := planDBExec("wrongmove", "wrongmove", "SHOW TABLES", true)
-	if p.pod != "mysql-standalone-0" {
-		t.Fatalf("unexpected mysql pod: %+v", p)
-	}
-	inner := strings.Join(p.argv, " ")
-	// password must come from the env var, never inline
-	if !strings.Contains(inner, `-p"$MYSQL_ROOT_PASSWORD"`) {
-		t.Fatalf("mysql must use env password wrapper: %v", p.argv)
-	}
-}
-
-func TestShellQuoteEscapes(t *testing.T) {
-	if got := shellQuote("a'b"); got != `'a'\''b'` {
-		t.Fatalf("shellQuote = %q", got)
-	}
-}
--- a/cli/main.go
+++ b/cli/main.go
@ -26,16 +26,8 @@ var (
 )

 func main() {
-	// homelab verb surface (work/tf/claim/...) is tried first; if the args are
-	// not a homelab verb, fall through to the legacy webhook -use-case path.
-	if handled, err := dispatchTop(os.Args[1:]); handled {
+	err := run()
 	if err != nil {
-			fmt.Fprintln(os.Stderr, "homelab: "+err.Error())
-			os.Exit(1)
-		}
-		return
-	}
-	if err := run(); err != nil {
 		glog.Errorf("run failed: %s", err.Error())
 		os.Exit(255)
 	}
--- a/cli/memory.go
+++ b/cli/memory.go
@ -1,103 +0,0 @@
-package main
-
-import (
-	"bytes"
-	"encoding/json"
-	"fmt"
-	"io"
-	"net/http"
-	"os"
-	"strings"
-	"time"
-)
-
-// defaultMemoryURL is used when no env override is present (agents normally have
-// CLAUDE_MEMORY_API_URL set by the memory hooks).
-const defaultMemoryURL = "https://claude-memory.viktorbarzin.me"
-
-type memoryClient struct {
-	base string
-	key  string
-	http *http.Client
-}
-
-func firstEnv(keys ...string) string {
-	for _, k := range keys {
-		if v := os.Getenv(k); v != "" {
-			return v
-		}
-	}
-	return ""
-}
-
-func resolveMemoryBase() string {
-	if b := firstEnv("CLAUDE_MEMORY_API_URL", "MEMORY_API_URL"); b != "" {
-		return strings.TrimRight(b, "/")
-	}
-	return defaultMemoryURL
-}
-
-// newMemoryClient talks straight to the claude-memory HTTP API (the same backend
-// the MCP wraps), so it works even when the MCP frontend is down.
-func newMemoryClient() (*memoryClient, error) {
-	key := firstEnv("CLAUDE_MEMORY_API_KEY", "MEMORY_API_KEY")
-	if key == "" {
-		return nil, fmt.Errorf("no memory API key — set CLAUDE_MEMORY_API_KEY (or MEMORY_API_KEY)")
-	}
-	return &memoryClient{base: resolveMemoryBase(), key: key, http: &http.Client{Timeout: 30 * time.Second}}, nil
-}
-
-func (c *memoryClient) do(method, path string, body interface{}) ([]byte, error) {
-	var r io.Reader
-	if body != nil {
-		b, err := json.Marshal(body)
-		if err != nil {
-			return nil, err
-		}
-		r = bytes.NewReader(b)
-	}
-	req, err := http.NewRequest(method, c.base+path, r)
-	if err != nil {
-		return nil, err
-	}
-	req.Header.Set("Authorization", "Bearer "+c.key)
-	if body != nil {
-		req.Header.Set("Content-Type", "application/json")
-	}
-	resp, err := c.http.Do(req)
-	if err != nil {
-		return nil, err
-	}
-	defer resp.Body.Close()
-	out, _ := io.ReadAll(resp.Body)
-	if resp.StatusCode >= 300 {
-		return nil, fmt.Errorf("memory API %s %s -> %d: %s", method, path, resp.StatusCode, strings.TrimSpace(string(out)))
-	}
-	return out, nil
-}
-
-// Request bodies mirror src/claude_memory/api/models.py.
-
-type memRecallReq struct {
-	Context       string `json:"context"`
-	ExpandedQuery string `json:"expanded_query,omitempty"`
-	Category      string `json:"category,omitempty"`
-	SortBy        string `json:"sort_by,omitempty"`
-	Limit         int    `json:"limit,omitempty"`
-}
-
-type memStoreReq struct {
-	Content          string  `json:"content"`
-	Category         string  `json:"category,omitempty"`
-	Tags             string  `json:"tags,omitempty"`
-	ExpandedKeywords string  `json:"expanded_keywords,omitempty"`
-	Importance       float64 `json:"importance"`
-	ForceSensitive   bool    `json:"force_sensitive,omitempty"`
-}
-
-type memUpdateReq struct {
-	Content          *string  `json:"content,omitempty"`
-	Tags             *string  `json:"tags,omitempty"`
-	Importance       *float64 `json:"importance,omitempty"`
-	ExpandedKeywords *string  `json:"expanded_keywords,omitempty"`
-}
--- a/cli/memory_test.go
+++ b/cli/memory_test.go
@ -1,102 +0,0 @@
-package main
-
-import (
-	"encoding/json"
-	"os"
-	"strings"
-	"testing"
-	"unicode/utf8"
-)
-
-func TestRenderMemoriesFullContent(t *testing.T) {
-	// The pretty view must NOT truncate content: the old 240-rune preview cut
-	// memories mid-sentence, misled agents into thinking no full-content
-	// read-back existed, and made blind `update --content` from the preview
-	// destroy the stored tail. Full passthrough also removes the mid-rune-cut
-	// invalid-UTF-8 class by construction — nothing is ever sliced.
-	long := strings.Repeat("я", 300) + strings.Repeat("a", 300)
-	raw, _ := json.Marshal(map[string]interface{}{"memories": []map[string]interface{}{
-		{"id": 7, "content": long, "category": "facts", "tags": "t1,t2", "importance": 0.7},
-	}})
-	got := renderMemories(raw, false)
-	if !strings.Contains(got, long) {
-		t.Fatalf("content was truncated: %q", got)
-	}
-	if strings.Contains(got, "…") {
-		t.Fatalf("ellipsis in output — truncation still active: %q", got)
-	}
-	if !utf8.ValidString(got) {
-		t.Fatalf("invalid UTF-8 in output: %q", got)
-	}
-	if !strings.Contains(got, "#7 [facts] (0.70) ") || !strings.Contains(got, "tags: t1,t2") {
-		t.Fatalf("line format broken: %q", got)
-	}
-}
-
-func TestRenderMemoriesFlattensNewlinesToOneLine(t *testing.T) {
-	// Consumers (the recall hook, terminal skims) rely on one memory per line;
-	// multi-line content is flattened, never split across lines.
-	raw, _ := json.Marshal(map[string]interface{}{"memories": []map[string]interface{}{
-		{"id": 1, "content": "line one\nline two\nline three", "category": "facts", "importance": 0.5},
-	}})
-	got := renderMemories(raw, false)
-	if !strings.Contains(got, "line one line two line three") {
-		t.Fatalf("newlines not flattened: %q", got)
-	}
-}
-
-func TestRenderMemoriesEdgeCases(t *testing.T) {
-	if got := renderMemories([]byte(`{"memories":[]}`), false); got != "(no memories)\n" {
-		t.Fatalf("empty list: %q", got)
-	}
-	// --json and unparseable responses pass through raw.
-	if got := renderMemories([]byte(`{"x":1}`), true); got != "{\"x\":1}\n" {
-		t.Fatalf("json passthrough: %q", got)
-	}
-	if got := renderMemories([]byte(`not json`), false); got != "not json\n" {
-		t.Fatalf("unparseable passthrough: %q", got)
-	}
-}
-
-func TestResolveMemoryBase(t *testing.T) {
-	old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL")
-	defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }()
-
-	os.Unsetenv("CLAUDE_MEMORY_API_URL")
-	os.Unsetenv("MEMORY_API_URL")
-	if got := resolveMemoryBase(); got != defaultMemoryURL {
-		t.Errorf("resolveMemoryBase() = %q, want default %q", got, defaultMemoryURL)
-	}
-	os.Setenv("CLAUDE_MEMORY_API_URL", "https://m.example/") // trailing slash trimmed
-	if got := resolveMemoryBase(); got != "https://m.example" {
-		t.Errorf("resolveMemoryBase() = %q, want https://m.example", got)
-	}
-}
-
-func TestMemStoreReqAlwaysSendsImportance(t *testing.T) {
-	b, _ := json.Marshal(memStoreReq{Content: "x", Category: "facts", Importance: 0.5})
-	s := string(b)
-	if !strings.Contains(s, `"content":"x"`) || !strings.Contains(s, `"importance":0.5`) {
-		t.Fatalf("memStoreReq JSON missing fields: %s", s)
-	}
-}
-
-func TestMemUpdateReqOmitsUnsetFields(t *testing.T) {
-	tags := "a,b"
-	b, _ := json.Marshal(memUpdateReq{Tags: &tags})
-	s := string(b)
-	if strings.Contains(s, "content") || strings.Contains(s, "importance") {
-		t.Fatalf("unset update fields must be omitted: %s", s)
-	}
-	if !strings.Contains(s, `"tags":"a,b"`) {
-		t.Fatalf("set field missing: %s", s)
-	}
-}
-
-func TestMemRecallReqOmitsEmptyOptionals(t *testing.T) {
-	b, _ := json.Marshal(memRecallReq{Context: "hi"})
-	s := string(b)
-	if strings.Contains(s, "expanded_query") || strings.Contains(s, "category") || strings.Contains(s, "limit") {
-		t.Fatalf("empty optionals must be omitted: %s", s)
-	}
-}
--- a/cli/presence.go
+++ b/cli/presence.go
@ -1,58 +0,0 @@
-package main
-
-import (
-	"fmt"
-	"os"
-	"path/filepath"
-	"strings"
-)
-
-// validPresenceKinds is the fixed label taxonomy accepted by the presence board.
-var validPresenceKinds = []string{"node", "host", "stack", "service", "db", "pvc", "infra"}
-
-// presenceScript locates the presence CLI — homelab WRAPS it, it does not
-// reimplement it. Override with HOMELAB_PRESENCE; defaults to ~/code/scripts/presence.
-func presenceScript() string {
-	if p := os.Getenv("HOMELAB_PRESENCE"); p != "" {
-		return p
-	}
-	home, err := os.UserHomeDir()
-	if err != nil {
-		return "presence"
-	}
-	return filepath.Join(home, "code", "scripts", "presence")
-}
-
-// validateLabel checks a presence label is <kind>:<name> with a known kind.
-func validateLabel(label string) error {
-	parts := strings.SplitN(label, ":", 2)
-	if len(parts) != 2 || parts[0] == "" || parts[1] == "" {
-		return fmt.Errorf("label must be <kind>:<name> (e.g. stack:vault), got %q", label)
-	}
-	for _, k := range validPresenceKinds {
-		if parts[0] == k {
-			return nil
-		}
-	}
-	return fmt.Errorf("invalid label kind %q; valid kinds: %s", parts[0], strings.Join(validPresenceKinds, ", "))
-}
-
-// presenceClaim claims label on the board with a purpose note.
-func presenceClaim(label, purpose string) error {
-	if err := validateLabel(label); err != nil {
-		return err
-	}
-	args := []string{"claim", label}
-	if purpose != "" {
-		args = append(args, "--purpose", purpose)
-	}
-	return runStreaming(presenceScript(), args...)
-}
-
-// presenceRelease releases a prior claim on label.
-func presenceRelease(label string) error {
-	if err := validateLabel(label); err != nil {
-		return err
-	}
-	return runStreaming(presenceScript(), "release", label)
-}
--- a/cli/presence_test.go
+++ b/cli/presence_test.go
@ -1,24 +0,0 @@
-package main
-
-import "testing"
-
-func TestValidateLabelAcceptsTaxonomy(t *testing.T) {
-	good := []string{
-		"stack:vault", "service:health", "node:k8s-node1", "db:pg-cluster",
-		"infra:gpu-operator", "host:proxmox-1", "pvc:dbaas/data",
-	}
-	for _, l := range good {
-		if err := validateLabel(l); err != nil {
-			t.Errorf("validateLabel(%q) = %v, want nil", l, err)
-		}
-	}
-}
-
-func TestValidateLabelRejectsBadLabels(t *testing.T) {
-	bad := []string{"vault", "stack:", "bogus:x", ":x", "stack", ""}
-	for _, l := range bad {
-		if err := validateLabel(l); err == nil {
-			t.Errorf("validateLabel(%q) = nil, want error", l)
-		}
-	}
-}
--- a/cli/probe.go
+++ b/cli/probe.go
@ -1,76 +0,0 @@
-package main
-
-import (
-	"context"
-	"crypto/tls"
-	"fmt"
-	"io"
-	"net"
-	"net/http"
-	"net/url"
-	"os/exec"
-	"strings"
-	"time"
-)
-
-// internalLBIP is the dedicated Traefik LB; every internal ingress routes through it.
-const internalLBIP = "10.0.20.203"
-
-// clientDialingIP returns an http.Client that dials ip for ANY host while keeping
-// the URL host as SNI (so the cert matches) — the Go form of `curl --resolve
-// host:443:ip`. TLS verification is skipped (these are reachability/observability
-// probes, not security checks; internal .lan vhosts may serve a non-matching cert).
-func clientDialingIP(ip string, timeout time.Duration) *http.Client {
-	d := &net.Dialer{Timeout: 8 * time.Second}
-	tr := &http.Transport{
-		DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
-			if i := strings.LastIndex(addr, ":"); i >= 0 {
-				addr = ip + addr[i:]
-			}
-			return d.DialContext(ctx, network, addr)
-		},
-		TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
-	}
-	return &http.Client{Timeout: timeout, Transport: tr}
-}
-
-// probeURL issues a GET and returns status code + elapsed time.
-func probeURL(c *http.Client, rawurl string) (int, time.Duration, error) {
-	start := time.Now()
-	resp, err := c.Get(rawurl)
-	dur := time.Since(start)
-	if err != nil {
-		return 0, dur, err
-	}
-	resp.Body.Close()
-	return resp.StatusCode, dur, nil
-}
-
-// lbGetBody GETs https://<host><path>?<q> through the internal LB and returns the body.
-func lbGetBody(host, path string, q url.Values) ([]byte, error) {
-	u := "https://" + host + path
-	if len(q) > 0 {
-		u += "?" + q.Encode()
-	}
-	resp, err := clientDialingIP(internalLBIP, 20*time.Second).Get(u)
-	if err != nil {
-		return nil, err
-	}
-	defer resp.Body.Close()
-	body, _ := io.ReadAll(resp.Body)
-	if resp.StatusCode >= 300 {
-		return nil, fmt.Errorf("%s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body)))
-	}
-	return body, nil
-}
-
-// dig runs `dig +short` against a resolver, optionally for a record type.
-func dig(name, server, rrtype string) (string, error) {
-	args := []string{"+short", "+time=3", "+tries=1"}
-	if rrtype != "" {
-		args = append(args, rrtype)
-	}
-	args = append(args, name, "@"+server)
-	out, err := exec.Command("dig", args...).Output()
-	return strings.TrimSpace(string(out)), err
-}
--- a/cli/probe_test.go
+++ b/cli/probe_test.go
@ -1,49 +0,0 @@
-package main
-
-import "testing"
-
-func TestQueryArg(t *testing.T) {
-	if got := queryArg([]string{"up"}, nil); got != "up" {
-		t.Errorf(`queryArg(["up"]) = %q, want "up"`, got)
-	}
-	if got := queryArg([]string{"up", "--json"}, nil); got != "up" {
-		t.Errorf(`--json should be dropped, got %q`, got)
-	}
-	// single quoted PromQL arrives as one token
-	if got := queryArg([]string{"count by (node) (up)", "--json"}, nil); got != "count by (node) (up)" {
-		t.Errorf(`quoted query mangled: %q`, got)
-	}
-	// value-flags and their values are skipped, query survives
-	vf := map[string]bool{"--since": true, "--limit": true}
-	if got := queryArg([]string{`{app="x"}`, "--since", "1h", "--limit", "50"}, vf); got != `{app="x"}` {
-		t.Errorf(`value-flag skipping failed: %q`, got)
-	}
-}
-
-func TestLabelStr(t *testing.T) {
-	got := labelStr(map[string]string{"__name__": "up", "job": "x", "instance": "y"})
-	if got != "up{instance=y,job=x}" { // __name__ extracted, rest sorted
-		t.Errorf("labelStr = %q", got)
-	}
-	if got := labelStr(map[string]string{"alertname": "Foo"}); got != "{alertname=Foo}" {
-		t.Errorf("labelStr (no __name__) = %q", got)
-	}
-}
-
-func TestOneLineList(t *testing.T) {
-	if got := oneLineList("  "); got != "(none)" {
-		t.Errorf("empty = %q, want (none)", got)
-	}
-	if got := oneLineList("a\nb"); got != "a, b" {
-		t.Errorf("multi = %q, want 'a, b'", got)
-	}
-}
-
-func TestHostOnly(t *testing.T) {
-	if got := hostOnly("foo.me/path"); got != "foo.me" {
-		t.Errorf("hostOnly = %q", got)
-	}
-	if got := hostOnly("foo.me"); got != "foo.me" {
-		t.Errorf("hostOnly = %q", got)
-	}
-}
--- a/cli/repo.go
+++ b/cli/repo.go
@ -1,101 +0,0 @@
-package main
-
-import (
-	"os"
-	"os/exec"
-	"os/user"
-	"path/filepath"
-	"strings"
-)
-
-// preferRemote picks the canonical remote: forgejo if present, else origin,
-// else the first listed. (For infra, origin and forgejo both point at Forgejo.)
-func preferRemote(remotes []string) string {
-	has := map[string]bool{}
-	for _, r := range remotes {
-		has[r] = true
-	}
-	switch {
-	case has["forgejo"]:
-		return "forgejo"
-	case has["origin"]:
-		return "origin"
-	case len(remotes) > 0:
-		return remotes[0]
-	default:
-		return ""
-	}
-}
-
-// hasGitCryptAttr reports whether .gitattributes content enables git-crypt.
-func hasGitCryptAttr(gitattributes string) bool {
-	return strings.Contains(gitattributes, "filter=git-crypt")
-}
-
-// gitCryptFlags are the per-command flags that disable smudge/clean so git
-// operations in a git-crypt repo don't try to decrypt (NEVER persisted to config).
-func gitCryptFlags() []string {
-	return []string{
-		"-c", "filter.git-crypt.smudge=cat",
-		"-c", "filter.git-crypt.clean=cat",
-		"-c", "filter.git-crypt.required=false",
-	}
-}
-
-// gitOutput runs `git -C dir <args>` and returns trimmed stdout.
-func gitOutput(dir string, args ...string) (string, error) {
-	cmd := exec.Command("git", append([]string{"-C", dir}, args...)...)
-	out, err := cmd.Output()
-	return strings.TrimSpace(string(out)), err
-}
-
-func gitRepoRoot(dir string) (string, error) {
-	return gitOutput(dir, "rev-parse", "--show-toplevel")
-}
-
-// gitRemotes lists configured remote names for the repo at dir.
-func gitRemotes(dir string) ([]string, error) {
-	out, err := gitOutput(dir, "remote")
-	if err != nil {
-		return nil, err
-	}
-	if out == "" {
-		return nil, nil
-	}
-	return strings.Split(out, "\n"), nil
-}
-
-// isGitCryptRepo reports whether the repo at repoRoot uses git-crypt.
-func isGitCryptRepo(repoRoot string) bool {
-	b, err := os.ReadFile(filepath.Join(repoRoot, ".gitattributes"))
-	if err != nil {
-		return false
-	}
-	return hasGitCryptAttr(string(b))
-}
-
-// cryptFlagsFor returns the git-crypt filter flags when repoRoot is encrypted,
-// else nil. These are injected per-command and never persisted.
-func cryptFlagsFor(repoRoot string) []string {
-	if isGitCryptRepo(repoRoot) {
-		return gitCryptFlags()
-	}
-	return nil
-}
-
-// gitStream runs `git [cryptFlags] -C repoRoot <args>` with live output.
-func gitStream(repoRoot string, cryptFlags []string, args ...string) error {
-	full := append(append([]string{}, cryptFlags...), append([]string{"-C", repoRoot}, args...)...)
-	return runStreamingIn("", "git", full...)
-}
-
-// currentUser returns the OS username for branch naming (<user>/<topic>).
-func currentUser() string {
-	if u := os.Getenv("USER"); u != "" {
-		return u
-	}
-	if u, err := user.Current(); err == nil && u.Username != "" {
-		return u.Username
-	}
-	return "user"
-}
--- a/cli/repo_test.go
+++ b/cli/repo_test.go
@ -1,37 +0,0 @@
-package main
-
-import "testing"
-
-func TestPreferRemote(t *testing.T) {
-	cases := []struct {
-		in   []string
-		want string
-	}{
-		{[]string{"origin", "forgejo"}, "forgejo"},
-		{[]string{"forgejo"}, "forgejo"},
-		{[]string{"origin"}, "origin"},
-		{[]string{"upstream"}, "upstream"},
-		{nil, ""},
-	}
-	for _, c := range cases {
-		if got := preferRemote(c.in); got != c.want {
-			t.Errorf("preferRemote(%v) = %q, want %q", c.in, got, c.want)
-		}
-	}
-}
-
-func TestHasGitCryptAttr(t *testing.T) {
-	if !hasGitCryptAttr("*.tfvars filter=git-crypt diff=git-crypt") {
-		t.Error("expected git-crypt detected")
-	}
-	if hasGitCryptAttr("*.md text\n*.png binary") {
-		t.Error("expected no git-crypt")
-	}
-}
-
-func TestGitCryptFlagsShape(t *testing.T) {
-	f := gitCryptFlags()
-	if len(f) != 6 || f[0] != "-c" || f[1] != "filter.git-crypt.smudge=cat" {
-		t.Fatalf("unexpected git-crypt flags: %v", f)
-	}
-}
--- a/cli/run.go
+++ b/cli/run.go
@ -1,23 +0,0 @@
-package main
-
-import (
-	"os"
-	"os/exec"
-)
-
-// runStreaming executes name with args, wiring std streams to this process so
-// the caller sees live output, and returns the command's error (non-nil on
-// non-zero exit — preserved so homelab's own exit code reflects the child's).
-func runStreaming(name string, args ...string) error {
-	return runStreamingIn("", name, args...)
-}
-
-// runStreamingIn is runStreaming with a working directory (empty = inherit).
-func runStreamingIn(dir, name string, args ...string) error {
-	cmd := exec.Command(name, args...)
-	cmd.Dir = dir
-	cmd.Stdout = os.Stdout
-	cmd.Stderr = os.Stderr
-	cmd.Stdin = os.Stdin
-	return cmd.Run()
-}
--- a/cli/stack.go
+++ b/cli/stack.go
@ -1,54 +0,0 @@
-package main
-
-import (
-	"fmt"
-	"os"
-	"path/filepath"
-	"sort"
-	"strings"
-)
-
-// findInfraRoot walks up from start to the infra repo root — the directory
-// holding both terragrunt.hcl and a stacks/ directory.
-func findInfraRoot(start string) (string, error) {
-	dir := start
-	for {
-		if isFile(filepath.Join(dir, "terragrunt.hcl")) && isDir(filepath.Join(dir, "stacks")) {
-			return dir, nil
-		}
-		parent := filepath.Dir(dir)
-		if parent == dir {
-			return "", fmt.Errorf("not inside an infra checkout (no terragrunt.hcl + stacks/ found above %s)", start)
-		}
-		dir = parent
-	}
-}
-
-// resolveStack maps a bare stack name to its directory under <infraRoot>/stacks.
-func resolveStack(infraRoot, name string) (string, error) {
-	dir := filepath.Join(infraRoot, "stacks", name)
-	if isDir(dir) {
-		return dir, nil
-	}
-	avail := listStacks(infraRoot)
-	return "", fmt.Errorf("stack %q not found under stacks/; available: %s", name, strings.Join(avail, ", "))
-}
-
-// listStacks returns the sorted names of every directory under <infraRoot>/stacks.
-func listStacks(infraRoot string) []string {
-	entries, err := os.ReadDir(filepath.Join(infraRoot, "stacks"))
-	if err != nil {
-		return nil
-	}
-	var out []string
-	for _, e := range entries {
-		if e.IsDir() {
-			out = append(out, e.Name())
-		}
-	}
-	sort.Strings(out)
-	return out
-}
-
-func isFile(p string) bool { fi, err := os.Stat(p); return err == nil && !fi.IsDir() }
-func isDir(p string) bool  { fi, err := os.Stat(p); return err == nil && fi.IsDir() }
--- a/cli/stack_test.go
+++ b/cli/stack_test.go
@ -1,52 +0,0 @@
-package main
-
-import (
-	"os"
-	"path/filepath"
-	"testing"
-)
-
-func newInfraTree(t *testing.T, stacks ...string) string {
-	t.Helper()
-	root := t.TempDir()
-	if err := os.WriteFile(filepath.Join(root, "terragrunt.hcl"), []byte("# root"), 0o644); err != nil {
-		t.Fatal(err)
-	}
-	for _, s := range stacks {
-		if err := os.MkdirAll(filepath.Join(root, "stacks", s), 0o755); err != nil {
-			t.Fatal(err)
-		}
-	}
-	return root
-}
-
-func TestFindInfraRootWalksUp(t *testing.T) {
-	root := newInfraTree(t, "vault")
-	got, err := findInfraRoot(filepath.Join(root, "stacks", "vault"))
-	if err != nil {
-		t.Fatalf("findInfraRoot error: %v", err)
-	}
-	if got != root {
-		t.Fatalf("findInfraRoot = %q, want %q", got, root)
-	}
-}
-
-func TestFindInfraRootErrorsOutsideInfra(t *testing.T) {
-	if _, err := findInfraRoot(t.TempDir()); err == nil {
-		t.Fatal("expected error outside an infra checkout")
-	}
-}
-
-func TestResolveStack(t *testing.T) {
-	root := newInfraTree(t, "vault", "monitoring")
-	dir, err := resolveStack(root, "vault")
-	if err != nil {
-		t.Fatalf("resolveStack error: %v", err)
-	}
-	if want := filepath.Join(root, "stacks", "vault"); dir != want {
-		t.Fatalf("resolveStack = %q, want %q", dir, want)
-	}
-	if _, err := resolveStack(root, "nonesuch"); err == nil {
-		t.Fatal("expected error for unknown stack")
-	}
-}
--- a/cli/telemetry.go
+++ b/cli/telemetry.go
@ -1,62 +0,0 @@
-package main
-
-import (
-	"bytes"
-	"encoding/json"
-	"net/http"
-	"os"
-	"strconv"
-	"strings"
-	"time"
-)
-
-// usageJob is the Loki stream job label for homelab usage telemetry.
-const usageJob = "homelab-usage"
-
-// emitUsage best-effort records one verb invocation to Loki for cross-user
-// usage analytics. Labels are low-cardinality (job/user/verb); the line carries
-// only exit code + CLI version. NEVER args, paths, flags, or secrets. It must
-// never affect the command: all errors are swallowed and a tight timeout bounds
-// the cost. Opt out with HOMELAB_TELEMETRY=0.
-func emitUsage(verb string, runErr error) {
-	switch os.Getenv("HOMELAB_TELEMETRY") {
-	case "0", "off", "false", "no":
-		return
-	}
-	if verb == "" || strings.HasPrefix(verb, "usage") {
-		return // don't self-record the analytics reader
-	}
-	exit := 0
-	if runErr != nil {
-		exit = 1
-	}
-	body, err := json.Marshal(lokiPush{Streams: []lokiStream{{
-		Stream: map[string]string{"job": usageJob, "user": currentUser(), "verb": verb},
-		Values: [][2]string{{
-			strconv.FormatInt(time.Now().UnixNano(), 10),
-			"exit=" + strconv.Itoa(exit) + " ver=" + version,
-		}},
-	}}})
-	if err != nil {
-		return
-	}
-	req, err := http.NewRequest("POST", "https://"+lokiHost+"/loki/api/v1/push", bytes.NewReader(body))
-	if err != nil {
-		return
-	}
-	req.Header.Set("Content-Type", "application/json")
-	resp, err := clientDialingIP(internalLBIP, 800*time.Millisecond).Do(req)
-	if err != nil {
-		return
-	}
-	resp.Body.Close()
-}
-
-type lokiPush struct {
-	Streams []lokiStream `json:"streams"`
-}
-
-type lokiStream struct {
-	Stream map[string]string `json:"stream"`
-	Values [][2]string       `json:"values"`
-}
--- a/cli/update_viktorbarzin_me.go
+++ b/cli/update_viktorbarzin_me.go
@ -103,6 +103,6 @@ func notifyForIPChange(oldIP, newIP net.IP) error {
 	if err != nil {
 		return errors.Wrapf(err, "Error reading response")
 	}
-	glog.Infof("Response: %s", string(responseBody))
+	glog.Infof("Response:", string(responseBody))
 	return nil
 }
--- a/cli/usage_test.go
+++ b/cli/usage_test.go
@ -1,18 +0,0 @@
-package main
-
-import (
-	"strings"
-	"testing"
-)
-
-func TestUsageQuery(t *testing.T) {
-	got := usageQuery("30d", "")
-	want := `sum by (verb) (count_over_time({job="homelab-usage"}[30d]))`
-	if got != want {
-		t.Errorf("usageQuery(30d,\"\") = %q, want %q", got, want)
-	}
-	withUser := usageQuery("7d", "emo")
-	if !strings.Contains(withUser, `user="emo"`) || !strings.Contains(withUser, "[7d]") {
-		t.Errorf("usageQuery with user missing filter/range: %q", withUser)
-	}
-}
--- a/cli/woodpecker.go
+++ b/cli/woodpecker.go
@ -1,191 +0,0 @@
-package main
-
-import (
-	"context"
-	"encoding/json"
-	"fmt"
-	"io"
-	"net"
-	"net/http"
-	"os"
-	"os/exec"
-	"strings"
-	"time"
-)
-
-// Woodpecker is reached at ci.viktorbarzin.me but routed via the internal Traefik
-// LB (mirrors the proven `curl --resolve ci.viktorbarzin.me:443:10.0.20.203`):
-// we dial the LB IP while keeping SNI/Host = the hostname so the cert verifies.
-const (
-	wpHost = "ci.viktorbarzin.me"
-	wpLBIP = "10.0.20.203"
-)
-
-type wpClient struct {
-	base  string
-	token string
-	http  *http.Client
-}
-
-// wpToken reads WOODPECKER_TOKEN, else the canonical Vault path.
-func wpToken() string {
-	if t := firstEnv("WOODPECKER_TOKEN", "WP_TOKEN"); t != "" {
-		return t
-	}
-	out, err := exec.Command("vault", "kv", "get", "-field=woodpecker_api_token", "secret/ci/global").Output()
-	if err != nil {
-		return ""
-	}
-	return strings.TrimSpace(string(out))
-}
-
-func newWPClient() (*wpClient, error) {
-	tok := wpToken()
-	if tok == "" {
-		return nil, fmt.Errorf("no woodpecker token — set WOODPECKER_TOKEN or `vault login` (reads secret/ci/global)")
-	}
-	ip := firstEnv("HOMELAB_WP_IP")
-	if ip == "" {
-		ip = wpLBIP
-	}
-	dialer := &net.Dialer{Timeout: 8 * time.Second}
-	tr := &http.Transport{
-		DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
-			if strings.HasPrefix(addr, wpHost+":") {
-				addr = ip + addr[strings.LastIndex(addr, ":"):]
-			}
-			return dialer.DialContext(ctx, network, addr)
-		},
-	}
-	return &wpClient{base: "https://" + wpHost, token: tok, http: &http.Client{Timeout: 20 * time.Second, Transport: tr}}, nil
-}
-
-// getJSON GETs path into v, retrying the transient empty/5xx responses the
-// Woodpecker API intermittently returns under load.
-func (c *wpClient) getJSON(path string, v interface{}) error {
-	var lastErr error
-	for attempt := 0; attempt < 5; attempt++ {
-		if attempt > 0 {
-			time.Sleep(2 * time.Second)
-		}
-		req, _ := http.NewRequest("GET", c.base+path, nil)
-		req.Header.Set("Authorization", "Bearer "+c.token)
-		resp, err := c.http.Do(req)
-		if err != nil {
-			lastErr = err
-			continue
-		}
-		body, _ := io.ReadAll(resp.Body)
-		resp.Body.Close()
-		if resp.StatusCode >= 500 || len(strings.TrimSpace(string(body))) == 0 {
-			lastErr = fmt.Errorf("woodpecker GET %s -> %d (empty/5xx, retrying)", path, resp.StatusCode)
-			continue
-		}
-		if resp.StatusCode >= 300 {
-			return fmt.Errorf("woodpecker GET %s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body)))
-		}
-		return json.Unmarshal(body, v)
-	}
-	return lastErr
-}
-
-type wpPipeline struct {
-	Number  int    `json:"number"`
-	Status  string `json:"status"`
-	Event   string `json:"event"`
-	Commit  string `json:"commit"`
-	Message string `json:"message"`
-}
-
-func (c *wpClient) recentPipelines(repoID, n int) ([]wpPipeline, error) {
-	var ps []wpPipeline
-	err := c.getJSON(fmt.Sprintf("/api/repos/%d/pipelines?per_page=%d", repoID, n), &ps)
-	return ps, err
-}
-
-// findPipeline returns the pipeline for commit (prefix match), or the latest when
-// commit is empty.
-func (c *wpClient) findPipeline(repoID int, commit string) (wpPipeline, error) {
-	ps, err := c.recentPipelines(repoID, 25)
-	if err != nil {
-		return wpPipeline{}, err
-	}
-	if len(ps) == 0 {
-		return wpPipeline{}, fmt.Errorf("no pipelines for repo %d", repoID)
-	}
-	if commit == "" {
-		return ps[0], nil
-	}
-	for _, p := range ps {
-		if strings.HasPrefix(p.Commit, commit) {
-			return p, nil
-		}
-	}
-	return wpPipeline{}, fmt.Errorf("no pipeline for commit %s in the last %d", commit[:min(8, len(commit))], len(ps))
-}
-
-func (c *wpClient) repoID() (int, error) {
-	owner, repo, err := repoOwnerName()
-	if err != nil {
-		return 0, err
-	}
-	var r struct {
-		ID int `json:"id"`
-	}
-	if err := c.getJSON("/api/repos/lookup/"+owner+"/"+repo, &r); err != nil {
-		return 0, err
-	}
-	if r.ID == 0 {
-		return 0, fmt.Errorf("repo %s/%s not registered in woodpecker", owner, repo)
-	}
-	return r.ID, nil
-}
-
-// repoOwnerName derives <owner>/<repo> from the cwd git remote.
-func repoOwnerName() (string, string, error) {
-	cwd, _ := os.Getwd()
-	root, err := gitRepoRoot(cwd)
-	if err != nil {
-		return "", "", fmt.Errorf("not in a git repository: %w", err)
-	}
-	remote := preferRemote(remotesOrEmpty(root))
-	url, err := gitOutput(root, "remote", "get-url", remote)
-	if err != nil {
-		return "", "", err
-	}
-	return parseOwnerRepo(url)
-}
-
-// parseOwnerRepo extracts owner/repo from an https or ssh git remote URL.
-func parseOwnerRepo(url string) (string, string, error) {
-	u := strings.TrimSuffix(strings.TrimSpace(url), ".git")
-	u = strings.TrimSuffix(u, "/")
-	if i := strings.Index(u, "://"); i >= 0 {
-		u = u[i+3:]
-	}
-	u = strings.ReplaceAll(u, ":", "/") // git@host:owner/repo -> git@host/owner/repo
-	parts := strings.Split(u, "/")
-	if len(parts) < 2 || parts[len(parts)-1] == "" || parts[len(parts)-2] == "" {
-		return "", "", fmt.Errorf("cannot parse owner/repo from remote %q", url)
-	}
-	return parts[len(parts)-2], parts[len(parts)-1], nil
-}
-
-func isTerminalStatus(s string) bool {
-	switch s {
-	case "success", "failure", "error", "killed", "declined", "blocked":
-		return true
-	}
-	return false
-}
-
-func isFailureStatus(s string) bool {
-	return s == "failure" || s == "error" || s == "killed" || s == "declined"
-}
-
-func min(a, b int) int {
-	if a < b {
-		return a
-	}
-	return b
-}
--- a/cli/woodpecker_test.go
+++ b/cli/woodpecker_test.go
@ -1,40 +0,0 @@
-package main
-
-import "testing"
-
-func TestParseOwnerRepo(t *testing.T) {
-	cases := []struct{ in, owner, repo string }{
-		{"https://forgejo.viktorbarzin.me/viktor/infra.git", "viktor", "infra"},
-		{"https://forgejo.viktorbarzin.me/viktor/infra", "viktor", "infra"},
-		{"git@github.com:ViktorBarzin/infra.git", "ViktorBarzin", "infra"},
-		{"https://github.com/ViktorBarzin/tripit/", "ViktorBarzin", "tripit"},
-	}
-	for _, c := range cases {
-		o, r, err := parseOwnerRepo(c.in)
-		if err != nil || o != c.owner || r != c.repo {
-			t.Errorf("parseOwnerRepo(%q) = (%q, %q, %v), want (%q, %q)", c.in, o, r, err, c.owner, c.repo)
-		}
-	}
-	if _, _, err := parseOwnerRepo("nonsense"); err == nil {
-		t.Error("expected error for unparseable remote")
-	}
-}
-
-func TestStatusClassification(t *testing.T) {
-	for _, s := range []string{"success", "failure", "error", "killed"} {
-		if !isTerminalStatus(s) {
-			t.Errorf("%q should be terminal", s)
-		}
-	}
-	for _, s := range []string{"running", "pending"} {
-		if isTerminalStatus(s) {
-			t.Errorf("%q should not be terminal", s)
-		}
-	}
-	if !isFailureStatus("failure") || !isFailureStatus("error") {
-		t.Error("failure/error should classify as failure")
-	}
-	if isFailureStatus("success") {
-		t.Error("success must not classify as failure")
-	}
-}
--- a/config.tfvars
+++ b/config.tfvars
--- a/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md
+++ b/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md
@ -13,7 +13,7 @@ The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling
 Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`:

 - Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.**
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`. `Plotting-Your-Dream-Book` (owned by Anca, dev in her org) keeps its GHA build in-place and pushes the image to **its own org's ghcr** (`ghcr.io/passionprojectsanca/book-plotter`, private) via the workflow's built-in `GITHUB_TOKEN` — no Forgejo mirror, no `viktorbarzin`-namespace push, no shared PAT in her repo (2026-06-27, migrated off DockerHub).
+- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`.
 - `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge.
 - Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror."

--- a/docs/adr/0004-homelab-unified-cli.md
+++ b/docs/adr/0004-homelab-unified-cli.md
@ -1,30 +0,0 @@
-# homelab: a unified infra-ops CLI grown in place from infra/cli
-
-Agents re-derive the same operational command boilerplate every session — mining
-51,116 bash commands across 2,225 past sessions showed dense, repeated patterns
-(the infra inner-loop alone is ~29%). We are building `homelab`, one CLI encoding
-the deterministic, repeated **actions** (not judgment) agents run — composable in
-bash, JSON-capable, and discovered progressively via `homelab manifest`. It is
-grown **in place** in `cli/` (the existing `infra-cli`), absorbing new verb-groups
-alongside the preserved legacy webhook use-cases. Versioned with a `cli/VERSION`
-file (the infra repo deploys continuously and does not cut semver tags).
-
-## Considered options
-
- **Its own top-level repo** (the original plan) — rejected in favour of keeping
-  it where the Terraform/Terragrunt and `scripts/tg` it drives already live; the
-  Go source isn't git-crypt-encrypted and a provision-time build is unaffected by
-  GitOps continuous-deploy.
- **A fresh CLI ignoring infra-cli** — rejected: strands the VPN/DNS/email
-  webhook use-cases.
- **Raw kubectl/tg/ssh + skills + MCP only** — kept for everything outside the
-  recurring action surface (methodology skills; third-party/owned MCP such as
-  phpIPAM, which homelab does NOT duplicate).
-
-## Consequences
-
- The binary is dual-purpose: the agent-facing `homelab` verb surface AND the
-  in-cluster `infra-cli` webhook image. `main()` front-dispatches homelab verbs
-  and falls through to the legacy `-use-case` path verbatim.
- Distribution: built from source to `/usr/local/bin/homelab` during devvm
-  provisioning (`t3-dispatch` precedent), refreshed by `t3-autoupdate`.
--- a/docs/adr/0005-homelab-v01-scope.md
+++ b/docs/adr/0005-homelab-v01-scope.md
@ -1,23 +0,0 @@
-# homelab v0.1 scope: the infra inner-loop; everything allowed, tiers recorded
-
-v0.1 ships only the highest-volume surface — the infra inner-loop: `work`
-(worktree lifecycle), `tf` (terragrunt via `scripts/tg` + fmt/validate/
-force-unlock), and `claim`/`release` (presence) — because it is ~29% of all mined
-commands and where agents lose the most time and leak the most presence claims.
-
-v0.1 enforces **no** homelab-level permission gating: everything is allowed,
-relying on existing gates (harness permission mode, presence claims, plan
-approval). But every verb records a `read|write` tier (visible in `manifest`), so
-a PreToolUse classifier hook (auto-allow reads / prompt writes) can be added
-later with zero restructuring.
-
-## Considered options
-
- **Reads-first vertical slice** (top read verb per domain) — lower risk, broad
-  value, but defers the toil that motivated the project.
- **One domain deep (k8s)** — cleanest template, narrow day-one value.
-
-We chose the highest-volume-but-write-heavy infra loop deliberately, accepting
-the extra complexity (worktree lifecycle, git-crypt flag injection, presence
-coupling, branch-protection PR fallback) for the biggest immediate toil
-reduction. k8s/node/secret/net/ci verb-groups are deferred to later versions.
--- a/docs/adr/0006-homelab-work-and-tf.md
+++ b/docs/adr/0006-homelab-work-and-tf.md
@ -1,29 +0,0 @@
-# homelab work/tf behaviour: native worktree entry, gated auto-land, presence-coupled apply
-
-Four behaviours of the infra-loop verbs are surprising enough to record:
-
-1. **`work` owns worktree create/land/clean, but session *entry* delegates to the
-   native harness worktree tool.** A CLI is a child process and cannot change the
-   agent's working directory; `EnterWorktree` can. So `homelab work start <topic>`
-   creates the worktree + branch off `<remote>/master` (git-crypt-aware) and
-   prints the path — the agent enters it with native `EnterWorktree({path})`.
-
-2. **`work land` is auto-land, but gated on verification.** It merges master in →
-   runs verification → pushes `HEAD:master` (fetch+merge+retry on
-   non-fast-forward) → falls back to pushing the feature branch for a PR when the
-   direct push is rejected (branch protection). It **refuses to push when it
-   cannot verify** (no `--verify-cmd` and no auto-detected suite) unless
-   `--no-verify` is passed — added after an accidental smoke-test land pushed
-   unverified WIP to master (benign: the infra CI applied 0 stacks because the
-   diff was `cli/`-only, but an unverified land must be deliberate, not default).
-
-3. **`tf apply` is first-class despite GitOps, and mandatorily presence-coupled.**
-   Local applies are out-of-band (CI applies canonically on push) but happen
-   constantly (~763× in the corpus). `tf apply <stack>` auto-claims `stack:<name>`,
-   delegates to `scripts/tg apply --non-interactive`, and **always releases on
-   exit** (normal, error, or signal via `sync.Once` + handler) — fixing the
-   documented ~200-claim leak — and prints an out-of-band reminder.
-
-4. **Known v0.1 limitation:** `work land` does not yet block on CI to green; that
-   arrives with the ci/deploy watch verb-group. It prints a reminder to follow
-   the pipeline manually.
--- a/docs/adr/0007-homelab-k8s-verbs.md
+++ b/docs/adr/0007-homelab-k8s-verbs.md
@ -1,30 +0,0 @@
-# homelab k8s verb-group: app→pod resolver, read/write split, config-mutation stays raw
-
-v0.2 adds the Kubernetes verb-group — the biggest remaining surface by far
-(mining the post-v0.1 corpus: 11,291 `kubectl` commands across 243 sessions, more
-than every other domain combined).
-
-It is built on an **app→namespace→pod resolver**: most namespaces hold exactly
-one app, so `<app>` defaults to the namespace, and the target defaults to
-`deploy/<app>` (kubectl resolves a pod from the Deployment). `-n`/`--pod`/`-c`/
-`-l`/`--tty` override; multi-pod namespaces (`dbaas`, `monitoring`) need
-specificity. The CLI uses the ambient kubeconfig — no per-call auth flags.
-
-Verbs: read — `status`, `get`, `logs`, `describe`, `debug` (one-shot triage),
-`pf`, `rollout-status`; write/operational — `db`, `exec`, `restart`, `rm-pod`.
-
-## Decisions worth recording
-
- **Config-mutation verbs are deliberately NOT exposed** (`apply`/`edit`/`patch`/
-  `scale`/`create`). They stay raw `kubectl`, by design, per the repo's
-  Terraform-only policy — the corpus confirms they're low-frequency, and a
-  friendly verb would normalise a policy violation.
- **`rm-pod` is restricted to pods/jobs only** — deleting Deployments/STS/PVCs is
-  config mutation and forbidden; the verb cannot target them.
- **`db` encodes the dbaas exec pattern** (the single highest-value k8s
-  sub-pattern, ~886 dbaas ops): PG via `pg-cluster-rw -c postgres`,
-  `psql -U postgres -d <app>`; MySQL via `mysql-standalone-0` with a
-  `bash -c 'mysql -p"$MYSQL_ROOT_PASSWORD" …'` wrapper so the password comes from
-  the pod env and never appears on the command line.
- Read verbs were smoke-tested against the live cluster; write verbs are
-  unit-tested (resolver, db-plan, shell-quoting) but not fired at live state.
--- a/docs/adr/0008-homelab-memory-verbs.md
+++ b/docs/adr/0008-homelab-memory-verbs.md
@ -1,30 +0,0 @@
-# homelab memory verb-group: direct HTTP client to claude-memory; MCP deprecation path
-
-v0.3 adds the memory verb-group so agents can search and navigate memory from the
-CLI. `claude-memory` is a FastAPI service (Postgres-backed, `Bearer`-auth,
-ingress `auth = "none"` so programmatic clients work) — the **MCP is just one
-frontend over it**. `homelab memory` is a thin HTTP client over the same API,
-using the env the hooks already set (`CLAUDE_MEMORY_API_URL` +
-`CLAUDE_MEMORY_API_KEY`; defaults to the ingress). Because it talks to the HTTP
-API directly, it **works even when the MCP frontend is down** — the recurring
-MCP-disconnect problem that motivated claude-memory HA (and that took the MCP
-offline for the entire session this was built in).
-
-Verbs: `recall` (server-side semantic ranking), `list`, `categories`, `tags`,
-`stats`, `secret` (read); `store`, `update`, `delete` (write). Validated against
-the live API including a store→recall→delete round-trip — full data-plane parity
-with the MCP.
-
-## Deprecation path (deliberate follow-up — NOT done in v0.3)
-
-The MCP is more than tools: the **per-prompt auto-recall hook** and the
-**auto-learn hook** run on every prompt for every agent. Deprecating it safely is
-a separate, sequenced change:
-
-1. Rewire the auto-recall hook to `homelab memory recall` and the auto-learn hook
-   to `homelab memory store`.
-2. Update the CLAUDE.md memory policy to point at the CLI.
-3. Uninstall the MCP.
-
-Done CLI-first (verbs proven before touching the every-prompt path) so a
-regression can't silently break auto-recall/auto-learn fleet-wide.
--- a/docs/adr/0009-homelab-ci-deploy-verbs.md
+++ b/docs/adr/0009-homelab-ci-deploy-verbs.md
@ -1,29 +0,0 @@
-# homelab ci/deploy verbs: API-based watch, internal-LB dialer, work-land integration
-
-v0.4 adds `ci`/`deploy` — the biggest *reasoning* sink in agent sessions (watching
-a build/deploy to completion), proven during the session that built it (hours
-spent hand-rolling Woodpecker API polling, DB-schema reverse-engineering, and
-retrigger logic for a single CI incident).
-
-## Decisions
-
- **API, not DB.** The verbs query the Woodpecker REST API (version-stable),
-  not its Postgres schema (which drifts across upgrades — column renames bit us
-  mid-incident). Reached via the internal Traefik LB by dialing `10.0.20.203`
-  while keeping SNI/Host = `ci.viktorbarzin.me` so the cert verifies (the Go
-  equivalent of the house `curl --resolve` pattern). Token from
-  `WOODPECKER_TOKEN` or Vault `secret/ci/global`; repo id resolved from the cwd
-  git remote via `/api/repos/lookup/<owner>/<repo>`.
- **Retries are mandatory.** The Woodpecker API intermittently returns empty/5xx
-  under load (it flapped through the whole build session); `getJSON` retries
-  empties with backoff so `ci watch` is reliable exactly when it's needed.
- **`work land` now waits for CI.** After pushing, `work land` calls `ci watch`
-  on the landed commit and fails if the pipeline does — closing the gap ADR-0005
-  deferred. `--no-ci-watch` opts out.
- **`deploy wait` encodes the "rollout status lies" rule:** it first waits for
-  the deployment image to reference the expected sha, *then* blocks on rollout
-  status (kubectl-based; reuses the k8s helpers).
- **`ci logs` deferred to v0.4.1.** Woodpecker's per-pipeline detail/log
-  endpoints were the least reliable this session (often empty); `status`/`watch`
-  rely on the list endpoint that works. A DB-backed `ci logs` is a possible
-  follow-up if the API path stays flaky.
--- a/docs/adr/0010-homelab-net-obs-verbs.md
+++ b/docs/adr/0010-homelab-net-obs-verbs.md
@ -1,37 +0,0 @@
-# homelab net/dns/metrics/logs verbs: endpoint resolution as the unit of value
-
-v0.5 adds `net`/`dns`/`metrics`/`logs`. These were chosen against an explicit
-test the user posed mid-build: *does the verb save reasoning, or only typing?* A
-wrapper over a command already known fluently (plain `ssh`, `vault kv get`) saves
-keystrokes but not thought. These four save thought — the reasoning they encode
-is **which endpoint, reached how, with what auth/URL shape** — re-derived every
-time otherwise. (That same test deprioritized `node ssh` aliasing and `secret
-get`, which are thin wrappers; see the session discussion.)
-
-## Decisions
-
- **Internal ingresses, reached via the LB.** Everything routes through the
-  Traefik LB by dialing `10.0.20.203` with the URL host preserved as SNI — the
-  Go form of the house `curl --resolve host:443:10.0.20.203` pattern
-  (`probe.go: clientDialingIP`). Verified live before building: Prometheus
-  (`prometheus-query.viktorbarzin.lan`) and Loki (`loki.viktorbarzin.lan`) both
-  answer JSON over the LB with **no auth gate and no port-forward** — so these
-  stay clean HTTP clients, not kubectl wrappers.
- **`net check` is two-legged on purpose.** It resolves the host via public DNS
-  (→ Cloudflare) AND dials the internal LB, reporting both — because the useful
-  question is *where* a break is (CF edge vs the app vs the LB path), which a
-  single curl can't answer. The external leg forces public resolution (the devvm
-  resolver is split-horizon and would otherwise hit the LB for both).
- **`metrics alerts` uses the `ALERTS` series, not `/api/v1/alerts`.**
-  `prometheus-query.*` is a query-only frontend (404 on `/api/v1/alerts`), and
-  Alertmanager has no LB ingress (the alert-digest reads it in-cluster). Firing
-  alerts are exposed as the synthetic `ALERTS{alertstate="firing"}` time series,
-  queryable through the working endpoint — so no new dependency.
- **Deliberately NOT built:** in-cluster-only endpoints (Alertmanager v2,
-  raw `*.svc` services) that would force port-forward/`kubectl run`. The
-  reasoning-savings there don't beat the added moving parts; kept out of scope.
- **No `node`/`secret` group.** Same test: their high-volume parts are
-  command-wrappers (low savings); only compound node ops (serial console, VM
-  wait, fan-out) would qualify, and those are lower-frequency. Left unbuilt
-  unless a concrete pain surfaces — the high-value deterministic surface
-  (tf/work/ci/k8s/memory + these probes) is now covered.
--- a/docs/adr/0011-homelab-usage-telemetry.md
+++ b/docs/adr/0011-homelab-usage-telemetry.md
@ -1,42 +0,0 @@
-# homelab usage telemetry: evidence-driven verb prioritization, privacy by construction
-
-v0.6 adds `usage top` plus a fire-and-forget emit on every dispatched verb. It
-exists to answer the question that drove the whole CLI — *which verbs are worth
-adding next* — with data instead of one maintainer's habits (the earlier mining
-covered a single user's ~51k commands, so the surface is shaped to that user).
-
-> **Update (2026-06-26) — the cross-user privacy *norm* below is superseded by
-> [ADR-0015](0015-os-is-the-authorization-boundary.md).** The prohibition this
-> ADR leaned on ("reading another user's `~/.claude` is off-limits even for an
-> owner in-session") no longer holds: the managed-settings policy now **defers
-> to OS/sudo authorization**. The `usage top` telemetry design itself is
-> unchanged and still current — only the "never read homes" framing in the
-> third decision below is overtaken.
-
-## Decisions
-
- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows
-  the verb path; after `Run` returns we emit `{verb, exit}`. Discovery verbs
-  don't go through `dispatch()` (`manifest`/`version`/`help` are handled in
-  `dispatchTop`), so they don't self-record; `usage *` is skipped explicitly so
-  the analytics reader doesn't pollute its own data.
- **Payload is deliberately minimal: verb path + exit code only.** Labels
-  `{job=homelab-usage, user, verb}` (all low-cardinality) + line `exit=N ver=X`.
-  **No args, paths, flags, hostnames, or secrets** ever leave the process — the
-  emit sees only the matched verb name, not the arguments. This is what makes
-  cross-user aggregation safe.
- **Shared Loki sink → cross-user analytics WITHOUT reading homes.** Each user's
-  CLI writes its own invocations (attributed to its OS user) to the shared Loki
-  push API via the Traefik LB (verified: HTTP 204, no auth). `usage top` reads
-  back with a LogQL metric query. This is the privacy-preserving resolution to
-  "what does everyone (e.g. another user) use" — it never touches anyone's
-  `~/.claude`, which the org per-user policy bars (see the per-user red-line in
-  managed-settings; reading another user's home is off-limits even for an owner
-  in-session — a fresh session under changed MDM policy is the only legitimate
-  path, and even then this telemetry is the better answer).
- **Best-effort, never affects the command.** All errors swallowed; an 800ms
-  client timeout bounds the cost; opt-out via `HOMELAB_TELEMETRY=0`. Telemetry
-  must never slow or break the tool it measures.
- **Loki, not a new datastore.** Zero new infra, and it dogfoods the v0.5 `logs`
-  path (same host, same LB dial). Presence MySQL was the alternative (queryable
-  SQL) but would add a write dependency and creds; Loki needs neither.
--- a/docs/adr/0012-homelab-ha-verbs.md
+++ b/docs/adr/0012-homelab-ha-verbs.md
@ -1,54 +0,0 @@
-# homelab Home Assistant verbs: token resolution + host SSH, not entity control
-
-v0.7 adds `ha token` and `ha ssh`. They were chosen by mining a heavy HA
-operator's sessions: across ~1,900 shell commands the single most-repeated line
-(420×) was a hand-rolled `kubectl … | base64 -d | python -c '…token'` pipeline,
-and a bespoke `ssh -o StrictHostKeyChecking=no -o …` invocation was redefined as
-a shell function ~30× — both re-derived from scratch every session. The existing
-`home-assistant-sofia.py` already covers the *API*, but it goes unused from an
-arbitrary cwd (it needs `HOME_ASSISTANT_SOFIA_TOKEN` set and is referenced by a
-cwd-relative path), so agents bypassed it. A global verb on `$PATH` closes that
-gap for every user in every directory.
-
-## Decisions
-
- **Only the two gaps the `ha` MCP can't fill.** The `ha` MCP server already
-  does entity state and control (`get_state`, `call_service`, history, logs).
-  Per the CLI's founding rule — *MCP-encoded actions are out of scope* (ADR-0004)
-  — we do **not** reimplement `on`/`off`/`list`/`state`. We add only token
-  *resolution* and host *SSH*, neither of which an API-only MCP can provide. The
-  value is endpoint/secret/host resolution, exactly like `net`/`dns` (ADR-0010).
- **`ha token` resolves live from the cluster, not from an env var.** It reads
-  the dedicated k8s Secret `openclaw/ha-tokens` (one key per instance: `sofia` /
-  `london`) via the ambient kubeconfig. This is robust to env drift — the precise
-  failure that made agents re-derive the pipeline. Read-tier, prints the bare
-  token to stdout so it composes in `$(…)`, mirroring `memory secret`.
- **The token is split into its own least-privilege secret** (`stacks/openclaw/ha_tokens.tf`).
-  It was originally read from `openclaw-secrets` → `skill_secrets` (a JSON blob
-  also holding `slack_webhook` + `uptime_kuma_password`), which only cluster
-  admins can read — so the verb hung/failed for the non-admin operator it was
-  built for (emo = `emil.barzin@gmail.com`, group `Home Server Admins`, whose
-  OIDC identity is barred from secrets in `openclaw`). `ha-tokens` carries only
-  the HA tokens, with a Role+RoleBinding granting `get` on *just that secret* to
-  the `Home Server Admins` group (k8s RBAC can't scope to a JSON sub-key, hence
-  the separate object). openclaw's own deployment keeps reading `openclaw-secrets`
-  — this is purely additive.
- **`ha ssh` is deterministic and per-user.** Flags are fixed for unattended
-  use: `-F /dev/null` (ignore user ssh-config), `StrictHostKeyChecking=no` +
-  `UserKnownHostsFile=/dev/null` (no host-key prompt/record — agents have no
-  TTY), `BatchMode=yes` + `ConnectTimeout=10` (fail fast, never hang). The key
-  is the **invoking user's** `~/.ssh/id_ed25519`, so the verb isn't tied to
-  whoever first wrote the workflow; that user's key must be enrolled on the HA
-  host. Write-tier (runs an arbitrary remote command).
- **sofia is the default; london is structural.** The devvm sits on the Sofia
-  LAN, so `vbarzin@192.168.1.8` is reachable and is the default instance. london
-  (`hassio@192.168.8.103`) is in the instance map so `ha token --instance london`
-  works (a pure secret read), but `ha ssh --instance london` generally won't
-  connect from here — london is remote. We model it correctly rather than
-  pretend it's reachable.
- **Scope held at two verbs.** `ha api` (an authenticated curl passthrough for
-  the endpoints the MCP/script don't cover — `/api/template`, `/reload`,
-  `check_config`, `/error_log`) was deferred: once `ha token` exists, raw curl is
-  already unblocked, and a generic passthrough overlaps the MCP. Re-measure via
-  `usage top` (ADR-0011); add targeted sugar verbs only if those endpoints are
-  still hand-rolled often.
--- a/docs/adr/0013-homelab-browser-verbs.md
+++ b/docs/adr/0013-homelab-browser-verbs.md
@ -1,75 +0,0 @@
-# homelab browser verbs: headful (anti-bot) web automation via cluster Chrome
-
-v0.8 adds `browser run`, `browser open`, and `browser --help`. They package a
-capability that already existed but was undiscoverable: driving the cluster's
-**headful** Chrome (`chrome-service` — real Chrome under Xvfb, CDP on
-`svc/chrome-service:9222`) from the devvm, for sites that detect and block
-headless automation.
-
-## Motivating incident (2026-06-22)
-
-Logging a washing-machine repair on the Stirling Ackroyd **Fixflo** tenant
-portal: the headless `@playwright/mcp` browser loaded the site and filled the
-entire multi-step form, but the **final submit silently failed** — Fixflo's
-pre-submit `POST /IssuePreCreationCheck` returned `net::ERR_FILE_NOT_FOUND`, the
-spinner hung, no issue was created. Root cause = headless-Chrome detection. The
-fix was to drive the headful `chrome-service` over `connect_over_cdp` — it
-submitted first try (Fixflo ref IS22657587). That capability was documented
-(`docs/architecture/chrome-service.md`) but **not packaged or discoverable**, so
-it took ~40 min, three redundant full form re-runs, and a user hint. The agent
-also misread `ERR_FILE_NOT_FOUND` as "network egress" and retried blind instead
-of inspecting the network panel.
-
-## Decisions
-
- **Mechanics in `homelab`, not a `~/.claude` skill.** A standalone skill was
-  rejected: the CLI is run every session (so the verb is *discoverable*), is
-  versioned, multi-user, and test-covered. A private, untested skill is none of
-  those. The command owns only the deterministic *mechanics* (port-forward,
-  stealth injection, lifecycle) — the agent supplies the Playwright script, so
-  *judgment* stays out of the CLI (the founding rule, ADR-0004/0005).
- **The failure was judgment, not setup friction**, so the CLI is paired with a
-  one-line pointer in always-in-context `~/code/CLAUDE.md` and a diagnostic
-  payload in `browser --help`: the *when-to-use* signature (a site loads but a
-  gated action fails/hangs, or one request 500s/aborts while siblings 200 →
-  suspect headless detection) and an error-code cheat-sheet (`ERR_FILE_NOT_FOUND`
-  = request resolved/intercepted by the automation layer, **not** egress;
-  egress failures are `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED`
-  and would break the page load too). A command the agent doesn't think to run is
-  useless; the cheat-sheet is the actual fix for the misdiagnosis.
- **Reach the pod via `kubectl port-forward`, then `connect_over_cdp` to
-  localhost.** port-forward tunnels API-server→pod, so it **bypasses the `:9222`
-  NetworkPolicy** that gates in-cluster callers — the devvm needs no namespace
-  label. Readiness is asserted against `/json/version`: the endpoint must report
-  a real `Chrome/…`, never `HeadlessChrome` (the whole point). The forward is
-  **always** torn down (process-group kill + signal handler), on success and on
-  error — an acceptance requirement.
- **Default to a fresh incognito context; `--shared-context` opts into the warmed
-  profile.** chrome-service is a single shared browser with a persistent profile.
-  A fresh, always-closed context is safe for concurrent callers (tripit's fare
-  scrape connects per-quote) and is what production already does. The warmed
-  persistent profile (cookies from a manual noVNC login) is opt-in for flows that
-  need a pre-logged-in session.
- **Pin the node CDP client to `playwright-core@1.48.2`** to match the
-  chrome-service image minor (`mcr.microsoft.com/playwright:v1.48.0-noble`,
-  Chromium 130). `connect_over_cdp` speaks the browser's CDP, and protocol
-  changes between Playwright minors — the devvm's ambient Python Playwright was
-  1.58, a 10-minor skew. The pin makes behaviour deterministic across the fleet
-  regardless of local drift. `playwright-core` (not `playwright`) because no
-  browser binary is needed — we connect to the remote one.
- **Self-provision the client lazily, no per-user setup.** The pinned client is
-  installed once into `~/.cache/homelab/browser-client/` (idempotent, version-
-  guarded) on first use, alongside the embedded runner + stealth files. node is
-  already fleet-wide; this avoids coupling the feature to a provisioner change
-  and keeps it self-contained and self-healing. The client runs on the devvm, so
-  `setInputFiles` streams local files to the remote browser over CDP — no
-  `chmod`/staging-dir workaround on the CDP path.
- **Vendor `stealth.js`, guard against drift.** The CLI embeds a byte-for-byte
-  copy of `stacks/chrome-service/files/stealth.js` (the source of truth the
-  in-cluster callers use) via `go:embed`; a unit test fails if the copy drifts.
-  `go:embed` can't reach outside the package dir, hence the vendored copy rather
-  than a path reference.
- **Scope held at two action verbs + help.** `run` (arbitrary script — the
-  workhorse) and `open` (navigate + title/text/screenshot — a quick check) cover
-  the surface. Both are write-tier; the bare `browser`/`--help` is read. Re-measure
-  via `usage top` (ADR-0011) before adding more.
--- a/docs/adr/0014-service-identity-and-east-west-observability.md
+++ b/docs/adr/0014-service-identity-and-east-west-observability.md
@ -1,35 +0,0 @@
---
-status: accepted
-date: 2026-06-24
---
-
-# Service identity is namespace + label; east-west observability via Calico Goldmane; no service mesh
-
-As the Service count grows we want an audit-grade record of which Service talks to which — the "service mesh evaluation" `docs/plans/2026-04-20-infra-audit-design.md` flagged as never done ("worth a design doc even if the answer is no, too much complexity for the gain"). We evaluated the full design space against two constraints: the trust model is a single-tenant cluster needing **attribution-grade** forensics (reconstruct events in a cluster we trust), not cryptographic non-repudiation against a hostile pod; and we are acutely **etcd-constrained** (we removed VPA/Goldilocks for exactly this, and carry open beads `code-oflt`/`code-at4f` on etcd starvation). Decision: **service identity = the workload's namespace** (primary; Goldmane stamps it natively and "one Service ≈ one namespace" holds for ~87 of our namespaces), refined by an explicit `service-identity` label only in the few genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). **East-west observability = Calico 3.30 Goldmane + Whisker** (already in our Calico v3.30.7, currently `enabled = false` in `stacks/calico/main.tf`), with Goldmane's emitter shipping flows to **Loki** for a durable trail. **Enforcement reuses the existing Wave 1 observe-then-enforce egress track**, now selecting on namespace/label and fed by Goldmane's allow/deny + policy-trace flows. We explicitly **reject** a service mesh, mTLS, SPIFFE/SPIRE, and dedicated per-Service ServiceAccounts for now.
-
-## Considered options
-
- **Dedicated per-Service ServiceAccount as the identity primitive** — initially chosen, then reversed. 56% of pods (257/458) run as `default`, so it is a ~116-stack rollout; and Goldmane (the chosen flow source) carries pod/namespace/workload **labels but no ServiceAccount field**, so SAs would not even reach the audit trail without a custom pod→SA mapping. The cheaper, etcd-inert path (namespace is free; a handful of static labels) delivers the same attribution. Deferred until identity-aware NetworkPolicy needs a principal finer than namespace/label, or mTLS is adopted.
- **Service mesh (Istio / Linkerd / Cilium-mesh) + mTLS + SPIFFE/SPIRE** — the only thing that makes the trail cryptographically non-repudiable against a hostile pod. The trust model does not justify it, east-west stays single-tenant plaintext, and it is precisely the "too much complexity for the gain" the audit doc predicted. Rejected.
- **Microsoft Retina (CNI-agnostic eBPF)** — more capable (DNS, drops, Hubble UI) and GA, runs on Calico without a CNI change. But identity-rich mode writes **one `RetinaEndpoint` CRD per pod to etcd** (continuous, pod-proportional churn — the exact axis we guard), and it is metrics-first, not log-first (no per-flow Loki records without custom glue). Rejected for this use case; noted as the fallback if DNS/drop-level detail is ever needed.
- **Cilium Hubble** — reads Cilium's eBPF datapath maps; unusable on Calico without migrating the CNI. A CNI migration is not justified. Rejected.
- **Kiali** — builds its graph entirely from an Istio mesh's Prometheus telemetry; no mesh, no graph. Rejected.
- **Custom Grafana Alloy enrichment exporter over raw iptables-`LOG` flow lines** — Alloy has no IP→identity dictionary-lookup primitive (`loki.process` lacks a lookup stage; `k8sattributes` can't do per-line/dual-IP association), so this is a multi-day custom build that also has to beat pod-IP churn. Goldmane delivers identity-stamped flows natively and obviates it. Rejected.
- **Kyverno generate+mutate to provision/assign identity** — rejected on etcd grounds: background scans + PolicyReports + UpdateRequests are continuous writes, the VPA-class cost we shed. Identity stays static.
-
-## Consequences
-
- **No etcd cost from the flow plane.** Goldmane streams flows from Felix (the existing `calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer — nothing written to etcd or the K8s API. Steady-state cost is two Deployments (`goldmane`, `whisker`) + RAM/CPU on the goldmane pod.
- **The ring buffer is not a trail.** Durable, queryable history depends on the emitter→Loki path (reuse the 90-day security-stream retention); on a Goldmane restart the in-memory window is lost.
- **Goldmane is tech-preview** in OSS Calico 3.30 — the main risk. Enabling it is a reversible toggle in `stacks/calico/main.tf`, but the toggle interacts with the operator-managed Installation CR (only namespaces are TF-adopted today; verify how Goldmane/Whisker are enabled before applying).
- **Attribution is namespace-grained for free** across ~87 single-Service namespaces. Multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`) need a `service-identity` label to disambiguate; most are platform/infra and already on the Wave 1 enforcement exclude list.
- **The trail is attribution-grade, not cryptographic.** It reliably reconstructs events in a trusted cluster but cannot prove identity against a pod that spoofs its source — an accepted limit of the trust model. This ADR does not change the east-west encryption posture (still plaintext, no mTLS).
- **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
- **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
- **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
-
-## As-built (2026-06-25)
-
-Implemented across infra issues #57–#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (all Slack consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
-
-Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.
--- a/docs/adr/0015-os-is-the-authorization-boundary.md
+++ b/docs/adr/0015-os-is-the-authorization-boundary.md
@ -1,57 +0,0 @@
-# OS is the authorization boundary: agents defer to Unix/sudo, not a stricter in-policy rule
-
-Supersedes the cross-user privacy *norm* that the devvm managed-settings policy
-carried and that ADR-0011 leaned on ("never read another user's home /
-`~/.claude`, off-limits even for an owner in-session"). ADR-0011's actual
-subject — `usage top` telemetry and its emit design — is unchanged and still
-current; only the privacy prohibition it referenced is superseded here.
-
-## Context
-
-The devvm managed-settings policy (`/etc/claude-code/managed-settings.json`,
-`claudeMd`) carried two rules that were, in practice, *stricter than the OS*:
-"you are not the admin, do not escalate privileges" and "never read another
-user's home directory, credentials, tokens, or `~/.claude`." The OS told a
-different story: `wizard` holds `(ALL) NOPASSWD: ALL` — full passwordless root.
-The kernel had already granted total read access; the policy was layering an
-artificial refusal on top of an authorization the OS already permits, and the
-"not the admin" framing was factually wrong for a NOPASSWD-root user.
-
-Two honest ways to resolve the inconsistency: tighten sudo to match the policy,
-or loosen the policy to match the OS. The owner chose the latter on 2026-06-26,
-for analytics/debugging across the shared box.
-
-## Decision
-
- **Authorization follows the OS, not this policy.** Agents may access whatever
-  their OS user can access — directly or via `sudo` where they hold sudo rights
-  — and must not impose restrictions stricter than the OS. On this box that
-  includes other users' home directories and `~/.claude` for users who hold
-  broad sudo.
- **No separate prompt or carve-out** for OS-authorized access. The Unix
-  permission model + sudoers is the single source of truth for who may read
-  what. Other homes are `0750`-owned, so a cross-home read necessarily transits
-  `sudo` and is therefore captured in the sudo/auth audit log.
- **Cluster/infra RBAC tiering is unchanged.** kubectl / Vault / infra access
-  stays scoped to each user's RBAC tier; "defer to the OS" is about OS-level
-  file access, not a licence to exceed cluster RBAC.
- **Scope is symmetric and multi-user.** The rule lives in the *shared*
-  managed-settings, so every user's agents defer to that user's own sudo grant.
-  Any user with broad sudo gets the same cross-home read capability over other
-  users' files. Accepted by the owner with that understanding; emo's and
-  ancamilea's `~/.claude` is now agent-readable by sudo-holders.
- **Takes effect in a fresh session.** managed-settings loads at session start;
-  the session that made the change keeps running under the old policy.
-
-## Consequences
-
- The privacy-preserving telemetry rationale in ADR-0011 (`usage top` as the
-  "cross-user analytics without reading homes" answer) remains useful but is no
-  longer the *only* sanctioned path; direct reads via `sudo` are now permitted.
- Larger blast radius: if an agent session running as a sudo-holder is
-  prompt-injected or otherwise compromised, it can now read every user's secrets
-  with no in-agent friction (sudo here is passwordless). The sudo/auth audit log
-  is the remaining accountability control.
- Reversible: restore the prior `claudeMd` bullets (backup kept at
-  `/etc/claude-code/managed-settings.json.bak-2026-06-26`) and start a fresh
-  session.
--- a/docs/adr/0016-gpu-vram-extended-resource-budget.md
+++ b/docs/adr/0016-gpu-vram-extended-resource-budget.md
@ -1,107 +0,0 @@
-# GPU VRAM protection via a scheduler extended-resource budget + a runtime watchdog (HAMi/MPS rejected)
-
-The single Tesla T4 (16 GB, ~15360 MiB usable) on `k8s-node1` is **time-sliced**
-(`nvidia.com/gpu` advertised ×100, `migStrategy: none`) and shared by ~9 tenants
-(immich-ml, immich-server, frigate, llama-swap, portal-stt, tts,
-ebook2audiobook, ytdlp, android-emulator). Time-slicing grants a *scheduling
-turn, not memory* — the scheduler is blind to VRAM, so the tenants can
-collectively overallocate the card. On 2026-06-02 immich-ml's unbounded
-onnxruntime OCR arena grew from ~2 GB to **10.7 GB**, starved llama-swap's
-qwen3-8b, and silently broke recruiter-responder triage for ~5 h
-(`docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). The
-post-mortem's #1 follow-up — alert/guard on GPU VRAM — was never built.
-
-## Context
-
- **MIG is impossible.** The T4 is Turing; hardware memory partitioning (MIG)
-  only exists on Ampere+. So per-tenant *hardware* isolation is off the table.
- **The card is busy but not steadily oversubscribed.** Measured steady residents
-  (2026-06-17, `gpu_pod_memory_used_bytes`): immich-ml ~2.1 GiB, frigate ~1.9 GiB,
-  llama-swap ~4.35 GiB peak (one model at a time — it already swaps), immich-server
-  ~1.2 GiB, portal-stt ~1.5 GiB, android-emulator ~0.15 GiB → ~11 GiB used, ~4 GiB
-  free. **The failure mode is a single tenant's runtime runaway, not a
-  scheduling-time pile-on.**
- **Prior art already exists (soft):** a `gpu-workload` PriorityClass (1,200,000)
-  is auto-stamped on every GPU pod by the Kyverno `inject-gpu-workload-priority`
-  policy (tts excluded → `tier-2-gpu`, evicted first); tts runs behind a
-  free-VRAM demand-gate (`stacks/tts`, scales 0↔1 on `sum(gpu_pod_memory_used_bytes)`
-  vs a floor); immich-ml is soft-bounded by `MACHINE_LEARNING_MODEL_TTL=600`. What
-  was missing is anything that bounds a tenant's VRAM *during active use*.
-
-### Alternatives considered and rejected
-
- **NVIDIA MPS** (device-plugin `sharing.mps`, hard `CUDA_MPS_PINNED_DEVICE_MEM_LIMIT`):
-  caps are **uniform** — slice = `total ÷ replicas`, tenants get integer multiples.
-  Nine heterogeneous tenants spanning 0.15→6 GB do not fit uniform slices without
-  large rounding waste on a card that has none to spare. Rejected.
- **HAMi vGPU** (per-container `nvidia.com/gpumem` MiB caps, libvgpu CUDA hook):
-  the *correct* hard-cap primitive and T4-supported, but it **replaces the
-  operator's device plugin** (the operator owns/reconciles it), enforces via an
-  `LD_PRELOAD` CUDA hook that is **unproven for our NVENC transcode path**
-  (open codec bug), **cannot cap the android-emulator** (QEMU bypasses the CUDA
-  hook — KubeVirt/Kata explicitly unsupported), carries a **restart-triggered
-  false-OOM bug** (#1181) directly in our blast radius (kured reboots node1
-  regularly), and its reservation-based scheduling would **supersede the working
-  demand-gate** and **strand the ~4 GB of steady headroom**. Too much risk and
-  behavioral change for the single proven failure mode. Rejected for now; this
-  ADR is the record of *why*, so a future "let's just use HAMi" re-opens with the
-  trade-offs already on the table.
-
-## Decision
-
-Make the scheduler VRAM-aware and add runtime teeth — entirely with repo-native
-pieces, **no device-plugin/driver change, time-slicing untouched**:
-
-1. **Budget (schedule-time).** Advertise a custom node-level **extended resource
-   `viktorbarzin.me/gpumem`** on the GPU node (= ~14000 MiB; ~15.4 GB physical
-   minus ~1.4 GB driver/CUDA-context/exporter slack), via a reconcile Job +
-   CronJob that `kubectl patch node --subresource=status` (dynamic over
-   `nvidia.com/gpu.present=true` nodes; re-asserts after node re-register).
-   Every GPU tenant declares `resources.limits."viktorbarzin.me/gpumem"` (immich-ml
-   3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500 — sum
-   ≤ advertised). Extended resources are **non-overcommittable** (request==limit,
-   integer), so the scheduler refuses to co-schedule past the card → overflow
-   `Pending`. On-demand batch tenants (tts/ebook2audiobook/ytdlp) keep the
-   free-VRAM demand-gate and fill the real slack rather than holding a reserved seat.
-2. **Watchdog (runtime).** A `gpu-vram-watchdog` CronJob (every minute, nvidia ns)
-   reads per-pod `gpu_pod_memory_used_bytes` (the host-PID exporter) and each GPU
-   pod's *declared* `gpumem`, and **only when actual free VRAM < floor (~1536 MiB)**
-   recycles the biggest **over-budget** offender (used > declared). Contract
-   enforcement, not priority (immich-ml and llama-swap share `gpu-workload`, so
-   priority can't distinguish them). Acting only under pressure lets a tenant burst
-   into genuine slack; the recycle clears its arena (exactly what the TTL=600
-   Recreate does for immich-ml when idle). This is what would have caught 2026-06-02.
-3. **Alerting** (the never-built follow-up): GPU free-VRAM below floor, GPU pod
-   `Pending` on `gpumem`, and pod-over-budget → the `#alerts` digest.
-
-This is **soft enforcement**: the scheduler reserves on paper and the watchdog
-corrects at runtime with a detection lag (seconds–minute), so a brief physical
-overshoot is possible before a recycle. Accepted, given the failure mode is a
-slow arena drift, not an instantaneous spike, and the alternative (HAMi) carries
-disproportionate risk for this hardware.
-
-## Consequences
-
- **The 2026-06-02 class is bounded** without touching the pinned driver, the GPU
-  operator, or time-slicing. immich-ml can no longer silently grow into
-  llama-swap's VRAM: it either schedules within its budget or, on a true runaway
-  under pressure, gets recycled (its heavy library job is the intended loser).
- **The card has a seating chart now.** Sum of declared budgets ≤ ~14 GB, so a new
-  always-on GPU tenant requires re-budgeting; an over-budget on-demand tenant sits
-  `Pending`. This is the intended, legible back-pressure.
- **Small/on-demand tenants (android-emulator, ytdlp, tts, ebook2audiobook) are
-  NOT budgeted in v1** — they fill *actual* slack rather than holding a scheduler
-  seat (tts via its existing free-VRAM demand-gate), and are covered by the
-  ~1.4 GiB physical reserve plus budget headroom (the five residents' budgets sum
-  to 13300 ≤ 14000 advertised). Give them budgets later if they grow; until then
-  the watchdog protects the budgeted five and counts everyone's usage toward free.
- **New RBAC:** the reconcile SA patches `nodes/status`; the watchdog SA lists pods
-  cluster-wide and deletes pods in GPU tenant namespaces. Far less privileged than
-  existing cluster-admin tooling (woodpecker-agent).
- **Apply order matters:** advertise `gpumem` (nvidia stack) **before** the
-  consumer stacks declare it, or a pod requesting an unadvertised extended
-  resource is unschedulable. The reconcile runs as a Job (immediate) for this.
- **Fully reversible:** delete the CronJobs/Job + the `gpumem` stanzas, and
-  `kubectl patch node --subresource=status` to remove the capacity key. Nothing
-  structural; no driver/operator state to unwind.
- The `gpumem` numbers are first estimates; tune from `gpu_pod_memory_used_bytes`.
--- a/docs/adr/0017-cctv-physical-cabling.svg
+++ b/docs/adr/0017-cctv-physical-cabling.svg
@ -1,126 +0,0 @@
-<svg xmlns="http://www.w3.org/2000/svg" width="1600" height="820" viewBox="0 0 1600 820" font-family="system-ui, -apple-system, 'Segoe UI', Roboto, sans-serif">
-  <!-- ADR-0017: PHYSICAL cabling only — no VLANs, no flows. Solid = cable in
-       place today · dashed = camera-day work · ~~~ = radio. Palette: neutral
-       grays + blue for copper runs (reference dataviz palette text tokens). -->
-  <defs>
-    <marker id="dot" viewBox="0 0 8 8" refX="4" refY="4" markerWidth="5" markerHeight="5">
-      <circle cx="4" cy="4" r="3" fill="#52514e"/>
-    </marker>
-  </defs>
-
-  <rect width="1600" height="820" fill="#fcfcfb"/>
-
-  <text x="40" y="42" font-size="26" font-weight="700" fill="#0b0b0b">ADR-0017 — physical cabling (single-switch, rev 3)</text>
-  <text x="40" y="66" font-size="15" fill="#52514e">wires only — no VLANs, no traffic · solid = in place · dashed = camera-day · ~ = radio</text>
-
-  <!-- ═════════ APARTMENT ═════════ -->
-  <rect x="40" y="100" width="330" height="330" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
-  <text x="56" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">APARTMENT</text>
-
-  <text x="70" y="158" font-size="13" fill="#52514e">☁ ISP (internet)</text>
-  <path d="M120,166 L120,196" fill="none" stroke="#52514e" stroke-width="2"/>
-
-  <rect x="64" y="198" width="220" height="64" rx="8" fill="#ffffff" stroke="#8a8984"/>
-  <text x="80" y="222" font-size="14.5" font-weight="700" fill="#0b0b0b">AX6000 router</text>
-  <text x="80" y="242" font-size="12" fill="#52514e">192.168.1.1 · WAN←ISP · 8×LAN</text>
-
-  <rect x="64" y="290" width="220" height="52" rx="8" fill="#ffffff" stroke="#8a8984"/>
-  <text x="80" y="312" font-size="14" font-weight="700" fill="#0b0b0b">Synology NAS · .13</text>
-  <text x="80" y="330" font-size="12" fill="#52514e">on an AX6000 LAN port</text>
-  <path d="M174,262 L174,290" fill="none" stroke="#2a78d6" stroke-width="2"/>
-
-  <text x="70" y="376" font-size="12.5" fill="#52514e">📶 wifi clients (phones, laptops)</text>
-  <path d="M110,262 C104,272 106,278 100,286 C106,294 104,300 100,308 C106,316 104,322 100,330 C106,338 104,344 100,352 C104,358 102,362 98,366" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
-
-  <!-- in-wall run apartment -> garage -->
-  <path d="M284,230 C450,230 540,228 616,228" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
-  <text x="330" y="218" font-size="12.5" font-weight="700" fill="#2a78d6">in-wall run → garage</text>
-
-  <!-- ═════════ GARAGE — RACK ═════════ -->
-  <rect x="560" y="100" width="640" height="680" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
-  <text x="576" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE — RACK</text>
-
-  <!-- switch -->
-  <rect x="600" y="150" width="560" height="150" rx="8" fill="#ffffff" stroke="#0b0b0b" stroke-opacity="0.5" stroke-width="1.6"/>
-  <text x="616" y="176" font-size="14.5" font-weight="700" fill="#0b0b0b">TL-SG105PE · 5-port gigabit PoE switch</text>
-  <text x="616" y="194" font-size="12" fill="#52514e">mgmt 192.168.1.6 · replaces the old TL-SG105E (→ shelf, cold spare)</text>
-  <g font-size="11.5" text-anchor="middle">
-    <rect x="616" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
-    <text x="664" y="227" font-weight="700" fill="#0b0b0b">P1</text>
-    <text x="664" y="242" fill="#52514e">← apartment</text>
-    <rect x="722" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
-    <text x="770" y="227" font-weight="700" fill="#0b0b0b">P2</text>
-    <text x="770" y="242" fill="#52514e">← 4G router</text>
-    <rect x="828" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
-    <text x="876" y="227" font-weight="700" fill="#0b0b0b">P3</text>
-    <text x="876" y="242" fill="#52514e">← UPS mgmt</text>
-    <rect x="934" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984" stroke-dasharray="4,3"/>
-    <text x="982" y="227" font-weight="700" fill="#0b0b0b">P4 ⚡PoE</text>
-    <text x="982" y="242" fill="#52514e">← camera</text>
-    <rect x="1040" y="210" width="96" height="40" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
-    <text x="1088" y="227" font-weight="700" fill="#0b0b0b">P5</text>
-    <text x="1088" y="242" fill="#52514e">← R730 eno1</text>
-  </g>
-  <text x="616" y="284" font-size="12" fill="#52514e">every cable below re-plugs old-switch → PE on camera day (≈3 min)</text>
-
-  <!-- 4G router -->
-  <rect x="600" y="360" width="250" height="64" rx="8" fill="#ffffff" stroke="#8a8984"/>
-  <text x="616" y="384" font-size="14" font-weight="700" fill="#0b0b0b">4G router · 192.168.1.7</text>
-  <text x="616" y="403" font-size="12" fill="#52514e">~cellular uplink (out-of-band)</text>
-  <path d="M770,300 L770,360" fill="none" stroke="#2a78d6" stroke-width="2"/>
-  <path d="M856,392 C866,386 864,380 874,376 C866,370 868,364 876,360" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
-  <text x="884" y="380" font-size="12" fill="#52514e">📡 cellular</text>
-
-  <!-- UPS -->
-  <rect x="600" y="452" width="250" height="56" rx="8" fill="#ffffff" stroke="#8a8984"/>
-  <text x="616" y="476" font-size="14" font-weight="700" fill="#0b0b0b">UPS (Huawei)</text>
-  <text x="616" y="494" font-size="12" fill="#52514e">network mgmt card</text>
-  <path d="M876,300 C876,340 800,410 720,452" fill="none" stroke="#2a78d6" stroke-width="2"/>
-
-  <!-- R730 -->
-  <rect x="600" y="540" width="560" height="220" rx="8" fill="#ffffff" stroke="#0b0b0b" stroke-opacity="0.5" stroke-width="1.6"/>
-  <text x="616" y="566" font-size="14.5" font-weight="700" fill="#0b0b0b">Dell R730 · PVE host · 192.168.1.127</text>
-  <g font-size="11.5">
-    <rect x="616" y="582" width="128" height="38" rx="5" fill="#2a78d6" fill-opacity="0.08" stroke="#8a8984"/>
-    <text x="628" y="598" font-weight="700" fill="#0b0b0b">eno1 · LAN1</text>
-    <text x="628" y="613" fill="#52514e">← switch P5 · 1GbE</text>
-    <rect x="756" y="582" width="128" height="38" rx="5" fill="#ffffff" stroke="#8a8984" stroke-dasharray="4,3"/>
-    <text x="768" y="598" font-weight="700" fill="#52514e">eno2 · LAN2</text>
-    <text x="768" y="613" fill="#8a8984">dark · fallback leg</text>
-    <rect x="896" y="582" width="128" height="38" rx="5" fill="#ffffff" stroke="#d8d7d2"/>
-    <text x="908" y="598" fill="#8a8984">eno3 / eno4</text>
-    <text x="908" y="613" fill="#8a8984">free, uncabled</text>
-    <rect x="1036" y="582" width="108" height="38" rx="5" fill="#ffffff" stroke="#d8d7d2"/>
-    <text x="1048" y="598" fill="#8a8984">iDRAC · .4</text>
-    <text x="1048" y="613" fill="#8a8984">shared-LOM/eno1</text>
-  </g>
-  <text x="616" y="648" font-size="12" fill="#52514e">no other network cables — everything else on this host is VIRTUAL:</text>
-  <text x="616" y="668" font-size="12" fill="#52514e">pfSense · ha-sofia (HA) · devvm · k8s-master + node1-6 · registry VM …</text>
-  <text x="616" y="696" font-size="12" fill="#8a8984">(power: host + switch fed from the UPS — power wiring not drawn)</text>
-
-  <path d="M1088,300 C1088,420 720,500 680,582" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
-  <text x="1100" y="330" font-size="12.5" font-weight="700" fill="#2a78d6">LAN1 cable</text>
-
-  <!-- ═════════ GARAGE ENTRANCE ═════════ -->
-  <rect x="1280" y="100" width="280" height="200" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
-  <text x="1296" y="126" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE ENTRANCE</text>
-  <rect x="1304" y="150" width="232" height="110" rx="8" fill="#ffffff" stroke="#8a8984"/>
-  <text x="1320" y="176" font-size="14" font-weight="700" fill="#0b0b0b">vermont-garage camera</text>
-  <text x="1320" y="196" font-size="12" fill="#52514e">HiLook IPC-T241H-C · 10.0.30.70</text>
-  <text x="1320" y="214" font-size="12" fill="#52514e">powered over the data cable (PoE)</text>
-  <text x="1320" y="232" font-size="12" fill="#52514e">outdoor · armored conduit</text>
-
-  <path d="M982,210 C982,150 1140,140 1304,180" fill="none" stroke="#52514e" stroke-width="2.5" stroke-dasharray="7,5"/>
-  <text x="1080" y="136" font-size="12.5" font-weight="700" fill="#52514e">single cat6 in conduit · data + PoE power (camera day)</text>
-
-  <!-- legend -->
-  <g transform="translate(40,780)" font-size="12.5">
-    <line x1="0" y1="-4" x2="44" y2="-4" stroke="#2a78d6" stroke-width="2.5"/>
-    <text x="52" y="0" fill="#0b0b0b">copper, in place</text>
-    <line x1="190" y1="-4" x2="234" y2="-4" stroke="#52514e" stroke-width="2.5" stroke-dasharray="7,5"/>
-    <text x="242" y="0" fill="#0b0b0b">camera-day cable / dark port</text>
-    <path d="M450,-4 C456,-10 454,-14 460,-18" fill="none" stroke="#8a8984" stroke-width="1.6" stroke-dasharray="2,3"/>
-    <text x="470" y="0" fill="#0b0b0b">radio (wifi / cellular)</text>
-    <text x="650" y="0" fill="#52514e">total wired links at the rack: 5 (all on the one switch) · ADR-0017 rev 3</text>
-  </g>
-</svg>
--- a/docs/adr/0017-cctv-segment-dedicated-pfsense-leg.md
+++ b/docs/adr/0017-cctv-segment-dedicated-pfsense-leg.md
@ -1,99 +0,0 @@
-# CCTV segment: dedicated pfSense interface, VLAN-30 trunk on the LAN1 cable
-
-Status: accepted (2026-07-02, rev 3 — single-switch)
-
-![Network topology — dCCTV segment, flows, and camera-day steps](./0017-cctv-segment-topology.svg)
-
-![Physical cabling — wires only, no VLANs](./0017-cctv-physical-cabling.svg)
-
-The first owned camera at the Sofia/Vermont site (`vermont-garage`, HiLook
-IPC-T241H-C at the garage entrance) needs to be network-isolated: its cable is
-physically exposed outside the apartment, so anything plugged into that cable
-must land in a segment that can reach nothing. The original design doc
-(NAS: `Emo shared/Claude shared/garage-camera/`) called for an "802.1Q trunk
-to pfSense" — but nothing in this network terminates dot1q on pfSense; the
-site idiom is one vlan-aware Proxmox bridge → one tagged VM NIC → one clean
-untagged pfSense interface per segment.
-
-**Decision (rev 3):** ONE switch — the new TL-SG105PE **replaces** the old
-garage TL-SG105E (Viktor prefers not running two switches; retired unit
-becomes a cold spare, its 192.168.1.6 mgmt IP passes to the PE). Five ports,
-all used: apartment uplink, 4G router 192.168.1.7, UPS mgmt (all untagged
-VLAN 1), the camera (untagged VLAN 30, PoE), and the **trunk to R730 `eno1`
-carrying home LAN untagged + CCTV tagged 30** over the existing LAN1 cable.
-pfSense `net3` (vtnet3) sits on `vmbr0` with `tag=30` — exactly the site
-idiom used for dManagementsVms/dKubernetes (bridge-level tag → clean untagged
-vNIC; pfSense still terminates no dot1q itself). The earlier dedicated
-`eno2`/`vmbr2` leg is kept **dormant as a fallback** (rev 2 wired it; moving
-net3 back to vmbr2 restores pure physical isolation in one `qm set`).
-This narrows the earlier 802.1Q objection rather than contradicting it: the
-rejection assumed *unmanaged* switches, where any LAN device could inject
-tagged frames; with the managed PE as the only device on eno1, VLAN-30
-membership is {camera port, trunk port} only, so tag-30 ingress from every
-other port — and from the exposed camera cable — is dropped or contained.
-Cameras are untrusted: default-deny on dCCTV with a single
-NTP-to-gateway exception; Frigate (k8s) pulls RTSP in; ha-sofia (192.168.1.8)
-may reach ISAPI/RTSP directly; home-LAN clients route in via an AX6000 static
-route (10.0.30.0/24 via 192.168.1.2). 10.0.30.0/24 is deliberately NOT in the
-10.0.20.0/22 trusted source-IP allowlist.
-
-## Traffic on the trunk — how one cable carries two networks
-
-The LAN1 cable is shared, but the two networks on it diverge at `vmbr0`
-(the vlan-aware bridge on the PVE host), and only ONE of them ever touches
-pfSense:
-
- **Untagged (VLAN 1, home LAN)** is plain L2 bridging: vmbr0 switches it
-  between the trunk, the host's own IP (192.168.1.127) and pfSense `net0` —
-  where pfSense sits as an ordinary LAN *client* (WAN 192.168.1.2). The home
-  LAN's gateway is and remains the AX6000; home-LAN traffic never transits
-  pfSense. Consequently a pfSense (or R730 VM-level) outage does not affect
-  the home LAN, and the apartment ↔ 4G-router ↔ UPS paths don't even leave
-  the switch (P1/P2/P3 bridge internally), so out-of-band recovery via the
-  4G router survives the whole rack being down.
- **Tagged 30 (CCTV)** has exactly one possible landing: vmbr0 delivers
-  VID 30 only to pfSense `net3` (dCCTV, 10.0.30.1), which is the camera
-  segment's gateway, firewall and sole exit. "Camera → AX6000 → internet"
-  is impossible by construction, not merely by firewall rule.
- pfSense forwards *upstream* only its own segments (10.0.10/20/30), NATed
-  out of its WAN toward the AX6000. Load-wise the trunk gained only the
-  camera's ~8 Mbps — it already carried all rack-bound home-LAN traffic.
-
-![VLAN tagging — where traffic can flow](./0017-cctv-vlan-tagging.svg)
-
-*(editable source: [`0017-cctv-vlan-tagging.excalidraw`](./0017-cctv-vlan-tagging.excalidraw) — open it in excalidraw to tweak)*
-
-## Considered options
-
- **802.1Q over the LAN path behind an UNMANAGED switch** (the original plan
-  read this way) — rejected: any LAN device could inject tagged frames into
-  vmbr0 (`bridge-vids 2-4094`) and tag-passing through a dumb switch is
-  undefined. Rev 3 adopts the tagged path ONLY because the managed PE now
-  polices VLAN-30 membership at the single entry point to eno1; no bridge
-  reconfiguration was needed (vmbr0 was already vlan-aware).
- **Dedicated physical leg (eno2 → vmbr2 → net3), one switch per role**
-  (rev 1/2 as-built) — superseded by rev 3: it forced either a second switch
-  (6 connections vs 5 ports once the PE also replaced the old switch) or new
-  hardware. Strongest isolation of all options; kept dormant as the fallback.
- **AX6000 as the camera gateway** — rejected earlier in the design (consumer
-  router, no inter-VLAN firewall).
-
-## Consequences
-
- The switch is now single-point and load-bearing for everything in the rack
-  (apartment uplink, pfSense backup-WAN via 4G, UPS mgmt, CCTV) AND its VLAN
-  table + mgmt password are part of the isolation boundary — the Easy Smart
-  mgmt UI answers on every port, so the password is the gate between a
-  compromised camera and the switch config. All 5 ports are consumed: the
-  next camera forces an 8-port PoE upgrade (the wiring plan already fits it).
- `eno2`/`vmbr2` stay cabled-ready but dormant (fallback to rev 2's physical
-  leg); eno3/eno4 remain free.
- The old TL-SG105E is retired to cold spare; the PE inherits 192.168.1.6
-  (Kea reservation by MAC).
- Revision history (all 2026-07-02): rev 1 assumed one shared PE with a
-  port-VLAN split (conflated the two devices); rev 2 split into two switches
-  after inspecting 192.168.1.6 (old non-PoE SG105E, 4/5 ports used); rev 3
-  consolidated back to one switch — the PE replacing the SG105E — per
-  Viktor's preference, moving CCTV onto a managed tagged trunk.
- Frigate's ADR-0016 VRAM budget was bumped 2000 → 2300 MiB for the extra
-  NVDEC stream.
--- a/docs/adr/0017-cctv-segment-topology.svg
+++ b/docs/adr/0017-cctv-segment-topology.svg
@ -1,178 +0,0 @@
-<svg xmlns="http://www.w3.org/2000/svg" width="1600" height="880" viewBox="0 0 1600 880" font-family="system-ui, -apple-system, 'Segoe UI', Roboto, sans-serif">
-  <!-- ADR-0017 rev 3 dCCTV topology (single switch, VLAN-30 trunk on LAN1).
-       Colors: reference dataviz palette (light mode). blue #2a78d6 = home LAN ·
-       violet #4a3aa7 = dCCTV · aqua #1baf7a = dKubernetes ·
-       yellow #eda100 = dManagementsVms · green #008300 allow · red #e34948 deny -->
-  <defs>
-    <marker id="arrGreen" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
-      <path d="M0,0 L10,5 L0,10 z" fill="#008300"/>
-    </marker>
-    <marker id="arrRed" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
-      <path d="M0,0 L10,5 L0,10 z" fill="#e34948"/>
-    </marker>
-    <marker id="arrGray" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto-start-reverse">
-      <path d="M0,0 L10,5 L0,10 z" fill="#52514e"/>
-    </marker>
-  </defs>
-
-  <rect width="1600" height="880" fill="#fcfcfb"/>
-
-  <text x="40" y="42" font-size="26" font-weight="700" fill="#0b0b0b">ADR-0017 — CCTV segment behind pfSense, VLAN-30 trunk on the LAN1 cable</text>
-  <text x="40" y="66" font-size="15" fill="#52514e">Sofia/Vermont · rev 3 (single switch) 2026-07-02 · dashed = camera-day · the ONLY 802.1Q is the trunk between the switch and eno1</text>
-
-  <!-- camera -> everything else (denied) -->
-  <path d="M240,168 C520,104 900,104 1148,140" fill="none" stroke="#e34948" stroke-width="3" marker-end="url(#arrRed)"/>
-  <g transform="translate(560,111)">
-    <circle r="11" fill="#fcfcfb" stroke="#e34948" stroke-width="2.5"/>
-    <path d="M-5,-5 L5,5 M5,-5 L-5,5" stroke="#e34948" stroke-width="2.5"/>
-  </g>
-  <text x="588" y="100" font-size="13.5" font-weight="700" fill="#e34948">DENY · camera → LAN / other segments / internet (default deny on dCCTV)</text>
-
-  <!-- GARAGE ENTRANCE -->
-  <rect x="40" y="128" width="240" height="180" rx="10" fill="#4a3aa7" fill-opacity="0.06" stroke="#4a3aa7" stroke-opacity="0.35"/>
-  <text x="56" y="154" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">GARAGE ENTRANCE</text>
-  <rect x="64" y="170" width="192" height="112" rx="8" fill="#ffffff" stroke="#4a3aa7" stroke-width="2"/>
-  <text x="80" y="196" font-size="15" font-weight="700" fill="#0b0b0b">vermont-garage</text>
-  <text x="80" y="216" font-size="12.5" fill="#52514e">HiLook IPC-T241H-C · pure IR</text>
-  <text x="80" y="234" font-size="12.5" fill="#52514e">10.0.30.70 (Kea reservation)</text>
-  <text x="80" y="252" font-size="12.5" fill="#52514e">DNS: garage-cam.viktorbarzin.lan</text>
-  <text x="80" y="270" font-size="12.5" fill="#52514e">PoE from switch · cloud/P2P off</text>
-
-  <path d="M256,284 C330,330 412,368 417,430" fill="none" stroke="#52514e" stroke-width="2" stroke-dasharray="6,5" marker-end="url(#arrGray)"/>
-  <text x="330" y="322" font-size="12" fill="#52514e">cat6 in conduit · PoE → P4</text>
-
-  <!-- RACK zone: single switch -->
-  <rect x="40" y="360" width="560" height="265" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
-  <text x="56" y="384" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">RACK — GARAGE · ONE SWITCH</text>
-
-  <rect x="64" y="396" width="512" height="176" rx="8" fill="#4a3aa7" fill-opacity="0.04" stroke="#4a3aa7" stroke-width="2"/>
-  <text x="80" y="420" font-size="15" font-weight="700" fill="#0b0b0b">TL-SG105PE <tspan font-size="12.5" font-weight="400" fill="#52514e">replaces the SG105E · mgmt 192.168.1.6 (Kea) · all 5 ports used</tspan></text>
-  <g font-size="11.5" text-anchor="middle">
-    <rect x="80"  y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
-    <text x="124" y="454" font-weight="700" fill="#0b0b0b">P1 · V1</text>
-    <text x="124" y="470" fill="#52514e">apartment</text>
-    <text x="124" y="484" fill="#52514e">uplink</text>
-    <rect x="178" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
-    <text x="222" y="454" font-weight="700" fill="#0b0b0b">P2 · V1</text>
-    <text x="222" y="470" fill="#52514e">4G router</text>
-    <text x="222" y="484" fill="#52514e">192.168.1.7</text>
-    <rect x="276" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
-    <text x="320" y="454" font-weight="700" fill="#0b0b0b">P3 · V1</text>
-    <text x="320" y="470" fill="#52514e">UPS mgmt</text>
-    <rect x="374" y="436" width="88" height="56" rx="6" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
-    <text x="418" y="454" font-weight="700" fill="#0b0b0b">P4 · V30</text>
-    <text x="418" y="470" fill="#52514e">camera</text>
-    <text x="418" y="484" fill="#52514e">PoE ON</text>
-    <rect x="472" y="436" width="88" height="56" rx="6" fill="#2a78d6" fill-opacity="0.10" stroke="#4a3aa7" stroke-width="2" stroke-dasharray="0"/>
-    <text x="516" y="454" font-weight="700" fill="#0b0b0b">P5 · trunk</text>
-    <text x="516" y="470" fill="#52514e">V1 untagged</text>
-    <text x="516" y="484" fill="#4a3aa7">+ V30 tagged</text>
-  </g>
-  <text x="80" y="516" font-size="12" fill="#52514e">802.1Q: VLAN 1 untagged {P1,P2,P3,P5} · VLAN 30 {P4 untagged/PVID 30, P5 tagged}</text>
-  <text x="80" y="534" font-size="12" fill="#52514e">tag-30 ingress on P1/P2/P3 is dropped (not members) — the trunk is the only tagged path</text>
-  <text x="80" y="558" font-size="12" fill="#8a8984">old TL-SG105E → retired, cold spare · backup-WAN (4G) + UPS keep their ports</text>
-
-  <!-- trunk: two parallel lines to eno1 -->
-  <path d="M560,458 C630,458 640,428 692,420" fill="none" stroke="#2a78d6" stroke-width="2.5"/>
-  <path d="M560,466 C632,466 644,436 692,428" fill="none" stroke="#4a3aa7" stroke-width="2.5"/>
-  <text x="588" y="404" font-size="12" font-weight="700" fill="#0b0b0b">LAN1 cable</text>
-
-  <!-- R730 / PVE zone -->
-  <rect x="680" y="330" width="880" height="440" rx="10" fill="#0b0b0b" fill-opacity="0.03" stroke="#b9b8b2"/>
-  <text x="696" y="356" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">DELL R730 — PVE HOST 192.168.1.127 (IN THE RACK)</text>
-
-  <g font-size="12">
-    <rect x="700" y="400" width="150" height="46" rx="6" fill="#2a78d6" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
-    <text x="712" y="419" font-weight="700" fill="#0b0b0b">eno1 → vmbr0</text>
-    <text x="712" y="436" fill="#52514e">untag V1 + tag 30</text>
-
-    <rect x="700" y="471" width="150" height="46" rx="6" fill="#ffffff" stroke="#8a8984" stroke-dasharray="4,3"/>
-    <text x="712" y="490" font-weight="700" fill="#52514e">eno2 → vmbr2</text>
-    <text x="712" y="507" fill="#8a8984">dormant fallback leg</text>
-
-    <rect x="700" y="542" width="150" height="46" rx="6" fill="#0b0b0b" fill-opacity="0.04" stroke="#8a8984"/>
-    <text x="712" y="561" font-weight="700" fill="#0b0b0b">vmbr1</text>
-    <text x="712" y="578" fill="#52514e">internal · tags 10/20</text>
-  </g>
-
-  <!-- pfSense VM -->
-  <rect x="890" y="388" width="300" height="230" rx="8" fill="#ffffff" stroke="#8a8984"/>
-  <text x="906" y="414" font-size="15" font-weight="700" fill="#0b0b0b">pfSense (VM 101)</text>
-  <text x="906" y="432" font-size="12" fill="#52514e">gateway + firewall for every segment</text>
-  <g font-size="12">
-    <rect x="906" y="444" width="268" height="34" rx="5" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
-    <text x="916" y="465" fill="#0b0b0b">net0 · WAN <tspan fill="#52514e">192.168.1.2 · vmbr0 untagged</tspan></text>
-    <rect x="906" y="484" width="268" height="34" rx="5" fill="#eda100" fill-opacity="0.14" stroke="#eda100"/>
-    <text x="916" y="505" fill="#0b0b0b">net1 · dManagementsVms <tspan fill="#52514e">10.0.10.1</tspan></text>
-    <rect x="906" y="524" width="268" height="34" rx="5" fill="#1baf7a" fill-opacity="0.12" stroke="#1baf7a"/>
-    <text x="916" y="545" fill="#0b0b0b">net2 · dKubernetes <tspan fill="#52514e">10.0.20.1</tspan></text>
-    <rect x="906" y="564" width="268" height="34" rx="5" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
-    <text x="916" y="585" fill="#0b0b0b">net3 · dCCTV <tspan fill="#52514e">10.0.30.1/24 · vmbr0 tag 30</tspan></text>
-  </g>
-  <path d="M850,415 L890,458" fill="none" stroke="#2a78d6" stroke-width="1.6" opacity="0.6"/>
-  <path d="M850,430 L890,581" fill="none" stroke="#4a3aa7" stroke-width="2"/>
-  <path d="M850,565 L890,501" fill="none" stroke="#8a8984" stroke-width="1.6" opacity="0.6"/>
-  <path d="M850,565 L890,541" fill="none" stroke="#8a8984" stroke-width="1.6" opacity="0.6"/>
-
-  <!-- k8s VMs -->
-  <rect x="1240" y="388" width="290" height="230" rx="8" fill="#1baf7a" fill-opacity="0.07" stroke="#1baf7a"/>
-  <text x="1256" y="414" font-size="15" font-weight="700" fill="#0b0b0b">k8s VMs · 10.0.20.0/24</text>
-  <text x="1256" y="434" font-size="12.5" fill="#52514e">vmbr1 tag 20 · pod egress SNATs</text>
-  <text x="1256" y="450" font-size="12.5" fill="#52514e">to node IPs</text>
-  <rect x="1256" y="464" width="258" height="66" rx="6" fill="#ffffff" stroke="#1baf7a"/>
-  <text x="1268" y="486" font-size="13.5" font-weight="700" fill="#0b0b0b">Frigate · k8s-node1 (T4)</text>
-  <text x="1268" y="504" font-size="12" fill="#52514e">detect sub / record main</text>
-  <text x="1268" y="520" font-size="12" fill="#52514e">gpumem budget 2300 MiB</text>
-  <rect x="1256" y="540" width="258" height="52" rx="6" fill="#ffffff" stroke="#1baf7a"/>
-  <text x="1268" y="562" font-size="13.5" font-weight="700" fill="#0b0b0b">go2rtc LB 10.0.20.204</text>
-  <text x="1268" y="580" font-size="12" fill="#52514e">restream → HA live view (MSE/HLS)</text>
-
-  <!-- HOME LAN zone -->
-  <rect x="1148" y="128" width="412" height="180" rx="10" fill="#2a78d6" fill-opacity="0.06" stroke="#2a78d6" stroke-opacity="0.4"/>
-  <text x="1164" y="154" font-size="13" font-weight="700" fill="#52514e" letter-spacing="1">HOME LAN 192.168.1.0/24</text>
-  <rect x="1164" y="168" width="180" height="56" rx="6" fill="#ffffff" stroke="#2a78d6"/>
-  <text x="1176" y="190" font-size="13.5" font-weight="700" fill="#0b0b0b">AX6000 · .1</text>
-  <text x="1176" y="208" font-size="11.5" fill="#52514e">+ route 10.0.30.0/24 → .2</text>
-  <rect x="1164" y="236" width="180" height="52" rx="6" fill="#ffffff" stroke="#2a78d6"/>
-  <text x="1176" y="258" font-size="13.5" font-weight="700" fill="#0b0b0b">ha-sofia · .8</text>
-  <text x="1176" y="275" font-size="11.5" fill="#52514e">Frigate card + hikvision_next</text>
-  <rect x="1360" y="168" width="184" height="56" rx="6" fill="#ffffff" stroke="#2a78d6"/>
-  <text x="1372" y="190" font-size="13.5" font-weight="700" fill="#0b0b0b">apartment clients</text>
-  <text x="1372" y="208" font-size="11.5" fill="#52514e">laptops, phones</text>
-  <rect x="1360" y="236" width="184" height="52" rx="6" fill="#ffffff" stroke="#52514e" stroke-dasharray="5,4"/>
-  <text x="1372" y="256" font-size="11.5" font-weight="700" fill="#52514e">CAMERA DAY: static route</text>
-  <text x="1372" y="272" font-size="11.5" fill="#52514e">10.0.30.0/24 via 192.168.1.2</text>
-
-  <path d="M1254,308 C1150,352 950,372 790,400" fill="none" stroke="#2a78d6" stroke-width="2" opacity="0.6"/>
-  <text x="1010" y="374" font-size="12" fill="#2a78d6">apartment uplink · switch P1 · trunk · eno1</text>
-
-  <!-- FLOWS -->
-  <path d="M1256,497 C1010,690 330,730 120,650 C40,618 40,380 96,286" fill="none" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
-  <text x="620" y="700" font-size="13.5" font-weight="700" fill="#008300">ALLOW · Frigate → camera RTSP :554 (routed k8s → dCCTV; opt1 allow-all)</text>
-
-  <path d="M1164,262 C820,282 470,268 302,176 C286,167 278,166 270,172" fill="none" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
-  <text x="484" y="216" font-size="13.5" font-weight="700" fill="#008300">ALLOW · ha-sofia → camera :80 ISAPI + :554</text>
-  <text x="484" y="234" font-size="12" fill="#52514e">enters pfSense WAN · reply-to off · needs the AX6000 route</text>
-
-  <path d="M280,232 C660,200 860,320 936,386" fill="none" stroke="#008300" stroke-width="2" opacity="0.85" marker-end="url(#arrGreen)"/>
-  <text x="740" y="322" font-size="12.5" font-weight="700" fill="#008300">ALLOW · camera → 10.0.30.1:123 (NTP)</text>
-
-  <!-- LEGEND -->
-  <g transform="translate(40,800)" font-size="12.5">
-    <rect x="0" y="0" width="18" height="18" rx="4" fill="#2a78d6" fill-opacity="0.12" stroke="#2a78d6"/>
-    <text x="26" y="14" fill="#0b0b0b">home LAN / VLAN 1</text>
-    <rect x="200" y="0" width="18" height="18" rx="4" fill="#4a3aa7" fill-opacity="0.12" stroke="#4a3aa7" stroke-width="2"/>
-    <text x="226" y="14" fill="#0b0b0b">CCTV / VLAN 30 / dCCTV 10.0.30.0/24</text>
-    <rect x="500" y="0" width="18" height="18" rx="4" fill="#1baf7a" fill-opacity="0.12" stroke="#1baf7a"/>
-    <text x="526" y="14" fill="#0b0b0b">dKubernetes</text>
-    <rect x="640" y="0" width="18" height="18" rx="4" fill="#eda100" fill-opacity="0.14" stroke="#eda100"/>
-    <text x="666" y="14" fill="#0b0b0b">dManagementsVms</text>
-    <line x1="820" y1="9" x2="860" y2="9" stroke="#008300" stroke-width="3" marker-end="url(#arrGreen)"/>
-    <text x="870" y="14" fill="#0b0b0b">allowed flow</text>
-    <line x1="980" y1="9" x2="1020" y2="9" stroke="#e34948" stroke-width="3" marker-end="url(#arrRed)"/>
-    <text x="1030" y="14" fill="#0b0b0b">denied</text>
-    <line x1="1100" y1="9" x2="1140" y2="9" stroke="#52514e" stroke-width="2" stroke-dasharray="6,5"/>
-    <text x="1150" y="14" fill="#0b0b0b">camera-day step</text>
-    <text x="1320" y="14" fill="#52514e">ADR-0017 · rev 3</text>
-  </g>
-</svg>
--- a/docs/adr/0017-cctv-vlan-tagging.excalidraw
+++ b/docs/adr/0017-cctv-vlan-tagging.excalidraw
--- a/docs/adr/0017-cctv-vlan-tagging.svg
+++ b/docs/adr/0017-cctv-vlan-tagging.svg
--- a/docs/adr/0018-valia-sites-off-infra-pages-in-cluster-sync.md
+++ b/docs/adr/0018-valia-sites-off-infra-pages-in-cluster-sync.md
@ -1,47 +0,0 @@
-# Valia sites are served off-infra (Cloudflare Pages), synced in-cluster
-
-Valia (Viktor's mother) authors small one-page static sites in Google Drive folders she
-shares, and keeps asking for them to be hosted — two exist already (`stem95su`, `bridge`)
-and more are expected. We decided all **Valia sites** are served **off-infra on Cloudflare
-Pages** under `<english-name>.viktorbarzin.me`, kept fresh by **one shared in-cluster
-CronJob** (`stacks/valia-sites/`) that mirrors each **Content folder** every 10 minutes
-(rclone, drive.readonly) and re-deploys only on change (wrangler direct upload). The
-existing in-cluster `stem95su` serving stack (nginx + NFS + ingress + per-site sync)
-migrates onto this and is retired.
-
-Why off-infra serving: these are her sites, shown to teachers/parents — they must survive
-homelab outages (cf. the 2026-06-27 egress incident that took every proxied in-cluster
-site down). With Pages, a homelab outage degrades to "content frozen until we're back",
-never "site down". Serving costs no cluster resources and no per-site nginx/PVC/ingress/
-Anubis. Why the syncer stays in-cluster anyway: secrets stay in Vault (no per-site GHA
-secret sprawl), and the stem95su guard patterns (hard-fail on Drive auth errors, never
-wipe a live site on an empty/partial folder, capped deletes) carry over wholesale. The
-deliberate asymmetry — off-infra serving, on-infra syncing — is the point, not an
-accident.
-
-## Considered options
-
- **In-cluster everywhere** (generalise stem95su into a factory module): one roof, no
-  Cloudflare Pages dependency — but her sites share the homelab's fate and each site
-  spends cluster resources to serve static files a free CDN serves better.
- **Pages for new sites only**: less work now, two patterns and two runbooks forever.
- **GHA-scheduled sync** (fully off-infra pipeline): no cluster dependency at all, but
-  Drive + Cloudflare credentials would live as GitHub secrets per repo, outside Vault.
-
-## Consequences
-
- Registration is one entry in the `sites` map (name, Content folder, optional Entry
-  file); CI applies Pages project, custom domain, public CNAME, and internal-DNS config
-  together. Names are English, picked by Viktor (most → bridge set the precedent).
- The internal split-horizon zone learns Valia sites from a ConfigMap the
-  `technitium-ingress-dns-sync` script consumes — declaratively, including **removal**
-  (the previous static-CNAME approach was add-only; a retired site left a stale record).
- Deploy-on-change is mandatory, not an optimisation: Pages caps monthly deployments on
-  the free tier, and a 10-minute cadence would burn ~4,300/month if unchanged runs
-  deployed.
- Failure visibility is **failed-Job-only** by explicit choice (no stale-sync alert, no
-  per-site uptime monitors, no notifications to Valia) — Viktor fields "it didn't
-  update" reports, consistent with the alert-noise-reduction posture. Revisit if a
-  silent stall actually bites.
- If the homelab is down, content updates pause; the sites keep serving last-deployed
-  content. Accepted degradation.
--- a/docs/adr/0019-backup-mx-self-hosted-oracle-relay.md
+++ b/docs/adr/0019-backup-mx-self-hosted-oracle-relay.md
@ -1,97 +0,0 @@
-# Inbound mail gets a self-hosted store-and-forward backup MX on Oracle Always-Free
-
-`viktorbarzin.me` has run a single direct MX to the home IP since the 2026-04-12
-inbound overhaul, with sender-MTA retry (1–5 days, sender-dependent) as the only
-outage protection — a documented "No Backup MX" decision made after ForwardEmail's
-forced anti-spoofing rejected legitimate forwarded mail and Cloudflare Email
-Routing proved pass-through-only. Viktor now wants inbound mail to survive
-homelab outages **without loss** (2026-07-04): delayed delivery is fine,
-mid-outage reading is not required, and the budget is **$0** — a hard
-constraint that eliminated every managed option (see below).
-
-We run a minimal **Postfix store-and-forward relay on an Oracle Cloud
-Always-Free `VM.Standard.E2.1.Micro`** (`mx2.viktorbarzin.me`, **reserved**
-public IP, MX preference 20; primary untouched at 1). It accepts everything
-for the domain (catch-all — every RCPT is valid; reputation may only ever
-4xx-defer, via postscreen pregreet + conservative DNSBL-defer on the VM —
-never 5xx: a backup MX that hard-rejects manufactures the loss it exists to
-prevent), queues up to **30 days** (bounce lifetime 1 day — the VM can never
-deliver a DSN, its only egress is the drain), and drains to the primary over
-**port 2526** — one scripted pfSense WAN NAT rule onto the existing HAProxy
-frontend — because Oracle blocks egress TCP 25 tenancy-wide. Management is
-tailnet-only (headscale preauth key, `tag:backup-mx`; OCI console as
-mid-outage break-glass since headscale itself lives in the cluster); TLS via
-certbot HTTP-01 (port 80 permanently open — LE validation is
-multi-perspective and unscopeable); the VM is a cattle-rebuild from a new
-`stacks/backup-mx/` Terraform stack (OCI provider + cloud-init, which must
-also punch 25/80 through the OCI Ubuntu image's OS-level iptables REJECT).
-On the primary, the drain stream (one /32) is enabled at the layers that
-actually bite — `check_client_access` permits past
-`reject_unknown_client_hostname` and spoof-protection, an anvil rate-limit
-exception, and rspamd `external_relay` (score against the *original* sender
-IP) with the reject action capped to tag/fold so drained spam can never force
-the VM to emit backscatter. Go-live is gated on empirical checks: inbound-25
-reachability (recurring probe — Oracle publishes no commitment), drain
-end-to-end, and a live failover test that includes a high-spam-score and a
->10 MB message. Two independent adversarial reviews (2026-07-04) shaped this
-final form. Design:
-[`plans/2026-07-04-backup-mx-design.md`](../plans/2026-07-04-backup-mx-design.md).
-
-## Considered options
-
- **Roller Network free Secondary MX** — v1 of this decision, killed at the
-  validation gates the same day: free tier caps at 200 relayed messages or
-  10 MB per rolling 7 days, and overage suspends the domain for 48 h
-  answering **SMTP 5xx** (permanent bounces) — since spammers target backup
-  MXes even while the primary is up, background spam alone can hold it
-  suspended, making it *worse than no backup MX*. Free accounts are also
-  being discontinued. (Their TLS checked out; their paid Basic at $30/yr is
-  the documented fallback if the OCI route sours.)
- **Dynu Email Backup ($9.99/yr)** — queue lifetime undocumented (FAQ hints
-  12–24 h, barely beating sender retry); filtering black-box; not free.
- **Cloudflare Email Routing / mailflare** — no store-and-forward / terminal
-  inbox on Cloudflare; rejected earlier (2026-04-12; 2026-07-04 memory #7148).
- **Other free tiers** (challenged and re-verified 2026-07-04): GCP e2-micro
-  blocks egress 25 too and its free regions are US-only; AWS's 2025+ "free"
-  plan is a 6-month credit; Azure has no always-free VM and blocks 25;
-  Hetzner has no free tier; Fly.io ended free allowances; Vultr/Linode are
-  trial credits; DNSExit/KisoLabs/DuoCircle backup-MX are paid or dead. OCI
-  is the only standing free option.
- **Harden-only** (5xx-misconfig guards + paging) — does not address
-  multi-day outages or short-retry senders; deferred as a complementary
-  track.
-
-## Consequences
-
- **A pet outside the cluster** — deliberately cattle: rebuilt entirely from
-  Terraform + cloud-init, patched by unattended-upgrades, scraped by the
-  cluster's Prometheus (exporters on the reserved public IP, allowlisted to
-  the homelab WAN /32 — there is **no cluster→tailnet route**, so tailnet
-  scraping was rejected as fictional; blackbox TCP:25 + MX-set drift alerts
-  besides). Never a backup target itself.
- **Oracle free-tier caprice is the top risk**: Oracle silently halved the A1
-  free allowance in June 2026 and terminated over-limit instances, and
-  publishes no commitment that inbound 25 stays open. Mitigations:
-  **Pay-As-You-Go conversion is a required prerequisite** (exempts idle
-  reclamation, stays $0), a recurring inbound-25 probe, `BackupMxDown`, and
-  the queue being empty outside outages (a surprise reclamation loses
-  coverage, never mail). Home region is fixed at signup — Frankfurt, chosen
-  once.
- The drain stream bypasses `reject_unknown_client_hostname`, anvil limits,
-  and rspamd's reject tier for one /32; DKIM verification, SPF/DMARC (against
-  the original IP via `external_relay`), and content scoring stay on — spam
-  arriving via the backup is tagged and folded to Junk, never bounced. The VM
-  is deliberately NOT in the primary's `mynetworks` (a compromised VM must
-  not relay through us).
- **Outages > 30 days lose queued mail silently** — no DSN can ever leave the
-  VM. Stated and accepted (6× better than the status quo).
- Outage mail sits in plaintext on Oracle disk ≤ 30 days — single-tenant but
-  off-premises; accepted (same class as Brevo holding outbound today).
- Cloudflare zone lands at 197/200 records; the MTA-STS follow-up (policy
-  host found dangling during design — inert today; must list `mx2` when
-  fixed) needs 1–2 more → schedule the next record purge proactively.
- `architecture/mailserver.md` §"No Backup MX" superseded at implementation;
-  new runbook `docs/runbooks/backup-mx.md` (incl. OCI console break-glass);
-  `vpn.md`'s stale headscale claims fixed in passing; the roundtrip probe's
-  failure semantics change (a "failing" probe may now mean "delayed via mx2,
-  drains shortly" — noted in alert description).
--- a/docs/architecture/authentication.md
+++ b/docs/architecture/authentication.md
@ -86,56 +86,10 @@ Signin latency is dominated by screen count and round trips, not server time
  use the explicit-consent flow (it re-prompted every 4 weeks per app).
 - **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
  are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
-  15m policy cache, gunicorn `max_requests=10000`/jitter=1000 (recycle
-  hardening — decorrelates the 9 workers' recycles from PG blips). **No
-  `CONN_MAX_AGE`** — persistent Django connections pin a PgBouncer server conn
-  1:1 and saturate the session-mode pool (reverted 2026-06-10).
+  15m policy cache, 60s persistent DB connections.
 - **Static assets cached immutable**: `/static` ingress carve-out adds
  `Cache-Control: public, max-age=31536000, immutable` (assets are
  version-fingerprinted; authentik itself sends no max-age).
- **Rate-limit carve-out** (2026-06-28): `/` and `/static` use a dedicated
-  `authentik-rate-limit` (100/1000) instead of the shared 10/50 default — the
-  login SPA cold-loads ~70 flow-executor chunks from `/static`; the default
-  burst 429'd the tail and a failed ES-module import left a blank login screen.
- **Readiness tolerance** (2026-06-28): server `readinessProbe.failureThreshold:8`
-  (~80s, was the chart-default ~30s). The probe (`/-/health/ready/`) queries the
-  DB; too-tight tolerance let a sub-60s PG/pgbouncer transient return 503 on all
-  3 server pods at once → Traefik had no healthy backend → 502/503/504 (episodic
-  blank login + 30s hangs). 80s absorbs a full CNPG failover reconnect. Sessions
-  + cache are PostgreSQL-only since Redis was removed in 2026.2 (no external-cache
-  option), so request-serving is coupled to PG — this survives a short transient,
-  not a total CNPG outage.
- **Rolling-update strategy** (2026-06-28): the chart key is `deploymentStrategy`
-  (the repo's old `strategy:` key was silently inert → live ran the chart-default
-  25%/25% and dropped a server pod out of rotation on every roll). Now
-  `maxSurge:1/maxUnavailable:0` keeps all 3 ready throughout a roll.
- **Old-browser login (SFE)** (2026-06-28): authentik's modern flow SPA is ES2022
-  and renders a **blank login** on Safari/WebKit ≤16.3 (every iOS browser shares
-  the system WebKit, so it's not browser-choice — e.g. iPadOS ≤15). The overlay
-  image patches `flows/views/interface.py::compat_needs_sfe()` to also serve
-  authentik's built-in no-JS **Simplified Flow Executor** (SFE, ES5) to old Safari
-  **and any iOS browser** (Chrome/Firefox on iOS are WebKit skins) on iOS ≤16.3,
-  so those clients get the *real* authentik login (password + MFA + reputation —
-  no auth downgrade). The SFE can't render Identification-stage **sources**
-  (authentik limitation), so the patch also injects static social-login `<a>`
-  links into `flow-sfe.html` (→ `/source/oauth/login/<slug>/`, plain redirects) —
-  required for password-less accounts (e.g. Google-only users). A Traefik
-  basic-auth fallback was rejected: it would have put a single spoofable-UA
-  password in front of `vbarzin→wizard` (passwordless root on the devvm). See
-  `stacks/authentik/patch-compat-sfe.py`.
- **SFE + forced-WebAuthn MFA gotcha** (2026-06-28): the `default-authentication-flow`
-  MFA stage (`not_configured_action=configure`, `conf_stages=[webauthn]`) force-enrols
-  a WebAuthn passkey for any **password**-path user with no MFA device — but the SFE
-  **cannot render WebAuthn** (enrol *or* validate), so that user gets
-  `unsupported state: ak-stage-authenticator-webauthn`. Two escape hatches, **no MFA
-  downgrade**: (1) **social login** — sources run `default-source-authentication`
-  (UserLoginStage only, **no MFA stage**), so the SFE's "Continue with <provider>"
-  button always completes; (2) **enrol TOTP** — the SFE *can* validate TOTP codes, and
-  ≥1 confirmed device flips the stage from force-enrol to validate. User MFA devices are
-  runtime data (not Terraform): enrol via `ak shell`
-  (`TOTPDevice.objects.create(user=…, confirmed=True)`) and store the secret in the
-  user's own Vaultwarden item. (Done for emo — the Google-only iPadOS-15 case: TOTP in
-  his `authentik.viktorbarzin.me` Bitwarden item; e2e-verified the BW code is accepted.)
 - **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
 - **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
  TCP setup on the forward-auth subrequest path.
@ -154,6 +108,31 @@ All new users must use an invitation link to register. The invitation-enrollment

 Group membership is auto-assigned from the invitation's `fixed_data` field. This prevents open registration while maintaining SSO convenience.

+### TripIt External self-signup (open enrollment, fenced)
+
+Unlike every other app, **TripIt allows open public self-signup** for people
+outside the homelab (ADR-0020 in the tripit repo; runbook
+`docs/runbooks/tripit-external-signup.md`). A dedicated public `tripit-enrollment`
+flow (email + passkey, no password) creates the account and stamps it into the
+parentless **`TripIt External`** group. Containment is two-layered:
+
+- **Forward-auth apps**: a branch prepended to the `admin-services-restriction`
+  catch-all policy admits `TripIt External` to `tripit.viktorbarzin.me` only and
+  denies every other `auth="required"` host.
+- **OIDC apps**: that branch does NOT cover OIDC (OIDC bypasses forward-auth).
+  External users are contained because every sensitive OIDC app already requires a
+  trusted group they do not hold — audited 2026-06-15:
+  Immich/Grafana/Linkwarden/Cloudflare Access → `Home Server Admins`, Forgejo →
+  `Task Submitters`/`Forgejo Users`, Headscale → `Headscale Users`, wrongmove →
+  `Wrongmove Users`. **Vault** was OPEN (any OIDC identity got a powerless
+  `default`-policy token) and is bound to **`Allow Login Users`** as part of this
+  change. The Kubernetes OIDC clients are OPEN but idle (apiserver rejects OIDC).
+
+**Invariants**: keep `TripIt External` parentless (never under `Allow Login
+Users`); keep the catch-all branch first; never co-assign `TripIt External` to a
+trusted/internal user; the `tripit-enrollment` user_write "Create users group"
+setting is the keystone that tags every signup.
+
 ### OIDC Applications

 Authentik provides OIDC for 10 applications:
--- a/docs/architecture/automated-upgrades.md
+++ b/docs/architecture/automated-upgrades.md
@ -128,7 +128,7 @@ The agent handles all three version patterns in Terraform:

 - **Slack**: All upgrade events reported (start, success, failure, rollback)
 - **Git**: Detailed commit messages with changelog summaries, risk level, backup status
- **DIUN Slack**: REMOVED 2026-07-02 (per-tag @channel pings in #image-updates; human cadence is the weekly upgrade report). The n8n webhook feed to the upgrade agent is unchanged.
+- **DIUN Slack**: Independent Slack channel for raw version detection (separate from upgrade agent)

 ## Bulk Upgrades

@ -319,7 +319,7 @@ each Job's pod and its drain target are always different nodes.
  - `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout.
  - `EtcdPreUpgradeSnapshotMissing` — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently.
  - `K8sUpgradeStalled` — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor.
-  - `K8sUpgradeChainJobFailed` — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured). The `unless k8s_upgrade_blocked == 1` clause (2026-06-21) excludes a deliberate compat-gate refusal (owned by `K8sUpgradeBlocked`) so a block doesn't double-fire as a wedge.
+  - `K8sUpgradeChainJobFailed` — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured).
 - **Pushgateway metrics**:
  - `k8s_upgrade_in_flight` (set in preflight, cleared in postflight)
  - `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB)
--- a/docs/architecture/chrome-service.md
+++ b/docs/architecture/chrome-service.md
@ -112,32 +112,17 @@ External caller (dev box):
  @playwright/mcp --isolated --storage-state ~/.cache/...storage-state.json
 ```

-## Browser binary — real Google Chrome (for proprietary codecs)
-
-The chrome-service container runs **real Google Chrome**, not the bundled
-Chromium, via the infra-owned image `ghcr.io/viktorbarzin/chrome-service-browser`
-(`files/chrome/Dockerfile` = `mcr.microsoft.com/playwright:v1.48.0-noble` +
-`google-chrome-stable`, built by `.github/workflows/build-chrome-service-browser.yml`).
-The launch resolves `CHROMIUM=/opt/google/chrome/chrome`.
-
-**Why:** the Playwright-bundled Chromium has proprietary codecs **compiled out**,
-so H.264/AAC video (Instagram Reels, X, most `.mp4`) fails in the noVNC view with
-`MEDIA_ERR_SRC_NOT_SUPPORTED` (the bytes download `200 video/mp4` but there's no
-decoder — NOT a GPU issue). Royalty-free codecs (VP9/VP8/AV1 → YouTube) always
-worked. Swapping `libffmpeg.so` does NOT help (codecs are compiled out, not just
-the lib stripped) and Chrome-for-Testing is also codec-less — only
-`google-chrome-stable` carries them.
-
 ## Image pin

-The Playwright base + the Python client (`playwright==1.48.0` in callers'
-`requirements.txt`) and the snapshot sidecars
-(`mcr.microsoft.com/playwright/python:v1.48.0-noble`) historically had to match
-minor-versions. The chrome-service browser is now real Google Chrome (a newer
-milestone than the 1.48 Chromium), but the `connect_over_cdp` callers (tripit
-fare scrape, `homelab browser`, snapshot-harvester) attach over raw CDP, which is
-version-tolerant — verified working against this Chrome. If a future Chrome
-milestone breaks a caller, pin Chrome in the Dockerfile or bump the clients.
+Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in
+`stacks/chrome-service/main.tf`) and the Python client
+(`playwright==1.48.0` in callers' `requirements.txt`) **must match
+minor-versions**. Bump in lockstep — Playwright protocol changes between
+minors and the client cannot connect to a mismatched server.
+
+The harvester + snapshot-server sidecar use
+`mcr.microsoft.com/playwright/python:v1.48.0-noble` — same playwright
+minor, with Python-side bindings pre-installed.

 ## Storage

@ -182,66 +167,7 @@ milestone breaks a caller, pin Chrome in the Dockerfile or bump the clients.
  `x11vnc` (connected to Xvfb on `localhost:6099`) bridged to
  `websockify` on port 6080. Service `chrome` maps :80 → :6080 and is
  exposed via `ingress_factory` at `chrome.viktorbarzin.me`,
-  Authentik-gated. The bare host serves `vnc.html` (image symlinks
-  `index.html → vnc.html`); add `?autoconnect=true&resize=scale&path=websockify`
-  to skip the Connect button. The view is **black when no browser window is
-  open** (idle) — that is normal, not a failed connection. Chrome is launched
-  with `--window-size=1280,720 --window-position=0,0` to fill the Xvfb screen
-  (no window manager runs, so without it Chrome opens at its profile-persisted
-  size and the rest of the framebuffer shows as a black cut-off).
-
-### noVNC fd-sweep gotcha (stuck "Connecting")
-
-If the noVNC client hangs on **"Connecting" forever then times out**, the cause
-is almost always x11vnc's fd-table sweep: containerd grants pods
-`RLIMIT_NOFILE = 2^31`, and x11vnc `fcntl`-sweeps the **entire** fd table on
-every client connection, so the RFB handshake never completes (websockify
-accepts the WS and logs `connecting to: localhost:5900`, but x11vnc never sends
-the `RFB 003.008` banner). Diagnose: `grep "open files" /proc/$(pgrep -n
-x11vnc)/limits` (huge = bad) and time the handshake from a sibling container
-(`python3 -c "import socket;s=socket.socket();s.connect(('127.0.0.1',5900));print(s.recv(12))"` —
-healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts**
-— done both in `files/novnc/entrypoint.sh` (root) and via the container `command`
-wrapper in `main.tf` (so it applies deterministically even though the image is
-`:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix
-as the android-emulator stack.
-
-### noVNC black after a browser-container restart (x11vnc supervision)
-
-A **distinct** failure from the fd-sweep gotcha above: the noVNC client *connects*
-but the view is **black**, and the novnc container logs spew
-`connecting to: localhost:5900` → `Failed to connect ... [Errno 111] Connection
-refused` (x11vnc is **down**, not slow). Cause: `x11vnc` and `websockify` both run
-in the **novnc** container, but x11vnc attaches to the **chrome-service** (browser)
-container's Xvfb over `localhost:6099` (shared pod network). When the browser
-container restarts — Chrome exits cleanly (exit 0, "Completed") or crashes — its
-Xvfb vanishes and x11vnc loses its X connection and exits.
-
-`entrypoint.sh` **supervises** x11vnc: it launches x11vnc and websockify as
-background children and `wait -n`s on them, exiting non-zero if **either** dies, so
-the kubelet restarts the novnc container, which re-waits for Xvfb on `:6099` and
-relaunches x11vnc — the bridge **self-heals** across browser-container restarts.
-(Before 2026-06-27, x11vnc was an unsupervised background child of an `exec`ed
-websockify; a dead x11vnc was never relaunched, leaving `:5900` dead — a
-`<defunct>` zombie — and the view black until a manual pod restart. Same
-supervision pattern as the android-emulator stack's entrypoint.)
-
-**Diagnose:** `kubectl exec -c novnc -- ps aux | grep x11vnc` (a `<defunct>`/Z
-entry = the bug); or the RFB-banner probe from a sibling container (`python3 -c
-"import socket;s=socket.socket();s.settimeout(2);s.connect(('127.0.0.1',5900));print(s.recv(12))"`
-— healthy returns `b'RFB 003.008\n'`, broken = `ConnectionRefused`). **Immediate
-recovery** (no image change): restart just the novnc container with `kubectl exec
-n chrome-service deploy/chrome-service -c novnc -- kill 1` — re-runs its entrypoint
-and relaunches x11vnc **without** touching the browser session/in-flight CDP jobs.
-
-> **Deploying a rebuilt novnc entrypoint:** Keel is **off** for this deployment
-> (`keel.sh/policy=never`, because the browser container's playwright image is
-> version-pinned to f1-stream) and the image is `:latest`/`IfNotPresent`, so a
-> rebuilt `:latest` will **not** redeploy on its own. After the
-> `build-chrome-service-novnc.yml` GHA build pushes `:latest` + `:<sha>`,
-> **SHA-pin** the novnc `image` in `main.tf` to the new `:<sha>` to force the pull
-> and rollout (the novnc image is TF-managed — not in the deployment's
-> `lifecycle.ignore_changes`).
+  Authentik-gated.
 - **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`)
  serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`,
  bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088
@ -254,87 +180,6 @@ and relaunches x11vnc **without** touching the browser session/in-flight CDP job
 See `stacks/chrome-service/README.md` for the recipe (label namespace,
 inject `CHROME_CDP_URL`, vendor `stealth.js`).

-## Driving from OUTSIDE the cluster (`homelab browser`)
-
-Agents on the devvm reach this browser through the **`homelab browser`** CLI
-(`cli/`, ADR-0013) — the packaged, discoverable form of the ad-hoc
-`connect_over_cdp` recipe. It is the **escalation path, not the default**:
-agents default to the Playwright MCP / headless browser for all routine
-automation, and reach for `homelab browser` ONLY when headless is blocked — a
-site loads but a gated action (submit/login) silently fails or hangs, the
-signature of headless / anti-bot detection. (Same tiered rule lives in
-`~/code/CLAUDE.md` and `homelab browser --help`.)
-
-```text
-devvm:  homelab browser run flow.js
-          │  kubectl port-forward svc/chrome-service :9222  (random local port)
-          ▼
-   http://127.0.0.1:<port>  ──►  chrome-service pod :9222 (CDP)
-          │  assert /json/version Browser is "Chrome/…", not "HeadlessChrome"
-          │  node + playwright-core@1.48.2 → connectOverCDP
-          │  context.addInitScript(stealth.js)   ← same vendored file as in-cluster
-          │  run the user's Playwright script with page/context/browser in scope
-          └─ port-forward always torn down (success or error)
-```
-
-Key facts:
-
- **port-forward bypasses the `:9222` NetworkPolicy.** It tunnels
-  API-server→pod, so the devvm needs no `chrome-service.viktorbarzin.me/client`
-  label — unlike in-cluster callers.
- **Client pinned to the image minor.** The node client is
-  `playwright-core@1.48.2` (matches `v1.48.0-noble` / Chromium 130), installed
-  lazily into `~/.cache/homelab/browser-client/`. Bump it in lockstep when the
-  server image bumps (same rule as the in-cluster Python clients — see "Image
-  pin" above).
- **Default context is a fresh incognito one** (closed on exit), safe for the
-  shared browser; `--shared-context` reuses the warmed persistent profile.
- **`stealth.js` is vendored** into the CLI (`cli/browser_stealth.js`) as a
-  byte-identical copy of `files/stealth.js`, guarded by a drift test — so the
-  CLI's stealth never diverges from the in-cluster callers'.
-
-## Multi-user access (sharing the browser)
-
-There is ONE chrome-service browser with ONE persistent profile, warmed with
-**Viktor's** logged-in sessions. CDP has no per-context auth, so anyone who can
-drive the browser — over the noVNC view OR the CDP/`homelab browser` path — can
-reach the persistent profile (`browser.contexts[0]`) and therefore Viktor's
-sessions. Access is gated accordingly, per user.
-
-**Decision (2026-06-28):** emo (`emil.barzin` / `emil.barzin@gmail.com`) SHARES
-Viktor's browser for form-filling + captcha solving, rather than getting an
-isolated instance. The session-exposure trade-off above was explicitly accepted.
-
-Two independent grants make up "browser access" for a user:
-
-1. **noVNC (interactive view, `chrome.viktorbarzin.me`)** — gated by the Authentik
-   `admin-services-restriction` policy: the `CHROME_ALLOWED` set
-   (`stacks/authentik/admin-services-restriction.tf`) matches the user's Authentik
-   username OR email. Add the user there. No kubeconfig/RBAC needed.
-2. **CLI (`homelab browser`, CDP over port-forward)** — needs `pods/portforward`
-   in `chrome-service` PLUS a non-interactive credential (a normal devvm user's
-   kubeconfig is interactive-OIDC-only and can't authenticate a headless agent
-   session). Provided by a per-user **ServiceAccount** with a long-lived token
-   (`stacks/chrome-service/rbac.tf`, e.g. `emo-browser`): `pods/portforward` in
-   this namespace + cluster read-only (`oidc-power-user-readonly`, so it can also
-   resolve the Service and doesn't regress the user's normal read). The devvm
-   provisioner (`scripts/t3-provision-users.sh` → `install_browser_kubeconfig`)
-   reads that token and installs it as the user's DEFAULT kubeconfig context
-   (`<user>-browser@homelab`), keeping their personal OIDC login as the
-   `oidc@homelab` named context. The SA's existence is the source of truth for who
-   gets the CLI — the provisioner no-ops for users without a `<user>-browser` SA.
-
-**To grant another user:** add them to `CHROME_ALLOWED` (noVNC) and/or add a
-`<user>-browser` SA + bindings mirroring `emo-browser` in `rbac.tf` (CLI), then run
-the provisioner. To revoke: remove from `CHROME_ALLOWED` and delete the SA (rotate
-a token by deleting its `<user>-browser-token` Secret).
-
-Because the SA is the user's DEFAULT kubectl credential, other per-namespace
-port-forward grants hang off the same identity: `stacks/excalidraw/rbac.tf`
-grants `emo-browser` `pods/portforward` in `excalidraw` (2026-07-02) so emo's
-agent can upload drawings via the port-forward + `X-Authentik-Username` recipe
-in his `~/.claude/CLAUDE.md`. Revoking the SA revokes those too.
-
 ## Limits + risks

 - **Anti-bot vs stealth arms race** — when an upstream beats us (DRM
--- a/docs/architecture/ci-cd.md
+++ b/docs/architecture/ci-cd.md
@ -94,7 +94,7 @@ can't reach Forgejo's public hairpin.
 | Visibility | Packages | Pull mechanism |
 |------------|----------|----------------|
 | **Public** | beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator | Anonymous |
-| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci, k8s-portal, excalidraw-library | `ghcr-credentials` dockerconfigjson |
+| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci | `ghcr-credentials` dockerconfigjson |

 Private-image pulls use the `ghcr-credentials` dockerconfigjson, cloned by the
 kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit
@ -115,66 +115,8 @@ claude-agent-service, claude-memory-mcp, kms-website, Freedify,
 instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`),
 fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original
 pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website,
-k8s-portal, apple-health-data, audiblez-web, insta2spotify,
-audiobook-search) now also land on ghcr.
-
-**plotting-book** is a special case (a GitHub-first repo owned by Anca,
-ADR-0003): the build runs in *her* GitHub repo
-(`PassionProjectsAnca/Plotting-Your-Dream-Book`) and pushes to **private
-`ghcr.io/passionprojectsanca/book-plotter`** — under her org's ghcr namespace,
-not `viktorbarzin`, using the workflow's built-in `GITHUB_TOKEN` (no shared
-PAT). The cluster pulls it via the Kyverno-synced `ghcr-credentials` secret (the
-`plotting-book` namespace is on the allowlist; the shared `ghcr_pull_token` has
-read access). Migrated off public DockerHub (`viktorbarzin/book-plotter`) on
-2026-06-27. The Woodpecker deploy hook (repo 43, registered to Anca's repo) is
-unchanged. Flow:
-
-```text
- DEVELOP ───────────────────────────────────────────────────────────────────────
-   Anca (Codex / t3 web agent)
-        │  git push → main
-        ▼
- ┌──────────────────────────────────────────────────────────────┐
- │ GitHub: PassionProjectsAnca/Plotting-Your-Dream-Book (private)│  ← canonical
- │   .github/workflows/build-and-deploy.yml     on: push → main  │
- └───────────────────────────┬──────────────────────────────────┘
-                             │  GitHub Actions runner (off-infra build · ADR-0002)
-        ┌────────────────────┴─────────────────────────────────┐
-        ▼                                                        ▼
- ┌─────────────────────────────────────────────┐      ╔═══════════════════════════════════════╗
- │ build job                                   │ push ║  GHCR · PRIVATE package                ║
- │  • svu next --always → tag vX.Y.Z (→ repo)  │═════▶║  ghcr.io/passionprojectsanca/         ║
- │  • buildx linux/amd64, provenance:false     │ tags ║       book-plotter  :vX.Y.Z  :latest  ║
- │  • login ghcr (GITHUB_TOKEN, packages:write)│      ╚═══════════════════╤═══════════════════╝
- │  • delete-package-versions (keep newest 10) │                          │
- └───────────────────────┬─────────────────────┘                          │ pull (private,
-                         ▼  deploy job  [gate: repo var DEPLOY_ENABLED ≠ "false"]  via secret)
-   POST ci.viktorbarzin.me/api/repos/43/pipelines {IMAGE_TAG, IMAGE_NAME}         │
-                         ▼                                                         │
- ┌─────────────────────────────────────────────────────────────┐                 │
- │ Woodpecker repo 43 · .woodpecker/deploy.yml (event: manual)  │                 │
- │   kubectl set image deployment/plotting-book = <ghcr>:vX.Y.Z │                 │
- │   kubectl rollout status                                     │                 │
- └───────────────────────────┬─────────────────────────────────┘                 │
-                             ▼                                                     │
- ═══════════════ Kubernetes · ns: plotting-book ════════════════════════════      │
- ┌─────────────────────────────────────────────────────────────┐                 │
- │ Deployment plotting-book  (Recreate · image = ignore_changes)│                 │
- │   imagePullSecrets: ghcr-credentials ────────pull───────────┼─────────────────┘
- │   Pod → Express :3001  +  SQLite on PVC (proxmox-lvm)        │
- └─────────────────────────────────────────────────────────────┘
-   guards / supporting:
-     • Kyverno require-trusted-registries [Enforce] → ghcr.io/* ALLOWED   (admission)
-     • Keel policy=patch @1h → watches GHCR via ghcr-credentials          (backstop)
-     • ghcr-credentials ⇐ Kyverno generate-clone ⇐ Vault secret/viktor/ghcr_pull_token
-
- ═══════════════ Serving path (unchanged) ══════════════════════════════════
-   Browser ─▶ plotting-book.viktorbarzin.me  (non-proxied DNS → Traefik .203)
-           ─▶ Authentik forward-auth (gate) ─▶ Service :80 ─▶ Pod :3001
-```
-
-Governance: the Deployment + Kyverno allowlist are Terraform (`stacks/plotting-book`,
-`stacks/kyverno`); the live image *tag* is CI-owned (`ignore_changes`).
+k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify,
+audiobook-search, council-complaints) now also land on ghcr.

 ### Infra-owned images (issues #29 / #30)

@ -188,8 +130,6 @@ reconciled — the workflows were added to the GitHub lineage via PR):
 | android-emulator | `build-android-emulator.yml` | public `ghcr.io/viktorbarzin/android-emulator` |
 | infra CLI | `build-cli.yml` | DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli` |
 | infra-ci | `build-infra-ci.yml` | private `ghcr.io/viktorbarzin/infra-ci` |
-| k8s-portal | `build-k8s-portal.yml` | private `ghcr.io/viktorbarzin/k8s-portal` (Keel rolls `:latest` digests) |
-| excalidraw-library | `build-excalidraw.yml` | private `ghcr.io/viktorbarzin/excalidraw-library` (Keel rolls `:latest` digests; DockerHub `:v4` frozen as rollback) |

 **`infra-ci`** is the image the `.woodpecker/default.yml` apply step and
 `drift-detection.yml` run in (proven by pipelines 165/166). `chatterbox-tts` is
@ -223,9 +163,9 @@ Woodpecker is **deploy + cluster-touching steps only**:
 | Pipeline | File | Purpose |
 |----------|------|---------|
 | per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) |
-| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`). **Skips Tier-0 `vault`** — it's human-applied via OIDC; the CI `ci` role lacks Vault-admin perms (`sys/mounts`, `sys/policies/acl`) so a CI apply 403s |
+| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) |
 | certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron |
-| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`). **Skips Tier-0 `vault`** (its `plan` 403s under the `ci` role and would fail the whole run) |
+| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) |
 | provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
 | registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` → `10.0.20.10` on change |
 | pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE |
@ -236,38 +176,6 @@ Woodpecker is **deploy + cluster-touching steps only**:

 **No build/test pipeline exists on any repo.** Do not (re)introduce one.

-### `default.yml` apply: dual-registration de-dup + reliability (2026-06-28)
-
-infra is registered in Woodpecker on **both** the canonical Forgejo repo (id 82)
-and the legacy GitHub mirror (id 1), and **both fire `default.yml` on every
-push**. Left unguarded, two `terragrunt apply` runs race each other for the
-per-stack PG state lock — historically the #1 source of `Error acquiring the
-state lock` failures and push-supersede "killed" runs.
-
- **Forge guard** (first command in the `apply` step): the push-apply runs **only
-  on the canonical Forgejo forge**; on the GitHub mirror it logs `[forge-guard]`
-  and `exit 0`s. Detection: `CI_REPO_URL`/`CI_FORGE_URL` contains `github.com` →
-  skip. Fail-open (unknown forge still applies). The mirror keeps running the
-  **crons** (drift-detection, renew-tls, …), which live on repo 1 — only its
-  duplicate push-apply no-ops. (Crons were NOT moved; deactivating repo 1 would
-  have killed them.)
- **Lock-skip matches both tiers**: a stack whose apply hits a lock is SKIPPED,
-  not failed. The grep now matches the Tier-0 Vault message (`is locked by`) **and**
-  the Tier-1 PG-backend message (`Error acquiring the state lock` / `already
-  locked`) — the PG case was previously miscounted as a hard failure.
- **Transient retry** (bounded, 3 attempts): only provider-registry download
-  timeouts (`Failed to install provider` / `Client.Timeout`) and Vault 5xx are
-  retried. Config errors (missing arg, invalid index) and helm `atomic` timeouts
-  are NOT retried — they fail fast.
-
-A pre-apply off-infra validate gate was evaluated and rejected: `terraform
-validate` runs without state but catches ~0 of the observed failures (they are
-provider-config-from-Vault-data, server-side-apply conflicts, helm installs, and
-lock contention — all invisible to static validate), and `plan` cannot run
-off-infra (no Vault/PG access). `terragrunt apply` already fails at its plan
-phase without mutating on config errors, so a separate in-pipeline plan-gate was
-also dropped as redundant.
-
 ### Woodpecker API

 Uses **numeric repo IDs** (`/api/repos/<id>/pipelines`), NOT owner/name paths
@ -295,9 +203,7 @@ The infra repo runs on Woodpecker via **two** forge registrations: the Forgejo
 forge (repo id 82, registered 2026-06-08) and the legacy GitHub forge (repo id
 1). Pushes to **Forgejo** `master` fire `.woodpecker/default.yml`
 (changed-stacks terragrunt apply, in `infra-ci`) plus the `notify-nonadmin-push`
-Slack audit step. **Slack policy (2026-07-02): every infra pipeline posts only
-on FAILURE** (plus the non-admin audit post and drift/error findings) — routine
-successful runs are silent. Operational facts (2026-06-10):
+Slack audit step. Operational facts (2026-06-10):

 - **Webhook URL is the IN-CLUSTER service**:
  `http://woodpecker-server.woodpecker.svc.cluster.local/api/hook?...` (PATCHed
@ -379,8 +285,7 @@ steps:
  notify:
    image: plugins/slack
    when:
-      # Failure-only (2026-07-02 policy): CI notifies about failed runs only.
-      status: [failure]
+      status: [success, failure]
 ```

 ### CI/CD secrets sync
--- a/docs/architecture/dns.md
+++ b/docs/architecture/dns.md
@ -277,7 +277,7 @@ Technitium's **Split Horizon AddressTranslation** app post-processes DNS respons

 Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h).

-**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too. **Off-infra Valia sites** (Cloudflare Pages, ADR-0018) are the other class of public-only names with no Traefik ingress — without internal records they NXDOMAIN for every internal client while working fine externally. Since 2026-07-03 they are reconciled **declaratively**: `stacks/valia-sites` writes the ConfigMap `valia-sites-dns` (technitium ns, `<name> → <project>.pages.dev`), and the sync script ensures/updates a CNAME per entry and **deletes** stale internal CNAMEs targeting `*.pages.dev` that left the map (retire/rename cleans itself up; deletion is suffix-scoped so nothing else can be touched).
+**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too.

 ## NodeLocal DNSCache

@ -368,7 +368,6 @@ The Cloudflare tunnel uses a **wildcard rule** (`*.viktorbarzin.me → Traefik`)
 | TXT (MTA-STS) | 1 | `v=STSv1; id=20260412` | TLS enforcement |
 | TXT (TLSRPT) | 1 | `v=TLSRPTv1; rua=mailto:postmaster@...` | TLS reporting |
 | A (keyserver) | 1 | `130.162.165.220` (Oracle VPS) | PGP keyserver |
-| CNAME (CF Pages) | 2 | `<project>.pages.dev` (Cloudflare Pages) | bridge, stem95su — Valia sites (ADR-0018), managed by `stacks/valia-sites` |

 ### Proxied vs Non-Proxied

@ -514,7 +513,6 @@ For external `.viktorbarzin.me` records:
 1. Add `dns_type = "proxied"` (or `"non-proxied"`) to the `ingress_factory` module call in the service stack
 2. Run `scripts/tg apply` on the service stack — DNS record is auto-created
 3. For non-standard records (MX, TXT), add a `cloudflare_record` resource in `stacks/cloudflared/modules/cloudflared/cloudflare.tf`
-4. For a Valia site (off-infra Cloudflare Pages), add the entry to `local.sites` in `stacks/valia-sites/main.tf` — public CNAME + internal record both follow (`docs/runbooks/valia-sites.md`)

 ## Incident History

--- a/docs/architecture/mailserver.md
+++ b/docs/architecture/mailserver.md
@ -161,17 +161,6 @@ https://mail.viktorbarzin.me → Traefik → Roundcubemail
  DB: MySQL (mysql.dbaas.svc.cluster.local)
 ```

-### Paperless ingest mailbox (docs@)
-
-`docs@viktorbarzin.me` is a dedicated real mailbox (explicit self-alias in
-`extra/aliases.txt` so the `@domain → spam@` catch-all doesn't shadow it) that
-paperless-ngx polls over IMAP; family members forward document emails to it
-and the sender maps 1:1 to a paperless account. A per-user Dovecot sieve
-(`docs-at-viktorbarzin.me.dovecot.sieve` in the `mailserver.config` ConfigMap,
-mounted as `/tmp/docker-mailserver/docs@viktorbarzin.me.dovecot.sieve`)
-discards mail from non-allowlisted senders at delivery. Full flow, sender map,
-and add-a-sender procedure: [`runbooks/paperless-mail-ingest.md`](../runbooks/paperless-mail-ingest.md).
-
 ## DNS Records

 All managed in Terraform at `stacks/cloudflared/modules/cloudflared/cloudflare.tf`.
@ -311,21 +300,6 @@ Push secrets (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) come from External

 ## Troubleshooting

-### All mail tempfailing with `451 4.3.0 queue file write error` (postsrsd spin)
-
-Seen 2026-07-03 right after a pod restart. Signature in `/var/log/mail/mail.log`:
-`postfix/cleanup: warning: tcp:localhost:10001 lookup error` +
-`sender_canonical_maps map lookup problem ... message not accepted, try again later`.
-Cause: **postsrsd** (SRS daemon, `sender_canonical_maps = tcp:localhost:10001`)
-came up spinning at 100% CPU without binding 10001/10002 — supervisor shows it
-`RUNNING` but `ss -ltn | grep 1000` is empty and its log is empty. Postfix then
-tempfails every message (inbound AND submission); senders retry so nothing is
-lost, and the roundtrip probe alerts within the hour.
-Fix: `supervisorctl restart postsrsd` inside the container; if the fresh
-process spins again (it did once), `kubectl -n mailserver delete pod` for a
-full re-init — that healed it. Root cause not pinned down (one-off bad init;
-postsrsd 1.10).
-
 ### Inbound mail not arriving
 1. **DNS/MX**: `dig MX viktorbarzin.me +short` → should show `mail.viktorbarzin.me`
 2. **WAN reachability**: `nc -zw5 mail.viktorbarzin.me 25` from outside
--- a/docs/architecture/monitoring.md
+++ b/docs/architecture/monitoring.md
@ -146,7 +146,7 @@ Query examples (Grafana → Loki): `{job="rpi-sofia-journal"}`, `{job="rpi-sofia

 **Dashboard** — `dashboards/rpi-sofia.json` ("RPi Sofia", Hardware folder): status, undervoltage/throttle, SoC temp, load, memory, root-fs free + read-only, network.

-**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`increase(rpi_under_voltage_occurred[1h])>0` — edge-triggered on the sticky bit; the live `rpi_under_voltage_now` bit is too transient to catch at 1-min sampling, so it fires on a *new* brown-out and auto-resolves ~1h later instead of latching until reboot), `RpiSofiaHighTemp`.
+**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`rpi_under_voltage_occurred==1`), `RpiSofiaHighTemp`.

 **Recovery** — a systemd hardware watchdog (`RuntimeWatchdogSec=14s`, bcm2835 max ~15s) auto-reboots the Pi on a hard hang instead of leaving it dead for hours.

@ -286,7 +286,7 @@ Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (por

 #### Security Alerts (Wave 1 — planned, beads `code-8ywc`)

-Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there). Same handling path as infra alerts; severity labels carried in the alert (critical/warning/info). The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
+Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).

 | # | Source | Event | Severity |
 |---|---|---|---|
@ -318,20 +318,9 @@ IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-poli
 Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out.

 - **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m).
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → posts to **`#alerts`** via the `slack-security` receiver, which keeps its `[SECURITY]` styling (Slack-only, no paging; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
+- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → **`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
 - **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)

-#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
-
-Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**.
-
-| Alert | Expr (abridged) | For | Severity |
-|---|---|---|---|
-| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning |
-| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning |
-
-The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`).
-
 #### Backup Alerts
 - **PostgreSQLBackupStale**: >36h since last backup
 - **MySQLBackupStale**: >36h since last backup
--- a/Show more
+++ b/Show more