k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC)

Digging into "why did the apiserver crash" disproved the earlier OIDC explanation. An isolated v1.35.6 apiserver repro with authentik reachable initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the --authentication-config -> --oidc-* revert is NOT what crashed it. etcd's surviving crash-window log is the real cause: 1180 "apply request took too long" warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the shared sdc HDD (beads code-oflt). A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated, driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live (73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate. Also corrected the OIDC handling: the kubeadm-config drift is real but only breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the apiserver. So the preflight check is now an ALERT, not a block (was added on the wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected. Per Viktor: reclaim the disk and automate so the manual cleanup never recurs; the durable IO fix remains code-oflt (etcd off the shared HDD). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade
2026-06-25 15:23:15 +00:00 · 2026-06-25 14:16:04 +00:00 · 2026-06-24 22:03:15 +00:00 · 2026-06-24 20:59:39 +00:00 · 2026-06-24 20:59:39 +00:00 · 2026-06-24 20:49:53 +00:00
475 changed files with 40244 additions and 9211 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
--- a/.claude/home-assistant-sofia.py
+++ b/.claude/home-assistant-sofia.py
@ -7,6 +7,7 @@ Control and query Home Assistant entities on ha-sofia.viktorbarzin.me.
 import argparse
 import json
 import os
+import subprocess
 import sys
 from urllib.parse import urljoin

@ -17,13 +18,29 @@ except ImportError:
    print("  pip install requests")
    sys.exit(1)

-# Configuration from environment variables (ha-sofia specific)
-HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/")
-HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN")

-if not HA_URL or not HA_TOKEN:
-    print("ERROR: HOME_ASSISTANT_SOFIA_URL and HOME_ASSISTANT_SOFIA_TOKEN environment variables must be set.")
-    print("These should be set when activating the Claude venv (~/.venvs/claude)")
+def _token_from_homelab():
+    """Resolve the token via the homelab CLI when the env var isn't set, so the
+    script works from any directory / unprovisioned session (see ADR-0012)."""
+    try:
+        out = subprocess.run(
+            ["homelab", "ha", "token", "--instance", "sofia"],
+            capture_output=True, text=True, timeout=30)
+        if out.returncode == 0 and out.stdout.strip():
+            return out.stdout.strip()
+    except Exception:
+        pass
+    return None
+
+
+# Configuration: prefer env vars (set by the Claude venv); otherwise fall back to
+# defaults + the homelab CLI so the script is not cwd/env dependent (ADR-0012).
+HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/") or "https://ha-sofia.viktorbarzin.me"
+HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN") or _token_from_homelab()
+
+if not HA_TOKEN:
+    print("ERROR: no ha-sofia API token available.")
+    print("Set HOME_ASSISTANT_SOFIA_TOKEN, or ensure `homelab ha token` works (kubeconfig reachable).")
    sys.exit(1)

 HEADERS = {
--- a/.claude/reference/authentik-state.md
+++ b/.claude/reference/authentik-state.md
@ -5,17 +5,26 @@
 ## Applications (11)
 | Application | Provider Type | Auth Flow |
 |-------------|--------------|-----------|
-| Cloudflare Access | OAuth2/OIDC | explicit consent |
+| Cloudflare Access | OAuth2/OIDC | implicit consent |
 | Domain wide catch all | Proxy (forward auth) | implicit consent |
-| Forgejo | OAuth2/OIDC | explicit consent |
+| Forgejo | OAuth2/OIDC | implicit consent |
 | Grafana | OAuth2/OIDC | implicit consent |
-| Headscale | OAuth2/OIDC | explicit consent |
-| Immich | OAuth2/OIDC | explicit consent |
+| Headscale | OAuth2/OIDC | implicit consent |
+| Immich | OAuth2/OIDC | implicit consent |
 | Kubernetes | OAuth2/OIDC (public) | implicit consent |
 | Kubernetes Dashboard | OAuth2/OIDC (confidential) | implicit consent |
-| linkwarden | OAuth2/OIDC | explicit consent |
+| linkwarden | OAuth2/OIDC | implicit consent |
+| Vault | OAuth2/OIDC | implicit consent |
 | wrongmove | OAuth2/OIDC | implicit consent |

+> **2026-06-10 — every provider now uses implicit consent.** Cloudflare
+> Access (pk 9), Forgejo (20), Immich (1), Headscale (13), linkwarden (8)
+> and Vault (53) were switched from
+> `default-provider-authorization-explicit-consent` via the API (these
+> providers are UI-managed, not in TF). All are first-party apps; the
+> expiring consent screen (re-shown every 4 weeks per app) only slowed
+> first-time signin.
+
 > **Kubernetes Dashboard** (TF-managed in `stacks/k8s-dashboard/authentik.tf`):
 > confidential client `k8s-dashboard`, built for seamless dashboard SSO via
 > oauth2-proxy. **Currently IDLE** — the apiserver rejects all OIDC tokens (see
@ -60,8 +69,27 @@
 - All sources use `invitation-enrollment` as enrollment flow (new users require invitation)

 ## Authorization Flows
- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen
- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects
+- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen — no provider uses it since 2026-06-10
+- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects — used by ALL providers
+
+## Authentication Flow (single-screen login, 2026-06-10)
+
+`default-authentication-flow` bindings: identification (order 10) →
+mfa-validation (order 30) → user-login (order 100). The identification
+stage (`default-authentication-identification`, pk
+`32aca5ab-106e-43f4-a4cc-4513d80e57f3`) has `password_stage` set to
+`default-authentication-password`, so username + password render on ONE
+screen (one round trip instead of two). The previously separate
+password-stage binding at order 20 (pk `0fc677db-a23f-4ee7-8648-da342e14573b`)
+was DELETED via the API — authentik requires removing it when the
+identification stage embeds the password field. `password_stage` is pinned in
+Terraform (`authentik_stage_identification.default_identification` in
+`stacks/authentik/authentik_provider.tf`); all other stage fields stay
+UI-managed via `ignore_changes`. Social-login buttons remain on the same
+screen and bypass the password field, so Google/GitHub/Facebook users are
+unaffected. If a future authentik upgrade/blueprint re-adds the order-20
+binding, users would briefly see a second password prompt — delete the
+binding again.

 ## Invitation Enrollment Flow
 Slug: `invitation-enrollment` | PK: `7d667321-2b02-4e16-8161-148078a8dac1`
@ -138,7 +166,8 @@ Pinned via Terraform in `stacks/authentik/`:

 | Knob | Value | Surface | Effect |
 |------|-------|---------|--------|
-| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. |
+| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. Used by password login (`default-authentication-flow`) AND passkey login (`webauthn` flow — both terminate on this stage). |
+| `UserLoginStage.session_duration` on `default-source-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_source_login` in `authentik_provider.tf` (imported 2026-06-20, id `4c6977d2-…`) | **Social logins** (Google/GitHub/Facebook, via `default-source-authentication-flow`). Was the provider default `seconds=0`, which fell back to `UNAUTHENTICATED_AGE=hours=2` — so social logins expired every **2h** while password/passkey lasted 4 weeks. Pinned `weeks=4` on 2026-06-20 to make all login paths consistent. (Surfaced when the 2026-06-18 passkey wipe forced fallback to Google login → "re-login multiple times daily".) |
 | `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. |
 | `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |

@ -149,7 +178,19 @@ Notes:
 - The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
 - `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
 - The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
- The `unauthenticated_age` env var is injected via `server.env` / `worker.env` (not `authentik.sessions.unauthenticated_age`) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. The same applies to the existing `authentik.cache.*`, `authentik.web.*`, `authentik.worker.*` blocks (currently inert; live values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced).
+
+## WebAuthn / Passkeys (2026-06-20)
+
+- **Passkey devices live in the DB, NOT Terraform** (`WebAuthnDevice` model). They are user-owned; no TF resource or blueprint manages them. Re-enroll via the user settings UI (Authentik → Settings → MFA Devices → register a security key / passkey).
+- **2026-06-18 wipe (root cause of the "WebAuthn broke" incident):** all 6 of Viktor's passkeys were deleted (`WebAuthnDevice.objects.count()` → 0) at 19:27 by an **ad-hoc tripit passkey E2E test** run from the devvm (`python-httpx/0.28.1`, as `akadmin`). The test cleanup did `GET /core/users/?search={demo}` (a **fuzzy** search) then `DELETE /api/v3/authenticators/admin/webauthn/{pk}/` for each device of `users[0]` — but `users[0]` resolved to the **real** account, not the intended demo user. **Lesson:** any future passkey-test cleanup MUST exact-match the demo user (`username == demo`), never `users[0]` of a fuzzy `?search=`. It was a one-off ad-hoc script (no committed/scheduled copy), so nothing auto-re-deletes — re-enrollment is safe.
+- **Passkey login path itself is intact:** the identification stage's `passwordless_flow` → `webauthn` flow (UI-managed, in `ignore_changes`); the break was purely the missing device records.
+- **Provider-schema gotcha:** the pinned authentik TF provider's `authentik_stage_identification` resource exposes **no** `webauthn_stage` or `enable_remember_me` attribute (they exist on the app *model*, not in the provider schema). Do NOT add them to `ignore_changes` — `tg plan` errors `Unsupported attribute`. They are purely UI/app-managed. (Commit `4e882989` removed them for exactly this reason; re-adding breaks every apply.)
+- ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns).
+- **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config.
+- **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin.
+- **Liveness budget (2026-06-10):** `server.livenessProbe` = 6×10s, 5s timeout (chart default 3×10s/3s kill-loops pods that queue on the DB migration advisory lock during rolling restarts).
+- **PgBouncer (2026-06-10):** `idle_transaction_timeout=300` reaps ghost `idle in transaction` sessions (a killed pod mid-migration otherwise holds the migration advisory lock forever, serializing all boots); the deployment carries a config-checksum annotation so ini changes roll the pods. Do NOT set `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE` — session-mode PgBouncer pins persistent conns 1:1 (pool saturation).
+- **Static assets (2026-06-10):** a second `ingress_factory` (`module.ingress-static`, path `/static` on the authentik host) attaches the `authentik-static-cache-headers` middleware → `Cache-Control: public, max-age=31536000, immutable`. Authentik itself serves no max-age; assets are version-fingerprinted so immutable is safe. Mainly helps split-horizon internal users (no Cloudflare edge cache on the direct path).

 ## Upgrade Validation Checklist

@ -161,8 +202,9 @@ Run after **any** of these:
 The fragile surfaces are the `kubernetes_json_patches` and the `Outpost.managed` field — both rely on assumptions that can silently break across upgrades. The checklist exercises the same path the alerts watch, so it doubles as a smoke test for the alerts.

 ```bash
-# 1. Service routes to the outpost pod (NOT the server pods).
-#    Empty endpoints => auth-proxy fallback fires; expected: ONE pod IP, ports 9000/9300/9443.
+# 1. Service routes to the outpost pods (NOT the server pods).
+#    Empty endpoints => auth-proxy fallback fires; expected: TWO pod IPs
+#    (kubernetes_replicas=2 since 2026-06-10), ports 9000/9300/9443.
 kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost

 # 2. Service selector still excludes the server pods. Expected: includes
--- a/.claude/reference/proxmox-inventory.md
+++ b/.claude/reference/proxmox-inventory.md
@ -92,19 +92,21 @@ Channel 3:  A4 [32G] ──── A8 [32G]  ──── A12[ 8G ]     = 72 GB
 | VMID | Name | Status | CPUs | RAM | Network | Disk | Notes |
 |------|------|--------|------|-----|---------|------|-------|
 | 101 | pfsense | running | 8 | 4GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall |
-| 102 | devvm | running | 16 | 24GB | vmbr1:vlan10 | 100G | Development VM + t3code Workstation host. 8G swapfile (swappiness=10). Capacity budget: ~4-5G RAM/active user, max ~3-4 concurrent active Claude sessions. NOT Terraform-managed. |
+| 102 | devvm | running | 16 | 24GB | vmbr1:vlan10 | 100G | Development VM + t3code Workstation host. 14G swap (8G /swapfile + 6G /swapfile2, grown 2026-06-10; swappiness=10). Capacity budget: ~4-5G RAM/active user, max ~3-4 concurrent active Claude sessions. NOT Terraform-managed. Disk controller: `virtio-scsi-single` + `scsi0 iothread=1,aio=threads` staged 2026-06-11 after the QEMU I/O stall (was `scsihw: lsi`, the only VM on the legacy path — see `docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md`); applies at next cold stop→start. |
 | 103 | home-assistant | running | 8 | 8GB | vmbr0 | 64G | HA Sofia, net0(vlan10) disabled, SSH: vbarzin@192.168.1.8 |
 | 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup (unused) |
-| 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) |
-| 201 | k8s-node1 | running | 16 | 32GB | vmbr1:vlan20 | 256G | GPU node, Tesla T4 |
-| 202 | k8s-node2 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
-| 203 | k8s-node3 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
-| 204 | k8s-node4 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
+| 200 | k8s-master | running | 8 | 32GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) |
+| 201 | k8s-node1 | running | 16 | 48GB | vmbr1:vlan20 | 256G | GPU node, Tesla T4 |
+| 202 | k8s-node2 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker |
+| 203 | k8s-node3 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker |
+| 204 | k8s-node4 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker |
+| 205 | k8s-node5 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker (10.0.20.105, joined 2026-05-26) |
+| 206 | k8s-node6 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker (10.0.20.106, joined 2026-05-26) |
 | 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
 | 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM |
 | ~~9000~~ | ~~truenas~~ | **stopped/decommissioned** | — | — | — | — | NFS migrated to Proxmox host (192.168.1.127) at `/srv/nfs` and `/srv/nfs-ssd` |

-**Total VM RAM allocated**: 196 GB of 272 GB (72%) — 76 GB free for future VMs (devvm corrected 8GB→24GB 2026-06-08)
+**Total VM RAM allocated**: ~288 GB nominal across running VMs vs 272 GB physical — OVERCOMMITTED (ballooning enabled on K8s workers, host swap in use; see memory id=535/2543). K8s rows live-verified via `kubectl get nodes` capacity 2026-06-11 (master 32G, node1 48G, node2-6 32G; the old 16/32/24GB figures predated the 2026-04-02 resize and node5/6).

 ## VM Templates
 | VMID | Name | Purpose |
--- a/.claude/reference/service-catalog.md
+++ b/.claude/reference/service-catalog.md
--- a/.claude/skills/home-assistant/SKILL.md
+++ b/.claude/skills/home-assistant/SKILL.md
@ -11,8 +11,8 @@ description: |
  There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
  Always use Home Assistant for smart home control.
 author: Claude Code
-version: 2.0.0
-date: 2026-02-07
+version: 2.1.0
+date: 2026-06-24
 ---

 # Home Assistant Control
@ -44,6 +44,12 @@ There are **two** Home Assistant instances:
 - Environment variables for each instance:
  - **ha-london**: `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN`
  - **ha-sofia**: `HOME_ASSISTANT_SOFIA_URL` and `HOME_ASSISTANT_SOFIA_TOKEN`
+  - If those env vars aren't set (e.g. you're not in the infra repo / Claude venv), don't hand-roll a `kubectl | base64 | jq` token pipeline — use the global **`homelab` CLI** instead (on `$PATH` in any directory):
+
+## homelab CLI (preferred — works from any directory)
+- **Token**: `homelab ha token [--instance sofia|london]` resolves the long-lived API token live from the cluster. Use it directly in curl: `curl -H "Authorization: Bearer $(homelab ha token)" https://ha-sofia.viktorbarzin.me/api/states`. (The `home-assistant-sofia.py` script also auto-falls-back to this when its env var is unset.)
+- **Host shell** (ha-sofia): `homelab ha ssh -- <cmd>` runs a command on the HA host with deterministic non-interactive ssh (no host-key prompt) — e.g. `homelab ha ssh -- "sudo docker ps"`, `homelab ha ssh -- "cat /config/configuration.yaml"`. Replaces bespoke `ssh -o StrictHostKeyChecking=no …` invocations.
+- **Cluster metrics/logs** (not HA-specific): prefer `homelab metrics query "<promql>"` / `homelab logs query "<logql>"` over hand-rolled `curl …/api/v1/query`, and `homelab claim`/`release` over calling `scripts/presence` directly.

 ## API Control

@ -389,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
 ## ha-london Knowledge Map

 ### Overview
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
+- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
 - **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/` (requires `sudo` for file access)
+- **Platform**: Raspberry Pi 4, HA OS
+- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
+- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
+- **Config path**: `/config/`
 - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
 - **Zone**: London (home)

+### Dashboards (redesigned 2026-06-24)
+**Glossary** (HA terms — keep distinct):
+- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
+- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
+- **Card** = a widget inside a view.
+
+- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
+  - **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
+  - **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
+- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.)
+- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
+
 ### Key Systems

 #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -418,10 +437,15 @@ Named plugs with power/energy tracking:
 - PM1.0/2.5/4.0/10 particulate sensors
 - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors

-#### 3. Cowboy E-Bike
- `sensor.bike_state_of_charge`: Battery %
- `sensor.bike_total_distance`: Total km
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
+#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
+Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
+- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
+- `sensor.classic_performance_remaining_range`: Range km
+- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
+- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
+- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
+- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
+- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).

 #### 4. Uptime Monitoring (UptimeRobot)
 - `sensor.blog`: blog uptime
@ -440,12 +464,17 @@ Named plugs with power/energy tracking:
 - Scripts: `script.start_netflix`, `script.start_stremio`
 - Scene: `scene.night` (turns off Livia + Michelle plugs)

-### Custom Components
- **cowboy**: Cowboy e-bike integration (HACS)
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
+### Custom Components (HACS integrations)
+- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
+- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
+
+### HACS frontend cards (plugins)
+- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.

 ### Integrations
-ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
+ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
+- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
+- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.

 ### AI / Voice Assistants
 - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -460,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL
 - Anca arrival/departure notifications
 - Night scene: turns off Livia + Michelle

-### Docker Setup
-```bash
-docker run -d --name homeassistant --privileged \
-  -e TZ=Europe/London \
-  -v /home/pi/docker/homeAssistant:/config \
-  -v /run/dbus:/run/dbus:ro \
-  --network=host --restart=unless-stopped \
-  homeassistant/home-assistant:2025.9
-```
+### Platform (HAOS — ignore any legacy `docker run` snippet)
+ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker).

 ### SSH Access
 ```bash
--- a/.claude/skills/upgrade-state/SKILL.md
+++ b/.claude/skills/upgrade-state/SKILL.md
@ -51,7 +51,7 @@ Exit codes: `0` healthy, `1` attention warranted, `2` stalled / broken.
 |---|---|---|---|
 | **Apps** | Keel polls every watched Deployment's container registry; rolls on new digest | hourly | Prom (`pending_approvals`, `registries_scanned_total`), Keel pod logs |
 | **OS** | `unattended-upgrades` in-release patching; `kured` reboots when `/var/run/reboot-required` is set | daily 02:00-06:00 London | SSH fan-out to all 5 nodes |
-| **K8s** | `k8s-version-check` CronJob detects new kubeadm patch/minor; spawns the Job-chain that drains+upgrades node-by-node | daily 12:00 UTC | Pushgateway (`k8s_upgrade_*`), `kubectl get nodes` |
+| **K8s** | `k8s-version-check` CronJob detects new kubeadm patch/minor; spawns the Job-chain that drains+upgrades node-by-node | nightly 23:00 UTC | Pushgateway (`k8s_upgrade_*`), `kubectl get nodes` |

 The K8s pipeline pushes a small set of gauges to the Prometheus
 Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`):
@ -61,8 +61,11 @@ Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`):
 - `k8s_upgrade_in_flight` — 0/1
 - `k8s_upgrade_started_timestamp` — when the current chain started (0 when idle)

-`K8sUpgradeStalled` alert fires when `in_flight=1` and the chain has
-been running >90 minutes. The script raises `✗` in the same window.
+`K8sUpgradeStalled` fires when `in_flight=1` and the chain has been running
+>90 minutes. `K8sUpgradeChainJobFailed` fires when a phase Job terminally
+failed — including a **preflight that aborted before `in_flight` was set**
+(the gates exit pre-metric). The script raises `✗` for either, and reads the
+Jobs directly, so it also catches a Failed preflight that left no metric.

 ## Status-icon legend

@ -72,7 +75,7 @@ been running >90 minutes. The script raises `✗` in the same window.
 | `→` | Update available, not yet applied (K8s patch/minor) |
 | `…` | In flight — chain currently running |
 | `⚠` | Attention: held-with-bumps, recent errors, pending approvals |
-| `✗` | Broken: pod down, alert firing, chain stalled |
+| `✗` | Broken: pod down, alert firing, chain stalled, or a chain Job failed |

 ## Drill-down — when a row trips, what to do

@ -177,6 +180,31 @@ kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- sh -
       --header='Content-Type: text/plain'"
 ```

+### K8s `✗ chain failed` — a phase Job terminally failed
+
+`K8sUpgradeChainJobFailed` would fire. Most often a **preflight** that aborted
+on a gate (a critical alert firing, a node not Ready, a kubeadm-plan mismatch) —
+these exit before `in_flight` is set, so `K8sUpgradeStalled` never sees them, and
+the deterministic name + 7d TTL blocked re-spawn (the 2026-06-12 5-day wedge).
+
+```bash
+kubectl -n k8s-upgrade get jobs
+kubectl -n k8s-upgrade describe job <failed-job>    # check the Failed reason
+# Preflight abort reasons post to Slack ONLY (not stdout), so Loki won't have
+# them. Replay the gate instead — which critical alerts were firing at the
+# failure time? (ALERTS{severity="critical"} in Prometheus, query at that ts.)
+```
+
+Recovery is now mostly automatic: the detection CronJob and `spawn_next`
+re-spawn a terminally-Failed Job on the next cycle (retry-on-failure), so a
+transient gate clears within ~24h. To expedite, delete the Failed Job and
+trigger detection:
+
+```bash
+kubectl -n k8s-upgrade delete job <failed-job>
+kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check manual-detect-$(date +%s)
+```
+
 ### K8s `✗ detection stale` — last detection >9 days

 ```bash
--- a/.github/workflows/build-android-emulator.yml
+++ b/.github/workflows/build-android-emulator.yml
@ -0,0 +1,36 @@
+name: Build android-emulator
+
+# ADR-0002: infra-owned image built off-infra on GHA → ghcr (public).
+# Large image (Android SDK + emulator); on-demand workload (scaled 0). Rebuilds
+# rare → dispatch + path trigger.
+on:
+  push:
+    branches: [master]
+    paths:
+      - 'stacks/android-emulator/docker/**'
+  workflow_dispatch: {}
+
+permissions:
+  contents: read
+  packages: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: docker/setup-buildx-action@v3
+      - uses: docker/login-action@v3
+        with:
+          registry: ghcr.io
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+      - uses: docker/build-push-action@v6
+        with:
+          context: stacks/android-emulator/docker
+          platforms: linux/amd64
+          provenance: false
+          push: true
+          tags: |
+            ghcr.io/viktorbarzin/android-emulator:latest
+            ghcr.io/viktorbarzin/android-emulator:${{ github.sha }}
--- a/.github/workflows/build-chrome-service-browser.yml
+++ b/.github/workflows/build-chrome-service-browser.yml
@ -0,0 +1,39 @@
+name: Build chrome-service-browser
+
+# ADR-0002: infra-owned image built off-infra on GHA → ghcr. Playwright base +
+# real Google Chrome (proprietary H.264/AAC codecs) for the chrome-service
+# browser container, so the noVNC view can play H.264 video (Reels). Rebuilds
+# are rare → dispatch + path trigger. NOTE: after the first push, set the ghcr
+# package `chrome-service-browser` to PUBLIC (same as chrome-service-novnc) so
+# the pod pulls it without credentials.
+on:
+  push:
+    branches: [master]
+    paths:
+      - 'stacks/chrome-service/files/chrome/**'
+  workflow_dispatch: {}
+
+permissions:
+  contents: read
+  packages: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: docker/setup-buildx-action@v3
+      - uses: docker/login-action@v3
+        with:
+          registry: ghcr.io
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+      - uses: docker/build-push-action@v6
+        with:
+          context: stacks/chrome-service/files/chrome
+          platforms: linux/amd64
+          provenance: false
+          push: true
+          tags: |
+            ghcr.io/viktorbarzin/chrome-service-browser:latest
+            ghcr.io/viktorbarzin/chrome-service-browser:${{ github.sha }}
--- a/.github/workflows/build-chrome-service-novnc.yml
+++ b/.github/workflows/build-chrome-service-novnc.yml
@ -0,0 +1,36 @@
+name: Build chrome-service-novnc
+
+# ADR-0002: infra-owned image built off-infra on GHA → ghcr (public).
+# Source Dockerfile identical on both git remotes, so the github checkout builds
+# the current image. Rebuilds are rare (stable noVNC proxy) → dispatch + path.
+on:
+  push:
+    branches: [master]
+    paths:
+      - 'stacks/chrome-service/files/novnc/**'
+  workflow_dispatch: {}
+
+permissions:
+  contents: read
+  packages: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: docker/setup-buildx-action@v3
+      - uses: docker/login-action@v3
+        with:
+          registry: ghcr.io
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+      - uses: docker/build-push-action@v6
+        with:
+          context: stacks/chrome-service/files/novnc
+          platforms: linux/amd64
+          provenance: false
+          push: true
+          tags: |
+            ghcr.io/viktorbarzin/chrome-service-novnc:latest
+            ghcr.io/viktorbarzin/chrome-service-novnc:${{ github.sha }}
--- a/.github/workflows/build-cli.yml
+++ b/.github/workflows/build-cli.yml
@ -0,0 +1,41 @@
+name: Build infra CLI
+
+# ADR-0002: infra CLI built off-infra on GHA. Replaces the Woodpecker
+# build-cli.yml. Pushes to DockerHub (public distribution, kept) + ghcr.
+# Not a cluster workload — a distributed tool image.
+on:
+  push:
+    branches: [master]
+    paths:
+      - 'cli/**'
+  workflow_dispatch: {}
+
+permissions:
+  contents: read
+  packages: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: docker/setup-buildx-action@v3
+      - uses: docker/login-action@v3
+        with:
+          registry: ghcr.io
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+      - uses: docker/login-action@v3
+        with:
+          username: ${{ secrets.DOCKERHUB_USERNAME }}
+          password: ${{ secrets.DOCKERHUB_TOKEN }}
+      - uses: docker/build-push-action@v6
+        with:
+          context: cli
+          platforms: linux/amd64
+          provenance: false
+          push: true
+          tags: |
+            viktorbarzin/infra:latest
+            ghcr.io/viktorbarzin/infra-cli:latest
+            ghcr.io/viktorbarzin/infra-cli:${{ github.sha }}
--- a/.github/workflows/build-infra-ci.yml
+++ b/.github/workflows/build-infra-ci.yml
@ -0,0 +1,37 @@
+name: Build infra-ci
+
+# ADR-0002: the infra CI toolbox image (terraform/terragrunt/sops/kubectl/vault)
+# built off-infra on GHA → ghcr (public). BOOTSTRAP-CRITICAL: .woodpecker/default.yml's
+# apply step runs in this image. The Woodpecker build-ci-image.yml is kept until a
+# ghcr-based apply is proven, then removed.
+on:
+  push:
+    branches: [master]
+    paths:
+      - 'ci/Dockerfile'
+  workflow_dispatch: {}
+
+permissions:
+  contents: read
+  packages: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: docker/setup-buildx-action@v3
+      - uses: docker/login-action@v3
+        with:
+          registry: ghcr.io
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+      - uses: docker/build-push-action@v6
+        with:
+          context: ci
+          platforms: linux/amd64
+          provenance: false
+          push: true
+          tags: |
+            ghcr.io/viktorbarzin/infra-ci:latest
+            ghcr.io/viktorbarzin/infra-ci:${{ github.sha }}
--- a/.github/workflows/build-k8s-portal.yml
+++ b/.github/workflows/build-k8s-portal.yml
@ -0,0 +1,36 @@
+name: Build k8s-portal
+
+# ADR-0002 / no-local-builds: k8s-portal (infra-owned Go portal) builds off-infra
+# on GHA → public ghcr; Keel polls ghcr:latest and rolls the deployment. Replaces
+# the in-cluster .woodpecker/k8s-portal.yml build.
+on:
+  push:
+    branches: [master]
+    paths:
+      - 'stacks/k8s-portal/modules/k8s-portal/files/**'
+  workflow_dispatch: {}
+
+permissions:
+  contents: read
+  packages: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: docker/setup-buildx-action@v3
+      - uses: docker/login-action@v3
+        with:
+          registry: ghcr.io
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+      - uses: docker/build-push-action@v6
+        with:
+          context: stacks/k8s-portal/modules/k8s-portal/files
+          platforms: linux/amd64
+          provenance: false
+          push: true
+          tags: |
+            ghcr.io/viktorbarzin/k8s-portal:latest
+            ghcr.io/viktorbarzin/k8s-portal:${{ github.sha }}
--- a/.gitignore
+++ b/.gitignore
@ -103,3 +103,16 @@ stacks/terminal/clipboard-upload/clipboard-upload
 # Plaintext terraform state — NEVER commit (use SOPS-encrypted .tfstate.enc only)
 terraform.tfstate
 terraform.tfstate.backup
+
+# Per-feature git worktrees (worktree-first workflow — execution.md)
+.worktrees/
+
+# Timestamped terraform state backups (terraform.tfstate.<ts>.backup) — plaintext Tier-0
+# secrets; created by terraform state ops. The patterns above miss the timestamped form.
+terraform.tfstate.*.backup
+
+# Python test artifacts (pytest bytecode cache) — e.g. from
+# stacks/k8s-version-upgrade/scripts/test_compat_gate.py
+__pycache__/
+*.pyc
+.pytest_cache/
--- a/.mcp.json
+++ b/.mcp.json
@ -3,10 +3,6 @@
    "ha": {
      "type": "http",
      "url": "${HA_MCP_URL}"
-    },
-    "paperless": {
-      "type": "http",
-      "url": "http://paperless-mcp.paperless-mcp.svc.cluster.local/mcp"
    }
  }
 }
--- a/.woodpecker/breakglass-infra-ci.yml
+++ b/.woodpecker/breakglass-infra-ci.yml
@ -0,0 +1,31 @@
+# Break-glass: save the ghcr infra-ci image to a tarball on the registry VM
+# (10.0.20.10) so it can be `docker load`-ed onto a node if ghcr is ever
+# unreachable during a recovery. infra-ci now builds on GHA → ghcr (ADR-0002),
+# which is external + node-cached, so this is a belt-and-braces DR artifact —
+# run MANUALLY after an infra-ci rebuild (or periodically). Pulls from ghcr
+# (public, no login). Recovery: docs/runbooks/forgejo-registry-breakglass.md.
+when:
+  - event: manual
+
+steps:
+  - name: breakglass-tarball
+    image: alpine:3.20
+    failure: ignore
+    environment:
+      REGISTRY_SSH_KEY:
+        from_secret: registry_ssh_key
+    commands:
+      - apk add --no-cache openssh-client
+      - mkdir -p ~/.ssh && chmod 700 ~/.ssh
+      - printf '%s\n' "$REGISTRY_SSH_KEY" > ~/.ssh/id_ed25519
+      - chmod 600 ~/.ssh/id_ed25519
+      - ssh-keyscan -t ed25519 10.0.20.10 >> ~/.ssh/known_hosts 2>/dev/null
+      - |
+        ssh -n -o BatchMode=yes root@10.0.20.10 "
+          set -e
+          mkdir -p /opt/registry/data/private/_breakglass
+          IMAGE=ghcr.io/viktorbarzin/infra-ci:latest
+          docker pull \$IMAGE
+          docker save \$IMAGE | gzip > /opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz
+          ls -lh /opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz
+        "
--- a/.woodpecker/build-ci-image.yml
+++ b/.woodpecker/build-ci-image.yml
@ -1,88 +0,0 @@
-# Build the CI tools Docker image used by all infra pipelines.
-# Triggers on push that touches ci/Dockerfile, or manual (API/UI) so
-# rebuilds after a registry incident don't need a cosmetic Dockerfile edit.
-
-when:
-  - event: push
-    branch: master
-    path:
-      include:
-        - 'ci/Dockerfile'
-  - event: manual
-
-steps:
-  - name: build-and-push
-    image: woodpeckerci/plugin-docker-buildx
-    settings:
-      # Phase 4 of forgejo-registry-consolidation 2026-05-07 —
-      # registry.viktorbarzin.me dropped, Forgejo is the only target.
-      repo:
-        - forgejo.viktorbarzin.me/viktor/infra-ci
-      dockerfile: ci/Dockerfile
-      context: ci/
-      tags:
-        - latest
-        - "${CI_COMMIT_SHA:0:8}"
-      platforms: linux/amd64
-      logins:
-        - registry: forgejo.viktorbarzin.me
-          username:
-            from_secret: forgejo_user
-          password:
-            from_secret: forgejo_push_token
-
-  # Post-push integrity check is now redundant with the every-15min
-  # forgejo-integrity-probe in stacks/monitoring/, which walks
-  # /v2/_catalog + HEADs every blob across the entire Forgejo registry.
-  # If a corruption pattern emerges that the periodic probe misses,
-  # restore a verify step similar to the pre-Phase-4 version (see
-  # commit 49f4956f) but pointed at forgejo.viktorbarzin.me.
-
-  # Break-glass tarball: save the just-pushed infra-ci image to disk on the
-  # registry VM (10.0.20.10) so we can `docker load` it back into a node
-  # when Forgejo is unreachable. Pulls from Forgejo (the only registry now).
-  # Best-effort — failure here doesn't fail the pipeline.
-  # Recovery procedure: docs/runbooks/forgejo-registry-breakglass.md.
-  - name: breakglass-tarball
-    image: alpine:3.20
-    failure: ignore
-    environment:
-      REGISTRY_SSH_KEY:
-        from_secret: registry_ssh_key
-      FORGEJO_USER:
-        from_secret: forgejo_user
-      FORGEJO_PASS:
-        from_secret: forgejo_push_token
-    commands:
-      - apk add --no-cache openssh-client
-      - mkdir -p ~/.ssh && chmod 700 ~/.ssh
-      - printf '%s\n' "$REGISTRY_SSH_KEY" > ~/.ssh/id_ed25519
-      - chmod 600 ~/.ssh/id_ed25519
-      - ssh-keyscan -t ed25519 10.0.20.10 >> ~/.ssh/known_hosts 2>/dev/null
-      - SHA=${CI_COMMIT_SHA:0:8}
-      - |
-        ssh -n -o BatchMode=yes root@10.0.20.10 "
-          set -e
-          mkdir -p /opt/registry/data/private/_breakglass
-          IMAGE=forgejo.viktorbarzin.me/viktor/infra-ci:$SHA
-          echo \$FORGEJO_PASS | docker login forgejo.viktorbarzin.me -u \$FORGEJO_USER --password-stdin
-          docker pull \$IMAGE
-          docker save \$IMAGE | gzip > /opt/registry/data/private/_breakglass/infra-ci-$SHA.tar.gz
-          ln -sfn infra-ci-$SHA.tar.gz /opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz
-          ls -t /opt/registry/data/private/_breakglass/infra-ci-*.tar.gz \
-            | grep -v 'latest' | tail -n +6 | xargs -r rm -v
-          ls -lh /opt/registry/data/private/_breakglass/
-        "
-
-  - name: slack
-    image: curlimages/curl
-    commands:
-      - |
-        curl -s -X POST -H 'Content-type: application/json' \
-          --data "{\"text\":\"CI image built: forgejo.viktorbarzin.me/viktor/infra-ci:${CI_COMMIT_SHA:0:8} (and registry-private mirror)\"}" \
-          "$SLACK_WEBHOOK" || true
-    environment:
-      SLACK_WEBHOOK:
-        from_secret: slack_webhook
-    when:
-      status: [success]
--- a/.woodpecker/build-cli.yml
+++ b/.woodpecker/build-cli.yml
@ -1,42 +0,0 @@
-when:
-  event: push
-
-clone:
-  git:
-    image: woodpeckerci/plugin-git
-    settings:
-      attempts: 5
-      backoff: 10s
-
-steps:
-  - name: build-image
-    image: woodpeckerci/plugin-docker-buildx
-    settings:
-      username: "viktorbarzin"
-      password:
-        from_secret: dockerhub-pat
-      # Phase 4 of forgejo-registry-consolidation 2026-05-07 —
-      # registry.viktorbarzin.me:5050 decommissioned. Push to DockerHub
-      # (the public-facing infra image) AND Forgejo (the cluster pull
-      # source). Same image, two locations.
-      repo:
-        - viktorbarzin/infra
-        - forgejo.viktorbarzin.me/viktor/infra
-      logins:
-        - registry: https://index.docker.io/v1/
-          username: viktorbarzin
-          password:
-            from_secret: dockerhub-pat
-        - registry: forgejo.viktorbarzin.me
-          username:
-            from_secret: forgejo_user
-          password:
-            from_secret: forgejo_push_token
-      dockerfile: cli/Dockerfile
-      context: cli
-      auto_tag: true
-      # cache_from/cache_to removed: registry cache corruption causes
-      # "short read: expected 32 bytes" BuildKit errors. Inline cache
-      # will be re-populated once a clean image is pushed.
-      # cache_from: "registry.viktorbarzin.me:5050/infra:latest"
-      # cache_to: "type=inline"
--- a/.woodpecker/default.yml
+++ b/.woodpecker/default.yml
@ -19,13 +19,34 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
+      partial: false
      depth: 2
      attempts: 5
      backoff: 10s

 steps:
+  # Audit feed for the allow-then-audit contribution model: any master push by
+  # a NON-admin author is surfaced in Slack (Viktor's own pushes are not).
+  # Runs before apply and never blocks it. Note: [ci skip] commits never reach
+  # this step (Woodpecker skips the whole pipeline) — hence the rule that
+  # non-admins must not use [ci skip].
+  - name: notify-nonadmin-push
+    image: curlimages/curl
+    environment:
+      SLACK_WEBHOOK:
+        from_secret: slack_webhook
+    commands:
+      - |
+        case "$CI_COMMIT_AUTHOR" in
+          viktor|ViktorBarzin|wizard) echo "admin push — no notify"; exit 0 ;;
+        esac
+        SUBJECT=$(echo "$CI_COMMIT_MESSAGE" | head -1 | tr -d '"\\')
+        curl -s -X POST -H 'Content-type: application/json' \
+          --data "{\"text\":\"📝 infra master push by *$CI_COMMIT_AUTHOR*: $SUBJECT\n$CI_REPO_URL/commit/$CI_COMMIT_SHA\"}" \
+          "$SLACK_WEBHOOK" || true
+
  - name: apply
-    image: forgejo.viktorbarzin.me/viktor/infra-ci:latest
+    image: ghcr.io/viktorbarzin/infra-ci:latest
    pull: true
    backend_options:
      kubernetes:
@ -115,6 +136,25 @@ steps:
          git fetch --deepen=1 origin master 2>/dev/null || true
        fi

+        # Diff base: prefer the push's true before-state (CI_PREV_COMMIT_SHA).
+        # HEAD~1 is WRONG for merge commits — it is the first parent (the
+        # feature-branch side), so the diff shows the OTHER lineage's files
+        # and silently skips the stacks this push actually changed
+        # (bit ci-pipeline-health on 2026-06-12, pipeline 128).
+        DIFF_BASE="HEAD~1"
+        if [ -n "${CI_PREV_COMMIT_SHA:-}" ] && [ "$CI_PREV_COMMIT_SHA" != "$CI_COMMIT_SHA" ]; then
+          git cat-file -e "$CI_PREV_COMMIT_SHA^{commit}" 2>/dev/null || git fetch --depth=50 origin master 2>/dev/null || true
+          # Restarted pipelines after master moved produce REVERSE diffs
+          # (CI_PREV ahead of the checked-out HEAD re-applied stale trees and
+          # reverted a sibling apply on 2026-06-12, pipeline 148). Only use
+          # CI_PREV when it is an ancestor of HEAD.
+          if git cat-file -e "$CI_PREV_COMMIT_SHA^{commit}" 2>/dev/null \
+             && git merge-base --is-ancestor "$CI_PREV_COMMIT_SHA" HEAD 2>/dev/null; then
+            DIFF_BASE="$CI_PREV_COMMIT_SHA"
+          fi
+        fi
+        echo "Diff base: $DIFF_BASE"
+
        # If still no parent, apply all platform stacks as a safe fallback
        if ! git rev-parse HEAD~1 >/dev/null 2>&1; then
          echo "Cannot determine changed files — applying ALL platform stacks"
@ -122,14 +162,14 @@ steps:
          > .app_apply
        else
          # Check if global files changed (triggers full platform apply)
-          GLOBAL_CHANGED=$(git diff --name-only HEAD~1 HEAD | grep -E '^(modules/|config\.tfvars|terragrunt\.hcl)' || true)
+          GLOBAL_CHANGED=$(git diff --name-only "$DIFF_BASE" HEAD | grep -E '^(modules/|config\.tfvars|terragrunt\.hcl)' || true)

          if [ -n "$GLOBAL_CHANGED" ]; then
            echo "Global files changed — applying ALL platform stacks"
            echo "$PLATFORM_STACKS" | tr ' ' '\n' > .platform_apply
          else
            # Detect platform stacks that changed
-            git diff --name-only HEAD~1 HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u > .all_changed
+            git diff --name-only "$DIFF_BASE" HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u > .all_changed
            > .platform_apply
            while read -r stack; do
              if echo "$PLATFORM_STACKS" | grep -qw "$stack"; then
@ -140,7 +180,7 @@ steps:

          # Detect app stacks that changed
          > .app_apply
-          git diff --name-only HEAD~1 HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u | while read -r stack; do
+          git diff --name-only "$DIFF_BASE" HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u | while read -r stack; do
            if echo "$PLATFORM_STACKS" | grep -qw "$stack"; then
              continue  # Skip platform stacks
            fi
--- a/.woodpecker/drift-detection.yml
+++ b/.woodpecker/drift-detection.yml
@ -9,12 +9,13 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
+      partial: false
      depth: 1
      attempts: 3

 steps:
  - name: detect-drift
-    image: forgejo.viktorbarzin.me/viktor/infra-ci:latest
+    image: ghcr.io/viktorbarzin/infra-ci:latest
    pull: true
    backend_options:
      kubernetes:
--- a/.woodpecker/issue-automation.yml
+++ b/.woodpecker/issue-automation.yml
@ -5,6 +5,7 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
+      partial: false
      depth: 2

 steps:
--- a/.woodpecker/k8s-portal.yml
+++ b/.woodpecker/k8s-portal.yml
@ -1,49 +0,0 @@
-when:
-  event: push
-  branch: master
-  path:
-    include:
-      - "stacks/platform/modules/k8s-portal/files/**"
-
-clone:
-  git:
-    image: woodpeckerci/plugin-git
-    settings:
-      attempts: 5
-      backoff: 10s
-
-steps:
-  - name: build-and-push
-    image: woodpeckerci/plugin-docker-buildx
-    settings:
-      username: "viktorbarzin"
-      password:
-        from_secret: dockerhub-pat
-      repo: viktorbarzin/k8s-portal
-      dockerfile: stacks/platform/modules/k8s-portal/files/Dockerfile
-      context: stacks/platform/modules/k8s-portal/files
-      platforms:
-        - linux/amd64
-      tag: ["${CI_PIPELINE_NUMBER}", "latest"]
-      cache_from: "viktorbarzin/k8s-portal:latest"
-      cache_to: "type=inline"
-
-  - name: deploy
-    image: bitnami/kubectl:latest
-    commands:
-      - "kubectl set image deployment/k8s-portal portal=viktorbarzin/k8s-portal:${CI_PIPELINE_NUMBER} -n k8s-portal"
-      - "kubectl rollout status deployment/k8s-portal -n k8s-portal --timeout=120s"
-      - "echo 'k8s-portal deployed successfully (build ${CI_PIPELINE_NUMBER})'"
-
-  - name: slack
-    image: curlimages/curl
-    commands:
-      - |
-        curl -s -X POST -H 'Content-type: application/json' \
-          --data "{\"text\":\"K8s Portal: build #${CI_PIPELINE_NUMBER} ${CI_PIPELINE_STATUS}\"}" \
-          "$SLACK_WEBHOOK" || true
-    environment:
-      SLACK_WEBHOOK:
-        from_secret: slack_webhook
-    when:
-      status: [success, failure]
--- a/.woodpecker/postmortem-todos.yml
+++ b/.woodpecker/postmortem-todos.yml
@ -11,6 +11,7 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
+      partial: false
      depth: 5

 steps:
--- a/.woodpecker/provision-user.yml
+++ b/.woodpecker/provision-user.yml
@ -5,6 +5,7 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
+      partial: false
      attempts: 5
      backoff: 10s

--- a/.woodpecker/pve-nfs-exports-sync.yml
+++ b/.woodpecker/pve-nfs-exports-sync.yml
@ -23,6 +23,7 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
+      partial: false
      depth: 1
      attempts: 3

--- a/.woodpecker/registry-config-sync.yml
+++ b/.woodpecker/registry-config-sync.yml
@ -38,6 +38,7 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
+      partial: false
      depth: 1
      attempts: 3

--- a/.woodpecker/renew-tls.yml
+++ b/.woodpecker/renew-tls.yml
@ -6,6 +6,7 @@ clone:
  git:
    image: woodpeckerci/plugin-git
    settings:
+      partial: false
      attempts: 5
      backoff: 10s

--- a/AGENTS.md
+++ b/AGENTS.md
@ -9,7 +9,7 @@
 - **Ask before `git push`** — always confirm with the user first

 ## Execution
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets)
+- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets; passes `-lock-timeout`, default `5m` / `TG_LOCK_TIMEOUT`, so a contended state lock waits instead of failing with `Error acquiring the state lock`)
 - **Legacy apply**: `cd stacks/<service> && terragrunt apply --non-interactive` (uses terraform.tfvars)
 - **kubectl**: `kubectl --kubeconfig $(pwd)/config`
 - **Health check**: `bash scripts/cluster_healthcheck.sh --quiet`
@ -90,6 +90,7 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
 - **Public domain**: `viktorbarzin.me` (Cloudflare) | **Internal**: `viktorbarzin.lan` (Technitium DNS)
 - **Onboarding portal**: `https://k8s-portal.viktorbarzin.me` — self-service kubectl setup + docs
 - **CI/CD**: Woodpecker CI — PRs run plan, merges to master auto-apply all stacks
+- **CI compute is external (ADR-0002, 2026-06-12)**: builds, tests, lint, and release jobs run on GitHub Actions hosted runners via each repo's GitHub mirror — never on cluster nodes. In-cluster pipelines exist only for steps that need cluster access (Woodpecker `kubectl set image` deploys, terragrunt applies, certbot). Never add an in-cluster build or test pipeline to any repo; the fallback-build pattern was deliberately removed. After pushing anything that fires a build chain, watch it end-to-end (GHA run → Woodpecker deploy → rollout) before calling the change done — verify live state, not the checkmark.

 ## Key Paths
 - `stacks/<service>/main.tf` — service definition
@ -109,7 +110,8 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
 - **SQLite on NFS is unreliable** (fsync issues) — always use proxmox-lvm or local disk for databases.
 - **NFS mount options**: Always `soft,timeo=30,retrans=3` to prevent uninterruptible sleep (D state).
 - **NFS export directory must exist** on the Proxmox host before Terraform can create the PV.
- **Backup (3-2-1)**: Copy 1 = live PVCs on sdc. Copy 2 = sda `/mnt/backup` (PVC file backups, auto SQLite backups, pfSense, PVE config). Copy 3 = Synology offsite (two-tier: sda→`pve-backup/`, NFS→`nfs/`+`nfs-ssd/` via inotify change tracking).
+- **Backup (3-2-1)**: Copy 1 = live PVCs on sdc. Copy 2 = sda `/mnt/backup` (PVC file backups, auto SQLite backups, pfSense, PVE config, **VM images via `vzdump-vms`**). Copy 3 = Synology offsite (two-tier: sda→`pve-backup/`, NFS→`nfs/`+`nfs-ssd/` via inotify change tracking).
+- **vzdump-vms** (Daily 01:00): live `vzdump --mode snapshot` of hand-managed VMs (NOT in TF) → `/mnt/backup/vzdump/`, keep 3/VMID. `VZDUMP_VMIDS` default `102` (devvm) — the only VM imaged today; before this (2026-06-09) no VM was ever imaged. NOT in the incremental offsite manifest; monthly full pass mirrors it. See `docs/architecture/backup-dr.md`.
 - **daily-backup** (Daily 05:00): Auto-discovered BACKUP_DIRS (glob), auto SQLite backup (magic number + `?mode=ro`), pfSense, PVE config. No NFS mirror step (NFS syncs directly to Synology via inotify).
 - **offsite-sync-backup** (Daily 06:00): Step 1: sda→Synology `pve-backup/`. Step 2: NFS→Synology `nfs/`+`nfs-ssd/` via `rsync --files-from` (inotify change log). Monthly full `--delete`.
 - **nfs-change-tracker.service**: inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log`. Incremental syncs complete in seconds.
@ -225,7 +227,69 @@ Per-workload opt-out: add the label `keel.sh/policy: never` on the Deployment me
 4. Viktor reviews → CI applies → Slack notification
 5. Portal: `https://k8s-portal.viktorbarzin.me/onboarding` for full guide

+### Non-admin workstation users — the AGENT does the git work
+
+Non-admin devvm users (power-user / namespace-owner tiers) may not know git at
+all. Their agent handles every version-control step silently — never ask them
+to commit, push, pull, or open a PR, and never surface git jargon at them.
+Their infra clone arrives preconfigured: git identity, a `forgejo` remote
+authenticated via `~/.git-credentials`, and `master` tracking `forgejo/master`
+(auto-freshened hourly and at session launch, fast-forward only).
+
+Two per-user layouts exist (`code_layout` in
+`scripts/workstation/roster.yaml`): `single` (the default) — `~/code` IS the
+locked infra clone — and `workspace` — `~/code` is a plain directory of
+per-project clones: the infra clone at `~/code/infra`, plus each roster
+`repos` entry (e.g. `~/code/tripit`) cloned from Forgejo `viktor/<name>` with
+the user's own PAT. The reconcile auto-migrates a single-layout `~/code` when
+a user is flipped to `workspace`, and keeps every clone fresh either way.
+
+The model is **allow-then-audit** (Viktor, 2026-06-10): whitelisted users (emo)
+push straight to `master` — no PR gate — and the record of *what changed and
+why* is what matters. Force-push is disabled for everyone, so master history
+is append-only.
+
+**Feature-sized work is worktree-first** (org rule, 2026-06-10): develop in an
+isolated worktree (`.worktrees/<topic>`, branch `<os-user>/<topic>` off
+`forgejo/master`) so concurrent agent sessions never collide in the clone, then
+land by merging latest master into the branch and pushing it
+(`git push forgejo HEAD:master`, or the PR fallback below if not whitelisted) —
+the audit-trail rules below apply to the branch's commit messages all the same.
+Locked (git-crypt) clones can use plain `git worktree add`. Trivial
+single-commit fixes may be committed directly on a clean `master`. Full
+lifecycle: `~/.claude/rules/execution.md` §3.
+
+To land a finished change from such a clone:
+
+1. Commit on `master`. **The commit message is the audit trail** — this matters
+   more than the change itself:
+   - subject: what changed, specific ("ha-sofia: lower fan curve bias to -5")
+   - body: WHY, in plain words — paraphrase the user's actual request and any
+     reasoning ("Emil asked for quieter fans in the evening; curve was
+     overshooting after the 2026-06-08 redesign")
+2. `git push forgejo master`. If rejected non-fast-forward: `git pull --rebase
+   forgejo master` and push again.
+3. **Never use `[ci skip]`** as a non-admin — it hides the change from the
+   Slack audit feed; a no-op CI apply on a docs-only commit is harmless.
+4. Leave the clone on clean `master` so auto-refresh keeps working.
+5. Tell the user in plain language what happened. Stack changes are
+   auto-applied by CI — verify the live result with the user's read-only
+   kubectl before saying "it's live".
+
+If a push to `master` is rejected by branch protection (user not on the
+whitelist — e.g. new users before Viktor grants it), fall back to a
+`<os-user>/<short-topic>` branch + PR with the user's own PAT
+(`write:repository` suffices — verified 2026-06-10):
+
+```bash
+TOK=$(sed -E 's#https://[^:]+:([^@]+)@.*#\1#' ~/.git-credentials)
+curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json' \
+  https://forgejo.viktorbarzin.me/api/v1/repos/viktor/infra/pulls \
+  -d '{"title":"<title>","head":"<os-user>/<short-topic>","base":"master","body":"<what + why>"}'
+```
+
 ## Common Operations
+- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply <stack>` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release <kind>:<name>`, `homelab work start|land|clean <topic>` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status <app>` (read; `<app>` defaults to the namespace, target to `deploy/<app>`), `homelab k8s db <app> [--mysql] -- "<SQL>"`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall "<context>"` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait <ns>/<deploy> [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check <host> [path]` (external-CF vs internal-LB reachability), `dns lookup <name>` (Technitium vs public diff), `metrics query "<promql>"` / `metrics alerts` (Prometheus via LB), `logs query "<logql>" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Usage telemetry (v0.6): every dispatched verb fire-and-forgets a Loki line (`{user,verb}` + exit only, NO args/secrets; opt-out `HOMELAB_TELEMETRY=0`); `homelab usage top [--since][--user]` ranks verb usage across all users — evidence for what to build next, queryable without reading anyone's home. Home Assistant (v0.7): `homelab ha token [--instance sofia|london]` (prints the long-lived API token, resolved live from k8s Secret `openclaw/openclaw-secrets` — use as `curl -H "Authorization: Bearer $(homelab ha token)"`), `homelab ha ssh [--instance sofia|london] -- <cmd>` (run a command on the HA host; deterministic non-interactive ssh, the invoking user's `~/.ssh/id_ed25519`, sofia=`vbarzin@192.168.1.8` default) — entity state/control stays with the `ha` MCP, these cover only what an API-only MCP can't (token + host shell). Full docs: `cli/README.md`.
 - **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service.
 - **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
 - **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf.
--- a/CONTEXT.md
+++ b/CONTEXT.md
@ -117,9 +117,17 @@ The bare-metal load-balancer that assigns external IPs to `type=LoadBalancer` Se
 _Avoid_: calling `.200` "the cluster IP" or assuming all ingress shares one LB IP.

 **Calico**:
-The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
+The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs; live flow observability via **Goldmane / Whisker**). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
 _Avoid_: "firewall" (it's pod-level policy, not a perimeter); conflating a Calico **NetworkPolicy** (enforced in the data path) with a **Kyverno policy** (enforced at admission) — different layers.

+**Service identity**:
+How a **Service** is named in flow/audit data — its **namespace** is the primary identity (Goldmane stamps it natively, and "one Service ≈ one namespace" holds for ~87 namespaces), refined by an explicit identity label (e.g. `service-identity`) only in the handful of genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). Deliberately NOT a per-Service **ServiceAccount** (deferred — 56% of pods share `default`; revisit only if principal-based enforcement or mTLS is adopted) and NOT a SPIFFE/mesh identity (rejected — attribution-grade audit on a trusted single-tenant cluster doesn't justify a mesh).
+_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
+
+**Goldmane / Whisker**:
+Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). Durable history requires emitting Goldmane flows to **Loki**; the in-memory buffer alone is not an audit trail.
+_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
+
 ### Storage

 **proxmox-lvm-encrypted**:
@ -149,7 +157,7 @@ _Avoid_: bare "backup" without saying which copy you mean (a service is "backed

 **CNPG** / **pg-cluster**:
 **CNPG** is the CloudNativePG operator; **`pg-cluster`** is the Postgres cluster it manages — the shared Postgres substrate. Backs Tier-1 Terraform state (`pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`) and ~12 application databases, reached through **PgBouncer** (a **critical-path Service**) for connection pooling; app credentials rotate via the `vault-database` ClusterSecretStore.
-_Avoid_: "the database" (many DBs share one cluster); the legacy `postgresql.dbaas` Service (no endpoints — dead); conflating the CNPG operator with the `pg-cluster` it manages.
+_Avoid_: "the database" (many DBs share one cluster); the legacy `postgresql.dbaas` Service for NEW work (it is a live compatibility alias selecting the CNPG primary — authentik's PgBouncer still uses it — but `pg-cluster-rw` is the canonical name); conflating the CNPG operator with the `pg-cluster` it manages.

 ### Secrets

@ -169,8 +177,24 @@ A user-managed secret committed to a Stack directory as `sealed-*.yaml`. Distinc
 ### CI/CD

 **GHA build + Woodpecker deploy**:
-The split where Docker images are built+pushed by GitHub Actions and Woodpecker only runs `kubectl set image` on a deploy-only pipeline. Repos that can't fit GHA limits stay on Woodpecker for build too.
-_Avoid_: bare "Woodpecker pipeline" — say "build" or "deploy".
+The split where every owned image is built+pushed by GitHub Actions and Woodpecker only runs `kubectl set image` on a deploy-only pipeline (ADR-0002). Woodpecker never builds images.
+_Avoid_: bare "Woodpecker pipeline" — say "build" or "deploy"; "fallback build" (the in-cluster fallback path was removed by ADR-0002).
+
+**Canonical repo**:
+The Forgejo `viktor/<name>` repo — the only place commits land, workflow files included. Every first-party repo is Forgejo-canonical *except* an explicit set of **GitHub-first repos**. A clone keeps **only** the canonical remote (ADR-0003): the **GitHub mirror** is not a second push target.
+_Avoid_: "upstream" (ambiguous); committing anywhere else; keeping both remotes on a clone and hand-pushing to each (the dual-push habit that caused the 2026-06 divergence — ADR-0003).
+
+**GitHub mirror**:
+The GitHub repo a **Canonical repo** push-mirrors to, one-way (Forgejo's `push_mirrors`, `sync_on_commit`), so GitHub Actions can build from it; anything committed on the mirror is silently overwritten by the next sync — and enabling the mirror **force-overwrites** the GitHub side, so a diverged GitHub-only commit must be merged back into Forgejo *before* the mirror is turned on or it is lost.
+_Avoid_: treating it as a second writable remote; bare "the GitHub repo" without saying mirror.
+
+**GitHub-first repo**:
+The deliberate exception to the **Canonical repo** rule — a repo whose canonical home is GitHub, so it sits outside the mirror policy. Two kinds: third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`), and a first-party repo intentionally kept public on GitHub (`health`). Single GitHub remote, never dual-pushed.
+_Avoid_: adding a Forgejo remote "for consistency"; treating one as a **Canonical repo**.
+
+**Forgejo registry**:
+Forgejo's built-in container registry — since ADR-0002 a frozen archive holding one last-known-good tag per **Service**, not a build target; owned images live on ghcr.io.
+_Avoid_: "private registry" (collides with the registry VM's pull-through caches); pushing new images to it.

 **Keel**:
 The **poll-driven** rollout orchestrator — watches registries for new image tags and rolls the matching Deployments automatically. The actor behind "auto-upgrade" for upstream images, and a redundant net for owned apps (already rolled on push by **Woodpecker deploy**).
@ -192,6 +216,7 @@ A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content
 - A **proxmox-lvm-encrypted** PVC binds to one Node at a time (RWO) and requires a Service-level backup CronJob; an **NFS volume** is RWX and is backed up at the host level via rsync.
 - **State tier** and **Namespace tier** are orthogonal — a Tier 0 Stack can deploy a Service into any Namespace tier and vice versa.
 - A **Service**'s image reaches the cluster via **Woodpecker deploy** (push-driven, on commit) or **Keel** (poll-driven, on a new registry tag); **Diun** only notifies. Operator-managed StatefulSets are rolled by neither.
+- An owned **Service**'s image is built by GitHub Actions from the **Canonical repo**'s **GitHub mirror** and hosted on ghcr.io (ADR-0002); the **Forgejo registry** keeps only a frozen last-known-good tag per **Service**.
 - Tier-1 **State tier** state and ~12 app databases share one **CNPG** `pg-cluster`, reached through **PgBouncer**; their credentials rotate via the `vault-database` store.

 ## Example dialogue
@ -211,3 +236,4 @@ A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content
 - **"secret"** spans Vault entries, K8s Secret objects, **ExternalSecrets**, and **Sealed Secrets**. Always specify which.
 - **"proxied"** / **"non-proxied"** refer to Cloudflare's CDN posture for a DNS record, _not_ Anubis or forward-auth layering.
 - **"policy"** spans **Kyverno policy** (admission-time mutate/generate/validate), **Calico NetworkPolicy** (data-path ingress/egress), Vault policy (KV access), and K8s RBAC. Always qualify which engine.
+- **"registry"** spans three things: ghcr.io (where owned images live, ADR-0002), the **Forgejo registry** (frozen last-known-good archive), and the registry VM's pull-through caches (read-only proxies of upstream registries). Name which one.
--- a/cli/README.md
+++ b/cli/README.md
@ -1,2 +1,224 @@
-# What is this?
-This is a CLI to manipulate files in the terraform repo and commit and push them
+# homelab
+
+`homelab` is the unified, agent-facing CLI for operating this homelab — one
+composable, JSON-capable surface for the operations agents run over and over,
+discovered progressively at runtime. It is grown **in place** from this
+directory (the former `infra-cli`), and the legacy webhook use-cases still work
+(see below).
+
+It encodes *actions*, never *judgment*: methodology (debugging, TDD, review) and
+third-party/owned MCP servers (e.g. phpIPAM) are deliberately out of scope.
+
+## Usage
+
+```
+homelab <command> [args]
+homelab manifest [--json]    # list every verb + its read/write tier (discovery entrypoint)
+homelab version
+```
+
+### v0.1 verbs — the infra inner-loop
+
+| Command | Tier | What it does |
+|---|---|---|
+| `claim <kind>:<name> --purpose "…"` | write | claim a shared resource on the presence board (wraps `scripts/presence`) |
+| `release <kind>:<name>` | write | release a presence claim |
+| `tf plan <stack>` | read | `scripts/tg plan` for a stack (resolved from cwd) |
+| `tf validate <stack>` | read | `scripts/tg validate` |
+| `tf fmt <stack>` | read | `terraform fmt -recursive` on the stack |
+| `tf force-unlock <stack> <lock-id>` | write | release a stuck state lock |
+| `tf apply <stack>` | write | `scripts/tg apply` — auto-claims `stack:<name>`, always releases, warns it's out-of-band |
+| `work start <topic>` | write | create `.worktrees/<topic>` on `<user>/<topic>` off `<remote>/master`; enter with native `EnterWorktree` |
+| `work land [--verify-cmd "…"] [--no-verify]` | write | merge master in → verify → push `HEAD:master` (non-ff retry; PR fallback) |
+| `work clean <topic>` | write | remove a task's worktree + branch (run from the main checkout) |
+
+### v0.2 verbs — Kubernetes
+
+Built on an **app→namespace→pod resolver**: `<app>` defaults to the namespace
+(most namespaces hold one app); the target defaults to `deploy/<app>` and lets
+kubectl resolve the pod. Override with `-n`/`--pod`/`-c`/`-l`/`--tty`. Uses the
+ambient kubeconfig.
+
+| Command | Tier | What it does |
+|---|---|---|
+| `k8s status [ns]` | read | pods (wide) + recent non-Normal events (`-A` if no ns) |
+| `k8s get <ns> <resource> […]` | read | `kubectl -n <ns> get …` passthrough |
+| `k8s logs <app>` | read | logs for `deploy/<app>` (`--tail` default 200; `-c`/`--previous`/`--since`/`-l`) |
+| `k8s describe <app> [resource]` | read | describe the deployment (or an explicit resource) |
+| `k8s debug <app>` | read | one-shot triage: pods + workloads + describe + recent logs + events |
+| `k8s pf <app> <local:remote> [target]` | read | port-forward to `svc/<app>` (or an explicit target) |
+| `k8s rollout-status <app>` | read | `rollout status deploy/<app>` |
+| `k8s db <app> [--mysql] [--db N] -- "<SQL>"` | write | exec into the dbaas DB (PG `pg-cluster-rw`, or MySQL with env-password wrapper) |
+| `k8s exec <app> [--tty] -- <cmd>` | write | exec in the app's pod |
+| `k8s restart <app>` | write | `rollout restart deploy/<app>` then wait for status |
+| `k8s rm-pod <name> -n <ns> [--job] [--force]` | write | delete a stuck **pod/job only** |
+
+Config-mutation verbs (`apply`/`edit`/`patch`/`scale`/`create`) are intentionally
+**not** exposed — they stay raw `kubectl`, per the Terraform-only policy.
+
+`tf` resolves the stack dir by walking up from cwd to the infra root and
+delegates to `scripts/tg` (which owns state decrypt/encrypt, the Vault lock, and
+the ingress auth-comment check). git-crypt filter flags are auto-injected on git
+operations in the encrypted infra repo.
+
+**`work land` refuses to push when it cannot verify** (no `--verify-cmd` and no
+auto-detected suite) unless you pass `--no-verify` — landing to master unverified
+must be deliberate. After pushing it **watches CI to green** (`ci watch` on the
+landed commit) and fails if the pipeline does; pass `--no-ci-watch` to skip.
+
+Tiers are recorded per verb so a future PreToolUse classifier can auto-allow
+reads / prompt writes; v0.1 allows everything and relies on existing gates
+(permission mode, presence claims, plan approval).
+
+### v0.3 verbs — memory
+
+A thin HTTP client over the **claude-memory** service (the same backend the
+memory MCP wraps), authed with `CLAUDE_MEMORY_API_KEY` against
+`CLAUDE_MEMORY_API_URL` (the env the hooks already set; defaults to the
+ingress). Because it hits the HTTP API directly, it **works even when the MCP
+frontend is down**.
+
+| Command | Tier | What it does |
+|---|---|---|
+| `memory recall "<context>" [--query --category --sort --limit]` | read | semantic search (server-side ranking) — the navigate workhorse |
+| `memory list [--category --tag --limit]` | read | recent memories |
+| `memory categories` / `memory tags` / `memory stats` | read | enumerate the store |
+| `memory secret <id>` | read | reveal a sensitive memory's content |
+| `memory store "<content>" [--category --tags --keywords --importance --sensitive]` | write | store a memory |
+| `memory update <id> [--content --tags --importance]` | write | edit a memory |
+| `memory delete <id>` | write | delete a memory |
+
+All read/write paths are validated against the live API (incl. a
+store→recall→delete round-trip). This gives full data-plane parity with the MCP;
+the eventual deprecation (rewiring the per-prompt auto-recall + auto-learn hooks
+to the CLI, then uninstalling the MCP) is a **separate, deliberate follow-up** —
+see `docs/adr/0008`.
+
+### v0.4 verbs — ci / deploy
+
+Watch what you trigger, without hand-rolling Woodpecker/kubectl polling. `ci`
+talks to the Woodpecker API (token from `WOODPECKER_TOKEN` or Vault
+`secret/ci/global`) via the internal Traefik LB, resolving the repo from the cwd
+remote, with retries that ride Woodpecker's intermittent empty responses.
+
+| Command | Tier | What it does |
+|---|---|---|
+| `ci status [commit]` | read | pipeline status for HEAD (or a commit) |
+| `ci watch [commit]` | read | poll the pipeline to terminal; exit non-zero on failure |
+| `deploy wait <ns>/<deploy> [--sha SHA]` | read | wait for the deployment image to match the sha, *then* rollout status (rollout status alone lies on the old ReplicaSet) |
+
+`work land` now calls `ci watch` on the landed commit automatically (skip with
+`--no-ci-watch`), closing the v0.1 "doesn't wait for CI" gap. `ci logs` (failing
+step) is deferred to v0.4.1 — Woodpecker's per-pipeline detail/log endpoints were
+the least reliable; `status`/`watch` use the list endpoint that works.
+
+### v0.5 verbs — net / dns / metrics / logs
+
+Reachability + observability probes. Their value is *endpoint resolution* — the
+non-obvious "which host, public or LB, what auth, what URL shape" reasoning you'd
+otherwise re-derive every time — not the HTTP call itself. All reach internal
+ingresses through the Traefik LB (the Go form of `curl --resolve host:443:10.0.20.203`).
+
+| Command | Tier | What it does |
+|---|---|---|
+| `net check <host> [path]` | read | probes the host two ways — external (public DNS → Cloudflare) vs internal (Traefik LB) — with status + latency, so you can tell *where* a break is (CF? app? the LB path?) |
+| `dns lookup <name> [type]` | read | resolves via Technitium (`10.0.20.201`) and public (`1.1.1.1`), diffed — surfaces split-horizon vs propagation gaps |
+| `metrics query "<promql>"` | read | Prometheus instant query (`prometheus-query.viktorbarzin.lan`); prints `value {labels}` or `--json` |
+| `metrics alerts` | read | currently-firing alerts (via the synthetic `ALERTS` series — the query frontend has no `/api/v1/alerts`) |
+| `logs query "<logql>" [--since 1h] [--limit N]` | read | Loki range query (`loki.viktorbarzin.lan`); prints log lines or `--json` |
+
+Quote the PromQL/LogQL. These hit auth-free internal ingresses — no port-forward,
+no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the
+firing set is reachable via `ALERTS` instead.)
+
+### v0.6 — usage telemetry (`usage top`)
+
+Makes "which verbs are actually used, by everyone" a query instead of a guess —
+so adding the *next* verb is evidence-driven, not shaped by one person's habits.
+
+Every dispatched verb emits one fire-and-forget Loki line: `{job, user, verb}`
+labels + `exit=N ver=X` — **only the verb path and exit code, never args, paths,
+flags, or secrets.** It's best-effort (tight timeout, errors swallowed, never
+affects the command) and opt-out via `HOMELAB_TELEMETRY=0`. Because the sink is
+the shared Loki, aggregate usage is queryable **without reading anyone's home** —
+the privacy-preserving answer to "what does the team use."
+
+| Command | Tier | What it does |
+|---|---|---|
+| `usage top [--since 30d] [--user U] [--json]` | read | rank verbs by invocation count across all users (or one), via `sum by (verb) (count_over_time({job="homelab-usage"}[…]))` |
+
+### v0.7 verbs — Home Assistant
+
+Cover exactly the two things the `ha` **MCP server can't**: resolving the
+long-lived API token out of the cluster, and SSH to the HA host for host-level
+work (config files, docker, add-ons). Entity state and control (`turn_on`,
+`get_state`, services) stay with the MCP — *actions an MCP already encodes are
+out of scope* (see top of this doc). The value here is the same as `net`/`dns`:
+the non-obvious *which secret, which host, which key, which flags* you'd
+otherwise re-derive every session — agents were hand-rolling a
+`kubectl | base64 | jq` token pipeline and a bespoke `ssh -o …` invocation on
+every run because the existing `home-assistant-sofia.py` needs an env var set
+and a cwd-relative path, neither of which holds in an arbitrary session.
+
+| Command | Tier | What it does |
+|---|---|---|
+| `ha token [--instance sofia\|london]` | read | print the long-lived HA API token, resolved live from the dedicated k8s Secret `openclaw/ha-tokens` (key per instance) via the ambient kubeconfig — no pre-set env var. Use as `curl -H "Authorization: Bearer $(homelab ha token)" …`. The secret is a least-privilege carve-out (`stacks/openclaw/ha_tokens.tf`): the `Home Server Admins` group can read *just* it, so non-admin operators get the HA token without the rest of `skill_secrets` (slack webhook, uptime-kuma password) |
+| `ha ssh [--instance sofia\|london] [-i KEY] -- <cmd>` | write | run `<cmd>` on the HA host over ssh with deterministic non-interactive flags (explicit key = the invoking user's `~/.ssh/id_ed25519`, no user ssh-config, no known_hosts prompt). sofia (`vbarzin@192.168.1.8`) is reachable from the devvm LAN; london is documented but generally remote |
+
+`--instance` defaults to **sofia** (the devvm shares the Sofia LAN). `ha token`
+prints the bare token to stdout so it composes in `$(…)`; it's read-tier like
+`memory secret`. `ha ssh` resolves the *invoking user's* key, so it's per-user,
+not tied to whoever first wrote the workflow (the user's key must be enrolled on
+the HA host).
+
+### v0.8 verbs — browser (headful anti-bot automation)
+
+Drive the cluster's **headful** Chrome (`chrome-service`, real Chrome under Xvfb)
+from the devvm over CDP, for sites that detect and block headless automation. The
+headless `@playwright/mcp` browser can *load* such a site and fill its forms, but
+the gated action (submit/login) silently fails — the motivating case was the
+Stirling Ackroyd Fixflo tenant portal, whose pre-submit check returned
+`net::ERR_FILE_NOT_FOUND` and hung. This path connects via `connect_over_cdp`,
+injects the same `stealth.js` the in-cluster callers use, and submits first try.
+
+The command owns only the *mechanics* (port-forward, stealth, lifecycle); the
+agent supplies the Playwright script — judgment stays out of the CLI.
+
+| Command | Tier | What it does |
+|---|---|---|
+| `browser run <script.js> [--url U] [--shared-context] [--keep-open] [--port N] [--timeout S]` | write | port-forward `svc/chrome-service:9222`, assert it's a real (non-headless) Chrome via `/json/version`, `connect_over_cdp`, `addInitScript(stealth.js)`, then run the script with `page`/`context`/`browser`/`log` in scope (top-level await ok; return a value to print it). Always tears the forward down. |
+| `browser open <url> [--shared-context] [--timeout S]` | write | open `<url>` headful and print title + visible text + a screenshot path — a quick check. |
+| `browser --help` | read | when-to-use signature + the error-code cheat-sheet (`ERR_FILE_NOT_FOUND` = automation-layer intercept, not egress; `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED` = real egress; one endpoint 500 while siblings 200 = bot rejection). |
+
+Default context is a **fresh incognito** one (closed on exit) — safe for the
+shared browser and concurrent callers (e.g. tripit's fare scrape); `--shared-context`
+reuses the warmed persistent profile when a pre-logged-in session is needed.
+`port-forward` tunnels API-server→pod, so it bypasses the `:9222` NetworkPolicy
+that gates in-cluster callers — no namespace label needed. The node CDP client is
+pinned to **`playwright-core@1.48.2`** to match the chrome-service image minor
+(Chromium 130; protocol changes between minors) and is installed once, lazily,
+into `~/.cache/homelab/browser-client/` (no per-user setup). Because the client
+runs on the devvm, `setInputFiles` streams local files to the remote browser over
+CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md`
+and `docs/adr/0013`.
+
+## Build / install
+
+Built from source to `/usr/local/bin/homelab` during devvm provisioning
+(`scripts/workstation/setup-devvm.sh`, the `t3-dispatch` pattern); version is
+stamped from `cli/VERSION` via ldflags. Manual build:
+
+```
+cd cli && go build -ldflags "-X main.version=$(cat VERSION)" -o /usr/local/bin/homelab .
+go test ./...
+```
+
+## Legacy webhook use-cases (preserved)
+
+This binary is also the in-cluster `infra-cli` image. Invocations starting with
+`-use-case=<vpn|setup-openwrt-dns|add-email-alias|...>` fall through to the
+original flag-based path unchanged, so the webhook handler is unaffected.
+
+## Design
+
+See `infra/docs/adr/0004`–`0013` for the architecture decisions.
--- a/cli/VERSION
+++ b/cli/VERSION
@ -0,0 +1 @@
+v0.8.1
--- a/cli/browser.go
+++ b/cli/browser.go
@ -0,0 +1,388 @@
+package main
+
+import (
+	_ "embed"
+	"encoding/json"
+	"fmt"
+	"io"
+	"net"
+	"net/http"
+	"os"
+	"os/exec"
+	"os/signal"
+	"path/filepath"
+	"strconv"
+	"strings"
+	"sync"
+	"syscall"
+	"time"
+)
+
+// playwrightVersion pins the node CDP client to the chrome-service image minor
+// (mcr.microsoft.com/playwright:v1.48.0-noble → Chromium 130). connect_over_cdp
+// speaks the browser's CDP, so the client minor must track the server minor;
+// see docs/architecture/chrome-service.md "Image pin".
+const playwrightVersion = "1.48.2"
+
+// defaultBrowserTimeout is how long (seconds) to wait for the port-forwarded CDP
+// endpoint to become ready before giving up.
+const defaultBrowserTimeout = 60
+
+const (
+	chromeServiceNamespace = "chrome-service"
+	chromeServiceName      = "chrome-service"
+	chromeServiceCDPPort   = 9222
+)
+
+// stealthJS is vendored verbatim from stacks/chrome-service/files/stealth.js (the
+// source of truth the in-cluster callers use). TestStealthJSEmbeddedMatchesCanonical
+// guards against drift.
+//
+//go:embed browser_stealth.js
+var stealthJS string
+
+// runnerJS is the node wrapper that connects to the port-forwarded CDP endpoint,
+// installs the stealth init script, and runs the user's Playwright script.
+//
+//go:embed browser_runner.js
+var runnerJS string
+
+// browserOpts is the parsed form of `homelab browser run|open` arguments.
+type browserOpts struct {
+	mode      string // "run" | "open"
+	script    string // path to the user Playwright script (run mode)
+	url       string // initial URL (run: optional; open: required positional)
+	sharedCtx bool   // use the warmed persistent profile instead of a fresh context
+	keepOpen  bool   // leave the created context/pages open on exit
+	port      int    // explicit local port for the forward (0 = auto)
+	timeout   int    // CDP readiness timeout, seconds
+	help      bool
+}
+
+// parseBrowserArgs parses the args after `browser run` / `browser open`.
+func parseBrowserArgs(mode string, args []string) (browserOpts, error) {
+	o := browserOpts{mode: mode, timeout: defaultBrowserTimeout}
+	var positionals []string
+	atoi := func(s, flag string) (int, error) {
+		n, err := strconv.Atoi(s)
+		if err != nil {
+			return 0, fmt.Errorf("%s expects an integer, got %q", flag, s)
+		}
+		return n, nil
+	}
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "-h" || a == "--help":
+			o.help = true
+		case a == "--shared-context":
+			o.sharedCtx = true
+		case a == "--keep-open":
+			o.keepOpen = true
+		case a == "--url":
+			if i+1 < len(args) {
+				o.url = args[i+1]
+				i++
+			}
+		case strings.HasPrefix(a, "--url="):
+			o.url = strings.TrimPrefix(a, "--url=")
+		case a == "--port":
+			if i+1 < len(args) {
+				n, err := atoi(args[i+1], "--port")
+				if err != nil {
+					return o, err
+				}
+				o.port = n
+				i++
+			}
+		case strings.HasPrefix(a, "--port="):
+			n, err := atoi(strings.TrimPrefix(a, "--port="), "--port")
+			if err != nil {
+				return o, err
+			}
+			o.port = n
+		case a == "--timeout":
+			if i+1 < len(args) {
+				n, err := atoi(args[i+1], "--timeout")
+				if err != nil {
+					return o, err
+				}
+				o.timeout = n
+				i++
+			}
+		case strings.HasPrefix(a, "--timeout="):
+			n, err := atoi(strings.TrimPrefix(a, "--timeout="), "--timeout")
+			if err != nil {
+				return o, err
+			}
+			o.timeout = n
+		case strings.HasPrefix(a, "-"):
+			return o, fmt.Errorf("unknown flag %q (try: homelab browser --help)", a)
+		default:
+			positionals = append(positionals, a)
+		}
+	}
+	if o.help {
+		return o, nil
+	}
+	switch mode {
+	case "run":
+		if len(positionals) == 0 {
+			return o, fmt.Errorf("usage: homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]")
+		}
+		o.script = positionals[0]
+	case "open":
+		if len(positionals) == 0 {
+			return o, fmt.Errorf("usage: homelab browser open <url> [--shared-context] [--timeout S]")
+		}
+		o.url = positionals[0]
+	}
+	return o, nil
+}
+
+// cdpHealthy parses a CDP /json/version body and reports whether the endpoint is
+// a real (non-headless) Chrome — the entire reason chrome-service exists.
+func cdpHealthy(jsonBody []byte) (browser string, healthy bool, err error) {
+	var v struct {
+		Browser   string `json:"Browser"`
+		UserAgent string `json:"User-Agent"`
+	}
+	if e := json.Unmarshal(jsonBody, &v); e != nil {
+		return "", false, fmt.Errorf("parse /json/version: %w", e)
+	}
+	if v.Browser == "" {
+		return "", false, fmt.Errorf("/json/version had no Browser field")
+	}
+	healthy = strings.HasPrefix(v.Browser, "Chrome/") &&
+		!strings.Contains(v.Browser, "Headless") &&
+		!strings.Contains(v.UserAgent, "Headless")
+	return v.Browser, healthy, nil
+}
+
+// buildPortForwardArgs is the kubectl invocation that exposes chrome-service's
+// CDP locally. port-forward tunnels API-server→pod, so it bypasses the :9222
+// NetworkPolicy that gates in-cluster callers.
+func buildPortForwardArgs(localPort int) []string {
+	return []string{"-n", chromeServiceNamespace, "port-forward",
+		"svc/" + chromeServiceName, fmt.Sprintf("%d:%d", localPort, chromeServiceCDPPort)}
+}
+
+// browserClientPackageJSON is the auto-managed manifest for the pinned node CDP
+// client kept under the user cache dir.
+func browserClientPackageJSON() string {
+	return fmt.Sprintf(`{
+  "name": "homelab-browser-client",
+  "private": true,
+  "description": "Pinned CDP client for 'homelab browser' — auto-managed, do not edit.",
+  "dependencies": {
+    "playwright-core": "%s"
+  }
+}
+`, playwrightVersion)
+}
+
+// freePort asks the kernel for an unused ephemeral TCP port.
+func freePort() (int, error) {
+	l, err := net.Listen("tcp", "127.0.0.1:0")
+	if err != nil {
+		return 0, err
+	}
+	defer l.Close()
+	return l.Addr().(*net.TCPAddr).Port, nil
+}
+
+// browserClientDir is where the pinned node client + managed runner files live.
+func browserClientDir() (string, error) {
+	cache, err := os.UserCacheDir()
+	if err != nil || cache == "" {
+		home, herr := os.UserHomeDir()
+		if herr != nil {
+			return "", fmt.Errorf("locate cache dir: %v / %v", err, herr)
+		}
+		cache = filepath.Join(home, ".cache")
+	}
+	return filepath.Join(cache, "homelab", "browser-client"), nil
+}
+
+// installedPlaywrightVersion reads the version of the playwright-core already
+// installed in dir, or "" if absent/unreadable.
+func installedPlaywrightVersion(dir string) string {
+	b, err := os.ReadFile(filepath.Join(dir, "node_modules", "playwright-core", "package.json"))
+	if err != nil {
+		return ""
+	}
+	var v struct {
+		Version string `json:"version"`
+	}
+	if json.Unmarshal(b, &v) != nil {
+		return ""
+	}
+	return v.Version
+}
+
+// ensureBrowserClient writes the managed runner/stealth/package files into dir
+// and lazily installs the pinned playwright-core (only when missing/mismatched),
+// so no per-user setup is needed and the client tracks the binary version.
+func ensureBrowserClient(dir string) error {
+	if err := os.MkdirAll(dir, 0o755); err != nil {
+		return err
+	}
+	files := map[string]string{
+		"package.json":      browserClientPackageJSON(),
+		"browser_runner.js": runnerJS,
+		"stealth.js":        stealthJS,
+	}
+	for name, content := range files {
+		if err := os.WriteFile(filepath.Join(dir, name), []byte(content), 0o644); err != nil {
+			return err
+		}
+	}
+	if installedPlaywrightVersion(dir) == playwrightVersion {
+		return nil
+	}
+	fmt.Fprintf(os.Stderr, "homelab browser: installing pinned playwright-core@%s (one-time, ~a few seconds)…\n", playwrightVersion)
+	cmd := exec.Command("npm", "install", "--no-audit", "--no-fund", "--silent")
+	cmd.Dir = dir
+	cmd.Stdout = os.Stderr
+	cmd.Stderr = os.Stderr
+	if err := cmd.Run(); err != nil {
+		return fmt.Errorf("npm install playwright-core@%s in %s: %w (is node/npm installed?)", playwrightVersion, dir, err)
+	}
+	if got := installedPlaywrightVersion(dir); got != playwrightVersion {
+		return fmt.Errorf("playwright-core install mismatch in %s: want %s, got %q", dir, playwrightVersion, got)
+	}
+	return nil
+}
+
+// waitForCDP polls the local CDP endpoint until it answers as a healthy
+// (non-headless) Chrome, or the timeout elapses.
+func waitForCDP(cdpURL string, timeout time.Duration) (string, error) {
+	deadline := time.Now().Add(timeout)
+	client := &http.Client{Timeout: 3 * time.Second}
+	var lastErr error
+	for time.Now().Before(deadline) {
+		resp, err := client.Get(cdpURL + "/json/version")
+		if err != nil {
+			lastErr = err
+			time.Sleep(300 * time.Millisecond)
+			continue
+		}
+		body, _ := io.ReadAll(resp.Body)
+		resp.Body.Close()
+		browser, healthy, herr := cdpHealthy(body)
+		if herr != nil {
+			lastErr = herr
+			time.Sleep(300 * time.Millisecond)
+			continue
+		}
+		if !healthy {
+			return browser, fmt.Errorf("CDP reports %q — expected a non-headless Chrome (wrong target?)", browser)
+		}
+		return browser, nil
+	}
+	if lastErr == nil {
+		lastErr = fmt.Errorf("timed out after %s", timeout)
+	}
+	return "", lastErr
+}
+
+// runBrowser is the orchestration: pick a port, ensure the pinned client, start
+// (and ALWAYS tear down) a CDP port-forward, wait for readiness, then run node.
+func runBrowser(o browserOpts) error {
+	port := o.port
+	if port == 0 {
+		p, err := freePort()
+		if err != nil {
+			return fmt.Errorf("pick local port: %w", err)
+		}
+		port = p
+	}
+
+	dir, err := browserClientDir()
+	if err != nil {
+		return err
+	}
+	if err := ensureBrowserClient(dir); err != nil {
+		return err
+	}
+
+	// Start the forward in its own process group so the whole tree dies on cleanup.
+	pf := exec.Command("kubectl", buildPortForwardArgs(port)...)
+	pf.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
+	var pfLog strings.Builder
+	pf.Stdout = &pfLog
+	pf.Stderr = &pfLog
+	if err := pf.Start(); err != nil {
+		return fmt.Errorf("start kubectl port-forward (kubeconfig set?): %w", err)
+	}
+
+	var once sync.Once
+	teardown := func() {
+		once.Do(func() {
+			if pf.Process != nil {
+				_ = syscall.Kill(-pf.Process.Pid, syscall.SIGKILL)
+			}
+			_ = pf.Wait()
+		})
+	}
+	defer teardown()
+
+	// Tear down on Ctrl-C / SIGTERM too, then exit non-zero.
+	sigCh := make(chan os.Signal, 1)
+	signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM)
+	defer signal.Stop(sigCh)
+	go func() {
+		if _, ok := <-sigCh; ok {
+			teardown()
+			os.Exit(130)
+		}
+	}()
+
+	cdpURL := fmt.Sprintf("http://127.0.0.1:%d", port)
+	browser, err := waitForCDP(cdpURL, time.Duration(o.timeout)*time.Second)
+	if err != nil {
+		return fmt.Errorf("chrome-service CDP not ready on %s: %w\n--- port-forward log ---\n%s", cdpURL, err, pfLog.String())
+	}
+	fmt.Fprintf(os.Stderr, "homelab browser: connected to %s via %s\n", browser, cdpURL)
+
+	return runBrowserNode(dir, cdpURL, o)
+}
+
+// runBrowserNode invokes the managed node runner with inputs passed via env.
+func runBrowserNode(dir, cdpURL string, o browserOpts) error {
+	env := append(os.Environ(),
+		"HOMELAB_CDP_URL="+cdpURL,
+		"HOMELAB_BROWSER_MODE="+o.mode,
+		"HOMELAB_STEALTH_PATH="+filepath.Join(dir, "stealth.js"),
+		"NODE_PATH="+filepath.Join(dir, "node_modules"),
+	)
+	if o.url != "" {
+		env = append(env, "HOMELAB_BROWSER_URL="+o.url)
+	}
+	if o.script != "" {
+		abs, err := filepath.Abs(o.script)
+		if err != nil {
+			return err
+		}
+		if _, err := os.Stat(abs); err != nil {
+			return fmt.Errorf("script %s: %w", o.script, err)
+		}
+		env = append(env, "HOMELAB_BROWSER_SCRIPT="+abs)
+	}
+	if o.sharedCtx {
+		env = append(env, "HOMELAB_BROWSER_SHARED=1")
+	}
+	if o.keepOpen {
+		env = append(env, "HOMELAB_BROWSER_KEEP_OPEN=1")
+	}
+	if o.mode == "open" {
+		shot := filepath.Join(os.TempDir(), fmt.Sprintf("homelab-browser-%d.png", os.Getpid()))
+		env = append(env, "HOMELAB_BROWSER_SCREENSHOT="+shot)
+	}
+	cmd := exec.Command("node", filepath.Join(dir, "browser_runner.js"))
+	cmd.Env = env
+	cmd.Stdout = os.Stdout
+	cmd.Stderr = os.Stderr
+	cmd.Stdin = os.Stdin
+	return cmd.Run()
+}
--- a/cli/browser_runner.js
+++ b/cli/browser_runner.js
@ -0,0 +1,106 @@
+// homelab browser — node CDP runner (auto-managed; regenerated each run from the
+// homelab binary — DO NOT EDIT here). Connects to the port-forwarded
+// chrome-service CDP endpoint, installs the stealth init script, then runs the
+// user's Playwright script (run mode) or opens a URL (open mode). All inputs
+// arrive via HOMELAB_* env vars set by the Go CLI.
+'use strict';
+const fs = require('fs');
+const { chromium } = require('playwright-core');
+
+async function main() {
+  const cdpURL = process.env.HOMELAB_CDP_URL;
+  if (!cdpURL) throw new Error('HOMELAB_CDP_URL not set');
+  const mode = process.env.HOMELAB_BROWSER_MODE || 'run';
+  const stealthPath = process.env.HOMELAB_STEALTH_PATH || '';
+  const initURL = process.env.HOMELAB_BROWSER_URL || '';
+  const scriptPath = process.env.HOMELAB_BROWSER_SCRIPT || '';
+  const shared = process.env.HOMELAB_BROWSER_SHARED === '1';
+  const keepOpen = process.env.HOMELAB_BROWSER_KEEP_OPEN === '1';
+  const screenshotPath = process.env.HOMELAB_BROWSER_SCREENSHOT || '';
+
+  const browser = await chromium.connectOverCDP(cdpURL);
+
+  // Fresh isolated context by default (safe for the shared browser + concurrent
+  // callers); --shared-context reuses the warmed persistent profile.
+  let context;
+  let createdContext = false;
+  if (shared) {
+    const existing = browser.contexts();
+    if (existing.length) {
+      context = existing[0];
+    } else {
+      context = await browser.newContext();
+      createdContext = true;
+    }
+  } else {
+    context = await browser.newContext();
+    createdContext = true;
+  }
+
+  if (stealthPath) {
+    const stealth = fs.readFileSync(stealthPath, 'utf8');
+    if (stealth.trim()) await context.addInitScript(stealth);
+  }
+
+  const page = await context.newPage();
+  const log = (...a) => console.error('[browser]', ...a);
+
+  let exitCode = 0;
+  try {
+    if (initURL) {
+      await page.goto(initURL, { waitUntil: 'domcontentloaded' });
+    }
+    if (mode === 'open') {
+      console.log('url:    ' + page.url());
+      console.log('title:  ' + (await page.title()));
+      const text = (await page.evaluate(() => (document.body ? document.body.innerText : ''))).trim();
+      console.log('--- visible text (truncated to 4000 chars) ---');
+      console.log(text.slice(0, 4000));
+      if (screenshotPath) {
+        await page.screenshot({ path: screenshotPath, fullPage: true });
+        console.log('screenshot: ' + screenshotPath);
+      }
+    } else {
+      if (!scriptPath) throw new Error('run mode requires HOMELAB_BROWSER_SCRIPT');
+      const src = fs.readFileSync(scriptPath, 'utf8');
+      // Run the user's source with page/context/browser/log in lexical scope.
+      // AsyncFunction body permits top-level await.
+      const AsyncFunction = Object.getPrototypeOf(async () => {}).constructor;
+      const fn = new AsyncFunction('page', 'context', 'browser', 'log', src);
+      const result = await fn(page, context, browser, log);
+      if (result !== undefined) {
+        let out;
+        try {
+          out = typeof result === 'string' ? result : JSON.stringify(result, null, 2);
+        } catch (_) {
+          out = String(result);
+        }
+        console.log(out);
+      }
+    }
+  } catch (e) {
+    console.error('homelab browser: script error:', e && e.stack ? e.stack : e);
+    exitCode = 1;
+  } finally {
+    if (!keepOpen) {
+      try {
+        // Close only what we created; never tear down the shared persistent context.
+        if (createdContext) {
+          await context.close();
+        } else {
+          await page.close();
+        }
+      } catch (_) { /* ignore */ }
+    }
+    // Disconnect from the CDP endpoint; this does NOT kill the remote browser.
+    try {
+      await browser.close();
+    } catch (_) { /* ignore */ }
+  }
+  process.exit(exitCode);
+}
+
+main().catch((e) => {
+  console.error('homelab browser: fatal:', e && e.stack ? e.stack : e);
+  process.exit(1);
+});
--- a/cli/browser_stealth.js
+++ b/cli/browser_stealth.js
@ -0,0 +1,54 @@
+// Minimal stealth init script for Playwright-driven Chromium.
+// Vendored from puppeteer-extra-plugin-stealth/evasions/* (MIT) — covers:
+//   webdriver, chrome.runtime, navigator.plugins, navigator.languages,
+//   Permissions.query, WebGL getParameter (vendor + renderer spoof).
+// Run via context.add_init_script() so it executes before any page script.
+(() => {
+  // navigator.webdriver — most common detection, removed entirely.
+  Object.defineProperty(Navigator.prototype, 'webdriver', { get: () => undefined });
+
+  // window.chrome.runtime — many sites check that real Chrome exposes this.
+  if (!window.chrome) window.chrome = {};
+  window.chrome.runtime = window.chrome.runtime || {};
+
+  // navigator.plugins — headless reports zero; spoof a plausible PDF viewer.
+  Object.defineProperty(navigator, 'plugins', {
+    get: () => [{ name: 'Chrome PDF Plugin' }, { name: 'Chrome PDF Viewer' }, { name: 'Native Client' }],
+  });
+
+  // navigator.languages — headless returns empty array.
+  Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
+
+  // Permissions.query — headless returns 'denied' for notifications instead of 'default'.
+  const origQuery = window.navigator.permissions && window.navigator.permissions.query;
+  if (origQuery) {
+    window.navigator.permissions.query = (parameters) =>
+      parameters && parameters.name === 'notifications'
+        ? Promise.resolve({ state: Notification.permission })
+        : origQuery(parameters);
+  }
+
+  // WebGL getParameter — spoof vendor + renderer strings to a real GPU.
+  const spoofGl = (proto) => {
+    if (!proto) return;
+    const orig = proto.getParameter;
+    proto.getParameter = function (parameter) {
+      if (parameter === 37445) return 'Intel Inc.';                   // UNMASKED_VENDOR_WEBGL
+      if (parameter === 37446) return 'Intel Iris OpenGL Engine';     // UNMASKED_RENDERER_WEBGL
+      return orig.apply(this, arguments);
+    };
+  };
+  spoofGl(window.WebGLRenderingContext && window.WebGLRenderingContext.prototype);
+  spoofGl(window.WebGL2RenderingContext && window.WebGL2RenderingContext.prototype);
+
+  // disable-devtool.js (theajack/disable-devtool) auto-inits via a script
+  // tag with `disable-devtool-auto`. Its Performance detector trips under
+  // Playwright (CDP adds console.log latency vs console.table) and the
+  // redirect URL is hard-coded — for hmembeds that's google.com.
+  // Hide the auto-init marker so the library's IIFE exits early.
+  const origQS = Document.prototype.querySelector;
+  Document.prototype.querySelector = function (sel) {
+    if (typeof sel === 'string' && sel.indexOf('disable-devtool-auto') !== -1) return null;
+    return origQS.apply(this, arguments);
+  };
+})();
--- a/cli/cmd_browser.go
+++ b/cli/cmd_browser.go
@ -0,0 +1,117 @@
+package main
+
+import "fmt"
+
+// browser verbs drive the cluster's HEADFUL Chrome (ns chrome-service) over CDP
+// from outside the cluster, for sites that detect/block headless automation.
+// The headless @playwright/mcp browser can load such sites but their gated
+// actions (submit/login) silently fail; this path submits first try. Mechanics
+// only — the agent supplies the Playwright script. See docs/adr/0013.
+
+func browserCommands() []Command {
+	return []Command{
+		{Path: []string{"browser"}, Tier: TierRead,
+			Summary: "headful cluster-Chrome automation for anti-bot sites (run `browser --help`)", Run: browserTopHelp},
+		{Path: []string{"browser", "run"}, Tier: TierWrite,
+			Summary: "run a Playwright script against headful cluster Chrome: browser run <script.js> [--url U] [--shared-context]", Run: browserRun},
+		{Path: []string{"browser", "open"}, Tier: TierWrite,
+			Summary: "open a URL in headful cluster Chrome; print title + text + screenshot: browser open <url>", Run: browserOpen},
+	}
+}
+
+func browserTopHelp([]string) error {
+	fmt.Print(browserHelp())
+	return nil
+}
+
+func browserRun(args []string) error {
+	o, err := parseBrowserArgs("run", args)
+	if err != nil {
+		return err
+	}
+	if o.help {
+		fmt.Print(browserHelp())
+		return nil
+	}
+	return runBrowser(o)
+}
+
+func browserOpen(args []string) error {
+	o, err := parseBrowserArgs("open", args)
+	if err != nil {
+		return err
+	}
+	if o.help {
+		fmt.Print(browserHelp())
+		return nil
+	}
+	return runBrowser(o)
+}
+
+// browserHelp carries the discoverability payload: WHEN to reach for this, and
+// the diagnostic cheat-sheet that lets the agent self-correct instead of
+// retrying a deterministic form blind (the failure mode that motivated this).
+func browserHelp() string {
+	return `homelab browser — drive the cluster's HEADFUL Chrome (anti-bot) over CDP
+
+The shared chrome-service (ns chrome-service) runs a REAL, headed Chrome under
+Xvfb. This connects to it via a port-forward + Playwright connect_over_cdp,
+injects the same stealth.js the in-cluster callers use, and runs your script.
+
+USAGE
+  homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]
+  homelab browser open <url> [--shared-context] [--timeout S]
+
+WHEN TO USE THIS — escalation only; DEFAULT to the headless/MCP browser
+  Default to the Playwright MCP / headless browser for ALL routine browsing and
+  automation — it's interactive (snapshot per step), fast to start, isolated.
+  Reach for THIS command ONLY when headless is demonstrably blocked: a site
+  LOADS fine but a gated action FAILS or HANGS — a submit/login/checkout spins
+  forever, or ONE request errors while its siblings 200. That is the signature
+  of headless / anti-bot detection (navigator.webdriver, UA "HeadlessChrome",
+  disable-devtool traps). It presents as a real Chrome and usually succeeds
+  first try — but it's the shared cluster browser (slower startup, one batch
+  run, no per-step feedback), so it's the escalation path, never the default.
+
+ERROR-CODE CHEAT-SHEET (diagnose BEFORE retrying)
+  ERR_FILE_NOT_FOUND (-6)   request intercepted/resolved locally by the
+                            automation layer — NOT a network/egress problem.
+                            (This is what silently broke the headless submit.)
+  ERR_CONNECTION_REFUSED /  real egress failure (DNS/route/firewall). These also
+  ERR_TIMED_OUT /           break the initial page load — if the page loaded,
+  ERR_NAME_NOT_RESOLVED     egress is fine and the cause is elsewhere.
+  one endpoint 500s while   server-side bot rejection of the automation, not
+  its siblings 200          your payload.
+
+HABITS
+  - Inspect the network panel BEFORE retrying a deterministic form; a blind
+    retry just repeats the same silent failure.
+  - Don't park a half-filled multi-step form across a user pause — the session
+    can expire; re-run the whole flow from this command in one shot.
+  - Uploads stream over CDP via setInputFiles from THIS host — no chmod/staging
+    of $HOME needed; just point setInputFiles at a local path.
+
+CONTEXT
+  Default: a FRESH incognito context, closed on exit — safe for the shared
+  browser and concurrent callers (e.g. tripit). Your script does its own login.
+  --shared-context: reuse the warmed PERSISTENT profile (cookies from a manual
+  noVNC login at chrome.viktorbarzin.me) when you need a pre-logged-in session.
+
+SCRIPT CONTRACT (run mode)
+  Your file's body runs with page, context, browser and log() already in scope
+  (top-level await allowed). Return a value to print it. Example flow.js:
+
+    await page.goto('https://portal.example.com/login');
+    await page.fill('#user', 'me'); await page.fill('#pass', process.env.PW);
+    await page.click('button[type=submit]');
+    await page.waitForURL('**/dashboard');
+    return 'logged in: ' + page.url();
+
+  Run it:  homelab browser run flow.js
+
+NOTES
+  - The Playwright client is pinned to playwright-core@` + playwrightVersion + ` to match the
+    chrome-service image (Chrome 130); installed once into ~/.cache/homelab/.
+  - The port-forward is always torn down, on success and on error.
+`
+}
--- a/cli/cmd_browser_test.go
+++ b/cli/cmd_browser_test.go
@ -0,0 +1,172 @@
+package main
+
+import (
+	"os"
+	"reflect"
+	"strings"
+	"testing"
+)
+
+func TestParseBrowserArgsRun(t *testing.T) {
+	got, err := parseBrowserArgs("run", []string{
+		"flow.js", "--url", "https://example.com", "--shared-context",
+		"--port", "19999", "--timeout", "45", "--keep-open",
+	})
+	if err != nil {
+		t.Fatalf("parseBrowserArgs run: unexpected err: %v", err)
+	}
+	want := browserOpts{
+		mode: "run", script: "flow.js", url: "https://example.com",
+		sharedCtx: true, keepOpen: true, port: 19999, timeout: 45,
+	}
+	if !reflect.DeepEqual(got, want) {
+		t.Fatalf("parseBrowserArgs run =\n %+v\nwant\n %+v", got, want)
+	}
+}
+
+func TestParseBrowserArgsRunDefaults(t *testing.T) {
+	got, err := parseBrowserArgs("run", []string{"flow.js"})
+	if err != nil {
+		t.Fatalf("unexpected err: %v", err)
+	}
+	if got.script != "flow.js" || got.sharedCtx || got.keepOpen || got.port != 0 {
+		t.Fatalf("defaults wrong: %+v", got)
+	}
+	if got.timeout != defaultBrowserTimeout {
+		t.Fatalf("timeout default = %d, want %d", got.timeout, defaultBrowserTimeout)
+	}
+}
+
+func TestParseBrowserArgsRunRequiresScript(t *testing.T) {
+	if _, err := parseBrowserArgs("run", []string{"--url", "https://x"}); err == nil {
+		t.Fatalf("run without a script path should error")
+	}
+}
+
+func TestParseBrowserArgsOpenRequiresURL(t *testing.T) {
+	got, err := parseBrowserArgs("open", []string{"https://example.com"})
+	if err != nil {
+		t.Fatalf("unexpected err: %v", err)
+	}
+	if got.url != "https://example.com" || got.mode != "open" {
+		t.Fatalf("open parse wrong: %+v", got)
+	}
+	if _, err := parseBrowserArgs("open", []string{}); err == nil {
+		t.Fatalf("open without a URL should error")
+	}
+}
+
+func TestParseBrowserArgsHelp(t *testing.T) {
+	for _, a := range [][]string{{"--help"}, {"-h"}, {"flow.js", "--help"}} {
+		got, err := parseBrowserArgs("run", a)
+		if err != nil {
+			t.Fatalf("help parse %v: %v", a, err)
+		}
+		if !got.help {
+			t.Fatalf("args %v should set help", a)
+		}
+	}
+}
+
+func TestParseBrowserArgsEqualsForm(t *testing.T) {
+	got, err := parseBrowserArgs("run", []string{"flow.js", "--url=https://x", "--port=8123", "--timeout=10"})
+	if err != nil {
+		t.Fatalf("unexpected err: %v", err)
+	}
+	if got.url != "https://x" || got.port != 8123 || got.timeout != 10 {
+		t.Fatalf("--flag=value form not parsed: %+v", got)
+	}
+}
+
+func TestCDPHealthy(t *testing.T) {
+	real := []byte(`{"Browser":"Chrome/130.0.6723.31","User-Agent":"Mozilla/5.0 (X11; Linux x86_64) Chrome/130.0.0.0 Safari/537.36","webSocketDebuggerUrl":"ws://127.0.0.1/devtools/browser/x"}`)
+	browser, ok, err := cdpHealthy(real)
+	if err != nil || !ok {
+		t.Fatalf("real Chrome should be healthy: ok=%v err=%v", ok, err)
+	}
+	if !strings.HasPrefix(browser, "Chrome/") {
+		t.Fatalf("browser = %q, want Chrome/ prefix", browser)
+	}
+
+	headless := []byte(`{"Browser":"HeadlessChrome/130.0.6723.31","User-Agent":"Mozilla/5.0 HeadlessChrome/130.0.0.0"}`)
+	if _, ok, _ := cdpHealthy(headless); ok {
+		t.Fatalf("HeadlessChrome must be reported unhealthy (the whole point of chrome-service)")
+	}
+
+	if _, _, err := cdpHealthy([]byte("not json")); err == nil {
+		t.Fatalf("malformed /json/version body should error")
+	}
+}
+
+func TestBuildPortForwardArgs(t *testing.T) {
+	got := buildPortForwardArgs(18080)
+	want := []string{"-n", "chrome-service", "port-forward", "svc/chrome-service", "18080:9222"}
+	if !reflect.DeepEqual(got, want) {
+		t.Fatalf("buildPortForwardArgs =\n %v\nwant\n %v", got, want)
+	}
+}
+
+func TestBrowserClientPackageJSONPinsVersion(t *testing.T) {
+	pj := browserClientPackageJSON()
+	if !strings.Contains(pj, `"playwright-core": "`+playwrightVersion+`"`) {
+		t.Fatalf("package.json must pin playwright-core to %s; got:\n%s", playwrightVersion, pj)
+	}
+}
+
+func TestPlaywrightVersionPinnedToServerMinor(t *testing.T) {
+	// chrome-service runs mcr.microsoft.com/playwright:v1.48.0-noble; the CDP
+	// client minor MUST match (protocol changes between minors).
+	if !strings.HasPrefix(playwrightVersion, "1.48.") {
+		t.Fatalf("playwrightVersion = %q, must be 1.48.x to match the chrome-service image", playwrightVersion)
+	}
+}
+
+func TestBrowserHelpHasDiagnosticCheatSheet(t *testing.T) {
+	h := browserHelp()
+	for _, want := range []string{
+		"homelab browser run",
+		"ERR_FILE_NOT_FOUND",
+		"ERR_CONNECTION_REFUSED",
+		"network panel",
+		"headless",
+		"--shared-context",
+	} {
+		if !strings.Contains(h, want) {
+			t.Errorf("browser --help is missing %q (the discoverability/self-correction payload)", want)
+		}
+	}
+}
+
+func TestBrowserHelpIsTiered(t *testing.T) {
+	// --help must frame this as the ESCALATION path (default to headless first),
+	// matching ~/code/CLAUDE.md and chrome-service.md — non-conflicting agent
+	// instructions. Guard against a regression to "co-equal choice" wording.
+	h := browserHelp()
+	for _, want := range []string{"Default to the", "escalation"} {
+		if !strings.Contains(h, want) {
+			t.Errorf("browser --help must carry the tiered/default-headless framing; missing %q", want)
+		}
+	}
+}
+
+func TestStealthJSEmbeddedMatchesCanonical(t *testing.T) {
+	// The embedded copy must never drift from the source of truth that the
+	// in-cluster callers use, else the CLI's stealth and the cluster's diverge.
+	canonical, err := os.ReadFile("../stacks/chrome-service/files/stealth.js")
+	if err != nil {
+		t.Fatalf("read canonical stealth.js: %v", err)
+	}
+	if stealthJS != string(canonical) {
+		t.Fatalf("cli/browser_stealth.js has drifted from stacks/chrome-service/files/stealth.js — re-copy it")
+	}
+}
+
+func TestFreePortReturnsUsablePort(t *testing.T) {
+	p, err := freePort()
+	if err != nil {
+		t.Fatalf("freePort: %v", err)
+	}
+	if p <= 1024 || p > 65535 {
+		t.Fatalf("freePort returned %d, want an ephemeral port", p)
+	}
+}
--- a/cli/cmd_ci.go
+++ b/cli/cmd_ci.go
@ -0,0 +1,99 @@
+package main
+
+import (
+	"fmt"
+	"os"
+	"strings"
+	"time"
+)
+
+func ciCommands() []Command {
+	return []Command{
+		{Path: []string{"ci", "status"}, Tier: TierRead,
+			Summary: "pipeline status for HEAD/a commit: ci status [commit]", Run: ciStatus},
+		{Path: []string{"ci", "watch"}, Tier: TierRead,
+			Summary: "poll the pipeline for HEAD (or a commit) to terminal; non-zero on failure", Run: ciWatch},
+	}
+}
+
+func short(s string) string {
+	if len(s) > 8 {
+		return s[:8]
+	}
+	return s
+}
+
+func firstLine(s string) string { return strings.SplitN(s, "\n", 2)[0] }
+
+// currentHEAD returns the full HEAD sha of the cwd repo (empty if not a repo).
+func currentHEAD() string {
+	cwd, _ := os.Getwd()
+	root, err := gitRepoRoot(cwd)
+	if err != nil {
+		return ""
+	}
+	sha, _ := gitOutput(root, "rev-parse", "HEAD")
+	return sha
+}
+
+func ciStatus(args []string) error {
+	commit, _ := firstPositional(args)
+	c, err := newWPClient()
+	if err != nil {
+		return err
+	}
+	id, err := c.repoID()
+	if err != nil {
+		return err
+	}
+	p, err := c.findPipeline(id, commit)
+	if err != nil {
+		return err
+	}
+	fmt.Printf("#%d %s event=%s %s %s\n", p.Number, p.Status, p.Event, short(p.Commit), firstLine(p.Message))
+	return nil
+}
+
+func ciWatch(args []string) error {
+	commit, _ := firstPositional(args)
+	if commit == "" {
+		commit = currentHEAD()
+	}
+	if commit == "" {
+		return fmt.Errorf("no commit given and not in a git repo")
+	}
+	c, err := newWPClient()
+	if err != nil {
+		return err
+	}
+	id, err := c.repoID()
+	if err != nil {
+		return err
+	}
+	timeout := 20 * time.Minute
+	deadline := time.Now().Add(timeout)
+	last := ""
+	for time.Now().Before(deadline) {
+		p, err := c.findPipeline(id, commit)
+		if err != nil {
+			if last != "waiting" {
+				fmt.Fprintf(os.Stderr, "homelab: waiting for pipeline (%s)...\n", short(commit))
+				last = "waiting"
+			}
+		} else {
+			if p.Status != last {
+				fmt.Fprintf(os.Stderr, "homelab: #%d %s\n", p.Number, p.Status)
+				last = p.Status
+			}
+			if isTerminalStatus(p.Status) {
+				fmt.Printf("#%d %s %s\n", p.Number, p.Status, short(commit))
+				if isFailureStatus(p.Status) {
+					return fmt.Errorf("pipeline #%d %s (woodpecker repo, see UI/DB for the failing step)", p.Number, p.Status)
+				}
+				return nil
+			}
+		}
+		time.Sleep(15 * time.Second)
+	}
+	return fmt.Errorf("timed out after %s waiting for CI on %s", timeout, short(commit))
+}
--- a/cli/cmd_claim.go
+++ b/cli/cmd_claim.go
@ -0,0 +1,56 @@
+package main
+
+import (
+	"fmt"
+	"strings"
+)
+
+func claimCommands() []Command {
+	return []Command{
+		{Path: []string{"claim"}, Tier: TierWrite,
+			Summary: "claim a shared infra resource on the presence board",
+			Run:     runClaim},
+		{Path: []string{"release"}, Tier: TierWrite,
+			Summary: "release a presence claim",
+			Run:     runRelease},
+	}
+}
+
+// runClaim parses `<kind>:<name> --purpose "..."` in either order (the presence
+// script takes the label first, so we can't rely on Go's flag package which
+// stops at the first positional).
+func runClaim(args []string) error {
+	var label, purpose string
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--purpose" || a == "-purpose":
+			if i+1 < len(args) {
+				purpose = args[i+1]
+				i++
+			}
+		case strings.HasPrefix(a, "--purpose="):
+			purpose = strings.TrimPrefix(a, "--purpose=")
+		case !strings.HasPrefix(a, "-") && label == "":
+			label = a
+		}
+	}
+	if label == "" {
+		return fmt.Errorf(`usage: homelab claim <kind>:<name> --purpose "what + why"`)
+	}
+	return presenceClaim(label, purpose)
+}
+
+func runRelease(args []string) error {
+	var label string
+	for _, a := range args {
+		if !strings.HasPrefix(a, "-") {
+			label = a
+			break
+		}
+	}
+	if label == "" {
+		return fmt.Errorf("usage: homelab release <kind>:<name>")
+	}
+	return presenceRelease(label)
+}
--- a/cli/cmd_deploy.go
+++ b/cli/cmd_deploy.go
@ -0,0 +1,51 @@
+package main
+
+import (
+	"fmt"
+	"os"
+	"strings"
+	"time"
+)
+
+func deployCommands() []Command {
+	return []Command{
+		{Path: []string{"deploy", "wait"}, Tier: TierRead,
+			Summary: "wait for <ns>/<deploy> to roll out the current (or --sha) image: deploy wait <ns>/<deploy> [--sha SHA]", Run: deployWait},
+	}
+}
+
+// deployWait closes the "did the NEW code land" gap: rollout status alone returns
+// success on the OLD ReplicaSet, so we first wait for the deployment image to
+// reference the expected sha, THEN block on rollout status.
+func deployWait(args []string) error {
+	target, _ := firstPositional(args)
+	if target == "" || !strings.Contains(target, "/") {
+		return fmt.Errorf("usage: homelab deploy wait <ns>/<deploy> [--sha SHA] [--timeout 10m]")
+	}
+	parts := strings.SplitN(target, "/", 2)
+	ns, deploy := parts[0], parts[1]
+
+	sha := flagValue(args, "--sha")
+	if sha == "" {
+		sha = short(currentHEAD())
+	}
+	deadline := time.Now().Add(10 * time.Minute)
+
+	if sha != "" {
+		fmt.Fprintf(os.Stderr, "homelab: waiting for %s/%s image to match %s...\n", ns, deploy, sha)
+		matched := false
+		for time.Now().Before(deadline) {
+			img, _ := kubectlCapture(ns, "get", "deploy", deploy, "-o", "jsonpath={.spec.template.spec.containers[*].image}")
+			if strings.Contains(img, sha) {
+				matched = true
+				break
+			}
+			time.Sleep(10 * time.Second)
+		}
+		if !matched {
+			return fmt.Errorf("timed out: %s/%s image never matched %q", ns, deploy, sha)
+		}
+	}
+	fmt.Fprintf(os.Stderr, "homelab: rollout status %s/%s...\n", ns, deploy)
+	return kubectlStream(ns, "rollout", "status", "deploy/"+deploy, "--timeout=180s")
+}
--- a/cli/cmd_ha.go
+++ b/cli/cmd_ha.go
@ -0,0 +1,172 @@
+package main
+
+import (
+	"encoding/base64"
+	"fmt"
+	"os"
+	"path/filepath"
+	"strings"
+)
+
+// Home Assistant verbs cover the two things the `ha` MCP server can't: resolving
+// the long-lived API token out of the cluster, and SSH to the HA host for
+// host-level work (config files, docker, add-ons). Entity state/control stays
+// with the MCP — see docs/adr/0012.
+//
+// The token lives in the dedicated k8s Secret openclaw/ha-tokens (one key per
+// instance), split out of openclaw-secrets so non-admin operators (emo / "Home
+// Server Admins") can read JUST the HA token, not the full skill_secrets blob.
+// `ha token` resolves it on demand via the ambient kubeconfig, so it never
+// depends on a pre-set env var (the gap that made agents re-derive the
+// kubectl|base64|jq pipeline every session).
+
+type haInstance struct {
+	name      string // sofia | london
+	sshUser   string // SSH login on the HA host
+	sshHost   string // host reachable from the devvm (Sofia LAN)
+	secretKey string // key inside the openclaw/ha-tokens Secret holding this token
+}
+
+const (
+	haDefaultInstance = "sofia"
+	haSecretNamespace = "openclaw"
+	haSecretName      = "ha-tokens" // dedicated, least-privilege; see stacks/openclaw/ha_tokens.tf
+)
+
+// haInstances maps instance name → connection/secret facts. sofia is the default
+// because the devvm is on the Sofia LAN; london is documented but its host
+// (192.168.8.x) is only reachable remotely, so `ha ssh --instance london`
+// generally won't connect from here (token resolution still works).
+var haInstances = map[string]haInstance{
+	"sofia":  {name: "sofia", sshUser: "vbarzin", sshHost: "192.168.1.8", secretKey: "sofia"},
+	"london": {name: "london", sshUser: "hassio", sshHost: "192.168.8.103", secretKey: "london"},
+}
+
+func haCommands() []Command {
+	return []Command{
+		{Path: []string{"ha", "token"}, Tier: TierRead,
+			Summary: "reveal the HA long-lived API token from the cluster: ha token [--instance sofia|london]", Run: haToken},
+		{Path: []string{"ha", "ssh"}, Tier: TierWrite,
+			Summary: "run a command on the HA host over ssh: ha ssh [--instance sofia|london] [-i KEY] -- <cmd>", Run: haSSH},
+	}
+}
+
+// resolveHAInstance looks up an instance by name; "" yields the default (sofia).
+func resolveHAInstance(name string) (haInstance, error) {
+	if name == "" {
+		name = haDefaultInstance
+	}
+	inst, ok := haInstances[name]
+	if !ok {
+		return haInstance{}, fmt.Errorf("unknown HA instance %q (want sofia or london)", name)
+	}
+	return inst, nil
+}
+
+// decodeSecretValue base64-decodes a k8s Secret `.data.<key>` value as returned
+// by kubectl jsonpath (trailing whitespace tolerated).
+func decodeSecretValue(b64 string) (string, error) {
+	raw, err := base64.StdEncoding.DecodeString(strings.TrimSpace(b64))
+	if err != nil {
+		return "", fmt.Errorf("base64-decode secret value: %w", err)
+	}
+	return string(raw), nil
+}
+
+func haToken(args []string) error {
+	name, _ := firstPositional(args) // accept `ha token sofia` as well as `--instance sofia`
+	for i := 0; i < len(args); i++ {
+		if args[i] == "--instance" && i+1 < len(args) {
+			name = args[i+1]
+		} else if strings.HasPrefix(args[i], "--instance=") {
+			name = strings.TrimPrefix(args[i], "--instance=")
+		}
+	}
+	inst, err := resolveHAInstance(name)
+	if err != nil {
+		return err
+	}
+	b64, err := kubectlCapture(haSecretNamespace, "get", "secret", haSecretName,
+		"-o", "jsonpath={.data."+inst.secretKey+"}")
+	if err != nil {
+		return fmt.Errorf("read secret %s/%s (kubeconfig set? RBAC?): %w", haSecretNamespace, haSecretName, err)
+	}
+	if b64 == "" {
+		return fmt.Errorf("secret %s/%s has no %q key", haSecretNamespace, haSecretName, inst.secretKey)
+	}
+	tok, err := decodeSecretValue(b64)
+	if err != nil {
+		return err
+	}
+	fmt.Println(tok)
+	return nil
+}
+
+// defaultHAKeyPath is the invoking user's ed25519 key, so the verb is per-user
+// rather than tied to whoever first wrote the workflow.
+func defaultHAKeyPath() string {
+	if home, err := os.UserHomeDir(); err == nil && home != "" {
+		return filepath.Join(home, ".ssh", "id_ed25519")
+	}
+	return filepath.Join("~", ".ssh", "id_ed25519")
+}
+
+// parseHASSH reads `[--instance X] [-i|--key PATH] [-- ] <cmd...>`. Tokens after
+// `--` are taken verbatim; bare tokens before it are also the remote command.
+func parseHASSH(args []string) (inst haInstance, keyPath string, remote []string, err error) {
+	name := haDefaultInstance
+	keyPath = defaultHAKeyPath()
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--":
+			remote = append(remote, args[i+1:]...)
+			i = len(args)
+		case a == "--instance":
+			if i+1 < len(args) {
+				name = args[i+1]
+				i++
+			}
+		case strings.HasPrefix(a, "--instance="):
+			name = strings.TrimPrefix(a, "--instance=")
+		case a == "--key" || a == "-i":
+			if i+1 < len(args) {
+				keyPath = args[i+1]
+				i++
+			}
+		case strings.HasPrefix(a, "--key="):
+			keyPath = strings.TrimPrefix(a, "--key=")
+		default:
+			remote = append(remote, a)
+		}
+	}
+	inst, err = resolveHAInstance(name)
+	return inst, keyPath, remote, err
+}
+
+// buildHASSHArgs assembles deterministic, non-interactive ssh args: an explicit
+// key, no user ssh config, and no known_hosts prompt/record — so it runs
+// unattended in an agent session without hanging on a host-key prompt.
+func buildHASSHArgs(inst haInstance, keyPath string, remote []string) []string {
+	args := []string{
+		"-F", "/dev/null",
+		"-o", "IdentityFile=" + keyPath,
+		"-o", "StrictHostKeyChecking=no",
+		"-o", "UserKnownHostsFile=/dev/null",
+		"-o", "ConnectTimeout=10",
+		"-o", "BatchMode=yes",
+		inst.sshUser + "@" + inst.sshHost,
+	}
+	return append(args, remote...)
+}
+
+func haSSH(args []string) error {
+	inst, keyPath, remote, err := parseHASSH(args)
+	if err != nil {
+		return err
+	}
+	if len(remote) == 0 {
+		return fmt.Errorf(`usage: homelab ha ssh [--instance sofia|london] [-i KEY] -- <command>`)
+	}
+	return runStreaming("ssh", buildHASSHArgs(inst, keyPath, remote)...)
+}
--- a/cli/cmd_ha_test.go
+++ b/cli/cmd_ha_test.go
@ -0,0 +1,92 @@
+package main
+
+import (
+	"encoding/base64"
+	"reflect"
+	"strings"
+	"testing"
+)
+
+func TestResolveHAInstance(t *testing.T) {
+	// empty defaults to sofia (the devvm sits on the Sofia LAN)
+	if got, err := resolveHAInstance(""); err != nil || got.name != "sofia" {
+		t.Fatalf(`resolveHAInstance("") = %+v, %v; want sofia`, got, err)
+	}
+	if got, err := resolveHAInstance("sofia"); err != nil || got.secretKey != "sofia" {
+		t.Fatalf("sofia secretKey = %q, %v", got.secretKey, err)
+	}
+	if got, err := resolveHAInstance("london"); err != nil || got.secretKey != "london" || got.sshUser != "hassio" {
+		t.Fatalf("london = %+v, %v", got, err)
+	}
+	if _, err := resolveHAInstance("paris"); err == nil {
+		t.Fatalf("resolveHAInstance(paris) should error on unknown instance")
+	}
+}
+
+func TestDecodeSecretValue(t *testing.T) {
+	// k8s stores Secret values base64-encoded; `kubectl -o jsonpath={.data.<k>}`
+	// returns that base64, which decodeSecretValue turns back into the raw token.
+	enc := base64.StdEncoding.EncodeToString([]byte("tok-sofia"))
+	if got, err := decodeSecretValue(enc); err != nil || got != "tok-sofia" {
+		t.Fatalf("decodeSecretValue = %q, %v; want tok-sofia", got, err)
+	}
+	// trailing whitespace/newline from jsonpath output must be tolerated
+	if got, err := decodeSecretValue(enc + "\n"); err != nil || got != "tok-sofia" {
+		t.Fatalf("decodeSecretValue (trailing ws) = %q, %v; want tok-sofia", got, err)
+	}
+	if _, err := decodeSecretValue("not-base64!!"); err == nil {
+		t.Fatalf("decodeSecretValue should error on undecodable base64")
+	}
+}
+
+func TestBuildHASSHArgs(t *testing.T) {
+	inst, _ := resolveHAInstance("sofia")
+	got := buildHASSHArgs(inst, "/home/u/.ssh/id_ed25519", []string{"cat", "/config/configuration.yaml"})
+	want := []string{
+		"-F", "/dev/null",
+		"-o", "IdentityFile=/home/u/.ssh/id_ed25519",
+		"-o", "StrictHostKeyChecking=no",
+		"-o", "UserKnownHostsFile=/dev/null",
+		"-o", "ConnectTimeout=10",
+		"-o", "BatchMode=yes",
+		"vbarzin@192.168.1.8",
+		"cat", "/config/configuration.yaml",
+	}
+	if !reflect.DeepEqual(got, want) {
+		t.Fatalf("buildHASSHArgs =\n %v\nwant\n %v", got, want)
+	}
+}
+
+func TestParseHASSH(t *testing.T) {
+	// instance flag + everything after `--` is the verbatim remote command
+	inst, key, remote, err := parseHASSH([]string{"--instance", "sofia", "--", "docker", "ps", "-a"})
+	if err != nil {
+		t.Fatalf("parseHASSH err: %v", err)
+	}
+	if inst.name != "sofia" {
+		t.Errorf("instance = %q, want sofia", inst.name)
+	}
+	if !strings.HasSuffix(key, "/.ssh/id_ed25519") {
+		t.Errorf("default key = %q, want it to end in /.ssh/id_ed25519", key)
+	}
+	if !reflect.DeepEqual(remote, []string{"docker", "ps", "-a"}) {
+		t.Errorf("remote = %v, want [docker ps -a]", remote)
+	}
+
+	// bare args (no `--`) are also taken as the remote command; -i overrides the key
+	_, key2, remote2, err := parseHASSH([]string{"-i", "/tmp/k", "uptime"})
+	if err != nil {
+		t.Fatalf("parseHASSH err: %v", err)
+	}
+	if key2 != "/tmp/k" {
+		t.Errorf("key = %q, want /tmp/k", key2)
+	}
+	if !reflect.DeepEqual(remote2, []string{"uptime"}) {
+		t.Errorf("remote = %v, want [uptime]", remote2)
+	}
+
+	// unknown instance surfaces as an error
+	if _, _, _, err := parseHASSH([]string{"--instance", "paris", "--", "ls"}); err == nil {
+		t.Errorf("parseHASSH should error on unknown instance")
+	}
+}
--- a/cli/cmd_k8s.go
+++ b/cli/cmd_k8s.go
@ -0,0 +1,288 @@
+package main
+
+import (
+	"fmt"
+	"os"
+	"strings"
+)
+
+func k8sCommands() []Command {
+	return []Command{
+		{Path: []string{"k8s", "status"}, Tier: TierRead,
+			Summary: "pods (wide) + recent non-Normal events for a namespace (or -A)", Run: k8sStatus},
+		{Path: []string{"k8s", "get"}, Tier: TierRead,
+			Summary: "kubectl get in a namespace: k8s get <ns> <resource> [args]", Run: k8sGet},
+		{Path: []string{"k8s", "logs"}, Tier: TierRead,
+			Summary: "logs for <app> (deploy/<app>; --tail/-c/--previous/--since/-l)", Run: k8sLogs},
+		{Path: []string{"k8s", "describe"}, Tier: TierRead,
+			Summary: "describe <app>'s deployment (or an explicit resource)", Run: k8sDescribe},
+		{Path: []string{"k8s", "debug"}, Tier: TierRead,
+			Summary: "one-shot triage for <app>: pods+deploy+describe+logs+events", Run: k8sDebug},
+		{Path: []string{"k8s", "pf"}, Tier: TierRead,
+			Summary: "port-forward: k8s pf <app> <local:remote> [svc/pod target]", Run: k8sPortForward},
+		{Path: []string{"k8s", "db"}, Tier: TierWrite,
+			Summary: `query a dbaas DB: k8s db <app> [--mysql] [--db N] -- "<SQL>"`, Run: k8sDB},
+		{Path: []string{"k8s", "exec"}, Tier: TierWrite,
+			Summary: "exec in <app>'s pod: k8s exec <app> [--tty] -- <cmd>", Run: k8sExec},
+		{Path: []string{"k8s", "rm-pod"}, Tier: TierWrite,
+			Summary: "delete a stuck pod/job ONLY: k8s rm-pod <name> -n <ns> [--job] [--force]", Run: k8sRmPod},
+		{Path: []string{"k8s", "rollout-status"}, Tier: TierRead,
+			Summary: "rollout status of deploy/<app>", Run: k8sRolloutStatus},
+		{Path: []string{"k8s", "restart"}, Tier: TierWrite,
+			Summary: "rollout restart deploy/<app> then wait for status", Run: k8sRestart},
+		{Path: []string{"k8s", "probe"}, Tier: TierRead,
+			Summary: "in-cluster reachability: ephemeral curl pod to <app>.<ns>.svc", Run: k8sProbe},
+	}
+}
+
+func k8sStatus(args []string) error {
+	t := parseK8sTarget(args)
+	ns := t.namespace() // "" when no app/ns given → cluster-wide
+	get := []string{"get", "pods", "-o", "wide"}
+	ev := []string{"get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp"}
+	if ns == "" {
+		get = append(get, "-A")
+		ev = append(ev, "-A")
+	}
+	if err := kubectlStream(ns, get...); err != nil {
+		return err
+	}
+	fmt.Fprintln(os.Stderr, "\n--- recent events (type!=Normal) ---")
+	_ = kubectlStream(ns, ev...) // best-effort
+	return nil
+}
+
+func k8sGet(args []string) error {
+	t := parseK8sTarget(args)
+	if t.app == "" || len(t.rest) == 0 {
+		return fmt.Errorf("usage: homelab k8s get <ns> <resource> [args]")
+	}
+	return kubectlStream(t.app, append([]string{"get"}, t.rest...)...)
+}
+
+func k8sLogs(args []string) error {
+	t := parseK8sTarget(args)
+	if t.app == "" {
+		return fmt.Errorf("usage: homelab k8s logs <app> [--tail N] [-c ctr] [--previous] [--since 1h] [-l sel]")
+	}
+	a := []string{"logs"}
+	if t.selector != "" {
+		a = append(a, "-l", t.selector)
+	} else {
+		a = append(a, t.objectRef())
+	}
+	if t.container != "" {
+		a = append(a, "-c", t.container)
+	}
+	if !containsPrefix(t.rest, "--tail") {
+		a = append(a, "--tail=200")
+	}
+	a = append(a, t.rest...)
+	return kubectlStream(t.namespace(), a...)
+}
+
+func k8sDescribe(args []string) error {
+	t := parseK8sTarget(args)
+	if t.app == "" {
+		return fmt.Errorf("usage: homelab k8s describe <app> [resource]")
+	}
+	if len(t.rest) > 0 {
+		return kubectlStream(t.namespace(), append([]string{"describe"}, t.rest...)...)
+	}
+	return kubectlStream(t.namespace(), "describe", t.objectRef())
+}
+
+func k8sDebug(args []string) error {
+	t := parseK8sTarget(args)
+	if t.app == "" {
+		return fmt.Errorf("usage: homelab k8s debug <app>")
+	}
+	ns := t.namespace()
+	sec := func(title string) { fmt.Fprintf(os.Stderr, "\n=== %s ===\n", title) }
+	sec("pods")
+	_ = kubectlStream(ns, "get", "pods", "-o", "wide")
+	sec("workloads")
+	_ = kubectlStream(ns, "get", "deploy,sts,ds", "-o", "wide")
+	sec("describe "+t.objectRef())
+	_ = kubectlStream(ns, "describe", t.objectRef())
+	sec("recent logs (--tail=50)")
+	_ = kubectlStream(ns, "logs", t.objectRef(), "--tail=50")
+	sec("events (type!=Normal)")
+	_ = kubectlStream(ns, "get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp")
+	return nil
+}
+
+func k8sPortForward(args []string) error {
+	t := parseK8sTarget(args)
+	if t.app == "" || len(t.rest) == 0 {
+		return fmt.Errorf("usage: homelab k8s pf <app> <local:remote> [svc/pod target]")
+	}
+	ports := t.rest[0]
+	target := "svc/" + t.app
+	if len(t.rest) > 1 {
+		target = t.rest[1]
+	}
+	return kubectlStream(t.namespace(), "port-forward", target, ports)
+}
+
+func k8sDB(args []string) error {
+	var app, dbName, sql string
+	mysql := false
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		if a == "--" {
+			sql = strings.Join(args[i+1:], " ")
+			break
+		}
+		switch {
+		case a == "--mysql":
+			mysql = true
+		case a == "--db":
+			if i+1 < len(args) {
+				dbName = args[i+1]
+				i++
+			}
+		case strings.HasPrefix(a, "--db="):
+			dbName = strings.TrimPrefix(a, "--db=")
+		case !strings.HasPrefix(a, "-") && app == "":
+			app = a
+		}
+	}
+	if app == "" {
+		return fmt.Errorf(`usage: homelab k8s db <app> [--mysql] [--db NAME] -- "<SQL>"`)
+	}
+	p := planDBExec(app, dbName, sql, mysql)
+	pod := p.pod
+	if pod == "" && p.selector != "" {
+		resolved, err := kubectlCapture(p.ns, "get", "pod", "-l", p.selector, "-o", "jsonpath={.items[0].metadata.name}")
+		if err != nil || resolved == "" {
+			return fmt.Errorf("could not resolve db pod in %s (selector %q): %v", p.ns, p.selector, err)
+		}
+		pod = resolved
+	}
+	exec := []string{"exec"}
+	if sql == "" {
+		exec = append(exec, "-it") // interactive client when no SQL given
+	}
+	exec = append(exec, pod)
+	if p.container != "" {
+		exec = append(exec, "-c", p.container)
+	}
+	exec = append(exec, "--")
+	exec = append(exec, p.argv...)
+	return kubectlStream(p.ns, exec...)
+}
+
+func k8sExec(args []string) error {
+	t := parseK8sTarget(args)
+	if t.app == "" {
+		return fmt.Errorf("usage: homelab k8s exec <app> [--pod p] [-c ctr] [--tty] -- <cmd>")
+	}
+	if len(t.rest) == 0 {
+		return fmt.Errorf("provide a command after --, e.g. homelab k8s exec %s -- env", t.app)
+	}
+	a := []string{"exec"}
+	if t.tty {
+		a = append(a, "-it")
+	}
+	a = append(a, t.objectRef())
+	if t.container != "" {
+		a = append(a, "-c", t.container)
+	}
+	a = append(a, "--")
+	a = append(a, t.rest...)
+	return kubectlStream(t.namespace(), a...)
+}
+
+func k8sRmPod(args []string) error {
+	var pod, ns, grace string
+	force, job := false, false
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "-n" || a == "--namespace":
+			if i+1 < len(args) {
+				ns = args[i+1]
+				i++
+			}
+		case a == "--force":
+			force = true
+		case a == "--job":
+			job = true
+		case a == "--grace":
+			if i+1 < len(args) {
+				grace = args[i+1]
+				i++
+			}
+		case !strings.HasPrefix(a, "-") && pod == "":
+			pod = a
+		}
+	}
+	if pod == "" || ns == "" {
+		return fmt.Errorf("usage: homelab k8s rm-pod <name> -n <ns> [--job] [--force] [--grace N] (pods/jobs only)")
+	}
+	kind := "pod"
+	if job {
+		kind = "job"
+	}
+	a := []string{"delete", kind, pod}
+	if grace != "" {
+		a = append(a, "--grace-period="+grace)
+	}
+	if force {
+		a = append(a, "--force")
+	}
+	return kubectlStream(ns, a...)
+}
+
+func k8sRolloutStatus(args []string) error {
+	t := parseK8sTarget(args)
+	if t.app == "" {
+		return fmt.Errorf("usage: homelab k8s rollout-status <app>")
+	}
+	return kubectlStream(t.namespace(), "rollout", "status", "deploy/"+t.app)
+}
+
+func k8sRestart(args []string) error {
+	t := parseK8sTarget(args)
+	if t.app == "" {
+		return fmt.Errorf("usage: homelab k8s restart <app>")
+	}
+	ns := t.namespace()
+	if err := kubectlStream(ns, "rollout", "restart", "deploy/"+t.app); err != nil {
+		return err
+	}
+	return kubectlStream(ns, "rollout", "status", "deploy/"+t.app)
+}
+
+func k8sProbe(args []string) error {
+	t := parseK8sTarget(args)
+	if t.app == "" {
+		return fmt.Errorf("usage: homelab k8s probe <app> [path] [--port N]")
+	}
+	ns := t.namespace()
+	url := "http://" + t.app + "." + ns + ".svc.cluster.local"
+	if port := flagValue(args, "--port"); port != "" {
+		url += ":" + port
+	}
+	if len(t.rest) > 0 {
+		p := t.rest[0]
+		if !strings.HasPrefix(p, "/") {
+			p = "/" + p
+		}
+		url += p
+	}
+	return kubectlStream(ns, "run", "homelab-probe", "--rm", "-i", "--restart=Never",
+		"--image=curlimages/curl:latest", "--",
+		"curl", "-sS", "--max-time", "10", "-w", "\n[%{http_code}] %{time_total}s\n", url)
+}
+
+// containsPrefix reports whether any arg starts with prefix.
+func containsPrefix(args []string, prefix string) bool {
+	for _, a := range args {
+		if strings.HasPrefix(a, prefix) {
+			return true
+		}
+	}
+	return false
+}
--- a/cli/cmd_memory.go
+++ b/cli/cmd_memory.go
@ -0,0 +1,302 @@
+package main
+
+import (
+	"encoding/json"
+	"fmt"
+	"net/url"
+	"strings"
+)
+
+func memoryCommands() []Command {
+	return []Command{
+		{Path: []string{"memory", "recall"}, Tier: TierRead,
+			Summary: `semantic search of memory: memory recall "<context>" [--query …] [--category] [--sort] [--limit]`, Run: memoryRecall},
+		{Path: []string{"memory", "list"}, Tier: TierRead,
+			Summary: "list recent memories [--category C] [--tag T] [--limit N]", Run: memoryList},
+		{Path: []string{"memory", "categories"}, Tier: TierRead,
+			Summary: "list memory categories", Run: memorySimpleGet("/api/categories")},
+		{Path: []string{"memory", "tags"}, Tier: TierRead,
+			Summary: "list memory tags", Run: memorySimpleGet("/api/tags")},
+		{Path: []string{"memory", "stats"}, Tier: TierRead,
+			Summary: "memory store stats", Run: memorySimpleGet("/api/stats")},
+		{Path: []string{"memory", "secret"}, Tier: TierRead,
+			Summary: "reveal a sensitive memory's content: memory secret <id>", Run: memorySecret},
+		{Path: []string{"memory", "store"}, Tier: TierWrite,
+			Summary: `store a memory: memory store "<content>" [--category --tags --keywords --importance --sensitive]`, Run: memoryStore},
+		{Path: []string{"memory", "update"}, Tier: TierWrite,
+			Summary: "update a memory: memory update <id> [--content --tags --importance --keywords]", Run: memoryUpdate},
+		{Path: []string{"memory", "delete"}, Tier: TierWrite,
+			Summary: "delete a memory: memory delete <id>", Run: memoryDelete},
+	}
+}
+
+// printMemories renders a {memories:[…]} response as compact lines, or raw JSON.
+func printMemories(raw []byte, jsonOut bool) error {
+	if jsonOut {
+		fmt.Println(string(raw))
+		return nil
+	}
+	var r struct {
+		Memories []struct {
+			ID         int     `json:"id"`
+			Content    string  `json:"content"`
+			Category   string  `json:"category"`
+			Tags       string  `json:"tags"`
+			Importance float64 `json:"importance"`
+		} `json:"memories"`
+	}
+	if err := json.Unmarshal(raw, &r); err != nil {
+		fmt.Println(string(raw))
+		return nil
+	}
+	if len(r.Memories) == 0 {
+		fmt.Println("(no memories)")
+		return nil
+	}
+	for _, m := range r.Memories {
+		c := strings.ReplaceAll(m.Content, "\n", " ")
+		if len(c) > 240 {
+			c = c[:240] + "…"
+		}
+		fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
+		if m.Tags != "" {
+			fmt.Printf("       tags: %s\n", m.Tags)
+		}
+	}
+	return nil
+}
+
+func memoryRecall(args []string) error {
+	req := memRecallReq{}
+	jsonOut := false
+	var pos []string
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--query":
+			if i+1 < len(args) {
+				req.ExpandedQuery = args[i+1]
+				i++
+			}
+		case a == "--category":
+			if i+1 < len(args) {
+				req.Category = args[i+1]
+				i++
+			}
+		case a == "--sort":
+			if i+1 < len(args) {
+				req.SortBy = args[i+1]
+				i++
+			}
+		case a == "--limit":
+			if i+1 < len(args) {
+				fmt.Sscanf(args[i+1], "%d", &req.Limit)
+				i++
+			}
+		case a == "--json":
+			jsonOut = true
+		case !strings.HasPrefix(a, "-"):
+			pos = append(pos, a)
+		}
+	}
+	req.Context = strings.Join(pos, " ")
+	if req.Context == "" {
+		return fmt.Errorf(`usage: homelab memory recall "<context>" [--query …] [--category C] [--sort importance|relevance|recency] [--limit N]`)
+	}
+	c, err := newMemoryClient()
+	if err != nil {
+		return err
+	}
+	raw, err := c.do("POST", "/api/memories/recall", req)
+	if err != nil {
+		return err
+	}
+	return printMemories(raw, jsonOut)
+}
+
+func memoryList(args []string) error {
+	q := url.Values{}
+	jsonOut := false
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--category":
+			if i+1 < len(args) {
+				q.Set("category", args[i+1])
+				i++
+			}
+		case a == "--tag":
+			if i+1 < len(args) {
+				q.Set("tag", args[i+1])
+				i++
+			}
+		case a == "--limit":
+			if i+1 < len(args) {
+				q.Set("limit", args[i+1])
+				i++
+			}
+		case a == "--json":
+			jsonOut = true
+		}
+	}
+	c, err := newMemoryClient()
+	if err != nil {
+		return err
+	}
+	path := "/api/memories"
+	if len(q) > 0 {
+		path += "?" + q.Encode()
+	}
+	raw, err := c.do("GET", path, nil)
+	if err != nil {
+		return err
+	}
+	return printMemories(raw, jsonOut)
+}
+
+func memorySimpleGet(path string) func([]string) error {
+	return func(args []string) error {
+		c, err := newMemoryClient()
+		if err != nil {
+			return err
+		}
+		raw, err := c.do("GET", path, nil)
+		if err != nil {
+			return err
+		}
+		fmt.Println(string(raw))
+		return nil
+	}
+}
+
+func memorySecret(args []string) error {
+	id, _ := firstPositional(args)
+	if id == "" {
+		return fmt.Errorf("usage: homelab memory secret <id>")
+	}
+	c, err := newMemoryClient()
+	if err != nil {
+		return err
+	}
+	raw, err := c.do("POST", "/api/memories/"+id+"/secret", nil)
+	if err != nil {
+		return err
+	}
+	fmt.Println(string(raw))
+	return nil
+}
+
+func memoryStore(args []string) error {
+	req := memStoreReq{Category: "facts", Importance: 0.5}
+	var pos []string
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--category":
+			if i+1 < len(args) {
+				req.Category = args[i+1]
+				i++
+			}
+		case a == "--tags":
+			if i+1 < len(args) {
+				req.Tags = args[i+1]
+				i++
+			}
+		case a == "--keywords":
+			if i+1 < len(args) {
+				req.ExpandedKeywords = args[i+1]
+				i++
+			}
+		case a == "--importance":
+			if i+1 < len(args) {
+				fmt.Sscanf(args[i+1], "%f", &req.Importance)
+				i++
+			}
+		case a == "--sensitive":
+			req.ForceSensitive = true
+		case !strings.HasPrefix(a, "-"):
+			pos = append(pos, a)
+		}
+	}
+	req.Content = strings.Join(pos, " ")
+	if req.Content == "" {
+		return fmt.Errorf(`usage: homelab memory store "<content>" [--category C] [--tags ...] [--keywords ...] [--importance 0.5] [--sensitive]`)
+	}
+	c, err := newMemoryClient()
+	if err != nil {
+		return err
+	}
+	raw, err := c.do("POST", "/api/memories", req)
+	if err != nil {
+		return err
+	}
+	fmt.Println(string(raw))
+	return nil
+}
+
+func memoryUpdate(args []string) error {
+	var id string
+	req := memUpdateReq{}
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--content":
+			if i+1 < len(args) {
+				v := args[i+1]
+				req.Content = &v
+				i++
+			}
+		case a == "--tags":
+			if i+1 < len(args) {
+				v := args[i+1]
+				req.Tags = &v
+				i++
+			}
+		case a == "--keywords":
+			if i+1 < len(args) {
+				v := args[i+1]
+				req.ExpandedKeywords = &v
+				i++
+			}
+		case a == "--importance":
+			if i+1 < len(args) {
+				var f float64
+				fmt.Sscanf(args[i+1], "%f", &f)
+				req.Importance = &f
+				i++
+			}
+		case !strings.HasPrefix(a, "-") && id == "":
+			id = a
+		}
+	}
+	if id == "" {
+		return fmt.Errorf("usage: homelab memory update <id> [--content ...] [--tags ...] [--importance N] [--keywords ...]")
+	}
+	c, err := newMemoryClient()
+	if err != nil {
+		return err
+	}
+	raw, err := c.do("PUT", "/api/memories/"+id, req)
+	if err != nil {
+		return err
+	}
+	fmt.Println(string(raw))
+	return nil
+}
+
+func memoryDelete(args []string) error {
+	id, _ := firstPositional(args)
+	if id == "" {
+		return fmt.Errorf("usage: homelab memory delete <id>")
+	}
+	c, err := newMemoryClient()
+	if err != nil {
+		return err
+	}
+	raw, err := c.do("DELETE", "/api/memories/"+id, nil)
+	if err != nil {
+		return err
+	}
+	fmt.Println(string(raw))
+	return nil
+}
--- a/cli/cmd_net.go
+++ b/cli/cmd_net.go
@ -0,0 +1,83 @@
+package main
+
+import (
+	"fmt"
+	"strings"
+	"time"
+)
+
+func netCommands() []Command {
+	return []Command{
+		{Path: []string{"net", "check"}, Tier: TierRead,
+			Summary: "reachability of <host>[/path]: external (public DNS→CF) vs internal (Traefik LB)", Run: netCheck},
+		{Path: []string{"dns", "lookup"}, Tier: TierRead,
+			Summary: "resolve <name> via Technitium (10.0.20.201) and public (1.1.1.1), diffed", Run: dnsLookup},
+	}
+}
+
+func fmtProbe(code int, d time.Duration, err error) string {
+	if err != nil {
+		return "ERR " + err.Error()
+	}
+	return fmt.Sprintf("HTTP %d  %dms", code, d.Milliseconds())
+}
+
+func netCheck(args []string) error {
+	host, rest := firstPositional(args)
+	if host == "" {
+		return fmt.Errorf("usage: homelab net check <host> [path]")
+	}
+	path := "/"
+	if len(rest) > 0 && !strings.HasPrefix(rest[0], "-") {
+		path = rest[0]
+		if !strings.HasPrefix(path, "/") {
+			path = "/" + path
+		}
+	}
+	u := "https://" + host + path
+	fmt.Printf("%s\n", u)
+
+	// external leg: resolve via public DNS, dial the public IP (tests the real CF path)
+	pubOut, _ := dig(hostOnly(host), "1.1.1.1", "")
+	if pubIP := firstLine(pubOut); pubIP != "" {
+		c, d, e := probeURL(clientDialingIP(pubIP, 10*time.Second), u)
+		fmt.Printf("  external (public %-15s) %s\n", pubIP, fmtProbe(c, d, e))
+	} else {
+		fmt.Println("  external (public)            no public A record")
+	}
+	// internal leg: dial the Traefik LB directly
+	c, d, e := probeURL(clientDialingIP(internalLBIP, 10*time.Second), u)
+	fmt.Printf("  internal (LB %-15s)     %s\n", internalLBIP, fmtProbe(c, d, e))
+	return nil
+}
+
+func dnsLookup(args []string) error {
+	name, rest := firstPositional(args)
+	if name == "" {
+		return fmt.Errorf("usage: homelab dns lookup <name> [A|AAAA|TXT|MX|PTR]")
+	}
+	rr := ""
+	if len(rest) > 0 {
+		rr = rest[0]
+	}
+	tech, _ := dig(name, "10.0.20.201", rr)
+	pub, _ := dig(name, "1.1.1.1", rr)
+	fmt.Printf("technitium (10.0.20.201): %s\n", oneLineList(tech))
+	fmt.Printf("public     (1.1.1.1)    : %s\n", oneLineList(pub))
+	if strings.TrimSpace(tech) != strings.TrimSpace(pub) {
+		fmt.Println("⚠ mismatch — split-horizon (expected for internal-only apps) or a propagation gap")
+	}
+	return nil
+}
+
+func hostOnly(h string) string { // strip any path accidentally included
+	return strings.SplitN(h, "/", 2)[0]
+}
+
+func oneLineList(s string) string {
+	s = strings.TrimSpace(s)
+	if s == "" {
+		return "(none)"
+	}
+	return strings.ReplaceAll(s, "\n", ", ")
+}
--- a/cli/cmd_obs.go
+++ b/cli/cmd_obs.go
@ -0,0 +1,197 @@
+package main
+
+import (
+	"encoding/json"
+	"fmt"
+	"net/url"
+	"sort"
+	"strconv"
+	"strings"
+	"time"
+)
+
+const (
+	promHost = "prometheus-query.viktorbarzin.lan"
+	lokiHost = "loki.viktorbarzin.lan"
+)
+
+func obsCommands() []Command {
+	return []Command{
+		{Path: []string{"metrics", "query"}, Tier: TierRead,
+			Summary: `Prometheus instant query: metrics query "<promql>" [--json]`, Run: metricsQuery},
+		{Path: []string{"metrics", "alerts"}, Tier: TierRead,
+			Summary: "list currently firing Prometheus alerts", Run: metricsAlerts},
+		{Path: []string{"logs", "query"}, Tier: TierRead,
+			Summary: `Loki query (last --since, default 1h): logs query "<logql>" [--since 1h] [--limit N] [--json]`, Run: logsQuery},
+	}
+}
+
+// queryArg joins non-flag args into the query (PromQL/LogQL should normally be
+// passed as a single quoted argument; this also tolerates unquoted multi-token).
+func queryArg(args []string, valueFlags map[string]bool) string {
+	var parts []string
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		if valueFlags[a] {
+			i++
+			continue
+		}
+		if strings.HasPrefix(a, "-") {
+			continue
+		}
+		parts = append(parts, a)
+	}
+	return strings.Join(parts, " ")
+}
+
+func labelStr(m map[string]string) string {
+	name := m["__name__"]
+	var kv []string
+	for k, v := range m {
+		if k != "__name__" {
+			kv = append(kv, k+"="+v)
+		}
+	}
+	sort.Strings(kv)
+	return name + "{" + strings.Join(kv, ",") + "}"
+}
+
+func metricsQuery(args []string) error {
+	q := queryArg(args, nil)
+	if q == "" {
+		return fmt.Errorf(`usage: homelab metrics query "<promql>" [--json]`)
+	}
+	v := url.Values{}
+	v.Set("query", q)
+	body, err := lbGetBody(promHost, "/api/v1/query", v)
+	if err != nil {
+		return err
+	}
+	if containsArg(args, "--json") {
+		fmt.Println(string(body))
+		return nil
+	}
+	var r struct {
+		Data struct {
+			Result []struct {
+				Metric map[string]string `json:"metric"`
+				Value  []interface{}     `json:"value"`
+			} `json:"result"`
+		} `json:"data"`
+	}
+	if err := json.Unmarshal(body, &r); err != nil {
+		fmt.Println(string(body))
+		return nil
+	}
+	if len(r.Data.Result) == 0 {
+		fmt.Println("(no series)")
+		return nil
+	}
+	for _, s := range r.Data.Result {
+		val := ""
+		if len(s.Value) == 2 {
+			val = fmt.Sprint(s.Value[1])
+		}
+		fmt.Printf("%-14s %s\n", val, labelStr(s.Metric))
+	}
+	return nil
+}
+
+func metricsAlerts(args []string) error {
+	// prometheus-query is a query-only frontend (no /api/v1/alerts); the firing
+	// set is exposed as the synthetic ALERTS series, queryable the normal way.
+	v := url.Values{}
+	v.Set("query", `ALERTS{alertstate="firing"}`)
+	body, err := lbGetBody(promHost, "/api/v1/query", v)
+	if err != nil {
+		return err
+	}
+	if containsArg(args, "--json") {
+		fmt.Println(string(body))
+		return nil
+	}
+	var r struct {
+		Data struct {
+			Result []struct {
+				Metric map[string]string `json:"metric"`
+			} `json:"result"`
+		} `json:"data"`
+	}
+	if err := json.Unmarshal(body, &r); err != nil {
+		fmt.Println(string(body))
+		return nil
+	}
+	if len(r.Data.Result) == 0 {
+		fmt.Println("(no firing alerts)")
+		return nil
+	}
+	for _, a := range r.Data.Result {
+		m := a.Metric
+		scope := ""
+		for _, k := range []string{"namespace", "deployment", "instance", "job", "node"} {
+			if v := m[k]; v != "" {
+				scope = k + "=" + v
+				break
+			}
+		}
+		fmt.Printf("%-9s %-34s %s\n", m["severity"], m["alertname"], scope)
+	}
+	return nil
+}
+
+func logsQuery(args []string) error {
+	q := queryArg(args, map[string]bool{"--since": true, "--limit": true})
+	if q == "" {
+		return fmt.Errorf(`usage: homelab logs query "<logql>" [--since 1h] [--limit N] [--json]`)
+	}
+	since := flagValue(args, "--since")
+	if since == "" {
+		since = "1h"
+	}
+	dur, err := time.ParseDuration(since)
+	if err != nil {
+		return fmt.Errorf("bad --since %q: %w", since, err)
+	}
+	limit := flagValue(args, "--limit")
+	if limit == "" {
+		limit = "100"
+	}
+	end := time.Now()
+	v := url.Values{}
+	v.Set("query", q)
+	v.Set("limit", limit)
+	v.Set("start", strconv.FormatInt(end.Add(-dur).UnixNano(), 10))
+	v.Set("end", strconv.FormatInt(end.UnixNano(), 10))
+	body, err := lbGetBody(lokiHost, "/loki/api/v1/query_range", v)
+	if err != nil {
+		return err
+	}
+	if containsArg(args, "--json") {
+		fmt.Println(string(body))
+		return nil
+	}
+	var r struct {
+		Data struct {
+			Result []struct {
+				Values [][]string `json:"values"`
+			} `json:"result"`
+		} `json:"data"`
+	}
+	if err := json.Unmarshal(body, &r); err != nil {
+		fmt.Println(string(body))
+		return nil
+	}
+	n := 0
+	for _, s := range r.Data.Result {
+		for _, val := range s.Values {
+			if len(val) == 2 {
+				fmt.Println(val[1])
+				n++
+			}
+		}
+	}
+	if n == 0 {
+		fmt.Println("(no log lines)")
+	}
+	return nil
+}
--- a/cli/cmd_tf.go
+++ b/cli/cmd_tf.go
@ -0,0 +1,122 @@
+package main
+
+import (
+	"fmt"
+	"os"
+	"os/signal"
+	"path/filepath"
+	"strings"
+	"sync"
+	"syscall"
+)
+
+func tfCommands() []Command {
+	return []Command{
+		{Path: []string{"tf", "plan"}, Tier: TierRead,
+			Summary: "terragrunt plan a stack (via scripts/tg)", Run: tfPassthrough("plan")},
+		{Path: []string{"tf", "validate"}, Tier: TierRead,
+			Summary: "terragrunt validate a stack", Run: tfPassthrough("validate")},
+		{Path: []string{"tf", "fmt"}, Tier: TierRead,
+			Summary: "terraform fmt a stack's files", Run: tfFmt},
+		{Path: []string{"tf", "force-unlock"}, Tier: TierWrite,
+			Summary: "release a stuck terraform state lock (needs <stack> <lock-id>)", Run: tfForceUnlock},
+		{Path: []string{"tf", "apply"}, Tier: TierWrite,
+			Summary: "terragrunt apply a stack — presence-coupled, out-of-band", Run: tfApply},
+	}
+}
+
+// firstPositional returns the first non-flag arg and the remaining args with it removed.
+func firstPositional(args []string) (string, []string) {
+	for i, a := range args {
+		if !strings.HasPrefix(a, "-") {
+			rest := append(append([]string{}, args[:i]...), args[i+1:]...)
+			return a, rest
+		}
+	}
+	return "", args
+}
+
+// resolveTfStack finds the infra root (from cwd) and the stack directory named
+// by the first positional arg, returning the remaining args.
+func resolveTfStack(args []string) (infraRoot, stackName, stackDir string, rest []string, err error) {
+	stackName, rest = firstPositional(args)
+	if stackName == "" {
+		err = fmt.Errorf("missing <stack> argument")
+		return
+	}
+	cwd, e := os.Getwd()
+	if e != nil {
+		err = e
+		return
+	}
+	infraRoot, err = findInfraRoot(cwd)
+	if err != nil {
+		return
+	}
+	stackDir, err = resolveStack(infraRoot, stackName)
+	return
+}
+
+func tgPath(infraRoot string) string { return filepath.Join(infraRoot, "scripts", "tg") }
+
+// tfPassthrough runs `scripts/tg <verb> [extra]` in the stack directory.
+func tfPassthrough(verb string) func([]string) error {
+	return func(args []string) error {
+		infraRoot, _, stackDir, rest, err := resolveTfStack(args)
+		if err != nil {
+			return err
+		}
+		return runStreamingIn(stackDir, tgPath(infraRoot), append([]string{verb}, rest...)...)
+	}
+}
+
+func tfFmt(args []string) error {
+	_, _, stackDir, _, err := resolveTfStack(args)
+	if err != nil {
+		return err
+	}
+	return runStreamingIn(stackDir, "terraform", "fmt", "-recursive", ".")
+}
+
+func tfForceUnlock(args []string) error {
+	infraRoot, _, stackDir, rest, err := resolveTfStack(args)
+	if err != nil {
+		return err
+	}
+	if len(rest) < 1 {
+		return fmt.Errorf("usage: homelab tf force-unlock <stack> <lock-id>")
+	}
+	return runStreamingIn(stackDir, tgPath(infraRoot), "force-unlock", "-force", rest[0])
+}
+
+// tfApply applies a stack out-of-band: claim the stack on the presence board,
+// ALWAYS release on exit (normal, error, or signal — fixing the claim leak),
+// and warn that CI applies canonically on push.
+func tfApply(args []string) error {
+	infraRoot, stackName, stackDir, _, err := resolveTfStack(args)
+	if err != nil {
+		return err
+	}
+	label := "stack:" + stackName
+	fmt.Fprintf(os.Stderr,
+		"homelab: out-of-band apply of %q — CI applies canonically on push to master.\n", stackName)
+
+	if err := presenceClaim(label, "homelab tf apply "+stackName); err != nil {
+		return fmt.Errorf("presence claim failed (run `vault login -method=oidc`?): %w", err)
+	}
+	// Release exactly once, whether we exit normally, on error, or on signal —
+	// sync.Once makes the defer and the signal goroutine safe to both call it.
+	var once sync.Once
+	release := func() { once.Do(func() { _ = presenceRelease(label) }) }
+	defer release()
+
+	sig := make(chan os.Signal, 1)
+	signal.Notify(sig, os.Interrupt, syscall.SIGTERM)
+	go func() {
+		<-sig
+		release()
+		os.Exit(130)
+	}()
+
+	return runStreamingIn(stackDir, tgPath(infraRoot), "apply", "--non-interactive")
+}
--- a/cli/cmd_tf_test.go
+++ b/cli/cmd_tf_test.go
@ -0,0 +1,27 @@
+package main
+
+import (
+	"reflect"
+	"testing"
+)
+
+func TestFirstPositional(t *testing.T) {
+	cases := []struct {
+		args     []string
+		wantName string
+		wantRest []string
+	}{
+		{[]string{"vault"}, "vault", []string{}},
+		{[]string{"--json", "vault"}, "vault", []string{"--json"}},
+		{[]string{"vault", "abc-123"}, "vault", []string{"abc-123"}},
+		{[]string{"--foo", "monitoring", "extra"}, "monitoring", []string{"--foo", "extra"}},
+		{[]string{"--only-flags"}, "", []string{"--only-flags"}},
+	}
+	for _, c := range cases {
+		gotName, gotRest := firstPositional(c.args)
+		if gotName != c.wantName || !reflect.DeepEqual(gotRest, c.wantRest) {
+			t.Errorf("firstPositional(%v) = (%q, %v), want (%q, %v)",
+				c.args, gotName, gotRest, c.wantName, c.wantRest)
+		}
+	}
+}
--- a/cli/cmd_usage.go
+++ b/cli/cmd_usage.go
@ -0,0 +1,77 @@
+package main
+
+import (
+	"encoding/json"
+	"fmt"
+	"net/url"
+	"sort"
+	"strconv"
+)
+
+func usageCommands() []Command {
+	return []Command{
+		{Path: []string{"usage", "top"}, Tier: TierRead,
+			Summary: "rank homelab verb usage across users (from Loki): usage top [--since 30d] [--user U] [--json]", Run: usageTop},
+	}
+}
+
+// usageQuery builds the LogQL metric query that counts invocations per verb.
+func usageQuery(since, user string) string {
+	sel := `job="` + usageJob + `"`
+	if user != "" {
+		sel += `, user="` + user + `"`
+	}
+	return fmt.Sprintf(`sum by (verb) (count_over_time({%s}[%s]))`, sel, since)
+}
+
+func usageTop(args []string) error {
+	since := flagValue(args, "--since")
+	if since == "" {
+		since = "30d"
+	}
+	v := url.Values{}
+	v.Set("query", usageQuery(since, flagValue(args, "--user")))
+	body, err := lbGetBody(lokiHost, "/loki/api/v1/query", v)
+	if err != nil {
+		return err
+	}
+	if containsArg(args, "--json") {
+		fmt.Println(string(body))
+		return nil
+	}
+	var r struct {
+		Data struct {
+			Result []struct {
+				Metric map[string]string `json:"metric"`
+				Value  []interface{}     `json:"value"`
+			} `json:"result"`
+		} `json:"data"`
+	}
+	if err := json.Unmarshal(body, &r); err != nil {
+		fmt.Println(string(body))
+		return nil
+	}
+	type row struct {
+		verb string
+		n    int
+	}
+	var rows []row
+	for _, s := range r.Data.Result {
+		n := 0
+		if len(s.Value) == 2 {
+			if f, e := strconv.ParseFloat(fmt.Sprint(s.Value[1]), 64); e == nil {
+				n = int(f)
+			}
+		}
+		rows = append(rows, row{s.Metric["verb"], n})
+	}
+	if len(rows) == 0 {
+		fmt.Println("(no usage recorded yet)")
+		return nil
+	}
+	sort.Slice(rows, func(i, j int) bool { return rows[i].n > rows[j].n })
+	for _, r := range rows {
+		fmt.Printf("%6d  %s\n", r.n, r.verb)
+	}
+	return nil
+}
--- a/cli/cmd_vault.go
+++ b/cli/cmd_vault.go
@ -0,0 +1,663 @@
+package main
+
+import (
+	"bufio"
+	"encoding/base64"
+	"encoding/json"
+	"fmt"
+	"os"
+	"os/exec"
+	"strings"
+	"syscall"
+)
+
+// vault verbs give each unix user no-HITL access to THEIR OWN Vaultwarden vault.
+// Identity is the kernel UID; per-user creds live in that user's isolated Vault
+// path (secret/workstation/claude-users/<user>) read via their scoped token, and
+// decryption is done by the official `bw` CLI. See
+// docs/superpowers/specs/2026-06-24-homelab-vault-design.md.
+func vaultCommands() []Command {
+	return []Command{
+		{Path: []string{"vault", "setup"}, Tier: TierWrite,
+			Summary: "one-time: store your Vaultwarden master password + API key in your Vault path", Run: vaultSetup},
+		{Path: []string{"vault", "status"}, Tier: TierRead,
+			Summary: "show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
+		{Path: []string{"vault", "list"}, Tier: TierRead,
+			Summary: "list your item names: vault list [--search Q]", Run: vaultList},
+		{Path: []string{"vault", "get"}, Tier: TierRead,
+			Summary: "fetch one item: vault get <name> [--field password|username|uri|notes|totp] [--json]", Run: vaultGet},
+		{Path: []string{"vault", "search"}, Tier: TierRead,
+			Summary: "search your item names: vault search <query>", Run: vaultSearch},
+		{Path: []string{"vault", "code"}, Tier: TierRead,
+			Summary: "current TOTP code for an item: vault code <name>", Run: vaultCode},
+		{Path: []string{"vault", "lock"}, Tier: TierWrite,
+			Summary: "lock/log out the local bw session", Run: vaultLock},
+		{Path: []string{"vault"}, Tier: TierRead,
+			Summary: "Vaultwarden access for your own vault (run `homelab vault` for help)",
+			Run:     func([]string) error { fmt.Print(vaultHelp()); return nil }},
+	}
+}
+
+// vaultHelp is shown for bare `homelab vault`.
+func vaultHelp() string {
+	return `homelab vault — read YOUR OWN Vaultwarden logins (no-HITL after one-time setup)
+
+  homelab vault setup             one-time: store your master password + API key in your Vault path
+  homelab vault status            configured / unlocked / reachable (no secrets)
+  homelab vault list [--search Q] list your item names (no secrets)
+  homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
+                                  TTY → clipboard (auto-clears); piped → stdout
+  homelab vault code <name>       current TOTP code
+  homelab vault lock              lock / log out the local bw session
+
+Creds live only in your own Vault path; the admin never sees them. Identity is
+your unix UID. Security model: docs/superpowers/specs/2026-06-24-homelab-vault-design.md
+(note: anything running as your user can decrypt your vault — the accepted no-HITL trade).
+`
+}
+
+const vwUserPathPrefix = "secret/workstation/claude-users/"
+
+// vwCreds is one user's Vaultwarden auth material, read from their Vault path.
+type vwCreds struct {
+	Email          string
+	MasterPassword string
+	ClientID       string
+	ClientSecret   string
+}
+
+// cmdRunner shells out to an external command with an explicit environment and
+// returns trimmed stdout. Secrets are passed via envv, NEVER argv. Tests inject
+// a fake; realRunner is the production implementation.
+type cmdRunner func(name string, argv, envv []string) (string, error)
+
+func realRunner(name string, argv, envv []string) (string, error) {
+	cmd := exec.Command(name, argv...)
+	if envv != nil {
+		cmd.Env = envv
+	}
+	out, err := cmd.Output()
+	// Trim only the trailing newline the tool appends — NOT all whitespace, so a
+	// fetched secret with significant leading/trailing spaces is preserved.
+	return strings.TrimRight(string(out), "\r\n"), err
+}
+
+// realRunnerStdin runs a command feeding `stdin` to it, for secret values that
+// must NOT appear in argv (visible via ps / /proc/<pid>/cmdline to same-UID
+// processes). Used by setup to write the master password / client_secret.
+func realRunnerStdin(name string, argv, envv []string, stdin string) (string, error) {
+	cmd := exec.Command(name, argv...)
+	if envv != nil {
+		cmd.Env = envv
+	}
+	cmd.Stdin = strings.NewReader(stdin)
+	out, err := cmd.Output()
+	return strings.TrimRight(string(out), "\r\n"), err
+}
+
+func vwCredsPath(user string) string { return vwUserPathPrefix + user }
+
+func bwAppDataDir(uid string) string { return "/run/user/" + uid + "/homelab-bw" }
+
+// readVaultField returns one field from a KV-v2 path, "" if absent/error.
+func readVaultField(run cmdRunner, field, path string) string {
+	out, err := run("vault", []string{"kv", "get", "-field=" + field, path}, nil)
+	if err != nil {
+		return ""
+	}
+	return out
+}
+
+// loadCreds reads the four vaultwarden_* keys from the user's isolated path.
+// A missing master password means the user hasn't onboarded.
+func loadCreds(run cmdRunner, user string) (vwCreds, error) {
+	p := vwCredsPath(user)
+	c := vwCreds{
+		Email:          readVaultField(run, "vaultwarden_email", p),
+		MasterPassword: readVaultField(run, "vaultwarden_master_password", p),
+		ClientID:       readVaultField(run, "vaultwarden_client_id", p),
+		ClientSecret:   readVaultField(run, "vaultwarden_client_secret", p),
+	}
+	if c.MasterPassword == "" {
+		return vwCreds{}, fmt.Errorf("vault not configured for this user — run `homelab vault setup`")
+	}
+	return c, nil
+}
+
+// vaultCurrentUser/vaultCurrentUID are seams for tests (avoid conflict with repo.go's currentUser func).
+var vaultCurrentUser = func() string { return os.Getenv("USER") }
+var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) }
+
+// bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately
+// do NOT inherit the full parent env (keeps stray secrets out of the child).
+func bwBaseEnv(appdata string) []string {
+	path := os.Getenv("PATH")
+	if path == "" {
+		path = "/usr/local/bin:/usr/bin:/bin"
+	}
+	return []string{
+		"PATH=" + path,
+		"HOME=" + os.Getenv("HOME"),
+		"BITWARDENCLI_APPDATA_DIR=" + appdata,
+		"BW_NOINTERACTION=true",
+	}
+}
+
+// bwSecretEnv adds the secret-bearing vars. session may be "" (pre-unlock).
+func bwSecretEnv(appdata string, c vwCreds, session string) []string {
+	env := bwBaseEnv(appdata)
+	env = append(env,
+		"BW_CLIENTID="+c.ClientID,
+		"BW_CLIENTSECRET="+c.ClientSecret,
+		"BW_PASSWORD="+c.MasterPassword,
+	)
+	if session != "" {
+		env = append(env, "BW_SESSION="+session)
+	}
+	return env
+}
+
+func bwLoginArgs() []string  { return []string{"login", "--apikey"} }
+func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} }
+func bwGetArgs(field, name string) []string { return []string{"get", field, name} }
+func bwStatusArgs() []string { return []string{"status"} }
+
+// bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is
+// required. Unparseable/empty output → true (safer to attempt login).
+func bwNeedsLogin(statusJSON string) bool {
+	var s struct {
+		Status string `json:"status"`
+	}
+	if err := json.Unmarshal([]byte(statusJSON), &s); err != nil {
+		return true
+	}
+	return s.Status == "unauthenticated" || s.Status == ""
+}
+
+func bwListArgs(search string) []string {
+	a := []string{"list", "items"}
+	if search != "" {
+		a = append(a, "--search", search)
+	}
+	return a
+}
+
+// bwUnlock runs `bw unlock` and returns the raw session key.
+func bwUnlock(run cmdRunner, env []string) (string, error) {
+	out, err := run("bw", bwUnlockArgs(), env)
+	if err != nil {
+		return "", fmt.Errorf("bw unlock failed (wrong master password? run `homelab vault setup`): %w", err)
+	}
+	return out, nil
+}
+
+// bwGet fetches one field of one item; session must be present in env.
+func bwGet(run cmdRunner, env []string, field, name string) (string, error) {
+	return run("bw", bwGetArgs(field, name), env)
+}
+
+func returnMode(isTTY bool) string {
+	if isTTY {
+		return "clipboard"
+	}
+	return "stdout"
+}
+
+// stdoutIsTTY reports whether stdout is a character device (a terminal).
+func stdoutIsTTY() bool {
+	fi, err := os.Stdout.Stat()
+	if err != nil {
+		return false
+	}
+	return fi.Mode()&os.ModeCharDevice != 0
+}
+
+// stderrIsTTY reports whether stderr is a terminal (the OSC52 escape is written
+// to stderr, so the clipboard path is only viable when stderr is a terminal).
+func stderrIsTTY() bool {
+	fi, err := os.Stderr.Stat()
+	if err != nil {
+		return false
+	}
+	return fi.Mode()&os.ModeCharDevice != 0
+}
+
+// osc52 returns the OSC 52 escape that makes the local terminal copy payload to
+// the system clipboard (works over SSH; no X11). osc52clear copies empty.
+func osc52(payload string) string {
+	return "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte(payload)) + "\a"
+}
+func osc52clear() string { return "\x1b]52;c;\a" }
+
+// terminalAllowed gates OSC 52: only terminals known to honor clipboard writes,
+// else we'd dump the secret's base64 into scrollback on unsupported terminals.
+func terminalAllowed(term, termProgram string) bool {
+	t := strings.ToLower(term)
+	p := strings.ToLower(termProgram)
+	for _, ok := range []string{"kitty", "alacritty", "foot", "wezterm", "ghostty", "tmux", "screen"} {
+		if strings.Contains(t, ok) || strings.Contains(p, ok) {
+			return true
+		}
+	}
+	// xterm proper supports it only when the program is a known-good emulator.
+	return false
+}
+
+// opRecord is one CLI operation. ItemName is accepted for the caller's
+// convenience but is INTENTIONALLY never rendered into the log line — auditing
+// which of your own logins you opened is itself sensitive, and per-item reads
+// are invisible server-side anyway (spec §9a).
+type opRecord struct {
+	User       string
+	Verb       string
+	PID        int
+	PPID       int
+	ParentComm string
+	ItemName   string // never logged
+}
+
+func opLogLine(r opRecord) string {
+	return fmt.Sprintf("user=%s verb=%s pid=%d ppid=%d parent=%s",
+		r.User, r.Verb, r.PID, r.PPID, r.ParentComm)
+}
+
+// parentComm reads /proc/<ppid>/comm (best-effort; "" on failure).
+func parentComm(ppid int) string {
+	b, err := os.ReadFile(fmt.Sprintf("/proc/%d/comm", ppid))
+	if err != nil {
+		return ""
+	}
+	return strings.TrimSpace(string(b))
+}
+
+// writeOpLog appends one privacy-aware line to the user's op-log (best-effort;
+// never blocks or fails the command). Goes to syslog so it ships to Loki.
+func writeOpLog(r opRecord) {
+	exec.Command("logger", "-t", "homelab-vault", opLogLine(r)).Run() // best-effort
+}
+
+func vaultLockPath(uid string) string { return "/run/user/" + uid + "/homelab-vault.lock" }
+
+// hardenProcess disables core dumps so a bw/homelab crash can't spill the master
+// password to a core file. Best-effort.
+func hardenProcess() {
+	_ = syscall.Setrlimit(syscall.RLIMIT_CORE, &syscall.Rlimit{Cur: 0, Max: 0})
+}
+
+// withUserLock serializes bw mutations for this user (concurrent Claude sessions
+// as the same user otherwise race bw's appdata). Returns an unlock func.
+func withUserLock(uid string) (func(), error) {
+	f, err := os.OpenFile(vaultLockPath(uid), os.O_CREATE|os.O_RDWR, 0600)
+	if err != nil {
+		return nil, err
+	}
+	if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX); err != nil {
+		f.Close()
+		return nil, err
+	}
+	return func() { syscall.Flock(int(f.Fd()), syscall.LOCK_UN); f.Close() }, nil
+}
+
+// session is one usable bw context: the env (with BW_SESSION) ready for `bw get`.
+type session struct {
+	env []string
+}
+
+// openSession resolves creds, ensures login, unlocks, and returns a ready env.
+// Caller must hold the user lock. appdata is created on tmpfs (0700).
+func openSession(run cmdRunner, user, uid string) (session, error) {
+	creds, err := loadCreds(run, user)
+	if err != nil {
+		return session{}, err
+	}
+	appdata := bwAppDataDir(uid)
+	if err := os.MkdirAll(appdata, 0700); err != nil {
+		return session{}, fmt.Errorf("create bw appdata %s: %w", appdata, err)
+	}
+	loginEnv := bwSecretEnv(appdata, creds, "")
+	// Ensure server is set and we're logged in (idempotent; ignore "already").
+	_, _ = run("bw", []string{"config", "server", "https://vaultwarden.viktorbarzin.me"}, loginEnv)
+	st, _ := run("bw", bwStatusArgs(), loginEnv)
+	if bwNeedsLogin(st) {
+		if _, err := run("bw", bwLoginArgs(), loginEnv); err != nil {
+			return session{}, fmt.Errorf("bw login --apikey failed (API key valid? run `homelab vault setup`): %w", err)
+		}
+	}
+	sess, err := bwUnlock(run, loginEnv)
+	if err != nil {
+		return session{}, err
+	}
+	return session{env: bwSecretEnv(appdata, creds, sess)}, nil
+}
+
+type getOpts struct {
+	name  string
+	field string
+	json  bool
+}
+
+var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true}
+
+func parseGetArgs(args []string) (getOpts, error) {
+	o := getOpts{field: "password"}
+	for i := 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--json":
+			o.json = true
+		case a == "--field" && i+1 < len(args):
+			o.field = args[i+1]
+			i++
+		case strings.HasPrefix(a, "--field="):
+			o.field = strings.TrimPrefix(a, "--field=")
+		case !strings.HasPrefix(a, "-") && o.name == "":
+			o.name = a
+		}
+	}
+	if o.name == "" {
+		return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json]")
+	}
+	if !validGetFields[o.field] {
+		return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field)
+	}
+	return o, nil
+}
+
+// getValue opens a session and fetches one field. Pure of I/O side effects
+// besides the runner, so it is unit-tested with a fake runner.
+func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) {
+	s, err := openSession(run, user, uid)
+	if err != nil {
+		return "", err
+	}
+	return bwGet(run, s.env, o.field, o.name)
+}
+
+// clipboardDecision picks how to return a secret value. "stdout" prints it (a
+// pipe/agent — the intended machine path); "clipboard" copies via OSC52;
+// "refuse" emits nothing sensitive (would otherwise risk dumping the secret's
+// base64 into scrollback, or silently fail because the OSC52 escape goes to a
+// non-terminal stderr).
+func clipboardDecision(stdoutTTY, stderrTTY bool, term, termProgram string) string {
+	if !stdoutTTY {
+		return "stdout"
+	}
+	if terminalAllowed(term, termProgram) && stderrTTY {
+		return "clipboard"
+	}
+	return "refuse"
+}
+
+// jsonToStdoutOK reports whether `--json` may print the secret to stdout — only
+// when stdout is NOT a terminal (i.e. piped to a machine consumer).
+func jsonToStdoutOK(stdoutTTY bool) bool { return !stdoutTTY }
+
+// emitSecret returns a value TTY-aware (see clipboardDecision). Never prints the
+// secret to a terminal's stdout/scrollback.
+func emitSecret(value string) {
+	switch clipboardDecision(stdoutIsTTY(), stderrIsTTY(), os.Getenv("TERM"), os.Getenv("TERM_PROGRAM")) {
+	case "stdout":
+		fmt.Println(value)
+	case "clipboard":
+		fmt.Fprint(os.Stderr, osc52(value))
+		fmt.Fprintln(os.Stderr, "copied to clipboard; clearing in 30s")
+		clearClipboardAfter(30)
+	default: // refuse
+		fmt.Fprintln(os.Stderr, "refusing to print secret: this terminal can't do OSC52 clipboard safely; pipe the command (e.g. | cat) or use a supported terminal")
+	}
+}
+
+// clearClipboardAfter spawns a detached background clear so the secret doesn't
+// linger in the clipboard. Best-effort.
+func clearClipboardAfter(seconds int) {
+	exec.Command("sh", "-c", fmt.Sprintf("sleep %d; printf '%s'", seconds, osc52clear())).Start()
+}
+
+// listNames extracts "name (id)" from `bw list items` JSON; never values.
+func listNames(jsonOut string) []string {
+	var items []struct {
+		ID   string `json:"id"`
+		Name string `json:"name"`
+	}
+	if err := json.Unmarshal([]byte(jsonOut), &items); err != nil {
+		return nil
+	}
+	out := make([]string, 0, len(items))
+	for _, it := range items {
+		out = append(out, fmt.Sprintf("%s (%s)", it.Name, it.ID))
+	}
+	return out
+}
+
+func runList(run cmdRunner, user, uid, search string) ([]string, error) {
+	s, err := openSession(run, user, uid)
+	if err != nil {
+		return nil, err
+	}
+	out, err := run("bw", bwListArgs(search), s.env)
+	if err != nil {
+		return nil, err
+	}
+	return listNames(out), nil
+}
+
+func vaultList(args []string) error {
+	hardenProcess()
+	search := ""
+	for i := 0; i < len(args); i++ {
+		if args[i] == "--search" && i+1 < len(args) {
+			search = args[i+1]
+			i++
+		} else if strings.HasPrefix(args[i], "--search=") {
+			search = strings.TrimPrefix(args[i], "--search=")
+		}
+	}
+	uid := vaultCurrentUID()
+	unlock, err := withUserLock(uid)
+	if err != nil {
+		return err
+	}
+	defer unlock()
+	names, err := runList(realRunner, vaultCurrentUser(), uid, search)
+	if err != nil {
+		return err
+	}
+	for _, n := range names {
+		fmt.Println(n)
+	}
+	return nil
+}
+
+func vaultSearch(args []string) error {
+	if len(args) == 0 {
+		return fmt.Errorf("usage: homelab vault search <query>")
+	}
+	return vaultList([]string{"--search", strings.Join(args, " ")})
+}
+
+func vaultCode(args []string) error {
+	hardenProcess()
+	if len(args) == 0 {
+		return fmt.Errorf("usage: homelab vault code <name>")
+	}
+	name := args[0]
+	uid := vaultCurrentUID()
+	unlock, err := withUserLock(uid)
+	if err != nil {
+		return err
+	}
+	defer unlock()
+	user := vaultCurrentUser()
+	val, err := getValue(realRunner, user, uid, getOpts{name: name, field: "totp"})
+	if err != nil {
+		return err
+	}
+	// TOTP is the most sensitive op: log AND emit an ntfy-bound marker (spec §9a-d).
+	writeOpLog(opRecord{User: user, Verb: "code", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
+	exec.Command("logger", "-t", "homelab-vault-totp", "user="+user+" totp-fetch parent="+parentComm(os.Getppid())).Run()
+	emitSecret(val)
+	return nil
+}
+
+// statusSummary reports config/reachability without revealing secrets.
+func statusSummary(run cmdRunner, user, uid string) string {
+	if _, err := loadCreds(run, user); err != nil {
+		return "vault: not configured — run `homelab vault setup`"
+	}
+	s, err := openSession(run, user, uid)
+	if err != nil {
+		return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error()
+	}
+	if _, err := run("bw", []string{"sync"}, s.env); err != nil {
+		return "vault: configured + unlocked, but sync/reachability failed: " + err.Error()
+	}
+	return "vault: configured, unlocked, reachable ✓"
+}
+
+func vaultStatus(args []string) error {
+	hardenProcess()
+	uid := vaultCurrentUID()
+	unlock, err := withUserLock(uid)
+	if err != nil {
+		return err
+	}
+	defer unlock()
+	fmt.Println(statusSummary(realRunner, vaultCurrentUser(), uid))
+	return nil
+}
+
+func vaultLock(args []string) error {
+	uid := vaultCurrentUID()
+	unlock, err := withUserLock(uid) // logout mutates bw state — serialize with get/list
+	if err != nil {
+		return err
+	}
+	defer unlock()
+	appdata := bwAppDataDir(uid)
+	_, _ = realRunner("bw", []string{"lock"}, bwBaseEnv(appdata))
+	_, logoutErr := realRunner("bw", []string{"logout"}, bwBaseEnv(appdata))
+	if logoutErr == nil {
+		fmt.Println("locked")
+	}
+	return nil // lock/logout best-effort; never error the caller
+}
+
+// vaultPatchPublicArgs writes the non-secret identifiers via argv. Neither the
+// email nor the API client_id is a usable credential on its own.
+func vaultPatchPublicArgs(user, email, clientID string) []string {
+	return []string{"kv", "patch", vwCredsPath(user),
+		"vaultwarden_email=" + email,
+		"vaultwarden_client_id=" + clientID,
+	}
+}
+
+// vaultPatchSecretArgs writes ONE secret value via the `key=-` stdin form, so
+// the value never appears in argv (ps / /proc/<pid>/cmdline). The value is fed
+// on stdin by realRunnerStdin.
+func vaultPatchSecretArgs(user, key string) []string {
+	return []string{"kv", "patch", vwCredsPath(user), key + "=-"}
+}
+
+// writeCreds stores all four fields in the user's Vault path. The two real
+// secrets (master password, API client_secret) go via stdin — never argv.
+func writeCreds(user string, c vwCreds) error {
+	if _, err := realRunner("vault", vaultPatchPublicArgs(user, c.Email, c.ClientID), nil); err != nil {
+		return err
+	}
+	if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
+		return err
+	}
+	if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
+		return err
+	}
+	return nil
+}
+
+// promptNoEcho reads one line without terminal echo (for the master password).
+func promptNoEcho(prompt string) (string, error) {
+	fmt.Fprint(os.Stderr, prompt)
+	exec.Command("stty", "-echo").Run()
+	defer func() { exec.Command("stty", "echo").Run(); fmt.Fprintln(os.Stderr) }()
+	r := bufio.NewReader(os.Stdin)
+	line, err := r.ReadString('\n')
+	// Trim only the line terminator — a master password / API secret may
+	// legitimately contain leading/trailing spaces.
+	return strings.TrimRight(line, "\r\n"), err
+}
+
+func promptLine(prompt string) (string, error) {
+	fmt.Fprint(os.Stderr, prompt)
+	line, err := bufio.NewReader(os.Stdin).ReadString('\n')
+	return strings.TrimSpace(line), err
+}
+
+func vaultSetup(args []string) error {
+	hardenProcess()
+	fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.")
+	fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.")
+	email, err := promptLine("Vaultwarden email: ")
+	if err != nil {
+		return err
+	}
+	clientID, err := promptLine("API key client_id (user.xxxx): ")
+	if err != nil {
+		return err
+	}
+	clientSecret, err := promptNoEcho("API key client_secret: ")
+	if err != nil {
+		return err
+	}
+	master, err := promptNoEcho("Master password: ")
+	if err != nil {
+		return err
+	}
+	if master == "" || clientID == "" || clientSecret == "" {
+		return fmt.Errorf("all fields are required")
+	}
+	c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret}
+	if err := writeCreds(vaultCurrentUser(), c); err != nil {
+		return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err)
+	}
+	fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…")
+	uid := vaultCurrentUID()
+	unlock, err := withUserLock(uid)
+	if err != nil {
+		return err
+	}
+	defer unlock()
+	if _, err := openSession(realRunner, vaultCurrentUser(), uid); err != nil {
+		return fmt.Errorf("stored, but verification failed — double-check master password / API key: %w", err)
+	}
+	fmt.Fprintln(os.Stderr, "✓ Verified. Fetches are now AFK.")
+	return nil
+}
+
+func vaultGet(args []string) error {
+	hardenProcess()
+	o, err := parseGetArgs(args)
+	if err != nil {
+		return err
+	}
+	uid := vaultCurrentUID()
+	unlock, err := withUserLock(uid)
+	if err != nil {
+		return err
+	}
+	defer unlock()
+	user := vaultCurrentUser()
+	val, err := getValue(realRunner, user, uid, o)
+	if err != nil {
+		return err
+	}
+	writeOpLog(opRecord{User: user, Verb: "get", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: o.name})
+	if o.json {
+		if !jsonToStdoutOK(stdoutIsTTY()) {
+			return fmt.Errorf("refusing to print a secret as JSON to a terminal; pipe it (e.g. | cat) or drop --json")
+		}
+		fmt.Printf("{%q:%q}\n", o.field, val)
+		return nil
+	}
+	emitSecret(val)
+	return nil
+}
+
--- a/cli/cmd_vault_test.go
+++ b/cli/cmd_vault_test.go
@ -0,0 +1,368 @@
+package main
+
+import (
+	"encoding/base64"
+	"fmt"
+	"os"
+	"reflect"
+	"strings"
+	"testing"
+)
+
+func TestVaultCommandsRegistered(t *testing.T) {
+	want := map[string]Tier{
+		"vault setup":  TierWrite,
+		"vault status": TierRead,
+		"vault list":   TierRead,
+		"vault get":    TierRead,
+		"vault search": TierRead,
+		"vault code":   TierRead,
+		"vault lock":   TierWrite,
+	}
+	got := map[string]Tier{}
+	for _, c := range vaultCommands() {
+		got[c.name()] = c.Tier
+	}
+	for name, tier := range want {
+		if got[name] != tier {
+			t.Errorf("command %q: tier=%q, want %q (registered=%v)", name, got[name], tier, got[name] != "")
+		}
+	}
+}
+
+func TestVaultGroupInRegistry(t *testing.T) {
+	if !isCommandGroup(buildRegistry(), "vault") {
+		t.Fatal("`vault` group not wired into buildRegistry()")
+	}
+}
+
+func TestVaultCredsPath(t *testing.T) {
+	if got := vwCredsPath("emo"); got != "secret/workstation/claude-users/emo" {
+		t.Fatalf("vwCredsPath = %q", got)
+	}
+}
+
+func TestBwAppDataDir(t *testing.T) {
+	if got := bwAppDataDir("1001"); got != "/run/user/1001/homelab-bw" {
+		t.Fatalf("bwAppDataDir = %q", got)
+	}
+}
+
+// fakeRunner records calls and returns canned stdout/err keyed by argv[0]+first arg.
+type fakeRunner struct {
+	calls   [][]string
+	out     map[string]string // key: name+" "+strings.Join(argv," ") prefix-matched
+	err     map[string]error
+	lastEnv []string
+}
+
+func (f *fakeRunner) run(name string, argv, envv []string) (string, error) {
+	f.calls = append(f.calls, append([]string{name}, argv...))
+	f.lastEnv = envv
+	key := name + " " + strings.Join(argv, " ")
+	for k, v := range f.out {
+		if strings.HasPrefix(key, k) {
+			return v, f.err[k]
+		}
+	}
+	return "", f.err[key]
+}
+
+func TestLoadCredsReadsFourFields(t *testing.T) {
+	f := &fakeRunner{out: map[string]string{
+		"vault kv get -field=vaultwarden_email secret/workstation/claude-users/emo":          "emo@x.me",
+		"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "hunter2",
+		"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo":       "user.abc",
+		"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo":   "sek",
+	}}
+	c, err := loadCreds(f.run, "emo")
+	if err != nil {
+		t.Fatalf("loadCreds: %v", err)
+	}
+	want := vwCreds{Email: "emo@x.me", MasterPassword: "hunter2", ClientID: "user.abc", ClientSecret: "sek"}
+	if !reflect.DeepEqual(c, want) {
+		t.Fatalf("loadCreds = %+v want %+v", c, want)
+	}
+}
+
+func TestLoadCredsUnconfigured(t *testing.T) {
+	f := &fakeRunner{out: map[string]string{}} // every field empty
+	if _, err := loadCreds(f.run, "emo"); err == nil || !strings.Contains(err.Error(), "not configured") {
+		t.Fatalf("want 'not configured' error, got %v", err)
+	}
+}
+
+func TestBwEnvCarriesSecretsNotArgv(t *testing.T) {
+	c := vwCreds{ClientID: "user.abc", ClientSecret: "sek", MasterPassword: "hunter2"}
+	env := bwSecretEnv("/run/user/1001/homelab-bw", c, "SESSIONKEY")
+	joined := strings.Join(env, "\n")
+	for _, want := range []string{
+		"BW_CLIENTID=user.abc", "BW_CLIENTSECRET=sek", "BW_PASSWORD=hunter2",
+		"BW_SESSION=SESSIONKEY", "BITWARDENCLI_APPDATA_DIR=/run/user/1001/homelab-bw",
+	} {
+		if !strings.Contains(joined, want) {
+			t.Errorf("bwSecretEnv missing %q", want)
+		}
+	}
+	if strings.Contains(joined, "PATH=") == false {
+		t.Error("bwSecretEnv must keep a PATH so node/bw resolve")
+	}
+}
+
+func TestBwGetArgsHasNoSessionInArgv(t *testing.T) {
+	argv := bwGetArgs("password", "github")
+	for _, a := range argv {
+		if strings.Contains(a, "SESSION") || a == "--session" {
+			t.Fatalf("session must travel via env, not argv: %v", argv)
+		}
+	}
+	if !reflect.DeepEqual(argv, []string{"get", "password", "github"}) {
+		t.Fatalf("bwGetArgs = %v", argv)
+	}
+}
+
+func TestBwListArgs(t *testing.T) {
+	if got := bwListArgs(""); !reflect.DeepEqual(got, []string{"list", "items"}) {
+		t.Fatalf("bwListArgs('') = %v", got)
+	}
+	if got := bwListArgs("git"); !reflect.DeepEqual(got, []string{"list", "items", "--search", "git"}) {
+		t.Fatalf("bwListArgs('git') = %v", got)
+	}
+}
+
+func TestBwUnlockReturnsSession(t *testing.T) {
+	f := &fakeRunner{out: map[string]string{"bw unlock": "THE-SESSION-KEY"}}
+	env := bwSecretEnv("/run/user/1001/homelab-bw", vwCreds{MasterPassword: "pw"}, "")
+	sess, err := bwUnlock(f.run, env)
+	if err != nil || sess != "THE-SESSION-KEY" {
+		t.Fatalf("bwUnlock = %q, %v", sess, err)
+	}
+	// argv must use --passwordenv + --raw, never the password literal
+	last := f.calls[len(f.calls)-1]
+	if strings.Join(last, " ") != "bw unlock --passwordenv BW_PASSWORD --raw" {
+		t.Fatalf("unlock argv = %v", last)
+	}
+}
+
+func TestReturnMode(t *testing.T) {
+	if returnMode(true) != "clipboard" || returnMode(false) != "stdout" {
+		t.Fatal("returnMode wrong")
+	}
+}
+
+func TestOSC52Encode(t *testing.T) {
+	got := osc52("secret")
+	want := "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte("secret")) + "\a"
+	if got != want {
+		t.Fatalf("osc52 = %q want %q", got, want)
+	}
+	if osc52clear() != "\x1b]52;c;\a" {
+		t.Fatalf("osc52clear wrong: %q", osc52clear())
+	}
+}
+
+func TestTerminalAllowed(t *testing.T) {
+	allow := []struct{ term, prog string }{
+		{"xterm-kitty", ""}, {"alacritty", ""}, {"foot", ""}, {"tmux-256color", ""},
+		{"screen-256color", ""}, {"xterm-256color", "WezTerm"}, {"xterm-256color", "ghostty"},
+	}
+	for _, c := range allow {
+		if !terminalAllowed(c.term, c.prog) {
+			t.Errorf("terminalAllowed(%q,%q) = false, want true", c.term, c.prog)
+		}
+	}
+	deny := []struct{ term, prog string }{{"dumb", ""}, {"", ""}, {"vt100", ""}}
+	for _, c := range deny {
+		if terminalAllowed(c.term, c.prog) {
+			t.Errorf("terminalAllowed(%q,%q) = true, want false", c.term, c.prog)
+		}
+	}
+}
+
+func TestOpLogLineHasNoSecretOrItem(t *testing.T) {
+	line := opLogLine(opRecord{User: "emo", Verb: "get", PID: 10, PPID: 9, ParentComm: "claude", ItemName: "Chase Bank"})
+	for _, must := range []string{"user=emo", "verb=get", "ppid=9", "parent=claude"} {
+		if !strings.Contains(line, must) {
+			t.Errorf("op-log missing %q: %s", must, line)
+		}
+	}
+	for _, mustNot := range []string{"Chase", "password", "secret"} {
+		if strings.Contains(line, mustNot) {
+			t.Fatalf("op-log LEAKS %q (privacy violation): %s", mustNot, line)
+		}
+	}
+}
+
+func TestLockPath(t *testing.T) {
+	if got := vaultLockPath("1001"); got != "/run/user/1001/homelab-vault.lock" {
+		t.Fatalf("vaultLockPath = %q", got)
+	}
+}
+
+func TestParseGetArgs(t *testing.T) {
+	o, err := parseGetArgs([]string{"github", "--field", "username", "--json"})
+	if err != nil || o.name != "github" || o.field != "username" || !o.json {
+		t.Fatalf("parseGetArgs = %+v err=%v", o, err)
+	}
+	d, _ := parseGetArgs([]string{"github"})
+	if d.field != "password" || d.json {
+		t.Fatalf("defaults wrong: %+v", d)
+	}
+	if _, err := parseGetArgs([]string{}); err == nil {
+		t.Fatal("get with no name must error")
+	}
+	if _, err := parseGetArgs([]string{"x", "--field", "evil"}); err == nil {
+		t.Fatal("invalid --field must error")
+	}
+}
+
+func TestListNamesParsing(t *testing.T) {
+	// bw list items returns JSON; listNames extracts name + id only.
+	js := `[{"id":"1","name":"GitHub","login":{"username":"u"}},{"id":"2","name":"AWS"}]`
+	names := listNames(js)
+	if len(names) != 2 || names[0] != "GitHub (1)" || names[1] != "AWS (2)" {
+		t.Fatalf("listNames = %v", names)
+	}
+}
+
+func TestStatusSummaryUnconfigured(t *testing.T) {
+	f := &fakeRunner{out: map[string]string{}} // no creds
+	s := statusSummary(f.run, "emo", "1001")
+	if !strings.Contains(s, "not configured") {
+		t.Fatalf("status = %q", s)
+	}
+}
+
+func TestVaultPatchPublicArgs(t *testing.T) {
+	got := vaultPatchPublicArgs("emo", "e@x.me", "user.ci")
+	want := []string{"kv", "patch", "secret/workstation/claude-users/emo",
+		"vaultwarden_email=e@x.me", "vaultwarden_client_id=user.ci"}
+	if !reflect.DeepEqual(got, want) {
+		t.Fatalf("vaultPatchPublicArgs = %v", got)
+	}
+	for _, a := range got {
+		if strings.Contains(a, "master_password") || strings.Contains(a, "client_secret") {
+			t.Fatalf("secret key leaked into public argv: %v", got)
+		}
+	}
+}
+
+func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) {
+	for _, key := range []string{"vaultwarden_master_password", "vaultwarden_client_secret"} {
+		got := vaultPatchSecretArgs("emo", key)
+		want := []string{"kv", "patch", "secret/workstation/claude-users/emo", key + "=-"}
+		if !reflect.DeepEqual(got, want) {
+			t.Fatalf("vaultPatchSecretArgs(%q) = %v", key, got)
+		}
+		if got[len(got)-1] != key+"=-" {
+			t.Fatalf("secret value must be read from stdin (`%s=-`), got %v", key, got)
+		}
+	}
+}
+
+// TestNoSecretInArgvAcrossFlow is the load-bearing security test: across the
+// whole get flow (vault reads, bw config/status/login/unlock/get) NO secret
+// value may appear in any command's argv — secrets travel via env/stdin only.
+func TestNoSecretInArgvAcrossFlow(t *testing.T) {
+	uid := fmt.Sprintf("%d", os.Getuid())
+	f := &fakeRunner{out: map[string]string{
+		"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "SUPERSECRETPW",
+		"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo":        "user.x",
+		"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo":    "CLIENTSEKRET",
+		"bw status":              `{"status":"locked"}`,
+		"bw unlock":              "SESSIONXYZ",
+		"bw get password github": "p@ss",
+	}}
+	if _, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"}); err != nil {
+		t.Fatalf("getValue: %v", err)
+	}
+	for _, call := range f.calls {
+		for _, arg := range call {
+			for _, s := range []string{"SUPERSECRETPW", "CLIENTSEKRET", "SESSIONXYZ"} {
+				if strings.Contains(arg, s) {
+					t.Errorf("secret %q leaked into argv: %v", s, call)
+				}
+			}
+		}
+	}
+	if !strings.Contains(strings.Join(f.lastEnv, "\n"), "BW_SESSION=SESSIONXYZ") {
+		t.Error("expected BW_SESSION in the bw get env (test would be vacuous otherwise)")
+	}
+}
+
+func TestClipboardDecision(t *testing.T) {
+	cases := []struct {
+		stdoutTTY, stderrTTY bool
+		term, prog, want     string
+	}{
+		{false, true, "xterm-kitty", "", "stdout"},
+		{true, true, "xterm-kitty", "", "clipboard"},
+		{true, true, "dumb", "", "refuse"},
+		{true, false, "xterm-kitty", "", "refuse"},
+	}
+	for _, c := range cases {
+		if got := clipboardDecision(c.stdoutTTY, c.stderrTTY, c.term, c.prog); got != c.want {
+			t.Errorf("clipboardDecision(%v,%v,%q) = %q, want %q", c.stdoutTTY, c.stderrTTY, c.term, got, c.want)
+		}
+	}
+}
+
+func TestJSONToStdoutOK(t *testing.T) {
+	if jsonToStdoutOK(true) {
+		t.Error("must refuse JSON secret on a terminal")
+	}
+	if !jsonToStdoutOK(false) {
+		t.Error("must allow JSON when piped")
+	}
+}
+
+func TestBwNeedsLogin(t *testing.T) {
+	if !bwNeedsLogin(`{"status":"unauthenticated"}`) {
+		t.Error("unauthenticated → needs login")
+	}
+	if bwNeedsLogin(`{"status":"locked"}`) {
+		t.Error("locked → no login (just unlock)")
+	}
+	if bwNeedsLogin(`{"status":"unlocked"}`) {
+		t.Error("unlocked → no login")
+	}
+	if !bwNeedsLogin(`not json`) {
+		t.Error("unparseable → attempt login")
+	}
+}
+
+func TestVaultHelpMentionsSecurity(t *testing.T) {
+	h := vaultHelp()
+	for _, want := range []string{"homelab vault get", "no-HITL", "your own", "setup"} {
+		if !strings.Contains(h, want) {
+			t.Errorf("vault help missing %q", want)
+		}
+	}
+}
+
+func TestVaultBareGroupRegistered(t *testing.T) {
+	for _, c := range vaultCommands() {
+		if len(c.Path) == 1 && c.Path[0] == "vault" {
+			return
+		}
+	}
+	t.Fatal("bare `vault` help command not registered")
+}
+
+// getValue is the testable core: given a runner + opts, returns the secret value.
+func TestGetValueFlow(t *testing.T) {
+	f := &fakeRunner{out: map[string]string{
+		"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw",
+		"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo":        "user.x",
+		"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo":    "cs",
+		"bw status":              `{"status":"locked"}`,
+		"bw unlock":              "SESS",
+		"bw get password github": "p@ss",
+	}}
+	// Use real UID so os.MkdirAll(/run/user/<uid>/homelab-bw) succeeds.
+	uid := fmt.Sprintf("%d", os.Getuid())
+	val, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"})
+	if err != nil || val != "p@ss" {
+		t.Fatalf("getValue = %q, %v", val, err)
+	}
+}
--- a/cli/cmd_work.go
+++ b/cli/cmd_work.go
@ -0,0 +1,212 @@
+package main
+
+import (
+	"fmt"
+	"os"
+	"path/filepath"
+	"strings"
+)
+
+func workCommands() []Command {
+	return []Command{
+		{Path: []string{"work", "start"}, Tier: TierWrite,
+			Summary: "create a worktree + branch for a task (enter it with EnterWorktree)", Run: workStart},
+		{Path: []string{"work", "land"}, Tier: TierWrite,
+			Summary: "merge master in, verify, push HEAD:master (run from the worktree)", Run: workLand},
+		{Path: []string{"work", "clean"}, Tier: TierWrite,
+			Summary: "remove a task's worktree + branch (run from the main checkout)", Run: workClean},
+	}
+}
+
+// flagValue extracts `--name value` or `--name=value` from args.
+func flagValue(args []string, name string) string {
+	for i, a := range args {
+		if a == name && i+1 < len(args) {
+			return args[i+1]
+		}
+		if strings.HasPrefix(a, name+"=") {
+			return strings.TrimPrefix(a, name+"=")
+		}
+	}
+	return ""
+}
+
+func remotesOrEmpty(repoRoot string) []string {
+	r, _ := gitRemotes(repoRoot)
+	return r
+}
+
+// workStart creates .worktrees/<topic> on branch <user>/<topic> off <remote>/master.
+func workStart(args []string) error {
+	topic, _ := firstPositional(args)
+	if topic == "" {
+		return fmt.Errorf("usage: homelab work start <topic>")
+	}
+	cwd, _ := os.Getwd()
+	repoRoot, err := gitRepoRoot(cwd)
+	if err != nil {
+		return fmt.Errorf("not in a git repository: %w", err)
+	}
+	remote := preferRemote(remotesOrEmpty(repoRoot))
+	if remote == "" {
+		return fmt.Errorf("no git remote configured in %s", repoRoot)
+	}
+	flags := cryptFlagsFor(repoRoot)
+	branch := currentUser() + "/" + topic
+	wtRel := filepath.Join(".worktrees", topic)
+
+	ensureWorktreesIgnored(repoRoot)
+	if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
+		return fmt.Errorf("fetch %s failed: %w", remote, err)
+	}
+	if err := gitStream(repoRoot, flags, "worktree", "add", wtRel, "-b", branch, remote+"/master"); err != nil {
+		return fmt.Errorf("worktree add failed: %w", err)
+	}
+	wtPath := filepath.Join(repoRoot, wtRel)
+	fmt.Printf("homelab: created worktree %s (branch %s off %s/master)\n", wtPath, branch, remote)
+	fmt.Printf("homelab: enter it with the native tool: EnterWorktree(path=%q)\n", wtPath)
+	return nil
+}
+
+// workLand integrates the current branch into master: fetch, merge master in,
+// verify, push HEAD:master (retrying on non-fast-forward), with a feature-branch
+// fallback when the direct push is rejected (e.g. branch protection).
+func workLand(args []string) error {
+	verifyCmd := flagValue(args, "--verify-cmd")
+	cwd, _ := os.Getwd()
+	repoRoot, err := gitRepoRoot(cwd)
+	if err != nil {
+		return fmt.Errorf("not in a git repository: %w", err)
+	}
+	branch, err := gitOutput(repoRoot, "rev-parse", "--abbrev-ref", "HEAD")
+	if err != nil {
+		return err
+	}
+	if branch == "master" || branch == "main" {
+		return fmt.Errorf("refusing to land: already on %s", branch)
+	}
+	remote := preferRemote(remotesOrEmpty(repoRoot))
+	if remote == "" {
+		return fmt.Errorf("no git remote configured in %s", repoRoot)
+	}
+	flags := cryptFlagsFor(repoRoot)
+
+	if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
+		return fmt.Errorf("fetch failed: %w", err)
+	}
+	if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil {
+		return fmt.Errorf("merging %s/master failed — resolve conflicts then re-run `homelab work land`: %w", remote, err)
+	}
+	if err := runVerify(repoRoot, verifyCmd, containsArg(args, "--no-verify")); err != nil {
+		return fmt.Errorf("not landing: %w", err)
+	}
+	if err := pushWithRetry(repoRoot, flags, remote, 3); err != nil {
+		return landFallback(repoRoot, flags, remote, branch, err)
+	}
+	fmt.Printf("homelab: landed %s -> %s/master.\n", branch, remote)
+	if containsArg(args, "--no-ci-watch") {
+		fmt.Println("homelab: --no-ci-watch set; not waiting for CI.")
+		return nil
+	}
+	landed, _ := gitOutput(repoRoot, "rev-parse", "HEAD")
+	fmt.Fprintln(os.Stderr, "homelab: watching CI for the landed commit...")
+	if err := ciWatch([]string{landed}); err != nil {
+		return fmt.Errorf("landed, but CI did not go green: %w", err)
+	}
+	return nil
+}
+
+// runVerify runs the explicit --verify-cmd, else auto-detects (go test). If
+// neither is available it REFUSES (returns an error) unless allowSkip is set —
+// landing to master unverified must be a deliberate choice (--no-verify).
+func runVerify(repoRoot, verifyCmd string, allowSkip bool) error {
+	if verifyCmd != "" {
+		fmt.Fprintf(os.Stderr, "homelab: verify: %s\n", verifyCmd)
+		return runStreamingIn(repoRoot, "sh", "-c", verifyCmd)
+	}
+	if isFile(filepath.Join(repoRoot, "go.mod")) {
+		fmt.Fprintln(os.Stderr, "homelab: verify: go test ./...")
+		return runStreamingIn(repoRoot, "go", "test", "./...")
+	}
+	if allowSkip {
+		fmt.Fprintln(os.Stderr, "homelab: WARNING: --no-verify set — landing without verification")
+		return nil
+	}
+	return fmt.Errorf("no verification configured for this repo — pass --verify-cmd \"...\" or --no-verify to land without verifying")
+}
+
+// pushWithRetry pushes HEAD:master, recovering from non-fast-forward rejections
+// by fetching + merging master and retrying.
+func pushWithRetry(repoRoot string, flags []string, remote string, attempts int) error {
+	var lastErr error
+	for i := 0; i < attempts; i++ {
+		if err := gitStream(repoRoot, flags, "push", remote, "HEAD:master"); err == nil {
+			return nil
+		} else {
+			lastErr = err
+		}
+		if i < attempts-1 {
+			fmt.Fprintln(os.Stderr, "homelab: push rejected — fetching + merging master, then retrying")
+			if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
+				return err
+			}
+			if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil {
+				return err
+			}
+		}
+	}
+	return fmt.Errorf("push to %s/master failed after %d attempts: %w", remote, attempts, lastErr)
+}
+
+// landFallback pushes the feature branch when the direct master push is rejected
+// (e.g. branch protection), so the work isn't lost and a PR can be opened.
+func landFallback(repoRoot string, flags []string, remote, branch string, pushErr error) error {
+	fmt.Fprintf(os.Stderr, "homelab: direct push to master failed (%v)\n", pushErr)
+	fmt.Fprintf(os.Stderr, "homelab: falling back to pushing the feature branch %q for a PR\n", branch)
+	if err := gitStream(repoRoot, flags, "push", "-u", remote, branch); err != nil {
+		return fmt.Errorf("fallback branch push also failed: %w", err)
+	}
+	fmt.Printf("homelab: pushed %s to %s. Open a PR to land it (branch protection blocked the direct push).\n", branch, remote)
+	return nil
+}
+
+// workClean removes a task's worktree and branch. Run from the main checkout.
+func workClean(args []string) error {
+	topic, _ := firstPositional(args)
+	if topic == "" {
+		return fmt.Errorf("usage: homelab work clean <topic>  (run from the main checkout)")
+	}
+	cwd, _ := os.Getwd()
+	repoRoot, err := gitRepoRoot(cwd)
+	if err != nil {
+		return fmt.Errorf("not in a git repository: %w", err)
+	}
+	flags := cryptFlagsFor(repoRoot)
+	wtRel := filepath.Join(".worktrees", topic)
+	branch := currentUser() + "/" + topic
+
+	if err := gitStream(repoRoot, flags, "worktree", "remove", wtRel); err != nil {
+		return fmt.Errorf("worktree remove failed (uncommitted changes? run from the main checkout, not the worktree): %w", err)
+	}
+	if err := gitStream(repoRoot, flags, "branch", "-d", branch); err != nil {
+		fmt.Fprintf(os.Stderr, "homelab: note: could not delete branch %s (unmerged — use `git branch -D` if intended): %v\n", branch, err)
+	}
+	fmt.Printf("homelab: removed worktree %s and branch %s\n", wtRel, branch)
+	return nil
+}
+
+// ensureWorktreesIgnored appends .worktrees/ to .gitignore if not already ignored.
+func ensureWorktreesIgnored(repoRoot string) {
+	if _, err := gitOutput(repoRoot, "check-ignore", ".worktrees"); err == nil {
+		return
+	}
+	gi := filepath.Join(repoRoot, ".gitignore")
+	f, err := os.OpenFile(gi, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0o644)
+	if err != nil {
+		return
+	}
+	defer f.Close()
+	if _, err := f.WriteString("\n.worktrees/\n"); err == nil {
+		fmt.Fprintln(os.Stderr, "homelab: added .worktrees/ to .gitignore")
+	}
+}
--- a/cli/cmd_work_test.go
+++ b/cli/cmd_work_test.go
@ -0,0 +1,32 @@
+package main
+
+import "testing"
+
+func TestRunVerifyRefusesWhenNothingToVerify(t *testing.T) {
+	dir := t.TempDir() // no go.mod, no verify cmd
+	if err := runVerify(dir, "", false); err == nil {
+		t.Fatal("runVerify must refuse (error) when nothing to verify and --no-verify absent")
+	}
+	if err := runVerify(dir, "", true); err != nil {
+		t.Fatalf("runVerify must skip when --no-verify set, got: %v", err)
+	}
+}
+
+func TestFlagValue(t *testing.T) {
+	cases := []struct {
+		args []string
+		name string
+		want string
+	}{
+		{[]string{"--verify-cmd", "go test ./..."}, "--verify-cmd", "go test ./..."},
+		{[]string{"--verify-cmd=make test"}, "--verify-cmd", "make test"},
+		{[]string{"topic", "--verify-cmd", "x"}, "--verify-cmd", "x"},
+		{[]string{"topic"}, "--verify-cmd", ""},
+		{[]string{"--verify-cmd"}, "--verify-cmd", ""}, // no value
+	}
+	for _, c := range cases {
+		if got := flagValue(c.args, c.name); got != c.want {
+			t.Errorf("flagValue(%v, %q) = %q, want %q", c.args, c.name, got, c.want)
+		}
+	}
+}
--- a/cli/command.go
+++ b/cli/command.go
@ -0,0 +1,104 @@
+package main
+
+import (
+	"encoding/json"
+	"fmt"
+	"sort"
+	"strings"
+)
+
+// Tier classifies whether a command observes (read) or mutates (write) state.
+// v0.1 allows everything; the tier is recorded so a classifier hook can gate
+// writes later without restructuring (see docs/adr/0005).
+type Tier string
+
+const (
+	TierRead  Tier = "read"
+	TierWrite Tier = "write"
+)
+
+// Command is one homelab verb. Path is the token sequence that selects it,
+// e.g. ["claim"] or ["tf", "plan"]. Run receives the args after the path.
+type Command struct {
+	Path    []string
+	Tier    Tier
+	Summary string
+	Run     func(args []string) error
+}
+
+// dispatch routes args to the command whose Path is the longest matching prefix
+// of args, passing the remaining args to its Run.
+func dispatch(reg []Command, args []string) error {
+	best := -1
+	bestLen := 0
+	for i, c := range reg {
+		if len(c.Path) > len(args) {
+			continue
+		}
+		match := true
+		for j, p := range c.Path {
+			if args[j] != p {
+				match = false
+				break
+			}
+		}
+		if match && len(c.Path) >= bestLen {
+			best = i
+			bestLen = len(c.Path)
+		}
+	}
+	if best < 0 {
+		return fmt.Errorf("unknown command: %q", strings.Join(args, " "))
+	}
+	matched := reg[best]
+	runErr := matched.Run(args[bestLen:])
+	emitUsage(matched.name(), runErr) // best-effort usage telemetry; never affects the command
+	return runErr
+}
+
+// name is the space-joined verb path, e.g. "tf plan".
+func (c Command) name() string { return strings.Join(c.Path, " ") }
+
+// sortedByName returns a copy of reg ordered by verb path for stable output.
+func sortedByName(reg []Command) []Command {
+	out := make([]Command, len(reg))
+	copy(out, reg)
+	sort.Slice(out, func(i, j int) bool { return out[i].name() < out[j].name() })
+	return out
+}
+
+// manifestText renders one aligned line per command: "<path>  <tier>  <summary>".
+// This is the cheap progressive-discovery entrypoint (see docs/adr/0004).
+func manifestText(reg []Command) string {
+	cmds := sortedByName(reg)
+	width := 0
+	for _, c := range cmds {
+		if n := len(c.name()); n > width {
+			width = n
+		}
+	}
+	var b strings.Builder
+	for _, c := range cmds {
+		fmt.Fprintf(&b, "%-*s  %-5s  %s\n", width, c.name(), c.Tier, c.Summary)
+	}
+	return b.String()
+}
+
+// manifestJSON renders the registry as a JSON array of {command, tier, summary}
+// so agents can parse the full surface in one call.
+func manifestJSON(reg []Command) (string, error) {
+	type entry struct {
+		Command string `json:"command"`
+		Tier    string `json:"tier"`
+		Summary string `json:"summary"`
+	}
+	entries := make([]entry, 0, len(reg))
+	for _, c := range sortedByName(reg) {
+		entries = append(entries, entry{Command: c.name(), Tier: string(c.Tier), Summary: c.Summary})
+	}
+	b, err := json.MarshalIndent(entries, "", "  ")
+	if err != nil {
+		return "", err
+	}
+	return string(b), nil
+}
--- a/cli/command_test.go
+++ b/cli/command_test.go
@ -0,0 +1,73 @@
+package main
+
+import (
+	"encoding/json"
+	"reflect"
+	"strings"
+	"testing"
+)
+
+// Tracer bullet: the dispatcher must route `homelab <path...> <args...>` to the
+// command whose Path is the longest matching prefix of the input tokens, and
+// hand the command the remaining args.
+func TestDispatchRoutesToLongestPrefixMatch(t *testing.T) {
+	var gotArgs []string
+	ran := ""
+	reg := []Command{
+		{Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource",
+			Run: func(a []string) error { ran = "claim"; gotArgs = a; return nil }},
+		{Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack",
+			Run: func(a []string) error { ran = "tf plan"; gotArgs = a; return nil }},
+	}
+
+	if err := dispatch(reg, []string{"tf", "plan", "vault", "--json"}); err != nil {
+		t.Fatalf("dispatch returned error: %v", err)
+	}
+	if ran != "tf plan" {
+		t.Fatalf("routed to %q, want %q", ran, "tf plan")
+	}
+	if want := []string{"vault", "--json"}; !reflect.DeepEqual(gotArgs, want) {
+		t.Fatalf("command got args %v, want %v", gotArgs, want)
+	}
+}
+
+func TestDispatchUnknownCommandErrors(t *testing.T) {
+	reg := []Command{{Path: []string{"claim"}, Run: func(a []string) error { return nil }}}
+	if err := dispatch(reg, []string{"bogus"}); err == nil {
+		t.Fatal("expected error for unknown command, got nil")
+	}
+}
+
+// The manifest is the progressive-discovery entrypoint: one line per command
+// showing the full verb path, its tier, and summary, sorted for stable output.
+func TestManifestTextListsEveryCommandWithTier(t *testing.T) {
+	reg := []Command{
+		{Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack"},
+		{Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource"},
+	}
+	out := manifestText(reg)
+	for _, want := range []string{"claim", "tf plan", "read", "write", "plan a stack", "claim a resource"} {
+		if !strings.Contains(out, want) {
+			t.Errorf("manifest text missing %q\n---\n%s", want, out)
+		}
+	}
+	// sorted: claim (c) must appear before tf plan (t)
+	if strings.Index(out, "claim") > strings.Index(out, "tf plan") {
+		t.Errorf("manifest not sorted by path:\n%s", out)
+	}
+}
+
+func TestManifestJSONIsParsableAndTagged(t *testing.T) {
+	reg := []Command{{Path: []string{"tf", "apply"}, Tier: TierWrite, Summary: "apply a stack"}}
+	out, err := manifestJSON(reg)
+	if err != nil {
+		t.Fatalf("manifestJSON error: %v", err)
+	}
+	var got []map[string]string
+	if err := json.Unmarshal([]byte(out), &got); err != nil {
+		t.Fatalf("manifest JSON not parsable: %v\n%s", err, out)
+	}
+	if len(got) != 1 || got[0]["command"] != "tf apply" || got[0]["tier"] != "write" {
+		t.Fatalf("unexpected manifest JSON: %v", got)
+	}
+}
--- a/cli/homelab.go
+++ b/cli/homelab.go
@ -0,0 +1,98 @@
+package main
+
+import (
+	"fmt"
+	"strings"
+)
+
+// version is stamped at build time via -ldflags "-X main.version=vX.Y.Z".
+var version = "dev"
+
+// buildRegistry returns every homelab verb. New verb-groups append here.
+func buildRegistry() []Command {
+	var reg []Command
+	reg = append(reg, claimCommands()...)
+	reg = append(reg, tfCommands()...)
+	reg = append(reg, workCommands()...)
+	reg = append(reg, k8sCommands()...)
+	reg = append(reg, memoryCommands()...)
+	reg = append(reg, ciCommands()...)
+	reg = append(reg, deployCommands()...)
+	reg = append(reg, netCommands()...)
+	reg = append(reg, obsCommands()...)
+	reg = append(reg, usageCommands()...)
+	reg = append(reg, haCommands()...)
+	reg = append(reg, browserCommands()...)
+	reg = append(reg, vaultCommands()...)
+	return reg
+}
+
+// dispatchTop handles the homelab verb surface. handled=false means the args are
+// not a homelab verb, so main() falls back to the legacy -use-case path.
+func dispatchTop(args []string) (handled bool, err error) {
+	if len(args) == 0 {
+		fmt.Print(usage())
+		return true, nil
+	}
+	switch args[0] {
+	case "help", "-h", "--help":
+		fmt.Print(usage())
+		return true, nil
+	case "version", "--version":
+		fmt.Println("homelab " + version)
+		return true, nil
+	case "manifest":
+		reg := buildRegistry()
+		if containsArg(args[1:], "--json") {
+			out, err := manifestJSON(reg)
+			if err != nil {
+				return true, err
+			}
+			fmt.Println(out)
+			return true, nil
+		}
+		fmt.Print(manifestText(reg))
+		return true, nil
+	}
+	if strings.HasPrefix(args[0], "-") {
+		return false, nil
+	}
+	reg := buildRegistry()
+	if !isCommandGroup(reg, args[0]) {
+		return false, nil
+	}
+	return true, dispatch(reg, args)
+}
+
+func isCommandGroup(reg []Command, group string) bool {
+	for _, c := range reg {
+		if len(c.Path) > 0 && c.Path[0] == group {
+			return true
+		}
+	}
+	return false
+}
+
+func containsArg(args []string, want string) bool {
+	for _, a := range args {
+		if a == want {
+			return true
+		}
+	}
+	return false
+}
+
+func usage() string {
+	var b strings.Builder
+	fmt.Fprintf(&b, "homelab %s — unified homelab operations CLI\n\n", version)
+	b.WriteString("Usage:\n  homelab <command> [args]\n\nCommands:\n")
+	for _, line := range strings.Split(strings.TrimRight(manifestText(buildRegistry()), "\n"), "\n") {
+		if line != "" {
+			b.WriteString("  " + line + "\n")
+		}
+	}
+	b.WriteString("\n  manifest [--json]   list all commands (machine-readable with --json)\n")
+	b.WriteString("  version             print version\n")
+	b.WriteString("\nLegacy webhook use-cases remain available via -use-case=<name>.\n")
+	return b.String()
+}
--- a/cli/k8s.go
+++ b/cli/k8s.go
@ -0,0 +1,138 @@
+package main
+
+import (
+	"fmt"
+	"os/exec"
+	"strings"
+)
+
+// kubectl helpers use the ambient kubeconfig (no per-call auth flags).
+
+func kubectlBase(ns string, args ...string) []string {
+	var full []string
+	if ns != "" {
+		full = append(full, "-n", ns)
+	}
+	return append(full, args...)
+}
+
+func kubectlStream(ns string, args ...string) error {
+	return runStreamingIn("", "kubectl", kubectlBase(ns, args...)...)
+}
+
+// kubectlCapture runs kubectl and returns trimmed stdout (for resolving pods).
+func kubectlCapture(ns string, args ...string) (string, error) {
+	out, err := exec.Command("kubectl", kubectlBase(ns, args...)...).Output()
+	return strings.TrimSpace(string(out)), err
+}
+
+// k8sTarget is the parsed `<app>` + selectors shared by the k8s verbs.
+type k8sTarget struct {
+	app       string
+	ns        string
+	pod       string
+	container string
+	selector  string
+	tty       bool
+	rest      []string // passthrough flags and, after `--`, the exec command
+}
+
+// parseK8sTarget reads `<app> [-n ns] [--pod p] [-c ctr] [-l sel] [flags] [-- cmd]`.
+// The first bare token is the app; unknown flags pass through in rest.
+func parseK8sTarget(args []string) k8sTarget {
+	t := k8sTarget{}
+	i := 0
+	take := func() string {
+		if i+1 < len(args) {
+			i++
+			return args[i]
+		}
+		return ""
+	}
+	for i = 0; i < len(args); i++ {
+		a := args[i]
+		switch {
+		case a == "--":
+			t.rest = append(t.rest, args[i+1:]...)
+			return t
+		case a == "-n" || a == "--namespace":
+			t.ns = take()
+		case strings.HasPrefix(a, "--namespace="):
+			t.ns = strings.TrimPrefix(a, "--namespace=")
+		case a == "--pod":
+			t.pod = take()
+		case strings.HasPrefix(a, "--pod="):
+			t.pod = strings.TrimPrefix(a, "--pod=")
+		case a == "-c" || a == "--container":
+			t.container = take()
+		case strings.HasPrefix(a, "--container="):
+			t.container = strings.TrimPrefix(a, "--container=")
+		case a == "-l" || a == "--selector":
+			t.selector = take()
+		case strings.HasPrefix(a, "--selector="):
+			t.selector = strings.TrimPrefix(a, "--selector=")
+		case a == "--tty" || a == "-it" || a == "-ti":
+			t.tty = true
+		case !strings.HasPrefix(a, "-") && t.app == "":
+			t.app = a
+		default:
+			t.rest = append(t.rest, a)
+		}
+	}
+	return t
+}
+
+// namespace defaults to the app name (most namespaces hold exactly one app).
+func (t k8sTarget) namespace() string {
+	if t.ns != "" {
+		return t.ns
+	}
+	return t.app
+}
+
+// objectRef is the kubectl object for logs/exec: an explicit pod, else
+// deploy/<app> (kubectl resolves a pod from the Deployment).
+func (t k8sTarget) objectRef() string {
+	if t.pod != "" {
+		return "pod/" + t.pod
+	}
+	return "deploy/" + t.app
+}
+
+// --- database access (the dbaas exec pattern) ---
+
+type dbPlan struct {
+	ns        string
+	pod       string   // explicit pod (e.g. mysql-standalone-0)
+	selector  string   // resolve the pod by this label when pod == "" (CNPG primary)
+	container string   // "" = default container
+	argv      []string // command + args to run inside the pod
+}
+
+// planDBExec builds the in-pod command to run sql against app's database.
+// PG (default): CNPG primary POD (resolved by label — pg-cluster-rw is a
+// Service, not an exec target), psql -U postgres -d <db>.
+// MySQL: mysql-standalone-0, password from env (never on the command line).
+// dbName defaults to app. sql empty => interactive client.
+func planDBExec(app, dbName, sql string, mysql bool) dbPlan {
+	if dbName == "" {
+		dbName = app
+	}
+	if mysql {
+		inner := fmt.Sprintf(`mysql -u root -p"$MYSQL_ROOT_PASSWORD" %s`, shellQuote(dbName))
+		if sql != "" {
+			inner += " -e " + shellQuote(sql)
+		}
+		return dbPlan{ns: "dbaas", pod: "mysql-standalone-0", argv: []string{"bash", "-c", inner}}
+	}
+	argv := []string{"psql", "-U", "postgres", "-d", dbName}
+	if sql != "" {
+		argv = append(argv, "-tAc", sql)
+	}
+	return dbPlan{ns: "dbaas", selector: "cnpg.io/instanceRole=primary", container: "postgres", argv: argv}
+}
+
+// shellQuote single-quotes s for safe embedding in a bash -c string.
+func shellQuote(s string) string {
+	return "'" + strings.ReplaceAll(s, "'", `'\''`) + "'"
+}
--- a/cli/k8s_test.go
+++ b/cli/k8s_test.go
@ -0,0 +1,65 @@
+package main
+
+import (
+	"reflect"
+	"strings"
+	"testing"
+)
+
+func TestParseK8sTarget(t *testing.T) {
+	got := parseK8sTarget([]string{"tripit", "-n", "prod", "--pod", "x-123", "-c", "app", "-l", "k=v", "--tail=50", "--", "ls", "-la"})
+	want := k8sTarget{app: "tripit", ns: "prod", pod: "x-123", container: "app", selector: "k=v", rest: []string{"--tail=50", "ls", "-la"}}
+	if !reflect.DeepEqual(got, want) {
+		t.Fatalf("parseK8sTarget =\n %+v\nwant\n %+v", got, want)
+	}
+}
+
+func TestK8sTargetNamespaceDefaultsToApp(t *testing.T) {
+	if ns := parseK8sTarget([]string{"immich"}).namespace(); ns != "immich" {
+		t.Errorf("namespace() = %q, want immich", ns)
+	}
+	if ns := parseK8sTarget([]string{"immich", "-n", "dbaas"}).namespace(); ns != "dbaas" {
+		t.Errorf("namespace() = %q, want dbaas", ns)
+	}
+}
+
+func TestK8sTargetObjectRef(t *testing.T) {
+	if r := parseK8sTarget([]string{"tripit"}).objectRef(); r != "deploy/tripit" {
+		t.Errorf("objectRef() = %q, want deploy/tripit", r)
+	}
+	if r := parseK8sTarget([]string{"tripit", "--pod", "tripit-abc"}).objectRef(); r != "pod/tripit-abc" {
+		t.Errorf("objectRef() = %q, want pod/tripit-abc", r)
+	}
+}
+
+func TestPlanDBExecPostgresDefault(t *testing.T) {
+	p := planDBExec("fire-planner", "", "SELECT 1", false)
+	// pg-cluster-rw is a Service, so the PG plan resolves the primary POD by
+	// label rather than naming an (un-exec-able) Service.
+	if p.ns != "dbaas" || p.pod != "" || p.selector != "cnpg.io/instanceRole=primary" || p.container != "postgres" {
+		t.Fatalf("unexpected pg target: %+v", p)
+	}
+	// db name defaults to the app; SQL passed via -tAc
+	joined := strings.Join(p.argv, " ")
+	if !strings.Contains(joined, "-d fire-planner") || !strings.Contains(joined, "-tAc") {
+		t.Fatalf("pg argv missing db/sql: %v", p.argv)
+	}
+}
+
+func TestPlanDBExecMysqlEnvPassword(t *testing.T) {
+	p := planDBExec("wrongmove", "wrongmove", "SHOW TABLES", true)
+	if p.pod != "mysql-standalone-0" {
+		t.Fatalf("unexpected mysql pod: %+v", p)
+	}
+	inner := strings.Join(p.argv, " ")
+	// password must come from the env var, never inline
+	if !strings.Contains(inner, `-p"$MYSQL_ROOT_PASSWORD"`) {
+		t.Fatalf("mysql must use env password wrapper: %v", p.argv)
+	}
+}
+
+func TestShellQuoteEscapes(t *testing.T) {
+	if got := shellQuote("a'b"); got != `'a'\''b'` {
+		t.Fatalf("shellQuote = %q", got)
+	}
+}
--- a/cli/main.go
+++ b/cli/main.go
@ -26,8 +26,16 @@ var (
 )

 func main() {
-	err := run()
-	if err != nil {
+	// homelab verb surface (work/tf/claim/...) is tried first; if the args are
+	// not a homelab verb, fall through to the legacy webhook -use-case path.
+	if handled, err := dispatchTop(os.Args[1:]); handled {
+		if err != nil {
+			fmt.Fprintln(os.Stderr, "homelab: "+err.Error())
+			os.Exit(1)
+		}
+		return
+	}
+	if err := run(); err != nil {
 		glog.Errorf("run failed: %s", err.Error())
 		os.Exit(255)
 	}
--- a/cli/memory.go
+++ b/cli/memory.go
@ -0,0 +1,103 @@
+package main
+
+import (
+	"bytes"
+	"encoding/json"
+	"fmt"
+	"io"
+	"net/http"
+	"os"
+	"strings"
+	"time"
+)
+
+// defaultMemoryURL is used when no env override is present (agents normally have
+// CLAUDE_MEMORY_API_URL set by the memory hooks).
+const defaultMemoryURL = "https://claude-memory.viktorbarzin.me"
+
+type memoryClient struct {
+	base string
+	key  string
+	http *http.Client
+}
+
+func firstEnv(keys ...string) string {
+	for _, k := range keys {
+		if v := os.Getenv(k); v != "" {
+			return v
+		}
+	}
+	return ""
+}
+
+func resolveMemoryBase() string {
+	if b := firstEnv("CLAUDE_MEMORY_API_URL", "MEMORY_API_URL"); b != "" {
+		return strings.TrimRight(b, "/")
+	}
+	return defaultMemoryURL
+}
+
+// newMemoryClient talks straight to the claude-memory HTTP API (the same backend
+// the MCP wraps), so it works even when the MCP frontend is down.
+func newMemoryClient() (*memoryClient, error) {
+	key := firstEnv("CLAUDE_MEMORY_API_KEY", "MEMORY_API_KEY")
+	if key == "" {
+		return nil, fmt.Errorf("no memory API key — set CLAUDE_MEMORY_API_KEY (or MEMORY_API_KEY)")
+	}
+	return &memoryClient{base: resolveMemoryBase(), key: key, http: &http.Client{Timeout: 30 * time.Second}}, nil
+}
+
+func (c *memoryClient) do(method, path string, body interface{}) ([]byte, error) {
+	var r io.Reader
+	if body != nil {
+		b, err := json.Marshal(body)
+		if err != nil {
+			return nil, err
+		}
+		r = bytes.NewReader(b)
+	}
+	req, err := http.NewRequest(method, c.base+path, r)
+	if err != nil {
+		return nil, err
+	}
+	req.Header.Set("Authorization", "Bearer "+c.key)
+	if body != nil {
+		req.Header.Set("Content-Type", "application/json")
+	}
+	resp, err := c.http.Do(req)
+	if err != nil {
+		return nil, err
+	}
+	defer resp.Body.Close()
+	out, _ := io.ReadAll(resp.Body)
+	if resp.StatusCode >= 300 {
+		return nil, fmt.Errorf("memory API %s %s -> %d: %s", method, path, resp.StatusCode, strings.TrimSpace(string(out)))
+	}
+	return out, nil
+}
+
+// Request bodies mirror src/claude_memory/api/models.py.
+
+type memRecallReq struct {
+	Context       string `json:"context"`
+	ExpandedQuery string `json:"expanded_query,omitempty"`
+	Category      string `json:"category,omitempty"`
+	SortBy        string `json:"sort_by,omitempty"`
+	Limit         int    `json:"limit,omitempty"`
+}
+
+type memStoreReq struct {
+	Content          string  `json:"content"`
+	Category         string  `json:"category,omitempty"`
+	Tags             string  `json:"tags,omitempty"`
+	ExpandedKeywords string  `json:"expanded_keywords,omitempty"`
+	Importance       float64 `json:"importance"`
+	ForceSensitive   bool    `json:"force_sensitive,omitempty"`
+}
+
+type memUpdateReq struct {
+	Content          *string  `json:"content,omitempty"`
+	Tags             *string  `json:"tags,omitempty"`
+	Importance       *float64 `json:"importance,omitempty"`
+	ExpandedKeywords *string  `json:"expanded_keywords,omitempty"`
+}
--- a/cli/memory_test.go
+++ b/cli/memory_test.go
@ -0,0 +1,51 @@
+package main
+
+import (
+	"encoding/json"
+	"os"
+	"strings"
+	"testing"
+)
+
+func TestResolveMemoryBase(t *testing.T) {
+	old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL")
+	defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }()
+
+	os.Unsetenv("CLAUDE_MEMORY_API_URL")
+	os.Unsetenv("MEMORY_API_URL")
+	if got := resolveMemoryBase(); got != defaultMemoryURL {
+		t.Errorf("resolveMemoryBase() = %q, want default %q", got, defaultMemoryURL)
+	}
+	os.Setenv("CLAUDE_MEMORY_API_URL", "https://m.example/") // trailing slash trimmed
+	if got := resolveMemoryBase(); got != "https://m.example" {
+		t.Errorf("resolveMemoryBase() = %q, want https://m.example", got)
+	}
+}
+
+func TestMemStoreReqAlwaysSendsImportance(t *testing.T) {
+	b, _ := json.Marshal(memStoreReq{Content: "x", Category: "facts", Importance: 0.5})
+	s := string(b)
+	if !strings.Contains(s, `"content":"x"`) || !strings.Contains(s, `"importance":0.5`) {
+		t.Fatalf("memStoreReq JSON missing fields: %s", s)
+	}
+}
+
+func TestMemUpdateReqOmitsUnsetFields(t *testing.T) {
+	tags := "a,b"
+	b, _ := json.Marshal(memUpdateReq{Tags: &tags})
+	s := string(b)
+	if strings.Contains(s, "content") || strings.Contains(s, "importance") {
+		t.Fatalf("unset update fields must be omitted: %s", s)
+	}
+	if !strings.Contains(s, `"tags":"a,b"`) {
+		t.Fatalf("set field missing: %s", s)
+	}
+}
+
+func TestMemRecallReqOmitsEmptyOptionals(t *testing.T) {
+	b, _ := json.Marshal(memRecallReq{Context: "hi"})
+	s := string(b)
+	if strings.Contains(s, "expanded_query") || strings.Contains(s, "category") || strings.Contains(s, "limit") {
+		t.Fatalf("empty optionals must be omitted: %s", s)
+	}
+}
--- a/cli/presence.go
+++ b/cli/presence.go
@ -0,0 +1,58 @@
+package main
+
+import (
+	"fmt"
+	"os"
+	"path/filepath"
+	"strings"
+)
+
+// validPresenceKinds is the fixed label taxonomy accepted by the presence board.
+var validPresenceKinds = []string{"node", "host", "stack", "service", "db", "pvc", "infra"}
+
+// presenceScript locates the presence CLI — homelab WRAPS it, it does not
+// reimplement it. Override with HOMELAB_PRESENCE; defaults to ~/code/scripts/presence.
+func presenceScript() string {
+	if p := os.Getenv("HOMELAB_PRESENCE"); p != "" {
+		return p
+	}
+	home, err := os.UserHomeDir()
+	if err != nil {
+		return "presence"
+	}
+	return filepath.Join(home, "code", "scripts", "presence")
+}
+
+// validateLabel checks a presence label is <kind>:<name> with a known kind.
+func validateLabel(label string) error {
+	parts := strings.SplitN(label, ":", 2)
+	if len(parts) != 2 || parts[0] == "" || parts[1] == "" {
+		return fmt.Errorf("label must be <kind>:<name> (e.g. stack:vault), got %q", label)
+	}
+	for _, k := range validPresenceKinds {
+		if parts[0] == k {
+			return nil
+		}
+	}
+	return fmt.Errorf("invalid label kind %q; valid kinds: %s", parts[0], strings.Join(validPresenceKinds, ", "))
+}
+
+// presenceClaim claims label on the board with a purpose note.
+func presenceClaim(label, purpose string) error {
+	if err := validateLabel(label); err != nil {
+		return err
+	}
+	args := []string{"claim", label}
+	if purpose != "" {
+		args = append(args, "--purpose", purpose)
+	}
+	return runStreaming(presenceScript(), args...)
+}
+
+// presenceRelease releases a prior claim on label.
+func presenceRelease(label string) error {
+	if err := validateLabel(label); err != nil {
+		return err
+	}
+	return runStreaming(presenceScript(), "release", label)
+}
--- a/cli/presence_test.go
+++ b/cli/presence_test.go
@ -0,0 +1,24 @@
+package main
+
+import "testing"
+
+func TestValidateLabelAcceptsTaxonomy(t *testing.T) {
+	good := []string{
+		"stack:vault", "service:health", "node:k8s-node1", "db:pg-cluster",
+		"infra:gpu-operator", "host:proxmox-1", "pvc:dbaas/data",
+	}
+	for _, l := range good {
+		if err := validateLabel(l); err != nil {
+			t.Errorf("validateLabel(%q) = %v, want nil", l, err)
+		}
+	}
+}
+
+func TestValidateLabelRejectsBadLabels(t *testing.T) {
+	bad := []string{"vault", "stack:", "bogus:x", ":x", "stack", ""}
+	for _, l := range bad {
+		if err := validateLabel(l); err == nil {
+			t.Errorf("validateLabel(%q) = nil, want error", l)
+		}
+	}
+}
--- a/cli/probe.go
+++ b/cli/probe.go
@ -0,0 +1,76 @@
+package main
+
+import (
+	"context"
+	"crypto/tls"
+	"fmt"
+	"io"
+	"net"
+	"net/http"
+	"net/url"
+	"os/exec"
+	"strings"
+	"time"
+)
+
+// internalLBIP is the dedicated Traefik LB; every internal ingress routes through it.
+const internalLBIP = "10.0.20.203"
+
+// clientDialingIP returns an http.Client that dials ip for ANY host while keeping
+// the URL host as SNI (so the cert matches) — the Go form of `curl --resolve
+// host:443:ip`. TLS verification is skipped (these are reachability/observability
+// probes, not security checks; internal .lan vhosts may serve a non-matching cert).
+func clientDialingIP(ip string, timeout time.Duration) *http.Client {
+	d := &net.Dialer{Timeout: 8 * time.Second}
+	tr := &http.Transport{
+		DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
+			if i := strings.LastIndex(addr, ":"); i >= 0 {
+				addr = ip + addr[i:]
+			}
+			return d.DialContext(ctx, network, addr)
+		},
+		TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
+	}
+	return &http.Client{Timeout: timeout, Transport: tr}
+}
+
+// probeURL issues a GET and returns status code + elapsed time.
+func probeURL(c *http.Client, rawurl string) (int, time.Duration, error) {
+	start := time.Now()
+	resp, err := c.Get(rawurl)
+	dur := time.Since(start)
+	if err != nil {
+		return 0, dur, err
+	}
+	resp.Body.Close()
+	return resp.StatusCode, dur, nil
+}
+
+// lbGetBody GETs https://<host><path>?<q> through the internal LB and returns the body.
+func lbGetBody(host, path string, q url.Values) ([]byte, error) {
+	u := "https://" + host + path
+	if len(q) > 0 {
+		u += "?" + q.Encode()
+	}
+	resp, err := clientDialingIP(internalLBIP, 20*time.Second).Get(u)
+	if err != nil {
+		return nil, err
+	}
+	defer resp.Body.Close()
+	body, _ := io.ReadAll(resp.Body)
+	if resp.StatusCode >= 300 {
+		return nil, fmt.Errorf("%s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body)))
+	}
+	return body, nil
+}
+
+// dig runs `dig +short` against a resolver, optionally for a record type.
+func dig(name, server, rrtype string) (string, error) {
+	args := []string{"+short", "+time=3", "+tries=1"}
+	if rrtype != "" {
+		args = append(args, rrtype)
+	}
+	args = append(args, name, "@"+server)
+	out, err := exec.Command("dig", args...).Output()
+	return strings.TrimSpace(string(out)), err
+}
--- a/cli/probe_test.go
+++ b/cli/probe_test.go
@ -0,0 +1,49 @@
+package main
+
+import "testing"
+
+func TestQueryArg(t *testing.T) {
+	if got := queryArg([]string{"up"}, nil); got != "up" {
+		t.Errorf(`queryArg(["up"]) = %q, want "up"`, got)
+	}
+	if got := queryArg([]string{"up", "--json"}, nil); got != "up" {
+		t.Errorf(`--json should be dropped, got %q`, got)
+	}
+	// single quoted PromQL arrives as one token
+	if got := queryArg([]string{"count by (node) (up)", "--json"}, nil); got != "count by (node) (up)" {
+		t.Errorf(`quoted query mangled: %q`, got)
+	}
+	// value-flags and their values are skipped, query survives
+	vf := map[string]bool{"--since": true, "--limit": true}
+	if got := queryArg([]string{`{app="x"}`, "--since", "1h", "--limit", "50"}, vf); got != `{app="x"}` {
+		t.Errorf(`value-flag skipping failed: %q`, got)
+	}
+}
+
+func TestLabelStr(t *testing.T) {
+	got := labelStr(map[string]string{"__name__": "up", "job": "x", "instance": "y"})
+	if got != "up{instance=y,job=x}" { // __name__ extracted, rest sorted
+		t.Errorf("labelStr = %q", got)
+	}
+	if got := labelStr(map[string]string{"alertname": "Foo"}); got != "{alertname=Foo}" {
+		t.Errorf("labelStr (no __name__) = %q", got)
+	}
+}
+
+func TestOneLineList(t *testing.T) {
+	if got := oneLineList("  "); got != "(none)" {
+		t.Errorf("empty = %q, want (none)", got)
+	}
+	if got := oneLineList("a\nb"); got != "a, b" {
+		t.Errorf("multi = %q, want 'a, b'", got)
+	}
+}
+
+func TestHostOnly(t *testing.T) {
+	if got := hostOnly("foo.me/path"); got != "foo.me" {
+		t.Errorf("hostOnly = %q", got)
+	}
+	if got := hostOnly("foo.me"); got != "foo.me" {
+		t.Errorf("hostOnly = %q", got)
+	}
+}
--- a/cli/repo.go
+++ b/cli/repo.go
@ -0,0 +1,101 @@
+package main
+
+import (
+	"os"
+	"os/exec"
+	"os/user"
+	"path/filepath"
+	"strings"
+)
+
+// preferRemote picks the canonical remote: forgejo if present, else origin,
+// else the first listed. (For infra, origin and forgejo both point at Forgejo.)
+func preferRemote(remotes []string) string {
+	has := map[string]bool{}
+	for _, r := range remotes {
+		has[r] = true
+	}
+	switch {
+	case has["forgejo"]:
+		return "forgejo"
+	case has["origin"]:
+		return "origin"
+	case len(remotes) > 0:
+		return remotes[0]
+	default:
+		return ""
+	}
+}
+
+// hasGitCryptAttr reports whether .gitattributes content enables git-crypt.
+func hasGitCryptAttr(gitattributes string) bool {
+	return strings.Contains(gitattributes, "filter=git-crypt")
+}
+
+// gitCryptFlags are the per-command flags that disable smudge/clean so git
+// operations in a git-crypt repo don't try to decrypt (NEVER persisted to config).
+func gitCryptFlags() []string {
+	return []string{
+		"-c", "filter.git-crypt.smudge=cat",
+		"-c", "filter.git-crypt.clean=cat",
+		"-c", "filter.git-crypt.required=false",
+	}
+}
+
+// gitOutput runs `git -C dir <args>` and returns trimmed stdout.
+func gitOutput(dir string, args ...string) (string, error) {
+	cmd := exec.Command("git", append([]string{"-C", dir}, args...)...)
+	out, err := cmd.Output()
+	return strings.TrimSpace(string(out)), err
+}
+
+func gitRepoRoot(dir string) (string, error) {
+	return gitOutput(dir, "rev-parse", "--show-toplevel")
+}
+
+// gitRemotes lists configured remote names for the repo at dir.
+func gitRemotes(dir string) ([]string, error) {
+	out, err := gitOutput(dir, "remote")
+	if err != nil {
+		return nil, err
+	}
+	if out == "" {
+		return nil, nil
+	}
+	return strings.Split(out, "\n"), nil
+}
+
+// isGitCryptRepo reports whether the repo at repoRoot uses git-crypt.
+func isGitCryptRepo(repoRoot string) bool {
+	b, err := os.ReadFile(filepath.Join(repoRoot, ".gitattributes"))
+	if err != nil {
+		return false
+	}
+	return hasGitCryptAttr(string(b))
+}
+
+// cryptFlagsFor returns the git-crypt filter flags when repoRoot is encrypted,
+// else nil. These are injected per-command and never persisted.
+func cryptFlagsFor(repoRoot string) []string {
+	if isGitCryptRepo(repoRoot) {
+		return gitCryptFlags()
+	}
+	return nil
+}
+
+// gitStream runs `git [cryptFlags] -C repoRoot <args>` with live output.
+func gitStream(repoRoot string, cryptFlags []string, args ...string) error {
+	full := append(append([]string{}, cryptFlags...), append([]string{"-C", repoRoot}, args...)...)
+	return runStreamingIn("", "git", full...)
+}
+
+// currentUser returns the OS username for branch naming (<user>/<topic>).
+func currentUser() string {
+	if u := os.Getenv("USER"); u != "" {
+		return u
+	}
+	if u, err := user.Current(); err == nil && u.Username != "" {
+		return u.Username
+	}
+	return "user"
+}
--- a/cli/repo_test.go
+++ b/cli/repo_test.go
@ -0,0 +1,37 @@
+package main
+
+import "testing"
+
+func TestPreferRemote(t *testing.T) {
+	cases := []struct {
+		in   []string
+		want string
+	}{
+		{[]string{"origin", "forgejo"}, "forgejo"},
+		{[]string{"forgejo"}, "forgejo"},
+		{[]string{"origin"}, "origin"},
+		{[]string{"upstream"}, "upstream"},
+		{nil, ""},
+	}
+	for _, c := range cases {
+		if got := preferRemote(c.in); got != c.want {
+			t.Errorf("preferRemote(%v) = %q, want %q", c.in, got, c.want)
+		}
+	}
+}
+
+func TestHasGitCryptAttr(t *testing.T) {
+	if !hasGitCryptAttr("*.tfvars filter=git-crypt diff=git-crypt") {
+		t.Error("expected git-crypt detected")
+	}
+	if hasGitCryptAttr("*.md text\n*.png binary") {
+		t.Error("expected no git-crypt")
+	}
+}
+
+func TestGitCryptFlagsShape(t *testing.T) {
+	f := gitCryptFlags()
+	if len(f) != 6 || f[0] != "-c" || f[1] != "filter.git-crypt.smudge=cat" {
+		t.Fatalf("unexpected git-crypt flags: %v", f)
+	}
+}
--- a/cli/run.go
+++ b/cli/run.go
@ -0,0 +1,23 @@
+package main
+
+import (
+	"os"
+	"os/exec"
+)
+
+// runStreaming executes name with args, wiring std streams to this process so
+// the caller sees live output, and returns the command's error (non-nil on
+// non-zero exit — preserved so homelab's own exit code reflects the child's).
+func runStreaming(name string, args ...string) error {
+	return runStreamingIn("", name, args...)
+}
+
+// runStreamingIn is runStreaming with a working directory (empty = inherit).
+func runStreamingIn(dir, name string, args ...string) error {
+	cmd := exec.Command(name, args...)
+	cmd.Dir = dir
+	cmd.Stdout = os.Stdout
+	cmd.Stderr = os.Stderr
+	cmd.Stdin = os.Stdin
+	return cmd.Run()
+}
--- a/cli/stack.go
+++ b/cli/stack.go
@ -0,0 +1,54 @@
+package main
+
+import (
+	"fmt"
+	"os"
+	"path/filepath"
+	"sort"
+	"strings"
+)
+
+// findInfraRoot walks up from start to the infra repo root — the directory
+// holding both terragrunt.hcl and a stacks/ directory.
+func findInfraRoot(start string) (string, error) {
+	dir := start
+	for {
+		if isFile(filepath.Join(dir, "terragrunt.hcl")) && isDir(filepath.Join(dir, "stacks")) {
+			return dir, nil
+		}
+		parent := filepath.Dir(dir)
+		if parent == dir {
+			return "", fmt.Errorf("not inside an infra checkout (no terragrunt.hcl + stacks/ found above %s)", start)
+		}
+		dir = parent
+	}
+}
+
+// resolveStack maps a bare stack name to its directory under <infraRoot>/stacks.
+func resolveStack(infraRoot, name string) (string, error) {
+	dir := filepath.Join(infraRoot, "stacks", name)
+	if isDir(dir) {
+		return dir, nil
+	}
+	avail := listStacks(infraRoot)
+	return "", fmt.Errorf("stack %q not found under stacks/; available: %s", name, strings.Join(avail, ", "))
+}
+
+// listStacks returns the sorted names of every directory under <infraRoot>/stacks.
+func listStacks(infraRoot string) []string {
+	entries, err := os.ReadDir(filepath.Join(infraRoot, "stacks"))
+	if err != nil {
+		return nil
+	}
+	var out []string
+	for _, e := range entries {
+		if e.IsDir() {
+			out = append(out, e.Name())
+		}
+	}
+	sort.Strings(out)
+	return out
+}
+
+func isFile(p string) bool { fi, err := os.Stat(p); return err == nil && !fi.IsDir() }
+func isDir(p string) bool  { fi, err := os.Stat(p); return err == nil && fi.IsDir() }
--- a/cli/stack_test.go
+++ b/cli/stack_test.go
@ -0,0 +1,52 @@
+package main
+
+import (
+	"os"
+	"path/filepath"
+	"testing"
+)
+
+func newInfraTree(t *testing.T, stacks ...string) string {
+	t.Helper()
+	root := t.TempDir()
+	if err := os.WriteFile(filepath.Join(root, "terragrunt.hcl"), []byte("# root"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	for _, s := range stacks {
+		if err := os.MkdirAll(filepath.Join(root, "stacks", s), 0o755); err != nil {
+			t.Fatal(err)
+		}
+	}
+	return root
+}
+
+func TestFindInfraRootWalksUp(t *testing.T) {
+	root := newInfraTree(t, "vault")
+	got, err := findInfraRoot(filepath.Join(root, "stacks", "vault"))
+	if err != nil {
+		t.Fatalf("findInfraRoot error: %v", err)
+	}
+	if got != root {
+		t.Fatalf("findInfraRoot = %q, want %q", got, root)
+	}
+}
+
+func TestFindInfraRootErrorsOutsideInfra(t *testing.T) {
+	if _, err := findInfraRoot(t.TempDir()); err == nil {
+		t.Fatal("expected error outside an infra checkout")
+	}
+}
+
+func TestResolveStack(t *testing.T) {
+	root := newInfraTree(t, "vault", "monitoring")
+	dir, err := resolveStack(root, "vault")
+	if err != nil {
+		t.Fatalf("resolveStack error: %v", err)
+	}
+	if want := filepath.Join(root, "stacks", "vault"); dir != want {
+		t.Fatalf("resolveStack = %q, want %q", dir, want)
+	}
+	if _, err := resolveStack(root, "nonesuch"); err == nil {
+		t.Fatal("expected error for unknown stack")
+	}
+}
--- a/cli/telemetry.go
+++ b/cli/telemetry.go
@ -0,0 +1,62 @@
+package main
+
+import (
+	"bytes"
+	"encoding/json"
+	"net/http"
+	"os"
+	"strconv"
+	"strings"
+	"time"
+)
+
+// usageJob is the Loki stream job label for homelab usage telemetry.
+const usageJob = "homelab-usage"
+
+// emitUsage best-effort records one verb invocation to Loki for cross-user
+// usage analytics. Labels are low-cardinality (job/user/verb); the line carries
+// only exit code + CLI version. NEVER args, paths, flags, or secrets. It must
+// never affect the command: all errors are swallowed and a tight timeout bounds
+// the cost. Opt out with HOMELAB_TELEMETRY=0.
+func emitUsage(verb string, runErr error) {
+	switch os.Getenv("HOMELAB_TELEMETRY") {
+	case "0", "off", "false", "no":
+		return
+	}
+	if verb == "" || strings.HasPrefix(verb, "usage") {
+		return // don't self-record the analytics reader
+	}
+	exit := 0
+	if runErr != nil {
+		exit = 1
+	}
+	body, err := json.Marshal(lokiPush{Streams: []lokiStream{{
+		Stream: map[string]string{"job": usageJob, "user": currentUser(), "verb": verb},
+		Values: [][2]string{{
+			strconv.FormatInt(time.Now().UnixNano(), 10),
+			"exit=" + strconv.Itoa(exit) + " ver=" + version,
+		}},
+	}}})
+	if err != nil {
+		return
+	}
+	req, err := http.NewRequest("POST", "https://"+lokiHost+"/loki/api/v1/push", bytes.NewReader(body))
+	if err != nil {
+		return
+	}
+	req.Header.Set("Content-Type", "application/json")
+	resp, err := clientDialingIP(internalLBIP, 800*time.Millisecond).Do(req)
+	if err != nil {
+		return
+	}
+	resp.Body.Close()
+}
+
+type lokiPush struct {
+	Streams []lokiStream `json:"streams"`
+}
+
+type lokiStream struct {
+	Stream map[string]string `json:"stream"`
+	Values [][2]string       `json:"values"`
+}
--- a/cli/update_viktorbarzin_me.go
+++ b/cli/update_viktorbarzin_me.go
@ -103,6 +103,6 @@ func notifyForIPChange(oldIP, newIP net.IP) error {
 	if err != nil {
 		return errors.Wrapf(err, "Error reading response")
 	}
-	glog.Infof("Response:", string(responseBody))
+	glog.Infof("Response: %s", string(responseBody))
 	return nil
 }
--- a/cli/usage_test.go
+++ b/cli/usage_test.go
@ -0,0 +1,18 @@
+package main
+
+import (
+	"strings"
+	"testing"
+)
+
+func TestUsageQuery(t *testing.T) {
+	got := usageQuery("30d", "")
+	want := `sum by (verb) (count_over_time({job="homelab-usage"}[30d]))`
+	if got != want {
+		t.Errorf("usageQuery(30d,\"\") = %q, want %q", got, want)
+	}
+	withUser := usageQuery("7d", "emo")
+	if !strings.Contains(withUser, `user="emo"`) || !strings.Contains(withUser, "[7d]") {
+		t.Errorf("usageQuery with user missing filter/range: %q", withUser)
+	}
+}
--- a/cli/woodpecker.go
+++ b/cli/woodpecker.go
@ -0,0 +1,191 @@
+package main
+
+import (
+	"context"
+	"encoding/json"
+	"fmt"
+	"io"
+	"net"
+	"net/http"
+	"os"
+	"os/exec"
+	"strings"
+	"time"
+)
+
+// Woodpecker is reached at ci.viktorbarzin.me but routed via the internal Traefik
+// LB (mirrors the proven `curl --resolve ci.viktorbarzin.me:443:10.0.20.203`):
+// we dial the LB IP while keeping SNI/Host = the hostname so the cert verifies.
+const (
+	wpHost = "ci.viktorbarzin.me"
+	wpLBIP = "10.0.20.203"
+)
+
+type wpClient struct {
+	base  string
+	token string
+	http  *http.Client
+}
+
+// wpToken reads WOODPECKER_TOKEN, else the canonical Vault path.
+func wpToken() string {
+	if t := firstEnv("WOODPECKER_TOKEN", "WP_TOKEN"); t != "" {
+		return t
+	}
+	out, err := exec.Command("vault", "kv", "get", "-field=woodpecker_api_token", "secret/ci/global").Output()
+	if err != nil {
+		return ""
+	}
+	return strings.TrimSpace(string(out))
+}
+
+func newWPClient() (*wpClient, error) {
+	tok := wpToken()
+	if tok == "" {
+		return nil, fmt.Errorf("no woodpecker token — set WOODPECKER_TOKEN or `vault login` (reads secret/ci/global)")
+	}
+	ip := firstEnv("HOMELAB_WP_IP")
+	if ip == "" {
+		ip = wpLBIP
+	}
+	dialer := &net.Dialer{Timeout: 8 * time.Second}
+	tr := &http.Transport{
+		DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
+			if strings.HasPrefix(addr, wpHost+":") {
+				addr = ip + addr[strings.LastIndex(addr, ":"):]
+			}
+			return dialer.DialContext(ctx, network, addr)
+		},
+	}
+	return &wpClient{base: "https://" + wpHost, token: tok, http: &http.Client{Timeout: 20 * time.Second, Transport: tr}}, nil
+}
+
+// getJSON GETs path into v, retrying the transient empty/5xx responses the
+// Woodpecker API intermittently returns under load.
+func (c *wpClient) getJSON(path string, v interface{}) error {
+	var lastErr error
+	for attempt := 0; attempt < 5; attempt++ {
+		if attempt > 0 {
+			time.Sleep(2 * time.Second)
+		}
+		req, _ := http.NewRequest("GET", c.base+path, nil)
+		req.Header.Set("Authorization", "Bearer "+c.token)
+		resp, err := c.http.Do(req)
+		if err != nil {
+			lastErr = err
+			continue
+		}
+		body, _ := io.ReadAll(resp.Body)
+		resp.Body.Close()
+		if resp.StatusCode >= 500 || len(strings.TrimSpace(string(body))) == 0 {
+			lastErr = fmt.Errorf("woodpecker GET %s -> %d (empty/5xx, retrying)", path, resp.StatusCode)
+			continue
+		}
+		if resp.StatusCode >= 300 {
+			return fmt.Errorf("woodpecker GET %s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body)))
+		}
+		return json.Unmarshal(body, v)
+	}
+	return lastErr
+}
+
+type wpPipeline struct {
+	Number  int    `json:"number"`
+	Status  string `json:"status"`
+	Event   string `json:"event"`
+	Commit  string `json:"commit"`
+	Message string `json:"message"`
+}
+
+func (c *wpClient) recentPipelines(repoID, n int) ([]wpPipeline, error) {
+	var ps []wpPipeline
+	err := c.getJSON(fmt.Sprintf("/api/repos/%d/pipelines?per_page=%d", repoID, n), &ps)
+	return ps, err
+}
+
+// findPipeline returns the pipeline for commit (prefix match), or the latest when
+// commit is empty.
+func (c *wpClient) findPipeline(repoID int, commit string) (wpPipeline, error) {
+	ps, err := c.recentPipelines(repoID, 25)
+	if err != nil {
+		return wpPipeline{}, err
+	}
+	if len(ps) == 0 {
+		return wpPipeline{}, fmt.Errorf("no pipelines for repo %d", repoID)
+	}
+	if commit == "" {
+		return ps[0], nil
+	}
+	for _, p := range ps {
+		if strings.HasPrefix(p.Commit, commit) {
+			return p, nil
+		}
+	}
+	return wpPipeline{}, fmt.Errorf("no pipeline for commit %s in the last %d", commit[:min(8, len(commit))], len(ps))
+}
+
+func (c *wpClient) repoID() (int, error) {
+	owner, repo, err := repoOwnerName()
+	if err != nil {
+		return 0, err
+	}
+	var r struct {
+		ID int `json:"id"`
+	}
+	if err := c.getJSON("/api/repos/lookup/"+owner+"/"+repo, &r); err != nil {
+		return 0, err
+	}
+	if r.ID == 0 {
+		return 0, fmt.Errorf("repo %s/%s not registered in woodpecker", owner, repo)
+	}
+	return r.ID, nil
+}
+
+// repoOwnerName derives <owner>/<repo> from the cwd git remote.
+func repoOwnerName() (string, string, error) {
+	cwd, _ := os.Getwd()
+	root, err := gitRepoRoot(cwd)
+	if err != nil {
+		return "", "", fmt.Errorf("not in a git repository: %w", err)
+	}
+	remote := preferRemote(remotesOrEmpty(root))
+	url, err := gitOutput(root, "remote", "get-url", remote)
+	if err != nil {
+		return "", "", err
+	}
+	return parseOwnerRepo(url)
+}
+
+// parseOwnerRepo extracts owner/repo from an https or ssh git remote URL.
+func parseOwnerRepo(url string) (string, string, error) {
+	u := strings.TrimSuffix(strings.TrimSpace(url), ".git")
+	u = strings.TrimSuffix(u, "/")
+	if i := strings.Index(u, "://"); i >= 0 {
+		u = u[i+3:]
+	}
+	u = strings.ReplaceAll(u, ":", "/") // git@host:owner/repo -> git@host/owner/repo
+	parts := strings.Split(u, "/")
+	if len(parts) < 2 || parts[len(parts)-1] == "" || parts[len(parts)-2] == "" {
+		return "", "", fmt.Errorf("cannot parse owner/repo from remote %q", url)
+	}
+	return parts[len(parts)-2], parts[len(parts)-1], nil
+}
+
+func isTerminalStatus(s string) bool {
+	switch s {
+	case "success", "failure", "error", "killed", "declined", "blocked":
+		return true
+	}
+	return false
+}
+
+func isFailureStatus(s string) bool {
+	return s == "failure" || s == "error" || s == "killed" || s == "declined"
+}
+
+func min(a, b int) int {
+	if a < b {
+		return a
+	}
+	return b
+}
--- a/cli/woodpecker_test.go
+++ b/cli/woodpecker_test.go
@ -0,0 +1,40 @@
+package main
+
+import "testing"
+
+func TestParseOwnerRepo(t *testing.T) {
+	cases := []struct{ in, owner, repo string }{
+		{"https://forgejo.viktorbarzin.me/viktor/infra.git", "viktor", "infra"},
+		{"https://forgejo.viktorbarzin.me/viktor/infra", "viktor", "infra"},
+		{"git@github.com:ViktorBarzin/infra.git", "ViktorBarzin", "infra"},
+		{"https://github.com/ViktorBarzin/tripit/", "ViktorBarzin", "tripit"},
+	}
+	for _, c := range cases {
+		o, r, err := parseOwnerRepo(c.in)
+		if err != nil || o != c.owner || r != c.repo {
+			t.Errorf("parseOwnerRepo(%q) = (%q, %q, %v), want (%q, %q)", c.in, o, r, err, c.owner, c.repo)
+		}
+	}
+	if _, _, err := parseOwnerRepo("nonsense"); err == nil {
+		t.Error("expected error for unparseable remote")
+	}
+}
+
+func TestStatusClassification(t *testing.T) {
+	for _, s := range []string{"success", "failure", "error", "killed"} {
+		if !isTerminalStatus(s) {
+			t.Errorf("%q should be terminal", s)
+		}
+	}
+	for _, s := range []string{"running", "pending"} {
+		if isTerminalStatus(s) {
+			t.Errorf("%q should not be terminal", s)
+		}
+	}
+	if !isFailureStatus("failure") || !isFailureStatus("error") {
+		t.Error("failure/error should classify as failure")
+	}
+	if isFailureStatus("success") {
+		t.Error("success must not classify as failure")
+	}
+}
--- a/docs/adr/0001-android-emulator-in-cluster.md
+++ b/docs/adr/0001-android-emulator-in-cluster.md
@ -0,0 +1,42 @@
+---
+status: accepted
+---
+
+# The Android testing environment is a privileged KVM emulator pod in-cluster
+
+Viktor's apps are growing Android clients (first: tripit's Capacitor shell —
+see tripit ADR-0013/0014), and agents need a native Android instance to test
+changes against before shipping. All K8s nodes already run with CPU type
+`host`, so `/dev/kvm` works inside the cluster.
+
+Decision (2026-06-11): one shared **Android 16 (API 36) Google-emulator
+instance** runs as a privileged pod in namespace `android-emulator`
+(stack `stacks/android-emulator`), with `/dev/kvm` via hostPath, adb exposed
+LAN-only on the shared MetalLB IP (10.0.20.200:5555), and a noVNC screen view
+at android-emulator.viktorbarzin.lan. The SDK/system-image/AVD live on a PVC;
+the image is a slim manually-built shell.
+
+## Considered options
+
+- **devvm-local docker emulator** — rejected as the durable home: shared
+  24GB workstation, ~13GB free disk, per-machine, not shared across agents.
+- **Dedicated Proxmox VM** — rejected: burns scarce PVE host headroom 24/7
+  and adds a whole VM lifecycle for one emulator.
+- **redroid (container-native Android)** — rejected: requires binder kernel
+  modules on every node (documented binderfs incompatibilities), max
+  Android 15; most invasive for the least version coverage.
+- **budtmo/docker-android** — rejected: turnkey but capped at Android 14;
+  the native features driving the Android work (Live Updates, background
+  GPS) are Android 16 behaviors, matching the real target device.
+- **/dev/kvm device plugin instead of privileged** — deferred: a new
+  cluster component to avoid one namespace-scoped exclude-list entry; the
+  exclude pattern (kured/woodpecker/frigate/changedetection) already exists.
+
+## Consequences
+
+- `android-emulator` joins the Kyverno `security_policy_exclude_namespaces`
+  list (privileged allowed; registry policy also bypassed in-namespace).
+- adb is unauthenticated by design — the LB IP must remain LAN-only.
+- Single shared instance: concurrent agent sessions share Android state;
+  long destructive work should presence-claim `service:android-emulator`.
+- Rendering is swiftshader (CPU) — the contended T4 stays out of the path.
--- a/docs/adr/0002-all-image-builds-off-infra-gha-ghcr.md
+++ b/docs/adr/0002-all-image-builds-off-infra-gha-ghcr.md
@ -0,0 +1,24 @@
+---
+status: accepted
+date: 2026-06-12
+---
+
+# All owned images build off-infra on GitHub Actions and live on ghcr.io
+
+In-cluster Woodpecker buildkit builds repeatedly hurt the homelab: registry-push load OOMKilled Forgejo (2026-06-09), buildkit→Forgejo pushes ride a flaky hairpin, build IO lands on the shared sdc HDD, and the Forgejo registry PVC sat at its 50Gi ceiling with retention stuck in DRY_RUN. We decided every owned image is built by GitHub Actions and hosted on ghcr.io, extending the tripit pilot (2026-06-09) to the whole fleet: Forgejo stays the canonical git host, a one-way push-mirror feeds a GitHub mirror, and the mirror's workflow builds, pushes, then POSTs Woodpecker's API to deploy. The Forgejo container registry is decommissioned as a build target — one manual cleanup pass keeps a last-known-good tag per Service, after which nothing pushes to it.
+
+## Considered options
+
+- **GHA builds pushing back into the Forgejo registry** — keeps images home and the pull path unchanged, but keeps the exact failure mode that motivated the move (Forgejo OOM under blob-push load), keeps the PVC growth, and keeps the circular dependency where the images needed to repair the cluster live inside the cluster. Rejected.
+- **Per-repo in-cluster fallback builds** (the old `build-fallback.yml` pattern) — rejected in favour of a clean cut: a GitHub outage pauses image builds (running workloads are unaffected), and existing fallback files are deleted. The hedge against ghcr's "currently free" private storage ever being enforced is the visibility split (public images are permanently free) plus re-creating fallbacks if that day comes.
+- **Paid builders (Docker Build Cloud, Depot)** — solve a multi-arch/persistent-cache problem this fleet doesn't have (everything is linux/amd64). Rejected.
+
+## Consequences
+
+- DR improves: images survive homelab loss, so a dead cluster can pull everything it needs to come back — the same doctrine that keeps the monorepo on GitHub ("Forgejo dies with the cluster").
+- Private ghcr pulls bypass the registry VM's pull-through cache (it can't authenticate), so cold-node pulls of private images depend on GitHub availability; public images cache normally.
+- Visibility is decided per repo: public = generic tooling that passes a gitleaks/PII history scan; private = personal, financial, or legally-gray domains. A failed scan means the repo stays private — canonical history is never rewritten for publication. For interpreted languages repo visibility ≈ image visibility (the image ships the source).
+- Only private-repo builds consume GitHub free-plan minutes (~12 builders, well under the 2,000/mo free tier; usage is reviewed after rollout wave 2 before considering Pro).
+- Woodpecker becomes deploy-only; its agents never build. The Kyverno-synced `registry-credentials` stays (Forgejo git + frozen last-known-good images); a cluster-wide Kyverno-synced `ghcr-credentials` joins it.
+- Builders with no live consumer (terminal-lobby, webhook-handler, hmrc-sync, trading-bot, travel-agent, trip-planner) are decommissioned rather than migrated; travel_blog is decommissioned outright (service + CI). Any revival adopts this ADR's pattern.
+- Workflows build single-manifest images (`provenance: false`, linux/amd64 only) so registry retention never faces the orphaned-index-children failure class that broke Forgejo's cleanup.
--- a/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md
+++ b/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md
@ -0,0 +1,30 @@
+# Keep Forgejo as the canonical forge; complete the one-way GitHub mirror instead of swapping to GitHub
+
+Status: accepted (extends ADR-0002)
+
+## Context
+
+Repo trees kept diverging between the Forgejo **Canonical repo** (`viktor/<name>`) and its **GitHub mirror**. A 2026-06-15 audit found the cause: an *incomplete rollout* of the Forgejo→GitHub push-mirror, not anything inherent to Forgejo. 14 repos carry **both** remotes and are hand-pushed to each (`push_mirrors = 0` on Forgejo — e.g. `infra`, `finance`, `Website`), so a human forgets one side and the trees drift; the ADR-0002-onboarded repos have a working one-way mirror (`push_mirrors = 1` — e.g. `tripit`, `recruiter-responder`) and never diverge. `infra/CONTEXT.md` already says Forgejo is the only place commits land and the GitHub mirror must never be a second writable remote — practice had simply drifted from the documented model.
+
+The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling reframed it: the pain (divergence) is a "two writable remotes" problem, and the stated preference is self-hosted-primary with the remote as backup.
+
+## Decision
+
+Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`:
+
+- Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.**
+- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`.
+- `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge.
+- Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror."
+
+## Considered options
+
+- **Swap to GitHub (retire Forgejo).** Rejected: takes on a hard WAN dependency for *all* git ops — including `infra`, the repo you use to *recover* from outages — plus git-crypt secrets on GitHub as primary, a Woodpecker forge migration (WP authenticates against and watches Forgejo), and GitHub private-repo CI-minute/size limits. All to fix a problem that is actually an incomplete mirror, not Forgejo's existence. Contradicts the self-hosted-primary preference.
+- **GitHub canonical, Forgejo demoted to a DR pull-mirror.** Rejected for the same WAN-dependency and forge-migration cost; unnecessary once the real cause is understood.
+
+## Consequences
+
+- Divergence becomes structurally impossible — one push target per repo.
+- Forgejo stays load-bearing (canonical git + the Woodpecker forge), so every cost of the swap is avoided.
+- The GitHub-limits worry is neutralized: private code lives on Forgejo (unlimited, self-hosted); GitHub holds mirrors for CI + backup. (GitHub Free has unlimited private repos anyway; the real limits are GHA minutes and ~1 GB repo size — `travel_blog` at 1.4 GB is why it never went to GHA.)
+- One-time remediation is required and carries a data-loss footgun: the Forgejo→GitHub mirror **force-overwrites** GitHub, so for each currently-diverged repo, any GitHub-only commits must be merged into Forgejo **before** the mirror is enabled, or they are lost. Scope: the 14 dual-push repos + the `infra` reconciliation; all other repos are already single-remote and non-diverging.
--- a/docs/adr/0004-homelab-unified-cli.md
+++ b/docs/adr/0004-homelab-unified-cli.md
@ -0,0 +1,30 @@
+# homelab: a unified infra-ops CLI grown in place from infra/cli
+
+Agents re-derive the same operational command boilerplate every session — mining
+51,116 bash commands across 2,225 past sessions showed dense, repeated patterns
+(the infra inner-loop alone is ~29%). We are building `homelab`, one CLI encoding
+the deterministic, repeated **actions** (not judgment) agents run — composable in
+bash, JSON-capable, and discovered progressively via `homelab manifest`. It is
+grown **in place** in `cli/` (the existing `infra-cli`), absorbing new verb-groups
+alongside the preserved legacy webhook use-cases. Versioned with a `cli/VERSION`
+file (the infra repo deploys continuously and does not cut semver tags).
+
+## Considered options
+
+- **Its own top-level repo** (the original plan) — rejected in favour of keeping
+  it where the Terraform/Terragrunt and `scripts/tg` it drives already live; the
+  Go source isn't git-crypt-encrypted and a provision-time build is unaffected by
+  GitOps continuous-deploy.
+- **A fresh CLI ignoring infra-cli** — rejected: strands the VPN/DNS/email
+  webhook use-cases.
+- **Raw kubectl/tg/ssh + skills + MCP only** — kept for everything outside the
+  recurring action surface (methodology skills; third-party/owned MCP such as
+  phpIPAM, which homelab does NOT duplicate).
+
+## Consequences
+
+- The binary is dual-purpose: the agent-facing `homelab` verb surface AND the
+  in-cluster `infra-cli` webhook image. `main()` front-dispatches homelab verbs
+  and falls through to the legacy `-use-case` path verbatim.
+- Distribution: built from source to `/usr/local/bin/homelab` during devvm
+  provisioning (`t3-dispatch` precedent), refreshed by `t3-autoupdate`.
--- a/docs/adr/0005-homelab-v01-scope.md
+++ b/docs/adr/0005-homelab-v01-scope.md
@ -0,0 +1,23 @@
+# homelab v0.1 scope: the infra inner-loop; everything allowed, tiers recorded
+
+v0.1 ships only the highest-volume surface — the infra inner-loop: `work`
+(worktree lifecycle), `tf` (terragrunt via `scripts/tg` + fmt/validate/
+force-unlock), and `claim`/`release` (presence) — because it is ~29% of all mined
+commands and where agents lose the most time and leak the most presence claims.
+
+v0.1 enforces **no** homelab-level permission gating: everything is allowed,
+relying on existing gates (harness permission mode, presence claims, plan
+approval). But every verb records a `read|write` tier (visible in `manifest`), so
+a PreToolUse classifier hook (auto-allow reads / prompt writes) can be added
+later with zero restructuring.
+
+## Considered options
+
+- **Reads-first vertical slice** (top read verb per domain) — lower risk, broad
+  value, but defers the toil that motivated the project.
+- **One domain deep (k8s)** — cleanest template, narrow day-one value.
+
+We chose the highest-volume-but-write-heavy infra loop deliberately, accepting
+the extra complexity (worktree lifecycle, git-crypt flag injection, presence
+coupling, branch-protection PR fallback) for the biggest immediate toil
+reduction. k8s/node/secret/net/ci verb-groups are deferred to later versions.
--- a/docs/adr/0006-homelab-work-and-tf.md
+++ b/docs/adr/0006-homelab-work-and-tf.md
@ -0,0 +1,29 @@
+# homelab work/tf behaviour: native worktree entry, gated auto-land, presence-coupled apply
+
+Four behaviours of the infra-loop verbs are surprising enough to record:
+
+1. **`work` owns worktree create/land/clean, but session *entry* delegates to the
+   native harness worktree tool.** A CLI is a child process and cannot change the
+   agent's working directory; `EnterWorktree` can. So `homelab work start <topic>`
+   creates the worktree + branch off `<remote>/master` (git-crypt-aware) and
+   prints the path — the agent enters it with native `EnterWorktree({path})`.
+
+2. **`work land` is auto-land, but gated on verification.** It merges master in →
+   runs verification → pushes `HEAD:master` (fetch+merge+retry on
+   non-fast-forward) → falls back to pushing the feature branch for a PR when the
+   direct push is rejected (branch protection). It **refuses to push when it
+   cannot verify** (no `--verify-cmd` and no auto-detected suite) unless
+   `--no-verify` is passed — added after an accidental smoke-test land pushed
+   unverified WIP to master (benign: the infra CI applied 0 stacks because the
+   diff was `cli/`-only, but an unverified land must be deliberate, not default).
+
+3. **`tf apply` is first-class despite GitOps, and mandatorily presence-coupled.**
+   Local applies are out-of-band (CI applies canonically on push) but happen
+   constantly (~763× in the corpus). `tf apply <stack>` auto-claims `stack:<name>`,
+   delegates to `scripts/tg apply --non-interactive`, and **always releases on
+   exit** (normal, error, or signal via `sync.Once` + handler) — fixing the
+   documented ~200-claim leak — and prints an out-of-band reminder.
+
+4. **Known v0.1 limitation:** `work land` does not yet block on CI to green; that
+   arrives with the ci/deploy watch verb-group. It prints a reminder to follow
+   the pipeline manually.
--- a/docs/adr/0007-homelab-k8s-verbs.md
+++ b/docs/adr/0007-homelab-k8s-verbs.md
@ -0,0 +1,30 @@
+# homelab k8s verb-group: app→pod resolver, read/write split, config-mutation stays raw
+
+v0.2 adds the Kubernetes verb-group — the biggest remaining surface by far
+(mining the post-v0.1 corpus: 11,291 `kubectl` commands across 243 sessions, more
+than every other domain combined).
+
+It is built on an **app→namespace→pod resolver**: most namespaces hold exactly
+one app, so `<app>` defaults to the namespace, and the target defaults to
+`deploy/<app>` (kubectl resolves a pod from the Deployment). `-n`/`--pod`/`-c`/
+`-l`/`--tty` override; multi-pod namespaces (`dbaas`, `monitoring`) need
+specificity. The CLI uses the ambient kubeconfig — no per-call auth flags.
+
+Verbs: read — `status`, `get`, `logs`, `describe`, `debug` (one-shot triage),
+`pf`, `rollout-status`; write/operational — `db`, `exec`, `restart`, `rm-pod`.
+
+## Decisions worth recording
+
+- **Config-mutation verbs are deliberately NOT exposed** (`apply`/`edit`/`patch`/
+  `scale`/`create`). They stay raw `kubectl`, by design, per the repo's
+  Terraform-only policy — the corpus confirms they're low-frequency, and a
+  friendly verb would normalise a policy violation.
+- **`rm-pod` is restricted to pods/jobs only** — deleting Deployments/STS/PVCs is
+  config mutation and forbidden; the verb cannot target them.
+- **`db` encodes the dbaas exec pattern** (the single highest-value k8s
+  sub-pattern, ~886 dbaas ops): PG via `pg-cluster-rw -c postgres`,
+  `psql -U postgres -d <app>`; MySQL via `mysql-standalone-0` with a
+  `bash -c 'mysql -p"$MYSQL_ROOT_PASSWORD" …'` wrapper so the password comes from
+  the pod env and never appears on the command line.
+- Read verbs were smoke-tested against the live cluster; write verbs are
+  unit-tested (resolver, db-plan, shell-quoting) but not fired at live state.
--- a/docs/adr/0008-homelab-memory-verbs.md
+++ b/docs/adr/0008-homelab-memory-verbs.md
@ -0,0 +1,30 @@
+# homelab memory verb-group: direct HTTP client to claude-memory; MCP deprecation path
+
+v0.3 adds the memory verb-group so agents can search and navigate memory from the
+CLI. `claude-memory` is a FastAPI service (Postgres-backed, `Bearer`-auth,
+ingress `auth = "none"` so programmatic clients work) — the **MCP is just one
+frontend over it**. `homelab memory` is a thin HTTP client over the same API,
+using the env the hooks already set (`CLAUDE_MEMORY_API_URL` +
+`CLAUDE_MEMORY_API_KEY`; defaults to the ingress). Because it talks to the HTTP
+API directly, it **works even when the MCP frontend is down** — the recurring
+MCP-disconnect problem that motivated claude-memory HA (and that took the MCP
+offline for the entire session this was built in).
+
+Verbs: `recall` (server-side semantic ranking), `list`, `categories`, `tags`,
+`stats`, `secret` (read); `store`, `update`, `delete` (write). Validated against
+the live API including a store→recall→delete round-trip — full data-plane parity
+with the MCP.
+
+## Deprecation path (deliberate follow-up — NOT done in v0.3)
+
+The MCP is more than tools: the **per-prompt auto-recall hook** and the
+**auto-learn hook** run on every prompt for every agent. Deprecating it safely is
+a separate, sequenced change:
+
+1. Rewire the auto-recall hook to `homelab memory recall` and the auto-learn hook
+   to `homelab memory store`.
+2. Update the CLAUDE.md memory policy to point at the CLI.
+3. Uninstall the MCP.
+
+Done CLI-first (verbs proven before touching the every-prompt path) so a
+regression can't silently break auto-recall/auto-learn fleet-wide.
--- a/docs/adr/0009-homelab-ci-deploy-verbs.md
+++ b/docs/adr/0009-homelab-ci-deploy-verbs.md
@ -0,0 +1,29 @@
+# homelab ci/deploy verbs: API-based watch, internal-LB dialer, work-land integration
+
+v0.4 adds `ci`/`deploy` — the biggest *reasoning* sink in agent sessions (watching
+a build/deploy to completion), proven during the session that built it (hours
+spent hand-rolling Woodpecker API polling, DB-schema reverse-engineering, and
+retrigger logic for a single CI incident).
+
+## Decisions
+
+- **API, not DB.** The verbs query the Woodpecker REST API (version-stable),
+  not its Postgres schema (which drifts across upgrades — column renames bit us
+  mid-incident). Reached via the internal Traefik LB by dialing `10.0.20.203`
+  while keeping SNI/Host = `ci.viktorbarzin.me` so the cert verifies (the Go
+  equivalent of the house `curl --resolve` pattern). Token from
+  `WOODPECKER_TOKEN` or Vault `secret/ci/global`; repo id resolved from the cwd
+  git remote via `/api/repos/lookup/<owner>/<repo>`.
+- **Retries are mandatory.** The Woodpecker API intermittently returns empty/5xx
+  under load (it flapped through the whole build session); `getJSON` retries
+  empties with backoff so `ci watch` is reliable exactly when it's needed.
+- **`work land` now waits for CI.** After pushing, `work land` calls `ci watch`
+  on the landed commit and fails if the pipeline does — closing the gap ADR-0005
+  deferred. `--no-ci-watch` opts out.
+- **`deploy wait` encodes the "rollout status lies" rule:** it first waits for
+  the deployment image to reference the expected sha, *then* blocks on rollout
+  status (kubectl-based; reuses the k8s helpers).
+- **`ci logs` deferred to v0.4.1.** Woodpecker's per-pipeline detail/log
+  endpoints were the least reliable this session (often empty); `status`/`watch`
+  rely on the list endpoint that works. A DB-backed `ci logs` is a possible
+  follow-up if the API path stays flaky.
--- a/docs/adr/0010-homelab-net-obs-verbs.md
+++ b/docs/adr/0010-homelab-net-obs-verbs.md
@ -0,0 +1,37 @@
+# homelab net/dns/metrics/logs verbs: endpoint resolution as the unit of value
+
+v0.5 adds `net`/`dns`/`metrics`/`logs`. These were chosen against an explicit
+test the user posed mid-build: *does the verb save reasoning, or only typing?* A
+wrapper over a command already known fluently (plain `ssh`, `vault kv get`) saves
+keystrokes but not thought. These four save thought — the reasoning they encode
+is **which endpoint, reached how, with what auth/URL shape** — re-derived every
+time otherwise. (That same test deprioritized `node ssh` aliasing and `secret
+get`, which are thin wrappers; see the session discussion.)
+
+## Decisions
+
+- **Internal ingresses, reached via the LB.** Everything routes through the
+  Traefik LB by dialing `10.0.20.203` with the URL host preserved as SNI — the
+  Go form of the house `curl --resolve host:443:10.0.20.203` pattern
+  (`probe.go: clientDialingIP`). Verified live before building: Prometheus
+  (`prometheus-query.viktorbarzin.lan`) and Loki (`loki.viktorbarzin.lan`) both
+  answer JSON over the LB with **no auth gate and no port-forward** — so these
+  stay clean HTTP clients, not kubectl wrappers.
+- **`net check` is two-legged on purpose.** It resolves the host via public DNS
+  (→ Cloudflare) AND dials the internal LB, reporting both — because the useful
+  question is *where* a break is (CF edge vs the app vs the LB path), which a
+  single curl can't answer. The external leg forces public resolution (the devvm
+  resolver is split-horizon and would otherwise hit the LB for both).
+- **`metrics alerts` uses the `ALERTS` series, not `/api/v1/alerts`.**
+  `prometheus-query.*` is a query-only frontend (404 on `/api/v1/alerts`), and
+  Alertmanager has no LB ingress (the alert-digest reads it in-cluster). Firing
+  alerts are exposed as the synthetic `ALERTS{alertstate="firing"}` time series,
+  queryable through the working endpoint — so no new dependency.
+- **Deliberately NOT built:** in-cluster-only endpoints (Alertmanager v2,
+  raw `*.svc` services) that would force port-forward/`kubectl run`. The
+  reasoning-savings there don't beat the added moving parts; kept out of scope.
+- **No `node`/`secret` group.** Same test: their high-volume parts are
+  command-wrappers (low savings); only compound node ops (serial console, VM
+  wait, fan-out) would qualify, and those are lower-frequency. Left unbuilt
+  unless a concrete pain surfaces — the high-value deterministic surface
+  (tf/work/ci/k8s/memory + these probes) is now covered.
--- a/docs/adr/0011-homelab-usage-telemetry.md
+++ b/docs/adr/0011-homelab-usage-telemetry.md
@ -0,0 +1,34 @@
+# homelab usage telemetry: evidence-driven verb prioritization, privacy by construction
+
+v0.6 adds `usage top` plus a fire-and-forget emit on every dispatched verb. It
+exists to answer the question that drove the whole CLI — *which verbs are worth
+adding next* — with data instead of one maintainer's habits (the earlier mining
+covered a single user's ~51k commands, so the surface is shaped to that user).
+
+## Decisions
+
+- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows
+  the verb path; after `Run` returns we emit `{verb, exit}`. Discovery verbs
+  don't go through `dispatch()` (`manifest`/`version`/`help` are handled in
+  `dispatchTop`), so they don't self-record; `usage *` is skipped explicitly so
+  the analytics reader doesn't pollute its own data.
+- **Payload is deliberately minimal: verb path + exit code only.** Labels
+  `{job=homelab-usage, user, verb}` (all low-cardinality) + line `exit=N ver=X`.
+  **No args, paths, flags, hostnames, or secrets** ever leave the process — the
+  emit sees only the matched verb name, not the arguments. This is what makes
+  cross-user aggregation safe.
+- **Shared Loki sink → cross-user analytics WITHOUT reading homes.** Each user's
+  CLI writes its own invocations (attributed to its OS user) to the shared Loki
+  push API via the Traefik LB (verified: HTTP 204, no auth). `usage top` reads
+  back with a LogQL metric query. This is the privacy-preserving resolution to
+  "what does everyone (e.g. another user) use" — it never touches anyone's
+  `~/.claude`, which the org per-user policy bars (see the per-user red-line in
+  managed-settings; reading another user's home is off-limits even for an owner
+  in-session — a fresh session under changed MDM policy is the only legitimate
+  path, and even then this telemetry is the better answer).
+- **Best-effort, never affects the command.** All errors swallowed; an 800ms
+  client timeout bounds the cost; opt-out via `HOMELAB_TELEMETRY=0`. Telemetry
+  must never slow or break the tool it measures.
+- **Loki, not a new datastore.** Zero new infra, and it dogfoods the v0.5 `logs`
+  path (same host, same LB dial). Presence MySQL was the alternative (queryable
+  SQL) but would add a write dependency and creds; Loki needs neither.
--- a/docs/adr/0012-homelab-ha-verbs.md
+++ b/docs/adr/0012-homelab-ha-verbs.md
@ -0,0 +1,54 @@
+# homelab Home Assistant verbs: token resolution + host SSH, not entity control
+
+v0.7 adds `ha token` and `ha ssh`. They were chosen by mining a heavy HA
+operator's sessions: across ~1,900 shell commands the single most-repeated line
+(420×) was a hand-rolled `kubectl … | base64 -d | python -c '…token'` pipeline,
+and a bespoke `ssh -o StrictHostKeyChecking=no -o …` invocation was redefined as
+a shell function ~30× — both re-derived from scratch every session. The existing
+`home-assistant-sofia.py` already covers the *API*, but it goes unused from an
+arbitrary cwd (it needs `HOME_ASSISTANT_SOFIA_TOKEN` set and is referenced by a
+cwd-relative path), so agents bypassed it. A global verb on `$PATH` closes that
+gap for every user in every directory.
+
+## Decisions
+
+- **Only the two gaps the `ha` MCP can't fill.** The `ha` MCP server already
+  does entity state and control (`get_state`, `call_service`, history, logs).
+  Per the CLI's founding rule — *MCP-encoded actions are out of scope* (ADR-0004)
+  — we do **not** reimplement `on`/`off`/`list`/`state`. We add only token
+  *resolution* and host *SSH*, neither of which an API-only MCP can provide. The
+  value is endpoint/secret/host resolution, exactly like `net`/`dns` (ADR-0010).
+- **`ha token` resolves live from the cluster, not from an env var.** It reads
+  the dedicated k8s Secret `openclaw/ha-tokens` (one key per instance: `sofia` /
+  `london`) via the ambient kubeconfig. This is robust to env drift — the precise
+  failure that made agents re-derive the pipeline. Read-tier, prints the bare
+  token to stdout so it composes in `$(…)`, mirroring `memory secret`.
+- **The token is split into its own least-privilege secret** (`stacks/openclaw/ha_tokens.tf`).
+  It was originally read from `openclaw-secrets` → `skill_secrets` (a JSON blob
+  also holding `slack_webhook` + `uptime_kuma_password`), which only cluster
+  admins can read — so the verb hung/failed for the non-admin operator it was
+  built for (emo = `emil.barzin@gmail.com`, group `Home Server Admins`, whose
+  OIDC identity is barred from secrets in `openclaw`). `ha-tokens` carries only
+  the HA tokens, with a Role+RoleBinding granting `get` on *just that secret* to
+  the `Home Server Admins` group (k8s RBAC can't scope to a JSON sub-key, hence
+  the separate object). openclaw's own deployment keeps reading `openclaw-secrets`
+  — this is purely additive.
+- **`ha ssh` is deterministic and per-user.** Flags are fixed for unattended
+  use: `-F /dev/null` (ignore user ssh-config), `StrictHostKeyChecking=no` +
+  `UserKnownHostsFile=/dev/null` (no host-key prompt/record — agents have no
+  TTY), `BatchMode=yes` + `ConnectTimeout=10` (fail fast, never hang). The key
+  is the **invoking user's** `~/.ssh/id_ed25519`, so the verb isn't tied to
+  whoever first wrote the workflow; that user's key must be enrolled on the HA
+  host. Write-tier (runs an arbitrary remote command).
+- **sofia is the default; london is structural.** The devvm sits on the Sofia
+  LAN, so `vbarzin@192.168.1.8` is reachable and is the default instance. london
+  (`hassio@192.168.8.103`) is in the instance map so `ha token --instance london`
+  works (a pure secret read), but `ha ssh --instance london` generally won't
+  connect from here — london is remote. We model it correctly rather than
+  pretend it's reachable.
+- **Scope held at two verbs.** `ha api` (an authenticated curl passthrough for
+  the endpoints the MCP/script don't cover — `/api/template`, `/reload`,
+  `check_config`, `/error_log`) was deferred: once `ha token` exists, raw curl is
+  already unblocked, and a generic passthrough overlaps the MCP. Re-measure via
+  `usage top` (ADR-0011); add targeted sugar verbs only if those endpoints are
+  still hand-rolled often.
--- a/docs/adr/0013-homelab-browser-verbs.md
+++ b/docs/adr/0013-homelab-browser-verbs.md
@ -0,0 +1,75 @@
+# homelab browser verbs: headful (anti-bot) web automation via cluster Chrome
+
+v0.8 adds `browser run`, `browser open`, and `browser --help`. They package a
+capability that already existed but was undiscoverable: driving the cluster's
+**headful** Chrome (`chrome-service` — real Chrome under Xvfb, CDP on
+`svc/chrome-service:9222`) from the devvm, for sites that detect and block
+headless automation.
+
+## Motivating incident (2026-06-22)
+
+Logging a washing-machine repair on the Stirling Ackroyd **Fixflo** tenant
+portal: the headless `@playwright/mcp` browser loaded the site and filled the
+entire multi-step form, but the **final submit silently failed** — Fixflo's
+pre-submit `POST /IssuePreCreationCheck` returned `net::ERR_FILE_NOT_FOUND`, the
+spinner hung, no issue was created. Root cause = headless-Chrome detection. The
+fix was to drive the headful `chrome-service` over `connect_over_cdp` — it
+submitted first try (Fixflo ref IS22657587). That capability was documented
+(`docs/architecture/chrome-service.md`) but **not packaged or discoverable**, so
+it took ~40 min, three redundant full form re-runs, and a user hint. The agent
+also misread `ERR_FILE_NOT_FOUND` as "network egress" and retried blind instead
+of inspecting the network panel.
+
+## Decisions
+
+- **Mechanics in `homelab`, not a `~/.claude` skill.** A standalone skill was
+  rejected: the CLI is run every session (so the verb is *discoverable*), is
+  versioned, multi-user, and test-covered. A private, untested skill is none of
+  those. The command owns only the deterministic *mechanics* (port-forward,
+  stealth injection, lifecycle) — the agent supplies the Playwright script, so
+  *judgment* stays out of the CLI (the founding rule, ADR-0004/0005).
+- **The failure was judgment, not setup friction**, so the CLI is paired with a
+  one-line pointer in always-in-context `~/code/CLAUDE.md` and a diagnostic
+  payload in `browser --help`: the *when-to-use* signature (a site loads but a
+  gated action fails/hangs, or one request 500s/aborts while siblings 200 →
+  suspect headless detection) and an error-code cheat-sheet (`ERR_FILE_NOT_FOUND`
+  = request resolved/intercepted by the automation layer, **not** egress;
+  egress failures are `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED`
+  and would break the page load too). A command the agent doesn't think to run is
+  useless; the cheat-sheet is the actual fix for the misdiagnosis.
+- **Reach the pod via `kubectl port-forward`, then `connect_over_cdp` to
+  localhost.** port-forward tunnels API-server→pod, so it **bypasses the `:9222`
+  NetworkPolicy** that gates in-cluster callers — the devvm needs no namespace
+  label. Readiness is asserted against `/json/version`: the endpoint must report
+  a real `Chrome/…`, never `HeadlessChrome` (the whole point). The forward is
+  **always** torn down (process-group kill + signal handler), on success and on
+  error — an acceptance requirement.
+- **Default to a fresh incognito context; `--shared-context` opts into the warmed
+  profile.** chrome-service is a single shared browser with a persistent profile.
+  A fresh, always-closed context is safe for concurrent callers (tripit's fare
+  scrape connects per-quote) and is what production already does. The warmed
+  persistent profile (cookies from a manual noVNC login) is opt-in for flows that
+  need a pre-logged-in session.
+- **Pin the node CDP client to `playwright-core@1.48.2`** to match the
+  chrome-service image minor (`mcr.microsoft.com/playwright:v1.48.0-noble`,
+  Chromium 130). `connect_over_cdp` speaks the browser's CDP, and protocol
+  changes between Playwright minors — the devvm's ambient Python Playwright was
+  1.58, a 10-minor skew. The pin makes behaviour deterministic across the fleet
+  regardless of local drift. `playwright-core` (not `playwright`) because no
+  browser binary is needed — we connect to the remote one.
+- **Self-provision the client lazily, no per-user setup.** The pinned client is
+  installed once into `~/.cache/homelab/browser-client/` (idempotent, version-
+  guarded) on first use, alongside the embedded runner + stealth files. node is
+  already fleet-wide; this avoids coupling the feature to a provisioner change
+  and keeps it self-contained and self-healing. The client runs on the devvm, so
+  `setInputFiles` streams local files to the remote browser over CDP — no
+  `chmod`/staging-dir workaround on the CDP path.
+- **Vendor `stealth.js`, guard against drift.** The CLI embeds a byte-for-byte
+  copy of `stacks/chrome-service/files/stealth.js` (the source of truth the
+  in-cluster callers use) via `go:embed`; a unit test fails if the copy drifts.
+  `go:embed` can't reach outside the package dir, hence the vendored copy rather
+  than a path reference.
+- **Scope held at two action verbs + help.** `run` (arbitrary script — the
+  workhorse) and `open` (navigate + title/text/screenshot — a quick check) cover
+  the surface. Both are write-tier; the bare `browser`/`--help` is read. Re-measure
+  via `usage top` (ADR-0011) before adding more.
--- a/docs/adr/0014-service-identity-and-east-west-observability.md
+++ b/docs/adr/0014-service-identity-and-east-west-observability.md
@ -0,0 +1,29 @@
+---
+status: accepted
+date: 2026-06-24
+---
+
+# Service identity is namespace + label; east-west observability via Calico Goldmane; no service mesh
+
+As the Service count grows we want an audit-grade record of which Service talks to which — the "service mesh evaluation" `docs/plans/2026-04-20-infra-audit-design.md` flagged as never done ("worth a design doc even if the answer is no, too much complexity for the gain"). We evaluated the full design space against two constraints: the trust model is a single-tenant cluster needing **attribution-grade** forensics (reconstruct events in a cluster we trust), not cryptographic non-repudiation against a hostile pod; and we are acutely **etcd-constrained** (we removed VPA/Goldilocks for exactly this, and carry open beads `code-oflt`/`code-at4f` on etcd starvation). Decision: **service identity = the workload's namespace** (primary; Goldmane stamps it natively and "one Service ≈ one namespace" holds for ~87 of our namespaces), refined by an explicit `service-identity` label only in the few genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). **East-west observability = Calico 3.30 Goldmane + Whisker** (already in our Calico v3.30.7, currently `enabled = false` in `stacks/calico/main.tf`), with Goldmane's emitter shipping flows to **Loki** for a durable trail. **Enforcement reuses the existing Wave 1 observe-then-enforce egress track**, now selecting on namespace/label and fed by Goldmane's allow/deny + policy-trace flows. We explicitly **reject** a service mesh, mTLS, SPIFFE/SPIRE, and dedicated per-Service ServiceAccounts for now.
+
+## Considered options
+
+- **Dedicated per-Service ServiceAccount as the identity primitive** — initially chosen, then reversed. 56% of pods (257/458) run as `default`, so it is a ~116-stack rollout; and Goldmane (the chosen flow source) carries pod/namespace/workload **labels but no ServiceAccount field**, so SAs would not even reach the audit trail without a custom pod→SA mapping. The cheaper, etcd-inert path (namespace is free; a handful of static labels) delivers the same attribution. Deferred until identity-aware NetworkPolicy needs a principal finer than namespace/label, or mTLS is adopted.
+- **Service mesh (Istio / Linkerd / Cilium-mesh) + mTLS + SPIFFE/SPIRE** — the only thing that makes the trail cryptographically non-repudiable against a hostile pod. The trust model does not justify it, east-west stays single-tenant plaintext, and it is precisely the "too much complexity for the gain" the audit doc predicted. Rejected.
+- **Microsoft Retina (CNI-agnostic eBPF)** — more capable (DNS, drops, Hubble UI) and GA, runs on Calico without a CNI change. But identity-rich mode writes **one `RetinaEndpoint` CRD per pod to etcd** (continuous, pod-proportional churn — the exact axis we guard), and it is metrics-first, not log-first (no per-flow Loki records without custom glue). Rejected for this use case; noted as the fallback if DNS/drop-level detail is ever needed.
+- **Cilium Hubble** — reads Cilium's eBPF datapath maps; unusable on Calico without migrating the CNI. A CNI migration is not justified. Rejected.
+- **Kiali** — builds its graph entirely from an Istio mesh's Prometheus telemetry; no mesh, no graph. Rejected.
+- **Custom Grafana Alloy enrichment exporter over raw iptables-`LOG` flow lines** — Alloy has no IP→identity dictionary-lookup primitive (`loki.process` lacks a lookup stage; `k8sattributes` can't do per-line/dual-IP association), so this is a multi-day custom build that also has to beat pod-IP churn. Goldmane delivers identity-stamped flows natively and obviates it. Rejected.
+- **Kyverno generate+mutate to provision/assign identity** — rejected on etcd grounds: background scans + PolicyReports + UpdateRequests are continuous writes, the VPA-class cost we shed. Identity stays static.
+
+## Consequences
+
+- **No etcd cost from the flow plane.** Goldmane streams flows from Felix (the existing `calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer — nothing written to etcd or the K8s API. Steady-state cost is two Deployments (`goldmane`, `whisker`) + RAM/CPU on the goldmane pod.
+- **The ring buffer is not a trail.** Durable, queryable history depends on the emitter→Loki path (reuse the 90-day security-stream retention); on a Goldmane restart the in-memory window is lost.
+- **Goldmane is tech-preview** in OSS Calico 3.30 — the main risk. Enabling it is a reversible toggle in `stacks/calico/main.tf`, but the toggle interacts with the operator-managed Installation CR (only namespaces are TF-adopted today; verify how Goldmane/Whisker are enabled before applying).
+- **Attribution is namespace-grained for free** across ~87 single-Service namespaces. Multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`) need a `service-identity` label to disambiguate; most are platform/infra and already on the Wave 1 enforcement exclude list.
+- **The trail is attribution-grade, not cryptographic.** It reliably reconstructs events in a trusted cluster but cannot prove identity against a pod that spoofs its source — an accepted limit of the trust model. This ADR does not change the east-west encryption posture (still plaintext, no mTLS).
+- **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
+- **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
+- **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
--- a/docs/architecture/authentication.md
+++ b/docs/architecture/authentication.md
@ -40,10 +40,10 @@ graph TB

 | Component | Version | Location | Purpose |
 |-----------|---------|----------|---------|
-| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (2 replicas) |
+| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (3 replicas) |
 | Authentik Worker | 2026.2.2 | `stacks/authentik/` | Background task processors (2 replicas) |
 | PgBouncer | Latest | `stacks/authentik/` | PostgreSQL connection pooler (3 replicas) |
-| Embedded Outpost | - | Built into Authentik | Forward auth endpoint for Traefik |
+| Embedded Outpost | - | Standalone deployment, managed by Authentik | Forward auth endpoint for Traefik (2 replicas, PG-backed sessions) |
 | Traefik ForwardAuth | - | `modules/kubernetes/ingress_factory/` | Middleware attached when `auth = "required"` or `"public"` |
 | Vault OIDC Method | - | `stacks/vault/` | Human SSO authentication to Vault |
 | Vault K8s Auth | - | `stacks/vault/` | Service account JWT authentication |
@ -64,15 +64,36 @@ Services pick an auth tier via the `auth` enum on the `ingress_factory` module (
 When `auth = "required"`, an unauthenticated request flows:

 1. Request hits Traefik ingress
-2. ForwardAuth middleware calls Authentik embedded outpost
-3. Authentik checks for valid session cookie
+2. ForwardAuth middleware calls the `auth-proxy` nginx (basicAuth fallback when Authentik is down), which proxies to the Authentik embedded outpost over a keepalive connection pool
+3. Authentik checks for valid session cookie (domain-level `authentik_proxy_*` cookie on `.viktorbarzin.me`, 4-week validity — one cookie covers all forward-auth apps)
 4. If missing/invalid, redirects to Authentik login page (authentik.viktorbarzin.me)
-5. User authenticates via social provider (Google/GitHub/Facebook)
+5. User authenticates on a **single screen**: username + password together (the identification stage embeds the password stage), or a social provider button (Google/GitHub/Facebook), then MFA validation
 6. Authentik creates session, sets cookie, redirects back to original URL
 7. Subsequent requests include session cookie, pass auth check, reach backend

 Authentik adds authentication headers (user, email, groups) to forwarded requests. These headers are stripped before reaching the backend to prevent confusion.

+### First-time signin performance (2026-06-10)
+
+Signin latency is dominated by screen count and round trips, not server time
+(DB avg 1.6ms). Standing decisions:
+
+- **Single-screen login**: the identification stage carries `password_stage`,
+  so username+password is one round trip. The separate password-stage binding
+  was removed from `default-authentication-flow` (required by authentik when
+  embedding). Pinned in TF: `authentik_stage_identification.default_identification`.
+- **Implicit consent everywhere**: all OIDC providers are first-party, so none
+  use the explicit-consent flow (it re-prompted every 4 weeks per app).
+- **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
+  are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
+  15m policy cache, 60s persistent DB connections.
+- **Static assets cached immutable**: `/static` ingress carve-out adds
+  `Cache-Control: public, max-age=31536000, immutable` (assets are
+  version-fingerprinted; authentik itself sends no max-age).
+- **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
+- **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
+  TCP setup on the forward-auth subrequest path.
+
 **Anti-exposure guard**: every `auth = "app"` or `auth = "none"` line MUST have a preceding `# auth = "<tier>": <reason>` comment documenting what gates the backend (for `"app"`) or why the endpoint is intentionally public (for `"none"`). The convention is enforced by `scripts/check-ingress-auth-comments.py`, which `scripts/tg` runs on every `plan/apply/destroy/refresh` and blocks the terragrunt invocation if violated. Stack-scoped — each stack documents itself.

 ### Social Login & Invitation Flow
--- a/docs/architecture/automated-upgrades.md
+++ b/docs/architecture/automated-upgrades.md
@ -4,7 +4,7 @@ This doc covers three independent automation paths:

 1. **Service-level upgrades** — Container image bumps for OSS apps (DIUN → n8n → claude-agent → Terraform). Most of this doc.
 2. **OS-level upgrades on K8s nodes** — `unattended-upgrades` + `kured` with sentinel-gate + Prometheus halt-on-alert. See "K8s Node OS Upgrades" section and the runbook at `docs/runbooks/k8s-node-auto-upgrades.md`.
-3. **K8s component version upgrades** (kubeadm/kubelet/kubectl) — weekly detection CronJob → chain of phase Jobs (preflight → master → worker × 4 → postflight). See "K8s Version Upgrades" section and the runbook at `docs/runbooks/k8s-version-upgrade.md`.
+3. **K8s component version upgrades** (kubeadm/kubelet/kubectl) — daily detection CronJob → chain of phase Jobs (preflight → master → one worker Job per worker, enumerated live → postflight). See "K8s Version Upgrades" section and the runbook at `docs/runbooks/k8s-version-upgrade.md`.

 ## Overview

@ -252,7 +252,7 @@ kubeadm/kubelet/kubectl bumps (patch + minor) on all 5 K8s VMs.
 ### Architecture

 ```
-k8s-version-check CronJob   (Sun 12:00 UTC, k8s-upgrade ns)
+k8s-version-check CronJob   (23:00 UTC nightly, k8s-upgrade ns)
  │ probe apt-cache madison kubeadm (master) → latest available patch
  │ probe HEAD https://pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor?
  │ push k8s_upgrade_available metric to Pushgateway
@ -262,20 +262,26 @@ envsubst on /template/job-template.yaml | kubectl apply -f -
  │ spawns Job 0 = k8s-upgrade-preflight-<target_version>
  ▼

-Job 0 — preflight       (pinned: k8s-node1)
-Job 1 — master upgrade  (pinned: k8s-node1)        drains k8s-master
-Job 2 — worker          (pinned: k8s-node1)        drains k8s-node4
-Job 3 — worker          (pinned: k8s-node1)        drains k8s-node3
-Job 4 — worker          (pinned: k8s-node1)        drains k8s-node2
-Job 5 — worker          (pinned: k8s-master)       drains k8s-node1  ← control-plane toleration
-Job 6 — postflight      (no pinning)
+Job 0 — preflight       (pinned: first worker)
+Job 1 — master upgrade  (pinned: first worker)     drains k8s-master
+Job 2..N — worker       (pinned: k8s-master)       drains each worker still off-target
+                                                   ← control-plane toleration; one Job
+                                                     per worker, enumerated live from
+                                                     `kubectl get nodes` (covers node5/6
+                                                     + any future node automatically)
+Job N+1 — postflight    (no pinning)
 ```

 Each Job runs `scripts/upgrade-step.sh`, which dispatches on `$PHASE` and ends
 by spawning the next Job (`envsubst < /template/job-template.yaml | kubectl
 apply -f -`). Job names are deterministic (`k8s-upgrade-<phase>-<target_version>[-<node>]`)
-so `apply` reconciles to a single Job per run — re-running a failed Job
-won't duplicate downstream Jobs.
+so `apply` reconciles to a single Job per run — re-running won't duplicate
+downstream Jobs. The detection CronJob and `spawn_next` additionally delete +
+re-spawn a terminally-**Failed** Job of the same name (rather than skipping it
+on existence), so a transient preflight gate self-heals on the next cycle
+instead of wedging the pipeline until the dead Job's 7d TTL expires
+(retry-on-failure, added 2026-06-17 after a spurious critical alert stalled
+1.34.9 for 5 days).

 ### Self-preemption history (the reason for the Job-chain rewrite)

@ -304,11 +310,16 @@ each Job's pod and its drain target are always different nodes.
  ConfigMap, and a `template` ConfigMap into each Job pod.
 - **Per-node script**: `infra/scripts/update_k8s.sh`. Caller passes
  `--role master|worker --release X.Y.Z`. Piped via SSH into each node by
-  upgrade-step.sh.
- **Three Upgrade Gates alerts**:
+  upgrade-step.sh. The master path runs `kubeadm upgrade apply` with
+  `--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins
+  --skip-phases=addon/coredns` so kubeadm never touches CoreDNS (custom Corefile
+  + separately-tracked image; CoreDNS is pinned off Keel via `keel.sh/policy=never`).
+  See the runbook's "CoreDNS is NOT upgraded by kubeadm here".
+- **Four Upgrade Gates alerts**:
  - `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout.
  - `EtcdPreUpgradeSnapshotMissing` — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently.
  - `K8sUpgradeStalled` — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor.
+  - `K8sUpgradeChainJobFailed` — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured). The `unless k8s_upgrade_blocked == 1` clause (2026-06-21) excludes a deliberate compat-gate refusal (owned by `K8sUpgradeBlocked`) so a block doesn't double-fire as a wedge.
 - **Pushgateway metrics**:
  - `k8s_upgrade_in_flight` (set in preflight, cleared in postflight)
  - `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB)
@ -334,7 +345,7 @@ The cluster has a single control plane (no HA). A failed `kubeadm upgrade apply`

 - **Mandatory etcd snapshot before every run** (even patch). Recovery point if master breaks.
 - **Halt-on-alert before every drain**. Reuses the same Prometheus ignore-list regex kured uses — any unrelated cluster-health alert blocks. Three gate alerts catch upgrade-specific half-states (version skew, missing snapshot, stalled chain).
- **Job pinning eliminates self-preemption**. Each Job's pod runs on a node that is NOT its drain target. k8s-node1 hosts every Job except the one that drains it (which runs on k8s-master with a control-plane toleration).
+- **Job pinning eliminates self-preemption**. Each Job's pod runs on a node that is NOT its drain target: the master-drain Job runs on the first worker; every worker-drain Job runs on k8s-master (already upgraded, control-plane toleration). The worker set is enumerated live from `kubectl get nodes`, so new nodes are covered with no script change; SSH targets are node InternalIPs (no DNS dependency).
 - **Sequential workers with 10-min inter-node soak**. Same risk-bounding as the 24h OS-reboot soak, but tightened because kubelet failures surface within minutes — not hours.
 - **Master upgrade goes first, workers last**. If master breaks, the cluster is already degraded so further worker upgrades would just delay recovery. By upgrading master first, we either succeed (workers can roll afterward) or fail loud (operator triages before any worker is touched).
 - **No auto-rollback**. kubeadm doesn't support clean downgrade; the snapshot + manual apt rollback in the runbook is the recovery path.
--- a/docs/architecture/backup-dr.md
+++ b/docs/architecture/backup-dr.md
@ -77,6 +77,8 @@ The **bypass list** (leg 2) is just `/srv/nfs/immich/` — too big for sda (1.5
  - `Synology/Backup/Viki/nfs/` — immich only (post-2026-05-26)
  - `Synology/Backup/Viki/nfs-ssd/` — **immich-ML only (2026-06-01)**; ollama/llamacpp dropped (re-pullable models, live-only on the SSD)

+**VM image backups (added 2026-06-09)**: the hand-managed Linux VMs (those NOT in Terraform — see `compute.md`) were historically **not imaged at all** — only their *contents* reached backup if they happened to host a PVC/NFS path. `vzdump-vms` now takes a daily live `vzdump --mode snapshot` of each configured VMID → `/mnt/backup/vzdump/` (Copy 2), carried offsite by the monthly offsite-sync full pass (Copy 3). **Currently enabled for VMID 102 (devvm)** — the shared workstation, whose per-user home dirs + local-only git repos are otherwise irreplaceable. Extend via `VZDUMP_VMIDS` in the unit. See "VM Image Backups (vzdump)" under How It Works.
+
 ## Architecture Diagram

 ### Data Routing — where each path goes (post-2026-05-26)
@ -208,13 +210,14 @@ graph LR
        T0000["00:00 LVM thin snapshots<br/>(lvm-pvc-snapshot)<br/>sdc PVCs CoW"]
        T0015["00:15 PostgreSQL per-DB dumps<br/>(CronJob)"]
        T0045["00:45 MySQL per-DB dumps<br/>(CronJob)"]
+        T0100["01:00 vzdump-vms<br/>live image of hand-managed VMs<br/>(devvm) → sda /mnt/backup/vzdump/"]
        T0200["02:00 nfs-mirror (daily)<br/>sdc /srv/nfs/* → sda /mnt/backup/<svc>/<br/>~10-20 min steady state"]
        T0500["05:00 daily-backup<br/>mount LVM snapshots ro<br/>rsync PVC files → /mnt/backup/pvc-data/<br/>+ sqlite + pfsense + pve-config"]
        T0600["06:00 offsite-sync-backup<br/>Step 1: sda → Synology /Viki/pve-backup/<br/>Step 2: sdc/immich + nfs-ssd → /Viki/nfs[-ssd]/"]
        T1200["12:00 LVM thin snapshots (midday)<br/>second daily snapshot"]
    end

-    T0000 --> T0015 --> T0045 --> T0200 --> T0500 --> T0600 --> T1200
+    T0000 --> T0015 --> T0045 --> T0100 --> T0200 --> T0500 --> T0600 --> T1200
    INO -.->|change events feed Step 2| T0600

    style Nightly fill:#ffe0b2
@ -322,6 +325,7 @@ graph LR
 | NFS Change Tracker | Continuous (inotifywait) | PVE host: `nfs-change-tracker.service` | Logs changed NFS file paths to `/mnt/backup/.nfs-changes.log` |
 | pfSense Backup | Daily 05:00 + daily-backup | PVE host: SSH + API | config.xml + full filesystem tar |
 | Offsite Sync | Daily 06:00 (after daily-backup) | PVE host: `offsite-sync-backup` | Two-step: sda→pve-backup + NFS→nfs/nfs-ssd via inotify |
+| VM Image Backup (vzdump) | Daily 01:00, keep 3 | PVE host: `vzdump-vms` | Live `vzdump` of hand-managed VMs (devvm) → `/mnt/backup/vzdump/` |
 | PostgreSQL Backup (full) | Daily 00:00, 14d retention | CronJob in `dbaas` namespace | pg_dumpall for all databases |
 | PostgreSQL Backup (per-db) | Daily 00:15, 14d retention | CronJob in `dbaas` namespace | pg_dump -Fc per database → `/backup/per-db/<db>/` |
 | MySQL Backup (full) | Daily 00:30, 14d retention | CronJob in `dbaas` namespace | mysqldump --all-databases |
@ -352,6 +356,20 @@ Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62

 **Restore**: `lvm-pvc-snapshot restore <pvc-lv> <snapshot-lv>` — auto-discovers K8s workload, scales down, swaps LVs, scales back up. See `docs/runbooks/restore-lvm-snapshot.md`.

+### VM Image Backups (vzdump)
+
+The hand-managed Linux VMs are **intentionally not in Terraform** (telmate/bpg provider bugs — see `compute.md`) and were historically **not imaged at all**: nothing took a whole-disk backup of the VM itself. For most that is acceptable — k8s nodes are reprovisioned from cloud-init and their data lives in PVCs covered above. But **devvm** (the shared multi-user Claude Code workstation, VMID 102) holds irreplaceable state that lives nowhere else: per-user home dirs (`~/.claude`, `~/.t3`, shell history), manually-installed tooling, and **local-only git repos** — the monorepo root at `/home/wizard/code` has no git remote. A lost devvm disk = unrecoverable.
+
+**Script**: `/usr/local/bin/vzdump-vms` on PVE host (source: `infra/scripts/vzdump-vms.sh`). Deploy: `scp infra/scripts/vzdump-vms.sh root@192.168.1.127:/usr/local/bin/vzdump-vms` + `scp infra/scripts/vzdump-vms.{service,timer} root@192.168.1.127:/etc/systemd/system/`, then `systemctl daemon-reload && systemctl enable --now vzdump-vms.timer`.
+**Schedule**: Daily 01:00 via systemd timer — ahead of the other backup jobs so the fresh image is on sda before offsite-sync runs.
+**Mode**: `vzdump --mode snapshot` — live, no downtime. devvm has the qemu guest agent enabled (`agent: 1`), so the snapshot is **filesystem-consistent** (fs-freeze) rather than merely crash-consistent. Runs `Nice=10` + `IOSchedulingClass=idle` + `--ionice 7` so it never starves etcd on the contended sdc IO domain.
+**Scope**: VMIDs in `VZDUMP_VMIDS` (default `102` = devvm). Add VMIDs there to image other hand-managed VMs.
+**Retention**: `KEEP=3` newest dumps per VMID on sda (`/mnt/backup/vzdump/`); each devvm image is ~35-50 GB zstd.
+**Critical dependency**: `nfs-mirror` MUST keep `--exclude='/vzdump/'`. Its nightly `rsync -rlt --delete /srv/nfs/ → /mnt/backup/` treats any `/mnt/backup` dir with no `/srv/nfs` counterpart as an orphan and deletes it — this silently reaped the first two vzdump images at 02:00 on 2026-06-10 before the exclude was added (same reason `pvc-data`/`pfsense`/`pve-config`/`sqlite-backup` are excluded).
+**Offsite**: deliberately **NOT** appended to the incremental offsite manifest — it never deletes, so daily multi-GB images would accumulate unbounded on Synology. Instead the **monthly offsite-sync full pass (days 1-7)** mirrors all of `/mnt/backup` (including `vzdump/`) to Synology with `--delete`, bounded to local retention. So Copy 2 (sda) refreshes **daily**; Copy 3 (Synology) refreshes **monthly**.
+**Monitoring**: pushes `vzdump_last_run_timestamp` / `vzdump_last_status` / `vzdump_last_success_timestamp` to Pushgateway job `vzdump-backup`. Alerts `VzdumpBackupStale` (>~50h since last success), `VzdumpBackupNeverRun`, `VzdumpBackupFailing` (status≠0) are defined in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (the 3-2-1 group) — **effective on the next `monitoring` stack apply** (metrics already flow, so the alerts arm immediately once applied).
+**Restore**: on the PVE host, `qmrestore /mnt/backup/vzdump/vzdump-qemu-<vmid>-<ts>.vma.zst <vmid>` — restore to a spare VMID first if the original still exists, then swap disks; or use the PVE UI (add `/mnt/backup` as a dir storage with content=backup → Restore).
+
 ### Layer 2: Weekly File-Level Backup (sda Backup Disk)

 **Backup disk**: sda (1.1TB RAID1 SAS) → VG `backup` → LV `data` → ext4 → mounted at `/mnt/backup` on PVE host. Dedicated backup disk, independent of live storage.
@ -527,12 +545,16 @@ The btrfs cleaner thread reclaims async — `df` may lag the snapshot-delete by
 | `/usr/local/bin/lvm-pvc-snapshot` | PVE host: LVM snapshot creation + restore |
 | `/usr/local/bin/daily-backup` | PVE host: PVC file copy + auto SQLite backup + pfSense |
 | `/usr/local/bin/offsite-sync-backup` | PVE host: two-step rsync to Synology (sda + NFS via inotify) |
+| `/usr/local/bin/vzdump-vms` | PVE host: daily live `vzdump` image of hand-managed VMs (devvm) → `/mnt/backup/vzdump/` |
 | `/mnt/backup/` | PVE host: sda mount point (1.1TB backup disk) |
+| `/mnt/backup/vzdump/` | PVE host: vzdump VM images (keep 3 per VMID), mirrored offsite monthly |
 | `/mnt/backup/.nfs-changes.log` | NFS change log from inotifywait, consumed by offsite-sync |
 | `/etc/systemd/system/nfs-change-tracker.service` | inotifywait watcher for `/srv/nfs` + `/srv/nfs-ssd` |
 | `/etc/systemd/system/lvm-pvc-snapshot.timer` | Daily 03:00 (LVM snapshots) |
 | `/etc/systemd/system/daily-backup.timer` | Daily 05:00 (file backup) |
 | `/etc/systemd/system/offsite-sync-backup.timer` | Daily 06:00 (offsite sync) |
+| `/etc/systemd/system/vzdump-vms.timer` | Daily 01:00 (VM image backup) |
+| `/etc/systemd/system/vzdump-vms.service` | oneshot: `vzdump-vms` (source `infra/scripts/vzdump-vms.{sh,service,timer}`) |
 | `/usr/local/bin/nfs-mirror` | PVE host: daily 02:00 mirror of /srv/nfs/* → sda /mnt/backup/<svc>/ (Layer 3a) |
 | `/etc/systemd/system/nfs-mirror.timer` | Daily 02:00 (NFS local mirror to sda) |
 | `stacks/dbaas/` | Terraform: PostgreSQL/MySQL backup CronJobs |
@ -911,6 +933,9 @@ the 2026-04-22 backup_offsite_sync FAIL (node3 kubelet hiccup at
 | Uptime Kuma | ✓ | ✓ | — | ✓ | proxmox-lvm |
 | **Other apps not enumerated above** | ✓¹ | ✓¹ | varies | ✓ | proxmox-lvm / proxmox-lvm-encrypted |
 | **Postiz** (bundled bitnami PG on local-path) | — | — | ✓ daily pg_dump → NFS | ✓ | local-path + NFS |
+| **Hand-managed VMs (not in Terraform)** |
+| devvm (workstation, VMID 102) | — | — | ✓ daily vzdump image | ✓ monthly | local-lvm (sdc) |
+| Other hand-managed VMs (HA 103, registry 220, k8s nodes) | — | — | — gap² | — | local-lvm — see note² |
 | **Media (NFS)** |
 | Immich (~800GB) | — | — | — | ✓ | NFS |
 | Audiobookshelf | — | — | — | ✓ | NFS |
@ -924,6 +949,8 @@ the 2026-04-22 backup_offsite_sync FAIL (node3 kubelet hiccup at

 **Note**: All proxmox-lvm and proxmox-lvm-encrypted PVCs get LVM snapshots (except `dbaas` and `monitoring` namespaces, excluded for write-amplification reasons) + file-level backup. NFS-backed media syncs directly to Synology `nfs/` and `nfs-ssd/` via inotify change tracking.

+² **Hand-managed VMs** — only **devvm (102)** is imaged today (`vzdump-vms`, `VZDUMP_VMIDS=102`). The k8s nodes are deliberately uncovered (reprovisioned from cloud-init; their data lives in the PVCs already backed up above). **home-assistant (103) and docker-registry (220) are a documented gap** — add their VMIDs to `VZDUMP_VMIDS` to image them (registry content is also re-pullable from upstreams; HA has its own add-on backups). pfSense (101) is covered separately by `daily-backup` (config.xml + weekly tar).
+
 ¹ **"Other apps not enumerated above"** — the table only enumerates services worth calling out. The default backup posture for any service using `proxmox-lvm` or `proxmox-lvm-encrypted` (outside `dbaas`/`monitoring`) is **automatic** Layer 1 (LVM thin snapshots, 7d retention) + Layer 2 (file backup, 4 weekly versions on sda) + Layer 3 (offsite to Synology). Auto-discovery is by LV name pattern (`vm-*-pvc-*`), so adding a new service to the cluster gets it covered without any explicit registration. Run `ssh root@192.168.1.127 lvs --noheadings -o lv_name pve | grep '^vm-.*-pvc-' | grep -v _snap_ | wc -l` to see the live count.

 **Known gaps** — services with PVCs not on the proxmox-lvm path lose Layer 1+2:
--- a/docs/architecture/chrome-service.md
+++ b/docs/architecture/chrome-service.md
@ -10,9 +10,14 @@ serves two distinct populations:
   `chromium.connect_over_cdp("http://chrome-service.chrome-service.svc:9222")`
   to drive a real browser when upstream anti-bot trips a headless one
   (`disable-devtool.js` redirect-to-google trap, `navigator.webdriver`
-   checks, console-clear timing tricks). The only currently-active
-   in-cluster caller is the `chrome-service-snapshot-harvester` CronJob;
-   the `stacks/f1-stream/files/backend/playback_verifier.py` +
+   checks, console-clear timing tricks). Currently-active in-cluster
+   callers: the `chrome-service-snapshot-harvester` CronJob, and
+   **tripit's `PlaywrightFareProvider`** (since 2026-06-11, tripit issue
+   #18 / ADR-0007) — the flight-fare scrape connects per quote, opens a
+   fresh incognito context, scrapes Google Flights, and closes the
+   context; rate-limited to one attempt per 30s with a 6h fare cache, so
+   browser load is negligible. The
+   `stacks/f1-stream/files/backend/playback_verifier.py` +
   `chrome_browser.py` tree is a vestigial design — the deployed
   f1-stream image (built from `github.com/ViktorBarzin/f1-stream`)
   does not use this code path.
@ -107,17 +112,32 @@ External caller (dev box):
  @playwright/mcp --isolated --storage-state ~/.cache/...storage-state.json
 ```

+## Browser binary — real Google Chrome (for proprietary codecs)
+
+The chrome-service container runs **real Google Chrome**, not the bundled
+Chromium, via the infra-owned image `ghcr.io/viktorbarzin/chrome-service-browser`
+(`files/chrome/Dockerfile` = `mcr.microsoft.com/playwright:v1.48.0-noble` +
+`google-chrome-stable`, built by `.github/workflows/build-chrome-service-browser.yml`).
+The launch resolves `CHROMIUM=/opt/google/chrome/chrome`.
+
+**Why:** the Playwright-bundled Chromium has proprietary codecs **compiled out**,
+so H.264/AAC video (Instagram Reels, X, most `.mp4`) fails in the noVNC view with
+`MEDIA_ERR_SRC_NOT_SUPPORTED` (the bytes download `200 video/mp4` but there's no
+decoder — NOT a GPU issue). Royalty-free codecs (VP9/VP8/AV1 → YouTube) always
+worked. Swapping `libffmpeg.so` does NOT help (codecs are compiled out, not just
+the lib stripped) and Chrome-for-Testing is also codec-less — only
+`google-chrome-stable` carries them.
+
 ## Image pin

-Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in
-`stacks/chrome-service/main.tf`) and the Python client
-(`playwright==1.48.0` in callers' `requirements.txt`) **must match
-minor-versions**. Bump in lockstep — Playwright protocol changes between
-minors and the client cannot connect to a mismatched server.
-
-The harvester + snapshot-server sidecar use
-`mcr.microsoft.com/playwright/python:v1.48.0-noble` — same playwright
-minor, with Python-side bindings pre-installed.
+The Playwright base + the Python client (`playwright==1.48.0` in callers'
+`requirements.txt`) and the snapshot sidecars
+(`mcr.microsoft.com/playwright/python:v1.48.0-noble`) historically had to match
+minor-versions. The chrome-service browser is now real Google Chrome (a newer
+milestone than the 1.48 Chromium), but the `connect_over_cdp` callers (tripit
+fare scrape, `homelab browser`, snapshot-harvester) attach over raw CDP, which is
+version-tolerant — verified working against this Chrome. If a future Chrome
+milestone breaks a caller, pin Chrome in the Dockerfile or bump the clients.

 ## Storage

@ -162,7 +182,29 @@ minor, with Python-side bindings pre-installed.
  `x11vnc` (connected to Xvfb on `localhost:6099`) bridged to
  `websockify` on port 6080. Service `chrome` maps :80 → :6080 and is
  exposed via `ingress_factory` at `chrome.viktorbarzin.me`,
-  Authentik-gated.
+  Authentik-gated. The bare host serves `vnc.html` (image symlinks
+  `index.html → vnc.html`); add `?autoconnect=true&resize=scale&path=websockify`
+  to skip the Connect button. The view is **black when no browser window is
+  open** (idle) — that is normal, not a failed connection. Chrome is launched
+  with `--window-size=1280,720 --window-position=0,0` to fill the Xvfb screen
+  (no window manager runs, so without it Chrome opens at its profile-persisted
+  size and the rest of the framebuffer shows as a black cut-off).
+
+### noVNC fd-sweep gotcha (stuck "Connecting")
+
+If the noVNC client hangs on **"Connecting" forever then times out**, the cause
+is almost always x11vnc's fd-table sweep: containerd grants pods
+`RLIMIT_NOFILE = 2^31`, and x11vnc `fcntl`-sweeps the **entire** fd table on
+every client connection, so the RFB handshake never completes (websockify
+accepts the WS and logs `connecting to: localhost:5900`, but x11vnc never sends
+the `RFB 003.008` banner). Diagnose: `grep "open files" /proc/$(pgrep -n
+x11vnc)/limits` (huge = bad) and time the handshake from a sibling container
+(`python3 -c "import socket;s=socket.socket();s.connect(('127.0.0.1',5900));print(s.recv(12))"` —
+healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts**
+— done both in `files/novnc/entrypoint.sh` (root) and via the container `command`
+wrapper in `main.tf` (so it applies deterministically even though the image is
+`:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix
+as the android-emulator stack.
 - **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`)
  serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`,
  bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088
@ -175,6 +217,45 @@ minor, with Python-side bindings pre-installed.
 See `stacks/chrome-service/README.md` for the recipe (label namespace,
 inject `CHROME_CDP_URL`, vendor `stealth.js`).

+## Driving from OUTSIDE the cluster (`homelab browser`)
+
+Agents on the devvm reach this browser through the **`homelab browser`** CLI
+(`cli/`, ADR-0013) — the packaged, discoverable form of the ad-hoc
+`connect_over_cdp` recipe. It is the **escalation path, not the default**:
+agents default to the Playwright MCP / headless browser for all routine
+automation, and reach for `homelab browser` ONLY when headless is blocked — a
+site loads but a gated action (submit/login) silently fails or hangs, the
+signature of headless / anti-bot detection. (Same tiered rule lives in
+`~/code/CLAUDE.md` and `homelab browser --help`.)
+
+```text
+devvm:  homelab browser run flow.js
+          │  kubectl port-forward svc/chrome-service :9222  (random local port)
+          ▼
+   http://127.0.0.1:<port>  ──►  chrome-service pod :9222 (CDP)
+          │  assert /json/version Browser is "Chrome/…", not "HeadlessChrome"
+          │  node + playwright-core@1.48.2 → connectOverCDP
+          │  context.addInitScript(stealth.js)   ← same vendored file as in-cluster
+          │  run the user's Playwright script with page/context/browser in scope
+          └─ port-forward always torn down (success or error)
+```
+
+Key facts:
+
+- **port-forward bypasses the `:9222` NetworkPolicy.** It tunnels
+  API-server→pod, so the devvm needs no `chrome-service.viktorbarzin.me/client`
+  label — unlike in-cluster callers.
+- **Client pinned to the image minor.** The node client is
+  `playwright-core@1.48.2` (matches `v1.48.0-noble` / Chromium 130), installed
+  lazily into `~/.cache/homelab/browser-client/`. Bump it in lockstep when the
+  server image bumps (same rule as the in-cluster Python clients — see "Image
+  pin" above).
+- **Default context is a fresh incognito one** (closed on exit), safe for the
+  shared browser; `--shared-context` reuses the warmed persistent profile.
+- **`stealth.js` is vendored** into the CLI (`cli/browser_stealth.js`) as a
+  byte-identical copy of `files/stealth.js`, guarded by a drift test — so the
+  CLI's stealth never diverges from the in-cluster callers'.
+
 ## Limits + risks

 - **Anti-bot vs stealth arms race** — when an upstream beats us (DRM
--- a/docs/architecture/ci-cd.md
+++ b/docs/architecture/ci-cd.md
@ -2,306 +2,378 @@

 ## Overview

-The CI/CD pipeline uses a hybrid approach: GitHub Actions for building Docker images (providing free compute for public repos) and Woodpecker CI for deployments (leveraging cluster-internal access). Git pushes trigger GHA builds that produce Docker images with 8-character SHA tags, push to DockerHub, then POST to Woodpecker's API to trigger deployments that update Kubernetes workloads via `kubectl set image`.
+**Doctrine (ADR-0002): all image builds and CI compute run OFF-infra.** Every
+owned image is built, tested, and linted on **GitHub Actions** (free on public
+repos; 2000 free min/mo on private) and pushed to **`ghcr.io/viktorbarzin/<name>`**.
+Woodpecker is **deploy-only** — a GHA job POSTs its API with the freshly-built
+image tag and Woodpecker runs `kubectl set image` from inside the cluster.
+There are **no in-cluster image builds or CI test runs anywhere** — the
+in-cluster Woodpecker buildkit and the fallback-build pattern were removed as a
+clean cut (ADR-0002, 2026-06-13). The Forgejo container registry is **frozen
+and emptied** — break-glass only.
+
+This breaks the old circular dependency (images needed to repair the cluster
+used to be built and stored *inside* it) and keeps build IO + registry pushes
+off the homelab spindle.

 ## Architecture Diagram

 ```mermaid
 graph LR
-    A[Git Push] --> B[GitHub Actions]
-    B --> C[Build Docker Image<br/>linux/amd64, 8-char SHA tag]
-    C --> D[Push to DockerHub]
-    D --> E[POST Woodpecker API]
-    E --> F[Woodpecker Pipeline]
-    F --> G[Vault K8s Auth<br/>SA JWT]
-    G --> H[kubectl set image]
-    H --> I[K8s Deployment]
-    I --> J[Pull from DockerHub<br/>or Pull-Through Cache]
+    A[git push Forgejo<br/>viktor/&lt;repo&gt; canonical] --> B[push-mirror sync_on_commit]
+    B --> C[GitHub mirror<br/>ViktorBarzin/&lt;repo&gt;]
+    C --> D[GitHub Actions<br/>.github/workflows/build.yml]
+    D --> E[lint / test]
+    E --> F[buildx linux/amd64<br/>provenance:false]
+    F --> G[push ghcr.io/viktorbarzin/&lt;name&gt;<br/>:sha8 + :latest]
+    G --> H[svu tag -> Forgejo canonical]
+    G --> I[POST Woodpecker deploy repo]
+    I --> J[.woodpecker/deploy.yml<br/>event: manual]
+    J --> K[kubectl set image<br/>in-cluster SA cluster-admin]
+    K --> L[K8s Deployment<br/>pulls from ghcr]

-    K[Pull-Through Cache<br/>10.0.20.10] -.-> J
-    L[forgejo.viktorbarzin.me<br/>Private Registry on Forgejo] -.-> J
-
-    style B fill:#2088ff
-    style F fill:#4c9e47
-    style K fill:#f39c12
+    style D fill:#2088ff
+    style J fill:#4c9e47
+    style G fill:#f39c12
 ```

 ## Components

-| Component | Version | Location | Purpose |
-|-----------|---------|----------|---------|
-| GitHub Actions | Cloud | `.github/workflows/build-and-deploy.yml` | Build Docker images, push to DockerHub |
-| Woodpecker CI | Self-hosted | `ci.viktorbarzin.me` | Deploy to Kubernetes cluster |
-| DockerHub | Cloud | `viktorbarzin/*` | Public image registry |
-| Private Registry | Forgejo Packages | `forgejo.viktorbarzin.me/viktor` | Private container images (PAT auth, retention CronJob) — migrated from registry.viktorbarzin.me 2026-05-07 |
-| Pull-Through Cache | Custom | `10.0.20.10:5000` (docker.io)<br/>`10.0.20.10:5010` (ghcr.io) | LAN cache for remote registries |
-| Kyverno | Cluster | `kyverno` namespace | Auto-sync registry credentials to all namespaces |
-| Vault | Cluster | `vault.viktorbarzin.me` | K8s auth for Woodpecker pipelines |
+| Component | Location | Purpose |
+|-----------|----------|---------|
+| GitHub Actions | `.github/workflows/build.yml` (per repo) | Build + lint + test + push image; trigger deploy; cut semver tag |
+| ghcr.io | `ghcr.io/viktorbarzin/*` | Container registry for ALL owned images (public + private packages) |
+| Woodpecker CI | `ci.viktorbarzin.me` | **Deploy-only** — `kubectl set image` in-cluster; plus infra applies + maintenance crons |
+| Forgejo | `forgejo.viktorbarzin.me/viktor/<repo>` | **Canonical** git source (push-mirrors to GitHub). Container registry **FROZEN** (break-glass only) |
+| Pull-Through Cache | `10.0.20.10:5000/5010/5020/5030/5040` | LAN cache for upstream registries (DockerHub, ghcr, Quay, k8s.gcr, Kyverno) |
+| Kyverno | `kyverno` namespace | Syncs `ghcr-credentials` (private-ghcr allowlist) + `registry-credentials` to namespaces |
+| Vault | `vault.viktorbarzin.me` | K8s auth for Woodpecker deploy pipelines; CI tokens in `secret/ci/global` + `secret/viktor` |

 ## How It Works

-### Build Flow (GitHub Actions)
+### The fleet pattern (every owned app)

-1. **Trigger**: Git push to main/master branch
-2. **Build**: GHA builds Docker image for `linux/amd64` platform only
-3. **Tag**: Image tagged with 8-character commit SHA (e.g., `viktorbarzin/app:a1b2c3d4`)
-   - `:latest` tags are **never used** to prevent stale pull-through cache issues
-4. **Push**: Image pushed to DockerHub public registry
-5. **Trigger Deploy**: POST request to Woodpecker API with repo ID and commit SHA
+1. **Canonical source = Forgejo** `viktor/<repo>`. A **push-mirror**
+   (`sync_on_commit`) pushes every commit to the GitHub mirror
+   `ViktorBarzin/<repo>`. The `.github/workflows/build.yml` is committed on
+   Forgejo and mirrors over.
+2. **GHA `build` job** (triggers `on: push: branches: [master]` ONLY — feature
+   branches mirror but build/deploy nothing, the safety valve):
+   - lint + test
+   - `svu` computes the next `vX.Y.Z` from conventional commits and pushes the
+     tag back to **canonical Forgejo** (GHA secret `FORGEJO_GIT_TOKEN` =
+     write:repository PAT); `VERSION` is baked into the image
+   - `docker buildx` `linux/amd64`, **`provenance: false`** (single-manifest —
+     avoids the orphaned-index-children failure class), push
+     `ghcr.io/viktorbarzin/<name>:<sha8>` + `:latest`
+   - `delete-package-versions` keeps the newest ~10 ghcr versions
+3. **GHA `deploy` job** POSTs `ci.viktorbarzin.me/api/repos/<id>/pipelines`
+   (the Woodpecker registration for the **GitHub mirror**, github-forge; GHA
+   secret `WOODPECKER_TOKEN`) with `IMAGE_TAG` + `IMAGE_NAME`.
+4. **`.woodpecker/deploy.yml`** (event: **manual** only, so the raw
+   Forgejo→GitHub mirror pushes don't fire a tag-less deploy) runs `kubectl set
+   image deployment/<app> <container>=<image>` in-cluster. The `woodpecker-agent`
+   SA is `cluster-admin`, so the `bitnami/kubectl` step needs no
+   kubeconfig/RBAC. The Deployment image is in `lifecycle.ignore_changes`
+   (`KEEL_IGNORE_IMAGE`) so the SHA tag sticks and `terragrunt apply` doesn't
+   fight it. CronJobs in owned apps track `:latest` + `imagePullPolicy: Always`
+   instead of a deploy step.

-### Deploy Flow (Woodpecker CI)
+**Keel stays enrolled** as a redundant net (finds the deployed SHA already
+running → no-op).

-1. **Receive Webhook**: Woodpecker API receives deployment trigger from GHA
-2. **Authenticate**: Pipeline uses Kubernetes ServiceAccount JWT to authenticate with Vault via K8s auth
-3. **Deploy**: `kubectl set image deployment/<name> <container>=viktorbarzin/<app>:<sha>`
-4. **Notify**: Slack notification on success/failure
+**Tooling**: `infra/scripts/offinfra-onboard` + `infra/scripts/offinfra-templates/`
+scaffold a repo onto this pattern (mirror, workflow, Woodpecker deploy repo,
+old-pipeline removal, default-branch flip). Mirror + workflow commits go via
+the Forgejo API over the internal Traefik LB
+(`curl --resolve forgejo.viktorbarzin.me:443:10.0.20.203`) since the devvm
+can't reach Forgejo's public hairpin.

-### Project Migration Status
+### ghcr package visibility

-**Migrated to GHA (8 projects)**:
- Website
- k8s-portal
- claude-memory-mcp
- apple-health-data
- audiblez-web
- plotting-book
- insta2spotify
- book-search (audiobook-search)
+| Visibility | Packages | Pull mechanism |
+|------------|----------|----------------|
+| **Public** | beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator | Anonymous |
+| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci | `ghcr-credentials` dockerconfigjson |

-**Woodpecker-native owned-app builds** (build + push to the Forgejo private
-registry + `kubectl set image` rollout, all in one `.woodpecker.yml`; Keel
-stays enrolled as a redundant net): `tuya_bridge`, `job-hunter`, `f1-stream`.
-`f1-stream` was extracted from this monorepo to `viktor/f1-stream` on
-2026-06-05 (Woodpecker repo id 166); the old github source is archived and its
-GHA-era Woodpecker repo (id 10) is deactivated.
+Private-image pulls use the `ghcr-credentials` dockerconfigjson, cloned by the
+kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit
+**ALLOWLIST** of private-ghcr namespaces only (NOT cluster-wide; source
+`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`). Cred = Vault
+`secret/viktor/ghcr_pull_token` (a dedicated classic PAT scoped to
+`read:packages`, UI-minted 2026-06-15 — no longer the admin `github_pat` alias.
+GitHub has no token-mint API, so rotation is manual: re-mint the classic
+`read:packages` PAT → `vault kv patch secret/viktor ghcr_pull_token=…` →
+targeted apply `module.kyverno.kubernetes_secret.ghcr_credentials` (reads Vault;
+avoids the git-crypt `tls-secret-sync` landmine on a locked clone), which
+Kyverno then re-syncs to the allowlisted namespaces).

-**Woodpecker-only (infra + large apps)**:
- `travel_blog`: 5.7GB content directory exceeds GHA limits
- Infra pipelines: require cluster access (terragrunt apply, certbot, build-cli)
+### Migrated apps (issues #13–#27)

-### Woodpecker Pipeline Files
+f1-stream, job-hunter, tuya_bridge, beadboard, nextcloud-todos,
+claude-agent-service, claude-memory-mcp, kms-website, Freedify,
+instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`),
+fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original
+pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website,
+k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify,
+audiobook-search) now also land on ghcr.

-Each project contains:
- `.woodpecker/deploy.yml`: kubectl set image + Slack notification
- `.woodpecker/build-fallback.yml`: Legacy full build pipeline (event: deployment, never auto-fires)
+### Infra-owned images (issues #29 / #30)

-### Woodpecker Repository IDs
+Images owned by the infra repo build on GHA workflows **in the infra repo's own
+`.github/workflows/`** (the github↔forgejo divergence was deliberately NOT
+reconciled — the workflows were added to the GitHub lineage via PR):

-Woodpecker API uses numeric IDs (not owner/name):
+| Image | Workflow | Destination |
+|-------|----------|-------------|
+| chrome-service-novnc | `build-chrome-service-novnc.yml` | public `ghcr.io/viktorbarzin/chrome-service-novnc` |
+| android-emulator | `build-android-emulator.yml` | public `ghcr.io/viktorbarzin/android-emulator` |
+| infra CLI | `build-cli.yml` | DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli` |
+| infra-ci | `build-infra-ci.yml` | private `ghcr.io/viktorbarzin/infra-ci` |

-| Repo | ID |
-|------|------|
-| infra | 1 |
-| Website | 2 |
-| finance | 3 |
-| health | 4 |
-| travel_blog | 5 |
-| webhook-handler | 6 |
-| audiblez-web | 9 |
-| plotting-book | 43 |
-| claude-memory-mcp | 78 |
-| infra-onboarding | 79 |
+**`infra-ci`** is the image the `.woodpecker/default.yml` apply step and
+`drift-detection.yml` run in (proven by pipelines 165/166). `chatterbox-tts` is
+already built by tripit's GHA → ghcr.

-### Image Registry Flow
+The Woodpecker `build-ci-image.yml` and `build-cli.yml` pipelines were
+**REMOVED**. Break-glass for infra-ci is now a manual
+`.woodpecker/breakglass-infra-ci.yml` (ghcr pull-and-save to the registry VM).

-1. **Containerd hosts.toml** redirects pulls from docker.io and ghcr.io to pull-through cache at `10.0.20.10`
-2. **Pull-through cache** serves cached images from LAN, fetches from upstream on cache miss
-3. **Kyverno ClusterPolicy** auto-syncs `registry-credentials` Secret to all namespaces for private registry access
-4. **Private registry** has been Forgejo's built-in OCI registry at `forgejo.viktorbarzin.me/viktor/<image>` since 2026-05-07. Auth via PAT (Vault `secret/ci/global/forgejo_push_token` for push, `secret/viktor/forgejo_pull_token` for pull). The pre-migration `registry:2.8.3`-based private registry on `registry.viktorbarzin.me:5050` was the root cause of three orphan-index incidents in three weeks (2026-04-13, 2026-04-19, 2026-05-04 — see `docs/post-mortems/2026-04-19-registry-orphan-index.md` and the full migration writeup at `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md`). The five pull-through caches on `10.0.20.10` (ports 5000/5010/5020/5030/5040) stay in place for upstream registries.
-5. **Integrity probe** (`registry-integrity-probe` CronJob in `monitoring` ns, every 15m) walks `/v2/_catalog` → tags → indexes → child manifests via HEAD and pushes `registry_manifest_integrity_failures` to Pushgateway; alerts `RegistryManifestIntegrityFailure` / `RegistryIntegrityProbeStale` / `RegistryCatalogInaccessible` page on broken state. Authoritative check (HTTP API, not filesystem).
+### Forgejo container registry — FROZEN

-### Infra Pipelines (Woodpecker-only)
+Issue #32 wiped all `viktor/*` container packages (~19G reclaimed, `/data`
+58%→20%). The registry is **break-glass-only** now; nothing pushes to it. The
+`forgejo-cleanup` CronJob stays in `DRY_RUN` (nothing to clean). Pull-through
+caches on the registry VM (`10.0.20.10`) are unchanged. See
+`docs/runbooks/forgejo-registry-breakglass.md`.
+
+### Image registry / pull path
+
+1. **Containerd `hosts.toml`** redirects pulls from docker.io and ghcr.io to the
+   pull-through cache at `10.0.20.10` (5000 = docker.io, 5010 = ghcr.io).
+2. **Pull-through cache** serves cached images from the LAN, fetches upstream on
+   a miss.
+3. **Kyverno ClusterPolicies** sync `ghcr-credentials` (private-ghcr allowlist)
+   and `registry-credentials` to namespaces.
+
+## Woodpecker — what it still runs
+
+Woodpecker is **deploy + cluster-touching steps only**:

 | Pipeline | File | Purpose |
 |----------|------|---------|
-| default | `.woodpecker/default.yml` | Terragrunt apply on push |
-| renew-tls | `.woodpecker/renew-tls.yml` | Certbot renewal cron |
-| build-cli | `.woodpecker/build-cli.yml` | Build and push to dual registries |
-| build-ci-image | `.woodpecker/build-ci-image.yml` | Build `infra-ci` tooling image (triggered by `ci/Dockerfile` change or manual); post-push HEADs every blob via `verify-integrity` step to catch orphan-index pushes |
-| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered build for k8s-portal subdirectory |
-| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` to `/opt/registry/` on `10.0.20.10` when any managed file changes; bounces containers + nginx per `docs/runbooks/registry-vm.md` |
-| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE host |
-| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new `docs/post-mortems/*.md` via headless Claude agent |
-| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift detection |
-| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues |
+| per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) |
+| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) |
+| certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron |
+| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) |
 | provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
+| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` → `10.0.20.10` on change |
+| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE |
+| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues |
+| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new post-mortems |
+| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered deploy for the portal |
+| breakglass-infra-ci | `.woodpecker/breakglass-infra-ci.yml` | **Manual** ghcr pull-and-save of infra-ci to the registry VM |
+
+**No build/test pipeline exists on any repo.** Do not (re)introduce one.
+
+### Woodpecker API
+
+Uses **numeric repo IDs** (`/api/repos/<id>/pipelines`), NOT owner/name paths
+(those return HTML). The deploy registration for each app is the **GitHub
+mirror** repo (registered github-forge). IDs are stable across renames and must
+be looked up from the Woodpecker UI/DB.
+
+### Woodpecker YAML gotchas
+
+- Commands with `${VAR}:${VAR}` must be **quoted** — an unquoted `:` triggers
+  YAML map parsing when the vars are empty.
+- Use `bitnami/kubectl:latest` (not pinned versions — entrypoint compatibility).
+- Global secrets must include `manual` in their events list for API-triggered
+  pipelines.
+
+### GitHub repo secrets
+
+Per repo: `WOODPECKER_TOKEN` (POST the deploy pipeline), `FORGEJO_GIT_TOKEN`
+(write:repository PAT for the `svu` tag push). ghcr push uses the workflow's
+built-in `GITHUB_TOKEN` (`packages: write`).
+
+## Infra repo CI topology
+
+The infra repo runs on Woodpecker via **two** forge registrations: the Forgejo
+forge (repo id 82, registered 2026-06-08) and the legacy GitHub forge (repo id
+1). Pushes to **Forgejo** `master` fire `.woodpecker/default.yml`
+(changed-stacks terragrunt apply, in `infra-ci`) plus the `notify-nonadmin-push`
+Slack audit step. Operational facts (2026-06-10):
+
+- **Webhook URL is the IN-CLUSTER service**:
+  `http://woodpecker-server.woodpecker.svc.cluster.local/api/hook?...` (PATCHed
+  via the Forgejo API). The Woodpecker default (`https://ci.viktorbarzin.me/...`)
+  resolves to the non-proxied public A record from pods → NAT hairpin →
+  intermittent `context deadline exceeded`, silently dropping push events. If
+  Woodpecker "repairs" the repo it rewrites the hook back to `ci.viktorbarzin.me`
+  — re-apply the in-cluster URL.
+- **Repo-scoped secrets must exist on BOTH repos**: pipelines reference
+  repo-level secrets (`registry_ssh_key`, `pve_ssh_key`, `CLOUDFLARE_TOKEN`, …).
+  When registering a new forge repo for infra, clone the secret set too.
+- **Empty commits defeat path filters**: a commit with no changed files makes
+  Woodpecker include ALL workflow files (path conditions can't exclude), so every
+  repo secret must resolve. Normal commits with real files only compile the
+  matching workflows.
+
+The Forgejo trigger is not fully dependable — land infra changes by pushing
+Forgejo master (as viktor), use `[ci skip]` for docs/no-op commits, and verify
+deploys via `scripts/tg` + live cluster state rather than trusting the CI
+checkmark. The two remotes have **diverged** (parallel histories under
+different SHAs); expect github pushes to reject non-fast-forward and leave them
+— never force-push.

 ## Configuration

-### GitHub Actions
-
-**File**: `.github/workflows/build-and-deploy.yml`
+### GitHub Actions (per-app `.github/workflows/build.yml`)

 ```yaml
-name: Build and Deploy
+name: build
 on:
  push:
-    branches: [main, master]
+    branches: [master]
 jobs:
  build:
    runs-on: ubuntu-latest
+    permissions:
+      contents: write   # svu tag push
+      packages: write    # ghcr push
    steps:
-      - name: Build Docker image
-        run: docker build --platform linux/amd64 -t viktorbarzin/app:${SHORT_SHA} .
-      - name: Push to DockerHub
-        run: docker push viktorbarzin/app:${SHORT_SHA}
-      - name: Trigger Woodpecker Deploy
+      - uses: actions/checkout@v4
+      - name: lint + test
+        run: make lint test
+      - name: svu tag -> Forgejo
        run: |
-          curl -X POST https://ci.viktorbarzin.me/api/repos/<REPO_ID>/pipelines \
-            -H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}"
+          VERSION=$(svu next)
+          # ... push tag to canonical Forgejo with FORGEJO_GIT_TOKEN
+      - uses: docker/setup-buildx-action@v3
+      - uses: docker/build-push-action@v6
+        with:
+          platforms: linux/amd64
+          provenance: false
+          push: true
+          tags: |
+            ghcr.io/viktorbarzin/<name>:${{ github.sha }}
+            ghcr.io/viktorbarzin/<name>:latest
+  deploy:
+    needs: build
+    runs-on: ubuntu-latest
+    steps:
+      - name: Trigger Woodpecker deploy
+        run: |
+          curl -X POST https://ci.viktorbarzin.me/api/repos/<DEPLOY_REPO_ID>/pipelines \
+            -H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}" \
+            -d '{"branch":"master","variables":{"IMAGE_TAG":"...","IMAGE_NAME":"..."}}'
 ```

-**Required GitHub Secrets**:
- `DOCKERHUB_USERNAME`
- `DOCKERHUB_TOKEN`
- `WOODPECKER_TOKEN`
-
-### Woodpecker Deploy Pipeline
-
-**File**: `.woodpecker/deploy.yml`
+### Woodpecker deploy pipeline (per-app `.woodpecker/deploy.yml`)

 ```yaml
 when:
-  event: [deployment]
+  event: manual

 steps:
  deploy:
-    image: bitnami/kubectl:latest
+    image: bitnami/kubectl:latest   # uses the in-cluster woodpecker-agent SA (cluster-admin)
    commands:
-      - kubectl set image deployment/app app=viktorbarzin/app:${CI_COMMIT_SHA:0:8}
-    secrets: [k8s_token]
-
+      - "kubectl set image deployment/app app=${IMAGE_NAME}:${IMAGE_TAG} -n <ns>"
+      - "kubectl rollout status deployment/app -n <ns> --timeout=300s"
  notify:
    image: plugins/slack
-    settings:
-      webhook: ${SLACK_WEBHOOK}
    when:
      status: [success, failure]
 ```

-**YAML Gotchas**:
- Commands with `${VAR}:${VAR}` syntax must be quoted to prevent YAML map parsing when vars are empty
- Use `bitnami/kubectl:latest` (not pinned versions)
- Global secrets must be manually added to `secrets:` list in pipeline
+### CI/CD secrets sync

-### Vault Configuration
-
-**K8s Auth for Woodpecker**:
- Woodpecker pipelines authenticate using ServiceAccount JWT
- Vault K8s auth mount validates JWT and issues token
- Policies grant access to secrets and dynamic credentials
-
-### CI/CD Secrets Sync
-
-**CronJob**: Pushes `secret/ci/global` from Vault → Woodpecker API every 6 hours
- Keeps Woodpecker global secrets in sync with Vault
- Runs in `woodpecker` namespace
+A CronJob in the `woodpecker` namespace pushes `secret/ci/global` from Vault →
+the Woodpecker API every 6h, keeping global secrets in sync. Woodpecker deploy
+pipelines authenticate to the cluster via the in-cluster `woodpecker-agent` SA
+(cluster-admin); Vault K8s auth backs any secret reads.

 ## Decisions & Rationale

-### Why GitHub Actions + Woodpecker?
+### Why all builds off-infra (ADR-0002)?

-**Alternatives considered**:
-1. **Woodpecker-only**: Simple, but wastes cluster resources on builds
-2. **GHA-only**: No cluster access, requires kubectl from outside (security risk)
-3. **Hybrid (chosen)**: GHA for compute-heavy builds (free), Woodpecker for privileged deployments (secure cluster access)
+- **Breaks the circular dependency** — the images needed to repair the cluster
+  no longer live inside it (they're on ghcr, an external registry).
+- **Removes build IO + registry push load** from the contended homelab spindle.
+- GHA is free on public repos and generous on private; buildx provenance:false
+  sidesteps the orphaned-index-children failure class that plagued the
+  in-cluster registry.
+- **Clean cut** — no in-cluster fallback builds anywhere; one pattern,
+  fleet-wide.

-**Benefits**:
- Free compute for builds on public repos
- Cluster access stays internal (Woodpecker has direct K8s access)
- Separation of concerns: build vs deploy
+### Why ghcr (not push back to Forgejo)?

-### Why 8-Character SHA Tags (Not :latest)?
+Forgejo's container registry repeatedly orphaned OCI index children
+(2026-04-13/19, 2026-05-04, 2026-06-10) and its retention is not container-aware.
+ghcr is external (DR-safe), free for this scale, and has native multi-arch
+handling. The Forgejo registry was frozen + emptied (issue #32).

- Pull-through cache serves stale `:latest` tags indefinitely
- SHA tags ensure every deployment pulls the correct image
- 8 characters provide sufficient collision resistance (16^8 = 4.3 billion combinations)
+### Why Woodpecker stays for deploy?

-### Why Numeric Repo IDs for Woodpecker API?
+`kubectl set image` needs in-cluster privileged access; doing it from GHA would
+mean exposing kube-apiserver or a long-lived kubeconfig. Woodpecker's
+`woodpecker-agent` SA is already cluster-admin in-cluster — the deploy step
+needs no credentials.

- Woodpecker API requires numeric IDs (not owner/name slugs)
- IDs are stable across repo renames
- Must be manually looked up from Woodpecker UI or database
+### Why `event: manual` on deploy.yml?

-### Why linux/amd64 Only?
+The Forgejo→GitHub push-mirror sends raw, tag-less pushes to the GitHub mirror.
+If `deploy.yml` fired on `push`, every mirror sync would trigger a deploy with no
+image tag. `manual` means only the GHA `deploy` job's explicit API POST (with
+`IMAGE_TAG`) deploys.

- Cluster runs on x86_64 nodes only
- ARM builds would waste time and storage
- Multi-arch images add complexity without benefit
+### Why linux/amd64 only?
+
+The cluster runs on x86_64 nodes only; ARM builds waste time and storage.

 ## Troubleshooting

-### GHA Build Fails: "denied: requested access to the resource is denied"
+### GHA build fails: ghcr push "denied"

-**Cause**: DockerHub credentials expired or incorrect
+The workflow `GITHUB_TOKEN` needs `packages: write` permission and the package
+must allow the repo to push. Check the workflow `permissions:` block and the
+package's "Manage Actions access" settings.
+
+### Image pull fails: "ErrImagePull" / "ImagePullBackOff"

-**Fix**:
 ```bash
-# Regenerate DockerHub token
-# Update GitHub repo secrets: DOCKERHUB_USERNAME, DOCKERHUB_TOKEN
+# Public image — check the pull-through cache is up
+curl http://10.0.20.10:5010/v2/_catalog
+
+# Private image — verify the ghcr-credentials Secret exists in the namespace
+kubectl get secret ghcr-credentials -n <namespace>
+# It's Kyverno-synced to an allowlist; if missing, the namespace isn't on the
+# allowlist in stacks/kyverno/modules/kyverno/ghcr-credentials.tf
 ```

-### Woodpecker Deploy Fails: "Unauthorized"
+If the cause is the internal-DNS hairpin (fresh pulls timing out on the public
+Forgejo path), see the CoreDNS `viktorbarzin.me` carve-out in
+`docs/architecture/networking.md` and `docs/runbooks/registry-vm.md`.

-**Cause**: Vault K8s auth token expired or invalid
+### Deploy didn't happen after a push

-**Fix**:
-```bash
-# Restart Woodpecker pipeline (token auto-renewed)
-# Check Vault K8s auth role exists: vault read auth/kubernetes/role/woodpecker-deployer
-```
+Confirm the push was to **master** (feature branches build/deploy nothing).
+Check the GHA run completed the `deploy` job, then check Woodpecker received the
+manual pipeline (`ci.viktorbarzin.me`, the GitHub-mirror deploy repo). Verify
+live with `kubectl rollout status` — not the CI checkmark.

-### Image Pull Fails: "ErrImagePull"
+### Woodpecker deploy fails: "YAML: did not find expected key"

-**Cause**: Pull-through cache or registry credentials issue
-
-**Fix**:
-```bash
-# Check pull-through cache is running
-curl http://10.0.20.10:5000/v2/_catalog
-
-# Verify registry-credentials Secret exists in namespace
-kubectl get secret registry-credentials -n <namespace>
-
-# Manually sync credentials if missing
-kubectl get secret registry-credentials -n default -o yaml | \
-  sed 's/namespace: default/namespace: <namespace>/' | kubectl apply -f -
-```
-
-### Woodpecker Pipeline: "YAML: did not find expected key"
-
-**Cause**: Unquoted command with `${VAR}:${VAR}` syntax when VAR is empty
-
-**Fix**: Quote the command:
-```yaml
-commands:
-  - "kubectl set image deployment/app app=viktorbarzin/app:${SHORT_SHA}"
-```
-
-### travel_blog Build Times Out on GHA
-
-**Cause**: 5.7GB content directory exceeds GHA disk/time limits
-
-**Fix**: Keep on Woodpecker (no migration). Build uses cluster storage and resources.
-
-### CI/CD Secrets Out of Sync
-
-**Cause**: CronJob failed to sync Vault → Woodpecker
-
-**Fix**:
-```bash
-# Check CronJob status
-kubectl get cronjob -n woodpecker
-
-# Manually trigger sync
-kubectl create job --from=cronjob/sync-secrets manual-sync -n woodpecker
-```
+Unquoted command with `${VAR}:${VAR}` syntax when a VAR is empty. Quote the
+command (see the deploy.yml example above).

 ## Related

- [Databases Architecture](./databases.md) — Database credentials via Vault
- [Multi-Tenancy](./multi-tenancy.md) — Per-user Woodpecker access
- Runbook: `../runbooks/deploy-new-app.md` — How to set up CI/CD for a new app
- Runbook: `../runbooks/troubleshoot-image-pull.md` — Debug image pull issues
- Vault documentation: K8s auth configuration
- Woodpecker documentation: API reference
+- ADR: `../adr/0002-all-image-builds-off-infra-gha-ghcr.md` — the decision
+- [Databases Architecture](./databases.md) — database credentials via Vault
+- [Multi-Tenancy](./multi-tenancy.md) — per-user Woodpecker access
+- Runbook: `../runbooks/forgejo-registry-breakglass.md` — using the frozen registry
+- Runbook: `../runbooks/registry-vm.md` — pull-through cache VM + image-pull debugging
+- Onboarding tool: `../../scripts/offinfra-onboard` + `../../scripts/offinfra-templates/`
--- a/docs/architecture/compute.md
+++ b/docs/architecture/compute.md
@ -22,9 +22,11 @@ graph TB
        NODE2["VM 202: k8s-node2<br/>8c / 32GB"]
        NODE3["VM 203: k8s-node3<br/>8c / 32GB"]
        NODE4["VM 204: k8s-node4<br/>8c / 32GB"]
+        NODE5["VM 205: k8s-node5<br/>8c / 32GB"]
+        NODE6["VM 206: k8s-node6<br/>8c / 32GB"]
    end

-    subgraph K8s["Kubernetes Cluster v1.34.2"]
+    subgraph K8s["Kubernetes Cluster v1.34.8"]
        direction TB

        subgraph VPA["VPA (Goldilocks - Initial Mode)"]
@ -62,7 +64,7 @@ graph TB
 | Model | Dell PowerEdge R730 |
 | CPU | 1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated) |
 | Total Cores/Threads | 22 cores / 44 threads |
-| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). VMs use ~176GB total (k8s-node1 48GB + 4 K8s VMs x 32GB) |
+| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). K8s VMs use ~240GB total (k8s-node1 48GB + 6 K8s VMs x 32GB) |
 | GPU | NVIDIA Tesla T4 (16GB GDDR6, PCIe 0000:06:00.0) |
 | Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD |
 | Hypervisor | Proxmox VE |
@ -76,8 +78,10 @@ graph TB
 | k8s-node2 | 202 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
 | k8s-node3 | 203 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
 | k8s-node4 | 204 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
+| k8s-node5 | 205 | 8 | 32GB | vmbr1:vlan20 (10.0.20.105) | Worker (joined 2026-05-26) | None |
+| k8s-node6 | 206 | 8 | 32GB | vmbr1:vlan20 (10.0.20.106) | Worker (joined 2026-05-26) | None |

-**Total Cluster Resources**: 48 vCPUs, ~176GB RAM (k8s-node1 48GB + 4 nodes x 32GB)
+**Total Cluster Resources**: 64 vCPUs, ~240GB RAM (k8s-node1 16c/48GB + master and 5 workers at 8c/32GB each)

 > **All Linux VMs are hand-managed in Proxmox, NOT in Terraform**
 > (decided 2026-05-26, commit 44c3770a). The telmate/proxmox v3.0.2
@ -97,7 +101,12 @@ graph TB
 > PVE host (sources in `infra/scripts/`, install pattern per
 > `architecture/backup-dr.md`). Timer fires `OnBootSec=5min` +
 > `OnCalendar=hourly`, so any drift (config restore, manual `qm
-> set`, fresh clone) self-heals within the hour. Current caps:
+> set`, fresh clone) self-heals within the hour. The script compares
+> *normalized option sets*, so an unchanged config is a true no-op —
+> until 2026-06-11 a raw string compare (defeated by `qm config`'s
+> canonical key order) re-issued `qm set` hourly against running VMs,
+> live-rewriting QEMU throttle state via QMP (implicated in the devvm
+> I/O stall; see `post-mortems/2026-06-11-devvm-qemu-io-stall.md`). Current caps:
 > 102 devvm 60/60, 103 home-assistant 40/40, 200 k8s-master 100/60,
 > 201 k8s-node1 150/120, 202 k8s-node2 150/120, 203 k8s-node3 150/120,
 > 204 k8s-node4 150/120, 220 docker-registry 40/40.
--- a/docs/architecture/dns.md
+++ b/docs/architecture/dns.md
@ -258,19 +258,27 @@ The TP-Link AP (dumb AP on 192.168.1.x) does not support hairpin NAT. LAN client
 Technitium's **Split Horizon AddressTranslation** app post-processes DNS responses for 192.168.1.0/24 clients, translating the public IP to the internal Traefik LB IP:

 ```
-176.12.22.76 → 10.0.20.200
+176.12.22.76 → 10.0.20.203
 ```

+(Was `10.0.20.200` until Traefik's 2026-05-30 move to its dedicated `.203` LB IP.)
+
 **DNS Rebinding Protection** has `viktorbarzin.me` in `privateDomains` to allow the translated private IP without being stripped as a rebinding attack.

 ### Scope

 - **Affected**: Non-proxied domains (ha-sofia, immich, headscale, calibre, vaultwarden, etc.) for 192.168.1.x clients
 - **Not affected**: Cloudflare-proxied domains (resolve to Cloudflare edge IPs, no translation needed)
- **Not affected**: 10.0.x.x and K8s clients (reach public IP via pfSense outbound NAT normally)
+- **10.0.x.x clients (k8s nodes, devvm, other VMs)** — handled at the resolver since 2026-06-10: **pfSense Unbound carries a domain override forwarding the whole `viktorbarzin.me` zone to Technitium** (`10.0.20.201`). Technitium's split-horizon zone answers with the zone apex A record, which auto-tracks the live Traefik LB IP (`technitium-ingress-dns-sync` CNAMEs every ingress host hourly; `viktorbarzin-apex-probe` is the drift canary). Every client of pfSense Unbound — all VLANs, k8s nodes included — therefore gets internal answers with **zero per-host configuration** (no `/etc/hosts` pins, no resolved drop-ins; both earlier same-day approaches were removed, nodes are stock). Names not behind Traefik keep distinct records in the zone (e.g. `mail.viktorbarzin.me → 10.0.20.1`, verified working on :993/:25; since 2026-06-10 its :443 also works internally — pfSense carries an SNI-routed HAProxy frontend on 443 that sends hostname traffic to Traefik and bare-IP/no-SNI traffic to the webGUI, which moved to :8443; see `docs/runbooks/mailserver-pfsense-haproxy.md`). See `docs/runbooks/pfsense-unbound.md` for the override config + rollback, and `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md` for the incident that motivated this (kubelet forgejo pulls riding the broken hairpin; the containerd hosts.toml mirror cannot fix it — Traefik 404s bare-IP requests and the registry auth realm is an absolute public URL).
+  - **devvm**: also covered by a `~viktorbarzin.me → 10.0.20.201` resolved routing domain (predates the pfSense override, provisioned by `setup-devvm.sh`) — redundant-but-harmless belt-and-suspenders.
+  - **in-cluster PODS are ordinary internal clients too** (since 2026-06-10 evening): CoreDNS's dedicated `viktorbarzin.me:53` block (in `stacks/technitium`, TF-managed) forwards to the Technitium ClusterIP (`10.96.0.53`, same as the `.lan` block), so pods get the same split-horizon answers as everyone else. This works because on k8s 1.34 **pods CAN reach the ETP=Local Traefik LB IP** — kube-proxy short-circuits in-cluster traffic to LB IPs via the cluster path (verified from pods on three non-Traefik nodes; re-verify after major k8s upgrades — the canary is the uptime-kuma `[External]` fleet going red). forgejo stays pinned to Traefik's **ClusterIP** in the same block so CI pushes survive a Technitium outage. History: the block briefly forwarded to `8.8.8.8/1.1.1.1` (morning of 2026-06-10), which kept pods on public IPs and the broken TP-Link NAT loopback — 27 non-proxied `[External]` uptime-kuma monitors dark (beads code-yh33). Note: in-cluster `[External]` monitors now test DNS+Traefik+service via the internal path for ALL names, including Cloudflare-proxied ones — genuine edge-path fidelity is the job of a true external vantage (ha-london), not in-cluster probes.
+  - **Trade-off**: `viktorbarzin.me` resolution via pfSense now depends on in-cluster Technitium (3 replicas). During a full cluster outage the zone SERVFAILs LAN-wide — acceptable, the services behind it are down anyway; node bootstrap images pull via the IP-addressed `10.0.20.10` mirrors, so cold-start self-unwinds.
+  - **Residual nondeterminism**: nodes keep `94.140.14.14` as a secondary resolver (netplan/qm `--nameserver`). If systemd-resolved fails over to it during a pfSense DNS blip, `.me` answers are public again until it switches back — a rare, self-healing window, accepted.

 Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h).

+**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too.
+
 ## NodeLocal DNSCache

 A DaemonSet in `kube-system` (`node-local-dns`, image `registry.k8s.io/dns/k8s-dns-node-cache:1.23.1`) runs on every node including the control plane. Each pod uses `hostNetwork: true` + `NET_ADMIN` and installs iptables NOTRACK rules so it transparently serves DNS on both:
@ -456,13 +464,21 @@ The zone-sync CronJob (runs every 30min) pushes the following to the Prometheus

 ### Hairpin NAT Not Working (LAN → *.viktorbarzin.me Fails)

-Since 2026-04-19 (Workstream D), pfSense Unbound answers LAN DNS queries
-directly instead of forwarding to Technitium, so the Technitium Split Horizon
-post-processing does NOT run for 192.168.1.x clients anymore. Non-proxied
-services break hairpin on LAN clients again. Options:
+**Since 2026-06-10 this is largely solved at the resolver**: pfSense Unbound
+carries a domain override forwarding the entire `viktorbarzin.me` zone to
+Technitium, so ANY client that queries pfSense (all VLANs + 192.168.1.x
+clients pointed at `192.168.1.2`) gets the internal Traefik answer. If
+hairpin still fails for a client, first check which resolver it actually
+uses — clients on the TP-Link's own DHCP DNS (router/ISP) bypass pfSense
+entirely. Options for those:
+
+(Historical context: 2026-04-19 Workstream D made Unbound answer LAN
+queries directly, which had removed the Technitium Split Horizon
+post-processing from the LAN path until the 2026-06-10 domain override
+restored internal answers at the zone level.)

 1. **Switch service to proxied Cloudflare** (preferred) — set `dns_type = "proxied"` in the `ingress_factory` module call; DNS now resolves to Cloudflare edge, hairpin-independent.
-2. **Add a local-data override on pfSense Unbound** — under `Services → DNS Resolver → Host Overrides`, set `<service>.viktorbarzin.me → 10.0.20.200` (Traefik LB IP). This is equivalent to what Split Horizon did, applied at the resolver.
+2. **Add a local-data override on pfSense Unbound** — under `Services → DNS Resolver → Host Overrides`, set `<service>.viktorbarzin.me → 10.0.20.203` (Traefik LB IP). This is equivalent to what Split Horizon did, applied at the resolver.
 3. **Revert to prior NAT rdr + Technitium Split Horizon** — documented in `docs/runbooks/pfsense-unbound.md` rollback section.

 K8s-side Split Horizon is still configured and applies when `*.viktorbarzin.me` queries DO reach Technitium (e.g., from pods that query via CoreDNS → Technitium forwarding for `.viktorbarzin.me` via pfSense). Verify Technitium split-horizon app:
@ -470,7 +486,7 @@ K8s-side Split Horizon is still configured and applies when `*.viktorbarzin.me`
 1. Verify Split Horizon app is installed on all instances
 2. Check CronJob status: `kubectl get cronjob -n technitium technitium-split-horizon-sync`
 3. Run the job manually: `kubectl create job --from=cronjob/technitium-split-horizon-sync test-sh -n technitium`
-4. Test: `dig @10.0.20.201 immich.viktorbarzin.me` — should return 10.0.20.200 for 192.168.1.x source
+4. Test: `dig @10.0.20.201 immich.viktorbarzin.me` — should return 10.0.20.203 for 192.168.1.x source

 ### Zone Not Replicating to Secondary/Tertiary

--- a/docs/architecture/monitoring.md
+++ b/docs/architecture/monitoring.md
@ -119,12 +119,18 @@ no `level` stream label.
 cluster error/warn line counts (5-min window) → `sensor.cluster_log_errors_5m` /
 `sensor.cluster_log_warnings_5m`, for a compact trend card on the Барзини status
 view plus a Grafana-link button. Those sensors reach Loki via the Traefik LB IP
-`10.0.20.203` + a `Host: loki.viktorbarzin.lan` header (`verify_ssl: false`)
-because `loki.viktorbarzin.lan` has **no Technitium record yet** (the
-`technitium-ingress-dns-sync` CronJob only creates `.me` CNAMEs + pins
-`ingress.viktorbarzin.lan`). **Follow-up:** register `loki.viktorbarzin.lan` in
-Technitium (or fix the `*.viktorbarzin.lan` wildcard) so both this sensor and the
-Sofia-Pi promtail can resolve it by name instead of pinning the LB IP.
+`10.0.20.203` + a `Host: loki.viktorbarzin.lan` header (`verify_ssl: false`).
+**Update 2026-06-10:** `loki.viktorbarzin.lan` is now **registered in Technitium**
+as a CNAME → `ingress.viktorbarzin.lan` (the anchor whose A record auto-tracks the
+live Traefik LB IP), added via the Technitium API and AXFR-replicated to all 3
+instances — so it resolves by name LAN-wide. The **PVE host** promtail (see
+"External host: pve" below) uses the name directly, with **no `/etc/hosts` pin**.
+This HA sensor and the rpi-sofia promtail still pin the LB IP in their own configs
+and can drop to the name on next touch (`verify_ssl: false` / `insecure_skip_verify`
+stays — the internal `.lan` cert isn't publicly trusted). Per-host `.lan` CNAMEs
+are still added manually via the API; auto-managing them in
+`technitium-ingress-dns-sync` (today `.me`-only + the `ingress.viktorbarzin.lan`
+anchor) remains a follow-up.

 ### External host: rpi-sofia (Sofia Raspberry Pi)

@ -140,12 +146,29 @@ Query examples (Grafana → Loki): `{job="rpi-sofia-journal"}`, `{job="rpi-sofia

 **Dashboard** — `dashboards/rpi-sofia.json` ("RPi Sofia", Hardware folder): status, undervoltage/throttle, SoC temp, load, memory, root-fs free + read-only, network.

-**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`rpi_under_voltage_occurred==1`), `RpiSofiaHighTemp`.
+**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`increase(rpi_under_voltage_occurred[1h])>0` — edge-triggered on the sticky bit; the live `rpi_under_voltage_now` bit is too transient to catch at 1-min sampling, so it fires on a *new* brown-out and auto-resolves ~1h later instead of latching until reboot), `RpiSofiaHighTemp`.

 **Recovery** — a systemd hardware watchdog (`RuntimeWatchdogSec=14s`, bcm2835 max ~15s) auto-reboots the Pi on a hard hang instead of leaving it dead for hours.

 > The cluster side (scrape job, alerts, Loki ingress, dashboard) is Terraform-managed in `stacks/monitoring/`. The **Pi-side** pieces (node_exporter, the textfile collector + timer, promtail, the watchdog config, and the `server=/viktorbarzin.lan/192.168.1.2` dnsmasq split-horizon forward needed to resolve the Loki ingress) are configured by hand on the Pi — it is not under Terraform — and are backed up off-box at `/home/wizard/rpi-sofia-backup/`. The real reliability fix (reflash/replace the SD card) needs on-site access.

+### External host: pve (Proxmox hypervisor, 192.168.1.127)
+
+`pve` is the Proxmox VE host — the hypervisor running **every** VM (pfSense, the 5 k8s nodes, the devvm, HA, Windows). It is not in the cluster. Since 2026-06-10 its **full systemd journal ships to cluster Loki**, closing a gap (the most critical host previously had no central logging) and giving the Wave-1 **S1** security rule its data source (`docs/architecture/security.md`).
+
+**Why now:** emo's Claude agent was granted **root SSH** to the host (a dedicated shared-root key `emo-pve-agent@devvm`, fingerprint `SHA256:Wd+m0EABlm4RDDykDh85PIYSqe0Al8Hr9AZ+7Ksy4HQ`, reachable as `ssh pve` from the devvm) so he can manage the host (e.g. the R730 fan daemon) via his agent. To keep an audit trail, **snoopy** (enabled via `/etc/ld.so.preload` → `libsnoopy.so`; config `scripts/pve-snoopy.ini`) logs every `execve()` to journald under identifier `snoopy`, and promtail ships it to Loki.
+
+**Logs** — `promtail` v3.5.1 (amd64) at `/usr/local/bin/promtail`, config `scripts/pve-promtail.yaml`, unit `scripts/pve-promtail.service`. Ships `/var/log/journal` to `https://loki.viktorbarzin.lan/loki/api/v1/push` (`insecure_skip_verify` — the internal `.lan` cert isn't publicly trusted; the name resolves via the Technitium CNAME above, no `/etc/hosts` pin). Relabels: `unit`, `level`, `identifier`; sshd lines (`identifier=~"sshd.*"`) are re-jobbed to `sshd-pve` so the S1 rule matches. Streams:
+- `{job="pve-journal", host="pve"}` — full host journal (kernel, pvestatd, fan-control, NFS, etc.).
+- `{job="pve-journal", identifier="snoopy"}` — **command audit** (every execve: `uid login tty sid cwd cmdline`).
+- `{job="sshd-pve"}` — sshd auth; an `Accepted publickey ... SHA256:<fp>` line ties a session to a key (e.g. emo's fp above). Feeds S1.
+
+**Attribution caveat:** all SSH is shared-root, so snoopy `uid`/`login` are always `root`; attribute a command to a person by correlating its `sid`/timestamp with the matching `{job="sshd-pve"}` Accepted-publickey line (key fingerprint). emo's agent arrives SNAT'd as `192.168.1.2`, which is in the S1 allowlist, so legitimate access does not alert.
+
+Query examples (Grafana → Loki): `{host="pve"}`, `{job="pve-journal", identifier="snoopy"}` (command audit), `{job="sshd-pve"} |= "Accepted publickey"`.
+
+> Hand-managed (not Terraform), like the rpi-sofia and fan-control pieces: the promtail binary/config/unit and the snoopy enable (`/etc/ld.so.preload`) live on the host (Loki resolves via the Technitium CNAME — no `/etc/hosts` pin). Source-of-truth files: `scripts/pve-promtail.{yaml,service}` + `scripts/pve-snoopy.ini`; deploy steps are in the `pve-promtail.yaml` header.
+
 ### Dell R730 iDRAC: SNMP-primary + Redfish remnant (migrated 2026-06-05)

 The R730 iDRAC (`192.168.1.4` / `idrac.viktorbarzin.lan`) is monitored by **two** Prometheus jobs, both relabeled to the `r730_idrac_*` prefix (which historically hid which source served what). Design/plan: `docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md`.
--- a/docs/architecture/multi-tenancy.md
+++ b/docs/architecture/multi-tenancy.md
@ -541,11 +541,33 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1

 **RBAC tiers:** `admin` (Viktor — cluster-admin, unlocked tree, secrets) · `power-user` (cluster-wide read-only, NO Secrets, via a dedicated `oidc-power-user-readonly` ClusterRole) · `namespace-owner` (admin in own namespace only). Each session acts as the user's **own** OIDC identity (kubelogin), never the admin's.

-**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked.
+**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.

-**Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo at `~/code` — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Changes are ungated (push ≠ apply); the real boundary is apply-time (`scripts/tg apply` needs an admin Vault token + cluster RBAC).
+**Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.)

-**Status (2026-06-08):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, **per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), and the Authentik `T3 Users` edge gate (applied + verified)**. **Remaining (held / future):** the emo cutover to his own locked clone (Phase 5), the offboarding apply-side (Phase 7), per-user MCP/auth injection, and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning.
+**Agent skills — vendored own-copies for an allowlist (2026-06-23):** beyond the config-inheritance base (above, which symlinks the admin's `~/.claude/skills` into every user), the reconcile's `install_skills` gives users on the `SKILL_USERS` allowlist (currently `emo`) their OWN copies of a curated skill set vendored in-repo at `scripts/workstation/claude-skills/` (16: the admin's 15 `mattpocock/skills` + `find-skills` from `vercel-labs/skills`). It copies each into `~/.agents/skills/<name>` (owned by the user, parent `~/.agents` chowned too — `install -d` leaves intermediates root-owned) and points `~/.claude/skills/<name>` at it with a **relative** symlink (`../../.agents/skills/<name>` — the layout `skills add -g` produces; Claude Code reads `~/.claude/skills/`). **Vendored, NOT `npx skills add`:** upstream drifted off this exact set (`diagnose`→`diagnosing-bugs`, `write-a-skill`→`writing-great-skills` renamed; `caveman` + `zoom-out` unpublished), so npx can't reproduce it — and a per-reconcile GitHub clone + unpinned-CLI dependency has no place in the hourly root job; refresh by re-snapshotting (`claude-skills/README.md`). **if-absent keys on the user's OWN copy** (a real dir under `~/.agents/skills`), so a steady-state reconcile is a no-op AND a stale or cross-user `~/.claude/skills` symlink is healed to the own copy — emo had `grill-me`/`file-issue` symlinked into the admin's home; `grill-me` is now emo's own (`file-issue` is outside the set, left as-is). A real dir squatting a name is never clobbered. Best-effort tail (`return 0`, like `install_memory`). Extend coverage = edit `SKILL_USERS`.
+
+**Onboarding state self-heals (2026-06-15):** `~/.claude.json` is a single file that ALL of a user's concurrent `claude` processes (the ttyd terminal + their `t3-serve` instance + agent/SDK sessions) read-modify-write, so a stale writer periodically drops top-level keys — including `hasCompletedOnboarding` — which bounces the next *interactive* session back to the first-run "Choose the text style" wizard even though the user is fully logged in (credentials live in the SEPARATE `~/.claude/.credentials.json`, untouched by the race; first observed for emo 2026-06-15). The launcher (`skel/start-claude.sh`) now idempotently re-asserts `hasCompletedOnboarding` (+ `lastOnboardingVersion`) in `~/.claude.json` right before it runs `claude` — merge-only, never clobbers other keys, no-op if jq is missing or the file is empty/corrupt. And since the launcher is a per-user copy that `/etc/skel` only seeds at account creation, the reconcile's new `deploy_user_launcher` step re-copies `skel/start-claude.sh` into every non-admin home (copy-if-changed) so launcher edits now reach EXISTING users within the hour — `.tmux.conf` is deliberately NOT re-copied (terminal-lobby appends its own managed section to it).
+
+**Claude Code runtime — native, per-user (2026-06-15):** `claude` is the **native** install (`~/.local/bin/claude` → `~/.local/share/claude/versions/<v>`, self-updating; `installMethod: native`) — NOT npm-global or npx. It is the runtime for both the ttyd launcher and each `t3-serve` instance. `setup-devvm.sh` installs node ONLY for the `t3` CLI (not claude); per-user native claude is provisioned by the reconcile's `install_user_claude_native` (covers terminal + t3, idempotent, skip-if-present) and self-bootstrapped by `start-claude.sh` on first launch — both via the official `https://claude.ai/install.sh`. The legacy machine-wide `npm install -g @anthropic-ai/claude-code` bootstrap and the launcher's `npx` fallback were removed; existing users had already auto-migrated to native, and the npm-global dir was empty. **PATH (`~/.local/bin`, where the native binary lives):** ensured three ways — `/etc/profile.d/10-local-bin.sh` for login shells (machine-wide, fresh-user-safe), `start-claude.sh` itself (the launcher runs in tmux's non-login env that skips the user's shell rc), and `t3-serve@.service` (`Environment=PATH=…:/home/%i/.local/bin`).
+
+**Claude authentication — per-user, self-renewing, Vault-recoverable (2026-06-20):** every roster user logs in with their OWN Enterprise identity; shared `CLAUDE_CODE_OAUTH_TOKEN` injection was removed because environment auth outranks local login and collapses identity/audit/quota. Claude owns access-token refresh in `~/.claude/.credentials.json`. A system template timer (`claude-auth-sync@<user>.timer`, every 6h) renews a dedicated 32-day periodic Vault token, validates Claude with real non-persistent Haiku inference (`auth status` can lie during a 401), backs up only `claudeAiOauth` to `secret/workstation/claude-users/<os_user>`, and performs one atomic Vault restore/retry on failure while preserving `mcpOAuth`. Vault policy `workstation-claude-<os_user>` isolates every path; the roster generates policies for present and future users. A hard refresh-token revocation still requires the affected person to complete SSO—there is no supported noninteractive bypass. Loki alert `WorkstationClaudeAuthInvalid` surfaces exhausted recovery. Runbook: `../runbooks/claude-auth-renew-workstation.md`.
+
+**Per-user browser MCP — playwright, reproducible from git (2026-06-16):** every user (incl. the admin) gets their OWN isolated `@playwright/mcp` server so their concurrent Claude sessions don't fight over tabs (`--isolated` → a fresh browser context per MCP connection), wired into Claude in **every directory** via a user-scope `~/.claude.json` entry (`playwright → http://localhost:<PLAYWRIGHT_PORT>/mcp`). Mechanism: **system-level template units** `playwright-mcp@<user>.service` + `playwright-snapshot-refresh@<user>.{service,timer}` (`User=%i`, sourced from `scripts/workstation/playwright/`, installed by `setup-devvm.sh` §9e — system manager, so NO systemd --user / linger). `roster_engine.py` allocates a sticky per-user `PLAYWRIGHT_PORT` (`PLAYWRIGHT_BASE_PORT=8931`); the reconcile's `install_playwright()` writes it, seeds the chrome-service snapshot token if-absent (staged from Vault `secret/chrome-service` to `/etc/t3-serve/chrome-service-token` by `setup-devvm.sh` §8c, since the hourly root reconcile has no Vault token), wires `~/.claude.json` by running `claude mcp add --scope user` AS the user (clobber-proof + if-absent, so it fixes existing/new/admin without rewriting a populated config), and `enable --now`s the instances (idempotent — never restarts a running server). The `@playwright/mcp` version is **pinned** in the unit (the `@latest`-silently-rolls-the-fleet footgun — see `T3_PIN`). Replaced the earlier hand-made `~/.config/systemd/user/playwright-*` units (one-time idle-gated migration; pre-migration emo/anca had servers running but never wired into their `.claude.json`). Cookie-warming pipeline + ops: `../runbooks/chrome-service-snapshot.md`.
+
+**Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Its location depends on the per-user `code_layout` in `roster.yaml`: `single` (default) puts the clone AT `~/code`; `workspace` makes `~/code` a plain directory of per-project clones — the infra clone at `~/code/infra` plus each roster `repos` entry cloned from Forgejo `viktor/<name>` **as the user** (their PAT authenticates, so private repos work; clone failures WARN and retry next hour). Flipping a user to `workspace` auto-migrates their existing `~/code` clone to `~/code/infra` (local branches/dirty state survive; running processes follow the moved inode). ancamilea = workspace + `tripit` since 2026-06-10. The provisioner clones infra anonymously from the public GitHub mirror; **contribute access is wired per-user on top** (see below). The apply boundary still holds (`scripts/tg apply` needs an admin Vault token + cluster RBAC), but **pushing `master` is NOT inert** — the Forgejo→Woodpecker webhook fires `.woodpecker/default.yml` (`event: push, branch: master`, `require_approval: forks` only), which terragrunt-applies changed stacks. `master` is **branch-protected on Forgejo** (force-push disabled for everyone — history is append-only; push + merge whitelists = `viktor` + explicitly granted users, deploy keys allowed). **Allow-then-audit (Viktor, 2026-06-10):** `ebarzin` (emo) is on the whitelist and pushes straight to `master` — no PR gate. The tracking burden moves to: (a) **commit messages that record what + why** (the agent instructions in AGENTS.md and the managed claudeMd require the body to paraphrase the user's request), (b) the **`notify-nonadmin-push` Slack audit step** in `.woodpecker/default.yml` — every master push by a non-admin author is posted to Slack (admin pushes are not), and (c) non-admins **never use `[ci skip]`** so every change fires the pipeline (and thus the audit feed). Users NOT on the whitelist fall back to `<user>/<topic>` branches + PRs. **Clones stay fresh automatically** (2026-06-10): the hourly `t3-provision-users` reconcile runs `refresh_user_clone` over every managed clone — the infra clone and any workspace repos (fetch all remotes + fast-forward `master`, ONLY when on master with a clean tree and an upstream — dirty trees and local commits are left alone with a WARN) — and also `wire_forgejo_remote`, which idempotently adds the documented `forgejo` remote + `forgejo/master` upstream to infra clones that predate that contract. `start-claude.sh` does the same freshen at session launch (10s fetch cap per repo so an offline remote never stalls the session; workspace layouts freshen each repo under `~/code`).
+
+**Contribute access (per non-admin, manual — the anca/tripit PAT precedent):**
+1. Add their Forgejo user as a **write** collaborator on `viktor/infra` (`PUT /api/v1/repos/viktor/infra/collaborators/<login>`).
+2. Mint a PAT — the admin REST endpoint 404s here, use the in-pod CLI: `kubectl -n forgejo exec deploy/forgejo -- su -s /bin/sh git -c "forgejo admin user generate-access-token --username <login> --token-name devvm-infra-git --scopes 'write:repository'"`.
+3. Install it in their `~/.git-credentials` (`https://<login>:<token>@forgejo.viktorbarzin.me`, mode 600) + `git config --global credential.helper store`, set `user.name`/`user.email`.
+4. The reconcile wires the clone side automatically (`wire_forgejo_remote`): `forgejo` remote + `master` tracking `forgejo/master` on every non-admin infra clone (origin stays the anonymous GitHub mirror). No manual step since 2026-06-10.
+5. (Optional — Viktor's call per user) Grant direct master push: add their login to the `master` branch-protection push + merge whitelists (`PATCH /api/v1/repos/viktor/infra/branch_protections/master`). Done for `ebarzin` 2026-06-10.
+6. Verify: branch push succeeds; a `master` push succeeds for whitelisted users and is rejected with `Not allowed to push to protected branch` otherwise.
+
+**Web-terminal session persistence (2026-06-10):** the tmux-based web terminal's named sessions (each running one Claude conversation) survive devvm reboots — `tmux-persist-save.timer` (5-min) snapshots every terminal user's sessions (name, cwd, conversation uuid from argv or the cwd-slug transcript dir) to `/var/lib/tmux-persist/<user>.tsv`, and `tmux-persist-restore.service` recreates missing sessions at boot with `claude --resume <uuid>` (per-session idempotent; also handles partial loss). The web terminal also exposes an **on-demand "Restore sessions" button** (terminal-lobby: `tmux-api` `POST /restore` → the validated root `tmux-restore-user` wrapper → `tmux-persist restore <user>`, a single-user mode of the same script): the boot-only restore service never fires when an **OOM kills a user's tmux server *without* a reboot** (the common case under multi-user memory pressure), so the button covers that gap. This is a **tmux/terminal-surface** feature, deliberately outside the t3 namespace: the t3 chat surface persists its own threads (`~/.t3` state, plus the daily `t3-backup-state` dump), and Claude conversations themselves were always durable (`~/.claude/projects/`) — what this adds is the volatile tmux wiring.
+
+**Status (2026-06-20):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), the Authentik `T3 Users` edge gate, **the emo Phase-5 cutover (own clone + launcher repoint + `code-shared` removal, completed 2026-06-10) and emo's contribute access (`ebarzin` write collaborator + PAT + protected `master`)**, **per-user `code_layout` with the ancamilea workspace cutover**, per-user playwright browser MCP, and per-user Claude OAuth renewal/Vault recovery. Per the live `/etc/skel` design, non-admin `~/.claude/{rules,skills}` symlinks into the admin base are **kept**. **Remaining (held / future):** the offboarding apply-side (Phase 7), per-user `ha`/`claude_memory`/beads credential injection, and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning.

 ## Related

--- a/docs/architecture/networking.md
+++ b/docs/architecture/networking.md
@ -4,7 +4,7 @@ Last updated: 2026-04-19 (WS E — Kea DHCP pushes dual DNS per subnet; Kea DDNS

 ## Overview

-The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a comprehensive middleware chain including CrowdSec bot protection, Authentik forward-auth, and rate limiting. All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs.
+The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a middleware chain of anti-AI bot-blocking, Authentik forward-auth, rate limiting, and retry. CrowdSec IP-reputation enforcement is **out-of-band** (not a Traefik hop): banned IPs are dropped in-kernel via nftables on direct hosts and blocked at the Cloudflare edge on proxied hosts (see `docs/architecture/security.md`). All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs.

 ## Architecture Diagram

@ -16,12 +16,14 @@ graph TB
    Traefik[Traefik Ingress<br/>3 replicas + PDB]

    subgraph "Middleware Chain"
-        CS[CrowdSec Bouncer<br/>fail-open]
+        AntiAI[Anti-AI bot-block<br/>fail-open]
        Auth[Authentik Forward-Auth<br/>3 replicas + PDB]
        RL[Rate Limiter<br/>429 response]
        Retry[Retry<br/>2 attempts, 100ms]
    end

+    CSdrop[CrowdSec drop<br/>nftables / CF edge<br/>out-of-band, pre-Traefik]
+
    subgraph "Proxmox Host (eno1)"
        vmbr0[vmbr0 Bridge<br/>192.168.1.127/24]
        vmbr1[vmbr1 Internal<br/>VLAN-aware]
@ -53,8 +55,9 @@ graph TB
    Internet -->|DNS query| CF
    CF -->|CNAME to tunnel| CFD
    CFD --> Traefik
-    Traefik --> CS
-    CS --> Auth
+    CSdrop -.->|banned IPs dropped before Traefik| Traefik
+    Traefik --> AntiAI
+    AntiAI --> Auth
    Auth --> RL
    RL --> Retry
    Retry --> Service
@ -82,7 +85,7 @@ graph TB
 | Cloudflare DNS | SaaS | External | ~50 public domains under viktorbarzin.me |
 | Cloudflared | Container | K8s (3 replicas) | Tunnel ingress, replaces port forwarding |
 | Traefik | Helm chart | K8s (3 replicas + PDB) | Ingress controller, HTTP/3 enabled |
-| CrowdSec | Helm chart | K8s (LAPI: 3 replicas) | Bot protection, fail-open bouncer |
+| CrowdSec | Helm chart | K8s (LAPI: 3 replicas) | IP reputation. Out-of-band enforcement: `cs-firewall-bouncer` DaemonSet (in-kernel nftables drop, direct hosts) + Cloudflare edge WAF rule (proxied hosts). Fail-open |
 | Authentik | Helm chart | K8s (3 replicas + PDB) | SSO, forward-auth middleware |
 | MetalLB | v0.15.3 Helm chart | K8s | LoadBalancer IPs (10.0.20.200-10.0.20.220), all services on 10.0.20.200 |
 | Registry Cache | Container | 10.0.20.10 | Pull-through for docker.io:5000, ghcr.io:5010 |
@ -208,24 +211,31 @@ VMs tag traffic on vmbr1 to isolate workloads. pfSense bridges VLAN 20 to the up

 ### Ingress Flow

+CrowdSec is **not** a step in this chain — banned IPs are dropped before the
+request ever reaches Traefik (Cloudflare edge WAF rule on proxied hosts; host
+nftables on direct hosts). The flow below is for a request that survives that
+out-of-band gate.
+
 ```mermaid
 sequenceDiagram
    participant Client
-    participant Cloudflare
+    participant CFedge as Cloudflare (edge WAF: crowdsec_ban block)
    participant Cloudflared
    participant Traefik
-    participant CrowdSec
+    participant AntiAI
    participant Authentik
    participant RateLimit
    participant Retry
    participant Service
    participant Pod

-    Client->>Cloudflare: HTTPS request to blog.viktorbarzin.me
-    Cloudflare->>Cloudflared: Forward via tunnel (QUIC)
+    Client->>CFedge: HTTPS request to blog.viktorbarzin.me
+    Note over CFedge: banned IP → blocked here (proxied hosts)
+    CFedge->>Cloudflared: Forward via tunnel (QUIC)
    Cloudflared->>Traefik: HTTP to LoadBalancer IP
-    Traefik->>CrowdSec: Apply bouncer middleware
-    CrowdSec->>Authentik: If allowed, check auth (protected=true)
+    Note over Traefik: on direct hosts, banned IPs already dropped in-kernel (nftables forward hook)
+    Traefik->>AntiAI: anti-AI bot-block (fail-open)
+    AntiAI->>Authentik: If allowed, check auth (protected=true)
    Authentik->>RateLimit: If authenticated, check rate limit
    RateLimit->>Retry: If within limit, continue
    Retry->>Service: Forward to Service
@ -234,24 +244,27 @@ sequenceDiagram
    Service-->>Retry: Response
    Retry-->>RateLimit: Response
    RateLimit-->>Authentik: Response (strip auth headers)
-    Authentik-->>CrowdSec: Response
-    CrowdSec-->>Traefik: Response
+    Authentik-->>AntiAI: Response
+    AntiAI-->>Traefik: Response
    Traefik-->>Cloudflared: Response
-    Cloudflared-->>Cloudflare: Response via tunnel
-    Cloudflare-->>Client: HTTPS response
+    Cloudflared-->>CFedge: Response via tunnel
+    CFedge-->>Client: HTTPS response
 ```

 ### Middleware Chain

-Every ingress created by the `ingress_factory` module follows this chain:
+CrowdSec IP-reputation enforcement is **not** in this chain — it is out-of-band
+(host nftables on direct hosts; the Cloudflare edge WAF `crowdsec_ban` rule on
+proxied hosts), so banned IPs never reach the chain and there is no per-request
+CrowdSec hop. Every ingress created by the `ingress_factory` module follows this
+Traefik chain:

-1. **CrowdSec Bouncer**: Checks IP against threat database. **Fail-open** mode — if LAPI is unreachable, traffic passes through to prevent outages.
+1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`).
 2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend.
-3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default limits are generous; services like Immich and Nextcloud have higher custom limits.
+3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads) and ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load).
 4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors).

 Additional middleware:
- **Anti-AI**: On by default via `ingress_factory`. Blocks common AI crawler user-agents.
 - **HTTP/3 (QUIC)**: Enabled globally on Traefik.

 ### Entrypoint Transport Timeouts
@ -348,10 +361,10 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
 | pfSense | `stacks/pfsense/` | VM + cloud-init config |
 | Technitium | `stacks/technitium/` | Deployment, Service, PVC |
 | Traefik | `stacks/platform/` (sub-module) | Helm release, IngressRoute CRDs |
-| CrowdSec | `stacks/platform/` (sub-module) | Helm release, LAPI + bouncer |
+| CrowdSec | `stacks/crowdsec/` (+ edge in `stacks/rybbit/`) | Helm release, LAPI + agent; `cs-firewall-bouncer` DaemonSet (nftables, direct hosts) + Cloudflare edge sync (proxied hosts) |
 | Authentik | `stacks/authentik/` | Helm release, ingress, OIDC configs |
 | MetalLB | `stacks/platform/` (sub-module) | Helm release, IPAddressPool |
-| Cloudflared | `stacks/cloudflared/` | Deployment (3 replicas), tunnel config |
+| Cloudflared | `stacks/cloudflared/` | Deployment (3 replicas), tunnel config; runs `--no-autoupdate` (in-place self-updates exited the pods and severed all tunnel WebSockets, 2026-06-09/10) |
 | ingress_factory | `modules/ingress_factory/` | IngressRoute + middleware chain |

 ### Key Configuration Files
@ -436,13 +449,30 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac

 **Decision**: Technitium handles internal `.lan` domains with near-zero latency. Cloudflare handles public domains with global DNS. K8s nodes use Technitium as primary, which forwards non-.lan queries to Cloudflare.

-### Why Fail-Open on CrowdSec Bouncer?
+### Why CrowdSec Enforcement Is Out-of-Band (and Fails Open)

-**Alternatives considered**:
-1. **Fail-closed**: Maximum security, but LAPI downtime blocks all traffic.
-2. **Redundant LAPI**: Already scaled to 3 replicas, but resource pressure can still cause outages.
+CrowdSec used to enforce inline as a Traefik middleware (the
+`crowdsec-bouncer-traefik-plugin`). On Traefik 3.7.5 the Yaegi plugin handler was
+never invoked, so it enforced nothing; the plugin was removed and enforcement
+moved off the request path entirely (full history in
+`docs/architecture/security.md`). It now runs on two surfaces:

-**Decision**: Availability > strict bot blocking. CrowdSec LAPI is scaled to 3 replicas for resilience, but during cluster-wide resource exhaustion (e.g., memory pressure), bouncer falls back to allowing traffic. This prevents a complete service outage due to a security add-on.
+- **Direct hosts** → `cs-firewall-bouncer` DaemonSet drops banned IPs in the host
+  nftables, in **both the `input` and `forward` hooks**. The `forward` hook is
+  the load-bearing one: with Traefik on a dedicated LB IP at
+  `externalTrafficPolicy=Local`, client packets are DNAT'd to the Traefik **pod**
+  and transit the node's `forward` chain (not `input`) — which is exactly why the
+  ingress must preserve the **real client IP** end-to-end (ETP=Local + PROXY-v2
+  for IPv6; see the Traefik LB IP and IPv6 ingress notes above). Without the real
+  client IP the firewall-bouncer (and the CF edge rule) would have nothing to
+  match on.
+- **Proxied hosts** → a Cloudflare edge WAF rule (`ip.src in $crowdsec_ban`) fed
+  by the `crowdsec-cf-sync` CronJob.
+
+Both **fail open**: if LAPI is unreachable, the firewall-bouncer simply stops
+receiving new decisions (existing drops persist) and the CF sync skips a run —
+neither ever blocks legitimate traffic. Availability > strict bot blocking, and
+out-of-band enforcement adds **zero per-request latency** (no Traefik hop).

 ### Why HTTP/3 (QUIC)?

@ -473,9 +503,10 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac

 **Symptoms**: All ingress routes return 503, Traefik dashboard shows no backends available.

-**Diagnosis**: Middleware chain is blocking traffic. Check:
-1. Authentik status: `kubectl get pod -n authentik`
-2. CrowdSec LAPI status: `kubectl get pod -n crowdsec`
+**Diagnosis**: Middleware chain is blocking traffic. (CrowdSec is **not** in the
+chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Check:
+1. Authentik status: `kubectl get pod -n authentik` (ForwardAuth fails closed if the auth server is unreachable)
+2. `bot-block-proxy` status: `kubectl get pod -n traefik -l app=bot-block-proxy` (anti-AI ForwardAuth target — also fails closed if down)
 3. Traefik logs: `kubectl logs -n kube-system deploy/traefik`

 **Fix**: If Authentik is down and ingress uses forward-auth, pods won't pass health checks. Scale Authentik to 3 replicas or temporarily disable forward-auth middleware.
@ -515,11 +546,11 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac

 ### Rate Limiter Blocks Legitimate Traffic

-**Symptoms**: Users report 429 errors during normal usage (e.g., Immich uploads).
+**Symptoms**: Users report 429 errors during normal usage (e.g., Immich uploads, ActualBudget's "Server returned an error while checking its status" boot screen).

 **Diagnosis**: Check Traefik middleware config for the affected IngressRoute.

-**Fix**: Increase rate limit in `ingress_factory` module. Default is 100 req/min per IP. Immich and Nextcloud use 500 req/min.
+**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300.

 ### Large Downloads or Uploads Truncate / Fail Partway

--- a/docs/architecture/security.md
+++ b/docs/architecture/security.md
@ -2,40 +2,50 @@

 ## Overview

-The homelab implements defense-in-depth security at the application layer (L7) using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). All security components operate in graceful degradation mode (fail-open) to prevent cascading failures. Security policies are deployed in audit mode first, then selectively enforced after validation.
+The homelab implements defense-in-depth security using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). CrowdSec enforcement is **out-of-band** (not a per-request Traefik hop — see the CrowdSec section): banned IPs are dropped in-kernel via nftables on direct hosts, and blocked at the Cloudflare edge on proxied hosts, so enforcement adds **zero per-request latency**. All security components fail open (a CrowdSec outage stops new bans but never blocks legitimate traffic). Security policies are deployed in audit mode first, then selectively enforced after validation.

 ## Architecture Diagram

+CrowdSec enforcement is out-of-band (NOT an inline Traefik middleware hop). The
+Traefik request chain is anti-AI → Authentik ForwardAuth → rate-limit → retry;
+CrowdSec drops banned IPs *before* (direct hosts) or *off* (proxied hosts) that
+chain entirely.
+
 ```mermaid
-graph LR
+graph TB
    Internet[Internet]
-    CF[Cloudflare WAF]
+
+    subgraph "Proxied hosts (orange-cloud)"
+        CFedge[Cloudflare edge<br/>WAF rule: ip.src in $crowdsec_ban → block]
+    end
+    subgraph "Direct hosts (grey-cloud / internal)"
+        NFT[Host nftables<br/>table crowdsec/crowdsec6<br/>drop in input + forward]
+    end
+
    Tunnel[Cloudflared Tunnel]
-    CrowdSec[CrowdSec Bouncer<br/>Traefik Plugin]
-    AntiAI[Anti-AI Check<br/>poison-fountain]
-    ForwardAuth[Authentik ForwardAuth]
-    RateLimit[Rate Limit Middleware]
-    Retry[Retry Middleware<br/>2 attempts, 100ms]
+    Traefik[Traefik<br/>anti-AI → Authentik → rate-limit → retry]
    Backend[Backend Service]

    LAPI[CrowdSec LAPI<br/>3 replicas]
-    Agent[CrowdSec Agent]
+    Agent[CrowdSec Agent<br/>parses Traefik logs]
+    FWB[cs-firewall-bouncer<br/>DaemonSet, every node]
+    CFsync[crowdsec-cf-sync<br/>CronJob, every 2 min]

-    Internet -->|1| CF
-    CF -->|2| Tunnel
-    Tunnel -->|3| CrowdSec
-    CrowdSec -.->|Query| LAPI
-    Agent -.->|Report| LAPI
-    CrowdSec -->|4. Pass/Block| AntiAI
-    AntiAI -->|5. Human/Bot| ForwardAuth
-    ForwardAuth -->|6. Authenticated| RateLimit
-    RateLimit -->|7. Under Limit| Retry
-    Retry -->|8. Success/Retry| Backend
+    Internet -->|proxied| CFedge
+    Internet -->|direct| NFT
+    CFedge -->|allowed| Tunnel
+    Tunnel --> Traefik
+    NFT -->|allowed| Traefik
+    Traefik --> Backend

-    style CrowdSec fill:#f9f,stroke:#333
-    style AntiAI fill:#ff9,stroke:#333
-    style ForwardAuth fill:#9f9,stroke:#333
-    style RateLimit fill:#99f,stroke:#333
+    Agent -.->|report| LAPI
+    LAPI -.->|all decisions incl. CAPI| FWB
+    FWB -.->|program drop rules| NFT
+    LAPI -.->|ban/captcha decisions, CAPI excluded| CFsync
+    CFsync -.->|push IP list| CFedge
+
+    style CFedge fill:#f9f,stroke:#333
+    style NFT fill:#f9f,stroke:#333
 ```

 ## Components
@ -44,7 +54,8 @@ graph LR
 |-----------|---------|----------|---------|
 | CrowdSec LAPI | Pinned | `stacks/crowdsec/` | Local API, threat intelligence aggregation (3 replicas) |
 | CrowdSec Agent | Pinned | `stacks/crowdsec/` | Log parser, scenario detection |
-| CrowdSec Traefik Bouncer | Plugin | Traefik config | Plugin-based IP reputation check |
+| cs-firewall-bouncer | v0.0.34 | `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf` | In-kernel nftables drop on every node (DIRECT hosts). Bouncer key `firewall` |
+| crowdsec-cf-sync | — | `stacks/rybbit/crowdsec_edge.tf` | LAPI→Cloudflare-IP-List sync CronJob (PROXIED hosts). Bouncer key `kvsync` |
 | Kyverno | Pinned chart | `stacks/kyverno/` | Policy engine for K8s admission control |
 | poison-fountain | Latest | `stacks/poison-fountain/` | Anti-AI bot detection and tarpit service |
 | cert-manager/certbot | - | `stacks/cert-manager/` | TLS certificate management |
@ -54,11 +65,15 @@ graph LR

 ### Request Security Layers

-Every incoming request passes through 6 security layers:
+CrowdSec IP-reputation enforcement happens **before** a request reaches the
+Traefik chain (banned IPs are dropped in-kernel on direct hosts, or blocked at
+the Cloudflare edge on proxied hosts — see CrowdSec Threat Intelligence below).
+A request that survives that out-of-band gate then passes through the Traefik
+middleware chain:

-1. **Cloudflare WAF** - DDoS protection, bot detection, firewall rules (external)
-2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP
-3. **CrowdSec Bouncer** - IP reputation check against LAPI (fail-open on error)
+1. **Cloudflare WAF / edge** - DDoS protection, bot detection, firewall rules incl. the CrowdSec `crowdsec_ban` block rule (proxied hosts only)
+2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP (proxied hosts)
+3. **CrowdSec out-of-band drop** - nftables on direct hosts; *not* a Traefik hop (zero per-request latency)
 4. **Anti-AI Scraping** - 3-layer bot defense (optional per service, updated 2026-04-17)
 5. **Authentik ForwardAuth** - Authentication check (if `protected = true`)
 6. **Rate Limiting** - Per-source IP rate limits (returns 429 on breach)
@ -80,11 +95,71 @@ CrowdSec operates in a hub-and-agent model:
 - Reports malicious IPs to LAPI
 - Shares threat intel with CrowdSec community (anonymized)

-**Traefik Bouncer Plugin**:
- Integrated as Traefik middleware
- Queries LAPI for IP reputation on each request
- **Fail-open mode**: If LAPI unreachable, allows traffic (graceful degradation)
- Blocks IPs on ban list, allows others
+Enforcement is split across **two out-of-band surfaces**, neither of which adds
+any per-request latency. (See "Why the Traefik bouncer plugin was removed" below
+for the supersession history — there is no longer an inline Traefik bouncer.)
+
+**Surface 1 — DIRECT (non-Cloudflare-proxied) hosts → in-kernel nftables drop**
+(`cs-firewall-bouncer` DaemonSet, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`):
+- Runs on **every node** (no nodeSelector). Programs the HOST nftables — `table ip
+  crowdsec` / `table ip6 crowdsec6` — with drop rules in **both the `input` AND
+  the `forward` hooks**. The `forward` hook is required because Traefik is a
+  LoadBalancer with `externalTrafficPolicy=Local`: client traffic is DNAT'd to the
+  Traefik **pod** and transits the node's `forward` hook (not `input`) with the
+  real client IP preserved. Chains use `policy accept` (only set members drop —
+  it can never blackhole normal traffic).
+- Pulls **all** decisions from LAPI, **including the CAPI community blocklist
+  (~31k IPs)**. Packets from banned IPs are dropped **in-kernel before reaching
+  Traefik** → zero per-request hops, no Traefik involvement at all.
+- **Packaging**: cs-firewall-bouncer publishes no container image, so the
+  **v0.0.34** static binary is fetched at runtime by an initContainer onto a
+  `debian:bookworm-slim` runtime container. Needs `hostNetwork` +
+  `NET_ADMIN`/`NET_RAW` to talk netlink directly. Registered bouncer key:
+  **`firewall`**.
+- **Fail-open**: if LAPI is unreachable it just stops receiving new decisions
+  (existing drop rules persist); it never blocks legitimate traffic.
+
+**Surface 2 — PROXIED (Cloudflare orange-cloud) hosts → Cloudflare edge block**
+(`stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py`):
+- Proxied hosts terminate at the Cloudflare edge, so a host-level nftables drop
+  would never see them. Enforcement is instead a single Cloudflare Rules List
+  **`crowdsec_ban`** + a zone-scoped WAF custom rule `(ip.src in $crowdsec_ban)`
+  → **block** action, which covers every proxied host in the zone.
+- Fed by the **`crowdsec-cf-sync` CronJob** (namespace `rybbit`, every 2 min,
+  pure-stdlib Python in a ConfigMap). It pulls local **ban/captcha ip-scoped**
+  decisions and pushes them into the CF list, but **EXCLUDES the ~31k CAPI
+  community blocklist** — that set is far too large for a CF Rules List (the CF
+  account hard-limits to **one** list), and CAPI is already covered in-kernel on
+  direct hosts and by Cloudflare's own managed protections on proxied hosts.
+  Registered bouncer key: **`kvsync`**.
+- **Block-only**: the single-list limit precludes a separate
+  captcha/managed-challenge list, so both ban and captcha decisions are enforced
+  as a plain block at the edge.
+- **Auth carve-out:** the WAF rule excludes `authentik.viktorbarzin.me` +
+  `public-auth.viktorbarzin.me` (`… and not (http.host in {…})`). A CrowdSec hit
+  must never wall a user out of the login / WebAuthn flow they authenticate
+  through; auth keeps `traefik-rate-limit` for brute-force protection.
+
+**Whitelist** (`stacks/crowdsec/whitelist.yaml`): a CrowdSec whitelist covers
+RFC1918 + the tailnet + internal CIDRs (plus one specific external IP), so
+internal users are never enforced. Internal access uses split-horizon DNS
+straight to Traefik, and direct internal clients are RFC1918 — both whitelisted.
+
+#### Why the Traefik bouncer plugin was removed
+
+Enforcement used to run as an inline Traefik middleware — the
+`crowdsec-bouncer-traefik-plugin` (Yaegi/Lua), which queried LAPI on every
+request and could serve a Cloudflare Turnstile captcha for soft remediations.
+On **Traefik 3.7.5 the Yaegi handler was never invoked**, so the bouncer was
+registered but enforced **nothing** despite appearing healthy. Rather than chase
+the Yaegi runtime, the whole plugin path was **removed** (2026-06): the plugin
+static config + initContainer download, the `crowdsec` Middleware CRD, the
+`captcha.html` template + its ConfigMap and volume mount, and the Cloudflare
+Turnstile widget (`cloudflare_turnstile_widget.crowdsec_captcha`). It was
+replaced by the two out-of-band surfaces above, which add zero per-request
+latency and fail open. (The earlier `crowdsec-cf-sync` cursor-pagination /
+IP-List-capacity issues are also moot now that CAPI is excluded from the edge
+list and dropped in-kernel instead.)

 **Metabase** (disabled by default):
 - Dashboard for CrowdSec analytics
@ -189,7 +264,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**
 | W1.2 Vault `x_forwarded_for_authorized_addrs = 10.10.0.0/16` | **LIVE** — applied via `tg apply -target=helm_release.vault` on 2026-05-18; all 3 vault pods restarted cleanly |
 | W1.2 Vault audit log shipping to Loki | **LIVE** — `audit-tail` sidecar in vault pods + Alloy DaemonSet ships to Loki with `container="audit-tail"`. Verified via `{namespace="vault",container="audit-tail"}` LogQL query. |
 | W1.1 K8s API audit policy + shipping | **LIVE** — kube-apiserver audit policy was already configured (Metadata level, `/var/log/kubernetes/audit.log`, 7d retention). Alloy DaemonSet now tolerates control-plane taint, scrapes the audit log file, ships to Loki with `job=kubernetes-audit`. K2-K9 alert rules in Loki ruler. |
-| W1.3 Source-IP anomaly rules (K9, V7, S1) | **LIVE** (K9, V7); **S1 PENDING** — fires once promtail/Alloy on PVE host ships sshd journal with `job=sshd-pve`. |
+| W1.3 Source-IP anomaly rules (K9, V7, S1) | **LIVE** (K9, V7, S1). **S1 activated 2026-06-10** — promtail on the PVE host now ships the journal to Loki (`scripts/pve-promtail.yaml`); sshd auth lands as `job=sshd-pve` (the S1 data source). The same shipper carries snoopy `execve()` command audit as `{job="pve-journal", identifier="snoopy"}` (forensic, not alerting). Deployed because emo's agent was given root SSH to the host (shared key) — see `docs/architecture/monitoring.md` → "External host: pve". |
 | W1.4 Kyverno security policies → Enforce | **LIVE** — 3 policies in Enforce mode with 35-namespace exclude list. |
 | W1.5 Kyverno trusted-registries → Enforce | **LIVE** — explicit allowlist (15 registries + 6 DockerHub library bare names + 56 DockerHub user repos). Verified by admission dry-run: `evilcorp.example/malware:v1` BLOCKED, `alpine:3.20` and `docker.io/library/alpine:3.20` ALLOWED. |
 | W1.6 Calico observe-phase (pilot: recruiter-responder) | **LIVE** (2026-05-19) — GlobalNetworkPolicy `wave1-egress-observe-recruiter-responder` with rules `[action:Log, action:Allow]`. FelixConfiguration.flowLogsFileEnabled approach abandoned (Calico Enterprise-only field, rejected by OSS v3.26). Log action emits iptables LOG with prefix `calico-packet: ` → kernel → journald → Alloy → Loki. Verified: `{job="node-journal"} \|~ "calico-packet"` returns real packet metadata (SRC/DST/PROTO). Expand to more namespaces by adding to `namespaceSelector`. |
@ -205,7 +280,7 @@ Response model: **(I) Slack-only, daily skim.** All security alerts land in a ne
 |---|---|---|---|
 | K8s API audit log | Custom audit policy on kube-apiserver: drop `get`/`list`/`watch` at `None` for most resources, log writes at `Metadata`, secret reads at `Metadata`, `exec`/`portforward` at `RequestResponse`, exclude kubelet+controller-manager noise. Codified in `stacks/infra` kubeadm config templating. | Alloy DaemonSet tails `/var/log/kubernetes/audit/*.log` | `job=kube-audit` |
 | Vault audit log | `file` audit device on existing Vault PVC. Vault listener config sets `x_forwarded_for_authorized_addrs` trusting Traefik pod CIDR so `remote_addr` is the real client IP, not Traefik's. | Alloy tails audit log file | `job=vault-audit` |
-| PVE sshd auth log | journald `_SYSTEMD_UNIT=ssh.service` | promtail systemd unit on Proxmox host (192.168.1.127) | `job=sshd-pve` |
+| PVE sshd auth log | journald (`_SYSTEMD_UNIT=ssh.service`, `SYSLOG_IDENTIFIER=sshd-session`); promtail relabels `identifier=~"sshd.*"` → `job=sshd-pve` | promtail systemd unit on Proxmox host (192.168.1.127), `scripts/pve-promtail.yaml` — **LIVE 2026-06-10** | `job=sshd-pve` |
 | Calico flow log | `flowLogsFileEnabled: true` in Calico Felix config | Alloy (cluster-wide) | `job=calico-flow` (W1.6 only) |

 #### Alert rules (16 total)
@ -255,6 +330,10 @@ Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same

 **Policy: no public-IP access ever.** Vault, kube-apiserver, PVE sshd must transit a trusted LAN or Headscale. Anything else fires an alert.

+**Documented exception — break-glass SSH (2026-06-11):** one deliberate carve-out. The Proxmox host's sshd listens on a WAN-exposed `:52222` (edge-router forward), **key-only**, trusting only a dedicated break-glass key (`Match LocalPort` → `authorized_keys.breakglass`), rate-limited (iptables hashlimit) + fail2ban. It is intentionally reachable from the public internet so it survives a cluster/tunnel outage with no dependency on the cluster — the one case the "must transit LAN/Headscale" rule cannot serve. Brute-force-proof (no password); the trade is Shodan-visibility. As-built: `docs/runbooks/breakglass-ssh.md`; rationale: `docs/plans/2026-06-11-breakglass-ssh-redesign-design.md`. (Replaced the 2026-05-30 port-knock variant, which was non-scannable but had a circular Vault dependency that caused a lockout.)
+
+**Two privileged footholds for the warm break-glass UI (2026-06-12):** the in-cluster `claude-breakglass` service (`breakglass.viktorbarzin.me`, warm case = devvm wedged, cluster healthy) holds one ed25519 key (Vault `secret/claude-breakglass/ssh_key`) authorising: (1) a `breakglass` user on the **devvm** with NOPASSWD sudo (`from="10.0.20.0/24"` — the Calico-SNAT node subnet); (2) a **PVE** `authorized_keys` entry pinned to `command="/usr/local/bin/breakglass-pve",restrict,from="192.168.1.2"` (pfSense's inter-VLAN SNAT IP) that only runs the verbs `status|forensics|reset|stop|start|cycle` against VM 102. The key is reachable ONLY by the breakglass pod (own namespace, no Vault role, ESO-synced); the shared `claude-agent` pod's `terraform-state` Vault policy is explicitly DENIED `secret/claude-breakglass/*`. Reset is autonomous (the agent may fire it), forensics-first. Reachable via Authentik or the basic-auth fallback — LAN-routed, not WAN-exposed. Runbook: `docs/runbooks/breakglass-ui.md`; ADR: `claude-agent-service/docs/adr/0001-breakglass-security-architecture.md`.
+
 #### Why no canary tokens

 Original plan included canary tokens (fake K8s Secret, Vault KV path, PVE file, sinkhole hostname). Rejected because Viktor routinely greps `secret/viktor` (135 keys) and lists `kubectl get secret -A` — any read-trigger canary self-fires. Use-based canaries (zero-RBAC SA tokens with audit alerts on use) were also considered but rejected in favor of cleaner source-IP anomaly detection (K9, V7) on REAL tokens — same threat model, no fake-token operational burden.
@ -326,10 +405,12 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**

 | Path | Purpose |
 |------|---------|
-| `stacks/crowdsec/` | CrowdSec LAPI, agent, bouncer config |
+| `stacks/crowdsec/` | CrowdSec LAPI, agent config + `whitelist.yaml` |
+| `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf` | cs-firewall-bouncer DaemonSet (in-kernel nftables drop, direct hosts) |
+| `stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py` | Cloudflare IP-List + WAF block rule + LAPI→CF sync CronJob (proxied hosts) |
 | `stacks/kyverno/` | Kyverno deployment + policies |
 | `stacks/poison-fountain/` | Anti-AI service + CronJob |
-| `stacks/platform/modules/traefik/middleware.tf` | Security middleware definitions |
+| `stacks/traefik/modules/traefik/middleware.tf` | Security middleware definitions (no longer includes a CrowdSec bouncer) |
 | `stacks/platform/modules/ingress_factory/` | Per-service security toggles |

 ### Vault Paths
@ -439,7 +520,11 @@ spec:
 **Fix**:
 1. Check LAPI decisions: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions list`
 2. Remove ban: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions delete --ip <IP>`
-3. Whitelist if needed: Add to `stacks/crowdsec/whitelist.yaml`
+   — the in-kernel drop clears as soon as `cs-firewall-bouncer` reconciles (direct
+   hosts); for proxied hosts the `crowdsec-cf-sync` CronJob removes it from the
+   `crowdsec_ban` CF list within ~2 min.
+3. Whitelist if needed: Add to `stacks/crowdsec/whitelist.yaml` (RFC1918 + tailnet
+   + internal CIDRs are already whitelisted, so internal clients are never banned).

 ### Kyverno Policy Blocking Deployment

--- a/docs/architecture/storage.md
+++ b/docs/architecture/storage.md
@ -17,7 +17,7 @@ All services storing sensitive data were migrated to `proxmox-lvm-encrypted` on
 - **HDD NFS**: `/srv/nfs` on ext4 LV `pve/nfs-data` (4TB) — bulk media and backup targets
 - **SSD NFS**: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB) — high-performance data (Immich ML)

-Both `StorageClass: nfs-truenas` and `StorageClass: nfs-proxmox` point to the Proxmox host and are functionally identical. The `nfs-truenas` name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster.
+`StorageClass: nfs-truenas` is the **only** NFS StorageClass and points to the Proxmox host. The name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster. (A short-lived parallel `nfs-proxmox` StorageClass was removed on 2026-04-25, commit 484b4c71, during the vault NFS-hostile migration.)

 **Backup storage (sda)**: 1.1TB RAID1 SAS disk, VG `backup`, LV `data` (ext4), mounted at `/mnt/backup` on PVE host. Dedicated backup disk for weekly PVC file backups, auto SQLite backups, pfSense backups, and PVE config. NFS data syncs directly to Synology via inotify change tracking (not stored on sda). Independent of live storage (sdc).

@ -47,7 +47,7 @@ graph TB
    end

    subgraph K8s["Kubernetes Cluster"]
-        CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-proxmox (+ legacy nfs-truenas)<br/>soft,timeo=30,retrans=3"]
+        CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-truenas (historical name)<br/>soft,timeo=30,retrans=3"]
        CSI_PVE["Proxmox CSI plugin<br/>StorageClass: proxmox-lvm<br/>StorageClass: proxmox-lvm-encrypted"]

        NFS_PV["NFS PersistentVolumes<br/>RWX, ~100 volumes"]
@ -85,8 +85,7 @@ graph TB
 | Proxmox NFS (HDD) | LV `pve/nfs-data`, 4TB ext4 | 192.168.1.127:/srv/nfs | Bulk NFS data for all services |
 | Proxmox NFS (SSD) | LV `ssd/nfs-ssd-data`, 100GB ext4 | 192.168.1.127:/srv/nfs-ssd | High-performance data (Immich ML) |
 | nfs-csi | Helm chart | Namespace: nfs-csi | NFS CSI driver |
-| StorageClass `nfs-proxmox` | RWX, soft mount | Cluster-wide | NFS storage, points to Proxmox host |
-| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | **Historical name** — functionally identical to `nfs-proxmox`, points to the Proxmox host. Kept because SC names are immutable on 48 bound PVs. |
+| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | The only NFS StorageClass — **historical name**, points to the Proxmox host. Kept because SC names are immutable on 48 bound PVs. (Sibling `nfs-proxmox` SC removed 2026-04-25, commit 484b4c71.) |
 | TF module `nfs_volume` | `modules/kubernetes/nfs_volume/` | Infra repo | Static NFS PV/PVC factory |
 | ~~TrueNAS VM~~ | **DECOMMISSIONED 2026-04-13** | Was VM 9000 at 10.0.10.15 | Replaced by Proxmox NFS. VM still in stopped state pending deletion. |
 | ~~democratic-csi-iscsi~~ | **REMOVED** | Was namespace: iscsi-csi | Replaced by Proxmox CSI (2026-04-02) |
@ -113,7 +112,7 @@ graph TB

 **Note**: Some legacy PVs still reference `/mnt/main/<service>` paths. These work via compatibility symlinks/bind-mounts on the Proxmox host. New PVs should use `/srv/nfs/<service>` or `/srv/nfs-ssd/<service>`.

-**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-proxmox` StorageClass (or the legacy `nfs-truenas` for existing PVs) via PVCs.
+**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-truenas` StorageClass (historical name; it points at the Proxmox host) via PVCs.

 ### Block Storage Flow (Proxmox CSI) — NEW

--- a/Show more
+++ b/Show more