infra/stacks/chrome-service/README.md
Viktor Barzin deede6dd11 chrome-service: switch to CDP + persistent profile + hourly snapshot pipeline
The chrome-service stack ran `playwright launch-server`, which creates
ephemeral browser contexts per `connect()`. Despite the encrypted PVC
mounted at /profile, no chromium user-data ever persisted — only npm
cache + fontconfig. Logging in via noVNC was effectively a no-op.

Refactor:
- Replace launch-server with direct chromium (TCP CDP on :9223 internal),
  fronted by a Python HTTP+WS bridge on :9222 that rewrites the Host
  header to bypass Chrome's hardcoded DNS-rebinding protection (no
  `--remote-allow-hosts` flag exists in stock Chrome 130; verified by
  binary string grep). Bridge also forces Connection: close on HTTP
  responses so Node ws opens a fresh TCP for the WS upgrade rather than
  trying to reuse the dead keep-alive socket.
- Add `--user-data-dir=/profile/chromium-data` so cookies/localStorage
  actually persist on the encrypted PVC.
- New snapshot-server sidecar (stdlib python HTTP) serves
  GET /api/snapshot at chrome.viktorbarzin.me/api/snapshot,
  bearer-token-gated by the existing api_bearer_token.
- New chrome-service-snapshot-harvester CronJob (hourly) connects via
  CDP, dumps storage_state() (cookies + localStorage), writes atomically
  to /profile/snapshots/storage-state.json.
- NetworkPolicy: TCP/9222 (was :3000), TCP/8088 added for traefik.

Caller migration:
- f1-stream: `chromium.connect(ws_url)` → `chromium.connect_over_cdp(cdp_url)`,
  env var CHROME_WS_URL → CHROME_CDP_URL. CHROME_WS_TOKEN dropped (no
  longer used by code; ExternalSecret kept for symmetry with the snapshot
  endpoint).

Dev-box side (out of scope for this commit — see ~/.config/systemd/user/):
- playwright-mcp.service flips to `--isolated --storage-state=...`
  so per-Claude-Code-session ephemeral contexts seed from the snapshot.
- playwright-snapshot-refresh.{service,timer} (hourly) pulls the
  snapshot via the bearer-gated HTTPS endpoint.

Docs updated:
- docs/architecture/chrome-service.md — new architecture diagram + wire protocol.
- docs/runbooks/chrome-service-snapshot.md — day-2 ops (refresh, rotation,
  failure modes, restore).
- stacks/chrome-service/README.md — connect_over_cdp recipe.

Design spec at docs/superpowers/specs/2026-06-04-playwright-per-session-browser-design.md.
2026-06-05 09:19:10 +00:00

119 lines
4.8 KiB
Markdown

# chrome-service
In-cluster headed Chromium exposed over the Chrome DevTools Protocol
(CDP) on TCP :9222. Sibling services drive it instead of running their
own in-process browser — useful when the upstream tries to detect
headless mode (e.g. hmembeds' `disable-devtool.js` redirect-to-google
trap). Also publishes an hourly snapshot of cookies + localStorage so
external dev-box Claude Code sessions can warm their isolated
playwright contexts from the same logged-in profile.
## Connect (in-cluster callers)
```python
from playwright.async_api import async_playwright
CDP_URL = "http://chrome-service.chrome-service.svc.cluster.local:9222"
async with async_playwright() as p:
browser = await p.chromium.connect_over_cdp(CDP_URL, timeout=15_000)
# browser.contexts[0] is the persistent default context (the one
# the user logs into via noVNC). For bot work that should NOT share
# cookies, create a fresh incognito context:
context = await browser.new_context()
await context.add_init_script(STEALTH_JS)
page = await context.new_page()
...
await browser.close()
```
NetworkPolicy is the only gate on the CDP endpoint — labelled client
namespaces or explicit fallback (`f1-stream`). No bearer token is
required for the connection itself.
## Snapshot endpoint (external callers)
```bash
# Bearer token comes from Vault secret/chrome-service.api_bearer_token.
TOKEN=$(vault kv get -field=api_bearer_token secret/chrome-service)
curl -fsSL \
-H "Authorization: Bearer $TOKEN" \
https://chrome.viktorbarzin.me/api/snapshot \
> storage-state.json
# Use the snapshot with @playwright/mcp:
npx @playwright/mcp@latest --port 8931 --host localhost \
--headless --browser chrome \
--isolated --storage-state ./storage-state.json
```
The snapshot is refreshed hourly by the `chrome-service-snapshot-harvester`
CronJob (schedule `23 * * * *`) which calls `context.storageState()` via
the CDP endpoint and writes to `/profile/snapshots/storage-state.json`
(atomic rename). The `snapshot-server` sidecar serves that file.
## Add a new in-cluster caller
1. **Label the caller's namespace** so the chrome-service NetworkPolicy
admits it:
```hcl
resource "kubernetes_namespace" "<ns>" {
metadata {
labels = {
"chrome-service.viktorbarzin.me/client" = "true"
}
}
}
```
2. **Inject `CHROME_CDP_URL`** into the caller's pod env:
```hcl
env {
name = "CHROME_CDP_URL"
value = "http://chrome-service.chrome-service.svc.cluster.local:9222"
}
```
3. **Vendor `stealth.js`** into the caller (or just paste — it's ~40
lines) and apply via `await context.add_init_script(STEALTH_JS)` after
every `new_context()`. Without it, hmembeds-class anti-bot still trips.
## Image pin
Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in
`main.tf`) and the client (`playwright==1.48.0` in callers' requirements)
must match minor-versions. Bump in lockstep — Playwright protocol changes
between minors.
## Operations
- **Storage**: encrypted PVC at `/profile`. Chromium user-data-dir lives
at `/profile/chromium-data` — cookies + localStorage + IndexedDB
persist here. Snapshots at `/profile/snapshots/storage-state.json`.
Backed up tar+gzip every 6h to `/srv/nfs/chrome-service-backup/`,
30-day retention.
- **Probes**: TCP/9222. Chrome's CDP serves `/json/version` once it's
bound; TCP-open is enough for readiness.
- **Health page**: visit `https://chrome.viktorbarzin.me` (Authentik-
gated) to confirm the pod is up and to log into sites. The CDP port
stays internal-only.
- **Token rotation**: `vault kv put secret/chrome-service api_bearer_token=$(python3 -c 'import secrets; print(secrets.token_urlsafe(32))')`.
Reloader cascades to the snapshot-server sidecar. Update the cached
token on any dev box that pulls the snapshot:
`vault kv get -field=api_bearer_token secret/chrome-service > ~/.config/playwright/token`.
## Why headed (Xvfb) instead of headless?
`disable-devtool.js` and similar libraries detect `navigator.webdriver`,
console-clear timing, and the `HeadlessChromium/...` user-agent suffix.
Running headed inside `Xvfb :99` reports as a normal Chromium, and the
stealth init script handles the JS-visible giveaways.
## Why direct chromium (CDP) instead of `playwright launch-server`?
`playwright launch-server` creates ephemeral browser contexts per
`connect()` call — cookies and localStorage never persist to the PVC.
The `/profile` mount only ever held npm cache + fontconfig cache
despite the original docs claiming it held "cookies, localStorage,
IndexedDB". Switched 2026-06-04 to direct chromium launch with
`--user-data-dir=/profile/chromium-data --remote-debugging-port=9222`
so the persistent profile actually persists, and callers migrate
`chromium.connect(ws_url)` → `chromium.connect_over_cdp(cdp_url)`.