chrome-service: switch to CDP + persistent profile + hourly snapshot pipeline

The chrome-service stack ran `playwright launch-server`, which creates
ephemeral browser contexts per `connect()`. Despite the encrypted PVC
mounted at /profile, no chromium user-data ever persisted — only npm
cache + fontconfig. Logging in via noVNC was effectively a no-op.

Refactor:
- Replace launch-server with direct chromium (TCP CDP on :9223 internal),
  fronted by a Python HTTP+WS bridge on :9222 that rewrites the Host
  header to bypass Chrome's hardcoded DNS-rebinding protection (no
  `--remote-allow-hosts` flag exists in stock Chrome 130; verified by
  binary string grep). Bridge also forces Connection: close on HTTP
  responses so Node ws opens a fresh TCP for the WS upgrade rather than
  trying to reuse the dead keep-alive socket.
- Add `--user-data-dir=/profile/chromium-data` so cookies/localStorage
  actually persist on the encrypted PVC.
- New snapshot-server sidecar (stdlib python HTTP) serves
  GET /api/snapshot at chrome.viktorbarzin.me/api/snapshot,
  bearer-token-gated by the existing api_bearer_token.
- New chrome-service-snapshot-harvester CronJob (hourly) connects via
  CDP, dumps storage_state() (cookies + localStorage), writes atomically
  to /profile/snapshots/storage-state.json.
- NetworkPolicy: TCP/9222 (was :3000), TCP/8088 added for traefik.

Caller migration:
- f1-stream: `chromium.connect(ws_url)` → `chromium.connect_over_cdp(cdp_url)`,
  env var CHROME_WS_URL → CHROME_CDP_URL. CHROME_WS_TOKEN dropped (no
  longer used by code; ExternalSecret kept for symmetry with the snapshot
  endpoint).

Dev-box side (out of scope for this commit — see ~/.config/systemd/user/):
- playwright-mcp.service flips to `--isolated --storage-state=...`
  so per-Claude-Code-session ephemeral contexts seed from the snapshot.
- playwright-snapshot-refresh.{service,timer} (hourly) pulls the
  snapshot via the bearer-gated HTTPS endpoint.

Docs updated:
- docs/architecture/chrome-service.md — new architecture diagram + wire protocol.
- docs/runbooks/chrome-service-snapshot.md — day-2 ops (refresh, rotation,
  failure modes, restore).
- stacks/chrome-service/README.md — connect_over_cdp recipe.

Design spec at docs/superpowers/specs/2026-06-04-playwright-per-session-browser-design.md.
This commit is contained in:
Viktor Barzin 2026-06-04 05:15:49 +00:00
parent b64d8d6168
commit deede6dd11
10 changed files with 1152 additions and 177 deletions

View file

@ -1,16 +1,23 @@
# chrome-service — In-cluster headed Chromium pool
# chrome-service — In-cluster headed Chromium with persistent profile
## Overview
`chrome-service` is a single-replica, persistent-profile, bearer-token-gated
Playwright **launch-server** that exposes a headed Chromium browser over a
WebSocket. Sibling services connect to it instead of running their own
in-process Chromium when the upstream's anti-bot tooling
(`disable-devtool.js` redirect-to-google trap, console-clear timing tricks,
`navigator.webdriver` checks) defeats a headless browser.
`chrome-service` is a single-replica, persistent-profile, headed
Chromium browser exposed over the Chrome DevTools Protocol (CDP). It
serves two distinct populations:
Initial caller: `f1-stream`'s `playback_verifier`. Future callers attach
via the WS+token contract documented in `stacks/chrome-service/README.md`.
1. **In-cluster automation callers** (e.g. `f1-stream`'s
`playback_verifier`, `chrome_browser` extractor) — connect via
`chromium.connect_over_cdp("http://chrome-service.chrome-service.svc:9222")`
to drive a real browser when upstream anti-bot trips a headless one
(`disable-devtool.js` redirect-to-google trap, `navigator.webdriver`
checks, console-clear timing tricks).
2. **External dev-box Claude Code sessions** — pull an hourly snapshot
of cookies + localStorage from `chrome.viktorbarzin.me/api/snapshot`
(bearer-gated) and seed local `@playwright/mcp` instances in
`--isolated --storage-state=…` mode. This is how concurrent Claude
Code sessions get their own isolated browser contexts without losing
shared cookies for logged-in sites.
## Why a separate stack
@ -25,8 +32,8 @@ In-process Chromium inside `f1-stream`:
`chrome-service` solves this by:
1. Running **headed** under `Xvfb :99` (via `playwright launch-server` with
a JSON config that pins `headless: false`).
1. Running **headed** under `Xvfb :99` (chromium with `DISPLAY=:99`,
not `--headless`).
2. Living in a long-lived pod so JIT browser launch latency disappears.
3. Allowing a per-context init script
(`stacks/chrome-service/files/stealth.js` ~ 40 lines, vendored from
@ -35,25 +42,67 @@ In-process Chromium inside `f1-stream`:
to hide the `disable-devtool-auto` script-tag attribute so the lib's
IIFE exits early.
## Wire protocol
## Wire protocol — CDP (current, since 2026-06-04)
```text
ws://chrome-service.chrome-service.svc.cluster.local:3000/<TOKEN>
http://chrome-service.chrome-service.svc.cluster.local:9222
┌───────────────────────────────┼───────────────────────────────┐
│ caller pod │ chrome-service pod
│ (e.g. f1-stream) │ (single replica)
│ │
│ CHROME_WS_URL ──────────────┘
│ CHROME_WS_TOKEN ─── from `secret/chrome-service.api_bearer_token` (ESO)
│ CHROME_CDP_URL ──────────────┘
│ await chromium.connect(f"{ws}/{token}")
│ await ctx.add_init_script(STEALTH_JS)
│ await chromium.connect_over_cdp(cdp_url)
│ context = await browser.new_context() ← incognito (no cookies)
│ OR: context = browser.contexts[0] ← persistent (shared cookies)
│ await context.add_init_script(STEALTH_JS)
│ page.goto("https://upstream.com/embed/...")
└─── ←── pages render under Xvfb, headed Chromium ──── ─────────┘
```
### Wire protocol — WS (legacy, removed 2026-06-04)
The previous design used `playwright launch-server --browser chromium`
with a path-token (`ws://...:3000/<TOKEN>`). Callers used
`chromium.connect(ws_url)`. **Problem**: `launch-server` creates
ephemeral browser contexts per `connect()` call, so cookies never
persisted to the PVC despite the `/profile` mount. We migrated to
direct chromium launch with `--user-data-dir` + CDP exposed on :9222
so cookies actually live across pod restarts.
## Cookie warming + snapshot pipeline
```text
┌─────────── chrome-service pod ──────────────────────────────────────────┐
│ │
│ chrome-service container (chromium --user-data-dir=/profile/chromium-data
│ --remote-debugging-port=9222) │
│ ▲ │
│ │ user logs in via noVNC ← chrome.viktorbarzin.me (Authentik) │
│ │ │
│ Cookies + localStorage land in /profile/chromium-data/Default/ │
│ │
│ snapshot-server sidecar (python stdlib HTTP server, :8088) │
│ ↑ serves /profile/snapshots/storage-state.json (bearer-gated) │
└──────────────────────────────────────────────────────────────────────────┘
│ hourly (cron 23 * * * *)
┌──────┴── chrome-service-snapshot-harvester CronJob ─────────────────────┐
│ podAffinity → same node as chrome-service (RWO PVC) │
│ python: connect_over_cdp + ctx.storage_state(path=...) │
│ writes /profile/snapshots/storage-state.json (atomic rename) │
└──────────────────────────────────────────────────────────────────────────┘
External caller (dev box):
systemd timer (hourly) → curl -H "Authorization: Bearer $TOKEN"
https://chrome.viktorbarzin.me/api/snapshot
-o ~/.cache/playwright-shared-storage-state.json
@playwright/mcp --isolated --storage-state ~/.cache/...storage-state.json
```
## Image pin
Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in
@ -62,17 +111,17 @@ Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in
minor-versions**. Bump in lockstep — Playwright protocol changes between
minors and the client cannot connect to a mismatched server.
The Microsoft image ships only the browser binaries, not the `playwright`
npm SDK; the start command runs `npx -y playwright@1.48.0 launch-server`
which downloads the SDK on first start (cached under `$HOME/.npm` via the
PVC) and reuses it on subsequent restarts.
The harvester + snapshot-server sidecar use
`mcr.microsoft.com/playwright/python:v1.48.0-noble` — same playwright
minor, with Python-side bindings pre-installed.
## Storage
- **`chrome-service-profile-encrypted`** (PVC, 2Gi → 10Gi autoresize,
`proxmox-lvm-encrypted`) — Chromium user-data dir + npm cache.
`proxmox-lvm-encrypted`) — Chromium user-data dir at
`/profile/chromium-data` + snapshot at `/profile/snapshots/storage-state.json`.
Encrypted because cookies/localStorage may include third-party auth tokens
for sites callers drive. `HOME=/profile` so npx caches there.
for sites callers drive.
- **`chrome-service-backup-host`** (NFS, RWX) — destination for a 6-hourly
CronJob that `tar -czf /backup/<YYYY_MM_DD_HH>.tar.gz -C /profile .`,
retention 30 days.
@ -82,41 +131,45 @@ PVC) and reuses it on subsequent restarts.
- Vault KV `secret/chrome-service.api_bearer_token` — 32-byte URL-safe
random, rotated by hand:
`vault kv put secret/chrome-service api_bearer_token=$(python3 -c 'import secrets; print(secrets.token_urlsafe(32))')`.
- ESO syncs into namespace-local Secret `chrome-service-secrets`
(server pod) and `chrome-service-client-secrets` (each caller pod).
- ESO syncs into namespace-local Secret `chrome-service-secrets`. The
`snapshot-server` sidecar reads it via `secret_key_ref`.
- f1-stream still imports the secret (via `chrome-service-client-secrets`)
for parity, but the CDP endpoint no longer requires it for connection —
NetworkPolicy is the gate.
- Reloader (`reloader.stakater.com/auto = "true"`) cascades token rotation
to both server and any annotated caller — no manual rollout.
to the snapshot-server sidecar.
- **Dev-box cache**: each dev box keeps a local copy at
`~/.config/playwright/token` (chmod 600). Re-fetch from Vault after
rotation: `vault kv get -field=api_bearer_token secret/chrome-service > ~/.config/playwright/token`.
## Network controls
- **`kubernetes_network_policy_v1.ws_ingress`** — two separate ingress
rules on the same policy:
- **TCP/3000** (Playwright WS): only namespaces labelled
- **`kubernetes_network_policy_v1.ws_ingress`** — three ingress rules:
- **TCP/9222** (Chromium CDP): only namespaces labelled
`chrome-service.viktorbarzin.me/client = "true"` (plus an explicit
fallback for `f1-stream` by `kubernetes.io/metadata.name`).
- **TCP/6080** (noVNC HTTP+WS): only the `traefik` namespace, since
the public-facing path is `chrome.viktorbarzin.me` ingress →
Traefik → sidecar. Authentik forward-auth still gates external
access at the Traefik layer.
- **WS port 3000** is internal-only (no ingress, no Cloudflare DNS).
fallback for `f1-stream` by `kubernetes.io/metadata.name`, plus
`chrome-service`'s own namespace for the harvester CronJob).
- **TCP/6080** (noVNC HTTP+WS): only the `traefik` namespace.
- **TCP/8088** (snapshot-server): only the `traefik` namespace
(bearer-token check happens in `snapshot_server.py`).
- **CDP port 9222** is internal-only (no ingress, no Cloudflare DNS).
- **noVNC sidecar** (`forgejo.viktorbarzin.me/viktor/chrome-service-novnc`)
exposes a live HTML5 view of the headed Chromium session via
`x11vnc` (connected to Xvfb on `localhost:6099`) bridged to
`websockify` on port 6080. Service `chrome` maps :80 → :6080 and is
exposed via `ingress_factory` at `chrome.viktorbarzin.me`,
Authentik-gated. Both static page and WebSocket upgrade share the
same path — Cloudflare proxy, Cloudflared tunnel, Traefik, and
Authentik forward-auth all preserve `Upgrade: websocket`.
Authentik-gated.
- **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`)
serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`,
bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088
and is exposed at `chrome.viktorbarzin.me/api/snapshot` via a second
`ingress_factory` call with `auth = "none"` (the bearer check is in
the sidecar, not at the ingress layer).
## Adding a new caller
## Adding a new in-cluster caller
See `stacks/chrome-service/README.md` for the four-step recipe:
1. Label the caller's namespace.
2. Add an `ExternalSecret` pulling `secret/chrome-service`.
3. Inject `CHROME_WS_URL` + `CHROME_WS_TOKEN` env vars.
4. Vendor `stealth.js` and apply via `await context.add_init_script(...)`
after every `new_context()`.
See `stacks/chrome-service/README.md` for the recipe (label namespace,
inject `CHROME_CDP_URL`, vendor `stealth.js`).
## Limits + risks
@ -134,3 +187,9 @@ See `stacks/chrome-service/README.md` for the four-step recipe:
- **No `/metrics` endpoint** — the cluster's generic
`KubePodCrashLooping` rule covers basic alerting. A Prometheus scrape
exporter is day-2 work.
- **Snapshot covers cookies + localStorage only** — Playwright's
`storage_state()` API doesn't capture IndexedDB or sessionStorage.
Sites that rely on those for auth won't warm via the snapshot.
- **Snapshot freshness up to 1h stale** — if a site rotates session
cookies more often than that, an on-demand refresh CLI is needed
(deferred to follow-on).

View file

@ -0,0 +1,211 @@
# Runbook — chrome-service snapshot pipeline
Operational playbook for the hourly cookie-snapshot pipeline that warms
external Claude Code sessions on the dev box. Architecture in
`architecture/chrome-service.md`.
## At a glance
| Component | Where | When | What |
|---|---|---|---|
| chrome-service Deployment | `chrome-service` ns | always-on | headed chromium, CDP :9222, persistent /profile/chromium-data |
| snapshot-server sidecar | same pod | always-on | serves `/api/snapshot`, bearer-gated, port 8088 |
| snapshot-harvester CronJob | `chrome-service` ns | `23 * * * *` | dumps `storage_state()` via CDP → `/profile/snapshots/storage-state.json` |
| dev-box refresh timer | each dev box | hourly | curls `chrome.viktorbarzin.me/api/snapshot``~/.cache/playwright-shared-storage-state.json` |
| dev-box `playwright-mcp.service` | each dev box | always-on | `@playwright/mcp --isolated --storage-state=…` per-MCP-connection contexts |
## Day-to-day
### Log into a new site (warm the profile)
1. Open `https://chrome.viktorbarzin.me/` (Authentik will gate).
2. The noVNC view of the in-cluster headed chromium loads. Click on the
browser window, navigate, log in.
3. Cookies land in `/profile/chromium-data/Default/Cookies` on the PVC.
4. Within ≤60 min, the snapshot-harvester CronJob picks them up and
writes the snapshot. Within ≤60 min after that, dev boxes pull the
new file. New Claude Code sessions see the new cookies.
5. To skip the wait: trigger the harvester now (next section).
### Trigger snapshot harvester manually
```bash
kubectl -n chrome-service create job \
--from=cronjob/chrome-service-snapshot-harvester \
snapshot-harvest-$(date +%s)
# Watch logs
kubectl -n chrome-service logs -f -l job-name=$(kubectl -n chrome-service get jobs -o name | tail -1 | cut -d/ -f2)
```
Expected: `wrote snapshot (… bytes) to /profile/snapshots/storage-state.json`.
### Trigger dev-box refresh manually
```bash
# On the dev box, as the user whose Claude Code sessions need the new state:
systemctl --user start playwright-snapshot-refresh.service
# Or directly:
/usr/local/bin/playwright-snapshot-refresh
# Verify
ls -la ~/.cache/playwright-shared-storage-state.json
```
### Inspect the current snapshot
```bash
# In-cluster (from any pod with kubectl exec into the chrome-service pod):
kubectl -n chrome-service exec deploy/chrome-service -c snapshot-server -- \
cat /profile/snapshots/storage-state.json | jq '.cookies | length'
# Externally (via the bearer-gated endpoint):
TOKEN=$(vault kv get -field=api_bearer_token secret/chrome-service)
curl -fsSL -H "Authorization: Bearer $TOKEN" \
https://chrome.viktorbarzin.me/api/snapshot | jq '.cookies | length'
```
## Failure modes
### "no browser contexts found"
The harvester reports `no browser contexts found — chrome-service may
not have launched a persistent context yet` and exits non-zero.
**Cause**: chromium just started and hasn't created its default context
yet, or it crashed.
**Fix**: check chrome-service pod logs (`kubectl -n chrome-service logs
deploy/chrome-service -c chrome-service`). The next hourly run will
retry. If chromium is wedged: `kubectl -n chrome-service rollout restart
deploy/chrome-service` (strategy = Recreate, brief downtime).
### "connect_over_cdp failed"
Harvester or any in-cluster caller can't reach the CDP endpoint.
**Cause**: chrome-service pod not Ready, NetworkPolicy doesn't admit
the caller's namespace, or chromium isn't listening on :9222.
**Diagnose**:
```bash
kubectl -n chrome-service get pods
kubectl -n chrome-service describe networkpolicy chrome-service-ws-ingress
# From inside the cluster (e.g. a debug pod in chrome-service ns):
nc -zv chrome-service.chrome-service.svc.cluster.local 9222
curl -fsSL http://chrome-service.chrome-service.svc.cluster.local:9222/json/version
```
**Fix**: depends on the diagnosis. NetworkPolicy needs the caller's
namespace label or an explicit name-fallback. If chromium isn't
binding, check the container logs.
### Dev-box `playwright-snapshot-refresh` returns 401
The bearer token in `~/.config/playwright/token` doesn't match the
server's. Almost always means the Vault secret was rotated and the
local cache is stale.
**Fix**:
```bash
vault login -method=oidc # if needed
vault kv get -field=api_bearer_token secret/chrome-service > ~/.config/playwright/token
chmod 600 ~/.config/playwright/token
systemctl --user start playwright-snapshot-refresh.service
```
### Dev-box `playwright-snapshot-refresh` returns 404 with "snapshot not yet available"
The harvester hasn't run successfully yet (fresh cluster, or all
recent runs failed). Trigger it manually (see "Trigger snapshot
harvester manually").
### Claude Code sessions still see old cookies
The MCP server reads the snapshot file at process start and seeds each
new context with it. **Existing MCP sessions don't hot-reload** — they
keep the cookies they were seeded with at session start. New sessions
get the fresh snapshot.
**Fix**: restart the MCP server on the dev box to pick up the new file:
```bash
systemctl --user restart playwright-mcp.service
```
### Snapshot file is suspiciously small or empty cookies array
The persistent chromium context isn't holding any cookies. Probably
means the user hasn't logged into anything via noVNC, or chromium was
relaunched without preserving `/profile/chromium-data`.
**Diagnose**:
```bash
kubectl -n chrome-service exec deploy/chrome-service -c chrome-service -- \
ls -la /profile/chromium-data/Default/Cookies
```
A populated `Cookies` SQLite file should be several hundred KB once
real logins exist. If it's missing or empty, log in via noVNC.
## Token rotation
```bash
# Rotate Vault secret (32-byte URL-safe random).
vault kv put secret/chrome-service \
api_bearer_token=$(python3 -c 'import secrets; print(secrets.token_urlsafe(32))')
# Reloader auto-restarts chrome-service pod (snapshot-server picks up new token).
# On EVERY dev box that pulls the snapshot:
vault kv get -field=api_bearer_token secret/chrome-service > ~/.config/playwright/token
chmod 600 ~/.config/playwright/token
# Verify the next refresh succeeds:
systemctl --user start playwright-snapshot-refresh.service
journalctl --user -u playwright-snapshot-refresh.service -n 20
```
## Restore from a backup tarball
The 6-hourly backup CronJob writes `tar -czf /backup/YYYY_MM_DD_HH.tar.gz
-C /profile .` to NFS at `/srv/nfs/chrome-service-backup/`. To restore
the entire profile:
```bash
# 1. Scale chrome-service down so its lock is released.
kubectl -n chrome-service scale deploy/chrome-service --replicas=0
# 2. Mount the PVC in a helper pod and restore.
kubectl -n chrome-service apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata: {name: restore-helper, namespace: chrome-service}
spec:
containers:
- name: helper
image: alpine:3.20
command: [sleep, infinity]
volumeMounts:
- {name: profile, mountPath: /profile}
- {name: backup, mountPath: /backup, readOnly: true}
volumes:
- name: profile
persistentVolumeClaim: {claimName: chrome-service-profile-encrypted}
- name: backup
persistentVolumeClaim: {claimName: chrome-service-backup-host}
restartPolicy: Never
EOF
kubectl -n chrome-service wait --for=condition=ready pod/restore-helper
kubectl -n chrome-service exec restore-helper -- sh -c '
rm -rf /profile/chromium-data /profile/snapshots &&
tar -xzf /backup/2026_06_04_18.tar.gz -C /profile
'
# 3. Cleanup helper, scale chrome-service back up.
kubectl -n chrome-service delete pod restore-helper
kubectl -n chrome-service scale deploy/chrome-service --replicas=1
```