The chrome-service stack ran `playwright launch-server`, which creates
ephemeral browser contexts per `connect()`. Despite the encrypted PVC
mounted at /profile, no chromium user-data ever persisted — only npm
cache + fontconfig. Logging in via noVNC was effectively a no-op.
Refactor:
- Replace launch-server with direct chromium (TCP CDP on :9223 internal),
fronted by a Python HTTP+WS bridge on :9222 that rewrites the Host
header to bypass Chrome's hardcoded DNS-rebinding protection (no
`--remote-allow-hosts` flag exists in stock Chrome 130; verified by
binary string grep). Bridge also forces Connection: close on HTTP
responses so Node ws opens a fresh TCP for the WS upgrade rather than
trying to reuse the dead keep-alive socket.
- Add `--user-data-dir=/profile/chromium-data` so cookies/localStorage
actually persist on the encrypted PVC.
- New snapshot-server sidecar (stdlib python HTTP) serves
GET /api/snapshot at chrome.viktorbarzin.me/api/snapshot,
bearer-token-gated by the existing api_bearer_token.
- New chrome-service-snapshot-harvester CronJob (hourly) connects via
CDP, dumps storage_state() (cookies + localStorage), writes atomically
to /profile/snapshots/storage-state.json.
- NetworkPolicy: TCP/9222 (was :3000), TCP/8088 added for traefik.
Caller migration:
- f1-stream: `chromium.connect(ws_url)` → `chromium.connect_over_cdp(cdp_url)`,
env var CHROME_WS_URL → CHROME_CDP_URL. CHROME_WS_TOKEN dropped (no
longer used by code; ExternalSecret kept for symmetry with the snapshot
endpoint).
Dev-box side (out of scope for this commit — see ~/.config/systemd/user/):
- playwright-mcp.service flips to `--isolated --storage-state=...`
so per-Claude-Code-session ephemeral contexts seed from the snapshot.
- playwright-snapshot-refresh.{service,timer} (hourly) pulls the
snapshot via the bearer-gated HTTPS endpoint.
Docs updated:
- docs/architecture/chrome-service.md — new architecture diagram + wire protocol.
- docs/runbooks/chrome-service-snapshot.md — day-2 ops (refresh, rotation,
failure modes, restore).
- stacks/chrome-service/README.md — connect_over_cdp recipe.
Design spec at docs/superpowers/specs/2026-06-04-playwright-per-session-browser-design.md.
7.5 KiB
Runbook — chrome-service snapshot pipeline
Operational playbook for the hourly cookie-snapshot pipeline that warms
external Claude Code sessions on the dev box. Architecture in
architecture/chrome-service.md.
At a glance
| Component | Where | When | What |
|---|---|---|---|
| chrome-service Deployment | chrome-service ns |
always-on | headed chromium, CDP :9222, persistent /profile/chromium-data |
| snapshot-server sidecar | same pod | always-on | serves /api/snapshot, bearer-gated, port 8088 |
| snapshot-harvester CronJob | chrome-service ns |
23 * * * * |
dumps storage_state() via CDP → /profile/snapshots/storage-state.json |
| dev-box refresh timer | each dev box | hourly | curls chrome.viktorbarzin.me/api/snapshot → ~/.cache/playwright-shared-storage-state.json |
dev-box playwright-mcp.service |
each dev box | always-on | @playwright/mcp --isolated --storage-state=… per-MCP-connection contexts |
Day-to-day
Log into a new site (warm the profile)
- Open
https://chrome.viktorbarzin.me/(Authentik will gate). - The noVNC view of the in-cluster headed chromium loads. Click on the browser window, navigate, log in.
- Cookies land in
/profile/chromium-data/Default/Cookieson the PVC. - Within ≤60 min, the snapshot-harvester CronJob picks them up and writes the snapshot. Within ≤60 min after that, dev boxes pull the new file. New Claude Code sessions see the new cookies.
- To skip the wait: trigger the harvester now (next section).
Trigger snapshot harvester manually
kubectl -n chrome-service create job \
--from=cronjob/chrome-service-snapshot-harvester \
snapshot-harvest-$(date +%s)
# Watch logs
kubectl -n chrome-service logs -f -l job-name=$(kubectl -n chrome-service get jobs -o name | tail -1 | cut -d/ -f2)
Expected: wrote snapshot (… bytes) to /profile/snapshots/storage-state.json.
Trigger dev-box refresh manually
# On the dev box, as the user whose Claude Code sessions need the new state:
systemctl --user start playwright-snapshot-refresh.service
# Or directly:
/usr/local/bin/playwright-snapshot-refresh
# Verify
ls -la ~/.cache/playwright-shared-storage-state.json
Inspect the current snapshot
# In-cluster (from any pod with kubectl exec into the chrome-service pod):
kubectl -n chrome-service exec deploy/chrome-service -c snapshot-server -- \
cat /profile/snapshots/storage-state.json | jq '.cookies | length'
# Externally (via the bearer-gated endpoint):
TOKEN=$(vault kv get -field=api_bearer_token secret/chrome-service)
curl -fsSL -H "Authorization: Bearer $TOKEN" \
https://chrome.viktorbarzin.me/api/snapshot | jq '.cookies | length'
Failure modes
"no browser contexts found"
The harvester reports no browser contexts found — chrome-service may not have launched a persistent context yet and exits non-zero.
Cause: chromium just started and hasn't created its default context yet, or it crashed.
Fix: check chrome-service pod logs (kubectl -n chrome-service logs deploy/chrome-service -c chrome-service). The next hourly run will
retry. If chromium is wedged: kubectl -n chrome-service rollout restart deploy/chrome-service (strategy = Recreate, brief downtime).
"connect_over_cdp failed"
Harvester or any in-cluster caller can't reach the CDP endpoint.
Cause: chrome-service pod not Ready, NetworkPolicy doesn't admit the caller's namespace, or chromium isn't listening on :9222.
Diagnose:
kubectl -n chrome-service get pods
kubectl -n chrome-service describe networkpolicy chrome-service-ws-ingress
# From inside the cluster (e.g. a debug pod in chrome-service ns):
nc -zv chrome-service.chrome-service.svc.cluster.local 9222
curl -fsSL http://chrome-service.chrome-service.svc.cluster.local:9222/json/version
Fix: depends on the diagnosis. NetworkPolicy needs the caller's namespace label or an explicit name-fallback. If chromium isn't binding, check the container logs.
Dev-box playwright-snapshot-refresh returns 401
The bearer token in ~/.config/playwright/token doesn't match the
server's. Almost always means the Vault secret was rotated and the
local cache is stale.
Fix:
vault login -method=oidc # if needed
vault kv get -field=api_bearer_token secret/chrome-service > ~/.config/playwright/token
chmod 600 ~/.config/playwright/token
systemctl --user start playwright-snapshot-refresh.service
Dev-box playwright-snapshot-refresh returns 404 with "snapshot not yet available"
The harvester hasn't run successfully yet (fresh cluster, or all recent runs failed). Trigger it manually (see "Trigger snapshot harvester manually").
Claude Code sessions still see old cookies
The MCP server reads the snapshot file at process start and seeds each new context with it. Existing MCP sessions don't hot-reload — they keep the cookies they were seeded with at session start. New sessions get the fresh snapshot.
Fix: restart the MCP server on the dev box to pick up the new file:
systemctl --user restart playwright-mcp.service
Snapshot file is suspiciously small or empty cookies array
The persistent chromium context isn't holding any cookies. Probably
means the user hasn't logged into anything via noVNC, or chromium was
relaunched without preserving /profile/chromium-data.
Diagnose:
kubectl -n chrome-service exec deploy/chrome-service -c chrome-service -- \
ls -la /profile/chromium-data/Default/Cookies
A populated Cookies SQLite file should be several hundred KB once
real logins exist. If it's missing or empty, log in via noVNC.
Token rotation
# Rotate Vault secret (32-byte URL-safe random).
vault kv put secret/chrome-service \
api_bearer_token=$(python3 -c 'import secrets; print(secrets.token_urlsafe(32))')
# Reloader auto-restarts chrome-service pod (snapshot-server picks up new token).
# On EVERY dev box that pulls the snapshot:
vault kv get -field=api_bearer_token secret/chrome-service > ~/.config/playwright/token
chmod 600 ~/.config/playwright/token
# Verify the next refresh succeeds:
systemctl --user start playwright-snapshot-refresh.service
journalctl --user -u playwright-snapshot-refresh.service -n 20
Restore from a backup tarball
The 6-hourly backup CronJob writes tar -czf /backup/YYYY_MM_DD_HH.tar.gz -C /profile . to NFS at /srv/nfs/chrome-service-backup/. To restore
the entire profile:
# 1. Scale chrome-service down so its lock is released.
kubectl -n chrome-service scale deploy/chrome-service --replicas=0
# 2. Mount the PVC in a helper pod and restore.
kubectl -n chrome-service apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata: {name: restore-helper, namespace: chrome-service}
spec:
containers:
- name: helper
image: alpine:3.20
command: [sleep, infinity]
volumeMounts:
- {name: profile, mountPath: /profile}
- {name: backup, mountPath: /backup, readOnly: true}
volumes:
- name: profile
persistentVolumeClaim: {claimName: chrome-service-profile-encrypted}
- name: backup
persistentVolumeClaim: {claimName: chrome-service-backup-host}
restartPolicy: Never
EOF
kubectl -n chrome-service wait --for=condition=ready pod/restore-helper
kubectl -n chrome-service exec restore-helper -- sh -c '
rm -rf /profile/chromium-data /profile/snapshots &&
tar -xzf /backup/2026_06_04_18.tar.gz -C /profile
'
# 3. Cleanup helper, scale chrome-service back up.
kubectl -n chrome-service delete pod restore-helper
kubectl -n chrome-service scale deploy/chrome-service --replicas=1