infra/docs/architecture/chrome-service.md
Viktor Barzin 3f3e5fc954 chrome-service: open NP for Traefik → noVNC sidecar (port 6080)
Existing NetworkPolicy only admitted port 3000 (Playwright WS) from
labelled client namespaces, blocking Traefik's traffic to the noVNC
sidecar on port 6080. The chrome.viktorbarzin.me ingress would hang
forever — page never loads, eventually times out.

Adds a second ingress rule allowing TCP/6080 from the traefik
namespace only. Authentik forward-auth still gates external access
at the Traefik layer.

Also reconciles the noVNC image to the new Forgejo registry path
(:v4 unchanged) — already declared in TF, just live-state drift from
the Phase 3 registry consolidation.

Updates the architecture doc; the previous text still described the
old nginx static health stub that noVNC replaced.
2026-05-07 23:29:34 +00:00

136 lines
6.5 KiB
Markdown

# chrome-service — In-cluster headed Chromium pool
## Overview
`chrome-service` is a single-replica, persistent-profile, bearer-token-gated
Playwright **launch-server** that exposes a headed Chromium browser over a
WebSocket. Sibling services connect to it instead of running their own
in-process Chromium when the upstream's anti-bot tooling
(`disable-devtool.js` redirect-to-google trap, console-clear timing tricks,
`navigator.webdriver` checks) defeats a headless browser.
Initial caller: `f1-stream`'s `playback_verifier`. Future callers attach
via the WS+token contract documented in `stacks/chrome-service/README.md`.
## Why a separate stack
In-process Chromium inside `f1-stream`:
- Runs **headless** by default (no `Xvfb`/`DISPLAY`).
- Has the `HeadlessChromium/...` UA suffix and `navigator.webdriver === true`.
- Trips `disable-devtool.js`'s **Performance** detector — Playwright's CDP
adds latency to `console.log(largeArray)` vs `console.table(largeArray)`,
which the lib reads as "DevTools is open" and redirects to
`https://www.google.com/`.
`chrome-service` solves this by:
1. Running **headed** under `Xvfb :99` (via `playwright launch-server` with
a JSON config that pins `headless: false`).
2. Living in a long-lived pod so JIT browser launch latency disappears.
3. Allowing a per-context init script
(`stacks/chrome-service/files/stealth.js` ~ 40 lines, vendored from
`puppeteer-extra-plugin-stealth`) to spoof `webdriver`, `chrome.runtime`,
`plugins`, `languages`, `Permissions.query`, WebGL renderer strings, and
to hide the `disable-devtool-auto` script-tag attribute so the lib's
IIFE exits early.
## Wire protocol
```text
ws://chrome-service.chrome-service.svc.cluster.local:3000/<TOKEN>
┌───────────────────────────────┼───────────────────────────────┐
│ caller pod │ chrome-service pod
│ (e.g. f1-stream) │ (single replica)
│ │
│ CHROME_WS_URL ──────────────┘
│ CHROME_WS_TOKEN ─── from `secret/chrome-service.api_bearer_token` (ESO)
│ await chromium.connect(f"{ws}/{token}")
│ await ctx.add_init_script(STEALTH_JS)
│ page.goto("https://upstream.com/embed/...")
└─── ←── pages render under Xvfb, headed Chromium ──── ─────────┘
```
## Image pin
Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in
`stacks/chrome-service/main.tf`) and the Python client
(`playwright==1.48.0` in callers' `requirements.txt`) **must match
minor-versions**. Bump in lockstep — Playwright protocol changes between
minors and the client cannot connect to a mismatched server.
The Microsoft image ships only the browser binaries, not the `playwright`
npm SDK; the start command runs `npx -y playwright@1.48.0 launch-server`
which downloads the SDK on first start (cached under `$HOME/.npm` via the
PVC) and reuses it on subsequent restarts.
## Storage
- **`chrome-service-profile-encrypted`** (PVC, 2Gi → 10Gi autoresize,
`proxmox-lvm-encrypted`) — Chromium user-data dir + npm cache.
Encrypted because cookies/localStorage may include third-party auth tokens
for sites callers drive. `HOME=/profile` so npx caches there.
- **`chrome-service-backup-host`** (NFS, RWX) — destination for a 6-hourly
CronJob that `tar -czf /backup/<YYYY_MM_DD_HH>.tar.gz -C /profile .`,
retention 30 days.
## Auth + secrets
- Vault KV `secret/chrome-service.api_bearer_token` — 32-byte URL-safe
random, rotated by hand:
`vault kv put secret/chrome-service api_bearer_token=$(python3 -c 'import secrets; print(secrets.token_urlsafe(32))')`.
- ESO syncs into namespace-local Secret `chrome-service-secrets`
(server pod) and `chrome-service-client-secrets` (each caller pod).
- Reloader (`reloader.stakater.com/auto = "true"`) cascades token rotation
to both server and any annotated caller — no manual rollout.
## Network controls
- **`kubernetes_network_policy_v1.ws_ingress`** — two separate ingress
rules on the same policy:
- **TCP/3000** (Playwright WS): only namespaces labelled
`chrome-service.viktorbarzin.me/client = "true"` (plus an explicit
fallback for `f1-stream` by `kubernetes.io/metadata.name`).
- **TCP/6080** (noVNC HTTP+WS): only the `traefik` namespace, since
the public-facing path is `chrome.viktorbarzin.me` ingress →
Traefik → sidecar. Authentik forward-auth still gates external
access at the Traefik layer.
- **WS port 3000** is internal-only (no ingress, no Cloudflare DNS).
- **noVNC sidecar** (`forgejo.viktorbarzin.me/viktor/chrome-service-novnc`)
exposes a live HTML5 view of the headed Chromium session via
`x11vnc` (connected to Xvfb on `localhost:6099`) bridged to
`websockify` on port 6080. Service `chrome` maps :80 → :6080 and is
exposed via `ingress_factory` at `chrome.viktorbarzin.me`,
Authentik-gated. Both static page and WebSocket upgrade share the
same path — Cloudflare proxy, Cloudflared tunnel, Traefik, and
Authentik forward-auth all preserve `Upgrade: websocket`.
## Adding a new caller
See `stacks/chrome-service/README.md` for the four-step recipe:
1. Label the caller's namespace.
2. Add an `ExternalSecret` pulling `secret/chrome-service`.
3. Inject `CHROME_WS_URL` + `CHROME_WS_TOKEN` env vars.
4. Vendor `stealth.js` and apply via `await context.add_init_script(...)`
after every `new_context()`.
## Limits + risks
- **Anti-bot vs stealth arms race** — when an upstream beats us (DRM
license check, device-fingerprint mismatch, hotlink protection that
whitelists specific parent domains), the verifier returns
`is_playable=False` and the extractor moves on. No user-visible
breakage, just empty stream lists for that source.
- **JWPlayer DRM error 102630** — observed with several hmembeds embeds
even from the headed chrome-service. The license check bails because
the request origin isn't on the embed's allowlist; this is upstream
policy, not an infra defect.
- **Single replica + RWO PVC** — the deployment uses `Recreate` strategy.
Brief outage on rollout, ~30s for browser warmup.
- **No `/metrics` endpoint** — the cluster's generic
`KubePodCrashLooping` rule covers basic alerting. A Prometheus scrape
exporter is day-2 work.