diff --git a/docs/architecture/chrome-service.md b/docs/architecture/chrome-service.md index 6f9c1ee4..37cb6edc 100644 --- a/docs/architecture/chrome-service.md +++ b/docs/architecture/chrome-service.md @@ -205,6 +205,43 @@ healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts* wrapper in `main.tf` (so it applies deterministically even though the image is `:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix as the android-emulator stack. + +### noVNC black after a browser-container restart (x11vnc supervision) + +A **distinct** failure from the fd-sweep gotcha above: the noVNC client *connects* +but the view is **black**, and the novnc container logs spew +`connecting to: localhost:5900` → `Failed to connect ... [Errno 111] Connection +refused` (x11vnc is **down**, not slow). Cause: `x11vnc` and `websockify` both run +in the **novnc** container, but x11vnc attaches to the **chrome-service** (browser) +container's Xvfb over `localhost:6099` (shared pod network). When the browser +container restarts — Chrome exits cleanly (exit 0, "Completed") or crashes — its +Xvfb vanishes and x11vnc loses its X connection and exits. + +`entrypoint.sh` **supervises** x11vnc: it launches x11vnc and websockify as +background children and `wait -n`s on them, exiting non-zero if **either** dies, so +the kubelet restarts the novnc container, which re-waits for Xvfb on `:6099` and +relaunches x11vnc — the bridge **self-heals** across browser-container restarts. +(Before 2026-06-27, x11vnc was an unsupervised background child of an `exec`ed +websockify; a dead x11vnc was never relaunched, leaving `:5900` dead — a +`` zombie — and the view black until a manual pod restart. Same +supervision pattern as the android-emulator stack's entrypoint.) + +**Diagnose:** `kubectl exec -c novnc -- ps aux | grep x11vnc` (a ``/Z +entry = the bug); or the RFB-banner probe from a sibling container (`python3 -c +"import socket;s=socket.socket();s.settimeout(2);s.connect(('127.0.0.1',5900));print(s.recv(12))"` +— healthy returns `b'RFB 003.008\n'`, broken = `ConnectionRefused`). **Immediate +recovery** (no image change): restart just the novnc container with `kubectl exec +-n chrome-service deploy/chrome-service -c novnc -- kill 1` — re-runs its entrypoint +and relaunches x11vnc **without** touching the browser session/in-flight CDP jobs. + +> **Deploying a rebuilt novnc entrypoint:** Keel is **off** for this deployment +> (`keel.sh/policy=never`, because the browser container's playwright image is +> version-pinned to f1-stream) and the image is `:latest`/`IfNotPresent`, so a +> rebuilt `:latest` will **not** redeploy on its own. After the +> `build-chrome-service-novnc.yml` GHA build pushes `:latest` + `:`, +> **SHA-pin** the novnc `image` in `main.tf` to the new `:` to force the pull +> and rollout (the novnc image is TF-managed — not in the deployment's +> `lifecycle.ignore_changes`). - **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`) serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`, bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088 diff --git a/stacks/chrome-service/files/novnc/entrypoint.sh b/stacks/chrome-service/files/novnc/entrypoint.sh index fae5c641..aeff9408 100644 --- a/stacks/chrome-service/files/novnc/entrypoint.sh +++ b/stacks/chrome-service/files/novnc/entrypoint.sh @@ -19,14 +19,14 @@ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do sleep 2 done -# websockify runs as PID 1; x11vnc is a child so its logs land on container stdout -# `-noshm` skips MIT-SHM probes that fail across container boundaries (each -# container has its own /dev/shm); `-noxdamage` skips XDAMAGE which Xvfb -# doesn't expose; `-quiet` keeps the polling chatter out of pod logs. +# Both x11vnc and websockify run as supervised children of this entrypoint (PID +# 1) so their logs land on container stdout and the `wait -n` at the end can catch +# either one dying. `-noshm` skips MIT-SHM probes that fail across container +# boundaries (each container has its own /dev/shm); `-noxdamage` skips XDAMAGE +# which Xvfb doesn't expose; `-quiet` keeps the polling chatter out of pod logs. echo "starting x11vnc -> :5900" x11vnc -display localhost:99 -nopw -listen 0.0.0.0 -rfbport 5900 \ -forever -shared -noshm -noxdamage -quiet 2>&1 & -X11VNC_PID=$! for i in 1 2 3 4 5 6 7 8 9 10; do if echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then @@ -43,4 +43,18 @@ if ! echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then fi echo "starting websockify -> :6080" -exec websockify --web=/usr/share/novnc 6080 localhost:5900 +# Run websockify in the background (it was `exec`ed before) so BOTH it and x11vnc +# are supervised. x11vnc attaches to the chrome-service container's Xvfb over +# localhost:6099 (shared pod network); when that container restarts, x11vnc loses +# its X connection and exits. Previously websockify was PID 1 and x11vnc was an +# unsupervised child, so a dead x11vnc was never relaunched: :5900 stayed dead and +# the noVNC view went black until a manual pod restart. Now if EITHER process +# exits, `wait -n` returns and we exit non-zero so the kubelet restarts this +# container, which re-waits for Xvfb and relaunches x11vnc — the bridge self-heals +# across browser-container restarts. (Same supervision pattern as the +# android-emulator stack's entrypoint.) +websockify --web=/usr/share/novnc 6080 localhost:5900 & + +wait -n || true +echo "novnc: a supervised process (x11vnc or websockify) exited; exiting so the kubelet restarts this container." >&2 +exit 1