chrome-service: supervise x11vnc in noVNC sidecar so the VNC view self-heals
Some checks are pending
Build chrome-service-novnc / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful

The noVNC view at chrome.viktorbarzin.me went black: x11vnc (in the novnc
sidecar) attaches to the browser container's Xvfb over localhost:6099, and when
that container restarted (~8h ago, Chrome exited cleanly) x11vnc lost its X
connection and exited. Because the entrypoint ran x11vnc as an unsupervised
background child and then exec'd websockify as PID 1, the dead x11vnc was never
relaunched — :5900 stayed dead (a defunct zombie), websockify kept returning
'Connection refused', and the view was black until a manual pod restart.

Fix: the entrypoint now runs both x11vnc and websockify as supervised background
children and exits non-zero via 'wait -n' if either dies, so the kubelet restarts
the novnc container, which re-waits for Xvfb and relaunches x11vnc. The bridge
now self-heals across browser-container restarts. Mirrors the android-emulator
stack's supervision pattern. Architecture doc updated with the new failure mode,
diagnosis, immediate-recovery, and SHA-pin deploy note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-27 08:03:29 +00:00
parent abb15cd49d
commit 19d0f0933a
2 changed files with 57 additions and 6 deletions

View file

@ -205,6 +205,43 @@ healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts*
wrapper in `main.tf` (so it applies deterministically even though the image is
`:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix
as the android-emulator stack.
### noVNC black after a browser-container restart (x11vnc supervision)
A **distinct** failure from the fd-sweep gotcha above: the noVNC client *connects*
but the view is **black**, and the novnc container logs spew
`connecting to: localhost:5900` → `Failed to connect ... [Errno 111] Connection
refused` (x11vnc is **down**, not slow). Cause: `x11vnc` and `websockify` both run
in the **novnc** container, but x11vnc attaches to the **chrome-service** (browser)
container's Xvfb over `localhost:6099` (shared pod network). When the browser
container restarts — Chrome exits cleanly (exit 0, "Completed") or crashes — its
Xvfb vanishes and x11vnc loses its X connection and exits.
`entrypoint.sh` **supervises** x11vnc: it launches x11vnc and websockify as
background children and `wait -n`s on them, exiting non-zero if **either** dies, so
the kubelet restarts the novnc container, which re-waits for Xvfb on `:6099` and
relaunches x11vnc — the bridge **self-heals** across browser-container restarts.
(Before 2026-06-27, x11vnc was an unsupervised background child of an `exec`ed
websockify; a dead x11vnc was never relaunched, leaving `:5900` dead — a
`<defunct>` zombie — and the view black until a manual pod restart. Same
supervision pattern as the android-emulator stack's entrypoint.)
**Diagnose:** `kubectl exec -c novnc -- ps aux | grep x11vnc` (a `<defunct>`/Z
entry = the bug); or the RFB-banner probe from a sibling container (`python3 -c
"import socket;s=socket.socket();s.settimeout(2);s.connect(('127.0.0.1',5900));print(s.recv(12))"`
— healthy returns `b'RFB 003.008\n'`, broken = `ConnectionRefused`). **Immediate
recovery** (no image change): restart just the novnc container with `kubectl exec
-n chrome-service deploy/chrome-service -c novnc -- kill 1` — re-runs its entrypoint
and relaunches x11vnc **without** touching the browser session/in-flight CDP jobs.
> **Deploying a rebuilt novnc entrypoint:** Keel is **off** for this deployment
> (`keel.sh/policy=never`, because the browser container's playwright image is
> version-pinned to f1-stream) and the image is `:latest`/`IfNotPresent`, so a
> rebuilt `:latest` will **not** redeploy on its own. After the
> `build-chrome-service-novnc.yml` GHA build pushes `:latest` + `:<sha>`,
> **SHA-pin** the novnc `image` in `main.tf` to the new `:<sha>` to force the pull
> and rollout (the novnc image is TF-managed — not in the deployment's
> `lifecycle.ignore_changes`).
- **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`)
serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`,
bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088

View file

@ -19,14 +19,14 @@ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do
sleep 2
done
# websockify runs as PID 1; x11vnc is a child so its logs land on container stdout
# `-noshm` skips MIT-SHM probes that fail across container boundaries (each
# container has its own /dev/shm); `-noxdamage` skips XDAMAGE which Xvfb
# doesn't expose; `-quiet` keeps the polling chatter out of pod logs.
# Both x11vnc and websockify run as supervised children of this entrypoint (PID
# 1) so their logs land on container stdout and the `wait -n` at the end can catch
# either one dying. `-noshm` skips MIT-SHM probes that fail across container
# boundaries (each container has its own /dev/shm); `-noxdamage` skips XDAMAGE
# which Xvfb doesn't expose; `-quiet` keeps the polling chatter out of pod logs.
echo "starting x11vnc -> :5900"
x11vnc -display localhost:99 -nopw -listen 0.0.0.0 -rfbport 5900 \
-forever -shared -noshm -noxdamage -quiet 2>&1 &
X11VNC_PID=$!
for i in 1 2 3 4 5 6 7 8 9 10; do
if echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then
@ -43,4 +43,18 @@ if ! echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then
fi
echo "starting websockify -> :6080"
exec websockify --web=/usr/share/novnc 6080 localhost:5900
# Run websockify in the background (it was `exec`ed before) so BOTH it and x11vnc
# are supervised. x11vnc attaches to the chrome-service container's Xvfb over
# localhost:6099 (shared pod network); when that container restarts, x11vnc loses
# its X connection and exits. Previously websockify was PID 1 and x11vnc was an
# unsupervised child, so a dead x11vnc was never relaunched: :5900 stayed dead and
# the noVNC view went black until a manual pod restart. Now if EITHER process
# exits, `wait -n` returns and we exit non-zero so the kubelet restarts this
# container, which re-waits for Xvfb and relaunches x11vnc — the bridge self-heals
# across browser-container restarts. (Same supervision pattern as the
# android-emulator stack's entrypoint.)
websockify --web=/usr/share/novnc 6080 localhost:5900 &
wait -n || true
echo "novnc: a supervised process (x11vnc or websockify) exited; exiting so the kubelet restarts this container." >&2
exit 1