From 19d0f0933a8ec75be6cfa077db88e0f8c3760f40 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 27 Jun 2026 08:03:29 +0000 Subject: [PATCH] chrome-service: supervise x11vnc in noVNC sidecar so the VNC view self-heals MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The noVNC view at chrome.viktorbarzin.me went black: x11vnc (in the novnc sidecar) attaches to the browser container's Xvfb over localhost:6099, and when that container restarted (~8h ago, Chrome exited cleanly) x11vnc lost its X connection and exited. Because the entrypoint ran x11vnc as an unsupervised background child and then exec'd websockify as PID 1, the dead x11vnc was never relaunched — :5900 stayed dead (a defunct zombie), websockify kept returning 'Connection refused', and the view was black until a manual pod restart. Fix: the entrypoint now runs both x11vnc and websockify as supervised background children and exits non-zero via 'wait -n' if either dies, so the kubelet restarts the novnc container, which re-waits for Xvfb and relaunches x11vnc. The bridge now self-heals across browser-container restarts. Mirrors the android-emulator stack's supervision pattern. Architecture doc updated with the new failure mode, diagnosis, immediate-recovery, and SHA-pin deploy note. Co-Authored-By: Claude Opus 4.8 --- docs/architecture/chrome-service.md | 37 +++++++++++++++++++ .../chrome-service/files/novnc/entrypoint.sh | 26 ++++++++++--- 2 files changed, 57 insertions(+), 6 deletions(-) diff --git a/docs/architecture/chrome-service.md b/docs/architecture/chrome-service.md index 6f9c1ee4..37cb6edc 100644 --- a/docs/architecture/chrome-service.md +++ b/docs/architecture/chrome-service.md @@ -205,6 +205,43 @@ healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts* wrapper in `main.tf` (so it applies deterministically even though the image is `:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix as the android-emulator stack. + +### noVNC black after a browser-container restart (x11vnc supervision) + +A **distinct** failure from the fd-sweep gotcha above: the noVNC client *connects* +but the view is **black**, and the novnc container logs spew +`connecting to: localhost:5900` → `Failed to connect ... [Errno 111] Connection +refused` (x11vnc is **down**, not slow). Cause: `x11vnc` and `websockify` both run +in the **novnc** container, but x11vnc attaches to the **chrome-service** (browser) +container's Xvfb over `localhost:6099` (shared pod network). When the browser +container restarts — Chrome exits cleanly (exit 0, "Completed") or crashes — its +Xvfb vanishes and x11vnc loses its X connection and exits. + +`entrypoint.sh` **supervises** x11vnc: it launches x11vnc and websockify as +background children and `wait -n`s on them, exiting non-zero if **either** dies, so +the kubelet restarts the novnc container, which re-waits for Xvfb on `:6099` and +relaunches x11vnc — the bridge **self-heals** across browser-container restarts. +(Before 2026-06-27, x11vnc was an unsupervised background child of an `exec`ed +websockify; a dead x11vnc was never relaunched, leaving `:5900` dead — a +`` zombie — and the view black until a manual pod restart. Same +supervision pattern as the android-emulator stack's entrypoint.) + +**Diagnose:** `kubectl exec -c novnc -- ps aux | grep x11vnc` (a ``/Z +entry = the bug); or the RFB-banner probe from a sibling container (`python3 -c +"import socket;s=socket.socket();s.settimeout(2);s.connect(('127.0.0.1',5900));print(s.recv(12))"` +— healthy returns `b'RFB 003.008\n'`, broken = `ConnectionRefused`). **Immediate +recovery** (no image change): restart just the novnc container with `kubectl exec +-n chrome-service deploy/chrome-service -c novnc -- kill 1` — re-runs its entrypoint +and relaunches x11vnc **without** touching the browser session/in-flight CDP jobs. + +> **Deploying a rebuilt novnc entrypoint:** Keel is **off** for this deployment +> (`keel.sh/policy=never`, because the browser container's playwright image is +> version-pinned to f1-stream) and the image is `:latest`/`IfNotPresent`, so a +> rebuilt `:latest` will **not** redeploy on its own. After the +> `build-chrome-service-novnc.yml` GHA build pushes `:latest` + `:`, +> **SHA-pin** the novnc `image` in `main.tf` to the new `:` to force the pull +> and rollout (the novnc image is TF-managed — not in the deployment's +> `lifecycle.ignore_changes`). - **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`) serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`, bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088 diff --git a/stacks/chrome-service/files/novnc/entrypoint.sh b/stacks/chrome-service/files/novnc/entrypoint.sh index fae5c641..aeff9408 100644 --- a/stacks/chrome-service/files/novnc/entrypoint.sh +++ b/stacks/chrome-service/files/novnc/entrypoint.sh @@ -19,14 +19,14 @@ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do sleep 2 done -# websockify runs as PID 1; x11vnc is a child so its logs land on container stdout -# `-noshm` skips MIT-SHM probes that fail across container boundaries (each -# container has its own /dev/shm); `-noxdamage` skips XDAMAGE which Xvfb -# doesn't expose; `-quiet` keeps the polling chatter out of pod logs. +# Both x11vnc and websockify run as supervised children of this entrypoint (PID +# 1) so their logs land on container stdout and the `wait -n` at the end can catch +# either one dying. `-noshm` skips MIT-SHM probes that fail across container +# boundaries (each container has its own /dev/shm); `-noxdamage` skips XDAMAGE +# which Xvfb doesn't expose; `-quiet` keeps the polling chatter out of pod logs. echo "starting x11vnc -> :5900" x11vnc -display localhost:99 -nopw -listen 0.0.0.0 -rfbport 5900 \ -forever -shared -noshm -noxdamage -quiet 2>&1 & -X11VNC_PID=$! for i in 1 2 3 4 5 6 7 8 9 10; do if echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then @@ -43,4 +43,18 @@ if ! echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then fi echo "starting websockify -> :6080" -exec websockify --web=/usr/share/novnc 6080 localhost:5900 +# Run websockify in the background (it was `exec`ed before) so BOTH it and x11vnc +# are supervised. x11vnc attaches to the chrome-service container's Xvfb over +# localhost:6099 (shared pod network); when that container restarts, x11vnc loses +# its X connection and exits. Previously websockify was PID 1 and x11vnc was an +# unsupervised child, so a dead x11vnc was never relaunched: :5900 stayed dead and +# the noVNC view went black until a manual pod restart. Now if EITHER process +# exits, `wait -n` returns and we exit non-zero so the kubelet restarts this +# container, which re-waits for Xvfb and relaunches x11vnc — the bridge self-heals +# across browser-container restarts. (Same supervision pattern as the +# android-emulator stack's entrypoint.) +websockify --web=/usr/share/novnc 6080 localhost:5900 & + +wait -n || true +echo "novnc: a supervised process (x11vnc or websockify) exited; exiting so the kubelet restarts this container." >&2 +exit 1