chrome-service: fix noVNC stuck-"Connecting" (x11vnc fd-sweep under nofile=2^31)
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build chrome-service-novnc / build (push) Has been cancelled

The noVNC view hung on "Connecting" forever then timed out. Root cause: x11vnc
sweeps the entire fd table (fcntl per fd) on every client connection, and
containerd grants pods RLIMIT_NOFILE=2^31, so the RFB handshake never completes
(websockify accepts the WS and dials localhost:5900, but x11vnc never sends its
banner — verified: handshake timed out at 8s, x11vnc had burned 1h41m CPU
spinning). Same bug + fix the android-emulator stack already carries.

Cap nofile before x11vnc starts, in two places:
- files/novnc/entrypoint.sh: `ulimit -n 65536` (root fix, makes the image correct)
- main.tf novnc container: `command = ["bash","-c","ulimit -n 65536; exec /entrypoint.sh"]`
  so the cap applies deterministically on rollout even though the image is
  :latest/IfNotPresent (a rebuilt entrypoint isn't guaranteed to be re-pulled).

Also documents the gotcha + diagnosis in docs/architecture/chrome-service.md and
notes the black-when-idle behaviour + the autoconnect URL.

(A live x11vnc relaunch with the cap already unblocked the running pod; this
makes it survive restarts.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-22 17:34:03 +00:00
parent 20ca5ee624
commit c7ead032ec
3 changed files with 35 additions and 1 deletions

View file

@ -326,6 +326,14 @@ resource "kubernetes_deployment" "chrome_service" {
# Phase 3 cutover 2026-05-07 Forgejo registry consolidation.
image = "ghcr.io/viktorbarzin/chrome-service-novnc:latest"
image_pull_policy = "IfNotPresent"
# Cap RLIMIT_NOFILE before the entrypoint runs. Containerd grants pods
# nofile=2^31; x11vnc sweeps the whole fd table on each client connect,
# so every VNC connection hangs on "Connecting" until it times out
# (fd-sweep bug, same as android-emulator). entrypoint.sh now also sets
# this, but the image is :latest/IfNotPresent so a rebuilt entrypoint
# isn't guaranteed to be pulled this wrapper applies the cap
# deterministically on every rollout off the cached image.
command = ["bash", "-c", "ulimit -n 65536; exec /entrypoint.sh"]
port {
name = "http"
container_port = 6080