noVNC scaled correctly but the emulator's Qt window opened small (~411x914)
and floated inside the 1080x2280 Xvfb, so the user saw a tiny phone in a sea
of black. v8 bakes a background fitter (wmctrl+xdotool) that, after boot,
auto-OKs the one-shot nested-virtualization warning dialog, fills the phone
window to the display, and parks the control strip off the right edge —
re-running to catch window/dialog timing then maintaining every 30s. Applied
live to the running pod already; this makes it survive the next wake.
Viktor's screen rendered unscaled on a bare /vnc.html. The entrypoint
now writes /usr/share/novnc/defaults.json (resize=scale, autoconnect,
reconnect with 2s delay, shared) so every load behaves right without URL
params, and viewers self-heal across pod restarts/wakes. Already applied
live to the running pod; this makes it survive the next wake.
Trues the runbook up to reality: guest GL stays software (llvmpipe)
under Xvfb by deliberate choice (NVIDIA headless GL would need a
different streaming architecture), the GPU slice costs ~100MiB VRAM only
while awake, and the awake steady-state is ~0.5-1.3 cores / ~5Gi with
scale-to-zero covering idle.
Viktor's noVNC sat at 'Connecting…' forever: the WebSocket traversed
Cloudflare/Authentik/websockify fine, but x11vnc never sent the RFB
banner — strace showed it sweeping the container's fd table with one
fcntl per fd, and containerd grants RLIMIT_NOFILE=2147483584 here, so
each connection effectively never completed. The entrypoint now sets
ulimit -n 65536 for everything it launches (verified live: banner
answers instantly under the capped limit); x11vnc also gets -nolookup
so client reverse-DNS can never stall handshakes.
First GPU boot verified qemu attached to the T4, but the guest GL
translator reported llvmpipe: the GPU operator injects only
compute,utility by default, so the NVIDIA EGL/GL vendor libraries were
absent and gfxstream silently fell back to software GL. The graphics
capability completes the hardware rendering path.
First real wake attempt 500'd: kubernetes.default.svc does not resolve
from the gate's alpine pod (musl + injected dns_config ndots quirk), so
every kube call failed with 'Name does not resolve'. Use the injected
KUBERNETES_SERVICE_HOST/PORT env vars — the canonical in-cluster
endpoint, no DNS dependency. ConfigMap checksum annotation rolls the
gate automatically.
Viktor's direction (2026-06-12): the emulator is dev-only, so it should
be on-demand, and it should use the T4 where applicable. (1) api36-v5
runs '-gpu host' on the GPU node (nodeSelector + time-slice + EGL libs;
automatic swiftshader fallback if GPU init dies) — screen-on rendering
moves off the CPU (~5 cores → expected 1-2). (2) The wake gate (stdlib
python, owns / on both hostnames) scales the deployment 0→1 on visit and
hands the browser to noVNC when ready; agents GET /wake + /status. The
idle-sleeper CronJob counts established adb/noVNC connections via
/proc/net/tcp (excluding the in-container loopback adb client) and scales
to zero after 4 idle checks (~1h). TF ignores replicas drift. VRAM cost
(~0.5-1GiB) is held only while awake, protecting llama-swap headroom.
The cluster-health check found the control plane flapping: kube-scheduler
and kube-controller-manager were crashlooping (220+ restarts) on lost
leader-election leases, with "etcdserver: request timed out" in the logs.
Root cause: the android-emulator pod's ~4.7-core swiftshader (software-GPU)
CPU burn on node3, together with frigate on node1, saturated the single
Proxmox host (load ~64) and starved etcd's disk/CPU on the k8s-master VM —
so etcd timed out and the leader-election controllers died and restarted in
a loop.
The emulator is a shared *test* instance, not a 24/7 service, so scaling it
to 0 is the right relief: spin it back to replicas=1 on-demand for a testing
session. Confirmed recovery after scaling down: node3 CPU 83%->28%, PVE load
64->51, control-plane restarts frozen. Durable structural fix (etcd/critical
VM disks off the shared sdc HDD; PVE CPU weighting) is tracked as code-oflt.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pipeline 102 applied nothing — the rate-limit commit entered master under
a merge head and the changed-stack detector is blind to merge diffs.
Plain commit touching both stacks so they apply.
Viktor's 'VNC stuck loading forever' (remote network): noVNC 1.3 is
unbundled and fetches ~60 ES modules in parallel on page open; the shared
Traefik rate-limit (average 10, burst 50) 429s the tail and noVNC's
loader waits on the missing modules indefinitely (reproduced: 38x429 in
a 90-request burst through the ingress). Adds a dedicated 50/300
android-emulator-rate-limit middleware (actualbudget/immich pattern) and
opts both emulator ingresses out of the shared limiter.
Viktor wants the emulator screen reachable over the web: adds
android-emulator.viktorbarzin.me (Cloudflare-proxied) behind Authentik
forward-auth — same-origin WebSockets through forward-auth are proven by
the terminal/ttyd stack. The LAN .lan view stays, and adb:5555 remains
LAN-only since it is unauthenticated.
Pipeline 96 applied only tripit: the v4 bump (577267cd) entered master
inside a merge whose first-parent diff hid stacks/android-emulator from
the stack detector — same failure mode as the tts 798b0255 trigger. This
plain commit touches the stack so the detector picks it up.
Two final fixes from the live debugging session: (1) sdkmanager-latest
emulator 36.6.11 hangs before executing a single guest instruction in
this pod (KVM and TCG alike, every gpu mode, crash-reporting on or off)
while 36.1.9 boots Android in ~107s — the entrypoint now pins build
13823996 on the PVC; (2) the emulator already listens on 127.0.0.1:5555,
so socat's wildcard bind died with EADDRINUSE and its exit restarted the
pod right after a successful boot — socat now binds the pod IP only.
v2's marker fix proved the install completes, but avdmanager still saw
no system images: it IGNORES ANDROID_SDK_ROOT (and has no --sdk_root),
deriving the SDK root from its own toolsdir — /opt/android in our image,
while packages live on the PVC at /sdk. v3 seeds cmdline-tools into
/sdk/cmdline-tools/latest once and runs avdmanager from there, so it
resolves the PVC as the SDK root.
First boot crashed mid-SDK-install, and the dir-existence check then
skipped reinstall forever: avdmanager saw the partial tree and died with
'Valid system image paths are: null' (CrashLoopBackOff). v2 tracks
install completion with a marker file written only after sdkmanager
succeeds + package.xml exists, wipes partial system-image trees before
reinstalling, and retries sdkmanager 3x.
The five imports from the last recovery commit are in state now (verified
serial 4: everything except the deployment). The deployment kept falling
out of state between runs, so instead of a third import round the broken
0-replica deployment object was deleted live (transient recovery step,
presence-claimed) and this apply recreates it Terraform-owned with the
quota-fitting 3Gi requests. Import stanzas must go because TF 1.5 errors
on importing already-managed addresses.
Pipeline 88 imported the namespace but its refresh dropped the PVC, both
services, the ingress and the tls secret from state (PG-backend state
races on this new stack's first applies), so the apply again died on
'already exists' conflicts. State now holds namespace+deployment; adopt
the missing five with import blocks (TF 1.5 errors on importing
already-managed addresses, so only the missing set is listed). Stanzas
come out once applied.
Pipeline 85 created the namespace but a Terraform pg-backend
workspace-creation lock race (new stack schema initializing while other
stacks applied concurrently) left it out of the recorded state — every
later apply then died with 'namespaces android-emulator already exists'.
Adopt it with an import block per the house recovery pattern; stanza
gets removed once it has applied.
First deploy hit 'exceeded quota: tier-quota, requested requests.memory=8Gi,
limited 4Gi' — the generated tier-1 quota caps memory REQUESTS at 4Gi but
allows 32Gi of limits, so go Burstable (requests 3Gi, limits 8Gi) like
tiers 3/4 do, instead of opting the namespace out via custom-quota.
Viktor is setting up an Android app development pipeline (tripit is the
first app) and wants agents to natively test changes on Android before
shipping. This adds the testing environment: an API-36 Google emulator
under KVM as a privileged pod (namespace joins the Kyverno exclude list),
SDK/system-image/AVD on a proxmox-lvm PVC, adb on the shared MetalLB IP
10.0.20.200:5555 (LAN only), noVNC screen view at
android-emulator.viktorbarzin.lan. Image is built manually from the
stack's docker/ dir (rare rebuilds; off-infra-CI rule targets repeated
builds). First infra ADR records the trade-offs (devvm/VM/redroid/budtmo
rejected).