The emulator was meant to scale to zero when idle but had been up 6+ days
straight despite ~5 days with no real use. Two bugs:
1. The idle check counted ESTABLISHED TCP connections to the adb/noVNC
ports. A forgotten `adb connect` (no disconnect) holds that transport
open forever, so every 15-min run saw "active" and reset the counter --
it never reached the sleep branch. (Right now: 4 such stale transports
from pods on k8s-node3/node4.)
2. Even when it did reach the sleep branch, `kubectl scale --replicas=0`
failed Forbidden -- the gate ServiceAccount can patch `deployments` but
not `deployments/scale`.
Switch the sleeper to measure actual use: time since last user activity
(taps/keys/app-launches, incl. noVNC clicks) from `dumpsys power` vs guest
uptime. No interaction for 6h -> sleep. This ignores idle/forgotten
connections entirely. Scale down with a direct replicas patch on the named
deployment (same path the wake gate scales up), so it needs only the
existing `deployments` patch grant -- no `deployments/scale`. Now stateless
(drops the idle-counter annotation; gate.py no longer sets it) and lighter
on etcd. Fail-safe: any read error (e.g. mid-boot) does not sleep.
Requested by Viktor: turn the dev-only emulator off when it hasn't been
used for 6h.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Trues the runbook up to reality: guest GL stays software (llvmpipe)
under Xvfb by deliberate choice (NVIDIA headless GL would need a
different streaming architecture), the GPU slice costs ~100MiB VRAM only
while awake, and the awake steady-state is ~0.5-1.3 cores / ~5Gi with
scale-to-zero covering idle.
Viktor's direction (2026-06-12): the emulator is dev-only, so it should
be on-demand, and it should use the T4 where applicable. (1) api36-v5
runs '-gpu host' on the GPU node (nodeSelector + time-slice + EGL libs;
automatic swiftshader fallback if GPU init dies) — screen-on rendering
moves off the CPU (~5 cores → expected 1-2). (2) The wake gate (stdlib
python, owns / on both hostnames) scales the deployment 0→1 on visit and
hands the browser to noVNC when ready; agents GET /wake + /status. The
idle-sleeper CronJob counts established adb/noVNC connections via
/proc/net/tcp (excluding the in-container loopback adb client) and scales
to zero after 4 idle checks (~1h). TF ignores replicas drift. VRAM cost
(~0.5-1GiB) is held only while awake, protecting llama-swap headroom.
Viktor is setting up an Android app development pipeline (tripit is the
first app) and wants agents to natively test changes on Android before
shipping. This adds the testing environment: an API-36 Google emulator
under KVM as a privileged pod (namespace joins the Kyverno exclude list),
SDK/system-image/AVD on a proxmox-lvm PVC, adb on the shared MetalLB IP
10.0.20.200:5555 (LAN only), noVNC screen view at
android-emulator.viktorbarzin.lan. Image is built manually from the
stack's docker/ dir (rare rebuilds; off-infra-CI rule targets repeated
builds). First infra ADR records the trade-offs (devvm/VM/redroid/budtmo
rejected).