infra/stacks/android-emulator/README.md
Viktor Barzin 839fdb33c2
All checks were successful
ci/woodpecker/push/default Pipeline was successful
android-emulator: sleep after 6h idle (activity-based), fix never-sleeping
The emulator was meant to scale to zero when idle but had been up 6+ days
straight despite ~5 days with no real use. Two bugs:

1. The idle check counted ESTABLISHED TCP connections to the adb/noVNC
   ports. A forgotten `adb connect` (no disconnect) holds that transport
   open forever, so every 15-min run saw "active" and reset the counter --
   it never reached the sleep branch. (Right now: 4 such stale transports
   from pods on k8s-node3/node4.)
2. Even when it did reach the sleep branch, `kubectl scale --replicas=0`
   failed Forbidden -- the gate ServiceAccount can patch `deployments` but
   not `deployments/scale`.

Switch the sleeper to measure actual use: time since last user activity
(taps/keys/app-launches, incl. noVNC clicks) from `dumpsys power` vs guest
uptime. No interaction for 6h -> sleep. This ignores idle/forgotten
connections entirely. Scale down with a direct replicas patch on the named
deployment (same path the wake gate scales up), so it needs only the
existing `deployments` patch grant -- no `deployments/scale`. Now stateless
(drops the idle-counter annotation; gate.py no longer sets it) and lighter
on etcd. Fail-safe: any read error (e.g. mid-boot) does not sleep.

Requested by Viktor: turn the dev-only emulator off when it hasn't been
used for 6h.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 08:49:23 +00:00

103 lines
5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# android-emulator — shared in-cluster Android testing instance
Android 16 (API 36, `google_apis/x86_64`) emulator running under KVM in the
cluster, so agents can natively test app/PWA changes before shipping (first
tenant: tripit). Decision record: `docs/adr/0001-android-emulator-in-cluster.md`.
## On-demand lifecycle (since 2026-06-12)
The emulator **scales to zero when idle** (no user interaction for 6h —
taps/keys/app-launches/noVNC clicks, read from `dumpsys power` by the
`android-emulator-idle-sleeper` CronJob) and **wakes on visit**: the wake
gate owns `/` on both hostnames. Warm boot is ~90s. Idle is measured from
real interaction, not connection count, so a forgotten `adb connect` (left
ESTABLISHED) no longer keeps it awake — but `adb disconnect` anyway.
- Humans: open https://android-emulator.viktorbarzin.me — it wakes the
emulator if needed, shows a self-refreshing boot page, then hands over to
the noVNC screen.
- Agents (before adb): wake + poll, then connect:
curl -ks --resolve android-emulator.viktorbarzin.lan:443:10.0.20.203 https://android-emulator.viktorbarzin.lan/wake
until curl -ks --resolve android-emulator.viktorbarzin.lan:443:10.0.20.203 https://android-emulator.viktorbarzin.lan/status | grep -q '"ready": 1'; do sleep 5; done
adb connect 10.0.20.200:5555
## Endpoints
| What | Where |
|---|---|
| adb | `adb connect 10.0.20.200:5555` (LAN only; adb is unauthenticated — never expose publicly) |
| Screen (noVNC) | <https://android-emulator.viktorbarzin.lan/vnc.html> (LAN only) |
## Agent quickstart (from a devvm)
```bash
# one-time: user-local platform-tools
wget -qO /tmp/pt.zip https://dl.google.com/android/repository/platform-tools-latest-linux.zip
unzip -q /tmp/pt.zip -d ~/android-sdk # → ~/android-sdk/platform-tools/adb
adb="$HOME/android-sdk/platform-tools/adb"
$adb connect 10.0.20.200:5555
$adb -s 10.0.20.200:5555 install app-debug.apk # install an APK
$adb -s 10.0.20.200:5555 shell am start -a android.intent.action.VIEW -d https://tripit.viktorbarzin.me # open a URL
$adb -s 10.0.20.200:5555 shell input tap 540 1200 # drive the UI
$adb -s 10.0.20.200:5555 exec-out screencap -p > /tmp/screen.png # screenshot
```
The emulator is a single shared instance — `adb shell pm list packages`,
uninstall your test app when done, and presence-claim
(`presence claim service:android-emulator`) for long destructive sessions
(wipes, system-image changes).
## How it works
- The container image (built from `docker/`) holds only JDK 17, cmdline-tools,
emulator native libs, Xvfb/x11vnc/noVNC and socat — ~1GB.
- The SDK proper (platform-tools, emulator, system image, AVD, snapshots)
lives on the `android-emulator-sdk` PVC (`proxmox-lvm`); the entrypoint
installs it idempotently. **First boot downloads ~2.5GB (≈9GB unpacked on the PVC) and takes ~15 min**
(startup probe allows 30); subsequent restarts boot in ~12 min.
- The emulator runs on the GPU node (k8s-node1) with a T4 time-slice
(qemu holds ~100 MiB VRAM while awake; scale-to-zero keeps it transient).
Guest GL is deliberately SOFTWARE (llvmpipe): rendering into Xvfb pins GL
to the X stack, and true NVIDIA headless GL would need -no-window plus the
emulator's own streaming instead of x11vnc — not worth it at the measured
CPU numbers below.
## Rebuilding the image (rare — tool/library bumps only)
```bash
cd stacks/android-emulator/docker
docker build -t forgejo.viktorbarzin.me/viktor/android-emulator:<new-tag> .
docker push forgejo.viktorbarzin.me/viktor/android-emulator:<new-tag>
# then bump var.image_tag default in variables.tf and land via CI
```
Built manually from a devvm on purpose: it changes rarely, and a one-off push
doesn't warrant CI plumbing (the off-infra-CI rule targets *repeated* build IO).
## Troubleshooting
- Pod CrashLoops with `FATAL: /dev/kvm not present` → node lost the device or
the privileged/Kyverno exclude regressed (`android-emulator` must be in
`security_policy_exclude_namespaces`, stacks/kyverno).
- Wedged Android (won't boot, storage full) → delete the PVC + pod: next boot
re-downloads cleanly. Snapshots/AVD state are disposable by design.
- Different API level: set `API_LEVEL` env on the deployment (entrypoint
installs that system image on the same PVC) or recreate the AVD.
## Resource profile (measured 2026-06-12, v6 on node1)
- **Asleep (scaled to zero)**: nothing — the gate (~10m CPU/13Mi) is the only
standing cost.
- **Awake**: settles to **~0.51.3 cores** with a static screen (on or off),
~4.85.2 Gi memory (limit 8 Gi, requests 3 Gi), ~100 MiB T4 VRAM. Boot
bursts 59 cores for the first few minutes (dex2oat etc.).
- **Disk**: ~7 G of the 30 Gi PVC.
- Etiquette still applies for long sessions with animated content:
`adb -s 10.0.20.200:5555 shell input keyevent KEYCODE_SLEEP` when done.
## Remote access
https://android-emulator.viktorbarzin.me (Cloudflare-proxied, Authentik-gated)
serves the same noVNC screen for off-LAN use. adb stays LAN-only by design.