From b598c61c6110d86ef0cb0f89496ce1b5d4682973 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Fri, 12 Jun 2026 07:31:46 +0000 Subject: [PATCH] =?UTF-8?q?android-emulator:=20scale=20to=200=20=E2=80=94?= =?UTF-8?q?=20its=20CPU=20burn=20was=20starving=20etcd?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The cluster-health check found the control plane flapping: kube-scheduler and kube-controller-manager were crashlooping (220+ restarts) on lost leader-election leases, with "etcdserver: request timed out" in the logs. Root cause: the android-emulator pod's ~4.7-core swiftshader (software-GPU) CPU burn on node3, together with frigate on node1, saturated the single Proxmox host (load ~64) and starved etcd's disk/CPU on the k8s-master VM — so etcd timed out and the leader-election controllers died and restarted in a loop. The emulator is a shared *test* instance, not a 24/7 service, so scaling it to 0 is the right relief: spin it back to replicas=1 on-demand for a testing session. Confirmed recovery after scaling down: node3 CPU 83%->28%, PVE load 64->51, control-plane restarts frozen. Durable structural fix (etcd/critical VM disks off the shared sdc HDD; PVE CPU weighting) is tracked as code-oflt. Co-Authored-By: Claude Opus 4.8 --- stacks/android-emulator/main.tf | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/stacks/android-emulator/main.tf b/stacks/android-emulator/main.tf index 10220999..e4d3cf61 100644 --- a/stacks/android-emulator/main.tf +++ b/stacks/android-emulator/main.tf @@ -68,7 +68,14 @@ resource "kubernetes_deployment" "android-emulator" { } } spec { - replicas = 1 + # Scaled to 0 (2026-06-12): the emulator's ~4.7-core swiftshader CPU burn + # on node3 saturated the single PVE host and starved etcd on the k8s-master + # VM → control-plane (scheduler/controller-manager/kyverno) leader-election + # crashloop. It is a shared TEST instance, not a 24/7 service: spin up + # on-demand (replicas = 1) for a testing session, return to 0 when done. + # Durable relief pending the structural etcd-protection fix (PVE CPU weight + # / etcd WAL on SSD). + replicas = 0 strategy { type = "Recreate" # RWO PVC — old pod must release it first }