android-emulator: scale to 0 — its CPU burn was starving etcd

The cluster-health check found the control plane flapping: kube-scheduler and kube-controller-manager were crashlooping (220+ restarts) on lost leader-election leases, with "etcdserver: request timed out" in the logs. Root cause: the android-emulator pod's ~4.7-core swiftshader (software-GPU) CPU burn on node3, together with frigate on node1, saturated the single Proxmox host (load ~64) and starved etcd's disk/CPU on the k8s-master VM — so etcd timed out and the leader-election controllers died and restarted in a loop. The emulator is a shared *test* instance, not a 24/7 service, so scaling it to 0 is the right relief: spin it back to replicas=1 on-demand for a testing session. Confirmed recovery after scaling down: node3 CPU 83%->28%, PVE load 64->51, control-plane restarts frozen. Durable structural fix (etcd/critical VM disks off the shared sdc HDD; PVE CPU weighting) is tracked as code-oflt. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 07:31:46 +00:00 · 2026-06-12 07:31:46 +00:00 · b598c61c61
commit b598c61c61
parent 39a22b352e
1 changed files with 8 additions and 1 deletions
--- a/stacks/android-emulator/main.tf
+++ b/stacks/android-emulator/main.tf
@ -68,7 +68,14 @@ resource "kubernetes_deployment" "android-emulator" {
    }
  }
  spec {
-    replicas = 1
+    # Scaled to 0 (2026-06-12): the emulator's ~4.7-core swiftshader CPU burn
+    # on node3 saturated the single PVE host and starved etcd on the k8s-master
+    # VM → control-plane (scheduler/controller-manager/kyverno) leader-election
+    # crashloop. It is a shared TEST instance, not a 24/7 service: spin up
+    # on-demand (replicas = 1) for a testing session, return to 0 when done.
+    # Durable relief pending the structural etcd-protection fix (PVE CPU weight
+    # / etcd WAL on SSD).
+    replicas = 0
    strategy {
      type = "Recreate" # RWO PVC — old pod must release it first
    }