android-emulator: scale to 0 — its CPU burn was starving etcd
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful

The cluster-health check found the control plane flapping: kube-scheduler
and kube-controller-manager were crashlooping (220+ restarts) on lost
leader-election leases, with "etcdserver: request timed out" in the logs.

Root cause: the android-emulator pod's ~4.7-core swiftshader (software-GPU)
CPU burn on node3, together with frigate on node1, saturated the single
Proxmox host (load ~64) and starved etcd's disk/CPU on the k8s-master VM —
so etcd timed out and the leader-election controllers died and restarted in
a loop.

The emulator is a shared *test* instance, not a 24/7 service, so scaling it
to 0 is the right relief: spin it back to replicas=1 on-demand for a testing
session. Confirmed recovery after scaling down: node3 CPU 83%->28%, PVE load
64->51, control-plane restarts frozen. Durable structural fix (etcd/critical
VM disks off the shared sdc HDD; PVE CPU weighting) is tracked as code-oflt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-12 07:31:46 +00:00
parent 39a22b352e
commit b598c61c61

View file

@ -68,7 +68,14 @@ resource "kubernetes_deployment" "android-emulator" {
}
}
spec {
replicas = 1
# Scaled to 0 (2026-06-12): the emulator's ~4.7-core swiftshader CPU burn
# on node3 saturated the single PVE host and starved etcd on the k8s-master
# VM control-plane (scheduler/controller-manager/kyverno) leader-election
# crashloop. It is a shared TEST instance, not a 24/7 service: spin up
# on-demand (replicas = 1) for a testing session, return to 0 when done.
# Durable relief pending the structural etcd-protection fix (PVE CPU weight
# / etcd WAL on SSD).
replicas = 0
strategy {
type = "Recreate" # RWO PVC old pod must release it first
}