android-emulator: scale to 0 — its CPU burn was starving etcd
The cluster-health check found the control plane flapping: kube-scheduler and kube-controller-manager were crashlooping (220+ restarts) on lost leader-election leases, with "etcdserver: request timed out" in the logs. Root cause: the android-emulator pod's ~4.7-core swiftshader (software-GPU) CPU burn on node3, together with frigate on node1, saturated the single Proxmox host (load ~64) and starved etcd's disk/CPU on the k8s-master VM — so etcd timed out and the leader-election controllers died and restarted in a loop. The emulator is a shared *test* instance, not a 24/7 service, so scaling it to 0 is the right relief: spin it back to replicas=1 on-demand for a testing session. Confirmed recovery after scaling down: node3 CPU 83%->28%, PVE load 64->51, control-plane restarts frozen. Durable structural fix (etcd/critical VM disks off the shared sdc HDD; PVE CPU weighting) is tracked as code-oflt. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
39a22b352e
commit
b598c61c61
1 changed files with 8 additions and 1 deletions
|
|
@ -68,7 +68,14 @@ resource "kubernetes_deployment" "android-emulator" {
|
|||
}
|
||||
}
|
||||
spec {
|
||||
replicas = 1
|
||||
# Scaled to 0 (2026-06-12): the emulator's ~4.7-core swiftshader CPU burn
|
||||
# on node3 saturated the single PVE host and starved etcd on the k8s-master
|
||||
# VM → control-plane (scheduler/controller-manager/kyverno) leader-election
|
||||
# crashloop. It is a shared TEST instance, not a 24/7 service: spin up
|
||||
# on-demand (replicas = 1) for a testing session, return to 0 when done.
|
||||
# Durable relief pending the structural etcd-protection fix (PVE CPU weight
|
||||
# / etcd WAL on SSD).
|
||||
replicas = 0
|
||||
strategy {
|
||||
type = "Recreate" # RWO PVC — old pod must release it first
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue