Commit graph

1 commit

Author SHA1 Message Date
Viktor Barzin
3fa9e2409c runbook: K8s worker scaling for PVC capacity headroom
Documents the 6-worker cluster shape (post 2026-05-26 scale-up after
the proxmox-csi LUN-cap incident), the six binding constraints (plugin
LUN cap at 29/VM, memory commit, sdc IO contention, GPU concentration
on node1, PVE host memory, no Terraform management for K8s VMs), and
the playbooks for adding/removing workers.

Scale-up triggers:
  - max-node VA count ≥ 25 (~86% of 29 cap) for ≥7 days
  - cluster memory requests > 90%
  - LUN-cap incident
  - planned ≥3 net-new block PVCs when max VA already ≥ 22
Scale-down conditions:
  - max-node PVC count ≤ 20, memory < 70%/95% for ≥30 days

Playbooks lean on scripts/provision-k8s-worker (clones template 2000,
cloud-inits, auto-joins) for adds; kubectl cordon → drain → delete
node → qm shutdown for removes. Cold-spare option documented.

Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap,
docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md,
beads code-oflt (IO contention long-term fix).
2026-06-01 19:50:41 +00:00