runbook: K8s worker scaling for PVC capacity headroom
Documents the 6-worker cluster shape (post 2026-05-26 scale-up after the proxmox-csi LUN-cap incident), the six binding constraints (plugin LUN cap at 29/VM, memory commit, sdc IO contention, GPU concentration on node1, PVE host memory, no Terraform management for K8s VMs), and the playbooks for adding/removing workers. Scale-up triggers: - max-node VA count ≥ 25 (~86% of 29 cap) for ≥7 days - cluster memory requests > 90% - LUN-cap incident - planned ≥3 net-new block PVCs when max VA already ≥ 22 Scale-down conditions: - max-node PVC count ≤ 20, memory < 70%/95% for ≥30 days Playbooks lean on scripts/provision-k8s-worker (clones template 2000, cloud-inits, auto-joins) for adds; kubectl cordon → drain → delete node → qm shutdown for removes. Cold-spare option documented. Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap, docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md, beads code-oflt (IO contention long-term fix).
This commit is contained in:
parent
5c77482a8c
commit
3fa9e2409c
1 changed files with 196 additions and 0 deletions
196
docs/runbooks/scale-k8s-cluster.md
Normal file
196
docs/runbooks/scale-k8s-cluster.md
Normal file
|
|
@ -0,0 +1,196 @@
|
|||
# Runbook: Scale K8s worker count (PVC capacity headroom)
|
||||
|
||||
Use when block-PVC pressure, memory pressure, or planned workload growth requires adding or removing K8s worker VMs. The cluster currently runs **6 workers (k8s-node1..6) + 1 control plane (k8s-master)**, sized to absorb the 2026-05-26 proxmox-csi LUN-cap incident with sustained headroom.
|
||||
|
||||
## Current shape
|
||||
|
||||
| Node | VMID | Memory | Disk | Special |
|
||||
|------|------|--------|------|---------|
|
||||
| k8s-master | 200 | 32 GiB | 64G | Control plane, no worker workloads |
|
||||
| k8s-node1 | 201 | 48 GiB | 256G | GPU host (NVIDIA Tesla T4 passthrough), DNS primary |
|
||||
| k8s-node2 | 202 | 32 GiB | 256G | |
|
||||
| k8s-node3 | 203 | 32 GiB | 256G | |
|
||||
| k8s-node4 | 204 | 32 GiB | 256G | |
|
||||
| k8s-node5 | 205 | 32 GiB | 256G | Added 2026-05-26 (LUN-cap incident) |
|
||||
| k8s-node6 | 206 | 32 GiB | 256G | Added 2026-05-26 (LUN-cap incident) |
|
||||
|
||||
Capacity envelope (6 workers): **174 block-PVC slots**, ~192 GiB memory, ~96 vCPU, GPU on node1 only. Pod cap is kubelet-default 110/node.
|
||||
|
||||
## Binding constraints — read these first
|
||||
|
||||
The cluster has 6 capacity dimensions. The one that bites first depends on workload shape; check each before adding/removing nodes.
|
||||
|
||||
1. **Per-VM block-PVC ceiling = 29** — hardcoded by `sergelogvinov/proxmox-csi-plugin` at `pkg/csi/utils.go:394` (`for lun = 1; lun < 30; lun++`). Symptom: pods stuck `ContainerCreating` with `FailedAttachVolume … no free lun found`. `CSINode.allocatable.count` advertises `28`/node. Switching `scsihw` to `virtio-scsi-single` does NOT raise this — it's a plugin constraint, not a Proxmox/QEMU one. See `docs/architecture/storage.md` § "Per-VM SCSI-LUN cap".
|
||||
|
||||
2. **Memory commitment** — node1 has historically run hot (was 117% of limits before the 2026-06 memory bump to 48 GiB). Treat memory as the next-binding constraint after PVC slots, especially since limits-vs-requests divergence isn't enforced by the scheduler.
|
||||
|
||||
3. **sdc IO contention** — every K8s VM disk + TrueNAS NFS LV live on the same Proxmox thin pool on sdc (10.7 TB RAID1 HDD). Three IO storms in 17 days (2026-05-09, 2026-05-16/17, 2026-05-25) — see `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. Adding workers redistributes block PVCs but does NOT relieve underlying disk contention; that's beads `code-oflt`.
|
||||
|
||||
4. **GPU concentration** — Tesla T4 is passthrough-only on node1. Frigate ML / Immich ML / Whisper / Piper / llama-cpp all schedule there via `nvidia.com/gpu.present` label. Cannot be spread without provisioning a second GPU node.
|
||||
|
||||
5. **PVE host memory** — total PVE RAM 320 GiB. K8s VMs claim 240 GiB; TrueNAS / pfsense / Windows VMs claim ~80 GiB more. Adding a 32-GiB worker requires verifying PVE has the headroom (`free -h`).
|
||||
|
||||
6. **Per-stack Terraform state** — adding/removing nodes does NOT live in any single Terragrunt stack today. VMs are created via `scripts/provision-k8s-worker` (which calls `qm clone`). They are *not* managed declaratively in TF. Consequence: removal is a manual `kubectl delete node` + `qm stop` + `qm destroy`, not `tg destroy`.
|
||||
|
||||
## When to scale UP (add a worker)
|
||||
|
||||
Add a worker when **any** of these is true for ≥7 days:
|
||||
|
||||
| Trigger | Threshold | How to observe |
|
||||
|---------|-----------|----------------|
|
||||
| PVC slots per node | `max(per-node VA count) ≥ 25` (~86% of 29 cap) | `kubectl get volumeattachment -o json \| jq -r '.items[].spec.nodeName' \| sort \| uniq -c` |
|
||||
| Cluster memory requests | `> 90%` | `kubectl describe nodes \| grep -A4 "Allocated resources"` or Goldilocks dashboard |
|
||||
| Planned PVC additions | ≥3 net-new block PVCs in next sprint AND current max VA ≥ 22 | Project-tracker / beads |
|
||||
| LUN-cap incident | Even one `no free lun found` event | Prometheus alert `ProxmoxCSILunsExhausted` (added 2026-05-31, commit `aded77d5`) |
|
||||
| Sustained pod-eviction churn | Eviction count > 20/day for ≥3 days | `kubectl get events -A --field-selector reason=Evicted` |
|
||||
|
||||
### Playbook — add a worker
|
||||
|
||||
```bash
|
||||
# 1. Choose VMID + IP (next free in 10.0.20.0/22 worker range, 10.0.20.105+ used)
|
||||
NEXT_VMID=207
|
||||
NEXT_IP=10.0.20.107
|
||||
NAME=k8s-node7
|
||||
|
||||
# 2. Verify PVE memory headroom (need ≥34 GiB free for a 32-GiB VM with overhead)
|
||||
ssh root@192.168.1.127 'free -h; pvesh get /nodes/pve/status --output-format=json | jq .memory'
|
||||
|
||||
# 3. Verify thin pool has space (need ≥256 GiB raw thin allocation, but thin so only growth matters)
|
||||
ssh root@192.168.1.127 'lvs pve/data'
|
||||
|
||||
# 4. Clone + cloud-init + auto-join (idempotent — aborts if VMID or IP exists)
|
||||
scp scripts/provision-k8s-worker root@192.168.1.127:/tmp/
|
||||
ssh root@192.168.1.127 'bash /tmp/provision-k8s-worker '"$NAME $NEXT_VMID $NEXT_IP"
|
||||
|
||||
# 5. Wait for node to appear Ready (3-5 min for cloud-init + kubeadm join)
|
||||
kubectl get nodes -w
|
||||
|
||||
# 6. Verify CSI registration (proxmox-csi + nfs-csi node pods)
|
||||
kubectl get pods -A -o wide --field-selector spec.nodeName=$NAME | grep -E "csi|calico"
|
||||
|
||||
# 7. Confirm Goldilocks / Kyverno / Prometheus targets it (DaemonSets populate within ~2 min)
|
||||
kubectl get ds -A -o wide | awk '{print $7,$8}' | head -20
|
||||
|
||||
# 8. Update this runbook's "Current shape" table
|
||||
```
|
||||
|
||||
**Post-add validation:**
|
||||
- `kubectl top node $NAME` reports stats (kubelet metrics OK)
|
||||
- A test pod with a `proxmox-lvm` PVC schedules there and binds
|
||||
- No new alerts firing in monitoring
|
||||
|
||||
## When to scale DOWN (drain a worker)
|
||||
|
||||
Scale down when **all** of these hold for ≥30 days:
|
||||
|
||||
| Condition | Threshold |
|
||||
|-----------|-----------|
|
||||
| Max-node PVC count | `≤ 20` (≈70% of cap) |
|
||||
| Cluster memory requests | `< 70%` |
|
||||
| Cluster memory limits | `< 95%` (no over-committed node) |
|
||||
| No upcoming workload additions | Confirmed via beads / project tracker |
|
||||
|
||||
Scaling down is also reasonable as a deliberate trade-off (cost, IO reduction, consolidation) even if thresholds aren't met — but accept that the next scale-up cycle will incur the LUN-cap risk again.
|
||||
|
||||
### Playbook — drain + remove a worker
|
||||
|
||||
**Pick the lightest node first.** Survey before draining:
|
||||
|
||||
```bash
|
||||
NODE=k8s-node5
|
||||
|
||||
# 1. Inventory what's there
|
||||
kubectl get pods -A -o wide --field-selector spec.nodeName=$NODE \
|
||||
| awk 'NR>1 {print $1}' | sort | uniq -c # pods per namespace
|
||||
|
||||
# 2. List drain blockers (local-path PVCs in use, GPU pods, single-replica services)
|
||||
kubectl get pvc -A -o json | jq -r --arg n "$NODE" '.items[]
|
||||
| select(.spec.storageClassName == "local-path")
|
||||
| select(.status.phase == "Bound")
|
||||
| "\(.metadata.namespace)/\(.metadata.name)"'
|
||||
|
||||
# 3. Check presence board — is anyone mutating workloads on this node right now?
|
||||
~/code/scripts/presence list
|
||||
# If a `service:*` claim covers any pod on $NODE, DEFER until released.
|
||||
|
||||
# 4. Cordon (mark unschedulable, existing pods stay)
|
||||
kubectl cordon $NODE
|
||||
|
||||
# 5. Watch memory pressure forecast on remaining nodes BEFORE evicting
|
||||
kubectl top nodes # baseline
|
||||
# Expected addition: ~ (sum of pod memory requests on $NODE) / (N - 1) per other node
|
||||
|
||||
# 6. Drain (respects PDBs; --delete-emptydir-data needed for tmp volumes)
|
||||
kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data --timeout=15m
|
||||
|
||||
# Expected blips during drain (~30s-2min each for PVC reattach):
|
||||
# any singleton on $NODE (Deployment replicas=1 or StatefulSet with no peers)
|
||||
# Multi-replica services with PDB just roll without downtime.
|
||||
|
||||
# 7. Verify everything rescheduled cleanly
|
||||
kubectl get pods -A -o wide --field-selector spec.nodeName=$NODE
|
||||
# Should show only DaemonSet pods + Completed jobs
|
||||
|
||||
# 8. Remove from cluster
|
||||
kubectl delete node $NODE
|
||||
|
||||
# 9. Shut down + (optional) destroy the VM
|
||||
VMID=205
|
||||
ssh root@192.168.1.127 "qm shutdown $VMID --timeout 300; qm status $VMID"
|
||||
# To fully destroy (frees thin-pool space):
|
||||
# ssh root@192.168.1.127 "qm destroy $VMID --purge"
|
||||
|
||||
# 10. Verify post-drain shape
|
||||
kubectl get volumeattachment -o json \
|
||||
| jq -r '.items[] | select(.spec.attacher == "csi.proxmox.sinextra.dev") | .spec.nodeName' \
|
||||
| sort | uniq -c
|
||||
|
||||
# 11. Update this runbook's "Current shape" table
|
||||
```
|
||||
|
||||
**Cold-spare option:** instead of `qm destroy`, keep the VM stopped. The 256 GiB disk stays allocated on thin pool but the VM consumes no CPU/RAM. Re-add via `qm start <VMID>` + `kubeadm join` (the snippet still lives at `/var/lib/vz/snippets/k8s_cloud_init.yaml`).
|
||||
|
||||
## Special cases
|
||||
|
||||
### Critical singletons that blip during drain
|
||||
|
||||
These services are single-replica and incur ~30s-2min outages while their PVC reattaches to the new node:
|
||||
|
||||
- **Stateful databases**: `mysql-standalone-0`, `pg-cluster-*` members (CNPG handles failover gracefully)
|
||||
- **Mail**: `mailserver`, `roundcubemail` (Dovecot maildir locking — defer if mid-incident)
|
||||
- **Browser-trust services**: `nextcloud` (sessions reset), `vaultwarden` (active sessions blip)
|
||||
- **Observability**: `prometheus-server` (scrape data gap), `claude-memory`
|
||||
- **Self-hosted apps with SQLite**: hackmd, n8n, paperless-ngx, freshrss, navidrome, audiobookshelf
|
||||
|
||||
Coordinate the drain timing with users if any of these is on the node being drained. Single-pod Postgres/MySQL DBs are the most painful — schedule during low-traffic windows.
|
||||
|
||||
### GPU pods
|
||||
|
||||
GPU pods scheduled via `nvidia.com/gpu.present=true` node selector. They **cannot** drain off node1; if node1 itself needs maintenance, scale GPU workloads to 0 first or defer drain. See `docs/runbooks/k8s-node-auto-upgrades.md` for the kured-driven reboot path.
|
||||
|
||||
### Active sessions
|
||||
|
||||
Check `~/code/scripts/presence list` before any drain. If another session holds a claim on a service hosted on the target node, defer or coordinate.
|
||||
|
||||
### Force-clean stuck VolumeAttachments
|
||||
|
||||
If a drained node has lingering VolumeAttachment entries after `kubectl delete node`:
|
||||
|
||||
```bash
|
||||
kubectl get volumeattachment -o json \
|
||||
| jq -r --arg n "$NODE" '.items[] | select(.spec.nodeName == $n) | .metadata.name' \
|
||||
| xargs -r kubectl patch volumeattachment -p '{"metadata":{"finalizers":null}}' --type=merge
|
||||
kubectl get volumeattachment -o json \
|
||||
| jq -r --arg n "$NODE" '.items[] | select(.spec.nodeName == $n) | .metadata.name' \
|
||||
| xargs -r kubectl delete volumeattachment
|
||||
```
|
||||
|
||||
## Related
|
||||
|
||||
- `docs/architecture/storage.md` § "Per-VM SCSI-LUN cap" — root-cause explanation of the PVC ceiling
|
||||
- `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md` — IO contention on sdc
|
||||
- `docs/runbooks/k8s-node-auto-upgrades.md` — kured-driven rolling reboots (separate from scale)
|
||||
- `docs/runbooks/restore-full-cluster.md` — disaster scenarios
|
||||
- `scripts/provision-k8s-worker` — the actual cloning/join script
|
||||
- Beads `code-oflt` — IO isolation (long-term fix for sdc contention)
|
||||
- Remote memory id=2788 — `proxmox-csi-plugin hardcodes a per-VM SCSI-LUN ceiling`
|
||||
Loading…
Add table
Add a link
Reference in a new issue