runbook: K8s worker scaling for PVC capacity headroom

Documents the 6-worker cluster shape (post 2026-05-26 scale-up after the proxmox-csi LUN-cap incident), the six binding constraints (plugin LUN cap at 29/VM, memory commit, sdc IO contention, GPU concentration on node1, PVE host memory, no Terraform management for K8s VMs), and the playbooks for adding/removing workers. Scale-up triggers: - max-node VA count ≥ 25 (~86% of 29 cap) for ≥7 days - cluster memory requests > 90% - LUN-cap incident - planned ≥3 net-new block PVCs when max VA already ≥ 22 Scale-down conditions: - max-node PVC count ≤ 20, memory < 70%/95% for ≥30 days Playbooks lean on scripts/provision-k8s-worker (clones template 2000, cloud-inits, auto-joins) for adds; kubectl cordon → drain → delete node → qm shutdown for removes. Cold-spare option documented. Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap, docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md, beads code-oflt (IO contention long-term fix).
2026-06-01 19:25:19 +00:00 · 2026-06-01 19:25:19 +00:00 · 3fa9e2409c
commit 3fa9e2409c
parent 5c77482a8c
1 changed files with 196 additions and 0 deletions
--- a/docs/runbooks/scale-k8s-cluster.md
+++ b/docs/runbooks/scale-k8s-cluster.md
@ -0,0 +1,196 @@
+# Runbook: Scale K8s worker count (PVC capacity headroom)
+
+Use when block-PVC pressure, memory pressure, or planned workload growth requires adding or removing K8s worker VMs. The cluster currently runs **6 workers (k8s-node1..6) + 1 control plane (k8s-master)**, sized to absorb the 2026-05-26 proxmox-csi LUN-cap incident with sustained headroom.
+
+## Current shape
+
+| Node | VMID | Memory | Disk | Special |
+|------|------|--------|------|---------|
+| k8s-master | 200 | 32 GiB | 64G | Control plane, no worker workloads |
+| k8s-node1 | 201 | 48 GiB | 256G | GPU host (NVIDIA Tesla T4 passthrough), DNS primary |
+| k8s-node2 | 202 | 32 GiB | 256G | |
+| k8s-node3 | 203 | 32 GiB | 256G | |
+| k8s-node4 | 204 | 32 GiB | 256G | |
+| k8s-node5 | 205 | 32 GiB | 256G | Added 2026-05-26 (LUN-cap incident) |
+| k8s-node6 | 206 | 32 GiB | 256G | Added 2026-05-26 (LUN-cap incident) |
+
+Capacity envelope (6 workers): **174 block-PVC slots**, ~192 GiB memory, ~96 vCPU, GPU on node1 only. Pod cap is kubelet-default 110/node.
+
+## Binding constraints — read these first
+
+The cluster has 6 capacity dimensions. The one that bites first depends on workload shape; check each before adding/removing nodes.
+
+1. **Per-VM block-PVC ceiling = 29** — hardcoded by `sergelogvinov/proxmox-csi-plugin` at `pkg/csi/utils.go:394` (`for lun = 1; lun < 30; lun++`). Symptom: pods stuck `ContainerCreating` with `FailedAttachVolume … no free lun found`. `CSINode.allocatable.count` advertises `28`/node. Switching `scsihw` to `virtio-scsi-single` does NOT raise this — it's a plugin constraint, not a Proxmox/QEMU one. See `docs/architecture/storage.md` § "Per-VM SCSI-LUN cap".
+
+2. **Memory commitment** — node1 has historically run hot (was 117% of limits before the 2026-06 memory bump to 48 GiB). Treat memory as the next-binding constraint after PVC slots, especially since limits-vs-requests divergence isn't enforced by the scheduler.
+
+3. **sdc IO contention** — every K8s VM disk + TrueNAS NFS LV live on the same Proxmox thin pool on sdc (10.7 TB RAID1 HDD). Three IO storms in 17 days (2026-05-09, 2026-05-16/17, 2026-05-25) — see `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. Adding workers redistributes block PVCs but does NOT relieve underlying disk contention; that's beads `code-oflt`.
+
+4. **GPU concentration** — Tesla T4 is passthrough-only on node1. Frigate ML / Immich ML / Whisper / Piper / llama-cpp all schedule there via `nvidia.com/gpu.present` label. Cannot be spread without provisioning a second GPU node.
+
+5. **PVE host memory** — total PVE RAM 320 GiB. K8s VMs claim 240 GiB; TrueNAS / pfsense / Windows VMs claim ~80 GiB more. Adding a 32-GiB worker requires verifying PVE has the headroom (`free -h`).
+
+6. **Per-stack Terraform state** — adding/removing nodes does NOT live in any single Terragrunt stack today. VMs are created via `scripts/provision-k8s-worker` (which calls `qm clone`). They are *not* managed declaratively in TF. Consequence: removal is a manual `kubectl delete node` + `qm stop` + `qm destroy`, not `tg destroy`.
+
+## When to scale UP (add a worker)
+
+Add a worker when **any** of these is true for ≥7 days:
+
+| Trigger | Threshold | How to observe |
+|---------|-----------|----------------|
+| PVC slots per node | `max(per-node VA count) ≥ 25` (~86% of 29 cap) | `kubectl get volumeattachment -o json \| jq -r '.items[].spec.nodeName' \| sort \| uniq -c` |
+| Cluster memory requests | `> 90%` | `kubectl describe nodes \| grep -A4 "Allocated resources"` or Goldilocks dashboard |
+| Planned PVC additions | ≥3 net-new block PVCs in next sprint AND current max VA ≥ 22 | Project-tracker / beads |
+| LUN-cap incident | Even one `no free lun found` event | Prometheus alert `ProxmoxCSILunsExhausted` (added 2026-05-31, commit `aded77d5`) |
+| Sustained pod-eviction churn | Eviction count > 20/day for ≥3 days | `kubectl get events -A --field-selector reason=Evicted` |
+
+### Playbook — add a worker
+
+```bash
+# 1. Choose VMID + IP (next free in 10.0.20.0/22 worker range, 10.0.20.105+ used)
+NEXT_VMID=207
+NEXT_IP=10.0.20.107
+NAME=k8s-node7
+
+# 2. Verify PVE memory headroom (need ≥34 GiB free for a 32-GiB VM with overhead)
+ssh root@192.168.1.127 'free -h; pvesh get /nodes/pve/status --output-format=json | jq .memory'
+
+# 3. Verify thin pool has space (need ≥256 GiB raw thin allocation, but thin so only growth matters)
+ssh root@192.168.1.127 'lvs pve/data'
+
+# 4. Clone + cloud-init + auto-join (idempotent — aborts if VMID or IP exists)
+scp scripts/provision-k8s-worker root@192.168.1.127:/tmp/
+ssh root@192.168.1.127 'bash /tmp/provision-k8s-worker '"$NAME $NEXT_VMID $NEXT_IP"
+
+# 5. Wait for node to appear Ready (3-5 min for cloud-init + kubeadm join)
+kubectl get nodes -w
+
+# 6. Verify CSI registration (proxmox-csi + nfs-csi node pods)
+kubectl get pods -A -o wide --field-selector spec.nodeName=$NAME | grep -E "csi|calico"
+
+# 7. Confirm Goldilocks / Kyverno / Prometheus targets it (DaemonSets populate within ~2 min)
+kubectl get ds -A -o wide | awk '{print $7,$8}' | head -20
+
+# 8. Update this runbook's "Current shape" table
+```
+
+**Post-add validation:**
+- `kubectl top node $NAME` reports stats (kubelet metrics OK)
+- A test pod with a `proxmox-lvm` PVC schedules there and binds
+- No new alerts firing in monitoring
+
+## When to scale DOWN (drain a worker)
+
+Scale down when **all** of these hold for ≥30 days:
+
+| Condition | Threshold |
+|-----------|-----------|
+| Max-node PVC count | `≤ 20` (≈70% of cap) |
+| Cluster memory requests | `< 70%` |
+| Cluster memory limits | `< 95%` (no over-committed node) |
+| No upcoming workload additions | Confirmed via beads / project tracker |
+
+Scaling down is also reasonable as a deliberate trade-off (cost, IO reduction, consolidation) even if thresholds aren't met — but accept that the next scale-up cycle will incur the LUN-cap risk again.
+
+### Playbook — drain + remove a worker
+
+**Pick the lightest node first.** Survey before draining:
+
+```bash
+NODE=k8s-node5
+
+# 1. Inventory what's there
+kubectl get pods -A -o wide --field-selector spec.nodeName=$NODE \
+  | awk 'NR>1 {print $1}' | sort | uniq -c   # pods per namespace
+
+# 2. List drain blockers (local-path PVCs in use, GPU pods, single-replica services)
+kubectl get pvc -A -o json | jq -r --arg n "$NODE" '.items[]
+  | select(.spec.storageClassName == "local-path")
+  | select(.status.phase == "Bound")
+  | "\(.metadata.namespace)/\(.metadata.name)"'
+
+# 3. Check presence board — is anyone mutating workloads on this node right now?
+~/code/scripts/presence list
+# If a `service:*` claim covers any pod on $NODE, DEFER until released.
+
+# 4. Cordon (mark unschedulable, existing pods stay)
+kubectl cordon $NODE
+
+# 5. Watch memory pressure forecast on remaining nodes BEFORE evicting
+kubectl top nodes  # baseline
+# Expected addition: ~ (sum of pod memory requests on $NODE) / (N - 1) per other node
+
+# 6. Drain (respects PDBs; --delete-emptydir-data needed for tmp volumes)
+kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data --timeout=15m
+
+# Expected blips during drain (~30s-2min each for PVC reattach):
+#   any singleton on $NODE (Deployment replicas=1 or StatefulSet with no peers)
+# Multi-replica services with PDB just roll without downtime.
+
+# 7. Verify everything rescheduled cleanly
+kubectl get pods -A -o wide --field-selector spec.nodeName=$NODE
+# Should show only DaemonSet pods + Completed jobs
+
+# 8. Remove from cluster
+kubectl delete node $NODE
+
+# 9. Shut down + (optional) destroy the VM
+VMID=205
+ssh root@192.168.1.127 "qm shutdown $VMID --timeout 300; qm status $VMID"
+# To fully destroy (frees thin-pool space):
+# ssh root@192.168.1.127 "qm destroy $VMID --purge"
+
+# 10. Verify post-drain shape
+kubectl get volumeattachment -o json \
+  | jq -r '.items[] | select(.spec.attacher == "csi.proxmox.sinextra.dev") | .spec.nodeName' \
+  | sort | uniq -c
+
+# 11. Update this runbook's "Current shape" table
+```
+
+**Cold-spare option:** instead of `qm destroy`, keep the VM stopped. The 256 GiB disk stays allocated on thin pool but the VM consumes no CPU/RAM. Re-add via `qm start <VMID>` + `kubeadm join` (the snippet still lives at `/var/lib/vz/snippets/k8s_cloud_init.yaml`).
+
+## Special cases
+
+### Critical singletons that blip during drain
+
+These services are single-replica and incur ~30s-2min outages while their PVC reattaches to the new node:
+
+- **Stateful databases**: `mysql-standalone-0`, `pg-cluster-*` members (CNPG handles failover gracefully)
+- **Mail**: `mailserver`, `roundcubemail` (Dovecot maildir locking — defer if mid-incident)
+- **Browser-trust services**: `nextcloud` (sessions reset), `vaultwarden` (active sessions blip)
+- **Observability**: `prometheus-server` (scrape data gap), `claude-memory`
+- **Self-hosted apps with SQLite**: hackmd, n8n, paperless-ngx, freshrss, navidrome, audiobookshelf
+
+Coordinate the drain timing with users if any of these is on the node being drained. Single-pod Postgres/MySQL DBs are the most painful — schedule during low-traffic windows.
+
+### GPU pods
+
+GPU pods scheduled via `nvidia.com/gpu.present=true` node selector. They **cannot** drain off node1; if node1 itself needs maintenance, scale GPU workloads to 0 first or defer drain. See `docs/runbooks/k8s-node-auto-upgrades.md` for the kured-driven reboot path.
+
+### Active sessions
+
+Check `~/code/scripts/presence list` before any drain. If another session holds a claim on a service hosted on the target node, defer or coordinate.
+
+### Force-clean stuck VolumeAttachments
+
+If a drained node has lingering VolumeAttachment entries after `kubectl delete node`:
+
+```bash
+kubectl get volumeattachment -o json \
+  | jq -r --arg n "$NODE" '.items[] | select(.spec.nodeName == $n) | .metadata.name' \
+  | xargs -r kubectl patch volumeattachment -p '{"metadata":{"finalizers":null}}' --type=merge
+kubectl get volumeattachment -o json \
+  | jq -r --arg n "$NODE" '.items[] | select(.spec.nodeName == $n) | .metadata.name' \
+  | xargs -r kubectl delete volumeattachment
+```
+
+## Related
+
+- `docs/architecture/storage.md` § "Per-VM SCSI-LUN cap" — root-cause explanation of the PVC ceiling
+- `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md` — IO contention on sdc
+- `docs/runbooks/k8s-node-auto-upgrades.md` — kured-driven rolling reboots (separate from scale)
+- `docs/runbooks/restore-full-cluster.md` — disaster scenarios
+- `scripts/provision-k8s-worker` — the actual cloning/join script
+- Beads `code-oflt` — IO isolation (long-term fix for sdc contention)
+- Remote memory id=2788 — `proxmox-csi-plugin hardcodes a per-VM SCSI-LUN ceiling`