Documents the 6-worker cluster shape (post 2026-05-26 scale-up after the proxmox-csi LUN-cap incident), the six binding constraints (plugin LUN cap at 29/VM, memory commit, sdc IO contention, GPU concentration on node1, PVE host memory, no Terraform management for K8s VMs), and the playbooks for adding/removing workers. Scale-up triggers: - max-node VA count ≥ 25 (~86% of 29 cap) for ≥7 days - cluster memory requests > 90% - LUN-cap incident - planned ≥3 net-new block PVCs when max VA already ≥ 22 Scale-down conditions: - max-node PVC count ≤ 20, memory < 70%/95% for ≥30 days Playbooks lean on scripts/provision-k8s-worker (clones template 2000, cloud-inits, auto-joins) for adds; kubectl cordon → drain → delete node → qm shutdown for removes. Cold-spare option documented. Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap, docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md, beads code-oflt (IO contention long-term fix).
10 KiB
Runbook: Scale K8s worker count (PVC capacity headroom)
Use when block-PVC pressure, memory pressure, or planned workload growth requires adding or removing K8s worker VMs. The cluster currently runs 6 workers (k8s-node1..6) + 1 control plane (k8s-master), sized to absorb the 2026-05-26 proxmox-csi LUN-cap incident with sustained headroom.
Current shape
| Node | VMID | Memory | Disk | Special |
|---|---|---|---|---|
| k8s-master | 200 | 32 GiB | 64G | Control plane, no worker workloads |
| k8s-node1 | 201 | 48 GiB | 256G | GPU host (NVIDIA Tesla T4 passthrough), DNS primary |
| k8s-node2 | 202 | 32 GiB | 256G | |
| k8s-node3 | 203 | 32 GiB | 256G | |
| k8s-node4 | 204 | 32 GiB | 256G | |
| k8s-node5 | 205 | 32 GiB | 256G | Added 2026-05-26 (LUN-cap incident) |
| k8s-node6 | 206 | 32 GiB | 256G | Added 2026-05-26 (LUN-cap incident) |
Capacity envelope (6 workers): 174 block-PVC slots, ~192 GiB memory, ~96 vCPU, GPU on node1 only. Pod cap is kubelet-default 110/node.
Binding constraints — read these first
The cluster has 6 capacity dimensions. The one that bites first depends on workload shape; check each before adding/removing nodes.
-
Per-VM block-PVC ceiling = 29 — hardcoded by
sergelogvinov/proxmox-csi-pluginatpkg/csi/utils.go:394(for lun = 1; lun < 30; lun++). Symptom: pods stuckContainerCreatingwithFailedAttachVolume … no free lun found.CSINode.allocatable.countadvertises28/node. Switchingscsihwtovirtio-scsi-singledoes NOT raise this — it's a plugin constraint, not a Proxmox/QEMU one. Seedocs/architecture/storage.md§ "Per-VM SCSI-LUN cap". -
Memory commitment — node1 has historically run hot (was 117% of limits before the 2026-06 memory bump to 48 GiB). Treat memory as the next-binding constraint after PVC slots, especially since limits-vs-requests divergence isn't enforced by the scheduler.
-
sdc IO contention — every K8s VM disk + TrueNAS NFS LV live on the same Proxmox thin pool on sdc (10.7 TB RAID1 HDD). Three IO storms in 17 days (2026-05-09, 2026-05-16/17, 2026-05-25) — see
docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md. Adding workers redistributes block PVCs but does NOT relieve underlying disk contention; that's beadscode-oflt. -
GPU concentration — Tesla T4 is passthrough-only on node1. Frigate ML / Immich ML / Whisper / Piper / llama-cpp all schedule there via
nvidia.com/gpu.presentlabel. Cannot be spread without provisioning a second GPU node. -
PVE host memory — total PVE RAM 320 GiB. K8s VMs claim 240 GiB; TrueNAS / pfsense / Windows VMs claim ~80 GiB more. Adding a 32-GiB worker requires verifying PVE has the headroom (
free -h). -
Per-stack Terraform state — adding/removing nodes does NOT live in any single Terragrunt stack today. VMs are created via
scripts/provision-k8s-worker(which callsqm clone). They are not managed declaratively in TF. Consequence: removal is a manualkubectl delete node+qm stop+qm destroy, nottg destroy.
When to scale UP (add a worker)
Add a worker when any of these is true for ≥7 days:
| Trigger | Threshold | How to observe |
|---|---|---|
| PVC slots per node | max(per-node VA count) ≥ 25 (~86% of 29 cap) |
kubectl get volumeattachment -o json | jq -r '.items[].spec.nodeName' | sort | uniq -c |
| Cluster memory requests | > 90% |
kubectl describe nodes | grep -A4 "Allocated resources" or Goldilocks dashboard |
| Planned PVC additions | ≥3 net-new block PVCs in next sprint AND current max VA ≥ 22 | Project-tracker / beads |
| LUN-cap incident | Even one no free lun found event |
Prometheus alert ProxmoxCSILunsExhausted (added 2026-05-31, commit aded77d5) |
| Sustained pod-eviction churn | Eviction count > 20/day for ≥3 days | kubectl get events -A --field-selector reason=Evicted |
Playbook — add a worker
# 1. Choose VMID + IP (next free in 10.0.20.0/22 worker range, 10.0.20.105+ used)
NEXT_VMID=207
NEXT_IP=10.0.20.107
NAME=k8s-node7
# 2. Verify PVE memory headroom (need ≥34 GiB free for a 32-GiB VM with overhead)
ssh root@192.168.1.127 'free -h; pvesh get /nodes/pve/status --output-format=json | jq .memory'
# 3. Verify thin pool has space (need ≥256 GiB raw thin allocation, but thin so only growth matters)
ssh root@192.168.1.127 'lvs pve/data'
# 4. Clone + cloud-init + auto-join (idempotent — aborts if VMID or IP exists)
scp scripts/provision-k8s-worker root@192.168.1.127:/tmp/
ssh root@192.168.1.127 'bash /tmp/provision-k8s-worker '"$NAME $NEXT_VMID $NEXT_IP"
# 5. Wait for node to appear Ready (3-5 min for cloud-init + kubeadm join)
kubectl get nodes -w
# 6. Verify CSI registration (proxmox-csi + nfs-csi node pods)
kubectl get pods -A -o wide --field-selector spec.nodeName=$NAME | grep -E "csi|calico"
# 7. Confirm Goldilocks / Kyverno / Prometheus targets it (DaemonSets populate within ~2 min)
kubectl get ds -A -o wide | awk '{print $7,$8}' | head -20
# 8. Update this runbook's "Current shape" table
Post-add validation:
kubectl top node $NAMEreports stats (kubelet metrics OK)- A test pod with a
proxmox-lvmPVC schedules there and binds - No new alerts firing in monitoring
When to scale DOWN (drain a worker)
Scale down when all of these hold for ≥30 days:
| Condition | Threshold |
|---|---|
| Max-node PVC count | ≤ 20 (≈70% of cap) |
| Cluster memory requests | < 70% |
| Cluster memory limits | < 95% (no over-committed node) |
| No upcoming workload additions | Confirmed via beads / project tracker |
Scaling down is also reasonable as a deliberate trade-off (cost, IO reduction, consolidation) even if thresholds aren't met — but accept that the next scale-up cycle will incur the LUN-cap risk again.
Playbook — drain + remove a worker
Pick the lightest node first. Survey before draining:
NODE=k8s-node5
# 1. Inventory what's there
kubectl get pods -A -o wide --field-selector spec.nodeName=$NODE \
| awk 'NR>1 {print $1}' | sort | uniq -c # pods per namespace
# 2. List drain blockers (local-path PVCs in use, GPU pods, single-replica services)
kubectl get pvc -A -o json | jq -r --arg n "$NODE" '.items[]
| select(.spec.storageClassName == "local-path")
| select(.status.phase == "Bound")
| "\(.metadata.namespace)/\(.metadata.name)"'
# 3. Check presence board — is anyone mutating workloads on this node right now?
~/code/scripts/presence list
# If a `service:*` claim covers any pod on $NODE, DEFER until released.
# 4. Cordon (mark unschedulable, existing pods stay)
kubectl cordon $NODE
# 5. Watch memory pressure forecast on remaining nodes BEFORE evicting
kubectl top nodes # baseline
# Expected addition: ~ (sum of pod memory requests on $NODE) / (N - 1) per other node
# 6. Drain (respects PDBs; --delete-emptydir-data needed for tmp volumes)
kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data --timeout=15m
# Expected blips during drain (~30s-2min each for PVC reattach):
# any singleton on $NODE (Deployment replicas=1 or StatefulSet with no peers)
# Multi-replica services with PDB just roll without downtime.
# 7. Verify everything rescheduled cleanly
kubectl get pods -A -o wide --field-selector spec.nodeName=$NODE
# Should show only DaemonSet pods + Completed jobs
# 8. Remove from cluster
kubectl delete node $NODE
# 9. Shut down + (optional) destroy the VM
VMID=205
ssh root@192.168.1.127 "qm shutdown $VMID --timeout 300; qm status $VMID"
# To fully destroy (frees thin-pool space):
# ssh root@192.168.1.127 "qm destroy $VMID --purge"
# 10. Verify post-drain shape
kubectl get volumeattachment -o json \
| jq -r '.items[] | select(.spec.attacher == "csi.proxmox.sinextra.dev") | .spec.nodeName' \
| sort | uniq -c
# 11. Update this runbook's "Current shape" table
Cold-spare option: instead of qm destroy, keep the VM stopped. The 256 GiB disk stays allocated on thin pool but the VM consumes no CPU/RAM. Re-add via qm start <VMID> + kubeadm join (the snippet still lives at /var/lib/vz/snippets/k8s_cloud_init.yaml).
Special cases
Critical singletons that blip during drain
These services are single-replica and incur ~30s-2min outages while their PVC reattaches to the new node:
- Stateful databases:
mysql-standalone-0,pg-cluster-*members (CNPG handles failover gracefully) - Mail:
mailserver,roundcubemail(Dovecot maildir locking — defer if mid-incident) - Browser-trust services:
nextcloud(sessions reset),vaultwarden(active sessions blip) - Observability:
prometheus-server(scrape data gap),claude-memory - Self-hosted apps with SQLite: hackmd, n8n, paperless-ngx, freshrss, navidrome, audiobookshelf
Coordinate the drain timing with users if any of these is on the node being drained. Single-pod Postgres/MySQL DBs are the most painful — schedule during low-traffic windows.
GPU pods
GPU pods scheduled via nvidia.com/gpu.present=true node selector. They cannot drain off node1; if node1 itself needs maintenance, scale GPU workloads to 0 first or defer drain. See docs/runbooks/k8s-node-auto-upgrades.md for the kured-driven reboot path.
Active sessions
Check ~/code/scripts/presence list before any drain. If another session holds a claim on a service hosted on the target node, defer or coordinate.
Force-clean stuck VolumeAttachments
If a drained node has lingering VolumeAttachment entries after kubectl delete node:
kubectl get volumeattachment -o json \
| jq -r --arg n "$NODE" '.items[] | select(.spec.nodeName == $n) | .metadata.name' \
| xargs -r kubectl patch volumeattachment -p '{"metadata":{"finalizers":null}}' --type=merge
kubectl get volumeattachment -o json \
| jq -r --arg n "$NODE" '.items[] | select(.spec.nodeName == $n) | .metadata.name' \
| xargs -r kubectl delete volumeattachment
Related
docs/architecture/storage.md§ "Per-VM SCSI-LUN cap" — root-cause explanation of the PVC ceilingdocs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md— IO contention on sdcdocs/runbooks/k8s-node-auto-upgrades.md— kured-driven rolling reboots (separate from scale)docs/runbooks/restore-full-cluster.md— disaster scenariosscripts/provision-k8s-worker— the actual cloning/join script- Beads
code-oflt— IO isolation (long-term fix for sdc contention) - Remote memory id=2788 —
proxmox-csi-plugin hardcodes a per-VM SCSI-LUN ceiling