docs: HA control plane design (3 masters)

Captures today's k8s-upgrade-pipeline session findings — root cause of repeated upgrade failures is the single-master apiserver outage window cascading into operator crashloops + storm I/O. HA control plane with 3 masters + apiserver LB removes the cascade entirely. Tracked in beads code-n0ow. Plan doc to follow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 09:41:20 +00:00 · 2026-05-21 09:41:20 +00:00 · 052404301b
commit 052404301b
parent eca1cc7e2e
1 changed files with 135 additions and 0 deletions
--- a/docs/plans/2026-05-21-ha-control-plane-design.md
+++ b/docs/plans/2026-05-21-ha-control-plane-design.md
@ -0,0 +1,135 @@
+# HA Control Plane (3 masters) — Design
+
+**Date**: 2026-05-21
+**Status**: Drafted, NOT scheduled
+**Beads**: code-n0ow
+**Trigger**: today's k8s 1.34.7→1.34.8 autonomous-upgrade session repeatedly hit a storm cascade rooted in single-master apiserver outages
+
+## Problem statement
+
+The autonomous k8s upgrade pipeline (`stacks/k8s-version-upgrade/`) is
+correct end-to-end but **cannot push through the cluster's
+single-master architecture**. Each attempted upgrade today rolled
+back via the same cascade:
+
+1. Chain drains master → `kubeadm upgrade apply` swaps a static-pod
+   manifest (etcd → apiserver → controller-manager → scheduler).
+2. While a manifest swap is in flight, the affected control-plane
+   component is briefly down — for apiserver, that means ~10–60s of
+   "connection refused" to `10.96.0.1:443` from every kubelet and
+   operator pod in the cluster.
+3. **Several operators die during that window** instead of waiting:
+   - **tigera-operator**: logs `[ERROR] Get "https://10.96.0.1:443/api?timeout=32s": connect: connection refused` then exits 1 immediately
+   - gpu-operator, cnpg-cloudnative-pg, kube-controller-manager: similar leader-lease failures
+4. Kubelet restarts those pods → image pulls + initial reads → storm
+   of disk I/O on master (we observed 563 MB/s from tigera alone).
+5. **The storm slows apiserver-to-kubelet status sync** past kubeadm's
+   hardcoded 5-min watch on the pod's `kubernetes.io/config.hash`
+   annotation.
+6. kubeadm declares the upgrade "did not change after 5m0s",
+   **rolls back to the previous manifest**, exits non-zero.
+7. Chain Job retries (backoffLimit=1) → same storm → same failure.
+   Chain dead.
+
+The container runtime, the script logic, the RBAC permissions are all
+fine after today's fixes. The **single master is the bottleneck**.
+
+## Why HA control plane fixes this
+
+With 3 masters running etcd quorum + apiserver behind an LB:
+
+| Failure mode | Single master | 3-master HA |
+|---|---|---|
+| Master reboot / kubeadm upgrade | Apiserver completely down 10–60s | Other 2 masters serve clients; LB transparently fails over |
+| etcd quorum during one master being down | Total outage (1/1 broken) | Quorum maintained (2/3 healthy) |
+| Tigera/operators see apiserver as "down" | Yes → crashloop storm | No → keep running through |
+| kubeadm `static-pod hash` watch | Times out under load (today's bug) | Never under load; sync stays fast |
+| Pipeline upgrade success rate | Brittle / needs manual nursing | Truly autonomous |
+
+The k8s upgrade chain doesn't need to be aware of *any* of this — the
+underlying availability of apiserver makes the chain's gates
+naturally pass on each iteration.
+
+## Decisions (proposed — to be confirmed)
+
+| # | Decision | Notes |
+|---|----------|-------|
+| 1 | **3 masters** (not 5) | Quorum tolerates 1 failure, sufficient for home-lab. 5 would tolerate 2 but doubles etcd write amplification. |
+| 2 | **Sizing**: match current `k8s-master` (8 vCPU, 32GB RAM, ~64 GB disk) for all 3 | Symmetric. New VMs `k8s-master-2`, `k8s-master-3` on Proxmox. |
+| 3 | **Apiserver LB**: **pfSense HAProxy** (existing pattern, see mailserver-pfsense-haproxy.md) over keepalived+haproxy-on-each-master | Pros: no per-node moving parts, mirrors the mailserver layout already in production. Cons: pfSense becomes more SPoF — but it's already SPoF for everything else (DNS, gateway, ingress). |
+| 4 | **VIP**: pick an unused IP on the cluster VLAN, e.g. `10.0.20.99`, point all kubeconfigs + kubelet `--server` at it | Internal-only VIP; external API access stays via Cloudflared. |
+| 5 | **etcd**: kubeadm-managed (existing); just `kubeadm join --control-plane` brings new members into the etcd cluster automatically | Avoids running etcd separately. |
+| 6 | **kured-sentinel-gate**: extend "quorum-safe" check to verify ≥2 control-plane nodes Ready before allowing a reboot | Otherwise kured could reboot 2 masters at once and break quorum. |
+| 7 | **etcd backup**: today's `etcd-backup` CronJob already takes a snapshot from one member; that's still sufficient (etcd snapshot is a consistent point-in-time). No new work needed. | |
+| 8 | **Migration order**: add masters one at a time, run smoke (kubectl from each), then cut over kubeconfigs | Each `kubeadm join --control-plane` is reversible (just `kubeadm reset` + remove from etcd member list). |
+
+## Out of scope
+
+- HA pfSense itself (separate, much bigger initiative)
+- Multi-DC failover
+- External etcd cluster (we're sticking with kubeadm-managed stacked etcd)
+- Rebuilding cluster from scratch — we'll join into the existing one
+
+## Risk register
+
+| Risk | Mitigation |
+|---|---|
+| etcd quorum split-brain during member join | kubeadm join is atomic; if it fails, the new member doesn't join the quorum. Existing etcd stays healthy. |
+| LB misconfiguration → all kubectl breaks | Smoke-test from each master before flipping clients. Keep a kubeconfig pointing directly at one master as fallback. |
+| Existing kubeconfigs (dev VM, agents, woodpecker) need updating | List all consumers, update in a single TF apply. |
+| New masters get scheduled some workload pods unintentionally | Verify control-plane taint is applied at join time. |
+| Cluster-wide cert rotation might be needed | kubeadm join handles certs automatically using the `--certificate-key` from `kubeadm init phase upload-certs`. |
+| 32GB per master × 3 = 96GB RAM used for control plane alone | Proxmox host has headroom; not blocking. |
+
+## Verification
+
+After all 3 masters joined + LB up:
+
+```bash
+# All 3 masters listed
+kubectl get nodes -l node-role.kubernetes.io/control-plane=
+
+# etcd quorum healthy
+kubectl -n kube-system exec etcd-k8s-master -- etcdctl \
+    --endpoints=https://10.0.20.100:2379,https://10.0.20.X:2379,https://10.0.20.Y:2379 \
+    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
+    --cert=/etc/kubernetes/pki/etcd/server.crt \
+    --key=/etc/kubernetes/pki/etcd/server.key \
+    endpoint health --cluster
+
+# Failover test: cordon master-1, reboot it, observe kubectl still works through LB
+kubectl drain k8s-master --delete-emptydir-data --ignore-daemonsets
+ssh wizard@k8s-master.viktorbarzin.lan sudo reboot
+
+# Pipeline test: re-trigger k8s upgrade chain (e.g. for whatever the next patch is)
+kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check ha-validation-$(date +%s)
+# Expect: full chain succeeds end-to-end without manual intervention
+```
+
+## Cost estimate
+
+- 2× VMs at 8 vCPU + 32GB RAM each = +64GB RAM on Proxmox host
+- ~+128GB disk usage (2× 64GB master disks)
+- ~2-4 hours of operator time end-to-end (VM provisioning + kubeadm join + LB config + smoke)
+
+## What's already in place from today's work
+
+(All these are prerequisites that were fixed during today's
+investigation — they stay relevant when HA lands.)
+
+- Master containerd 1.6.22 → 2.2.2, runc 1.1.8 → 1.4.0 (fixed
+  `runc: unable to signal init: permission denied` on Ubuntu 26.04)
+- Pipeline script bugs: 3× `grep -vE` pipefail, 1× RBAC missing
+  `get daemonsets`, 1× `RecentNodeReboot` not ignored in master phase
+- Kill-switch ConfigMap mechanism (`k8s-upgrade-killswitch`)
+- Kubeadm-apply retry wrapper in `update_k8s.sh` (helps but doesn't
+  fully fix the storm cascade)
+- Quiet-baseline threshold 3600s → 600s
+
+## Reference
+
+Commits from today's session:
+- `10b261d2` — first `grep -vE` pipefail
+- `0c8b46df` — 2 more pipefail sites
+- `fc0510aa` — kill-switch + RecentNodeReboot ignore + 600s threshold
+- `2dc7e001` — kubeadm apply 3-attempt retry