From 052404301bfd943bdf26419a9b4da690be1a678a Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Thu, 21 May 2026 09:41:20 +0000 Subject: [PATCH] docs: HA control plane design (3 masters) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Captures today's k8s-upgrade-pipeline session findings — root cause of repeated upgrade failures is the single-master apiserver outage window cascading into operator crashloops + storm I/O. HA control plane with 3 masters + apiserver LB removes the cascade entirely. Tracked in beads code-n0ow. Plan doc to follow. Co-Authored-By: Claude Opus 4.7 --- .../2026-05-21-ha-control-plane-design.md | 135 ++++++++++++++++++ 1 file changed, 135 insertions(+) create mode 100644 docs/plans/2026-05-21-ha-control-plane-design.md diff --git a/docs/plans/2026-05-21-ha-control-plane-design.md b/docs/plans/2026-05-21-ha-control-plane-design.md new file mode 100644 index 00000000..11535457 --- /dev/null +++ b/docs/plans/2026-05-21-ha-control-plane-design.md @@ -0,0 +1,135 @@ +# HA Control Plane (3 masters) — Design + +**Date**: 2026-05-21 +**Status**: Drafted, NOT scheduled +**Beads**: code-n0ow +**Trigger**: today's k8s 1.34.7→1.34.8 autonomous-upgrade session repeatedly hit a storm cascade rooted in single-master apiserver outages + +## Problem statement + +The autonomous k8s upgrade pipeline (`stacks/k8s-version-upgrade/`) is +correct end-to-end but **cannot push through the cluster's +single-master architecture**. Each attempted upgrade today rolled +back via the same cascade: + +1. Chain drains master → `kubeadm upgrade apply` swaps a static-pod + manifest (etcd → apiserver → controller-manager → scheduler). +2. While a manifest swap is in flight, the affected control-plane + component is briefly down — for apiserver, that means ~10–60s of + "connection refused" to `10.96.0.1:443` from every kubelet and + operator pod in the cluster. +3. **Several operators die during that window** instead of waiting: + - **tigera-operator**: logs `[ERROR] Get "https://10.96.0.1:443/api?timeout=32s": connect: connection refused` then exits 1 immediately + - gpu-operator, cnpg-cloudnative-pg, kube-controller-manager: similar leader-lease failures +4. Kubelet restarts those pods → image pulls + initial reads → storm + of disk I/O on master (we observed 563 MB/s from tigera alone). +5. **The storm slows apiserver-to-kubelet status sync** past kubeadm's + hardcoded 5-min watch on the pod's `kubernetes.io/config.hash` + annotation. +6. kubeadm declares the upgrade "did not change after 5m0s", + **rolls back to the previous manifest**, exits non-zero. +7. Chain Job retries (backoffLimit=1) → same storm → same failure. + Chain dead. + +The container runtime, the script logic, the RBAC permissions are all +fine after today's fixes. The **single master is the bottleneck**. + +## Why HA control plane fixes this + +With 3 masters running etcd quorum + apiserver behind an LB: + +| Failure mode | Single master | 3-master HA | +|---|---|---| +| Master reboot / kubeadm upgrade | Apiserver completely down 10–60s | Other 2 masters serve clients; LB transparently fails over | +| etcd quorum during one master being down | Total outage (1/1 broken) | Quorum maintained (2/3 healthy) | +| Tigera/operators see apiserver as "down" | Yes → crashloop storm | No → keep running through | +| kubeadm `static-pod hash` watch | Times out under load (today's bug) | Never under load; sync stays fast | +| Pipeline upgrade success rate | Brittle / needs manual nursing | Truly autonomous | + +The k8s upgrade chain doesn't need to be aware of *any* of this — the +underlying availability of apiserver makes the chain's gates +naturally pass on each iteration. + +## Decisions (proposed — to be confirmed) + +| # | Decision | Notes | +|---|----------|-------| +| 1 | **3 masters** (not 5) | Quorum tolerates 1 failure, sufficient for home-lab. 5 would tolerate 2 but doubles etcd write amplification. | +| 2 | **Sizing**: match current `k8s-master` (8 vCPU, 32GB RAM, ~64 GB disk) for all 3 | Symmetric. New VMs `k8s-master-2`, `k8s-master-3` on Proxmox. | +| 3 | **Apiserver LB**: **pfSense HAProxy** (existing pattern, see mailserver-pfsense-haproxy.md) over keepalived+haproxy-on-each-master | Pros: no per-node moving parts, mirrors the mailserver layout already in production. Cons: pfSense becomes more SPoF — but it's already SPoF for everything else (DNS, gateway, ingress). | +| 4 | **VIP**: pick an unused IP on the cluster VLAN, e.g. `10.0.20.99`, point all kubeconfigs + kubelet `--server` at it | Internal-only VIP; external API access stays via Cloudflared. | +| 5 | **etcd**: kubeadm-managed (existing); just `kubeadm join --control-plane` brings new members into the etcd cluster automatically | Avoids running etcd separately. | +| 6 | **kured-sentinel-gate**: extend "quorum-safe" check to verify ≥2 control-plane nodes Ready before allowing a reboot | Otherwise kured could reboot 2 masters at once and break quorum. | +| 7 | **etcd backup**: today's `etcd-backup` CronJob already takes a snapshot from one member; that's still sufficient (etcd snapshot is a consistent point-in-time). No new work needed. | | +| 8 | **Migration order**: add masters one at a time, run smoke (kubectl from each), then cut over kubeconfigs | Each `kubeadm join --control-plane` is reversible (just `kubeadm reset` + remove from etcd member list). | + +## Out of scope + +- HA pfSense itself (separate, much bigger initiative) +- Multi-DC failover +- External etcd cluster (we're sticking with kubeadm-managed stacked etcd) +- Rebuilding cluster from scratch — we'll join into the existing one + +## Risk register + +| Risk | Mitigation | +|---|---| +| etcd quorum split-brain during member join | kubeadm join is atomic; if it fails, the new member doesn't join the quorum. Existing etcd stays healthy. | +| LB misconfiguration → all kubectl breaks | Smoke-test from each master before flipping clients. Keep a kubeconfig pointing directly at one master as fallback. | +| Existing kubeconfigs (dev VM, agents, woodpecker) need updating | List all consumers, update in a single TF apply. | +| New masters get scheduled some workload pods unintentionally | Verify control-plane taint is applied at join time. | +| Cluster-wide cert rotation might be needed | kubeadm join handles certs automatically using the `--certificate-key` from `kubeadm init phase upload-certs`. | +| 32GB per master × 3 = 96GB RAM used for control plane alone | Proxmox host has headroom; not blocking. | + +## Verification + +After all 3 masters joined + LB up: + +```bash +# All 3 masters listed +kubectl get nodes -l node-role.kubernetes.io/control-plane= + +# etcd quorum healthy +kubectl -n kube-system exec etcd-k8s-master -- etcdctl \ + --endpoints=https://10.0.20.100:2379,https://10.0.20.X:2379,https://10.0.20.Y:2379 \ + --cacert=/etc/kubernetes/pki/etcd/ca.crt \ + --cert=/etc/kubernetes/pki/etcd/server.crt \ + --key=/etc/kubernetes/pki/etcd/server.key \ + endpoint health --cluster + +# Failover test: cordon master-1, reboot it, observe kubectl still works through LB +kubectl drain k8s-master --delete-emptydir-data --ignore-daemonsets +ssh wizard@k8s-master.viktorbarzin.lan sudo reboot + +# Pipeline test: re-trigger k8s upgrade chain (e.g. for whatever the next patch is) +kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check ha-validation-$(date +%s) +# Expect: full chain succeeds end-to-end without manual intervention +``` + +## Cost estimate + +- 2× VMs at 8 vCPU + 32GB RAM each = +64GB RAM on Proxmox host +- ~+128GB disk usage (2× 64GB master disks) +- ~2-4 hours of operator time end-to-end (VM provisioning + kubeadm join + LB config + smoke) + +## What's already in place from today's work + +(All these are prerequisites that were fixed during today's +investigation — they stay relevant when HA lands.) + +- Master containerd 1.6.22 → 2.2.2, runc 1.1.8 → 1.4.0 (fixed + `runc: unable to signal init: permission denied` on Ubuntu 26.04) +- Pipeline script bugs: 3× `grep -vE` pipefail, 1× RBAC missing + `get daemonsets`, 1× `RecentNodeReboot` not ignored in master phase +- Kill-switch ConfigMap mechanism (`k8s-upgrade-killswitch`) +- Kubeadm-apply retry wrapper in `update_k8s.sh` (helps but doesn't + fully fix the storm cascade) +- Quiet-baseline threshold 3600s → 600s + +## Reference + +Commits from today's session: +- `10b261d2` — first `grep -vE` pipefail +- `0c8b46df` — 2 more pipefail sites +- `fc0510aa` — kill-switch + RecentNodeReboot ignore + 600s threshold +- `2dc7e001` — kubeadm apply 3-attempt retry