docs: HA control plane design (3 masters)

Captures today's k8s-upgrade-pipeline session findings — root cause
of repeated upgrade failures is the single-master apiserver outage
window cascading into operator crashloops + storm I/O. HA control
plane with 3 masters + apiserver LB removes the cascade entirely.

Tracked in beads code-n0ow. Plan doc to follow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-05-21 09:41:20 +00:00
parent eca1cc7e2e
commit 052404301b

View file

@ -0,0 +1,135 @@
# HA Control Plane (3 masters) — Design
**Date**: 2026-05-21
**Status**: Drafted, NOT scheduled
**Beads**: code-n0ow
**Trigger**: today's k8s 1.34.7→1.34.8 autonomous-upgrade session repeatedly hit a storm cascade rooted in single-master apiserver outages
## Problem statement
The autonomous k8s upgrade pipeline (`stacks/k8s-version-upgrade/`) is
correct end-to-end but **cannot push through the cluster's
single-master architecture**. Each attempted upgrade today rolled
back via the same cascade:
1. Chain drains master → `kubeadm upgrade apply` swaps a static-pod
manifest (etcd → apiserver → controller-manager → scheduler).
2. While a manifest swap is in flight, the affected control-plane
component is briefly down — for apiserver, that means ~1060s of
"connection refused" to `10.96.0.1:443` from every kubelet and
operator pod in the cluster.
3. **Several operators die during that window** instead of waiting:
- **tigera-operator**: logs `[ERROR] Get "https://10.96.0.1:443/api?timeout=32s": connect: connection refused` then exits 1 immediately
- gpu-operator, cnpg-cloudnative-pg, kube-controller-manager: similar leader-lease failures
4. Kubelet restarts those pods → image pulls + initial reads → storm
of disk I/O on master (we observed 563 MB/s from tigera alone).
5. **The storm slows apiserver-to-kubelet status sync** past kubeadm's
hardcoded 5-min watch on the pod's `kubernetes.io/config.hash`
annotation.
6. kubeadm declares the upgrade "did not change after 5m0s",
**rolls back to the previous manifest**, exits non-zero.
7. Chain Job retries (backoffLimit=1) → same storm → same failure.
Chain dead.
The container runtime, the script logic, the RBAC permissions are all
fine after today's fixes. The **single master is the bottleneck**.
## Why HA control plane fixes this
With 3 masters running etcd quorum + apiserver behind an LB:
| Failure mode | Single master | 3-master HA |
|---|---|---|
| Master reboot / kubeadm upgrade | Apiserver completely down 1060s | Other 2 masters serve clients; LB transparently fails over |
| etcd quorum during one master being down | Total outage (1/1 broken) | Quorum maintained (2/3 healthy) |
| Tigera/operators see apiserver as "down" | Yes → crashloop storm | No → keep running through |
| kubeadm `static-pod hash` watch | Times out under load (today's bug) | Never under load; sync stays fast |
| Pipeline upgrade success rate | Brittle / needs manual nursing | Truly autonomous |
The k8s upgrade chain doesn't need to be aware of *any* of this — the
underlying availability of apiserver makes the chain's gates
naturally pass on each iteration.
## Decisions (proposed — to be confirmed)
| # | Decision | Notes |
|---|----------|-------|
| 1 | **3 masters** (not 5) | Quorum tolerates 1 failure, sufficient for home-lab. 5 would tolerate 2 but doubles etcd write amplification. |
| 2 | **Sizing**: match current `k8s-master` (8 vCPU, 32GB RAM, ~64 GB disk) for all 3 | Symmetric. New VMs `k8s-master-2`, `k8s-master-3` on Proxmox. |
| 3 | **Apiserver LB**: **pfSense HAProxy** (existing pattern, see mailserver-pfsense-haproxy.md) over keepalived+haproxy-on-each-master | Pros: no per-node moving parts, mirrors the mailserver layout already in production. Cons: pfSense becomes more SPoF — but it's already SPoF for everything else (DNS, gateway, ingress). |
| 4 | **VIP**: pick an unused IP on the cluster VLAN, e.g. `10.0.20.99`, point all kubeconfigs + kubelet `--server` at it | Internal-only VIP; external API access stays via Cloudflared. |
| 5 | **etcd**: kubeadm-managed (existing); just `kubeadm join --control-plane` brings new members into the etcd cluster automatically | Avoids running etcd separately. |
| 6 | **kured-sentinel-gate**: extend "quorum-safe" check to verify ≥2 control-plane nodes Ready before allowing a reboot | Otherwise kured could reboot 2 masters at once and break quorum. |
| 7 | **etcd backup**: today's `etcd-backup` CronJob already takes a snapshot from one member; that's still sufficient (etcd snapshot is a consistent point-in-time). No new work needed. | |
| 8 | **Migration order**: add masters one at a time, run smoke (kubectl from each), then cut over kubeconfigs | Each `kubeadm join --control-plane` is reversible (just `kubeadm reset` + remove from etcd member list). |
## Out of scope
- HA pfSense itself (separate, much bigger initiative)
- Multi-DC failover
- External etcd cluster (we're sticking with kubeadm-managed stacked etcd)
- Rebuilding cluster from scratch — we'll join into the existing one
## Risk register
| Risk | Mitigation |
|---|---|
| etcd quorum split-brain during member join | kubeadm join is atomic; if it fails, the new member doesn't join the quorum. Existing etcd stays healthy. |
| LB misconfiguration → all kubectl breaks | Smoke-test from each master before flipping clients. Keep a kubeconfig pointing directly at one master as fallback. |
| Existing kubeconfigs (dev VM, agents, woodpecker) need updating | List all consumers, update in a single TF apply. |
| New masters get scheduled some workload pods unintentionally | Verify control-plane taint is applied at join time. |
| Cluster-wide cert rotation might be needed | kubeadm join handles certs automatically using the `--certificate-key` from `kubeadm init phase upload-certs`. |
| 32GB per master × 3 = 96GB RAM used for control plane alone | Proxmox host has headroom; not blocking. |
## Verification
After all 3 masters joined + LB up:
```bash
# All 3 masters listed
kubectl get nodes -l node-role.kubernetes.io/control-plane=
# etcd quorum healthy
kubectl -n kube-system exec etcd-k8s-master -- etcdctl \
--endpoints=https://10.0.20.100:2379,https://10.0.20.X:2379,https://10.0.20.Y:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health --cluster
# Failover test: cordon master-1, reboot it, observe kubectl still works through LB
kubectl drain k8s-master --delete-emptydir-data --ignore-daemonsets
ssh wizard@k8s-master.viktorbarzin.lan sudo reboot
# Pipeline test: re-trigger k8s upgrade chain (e.g. for whatever the next patch is)
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check ha-validation-$(date +%s)
# Expect: full chain succeeds end-to-end without manual intervention
```
## Cost estimate
- 2× VMs at 8 vCPU + 32GB RAM each = +64GB RAM on Proxmox host
- ~+128GB disk usage (2× 64GB master disks)
- ~2-4 hours of operator time end-to-end (VM provisioning + kubeadm join + LB config + smoke)
## What's already in place from today's work
(All these are prerequisites that were fixed during today's
investigation — they stay relevant when HA lands.)
- Master containerd 1.6.22 → 2.2.2, runc 1.1.8 → 1.4.0 (fixed
`runc: unable to signal init: permission denied` on Ubuntu 26.04)
- Pipeline script bugs: 3× `grep -vE` pipefail, 1× RBAC missing
`get daemonsets`, 1× `RecentNodeReboot` not ignored in master phase
- Kill-switch ConfigMap mechanism (`k8s-upgrade-killswitch`)
- Kubeadm-apply retry wrapper in `update_k8s.sh` (helps but doesn't
fully fix the storm cascade)
- Quiet-baseline threshold 3600s → 600s
## Reference
Commits from today's session:
- `10b261d2` — first `grep -vE` pipefail
- `0c8b46df` — 2 more pipefail sites
- `fc0510aa` — kill-switch + RecentNodeReboot ignore + 600s threshold
- `2dc7e001` — kubeadm apply 3-attempt retry