docs/plans: HA control plane — design + plan + deferral

Investigated, designed, and planned the 3-master HA control plane migration triggered by 2026-05-21's autonomous k8s upgrade cascade. Locked 14 design decisions across two passes: - 10 initial decisions (LB strategy, IPs, sizing, etcd, kured gate, etc) - 4 challenger-pass amendments (cloud-init template bump, rbac stack multi-master refactor, HTTPS /readyz health check, expanded blast radius to include /home/wizard/code/infra/config root kubeconfig, config.tfvars, k8s-portal user kubeconfigs, etcd-backup nodeSelector, k8s-version-upgrade chain extension as Phase 7) Plan covers 11 phases end-to-end including panic-mode rollback. DEFERRED before execution. PVE host is 98% RAM-committed (262 GB allocated / 267 GB physical, 1.5 GB swap active); the planned 3 x 32 GB masters would push allocation to 326 GB and OOM the host. k8s-master currently uses only 4.6 GB of its 32 GB allocation (5-6x oversized). Revisit triggers documented in design doc: 1. Second PVE host added → hardware HA becomes possible. 2. Right-sizing pass OR planning masters at 16 GB each. 3. Cumulative manual upgrade nursing > ~10h. Standalone candidate worth lifting independently: Phase 1.5's rbac stack refactor (apiserver-oidc + audit-policy + etcd-tuning to loop over k8s_master_hosts list) — future-proofs the cluster without committing to the HA migration. Refs: code-n0ow (open, deferred via bd note). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 08:32:15 +00:00 · 2026-05-23 08:32:15 +00:00 · 7f63d35d0a
commit 7f63d35d0a
parent 70a334e431
2 changed files with 387 additions and 19 deletions
--- a/docs/plans/2026-05-21-ha-control-plane-design.md
+++ b/docs/plans/2026-05-21-ha-control-plane-design.md
@ -1,9 +1,32 @@
 # HA Control Plane (3 masters) — Design

-**Date**: 2026-05-21
-**Status**: Drafted, NOT scheduled
-**Beads**: code-n0ow
-**Trigger**: today's k8s 1.34.7→1.34.8 autonomous-upgrade session repeatedly hit a storm cascade rooted in single-master apiserver outages
+**Date**: 2026-05-21 (decisions locked 2026-05-22; **deferred 2026-05-23**)
+**Status**: **DEFERRED** — design + plan complete, NOT scheduled. Awaiting either PVE host capacity expansion OR a separate right-sizing pass on the existing master before this becomes affordable. Paired plan: `2026-05-21-ha-control-plane-plan.md`.
+**Beads**: code-n0ow (open, deferred — see `bd show code-n0ow`)
+**Trigger**: 2026-05-21 k8s 1.34.7→1.34.8 autonomous-upgrade session repeatedly hit a storm cascade rooted in single-master apiserver outages
+
+## Why deferred (2026-05-23)
+
+Measured during the locking pass:
+
+- **k8s-master uses 4.6 GB of 32 GB allocated** (kube-apiserver 2.6 GB + etcd 660 MB + cm 360 MB + ~1 GB everything else). The 32 GB sizing is ~5-6× oversized vs working set.
+- **PVE host is already 98% RAM-committed** — 262 GB allocated to VMs against 267 GB physical, with 1.5 GB of active swap. The planned 3 × 32 GB control plane (+64 GB net) would push allocation to 326 GB → OOM on the host.
+- **Software-only HA on a single PVE host has bounded value** — a hypervisor crash still loses all 3 masters. The big resilience wins (kubeadm upgrades, cert rotation, planned reboots) are real but the disaster-recovery angle is limited until a second PVE host exists.
+
+### Revisit triggers — any of:
+
+1. **Second PVE host added** to the lab. Hardware HA becomes possible; HA control plane becomes the natural follow-up. Spread the 3 masters across 2 hosts (2+1).
+2. **Cluster-wide right-sizing pass** that frees enough headroom for the original 3 × 32 GB plan, OR pre-agreed amendment to provision 16 GB masters (right-sized to actual usage; 3-4× current working-set headroom).
+3. **Storm cascade burns enough hours** that the operational cost outweighs the memory cost — track minutes spent manually nursing kubeadm upgrades; if cumulative > ~10h over a few months, revisit.
+
+### What's still good
+
+The design + plan in this directory remain authoritative. When we revisit:
+
+- All 14 locked decisions stand.
+- Challenger amendments (cloud-init template bump, rbac multi-master refactor, HTTPS `/readyz` health check, expanded blast radius, etcd-backup nodeSelector, chain extension as Phase 7) are baked in.
+- Only the sizing decision needs revisiting — likely 16 GB per master instead of 32 GB.
+- Adding `k8s_master_hosts` list-based refactor to the rbac stack (Phase 1.5) is a **standalone win** that could be done independently of HA — it would future-proof the cluster against the day HA lands. Consider lifting that as its own task.

 ## Problem statement

@ -50,18 +73,24 @@ The k8s upgrade chain doesn't need to be aware of *any* of this — the
 underlying availability of apiserver makes the chain's gates
 naturally pass on each iteration.

-## Decisions (proposed — to be confirmed)
+## Decisions (locked 2026-05-22)

 | # | Decision | Notes |
 |---|----------|-------|
 | 1 | **3 masters** (not 5) | Quorum tolerates 1 failure, sufficient for home-lab. 5 would tolerate 2 but doubles etcd write amplification. |
-| 2 | **Sizing**: match current `k8s-master` (8 vCPU, 32GB RAM, ~64 GB disk) for all 3 | Symmetric. New VMs `k8s-master-2`, `k8s-master-3` on Proxmox. |
-| 3 | **Apiserver LB**: **pfSense HAProxy** (existing pattern, see mailserver-pfsense-haproxy.md) over keepalived+haproxy-on-each-master | Pros: no per-node moving parts, mirrors the mailserver layout already in production. Cons: pfSense becomes more SPoF — but it's already SPoF for everything else (DNS, gateway, ingress). |
-| 4 | **VIP**: pick an unused IP on the cluster VLAN, e.g. `10.0.20.99`, point all kubeconfigs + kubelet `--server` at it | Internal-only VIP; external API access stays via Cloudflared. |
-| 5 | **etcd**: kubeadm-managed (existing); just `kubeadm join --control-plane` brings new members into the etcd cluster automatically | Avoids running etcd separately. |
-| 6 | **kured-sentinel-gate**: extend "quorum-safe" check to verify ≥2 control-plane nodes Ready before allowing a reboot | Otherwise kured could reboot 2 masters at once and break quorum. |
-| 7 | **etcd backup**: today's `etcd-backup` CronJob already takes a snapshot from one member; that's still sufficient (etcd snapshot is a consistent point-in-time). No new work needed. | |
-| 8 | **Migration order**: add masters one at a time, run smoke (kubectl from each), then cut over kubeconfigs | Each `kubeadm join --control-plane` is reversible (just `kubeadm reset` + remove from etcd member list). |
+| 2 | **Sizing**: match current `k8s-master` (8 vCPU, 32GB RAM, ~64 GB disk) for all 3 | Symmetric. New VMs `k8s-master-2` (VMID 205, 10.0.20.110), `k8s-master-3` (VMID 206, 10.0.20.111). |
+| 3 | **Apiserver LB**: **pfSense HAProxy** — new TCP frontend on `10.0.20.99:6443` mirroring the mailserver pattern. Idempotent via `scripts/pfsense-haproxy-bootstrap.php`. | Pros: no per-node moving parts, mirrors existing mailserver layout. Cons: pfSense becomes more SPoF — but it's already SPoF for everything else (gateway/DNS/ingress). |
+| 4 | **VIP**: `10.0.20.99` (one below current master `.100`, well clear of MetalLB pool `.200-.220`). Internal-only — external API access stays via Cloudflared. | All kubeconfigs + kubelet.conf entries flip from `10.0.20.100:6443` → `10.0.20.99:6443`. |
+| 5 | **etcd**: kubeadm-managed stacked; `kubeadm join --control-plane` brings new members into the etcd cluster automatically | Avoids running etcd separately. |
+| 6 | **kured-sentinel-gate**: extend the bash loop in `stacks/kured/main.tf` with a "≥2 control-plane nodes Ready" check between the existing all-nodes-Ready and calico-Ready checks | Otherwise kured could reboot 2 masters at once and break quorum. |
+| 7 | **etcd backup**: `etcdctl snapshot save` from any member is a consistent point-in-time of the full quorum state — but the existing CronJob is pinned `node_name = "k8s-master"`. Phase 4.5 flips this to a control-plane label + toleration so backups don't silently skip when master-1 is drained. | Snapshot CORRECTNESS unchanged; SCHEDULING needs fixing. |
+| 8 | **Migration order**: Phase 0 (retrofit existing cluster) → Phase 1 (LB up, single backend, HTTPS health check) → Phase 1.5 (rbac stack refactor) → Phase 2 (cloud-init bump + master-2 join + add to LB) → Phase 3 (master-3 join + add to LB) → Phase 4 (flip clients + workers to VIP) → Phase 4.5 (etcd-backup CronJob fix) → Phase 5 (kured-sentinel-gate quorum check) → Phase 6 (E2E validation) → Phase 7 (k8s-version-upgrade chain extension) | Each kubeadm join is reversible (`kubeadm reset` + `etcdctl member remove`). |
+| 9 | **VM provisioning**: cloud-init via `create-template-vm` module, **but the template needs an apt-source bump first** (v1.32 → v1.34) and a control-plane gate on `k8s_join_command` so master VMs don't auto-join as workers. Existing master stays as the legacy manual VM (not rebuilt). | The repo has zero VMs using cloud-init for provisioning today — we're the first user. Update template first, then use it. |
+| 10 | **Cert SAN + controlPlaneEndpoint retrofit**: Phase 0, before any new master joins. Patch `kubeadm-config` via `kubeadm init phase upload-config kubeadm --config <file>` (kubeadm-owned write, future-proof against `kubeadm upgrade apply`), regen `apiserver.crt` via `kubeadm init phase certs apiserver`, restart the kube-apiserver pod (~30s outage on the existing master only). | Standard kubeadm retrofit path; `kubeadm join --control-plane` requires controlPlaneEndpoint to be set. |
+| 11 | **Multi-master config propagation (Phase 1.5)**: refactor `stacks/rbac/modules/rbac/{apiserver-oidc,audit-policy,etcd-tuning}.tf` to loop over a list of master hosts. Apply BEFORE master-2/3 join so they boot with OIDC, audit policy, and etcd tuning already in place. | Today these stacks SSH into a single master and sed into `kube-apiserver.yaml` — if not propagated, Authentik login flaps depending on which master the LB lands on. |
+| 12 | **k8s-version-upgrade chain extension (Phase 7)**: extend `stacks/k8s-version-upgrade/scripts/upgrade-step.sh` to discover and iterate over all control-plane nodes (drain → upgrade → uncordon, gated by quorum check). | Without this, chain only upgrades master-1; masters 2/3 drift behind one version per upgrade. Original autonomous-upgrades goal unmet. |
+| 13 | **LB health check**: HTTPS `GET /readyz` (with `verify none` for self-signed apiserver cert), NOT plain TCP. | Plain TCP misses apiserver-NotReady states (etcd unreachable, controller-manager flapping). |
+| 14 | **VIP DNS name**: add `k8s-apiserver IN A 10.0.20.99` to `config.tfvars` BEFORE Phase 4. Delete stale `kubernetes IN A 10.0.20.100`. Consumers reference the FQDN, not the bare IP — future renumbering is then a single record change. | |

 ## Out of scope

@ -74,12 +103,15 @@ naturally pass on each iteration.

 | Risk | Mitigation |
 |---|---|
+| Phase 0 cert regen on existing master triggers a brief apiserver outage (~30s) | Already a known cluster behaviour during static-pod restart. Schedule during a low-activity window. Tigera/operators will crash-loop briefly but recover — same blast radius as today's k8s upgrade. **Once HA is up, future restarts won't have this surface at all.** |
 | etcd quorum split-brain during member join | kubeadm join is atomic; if it fails, the new member doesn't join the quorum. Existing etcd stays healthy. |
-| LB misconfiguration → all kubectl breaks | Smoke-test from each master before flipping clients. Keep a kubeconfig pointing directly at one master as fallback. |
-| Existing kubeconfigs (dev VM, agents, woodpecker) need updating | List all consumers, update in a single TF apply. |
-| New masters get scheduled some workload pods unintentionally | Verify control-plane taint is applied at join time. |
-| Cluster-wide cert rotation might be needed | kubeadm join handles certs automatically using the `--certificate-key` from `kubeadm init phase upload-certs`. |
-| 32GB per master × 3 = 96GB RAM used for control plane alone | Proxmox host has headroom; not blocking. |
+| LB misconfiguration → all kubectl breaks | Smoke-test from each master directly (bypass LB) before flipping clients. Keep a kubeconfig pointing at `10.0.20.100:6443` as fallback. |
+| Existing kubeconfigs (Woodpecker pipelines, agents, dev VM, in-cluster RBAC default) need updating | Single Terraform apply touches `stacks/rbac/modules/rbac/apiserver-oidc.tf` (default), `.woodpecker/*.yml` (committed kubeconfigs). Worker `kubelet.conf` files patched in Phase 4 via ssh loop. |
+| New masters get scheduled workload pods unintentionally | Verify `node-role.kubernetes.io/control-plane:NoSchedule` taint is applied at join time (default with `--control-plane`). |
+| Cert rotation propagation | kubeadm join uses the `--certificate-key` from `kubeadm init phase upload-certs` to fetch existing CA materials. Single short-lived secret in `kube-system/kubeadm-certs` (**2h TTL** — Phases 2 + 3 must complete within the window, or re-upload between them). |
+| 32GB per master × 3 = 96GB RAM used for control plane alone | PVE host has 272GB total, 176GB allocated to cluster pre-HA. Post-HA: 240GB allocated, 32GB headroom. Sufficient. |
+| Pre-existing kubeadm-config does NOT have `controlPlaneEndpoint` set | Phase 0 patches it. Verify: `kubectl -n kube-system get cm kubeadm-config -o yaml \| grep controlPlaneEndpoint` (absent → `10.0.20.99:6443` post-Phase 0). |
+| Existing master cert SANs are `[k8s-master, 10.96.0.1, 10.0.20.100]` only — missing VIP | Phase 0 regens with `--apiserver-cert-extra-sans 10.0.20.99` after patching kubeadm-config. |

 ## Verification

@ -91,12 +123,23 @@ kubectl get nodes -l node-role.kubernetes.io/control-plane=

 # etcd quorum healthy
 kubectl -n kube-system exec etcd-k8s-master -- etcdctl \
-    --endpoints=https://10.0.20.100:2379,https://10.0.20.X:2379,https://10.0.20.Y:2379 \
+    --endpoints=https://10.0.20.100:2379,https://10.0.20.110:2379,https://10.0.20.111:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
+   
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key \
    endpoint health --cluster

+# Kubeconfig points at VIP
+kubectl --kubeconfig ~/.kube/config config view --minify -o jsonpath='{.clusters[0].cluster.server}'
+# Expect: https://10.0.20.99:6443
+
+# Worker kubelet.conf points at VIP
+for n in k8s-node{1,2,3,4}; do
+  ssh wizard@$n.viktorbarzin.lan "sudo grep -E '^\s+server:' /etc/kubernetes/kubelet.conf"
+done
+# Expect: server: https://10.0.20.99:6443 on every node
+
 # Failover test: cordon master-1, reboot it, observe kubectl still works through LB
 kubectl drain k8s-master --delete-emptydir-data --ignore-daemonsets
 ssh wizard@k8s-master.viktorbarzin.lan sudo reboot
@ -110,7 +153,7 @@ kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check ha-validation

 - 2× VMs at 8 vCPU + 32GB RAM each = +64GB RAM on Proxmox host
 - ~+128GB disk usage (2× 64GB master disks)
- ~2-4 hours of operator time end-to-end (VM provisioning + kubeadm join + LB config + smoke)
+- **~5-7 hours of operator time end-to-end** (cloud-init template bump + Phase 0 retrofit + LB + Phase 1.5 rbac refactor + 2× kubeadm join + Phase 4 cutover + Phase 4.5 etcd-backup fix + Phase 5 kured-gate + Phase 6 validation + Phase 7 chain extension). Phases 0–6 can land in one session; Phase 7 can be deferred a few days if needed.

 ## What's already in place from today's work

--- a/docs/plans/2026-05-21-ha-control-plane-plan.md
+++ b/docs/plans/2026-05-21-ha-control-plane-plan.md
@ -0,0 +1,325 @@
+# HA Control Plane (3 masters) — Plan
+
+**Date**: 2026-05-21 (locked + revised 2026-05-22 after challenger pass)
+**Status**: Drafted, awaiting approval
+**Pairs with**: `2026-05-21-ha-control-plane-design.md`
+**Beads**: `code-n0ow`
+
+## Goal
+
+Migrate the single-master cluster to a 3-master HA control plane behind
+a pfSense HAProxy VIP (`10.0.20.99:6443`), enabling autonomous k8s
+upgrades without storm-cascade manual nursing.
+
+## Topology — before / after
+
+```
+Before                            After
+                                  ┌──────────────────────┐
+                                  │ pfSense HAProxy      │
+                                  │  10.0.20.99:6443     │
+                                  │  TCP, /readyz health │
+                                  └──┬───────┬───────┬───┘
+┌───────────────┐                    │       │       │
+│ k8s-master    │                    ▼       ▼       ▼
+│ 10.0.20.100   │     ┌──────────────┐ ┌────────────┐ ┌────────────┐
+│ apiserver+etcd│     │k8s-master    │ │k8s-master-2│ │k8s-master-3│
+│ + workers join│     │10.0.20.100   │ │10.0.20.110 │ │10.0.20.111 │
+│ directly      │     │(VMID 200)    │ │(VMID 205)  │ │(VMID 206)  │
+└───────────────┘     │apiserver+etcd│ │apiserver+e.│ │apiserver+e.│
+                      └──────────────┘ └────────────┘ └────────────┘
+                          ▲                ▲                ▲
+                          └────────────────┼────────────────┘
+                                           │
+                            etcd quorum (3 members, tolerates 1 down)
+```
+
+## Research decisions (locked — see design doc for full table)
+
+| Decision | Value |
+|---|---|
+| LB strategy | pfSense HAProxy, TCP mode, HTTPS `/readyz` health check |
+| VIP | `10.0.20.99` (FQDN `k8s-apiserver.viktorbarzin.lan`) |
+| New master IPs | `10.0.20.110`, `10.0.20.111` |
+| New master VMIDs | `205`, `206` |
+| Master sizing | 8 vCPU, 32 GB RAM, 64 GB disk (matches existing) |
+| VM provisioning | cloud-init via `create-template-vm` (template bumped v1.32 → v1.34 first; `k8s_join_command = ""` for masters) |
+| etcd | stacked (kubeadm-managed) |
+| Multi-master apiserver flags | rbac stack refactored to loop over master list (Phase 1.5) |
+| controlPlaneEndpoint + cert SAN retrofit | Phase 0, before any new master joins |
+| k8s-version-upgrade chain | extended to multi-master in Phase 7 |
+
+## Callers / blast radius
+
+| Surface | Path | Phase |
+|---|---|---|
+| Worker `/etc/kubernetes/kubelet.conf` × 4 | nodes 1-4 | 4.2 |
+| `/home/wizard/code/infra/config` (root kubeconfig used by every `tg apply`) | repo root | 4.1 |
+| `config.tfvars:115` (`kubernetes IN A 10.0.20.100` zone-file record) | repo root | 1.1 (delete) |
+| `config.tfvars:231` (`k8s_join_command` for cloud-init template) | repo root | 4.1 (flip to VIP) |
+| `stacks/rbac/modules/rbac/{apiserver-oidc,audit-policy,etcd-tuning}.tf` | `var.k8s_master_host` defaults | 1.5 (refactor to list) |
+| `.woodpecker/{default,drift-detection,renew-tls,provision-user}.yml` (4 files × 2 refs each — kubeconfig `server:` AND `curl` lines) | repo root | 4.1 |
+| `stacks/k8s-portal/.../files/src/routes/{download,setup/script}/+server.ts` (`CLUSTER_SERVER` const used to generate user kubeconfigs) | k8s-portal module | 4.1 |
+| `stacks/k8s-version-upgrade/scripts/upgrade-step.sh` (hard-coded `k8s-master` in phase_master) | stack | 7.1 |
+| `stacks/infra-maintenance/.../main.tf` lines 98 + 218 (`node_name = "k8s-master"` on etcd-backup + defrag-etcd CronJobs) | stack | 4.5 |
+| `kured-sentinel-gate` bash loop | `stacks/kured/main.tf` | 5.1 |
+| `docs/architecture/compute.md`, `.claude/skills/uptime-kuma/SKILL.md`, runbooks | docs | 6.3 |
+| **No-op surfaces** (confirmed clean): Vault (uses `kubernetes.default.svc`), Cloudflared (no apiserver tunnel), in-cluster `kubernetes.default.svc` / `10.96.0.1`, etcd-backup CORRECTNESS (snapshot is cluster-wide), kubeadm-managed etcd peer certs (auto-generated on join) | | — |
+
+## Edge cases
+
+- **Phase 0 apiserver restart (~30s)** = same blast radius as today's k8s upgrade (tigera/cnpg/gpu-operator briefly crash). The LB doesn't help here because the new cert isn't yet trusted by clients. Accept the brief outage. Schedule during a low-activity window.
+- **`kubeadm-certs` secret TTL = 2h** (NOT 24h as initially stated). Phase 2 + 3 must complete within the window, or re-upload between them.
+- **pfSense haproxy bootstrap = reset-to-declared-state** on each run (lines 155-158 of the script). Adding master-2 means the apiserver pool is briefly torn down + rebuilt. TCP frontends bounce. Long-poll connections from kubelets break + reconnect. Expect ~2-5s of "kubectl: unable to connect" during pool rewrites.
+- **TCP health check is too lax** for apiserver (listener up ≠ ready). Phase 1 uses HTTPS `GET /readyz` with `verify none` — catches NotReady (etcd unreachable, controller-manager flapping).
+- **Worker kubelet.conf flip**: kubelet TLS bootstrap re-auths against new endpoint on restart. Expect 5-10s NotReady per node during the Phase 4.2 loop.
+- **VIP cannot be the existing master IP**: confirmed `.99` is free (no grep matches, no MetalLB pool conflict — pool is .200-.220).
+- **pfSense reboot windows**: pre-Phase-4 OK (clients still on direct IP), post-Phase-4 breaks everything. Don't migrate near a pfSense maintenance window.
+
+## Phased plan
+
+Reversible up to Phase 4. Phase 4+ reverse via the rollback section.
+
+### Phase 0 — Retrofit existing cluster (~30 min, ~30s of apiserver outage)
+
+- [ ] **0.1 Pre-flight**
+  - [ ] Cluster healthy: `kubectl get nodes` (all Ready), `kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded` empty
+  - [ ] Recent etcd backup valid: `ls -lh /srv/nfs/etcd-backup/ | tail -5`
+  - [ ] Proxmox VM snapshot of `k8s-master`: `ssh root@192.168.1.127 qm snapshot 200 pre-ha-retrofit`
+  - [ ] IPs free: `for ip in 99 110 111; do ping -c1 -W1 10.0.20.$ip && echo "BUSY $ip" || echo "free $ip"; done`
+- [ ] **0.2 Patch `kubeadm-config` ConfigMap via kubeadm (NOT kubectl apply)**
+  - [ ] On master: `sudo kubeadm config print init-defaults --component-configs=KubeletConfiguration > /tmp/kubeadm-new.yaml`
+  - [ ] Hand-edit /tmp/kubeadm-new.yaml: take the existing CM as base, add `controlPlaneEndpoint: 10.0.20.99:6443` under ClusterConfiguration, add `apiServer.certSANs: [10.0.20.99, k8s-apiserver.viktorbarzin.lan]`
+  - [ ] Apply via kubeadm (kubeadm-owned, future `kubeadm upgrade apply` won't overwrite): `sudo kubeadm init phase upload-config kubeadm --config /tmp/kubeadm-new.yaml`
+  - [ ] Verify: `kubectl -n kube-system get cm kubeadm-config -o yaml | grep -E 'controlPlaneEndpoint|certSANs'`
+- [ ] **0.3 Regen apiserver cert**
+  - [ ] On master: `sudo mkdir -p /tmp/apiserver-backup && sudo mv /etc/kubernetes/pki/apiserver.{crt,key} /tmp/apiserver-backup/`
+  - [ ] `sudo kubeadm init phase certs apiserver` (reads patched kubeadm-config)
+  - [ ] Verify: `sudo openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -text | grep -A2 'Subject Alternative'` — expect `IP Address:10.0.20.99` PLUS existing SANs (kubeadm adds, doesn't replace)
+- [ ] **0.4 Restart kube-apiserver static pod**
+  - [ ] On master: `sudo kubectl -n kube-system delete pod kube-apiserver-k8s-master --force --grace-period=0`
+  - [ ] Wait: `kubectl wait --for=condition=Ready pod/kube-apiserver-k8s-master -n kube-system --timeout=180s`
+  - [ ] Verify: `kubectl get nodes` works (apiserver alive on direct IP)
+- [ ] **0.5 Panic-mode rollback procedure (DOCUMENTED ONLY — only run if 0.4 fails)**
+  - [ ] `sudo cp /tmp/apiserver-backup/apiserver.{crt,key} /etc/kubernetes/pki/`
+  - [ ] `sudo systemctl restart kubelet` (forces static pod re-read)
+  - [ ] Wait for apiserver Ready; revert kubeadm-config edits via the file backup
+- [ ] **0.6 Verify operators recovered from brief outage**
+  - [ ] `kubectl get pods -n calico-system -l app=tigera-operator -o wide` — Running, restart count incremented by 1 max
+  - [ ] `kubectl get pods -n gpu-operator -o wide` — same
+  - [ ] `kubectl get pods -n cnpg-system -o wide` — same
+
+### Phase 1 — pfSense HAProxy + DNS (~30 min)
+
+- [ ] **1.1 Reserve VIP `10.0.20.99` + DNS**
+  - [ ] Add Virtual IP on pfSense (Firewall → Virtual IPs → IP Alias on VLAN20, `10.0.20.99/24`)
+  - [ ] Add `k8s-apiserver-vip → 10.0.20.99` host alias (Firewall → Aliases → Hosts)
+  - [ ] phpIPAM: register `10.0.20.99` under section "K8s cluster"
+  - [ ] Add DNS A record `k8s-apiserver IN A 10.0.20.99` to `config.tfvars` (and **delete** stale `kubernetes IN A 10.0.20.100` on line 115)
+  - [ ] `scripts/tg apply -target=module.technitium` — confirm zone reload
+- [ ] **1.2 Extend `infra/scripts/pfsense-haproxy-bootstrap.php` for apiserver pool with HTTPS health check**
+  - [ ] Add `build_pool_https()` helper variant (or add `$use_https_readyz` param to existing `build_pool()`) that emits `check_type='HTTP'`, `monitor_uri='/readyz'`, `httpchk_method='GET'`, `ssl='yes'`, `sslverify='no'`
+  - [ ] Add `'apiserver_nodes'` to `$POOL_NAMES`; `'apiserver_proxy_6443'` to `$FRONTEND_NAMES`
+  - [ ] `build_pool_https('apiserver_nodes', '6443', [['k8s-master', '10.0.20.100']])`
+  - [ ] `build_frontend('apiserver_proxy_6443', 'K8s apiserver VIP', '10.0.20.99', '6443', 'apiserver_nodes')`
+- [ ] **1.3 Deploy + validate**
+  - [ ] `scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/ && ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'`
+  - [ ] `ssh admin@10.0.20.1 'sockstat -l | grep 10.0.20.99:6443'` — expect haproxy listening
+  - [ ] `ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio" | grep apiserver` — backend UP (op_state=2)
+- [ ] **1.4 Smoke via VIP**
+  - [ ] From devvm: `curl --cacert /etc/kubernetes/pki/ca.crt https://10.0.20.99:6443/readyz` — expect `ok`
+  - [ ] Build a transient kubeconfig pointing at VIP, run `kubectl get nodes` — succeeds
+  - [ ] **If TLS validation fails: STOP — Phase 0 cert regen didn't include VIP**, rollback Phase 1 and retry Phase 0
+
+### Phase 1.5 — Refactor rbac stack for multi-master (~45 min)
+
+- [ ] **1.5.1 Refactor `stacks/rbac/modules/rbac/{apiserver-oidc,audit-policy,etcd-tuning}.tf`**
+  - [ ] Replace `var.k8s_master_host = "10.0.20.100"` with `var.k8s_master_hosts = list(string)` (default `["10.0.20.100"]`)
+  - [ ] Wrap each `null_resource` / `provisioner "remote-exec"` block in `for_each = toset(var.k8s_master_hosts)` so the same sed runs on every master
+  - [ ] In `stacks/rbac/main.tf` set `k8s_master_hosts = ["10.0.20.100"]` (still single-master in this phase — variable is forward-looking, no behaviour change yet)
+- [ ] **1.5.2 `scripts/tg apply` rbac stack** — confirm zero diff against today (no-op refactor)
+- [ ] **1.5.3 Verify** — sanity: `ssh wizard@k8s-master 'sudo grep oidc-issuer-url /etc/kubernetes/manifests/kube-apiserver.yaml | wc -l'` — expect `1`. Cluster healthy.
+
+### Phase 2 — Cloud-init template bump + master-2 (~75 min)
+
+- [ ] **2.0 Bump cloud-init template (one-time)**
+  - [ ] Edit `infra/modules/create-template-vm/cloud_init.yaml`:
+    - line 49: apt source `pkgs.k8s.io/core:/stable:/v1.32/deb/` → `pkgs.k8s.io/core:/stable:/v1.34/deb/`
+    - line 135: wrap `${k8s_join_command}` in a conditional via cloud-init `if:` template logic, or simpler: add `${k8s_join_command_or_noop}` and let the module pass `""` for masters and the real worker join command for workers (default)
+  - [ ] Update `infra/modules/create-template-vm/main.tf` to add `variable "k8s_join_command" { default = "" }` and a conditional in the templatefile to skip the runcmd line when empty
+  - [ ] Rebuild the template: `scripts/tg apply -target=module.k8s_template` (or whatever the existing template-build target name is in `stacks/infra/main.tf`)
+  - [ ] Verify new template registered in Proxmox at the same template_id
+- [ ] **2.1 Add master-2 VM to Terraform**
+  - [ ] In `stacks/infra/main.tf`: add `module "k8s-master-2"` using `create-vm` from the (now-v1.34) k8s template, with master sizing (8 vCPU / 32GB / 64GB), VMID 205, IP `10.0.20.110`, unique MAC, `vmbr1/vlan 20`, `use_cloud_init = true`, and explicitly pass `k8s_join_command = ""` (so first-boot does NOT auto-join as worker)
+  - [ ] `scripts/tg apply -target=module.k8s-master-2`
+  - [ ] Verify VM booted: `ssh wizard@k8s-master-2.viktorbarzin.lan uname -a` (expect Ubuntu 26.04 LTS, kernel 7.0.x)
+- [ ] **2.2 Prep master-2 for kubeadm join**
+  - [ ] Confirm versions: `ssh wizard@k8s-master-2.viktorbarzin.lan 'kubeadm version; containerd --version'` — expect kubeadm v1.34.x, containerd 2.2.2+
+  - [ ] DNS resolves: `getent hosts k8s-master-2.viktorbarzin.lan`
+- [ ] **2.3 Upload certs on existing master**
+  - [ ] `sudo kubeadm init phase upload-certs --upload-certs` → records `--certificate-key <KEY>`
+  - [ ] **2h TTL** — Phase 2 + 3 must complete within window or re-upload
+- [ ] **2.4 Generate join command**
+  - [ ] `sudo kubeadm token create --print-join-command` → `kubeadm join 10.0.20.99:6443 --token <T> --discovery-token-ca-cert-hash sha256:<H>`
+  - [ ] Append `--control-plane --certificate-key <KEY>`
+- [ ] **2.5 Run join on master-2**
+  - [ ] `ssh wizard@k8s-master-2.viktorbarzin.lan` → run sudo join command from 2.4
+  - [ ] Wait for "This node has joined the cluster"
+- [ ] **2.6 Update rbac stack to include master-2 (propagates OIDC/audit/etcd tuning to it)**
+  - [ ] Edit `stacks/rbac/main.tf`: `k8s_master_hosts = ["10.0.20.100", "10.0.20.110"]`
+  - [ ] `scripts/tg apply` rbac stack
+  - [ ] Verify: `ssh wizard@k8s-master-2 'sudo grep -c oidc-issuer-url /etc/kubernetes/manifests/kube-apiserver.yaml'` — expect `1`
+- [ ] **2.7 Smoke**
+  - [ ] `kubectl get nodes` — 6 nodes, master-2 Ready control-plane
+  - [ ] `kubectl -n kube-system get pods -o wide | grep k8s-master-2` — 4 static pods Running
+  - [ ] etcd member list shows 2 members
+  - [ ] `kubectl --server=https://10.0.20.110:6443 get nodes` — direct probe works
+- [ ] **2.8 Add master-2 to LB pool**
+  - [ ] Edit `pfsense-haproxy-bootstrap.php`: pool now `[['k8s-master', '10.0.20.100'], ['k8s-master-2', '10.0.20.110']]`
+  - [ ] Deploy + verify both backends UP
+
+### Phase 3 — master-3 (~45 min) — same pattern as Phase 2
+
+- [ ] **3.1 Add `module.k8s-master-3` to Terraform** (VMID 206, IP `10.0.20.111`, same template, `k8s_join_command = ""`)
+- [ ] **3.2 Prep verify**
+- [ ] **3.3 Re-upload certs if >2h since Phase 2.3, refresh `--certificate-key`**
+- [ ] **3.4 Generate fresh join command**
+- [ ] **3.5 Run join on master-3**
+- [ ] **3.6 Update rbac stack: `k8s_master_hosts = [".100", ".110", ".111"]`, apply, verify master-3 has OIDC flag**
+- [ ] **3.7 Smoke (7 nodes, 3 control-plane, etcd quorum 3/3)**
+- [ ] **3.8 Add master-3 to LB pool — all three backends UP**
+
+### Phase 4 — Cut over clients and workers to VIP (~45 min)
+
+- [ ] **4.1 Update in-repo kubeconfig consumers (single commit)**
+  - [ ] `/home/wizard/code/infra/config` — flip `server:` to `https://10.0.20.99:6443`
+  - [ ] `config.tfvars:231` — `k8s_join_command` to `kubeadm join 10.0.20.99:6443 ...`
+  - [ ] `stacks/rbac/modules/rbac/apiserver-oidc.tf` — variable `default = "10.0.20.99"` (or whatever the multi-master refactor needs)
+  - [ ] `.woodpecker/default.yml` — flip server: AND curl URL
+  - [ ] `.woodpecker/drift-detection.yml` — flip server: AND curl URL
+  - [ ] `.woodpecker/renew-tls.yml` — flip curl URL (line 18)
+  - [ ] `.woodpecker/provision-user.yml` — flip curl URL (line 41)
+  - [ ] `stacks/k8s-portal/modules/k8s-portal/files/src/routes/download/+server.ts` — `CLUSTER_SERVER` const
+  - [ ] `stacks/k8s-portal/modules/k8s-portal/files/src/routes/setup/script/+server.ts` — same
+  - [ ] Final sweep: `cd /home/wizard/code/infra && grep -rn '10.0.20.100:6443' --include='*.tf' --include='*.yml' --include='*.yaml' --include='*.ts' --include='*.php' --include='*.sh'` — handle anything remaining
+  - [ ] `scripts/tg apply` for rbac + k8s-portal (and any other stacks touched)
+  - [ ] Commit + push (single conventional commit referencing `code-n0ow`)
+- [ ] **4.2 Worker `kubelet.conf` flip (one at a time, with 5-10s expected NotReady)**
+  ```bash
+  for n in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
+    echo "=== $n ==="
+    ssh wizard@$n.viktorbarzin.lan "sudo sed -i.bak 's|server: https://10.0.20.100:6443|server: https://10.0.20.99:6443|' /etc/kubernetes/kubelet.conf"
+    ssh wizard@$n.viktorbarzin.lan "sudo systemctl restart kubelet"
+    kubectl wait --for=condition=Ready node/$n --timeout=180s
+    echo "$n Ready"
+    sleep 15
+  done
+  ```
+- [ ] **4.3 Existing master's `kubelet.conf`** — same sed + restart on `k8s-master`
+- [ ] **4.4 Verify master-2 + master-3 kubelet.conf already at VIP** (cloud-init join used VIP via controlPlaneEndpoint)
+- [ ] **4.5 Verify everything**
+  - [ ] `kubectl get nodes` — all 7 Ready
+  - [ ] `kubectl --kubeconfig ~/.kube/config config view --minify -o jsonpath='{.clusters[0].cluster.server}'` → `https://10.0.20.99:6443`
+  - [ ] Worker loop: `for n in k8s-{master,node1,node2,node3,node4,master-2,master-3}; do ssh wizard@$n.viktorbarzin.lan "sudo grep server: /etc/kubernetes/kubelet.conf"; done` — all show VIP
+  - [ ] Trigger a no-op Woodpecker pipeline (commit a typo fix in a runbook) — verify the kubeconfig path through the new VIP
+
+### Phase 4.5 — Fix etcd-backup CronJob node pinning (~15 min)
+
+- [ ] **4.5.1 Edit `stacks/infra-maintenance/modules/infra-maintenance/main.tf`**
+  - [ ] backup-etcd (line 98): replace `node_name = "k8s-master"` with `nodeSelector { "node-role.kubernetes.io/control-plane" = "" }` + the corresponding toleration block
+  - [ ] defrag-etcd (line 218): same change
+- [ ] **4.5.2 `scripts/tg apply` infra-maintenance**
+- [ ] **4.5.3 Verify backup runs** — trigger a manual job-from-cronjob, confirm it lands on one of the 3 masters and produces a valid snapshot
+
+### Phase 5 — kured-sentinel-gate quorum check (~15 min)
+
+- [ ] **5.1 Edit `infra/stacks/kured/main.tf`** (insert into the bash heredoc in the sentinel-gate ConfigMap, between all-nodes-Ready and calico-Ready checks)
+  ```bash
+  # Check 3b: control-plane quorum safety (HA invariant)
+  CP_READY=$(kubectl get nodes -l node-role.kubernetes.io/control-plane= --no-headers | grep ' Ready ' | wc -l | tr -d ' ')
+  if [ "$CP_READY" -lt 2 ]; then
+    echo "  BLOCKED: Only $CP_READY control-plane node(s) Ready (need ≥2 for HA)"
+    rm -f /host/var-run/gated-reboot-required
+    sleep 300
+    continue
+  fi
+  echo "  Control-plane quorum safe ($CP_READY Ready)"
+  ```
+- [ ] **5.2 `scripts/tg apply` kured**
+- [ ] **5.3 Verify**
+  - [ ] `kubectl -n kured logs ds/kured-sentinel-gate | tail -50` — expect "Control-plane quorum safe (3 Ready)" line
+  - [ ] Negative test: cordon `k8s-master-2`, wait for the gate to re-evaluate, confirm block message. Restore.
+
+### Phase 6 — E2E validation (~30 min)
+
+- [ ] **6.1 Failover test**
+  - [ ] `kubectl drain k8s-master --delete-emptydir-data --ignore-daemonsets`
+  - [ ] `ssh wizard@k8s-master.viktorbarzin.lan sudo reboot`
+  - [ ] During the 50-90s reboot: tight loop `while true; do kubectl get nodes -o name | wc -l; sleep 2; done` from devvm — line count never drops to 0 (LB transparent)
+  - [ ] After boot: `kubectl uncordon k8s-master`, verify apiserver static pod re-registers in LB pool (op_state=2)
+- [ ] **6.2 All-masters apiserver flag parity**
+  - [ ] `for h in k8s-master k8s-master-2 k8s-master-3; do echo "=== $h ==="; ssh wizard@$h.viktorbarzin.lan 'sudo grep -E "oidc-issuer-url|audit-policy|auto-compaction-retention|snapshot-count" /etc/kubernetes/manifests/{kube-apiserver,etcd}.yaml | sort'; done`
+  - [ ] Expect identical flag set across all 3 masters
+- [ ] **6.3 Update documentation**
+  - [ ] Add `docs/architecture/control-plane.md` — HA topology, etcd member list, LB config location
+  - [ ] Update `.claude/reference/proxmox-inventory.md` — add VMIDs 205, 206
+  - [ ] Add `docs/runbooks/control-plane-add-remove-master.md`
+  - [ ] Update `docs/runbooks/restore-etcd.md` to cover 3-member quorum restore (was single-master only)
+  - [ ] Cross-link `docs/runbooks/mailserver-pfsense-haproxy.md` with the new apiserver_proxy_6443 pool
+
+### Phase 7 — Extend k8s-version-upgrade chain to multi-master (~60 min)
+
+- [ ] **7.1 Edit `stacks/k8s-version-upgrade/scripts/upgrade-step.sh`**
+  - [ ] phase_master: discover masters dynamically — `MASTERS=$($KUBECTL get nodes -l node-role.kubernetes.io/control-plane= -o name | sed 's|node/||')`
+  - [ ] Wrap drain → `update_k8s.sh` → uncordon → wait-ready in a `for m in $MASTERS; do ... done` loop
+  - [ ] Between masters: quorum check — `READY=$($KUBECTL get nodes -l node-role.kubernetes.io/control-plane= --no-headers | grep ' Ready ' | wc -l); [ $READY -ge 2 ] || { slack "ABORT quorum lost"; exit 1; }`
+  - [ ] Update line 9 + 17 comment block to reflect multi-master phase
+  - [ ] Update line 326-340 containerd-bump section to loop over masters
+- [ ] **7.2 Edit `phase_preflight` and the master phase pin**
+  - [ ] Line 209-210 (scheduling_block): allow any control-plane node to be the target
+  - [ ] Line 285 (`kubeadm upgrade plan` check): run against the first master in the list, not specifically `k8s-master`
+- [ ] **7.3 `scripts/tg apply` k8s-version-upgrade**
+- [ ] **7.4 Dry-run test**
+  - [ ] `kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check ha-validation-$(date +%s)` (no actual upgrade pending — chain should noop the upgrade phase but exercise the discovery loop)
+  - [ ] Verify logs show 3 masters discovered in correct order
+- [ ] **7.5 (Real test on next patch release)** — when 1.34.8 ships:
+  - [ ] Watch the chain execute drain → upgrade → uncordon across all 3 masters in turn
+  - [ ] Confirm no manual intervention needed
+
+### Phase 8 — Close out
+
+- [ ] **8.1 Update beads** — `bd close code-n0ow` once all 6 acceptance criteria met (see below)
+
+## Rollback plan
+
+### Before Phase 4 (no clients flipped)
+
+- **Phase 0**: restore apiserver cert/key from `/tmp/apiserver-backup/`, edit kubeadm-config back, restart kubelet on master.
+- **Phase 1**: remove `apiserver_proxy_6443` + `apiserver_nodes` from `pfsense-haproxy-bootstrap.php`, re-run; revert DNS A record in config.tfvars.
+- **Phase 1.5**: revert rbac stack to single `k8s_master_host` var; apply.
+- **Phase 2/3**: on failed master `sudo kubeadm reset --force`; from a surviving master `etcdctl member remove <id>`; `tg destroy -target=module.k8s-master-N`.
+
+### After Phase 4 (clients flipped)
+
+- Revert all the Phase 4.1 file changes (single revert commit).
+- Reverse the kubelet.conf sed loop (VIP → direct IP) on all 7 nodes.
+- Phase 0 controlPlaneEndpoint can stay — harmless even on full rollback.
+
+### Worst case (etcd corruption / multi-master split-brain)
+
+- Restore from latest etcd snapshot via `etcdctl snapshot restore` to a single master.
+- Rebuild master VM from the Proxmox snapshot taken in Phase 0.1.
+- Cluster back to single-master.
+
+## Acceptance criteria (beads `code-n0ow`)
+
+- [ ] 1. Design doc + plan doc written ✓ (this commit)
+- [ ] 2. Plan approved by user
+- [ ] 3. 3 masters online, etcd quorum healthy, apiserver LB working
+- [ ] 4. k8s upgrade chain runs end-to-end across **all 3 masters** without manual intervention (Phase 7)
+- [ ] 5. kured-sentinel-gate respects quorum (Phase 5)
+- [ ] 6. etcd backup runs from any control-plane node (Phase 4.5)
+
+## Open questions
+
+None — all locked via 2026-05-22 decision pass + challenger amendment pass.