Investigated, designed, and planned the 3-master HA control plane
migration triggered by 2026-05-21's autonomous k8s upgrade cascade.
Locked 14 design decisions across two passes:
- 10 initial decisions (LB strategy, IPs, sizing, etcd, kured gate, etc)
- 4 challenger-pass amendments (cloud-init template bump, rbac stack
multi-master refactor, HTTPS /readyz health check, expanded blast
radius to include /home/wizard/code/infra/config root kubeconfig,
config.tfvars, k8s-portal user kubeconfigs, etcd-backup nodeSelector,
k8s-version-upgrade chain extension as Phase 7)
Plan covers 11 phases end-to-end including panic-mode rollback.
DEFERRED before execution. PVE host is 98% RAM-committed
(262 GB allocated / 267 GB physical, 1.5 GB swap active); the
planned 3 x 32 GB masters would push allocation to 326 GB and OOM
the host. k8s-master currently uses only 4.6 GB of its 32 GB
allocation (5-6x oversized).
Revisit triggers documented in design doc:
1. Second PVE host added → hardware HA becomes possible.
2. Right-sizing pass OR planning masters at 16 GB each.
3. Cumulative manual upgrade nursing > ~10h.
Standalone candidate worth lifting independently: Phase 1.5's
rbac stack refactor (apiserver-oidc + audit-policy + etcd-tuning
to loop over k8s_master_hosts list) — future-proofs the cluster
without committing to the HA migration.
Refs: code-n0ow (open, deferred via bd note).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>