|
All checks were successful
ci/woodpecker/push/default Pipeline was successful
phase_master quiesces tigera-operator (Calico's config reconciler) to 0 around
the master upgrade so it can't crashloop during the apiserver blip + I/O-storm
kubeadm's static-pod-hash watch (which would roll the upgrade back). The restore
was a plain line at the end of the phase, so any abort AFTER quiescing left the
operator at 0 — and the idempotent retry then skipped the already-on-target
master phase and never restored it. Observed 2026-06-17: a post-upgrade gate
aborted the master attempt; the operator sat scaled to 0 for ~1.5h (data plane
fine — calico-node keeps running — but no Calico reconciliation).
Fix:
- Drain first (drain doesn't blip the apiserver), THEN quiesce right before
`kubeadm upgrade apply`, and install an EXIT trap that restores the operator
no matter how the phase exits (gate abort, set -e on ssh/kubeadm, success).
Trap is set AFTER drain_node so its own EXIT trap can't clobber it; cleared
after the explicit happy-path restore.
- postflight also force-restores replicas=1 as a final guarantee (covers the
skip-on-retry path that never quiesces or restores).
Long-term fix remains HA control plane (apiserver never goes down) — bead code-n0ow.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| scripts | ||
| job-template.yaml | ||
| main.tf | ||
| terragrunt.hcl | ||