infra/stacks/k8s-version-upgrade
Viktor Barzin b931d9fb20
All checks were successful
ci/woodpecker/push/default Pipeline was successful
k8s-version-upgrade: make tigera-operator restore crash-safe (EXIT trap)
phase_master quiesces tigera-operator (Calico's config reconciler) to 0 around
the master upgrade so it can't crashloop during the apiserver blip + I/O-storm
kubeadm's static-pod-hash watch (which would roll the upgrade back). The restore
was a plain line at the end of the phase, so any abort AFTER quiescing left the
operator at 0 — and the idempotent retry then skipped the already-on-target
master phase and never restored it. Observed 2026-06-17: a post-upgrade gate
aborted the master attempt; the operator sat scaled to 0 for ~1.5h (data plane
fine — calico-node keeps running — but no Calico reconciliation).

Fix:
  - Drain first (drain doesn't blip the apiserver), THEN quiesce right before
    `kubeadm upgrade apply`, and install an EXIT trap that restores the operator
    no matter how the phase exits (gate abort, set -e on ssh/kubeadm, success).
    Trap is set AFTER drain_node so its own EXIT trap can't clobber it; cleared
    after the explicit happy-path restore.
  - postflight also force-restores replicas=1 as a final guarantee (covers the
    skip-on-retry path that never quiesces or restores).

Long-term fix remains HA control plane (apiserver never goes down) — bead code-n0ow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 18:25:54 +00:00
..
scripts k8s-version-upgrade: make tigera-operator restore crash-safe (EXIT trap) 2026-06-17 18:25:54 +00:00
job-template.yaml fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
main.tf k8s-version-upgrade: move detection to nightly 23:00 UTC (overnight upgrades) 2026-06-17 18:16:32 +00:00
terragrunt.hcl fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00