infra

Viktor Barzin b931d9fb20 All checks were successful ci/woodpecker/push/default Pipeline was successful Details k8s-version-upgrade: make tigera-operator restore crash-safe (EXIT trap) phase_master quiesces tigera-operator (Calico's config reconciler) to 0 around the master upgrade so it can't crashloop during the apiserver blip + I/O-storm kubeadm's static-pod-hash watch (which would roll the upgrade back). The restore was a plain line at the end of the phase, so any abort AFTER quiescing left the operator at 0 — and the idempotent retry then skipped the already-on-target master phase and never restored it. Observed 2026-06-17: a post-upgrade gate aborted the master attempt; the operator sat scaled to 0 for ~1.5h (data plane fine — calico-node keeps running — but no Calico reconciliation). Fix: - Drain first (drain doesn't blip the apiserver), THEN quiesce right before `kubeadm upgrade apply`, and install an EXIT trap that restores the operator no matter how the phase exits (gate abort, set -e on ssh/kubeadm, success). Trap is set AFTER drain_node so its own EXIT trap can't clobber it; cleared after the explicit happy-path restore. - postflight also force-restores replicas=1 as a final guarantee (covers the skip-on-retry path that never quiesces or restores). Long-term fix remains HA control plane (apiserver never goes down) — bead code-n0ow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>		2026-06-17 18:25:54 +00:00
..
scripts	k8s-version-upgrade: make tigera-operator restore crash-safe (EXIT trap)	2026-06-17 18:25:54 +00:00
job-template.yaml	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
main.tf	k8s-version-upgrade: move detection to nightly 23:00 UTC (overnight upgrades)	2026-06-17 18:16:32 +00:00
terragrunt.hcl	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00