infra/stacks/k8s-version-upgrade
Viktor Barzin ad9f6c8f41 k8s-version-upgrade: halt_on_alert allowlist (severity=critical only)
Refactored halt_on_alert_query from denylist ("ignore these noisy alerts")
to an allowlist ("only halt on severity=critical"). Today's blocking
alerts were all warning/info-level and not actual upgrade blockers:
  - PodCrashLooping (gpu-operator on the GPU node, code-8vr0, long-standing)
  - IngressTTFBHigh (Traefik latency, transient)
  - NodeHighIOWait (chicken-and-egg with our own upgrade I/O)
  - RecentNodeReboot (chain causes this itself)

severity=critical filtering is more robust than maintaining a denylist
of every noisy alert that crops up. extra_ignore parameter kept for
backwards compatibility but is rarely needed now (critical alerts are
the only ones that should actually halt the chain).

Tested end-to-end this session — master successfully upgraded to v1.34.8
via the autonomous chain after the apiserver state-repair (apiserver
manifest had been pinned at v1.34.2 from a previous month's rollback;
required a one-time manual edit + kubelet reload to bring back to v1.34.7,
after which the chain ran cleanly).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 09:14:39 +00:00
..
scripts k8s-version-upgrade: halt_on_alert allowlist (severity=critical only) 2026-05-23 09:14:39 +00:00
job-template.yaml k8s-version-upgrade: decompose into Job chain to fix self-preemption 2026-05-11 23:54:22 +00:00
main.tf k8s-version-upgrade: tigera quiesce + etcd-skip retry + IO-wait alert ignore 2026-05-23 08:40:11 +00:00
terragrunt.hcl k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline 2026-05-10 19:07:42 +00:00