infra

Viktor Barzin fc0510aa67 k8s-version-upgrade: kill-switch + ignore RecentNodeReboot + shorter quiet window Three changes from today's autonomous-pipeline validation session: 1. Kill-switch ConfigMap — chain checks for `k8s-upgrade-killswitch` ConfigMap in `k8s-upgrade` namespace at the top of every phase + at the start of version-check. Existence halts the chain (exit 0) with a Slack message. Single-command emergency stop: kubectl -n k8s-upgrade create configmap k8s-upgrade-killswitch \ --from-literal=reason="storm response" Resume: kubectl -n k8s-upgrade delete cm k8s-upgrade-killswitch Role rule for `configmaps` get/list/watch added (resourceName-scoped). 2. Ignore RecentNodeReboot in halt_on_alert_query everywhere — the chain itself causes reboots. The pre-drain master check, post-upgrade worker check, postflight check, and preflight halt-on-alert all now pass `RecentNodeReboot` as the extra-ignore. Previously only worker phase's post-upgrade gate did this. Master Failed silently this morning on the pre-drain check after my own master reboot. 3. Preflight quiet-baseline 3600s → 600s — the 1h cooldown after any Ready transition meant the chain refused to run for an hour after every kured reboot. 10 min is enough for kubelet/control-plane to settle; the 24h-between-cluster-reboots invariant lives in kured-sentinel-gate, not here. Validated by running the chain end-to-end: preflight passed in 5s, master phase now in drain. Today's storm post-mortem (snapshot CoW amplification + tigera-operator crashloop feedback loop) drove the kill-switch design. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>		2026-05-21 09:23:41 +00:00
..
scripts	k8s-version-upgrade: kill-switch + ignore RecentNodeReboot + shorter quiet window	2026-05-21 09:23:41 +00:00
job-template.yaml	k8s-version-upgrade: decompose into Job chain to fix self-preemption	2026-05-11 23:54:22 +00:00
main.tf	k8s-version-upgrade: kill-switch + ignore RecentNodeReboot + shorter quiet window	2026-05-21 09:23:41 +00:00
terragrunt.hcl	k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline	2026-05-10 19:07:42 +00:00