infra

Viktor Barzin fc0510aa67 k8s-version-upgrade: kill-switch + ignore RecentNodeReboot + shorter quiet window Three changes from today's autonomous-pipeline validation session: 1. Kill-switch ConfigMap — chain checks for `k8s-upgrade-killswitch` ConfigMap in `k8s-upgrade` namespace at the top of every phase + at the start of version-check. Existence halts the chain (exit 0) with a Slack message. Single-command emergency stop: kubectl -n k8s-upgrade create configmap k8s-upgrade-killswitch \ --from-literal=reason="storm response" Resume: kubectl -n k8s-upgrade delete cm k8s-upgrade-killswitch Role rule for `configmaps` get/list/watch added (resourceName-scoped). 2. Ignore RecentNodeReboot in halt_on_alert_query everywhere — the chain itself causes reboots. The pre-drain master check, post-upgrade worker check, postflight check, and preflight halt-on-alert all now pass `RecentNodeReboot` as the extra-ignore. Previously only worker phase's post-upgrade gate did this. Master Failed silently this morning on the pre-drain check after my own master reboot. 3. Preflight quiet-baseline 3600s → 600s — the 1h cooldown after any Ready transition meant the chain refused to run for an hour after every kured reboot. 10 min is enough for kubelet/control-plane to settle; the 24h-between-cluster-reboots invariant lives in kured-sentinel-gate, not here. Validated by running the chain end-to-end: preflight passed in 5s, master phase now in drain. Today's storm post-mortem (snapshot CoW amplification + tigera-operator crashloop feedback loop) drove the kill-switch design. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 09:23:41 +00:00
..
upgrade-step.sh	k8s-version-upgrade: kill-switch + ignore RecentNodeReboot + shorter quiet window	2026-05-21 09:23:41 +00:00

Viktor Barzin fc0510aa67 k8s-version-upgrade: kill-switch + ignore RecentNodeReboot + shorter quiet window

Three changes from today's autonomous-pipeline validation session:

1. **Kill-switch ConfigMap** — chain checks for `k8s-upgrade-killswitch`
   ConfigMap in `k8s-upgrade` namespace at the top of every phase + at the
   start of version-check. Existence halts the chain (exit 0) with a Slack
   message. Single-command emergency stop:
       kubectl -n k8s-upgrade create configmap k8s-upgrade-killswitch \
           --from-literal=reason="storm response"
   Resume:  kubectl -n k8s-upgrade delete cm k8s-upgrade-killswitch
   Role rule for `configmaps` get/list/watch added (resourceName-scoped).

2. **Ignore RecentNodeReboot in halt_on_alert_query everywhere** — the
   chain itself causes reboots. The pre-drain master check, post-upgrade
   worker check, postflight check, and preflight halt-on-alert all now
   pass `RecentNodeReboot` as the extra-ignore. Previously only worker
   phase's post-upgrade gate did this. Master Failed silently this morning
   on the pre-drain check after my own master reboot.

3. **Preflight quiet-baseline 3600s → 600s** — the 1h cooldown after any
   Ready transition meant the chain refused to run for an hour after
   every kured reboot. 10 min is enough for kubelet/control-plane to
   settle; the 24h-between-cluster-reboots invariant lives in
   kured-sentinel-gate, not here.

Validated by running the chain end-to-end: preflight passed in 5s,
master phase now in drain. Today's storm post-mortem (snapshot CoW
amplification + tigera-operator crashloop feedback loop) drove the
kill-switch design.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-21 09:23:41 +00:00

upgrade-step.sh

k8s-version-upgrade: kill-switch + ignore RecentNodeReboot + shorter quiet window

2026-05-21 09:23:41 +00:00