k8s-version-upgrade: compat gate — auto-upgrade when safe, halt + alert when not
Make k8s upgrades (patch AND minor) autonomous without being reckless: the chain attempts every upgrade but refuses unless it can prove the target is safe. A refusal is a BLOCK (not a crash) — it halts the chain and signals for attention. - compat-gate.py: read-only preflight check. Blocks if (a) a critical addon's running version doesn't support the target k8s minor, (b) an in-use deprecated API (apiserver_requested_deprecated_apis) is removed at/before the target, or (c) a node's containerd is below the target's floor. Validated against the live cluster: correctly blocks 1.35/1.36 today on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), which is exactly the auto-halt we want until they're bumped. - addon-compat.json: curated addon -> max-supported-k8s matrix (Calico, ESO, kyverno, gpu-operator + containerd floor), sourced from each project's compat docs (2026-06-19). The keystone data the gate reads; keep current. - upgrade-step.sh: phase_preflight runs the gate FIRST (before any mutation); block() pushes k8s_upgrade_blocked=1 + Slacks the reasons + halts. - main.tf: detector minor-probe fix (curl -sILo so the 302 from pkgs.k8s.io resolves to 200 — minors were never being detected). Gated behind the compat gate above, so enabling minor detection can't roll an unsafe minor. Not pushed yet: deploys with the K8sUpgradeBlocked alert + deeper postflight + runbook (next commit) so the detector fix only goes live with the full net.
This commit is contained in:
parent
9189560ac3
commit
cecd9fe247
4 changed files with 230 additions and 3 deletions
|
|
@ -88,6 +88,19 @@ push() {
|
|||
| curl -sS --data-binary @- "$PG" || echo "warn: pushgateway push failed"
|
||||
}
|
||||
|
||||
# Auto-upgrade safety: a preflight compat-gate refusal is a BLOCK, not a crash —
|
||||
# the cluster simply isn't ready for this target yet (an addon / in-use API /
|
||||
# containerd is too old). Record it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
|
||||
# alert), Slack the reasons, and halt so a human clears the blocker (or a later
|
||||
# run proceeds once it's cleared). This is the "upgrade when we can, alert when
|
||||
# we can't" contract.
|
||||
block() {
|
||||
push k8s_upgrade_blocked 1
|
||||
slack "BLOCKED preflight (target v$TARGET_VERSION) — auto-upgrade halted, needs attention:\n$1"
|
||||
echo "BLOCKED: $1" >&2
|
||||
exit 1
|
||||
}
|
||||
|
||||
halt_on_alert_query() {
|
||||
local extra_ignore="${1:-}"
|
||||
# ALLOWLIST design (refactored 2026-05-23 from a denylist): halt only on
|
||||
|
|
@ -299,6 +312,19 @@ spawn_next() {
|
|||
phase_preflight() {
|
||||
slack "Starting preflight (target v$TARGET_VERSION, kind=$KIND)"
|
||||
|
||||
# 0. Auto-upgrade compat gate (compat-gate.py): refuse the upgrade if a critical
|
||||
# addon, an in-use deprecated API, or a node's containerd is too old for the
|
||||
# target. Runs FIRST — before any mutation (etcd snapshot, drains) — so a
|
||||
# block is cheap. Reset the blocked gauge for this run; block() sets it to 1
|
||||
# only on a refusal. This is what makes unattended minor upgrades safe: the
|
||||
# chain proceeds when the cluster supports the target and halts+alerts when
|
||||
# it doesn't (e.g. Calico/ESO/kyverno behind, or a removed API still in use).
|
||||
push k8s_upgrade_blocked 0
|
||||
local gate_out gate_rc=0
|
||||
gate_out=$(python3 /scripts/compat-gate.py "$TARGET_VERSION" < /scripts/addon-compat.json 2>&1) || gate_rc=$?
|
||||
if [ "$gate_rc" -ne 0 ]; then block "$gate_out"; fi
|
||||
echo "compat-gate passed for v$TARGET_VERSION"
|
||||
|
||||
# 1. All nodes Ready + no pressure
|
||||
local bad_nodes
|
||||
bad_nodes=$($KUBECTL get nodes -o json | jq -r '
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue