k8s-version-upgrade: compat gate — auto-upgrade when safe, halt + alert when not

Make k8s upgrades (patch AND minor) autonomous without being reckless: the chain
attempts every upgrade but refuses unless it can prove the target is safe. A
refusal is a BLOCK (not a crash) — it halts the chain and signals for attention.

- compat-gate.py: read-only preflight check. Blocks if (a) a critical addon's
  running version doesn't support the target k8s minor, (b) an in-use deprecated
  API (apiserver_requested_deprecated_apis) is removed at/before the target, or
  (c) a node's containerd is below the target's floor. Validated against the live
  cluster: correctly blocks 1.35/1.36 today on Calico 3.26 / ESO 0.12 / kyverno
  1.16 (all behind), which is exactly the auto-halt we want until they're bumped.
- addon-compat.json: curated addon -> max-supported-k8s matrix (Calico, ESO,
  kyverno, gpu-operator + containerd floor), sourced from each project's compat
  docs (2026-06-19). The keystone data the gate reads; keep current.
- upgrade-step.sh: phase_preflight runs the gate FIRST (before any mutation);
  block() pushes k8s_upgrade_blocked=1 + Slacks the reasons + halts.
- main.tf: detector minor-probe fix (curl -sILo so the 302 from pkgs.k8s.io
  resolves to 200 — minors were never being detected). Gated behind the compat
  gate above, so enabling minor detection can't roll an unsafe minor.

Not pushed yet: deploys with the K8sUpgradeBlocked alert + deeper postflight +
runbook (next commit) so the detector fix only goes live with the full net.
This commit is contained in:
Viktor Barzin 2026-06-19 11:23:30 +00:00
parent 9189560ac3
commit cecd9fe247
4 changed files with 230 additions and 3 deletions

View file

@ -297,8 +297,10 @@ resource "kubernetes_config_map" "k8s_upgrade_scripts" {
labels = local.labels
}
data = {
"upgrade-step.sh" = file("${path.module}/scripts/upgrade-step.sh")
"update_k8s.sh" = file("${path.module}/../../scripts/update_k8s.sh")
"upgrade-step.sh" = file("${path.module}/scripts/upgrade-step.sh")
"update_k8s.sh" = file("${path.module}/../../scripts/update_k8s.sh")
"compat-gate.py" = file("${path.module}/scripts/compat-gate.py")
"addon-compat.json" = file("${path.module}/scripts/addon-compat.json")
}
}
@ -418,7 +420,7 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
NEXT_MINOR_NUM=$(( $(echo "$RUNNING_MINOR" | cut -d. -f2) + 1 ))
NEXT_MINOR="1.$NEXT_MINOR_NUM"
NEXT_MINOR_AVAILABLE="no"
if curl -sIo /dev/null -w '%%{http_code}' \
if curl -sILo /dev/null -w '%%{http_code}' \
"https://pkgs.k8s.io/core:/stable:/v$NEXT_MINOR/deb/Release" \
| grep -q '^200$'; then
NEXT_MINOR_AVAILABLE="yes"