k8s-version-upgrade: compat gate — auto-upgrade when safe, halt + alert when not
Make k8s upgrades (patch AND minor) autonomous without being reckless: the chain attempts every upgrade but refuses unless it can prove the target is safe. A refusal is a BLOCK (not a crash) — it halts the chain and signals for attention. - compat-gate.py: read-only preflight check. Blocks if (a) a critical addon's running version doesn't support the target k8s minor, (b) an in-use deprecated API (apiserver_requested_deprecated_apis) is removed at/before the target, or (c) a node's containerd is below the target's floor. Validated against the live cluster: correctly blocks 1.35/1.36 today on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), which is exactly the auto-halt we want until they're bumped. - addon-compat.json: curated addon -> max-supported-k8s matrix (Calico, ESO, kyverno, gpu-operator + containerd floor), sourced from each project's compat docs (2026-06-19). The keystone data the gate reads; keep current. - upgrade-step.sh: phase_preflight runs the gate FIRST (before any mutation); block() pushes k8s_upgrade_blocked=1 + Slacks the reasons + halts. - main.tf: detector minor-probe fix (curl -sILo so the 302 from pkgs.k8s.io resolves to 200 — minors were never being detected). Gated behind the compat gate above, so enabling minor detection can't roll an unsafe minor. Not pushed yet: deploys with the K8sUpgradeBlocked alert + deeper postflight + runbook (next commit) so the detector fix only goes live with the full net.
This commit is contained in:
parent
9189560ac3
commit
cecd9fe247
4 changed files with 230 additions and 3 deletions
57
stacks/k8s-version-upgrade/scripts/addon-compat.json
Normal file
57
stacks/k8s-version-upgrade/scripts/addon-compat.json
Normal file
|
|
@ -0,0 +1,57 @@
|
|||
{
|
||||
"_comment": "Addon -> highest k8s minor each addon version supports. The preflight compat-gate (compat-gate.py) reads the RUNNING version of each addon and blocks a k8s upgrade whose target minor exceeds what that running version supports — so the chain auto-halts + alerts instead of breaking on an unsupported addon. Keep current; sources are the addons' own k8s compat matrices (last refreshed 2026-06-19 for the 1.34->1.36 catch-up). max_k8s keys are addon-version floors (major.minor); value is the highest k8s minor that floor supports.",
|
||||
"addons": [
|
||||
{
|
||||
"name": "calico",
|
||||
"namespace": "calico-system",
|
||||
"kind": "daemonset",
|
||||
"resource": "calico-node",
|
||||
"image_re": "node:v?([0-9]+\\.[0-9]+)",
|
||||
"max_k8s": {
|
||||
"3.26": "1.28",
|
||||
"3.27": "1.29",
|
||||
"3.28": "1.30",
|
||||
"3.29": "1.32",
|
||||
"3.30": "1.35",
|
||||
"3.31": "1.35",
|
||||
"3.32": "1.36"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "external-secrets",
|
||||
"namespace": "external-secrets",
|
||||
"kind": "deployment",
|
||||
"resource": "external-secrets",
|
||||
"image_re": "external-secrets:v?([0-9]+\\.[0-9]+)",
|
||||
"max_k8s": {
|
||||
"0.12": "1.31",
|
||||
"2.0": "1.35"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "kyverno",
|
||||
"namespace": "kyverno",
|
||||
"kind": "deployment",
|
||||
"resource": "kyverno-admission-controller",
|
||||
"image_re": "kyverno:v?([0-9]+\\.[0-9]+)",
|
||||
"max_k8s": {
|
||||
"1.16": "1.34",
|
||||
"1.18": "1.35"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "gpu-operator",
|
||||
"namespace": "nvidia",
|
||||
"kind": "deployment",
|
||||
"resource": "gpu-operator",
|
||||
"image_re": "gpu-operator:v?([0-9]+\\.[0-9]+)",
|
||||
"max_k8s": {
|
||||
"25.10": "1.35",
|
||||
"26.3": "1.36"
|
||||
}
|
||||
}
|
||||
],
|
||||
"containerd_min": {
|
||||
"1.37": "2.0"
|
||||
}
|
||||
}
|
||||
142
stacks/k8s-version-upgrade/scripts/compat-gate.py
Normal file
142
stacks/k8s-version-upgrade/scripts/compat-gate.py
Normal file
|
|
@ -0,0 +1,142 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Preflight compatibility gate for the k8s version-upgrade chain.
|
||||
|
||||
Decides whether it is SAFE to auto-upgrade Kubernetes to a target version, so the
|
||||
chain can "upgrade whenever it can, and halt + alert when it can't" without a
|
||||
human in the loop. Reads the addon-compat matrix (JSON on stdin) and checks three
|
||||
classes of blocker:
|
||||
|
||||
1. addon compat — every critical addon's RUNNING version must support the
|
||||
target k8s minor (Calico is the usual blocker)
|
||||
2. removed APIs — no in-use API (Prometheus apiserver_requested_deprecated_apis)
|
||||
is removed at/before the target minor
|
||||
3. containerd — every node's containerd >= the target's floor, if the matrix
|
||||
declares one (e.g. the 1.7.x -> k8s 1.37 cliff)
|
||||
|
||||
Exit 0 = safe, proceed.
|
||||
Exit 2 = BLOCKED — prints one human reason per line (caller pushes
|
||||
k8s_upgrade_blocked=1, Slacks the reasons, and halts the chain).
|
||||
Exit 3 = the gate itself errored — caller treats as a block (fail safe).
|
||||
|
||||
Read-only: kubectl get + one Prometheus query. No mutations. PROM is overridable
|
||||
via $PROM for local testing (cluster DNS isn't resolvable off-cluster).
|
||||
"""
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
import urllib.request
|
||||
|
||||
PROM = os.environ.get("PROM", "http://prometheus-server.monitoring.svc.cluster.local:80")
|
||||
|
||||
|
||||
def minor(v):
|
||||
"""'v1.35.6' | '1.35.6' | '1.35' -> (1, 35); None if unparseable."""
|
||||
m = re.search(r"(\d+)\.(\d+)", v or "")
|
||||
return (int(m.group(1)), int(m.group(2))) if m else None
|
||||
|
||||
|
||||
def kget(args):
|
||||
try:
|
||||
r = subprocess.run(["kubectl", *args], capture_output=True, text=True, timeout=30)
|
||||
return r.stdout.strip()
|
||||
except Exception:
|
||||
return ""
|
||||
|
||||
|
||||
def check_addons(matrix, tgt):
|
||||
reasons = []
|
||||
for a in matrix.get("addons", []):
|
||||
img = kget(["-n", a["namespace"], "get", a["kind"], a["resource"],
|
||||
"-o", "jsonpath={.spec.template.spec.containers[*].image}"])
|
||||
m = re.search(a["image_re"], img or "")
|
||||
if not m:
|
||||
# Fail safe: if we can't read the running version, don't upgrade blind.
|
||||
reasons.append(f"addon {a['name']}: could not read running version "
|
||||
f"(img='{img or 'not found'}') — refusing to upgrade blind")
|
||||
continue
|
||||
running = m.group(1) # e.g. "3.26"
|
||||
# max_k8s maps an addon-version floor -> highest supported k8s minor.
|
||||
# Pick the highest floor that is <= the running version.
|
||||
max_k8s = None
|
||||
for floor, mk in sorted(a["max_k8s"].items(), key=lambda kv: minor(kv[0]), reverse=True):
|
||||
if minor(running) >= minor(floor):
|
||||
max_k8s = mk
|
||||
break
|
||||
if max_k8s is None:
|
||||
reasons.append(f"addon {a['name']} v{running}: below the lowest version "
|
||||
f"in the compat matrix — unknown k8s support")
|
||||
continue
|
||||
if tgt > minor(max_k8s):
|
||||
reasons.append(f"addon {a['name']} v{running} supports k8s <= {max_k8s}; "
|
||||
f"target {tgt[0]}.{tgt[1]} exceeds it — upgrade {a['name']} first")
|
||||
return reasons
|
||||
|
||||
|
||||
def check_removed_apis(tgt):
|
||||
reasons = []
|
||||
try:
|
||||
url = PROM + "/api/v1/query?query=apiserver_requested_deprecated_apis"
|
||||
data = json.load(urllib.request.urlopen(url, timeout=20))
|
||||
for s in data.get("data", {}).get("result", []):
|
||||
lbl = s["metric"]
|
||||
rr = lbl.get("removed_release", "")
|
||||
if rr and minor(rr) and tgt >= minor(rr):
|
||||
g = lbl.get("group") or "core"
|
||||
reasons.append(f"deprecated API {g}/{lbl.get('version')} "
|
||||
f"{lbl.get('resource')} is in use and is removed in "
|
||||
f"k8s {rr} (target {tgt[0]}.{tgt[1]}) — migrate callers first")
|
||||
except Exception as e:
|
||||
reasons.append(f"removed-API check could not query Prometheus ({e}) — "
|
||||
f"refusing to upgrade blind")
|
||||
return reasons
|
||||
|
||||
|
||||
def check_containerd(matrix, tgt):
|
||||
reasons = []
|
||||
floor = matrix.get("containerd_min", {}).get(f"{tgt[0]}.{tgt[1]}")
|
||||
if not floor:
|
||||
return reasons
|
||||
out = kget(["get", "nodes", "-o",
|
||||
"jsonpath={range .items[*]}{.metadata.name}{\" \"}"
|
||||
"{.status.nodeInfo.containerRuntimeVersion}{\"\\n\"}{end}"])
|
||||
for line in out.splitlines():
|
||||
if not line.strip():
|
||||
continue
|
||||
name, _, ver = line.partition(" ")
|
||||
cv = ver.replace("containerd://", "")
|
||||
if minor(cv) and minor(cv) < minor(floor):
|
||||
reasons.append(f"node {name} containerd {cv} < required {floor} "
|
||||
f"for k8s {tgt[0]}.{tgt[1]} — bump containerd first")
|
||||
return reasons
|
||||
|
||||
|
||||
def main():
|
||||
if len(sys.argv) < 2:
|
||||
print("usage: compat-gate.py <target-k8s-version> (matrix JSON on stdin)")
|
||||
sys.exit(3)
|
||||
tgt = minor(sys.argv[1])
|
||||
if not tgt:
|
||||
print(f"bad target version '{sys.argv[1]}'")
|
||||
sys.exit(3)
|
||||
try:
|
||||
matrix = json.load(sys.stdin)
|
||||
except Exception as e:
|
||||
print(f"could not parse compat matrix JSON: {e}")
|
||||
sys.exit(3)
|
||||
|
||||
reasons = (check_addons(matrix, tgt)
|
||||
+ check_removed_apis(tgt)
|
||||
+ check_containerd(matrix, tgt))
|
||||
if reasons:
|
||||
for r in reasons:
|
||||
print(r)
|
||||
sys.exit(2)
|
||||
print(f"compat-gate OK: cluster is safe to upgrade to {sys.argv[1]}")
|
||||
sys.exit(0)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -88,6 +88,19 @@ push() {
|
|||
| curl -sS --data-binary @- "$PG" || echo "warn: pushgateway push failed"
|
||||
}
|
||||
|
||||
# Auto-upgrade safety: a preflight compat-gate refusal is a BLOCK, not a crash —
|
||||
# the cluster simply isn't ready for this target yet (an addon / in-use API /
|
||||
# containerd is too old). Record it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
|
||||
# alert), Slack the reasons, and halt so a human clears the blocker (or a later
|
||||
# run proceeds once it's cleared). This is the "upgrade when we can, alert when
|
||||
# we can't" contract.
|
||||
block() {
|
||||
push k8s_upgrade_blocked 1
|
||||
slack "BLOCKED preflight (target v$TARGET_VERSION) — auto-upgrade halted, needs attention:\n$1"
|
||||
echo "BLOCKED: $1" >&2
|
||||
exit 1
|
||||
}
|
||||
|
||||
halt_on_alert_query() {
|
||||
local extra_ignore="${1:-}"
|
||||
# ALLOWLIST design (refactored 2026-05-23 from a denylist): halt only on
|
||||
|
|
@ -299,6 +312,19 @@ spawn_next() {
|
|||
phase_preflight() {
|
||||
slack "Starting preflight (target v$TARGET_VERSION, kind=$KIND)"
|
||||
|
||||
# 0. Auto-upgrade compat gate (compat-gate.py): refuse the upgrade if a critical
|
||||
# addon, an in-use deprecated API, or a node's containerd is too old for the
|
||||
# target. Runs FIRST — before any mutation (etcd snapshot, drains) — so a
|
||||
# block is cheap. Reset the blocked gauge for this run; block() sets it to 1
|
||||
# only on a refusal. This is what makes unattended minor upgrades safe: the
|
||||
# chain proceeds when the cluster supports the target and halts+alerts when
|
||||
# it doesn't (e.g. Calico/ESO/kyverno behind, or a removed API still in use).
|
||||
push k8s_upgrade_blocked 0
|
||||
local gate_out gate_rc=0
|
||||
gate_out=$(python3 /scripts/compat-gate.py "$TARGET_VERSION" < /scripts/addon-compat.json 2>&1) || gate_rc=$?
|
||||
if [ "$gate_rc" -ne 0 ]; then block "$gate_out"; fi
|
||||
echo "compat-gate passed for v$TARGET_VERSION"
|
||||
|
||||
# 1. All nodes Ready + no pressure
|
||||
local bad_nodes
|
||||
bad_nodes=$($KUBECTL get nodes -o json | jq -r '
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue