diff --git a/docs/runbooks/k8s-version-upgrade.md b/docs/runbooks/k8s-version-upgrade.md index 7533da44..e4df2f29 100644 --- a/docs/runbooks/k8s-version-upgrade.md +++ b/docs/runbooks/k8s-version-upgrade.md @@ -36,6 +36,7 @@ envsubst on /template/job-template.yaml | kubectl apply -f - ▼ Job 0 — preflight (pinned: k8s-node1) + ├── compat-gate: addon/API/containerd support for target (else BLOCK+alert) ├── All nodes Ready + no Mem/Disk pressure ├── halt-on-alert (kured-style ignore-list) ├── 24h-quiet baseline (no Ready transitions <24h ago) @@ -87,6 +88,46 @@ Job 6 — postflight (no pinning) **adding a node needs no change** — the chain upgrades every worker still off-target, then runs postflight. SSH uses node InternalIPs (no DNS needed). +### Auto-upgrade compat gate + +The chain now attempts **patch AND minor** upgrades autonomously — but before any +mutation, `phase_preflight` runs `compat-gate.py` **FIRST** and **REFUSES (blocks) +the upgrade** if any of these hold for the detected target: + +- a **critical addon's running version doesn't support the target k8s minor** + (running version > the addon's highest-supported minor in the compat matrix), +- an **in-use deprecated API is removed at/before the target** — measured live + from `apiserver_requested_deprecated_apis` (something is still calling a + group/version that the target k8s drops), or +- a **node's containerd is below the target's floor** (the minimum containerd the + target k8s requires). + +This is the **"auto-upgrade when we can, halt + alert when we can't"** contract. + +**On a block**, the gate: +- pushes `k8s_upgrade_blocked=1` to Pushgateway (→ the `K8sUpgradeBlocked` + Prometheus alert), +- Slacks the **specific reasons** (which addon/API/node, current vs required), and +- **halts the chain** — it exits **non-fatal** (the upgrade simply isn't safe yet, + this is not a failure). Because the block happens **before any mutation, no + rollback is involved**; nothing was changed. + +**To clear a block**: upgrade the named addon (or migrate the API caller off the +deprecated group/version, or bump containerd on the named node) so the offending +condition no longer holds. The **next nightly run then proceeds automatically** — +no manual chain restart needed. + +The **compat matrix** lives in +`stacks/k8s-version-upgrade/scripts/addon-compat.json` — a map of `addon → highest +supported k8s minor`, populated from each addon's own compatibility docs. **Keep +it current**; the gate reads it on every run. Gate logic: +`stacks/k8s-version-upgrade/scripts/compat-gate.py`. + +> The detector's minor-probe was **fixed** (the `HEAD pkgs.k8s.io/.../v` +> curl now follows the 302 from `pkgs.k8s.io` via `-L`), so **minor versions are +> finally detected** — and are gated behind the compat check above before the +> chain will act on them. + ## Components ### Shared resources (one-time, Terraform-managed) @@ -118,7 +159,8 @@ Pushed by upgrade-step.sh during phase execution; observed by the - **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently. - **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor. - **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). -- All four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade. +- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). A k8s **auto-upgrade was refused** by the compat gate because a critical addon, an in-use deprecated API, or a node's containerd is too old for the detected target. The **specific reasons are in Slack**; clear it by upgrading the named addon / migrating the API caller / bumping containerd, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert. +- The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade. ### CoreDNS is NOT upgraded by kubeadm here @@ -391,6 +433,8 @@ kill %1 |------|-------| | Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `infra/stacks/k8s-version-upgrade/main.tf` | | Universal phase body | `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh` | +| Compat gate (addon/API/containerd block logic) | `infra/stacks/k8s-version-upgrade/scripts/compat-gate.py` | +| Compat matrix (addon → highest supported k8s minor) | `infra/stacks/k8s-version-upgrade/scripts/addon-compat.json` | | Job template | `infra/stacks/k8s-version-upgrade/job-template.yaml` | | Per-node upgrade script | `infra/scripts/update_k8s.sh` | | Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") | diff --git a/stacks/k8s-version-upgrade/main.tf b/stacks/k8s-version-upgrade/main.tf index 738b5431..c2ac3b01 100644 --- a/stacks/k8s-version-upgrade/main.tf +++ b/stacks/k8s-version-upgrade/main.tf @@ -297,8 +297,10 @@ resource "kubernetes_config_map" "k8s_upgrade_scripts" { labels = local.labels } data = { - "upgrade-step.sh" = file("${path.module}/scripts/upgrade-step.sh") - "update_k8s.sh" = file("${path.module}/../../scripts/update_k8s.sh") + "upgrade-step.sh" = file("${path.module}/scripts/upgrade-step.sh") + "update_k8s.sh" = file("${path.module}/../../scripts/update_k8s.sh") + "compat-gate.py" = file("${path.module}/scripts/compat-gate.py") + "addon-compat.json" = file("${path.module}/scripts/addon-compat.json") } } @@ -418,7 +420,7 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" { NEXT_MINOR_NUM=$(( $(echo "$RUNNING_MINOR" | cut -d. -f2) + 1 )) NEXT_MINOR="1.$NEXT_MINOR_NUM" NEXT_MINOR_AVAILABLE="no" - if curl -sIo /dev/null -w '%%{http_code}' \ + if curl -sILo /dev/null -w '%%{http_code}' \ "https://pkgs.k8s.io/core:/stable:/v$NEXT_MINOR/deb/Release" \ | grep -q '^200$'; then NEXT_MINOR_AVAILABLE="yes" diff --git a/stacks/k8s-version-upgrade/scripts/addon-compat.json b/stacks/k8s-version-upgrade/scripts/addon-compat.json new file mode 100644 index 00000000..5e1afdd8 --- /dev/null +++ b/stacks/k8s-version-upgrade/scripts/addon-compat.json @@ -0,0 +1,57 @@ +{ + "_comment": "Addon -> highest k8s minor each addon version supports. The preflight compat-gate (compat-gate.py) reads the RUNNING version of each addon and blocks a k8s upgrade whose target minor exceeds what that running version supports — so the chain auto-halts + alerts instead of breaking on an unsupported addon. Keep current; sources are the addons' own k8s compat matrices (last refreshed 2026-06-19 for the 1.34->1.36 catch-up). max_k8s keys are addon-version floors (major.minor); value is the highest k8s minor that floor supports.", + "addons": [ + { + "name": "calico", + "namespace": "calico-system", + "kind": "daemonset", + "resource": "calico-node", + "image_re": "node:v?([0-9]+\\.[0-9]+)", + "max_k8s": { + "3.26": "1.28", + "3.27": "1.29", + "3.28": "1.30", + "3.29": "1.32", + "3.30": "1.35", + "3.31": "1.35", + "3.32": "1.36" + } + }, + { + "name": "external-secrets", + "namespace": "external-secrets", + "kind": "deployment", + "resource": "external-secrets", + "image_re": "external-secrets:v?([0-9]+\\.[0-9]+)", + "max_k8s": { + "0.12": "1.31", + "2.0": "1.35" + } + }, + { + "name": "kyverno", + "namespace": "kyverno", + "kind": "deployment", + "resource": "kyverno-admission-controller", + "image_re": "kyverno:v?([0-9]+\\.[0-9]+)", + "max_k8s": { + "1.16": "1.34", + "1.18": "1.35" + } + }, + { + "name": "gpu-operator", + "namespace": "nvidia", + "kind": "deployment", + "resource": "gpu-operator", + "image_re": "gpu-operator:v?([0-9]+\\.[0-9]+)", + "max_k8s": { + "25.10": "1.35", + "26.3": "1.36" + } + } + ], + "containerd_min": { + "1.37": "2.0" + } +} diff --git a/stacks/k8s-version-upgrade/scripts/compat-gate.py b/stacks/k8s-version-upgrade/scripts/compat-gate.py new file mode 100644 index 00000000..1c8895b2 --- /dev/null +++ b/stacks/k8s-version-upgrade/scripts/compat-gate.py @@ -0,0 +1,142 @@ +#!/usr/bin/env python3 +""" +Preflight compatibility gate for the k8s version-upgrade chain. + +Decides whether it is SAFE to auto-upgrade Kubernetes to a target version, so the +chain can "upgrade whenever it can, and halt + alert when it can't" without a +human in the loop. Reads the addon-compat matrix (JSON on stdin) and checks three +classes of blocker: + + 1. addon compat — every critical addon's RUNNING version must support the + target k8s minor (Calico is the usual blocker) + 2. removed APIs — no in-use API (Prometheus apiserver_requested_deprecated_apis) + is removed at/before the target minor + 3. containerd — every node's containerd >= the target's floor, if the matrix + declares one (e.g. the 1.7.x -> k8s 1.37 cliff) + +Exit 0 = safe, proceed. +Exit 2 = BLOCKED — prints one human reason per line (caller pushes + k8s_upgrade_blocked=1, Slacks the reasons, and halts the chain). +Exit 3 = the gate itself errored — caller treats as a block (fail safe). + +Read-only: kubectl get + one Prometheus query. No mutations. PROM is overridable +via $PROM for local testing (cluster DNS isn't resolvable off-cluster). +""" +import json +import os +import re +import subprocess +import sys +import urllib.request + +PROM = os.environ.get("PROM", "http://prometheus-server.monitoring.svc.cluster.local:80") + + +def minor(v): + """'v1.35.6' | '1.35.6' | '1.35' -> (1, 35); None if unparseable.""" + m = re.search(r"(\d+)\.(\d+)", v or "") + return (int(m.group(1)), int(m.group(2))) if m else None + + +def kget(args): + try: + r = subprocess.run(["kubectl", *args], capture_output=True, text=True, timeout=30) + return r.stdout.strip() + except Exception: + return "" + + +def check_addons(matrix, tgt): + reasons = [] + for a in matrix.get("addons", []): + img = kget(["-n", a["namespace"], "get", a["kind"], a["resource"], + "-o", "jsonpath={.spec.template.spec.containers[*].image}"]) + m = re.search(a["image_re"], img or "") + if not m: + # Fail safe: if we can't read the running version, don't upgrade blind. + reasons.append(f"addon {a['name']}: could not read running version " + f"(img='{img or 'not found'}') — refusing to upgrade blind") + continue + running = m.group(1) # e.g. "3.26" + # max_k8s maps an addon-version floor -> highest supported k8s minor. + # Pick the highest floor that is <= the running version. + max_k8s = None + for floor, mk in sorted(a["max_k8s"].items(), key=lambda kv: minor(kv[0]), reverse=True): + if minor(running) >= minor(floor): + max_k8s = mk + break + if max_k8s is None: + reasons.append(f"addon {a['name']} v{running}: below the lowest version " + f"in the compat matrix — unknown k8s support") + continue + if tgt > minor(max_k8s): + reasons.append(f"addon {a['name']} v{running} supports k8s <= {max_k8s}; " + f"target {tgt[0]}.{tgt[1]} exceeds it — upgrade {a['name']} first") + return reasons + + +def check_removed_apis(tgt): + reasons = [] + try: + url = PROM + "/api/v1/query?query=apiserver_requested_deprecated_apis" + data = json.load(urllib.request.urlopen(url, timeout=20)) + for s in data.get("data", {}).get("result", []): + lbl = s["metric"] + rr = lbl.get("removed_release", "") + if rr and minor(rr) and tgt >= minor(rr): + g = lbl.get("group") or "core" + reasons.append(f"deprecated API {g}/{lbl.get('version')} " + f"{lbl.get('resource')} is in use and is removed in " + f"k8s {rr} (target {tgt[0]}.{tgt[1]}) — migrate callers first") + except Exception as e: + reasons.append(f"removed-API check could not query Prometheus ({e}) — " + f"refusing to upgrade blind") + return reasons + + +def check_containerd(matrix, tgt): + reasons = [] + floor = matrix.get("containerd_min", {}).get(f"{tgt[0]}.{tgt[1]}") + if not floor: + return reasons + out = kget(["get", "nodes", "-o", + "jsonpath={range .items[*]}{.metadata.name}{\" \"}" + "{.status.nodeInfo.containerRuntimeVersion}{\"\\n\"}{end}"]) + for line in out.splitlines(): + if not line.strip(): + continue + name, _, ver = line.partition(" ") + cv = ver.replace("containerd://", "") + if minor(cv) and minor(cv) < minor(floor): + reasons.append(f"node {name} containerd {cv} < required {floor} " + f"for k8s {tgt[0]}.{tgt[1]} — bump containerd first") + return reasons + + +def main(): + if len(sys.argv) < 2: + print("usage: compat-gate.py (matrix JSON on stdin)") + sys.exit(3) + tgt = minor(sys.argv[1]) + if not tgt: + print(f"bad target version '{sys.argv[1]}'") + sys.exit(3) + try: + matrix = json.load(sys.stdin) + except Exception as e: + print(f"could not parse compat matrix JSON: {e}") + sys.exit(3) + + reasons = (check_addons(matrix, tgt) + + check_removed_apis(tgt) + + check_containerd(matrix, tgt)) + if reasons: + for r in reasons: + print(r) + sys.exit(2) + print(f"compat-gate OK: cluster is safe to upgrade to {sys.argv[1]}") + sys.exit(0) + + +if __name__ == "__main__": + main() diff --git a/stacks/k8s-version-upgrade/scripts/upgrade-step.sh b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh index 95bfb9c7..17f2d2d3 100644 --- a/stacks/k8s-version-upgrade/scripts/upgrade-step.sh +++ b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh @@ -88,6 +88,19 @@ push() { | curl -sS --data-binary @- "$PG" || echo "warn: pushgateway push failed" } +# Auto-upgrade safety: a preflight compat-gate refusal is a BLOCK, not a crash — +# the cluster simply isn't ready for this target yet (an addon / in-use API / +# containerd is too old). Record it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked +# alert), Slack the reasons, and halt so a human clears the blocker (or a later +# run proceeds once it's cleared). This is the "upgrade when we can, alert when +# we can't" contract. +block() { + push k8s_upgrade_blocked 1 + slack "BLOCKED preflight (target v$TARGET_VERSION) — auto-upgrade halted, needs attention:\n$1" + echo "BLOCKED: $1" >&2 + exit 1 +} + halt_on_alert_query() { local extra_ignore="${1:-}" # ALLOWLIST design (refactored 2026-05-23 from a denylist): halt only on @@ -299,6 +312,19 @@ spawn_next() { phase_preflight() { slack "Starting preflight (target v$TARGET_VERSION, kind=$KIND)" + # 0. Auto-upgrade compat gate (compat-gate.py): refuse the upgrade if a critical + # addon, an in-use deprecated API, or a node's containerd is too old for the + # target. Runs FIRST — before any mutation (etcd snapshot, drains) — so a + # block is cheap. Reset the blocked gauge for this run; block() sets it to 1 + # only on a refusal. This is what makes unattended minor upgrades safe: the + # chain proceeds when the cluster supports the target and halts+alerts when + # it doesn't (e.g. Calico/ESO/kyverno behind, or a removed API still in use). + push k8s_upgrade_blocked 0 + local gate_out gate_rc=0 + gate_out=$(python3 /scripts/compat-gate.py "$TARGET_VERSION" < /scripts/addon-compat.json 2>&1) || gate_rc=$? + if [ "$gate_rc" -ne 0 ]; then block "$gate_out"; fi + echo "compat-gate passed for v$TARGET_VERSION" + # 1. All nodes Ready + no pressure local bad_nodes bad_nodes=$($KUBECTL get nodes -o json | jq -r ' @@ -648,6 +674,60 @@ phase_postflight() { --data-urlencode 'query=sum(kube_pod_status_ready{condition="true"}) / sum(kube_pod_status_phase{phase="Running"})' \ | jq -r '.data.result[0].value[1] // "0"') + # --------------------------------------------------------------------------- + # Deeper smoke tests — catch a cluster that's "all pods Running" but actually + # broken after the upgrade (dead apiserver health endpoints, broken + # CoreDNS/in-cluster DNS, or a control-plane component that's only superficially + # up). Uses ONLY the chain's existing permissions: read-only kubectl raw API + # reads + this pod's own resolver. No new pods/exec/images/RBAC. We do NOT + # rollback — kubeadm can't downgrade — we halt loudly for a human. + local smoke_failed=0 + + # 1. apiserver health endpoints. `kubectl get --raw` exits non-zero on a + # non-200, which under `set -e` would abort — capture rc explicitly. + local readyz_out readyz_rc=0 livez_out livez_rc=0 + readyz_out=$($KUBECTL get --raw='/readyz' 2>&1) || readyz_rc=$? + if [ "$readyz_rc" -ne 0 ] || [ "$readyz_out" != "ok" ]; then + smoke_failed=1 + slack "postflight smoke FAIL — apiserver /readyz not ok (rc=$readyz_rc, body='${readyz_out:0:200}')" + fi + livez_out=$($KUBECTL get --raw='/livez' 2>&1) || livez_rc=$? + if [ "$livez_rc" -ne 0 ] || [ "$livez_out" != "ok" ]; then + smoke_failed=1 + slack "postflight smoke FAIL — apiserver /livez not ok (rc=$livez_rc, body='${livez_out:0:200}')" + fi + + # 2. In-cluster DNS resolution from THIS pod's resolver. If CoreDNS / kube-dns + # is broken after the upgrade, resolving the apiserver's cluster service + # name fails here even though pods may still look Running. + local dns_rc=0 + python3 -c 'import socket; socket.gethostbyname("kubernetes.default.svc.cluster.local")' >/dev/null 2>&1 || dns_rc=$? + if [ "$dns_rc" -ne 0 ]; then + smoke_failed=1 + slack "postflight smoke FAIL — in-cluster DNS broken (could not resolve kubernetes.default.svc.cluster.local; CoreDNS down?)" + fi + + # 3. Core kube-system pods Running: control-plane statics (apiserver, + # controller-manager, scheduler, etcd) AND CoreDNS. `grep -v Running` + # returns 1 when everything is Running (the happy path) → wrap in `|| true` + # so pipefail doesn't abort us at the moment of success. + local comp not_running + for comp in kube-apiserver kube-controller-manager kube-scheduler etcd coredns; do + not_running=$($KUBECTL -n kube-system get pods --no-headers 2>/dev/null \ + | { grep -E "(^|[[:space:]])${comp}-" || true; } \ + | { grep -v Running || true; } | wc -l) + if [ "$not_running" -gt 0 ]; then + smoke_failed=1 + slack "postflight smoke FAIL — $not_running kube-system '$comp' pod(s) not Running after upgrade" + fi + done + + if [ "$smoke_failed" -ne 0 ]; then + slack "postflight smoke tests FAILED — upgrade left the cluster unhealthy, halting for a human (no rollback; kubeadm can't downgrade)" + exit 1 + fi + echo "postflight smoke tests passed (apiserver health + DNS + core kube-system pods)" + # Clear annotations + gauges $KUBECTL annotate ns "$NS" \ 'viktorbarzin.me/k8s-upgrade-in-flight-' \ diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl index f98e429c..f7bbe256 100755 --- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl +++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl @@ -2252,6 +2252,26 @@ serverFiles: subsystem: k8s-upgrade annotations: summary: "K8s upgrade chain Job {{ $labels.job_name }} terminally failed ({{ $labels.reason }}) — pipeline wedged. kubectl -n k8s-upgrade get jobs ; kubectl -n k8s-upgrade describe job {{ $labels.job_name }}" + # K8sUpgradeBlocked: the k8s-version-upgrade chain pushes + # `k8s_upgrade_blocked=1` when the preflight compat gate REFUSES the + # target version — the cluster isn't ready (a critical addon lags the + # target's support window, an in-use API is deprecated/removed at the + # target, or a node's containerd predates the target's minimum). This + # is the designed "halt + alert" outcome, NOT a crash: the chain stops + # cleanly and the specific blocking reasons are posted to Slack by the + # upgrade chain. Same bare-metric pushgateway selector as + # K8sUpgradeStalled (job label "k8s-version-upgrade"). To clear: bump + # the named addon / migrate the deprecated API usage / upgrade the + # node's containerd, then the next nightly run proceeds automatically. + - alert: K8sUpgradeBlocked + expr: k8s_upgrade_blocked == 1 + for: 10m + labels: + severity: warning + subsystem: k8s-upgrade + annotations: + summary: "K8s auto-upgrade refused by the preflight compat gate — cluster not ready for the target version. Blocking reasons were posted to Slack by the upgrade chain." + description: "An automated Kubernetes upgrade was REFUSED (not crashed) by the preflight compatibility gate because the cluster isn't ready for the target version — a critical addon lags the target's support window, an in-use deprecated API would be removed at the target, or a node's containerd is too old. The specific reasons were posted to Slack by the k8s-version-upgrade chain. This is the intended halt-and-alert. To clear it: bump the named addon / migrate the deprecated API usage / upgrade the node's containerd, then the next nightly run proceeds automatically." - name: "Traefik Ingress" rules: - alert: TraefikDown