k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case

The nightly upgrade chain detected 1.36, the preflight compat-gate refused it,
and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY
night — even though the block is unactionable (no kyverno/ESO release supports
1.36 yet, and gpu-operator is pinned to its current version because bumping it
needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor
asked to teach the checker to tell 'we can fix this' apart from 'nothing to do
but wait', and stop the nightly Failed-Job + alert noise for the latter.

compat-gate.py now classifies each blocker:
  - ACTIONABLE: a newer addon version in addon-compat.json supports the target
    -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the
    nightly report).
  - WAITING-on-upstream: no released version supports the target yet -> held.
  - PINNED: addon marked pinned in the matrix (gpu-operator) -> held.
Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert.

Tidy the block path (Viktor's scope choice): deliberate gate decisions now make
the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a
non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete
'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge
is pushed definitively once per run (no 1->0->1 flap that re-notifies). The
detector re-spawns a refused-but-Complete preflight nightly (silently) so a
standing hold still re-evaluates, and only announces genuine new/Failed spawns.

nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class.
gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason).

Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned;
Calico the lone actionable piece) — no nightly Failed Job, no alert, just the
morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-28 10:08:20 +00:00
parent afcd463f39
commit eebb6c8594
9 changed files with 397 additions and 78 deletions

View file

@ -37,6 +37,12 @@ KUBECTL=kubectl
JOB_TEMPLATE=/template/job-template.yaml
UPDATE_K8S_SH=/scripts/update_k8s.sh
# Set to 1 by record_blocked/record_held when the compat-gate refuses the
# target. spawn_next() then declines to advance the chain — but the Job still
# exits 0, because a gate refusal is a DECISION, not a failure (no Failed Job,
# no K8sUpgradeChainJobFailed). Signalling is via the gauges those recorders push.
HALT_CHAIN=0
# SSH targets are node InternalIPs, resolved live from `kubectl get nodes` (see
# ssh_target() below) — the pipeline has NO dependency on node DNS records
# (`k8s-node<N>.viktorbarzin.lan`). This is what lets a freshly-joined node be
@ -88,17 +94,31 @@ push() {
| curl -sS --data-binary @- "$PG" || echo "warn: pushgateway push failed"
}
# Auto-upgrade safety: a preflight compat-gate refusal is a BLOCK, not a crash —
# the cluster simply isn't ready for this target yet (an addon / in-use API /
# containerd is too old). Record it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
# alert), Slack the reasons, and halt so a human clears the blocker (or a later
# run proceeds once it's cleared). This is the "upgrade when we can, alert when
# we can't" contract.
block() {
# Compat-gate verdict recorders. A gate refusal is a DECISION, not a crash: the
# Job Completes cleanly and the chain simply doesn't advance (spawn_next checks
# HALT_CHAIN). The two outcomes differ only in how they're signalled:
# - record_blocked: ACTIONABLE — a newer addon version would clear it.
# k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (fires once via
# alert-on-change). "upgrade when we can, alert when we can't."
# - record_held: WAITING-ON-UPSTREAM or PINNED — nothing to do but wait.
# k8s_upgrade_held=1 -> NO alert; the nightly report's ⏸️ line is the
# only signal. This is what stops the nightly cry-wolf for unactionable
# blocks (kyverno/ESO behind upstream, gpu-operator pinned).
# Neither Slacks per-run: the reasons are in the nightly report (it re-runs
# compat-gate), and per-run Slack was itself a nightly-noise source.
record_blocked() {
push k8s_upgrade_blocked 1
slack "BLOCKED preflight (target v$TARGET_VERSION) — auto-upgrade halted, needs attention:\n$1"
echo "BLOCKED: $1" >&2
exit 1
push k8s_upgrade_held 0
HALT_CHAIN=1
echo "BLOCKED (action needed) preflight v$TARGET_VERSION:" >&2
printf '%s\n' "$1" >&2
}
record_held() {
push k8s_upgrade_held 1
push k8s_upgrade_blocked 0
HALT_CHAIN=1
echo "HELD (not yet upgradable — waiting upstream / pinned) preflight v$TARGET_VERSION:" >&2
printf '%s\n' "$1" >&2
}
halt_on_alert_query() {
@ -256,6 +276,10 @@ case "$PHASE" in
esac
spawn_next() {
if [ "${HALT_CHAIN:-0}" = "1" ]; then
echo "Chain halted by compat-gate (blocked/held) — not spawning next phase."
return 0
fi
[ -z "$NEXT_PHASE" ] && { echo "End of chain."; return 0; }
local job_name="k8s-upgrade-${NEXT_PHASE}-${TARGET_VERSION//./-}"
@ -315,15 +339,37 @@ phase_preflight() {
# 0. Auto-upgrade compat gate (compat-gate.py): refuse the upgrade if a critical
# addon, an in-use deprecated API, or a node's containerd is too old for the
# target. Runs FIRST — before any mutation (etcd snapshot, drains) — so a
# block is cheap. Reset the blocked gauge for this run; block() sets it to 1
# only on a refusal. This is what makes unattended minor upgrades safe: the
# chain proceeds when the cluster supports the target and halts+alerts when
# it doesn't (e.g. Calico/ESO/kyverno behind, or a removed API still in use).
push k8s_upgrade_blocked 0
# refusal is cheap. The gate CLASSIFIES the refusal (exit code):
# 0 safe -> proceed
# 2 actionable -> record_blocked (a newer addon version would clear it)
# 4 held -> record_held (waiting on upstream / a pinned addon)
# 3/other err -> fail-safe: treat as actionable block
# blocked/held push the gauge DEFINITIVELY (one value per run — no pre-reset
# flap that would re-notify the alert nightly) and set HALT_CHAIN so the Job
# Completes cleanly without advancing the chain. This is what makes
# unattended minor upgrades safe AND quiet: proceed when supported, alert
# only when there's something to do, hold silently when there isn't.
local gate_out gate_rc=0
gate_out=$(python3 /scripts/compat-gate.py "$TARGET_VERSION" < /scripts/addon-compat.json 2>&1) || gate_rc=$?
if [ "$gate_rc" -ne 0 ]; then block "$gate_out"; fi
echo "compat-gate passed for v$TARGET_VERSION"
case "$gate_rc" in
0)
push k8s_upgrade_blocked 0
push k8s_upgrade_held 0
echo "compat-gate passed for v$TARGET_VERSION"
;;
4)
record_held "$gate_out"
return 0
;;
2)
record_blocked "$gate_out"
return 0
;;
*)
record_blocked "gate ERROR (rc=$gate_rc) — failing safe as an actionable block:"$'\n'"$gate_out"
return 0
;;
esac
# 1. All nodes Ready + no pressure
local bad_nodes
@ -777,6 +823,8 @@ phase_postflight() {
push k8s_upgrade_in_flight 0
push k8s_upgrade_snapshot_taken 0
push k8s_upgrade_started_timestamp 0
push k8s_upgrade_blocked 0
push k8s_upgrade_held 0
slack ":white_check_mark: K8s upgrade complete: cluster on v$TARGET_VERSION (pod-ready ratio $ratio)"
}