k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case
The nightly upgrade chain detected 1.36, the preflight compat-gate refused it,
and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY
night — even though the block is unactionable (no kyverno/ESO release supports
1.36 yet, and gpu-operator is pinned to its current version because bumping it
needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor
asked to teach the checker to tell 'we can fix this' apart from 'nothing to do
but wait', and stop the nightly Failed-Job + alert noise for the latter.
compat-gate.py now classifies each blocker:
- ACTIONABLE: a newer addon version in addon-compat.json supports the target
-> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the
nightly report).
- WAITING-on-upstream: no released version supports the target yet -> held.
- PINNED: addon marked pinned in the matrix (gpu-operator) -> held.
Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert.
Tidy the block path (Viktor's scope choice): deliberate gate decisions now make
the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a
non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete
'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge
is pushed definitively once per run (no 1->0->1 flap that re-notifies). The
detector re-spawns a refused-but-Complete preflight nightly (silently) so a
standing hold still re-evaluates, and only announces genuine new/Failed spawns.
nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class.
gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason).
Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned;
Calico the lone actionable piece) — no nightly Failed Job, no alert, just the
morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
afcd463f39
commit
eebb6c8594
9 changed files with 397 additions and 78 deletions
|
|
@ -66,6 +66,17 @@ exit code. Real failures (unhealthy nodes, kubeadm errors, crashes) still exit
|
|||
at gate start) so a standing block doesn't flap 1→0→1 and re-notify.
|
||||
- postflight also clears `held=0` alongside the existing gauge resets.
|
||||
|
||||
### detector (`main.tf`, the `k8s-version-check` CronJob)
|
||||
- Consequence of the tidy change: refusals now **Complete** instead of Failing,
|
||||
so the old "re-spawn only a *Failed* preflight" idempotency would skip a
|
||||
refused-but-Complete preflight until its 7d TTL. Fix: re-spawn nightly when the
|
||||
preflight is **Complete but no `k8s-upgrade-master-<target>` Job exists** (the
|
||||
gate refused — chain never advanced) — **silently** (no Slack), so a standing
|
||||
hold re-evaluates each night without noise.
|
||||
- The per-night `slack "K8s upgrade available…"` becomes an `echo`; the spawn
|
||||
Slack fires only for a genuinely new spawn or a Failed-respawn (`ANNOUNCE`
|
||||
flag), not for silent re-evaluations — killing the last nightly-noise source.
|
||||
|
||||
### `addon-compat.json`
|
||||
- Add `"pinned": true` + `"pin_reason"` to the gpu-operator entry (its
|
||||
`26.3 → 1.36` row stays; `pinned` overrides classification to held). Document
|
||||
|
|
|
|||
|
|
@ -483,31 +483,49 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
|||
exit 0
|
||||
fi
|
||||
|
||||
slack "K8s upgrade available: v$RUNNING → v$TARGET ($KIND)"
|
||||
echo "K8s upgrade available: v$RUNNING -> v$TARGET ($KIND)"
|
||||
|
||||
if [ "$DRY_RUN" = "true" ]; then
|
||||
slack "DRY_RUN — not spawning preflight Job"
|
||||
slack "DRY_RUN — target v$TARGET detected, not spawning preflight Job"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# 7. Spawn Job 0 (preflight) via envsubst on the job-template
|
||||
# Idempotency: deterministic name reconciles via `apply`.
|
||||
JOB_NAME="k8s-upgrade-preflight-$${TARGET//./-}"
|
||||
MASTER_JOB="k8s-upgrade-master-$${TARGET//./-}"
|
||||
ANNOUNCE=yes # Slack the spawn? Suppressed for silent nightly re-evaluations of a standing gate refusal.
|
||||
|
||||
# Retry-on-failure idempotency: skip only if an existing preflight
|
||||
# Job is Active/Complete. A *Failed* preflight (aborted on a
|
||||
# transient gate, e.g. a spurious critical alert) is deleted and
|
||||
# re-spawned — otherwise its deterministic name + 7d TTL wedges
|
||||
# the entire pipeline until it ages out. (Stuck-pipeline fix
|
||||
# 2026-06-17: a transient critical alert wedged 1.34.9 for 5 days.)
|
||||
# Idempotency + nightly re-evaluation:
|
||||
# - FAILED preflight (transient gate abort, e.g. a spurious
|
||||
# critical alert / unhealthy node) -> delete + re-spawn, announced.
|
||||
# - COMPLETE preflight but NO master Job spawned -> the compat
|
||||
# gate REFUSED the target (blocked/held now Complete cleanly
|
||||
# rather than Failing). Re-spawn SILENTLY so the gate re-checks
|
||||
# nightly (the refusal may have cleared: addon upgraded / matrix
|
||||
# updated / upstream shipped) WITHOUT nightly Slack noise for a
|
||||
# standing refusal — the morning report (+ K8sUpgradeBlocked for
|
||||
# actionable) is the signal.
|
||||
# - Otherwise (Active, or Complete with the chain advanced) -> skip.
|
||||
# The old "Failed-only re-spawn" left a refused-but-Complete preflight
|
||||
# skipped until its 7d TTL — too slow now that refusals Complete
|
||||
# instead of Failing (2026-06-28). Deterministic names; `apply`
|
||||
# reconciles. (Stuck-pipeline history: a transient critical alert
|
||||
# wedged 1.34.9 for 5 days, 2026-06-17 — hence Failed always re-spawns.)
|
||||
if /usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" >/dev/null 2>&1; then
|
||||
JOB_FAILED=$(/usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" \
|
||||
-o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' 2>/dev/null || true)
|
||||
JOB_COMPLETE=$(/usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" \
|
||||
-o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' 2>/dev/null || true)
|
||||
if [ "$JOB_FAILED" = "True" ]; then
|
||||
slack "Preflight Job $JOB_NAME exists but FAILED — deleting and re-spawning"
|
||||
/usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
|
||||
elif [ "$JOB_COMPLETE" = "True" ] && ! /usr/local/bin/kubectl -n k8s-upgrade get job "$MASTER_JOB" >/dev/null 2>&1; then
|
||||
echo "Preflight $JOB_NAME Complete + no master Job (gate refused) — silent nightly re-evaluate"
|
||||
/usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
|
||||
ANNOUNCE=no
|
||||
else
|
||||
slack "Preflight Job $JOB_NAME already exists (active/complete) — skipping"
|
||||
echo "Preflight Job $JOB_NAME already exists (active / chain advanced) — skipping"
|
||||
exit 0
|
||||
fi
|
||||
fi
|
||||
|
|
@ -521,7 +539,9 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
|||
< /template/job-template.yaml \
|
||||
| /usr/local/bin/kubectl apply -f -
|
||||
|
||||
slack "Spawned $JOB_NAME (target=v$TARGET kind=$KIND)"
|
||||
if [ "$ANNOUNCE" = "yes" ]; then
|
||||
slack "Spawned $JOB_NAME (target=v$TARGET kind=$KIND)"
|
||||
fi
|
||||
EOT
|
||||
]
|
||||
env {
|
||||
|
|
|
|||
|
|
@ -1,5 +1,5 @@
|
|||
{
|
||||
"_comment": "Addon -> highest k8s minor each addon version supports. The preflight compat-gate (compat-gate.py) reads the RUNNING version of each addon and blocks a k8s upgrade whose target minor exceeds what that running version supports — so the chain auto-halts + alerts instead of breaking on an unsupported addon. Keep current; sources are the addons' own k8s compat matrices (last refreshed 2026-06-19 for the 1.34->1.36 catch-up). max_k8s keys are addon-version floors (major.minor); value is the highest k8s minor that floor supports.",
|
||||
"_comment": "Addon -> highest k8s minor each addon version supports. The preflight compat-gate (compat-gate.py) reads the RUNNING version of each addon and blocks a k8s upgrade whose target minor exceeds what that running version supports — so the chain auto-halts + alerts instead of breaking on an unsupported addon. Keep current; sources are the addons' own k8s compat matrices (last refreshed 2026-06-19 for the 1.34->1.36 catch-up). max_k8s keys are addon-version floors (major.minor); value is the highest k8s minor that floor supports. An addon entry may also set \"pinned\": true (+ \"pin_reason\") to mark it deliberately held: the gate classifies its block as PINNED/held (quiet — no alert, nightly report only) even if a supporting version exists, for upgrades coupled to other work we're not ready for (e.g. gpu-operator's NVIDIA-driver/Ubuntu coupling). A block with NO supporting version in the matrix is WAITING (also quiet); a block a newer matrix version would clear is ACTIONABLE (alerts).",
|
||||
"addons": [
|
||||
{
|
||||
"name": "calico",
|
||||
|
|
@ -48,7 +48,9 @@
|
|||
"max_k8s": {
|
||||
"25.10": "1.35",
|
||||
"26.3": "1.36"
|
||||
}
|
||||
},
|
||||
"pinned": true,
|
||||
"pin_reason": "26.3 needs a newer NVIDIA driver image + Ubuntu/kernel; held until the driver/OS path is ready. Unpin = delete pinned + pin_reason."
|
||||
}
|
||||
],
|
||||
"containerd_min": {
|
||||
|
|
|
|||
|
|
@ -14,9 +14,20 @@ classes of blocker:
|
|||
3. containerd — every node's containerd >= the target's floor, if the matrix
|
||||
declares one (e.g. the 1.7.x -> k8s 1.37 cliff)
|
||||
|
||||
Each reason line is tagged with its class so the caller can act differently:
|
||||
[ACTIONABLE] a newer addon version (present in the matrix) supports the
|
||||
target — upgrading it clears the block. Also covers removed-API
|
||||
/ containerd blocks and the unreadable-version fail-safe.
|
||||
[WAITING] no released addon version supports the target yet — only an
|
||||
upstream release can clear it (e.g. kyverno/ESO behind a new k8s).
|
||||
[PINNED] a supporting version exists but the addon is deliberately held
|
||||
(matrix `pinned: true`, e.g. gpu-operator's driver/OS coupling).
|
||||
|
||||
Exit 0 = safe, proceed.
|
||||
Exit 2 = BLOCKED — prints one human reason per line (caller pushes
|
||||
k8s_upgrade_blocked=1, Slacks the reasons, and halts the chain).
|
||||
Exit 2 = BLOCKED, actionable — >=1 blocker, none held. Caller pushes
|
||||
k8s_upgrade_blocked=1 (-> K8sUpgradeBlocked alert) and halts.
|
||||
Exit 4 = HELD — >=1 waiting-upstream/pinned blocker (held wins over actionable).
|
||||
Caller pushes k8s_upgrade_held=1 (no alert; nightly report only) and halts.
|
||||
Exit 3 = the gate itself errored — caller treats as a block (fail safe).
|
||||
|
||||
Read-only: kubectl get + one Prometheus query. No mutations. PROM is overridable
|
||||
|
|
@ -62,6 +73,20 @@ def running_minor():
|
|||
return min(minors) if minors else None
|
||||
|
||||
|
||||
def _addon_resolution(a, tgt, running_ver):
|
||||
"""For a BLOCKING addon, decide whether a newer matrix version would clear
|
||||
the block. Returns ("actionable", hint) when some version key has
|
||||
max_k8s >= target AND is newer than the running version (upgrading it clears
|
||||
the block); otherwise ("waiting", hint) — nothing released supports the
|
||||
target yet, so only an upstream release can clear it."""
|
||||
sufficient = [floor for floor, mk in a["max_k8s"].items()
|
||||
if minor(mk) and minor(mk) >= tgt and minor(floor) > minor(running_ver)]
|
||||
if sufficient:
|
||||
best = min(sufficient, key=minor) # smallest sufficient upgrade
|
||||
return "actionable", f"upgrade {a['name']} to >= {best}"
|
||||
return "waiting", f"no released {a['name']} version supports k8s {tgt[0]}.{tgt[1]} yet"
|
||||
|
||||
|
||||
def check_addons(matrix, tgt, running):
|
||||
# A target at or below the RUNNING minor (a patch, or a same/lower minor)
|
||||
# crosses into no new k8s minor, so every installed addon is already
|
||||
|
|
@ -77,25 +102,36 @@ def check_addons(matrix, tgt, running):
|
|||
"-o", "jsonpath={.spec.template.spec.containers[*].image}"])
|
||||
m = re.search(a["image_re"], img or "")
|
||||
if not m:
|
||||
# Fail safe: if we can't read the running version, don't upgrade blind.
|
||||
reasons.append(f"addon {a['name']}: could not read running version "
|
||||
f"(img='{img or 'not found'}') — refusing to upgrade blind")
|
||||
# Fail safe: can't read the running version → block; a human must
|
||||
# look (ACTIONABLE), never upgrade blind.
|
||||
reasons.append(f"[ACTIONABLE] addon {a['name']}: could not read running "
|
||||
f"version (img='{img or 'not found'}') — refusing to upgrade blind")
|
||||
continue
|
||||
running = m.group(1) # e.g. "3.26"
|
||||
running_ver = m.group(1) # e.g. "3.26"
|
||||
# max_k8s maps an addon-version floor -> highest supported k8s minor.
|
||||
# Pick the highest floor that is <= the running version.
|
||||
max_k8s = None
|
||||
for floor, mk in sorted(a["max_k8s"].items(), key=lambda kv: minor(kv[0]), reverse=True):
|
||||
if minor(running) >= minor(floor):
|
||||
if minor(running_ver) >= minor(floor):
|
||||
max_k8s = mk
|
||||
break
|
||||
if max_k8s is None:
|
||||
reasons.append(f"addon {a['name']} v{running}: below the lowest version "
|
||||
f"in the compat matrix — unknown k8s support")
|
||||
reasons.append(f"[ACTIONABLE] addon {a['name']} v{running_ver}: below the lowest "
|
||||
f"version in the compat matrix — unknown k8s support")
|
||||
continue
|
||||
if tgt > minor(max_k8s):
|
||||
reasons.append(f"addon {a['name']} v{running} supports k8s <= {max_k8s}; "
|
||||
f"target {tgt[0]}.{tgt[1]} exceeds it — upgrade {a['name']} first")
|
||||
base = (f"addon {a['name']} v{running_ver} supports k8s <= {max_k8s}; "
|
||||
f"target {tgt[0]}.{tgt[1]} exceeds it")
|
||||
# A deliberately-pinned addon is HELD even if a newer version exists
|
||||
# (e.g. gpu-operator 26.3 supports 1.36 but its driver/OS coupling
|
||||
# means we don't take it yet) — the pin overrides actionable.
|
||||
if a.get("pinned"):
|
||||
why = a.get("pin_reason", "deliberately pinned")
|
||||
reasons.append(f"[PINNED] {base} — pinned ({why}); holding")
|
||||
else:
|
||||
kind, hint = _addon_resolution(a, tgt, running_ver)
|
||||
tag = "ACTIONABLE" if kind == "actionable" else "WAITING"
|
||||
reasons.append(f"[{tag}] {base} — {hint}")
|
||||
return reasons
|
||||
|
||||
|
||||
|
|
@ -109,11 +145,11 @@ def check_removed_apis(tgt):
|
|||
rr = lbl.get("removed_release", "")
|
||||
if rr and minor(rr) and tgt >= minor(rr):
|
||||
g = lbl.get("group") or "core"
|
||||
reasons.append(f"deprecated API {g}/{lbl.get('version')} "
|
||||
reasons.append(f"[ACTIONABLE] deprecated API {g}/{lbl.get('version')} "
|
||||
f"{lbl.get('resource')} is in use and is removed in "
|
||||
f"k8s {rr} (target {tgt[0]}.{tgt[1]}) — migrate callers first")
|
||||
except Exception as e:
|
||||
reasons.append(f"removed-API check could not query Prometheus ({e}) — "
|
||||
reasons.append(f"[ACTIONABLE] removed-API check could not query Prometheus ({e}) — "
|
||||
f"refusing to upgrade blind")
|
||||
return reasons
|
||||
|
||||
|
|
@ -132,11 +168,28 @@ def check_containerd(matrix, tgt):
|
|||
name, _, ver = line.partition(" ")
|
||||
cv = ver.replace("containerd://", "")
|
||||
if minor(cv) and minor(cv) < minor(floor):
|
||||
reasons.append(f"node {name} containerd {cv} < required {floor} "
|
||||
reasons.append(f"[ACTIONABLE] node {name} containerd {cv} < required {floor} "
|
||||
f"for k8s {tgt[0]}.{tgt[1]} — bump containerd first")
|
||||
return reasons
|
||||
|
||||
|
||||
def held_reason(r):
|
||||
"""True for a blocker the cluster cannot act on now: no released version
|
||||
supports the target (WAITING) or the addon is deliberately pinned (PINNED).
|
||||
These are quiet (no alert) — only an upstream release / a manual unpin clears
|
||||
them, so a nightly 'needs attention' alert would be crying wolf."""
|
||||
return r.startswith("[WAITING]") or r.startswith("[PINNED]")
|
||||
|
||||
|
||||
def exit_code(reasons):
|
||||
"""Map reasons to the gate verdict: 0 safe · 2 actionable block · 4 held.
|
||||
Held WINS over actionable on a mix — if anything is waiting/pinned the target
|
||||
can't proceed yet, so acting on the actionable blockers would be premature."""
|
||||
if not reasons:
|
||||
return 0
|
||||
return 4 if any(held_reason(r) for r in reasons) else 2
|
||||
|
||||
|
||||
def main():
|
||||
if len(sys.argv) < 2:
|
||||
print("usage: compat-gate.py <target-k8s-version> (matrix JSON on stdin)")
|
||||
|
|
@ -158,9 +211,9 @@ def main():
|
|||
if reasons:
|
||||
for r in reasons:
|
||||
print(r)
|
||||
sys.exit(2)
|
||||
print(f"compat-gate OK: cluster is safe to upgrade to {sys.argv[1]}")
|
||||
sys.exit(0)
|
||||
else:
|
||||
print(f"compat-gate OK: cluster is safe to upgrade to {sys.argv[1]}")
|
||||
sys.exit(exit_code(reasons))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
|
|
|||
|
|
@ -69,6 +69,29 @@ def fmt_age(seconds):
|
|||
return f"{seconds / 86400:.1f}d ago"
|
||||
|
||||
|
||||
def _render_reasons(blocker_reasons):
|
||||
"""Group compat-gate reason lines by their [ACTIONABLE]/[WAITING]/[PINNED]
|
||||
tag into labelled sections, stripping the tag from each bullet. Untagged
|
||||
lines (older reason format) fall back to a generic 'Blockers' list. PURE.
|
||||
Returns a list of message lines."""
|
||||
lines = [r.strip() for r in (blocker_reasons or "").splitlines() if r.strip()]
|
||||
out, shown = [], set()
|
||||
for title, tag in (("Action needed", "[ACTIONABLE]"),
|
||||
("Waiting on upstream", "[WAITING]"),
|
||||
("Pinned (held by us)", "[PINNED]")):
|
||||
sub = [l for l in lines if l.startswith(tag)]
|
||||
if sub:
|
||||
out.append(f"{title}:")
|
||||
for l in sub:
|
||||
shown.add(l)
|
||||
out.append(f" • {l[len(tag):].strip()}")
|
||||
rest = [l for l in lines if l not in shown]
|
||||
if rest:
|
||||
out.append("Blockers:")
|
||||
out.extend(f" • {l}" for l in rest)
|
||||
return out
|
||||
|
||||
|
||||
def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
|
||||
"""Build the Slack message text from gathered facts. PURE.
|
||||
|
||||
|
|
@ -98,6 +121,7 @@ def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
|
|||
|
||||
avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
|
||||
blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
|
||||
held = any(val == 1 for _, val in select(metrics, "k8s_upgrade_held"))
|
||||
|
||||
if avail:
|
||||
lbl = avail[0][0]
|
||||
|
|
@ -105,7 +129,12 @@ def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
|
|||
kind = lbl.get("kind", "?")
|
||||
tgt_line = f"Detected target: *{target}* ({kind})"
|
||||
if blocked:
|
||||
headline = f"🔴 BLOCKED — compat gate refused {target}"
|
||||
# actionable block — an addon upgrade would clear it (K8sUpgradeBlocked fired)
|
||||
headline = f"🔴 BLOCKED (action needed) — {target}"
|
||||
elif held:
|
||||
# waiting on upstream and/or a pinned addon — nothing to do but wait;
|
||||
# intentionally NO alert, this nightly line is the only signal
|
||||
headline = f"⏸️ HELD — {target} not yet upgradable"
|
||||
elif len(versions) == 1 and target == versions[0]:
|
||||
headline = f"🟢 UPGRADED — all nodes now on {target}"
|
||||
else:
|
||||
|
|
@ -120,12 +149,8 @@ def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
|
|||
|
||||
msg = [f"*[k8s-upgrade nightly]* {headline}", node_line, run_line, tgt_line]
|
||||
|
||||
if blocked and blocker_reasons:
|
||||
msg.append("Blockers (live):")
|
||||
for r in blocker_reasons.splitlines():
|
||||
r = r.strip()
|
||||
if r:
|
||||
msg.append(f" • {r}")
|
||||
if (blocked or held) and blocker_reasons:
|
||||
msg.extend(_render_reasons(blocker_reasons))
|
||||
|
||||
if jobs:
|
||||
msg.append("Chain jobs (recent):")
|
||||
|
|
@ -213,7 +238,8 @@ def main():
|
|||
|
||||
avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
|
||||
blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
|
||||
reasons = get_blocker_reasons(avail[0][0].get("target", "")) if (avail and blocked) else None
|
||||
held = any(val == 1 for _, val in select(metrics, "k8s_upgrade_held"))
|
||||
reasons = get_blocker_reasons(avail[0][0].get("target", "")) if (avail and (blocked or held)) else None
|
||||
|
||||
msg = compose_report(now_ts, nodes, metrics, reasons, jobs)
|
||||
post_slack(msg)
|
||||
|
|
|
|||
|
|
@ -95,3 +95,121 @@ def test_running_minor_from_kubectl(monkeypatch):
|
|||
# oldest kubelet wins (mirrors the detector): node2 on 1.33 is the floor.
|
||||
monkeypatch.setattr(cg, "kget", lambda args: "v1.34.9\nv1.33.5\nv1.34.9")
|
||||
assert cg.running_minor() == (1, 33)
|
||||
|
||||
|
||||
# --- block classification: actionable / waiting-upstream / pinned ----------
|
||||
# A block is ACTIONABLE if a newer addon version in the matrix supports the
|
||||
# target (we can upgrade to clear it), WAITING if no released version supports
|
||||
# the target yet (only upstream can clear it), or PINNED if a version exists but
|
||||
# we deliberately hold the addon. Held (waiting|pinned) is quiet; actionable
|
||||
# alerts.
|
||||
KYVERNO_MATRIX = {
|
||||
"addons": [{
|
||||
"name": "kyverno",
|
||||
"namespace": "kyverno",
|
||||
"kind": "deployment",
|
||||
"resource": "kyverno-admission-controller",
|
||||
"image_re": r"kyverno:v(\d+\.\d+)",
|
||||
"max_k8s": {"1.16": "1.34", "1.18": "1.35"},
|
||||
}]
|
||||
}
|
||||
GPU_MATRIX = {
|
||||
"addons": [{
|
||||
"name": "gpu-operator",
|
||||
"namespace": "nvidia",
|
||||
"kind": "deployment",
|
||||
"resource": "gpu-operator",
|
||||
"image_re": r"gpu-operator:v(\d+\.\d+)",
|
||||
"max_k8s": {"25.10": "1.35", "26.3": "1.36"},
|
||||
"pinned": True,
|
||||
"pin_reason": "needs newer NVIDIA driver + Ubuntu release",
|
||||
}]
|
||||
}
|
||||
|
||||
|
||||
def test_actionable_when_higher_version_supports_target(monkeypatch):
|
||||
# calico 3.30 (ceiling 1.35), target 1.36, matrix has 3.32 -> 1.36:
|
||||
# upgrading calico WOULD clear it -> ACTIONABLE, with a remediation hint.
|
||||
_img(monkeypatch, "quay.io/calico/node:v3.30.7")
|
||||
reasons = cg.check_addons(CALICO_MATRIX, (1, 36), (1, 35))
|
||||
assert len(reasons) == 1, reasons
|
||||
assert reasons[0].startswith("[ACTIONABLE]"), reasons
|
||||
assert "3.32" in reasons[0] and "calico" in reasons[0]
|
||||
|
||||
|
||||
def test_waiting_when_no_version_supports_target(monkeypatch):
|
||||
# kyverno 1.18 is the matrix ceiling (k8s 1.35); target 1.36 has NO
|
||||
# supporting version -> WAITING on upstream (nothing to upgrade to).
|
||||
_img(monkeypatch, "kyverno/kyverno:v1.18.1")
|
||||
reasons = cg.check_addons(KYVERNO_MATRIX, (1, 36), (1, 35))
|
||||
assert len(reasons) == 1, reasons
|
||||
assert reasons[0].startswith("[WAITING]"), reasons
|
||||
assert "kyverno" in reasons[0]
|
||||
|
||||
|
||||
def test_pinned_addon_is_held_not_actionable(monkeypatch):
|
||||
# gpu-operator 25.10, target 1.36; 26.3 supports 1.36 BUT the entry is
|
||||
# pinned -> classified PINNED (held), never ACTIONABLE.
|
||||
_img(monkeypatch, "nvcr.io/nvidia/gpu-operator:v25.10.0")
|
||||
reasons = cg.check_addons(GPU_MATRIX, (1, 36), (1, 35))
|
||||
assert len(reasons) == 1, reasons
|
||||
assert reasons[0].startswith("[PINNED]"), reasons
|
||||
assert "gpu-operator" in reasons[0]
|
||||
|
||||
|
||||
def test_unreadable_addon_tagged_actionable(monkeypatch):
|
||||
# fail-safe block on an unreadable image is ACTIONABLE (a human must look).
|
||||
_img(monkeypatch, "")
|
||||
reasons = cg.check_addons(ESO_MATRIX, (1, 35), (1, 34))
|
||||
assert reasons and reasons[0].startswith("[ACTIONABLE]"), reasons
|
||||
|
||||
|
||||
def test_existing_reasons_are_tagged(monkeypatch):
|
||||
# the legacy "ceiling below target, newer version exists" case is ACTIONABLE.
|
||||
_img(monkeypatch, "external-secrets/external-secrets:v0.12.1")
|
||||
reasons = cg.check_addons(ESO_MATRIX, (1, 35), (1, 34))
|
||||
assert reasons[0].startswith("[ACTIONABLE]"), reasons
|
||||
|
||||
|
||||
def test_held_reason_classifier():
|
||||
assert cg.held_reason("[WAITING] x")
|
||||
assert cg.held_reason("[PINNED] x")
|
||||
assert not cg.held_reason("[ACTIONABLE] x")
|
||||
assert not cg.held_reason("untagged")
|
||||
|
||||
|
||||
def test_exit_code_mapping():
|
||||
assert cg.exit_code([]) == 0
|
||||
assert cg.exit_code(["[ACTIONABLE] x"]) == 2
|
||||
assert cg.exit_code(["[WAITING] x"]) == 4
|
||||
assert cg.exit_code(["[PINNED] x"]) == 4
|
||||
# held wins on a mix: an upstream/pinned wait can't be cleared by acting now
|
||||
assert cg.exit_code(["[ACTIONABLE] x", "[WAITING] y"]) == 4
|
||||
|
||||
|
||||
def test_real_matrix_136_is_held(monkeypatch):
|
||||
"""Regression guard on the SHIPPED addon-compat.json: at today's running
|
||||
versions a 1.36 jump must be HELD (exit 4) — calico ACTIONABLE (3.32 in the
|
||||
matrix), ESO+kyverno WAITING (no 1.36 release), gpu-operator PINNED. Catches
|
||||
a matrix edit that silently turns the quiet held state into a nightly alert."""
|
||||
import json as _json
|
||||
matrix = _json.loads((HERE / "addon-compat.json").read_text())
|
||||
running_imgs = {
|
||||
"calico-system": "quay.io/calico/node:v3.30.7",
|
||||
"external-secrets": "ghcr.io/external-secrets/external-secrets:v2.6.0",
|
||||
"kyverno": "ghcr.io/kyverno/kyverno:v1.18.1",
|
||||
"nvidia": "nvcr.io/nvidia/gpu-operator:v25.10.0",
|
||||
}
|
||||
|
||||
def fake_kget(args):
|
||||
ns = args[args.index("-n") + 1] if "-n" in args else ""
|
||||
return running_imgs.get(ns, "")
|
||||
|
||||
monkeypatch.setattr(cg, "kget", fake_kget)
|
||||
reasons = cg.check_addons(matrix, (1, 36), (1, 35))
|
||||
pick = lambda name: next(r for r in reasons if name in r)
|
||||
assert pick("calico").startswith("[ACTIONABLE]"), reasons
|
||||
assert pick("external-secrets").startswith("[WAITING]"), reasons
|
||||
assert pick("kyverno").startswith("[WAITING]"), reasons
|
||||
assert pick("gpu-operator").startswith("[PINNED]"), reasons
|
||||
assert cg.exit_code(reasons) == 4 # held wins
|
||||
|
|
|
|||
|
|
@ -79,3 +79,41 @@ def test_compose_includes_recent_jobs():
|
|||
jobs = [{"name": "k8s-upgrade-preflight-1-35-6", "status": "Failed", "age_s": 3600}]
|
||||
out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, "x", jobs)
|
||||
assert "k8s-upgrade-preflight-1-35-6: Failed" in out
|
||||
|
||||
|
||||
# --- held (waiting-upstream / pinned) vs actionable-blocked rendering -------
|
||||
METRICS_HELD = f"""# TYPE k8s_upgrade_available gauge
|
||||
k8s_upgrade_available{{instance="",job="k8s-version-check",kind="minor",running="1.35.6",target="1.36.2"}} 1
|
||||
k8s_upgrade_held{{instance="",job="k8s-version-upgrade"}} 1
|
||||
k8s_upgrade_blocked{{instance="",job="k8s-version-upgrade"}} 0
|
||||
k8s_version_check_last_run_timestamp{{instance="",job="k8s-version-check"}} {LAST_RUN}
|
||||
"""
|
||||
NODES_135 = [(f"k8s-node{i}", "v1.35.6") for i in range(7)]
|
||||
|
||||
|
||||
def test_compose_held_headline_and_grouped_reasons():
|
||||
m = nr.parse_metrics(METRICS_HELD)
|
||||
reasons = (
|
||||
"[WAITING] addon kyverno v1.18 supports k8s <= 1.35; target 1.36 exceeds it — no released kyverno version supports k8s 1.36 yet\n"
|
||||
"[PINNED] addon gpu-operator v25.10 supports k8s <= 1.35; target 1.36 exceeds it — pinned (driver/OS); holding\n"
|
||||
"[ACTIONABLE] addon calico v3.30 supports k8s <= 1.35; target 1.36 exceeds it — upgrade calico to >= 3.32"
|
||||
)
|
||||
out = nr.compose_report(LAST_RUN + 30000, NODES_135, m, reasons, [])
|
||||
# held headline, NOT a red actionable block
|
||||
assert "⏸️ HELD" in out and "1.36.2" in out
|
||||
assert "🔴 BLOCKED" not in out
|
||||
# grouped by class
|
||||
assert "Waiting on upstream" in out and "kyverno" in out
|
||||
assert "Pinned" in out and "gpu-operator" in out
|
||||
# the lone actionable piece is still listed so eventual scope is visible
|
||||
assert "calico" in out
|
||||
# tags are stripped from the rendered bullets (no raw "[WAITING]")
|
||||
assert "[WAITING]" not in out
|
||||
|
||||
|
||||
def test_compose_blocked_groups_actionable():
|
||||
m = nr.parse_metrics(METRICS_BLOCKED) # blocked=1
|
||||
reasons = "[ACTIONABLE] addon calico v3.30 supports k8s <= 1.35; target 1.36 exceeds it — upgrade calico to >= 3.32"
|
||||
out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, reasons, [])
|
||||
assert "🔴 BLOCKED" in out
|
||||
assert "Action needed" in out and "calico" in out
|
||||
|
|
|
|||
|
|
@ -37,6 +37,12 @@ KUBECTL=kubectl
|
|||
JOB_TEMPLATE=/template/job-template.yaml
|
||||
UPDATE_K8S_SH=/scripts/update_k8s.sh
|
||||
|
||||
# Set to 1 by record_blocked/record_held when the compat-gate refuses the
|
||||
# target. spawn_next() then declines to advance the chain — but the Job still
|
||||
# exits 0, because a gate refusal is a DECISION, not a failure (no Failed Job,
|
||||
# no K8sUpgradeChainJobFailed). Signalling is via the gauges those recorders push.
|
||||
HALT_CHAIN=0
|
||||
|
||||
# SSH targets are node InternalIPs, resolved live from `kubectl get nodes` (see
|
||||
# ssh_target() below) — the pipeline has NO dependency on node DNS records
|
||||
# (`k8s-node<N>.viktorbarzin.lan`). This is what lets a freshly-joined node be
|
||||
|
|
@ -88,17 +94,31 @@ push() {
|
|||
| curl -sS --data-binary @- "$PG" || echo "warn: pushgateway push failed"
|
||||
}
|
||||
|
||||
# Auto-upgrade safety: a preflight compat-gate refusal is a BLOCK, not a crash —
|
||||
# the cluster simply isn't ready for this target yet (an addon / in-use API /
|
||||
# containerd is too old). Record it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
|
||||
# alert), Slack the reasons, and halt so a human clears the blocker (or a later
|
||||
# run proceeds once it's cleared). This is the "upgrade when we can, alert when
|
||||
# we can't" contract.
|
||||
block() {
|
||||
# Compat-gate verdict recorders. A gate refusal is a DECISION, not a crash: the
|
||||
# Job Completes cleanly and the chain simply doesn't advance (spawn_next checks
|
||||
# HALT_CHAIN). The two outcomes differ only in how they're signalled:
|
||||
# - record_blocked: ACTIONABLE — a newer addon version would clear it.
|
||||
# k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (fires once via
|
||||
# alert-on-change). "upgrade when we can, alert when we can't."
|
||||
# - record_held: WAITING-ON-UPSTREAM or PINNED — nothing to do but wait.
|
||||
# k8s_upgrade_held=1 -> NO alert; the nightly report's ⏸️ line is the
|
||||
# only signal. This is what stops the nightly cry-wolf for unactionable
|
||||
# blocks (kyverno/ESO behind upstream, gpu-operator pinned).
|
||||
# Neither Slacks per-run: the reasons are in the nightly report (it re-runs
|
||||
# compat-gate), and per-run Slack was itself a nightly-noise source.
|
||||
record_blocked() {
|
||||
push k8s_upgrade_blocked 1
|
||||
slack "BLOCKED preflight (target v$TARGET_VERSION) — auto-upgrade halted, needs attention:\n$1"
|
||||
echo "BLOCKED: $1" >&2
|
||||
exit 1
|
||||
push k8s_upgrade_held 0
|
||||
HALT_CHAIN=1
|
||||
echo "BLOCKED (action needed) preflight v$TARGET_VERSION:" >&2
|
||||
printf '%s\n' "$1" >&2
|
||||
}
|
||||
record_held() {
|
||||
push k8s_upgrade_held 1
|
||||
push k8s_upgrade_blocked 0
|
||||
HALT_CHAIN=1
|
||||
echo "HELD (not yet upgradable — waiting upstream / pinned) preflight v$TARGET_VERSION:" >&2
|
||||
printf '%s\n' "$1" >&2
|
||||
}
|
||||
|
||||
halt_on_alert_query() {
|
||||
|
|
@ -256,6 +276,10 @@ case "$PHASE" in
|
|||
esac
|
||||
|
||||
spawn_next() {
|
||||
if [ "${HALT_CHAIN:-0}" = "1" ]; then
|
||||
echo "Chain halted by compat-gate (blocked/held) — not spawning next phase."
|
||||
return 0
|
||||
fi
|
||||
[ -z "$NEXT_PHASE" ] && { echo "End of chain."; return 0; }
|
||||
|
||||
local job_name="k8s-upgrade-${NEXT_PHASE}-${TARGET_VERSION//./-}"
|
||||
|
|
@ -315,15 +339,37 @@ phase_preflight() {
|
|||
# 0. Auto-upgrade compat gate (compat-gate.py): refuse the upgrade if a critical
|
||||
# addon, an in-use deprecated API, or a node's containerd is too old for the
|
||||
# target. Runs FIRST — before any mutation (etcd snapshot, drains) — so a
|
||||
# block is cheap. Reset the blocked gauge for this run; block() sets it to 1
|
||||
# only on a refusal. This is what makes unattended minor upgrades safe: the
|
||||
# chain proceeds when the cluster supports the target and halts+alerts when
|
||||
# it doesn't (e.g. Calico/ESO/kyverno behind, or a removed API still in use).
|
||||
push k8s_upgrade_blocked 0
|
||||
# refusal is cheap. The gate CLASSIFIES the refusal (exit code):
|
||||
# 0 safe -> proceed
|
||||
# 2 actionable -> record_blocked (a newer addon version would clear it)
|
||||
# 4 held -> record_held (waiting on upstream / a pinned addon)
|
||||
# 3/other err -> fail-safe: treat as actionable block
|
||||
# blocked/held push the gauge DEFINITIVELY (one value per run — no pre-reset
|
||||
# flap that would re-notify the alert nightly) and set HALT_CHAIN so the Job
|
||||
# Completes cleanly without advancing the chain. This is what makes
|
||||
# unattended minor upgrades safe AND quiet: proceed when supported, alert
|
||||
# only when there's something to do, hold silently when there isn't.
|
||||
local gate_out gate_rc=0
|
||||
gate_out=$(python3 /scripts/compat-gate.py "$TARGET_VERSION" < /scripts/addon-compat.json 2>&1) || gate_rc=$?
|
||||
if [ "$gate_rc" -ne 0 ]; then block "$gate_out"; fi
|
||||
echo "compat-gate passed for v$TARGET_VERSION"
|
||||
case "$gate_rc" in
|
||||
0)
|
||||
push k8s_upgrade_blocked 0
|
||||
push k8s_upgrade_held 0
|
||||
echo "compat-gate passed for v$TARGET_VERSION"
|
||||
;;
|
||||
4)
|
||||
record_held "$gate_out"
|
||||
return 0
|
||||
;;
|
||||
2)
|
||||
record_blocked "$gate_out"
|
||||
return 0
|
||||
;;
|
||||
*)
|
||||
record_blocked "gate ERROR (rc=$gate_rc) — failing safe as an actionable block:"$'\n'"$gate_out"
|
||||
return 0
|
||||
;;
|
||||
esac
|
||||
|
||||
# 1. All nodes Ready + no pressure
|
||||
local bad_nodes
|
||||
|
|
@ -777,6 +823,8 @@ phase_postflight() {
|
|||
push k8s_upgrade_in_flight 0
|
||||
push k8s_upgrade_snapshot_taken 0
|
||||
push k8s_upgrade_started_timestamp 0
|
||||
push k8s_upgrade_blocked 0
|
||||
push k8s_upgrade_held 0
|
||||
|
||||
slack ":white_check_mark: K8s upgrade complete: cluster on v$TARGET_VERSION (pod-ready ratio $ratio)"
|
||||
}
|
||||
|
|
|
|||
|
|
@ -2308,35 +2308,38 @@ serverFiles:
|
|||
# postflight) — NOT a bare `k8s-upgrade-.*`, which would also match
|
||||
# helper jobs in the namespace like k8s-upgrade-nightly-report-* and
|
||||
# false-fire if one of those failed.
|
||||
# `unless on() k8s_upgrade_blocked == 1` excludes the case where the
|
||||
# preflight terminally failed because the compat gate deliberately
|
||||
# REFUSED the target: block() exits 1 (so the Failed Job re-spawns
|
||||
# nightly) but a refusal is not a wedge — that case is owned by
|
||||
# K8sUpgradeBlocked below, and firing here too is a duplicate false
|
||||
# alarm (observed 2026-06-21: a 1.35.6 block tripped BOTH). A genuine
|
||||
# wedge / crash / halt-on-alert exits 1 WITHOUT pushing
|
||||
# k8s_upgrade_blocked=1, so it still fires. The gauge stays 1 from the
|
||||
# block until the next run's preflight resets it to 0, so the exclusion
|
||||
# holds for the whole blocked period.
|
||||
# (2026-06-28) The old `unless on() (k8s_upgrade_blocked == 1)` clause
|
||||
# is GONE: compat-gate refusals no longer Fail the preflight Job. The
|
||||
# preflight now records the verdict via a gauge (k8s_upgrade_blocked or
|
||||
# k8s_upgrade_held) and exits 0 — the Job Completes, the chain just
|
||||
# doesn't advance (HALT_CHAIN) — so a terminally-Failed chain Job again
|
||||
# means a genuine wedge / crash / halt-on-alert, with nothing to
|
||||
# exclude. (Previously block() exit 1'd, Failing the Job, and this alert
|
||||
# had to exclude blocked==1 to avoid double-firing with K8sUpgradeBlocked.)
|
||||
- alert: K8sUpgradeChainJobFailed
|
||||
expr: (kube_job_status_failed{namespace="k8s-upgrade", job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*", reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)
|
||||
expr: kube_job_status_failed{namespace="k8s-upgrade", job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*", reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
subsystem: k8s-upgrade
|
||||
annotations:
|
||||
summary: "K8s upgrade chain Job {{ $labels.job_name }} terminally failed ({{ $labels.reason }}) — pipeline wedged. kubectl -n k8s-upgrade get jobs ; kubectl -n k8s-upgrade describe job {{ $labels.job_name }}"
|
||||
# K8sUpgradeBlocked: the k8s-version-upgrade chain pushes
|
||||
# `k8s_upgrade_blocked=1` when the preflight compat gate REFUSES the
|
||||
# target version — the cluster isn't ready (a critical addon lags the
|
||||
# target's support window, an in-use API is deprecated/removed at the
|
||||
# target, or a node's containerd predates the target's minimum). This
|
||||
# is the designed "halt + alert" outcome, NOT a crash: the chain stops
|
||||
# cleanly and the specific blocking reasons are posted to Slack by the
|
||||
# upgrade chain. Same bare-metric pushgateway selector as
|
||||
# K8sUpgradeStalled (job label "k8s-version-upgrade"). To clear: bump
|
||||
# the named addon / migrate the deprecated API usage / upgrade the
|
||||
# node's containerd, then the next nightly run proceeds automatically.
|
||||
# K8sUpgradeBlocked: the chain pushes `k8s_upgrade_blocked=1` when the
|
||||
# preflight compat gate refuses the target with an ACTIONABLE blocker —
|
||||
# a newer version of the lagging addon EXISTS in addon-compat.json and
|
||||
# upgrading it would clear the block (or an in-use deprecated API must be
|
||||
# migrated / a node's containerd bumped). Designed "halt + alert", NOT a
|
||||
# crash: the preflight Job Completes cleanly (exit 0) and the chain just
|
||||
# doesn't advance. Specific reasons + remediation are in the morning
|
||||
# k8s-upgrade nightly report (it re-runs compat-gate). Same bare-metric
|
||||
# pushgateway selector as K8sUpgradeStalled (job "k8s-version-upgrade").
|
||||
# To clear: do the named upgrade/migration; the next nightly run proceeds.
|
||||
#
|
||||
# DELIBERATELY no companion alert for `k8s_upgrade_held=1` — the gate's
|
||||
# WAITING-on-upstream / PINNED verdict. Those can't be actioned now (no
|
||||
# released addon version supports the target yet, or the addon is pinned
|
||||
# e.g. gpu-operator), so a nightly alert would cry wolf. The held state
|
||||
# is surfaced only in the nightly report's ⏸️ line. (2026-06-28)
|
||||
- alert: K8sUpgradeBlocked
|
||||
expr: k8s_upgrade_blocked == 1
|
||||
for: 10m
|
||||
|
|
@ -2344,8 +2347,8 @@ serverFiles:
|
|||
severity: warning
|
||||
subsystem: k8s-upgrade
|
||||
annotations:
|
||||
summary: "K8s auto-upgrade refused by the preflight compat gate — cluster not ready for the target version. Blocking reasons were posted to Slack by the upgrade chain."
|
||||
description: "An automated Kubernetes upgrade was REFUSED (not crashed) by the preflight compatibility gate because the cluster isn't ready for the target version — a critical addon lags the target's support window, an in-use deprecated API would be removed at the target, or a node's containerd is too old. The specific reasons were posted to Slack by the k8s-version-upgrade chain. This is the intended halt-and-alert. To clear it: bump the named addon / migrate the deprecated API usage / upgrade the node's containerd, then the next nightly run proceeds automatically."
|
||||
summary: "K8s auto-upgrade blocked by an ACTIONABLE compat-gate refusal — a lagging addon/API/containerd can be upgraded to clear it. Reasons + remediation are in the morning k8s-upgrade nightly report."
|
||||
description: "An automated Kubernetes upgrade was REFUSED (not crashed) by the preflight compat gate with an ACTIONABLE blocker — a newer version of the lagging addon exists and upgrading it would clear the block (or an in-use deprecated API must be migrated, or a node's containerd bumped). The preflight Job Completes cleanly; the chain just halts. Specific reasons + remediation are in the morning k8s-upgrade nightly report. To clear: do the named upgrade/migration, then the next nightly run proceeds automatically. NB the gate's WAITING-on-upstream / PINNED verdict (k8s_upgrade_held=1) deliberately does NOT alert — nothing to action until upstream ships support or the pin is lifted; see the nightly report's HELD line."
|
||||
- name: "Traefik Ingress"
|
||||
rules:
|
||||
- alert: TraefikDown
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue