k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case
The nightly upgrade chain detected 1.36, the preflight compat-gate refused it,
and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY
night — even though the block is unactionable (no kyverno/ESO release supports
1.36 yet, and gpu-operator is pinned to its current version because bumping it
needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor
asked to teach the checker to tell 'we can fix this' apart from 'nothing to do
but wait', and stop the nightly Failed-Job + alert noise for the latter.
compat-gate.py now classifies each blocker:
- ACTIONABLE: a newer addon version in addon-compat.json supports the target
-> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the
nightly report).
- WAITING-on-upstream: no released version supports the target yet -> held.
- PINNED: addon marked pinned in the matrix (gpu-operator) -> held.
Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert.
Tidy the block path (Viktor's scope choice): deliberate gate decisions now make
the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a
non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete
'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge
is pushed definitively once per run (no 1->0->1 flap that re-notifies). The
detector re-spawns a refused-but-Complete preflight nightly (silently) so a
standing hold still re-evaluates, and only announces genuine new/Failed spawns.
nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class.
gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason).
Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned;
Calico the lone actionable piece) — no nightly Failed Job, no alert, just the
morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
afcd463f39
commit
eebb6c8594
9 changed files with 397 additions and 78 deletions
|
|
@ -69,6 +69,29 @@ def fmt_age(seconds):
|
|||
return f"{seconds / 86400:.1f}d ago"
|
||||
|
||||
|
||||
def _render_reasons(blocker_reasons):
|
||||
"""Group compat-gate reason lines by their [ACTIONABLE]/[WAITING]/[PINNED]
|
||||
tag into labelled sections, stripping the tag from each bullet. Untagged
|
||||
lines (older reason format) fall back to a generic 'Blockers' list. PURE.
|
||||
Returns a list of message lines."""
|
||||
lines = [r.strip() for r in (blocker_reasons or "").splitlines() if r.strip()]
|
||||
out, shown = [], set()
|
||||
for title, tag in (("Action needed", "[ACTIONABLE]"),
|
||||
("Waiting on upstream", "[WAITING]"),
|
||||
("Pinned (held by us)", "[PINNED]")):
|
||||
sub = [l for l in lines if l.startswith(tag)]
|
||||
if sub:
|
||||
out.append(f"{title}:")
|
||||
for l in sub:
|
||||
shown.add(l)
|
||||
out.append(f" • {l[len(tag):].strip()}")
|
||||
rest = [l for l in lines if l not in shown]
|
||||
if rest:
|
||||
out.append("Blockers:")
|
||||
out.extend(f" • {l}" for l in rest)
|
||||
return out
|
||||
|
||||
|
||||
def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
|
||||
"""Build the Slack message text from gathered facts. PURE.
|
||||
|
||||
|
|
@ -98,6 +121,7 @@ def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
|
|||
|
||||
avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
|
||||
blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
|
||||
held = any(val == 1 for _, val in select(metrics, "k8s_upgrade_held"))
|
||||
|
||||
if avail:
|
||||
lbl = avail[0][0]
|
||||
|
|
@ -105,7 +129,12 @@ def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
|
|||
kind = lbl.get("kind", "?")
|
||||
tgt_line = f"Detected target: *{target}* ({kind})"
|
||||
if blocked:
|
||||
headline = f"🔴 BLOCKED — compat gate refused {target}"
|
||||
# actionable block — an addon upgrade would clear it (K8sUpgradeBlocked fired)
|
||||
headline = f"🔴 BLOCKED (action needed) — {target}"
|
||||
elif held:
|
||||
# waiting on upstream and/or a pinned addon — nothing to do but wait;
|
||||
# intentionally NO alert, this nightly line is the only signal
|
||||
headline = f"⏸️ HELD — {target} not yet upgradable"
|
||||
elif len(versions) == 1 and target == versions[0]:
|
||||
headline = f"🟢 UPGRADED — all nodes now on {target}"
|
||||
else:
|
||||
|
|
@ -120,12 +149,8 @@ def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
|
|||
|
||||
msg = [f"*[k8s-upgrade nightly]* {headline}", node_line, run_line, tgt_line]
|
||||
|
||||
if blocked and blocker_reasons:
|
||||
msg.append("Blockers (live):")
|
||||
for r in blocker_reasons.splitlines():
|
||||
r = r.strip()
|
||||
if r:
|
||||
msg.append(f" • {r}")
|
||||
if (blocked or held) and blocker_reasons:
|
||||
msg.extend(_render_reasons(blocker_reasons))
|
||||
|
||||
if jobs:
|
||||
msg.append("Chain jobs (recent):")
|
||||
|
|
@ -213,7 +238,8 @@ def main():
|
|||
|
||||
avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
|
||||
blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
|
||||
reasons = get_blocker_reasons(avail[0][0].get("target", "")) if (avail and blocked) else None
|
||||
held = any(val == 1 for _, val in select(metrics, "k8s_upgrade_held"))
|
||||
reasons = get_blocker_reasons(avail[0][0].get("target", "")) if (avail and (blocked or held)) else None
|
||||
|
||||
msg = compose_report(now_ts, nodes, metrics, reasons, jobs)
|
||||
post_slack(msg)
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue