The nightly upgrade chain fails a preflight Job and raises K8sUpgradeBlocked every night for the 1.36 target, even though the block is unactionable: no kyverno/ESO release supports 1.36 yet and gpu-operator is deliberately pinned (NVIDIA driver/Ubuntu coupling). Viktor asked to teach the checker to tell 'we can fix this' apart from 'nothing to do but wait', and stop the nightly Failed-Job + alert noise for the latter. This documents the design: classify each blocker as actionable / waiting- upstream / pinned, keep the alert only for actionable, quiet the held case to the nightly report, and make deliberate gate decisions Complete cleanly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5.8 KiB
k8s-upgrade compat-gate: classify "actionable" vs "held" blocks
Date: 2026-06-28
Status: design → implementation
Stack: stacks/k8s-version-upgrade (+ stacks/monitoring alert rules)
Problem
The cluster is on k8s 1.35.6. The nightly k8s-version-check chain detects the
next minor (1.36.2), runs the preflight compat-gate, and the gate refuses
it — because no released kyverno/ESO supports k8s 1.36 yet, and gpu-operator is
deliberately pinned (its 26.3 bump needs a newer NVIDIA driver image + Ubuntu
release we're not ready for). The result, every single night:
- a Failed preflight Job (
block()exits 1), and k8s_upgrade_blocked=1→ the K8sUpgradeBlocked alert.
But this block is not actionable — there's nothing we can upgrade to clear it; we can only wait for upstream (kyverno/ESO) and, separately, do the gpu-operator/Ubuntu work. The gate is crying wolf: a "blocked, needs attention" signal that's indistinguishable from a block we could actually fix.
Goal
Make the gate classify each blocker and behave accordingly:
| Class | Definition | Behaviour |
|---|---|---|
| actionable | the compat matrix has a newer version of the addon whose max_k8s >= target, and the running version is older — upgrading it would clear the block |
alert (k8s_upgrade_blocked=1 → K8sUpgradeBlocked), with the specific "upgrade X → Y" remediation in the nightly report |
| waiting-upstream | no matrix version of the addon supports the target yet (kyverno/ESO for 1.36) | quiet (k8s_upgrade_held=1, no alert) — nightly report only |
| pinned | a supporting version exists but the addon carries "pinned": true in the matrix (gpu-operator) |
quiet (held) |
Removed-API and containerd blocks are always actionable. Held wins: if any blocker is waiting-or-pinned, the whole target is HELD (quiet) — acting on the actionable blockers wouldn't unblock it yet. The nightly report still lists everything so the full eventual scope is visible.
Also (scope decision: "tidy the block path"): deliberate gate decisions
(actionable-block and held) now make the preflight Job Complete cleanly
(exit 0) instead of Failing. Chain progression is gated on the verdict, not the
exit code. Real failures (unhealthy nodes, kubeadm errors, crashes) still exit
1 → K8sUpgradeChainJobFailed.
Design
compat-gate.py
- New exit codes:
0safe ·2actionable-block ·3gate-error (fail-safe) ·4held. - Each stdout reason line is tagged
[ACTIONABLE]/[WAITING]/[PINNED]. check_addons: when an addon blocks, decide its class:pinned: truein its matrix entry →[PINNED].- else a higher matrix version with
max_k8s >= targetexists →[ACTIONABLE](upgrade X to >= V). - else →
[WAITING](no released X version supports k8s T yet). - unreadable image / below-matrix →
[ACTIONABLE](fail-safe — a human must look).
check_removed_apis,check_containerd: tag[ACTIONABLE].exit_code(reasons):0if none;4if anyheld_reason(WAITING/PINNED); else2.
upgrade-step.sh
- New global
HALT_CHAIN=0;spawn_next()returns early (no next Job) when set. - Replace
block()withrecord_blocked()/record_held()— push the gauge, setHALT_CHAIN=1, do not exit. phase_preflightgate handling routes on the gate's exit code:0→ pushblocked=0+held=0, proceed.2/3→record_blocked,return 0(Job Completes, K8sUpgradeBlocked fires).4→record_held,return 0(Job Completes, no alert).
- Push the gauge definitively once per run (remove the pre-reset
blocked=0at gate start) so a standing block doesn't flap 1→0→1 and re-notify. - postflight also clears
held=0alongside the existing gauge resets.
addon-compat.json
- Add
"pinned": true+"pin_reason"to the gpu-operator entry (its26.3 → 1.36row stays;pinnedoverrides classification to held). Document thepinnedflag in_comment. Unpinning later = delete two keys.
stacks/monitoring alert rules (prometheus_chart_values.tpl)
K8sUpgradeBlocked(k8s_upgrade_blocked == 1): unchanged trigger, now actionable-only; reword annotation (reasons are in the nightly report, not a per-run chain Slack).K8sUpgradeChainJobFailed: drop theunless on() (k8s_upgrade_blocked == 1)clause — deliberate blocks no longer create Failed Jobs, so the alert again means a genuine wedge.- No alert for
k8s_upgrade_held(intentional — nothing to action; the nightly report surfaces it). Add a comment recording this.
nightly-report.py
- Read
k8s_upgrade_held. New⏸️ HELD — <target> not yet upgradableheadline. - Group reasons by tag: Action needed / Waiting on upstream / Pinned (held by us) (fallback bullets for untagged lines, so older reason strings still render).
- Fetch reasons when avail AND (blocked OR held).
Net effect on 1.36 today
HELD, quiet — waiting on kyverno + ESO (upstream) + gpu-operator (pinned); Calico listed as the lone actionable piece. No nightly Failed Job, no alert — just the nightly report's ⏸️ line. Flips to actionable (→ alert) only once kyverno/ESO ship support and gpu-operator is unpinned.
Tests (TDD)
compat-gate: waiting / actionable / pinned-is-held / mixed-held-wins, removed-API & containerd are actionable, exit_code mapping, + existing patch/safe cases stay green.nightly-report: held headline + grouped reasons; existing tests stay green.upgrade-step.sh: shellcheck; manual review of the HALT_CHAIN + gauge flow (bash, not unit-tested).
Out of scope (separate follow-up)
Auto-refreshing the matrix when upstream ships 1.36 support (a periodic addon-readiness probe). This change only consumes the matrix.