k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case

The nightly upgrade chain detected 1.36, the preflight compat-gate refused it,
and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY
night — even though the block is unactionable (no kyverno/ESO release supports
1.36 yet, and gpu-operator is pinned to its current version because bumping it
needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor
asked to teach the checker to tell 'we can fix this' apart from 'nothing to do
but wait', and stop the nightly Failed-Job + alert noise for the latter.

compat-gate.py now classifies each blocker:
  - ACTIONABLE: a newer addon version in addon-compat.json supports the target
    -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the
    nightly report).
  - WAITING-on-upstream: no released version supports the target yet -> held.
  - PINNED: addon marked pinned in the matrix (gpu-operator) -> held.
Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert.

Tidy the block path (Viktor's scope choice): deliberate gate decisions now make
the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a
non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete
'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge
is pushed definitively once per run (no 1->0->1 flap that re-notifies). The
detector re-spawns a refused-but-Complete preflight nightly (silently) so a
standing hold still re-evaluates, and only announces genuine new/Failed spawns.

nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class.
gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason).

Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned;
Calico the lone actionable piece) — no nightly Failed Job, no alert, just the
morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-28 10:08:20 +00:00
parent afcd463f39
commit eebb6c8594
9 changed files with 397 additions and 78 deletions

View file

@ -66,6 +66,17 @@ exit code. Real failures (unhealthy nodes, kubeadm errors, crashes) still exit
at gate start) so a standing block doesn't flap 1→0→1 and re-notify.
- postflight also clears `held=0` alongside the existing gauge resets.
### detector (`main.tf`, the `k8s-version-check` CronJob)
- Consequence of the tidy change: refusals now **Complete** instead of Failing,
so the old "re-spawn only a *Failed* preflight" idempotency would skip a
refused-but-Complete preflight until its 7d TTL. Fix: re-spawn nightly when the
preflight is **Complete but no `k8s-upgrade-master-<target>` Job exists** (the
gate refused — chain never advanced) — **silently** (no Slack), so a standing
hold re-evaluates each night without noise.
- The per-night `slack "K8s upgrade available…"` becomes an `echo`; the spawn
Slack fires only for a genuinely new spawn or a Failed-respawn (`ANNOUNCE`
flag), not for silent re-evaluations — killing the last nightly-noise source.
### `addon-compat.json`
- Add `"pinned": true` + `"pin_reason"` to the gpu-operator entry (its
`26.3 → 1.36` row stays; `pinned` overrides classification to held). Document