k8s-upgrade: design doc for actionable-vs-held compat-gate classification
The nightly upgrade chain fails a preflight Job and raises K8sUpgradeBlocked every night for the 1.36 target, even though the block is unactionable: no kyverno/ESO release supports 1.36 yet and gpu-operator is deliberately pinned (NVIDIA driver/Ubuntu coupling). Viktor asked to teach the checker to tell 'we can fix this' apart from 'nothing to do but wait', and stop the nightly Failed-Job + alert noise for the latter. This documents the design: classify each blocker as actionable / waiting- upstream / pinned, keep the alert only for actionable, quiet the held case to the nightly report, and make deliberate gate decisions Complete cleanly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
b3c419e108
commit
afcd463f39
1 changed files with 106 additions and 0 deletions
106
docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md
Normal file
106
docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md
Normal file
|
|
@ -0,0 +1,106 @@
|
||||||
|
# k8s-upgrade compat-gate: classify "actionable" vs "held" blocks
|
||||||
|
|
||||||
|
**Date:** 2026-06-28
|
||||||
|
**Status:** design → implementation
|
||||||
|
**Stack:** `stacks/k8s-version-upgrade` (+ `stacks/monitoring` alert rules)
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
The cluster is on k8s 1.35.6. The nightly `k8s-version-check` chain detects the
|
||||||
|
next minor (1.36.2), runs the preflight compat-gate, and the gate **refuses**
|
||||||
|
it — because no released kyverno/ESO supports k8s 1.36 yet, and gpu-operator is
|
||||||
|
deliberately pinned (its 26.3 bump needs a newer NVIDIA driver image + Ubuntu
|
||||||
|
release we're not ready for). The result, **every single night**:
|
||||||
|
|
||||||
|
- a **Failed** preflight Job (`block()` exits 1), and
|
||||||
|
- `k8s_upgrade_blocked=1` → the **K8sUpgradeBlocked** alert.
|
||||||
|
|
||||||
|
But this block is **not actionable** — there's nothing we can upgrade to clear
|
||||||
|
it; we can only wait for upstream (kyverno/ESO) and, separately, do the
|
||||||
|
gpu-operator/Ubuntu work. The gate is crying wolf: a "blocked, needs attention"
|
||||||
|
signal that's indistinguishable from a block we could actually fix.
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Make the gate **classify** each blocker and behave accordingly:
|
||||||
|
|
||||||
|
| Class | Definition | Behaviour |
|
||||||
|
|-------|-----------|-----------|
|
||||||
|
| **actionable** | the compat matrix has a newer version of the addon whose `max_k8s >= target`, and the running version is older — upgrading it would clear the block | **alert** (`k8s_upgrade_blocked=1` → K8sUpgradeBlocked), with the specific "upgrade X → Y" remediation in the nightly report |
|
||||||
|
| **waiting-upstream** | **no** matrix version of the addon supports the target yet (kyverno/ESO for 1.36) | **quiet** (`k8s_upgrade_held=1`, no alert) — nightly report only |
|
||||||
|
| **pinned** | a supporting version exists but the addon carries `"pinned": true` in the matrix (gpu-operator) | **quiet** (held) |
|
||||||
|
|
||||||
|
Removed-API and containerd blocks are always **actionable**. **Held wins:** if
|
||||||
|
*any* blocker is waiting-or-pinned, the whole target is **HELD** (quiet) —
|
||||||
|
acting on the actionable blockers wouldn't unblock it yet. The nightly report
|
||||||
|
still lists everything so the full eventual scope is visible.
|
||||||
|
|
||||||
|
Also (scope decision: "tidy the block path"): deliberate gate decisions
|
||||||
|
(actionable-block **and** held) now make the preflight Job **Complete cleanly**
|
||||||
|
(exit 0) instead of Failing. Chain progression is gated on the verdict, not the
|
||||||
|
exit code. Real failures (unhealthy nodes, kubeadm errors, crashes) still exit
|
||||||
|
1 → `K8sUpgradeChainJobFailed`.
|
||||||
|
|
||||||
|
## Design
|
||||||
|
|
||||||
|
### `compat-gate.py`
|
||||||
|
- New exit codes: `0` safe · `2` actionable-block · `3` gate-error (fail-safe) · **`4` held**.
|
||||||
|
- Each stdout reason line is tagged `[ACTIONABLE]` / `[WAITING]` / `[PINNED]`.
|
||||||
|
- `check_addons`: when an addon blocks, decide its class:
|
||||||
|
- `pinned: true` in its matrix entry → `[PINNED]`.
|
||||||
|
- else a higher matrix version with `max_k8s >= target` exists → `[ACTIONABLE]` (`upgrade X to >= V`).
|
||||||
|
- else → `[WAITING]` (`no released X version supports k8s T yet`).
|
||||||
|
- unreadable image / below-matrix → `[ACTIONABLE]` (fail-safe — a human must look).
|
||||||
|
- `check_removed_apis`, `check_containerd`: tag `[ACTIONABLE]`.
|
||||||
|
- `exit_code(reasons)`: `0` if none; `4` if any `held_reason` (WAITING/PINNED); else `2`.
|
||||||
|
|
||||||
|
### `upgrade-step.sh`
|
||||||
|
- New global `HALT_CHAIN=0`; `spawn_next()` returns early (no next Job) when set.
|
||||||
|
- Replace `block()` with `record_blocked()` / `record_held()` — push the gauge,
|
||||||
|
set `HALT_CHAIN=1`, **do not exit**.
|
||||||
|
- `phase_preflight` gate handling routes on the gate's exit code:
|
||||||
|
- `0` → push `blocked=0`+`held=0`, proceed.
|
||||||
|
- `2`/`3` → `record_blocked`, `return 0` (Job Completes, K8sUpgradeBlocked fires).
|
||||||
|
- `4` → `record_held`, `return 0` (Job Completes, **no alert**).
|
||||||
|
- Push the gauge **definitively once** per run (remove the pre-reset `blocked=0`
|
||||||
|
at gate start) so a standing block doesn't flap 1→0→1 and re-notify.
|
||||||
|
- postflight also clears `held=0` alongside the existing gauge resets.
|
||||||
|
|
||||||
|
### `addon-compat.json`
|
||||||
|
- Add `"pinned": true` + `"pin_reason"` to the gpu-operator entry (its
|
||||||
|
`26.3 → 1.36` row stays; `pinned` overrides classification to held). Document
|
||||||
|
the `pinned` flag in `_comment`. Unpinning later = delete two keys.
|
||||||
|
|
||||||
|
### `stacks/monitoring` alert rules (`prometheus_chart_values.tpl`)
|
||||||
|
- `K8sUpgradeBlocked` (`k8s_upgrade_blocked == 1`): unchanged trigger, now
|
||||||
|
actionable-only; reword annotation (reasons are in the nightly report, not a
|
||||||
|
per-run chain Slack).
|
||||||
|
- `K8sUpgradeChainJobFailed`: **drop** the `unless on() (k8s_upgrade_blocked == 1)`
|
||||||
|
clause — deliberate blocks no longer create Failed Jobs, so the alert again
|
||||||
|
means a genuine wedge.
|
||||||
|
- **No alert** for `k8s_upgrade_held` (intentional — nothing to action; the
|
||||||
|
nightly report surfaces it). Add a comment recording this.
|
||||||
|
|
||||||
|
### `nightly-report.py`
|
||||||
|
- Read `k8s_upgrade_held`. New `⏸️ HELD — <target> not yet upgradable` headline.
|
||||||
|
- Group reasons by tag: *Action needed* / *Waiting on upstream* / *Pinned (held by us)*
|
||||||
|
(fallback bullets for untagged lines, so older reason strings still render).
|
||||||
|
- Fetch reasons when avail AND (blocked OR held).
|
||||||
|
|
||||||
|
## Net effect on 1.36 today
|
||||||
|
**HELD, quiet** — waiting on kyverno + ESO (upstream) + gpu-operator (pinned);
|
||||||
|
Calico listed as the lone actionable piece. No nightly Failed Job, no alert —
|
||||||
|
just the nightly report's ⏸️ line. Flips to actionable (→ alert) only once
|
||||||
|
kyverno/ESO ship support **and** gpu-operator is unpinned.
|
||||||
|
|
||||||
|
## Tests (TDD)
|
||||||
|
- `compat-gate`: waiting / actionable / pinned-is-held / mixed-held-wins,
|
||||||
|
removed-API & containerd are actionable, exit_code mapping, + existing
|
||||||
|
patch/safe cases stay green.
|
||||||
|
- `nightly-report`: held headline + grouped reasons; existing tests stay green.
|
||||||
|
- `upgrade-step.sh`: shellcheck; manual review of the HALT_CHAIN + gauge flow
|
||||||
|
(bash, not unit-tested).
|
||||||
|
|
||||||
|
## Out of scope (separate follow-up)
|
||||||
|
Auto-refreshing the matrix when upstream ships 1.36 support (a periodic
|
||||||
|
addon-readiness probe). This change only *consumes* the matrix.
|
||||||
Loading…
Add table
Add a link
Reference in a new issue