k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case
The nightly upgrade chain detected 1.36, the preflight compat-gate refused it,
and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY
night — even though the block is unactionable (no kyverno/ESO release supports
1.36 yet, and gpu-operator is pinned to its current version because bumping it
needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor
asked to teach the checker to tell 'we can fix this' apart from 'nothing to do
but wait', and stop the nightly Failed-Job + alert noise for the latter.
compat-gate.py now classifies each blocker:
- ACTIONABLE: a newer addon version in addon-compat.json supports the target
-> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the
nightly report).
- WAITING-on-upstream: no released version supports the target yet -> held.
- PINNED: addon marked pinned in the matrix (gpu-operator) -> held.
Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert.
Tidy the block path (Viktor's scope choice): deliberate gate decisions now make
the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a
non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete
'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge
is pushed definitively once per run (no 1->0->1 flap that re-notifies). The
detector re-spawns a refused-but-Complete preflight nightly (silently) so a
standing hold still re-evaluates, and only announces genuine new/Failed spawns.
nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class.
gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason).
Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned;
Calico the lone actionable piece) — no nightly Failed Job, no alert, just the
morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
afcd463f39
commit
eebb6c8594
9 changed files with 397 additions and 78 deletions
|
|
@ -483,31 +483,49 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
|||
exit 0
|
||||
fi
|
||||
|
||||
slack "K8s upgrade available: v$RUNNING → v$TARGET ($KIND)"
|
||||
echo "K8s upgrade available: v$RUNNING -> v$TARGET ($KIND)"
|
||||
|
||||
if [ "$DRY_RUN" = "true" ]; then
|
||||
slack "DRY_RUN — not spawning preflight Job"
|
||||
slack "DRY_RUN — target v$TARGET detected, not spawning preflight Job"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# 7. Spawn Job 0 (preflight) via envsubst on the job-template
|
||||
# Idempotency: deterministic name reconciles via `apply`.
|
||||
JOB_NAME="k8s-upgrade-preflight-$${TARGET//./-}"
|
||||
MASTER_JOB="k8s-upgrade-master-$${TARGET//./-}"
|
||||
ANNOUNCE=yes # Slack the spawn? Suppressed for silent nightly re-evaluations of a standing gate refusal.
|
||||
|
||||
# Retry-on-failure idempotency: skip only if an existing preflight
|
||||
# Job is Active/Complete. A *Failed* preflight (aborted on a
|
||||
# transient gate, e.g. a spurious critical alert) is deleted and
|
||||
# re-spawned — otherwise its deterministic name + 7d TTL wedges
|
||||
# the entire pipeline until it ages out. (Stuck-pipeline fix
|
||||
# 2026-06-17: a transient critical alert wedged 1.34.9 for 5 days.)
|
||||
# Idempotency + nightly re-evaluation:
|
||||
# - FAILED preflight (transient gate abort, e.g. a spurious
|
||||
# critical alert / unhealthy node) -> delete + re-spawn, announced.
|
||||
# - COMPLETE preflight but NO master Job spawned -> the compat
|
||||
# gate REFUSED the target (blocked/held now Complete cleanly
|
||||
# rather than Failing). Re-spawn SILENTLY so the gate re-checks
|
||||
# nightly (the refusal may have cleared: addon upgraded / matrix
|
||||
# updated / upstream shipped) WITHOUT nightly Slack noise for a
|
||||
# standing refusal — the morning report (+ K8sUpgradeBlocked for
|
||||
# actionable) is the signal.
|
||||
# - Otherwise (Active, or Complete with the chain advanced) -> skip.
|
||||
# The old "Failed-only re-spawn" left a refused-but-Complete preflight
|
||||
# skipped until its 7d TTL — too slow now that refusals Complete
|
||||
# instead of Failing (2026-06-28). Deterministic names; `apply`
|
||||
# reconciles. (Stuck-pipeline history: a transient critical alert
|
||||
# wedged 1.34.9 for 5 days, 2026-06-17 — hence Failed always re-spawns.)
|
||||
if /usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" >/dev/null 2>&1; then
|
||||
JOB_FAILED=$(/usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" \
|
||||
-o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' 2>/dev/null || true)
|
||||
JOB_COMPLETE=$(/usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" \
|
||||
-o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' 2>/dev/null || true)
|
||||
if [ "$JOB_FAILED" = "True" ]; then
|
||||
slack "Preflight Job $JOB_NAME exists but FAILED — deleting and re-spawning"
|
||||
/usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
|
||||
elif [ "$JOB_COMPLETE" = "True" ] && ! /usr/local/bin/kubectl -n k8s-upgrade get job "$MASTER_JOB" >/dev/null 2>&1; then
|
||||
echo "Preflight $JOB_NAME Complete + no master Job (gate refused) — silent nightly re-evaluate"
|
||||
/usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
|
||||
ANNOUNCE=no
|
||||
else
|
||||
slack "Preflight Job $JOB_NAME already exists (active/complete) — skipping"
|
||||
echo "Preflight Job $JOB_NAME already exists (active / chain advanced) — skipping"
|
||||
exit 0
|
||||
fi
|
||||
fi
|
||||
|
|
@ -521,7 +539,9 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
|||
< /template/job-template.yaml \
|
||||
| /usr/local/bin/kubectl apply -f -
|
||||
|
||||
slack "Spawned $JOB_NAME (target=v$TARGET kind=$KIND)"
|
||||
if [ "$ANNOUNCE" = "yes" ]; then
|
||||
slack "Spawned $JOB_NAME (target=v$TARGET kind=$KIND)"
|
||||
fi
|
||||
EOT
|
||||
]
|
||||
env {
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue