monitoring: K8sUpgradeChainJobFailed must not double-fire on a compat-gate block
Some checks failed
ci/woodpecker/push/default Pipeline failed

Last night (2026-06-20) the detector + compat-gate fixes worked: the chain
resolved target 1.35.6 and the gate correctly REFUSED it (ESO 0.12 + kyverno
1.16 don't support 1.35), pushing k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
fired as designed. But the refusal also made the preflight Job exit 1
(block() exits 1 on purpose so the Failed Job re-spawns nightly), which tripped
K8sUpgradeChainJobFailed too — a duplicate, misleading "pipeline wedged" alarm
for what is the intended halt-and-alert outcome.

Fix: gate the alert with `unless on() k8s_upgrade_blocked == 1`. A deliberate
block sets that gauge (and it stays 1 until the next preflight resets it), so
the chain-job-failed alert is suppressed for the blocked period; a genuine
wedge / crash / halt-on-alert exits 1 WITHOUT setting it, so it still fires
(preserving the alert's original purpose — catching the pre-in_flight preflight
failure that hid the 5-day 1.34.9 wedge). Runbook + automated-upgrades docs
updated to match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-21 16:35:35 +00:00
parent b0ccaf1c65
commit 7270e2be3b
3 changed files with 13 additions and 3 deletions

View file

@ -2244,8 +2244,18 @@ serverFiles:
# idempotency guard the next detection cycle deletes + re-spawns the
# Failed Job (clearing this within ~24h); a sustained firing means it
# re-failed — investigate the root cause.
# `unless on() k8s_upgrade_blocked == 1` excludes the case where the
# preflight terminally failed because the compat gate deliberately
# REFUSED the target: block() exits 1 (so the Failed Job re-spawns
# nightly) but a refusal is not a wedge — that case is owned by
# K8sUpgradeBlocked below, and firing here too is a duplicate false
# alarm (observed 2026-06-21: a 1.35.6 block tripped BOTH). A genuine
# wedge / crash / halt-on-alert exits 1 WITHOUT pushing
# k8s_upgrade_blocked=1, so it still fires. The gauge stays 1 from the
# block until the next run's preflight resets it to 0, so the exclusion
# holds for the whole blocked period.
- alert: K8sUpgradeChainJobFailed
expr: kube_job_status_failed{namespace="k8s-upgrade", job_name=~"k8s-upgrade-.*", reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0
expr: (kube_job_status_failed{namespace="k8s-upgrade", job_name=~"k8s-upgrade-.*", reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)
for: 15m
labels:
severity: warning