monitoring: K8sUpgradeChainJobFailed must not double-fire on a compat-gate block
Some checks failed
ci/woodpecker/push/default Pipeline failed
Some checks failed
ci/woodpecker/push/default Pipeline failed
Last night (2026-06-20) the detector + compat-gate fixes worked: the chain resolved target 1.35.6 and the gate correctly REFUSED it (ESO 0.12 + kyverno 1.16 don't support 1.35), pushing k8s_upgrade_blocked=1 -> K8sUpgradeBlocked fired as designed. But the refusal also made the preflight Job exit 1 (block() exits 1 on purpose so the Failed Job re-spawns nightly), which tripped K8sUpgradeChainJobFailed too — a duplicate, misleading "pipeline wedged" alarm for what is the intended halt-and-alert outcome. Fix: gate the alert with `unless on() k8s_upgrade_blocked == 1`. A deliberate block sets that gauge (and it stays 1 until the next preflight resets it), so the chain-job-failed alert is suppressed for the blocked period; a genuine wedge / crash / halt-on-alert exits 1 WITHOUT setting it, so it still fires (preserving the alert's original purpose — catching the pre-in_flight preflight failure that hid the 5-day 1.34.9 wedge). Runbook + automated-upgrades docs updated to match. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
b0ccaf1c65
commit
7270e2be3b
3 changed files with 13 additions and 3 deletions
|
|
@ -2244,8 +2244,18 @@ serverFiles:
|
|||
# idempotency guard the next detection cycle deletes + re-spawns the
|
||||
# Failed Job (clearing this within ~24h); a sustained firing means it
|
||||
# re-failed — investigate the root cause.
|
||||
# `unless on() k8s_upgrade_blocked == 1` excludes the case where the
|
||||
# preflight terminally failed because the compat gate deliberately
|
||||
# REFUSED the target: block() exits 1 (so the Failed Job re-spawns
|
||||
# nightly) but a refusal is not a wedge — that case is owned by
|
||||
# K8sUpgradeBlocked below, and firing here too is a duplicate false
|
||||
# alarm (observed 2026-06-21: a 1.35.6 block tripped BOTH). A genuine
|
||||
# wedge / crash / halt-on-alert exits 1 WITHOUT pushing
|
||||
# k8s_upgrade_blocked=1, so it still fires. The gauge stays 1 from the
|
||||
# block until the next run's preflight resets it to 0, so the exclusion
|
||||
# holds for the whole blocked period.
|
||||
- alert: K8sUpgradeChainJobFailed
|
||||
expr: kube_job_status_failed{namespace="k8s-upgrade", job_name=~"k8s-upgrade-.*", reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0
|
||||
expr: (kube_job_status_failed{namespace="k8s-upgrade", job_name=~"k8s-upgrade-.*", reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue