k8s-version-upgrade: halt_on_alert allowlist (severity=critical only)
Refactored halt_on_alert_query from denylist ("ignore these noisy alerts")
to an allowlist ("only halt on severity=critical"). Today's blocking
alerts were all warning/info-level and not actual upgrade blockers:
- PodCrashLooping (gpu-operator on the GPU node, code-8vr0, long-standing)
- IngressTTFBHigh (Traefik latency, transient)
- NodeHighIOWait (chicken-and-egg with our own upgrade I/O)
- RecentNodeReboot (chain causes this itself)
severity=critical filtering is more robust than maintaining a denylist
of every noisy alert that crops up. extra_ignore parameter kept for
backwards compatibility but is rarely needed now (critical alerts are
the only ones that should actually halt the chain).
Tested end-to-end this session — master successfully upgraded to v1.34.8
via the autonomous chain after the apiserver state-repair (apiserver
manifest had been pinned at v1.34.2 from a previous month's rollback;
required a one-time manual edit + kubelet reload to bring back to v1.34.7,
after which the chain ran cleanly).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
0025511b6a
commit
ad9f6c8f41
1 changed files with 34 additions and 25 deletions
|
|
@ -94,32 +94,41 @@ push() {
|
|||
|
||||
halt_on_alert_query() {
|
||||
local extra_ignore="${1:-}"
|
||||
# Always-ignored alerts — present in steady-state OR are themselves caused
|
||||
# by what the chain does, so they should never halt a chain phase:
|
||||
# Watchdog — Prometheus meta-alert, always firing
|
||||
# RebootRequired — long-running info, not actionable mid-chain
|
||||
# KuredNodeWasNotDrained — kured info-level, doesn't block upgrade
|
||||
# InfoInhibitor — used to inhibit other alerts, always present
|
||||
# IngressTTFBHigh — Traefik latency. Symptoms-not-causes; upgrades
|
||||
# routinely spike latency briefly. Halting on
|
||||
# this would prevent the chain from running in
|
||||
# any moderately busy cluster. (2026-05-23)
|
||||
# NodeHighIOWait — chicken-and-egg with our own upgrade I/O. The
|
||||
# inline quiet-baseline check (Ready transition
|
||||
# <10min) is the real cluster-churn gate; iowait
|
||||
# is too noisy to be a hard gate. (2026-05-23)
|
||||
local regex='^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor|IngressTTFBHigh|NodeHighIOWait'
|
||||
[ -n "$extra_ignore" ] && regex="$regex|$extra_ignore"
|
||||
regex="$regex)$"
|
||||
# ALLOWLIST design (refactored 2026-05-23 from a denylist): halt only on
|
||||
# alerts with severity=critical. Any warning/info-level alert is treated
|
||||
# as informational and doesn't block the chain.
|
||||
#
|
||||
# Why this is the right model:
|
||||
# - The cluster has long-running warning-level alerts that are NOT
|
||||
# blockers for a k8s patch (e.g. GPU operator crashloop on the GPU
|
||||
# node, ingress latency spikes, IO-wait warnings).
|
||||
# - Maintaining a denylist of every "noisy" alert is a losing battle.
|
||||
# - Critical alerts are the only ones that should actually stop us
|
||||
# mid-chain (apiserver down, etcd down, node not ready, etc.).
|
||||
#
|
||||
# `extra_ignore` is now mostly historical — kept for backwards compat with
|
||||
# `halt_on_alert_query RecentNodeReboot`-style calls. With severity-based
|
||||
# filtering, RecentNodeReboot (severity=info) is filtered automatically.
|
||||
# We still build the regex for any critical alert the caller wants to
|
||||
# explicitly ignore (e.g. a known-broken thing we're aware of).
|
||||
local ignore_regex=""
|
||||
[ -n "$extra_ignore" ] && ignore_regex="^($extra_ignore)\$"
|
||||
|
||||
# `grep -vE` returns 1 when nothing matches, which under `set -o pipefail`
|
||||
# bubbles up and (via the caller's `alerts=$(...)`) aborts the whole script.
|
||||
# Trailing `|| true` keeps a no-alerts-firing cluster from looking like a
|
||||
# script error. Discovered 2026-05-19 when the chain wouldn't fire on a
|
||||
# genuinely-clean cluster (every alert was Watchdog/RebootRequired/etc.).
|
||||
curl -sf "$PROM/api/v1/alerts" \
|
||||
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
|
||||
| { grep -vE "$regex" || true; } | sort -u
|
||||
# `grep` returns 1 when nothing matches → under `set -o pipefail` that
|
||||
# bubbles up and aborts the script via the caller's `alerts=$(...)`.
|
||||
# Trailing `|| true` on each grep handles the no-matches case.
|
||||
local critical_firing
|
||||
critical_firing=$(curl -sf "$PROM/api/v1/alerts" \
|
||||
| jq -r '.data.alerts[]
|
||||
| select(.state == "firing" and .labels.severity == "critical")
|
||||
| .labels.alertname' 2>/dev/null \
|
||||
| sort -u || true)
|
||||
|
||||
if [ -n "$ignore_regex" ]; then
|
||||
echo "$critical_firing" | { grep -vE "$ignore_regex" || true; }
|
||||
else
|
||||
echo "$critical_firing"
|
||||
fi
|
||||
}
|
||||
|
||||
wait_for_node_ready() {
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue