k8s-version-upgrade: halt_on_alert allowlist (severity=critical only)

Refactored halt_on_alert_query from denylist ("ignore these noisy alerts")
to an allowlist ("only halt on severity=critical"). Today's blocking
alerts were all warning/info-level and not actual upgrade blockers:
  - PodCrashLooping (gpu-operator on the GPU node, code-8vr0, long-standing)
  - IngressTTFBHigh (Traefik latency, transient)
  - NodeHighIOWait (chicken-and-egg with our own upgrade I/O)
  - RecentNodeReboot (chain causes this itself)

severity=critical filtering is more robust than maintaining a denylist
of every noisy alert that crops up. extra_ignore parameter kept for
backwards compatibility but is rarely needed now (critical alerts are
the only ones that should actually halt the chain).

Tested end-to-end this session — master successfully upgraded to v1.34.8
via the autonomous chain after the apiserver state-repair (apiserver
manifest had been pinned at v1.34.2 from a previous month's rollback;
required a one-time manual edit + kubelet reload to bring back to v1.34.7,
after which the chain ran cleanly).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-05-23 09:14:39 +00:00
parent 0025511b6a
commit ad9f6c8f41

View file

@ -94,32 +94,41 @@ push() {
halt_on_alert_query() {
local extra_ignore="${1:-}"
# Always-ignored alerts — present in steady-state OR are themselves caused
# by what the chain does, so they should never halt a chain phase:
# Watchdog — Prometheus meta-alert, always firing
# RebootRequired — long-running info, not actionable mid-chain
# KuredNodeWasNotDrained — kured info-level, doesn't block upgrade
# InfoInhibitor — used to inhibit other alerts, always present
# IngressTTFBHigh — Traefik latency. Symptoms-not-causes; upgrades
# routinely spike latency briefly. Halting on
# this would prevent the chain from running in
# any moderately busy cluster. (2026-05-23)
# NodeHighIOWait — chicken-and-egg with our own upgrade I/O. The
# inline quiet-baseline check (Ready transition
# <10min) is the real cluster-churn gate; iowait
# is too noisy to be a hard gate. (2026-05-23)
local regex='^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor|IngressTTFBHigh|NodeHighIOWait'
[ -n "$extra_ignore" ] && regex="$regex|$extra_ignore"
regex="$regex)$"
# ALLOWLIST design (refactored 2026-05-23 from a denylist): halt only on
# alerts with severity=critical. Any warning/info-level alert is treated
# as informational and doesn't block the chain.
#
# Why this is the right model:
# - The cluster has long-running warning-level alerts that are NOT
# blockers for a k8s patch (e.g. GPU operator crashloop on the GPU
# node, ingress latency spikes, IO-wait warnings).
# - Maintaining a denylist of every "noisy" alert is a losing battle.
# - Critical alerts are the only ones that should actually stop us
# mid-chain (apiserver down, etcd down, node not ready, etc.).
#
# `extra_ignore` is now mostly historical — kept for backwards compat with
# `halt_on_alert_query RecentNodeReboot`-style calls. With severity-based
# filtering, RecentNodeReboot (severity=info) is filtered automatically.
# We still build the regex for any critical alert the caller wants to
# explicitly ignore (e.g. a known-broken thing we're aware of).
local ignore_regex=""
[ -n "$extra_ignore" ] && ignore_regex="^($extra_ignore)\$"
# `grep -vE` returns 1 when nothing matches, which under `set -o pipefail`
# bubbles up and (via the caller's `alerts=$(...)`) aborts the whole script.
# Trailing `|| true` keeps a no-alerts-firing cluster from looking like a
# script error. Discovered 2026-05-19 when the chain wouldn't fire on a
# genuinely-clean cluster (every alert was Watchdog/RebootRequired/etc.).
curl -sf "$PROM/api/v1/alerts" \
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
| { grep -vE "$regex" || true; } | sort -u
# `grep` returns 1 when nothing matches → under `set -o pipefail` that
# bubbles up and aborts the script via the caller's `alerts=$(...)`.
# Trailing `|| true` on each grep handles the no-matches case.
local critical_firing
critical_firing=$(curl -sf "$PROM/api/v1/alerts" \
| jq -r '.data.alerts[]
| select(.state == "firing" and .labels.severity == "critical")
| .labels.alertname' 2>/dev/null \
| sort -u || true)
if [ -n "$ignore_regex" ]; then
echo "$critical_firing" | { grep -vE "$ignore_regex" || true; }
else
echo "$critical_firing"
fi
}
wait_for_node_ready() {