k8s-version-upgrade: halt_on_alert allowlist (severity=critical only)

Refactored halt_on_alert_query from denylist ("ignore these noisy alerts")
to an allowlist ("only halt on severity=critical"). Today's blocking
alerts were all warning/info-level and not actual upgrade blockers:
  - PodCrashLooping (gpu-operator on the GPU node, code-8vr0, long-standing)
  - IngressTTFBHigh (Traefik latency, transient)
  - NodeHighIOWait (chicken-and-egg with our own upgrade I/O)
  - RecentNodeReboot (chain causes this itself)

severity=critical filtering is more robust than maintaining a denylist
of every noisy alert that crops up. extra_ignore parameter kept for
backwards compatibility but is rarely needed now (critical alerts are
the only ones that should actually halt the chain).

Tested end-to-end this session — master successfully upgraded to v1.34.8
via the autonomous chain after the apiserver state-repair (apiserver
manifest had been pinned at v1.34.2 from a previous month's rollback;
required a one-time manual edit + kubelet reload to bring back to v1.34.7,
after which the chain ran cleanly).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-05-23 09:14:39 +00:00
parent 0025511b6a
commit ad9f6c8f41

View file

@ -94,32 +94,41 @@ push() {
halt_on_alert_query() { halt_on_alert_query() {
local extra_ignore="${1:-}" local extra_ignore="${1:-}"
# Always-ignored alerts — present in steady-state OR are themselves caused # ALLOWLIST design (refactored 2026-05-23 from a denylist): halt only on
# by what the chain does, so they should never halt a chain phase: # alerts with severity=critical. Any warning/info-level alert is treated
# Watchdog — Prometheus meta-alert, always firing # as informational and doesn't block the chain.
# RebootRequired — long-running info, not actionable mid-chain #
# KuredNodeWasNotDrained — kured info-level, doesn't block upgrade # Why this is the right model:
# InfoInhibitor — used to inhibit other alerts, always present # - The cluster has long-running warning-level alerts that are NOT
# IngressTTFBHigh — Traefik latency. Symptoms-not-causes; upgrades # blockers for a k8s patch (e.g. GPU operator crashloop on the GPU
# routinely spike latency briefly. Halting on # node, ingress latency spikes, IO-wait warnings).
# this would prevent the chain from running in # - Maintaining a denylist of every "noisy" alert is a losing battle.
# any moderately busy cluster. (2026-05-23) # - Critical alerts are the only ones that should actually stop us
# NodeHighIOWait — chicken-and-egg with our own upgrade I/O. The # mid-chain (apiserver down, etcd down, node not ready, etc.).
# inline quiet-baseline check (Ready transition #
# <10min) is the real cluster-churn gate; iowait # `extra_ignore` is now mostly historical — kept for backwards compat with
# is too noisy to be a hard gate. (2026-05-23) # `halt_on_alert_query RecentNodeReboot`-style calls. With severity-based
local regex='^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor|IngressTTFBHigh|NodeHighIOWait' # filtering, RecentNodeReboot (severity=info) is filtered automatically.
[ -n "$extra_ignore" ] && regex="$regex|$extra_ignore" # We still build the regex for any critical alert the caller wants to
regex="$regex)$" # explicitly ignore (e.g. a known-broken thing we're aware of).
local ignore_regex=""
[ -n "$extra_ignore" ] && ignore_regex="^($extra_ignore)\$"
# `grep -vE` returns 1 when nothing matches, which under `set -o pipefail` # `grep` returns 1 when nothing matches → under `set -o pipefail` that
# bubbles up and (via the caller's `alerts=$(...)`) aborts the whole script. # bubbles up and aborts the script via the caller's `alerts=$(...)`.
# Trailing `|| true` keeps a no-alerts-firing cluster from looking like a # Trailing `|| true` on each grep handles the no-matches case.
# script error. Discovered 2026-05-19 when the chain wouldn't fire on a local critical_firing
# genuinely-clean cluster (every alert was Watchdog/RebootRequired/etc.). critical_firing=$(curl -sf "$PROM/api/v1/alerts" \
curl -sf "$PROM/api/v1/alerts" \ | jq -r '.data.alerts[]
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \ | select(.state == "firing" and .labels.severity == "critical")
| { grep -vE "$regex" || true; } | sort -u | .labels.alertname' 2>/dev/null \
| sort -u || true)
if [ -n "$ignore_regex" ]; then
echo "$critical_firing" | { grep -vE "$ignore_regex" || true; }
else
echo "$critical_firing"
fi
} }
wait_for_node_ready() { wait_for_node_ready() {