k8s-version-upgrade: halt_on_alert allowlist (severity=critical only)
Refactored halt_on_alert_query from denylist ("ignore these noisy alerts")
to an allowlist ("only halt on severity=critical"). Today's blocking
alerts were all warning/info-level and not actual upgrade blockers:
- PodCrashLooping (gpu-operator on the GPU node, code-8vr0, long-standing)
- IngressTTFBHigh (Traefik latency, transient)
- NodeHighIOWait (chicken-and-egg with our own upgrade I/O)
- RecentNodeReboot (chain causes this itself)
severity=critical filtering is more robust than maintaining a denylist
of every noisy alert that crops up. extra_ignore parameter kept for
backwards compatibility but is rarely needed now (critical alerts are
the only ones that should actually halt the chain).
Tested end-to-end this session — master successfully upgraded to v1.34.8
via the autonomous chain after the apiserver state-repair (apiserver
manifest had been pinned at v1.34.2 from a previous month's rollback;
required a one-time manual edit + kubelet reload to bring back to v1.34.7,
after which the chain ran cleanly).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
0025511b6a
commit
ad9f6c8f41
1 changed files with 34 additions and 25 deletions
|
|
@ -94,32 +94,41 @@ push() {
|
||||||
|
|
||||||
halt_on_alert_query() {
|
halt_on_alert_query() {
|
||||||
local extra_ignore="${1:-}"
|
local extra_ignore="${1:-}"
|
||||||
# Always-ignored alerts — present in steady-state OR are themselves caused
|
# ALLOWLIST design (refactored 2026-05-23 from a denylist): halt only on
|
||||||
# by what the chain does, so they should never halt a chain phase:
|
# alerts with severity=critical. Any warning/info-level alert is treated
|
||||||
# Watchdog — Prometheus meta-alert, always firing
|
# as informational and doesn't block the chain.
|
||||||
# RebootRequired — long-running info, not actionable mid-chain
|
#
|
||||||
# KuredNodeWasNotDrained — kured info-level, doesn't block upgrade
|
# Why this is the right model:
|
||||||
# InfoInhibitor — used to inhibit other alerts, always present
|
# - The cluster has long-running warning-level alerts that are NOT
|
||||||
# IngressTTFBHigh — Traefik latency. Symptoms-not-causes; upgrades
|
# blockers for a k8s patch (e.g. GPU operator crashloop on the GPU
|
||||||
# routinely spike latency briefly. Halting on
|
# node, ingress latency spikes, IO-wait warnings).
|
||||||
# this would prevent the chain from running in
|
# - Maintaining a denylist of every "noisy" alert is a losing battle.
|
||||||
# any moderately busy cluster. (2026-05-23)
|
# - Critical alerts are the only ones that should actually stop us
|
||||||
# NodeHighIOWait — chicken-and-egg with our own upgrade I/O. The
|
# mid-chain (apiserver down, etcd down, node not ready, etc.).
|
||||||
# inline quiet-baseline check (Ready transition
|
#
|
||||||
# <10min) is the real cluster-churn gate; iowait
|
# `extra_ignore` is now mostly historical — kept for backwards compat with
|
||||||
# is too noisy to be a hard gate. (2026-05-23)
|
# `halt_on_alert_query RecentNodeReboot`-style calls. With severity-based
|
||||||
local regex='^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor|IngressTTFBHigh|NodeHighIOWait'
|
# filtering, RecentNodeReboot (severity=info) is filtered automatically.
|
||||||
[ -n "$extra_ignore" ] && regex="$regex|$extra_ignore"
|
# We still build the regex for any critical alert the caller wants to
|
||||||
regex="$regex)$"
|
# explicitly ignore (e.g. a known-broken thing we're aware of).
|
||||||
|
local ignore_regex=""
|
||||||
|
[ -n "$extra_ignore" ] && ignore_regex="^($extra_ignore)\$"
|
||||||
|
|
||||||
# `grep -vE` returns 1 when nothing matches, which under `set -o pipefail`
|
# `grep` returns 1 when nothing matches → under `set -o pipefail` that
|
||||||
# bubbles up and (via the caller's `alerts=$(...)`) aborts the whole script.
|
# bubbles up and aborts the script via the caller's `alerts=$(...)`.
|
||||||
# Trailing `|| true` keeps a no-alerts-firing cluster from looking like a
|
# Trailing `|| true` on each grep handles the no-matches case.
|
||||||
# script error. Discovered 2026-05-19 when the chain wouldn't fire on a
|
local critical_firing
|
||||||
# genuinely-clean cluster (every alert was Watchdog/RebootRequired/etc.).
|
critical_firing=$(curl -sf "$PROM/api/v1/alerts" \
|
||||||
curl -sf "$PROM/api/v1/alerts" \
|
| jq -r '.data.alerts[]
|
||||||
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
|
| select(.state == "firing" and .labels.severity == "critical")
|
||||||
| { grep -vE "$regex" || true; } | sort -u
|
| .labels.alertname' 2>/dev/null \
|
||||||
|
| sort -u || true)
|
||||||
|
|
||||||
|
if [ -n "$ignore_regex" ]; then
|
||||||
|
echo "$critical_firing" | { grep -vE "$ignore_regex" || true; }
|
||||||
|
else
|
||||||
|
echo "$critical_firing"
|
||||||
|
fi
|
||||||
}
|
}
|
||||||
|
|
||||||
wait_for_node_ready() {
|
wait_for_node_ready() {
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue