Refactored halt_on_alert_query from denylist ("ignore these noisy alerts")
to an allowlist ("only halt on severity=critical"). Today's blocking
alerts were all warning/info-level and not actual upgrade blockers:
- PodCrashLooping (gpu-operator on the GPU node, code-8vr0, long-standing)
- IngressTTFBHigh (Traefik latency, transient)
- NodeHighIOWait (chicken-and-egg with our own upgrade I/O)
- RecentNodeReboot (chain causes this itself)
severity=critical filtering is more robust than maintaining a denylist
of every noisy alert that crops up. extra_ignore parameter kept for
backwards compatibility but is rarely needed now (critical alerts are
the only ones that should actually halt the chain).
Tested end-to-end this session — master successfully upgraded to v1.34.8
via the autonomous chain after the apiserver state-repair (apiserver
manifest had been pinned at v1.34.2 from a previous month's rollback;
required a one-time manual edit + kubelet reload to bring back to v1.34.7,
after which the chain ran cleanly).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>