Earlier in this session, commit 503ac4c1 brought the for: from 5m → 2m
based on a brief I wrote inaccurately. The brief said the alert "fires
immediately" but it was actually already at 5m. The subagent followed
the explicit "2m" target and tightened it — opposite of what we wanted.
10m is the right value for our chain: a full drain + kubeadm + apt +
kubelet restart + uncordon cycle can take a worker out of MetalLB
rotation for 5-7 min in the worst case (PDB stickiness on some pods).
10m suppresses upgrade-induced blips while still catching real
speaker-down conditions.
node4 worker phase tripped this alert mid-soak today, aborted the
chain (Job retry), succeeded on the 2nd attempt only because alerts
didn't re-fire fast enough. With 10m the next workers shouldn't need
the retry.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>