monitoring: MetalLBSpeakerDown for: 2m → 10m (was upgrade-chain regression)

Earlier in this session, commit 503ac4c1 brought the for: from 5m → 2m
based on a brief I wrote inaccurately. The brief said the alert "fires
immediately" but it was actually already at 5m. The subagent followed
the explicit "2m" target and tightened it — opposite of what we wanted.

10m is the right value for our chain: a full drain + kubeadm + apt +
kubelet restart + uncordon cycle can take a worker out of MetalLB
rotation for 5-7 min in the worst case (PDB stickiness on some pods).
10m suppresses upgrade-induced blips while still catching real
speaker-down conditions.

node4 worker phase tripped this alert mid-soak today, aborted the
chain (Job retry), succeeded on the 2nd attempt only because alerts
didn't re-fire fast enough. With 10m the next workers shouldn't need
the retry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-05-23 09:32:41 +00:00
parent 503ac4c192
commit 68f8514e61

View file

@ -2243,15 +2243,22 @@ serverFiles:
summary: "Cloudflared: {{ $value | printf \"%.0f\" }} replica(s) unavailable"
- alert: MetalLBSpeakerDown
# kubelet restart during k8s upgrade briefly takes the speaker
# pod down; typical recovery is 30-45s. 2m suppresses those
# transient blips while still catching genuine failures.
# Adjusted from 5m on 2026-05-18.
# pod down; typical recovery is 30-45s. The full drain+kubeadm+
# apt+kubelet-restart+uncordon cycle in the chain's worker phase
# can take a single node out of MetalLB rotation for 5-7 min in
# the worst case (depending on PDB stickiness). 10m suppresses
# those upgrade-induced blips while still catching genuine
# speaker-down conditions.
# Reverted from 2m → 10m on 2026-05-23 after node4 upgrade
# tripped it mid-soak and aborted the chain. Previous value was
# 5m (set 2026-05-18) which was already correct; a brief patch
# had tightened it.
expr: |
(
kube_daemonset_status_desired_number_scheduled{namespace="metallb-system", daemonset="metallb-speaker"}
- on(namespace, daemonset) kube_daemonset_status_number_ready{namespace="metallb-system", daemonset="metallb-speaker"}
) > 0
for: 2m
for: 10m
labels:
severity: critical
annotations: