monitoring: MetalLBSpeakerDown for: 2m → 10m (was upgrade-chain regression)
Earlier in this session, commit 503ac4c1 brought the for: from 5m → 2m
based on a brief I wrote inaccurately. The brief said the alert "fires
immediately" but it was actually already at 5m. The subagent followed
the explicit "2m" target and tightened it — opposite of what we wanted.
10m is the right value for our chain: a full drain + kubeadm + apt +
kubelet restart + uncordon cycle can take a worker out of MetalLB
rotation for 5-7 min in the worst case (PDB stickiness on some pods).
10m suppresses upgrade-induced blips while still catching real
speaker-down conditions.
node4 worker phase tripped this alert mid-soak today, aborted the
chain (Job retry), succeeded on the 2nd attempt only because alerts
didn't re-fire fast enough. With 10m the next workers shouldn't need
the retry.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
503ac4c192
commit
68f8514e61
1 changed files with 11 additions and 4 deletions
|
|
@ -2243,15 +2243,22 @@ serverFiles:
|
|||
summary: "Cloudflared: {{ $value | printf \"%.0f\" }} replica(s) unavailable"
|
||||
- alert: MetalLBSpeakerDown
|
||||
# kubelet restart during k8s upgrade briefly takes the speaker
|
||||
# pod down; typical recovery is 30-45s. 2m suppresses those
|
||||
# transient blips while still catching genuine failures.
|
||||
# Adjusted from 5m on 2026-05-18.
|
||||
# pod down; typical recovery is 30-45s. The full drain+kubeadm+
|
||||
# apt+kubelet-restart+uncordon cycle in the chain's worker phase
|
||||
# can take a single node out of MetalLB rotation for 5-7 min in
|
||||
# the worst case (depending on PDB stickiness). 10m suppresses
|
||||
# those upgrade-induced blips while still catching genuine
|
||||
# speaker-down conditions.
|
||||
# Reverted from 2m → 10m on 2026-05-23 after node4 upgrade
|
||||
# tripped it mid-soak and aborted the chain. Previous value was
|
||||
# 5m (set 2026-05-18) which was already correct; a brief patch
|
||||
# had tightened it.
|
||||
expr: |
|
||||
(
|
||||
kube_daemonset_status_desired_number_scheduled{namespace="metallb-system", daemonset="metallb-speaker"}
|
||||
- on(namespace, daemonset) kube_daemonset_status_number_ready{namespace="metallb-system", daemonset="metallb-speaker"}
|
||||
) > 0
|
||||
for: 2m
|
||||
for: 10m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue