From 68f8514e61d29336432d3a749854863b255f357b Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 23 May 2026 09:32:41 +0000 Subject: [PATCH] =?UTF-8?q?monitoring:=20MetalLBSpeakerDown=20for:=202m=20?= =?UTF-8?q?=E2=86=92=2010m=20(was=20upgrade-chain=20regression)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Earlier in this session, commit 503ac4c1 brought the for: from 5m → 2m based on a brief I wrote inaccurately. The brief said the alert "fires immediately" but it was actually already at 5m. The subagent followed the explicit "2m" target and tightened it — opposite of what we wanted. 10m is the right value for our chain: a full drain + kubeadm + apt + kubelet restart + uncordon cycle can take a worker out of MetalLB rotation for 5-7 min in the worst case (PDB stickiness on some pods). 10m suppresses upgrade-induced blips while still catching real speaker-down conditions. node4 worker phase tripped this alert mid-soak today, aborted the chain (Job retry), succeeded on the 2nd attempt only because alerts didn't re-fire fast enough. With 10m the next workers shouldn't need the retry. Co-Authored-By: Claude Opus 4.7 --- .../monitoring/prometheus_chart_values.tpl | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl index 59a6d1a9..f8561b8c 100755 --- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl +++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl @@ -2243,15 +2243,22 @@ serverFiles: summary: "Cloudflared: {{ $value | printf \"%.0f\" }} replica(s) unavailable" - alert: MetalLBSpeakerDown # kubelet restart during k8s upgrade briefly takes the speaker - # pod down; typical recovery is 30-45s. 2m suppresses those - # transient blips while still catching genuine failures. - # Adjusted from 5m on 2026-05-18. + # pod down; typical recovery is 30-45s. The full drain+kubeadm+ + # apt+kubelet-restart+uncordon cycle in the chain's worker phase + # can take a single node out of MetalLB rotation for 5-7 min in + # the worst case (depending on PDB stickiness). 10m suppresses + # those upgrade-induced blips while still catching genuine + # speaker-down conditions. + # Reverted from 2m → 10m on 2026-05-23 after node4 upgrade + # tripped it mid-soak and aborted the chain. Previous value was + # 5m (set 2026-05-18) which was already correct; a brief patch + # had tightened it. expr: | ( kube_daemonset_status_desired_number_scheduled{namespace="metallb-system", daemonset="metallb-speaker"} - on(namespace, daemonset) kube_daemonset_status_number_ready{namespace="metallb-system", daemonset="metallb-speaker"} ) > 0 - for: 2m + for: 10m labels: severity: critical annotations: