From 68f8514e61d29336432d3a749854863b255f357b Mon Sep 17 00:00:00 2001
From: Viktor Barzin <vbarzin@gmail.com>
Date: Sat, 23 May 2026 09:32:41 +0000
Subject: [PATCH] =?UTF-8?q?monitoring:=20MetalLBSpeakerDown=20for:=202m=20?=
 =?UTF-8?q?=E2=86=92=2010m=20(was=20upgrade-chain=20regression)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Earlier in this session, commit 503ac4c1 brought the for: from 5m → 2m
based on a brief I wrote inaccurately. The brief said the alert "fires
immediately" but it was actually already at 5m. The subagent followed
the explicit "2m" target and tightened it — opposite of what we wanted.

10m is the right value for our chain: a full drain + kubeadm + apt +
kubelet restart + uncordon cycle can take a worker out of MetalLB
rotation for 5-7 min in the worst case (PDB stickiness on some pods).
10m suppresses upgrade-induced blips while still catching real
speaker-down conditions.

node4 worker phase tripped this alert mid-soak today, aborted the
chain (Job retry), succeeded on the 2nd attempt only because alerts
didn't re-fire fast enough. With 10m the next workers shouldn't need
the retry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .../monitoring/prometheus_chart_values.tpl        | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
index 59a6d1a9..f8561b8c 100755
--- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
+++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
@@ -2243,15 +2243,22 @@ serverFiles:
               summary: "Cloudflared: {{ $value | printf \"%.0f\" }} replica(s) unavailable"
           - alert: MetalLBSpeakerDown
             # kubelet restart during k8s upgrade briefly takes the speaker
-            # pod down; typical recovery is 30-45s. 2m suppresses those
-            # transient blips while still catching genuine failures.
-            # Adjusted from 5m on 2026-05-18.
+            # pod down; typical recovery is 30-45s. The full drain+kubeadm+
+            # apt+kubelet-restart+uncordon cycle in the chain's worker phase
+            # can take a single node out of MetalLB rotation for 5-7 min in
+            # the worst case (depending on PDB stickiness). 10m suppresses
+            # those upgrade-induced blips while still catching genuine
+            # speaker-down conditions.
+            # Reverted from 2m → 10m on 2026-05-23 after node4 upgrade
+            # tripped it mid-soak and aborted the chain. Previous value was
+            # 5m (set 2026-05-18) which was already correct; a brief patch
+            # had tightened it.
             expr: |
               (
                 kube_daemonset_status_desired_number_scheduled{namespace="metallb-system", daemonset="metallb-speaker"}
                 - on(namespace, daemonset) kube_daemonset_status_number_ready{namespace="metallb-system", daemonset="metallb-speaker"}
               ) > 0
-            for: 2m
+            for: 10m
             labels:
               severity: critical
             annotations: