infra

Viktor Barzin 68f8514e61 monitoring: MetalLBSpeakerDown for: 2m → 10m (was upgrade-chain regression) Earlier in this session, commit `503ac4c1` brought the for: from 5m → 2m based on a brief I wrote inaccurately. The brief said the alert "fires immediately" but it was actually already at 5m. The subagent followed the explicit "2m" target and tightened it — opposite of what we wanted. 10m is the right value for our chain: a full drain + kubeadm + apt + kubelet restart + uncordon cycle can take a worker out of MetalLB rotation for 5-7 min in the worst case (PDB stickiness on some pods). 10m suppresses upgrade-induced blips while still catching real speaker-down conditions. node4 worker phase tripped this alert mid-soak today, aborted the chain (Job retry), succeeded on the 2nd attempt only because alerts didn't re-fire fast enough. With 10m the next workers shouldn't need the retry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>		2026-05-23 09:32:41 +00:00
..
modules/monitoring	monitoring: MetalLBSpeakerDown for: 2m → 10m (was upgrade-chain regression)	2026-05-23 09:32:41 +00:00
main.tf	[forgejo] Tolerate missing Vault keys during Phase 0 bootstrap	2026-05-07 15:53:08 +00:00
secrets	extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip]	2026-03-17 21:34:11 +00:00
terragrunt.hcl	extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip]	2026-03-17 21:34:11 +00:00