infra/stacks/monitoring/modules/monitoring
Viktor Barzin 68f8514e61 monitoring: MetalLBSpeakerDown for: 2m → 10m (was upgrade-chain regression)
Earlier in this session, commit 503ac4c1 brought the for: from 5m → 2m
based on a brief I wrote inaccurately. The brief said the alert "fires
immediately" but it was actually already at 5m. The subagent followed
the explicit "2m" target and tightened it — opposite of what we wanted.

10m is the right value for our chain: a full drain + kubeadm + apt +
kubelet restart + uncordon cycle can take a worker out of MetalLB
rotation for 5-7 min in the worst case (PDB stickiness on some pods).
10m suppresses upgrade-induced blips while still catching real
speaker-down conditions.

node4 worker phase tripped this alert mid-soak today, aborted the
chain (Job retry), succeeded on the 2nd attempt only because alerts
didn't re-fire fast enough. With 10m the next workers shouldn't need
the retry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 09:32:41 +00:00
..
dashboards fire-planner: COL refresh CronJob + Grafana Cost-of-Living dashboard 2026-05-22 14:15:38 +00:00
server-power-cycle Add broker-sync Terraform stack (#7) 2026-04-17 21:17:45 +01:00
alloy.yaml alloy: switch pod log shipping from apiserver to file-tail 2026-05-21 08:27:34 +00:00
Dockerfile extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
goflow2.tf [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] 2026-04-18 21:19:48 +00:00
grafana.tf fire-planner: COL refresh CronJob + Grafana Cost-of-Living dashboard 2026-05-22 14:15:38 +00:00
grafana_chart_values.yaml monitoring: protect grafana ingress with authentik + disable anonymous 2026-05-10 17:01:50 +00:00
idrac.tf infra: document auth = "app|none" tier on every legacy ingress 2026-05-11 19:25:48 +00:00
k8s-monitoring-values.yaml cleanup: remove calibre and audiobookshelf stacks after ebooks migration [ci skip] 2026-03-25 23:56:07 +02:00
loki.tf security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE 2026-05-19 06:37:54 +00:00
loki.yaml [infra] TrueNAS decommission — remove active references from Terraform + configs 2026-04-19 16:57:05 +00:00
main.tf keel: enroll 15 critical-path namespaces for digest-only auto-update 2026-05-17 12:13:22 +00:00
prometheus.tf fix: HA Sofia REST sensors + PVC drift safety 2026-05-10 21:48:29 +00:00
prometheus_chart_values.tpl monitoring: MetalLBSpeakerDown for: 2m → 10m (was upgrade-chain regression) 2026-05-23 09:32:41 +00:00
prometheus_snmp_chart_values.yaml extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
pve_exporter.tf [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] 2026-04-18 21:19:48 +00:00
snmp_exporter.tf infra: document auth = "app|none" tier on every legacy ingress 2026-05-11 19:25:48 +00:00
ups_snmp_values.yaml extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00