infra/stacks/monitoring
Viktor Barzin 503ac4c192 monitoring: tune 4 alerts for transient drain/upgrade blips
Today's worker-phase rolling upgrade tripped MysqlStandaloneDown,
MetalLBSpeakerDown, KubeletRunningContainersDrop, and
IngressErrorRate5xxHigh even though every affected workload
recovered within 30-60s. Loosen `for:` (and one threshold) on each so
they only fire on persistent faults, not on routine drain+kubelet-
restart cycles.

- MysqlStandaloneDown: for 2m -> 3m (single-replica StatefulSet,
  drain re-scheduling routinely takes 1-3m).
- MetalLBSpeakerDown: for 5m -> 2m (kubelet restart drops the
  speaker pod for 30-45s; 2m suppresses that blip).
- KubeletRunningContainersDrop: absolute `< -10` threshold replaced
  with relative `< -0.5` (>50% drop vs. 10m ago); routine drains
  routinely shed 10-30 containers and tripped the old rule.
- IngressErrorRate5xxHigh: for 5m -> 10m (rolling pod migrations
  cause brief 5xx spikes that clear in 1-2m).

Severity, labels, and annotation structure preserved; only `for:`
durations and the one expression changed. Tactical loosening of
four specific alerts -- broader observability audit tracked
separately in beads.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 09:28:53 +00:00
..
modules/monitoring monitoring: tune 4 alerts for transient drain/upgrade blips 2026-05-23 09:28:53 +00:00
main.tf [forgejo] Tolerate missing Vault keys during Phase 0 bootstrap 2026-05-07 15:53:08 +00:00
secrets extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
terragrunt.hcl extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00