infra

Viktor Barzin 503ac4c192 monitoring: tune 4 alerts for transient drain/upgrade blips Today's worker-phase rolling upgrade tripped MysqlStandaloneDown, MetalLBSpeakerDown, KubeletRunningContainersDrop, and IngressErrorRate5xxHigh even though every affected workload recovered within 30-60s. Loosen `for:` (and one threshold) on each so they only fire on persistent faults, not on routine drain+kubelet- restart cycles. - MysqlStandaloneDown: for 2m -> 3m (single-replica StatefulSet, drain re-scheduling routinely takes 1-3m). - MetalLBSpeakerDown: for 5m -> 2m (kubelet restart drops the speaker pod for 30-45s; 2m suppresses that blip). - KubeletRunningContainersDrop: absolute `< -10` threshold replaced with relative `< -0.5` (>50% drop vs. 10m ago); routine drains routinely shed 10-30 containers and tripped the old rule. - IngressErrorRate5xxHigh: for 5m -> 10m (rolling pod migrations cause brief 5xx spikes that clear in 1-2m). Severity, labels, and annotation structure preserved; only `for:` durations and the one expression changed. Tactical loosening of four specific alerts -- broader observability audit tracked separately in beads. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 09:28:53 +00:00
..
monitoring	monitoring: tune 4 alerts for transient drain/upgrade blips	2026-05-23 09:28:53 +00:00

Viktor Barzin 503ac4c192 monitoring: tune 4 alerts for transient drain/upgrade blips

Today's worker-phase rolling upgrade tripped MysqlStandaloneDown,
MetalLBSpeakerDown, KubeletRunningContainersDrop, and
IngressErrorRate5xxHigh even though every affected workload
recovered within 30-60s. Loosen `for:` (and one threshold) on each so
they only fire on persistent faults, not on routine drain+kubelet-
restart cycles.

- MysqlStandaloneDown: for 2m -> 3m (single-replica StatefulSet,
  drain re-scheduling routinely takes 1-3m).
- MetalLBSpeakerDown: for 5m -> 2m (kubelet restart drops the
  speaker pod for 30-45s; 2m suppresses that blip).
- KubeletRunningContainersDrop: absolute `< -10` threshold replaced
  with relative `< -0.5` (>50% drop vs. 10m ago); routine drains
  routinely shed 10-30 containers and tripped the old rule.
- IngressErrorRate5xxHigh: for 5m -> 10m (rolling pod migrations
  cause brief 5xx spikes that clear in 1-2m).

Severity, labels, and annotation structure preserved; only `for:`
durations and the one expression changed. Tactical loosening of
four specific alerts -- broader observability audit tracked
separately in beads.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-23 09:28:53 +00:00

monitoring

monitoring: tune 4 alerts for transient drain/upgrade blips

2026-05-23 09:28:53 +00:00