infra/stacks/monitoring/modules
Viktor Barzin 9b4970da61 monitoring: alert hygiene — disambiguate, rename, tune, fix inhibits
- HighPowerUsage: add subsystem:gpu (line 724) + subsystem:r730 (line 775)
  labels so the two same-named alerts are distinguishable in routing.
- HeadscaleDown (deployment-replicas flavor, line 1414) → rename to
  HeadscaleReplicasMismatch. Line 2039 keeps HeadscaleDown as the real
  up-metric critical check. NodeDown inhibit rule updated to suppress
  the renamed alert too.
- EmailRoundtripStale (line 1816): for 10m → 20m. Survives one missed
  20-min probe cycle before firing, cuts flapping (12 short-burst fires
  over last 24h).

ATSOverload tuning skipped: 24h fire-count is 0, it's continuously
firing not flapping — already-known sustained 83% ATS load, tuning
would not change behavior.

8 backup *NeverSucceeded rules audited: all 7 using
kube_cronjob_status_last_successful_time target real K8s CronJobs with
active metrics (not Pushgateway-sourced). PrometheusBackupNeverRun
already uses absent() correctly. No fixes needed.
2026-04-21 22:29:15 +00:00
..
monitoring monitoring: alert hygiene — disambiguate, rename, tune, fix inhibits 2026-04-21 22:29:15 +00:00