monitoring: fix PrometheusBackupStale false-fire (32d->40d threshold)
The prometheus-backup sidecar runs monthly on the 1st SUNDAY 04:00 UTC. Consecutive first-Sundays can be ~35 days apart (e.g. May 3 -> Jun 7), but the alert threshold was 32d (2764800s) -> it false-fired every year for the ~3 days between day-32 and the next run. Raised to 40d (3456000s): clears the max first-Sunday spacing with margin, still catches a genuinely missed monthly backup. Backup itself is healthy (last May 3, next Jun 7). Verified: live rule now > 3.456e6, alert state inactive.
This commit is contained in:
parent
c4bd64f88a
commit
63ee655c08
1 changed files with 8 additions and 2 deletions
|
|
@ -1554,12 +1554,18 @@ serverFiles:
|
|||
annotations:
|
||||
summary: "Redis backup CronJob has never completed successfully"
|
||||
- alert: PrometheusBackupStale
|
||||
expr: (time() - prometheus_backup_last_success_timestamp{job="prometheus-backup"}) > 2764800
|
||||
# The backup sidecar runs monthly on the 1st SUNDAY 04:00 UTC.
|
||||
# Consecutive first-Sundays can be up to ~35-37 days apart (e.g.
|
||||
# May 3 → Jun 7 = 35d), so a 32d threshold false-fired every year
|
||||
# in the gap before the next run. 40d (3456000s) clears the max
|
||||
# first-Sunday spacing with margin while still catching a genuinely
|
||||
# missed monthly backup.
|
||||
expr: (time() - prometheus_backup_last_success_timestamp{job="prometheus-backup"}) > 3456000
|
||||
for: 30m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Prometheus backup is {{ $value | humanizeDuration }} old (threshold: 32d)"
|
||||
summary: "Prometheus backup is {{ $value | humanizeDuration }} old (threshold: 40d)"
|
||||
- alert: PrometheusBackupNeverRun
|
||||
expr: absent(prometheus_backup_last_success_timestamp{job="prometheus-backup"})
|
||||
for: 32d
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue