monitoring: fix PrometheusBackupStale false-fire (32d->40d threshold)
The prometheus-backup sidecar runs monthly on the 1st SUNDAY 04:00 UTC. Consecutive first-Sundays can be ~35 days apart (e.g. May 3 -> Jun 7), but the alert threshold was 32d (2764800s) -> it false-fired every year for the ~3 days between day-32 and the next run. Raised to 40d (3456000s): clears the max first-Sunday spacing with margin, still catches a genuinely missed monthly backup. Backup itself is healthy (last May 3, next Jun 7). Verified: live rule now > 3.456e6, alert state inactive.
This commit is contained in:
parent
c4bd64f88a
commit
63ee655c08
1 changed files with 8 additions and 2 deletions
|
|
@ -1554,12 +1554,18 @@ serverFiles:
|
||||||
annotations:
|
annotations:
|
||||||
summary: "Redis backup CronJob has never completed successfully"
|
summary: "Redis backup CronJob has never completed successfully"
|
||||||
- alert: PrometheusBackupStale
|
- alert: PrometheusBackupStale
|
||||||
expr: (time() - prometheus_backup_last_success_timestamp{job="prometheus-backup"}) > 2764800
|
# The backup sidecar runs monthly on the 1st SUNDAY 04:00 UTC.
|
||||||
|
# Consecutive first-Sundays can be up to ~35-37 days apart (e.g.
|
||||||
|
# May 3 → Jun 7 = 35d), so a 32d threshold false-fired every year
|
||||||
|
# in the gap before the next run. 40d (3456000s) clears the max
|
||||||
|
# first-Sunday spacing with margin while still catching a genuinely
|
||||||
|
# missed monthly backup.
|
||||||
|
expr: (time() - prometheus_backup_last_success_timestamp{job="prometheus-backup"}) > 3456000
|
||||||
for: 30m
|
for: 30m
|
||||||
labels:
|
labels:
|
||||||
severity: critical
|
severity: critical
|
||||||
annotations:
|
annotations:
|
||||||
summary: "Prometheus backup is {{ $value | humanizeDuration }} old (threshold: 32d)"
|
summary: "Prometheus backup is {{ $value | humanizeDuration }} old (threshold: 40d)"
|
||||||
- alert: PrometheusBackupNeverRun
|
- alert: PrometheusBackupNeverRun
|
||||||
expr: absent(prometheus_backup_last_success_timestamp{job="prometheus-backup"})
|
expr: absent(prometheus_backup_last_success_timestamp{job="prometheus-backup"})
|
||||||
for: 32d
|
for: 32d
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue