monitoring: VzdumpBackup{Stale,NeverRun,Failing} alerts for the new VM-image backup

vzdump-vms pushes vzdump_last_{run,success}_timestamp + vzdump_last_status to
Pushgateway job vzdump-backup, but nothing alerted on them — a stopped/failing VM
backup would be silent (exactly how the nfs-mirror reaping went unnoticed until I
re-verified). Add the trio to the 3-2-1 group in prometheus_chart_values.tpl,
mirroring the LVM/pfSense/nfs-mirror alerts. Stale = >~50h since last success.

NOT [ci]-applied: this is a Terraform stack change — arms on the next
`scripts/tg apply` of the monitoring stack (metrics already flow, so it arms
immediately once applied). Admin-gated apply per org policy.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-10 09:10:46 +00:00
parent 05f928931f
commit e49c91e60c
2 changed files with 23 additions and 1 deletions

View file

@ -1696,6 +1696,28 @@ serverFiles:
severity: warning
annotations:
summary: "NFS local mirror last run failed (status={{ $value }})"
- alert: VzdumpBackupStale
expr: (time() - vzdump_last_success_timestamp{job="vzdump-backup"}) > 180000
for: 30m
labels:
severity: warning
annotations:
summary: "vzdump VM image backup is {{ $value | humanizeDuration }} old (threshold: ~50h / 2 daily cycles)"
description: "vzdump-vms.timer on 192.168.1.127 hasn't produced a fresh devvm image. Check: ssh root@192.168.1.127 systemctl status vzdump-vms. Runbook: docs/architecture/backup-dr.md (VM Image Backups)."
- alert: VzdumpBackupNeverRun
expr: absent(vzdump_last_run_timestamp{job="vzdump-backup"})
for: 48h
labels:
severity: warning
annotations:
summary: "vzdump VM image backup job has never reported metrics to Pushgateway"
- alert: VzdumpBackupFailing
expr: vzdump_last_status{job="vzdump-backup"} != 0
for: 0m
labels:
severity: warning
annotations:
summary: "vzdump VM image backup last run failed (status={{ $value }})"
- alert: BackupDiskFull
expr: (1 - node_filesystem_avail_bytes{job="proxmox-host", mountpoint="/mnt/backup"} / node_filesystem_size_bytes{job="proxmox-host", mountpoint="/mnt/backup"}) > 0.85
for: 15m