monitoring: tighten LVMSnapshotStale to 30h for daily-cadence detection

Threshold was 48h + 30m for: a job that runs daily. We don't need to wait 2.5 days to detect a broken timer — bring it down to 30h + 30m (just over a day of cadence + minor drift/retry grace). Also add a description pointing to the restore runbook so the alert text surfaces the fix path directly. Threshold change: 172800s → 108000s. Docs in backup-dr.md synced. Re-triggers default.yml apply now that ci/Dockerfile is rebuilt with vault CLI — this is the first commit touching a stack that will actually succeed since the e80b2f02 regression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-22 08:54:37 +00:00 · 2026-04-22 08:54:37 +00:00 · d39770b30d
commit d39770b30d
parent 3eb8b9a4ea
2 changed files with 4 additions and 3 deletions
--- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
+++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
@ -1249,12 +1249,13 @@ serverFiles:
            annotations:
              summary: "Backup job failed: {{ $labels.namespace }}/{{ $labels.job_name }}"
          - alert: LVMSnapshotStale
-            expr: (time() - lvm_snapshot_last_run_timestamp{job="lvm-pvc-snapshot"}) > 172800
+            expr: (time() - lvm_snapshot_last_run_timestamp{job="lvm-pvc-snapshot"}) > 108000
            for: 30m
            labels:
              severity: critical
            annotations:
              summary: "LVM PVC snapshots are {{ $value | humanizeDuration }} old (expected daily)"
+              description: "Timer lvm-pvc-snapshot.timer on 192.168.1.127 hasn't pushed fresh metrics. Runbook: docs/runbooks/restore-lvm-snapshot.md"
          - alert: LVMSnapshotNeverRun
            expr: absent(lvm_snapshot_last_run_timestamp{job="lvm-pvc-snapshot"})
            for: 48h