monitoring: RpiSofiaUndervoltage alerts on new brown-out, not until reboot
Some checks failed
ci/woodpecker/push/default Pipeline failed
Some checks failed
ci/woodpecker/push/default Pipeline failed
The rpi-sofia under-voltage alert keyed off the sticky firmware bit (rpi_under_voltage_occurred == 1), which latches on the first brown-out and stays 1 until the Pi reboots. With alert-on-change routing it re-paged on every boot cycle and sat firing for ~211h of the last 14d — Viktor reported "getting a few of these lately" — and it disagreed with the HA-sofia dashboard, which shows the live state and reads OK once voltage recovers. Can't just switch to the live bit: rpi_under_voltage_now never registered once in 14d (brown-outs are sub-second and fall between the 1-min textfile-collector samples), so the sticky bit is the only reliable detector. Fix: edge-trigger on a NEW latch via increase(rpi_under_voltage_occurred[1h]) > 0. Fires once per brown-out and auto-resolves ~1h later (~2h active over the same 14d instead of ~211h); counter-reset handling makes a clean reboot a no-op. Both real brown-out events in the window are still caught. Docs updated in the same commit (monitoring.md). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
fbf6f11038
commit
fd77c0dc4f
2 changed files with 8 additions and 3 deletions
|
|
@ -146,7 +146,7 @@ Query examples (Grafana → Loki): `{job="rpi-sofia-journal"}`, `{job="rpi-sofia
|
||||||
|
|
||||||
**Dashboard** — `dashboards/rpi-sofia.json` ("RPi Sofia", Hardware folder): status, undervoltage/throttle, SoC temp, load, memory, root-fs free + read-only, network.
|
**Dashboard** — `dashboards/rpi-sofia.json` ("RPi Sofia", Hardware folder): status, undervoltage/throttle, SoC temp, load, memory, root-fs free + read-only, network.
|
||||||
|
|
||||||
**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`rpi_under_voltage_occurred==1`), `RpiSofiaHighTemp`.
|
**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`increase(rpi_under_voltage_occurred[1h])>0` — edge-triggered on the sticky bit; the live `rpi_under_voltage_now` bit is too transient to catch at 1-min sampling, so it fires on a *new* brown-out and auto-resolves ~1h later instead of latching until reboot), `RpiSofiaHighTemp`.
|
||||||
|
|
||||||
**Recovery** — a systemd hardware watchdog (`RuntimeWatchdogSec=14s`, bcm2835 max ~15s) auto-reboots the Pi on a hard hang instead of leaving it dead for hours.
|
**Recovery** — a systemd hardware watchdog (`RuntimeWatchdogSec=14s`, bcm2835 max ~15s) auto-reboots the Pi on a hard hang instead of leaving it dead for hours.
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -879,12 +879,17 @@ serverFiles:
|
||||||
annotations:
|
annotations:
|
||||||
summary: "rpi-sofia rootfs is READ-ONLY — failing SD card (the silent-journal failure mode from this incident). Reflash/replace the card."
|
summary: "rpi-sofia rootfs is READ-ONLY — failing SD card (the silent-journal failure mode from this incident). Reflash/replace the card."
|
||||||
- alert: RpiSofiaUndervoltage
|
- alert: RpiSofiaUndervoltage
|
||||||
expr: rpi_under_voltage_occurred{instance="rpi-sofia"} == 1
|
# Edge-trigger on the sticky firmware bit (rpi_under_voltage_now is too
|
||||||
|
# transient to catch at 1-min sampling — 0 hits in 14d). Fires on a NEW latch
|
||||||
|
# and auto-resolves ~1h later instead of latching until reboot; counter-reset
|
||||||
|
# handling makes a clean reboot a no-op. Since-boot state + history: Grafana "RPi Sofia".
|
||||||
|
expr: increase(rpi_under_voltage_occurred{instance="rpi-sofia"}[1h]) > 0
|
||||||
for: 5m
|
for: 5m
|
||||||
labels:
|
labels:
|
||||||
severity: warning
|
severity: warning
|
||||||
annotations:
|
annotations:
|
||||||
summary: "rpi-sofia under-voltage detected since last boot — check PSU/USB power cable"
|
summary: "rpi-sofia under-voltage event in the last hour — check PSU/USB power cable + SD card"
|
||||||
|
description: "A new under-voltage brown-out latched on rpi-sofia within the last 1h (repeat/sustained events risk SD-card corruption). Sticky since-boot flag + history on Grafana 'RPi Sofia' (Hardware)."
|
||||||
- alert: RpiSofiaHighTemp
|
- alert: RpiSofiaHighTemp
|
||||||
expr: rpi_soc_temp_celsius{instance="rpi-sofia"} > 75
|
expr: rpi_soc_temp_celsius{instance="rpi-sofia"} > 75
|
||||||
for: 10m
|
for: 10m
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue