diff --git a/docs/architecture/monitoring.md b/docs/architecture/monitoring.md index a5dec0af..3c75a345 100644 --- a/docs/architecture/monitoring.md +++ b/docs/architecture/monitoring.md @@ -146,7 +146,7 @@ Query examples (Grafana → Loki): `{job="rpi-sofia-journal"}`, `{job="rpi-sofia **Dashboard** — `dashboards/rpi-sofia.json` ("RPi Sofia", Hardware folder): status, undervoltage/throttle, SoC temp, load, memory, root-fs free + read-only, network. -**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`rpi_under_voltage_occurred==1`), `RpiSofiaHighTemp`. +**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`increase(rpi_under_voltage_occurred[1h])>0` — edge-triggered on the sticky bit; the live `rpi_under_voltage_now` bit is too transient to catch at 1-min sampling, so it fires on a *new* brown-out and auto-resolves ~1h later instead of latching until reboot), `RpiSofiaHighTemp`. **Recovery** — a systemd hardware watchdog (`RuntimeWatchdogSec=14s`, bcm2835 max ~15s) auto-reboots the Pi on a hard hang instead of leaving it dead for hours. diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl index 65e89278..f98e429c 100755 --- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl +++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl @@ -879,12 +879,17 @@ serverFiles: annotations: summary: "rpi-sofia rootfs is READ-ONLY — failing SD card (the silent-journal failure mode from this incident). Reflash/replace the card." - alert: RpiSofiaUndervoltage - expr: rpi_under_voltage_occurred{instance="rpi-sofia"} == 1 + # Edge-trigger on the sticky firmware bit (rpi_under_voltage_now is too + # transient to catch at 1-min sampling — 0 hits in 14d). Fires on a NEW latch + # and auto-resolves ~1h later instead of latching until reboot; counter-reset + # handling makes a clean reboot a no-op. Since-boot state + history: Grafana "RPi Sofia". + expr: increase(rpi_under_voltage_occurred{instance="rpi-sofia"}[1h]) > 0 for: 5m labels: severity: warning annotations: - summary: "rpi-sofia under-voltage detected since last boot — check PSU/USB power cable" + summary: "rpi-sofia under-voltage event in the last hour — check PSU/USB power cable + SD card" + description: "A new under-voltage brown-out latched on rpi-sofia within the last 1h (repeat/sustained events risk SD-card corruption). Sticky since-boot flag + history on Grafana 'RPi Sofia' (Hardware)." - alert: RpiSofiaHighTemp expr: rpi_soc_temp_celsius{instance="rpi-sofia"} > 75 for: 10m