monitoring: migrate R730 iDRAC scraping to SNMP (fast primary) + thin Redfish remnant

The Redfish exporter (mrlhansen, metrics:all:true) walked every BMC subtree on
each scrape — ~18.5s avg / 28s peak against the slow iDRAC — forcing a 3m
interval. Moved the fast path to SNMP via the (previously unmounted) dell_idrac
module: ~3.7s/scrape at 1m.

- snmp_exporter: merge dell_idrac into ups_snmp_values.yaml; hand-add fan-RPM
  (coolingDeviceReading + location lookup) and an amperageProbeLocationName
  lookup so the "System Board Pwr Consumption" watts probe is label-selectable.
- snmp-idrac job: params module=dell_idrac, auth=public_v2, 1m/30s — now the
  primary source for health/thermal/power/fan/voltage (relabeled r730_idrac_*).
- Re-point 9 iDRAC alerts to SNMP metrics + DellStatus enums (OK=3, on=4) and
  fix the misnamed iDRACSNMPMetricsMissing/iDRACRedfishMetricsMissing probes.
- Re-point Grafana panels (idrac.json, cluster_health.json) to SNMP names;
  temps ÷10 (tenths-degC); DellStatus value-mappings updated.
- Demote the Redfish exporter to a slow remnant: trim collectors to
  system/sensors/power/storage/network/memory, scrape 3m->10m. Kept only for
  metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS, per-drive
  table) AND to keep HA Sofia's sensor.r730_fan_speed working — it reads
  idrac_sensors_fan_speed from the exporter directly, so no ha-sofia change.

SSD-wear alerts + SEL panel left as-is (already inert/empty today). Verified
live: snmp-idrac up, scrape 3.7s, all 9 re-pointed alerts resolve without
firing, HA fan metric (idrac_sensors_fan_speed=6) intact. Design/plan +
as-built docs: docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md,
docs/architecture/monitoring.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-05 16:33:20 +00:00
parent 6442978f07
commit 6b1d23abbd
8 changed files with 1891 additions and 55 deletions

View file

@ -65,6 +65,8 @@ graph TB
| Email Roundtrip Probe | Python 3.12 | `stacks/mailserver/modules/mailserver/` | E2E email delivery verification via Mailgun API + IMAP |
| Forgejo Registry Integrity Probe | Alpine 3.20 + curl/jq | `stacks/monitoring/modules/monitoring/main.tf` | CronJob every 15m: walks `/v2/_catalog` on `forgejo.viktorbarzin.me` (HTTP via in-cluster service), HEADs every tagged manifest + index child; emits `registry_manifest_integrity_*` metrics to Pushgateway. Replaces the legacy `registry-integrity-probe` against `registry.viktorbarzin.me:5050` decommissioned in Phase 4 of forgejo-registry-consolidation 2026-05-07. |
| blackbox-exporter (Authentik walling-off guard) | `prom/blackbox-exporter` (Keel-managed) | `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` | Single-purpose blackbox-exporter. Its `http_no_authentik_redirect` module probes each must-stay-public carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff the response redirects to Authentik. Scraped by job `blackbox-authentik-walloff` (1m); feeds alert `AuthentikWallingOffPublicPath`. Target list = `local.authentik_walloff_targets` in the same file. |
| snmp-exporter | `prom/snmp-exporter` (Keel-managed) | `stacks/monitoring/modules/monitoring/snmp_exporter.tf` + `ups_snmp_values.yaml` | SNMP→Prometheus bridge. Modules in `ups_snmp_values.yaml`: `huawei` (UPS), `if_mib`/`ip_mib`, and **`dell_idrac`** (R730 iDRAC, merged from `prometheus_snmp_chart_values.yaml` 2026-06-05 + hand-added fan-RPM `coolingDeviceReading` / amperage location lookup). Scrape jobs: `snmp-ups` (30s, module=huawei), **`snmp-idrac` (1m, module=dell_idrac, auth=public_v2)** — the FAST primary source for R730 health/thermal/power/fan/voltage since the 2026-06-05 Redfish→SNMP migration (~3.7s/scrape vs Redfish ~18.5s). Relabels all metrics to `r730_idrac_<mibName>`. |
| idrac-redfish-exporter | `viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix` (mrlhansen/idrac_exporter, Keel-managed) | `stacks/monitoring/modules/monitoring/idrac.tf` | **Slow remnant** (10m scrape, job `redfish-idrac`) since the 2026-06-05 SNMP migration — was the sole iDRAC source at a 3m interval, demoted once SNMP took over the fast path. Trimmed to `system,sensors,power,storage,network,memory`. Serves only what SNMP can't (indicator LED, NIC link-speed Mbps, machine/BIOS info, per-drive storage table) **and keeps HA Sofia's `sensor.r730_fan_speed` REST sensor alive** — that sensor reads `idrac_sensors_fan_speed` from this exporter directly, so the `sensors` collector must stay enabled here. |
## How It Works
@ -103,6 +105,19 @@ Query examples (Grafana → Loki): `{job="rpi-sofia-journal"}`, `{job="rpi-sofia
> The cluster side (scrape job, alerts, Loki ingress, dashboard) is Terraform-managed in `stacks/monitoring/`. The **Pi-side** pieces (node_exporter, the textfile collector + timer, promtail, the watchdog config, and the `server=/viktorbarzin.lan/192.168.1.2` dnsmasq split-horizon forward needed to resolve the Loki ingress) are configured by hand on the Pi — it is not under Terraform — and are backed up off-box at `/home/wizard/rpi-sofia-backup/`. The real reliability fix (reflash/replace the SD card) needs on-site access.
### Dell R730 iDRAC: SNMP-primary + Redfish remnant (migrated 2026-06-05)
The R730 iDRAC (`192.168.1.4` / `idrac.viktorbarzin.lan`) is monitored by **two** Prometheus jobs, both relabeled to the `r730_idrac_*` prefix (which historically hid which source served what). Design/plan: `docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md`.
- **`snmp-idrac` (FAST, primary, 1m / 30s):** snmp-exporter `dell_idrac` module against `:161` (v2c, community `Public0` = `auth=public_v2`). ~3.7s/scrape. Serves all dynamic + health + alerting metrics: `r730_idrac_temperatureProbeReading` (tenths-°C, ÷10), `coolingDeviceReading` (fan RPM, label `coolingDeviceLocationName`), `amperageProbeReading{amperageProbeLocationName="System Board Pwr Consumption"}` (watts), `powerSupplyCurrentInputVoltage`, `globalSystemStatus`, `systemPowerState`, `powerSupplyStatus`, `physicalDiskComponentStatus`, `systemStateMemoryDeviceStatusCombined`, etc.
- **`redfish-idrac` (SLOW remnant, 10m / 45s):** the old mrlhansen exporter, trimmed, kept only for metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS info, per-drive storage table) and to feed **HA Sofia's `sensor.r730_fan_speed`** (reads `idrac_sensors_fan_speed` from the exporter HTTP endpoint directly — NOT via Prometheus, so its freshness is HA's REST poll, independent of the 10m Prometheus scrape).
**Gotchas:**
- **Enum values differ from the old Redfish metrics.** DellStatus: `3 = OK` (was Redfish `1`); `systemPowerState`: `4 = on` (was `2`). All iDRAC alert exprs were rewritten accordingly (`!= 3`, `!= 4`).
- The alert `iDRACSNMPMetricsMissing` was historically a misnomer (checked a Redfish metric); it now correctly probes `absent(r730_idrac_globalSystemStatus)`. `iDRACRedfishMetricsMissing` now probes `absent(r730_idrac_powerSupplyCurrentInputVoltage)`.
- **SSD life % + SEL are genuine SNMP gaps but were already inert** (Redfish reported `0`/empty), so the SSD-wear alerts (kept on `r730_idrac_idrac_storage_drive_life_left_percent`) and the SEL dashboard panel are unchanged.
- Why SNMP: the Redfish exporter (`metrics: all: true`) walked every subtree on each scrape — ~18.5s avg / 28s peak against the slow BMC — which forced the infrequent interval. SNMP is a single fast walk.
### Alert Cascade Inhibition
Alertmanager implements intelligent alert suppression to prevent alert storms during cascading failures:

View file

@ -0,0 +1,116 @@
# iDRAC monitoring: Redfish → SNMP migration (design)
**Date:** 2026-06-05
**Status:** approved (Viktor) — SNMP primary + thin Redfish remnant
**Stack:** `stacks/monitoring`
## Problem
The R730 iDRAC Redfish exporter (`idrac-redfish-exporter`, mrlhansen
`idrac_exporter`, image `viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix`)
is configured `metrics: all: true`. It collects on-demand and walks every
Redfish subtree, making dozens of sequential ~12 s requests to a slow BMC.
Measured live (Prometheus `scrape_duration_seconds{job="redfish-idrac"}`, 24 h):
- **avg 18.5 s, peak 28.3 s**, occasional fast-fail 0.085 s.
- Pinned to a **3 m interval / 45 s timeout** because it cannot run at the 2 m
global cadence.
The cost is dominated by walks that feed **dashboard-only** panels (`memory`
10 DIMMs, `network`, `events`/SEL); the operationally important metrics (fan
speed, temps, power, voltage) come from cheap single-request collectors.
## Decision
Make **SNMP the fast primary source** and keep a **thin, slow Redfish remnant**
for the few things SNMP cannot serve. SNMP walks are fast (the `snmp-ups` job
runs at 30 s); the iDRAC SNMP agent is already enabled and reachable.
Rejected alternatives: (1) pure collector-trim of Redfish — still BMC-bound and
slow; (2) pure SNMP / retire Redfish — would require re-pointing the **external
ha-sofia** `sensor.r730_fan_speed` REST sensor (collides with a live session
editing the fan dashboard) and would drop two cosmetic panels.
## Key findings (ground-truthed)
- **The `snmp-idrac` job was dead.** It specified **no `module`** param, so
`snmp_exporter` defaulted to `if_mib` and returned only the iDRAC NIC's
interface counters — zero health/power/thermal. Both iDRAC jobs relabel to
`r730_idrac_*`, which hid this. The alert `iDRACSNMPMetricsMissing` is
**misnamed** — its expr `absent(r730_idrac_idrac_system_health)` checks a
*Redfish* metric.
- **A generated `dell_idrac` module already exists**, unmounted, in
`prometheus_snmp_chart_values.yaml` (~lines 791628). The mounted config is
`ups_snmp_values.yaml` (huawei/if_mib/ip_mib only). iDRAC SNMP = v2c,
community `Public0` (already the `public_v2` auth in `ups_snmp_values.yaml`).
- **Live snmpwalk (Public0, 192.168.1.4) confirms** these return real data:
fan RPM `coolingDeviceReading` (.4.700.12.1.6 = 7080 RPM), temps
`temperatureProbeReading` (.700.20.1.6, tenths-°C), system watts
`amperageProbeReading` (.600.30.1.6 = 252 W), PSU input voltage
`powerSupplyCurrentInputVoltage` (.600.12.1.16), PSU watts/health, global
health `globalSystemStatus` (.5.2.1), `systemState*` rollups (.200),
`physicalDisk*` status, `memoryDevice*` size/status/type/speed (.1100),
`networkDevice*` status/connection (.1100.90), BIOS `2.19.0` (.300.50.1.8),
model/service-tag (.5.1.3).
- **Genuine SNMP gaps — but inert or cosmetic today:**
- SSD life-left % (`physicalDiskRemainingRatedWriteEndurance` .49) → returns
`255` (N/A) for every drive incl. the Samsung SSD. **Redfish today reports
`0`** on the one drive that has it, and the SSD-wear alerts guard on `> 0`,
so they **already never fire** → no functional loss.
- SEL event log (`5.5.2`) → `NoSuchObject`. The `idrac_events_log_entry`
metric is **already empty in Prometheus** today → no loss.
- Indicator LED (`5.1.4`) → absent. Cosmetic ("Off") panel.
- NIC link-speed Mbps → no OID (health + up/down preserved). Cosmetic.
- Average watts → no native OID; reconstruct via PromQL `avg_over_time()`.
Conclusion: **every metric with real, used data today has an SNMP equivalent.**
## Naming / enum strategy
`snmp_exporter` names metrics after MIB objects (`temperatureProbeReading`,
`coolingDeviceReading`, `globalSystemStatus`, …) → after the `r730_idrac_`
relabel they are `r730_idrac_<mibName>`, different from today's
`r730_idrac_idrac_*` / `r730_idrac_redfish_*`. **Re-point consumers** (not
alias): aliasing via `metric_relabel_configs` only renames `__name__` and
cannot fix the label-set mismatch (Redfish `member_id`/`name` vs SNMP numeric
indexes) nor the **enum-value mismatch** (DellStatus `3=OK` vs Redfish `1`;
`systemPowerState 4=on` vs Redfish `2`). Alert exprs must change regardless, so
re-pointing is the honest path. The module adds `lookups:` so SNMP series carry
human labels (probe/fan location, disk display name) like today.
## Consumed-metric → SNMP mapping (DIRECT / REGEN / remnant)
REGEN = OID returns data but must be added to the module walk.
| Consumed (today) | Source after migration |
|---|---|
| fan health | REGEN `coolingDeviceStatus` .700.12.1.5 |
| consumed watts | DIRECT `amperageProbeReading` (System Board Pwr Consumption) |
| system health rollup | DIRECT `globalSystemStatus` .5.2.1 |
| PSU health | DIRECT `powerSupplyStatus`/`powerSupplySensorState` |
| memory health | DIRECT `systemStateMemoryDeviceStatusCombined` .200.10.1.27 |
| storage drive health | DIRECT `physicalDiskComponentStatus` .5.5.1.20.130.4.1.24 |
| **SSD life %** | **remnant** (SNMP=255 N/A; already inert) |
| system power state | DIRECT `systemPowerState` .5.2.4 (enum 4=on) |
| PSU input voltage | DIRECT `powerSupplyCurrentInputVoltage` .600.12.1.16 |
| system health (absent-probe) | DIRECT `globalSystemStatus` |
| **fan speed RPM (HA)** | DIRECT via remnant (HA reads exporter directly); SNMP REGEN `coolingDeviceReading` for Grafana |
| temperature | DIRECT `temperatureProbeReading` .700.20.1.6 (÷10) |
| avg watts | PromQL `avg_over_time(amperageProbeReading)` |
| **SEL log** | **remnant** (already empty) |
| machine/bios info | REGEN model/svctag .5.1.3, BIOS .300.50.1.8 |
| memory size / cpu count | DIRECT `memoryDeviceSize` (sum) / `processorDeviceStatus` (count) |
| **indicator LED** | **remnant** (cosmetic) |
| storage drive info/health/capacity | DIRECT `physicalDisk*` |
| memory module info/health/cap/speed | DIRECT(size) + REGEN(status/type/speed .1100.50.1.{5,7,8,15}) |
| network port health/link / **Mbps** | REGEN `networkDevice*` (.1100.90); **Mbps → remnant/drop** |
| PSU output/input/capacity watts | DIRECT `powerSupplyOutputWatts`/`RatedInputWattage` |
## Remnant role
The Redfish exporter stays alive (so the external ha-sofia
`sensor.r730_fan_speed` REST poll is **unchanged** — no ha-sofia edit, no
collision). It is trimmed to `sensors,system,network,storage,events` and its
Prometheus scrape slows to 10 m, keeping **only** the gap metrics (indicator
LED, NIC Mbps, SSD-life, SEL) via `metric_relabel_configs` to avoid duplicate
series with SNMP.

View file

@ -0,0 +1,53 @@
# iDRAC Redfish → SNMP migration (plan)
Companion to `2026-06-05-idrac-snmp-migration-design.md`. Execute in order;
applies are staged so the safe/additive work lands and is verified before any
consumer re-pointing.
Files:
- `stacks/monitoring/modules/monitoring/ups_snmp_values.yaml` (merge target)
- `stacks/monitoring/modules/monitoring/prometheus_snmp_chart_values.yaml` (dell_idrac source, ~791628)
- `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (scrape jobs ~3150/3170, alerts ~8111186)
- `stacks/monitoring/modules/monitoring/idrac.tf` (Redfish exporter / remnant)
- `stacks/monitoring/modules/monitoring/dashboards/idrac.json`, `cluster_health.json`
## Phase A — additive SNMP source (low risk)
- [ ] A1. Extract `dell_idrac` (lines 791628) from `prometheus_snmp_chart_values.yaml`; **strip its embedded `auth:`/`version:`** (the merge target uses the split `auths:` format) and append the module under `modules:` in `ups_snmp_values.yaml`.
- [ ] A2. Hand-add to dell_idrac `walk:` + `metrics:` (with `lookups:` for labels):
- `coolingDeviceReading` .4.700.12.1.6 (fan RPM, gauge, idx chassis+device, lookup `coolingDeviceLocationName` .8)
- `coolingDeviceStatus` .4.700.12.1.5 (fan health, enum)
- `networkDeviceStatus` / `networkDeviceConnectionStatus` (.1100.90.1.{3,17})
- `systemBIOSVersionName` .300.50.1.8; system model .5.1.3.12 + service-tag .5.1.3.2
- DIMM `.1100.50.1.{5 status, 7 type, 8 location, 15 speed}`
- `physicalDiskRemainingRatedWriteEndurance` .5.5.1.20.130.4.1.49 (so remnant isn't needed for SSD if it ever populates; harmless 255 today)
- [ ] A3. `snmp-idrac` job (`prometheus_chart_values.tpl` ~3150): add `params: { module: [dell_idrac], auth: [public_v2] }`, `scrape_interval: 1m`, `scrape_timeout: 30s`. Keep the `r730_idrac_` relabel.
- [ ] A4. **Validate before any repoint:** apply monitoring stack; `curl 'http://snmp-exporter.monitoring.svc:9116/snmp?module=dell_idrac&auth=public_v2&target=192.168.1.4:161'` returns all REGEN/DIRECT metrics with readable labels; `scrape_duration_seconds{job="snmp-idrac"}` < 5 s; confirm exact emitted metric names + label keys (feeds B/C).
## Phase B — re-point consumers to verified SNMP names (riskier)
- [ ] B1. Rewrite ~12 alert exprs (`prometheus_chart_values.tpl` 8111186) to SNMP names + **SNMP enums** (`3=OK` not `1`; power `4=on` not `2`). Re-target absent-probes: `iDRACRedfishMetricsMissing``absent(r730_idrac_powerSupplyCurrentInputVoltage)`; `iDRACSNMPMetricsMissing``absent(r730_idrac_globalSystemStatus)` (also fixes the misnomer).
- [ ] B2. Re-point ~26 panels in `idrac.json` + `cluster_health.json` to SNMP names/labels; avg-watts → `avg_over_time(...amperageProbeReading...[$__interval])`.
- [ ] B3. Add any new SNMP metric names to the Prometheus keep-rules whitelist if present (grep `prometheus-server` configmap / `prometheus_chart_values.tpl` keep rules) so they aren't silently dropped.
- [ ] B4. Apply; verify each re-pointed alert has data (no spurious `absent` firing) and panels render.
## Phase C — thin Redfish remnant
- [ ] C1. `idrac.tf` config map: `metrics: all: false` + enable only `sensors, system, network, storage, events` (drop power/memory/processors/manager/extra — now SNMP). (HA reads `sensors` directly — unchanged.)
- [ ] C2. `redfish-idrac` job: `scrape_interval: 10m`; add `metric_relabel_configs` to **keep only** the gap series (indicator LED, NIC Mbps, SSD-life, SEL) → avoids duplicate series with SNMP.
- [ ] C3. Apply; verify HA `sensor.r730_fan_speed` still updates, gap panels render, fan-control daemon unaffected (it uses IPMI, not this exporter — should be untouched).
## Phase D — docs + ship
- [ ] D1. Update `docs/architecture/monitoring.md` (iDRAC now SNMP-primary; remnant role), note the fixed alert misnomer, any runbook.
- [ ] D2. Update this plan's checkboxes; commit (named files) + push; wait for CI/deploy.
## Rollback
All Terraform-managed. Revert the monitoring-stack commit + `scripts/tg apply`
restores the Redfish-primary state. Phase A is additive (safe to leave even if
B/C are reverted).
## Presence
Claim `stack:monitoring` + `service:idrac-redfish-exporter` before each apply.