monitoring: migrate R730 iDRAC scraping to SNMP (fast primary) + thin Redfish remnant
The Redfish exporter (mrlhansen, metrics:all:true) walked every BMC subtree on
each scrape — ~18.5s avg / 28s peak against the slow iDRAC — forcing a 3m
interval. Moved the fast path to SNMP via the (previously unmounted) dell_idrac
module: ~3.7s/scrape at 1m.
- snmp_exporter: merge dell_idrac into ups_snmp_values.yaml; hand-add fan-RPM
(coolingDeviceReading + location lookup) and an amperageProbeLocationName
lookup so the "System Board Pwr Consumption" watts probe is label-selectable.
- snmp-idrac job: params module=dell_idrac, auth=public_v2, 1m/30s — now the
primary source for health/thermal/power/fan/voltage (relabeled r730_idrac_*).
- Re-point 9 iDRAC alerts to SNMP metrics + DellStatus enums (OK=3, on=4) and
fix the misnamed iDRACSNMPMetricsMissing/iDRACRedfishMetricsMissing probes.
- Re-point Grafana panels (idrac.json, cluster_health.json) to SNMP names;
temps ÷10 (tenths-degC); DellStatus value-mappings updated.
- Demote the Redfish exporter to a slow remnant: trim collectors to
system/sensors/power/storage/network/memory, scrape 3m->10m. Kept only for
metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS, per-drive
table) AND to keep HA Sofia's sensor.r730_fan_speed working — it reads
idrac_sensors_fan_speed from the exporter directly, so no ha-sofia change.
SSD-wear alerts + SEL panel left as-is (already inert/empty today). Verified
live: snmp-idrac up, scrape 3.7s, all 9 re-pointed alerts resolve without
firing, HA fan metric (idrac_sensors_fan_speed=6) intact. Design/plan +
as-built docs: docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md,
docs/architecture/monitoring.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
6442978f07
commit
6b1d23abbd
8 changed files with 1891 additions and 55 deletions
116
docs/plans/2026-06-05-idrac-snmp-migration-design.md
Normal file
116
docs/plans/2026-06-05-idrac-snmp-migration-design.md
Normal file
|
|
@ -0,0 +1,116 @@
|
|||
# iDRAC monitoring: Redfish → SNMP migration (design)
|
||||
|
||||
**Date:** 2026-06-05
|
||||
**Status:** approved (Viktor) — SNMP primary + thin Redfish remnant
|
||||
**Stack:** `stacks/monitoring`
|
||||
|
||||
## Problem
|
||||
|
||||
The R730 iDRAC Redfish exporter (`idrac-redfish-exporter`, mrlhansen
|
||||
`idrac_exporter`, image `viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix`)
|
||||
is configured `metrics: all: true`. It collects on-demand and walks every
|
||||
Redfish subtree, making dozens of sequential ~1–2 s requests to a slow BMC.
|
||||
|
||||
Measured live (Prometheus `scrape_duration_seconds{job="redfish-idrac"}`, 24 h):
|
||||
- **avg 18.5 s, peak 28.3 s**, occasional fast-fail 0.085 s.
|
||||
- Pinned to a **3 m interval / 45 s timeout** because it cannot run at the 2 m
|
||||
global cadence.
|
||||
|
||||
The cost is dominated by walks that feed **dashboard-only** panels (`memory`
|
||||
10 DIMMs, `network`, `events`/SEL); the operationally important metrics (fan
|
||||
speed, temps, power, voltage) come from cheap single-request collectors.
|
||||
|
||||
## Decision
|
||||
|
||||
Make **SNMP the fast primary source** and keep a **thin, slow Redfish remnant**
|
||||
for the few things SNMP cannot serve. SNMP walks are fast (the `snmp-ups` job
|
||||
runs at 30 s); the iDRAC SNMP agent is already enabled and reachable.
|
||||
|
||||
Rejected alternatives: (1) pure collector-trim of Redfish — still BMC-bound and
|
||||
slow; (2) pure SNMP / retire Redfish — would require re-pointing the **external
|
||||
ha-sofia** `sensor.r730_fan_speed` REST sensor (collides with a live session
|
||||
editing the fan dashboard) and would drop two cosmetic panels.
|
||||
|
||||
## Key findings (ground-truthed)
|
||||
|
||||
- **The `snmp-idrac` job was dead.** It specified **no `module`** param, so
|
||||
`snmp_exporter` defaulted to `if_mib` and returned only the iDRAC NIC's
|
||||
interface counters — zero health/power/thermal. Both iDRAC jobs relabel to
|
||||
`r730_idrac_*`, which hid this. The alert `iDRACSNMPMetricsMissing` is
|
||||
**misnamed** — its expr `absent(r730_idrac_idrac_system_health)` checks a
|
||||
*Redfish* metric.
|
||||
- **A generated `dell_idrac` module already exists**, unmounted, in
|
||||
`prometheus_snmp_chart_values.yaml` (~lines 79–1628). The mounted config is
|
||||
`ups_snmp_values.yaml` (huawei/if_mib/ip_mib only). iDRAC SNMP = v2c,
|
||||
community `Public0` (already the `public_v2` auth in `ups_snmp_values.yaml`).
|
||||
- **Live snmpwalk (Public0, 192.168.1.4) confirms** these return real data:
|
||||
fan RPM `coolingDeviceReading` (.4.700.12.1.6 = 7080 RPM), temps
|
||||
`temperatureProbeReading` (.700.20.1.6, tenths-°C), system watts
|
||||
`amperageProbeReading` (.600.30.1.6 = 252 W), PSU input voltage
|
||||
`powerSupplyCurrentInputVoltage` (.600.12.1.16), PSU watts/health, global
|
||||
health `globalSystemStatus` (.5.2.1), `systemState*` rollups (.200),
|
||||
`physicalDisk*` status, `memoryDevice*` size/status/type/speed (.1100),
|
||||
`networkDevice*` status/connection (.1100.90), BIOS `2.19.0` (.300.50.1.8),
|
||||
model/service-tag (.5.1.3).
|
||||
- **Genuine SNMP gaps — but inert or cosmetic today:**
|
||||
- SSD life-left % (`physicalDiskRemainingRatedWriteEndurance` .49) → returns
|
||||
`255` (N/A) for every drive incl. the Samsung SSD. **Redfish today reports
|
||||
`0`** on the one drive that has it, and the SSD-wear alerts guard on `> 0`,
|
||||
so they **already never fire** → no functional loss.
|
||||
- SEL event log (`5.5.2`) → `NoSuchObject`. The `idrac_events_log_entry`
|
||||
metric is **already empty in Prometheus** today → no loss.
|
||||
- Indicator LED (`5.1.4`) → absent. Cosmetic ("Off") panel.
|
||||
- NIC link-speed Mbps → no OID (health + up/down preserved). Cosmetic.
|
||||
- Average watts → no native OID; reconstruct via PromQL `avg_over_time()`.
|
||||
|
||||
Conclusion: **every metric with real, used data today has an SNMP equivalent.**
|
||||
|
||||
## Naming / enum strategy
|
||||
|
||||
`snmp_exporter` names metrics after MIB objects (`temperatureProbeReading`,
|
||||
`coolingDeviceReading`, `globalSystemStatus`, …) → after the `r730_idrac_`
|
||||
relabel they are `r730_idrac_<mibName>`, different from today's
|
||||
`r730_idrac_idrac_*` / `r730_idrac_redfish_*`. **Re-point consumers** (not
|
||||
alias): aliasing via `metric_relabel_configs` only renames `__name__` and
|
||||
cannot fix the label-set mismatch (Redfish `member_id`/`name` vs SNMP numeric
|
||||
indexes) nor the **enum-value mismatch** (DellStatus `3=OK` vs Redfish `1`;
|
||||
`systemPowerState 4=on` vs Redfish `2`). Alert exprs must change regardless, so
|
||||
re-pointing is the honest path. The module adds `lookups:` so SNMP series carry
|
||||
human labels (probe/fan location, disk display name) like today.
|
||||
|
||||
## Consumed-metric → SNMP mapping (DIRECT / REGEN / remnant)
|
||||
|
||||
REGEN = OID returns data but must be added to the module walk.
|
||||
|
||||
| Consumed (today) | Source after migration |
|
||||
|---|---|
|
||||
| fan health | REGEN `coolingDeviceStatus` .700.12.1.5 |
|
||||
| consumed watts | DIRECT `amperageProbeReading` (System Board Pwr Consumption) |
|
||||
| system health rollup | DIRECT `globalSystemStatus` .5.2.1 |
|
||||
| PSU health | DIRECT `powerSupplyStatus`/`powerSupplySensorState` |
|
||||
| memory health | DIRECT `systemStateMemoryDeviceStatusCombined` .200.10.1.27 |
|
||||
| storage drive health | DIRECT `physicalDiskComponentStatus` .5.5.1.20.130.4.1.24 |
|
||||
| **SSD life %** | **remnant** (SNMP=255 N/A; already inert) |
|
||||
| system power state | DIRECT `systemPowerState` .5.2.4 (enum 4=on) |
|
||||
| PSU input voltage | DIRECT `powerSupplyCurrentInputVoltage` .600.12.1.16 |
|
||||
| system health (absent-probe) | DIRECT `globalSystemStatus` |
|
||||
| **fan speed RPM (HA)** | DIRECT via remnant (HA reads exporter directly); SNMP REGEN `coolingDeviceReading` for Grafana |
|
||||
| temperature | DIRECT `temperatureProbeReading` .700.20.1.6 (÷10) |
|
||||
| avg watts | PromQL `avg_over_time(amperageProbeReading)` |
|
||||
| **SEL log** | **remnant** (already empty) |
|
||||
| machine/bios info | REGEN model/svctag .5.1.3, BIOS .300.50.1.8 |
|
||||
| memory size / cpu count | DIRECT `memoryDeviceSize` (sum) / `processorDeviceStatus` (count) |
|
||||
| **indicator LED** | **remnant** (cosmetic) |
|
||||
| storage drive info/health/capacity | DIRECT `physicalDisk*` |
|
||||
| memory module info/health/cap/speed | DIRECT(size) + REGEN(status/type/speed .1100.50.1.{5,7,8,15}) |
|
||||
| network port health/link / **Mbps** | REGEN `networkDevice*` (.1100.90); **Mbps → remnant/drop** |
|
||||
| PSU output/input/capacity watts | DIRECT `powerSupplyOutputWatts`/`RatedInputWattage` |
|
||||
|
||||
## Remnant role
|
||||
|
||||
The Redfish exporter stays alive (so the external ha-sofia
|
||||
`sensor.r730_fan_speed` REST poll is **unchanged** — no ha-sofia edit, no
|
||||
collision). It is trimmed to `sensors,system,network,storage,events` and its
|
||||
Prometheus scrape slows to 10 m, keeping **only** the gap metrics (indicator
|
||||
LED, NIC Mbps, SSD-life, SEL) via `metric_relabel_configs` to avoid duplicate
|
||||
series with SNMP.
|
||||
53
docs/plans/2026-06-05-idrac-snmp-migration-plan.md
Normal file
53
docs/plans/2026-06-05-idrac-snmp-migration-plan.md
Normal file
|
|
@ -0,0 +1,53 @@
|
|||
# iDRAC Redfish → SNMP migration (plan)
|
||||
|
||||
Companion to `2026-06-05-idrac-snmp-migration-design.md`. Execute in order;
|
||||
applies are staged so the safe/additive work lands and is verified before any
|
||||
consumer re-pointing.
|
||||
|
||||
Files:
|
||||
- `stacks/monitoring/modules/monitoring/ups_snmp_values.yaml` (merge target)
|
||||
- `stacks/monitoring/modules/monitoring/prometheus_snmp_chart_values.yaml` (dell_idrac source, ~79–1628)
|
||||
- `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (scrape jobs ~3150/3170, alerts ~811–1186)
|
||||
- `stacks/monitoring/modules/monitoring/idrac.tf` (Redfish exporter / remnant)
|
||||
- `stacks/monitoring/modules/monitoring/dashboards/idrac.json`, `cluster_health.json`
|
||||
|
||||
## Phase A — additive SNMP source (low risk)
|
||||
|
||||
- [ ] A1. Extract `dell_idrac` (lines 79–1628) from `prometheus_snmp_chart_values.yaml`; **strip its embedded `auth:`/`version:`** (the merge target uses the split `auths:` format) and append the module under `modules:` in `ups_snmp_values.yaml`.
|
||||
- [ ] A2. Hand-add to dell_idrac `walk:` + `metrics:` (with `lookups:` for labels):
|
||||
- `coolingDeviceReading` .4.700.12.1.6 (fan RPM, gauge, idx chassis+device, lookup `coolingDeviceLocationName` .8)
|
||||
- `coolingDeviceStatus` .4.700.12.1.5 (fan health, enum)
|
||||
- `networkDeviceStatus` / `networkDeviceConnectionStatus` (.1100.90.1.{3,17})
|
||||
- `systemBIOSVersionName` .300.50.1.8; system model .5.1.3.12 + service-tag .5.1.3.2
|
||||
- DIMM `.1100.50.1.{5 status, 7 type, 8 location, 15 speed}`
|
||||
- `physicalDiskRemainingRatedWriteEndurance` .5.5.1.20.130.4.1.49 (so remnant isn't needed for SSD if it ever populates; harmless 255 today)
|
||||
- [ ] A3. `snmp-idrac` job (`prometheus_chart_values.tpl` ~3150): add `params: { module: [dell_idrac], auth: [public_v2] }`, `scrape_interval: 1m`, `scrape_timeout: 30s`. Keep the `r730_idrac_` relabel.
|
||||
- [ ] A4. **Validate before any repoint:** apply monitoring stack; `curl 'http://snmp-exporter.monitoring.svc:9116/snmp?module=dell_idrac&auth=public_v2&target=192.168.1.4:161'` returns all REGEN/DIRECT metrics with readable labels; `scrape_duration_seconds{job="snmp-idrac"}` < 5 s; confirm exact emitted metric names + label keys (feeds B/C).
|
||||
|
||||
## Phase B — re-point consumers to verified SNMP names (riskier)
|
||||
|
||||
- [ ] B1. Rewrite ~12 alert exprs (`prometheus_chart_values.tpl` 811–1186) to SNMP names + **SNMP enums** (`3=OK` not `1`; power `4=on` not `2`). Re-target absent-probes: `iDRACRedfishMetricsMissing`→`absent(r730_idrac_powerSupplyCurrentInputVoltage)`; `iDRACSNMPMetricsMissing`→`absent(r730_idrac_globalSystemStatus)` (also fixes the misnomer).
|
||||
- [ ] B2. Re-point ~26 panels in `idrac.json` + `cluster_health.json` to SNMP names/labels; avg-watts → `avg_over_time(...amperageProbeReading...[$__interval])`.
|
||||
- [ ] B3. Add any new SNMP metric names to the Prometheus keep-rules whitelist if present (grep `prometheus-server` configmap / `prometheus_chart_values.tpl` keep rules) so they aren't silently dropped.
|
||||
- [ ] B4. Apply; verify each re-pointed alert has data (no spurious `absent` firing) and panels render.
|
||||
|
||||
## Phase C — thin Redfish remnant
|
||||
|
||||
- [ ] C1. `idrac.tf` config map: `metrics: all: false` + enable only `sensors, system, network, storage, events` (drop power/memory/processors/manager/extra — now SNMP). (HA reads `sensors` directly — unchanged.)
|
||||
- [ ] C2. `redfish-idrac` job: `scrape_interval: 10m`; add `metric_relabel_configs` to **keep only** the gap series (indicator LED, NIC Mbps, SSD-life, SEL) → avoids duplicate series with SNMP.
|
||||
- [ ] C3. Apply; verify HA `sensor.r730_fan_speed` still updates, gap panels render, fan-control daemon unaffected (it uses IPMI, not this exporter — should be untouched).
|
||||
|
||||
## Phase D — docs + ship
|
||||
|
||||
- [ ] D1. Update `docs/architecture/monitoring.md` (iDRAC now SNMP-primary; remnant role), note the fixed alert misnomer, any runbook.
|
||||
- [ ] D2. Update this plan's checkboxes; commit (named files) + push; wait for CI/deploy.
|
||||
|
||||
## Rollback
|
||||
|
||||
All Terraform-managed. Revert the monitoring-stack commit + `scripts/tg apply`
|
||||
restores the Redfish-primary state. Phase A is additive (safe to leave even if
|
||||
B/C are reverted).
|
||||
|
||||
## Presence
|
||||
|
||||
Claim `stack:monitoring` + `service:idrac-redfish-exporter` before each apply.
|
||||
Loading…
Add table
Add a link
Reference in a new issue