infra/docs/plans/2026-06-05-idrac-snmp-migration-design.md
Viktor Barzin 6b1d23abbd monitoring: migrate R730 iDRAC scraping to SNMP (fast primary) + thin Redfish remnant
The Redfish exporter (mrlhansen, metrics:all:true) walked every BMC subtree on
each scrape — ~18.5s avg / 28s peak against the slow iDRAC — forcing a 3m
interval. Moved the fast path to SNMP via the (previously unmounted) dell_idrac
module: ~3.7s/scrape at 1m.

- snmp_exporter: merge dell_idrac into ups_snmp_values.yaml; hand-add fan-RPM
  (coolingDeviceReading + location lookup) and an amperageProbeLocationName
  lookup so the "System Board Pwr Consumption" watts probe is label-selectable.
- snmp-idrac job: params module=dell_idrac, auth=public_v2, 1m/30s — now the
  primary source for health/thermal/power/fan/voltage (relabeled r730_idrac_*).
- Re-point 9 iDRAC alerts to SNMP metrics + DellStatus enums (OK=3, on=4) and
  fix the misnamed iDRACSNMPMetricsMissing/iDRACRedfishMetricsMissing probes.
- Re-point Grafana panels (idrac.json, cluster_health.json) to SNMP names;
  temps ÷10 (tenths-degC); DellStatus value-mappings updated.
- Demote the Redfish exporter to a slow remnant: trim collectors to
  system/sensors/power/storage/network/memory, scrape 3m->10m. Kept only for
  metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS, per-drive
  table) AND to keep HA Sofia's sensor.r730_fan_speed working — it reads
  idrac_sensors_fan_speed from the exporter directly, so no ha-sofia change.

SSD-wear alerts + SEL panel left as-is (already inert/empty today). Verified
live: snmp-idrac up, scrape 3.7s, all 9 re-pointed alerts resolve without
firing, HA fan metric (idrac_sensors_fan_speed=6) intact. Design/plan +
as-built docs: docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md,
docs/architecture/monitoring.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 16:33:20 +00:00

6.4 KiB
Raw Blame History

iDRAC monitoring: Redfish → SNMP migration (design)

Date: 2026-06-05 Status: approved (Viktor) — SNMP primary + thin Redfish remnant Stack: stacks/monitoring

Problem

The R730 iDRAC Redfish exporter (idrac-redfish-exporter, mrlhansen idrac_exporter, image viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix) is configured metrics: all: true. It collects on-demand and walks every Redfish subtree, making dozens of sequential ~12 s requests to a slow BMC.

Measured live (Prometheus scrape_duration_seconds{job="redfish-idrac"}, 24 h):

  • avg 18.5 s, peak 28.3 s, occasional fast-fail 0.085 s.
  • Pinned to a 3 m interval / 45 s timeout because it cannot run at the 2 m global cadence.

The cost is dominated by walks that feed dashboard-only panels (memory 10 DIMMs, network, events/SEL); the operationally important metrics (fan speed, temps, power, voltage) come from cheap single-request collectors.

Decision

Make SNMP the fast primary source and keep a thin, slow Redfish remnant for the few things SNMP cannot serve. SNMP walks are fast (the snmp-ups job runs at 30 s); the iDRAC SNMP agent is already enabled and reachable.

Rejected alternatives: (1) pure collector-trim of Redfish — still BMC-bound and slow; (2) pure SNMP / retire Redfish — would require re-pointing the external ha-sofia sensor.r730_fan_speed REST sensor (collides with a live session editing the fan dashboard) and would drop two cosmetic panels.

Key findings (ground-truthed)

  • The snmp-idrac job was dead. It specified no module param, so snmp_exporter defaulted to if_mib and returned only the iDRAC NIC's interface counters — zero health/power/thermal. Both iDRAC jobs relabel to r730_idrac_*, which hid this. The alert iDRACSNMPMetricsMissing is misnamed — its expr absent(r730_idrac_idrac_system_health) checks a Redfish metric.
  • A generated dell_idrac module already exists, unmounted, in prometheus_snmp_chart_values.yaml (~lines 791628). The mounted config is ups_snmp_values.yaml (huawei/if_mib/ip_mib only). iDRAC SNMP = v2c, community Public0 (already the public_v2 auth in ups_snmp_values.yaml).
  • Live snmpwalk (Public0, 192.168.1.4) confirms these return real data: fan RPM coolingDeviceReading (.4.700.12.1.6 = 7080 RPM), temps temperatureProbeReading (.700.20.1.6, tenths-°C), system watts amperageProbeReading (.600.30.1.6 = 252 W), PSU input voltage powerSupplyCurrentInputVoltage (.600.12.1.16), PSU watts/health, global health globalSystemStatus (.5.2.1), systemState* rollups (.200), physicalDisk* status, memoryDevice* size/status/type/speed (.1100), networkDevice* status/connection (.1100.90), BIOS 2.19.0 (.300.50.1.8), model/service-tag (.5.1.3).
  • Genuine SNMP gaps — but inert or cosmetic today:
    • SSD life-left % (physicalDiskRemainingRatedWriteEndurance .49) → returns 255 (N/A) for every drive incl. the Samsung SSD. Redfish today reports 0 on the one drive that has it, and the SSD-wear alerts guard on > 0, so they already never fire → no functional loss.
    • SEL event log (5.5.2) → NoSuchObject. The idrac_events_log_entry metric is already empty in Prometheus today → no loss.
    • Indicator LED (5.1.4) → absent. Cosmetic ("Off") panel.
    • NIC link-speed Mbps → no OID (health + up/down preserved). Cosmetic.
    • Average watts → no native OID; reconstruct via PromQL avg_over_time().

Conclusion: every metric with real, used data today has an SNMP equivalent.

Naming / enum strategy

snmp_exporter names metrics after MIB objects (temperatureProbeReading, coolingDeviceReading, globalSystemStatus, …) → after the r730_idrac_ relabel they are r730_idrac_<mibName>, different from today's r730_idrac_idrac_* / r730_idrac_redfish_*. Re-point consumers (not alias): aliasing via metric_relabel_configs only renames __name__ and cannot fix the label-set mismatch (Redfish member_id/name vs SNMP numeric indexes) nor the enum-value mismatch (DellStatus 3=OK vs Redfish 1; systemPowerState 4=on vs Redfish 2). Alert exprs must change regardless, so re-pointing is the honest path. The module adds lookups: so SNMP series carry human labels (probe/fan location, disk display name) like today.

Consumed-metric → SNMP mapping (DIRECT / REGEN / remnant)

REGEN = OID returns data but must be added to the module walk.

Consumed (today) Source after migration
fan health REGEN coolingDeviceStatus .700.12.1.5
consumed watts DIRECT amperageProbeReading (System Board Pwr Consumption)
system health rollup DIRECT globalSystemStatus .5.2.1
PSU health DIRECT powerSupplyStatus/powerSupplySensorState
memory health DIRECT systemStateMemoryDeviceStatusCombined .200.10.1.27
storage drive health DIRECT physicalDiskComponentStatus .5.5.1.20.130.4.1.24
SSD life % remnant (SNMP=255 N/A; already inert)
system power state DIRECT systemPowerState .5.2.4 (enum 4=on)
PSU input voltage DIRECT powerSupplyCurrentInputVoltage .600.12.1.16
system health (absent-probe) DIRECT globalSystemStatus
fan speed RPM (HA) DIRECT via remnant (HA reads exporter directly); SNMP REGEN coolingDeviceReading for Grafana
temperature DIRECT temperatureProbeReading .700.20.1.6 (÷10)
avg watts PromQL avg_over_time(amperageProbeReading)
SEL log remnant (already empty)
machine/bios info REGEN model/svctag .5.1.3, BIOS .300.50.1.8
memory size / cpu count DIRECT memoryDeviceSize (sum) / processorDeviceStatus (count)
indicator LED remnant (cosmetic)
storage drive info/health/capacity DIRECT physicalDisk*
memory module info/health/cap/speed DIRECT(size) + REGEN(status/type/speed .1100.50.1.{5,7,8,15})
network port health/link / Mbps REGEN networkDevice* (.1100.90); Mbps → remnant/drop
PSU output/input/capacity watts DIRECT powerSupplyOutputWatts/RatedInputWattage

Remnant role

The Redfish exporter stays alive (so the external ha-sofia sensor.r730_fan_speed REST poll is unchanged — no ha-sofia edit, no collision). It is trimmed to sensors,system,network,storage,events and its Prometheus scrape slows to 10 m, keeping only the gap metrics (indicator LED, NIC Mbps, SSD-life, SEL) via metric_relabel_configs to avoid duplicate series with SNMP.