infra/docs/plans/2026-06-05-idrac-snmp-migration-design.md
Viktor Barzin fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00

6.4 KiB
Raw Blame History

iDRAC monitoring: Redfish → SNMP migration (design)

Date: 2026-06-05 Status: approved (Viktor) — SNMP primary + thin Redfish remnant Stack: stacks/monitoring

Problem

The R730 iDRAC Redfish exporter (idrac-redfish-exporter, mrlhansen idrac_exporter, image viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix) is configured metrics: all: true. It collects on-demand and walks every Redfish subtree, making dozens of sequential ~12 s requests to a slow BMC.

Measured live (Prometheus scrape_duration_seconds{job="redfish-idrac"}, 24 h):

  • avg 18.5 s, peak 28.3 s, occasional fast-fail 0.085 s.
  • Pinned to a 3 m interval / 45 s timeout because it cannot run at the 2 m global cadence.

The cost is dominated by walks that feed dashboard-only panels (memory 10 DIMMs, network, events/SEL); the operationally important metrics (fan speed, temps, power, voltage) come from cheap single-request collectors.

Decision

Make SNMP the fast primary source and keep a thin, slow Redfish remnant for the few things SNMP cannot serve. SNMP walks are fast (the snmp-ups job runs at 30 s); the iDRAC SNMP agent is already enabled and reachable.

Rejected alternatives: (1) pure collector-trim of Redfish — still BMC-bound and slow; (2) pure SNMP / retire Redfish — would require re-pointing the external ha-sofia sensor.r730_fan_speed REST sensor (collides with a live session editing the fan dashboard) and would drop two cosmetic panels.

Key findings (ground-truthed)

  • The snmp-idrac job was dead. It specified no module param, so snmp_exporter defaulted to if_mib and returned only the iDRAC NIC's interface counters — zero health/power/thermal. Both iDRAC jobs relabel to r730_idrac_*, which hid this. The alert iDRACSNMPMetricsMissing is misnamed — its expr absent(r730_idrac_idrac_system_health) checks a Redfish metric.
  • A generated dell_idrac module already exists, unmounted, in prometheus_snmp_chart_values.yaml (~lines 791628). The mounted config is ups_snmp_values.yaml (huawei/if_mib/ip_mib only). iDRAC SNMP = v2c, community Public0 (already the public_v2 auth in ups_snmp_values.yaml).
  • Live snmpwalk (Public0, 192.168.1.4) confirms these return real data: fan RPM coolingDeviceReading (.4.700.12.1.6 = 7080 RPM), temps temperatureProbeReading (.700.20.1.6, tenths-°C), system watts amperageProbeReading (.600.30.1.6 = 252 W), PSU input voltage powerSupplyCurrentInputVoltage (.600.12.1.16), PSU watts/health, global health globalSystemStatus (.5.2.1), systemState* rollups (.200), physicalDisk* status, memoryDevice* size/status/type/speed (.1100), networkDevice* status/connection (.1100.90), BIOS 2.19.0 (.300.50.1.8), model/service-tag (.5.1.3).
  • Genuine SNMP gaps — but inert or cosmetic today:
    • SSD life-left % (physicalDiskRemainingRatedWriteEndurance .49) → returns 255 (N/A) for every drive incl. the Samsung SSD. Redfish today reports 0 on the one drive that has it, and the SSD-wear alerts guard on > 0, so they already never fire → no functional loss.
    • SEL event log (5.5.2) → NoSuchObject. The idrac_events_log_entry metric is already empty in Prometheus today → no loss.
    • Indicator LED (5.1.4) → absent. Cosmetic ("Off") panel.
    • NIC link-speed Mbps → no OID (health + up/down preserved). Cosmetic.
    • Average watts → no native OID; reconstruct via PromQL avg_over_time().

Conclusion: every metric with real, used data today has an SNMP equivalent.

Naming / enum strategy

snmp_exporter names metrics after MIB objects (temperatureProbeReading, coolingDeviceReading, globalSystemStatus, …) → after the r730_idrac_ relabel they are r730_idrac_<mibName>, different from today's r730_idrac_idrac_* / r730_idrac_redfish_*. Re-point consumers (not alias): aliasing via metric_relabel_configs only renames __name__ and cannot fix the label-set mismatch (Redfish member_id/name vs SNMP numeric indexes) nor the enum-value mismatch (DellStatus 3=OK vs Redfish 1; systemPowerState 4=on vs Redfish 2). Alert exprs must change regardless, so re-pointing is the honest path. The module adds lookups: so SNMP series carry human labels (probe/fan location, disk display name) like today.

Consumed-metric → SNMP mapping (DIRECT / REGEN / remnant)

REGEN = OID returns data but must be added to the module walk.

Consumed (today) Source after migration
fan health REGEN coolingDeviceStatus .700.12.1.5
consumed watts DIRECT amperageProbeReading (System Board Pwr Consumption)
system health rollup DIRECT globalSystemStatus .5.2.1
PSU health DIRECT powerSupplyStatus/powerSupplySensorState
memory health DIRECT systemStateMemoryDeviceStatusCombined .200.10.1.27
storage drive health DIRECT physicalDiskComponentStatus .5.5.1.20.130.4.1.24
SSD life % remnant (SNMP=255 N/A; already inert)
system power state DIRECT systemPowerState .5.2.4 (enum 4=on)
PSU input voltage DIRECT powerSupplyCurrentInputVoltage .600.12.1.16
system health (absent-probe) DIRECT globalSystemStatus
fan speed RPM (HA) DIRECT via remnant (HA reads exporter directly); SNMP REGEN coolingDeviceReading for Grafana
temperature DIRECT temperatureProbeReading .700.20.1.6 (÷10)
avg watts PromQL avg_over_time(amperageProbeReading)
SEL log remnant (already empty)
machine/bios info REGEN model/svctag .5.1.3, BIOS .300.50.1.8
memory size / cpu count DIRECT memoryDeviceSize (sum) / processorDeviceStatus (count)
indicator LED remnant (cosmetic)
storage drive info/health/capacity DIRECT physicalDisk*
memory module info/health/cap/speed DIRECT(size) + REGEN(status/type/speed .1100.50.1.{5,7,8,15})
network port health/link / Mbps REGEN networkDevice* (.1100.90); Mbps → remnant/drop
PSU output/input/capacity watts DIRECT powerSupplyOutputWatts/RatedInputWattage

Remnant role

The Redfish exporter stays alive (so the external ha-sofia sensor.r730_fan_speed REST poll is unchanged — no ha-sofia edit, no collision). It is trimmed to sensors,system,network,storage,events and its Prometheus scrape slows to 10 m, keeping only the gap metrics (indicator LED, NIC Mbps, SSD-life, SEL) via metric_relabel_configs to avoid duplicate series with SNMP.