6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6.4 KiB
iDRAC monitoring: Redfish → SNMP migration (design)
Date: 2026-06-05
Status: approved (Viktor) — SNMP primary + thin Redfish remnant
Stack: stacks/monitoring
Problem
The R730 iDRAC Redfish exporter (idrac-redfish-exporter, mrlhansen
idrac_exporter, image viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix)
is configured metrics: all: true. It collects on-demand and walks every
Redfish subtree, making dozens of sequential ~1–2 s requests to a slow BMC.
Measured live (Prometheus scrape_duration_seconds{job="redfish-idrac"}, 24 h):
- avg 18.5 s, peak 28.3 s, occasional fast-fail 0.085 s.
- Pinned to a 3 m interval / 45 s timeout because it cannot run at the 2 m global cadence.
The cost is dominated by walks that feed dashboard-only panels (memory
10 DIMMs, network, events/SEL); the operationally important metrics (fan
speed, temps, power, voltage) come from cheap single-request collectors.
Decision
Make SNMP the fast primary source and keep a thin, slow Redfish remnant
for the few things SNMP cannot serve. SNMP walks are fast (the snmp-ups job
runs at 30 s); the iDRAC SNMP agent is already enabled and reachable.
Rejected alternatives: (1) pure collector-trim of Redfish — still BMC-bound and
slow; (2) pure SNMP / retire Redfish — would require re-pointing the external
ha-sofia sensor.r730_fan_speed REST sensor (collides with a live session
editing the fan dashboard) and would drop two cosmetic panels.
Key findings (ground-truthed)
- The
snmp-idracjob was dead. It specified nomoduleparam, sosnmp_exporterdefaulted toif_miband returned only the iDRAC NIC's interface counters — zero health/power/thermal. Both iDRAC jobs relabel tor730_idrac_*, which hid this. The alertiDRACSNMPMetricsMissingis misnamed — its exprabsent(r730_idrac_idrac_system_health)checks a Redfish metric. - A generated
dell_idracmodule already exists, unmounted, inprometheus_snmp_chart_values.yaml(~lines 79–1628). The mounted config isups_snmp_values.yaml(huawei/if_mib/ip_mib only). iDRAC SNMP = v2c, communityPublic0(already thepublic_v2auth inups_snmp_values.yaml). - Live snmpwalk (Public0, 192.168.1.4) confirms these return real data:
fan RPM
coolingDeviceReading(.4.700.12.1.6 = 7080 RPM), tempstemperatureProbeReading(.700.20.1.6, tenths-°C), system wattsamperageProbeReading(.600.30.1.6 = 252 W), PSU input voltagepowerSupplyCurrentInputVoltage(.600.12.1.16), PSU watts/health, global healthglobalSystemStatus(.5.2.1),systemState*rollups (.200),physicalDisk*status,memoryDevice*size/status/type/speed (.1100),networkDevice*status/connection (.1100.90), BIOS2.19.0(.300.50.1.8), model/service-tag (.5.1.3). - Genuine SNMP gaps — but inert or cosmetic today:
- SSD life-left % (
physicalDiskRemainingRatedWriteEndurance.49) → returns255(N/A) for every drive incl. the Samsung SSD. Redfish today reports0on the one drive that has it, and the SSD-wear alerts guard on> 0, so they already never fire → no functional loss. - SEL event log (
5.5.2) →NoSuchObject. Theidrac_events_log_entrymetric is already empty in Prometheus today → no loss. - Indicator LED (
5.1.4) → absent. Cosmetic ("Off") panel. - NIC link-speed Mbps → no OID (health + up/down preserved). Cosmetic.
- Average watts → no native OID; reconstruct via PromQL
avg_over_time().
- SSD life-left % (
Conclusion: every metric with real, used data today has an SNMP equivalent.
Naming / enum strategy
snmp_exporter names metrics after MIB objects (temperatureProbeReading,
coolingDeviceReading, globalSystemStatus, …) → after the r730_idrac_
relabel they are r730_idrac_<mibName>, different from today's
r730_idrac_idrac_* / r730_idrac_redfish_*. Re-point consumers (not
alias): aliasing via metric_relabel_configs only renames __name__ and
cannot fix the label-set mismatch (Redfish member_id/name vs SNMP numeric
indexes) nor the enum-value mismatch (DellStatus 3=OK vs Redfish 1;
systemPowerState 4=on vs Redfish 2). Alert exprs must change regardless, so
re-pointing is the honest path. The module adds lookups: so SNMP series carry
human labels (probe/fan location, disk display name) like today.
Consumed-metric → SNMP mapping (DIRECT / REGEN / remnant)
REGEN = OID returns data but must be added to the module walk.
| Consumed (today) | Source after migration |
|---|---|
| fan health | REGEN coolingDeviceStatus .700.12.1.5 |
| consumed watts | DIRECT amperageProbeReading (System Board Pwr Consumption) |
| system health rollup | DIRECT globalSystemStatus .5.2.1 |
| PSU health | DIRECT powerSupplyStatus/powerSupplySensorState |
| memory health | DIRECT systemStateMemoryDeviceStatusCombined .200.10.1.27 |
| storage drive health | DIRECT physicalDiskComponentStatus .5.5.1.20.130.4.1.24 |
| SSD life % | remnant (SNMP=255 N/A; already inert) |
| system power state | DIRECT systemPowerState .5.2.4 (enum 4=on) |
| PSU input voltage | DIRECT powerSupplyCurrentInputVoltage .600.12.1.16 |
| system health (absent-probe) | DIRECT globalSystemStatus |
| fan speed RPM (HA) | DIRECT via remnant (HA reads exporter directly); SNMP REGEN coolingDeviceReading for Grafana |
| temperature | DIRECT temperatureProbeReading .700.20.1.6 (÷10) |
| avg watts | PromQL avg_over_time(amperageProbeReading) |
| SEL log | remnant (already empty) |
| machine/bios info | REGEN model/svctag .5.1.3, BIOS .300.50.1.8 |
| memory size / cpu count | DIRECT memoryDeviceSize (sum) / processorDeviceStatus (count) |
| indicator LED | remnant (cosmetic) |
| storage drive info/health/capacity | DIRECT physicalDisk* |
| memory module info/health/cap/speed | DIRECT(size) + REGEN(status/type/speed .1100.50.1.{5,7,8,15}) |
| network port health/link / Mbps | REGEN networkDevice* (.1100.90); Mbps → remnant/drop |
| PSU output/input/capacity watts | DIRECT powerSupplyOutputWatts/RatedInputWattage |
Remnant role
The Redfish exporter stays alive (so the external ha-sofia
sensor.r730_fan_speed REST poll is unchanged — no ha-sofia edit, no
collision). It is trimmed to sensors,system,network,storage,events and its
Prometheus scrape slows to 10 m, keeping only the gap metrics (indicator
LED, NIC Mbps, SSD-life, SEL) via metric_relabel_configs to avoid duplicate
series with SNMP.