infra/docs/plans/2026-06-05-idrac-snmp-migration-plan.md at 8a3bbde38cc0fa592dc3338c5fc24bca167beef4

Viktor Barzin 6b1d23abbd monitoring: migrate R730 iDRAC scraping to SNMP (fast primary) + thin Redfish remnant

The Redfish exporter (mrlhansen, metrics:all:true) walked every BMC subtree on
each scrape — ~18.5s avg / 28s peak against the slow iDRAC — forcing a 3m
interval. Moved the fast path to SNMP via the (previously unmounted) dell_idrac
module: ~3.7s/scrape at 1m.

- snmp_exporter: merge dell_idrac into ups_snmp_values.yaml; hand-add fan-RPM
  (coolingDeviceReading + location lookup) and an amperageProbeLocationName
  lookup so the "System Board Pwr Consumption" watts probe is label-selectable.
- snmp-idrac job: params module=dell_idrac, auth=public_v2, 1m/30s — now the
  primary source for health/thermal/power/fan/voltage (relabeled r730_idrac_*).
- Re-point 9 iDRAC alerts to SNMP metrics + DellStatus enums (OK=3, on=4) and
  fix the misnamed iDRACSNMPMetricsMissing/iDRACRedfishMetricsMissing probes.
- Re-point Grafana panels (idrac.json, cluster_health.json) to SNMP names;
  temps ÷10 (tenths-degC); DellStatus value-mappings updated.
- Demote the Redfish exporter to a slow remnant: trim collectors to
  system/sensors/power/storage/network/memory, scrape 3m->10m. Kept only for
  metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS, per-drive
  table) AND to keep HA Sofia's sensor.r730_fan_speed working — it reads
  idrac_sensors_fan_speed from the exporter directly, so no ha-sofia change.

SSD-wear alerts + SEL panel left as-is (already inert/empty today). Verified
live: snmp-idrac up, scrape 3.7s, all 9 re-pointed alerts resolve without
firing, HA fan metric (idrac_sensors_fan_speed=6) intact. Design/plan +
as-built docs: docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md,
docs/architecture/monitoring.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

4.1 KiB

Raw Blame History

iDRAC Redfish → SNMP migration (plan)

Phase A — additive SNMP source (low risk)

Phase B — re-point consumers to verified SNMP names (riskier)

Phase C — thin Redfish remnant

Phase D — docs + ship

Rollback

Presence

4.1 KiB Raw Blame History Unescape Escape

iDRAC Redfish → SNMP migration (plan)

Phase A — additive SNMP source (low risk)

Phase B — re-point consumers to verified SNMP names (riskier)

Phase C — thin Redfish remnant

Phase D — docs + ship

Rollback

Presence

4.1 KiB

Raw Blame History