The Redfish exporter (mrlhansen, metrics:all:true) walked every BMC subtree on
each scrape — ~18.5s avg / 28s peak against the slow iDRAC — forcing a 3m
interval. Moved the fast path to SNMP via the (previously unmounted) dell_idrac
module: ~3.7s/scrape at 1m.
- snmp_exporter: merge dell_idrac into ups_snmp_values.yaml; hand-add fan-RPM
(coolingDeviceReading + location lookup) and an amperageProbeLocationName
lookup so the "System Board Pwr Consumption" watts probe is label-selectable.
- snmp-idrac job: params module=dell_idrac, auth=public_v2, 1m/30s — now the
primary source for health/thermal/power/fan/voltage (relabeled r730_idrac_*).
- Re-point 9 iDRAC alerts to SNMP metrics + DellStatus enums (OK=3, on=4) and
fix the misnamed iDRACSNMPMetricsMissing/iDRACRedfishMetricsMissing probes.
- Re-point Grafana panels (idrac.json, cluster_health.json) to SNMP names;
temps ÷10 (tenths-degC); DellStatus value-mappings updated.
- Demote the Redfish exporter to a slow remnant: trim collectors to
system/sensors/power/storage/network/memory, scrape 3m->10m. Kept only for
metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS, per-drive
table) AND to keep HA Sofia's sensor.r730_fan_speed working — it reads
idrac_sensors_fan_speed from the exporter directly, so no ha-sofia change.
SSD-wear alerts + SEL panel left as-is (already inert/empty today). Verified
live: snmp-idrac up, scrape 3.7s, all 9 re-pointed alerts resolve without
firing, HA fan metric (idrac_sensors_fan_speed=6) intact. Design/plan +
as-built docs: docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md,
docs/architecture/monitoring.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
4.1 KiB
4.1 KiB
iDRAC Redfish → SNMP migration (plan)
Companion to 2026-06-05-idrac-snmp-migration-design.md. Execute in order;
applies are staged so the safe/additive work lands and is verified before any
consumer re-pointing.
Files:
stacks/monitoring/modules/monitoring/ups_snmp_values.yaml(merge target)stacks/monitoring/modules/monitoring/prometheus_snmp_chart_values.yaml(dell_idrac source, ~79–1628)stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl(scrape jobs ~3150/3170, alerts ~811–1186)stacks/monitoring/modules/monitoring/idrac.tf(Redfish exporter / remnant)stacks/monitoring/modules/monitoring/dashboards/idrac.json,cluster_health.json
Phase A — additive SNMP source (low risk)
- A1. Extract
dell_idrac(lines 79–1628) fromprometheus_snmp_chart_values.yaml; strip its embeddedauth:/version:(the merge target uses the splitauths:format) and append the module undermodules:inups_snmp_values.yaml. - A2. Hand-add to dell_idrac
walk:+metrics:(withlookups:for labels):coolingDeviceReading.4.700.12.1.6 (fan RPM, gauge, idx chassis+device, lookupcoolingDeviceLocationName.8)coolingDeviceStatus.4.700.12.1.5 (fan health, enum)networkDeviceStatus/networkDeviceConnectionStatus(.1100.90.1.{3,17})systemBIOSVersionName.300.50.1.8; system model .5.1.3.12 + service-tag .5.1.3.2- DIMM
.1100.50.1.{5 status, 7 type, 8 location, 15 speed} physicalDiskRemainingRatedWriteEndurance.5.5.1.20.130.4.1.49 (so remnant isn't needed for SSD if it ever populates; harmless 255 today)
- A3.
snmp-idracjob (prometheus_chart_values.tpl~3150): addparams: { module: [dell_idrac], auth: [public_v2] },scrape_interval: 1m,scrape_timeout: 30s. Keep ther730_idrac_relabel. - A4. Validate before any repoint: apply monitoring stack;
curl 'http://snmp-exporter.monitoring.svc:9116/snmp?module=dell_idrac&auth=public_v2&target=192.168.1.4:161'returns all REGEN/DIRECT metrics with readable labels;scrape_duration_seconds{job="snmp-idrac"}< 5 s; confirm exact emitted metric names + label keys (feeds B/C).
Phase B — re-point consumers to verified SNMP names (riskier)
- B1. Rewrite ~12 alert exprs (
prometheus_chart_values.tpl811–1186) to SNMP names + SNMP enums (3=OKnot1; power4=onnot2). Re-target absent-probes:iDRACRedfishMetricsMissing→absent(r730_idrac_powerSupplyCurrentInputVoltage);iDRACSNMPMetricsMissing→absent(r730_idrac_globalSystemStatus)(also fixes the misnomer). - B2. Re-point ~26 panels in
idrac.json+cluster_health.jsonto SNMP names/labels; avg-watts →avg_over_time(...amperageProbeReading...[$__interval]). - B3. Add any new SNMP metric names to the Prometheus keep-rules whitelist if present (grep
prometheus-serverconfigmap /prometheus_chart_values.tplkeep rules) so they aren't silently dropped. - B4. Apply; verify each re-pointed alert has data (no spurious
absentfiring) and panels render.
Phase C — thin Redfish remnant
- C1.
idrac.tfconfig map:metrics: all: false+ enable onlysensors, system, network, storage, events(drop power/memory/processors/manager/extra — now SNMP). (HA readssensorsdirectly — unchanged.) - C2.
redfish-idracjob:scrape_interval: 10m; addmetric_relabel_configsto keep only the gap series (indicator LED, NIC Mbps, SSD-life, SEL) → avoids duplicate series with SNMP. - C3. Apply; verify HA
sensor.r730_fan_speedstill updates, gap panels render, fan-control daemon unaffected (it uses IPMI, not this exporter — should be untouched).
Phase D — docs + ship
- D1. Update
docs/architecture/monitoring.md(iDRAC now SNMP-primary; remnant role), note the fixed alert misnomer, any runbook. - D2. Update this plan's checkboxes; commit (named files) + push; wait for CI/deploy.
Rollback
All Terraform-managed. Revert the monitoring-stack commit + scripts/tg apply
restores the Redfish-primary state. Phase A is additive (safe to leave even if
B/C are reverted).
Presence
Claim stack:monitoring + service:idrac-redfish-exporter before each apply.