infra/docs/plans/2026-06-05-idrac-snmp-migration-plan.md
Viktor Barzin 6b1d23abbd monitoring: migrate R730 iDRAC scraping to SNMP (fast primary) + thin Redfish remnant
The Redfish exporter (mrlhansen, metrics:all:true) walked every BMC subtree on
each scrape — ~18.5s avg / 28s peak against the slow iDRAC — forcing a 3m
interval. Moved the fast path to SNMP via the (previously unmounted) dell_idrac
module: ~3.7s/scrape at 1m.

- snmp_exporter: merge dell_idrac into ups_snmp_values.yaml; hand-add fan-RPM
  (coolingDeviceReading + location lookup) and an amperageProbeLocationName
  lookup so the "System Board Pwr Consumption" watts probe is label-selectable.
- snmp-idrac job: params module=dell_idrac, auth=public_v2, 1m/30s — now the
  primary source for health/thermal/power/fan/voltage (relabeled r730_idrac_*).
- Re-point 9 iDRAC alerts to SNMP metrics + DellStatus enums (OK=3, on=4) and
  fix the misnamed iDRACSNMPMetricsMissing/iDRACRedfishMetricsMissing probes.
- Re-point Grafana panels (idrac.json, cluster_health.json) to SNMP names;
  temps ÷10 (tenths-degC); DellStatus value-mappings updated.
- Demote the Redfish exporter to a slow remnant: trim collectors to
  system/sensors/power/storage/network/memory, scrape 3m->10m. Kept only for
  metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS, per-drive
  table) AND to keep HA Sofia's sensor.r730_fan_speed working — it reads
  idrac_sensors_fan_speed from the exporter directly, so no ha-sofia change.

SSD-wear alerts + SEL panel left as-is (already inert/empty today). Verified
live: snmp-idrac up, scrape 3.7s, all 9 re-pointed alerts resolve without
firing, HA fan metric (idrac_sensors_fan_speed=6) intact. Design/plan +
as-built docs: docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md,
docs/architecture/monitoring.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 16:33:20 +00:00

4.1 KiB
Raw Blame History

iDRAC Redfish → SNMP migration (plan)

Companion to 2026-06-05-idrac-snmp-migration-design.md. Execute in order; applies are staged so the safe/additive work lands and is verified before any consumer re-pointing.

Files:

  • stacks/monitoring/modules/monitoring/ups_snmp_values.yaml (merge target)
  • stacks/monitoring/modules/monitoring/prometheus_snmp_chart_values.yaml (dell_idrac source, ~791628)
  • stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl (scrape jobs ~3150/3170, alerts ~8111186)
  • stacks/monitoring/modules/monitoring/idrac.tf (Redfish exporter / remnant)
  • stacks/monitoring/modules/monitoring/dashboards/idrac.json, cluster_health.json

Phase A — additive SNMP source (low risk)

  • A1. Extract dell_idrac (lines 791628) from prometheus_snmp_chart_values.yaml; strip its embedded auth:/version: (the merge target uses the split auths: format) and append the module under modules: in ups_snmp_values.yaml.
  • A2. Hand-add to dell_idrac walk: + metrics: (with lookups: for labels):
    • coolingDeviceReading .4.700.12.1.6 (fan RPM, gauge, idx chassis+device, lookup coolingDeviceLocationName .8)
    • coolingDeviceStatus .4.700.12.1.5 (fan health, enum)
    • networkDeviceStatus / networkDeviceConnectionStatus (.1100.90.1.{3,17})
    • systemBIOSVersionName .300.50.1.8; system model .5.1.3.12 + service-tag .5.1.3.2
    • DIMM .1100.50.1.{5 status, 7 type, 8 location, 15 speed}
    • physicalDiskRemainingRatedWriteEndurance .5.5.1.20.130.4.1.49 (so remnant isn't needed for SSD if it ever populates; harmless 255 today)
  • A3. snmp-idrac job (prometheus_chart_values.tpl ~3150): add params: { module: [dell_idrac], auth: [public_v2] }, scrape_interval: 1m, scrape_timeout: 30s. Keep the r730_idrac_ relabel.
  • A4. Validate before any repoint: apply monitoring stack; curl 'http://snmp-exporter.monitoring.svc:9116/snmp?module=dell_idrac&auth=public_v2&target=192.168.1.4:161' returns all REGEN/DIRECT metrics with readable labels; scrape_duration_seconds{job="snmp-idrac"} < 5 s; confirm exact emitted metric names + label keys (feeds B/C).

Phase B — re-point consumers to verified SNMP names (riskier)

  • B1. Rewrite ~12 alert exprs (prometheus_chart_values.tpl 8111186) to SNMP names + SNMP enums (3=OK not 1; power 4=on not 2). Re-target absent-probes: iDRACRedfishMetricsMissingabsent(r730_idrac_powerSupplyCurrentInputVoltage); iDRACSNMPMetricsMissingabsent(r730_idrac_globalSystemStatus) (also fixes the misnomer).
  • B2. Re-point ~26 panels in idrac.json + cluster_health.json to SNMP names/labels; avg-watts → avg_over_time(...amperageProbeReading...[$__interval]).
  • B3. Add any new SNMP metric names to the Prometheus keep-rules whitelist if present (grep prometheus-server configmap / prometheus_chart_values.tpl keep rules) so they aren't silently dropped.
  • B4. Apply; verify each re-pointed alert has data (no spurious absent firing) and panels render.

Phase C — thin Redfish remnant

  • C1. idrac.tf config map: metrics: all: false + enable only sensors, system, network, storage, events (drop power/memory/processors/manager/extra — now SNMP). (HA reads sensors directly — unchanged.)
  • C2. redfish-idrac job: scrape_interval: 10m; add metric_relabel_configs to keep only the gap series (indicator LED, NIC Mbps, SSD-life, SEL) → avoids duplicate series with SNMP.
  • C3. Apply; verify HA sensor.r730_fan_speed still updates, gap panels render, fan-control daemon unaffected (it uses IPMI, not this exporter — should be untouched).

Phase D — docs + ship

  • D1. Update docs/architecture/monitoring.md (iDRAC now SNMP-primary; remnant role), note the fixed alert misnomer, any runbook.
  • D2. Update this plan's checkboxes; commit (named files) + push; wait for CI/deploy.

Rollback

All Terraform-managed. Revert the monitoring-stack commit + scripts/tg apply restores the Redfish-primary state. Phase A is additive (safe to leave even if B/C are reverted).

Presence

Claim stack:monitoring + service:idrac-redfish-exporter before each apply.