The Redfish exporter (mrlhansen, metrics:all:true) walked every BMC subtree on
each scrape — ~18.5s avg / 28s peak against the slow iDRAC — forcing a 3m
interval. Moved the fast path to SNMP via the (previously unmounted) dell_idrac
module: ~3.7s/scrape at 1m.
- snmp_exporter: merge dell_idrac into ups_snmp_values.yaml; hand-add fan-RPM
(coolingDeviceReading + location lookup) and an amperageProbeLocationName
lookup so the "System Board Pwr Consumption" watts probe is label-selectable.
- snmp-idrac job: params module=dell_idrac, auth=public_v2, 1m/30s — now the
primary source for health/thermal/power/fan/voltage (relabeled r730_idrac_*).
- Re-point 9 iDRAC alerts to SNMP metrics + DellStatus enums (OK=3, on=4) and
fix the misnamed iDRACSNMPMetricsMissing/iDRACRedfishMetricsMissing probes.
- Re-point Grafana panels (idrac.json, cluster_health.json) to SNMP names;
temps ÷10 (tenths-degC); DellStatus value-mappings updated.
- Demote the Redfish exporter to a slow remnant: trim collectors to
system/sensors/power/storage/network/memory, scrape 3m->10m. Kept only for
metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS, per-drive
table) AND to keep HA Sofia's sensor.r730_fan_speed working — it reads
idrac_sensors_fan_speed from the exporter directly, so no ha-sofia change.
SSD-wear alerts + SEL panel left as-is (already inert/empty today). Verified
live: snmp-idrac up, scrape 3.7s, all 9 re-pointed alerts resolve without
firing, HA fan metric (idrac_sensors_fan_speed=6) intact. Design/plan +
as-built docs: docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md,
docs/architecture/monitoring.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
53 lines
4.1 KiB
Markdown
53 lines
4.1 KiB
Markdown
# iDRAC Redfish → SNMP migration (plan)
|
||
|
||
Companion to `2026-06-05-idrac-snmp-migration-design.md`. Execute in order;
|
||
applies are staged so the safe/additive work lands and is verified before any
|
||
consumer re-pointing.
|
||
|
||
Files:
|
||
- `stacks/monitoring/modules/monitoring/ups_snmp_values.yaml` (merge target)
|
||
- `stacks/monitoring/modules/monitoring/prometheus_snmp_chart_values.yaml` (dell_idrac source, ~79–1628)
|
||
- `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (scrape jobs ~3150/3170, alerts ~811–1186)
|
||
- `stacks/monitoring/modules/monitoring/idrac.tf` (Redfish exporter / remnant)
|
||
- `stacks/monitoring/modules/monitoring/dashboards/idrac.json`, `cluster_health.json`
|
||
|
||
## Phase A — additive SNMP source (low risk)
|
||
|
||
- [ ] A1. Extract `dell_idrac` (lines 79–1628) from `prometheus_snmp_chart_values.yaml`; **strip its embedded `auth:`/`version:`** (the merge target uses the split `auths:` format) and append the module under `modules:` in `ups_snmp_values.yaml`.
|
||
- [ ] A2. Hand-add to dell_idrac `walk:` + `metrics:` (with `lookups:` for labels):
|
||
- `coolingDeviceReading` .4.700.12.1.6 (fan RPM, gauge, idx chassis+device, lookup `coolingDeviceLocationName` .8)
|
||
- `coolingDeviceStatus` .4.700.12.1.5 (fan health, enum)
|
||
- `networkDeviceStatus` / `networkDeviceConnectionStatus` (.1100.90.1.{3,17})
|
||
- `systemBIOSVersionName` .300.50.1.8; system model .5.1.3.12 + service-tag .5.1.3.2
|
||
- DIMM `.1100.50.1.{5 status, 7 type, 8 location, 15 speed}`
|
||
- `physicalDiskRemainingRatedWriteEndurance` .5.5.1.20.130.4.1.49 (so remnant isn't needed for SSD if it ever populates; harmless 255 today)
|
||
- [ ] A3. `snmp-idrac` job (`prometheus_chart_values.tpl` ~3150): add `params: { module: [dell_idrac], auth: [public_v2] }`, `scrape_interval: 1m`, `scrape_timeout: 30s`. Keep the `r730_idrac_` relabel.
|
||
- [ ] A4. **Validate before any repoint:** apply monitoring stack; `curl 'http://snmp-exporter.monitoring.svc:9116/snmp?module=dell_idrac&auth=public_v2&target=192.168.1.4:161'` returns all REGEN/DIRECT metrics with readable labels; `scrape_duration_seconds{job="snmp-idrac"}` < 5 s; confirm exact emitted metric names + label keys (feeds B/C).
|
||
|
||
## Phase B — re-point consumers to verified SNMP names (riskier)
|
||
|
||
- [ ] B1. Rewrite ~12 alert exprs (`prometheus_chart_values.tpl` 811–1186) to SNMP names + **SNMP enums** (`3=OK` not `1`; power `4=on` not `2`). Re-target absent-probes: `iDRACRedfishMetricsMissing`→`absent(r730_idrac_powerSupplyCurrentInputVoltage)`; `iDRACSNMPMetricsMissing`→`absent(r730_idrac_globalSystemStatus)` (also fixes the misnomer).
|
||
- [ ] B2. Re-point ~26 panels in `idrac.json` + `cluster_health.json` to SNMP names/labels; avg-watts → `avg_over_time(...amperageProbeReading...[$__interval])`.
|
||
- [ ] B3. Add any new SNMP metric names to the Prometheus keep-rules whitelist if present (grep `prometheus-server` configmap / `prometheus_chart_values.tpl` keep rules) so they aren't silently dropped.
|
||
- [ ] B4. Apply; verify each re-pointed alert has data (no spurious `absent` firing) and panels render.
|
||
|
||
## Phase C — thin Redfish remnant
|
||
|
||
- [ ] C1. `idrac.tf` config map: `metrics: all: false` + enable only `sensors, system, network, storage, events` (drop power/memory/processors/manager/extra — now SNMP). (HA reads `sensors` directly — unchanged.)
|
||
- [ ] C2. `redfish-idrac` job: `scrape_interval: 10m`; add `metric_relabel_configs` to **keep only** the gap series (indicator LED, NIC Mbps, SSD-life, SEL) → avoids duplicate series with SNMP.
|
||
- [ ] C3. Apply; verify HA `sensor.r730_fan_speed` still updates, gap panels render, fan-control daemon unaffected (it uses IPMI, not this exporter — should be untouched).
|
||
|
||
## Phase D — docs + ship
|
||
|
||
- [ ] D1. Update `docs/architecture/monitoring.md` (iDRAC now SNMP-primary; remnant role), note the fixed alert misnomer, any runbook.
|
||
- [ ] D2. Update this plan's checkboxes; commit (named files) + push; wait for CI/deploy.
|
||
|
||
## Rollback
|
||
|
||
All Terraform-managed. Revert the monitoring-stack commit + `scripts/tg apply`
|
||
restores the Redfish-primary state. Phase A is additive (safe to leave even if
|
||
B/C are reverted).
|
||
|
||
## Presence
|
||
|
||
Claim `stack:monitoring` + `service:idrac-redfish-exporter` before each apply.
|