infra/docs/plans/2026-06-05-idrac-snmp-migration-design.md
Viktor Barzin fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00

116 lines
6.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# iDRAC monitoring: Redfish → SNMP migration (design)
**Date:** 2026-06-05
**Status:** approved (Viktor) — SNMP primary + thin Redfish remnant
**Stack:** `stacks/monitoring`
## Problem
The R730 iDRAC Redfish exporter (`idrac-redfish-exporter`, mrlhansen
`idrac_exporter`, image `viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix`)
is configured `metrics: all: true`. It collects on-demand and walks every
Redfish subtree, making dozens of sequential ~12 s requests to a slow BMC.
Measured live (Prometheus `scrape_duration_seconds{job="redfish-idrac"}`, 24 h):
- **avg 18.5 s, peak 28.3 s**, occasional fast-fail 0.085 s.
- Pinned to a **3 m interval / 45 s timeout** because it cannot run at the 2 m
global cadence.
The cost is dominated by walks that feed **dashboard-only** panels (`memory`
10 DIMMs, `network`, `events`/SEL); the operationally important metrics (fan
speed, temps, power, voltage) come from cheap single-request collectors.
## Decision
Make **SNMP the fast primary source** and keep a **thin, slow Redfish remnant**
for the few things SNMP cannot serve. SNMP walks are fast (the `snmp-ups` job
runs at 30 s); the iDRAC SNMP agent is already enabled and reachable.
Rejected alternatives: (1) pure collector-trim of Redfish — still BMC-bound and
slow; (2) pure SNMP / retire Redfish — would require re-pointing the **external
ha-sofia** `sensor.r730_fan_speed` REST sensor (collides with a live session
editing the fan dashboard) and would drop two cosmetic panels.
## Key findings (ground-truthed)
- **The `snmp-idrac` job was dead.** It specified **no `module`** param, so
`snmp_exporter` defaulted to `if_mib` and returned only the iDRAC NIC's
interface counters — zero health/power/thermal. Both iDRAC jobs relabel to
`r730_idrac_*`, which hid this. The alert `iDRACSNMPMetricsMissing` is
**misnamed** — its expr `absent(r730_idrac_idrac_system_health)` checks a
*Redfish* metric.
- **A generated `dell_idrac` module already exists**, unmounted, in
`prometheus_snmp_chart_values.yaml` (~lines 791628). The mounted config is
`ups_snmp_values.yaml` (huawei/if_mib/ip_mib only). iDRAC SNMP = v2c,
community `Public0` (already the `public_v2` auth in `ups_snmp_values.yaml`).
- **Live snmpwalk (Public0, 192.168.1.4) confirms** these return real data:
fan RPM `coolingDeviceReading` (.4.700.12.1.6 = 7080 RPM), temps
`temperatureProbeReading` (.700.20.1.6, tenths-°C), system watts
`amperageProbeReading` (.600.30.1.6 = 252 W), PSU input voltage
`powerSupplyCurrentInputVoltage` (.600.12.1.16), PSU watts/health, global
health `globalSystemStatus` (.5.2.1), `systemState*` rollups (.200),
`physicalDisk*` status, `memoryDevice*` size/status/type/speed (.1100),
`networkDevice*` status/connection (.1100.90), BIOS `2.19.0` (.300.50.1.8),
model/service-tag (.5.1.3).
- **Genuine SNMP gaps — but inert or cosmetic today:**
- SSD life-left % (`physicalDiskRemainingRatedWriteEndurance` .49) → returns
`255` (N/A) for every drive incl. the Samsung SSD. **Redfish today reports
`0`** on the one drive that has it, and the SSD-wear alerts guard on `> 0`,
so they **already never fire** → no functional loss.
- SEL event log (`5.5.2`) → `NoSuchObject`. The `idrac_events_log_entry`
metric is **already empty in Prometheus** today → no loss.
- Indicator LED (`5.1.4`) → absent. Cosmetic ("Off") panel.
- NIC link-speed Mbps → no OID (health + up/down preserved). Cosmetic.
- Average watts → no native OID; reconstruct via PromQL `avg_over_time()`.
Conclusion: **every metric with real, used data today has an SNMP equivalent.**
## Naming / enum strategy
`snmp_exporter` names metrics after MIB objects (`temperatureProbeReading`,
`coolingDeviceReading`, `globalSystemStatus`, …) → after the `r730_idrac_`
relabel they are `r730_idrac_<mibName>`, different from today's
`r730_idrac_idrac_*` / `r730_idrac_redfish_*`. **Re-point consumers** (not
alias): aliasing via `metric_relabel_configs` only renames `__name__` and
cannot fix the label-set mismatch (Redfish `member_id`/`name` vs SNMP numeric
indexes) nor the **enum-value mismatch** (DellStatus `3=OK` vs Redfish `1`;
`systemPowerState 4=on` vs Redfish `2`). Alert exprs must change regardless, so
re-pointing is the honest path. The module adds `lookups:` so SNMP series carry
human labels (probe/fan location, disk display name) like today.
## Consumed-metric → SNMP mapping (DIRECT / REGEN / remnant)
REGEN = OID returns data but must be added to the module walk.
| Consumed (today) | Source after migration |
|---|---|
| fan health | REGEN `coolingDeviceStatus` .700.12.1.5 |
| consumed watts | DIRECT `amperageProbeReading` (System Board Pwr Consumption) |
| system health rollup | DIRECT `globalSystemStatus` .5.2.1 |
| PSU health | DIRECT `powerSupplyStatus`/`powerSupplySensorState` |
| memory health | DIRECT `systemStateMemoryDeviceStatusCombined` .200.10.1.27 |
| storage drive health | DIRECT `physicalDiskComponentStatus` .5.5.1.20.130.4.1.24 |
| **SSD life %** | **remnant** (SNMP=255 N/A; already inert) |
| system power state | DIRECT `systemPowerState` .5.2.4 (enum 4=on) |
| PSU input voltage | DIRECT `powerSupplyCurrentInputVoltage` .600.12.1.16 |
| system health (absent-probe) | DIRECT `globalSystemStatus` |
| **fan speed RPM (HA)** | DIRECT via remnant (HA reads exporter directly); SNMP REGEN `coolingDeviceReading` for Grafana |
| temperature | DIRECT `temperatureProbeReading` .700.20.1.6 (÷10) |
| avg watts | PromQL `avg_over_time(amperageProbeReading)` |
| **SEL log** | **remnant** (already empty) |
| machine/bios info | REGEN model/svctag .5.1.3, BIOS .300.50.1.8 |
| memory size / cpu count | DIRECT `memoryDeviceSize` (sum) / `processorDeviceStatus` (count) |
| **indicator LED** | **remnant** (cosmetic) |
| storage drive info/health/capacity | DIRECT `physicalDisk*` |
| memory module info/health/cap/speed | DIRECT(size) + REGEN(status/type/speed .1100.50.1.{5,7,8,15}) |
| network port health/link / **Mbps** | REGEN `networkDevice*` (.1100.90); **Mbps → remnant/drop** |
| PSU output/input/capacity watts | DIRECT `powerSupplyOutputWatts`/`RatedInputWattage` |
## Remnant role
The Redfish exporter stays alive (so the external ha-sofia
`sensor.r730_fan_speed` REST poll is **unchanged** — no ha-sofia edit, no
collision). It is trimmed to `sensors,system,network,storage,events` and its
Prometheus scrape slows to 10 m, keeping **only** the gap metrics (indicator
LED, NIC Mbps, SSD-life, SEL) via `metric_relabel_configs` to avoid duplicate
series with SNMP.