6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
53 lines
4.1 KiB
Markdown
53 lines
4.1 KiB
Markdown
# iDRAC Redfish → SNMP migration (plan)
|
||
|
||
Companion to `2026-06-05-idrac-snmp-migration-design.md`. Execute in order;
|
||
applies are staged so the safe/additive work lands and is verified before any
|
||
consumer re-pointing.
|
||
|
||
Files:
|
||
- `stacks/monitoring/modules/monitoring/ups_snmp_values.yaml` (merge target)
|
||
- `stacks/monitoring/modules/monitoring/prometheus_snmp_chart_values.yaml` (dell_idrac source, ~79–1628)
|
||
- `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (scrape jobs ~3150/3170, alerts ~811–1186)
|
||
- `stacks/monitoring/modules/monitoring/idrac.tf` (Redfish exporter / remnant)
|
||
- `stacks/monitoring/modules/monitoring/dashboards/idrac.json`, `cluster_health.json`
|
||
|
||
## Phase A — additive SNMP source (low risk)
|
||
|
||
- [ ] A1. Extract `dell_idrac` (lines 79–1628) from `prometheus_snmp_chart_values.yaml`; **strip its embedded `auth:`/`version:`** (the merge target uses the split `auths:` format) and append the module under `modules:` in `ups_snmp_values.yaml`.
|
||
- [ ] A2. Hand-add to dell_idrac `walk:` + `metrics:` (with `lookups:` for labels):
|
||
- `coolingDeviceReading` .4.700.12.1.6 (fan RPM, gauge, idx chassis+device, lookup `coolingDeviceLocationName` .8)
|
||
- `coolingDeviceStatus` .4.700.12.1.5 (fan health, enum)
|
||
- `networkDeviceStatus` / `networkDeviceConnectionStatus` (.1100.90.1.{3,17})
|
||
- `systemBIOSVersionName` .300.50.1.8; system model .5.1.3.12 + service-tag .5.1.3.2
|
||
- DIMM `.1100.50.1.{5 status, 7 type, 8 location, 15 speed}`
|
||
- `physicalDiskRemainingRatedWriteEndurance` .5.5.1.20.130.4.1.49 (so remnant isn't needed for SSD if it ever populates; harmless 255 today)
|
||
- [ ] A3. `snmp-idrac` job (`prometheus_chart_values.tpl` ~3150): add `params: { module: [dell_idrac], auth: [public_v2] }`, `scrape_interval: 1m`, `scrape_timeout: 30s`. Keep the `r730_idrac_` relabel.
|
||
- [ ] A4. **Validate before any repoint:** apply monitoring stack; `curl 'http://snmp-exporter.monitoring.svc:9116/snmp?module=dell_idrac&auth=public_v2&target=192.168.1.4:161'` returns all REGEN/DIRECT metrics with readable labels; `scrape_duration_seconds{job="snmp-idrac"}` < 5 s; confirm exact emitted metric names + label keys (feeds B/C).
|
||
|
||
## Phase B — re-point consumers to verified SNMP names (riskier)
|
||
|
||
- [ ] B1. Rewrite ~12 alert exprs (`prometheus_chart_values.tpl` 811–1186) to SNMP names + **SNMP enums** (`3=OK` not `1`; power `4=on` not `2`). Re-target absent-probes: `iDRACRedfishMetricsMissing`→`absent(r730_idrac_powerSupplyCurrentInputVoltage)`; `iDRACSNMPMetricsMissing`→`absent(r730_idrac_globalSystemStatus)` (also fixes the misnomer).
|
||
- [ ] B2. Re-point ~26 panels in `idrac.json` + `cluster_health.json` to SNMP names/labels; avg-watts → `avg_over_time(...amperageProbeReading...[$__interval])`.
|
||
- [ ] B3. Add any new SNMP metric names to the Prometheus keep-rules whitelist if present (grep `prometheus-server` configmap / `prometheus_chart_values.tpl` keep rules) so they aren't silently dropped.
|
||
- [ ] B4. Apply; verify each re-pointed alert has data (no spurious `absent` firing) and panels render.
|
||
|
||
## Phase C — thin Redfish remnant
|
||
|
||
- [ ] C1. `idrac.tf` config map: `metrics: all: false` + enable only `sensors, system, network, storage, events` (drop power/memory/processors/manager/extra — now SNMP). (HA reads `sensors` directly — unchanged.)
|
||
- [ ] C2. `redfish-idrac` job: `scrape_interval: 10m`; add `metric_relabel_configs` to **keep only** the gap series (indicator LED, NIC Mbps, SSD-life, SEL) → avoids duplicate series with SNMP.
|
||
- [ ] C3. Apply; verify HA `sensor.r730_fan_speed` still updates, gap panels render, fan-control daemon unaffected (it uses IPMI, not this exporter — should be untouched).
|
||
|
||
## Phase D — docs + ship
|
||
|
||
- [ ] D1. Update `docs/architecture/monitoring.md` (iDRAC now SNMP-primary; remnant role), note the fixed alert misnomer, any runbook.
|
||
- [ ] D2. Update this plan's checkboxes; commit (named files) + push; wait for CI/deploy.
|
||
|
||
## Rollback
|
||
|
||
All Terraform-managed. Revert the monitoring-stack commit + `scripts/tg apply`
|
||
restores the Redfish-primary state. Phase A is additive (safe to leave even if
|
||
B/C are reverted).
|
||
|
||
## Presence
|
||
|
||
Claim `stack:monitoring` + `service:idrac-redfish-exporter` before each apply.
|