infra/docs/plans/2026-06-05-idrac-snmp-migration-plan.md
Viktor Barzin fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00

53 lines
4.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# iDRAC Redfish → SNMP migration (plan)
Companion to `2026-06-05-idrac-snmp-migration-design.md`. Execute in order;
applies are staged so the safe/additive work lands and is verified before any
consumer re-pointing.
Files:
- `stacks/monitoring/modules/monitoring/ups_snmp_values.yaml` (merge target)
- `stacks/monitoring/modules/monitoring/prometheus_snmp_chart_values.yaml` (dell_idrac source, ~791628)
- `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (scrape jobs ~3150/3170, alerts ~8111186)
- `stacks/monitoring/modules/monitoring/idrac.tf` (Redfish exporter / remnant)
- `stacks/monitoring/modules/monitoring/dashboards/idrac.json`, `cluster_health.json`
## Phase A — additive SNMP source (low risk)
- [ ] A1. Extract `dell_idrac` (lines 791628) from `prometheus_snmp_chart_values.yaml`; **strip its embedded `auth:`/`version:`** (the merge target uses the split `auths:` format) and append the module under `modules:` in `ups_snmp_values.yaml`.
- [ ] A2. Hand-add to dell_idrac `walk:` + `metrics:` (with `lookups:` for labels):
- `coolingDeviceReading` .4.700.12.1.6 (fan RPM, gauge, idx chassis+device, lookup `coolingDeviceLocationName` .8)
- `coolingDeviceStatus` .4.700.12.1.5 (fan health, enum)
- `networkDeviceStatus` / `networkDeviceConnectionStatus` (.1100.90.1.{3,17})
- `systemBIOSVersionName` .300.50.1.8; system model .5.1.3.12 + service-tag .5.1.3.2
- DIMM `.1100.50.1.{5 status, 7 type, 8 location, 15 speed}`
- `physicalDiskRemainingRatedWriteEndurance` .5.5.1.20.130.4.1.49 (so remnant isn't needed for SSD if it ever populates; harmless 255 today)
- [ ] A3. `snmp-idrac` job (`prometheus_chart_values.tpl` ~3150): add `params: { module: [dell_idrac], auth: [public_v2] }`, `scrape_interval: 1m`, `scrape_timeout: 30s`. Keep the `r730_idrac_` relabel.
- [ ] A4. **Validate before any repoint:** apply monitoring stack; `curl 'http://snmp-exporter.monitoring.svc:9116/snmp?module=dell_idrac&auth=public_v2&target=192.168.1.4:161'` returns all REGEN/DIRECT metrics with readable labels; `scrape_duration_seconds{job="snmp-idrac"}` < 5 s; confirm exact emitted metric names + label keys (feeds B/C).
## Phase B — re-point consumers to verified SNMP names (riskier)
- [ ] B1. Rewrite ~12 alert exprs (`prometheus_chart_values.tpl` 8111186) to SNMP names + **SNMP enums** (`3=OK` not `1`; power `4=on` not `2`). Re-target absent-probes: `iDRACRedfishMetricsMissing``absent(r730_idrac_powerSupplyCurrentInputVoltage)`; `iDRACSNMPMetricsMissing``absent(r730_idrac_globalSystemStatus)` (also fixes the misnomer).
- [ ] B2. Re-point ~26 panels in `idrac.json` + `cluster_health.json` to SNMP names/labels; avg-watts `avg_over_time(...amperageProbeReading...[$__interval])`.
- [ ] B3. Add any new SNMP metric names to the Prometheus keep-rules whitelist if present (grep `prometheus-server` configmap / `prometheus_chart_values.tpl` keep rules) so they aren't silently dropped.
- [ ] B4. Apply; verify each re-pointed alert has data (no spurious `absent` firing) and panels render.
## Phase C — thin Redfish remnant
- [ ] C1. `idrac.tf` config map: `metrics: all: false` + enable only `sensors, system, network, storage, events` (drop power/memory/processors/manager/extra now SNMP). (HA reads `sensors` directly unchanged.)
- [ ] C2. `redfish-idrac` job: `scrape_interval: 10m`; add `metric_relabel_configs` to **keep only** the gap series (indicator LED, NIC Mbps, SSD-life, SEL) avoids duplicate series with SNMP.
- [ ] C3. Apply; verify HA `sensor.r730_fan_speed` still updates, gap panels render, fan-control daemon unaffected (it uses IPMI, not this exporter should be untouched).
## Phase D — docs + ship
- [ ] D1. Update `docs/architecture/monitoring.md` (iDRAC now SNMP-primary; remnant role), note the fixed alert misnomer, any runbook.
- [ ] D2. Update this plan's checkboxes; commit (named files) + push; wait for CI/deploy.
## Rollback
All Terraform-managed. Revert the monitoring-stack commit + `scripts/tg apply`
restores the Redfish-primary state. Phase A is additive (safe to leave even if
B/C are reverted).
## Presence
Claim `stack:monitoring` + `service:idrac-redfish-exporter` before each apply.