6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
4.1 KiB
4.1 KiB
iDRAC Redfish → SNMP migration (plan)
Companion to 2026-06-05-idrac-snmp-migration-design.md. Execute in order;
applies are staged so the safe/additive work lands and is verified before any
consumer re-pointing.
Files:
stacks/monitoring/modules/monitoring/ups_snmp_values.yaml(merge target)stacks/monitoring/modules/monitoring/prometheus_snmp_chart_values.yaml(dell_idrac source, ~79–1628)stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl(scrape jobs ~3150/3170, alerts ~811–1186)stacks/monitoring/modules/monitoring/idrac.tf(Redfish exporter / remnant)stacks/monitoring/modules/monitoring/dashboards/idrac.json,cluster_health.json
Phase A — additive SNMP source (low risk)
- A1. Extract
dell_idrac(lines 79–1628) fromprometheus_snmp_chart_values.yaml; strip its embeddedauth:/version:(the merge target uses the splitauths:format) and append the module undermodules:inups_snmp_values.yaml. - A2. Hand-add to dell_idrac
walk:+metrics:(withlookups:for labels):coolingDeviceReading.4.700.12.1.6 (fan RPM, gauge, idx chassis+device, lookupcoolingDeviceLocationName.8)coolingDeviceStatus.4.700.12.1.5 (fan health, enum)networkDeviceStatus/networkDeviceConnectionStatus(.1100.90.1.{3,17})systemBIOSVersionName.300.50.1.8; system model .5.1.3.12 + service-tag .5.1.3.2- DIMM
.1100.50.1.{5 status, 7 type, 8 location, 15 speed} physicalDiskRemainingRatedWriteEndurance.5.5.1.20.130.4.1.49 (so remnant isn't needed for SSD if it ever populates; harmless 255 today)
- A3.
snmp-idracjob (prometheus_chart_values.tpl~3150): addparams: { module: [dell_idrac], auth: [public_v2] },scrape_interval: 1m,scrape_timeout: 30s. Keep ther730_idrac_relabel. - A4. Validate before any repoint: apply monitoring stack;
curl 'http://snmp-exporter.monitoring.svc:9116/snmp?module=dell_idrac&auth=public_v2&target=192.168.1.4:161'returns all REGEN/DIRECT metrics with readable labels;scrape_duration_seconds{job="snmp-idrac"}< 5 s; confirm exact emitted metric names + label keys (feeds B/C).
Phase B — re-point consumers to verified SNMP names (riskier)
- B1. Rewrite ~12 alert exprs (
prometheus_chart_values.tpl811–1186) to SNMP names + SNMP enums (3=OKnot1; power4=onnot2). Re-target absent-probes:iDRACRedfishMetricsMissing→absent(r730_idrac_powerSupplyCurrentInputVoltage);iDRACSNMPMetricsMissing→absent(r730_idrac_globalSystemStatus)(also fixes the misnomer). - B2. Re-point ~26 panels in
idrac.json+cluster_health.jsonto SNMP names/labels; avg-watts →avg_over_time(...amperageProbeReading...[$__interval]). - B3. Add any new SNMP metric names to the Prometheus keep-rules whitelist if present (grep
prometheus-serverconfigmap /prometheus_chart_values.tplkeep rules) so they aren't silently dropped. - B4. Apply; verify each re-pointed alert has data (no spurious
absentfiring) and panels render.
Phase C — thin Redfish remnant
- C1.
idrac.tfconfig map:metrics: all: false+ enable onlysensors, system, network, storage, events(drop power/memory/processors/manager/extra — now SNMP). (HA readssensorsdirectly — unchanged.) - C2.
redfish-idracjob:scrape_interval: 10m; addmetric_relabel_configsto keep only the gap series (indicator LED, NIC Mbps, SSD-life, SEL) → avoids duplicate series with SNMP. - C3. Apply; verify HA
sensor.r730_fan_speedstill updates, gap panels render, fan-control daemon unaffected (it uses IPMI, not this exporter — should be untouched).
Phase D — docs + ship
- D1. Update
docs/architecture/monitoring.md(iDRAC now SNMP-primary; remnant role), note the fixed alert misnomer, any runbook. - D2. Update this plan's checkboxes; commit (named files) + push; wait for CI/deploy.
Rollback
All Terraform-managed. Revert the monitoring-stack commit + scripts/tg apply
restores the Redfish-primary state. Phase A is additive (safe to leave even if
B/C are reverted).
Presence
Claim stack:monitoring + service:idrac-redfish-exporter before each apply.