infra/docs/plans/2026-06-04-pve-fan-control-design.md
Viktor Barzin fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00

7.7 KiB
Raw Blame History

PVE R730 presence-aware fan control — design

Date: 2026-06-04 Status: implemented Scripts: infra/scripts/fan-control.{sh,service,env.example}, test-fan-control.sh Runbook: infra/docs/runbooks/fan-control.md

Problem

The Dell R730 PVE host (192.168.1.127) runs its CPU at ~7277°C under normal cluster load. That is safe (firmware warning at 88°C, critical 93°C) but the iDRAC's stock fan curve optimises for quiet, not cool — it pins the fans at the ~7080 RPM floor even at 72°C / load 30 and only ramps near ~80°C. We want the CPU to run cooler when it costs nothing (the box is in the garage, usually empty) while staying quiet when someone is physically in the garage.

Measured fan/temp relationship (manual IPMI sweep, 2026-06-04)

At a comparable CPU load (~4553 % busy):

Fan setting Fan RPM CPU temp
Auto (floor) 7,080 7172°C
50 % 9,360 6566°C
70 % 12,800 6061°C
100 % 17,000 5556°C

Best °C-per-RPM is the first step; beyond ~70 % it is mostly noise. ~16°C of swing is available.

Power characterization (sweep 2026-06-05)

Averaged wall power (iDRAC DCMI) + temp at each fan setting:

Fan RPM Power CPU load
auto 7,080 296 W 68°C 21
20 % 4,800 281 W 73°C 20
30 % 6,360 288 W 72°C 19
50 % 9,360 299 W 65°C 18
60 % 11,040 303 W 61°C 17
70 % 12,720 324 W 59°C 16
100 % 16,920 378 W 59°C 17

The cooling-per-watt knee is ~60 %. Fan power follows ~RPM³: 60→70 % costs +21 W for 2°C; 70→100 % costs +54 W for 0°C (the CPU floors ~59°C at cluster load — more airflow does nothing). Full speed draws ~97 W (~850 kWh/yr) over the floor and buys nothing past 60 %.

Decision (2026-06-05): the COOL curve caps its normal band at 60 % (~303 W, ~61°C) — capturing essentially all achievable cooling while avoiding the wasteful 80100 % zone, now reserved as a high-load safety ramp (≥73/79°C) before the 83°C ceiling. QUIET is unchanged (already at the low-power floor: 20 % / 4,800 RPM / 281 W). Verified live after re-tune: 63°C, 60 %, ~267 W.

Decisions

  1. Custom bash daemon + systemd service, deployed to the PVE host the same way as apply-mbps-caps / daily-backup (source in infra/scripts/, scp to /usr/local/bin). It cannot be Terraform/k8s — it runs on the bare host where IPMI lives. (OSS tigerblue77/Dell-iDRAC-fan-controller was considered; rejected — it is a Docker container, off-pattern here, and unaware of our constraints.)

  2. CPU temperature is the only control input. The Tesla T4 has its own always-on fan (owner-confirmed), so it self-cools and does not depend on chassis airflow — no GPU coupling needed.

  3. Presence = the garage door, because the server is in the garage (memory id=1723); noise only matters to people physically there. Signal: ha-sofia sensor.garage_door_state_bg. Open now, or last changed within HOLD_SECS (15 min) ⇒ someone's around ⇒ QUIET; otherwise COOL. house_mode was rejected — it tracks apartment occupancy, irrelevant to garage noise.

  4. Two continuous LINEAR curves, picked by presence. (Originally discrete step-bands; replaced 2026-06-05 — the bands flapped at edges, e.g. 45↔65%. Web research: a linear curve + 23°C hysteresis is the homelab standard; PID is overkill for this slow thermal loop and even PID projects "only lower, don't chase a setpoint".) fan% interpolates between per-mode anchors, clamped flat outside; both reach 100% right at the 83°C ceiling:

    Mode T_LO → P_LO T_HI → P_HI slope
    COOL (garage empty) 50°C → 30% 83°C → 100% ~2.1%/°C (≈51% at the ~60°C equilibrium)
    QUIET (occupied) 68°C → 20% 83°C → 100% ~4.7%/°C (near-silent until ~70°C)

    Anchors are env-tunable (COOL_T_LO/P_LO/T_HI/P_HI, QUIET_*). Under normal load the COOL equilibrium (~60°C → ~51%) sits near the measured ~60% power knee; the ramp toward 100% only engages at genuinely high temp (safety). Anti-oscillation: asymmetric hysteresis (ramp up immediately, ease down only once the curve wants lower 3°C hotter) plus a MIN_STEP (3%) min-change threshold so 12% wiggles don't churn IPMI writes.

Safety

Manual fan mode bypasses the iDRAC's own protection, so it is backstopped:

  • Daemon exit/crash/stop → bash EXIT trap + systemd ExecStopPost both run ipmitool raw 0x30 0x30 0x01 0x01 (restore Dell auto). Restart=on-failure.
  • CPU ≥ CEILING (83°C) → hand back to Dell auto until temp holds below RESUME_BELOW (75°C) for RESUME_STABLE (120 s), then resume manual.
  • IPMI read failures ≥ MAX_IPMI_FAILS → restore Dell auto.
  • ha-sofia unreachable → keep the last good presence decision; default COOL at cold start (thermally safe).

Observability

Pushes to the Pushgateway (http://10.0.20.100:30091, job fan_control): pve_fan_control_cpu_temp_celsius, _fan_percent, _mode (1 quiet / 2 cool / 3 manual / 0 fallback), _ha_reachable, _fallback, _fan_rpm, and _fan_watts_est.

Fan power is ESTIMATED — the iDRAC exposes only total DCMI watts + RPM (no per-fan power), so _fan_watts_est models it from RPM via the fan affinity law (power ∝ RPM³), calibrated to the 2026-06-05 sweep: fan_W ≈ 0.0205·(RPM/1000)³ (≈2 W at the floor → ~99 W at full; fits the sweep within ~3 W). Surfaced in HA as sensor.r730_fan_power_est + a "Fan Power (est)" card on the dashboard-it Server view, next to total power (sensor.r730_power_consumption, redfish) — so the fan tax of the control curve is visible. The existing CPU-temp alert is unaffected.

Testing

test-fan-control.sh sources the script (main is guarded by a BASH_SOURCE check) and unit-tests the pure functions: both curves, hysteresis up/down, presence open/recent/stale, temperature parsing, jq-free JSON field extraction, and percent→hex. 36 assertions, no hardware needed. The daemon also supports DRY_RUN=1 and RUN_ONCE=1 for integration checks.

HA control (added 2026-06-05, on the host daemon)

Delivered ahead of the cron migration (which is Vault-gated) by teaching the host daemon to poll two ha-sofia helpers each loop (fc_resolve): input_select.r730_fan_mode (auto/cool/quiet/manual) + input_number.r730_fan_manual_pct. auto = the garage-presence curve above; cool/quiet force that curve; manual holds a fixed %; CEILING still overrides. The simplified dashboard (2026-06-05) exposes just three things — fan speed (%/RPM), an Override % slider, and a Lock toggle. Lock = "freeze current speed / algo off": automation.r730_fan_lock_freeze_current_speed_resume_algo snapshots the live target % into Override and sets mode=manual on lock-ON, and mode=auto on lock-OFF — the daemon needs no change, the toggle just drives the mode. cool/quiet stay reachable via the entity but are off the dashboard. The 60-min automation.r730_fan_mode_auto_revert is retained as a dormant safety net (manual now only happens while locked, which it skips). The daemon just polls and actuates. Monitoring + control live on the dashboard-it "Server" view (REST sensors: fan RPM from the redfish exporter; mode/target-% from the Pushgateway). The same logic already exists in the Python controller (r730-fan-control/) for the eventual in-cluster CronJob; when that deploys it supersedes the host daemon.

Rollback

systemctl disable --now fan-control && ipmitool raw 0x30 0x30 0x01 0x01 on the host returns the box to stock firmware fan control. See the runbook.