The committed docs still described the 2026-06-04 presence-aware daemon. Bring them in line with what is actually deployed: HA computes the setpoint, the host is a thin actuator (COMMAND_ENTITY/STALE_SECS/HA_GRACE_SECS), additive bias, anti-flap hold-last, and the new HA readout sensors (command/equilibrium/ cpu_load/fan_speed_avg/fan_power_avg). Earlier doc edits were made in a clone lost in the workstation reshuffle; re-created here. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5.7 KiB
Runbook — PVE R730 fan control
The control logic lives in Home Assistant; the PVE host runs only a thin
actuator. HA computes the fan setpoint from the CPU temperature and the
dashboard inputs and publishes ONE number, sensor.r730_fan_command_pct. The
host daemon reads that number each loop and applies it over IPMI — it does no
math. Design + history: infra/docs/plans/2026-06-04-pve-fan-control-design.md.
History: (1) 2026-06-04/05 presence-aware two-curve controller (COOL/QUIET by garage door). (2) 2026-06-07 single linear curve, presence removed. (3) 2026-06-08 all control moved into HA, host became a thin actuator, additive bias replaced the ease-down hysteresis. (4) 2026-06-15 daemon anti-flap: holds the last command through transient HA losses instead of dumping to Dell auto.
What it is
- HA (brain), on ha-sofia — NOT in this repo: the
input_numbersliders, the command template sensor, the display/equilibrium sensors, the Lock/Override controls, and the dashboard cards. Auto-git-tracked on ha-sofia by the version-control add-on. /usr/local/bin/fan-control— bash actuator (source:infra/scripts/fan-control.sh).fan-control.service— systemd unit (Type=simple, restarts on failure)./etc/fan-control.env— config incl. the ha-sofia token (chmod 600, not in git).
HA brain — where the curve lives (dashboard-it → "Server" view → Fans)
sensor.r730_fan_command_pct (template) computes:
command% = clamp( curve(temp) + bias, 0..100 ), where curve(temp) is a linear
ramp from (Temp min, Duty min) to (Temp max, Duty max) over
sensor.r730_cpu_temperature, plus an asymmetric output deadband (rise
immediately; ease down only once it would drop ≥ Hysteresis). When Lock is
on it outputs the Override % directly.
Inputs (input_number sliders): r730_fan_temp_min, r730_fan_temp_max,
r730_fan_duty_min, r730_fan_duty_max, r730_fan_bias (flat % added on top —
guarantees a floor), r730_fan_hysteresis (output deadband %).
Slope = (Duty max − Duty min)/(Temp max − Temp min) — steeper/higher-bias/lower-Temp-min
⇒ lower steady-state CPU temp (it's a P controller; the curve sets the equilibrium).
Manual override: input_boolean.r730_fan_lock (Lock — freeze) + input_number.r730_fan_manual_pct (Override %).
Readout sensors: sensor.r730_fan_command_display ("Fan set point", "X % (Y rpm)"),
sensor.r730_expected_equilibrium_temp (predicted equilibrium at current load),
sensor.r730_cpu_load, sensor.r730_fan_speed_avg (mean of 6 fans),
sensor.r730_fan_power_avg (cube-law estimate). The Prometheus-backed REST
sensors live in rest_resources/idrac_redfish_exporter.yaml on ha-sofia and have
value-template fallbacks so they don't blink unavailable on a transient empty.
Actuator (host) — what the daemon does
Loop every ~15 s, using only the existing IPMI + HA-REST methods:
- read
command%from HA (/api/states/$COMMAND_ENTITY), validate (numeric + not stale >STALE_SECS); - apply it via
ipmitool raw 0x30 0x30 0x02 0xff 0x<NN>(writes only if the change clearsMIN_STEP); - read CPU temp + fan rpm for safety + telemetry (Pushgateway).
Anti-flap: on a missing/stale command it holds the last applied % for up
to HA_GRACE_SECS (300 s) instead of falling back; only sustained loss hands the
fans to Dell auto.
Safety (on the host, independent of HA)
CPU ≥ CEILING (83 °C), repeated IPMI failures, sustained HA loss, or daemon
stop/crash → hand the fans back to Dell auto (raw 0x30 0x30 0x01 0x01;
EXIT trap + systemd ExecStopPost). The 83 °C ceiling uses the daemon's own
IPMI temp read, so it protects even if HA is wrong/unreachable.
Quick status
ssh root@192.168.1.127 systemctl status fan-control
ssh root@192.168.1.127 'journalctl -u fan-control -n 30 --no-pager'
Log line: temp=64C cmd=49% rpm=9380 (was -1%) (cmd = the % read from HA and
applied). HA command miss — holding 49% = a transient HA blip being ridden out;
HA command lost (...) — Dell auto = sustained loss.
Tune
The whole curve (anchors + bias + hysteresis) is tuned live from the HA
dashboard — no host access needed. /etc/fan-control.env only holds the
actuator plumbing + safety knobs (COMMAND_ENTITY, STALE_SECS, HA_GRACE_SECS,
MIN_STEP, CEILING); edit it then systemctl restart fan-control.
Deploy / update (daemon source)
scp -i ~/.ssh/pve_root scripts/fan-control.sh root@192.168.1.127:/tmp/fan-control.new
ssh -i ~/.ssh/pve_root root@192.168.1.127 'install -m0755 /tmp/fan-control.new /usr/local/bin/fan-control && systemctl restart fan-control'
(fan-control.service only on a unit change → also systemctl daemon-reload.)
Symptoms & checks
| Symptom | Check |
|---|---|
| Fans surge then crash to ~7100 then surge | flapping to Dell auto — journalctl -u fan-control | grep -E 'holding|Dell auto'; pre-2026-06-15 this was the stale-command bug (now fixed). |
| Fans stuck loud | journalctl — CEILING breach or HA command lost? Check CPU temp + HA reachability. |
A readout blinks unavailable |
the REST value-template fallback should hold it; a 1×/8h blip at ~02:00 (backup window) is a benign fetch hiccup. |
| Slider changes ignored | does sensor.r730_fan_command_pct change in HA? token valid? |
| Box left in manual after crash | ipmitool raw 0x30 0x30 0x01 0x01 to force Dell auto. |
Verify wiring
ssh -i ~/.ssh/pve_root root@192.168.1.127 'set -a; . /etc/fan-control.env; set +a; RUN_ONCE=1 /usr/local/bin/fan-control'
The log cmd=% should equal sensor.r730_fan_command_pct. Move a slider so the
HA sensor changes, re-run, and the applied cmd=% should follow.