infra/docs/runbooks/fan-control.md
Emil Barzin 1ba453c65d
All checks were successful
ci/woodpecker/push/default Pipeline was successful
fan-control docs: sync runbook/env/service/design to the HA-actuator + anti-flap model
The committed docs still described the 2026-06-04 presence-aware daemon. Bring
them in line with what is actually deployed: HA computes the setpoint, the host
is a thin actuator (COMMAND_ENTITY/STALE_SECS/HA_GRACE_SECS), additive bias,
anti-flap hold-last, and the new HA readout sensors (command/equilibrium/
cpu_load/fan_speed_avg/fan_power_avg). Earlier doc edits were made in a clone
lost in the workstation reshuffle; re-created here.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 08:11:48 +00:00

5.7 KiB
Raw Blame History

Runbook — PVE R730 fan control

The control logic lives in Home Assistant; the PVE host runs only a thin actuator. HA computes the fan setpoint from the CPU temperature and the dashboard inputs and publishes ONE number, sensor.r730_fan_command_pct. The host daemon reads that number each loop and applies it over IPMI — it does no math. Design + history: infra/docs/plans/2026-06-04-pve-fan-control-design.md.

History: (1) 2026-06-04/05 presence-aware two-curve controller (COOL/QUIET by garage door). (2) 2026-06-07 single linear curve, presence removed. (3) 2026-06-08 all control moved into HA, host became a thin actuator, additive bias replaced the ease-down hysteresis. (4) 2026-06-15 daemon anti-flap: holds the last command through transient HA losses instead of dumping to Dell auto.

What it is

  • HA (brain), on ha-sofia — NOT in this repo: the input_number sliders, the command template sensor, the display/equilibrium sensors, the Lock/Override controls, and the dashboard cards. Auto-git-tracked on ha-sofia by the version-control add-on.
  • /usr/local/bin/fan-control — bash actuator (source: infra/scripts/fan-control.sh).
  • fan-control.service — systemd unit (Type=simple, restarts on failure).
  • /etc/fan-control.env — config incl. the ha-sofia token (chmod 600, not in git).

HA brain — where the curve lives (dashboard-it → "Server" view → Fans)

sensor.r730_fan_command_pct (template) computes: command% = clamp( curve(temp) + bias, 0..100 ), where curve(temp) is a linear ramp from (Temp min, Duty min) to (Temp max, Duty max) over sensor.r730_cpu_temperature, plus an asymmetric output deadband (rise immediately; ease down only once it would drop ≥ Hysteresis). When Lock is on it outputs the Override % directly.

Inputs (input_number sliders): r730_fan_temp_min, r730_fan_temp_max, r730_fan_duty_min, r730_fan_duty_max, r730_fan_bias (flat % added on top — guarantees a floor), r730_fan_hysteresis (output deadband %). Slope = (Duty max Duty min)/(Temp max Temp min) — steeper/higher-bias/lower-Temp-min ⇒ lower steady-state CPU temp (it's a P controller; the curve sets the equilibrium).

Manual override: input_boolean.r730_fan_lock (Lock — freeze) + input_number.r730_fan_manual_pct (Override %).

Readout sensors: sensor.r730_fan_command_display ("Fan set point", "X % (Y rpm)"), sensor.r730_expected_equilibrium_temp (predicted equilibrium at current load), sensor.r730_cpu_load, sensor.r730_fan_speed_avg (mean of 6 fans), sensor.r730_fan_power_avg (cube-law estimate). The Prometheus-backed REST sensors live in rest_resources/idrac_redfish_exporter.yaml on ha-sofia and have value-template fallbacks so they don't blink unavailable on a transient empty.

Actuator (host) — what the daemon does

Loop every ~15 s, using only the existing IPMI + HA-REST methods:

  1. read command% from HA (/api/states/$COMMAND_ENTITY), validate (numeric + not stale > STALE_SECS);
  2. apply it via ipmitool raw 0x30 0x30 0x02 0xff 0x<NN> (writes only if the change clears MIN_STEP);
  3. read CPU temp + fan rpm for safety + telemetry (Pushgateway).

Anti-flap: on a missing/stale command it holds the last applied % for up to HA_GRACE_SECS (300 s) instead of falling back; only sustained loss hands the fans to Dell auto.

Safety (on the host, independent of HA)

CPU ≥ CEILING (83 °C), repeated IPMI failures, sustained HA loss, or daemon stop/crash → hand the fans back to Dell auto (raw 0x30 0x30 0x01 0x01; EXIT trap + systemd ExecStopPost). The 83 °C ceiling uses the daemon's own IPMI temp read, so it protects even if HA is wrong/unreachable.

Quick status

ssh root@192.168.1.127 systemctl status fan-control
ssh root@192.168.1.127 'journalctl -u fan-control -n 30 --no-pager'

Log line: temp=64C cmd=49% rpm=9380 (was -1%) (cmd = the % read from HA and applied). HA command miss — holding 49% = a transient HA blip being ridden out; HA command lost (...) — Dell auto = sustained loss.

Tune

The whole curve (anchors + bias + hysteresis) is tuned live from the HA dashboard — no host access needed. /etc/fan-control.env only holds the actuator plumbing + safety knobs (COMMAND_ENTITY, STALE_SECS, HA_GRACE_SECS, MIN_STEP, CEILING); edit it then systemctl restart fan-control.

Deploy / update (daemon source)

scp -i ~/.ssh/pve_root scripts/fan-control.sh root@192.168.1.127:/tmp/fan-control.new
ssh -i ~/.ssh/pve_root root@192.168.1.127 'install -m0755 /tmp/fan-control.new /usr/local/bin/fan-control && systemctl restart fan-control'

(fan-control.service only on a unit change → also systemctl daemon-reload.)

Symptoms & checks

Symptom Check
Fans surge then crash to ~7100 then surge flapping to Dell auto — journalctl -u fan-control | grep -E 'holding|Dell auto'; pre-2026-06-15 this was the stale-command bug (now fixed).
Fans stuck loud journalctlCEILING breach or HA command lost? Check CPU temp + HA reachability.
A readout blinks unavailable the REST value-template fallback should hold it; a 1×/8h blip at ~02:00 (backup window) is a benign fetch hiccup.
Slider changes ignored does sensor.r730_fan_command_pct change in HA? token valid?
Box left in manual after crash ipmitool raw 0x30 0x30 0x01 0x01 to force Dell auto.

Verify wiring

ssh -i ~/.ssh/pve_root root@192.168.1.127 'set -a; . /etc/fan-control.env; set +a; RUN_ONCE=1 /usr/local/bin/fan-control'

The log cmd=% should equal sensor.r730_fan_command_pct. Move a slider so the HA sensor changes, re-run, and the applied cmd=% should follow.