All checks were successful
ci/woodpecker/push/default Pipeline was successful
The committed docs still described the 2026-06-04 presence-aware daemon. Bring them in line with what is actually deployed: HA computes the setpoint, the host is a thin actuator (COMMAND_ENTITY/STALE_SECS/HA_GRACE_SECS), additive bias, anti-flap hold-last, and the new HA readout sensors (command/equilibrium/ cpu_load/fan_speed_avg/fan_power_avg). Earlier doc edits were made in a clone lost in the workstation reshuffle; re-created here. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
103 lines
5.7 KiB
Markdown
103 lines
5.7 KiB
Markdown
# Runbook — PVE R730 fan control
|
||
|
||
**The control logic lives in Home Assistant; the PVE host runs only a thin
|
||
actuator.** HA computes the fan setpoint from the CPU temperature and the
|
||
dashboard inputs and publishes ONE number, `sensor.r730_fan_command_pct`. The
|
||
host daemon reads that number each loop and applies it over IPMI — it does **no**
|
||
math. Design + history: `infra/docs/plans/2026-06-04-pve-fan-control-design.md`.
|
||
|
||
> **History:** (1) 2026-06-04/05 presence-aware two-curve controller (COOL/QUIET
|
||
> by garage door). (2) 2026-06-07 single linear curve, presence removed.
|
||
> (3) 2026-06-08 **all control moved into HA**, host became a thin actuator,
|
||
> additive **bias** replaced the ease-down hysteresis. (4) 2026-06-15 daemon
|
||
> **anti-flap**: holds the last command through transient HA losses instead of
|
||
> dumping to Dell auto.
|
||
|
||
## What it is
|
||
|
||
- **HA (brain), on ha-sofia — NOT in this repo:** the `input_number` sliders, the
|
||
command template sensor, the display/equilibrium sensors, the Lock/Override
|
||
controls, and the dashboard cards. Auto-git-tracked on ha-sofia by the
|
||
version-control add-on.
|
||
- `/usr/local/bin/fan-control` — bash **actuator** (source: `infra/scripts/fan-control.sh`).
|
||
- `fan-control.service` — systemd unit (`Type=simple`, restarts on failure).
|
||
- `/etc/fan-control.env` — config incl. the ha-sofia token (chmod 600, not in git).
|
||
|
||
## HA brain — where the curve lives (dashboard-it → "Server" view → Fans)
|
||
|
||
`sensor.r730_fan_command_pct` (template) computes:
|
||
`command% = clamp( curve(temp) + bias, 0..100 )`, where `curve(temp)` is a linear
|
||
ramp from `(Temp min, Duty min)` to `(Temp max, Duty max)` over
|
||
`sensor.r730_cpu_temperature`, plus an **asymmetric output deadband** (rise
|
||
immediately; ease down only once it would drop ≥ Hysteresis). When **Lock** is
|
||
on it outputs the Override % directly.
|
||
|
||
**Inputs** (`input_number` sliders): `r730_fan_temp_min`, `r730_fan_temp_max`,
|
||
`r730_fan_duty_min`, `r730_fan_duty_max`, `r730_fan_bias` (flat % added on top —
|
||
guarantees a floor), `r730_fan_hysteresis` (output deadband %).
|
||
Slope = `(Duty max − Duty min)/(Temp max − Temp min)` — steeper/higher-bias/lower-Temp-min
|
||
⇒ lower steady-state CPU temp (it's a P controller; the curve sets the equilibrium).
|
||
|
||
**Manual override:** `input_boolean.r730_fan_lock` (Lock — freeze) + `input_number.r730_fan_manual_pct` (Override %).
|
||
|
||
**Readout sensors:** `sensor.r730_fan_command_display` ("Fan set point", "X % (Y rpm)"),
|
||
`sensor.r730_expected_equilibrium_temp` (predicted equilibrium at current load),
|
||
`sensor.r730_cpu_load`, `sensor.r730_fan_speed_avg` (mean of 6 fans),
|
||
`sensor.r730_fan_power_avg` (cube-law estimate). The Prometheus-backed REST
|
||
sensors live in `rest_resources/idrac_redfish_exporter.yaml` on ha-sofia and have
|
||
value-template fallbacks so they don't blink `unavailable` on a transient empty.
|
||
|
||
## Actuator (host) — what the daemon does
|
||
|
||
Loop every ~15 s, using only the existing IPMI + HA-REST methods:
|
||
1. read `command%` from HA (`/api/states/$COMMAND_ENTITY`), validate (numeric + not stale > `STALE_SECS`);
|
||
2. apply it via `ipmitool raw 0x30 0x30 0x02 0xff 0x<NN>` (writes only if the change clears `MIN_STEP`);
|
||
3. read CPU temp + fan rpm for safety + telemetry (Pushgateway).
|
||
|
||
**Anti-flap:** on a missing/stale command it **holds the last applied %** for up
|
||
to `HA_GRACE_SECS` (300 s) instead of falling back; only sustained loss hands the
|
||
fans to Dell auto.
|
||
|
||
## Safety (on the host, independent of HA)
|
||
`CPU ≥ CEILING (83 °C)`, repeated IPMI failures, sustained HA loss, or daemon
|
||
stop/crash → hand the fans back to **Dell auto** (`raw 0x30 0x30 0x01 0x01`;
|
||
EXIT trap + systemd `ExecStopPost`). The 83 °C ceiling uses the daemon's own
|
||
IPMI temp read, so it protects even if HA is wrong/unreachable.
|
||
|
||
## Quick status
|
||
```bash
|
||
ssh root@192.168.1.127 systemctl status fan-control
|
||
ssh root@192.168.1.127 'journalctl -u fan-control -n 30 --no-pager'
|
||
```
|
||
Log line: `temp=64C cmd=49% rpm=9380 (was -1%)` (`cmd` = the % read from HA and
|
||
applied). `HA command miss — holding 49%` = a transient HA blip being ridden out;
|
||
`HA command lost (...) — Dell auto` = sustained loss.
|
||
|
||
## Tune
|
||
The whole curve (anchors + bias + hysteresis) is tuned **live from the HA
|
||
dashboard** — no host access needed. `/etc/fan-control.env` only holds the
|
||
actuator plumbing + safety knobs (`COMMAND_ENTITY`, `STALE_SECS`, `HA_GRACE_SECS`,
|
||
`MIN_STEP`, `CEILING`); edit it then `systemctl restart fan-control`.
|
||
|
||
## Deploy / update (daemon source)
|
||
```bash
|
||
scp -i ~/.ssh/pve_root scripts/fan-control.sh root@192.168.1.127:/tmp/fan-control.new
|
||
ssh -i ~/.ssh/pve_root root@192.168.1.127 'install -m0755 /tmp/fan-control.new /usr/local/bin/fan-control && systemctl restart fan-control'
|
||
```
|
||
(`fan-control.service` only on a unit change → also `systemctl daemon-reload`.)
|
||
|
||
## Symptoms & checks
|
||
| Symptom | Check |
|
||
|---------|-------|
|
||
| Fans surge then crash to ~7100 then surge | flapping to Dell auto — `journalctl -u fan-control \| grep -E 'holding\|Dell auto'`; pre-2026-06-15 this was the stale-command bug (now fixed). |
|
||
| Fans stuck loud | `journalctl` — `CEILING` breach or `HA command lost`? Check CPU temp + HA reachability. |
|
||
| A readout blinks `unavailable` | the REST value-template fallback should hold it; a 1×/8h blip at ~02:00 (backup window) is a benign fetch hiccup. |
|
||
| Slider changes ignored | does `sensor.r730_fan_command_pct` change in HA? token valid? |
|
||
| Box left in manual after crash | `ipmitool raw 0x30 0x30 0x01 0x01` to force Dell auto. |
|
||
|
||
## Verify wiring
|
||
```bash
|
||
ssh -i ~/.ssh/pve_root root@192.168.1.127 'set -a; . /etc/fan-control.env; set +a; RUN_ONCE=1 /usr/local/bin/fan-control'
|
||
```
|
||
The log `cmd=%` should equal `sensor.r730_fan_command_pct`. Move a slider so the
|
||
HA sensor changes, re-run, and the applied `cmd=%` should follow.
|