diff --git a/docs/plans/2026-06-04-pve-fan-control-design.md b/docs/plans/2026-06-04-pve-fan-control-design.md index a2d0216f..ccc972b3 100644 --- a/docs/plans/2026-06-04-pve-fan-control-design.md +++ b/docs/plans/2026-06-04-pve-fan-control-design.md @@ -1,10 +1,32 @@ # PVE R730 presence-aware fan control — design **Date:** 2026-06-04 -**Status:** implemented +**Status:** implemented; **redesigned 2026-06-08, anti-flap 2026-06-15** (see update below) **Scripts:** `infra/scripts/fan-control.{sh,service,env.example}`, `test-fan-control.sh` **Runbook:** `infra/docs/runbooks/fan-control.md` +> ## Update — control moved to HA; host is a thin actuator +> +> - **2026-06-07:** presence/two-curve scheme replaced by a single linear curve; +> all garage-presence logic removed. +> - **2026-06-08:** **all control moved into Home Assistant.** HA owns the curve +> thresholds, duty %, an additive **bias** (replaces the ease-down hysteresis), +> plus manual/lock, and publishes `sensor.r730_fan_command_pct = +> clamp(curve(temp)+bias, 0..100)` with an asymmetric output deadband. The host +> `fan-control.sh` is now a **thin actuator**: read that one number, validate, +> apply over IPMI — no local math. Independent host safety (CPU≥83 °C, IPMI +> fail, HA loss) hands the fans to Dell auto. It's a P controller, so the curve +> slope/offset set the steady-state equilibrium temperature (not a setpoint). +> - **2026-06-15:** daemon **anti-flap** — on a transient HA miss it HOLDS the +> last applied % for `HA_GRACE_SECS` (300 s) instead of dumping to Dell auto, +> and `STALE_SECS` loosened 120→1800 (staleness only happens at flat temp, +> where the held value is still valid). Killed a ~14%-of-the-time flap to the +> Dell floor; verified fallback 14%→0%, command std 16→3 over 8 h. +> +> The HA objects (sliders, command template, display/equilibrium sensors, +> Lock/Override, dashboard cards, REST sensors) live on ha-sofia, not this repo. +> Sections below are retained as historical context. + ## Problem The Dell R730 PVE host (192.168.1.127) runs its CPU at ~72–77°C under normal diff --git a/docs/runbooks/fan-control.md b/docs/runbooks/fan-control.md index 390c349a..d243b9fc 100644 --- a/docs/runbooks/fan-control.md +++ b/docs/runbooks/fan-control.md @@ -1,122 +1,103 @@ -# Runbook — PVE R730 fan-control daemon +# Runbook — PVE R730 fan control -Presence-aware IPMI fan controller on the PVE host (192.168.1.127). Runs the -CPU cool when the garage is empty, quiet when someone's in the garage. Design: -`infra/docs/plans/2026-06-04-pve-fan-control-design.md`. +**The control logic lives in Home Assistant; the PVE host runs only a thin +actuator.** HA computes the fan setpoint from the CPU temperature and the +dashboard inputs and publishes ONE number, `sensor.r730_fan_command_pct`. The +host daemon reads that number each loop and applies it over IPMI — it does **no** +math. Design + history: `infra/docs/plans/2026-06-04-pve-fan-control-design.md`. + +> **History:** (1) 2026-06-04/05 presence-aware two-curve controller (COOL/QUIET +> by garage door). (2) 2026-06-07 single linear curve, presence removed. +> (3) 2026-06-08 **all control moved into HA**, host became a thin actuator, +> additive **bias** replaced the ease-down hysteresis. (4) 2026-06-15 daemon +> **anti-flap**: holds the last command through transient HA losses instead of +> dumping to Dell auto. ## What it is -- `/usr/local/bin/fan-control` — bash daemon (source: `infra/scripts/fan-control.sh`). +- **HA (brain), on ha-sofia — NOT in this repo:** the `input_number` sliders, the + command template sensor, the display/equilibrium sensors, the Lock/Override + controls, and the dashboard cards. Auto-git-tracked on ha-sofia by the + version-control add-on. +- `/usr/local/bin/fan-control` — bash **actuator** (source: `infra/scripts/fan-control.sh`). - `fan-control.service` — systemd unit (`Type=simple`, restarts on failure). - `/etc/fan-control.env` — config incl. the ha-sofia token (chmod 600, not in git). -## HA control (Home Assistant) +## HA brain — where the curve lives (dashboard-it → "Server" view → Fans) -Drive the fans from **dashboard-it → "Server" view → Fans**. The view is -deliberately minimal — it shows the current **fan speed** (% of capacity + -absolute RPM) and two controls: +`sensor.r730_fan_command_pct` (template) computes: +`command% = clamp( curve(temp) + bias, 0..100 )`, where `curve(temp)` is a linear +ramp from `(Temp min, Duty min)` to `(Temp max, Duty max)` over +`sensor.r730_cpu_temperature`, plus an **asymmetric output deadband** (rise +immediately; ease down only once it would drop ≥ Hysteresis). When **Lock** is +on it outputs the Override % directly. -- **Override %** (`input_number.r730_fan_manual_pct`) — the fan % to hold. While - **unlocked** it continuously mirrors the live commanded fan %, so it always - shows the actual *absolute* speed and updates as the fan moves (NOT a stale - value or a delta) — `automation.r730_fan_override_track_live_speed_while_unlocked` - syncs it to `sensor.r730_fan_control_target` (guarded to ignore - unavailable/unknown). While **locked** it stops tracking and becomes your - editable setpoint. A readout under the slider shows the live `% · rpm`. -- **Lock — freeze speed** (`input_boolean.r730_fan_lock`) — turn the algorithm - off and hold a fixed speed. Toggling it **ON** snapshots the *current* - commanded % into Override and switches the daemon to `manual` - (`automation.r730_fan_lock_freeze_current_speed_resume_algo`); toggling it - **OFF** switches back to `auto`, resuming the presence curve. Fine-tune the - held % with Override while locked. A 🔒 reminder appears on the view while - locked. +**Inputs** (`input_number` sliders): `r730_fan_temp_min`, `r730_fan_temp_max`, +`r730_fan_duty_min`, `r730_fan_duty_max`, `r730_fan_bias` (flat % added on top — +guarantees a floor), `r730_fan_hysteresis` (output deadband %). +Slope = `(Duty max − Duty min)/(Temp max − Temp min)` — steeper/higher-bias/lower-Temp-min +⇒ lower steady-state CPU temp (it's a P controller; the curve sets the equilibrium). -Under the hood the daemon still reads `input_select.r730_fan_mode` -(auto/cool/quiet/manual) + `input_number.r730_fan_manual_pct` each loop; the Lock -toggle just drives `mode` between `manual` (locked) and `auto` (unlocked). -`cool`/`quiet` remain valid modes if set directly (via the entity) but are no -longer surfaced on the simplified dashboard. `CEILING` (83 °C) still overrides -everything → Dell auto, **even when locked**. A stale non-`auto` mode left while -*unlocked* still auto-reverts to `auto` after 60 min -(`automation.r730_fan_mode_auto_revert`, now a dormant safety net). An HA change -is applied within one daemon loop (~15 s). +**Manual override:** `input_boolean.r730_fan_lock` (Lock — freeze) + `input_number.r730_fan_manual_pct` (Override %). -Monitoring sensors on the same view: `sensor.r730_fan_speed` (redfish exporter), -`sensor.r730_fan_control_target` + `sensor.r730_fan_control_mode` + -`sensor.r730_fan_power_est` (Pushgateway). Fan **% and RPM are merged into one -"Fan speed" card** (the two had identical trend shapes) — the % trend comes from -the stable Pushgateway sensor, while RPM reads `sensor.r730_fan_speed` but **falls -back to a calibrated estimate (shown with a `~` prefix) whenever the Redfish -sensor is `unavailable`** (it blips out intermittently), so the readout never goes -blank. `r730_fan_power_est` is an ESTIMATE of -total fan power (the iDRAC reports no per-fan power) — modelled from RPM via the -fan affinity law (∝ RPM³), calibrated to the power sweep (~2 W floor → ~99 W full). +**Readout sensors:** `sensor.r730_fan_command_display` ("Fan set point", "X % (Y rpm)"), +`sensor.r730_expected_equilibrium_temp` (predicted equilibrium at current load), +`sensor.r730_cpu_load`, `sensor.r730_fan_speed_avg` (mean of 6 fans), +`sensor.r730_fan_power_avg` (cube-law estimate). The Prometheus-backed REST +sensors live in `rest_resources/idrac_redfish_exporter.yaml` on ha-sofia and have +value-template fallbacks so they don't blink `unavailable` on a transient empty. -The HA objects (helpers, the auto-revert automation, the REST sensors in -`rest_resources/{idrac_redfish_exporter,fan_control}.yaml`, and the dashboard -cards) live on **ha-sofia** and are auto-git-tracked there by the version-control -add-on — they are NOT in this repo. +## Actuator (host) — what the daemon does + +Loop every ~15 s, using only the existing IPMI + HA-REST methods: +1. read `command%` from HA (`/api/states/$COMMAND_ENTITY`), validate (numeric + not stale > `STALE_SECS`); +2. apply it via `ipmitool raw 0x30 0x30 0x02 0xff 0x` (writes only if the change clears `MIN_STEP`); +3. read CPU temp + fan rpm for safety + telemetry (Pushgateway). + +**Anti-flap:** on a missing/stale command it **holds the last applied %** for up +to `HA_GRACE_SECS` (300 s) instead of falling back; only sustained loss hands the +fans to Dell auto. + +## Safety (on the host, independent of HA) +`CPU ≥ CEILING (83 °C)`, repeated IPMI failures, sustained HA loss, or daemon +stop/crash → hand the fans back to **Dell auto** (`raw 0x30 0x30 0x01 0x01`; +EXIT trap + systemd `ExecStopPost`). The 83 °C ceiling uses the daemon's own +IPMI temp read, so it protects even if HA is wrong/unreachable. ## Quick status - ```bash ssh root@192.168.1.127 systemctl status fan-control ssh root@192.168.1.127 'journalctl -u fan-control -n 30 --no-pager' -ssh root@192.168.1.127 'ipmitool sdr type fan | grep ^Fan1; ipmitool sdr type temperature | grep "^Temp "' ``` -Log lines look like `temp=60C ha_mode=auto eff=cool fan=50% (was 70%)` -(`ha_mode` = the HA setpoint; `eff` = the effective curve applied). - -## Disable / roll back to stock firmware control - -```bash -ssh root@192.168.1.127 'systemctl disable --now fan-control && ipmitool raw 0x30 0x30 0x01 0x01' -``` -The unit's `ExecStopPost` already restores Dell auto on stop, so the explicit -`raw ... 0x01` is belt-and-suspenders. The box is back to its stock curve. +Log line: `temp=64C cmd=49% rpm=9380 (was -1%)` (`cmd` = the % read from HA and +applied). `HA command miss — holding 49%` = a transient HA blip being ridden out; +`HA command lost (...) — Dell auto` = sustained loss. ## Tune +The whole curve (anchors + bias + hysteresis) is tuned **live from the HA +dashboard** — no host access needed. `/etc/fan-control.env` only holds the +actuator plumbing + safety knobs (`COMMAND_ENTITY`, `STALE_SECS`, `HA_GRACE_SECS`, +`MIN_STEP`, `CEILING`); edit it then `systemctl restart fan-control`. -Edit `/etc/fan-control.env` on the host, then `systemctl restart fan-control`. -Common knobs: -- `HOLD_SECS` — how long to stay quiet after the garage door last moved (default 900 = 15 min). -- `CEILING` — temp at which we abandon manual control and let the firmware take over (default 83). -- Curve shape: **linear anchors** near the top of the script — `COOL_T_LO/COOL_P_LO/COOL_T_HI/COOL_P_HI` (default 50°C/30% → 83°C/100%) and `QUIET_*` (68°C/20% → 83°C/100%); fan% interpolates linearly between them (replaced the old discrete step-bands). `MIN_STEP` (default 3%) = smallest fan-% change worth an IPMI write (anti-jitter); `DEADBAND` (3°C) = ease-down hysteresis. Lower `COOL_P_HI` or raise `COOL_T_HI` to run the top end quieter; steepen by raising `COOL_P_LO` / lowering `COOL_T_LO`. - -## Deploy / update - +## Deploy / update (daemon source) ```bash -cd infra -scp scripts/fan-control.sh root@192.168.1.127:/usr/local/bin/fan-control -ssh root@192.168.1.127 chmod +x /usr/local/bin/fan-control -scp scripts/fan-control.service root@192.168.1.127:/etc/systemd/system/fan-control.service -# first install only — create /etc/fan-control.env from fan-control.env.example with the HA token -ssh root@192.168.1.127 'systemctl daemon-reload && systemctl restart fan-control' +scp -i ~/.ssh/pve_root scripts/fan-control.sh root@192.168.1.127:/tmp/fan-control.new +ssh -i ~/.ssh/pve_root root@192.168.1.127 'install -m0755 /tmp/fan-control.new /usr/local/bin/fan-control && systemctl restart fan-control' ``` - -## HA token - -`/etc/fan-control.env` holds a long-lived ha-sofia token used to read -`sensor.garage_door_state_bg`. Mint via Home Assistant → Profile → Security → -Long-lived access tokens, or reuse the existing ha-sofia token. If the token is -missing/empty, the daemon still runs but **COOL-only** (no quiet mode) and logs -`ha_reachable=0`. +(`fan-control.service` only on a unit change → also `systemctl daemon-reload`.) ## Symptoms & checks - | Symptom | Check | |---------|-------| -| Fans stuck loud | `journalctl -u fan-control` — is `mode=fallback`? (ceiling breach or IPMI fail). Check CPU temp. | -| Never goes quiet | Token valid? `curl -H "Authorization: Bearer $TOKEN" http://192.168.1.8:8123/api/states/sensor.garage_door_state_bg`. Garage door reporting? | -| Fans flapping | Increase `DEADBAND`. | -| Service won't start | `systemctl status fan-control`; check `ipmitool` works: `ipmitool sdr type temperature`. | +| Fans surge then crash to ~7100 then surge | flapping to Dell auto — `journalctl -u fan-control \| grep -E 'holding\|Dell auto'`; pre-2026-06-15 this was the stale-command bug (now fixed). | +| Fans stuck loud | `journalctl` — `CEILING` breach or `HA command lost`? Check CPU temp + HA reachability. | +| A readout blinks `unavailable` | the REST value-template fallback should hold it; a 1×/8h blip at ~02:00 (backup window) is a benign fetch hiccup. | +| Slider changes ignored | does `sensor.r730_fan_command_pct` change in HA? token valid? | | Box left in manual after crash | `ipmitool raw 0x30 0x30 0x01 0x01` to force Dell auto. | -## Verify presence wiring - +## Verify wiring ```bash -# one iteration, real IPMI + HA, no daemon loop: -ssh root@192.168.1.127 'set -a; . /etc/fan-control.env; set +a; RUN_ONCE=1 /usr/local/bin/fan-control' +ssh -i ~/.ssh/pve_root root@192.168.1.127 'set -a; . /etc/fan-control.env; set +a; RUN_ONCE=1 /usr/local/bin/fan-control' ``` -With the garage closed for >15 min you should see `mode=cool`; within 15 min of -the door moving, `mode=quiet`. +The log `cmd=%` should equal `sensor.r730_fan_command_pct`. Move a slider so the +HA sensor changes, re-run, and the applied `cmd=%` should follow. diff --git a/scripts/fan-control.env.example b/scripts/fan-control.env.example index 3c2565c2..24394aec 100644 --- a/scripts/fan-control.env.example +++ b/scripts/fan-control.env.example @@ -1,21 +1,27 @@ -# /etc/fan-control.env — config for the fan-control daemon (chmod 600). +# /etc/fan-control.env — config for the fan-control ACTUATOR (chmod 600). # Deployed manually to the PVE host; the real file holds a secret token and is # NOT committed. Copy this template, fill HA_TOKEN, scp to /etc/fan-control.env. +# +# The control logic lives in Home Assistant (curve + bias + hysteresis + +# setpoint). This daemon only reads the HA-computed % and applies it over IPMI. # Long-lived ha-sofia access token (Home Assistant -> Profile -> Security -> -# Long-lived access tokens). Empty => presence disabled, daemon runs COOL-only. +# Long-lived access tokens). Used to read COMMAND_ENTITY. Empty/unreachable => +# the actuator hands the fans to Dell auto (it cannot compute a setpoint itself). HA_TOKEN= # --- optional overrides (defaults shown) --- # HA_URL=http://192.168.1.8:8123 -# GARAGE_ENTITY=sensor.garage_door_state_bg -# GARAGE_OPEN_STATE=Отворена -# HOLD_SECS=900 # quiet-mode hold after last garage activity (15 min) +# COMMAND_ENTITY=sensor.r730_fan_command_pct # HA-computed fan %; we only apply it +# STALE_SECS=1800 # command older than this => stale. Loose on purpose: +# # staleness only happens when CPU temp is flat (so the +# # held value is still valid); a rising temp re-renders it. +# HA_GRACE_SECS=300 # on a transient HA miss, HOLD the last applied % this +# # long before handing the fans to Dell auto (anti-flap) # LOOP_INTERVAL=15 -# PRESENCE_INTERVAL=30 -# DEADBAND=3 -# CEILING=83 # degC: hand back to Dell auto at/above this +# CEILING=83 # degC: hand back to Dell auto at/above this (hardware safety) # RESUME_BELOW=75 # RESUME_STABLE=120 # MAX_IPMI_FAILS=3 +# MIN_STEP=3 # smallest fan-% change worth an IPMI write (anti-jitter) PUSHGATEWAY_URL=http://10.0.20.100:30091 diff --git a/scripts/fan-control.service b/scripts/fan-control.service index f337ff4d..3b649751 100644 --- a/scripts/fan-control.service +++ b/scripts/fan-control.service @@ -1,5 +1,5 @@ [Unit] -Description=Presence-aware IPMI fan controller (Dell R730, garage) +Description=IPMI fan actuator (Dell R730) — applies the HA-computed setpoint Documentation=https://github.com/ViktorBarzin/infra/blob/master/scripts/fan-control.sh After=network-online.target Wants=network-online.target