fan-control: thin actuator — HA computes the setpoint, host only applies it

The R730 fan-control logic now lives entirely in Home Assistant: the curve
thresholds, duty %, bias and asymmetric deadband, plus manual/lock, are set on
the dashboard and published as sensor.r730_fan_command_pct. The host daemon is
reduced to a thin actuator — it reads that one number each loop, validates it
(numeric + not older than STALE_SECS) and applies it over IPMI. Removed the
presence-aware two-curve logic and the garage-door coupling.

Safety stays independent on the host: CPU>=CEILING, repeated IPMI failures, or
HA unreachable/stale all hand the fans back to Dell auto. RPM telemetry now
averages all 6 chassis fans. Deployed and verified live on pve (applies the HA
command; fans follow).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Emil Barzin 2026-06-13 12:59:57 +00:00
parent 3c3e6bfc95
commit 082bdfcc77
2 changed files with 94 additions and 165 deletions

View file

@ -1,20 +1,23 @@
#!/usr/bin/env bash
# Presence-aware IPMI fan controller for the Dell R730 PVE host (192.168.1.127).
# IPMI fan ACTUATOR for the Dell R730 PVE host (192.168.1.127).
#
# The server lives in the GARAGE (memory id=1723). Two curves, picked by
# whether someone is physically in the garage:
# - COOL : garage empty -> minimise CPU temp, noise is free.
# - QUIET : someone in the garage -> minimise noise, accept a warmer CPU.
# Presence comes from the ha-sofia garage-door sensor: door open now, OR it
# last changed within HOLD_SECS, => QUIET. Otherwise COOL.
# THIN ACTUATOR — the control logic lives entirely in Home Assistant. HA owns
# the curve thresholds, the duty %, the bias, and the final setpoint: it
# publishes ONE number, `sensor.r730_fan_command_pct` (= computed fan % incl.
# bias and any manual/lock override). This daemon does NOT compute anything — it
# just reads that command each loop and applies it over IPMI, and reads the raw
# sensors (temp/rpm) that feed HA/Prometheus.
# (Until 2026-06-07 the curve+hysteresis were computed HERE; moved to HA so
# all tuning + the setpoint determination happen on the dashboard.)
#
# Safety (manual fan mode bypasses the iDRAC's own curve, so we backstop it):
# Safety (manual fan mode bypasses the iDRAC's own curve, so we backstop it).
# These are INDEPENDENT of HA — the actuator protects the hardware on its own:
# - On ANY exit (crash/stop/TERM) the EXIT trap hands fans back to Dell
# automatic control (raw 0x30 0x30 0x01 0x01). systemd ExecStopPost
# repeats this belt-and-suspenders.
# automatic control (raw 0x30 0x30 0x01 0x01). systemd ExecStopPost repeats.
# - CPU >= CEILING -> hand back to Dell auto until it recovers (RESUME_BELOW
# held for RESUME_STABLE s). The firmware's own emergency cooling takes over.
# - IPMI read failures (>= MAX_IPMI_FAILS) -> hand back to Dell auto.
# - HA unreachable / command missing / STALE -> hand back to Dell auto.
#
# Deploy: scp to /usr/local/bin/fan-control (strip .sh) + install
# fan-control.service + /etc/fan-control.env. Same pattern as apply-mbps-caps.
@ -26,71 +29,34 @@ set -uo pipefail
# ---- configuration (override via /etc/fan-control.env) ----
: "${IPMITOOL:=ipmitool}"
: "${LOOP_INTERVAL:=15}" # seconds between temperature decisions
: "${PRESENCE_INTERVAL:=30}" # seconds between ha-sofia garage-door polls
: "${DEADBAND:=3}" # degC hysteresis applied to downward fan steps
: "${LOOP_INTERVAL:=15}" # seconds between apply cycles
: "${CEILING:=83}" # degC: hand back to Dell auto at/above this
: "${RESUME_BELOW:=75}" # degC: eligible to resume manual below this...
: "${RESUME_STABLE:=120}" # ...once held that long
: "${HOLD_SECS:=900}" # quiet-mode hold after last garage activity (15 min)
: "${HA_URL:=http://192.168.1.8:8123}"
: "${HA_TOKEN:=}" # long-lived ha-sofia token; empty => presence disabled (COOL only)
: "${GARAGE_ENTITY:=sensor.garage_door_state_bg}"
: "${GARAGE_OPEN_STATE:=Отворена}" # ha state string meaning "open"
# HA control: a mode select + manual % the user drives from Home Assistant.
# auto => garage-presence curve (default); cool/quiet => force that curve;
# manual => hold MANUAL_ENTITY %. Empty HA_TOKEN or unreachable HA => auto.
: "${MODE_ENTITY:=input_select.r730_fan_mode}"
: "${MANUAL_ENTITY:=input_number.r730_fan_manual_pct}"
: "${HA_TOKEN:=}" # long-lived ha-sofia token; empty => Dell auto (no control)
: "${COMMAND_ENTITY:=sensor.r730_fan_command_pct}" # HA-computed fan %; we only apply it
: "${STALE_SECS:=120}" # command older than this => stale => Dell auto
: "${PUSHGATEWAY_URL:=}" # optional Prometheus Pushgateway base URL
: "${MAX_IPMI_FAILS:=3}"
: "${MIN_STEP:=3}" # min fan-% change worth an IPMI write (anti-jitter)
: "${DRY_RUN:=0}" # 1 => log IPMI actions instead of executing
: "${RUN_ONCE:=0}" # 1 => one iteration then exit (testing)
# Continuous LINEAR fan curve (2026-06-05): fan% ramps proportionally with CPU
# temp between (T_LO,P_LO) and (T_HI,P_HI), clamped flat outside. Replaces the old
# discrete step-bands (which flapped at band edges — e.g. 45<->65%). Both modes
# reach 100% right at the 83°C ceiling. Anchors are env-tunable.
# COOL (garage empty): 30% @50°C .. 100% @83°C (~2.1%/°C; equilibrium ~60°C/~51%)
# QUIET (someone there): 20% @68°C .. 100% @83°C (near-silent until ~70°C)
# Web-researched: a linear curve + 2-3°C hysteresis is the homelab standard; PID is
# overkill for this slow thermal loop. See docs/plans/2026-06-04-pve-fan-control-design.md.
: "${COOL_T_LO:=50}"; : "${COOL_P_LO:=30}"; : "${COOL_T_HI:=83}"; : "${COOL_P_HI:=100}"
: "${QUIET_T_LO:=68}"; : "${QUIET_P_LO:=20}"; : "${QUIET_T_HI:=83}"; : "${QUIET_P_HI:=100}"
: "${MIN_STEP:=3}" # min fan-% change worth an IPMI write (anti-jitter on the smooth curve)
log() { printf '%s %s\n' "$(date '+%Y-%m-%dT%H:%M:%S%z')" "$*"; }
# ---- pure functions (no side effects; unit-tested) ----
# fc_curve <mode> <temp> -> fan percent (continuous linear interpolation between
# the per-mode (T_LO,P_LO)..(T_HI,P_HI) anchors; clamped flat outside the range).
fc_curve() {
local mode="$1" temp="$2" tlo plo thi phi
if [[ "$mode" == "quiet" ]]; then tlo=$QUIET_T_LO; plo=$QUIET_P_LO; thi=$QUIET_T_HI; phi=$QUIET_P_HI
else tlo=$COOL_T_LO; plo=$COOL_P_LO; thi=$COOL_T_HI; phi=$COOL_P_HI; fi
if (( temp <= tlo )); then echo "$plo"; return 0; fi
if (( temp >= thi )); then echo "$phi"; return 0; fi
echo $(( plo + ( (temp - tlo) * (phi - plo) + (thi - tlo) / 2 ) / (thi - tlo) )) # rounded
# fc_num <value> <fallback> <min> <max> -> validated integer (floats truncated;
# non-numeric => fallback; out-of-range clamped). Sanitises the HA command read.
fc_num() {
local v="${1%%.*}" fb="$2" lo="$3" hi="$4"
[[ "$v" =~ ^-?[0-9]+$ ]] || { echo "$fb"; return 0; }
(( v < lo )) && v="$lo"; (( v > hi )) && v="$hi"; echo "$v"
}
# fc_decide <mode> <temp> <current_pct> <deadband> -> fan percent
# Ramps up immediately; only steps down once the curve still wants a lower
# percent even DEADBAND degrees hotter (prevents flapping at band edges).
fc_decide() {
local mode="$1" temp="$2" current="$3" deadband="$4" target
target="$(fc_curve "$mode" "$temp")"
if (( current < 0 || target >= current )); then echo "$target"; return 0; fi
if (( $(fc_curve "$mode" "$((temp + deadband))") < current )); then echo "$target"; else echo "$current"; fi
}
# fc_presence_mode <state> <last_changed_epoch> <now_epoch> <hold_secs> <open_state> -> quiet|cool
fc_presence_mode() {
local state="$1" lc="$2" now="$3" hold="$4" open="$5"
if [[ "$state" == "$open" ]]; then echo "quiet"; return 0; fi
if (( now - lc < hold )); then echo "quiet"; return 0; fi
echo "cool"
}
# fc_fresh <age_secs> <max_secs> -> exit 0 if fresh (age <= max), else 1.
fc_fresh() { (( $1 <= $2 )); }
# fc_parse_temp <ipmitool 'Temp' line> -> integer degC
fc_parse_temp() {
@ -115,18 +81,6 @@ fc_clamp() { local p="$1"; (( p < 0 )) && p=0; (( p > 100 )) && p=100; echo "$p"
# (~2W @4800rpm · ~17W @9360 · ~42W @12720 · ~99W @16920). Integer: 0.0205·(rpm/1e3)³.
fc_fan_watts() { echo $(( $1 * $1 * $1 * 205 / 10000000000000 )); }
# fc_resolve <ha_mode> <temp> <manual_pct> <presence> <current> <deadband> -> pct
# HA mode resolution (the hard ceiling is handled by the caller):
# manual -> clamp(manual_pct), no hysteresis
# cool|quiet -> that curve (with hysteresis)
# auto (else) -> presence-driven curve (garage door)
fc_resolve() {
local ha_mode="$1" temp="$2" manual_pct="$3" presence="$4" current="$5" deadband="$6"
if [[ "$ha_mode" == "manual" ]]; then fc_clamp "$manual_pct"; return 0; fi
local eff; [[ "$ha_mode" == "auto" ]] && eff="$presence" || eff="$ha_mode"
fc_decide "$eff" "$temp" "$current" "$deadband"
}
# ---- side-effecting wrappers ----
ipmi_manual_on=0
@ -151,39 +105,34 @@ read_cpu_temp() {
fc_parse_temp "$("$IPMITOOL" sdr type temperature 2>/dev/null | grep -E '^Temp ' | head -1)"
}
read_fan_rpm() { # Fan1 RPM — representative (all 6 fans are set together)
"$IPMITOOL" sdr type fan 2>/dev/null | awk -F'|' '/^Fan1/{gsub(/[^0-9]/,"",$5); print $5+0; exit}'
read_fan_rpm() { # mean RPM across all 6 chassis fans (Fan1..Fan6). All fans run
# one global duty, so the mean is representative AND a single
# stalled fan won't skew it. Telemetry only — not a control input.
"$IPMITOOL" sdr type fan 2>/dev/null | awk -F'|' '
/^Fan[0-9]/ { gsub(/[^0-9]/, "", $5); if ($5 != "") { sum += $5; n++ } }
END { if (n > 0) printf "%d\n", (sum / n) + 0.5 }'
}
presence_cache="cool"; presence_ts=0
get_presence() {
local now; now="$(date +%s)"
if (( now - presence_ts < PRESENCE_INTERVAL )); then echo "$presence_cache"; return 0; fi
presence_ts="$now"
[[ -z "$HA_TOKEN" ]] && { echo "$presence_cache"; return 0; }
local resp state lc_iso lc_epoch
resp="$(curl -fsS --max-time 5 -H "Authorization: Bearer $HA_TOKEN" \
"$HA_URL/api/states/$GARAGE_ENTITY" 2>/dev/null)" || { echo "$presence_cache"; return 0; }
state="$(fc_json_str_field "$resp" state)"
[[ -z "$state" ]] && { echo "$presence_cache"; return 0; }
lc_iso="$(fc_json_str_field "$resp" last_changed)"
lc_epoch="$(date -d "$lc_iso" +%s 2>/dev/null || echo "$now")"
presence_cache="$(fc_presence_mode "$state" "$lc_epoch" "$now" "$HOLD_SECS" "$GARAGE_OPEN_STATE")"
echo "$presence_cache"
}
# ha_entity_state <entity> -> state string (empty if HA disabled/unreachable)
ha_entity_state() {
# ha_command_pct -> the HA-computed fan % (0..100 int), or EMPTY when HA is
# disabled/unreachable, the value is non-numeric, or the command is STALE
# (last_updated older than STALE_SECS). Empty => caller hands fans to Dell auto.
ha_command_pct() {
[[ -z "$HA_TOKEN" ]] && return 0
local resp
local resp state lu lu_epoch now
resp="$(curl -fsS --max-time 5 -H "Authorization: Bearer $HA_TOKEN" \
"$HA_URL/api/states/$1" 2>/dev/null)" || return 0
fc_json_str_field "$resp" state
"$HA_URL/api/states/$COMMAND_ENTITY" 2>/dev/null)" || return 0
state="$(fc_json_str_field "$resp" state)"
[[ "$state" =~ ^[0-9]+(\.[0-9]+)?$ ]] || return 0
lu="$(fc_json_str_field "$resp" last_updated)"
lu_epoch="$(date -d "$lu" +%s 2>/dev/null || echo 0)"; now="$(date +%s)"
(( lu_epoch == 0 )) && return 0
fc_fresh "$((now - lu_epoch))" "$STALE_SECS" || return 0
fc_num "$state" 0 0 100
}
push_metrics() { # <temp> <pct> <mode> <ha_ok> <fallback> [fan_rpm] [fan_watts_est]
[[ -z "$PUSHGATEWAY_URL" ]] && return 0
local mode_num; case "$3" in quiet) mode_num=1;; cool) mode_num=2;; manual) mode_num=3;; *) mode_num=0;; esac
local mode_num; case "$3" in applied) mode_num=2;; *) mode_num=0;; esac
curl -fsS --max-time 5 --data-binary @- \
"$PUSHGATEWAY_URL/metrics/job/fan_control/instance/pve-r730" >/dev/null 2>&1 <<EOF || true
# TYPE pve_fan_control_cpu_temp_celsius gauge
@ -204,10 +153,12 @@ EOF
}
main() {
log "fan-control start (loop=${LOOP_INTERVAL}s presence=${PRESENCE_INTERVAL}s hold=${HOLD_SECS}s ceiling=${CEILING}C dry_run=${DRY_RUN})"
log "fan-control start (actuator; loop=${LOOP_INTERVAL}s ceiling=${CEILING}C cmd=${COMMAND_ENTITY} stale=${STALE_SECS}s dry_run=${DRY_RUN})"
trap 'log "exit — restoring Dell auto fan control"; restore_auto' EXIT
local current=-1 fails=0 in_fallback=0 cool_since=0
local current=-1 fails=0 in_fallback=0 cool_since=0 ha_down=0
while true; do
local rpm fan_w; rpm="$(read_fan_rpm)"; rpm="${rpm:-0}"; fan_w="$(fc_fan_watts "$rpm")"
local temp; temp="$(read_cpu_temp)"
if [[ -z "$temp" ]]; then
fails=$((fails + 1)); log "WARN cannot read CPU temp ($fails/$MAX_IPMI_FAILS)"
@ -216,45 +167,41 @@ main() {
fi
fails=0
# Hardware ceiling — independent of HA; firmware emergency cooling takes over.
if (( temp >= CEILING )); then
(( in_fallback == 0 )) && { log "CEILING temp=${temp}${CEILING} — Dell auto"; restore_auto; current=-1; in_fallback=1; }
push_metrics "$temp" 0 fallback 1 1
push_metrics "$temp" 0 fallback 1 1 "$rpm" "$fan_w"
(( RUN_ONCE == 1 )) && break || { sleep "$LOOP_INTERVAL"; continue; }
fi
if (( in_fallback == 1 )); then
if (( temp < RESUME_BELOW )); then
(( cool_since == 0 )) && cool_since="$(date +%s)"
if (( $(date +%s) - cool_since >= RESUME_STABLE )); then
log "recovered (temp<${RESUME_BELOW}C ${RESUME_STABLE}s) — resuming manual"; in_fallback=0; cool_since=0
log "recovered (temp<${RESUME_BELOW}C ${RESUME_STABLE}s) — resuming HA control"; in_fallback=0; cool_since=0
else
push_metrics "$temp" 0 fallback 1 1; (( RUN_ONCE == 1 )) && break || { sleep "$LOOP_INTERVAL"; continue; }
push_metrics "$temp" 0 fallback 1 1 "$rpm" "$fan_w"; (( RUN_ONCE == 1 )) && break || { sleep "$LOOP_INTERVAL"; continue; }
fi
else
cool_since=0; push_metrics "$temp" 0 fallback 1 1
cool_since=0; push_metrics "$temp" 0 fallback 1 1 "$rpm" "$fan_w"
(( RUN_ONCE == 1 )) && break || { sleep "$LOOP_INTERVAL"; continue; }
fi
fi
# HA-desired mode (auto/cool/quiet/manual); unreachable/unset => auto.
local ha_mode ha_ok=1; ha_mode="$(ha_entity_state "$MODE_ENTITY")"; [[ -z "$HA_TOKEN" ]] && ha_ok=0
[[ -z "$ha_mode" ]] && ha_mode="auto"
case "$ha_mode" in auto|cool|quiet|manual) ;; *) ha_mode="auto" ;; esac
local manual_pct=0
if [[ "$ha_mode" == "manual" ]]; then
manual_pct="$(ha_entity_state "$MANUAL_ENTITY")"; manual_pct="${manual_pct%%.*}"
[[ "$manual_pct" =~ ^[0-9]+$ ]] || manual_pct=0
# The setpoint is whatever HA computed. No local math — just apply it.
local cmd; cmd="$(ha_command_pct)"
if [[ -z "$cmd" ]]; then
(( ha_down == 0 )) && { log "HA command unavailable/stale — Dell auto until it returns"; restore_auto; current=-1; ha_down=1; }
push_metrics "$temp" 0 fallback 0 1 "$rpm" "$fan_w"
(( RUN_ONCE == 1 )) && break || { sleep "$LOOP_INTERVAL"; continue; }
fi
local presence="cool"; [[ "$ha_mode" == "auto" ]] && presence="$(get_presence)"
local eff; if [[ "$ha_mode" == "manual" ]]; then eff="manual"; elif [[ "$ha_mode" == "auto" ]]; then eff="$presence"; else eff="$ha_mode"; fi
local pct; pct="$(fc_resolve "$ha_mode" "$temp" "$manual_pct" "$presence" "$current" "$DEADBAND")"
# Only write when first-run or the change clears MIN_STEP (kills 1-2% jitter
# on the continuous curve; fc_decide already gives asymmetric hysteresis).
if (( current < 0 || pct - current >= MIN_STEP || current - pct >= MIN_STEP )); then
if set_manual "$pct"; then log "temp=${temp}C ha_mode=${ha_mode} eff=${eff} fan=${pct}% (was ${current}%)"; current="$pct"
else log "WARN set_manual ${pct}% failed"; fi
(( ha_down == 1 )) && { log "HA command back (${cmd}%) — resuming"; ha_down=0; }
# Only write when first-run or the change clears MIN_STEP (kills 1-2% jitter).
if (( current < 0 || cmd - current >= MIN_STEP || current - cmd >= MIN_STEP )); then
if set_manual "$cmd"; then log "temp=${temp}C cmd=${cmd}% rpm=${rpm} (was ${current}%)"; current="$cmd"
else log "WARN set_manual ${cmd}% failed"; fi
fi
local rpm fan_w; rpm="$(read_fan_rpm)"; rpm="${rpm:-0}"; fan_w="$(fc_fan_watts "$rpm")"
push_metrics "$temp" "$current" "$eff" "$ha_ok" 0 "$rpm" "$fan_w"
push_metrics "$temp" "$current" applied 1 0 "$rpm" "$fan_w"
(( RUN_ONCE == 1 )) && break || sleep "$LOOP_INTERVAL"
done
}