fan-control: presence-aware IPMI fan curve for the R730 PVE host
The iDRAC stock curve runs the CPU at ~72°C on the 7080 RPM floor even under load (optimises for quiet, not cool). Add a bash daemon + systemd unit that drives the chassis fans from CPU temp on two curves, picked by garage occupancy (the server is in the garage): COOL when empty (measured ~58-65°C under load), QUIET near the silent floor when the ha-sofia garage door shows someone is there (open, or <15min since last activity). Manual fan mode is backstopped: bash EXIT trap + systemd ExecStopPost hand fans back to Dell auto on stop/crash; CPU>=83°C or repeated IPMI failures do the same. Pushgateway metrics (job=fan_control). 36 unit tests cover the pure curve/hysteresis/presence/parse logic; DRY_RUN + RUN_ONCE for integration checks. Deployed and verified on 192.168.1.127 (CPU 70->58°C in cool mode, hysteresis stepping confirmed). Design: docs/plans/2026-06-04-pve-fan-control-design.md Runbook: docs/runbooks/fan-control.md [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
c6f27fa172
commit
90ad6b9125
60 changed files with 640 additions and 9563 deletions
21
scripts/fan-control.env.example
Normal file
21
scripts/fan-control.env.example
Normal file
|
|
@ -0,0 +1,21 @@
|
|||
# /etc/fan-control.env — config for the fan-control daemon (chmod 600).
|
||||
# Deployed manually to the PVE host; the real file holds a secret token and is
|
||||
# NOT committed. Copy this template, fill HA_TOKEN, scp to /etc/fan-control.env.
|
||||
|
||||
# Long-lived ha-sofia access token (Home Assistant -> Profile -> Security ->
|
||||
# Long-lived access tokens). Empty => presence disabled, daemon runs COOL-only.
|
||||
HA_TOKEN=
|
||||
|
||||
# --- optional overrides (defaults shown) ---
|
||||
# HA_URL=http://192.168.1.8:8123
|
||||
# GARAGE_ENTITY=sensor.garage_door_state_bg
|
||||
# GARAGE_OPEN_STATE=Отворена
|
||||
# HOLD_SECS=900 # quiet-mode hold after last garage activity (15 min)
|
||||
# LOOP_INTERVAL=15
|
||||
# PRESENCE_INTERVAL=30
|
||||
# DEADBAND=3
|
||||
# CEILING=83 # degC: hand back to Dell auto at/above this
|
||||
# RESUME_BELOW=75
|
||||
# RESUME_STABLE=120
|
||||
# MAX_IPMI_FAILS=3
|
||||
PUSHGATEWAY_URL=http://10.0.20.100:30091
|
||||
21
scripts/fan-control.service
Normal file
21
scripts/fan-control.service
Normal file
|
|
@ -0,0 +1,21 @@
|
|||
[Unit]
|
||||
Description=Presence-aware IPMI fan controller (Dell R730, garage)
|
||||
Documentation=https://github.com/ViktorBarzin/infra/blob/master/scripts/fan-control.sh
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
EnvironmentFile=-/etc/fan-control.env
|
||||
ExecStart=/usr/local/bin/fan-control
|
||||
# Belt-and-suspenders: whatever happens to the daemon, hand the fans back to
|
||||
# the iDRAC's own automatic curve so the box is never stuck in manual mode.
|
||||
ExecStopPost=/usr/bin/ipmitool raw 0x30 0x30 0x01 0x01
|
||||
Restart=on-failure
|
||||
RestartSec=10
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
SyslogIdentifier=fan-control
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
199
scripts/fan-control.sh
Normal file
199
scripts/fan-control.sh
Normal file
|
|
@ -0,0 +1,199 @@
|
|||
#!/usr/bin/env bash
|
||||
# Presence-aware IPMI fan controller for the Dell R730 PVE host (192.168.1.127).
|
||||
#
|
||||
# The server lives in the GARAGE (memory id=1723). Two curves, picked by
|
||||
# whether someone is physically in the garage:
|
||||
# - COOL : garage empty -> minimise CPU temp, noise is free.
|
||||
# - QUIET : someone in the garage -> minimise noise, accept a warmer CPU.
|
||||
# Presence comes from the ha-sofia garage-door sensor: door open now, OR it
|
||||
# last changed within HOLD_SECS, => QUIET. Otherwise COOL.
|
||||
#
|
||||
# Safety (manual fan mode bypasses the iDRAC's own curve, so we backstop it):
|
||||
# - On ANY exit (crash/stop/TERM) the EXIT trap hands fans back to Dell
|
||||
# automatic control (raw 0x30 0x30 0x01 0x01). systemd ExecStopPost
|
||||
# repeats this belt-and-suspenders.
|
||||
# - CPU >= CEILING -> hand back to Dell auto until it recovers (RESUME_BELOW
|
||||
# held for RESUME_STABLE s). The firmware's own emergency cooling takes over.
|
||||
# - IPMI read failures (>= MAX_IPMI_FAILS) -> hand back to Dell auto.
|
||||
#
|
||||
# Deploy: scp to /usr/local/bin/fan-control (strip .sh) + install
|
||||
# fan-control.service + /etc/fan-control.env. Same pattern as apply-mbps-caps.
|
||||
# Tests: test-fan-control.sh (sources this file, exercises the pure functions).
|
||||
# Design: infra/docs/plans/2026-06-04-pve-fan-control-design.md
|
||||
# Runbook: infra/docs/runbooks/fan-control.md
|
||||
|
||||
set -uo pipefail
|
||||
|
||||
# ---- configuration (override via /etc/fan-control.env) ----
|
||||
: "${IPMITOOL:=ipmitool}"
|
||||
: "${LOOP_INTERVAL:=15}" # seconds between temperature decisions
|
||||
: "${PRESENCE_INTERVAL:=30}" # seconds between ha-sofia garage-door polls
|
||||
: "${DEADBAND:=3}" # degC hysteresis applied to downward fan steps
|
||||
: "${CEILING:=83}" # degC: hand back to Dell auto at/above this
|
||||
: "${RESUME_BELOW:=75}" # degC: eligible to resume manual below this...
|
||||
: "${RESUME_STABLE:=120}" # ...once held that long
|
||||
: "${HOLD_SECS:=900}" # quiet-mode hold after last garage activity (15 min)
|
||||
: "${HA_URL:=http://192.168.1.8:8123}"
|
||||
: "${HA_TOKEN:=}" # long-lived ha-sofia token; empty => presence disabled (COOL only)
|
||||
: "${GARAGE_ENTITY:=sensor.garage_door_state_bg}"
|
||||
: "${GARAGE_OPEN_STATE:=Отворена}" # ha state string meaning "open"
|
||||
: "${PUSHGATEWAY_URL:=}" # optional Prometheus Pushgateway base URL
|
||||
: "${MAX_IPMI_FAILS:=3}"
|
||||
: "${DRY_RUN:=0}" # 1 => log IPMI actions instead of executing
|
||||
: "${RUN_ONCE:=0}" # 1 => one iteration then exit (testing)
|
||||
|
||||
# Curves as "min_temp:pct" entries, descending; first whose min_temp <= temp wins.
|
||||
COOL_CURVE=(74:100 68:85 61:65 53:45 0:25)
|
||||
QUIET_CURVE=(82:100 78:65 73:40 0:20)
|
||||
|
||||
log() { printf '%s %s\n' "$(date '+%Y-%m-%dT%H:%M:%S%z')" "$*"; }
|
||||
|
||||
# ---- pure functions (no side effects; unit-tested) ----
|
||||
|
||||
# fc_curve <mode> <temp> -> fan percent
|
||||
fc_curve() {
|
||||
local mode="$1" temp="$2"
|
||||
local -a curve
|
||||
if [[ "$mode" == "quiet" ]]; then curve=("${QUIET_CURVE[@]}"); else curve=("${COOL_CURVE[@]}"); fi
|
||||
local entry
|
||||
for entry in "${curve[@]}"; do
|
||||
if (( temp >= ${entry%%:*} )); then echo "${entry##*:}"; return 0; fi
|
||||
done
|
||||
echo "${curve[-1]##*:}"
|
||||
}
|
||||
|
||||
# fc_decide <mode> <temp> <current_pct> <deadband> -> fan percent
|
||||
# Ramps up immediately; only steps down once the curve still wants a lower
|
||||
# percent even DEADBAND degrees hotter (prevents flapping at band edges).
|
||||
fc_decide() {
|
||||
local mode="$1" temp="$2" current="$3" deadband="$4" target
|
||||
target="$(fc_curve "$mode" "$temp")"
|
||||
if (( current < 0 || target >= current )); then echo "$target"; return 0; fi
|
||||
if (( $(fc_curve "$mode" "$((temp + deadband))") < current )); then echo "$target"; else echo "$current"; fi
|
||||
}
|
||||
|
||||
# fc_presence_mode <state> <last_changed_epoch> <now_epoch> <hold_secs> <open_state> -> quiet|cool
|
||||
fc_presence_mode() {
|
||||
local state="$1" lc="$2" now="$3" hold="$4" open="$5"
|
||||
if [[ "$state" == "$open" ]]; then echo "quiet"; return 0; fi
|
||||
if (( now - lc < hold )); then echo "quiet"; return 0; fi
|
||||
echo "cool"
|
||||
}
|
||||
|
||||
# fc_parse_temp <ipmitool 'Temp' line> -> integer degC
|
||||
fc_parse_temp() {
|
||||
echo "$1" | grep -oE '[0-9]+ degrees C' | grep -oE '^[0-9]+' | head -1
|
||||
}
|
||||
|
||||
# fc_json_str_field <json> <key> -> string value (first match; jq-free)
|
||||
fc_json_str_field() {
|
||||
printf '%s' "$1" | grep -oE "\"$2\"[[:space:]]*:[[:space:]]*\"[^\"]*\"" | head -1 \
|
||||
| sed -E "s/.*:[[:space:]]*\"(.*)\"\$/\1/"
|
||||
}
|
||||
|
||||
# fc_pct_to_hex <pct> -> 0xNN
|
||||
fc_pct_to_hex() { printf '0x%02x' "$1"; }
|
||||
|
||||
# ---- side-effecting wrappers ----
|
||||
|
||||
ipmi_manual_on=0
|
||||
|
||||
set_manual() { # <pct>
|
||||
local pct="$1" hex; hex="$(fc_pct_to_hex "$pct")"
|
||||
if (( DRY_RUN == 1 )); then log "DRY set fan ${pct}% (${hex})"; ipmi_manual_on=1; return 0; fi
|
||||
if (( ipmi_manual_on == 0 )); then
|
||||
"$IPMITOOL" raw 0x30 0x30 0x01 0x00 >/dev/null 2>&1 || return 1
|
||||
ipmi_manual_on=1
|
||||
fi
|
||||
"$IPMITOOL" raw 0x30 0x30 0x02 0xff "$hex" >/dev/null 2>&1
|
||||
}
|
||||
|
||||
restore_auto() {
|
||||
if (( DRY_RUN == 1 )); then log "DRY restore Dell auto fan control"; ipmi_manual_on=0; return 0; fi
|
||||
"$IPMITOOL" raw 0x30 0x30 0x01 0x01 >/dev/null 2>&1
|
||||
ipmi_manual_on=0
|
||||
}
|
||||
|
||||
read_cpu_temp() {
|
||||
fc_parse_temp "$("$IPMITOOL" sdr type temperature 2>/dev/null | grep -E '^Temp ' | head -1)"
|
||||
}
|
||||
|
||||
presence_cache="cool"; presence_ts=0
|
||||
get_presence() {
|
||||
local now; now="$(date +%s)"
|
||||
if (( now - presence_ts < PRESENCE_INTERVAL )); then echo "$presence_cache"; return 0; fi
|
||||
presence_ts="$now"
|
||||
[[ -z "$HA_TOKEN" ]] && { echo "$presence_cache"; return 0; }
|
||||
local resp state lc_iso lc_epoch
|
||||
resp="$(curl -fsS --max-time 5 -H "Authorization: Bearer $HA_TOKEN" \
|
||||
"$HA_URL/api/states/$GARAGE_ENTITY" 2>/dev/null)" || { echo "$presence_cache"; return 0; }
|
||||
state="$(fc_json_str_field "$resp" state)"
|
||||
[[ -z "$state" ]] && { echo "$presence_cache"; return 0; }
|
||||
lc_iso="$(fc_json_str_field "$resp" last_changed)"
|
||||
lc_epoch="$(date -d "$lc_iso" +%s 2>/dev/null || echo "$now")"
|
||||
presence_cache="$(fc_presence_mode "$state" "$lc_epoch" "$now" "$HOLD_SECS" "$GARAGE_OPEN_STATE")"
|
||||
echo "$presence_cache"
|
||||
}
|
||||
|
||||
push_metrics() { # <temp> <pct> <mode> <ha_ok> <fallback>
|
||||
[[ -z "$PUSHGATEWAY_URL" ]] && return 0
|
||||
local mode_num; case "$3" in quiet) mode_num=1;; cool) mode_num=2;; *) mode_num=0;; esac
|
||||
curl -fsS --max-time 5 --data-binary @- \
|
||||
"$PUSHGATEWAY_URL/metrics/job/fan_control/instance/pve-r730" >/dev/null 2>&1 <<EOF || true
|
||||
# TYPE pve_fan_control_cpu_temp_celsius gauge
|
||||
pve_fan_control_cpu_temp_celsius $1
|
||||
# TYPE pve_fan_control_fan_percent gauge
|
||||
pve_fan_control_fan_percent $2
|
||||
# TYPE pve_fan_control_mode gauge
|
||||
pve_fan_control_mode $mode_num
|
||||
# TYPE pve_fan_control_ha_reachable gauge
|
||||
pve_fan_control_ha_reachable $4
|
||||
# TYPE pve_fan_control_fallback gauge
|
||||
pve_fan_control_fallback $5
|
||||
EOF
|
||||
}
|
||||
|
||||
main() {
|
||||
log "fan-control start (loop=${LOOP_INTERVAL}s presence=${PRESENCE_INTERVAL}s hold=${HOLD_SECS}s ceiling=${CEILING}C dry_run=${DRY_RUN})"
|
||||
trap 'log "exit — restoring Dell auto fan control"; restore_auto' EXIT
|
||||
local current=-1 fails=0 in_fallback=0 cool_since=0
|
||||
while true; do
|
||||
local temp; temp="$(read_cpu_temp)"
|
||||
if [[ -z "$temp" ]]; then
|
||||
fails=$((fails + 1)); log "WARN cannot read CPU temp ($fails/$MAX_IPMI_FAILS)"
|
||||
if (( fails >= MAX_IPMI_FAILS )); then log "ERR temp unreadable — Dell auto"; restore_auto; current=-1; fi
|
||||
(( RUN_ONCE == 1 )) && break || { sleep "$LOOP_INTERVAL"; continue; }
|
||||
fi
|
||||
fails=0
|
||||
|
||||
if (( temp >= CEILING )); then
|
||||
(( in_fallback == 0 )) && { log "CEILING temp=${temp}≥${CEILING} — Dell auto"; restore_auto; current=-1; in_fallback=1; }
|
||||
push_metrics "$temp" 0 fallback 1 1
|
||||
(( RUN_ONCE == 1 )) && break || { sleep "$LOOP_INTERVAL"; continue; }
|
||||
fi
|
||||
if (( in_fallback == 1 )); then
|
||||
if (( temp < RESUME_BELOW )); then
|
||||
(( cool_since == 0 )) && cool_since="$(date +%s)"
|
||||
if (( $(date +%s) - cool_since >= RESUME_STABLE )); then
|
||||
log "recovered (temp<${RESUME_BELOW}C ${RESUME_STABLE}s) — resuming manual"; in_fallback=0; cool_since=0
|
||||
else
|
||||
push_metrics "$temp" 0 fallback 1 1; (( RUN_ONCE == 1 )) && break || { sleep "$LOOP_INTERVAL"; continue; }
|
||||
fi
|
||||
else
|
||||
cool_since=0; push_metrics "$temp" 0 fallback 1 1
|
||||
(( RUN_ONCE == 1 )) && break || { sleep "$LOOP_INTERVAL"; continue; }
|
||||
fi
|
||||
fi
|
||||
|
||||
local mode ha_ok=1; mode="$(get_presence)"; [[ -z "$HA_TOKEN" ]] && ha_ok=0
|
||||
local pct; pct="$(fc_decide "$mode" "$temp" "$current" "$DEADBAND")"
|
||||
if (( pct != current )); then
|
||||
if set_manual "$pct"; then log "temp=${temp}C mode=${mode} fan=${pct}% (was ${current}%)"; current="$pct"
|
||||
else log "WARN set_manual ${pct}% failed"; fi
|
||||
fi
|
||||
push_metrics "$temp" "$current" "$mode" "$ha_ok" 0
|
||||
(( RUN_ONCE == 1 )) && break || sleep "$LOOP_INTERVAL"
|
||||
done
|
||||
}
|
||||
|
||||
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then main "$@"; fi
|
||||
71
scripts/test-fan-control.sh
Normal file
71
scripts/test-fan-control.sh
Normal file
|
|
@ -0,0 +1,71 @@
|
|||
#!/usr/bin/env bash
|
||||
# Unit tests for the pure functions in fan-control.sh.
|
||||
# Sources the script (main is guarded), exercises curve/decide/presence/parse.
|
||||
# Run: bash infra/scripts/test-fan-control.sh
|
||||
|
||||
set -uo pipefail
|
||||
DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=/dev/null
|
||||
source "$DIR/fan-control.sh"
|
||||
|
||||
pass=0 fail=0
|
||||
eq() { # <description> <expected> <actual>
|
||||
if [[ "$2" == "$3" ]]; then pass=$((pass + 1)); else
|
||||
fail=$((fail + 1)); printf 'FAIL: %s — expected [%s] got [%s]\n' "$1" "$2" "$3"
|
||||
fi
|
||||
}
|
||||
|
||||
# --- COOL curve ---
|
||||
eq "cool 40 -> 25" 25 "$(fc_curve cool 40)"
|
||||
eq "cool 52 -> 25" 25 "$(fc_curve cool 52)"
|
||||
eq "cool 53 -> 45" 45 "$(fc_curve cool 53)"
|
||||
eq "cool 60 -> 45" 45 "$(fc_curve cool 60)"
|
||||
eq "cool 61 -> 65" 65 "$(fc_curve cool 61)"
|
||||
eq "cool 67 -> 65" 65 "$(fc_curve cool 67)"
|
||||
eq "cool 68 -> 85" 85 "$(fc_curve cool 68)"
|
||||
eq "cool 73 -> 85" 85 "$(fc_curve cool 73)"
|
||||
eq "cool 74 -> 100" 100 "$(fc_curve cool 74)"
|
||||
eq "cool 91 -> 100" 100 "$(fc_curve cool 91)"
|
||||
|
||||
# --- QUIET curve ---
|
||||
eq "quiet 50 -> 20" 20 "$(fc_curve quiet 50)"
|
||||
eq "quiet 72 -> 20" 20 "$(fc_curve quiet 72)"
|
||||
eq "quiet 73 -> 40" 40 "$(fc_curve quiet 73)"
|
||||
eq "quiet 77 -> 40" 40 "$(fc_curve quiet 77)"
|
||||
eq "quiet 78 -> 65" 65 "$(fc_curve quiet 78)"
|
||||
eq "quiet 81 -> 65" 65 "$(fc_curve quiet 81)"
|
||||
eq "quiet 82 -> 100" 100 "$(fc_curve quiet 82)"
|
||||
|
||||
# --- decide: hysteresis ---
|
||||
eq "decide uninit -> target" 85 "$(fc_decide cool 68 -1 3)"
|
||||
eq "decide ramp up now" 85 "$(fc_decide cool 68 25 3)"
|
||||
eq "decide equal holds" 65 "$(fc_decide cool 65 65 3)"
|
||||
eq "decide down held in band" 85 "$(fc_decide cool 67 85 3)" # 67+3=70 still 85% -> hold
|
||||
eq "decide down past band" 65 "$(fc_decide cool 64 85 3)" # 64+3=67 -> 65% < 85 -> drop
|
||||
eq "decide 100 holds at 71" 100 "$(fc_decide cool 71 100 3)" # 71+3=74 -> 100 -> hold
|
||||
eq "decide 100 drops at 70" 85 "$(fc_decide cool 70 100 3)" # 70+3=73 -> 85 < 100 -> drop
|
||||
|
||||
# --- presence ---
|
||||
now=1000000
|
||||
eq "presence open -> quiet" quiet "$(fc_presence_mode Отворена 0 $now 900 Отворена)"
|
||||
eq "presence closed recent -> quiet" quiet "$(fc_presence_mode Затворена $((now - 100)) $now 900 Отворена)"
|
||||
eq "presence closed stale -> cool" cool "$(fc_presence_mode Затворена $((now - 1000)) $now 900 Отворена)"
|
||||
eq "presence closed edge -> cool" cool "$(fc_presence_mode Затворена $((now - 900)) $now 900 Отворена)"
|
||||
|
||||
# --- temp parsing ---
|
||||
eq "parse temp line" 74 "$(fc_parse_temp 'Temp | 0Eh | ok | 3.1 | 74 degrees C')"
|
||||
eq "parse temp 7C" 72 "$(fc_parse_temp 'Temp | 0Eh | ok | 3.1 | 72 degrees C')"
|
||||
|
||||
# --- json field (jq-free) ---
|
||||
J='{"entity_id":"sensor.garage_door_state_bg","state":"Отворена","attributes":{"friendly_name":"Garage Door State BG"},"last_changed":"2026-06-04T16:55:20.517745+00:00","last_updated":"2026-06-04T16:55:20.517745+00:00"}'
|
||||
eq "json state" "Отворена" "$(fc_json_str_field "$J" state)"
|
||||
eq "json last_changed" "2026-06-04T16:55:20.517745+00:00" "$(fc_json_str_field "$J" last_changed)"
|
||||
|
||||
# --- hex conversion ---
|
||||
eq "hex 20" 0x14 "$(fc_pct_to_hex 20)"
|
||||
eq "hex 45" 0x2d "$(fc_pct_to_hex 45)"
|
||||
eq "hex 100" 0x64 "$(fc_pct_to_hex 100)"
|
||||
eq "hex 5" 0x05 "$(fc_pct_to_hex 5)"
|
||||
|
||||
printf '\n%d passed, %d failed\n' "$pass" "$fail"
|
||||
(( fail == 0 ))
|
||||
Loading…
Add table
Add a link
Reference in a new issue