infra/docs/plans/2026-06-04-pve-fan-control-design.md
Viktor Barzin 90ad6b9125 fan-control: presence-aware IPMI fan curve for the R730 PVE host
The iDRAC stock curve runs the CPU at ~72°C on the 7080 RPM floor even
under load (optimises for quiet, not cool). Add a bash daemon + systemd
unit that drives the chassis fans from CPU temp on two curves, picked by
garage occupancy (the server is in the garage): COOL when empty
(measured ~58-65°C under load), QUIET near the silent floor when the
ha-sofia garage door shows someone is there (open, or <15min since last
activity).

Manual fan mode is backstopped: bash EXIT trap + systemd ExecStopPost
hand fans back to Dell auto on stop/crash; CPU>=83°C or repeated IPMI
failures do the same. Pushgateway metrics (job=fan_control). 36 unit
tests cover the pure curve/hysteresis/presence/parse logic; DRY_RUN +
RUN_ONCE for integration checks. Deployed and verified on 192.168.1.127
(CPU 70->58°C in cool mode, hysteresis stepping confirmed).

Design:  docs/plans/2026-06-04-pve-fan-control-design.md
Runbook: docs/runbooks/fan-control.md

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00

91 lines
4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# PVE R730 presence-aware fan control — design
**Date:** 2026-06-04
**Status:** implemented
**Scripts:** `infra/scripts/fan-control.{sh,service,env.example}`, `test-fan-control.sh`
**Runbook:** `infra/docs/runbooks/fan-control.md`
## Problem
The Dell R730 PVE host (192.168.1.127) runs its CPU at ~7277°C under normal
cluster load. That is safe (firmware warning at 88°C, critical 93°C) but the
iDRAC's stock fan curve optimises for quiet, not cool — it pins the fans at the
~7080 RPM floor even at 72°C / load 30 and only ramps near ~80°C. We want the
CPU to run cooler when it costs nothing (the box is in the garage, usually
empty) while staying quiet when someone is physically in the garage.
## Measured fan/temp relationship (manual IPMI sweep, 2026-06-04)
At a comparable CPU load (~4553 % busy):
| Fan setting | Fan RPM | CPU temp |
|-------------|---------|----------|
| Auto (floor) | 7,080 | 7172°C |
| 50 % | 9,360 | 6566°C |
| 70 % | 12,800 | 6061°C |
| 100 % | 17,000 | 5556°C |
Best °C-per-RPM is the first step; beyond ~70 % it is mostly noise. ~16°C of
swing is available.
## Decisions
1. **Custom bash daemon + systemd service**, deployed to the PVE host the same
way as `apply-mbps-caps` / `daily-backup` (source in `infra/scripts/`, scp to
`/usr/local/bin`). It cannot be Terraform/k8s — it runs on the bare host where
IPMI lives. (OSS `tigerblue77/Dell-iDRAC-fan-controller` was considered;
rejected — it is a Docker container, off-pattern here, and unaware of our
constraints.)
2. **CPU temperature is the only control input.** The Tesla T4 has its own
always-on fan (owner-confirmed), so it self-cools and does not depend on
chassis airflow — no GPU coupling needed.
3. **Presence = the garage door**, because the server is *in the garage*
(memory id=1723); noise only matters to people physically there. Signal:
ha-sofia `sensor.garage_door_state_bg`. Open now, or last changed within
`HOLD_SECS` (15 min) ⇒ someone's around ⇒ QUIET; otherwise COOL.
`house_mode` was rejected — it tracks *apartment* occupancy, irrelevant to
garage noise.
4. **Two curves**, picked by presence:
| CPU °C | COOL % (empty) | CPU °C | QUIET % (occupied) |
|--------|----------------|--------|--------------------|
| ≤52 | 25 | ≤72 | 20 (≈silent floor) |
| 5360 | 45 | 7377 | 40 |
| 6167 | 65 | 7881 | 65 |
| 6873 | 85 | ≥82 | 100 |
| ≥74 | 100 | | |
3°C downward hysteresis prevents flapping at band edges (ramp up immediately,
step down only once the curve still wants lower 3°C hotter).
## Safety
Manual fan mode bypasses the iDRAC's own protection, so it is backstopped:
- **Daemon exit/crash/stop** → bash `EXIT` trap + systemd `ExecStopPost` both
run `ipmitool raw 0x30 0x30 0x01 0x01` (restore Dell auto). `Restart=on-failure`.
- **CPU ≥ `CEILING` (83°C)** → hand back to Dell auto until temp holds below
`RESUME_BELOW` (75°C) for `RESUME_STABLE` (120 s), then resume manual.
- **IPMI read failures ≥ `MAX_IPMI_FAILS`** → restore Dell auto.
- **ha-sofia unreachable** → keep the last good presence decision; default COOL
at cold start (thermally safe).
## Observability
Pushes to the existing Pushgateway (`http://10.0.20.100:30091`, job
`fan_control`): `pve_fan_control_cpu_temp_celsius`, `_fan_percent`, `_mode`
(1 quiet / 2 cool / 0 fallback), `_ha_reachable`, `_fallback`. The existing CPU-
temp alert is unaffected.
## Testing
`test-fan-control.sh` sources the script (main is guarded by a `BASH_SOURCE`
check) and unit-tests the pure functions: both curves, hysteresis up/down,
presence open/recent/stale, temperature parsing, jq-free JSON field extraction,
and percent→hex. 36 assertions, no hardware needed. The daemon also supports
`DRY_RUN=1` and `RUN_ONCE=1` for integration checks.
## Rollback
`systemctl disable --now fan-control && ipmitool raw 0x30 0x30 0x01 0x01` on the
host returns the box to stock firmware fan control. See the runbook.