fan-control: presence-aware IPMI fan curve for the R730 PVE host

The iDRAC stock curve runs the CPU at ~72°C on the 7080 RPM floor even
under load (optimises for quiet, not cool). Add a bash daemon + systemd
unit that drives the chassis fans from CPU temp on two curves, picked by
garage occupancy (the server is in the garage): COOL when empty
(measured ~58-65°C under load), QUIET near the silent floor when the
ha-sofia garage door shows someone is there (open, or <15min since last
activity).

Manual fan mode is backstopped: bash EXIT trap + systemd ExecStopPost
hand fans back to Dell auto on stop/crash; CPU>=83°C or repeated IPMI
failures do the same. Pushgateway metrics (job=fan_control). 36 unit
tests cover the pure curve/hysteresis/presence/parse logic; DRY_RUN +
RUN_ONCE for integration checks. Deployed and verified on 192.168.1.127
(CPU 70->58°C in cool mode, hysteresis stepping confirmed).

Design:  docs/plans/2026-06-04-pve-fan-control-design.md
Runbook: docs/runbooks/fan-control.md

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-04 21:38:34 +00:00
parent c6f27fa172
commit 90ad6b9125
60 changed files with 640 additions and 9563 deletions

View file

@ -58,10 +58,9 @@ graph LR
### Project Migration Status
**Migrated to GHA (9 projects)**:
**Migrated to GHA (8 projects)**:
- Website
- k8s-portal
- f1-stream
- claude-memory-mcp
- apple-health-data
- audiblez-web
@ -69,6 +68,14 @@ graph LR
- insta2spotify
- book-search (audiobook-search)
**Woodpecker-native owned-app builds** (build + push to the Forgejo private
registry + `kubectl set image` rollout, all in one `.woodpecker.yml`; Keel
stays enrolled as a redundant net):
- `tuya_bridge`, `job-hunter`, `f1-stream`
- `f1-stream` was extracted from this monorepo into its own repo
(`viktor/f1-stream`) on 2026-06-04; its Woodpecker repo id is 166 (the old
GHA-era id 10 is defunct).
**Woodpecker-only (infra + large apps)**:
- `travel_blog`: 5.7GB content directory exceeds GHA limits
- Infra pipelines: require cluster access (terragrunt apply, certbot, build-cli)
@ -92,7 +99,6 @@ Woodpecker API uses numeric IDs (not owner/name):
| travel_blog | 5 |
| webhook-handler | 6 |
| audiblez-web | 9 |
| f1-stream | 10 |
| plotting-book | 43 |
| claude-memory-mcp | 78 |
| infra-onboarding | 79 |

View file

@ -0,0 +1,78 @@
# f1-stream extraction + productionization — design (2026-06-04)
## Problem
`f1-stream` (FastAPI backend serving a SvelteKit SPA; ~15 pluggable stream
extractors + a Playwright/chrome-service playback verifier) lived **inside**
the infra monorepo at `infra/stacks/f1-stream/files/`. It had:
- no standalone repo — source coupled to the Terraform stack;
- **no real CI** — only a manual `redeploy.sh` doing a local `docker buildx`
push to DockerHub (`viktorbarzin/f1-stream`) + `kubectl rollout restart`;
- no README, no tests, a loose unpinned `requirements.txt`, no semver tags;
- a stale CI claim in docs ("migrated to GHA, Woodpecker repo id 10") that did
not match reality (no GHA workflow ever existed for it).
## Goal
Extract the app into its own Forgejo repo `viktor/f1-stream` and productionize
it, mirroring the established owned-app pattern (`tuya_bridge`, `job-hunter`,
`tripit`, `travel-agent`).
## Decisions (with rationale)
- **Registry → Forgejo private** (`forgejo.viktorbarzin.me/viktor/f1-stream`),
matching the fleet standard. Needs the `registry-credentials` pull secret
(Kyverno-synced to every namespace) on the deployment.
- **Packaging → Poetry + ruff + mypy** (replaces the loose pip
`requirements.txt`). Python **package stays `backend`** — imports are
`from backend.x` and the entrypoint is `uvicorn backend.main:app`; renaming
would churn every module + the Dockerfile + the staticfiles path. Python
**3.13 kept** (the live image already runs it; tripit's 3.12 pin is for
zxing-cpp/pymupdf, which f1-stream lacks).
- **Tests → pragmatic pure-logic only**. The extractors + verifier are
network/browser-bound; full coverage is brittle. Unit-test the deterministic
core: `m3u8_rewriter` (incl. the EXT-X tag rewriters), the `proxy` HLS
parsers, `schedule` parsing/status, the extractor `registry`. 63 tests.
- **CI → single `.woodpecker.yml`**: `lint-and-test` (ruff + mypy + pytest on
`python:3.13-slim`) → `build-and-push` (buildx → Forgejo, tags `latest` +
`${CI_COMMIT_SHA:0:8}`) → `deploy` (`kubectl set image` + `rollout status`).
**Keel stays enrolled** as a redundant net. This is the `tuya_bridge`
"build drives the rollout" model + a `travel-agent`-style test gate.
- A Slack-notify step was prototyped but **dropped**: the
`environment: { from_secret }` form is rejected by this Woodpecker
version's pipeline-struct decoder (`yaml: did not find expected key`), and
the canonical owned-app refs (`tuya_bridge`, `job-hunter`) have no Slack
step. Deploy success is confirmed by `rollout status`.
- **Versioning → first git tag `v2.0.1`** (continuity with the existing image
lineage; a fresh `v0.1.0` on a production 2.x app would mislead
monitoring/homepage). Deviates deliberately from the `v0.1.0` precedent of
tripit/travel-agent.
- **Runtime stays root** (matching the prior working image) to avoid a
non-root regression on the `/data` NFS write path and the Playwright browser
cache. Non-root is a possible future hardening.
## Terraform delta (the only infra change)
`infra/stacks/f1-stream/main.tf`:
- image `viktorbarzin/f1-stream:latest` (DockerHub) →
`forgejo.viktorbarzin.me/viktor/f1-stream:${var.image_tag}` (new
`var.image_tag`, default `latest`);
- add `image_pull_secrets { name = "registry-credentials" }` to the pod spec;
- delete `files/` (source now lives in the standalone repo) and `redeploy.sh`.
The image field is in the deployment's `ignore_changes` (KEEL_IGNORE_IMAGE), so
the live tag is managed by CI/Keel, not Terraform. Everything else — namespace,
ExternalSecrets (`f1-stream-secrets`, `chrome-service-client-secrets`), NFS data
volume, Anubis PoW policy, `ingress_factory`, homepage + x402 annotations,
Discord + chrome-service env — is unchanged.
## Blast radius
- The `f1-stream` K8s service is the only consumer; no other stack references
`viktorbarzin/f1-stream` or the `files/` dir (verified: no `path.module` /
`archive_file` / `null_resource` references the dir).
- Adding `imagePullSecrets` triggers one Recreate rollout that pulls the
*current* (still-DockerHub, public) image — safe; CI then switches it to the
Forgejo image.

View file

@ -0,0 +1,54 @@
# f1-stream extraction + productionization — plan (2026-06-04)
Companion to `2026-06-04-f1-stream-extraction-design.md`.
## Steps
1. **Scaffold** `/home/wizard/code/f1-stream/` — copy `backend/`, `frontend/`,
`Dockerfile`, `.dockerignore` from `infra/stacks/f1-stream/files/` by name
(exclude the `.claude/` marker + `redeploy.sh`); add `README.md`,
`.gitignore`. ✅
2. **Poetry conversion**`pyproject.toml` (dist `f1-stream` v2.0.1,
`packages=[{include="backend"}]`, pinned deps), `poetry.lock`, ruff/mypy/
pytest config (E501 per-file-ignored on the embedded-JS/scraper modules).
Rewrite the Dockerfile to a Poetry multi-stage build (Poetry 2.1.3 to match
the lock; python:3.13; keep Chromium libs + `playwright install chromium`;
keep `backend/` + `frontend/build/` siblings under `/app`). ✅
3. **Tests** — 63 pytest unit tests over the pure-logic core. ✅
4. **CI** — single `.woodpecker.yml` (lint+test → buildx push to Forgejo →
`kubectl set image` + rollout). ✅
5. **Create + push** — Forgejo repo `viktor/f1-stream` (private), commit, push
`master`, tag `v2.0.1`. ✅
6. **Enable in Woodpecker** — activate via
`scripts/woodpecker-register-forgejo-repo.sh` (Woodpecker repo id 166);
org-level `forgejo_user`/`forgejo_push_token` secrets apply. ✅
7. **Repoint Terraform**`main.tf` image → Forgejo + `var.image_tag` +
`image_pull_secrets`; `tg apply`. ✅
8. **Untrack from infra**`git rm -r stacks/f1-stream/files`; add
`/f1-stream/` to the monorepo root `.gitignore`. ✅
9. **Docs** — fix the stale "GHA / repo id 10" claim in `.claude/CLAUDE.md` +
`docs/architecture/ci-cd.md`; update `service-catalog.md`; this design/plan
pair. ✅
10. **Verify** — pipeline green; pod runs the Forgejo image; `/health` 200;
ingress reachable through Anubis.
## Verification commands
```bash
# pipeline
curl -s https://ci.viktorbarzin.me/api/repos/166/pipelines/<n> -H "Authorization: Bearer <jwt>"
# running image is the Forgejo one
kubectl get deploy f1-stream -n f1-stream \
-o jsonpath='{.spec.template.spec.containers[0].image}'
kubectl get pods -n f1-stream -l app=f1-stream
# health
kubectl exec -n f1-stream deploy/f1-stream -- \
python -c "import urllib.request;print(urllib.request.urlopen('http://localhost:8000/health').read())"
```
## Rollback
The DockerHub image `viktorbarzin/f1-stream` and its tags still exist. To
revert: `kubectl -n f1-stream set image deployment/f1-stream
f1-stream=viktorbarzin/f1-stream:<tag>` and restore the `main.tf` image string.
The standalone repo + Forgejo image are additive; nothing is destroyed.

View file

@ -0,0 +1,91 @@
# PVE R730 presence-aware fan control — design
**Date:** 2026-06-04
**Status:** implemented
**Scripts:** `infra/scripts/fan-control.{sh,service,env.example}`, `test-fan-control.sh`
**Runbook:** `infra/docs/runbooks/fan-control.md`
## Problem
The Dell R730 PVE host (192.168.1.127) runs its CPU at ~7277°C under normal
cluster load. That is safe (firmware warning at 88°C, critical 93°C) but the
iDRAC's stock fan curve optimises for quiet, not cool — it pins the fans at the
~7080 RPM floor even at 72°C / load 30 and only ramps near ~80°C. We want the
CPU to run cooler when it costs nothing (the box is in the garage, usually
empty) while staying quiet when someone is physically in the garage.
## Measured fan/temp relationship (manual IPMI sweep, 2026-06-04)
At a comparable CPU load (~4553 % busy):
| Fan setting | Fan RPM | CPU temp |
|-------------|---------|----------|
| Auto (floor) | 7,080 | 7172°C |
| 50 % | 9,360 | 6566°C |
| 70 % | 12,800 | 6061°C |
| 100 % | 17,000 | 5556°C |
Best °C-per-RPM is the first step; beyond ~70 % it is mostly noise. ~16°C of
swing is available.
## Decisions
1. **Custom bash daemon + systemd service**, deployed to the PVE host the same
way as `apply-mbps-caps` / `daily-backup` (source in `infra/scripts/`, scp to
`/usr/local/bin`). It cannot be Terraform/k8s — it runs on the bare host where
IPMI lives. (OSS `tigerblue77/Dell-iDRAC-fan-controller` was considered;
rejected — it is a Docker container, off-pattern here, and unaware of our
constraints.)
2. **CPU temperature is the only control input.** The Tesla T4 has its own
always-on fan (owner-confirmed), so it self-cools and does not depend on
chassis airflow — no GPU coupling needed.
3. **Presence = the garage door**, because the server is *in the garage*
(memory id=1723); noise only matters to people physically there. Signal:
ha-sofia `sensor.garage_door_state_bg`. Open now, or last changed within
`HOLD_SECS` (15 min) ⇒ someone's around ⇒ QUIET; otherwise COOL.
`house_mode` was rejected — it tracks *apartment* occupancy, irrelevant to
garage noise.
4. **Two curves**, picked by presence:
| CPU °C | COOL % (empty) | CPU °C | QUIET % (occupied) |
|--------|----------------|--------|--------------------|
| ≤52 | 25 | ≤72 | 20 (≈silent floor) |
| 5360 | 45 | 7377 | 40 |
| 6167 | 65 | 7881 | 65 |
| 6873 | 85 | ≥82 | 100 |
| ≥74 | 100 | | |
3°C downward hysteresis prevents flapping at band edges (ramp up immediately,
step down only once the curve still wants lower 3°C hotter).
## Safety
Manual fan mode bypasses the iDRAC's own protection, so it is backstopped:
- **Daemon exit/crash/stop** → bash `EXIT` trap + systemd `ExecStopPost` both
run `ipmitool raw 0x30 0x30 0x01 0x01` (restore Dell auto). `Restart=on-failure`.
- **CPU ≥ `CEILING` (83°C)** → hand back to Dell auto until temp holds below
`RESUME_BELOW` (75°C) for `RESUME_STABLE` (120 s), then resume manual.
- **IPMI read failures ≥ `MAX_IPMI_FAILS`** → restore Dell auto.
- **ha-sofia unreachable** → keep the last good presence decision; default COOL
at cold start (thermally safe).
## Observability
Pushes to the existing Pushgateway (`http://10.0.20.100:30091`, job
`fan_control`): `pve_fan_control_cpu_temp_celsius`, `_fan_percent`, `_mode`
(1 quiet / 2 cool / 0 fallback), `_ha_reachable`, `_fallback`. The existing CPU-
temp alert is unaffected.
## Testing
`test-fan-control.sh` sources the script (main is guarded by a `BASH_SOURCE`
check) and unit-tests the pure functions: both curves, hysteresis up/down,
presence open/recent/stale, temperature parsing, jq-free JSON field extraction,
and percent→hex. 36 assertions, no hardware needed. The daemon also supports
`DRY_RUN=1` and `RUN_ONCE=1` for integration checks.
## Rollback
`systemctl disable --now fan-control && ipmitool raw 0x30 0x30 0x01 0x01` on the
host returns the box to stock firmware fan control. See the runbook.

View file

@ -0,0 +1,74 @@
# Runbook — PVE R730 fan-control daemon
Presence-aware IPMI fan controller on the PVE host (192.168.1.127). Runs the
CPU cool when the garage is empty, quiet when someone's in the garage. Design:
`infra/docs/plans/2026-06-04-pve-fan-control-design.md`.
## What it is
- `/usr/local/bin/fan-control` — bash daemon (source: `infra/scripts/fan-control.sh`).
- `fan-control.service` — systemd unit (`Type=simple`, restarts on failure).
- `/etc/fan-control.env` — config incl. the ha-sofia token (chmod 600, not in git).
## Quick status
```bash
ssh root@192.168.1.127 systemctl status fan-control
ssh root@192.168.1.127 'journalctl -u fan-control -n 30 --no-pager'
ssh root@192.168.1.127 'ipmitool sdr type fan | grep ^Fan1; ipmitool sdr type temperature | grep "^Temp "'
```
Log lines look like `temp=63C mode=cool fan=65% (was 45%)`.
## Disable / roll back to stock firmware control
```bash
ssh root@192.168.1.127 'systemctl disable --now fan-control && ipmitool raw 0x30 0x30 0x01 0x01'
```
The unit's `ExecStopPost` already restores Dell auto on stop, so the explicit
`raw ... 0x01` is belt-and-suspenders. The box is back to its stock curve.
## Tune
Edit `/etc/fan-control.env` on the host, then `systemctl restart fan-control`.
Common knobs:
- `HOLD_SECS` — how long to stay quiet after the garage door last moved (default 900 = 15 min).
- `CEILING` — temp at which we abandon manual control and let the firmware take over (default 83).
- Curves themselves are arrays (`COOL_CURVE`, `QUIET_CURVE`) near the top of the script.
## Deploy / update
```bash
cd infra
scp scripts/fan-control.sh root@192.168.1.127:/usr/local/bin/fan-control
ssh root@192.168.1.127 chmod +x /usr/local/bin/fan-control
scp scripts/fan-control.service root@192.168.1.127:/etc/systemd/system/fan-control.service
# first install only — create /etc/fan-control.env from fan-control.env.example with the HA token
ssh root@192.168.1.127 'systemctl daemon-reload && systemctl restart fan-control'
```
## HA token
`/etc/fan-control.env` holds a long-lived ha-sofia token used to read
`sensor.garage_door_state_bg`. Mint via Home Assistant → Profile → Security →
Long-lived access tokens, or reuse the existing ha-sofia token. If the token is
missing/empty, the daemon still runs but **COOL-only** (no quiet mode) and logs
`ha_reachable=0`.
## Symptoms & checks
| Symptom | Check |
|---------|-------|
| Fans stuck loud | `journalctl -u fan-control` — is `mode=fallback`? (ceiling breach or IPMI fail). Check CPU temp. |
| Never goes quiet | Token valid? `curl -H "Authorization: Bearer $TOKEN" http://192.168.1.8:8123/api/states/sensor.garage_door_state_bg`. Garage door reporting? |
| Fans flapping | Increase `DEADBAND`. |
| Service won't start | `systemctl status fan-control`; check `ipmitool` works: `ipmitool sdr type temperature`. |
| Box left in manual after crash | `ipmitool raw 0x30 0x30 0x01 0x01` to force Dell auto. |
## Verify presence wiring
```bash
# one iteration, real IPMI + HA, no daemon loop:
ssh root@192.168.1.127 'set -a; . /etc/fan-control.env; set +a; RUN_ONCE=1 /usr/local/bin/fan-control'
```
With the garage closed for >15 min you should see `mode=cool`; within 15 min of
the door moving, `mode=quiet`.