diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 54b51441..af780d1b 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -104,13 +104,15 @@ have `ignore_changes` on `…container[0].image` (KEEL_IGNORE_IMAGE) so CI `:latest` + `imagePullPolicy: Always` (fresh pod each run) instead of a deploy step. **Never** `set image`/`rollout restart` operator-managed StatefulSets (memory id=740). Reference impls: `tuya_bridge/.woodpecker.yml`, -`job-hunter`. This reverses decision #12 of +`job-hunter`, `f1-stream` (viktor/f1-stream, extracted from this monorepo +2026-06-04). This reverses decision #12 of `docs/plans/2026-05-16-auto-upgrade-apps-design.md` for owned (not upstream) images. **Flow (GHA-migrated apps)**: `git push → GHA build+push DockerHub (8-char SHA) → POST Woodpecker API → kubectl set image` -**Migrated to GHA** (10): Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints +**Migrated to GHA** (9): Website, k8s-portal, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints +**Woodpecker-native owned-app build** (Forgejo registry, build->deploy in one `.woodpecker.yml`): tuya_bridge, job-hunter, f1-stream (extracted to viktor/f1-stream 2026-06-04; Woodpecker repo id 166) **Woodpecker-only**: travel_blog (1.4GB content too large for GHA), infra pipelines (terragrunt apply, certbot, build-cli — need cluster access) **Per-project files**: @@ -119,7 +121,7 @@ images. - `.woodpecker/build-fallback.yml` — Old full build pipeline preserved (event: `deployment` — never auto-fires) **Woodpecker API**: Uses **numeric repo IDs** (`/api/repos/2/pipelines`), NOT owner/name paths (those return HTML). -Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handler=6, audiblez-web=9, f1-stream=10, plotting-book=43, claude-memory-mcp=78, infra-onboarding=79, council-complaints=TBD +Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handler=6, audiblez-web=9, plotting-book=43, claude-memory-mcp=78, infra-onboarding=79, council-complaints=TBD (f1-stream's old GHA-era id 10 is defunct; it's now a Woodpecker-native build at repo id 166) **Woodpecker YAML gotchas**: - Commands with `${VAR}:${VAR}` must be **quoted** — unquoted `:` triggers YAML map parsing when vars are empty diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md index 86016d62..5741b227 100644 --- a/.claude/reference/service-catalog.md +++ b/.claude/reference/service-catalog.md @@ -46,7 +46,7 @@ | nextcloud | File sync/share | nextcloud | | calibre | E-book management (may be merged into ebooks stack) | calibre | | onlyoffice | Document editing | onlyoffice | -| f1-stream | F1 streaming (uses chrome-service for hmembeds verifier) | f1-stream | +| f1-stream | F1 streaming (uses chrome-service for hmembeds verifier); source in own repo `viktor/f1-stream` (extracted 2026-06-04), Woodpecker-native build->deploy | f1-stream | | chrome-service | Headed Chromium WebSocket pool (`ws://chrome-service.chrome-service.svc:3000/`) for sibling services driving anti-bot embeds | chrome-service | | rybbit | Analytics | rybbit | | isponsorblocktv | SponsorBlock for TV | isponsorblocktv | diff --git a/docs/architecture/ci-cd.md b/docs/architecture/ci-cd.md index 4c0c020b..75699744 100644 --- a/docs/architecture/ci-cd.md +++ b/docs/architecture/ci-cd.md @@ -58,10 +58,9 @@ graph LR ### Project Migration Status -**Migrated to GHA (9 projects)**: +**Migrated to GHA (8 projects)**: - Website - k8s-portal -- f1-stream - claude-memory-mcp - apple-health-data - audiblez-web @@ -69,6 +68,14 @@ graph LR - insta2spotify - book-search (audiobook-search) +**Woodpecker-native owned-app builds** (build + push to the Forgejo private +registry + `kubectl set image` rollout, all in one `.woodpecker.yml`; Keel +stays enrolled as a redundant net): +- `tuya_bridge`, `job-hunter`, `f1-stream` +- `f1-stream` was extracted from this monorepo into its own repo + (`viktor/f1-stream`) on 2026-06-04; its Woodpecker repo id is 166 (the old + GHA-era id 10 is defunct). + **Woodpecker-only (infra + large apps)**: - `travel_blog`: 5.7GB content directory exceeds GHA limits - Infra pipelines: require cluster access (terragrunt apply, certbot, build-cli) @@ -92,7 +99,6 @@ Woodpecker API uses numeric IDs (not owner/name): | travel_blog | 5 | | webhook-handler | 6 | | audiblez-web | 9 | -| f1-stream | 10 | | plotting-book | 43 | | claude-memory-mcp | 78 | | infra-onboarding | 79 | diff --git a/docs/plans/2026-06-04-f1-stream-extraction-design.md b/docs/plans/2026-06-04-f1-stream-extraction-design.md new file mode 100644 index 00000000..1ccc7909 --- /dev/null +++ b/docs/plans/2026-06-04-f1-stream-extraction-design.md @@ -0,0 +1,78 @@ +# f1-stream extraction + productionization — design (2026-06-04) + +## Problem + +`f1-stream` (FastAPI backend serving a SvelteKit SPA; ~15 pluggable stream +extractors + a Playwright/chrome-service playback verifier) lived **inside** +the infra monorepo at `infra/stacks/f1-stream/files/`. It had: + +- no standalone repo — source coupled to the Terraform stack; +- **no real CI** — only a manual `redeploy.sh` doing a local `docker buildx` + push to DockerHub (`viktorbarzin/f1-stream`) + `kubectl rollout restart`; +- no README, no tests, a loose unpinned `requirements.txt`, no semver tags; +- a stale CI claim in docs ("migrated to GHA, Woodpecker repo id 10") that did + not match reality (no GHA workflow ever existed for it). + +## Goal + +Extract the app into its own Forgejo repo `viktor/f1-stream` and productionize +it, mirroring the established owned-app pattern (`tuya_bridge`, `job-hunter`, +`tripit`, `travel-agent`). + +## Decisions (with rationale) + +- **Registry → Forgejo private** (`forgejo.viktorbarzin.me/viktor/f1-stream`), + matching the fleet standard. Needs the `registry-credentials` pull secret + (Kyverno-synced to every namespace) on the deployment. +- **Packaging → Poetry + ruff + mypy** (replaces the loose pip + `requirements.txt`). Python **package stays `backend`** — imports are + `from backend.x` and the entrypoint is `uvicorn backend.main:app`; renaming + would churn every module + the Dockerfile + the staticfiles path. Python + **3.13 kept** (the live image already runs it; tripit's 3.12 pin is for + zxing-cpp/pymupdf, which f1-stream lacks). +- **Tests → pragmatic pure-logic only**. The extractors + verifier are + network/browser-bound; full coverage is brittle. Unit-test the deterministic + core: `m3u8_rewriter` (incl. the EXT-X tag rewriters), the `proxy` HLS + parsers, `schedule` parsing/status, the extractor `registry`. 63 tests. +- **CI → single `.woodpecker.yml`**: `lint-and-test` (ruff + mypy + pytest on + `python:3.13-slim`) → `build-and-push` (buildx → Forgejo, tags `latest` + + `${CI_COMMIT_SHA:0:8}`) → `deploy` (`kubectl set image` + `rollout status`). + **Keel stays enrolled** as a redundant net. This is the `tuya_bridge` + "build drives the rollout" model + a `travel-agent`-style test gate. + - A Slack-notify step was prototyped but **dropped**: the + `environment: { from_secret }` form is rejected by this Woodpecker + version's pipeline-struct decoder (`yaml: did not find expected key`), and + the canonical owned-app refs (`tuya_bridge`, `job-hunter`) have no Slack + step. Deploy success is confirmed by `rollout status`. +- **Versioning → first git tag `v2.0.1`** (continuity with the existing image + lineage; a fresh `v0.1.0` on a production 2.x app would mislead + monitoring/homepage). Deviates deliberately from the `v0.1.0` precedent of + tripit/travel-agent. +- **Runtime stays root** (matching the prior working image) to avoid a + non-root regression on the `/data` NFS write path and the Playwright browser + cache. Non-root is a possible future hardening. + +## Terraform delta (the only infra change) + +`infra/stacks/f1-stream/main.tf`: + +- image `viktorbarzin/f1-stream:latest` (DockerHub) → + `forgejo.viktorbarzin.me/viktor/f1-stream:${var.image_tag}` (new + `var.image_tag`, default `latest`); +- add `image_pull_secrets { name = "registry-credentials" }` to the pod spec; +- delete `files/` (source now lives in the standalone repo) and `redeploy.sh`. + +The image field is in the deployment's `ignore_changes` (KEEL_IGNORE_IMAGE), so +the live tag is managed by CI/Keel, not Terraform. Everything else — namespace, +ExternalSecrets (`f1-stream-secrets`, `chrome-service-client-secrets`), NFS data +volume, Anubis PoW policy, `ingress_factory`, homepage + x402 annotations, +Discord + chrome-service env — is unchanged. + +## Blast radius + +- The `f1-stream` K8s service is the only consumer; no other stack references + `viktorbarzin/f1-stream` or the `files/` dir (verified: no `path.module` / + `archive_file` / `null_resource` references the dir). +- Adding `imagePullSecrets` triggers one Recreate rollout that pulls the + *current* (still-DockerHub, public) image — safe; CI then switches it to the + Forgejo image. diff --git a/docs/plans/2026-06-04-f1-stream-extraction-plan.md b/docs/plans/2026-06-04-f1-stream-extraction-plan.md new file mode 100644 index 00000000..ba773802 --- /dev/null +++ b/docs/plans/2026-06-04-f1-stream-extraction-plan.md @@ -0,0 +1,54 @@ +# f1-stream extraction + productionization — plan (2026-06-04) + +Companion to `2026-06-04-f1-stream-extraction-design.md`. + +## Steps + +1. **Scaffold** `/home/wizard/code/f1-stream/` — copy `backend/`, `frontend/`, + `Dockerfile`, `.dockerignore` from `infra/stacks/f1-stream/files/` by name + (exclude the `.claude/` marker + `redeploy.sh`); add `README.md`, + `.gitignore`. ✅ +2. **Poetry conversion** — `pyproject.toml` (dist `f1-stream` v2.0.1, + `packages=[{include="backend"}]`, pinned deps), `poetry.lock`, ruff/mypy/ + pytest config (E501 per-file-ignored on the embedded-JS/scraper modules). + Rewrite the Dockerfile to a Poetry multi-stage build (Poetry 2.1.3 to match + the lock; python:3.13; keep Chromium libs + `playwright install chromium`; + keep `backend/` + `frontend/build/` siblings under `/app`). ✅ +3. **Tests** — 63 pytest unit tests over the pure-logic core. ✅ +4. **CI** — single `.woodpecker.yml` (lint+test → buildx push to Forgejo → + `kubectl set image` + rollout). ✅ +5. **Create + push** — Forgejo repo `viktor/f1-stream` (private), commit, push + `master`, tag `v2.0.1`. ✅ +6. **Enable in Woodpecker** — activate via + `scripts/woodpecker-register-forgejo-repo.sh` (Woodpecker repo id 166); + org-level `forgejo_user`/`forgejo_push_token` secrets apply. ✅ +7. **Repoint Terraform** — `main.tf` image → Forgejo + `var.image_tag` + + `image_pull_secrets`; `tg apply`. ✅ +8. **Untrack from infra** — `git rm -r stacks/f1-stream/files`; add + `/f1-stream/` to the monorepo root `.gitignore`. ✅ +9. **Docs** — fix the stale "GHA / repo id 10" claim in `.claude/CLAUDE.md` + + `docs/architecture/ci-cd.md`; update `service-catalog.md`; this design/plan + pair. ✅ +10. **Verify** — pipeline green; pod runs the Forgejo image; `/health` 200; + ingress reachable through Anubis. + +## Verification commands + +```bash +# pipeline +curl -s https://ci.viktorbarzin.me/api/repos/166/pipelines/ -H "Authorization: Bearer " +# running image is the Forgejo one +kubectl get deploy f1-stream -n f1-stream \ + -o jsonpath='{.spec.template.spec.containers[0].image}' +kubectl get pods -n f1-stream -l app=f1-stream +# health +kubectl exec -n f1-stream deploy/f1-stream -- \ + python -c "import urllib.request;print(urllib.request.urlopen('http://localhost:8000/health').read())" +``` + +## Rollback + +The DockerHub image `viktorbarzin/f1-stream` and its tags still exist. To +revert: `kubectl -n f1-stream set image deployment/f1-stream +f1-stream=viktorbarzin/f1-stream:` and restore the `main.tf` image string. +The standalone repo + Forgejo image are additive; nothing is destroyed. diff --git a/docs/plans/2026-06-04-pve-fan-control-design.md b/docs/plans/2026-06-04-pve-fan-control-design.md new file mode 100644 index 00000000..ed4c2ae7 --- /dev/null +++ b/docs/plans/2026-06-04-pve-fan-control-design.md @@ -0,0 +1,91 @@ +# PVE R730 presence-aware fan control — design + +**Date:** 2026-06-04 +**Status:** implemented +**Scripts:** `infra/scripts/fan-control.{sh,service,env.example}`, `test-fan-control.sh` +**Runbook:** `infra/docs/runbooks/fan-control.md` + +## Problem + +The Dell R730 PVE host (192.168.1.127) runs its CPU at ~72–77°C under normal +cluster load. That is safe (firmware warning at 88°C, critical 93°C) but the +iDRAC's stock fan curve optimises for quiet, not cool — it pins the fans at the +~7080 RPM floor even at 72°C / load 30 and only ramps near ~80°C. We want the +CPU to run cooler when it costs nothing (the box is in the garage, usually +empty) while staying quiet when someone is physically in the garage. + +## Measured fan/temp relationship (manual IPMI sweep, 2026-06-04) + +At a comparable CPU load (~45–53 % busy): + +| Fan setting | Fan RPM | CPU temp | +|-------------|---------|----------| +| Auto (floor) | 7,080 | 71–72°C | +| 50 % | 9,360 | 65–66°C | +| 70 % | 12,800 | 60–61°C | +| 100 % | 17,000 | 55–56°C | + +Best °C-per-RPM is the first step; beyond ~70 % it is mostly noise. ~16°C of +swing is available. + +## Decisions + +1. **Custom bash daemon + systemd service**, deployed to the PVE host the same + way as `apply-mbps-caps` / `daily-backup` (source in `infra/scripts/`, scp to + `/usr/local/bin`). It cannot be Terraform/k8s — it runs on the bare host where + IPMI lives. (OSS `tigerblue77/Dell-iDRAC-fan-controller` was considered; + rejected — it is a Docker container, off-pattern here, and unaware of our + constraints.) +2. **CPU temperature is the only control input.** The Tesla T4 has its own + always-on fan (owner-confirmed), so it self-cools and does not depend on + chassis airflow — no GPU coupling needed. +3. **Presence = the garage door**, because the server is *in the garage* + (memory id=1723); noise only matters to people physically there. Signal: + ha-sofia `sensor.garage_door_state_bg`. Open now, or last changed within + `HOLD_SECS` (15 min) ⇒ someone's around ⇒ QUIET; otherwise COOL. + `house_mode` was rejected — it tracks *apartment* occupancy, irrelevant to + garage noise. +4. **Two curves**, picked by presence: + + | CPU °C | COOL % (empty) | CPU °C | QUIET % (occupied) | + |--------|----------------|--------|--------------------| + | ≤52 | 25 | ≤72 | 20 (≈silent floor) | + | 53–60 | 45 | 73–77 | 40 | + | 61–67 | 65 | 78–81 | 65 | + | 68–73 | 85 | ≥82 | 100 | + | ≥74 | 100 | | | + + 3°C downward hysteresis prevents flapping at band edges (ramp up immediately, + step down only once the curve still wants lower 3°C hotter). + +## Safety + +Manual fan mode bypasses the iDRAC's own protection, so it is backstopped: + +- **Daemon exit/crash/stop** → bash `EXIT` trap + systemd `ExecStopPost` both + run `ipmitool raw 0x30 0x30 0x01 0x01` (restore Dell auto). `Restart=on-failure`. +- **CPU ≥ `CEILING` (83°C)** → hand back to Dell auto until temp holds below + `RESUME_BELOW` (75°C) for `RESUME_STABLE` (120 s), then resume manual. +- **IPMI read failures ≥ `MAX_IPMI_FAILS`** → restore Dell auto. +- **ha-sofia unreachable** → keep the last good presence decision; default COOL + at cold start (thermally safe). + +## Observability + +Pushes to the existing Pushgateway (`http://10.0.20.100:30091`, job +`fan_control`): `pve_fan_control_cpu_temp_celsius`, `_fan_percent`, `_mode` +(1 quiet / 2 cool / 0 fallback), `_ha_reachable`, `_fallback`. The existing CPU- +temp alert is unaffected. + +## Testing + +`test-fan-control.sh` sources the script (main is guarded by a `BASH_SOURCE` +check) and unit-tests the pure functions: both curves, hysteresis up/down, +presence open/recent/stale, temperature parsing, jq-free JSON field extraction, +and percent→hex. 36 assertions, no hardware needed. The daemon also supports +`DRY_RUN=1` and `RUN_ONCE=1` for integration checks. + +## Rollback + +`systemctl disable --now fan-control && ipmitool raw 0x30 0x30 0x01 0x01` on the +host returns the box to stock firmware fan control. See the runbook. diff --git a/docs/runbooks/fan-control.md b/docs/runbooks/fan-control.md new file mode 100644 index 00000000..cb28dfa2 --- /dev/null +++ b/docs/runbooks/fan-control.md @@ -0,0 +1,74 @@ +# Runbook — PVE R730 fan-control daemon + +Presence-aware IPMI fan controller on the PVE host (192.168.1.127). Runs the +CPU cool when the garage is empty, quiet when someone's in the garage. Design: +`infra/docs/plans/2026-06-04-pve-fan-control-design.md`. + +## What it is + +- `/usr/local/bin/fan-control` — bash daemon (source: `infra/scripts/fan-control.sh`). +- `fan-control.service` — systemd unit (`Type=simple`, restarts on failure). +- `/etc/fan-control.env` — config incl. the ha-sofia token (chmod 600, not in git). + +## Quick status + +```bash +ssh root@192.168.1.127 systemctl status fan-control +ssh root@192.168.1.127 'journalctl -u fan-control -n 30 --no-pager' +ssh root@192.168.1.127 'ipmitool sdr type fan | grep ^Fan1; ipmitool sdr type temperature | grep "^Temp "' +``` +Log lines look like `temp=63C mode=cool fan=65% (was 45%)`. + +## Disable / roll back to stock firmware control + +```bash +ssh root@192.168.1.127 'systemctl disable --now fan-control && ipmitool raw 0x30 0x30 0x01 0x01' +``` +The unit's `ExecStopPost` already restores Dell auto on stop, so the explicit +`raw ... 0x01` is belt-and-suspenders. The box is back to its stock curve. + +## Tune + +Edit `/etc/fan-control.env` on the host, then `systemctl restart fan-control`. +Common knobs: +- `HOLD_SECS` — how long to stay quiet after the garage door last moved (default 900 = 15 min). +- `CEILING` — temp at which we abandon manual control and let the firmware take over (default 83). +- Curves themselves are arrays (`COOL_CURVE`, `QUIET_CURVE`) near the top of the script. + +## Deploy / update + +```bash +cd infra +scp scripts/fan-control.sh root@192.168.1.127:/usr/local/bin/fan-control +ssh root@192.168.1.127 chmod +x /usr/local/bin/fan-control +scp scripts/fan-control.service root@192.168.1.127:/etc/systemd/system/fan-control.service +# first install only — create /etc/fan-control.env from fan-control.env.example with the HA token +ssh root@192.168.1.127 'systemctl daemon-reload && systemctl restart fan-control' +``` + +## HA token + +`/etc/fan-control.env` holds a long-lived ha-sofia token used to read +`sensor.garage_door_state_bg`. Mint via Home Assistant → Profile → Security → +Long-lived access tokens, or reuse the existing ha-sofia token. If the token is +missing/empty, the daemon still runs but **COOL-only** (no quiet mode) and logs +`ha_reachable=0`. + +## Symptoms & checks + +| Symptom | Check | +|---------|-------| +| Fans stuck loud | `journalctl -u fan-control` — is `mode=fallback`? (ceiling breach or IPMI fail). Check CPU temp. | +| Never goes quiet | Token valid? `curl -H "Authorization: Bearer $TOKEN" http://192.168.1.8:8123/api/states/sensor.garage_door_state_bg`. Garage door reporting? | +| Fans flapping | Increase `DEADBAND`. | +| Service won't start | `systemctl status fan-control`; check `ipmitool` works: `ipmitool sdr type temperature`. | +| Box left in manual after crash | `ipmitool raw 0x30 0x30 0x01 0x01` to force Dell auto. | + +## Verify presence wiring + +```bash +# one iteration, real IPMI + HA, no daemon loop: +ssh root@192.168.1.127 'set -a; . /etc/fan-control.env; set +a; RUN_ONCE=1 /usr/local/bin/fan-control' +``` +With the garage closed for >15 min you should see `mode=cool`; within 15 min of +the door moving, `mode=quiet`. diff --git a/scripts/fan-control.env.example b/scripts/fan-control.env.example new file mode 100644 index 00000000..3c2565c2 --- /dev/null +++ b/scripts/fan-control.env.example @@ -0,0 +1,21 @@ +# /etc/fan-control.env — config for the fan-control daemon (chmod 600). +# Deployed manually to the PVE host; the real file holds a secret token and is +# NOT committed. Copy this template, fill HA_TOKEN, scp to /etc/fan-control.env. + +# Long-lived ha-sofia access token (Home Assistant -> Profile -> Security -> +# Long-lived access tokens). Empty => presence disabled, daemon runs COOL-only. +HA_TOKEN= + +# --- optional overrides (defaults shown) --- +# HA_URL=http://192.168.1.8:8123 +# GARAGE_ENTITY=sensor.garage_door_state_bg +# GARAGE_OPEN_STATE=Отворена +# HOLD_SECS=900 # quiet-mode hold after last garage activity (15 min) +# LOOP_INTERVAL=15 +# PRESENCE_INTERVAL=30 +# DEADBAND=3 +# CEILING=83 # degC: hand back to Dell auto at/above this +# RESUME_BELOW=75 +# RESUME_STABLE=120 +# MAX_IPMI_FAILS=3 +PUSHGATEWAY_URL=http://10.0.20.100:30091 diff --git a/scripts/fan-control.service b/scripts/fan-control.service new file mode 100644 index 00000000..f337ff4d --- /dev/null +++ b/scripts/fan-control.service @@ -0,0 +1,21 @@ +[Unit] +Description=Presence-aware IPMI fan controller (Dell R730, garage) +Documentation=https://github.com/ViktorBarzin/infra/blob/master/scripts/fan-control.sh +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +EnvironmentFile=-/etc/fan-control.env +ExecStart=/usr/local/bin/fan-control +# Belt-and-suspenders: whatever happens to the daemon, hand the fans back to +# the iDRAC's own automatic curve so the box is never stuck in manual mode. +ExecStopPost=/usr/bin/ipmitool raw 0x30 0x30 0x01 0x01 +Restart=on-failure +RestartSec=10 +StandardOutput=journal +StandardError=journal +SyslogIdentifier=fan-control + +[Install] +WantedBy=multi-user.target diff --git a/scripts/fan-control.sh b/scripts/fan-control.sh new file mode 100644 index 00000000..95942635 --- /dev/null +++ b/scripts/fan-control.sh @@ -0,0 +1,199 @@ +#!/usr/bin/env bash +# Presence-aware IPMI fan controller for the Dell R730 PVE host (192.168.1.127). +# +# The server lives in the GARAGE (memory id=1723). Two curves, picked by +# whether someone is physically in the garage: +# - COOL : garage empty -> minimise CPU temp, noise is free. +# - QUIET : someone in the garage -> minimise noise, accept a warmer CPU. +# Presence comes from the ha-sofia garage-door sensor: door open now, OR it +# last changed within HOLD_SECS, => QUIET. Otherwise COOL. +# +# Safety (manual fan mode bypasses the iDRAC's own curve, so we backstop it): +# - On ANY exit (crash/stop/TERM) the EXIT trap hands fans back to Dell +# automatic control (raw 0x30 0x30 0x01 0x01). systemd ExecStopPost +# repeats this belt-and-suspenders. +# - CPU >= CEILING -> hand back to Dell auto until it recovers (RESUME_BELOW +# held for RESUME_STABLE s). The firmware's own emergency cooling takes over. +# - IPMI read failures (>= MAX_IPMI_FAILS) -> hand back to Dell auto. +# +# Deploy: scp to /usr/local/bin/fan-control (strip .sh) + install +# fan-control.service + /etc/fan-control.env. Same pattern as apply-mbps-caps. +# Tests: test-fan-control.sh (sources this file, exercises the pure functions). +# Design: infra/docs/plans/2026-06-04-pve-fan-control-design.md +# Runbook: infra/docs/runbooks/fan-control.md + +set -uo pipefail + +# ---- configuration (override via /etc/fan-control.env) ---- +: "${IPMITOOL:=ipmitool}" +: "${LOOP_INTERVAL:=15}" # seconds between temperature decisions +: "${PRESENCE_INTERVAL:=30}" # seconds between ha-sofia garage-door polls +: "${DEADBAND:=3}" # degC hysteresis applied to downward fan steps +: "${CEILING:=83}" # degC: hand back to Dell auto at/above this +: "${RESUME_BELOW:=75}" # degC: eligible to resume manual below this... +: "${RESUME_STABLE:=120}" # ...once held that long +: "${HOLD_SECS:=900}" # quiet-mode hold after last garage activity (15 min) +: "${HA_URL:=http://192.168.1.8:8123}" +: "${HA_TOKEN:=}" # long-lived ha-sofia token; empty => presence disabled (COOL only) +: "${GARAGE_ENTITY:=sensor.garage_door_state_bg}" +: "${GARAGE_OPEN_STATE:=Отворена}" # ha state string meaning "open" +: "${PUSHGATEWAY_URL:=}" # optional Prometheus Pushgateway base URL +: "${MAX_IPMI_FAILS:=3}" +: "${DRY_RUN:=0}" # 1 => log IPMI actions instead of executing +: "${RUN_ONCE:=0}" # 1 => one iteration then exit (testing) + +# Curves as "min_temp:pct" entries, descending; first whose min_temp <= temp wins. +COOL_CURVE=(74:100 68:85 61:65 53:45 0:25) +QUIET_CURVE=(82:100 78:65 73:40 0:20) + +log() { printf '%s %s\n' "$(date '+%Y-%m-%dT%H:%M:%S%z')" "$*"; } + +# ---- pure functions (no side effects; unit-tested) ---- + +# fc_curve -> fan percent +fc_curve() { + local mode="$1" temp="$2" + local -a curve + if [[ "$mode" == "quiet" ]]; then curve=("${QUIET_CURVE[@]}"); else curve=("${COOL_CURVE[@]}"); fi + local entry + for entry in "${curve[@]}"; do + if (( temp >= ${entry%%:*} )); then echo "${entry##*:}"; return 0; fi + done + echo "${curve[-1]##*:}" +} + +# fc_decide -> fan percent +# Ramps up immediately; only steps down once the curve still wants a lower +# percent even DEADBAND degrees hotter (prevents flapping at band edges). +fc_decide() { + local mode="$1" temp="$2" current="$3" deadband="$4" target + target="$(fc_curve "$mode" "$temp")" + if (( current < 0 || target >= current )); then echo "$target"; return 0; fi + if (( $(fc_curve "$mode" "$((temp + deadband))") < current )); then echo "$target"; else echo "$current"; fi +} + +# fc_presence_mode -> quiet|cool +fc_presence_mode() { + local state="$1" lc="$2" now="$3" hold="$4" open="$5" + if [[ "$state" == "$open" ]]; then echo "quiet"; return 0; fi + if (( now - lc < hold )); then echo "quiet"; return 0; fi + echo "cool" +} + +# fc_parse_temp -> integer degC +fc_parse_temp() { + echo "$1" | grep -oE '[0-9]+ degrees C' | grep -oE '^[0-9]+' | head -1 +} + +# fc_json_str_field -> string value (first match; jq-free) +fc_json_str_field() { + printf '%s' "$1" | grep -oE "\"$2\"[[:space:]]*:[[:space:]]*\"[^\"]*\"" | head -1 \ + | sed -E "s/.*:[[:space:]]*\"(.*)\"\$/\1/" +} + +# fc_pct_to_hex -> 0xNN +fc_pct_to_hex() { printf '0x%02x' "$1"; } + +# ---- side-effecting wrappers ---- + +ipmi_manual_on=0 + +set_manual() { # + local pct="$1" hex; hex="$(fc_pct_to_hex "$pct")" + if (( DRY_RUN == 1 )); then log "DRY set fan ${pct}% (${hex})"; ipmi_manual_on=1; return 0; fi + if (( ipmi_manual_on == 0 )); then + "$IPMITOOL" raw 0x30 0x30 0x01 0x00 >/dev/null 2>&1 || return 1 + ipmi_manual_on=1 + fi + "$IPMITOOL" raw 0x30 0x30 0x02 0xff "$hex" >/dev/null 2>&1 +} + +restore_auto() { + if (( DRY_RUN == 1 )); then log "DRY restore Dell auto fan control"; ipmi_manual_on=0; return 0; fi + "$IPMITOOL" raw 0x30 0x30 0x01 0x01 >/dev/null 2>&1 + ipmi_manual_on=0 +} + +read_cpu_temp() { + fc_parse_temp "$("$IPMITOOL" sdr type temperature 2>/dev/null | grep -E '^Temp ' | head -1)" +} + +presence_cache="cool"; presence_ts=0 +get_presence() { + local now; now="$(date +%s)" + if (( now - presence_ts < PRESENCE_INTERVAL )); then echo "$presence_cache"; return 0; fi + presence_ts="$now" + [[ -z "$HA_TOKEN" ]] && { echo "$presence_cache"; return 0; } + local resp state lc_iso lc_epoch + resp="$(curl -fsS --max-time 5 -H "Authorization: Bearer $HA_TOKEN" \ + "$HA_URL/api/states/$GARAGE_ENTITY" 2>/dev/null)" || { echo "$presence_cache"; return 0; } + state="$(fc_json_str_field "$resp" state)" + [[ -z "$state" ]] && { echo "$presence_cache"; return 0; } + lc_iso="$(fc_json_str_field "$resp" last_changed)" + lc_epoch="$(date -d "$lc_iso" +%s 2>/dev/null || echo "$now")" + presence_cache="$(fc_presence_mode "$state" "$lc_epoch" "$now" "$HOLD_SECS" "$GARAGE_OPEN_STATE")" + echo "$presence_cache" +} + +push_metrics() { # + [[ -z "$PUSHGATEWAY_URL" ]] && return 0 + local mode_num; case "$3" in quiet) mode_num=1;; cool) mode_num=2;; *) mode_num=0;; esac + curl -fsS --max-time 5 --data-binary @- \ + "$PUSHGATEWAY_URL/metrics/job/fan_control/instance/pve-r730" >/dev/null 2>&1 <= MAX_IPMI_FAILS )); then log "ERR temp unreadable — Dell auto"; restore_auto; current=-1; fi + (( RUN_ONCE == 1 )) && break || { sleep "$LOOP_INTERVAL"; continue; } + fi + fails=0 + + if (( temp >= CEILING )); then + (( in_fallback == 0 )) && { log "CEILING temp=${temp}≥${CEILING} — Dell auto"; restore_auto; current=-1; in_fallback=1; } + push_metrics "$temp" 0 fallback 1 1 + (( RUN_ONCE == 1 )) && break || { sleep "$LOOP_INTERVAL"; continue; } + fi + if (( in_fallback == 1 )); then + if (( temp < RESUME_BELOW )); then + (( cool_since == 0 )) && cool_since="$(date +%s)" + if (( $(date +%s) - cool_since >= RESUME_STABLE )); then + log "recovered (temp<${RESUME_BELOW}C ${RESUME_STABLE}s) — resuming manual"; in_fallback=0; cool_since=0 + else + push_metrics "$temp" 0 fallback 1 1; (( RUN_ONCE == 1 )) && break || { sleep "$LOOP_INTERVAL"; continue; } + fi + else + cool_since=0; push_metrics "$temp" 0 fallback 1 1 + (( RUN_ONCE == 1 )) && break || { sleep "$LOOP_INTERVAL"; continue; } + fi + fi + + local mode ha_ok=1; mode="$(get_presence)"; [[ -z "$HA_TOKEN" ]] && ha_ok=0 + local pct; pct="$(fc_decide "$mode" "$temp" "$current" "$DEADBAND")" + if (( pct != current )); then + if set_manual "$pct"; then log "temp=${temp}C mode=${mode} fan=${pct}% (was ${current}%)"; current="$pct" + else log "WARN set_manual ${pct}% failed"; fi + fi + push_metrics "$temp" "$current" "$mode" "$ha_ok" 0 + (( RUN_ONCE == 1 )) && break || sleep "$LOOP_INTERVAL" + done +} + +if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then main "$@"; fi diff --git a/scripts/test-fan-control.sh b/scripts/test-fan-control.sh new file mode 100644 index 00000000..1246f588 --- /dev/null +++ b/scripts/test-fan-control.sh @@ -0,0 +1,71 @@ +#!/usr/bin/env bash +# Unit tests for the pure functions in fan-control.sh. +# Sources the script (main is guarded), exercises curve/decide/presence/parse. +# Run: bash infra/scripts/test-fan-control.sh + +set -uo pipefail +DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# shellcheck source=/dev/null +source "$DIR/fan-control.sh" + +pass=0 fail=0 +eq() { # + if [[ "$2" == "$3" ]]; then pass=$((pass + 1)); else + fail=$((fail + 1)); printf 'FAIL: %s — expected [%s] got [%s]\n' "$1" "$2" "$3" + fi +} + +# --- COOL curve --- +eq "cool 40 -> 25" 25 "$(fc_curve cool 40)" +eq "cool 52 -> 25" 25 "$(fc_curve cool 52)" +eq "cool 53 -> 45" 45 "$(fc_curve cool 53)" +eq "cool 60 -> 45" 45 "$(fc_curve cool 60)" +eq "cool 61 -> 65" 65 "$(fc_curve cool 61)" +eq "cool 67 -> 65" 65 "$(fc_curve cool 67)" +eq "cool 68 -> 85" 85 "$(fc_curve cool 68)" +eq "cool 73 -> 85" 85 "$(fc_curve cool 73)" +eq "cool 74 -> 100" 100 "$(fc_curve cool 74)" +eq "cool 91 -> 100" 100 "$(fc_curve cool 91)" + +# --- QUIET curve --- +eq "quiet 50 -> 20" 20 "$(fc_curve quiet 50)" +eq "quiet 72 -> 20" 20 "$(fc_curve quiet 72)" +eq "quiet 73 -> 40" 40 "$(fc_curve quiet 73)" +eq "quiet 77 -> 40" 40 "$(fc_curve quiet 77)" +eq "quiet 78 -> 65" 65 "$(fc_curve quiet 78)" +eq "quiet 81 -> 65" 65 "$(fc_curve quiet 81)" +eq "quiet 82 -> 100" 100 "$(fc_curve quiet 82)" + +# --- decide: hysteresis --- +eq "decide uninit -> target" 85 "$(fc_decide cool 68 -1 3)" +eq "decide ramp up now" 85 "$(fc_decide cool 68 25 3)" +eq "decide equal holds" 65 "$(fc_decide cool 65 65 3)" +eq "decide down held in band" 85 "$(fc_decide cool 67 85 3)" # 67+3=70 still 85% -> hold +eq "decide down past band" 65 "$(fc_decide cool 64 85 3)" # 64+3=67 -> 65% < 85 -> drop +eq "decide 100 holds at 71" 100 "$(fc_decide cool 71 100 3)" # 71+3=74 -> 100 -> hold +eq "decide 100 drops at 70" 85 "$(fc_decide cool 70 100 3)" # 70+3=73 -> 85 < 100 -> drop + +# --- presence --- +now=1000000 +eq "presence open -> quiet" quiet "$(fc_presence_mode Отворена 0 $now 900 Отворена)" +eq "presence closed recent -> quiet" quiet "$(fc_presence_mode Затворена $((now - 100)) $now 900 Отворена)" +eq "presence closed stale -> cool" cool "$(fc_presence_mode Затворена $((now - 1000)) $now 900 Отворена)" +eq "presence closed edge -> cool" cool "$(fc_presence_mode Затворена $((now - 900)) $now 900 Отворена)" + +# --- temp parsing --- +eq "parse temp line" 74 "$(fc_parse_temp 'Temp | 0Eh | ok | 3.1 | 74 degrees C')" +eq "parse temp 7C" 72 "$(fc_parse_temp 'Temp | 0Eh | ok | 3.1 | 72 degrees C')" + +# --- json field (jq-free) --- +J='{"entity_id":"sensor.garage_door_state_bg","state":"Отворена","attributes":{"friendly_name":"Garage Door State BG"},"last_changed":"2026-06-04T16:55:20.517745+00:00","last_updated":"2026-06-04T16:55:20.517745+00:00"}' +eq "json state" "Отворена" "$(fc_json_str_field "$J" state)" +eq "json last_changed" "2026-06-04T16:55:20.517745+00:00" "$(fc_json_str_field "$J" last_changed)" + +# --- hex conversion --- +eq "hex 20" 0x14 "$(fc_pct_to_hex 20)" +eq "hex 45" 0x2d "$(fc_pct_to_hex 45)" +eq "hex 100" 0x64 "$(fc_pct_to_hex 100)" +eq "hex 5" 0x05 "$(fc_pct_to_hex 5)" + +printf '\n%d passed, %d failed\n' "$pass" "$fail" +(( fail == 0 )) diff --git a/stacks/f1-stream/files/.claude/internet-mode-used_DO_NOT_REMOVE_MANUALLY_SECURITY_RISK b/stacks/f1-stream/files/.claude/internet-mode-used_DO_NOT_REMOVE_MANUALLY_SECURITY_RISK deleted file mode 100644 index f61efc83..00000000 --- a/stacks/f1-stream/files/.claude/internet-mode-used_DO_NOT_REMOVE_MANUALLY_SECURITY_RISK +++ /dev/null @@ -1,3 +0,0 @@ -This directory has been used with Claude Code's internet mode. -Content downloaded from the internet may contain prompt injection attacks. -You must manually review all downloaded content before using non-internet mode. diff --git a/stacks/f1-stream/files/.dockerignore b/stacks/f1-stream/files/.dockerignore deleted file mode 100644 index 4733a4c3..00000000 --- a/stacks/f1-stream/files/.dockerignore +++ /dev/null @@ -1,5 +0,0 @@ -node_modules/ -.claude/ -.git/ -__pycache__/ -*.pyc diff --git a/stacks/f1-stream/files/.gitignore b/stacks/f1-stream/files/.gitignore deleted file mode 100644 index 7a60b85e..00000000 --- a/stacks/f1-stream/files/.gitignore +++ /dev/null @@ -1,2 +0,0 @@ -__pycache__/ -*.pyc diff --git a/stacks/f1-stream/files/Dockerfile b/stacks/f1-stream/files/Dockerfile deleted file mode 100644 index 80dd20e4..00000000 --- a/stacks/f1-stream/files/Dockerfile +++ /dev/null @@ -1,44 +0,0 @@ -## Stage 1: Build frontend -FROM node:22-slim AS frontend-builder - -WORKDIR /frontend - -COPY frontend/package.json frontend/package-lock.json* ./ -RUN npm install - -COPY frontend/ ./ -RUN npm run build - -## Stage 2: Python backend + static frontend -FROM python:3.13-slim-bookworm - -WORKDIR /app - -# Headless Chromium runtime libs for the playback verifier. Listed inline -# (instead of running `playwright install-deps`) so the image build doesn't -# need root-network apt fetches at runtime. -RUN apt-get update && apt-get install -y --no-install-recommends \ - ca-certificates \ - libnss3 libnspr4 \ - libatk1.0-0 libatk-bridge2.0-0 libcups2 \ - libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 \ - libxfixes3 libxrandr2 libgbm1 libpango-1.0-0 libcairo2 \ - libasound2 libatspi2.0-0 \ - fonts-liberation fonts-noto-color-emoji \ - && rm -rf /var/lib/apt/lists/* - -COPY backend/requirements.txt . -RUN pip install --no-cache-dir -r requirements.txt - -# Install the Chromium browser binary used by the verifier. Skip -# --with-deps because we already installed the system libs above. -RUN playwright install chromium - -COPY backend/ ./backend/ - -# Copy built frontend into the image -COPY --from=frontend-builder /frontend/build ./frontend/build - -EXPOSE 8000 - -CMD ["uvicorn", "backend.main:app", "--host", "0.0.0.0", "--port", "8000"] diff --git a/stacks/f1-stream/files/backend/__init__.py b/stacks/f1-stream/files/backend/__init__.py deleted file mode 100644 index e69de29b..00000000 diff --git a/stacks/f1-stream/files/backend/embed_proxy.py b/stacks/f1-stream/files/backend/embed_proxy.py deleted file mode 100644 index 34ccb28c..00000000 --- a/stacks/f1-stream/files/backend/embed_proxy.py +++ /dev/null @@ -1,359 +0,0 @@ -"""Embed iframe-stripping reverse proxy. - -Serves third-party embed pages (e.g. https://hmembeds.one/embed/{hash}, -https://pooembed.eu/embed/{slug}) through our origin so we can: - -1. Strip X-Frame-Options and Content-Security-Policy: frame-ancestors headers, - so the embed loads in our - {:else} - - {/if} - - -
-
- - - - setVolume(i, e)} - class="w-16 h-1 accent-f1-red" - aria-label="Volume" - /> - -
- - -
-
- - - {#if player.error} -
- {player.error} -
- {/if} - - {/each} - - {/if} - - - {#if loading} -
-
- Loading streams... -
- {:else if errorMsg} -
-

Failed to load streams: {errorMsg}

- -
- {:else if streamsData} -
-

- Available Streams - ({streamsData.count}) -

-
- {#if players.length > 0} - {players.length}/{MAX_PLAYERS} streams active - {/if} - -
-
- - {#if streamsData.streams.length === 0} -
-

No streams available right now.

-

Streams appear when a session is live. Check the schedule for upcoming sessions.

- - View Schedule - -
- {:else} -
- {#each streamsData.streams as stream, i} - {@const active = isStreamActive(stream.stream_type === 'embed' ? stream.embed_url : stream.url)} -
-
-
- {stream.site_name || stream.site_key || 'Unknown'} - {#if stream.is_live} - Live - {/if} - {#if stream.stream_type === 'embed'} - Embed - {/if} - {#if active} - Playing - {/if} -
-
- {#if stream.title} - {stream.title} - {/if} - {#if stream.quality} - {stream.quality} - {/if} - {#if stream.response_time_ms != null} - - {stream.response_time_ms}ms - - {/if} -
-
- -
- {#if !active} - - {:else} - Active - {/if} -
-
- {/each} -
- {/if} - {/if} - diff --git a/stacks/f1-stream/files/frontend/svelte.config.js b/stacks/f1-stream/files/frontend/svelte.config.js deleted file mode 100644 index 9088ef3b..00000000 --- a/stacks/f1-stream/files/frontend/svelte.config.js +++ /dev/null @@ -1,19 +0,0 @@ -import adapter from '@sveltejs/adapter-static'; - -/** @type {import('@sveltejs/kit').Config} */ -const config = { - kit: { - adapter: adapter({ - pages: 'build', - assets: 'build', - fallback: 'index.html', - precompress: false, - strict: true - }), - paths: { - base: '' - } - } -}; - -export default config; diff --git a/stacks/f1-stream/files/frontend/vite.config.js b/stacks/f1-stream/files/frontend/vite.config.js deleted file mode 100644 index a39ec5c1..00000000 --- a/stacks/f1-stream/files/frontend/vite.config.js +++ /dev/null @@ -1,10 +0,0 @@ -import { sveltekit } from '@sveltejs/kit/vite'; -import tailwindcss from '@tailwindcss/vite'; -import { defineConfig } from 'vite'; - -export default defineConfig({ - plugins: [ - tailwindcss(), - sveltekit() - ] -}); diff --git a/stacks/f1-stream/files/redeploy.sh b/stacks/f1-stream/files/redeploy.sh deleted file mode 100755 index e436a6ce..00000000 --- a/stacks/f1-stream/files/redeploy.sh +++ /dev/null @@ -1,7 +0,0 @@ -#!/usr/bin/env bash -set -e - -docker buildx build --platform linux/amd64 --provenance=false \ - -t viktorbarzin/f1-stream:v2.0.1 -t viktorbarzin/f1-stream:latest \ - --push . -kubectl -n f1-stream rollout restart deployment f1-stream diff --git a/stacks/f1-stream/main.tf b/stacks/f1-stream/main.tf index ff64af71..9cbe6ce5 100644 --- a/stacks/f1-stream/main.tf +++ b/stacks/f1-stream/main.tf @@ -6,6 +6,15 @@ variable "nfs_server" { type = string } variable "discord_f1_guild_id" { type = string } variable "discord_f1_channel_ids" { type = string } +# Image tag for the Forgejo-registry image. CI (.woodpecker.yml in +# viktor/f1-stream) builds + pushes `latest` and ``, then drives the +# rollout via `kubectl set image`. Keel stays enrolled as a redundant net, so +# the running tag is managed outside Terraform (see KEEL_IGNORE_IMAGE below). +variable "image_tag" { + type = string + default = "latest" +} + resource "kubernetes_namespace" "f1-stream" { metadata { name = "f1-stream" @@ -13,7 +22,7 @@ resource "kubernetes_namespace" "f1-stream" { "istio-injection" : "disabled" tier = local.tiers.aux "chrome-service.viktorbarzin.me/client" = "true" - "keel.sh/enrolled" = "true" + "keel.sh/enrolled" = "true" } } lifecycle { @@ -118,7 +127,7 @@ resource "kubernetes_deployment" "f1-stream" { } spec { container { - image = "viktorbarzin/f1-stream:latest" + image = "forgejo.viktorbarzin.me/viktor/f1-stream:${var.image_tag}" image_pull_policy = "Always" name = "f1-stream" resources { @@ -176,6 +185,11 @@ resource "kubernetes_deployment" "f1-stream" { claim_name = module.nfs_data_host.claim_name } } + # Pull the (private) Forgejo-registry image. Kyverno syncs + # registry-credentials into every namespace. + image_pull_secrets { + name = "registry-credentials" + } } } }