Two mailserver-namespace log-noise cleanups (cluster's #1 Loki error
source, from the 2026-06-06 log triage):
1. TLS warning: docker-mailserver SSL_TYPE=manual writes the authoritative
smtpd_tls_chain_files at boot, so the legacy smtpd_tls_cert_file/key_file
in our postfix-main.cf override were IGNORED and triggered postfix's
'Both smtpd_tls_chain_files and ... legacy ...' warning. Dropped the two
legacy lines (functional no-op; chain_files already wins). Verified via
live postconf.
2. Scanner noise (~9k lines/hr): narrow Alloy stage.drop for the benign
public-SMTP probe patterns (unknown[unknown] SSL_accept resets, postscreen
half-open drops, rate-limit-exceeded from unknown). Real delivery logs +
real-IP SASL failures KEPT; CrowdSec bans these IPs independently, so
security posture is unchanged. Validated with 'alloy fmt' (exit 0).
Reversible.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
rybbit's 'rybbit' PG database was missing from CNPG (the role survived a
past cluster rebuild but the database did not), so the app's node-cron
logged 'database "rybbit" does not exist' every minute (found via Loki
2026-06-06). Created the DB manually to restore service (app auto-migrated
11 tables); this adds a self-contained init Job so the DB is recreated on
any future rebuild -- connects as the rybbit role (has CREATEDB) using the
existing rybbit-secrets password, idempotent CREATE DATABASE if absent.
Deployment now depends_on the job.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reflect commit b6dd23b1: bot-block-proxy short-circuits /auth to
return 200 instead of proxying to the scaled-to-0 poison-fountain.
- security.md Layer 1 + tarpit description + troubleshooting (fix stale
stacks/platform path -> traefik stack; drop misleading
restart-poison-fountain step).
- .claude/CLAUDE.md: add matrix to PG rotation list; document that
startup-read secret consumers need a Reloader annotation (matrix root
cause, found via Loki 2026-06-05).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bot-block-proxy is the forward-auth target for the ai-bot-block
middleware (applied to every anti-AI ingress). It proxied /auth to the
poison-fountain bot trap with error_page 5xx=200 fail-open. But
poison-fountain is intentionally scaled to 0, so proxy_pass only ever
failed and fell open to '200 allowed' -- while logging ~51k errors/hr
(the #1 Loki source once pod logs began shipping 2026-06-05) and paying
up to 100ms connect-timeout per authed request.
Short-circuit /auth to 'return 200 "allowed"' directly (drop the
upstream + proxy_pass + fallback). Identical effective behaviour
(allow-all), no upstream attempt, no noise, no latency. Reversible:
restore the upstream + proxy_pass and scale poison-fountain up.
Also add the missing configmap.reloader.stakater.com/reload annotation
so openresty picks up ConfigMap changes (it does not hot-reload on its
own -- the root reason stale config ran for days). replicas stays 2:
critical-path forward-auth target (anti-AI ingresses fail closed if it
is down), so HA is retained though each request is now trivial.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Synapse injects the Postgres password into homeserver.yaml only at
startup (inject-db-password initContainer). matrix-db-creds is rotated
by Vault via ESO (15m refresh), so each rotation left the running pod
with a stale password and Synapse DB auth failed silently until a
manual rollout restart. Found today via Loki: ~12.9k/hr 'password
authentication failed for user matrix' lines; secret password verified
working against the DB while the 10-day-old pod held the pre-rotation
value.
Add the explicit secret.reloader.stakater.com/reload annotation so
Reloader rolls the deployment whenever the secret changes (explicit
form, not auto/search, because the secret is referenced only in an
initContainer env var). Live pod already restarted to restore service;
this prevents recurrence on the next rotation.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Config-only PVC (no embedded DB), preflighted. Frees one proxmox-csi slot.
NB: pod reschedule re-pulled the 3.26GB backend image (~6min stall) — large-image
services incur a pull-delay blip when migration moves them to a fresh node.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Closes the ghost-disk doom loop by construction (failed detach -> orphan scsiN
with no VolumeAttachment -> invisible oversubscription -> query-pci wedge).
Every 15min csi-ghost-reconcile compares each worker VM's real scsi disks
(Proxmox API) vs k8s VolumeAttachments and safely detaches ghosts (PUT
.../config delete=scsiN -> frees the LUN slot, retains the LV).
- detection mirrors cluster-health check #47
- SAFETY: only vm-9999-pvc scsi with no matching VA; 60s re-confirm; per-run cap 5
- scoped CSI API token (VM.Config.Disk), not root SSH; k8s API via injected ClusterIP
- verified live: read 66 VAs, 0 ghosts, no false positives
- pushes csi_ghosts_detected/detached to Pushgateway
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
modules/kubernetes/ebook2audiobook/ held a tracked copy of the audiblez-web
app source (24 files), sourced by no stack and built by no CI — audiblez-web
is GHA-built from its own repo. Bulk-swept in 2026-04-15; removed.
Also corrected CONTEXT.md: the "vestigial per-app dirs (immich/, ollama/,
...)" note was wrong — those were untracked local macOS cruft (._main.tf
AppleDouble turds), never in the repo; cleaned from the working tree.
modules/kubernetes/ now holds exactly the four factory modules
(ingress_factory, nfs_volume, anubis_instance, setup_tls_secret).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Stand up upstream TREK (mauriceboe/trek:3.0.22, AGPL) as a low-commitment
trial to evaluate the self-hosted group-trip use case before building a
custom app. Solo, single shared instance, Authentik forward-auth.
- stacks/trek: namespace, deployment (pinned, TF-managed, no CI/Keel),
service 80->3000, ingress_factory auth=required + proxied DNS at
trek.viktorbarzin.me, TLS. Two proxmox-lvm-encrypted PVCs (SQLite data +
uploads) -- encrypted per the sensitive-data rule and to avoid the
SQLite-over-NFS locking hazard.
- Trial secrets posture: ENCRYPTION_KEY auto-generated on the data PVC,
bootstrap admin in pod logs -- no Vault/ESO. Graduation TODOs documented
in main.tf + service-catalog (Vault key, app-level SQLite backup, OIDC SSO).
- kyverno: add mauriceboe/* to require-trusted-registries allowlist (the
policy is Enforce since 2026-05-19 -- also fixed the stale "stays in
Audit" header comment that said otherwise and misled the deploy).
- Runs free on OpenStreetMap (no paid maps key). Rallly availability-poll
companion deferred per solo-trial scope.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document the 2026-06-05 decision to keep proxmox-csi and harden it (keep PVC
mobility, no hardware) over TopoLVM (pins to node) / Longhorn (2x writes on
single shared HDD). Wave-2 moved 5 non-DB workloads off block to NFS
(tandoor, speedtest, hackmd, changedetection, send), freeing 5 LUN slots.
- storage.md: live PVC counts, Retain-policy/orphan-LV note, Wave-2 history,
updated cap-relief levers
- topolvm-evaluation.md: stamped NOT ADOPTED with rationale + pointer to the
decision doc
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Enables Open-Meteo geocoding of lodging addresses (results cached in the
new geocode_cache table) so the itinerary can show per-city weather.
Applied manually via scripts/tg apply.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Refresh CONTEXT.md against current repo + cluster reality (grill-with-docs):
- Module taxonomy rewrite: drop fictional k8s_app/helm_app/postgres_app
factory modules (never existed); name the real four (ingress_factory,
nfs_volume, anubis_instance, setup_tls_secret) + the shared / Stack-local
/ flat distinction; flag vestigial modules/kubernetes/<app> dirs.
- Rename "Ingress auth tier" -> "Ingress auth" (discrete modes, not tiers);
reserve "tier" for State tier + Namespace tier only.
- Add local-path entry (cluster default SC; node-local footgun warning).
- Add concepts: Keel, Diun, CNPG/pg-cluster, MetalLB LB-IP split, Calico.
- Add "policy" ambiguity flag (Kyverno vs Calico NetworkPolicy vs Vault/RBAC).
- Fix node count 5 -> 7 (k8s-master + k8s-node1..6).
Doc-sync (same commit per repo rules):
- overview.md: replace fictional factory modules with the real shared
modules + the flat/stack-local pattern.
- .claude/CLAUDE.md: drop dead nfs-proxmox column from the storage decision
table + stale cross-reference (vault migrated off it 2026-04-25).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tandoor is PostgreSQL-backed with no embedded DB, so its media/static PVC
is NFS-safe. Frees one proxmox-csi SCSI-LUN slot. Part of the 'harden
proxmox-csi + NFS' plan (keeps PVC mobility, no new hardware) — see
docs/plans/2026-06-05-block-storage-harden-nfs-design.md.
- swap tandoor-data-proxmox -> nfs_volume module (nfs-truenas SC)
- data copied + verified (HTTP 200 on NFS volume); block PVC removed
- block LV retained per SC policy (orphan cleanup tracked in code-dfjn)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tripit outbound (linked-email verification + trip-share invites) was sent
From: spam@viktorbarzin.me. Switch the From to plans@viktorbarzin.me while
keeping SMTP auth as spam@ (its password, unchanged).
docker-mailserver SPOOF_PROTECTION (reject_sender_login_mismatch) requires
the authed login to "own" the From; the @viktorbarzin.me catch-all does NOT
grant that per-address, so add an explicit `plans@ -> spam@` virtual alias to
authorize it (also keeps inbound plans@ routing to spam@ for the mail-ingest
poller). tripit SMTP_FROM flips to plans@.
Verified: sender-login probe (auth spam@, MAIL FROM plans@) now 250 (was 553);
a real send from the tripit pod logs from=<plans@viktorbarzin.me> accepted.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Synology offsite backup target (/mnt/synology-backup, surfaced via
the PVE host NFS mount) sits at ~94% by design and was firing
NodeFilesystemFull continuously. Per user request, raise the threshold
to 95% (<5% free). NOTE: NodeFilesystemFull is a global node-filesystem
rule, so this also loosens the warning on k8s node/system disks;
BackupDiskFull (sda /mnt/backup) stays at 85%.
Also adds docs/runbooks/synology-storage.md: how to assess Synology
usage WITHOUT du (Storage Analyzer weekly CSVs, df/btrfs/qgroup),
btrfs async/snapshot-pinned reclaim, the 2026-06-05 capacity assessment
(94% full; Backup share 4.42TiB), and ~500GiB of homelab cleanup
candidates (redundant gphotos Takeout, old laptop VM images, archives).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The ha-sofia R730 REST sensors (via prometheus-query.lan) + Grafana iDRAC
panels were bound to the 1m snmp-idrac scrape. Halved to 30s so the
dashboard-it Server view refreshes uniformly at 30s, matching the
fan-control daemon's Pushgateway metrics. SNMP scrape ~3-4s; timeout 15s.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
goflow2 emits ~8 GB/day of per-flow NetFlow JSON to stdout (~64% of all cluster
log volume) but only its Prometheus aggregate metrics are used; vpa is ~1.3
GB/day of Goldilocks/VPA recommender chatter. Both are low-value and were
landing in Loki (PVC on the contended sdc HDD). Drop them at the Alloy relabel.
Reversible (remove the drop rule). Loki ingestion drops ~73%.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ha-sofia's 7 R730 REST sensors (CPU/exhaust/inlet temp, power, 2x PSU voltage,
fan) read the iDRAC via the slow on-demand Redfish exporter (scan_interval 120,
~16-22s/fetch, intermittent `unavailable` blips). Migrated them to a FAST
Prometheus query of the SNMP values (instant, ~1m-fresh from the snmp-idrac
scrape), scan_interval 30.
This adds the enabling ingress: `prometheus-query.viktorbarzin.lan` →
`prometheus-server:80`, auth=none, allow_local_access_only, path-scoped to
`/api/v1/query` (read-only instant-query only — not the UI/admin/federation).
ha-sofia can't use `prometheus.viktorbarzin.me` (Authentik-gated, no OIDC from
a REST sensor), so this mirrors the existing local-only `.lan` exporter
ingresses HA already queries.
The ha-sofia REST file (`/config/rest_resources/idrac_redfish_exporter.yaml`)
was edited in place (auto-version-controlled by the HA version-control add-on;
pre-migration copy at `/config/idrac_redfish_exporter.bak-pre-snmp`). The
Technitium CNAME `prometheus-query.viktorbarzin.lan -> ingress.viktorbarzin.lan`
was added manually via the API — like the other `.lan` exporter hosts it is NOT
auto-synced (the technitium-ingress-dns-sync CronJob only creates `.me`
records). Follow-up (already noted for the Loki sensor): extend that sync to
manage `.lan` CNAMEs too. The Redfish remnant's `sensors` collector is now
vestigial (HA no longer reads it).
Verified: all 7 HA sensors report correct fresh values from Prometheus (fan
10800 rpm, CPU 62.0C, power 280W, PSU 230/240V).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document the 2026-06-05 cluster-wide log observability work: the Alloy
local.file_match fix (loki.source.file doesn't expand globs) + stage.cri, the
new "Cluster Logs" Grafana dashboard, the ha-sofia cluster-log-health REST
sensors, and the loki.viktorbarzin.lan Technitium-record follow-up.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cluster-wide Loki log observability now that pod logs flow (Alloy fix). New
dashboards/cluster-logs.json (Loki DS P8E80F9AEF21F6940): namespace/app/pod
dropdowns + free-text regex search; stats (lines/errors/warns/active-ns),
log-volume-by-namespace, error/warn rate, top-namespaces-by-errors,
top-pods-by-errors, a filterable live-logs panel, and a second row for the
node + rpi-sofia systemd journals (volume-by-level + error/warn journal panel).
Error/warn use case-insensitive regex line-filters so they work regardless of
level-label availability. New "Logs" Grafana folder.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cluster pod logs were NOT reaching Loki — only node/Pi journals were. Root cause:
loki.source.file was fed the /var/log/pods/*<uid>/<container>/*.log glob directly
from discovery.relabel, but loki.source.file does NOT expand globs, so it stat()'d
the literal `*` path and shipped zero pod logs ("stat failed: no such file" for
every pod). Per Grafana Alloy docs, a local.file_match component must expand the
glob into concrete file targets first. Add it. Also add stage.cri {} so Loki
stores clean messages + real timestamps instead of raw containerd CRI-prefixed
lines. Fixes cluster-wide log observability (regression vs the working 2026-05-26
state). Ship-all-then-measure per the agreed plan; Alloy mem limits stay as the
IO-storm safeguard.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Redfish exporter (mrlhansen, metrics:all:true) walked every BMC subtree on
each scrape — ~18.5s avg / 28s peak against the slow iDRAC — forcing a 3m
interval. Moved the fast path to SNMP via the (previously unmounted) dell_idrac
module: ~3.7s/scrape at 1m.
- snmp_exporter: merge dell_idrac into ups_snmp_values.yaml; hand-add fan-RPM
(coolingDeviceReading + location lookup) and an amperageProbeLocationName
lookup so the "System Board Pwr Consumption" watts probe is label-selectable.
- snmp-idrac job: params module=dell_idrac, auth=public_v2, 1m/30s — now the
primary source for health/thermal/power/fan/voltage (relabeled r730_idrac_*).
- Re-point 9 iDRAC alerts to SNMP metrics + DellStatus enums (OK=3, on=4) and
fix the misnamed iDRACSNMPMetricsMissing/iDRACRedfishMetricsMissing probes.
- Re-point Grafana panels (idrac.json, cluster_health.json) to SNMP names;
temps ÷10 (tenths-degC); DellStatus value-mappings updated.
- Demote the Redfish exporter to a slow remnant: trim collectors to
system/sensors/power/storage/network/memory, scrape 3m->10m. Kept only for
metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS, per-drive
table) AND to keep HA Sofia's sensor.r730_fan_speed working — it reads
idrac_sensors_fan_speed from the exporter directly, so no ha-sofia change.
SSD-wear alerts + SEL panel left as-is (already inert/empty today). Verified
live: snmp-idrac up, scrape 3.7s, all 9 re-pointed alerts resolve without
firing, HA fan metric (idrac_sensors_fan_speed=6) intact. Design/plan +
as-built docs: docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md,
docs/architecture/monitoring.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Fan % and Fan RPM sensor-graph cards had identical trend shapes (RPM ∝ %),
so merge them into one "Fan speed" card: % trend (stable Pushgateway sensor) +
RPM beneath. RPM reads sensor.r730_fan_speed (Redfish) but falls back to the
calibrated estimate (rpm≈160·%+1520, shown with a "~" prefix) when that sensor
is unavailable — it blips out intermittently, so the readout never goes blank.
The Override readout likewise shows both "% · rpm". HA-side only; daemon
unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add an "External host: rpi-sofia" section to docs/architecture/monitoring.md
covering the 2026-06-05 setup: node_exporter + vcgencmd textfile metrics; the
full-journal promtail->Loki shipping (job=rpi-sofia-journal — kernel/dmesg via
the (none) unit + all systemd units, labeled by unit/level); the RPi Sofia
alert group; the dashboard; and the systemd watchdog. Notes the SD-card root
cause and that the Pi-side config is hand-managed + backed up off-box.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The dashboard Override slider used to show a stale stored % (e.g. 5%) while the
fans were actually at ~53%, which was confusing. Add
automation.r730_fan_override_track_live_speed_while_unlocked: while unlocked it
mirrors the live commanded % (sensor.r730_fan_control_target) into the Override,
so it always shows the actual absolute fan speed and updates as the fan moves.
While locked it stops tracking and is the user's editable setpoint. The readout
under the slider now shows the live "% · rpm" (actual, not an estimate). HA-side
only; daemon unchanged. Verified live: slider forced to 10 → synced to 58 target.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Gmail All-Mail scrape (tripit-ingest-mail) is retired — Viktor only wants
mail ingested when forwarded to plans@viktorbarzin.me, and only from actual
users. Dropped the ingest-mail CronJob and removed MAIL_DEFAULT_OWNER_EMAIL
from ingest-plans (the app now ignores mail from non-users instead of filing it
under the default owner). ingest-plans already carries EMAIL_PROVIDER/SMTP_* for
the new sender notifications. Service-catalog updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
scripts/tg's check-ingress-auth-comments.py requires the `# auth = "none":`
rationale comment DIRECTLY above the `auth = "none"` line; mine was in the
module's top block comment, so the guard aborted the whole monitoring apply
(this is why the rpi-sofia scrape/alerts/ingress/dashboard never landed on the
first push). Move the rationale to the required position.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The dashboard-it Server → Fans view is now minimal: fan speed (% + RPM), an
Override % slider, and a Lock toggle. Lock now means "freeze the current speed,
algorithm off" — a new automation (r730_fan_lock_freeze_current_speed_resume_algo)
snapshots the live target % into Override and sets mode=manual on lock-ON, and
mode=auto on lock-OFF. The host daemon is unchanged (the toggle just drives the
mode it already reads). cool/quiet stay reachable via the entity but are off the
simplified view; the 60-min auto-revert is kept as a dormant safety net. Verified
live: lock ON → mode=manual + Override captured the live 60%; lock OFF → auto.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Sofia Raspberry Pi hung this morning (network wedged ~10:13, HA
sensors dead, and its local journal had been silent since Apr 27 — a
2017 SD card intermittently flipping the rootfs read-only). Nothing was
captured because logging lived only on the failing card. Ship telemetry
off-box so the next failure is diagnosable centrally:
- Prometheus scrape job `rpi-sofia` (rpi-sofia.viktorbarzin.lan:9100) —
node_exporter + a vcgencmd textfile collector on the Pi exporting
under-voltage/throttle/SoC-temp as rpi_* metrics.
- Alert group "RPi Sofia": node_exporter Down, rootfs ReadOnly (the
exact SD-failure signature), Under-voltage since boot, High SoC temp.
- LAN-gated Loki write ingress (loki.viktorbarzin.lan) so the Pi's
promtail can push its journal — Loki was ClusterIP-only.
- Grafana dashboard "RPi Sofia" (Hardware): status, undervoltage/
throttle, temp, load, memory, disk, network.
The Pi separately got a systemd hardware watchdog (auto-reboot on a hard
hang; today it stayed down ~5h until a manual power-cycle).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Actual usage ~116Mi, Goldilocks/VPA upperBound ~185Mi (incl. live races over
99d). The 1Gi reservation was sized for the old bundled-Chromium image; the app
now drives the remote chrome-service over CDP. 256Mi (upperBound x~1.3, bursty)
requests=limits per convention; cpu request 100m -> 50m (VPA upperBound 49m).
Frees ~768Mi of reserved cluster memory.
A manual/cool/quiet override in HA auto-reverts to `auto` after 60 min. Add a
Fan Lock (`input_boolean.r730_fan_lock`) that gates that automation so a
deliberate override persists, with a visible "🔒 FAN CONTROL LOCKED" banner on
the dashboard-it Server view so it isn't forgotten. The automation re-checks the
lock after the hour (locking mid-countdown cancels the revert) and the 83 °C
ceiling still wins. HA-side only (helper + automation + dashboard live on
ha-sofia, auto-git-tracked there); these docs are the infra-repo record.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The iDRAC reports only total DCMI watts + RPM (no per-fan power), so add a
cube-law fan-power estimate: fan_W ~= 0.0205*(RPM/1000)^3, calibrated to the
2026-06-05 sweep (fits within ~3W; ~2W floor -> ~99W full). The daemon reads
live RPM each loop and pushes pve_fan_control_fan_rpm + _fan_watts_est.
Surfaced in HA as sensor.r730_fan_power_est + a "Fan Power (est)" card on the
dashboard-it Server view, next to total power. 46 bash tests green; verified
live (9120rpm -> ~15W est).
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the step-band fan curve with a continuous linear ramp — the bands
flapped at edges (e.g. 45<->65%). Web-researched: linear + 2-3C hysteresis
is the homelab standard; PID is overkill for this slow thermal loop.
fan% now interpolates between env-tunable anchors:
COOL 50C/30% -> 83C/100% (~2.1%/C; ~51% at the ~60C equilibrium)
QUIET 68C/20% -> 83C/100% (near-silent until ~70C)
Both reach 100% at the 83C ceiling. Anti-oscillation: asymmetric
hysteresis (fc_decide) + a MIN_STEP (3%) min-change threshold.
41 bash tests green; deployed + verified live (59C -> 49%, smooth).
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document the HA-control feature shipped in 8beca1df: the daemon reads the
ha-sofia r730_fan_mode/manual_pct helpers, the 60-min auto-revert automation,
and the dashboard-it Server-view sensors + control tiles.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The host daemon now polls input_select.r730_fan_mode (auto/cool/quiet/
manual) + input_number.r730_fan_manual_pct from ha-sofia each loop and
routes through fc_resolve: manual holds a fixed %, cool/quiet force that
curve, auto keeps the garage-presence behaviour. CEILING still overrides.
Ships HA control now on the running host daemon (no Vault); the cluster
CronJob migration stays the eventual Terraform home (same logic).
HA side (on ha-sofia, auto-git-tracked there): two helpers, an auto-
revert-to-auto automation (60min), mode + %-slider control tiles on the
dashboard-it Server view. Verified end-to-end: HA manual 70% -> fans
12720rpm; revert to auto -> presence curve 50%.
10 new pure-function tests (fc_resolve/fc_clamp); 46 total green.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
woodpecker-server sets reloader.stakater.com/search="true" but the
woodpecker-db-creds ExternalSecret never carried the matching
reloader.stakater.com/match="true", so Stakater Reloader never
restarted the server when Vault rotated the pg-woodpecker password
(7-day static role). The DB DSN is injected via envFrom, which does not
hot-reload a running pod — so after each rotation the server kept using
the revoked password until some unrelated restart (Keel bump, drain,
manual) recreated it inside the window. A latent weekly DB-outage masked
by incidental restarts.
Add the match annotation to the ESO target template and correct the
stale "rotated every 24h" comment (actual rotation_period is 604800s =
7 days).
Verified end-to-end: forced 'vault write -f database/rotate-role/pg-woodpecker',
ESO updated the secret in ~3s, Reloader auto-restarted woodpecker-server
in ~36s, new pod reconnected with zero DB errors. [ci skip] because the
change was already applied via scripts/tg apply.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
f1-stream was extracted to its own Forgejo repo + deployed from the Forgejo
registry (2026-06-05). Correct the stale "Migrated to GHA / repo id 10" claims:
- CLAUDE.md + ci-cd.md: move f1-stream from the GHA list to the Woodpecker-native
owned-app group; note old github source archived + GHA Woodpecker repo 10
deactivated; f1-stream is now Woodpecker repo 166.
- service-catalog: note the source repo + deploy model.
The actively-developed f1-stream (infra files/ copy: 12 active extractors +
Playwright/chrome-service verifier) is now its own repo viktor/f1-stream and is
the deployed app (replacing the stale March github build).
- main.tf: image -> forgejo.viktorbarzin.me/viktor/f1-stream:${var.image_tag}
+ image_pull_secrets registry-credentials. Image stays in KEEL_IGNORE_IMAGE.
- Remove stacks/f1-stream/files/ (source now in viktor/f1-stream).
- docs/plans: extraction design + plan pair.
Applied via tg + kubectl set image to forgejo:24857a82; live /health green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Power/temp sweep (2026-06-05) located the cooling-per-watt knee at ~60%:
60->70% buys only -2C for +21W, and 70->100% buys 0C for +54W (the CPU
floors ~59C at cluster load, so more airflow does nothing). Re-tune the
COOL curve to cap its normal band at 60% (~303W, ~61C); 80/100% become a
high-load safety ramp (>=73/79C) before the 83C ceiling. QUIET unchanged
(already at the 281W / 4800rpm floor). Saves up to ~75W (~650 kWh/yr) vs
full-tilt for the last ~2C. Tests + design doc updated; verified live
(63C, 60%, ~267W).
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>