Commit graph

4009 commits

Author SHA1 Message Date
Viktor Barzin
c059405632 fan-control: simplify HA dashboard + Lock = freeze-current/algo-off [ci skip]
The dashboard-it Server → Fans view is now minimal: fan speed (% + RPM), an
Override % slider, and a Lock toggle. Lock now means "freeze the current speed,
algorithm off" — a new automation (r730_fan_lock_freeze_current_speed_resume_algo)
snapshots the live target % into Override and sets mode=manual on lock-ON, and
mode=auto on lock-OFF. The host daemon is unchanged (the toggle just drives the
mode it already reads). cool/quiet stay reachable via the entity but are off the
simplified view; the 60-min auto-revert is kept as a dormant safety net. Verified
live: lock ON → mode=manual + Override captured the live 60%; lock OFF → auto.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 13:27:46 +00:00
Viktor Barzin
f9376a36ff monitoring: wire rpi-sofia (Sofia Pi) into Prometheus/Loki/alerts
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
The Sofia Raspberry Pi hung this morning (network wedged ~10:13, HA
sensors dead, and its local journal had been silent since Apr 27 — a
2017 SD card intermittently flipping the rootfs read-only). Nothing was
captured because logging lived only on the failing card. Ship telemetry
off-box so the next failure is diagnosable centrally:

- Prometheus scrape job `rpi-sofia` (rpi-sofia.viktorbarzin.lan:9100) —
  node_exporter + a vcgencmd textfile collector on the Pi exporting
  under-voltage/throttle/SoC-temp as rpi_* metrics.
- Alert group "RPi Sofia": node_exporter Down, rootfs ReadOnly (the
  exact SD-failure signature), Under-voltage since boot, High SoC temp.
- LAN-gated Loki write ingress (loki.viktorbarzin.lan) so the Pi's
  promtail can push its journal — Loki was ClusterIP-only.
- Grafana dashboard "RPi Sofia" (Hardware): status, undervoltage/
  throttle, temp, load, memory, disk, network.

The Pi separately got a systemd hardware watchdog (auto-reboot on a hard
hang; today it stayed down ~5h until a manual power-cycle).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 13:11:40 +00:00
Viktor Barzin
5b96b841fc f1-stream: right-size memory 1Gi -> 256Mi (CDP-only, no bundled Chromium)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Actual usage ~116Mi, Goldilocks/VPA upperBound ~185Mi (incl. live races over
99d). The 1Gi reservation was sized for the old bundled-Chromium image; the app
now drives the remote chrome-service over CDP. 256Mi (upperBound x~1.3, bursty)
requests=limits per convention; cpu request 100m -> 50m (VPA upperBound 49m).
Frees ~768Mi of reserved cluster memory.
2026-06-05 12:57:22 +00:00
Viktor Barzin
d17b25cdcc fan-control: document the HA Fan Lock (opt out of 60-min auto-revert) [ci skip]
A manual/cool/quiet override in HA auto-reverts to `auto` after 60 min. Add a
Fan Lock (`input_boolean.r730_fan_lock`) that gates that automation so a
deliberate override persists, with a visible "🔒 FAN CONTROL LOCKED" banner on
the dashboard-it Server view so it isn't forgotten. The automation re-checks the
lock after the hour (locking mid-countdown cancels the revert) and the 83 °C
ceiling still wins. HA-side only (helper + automation + dashboard live on
ha-sofia, auto-git-tracked there); these docs are the infra-repo record.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 12:22:00 +00:00
Viktor Barzin
51456a96f6 fan-control: estimate + expose fan power (fan_watts_est)
The iDRAC reports only total DCMI watts + RPM (no per-fan power), so add a
cube-law fan-power estimate: fan_W ~= 0.0205*(RPM/1000)^3, calibrated to the
2026-06-05 sweep (fits within ~3W; ~2W floor -> ~99W full). The daemon reads
live RPM each loop and pushes pve_fan_control_fan_rpm + _fan_watts_est.
Surfaced in HA as sensor.r730_fan_power_est + a "Fan Power (est)" card on the
dashboard-it Server view, next to total power. 46 bash tests green; verified
live (9120rpm -> ~15W est).

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 11:10:27 +00:00
Viktor Barzin
324f2dc3bf fan-control: continuous linear curve (replaces discrete step-bands)
Replace the step-band fan curve with a continuous linear ramp — the bands
flapped at edges (e.g. 45<->65%). Web-researched: linear + 2-3C hysteresis
is the homelab standard; PID is overkill for this slow thermal loop.
fan% now interpolates between env-tunable anchors:
  COOL  50C/30% -> 83C/100% (~2.1%/C; ~51% at the ~60C equilibrium)
  QUIET 68C/20% -> 83C/100% (near-silent until ~70C)
Both reach 100% at the 83C ceiling. Anti-oscillation: asymmetric
hysteresis (fc_decide) + a MIN_STEP (3%) min-change threshold.
41 bash tests green; deployed + verified live (59C -> 49%, smooth).

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 10:29:35 +00:00
Viktor Barzin
945c1936e3 fan-control docs: HA control (mode/manual-% + auto-revert + dashboard)
Document the HA-control feature shipped in 8beca1df: the daemon reads the
ha-sofia r730_fan_mode/manual_pct helpers, the 60-min auto-revert automation,
and the dashboard-it Server-view sensors + control tiles.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:29:35 +00:00
Viktor Barzin
8beca1dfc7 fan-control: read HA mode/manual-% setpoint (HA fan control)
The host daemon now polls input_select.r730_fan_mode (auto/cool/quiet/
manual) + input_number.r730_fan_manual_pct from ha-sofia each loop and
routes through fc_resolve: manual holds a fixed %, cool/quiet force that
curve, auto keeps the garage-presence behaviour. CEILING still overrides.
Ships HA control now on the running host daemon (no Vault); the cluster
CronJob migration stays the eventual Terraform home (same logic).

HA side (on ha-sofia, auto-git-tracked there): two helpers, an auto-
revert-to-auto automation (60min), mode + %-slider control tiles on the
dashboard-it Server view. Verified end-to-end: HA manual 70% -> fans
12720rpm; revert to auto -> presence curve 50%.

10 new pure-function tests (fc_resolve/fc_clamp); 46 total green.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:26:22 +00:00
Viktor Barzin
b958935ee0 woodpecker: reload server on Vault PG password rotation [ci skip]
woodpecker-server sets reloader.stakater.com/search="true" but the
woodpecker-db-creds ExternalSecret never carried the matching
reloader.stakater.com/match="true", so Stakater Reloader never
restarted the server when Vault rotated the pg-woodpecker password
(7-day static role). The DB DSN is injected via envFrom, which does not
hot-reload a running pod — so after each rotation the server kept using
the revoked password until some unrelated restart (Keel bump, drain,
manual) recreated it inside the window. A latent weekly DB-outage masked
by incidental restarts.

Add the match annotation to the ESO target template and correct the
stale "rotated every 24h" comment (actual rotation_period is 604800s =
7 days).

Verified end-to-end: forced 'vault write -f database/rotate-role/pg-woodpecker',
ESO updated the secret in ~3s, Reloader auto-restarted woodpecker-server
in ~36s, new pod reconnected with zero DB errors. [ci skip] because the
change was already applied via scripts/tg apply.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:12 +00:00
Viktor Barzin
3796a84e04 docs: f1-stream is Woodpecker-native (Forgejo viktor/f1-stream), not GHA/repo-10
f1-stream was extracted to its own Forgejo repo + deployed from the Forgejo
registry (2026-06-05). Correct the stale "Migrated to GHA / repo id 10" claims:
- CLAUDE.md + ci-cd.md: move f1-stream from the GHA list to the Woodpecker-native
  owned-app group; note old github source archived + GHA Woodpecker repo 10
  deactivated; f1-stream is now Woodpecker repo 166.
- service-catalog: note the source repo + deploy model.
2026-06-05 09:19:12 +00:00
root
aa25dd488c Woodpecker CI deploy [CI SKIP] 2026-06-05 09:19:12 +00:00
Viktor Barzin
e8bfb4d06b f1-stream: consume Forgejo-registry image; drop in-monorepo source
The actively-developed f1-stream (infra files/ copy: 12 active extractors +
Playwright/chrome-service verifier) is now its own repo viktor/f1-stream and is
the deployed app (replacing the stale March github build).

- main.tf: image -> forgejo.viktorbarzin.me/viktor/f1-stream:${var.image_tag}
  + image_pull_secrets registry-credentials. Image stays in KEEL_IGNORE_IMAGE.
- Remove stacks/f1-stream/files/ (source now in viktor/f1-stream).
- docs/plans: extraction design + plan pair.

Applied via tg + kubectl set image to forgejo:24857a82; live /health green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:12 +00:00
Viktor Barzin
99f9bf8d89 fan-control: power-tune COOL curve to the 60% efficiency knee
Power/temp sweep (2026-06-05) located the cooling-per-watt knee at ~60%:
60->70% buys only -2C for +21W, and 70->100% buys 0C for +54W (the CPU
floors ~59C at cluster load, so more airflow does nothing). Re-tune the
COOL curve to cap its normal band at 60% (~303W, ~61C); 80/100% become a
high-load safety ramp (>=73/79C) before the 83C ceiling. QUIET unchanged
(already at the 281W / 4800rpm floor). Saves up to ~75W (~650 kWh/yr) vs
full-tilt for the last ~2C. Tests + design doc updated; verified live
(63C, 60%, ~267W).

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:12 +00:00
Viktor Barzin
17da37cea3 fire-planner: reset bulk ingest toggle after successful run
Job completed: 1,060 examples inserted across 10 FIRE subreddits
(1,080 total), 20/24 sub-runs succeeded. Toggle reset to false.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-05 09:19:12 +00:00
Viktor Barzin
deb031cc2c feat(tripit): encrypted personal-document vault PVC + DOCUMENT_ENCRYPTION_KEY
Add a proxmox-lvm-encrypted RWO PVC (tripit-personal-documents) mounted at
/data/personal-documents on the app container, PERSONAL_STORAGE_DIR env, and the
DOCUMENT_ENCRYPTION_KEY ExternalSecret entry (seeded in Vault secret/tripit). A
root chown init-container makes the block volume writable by the non-root app
without touching the NFS doc vault. Backs the new owner-only encrypted personal
document vault in the tripit app.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:12 +00:00
Viktor Barzin
27989cd9f1 fire-planner: bulk Reddit FIRE examples ingest + qwen3-8b model upgrade
- Enable bulk ingest job (run_examples_bulk_ingest=true) to populate
  fire_example table from top/all + top/year across 12 FIRE subreddits.
  Job fire-planner-examples-bulk-202606042150 is currently running.
- Upgrade examples_llm_model from qwen3vl-4b to qwen3-8b; GPU has 10.7GB
  free (immich-ml using ~4GB of 15GB total), so higher-quality model fits.
- Add LLM_CONCURRENCY=3 to bulk job container — claude-agent-service is
  now bounded-concurrency (MAX_CONCURRENCY=10), no longer single-flight.
  Strictly serial extraction (default 1) is no longer necessary.

TODO: flip run_examples_bulk_ingest=false after job completes and re-apply
to push the weekly CronJob model upgrade (qwen3vl-4b→qwen3-8b) which
didn't land in this apply (TF timed out waiting for Job completion).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-05 09:19:12 +00:00
Viktor Barzin
147a8cff40 Restore f1-stream stack — undo accidental bundling into 63fe7d2b
Commit 63fe7d2b (fan-control) was made with a bare `git commit` in the
shared infra working tree and inadvertently swept in a parallel session's
staged f1-stream-extraction work (main.tf repoint, ~48 files/ removals,
ci-cd.md + .claude docs, two extraction plan docs).

This returns every f1-stream-related path to its pre-63fe7d2b state
(3493c347) so that extraction can be committed cleanly by its own
session. The fan-control files added in 63fe7d2b are untouched.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:12 +00:00
Viktor Barzin
90ad6b9125 fan-control: presence-aware IPMI fan curve for the R730 PVE host
The iDRAC stock curve runs the CPU at ~72°C on the 7080 RPM floor even
under load (optimises for quiet, not cool). Add a bash daemon + systemd
unit that drives the chassis fans from CPU temp on two curves, picked by
garage occupancy (the server is in the garage): COOL when empty
(measured ~58-65°C under load), QUIET near the silent floor when the
ha-sofia garage door shows someone is there (open, or <15min since last
activity).

Manual fan mode is backstopped: bash EXIT trap + systemd ExecStopPost
hand fans back to Dell auto on stop/crash; CPU>=83°C or repeated IPMI
failures do the same. Pushgateway metrics (job=fan_control). 36 unit
tests cover the pure curve/hysteresis/presence/parse logic; DRY_RUN +
RUN_ONCE for integration checks. Deployed and verified on 192.168.1.127
(CPU 70->58°C in cool mode, hysteresis stepping confirmed).

Design:  docs/plans/2026-06-04-pve-fan-control-design.md
Runbook: docs/runbooks/fan-control.md

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
Viktor Barzin
c6f27fa172 wealth dashboard: enlarge returns numbers (drop stat name labels) [ci skip]
At h=4 the two stacked values per window panel were too small because each
also rendered its field-name label. Switch textMode value_and_name -> value
on 9211-9215 so the numbers get the full cell height; the % suffix / £ prefix
keep them self-identifying and the window stays in the panel title. Applied
via targeted tg apply of the configmap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
Viktor Barzin
dbe10a708c wealth dashboard: shrink returns stat panels to h=4 [ci skip]
The 5 per-window returns widgets (9211-9215) were too tall at h=8. Halve to
h=4 (matching the overview stat cards directly above) and pull every panel
below up by 4 so the layout stays gap-free. Layout-only change — no panel
content/query touched. Applied via targeted tg apply of the configmap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
Viktor Barzin
fc1486c3dd wealth dashboard: replace returns table with per-window stat panels [ci skip]
Swap the single "Returns over time windows" table (panel 9201) for 5 stat
panels (1d/7d/30d/90d/12mo), each showing Return % (Modified-Dietz) as the
headline value + Δ market (£, net of contributions) as a second value,
colored red/green by sign.

Same per-window Modified-Dietz math as the old table, just scoped to one
interval per panel — verified against live wealthfolio_sync PG and reproduced
through Grafana's datasource API (e.g. 30d = 8.15% / £86,875, 12mo = 38.68% /
£297,846, matching the prior table exactly). Kept the same 24×8 grid footprint
so nothing else on the dashboard reflows.

Already applied via targeted `tg apply` of the wealth.json configmap; [ci skip]
because a full monitoring-stack CI apply would pull in unrelated pre-existing
drift.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
Viktor Barzin
6cec27f8dc novelapp: bump Keel policy patch -> all (track any upstream version)
Explicitly own the keel.sh/policy annotation in TF (was relying on the
Kyverno-stamped `patch` default). Set policy=all + trigger=poll +
pollSchedule, expand ignore_changes per KEEL_LIFECYCLE_V1 to cover
Keel-written runtime annotations (change-cause, update-time, revision,
match-tag).
2026-06-05 09:19:11 +00:00
Viktor Barzin
9cb609f21a nextcloud-todos: register only the Created webhook (drop Updated)
The agent acts only on newly-created todos; the Updated listener re-fired on
every edit (incl. the agent's own note-append). Live Updated webhook (id=2)
already deleted via OCS API.
2026-06-05 09:19:11 +00:00
Viktor Barzin
3d0cba9dcb openclaw: pin 2026.2.26, resilient startup, SHA-pinned plugin init (recover from agentRuntime + configSchema crashloop)
Surfaced while installing the nextcloud-todos-api plugin (a pod roll):
- 2026.5.4 gateway rejects an openai-codex `agentRuntime` key it writes itself
  (commit 4b39cb72) -> crashloop on any restart. Pinned image back to 2026.2.26.
- startup steps (plugins enable / mcp set / memory index) backgrounded +
  timeout-guarded so a hung npm-install can never block the gateway.
- install-nextcloud-todos-plugin init SHA-pinned (:f85c6de1) + Always pull:
  IfNotPresent served a stale cached :latest, so the plugin manifest
  (configSchema) fix never landed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
root
c01a28e23c Woodpecker CI deploy [CI SKIP] 2026-06-05 09:19:11 +00:00
Viktor Barzin
6cff9bac26 freshrss: migrate extensions PVC proxmox-lvm -> NFS (node6 LUN relief, code-dfjn)
Frees a per-node SCSI-LUN slot on node6 (20->19, under the check #47 >=20
WARN). FreshRSS extensions are static plugin files (no embedded DB; app DB is
external MySQL) -> NFS-safe. Empty volume (re-installable). Applied
deadlock-safe: -target deployment+module first (Recreate releases old PVC),
then full apply destroys the now-unused proxmox PVC.
2026-06-05 09:19:11 +00:00
root
11b092a589 Woodpecker CI deploy [CI SKIP] 2026-06-05 09:19:11 +00:00
Viktor Barzin
e2d46ebd30 isponsorblocktv: migrate data PVC proxmox-lvm -> NFS (node6 LUN relief, code-dfjn)
Frees a per-node SCSI-LUN slot on node6 (21->20). Volume holds only
config.json (no embedded DB) -> NFS-safe. config pre-seeded to
/srv/nfs/isponsorblocktv before cutover. RWO-destroy deadlock during apply
(TF deleting the in-use old PVC before rolling the deployment) was broken by
patching the deployment claim to the NFS PVC; TF reconciled to the same value.
2026-06-05 09:19:11 +00:00
Viktor Barzin
9858a1c44b docs(add-user): document dashboard auto-login home-ns scope + foreign-namespace exception [ci skip]
Auto-login covers a user's k8s_users home namespace only (dashboard SA bound
there). For workloads in a separate/pre-existing namespace (gheorghe→novelapp),
that namespace must also grant the dashboard SA, not just the OIDC User. Best
practice: set k8s_users namespace = where the workload runs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
root
ace6ee59f9 Woodpecker CI deploy [CI SKIP] 2026-06-05 09:19:11 +00:00
Viktor Barzin
adec2c135f fix(novelapp): also bind gheorghe's dashboard SA to novelapp admin
His app lives in novelapp, but the dashboard injects his SA token
(system:serviceaccount:vabbit81:dashboard-vabbit81), while the existing
binding only granted the OIDC User vabbit81@gmail.com (OIDC blocked). Add the
SA as a second subject so the web dashboard (token-injector) can manage
novelapp. Verified: SA can list/create in novelapp; injector path returns 200.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
Viktor Barzin
8f13fdeaf7 docs: dashboard SA cluster-read tightened to namespace-list + nodes only [ci skip]
Reflect the dashboard-nav-readonly ClusterRole: namespace-owners can list
namespaces/nodes (for dashboard nav) but not read other tenants' resources.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
Viktor Barzin
7114824c06 fix(rbac): tighten dashboard SA cluster-read to namespaces+nodes only
namespace-owners could read all tenants' pods/configmaps/etc cluster-wide
(read-only) via the broad namespace_owner_readonly role. Give the dashboard
SAs a dedicated dashboard-nav-readonly ClusterRole = namespaces + nodes (list)
only — enough for the dashboard namespace-picker/Nodes view, but no
cross-tenant resource reads. Own-namespace access (admin) unchanged. Verified:
gheorghe can list namespaces/nodes + full vabbit81, but list pods/configmaps -A
= no, other namespaces = no.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
Viktor Barzin
ae252f9116 cluster-health: ha_integrations — skip disabled + ignored config entries
check_ha_integrations counted any config entry with state=not_loaded as a
problem, but HA marks intentionally-off entries that way too: disabled_by
set (user/integration disabled it) and source=="ignore" (a discovered
integration the user chose to ignore — never meant to load). On ha-sofia
2026-06-04 this false-WARNed on 6 entries that are all intentional —
wyoming faster-whisper/piper + ollama (disabled_by=user) and
mass_queue/dlna_dms(EMO-LAPTOP2)/yalexs_ble (source=ignore).

Skip disabled/ignored entries; only genuine setup_error/setup_retry/
not_loaded (without disabled/ignore) now flag. Verified: check #27 -> PASS
"All 96 integrations loaded".
2026-06-05 09:19:11 +00:00
Viktor Barzin
dd2a8e640f monitoring: right-size loki memory request 3Gi->1Gi (quota 89%->79%)
monitoring-quota requests.memory sat at 89% (18.2/20Gi), tripping the
ResourceQuota>80% WARN. Root cause was over-provisioned requests, not real
usage: loki requested 3Gi but its VPA upperBound is 364Mi and actual ~315Mi.
prometheus's 4Gi is legitimately required (2Gi tmpfs WAL shares the cgroup;
OOMs at 3Gi during WAL replay) so it stays; grafana's main container is
already 512Mi. Trimmed loki to 1Gi request (~3x its observed ceiling; 4Gi
Burstable limit preserves query-spike headroom) -> quota 78.8%, clears the
WARN. NOTE: alloy DaemonSet (562Mi/node) grows with node count, so revisit
(bump the 20Gi quota) as the cluster expands.
2026-06-05 09:19:11 +00:00
Viktor Barzin
90d7c11c16 state(vault): update encrypted state 2026-06-05 09:19:10 +00:00
Viktor Barzin
2707496b37 state(dbaas): update encrypted state 2026-06-05 09:19:10 +00:00
Viktor Barzin
31b8104b43 cluster-health: uptime_kuma check — only count status==0 as down
check_uptime_kuma flagged a monitor as down whenever its last heartbeat
status != 1, and treated "no beats" as down too. But uptime-kuma status 2 =
PENDING (mid-retry) and 3 = MAINTENANCE are not outages, and no-beats = no
data. So a monitor caught in a momentary pending/retry state at check time
produced a false "internal/external down(N)" WARN — observed twice on
2026-06-04 (Novelapp, then ha-sofia) for monitors uptime-kuma itself logged
ZERO downs against over 24h (0/2880 and 0/288 beats).

Count a monitor as down ONLY on an explicit DOWN beat (status==0); pending,
maintenance, and no-data are not-down. Real outages still flag (uptime-kuma
persists status==0 beats for genuine downs).
2026-06-05 09:19:10 +00:00
Viktor Barzin
bf6ede2b9e vault: deny secret/data/vault for claude-agent terraform-state policy (executor elevation safety narrowing)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
Viktor Barzin
63ee655c08 monitoring: fix PrometheusBackupStale false-fire (32d->40d threshold)
The prometheus-backup sidecar runs monthly on the 1st SUNDAY 04:00 UTC.
Consecutive first-Sundays can be ~35 days apart (e.g. May 3 -> Jun 7), but
the alert threshold was 32d (2764800s) -> it false-fired every year for the
~3 days between day-32 and the next run. Raised to 40d (3456000s): clears
the max first-Sunday spacing with margin, still catches a genuinely missed
monthly backup. Backup itself is healthy (last May 3, next Jun 7). Verified:
live rule now > 3.456e6, alert state inactive.
2026-06-05 09:19:10 +00:00
Viktor Barzin
c4bd64f88a docs: dashboard now auto-injects per-user SA token (no token-paste)
Update authentication.md, multi-tenancy.md, service-catalog, add-user skill to
reflect the token-injector (X-authentik-username -> SA token -> Bearer). Note the
extra k8s-dashboard apply needed when onboarding a namespace-owner (injector map
regen). [ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
Viktor Barzin
d649f4f287 feat(k8s-dashboard): auto-inject per-user SA token (no token-paste)
nginx token-injector behind the existing forward-auth: maps X-authentik-username
(the user's email, injected by Authentik) -> that user's ServiceAccount token ->
sets Authorization: Bearer -> kong-proxy. Dashboard auto-authenticates; users
never see the token prompt. Mirrors the t3-dispatch pattern. Token map lives in a
Secret (namespace-owners' cluster-read covers configmaps, not secrets). Verified:
gheorghe->vabbit81 pods 200 + kube-system 200 (cluster-read); viktor->nodes 200
(admin); unmapped->401. namespace-owners auto-derived from k8s_users; admins
hardcoded (their Authentik identity != k8s_users email).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
Viktor Barzin
467eb7d7ee claude-agent: grant shared pod executor powers (Forgejo PR, terragrunt apply, kubectl write, MCP)
Elevates the shared claude-agent-service pod (SA claude-agent, ns
claude-agent) so the nextcloud-todos-exec agent can run autonomously.
Viktor explicitly chose to elevate the SHARED service knowing every
agent on the pod inherits these creds — each grant is security-sensitive
and flagged inline for review.

Vault (stacks/vault/main.tf):
- terraform-state k8s-auth role: add `claude-agent` to
  bound_service_account_names (was only `default` — the pod's own SA
  token could not log in, so scripts/tg apply died fetching the PG
  backend password). `default` kept.
- terraform-state policy broadened from `database/static-creds/pg-terraform-state`
  read only to read on database/static-creds/*, database/creds/*,
  secret/data/* and secret/metadata/* — what stacks read at plan/apply
  time. FLAG: grants the shared pod broad Vault READ (effectively all app
  secrets + rotating DB creds); not denied: secret/data/vault.

claude-agent-service stack (stacks/claude-agent-service/main.tf):
- ExternalSecret: add FORGEJO_TOKEN (secret/ci/global -> forgejo_push_token,
  viktor-scoped admin PAT) and HA_MCP_URL (secret/openclaw -> ha_sofia_mcp_url).
- git-init: add url.insteadOf rewrite to authenticate git pushes to
  forgejo.viktorbarzin.me with $FORGEJO_TOKEN (PRs opened via Forgejo API).
- New claude-agent-exec ClusterRole+Binding: cluster-wide
  get/list/watch/create/update/patch/delete on core (incl. secrets),
  apps, batch, networking.k8s.io, rbac roles/rolebindings. Additive to the
  existing read-only claude-agent role; does NOT bind cluster-admin. FLAG:
  very broad — close to cluster-admin in blast radius.
- Vault login: VAULT_ADDR + VAULT_K8S_ROLE env + vault-token-refresher
  sidecar (k8s-auth login role=terraform-state every 30m -> shared
  emptyDir); main container symlinks ~/.vault-token so scripts/tg auto-auths.
- MCP: project-scoped .mcp.json at infra repo root wires `ha` (HTTP,
  ${HA_MCP_URL}) and `paperless` (in-cluster Service, no token in-cluster).

Not applied, not pushed — code only, for human review of the privilege grants.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
Viktor Barzin
b56a868b4e wealthfolio-sync: podAffinity to co-locate with app pod (RWO multi-attach fix)
The monthly wealthfolio-sync CronJob mounts the same RWO
wealthfolio-data-encrypted volume (shared wealthfolio.db SQLite) as the
always-running wealthfolio app Deployment. RWO attaches to only one node,
but the sync had no affinity — so the 2026-06-01 run landed on node4 while
the app held the volume on node3 and hung in ContainerCreating for 3 days
(FailedAttachVolume / Multi-Attach), surfacing as a problematic_pods WARN.

Add a required podAffinity (app=wealthfolio, topologyKey hostname) so the
sync always schedules onto the app's node, where the volume is already
attached (RWO permits multiple pods on the same node). Verified: a fresh
sync run co-located on node3, attached cleanly, and broker-sync started.
2026-06-05 09:19:10 +00:00
Viktor Barzin
8e44ccaa65 docs: dashboard access is forward-auth + token-paste (OIDC SSO blocked)
Correct the docs I'd written for the (reverted) oauth2-proxy SSO. Reality:
apiserver OIDC rejects all Authentik tokens (design §12), so the dashboard
uses forward-auth (admits kubernetes-* groups) + per-namespace SA token-paste.
Updates authentication.md, multi-tenancy.md, service-catalog, authentik-state,
and add-user skill (onboarding now documents the dashboard token). oauth2-proxy
+ k8s-dashboard OIDC app noted as idle. [ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
Viktor Barzin
e4c3fbbbbb feat(authentik): adopt admin-services-restriction policy; admit kubernetes-* groups to k8s dashboard
Namespace-owners (e.g. gheorghe) were blocked at forward-auth — k8s.viktorbarzin.me
was Home-Server-Admins-only. Carve-out: the dashboard host now also admits
kubernetes-admins/power-users/namespace-owners so they can reach the login page;
per-namespace access is still enforced by the pasted SA token (dashboard-sa.tf).
All other admin-only hosts unchanged. Policy adopted from UI into TF via import.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
Viktor Barzin
317989f9d5 feat(rbac): per-namespace-owner dashboard SA + long-lived token
Pragmatic dashboard access while OIDC SSO is blocked: each namespace-owner
(from k8s_users) gets a ServiceAccount scoped to admin on their namespace(s)
+ cluster read-only, plus a long-lived token to paste into the dashboard
'Token' login. Real per-namespace isolation, no apiserver-OIDC dependency.
Verified: vabbit81 SA = admin in vabbit81, read-only elsewhere, no cross-ns write.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
Viktor Barzin
4aa6e7a5af chrome-service docs: clarify f1-stream is not a real caller
stacks/f1-stream/files/backend/playback_verifier.py and
chrome_browser.py describe an in-cluster CDP caller, but the deployed
f1-stream image is built from github.com/ViktorBarzin/f1-stream which
has neither file — verified by `kubectl exec ls /app/backend/` and
grepping for 'CHROME' in the deployed pod.

The infra/stacks/f1-stream/files/backend/ tree is a vestigial design
that was never wired up to a build pipeline. Calling it out so the
next reader doesn't waste time debugging why the migration "didn't
take effect" — it took effect on dead code.

The hourly snapshot-harvester CronJob is the only live in-cluster
caller of the CDP endpoint today.
2026-06-05 09:19:10 +00:00
root
d479d5b4f9 Woodpecker CI deploy [CI SKIP] 2026-06-05 09:19:10 +00:00
Viktor Barzin
deede6dd11 chrome-service: switch to CDP + persistent profile + hourly snapshot pipeline
The chrome-service stack ran `playwright launch-server`, which creates
ephemeral browser contexts per `connect()`. Despite the encrypted PVC
mounted at /profile, no chromium user-data ever persisted — only npm
cache + fontconfig. Logging in via noVNC was effectively a no-op.

Refactor:
- Replace launch-server with direct chromium (TCP CDP on :9223 internal),
  fronted by a Python HTTP+WS bridge on :9222 that rewrites the Host
  header to bypass Chrome's hardcoded DNS-rebinding protection (no
  `--remote-allow-hosts` flag exists in stock Chrome 130; verified by
  binary string grep). Bridge also forces Connection: close on HTTP
  responses so Node ws opens a fresh TCP for the WS upgrade rather than
  trying to reuse the dead keep-alive socket.
- Add `--user-data-dir=/profile/chromium-data` so cookies/localStorage
  actually persist on the encrypted PVC.
- New snapshot-server sidecar (stdlib python HTTP) serves
  GET /api/snapshot at chrome.viktorbarzin.me/api/snapshot,
  bearer-token-gated by the existing api_bearer_token.
- New chrome-service-snapshot-harvester CronJob (hourly) connects via
  CDP, dumps storage_state() (cookies + localStorage), writes atomically
  to /profile/snapshots/storage-state.json.
- NetworkPolicy: TCP/9222 (was :3000), TCP/8088 added for traefik.

Caller migration:
- f1-stream: `chromium.connect(ws_url)` → `chromium.connect_over_cdp(cdp_url)`,
  env var CHROME_WS_URL → CHROME_CDP_URL. CHROME_WS_TOKEN dropped (no
  longer used by code; ExternalSecret kept for symmetry with the snapshot
  endpoint).

Dev-box side (out of scope for this commit — see ~/.config/systemd/user/):
- playwright-mcp.service flips to `--isolated --storage-state=...`
  so per-Claude-Code-session ephemeral contexts seed from the snapshot.
- playwright-snapshot-refresh.{service,timer} (hourly) pulls the
  snapshot via the bearer-gated HTTPS endpoint.

Docs updated:
- docs/architecture/chrome-service.md — new architecture diagram + wire protocol.
- docs/runbooks/chrome-service-snapshot.md — day-2 ops (refresh, rotation,
  failure modes, restore).
- stacks/chrome-service/README.md — connect_over_cdp recipe.

Design spec at docs/superpowers/specs/2026-06-04-playwright-per-session-browser-design.md.
2026-06-05 09:19:10 +00:00