infra

Author	SHA1	Message	Date
Viktor Barzin	6442978f07	fan-control: merge Fan %/RPM dashboard cards + RPM estimate fallback [ci skip] The Fan % and Fan RPM sensor-graph cards had identical trend shapes (RPM ∝ %), so merge them into one "Fan speed" card: % trend (stable Pushgateway sensor) + RPM beneath. RPM reads sensor.r730_fan_speed (Redfish) but falls back to the calibrated estimate (rpm≈160·%+1520, shown with a "~" prefix) when that sensor is unavailable — it blips out intermittently, so the readout never goes blank. The Override readout likewise shows both "% · rpm". HA-side only; daemon unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 14:31:32 +00:00
Viktor Barzin	722a1c9b42	docs(monitoring): document rpi-sofia off-box monitoring + log shipping [ci skip] Add an "External host: rpi-sofia" section to docs/architecture/monitoring.md covering the 2026-06-05 setup: node_exporter + vcgencmd textfile metrics; the full-journal promtail->Loki shipping (job=rpi-sofia-journal — kernel/dmesg via the (none) unit + all systemd units, labeled by unit/level); the RPi Sofia alert group; the dashboard; and the systemd watchdog. Notes the SD-card root cause and that the Pi-side config is hand-managed + backed up off-box. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 14:25:20 +00:00
Viktor Barzin	405ca79531	fan-control: Override slider now tracks live fan speed while unlocked [ci skip] The dashboard Override slider used to show a stale stored % (e.g. 5%) while the fans were actually at ~53%, which was confusing. Add automation.r730_fan_override_track_live_speed_while_unlocked: while unlocked it mirrors the live commanded % (sensor.r730_fan_control_target) into the Override, so it always shows the actual absolute fan speed and updates as the fan moves. While locked it stops tracking and is the user's editable setpoint. The readout under the slider now shows the live "% · rpm" (actual, not an estimate). HA-side only; daemon unchanged. Verified live: slider forced to 10 → synced to 58 target. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 14:20:38 +00:00
Viktor Barzin	c059405632	fan-control: simplify HA dashboard + Lock = freeze-current/algo-off [ci skip] The dashboard-it Server → Fans view is now minimal: fan speed (% + RPM), an Override % slider, and a Lock toggle. Lock now means "freeze the current speed, algorithm off" — a new automation (r730_fan_lock_freeze_current_speed_resume_algo) snapshots the live target % into Override and sets mode=manual on lock-ON, and mode=auto on lock-OFF. The host daemon is unchanged (the toggle just drives the mode it already reads). cool/quiet stay reachable via the entity but are off the simplified view; the 60-min auto-revert is kept as a dormant safety net. Verified live: lock ON → mode=manual + Override captured the live 60%; lock OFF → auto. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 13:27:46 +00:00
Viktor Barzin	d17b25cdcc	fan-control: document the HA Fan Lock (opt out of 60-min auto-revert) [ci skip] A manual/cool/quiet override in HA auto-reverts to `auto` after 60 min. Add a Fan Lock (`input_boolean.r730_fan_lock`) that gates that automation so a deliberate override persists, with a visible "🔒 FAN CONTROL LOCKED" banner on the dashboard-it Server view so it isn't forgotten. The automation re-checks the lock after the hour (locking mid-countdown cancels the revert) and the 83 °C ceiling still wins. HA-side only (helper + automation + dashboard live on ha-sofia, auto-git-tracked there); these docs are the infra-repo record. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 12:22:00 +00:00
Viktor Barzin	51456a96f6	fan-control: estimate + expose fan power (fan_watts_est) The iDRAC reports only total DCMI watts + RPM (no per-fan power), so add a cube-law fan-power estimate: fan_W ~= 0.0205*(RPM/1000)^3, calibrated to the 2026-06-05 sweep (fits within ~3W; ~2W floor -> ~99W full). The daemon reads live RPM each loop and pushes pve_fan_control_fan_rpm + _fan_watts_est. Surfaced in HA as sensor.r730_fan_power_est + a "Fan Power (est)" card on the dashboard-it Server view, next to total power. 46 bash tests green; verified live (9120rpm -> ~15W est). [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 11:10:27 +00:00
Viktor Barzin	324f2dc3bf	fan-control: continuous linear curve (replaces discrete step-bands) Replace the step-band fan curve with a continuous linear ramp — the bands flapped at edges (e.g. 45<->65%). Web-researched: linear + 2-3C hysteresis is the homelab standard; PID is overkill for this slow thermal loop. fan% now interpolates between env-tunable anchors: COOL 50C/30% -> 83C/100% (~2.1%/C; ~51% at the ~60C equilibrium) QUIET 68C/20% -> 83C/100% (near-silent until ~70C) Both reach 100% at the 83C ceiling. Anti-oscillation: asymmetric hysteresis (fc_decide) + a MIN_STEP (3%) min-change threshold. 41 bash tests green; deployed + verified live (59C -> 49%, smooth). [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 10:29:35 +00:00
Viktor Barzin	945c1936e3	fan-control docs: HA control (mode/manual-% + auto-revert + dashboard) Document the HA-control feature shipped in `8beca1df`: the daemon reads the ha-sofia r730_fan_mode/manual_pct helpers, the 60-min auto-revert automation, and the dashboard-it Server-view sensors + control tiles. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:29:35 +00:00
Viktor Barzin	3796a84e04	docs: f1-stream is Woodpecker-native (Forgejo viktor/f1-stream), not GHA/repo-10 f1-stream was extracted to its own Forgejo repo + deployed from the Forgejo registry (2026-06-05). Correct the stale "Migrated to GHA / repo id 10" claims: - CLAUDE.md + ci-cd.md: move f1-stream from the GHA list to the Woodpecker-native owned-app group; note old github source archived + GHA Woodpecker repo 10 deactivated; f1-stream is now Woodpecker repo 166. - service-catalog: note the source repo + deploy model.	2026-06-05 09:19:12 +00:00
Viktor Barzin	e8bfb4d06b	f1-stream: consume Forgejo-registry image; drop in-monorepo source The actively-developed f1-stream (infra files/ copy: 12 active extractors + Playwright/chrome-service verifier) is now its own repo viktor/f1-stream and is the deployed app (replacing the stale March github build). - main.tf: image -> forgejo.viktorbarzin.me/viktor/f1-stream:${var.image_tag} + image_pull_secrets registry-credentials. Image stays in KEEL_IGNORE_IMAGE. - Remove stacks/f1-stream/files/ (source now in viktor/f1-stream). - docs/plans: extraction design + plan pair. Applied via tg + kubectl set image to forgejo:24857a82; live /health green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:12 +00:00
Viktor Barzin	99f9bf8d89	fan-control: power-tune COOL curve to the 60% efficiency knee Power/temp sweep (2026-06-05) located the cooling-per-watt knee at ~60%: 60->70% buys only -2C for +21W, and 70->100% buys 0C for +54W (the CPU floors ~59C at cluster load, so more airflow does nothing). Re-tune the COOL curve to cap its normal band at 60% (~303W, ~61C); 80/100% become a high-load safety ramp (>=73/79C) before the 83C ceiling. QUIET unchanged (already at the 281W / 4800rpm floor). Saves up to ~75W (~650 kWh/yr) vs full-tilt for the last ~2C. Tests + design doc updated; verified live (63C, 60%, ~267W). [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:12 +00:00
Viktor Barzin	147a8cff40	Restore f1-stream stack — undo accidental bundling into 63fe7d2b Commit 63fe7d2b (fan-control) was made with a bare `git commit` in the shared infra working tree and inadvertently swept in a parallel session's staged f1-stream-extraction work (main.tf repoint, ~48 files/ removals, ci-cd.md + .claude docs, two extraction plan docs). This returns every f1-stream-related path to its pre-63fe7d2b state (3493c347) so that extraction can be committed cleanly by its own session. The fan-control files added in 63fe7d2b are untouched. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:12 +00:00
Viktor Barzin	90ad6b9125	fan-control: presence-aware IPMI fan curve for the R730 PVE host The iDRAC stock curve runs the CPU at ~72°C on the 7080 RPM floor even under load (optimises for quiet, not cool). Add a bash daemon + systemd unit that drives the chassis fans from CPU temp on two curves, picked by garage occupancy (the server is in the garage): COOL when empty (measured ~58-65°C under load), QUIET near the silent floor when the ha-sofia garage door shows someone is there (open, or <15min since last activity). Manual fan mode is backstopped: bash EXIT trap + systemd ExecStopPost hand fans back to Dell auto on stop/crash; CPU>=83°C or repeated IPMI failures do the same. Pushgateway metrics (job=fan_control). 36 unit tests cover the pure curve/hysteresis/presence/parse logic; DRY_RUN + RUN_ONCE for integration checks. Deployed and verified on 192.168.1.127 (CPU 70->58°C in cool mode, hysteresis stepping confirmed). Design: docs/plans/2026-06-04-pve-fan-control-design.md Runbook: docs/runbooks/fan-control.md [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	8f13fdeaf7	docs: dashboard SA cluster-read tightened to namespace-list + nodes only [ci skip] Reflect the dashboard-nav-readonly ClusterRole: namespace-owners can list namespaces/nodes (for dashboard nav) but not read other tenants' resources. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	c4bd64f88a	docs: dashboard now auto-injects per-user SA token (no token-paste) Update authentication.md, multi-tenancy.md, service-catalog, add-user skill to reflect the token-injector (X-authentik-username -> SA token -> Bearer). Note the extra k8s-dashboard apply needed when onboarding a namespace-owner (injector map regen). [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	8e44ccaa65	docs: dashboard access is forward-auth + token-paste (OIDC SSO blocked) Correct the docs I'd written for the (reverted) oauth2-proxy SSO. Reality: apiserver OIDC rejects all Authentik tokens (design §12), so the dashboard uses forward-auth (admits kubernetes-* groups) + per-namespace SA token-paste. Updates authentication.md, multi-tenancy.md, service-catalog, authentik-state, and add-user skill (onboarding now documents the dashboard token). oauth2-proxy + k8s-dashboard OIDC app noted as idle. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	4aa6e7a5af	chrome-service docs: clarify f1-stream is not a real caller stacks/f1-stream/files/backend/playback_verifier.py and chrome_browser.py describe an in-cluster CDP caller, but the deployed f1-stream image is built from github.com/ViktorBarzin/f1-stream which has neither file — verified by `kubectl exec ls /app/backend/` and grepping for 'CHROME' in the deployed pod. The infra/stacks/f1-stream/files/backend/ tree is a vestigial design that was never wired up to a build pipeline. Calling it out so the next reader doesn't waste time debugging why the migration "didn't take effect" — it took effect on dead code. The hourly snapshot-harvester CronJob is the only live in-cluster caller of the CDP endpoint today.	2026-06-05 09:19:10 +00:00
Viktor Barzin	deede6dd11	chrome-service: switch to CDP + persistent profile + hourly snapshot pipeline The chrome-service stack ran `playwright launch-server`, which creates ephemeral browser contexts per `connect()`. Despite the encrypted PVC mounted at /profile, no chromium user-data ever persisted — only npm cache + fontconfig. Logging in via noVNC was effectively a no-op. Refactor: - Replace launch-server with direct chromium (TCP CDP on :9223 internal), fronted by a Python HTTP+WS bridge on :9222 that rewrites the Host header to bypass Chrome's hardcoded DNS-rebinding protection (no `--remote-allow-hosts` flag exists in stock Chrome 130; verified by binary string grep). Bridge also forces Connection: close on HTTP responses so Node ws opens a fresh TCP for the WS upgrade rather than trying to reuse the dead keep-alive socket. - Add `--user-data-dir=/profile/chromium-data` so cookies/localStorage actually persist on the encrypted PVC. - New snapshot-server sidecar (stdlib python HTTP) serves GET /api/snapshot at chrome.viktorbarzin.me/api/snapshot, bearer-token-gated by the existing api_bearer_token. - New chrome-service-snapshot-harvester CronJob (hourly) connects via CDP, dumps storage_state() (cookies + localStorage), writes atomically to /profile/snapshots/storage-state.json. - NetworkPolicy: TCP/9222 (was :3000), TCP/8088 added for traefik. Caller migration: - f1-stream: `chromium.connect(ws_url)` → `chromium.connect_over_cdp(cdp_url)`, env var CHROME_WS_URL → CHROME_CDP_URL. CHROME_WS_TOKEN dropped (no longer used by code; ExternalSecret kept for symmetry with the snapshot endpoint). Dev-box side (out of scope for this commit — see ~/.config/systemd/user/): - playwright-mcp.service flips to `--isolated --storage-state=...` so per-Claude-Code-session ephemeral contexts seed from the snapshot. - playwright-snapshot-refresh.{service,timer} (hourly) pulls the snapshot via the bearer-gated HTTPS endpoint. Docs updated: - docs/architecture/chrome-service.md — new architecture diagram + wire protocol. - docs/runbooks/chrome-service-snapshot.md — day-2 ops (refresh, rotation, failure modes, restore). - stacks/chrome-service/README.md — connect_over_cdp recipe. Design spec at docs/superpowers/specs/2026-06-04-playwright-per-session-browser-design.md.	2026-06-05 09:19:10 +00:00
Viktor Barzin	ad3432d685	docs(k8s-dashboard): dashboard SSO as-built (Option B multi-issuer apiserver) Update authentication.md (structured multi-issuer AuthenticationConfiguration + dashboard SSO flow), multi-tenancy.md (web dashboard access), authentik-state (new k8s-dashboard app + gheorghe groups), service-catalog (dashboard auth), and the k8s-version-upgrade runbook (kubeadm wipes --authentication-config → re-apply rbac post-upgrade). Design/plan addenda record the issuer-constraint pivot from the original dual-aud approach. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	549320f79c	docs(k8s-dashboard): SSO via Authentik oauth2-proxy — implementation plan [ci skip] Task-by-task plan: Vault secret, Authentik OIDC app (TF), oauth2-proxy deploy, ingress cutover with blocking audience-verification gate, docs. Additive + one revertible ingress repoint. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	8b72eaebb0	docs(k8s-dashboard): SSO via Authentik oauth2-proxy — design [ci skip] Design for letting namespace-owner users (e.g. gheorghe/vabbit81) open the K8s Dashboard with their Authentik account, mapped to their per-user RBAC. oauth2-proxy fronts kong-proxy, runs the OIDC code-flow, and injects the user's id_token as Bearer so the apiserver applies existing namespace-owner bindings. Additive + one ingress repoint; multi-audience scope mapping keeps the CLI flow untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	f201e4573e	immich: fix slow context search — prewarm clip_index + latency alert/healthcheck Context (smart) search latency was caused by the 665MB vchord clip_index decaying out of PG shared_buffers (~33% resident -> ~1.8s cold ANN reads vs ~4ms warm), NOT by yesterday's ML MODEL_TTL/clip-keepalive change (CLIP textual is warm ~15ms on GPU). The postStart prewarm runs once at pod start and pg_prewarm.autoprewarm only re-warms at startup, so the index decays under job buffer-pressure over days. - clip-index-prewarm CronJob (immich, /5): pg_prewarm('clip_index') keeps the whole index resident -> searches stay ~4ms. - immich-search-probe CronJob (immich, /5): times a random-vector ANN query + reads clip_index residency, pushes gauges to the Pushgateway. - Prometheus alerts ImmichSmartSearchSlow / ImmichClipIndexColdCache / ImmichSearchProbeStale (+ inhibition when the probe is stale). - cluster_healthcheck.sh check #46 check_immich_search (TOTAL_CHECKS 45->46). - Docs: infra CLAUDE.md immich note, monitoring.md, cluster-health skill. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	98f29edf34	technitium: CoreDNS rewrite forgejo.viktorbarzin.me -> Traefik ClusterIP In-cluster pods resolved forgejo.viktorbarzin.me to the public IP (176.12.22.76) and hairpinned out through the WAN gateway, intermittently timing out buildkit pushes from Woodpecker build pods (which, unlike kubelet, don't use the per-node containerd Forgejo mirror). This silently failed CI build-and-push for Forgejo-hosted repos (recruiter-responder pipelines #15-#18 at the push step). Add a CoreDNS `rewrite name exact forgejo.viktorbarzin.me traefik.traefik.svc.cluster.local` so pods resolve to the Traefik ClusterIP (reachable in-cluster, unlike the ETP=Local LB .203; the Service-name target auto-tracks the ClusterIP so it can't rot on a Traefik renumber). Traefik's *.viktorbarzin.me wildcard keeps SNI/TLS valid. Makes the per-pod woodpecker-server hostAlias belt-and-suspenders. Applied via targeted apply (coredns ConfigMap only, to avoid reconciling 7 unrelated pre-existing drifts in the stack) + verified: - pod resolves forgejo.viktorbarzin.me -> 10.111.111.95 (Traefik ClusterIP) - recruiter-responder pipeline #20 build-and-push succeeds via ClusterIP Docs: networking.md (K8s cluster DNS path) + .claude/CLAUDE.md (forgejo registry quick-ref). Advances beads code-yh33. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 07:34:30 +00:00
Viktor Barzin	7d7a0ad474	infra: fix stale Traefik LB-IP refs + accurate LB-IP registry Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md). The 2026-05-30 Traefik .200->.203 move left consumers pointing at the dead .200; this fixes the two in-Terraform ones and replaces the stale networking doc with an accurate registry + a renumber checklist. - woodpecker: forgejo.viktorbarzin.me hostAlias hardcoded 10.0.20.200 (.200:443 refuses TLS now; the next woodpecker apply would re-pin it and break pipeline creation). Now reads the Traefik ClusterIP dynamically via a kubernetes_service data source -- cannot rot on a future renumber and avoids the ETP=Local hairpin trap. - monitoring: ViktorBarzinApexDrift alert summary said "expected 10.0.20.200" -> 10.0.20.203 (cosmetic; alert logic already correct). - docs/architecture/networking.md: rewrote the MetalLB section (it wrongly had KMS on .200, mailserver on a LB IP, "two dedicated") into an accurate 4-IP registry + LB-IP renumber checklist (in-band + out-of-band consumers). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:25 +00:00
Viktor Barzin	c7cf21a986	Revert mail LAN-redirect approach; pending VIP-based redesign The pfSense NAT rdr rules added in f7cf9f07 hardcoded 10.0.20.203 (Traefik LB IP) as the redirect source. That couples mail's LAN path to Traefik's IP choice — if Traefik moves again (it just moved .200 → .203 on 2026-05-30), the mail path silently breaks. Removing the script and the matching doc paragraph; keeping the networking.md .200 → .203 staleness fix (separate correction). Follow-up: give the mail HAProxy listener a dedicated pfSense Virtual IP (IP Alias on opt1), update Technitium internal zone + WAN port-forwards to target the VIP, so mail's LAN-side path is decoupled from any other service's LB IP.	2026-06-03 10:24:25 +00:00
Viktor Barzin	922d95af9c	Reapply "tripit: Gmail ingest (12-month) + vbarzin owner + plans@ forward-to-parse" This reverts commit a82ba46ad83e85a231d839564c2f009c700dc4d1.	2026-06-03 10:24:25 +00:00
Viktor Barzin	f0843e398b	Revert "tripit: Gmail ingest (12-month) + vbarzin owner + plans@ forward-to-parse" This reverts commit 4cc9229e716b6683418a148a0f896442d5ab07ad.	2026-06-03 10:24:25 +00:00
Viktor Barzin	0c7ec3d470	tripit: Gmail ingest (12-month) + vbarzin owner + plans@ forward-to-parse Reconciles the tripit stack source with live state and adds the forward flow. Ingest now polls vbarzin@gmail.com [Gmail]/All Mail read-only over a rolling 12-month X-GM-RAW travel-sender window (Croatia Jet2 refs excluded), filing trips under MAIL_DEFAULT_OWNER_EMAIL=vbarzin@gmail.com (Viktor's Authentik login identity). Adds an ingest-plans CronJob that polls spam@ filtered to To:plans@viktorbarzin.me (the @viktorbarzin.me catch-all target) so forwarded bookings are extracted and attached to the matching trip; IMAP_PASSWORD is overridden per-job to spam@'s creds (PLANS_IMAP_PASSWORD). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:25 +00:00
Viktor Barzin	fd35c4f303	pfSense: LAN-side NAT redirect for mail ports landing on Traefik LB IP Technitium's split-horizon rewrites *.viktorbarzin.me to 10.0.20.203 (Traefik LB) for the 192.168.1.0/24 Barzini WiFi (TP-Link router has no hairpin NAT). The rule is name-agnostic so mail.viktorbarzin.me (and imap./smtp.) get sent to .203 too — where Traefik does not listen on 25/465/587/993. iOS Mail on Barzini WiFi silently hangs while Roundcube (port 443 via Traefik) keeps working. Adds pfSense NAT rdr rules so traffic to 10.0.20.203:{25,465,587,993} gets redirected to 10.0.20.1 (the mail HAProxy listener already serving the public path). Loaded on every incoming interface by pfSense rule generation, so any LAN/VPN client falling into the split-horizon answer lands on the right service unchanged. Includes idempotent reproducer script (mirrors the existing pfsense-haproxy-bootstrap.php pattern) and the networking.md mail carve-out paragraph plus the stale .200 → .203 reference.	2026-06-03 10:24:25 +00:00
Viktor Barzin	f0948493b3	claude-agent-service: wire parallel execution (git-crypt mount, memory, MAX_CONCURRENCY) The service now runs agent calls concurrently (bounded semaphore, per-job isolated clones) instead of single-flight. Infra side: - mount git-crypt-key into the main container (each job re-unlocks its own clone) - MAX_CONCURRENCY=10 env (excess calls queue FIFO) - bump pod memory 2Gi req / 12Gi limit, cpu req 1 (Burstable, tier-aux) — sized for ~10 concurrent claude+terraform runs; fits node2/3/5 headroom - docs: beads-auto-dispatch + automated-upgrades no longer describe single-slot Service code: viktor/claude-agent-service @ 66104a3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:24 +00:00
Viktor Barzin	aa0d6511b2	job-hunter runbook: document two self baselines + taxable_pay gotcha All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Dashboard now shows two 'Me' bars: realized gross (~£409k, from SUM(payslip taxable_pay) = P60 basis) and package/grant-value (~£267k, levels.fyi-comparable). Document that gross MUST come from taxable_pay, NOT salary+bonus+rsu_vest (rsu_vest is net/partial, understates RSU ~50%). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 23:13:35 +00:00
Viktor Barzin	50a4ad70f0	job-hunter runbook: self-comp re-seed stores full TC breakdown All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details total_value (what the comparison bar uses) must be full TC; document storing base+bonus+RSU components too so it's verifiable that RSU+bonus are included. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 22:23:42 +00:00
Viktor Barzin	deb0dd4778	monitoring: "Your comp vs the market" panel on Job Hunter dashboard Add a barchart (panel 10) ranking every company's London p50 total comp (COALESCE total/base) with the user's current comp shown in line, so it's a direct "how do I compare" view. The user's figure is NOT hardcoded in the dashboard JSON — it's a labeled comp_point in the DB (company_slug 'self-current', source 'self', "Me (Meta IC5)"), keeping the sensitive number out of git. It's below the £500k alert bar (no Slack ping) and ranks too low to appear in analyze leaders. Runbook documents the panel + how to update the baseline. [ci skip] — dashboard ConfigMap applied locally (targeted). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 21:27:26 +00:00
Viktor Barzin	74313149dd	job-hunter: weekly above-target Slack alert CronJob Add job-hunter-alert CronJob (Sundays 05:00 UTC, an hour after the refresh): `python -m job_hunter alert --threshold 500000 --location london --slack` posts to Slack the companies whose London p50 total comp >= £500k, flagging any that newly crossed since last week's snapshot. SLACK_WEBHOOK_URL wired via the job-hunter-secrets ExternalSecret from Vault secret/job-hunter slack_webhook_url (seeded from the shared workspace webhook; repointable to a dedicated channel). Runbook gains an "above-target Slack alert" section. [ci skip] — applied locally (stack-scoped). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 20:49:42 +00:00
Viktor Barzin	fe8db19aaf	job-hunter: build-triggers-deploy model; CronJob :latest + docs CI now drives the Deployment rollout (kubectl set image to the build SHA in .woodpecker.yml), so the stack moves to image_tag = "latest": the Deployment runs whatever CI last set (image ignore_changes keeps TF from fighting it), and the CronJob uses :latest + imagePullPolicy=Always (fresh pod each weekly run). Keel stays enrolled in parallel as a redundant net. Docs: rewrite the runbook "Deploying" section for build-triggers-deploy; record the reversal of decision #12 in the auto-upgrade design doc (owned apps drive their own rollout, Keel parallel — upstream stays Keel-only); add the owned-app deploy model to infra/.claude/CLAUDE.md CI/CD section. [ci skip] — applied locally (stack-scoped); avoids a broad CI auto-apply. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 20:24:50 +00:00
Viktor Barzin	052c776eba	immich: set MACHINE_LEARNING_MODEL_TTL 0->600 to stop GPU VRAM hog immich-ml at TTL=0 never unloaded models; a heavy OCR library job inflated onnxruntime's CUDA arena to ~10.7GB and held it on the shared time-sliced T4, starving llama-swap (qwen3-8b) so recruiter-responder triage 502'd silently for hours (emails preserved unseen, no loss). TTL=600 lets idle ad-hoc models (OCR, face) free VRAM while preloaded CLIP/smart-search stays warm. Docs: correct stale llama-cpp GPU notes (T4 is time-sliced, no VRAM isolation; add qwen3-8b to model table), immich MODEL_TTL gotcha in .claude/CLAUDE.md, and a post-mortem. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 20:16:11 +00:00
Viktor Barzin	cda858d560	job-hunter: weekly refresh CronJob + ops/analyst runbook All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Add kubernetes_cron_job_v1.job_hunter_refresh — Sundays 04:00 UTC, runs `refresh --source ats --source hn --source levels_fyi`, which upserts roles/ comp AND appends the dated comp_snapshots/roles_snapshots series consumed by `job-hunter analyze`. Mirrors the Deployment's alembic-migrate init container so a refresh never runs against an un-migrated DB; concurrency Forbid, backoff 1, 30m activeDeadline, KYVERNO_LIFECYCLE_V1 dns_config ignore. Add docs/runbooks/job-hunter.md: ops (health checks, manual refresh, add an ATS company / CDIO watch, secret bag + rotation, failure table, TF apply) and analyst (the analyze report, query recipes, SQL trend queries against the snapshot tables, interpretation caveats) sections. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 19:37:57 +00:00
Viktor Barzin	de09e8f294	immich runbook: note force=false re-kick gotcha after row deletion [ci skip] The videoConversion enqueue is an async scan; deleting encoded_video rows while a prior scan is in-flight misses them (observed 2026-06-02: 11/3296 picked up on the first pass). Re-trigger force=false once the queue first drains to waiting:0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	b651f137b9	docs(kms): SXSMSI/1603 is client-machine-specific (VM 300 pilot) + deep-repair/escalation Pilot on PVE VM 300 established strong counterfactuals: identical kms-bootstrap + the user's exact journey both reach office/ok on healthy Win10 (CF1 clean install, CF2 retail O365HomePremRetail->targeted-remove->reboot->VL install). So a persistent [Failing PreReq=SXSMSI]/1603 is the client's corrupted Windows servicing/Installer subsystem (below DISM/SFC), not the script/ODT/KMS. Documents the consent-gated deep repair, the DeepRepairDone marker + in-place-repair escalation, and the low-disk/guest-agent-drop gotchas hit during the pilot.	2026-06-02 19:24:30 +00:00
Viktor Barzin	481585f6e6	immich: cap streaming transcode bitrate to fix 4K video stutter [ci skip] Transcodes were uncapped (ffmpeg maxBitrate=0 + preset=ultrafast + targetResolution=original) -> 77-264 Mbps 4K H.264 files. Mobile playback streams that copy off the shared 7200rpm sdc pool over inter-VLAN NFS; a single stream needs ~10-13.5 MB/s and stuttered for every client, local and remote. Fix (DB system-config, applied via API): maxBitrate=20000k, preset=medium, transcode=bitrate. 4K resolution preserved; originals never modified. Existing oversized transcodes regenerated by deleting their asset_file encoded_video rows + videoConversion force=false (concurrency 1). Document config + add runbook docs/runbooks/immich-transcode-bitrate.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	deec540fad	t3code: docs — auto-provisioning service-catalog entry + design status implemented Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	2152430b70	docs(t3code): record discovered t3 web-auth contract	2026-06-02 19:24:30 +00:00
Viktor Barzin	a09b0b3612	docs(t3code): implementation plan for per-user auto-provisioning Task-by-task plan pairing with the design doc: Task 1 discovers the t3 web-auth contract (cookie name + bootstrap body), then systemd template, reconcile, devvm dispatch+auto-pair Go service, scoped sudoers, TF repoint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 22:19:22 +00:00
Viktor Barzin	1a0647c7ed	docs(t3code): design for per-user auto-provisioning (Authentik login → instance + session) Approach 1: /etc/ttyd-user-map as source of truth; per-user t3-serve@.service template (User=%i enforces file permissions); devvm reconcile; devvm dispatch+auto-pair service (mints + injects the t3 session cookie on first authenticated visit, replacing the in-cluster nginx). Spec for review before writing the implementation plan. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 22:10:05 +00:00
Viktor Barzin	55ed50b932	docs(plans): wealth dashboard consolidation design Consolidate the wealth Grafana dashboard 36 -> ~17 panels with zero metric loss: merge the 3 NW/contribution/growth timeseries into 1, the 11 returns/Δ stat cards into 1 returns table, the 2 yearly barcharts into 1 combo, and the 3 net-pay-vs-market-gain panels into 1 (grain dropdown); reorganize into collapsed rows. Also rebuild the projection as a Trend panel (numeric years-from-today x-axis) so it renders regardless of the dashboard time range (fixes empty-by-default). Philosophy: merge duplicates, keep every metric. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 21:52:59 +00:00
Viktor Barzin	9fb3e6e851	docs: correct cloudflared-502 post-mortem + fix stale .200 Traefik ref [ci skip] Real root cause of the 2026-06-01 full-site 502 was not a missed reference but an out-of-band fix that Terraform reverted: the 2026-05-30 Traefik .200->.203 migration repointed the Cloudflare tunnel to the Traefik service DNS via the CF Global API Key, but never landed that change in cloudflare.tf (left at .200). A terragrunt apply on 2026-06-01 reconciled live back to the stale .200, breaking all external ingress. Rewrite the post-mortem around the "codify out-of-band fixes or TF reverts them" lesson (a Terraform-Only-rule violation). Also fix docs/runbooks/kms-public-exposure.md, which still claimed Traefik served on 10.0.20.200:443 (now .203) — same migration fallout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-01 21:25:33 +00:00
Viktor Barzin	f807050eb5	cloudflared: fix tunnel origin .200 -> Traefik svc DNS (full-site 502 outage) [ci skip] The Cloudflare tunnel routed *.viktorbarzin.me and the apex to https://10.0.20.200:443, but Traefik moved off the shared MetalLB .200 onto its dedicated 10.0.20.203 on 2026-05-30 (commit `0c01adac`). Nothing serves HTTPS on .200:443 anymore, so cloudflared could not reach its origin (no route to host / i/o timeout) and Cloudflare returned 502 for every externally-proxied service. Internal/LAN access (split-horizon -> .203) was unaffected, which masked the outage. Repoint both ingress rules at the in-cluster Traefik Service DNS (https://traefik.traefik.svc.cluster.local:443) -- the design the docs already described but the code never implemented -- so the tunnel is decoupled from the Traefik LB IP and this cannot recur on a future move. Applied live via targeted apply on the tunnel config resource only; [ci skip] because live already matches and a full stack apply would churn unrelated pre-existing drift (Keel annotations, DKIM re-chunk). Post-mortem: docs/post-mortems/2026-06-01-cloudflared-stale-traefik-origin.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-01 21:22:05 +00:00
Viktor Barzin	30a644d3cd	docs(kms): document reboot-after-uninstall / 1603 handling + real-hardware status The bundled consumer Office removal leaves a pending reboot; a same-run VL install (or re-run before rebooting) fails with setup.exe 1603. Document the two guards (hard-reboot gate + reboot-aware 1603 message), the C2R-log capture, and the on-disk completion poll. Record that the uninstall path is now verified on a real M365 box (O365HomePremRetail removed) and the install needs a reboot first.	2026-06-01 21:22:05 +00:00
Viktor Barzin	82855848d1	plans: TopoLVM migration evaluation (Path 3 for LUN-cap relief) Decision-support doc, NOT a commitment. Evaluates whether replacing proxmox-csi with TopoLVM would lift the per-VM 29-PVC ceiling permanently and at what cost. Key trade-off documented: TopoLVM PVCs are pinned to the node where the LV lives (topology.topolvm.cybozu.com/node). proxmox-csi PVCs migrate between VMs when pods reschedule. The data-locality penalty matters most for single-replica stateful services (MySQL standalone, Nextcloud, Vaultwarden, mailserver, claude-memory, ~30 SQLite-backed apps); replicated services (CNPG PG cluster, Redis-v2, Vault Raft) absorb it. Three disk-layout options: A. Carve per-VM data disks from sdc — simple, no hardware, IO contention unchanged B. Hybrid SSD/HDD — SSD-constrained at 675 GiB free C. Add a dedicated NVMe — also closes beads code-oflt (IO contention), ~£200 hardware investment Effort estimate: 2.5-3 weeks of focused work for the full migration; covers TopoLVM install, lvmd config, per-VM disk provisioning, LUKS plumbing, 5 migration waves (regenerable → huge PVCs), backup-pipeline rewrite, deprecation. Recommended next step before committing: small pilot on k8s-node5/6 with one non-critical PVC to validate the operational pattern end-to-end. Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap, docs/runbooks/scale-k8s-cluster.md (Path 1+2 alternative), beads code-oflt (IO isolation).	2026-06-01 21:22:05 +00:00
Viktor Barzin	599d67db51	docs(kms): self-hosted ODT bootstrapper + anonymous client telemetry (kms-diag/Loki) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 21:22:05 +00:00

1 2 3 4 5 ...

257 commits