infra

Author	SHA1	Message	Date
Viktor Barzin	eae35c511a	pfsense: SNI-routed internal 443 — mail.viktorbarzin.me serves webmail everywhere Completes the internal port table of the mail front door (10.0.20.1): 443 was squatted by the pfSense webGUI (self-signed cert expired 2022), so internal webmail and the kuma [External] mail probe hit the firewall login instead of Roundcube — the last leg of the mail split-brain name. Design (Viktor): route by what the client asked for. New HAProxy frontend internal_https_443 (binds 10.0.20.1+10.0.10.1 :443, mode tcp): SNI present -> Traefik .203 with send-proxy-v2 (trusted, IPv6-bridge pattern, no health check per the PROXY-probe gotcha); SNI of pfsense.viktorbarzin.{lan,me} or NO SNI (bare-IP admin access) -> webGUI, which moved to :8443 (invisible to habits — https://10.0.20.1 still lands on the login page; :8443 doubles as direct fallback). The reverse-proxy pfsense ingress now targets :8443 directly. Declared idempotently in pfsense-haproxy-bootstrap.php; config.xml backed up on-box (config.xml.bak-2026-06-10-pre-sni443). Verified: bare IP -> GUI login; pfsense.viktorbarzin.lan -> GUI; pfsense.viktorbarzin.me -> 302 via ingress; mail.viktorbarzin.me -> Roundcube with STRICT cert validation; :993 IMAPS untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 18:41:07 +00:00
Viktor Barzin	8cfd0e5e5c	Merge forgejo/master: reconcile diverged lineages [ci skip] Local checkout carried the 2026-06-10 DNS/registry architecture series (pfSense forward-zone, CoreDNS viktorbarzin.me:53 carve-out, nodes stock) + vzdump/nfs-mirror/workstation-rebuild commits that never reached the canonical remote, while forgejo master received the emo-access series via isolated worktrees. Viktor asked to merge. Conflict resolutions (newest iteration wins in each file): - stacks/forgejo/cleanup.tf: LOCAL — dry_run=true (2026-06-10 revert after live retention orphaned OCI indexes; remote had 06-09 enable) - .claude/CLAUDE.md, docs/architecture/backup-dr.md: LOCAL — final registry/DNS architecture + implemented vzdump alerts - scripts/workstation/setup-devvm.sh: LOCAL — pinned-version, reproducible-rebuild refactor (kubelogin pin, restructured staging) - scripts/workstation/managed-settings.json: FORGEJO — the allow-then-audit claudeMd (matches /etc deployment byte-for-byte) - scripts/t3-provision-users.sh: FORGEJO comment; refresh_locked_clone intact [ci skip]: all stack changes in the local lineage were applied live this morning — CI would re-walk 100+ stacks via the modules/ fallback for zero state change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:21:50 +00:00
Viktor Barzin	a49d1eadf6	workstation: emo direct master push — allow-then-audit [ci skip] Viktor: emo may make any change; what matters is tracking what changed and why. ebarzin added to master push+merge whitelists (force-push stays disabled — append-only history). Tracking enforced three ways: - agent instructions (managed claudeMd + AGENTS.md): commit body MUST carry the user's plain-language intent; commits land on master directly; [ci skip] forbidden for non-admins - new notify-nonadmin-push step in .woodpecker/default.yml: Slack message for every non-admin master push (admin pushes silent) - PR flow remains the fallback for non-whitelisted users Accepted consequence (informed): emo's pushes auto-apply changed stacks via CI. Offboard runbook gains whitelist-removal step. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 14:53:43 +00:00
Viktor Barzin	5d9417fbaa	workstation: emo contribute access + Phase-5 cutover done; gate master (push=apply) [ci skip] ADR-0004's premise was wrong: pushing master fires the Woodpecker apply pipeline (require_approval=forks only), so master pushes ARE deploys. Added Forgejo branch protection on master (push/merge whitelist=viktor, deploy keys allowed); non-admins contribute via branches + PRs. emo (ebarzin): write collaborator on viktor/infra, PAT in ~/.git-credentials, forgejo remote + upstream in his locked clone. Phase-5 finished: code-shared removed; ~/.claude symlinks kept (they ARE the skel shared-base mechanism — plan step 4c obsolete). Offboard runbook: revoke PAT + collaborator + group steps added. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 09:30:41 +00:00
Viktor Barzin	2b8c0def30	dns: pfSense forward-zone for viktorbarzin.me, nodes fully stock [ci skip] Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node customization — split-brain lives in the DNS infra): - pfSense Unbound domain override viktorbarzin.me -> Technitium 10.0.20.201 (applied via php write_config, backup on-box). Every Unbound client on every VLAN now gets the internal split-horizon answers (live Traefik IP via apex CNAME) with zero per-host config. - CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block — forgejo pinned to Traefik ClusterIP via data source (pods cannot reach the ETP=Local LB IP pfSense now returns), all other .me names kept on public resolvers (pods' pre-existing behavior). Replaces the .:53 forgejo rewrite. - Removed the same-day resolved routing-domain drop-ins from all 7 nodes; node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206) for fleet parity; cloud-init no longer writes any DNS drop-ins. - Docs: dns.md, pfsense-unbound runbook (override + rollback), registry bullet, post-mortem final-architecture addendum. Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK, pods resolve forgejo -> ClusterIP / others -> public, mail record works, .lan zone unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 08:32:34 +00:00
Viktor Barzin	1ee1bf0817	forgejo pulls: route *.viktorbarzin.me to Technitium, drop /etc/hosts pins [ci skip] Supersedes this morning's per-node /etc/hosts pin (no hardcoded service IPs on nodes, per Viktor). Technitium's split-horizon zone already resolves forgejo.viktorbarzin.me -> CNAME apex -> live Traefik LB IP (ingress-dns-sync auto-CNAMEs every ingress host; apex drift probe alerts) -- the nodes just never queried it. Rolled the devvm's systemd-resolved routing-domain pattern (~viktorbarzin.me -> 10.0.20.201) to all 7 nodes, removed the pins, verified getent + crictl pull via pure DNS. Also demoted node5/6's cloud-init global-dns.conf (DNS=8.8.8.8 1.1.1.1) to FallbackDNS-only: public servers in the global set race the routing domain. Its justification ("Technitium NXDOMAINs forgejo") was obsolete -- exactly the stale comment that pointed new nodes at the hairpin. hosts.toml mirror kept but documented as vestigial (Traefik 404s bare-IP requests; registry auth realm is an absolute URL). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 07:56:31 +00:00
Viktor Barzin	b6976ce014	forgejo pulls: pin registry name to internal Traefik in node /etc/hosts [ci skip] tuya-bridge was down 7.5h (ImagePullBackOff on k8s-node3): fresh kubelet pulls of forgejo.viktorbarzin.me images depended on the intermittently broken public-IP hairpin. The containerd hosts.toml mirror cannot keep pulls internal on its own — Traefik 404s its bare-IP requests (no Host/SNI match) and the registry Bearer realm is an absolute public URL fetched outside the mirror. Third incident of this class (buildkit 06-04, tripit/devvm 06-09). Fix: /etc/hosts pin 10.0.20.203 forgejo.viktorbarzin.me on every node — covers resolve + token + blob legs with correct SNI and valid cert. Applied live to all 7 nodes; persisted in the cloud-init bootstrap and the existing-node rollout script. Docs updated (registry bullet, dns.md hairpin scope + stale .200 literals, runbook) + post-mortem. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 07:15:24 +00:00
Viktor Barzin	dacd9d2d8a	t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip] Investigated the 0.0.25 break: it is ONLY an endpoint rename (/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing contract (credential payload, t3_session cookie, /api/auth/session) is byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep): - t3-dispatch: autoPair tries /api/auth/browser-session, falls back to /api/auth/bootstrap on 404 — one binary pairs across both versions and any rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25 before, green after). Built, deployed, verified live on 0.0.24 (all three users still 302 + t3_session via the fallback). - t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad build now auto-rolls-back. Validated against both versions. - t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3 state.sqlite (was the only copy, unbacked) -> the one-way forward schema migration becomes a restore, not sqlite surgery. timeout-guarded. - runbooks/t3-version-bump.md: the reversible cutover checklist. - post-mortem #5 (health-check) DONE + #6 added; service-catalog updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	bccaa08d8e	t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip] Investigated the 0.0.25 break: it is ONLY an endpoint rename (/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing contract (credential payload, t3_session cookie, /api/auth/session) is byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep): - t3-dispatch: autoPair tries /api/auth/browser-session, falls back to /api/auth/bootstrap on 404 — one binary pairs across both versions and any rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25 before, green after). Built, deployed, verified live on 0.0.24 (all three users still 302 + t3_session via the fallback). - t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad build now auto-rolls-back. Validated against both versions. - t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3 state.sqlite (was the only copy, unbacked) -> the one-way forward schema migration becomes a restore, not sqlite surgery. timeout-guarded. - runbooks/t3-version-bump.md: the reversible cutover checklist. - post-mortem #5 (health-check) DONE + #6 added; service-catalog updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	fd0f4a0365	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip] `6d224861` came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:45:33 +00:00
Viktor Barzin	6d224861c4	stem95su: scheduled Drive->site sync CronJob (every 10m) CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:42:26 +00:00
Viktor Barzin	c611ecf84d	workstation: docs — multi-tenancy Workstation section + offboard runbook + service-catalog fix [ci skip] multi-tenancy.md: new DevVM Workstation section (roster SSoT, tiers, config inheritance, locked clone, built-vs-gated status). service-catalog.md t3code row: corrected the stale 'source of truth = /etc/ttyd-user-map' (now roster.yaml; the map/dispatch are GENERATED). offboard-user.md: written (was a referenced-but-missing dead link) — staged reversible-cut-then-gated-destructive for both cluster + workstation surfaces. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 14:27:17 +00:00
Viktor Barzin	d4ec5768b2	vault-token-renew: version the devvm renewer + user units in the repo The devvm periodic Vault admin token (token-devvm-wizard, period=768h, policies default+sops-admin+vault-admin) is kept alive by a systemd user timer, but the renewer script + units lived only under ~/.local/bin and ~/.config/systemd/user — lost on a devvm rebuild. Move them into the repo as the source of truth so a rebuild can restore them. (version-only scope: behavior unchanged; no canonical-file/self-heal added.) - scripts/vault-token-renew.{sh,service,timer}: renewer + user units, refactored into pure drift-guard functions + a guarded main (behavior identical; deployed live and verified still renewing with full write access). - scripts/test-vault-token-renew.sh: unit-tests the drift guard + lookup-JSON parsing, incl. the 2026-06-05 woodpecker-clobber case (17 assertions). - docs/runbooks/vault-token-renew-devvm.md: deploy, mint/re-mint, health-check, drift recovery. - docs/architecture/secrets.md: correct the stale '~/.vault-token = OIDC token' description for devvm. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 22:10:06 +00:00
Viktor Barzin	551412488b	apiserver: enable audit logging (low-write Metadata) + ship to Loki Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Resource changes/deletions are now attributable (the novelapp deletion this week was untraceable because apiserver audit was off). Low-write policy: drops reads/noise, Metadata level on mutations, omitStages RequestReceived. Wired into the kube-apiserver static-pod manifest + kubeadm-config (v1beta4 extraArgs/extraVolumes -> survives kubeadm upgrade) on k8s-master; Alloy tails /var/log/kubernetes/audit/audit.log -> Loki {job=kubernetes-audit}. Root cause that had silently blocked this AND OIDC for weeks: a stray kube-apiserver.yaml.bak inside /etc/kubernetes/manifests/ was a duplicate static-pod manifest kubelet ran instead of the real one, dropping every flag added to the real manifest. Removed it. Runbook added. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	bc33cd5ac4	monitoring: NodeFilesystemFull 90%->95% + Synology storage runbook The Synology offsite backup target (/mnt/synology-backup, surfaced via the PVE host NFS mount) sits at ~94% by design and was firing NodeFilesystemFull continuously. Per user request, raise the threshold to 95% (<5% free). NOTE: NodeFilesystemFull is a global node-filesystem rule, so this also loosens the warning on k8s node/system disks; BackupDiskFull (sda /mnt/backup) stays at 85%. Also adds docs/runbooks/synology-storage.md: how to assess Synology usage WITHOUT du (Storage Analyzer weekly CSVs, df/btrfs/qgroup), btrfs async/snapshot-pinned reclaim, the 2026-06-05 capacity assessment (94% full; Backup share 4.42TiB), and ~500GiB of homelab cleanup candidates (redundant gphotos Takeout, old laptop VM images, archives). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 18:18:31 +00:00
Viktor Barzin	6442978f07	fan-control: merge Fan %/RPM dashboard cards + RPM estimate fallback [ci skip] The Fan % and Fan RPM sensor-graph cards had identical trend shapes (RPM ∝ %), so merge them into one "Fan speed" card: % trend (stable Pushgateway sensor) + RPM beneath. RPM reads sensor.r730_fan_speed (Redfish) but falls back to the calibrated estimate (rpm≈160·%+1520, shown with a "~" prefix) when that sensor is unavailable — it blips out intermittently, so the readout never goes blank. The Override readout likewise shows both "% · rpm". HA-side only; daemon unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 14:31:32 +00:00
Viktor Barzin	405ca79531	fan-control: Override slider now tracks live fan speed while unlocked [ci skip] The dashboard Override slider used to show a stale stored % (e.g. 5%) while the fans were actually at ~53%, which was confusing. Add automation.r730_fan_override_track_live_speed_while_unlocked: while unlocked it mirrors the live commanded % (sensor.r730_fan_control_target) into the Override, so it always shows the actual absolute fan speed and updates as the fan moves. While locked it stops tracking and is the user's editable setpoint. The readout under the slider now shows the live "% · rpm" (actual, not an estimate). HA-side only; daemon unchanged. Verified live: slider forced to 10 → synced to 58 target. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 14:20:38 +00:00
Viktor Barzin	c059405632	fan-control: simplify HA dashboard + Lock = freeze-current/algo-off [ci skip] The dashboard-it Server → Fans view is now minimal: fan speed (% + RPM), an Override % slider, and a Lock toggle. Lock now means "freeze the current speed, algorithm off" — a new automation (r730_fan_lock_freeze_current_speed_resume_algo) snapshots the live target % into Override and sets mode=manual on lock-ON, and mode=auto on lock-OFF. The host daemon is unchanged (the toggle just drives the mode it already reads). cool/quiet stay reachable via the entity but are off the simplified view; the 60-min auto-revert is kept as a dormant safety net. Verified live: lock ON → mode=manual + Override captured the live 60%; lock OFF → auto. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 13:27:46 +00:00
Viktor Barzin	d17b25cdcc	fan-control: document the HA Fan Lock (opt out of 60-min auto-revert) [ci skip] A manual/cool/quiet override in HA auto-reverts to `auto` after 60 min. Add a Fan Lock (`input_boolean.r730_fan_lock`) that gates that automation so a deliberate override persists, with a visible "🔒 FAN CONTROL LOCKED" banner on the dashboard-it Server view so it isn't forgotten. The automation re-checks the lock after the hour (locking mid-countdown cancels the revert) and the 83 °C ceiling still wins. HA-side only (helper + automation + dashboard live on ha-sofia, auto-git-tracked there); these docs are the infra-repo record. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 12:22:00 +00:00
Viktor Barzin	51456a96f6	fan-control: estimate + expose fan power (fan_watts_est) The iDRAC reports only total DCMI watts + RPM (no per-fan power), so add a cube-law fan-power estimate: fan_W ~= 0.0205*(RPM/1000)^3, calibrated to the 2026-06-05 sweep (fits within ~3W; ~2W floor -> ~99W full). The daemon reads live RPM each loop and pushes pve_fan_control_fan_rpm + _fan_watts_est. Surfaced in HA as sensor.r730_fan_power_est + a "Fan Power (est)" card on the dashboard-it Server view, next to total power. 46 bash tests green; verified live (9120rpm -> ~15W est). [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 11:10:27 +00:00
Viktor Barzin	324f2dc3bf	fan-control: continuous linear curve (replaces discrete step-bands) Replace the step-band fan curve with a continuous linear ramp — the bands flapped at edges (e.g. 45<->65%). Web-researched: linear + 2-3C hysteresis is the homelab standard; PID is overkill for this slow thermal loop. fan% now interpolates between env-tunable anchors: COOL 50C/30% -> 83C/100% (~2.1%/C; ~51% at the ~60C equilibrium) QUIET 68C/20% -> 83C/100% (near-silent until ~70C) Both reach 100% at the 83C ceiling. Anti-oscillation: asymmetric hysteresis (fc_decide) + a MIN_STEP (3%) min-change threshold. 41 bash tests green; deployed + verified live (59C -> 49%, smooth). [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 10:29:35 +00:00
Viktor Barzin	945c1936e3	fan-control docs: HA control (mode/manual-% + auto-revert + dashboard) Document the HA-control feature shipped in `8beca1df`: the daemon reads the ha-sofia r730_fan_mode/manual_pct helpers, the 60-min auto-revert automation, and the dashboard-it Server-view sensors + control tiles. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:29:35 +00:00
Viktor Barzin	90ad6b9125	fan-control: presence-aware IPMI fan curve for the R730 PVE host The iDRAC stock curve runs the CPU at ~72°C on the 7080 RPM floor even under load (optimises for quiet, not cool). Add a bash daemon + systemd unit that drives the chassis fans from CPU temp on two curves, picked by garage occupancy (the server is in the garage): COOL when empty (measured ~58-65°C under load), QUIET near the silent floor when the ha-sofia garage door shows someone is there (open, or <15min since last activity). Manual fan mode is backstopped: bash EXIT trap + systemd ExecStopPost hand fans back to Dell auto on stop/crash; CPU>=83°C or repeated IPMI failures do the same. Pushgateway metrics (job=fan_control). 36 unit tests cover the pure curve/hysteresis/presence/parse logic; DRY_RUN + RUN_ONCE for integration checks. Deployed and verified on 192.168.1.127 (CPU 70->58°C in cool mode, hysteresis stepping confirmed). Design: docs/plans/2026-06-04-pve-fan-control-design.md Runbook: docs/runbooks/fan-control.md [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	deede6dd11	chrome-service: switch to CDP + persistent profile + hourly snapshot pipeline The chrome-service stack ran `playwright launch-server`, which creates ephemeral browser contexts per `connect()`. Despite the encrypted PVC mounted at /profile, no chromium user-data ever persisted — only npm cache + fontconfig. Logging in via noVNC was effectively a no-op. Refactor: - Replace launch-server with direct chromium (TCP CDP on :9223 internal), fronted by a Python HTTP+WS bridge on :9222 that rewrites the Host header to bypass Chrome's hardcoded DNS-rebinding protection (no `--remote-allow-hosts` flag exists in stock Chrome 130; verified by binary string grep). Bridge also forces Connection: close on HTTP responses so Node ws opens a fresh TCP for the WS upgrade rather than trying to reuse the dead keep-alive socket. - Add `--user-data-dir=/profile/chromium-data` so cookies/localStorage actually persist on the encrypted PVC. - New snapshot-server sidecar (stdlib python HTTP) serves GET /api/snapshot at chrome.viktorbarzin.me/api/snapshot, bearer-token-gated by the existing api_bearer_token. - New chrome-service-snapshot-harvester CronJob (hourly) connects via CDP, dumps storage_state() (cookies + localStorage), writes atomically to /profile/snapshots/storage-state.json. - NetworkPolicy: TCP/9222 (was :3000), TCP/8088 added for traefik. Caller migration: - f1-stream: `chromium.connect(ws_url)` → `chromium.connect_over_cdp(cdp_url)`, env var CHROME_WS_URL → CHROME_CDP_URL. CHROME_WS_TOKEN dropped (no longer used by code; ExternalSecret kept for symmetry with the snapshot endpoint). Dev-box side (out of scope for this commit — see ~/.config/systemd/user/): - playwright-mcp.service flips to `--isolated --storage-state=...` so per-Claude-Code-session ephemeral contexts seed from the snapshot. - playwright-snapshot-refresh.{service,timer} (hourly) pulls the snapshot via the bearer-gated HTTPS endpoint. Docs updated: - docs/architecture/chrome-service.md — new architecture diagram + wire protocol. - docs/runbooks/chrome-service-snapshot.md — day-2 ops (refresh, rotation, failure modes, restore). - stacks/chrome-service/README.md — connect_over_cdp recipe. Design spec at docs/superpowers/specs/2026-06-04-playwright-per-session-browser-design.md.	2026-06-05 09:19:10 +00:00
Viktor Barzin	ad3432d685	docs(k8s-dashboard): dashboard SSO as-built (Option B multi-issuer apiserver) Update authentication.md (structured multi-issuer AuthenticationConfiguration + dashboard SSO flow), multi-tenancy.md (web dashboard access), authentik-state (new k8s-dashboard app + gheorghe groups), service-catalog (dashboard auth), and the k8s-version-upgrade runbook (kubeadm wipes --authentication-config → re-apply rbac post-upgrade). Design/plan addenda record the issuer-constraint pivot from the original dual-aud approach. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	f0948493b3	claude-agent-service: wire parallel execution (git-crypt mount, memory, MAX_CONCURRENCY) The service now runs agent calls concurrently (bounded semaphore, per-job isolated clones) instead of single-flight. Infra side: - mount git-crypt-key into the main container (each job re-unlocks its own clone) - MAX_CONCURRENCY=10 env (excess calls queue FIFO) - bump pod memory 2Gi req / 12Gi limit, cpu req 1 (Burstable, tier-aux) — sized for ~10 concurrent claude+terraform runs; fits node2/3/5 headroom - docs: beads-auto-dispatch + automated-upgrades no longer describe single-slot Service code: viktor/claude-agent-service @ 66104a3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:24 +00:00
Viktor Barzin	aa0d6511b2	job-hunter runbook: document two self baselines + taxable_pay gotcha All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Dashboard now shows two 'Me' bars: realized gross (~£409k, from SUM(payslip taxable_pay) = P60 basis) and package/grant-value (~£267k, levels.fyi-comparable). Document that gross MUST come from taxable_pay, NOT salary+bonus+rsu_vest (rsu_vest is net/partial, understates RSU ~50%). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 23:13:35 +00:00
Viktor Barzin	50a4ad70f0	job-hunter runbook: self-comp re-seed stores full TC breakdown All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details total_value (what the comparison bar uses) must be full TC; document storing base+bonus+RSU components too so it's verifiable that RSU+bonus are included. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 22:23:42 +00:00
Viktor Barzin	deb0dd4778	monitoring: "Your comp vs the market" panel on Job Hunter dashboard Add a barchart (panel 10) ranking every company's London p50 total comp (COALESCE total/base) with the user's current comp shown in line, so it's a direct "how do I compare" view. The user's figure is NOT hardcoded in the dashboard JSON — it's a labeled comp_point in the DB (company_slug 'self-current', source 'self', "Me (Meta IC5)"), keeping the sensitive number out of git. It's below the £500k alert bar (no Slack ping) and ranks too low to appear in analyze leaders. Runbook documents the panel + how to update the baseline. [ci skip] — dashboard ConfigMap applied locally (targeted). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 21:27:26 +00:00
Viktor Barzin	74313149dd	job-hunter: weekly above-target Slack alert CronJob Add job-hunter-alert CronJob (Sundays 05:00 UTC, an hour after the refresh): `python -m job_hunter alert --threshold 500000 --location london --slack` posts to Slack the companies whose London p50 total comp >= £500k, flagging any that newly crossed since last week's snapshot. SLACK_WEBHOOK_URL wired via the job-hunter-secrets ExternalSecret from Vault secret/job-hunter slack_webhook_url (seeded from the shared workspace webhook; repointable to a dedicated channel). Runbook gains an "above-target Slack alert" section. [ci skip] — applied locally (stack-scoped). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 20:49:42 +00:00
Viktor Barzin	fe8db19aaf	job-hunter: build-triggers-deploy model; CronJob :latest + docs CI now drives the Deployment rollout (kubectl set image to the build SHA in .woodpecker.yml), so the stack moves to image_tag = "latest": the Deployment runs whatever CI last set (image ignore_changes keeps TF from fighting it), and the CronJob uses :latest + imagePullPolicy=Always (fresh pod each weekly run). Keel stays enrolled in parallel as a redundant net. Docs: rewrite the runbook "Deploying" section for build-triggers-deploy; record the reversal of decision #12 in the auto-upgrade design doc (owned apps drive their own rollout, Keel parallel — upstream stays Keel-only); add the owned-app deploy model to infra/.claude/CLAUDE.md CI/CD section. [ci skip] — applied locally (stack-scoped); avoids a broad CI auto-apply. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 20:24:50 +00:00
Viktor Barzin	cda858d560	job-hunter: weekly refresh CronJob + ops/analyst runbook All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Add kubernetes_cron_job_v1.job_hunter_refresh — Sundays 04:00 UTC, runs `refresh --source ats --source hn --source levels_fyi`, which upserts roles/ comp AND appends the dated comp_snapshots/roles_snapshots series consumed by `job-hunter analyze`. Mirrors the Deployment's alembic-migrate init container so a refresh never runs against an un-migrated DB; concurrency Forbid, backoff 1, 30m activeDeadline, KYVERNO_LIFECYCLE_V1 dns_config ignore. Add docs/runbooks/job-hunter.md: ops (health checks, manual refresh, add an ATS company / CDIO watch, secret bag + rotation, failure table, TF apply) and analyst (the analyze report, query recipes, SQL trend queries against the snapshot tables, interpretation caveats) sections. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 19:37:57 +00:00
Viktor Barzin	de09e8f294	immich runbook: note force=false re-kick gotcha after row deletion [ci skip] The videoConversion enqueue is an async scan; deleting encoded_video rows while a prior scan is in-flight misses them (observed 2026-06-02: 11/3296 picked up on the first pass). Re-trigger force=false once the queue first drains to waiting:0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	b651f137b9	docs(kms): SXSMSI/1603 is client-machine-specific (VM 300 pilot) + deep-repair/escalation Pilot on PVE VM 300 established strong counterfactuals: identical kms-bootstrap + the user's exact journey both reach office/ok on healthy Win10 (CF1 clean install, CF2 retail O365HomePremRetail->targeted-remove->reboot->VL install). So a persistent [Failing PreReq=SXSMSI]/1603 is the client's corrupted Windows servicing/Installer subsystem (below DISM/SFC), not the script/ODT/KMS. Documents the consent-gated deep repair, the DeepRepairDone marker + in-place-repair escalation, and the low-disk/guest-agent-drop gotchas hit during the pilot.	2026-06-02 19:24:30 +00:00
Viktor Barzin	481585f6e6	immich: cap streaming transcode bitrate to fix 4K video stutter [ci skip] Transcodes were uncapped (ffmpeg maxBitrate=0 + preset=ultrafast + targetResolution=original) -> 77-264 Mbps 4K H.264 files. Mobile playback streams that copy off the shared 7200rpm sdc pool over inter-VLAN NFS; a single stream needs ~10-13.5 MB/s and stuttered for every client, local and remote. Fix (DB system-config, applied via API): maxBitrate=20000k, preset=medium, transcode=bitrate. 4K resolution preserved; originals never modified. Existing oversized transcodes regenerated by deleting their asset_file encoded_video rows + videoConversion force=false (concurrency 1). Document config + add runbook docs/runbooks/immich-transcode-bitrate.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	9fb3e6e851	docs: correct cloudflared-502 post-mortem + fix stale .200 Traefik ref [ci skip] Real root cause of the 2026-06-01 full-site 502 was not a missed reference but an out-of-band fix that Terraform reverted: the 2026-05-30 Traefik .200->.203 migration repointed the Cloudflare tunnel to the Traefik service DNS via the CF Global API Key, but never landed that change in cloudflare.tf (left at .200). A terragrunt apply on 2026-06-01 reconciled live back to the stale .200, breaking all external ingress. Rewrite the post-mortem around the "codify out-of-band fixes or TF reverts them" lesson (a Terraform-Only-rule violation). Also fix docs/runbooks/kms-public-exposure.md, which still claimed Traefik served on 10.0.20.200:443 (now .203) — same migration fallout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-01 21:25:33 +00:00
Viktor Barzin	30a644d3cd	docs(kms): document reboot-after-uninstall / 1603 handling + real-hardware status The bundled consumer Office removal leaves a pending reboot; a same-run VL install (or re-run before rebooting) fails with setup.exe 1603. Document the two guards (hard-reboot gate + reboot-aware 1603 message), the C2R-log capture, and the on-disk completion poll. Record that the uninstall path is now verified on a real M365 box (O365HomePremRetail removed) and the install needs a reboot first.	2026-06-01 21:22:05 +00:00
Viktor Barzin	599d67db51	docs(kms): self-hosted ODT bootstrapper + anonymous client telemetry (kms-diag/Loki) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 21:22:05 +00:00
Viktor Barzin	3fa9e2409c	runbook: K8s worker scaling for PVC capacity headroom Documents the 6-worker cluster shape (post 2026-05-26 scale-up after the proxmox-csi LUN-cap incident), the six binding constraints (plugin LUN cap at 29/VM, memory commit, sdc IO contention, GPU concentration on node1, PVE host memory, no Terraform management for K8s VMs), and the playbooks for adding/removing workers. Scale-up triggers: - max-node VA count ≥ 25 (~86% of 29 cap) for ≥7 days - cluster memory requests > 90% - LUN-cap incident - planned ≥3 net-new block PVCs when max VA already ≥ 22 Scale-down conditions: - max-node PVC count ≤ 20, memory < 70%/95% for ≥30 days Playbooks lean on scripts/provision-k8s-worker (clones template 2000, cloud-inits, auto-joins) for adds; kubectl cordon → drain → delete node → qm shutdown for removes. Cold-spare option documented. Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap, docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md, beads code-oflt (IO contention long-term fix).	2026-06-01 19:50:41 +00:00
Viktor Barzin	1c165ce5b4	docs(kms): document the consequence-gated edition switch (changepk + ODT) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:27 +00:00
Viktor Barzin	bdb0cef242	docs(kms): document /keys.json carve-out + script auto-key selection Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	e63a812062	kms: dedicated vlmcs.viktorbarzin.me endpoint + Anubis /scripts carve-out Internal split-horizon resolves kms.viktorbarzin.me to Traefik (10.0.20.203), which has no :1688 listener — so LAN clients pointed at kms.viktorbarzin.me:1688 failed with 0xC004F074 "no KMS could be contacted". Add a dedicated A-only vlmcs.viktorbarzin.me (cloudflare_record.vlmcs -> 176.12.22.76 for the public WAN NAT; Technitium -> 10.0.20.202 internal, set via API) so it resolves to vlmcsd both ways. Also carve /scripts/* out of Anubis (module.ingress_scripts -> bare kms-web-page service) so `iwr \| iex` downloads the real script instead of the PoW challenge HTML. Verified end-to-end on Win VM 300: reproduced 0xC004F074 on the old host, then slmgr + ospp + both PowerShell one-liners all -> Licensed via vlmcs (10.0.20.202). Docs: kms-public-exposure runbook + service-catalog entry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	d6590612b2	immich: bulk-import Anca's Elements photo archive into her account Grows pve/nfs-data 3T → 4T (online lvextend + resize2fs) to absorb ~340 GB of new originals landing under /srv/nfs/immich/upload during the import. Adds: - module "nfs_anca_elements_host" — RO PVC over /srv/nfs/anca-elements, consumed only by the import Job (not mounted in immich-server). - kubernetes_job_v1.anca_elements_import — immich-go v0.31.0 uploader posting to immich-server.immich.svc:2283 with Anca's API key (synced via the existing immich-secrets ExternalSecret from secret/immich.anca_api_key). Filters to image extensions, bans the non-photo top-level dirs (filme/, Music/, carti/, courses, installers, docs, etc.), puts every asset in the album "Poze (Elements)". Default `--pause-immich-jobs` is disabled — non-admin keys can't pause jobs. - docs/architecture/storage.md — note the new 4 TB size in 3 places. - docs/runbooks/grow-pve-nfs-lv.md — captures the one-shot lvextend procedure (no pve-host TF stack exists for this). Job is removed in the follow-up cleanup commit once the upload completes; the PVC stays for a videos batch later. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 14:12:30 +00:00
Viktor Barzin	34f8c0f537	docs+scripts: lock in nextcloud-as-PVE-NFS-browser surface Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline was successful Details - docs/architecture/storage.md: new "Nextcloud as PVE-NFS browser" section documenting mount-per-archive + applicable_users model, why mount-level ACL beats Files Access Control on NC 30/31, the manifest shape (with current applicableUsers + enableSharing fields), and the trade-off - docs/runbooks/nextcloud-add-archive.md: 5-step runbook to surface a new directory under /srv/nfs/* to specific NC users via the bootstrap Job - scripts/anca-elements-sync.sh: deployed at /usr/local/bin/anca-elements-sync.sh on the PVE host; fpsync from Synology Anca/Elements to /srv/nfs/anca-elements (idempotent + resumable). The PVE replica is what the NC /anca-elements mount serves; the offsite-sync pipeline excludes this path (committed earlier this session) so we don't write it back to Synology NC usernames are admin/anca/emo (not display names — admin is Viktor). Stale "viktor" references in the manifest example dropped.	2026-05-24 11:45:01 +00:00
Viktor Barzin	6024cfb410	docs: update MySQL restore runbook + CLAUDE.md after 8.4.9 recovery Runbook rewritten for the standalone setup (InnoDB Cluster gone since 2026-04-16) and now covers the full disaster-recovery flow we just executed: stop pod, wipe PVC (incl. PV reclaim-policy flip from Retain → Delete), re-apply TF, restore via in-namespace Job, drop+create static users with fresh Vault passwords, restart dependents. CLAUDE.md MySQL row notes the 8.4.8 pin + links the runbook. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 22:51:52 +00:00
Viktor Barzin	01de3babd6	docs(security): wave 1 plan — Kyverno enforce, NetworkPolicy egress, audit logging, source-IP anomaly Locked design for wave 1 of cluster security hardening. Plan only — implementation lives in beads code-8ywc and follow-up commits. Captures: - security.md: Kyverno policy table updated (Audit → Enforce planned for the four security policies with the 31-namespace exclude list). New section "Audit Logging & Anomaly Detection" detailing the K8s API audit policy, Vault audit device + X-Forwarded-For trust, source-IP anomaly rules (K9, V7, S1), and the rejected-canary-tokens / rejected-K1 rationales. New section "NetworkPolicy Default-Deny Egress" describing the observe-then-enforce (γ) approach for tier 3+4. - monitoring.md: new "Security Alerts (Wave 1)" section listing the 16 rules (K2-K9, V1-V7, S1) and the Loki ruler → Alertmanager → #security routing path. - runbooks/security-incident.md (new): per-alert response playbook with LogQL queries, action steps, false-positive triage, and SEV1 escalation. - .claude/CLAUDE.md: new "Security Posture" section summarising the locked decisions: identity allowlist is me@viktorbarzin.me ONLY, source-IP allowlist CIDRs, no public-IP access policy, rationale for not adopting canary tokens. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 19:10:16 +00:00
Viktor Barzin	48abb7c520	kured: drop Mon-Fri restriction, reboot any day The weekday-only schedule was a 2026-03-16-incident-era guardrail when the rest of the safety net was thin. Today's gates — halt-on-alert, sentinel-gate Check 4 (24h soak via node Ready transitions), the K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the sentinel-path fix from earlier today — make weekend reboots safe and just clear the backlog faster. Effect: 5 pending node reboots clear in 5 calendar days instead of queueing up over weekends. The K8s version-upgrade detection at Sun 12:00 UTC self-defers if a Sunday-morning kured reboot fires (the RecentNodeReboot alert is in the Upgrade Gates ignore-less list for the version-upgrade preflight — same mechanism kured uses). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 12:29:01 +00:00
Viktor Barzin	01bc16d592	k8s-version-upgrade: decompose into Job chain to fix self-preemption The agent-based v1 ran inside claude-agent-service (replicas=1, no nodeSelector) and self-evicted when it tried to drain its host (k8s-node4 on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers v1.34.2) until manual recovery. Rewrite the pipeline as a chain of nodeSelector-pinned Jobs: preflight (k8s-node1) → master (k8s-node1) drains k8s-master → worker × 4 (k8s-node1) drains k8s-node{4,3,2} → worker (k8s-master + control-plane toleration) drains k8s-node1 → postflight (no pinning) Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by envsubst-ing job-template.yaml into the next Job. Deterministic names (k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply` idempotent — a failed Job can be re-created without duplicating downstream. Also lands `predrain_unstick`: deletes pods on the target node whose PDB has 0 disruptionsAllowed. Without this, drain loops indefinitely on single-replica deployments (e.g. every Anubis instance — discovered the hard way during 2026-05-11 manual recovery of k8s-node3). Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min). Deprecates the agent prompt (renamed to *.deprecated.md with a header pointer to the new code). Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps), then monitoring (loads the new alert). Both applied 2026-05-11. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-11 23:54:22 +00:00
Viktor Barzin	a58d777059	k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison on master for new patches + HEAD pkgs.k8s.io for next-minor availability, then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent. The agent (.claude/agents/k8s-version-upgrade.md) orchestrates: pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match) -> etcd snapshot save -> optional master containerd skew fix -> apt repo URL rewrite (minor bumps only) -> drain/upgrade/uncordon master via ssh < update_k8s.sh -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each -> post-flight verification Two new Upgrade Gates alerts catch failure modes: - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m) - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m) update_k8s.sh refactored to take --role / --release args; the agent shells it into each node via SSH pipe. update_node.sh annotated as OS-major path. Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section in docs/architecture/automated-upgrades.md. Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519 keypair distributed to all 5 nodes via authorized_keys; slack_webhook reuses kured webhook URL on initial deploy).	2026-05-10 19:07:42 +00:00
Viktor Barzin	a245e6e569	docs: add k8s node auto-upgrade runbook + architecture section The OS-side counterpart to the service-upgrade pipeline. Covers the unattended-upgrades + kured + sentinel-gate + Prometheus halt-on-alert design landed in `c0991f7f8`. Runbook: ops procedures (verify health, halt rollout, restore config to a re-imaged node, roll back a bad upgrade, investigate which alert is blocking). Architecture doc: extends the existing service-upgrade flow with a "K8s Node OS Upgrades" section (stack, sources of truth, day-2 mechanism, why-this-design rationale tied to the March 2026 post-mortem). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 17:26:15 +00:00

1 2

83 commits