infra

Author	SHA1	Message	Date
root	d479d5b4f9	Woodpecker CI deploy [CI SKIP]	2026-06-05 09:19:10 +00:00
Viktor Barzin	deede6dd11	chrome-service: switch to CDP + persistent profile + hourly snapshot pipeline The chrome-service stack ran `playwright launch-server`, which creates ephemeral browser contexts per `connect()`. Despite the encrypted PVC mounted at /profile, no chromium user-data ever persisted — only npm cache + fontconfig. Logging in via noVNC was effectively a no-op. Refactor: - Replace launch-server with direct chromium (TCP CDP on :9223 internal), fronted by a Python HTTP+WS bridge on :9222 that rewrites the Host header to bypass Chrome's hardcoded DNS-rebinding protection (no `--remote-allow-hosts` flag exists in stock Chrome 130; verified by binary string grep). Bridge also forces Connection: close on HTTP responses so Node ws opens a fresh TCP for the WS upgrade rather than trying to reuse the dead keep-alive socket. - Add `--user-data-dir=/profile/chromium-data` so cookies/localStorage actually persist on the encrypted PVC. - New snapshot-server sidecar (stdlib python HTTP) serves GET /api/snapshot at chrome.viktorbarzin.me/api/snapshot, bearer-token-gated by the existing api_bearer_token. - New chrome-service-snapshot-harvester CronJob (hourly) connects via CDP, dumps storage_state() (cookies + localStorage), writes atomically to /profile/snapshots/storage-state.json. - NetworkPolicy: TCP/9222 (was :3000), TCP/8088 added for traefik. Caller migration: - f1-stream: `chromium.connect(ws_url)` → `chromium.connect_over_cdp(cdp_url)`, env var CHROME_WS_URL → CHROME_CDP_URL. CHROME_WS_TOKEN dropped (no longer used by code; ExternalSecret kept for symmetry with the snapshot endpoint). Dev-box side (out of scope for this commit — see ~/.config/systemd/user/): - playwright-mcp.service flips to `--isolated --storage-state=...` so per-Claude-Code-session ephemeral contexts seed from the snapshot. - playwright-snapshot-refresh.{service,timer} (hourly) pulls the snapshot via the bearer-gated HTTPS endpoint. Docs updated: - docs/architecture/chrome-service.md — new architecture diagram + wire protocol. - docs/runbooks/chrome-service-snapshot.md — day-2 ops (refresh, rotation, failure modes, restore). - stacks/chrome-service/README.md — connect_over_cdp recipe. Design spec at docs/superpowers/specs/2026-06-04-playwright-per-session-browser-design.md.	2026-06-05 09:19:10 +00:00
Viktor Barzin	ea1e4f793b	revert(k8s-dashboard): restore forward-auth ingress (apiserver OIDC unresolved) Dashboard back to the working forward-auth + kong-proxy state. The oauth2-proxy SSO path is blocked by a deeper issue: the apiserver rejects ALL valid Authentik OIDC tokens (both legacy --oidc-* flags and structured AuthenticationConfiguration), despite verified signature, issuer, audience, email_verified, synced clock, and reachable+trusted JWKS. Needs dedicated apiserver-OIDC investigation. oauth2-proxy + k8s-dashboard Authentik app left deployed (idle, harmless) pending that. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	c958f6a589	feat(nextcloud-todos): Phase 4 IaC — service stack, Vault role, DB bootstrap, OpenClaw plugin, monitoring Phase 4 infrastructure-as-code for the nextcloud-todos service (watches the Nextcloud Personal task list; classifies todos via local qwen3-8b and routes research/mutating work through claude-agent-service). Clones the recruiter-responder service pattern end-to-end. Written only — NOT applied. - stacks/nextcloud-todos/{main.tf,terragrunt.hcl}: new aux stack cloning recruiter-responder — ns (tier aux, istio-injection disabled, keel enrolled), two ExternalSecrets (vault-kv app secrets + vault-database DSN), Recreate deployment with alembic-migrate init-container, ClusterIP svc, /cb-only HMAC-gated ingress (auth=none, proxied), and an idempotent webhook-register null_resource (OCS webhook_listeners API, both CalendarObject Created/Updated events -> internal svc URL, Bearer auth). - stacks/vault/main.tf: pg_nextcloud_todos static role (nextcloud_todos, 7d rotation) + pg-nextcloud-todos in the postgresql allowed_roles array. - stacks/dbaas/modules/dbaas/main.tf: pg_nextcloud_todos_db null_resource (clone of pg_tripit_db) — creates role+DB, pins role search_path, and creates schema nextcloud_todos AUTHORIZATION nextcloud_todos. - stacks/openclaw/main.tf: install-nextcloud-todos-plugin init-container, nextcloud-todos-api in plugins.allow + the doctor-fix re-add + plugins enable, NEXTCLOUD_TODOS_URL/NEXTCLOUD_TODOS_TOKEN env, and the cross-path ESO key (secret/nextcloud-todos.webhook_bearer_token). - stacks/uptime-kuma/modules/uptime-kuma/main.tf: internal /healthz HTTP monitor. Prometheus /metrics scrape via svc annotations in the new stack. - .gitleaksignore: allowlist two curl-auth-user false positives (the OCS webhook curl uses a Vault-sourced shell var, not a literal credential). KV seed (secret/nextcloud-todos) + applies are deferred to the apply runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	c3c3d5e010	feat(claude-agent-service): seed nextcloud-todos planner + exec agents Add cp lines in the seed-beads-agent init-container so the two new nextcloud-todos agent definitions (baked into the image at /usr/share/agent-seed/ by the claude-agent-service Dockerfile) land in ~/.claude/agents/ at pod start. Phase 3, task 3.3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
github-actions[bot]	d0206848cb	priority-pass: bump image_tag to 63e118c3 [ci skip] Auto-committed by ViktorBarzin/priority-pass GHA on push to main. Source: `63e118c334`	2026-06-05 09:19:09 +00:00
Viktor Barzin	55a8b238a0	mam-farming: migrate data volume proxmox-lvm → NFS The grabber + bp-spender shared a 1Gi proxmox-lvm RWO PVC holding two plain-text files (mam_id cookie + grabbed_ids.txt dedup list — no embedded DB). On 2026-06-04 a grabber pod wedged in ContainerCreating for >1h because its proxmox-lvm disk couldn't hot-plug onto a SCSI-LUN-saturated node VM (k8s-node3, QEMU `query-pci` QMP timeout); `concurrencyPolicy: Forbid` then blocked every run → 0 grabs → MAMFarmingStuck. NFS (nfs_volume module, matching the other 9 servarr apps) removes this volume from the per-VM SCSI hotplug path entirely: it mounts over the network, consumes zero LUN slots, and is RWX so the grabber + bp-spender can co-schedule on any node. Data (mam_id + grabbed_ids.txt) was copied across before the switch; verified a grabber run Succeeds on NFS on node4 with the preserved dedup list (tracked IDs carried over). Lever #1 from docs/architecture/storage.md "Per-VM SCSI-LUN cap".	2026-06-05 09:19:09 +00:00
Viktor Barzin	cb96d5d590	fix(k8s-dashboard): use email_verified=true + groups scope mappings The apiserver rejects the email username-claim when email_verified is false (invalid bearer token 401). Authentik external/social users are unverified, so the default scope-email mapping fails. Mirror the proven kubernetes provider: use the custom 'Kubernetes Email (verified)' mapping (hardcodes email_verified=true) + 'Kubernetes Groups'. Drop the now-unneeded dual-aud mapping (apiserver trusts the k8s-dashboard issuer w/ audience=client_id) and align oauth2-proxy scope to 'openid email profile groups'. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	1042c0f082	fix(k8s-dashboard): set RS256 signing_key on Authentik OIDC provider Provider had signing_key=null → Authentik signed id_tokens with HS256 and served an empty JWKS, so oauth2-proxy (and the apiserver) failed signature verification (500 'failed to verify id token signature' on the callback). Use the same 'authentik Self-signed Certificate' keypair the kubernetes provider uses. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	e436af8d8c	fix(k8s-dashboard): drop group-restriction policy; RBAC is the gate The Authentik group policy denied admins: it gated on kubernetes-* group membership, but cluster access is email-based RBAC (User bindings from k8s_users), not group-based. vbarzin@gmail.com (Home Server Admins) gets cluster-admin via oidc-admin-vbarzin but isn't in any kubernetes-* group, so the gate locked him out. Apiserver RBAC is now the sole gate — matching the kubelogin CLI (authenticate freely, RBAC decides actions). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	c9b22c7dd3	feat(k8s-dashboard): cut over ingress to oauth2-proxy SSO Dashboard now authenticates via Authentik (oauth2-proxy, k8s-dashboard issuer) and applies each user's own RBAC via the apiserver multi-issuer AuthenticationConfiguration. Committed so CI converges (uncommitted local applies were being reverted by the Woodpecker terragrunt-apply pipeline). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	ed4ed6bd09	fix(k8s-dashboard): ignore Keel/tier drift on oauth2-proxy deployment	2026-06-05 09:19:09 +00:00
Viktor Barzin	75c2b6dc5e	feat(rbac): apiserver multi-issuer OIDC via structured AuthenticationConfiguration Replace the legacy single --oidc-* flags (which kubeadm v1.34 had wiped, silently disabling apiserver OIDC) with an apiserver.config.k8s.io/v1 AuthenticationConfiguration trusting BOTH the kubernetes (CLI) and k8s-dashboard (oauth2-proxy) issuers. Enables per-user RBAC for the dashboard via SSO while keeping the CLI issuer working. Remote script health-gates /livez and auto-rolls-back on failure (single master). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
github-actions[bot]	5b25ce1ec5	priority-pass: bump image_tag to 061a66ad [ci skip] Auto-committed by ViktorBarzin/priority-pass GHA on push to main. Source: `061a66ad3b`	2026-06-05 09:19:09 +00:00
Viktor Barzin	9c4335025d	feat(tripit): linked-email verification (SMTP + confirm carve-out) [ci skip] Adds outbound mail for linked-email verification: EMAIL_PROVIDER=smtp + SMTP_* app env (submits via the cluster mailserver as spam@, relayed by Brevo), SMTP_PASSWORD mapped to the existing PLANS_IMAP_PASSWORD (no new secret), and a token-gated /api/emails/confirm ingress carve-out (auth=none, like the calendar feed). Applied locally via scripts/tg. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	b8c55732e0	feat(k8s-dashboard): deploy oauth2-proxy (not yet wired to ingress) 2 replicas in kubernetes-dashboard ns; OIDC code-flow against the k8s-dashboard Authentik client, injects user id_token as Bearer upstream to kong-proxy. ESO syncs client/cookie secrets from Vault. Ingress still points at kong-proxy — no user-facing change yet. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
root	7c4375d7cd	Woodpecker CI deploy [CI SKIP]	2026-06-05 09:19:09 +00:00
Viktor Barzin	4ed0c5a834	uptime-kuma: codify Traefik LB internal monitor at .203 (was stale .200) A hand-created (non-TF) uptime-kuma monitor "Traefik LoadBalancer" (id=95) port-checked 10.0.20.200:443 — the shared LB IP Traefik moved OFF on 2026-05-30 when it took its dedicated .203 (ETP=Local). It had been DOWN for ~5 days, surfacing as the cluster-health "uptime_kuma internal down(1)" WARN. Add it to local.internal_monitors as "Traefik LoadBalancer (10.0.20.203)" (port 10.0.20.203:443) so it's managed like the TP-Link/Proxmox direct-IP probes — a direct check of the MetalLB L2 + Traefik bind, complementing the [External] traefik (full CF path) and Traefik Dashboard (in-cluster) monitors. The sync CronJob created it (id=902, reporting UP @1ms); the orphan id=95 was deleted via the uptime-kuma API.	2026-06-05 09:19:08 +00:00
Viktor Barzin	011c63c92d	feat(k8s-dashboard): add Authentik OIDC app for dashboard SSO Confidential client k8s-dashboard + custom scope mapping emitting aud=[kubernetes,k8s-dashboard] + group-restriction policy (kubernetes-* RBAC groups). Additive — dashboard ingress unchanged. Token via Vault secret/k8s-dashboard. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	f201e4573e	immich: fix slow context search — prewarm clip_index + latency alert/healthcheck Context (smart) search latency was caused by the 665MB vchord clip_index decaying out of PG shared_buffers (~33% resident -> ~1.8s cold ANN reads vs ~4ms warm), NOT by yesterday's ML MODEL_TTL/clip-keepalive change (CLIP textual is warm ~15ms on GPU). The postStart prewarm runs once at pod start and pg_prewarm.autoprewarm only re-warms at startup, so the index decays under job buffer-pressure over days. - clip-index-prewarm CronJob (immich, /5): pg_prewarm('clip_index') keeps the whole index resident -> searches stay ~4ms. - immich-search-probe CronJob (immich, /5): times a random-vector ANN query + reads clip_index residency, pushes gauges to the Pushgateway. - Prometheus alerts ImmichSmartSearchSlow / ImmichClipIndexColdCache / ImmichSearchProbeStale (+ inhibition when the probe is stale). - cluster_healthcheck.sh check #46 check_immich_search (TOTAL_CHECKS 45->46). - Docs: infra CLAUDE.md immich note, monitoring.md, cluster-health skill. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	38c77048fd	chore(travel-agent): decommission — merged into tripit [ci skip] travel-agent's transport-to-airport + weather-brief workflows now run inside tripit (DB-driven instead of CalDAV), so the standalone CronJob stack is retired (namespace + ExternalSecret + 2 CronJobs destroyed via scripts/tg). secret/travel-agent left in Vault as an archive. Applied locally. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	be4ee7315a	feat(tripit): proactive-nudge CronJobs (transport + weather brief) [ci skip] Merge travel-agent's two workflows into tripit (beads code-muqi): adds tripit-transport-nudge (08:00) + tripit-weather-brief (21:00) CronJobs on Europe/London, an optional per-job timezone, and SLACK_BOT_TOKEN + DAWARICH_API_KEY in the tripit-secrets ExternalSecret. Nudges post to Slack #travel + Web Push; Dawarich location via the public host (in-cluster *.svc is 403'd by Rails host-auth). Vault secret/tripit seeded from secret/travel-agent + secret/owntracks. Applied locally via scripts/tg. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:06 +00:00
Viktor Barzin	98f29edf34	technitium: CoreDNS rewrite forgejo.viktorbarzin.me -> Traefik ClusterIP In-cluster pods resolved forgejo.viktorbarzin.me to the public IP (176.12.22.76) and hairpinned out through the WAN gateway, intermittently timing out buildkit pushes from Woodpecker build pods (which, unlike kubelet, don't use the per-node containerd Forgejo mirror). This silently failed CI build-and-push for Forgejo-hosted repos (recruiter-responder pipelines #15-#18 at the push step). Add a CoreDNS `rewrite name exact forgejo.viktorbarzin.me traefik.traefik.svc.cluster.local` so pods resolve to the Traefik ClusterIP (reachable in-cluster, unlike the ETP=Local LB .203; the Service-name target auto-tracks the ClusterIP so it can't rot on a Traefik renumber). Traefik's *.viktorbarzin.me wildcard keeps SNI/TLS valid. Makes the per-pod woodpecker-server hostAlias belt-and-suspenders. Applied via targeted apply (coredns ConfigMap only, to avoid reconciling 7 unrelated pre-existing drifts in the stack) + verified: - pod resolves forgejo.viktorbarzin.me -> 10.111.111.95 (Traefik ClusterIP) - recruiter-responder pipeline #20 build-and-push succeeds via ClusterIP Docs: networking.md (K8s cluster DNS path) + .claude/CLAUDE.md (forgejo registry quick-ref). Advances beads code-yh33. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 07:34:30 +00:00
Viktor Barzin	7302cd7908	infra: untrack generated backend.tf (stale PG creds + .200 literal) [CI SKIP] terragrunt generates backend.tf per run (remote_state generate, if_exists=overwrite_terragrunt) from get_env("PG_CONN_STR"); these 72 committed copies are stale artifacts already covered by .gitignore:65. They held a plaintext (Vault-rotated, ~expired) PG password + the .200 state-backend literal and were re-committed by CI on every run. git rm --cached stops that; they regenerate locally from PG_CONN_STR. The live .200:5432 literal now lives only in scripts/tg (its single bootstrap source). Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:52:46 +00:00
Viktor Barzin	7d7a0ad474	infra: fix stale Traefik LB-IP refs + accurate LB-IP registry Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md). The 2026-05-30 Traefik .200->.203 move left consumers pointing at the dead .200; this fixes the two in-Terraform ones and replaces the stale networking doc with an accurate registry + a renumber checklist. - woodpecker: forgejo.viktorbarzin.me hostAlias hardcoded 10.0.20.200 (.200:443 refuses TLS now; the next woodpecker apply would re-pin it and break pipeline creation). Now reads the Traefik ClusterIP dynamically via a kubernetes_service data source -- cannot rot on a future renumber and avoids the ETP=Local hairpin trap. - monitoring: ViktorBarzinApexDrift alert summary said "expected 10.0.20.200" -> 10.0.20.203 (cosmetic; alert logic already correct). - docs/architecture/networking.md: rewrote the MetalLB section (it wrongly had KMS on .200, mailserver on a LB IP, "two dedicated") into an accurate 4-IP registry + LB-IP renumber checklist (in-band + out-of-band consumers). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:25 +00:00
Viktor Barzin	dcb7c74531	url/shlink: fix admin UI — pin shlink-web-client 4.7.1 + port 8080 The shlink-web admin SPA (shlink.viktorbarzin.me) showed "Something went wrong while loading short URLs". Root cause: the web client was untagged (:latest) and Keel's 2026-05-26 match-tag rewrite downgraded it to the ancient 0.1.1 (2019 image), which speaks the removed /rest/v1/authenticate API (404) and serves nginx on port 80. Backend (shlink:5.0.2) was healthy. Pin shlink-web-client to 4.7.1 (current stable; :latest/:stable resolve to it) and align container port + both probes + service target_port to 8080 (the port the 4.x nginx listens on). A clean semver anchor can no longer be Keel-downgraded to 0.1.1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:25 +00:00
Viktor Barzin	922d95af9c	Reapply "tripit: Gmail ingest (12-month) + vbarzin owner + plans@ forward-to-parse" This reverts commit a82ba46ad83e85a231d839564c2f009c700dc4d1.	2026-06-03 10:24:25 +00:00
Viktor Barzin	f0843e398b	Revert "tripit: Gmail ingest (12-month) + vbarzin owner + plans@ forward-to-parse" This reverts commit 4cc9229e716b6683418a148a0f896442d5ab07ad.	2026-06-03 10:24:25 +00:00
Viktor Barzin	0c7ec3d470	tripit: Gmail ingest (12-month) + vbarzin owner + plans@ forward-to-parse Reconciles the tripit stack source with live state and adds the forward flow. Ingest now polls vbarzin@gmail.com [Gmail]/All Mail read-only over a rolling 12-month X-GM-RAW travel-sender window (Croatia Jet2 refs excluded), filing trips under MAIL_DEFAULT_OWNER_EMAIL=vbarzin@gmail.com (Viktor's Authentik login identity). Adds an ingest-plans CronJob that polls spam@ filtered to To:plans@viktorbarzin.me (the @viktorbarzin.me catch-all target) so forwarded bookings are extracted and attached to the matching trip; IMAP_PASSWORD is overridden per-job to spam@'s creds (PLANS_IMAP_PASSWORD). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:25 +00:00
Viktor Barzin	ff26d1c957	openclaw: give recruiter-api plugin the Telegram bot token so it can announce The recruiter-api plugin's announceEvent() sends recruiter cards to Telegram via OPENLOBSTER_CHANNELS_TELEGRAM_TOKEN (its fallback path, since OpenClaw doesn't pass api.bot to "kind: tools" plugins). That env was never set in the container, so every hourly poll threw on the send, events were never marked consumed, and no Telegram notification ever went out — the rest of the "recruiter pipeline has no responses" problem (the GPU/triage half was fixed separately). Wire it from openclaw-secrets.telegram_bot_token (same token as channels.telegram.botToken). Verified: the 3 backlogged events were announced + consumed on the openclaw restart. Drafting (the /api/draft 500 that also degraded the cards) was fixed in parallel by swapping Vault secret/recruiter-responder gpt_mini_model from the slow/timing-out qwen3-coder-480b to meta/llama-3.3-70b-instruct (~1.6s). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:25 +00:00
root	c85533d2d9	Woodpecker CI deploy [CI SKIP]	2026-06-03 10:24:25 +00:00
Viktor Barzin	982dc9e63a	openclaw: task-webhook ingress auth required->none (inbound Forgejo webhook) The task-webhook host is an inbound webhook receiver: Forgejo (a machine with no Authentik SSO cookie) POSTs issue/comment events, so forward-auth 302-bounced every delivery and silently dropped all webhooks. Flip only this ingress to auth=none; the do_POST handler gates on payload action + bot-user filtering. Gateway (openclaw) and openlobster stay auth=required. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-03 10:24:25 +00:00
root	91d110acf5	Woodpecker CI deploy [CI SKIP]	2026-06-03 10:24:24 +00:00
Viktor Barzin	fde2d19bf7	trading-bot: ingress auth required->app (app has own WebAuthn/JWT) The app ships complete auth — WebAuthn/passkey (RP_ID=trading.viktorbarzin.me) + JWT bearer on every /api/* route + a /ws?token=<JWT> WebSocket. Authentik forward-auth on / was 302-bouncing the WebAuthn XHR flow and the WS upgrade, making the app unusable. Flip to auth = "app" so the backend's own auth is the gate (same-origin SPA + bearer-token API, same pattern as immich). Verified all 11 route modules enforce Depends(get_current_user) and dev_mode defaults False before flipping. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-03 10:24:24 +00:00
Viktor Barzin	e18e0d51a0	uptime-kuma: public status pages + push monitors bypass Authentik The single uptime ingress gated the ENTIRE site (path "/") behind Authentik forward-auth, so public-by-design endpoints 302-bounced to SSO: status pages (/status/<slug>), push-monitor ingest (/api/push/<key>), status-page API + heartbeat (/api/status-page), badges (/api/badge), and static assets. Status pages are for logged-out viewers and push monitors POST from machines — neither can follow the Authentik OAuth cookie dance, so all were broken. Fix mirrors the meshcentral agent carve-out (9a15f3f2): add a second path-scoped ingress_factory (auth="none") pointing at the same uptime-kuma Service. Traefik routes longest-rule-first, so these out-prioritise the "/" catch-all; the dashboard (/, /dashboard, /manage-*, /settings, etc.) stays Authentik-gated via the original ingress. WebSocket status UI keeps working — the default middleware chain passes Upgrade/Connection through. Verified: /status/infra, /api/status-page/{,heartbeat/}infra, /api/badge no longer 302 (200); / still 302s to authentik. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-03 10:24:24 +00:00
root	17f91f6167	Woodpecker CI deploy [CI SKIP]	2026-06-03 10:24:24 +00:00
Viktor Barzin	bc5aba34b6	meshcentral: fix agent connectivity behind Authentik + TLS-offload Traefik Two root causes kept all 8 mesh agents (incl. family laptops) offline: 1. The single ingress gated the ENTIRE site (path "/") behind Authentik forward-auth, so the agent/relay endpoints (/agent.ashx, /meshrelay.ashx, /control.ashx, etc.) got 302-bounced to SSO. Native mesh clients can't do the OAuth cookie dance. Fix: add a second ingress_factory (auth="none") path-scoped to the agent endpoints, pointing at the same meshcentral service. Traefik routes by rule length so these out-prioritise the "/" catch-all; the human web UI stays Authentik-gated. 2. After the auth fix, agents reached /agent.ashx but were rejected with "Agent bad web cert hash" — MeshCentral pins the OUTER TLS cert, but with TLS offload the agent sees Traefik's Let's Encrypt cert (which differs between the internal .203 LB and the external Cloudflare path, and rotates monthly), not MeshCentral's own webserver cert. Fix: set ignoreAgentHashCheck=true in the init-container config so MeshCentral echoes back the agent-reported hash. The separate mesh-certificate (ServerID) handshake still authenticates the server. Verified: agent paths no longer 302->authentik; web UI root still does; laptop "Valia_Laptop" enrolled in group "laptops" and ONLINE. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-03 10:24:24 +00:00
Viktor Barzin	01ea7d6fa1	immich: clip-keepalive CronJob to pin smart-search model warm MACHINE_LEARNING_MODEL_TTL=600 is a single global knob, so it unloads the CLIP textual (smart-search) encoder after idle exactly like OCR/face — immich has no per-model pin. This CronJob pings the textual encoder every 5 min (< the 600s TTL) via immich-ml /predict, so a search query never pays the ~1.5s cold-load, while idle OCR/face still free their VRAM on the shared T4. Textual-only (search = text->embedding->pgvector); the visual encoder is import-time and left to unload. curl baked into the image (no runtime install). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:24 +00:00
Viktor Barzin	f0948493b3	claude-agent-service: wire parallel execution (git-crypt mount, memory, MAX_CONCURRENCY) The service now runs agent calls concurrently (bounded semaphore, per-job isolated clones) instead of single-flight. Infra side: - mount git-crypt-key into the main container (each job re-unlocks its own clone) - MAX_CONCURRENCY=10 env (excess calls queue FIFO) - bump pod memory 2Gi req / 12Gi limit, cpu req 1 (Burstable, tier-aux) — sized for ~10 concurrent claude+terraform runs; fits node2/3/5 headroom - docs: beads-auto-dispatch + automated-upgrades no longer describe single-slot Service code: viktor/claude-agent-service @ 66104a3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:24 +00:00
Viktor Barzin	16763464cd	job-hunter dashboard: role panels now respect the $location filter All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The role panels (Top roles, Top companies by role volume, New roles/day, Roles by source, Salary distribution) had no location filter, so they showed all locations regardless of the $location dropdown. Add 'primary_location IN (${location:sqlstring})' to each (matching the comp panels' pattern). Also switch the 'Your comp vs the market' panel from hardcoded 'london' to the same $location filter for consistency. Data was fine (all london-tagged roles genuinely contain 'london'). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 23:35:25 +00:00
Viktor Barzin	7a7abe4cbe	uk-payslip dashboard: count gross comp on taxable_pay (P60) basis All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The 'Yearly receipt' + 'YTD gross salary' panels summed salary+bonus+rsu_vest (rsu_vest = net/partial RSU), understating gross by ~£73k/yr. Switch to COALESCE(taxable_pay, gross_pay) + pension_sacrifice = true P60 gross (verified: 23/24 -> £286,288, 25/26 -> £416,646, matching the P60 + job-hunter realized bar). 'Yearly receipt' rsu_gross is now the real gross RSU (£150k/£271k, not £70k/£128k). Relabel the Sankey RSU inflow 'RSU (net vested)' for honesty; leave cash-flow/net_pay + the (taxable_pay-based) reconciliation/rate panels. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 23:23:15 +00:00
Viktor Barzin	deb0dd4778	monitoring: "Your comp vs the market" panel on Job Hunter dashboard Add a barchart (panel 10) ranking every company's London p50 total comp (COALESCE total/base) with the user's current comp shown in line, so it's a direct "how do I compare" view. The user's figure is NOT hardcoded in the dashboard JSON — it's a labeled comp_point in the DB (company_slug 'self-current', source 'self', "Me (Meta IC5)"), keeping the sensitive number out of git. It's below the £500k alert bar (no Slack ping) and ranks too low to appear in analyze leaders. Runbook documents the panel + how to update the baseline. [ci skip] — dashboard ConfigMap applied locally (targeted). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 21:27:26 +00:00
Viktor Barzin	74313149dd	job-hunter: weekly above-target Slack alert CronJob Add job-hunter-alert CronJob (Sundays 05:00 UTC, an hour after the refresh): `python -m job_hunter alert --threshold 500000 --location london --slack` posts to Slack the companies whose London p50 total comp >= £500k, flagging any that newly crossed since last week's snapshot. SLACK_WEBHOOK_URL wired via the job-hunter-secrets ExternalSecret from Vault secret/job-hunter slack_webhook_url (seeded from the shared workspace webhook; repointable to a dedicated channel). Runbook gains an "above-target Slack alert" section. [ci skip] — applied locally (stack-scoped). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 20:49:42 +00:00
Viktor Barzin	5dc5cd53c0	url/shlink: ingress url.viktorbarzin.me auth required -> none Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details Authentik forward-auth on the shlink REST API + short-link domain (url.viktorbarzin.me) 302s shlink-web's cross-origin API XHR (CORS preflight) and SSO-bounces every public short link. Result: the admin UI showed "Something went wrong while loading short URLs" and short links never resolved for logged-out clients. The shlink REST API is self-gated by its X-Api-Key and short links are public by design, so Authentik must not front this domain. CrowdSec + rate-limit + anti-AI bot-block still apply. The admin web UI (shlink.viktorbarzin.me) stays auth=required via module.ingress-web. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 20:37:33 +00:00
Viktor Barzin	fe8db19aaf	job-hunter: build-triggers-deploy model; CronJob :latest + docs CI now drives the Deployment rollout (kubectl set image to the build SHA in .woodpecker.yml), so the stack moves to image_tag = "latest": the Deployment runs whatever CI last set (image ignore_changes keeps TF from fighting it), and the CronJob uses :latest + imagePullPolicy=Always (fresh pod each weekly run). Keel stays enrolled in parallel as a redundant net. Docs: rewrite the runbook "Deploying" section for build-triggers-deploy; record the reversal of decision #12 in the auto-upgrade design doc (owned apps drive their own rollout, Keel parallel — upstream stays Keel-only); add the owned-app deploy model to infra/.claude/CLAUDE.md CI/CD section. [ci skip] — applied locally (stack-scoped); avoids a broad CI auto-apply. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 20:24:50 +00:00
Viktor Barzin	052c776eba	immich: set MACHINE_LEARNING_MODEL_TTL 0->600 to stop GPU VRAM hog immich-ml at TTL=0 never unloaded models; a heavy OCR library job inflated onnxruntime's CUDA arena to ~10.7GB and held it on the shared time-sliced T4, starving llama-swap (qwen3-8b) so recruiter-responder triage 502'd silently for hours (emails preserved unseen, no loss). TTL=600 lets idle ad-hoc models (OCR, face) free VRAM while preloaded CLIP/smart-search stays warm. Docs: correct stale llama-cpp GPU notes (T4 is time-sliced, no VRAM isolation; add qwen3-8b to model table), immich MODEL_TTL gotcha in .claude/CLAUDE.md, and a post-mortem. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 20:16:11 +00:00
Viktor Barzin	cda858d560	job-hunter: weekly refresh CronJob + ops/analyst runbook All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Add kubernetes_cron_job_v1.job_hunter_refresh — Sundays 04:00 UTC, runs `refresh --source ats --source hn --source levels_fyi`, which upserts roles/ comp AND appends the dated comp_snapshots/roles_snapshots series consumed by `job-hunter analyze`. Mirrors the Deployment's alembic-migrate init container so a refresh never runs against an un-migrated DB; concurrency Forbid, backoff 1, 30m activeDeadline, KYVERNO_LIFECYCLE_V1 dns_config ignore. Add docs/runbooks/job-hunter.md: ops (health checks, manual refresh, add an ATS company / CDIO watch, secret bag + rotation, failure table, TF apply) and analyst (the analyze report, query recipes, SQL trend queries against the snapshot tables, interpretation caveats) sections. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 19:37:57 +00:00
Viktor Barzin	87f1dcb72d	wealth: consolidation chunk 2 — net-pay $grain merge, Trend projection, row reorg Completes the 36->17 consolidation: - 3 net-pay panels -> 1 "Net pay vs market gain (${grain})" with a cumulative/ yearly/monthly dropdown (Mixed datasource: payslips-pg + wealth-pg). - Projection rebuilt as a Trend panel (numeric "Years from today" x-axis) so it renders regardless of the dashboard time range — fixes empty-by-default. Drops the duplicate projection-row stat cards + the how-to-view text panel. - Full reorg into 7 collapsed rows: Overview / Net worth over time / Returns & contributions / Income vs market / Holdings / RSUs (META) / Projections. All wealth-pg SQL validated live; net_pay target reuses the existing payslips-pg source. Visual review pending. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	a587f0ee55	t3code: ingress -> devvm dispatch+autopair (retire in-cluster nginx) stacks/t3code now points the Authentik-gated ingress at the DevVM t3-dispatch service (Service+Endpoints -> 10.0.10.10:3780) instead of the in-cluster nginx, which is removed. Per-user routing + session auto-injection now live on DevVM. Verified: external 302->Authentik; in-cluster vbarzin/emil.barzin->302 (auto-pair to own instance), unmapped->403. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	5e4f83d4e7	wealth: consolidation chunk 1 — merge NW/contribution/growth, returns table, yearly combo 36 -> 19 panels (chunk 1 of 2), zero metric loss: - 3 NW/contribution/growth timeseries -> 1 "contribution vs market value (+growth)" - 11 returns/Δ stat cards (12mo x3 + Δ 1d/7d/30d/90d all&mkt) -> 1 "Returns over time windows" table (window × Δall/Δmkt/return%) - 2 yearly barcharts -> 1 combo (contributions/market-gain bars + return-% line, timeFrom=10y so full history always shows) All SQL validated live. Chunk 2 (net-pay $grain merge, projection->Trend panel, row reorg) to follow. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 22:27:09 +00:00

1 2 3 4 5 ...

1201 commits