infra

Author	SHA1	Message	Date
Viktor Barzin	17da37cea3	fire-planner: reset bulk ingest toggle after successful run Job completed: 1,060 examples inserted across 10 FIRE subreddits (1,080 total), 20/24 sub-runs succeeded. Toggle reset to false. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-05 09:19:12 +00:00
Viktor Barzin	deb031cc2c	feat(tripit): encrypted personal-document vault PVC + DOCUMENT_ENCRYPTION_KEY Add a proxmox-lvm-encrypted RWO PVC (tripit-personal-documents) mounted at /data/personal-documents on the app container, PERSONAL_STORAGE_DIR env, and the DOCUMENT_ENCRYPTION_KEY ExternalSecret entry (seeded in Vault secret/tripit). A root chown init-container makes the block volume writable by the non-root app without touching the NFS doc vault. Backs the new owner-only encrypted personal document vault in the tripit app. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:12 +00:00
Viktor Barzin	27989cd9f1	fire-planner: bulk Reddit FIRE examples ingest + qwen3-8b model upgrade - Enable bulk ingest job (run_examples_bulk_ingest=true) to populate fire_example table from top/all + top/year across 12 FIRE subreddits. Job fire-planner-examples-bulk-202606042150 is currently running. - Upgrade examples_llm_model from qwen3vl-4b to qwen3-8b; GPU has 10.7GB free (immich-ml using ~4GB of 15GB total), so higher-quality model fits. - Add LLM_CONCURRENCY=3 to bulk job container — claude-agent-service is now bounded-concurrency (MAX_CONCURRENCY=10), no longer single-flight. Strictly serial extraction (default 1) is no longer necessary. TODO: flip run_examples_bulk_ingest=false after job completes and re-apply to push the weekly CronJob model upgrade (qwen3vl-4b→qwen3-8b) which didn't land in this apply (TF timed out waiting for Job completion). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-05 09:19:12 +00:00
Viktor Barzin	147a8cff40	Restore f1-stream stack — undo accidental bundling into 63fe7d2b Commit 63fe7d2b (fan-control) was made with a bare `git commit` in the shared infra working tree and inadvertently swept in a parallel session's staged f1-stream-extraction work (main.tf repoint, ~48 files/ removals, ci-cd.md + .claude docs, two extraction plan docs). This returns every f1-stream-related path to its pre-63fe7d2b state (3493c347) so that extraction can be committed cleanly by its own session. The fan-control files added in 63fe7d2b are untouched. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:12 +00:00
Viktor Barzin	90ad6b9125	fan-control: presence-aware IPMI fan curve for the R730 PVE host The iDRAC stock curve runs the CPU at ~72°C on the 7080 RPM floor even under load (optimises for quiet, not cool). Add a bash daemon + systemd unit that drives the chassis fans from CPU temp on two curves, picked by garage occupancy (the server is in the garage): COOL when empty (measured ~58-65°C under load), QUIET near the silent floor when the ha-sofia garage door shows someone is there (open, or <15min since last activity). Manual fan mode is backstopped: bash EXIT trap + systemd ExecStopPost hand fans back to Dell auto on stop/crash; CPU>=83°C or repeated IPMI failures do the same. Pushgateway metrics (job=fan_control). 36 unit tests cover the pure curve/hysteresis/presence/parse logic; DRY_RUN + RUN_ONCE for integration checks. Deployed and verified on 192.168.1.127 (CPU 70->58°C in cool mode, hysteresis stepping confirmed). Design: docs/plans/2026-06-04-pve-fan-control-design.md Runbook: docs/runbooks/fan-control.md [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	c6f27fa172	wealth dashboard: enlarge returns numbers (drop stat name labels) [ci skip] At h=4 the two stacked values per window panel were too small because each also rendered its field-name label. Switch textMode value_and_name -> value on 9211-9215 so the numbers get the full cell height; the % suffix / £ prefix keep them self-identifying and the window stays in the panel title. Applied via targeted tg apply of the configmap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	dbe10a708c	wealth dashboard: shrink returns stat panels to h=4 [ci skip] The 5 per-window returns widgets (9211-9215) were too tall at h=8. Halve to h=4 (matching the overview stat cards directly above) and pull every panel below up by 4 so the layout stays gap-free. Layout-only change — no panel content/query touched. Applied via targeted tg apply of the configmap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	fc1486c3dd	wealth dashboard: replace returns table with per-window stat panels [ci skip] Swap the single "Returns over time windows" table (panel 9201) for 5 stat panels (1d/7d/30d/90d/12mo), each showing Return % (Modified-Dietz) as the headline value + Δ market (£, net of contributions) as a second value, colored red/green by sign. Same per-window Modified-Dietz math as the old table, just scoped to one interval per panel — verified against live wealthfolio_sync PG and reproduced through Grafana's datasource API (e.g. 30d = 8.15% / £86,875, 12mo = 38.68% / £297,846, matching the prior table exactly). Kept the same 24×8 grid footprint so nothing else on the dashboard reflows. Already applied via targeted `tg apply` of the wealth.json configmap; [ci skip] because a full monitoring-stack CI apply would pull in unrelated pre-existing drift. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	6cec27f8dc	novelapp: bump Keel policy patch -> all (track any upstream version) Explicitly own the keel.sh/policy annotation in TF (was relying on the Kyverno-stamped `patch` default). Set policy=all + trigger=poll + pollSchedule, expand ignore_changes per KEEL_LIFECYCLE_V1 to cover Keel-written runtime annotations (change-cause, update-time, revision, match-tag).	2026-06-05 09:19:11 +00:00
Viktor Barzin	9cb609f21a	nextcloud-todos: register only the Created webhook (drop Updated) The agent acts only on newly-created todos; the Updated listener re-fired on every edit (incl. the agent's own note-append). Live Updated webhook (id=2) already deleted via OCS API.	2026-06-05 09:19:11 +00:00
Viktor Barzin	3d0cba9dcb	openclaw: pin 2026.2.26, resilient startup, SHA-pinned plugin init (recover from agentRuntime + configSchema crashloop) Surfaced while installing the nextcloud-todos-api plugin (a pod roll): - 2026.5.4 gateway rejects an openai-codex `agentRuntime` key it writes itself (commit `4b39cb72`) -> crashloop on any restart. Pinned image back to 2026.2.26. - startup steps (plugins enable / mcp set / memory index) backgrounded + timeout-guarded so a hung npm-install can never block the gateway. - install-nextcloud-todos-plugin init SHA-pinned (:f85c6de1) + Always pull: IfNotPresent served a stale cached :latest, so the plugin manifest (configSchema) fix never landed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
root	c01a28e23c	Woodpecker CI deploy [CI SKIP]	2026-06-05 09:19:11 +00:00
Viktor Barzin	6cff9bac26	freshrss: migrate extensions PVC proxmox-lvm -> NFS (node6 LUN relief, code-dfjn) Frees a per-node SCSI-LUN slot on node6 (20->19, under the check #47 >=20 WARN). FreshRSS extensions are static plugin files (no embedded DB; app DB is external MySQL) -> NFS-safe. Empty volume (re-installable). Applied deadlock-safe: -target deployment+module first (Recreate releases old PVC), then full apply destroys the now-unused proxmox PVC.	2026-06-05 09:19:11 +00:00
root	11b092a589	Woodpecker CI deploy [CI SKIP]	2026-06-05 09:19:11 +00:00
Viktor Barzin	e2d46ebd30	isponsorblocktv: migrate data PVC proxmox-lvm -> NFS (node6 LUN relief, code-dfjn) Frees a per-node SCSI-LUN slot on node6 (21->20). Volume holds only config.json (no embedded DB) -> NFS-safe. config pre-seeded to /srv/nfs/isponsorblocktv before cutover. RWO-destroy deadlock during apply (TF deleting the in-use old PVC before rolling the deployment) was broken by patching the deployment claim to the NFS PVC; TF reconciled to the same value.	2026-06-05 09:19:11 +00:00
root	ace6ee59f9	Woodpecker CI deploy [CI SKIP]	2026-06-05 09:19:11 +00:00
Viktor Barzin	adec2c135f	fix(novelapp): also bind gheorghe's dashboard SA to novelapp admin His app lives in novelapp, but the dashboard injects his SA token (system:serviceaccount:vabbit81:dashboard-vabbit81), while the existing binding only granted the OIDC User vabbit81@gmail.com (OIDC blocked). Add the SA as a second subject so the web dashboard (token-injector) can manage novelapp. Verified: SA can list/create in novelapp; injector path returns 200. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	7114824c06	fix(rbac): tighten dashboard SA cluster-read to namespaces+nodes only namespace-owners could read all tenants' pods/configmaps/etc cluster-wide (read-only) via the broad namespace_owner_readonly role. Give the dashboard SAs a dedicated dashboard-nav-readonly ClusterRole = namespaces + nodes (list) only — enough for the dashboard namespace-picker/Nodes view, but no cross-tenant resource reads. Own-namespace access (admin) unchanged. Verified: gheorghe can list namespaces/nodes + full vabbit81, but list pods/configmaps -A = no, other namespaces = no. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	dd2a8e640f	monitoring: right-size loki memory request 3Gi->1Gi (quota 89%->79%) monitoring-quota requests.memory sat at 89% (18.2/20Gi), tripping the ResourceQuota>80% WARN. Root cause was over-provisioned requests, not real usage: loki requested 3Gi but its VPA upperBound is 364Mi and actual ~315Mi. prometheus's 4Gi is legitimately required (2Gi tmpfs WAL shares the cgroup; OOMs at 3Gi during WAL replay) so it stays; grafana's main container is already 512Mi. Trimmed loki to 1Gi request (~3x its observed ceiling; 4Gi Burstable limit preserves query-spike headroom) -> quota 78.8%, clears the WARN. NOTE: alloy DaemonSet (562Mi/node) grows with node count, so revisit (bump the 20Gi quota) as the cluster expands.	2026-06-05 09:19:11 +00:00
Viktor Barzin	bf6ede2b9e	vault: deny secret/data/vault for claude-agent terraform-state policy (executor elevation safety narrowing) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	63ee655c08	monitoring: fix PrometheusBackupStale false-fire (32d->40d threshold) The prometheus-backup sidecar runs monthly on the 1st SUNDAY 04:00 UTC. Consecutive first-Sundays can be ~35 days apart (e.g. May 3 -> Jun 7), but the alert threshold was 32d (2764800s) -> it false-fired every year for the ~3 days between day-32 and the next run. Raised to 40d (3456000s): clears the max first-Sunday spacing with margin, still catches a genuinely missed monthly backup. Backup itself is healthy (last May 3, next Jun 7). Verified: live rule now > 3.456e6, alert state inactive.	2026-06-05 09:19:10 +00:00
Viktor Barzin	d649f4f287	feat(k8s-dashboard): auto-inject per-user SA token (no token-paste) nginx token-injector behind the existing forward-auth: maps X-authentik-username (the user's email, injected by Authentik) -> that user's ServiceAccount token -> sets Authorization: Bearer -> kong-proxy. Dashboard auto-authenticates; users never see the token prompt. Mirrors the t3-dispatch pattern. Token map lives in a Secret (namespace-owners' cluster-read covers configmaps, not secrets). Verified: gheorghe->vabbit81 pods 200 + kube-system 200 (cluster-read); viktor->nodes 200 (admin); unmapped->401. namespace-owners auto-derived from k8s_users; admins hardcoded (their Authentik identity != k8s_users email). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	467eb7d7ee	claude-agent: grant shared pod executor powers (Forgejo PR, terragrunt apply, kubectl write, MCP) Elevates the shared claude-agent-service pod (SA claude-agent, ns claude-agent) so the nextcloud-todos-exec agent can run autonomously. Viktor explicitly chose to elevate the SHARED service knowing every agent on the pod inherits these creds — each grant is security-sensitive and flagged inline for review. Vault (stacks/vault/main.tf): - terraform-state k8s-auth role: add `claude-agent` to bound_service_account_names (was only `default` — the pod's own SA token could not log in, so scripts/tg apply died fetching the PG backend password). `default` kept. - terraform-state policy broadened from `database/static-creds/pg-terraform-state` read only to read on database/static-creds/, database/creds/, secret/data/* and secret/metadata/* — what stacks read at plan/apply time. FLAG: grants the shared pod broad Vault READ (effectively all app secrets + rotating DB creds); not denied: secret/data/vault. claude-agent-service stack (stacks/claude-agent-service/main.tf): - ExternalSecret: add FORGEJO_TOKEN (secret/ci/global -> forgejo_push_token, viktor-scoped admin PAT) and HA_MCP_URL (secret/openclaw -> ha_sofia_mcp_url). - git-init: add url.insteadOf rewrite to authenticate git pushes to forgejo.viktorbarzin.me with $FORGEJO_TOKEN (PRs opened via Forgejo API). - New claude-agent-exec ClusterRole+Binding: cluster-wide get/list/watch/create/update/patch/delete on core (incl. secrets), apps, batch, networking.k8s.io, rbac roles/rolebindings. Additive to the existing read-only claude-agent role; does NOT bind cluster-admin. FLAG: very broad — close to cluster-admin in blast radius. - Vault login: VAULT_ADDR + VAULT_K8S_ROLE env + vault-token-refresher sidecar (k8s-auth login role=terraform-state every 30m -> shared emptyDir); main container symlinks ~/.vault-token so scripts/tg auto-auths. - MCP: project-scoped .mcp.json at infra repo root wires `ha` (HTTP, ${HA_MCP_URL}) and `paperless` (in-cluster Service, no token in-cluster). Not applied, not pushed — code only, for human review of the privilege grants. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	b56a868b4e	wealthfolio-sync: podAffinity to co-locate with app pod (RWO multi-attach fix) The monthly wealthfolio-sync CronJob mounts the same RWO wealthfolio-data-encrypted volume (shared wealthfolio.db SQLite) as the always-running wealthfolio app Deployment. RWO attaches to only one node, but the sync had no affinity — so the 2026-06-01 run landed on node4 while the app held the volume on node3 and hung in ContainerCreating for 3 days (FailedAttachVolume / Multi-Attach), surfacing as a problematic_pods WARN. Add a required podAffinity (app=wealthfolio, topologyKey hostname) so the sync always schedules onto the app's node, where the volume is already attached (RWO permits multiple pods on the same node). Verified: a fresh sync run co-located on node3, attached cleanly, and broker-sync started.	2026-06-05 09:19:10 +00:00
Viktor Barzin	e4c3fbbbbb	feat(authentik): adopt admin-services-restriction policy; admit kubernetes-* groups to k8s dashboard Namespace-owners (e.g. gheorghe) were blocked at forward-auth — k8s.viktorbarzin.me was Home-Server-Admins-only. Carve-out: the dashboard host now also admits kubernetes-admins/power-users/namespace-owners so they can reach the login page; per-namespace access is still enforced by the pasted SA token (dashboard-sa.tf). All other admin-only hosts unchanged. Policy adopted from UI into TF via import. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	317989f9d5	feat(rbac): per-namespace-owner dashboard SA + long-lived token Pragmatic dashboard access while OIDC SSO is blocked: each namespace-owner (from k8s_users) gets a ServiceAccount scoped to admin on their namespace(s) + cluster read-only, plus a long-lived token to paste into the dashboard 'Token' login. Real per-namespace isolation, no apiserver-OIDC dependency. Verified: vabbit81 SA = admin in vabbit81, read-only elsewhere, no cross-ns write. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
root	d479d5b4f9	Woodpecker CI deploy [CI SKIP]	2026-06-05 09:19:10 +00:00
Viktor Barzin	deede6dd11	chrome-service: switch to CDP + persistent profile + hourly snapshot pipeline The chrome-service stack ran `playwright launch-server`, which creates ephemeral browser contexts per `connect()`. Despite the encrypted PVC mounted at /profile, no chromium user-data ever persisted — only npm cache + fontconfig. Logging in via noVNC was effectively a no-op. Refactor: - Replace launch-server with direct chromium (TCP CDP on :9223 internal), fronted by a Python HTTP+WS bridge on :9222 that rewrites the Host header to bypass Chrome's hardcoded DNS-rebinding protection (no `--remote-allow-hosts` flag exists in stock Chrome 130; verified by binary string grep). Bridge also forces Connection: close on HTTP responses so Node ws opens a fresh TCP for the WS upgrade rather than trying to reuse the dead keep-alive socket. - Add `--user-data-dir=/profile/chromium-data` so cookies/localStorage actually persist on the encrypted PVC. - New snapshot-server sidecar (stdlib python HTTP) serves GET /api/snapshot at chrome.viktorbarzin.me/api/snapshot, bearer-token-gated by the existing api_bearer_token. - New chrome-service-snapshot-harvester CronJob (hourly) connects via CDP, dumps storage_state() (cookies + localStorage), writes atomically to /profile/snapshots/storage-state.json. - NetworkPolicy: TCP/9222 (was :3000), TCP/8088 added for traefik. Caller migration: - f1-stream: `chromium.connect(ws_url)` → `chromium.connect_over_cdp(cdp_url)`, env var CHROME_WS_URL → CHROME_CDP_URL. CHROME_WS_TOKEN dropped (no longer used by code; ExternalSecret kept for symmetry with the snapshot endpoint). Dev-box side (out of scope for this commit — see ~/.config/systemd/user/): - playwright-mcp.service flips to `--isolated --storage-state=...` so per-Claude-Code-session ephemeral contexts seed from the snapshot. - playwright-snapshot-refresh.{service,timer} (hourly) pulls the snapshot via the bearer-gated HTTPS endpoint. Docs updated: - docs/architecture/chrome-service.md — new architecture diagram + wire protocol. - docs/runbooks/chrome-service-snapshot.md — day-2 ops (refresh, rotation, failure modes, restore). - stacks/chrome-service/README.md — connect_over_cdp recipe. Design spec at docs/superpowers/specs/2026-06-04-playwright-per-session-browser-design.md.	2026-06-05 09:19:10 +00:00
Viktor Barzin	ea1e4f793b	revert(k8s-dashboard): restore forward-auth ingress (apiserver OIDC unresolved) Dashboard back to the working forward-auth + kong-proxy state. The oauth2-proxy SSO path is blocked by a deeper issue: the apiserver rejects ALL valid Authentik OIDC tokens (both legacy --oidc-* flags and structured AuthenticationConfiguration), despite verified signature, issuer, audience, email_verified, synced clock, and reachable+trusted JWKS. Needs dedicated apiserver-OIDC investigation. oauth2-proxy + k8s-dashboard Authentik app left deployed (idle, harmless) pending that. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	c958f6a589	feat(nextcloud-todos): Phase 4 IaC — service stack, Vault role, DB bootstrap, OpenClaw plugin, monitoring Phase 4 infrastructure-as-code for the nextcloud-todos service (watches the Nextcloud Personal task list; classifies todos via local qwen3-8b and routes research/mutating work through claude-agent-service). Clones the recruiter-responder service pattern end-to-end. Written only — NOT applied. - stacks/nextcloud-todos/{main.tf,terragrunt.hcl}: new aux stack cloning recruiter-responder — ns (tier aux, istio-injection disabled, keel enrolled), two ExternalSecrets (vault-kv app secrets + vault-database DSN), Recreate deployment with alembic-migrate init-container, ClusterIP svc, /cb-only HMAC-gated ingress (auth=none, proxied), and an idempotent webhook-register null_resource (OCS webhook_listeners API, both CalendarObject Created/Updated events -> internal svc URL, Bearer auth). - stacks/vault/main.tf: pg_nextcloud_todos static role (nextcloud_todos, 7d rotation) + pg-nextcloud-todos in the postgresql allowed_roles array. - stacks/dbaas/modules/dbaas/main.tf: pg_nextcloud_todos_db null_resource (clone of pg_tripit_db) — creates role+DB, pins role search_path, and creates schema nextcloud_todos AUTHORIZATION nextcloud_todos. - stacks/openclaw/main.tf: install-nextcloud-todos-plugin init-container, nextcloud-todos-api in plugins.allow + the doctor-fix re-add + plugins enable, NEXTCLOUD_TODOS_URL/NEXTCLOUD_TODOS_TOKEN env, and the cross-path ESO key (secret/nextcloud-todos.webhook_bearer_token). - stacks/uptime-kuma/modules/uptime-kuma/main.tf: internal /healthz HTTP monitor. Prometheus /metrics scrape via svc annotations in the new stack. - .gitleaksignore: allowlist two curl-auth-user false positives (the OCS webhook curl uses a Vault-sourced shell var, not a literal credential). KV seed (secret/nextcloud-todos) + applies are deferred to the apply runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	c3c3d5e010	feat(claude-agent-service): seed nextcloud-todos planner + exec agents Add cp lines in the seed-beads-agent init-container so the two new nextcloud-todos agent definitions (baked into the image at /usr/share/agent-seed/ by the claude-agent-service Dockerfile) land in ~/.claude/agents/ at pod start. Phase 3, task 3.3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
github-actions[bot]	d0206848cb	priority-pass: bump image_tag to 63e118c3 [ci skip] Auto-committed by ViktorBarzin/priority-pass GHA on push to main. Source: `63e118c334`	2026-06-05 09:19:09 +00:00
Viktor Barzin	55a8b238a0	mam-farming: migrate data volume proxmox-lvm → NFS The grabber + bp-spender shared a 1Gi proxmox-lvm RWO PVC holding two plain-text files (mam_id cookie + grabbed_ids.txt dedup list — no embedded DB). On 2026-06-04 a grabber pod wedged in ContainerCreating for >1h because its proxmox-lvm disk couldn't hot-plug onto a SCSI-LUN-saturated node VM (k8s-node3, QEMU `query-pci` QMP timeout); `concurrencyPolicy: Forbid` then blocked every run → 0 grabs → MAMFarmingStuck. NFS (nfs_volume module, matching the other 9 servarr apps) removes this volume from the per-VM SCSI hotplug path entirely: it mounts over the network, consumes zero LUN slots, and is RWX so the grabber + bp-spender can co-schedule on any node. Data (mam_id + grabbed_ids.txt) was copied across before the switch; verified a grabber run Succeeds on NFS on node4 with the preserved dedup list (tracked IDs carried over). Lever #1 from docs/architecture/storage.md "Per-VM SCSI-LUN cap".	2026-06-05 09:19:09 +00:00
Viktor Barzin	cb96d5d590	fix(k8s-dashboard): use email_verified=true + groups scope mappings The apiserver rejects the email username-claim when email_verified is false (invalid bearer token 401). Authentik external/social users are unverified, so the default scope-email mapping fails. Mirror the proven kubernetes provider: use the custom 'Kubernetes Email (verified)' mapping (hardcodes email_verified=true) + 'Kubernetes Groups'. Drop the now-unneeded dual-aud mapping (apiserver trusts the k8s-dashboard issuer w/ audience=client_id) and align oauth2-proxy scope to 'openid email profile groups'. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	1042c0f082	fix(k8s-dashboard): set RS256 signing_key on Authentik OIDC provider Provider had signing_key=null → Authentik signed id_tokens with HS256 and served an empty JWKS, so oauth2-proxy (and the apiserver) failed signature verification (500 'failed to verify id token signature' on the callback). Use the same 'authentik Self-signed Certificate' keypair the kubernetes provider uses. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	e436af8d8c	fix(k8s-dashboard): drop group-restriction policy; RBAC is the gate The Authentik group policy denied admins: it gated on kubernetes-* group membership, but cluster access is email-based RBAC (User bindings from k8s_users), not group-based. vbarzin@gmail.com (Home Server Admins) gets cluster-admin via oidc-admin-vbarzin but isn't in any kubernetes-* group, so the gate locked him out. Apiserver RBAC is now the sole gate — matching the kubelogin CLI (authenticate freely, RBAC decides actions). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	c9b22c7dd3	feat(k8s-dashboard): cut over ingress to oauth2-proxy SSO Dashboard now authenticates via Authentik (oauth2-proxy, k8s-dashboard issuer) and applies each user's own RBAC via the apiserver multi-issuer AuthenticationConfiguration. Committed so CI converges (uncommitted local applies were being reverted by the Woodpecker terragrunt-apply pipeline). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	ed4ed6bd09	fix(k8s-dashboard): ignore Keel/tier drift on oauth2-proxy deployment	2026-06-05 09:19:09 +00:00
Viktor Barzin	75c2b6dc5e	feat(rbac): apiserver multi-issuer OIDC via structured AuthenticationConfiguration Replace the legacy single --oidc-* flags (which kubeadm v1.34 had wiped, silently disabling apiserver OIDC) with an apiserver.config.k8s.io/v1 AuthenticationConfiguration trusting BOTH the kubernetes (CLI) and k8s-dashboard (oauth2-proxy) issuers. Enables per-user RBAC for the dashboard via SSO while keeping the CLI issuer working. Remote script health-gates /livez and auto-rolls-back on failure (single master). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
github-actions[bot]	5b25ce1ec5	priority-pass: bump image_tag to 061a66ad [ci skip] Auto-committed by ViktorBarzin/priority-pass GHA on push to main. Source: `061a66ad3b`	2026-06-05 09:19:09 +00:00
Viktor Barzin	9c4335025d	feat(tripit): linked-email verification (SMTP + confirm carve-out) [ci skip] Adds outbound mail for linked-email verification: EMAIL_PROVIDER=smtp + SMTP_* app env (submits via the cluster mailserver as spam@, relayed by Brevo), SMTP_PASSWORD mapped to the existing PLANS_IMAP_PASSWORD (no new secret), and a token-gated /api/emails/confirm ingress carve-out (auth=none, like the calendar feed). Applied locally via scripts/tg. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	b8c55732e0	feat(k8s-dashboard): deploy oauth2-proxy (not yet wired to ingress) 2 replicas in kubernetes-dashboard ns; OIDC code-flow against the k8s-dashboard Authentik client, injects user id_token as Bearer upstream to kong-proxy. ESO syncs client/cookie secrets from Vault. Ingress still points at kong-proxy — no user-facing change yet. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
root	7c4375d7cd	Woodpecker CI deploy [CI SKIP]	2026-06-05 09:19:09 +00:00
Viktor Barzin	4ed0c5a834	uptime-kuma: codify Traefik LB internal monitor at .203 (was stale .200) A hand-created (non-TF) uptime-kuma monitor "Traefik LoadBalancer" (id=95) port-checked 10.0.20.200:443 — the shared LB IP Traefik moved OFF on 2026-05-30 when it took its dedicated .203 (ETP=Local). It had been DOWN for ~5 days, surfacing as the cluster-health "uptime_kuma internal down(1)" WARN. Add it to local.internal_monitors as "Traefik LoadBalancer (10.0.20.203)" (port 10.0.20.203:443) so it's managed like the TP-Link/Proxmox direct-IP probes — a direct check of the MetalLB L2 + Traefik bind, complementing the [External] traefik (full CF path) and Traefik Dashboard (in-cluster) monitors. The sync CronJob created it (id=902, reporting UP @1ms); the orphan id=95 was deleted via the uptime-kuma API.	2026-06-05 09:19:08 +00:00
Viktor Barzin	011c63c92d	feat(k8s-dashboard): add Authentik OIDC app for dashboard SSO Confidential client k8s-dashboard + custom scope mapping emitting aud=[kubernetes,k8s-dashboard] + group-restriction policy (kubernetes-* RBAC groups). Additive — dashboard ingress unchanged. Token via Vault secret/k8s-dashboard. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	f201e4573e	immich: fix slow context search — prewarm clip_index + latency alert/healthcheck Context (smart) search latency was caused by the 665MB vchord clip_index decaying out of PG shared_buffers (~33% resident -> ~1.8s cold ANN reads vs ~4ms warm), NOT by yesterday's ML MODEL_TTL/clip-keepalive change (CLIP textual is warm ~15ms on GPU). The postStart prewarm runs once at pod start and pg_prewarm.autoprewarm only re-warms at startup, so the index decays under job buffer-pressure over days. - clip-index-prewarm CronJob (immich, /5): pg_prewarm('clip_index') keeps the whole index resident -> searches stay ~4ms. - immich-search-probe CronJob (immich, /5): times a random-vector ANN query + reads clip_index residency, pushes gauges to the Pushgateway. - Prometheus alerts ImmichSmartSearchSlow / ImmichClipIndexColdCache / ImmichSearchProbeStale (+ inhibition when the probe is stale). - cluster_healthcheck.sh check #46 check_immich_search (TOTAL_CHECKS 45->46). - Docs: infra CLAUDE.md immich note, monitoring.md, cluster-health skill. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	38c77048fd	chore(travel-agent): decommission — merged into tripit [ci skip] travel-agent's transport-to-airport + weather-brief workflows now run inside tripit (DB-driven instead of CalDAV), so the standalone CronJob stack is retired (namespace + ExternalSecret + 2 CronJobs destroyed via scripts/tg). secret/travel-agent left in Vault as an archive. Applied locally. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	be4ee7315a	feat(tripit): proactive-nudge CronJobs (transport + weather brief) [ci skip] Merge travel-agent's two workflows into tripit (beads code-muqi): adds tripit-transport-nudge (08:00) + tripit-weather-brief (21:00) CronJobs on Europe/London, an optional per-job timezone, and SLACK_BOT_TOKEN + DAWARICH_API_KEY in the tripit-secrets ExternalSecret. Nudges post to Slack #travel + Web Push; Dawarich location via the public host (in-cluster *.svc is 403'd by Rails host-auth). Vault secret/tripit seeded from secret/travel-agent + secret/owntracks. Applied locally via scripts/tg. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:06 +00:00
Viktor Barzin	98f29edf34	technitium: CoreDNS rewrite forgejo.viktorbarzin.me -> Traefik ClusterIP In-cluster pods resolved forgejo.viktorbarzin.me to the public IP (176.12.22.76) and hairpinned out through the WAN gateway, intermittently timing out buildkit pushes from Woodpecker build pods (which, unlike kubelet, don't use the per-node containerd Forgejo mirror). This silently failed CI build-and-push for Forgejo-hosted repos (recruiter-responder pipelines #15-#18 at the push step). Add a CoreDNS `rewrite name exact forgejo.viktorbarzin.me traefik.traefik.svc.cluster.local` so pods resolve to the Traefik ClusterIP (reachable in-cluster, unlike the ETP=Local LB .203; the Service-name target auto-tracks the ClusterIP so it can't rot on a Traefik renumber). Traefik's *.viktorbarzin.me wildcard keeps SNI/TLS valid. Makes the per-pod woodpecker-server hostAlias belt-and-suspenders. Applied via targeted apply (coredns ConfigMap only, to avoid reconciling 7 unrelated pre-existing drifts in the stack) + verified: - pod resolves forgejo.viktorbarzin.me -> 10.111.111.95 (Traefik ClusterIP) - recruiter-responder pipeline #20 build-and-push succeeds via ClusterIP Docs: networking.md (K8s cluster DNS path) + .claude/CLAUDE.md (forgejo registry quick-ref). Advances beads code-yh33. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 07:34:30 +00:00
Viktor Barzin	7302cd7908	infra: untrack generated backend.tf (stale PG creds + .200 literal) [CI SKIP] terragrunt generates backend.tf per run (remote_state generate, if_exists=overwrite_terragrunt) from get_env("PG_CONN_STR"); these 72 committed copies are stale artifacts already covered by .gitignore:65. They held a plaintext (Vault-rotated, ~expired) PG password + the .200 state-backend literal and were re-committed by CI on every run. git rm --cached stops that; they regenerate locally from PG_CONN_STR. The live .200:5432 literal now lives only in scripts/tg (its single bootstrap source). Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:52:46 +00:00

1 2 3 4 5 ...

1227 commits