infra

Author	SHA1	Message	Date
Viktor Barzin	27989cd9f1	fire-planner: bulk Reddit FIRE examples ingest + qwen3-8b model upgrade - Enable bulk ingest job (run_examples_bulk_ingest=true) to populate fire_example table from top/all + top/year across 12 FIRE subreddits. Job fire-planner-examples-bulk-202606042150 is currently running. - Upgrade examples_llm_model from qwen3vl-4b to qwen3-8b; GPU has 10.7GB free (immich-ml using ~4GB of 15GB total), so higher-quality model fits. - Add LLM_CONCURRENCY=3 to bulk job container — claude-agent-service is now bounded-concurrency (MAX_CONCURRENCY=10), no longer single-flight. Strictly serial extraction (default 1) is no longer necessary. TODO: flip run_examples_bulk_ingest=false after job completes and re-apply to push the weekly CronJob model upgrade (qwen3vl-4b→qwen3-8b) which didn't land in this apply (TF timed out waiting for Job completion). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-05 09:19:12 +00:00
Viktor Barzin	147a8cff40	Restore f1-stream stack — undo accidental bundling into 63fe7d2b Commit 63fe7d2b (fan-control) was made with a bare `git commit` in the shared infra working tree and inadvertently swept in a parallel session's staged f1-stream-extraction work (main.tf repoint, ~48 files/ removals, ci-cd.md + .claude docs, two extraction plan docs). This returns every f1-stream-related path to its pre-63fe7d2b state (3493c347) so that extraction can be committed cleanly by its own session. The fan-control files added in 63fe7d2b are untouched. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:12 +00:00
Viktor Barzin	90ad6b9125	fan-control: presence-aware IPMI fan curve for the R730 PVE host The iDRAC stock curve runs the CPU at ~72°C on the 7080 RPM floor even under load (optimises for quiet, not cool). Add a bash daemon + systemd unit that drives the chassis fans from CPU temp on two curves, picked by garage occupancy (the server is in the garage): COOL when empty (measured ~58-65°C under load), QUIET near the silent floor when the ha-sofia garage door shows someone is there (open, or <15min since last activity). Manual fan mode is backstopped: bash EXIT trap + systemd ExecStopPost hand fans back to Dell auto on stop/crash; CPU>=83°C or repeated IPMI failures do the same. Pushgateway metrics (job=fan_control). 36 unit tests cover the pure curve/hysteresis/presence/parse logic; DRY_RUN + RUN_ONCE for integration checks. Deployed and verified on 192.168.1.127 (CPU 70->58°C in cool mode, hysteresis stepping confirmed). Design: docs/plans/2026-06-04-pve-fan-control-design.md Runbook: docs/runbooks/fan-control.md [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	c6f27fa172	wealth dashboard: enlarge returns numbers (drop stat name labels) [ci skip] At h=4 the two stacked values per window panel were too small because each also rendered its field-name label. Switch textMode value_and_name -> value on 9211-9215 so the numbers get the full cell height; the % suffix / £ prefix keep them self-identifying and the window stays in the panel title. Applied via targeted tg apply of the configmap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	dbe10a708c	wealth dashboard: shrink returns stat panels to h=4 [ci skip] The 5 per-window returns widgets (9211-9215) were too tall at h=8. Halve to h=4 (matching the overview stat cards directly above) and pull every panel below up by 4 so the layout stays gap-free. Layout-only change — no panel content/query touched. Applied via targeted tg apply of the configmap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	fc1486c3dd	wealth dashboard: replace returns table with per-window stat panels [ci skip] Swap the single "Returns over time windows" table (panel 9201) for 5 stat panels (1d/7d/30d/90d/12mo), each showing Return % (Modified-Dietz) as the headline value + Δ market (£, net of contributions) as a second value, colored red/green by sign. Same per-window Modified-Dietz math as the old table, just scoped to one interval per panel — verified against live wealthfolio_sync PG and reproduced through Grafana's datasource API (e.g. 30d = 8.15% / £86,875, 12mo = 38.68% / £297,846, matching the prior table exactly). Kept the same 24×8 grid footprint so nothing else on the dashboard reflows. Already applied via targeted `tg apply` of the wealth.json configmap; [ci skip] because a full monitoring-stack CI apply would pull in unrelated pre-existing drift. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	6cec27f8dc	novelapp: bump Keel policy patch -> all (track any upstream version) Explicitly own the keel.sh/policy annotation in TF (was relying on the Kyverno-stamped `patch` default). Set policy=all + trigger=poll + pollSchedule, expand ignore_changes per KEEL_LIFECYCLE_V1 to cover Keel-written runtime annotations (change-cause, update-time, revision, match-tag).	2026-06-05 09:19:11 +00:00
Viktor Barzin	9cb609f21a	nextcloud-todos: register only the Created webhook (drop Updated) The agent acts only on newly-created todos; the Updated listener re-fired on every edit (incl. the agent's own note-append). Live Updated webhook (id=2) already deleted via OCS API.	2026-06-05 09:19:11 +00:00
Viktor Barzin	3d0cba9dcb	openclaw: pin 2026.2.26, resilient startup, SHA-pinned plugin init (recover from agentRuntime + configSchema crashloop) Surfaced while installing the nextcloud-todos-api plugin (a pod roll): - 2026.5.4 gateway rejects an openai-codex `agentRuntime` key it writes itself (commit `4b39cb72`) -> crashloop on any restart. Pinned image back to 2026.2.26. - startup steps (plugins enable / mcp set / memory index) backgrounded + timeout-guarded so a hung npm-install can never block the gateway. - install-nextcloud-todos-plugin init SHA-pinned (:f85c6de1) + Always pull: IfNotPresent served a stale cached :latest, so the plugin manifest (configSchema) fix never landed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
root	c01a28e23c	Woodpecker CI deploy [CI SKIP]	2026-06-05 09:19:11 +00:00
Viktor Barzin	6cff9bac26	freshrss: migrate extensions PVC proxmox-lvm -> NFS (node6 LUN relief, code-dfjn) Frees a per-node SCSI-LUN slot on node6 (20->19, under the check #47 >=20 WARN). FreshRSS extensions are static plugin files (no embedded DB; app DB is external MySQL) -> NFS-safe. Empty volume (re-installable). Applied deadlock-safe: -target deployment+module first (Recreate releases old PVC), then full apply destroys the now-unused proxmox PVC.	2026-06-05 09:19:11 +00:00
root	11b092a589	Woodpecker CI deploy [CI SKIP]	2026-06-05 09:19:11 +00:00
Viktor Barzin	e2d46ebd30	isponsorblocktv: migrate data PVC proxmox-lvm -> NFS (node6 LUN relief, code-dfjn) Frees a per-node SCSI-LUN slot on node6 (21->20). Volume holds only config.json (no embedded DB) -> NFS-safe. config pre-seeded to /srv/nfs/isponsorblocktv before cutover. RWO-destroy deadlock during apply (TF deleting the in-use old PVC before rolling the deployment) was broken by patching the deployment claim to the NFS PVC; TF reconciled to the same value.	2026-06-05 09:19:11 +00:00
Viktor Barzin	9858a1c44b	docs(add-user): document dashboard auto-login home-ns scope + foreign-namespace exception [ci skip] Auto-login covers a user's k8s_users home namespace only (dashboard SA bound there). For workloads in a separate/pre-existing namespace (gheorghe→novelapp), that namespace must also grant the dashboard SA, not just the OIDC User. Best practice: set k8s_users namespace = where the workload runs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
root	ace6ee59f9	Woodpecker CI deploy [CI SKIP]	2026-06-05 09:19:11 +00:00
Viktor Barzin	adec2c135f	fix(novelapp): also bind gheorghe's dashboard SA to novelapp admin His app lives in novelapp, but the dashboard injects his SA token (system:serviceaccount:vabbit81:dashboard-vabbit81), while the existing binding only granted the OIDC User vabbit81@gmail.com (OIDC blocked). Add the SA as a second subject so the web dashboard (token-injector) can manage novelapp. Verified: SA can list/create in novelapp; injector path returns 200. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	8f13fdeaf7	docs: dashboard SA cluster-read tightened to namespace-list + nodes only [ci skip] Reflect the dashboard-nav-readonly ClusterRole: namespace-owners can list namespaces/nodes (for dashboard nav) but not read other tenants' resources. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	7114824c06	fix(rbac): tighten dashboard SA cluster-read to namespaces+nodes only namespace-owners could read all tenants' pods/configmaps/etc cluster-wide (read-only) via the broad namespace_owner_readonly role. Give the dashboard SAs a dedicated dashboard-nav-readonly ClusterRole = namespaces + nodes (list) only — enough for the dashboard namespace-picker/Nodes view, but no cross-tenant resource reads. Own-namespace access (admin) unchanged. Verified: gheorghe can list namespaces/nodes + full vabbit81, but list pods/configmaps -A = no, other namespaces = no. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	ae252f9116	cluster-health: ha_integrations — skip disabled + ignored config entries check_ha_integrations counted any config entry with state=not_loaded as a problem, but HA marks intentionally-off entries that way too: disabled_by set (user/integration disabled it) and source=="ignore" (a discovered integration the user chose to ignore — never meant to load). On ha-sofia 2026-06-04 this false-WARNed on 6 entries that are all intentional — wyoming faster-whisper/piper + ollama (disabled_by=user) and mass_queue/dlna_dms(EMO-LAPTOP2)/yalexs_ble (source=ignore). Skip disabled/ignored entries; only genuine setup_error/setup_retry/ not_loaded (without disabled/ignore) now flag. Verified: check #27 -> PASS "All 96 integrations loaded".	2026-06-05 09:19:11 +00:00
Viktor Barzin	dd2a8e640f	monitoring: right-size loki memory request 3Gi->1Gi (quota 89%->79%) monitoring-quota requests.memory sat at 89% (18.2/20Gi), tripping the ResourceQuota>80% WARN. Root cause was over-provisioned requests, not real usage: loki requested 3Gi but its VPA upperBound is 364Mi and actual ~315Mi. prometheus's 4Gi is legitimately required (2Gi tmpfs WAL shares the cgroup; OOMs at 3Gi during WAL replay) so it stays; grafana's main container is already 512Mi. Trimmed loki to 1Gi request (~3x its observed ceiling; 4Gi Burstable limit preserves query-spike headroom) -> quota 78.8%, clears the WARN. NOTE: alloy DaemonSet (562Mi/node) grows with node count, so revisit (bump the 20Gi quota) as the cluster expands.	2026-06-05 09:19:11 +00:00
Viktor Barzin	90d7c11c16	state(vault): update encrypted state	2026-06-05 09:19:10 +00:00
Viktor Barzin	2707496b37	state(dbaas): update encrypted state	2026-06-05 09:19:10 +00:00
Viktor Barzin	31b8104b43	cluster-health: uptime_kuma check — only count status==0 as down check_uptime_kuma flagged a monitor as down whenever its last heartbeat status != 1, and treated "no beats" as down too. But uptime-kuma status 2 = PENDING (mid-retry) and 3 = MAINTENANCE are not outages, and no-beats = no data. So a monitor caught in a momentary pending/retry state at check time produced a false "internal/external down(N)" WARN — observed twice on 2026-06-04 (Novelapp, then ha-sofia) for monitors uptime-kuma itself logged ZERO downs against over 24h (0/2880 and 0/288 beats). Count a monitor as down ONLY on an explicit DOWN beat (status==0); pending, maintenance, and no-data are not-down. Real outages still flag (uptime-kuma persists status==0 beats for genuine downs).	2026-06-05 09:19:10 +00:00
Viktor Barzin	bf6ede2b9e	vault: deny secret/data/vault for claude-agent terraform-state policy (executor elevation safety narrowing) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	63ee655c08	monitoring: fix PrometheusBackupStale false-fire (32d->40d threshold) The prometheus-backup sidecar runs monthly on the 1st SUNDAY 04:00 UTC. Consecutive first-Sundays can be ~35 days apart (e.g. May 3 -> Jun 7), but the alert threshold was 32d (2764800s) -> it false-fired every year for the ~3 days between day-32 and the next run. Raised to 40d (3456000s): clears the max first-Sunday spacing with margin, still catches a genuinely missed monthly backup. Backup itself is healthy (last May 3, next Jun 7). Verified: live rule now > 3.456e6, alert state inactive.	2026-06-05 09:19:10 +00:00
Viktor Barzin	c4bd64f88a	docs: dashboard now auto-injects per-user SA token (no token-paste) Update authentication.md, multi-tenancy.md, service-catalog, add-user skill to reflect the token-injector (X-authentik-username -> SA token -> Bearer). Note the extra k8s-dashboard apply needed when onboarding a namespace-owner (injector map regen). [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	d649f4f287	feat(k8s-dashboard): auto-inject per-user SA token (no token-paste) nginx token-injector behind the existing forward-auth: maps X-authentik-username (the user's email, injected by Authentik) -> that user's ServiceAccount token -> sets Authorization: Bearer -> kong-proxy. Dashboard auto-authenticates; users never see the token prompt. Mirrors the t3-dispatch pattern. Token map lives in a Secret (namespace-owners' cluster-read covers configmaps, not secrets). Verified: gheorghe->vabbit81 pods 200 + kube-system 200 (cluster-read); viktor->nodes 200 (admin); unmapped->401. namespace-owners auto-derived from k8s_users; admins hardcoded (their Authentik identity != k8s_users email). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	467eb7d7ee	claude-agent: grant shared pod executor powers (Forgejo PR, terragrunt apply, kubectl write, MCP) Elevates the shared claude-agent-service pod (SA claude-agent, ns claude-agent) so the nextcloud-todos-exec agent can run autonomously. Viktor explicitly chose to elevate the SHARED service knowing every agent on the pod inherits these creds — each grant is security-sensitive and flagged inline for review. Vault (stacks/vault/main.tf): - terraform-state k8s-auth role: add `claude-agent` to bound_service_account_names (was only `default` — the pod's own SA token could not log in, so scripts/tg apply died fetching the PG backend password). `default` kept. - terraform-state policy broadened from `database/static-creds/pg-terraform-state` read only to read on database/static-creds/, database/creds/, secret/data/* and secret/metadata/* — what stacks read at plan/apply time. FLAG: grants the shared pod broad Vault READ (effectively all app secrets + rotating DB creds); not denied: secret/data/vault. claude-agent-service stack (stacks/claude-agent-service/main.tf): - ExternalSecret: add FORGEJO_TOKEN (secret/ci/global -> forgejo_push_token, viktor-scoped admin PAT) and HA_MCP_URL (secret/openclaw -> ha_sofia_mcp_url). - git-init: add url.insteadOf rewrite to authenticate git pushes to forgejo.viktorbarzin.me with $FORGEJO_TOKEN (PRs opened via Forgejo API). - New claude-agent-exec ClusterRole+Binding: cluster-wide get/list/watch/create/update/patch/delete on core (incl. secrets), apps, batch, networking.k8s.io, rbac roles/rolebindings. Additive to the existing read-only claude-agent role; does NOT bind cluster-admin. FLAG: very broad — close to cluster-admin in blast radius. - Vault login: VAULT_ADDR + VAULT_K8S_ROLE env + vault-token-refresher sidecar (k8s-auth login role=terraform-state every 30m -> shared emptyDir); main container symlinks ~/.vault-token so scripts/tg auto-auths. - MCP: project-scoped .mcp.json at infra repo root wires `ha` (HTTP, ${HA_MCP_URL}) and `paperless` (in-cluster Service, no token in-cluster). Not applied, not pushed — code only, for human review of the privilege grants. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	b56a868b4e	wealthfolio-sync: podAffinity to co-locate with app pod (RWO multi-attach fix) The monthly wealthfolio-sync CronJob mounts the same RWO wealthfolio-data-encrypted volume (shared wealthfolio.db SQLite) as the always-running wealthfolio app Deployment. RWO attaches to only one node, but the sync had no affinity — so the 2026-06-01 run landed on node4 while the app held the volume on node3 and hung in ContainerCreating for 3 days (FailedAttachVolume / Multi-Attach), surfacing as a problematic_pods WARN. Add a required podAffinity (app=wealthfolio, topologyKey hostname) so the sync always schedules onto the app's node, where the volume is already attached (RWO permits multiple pods on the same node). Verified: a fresh sync run co-located on node3, attached cleanly, and broker-sync started.	2026-06-05 09:19:10 +00:00
Viktor Barzin	8e44ccaa65	docs: dashboard access is forward-auth + token-paste (OIDC SSO blocked) Correct the docs I'd written for the (reverted) oauth2-proxy SSO. Reality: apiserver OIDC rejects all Authentik tokens (design §12), so the dashboard uses forward-auth (admits kubernetes-* groups) + per-namespace SA token-paste. Updates authentication.md, multi-tenancy.md, service-catalog, authentik-state, and add-user skill (onboarding now documents the dashboard token). oauth2-proxy + k8s-dashboard OIDC app noted as idle. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	e4c3fbbbbb	feat(authentik): adopt admin-services-restriction policy; admit kubernetes-* groups to k8s dashboard Namespace-owners (e.g. gheorghe) were blocked at forward-auth — k8s.viktorbarzin.me was Home-Server-Admins-only. Carve-out: the dashboard host now also admits kubernetes-admins/power-users/namespace-owners so they can reach the login page; per-namespace access is still enforced by the pasted SA token (dashboard-sa.tf). All other admin-only hosts unchanged. Policy adopted from UI into TF via import. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	317989f9d5	feat(rbac): per-namespace-owner dashboard SA + long-lived token Pragmatic dashboard access while OIDC SSO is blocked: each namespace-owner (from k8s_users) gets a ServiceAccount scoped to admin on their namespace(s) + cluster read-only, plus a long-lived token to paste into the dashboard 'Token' login. Real per-namespace isolation, no apiserver-OIDC dependency. Verified: vabbit81 SA = admin in vabbit81, read-only elsewhere, no cross-ns write. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	4aa6e7a5af	chrome-service docs: clarify f1-stream is not a real caller stacks/f1-stream/files/backend/playback_verifier.py and chrome_browser.py describe an in-cluster CDP caller, but the deployed f1-stream image is built from github.com/ViktorBarzin/f1-stream which has neither file — verified by `kubectl exec ls /app/backend/` and grepping for 'CHROME' in the deployed pod. The infra/stacks/f1-stream/files/backend/ tree is a vestigial design that was never wired up to a build pipeline. Calling it out so the next reader doesn't waste time debugging why the migration "didn't take effect" — it took effect on dead code. The hourly snapshot-harvester CronJob is the only live in-cluster caller of the CDP endpoint today.	2026-06-05 09:19:10 +00:00
root	d479d5b4f9	Woodpecker CI deploy [CI SKIP]	2026-06-05 09:19:10 +00:00
Viktor Barzin	deede6dd11	chrome-service: switch to CDP + persistent profile + hourly snapshot pipeline The chrome-service stack ran `playwright launch-server`, which creates ephemeral browser contexts per `connect()`. Despite the encrypted PVC mounted at /profile, no chromium user-data ever persisted — only npm cache + fontconfig. Logging in via noVNC was effectively a no-op. Refactor: - Replace launch-server with direct chromium (TCP CDP on :9223 internal), fronted by a Python HTTP+WS bridge on :9222 that rewrites the Host header to bypass Chrome's hardcoded DNS-rebinding protection (no `--remote-allow-hosts` flag exists in stock Chrome 130; verified by binary string grep). Bridge also forces Connection: close on HTTP responses so Node ws opens a fresh TCP for the WS upgrade rather than trying to reuse the dead keep-alive socket. - Add `--user-data-dir=/profile/chromium-data` so cookies/localStorage actually persist on the encrypted PVC. - New snapshot-server sidecar (stdlib python HTTP) serves GET /api/snapshot at chrome.viktorbarzin.me/api/snapshot, bearer-token-gated by the existing api_bearer_token. - New chrome-service-snapshot-harvester CronJob (hourly) connects via CDP, dumps storage_state() (cookies + localStorage), writes atomically to /profile/snapshots/storage-state.json. - NetworkPolicy: TCP/9222 (was :3000), TCP/8088 added for traefik. Caller migration: - f1-stream: `chromium.connect(ws_url)` → `chromium.connect_over_cdp(cdp_url)`, env var CHROME_WS_URL → CHROME_CDP_URL. CHROME_WS_TOKEN dropped (no longer used by code; ExternalSecret kept for symmetry with the snapshot endpoint). Dev-box side (out of scope for this commit — see ~/.config/systemd/user/): - playwright-mcp.service flips to `--isolated --storage-state=...` so per-Claude-Code-session ephemeral contexts seed from the snapshot. - playwright-snapshot-refresh.{service,timer} (hourly) pulls the snapshot via the bearer-gated HTTPS endpoint. Docs updated: - docs/architecture/chrome-service.md — new architecture diagram + wire protocol. - docs/runbooks/chrome-service-snapshot.md — day-2 ops (refresh, rotation, failure modes, restore). - stacks/chrome-service/README.md — connect_over_cdp recipe. Design spec at docs/superpowers/specs/2026-06-04-playwright-per-session-browser-design.md.	2026-06-05 09:19:10 +00:00
Viktor Barzin	b64d8d6168	cluster-health: add #47 ghost-disk drift check; fix immich_search set -e crash Check #47 "Proxmox CSI — Ghost-Disk Drift": per node, compares the real virtio-scsi CSI disk count in `qm config <vmid>` (SSH PVE) against the attached proxmox-CSI VolumeAttachments k8s tracks. Catches orphaned "ghost" disks left by failed detaches (query-pci QMP timeouts) that the scheduler's 28-LUN guard can't see — exactly the drift that wedged the MAM grabber on node3 (13 tracked vs 23 real). PASS reconciled; WARN drift>0 or real 20-24; FAIL real ≥25 (near the LUN cap). Already flagging node6 at 21 disks. Single `qm list` + one `qm config` per VM keeps it ~3s (the naive once-per-VM version timed out the parallel runner). Also fixes a PRE-EXISTING set -e crash in #46 immich_search (introduced by 138894cd): `pct=$(kubectl exec … \| tr -d ' ')` and the dur_ms probe were unguarded, so with `set -o pipefail` a non-zero psql/exec propagated and tripped `set -e`, killing the check before json_add. It silently dropped from every parallel report and broke --serial entirely (whole run aborted). Guarded both substitutions with `\|\| true`; the existing `=~` numeric checks already handle the empty case. immich_search now reports PASS/WARN instead of vanishing.	2026-06-05 09:19:10 +00:00
Viktor Barzin	ea1e4f793b	revert(k8s-dashboard): restore forward-auth ingress (apiserver OIDC unresolved) Dashboard back to the working forward-auth + kong-proxy state. The oauth2-proxy SSO path is blocked by a deeper issue: the apiserver rejects ALL valid Authentik OIDC tokens (both legacy --oidc-* flags and structured AuthenticationConfiguration), despite verified signature, issuer, audience, email_verified, synced clock, and reachable+trusted JWKS. Needs dedicated apiserver-OIDC investigation. oauth2-proxy + k8s-dashboard Authentik app left deployed (idle, harmless) pending that. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	c958f6a589	feat(nextcloud-todos): Phase 4 IaC — service stack, Vault role, DB bootstrap, OpenClaw plugin, monitoring Phase 4 infrastructure-as-code for the nextcloud-todos service (watches the Nextcloud Personal task list; classifies todos via local qwen3-8b and routes research/mutating work through claude-agent-service). Clones the recruiter-responder service pattern end-to-end. Written only — NOT applied. - stacks/nextcloud-todos/{main.tf,terragrunt.hcl}: new aux stack cloning recruiter-responder — ns (tier aux, istio-injection disabled, keel enrolled), two ExternalSecrets (vault-kv app secrets + vault-database DSN), Recreate deployment with alembic-migrate init-container, ClusterIP svc, /cb-only HMAC-gated ingress (auth=none, proxied), and an idempotent webhook-register null_resource (OCS webhook_listeners API, both CalendarObject Created/Updated events -> internal svc URL, Bearer auth). - stacks/vault/main.tf: pg_nextcloud_todos static role (nextcloud_todos, 7d rotation) + pg-nextcloud-todos in the postgresql allowed_roles array. - stacks/dbaas/modules/dbaas/main.tf: pg_nextcloud_todos_db null_resource (clone of pg_tripit_db) — creates role+DB, pins role search_path, and creates schema nextcloud_todos AUTHORIZATION nextcloud_todos. - stacks/openclaw/main.tf: install-nextcloud-todos-plugin init-container, nextcloud-todos-api in plugins.allow + the doctor-fix re-add + plugins enable, NEXTCLOUD_TODOS_URL/NEXTCLOUD_TODOS_TOKEN env, and the cross-path ESO key (secret/nextcloud-todos.webhook_bearer_token). - stacks/uptime-kuma/modules/uptime-kuma/main.tf: internal /healthz HTTP monitor. Prometheus /metrics scrape via svc annotations in the new stack. - .gitleaksignore: allowlist two curl-auth-user false positives (the OCS webhook curl uses a Vault-sourced shell var, not a literal credential). KV seed (secret/nextcloud-todos) + applies are deferred to the apply runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	c3c3d5e010	feat(claude-agent-service): seed nextcloud-todos planner + exec agents Add cp lines in the seed-beads-agent init-container so the two new nextcloud-todos agent definitions (baked into the image at /usr/share/agent-seed/ by the claude-agent-service Dockerfile) land in ~/.claude/agents/ at pod start. Phase 3, task 3.3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
github-actions[bot]	d0206848cb	priority-pass: bump image_tag to 63e118c3 [ci skip] Auto-committed by ViktorBarzin/priority-pass GHA on push to main. Source: `63e118c334`	2026-06-05 09:19:09 +00:00
Viktor Barzin	55a8b238a0	mam-farming: migrate data volume proxmox-lvm → NFS The grabber + bp-spender shared a 1Gi proxmox-lvm RWO PVC holding two plain-text files (mam_id cookie + grabbed_ids.txt dedup list — no embedded DB). On 2026-06-04 a grabber pod wedged in ContainerCreating for >1h because its proxmox-lvm disk couldn't hot-plug onto a SCSI-LUN-saturated node VM (k8s-node3, QEMU `query-pci` QMP timeout); `concurrencyPolicy: Forbid` then blocked every run → 0 grabs → MAMFarmingStuck. NFS (nfs_volume module, matching the other 9 servarr apps) removes this volume from the per-VM SCSI hotplug path entirely: it mounts over the network, consumes zero LUN slots, and is RWX so the grabber + bp-spender can co-schedule on any node. Data (mam_id + grabbed_ids.txt) was copied across before the switch; verified a grabber run Succeeds on NFS on node4 with the preserved dedup list (tracked IDs carried over). Lever #1 from docs/architecture/storage.md "Per-VM SCSI-LUN cap".	2026-06-05 09:19:09 +00:00
Viktor Barzin	cb96d5d590	fix(k8s-dashboard): use email_verified=true + groups scope mappings The apiserver rejects the email username-claim when email_verified is false (invalid bearer token 401). Authentik external/social users are unverified, so the default scope-email mapping fails. Mirror the proven kubernetes provider: use the custom 'Kubernetes Email (verified)' mapping (hardcodes email_verified=true) + 'Kubernetes Groups'. Drop the now-unneeded dual-aud mapping (apiserver trusts the k8s-dashboard issuer w/ audience=client_id) and align oauth2-proxy scope to 'openid email profile groups'. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	1042c0f082	fix(k8s-dashboard): set RS256 signing_key on Authentik OIDC provider Provider had signing_key=null → Authentik signed id_tokens with HS256 and served an empty JWKS, so oauth2-proxy (and the apiserver) failed signature verification (500 'failed to verify id token signature' on the callback). Use the same 'authentik Self-signed Certificate' keypair the kubernetes provider uses. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	e436af8d8c	fix(k8s-dashboard): drop group-restriction policy; RBAC is the gate The Authentik group policy denied admins: it gated on kubernetes-* group membership, but cluster access is email-based RBAC (User bindings from k8s_users), not group-based. vbarzin@gmail.com (Home Server Admins) gets cluster-admin via oidc-admin-vbarzin but isn't in any kubernetes-* group, so the gate locked him out. Apiserver RBAC is now the sole gate — matching the kubelogin CLI (authenticate freely, RBAC decides actions). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	ad3432d685	docs(k8s-dashboard): dashboard SSO as-built (Option B multi-issuer apiserver) Update authentication.md (structured multi-issuer AuthenticationConfiguration + dashboard SSO flow), multi-tenancy.md (web dashboard access), authentik-state (new k8s-dashboard app + gheorghe groups), service-catalog (dashboard auth), and the k8s-version-upgrade runbook (kubeadm wipes --authentication-config → re-apply rbac post-upgrade). Design/plan addenda record the issuer-constraint pivot from the original dual-aud approach. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	c9b22c7dd3	feat(k8s-dashboard): cut over ingress to oauth2-proxy SSO Dashboard now authenticates via Authentik (oauth2-proxy, k8s-dashboard issuer) and applies each user's own RBAC via the apiserver multi-issuer AuthenticationConfiguration. Committed so CI converges (uncommitted local applies were being reverted by the Woodpecker terragrunt-apply pipeline). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	ed4ed6bd09	fix(k8s-dashboard): ignore Keel/tier drift on oauth2-proxy deployment	2026-06-05 09:19:09 +00:00
Viktor Barzin	75c2b6dc5e	feat(rbac): apiserver multi-issuer OIDC via structured AuthenticationConfiguration Replace the legacy single --oidc-* flags (which kubeadm v1.34 had wiped, silently disabling apiserver OIDC) with an apiserver.config.k8s.io/v1 AuthenticationConfiguration trusting BOTH the kubernetes (CLI) and k8s-dashboard (oauth2-proxy) issuers. Enables per-user RBAC for the dashboard via SSO while keeping the CLI issuer working. Remote script health-gates /livez and auto-rolls-back on failure (single master). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
github-actions[bot]	5b25ce1ec5	priority-pass: bump image_tag to 061a66ad [ci skip] Auto-committed by ViktorBarzin/priority-pass GHA on push to main. Source: `061a66ad3b`	2026-06-05 09:19:09 +00:00
Viktor Barzin	9c4335025d	feat(tripit): linked-email verification (SMTP + confirm carve-out) [ci skip] Adds outbound mail for linked-email verification: EMAIL_PROVIDER=smtp + SMTP_* app env (submits via the cluster mailserver as spam@, relayed by Brevo), SMTP_PASSWORD mapped to the existing PLANS_IMAP_PASSWORD (no new secret), and a token-gated /api/emails/confirm ingress carve-out (auth=none, like the calendar feed). Applied locally via scripts/tg. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00

1 2 3 4 5 ...

3994 commits