infra

Author	SHA1	Message	Date
Viktor Barzin	dd2a8e640f	monitoring: right-size loki memory request 3Gi->1Gi (quota 89%->79%) monitoring-quota requests.memory sat at 89% (18.2/20Gi), tripping the ResourceQuota>80% WARN. Root cause was over-provisioned requests, not real usage: loki requested 3Gi but its VPA upperBound is 364Mi and actual ~315Mi. prometheus's 4Gi is legitimately required (2Gi tmpfs WAL shares the cgroup; OOMs at 3Gi during WAL replay) so it stays; grafana's main container is already 512Mi. Trimmed loki to 1Gi request (~3x its observed ceiling; 4Gi Burstable limit preserves query-spike headroom) -> quota 78.8%, clears the WARN. NOTE: alloy DaemonSet (562Mi/node) grows with node count, so revisit (bump the 20Gi quota) as the cluster expands.	2026-06-05 09:19:11 +00:00
Viktor Barzin	90d7c11c16	state(vault): update encrypted state	2026-06-05 09:19:10 +00:00
Viktor Barzin	2707496b37	state(dbaas): update encrypted state	2026-06-05 09:19:10 +00:00
Viktor Barzin	31b8104b43	cluster-health: uptime_kuma check — only count status==0 as down check_uptime_kuma flagged a monitor as down whenever its last heartbeat status != 1, and treated "no beats" as down too. But uptime-kuma status 2 = PENDING (mid-retry) and 3 = MAINTENANCE are not outages, and no-beats = no data. So a monitor caught in a momentary pending/retry state at check time produced a false "internal/external down(N)" WARN — observed twice on 2026-06-04 (Novelapp, then ha-sofia) for monitors uptime-kuma itself logged ZERO downs against over 24h (0/2880 and 0/288 beats). Count a monitor as down ONLY on an explicit DOWN beat (status==0); pending, maintenance, and no-data are not-down. Real outages still flag (uptime-kuma persists status==0 beats for genuine downs).	2026-06-05 09:19:10 +00:00
Viktor Barzin	bf6ede2b9e	vault: deny secret/data/vault for claude-agent terraform-state policy (executor elevation safety narrowing) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	63ee655c08	monitoring: fix PrometheusBackupStale false-fire (32d->40d threshold) The prometheus-backup sidecar runs monthly on the 1st SUNDAY 04:00 UTC. Consecutive first-Sundays can be ~35 days apart (e.g. May 3 -> Jun 7), but the alert threshold was 32d (2764800s) -> it false-fired every year for the ~3 days between day-32 and the next run. Raised to 40d (3456000s): clears the max first-Sunday spacing with margin, still catches a genuinely missed monthly backup. Backup itself is healthy (last May 3, next Jun 7). Verified: live rule now > 3.456e6, alert state inactive.	2026-06-05 09:19:10 +00:00
Viktor Barzin	c4bd64f88a	docs: dashboard now auto-injects per-user SA token (no token-paste) Update authentication.md, multi-tenancy.md, service-catalog, add-user skill to reflect the token-injector (X-authentik-username -> SA token -> Bearer). Note the extra k8s-dashboard apply needed when onboarding a namespace-owner (injector map regen). [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	d649f4f287	feat(k8s-dashboard): auto-inject per-user SA token (no token-paste) nginx token-injector behind the existing forward-auth: maps X-authentik-username (the user's email, injected by Authentik) -> that user's ServiceAccount token -> sets Authorization: Bearer -> kong-proxy. Dashboard auto-authenticates; users never see the token prompt. Mirrors the t3-dispatch pattern. Token map lives in a Secret (namespace-owners' cluster-read covers configmaps, not secrets). Verified: gheorghe->vabbit81 pods 200 + kube-system 200 (cluster-read); viktor->nodes 200 (admin); unmapped->401. namespace-owners auto-derived from k8s_users; admins hardcoded (their Authentik identity != k8s_users email). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	467eb7d7ee	claude-agent: grant shared pod executor powers (Forgejo PR, terragrunt apply, kubectl write, MCP) Elevates the shared claude-agent-service pod (SA claude-agent, ns claude-agent) so the nextcloud-todos-exec agent can run autonomously. Viktor explicitly chose to elevate the SHARED service knowing every agent on the pod inherits these creds — each grant is security-sensitive and flagged inline for review. Vault (stacks/vault/main.tf): - terraform-state k8s-auth role: add `claude-agent` to bound_service_account_names (was only `default` — the pod's own SA token could not log in, so scripts/tg apply died fetching the PG backend password). `default` kept. - terraform-state policy broadened from `database/static-creds/pg-terraform-state` read only to read on database/static-creds/, database/creds/, secret/data/* and secret/metadata/* — what stacks read at plan/apply time. FLAG: grants the shared pod broad Vault READ (effectively all app secrets + rotating DB creds); not denied: secret/data/vault. claude-agent-service stack (stacks/claude-agent-service/main.tf): - ExternalSecret: add FORGEJO_TOKEN (secret/ci/global -> forgejo_push_token, viktor-scoped admin PAT) and HA_MCP_URL (secret/openclaw -> ha_sofia_mcp_url). - git-init: add url.insteadOf rewrite to authenticate git pushes to forgejo.viktorbarzin.me with $FORGEJO_TOKEN (PRs opened via Forgejo API). - New claude-agent-exec ClusterRole+Binding: cluster-wide get/list/watch/create/update/patch/delete on core (incl. secrets), apps, batch, networking.k8s.io, rbac roles/rolebindings. Additive to the existing read-only claude-agent role; does NOT bind cluster-admin. FLAG: very broad — close to cluster-admin in blast radius. - Vault login: VAULT_ADDR + VAULT_K8S_ROLE env + vault-token-refresher sidecar (k8s-auth login role=terraform-state every 30m -> shared emptyDir); main container symlinks ~/.vault-token so scripts/tg auto-auths. - MCP: project-scoped .mcp.json at infra repo root wires `ha` (HTTP, ${HA_MCP_URL}) and `paperless` (in-cluster Service, no token in-cluster). Not applied, not pushed — code only, for human review of the privilege grants. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	b56a868b4e	wealthfolio-sync: podAffinity to co-locate with app pod (RWO multi-attach fix) The monthly wealthfolio-sync CronJob mounts the same RWO wealthfolio-data-encrypted volume (shared wealthfolio.db SQLite) as the always-running wealthfolio app Deployment. RWO attaches to only one node, but the sync had no affinity — so the 2026-06-01 run landed on node4 while the app held the volume on node3 and hung in ContainerCreating for 3 days (FailedAttachVolume / Multi-Attach), surfacing as a problematic_pods WARN. Add a required podAffinity (app=wealthfolio, topologyKey hostname) so the sync always schedules onto the app's node, where the volume is already attached (RWO permits multiple pods on the same node). Verified: a fresh sync run co-located on node3, attached cleanly, and broker-sync started.	2026-06-05 09:19:10 +00:00
Viktor Barzin	8e44ccaa65	docs: dashboard access is forward-auth + token-paste (OIDC SSO blocked) Correct the docs I'd written for the (reverted) oauth2-proxy SSO. Reality: apiserver OIDC rejects all Authentik tokens (design §12), so the dashboard uses forward-auth (admits kubernetes-* groups) + per-namespace SA token-paste. Updates authentication.md, multi-tenancy.md, service-catalog, authentik-state, and add-user skill (onboarding now documents the dashboard token). oauth2-proxy + k8s-dashboard OIDC app noted as idle. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	e4c3fbbbbb	feat(authentik): adopt admin-services-restriction policy; admit kubernetes-* groups to k8s dashboard Namespace-owners (e.g. gheorghe) were blocked at forward-auth — k8s.viktorbarzin.me was Home-Server-Admins-only. Carve-out: the dashboard host now also admits kubernetes-admins/power-users/namespace-owners so they can reach the login page; per-namespace access is still enforced by the pasted SA token (dashboard-sa.tf). All other admin-only hosts unchanged. Policy adopted from UI into TF via import. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	317989f9d5	feat(rbac): per-namespace-owner dashboard SA + long-lived token Pragmatic dashboard access while OIDC SSO is blocked: each namespace-owner (from k8s_users) gets a ServiceAccount scoped to admin on their namespace(s) + cluster read-only, plus a long-lived token to paste into the dashboard 'Token' login. Real per-namespace isolation, no apiserver-OIDC dependency. Verified: vabbit81 SA = admin in vabbit81, read-only elsewhere, no cross-ns write. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	4aa6e7a5af	chrome-service docs: clarify f1-stream is not a real caller stacks/f1-stream/files/backend/playback_verifier.py and chrome_browser.py describe an in-cluster CDP caller, but the deployed f1-stream image is built from github.com/ViktorBarzin/f1-stream which has neither file — verified by `kubectl exec ls /app/backend/` and grepping for 'CHROME' in the deployed pod. The infra/stacks/f1-stream/files/backend/ tree is a vestigial design that was never wired up to a build pipeline. Calling it out so the next reader doesn't waste time debugging why the migration "didn't take effect" — it took effect on dead code. The hourly snapshot-harvester CronJob is the only live in-cluster caller of the CDP endpoint today.	2026-06-05 09:19:10 +00:00
root	d479d5b4f9	Woodpecker CI deploy [CI SKIP]	2026-06-05 09:19:10 +00:00
Viktor Barzin	deede6dd11	chrome-service: switch to CDP + persistent profile + hourly snapshot pipeline The chrome-service stack ran `playwright launch-server`, which creates ephemeral browser contexts per `connect()`. Despite the encrypted PVC mounted at /profile, no chromium user-data ever persisted — only npm cache + fontconfig. Logging in via noVNC was effectively a no-op. Refactor: - Replace launch-server with direct chromium (TCP CDP on :9223 internal), fronted by a Python HTTP+WS bridge on :9222 that rewrites the Host header to bypass Chrome's hardcoded DNS-rebinding protection (no `--remote-allow-hosts` flag exists in stock Chrome 130; verified by binary string grep). Bridge also forces Connection: close on HTTP responses so Node ws opens a fresh TCP for the WS upgrade rather than trying to reuse the dead keep-alive socket. - Add `--user-data-dir=/profile/chromium-data` so cookies/localStorage actually persist on the encrypted PVC. - New snapshot-server sidecar (stdlib python HTTP) serves GET /api/snapshot at chrome.viktorbarzin.me/api/snapshot, bearer-token-gated by the existing api_bearer_token. - New chrome-service-snapshot-harvester CronJob (hourly) connects via CDP, dumps storage_state() (cookies + localStorage), writes atomically to /profile/snapshots/storage-state.json. - NetworkPolicy: TCP/9222 (was :3000), TCP/8088 added for traefik. Caller migration: - f1-stream: `chromium.connect(ws_url)` → `chromium.connect_over_cdp(cdp_url)`, env var CHROME_WS_URL → CHROME_CDP_URL. CHROME_WS_TOKEN dropped (no longer used by code; ExternalSecret kept for symmetry with the snapshot endpoint). Dev-box side (out of scope for this commit — see ~/.config/systemd/user/): - playwright-mcp.service flips to `--isolated --storage-state=...` so per-Claude-Code-session ephemeral contexts seed from the snapshot. - playwright-snapshot-refresh.{service,timer} (hourly) pulls the snapshot via the bearer-gated HTTPS endpoint. Docs updated: - docs/architecture/chrome-service.md — new architecture diagram + wire protocol. - docs/runbooks/chrome-service-snapshot.md — day-2 ops (refresh, rotation, failure modes, restore). - stacks/chrome-service/README.md — connect_over_cdp recipe. Design spec at docs/superpowers/specs/2026-06-04-playwright-per-session-browser-design.md.	2026-06-05 09:19:10 +00:00
Viktor Barzin	b64d8d6168	cluster-health: add #47 ghost-disk drift check; fix immich_search set -e crash Check #47 "Proxmox CSI — Ghost-Disk Drift": per node, compares the real virtio-scsi CSI disk count in `qm config <vmid>` (SSH PVE) against the attached proxmox-CSI VolumeAttachments k8s tracks. Catches orphaned "ghost" disks left by failed detaches (query-pci QMP timeouts) that the scheduler's 28-LUN guard can't see — exactly the drift that wedged the MAM grabber on node3 (13 tracked vs 23 real). PASS reconciled; WARN drift>0 or real 20-24; FAIL real ≥25 (near the LUN cap). Already flagging node6 at 21 disks. Single `qm list` + one `qm config` per VM keeps it ~3s (the naive once-per-VM version timed out the parallel runner). Also fixes a PRE-EXISTING set -e crash in #46 immich_search (introduced by 138894cd): `pct=$(kubectl exec … \| tr -d ' ')` and the dur_ms probe were unguarded, so with `set -o pipefail` a non-zero psql/exec propagated and tripped `set -e`, killing the check before json_add. It silently dropped from every parallel report and broke --serial entirely (whole run aborted). Guarded both substitutions with `\|\| true`; the existing `=~` numeric checks already handle the empty case. immich_search now reports PASS/WARN instead of vanishing.	2026-06-05 09:19:10 +00:00
Viktor Barzin	ea1e4f793b	revert(k8s-dashboard): restore forward-auth ingress (apiserver OIDC unresolved) Dashboard back to the working forward-auth + kong-proxy state. The oauth2-proxy SSO path is blocked by a deeper issue: the apiserver rejects ALL valid Authentik OIDC tokens (both legacy --oidc-* flags and structured AuthenticationConfiguration), despite verified signature, issuer, audience, email_verified, synced clock, and reachable+trusted JWKS. Needs dedicated apiserver-OIDC investigation. oauth2-proxy + k8s-dashboard Authentik app left deployed (idle, harmless) pending that. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	c958f6a589	feat(nextcloud-todos): Phase 4 IaC — service stack, Vault role, DB bootstrap, OpenClaw plugin, monitoring Phase 4 infrastructure-as-code for the nextcloud-todos service (watches the Nextcloud Personal task list; classifies todos via local qwen3-8b and routes research/mutating work through claude-agent-service). Clones the recruiter-responder service pattern end-to-end. Written only — NOT applied. - stacks/nextcloud-todos/{main.tf,terragrunt.hcl}: new aux stack cloning recruiter-responder — ns (tier aux, istio-injection disabled, keel enrolled), two ExternalSecrets (vault-kv app secrets + vault-database DSN), Recreate deployment with alembic-migrate init-container, ClusterIP svc, /cb-only HMAC-gated ingress (auth=none, proxied), and an idempotent webhook-register null_resource (OCS webhook_listeners API, both CalendarObject Created/Updated events -> internal svc URL, Bearer auth). - stacks/vault/main.tf: pg_nextcloud_todos static role (nextcloud_todos, 7d rotation) + pg-nextcloud-todos in the postgresql allowed_roles array. - stacks/dbaas/modules/dbaas/main.tf: pg_nextcloud_todos_db null_resource (clone of pg_tripit_db) — creates role+DB, pins role search_path, and creates schema nextcloud_todos AUTHORIZATION nextcloud_todos. - stacks/openclaw/main.tf: install-nextcloud-todos-plugin init-container, nextcloud-todos-api in plugins.allow + the doctor-fix re-add + plugins enable, NEXTCLOUD_TODOS_URL/NEXTCLOUD_TODOS_TOKEN env, and the cross-path ESO key (secret/nextcloud-todos.webhook_bearer_token). - stacks/uptime-kuma/modules/uptime-kuma/main.tf: internal /healthz HTTP monitor. Prometheus /metrics scrape via svc annotations in the new stack. - .gitleaksignore: allowlist two curl-auth-user false positives (the OCS webhook curl uses a Vault-sourced shell var, not a literal credential). KV seed (secret/nextcloud-todos) + applies are deferred to the apply runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	c3c3d5e010	feat(claude-agent-service): seed nextcloud-todos planner + exec agents Add cp lines in the seed-beads-agent init-container so the two new nextcloud-todos agent definitions (baked into the image at /usr/share/agent-seed/ by the claude-agent-service Dockerfile) land in ~/.claude/agents/ at pod start. Phase 3, task 3.3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
github-actions[bot]	d0206848cb	priority-pass: bump image_tag to 63e118c3 [ci skip] Auto-committed by ViktorBarzin/priority-pass GHA on push to main. Source: `63e118c334`	2026-06-05 09:19:09 +00:00
Viktor Barzin	55a8b238a0	mam-farming: migrate data volume proxmox-lvm → NFS The grabber + bp-spender shared a 1Gi proxmox-lvm RWO PVC holding two plain-text files (mam_id cookie + grabbed_ids.txt dedup list — no embedded DB). On 2026-06-04 a grabber pod wedged in ContainerCreating for >1h because its proxmox-lvm disk couldn't hot-plug onto a SCSI-LUN-saturated node VM (k8s-node3, QEMU `query-pci` QMP timeout); `concurrencyPolicy: Forbid` then blocked every run → 0 grabs → MAMFarmingStuck. NFS (nfs_volume module, matching the other 9 servarr apps) removes this volume from the per-VM SCSI hotplug path entirely: it mounts over the network, consumes zero LUN slots, and is RWX so the grabber + bp-spender can co-schedule on any node. Data (mam_id + grabbed_ids.txt) was copied across before the switch; verified a grabber run Succeeds on NFS on node4 with the preserved dedup list (tracked IDs carried over). Lever #1 from docs/architecture/storage.md "Per-VM SCSI-LUN cap".	2026-06-05 09:19:09 +00:00
Viktor Barzin	cb96d5d590	fix(k8s-dashboard): use email_verified=true + groups scope mappings The apiserver rejects the email username-claim when email_verified is false (invalid bearer token 401). Authentik external/social users are unverified, so the default scope-email mapping fails. Mirror the proven kubernetes provider: use the custom 'Kubernetes Email (verified)' mapping (hardcodes email_verified=true) + 'Kubernetes Groups'. Drop the now-unneeded dual-aud mapping (apiserver trusts the k8s-dashboard issuer w/ audience=client_id) and align oauth2-proxy scope to 'openid email profile groups'. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	1042c0f082	fix(k8s-dashboard): set RS256 signing_key on Authentik OIDC provider Provider had signing_key=null → Authentik signed id_tokens with HS256 and served an empty JWKS, so oauth2-proxy (and the apiserver) failed signature verification (500 'failed to verify id token signature' on the callback). Use the same 'authentik Self-signed Certificate' keypair the kubernetes provider uses. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	e436af8d8c	fix(k8s-dashboard): drop group-restriction policy; RBAC is the gate The Authentik group policy denied admins: it gated on kubernetes-* group membership, but cluster access is email-based RBAC (User bindings from k8s_users), not group-based. vbarzin@gmail.com (Home Server Admins) gets cluster-admin via oidc-admin-vbarzin but isn't in any kubernetes-* group, so the gate locked him out. Apiserver RBAC is now the sole gate — matching the kubelogin CLI (authenticate freely, RBAC decides actions). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	ad3432d685	docs(k8s-dashboard): dashboard SSO as-built (Option B multi-issuer apiserver) Update authentication.md (structured multi-issuer AuthenticationConfiguration + dashboard SSO flow), multi-tenancy.md (web dashboard access), authentik-state (new k8s-dashboard app + gheorghe groups), service-catalog (dashboard auth), and the k8s-version-upgrade runbook (kubeadm wipes --authentication-config → re-apply rbac post-upgrade). Design/plan addenda record the issuer-constraint pivot from the original dual-aud approach. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	c9b22c7dd3	feat(k8s-dashboard): cut over ingress to oauth2-proxy SSO Dashboard now authenticates via Authentik (oauth2-proxy, k8s-dashboard issuer) and applies each user's own RBAC via the apiserver multi-issuer AuthenticationConfiguration. Committed so CI converges (uncommitted local applies were being reverted by the Woodpecker terragrunt-apply pipeline). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	ed4ed6bd09	fix(k8s-dashboard): ignore Keel/tier drift on oauth2-proxy deployment	2026-06-05 09:19:09 +00:00
Viktor Barzin	75c2b6dc5e	feat(rbac): apiserver multi-issuer OIDC via structured AuthenticationConfiguration Replace the legacy single --oidc-* flags (which kubeadm v1.34 had wiped, silently disabling apiserver OIDC) with an apiserver.config.k8s.io/v1 AuthenticationConfiguration trusting BOTH the kubernetes (CLI) and k8s-dashboard (oauth2-proxy) issuers. Enables per-user RBAC for the dashboard via SSO while keeping the CLI issuer working. Remote script health-gates /livez and auto-rolls-back on failure (single master). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
github-actions[bot]	5b25ce1ec5	priority-pass: bump image_tag to 061a66ad [ci skip] Auto-committed by ViktorBarzin/priority-pass GHA on push to main. Source: `061a66ad3b`	2026-06-05 09:19:09 +00:00
Viktor Barzin	9c4335025d	feat(tripit): linked-email verification (SMTP + confirm carve-out) [ci skip] Adds outbound mail for linked-email verification: EMAIL_PROVIDER=smtp + SMTP_* app env (submits via the cluster mailserver as spam@, relayed by Brevo), SMTP_PASSWORD mapped to the existing PLANS_IMAP_PASSWORD (no new secret), and a token-gated /api/emails/confirm ingress carve-out (auth=none, like the calendar feed). Applied locally via scripts/tg. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	b8c55732e0	feat(k8s-dashboard): deploy oauth2-proxy (not yet wired to ingress) 2 replicas in kubernetes-dashboard ns; OIDC code-flow against the k8s-dashboard Authentik client, injects user id_token as Bearer upstream to kong-proxy. ESO syncs client/cookie secrets from Vault. Ingress still points at kong-proxy — no user-facing change yet. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
root	7c4375d7cd	Woodpecker CI deploy [CI SKIP]	2026-06-05 09:19:09 +00:00
Viktor Barzin	4ed0c5a834	uptime-kuma: codify Traefik LB internal monitor at .203 (was stale .200) A hand-created (non-TF) uptime-kuma monitor "Traefik LoadBalancer" (id=95) port-checked 10.0.20.200:443 — the shared LB IP Traefik moved OFF on 2026-05-30 when it took its dedicated .203 (ETP=Local). It had been DOWN for ~5 days, surfacing as the cluster-health "uptime_kuma internal down(1)" WARN. Add it to local.internal_monitors as "Traefik LoadBalancer (10.0.20.203)" (port 10.0.20.203:443) so it's managed like the TP-Link/Proxmox direct-IP probes — a direct check of the MetalLB L2 + Traefik bind, complementing the [External] traefik (full CF path) and Traefik Dashboard (in-cluster) monitors. The sync CronJob created it (id=902, reporting UP @1ms); the orphan id=95 was deleted via the uptime-kuma API.	2026-06-05 09:19:08 +00:00
Viktor Barzin	011c63c92d	feat(k8s-dashboard): add Authentik OIDC app for dashboard SSO Confidential client k8s-dashboard + custom scope mapping emitting aud=[kubernetes,k8s-dashboard] + group-restriction policy (kubernetes-* RBAC groups). Additive — dashboard ingress unchanged. Token via Vault secret/k8s-dashboard. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	549320f79c	docs(k8s-dashboard): SSO via Authentik oauth2-proxy — implementation plan [ci skip] Task-by-task plan: Vault secret, Authentik OIDC app (TF), oauth2-proxy deploy, ingress cutover with blocking audience-verification gate, docs. Additive + one revertible ingress repoint. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	23d87d8885	cluster-health #20 : fix false NFS FAIL on Linux (nc -G is macOS-only) The NFS connectivity check fell through to `nc -z -G 3 192.168.1.127 2049` when `showmount` is absent (the DevVM ships no nfs-common). But `-G` is a macOS/Darwin-only connect-timeout flag — OpenBSD/GNU nc on Linux rejects it with "invalid option -- 'G'", so the elif failed and the check reported "NFS unreachable" on every Linux run even though port 2049 was wide open (confirmed via /dev/tcp). All deployment/PVC/statefulset checks were green throughout — a real PVE NFS outage would have taken down 30+ services. Fix: use the portable `-w` timeout flag, and add a final bash /dev/tcp fallback so the probe is correct even on hosts with neither showmount nor a usable nc.	2026-06-05 09:19:07 +00:00
Viktor Barzin	8b72eaebb0	docs(k8s-dashboard): SSO via Authentik oauth2-proxy — design [ci skip] Design for letting namespace-owner users (e.g. gheorghe/vabbit81) open the K8s Dashboard with their Authentik account, mapped to their per-user RBAC. oauth2-proxy fronts kong-proxy, runs the OIDC code-flow, and injects the user's id_token as Bearer so the apiserver applies existing namespace-owner bindings. Additive + one ingress repoint; multi-audience scope mapping keeps the CLI flow untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	f201e4573e	immich: fix slow context search — prewarm clip_index + latency alert/healthcheck Context (smart) search latency was caused by the 665MB vchord clip_index decaying out of PG shared_buffers (~33% resident -> ~1.8s cold ANN reads vs ~4ms warm), NOT by yesterday's ML MODEL_TTL/clip-keepalive change (CLIP textual is warm ~15ms on GPU). The postStart prewarm runs once at pod start and pg_prewarm.autoprewarm only re-warms at startup, so the index decays under job buffer-pressure over days. - clip-index-prewarm CronJob (immich, /5): pg_prewarm('clip_index') keeps the whole index resident -> searches stay ~4ms. - immich-search-probe CronJob (immich, /5): times a random-vector ANN query + reads clip_index residency, pushes gauges to the Pushgateway. - Prometheus alerts ImmichSmartSearchSlow / ImmichClipIndexColdCache / ImmichSearchProbeStale (+ inhibition when the probe is stale). - cluster_healthcheck.sh check #46 check_immich_search (TOTAL_CHECKS 45->46). - Docs: infra CLAUDE.md immich note, monitoring.md, cluster-health skill. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	38c77048fd	chore(travel-agent): decommission — merged into tripit [ci skip] travel-agent's transport-to-airport + weather-brief workflows now run inside tripit (DB-driven instead of CalDAV), so the standalone CronJob stack is retired (namespace + ExternalSecret + 2 CronJobs destroyed via scripts/tg). secret/travel-agent left in Vault as an archive. Applied locally. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	be4ee7315a	feat(tripit): proactive-nudge CronJobs (transport + weather brief) [ci skip] Merge travel-agent's two workflows into tripit (beads code-muqi): adds tripit-transport-nudge (08:00) + tripit-weather-brief (21:00) CronJobs on Europe/London, an optional per-job timezone, and SLACK_BOT_TOKEN + DAWARICH_API_KEY in the tripit-secrets ExternalSecret. Nudges post to Slack #travel + Web Push; Dawarich location via the public host (in-cluster *.svc is 403'd by Rails host-auth). Vault secret/tripit seeded from secret/travel-agent + secret/owntracks. Applied locally via scripts/tg. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:06 +00:00
Viktor Barzin	a2fa912b44	cluster-health: add check #45 — HA Sofia Status Dashboard Mirrors the verdict of emo's curated Барзини → Статус Lovelace view (dashboard-barzini / path 'status', 8 sections, ~43 mushroom-template cards). Pulls the dashboard config via the HA WebSocket API (one-shot, shared cache), batch-renders every card's secondary Jinja template against /api/template in a single POST, and classifies the rendered text per card: FAIL — contains "Offline" / "Disconnected" / "Разкачен" / "— No data" WARN — contains "⚠️" / "Abnormal" / "Trouble (" / "(ниска)" / "Пълен резервоар" / "Грешка" / "attention" / "Внимание" Roll-up is a single check with a per-section breakdown (Сигурност 0F/0W/4P; Мрежа 0F/1W/10P; …). On WARN/FAIL the non-quiet non-JSON path lists each offending card with its rendered status line. Verified live against ha-sofia: 2 offline devices (Пералня, Гардероб спалня) and 1 degraded (NAS_Barzini volume attention, 7% free) surfaced correctly in both human and JSON output.	2026-06-05 09:19:06 +00:00
Viktor Barzin	98f29edf34	technitium: CoreDNS rewrite forgejo.viktorbarzin.me -> Traefik ClusterIP In-cluster pods resolved forgejo.viktorbarzin.me to the public IP (176.12.22.76) and hairpinned out through the WAN gateway, intermittently timing out buildkit pushes from Woodpecker build pods (which, unlike kubelet, don't use the per-node containerd Forgejo mirror). This silently failed CI build-and-push for Forgejo-hosted repos (recruiter-responder pipelines #15-#18 at the push step). Add a CoreDNS `rewrite name exact forgejo.viktorbarzin.me traefik.traefik.svc.cluster.local` so pods resolve to the Traefik ClusterIP (reachable in-cluster, unlike the ETP=Local LB .203; the Service-name target auto-tracks the ClusterIP so it can't rot on a Traefik renumber). Traefik's *.viktorbarzin.me wildcard keeps SNI/TLS valid. Makes the per-pod woodpecker-server hostAlias belt-and-suspenders. Applied via targeted apply (coredns ConfigMap only, to avoid reconciling 7 unrelated pre-existing drifts in the stack) + verified: - pod resolves forgejo.viktorbarzin.me -> 10.111.111.95 (Traefik ClusterIP) - recruiter-responder pipeline #20 build-and-push succeeds via ClusterIP Docs: networking.md (K8s cluster DNS path) + .claude/CLAUDE.md (forgejo registry quick-ref). Advances beads code-yh33. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 07:34:30 +00:00
Viktor Barzin	7302cd7908	infra: untrack generated backend.tf (stale PG creds + .200 literal) [CI SKIP] terragrunt generates backend.tf per run (remote_state generate, if_exists=overwrite_terragrunt) from get_env("PG_CONN_STR"); these 72 committed copies are stale artifacts already covered by .gitignore:65. They held a plaintext (Vault-rotated, ~expired) PG password + the .200 state-backend literal and were re-committed by CI on every run. git rm --cached stops that; they regenerate locally from PG_CONN_STR. The live .200:5432 literal now lives only in scripts/tg (its single bootstrap source). Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:52:46 +00:00
Viktor Barzin	7d7a0ad474	infra: fix stale Traefik LB-IP refs + accurate LB-IP registry Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md). The 2026-05-30 Traefik .200->.203 move left consumers pointing at the dead .200; this fixes the two in-Terraform ones and replaces the stale networking doc with an accurate registry + a renumber checklist. - woodpecker: forgejo.viktorbarzin.me hostAlias hardcoded 10.0.20.200 (.200:443 refuses TLS now; the next woodpecker apply would re-pin it and break pipeline creation). Now reads the Traefik ClusterIP dynamically via a kubernetes_service data source -- cannot rot on a future renumber and avoids the ETP=Local hairpin trap. - monitoring: ViktorBarzinApexDrift alert summary said "expected 10.0.20.200" -> 10.0.20.203 (cosmetic; alert logic already correct). - docs/architecture/networking.md: rewrote the MetalLB section (it wrongly had KMS on .200, mailserver on a LB IP, "two dedicated") into an accurate 4-IP registry + LB-IP renumber checklist (in-band + out-of-band consumers). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:25 +00:00
Viktor Barzin	dcb7c74531	url/shlink: fix admin UI — pin shlink-web-client 4.7.1 + port 8080 The shlink-web admin SPA (shlink.viktorbarzin.me) showed "Something went wrong while loading short URLs". Root cause: the web client was untagged (:latest) and Keel's 2026-05-26 match-tag rewrite downgraded it to the ancient 0.1.1 (2019 image), which speaks the removed /rest/v1/authenticate API (404) and serves nginx on port 80. Backend (shlink:5.0.2) was healthy. Pin shlink-web-client to 4.7.1 (current stable; :latest/:stable resolve to it) and align container port + both probes + service target_port to 8080 (the port the 4.x nginx listens on). A clean semver anchor can no longer be Keel-downgraded to 0.1.1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:25 +00:00
Viktor Barzin	c7cf21a986	Revert mail LAN-redirect approach; pending VIP-based redesign The pfSense NAT rdr rules added in f7cf9f07 hardcoded 10.0.20.203 (Traefik LB IP) as the redirect source. That couples mail's LAN path to Traefik's IP choice — if Traefik moves again (it just moved .200 → .203 on 2026-05-30), the mail path silently breaks. Removing the script and the matching doc paragraph; keeping the networking.md .200 → .203 staleness fix (separate correction). Follow-up: give the mail HAProxy listener a dedicated pfSense Virtual IP (IP Alias on opt1), update Technitium internal zone + WAN port-forwards to target the VIP, so mail's LAN-side path is decoupled from any other service's LB IP.	2026-06-03 10:24:25 +00:00
Viktor Barzin	922d95af9c	Reapply "tripit: Gmail ingest (12-month) + vbarzin owner + plans@ forward-to-parse" This reverts commit a82ba46ad83e85a231d839564c2f009c700dc4d1.	2026-06-03 10:24:25 +00:00
Viktor Barzin	f0843e398b	Revert "tripit: Gmail ingest (12-month) + vbarzin owner + plans@ forward-to-parse" This reverts commit 4cc9229e716b6683418a148a0f896442d5ab07ad.	2026-06-03 10:24:25 +00:00
Viktor Barzin	0c7ec3d470	tripit: Gmail ingest (12-month) + vbarzin owner + plans@ forward-to-parse Reconciles the tripit stack source with live state and adds the forward flow. Ingest now polls vbarzin@gmail.com [Gmail]/All Mail read-only over a rolling 12-month X-GM-RAW travel-sender window (Croatia Jet2 refs excluded), filing trips under MAIL_DEFAULT_OWNER_EMAIL=vbarzin@gmail.com (Viktor's Authentik login identity). Adds an ingest-plans CronJob that polls spam@ filtered to To:plans@viktorbarzin.me (the @viktorbarzin.me catch-all target) so forwarded bookings are extracted and attached to the matching trip; IMAP_PASSWORD is overridden per-job to spam@'s creds (PLANS_IMAP_PASSWORD). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:25 +00:00

1 2 3 4 5 ...

3975 commits