Commit graph

1224 commits

Author SHA1 Message Date
Viktor Barzin
147a8cff40 Restore f1-stream stack — undo accidental bundling into 63fe7d2b
Commit 63fe7d2b (fan-control) was made with a bare `git commit` in the
shared infra working tree and inadvertently swept in a parallel session's
staged f1-stream-extraction work (main.tf repoint, ~48 files/ removals,
ci-cd.md + .claude docs, two extraction plan docs).

This returns every f1-stream-related path to its pre-63fe7d2b state
(3493c347) so that extraction can be committed cleanly by its own
session. The fan-control files added in 63fe7d2b are untouched.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:12 +00:00
Viktor Barzin
90ad6b9125 fan-control: presence-aware IPMI fan curve for the R730 PVE host
The iDRAC stock curve runs the CPU at ~72°C on the 7080 RPM floor even
under load (optimises for quiet, not cool). Add a bash daemon + systemd
unit that drives the chassis fans from CPU temp on two curves, picked by
garage occupancy (the server is in the garage): COOL when empty
(measured ~58-65°C under load), QUIET near the silent floor when the
ha-sofia garage door shows someone is there (open, or <15min since last
activity).

Manual fan mode is backstopped: bash EXIT trap + systemd ExecStopPost
hand fans back to Dell auto on stop/crash; CPU>=83°C or repeated IPMI
failures do the same. Pushgateway metrics (job=fan_control). 36 unit
tests cover the pure curve/hysteresis/presence/parse logic; DRY_RUN +
RUN_ONCE for integration checks. Deployed and verified on 192.168.1.127
(CPU 70->58°C in cool mode, hysteresis stepping confirmed).

Design:  docs/plans/2026-06-04-pve-fan-control-design.md
Runbook: docs/runbooks/fan-control.md

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
Viktor Barzin
c6f27fa172 wealth dashboard: enlarge returns numbers (drop stat name labels) [ci skip]
At h=4 the two stacked values per window panel were too small because each
also rendered its field-name label. Switch textMode value_and_name -> value
on 9211-9215 so the numbers get the full cell height; the % suffix / £ prefix
keep them self-identifying and the window stays in the panel title. Applied
via targeted tg apply of the configmap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
Viktor Barzin
dbe10a708c wealth dashboard: shrink returns stat panels to h=4 [ci skip]
The 5 per-window returns widgets (9211-9215) were too tall at h=8. Halve to
h=4 (matching the overview stat cards directly above) and pull every panel
below up by 4 so the layout stays gap-free. Layout-only change — no panel
content/query touched. Applied via targeted tg apply of the configmap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
Viktor Barzin
fc1486c3dd wealth dashboard: replace returns table with per-window stat panels [ci skip]
Swap the single "Returns over time windows" table (panel 9201) for 5 stat
panels (1d/7d/30d/90d/12mo), each showing Return % (Modified-Dietz) as the
headline value + Δ market (£, net of contributions) as a second value,
colored red/green by sign.

Same per-window Modified-Dietz math as the old table, just scoped to one
interval per panel — verified against live wealthfolio_sync PG and reproduced
through Grafana's datasource API (e.g. 30d = 8.15% / £86,875, 12mo = 38.68% /
£297,846, matching the prior table exactly). Kept the same 24×8 grid footprint
so nothing else on the dashboard reflows.

Already applied via targeted `tg apply` of the wealth.json configmap; [ci skip]
because a full monitoring-stack CI apply would pull in unrelated pre-existing
drift.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
Viktor Barzin
6cec27f8dc novelapp: bump Keel policy patch -> all (track any upstream version)
Explicitly own the keel.sh/policy annotation in TF (was relying on the
Kyverno-stamped `patch` default). Set policy=all + trigger=poll +
pollSchedule, expand ignore_changes per KEEL_LIFECYCLE_V1 to cover
Keel-written runtime annotations (change-cause, update-time, revision,
match-tag).
2026-06-05 09:19:11 +00:00
Viktor Barzin
9cb609f21a nextcloud-todos: register only the Created webhook (drop Updated)
The agent acts only on newly-created todos; the Updated listener re-fired on
every edit (incl. the agent's own note-append). Live Updated webhook (id=2)
already deleted via OCS API.
2026-06-05 09:19:11 +00:00
Viktor Barzin
3d0cba9dcb openclaw: pin 2026.2.26, resilient startup, SHA-pinned plugin init (recover from agentRuntime + configSchema crashloop)
Surfaced while installing the nextcloud-todos-api plugin (a pod roll):
- 2026.5.4 gateway rejects an openai-codex `agentRuntime` key it writes itself
  (commit 4b39cb72) -> crashloop on any restart. Pinned image back to 2026.2.26.
- startup steps (plugins enable / mcp set / memory index) backgrounded +
  timeout-guarded so a hung npm-install can never block the gateway.
- install-nextcloud-todos-plugin init SHA-pinned (:f85c6de1) + Always pull:
  IfNotPresent served a stale cached :latest, so the plugin manifest
  (configSchema) fix never landed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
root
c01a28e23c Woodpecker CI deploy [CI SKIP] 2026-06-05 09:19:11 +00:00
Viktor Barzin
6cff9bac26 freshrss: migrate extensions PVC proxmox-lvm -> NFS (node6 LUN relief, code-dfjn)
Frees a per-node SCSI-LUN slot on node6 (20->19, under the check #47 >=20
WARN). FreshRSS extensions are static plugin files (no embedded DB; app DB is
external MySQL) -> NFS-safe. Empty volume (re-installable). Applied
deadlock-safe: -target deployment+module first (Recreate releases old PVC),
then full apply destroys the now-unused proxmox PVC.
2026-06-05 09:19:11 +00:00
root
11b092a589 Woodpecker CI deploy [CI SKIP] 2026-06-05 09:19:11 +00:00
Viktor Barzin
e2d46ebd30 isponsorblocktv: migrate data PVC proxmox-lvm -> NFS (node6 LUN relief, code-dfjn)
Frees a per-node SCSI-LUN slot on node6 (21->20). Volume holds only
config.json (no embedded DB) -> NFS-safe. config pre-seeded to
/srv/nfs/isponsorblocktv before cutover. RWO-destroy deadlock during apply
(TF deleting the in-use old PVC before rolling the deployment) was broken by
patching the deployment claim to the NFS PVC; TF reconciled to the same value.
2026-06-05 09:19:11 +00:00
root
ace6ee59f9 Woodpecker CI deploy [CI SKIP] 2026-06-05 09:19:11 +00:00
Viktor Barzin
adec2c135f fix(novelapp): also bind gheorghe's dashboard SA to novelapp admin
His app lives in novelapp, but the dashboard injects his SA token
(system:serviceaccount:vabbit81:dashboard-vabbit81), while the existing
binding only granted the OIDC User vabbit81@gmail.com (OIDC blocked). Add the
SA as a second subject so the web dashboard (token-injector) can manage
novelapp. Verified: SA can list/create in novelapp; injector path returns 200.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
Viktor Barzin
7114824c06 fix(rbac): tighten dashboard SA cluster-read to namespaces+nodes only
namespace-owners could read all tenants' pods/configmaps/etc cluster-wide
(read-only) via the broad namespace_owner_readonly role. Give the dashboard
SAs a dedicated dashboard-nav-readonly ClusterRole = namespaces + nodes (list)
only — enough for the dashboard namespace-picker/Nodes view, but no
cross-tenant resource reads. Own-namespace access (admin) unchanged. Verified:
gheorghe can list namespaces/nodes + full vabbit81, but list pods/configmaps -A
= no, other namespaces = no.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
Viktor Barzin
dd2a8e640f monitoring: right-size loki memory request 3Gi->1Gi (quota 89%->79%)
monitoring-quota requests.memory sat at 89% (18.2/20Gi), tripping the
ResourceQuota>80% WARN. Root cause was over-provisioned requests, not real
usage: loki requested 3Gi but its VPA upperBound is 364Mi and actual ~315Mi.
prometheus's 4Gi is legitimately required (2Gi tmpfs WAL shares the cgroup;
OOMs at 3Gi during WAL replay) so it stays; grafana's main container is
already 512Mi. Trimmed loki to 1Gi request (~3x its observed ceiling; 4Gi
Burstable limit preserves query-spike headroom) -> quota 78.8%, clears the
WARN. NOTE: alloy DaemonSet (562Mi/node) grows with node count, so revisit
(bump the 20Gi quota) as the cluster expands.
2026-06-05 09:19:11 +00:00
Viktor Barzin
bf6ede2b9e vault: deny secret/data/vault for claude-agent terraform-state policy (executor elevation safety narrowing)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
Viktor Barzin
63ee655c08 monitoring: fix PrometheusBackupStale false-fire (32d->40d threshold)
The prometheus-backup sidecar runs monthly on the 1st SUNDAY 04:00 UTC.
Consecutive first-Sundays can be ~35 days apart (e.g. May 3 -> Jun 7), but
the alert threshold was 32d (2764800s) -> it false-fired every year for the
~3 days between day-32 and the next run. Raised to 40d (3456000s): clears
the max first-Sunday spacing with margin, still catches a genuinely missed
monthly backup. Backup itself is healthy (last May 3, next Jun 7). Verified:
live rule now > 3.456e6, alert state inactive.
2026-06-05 09:19:10 +00:00
Viktor Barzin
d649f4f287 feat(k8s-dashboard): auto-inject per-user SA token (no token-paste)
nginx token-injector behind the existing forward-auth: maps X-authentik-username
(the user's email, injected by Authentik) -> that user's ServiceAccount token ->
sets Authorization: Bearer -> kong-proxy. Dashboard auto-authenticates; users
never see the token prompt. Mirrors the t3-dispatch pattern. Token map lives in a
Secret (namespace-owners' cluster-read covers configmaps, not secrets). Verified:
gheorghe->vabbit81 pods 200 + kube-system 200 (cluster-read); viktor->nodes 200
(admin); unmapped->401. namespace-owners auto-derived from k8s_users; admins
hardcoded (their Authentik identity != k8s_users email).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
Viktor Barzin
467eb7d7ee claude-agent: grant shared pod executor powers (Forgejo PR, terragrunt apply, kubectl write, MCP)
Elevates the shared claude-agent-service pod (SA claude-agent, ns
claude-agent) so the nextcloud-todos-exec agent can run autonomously.
Viktor explicitly chose to elevate the SHARED service knowing every
agent on the pod inherits these creds — each grant is security-sensitive
and flagged inline for review.

Vault (stacks/vault/main.tf):
- terraform-state k8s-auth role: add `claude-agent` to
  bound_service_account_names (was only `default` — the pod's own SA
  token could not log in, so scripts/tg apply died fetching the PG
  backend password). `default` kept.
- terraform-state policy broadened from `database/static-creds/pg-terraform-state`
  read only to read on database/static-creds/*, database/creds/*,
  secret/data/* and secret/metadata/* — what stacks read at plan/apply
  time. FLAG: grants the shared pod broad Vault READ (effectively all app
  secrets + rotating DB creds); not denied: secret/data/vault.

claude-agent-service stack (stacks/claude-agent-service/main.tf):
- ExternalSecret: add FORGEJO_TOKEN (secret/ci/global -> forgejo_push_token,
  viktor-scoped admin PAT) and HA_MCP_URL (secret/openclaw -> ha_sofia_mcp_url).
- git-init: add url.insteadOf rewrite to authenticate git pushes to
  forgejo.viktorbarzin.me with $FORGEJO_TOKEN (PRs opened via Forgejo API).
- New claude-agent-exec ClusterRole+Binding: cluster-wide
  get/list/watch/create/update/patch/delete on core (incl. secrets),
  apps, batch, networking.k8s.io, rbac roles/rolebindings. Additive to the
  existing read-only claude-agent role; does NOT bind cluster-admin. FLAG:
  very broad — close to cluster-admin in blast radius.
- Vault login: VAULT_ADDR + VAULT_K8S_ROLE env + vault-token-refresher
  sidecar (k8s-auth login role=terraform-state every 30m -> shared
  emptyDir); main container symlinks ~/.vault-token so scripts/tg auto-auths.
- MCP: project-scoped .mcp.json at infra repo root wires `ha` (HTTP,
  ${HA_MCP_URL}) and `paperless` (in-cluster Service, no token in-cluster).

Not applied, not pushed — code only, for human review of the privilege grants.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
Viktor Barzin
b56a868b4e wealthfolio-sync: podAffinity to co-locate with app pod (RWO multi-attach fix)
The monthly wealthfolio-sync CronJob mounts the same RWO
wealthfolio-data-encrypted volume (shared wealthfolio.db SQLite) as the
always-running wealthfolio app Deployment. RWO attaches to only one node,
but the sync had no affinity — so the 2026-06-01 run landed on node4 while
the app held the volume on node3 and hung in ContainerCreating for 3 days
(FailedAttachVolume / Multi-Attach), surfacing as a problematic_pods WARN.

Add a required podAffinity (app=wealthfolio, topologyKey hostname) so the
sync always schedules onto the app's node, where the volume is already
attached (RWO permits multiple pods on the same node). Verified: a fresh
sync run co-located on node3, attached cleanly, and broker-sync started.
2026-06-05 09:19:10 +00:00
Viktor Barzin
e4c3fbbbbb feat(authentik): adopt admin-services-restriction policy; admit kubernetes-* groups to k8s dashboard
Namespace-owners (e.g. gheorghe) were blocked at forward-auth — k8s.viktorbarzin.me
was Home-Server-Admins-only. Carve-out: the dashboard host now also admits
kubernetes-admins/power-users/namespace-owners so they can reach the login page;
per-namespace access is still enforced by the pasted SA token (dashboard-sa.tf).
All other admin-only hosts unchanged. Policy adopted from UI into TF via import.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
Viktor Barzin
317989f9d5 feat(rbac): per-namespace-owner dashboard SA + long-lived token
Pragmatic dashboard access while OIDC SSO is blocked: each namespace-owner
(from k8s_users) gets a ServiceAccount scoped to admin on their namespace(s)
+ cluster read-only, plus a long-lived token to paste into the dashboard
'Token' login. Real per-namespace isolation, no apiserver-OIDC dependency.
Verified: vabbit81 SA = admin in vabbit81, read-only elsewhere, no cross-ns write.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
root
d479d5b4f9 Woodpecker CI deploy [CI SKIP] 2026-06-05 09:19:10 +00:00
Viktor Barzin
deede6dd11 chrome-service: switch to CDP + persistent profile + hourly snapshot pipeline
The chrome-service stack ran `playwright launch-server`, which creates
ephemeral browser contexts per `connect()`. Despite the encrypted PVC
mounted at /profile, no chromium user-data ever persisted — only npm
cache + fontconfig. Logging in via noVNC was effectively a no-op.

Refactor:
- Replace launch-server with direct chromium (TCP CDP on :9223 internal),
  fronted by a Python HTTP+WS bridge on :9222 that rewrites the Host
  header to bypass Chrome's hardcoded DNS-rebinding protection (no
  `--remote-allow-hosts` flag exists in stock Chrome 130; verified by
  binary string grep). Bridge also forces Connection: close on HTTP
  responses so Node ws opens a fresh TCP for the WS upgrade rather than
  trying to reuse the dead keep-alive socket.
- Add `--user-data-dir=/profile/chromium-data` so cookies/localStorage
  actually persist on the encrypted PVC.
- New snapshot-server sidecar (stdlib python HTTP) serves
  GET /api/snapshot at chrome.viktorbarzin.me/api/snapshot,
  bearer-token-gated by the existing api_bearer_token.
- New chrome-service-snapshot-harvester CronJob (hourly) connects via
  CDP, dumps storage_state() (cookies + localStorage), writes atomically
  to /profile/snapshots/storage-state.json.
- NetworkPolicy: TCP/9222 (was :3000), TCP/8088 added for traefik.

Caller migration:
- f1-stream: `chromium.connect(ws_url)` → `chromium.connect_over_cdp(cdp_url)`,
  env var CHROME_WS_URL → CHROME_CDP_URL. CHROME_WS_TOKEN dropped (no
  longer used by code; ExternalSecret kept for symmetry with the snapshot
  endpoint).

Dev-box side (out of scope for this commit — see ~/.config/systemd/user/):
- playwright-mcp.service flips to `--isolated --storage-state=...`
  so per-Claude-Code-session ephemeral contexts seed from the snapshot.
- playwright-snapshot-refresh.{service,timer} (hourly) pulls the
  snapshot via the bearer-gated HTTPS endpoint.

Docs updated:
- docs/architecture/chrome-service.md — new architecture diagram + wire protocol.
- docs/runbooks/chrome-service-snapshot.md — day-2 ops (refresh, rotation,
  failure modes, restore).
- stacks/chrome-service/README.md — connect_over_cdp recipe.

Design spec at docs/superpowers/specs/2026-06-04-playwright-per-session-browser-design.md.
2026-06-05 09:19:10 +00:00
Viktor Barzin
ea1e4f793b revert(k8s-dashboard): restore forward-auth ingress (apiserver OIDC unresolved)
Dashboard back to the working forward-auth + kong-proxy state. The
oauth2-proxy SSO path is blocked by a deeper issue: the apiserver rejects
ALL valid Authentik OIDC tokens (both legacy --oidc-* flags and structured
AuthenticationConfiguration), despite verified signature, issuer, audience,
email_verified, synced clock, and reachable+trusted JWKS. Needs dedicated
apiserver-OIDC investigation. oauth2-proxy + k8s-dashboard Authentik app
left deployed (idle, harmless) pending that.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
Viktor Barzin
c958f6a589 feat(nextcloud-todos): Phase 4 IaC — service stack, Vault role, DB bootstrap, OpenClaw plugin, monitoring
Phase 4 infrastructure-as-code for the nextcloud-todos service (watches the
Nextcloud Personal task list; classifies todos via local qwen3-8b and routes
research/mutating work through claude-agent-service). Clones the
recruiter-responder service pattern end-to-end. Written only — NOT applied.

- stacks/nextcloud-todos/{main.tf,terragrunt.hcl}: new aux stack cloning
  recruiter-responder — ns (tier aux, istio-injection disabled, keel enrolled),
  two ExternalSecrets (vault-kv app secrets + vault-database DSN), Recreate
  deployment with alembic-migrate init-container, ClusterIP svc, /cb-only
  HMAC-gated ingress (auth=none, proxied), and an idempotent webhook-register
  null_resource (OCS webhook_listeners API, both CalendarObject Created/Updated
  events -> internal svc URL, Bearer auth).
- stacks/vault/main.tf: pg_nextcloud_todos static role (nextcloud_todos, 7d
  rotation) + pg-nextcloud-todos in the postgresql allowed_roles array.
- stacks/dbaas/modules/dbaas/main.tf: pg_nextcloud_todos_db null_resource
  (clone of pg_tripit_db) — creates role+DB, pins role search_path, and
  creates schema nextcloud_todos AUTHORIZATION nextcloud_todos.
- stacks/openclaw/main.tf: install-nextcloud-todos-plugin init-container,
  nextcloud-todos-api in plugins.allow + the doctor-fix re-add + plugins
  enable, NEXTCLOUD_TODOS_URL/NEXTCLOUD_TODOS_TOKEN env, and the cross-path
  ESO key (secret/nextcloud-todos.webhook_bearer_token).
- stacks/uptime-kuma/modules/uptime-kuma/main.tf: internal /healthz HTTP
  monitor. Prometheus /metrics scrape via svc annotations in the new stack.
- .gitleaksignore: allowlist two curl-auth-user false positives (the OCS
  webhook curl uses a Vault-sourced shell var, not a literal credential).

KV seed (secret/nextcloud-todos) + applies are deferred to the apply runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
Viktor Barzin
c3c3d5e010 feat(claude-agent-service): seed nextcloud-todos planner + exec agents
Add cp lines in the seed-beads-agent init-container so the two new
nextcloud-todos agent definitions (baked into the image at
/usr/share/agent-seed/ by the claude-agent-service Dockerfile) land in
~/.claude/agents/ at pod start. Phase 3, task 3.3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:09 +00:00
github-actions[bot]
d0206848cb priority-pass: bump image_tag to 63e118c3 [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 63e118c334
2026-06-05 09:19:09 +00:00
Viktor Barzin
55a8b238a0 mam-farming: migrate data volume proxmox-lvm → NFS
The grabber + bp-spender shared a 1Gi proxmox-lvm RWO PVC holding two
plain-text files (mam_id cookie + grabbed_ids.txt dedup list — no embedded
DB). On 2026-06-04 a grabber pod wedged in ContainerCreating for >1h because
its proxmox-lvm disk couldn't hot-plug onto a SCSI-LUN-saturated node VM
(k8s-node3, QEMU `query-pci` QMP timeout); `concurrencyPolicy: Forbid` then
blocked every run → 0 grabs → MAMFarmingStuck.

NFS (nfs_volume module, matching the other 9 servarr apps) removes this
volume from the per-VM SCSI hotplug path entirely: it mounts over the
network, consumes zero LUN slots, and is RWX so the grabber + bp-spender can
co-schedule on any node. Data (mam_id + grabbed_ids.txt) was copied across
before the switch; verified a grabber run Succeeds on NFS on node4 with the
preserved dedup list (tracked IDs carried over). Lever #1 from
docs/architecture/storage.md "Per-VM SCSI-LUN cap".
2026-06-05 09:19:09 +00:00
Viktor Barzin
cb96d5d590 fix(k8s-dashboard): use email_verified=true + groups scope mappings
The apiserver rejects the email username-claim when email_verified is false
(invalid bearer token 401). Authentik external/social users are unverified,
so the default scope-email mapping fails. Mirror the proven kubernetes
provider: use the custom 'Kubernetes Email (verified)' mapping (hardcodes
email_verified=true) + 'Kubernetes Groups'. Drop the now-unneeded dual-aud
mapping (apiserver trusts the k8s-dashboard issuer w/ audience=client_id) and
align oauth2-proxy scope to 'openid email profile groups'.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:09 +00:00
Viktor Barzin
1042c0f082 fix(k8s-dashboard): set RS256 signing_key on Authentik OIDC provider
Provider had signing_key=null → Authentik signed id_tokens with HS256 and
served an empty JWKS, so oauth2-proxy (and the apiserver) failed signature
verification (500 'failed to verify id token signature' on the callback).
Use the same 'authentik Self-signed Certificate' keypair the kubernetes
provider uses.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:09 +00:00
Viktor Barzin
e436af8d8c fix(k8s-dashboard): drop group-restriction policy; RBAC is the gate
The Authentik group policy denied admins: it gated on kubernetes-* group
membership, but cluster access is email-based RBAC (User bindings from
k8s_users), not group-based. vbarzin@gmail.com (Home Server Admins) gets
cluster-admin via oidc-admin-vbarzin but isn't in any kubernetes-* group,
so the gate locked him out. Apiserver RBAC is now the sole gate — matching
the kubelogin CLI (authenticate freely, RBAC decides actions).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:09 +00:00
Viktor Barzin
c9b22c7dd3 feat(k8s-dashboard): cut over ingress to oauth2-proxy SSO
Dashboard now authenticates via Authentik (oauth2-proxy, k8s-dashboard
issuer) and applies each user's own RBAC via the apiserver multi-issuer
AuthenticationConfiguration. Committed so CI converges (uncommitted local
applies were being reverted by the Woodpecker terragrunt-apply pipeline).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:09 +00:00
Viktor Barzin
ed4ed6bd09 fix(k8s-dashboard): ignore Keel/tier drift on oauth2-proxy deployment 2026-06-05 09:19:09 +00:00
Viktor Barzin
75c2b6dc5e feat(rbac): apiserver multi-issuer OIDC via structured AuthenticationConfiguration
Replace the legacy single --oidc-* flags (which kubeadm v1.34 had wiped,
silently disabling apiserver OIDC) with an apiserver.config.k8s.io/v1
AuthenticationConfiguration trusting BOTH the kubernetes (CLI) and
k8s-dashboard (oauth2-proxy) issuers. Enables per-user RBAC for the
dashboard via SSO while keeping the CLI issuer working. Remote script
health-gates /livez and auto-rolls-back on failure (single master).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:09 +00:00
github-actions[bot]
5b25ce1ec5 priority-pass: bump image_tag to 061a66ad [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 061a66ad3b
2026-06-05 09:19:09 +00:00
Viktor Barzin
9c4335025d feat(tripit): linked-email verification (SMTP + confirm carve-out) [ci skip]
Adds outbound mail for linked-email verification: EMAIL_PROVIDER=smtp + SMTP_*
app env (submits via the cluster mailserver as spam@, relayed by Brevo),
SMTP_PASSWORD mapped to the existing PLANS_IMAP_PASSWORD (no new secret), and a
token-gated /api/emails/confirm ingress carve-out (auth=none, like the calendar
feed). Applied locally via scripts/tg.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:09 +00:00
Viktor Barzin
b8c55732e0 feat(k8s-dashboard): deploy oauth2-proxy (not yet wired to ingress)
2 replicas in kubernetes-dashboard ns; OIDC code-flow against the
k8s-dashboard Authentik client, injects user id_token as Bearer upstream
to kong-proxy. ESO syncs client/cookie secrets from Vault. Ingress still
points at kong-proxy — no user-facing change yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:09 +00:00
root
7c4375d7cd Woodpecker CI deploy [CI SKIP] 2026-06-05 09:19:09 +00:00
Viktor Barzin
4ed0c5a834 uptime-kuma: codify Traefik LB internal monitor at .203 (was stale .200)
A hand-created (non-TF) uptime-kuma monitor "Traefik LoadBalancer" (id=95)
port-checked 10.0.20.200:443 — the shared LB IP Traefik moved OFF on
2026-05-30 when it took its dedicated .203 (ETP=Local). It had been DOWN
for ~5 days, surfacing as the cluster-health "uptime_kuma internal down(1)"
WARN.

Add it to local.internal_monitors as "Traefik LoadBalancer (10.0.20.203)"
(port 10.0.20.203:443) so it's managed like the TP-Link/Proxmox direct-IP
probes — a direct check of the MetalLB L2 + Traefik bind, complementing the
[External] traefik (full CF path) and Traefik Dashboard (in-cluster)
monitors. The sync CronJob created it (id=902, reporting UP @1ms); the
orphan id=95 was deleted via the uptime-kuma API.
2026-06-05 09:19:08 +00:00
Viktor Barzin
011c63c92d feat(k8s-dashboard): add Authentik OIDC app for dashboard SSO
Confidential client k8s-dashboard + custom scope mapping emitting
aud=[kubernetes,k8s-dashboard] + group-restriction policy (kubernetes-*
RBAC groups). Additive — dashboard ingress unchanged. Token via Vault
secret/k8s-dashboard.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:07 +00:00
Viktor Barzin
f201e4573e immich: fix slow context search — prewarm clip_index + latency alert/healthcheck
Context (smart) search latency was caused by the 665MB vchord clip_index
decaying out of PG shared_buffers (~33% resident -> ~1.8s cold ANN reads vs
~4ms warm), NOT by yesterday's ML MODEL_TTL/clip-keepalive change (CLIP textual
is warm ~15ms on GPU). The postStart prewarm runs once at pod start and
pg_prewarm.autoprewarm only re-warms at startup, so the index decays under job
buffer-pressure over days.

- clip-index-prewarm CronJob (immich, */5): pg_prewarm('clip_index') keeps the
  whole index resident -> searches stay ~4ms.
- immich-search-probe CronJob (immich, */5): times a random-vector ANN query +
  reads clip_index residency, pushes gauges to the Pushgateway.
- Prometheus alerts ImmichSmartSearchSlow / ImmichClipIndexColdCache /
  ImmichSearchProbeStale (+ inhibition when the probe is stale).
- cluster_healthcheck.sh check #46 check_immich_search (TOTAL_CHECKS 45->46).
- Docs: infra CLAUDE.md immich note, monitoring.md, cluster-health skill.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:07 +00:00
Viktor Barzin
38c77048fd chore(travel-agent): decommission — merged into tripit [ci skip]
travel-agent's transport-to-airport + weather-brief workflows now run inside
tripit (DB-driven instead of CalDAV), so the standalone CronJob stack is
retired (namespace + ExternalSecret + 2 CronJobs destroyed via scripts/tg).
secret/travel-agent left in Vault as an archive. Applied locally.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:07 +00:00
Viktor Barzin
be4ee7315a feat(tripit): proactive-nudge CronJobs (transport + weather brief) [ci skip]
Merge travel-agent's two workflows into tripit (beads code-muqi): adds
tripit-transport-nudge (08:00) + tripit-weather-brief (21:00) CronJobs on
Europe/London, an optional per-job timezone, and SLACK_BOT_TOKEN +
DAWARICH_API_KEY in the tripit-secrets ExternalSecret. Nudges post to Slack
#travel + Web Push; Dawarich location via the public host (in-cluster *.svc
is 403'd by Rails host-auth). Vault secret/tripit seeded from
secret/travel-agent + secret/owntracks. Applied locally via scripts/tg.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:06 +00:00
Viktor Barzin
98f29edf34 technitium: CoreDNS rewrite forgejo.viktorbarzin.me -> Traefik ClusterIP
In-cluster pods resolved forgejo.viktorbarzin.me to the public IP
(176.12.22.76) and hairpinned out through the WAN gateway, intermittently
timing out buildkit pushes from Woodpecker build pods (which, unlike
kubelet, don't use the per-node containerd Forgejo mirror). This silently
failed CI build-and-push for Forgejo-hosted repos (recruiter-responder
pipelines #15-#18 at the push step).

Add a CoreDNS `rewrite name exact forgejo.viktorbarzin.me
traefik.traefik.svc.cluster.local` so pods resolve to the Traefik ClusterIP
(reachable in-cluster, unlike the ETP=Local LB .203; the Service-name target
auto-tracks the ClusterIP so it can't rot on a Traefik renumber). Traefik's
*.viktorbarzin.me wildcard keeps SNI/TLS valid. Makes the per-pod
woodpecker-server hostAlias belt-and-suspenders.

Applied via targeted apply (coredns ConfigMap only, to avoid reconciling 7
unrelated pre-existing drifts in the stack) + verified:
- pod resolves forgejo.viktorbarzin.me -> 10.111.111.95 (Traefik ClusterIP)
- recruiter-responder pipeline #20 build-and-push succeeds via ClusterIP

Docs: networking.md (K8s cluster DNS path) + .claude/CLAUDE.md (forgejo
registry quick-ref). Advances beads code-yh33.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 07:34:30 +00:00
Viktor Barzin
7302cd7908 infra: untrack generated backend.tf (stale PG creds + .200 literal) [CI SKIP]
terragrunt generates backend.tf per run (remote_state generate,
if_exists=overwrite_terragrunt) from get_env("PG_CONN_STR"); these 72 committed
copies are stale artifacts already covered by .gitignore:65. They held a
plaintext (Vault-rotated, ~expired) PG password + the .200 state-backend literal
and were re-committed by CI on every run. git rm --cached stops that; they
regenerate locally from PG_CONN_STR. The live .200:5432 literal now lives only
in scripts/tg (its single bootstrap source).

Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 10:52:46 +00:00
Viktor Barzin
7d7a0ad474 infra: fix stale Traefik LB-IP refs + accurate LB-IP registry
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md).
The 2026-05-30 Traefik .200->.203 move left consumers pointing at the dead
.200; this fixes the two in-Terraform ones and replaces the stale networking
doc with an accurate registry + a renumber checklist.

- woodpecker: forgejo.viktorbarzin.me hostAlias hardcoded 10.0.20.200
  (.200:443 refuses TLS now; the next woodpecker apply would re-pin it and
  break pipeline creation). Now reads the Traefik ClusterIP dynamically via a
  kubernetes_service data source -- cannot rot on a future renumber and avoids
  the ETP=Local hairpin trap.
- monitoring: ViktorBarzinApexDrift alert summary said "expected 10.0.20.200"
  -> 10.0.20.203 (cosmetic; alert logic already correct).
- docs/architecture/networking.md: rewrote the MetalLB section (it wrongly had
  KMS on .200, mailserver on a LB IP, "two dedicated") into an accurate 4-IP
  registry + LB-IP renumber checklist (in-band + out-of-band consumers).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 10:24:25 +00:00
Viktor Barzin
dcb7c74531 url/shlink: fix admin UI — pin shlink-web-client 4.7.1 + port 8080
The shlink-web admin SPA (shlink.viktorbarzin.me) showed "Something went
wrong while loading short URLs". Root cause: the web client was untagged
(:latest) and Keel's 2026-05-26 match-tag rewrite downgraded it to the
ancient 0.1.1 (2019 image), which speaks the removed /rest/v1/authenticate
API (404) and serves nginx on port 80. Backend (shlink:5.0.2) was healthy.

Pin shlink-web-client to 4.7.1 (current stable; :latest/:stable resolve to
it) and align container port + both probes + service target_port to 8080
(the port the 4.x nginx listens on). A clean semver anchor can no longer be
Keel-downgraded to 0.1.1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 10:24:25 +00:00
Viktor Barzin
922d95af9c Reapply "tripit: Gmail ingest (12-month) + vbarzin owner + plans@ forward-to-parse"
This reverts commit a82ba46ad83e85a231d839564c2f009c700dc4d1.
2026-06-03 10:24:25 +00:00