infra

Author	SHA1	Message	Date
Viktor Barzin	bc5aba34b6	meshcentral: fix agent connectivity behind Authentik + TLS-offload Traefik Two root causes kept all 8 mesh agents (incl. family laptops) offline: 1. The single ingress gated the ENTIRE site (path "/") behind Authentik forward-auth, so the agent/relay endpoints (/agent.ashx, /meshrelay.ashx, /control.ashx, etc.) got 302-bounced to SSO. Native mesh clients can't do the OAuth cookie dance. Fix: add a second ingress_factory (auth="none") path-scoped to the agent endpoints, pointing at the same meshcentral service. Traefik routes by rule length so these out-prioritise the "/" catch-all; the human web UI stays Authentik-gated. 2. After the auth fix, agents reached /agent.ashx but were rejected with "Agent bad web cert hash" — MeshCentral pins the OUTER TLS cert, but with TLS offload the agent sees Traefik's Let's Encrypt cert (which differs between the internal .203 LB and the external Cloudflare path, and rotates monthly), not MeshCentral's own webserver cert. Fix: set ignoreAgentHashCheck=true in the init-container config so MeshCentral echoes back the agent-reported hash. The separate mesh-certificate (ServerID) handshake still authenticates the server. Verified: agent paths no longer 302->authentik; web UI root still does; laptop "Valia_Laptop" enrolled in group "laptops" and ONLINE. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-03 10:24:24 +00:00
Viktor Barzin	01ea7d6fa1	immich: clip-keepalive CronJob to pin smart-search model warm MACHINE_LEARNING_MODEL_TTL=600 is a single global knob, so it unloads the CLIP textual (smart-search) encoder after idle exactly like OCR/face — immich has no per-model pin. This CronJob pings the textual encoder every 5 min (< the 600s TTL) via immich-ml /predict, so a search query never pays the ~1.5s cold-load, while idle OCR/face still free their VRAM on the shared T4. Textual-only (search = text->embedding->pgvector); the visual encoder is import-time and left to unload. curl baked into the image (no runtime install). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:24 +00:00
Viktor Barzin	f0948493b3	claude-agent-service: wire parallel execution (git-crypt mount, memory, MAX_CONCURRENCY) The service now runs agent calls concurrently (bounded semaphore, per-job isolated clones) instead of single-flight. Infra side: - mount git-crypt-key into the main container (each job re-unlocks its own clone) - MAX_CONCURRENCY=10 env (excess calls queue FIFO) - bump pod memory 2Gi req / 12Gi limit, cpu req 1 (Burstable, tier-aux) — sized for ~10 concurrent claude+terraform runs; fits node2/3/5 headroom - docs: beads-auto-dispatch + automated-upgrades no longer describe single-slot Service code: viktor/claude-agent-service @ 66104a3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:24 +00:00
Viktor Barzin	16763464cd	job-hunter dashboard: role panels now respect the $location filter All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The role panels (Top roles, Top companies by role volume, New roles/day, Roles by source, Salary distribution) had no location filter, so they showed all locations regardless of the $location dropdown. Add 'primary_location IN (${location:sqlstring})' to each (matching the comp panels' pattern). Also switch the 'Your comp vs the market' panel from hardcoded 'london' to the same $location filter for consistency. Data was fine (all london-tagged roles genuinely contain 'london'). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 23:35:25 +00:00
Viktor Barzin	7a7abe4cbe	uk-payslip dashboard: count gross comp on taxable_pay (P60) basis All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The 'Yearly receipt' + 'YTD gross salary' panels summed salary+bonus+rsu_vest (rsu_vest = net/partial RSU), understating gross by ~£73k/yr. Switch to COALESCE(taxable_pay, gross_pay) + pension_sacrifice = true P60 gross (verified: 23/24 -> £286,288, 25/26 -> £416,646, matching the P60 + job-hunter realized bar). 'Yearly receipt' rsu_gross is now the real gross RSU (£150k/£271k, not £70k/£128k). Relabel the Sankey RSU inflow 'RSU (net vested)' for honesty; leave cash-flow/net_pay + the (taxable_pay-based) reconciliation/rate panels. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 23:23:15 +00:00
Viktor Barzin	deb0dd4778	monitoring: "Your comp vs the market" panel on Job Hunter dashboard Add a barchart (panel 10) ranking every company's London p50 total comp (COALESCE total/base) with the user's current comp shown in line, so it's a direct "how do I compare" view. The user's figure is NOT hardcoded in the dashboard JSON — it's a labeled comp_point in the DB (company_slug 'self-current', source 'self', "Me (Meta IC5)"), keeping the sensitive number out of git. It's below the £500k alert bar (no Slack ping) and ranks too low to appear in analyze leaders. Runbook documents the panel + how to update the baseline. [ci skip] — dashboard ConfigMap applied locally (targeted). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 21:27:26 +00:00
Viktor Barzin	74313149dd	job-hunter: weekly above-target Slack alert CronJob Add job-hunter-alert CronJob (Sundays 05:00 UTC, an hour after the refresh): `python -m job_hunter alert --threshold 500000 --location london --slack` posts to Slack the companies whose London p50 total comp >= £500k, flagging any that newly crossed since last week's snapshot. SLACK_WEBHOOK_URL wired via the job-hunter-secrets ExternalSecret from Vault secret/job-hunter slack_webhook_url (seeded from the shared workspace webhook; repointable to a dedicated channel). Runbook gains an "above-target Slack alert" section. [ci skip] — applied locally (stack-scoped). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 20:49:42 +00:00
Viktor Barzin	5dc5cd53c0	url/shlink: ingress url.viktorbarzin.me auth required -> none Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details Authentik forward-auth on the shlink REST API + short-link domain (url.viktorbarzin.me) 302s shlink-web's cross-origin API XHR (CORS preflight) and SSO-bounces every public short link. Result: the admin UI showed "Something went wrong while loading short URLs" and short links never resolved for logged-out clients. The shlink REST API is self-gated by its X-Api-Key and short links are public by design, so Authentik must not front this domain. CrowdSec + rate-limit + anti-AI bot-block still apply. The admin web UI (shlink.viktorbarzin.me) stays auth=required via module.ingress-web. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 20:37:33 +00:00
Viktor Barzin	fe8db19aaf	job-hunter: build-triggers-deploy model; CronJob :latest + docs CI now drives the Deployment rollout (kubectl set image to the build SHA in .woodpecker.yml), so the stack moves to image_tag = "latest": the Deployment runs whatever CI last set (image ignore_changes keeps TF from fighting it), and the CronJob uses :latest + imagePullPolicy=Always (fresh pod each weekly run). Keel stays enrolled in parallel as a redundant net. Docs: rewrite the runbook "Deploying" section for build-triggers-deploy; record the reversal of decision #12 in the auto-upgrade design doc (owned apps drive their own rollout, Keel parallel — upstream stays Keel-only); add the owned-app deploy model to infra/.claude/CLAUDE.md CI/CD section. [ci skip] — applied locally (stack-scoped); avoids a broad CI auto-apply. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 20:24:50 +00:00
Viktor Barzin	052c776eba	immich: set MACHINE_LEARNING_MODEL_TTL 0->600 to stop GPU VRAM hog immich-ml at TTL=0 never unloaded models; a heavy OCR library job inflated onnxruntime's CUDA arena to ~10.7GB and held it on the shared time-sliced T4, starving llama-swap (qwen3-8b) so recruiter-responder triage 502'd silently for hours (emails preserved unseen, no loss). TTL=600 lets idle ad-hoc models (OCR, face) free VRAM while preloaded CLIP/smart-search stays warm. Docs: correct stale llama-cpp GPU notes (T4 is time-sliced, no VRAM isolation; add qwen3-8b to model table), immich MODEL_TTL gotcha in .claude/CLAUDE.md, and a post-mortem. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 20:16:11 +00:00
Viktor Barzin	cda858d560	job-hunter: weekly refresh CronJob + ops/analyst runbook All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Add kubernetes_cron_job_v1.job_hunter_refresh — Sundays 04:00 UTC, runs `refresh --source ats --source hn --source levels_fyi`, which upserts roles/ comp AND appends the dated comp_snapshots/roles_snapshots series consumed by `job-hunter analyze`. Mirrors the Deployment's alembic-migrate init container so a refresh never runs against an un-migrated DB; concurrency Forbid, backoff 1, 30m activeDeadline, KYVERNO_LIFECYCLE_V1 dns_config ignore. Add docs/runbooks/job-hunter.md: ops (health checks, manual refresh, add an ATS company / CDIO watch, secret bag + rotation, failure table, TF apply) and analyst (the analyze report, query recipes, SQL trend queries against the snapshot tables, interpretation caveats) sections. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 19:37:57 +00:00
Viktor Barzin	87f1dcb72d	wealth: consolidation chunk 2 — net-pay $grain merge, Trend projection, row reorg Completes the 36->17 consolidation: - 3 net-pay panels -> 1 "Net pay vs market gain (${grain})" with a cumulative/ yearly/monthly dropdown (Mixed datasource: payslips-pg + wealth-pg). - Projection rebuilt as a Trend panel (numeric "Years from today" x-axis) so it renders regardless of the dashboard time range — fixes empty-by-default. Drops the duplicate projection-row stat cards + the how-to-view text panel. - Full reorg into 7 collapsed rows: Overview / Net worth over time / Returns & contributions / Income vs market / Holdings / RSUs (META) / Projections. All wealth-pg SQL validated live; net_pay target reuses the existing payslips-pg source. Visual review pending. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	a587f0ee55	t3code: ingress -> devvm dispatch+autopair (retire in-cluster nginx) stacks/t3code now points the Authentik-gated ingress at the DevVM t3-dispatch service (Service+Endpoints -> 10.0.10.10:3780) instead of the in-cluster nginx, which is removed. Per-user routing + session auto-injection now live on DevVM. Verified: external 302->Authentik; in-cluster vbarzin/emil.barzin->302 (auto-pair to own instance), unmapped->403. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	5e4f83d4e7	wealth: consolidation chunk 1 — merge NW/contribution/growth, returns table, yearly combo 36 -> 19 panels (chunk 1 of 2), zero metric loss: - 3 NW/contribution/growth timeseries -> 1 "contribution vs market value (+growth)" - 11 returns/Δ stat cards (12mo x3 + Δ 1d/7d/30d/90d all&mkt) -> 1 "Returns over time windows" table (window × Δall/Δmkt/return%) - 2 yearly barcharts -> 1 combo (contributions/market-gain bars + return-% line, timeFrom=10y so full history always shows) All SQL validated live. Chunk 2 (net-pay $grain merge, projection->Trend panel, row reorg) to follow. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 22:27:09 +00:00
Viktor Barzin	73cb0aab8b	t3code: per-user isolation via Authentik + nginx username dispatcher t3 is single-owner (no in-app multi-user), so each person runs their own `t3 serve` on the DevVM as their own OS user: wizard→:3773 (t3-serve.service), emo→:3774 (t3-serve-emo.service). An in-cluster nginx `t3-dispatch` maps the Authentik-injected X-authentik-username to the right instance; unmapped identities get 403 (no shared fallback). Flipped the ingress auth app→required (Authentik forward-auth) — the same-origin self-served UI works behind it (WS carries the Authentik cookie) and t3's own pairing/bearer stays the inner gate. Mirrors the terminal stack's per-user model. Verified: dispatcher routes vbarzin→:3773, emil.barzin→:3774, unmapped→403; t3.viktorbarzin.me now 302s to Authentik. Cross-origin native app / app.t3.codes intentionally unsupported here — deferred until the native app is published. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 21:38:06 +00:00
Viktor Barzin	f807050eb5	cloudflared: fix tunnel origin .200 -> Traefik svc DNS (full-site 502 outage) [ci skip] The Cloudflare tunnel routed *.viktorbarzin.me and the apex to https://10.0.20.200:443, but Traefik moved off the shared MetalLB .200 onto its dedicated 10.0.20.203 on 2026-05-30 (commit `0c01adac`). Nothing serves HTTPS on .200:443 anymore, so cloudflared could not reach its origin (no route to host / i/o timeout) and Cloudflare returned 502 for every externally-proxied service. Internal/LAN access (split-horizon -> .203) was unaffected, which masked the outage. Repoint both ingress rules at the in-cluster Traefik Service DNS (https://traefik.traefik.svc.cluster.local:443) -- the design the docs already described but the code never implemented -- so the tunnel is decoupled from the Traefik LB IP and this cannot recur on a future move. Applied live via targeted apply on the tunnel config resource only; [ci skip] because live already matches and a full stack apply would churn unrelated pre-existing drift (Keel annotations, DKIM re-chunk). Post-mortem: docs/post-mortems/2026-06-01-cloudflared-stale-traefik-origin.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-01 21:22:05 +00:00
Viktor Barzin	f364399ede	wealth: add 30y net-worth projection row + align net-pay panel Implements the committed projections design (docs/plans/2026-05-28-wealth- projections-{design,plan}.md): a collapsed "Projections" row on the wealth dashboard with 5 template vars (rate_low/base/high, monthly_contribution=auto, horizon_years=30), a multi-scenario projection panel (Low/Base/High + trailing- 3y historical line + a base-rate compounding-only line), 3 stat cards, and a text panel with one-click future time-range links. Projection is pure SQL over dav_corrected: compound + ordinary-annuity FV from today's net worth; auto contribution = trailing-12mo run-rate (COALESCE/NULLIF so $monthly_contribution=auto doesn't constant-fold 'auto'::numeric). Historical rate = trailing-3-full-year geometric mean of per-year Modified-Dietz returns (~10.4%) — all-time was a nonsense 83% because the all-accounts-complete window is only ~4 months, and the true all-time geomean is skewed by 2021's +86%. Also aligns "Net pay vs market gain — per month" to consecutive month-end deltas (same fix as the other monthly panels). Verified all SQL live. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 19:50:41 +00:00
Viktor Barzin	32e1042ca8	t3code: expose `t3 serve` (DevVM) publicly at t3.viktorbarzin.me (app-tier) New stacks/t3code mirrors stacks/terminal: K8s Service + Endpoints → 10.0.10.10:3773 plus an ingress_factory route (dns_type=proxied, auth="app"). t3 ships its own owner-pairing + bearer-session auth, so Authentik forward-auth is intentionally omitted — it would break the cross-origin native mobile app and app.t3.codes (bearer-only, no Authentik cookie). CrowdSec + anti-AI (both default-on for app-tier) rate-limit the public surface; t3's pairing is the gate. TLS is auto-synced into the namespace by Kyverno's sync-tls-secret policy. Verified end-to-end: t3.viktorbarzin.me → CF → Traefik → devvm:3773 = 200. Trade-off (public RCE surface behind app-native auth, no Authentik SSO) accepted 2026-06-01 to keep the native app + app.t3.codes working. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 19:50:41 +00:00
Viktor Barzin	c5e4b1ea71	kms: add /diag anonymous telemetry collector behind Anubis carve-out The PowerShell activation scripts POST small JSON diagnostics to /diag so script execution errors are captured. The collector (python:3.12-alpine, ConfigMap-mounted) prints each event to stdout as a KMSDIAG line; the cluster's Loki scrapes pod stdout, making events searchable in Grafana (Loki only — no Slack, no Prometheus). Like /scripts, /diag needs a second ingress_factory carve-out with full_host="kms.viktorbarzin.me" so it bypasses the Anubis PoW challenge that PowerShell/curl can't solve. Without full_host the factory would derive kms-diag.viktorbarzin.me and the carve-out would never match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 19:50:41 +00:00
Viktor Barzin	5c77482a8c	fire-planner: LLM_MODEL env var → qwen3vl-4b default (fits in current GPU headroom; immich-ml is holding ~10GB)	2026-06-01 19:50:41 +00:00
Viktor Barzin	fb1e47a20a	nextcloud: re-enable Keel auto-upgrades with occ-upgrade self-heal + live-tag floor Re-enrolls Nextcloud in Keel (opted out after the 2026-05-26 32.0.3->32.0.9 bump stuck the pod in maintenance mode ~22h). Two safeguards engineer around both failure modes: - F1 (interrupted occ upgrade -> 503): nextcloud-watchdog CronJob runs `occ upgrade` + clears maintenance mode when occ reports needsDbUpgrade=true; Job deadline bumped 120->600s so it isn't killed mid-migration. - F2 (helm re-renders a tag below the Keel-bumped live image -> downgrade CrashLoop): chart_values renders the live tag via a plural kubernetes_resources data source (empty-list-on-absence -> floor 32.0.9 on fresh install/DR), so a re-render never downgrades below live. Scope is patch -- Kyverno's shared inject-keel-annotations policy stamps it and its background-controller overrides a TF-set value, and patch == minor for Nextcloud in practice (32.0.x only; major 33 stays manual). Dropped the per-workload keel.sh/policy override resources to avoid perpetual drift; ns enrollment + Kyverno now own the keel annotations like other workloads. Also bumps the external-storage bootstrap Job create timeout 1m->12m to match its own 10m pod-wait, since Keel bumps now roll the pod mid-apply. Verified: Keel auto-upgraded 32.0.9->32.0.10 on apply, entrypoint occ upgrade completed clean (no watchdog needed), pod 2/2, HTTP 200, plan shows no drift.	2026-06-01 19:50:41 +00:00
Viktor Barzin	50d0f1affa	kyverno: strip orphaned keel.sh/match-tag fleet-wide (image-swap fix) The 2026-05-26 migration flipped the keel default force->patch and dropped match-tag from the inject-keel-annotations patch, but Kyverno's add-only mutate can't remove an annotation that's no longer listed -- 194 workloads kept a stale keel.sh/match-tag=true. Under it Keel cross-assigned images in multi-image pods: the blog's nginx<->nginx-exporter images were swapped and the site was down 2026-05-26 -> 06-01 (nginx received the exporter's -nginx.scrape-uri arg and CrashLoopBackOff'd); changedetection was silently swapped (app lost its /datastore PVC + env, ran ephemeral for days). - policy now sets keel.sh/match-tag=null (strips on admission, never re-added) - swept the annotation off all 194 existing workloads (kubectl, no pod restart) - AGENTS.md: documents the strip; post-mortem added blog + changedetection un-swapped via kubectl set image (TF-ignored images); both 2/2 and serving 200. Policy already applied via scripts/tg (Tier-1 PG state authoritative). [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 19:50:41 +00:00
Viktor Barzin	769ae7a6d3	traefik: bot-block-proxy buffer 256k + document the real HTTP/2 limit Follow-up to the 64k bump: raised bot-block-proxy large_client_header_buffers to 256k and corrected the rationale. Investigation found the binding limit for browsers is Traefik's HTTP/2 header cap (~64KB, Go maxHeaderListSize, not exposed by Traefik config) — oversized authentik_proxy_* cookie piles are rejected at the h2 layer upstream of bot-block regardless of these buffers. The real fix for >64KB piles is reducing authentik_proxy_* cookie accumulation (or clearing cookies); these buffers only prevent bot-block being a tighter bottleneck for sub-64KB piles + HTTP/1.1 clients. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:27 +00:00
Viktor Barzin	3d28870e25	nextcloud: fix backup retention to sort by name, not mtime The dated backup dirs are named YYYYMMDD_HHMMSS, but the cleanup used `ls -dt` (mtime). `rsync -a` stamps the backup dir with the SOURCE dir's mtime, so the freshest backup didn't sort as newest — the retention step deleted the new backup and kept a stale one. Sort lexically (chronological for these names) and keep the last. Also exclude html/ (the app code, reproducible from the now-pinned image; the real config lives at config/config.php, html/config is empty) so the backup is config+data+custom_apps only → ~4.3G (<5G target). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:26 +00:00
root	84ab4c998c	Woodpecker CI deploy [CI SKIP]	2026-06-01 15:15:26 +00:00
Viktor Barzin	ddd582a28c	backup: stop offsite-copying regenerable data; shrink nextcloud backup; pin nextcloud image The offsite Synology hit 97% — the Backup share grew +670G in a week, traced to the 2026-05-26 change that began mirroring large regenerable services offsite, plus an unbounded nextcloud.log bloating its backups to 87G. - nfs-mirror: re-exclude ollama, prometheus-backup, audiblez, ebook2audiobook (regenerable; live-only on sdc). Keep *-backup DB dumps (real safety copies). - offsite-sync Step 2: nfs-ssd leg is now immich-only; ollama/llamacpp on the SSD no longer ship offsite (re-pullable models). - daily-backup: skip nextcloud/nextcloud-data-proxmox (orphaned pre-encryption PV, still backed up weekly). - nextcloud: cap+rotate the log (log_rotate_size=10MB); the dedicated backup now excludes html/ (app code, from image), logs, and preview cache and keeps only the latest copy (pvc-data holds version history) → <5G (was 87G). - nextcloud: pin image to 32.0.9 in chart_values. A 2026-05-26 Keel bump moved the live pod to 32.0.9 (data migrated to 32.0.9.2) but TF still defaulted to 32.0.3; reconciling that drift this session rolled a 32.0.3 pod that CrashLooped on the downgrade. Pinning eliminates the drift. Docs: backup-dr.md + infra CLAUDE.md updated (add nfs-mirror, new exclusions). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:26 +00:00
Viktor Barzin	af4bfbe046	kms: revert files accidentally bundled into the docs commit The previous commit (81a7d804) swept in 23 unrelated working-tree files because a rebase --autostash had left them staged in the index — including 4 files with leftover git conflict markers (llama-cpp/main.tf, excalidraw/providers.tf, url + wealthfolio .terraform.lock.hcl) from a stale 2026-05-25 stash, which is invalid Terraform. Revert all 23 (terragrunt-generated backend/providers/lock + the llama-cpp markers) to their prior committed state; terragrunt regenerates the generated files on the next run. Net effect of the docs commit is now just the runbook doc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	bdb0cef242	docs(kms): document /keys.json carve-out + script auto-key selection Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	170a3bb052	traefik: bump bot-block-proxy large_client_header_buffers to 8x64k The ai-bot-block forward-auth copies the full request (incl. the accumulated authentik_proxy_<random> cookie pile) to bot-block-proxy. With 30+ Authentik Proxy Providers under viktorbarzin.me the combined Cookie header exceeds openresty's default 4x8k buffers, so the auth check returned 400 "Request Header Or Cookie Too Large" (surfaced as error-pages' "Too big request header" 431) and broke Woodpecker/Forgejo OAuth sign-in for affected browsers. Mirror the existing auth-proxy-config fix: 8x64k accepts the pile. Applied live via tg apply + bot-block-proxy rollout restart. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	6f0bdf2993	kms: carve /keys.json out of Anubis for script auto-key-selection The activation scripts now fetch the published GVLK list from /keys.json to auto-select the right key for the detected edition. Like the .ps1 scripts, that endpoint must bypass Anubis (PowerShell/ConvertFrom-Json can't solve the PoW). Add /keys.json to the ingress_scripts carve-out path list. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
root	7a297deb24	Woodpecker CI deploy [CI SKIP]	2026-06-01 10:36:49 +00:00
Viktor Barzin	e63a812062	kms: dedicated vlmcs.viktorbarzin.me endpoint + Anubis /scripts carve-out Internal split-horizon resolves kms.viktorbarzin.me to Traefik (10.0.20.203), which has no :1688 listener — so LAN clients pointed at kms.viktorbarzin.me:1688 failed with 0xC004F074 "no KMS could be contacted". Add a dedicated A-only vlmcs.viktorbarzin.me (cloudflare_record.vlmcs -> 176.12.22.76 for the public WAN NAT; Technitium -> 10.0.20.202 internal, set via API) so it resolves to vlmcsd both ways. Also carve /scripts/* out of Anubis (module.ingress_scripts -> bare kms-web-page service) so `iwr \| iex` downloads the real script instead of the PoW challenge HTML. Verified end-to-end on Win VM 300: reproduced 0xC004F074 on the old host, then slmgr + ospp + both PowerShell one-liners all -> Licensed via vlmcs (10.0.20.202). Docs: kms-public-exposure runbook + service-catalog entry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	e5d9160a88	monitoring: KEEL/tier ignore_changes on 5 exporters [ci skip] goflow2, snmp-exporter, pve-exporter, idrac-redfish and the sysctl-inotify daemonset were missed by the `cdb7d9a8` KEEL_LIFECYCLE sweep. The monitoring ns is keel-enrolled (policy=patch) so Keel owns their image tags + injects keel.sh annotations; TF kept trying to revert both, plus a live-stamped tier label — which made `terragrunt plan -detailed-exitcode` return 2 every run and the drift-detection cron fail daily. Add the standard KEEL ignore_changes (image + keel.sh annotations) and ignore the tier label so these stop churning. Declarative-only: takes effect at next plan, no apply needed. [ci skip] so this does not trigger a monitoring apply. Remaining (separate) drift: the grafana ACL null_resource (triggers.always) + tls cert refresh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 15:33:30 +00:00
Viktor Barzin	935fb07df7	hermes-agent: gate PVC on parked flag (clears PVCStuckPending) The data_proxmox PVC is WaitForFirstConsumer; with the Deployment parked at replicas=0 it had no consumer pod and sat Pending forever, falsely tripping PVCStuckPending (which halts kured reboots). Introduce local.hermes_parked to drive both replicas and the PVC count, so a parked service has no PVC at all. Empty/never-bound PVC removed; recreated automatically when un-parked. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 15:19:28 +00:00
Viktor Barzin	7b6a0e70af	hermes-agent: opt out of external monitor while parked hermes-agent is parked at replicas=0 (PVC perms bug, 2026-04-22). Its auto-created Uptime Kuma external monitor was down → ExternalAccessDivergence firing, which halts kured node reboots. Set external_monitor=false so a deliberately-down service stops tripping the divergence gate. Re-enable when the deployment is brought back up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 15:12:33 +00:00
Viktor Barzin	51313ee088	kured: fix sentinel-gate OOM — 256Mi limit + self-restart leak guard The k8s-master gate pod OOM-killed child kubectls 149x/7d (accelerating: 0/day → 15 → 134) while master sat in pending-reboot. Root cause: only the pending-reboot node's gate pod runs the kubectl-heavy hot path each cycle, and the immortal bash loop slowly leaks (kubectl forks + Check-4 process substitution) past the 64Mi cgroup limit. PID 1 bash survives each kill, so the pod never restarts — just silent oom_events. Fix: raise limit 64Mi→256Mi (headroom for ~30-50Mi kubectl forks) + add a MAX_ITER=72 self-exit (~6h) so kubelet restarts the pod fresh and the leak can never accumulate, regardless of how long a node stays pending-reboot. Docs: post-mortem + automated-upgrades.md gate note. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 14:49:04 +00:00
Viktor Barzin	0c64fc2948	travel-agent: switch from Slack webhook to bot token (chat.postMessage)	2026-05-30 22:44:11 +00:00
Viktor Barzin	46f63bb70e	infra: travel-agent stack (namespace + ExternalSecret + 2 CronJobs)	2026-05-30 18:24:13 +00:00
Viktor Barzin	e1ab23193d	redis: revert 3-node Sentinel HA to single standalone instance [ci skip] The redis-v2 Sentinel cluster split-brained: redis-v2-0 booted during a network partition, hit the init script's deterministic "pod-0 = bootstrap master" fallback, and became a SECOND master alongside the sentinel-elected redis-v2-2. HAProxy's `expect rstring role:master` matched both and round-robined client connections across the two diverging masters, so Immich enqueued BullMQ jobs on one while its workers blocked-popped on the other -> every queue wedged and new-upload thumbnails 404'd cluster-wide. Third Sentinel-class incident in ~6 weeks (after the 2026-04-19 PM quorum drift and 2026-04-22 flap cascade). Revert to a single standalone instance: replicas=1; drop Sentinel + HAProxy + init bootstrap configmap + both PDBs; redis container only (+ exporter). maxmemory-policy allkeys-lru -> volatile-lru so one shared instance serves both workload classes correctly: evict only TTL'd cache keys, never TTL-less Immich BullMQ / Celery job keys. redis-master service name/DNS unchanged -> no consumer edits; collapsed onto redis-v2-0's existing dataset (queued jobs preserved). Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source + docs only, hence [ci skip]. Monitoring: drop RedisReplicationLagHigh + RedisReplicasMissing (no replicas now; the latter would false-fire), RedisMemoryPressure 85%->80% volatile-lru backstop. Docs: rewrite databases.md Redis section (single-instance design + incident history); add post-mortem 2026-05-30-redis-split-brain.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 17:49:43 +00:00
Viktor Barzin	5bcb4525a4	traefik: uncap download duration (writeTimeout 60s->0), upload window 3600s [ci skip] Large Immich video downloads and uploads failed at a hard ~60s wall. The websecure entrypoint set respondingTimeouts.{read,write}Timeout=60s; unlike nginx proxy_*_timeout (per-read idle), Traefik respondingTimeouts are hard caps on total request/response duration, so every transfer slower than 60s was cut mid-stream. Reproduced: a 6 MB/s throttled 650MB download died at 386MB / 62s with an HTTP/2 stream reset. - writeTimeout=0 (Traefik's default, which Immich's reverse-proxy guidance assumes): unlimited download size/duration. - readTimeout=3600s: passes multi-GB uploads while keeping a slow-loris backstop (Immich has no resumable upload, so the window must exceed real upload times). Verified: the same 650MB download now completes fully (650MB / 102s, exit 0). IPv6 path needs no change - the pfSense bridge HAProxy 1h timeouts are inactivity-based, not total caps. Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source + docs only, hence [ci skip]. Docs: networking.md (Entrypoint Transport Timeouts + troubleshooting), .claude/CLAUDE.md networking note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 17:46:59 +00:00
Viktor Barzin	89561c7779	technitium: complete Traefik .200->.203 migration for the .lan zone [ci skip] Today's Traefik dedicated-IP migration (.200 -> .203, ETP=Local) updated the viktorbarzin.me zone but missed the viktorbarzin.lan zone + two stale .200 literals — breaking every *.viktorbarzin.lan ingress host (internal exporters + ~15 HA-Sofia sensors via idrac-redfish/nvidia/snmp) and tripping the apex-drift probe. Found via /cluster-health (23 alerts -> 7). - apex-probe EXPECTED .200 -> .203 (apex IS .203; probe asserted the wrong value -> false ViktorBarzinApexDrift "critical"). - split-horizon externalToInternalTranslation .200 -> .203 (sofia-lan hairpin-NAT target). - ingress-dns-sync CronJob now also pins ingress.viktorbarzin.lan A to the LIVE Traefik LB IP (queried from svc/traefik) every run, so a future Traefik IP move can't silently break the .lan zone again. Added services get/list to its ClusterRole. Applied via targeted apply (4 resources, 0 destroyed) + manual CronJob triggers; verified apex correct=1 and the .lan anchor self-pins to .203. [ci skip] because a full technitium apply would also pick up unrelated pre-existing deployment drift (DNS pod restart risk) — left untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 16:54:09 +00:00
Viktor Barzin	c2b820dc55	postiz: adopt drifted resources into TF state; exclude stuck Helm release The 2026-05-24 apply was interrupted with the Helm release stuck in pending-install, leaving only 2 of ~12 resources in TF state (any apply errored "already exists"). Adopted the live resources back via import {} sweep (namespace, tls-secret, uploads PVC, ESO ExternalSecret, both ingresses, temporal Service, nfs backup PV+PVC) — plan now reaches zero. Reconciled code to live reality (zero runtime change to running postiz): - Removed kubernetes_deployment.temporal + kubernetes_job.temporal_search_ attr_cleanup: the temporal Deployment is gone from the cluster (only the Service survives). Scheduled posts remain unavailable until temporal is restored; immediate posting works. - Removed helm_release.postiz from TF entirely: importing it would force a helm upgrade (provider can't match merged values to config) and the release is stuck pending-install. Left Helm-managed outside TF. - Removed keel.sh/enrolled=true from the namespace (postiz was opted out of Keel on 2026-05-29; this would have re-enrolled it on apply). - Backup CronJob now dumps only the `postiz` DB (temporal/temporal_visibility DBs don't exist) and no longer depends_on the removed helm_release. Applied: 9 imported, 1 added (backup CronJob), 6 changed (benign), 0 destroyed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 14:36:07 +00:00
Viktor Barzin	01351e4ce2	tripit: deploy stack + DB provisioning + ongoing mail-ingest [ci skip] - stacks/tripit: namespace, ESO (vault-kv + vault-database), Deployment (alembic init + app), Service, NFS document PVC, ingress (Authentik forward-auth) + /api/calendar carve-out (auth=none, HMAC-token gated), and 3 worker CronJobs. ingest-mail is live: real IMAP (me@, read-only BODY.PEEK, recent-30) + local LLM (qwen3vl-4b on llama-swap), idempotent (skips seen message_ids), owner me@viktorbarzin.me. - stacks/dbaas: create CNPG role+db `tripit`. - stacks/vault: pg-tripit static role (7d rotation) + allowed_roles entry. Deployed at tripit.viktorbarzin.me. [ci skip]: stacks were applied out-of-band via scripts/tg this session; a CI re-apply would also apply unrelated pre-existing dbaas/vault drift (MySQL StatefulSet, vault OIDC). Refs: code-bb9g, code-muqi Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 10:23:11 +00:00
Viktor Barzin	e9046e5a26	traefik+pfsense: real IPv6 client IPs via HAProxy PROXY-v2 bridge Replace the pfSense socat IPv6 forwarder (which masked every IPv6 client as 10.0.20.1) with a standalone HAProxy bridge using send-proxy-v2, so real IPv6 client IPs reach Traefik/CrowdSec. Traefik now trusts PROXY-v2 only from 10.0.20.1 on the web/websecure entrypoints; real IPv4 clients (ETP=Local, own source IP) are unaffected. Mail-over-IPv6 routed through the mail NodePorts (send-proxy-v2) too. Bridge is TCP/h2 only (no QUIC over IPv6). Persistence on pfSense: rc.d/ipv6proxy + ipv6_proxy.sh (config.xml shellcmd), keeping the nginx-off-[::] patch. Also fixes stale networking.md: Traefik was still documented on the shared .200; it moved to dedicated .203/ETP=Local on 2026-05-30. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 09:51:23 +00:00
Viktor Barzin	0c01adac95	traefik: dedicate LB IP 10.0.20.203 + externalTrafficPolicy=Local Gives direct (non-proxied) apps real client IPs for CrowdSec (were SNAT'd to the node IP under ETP=Cluster) and working QUIC. Companion change (NOT in TF — remote cloudflared tunnel config, done via CF API): tunnel ingress repointed from https://10.0.20.200:443 to https://traefik.traefik.svc.cluster.local:443 so proxied apps are decoupled from the LB IP. pfSense 443 NAT -> traefik_lb alias (.203). See docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-*. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 08:09:37 +00:00
Viktor Barzin	0f26bf030b	kyverno: exclude postiz namespace from Keel auto-update injection Postiz was generating hourly Slack spam and a wedged rollout, both Keel-driven: - Bundled redis StatefulSets run docker.io/bitnamilegacy/redis; Keel tried 7.4.0->7.4.1/7.4.2 every poll but require-trusted-registries denies bitnamilegacy/* (only bitnami/* allowlisted) -> endless deny/retry/Slack-ping loop. - Keel bumped postiz-app v2.21.7->v2.21.8 on 2026-05-26; the surge pod couldn't schedule under the 3Gi tier-4-aux quota, wedging the rollout for 3 days. postiz Terraform state is heavily drifted (~2/30 resources tracked), so per-workload opt-out can't be applied from the postiz stack. Durable guard is here (clean kyverno state). Operational steps applied live via kubectl (postiz stack can't apply): removed keel.sh/enrolled=true from the namespace, set keel.sh/policy=never (annotation+label) on all 4 workloads, rolled postiz back to the running v2.21.7. Keel restarted (scale 0->1) to drop postiz-app from its in-memory tracker; confirmed it no longer tracks postiz. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 19:16:58 +00:00
root	ae72ad51bb	Woodpecker CI deploy [CI SKIP]	2026-05-29 18:07:00 +00:00
Viktor Barzin	bc41fe572a	immich: GPU-accelerate video transcoding (NVENC + NVDEC) Pin immich-server to the GPU node with a time-sliced nvidia.com/gpu slice so ffmpeg uses hardware NVENC encode + NVDEC decode instead of software. This frees the ~3-4 CPU cores the software transcoder was burning inside the request-serving pod (which was slowing thumbnail/photo browsing), and makes incompatible (HEVC/iPhone) videos playable in seconds. Activation is ffmpeg.accel=nvenc + accelDecode=true in the DB system-config (Immich app config is DB-managed here, like oauth/smtp — not Terraform). Also give immich-frame the same Keel ignore_changes immich-server already has, so an untargeted apply no longer churns it (pre-existing drift). Docs: .claude/CLAUDE.md Immich row + compute.md GPU-workloads list. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 18:05:34 +00:00
Viktor Barzin	b10233975b	llama-cpp: restore replicas to 1; fire-planner: fix llama-swap URL llama-cpp was scaled to 0 during 2026-05-25 IO-storm recovery (TEMP-SCALEDOWN). Cluster is now stable; only frigate competes for the GPU on k8s-node1. Restoring to 1 to unblock fire-planner's Reddit examples ingest, which needs qwen3-8b for structured extraction. fire-planner's llama_cpp_base_url default pointed at a non-existent service:port (llama-cpp:8000) — the real service is `llama-swap` on port 8080. First 2026-05-28 bulk Job exited 0 with 0 rows because of this. Correcting.	2026-05-29 06:20:03 +00:00
Viktor Barzin	478629c1ee	keel+anubis: extend sweep to non-V2 raw deployments; fix anubis replicas validation Second-tier keel drift: actualbudget, mailserver (docker-mailserver + roundcube), servarr (8 deployments), and authentik pgbouncer are live-enrolled (Kyverno injects keel.sh/policy=patch) and drifting, but never had the V2 block in Terraform. Added the full block (KYVERNO_LIFECYCLE_V2 + keel.sh/match-tag + per-container KEEL_IGNORE_IMAGE + KEEL_LIFECYCLE_V1) to all 13 deployments. The docker-mailserver deployment had no resource-level lifecycle at all — added one. Also fixes a pre-existing bug in modules/kubernetes/anubis_instance: the `replicas` validation `var.replicas == null \|\| (...)` doesn't null-short-circuit in the current TF version, failing apply on every single-replica Anubis site (blog, cyberchef, f1-stream, homepage, jsoncrack, kms, postiz, real-estate-crawler, travel_blog) with "argument must not be null". Switched to a null-safe ternary. Verified: actualbudget plan shows no image drift (http-api 26.5.2 downgrade prevented). The anubis module change triggers a full platform apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 06:02:24 +00:00

1 2 3 4 5 ...

1165 commits