infra

Author	SHA1	Message	Date
Viktor Barzin	73cb0aab8b	t3code: per-user isolation via Authentik + nginx username dispatcher t3 is single-owner (no in-app multi-user), so each person runs their own `t3 serve` on the DevVM as their own OS user: wizard→:3773 (t3-serve.service), emo→:3774 (t3-serve-emo.service). An in-cluster nginx `t3-dispatch` maps the Authentik-injected X-authentik-username to the right instance; unmapped identities get 403 (no shared fallback). Flipped the ingress auth app→required (Authentik forward-auth) — the same-origin self-served UI works behind it (WS carries the Authentik cookie) and t3's own pairing/bearer stays the inner gate. Mirrors the terminal stack's per-user model. Verified: dispatcher routes vbarzin→:3773, emil.barzin→:3774, unmapped→403; t3.viktorbarzin.me now 302s to Authentik. Cross-origin native app / app.t3.codes intentionally unsupported here — deferred until the native app is published. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 21:38:06 +00:00
Viktor Barzin	f807050eb5	cloudflared: fix tunnel origin .200 -> Traefik svc DNS (full-site 502 outage) [ci skip] The Cloudflare tunnel routed *.viktorbarzin.me and the apex to https://10.0.20.200:443, but Traefik moved off the shared MetalLB .200 onto its dedicated 10.0.20.203 on 2026-05-30 (commit `0c01adac`). Nothing serves HTTPS on .200:443 anymore, so cloudflared could not reach its origin (no route to host / i/o timeout) and Cloudflare returned 502 for every externally-proxied service. Internal/LAN access (split-horizon -> .203) was unaffected, which masked the outage. Repoint both ingress rules at the in-cluster Traefik Service DNS (https://traefik.traefik.svc.cluster.local:443) -- the design the docs already described but the code never implemented -- so the tunnel is decoupled from the Traefik LB IP and this cannot recur on a future move. Applied live via targeted apply on the tunnel config resource only; [ci skip] because live already matches and a full stack apply would churn unrelated pre-existing drift (Keel annotations, DKIM re-chunk). Post-mortem: docs/post-mortems/2026-06-01-cloudflared-stale-traefik-origin.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-01 21:22:05 +00:00
Viktor Barzin	f364399ede	wealth: add 30y net-worth projection row + align net-pay panel Implements the committed projections design (docs/plans/2026-05-28-wealth- projections-{design,plan}.md): a collapsed "Projections" row on the wealth dashboard with 5 template vars (rate_low/base/high, monthly_contribution=auto, horizon_years=30), a multi-scenario projection panel (Low/Base/High + trailing- 3y historical line + a base-rate compounding-only line), 3 stat cards, and a text panel with one-click future time-range links. Projection is pure SQL over dav_corrected: compound + ordinary-annuity FV from today's net worth; auto contribution = trailing-12mo run-rate (COALESCE/NULLIF so $monthly_contribution=auto doesn't constant-fold 'auto'::numeric). Historical rate = trailing-3-full-year geometric mean of per-year Modified-Dietz returns (~10.4%) — all-time was a nonsense 83% because the all-accounts-complete window is only ~4 months, and the true all-time geomean is skewed by 2021's +86%. Also aligns "Net pay vs market gain — per month" to consecutive month-end deltas (same fix as the other monthly panels). Verified all SQL live. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 19:50:41 +00:00
Viktor Barzin	32e1042ca8	t3code: expose `t3 serve` (DevVM) publicly at t3.viktorbarzin.me (app-tier) New stacks/t3code mirrors stacks/terminal: K8s Service + Endpoints → 10.0.10.10:3773 plus an ingress_factory route (dns_type=proxied, auth="app"). t3 ships its own owner-pairing + bearer-session auth, so Authentik forward-auth is intentionally omitted — it would break the cross-origin native mobile app and app.t3.codes (bearer-only, no Authentik cookie). CrowdSec + anti-AI (both default-on for app-tier) rate-limit the public surface; t3's pairing is the gate. TLS is auto-synced into the namespace by Kyverno's sync-tls-secret policy. Verified end-to-end: t3.viktorbarzin.me → CF → Traefik → devvm:3773 = 200. Trade-off (public RCE surface behind app-native auth, no Authentik SSO) accepted 2026-06-01 to keep the native app + app.t3.codes working. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 19:50:41 +00:00
Viktor Barzin	c5e4b1ea71	kms: add /diag anonymous telemetry collector behind Anubis carve-out The PowerShell activation scripts POST small JSON diagnostics to /diag so script execution errors are captured. The collector (python:3.12-alpine, ConfigMap-mounted) prints each event to stdout as a KMSDIAG line; the cluster's Loki scrapes pod stdout, making events searchable in Grafana (Loki only — no Slack, no Prometheus). Like /scripts, /diag needs a second ingress_factory carve-out with full_host="kms.viktorbarzin.me" so it bypasses the Anubis PoW challenge that PowerShell/curl can't solve. Without full_host the factory would derive kms-diag.viktorbarzin.me and the carve-out would never match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 19:50:41 +00:00
Viktor Barzin	5c77482a8c	fire-planner: LLM_MODEL env var → qwen3vl-4b default (fits in current GPU headroom; immich-ml is holding ~10GB)	2026-06-01 19:50:41 +00:00
Viktor Barzin	fb1e47a20a	nextcloud: re-enable Keel auto-upgrades with occ-upgrade self-heal + live-tag floor Re-enrolls Nextcloud in Keel (opted out after the 2026-05-26 32.0.3->32.0.9 bump stuck the pod in maintenance mode ~22h). Two safeguards engineer around both failure modes: - F1 (interrupted occ upgrade -> 503): nextcloud-watchdog CronJob runs `occ upgrade` + clears maintenance mode when occ reports needsDbUpgrade=true; Job deadline bumped 120->600s so it isn't killed mid-migration. - F2 (helm re-renders a tag below the Keel-bumped live image -> downgrade CrashLoop): chart_values renders the live tag via a plural kubernetes_resources data source (empty-list-on-absence -> floor 32.0.9 on fresh install/DR), so a re-render never downgrades below live. Scope is patch -- Kyverno's shared inject-keel-annotations policy stamps it and its background-controller overrides a TF-set value, and patch == minor for Nextcloud in practice (32.0.x only; major 33 stays manual). Dropped the per-workload keel.sh/policy override resources to avoid perpetual drift; ns enrollment + Kyverno now own the keel annotations like other workloads. Also bumps the external-storage bootstrap Job create timeout 1m->12m to match its own 10m pod-wait, since Keel bumps now roll the pod mid-apply. Verified: Keel auto-upgraded 32.0.9->32.0.10 on apply, entrypoint occ upgrade completed clean (no watchdog needed), pod 2/2, HTTP 200, plan shows no drift.	2026-06-01 19:50:41 +00:00
Viktor Barzin	50d0f1affa	kyverno: strip orphaned keel.sh/match-tag fleet-wide (image-swap fix) The 2026-05-26 migration flipped the keel default force->patch and dropped match-tag from the inject-keel-annotations patch, but Kyverno's add-only mutate can't remove an annotation that's no longer listed -- 194 workloads kept a stale keel.sh/match-tag=true. Under it Keel cross-assigned images in multi-image pods: the blog's nginx<->nginx-exporter images were swapped and the site was down 2026-05-26 -> 06-01 (nginx received the exporter's -nginx.scrape-uri arg and CrashLoopBackOff'd); changedetection was silently swapped (app lost its /datastore PVC + env, ran ephemeral for days). - policy now sets keel.sh/match-tag=null (strips on admission, never re-added) - swept the annotation off all 194 existing workloads (kubectl, no pod restart) - AGENTS.md: documents the strip; post-mortem added blog + changedetection un-swapped via kubectl set image (TF-ignored images); both 2/2 and serving 200. Policy already applied via scripts/tg (Tier-1 PG state authoritative). [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 19:50:41 +00:00
Viktor Barzin	769ae7a6d3	traefik: bot-block-proxy buffer 256k + document the real HTTP/2 limit Follow-up to the 64k bump: raised bot-block-proxy large_client_header_buffers to 256k and corrected the rationale. Investigation found the binding limit for browsers is Traefik's HTTP/2 header cap (~64KB, Go maxHeaderListSize, not exposed by Traefik config) — oversized authentik_proxy_* cookie piles are rejected at the h2 layer upstream of bot-block regardless of these buffers. The real fix for >64KB piles is reducing authentik_proxy_* cookie accumulation (or clearing cookies); these buffers only prevent bot-block being a tighter bottleneck for sub-64KB piles + HTTP/1.1 clients. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:27 +00:00
Viktor Barzin	3d28870e25	nextcloud: fix backup retention to sort by name, not mtime The dated backup dirs are named YYYYMMDD_HHMMSS, but the cleanup used `ls -dt` (mtime). `rsync -a` stamps the backup dir with the SOURCE dir's mtime, so the freshest backup didn't sort as newest — the retention step deleted the new backup and kept a stale one. Sort lexically (chronological for these names) and keep the last. Also exclude html/ (the app code, reproducible from the now-pinned image; the real config lives at config/config.php, html/config is empty) so the backup is config+data+custom_apps only → ~4.3G (<5G target). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:26 +00:00
root	84ab4c998c	Woodpecker CI deploy [CI SKIP]	2026-06-01 15:15:26 +00:00
Viktor Barzin	ddd582a28c	backup: stop offsite-copying regenerable data; shrink nextcloud backup; pin nextcloud image The offsite Synology hit 97% — the Backup share grew +670G in a week, traced to the 2026-05-26 change that began mirroring large regenerable services offsite, plus an unbounded nextcloud.log bloating its backups to 87G. - nfs-mirror: re-exclude ollama, prometheus-backup, audiblez, ebook2audiobook (regenerable; live-only on sdc). Keep *-backup DB dumps (real safety copies). - offsite-sync Step 2: nfs-ssd leg is now immich-only; ollama/llamacpp on the SSD no longer ship offsite (re-pullable models). - daily-backup: skip nextcloud/nextcloud-data-proxmox (orphaned pre-encryption PV, still backed up weekly). - nextcloud: cap+rotate the log (log_rotate_size=10MB); the dedicated backup now excludes html/ (app code, from image), logs, and preview cache and keeps only the latest copy (pvc-data holds version history) → <5G (was 87G). - nextcloud: pin image to 32.0.9 in chart_values. A 2026-05-26 Keel bump moved the live pod to 32.0.9 (data migrated to 32.0.9.2) but TF still defaulted to 32.0.3; reconciling that drift this session rolled a 32.0.3 pod that CrashLooped on the downgrade. Pinning eliminates the drift. Docs: backup-dr.md + infra CLAUDE.md updated (add nfs-mirror, new exclusions). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:26 +00:00
Viktor Barzin	af4bfbe046	kms: revert files accidentally bundled into the docs commit The previous commit (81a7d804) swept in 23 unrelated working-tree files because a rebase --autostash had left them staged in the index — including 4 files with leftover git conflict markers (llama-cpp/main.tf, excalidraw/providers.tf, url + wealthfolio .terraform.lock.hcl) from a stale 2026-05-25 stash, which is invalid Terraform. Revert all 23 (terragrunt-generated backend/providers/lock + the llama-cpp markers) to their prior committed state; terragrunt regenerates the generated files on the next run. Net effect of the docs commit is now just the runbook doc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	bdb0cef242	docs(kms): document /keys.json carve-out + script auto-key selection Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	170a3bb052	traefik: bump bot-block-proxy large_client_header_buffers to 8x64k The ai-bot-block forward-auth copies the full request (incl. the accumulated authentik_proxy_<random> cookie pile) to bot-block-proxy. With 30+ Authentik Proxy Providers under viktorbarzin.me the combined Cookie header exceeds openresty's default 4x8k buffers, so the auth check returned 400 "Request Header Or Cookie Too Large" (surfaced as error-pages' "Too big request header" 431) and broke Woodpecker/Forgejo OAuth sign-in for affected browsers. Mirror the existing auth-proxy-config fix: 8x64k accepts the pile. Applied live via tg apply + bot-block-proxy rollout restart. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	6f0bdf2993	kms: carve /keys.json out of Anubis for script auto-key-selection The activation scripts now fetch the published GVLK list from /keys.json to auto-select the right key for the detected edition. Like the .ps1 scripts, that endpoint must bypass Anubis (PowerShell/ConvertFrom-Json can't solve the PoW). Add /keys.json to the ingress_scripts carve-out path list. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
root	7a297deb24	Woodpecker CI deploy [CI SKIP]	2026-06-01 10:36:49 +00:00
Viktor Barzin	e63a812062	kms: dedicated vlmcs.viktorbarzin.me endpoint + Anubis /scripts carve-out Internal split-horizon resolves kms.viktorbarzin.me to Traefik (10.0.20.203), which has no :1688 listener — so LAN clients pointed at kms.viktorbarzin.me:1688 failed with 0xC004F074 "no KMS could be contacted". Add a dedicated A-only vlmcs.viktorbarzin.me (cloudflare_record.vlmcs -> 176.12.22.76 for the public WAN NAT; Technitium -> 10.0.20.202 internal, set via API) so it resolves to vlmcsd both ways. Also carve /scripts/* out of Anubis (module.ingress_scripts -> bare kms-web-page service) so `iwr \| iex` downloads the real script instead of the PoW challenge HTML. Verified end-to-end on Win VM 300: reproduced 0xC004F074 on the old host, then slmgr + ospp + both PowerShell one-liners all -> Licensed via vlmcs (10.0.20.202). Docs: kms-public-exposure runbook + service-catalog entry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	e5d9160a88	monitoring: KEEL/tier ignore_changes on 5 exporters [ci skip] goflow2, snmp-exporter, pve-exporter, idrac-redfish and the sysctl-inotify daemonset were missed by the `cdb7d9a8` KEEL_LIFECYCLE sweep. The monitoring ns is keel-enrolled (policy=patch) so Keel owns their image tags + injects keel.sh annotations; TF kept trying to revert both, plus a live-stamped tier label — which made `terragrunt plan -detailed-exitcode` return 2 every run and the drift-detection cron fail daily. Add the standard KEEL ignore_changes (image + keel.sh annotations) and ignore the tier label so these stop churning. Declarative-only: takes effect at next plan, no apply needed. [ci skip] so this does not trigger a monitoring apply. Remaining (separate) drift: the grafana ACL null_resource (triggers.always) + tls cert refresh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 15:33:30 +00:00
Viktor Barzin	935fb07df7	hermes-agent: gate PVC on parked flag (clears PVCStuckPending) The data_proxmox PVC is WaitForFirstConsumer; with the Deployment parked at replicas=0 it had no consumer pod and sat Pending forever, falsely tripping PVCStuckPending (which halts kured reboots). Introduce local.hermes_parked to drive both replicas and the PVC count, so a parked service has no PVC at all. Empty/never-bound PVC removed; recreated automatically when un-parked. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 15:19:28 +00:00
Viktor Barzin	7b6a0e70af	hermes-agent: opt out of external monitor while parked hermes-agent is parked at replicas=0 (PVC perms bug, 2026-04-22). Its auto-created Uptime Kuma external monitor was down → ExternalAccessDivergence firing, which halts kured node reboots. Set external_monitor=false so a deliberately-down service stops tripping the divergence gate. Re-enable when the deployment is brought back up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 15:12:33 +00:00
Viktor Barzin	51313ee088	kured: fix sentinel-gate OOM — 256Mi limit + self-restart leak guard The k8s-master gate pod OOM-killed child kubectls 149x/7d (accelerating: 0/day → 15 → 134) while master sat in pending-reboot. Root cause: only the pending-reboot node's gate pod runs the kubectl-heavy hot path each cycle, and the immortal bash loop slowly leaks (kubectl forks + Check-4 process substitution) past the 64Mi cgroup limit. PID 1 bash survives each kill, so the pod never restarts — just silent oom_events. Fix: raise limit 64Mi→256Mi (headroom for ~30-50Mi kubectl forks) + add a MAX_ITER=72 self-exit (~6h) so kubelet restarts the pod fresh and the leak can never accumulate, regardless of how long a node stays pending-reboot. Docs: post-mortem + automated-upgrades.md gate note. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 14:49:04 +00:00
Viktor Barzin	0c64fc2948	travel-agent: switch from Slack webhook to bot token (chat.postMessage)	2026-05-30 22:44:11 +00:00
Viktor Barzin	46f63bb70e	infra: travel-agent stack (namespace + ExternalSecret + 2 CronJobs)	2026-05-30 18:24:13 +00:00
Viktor Barzin	e1ab23193d	redis: revert 3-node Sentinel HA to single standalone instance [ci skip] The redis-v2 Sentinel cluster split-brained: redis-v2-0 booted during a network partition, hit the init script's deterministic "pod-0 = bootstrap master" fallback, and became a SECOND master alongside the sentinel-elected redis-v2-2. HAProxy's `expect rstring role:master` matched both and round-robined client connections across the two diverging masters, so Immich enqueued BullMQ jobs on one while its workers blocked-popped on the other -> every queue wedged and new-upload thumbnails 404'd cluster-wide. Third Sentinel-class incident in ~6 weeks (after the 2026-04-19 PM quorum drift and 2026-04-22 flap cascade). Revert to a single standalone instance: replicas=1; drop Sentinel + HAProxy + init bootstrap configmap + both PDBs; redis container only (+ exporter). maxmemory-policy allkeys-lru -> volatile-lru so one shared instance serves both workload classes correctly: evict only TTL'd cache keys, never TTL-less Immich BullMQ / Celery job keys. redis-master service name/DNS unchanged -> no consumer edits; collapsed onto redis-v2-0's existing dataset (queued jobs preserved). Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source + docs only, hence [ci skip]. Monitoring: drop RedisReplicationLagHigh + RedisReplicasMissing (no replicas now; the latter would false-fire), RedisMemoryPressure 85%->80% volatile-lru backstop. Docs: rewrite databases.md Redis section (single-instance design + incident history); add post-mortem 2026-05-30-redis-split-brain.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 17:49:43 +00:00
Viktor Barzin	5bcb4525a4	traefik: uncap download duration (writeTimeout 60s->0), upload window 3600s [ci skip] Large Immich video downloads and uploads failed at a hard ~60s wall. The websecure entrypoint set respondingTimeouts.{read,write}Timeout=60s; unlike nginx proxy_*_timeout (per-read idle), Traefik respondingTimeouts are hard caps on total request/response duration, so every transfer slower than 60s was cut mid-stream. Reproduced: a 6 MB/s throttled 650MB download died at 386MB / 62s with an HTTP/2 stream reset. - writeTimeout=0 (Traefik's default, which Immich's reverse-proxy guidance assumes): unlimited download size/duration. - readTimeout=3600s: passes multi-GB uploads while keeping a slow-loris backstop (Immich has no resumable upload, so the window must exceed real upload times). Verified: the same 650MB download now completes fully (650MB / 102s, exit 0). IPv6 path needs no change - the pfSense bridge HAProxy 1h timeouts are inactivity-based, not total caps. Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source + docs only, hence [ci skip]. Docs: networking.md (Entrypoint Transport Timeouts + troubleshooting), .claude/CLAUDE.md networking note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 17:46:59 +00:00
Viktor Barzin	89561c7779	technitium: complete Traefik .200->.203 migration for the .lan zone [ci skip] Today's Traefik dedicated-IP migration (.200 -> .203, ETP=Local) updated the viktorbarzin.me zone but missed the viktorbarzin.lan zone + two stale .200 literals — breaking every *.viktorbarzin.lan ingress host (internal exporters + ~15 HA-Sofia sensors via idrac-redfish/nvidia/snmp) and tripping the apex-drift probe. Found via /cluster-health (23 alerts -> 7). - apex-probe EXPECTED .200 -> .203 (apex IS .203; probe asserted the wrong value -> false ViktorBarzinApexDrift "critical"). - split-horizon externalToInternalTranslation .200 -> .203 (sofia-lan hairpin-NAT target). - ingress-dns-sync CronJob now also pins ingress.viktorbarzin.lan A to the LIVE Traefik LB IP (queried from svc/traefik) every run, so a future Traefik IP move can't silently break the .lan zone again. Added services get/list to its ClusterRole. Applied via targeted apply (4 resources, 0 destroyed) + manual CronJob triggers; verified apex correct=1 and the .lan anchor self-pins to .203. [ci skip] because a full technitium apply would also pick up unrelated pre-existing deployment drift (DNS pod restart risk) — left untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 16:54:09 +00:00
Viktor Barzin	c2b820dc55	postiz: adopt drifted resources into TF state; exclude stuck Helm release The 2026-05-24 apply was interrupted with the Helm release stuck in pending-install, leaving only 2 of ~12 resources in TF state (any apply errored "already exists"). Adopted the live resources back via import {} sweep (namespace, tls-secret, uploads PVC, ESO ExternalSecret, both ingresses, temporal Service, nfs backup PV+PVC) — plan now reaches zero. Reconciled code to live reality (zero runtime change to running postiz): - Removed kubernetes_deployment.temporal + kubernetes_job.temporal_search_ attr_cleanup: the temporal Deployment is gone from the cluster (only the Service survives). Scheduled posts remain unavailable until temporal is restored; immediate posting works. - Removed helm_release.postiz from TF entirely: importing it would force a helm upgrade (provider can't match merged values to config) and the release is stuck pending-install. Left Helm-managed outside TF. - Removed keel.sh/enrolled=true from the namespace (postiz was opted out of Keel on 2026-05-29; this would have re-enrolled it on apply). - Backup CronJob now dumps only the `postiz` DB (temporal/temporal_visibility DBs don't exist) and no longer depends_on the removed helm_release. Applied: 9 imported, 1 added (backup CronJob), 6 changed (benign), 0 destroyed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 14:36:07 +00:00
Viktor Barzin	01351e4ce2	tripit: deploy stack + DB provisioning + ongoing mail-ingest [ci skip] - stacks/tripit: namespace, ESO (vault-kv + vault-database), Deployment (alembic init + app), Service, NFS document PVC, ingress (Authentik forward-auth) + /api/calendar carve-out (auth=none, HMAC-token gated), and 3 worker CronJobs. ingest-mail is live: real IMAP (me@, read-only BODY.PEEK, recent-30) + local LLM (qwen3vl-4b on llama-swap), idempotent (skips seen message_ids), owner me@viktorbarzin.me. - stacks/dbaas: create CNPG role+db `tripit`. - stacks/vault: pg-tripit static role (7d rotation) + allowed_roles entry. Deployed at tripit.viktorbarzin.me. [ci skip]: stacks were applied out-of-band via scripts/tg this session; a CI re-apply would also apply unrelated pre-existing dbaas/vault drift (MySQL StatefulSet, vault OIDC). Refs: code-bb9g, code-muqi Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 10:23:11 +00:00
Viktor Barzin	e9046e5a26	traefik+pfsense: real IPv6 client IPs via HAProxy PROXY-v2 bridge Replace the pfSense socat IPv6 forwarder (which masked every IPv6 client as 10.0.20.1) with a standalone HAProxy bridge using send-proxy-v2, so real IPv6 client IPs reach Traefik/CrowdSec. Traefik now trusts PROXY-v2 only from 10.0.20.1 on the web/websecure entrypoints; real IPv4 clients (ETP=Local, own source IP) are unaffected. Mail-over-IPv6 routed through the mail NodePorts (send-proxy-v2) too. Bridge is TCP/h2 only (no QUIC over IPv6). Persistence on pfSense: rc.d/ipv6proxy + ipv6_proxy.sh (config.xml shellcmd), keeping the nginx-off-[::] patch. Also fixes stale networking.md: Traefik was still documented on the shared .200; it moved to dedicated .203/ETP=Local on 2026-05-30. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 09:51:23 +00:00
Viktor Barzin	0c01adac95	traefik: dedicate LB IP 10.0.20.203 + externalTrafficPolicy=Local Gives direct (non-proxied) apps real client IPs for CrowdSec (were SNAT'd to the node IP under ETP=Cluster) and working QUIC. Companion change (NOT in TF — remote cloudflared tunnel config, done via CF API): tunnel ingress repointed from https://10.0.20.200:443 to https://traefik.traefik.svc.cluster.local:443 so proxied apps are decoupled from the LB IP. pfSense 443 NAT -> traefik_lb alias (.203). See docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-*. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 08:09:37 +00:00
Viktor Barzin	0f26bf030b	kyverno: exclude postiz namespace from Keel auto-update injection Postiz was generating hourly Slack spam and a wedged rollout, both Keel-driven: - Bundled redis StatefulSets run docker.io/bitnamilegacy/redis; Keel tried 7.4.0->7.4.1/7.4.2 every poll but require-trusted-registries denies bitnamilegacy/* (only bitnami/* allowlisted) -> endless deny/retry/Slack-ping loop. - Keel bumped postiz-app v2.21.7->v2.21.8 on 2026-05-26; the surge pod couldn't schedule under the 3Gi tier-4-aux quota, wedging the rollout for 3 days. postiz Terraform state is heavily drifted (~2/30 resources tracked), so per-workload opt-out can't be applied from the postiz stack. Durable guard is here (clean kyverno state). Operational steps applied live via kubectl (postiz stack can't apply): removed keel.sh/enrolled=true from the namespace, set keel.sh/policy=never (annotation+label) on all 4 workloads, rolled postiz back to the running v2.21.7. Keel restarted (scale 0->1) to drop postiz-app from its in-memory tracker; confirmed it no longer tracks postiz. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 19:16:58 +00:00
root	ae72ad51bb	Woodpecker CI deploy [CI SKIP]	2026-05-29 18:07:00 +00:00
Viktor Barzin	bc41fe572a	immich: GPU-accelerate video transcoding (NVENC + NVDEC) Pin immich-server to the GPU node with a time-sliced nvidia.com/gpu slice so ffmpeg uses hardware NVENC encode + NVDEC decode instead of software. This frees the ~3-4 CPU cores the software transcoder was burning inside the request-serving pod (which was slowing thumbnail/photo browsing), and makes incompatible (HEVC/iPhone) videos playable in seconds. Activation is ffmpeg.accel=nvenc + accelDecode=true in the DB system-config (Immich app config is DB-managed here, like oauth/smtp — not Terraform). Also give immich-frame the same Keel ignore_changes immich-server already has, so an untargeted apply no longer churns it (pre-existing drift). Docs: .claude/CLAUDE.md Immich row + compute.md GPU-workloads list. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 18:05:34 +00:00
Viktor Barzin	b10233975b	llama-cpp: restore replicas to 1; fire-planner: fix llama-swap URL llama-cpp was scaled to 0 during 2026-05-25 IO-storm recovery (TEMP-SCALEDOWN). Cluster is now stable; only frigate competes for the GPU on k8s-node1. Restoring to 1 to unblock fire-planner's Reddit examples ingest, which needs qwen3-8b for structured extraction. fire-planner's llama_cpp_base_url default pointed at a non-existent service:port (llama-cpp:8000) — the real service is `llama-swap` on port 8080. First 2026-05-28 bulk Job exited 0 with 0 rows because of this. Correcting.	2026-05-29 06:20:03 +00:00
Viktor Barzin	478629c1ee	keel+anubis: extend sweep to non-V2 raw deployments; fix anubis replicas validation Second-tier keel drift: actualbudget, mailserver (docker-mailserver + roundcube), servarr (8 deployments), and authentik pgbouncer are live-enrolled (Kyverno injects keel.sh/policy=patch) and drifting, but never had the V2 block in Terraform. Added the full block (KYVERNO_LIFECYCLE_V2 + keel.sh/match-tag + per-container KEEL_IGNORE_IMAGE + KEEL_LIFECYCLE_V1) to all 13 deployments. The docker-mailserver deployment had no resource-level lifecycle at all — added one. Also fixes a pre-existing bug in modules/kubernetes/anubis_instance: the `replicas` validation `var.replicas == null \|\| (...)` doesn't null-short-circuit in the current TF version, failing apply on every single-replica Anubis site (blog, cyberchef, f1-stream, homepage, jsoncrack, kms, postiz, real-estate-crawler, travel_blog) with "argument must not be null". Switched to a null-safe ternary. Verified: actualbudget plan shows no image drift (http-api 26.5.2 downgrade prevented). The anubis module change triggers a full platform apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 06:02:24 +00:00
root	fe1a16a5f5	Woodpecker CI deploy [CI SKIP]	2026-05-29 05:48:10 +00:00
Viktor Barzin	5bc7a76630	tuya-bridge: switch to Forgejo image + CI-driven deploy Mirrors the kms-website pattern: deployment image now points to forgejo.viktorbarzin.me/viktor/tuya_bridge:${var.image_tag} and the new Woodpecker pipeline in tuya_bridge/.woodpecker.yml drives the rollout via `kubectl set image` on every push. Changes: - Extract `tls_secret_name` and add `image_tag` (default "latest") to a new variables.tf, matching the kms / fire-planner / payslip-ingest convention. - Add `image_pull_secrets { name = "registry-credentials" }` (Kyverno ClusterPolicy sync-registry-credentials already syncs the Secret into every namespace). - Set explicit `image_pull_policy = "IfNotPresent"` — SHA-tagged images are immutable, no need to re-pull on every restart. The image attribute remains in `lifecycle.ignore_changes` (line was already there from the prior Keel-managed era), so future `tg apply`s do not fight Woodpecker's `kubectl set image`. Keel is still enrolled on the namespace but will skip SHA-tagged images under `policy: patch` (non-semver), so the CI pipeline is the sole rollout mechanism. Backstory: the 2026-05-26 cluster-health incident was tuya-bridge crashlooping after Keel rewrote `:latest` to a stale broken `:0.1` tag on Docker Hub (which predated the `prometheus_exporter.py` addition). Manual rebuild + push was the immediate fix; this commit plus tuya_bridge/.woodpecker.yml close the underlying gap so a source change reliably produces a fresh registry image. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 05:45:16 +00:00
Viktor Barzin	7870e62a07	uptime-kuma: declare Proxmox UI monitor in TF Yesterday's session SQL-patched monitor 313 to `https://192.168.1.127:8006/` + ignore_tls=1 because the prior URL `http://proxmox.reverse-proxy.svc.cluster.local:8006` hit a CoreDNS pod-level cache returning stale `10.0.10.1` (pfSense GW) intermittently, false-tripping ExternalAccessDivergence. A kuma DB restore would have lost the SQL fix. Declare the monitor in `internal_monitors` so the existing sync CronJob self-heals it. Extends the schema with optional `url` / `accepted_statuscodes` / `ignore_tls` fields (null on the existing DB/port entries) and teaches the sync script the MonitorType.HTTP branch — url + accepted_statuscodes + ignoreTls (camelCase on the API), matching drift fields the same way PORT does for hostname/port. Verified: manually triggered the sync after apply; it found monitor 313 by name and reported "already in desired state". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 05:40:18 +00:00
Viktor Barzin	7c73c69f9b	keel: add KEEL_LIFECYCLE_V1 + image-ignore to fire-planner Completes the enrolled-workload sweep from `cdb7d9a8`. fire-planner was held back because a parallel session was mid-apply on it (presence board); that claim has since cleared. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 23:12:49 +00:00
Viktor Barzin	cdb7d9a81a	keel: sweep KEEL_LIFECYCLE_V1 + per-container KEEL_IGNORE_IMAGE across enrolled workloads Every Keel-enrolled workload (policy=patch, match-tag=true, injected by the inject-keel-annotations Kyverno policy) was fighting Terraform: Keel rewrites the image tag and restamps keel.sh/update-time, change-cause and the rollout revision on each poll; without ignore_changes every `tg apply` reverted those — downgrading the image and forcing a spurious rollout that Keel then re-did. Only llama-cpp had the full block (added 2026-05-24); the other ~73 workloads drifted. This sweep adds, to every enrolled deployment/daemonset lifecycle: - container[N].image (one per container index + init_container[N]) # KEEL_IGNORE_IMAGE - keel.sh/match-tag, keel.sh/update-time, kubernetes.io/change-cause, deployment.kubernetes.io/revision # KEEL_LIFECYCLE_V1 Verified via `tg plan` on speedtest (single-container: image downgrade 0.24.3->0.24.1 + annotation strip now gone) and changedetection (multi-container: both container images no longer drift). AGENTS.md drift-suppression section updated with the canonical block + marker legend. fire-planner deferred (parallel session mid-apply per presence board). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 23:09:30 +00:00
Viktor Barzin	4f71ce6bc5	wealth: fix Fidelity Feb-2026 zero-gap + month-boundary contribution smear Two correctness fixes to the wealth dashboard, found while validating contribution data against actual-viktor (source of truth): 1. dav_corrected (Fix 1): LOCF gap-fill scoped to the Fidelity pension. A PlanViewer scrape gap left total_value=0 for 13 days from 2026-02-16, which cratered net worth and produced a phantom -£97,457 "contribution" in Feb then +£100,458 in Mar. Carry the last non-zero day forward across the gap (a £0 pension valuation is always a scrape gap, never real). 2. wealth.json (Fix 3): "Monthly contributions vs market gain" and "Annual change decomposition" now use consecutive period-end deltas instead of within-period first-to-last-obs, so contributions landing near a period boundary are no longer dropped/mis-attributed. Verified live: Feb-2026 monthly contribution now +£34,000 (real Trading212 RSU-proceeds investment, reconciles with actual-viktor), no spurious negatives. Brokerage contributions unchanged (already correct). Applied via scripts/tg (wealthfolio + targeted monitoring ConfigMap). [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 22:58:59 +00:00
Viktor Barzin	0044c3a8ea	fire-planner: add examples ingest Job (toggled) + weekly CronJob Adds the K8s plumbing for the Reddit FIRE-examples ingest path: - ExternalSecret fire-planner-examples-reddit (Reddit OAuth from Vault secret/viktor.trading_bot_reddit_{client_id,client_secret}). - ExternalSecret fire-planner-examples-claude (claude-agent-service bearer from Vault secret/claude-agent-service.api_bearer_token). - kubernetes_job_v1.examples_bulk_ingest — one-shot bulk Job toggled via var.run_examples_bulk_ingest (default false). Timestamp-named so each (true) transition creates a fresh Job; lifecycle ignores the name so re-plans don't propose phantom renames. - kubernetes_cron_job_v1.examples_weekly_delta — Sunday 04:00 UTC --top=week --limit=200 incremental run. Both runners share the env_from plumbing of the existing recompute CronJob (fire-planner-secrets, fire-planner-db-creds, wealthfolio-sync-db-creds) plus examples-specific vars (REDDIT_USER_AGENT, LLAMA_CPP_BASE_URL, CLAUDE_AGENT_SERVICE_URL, plus the three secret-backed env vars). Plan-only this commit — actual apply lands in Task 17 after the ingest image build.	2026-05-28 22:51:14 +00:00
Viktor Barzin	4dff834c8a	reduce ingress-dns-sync frequency to hourly [ci skip]	2026-05-28 22:30:08 +00:00
Viktor Barzin	5ac8d625b9	add ingress-dns-sync CronJob to auto-create Technitium CNAME records Discovers all *.viktorbarzin.me ingress hosts every 15 minutes and creates matching CNAME records in Technitium if missing. Prevents the desync where Cloudflare has the DNS record (via ingress_factory) but internal DNS returns NXDOMAIN because Technitium was never updated. Includes ServiceAccount + ClusterRole for ingress list permissions.	2026-05-28 22:22:42 +00:00
Viktor Barzin	58cced5dab	monitoring: render market-vs-salary periodic panels as lines, not bars	2026-05-28 22:18:59 +00:00
Viktor Barzin	388a7f60c7	monitoring: add net-pay-vs-market-gains panels to wealth dashboard Three new panels comparing employment income to investment returns over time, via Grafana's -- Mixed -- datasource (salary lives in payslip_ingest, portfolio in wealthfolio_sync — separate DBs, so per-target datasources): - cumulative net take-home pay vs cumulative market gain (line race) - net pay vs market gain per year (grouped bars) - net pay vs market gain per month (grouped bars) Inserted after the "Growth over time" panel; existing panels shifted down, full-width tables remain at the bottom.	2026-05-28 22:13:44 +00:00
Viktor Barzin	1af412b461	trading-bot: bump TRADING_MEET_KEVIN_PROMPT_VERSION v1 -> v2 (forward-looking prompt)	2026-05-28 21:40:17 +00:00
Viktor Barzin	188bdd50a0	infra: decommission foolery agent UI User no longer actively using foolery. Removed: - TF stack stacks/foolery (Cloudflare DNS, Traefik IngressRoute, Authentik forward-auth integration, K8s Service+Endpoints) - Devvm systemd unit /etc/systemd/system/foolery.service - Runtime at ~/.local/share/foolery and launcher ~/.local/bin/foolery - Stale foolery reference in .claude/CLAUDE.md auth="required" examples Uptime Kuma [External] foolery monitor will auto-prune on next external-monitor-sync reconcile. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:08:41 +00:00
Viktor Barzin	8b4bcc0ca2	blog: Anubis carve-out for /net-diag.sh curl\|bash clients can't solve PoW, so /net-diag.sh has to bypass Anubis. Adds a second ingress_factory pointing /net-diag.sh at the bare blog service (port 80), keeping every other path on the existing Anubis chain. Path-prefix specificity wins in Traefik routing — / stays gated. dns_type = "none" because the apex viktorbarzin.me CF record already exists from the main ingress. Doc update: CLAUDE.md Anubis section notes blog now follows the wrongmove carve-out pattern.	2026-05-28 13:22:57 +00:00

1 2 3 4 5 ...

1151 commits