infra

Author	SHA1	Message	Date
Viktor Barzin	b3ef0dba76	authentik: ignore Keel-managed image_pull_policy on pgbouncer Keel flip-flops the pgbouncer container's imagePullPolicy, so the declared Always kept re-diffing on every plan. Ignore it like the image tag (KEEL_IGNORE pattern) — plan-to-zero restored. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 00:34:44 +00:00
Viktor Barzin	4e88298976	authentik: incident hardening after the signin-speedup rollout storm The first apply of the signin-speedup change triggered a ~50min authentik outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2) silently DOWNGRADED the Keel-managed live image (2026.2.4) against an already-migrated DB, default liveness probes kill-looped pods queuing on authentik's migration advisory lock, and kills mid-migration left ghost idle-in-transaction sessions holding that lock. Full analysis in docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md. Hardening (all root causes): - values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4) so helm applies can never downgrade under Keel again - values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s) - values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits) - pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders; pgbouncer.tf gets a config-checksum annotation so ini changes roll pods - authentik_provider.tf: drop the completed import stanza (adoption rule) - traefik: suppress pre-existing keel.sh annotation/tier-label drift on auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1 pattern) so applies stop stripping live Keel state Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 00:26:52 +00:00
Viktor Barzin	97ccdbecb8	authentik: speed up first-time signin (single-screen login, live env tuning, asset caching, outpost+nginx hot path) Viktor asked to review Authentik and the web tier and make first-time signin to apps faster. Review found the slowness is screens and round trips, not server time. Changes: - values.yaml: the authentik.* Helm values (gunicorn workers, cache timeouts, conn_max_age) were silently INERT because existingSecret skips chart env rendering — pods ran defaults (2 workers, 300s caches, no persistent DB conns). Moved all tuning into server.env/worker.env, which actually reaches the pods. - authentik_provider.tf: adopt the identification stage and pin password_stage so username+password render on ONE screen (the separate order-20 password binding is deleted via API — authentik requires that when embedding). Outpost log_level trace->info and 1->2 replicas (it is on the hot path of every forward-auth request; PG-backed sessions make 2 replicas safe). - authentik module: /static ingress carve-out with immutable Cache-Control (assets are version-fingerprinted but served with no max-age — internal split-horizon users got zero caching). - traefik auth-proxy nginx: upstream keepalive 32 + HTTP/1.1 (was opening a fresh TCP connection to the outpost per subrequest) + config-checksum annotation so config changes roll the pods. - docs: authentication.md + authentik-state.md updated; fixed stale 'postgresql.dbaas has no endpoints' claim in CLAUDE.md/CONTEXT.md (it is a live CNPG primary-selector compatibility service). Done via API in the same change (UI-managed objects): 6 OIDC providers (Vault, Forgejo, Immich, Headscale, linkwarden, Cloudflare Access) switched from explicit to implicit consent — all first-party, the 4-weekly consent screen only slowed first-time signin. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 21:58:10 +00:00
Viktor Barzin	70442ccdc6	t3-probe: fix aiohttp 3.9 compat (ClientWSTimeout is 3.10+) Bound connection establishment via session ClientTimeout(total=None, connect=15) instead — works on 3.9 through current; total must stay None or the session timeout would kill the long-lived probe WS. Verified by a local 14s smoke run: cloudflare + internal legs both connect.	2026-06-10 21:26:09 +00:00
Viktor Barzin	9b55d53be0	t3: differential drop-attribution probe + devvm metrics Closes the loop on Viktor's ask to find the t3 disconnect root cause and definitively rule infra in or out. Server logs alone cannot separate 'client network broke' from 'Cloudflare/tunnel broke' from 't3-serve stalled' — every cause collapses into the same 20s-watchdog reconnect. The t3-probe (stacks/t3code) holds three permanent legs that differ only in path segment: 'cloudflare' (WS via DoH-resolved public DNS -> WAN -> CF edge -> tunnel -> Traefik -> dispatch), 'internal' (same WS pinned to the Traefik LB, no Cloudflare), 't3serve' (HTTP straight to the serve process). Whichever leg drops convicts its segment; all legs clean while a user drops exonerates infra with data. Dispatch gains an unauthenticated /probe/ws echo + /probe/healthz (gorilla/websocket, test-first) behind an auth=none path carve-out, guarded by the authentik-walloff probe. Also starts scraping devvm's node_exporter (job 'devvm') — it ran unscraped, so the box whose memory/IO stalls cause the drops had zero pressure history. Alerts T3ProbeLegDown + T3ProbeDropBurst; runbook docs/runbooks/t3-drop-attribution.md.	2026-06-10 21:11:29 +00:00
Viktor Barzin	d5fdc7ffe9	cloudflared: disable in-place autoupdate (--no-autoupdate) Viktor asked to root-cause the frequent t3 code disconnects and rule infra in or out. The tunnel pods ran bare 'cloudflared tunnel run': every Cloudflare release made the binary self-update and exit (code 11), restarting all 3 pods and severing every WebSocket riding the tunnel — one of the confirmed infra-side drop causes (pods cycled 2026-06-09 20:55/21:00 and 2026-06-10 02:31). Updates belong to pod image rollouts, not in-place binary swaps.	2026-06-10 21:00:05 +00:00
Viktor Barzin	9fff77cbea	Merge branch 'wizard/budget-rate-limit' Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-10 19:42:19 +00:00
Viktor Barzin	acb847b858	actualbudget: dedicated traefik rate-limit (50/300) for budget ingresses The Actual web app boots with ~70 near-parallel requests (55 /data/migrations/.sql + statics, all served cache-control max-age=0 so every page load re-validates them). The shared rate-limit middleware (average 10, burst 50) 429s the tail of that storm, so every cold boot shows 'Server returned an error while checking its status' and every load stalls in retry backoff — measured up to 5min stalls when two loads from one IP overlap. Viktor asked to relax the limit after the anca slow-load investigation (beads code-7zv). Same pattern as immich: dedicated actualbudget-rate-limit middleware in the traefik stack, budget- ingresses opt out of the default via skip_default_rate_limit + extra_middlewares. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 19:36:42 +00:00
Viktor Barzin	eae35c511a	pfsense: SNI-routed internal 443 — mail.viktorbarzin.me serves webmail everywhere Completes the internal port table of the mail front door (10.0.20.1): 443 was squatted by the pfSense webGUI (self-signed cert expired 2022), so internal webmail and the kuma [External] mail probe hit the firewall login instead of Roundcube — the last leg of the mail split-brain name. Design (Viktor): route by what the client asked for. New HAProxy frontend internal_https_443 (binds 10.0.20.1+10.0.10.1 :443, mode tcp): SNI present -> Traefik .203 with send-proxy-v2 (trusted, IPv6-bridge pattern, no health check per the PROXY-probe gotcha); SNI of pfsense.viktorbarzin.{lan,me} or NO SNI (bare-IP admin access) -> webGUI, which moved to :8443 (invisible to habits — https://10.0.20.1 still lands on the login page; :8443 doubles as direct fallback). The reverse-proxy pfsense ingress now targets :8443 directly. Declared idempotently in pfsense-haproxy-bootstrap.php; config.xml backed up on-box (config.xml.bak-2026-06-10-pre-sni443). Verified: bare IP -> GUI login; pfsense.viktorbarzin.lan -> GUI; pfsense.viktorbarzin.me -> 302 via ingress; mail.viktorbarzin.me -> Roundcube with STRICT cert validation; :993 IMAPS untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 18:41:07 +00:00
Viktor Barzin	176a65d3d2	plotting-book: TF baseline image follows what CI actually builds Viktor asked to verify the book-plotting push->build->deploy chain. The chain itself is healthy, but the Terraform baseline image said ancamilea/book-plotter:latest while CI (GHA on PassionProjectsAnca/Plotting-Your-Dream-Book) builds and deploys viktorbarzin/book-plotter:<sha8> + :latest — a from-scratch apply would have resurrected a stale March image. Baseline now viktorbarzin/book-plotter:latest. No live change: the running tag is CI-owned via ignore_changes, plan confirms the image attr is ignored. [ci skip] deliberately: plan shows UNRELATED pre-existing drift on this stack (live ns labels managed-by=vault-user-onboarding + resource-governance/custom-quota=true would be stripped; deployment keel.sh/policy=patch annotations removed) — auto-applying that needs its own reviewed pass.	2026-06-10 18:37:14 +00:00
Viktor Barzin	de1d8b7bf3	technitium: add Brevo DKIM selector CNAMEs to internal zone [ci skip] The roundtrip probe kept failing after the SPF/MX fix: rspamd's actual junk-score driver was R_DKIM_PERMFAIL(+4.5) on selector brevo2 — Brevo signs with brevo1/brevo2._domainkey, which are CNAMEs to b{1,2}.viktorbarzin-me.dkim.brevo.com in public DNS and were absent from the internal zone (the earlier existence check used ANY queries, which Cloudflare refuses per RFC 8482 — false negative). The DKIM permfail also cascaded into DMARC_POLICY_SOFTFAIL(+1.5), totalling the 6.09/6.0 junk threshold; sieve filed probes into \Junk where the INBOX poll never finds them. ingress-dns-sync now maintains both selector CNAMEs. Ops notes: rspamd caches DNS (restart to flush after zone fixes); CoreDNS denial cache holds NXDOMAINs up to 300s. Verified: roundtrip SUCCESS in 20.5s. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 18:07:38 +00:00
Viktor Barzin	00bc1e052d	technitium: mirror mail-auth records into internal zone; fix redfish check [ci skip] Two fixes from the post-DNS-internalization health sweep: 1. The internal viktorbarzin.me zone served only ingress A/CNAME records. Since the mailserver pods now resolve the domain through it (CoreDNS viktorbarzin.me:53 -> Technitium, `59a531b8`), rspamd's SPF checks on inbound @viktorbarzin.me mail saw SPF=none and quarantined it — the Brevo email-roundtrip probe failed from the 16:20 run onward (EmailRoundtripFailing/Stale). The ingress-dns-sync CronJob now also maintains the static mail-auth records (SPF, brevo-code TXT, MX; DMARC + DKIM were already present), idempotently. Principle: the internal zone must be a SUPERSET of the public zone for every record type internal clients consume. Verified in-pod: all four types resolve; roundtrip re-probe green. 2. cluster_healthcheck #30 queried instant `up`, which goes stale for ~5 of every 10 minutes on the deliberate 10m redfish-idrac remnant job -> intermittent false "redfish-idrac=missing". Now uses last_over_time(up[15m]) — same answers for fast jobs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:46:37 +00:00
Viktor Barzin	59a531b8e0	coredns: pods get internal split-horizon answers for viktorbarzin.me [ci skip] Forward the viktorbarzin.me:53 pod block to the Technitium ClusterIP (10.96.0.53, same as the .lan block) instead of 8.8.8.8/1.1.1.1. Pods become ordinary internal clients (CNAME -> apex -> live Traefik LB; mail -> 10.0.20.1), fixing the 27 non-proxied [External] uptime-kuma monitors that rode the TP-Link NAT loopback (hard-down since 06-09; loopback refuses flows whose source equals the reflection target, which all pfSense-SNAT'd cluster traffic does). Enabled by re-testing a stale premise: on k8s 1.34 pods DO reach the ETP=Local Traefik LB IP (kube-proxy short-circuits in-cluster traffic to LB IPs; verified from pods on three non-Traefik nodes) — re-verify after major k8s upgrades; canary = [External] fleet going red. The NAT-layer alternatives (pfSense rdr, SNAT-drop) were rejected: both fight return-path asymmetry and deepen TP-Link dependency. Verified in-pod: immich -> .203 + HTTPS 200, mail -> 10.0.20.1, forgejo -> Traefik ClusterIP (pin kept for Technitium-outage resilience). Proxied [External] monitors now test the internal path — true edge fidelity moves to the external vantage (ha-london, next fix). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 16:21:34 +00:00
Viktor Barzin	a1b7b0ca53	forgejo retention: revert to DRY_RUN — first live run orphaned OCI indexes [ci skip] The keep-set (newest 10 versions + latest + cache tags) treats multi-arch/attestation index CHILDREN — separate untagged sha256 versions — as deletable: for images not rebuilt recently they sort outside the newest-10 window and were pruned while their kept parent index survived. kms-website :latest and :dfc83fb children 404'd (RegistryManifestIntegrityFailure, caught by forgejo-integrity-probe within hours; deployed tag a794d1a unaffected). Healed: :latest re-pointed at the intact a794d1a index (also the newest commit), corrupt :dfc83fb version deleted, probe re-run clean (0 failures / 22 repos / 63 tags / 59 indexes). DRY_RUN=true applied live. Re-enable only with a container-aware keep-set — options in the post-mortem. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 09:22:47 +00:00
Viktor Barzin	e49c91e60c	monitoring: VzdumpBackup{Stale,NeverRun,Failing} alerts for the new VM-image backup vzdump-vms pushes vzdump_last_{run,success}_timestamp + vzdump_last_status to Pushgateway job vzdump-backup, but nothing alerted on them — a stopped/failing VM backup would be silent (exactly how the nfs-mirror reaping went unnoticed until I re-verified). Add the trio to the 3-2-1 group in prometheus_chart_values.tpl, mirroring the LVM/pfSense/nfs-mirror alerts. Stale = >~50h since last success. NOT [ci]-applied: this is a Terraform stack change — arms on the next `scripts/tg apply` of the monitoring stack (metrics already flow, so it arms immediately once applied). Admin-gated apply per org policy. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:10:46 +00:00
Viktor Barzin	2b8c0def30	dns: pfSense forward-zone for viktorbarzin.me, nodes fully stock [ci skip] Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node customization — split-brain lives in the DNS infra): - pfSense Unbound domain override viktorbarzin.me -> Technitium 10.0.20.201 (applied via php write_config, backup on-box). Every Unbound client on every VLAN now gets the internal split-horizon answers (live Traefik IP via apex CNAME) with zero per-host config. - CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block — forgejo pinned to Traefik ClusterIP via data source (pods cannot reach the ETP=Local LB IP pfSense now returns), all other .me names kept on public resolvers (pods' pre-existing behavior). Replaces the .:53 forgejo rewrite. - Removed the same-day resolved routing-domain drop-ins from all 7 nodes; node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206) for fleet parity; cloud-init no longer writes any DNS drop-ins. - Docs: dns.md, pfsense-unbound runbook (override + rollback), registry bullet, post-mortem final-architecture addendum. Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK, pods resolve forgejo -> ClusterIP / others -> public, mail record works, .lan zone unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 08:32:34 +00:00
Viktor Barzin	87702bdce8	feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip] New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible /v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on 8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/. Option A off-peak control (no VRAM isolation on the time-sliced T4 — see post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00 ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at 06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a skipped/aborted window backfills next time. SA+Role+RoleBinding grant the CronJobs deployments/scale (nextcloud-watchdog pattern). Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the `tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure — never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses the base exclude list (tts untouched there). tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to local.app_env (no token — ClusterIP only). No tripit code change. Image build is documented in stacks/tts/README.md (devnen cu128 target -> forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline (large CUDA image + needs the upstream repo). NOT APPLIED — review branch only. Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the measured chatterbox-multilingual T4 peak during the first bake. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
viktor	90b8312a29	tripit: build off-infra via GHA -> GHCR (private), pull via scoped ghcr-credentials Switches tripit image from forgejo.viktorbarzin.me (in-cluster buildkit, sdc load) to ghcr.io/viktorbarzin/tripit built by GitHub Actions (Forgejo push-mirror -> private GitHub -> GHA -> GHCR). Adds a tripit-ns-scoped GHCR pull secret (github_pat, interim). Verified: deploy on :c8dfb5cb ready, ingest-plans CronJob pulled :latest + Succeeded. [ci skip]	2026-06-09 21:41:53 +00:00
Viktor Barzin	e0452611b5	forgejo: survive CI-build registry-push storms (mem 3Gi + working retention) Heavy in-cluster builds (e.g. tripit buildkit) were taking Forgejo down via two vectors. Fixes both, without moving Forgejo off the sdc HDD (code-oflt deferred): - Memory 1Gi -> 3Gi (requests=limits). Forgejo was OOMKilled (exit 137) under registry-push load; VPA upperBound ~1.5Gi was suppressed by the 1Gi cap it kept OOMing against. Size for the push spike. - Activate registry retention (DRY_RUN false). Verified the delete list against all running viktor/* images first: 0 running images affected. Pruned 478 -> 161 package versions; PVC was at its 50Gi autoresize ceiling. - FIX broken retention auth: the cleanup PAT was ci-pusher's, but Forgejo scopes container packages per-user, so DELETE on viktor/* returned 403 (the dry-run only did GETs, hiding it). Repointed forgejo_cleanup_token to viktor's write:package PAT. Retention had never actually worked. - Protect buildkit cache tags from retention (cleanup.sh keep-set) so the gentler-builds layer cache survives daily pruning. [ci skip] — already applied via scripts/tg. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	fd0f4a0365	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip] `6d224861` came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:45:33 +00:00
Viktor Barzin	6d224861c4	stem95su: scheduled Drive->site sync CronJob (every 10m) CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:42:26 +00:00
Viktor Barzin	98fe65e345	storage: migrate priority-pass uploads off proxmox-lvm-encrypted to NFS (Phase 1) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Boarding-pass images, no embedded DB. Drops LUKS-at-rest (low-sensitivity, accepted). 21.8M copied + verified on NFS; pod 2/2 on NFS; frees one proxmox-csi slot. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 18:47:07 +00:00
Viktor Barzin	5c378dd5e3	workstation: gate t3.viktorbarzin.me to the T3 Users group (Phase 4) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details New authentik_group 'T3 Users' (members wizard/emo/ancamilea via data lookups — usernames ARE their emails in this Authentik instance) + a branch in the admin-services-restriction expression policy gating t3.viktorbarzin.me to that group, placed BEFORE the ADMIN_ONLY_HOSTS early-return. Surgical two-step targeted apply (group-with-members first, then the gate) → zero lock-out window. Verified: group has all 3 members, the live policy contains the t3 branch, t3 still 302s to Authentik. Membership is HCL for now (FUTURE: roster-reconciled via the Authentik API). Note: the authentik stack had 3 unrelated pending drift changes (pgbouncer deployment + 2 tls_secrets) — deliberately NOT applied (targeted apply isolated this change; left for the stack owner). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 17:50:40 +00:00
Viktor Barzin	173b1fc116	workstation: per-user OIDC kubectl — power-user-readonly RBAC + kubeconfig (Phase 2.2) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details New oidc-power-user-readonly ClusterRole (cluster-wide get/list/watch, NO secrets/exec/write); the power-user binding re-pointed to it (the existing read+write+secrets oidc-power-user role is retained but UNBOUND per ADR-0005). Applied to the rbac stack (2 add, 1 change, 0 destroy). emo added to Vault k8s_users (secret/platform) as power-user, email emil.barzin@gmail.com — the OIDC email IS the Authentik username (verified live). Verified via impersonation: emo gets cluster-wide read, NO secrets/write/exec/delete; anca unchanged. Provisioner: install_user_kubeconfig writes a per-user OIDC kubeconfig (kubelogin/PKCE — the kubernetes Authentik client is public, no secret; server+CA copied from the admin kubeconfig) if-absent. Written for emo + ancamilea (0600). End-to-end login is interactive (browser OIDC); verified config validity + RBAC, not the live browser flow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 17:47:00 +00:00
Viktor Barzin	6504911a77	matrix: open (tokenless) registration + bot mitigations + #security alert User-chosen fully-open registration on tuwunel (no CAPTCHA support; browser challenges break native clients). Bot defense is layered instead: - Traefik rate-limit Middleware on a path-scoped /register ingress carve-out, keyed on request Host (GLOBAL /register cap) not source IP — the host is reachable via both Cloudflare-IPv4 (CF-Connecting-IP) and IPv6-direct (HE tunnel, no CF header), so a per-source key let IPv6 bots bypass. 10/min, burst 20, per replica; CrowdSec is the hard backstop on both paths. - Loki ruler rule MatrixNewUserRegistered -> lane=security -> existing #security Slack receiver (matches "registered on this server", never the rejection line). tuwunel's admin bot also posts signups to the admin room. Dropped the REGISTRATION_TOKEN env (secret/matrix + ESO kept for revert). Applied via scripts/tg (matrix tier-1 + targeted monitoring configmap), so [ci skip] to avoid CI full-applying monitoring (unrelated grafana-acl drift). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 13:27:02 +00:00
Viktor Barzin	23602f393e	matrix: migrate Synapse -> tuwunel (Rust homeserver, fresh start, federated) Replace the cramped Synapse deployment with tuwunel v1.7.1: embedded RocksDB drops the CNPG dependency (both init-containers, the db ESO, the Reloader annotation all gone), env-var config, fsGroup-owned encrypted PVC, federation on, tuwunel-served well-known delegation to :443. server_name unchanged (matrix.viktorbarzin.me); fresh start (no Synapse->RocksDB migration path). Registered @viktor admin then disabled registration (403). Cleanup: removed the orphaned pg-matrix Vault static role and dropped the matrix Postgres DB/role; updated service-catalog, upgrade-config, CLAUDE.md PG-rotation list, and the Matrix OIDC->orphaned auth notes. Design+plan in docs/plans/2026-06-08-matrix-synapse-to-tuwunel-*. Already applied via scripts/tg (matrix tier-1 + targeted vault tier-0), so [ci skip] to avoid CI reconciling an unrelated pre-existing vault OIDC tune-TTL drift. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 11:58:17 +00:00
Viktor Barzin	7501ea286b	tripit: wire planner subsystem (merged trip-planner) secrets + Slack webhook ingress All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details - ExternalSecret gains SLACK_SIGNING_SECRET / TREK_USER / TREK_PASSWORD / CLAUDE_AGENT_TOKEN (SLACK_BOT_TOKEN reused from nudges). - New auth=none ingress carve-out /api/planner/slack (Slack v0 signature-gated, same pattern as the calendar + emails-confirm carve-outs). - Remove the superseded standalone stacks/trip-planner (merged into tripit per the "future travel logic goes in tripit" policy). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 09:26:21 +00:00
Viktor Barzin	f9d5cd6243	feat(tripit): wire real flight (AeroDataBox) + rail (RealtimeTrains) status Prod ran FLIGHT_PROVIDER=fake, so every flight gate/terminal/time/position was fabricated from a hash and never matched reality. Switch to real providers: - FLIGHT_PROVIDER=aerodatabox (RapidAPI free BASIC; AERODATABOX_API_KEY via the tripit-secrets ExternalSecret) - RAIL_PROVIDER=realtimetrains (RTT_API_TOKEN, already in Vault) - poll-flights cron */30 -> hourly to respect the free 600 req/month cap (provider also self-throttles to <=1 req/sec) Verified live: /api/segments/<LS1468>/status returns source=aerodatabox with real schedule/terminal/aircraft. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 22:10:06 +00:00
Viktor Barzin	0d445d948c	stem95su: host STEM platform for 95. СУ (public NFS-backed static site) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details New public static site at stem95su.viktorbarzin.me serving the school's Bulgarian STEM platform (dashboard + lessons/games, externally authored HTML/media exported from Gemini). - Stock nginx:1.28-alpine serving /srv/nfs/stem-site read-only (nfs_volume), NOT image-baked — content updated out-of-band (Nextcloud "PVE NFS Pool" or rsync), no rebuild; auto-backed-up offsite by nfs-mirror. - ingress_factory auth="none" (open; CrowdSec + ai-bot-block at the edge), dns_type="proxied" (Cloudflare CNAME auto-created). - nginx ConfigMap sets index stem_board.html (the dashboard) for "/". - Docs: service-catalog entry + new "Static Site Hosting" pattern (NFS-backed vs image-baked) in patterns.md. Applied via scripts/tg apply; verified live end-to-end (dashboard, 20MB page, video byte-range, no Authentik redirect) through the public Cloudflare path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 15:21:21 +00:00
Viktor Barzin	c7ffbaa204	aiostreams: harden stream-probe + repair sources (RD-451 "few films" fix) Root cause of "barely serving films": Real-Debrid's May-2026 infringing_file/HTTP-451 filter blocks WEB-DL releases (which dominate new content), while degraded sources starved candidates. RD account + popular-title availability were healthy throughout (library 32/36 unrestrict OK; Matrix 897 / Dune2 694 / Oppenheimer 672 streams). Runtime config (AIOStreams PG, applied via API — not in this diff): - Comet timeout 5s -> 10s. Comet is the workhorse (~450+ streams/title) and was silently dropping the bulk of its results at the 5s cutoff; Interstellar 430 -> 987 streams after the bump. - Removed MediaFusion preset: broken upstream ("Invalid configuration" -> 500 Internal Server Error), contributed 0 usable streams, only a dead [X] entry in every list. This diff (Terraform): - Harden aiostreams-stream-probe: test series AND movie paths, per-source breakdown (comet/torrentio/stremthru_torz/knaben), error-stream count, success gated on Comet being alive. The old probe counted only Breaking Bad streams and stayed green while new-content playback was broken. - service-catalog: reflect source set + probe behaviour. [ci skip] — probe already applied via targeted `tg apply` + verified (series=378 movie=898 comet=206 errors=0 success=1); skipping the full servarr reconcile to avoid touching unrelated pre-existing drift (qbittorrent MetalLB annotation, tls_secret cert revert). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 07:21:42 +00:00
Viktor Barzin	4cdb9e1886	novelapp: switch Keel to semver (policy=major) now upstream tags are valid All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Gheorghe fixed his tag format 2026-06-06 (v.1.1.1 -> valid v1.1.1 / v1.1.3), so drop the :latest+force+match-tag digest workaround and track semver properly: policy=major (all upgrades, cumulative), match-tag removed (so Keel is free to climb to higher semver tags), image floor pinned to v1.1.3. Pull policy -> IfNotPresent (correct for a pinned Keel-managed tag; Always was only needed for the mutable :latest). Running v1.1.3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 22:56:46 +00:00
Viktor Barzin	551412488b	apiserver: enable audit logging (low-write Metadata) + ship to Loki Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Resource changes/deletions are now attributable (the novelapp deletion this week was untraceable because apiserver audit was off). Low-write policy: drops reads/noise, Metadata level on mutations, omitStages RequestReceived. Wired into the kube-apiserver static-pod manifest + kubeadm-config (v1beta4 extraArgs/extraVolumes -> survives kubeadm upgrade) on k8s-master; Alloy tails /var/log/kubernetes/audit/audit.log -> Loki {job=kubernetes-audit}. Root cause that had silently blocked this AND OIDC for weeks: a stray kube-apiserver.yaml.bak inside /etc/kubernetes/manifests/ was a duplicate static-pod manifest kubelet ran instead of the real one, dropping every flag added to the real manifest. Removed it. Runbook added. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	3696ff5922	novelapp: track :latest by digest (Keel force+match-tag), adopt into TF state Keel was stuck on v1.0.3 because upstream mghee/novelapp tags newer releases as `v.1.1.1` (dot after v), which isn't valid semver, so policy=all couldn't see past the highest parseable tag. :latest correctly points at the newest release, so switch to force + match-tag digest-tracking of :latest (Kyverno does not manage match-tag, contrary to the stale code comment). Imports the live Deployment (recreated out-of-band 2026-06-06) back into TF state; running image flipped to :latest -> now on v.1.1.1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	4d8b782df1	feat(trip-planner): app stack (Tier-1, CNPG, Slack-signed webhook ingress) Namespace trip-planner (tier=4-aux, keel enrolled), ExternalSecret pulling secret/trip-planner from vault-kv, DB-creds ExternalSecret from vault-database (static-creds/pg-trip-planner → asyncpg DSN), Deployment with migrate init container + main container (readiness+liveness /healthz, 256Mi req=limit, 100m cpu request), ClusterIP service port 8080, and ingress_factory with auth=none (Slack v0 HMAC signature verification in-app). Terraform fmt clean. NOT applied; requires Vault secret/trip-planner + CNPG trip_planner DB + Slack app config. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	7c12fbba95	monitoring/alloy: drop cosmetic calico-typha 'Endpoints deprecated' warning calico-typha (~342 lines/hr across 3 pods) still WATCHes the core v1 Endpoints API, so the apiserver returns the 'v1 Endpoints is deprecated in v1.33+' client-go warning, which typha logs. Per KEP-4974 the v1 Endpoints API will essentially never be removed (clients keep working indefinitely), and even the latest Calico still watches Endpoints (projectcalico/calico#11540) so a CNI upgrade would not fix it. Pure cosmetic noise. Targeted Alloy stage.drop (calico-system ns, exact deprecation message), mirroring the mailserver drop. Real calico warnings/errors kept; reversible. Validated with alloy fmt (exit 0). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	4b13be6d48	dawarich: upgrade 1.6.1 -> 1.7.11 (removes RailsPulse, drops orphan tables) dawarich 1.6.1 shipped the RailsPulse perf-monitoring gem, which scheduled an hourly Sidekiq SummaryJob INDEPENDENT of its disabled flag; the job hit rails_pulse_routes (no primary key) and retry-looped, logging ~125 UnknownPrimaryKey lines/hr (found via Loki triage 2026-06-06). Upstream removed RailsPulse entirely in 1.7.x (commit a5172cc) with a DropRailsPulseTables migration; 1.7.11 is latest stable. Keel only auto-applies patch bumps within 1.6.x, so the minor jump is manual. Pre-upgrade pg_dump of dawarich (79.9MB) + dawarich_queue taken to devvm. The 5 rails_pulse_* tables are empty (feature never collected data), so cleanup is zero-data-risk; location data (tracks/points/visits/places) untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	8a3bbde38c	mailserver: silence mixed-TLS-directive warning + drop SMTP scanner noise from Loki Two mailserver-namespace log-noise cleanups (cluster's #1 Loki error source, from the 2026-06-06 log triage): 1. TLS warning: docker-mailserver SSL_TYPE=manual writes the authoritative smtpd_tls_chain_files at boot, so the legacy smtpd_tls_cert_file/key_file in our postfix-main.cf override were IGNORED and triggered postfix's 'Both smtpd_tls_chain_files and ... legacy ...' warning. Dropped the two legacy lines (functional no-op; chain_files already wins). Verified via live postconf. 2. Scanner noise (~9k lines/hr): narrow Alloy stage.drop for the benign public-SMTP probe patterns (unknown[unknown] SSL_accept resets, postscreen half-open drops, rate-limit-exceeded from unknown). Real delivery logs + real-IP SASL failures KEPT; CrowdSec bans these IPs independently, so security posture is unchanged. Validated with 'alloy fmt' (exit 0). Reversible. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
root	de181a9afc	Woodpecker CI deploy [CI SKIP]	2026-06-06 16:51:26 +00:00
Viktor Barzin	27211acda1	rybbit: recreate missing Postgres database via idempotent init Job rybbit's 'rybbit' PG database was missing from CNPG (the role survived a past cluster rebuild but the database did not), so the app's node-cron logged 'database "rybbit" does not exist' every minute (found via Loki 2026-06-06). Created the DB manually to restore service (app auto-migrated 11 tables); this adds a self-contained init Job so the DB is recreated on any future rebuild -- connects as the rybbit role (has CREATEDB) using the existing rybbit-secrets password, idempotent CREATE DATABASE if absent. Deployment now depends_on the job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	9ad7756a94	traefik: make bot-block-proxy a clean no-op while poison-fountain is at 0 bot-block-proxy is the forward-auth target for the ai-bot-block middleware (applied to every anti-AI ingress). It proxied /auth to the poison-fountain bot trap with error_page 5xx=200 fail-open. But poison-fountain is intentionally scaled to 0, so proxy_pass only ever failed and fell open to '200 allowed' -- while logging ~51k errors/hr (the #1 Loki source once pod logs began shipping 2026-06-05) and paying up to 100ms connect-timeout per authed request. Short-circuit /auth to 'return 200 "allowed"' directly (drop the upstream + proxy_pass + fallback). Identical effective behaviour (allow-all), no upstream attempt, no noise, no latency. Reversible: restore the upstream + proxy_pass and scale poison-fountain up. Also add the missing configmap.reloader.stakater.com/reload annotation so openresty picks up ConfigMap changes (it does not hot-reload on its own -- the root reason stale config ran for days). replicas stays 2: critical-path forward-auth target (anti-AI ingresses fail closed if it is down), so HA is retained though each request is now trivial. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
root	d70a99dc48	Woodpecker CI deploy [CI SKIP]	2026-06-06 16:51:26 +00:00
Viktor Barzin	d661d074ef	matrix: auto-reload Synapse on DB credential rotation (Reloader) Synapse injects the Postgres password into homeserver.yaml only at startup (inject-db-password initContainer). matrix-db-creds is rotated by Vault via ESO (15m refresh), so each rotation left the running pod with a stale password and Synapse DB auth failed silently until a manual rollout restart. Found today via Loki: ~12.9k/hr 'password authentication failed for user matrix' lines; secret password verified working against the DB while the 10-day-old pod held the pre-rotation value. Add the explicit secret.reloader.stakater.com/reload annotation so Reloader rolls the deployment whenever the secret changes (explicit form, not auto/search, because the secret is referenced only in an initContainer env var). Live pod already restarted to restore service; this prevents recurrence on the next rotation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
root	e7ece3eaf9	Woodpecker CI deploy [CI SKIP]	2026-06-06 16:51:26 +00:00
root	02366103ef	Woodpecker CI deploy [CI SKIP]	2026-06-06 16:51:26 +00:00
Viktor Barzin	1b9d4f1233	storage: migrate insta2spotify off proxmox-lvm to NFS (LUN relief, Phase 1) Some checks failed ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was canceled Details Config-only PVC (no embedded DB), preflighted. Frees one proxmox-csi slot. NB: pod reschedule re-pulled the 3.26GB backend image (~6min stall) — large-image services incur a pull-delay blip when migration moves them to a fresh node. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 21:38:01 +00:00
Viktor Barzin	355ca3ee91	proxmox-csi: auto-reconcile CronJob to detach ghost disks (code-dfjn prevention) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Closes the ghost-disk doom loop by construction (failed detach -> orphan scsiN with no VolumeAttachment -> invisible oversubscription -> query-pci wedge). Every 15min csi-ghost-reconcile compares each worker VM's real scsi disks (Proxmox API) vs k8s VolumeAttachments and safely detaches ghosts (PUT .../config delete=scsiN -> frees the LUN slot, retains the LV). - detection mirrors cluster-health check #47 - SAFETY: only vm-9999-pvc scsi with no matching VA; 60s re-confirm; per-run cap 5 - scoped CSI API token (VM.Config.Disk), not root SSH; k8s API via injected ClusterIP - verified live: read 66 VAs, 0 ghosts, no false positives - pushes csi_ghosts_detected/detached to Pushgateway Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 21:25:36 +00:00
Viktor Barzin	a42f4f7b26	trek: trial-deploy TREK group-trip planner behind Authentik (solo eval) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Stand up upstream TREK (mauriceboe/trek:3.0.22, AGPL) as a low-commitment trial to evaluate the self-hosted group-trip use case before building a custom app. Solo, single shared instance, Authentik forward-auth. - stacks/trek: namespace, deployment (pinned, TF-managed, no CI/Keel), service 80->3000, ingress_factory auth=required + proxied DNS at trek.viktorbarzin.me, TLS. Two proxmox-lvm-encrypted PVCs (SQLite data + uploads) -- encrypted per the sensitive-data rule and to avoid the SQLite-over-NFS locking hazard. - Trial secrets posture: ENCRYPTION_KEY auto-generated on the data PVC, bootstrap admin in pod logs -- no Vault/ESO. Graduation TODOs documented in main.tf + service-catalog (Vault key, app-level SQLite backup, OIDC SSO). - kyverno: add mauriceboe/* to require-trusted-registries allowlist (the policy is Enforce since 2026-05-19 -- also fixed the stale "stays in Audit" header comment that said otherwise and misled the deploy). - Runs free on OpenStreetMap (no paid maps key). Rallly availability-poll companion deferred per solo-trial scope. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:30:07 +00:00
Viktor Barzin	a0b34750ee	storage: migrate hackmd uploads off proxmox-lvm-encrypted to NFS (LUN-cap relief) Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details codimd is MySQL-backed; this PVC holds only pasted image uploads (subPath hackmd, 4.5M) — no embedded DB, NFS-safe. Drops LUKS-at-rest for these low-sensitivity images (accepted). Frees one proxmox-csi SCSI-LUN slot on node6. - swap hackmd-data-encrypted -> nfs_volume module (subPath preserved) - uploads copied + verified (20 files, HTTP 200, codimd listening) - block PVC removed; LV retained per SC policy (code-dfjn cleanup) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:11:31 +00:00
Viktor Barzin	e35d693972	storage: migrate send off proxmox-lvm to NFS (LUN-cap relief) Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details Send (timvisee/send) stores encrypted upload blobs on disk with metadata in Redis — no embedded DB, NFS-safe. Frees one proxmox-csi SCSI-LUN slot on node2. - swap send-data-proxmox -> nfs_volume module - blobs copied + verified (273M, 22 entries, HTTP 200 on NFS) - block PVC removed; LV retained per SC policy (code-dfjn cleanup) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:04:37 +00:00
Viktor Barzin	bf3608052b	tripit: GEOCODER_PROVIDER=openmeteo for per-city itinerary weather Enables Open-Meteo geocoding of lodging addresses (results cached in the new geocode_cache table) so the itinerary can show per-city weather. Applied manually via scripts/tg apply. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:01:31 +00:00

1 2 3 4 5 ...

1295 commits