infra

Author	SHA1	Message	Date
Viktor Barzin	5e8a988858	android-emulator: api36-v2 — marker-file install idempotency + retries Some checks failed ci/woodpecker/push/k8s-portal Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/registry-config-sync Pipeline was successful Details ci/woodpecker/push/build-ci-image Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details First boot crashed mid-SDK-install, and the dir-existence check then skipped reinstall forever: avdmanager saw the partial tree and died with 'Valid system image paths are: null' (CrashLoopBackOff). v2 tracks install completion with a marker file written only after sdkmanager succeeds + package.xml exists, wipes partial system-image trees before reinstalling, and retries sdkmanager 3x.	2026-06-11 20:59:08 +00:00
Viktor Barzin	3fac45febc	android-emulator: drop applied import stanzas; deployment recreates fresh Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details The five imports from the last recovery commit are in state now (verified serial 4: everything except the deployment). The deployment kept falling out of state between runs, so instead of a third import round the broken 0-replica deployment object was deleted live (transient recovery step, presence-claimed) and this apply recreates it Terraform-owned with the quota-fitting 3Gi requests. Import stanzas must go because TF 1.5 errors on importing already-managed addresses.	2026-06-11 20:49:37 +00:00
Viktor Barzin	6b7efcd2d6	android-emulator: import the five resources still missing from state Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Pipeline 88 imported the namespace but its refresh dropped the PVC, both services, the ingress and the tls secret from state (PG-backend state races on this new stack's first applies), so the apply again died on 'already exists' conflicts. State now holds namespace+deployment; adopt the missing five with import blocks (TF 1.5 errors on importing already-managed addresses, so only the missing set is listed). Stanzas come out once applied.	2026-06-11 20:44:09 +00:00
Viktor Barzin	b948224008	android-emulator: import orphaned namespace into state (lock-race recovery) Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Pipeline 85 created the namespace but a Terraform pg-backend workspace-creation lock race (new stack schema initializing while other stacks applied concurrently) left it out of the recorded state — every later apply then died with 'namespaces android-emulator already exists'. Adopt it with an import block per the house recovery pattern; stanza gets removed once it has applied.	2026-06-11 20:38:46 +00:00
Viktor Barzin	99c19584f7	android-emulator: fit pod inside the tier-1 ResourceQuota (Burstable memory) Some checks failed ci/woodpecker/push/k8s-portal Pipeline failed Details ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful Details ci/woodpecker/push/registry-config-sync Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/build-ci-image Pipeline was successful Details First deploy hit 'exceeded quota: tier-quota, requested requests.memory=8Gi, limited 4Gi' — the generated tier-1 quota caps memory REQUESTS at 4Gi but allows 32Gi of limits, so go Burstable (requests 3Gi, limits 8Gi) like tiers 3/4 do, instead of opting the namespace out via custom-quota.	2026-06-11 19:56:09 +00:00
Viktor Barzin	6bf216751b	Merge forgejo/master (tts stack) into wizard/android-emulator Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details # Conflicts: # stacks/tripit/main.tf	2026-06-11 19:53:07 +00:00
Viktor Barzin	8b7c77c794	android-emulator: new stack — shared in-cluster Android 16 testing instance Viktor is setting up an Android app development pipeline (tripit is the first app) and wants agents to natively test changes on Android before shipping. This adds the testing environment: an API-36 Google emulator under KVM as a privileged pod (namespace joins the Kyverno exclude list), SDK/system-image/AVD on a proxmox-lvm PVC, adb on the shared MetalLB IP 10.0.20.200:5555 (LAN only), noVNC screen view at android-emulator.viktorbarzin.lan. Image is built manually from the stack's docker/ dir (rare rebuilds; off-infra-CI rule targets repeated builds). First infra ADR records the trade-offs (devvm/VM/redroid/budtmo rejected).	2026-06-11 19:51:57 +00:00
Viktor Barzin	798b025580	tts+kyverno: non-merge apply trigger (merge-commit diff hid stacks/tts from the stack detector) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The Woodpecker default pipeline selects stacks via git diff HEAD~1 HEAD; on a merge commit that is the first-parent diff, which contained only the concurrently-landed files — stacks/tts never got applied (namespace still absent) and the kyverno re-trigger push got no pipeline at all. Single non-merge commit touching both stacks so the detector sees them; the sorted loop applies kyverno before tts, the order tripit#26 requires. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 19:08:23 +00:00
Viktor Barzin	4a8c4f9a14	tts: first apply of Chatterbox stack; predefined voices from the image, not the unseeded PVC Viktor's tour-guide redo (tripit#26): `87702bdc` committed this stack with [ci skip] so it was never applied — prod tripit has been pointing at a nonexistent chatterbox-tts service since. This commit triggers the apply and fixes the voices path: config pointed predefined_voices_path at the NFS PVC (/data/voices), which nobody can seed without NFS-host shell access and which would leave /v1/audio/voices empty (it gates readiness). Use the 28 voices bundled in the image at /app/voices instead; /data keeps reference audio (future cloning) and the HF model cache. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:27:44 +00:00
Viktor Barzin	7a1cc64898	kyverno: re-trigger apply of tts GPU-priority exclusion (`87702bdc` was [ci skip]'d) Viktor's tour-guide redo (tripit#26): the Chatterbox TTS go-live commit `87702bdc` carried [ci skip], so CI never applied the kyverno change that keeps the tts namespace out of low-GPU-priority injection. This comment-only commit makes CI apply the already-committed change — step 1 of the kyverno -> tts -> tripit apply order. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:23:29 +00:00
Viktor Barzin	50eff3ca39	tripit: enable real tour-guide content providers (wikipedia discovery, web sources, chat writer) Some checks failed ci/woodpecker/push/build-cli Pipeline was canceled Details ci/woodpecker/push/default Pipeline was canceled Details Viktor's tour-guide redo (tripit#24, slice tripit#25): the feature shipped dark on 2026-06-08 because these three env vars were never set, so prod ran the fake test-fixture providers — the only sight users ever saw was the placeholder 'Sight 1' narrated by browser TTS. Flips discovery to Wikipedia GeoSearch, story material to the five real web sources, and script-writing to claude-agent-service (token already present in tripit-secrets). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:22:10 +00:00
Viktor Barzin	5486b9d438	tripit: wire calendar-conflict column to Nextcloud CalDAV (#19 ) CALENDAR_CONFLICT_PROVIDER=nextcloud + CalDAV base/user on the deployment, and the read-only app-password via tripit-secrets (seeded in Vault secret/tripit). Lets the planning workspace's calendar_check column flag date clashes against the owner's Nextcloud calendar. Same image-first hold-order as the fare scrape — pushed only after the #19 image is live. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-11 18:13:01 +00:00
Viktor Barzin	81e01ec1c4	tripit: label namespace as chrome-service CDP client The fare scrape's first E2E test was blocked by chrome-service-ws-ingress (9222 admits only namespaces labeled chrome-service.viktorbarzin.me/client=true). Label the tripit namespace per that policy's opt-in design so the planning workspace's live fare fetches reach the shared browser. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-11 14:42:53 +00:00
Viktor Barzin	980ec55418	tripit: enable live flight-fare scrape via shared chrome-service CDP Sets FARE_PROVIDER=playwright + FARE_CDP_URL on the tripit deployment so the planning workspace's flight_fare cells auto-fetch live Google Flights quotes through the existing in-cluster headed browser (tripit issue #18, ADR-0007 — rate-limited, cached, degrades to manual entry). Viktor asked to complete the trip-planning tickets; this is the infra leg of the fare-scrape slice. Docs: chrome-service architecture + service catalog updated (tripit is now the second active CDP caller; catalog's legacy :3000 WS pool line corrected to CDP :9222). HOLD-ORDER NOTE: pushed only after the tripit image containing FareMode.playwright rolled out (older images crash-loop on the unknown enum). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-11 14:23:53 +00:00
Viktor Barzin	b3ef0dba76	authentik: ignore Keel-managed image_pull_policy on pgbouncer Keel flip-flops the pgbouncer container's imagePullPolicy, so the declared Always kept re-diffing on every plan. Ignore it like the image tag (KEEL_IGNORE pattern) — plan-to-zero restored. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 00:34:44 +00:00
Viktor Barzin	4e88298976	authentik: incident hardening after the signin-speedup rollout storm The first apply of the signin-speedup change triggered a ~50min authentik outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2) silently DOWNGRADED the Keel-managed live image (2026.2.4) against an already-migrated DB, default liveness probes kill-looped pods queuing on authentik's migration advisory lock, and kills mid-migration left ghost idle-in-transaction sessions holding that lock. Full analysis in docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md. Hardening (all root causes): - values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4) so helm applies can never downgrade under Keel again - values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s) - values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits) - pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders; pgbouncer.tf gets a config-checksum annotation so ini changes roll pods - authentik_provider.tf: drop the completed import stanza (adoption rule) - traefik: suppress pre-existing keel.sh annotation/tier-label drift on auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1 pattern) so applies stop stripping live Keel state Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 00:26:52 +00:00
Viktor Barzin	97ccdbecb8	authentik: speed up first-time signin (single-screen login, live env tuning, asset caching, outpost+nginx hot path) Viktor asked to review Authentik and the web tier and make first-time signin to apps faster. Review found the slowness is screens and round trips, not server time. Changes: - values.yaml: the authentik.* Helm values (gunicorn workers, cache timeouts, conn_max_age) were silently INERT because existingSecret skips chart env rendering — pods ran defaults (2 workers, 300s caches, no persistent DB conns). Moved all tuning into server.env/worker.env, which actually reaches the pods. - authentik_provider.tf: adopt the identification stage and pin password_stage so username+password render on ONE screen (the separate order-20 password binding is deleted via API — authentik requires that when embedding). Outpost log_level trace->info and 1->2 replicas (it is on the hot path of every forward-auth request; PG-backed sessions make 2 replicas safe). - authentik module: /static ingress carve-out with immutable Cache-Control (assets are version-fingerprinted but served with no max-age — internal split-horizon users got zero caching). - traefik auth-proxy nginx: upstream keepalive 32 + HTTP/1.1 (was opening a fresh TCP connection to the outpost per subrequest) + config-checksum annotation so config changes roll the pods. - docs: authentication.md + authentik-state.md updated; fixed stale 'postgresql.dbaas has no endpoints' claim in CLAUDE.md/CONTEXT.md (it is a live CNPG primary-selector compatibility service). Done via API in the same change (UI-managed objects): 6 OIDC providers (Vault, Forgejo, Immich, Headscale, linkwarden, Cloudflare Access) switched from explicit to implicit consent — all first-party, the 4-weekly consent screen only slowed first-time signin. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 21:58:10 +00:00
Viktor Barzin	70442ccdc6	t3-probe: fix aiohttp 3.9 compat (ClientWSTimeout is 3.10+) Bound connection establishment via session ClientTimeout(total=None, connect=15) instead — works on 3.9 through current; total must stay None or the session timeout would kill the long-lived probe WS. Verified by a local 14s smoke run: cloudflare + internal legs both connect.	2026-06-10 21:26:09 +00:00
Viktor Barzin	9b55d53be0	t3: differential drop-attribution probe + devvm metrics Closes the loop on Viktor's ask to find the t3 disconnect root cause and definitively rule infra in or out. Server logs alone cannot separate 'client network broke' from 'Cloudflare/tunnel broke' from 't3-serve stalled' — every cause collapses into the same 20s-watchdog reconnect. The t3-probe (stacks/t3code) holds three permanent legs that differ only in path segment: 'cloudflare' (WS via DoH-resolved public DNS -> WAN -> CF edge -> tunnel -> Traefik -> dispatch), 'internal' (same WS pinned to the Traefik LB, no Cloudflare), 't3serve' (HTTP straight to the serve process). Whichever leg drops convicts its segment; all legs clean while a user drops exonerates infra with data. Dispatch gains an unauthenticated /probe/ws echo + /probe/healthz (gorilla/websocket, test-first) behind an auth=none path carve-out, guarded by the authentik-walloff probe. Also starts scraping devvm's node_exporter (job 'devvm') — it ran unscraped, so the box whose memory/IO stalls cause the drops had zero pressure history. Alerts T3ProbeLegDown + T3ProbeDropBurst; runbook docs/runbooks/t3-drop-attribution.md.	2026-06-10 21:11:29 +00:00
Viktor Barzin	d5fdc7ffe9	cloudflared: disable in-place autoupdate (--no-autoupdate) Viktor asked to root-cause the frequent t3 code disconnects and rule infra in or out. The tunnel pods ran bare 'cloudflared tunnel run': every Cloudflare release made the binary self-update and exit (code 11), restarting all 3 pods and severing every WebSocket riding the tunnel — one of the confirmed infra-side drop causes (pods cycled 2026-06-09 20:55/21:00 and 2026-06-10 02:31). Updates belong to pod image rollouts, not in-place binary swaps.	2026-06-10 21:00:05 +00:00
Viktor Barzin	9fff77cbea	Merge branch 'wizard/budget-rate-limit' Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-10 19:42:19 +00:00
Viktor Barzin	acb847b858	actualbudget: dedicated traefik rate-limit (50/300) for budget ingresses The Actual web app boots with ~70 near-parallel requests (55 /data/migrations/.sql + statics, all served cache-control max-age=0 so every page load re-validates them). The shared rate-limit middleware (average 10, burst 50) 429s the tail of that storm, so every cold boot shows 'Server returned an error while checking its status' and every load stalls in retry backoff — measured up to 5min stalls when two loads from one IP overlap. Viktor asked to relax the limit after the anca slow-load investigation (beads code-7zv). Same pattern as immich: dedicated actualbudget-rate-limit middleware in the traefik stack, budget- ingresses opt out of the default via skip_default_rate_limit + extra_middlewares. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 19:36:42 +00:00
Viktor Barzin	eae35c511a	pfsense: SNI-routed internal 443 — mail.viktorbarzin.me serves webmail everywhere Completes the internal port table of the mail front door (10.0.20.1): 443 was squatted by the pfSense webGUI (self-signed cert expired 2022), so internal webmail and the kuma [External] mail probe hit the firewall login instead of Roundcube — the last leg of the mail split-brain name. Design (Viktor): route by what the client asked for. New HAProxy frontend internal_https_443 (binds 10.0.20.1+10.0.10.1 :443, mode tcp): SNI present -> Traefik .203 with send-proxy-v2 (trusted, IPv6-bridge pattern, no health check per the PROXY-probe gotcha); SNI of pfsense.viktorbarzin.{lan,me} or NO SNI (bare-IP admin access) -> webGUI, which moved to :8443 (invisible to habits — https://10.0.20.1 still lands on the login page; :8443 doubles as direct fallback). The reverse-proxy pfsense ingress now targets :8443 directly. Declared idempotently in pfsense-haproxy-bootstrap.php; config.xml backed up on-box (config.xml.bak-2026-06-10-pre-sni443). Verified: bare IP -> GUI login; pfsense.viktorbarzin.lan -> GUI; pfsense.viktorbarzin.me -> 302 via ingress; mail.viktorbarzin.me -> Roundcube with STRICT cert validation; :993 IMAPS untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 18:41:07 +00:00
Viktor Barzin	176a65d3d2	plotting-book: TF baseline image follows what CI actually builds Viktor asked to verify the book-plotting push->build->deploy chain. The chain itself is healthy, but the Terraform baseline image said ancamilea/book-plotter:latest while CI (GHA on PassionProjectsAnca/Plotting-Your-Dream-Book) builds and deploys viktorbarzin/book-plotter:<sha8> + :latest — a from-scratch apply would have resurrected a stale March image. Baseline now viktorbarzin/book-plotter:latest. No live change: the running tag is CI-owned via ignore_changes, plan confirms the image attr is ignored. [ci skip] deliberately: plan shows UNRELATED pre-existing drift on this stack (live ns labels managed-by=vault-user-onboarding + resource-governance/custom-quota=true would be stripped; deployment keel.sh/policy=patch annotations removed) — auto-applying that needs its own reviewed pass.	2026-06-10 18:37:14 +00:00
Viktor Barzin	de1d8b7bf3	technitium: add Brevo DKIM selector CNAMEs to internal zone [ci skip] The roundtrip probe kept failing after the SPF/MX fix: rspamd's actual junk-score driver was R_DKIM_PERMFAIL(+4.5) on selector brevo2 — Brevo signs with brevo1/brevo2._domainkey, which are CNAMEs to b{1,2}.viktorbarzin-me.dkim.brevo.com in public DNS and were absent from the internal zone (the earlier existence check used ANY queries, which Cloudflare refuses per RFC 8482 — false negative). The DKIM permfail also cascaded into DMARC_POLICY_SOFTFAIL(+1.5), totalling the 6.09/6.0 junk threshold; sieve filed probes into \Junk where the INBOX poll never finds them. ingress-dns-sync now maintains both selector CNAMEs. Ops notes: rspamd caches DNS (restart to flush after zone fixes); CoreDNS denial cache holds NXDOMAINs up to 300s. Verified: roundtrip SUCCESS in 20.5s. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 18:07:38 +00:00
Viktor Barzin	00bc1e052d	technitium: mirror mail-auth records into internal zone; fix redfish check [ci skip] Two fixes from the post-DNS-internalization health sweep: 1. The internal viktorbarzin.me zone served only ingress A/CNAME records. Since the mailserver pods now resolve the domain through it (CoreDNS viktorbarzin.me:53 -> Technitium, `59a531b8`), rspamd's SPF checks on inbound @viktorbarzin.me mail saw SPF=none and quarantined it — the Brevo email-roundtrip probe failed from the 16:20 run onward (EmailRoundtripFailing/Stale). The ingress-dns-sync CronJob now also maintains the static mail-auth records (SPF, brevo-code TXT, MX; DMARC + DKIM were already present), idempotently. Principle: the internal zone must be a SUPERSET of the public zone for every record type internal clients consume. Verified in-pod: all four types resolve; roundtrip re-probe green. 2. cluster_healthcheck #30 queried instant `up`, which goes stale for ~5 of every 10 minutes on the deliberate 10m redfish-idrac remnant job -> intermittent false "redfish-idrac=missing". Now uses last_over_time(up[15m]) — same answers for fast jobs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:46:37 +00:00
Viktor Barzin	59a531b8e0	coredns: pods get internal split-horizon answers for viktorbarzin.me [ci skip] Forward the viktorbarzin.me:53 pod block to the Technitium ClusterIP (10.96.0.53, same as the .lan block) instead of 8.8.8.8/1.1.1.1. Pods become ordinary internal clients (CNAME -> apex -> live Traefik LB; mail -> 10.0.20.1), fixing the 27 non-proxied [External] uptime-kuma monitors that rode the TP-Link NAT loopback (hard-down since 06-09; loopback refuses flows whose source equals the reflection target, which all pfSense-SNAT'd cluster traffic does). Enabled by re-testing a stale premise: on k8s 1.34 pods DO reach the ETP=Local Traefik LB IP (kube-proxy short-circuits in-cluster traffic to LB IPs; verified from pods on three non-Traefik nodes) — re-verify after major k8s upgrades; canary = [External] fleet going red. The NAT-layer alternatives (pfSense rdr, SNAT-drop) were rejected: both fight return-path asymmetry and deepen TP-Link dependency. Verified in-pod: immich -> .203 + HTTPS 200, mail -> 10.0.20.1, forgejo -> Traefik ClusterIP (pin kept for Technitium-outage resilience). Proxied [External] monitors now test the internal path — true edge fidelity moves to the external vantage (ha-london, next fix). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 16:21:34 +00:00
Viktor Barzin	a1b7b0ca53	forgejo retention: revert to DRY_RUN — first live run orphaned OCI indexes [ci skip] The keep-set (newest 10 versions + latest + cache tags) treats multi-arch/attestation index CHILDREN — separate untagged sha256 versions — as deletable: for images not rebuilt recently they sort outside the newest-10 window and were pruned while their kept parent index survived. kms-website :latest and :dfc83fb children 404'd (RegistryManifestIntegrityFailure, caught by forgejo-integrity-probe within hours; deployed tag a794d1a unaffected). Healed: :latest re-pointed at the intact a794d1a index (also the newest commit), corrupt :dfc83fb version deleted, probe re-run clean (0 failures / 22 repos / 63 tags / 59 indexes). DRY_RUN=true applied live. Re-enable only with a container-aware keep-set — options in the post-mortem. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 09:22:47 +00:00
Viktor Barzin	e49c91e60c	monitoring: VzdumpBackup{Stale,NeverRun,Failing} alerts for the new VM-image backup vzdump-vms pushes vzdump_last_{run,success}_timestamp + vzdump_last_status to Pushgateway job vzdump-backup, but nothing alerted on them — a stopped/failing VM backup would be silent (exactly how the nfs-mirror reaping went unnoticed until I re-verified). Add the trio to the 3-2-1 group in prometheus_chart_values.tpl, mirroring the LVM/pfSense/nfs-mirror alerts. Stale = >~50h since last success. NOT [ci]-applied: this is a Terraform stack change — arms on the next `scripts/tg apply` of the monitoring stack (metrics already flow, so it arms immediately once applied). Admin-gated apply per org policy. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:10:46 +00:00
Viktor Barzin	2b8c0def30	dns: pfSense forward-zone for viktorbarzin.me, nodes fully stock [ci skip] Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node customization — split-brain lives in the DNS infra): - pfSense Unbound domain override viktorbarzin.me -> Technitium 10.0.20.201 (applied via php write_config, backup on-box). Every Unbound client on every VLAN now gets the internal split-horizon answers (live Traefik IP via apex CNAME) with zero per-host config. - CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block — forgejo pinned to Traefik ClusterIP via data source (pods cannot reach the ETP=Local LB IP pfSense now returns), all other .me names kept on public resolvers (pods' pre-existing behavior). Replaces the .:53 forgejo rewrite. - Removed the same-day resolved routing-domain drop-ins from all 7 nodes; node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206) for fleet parity; cloud-init no longer writes any DNS drop-ins. - Docs: dns.md, pfsense-unbound runbook (override + rollback), registry bullet, post-mortem final-architecture addendum. Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK, pods resolve forgejo -> ClusterIP / others -> public, mail record works, .lan zone unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 08:32:34 +00:00
Viktor Barzin	87702bdce8	feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip] New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible /v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on 8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/. Option A off-peak control (no VRAM isolation on the time-sliced T4 — see post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00 ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at 06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a skipped/aborted window backfills next time. SA+Role+RoleBinding grant the CronJobs deployments/scale (nextcloud-watchdog pattern). Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the `tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure — never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses the base exclude list (tts untouched there). tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to local.app_env (no token — ClusterIP only). No tripit code change. Image build is documented in stacks/tts/README.md (devnen cu128 target -> forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline (large CUDA image + needs the upstream repo). NOT APPLIED — review branch only. Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the measured chatterbox-multilingual T4 peak during the first bake. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
viktor	90b8312a29	tripit: build off-infra via GHA -> GHCR (private), pull via scoped ghcr-credentials Switches tripit image from forgejo.viktorbarzin.me (in-cluster buildkit, sdc load) to ghcr.io/viktorbarzin/tripit built by GitHub Actions (Forgejo push-mirror -> private GitHub -> GHA -> GHCR). Adds a tripit-ns-scoped GHCR pull secret (github_pat, interim). Verified: deploy on :c8dfb5cb ready, ingest-plans CronJob pulled :latest + Succeeded. [ci skip]	2026-06-09 21:41:53 +00:00
Viktor Barzin	e0452611b5	forgejo: survive CI-build registry-push storms (mem 3Gi + working retention) Heavy in-cluster builds (e.g. tripit buildkit) were taking Forgejo down via two vectors. Fixes both, without moving Forgejo off the sdc HDD (code-oflt deferred): - Memory 1Gi -> 3Gi (requests=limits). Forgejo was OOMKilled (exit 137) under registry-push load; VPA upperBound ~1.5Gi was suppressed by the 1Gi cap it kept OOMing against. Size for the push spike. - Activate registry retention (DRY_RUN false). Verified the delete list against all running viktor/* images first: 0 running images affected. Pruned 478 -> 161 package versions; PVC was at its 50Gi autoresize ceiling. - FIX broken retention auth: the cleanup PAT was ci-pusher's, but Forgejo scopes container packages per-user, so DELETE on viktor/* returned 403 (the dry-run only did GETs, hiding it). Repointed forgejo_cleanup_token to viktor's write:package PAT. Retention had never actually worked. - Protect buildkit cache tags from retention (cleanup.sh keep-set) so the gentler-builds layer cache survives daily pruning. [ci skip] — already applied via scripts/tg. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	fd0f4a0365	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip] `6d224861` came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:45:33 +00:00
Viktor Barzin	6d224861c4	stem95su: scheduled Drive->site sync CronJob (every 10m) CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:42:26 +00:00
Viktor Barzin	98fe65e345	storage: migrate priority-pass uploads off proxmox-lvm-encrypted to NFS (Phase 1) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Boarding-pass images, no embedded DB. Drops LUKS-at-rest (low-sensitivity, accepted). 21.8M copied + verified on NFS; pod 2/2 on NFS; frees one proxmox-csi slot. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 18:47:07 +00:00
Viktor Barzin	5c378dd5e3	workstation: gate t3.viktorbarzin.me to the T3 Users group (Phase 4) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details New authentik_group 'T3 Users' (members wizard/emo/ancamilea via data lookups — usernames ARE their emails in this Authentik instance) + a branch in the admin-services-restriction expression policy gating t3.viktorbarzin.me to that group, placed BEFORE the ADMIN_ONLY_HOSTS early-return. Surgical two-step targeted apply (group-with-members first, then the gate) → zero lock-out window. Verified: group has all 3 members, the live policy contains the t3 branch, t3 still 302s to Authentik. Membership is HCL for now (FUTURE: roster-reconciled via the Authentik API). Note: the authentik stack had 3 unrelated pending drift changes (pgbouncer deployment + 2 tls_secrets) — deliberately NOT applied (targeted apply isolated this change; left for the stack owner). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 17:50:40 +00:00
Viktor Barzin	173b1fc116	workstation: per-user OIDC kubectl — power-user-readonly RBAC + kubeconfig (Phase 2.2) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details New oidc-power-user-readonly ClusterRole (cluster-wide get/list/watch, NO secrets/exec/write); the power-user binding re-pointed to it (the existing read+write+secrets oidc-power-user role is retained but UNBOUND per ADR-0005). Applied to the rbac stack (2 add, 1 change, 0 destroy). emo added to Vault k8s_users (secret/platform) as power-user, email emil.barzin@gmail.com — the OIDC email IS the Authentik username (verified live). Verified via impersonation: emo gets cluster-wide read, NO secrets/write/exec/delete; anca unchanged. Provisioner: install_user_kubeconfig writes a per-user OIDC kubeconfig (kubelogin/PKCE — the kubernetes Authentik client is public, no secret; server+CA copied from the admin kubeconfig) if-absent. Written for emo + ancamilea (0600). End-to-end login is interactive (browser OIDC); verified config validity + RBAC, not the live browser flow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 17:47:00 +00:00
Viktor Barzin	6504911a77	matrix: open (tokenless) registration + bot mitigations + #security alert User-chosen fully-open registration on tuwunel (no CAPTCHA support; browser challenges break native clients). Bot defense is layered instead: - Traefik rate-limit Middleware on a path-scoped /register ingress carve-out, keyed on request Host (GLOBAL /register cap) not source IP — the host is reachable via both Cloudflare-IPv4 (CF-Connecting-IP) and IPv6-direct (HE tunnel, no CF header), so a per-source key let IPv6 bots bypass. 10/min, burst 20, per replica; CrowdSec is the hard backstop on both paths. - Loki ruler rule MatrixNewUserRegistered -> lane=security -> existing #security Slack receiver (matches "registered on this server", never the rejection line). tuwunel's admin bot also posts signups to the admin room. Dropped the REGISTRATION_TOKEN env (secret/matrix + ESO kept for revert). Applied via scripts/tg (matrix tier-1 + targeted monitoring configmap), so [ci skip] to avoid CI full-applying monitoring (unrelated grafana-acl drift). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 13:27:02 +00:00
Viktor Barzin	23602f393e	matrix: migrate Synapse -> tuwunel (Rust homeserver, fresh start, federated) Replace the cramped Synapse deployment with tuwunel v1.7.1: embedded RocksDB drops the CNPG dependency (both init-containers, the db ESO, the Reloader annotation all gone), env-var config, fsGroup-owned encrypted PVC, federation on, tuwunel-served well-known delegation to :443. server_name unchanged (matrix.viktorbarzin.me); fresh start (no Synapse->RocksDB migration path). Registered @viktor admin then disabled registration (403). Cleanup: removed the orphaned pg-matrix Vault static role and dropped the matrix Postgres DB/role; updated service-catalog, upgrade-config, CLAUDE.md PG-rotation list, and the Matrix OIDC->orphaned auth notes. Design+plan in docs/plans/2026-06-08-matrix-synapse-to-tuwunel-*. Already applied via scripts/tg (matrix tier-1 + targeted vault tier-0), so [ci skip] to avoid CI reconciling an unrelated pre-existing vault OIDC tune-TTL drift. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 11:58:17 +00:00
Viktor Barzin	7501ea286b	tripit: wire planner subsystem (merged trip-planner) secrets + Slack webhook ingress All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details - ExternalSecret gains SLACK_SIGNING_SECRET / TREK_USER / TREK_PASSWORD / CLAUDE_AGENT_TOKEN (SLACK_BOT_TOKEN reused from nudges). - New auth=none ingress carve-out /api/planner/slack (Slack v0 signature-gated, same pattern as the calendar + emails-confirm carve-outs). - Remove the superseded standalone stacks/trip-planner (merged into tripit per the "future travel logic goes in tripit" policy). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 09:26:21 +00:00
Viktor Barzin	f9d5cd6243	feat(tripit): wire real flight (AeroDataBox) + rail (RealtimeTrains) status Prod ran FLIGHT_PROVIDER=fake, so every flight gate/terminal/time/position was fabricated from a hash and never matched reality. Switch to real providers: - FLIGHT_PROVIDER=aerodatabox (RapidAPI free BASIC; AERODATABOX_API_KEY via the tripit-secrets ExternalSecret) - RAIL_PROVIDER=realtimetrains (RTT_API_TOKEN, already in Vault) - poll-flights cron */30 -> hourly to respect the free 600 req/month cap (provider also self-throttles to <=1 req/sec) Verified live: /api/segments/<LS1468>/status returns source=aerodatabox with real schedule/terminal/aircraft. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 22:10:06 +00:00
Viktor Barzin	0d445d948c	stem95su: host STEM platform for 95. СУ (public NFS-backed static site) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details New public static site at stem95su.viktorbarzin.me serving the school's Bulgarian STEM platform (dashboard + lessons/games, externally authored HTML/media exported from Gemini). - Stock nginx:1.28-alpine serving /srv/nfs/stem-site read-only (nfs_volume), NOT image-baked — content updated out-of-band (Nextcloud "PVE NFS Pool" or rsync), no rebuild; auto-backed-up offsite by nfs-mirror. - ingress_factory auth="none" (open; CrowdSec + ai-bot-block at the edge), dns_type="proxied" (Cloudflare CNAME auto-created). - nginx ConfigMap sets index stem_board.html (the dashboard) for "/". - Docs: service-catalog entry + new "Static Site Hosting" pattern (NFS-backed vs image-baked) in patterns.md. Applied via scripts/tg apply; verified live end-to-end (dashboard, 20MB page, video byte-range, no Authentik redirect) through the public Cloudflare path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 15:21:21 +00:00
Viktor Barzin	c7ffbaa204	aiostreams: harden stream-probe + repair sources (RD-451 "few films" fix) Root cause of "barely serving films": Real-Debrid's May-2026 infringing_file/HTTP-451 filter blocks WEB-DL releases (which dominate new content), while degraded sources starved candidates. RD account + popular-title availability were healthy throughout (library 32/36 unrestrict OK; Matrix 897 / Dune2 694 / Oppenheimer 672 streams). Runtime config (AIOStreams PG, applied via API — not in this diff): - Comet timeout 5s -> 10s. Comet is the workhorse (~450+ streams/title) and was silently dropping the bulk of its results at the 5s cutoff; Interstellar 430 -> 987 streams after the bump. - Removed MediaFusion preset: broken upstream ("Invalid configuration" -> 500 Internal Server Error), contributed 0 usable streams, only a dead [X] entry in every list. This diff (Terraform): - Harden aiostreams-stream-probe: test series AND movie paths, per-source breakdown (comet/torrentio/stremthru_torz/knaben), error-stream count, success gated on Comet being alive. The old probe counted only Breaking Bad streams and stayed green while new-content playback was broken. - service-catalog: reflect source set + probe behaviour. [ci skip] — probe already applied via targeted `tg apply` + verified (series=378 movie=898 comet=206 errors=0 success=1); skipping the full servarr reconcile to avoid touching unrelated pre-existing drift (qbittorrent MetalLB annotation, tls_secret cert revert). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 07:21:42 +00:00
Viktor Barzin	4cdb9e1886	novelapp: switch Keel to semver (policy=major) now upstream tags are valid All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Gheorghe fixed his tag format 2026-06-06 (v.1.1.1 -> valid v1.1.1 / v1.1.3), so drop the :latest+force+match-tag digest workaround and track semver properly: policy=major (all upgrades, cumulative), match-tag removed (so Keel is free to climb to higher semver tags), image floor pinned to v1.1.3. Pull policy -> IfNotPresent (correct for a pinned Keel-managed tag; Always was only needed for the mutable :latest). Running v1.1.3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 22:56:46 +00:00
Viktor Barzin	551412488b	apiserver: enable audit logging (low-write Metadata) + ship to Loki Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Resource changes/deletions are now attributable (the novelapp deletion this week was untraceable because apiserver audit was off). Low-write policy: drops reads/noise, Metadata level on mutations, omitStages RequestReceived. Wired into the kube-apiserver static-pod manifest + kubeadm-config (v1beta4 extraArgs/extraVolumes -> survives kubeadm upgrade) on k8s-master; Alloy tails /var/log/kubernetes/audit/audit.log -> Loki {job=kubernetes-audit}. Root cause that had silently blocked this AND OIDC for weeks: a stray kube-apiserver.yaml.bak inside /etc/kubernetes/manifests/ was a duplicate static-pod manifest kubelet ran instead of the real one, dropping every flag added to the real manifest. Removed it. Runbook added. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	3696ff5922	novelapp: track :latest by digest (Keel force+match-tag), adopt into TF state Keel was stuck on v1.0.3 because upstream mghee/novelapp tags newer releases as `v.1.1.1` (dot after v), which isn't valid semver, so policy=all couldn't see past the highest parseable tag. :latest correctly points at the newest release, so switch to force + match-tag digest-tracking of :latest (Kyverno does not manage match-tag, contrary to the stale code comment). Imports the live Deployment (recreated out-of-band 2026-06-06) back into TF state; running image flipped to :latest -> now on v.1.1.1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	4d8b782df1	feat(trip-planner): app stack (Tier-1, CNPG, Slack-signed webhook ingress) Namespace trip-planner (tier=4-aux, keel enrolled), ExternalSecret pulling secret/trip-planner from vault-kv, DB-creds ExternalSecret from vault-database (static-creds/pg-trip-planner → asyncpg DSN), Deployment with migrate init container + main container (readiness+liveness /healthz, 256Mi req=limit, 100m cpu request), ClusterIP service port 8080, and ingress_factory with auth=none (Slack v0 HMAC signature verification in-app). Terraform fmt clean. NOT applied; requires Vault secret/trip-planner + CNPG trip_planner DB + Slack app config. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	7c12fbba95	monitoring/alloy: drop cosmetic calico-typha 'Endpoints deprecated' warning calico-typha (~342 lines/hr across 3 pods) still WATCHes the core v1 Endpoints API, so the apiserver returns the 'v1 Endpoints is deprecated in v1.33+' client-go warning, which typha logs. Per KEP-4974 the v1 Endpoints API will essentially never be removed (clients keep working indefinitely), and even the latest Calico still watches Endpoints (projectcalico/calico#11540) so a CNI upgrade would not fix it. Pure cosmetic noise. Targeted Alloy stage.drop (calico-system ns, exact deprecation message), mirroring the mailserver drop. Real calico warnings/errors kept; reversible. Validated with alloy fmt (exit 0). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	4b13be6d48	dawarich: upgrade 1.6.1 -> 1.7.11 (removes RailsPulse, drops orphan tables) dawarich 1.6.1 shipped the RailsPulse perf-monitoring gem, which scheduled an hourly Sidekiq SummaryJob INDEPENDENT of its disabled flag; the job hit rails_pulse_routes (no primary key) and retry-looped, logging ~125 UnknownPrimaryKey lines/hr (found via Loki triage 2026-06-06). Upstream removed RailsPulse entirely in 1.7.x (commit a5172cc) with a DropRailsPulseTables migration; 1.7.11 is latest stable. Keel only auto-applies patch bumps within 1.6.x, so the minor jump is manual. Pre-upgrade pg_dump of dawarich (79.9MB) + dawarich_queue taken to devvm. The 5 rails_pulse_* tables are empty (feature never collected data), so cleanup is zero-data-risk; location data (tracks/points/visits/places) untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00

1 2 3 4 5 ...

1309 commits