infra

Author	SHA1	Message	Date
Viktor Barzin	9af3e8860e	Merge origin/master (CI state-sync commits) into wizard/android-emulator-public Some checks failed ci/woodpecker/push/default Pipeline was canceled Details ci/woodpecker/push/build-cli Pipeline was canceled Details	2026-06-12 00:08:14 +00:00
Viktor Barzin	43d2107760	android-emulator: public Authentik-gated ingress for the noVNC screen Some checks failed ci/woodpecker/push/build-cli Pipeline was canceled Details ci/woodpecker/push/default Pipeline was canceled Details Viktor wants the emulator screen reachable over the web: adds android-emulator.viktorbarzin.me (Cloudflare-proxied) behind Authentik forward-auth — same-origin WebSockets through forward-auth are proven by the terminal/ttyd stack. The LAN .lan view stays, and adb:5555 remains LAN-only since it is unauthenticated.	2026-06-12 00:07:49 +00:00
Viktor Barzin	9a2124f105	tripit: flip Research agent live (RESEARCH_PROVIDER=claude_agent, #23 ) Switches the planning workspace's 'Research this' from the deterministic Fake to the live claude-agent-service Researcher. Behaviour-reviewed via a prod-pod country_when call (proposed Morocco/Georgia/Peru/Iceland with real 2026 UK bank-holiday leave windows + rough fares). Opt-in, budget-capped ~$2/run, wall-clock-bounded → degrades to 'found nothing' on slow/failed/quota-exhausted runs. Reuses CLAUDE_AGENT_TOKEN already in tripit-secrets. Completes the 12-slice Trip-Planning-Decisions feature. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-11 23:53:49 +00:00
Viktor Barzin	02ed3062f6	android-emulator: non-merge apply trigger for v4 image rollout All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Pipeline 96 applied only tripit: the v4 bump (`577267cd`) entered master inside a merge whose first-parent diff hid stacks/android-emulator from the stack detector — same failure mode as the tts `798b0255` trigger. This plain commit touches the stack so the detector picks it up.	2026-06-11 23:48:16 +00:00
Viktor Barzin	2f8addc63b	Merge forgejo/master into wizard/android-emulator Some checks failed ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline failed Details	2026-06-11 22:53:11 +00:00
Viktor Barzin	577267cd97	android-emulator: api36-v4 — pin emulator 36.1.9; bind socat to pod IP Two final fixes from the live debugging session: (1) sdkmanager-latest emulator 36.6.11 hangs before executing a single guest instruction in this pod (KVM and TCG alike, every gpu mode, crash-reporting on or off) while 36.1.9 boots Android in ~107s — the entrypoint now pins build 13823996 on the PVC; (2) the emulator already listens on 127.0.0.1:5555, so socat's wildcard bind died with EADDRINUSE and its exit restarted the pod right after a successful boot — socat now binds the pod IP only.	2026-06-11 22:52:54 +00:00
Viktor Barzin	fba1659611	tripit: enable LLM sight discovery + real place resolver (image 2a965ca0 is live) All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Viktor's tour-redo (tripit#29): the new image is rolled out, so the two new provider knobs can flip — discovery becomes wikipedia+llm (GeoSearch merged with claude-agent-service proposals, Focus-steered) and the Wikipedia place resolver (manual sight search + LLM-proposal resolution) leaves its fake default. Env-after-image hold order, same as FARE_PROVIDER. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 22:30:24 +00:00
Viktor Barzin	f74e421283	tripit: overnight tour-audio fill CronJobs (02:20 + 04:30 retry, Europe/London) All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Viktor's tour-guide redo (tripit#30/#31): narration audio is baked-audio-only now — the fill-tour-audio worker synthesizes the queued (story, telling, voice) audio while the tts stack's off-peak window (02:00-06:00) has Chatterbox scaled up. Two idempotent passes: 02:20 after scale-up + model load, 04:30 insurance against a skipped window or guard yield. Daytime runs record tts_unreachable and exit quietly by design. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 22:24:29 +00:00
Viktor Barzin	85dbec6108	android-emulator: api36-v3 — avdmanager must run from inside the SDK root Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details v2's marker fix proved the install completes, but avdmanager still saw no system images: it IGNORES ANDROID_SDK_ROOT (and has no --sdk_root), deriving the SDK root from its own toolsdir — /opt/android in our image, while packages live on the PVC at /sdk. v3 seeds cmdline-tools into /sdk/cmdline-tools/latest once and runs avdmanager from there, so it resolves the PVC as the SDK root.	2026-06-11 21:15:50 +00:00
Viktor Barzin	5e8a988858	android-emulator: api36-v2 — marker-file install idempotency + retries Some checks failed ci/woodpecker/push/k8s-portal Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/registry-config-sync Pipeline was successful Details ci/woodpecker/push/build-ci-image Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details First boot crashed mid-SDK-install, and the dir-existence check then skipped reinstall forever: avdmanager saw the partial tree and died with 'Valid system image paths are: null' (CrashLoopBackOff). v2 tracks install completion with a marker file written only after sdkmanager succeeds + package.xml exists, wipes partial system-image trees before reinstalling, and retries sdkmanager 3x.	2026-06-11 20:59:08 +00:00
Viktor Barzin	3fac45febc	android-emulator: drop applied import stanzas; deployment recreates fresh Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details The five imports from the last recovery commit are in state now (verified serial 4: everything except the deployment). The deployment kept falling out of state between runs, so instead of a third import round the broken 0-replica deployment object was deleted live (transient recovery step, presence-claimed) and this apply recreates it Terraform-owned with the quota-fitting 3Gi requests. Import stanzas must go because TF 1.5 errors on importing already-managed addresses.	2026-06-11 20:49:37 +00:00
Viktor Barzin	6b7efcd2d6	android-emulator: import the five resources still missing from state Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Pipeline 88 imported the namespace but its refresh dropped the PVC, both services, the ingress and the tls secret from state (PG-backend state races on this new stack's first applies), so the apply again died on 'already exists' conflicts. State now holds namespace+deployment; adopt the missing five with import blocks (TF 1.5 errors on importing already-managed addresses, so only the missing set is listed). Stanzas come out once applied.	2026-06-11 20:44:09 +00:00
Viktor Barzin	b948224008	android-emulator: import orphaned namespace into state (lock-race recovery) Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Pipeline 85 created the namespace but a Terraform pg-backend workspace-creation lock race (new stack schema initializing while other stacks applied concurrently) left it out of the recorded state — every later apply then died with 'namespaces android-emulator already exists'. Adopt it with an import block per the house recovery pattern; stanza gets removed once it has applied.	2026-06-11 20:38:46 +00:00
Viktor Barzin	99c19584f7	android-emulator: fit pod inside the tier-1 ResourceQuota (Burstable memory) Some checks failed ci/woodpecker/push/k8s-portal Pipeline failed Details ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful Details ci/woodpecker/push/registry-config-sync Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/build-ci-image Pipeline was successful Details First deploy hit 'exceeded quota: tier-quota, requested requests.memory=8Gi, limited 4Gi' — the generated tier-1 quota caps memory REQUESTS at 4Gi but allows 32Gi of limits, so go Burstable (requests 3Gi, limits 8Gi) like tiers 3/4 do, instead of opting the namespace out via custom-quota.	2026-06-11 19:56:09 +00:00
Viktor Barzin	6bf216751b	Merge forgejo/master (tts stack) into wizard/android-emulator Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details # Conflicts: # stacks/tripit/main.tf	2026-06-11 19:53:07 +00:00
Viktor Barzin	8b7c77c794	android-emulator: new stack — shared in-cluster Android 16 testing instance Viktor is setting up an Android app development pipeline (tripit is the first app) and wants agents to natively test changes on Android before shipping. This adds the testing environment: an API-36 Google emulator under KVM as a privileged pod (namespace joins the Kyverno exclude list), SDK/system-image/AVD on a proxmox-lvm PVC, adb on the shared MetalLB IP 10.0.20.200:5555 (LAN only), noVNC screen view at android-emulator.viktorbarzin.lan. Image is built manually from the stack's docker/ dir (rare rebuilds; off-infra-CI rule targets repeated builds). First infra ADR records the trade-offs (devvm/VM/redroid/budtmo rejected).	2026-06-11 19:51:57 +00:00
Viktor Barzin	798b025580	tts+kyverno: non-merge apply trigger (merge-commit diff hid stacks/tts from the stack detector) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The Woodpecker default pipeline selects stacks via git diff HEAD~1 HEAD; on a merge commit that is the first-parent diff, which contained only the concurrently-landed files — stacks/tts never got applied (namespace still absent) and the kyverno re-trigger push got no pipeline at all. Single non-merge commit touching both stacks so the detector sees them; the sorted loop applies kyverno before tts, the order tripit#26 requires. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 19:08:23 +00:00
Viktor Barzin	a66aeac3b8	Merge remote-tracking branch 'forgejo/master' into wizard/tour-redo-env All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-11 18:27:53 +00:00
Viktor Barzin	4a8c4f9a14	tts: first apply of Chatterbox stack; predefined voices from the image, not the unseeded PVC Viktor's tour-guide redo (tripit#26): `87702bdc` committed this stack with [ci skip] so it was never applied — prod tripit has been pointing at a nonexistent chatterbox-tts service since. This commit triggers the apply and fixes the voices path: config pointed predefined_voices_path at the NFS PVC (/data/voices), which nobody can seed without NFS-host shell access and which would leave /v1/audio/voices empty (it gates readiness). Use the 28 voices bundled in the image at /app/voices instead; /data keeps reference audio (future cloning) and the HF model cache. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:27:44 +00:00
Viktor Barzin	318ce9b909	Merge remote-tracking branch 'forgejo/master' into wizard/breakglass-redesign All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-11 18:23:40 +00:00
Viktor Barzin	df332b59e6	break-glass SSH: drop port-knock for exposed key-only :52222; version host config Viktor got locked out of the break-glass path (forgot the port-knock setup) and deleted the edge-router forwards, then asked to review and redesign it from scratch. Root cause of the lockout: the knock added no real security (key-only SSH is already brute-force-proof) and its only benefit — hiding the port — came at the cost of a circular dependency. The knock sequence lived only in in-cluster Vault, which is unreachable in the exact away/cold scenario break-glass exists for. So the unlock secret was unavailable precisely when needed. New model (self-contained, nothing to remember): plain key-only SSH on the Proxmox host's :52222, openly reachable. The edge router forwards WAN tcp/52222 -> 192.168.1.127:52222 (external port MUST equal internal on the TP-Link AX6000 - it rejects remaps; port 22 itself is reserved). The exposed port trusts only a dedicated break-glass key via `Match LocalPort` (a leak of any other root key does not grant internet access), rate-limited (iptables hashlimit) + fail2ban. - Removed knockd (package + config) and the legacy Synology SSH forward (ext 3333 -> .13:22, a needless WAN exposure the original plan wanted gone). - Fixed the fail2ban jail for Debian 13 (auth logs under sshd-session, not sshd - the stock journalmatch silently never banned). - Versioned the host config in scripts/ (it was applied ad-hoc, never committed) and recorded the deliberate Wave-1 "no public-IP" exception in security.md + .claude/CLAUDE.md. Superseded the 2026-05-30 port-knock design docs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:23:39 +00:00
Viktor Barzin	7a1cc64898	kyverno: re-trigger apply of tts GPU-priority exclusion (`87702bdc` was [ci skip]'d) Viktor's tour-guide redo (tripit#26): the Chatterbox TTS go-live commit `87702bdc` carried [ci skip], so CI never applied the kyverno change that keeps the tts namespace out of low-GPU-priority injection. This comment-only commit makes CI apply the already-committed change — step 1 of the kyverno -> tts -> tripit apply order. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:23:29 +00:00
Viktor Barzin	50eff3ca39	tripit: enable real tour-guide content providers (wikipedia discovery, web sources, chat writer) Some checks failed ci/woodpecker/push/build-cli Pipeline was canceled Details ci/woodpecker/push/default Pipeline was canceled Details Viktor's tour-guide redo (tripit#24, slice tripit#25): the feature shipped dark on 2026-06-08 because these three env vars were never set, so prod ran the fake test-fixture providers — the only sight users ever saw was the placeholder 'Sight 1' narrated by browser TTS. Flips discovery to Wikipedia GeoSearch, story material to the five real web sources, and script-writing to claude-agent-service (token already present in tripit-secrets). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:22:10 +00:00
Viktor Barzin	5486b9d438	tripit: wire calendar-conflict column to Nextcloud CalDAV (#19 ) CALENDAR_CONFLICT_PROVIDER=nextcloud + CalDAV base/user on the deployment, and the read-only app-password via tripit-secrets (seeded in Vault secret/tripit). Lets the planning workspace's calendar_check column flag date clashes against the owner's Nextcloud calendar. Same image-first hold-order as the fare scrape — pushed only after the #19 image is live. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-11 18:13:01 +00:00
Viktor Barzin	e2788d1b2d	workstation: lean managed-settings claudeMd — org red-lines + pointers [ci skip] Viktor's agent-rules cleanup: the org claudeMd now carries only governance red-lines (RBAC tiers, per-user secrets, Terraform-only, git audit-trail rules, code-layout detection) and points to ~/.claude/rules/execution.md for the worktree lifecycle, which was previously duplicated here in full. Settings precedence and the model key are unchanged. Also refreshes a .gitignore comment that cited the old execution.md section numbering. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:02:43 +00:00
Viktor Barzin	c3a63fcd38	apply-mbps-caps: compare normalized option sets (true idempotency) + devvm I/O-stall post-mortem [ci skip] The raw string compare never matched qm config's canonical key order, so the hourly timer re-issued 'qm set' against every running capped VM, live-rewriting QEMU throttle state via QMP 24x/day. Implicated in today's devvm freeze (15:21-16:48 UTC): the guest's disk I/O stalled inside QEMU (blockstats frozen at 0 while QMP stayed responsive) on the legacy lsi controller path with no iothread. Viktor asked to root-cause the freeze before choosing fixes, then approved mitigating via VM settings: this commit fixes the hourly trigger and documents the incident; the controller swap (virtio-scsi-single + iothread=1 + aio=threads) is staged on VM 102 separately, pending his cold stop/start. Adds docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md (evidence chain, ruled-out causes, capture-before-kill autopsy steps) and syncs compute.md + proxmox-inventory.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:00:08 +00:00
Viktor Barzin	2e0cebff87	docs: sync compute/storage/proxmox-inventory with live state (memory audit) [ci skip] Viktor asked to go through the agent's stored infra facts and straighten out anything wrong about what-is-where. Cross-checking docs against the live cluster surfaced doc drift alongside the stale memories: - compute.md: add k8s-node5/6 (joined 2026-05-26) to diagram + node table; totals 48 vCPU / ~176GB -> 64 vCPU / ~240GB; cluster version v1.34.2 -> v1.34.8 (live-verified) - storage.md: the nfs-proxmox StorageClass no longer exists (removed 2026-04-25, commit `484b4c71`) — nfs-truenas is the only NFS SC; fixed three spots that told readers to use nfs-proxmox - proxmox-inventory.md: k8s VM RAM rows live-verified via kubectl (master 32G, node1 48G, node2-4 32G — the old 16/32/24G figures predated the 2026-04-02 resize), added node5/6 rows, devvm swap 8G -> 14G (grown 2026-06-10), recomputed total (~288GB nominal of 272GB physical, overcommitted) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 17:50:43 +00:00
Viktor Barzin	81e01ec1c4	tripit: label namespace as chrome-service CDP client The fare scrape's first E2E test was blocked by chrome-service-ws-ingress (9222 admits only namespaces labeled chrome-service.viktorbarzin.me/client=true). Label the tripit namespace per that policy's opt-in design so the planning workspace's live fare fetches reach the shared browser. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-11 14:42:53 +00:00
Viktor Barzin	980ec55418	tripit: enable live flight-fare scrape via shared chrome-service CDP Sets FARE_PROVIDER=playwright + FARE_CDP_URL on the tripit deployment so the planning workspace's flight_fare cells auto-fetch live Google Flights quotes through the existing in-cluster headed browser (tripit issue #18, ADR-0007 — rate-limited, cached, degrades to manual entry). Viktor asked to complete the trip-planning tickets; this is the infra leg of the fare-scrape slice. Docs: chrome-service architecture + service catalog updated (tripit is now the second active CDP caller; catalog's legacy :3000 WS pool line corrected to CDP :9222). HOLD-ORDER NOTE: pushed only after the tripit image containing FareMode.playwright rolled out (older images crash-loop on the unknown enum). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-11 14:23:53 +00:00
Viktor Barzin	9b19caff47	t3: connection logging across the path for drop attribution All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Viktor asked to add connection logs (Traefik/Cloudflare) to catch the real-path t3 WS drops: a direct-to-t3-serve browser ran 40 min clean while real tunnel sessions cycle every 15-35s, so the drop originates above t3-serve and we need to see which layer cuts the socket. Traefik (/ws duration) and cloudflared (WS close events) already ship to Loki; the gap was the devvm side. This adds: - t3-dispatch logs every /ws open/close with dur_ms + cause: downstream_closed (client/CF/Traefik hung up = last-mile/network), upstream_closed (t3-serve closed/reset), or graceful. Graceful closes previously left no trace (default ReverseProxy only logs on error), so a watchdog-driven reconnect was invisible. Helpers unit-tested. - devvm-promtail.{yaml,service}: ships devvm journald (t3-dispatch + t3-serve@<user>) to cluster Loki as job=devvm-journal, mirroring the pve/rpi-sofia shippers. devvm was never in Loki (standalone VM). Joined in Loki the three layers attribute any future drop to a segment with no repro needed. Runbook + service-catalog updated.	2026-06-11 13:48:10 +00:00
Viktor Barzin	933e4649fb	Merge remote-tracking branch 'forgejo/master' into wizard/authentik-signin-speed Some checks failed ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/k8s-portal Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful Details ci/woodpecker/push/registry-config-sync Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/build-ci-image Pipeline was successful Details	2026-06-11 00:35:56 +00:00
Viktor Barzin	b3ef0dba76	authentik: ignore Keel-managed image_pull_policy on pgbouncer Keel flip-flops the pgbouncer container's imagePullPolicy, so the declared Always kept re-diffing on every plan. Ignore it like the image tag (KEEL_IGNORE pattern) — plan-to-zero restored. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 00:34:44 +00:00
Viktor Barzin	4e88298976	authentik: incident hardening after the signin-speedup rollout storm The first apply of the signin-speedup change triggered a ~50min authentik outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2) silently DOWNGRADED the Keel-managed live image (2026.2.4) against an already-migrated DB, default liveness probes kill-looped pods queuing on authentik's migration advisory lock, and kills mid-migration left ghost idle-in-transaction sessions holding that lock. Full analysis in docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md. Hardening (all root causes): - values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4) so helm applies can never downgrade under Keel again - values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s) - values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits) - pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders; pgbouncer.tf gets a config-checksum annotation so ini changes roll pods - authentik_provider.tf: drop the completed import stanza (adoption rule) - traefik: suppress pre-existing keel.sh annotation/tier-label drift on auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1 pattern) so applies stop stripping live Keel state Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 00:26:52 +00:00
Viktor Barzin	bd60c3d5e0	pve-host/dns: register loki.viktorbarzin.lan CNAME, drop the /etc/hosts pin All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Follow-up to the pve-host Loki shipper (`aac807fb`). The host reached Loki via an /etc/hosts pin of the Traefik LB IP — Viktor flagged that as the wrong solution (no hardcoding; the DNS infra should handle it). Registered loki.viktorbarzin.lan in Technitium as a CNAME -> ingress.viktorbarzin.lan (the anchor whose A record auto-tracks the live Traefik LB IP, so it's renumber-proof), via the Technitium API + zone-sync to all 3 instances. Removed the /etc/hosts pin from the PVE host; promtail now resolves the name purely via DNS (verified still shipping to Loki). insecure_skip_verify stays — the internal .lan cert isn't publicly trusted. Docs (monitoring.md) + the pve-promtail.yaml header updated to drop the pin references. The DNS record is API-managed (the viktorbarzin.lan zone convention), not in this repo; auto-managing .lan CNAMEs in technitium-ingress-dns-sync remains a noted follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 22:55:20 +00:00
Viktor Barzin	97ccdbecb8	authentik: speed up first-time signin (single-screen login, live env tuning, asset caching, outpost+nginx hot path) Viktor asked to review Authentik and the web tier and make first-time signin to apps faster. Review found the slowness is screens and round trips, not server time. Changes: - values.yaml: the authentik.* Helm values (gunicorn workers, cache timeouts, conn_max_age) were silently INERT because existingSecret skips chart env rendering — pods ran defaults (2 workers, 300s caches, no persistent DB conns). Moved all tuning into server.env/worker.env, which actually reaches the pods. - authentik_provider.tf: adopt the identification stage and pin password_stage so username+password render on ONE screen (the separate order-20 password binding is deleted via API — authentik requires that when embedding). Outpost log_level trace->info and 1->2 replicas (it is on the hot path of every forward-auth request; PG-backed sessions make 2 replicas safe). - authentik module: /static ingress carve-out with immutable Cache-Control (assets are version-fingerprinted but served with no max-age — internal split-horizon users got zero caching). - traefik auth-proxy nginx: upstream keepalive 32 + HTTP/1.1 (was opening a fresh TCP connection to the outpost per subrequest) + config-checksum annotation so config changes roll the pods. - docs: authentication.md + authentik-state.md updated; fixed stale 'postgresql.dbaas has no endpoints' claim in CLAUDE.md/CONTEXT.md (it is a live CNPG primary-selector compatibility service). Done via API in the same change (UI-managed objects): 6 OIDC providers (Vault, Forgejo, Immich, Headscale, linkwarden, Cloudflare Access) switched from explicit to implicit consent — all first-party, the 4-weekly consent screen only slowed first-time signin. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 21:58:10 +00:00
Viktor Barzin	93ba67c84a	devvm: install prometheus-node-exporter (was never installed) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The monitoring stack now scrapes devvm (job 'devvm') for the t3 drop attribution work, but the box had no node_exporter at all — installed via apt and persisted here so reprovisioning keeps it.	2026-06-10 21:29:17 +00:00
Viktor Barzin	046a4a32f3	Merge remote-tracking branch 'forgejo/master' into wizard/t3-disconnect-fixes All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-10 21:26:10 +00:00
Viktor Barzin	70442ccdc6	t3-probe: fix aiohttp 3.9 compat (ClientWSTimeout is 3.10+) Bound connection establishment via session ClientTimeout(total=None, connect=15) instead — works on 3.9 through current; total must stay None or the session timeout would kill the long-lived probe WS. Verified by a local 14s smoke run: cloudflare + internal legs both connect.	2026-06-10 21:26:09 +00:00
Viktor Barzin	4af5eff043	docs(multi-tenancy): note the on-demand web restore button All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The tmux-persist paragraph only described the boot-time restore. Document the new manual path — the web terminal's "Restore sessions" button (tmux-api POST /restore -> tmux-restore-user wrapper -> `tmux-persist restore <user>`) — and why it exists: an OOM that kills a user's tmux server WITHOUT a reboot never triggers the boot-only restore service, which is the common case under multi-user memory pressure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 21:22:41 +00:00
Viktor Barzin	a734155fb5	Merge remote-tracking branch 'forgejo/master' into wizard/t3-disconnect-fixes All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-10 21:11:30 +00:00
Viktor Barzin	9b55d53be0	t3: differential drop-attribution probe + devvm metrics Closes the loop on Viktor's ask to find the t3 disconnect root cause and definitively rule infra in or out. Server logs alone cannot separate 'client network broke' from 'Cloudflare/tunnel broke' from 't3-serve stalled' — every cause collapses into the same 20s-watchdog reconnect. The t3-probe (stacks/t3code) holds three permanent legs that differ only in path segment: 'cloudflare' (WS via DoH-resolved public DNS -> WAN -> CF edge -> tunnel -> Traefik -> dispatch), 'internal' (same WS pinned to the Traefik LB, no Cloudflare), 't3serve' (HTTP straight to the serve process). Whichever leg drops convicts its segment; all legs clean while a user drops exonerates infra with data. Dispatch gains an unauthenticated /probe/ws echo + /probe/healthz (gorilla/websocket, test-first) behind an auth=none path carve-out, guarded by the authentik-walloff probe. Also starts scraping devvm's node_exporter (job 'devvm') — it ran unscraped, so the box whose memory/IO stalls cause the drops had zero pressure history. Alerts T3ProbeLegDown + T3ProbeDropBurst; runbook docs/runbooks/t3-drop-attribution.md.	2026-06-10 21:11:29 +00:00
Viktor Barzin	ecef09ab87	tmux-persist: add single-user restore mode (`restore [user]`) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The web-terminal will get a "Restore sessions" button (common ask after an OOM kills a user's tmux server without a reboot, which the boot-only restore service doesn't catch). The button needs to restore ONE user's saved sessions on demand, so teach `restore` an optional <user> argument: with no arg it restores every terminal user (unchanged — the boot service path), with a <user> arg it validates the name against /etc/ttyd-user-map and restores only that user. Reuses the existing restore loop (single source of restore truth). The terminal-lobby tmux-api will invoke this as root via a validated tmux-restore-user sudo wrapper. Verified: bad user exits 2 (won't fall back to restoring everyone), no-arg path unchanged, shellcheck clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 21:08:57 +00:00
Viktor Barzin	b5c6639272	t3-serve@: contain agent memory storms; survive child OOM kills All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Same t3-disconnect root-cause work: a runaway claude agent child grew to 10.8G anon RSS inside t3-serve@wizard's cgroup, swap-thrashed devvm off its spinning disk (system-wide multi-10s freezes = every t3 client's 20s watchdog firing = the 'frequent disconnects that self-recover'), then the global OOM at 2026-06-10 19:56 took the whole unit down for 8.5min because the default OOMPolicy=stop fails the unit when ANY cgroup child is OOM-killed. Cap the cgroup (MemoryHigh=12G, MemoryMax=16G), forbid swap so stalls can't smear into minute-long freezes, and OOMPolicy=continue so a runaway agent dies alone while the WS server keeps serving.	2026-06-10 21:00:06 +00:00
Viktor Barzin	d5fdc7ffe9	cloudflared: disable in-place autoupdate (--no-autoupdate) Viktor asked to root-cause the frequent t3 code disconnects and rule infra in or out. The tunnel pods ran bare 'cloudflared tunnel run': every Cloudflare release made the binary self-update and exit (code 11), restarting all 3 pods and severing every WebSocket riding the tunnel — one of the confirmed infra-side drop causes (pods cycled 2026-06-09 20:55/21:00 and 2026-06-10 02:31). Updates belong to pod image rollouts, not in-place binary swaps.	2026-06-10 21:00:05 +00:00
Viktor Barzin	ac6f19dd3b	tmux-persist: never let an empty snapshot clobber a saved manifest All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details emo's 5 web-terminal tmux sessions were OOM-killed (the server died, no reboot), and the 5-minute save tick then overwrote his session manifest with 0 bytes — wiping the record that restore needs. Root cause: the save guard only checked that the tmux socket file existed, but an OOM-killed server leaves a stale /tmp/tmux-<uid>/default behind; list-panes then returns nothing and that empty capture was installed over the good manifest. Because the restore service only runs at boot, an OOM (not a reboot) skips restore entirely, so the clobbered manifest was the only record left — and it was already gone. Fix: only overwrite <user>.tsv when the snapshot captured >=1 live session; otherwise keep the last good manifest (now covers no-server AND stale-socket/dead-server). Verified by reproducing the 0-byte clobber on the old script and confirming the new one preserves the manifest, plus a live save that still captures every active session. emo's 5 sessions were recovered from their transcripts and are back; this keeps the next OOM from destroying the manifest again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 20:38:59 +00:00
Viktor Barzin	9fff77cbea	Merge branch 'wizard/budget-rate-limit' Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-10 19:42:19 +00:00
Viktor Barzin	acb847b858	actualbudget: dedicated traefik rate-limit (50/300) for budget ingresses The Actual web app boots with ~70 near-parallel requests (55 /data/migrations/.sql + statics, all served cache-control max-age=0 so every page load re-validates them). The shared rate-limit middleware (average 10, burst 50) 429s the tail of that storm, so every cold boot shows 'Server returned an error while checking its status' and every load stalls in retry backoff — measured up to 5min stalls when two loads from one IP overlap. Viktor asked to relax the limit after the anca slow-load investigation (beads code-7zv). Same pattern as immich: dedicated actualbudget-rate-limit middleware in the traefik stack, budget- ingresses opt out of the default via skip_default_rate_limit + extra_middlewares. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 19:36:42 +00:00
Viktor Barzin	8304ef0f70	Merge origin/master (pfsense SNI-routed internal 443) into forgejo/master All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Reconciles the two live infra remotes after the pve-host logging change landed on forgejo (which was a commit behind origin). Non-destructive merge — keeps both `eae35c51` (pfsense webmail SNI routing) and `aac807fb` (pve-host Loki shipping).	2026-06-10 19:35:55 +00:00
Viktor Barzin	aac807fb3a	pve-host: ship journal to Loki (snoopy command audit + sshd-pve) for emo's root SSH All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Emo's Claude agent was given root SSH to the Proxmox host (`ssh pve`, dedicated shared-root key emo-pve-agent@devvm) so he can manage the host — e.g. the R730 fan daemon — through his agent. To keep an audit trail of what that agent does, and to feed the long-pending Wave-1 S1 security rule, the PVE host now ships its systemd journal to cluster Loki: - snoopy logs every execve() to journald (identifier=snoopy), enabled via /etc/ld.so.preload; config scripts/pve-snoopy.ini. - promtail v3.5.1 (amd64) ships /var/log/journal to Loki as {job="pve-journal"} (full host journal; filter identifier="snoopy" for the command audit), and relabels sshd auth to {job="sshd-pve"} — which ACTIVATES S1 (it was PENDING only for lack of this shipper). Config/unit: scripts/pve-promtail.{yaml,service}. S1 won't false-fire on legitimate access: the devvm SNATs through pfSense to 192.168.1.2, which is already in the S1 source-IP allowlist. Loki is reached via an /etc/hosts pin (10.0.20.203 loki.viktorbarzin.lan); follow-up noted to register a Technitium CNAME so it auto-tracks LB renumbers. Host pieces are hand-managed (not Terraform), like fan-control and the rpi-sofia promtail — these files are the source of truth. Docs updated: security.md (S1 LIVE) and monitoring.md ("External host: pve"). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 19:31:45 +00:00
Viktor Barzin	eae35c511a	pfsense: SNI-routed internal 443 — mail.viktorbarzin.me serves webmail everywhere Completes the internal port table of the mail front door (10.0.20.1): 443 was squatted by the pfSense webGUI (self-signed cert expired 2022), so internal webmail and the kuma [External] mail probe hit the firewall login instead of Roundcube — the last leg of the mail split-brain name. Design (Viktor): route by what the client asked for. New HAProxy frontend internal_https_443 (binds 10.0.20.1+10.0.10.1 :443, mode tcp): SNI present -> Traefik .203 with send-proxy-v2 (trusted, IPv6-bridge pattern, no health check per the PROXY-probe gotcha); SNI of pfsense.viktorbarzin.{lan,me} or NO SNI (bare-IP admin access) -> webGUI, which moved to :8443 (invisible to habits — https://10.0.20.1 still lands on the login page; :8443 doubles as direct fallback). The reverse-proxy pfsense ingress now targets :8443 directly. Declared idempotently in pfsense-haproxy-bootstrap.php; config.xml backed up on-box (config.xml.bak-2026-06-10-pre-sni443). Verified: bare IP -> GUI login; pfsense.viktorbarzin.lan -> GUI; pfsense.viktorbarzin.me -> 302 via ingress; mail.viktorbarzin.me -> Roundcube with STRICT cert validation; :993 IMAPS untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 18:41:07 +00:00

1 2 3 4 5 ...

4192 commits