infra

Author	SHA1	Message	Date
Viktor Barzin	39a22b352e	tts: bootstrap the chatterbox NFS subdir — first-window mount failed forever All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details First real window (2026-06-12 02:00): the chatterbox pod sat in ContainerCreating with MountVolume exit 32 x19 — /srv/nfs-ssd is exported whole-tree but the chatterbox SUBDIR never existed on the host (the go-live runbook step needed NFS-host shell nobody doing the apply had). One-shot busybox Job mounts the export root and mkdir -p's the subtree; kubelet's mount retry then self-heals the pod. Audio queue (27 items) drains as soon as the model loads. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 02:51:14 +00:00
Viktor Barzin	db63cd7501	android-emulator+traefik: non-merge apply trigger for the rate-limit fix Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline failed Details Pipeline 102 applied nothing — the rate-limit commit entered master under a merge head and the changed-stack detector is blind to merge diffs. Plain commit touching both stacks so they apply.	2026-06-12 00:33:10 +00:00
Viktor Barzin	4d844d6fd4	Merge forgejo/master into wizard/emu-ratelimit Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline failed Details	2026-06-12 00:26:05 +00:00
Viktor Barzin	152dad0a40	android-emulator: dedicated rate-limit — noVNC's module storm tripped the shared 10/50 limiter Viktor's 'VNC stuck loading forever' (remote network): noVNC 1.3 is unbundled and fetches ~60 ES modules in parallel on page open; the shared Traefik rate-limit (average 10, burst 50) 429s the tail and noVNC's loader waits on the missing modules indefinitely (reproduced: 38x429 in a 90-request burst through the ingress). Adds a dedicated 50/300 android-emulator-rate-limit middleware (actualbudget/immich pattern) and opts both emulator ingresses out of the shared limiter.	2026-06-12 00:25:44 +00:00
Viktor Barzin	d3d37a15ec	tts: GPU-gated live narration — demand-gate CronJob + all-day VRAM guard Some checks failed ci/woodpecker/push/default Pipeline was canceled Details ci/woodpecker/push/build-cli Pipeline was canceled Details Viktor asked 'can't we make it live? why the cronjob?' — the overnight window guaranteed VRAM room on the shared T4, but immich/frigate models idle-unload during the day so the card often has room (measured 10.3 GiB free at 01:20). New 'demand' action every 3 min: scale Chatterbox up when tripit's audio queue is non-empty AND free VRAM >= floor; idle it back to 0 when the queue empties (also frees the card early inside the nightly window). Failed metrics scrape fail-safes to no-scale-up, same as the window preflight. The guard moves to all-day */5 — live synthesis can hold the card at any hour, so the yield-on-pressure watchdog must watch at any hour. tripit exposes the unauthenticated in-cluster queue count; a 404 from an older image reads as queued=0 (no-op). The 02:00 window-up stays as the guaranteed nightly catch-up. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 00:25:35 +00:00
Viktor Barzin	d818f7ed3b	android-emulator: README — measured resource profile + remote access + screen-off etiquette All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-12 00:10:03 +00:00
Viktor Barzin	9af3e8860e	Merge origin/master (CI state-sync commits) into wizard/android-emulator-public Some checks failed ci/woodpecker/push/default Pipeline was canceled Details ci/woodpecker/push/build-cli Pipeline was canceled Details	2026-06-12 00:08:14 +00:00
Viktor Barzin	43d2107760	android-emulator: public Authentik-gated ingress for the noVNC screen Some checks failed ci/woodpecker/push/build-cli Pipeline was canceled Details ci/woodpecker/push/default Pipeline was canceled Details Viktor wants the emulator screen reachable over the web: adds android-emulator.viktorbarzin.me (Cloudflare-proxied) behind Authentik forward-auth — same-origin WebSockets through forward-auth are proven by the terminal/ttyd stack. The LAN .lan view stays, and adb:5555 remains LAN-only since it is unauthenticated.	2026-06-12 00:07:49 +00:00
Viktor Barzin	9a2124f105	tripit: flip Research agent live (RESEARCH_PROVIDER=claude_agent, #23 ) Switches the planning workspace's 'Research this' from the deterministic Fake to the live claude-agent-service Researcher. Behaviour-reviewed via a prod-pod country_when call (proposed Morocco/Georgia/Peru/Iceland with real 2026 UK bank-holiday leave windows + rough fares). Opt-in, budget-capped ~$2/run, wall-clock-bounded → degrades to 'found nothing' on slow/failed/quota-exhausted runs. Reuses CLAUDE_AGENT_TOKEN already in tripit-secrets. Completes the 12-slice Trip-Planning-Decisions feature. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-11 23:53:49 +00:00
Viktor Barzin	02ed3062f6	android-emulator: non-merge apply trigger for v4 image rollout All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Pipeline 96 applied only tripit: the v4 bump (`577267cd`) entered master inside a merge whose first-parent diff hid stacks/android-emulator from the stack detector — same failure mode as the tts `798b0255` trigger. This plain commit touches the stack so the detector picks it up.	2026-06-11 23:48:16 +00:00
Viktor Barzin	2f8addc63b	Merge forgejo/master into wizard/android-emulator Some checks failed ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline failed Details	2026-06-11 22:53:11 +00:00
Viktor Barzin	577267cd97	android-emulator: api36-v4 — pin emulator 36.1.9; bind socat to pod IP Two final fixes from the live debugging session: (1) sdkmanager-latest emulator 36.6.11 hangs before executing a single guest instruction in this pod (KVM and TCG alike, every gpu mode, crash-reporting on or off) while 36.1.9 boots Android in ~107s — the entrypoint now pins build 13823996 on the PVC; (2) the emulator already listens on 127.0.0.1:5555, so socat's wildcard bind died with EADDRINUSE and its exit restarted the pod right after a successful boot — socat now binds the pod IP only.	2026-06-11 22:52:54 +00:00
Viktor Barzin	fba1659611	tripit: enable LLM sight discovery + real place resolver (image 2a965ca0 is live) All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Viktor's tour-redo (tripit#29): the new image is rolled out, so the two new provider knobs can flip — discovery becomes wikipedia+llm (GeoSearch merged with claude-agent-service proposals, Focus-steered) and the Wikipedia place resolver (manual sight search + LLM-proposal resolution) leaves its fake default. Env-after-image hold order, same as FARE_PROVIDER. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 22:30:24 +00:00
Viktor Barzin	f74e421283	tripit: overnight tour-audio fill CronJobs (02:20 + 04:30 retry, Europe/London) All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Viktor's tour-guide redo (tripit#30/#31): narration audio is baked-audio-only now — the fill-tour-audio worker synthesizes the queued (story, telling, voice) audio while the tts stack's off-peak window (02:00-06:00) has Chatterbox scaled up. Two idempotent passes: 02:20 after scale-up + model load, 04:30 insurance against a skipped window or guard yield. Daytime runs record tts_unreachable and exit quietly by design. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 22:24:29 +00:00
Viktor Barzin	85dbec6108	android-emulator: api36-v3 — avdmanager must run from inside the SDK root Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details v2's marker fix proved the install completes, but avdmanager still saw no system images: it IGNORES ANDROID_SDK_ROOT (and has no --sdk_root), deriving the SDK root from its own toolsdir — /opt/android in our image, while packages live on the PVC at /sdk. v3 seeds cmdline-tools into /sdk/cmdline-tools/latest once and runs avdmanager from there, so it resolves the PVC as the SDK root.	2026-06-11 21:15:50 +00:00
Viktor Barzin	5e8a988858	android-emulator: api36-v2 — marker-file install idempotency + retries Some checks failed ci/woodpecker/push/k8s-portal Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/registry-config-sync Pipeline was successful Details ci/woodpecker/push/build-ci-image Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details First boot crashed mid-SDK-install, and the dir-existence check then skipped reinstall forever: avdmanager saw the partial tree and died with 'Valid system image paths are: null' (CrashLoopBackOff). v2 tracks install completion with a marker file written only after sdkmanager succeeds + package.xml exists, wipes partial system-image trees before reinstalling, and retries sdkmanager 3x.	2026-06-11 20:59:08 +00:00
Viktor Barzin	3fac45febc	android-emulator: drop applied import stanzas; deployment recreates fresh Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details The five imports from the last recovery commit are in state now (verified serial 4: everything except the deployment). The deployment kept falling out of state between runs, so instead of a third import round the broken 0-replica deployment object was deleted live (transient recovery step, presence-claimed) and this apply recreates it Terraform-owned with the quota-fitting 3Gi requests. Import stanzas must go because TF 1.5 errors on importing already-managed addresses.	2026-06-11 20:49:37 +00:00
Viktor Barzin	6b7efcd2d6	android-emulator: import the five resources still missing from state Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Pipeline 88 imported the namespace but its refresh dropped the PVC, both services, the ingress and the tls secret from state (PG-backend state races on this new stack's first applies), so the apply again died on 'already exists' conflicts. State now holds namespace+deployment; adopt the missing five with import blocks (TF 1.5 errors on importing already-managed addresses, so only the missing set is listed). Stanzas come out once applied.	2026-06-11 20:44:09 +00:00
Viktor Barzin	b948224008	android-emulator: import orphaned namespace into state (lock-race recovery) Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Pipeline 85 created the namespace but a Terraform pg-backend workspace-creation lock race (new stack schema initializing while other stacks applied concurrently) left it out of the recorded state — every later apply then died with 'namespaces android-emulator already exists'. Adopt it with an import block per the house recovery pattern; stanza gets removed once it has applied.	2026-06-11 20:38:46 +00:00
Viktor Barzin	99c19584f7	android-emulator: fit pod inside the tier-1 ResourceQuota (Burstable memory) Some checks failed ci/woodpecker/push/k8s-portal Pipeline failed Details ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful Details ci/woodpecker/push/registry-config-sync Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/build-ci-image Pipeline was successful Details First deploy hit 'exceeded quota: tier-quota, requested requests.memory=8Gi, limited 4Gi' — the generated tier-1 quota caps memory REQUESTS at 4Gi but allows 32Gi of limits, so go Burstable (requests 3Gi, limits 8Gi) like tiers 3/4 do, instead of opting the namespace out via custom-quota.	2026-06-11 19:56:09 +00:00
Viktor Barzin	6bf216751b	Merge forgejo/master (tts stack) into wizard/android-emulator Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details # Conflicts: # stacks/tripit/main.tf	2026-06-11 19:53:07 +00:00
Viktor Barzin	8b7c77c794	android-emulator: new stack — shared in-cluster Android 16 testing instance Viktor is setting up an Android app development pipeline (tripit is the first app) and wants agents to natively test changes on Android before shipping. This adds the testing environment: an API-36 Google emulator under KVM as a privileged pod (namespace joins the Kyverno exclude list), SDK/system-image/AVD on a proxmox-lvm PVC, adb on the shared MetalLB IP 10.0.20.200:5555 (LAN only), noVNC screen view at android-emulator.viktorbarzin.lan. Image is built manually from the stack's docker/ dir (rare rebuilds; off-infra-CI rule targets repeated builds). First infra ADR records the trade-offs (devvm/VM/redroid/budtmo rejected).	2026-06-11 19:51:57 +00:00
Viktor Barzin	798b025580	tts+kyverno: non-merge apply trigger (merge-commit diff hid stacks/tts from the stack detector) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The Woodpecker default pipeline selects stacks via git diff HEAD~1 HEAD; on a merge commit that is the first-parent diff, which contained only the concurrently-landed files — stacks/tts never got applied (namespace still absent) and the kyverno re-trigger push got no pipeline at all. Single non-merge commit touching both stacks so the detector sees them; the sorted loop applies kyverno before tts, the order tripit#26 requires. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 19:08:23 +00:00
Viktor Barzin	a66aeac3b8	Merge remote-tracking branch 'forgejo/master' into wizard/tour-redo-env All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-11 18:27:53 +00:00
Viktor Barzin	4a8c4f9a14	tts: first apply of Chatterbox stack; predefined voices from the image, not the unseeded PVC Viktor's tour-guide redo (tripit#26): `87702bdc` committed this stack with [ci skip] so it was never applied — prod tripit has been pointing at a nonexistent chatterbox-tts service since. This commit triggers the apply and fixes the voices path: config pointed predefined_voices_path at the NFS PVC (/data/voices), which nobody can seed without NFS-host shell access and which would leave /v1/audio/voices empty (it gates readiness). Use the 28 voices bundled in the image at /app/voices instead; /data keeps reference audio (future cloning) and the HF model cache. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:27:44 +00:00
Viktor Barzin	318ce9b909	Merge remote-tracking branch 'forgejo/master' into wizard/breakglass-redesign All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-11 18:23:40 +00:00
Viktor Barzin	df332b59e6	break-glass SSH: drop port-knock for exposed key-only :52222; version host config Viktor got locked out of the break-glass path (forgot the port-knock setup) and deleted the edge-router forwards, then asked to review and redesign it from scratch. Root cause of the lockout: the knock added no real security (key-only SSH is already brute-force-proof) and its only benefit — hiding the port — came at the cost of a circular dependency. The knock sequence lived only in in-cluster Vault, which is unreachable in the exact away/cold scenario break-glass exists for. So the unlock secret was unavailable precisely when needed. New model (self-contained, nothing to remember): plain key-only SSH on the Proxmox host's :52222, openly reachable. The edge router forwards WAN tcp/52222 -> 192.168.1.127:52222 (external port MUST equal internal on the TP-Link AX6000 - it rejects remaps; port 22 itself is reserved). The exposed port trusts only a dedicated break-glass key via `Match LocalPort` (a leak of any other root key does not grant internet access), rate-limited (iptables hashlimit) + fail2ban. - Removed knockd (package + config) and the legacy Synology SSH forward (ext 3333 -> .13:22, a needless WAN exposure the original plan wanted gone). - Fixed the fail2ban jail for Debian 13 (auth logs under sshd-session, not sshd - the stock journalmatch silently never banned). - Versioned the host config in scripts/ (it was applied ad-hoc, never committed) and recorded the deliberate Wave-1 "no public-IP" exception in security.md + .claude/CLAUDE.md. Superseded the 2026-05-30 port-knock design docs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:23:39 +00:00
Viktor Barzin	7a1cc64898	kyverno: re-trigger apply of tts GPU-priority exclusion (`87702bdc` was [ci skip]'d) Viktor's tour-guide redo (tripit#26): the Chatterbox TTS go-live commit `87702bdc` carried [ci skip], so CI never applied the kyverno change that keeps the tts namespace out of low-GPU-priority injection. This comment-only commit makes CI apply the already-committed change — step 1 of the kyverno -> tts -> tripit apply order. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:23:29 +00:00
Viktor Barzin	50eff3ca39	tripit: enable real tour-guide content providers (wikipedia discovery, web sources, chat writer) Some checks failed ci/woodpecker/push/build-cli Pipeline was canceled Details ci/woodpecker/push/default Pipeline was canceled Details Viktor's tour-guide redo (tripit#24, slice tripit#25): the feature shipped dark on 2026-06-08 because these three env vars were never set, so prod ran the fake test-fixture providers — the only sight users ever saw was the placeholder 'Sight 1' narrated by browser TTS. Flips discovery to Wikipedia GeoSearch, story material to the five real web sources, and script-writing to claude-agent-service (token already present in tripit-secrets). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:22:10 +00:00
Viktor Barzin	5486b9d438	tripit: wire calendar-conflict column to Nextcloud CalDAV (#19 ) CALENDAR_CONFLICT_PROVIDER=nextcloud + CalDAV base/user on the deployment, and the read-only app-password via tripit-secrets (seeded in Vault secret/tripit). Lets the planning workspace's calendar_check column flag date clashes against the owner's Nextcloud calendar. Same image-first hold-order as the fare scrape — pushed only after the #19 image is live. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-11 18:13:01 +00:00
Viktor Barzin	e2788d1b2d	workstation: lean managed-settings claudeMd — org red-lines + pointers [ci skip] Viktor's agent-rules cleanup: the org claudeMd now carries only governance red-lines (RBAC tiers, per-user secrets, Terraform-only, git audit-trail rules, code-layout detection) and points to ~/.claude/rules/execution.md for the worktree lifecycle, which was previously duplicated here in full. Settings precedence and the model key are unchanged. Also refreshes a .gitignore comment that cited the old execution.md section numbering. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:02:43 +00:00
Viktor Barzin	c3a63fcd38	apply-mbps-caps: compare normalized option sets (true idempotency) + devvm I/O-stall post-mortem [ci skip] The raw string compare never matched qm config's canonical key order, so the hourly timer re-issued 'qm set' against every running capped VM, live-rewriting QEMU throttle state via QMP 24x/day. Implicated in today's devvm freeze (15:21-16:48 UTC): the guest's disk I/O stalled inside QEMU (blockstats frozen at 0 while QMP stayed responsive) on the legacy lsi controller path with no iothread. Viktor asked to root-cause the freeze before choosing fixes, then approved mitigating via VM settings: this commit fixes the hourly trigger and documents the incident; the controller swap (virtio-scsi-single + iothread=1 + aio=threads) is staged on VM 102 separately, pending his cold stop/start. Adds docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md (evidence chain, ruled-out causes, capture-before-kill autopsy steps) and syncs compute.md + proxmox-inventory.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:00:08 +00:00
Viktor Barzin	2e0cebff87	docs: sync compute/storage/proxmox-inventory with live state (memory audit) [ci skip] Viktor asked to go through the agent's stored infra facts and straighten out anything wrong about what-is-where. Cross-checking docs against the live cluster surfaced doc drift alongside the stale memories: - compute.md: add k8s-node5/6 (joined 2026-05-26) to diagram + node table; totals 48 vCPU / ~176GB -> 64 vCPU / ~240GB; cluster version v1.34.2 -> v1.34.8 (live-verified) - storage.md: the nfs-proxmox StorageClass no longer exists (removed 2026-04-25, commit `484b4c71`) — nfs-truenas is the only NFS SC; fixed three spots that told readers to use nfs-proxmox - proxmox-inventory.md: k8s VM RAM rows live-verified via kubectl (master 32G, node1 48G, node2-4 32G — the old 16/32/24G figures predated the 2026-04-02 resize), added node5/6 rows, devvm swap 8G -> 14G (grown 2026-06-10), recomputed total (~288GB nominal of 272GB physical, overcommitted) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 17:50:43 +00:00
Viktor Barzin	81e01ec1c4	tripit: label namespace as chrome-service CDP client The fare scrape's first E2E test was blocked by chrome-service-ws-ingress (9222 admits only namespaces labeled chrome-service.viktorbarzin.me/client=true). Label the tripit namespace per that policy's opt-in design so the planning workspace's live fare fetches reach the shared browser. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-11 14:42:53 +00:00
Viktor Barzin	980ec55418	tripit: enable live flight-fare scrape via shared chrome-service CDP Sets FARE_PROVIDER=playwright + FARE_CDP_URL on the tripit deployment so the planning workspace's flight_fare cells auto-fetch live Google Flights quotes through the existing in-cluster headed browser (tripit issue #18, ADR-0007 — rate-limited, cached, degrades to manual entry). Viktor asked to complete the trip-planning tickets; this is the infra leg of the fare-scrape slice. Docs: chrome-service architecture + service catalog updated (tripit is now the second active CDP caller; catalog's legacy :3000 WS pool line corrected to CDP :9222). HOLD-ORDER NOTE: pushed only after the tripit image containing FareMode.playwright rolled out (older images crash-loop on the unknown enum). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-11 14:23:53 +00:00
Viktor Barzin	9b19caff47	t3: connection logging across the path for drop attribution All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Viktor asked to add connection logs (Traefik/Cloudflare) to catch the real-path t3 WS drops: a direct-to-t3-serve browser ran 40 min clean while real tunnel sessions cycle every 15-35s, so the drop originates above t3-serve and we need to see which layer cuts the socket. Traefik (/ws duration) and cloudflared (WS close events) already ship to Loki; the gap was the devvm side. This adds: - t3-dispatch logs every /ws open/close with dur_ms + cause: downstream_closed (client/CF/Traefik hung up = last-mile/network), upstream_closed (t3-serve closed/reset), or graceful. Graceful closes previously left no trace (default ReverseProxy only logs on error), so a watchdog-driven reconnect was invisible. Helpers unit-tested. - devvm-promtail.{yaml,service}: ships devvm journald (t3-dispatch + t3-serve@<user>) to cluster Loki as job=devvm-journal, mirroring the pve/rpi-sofia shippers. devvm was never in Loki (standalone VM). Joined in Loki the three layers attribute any future drop to a segment with no repro needed. Runbook + service-catalog updated.	2026-06-11 13:48:10 +00:00
Viktor Barzin	933e4649fb	Merge remote-tracking branch 'forgejo/master' into wizard/authentik-signin-speed Some checks failed ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/k8s-portal Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful Details ci/woodpecker/push/registry-config-sync Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/build-ci-image Pipeline was successful Details	2026-06-11 00:35:56 +00:00
Viktor Barzin	b3ef0dba76	authentik: ignore Keel-managed image_pull_policy on pgbouncer Keel flip-flops the pgbouncer container's imagePullPolicy, so the declared Always kept re-diffing on every plan. Ignore it like the image tag (KEEL_IGNORE pattern) — plan-to-zero restored. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 00:34:44 +00:00
Viktor Barzin	4e88298976	authentik: incident hardening after the signin-speedup rollout storm The first apply of the signin-speedup change triggered a ~50min authentik outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2) silently DOWNGRADED the Keel-managed live image (2026.2.4) against an already-migrated DB, default liveness probes kill-looped pods queuing on authentik's migration advisory lock, and kills mid-migration left ghost idle-in-transaction sessions holding that lock. Full analysis in docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md. Hardening (all root causes): - values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4) so helm applies can never downgrade under Keel again - values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s) - values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits) - pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders; pgbouncer.tf gets a config-checksum annotation so ini changes roll pods - authentik_provider.tf: drop the completed import stanza (adoption rule) - traefik: suppress pre-existing keel.sh annotation/tier-label drift on auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1 pattern) so applies stop stripping live Keel state Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 00:26:52 +00:00
Viktor Barzin	bd60c3d5e0	pve-host/dns: register loki.viktorbarzin.lan CNAME, drop the /etc/hosts pin All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Follow-up to the pve-host Loki shipper (`aac807fb`). The host reached Loki via an /etc/hosts pin of the Traefik LB IP — Viktor flagged that as the wrong solution (no hardcoding; the DNS infra should handle it). Registered loki.viktorbarzin.lan in Technitium as a CNAME -> ingress.viktorbarzin.lan (the anchor whose A record auto-tracks the live Traefik LB IP, so it's renumber-proof), via the Technitium API + zone-sync to all 3 instances. Removed the /etc/hosts pin from the PVE host; promtail now resolves the name purely via DNS (verified still shipping to Loki). insecure_skip_verify stays — the internal .lan cert isn't publicly trusted. Docs (monitoring.md) + the pve-promtail.yaml header updated to drop the pin references. The DNS record is API-managed (the viktorbarzin.lan zone convention), not in this repo; auto-managing .lan CNAMEs in technitium-ingress-dns-sync remains a noted follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 22:55:20 +00:00
Viktor Barzin	97ccdbecb8	authentik: speed up first-time signin (single-screen login, live env tuning, asset caching, outpost+nginx hot path) Viktor asked to review Authentik and the web tier and make first-time signin to apps faster. Review found the slowness is screens and round trips, not server time. Changes: - values.yaml: the authentik.* Helm values (gunicorn workers, cache timeouts, conn_max_age) were silently INERT because existingSecret skips chart env rendering — pods ran defaults (2 workers, 300s caches, no persistent DB conns). Moved all tuning into server.env/worker.env, which actually reaches the pods. - authentik_provider.tf: adopt the identification stage and pin password_stage so username+password render on ONE screen (the separate order-20 password binding is deleted via API — authentik requires that when embedding). Outpost log_level trace->info and 1->2 replicas (it is on the hot path of every forward-auth request; PG-backed sessions make 2 replicas safe). - authentik module: /static ingress carve-out with immutable Cache-Control (assets are version-fingerprinted but served with no max-age — internal split-horizon users got zero caching). - traefik auth-proxy nginx: upstream keepalive 32 + HTTP/1.1 (was opening a fresh TCP connection to the outpost per subrequest) + config-checksum annotation so config changes roll the pods. - docs: authentication.md + authentik-state.md updated; fixed stale 'postgresql.dbaas has no endpoints' claim in CLAUDE.md/CONTEXT.md (it is a live CNPG primary-selector compatibility service). Done via API in the same change (UI-managed objects): 6 OIDC providers (Vault, Forgejo, Immich, Headscale, linkwarden, Cloudflare Access) switched from explicit to implicit consent — all first-party, the 4-weekly consent screen only slowed first-time signin. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 21:58:10 +00:00
Viktor Barzin	93ba67c84a	devvm: install prometheus-node-exporter (was never installed) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The monitoring stack now scrapes devvm (job 'devvm') for the t3 drop attribution work, but the box had no node_exporter at all — installed via apt and persisted here so reprovisioning keeps it.	2026-06-10 21:29:17 +00:00
Viktor Barzin	046a4a32f3	Merge remote-tracking branch 'forgejo/master' into wizard/t3-disconnect-fixes All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-10 21:26:10 +00:00
Viktor Barzin	70442ccdc6	t3-probe: fix aiohttp 3.9 compat (ClientWSTimeout is 3.10+) Bound connection establishment via session ClientTimeout(total=None, connect=15) instead — works on 3.9 through current; total must stay None or the session timeout would kill the long-lived probe WS. Verified by a local 14s smoke run: cloudflare + internal legs both connect.	2026-06-10 21:26:09 +00:00
Viktor Barzin	4af5eff043	docs(multi-tenancy): note the on-demand web restore button All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The tmux-persist paragraph only described the boot-time restore. Document the new manual path — the web terminal's "Restore sessions" button (tmux-api POST /restore -> tmux-restore-user wrapper -> `tmux-persist restore <user>`) — and why it exists: an OOM that kills a user's tmux server WITHOUT a reboot never triggers the boot-only restore service, which is the common case under multi-user memory pressure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 21:22:41 +00:00
Viktor Barzin	a734155fb5	Merge remote-tracking branch 'forgejo/master' into wizard/t3-disconnect-fixes All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-10 21:11:30 +00:00
Viktor Barzin	9b55d53be0	t3: differential drop-attribution probe + devvm metrics Closes the loop on Viktor's ask to find the t3 disconnect root cause and definitively rule infra in or out. Server logs alone cannot separate 'client network broke' from 'Cloudflare/tunnel broke' from 't3-serve stalled' — every cause collapses into the same 20s-watchdog reconnect. The t3-probe (stacks/t3code) holds three permanent legs that differ only in path segment: 'cloudflare' (WS via DoH-resolved public DNS -> WAN -> CF edge -> tunnel -> Traefik -> dispatch), 'internal' (same WS pinned to the Traefik LB, no Cloudflare), 't3serve' (HTTP straight to the serve process). Whichever leg drops convicts its segment; all legs clean while a user drops exonerates infra with data. Dispatch gains an unauthenticated /probe/ws echo + /probe/healthz (gorilla/websocket, test-first) behind an auth=none path carve-out, guarded by the authentik-walloff probe. Also starts scraping devvm's node_exporter (job 'devvm') — it ran unscraped, so the box whose memory/IO stalls cause the drops had zero pressure history. Alerts T3ProbeLegDown + T3ProbeDropBurst; runbook docs/runbooks/t3-drop-attribution.md.	2026-06-10 21:11:29 +00:00
Viktor Barzin	ecef09ab87	tmux-persist: add single-user restore mode (`restore [user]`) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The web-terminal will get a "Restore sessions" button (common ask after an OOM kills a user's tmux server without a reboot, which the boot-only restore service doesn't catch). The button needs to restore ONE user's saved sessions on demand, so teach `restore` an optional <user> argument: with no arg it restores every terminal user (unchanged — the boot service path), with a <user> arg it validates the name against /etc/ttyd-user-map and restores only that user. Reuses the existing restore loop (single source of restore truth). The terminal-lobby tmux-api will invoke this as root via a validated tmux-restore-user sudo wrapper. Verified: bad user exits 2 (won't fall back to restoring everyone), no-arg path unchanged, shellcheck clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 21:08:57 +00:00
Viktor Barzin	b5c6639272	t3-serve@: contain agent memory storms; survive child OOM kills All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Same t3-disconnect root-cause work: a runaway claude agent child grew to 10.8G anon RSS inside t3-serve@wizard's cgroup, swap-thrashed devvm off its spinning disk (system-wide multi-10s freezes = every t3 client's 20s watchdog firing = the 'frequent disconnects that self-recover'), then the global OOM at 2026-06-10 19:56 took the whole unit down for 8.5min because the default OOMPolicy=stop fails the unit when ANY cgroup child is OOM-killed. Cap the cgroup (MemoryHigh=12G, MemoryMax=16G), forbid swap so stalls can't smear into minute-long freezes, and OOMPolicy=continue so a runaway agent dies alone while the WS server keeps serving.	2026-06-10 21:00:06 +00:00
Viktor Barzin	d5fdc7ffe9	cloudflared: disable in-place autoupdate (--no-autoupdate) Viktor asked to root-cause the frequent t3 code disconnects and rule infra in or out. The tunnel pods ran bare 'cloudflared tunnel run': every Cloudflare release made the binary self-update and exit (code 11), restarting all 3 pods and severing every WebSocket riding the tunnel — one of the confirmed infra-side drop causes (pods cycled 2026-06-09 20:55/21:00 and 2026-06-10 02:31). Updates belong to pod image rollouts, not in-place binary swaps.	2026-06-10 21:00:05 +00:00

1 2 3 4 5 ...

4198 commits