Commit graph

1336 commits

Author SHA1 Message Date
Viktor Barzin
3978eec53a Merge forgejo/master into wizard/emu-gpu
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-12 19:45:06 +00:00
Viktor Barzin
b2bd859a8e android-emulator: NVIDIA_DRIVER_CAPABILITIES=all — graphics libs for -gpu host
First GPU boot verified qemu attached to the T4, but the guest GL
translator reported llvmpipe: the GPU operator injects only
compute,utility by default, so the NVIDIA EGL/GL vendor libraries were
absent and gfxstream silently fell back to software GL. The graphics
capability completes the hardware rendering path.
2026-06-12 19:43:25 +00:00
Viktor Barzin
0216e993dc etcd-load-reduction: remove VPA/Goldilocks, disable kyverno reporting, descheduler hourly
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline failed
The control-plane flap (etcd lease-renewal timeouts) recurred. Rather than move
etcd to SSD (code-oflt, deferred again), the chosen direction is to REDUCE etcd
load enough that the leader-election-timeout band-aid (renew 10s->30s) becomes
removable. These are the big, clean cuts:

1. Remove VPA/Goldilocks (stacks/vpa emptied). All 349 VPAs ran updateMode=Off
   (no auto-right-sizing) yet cost ~800 etcd objects + continuous recommender
   writes + a pod-creation admission webhook, purely to feed a dashboard. krr
   (Dockerized, on-demand) replaces it. Reverses the re-add after memory 2431.

2. Disable kyverno reporting (admission/aggregate/background). policyReports were
   already off, so the pipeline generated ephemeralreports + an hourly
   all-resource etcd re-scan for NO user-facing output. Admission enforcement
   (deny-* policies) and Keel mutation are unaffected; violations surface via
   Loki->Slack.

3. descheduler */5 -> hourly (fewer list/evict cycles; rebalancing isn't urgent).

Deferred (poor ROI / unsafe as planned): ESO refreshInterval 15m->1h is a
~20-stack sprawl for ~0.1 writes/s; keel background=false is invalid for a
mutate-existing policy and its churn is apply-time not steady-state. Both filed
as follow-up beads.

Post-apply: delete the chart-orphaned VPA CRDs to cascade-clean leftover CRs.
Then measure etcd apply-latency and revert the timeouts. Docs updated
(VPA/Goldilocks -> krr). See memory 5402-5407.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 19:41:22 +00:00
Viktor Barzin
16adda2c48 android-emulator: gate reaches the kube API via env vars, not DNS
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
First real wake attempt 500'd: kubernetes.default.svc does not resolve
from the gate's alpine pod (musl + injected dns_config ndots quirk), so
every kube call failed with 'Name does not resolve'. Use the injected
KUBERNETES_SERVICE_HOST/PORT env vars — the canonical in-cluster
endpoint, no DNS dependency. ConfigMap checksum annotation rolls the
gate automatically.
2026-06-12 19:32:34 +00:00
Viktor Barzin
b1b9de90e4 tripit: tripit-api ingress joins the dedicated 100/1000 rate-limit
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Follow-up to eef4dc7f: the Android Shell's dedicated bearer-auth host
(tripit-api, ADR-0017) serves the same thumbnail-proxy traffic and was
still on the default 10/50 limiter — the shell's photo grid would have
hit the identical 429 wall Viktor just reported on the PWA host.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 19:18:40 +00:00
Viktor Barzin
eef4dc7f63 tripit: dedicated 100/1000 rate-limit — photo grid 429s on the default 10/50
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
Viktor hit a wall of 429s scrolling the new trip Photos tab: every Immich
thumbnail proxies through tripit's /api, so a few-hundred-photo trip is
that many parallel GETs from one IP — far past the shared Traefik
limiter's average 10 / burst 50. Fourth instance of the parallel-asset
pattern (ha-sofia, ActualBudget, noVNC); same cure: dedicated
tripit-rate-limit middleware (average 100, burst 1000) +
skip_default_rate_limit on the main tripit ingress only. The token-gated
calendar/email/slack carve-outs keep the strict default.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 19:15:56 +00:00
Viktor Barzin
e8a4eb0f05 tripit: satisfy the auth-comment lint on the tripit-api ingress
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The previous commit (c5631cff) failed CI's ingress_factory guard: the
'# auth = "none": <why>' justification must sit directly above the auth
line inside the module, not above the module block. Same content, moved
to where the lint looks; no functional change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 08:53:02 +00:00
Viktor Barzin
c5631cff74 tripit: Shell auth surface — tripit-app OAuth2 provider + bearer-only tripit-api host
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
Viktor is adding the Android APK (Capacitor Shell) for TripIt. The Shell
cannot use the browser's forward-auth cookie dance, so per tripit ADR-0017
it logs in with OIDC Code+PKCE and calls the API with bearer JWTs:

- authentik.tf: tripit-app OAuth2 provider (public client + PKCE — an APK
  holds no secret), custom-scheme redirect me.viktorbarzin.tripit://callback,
  RS256, 1h access / 90d refresh (offline_access mapping attached so refresh
  tokens are issued), plus the TripIt App application.
- main.tf: new ingress host tripit-api.viktorbarzin.me -> same tripit
  Service, no forward-auth (backend validates the JWTs itself once tripit
  AUTH_MODE=hybrid lands — slice 2), inbound X-authentik-* deleted via the
  existing traefik strip-auth-headers middleware so the header fallback can
  never be spoofed through this host.

Closes nothing here; tracked as viktor/tripit#49.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 08:47:46 +00:00
Viktor Barzin
b985686661 android-emulator: non-merge apply trigger (GPU + wake gate)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-12 07:53:38 +00:00
Viktor Barzin
18ccd57b63 Merge forgejo/master into wizard/emu-gpu
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
2026-06-12 07:53:12 +00:00
Viktor Barzin
f4dd515fd7 android-emulator: GPU rendering on node1 + scale-to-zero wake gate
Viktor's direction (2026-06-12): the emulator is dev-only, so it should
be on-demand, and it should use the T4 where applicable. (1) api36-v5
runs '-gpu host' on the GPU node (nodeSelector + time-slice + EGL libs;
automatic swiftshader fallback if GPU init dies) — screen-on rendering
moves off the CPU (~5 cores → expected 1-2). (2) The wake gate (stdlib
python, owns / on both hostnames) scales the deployment 0→1 on visit and
hands the browser to noVNC when ready; agents GET /wake + /status. The
idle-sleeper CronJob counts established adb/noVNC connections via
/proc/net/tcp (excluding the in-container loopback adb client) and scales
to zero after 4 idle checks (~1h). TF ignores replicas drift. VRAM cost
(~0.5-1GiB) is held only while awake, protecting llama-swap headroom.
2026-06-12 07:52:50 +00:00
Viktor Barzin
b598c61c61 android-emulator: scale to 0 — its CPU burn was starving etcd
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
The cluster-health check found the control plane flapping: kube-scheduler
and kube-controller-manager were crashlooping (220+ restarts) on lost
leader-election leases, with "etcdserver: request timed out" in the logs.

Root cause: the android-emulator pod's ~4.7-core swiftshader (software-GPU)
CPU burn on node3, together with frigate on node1, saturated the single
Proxmox host (load ~64) and starved etcd's disk/CPU on the k8s-master VM —
so etcd timed out and the leader-election controllers died and restarted in
a loop.

The emulator is a shared *test* instance, not a 24/7 service, so scaling it
to 0 is the right relief: spin it back to replicas=1 on-demand for a testing
session. Confirmed recovery after scaling down: node3 CPU 83%->28%, PVE load
64->51, control-plane restarts frozen. Durable structural fix (etcd/critical
VM disks off the shared sdc HDD; PVE CPU weighting) is tracked as code-oflt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 07:31:46 +00:00
Viktor Barzin
39a22b352e tts: bootstrap the chatterbox NFS subdir — first-window mount failed forever
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
First real window (2026-06-12 02:00): the chatterbox pod sat in
ContainerCreating with MountVolume exit 32 x19 — /srv/nfs-ssd is exported
whole-tree but the chatterbox SUBDIR never existed on the host (the
go-live runbook step needed NFS-host shell nobody doing the apply had).
One-shot busybox Job mounts the export root and mkdir -p's the subtree;
kubelet's mount retry then self-heals the pod. Audio queue (27 items)
drains as soon as the model loads.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 02:51:14 +00:00
Viktor Barzin
db63cd7501 android-emulator+traefik: non-merge apply trigger for the rate-limit fix
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline failed
Pipeline 102 applied nothing — the rate-limit commit entered master under
a merge head and the changed-stack detector is blind to merge diffs.
Plain commit touching both stacks so they apply.
2026-06-12 00:33:10 +00:00
Viktor Barzin
4d844d6fd4 Merge forgejo/master into wizard/emu-ratelimit
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline failed
2026-06-12 00:26:05 +00:00
Viktor Barzin
152dad0a40 android-emulator: dedicated rate-limit — noVNC's module storm tripped the shared 10/50 limiter
Viktor's 'VNC stuck loading forever' (remote network): noVNC 1.3 is
unbundled and fetches ~60 ES modules in parallel on page open; the shared
Traefik rate-limit (average 10, burst 50) 429s the tail and noVNC's
loader waits on the missing modules indefinitely (reproduced: 38x429 in
a 90-request burst through the ingress). Adds a dedicated 50/300
android-emulator-rate-limit middleware (actualbudget/immich pattern) and
opts both emulator ingresses out of the shared limiter.
2026-06-12 00:25:44 +00:00
Viktor Barzin
d3d37a15ec tts: GPU-gated live narration — demand-gate CronJob + all-day VRAM guard
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
ci/woodpecker/push/build-cli Pipeline was canceled
Viktor asked 'can't we make it live? why the cronjob?' — the overnight
window guaranteed VRAM room on the shared T4, but immich/frigate models
idle-unload during the day so the card often has room (measured 10.3 GiB
free at 01:20). New 'demand' action every 3 min: scale Chatterbox up when
tripit's audio queue is non-empty AND free VRAM >= floor; idle it back to
0 when the queue empties (also frees the card early inside the nightly
window). Failed metrics scrape fail-safes to no-scale-up, same as the
window preflight. The guard moves to all-day */5 — live synthesis can
hold the card at any hour, so the yield-on-pressure watchdog must watch
at any hour. tripit exposes the unauthenticated in-cluster queue count;
a 404 from an older image reads as queued=0 (no-op). The 02:00 window-up
stays as the guaranteed nightly catch-up.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 00:25:35 +00:00
Viktor Barzin
d818f7ed3b android-emulator: README — measured resource profile + remote access + screen-off etiquette
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-12 00:10:03 +00:00
Viktor Barzin
9af3e8860e Merge origin/master (CI state-sync commits) into wizard/android-emulator-public
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
ci/woodpecker/push/build-cli Pipeline was canceled
2026-06-12 00:08:14 +00:00
Viktor Barzin
43d2107760 android-emulator: public Authentik-gated ingress for the noVNC screen
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
Viktor wants the emulator screen reachable over the web: adds
android-emulator.viktorbarzin.me (Cloudflare-proxied) behind Authentik
forward-auth — same-origin WebSockets through forward-auth are proven by
the terminal/ttyd stack. The LAN .lan view stays, and adb:5555 remains
LAN-only since it is unauthenticated.
2026-06-12 00:07:49 +00:00
Viktor Barzin
9a2124f105 tripit: flip Research agent live (RESEARCH_PROVIDER=claude_agent, #23)
Switches the planning workspace's 'Research this' from the deterministic Fake to the live claude-agent-service Researcher. Behaviour-reviewed via a prod-pod country_when call (proposed Morocco/Georgia/Peru/Iceland with real 2026 UK bank-holiday leave windows + rough fares). Opt-in, budget-capped ~$2/run, wall-clock-bounded → degrades to 'found nothing' on slow/failed/quota-exhausted runs. Reuses CLAUDE_AGENT_TOKEN already in tripit-secrets. Completes the 12-slice Trip-Planning-Decisions feature.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 23:53:49 +00:00
Viktor Barzin
02ed3062f6 android-emulator: non-merge apply trigger for v4 image rollout
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Pipeline 96 applied only tripit: the v4 bump (577267cd) entered master
inside a merge whose first-parent diff hid stacks/android-emulator from
the stack detector — same failure mode as the tts 798b0255 trigger. This
plain commit touches the stack so the detector picks it up.
2026-06-11 23:48:16 +00:00
Viktor Barzin
2f8addc63b Merge forgejo/master into wizard/android-emulator
Some checks failed
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline failed
2026-06-11 22:53:11 +00:00
Viktor Barzin
577267cd97 android-emulator: api36-v4 — pin emulator 36.1.9; bind socat to pod IP
Two final fixes from the live debugging session: (1) sdkmanager-latest
emulator 36.6.11 hangs before executing a single guest instruction in
this pod (KVM and TCG alike, every gpu mode, crash-reporting on or off)
while 36.1.9 boots Android in ~107s — the entrypoint now pins build
13823996 on the PVC; (2) the emulator already listens on 127.0.0.1:5555,
so socat's wildcard bind died with EADDRINUSE and its exit restarted the
pod right after a successful boot — socat now binds the pod IP only.
2026-06-11 22:52:54 +00:00
Viktor Barzin
fba1659611 tripit: enable LLM sight discovery + real place resolver (image 2a965ca0 is live)
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Viktor's tour-redo (tripit#29): the new image is rolled out, so the two
new provider knobs can flip — discovery becomes wikipedia+llm (GeoSearch
merged with claude-agent-service proposals, Focus-steered) and the
Wikipedia place resolver (manual sight search + LLM-proposal resolution)
leaves its fake default. Env-after-image hold order, same as FARE_PROVIDER.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 22:30:24 +00:00
Viktor Barzin
f74e421283 tripit: overnight tour-audio fill CronJobs (02:20 + 04:30 retry, Europe/London)
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Viktor's tour-guide redo (tripit#30/#31): narration audio is baked-audio-only
now — the fill-tour-audio worker synthesizes the queued (story, telling,
voice) audio while the tts stack's off-peak window (02:00-06:00) has
Chatterbox scaled up. Two idempotent passes: 02:20 after scale-up + model
load, 04:30 insurance against a skipped window or guard yield. Daytime runs
record tts_unreachable and exit quietly by design.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 22:24:29 +00:00
Viktor Barzin
85dbec6108 android-emulator: api36-v3 — avdmanager must run from inside the SDK root
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline failed
v2's marker fix proved the install completes, but avdmanager still saw
no system images: it IGNORES ANDROID_SDK_ROOT (and has no --sdk_root),
deriving the SDK root from its own toolsdir — /opt/android in our image,
while packages live on the PVC at /sdk. v3 seeds cmdline-tools into
/sdk/cmdline-tools/latest once and runs avdmanager from there, so it
resolves the PVC as the SDK root.
2026-06-11 21:15:50 +00:00
Viktor Barzin
5e8a988858 android-emulator: api36-v2 — marker-file install idempotency + retries
Some checks failed
ci/woodpecker/push/k8s-portal Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/registry-config-sync Pipeline was successful
ci/woodpecker/push/build-ci-image Pipeline was successful
ci/woodpecker/push/default Pipeline failed
First boot crashed mid-SDK-install, and the dir-existence check then
skipped reinstall forever: avdmanager saw the partial tree and died with
'Valid system image paths are: null' (CrashLoopBackOff). v2 tracks
install completion with a marker file written only after sdkmanager
succeeds + package.xml exists, wipes partial system-image trees before
reinstalling, and retries sdkmanager 3x.
2026-06-11 20:59:08 +00:00
Viktor Barzin
3fac45febc android-emulator: drop applied import stanzas; deployment recreates fresh
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
The five imports from the last recovery commit are in state now (verified
serial 4: everything except the deployment). The deployment kept falling
out of state between runs, so instead of a third import round the broken
0-replica deployment object was deleted live (transient recovery step,
presence-claimed) and this apply recreates it Terraform-owned with the
quota-fitting 3Gi requests. Import stanzas must go because TF 1.5 errors
on importing already-managed addresses.
2026-06-11 20:49:37 +00:00
Viktor Barzin
6b7efcd2d6 android-emulator: import the five resources still missing from state
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
Pipeline 88 imported the namespace but its refresh dropped the PVC, both
services, the ingress and the tls secret from state (PG-backend state
races on this new stack's first applies), so the apply again died on
'already exists' conflicts. State now holds namespace+deployment; adopt
the missing five with import blocks (TF 1.5 errors on importing
already-managed addresses, so only the missing set is listed). Stanzas
come out once applied.
2026-06-11 20:44:09 +00:00
Viktor Barzin
b948224008 android-emulator: import orphaned namespace into state (lock-race recovery)
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
Pipeline 85 created the namespace but a Terraform pg-backend
workspace-creation lock race (new stack schema initializing while other
stacks applied concurrently) left it out of the recorded state — every
later apply then died with 'namespaces android-emulator already exists'.
Adopt it with an import block per the house recovery pattern; stanza
gets removed once it has applied.
2026-06-11 20:38:46 +00:00
Viktor Barzin
99c19584f7 android-emulator: fit pod inside the tier-1 ResourceQuota (Burstable memory)
Some checks failed
ci/woodpecker/push/k8s-portal Pipeline failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful
ci/woodpecker/push/registry-config-sync Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/build-ci-image Pipeline was successful
First deploy hit 'exceeded quota: tier-quota, requested requests.memory=8Gi,
limited 4Gi' — the generated tier-1 quota caps memory REQUESTS at 4Gi but
allows 32Gi of limits, so go Burstable (requests 3Gi, limits 8Gi) like
tiers 3/4 do, instead of opting the namespace out via custom-quota.
2026-06-11 19:56:09 +00:00
Viktor Barzin
6bf216751b Merge forgejo/master (tts stack) into wizard/android-emulator
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
# Conflicts:
#	stacks/tripit/main.tf
2026-06-11 19:53:07 +00:00
Viktor Barzin
8b7c77c794 android-emulator: new stack — shared in-cluster Android 16 testing instance
Viktor is setting up an Android app development pipeline (tripit is the
first app) and wants agents to natively test changes on Android before
shipping. This adds the testing environment: an API-36 Google emulator
under KVM as a privileged pod (namespace joins the Kyverno exclude list),
SDK/system-image/AVD on a proxmox-lvm PVC, adb on the shared MetalLB IP
10.0.20.200:5555 (LAN only), noVNC screen view at
android-emulator.viktorbarzin.lan. Image is built manually from the
stack's docker/ dir (rare rebuilds; off-infra-CI rule targets repeated
builds). First infra ADR records the trade-offs (devvm/VM/redroid/budtmo
rejected).
2026-06-11 19:51:57 +00:00
Viktor Barzin
798b025580 tts+kyverno: non-merge apply trigger (merge-commit diff hid stacks/tts from the stack detector)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The Woodpecker default pipeline selects stacks via git diff HEAD~1 HEAD;
on a merge commit that is the first-parent diff, which contained only the
concurrently-landed files — stacks/tts never got applied (namespace still
absent) and the kyverno re-trigger push got no pipeline at all. Single
non-merge commit touching both stacks so the detector sees them; the
sorted loop applies kyverno before tts, the order tripit#26 requires.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 19:08:23 +00:00
Viktor Barzin
4a8c4f9a14 tts: first apply of Chatterbox stack; predefined voices from the image, not the unseeded PVC
Viktor's tour-guide redo (tripit#26): 87702bdc committed this stack with
[ci skip] so it was never applied — prod tripit has been pointing at a
nonexistent chatterbox-tts service since. This commit triggers the apply
and fixes the voices path: config pointed predefined_voices_path at the
NFS PVC (/data/voices), which nobody can seed without NFS-host shell
access and which would leave /v1/audio/voices empty (it gates readiness).
Use the 28 voices bundled in the image at /app/voices instead; /data
keeps reference audio (future cloning) and the HF model cache.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:27:44 +00:00
Viktor Barzin
7a1cc64898 kyverno: re-trigger apply of tts GPU-priority exclusion (87702bdc was [ci skip]'d)
Viktor's tour-guide redo (tripit#26): the Chatterbox TTS go-live commit
87702bdc carried [ci skip], so CI never applied the kyverno change that
keeps the tts namespace out of low-GPU-priority injection. This comment-only
commit makes CI apply the already-committed change — step 1 of the
kyverno -> tts -> tripit apply order.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:23:29 +00:00
Viktor Barzin
50eff3ca39 tripit: enable real tour-guide content providers (wikipedia discovery, web sources, chat writer)
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
Viktor's tour-guide redo (tripit#24, slice tripit#25): the feature shipped
dark on 2026-06-08 because these three env vars were never set, so prod ran
the fake test-fixture providers — the only sight users ever saw was the
placeholder 'Sight 1' narrated by browser TTS. Flips discovery to Wikipedia
GeoSearch, story material to the five real web sources, and script-writing
to claude-agent-service (token already present in tripit-secrets).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:22:10 +00:00
Viktor Barzin
5486b9d438 tripit: wire calendar-conflict column to Nextcloud CalDAV (#19)
CALENDAR_CONFLICT_PROVIDER=nextcloud + CalDAV base/user on the deployment, and the read-only app-password via tripit-secrets (seeded in Vault secret/tripit). Lets the planning workspace's calendar_check column flag date clashes against the owner's Nextcloud calendar. Same image-first hold-order as the fare scrape — pushed only after the #19 image is live.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 18:13:01 +00:00
Viktor Barzin
81e01ec1c4 tripit: label namespace as chrome-service CDP client
The fare scrape's first E2E test was blocked by chrome-service-ws-ingress (9222 admits only namespaces labeled chrome-service.viktorbarzin.me/client=true). Label the tripit namespace per that policy's opt-in design so the planning workspace's live fare fetches reach the shared browser.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 14:42:53 +00:00
Viktor Barzin
980ec55418 tripit: enable live flight-fare scrape via shared chrome-service CDP
Sets FARE_PROVIDER=playwright + FARE_CDP_URL on the tripit deployment so the planning workspace's flight_fare cells auto-fetch live Google Flights quotes through the existing in-cluster headed browser (tripit issue #18, ADR-0007 — rate-limited, cached, degrades to manual entry). Viktor asked to complete the trip-planning tickets; this is the infra leg of the fare-scrape slice. Docs: chrome-service architecture + service catalog updated (tripit is now the second active CDP caller; catalog's legacy :3000 WS pool line corrected to CDP :9222). HOLD-ORDER NOTE: pushed only after the tripit image containing FareMode.playwright rolled out (older images crash-loop on the unknown enum).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 14:23:53 +00:00
Viktor Barzin
b3ef0dba76 authentik: ignore Keel-managed image_pull_policy on pgbouncer
Keel flip-flops the pgbouncer container's imagePullPolicy, so the
declared Always kept re-diffing on every plan. Ignore it like the
image tag (KEEL_IGNORE pattern) — plan-to-zero restored.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 00:34:44 +00:00
Viktor Barzin
4e88298976 authentik: incident hardening after the signin-speedup rollout storm
The first apply of the signin-speedup change triggered a ~50min authentik
outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2)
silently DOWNGRADED the Keel-managed live image (2026.2.4) against an
already-migrated DB, default liveness probes kill-looped pods queuing on
authentik's migration advisory lock, and kills mid-migration left ghost
idle-in-transaction sessions holding that lock. Full analysis in
docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md.

Hardening (all root causes):
- values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4)
  so helm applies can never downgrade under Keel again
- values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s)
- values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode
  pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits)
- pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders;
  pgbouncer.tf gets a config-checksum annotation so ini changes roll pods
- authentik_provider.tf: drop the completed import stanza (adoption rule)
- traefik: suppress pre-existing keel.sh annotation/tier-label drift on
  auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1
  pattern) so applies stop stripping live Keel state

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 00:26:52 +00:00
Viktor Barzin
97ccdbecb8 authentik: speed up first-time signin (single-screen login, live env tuning, asset caching, outpost+nginx hot path)
Viktor asked to review Authentik and the web tier and make first-time
signin to apps faster. Review found the slowness is screens and round
trips, not server time. Changes:

- values.yaml: the authentik.* Helm values (gunicorn workers, cache
  timeouts, conn_max_age) were silently INERT because existingSecret
  skips chart env rendering — pods ran defaults (2 workers, 300s
  caches, no persistent DB conns). Moved all tuning into
  server.env/worker.env, which actually reaches the pods.
- authentik_provider.tf: adopt the identification stage and pin
  password_stage so username+password render on ONE screen (the
  separate order-20 password binding is deleted via API — authentik
  requires that when embedding). Outpost log_level trace->info and
  1->2 replicas (it is on the hot path of every forward-auth request;
  PG-backed sessions make 2 replicas safe).
- authentik module: /static ingress carve-out with immutable
  Cache-Control (assets are version-fingerprinted but served with no
  max-age — internal split-horizon users got zero caching).
- traefik auth-proxy nginx: upstream keepalive 32 + HTTP/1.1 (was
  opening a fresh TCP connection to the outpost per subrequest) +
  config-checksum annotation so config changes roll the pods.
- docs: authentication.md + authentik-state.md updated; fixed stale
  'postgresql.dbaas has no endpoints' claim in CLAUDE.md/CONTEXT.md
  (it is a live CNPG primary-selector compatibility service).

Done via API in the same change (UI-managed objects): 6 OIDC providers
(Vault, Forgejo, Immich, Headscale, linkwarden, Cloudflare Access)
switched from explicit to implicit consent — all first-party, the
4-weekly consent screen only slowed first-time signin.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 21:58:10 +00:00
Viktor Barzin
70442ccdc6 t3-probe: fix aiohttp 3.9 compat (ClientWSTimeout is 3.10+)
Bound connection establishment via session ClientTimeout(total=None,
connect=15) instead — works on 3.9 through current; total must stay None
or the session timeout would kill the long-lived probe WS. Verified by a
local 14s smoke run: cloudflare + internal legs both connect.
2026-06-10 21:26:09 +00:00
Viktor Barzin
9b55d53be0 t3: differential drop-attribution probe + devvm metrics
Closes the loop on Viktor's ask to find the t3 disconnect root cause and
definitively rule infra in or out. Server logs alone cannot separate
'client network broke' from 'Cloudflare/tunnel broke' from 't3-serve
stalled' — every cause collapses into the same 20s-watchdog reconnect.

The t3-probe (stacks/t3code) holds three permanent legs that differ only
in path segment: 'cloudflare' (WS via DoH-resolved public DNS -> WAN ->
CF edge -> tunnel -> Traefik -> dispatch), 'internal' (same WS pinned to
the Traefik LB, no Cloudflare), 't3serve' (HTTP straight to the serve
process). Whichever leg drops convicts its segment; all legs clean while
a user drops exonerates infra with data. Dispatch gains an
unauthenticated /probe/ws echo + /probe/healthz (gorilla/websocket,
test-first) behind an auth=none path carve-out, guarded by the
authentik-walloff probe.

Also starts scraping devvm's node_exporter (job 'devvm') — it ran
unscraped, so the box whose memory/IO stalls cause the drops had zero
pressure history. Alerts T3ProbeLegDown + T3ProbeDropBurst; runbook
docs/runbooks/t3-drop-attribution.md.
2026-06-10 21:11:29 +00:00
Viktor Barzin
d5fdc7ffe9 cloudflared: disable in-place autoupdate (--no-autoupdate)
Viktor asked to root-cause the frequent t3 code disconnects and rule
infra in or out. The tunnel pods ran bare 'cloudflared tunnel run':
every Cloudflare release made the binary self-update and exit (code 11),
restarting all 3 pods and severing every WebSocket riding the tunnel —
one of the confirmed infra-side drop causes (pods cycled 2026-06-09
20:55/21:00 and 2026-06-10 02:31). Updates belong to pod image rollouts,
not in-place binary swaps.
2026-06-10 21:00:05 +00:00
Viktor Barzin
9fff77cbea Merge branch 'wizard/budget-rate-limit'
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-10 19:42:19 +00:00
Viktor Barzin
acb847b858 actualbudget: dedicated traefik rate-limit (50/300) for budget ingresses
The Actual web app boots with ~70 near-parallel requests (55
/data/migrations/*.sql + statics, all served cache-control max-age=0 so
every page load re-validates them). The shared rate-limit middleware
(average 10, burst 50) 429s the tail of that storm, so every cold boot
shows 'Server returned an error while checking its status' and every
load stalls in retry backoff — measured up to 5min stalls when two
loads from one IP overlap. Viktor asked to relax the limit after the
anca slow-load investigation (beads code-7zv).

Same pattern as immich: dedicated actualbudget-rate-limit middleware in
the traefik stack, budget-* ingresses opt out of the default via
skip_default_rate_limit + extra_middlewares.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 19:36:42 +00:00
Viktor Barzin
eae35c511a pfsense: SNI-routed internal 443 — mail.viktorbarzin.me serves webmail everywhere
Completes the internal port table of the mail front door (10.0.20.1):
443 was squatted by the pfSense webGUI (self-signed cert expired 2022),
so internal webmail and the kuma [External] mail probe hit the firewall
login instead of Roundcube — the last leg of the mail split-brain name.

Design (Viktor): route by what the client asked for. New HAProxy
frontend internal_https_443 (binds 10.0.20.1+10.0.10.1 :443, mode tcp):
SNI present -> Traefik .203 with send-proxy-v2 (trusted, IPv6-bridge
pattern, no health check per the PROXY-probe gotcha); SNI of
pfsense.viktorbarzin.{lan,me} or NO SNI (bare-IP admin access) -> webGUI,
which moved to :8443 (invisible to habits — https://10.0.20.1 still
lands on the login page; :8443 doubles as direct fallback). The
reverse-proxy pfsense ingress now targets :8443 directly.

Declared idempotently in pfsense-haproxy-bootstrap.php; config.xml
backed up on-box (config.xml.bak-2026-06-10-pre-sni443). Verified:
bare IP -> GUI login; pfsense.viktorbarzin.lan -> GUI;
pfsense.viktorbarzin.me -> 302 via ingress; mail.viktorbarzin.me ->
Roundcube with STRICT cert validation; :993 IMAPS untouched.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 18:41:07 +00:00