Commit graph

4205 commits

Author SHA1 Message Date
Viktor Barzin
eef4dc7f63 tripit: dedicated 100/1000 rate-limit — photo grid 429s on the default 10/50
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
Viktor hit a wall of 429s scrolling the new trip Photos tab: every Immich
thumbnail proxies through tripit's /api, so a few-hundred-photo trip is
that many parallel GETs from one IP — far past the shared Traefik
limiter's average 10 / burst 50. Fourth instance of the parallel-asset
pattern (ha-sofia, ActualBudget, noVNC); same cure: dedicated
tripit-rate-limit middleware (average 100, burst 1000) +
skip_default_rate_limit on the main tripit ingress only. The token-gated
calendar/email/slack carve-outs keep the strict default.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 19:15:56 +00:00
Viktor Barzin
e8a4eb0f05 tripit: satisfy the auth-comment lint on the tripit-api ingress
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The previous commit (c5631cff) failed CI's ingress_factory guard: the
'# auth = "none": <why>' justification must sit directly above the auth
line inside the module, not above the module block. Same content, moved
to where the lint looks; no functional change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 08:53:02 +00:00
Viktor Barzin
c5631cff74 tripit: Shell auth surface — tripit-app OAuth2 provider + bearer-only tripit-api host
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
Viktor is adding the Android APK (Capacitor Shell) for TripIt. The Shell
cannot use the browser's forward-auth cookie dance, so per tripit ADR-0017
it logs in with OIDC Code+PKCE and calls the API with bearer JWTs:

- authentik.tf: tripit-app OAuth2 provider (public client + PKCE — an APK
  holds no secret), custom-scheme redirect me.viktorbarzin.tripit://callback,
  RS256, 1h access / 90d refresh (offline_access mapping attached so refresh
  tokens are issued), plus the TripIt App application.
- main.tf: new ingress host tripit-api.viktorbarzin.me -> same tripit
  Service, no forward-auth (backend validates the JWTs itself once tripit
  AUTH_MODE=hybrid lands — slice 2), inbound X-authentik-* deleted via the
  existing traefik strip-auth-headers middleware so the header fallback can
  never be spoofed through this host.

Closes nothing here; tracked as viktor/tripit#49.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 08:47:46 +00:00
Viktor Barzin
b985686661 android-emulator: non-merge apply trigger (GPU + wake gate)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-12 07:53:38 +00:00
Viktor Barzin
18ccd57b63 Merge forgejo/master into wizard/emu-gpu
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
2026-06-12 07:53:12 +00:00
Viktor Barzin
f4dd515fd7 android-emulator: GPU rendering on node1 + scale-to-zero wake gate
Viktor's direction (2026-06-12): the emulator is dev-only, so it should
be on-demand, and it should use the T4 where applicable. (1) api36-v5
runs '-gpu host' on the GPU node (nodeSelector + time-slice + EGL libs;
automatic swiftshader fallback if GPU init dies) — screen-on rendering
moves off the CPU (~5 cores → expected 1-2). (2) The wake gate (stdlib
python, owns / on both hostnames) scales the deployment 0→1 on visit and
hands the browser to noVNC when ready; agents GET /wake + /status. The
idle-sleeper CronJob counts established adb/noVNC connections via
/proc/net/tcp (excluding the in-container loopback adb client) and scales
to zero after 4 idle checks (~1h). TF ignores replicas drift. VRAM cost
(~0.5-1GiB) is held only while awake, protecting llama-swap headroom.
2026-06-12 07:52:50 +00:00
Viktor Barzin
b598c61c61 android-emulator: scale to 0 — its CPU burn was starving etcd
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
The cluster-health check found the control plane flapping: kube-scheduler
and kube-controller-manager were crashlooping (220+ restarts) on lost
leader-election leases, with "etcdserver: request timed out" in the logs.

Root cause: the android-emulator pod's ~4.7-core swiftshader (software-GPU)
CPU burn on node3, together with frigate on node1, saturated the single
Proxmox host (load ~64) and starved etcd's disk/CPU on the k8s-master VM —
so etcd timed out and the leader-election controllers died and restarted in
a loop.

The emulator is a shared *test* instance, not a 24/7 service, so scaling it
to 0 is the right relief: spin it back to replicas=1 on-demand for a testing
session. Confirmed recovery after scaling down: node3 CPU 83%->28%, PVE load
64->51, control-plane restarts frozen. Durable structural fix (etcd/critical
VM disks off the shared sdc HDD; PVE CPU weighting) is tracked as code-oflt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 07:31:46 +00:00
Viktor Barzin
39a22b352e tts: bootstrap the chatterbox NFS subdir — first-window mount failed forever
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
First real window (2026-06-12 02:00): the chatterbox pod sat in
ContainerCreating with MountVolume exit 32 x19 — /srv/nfs-ssd is exported
whole-tree but the chatterbox SUBDIR never existed on the host (the
go-live runbook step needed NFS-host shell nobody doing the apply had).
One-shot busybox Job mounts the export root and mkdir -p's the subtree;
kubelet's mount retry then self-heals the pod. Audio queue (27 items)
drains as soon as the model loads.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 02:51:14 +00:00
Viktor Barzin
db63cd7501 android-emulator+traefik: non-merge apply trigger for the rate-limit fix
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline failed
Pipeline 102 applied nothing — the rate-limit commit entered master under
a merge head and the changed-stack detector is blind to merge diffs.
Plain commit touching both stacks so they apply.
2026-06-12 00:33:10 +00:00
Viktor Barzin
4d844d6fd4 Merge forgejo/master into wizard/emu-ratelimit
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline failed
2026-06-12 00:26:05 +00:00
Viktor Barzin
152dad0a40 android-emulator: dedicated rate-limit — noVNC's module storm tripped the shared 10/50 limiter
Viktor's 'VNC stuck loading forever' (remote network): noVNC 1.3 is
unbundled and fetches ~60 ES modules in parallel on page open; the shared
Traefik rate-limit (average 10, burst 50) 429s the tail and noVNC's
loader waits on the missing modules indefinitely (reproduced: 38x429 in
a 90-request burst through the ingress). Adds a dedicated 50/300
android-emulator-rate-limit middleware (actualbudget/immich pattern) and
opts both emulator ingresses out of the shared limiter.
2026-06-12 00:25:44 +00:00
Viktor Barzin
d3d37a15ec tts: GPU-gated live narration — demand-gate CronJob + all-day VRAM guard
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
ci/woodpecker/push/build-cli Pipeline was canceled
Viktor asked 'can't we make it live? why the cronjob?' — the overnight
window guaranteed VRAM room on the shared T4, but immich/frigate models
idle-unload during the day so the card often has room (measured 10.3 GiB
free at 01:20). New 'demand' action every 3 min: scale Chatterbox up when
tripit's audio queue is non-empty AND free VRAM >= floor; idle it back to
0 when the queue empties (also frees the card early inside the nightly
window). Failed metrics scrape fail-safes to no-scale-up, same as the
window preflight. The guard moves to all-day */5 — live synthesis can
hold the card at any hour, so the yield-on-pressure watchdog must watch
at any hour. tripit exposes the unauthenticated in-cluster queue count;
a 404 from an older image reads as queued=0 (no-op). The 02:00 window-up
stays as the guaranteed nightly catch-up.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 00:25:35 +00:00
Viktor Barzin
d818f7ed3b android-emulator: README — measured resource profile + remote access + screen-off etiquette
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-12 00:10:03 +00:00
Viktor Barzin
9af3e8860e Merge origin/master (CI state-sync commits) into wizard/android-emulator-public
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
ci/woodpecker/push/build-cli Pipeline was canceled
2026-06-12 00:08:14 +00:00
Viktor Barzin
43d2107760 android-emulator: public Authentik-gated ingress for the noVNC screen
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
Viktor wants the emulator screen reachable over the web: adds
android-emulator.viktorbarzin.me (Cloudflare-proxied) behind Authentik
forward-auth — same-origin WebSockets through forward-auth are proven by
the terminal/ttyd stack. The LAN .lan view stays, and adb:5555 remains
LAN-only since it is unauthenticated.
2026-06-12 00:07:49 +00:00
Viktor Barzin
9a2124f105 tripit: flip Research agent live (RESEARCH_PROVIDER=claude_agent, #23)
Switches the planning workspace's 'Research this' from the deterministic Fake to the live claude-agent-service Researcher. Behaviour-reviewed via a prod-pod country_when call (proposed Morocco/Georgia/Peru/Iceland with real 2026 UK bank-holiday leave windows + rough fares). Opt-in, budget-capped ~$2/run, wall-clock-bounded → degrades to 'found nothing' on slow/failed/quota-exhausted runs. Reuses CLAUDE_AGENT_TOKEN already in tripit-secrets. Completes the 12-slice Trip-Planning-Decisions feature.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 23:53:49 +00:00
Viktor Barzin
02ed3062f6 android-emulator: non-merge apply trigger for v4 image rollout
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Pipeline 96 applied only tripit: the v4 bump (577267cd) entered master
inside a merge whose first-parent diff hid stacks/android-emulator from
the stack detector — same failure mode as the tts 798b0255 trigger. This
plain commit touches the stack so the detector picks it up.
2026-06-11 23:48:16 +00:00
Viktor Barzin
2f8addc63b Merge forgejo/master into wizard/android-emulator
Some checks failed
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline failed
2026-06-11 22:53:11 +00:00
Viktor Barzin
577267cd97 android-emulator: api36-v4 — pin emulator 36.1.9; bind socat to pod IP
Two final fixes from the live debugging session: (1) sdkmanager-latest
emulator 36.6.11 hangs before executing a single guest instruction in
this pod (KVM and TCG alike, every gpu mode, crash-reporting on or off)
while 36.1.9 boots Android in ~107s — the entrypoint now pins build
13823996 on the PVC; (2) the emulator already listens on 127.0.0.1:5555,
so socat's wildcard bind died with EADDRINUSE and its exit restarted the
pod right after a successful boot — socat now binds the pod IP only.
2026-06-11 22:52:54 +00:00
Viktor Barzin
fba1659611 tripit: enable LLM sight discovery + real place resolver (image 2a965ca0 is live)
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Viktor's tour-redo (tripit#29): the new image is rolled out, so the two
new provider knobs can flip — discovery becomes wikipedia+llm (GeoSearch
merged with claude-agent-service proposals, Focus-steered) and the
Wikipedia place resolver (manual sight search + LLM-proposal resolution)
leaves its fake default. Env-after-image hold order, same as FARE_PROVIDER.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 22:30:24 +00:00
Viktor Barzin
f74e421283 tripit: overnight tour-audio fill CronJobs (02:20 + 04:30 retry, Europe/London)
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Viktor's tour-guide redo (tripit#30/#31): narration audio is baked-audio-only
now — the fill-tour-audio worker synthesizes the queued (story, telling,
voice) audio while the tts stack's off-peak window (02:00-06:00) has
Chatterbox scaled up. Two idempotent passes: 02:20 after scale-up + model
load, 04:30 insurance against a skipped window or guard yield. Daytime runs
record tts_unreachable and exit quietly by design.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 22:24:29 +00:00
Viktor Barzin
85dbec6108 android-emulator: api36-v3 — avdmanager must run from inside the SDK root
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline failed
v2's marker fix proved the install completes, but avdmanager still saw
no system images: it IGNORES ANDROID_SDK_ROOT (and has no --sdk_root),
deriving the SDK root from its own toolsdir — /opt/android in our image,
while packages live on the PVC at /sdk. v3 seeds cmdline-tools into
/sdk/cmdline-tools/latest once and runs avdmanager from there, so it
resolves the PVC as the SDK root.
2026-06-11 21:15:50 +00:00
Viktor Barzin
5e8a988858 android-emulator: api36-v2 — marker-file install idempotency + retries
Some checks failed
ci/woodpecker/push/k8s-portal Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/registry-config-sync Pipeline was successful
ci/woodpecker/push/build-ci-image Pipeline was successful
ci/woodpecker/push/default Pipeline failed
First boot crashed mid-SDK-install, and the dir-existence check then
skipped reinstall forever: avdmanager saw the partial tree and died with
'Valid system image paths are: null' (CrashLoopBackOff). v2 tracks
install completion with a marker file written only after sdkmanager
succeeds + package.xml exists, wipes partial system-image trees before
reinstalling, and retries sdkmanager 3x.
2026-06-11 20:59:08 +00:00
Viktor Barzin
3fac45febc android-emulator: drop applied import stanzas; deployment recreates fresh
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
The five imports from the last recovery commit are in state now (verified
serial 4: everything except the deployment). The deployment kept falling
out of state between runs, so instead of a third import round the broken
0-replica deployment object was deleted live (transient recovery step,
presence-claimed) and this apply recreates it Terraform-owned with the
quota-fitting 3Gi requests. Import stanzas must go because TF 1.5 errors
on importing already-managed addresses.
2026-06-11 20:49:37 +00:00
Viktor Barzin
6b7efcd2d6 android-emulator: import the five resources still missing from state
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
Pipeline 88 imported the namespace but its refresh dropped the PVC, both
services, the ingress and the tls secret from state (PG-backend state
races on this new stack's first applies), so the apply again died on
'already exists' conflicts. State now holds namespace+deployment; adopt
the missing five with import blocks (TF 1.5 errors on importing
already-managed addresses, so only the missing set is listed). Stanzas
come out once applied.
2026-06-11 20:44:09 +00:00
Viktor Barzin
b948224008 android-emulator: import orphaned namespace into state (lock-race recovery)
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
Pipeline 85 created the namespace but a Terraform pg-backend
workspace-creation lock race (new stack schema initializing while other
stacks applied concurrently) left it out of the recorded state — every
later apply then died with 'namespaces android-emulator already exists'.
Adopt it with an import block per the house recovery pattern; stanza
gets removed once it has applied.
2026-06-11 20:38:46 +00:00
Viktor Barzin
99c19584f7 android-emulator: fit pod inside the tier-1 ResourceQuota (Burstable memory)
Some checks failed
ci/woodpecker/push/k8s-portal Pipeline failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful
ci/woodpecker/push/registry-config-sync Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/build-ci-image Pipeline was successful
First deploy hit 'exceeded quota: tier-quota, requested requests.memory=8Gi,
limited 4Gi' — the generated tier-1 quota caps memory REQUESTS at 4Gi but
allows 32Gi of limits, so go Burstable (requests 3Gi, limits 8Gi) like
tiers 3/4 do, instead of opting the namespace out via custom-quota.
2026-06-11 19:56:09 +00:00
Viktor Barzin
6bf216751b Merge forgejo/master (tts stack) into wizard/android-emulator
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
# Conflicts:
#	stacks/tripit/main.tf
2026-06-11 19:53:07 +00:00
Viktor Barzin
8b7c77c794 android-emulator: new stack — shared in-cluster Android 16 testing instance
Viktor is setting up an Android app development pipeline (tripit is the
first app) and wants agents to natively test changes on Android before
shipping. This adds the testing environment: an API-36 Google emulator
under KVM as a privileged pod (namespace joins the Kyverno exclude list),
SDK/system-image/AVD on a proxmox-lvm PVC, adb on the shared MetalLB IP
10.0.20.200:5555 (LAN only), noVNC screen view at
android-emulator.viktorbarzin.lan. Image is built manually from the
stack's docker/ dir (rare rebuilds; off-infra-CI rule targets repeated
builds). First infra ADR records the trade-offs (devvm/VM/redroid/budtmo
rejected).
2026-06-11 19:51:57 +00:00
Viktor Barzin
798b025580 tts+kyverno: non-merge apply trigger (merge-commit diff hid stacks/tts from the stack detector)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The Woodpecker default pipeline selects stacks via git diff HEAD~1 HEAD;
on a merge commit that is the first-parent diff, which contained only the
concurrently-landed files — stacks/tts never got applied (namespace still
absent) and the kyverno re-trigger push got no pipeline at all. Single
non-merge commit touching both stacks so the detector sees them; the
sorted loop applies kyverno before tts, the order tripit#26 requires.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 19:08:23 +00:00
Viktor Barzin
a66aeac3b8 Merge remote-tracking branch 'forgejo/master' into wizard/tour-redo-env
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-11 18:27:53 +00:00
Viktor Barzin
4a8c4f9a14 tts: first apply of Chatterbox stack; predefined voices from the image, not the unseeded PVC
Viktor's tour-guide redo (tripit#26): 87702bdc committed this stack with
[ci skip] so it was never applied — prod tripit has been pointing at a
nonexistent chatterbox-tts service since. This commit triggers the apply
and fixes the voices path: config pointed predefined_voices_path at the
NFS PVC (/data/voices), which nobody can seed without NFS-host shell
access and which would leave /v1/audio/voices empty (it gates readiness).
Use the 28 voices bundled in the image at /app/voices instead; /data
keeps reference audio (future cloning) and the HF model cache.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:27:44 +00:00
Viktor Barzin
318ce9b909 Merge remote-tracking branch 'forgejo/master' into wizard/breakglass-redesign
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-11 18:23:40 +00:00
Viktor Barzin
df332b59e6 break-glass SSH: drop port-knock for exposed key-only :52222; version host config
Viktor got locked out of the break-glass path (forgot the port-knock setup) and
deleted the edge-router forwards, then asked to review and redesign it from
scratch.

Root cause of the lockout: the knock added no real security (key-only SSH is
already brute-force-proof) and its only benefit — hiding the port — came at the
cost of a circular dependency. The knock sequence lived only in in-cluster
Vault, which is unreachable in the exact away/cold scenario break-glass exists
for. So the unlock secret was unavailable precisely when needed.

New model (self-contained, nothing to remember): plain key-only SSH on the
Proxmox host's :52222, openly reachable. The edge router forwards WAN tcp/52222
-> 192.168.1.127:52222 (external port MUST equal internal on the TP-Link AX6000
- it rejects remaps; port 22 itself is reserved). The exposed port trusts only a
dedicated break-glass key via `Match LocalPort` (a leak of any other root key
does not grant internet access), rate-limited (iptables hashlimit) + fail2ban.

- Removed knockd (package + config) and the legacy Synology SSH forward
  (ext 3333 -> .13:22, a needless WAN exposure the original plan wanted gone).
- Fixed the fail2ban jail for Debian 13 (auth logs under sshd-session, not sshd
  - the stock journalmatch silently never banned).
- Versioned the host config in scripts/ (it was applied ad-hoc, never committed)
  and recorded the deliberate Wave-1 "no public-IP" exception in security.md +
  .claude/CLAUDE.md. Superseded the 2026-05-30 port-knock design docs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:23:39 +00:00
Viktor Barzin
7a1cc64898 kyverno: re-trigger apply of tts GPU-priority exclusion (87702bdc was [ci skip]'d)
Viktor's tour-guide redo (tripit#26): the Chatterbox TTS go-live commit
87702bdc carried [ci skip], so CI never applied the kyverno change that
keeps the tts namespace out of low-GPU-priority injection. This comment-only
commit makes CI apply the already-committed change — step 1 of the
kyverno -> tts -> tripit apply order.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:23:29 +00:00
Viktor Barzin
50eff3ca39 tripit: enable real tour-guide content providers (wikipedia discovery, web sources, chat writer)
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
Viktor's tour-guide redo (tripit#24, slice tripit#25): the feature shipped
dark on 2026-06-08 because these three env vars were never set, so prod ran
the fake test-fixture providers — the only sight users ever saw was the
placeholder 'Sight 1' narrated by browser TTS. Flips discovery to Wikipedia
GeoSearch, story material to the five real web sources, and script-writing
to claude-agent-service (token already present in tripit-secrets).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:22:10 +00:00
Viktor Barzin
5486b9d438 tripit: wire calendar-conflict column to Nextcloud CalDAV (#19)
CALENDAR_CONFLICT_PROVIDER=nextcloud + CalDAV base/user on the deployment, and the read-only app-password via tripit-secrets (seeded in Vault secret/tripit). Lets the planning workspace's calendar_check column flag date clashes against the owner's Nextcloud calendar. Same image-first hold-order as the fare scrape — pushed only after the #19 image is live.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 18:13:01 +00:00
Viktor Barzin
e2788d1b2d workstation: lean managed-settings claudeMd — org red-lines + pointers [ci skip]
Viktor's agent-rules cleanup: the org claudeMd now carries only
governance red-lines (RBAC tiers, per-user secrets, Terraform-only,
git audit-trail rules, code-layout detection) and points to
~/.claude/rules/execution.md for the worktree lifecycle, which was
previously duplicated here in full. Settings precedence and the
model key are unchanged. Also refreshes a .gitignore comment that
cited the old execution.md section numbering.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:02:43 +00:00
Viktor Barzin
c3a63fcd38 apply-mbps-caps: compare normalized option sets (true idempotency) + devvm I/O-stall post-mortem [ci skip]
The raw string compare never matched qm config's canonical key order, so
the hourly timer re-issued 'qm set' against every running capped VM,
live-rewriting QEMU throttle state via QMP 24x/day. Implicated in today's
devvm freeze (15:21-16:48 UTC): the guest's disk I/O stalled inside QEMU
(blockstats frozen at 0 while QMP stayed responsive) on the legacy lsi
controller path with no iothread.

Viktor asked to root-cause the freeze before choosing fixes, then approved
mitigating via VM settings: this commit fixes the hourly trigger and
documents the incident; the controller swap (virtio-scsi-single +
iothread=1 + aio=threads) is staged on VM 102 separately, pending his
cold stop/start.

Adds docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md (evidence chain,
ruled-out causes, capture-before-kill autopsy steps) and syncs compute.md
+ proxmox-inventory.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:00:08 +00:00
Viktor Barzin
2e0cebff87 docs: sync compute/storage/proxmox-inventory with live state (memory audit) [ci skip]
Viktor asked to go through the agent's stored infra facts and straighten out anything wrong about what-is-where. Cross-checking docs against the live cluster surfaced doc drift alongside the stale memories:

- compute.md: add k8s-node5/6 (joined 2026-05-26) to diagram + node table; totals 48 vCPU / ~176GB -> 64 vCPU / ~240GB; cluster version v1.34.2 -> v1.34.8 (live-verified)
- storage.md: the nfs-proxmox StorageClass no longer exists (removed 2026-04-25, commit 484b4c71) — nfs-truenas is the only NFS SC; fixed three spots that told readers to use nfs-proxmox
- proxmox-inventory.md: k8s VM RAM rows live-verified via kubectl (master 32G, node1 48G, node2-4 32G — the old 16/32/24G figures predated the 2026-04-02 resize), added node5/6 rows, devvm swap 8G -> 14G (grown 2026-06-10), recomputed total (~288GB nominal of 272GB physical, overcommitted)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 17:50:43 +00:00
Viktor Barzin
81e01ec1c4 tripit: label namespace as chrome-service CDP client
The fare scrape's first E2E test was blocked by chrome-service-ws-ingress (9222 admits only namespaces labeled chrome-service.viktorbarzin.me/client=true). Label the tripit namespace per that policy's opt-in design so the planning workspace's live fare fetches reach the shared browser.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 14:42:53 +00:00
Viktor Barzin
980ec55418 tripit: enable live flight-fare scrape via shared chrome-service CDP
Sets FARE_PROVIDER=playwright + FARE_CDP_URL on the tripit deployment so the planning workspace's flight_fare cells auto-fetch live Google Flights quotes through the existing in-cluster headed browser (tripit issue #18, ADR-0007 — rate-limited, cached, degrades to manual entry). Viktor asked to complete the trip-planning tickets; this is the infra leg of the fare-scrape slice. Docs: chrome-service architecture + service catalog updated (tripit is now the second active CDP caller; catalog's legacy :3000 WS pool line corrected to CDP :9222). HOLD-ORDER NOTE: pushed only after the tripit image containing FareMode.playwright rolled out (older images crash-loop on the unknown enum).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 14:23:53 +00:00
Viktor Barzin
9b19caff47 t3: connection logging across the path for drop attribution
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Viktor asked to add connection logs (Traefik/Cloudflare) to catch the
real-path t3 WS drops: a direct-to-t3-serve browser ran 40 min clean
while real tunnel sessions cycle every 15-35s, so the drop originates
above t3-serve and we need to see which layer cuts the socket.

Traefik (/ws duration) and cloudflared (WS close events) already ship to
Loki; the gap was the devvm side. This adds:

- t3-dispatch logs every /ws open/close with dur_ms + cause:
  downstream_closed (client/CF/Traefik hung up = last-mile/network),
  upstream_closed (t3-serve closed/reset), or graceful. Graceful closes
  previously left no trace (default ReverseProxy only logs on error), so a
  watchdog-driven reconnect was invisible. Helpers unit-tested.
- devvm-promtail.{yaml,service}: ships devvm journald (t3-dispatch +
  t3-serve@<user>) to cluster Loki as job=devvm-journal, mirroring the
  pve/rpi-sofia shippers. devvm was never in Loki (standalone VM).

Joined in Loki the three layers attribute any future drop to a segment
with no repro needed. Runbook + service-catalog updated.
2026-06-11 13:48:10 +00:00
Viktor Barzin
933e4649fb Merge remote-tracking branch 'forgejo/master' into wizard/authentik-signin-speed
Some checks failed
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/k8s-portal Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful
ci/woodpecker/push/registry-config-sync Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/build-ci-image Pipeline was successful
2026-06-11 00:35:56 +00:00
Viktor Barzin
b3ef0dba76 authentik: ignore Keel-managed image_pull_policy on pgbouncer
Keel flip-flops the pgbouncer container's imagePullPolicy, so the
declared Always kept re-diffing on every plan. Ignore it like the
image tag (KEEL_IGNORE pattern) — plan-to-zero restored.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 00:34:44 +00:00
Viktor Barzin
4e88298976 authentik: incident hardening after the signin-speedup rollout storm
The first apply of the signin-speedup change triggered a ~50min authentik
outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2)
silently DOWNGRADED the Keel-managed live image (2026.2.4) against an
already-migrated DB, default liveness probes kill-looped pods queuing on
authentik's migration advisory lock, and kills mid-migration left ghost
idle-in-transaction sessions holding that lock. Full analysis in
docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md.

Hardening (all root causes):
- values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4)
  so helm applies can never downgrade under Keel again
- values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s)
- values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode
  pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits)
- pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders;
  pgbouncer.tf gets a config-checksum annotation so ini changes roll pods
- authentik_provider.tf: drop the completed import stanza (adoption rule)
- traefik: suppress pre-existing keel.sh annotation/tier-label drift on
  auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1
  pattern) so applies stop stripping live Keel state

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 00:26:52 +00:00
Viktor Barzin
bd60c3d5e0 pve-host/dns: register loki.viktorbarzin.lan CNAME, drop the /etc/hosts pin
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Follow-up to the pve-host Loki shipper (aac807fb). The host reached Loki via an
/etc/hosts pin of the Traefik LB IP — Viktor flagged that as the wrong solution
(no hardcoding; the DNS infra should handle it). Registered loki.viktorbarzin.lan
in Technitium as a CNAME -> ingress.viktorbarzin.lan (the anchor whose A record
auto-tracks the live Traefik LB IP, so it's renumber-proof), via the Technitium
API + zone-sync to all 3 instances. Removed the /etc/hosts pin from the PVE host;
promtail now resolves the name purely via DNS (verified still shipping to Loki).
insecure_skip_verify stays — the internal .lan cert isn't publicly trusted.

Docs (monitoring.md) + the pve-promtail.yaml header updated to drop the pin
references. The DNS record is API-managed (the viktorbarzin.lan zone convention),
not in this repo; auto-managing .lan CNAMEs in technitium-ingress-dns-sync
remains a noted follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 22:55:20 +00:00
Viktor Barzin
97ccdbecb8 authentik: speed up first-time signin (single-screen login, live env tuning, asset caching, outpost+nginx hot path)
Viktor asked to review Authentik and the web tier and make first-time
signin to apps faster. Review found the slowness is screens and round
trips, not server time. Changes:

- values.yaml: the authentik.* Helm values (gunicorn workers, cache
  timeouts, conn_max_age) were silently INERT because existingSecret
  skips chart env rendering — pods ran defaults (2 workers, 300s
  caches, no persistent DB conns). Moved all tuning into
  server.env/worker.env, which actually reaches the pods.
- authentik_provider.tf: adopt the identification stage and pin
  password_stage so username+password render on ONE screen (the
  separate order-20 password binding is deleted via API — authentik
  requires that when embedding). Outpost log_level trace->info and
  1->2 replicas (it is on the hot path of every forward-auth request;
  PG-backed sessions make 2 replicas safe).
- authentik module: /static ingress carve-out with immutable
  Cache-Control (assets are version-fingerprinted but served with no
  max-age — internal split-horizon users got zero caching).
- traefik auth-proxy nginx: upstream keepalive 32 + HTTP/1.1 (was
  opening a fresh TCP connection to the outpost per subrequest) +
  config-checksum annotation so config changes roll the pods.
- docs: authentication.md + authentik-state.md updated; fixed stale
  'postgresql.dbaas has no endpoints' claim in CLAUDE.md/CONTEXT.md
  (it is a live CNPG primary-selector compatibility service).

Done via API in the same change (UI-managed objects): 6 OIDC providers
(Vault, Forgejo, Immich, Headscale, linkwarden, Cloudflare Access)
switched from explicit to implicit consent — all first-party, the
4-weekly consent screen only slowed first-time signin.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 21:58:10 +00:00
Viktor Barzin
93ba67c84a devvm: install prometheus-node-exporter (was never installed)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The monitoring stack now scrapes devvm (job 'devvm') for the t3 drop
attribution work, but the box had no node_exporter at all — installed
via apt and persisted here so reprovisioning keeps it.
2026-06-10 21:29:17 +00:00
Viktor Barzin
046a4a32f3 Merge remote-tracking branch 'forgejo/master' into wizard/t3-disconnect-fixes
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-10 21:26:10 +00:00