Commit graph

4221 commits

Author SHA1 Message Date
Viktor Barzin
f3cb5661a6 Merge forgejo/master into wizard/errorpages-buffer
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
ci/woodpecker/push/build-cli Pipeline was canceled
2026-06-12 20:31:22 +00:00
Viktor Barzin
aa1fccb883 traefik/error-pages: READ_BUFFER_SIZE 5KB -> 128KB — 431s for cookie-heavy users
Viktor hit 'Too big request header' (fasthttp 431 from error-pages) on a
routed host during a brief 503 window, and sees it periodically across
ingresses: Authentik forward-auth accumulates one authentik_proxy_*
cookie per protected service on .viktorbarzin.me, so established
browsers carry multi-10KB Cookie headers — over error-pages' 5120-byte
default read buffer, which doubles as its max header size. Any error-
middleware dispatch then 431'd instead of rendering the styled page.
Same root cause class as the 2026-06-01 large_client_header_buffers
fixes on bot-block-proxy and auth-proxy-config; error-pages was the
remaining small-buffer backend on the shared chain.
2026-06-12 20:31:01 +00:00
Viktor Barzin
523e18c127 kyverno: sync-ghcr-credentials to private-ghcr namespaces; tripit consumes the clone
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Viktor asked to unblock the ADR-0002 ghcr pull-secret work (infra#12)
without waiting on a UI-minted token: GitHub has no token-mint API, so
the admin PAT (aliased in Vault as secret/viktor/ghcr_pull_token —
swap the alias value when a scoped token is ever minted) becomes the
platform credential. Because the PAT is broad, the new ClusterPolicy
clones ghcr-credentials ONLY to an explicit allowlist of namespaces
running private ghcr images (tripit, f1-stream, job-hunter,
instagram-poster, payslip-ingest, wealthfolio, fire-planner,
recruiter-responder) — NOT cluster-wide like registry-credentials.
generateExisting+synchronize so existing namespaces get the clone.
tripit's hand-declared ns-scoped secret is removed in favour of the
clone (imagePullSecrets now reference the name literally).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:28:11 +00:00
Viktor Barzin
12fd1fcbc9 android-emulator: api36-v7 — noVNC defaults: scaled view, autoconnect, reconnect
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
Viktor's screen rendered unscaled on a bare /vnc.html. The entrypoint
now writes /usr/share/novnc/defaults.json (resize=scale, autoconnect,
reconnect with 2s delay, shared) so every load behaves right without URL
params, and viewers self-heal across pod restarts/wakes. Already applied
live to the running pod; this makes it survive the next wake.
2026-06-12 20:18:26 +00:00
Viktor Barzin
ff08c685cd tts: image is TF-owned — drop the copied KEEL ignore so the GHCR switch applies
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The deployment's lifecycle.ignore_changes still ignored the container
image (copied from the keel-managed tripit pattern), which would have
made the previous commit's GHCR switch a silent no-op on apply. Keel
cannot poll the private GHCR repo anyway; the pinned sha tag is
terraform's to manage.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:13:50 +00:00
Viktor Barzin
dbb4572112 tts: pull Chatterbox from GHCR — the Forgejo-registry copy is unpullable
Some checks are pending
ci/woodpecker/push/build-cli Pipeline is pending
ci/woodpecker/push/default Pipeline is pending
Viktor reports the voice still isn't from the TTS service — correct:
zero story_audio rows exist; the pod has sat in ImagePullBackOff since
the first window because the 2026-06-09 Forgejo-registry push has a
corrupt layer blob (HEAD 500s; pushed from a 94%-full disk) and identical
digests can't heal corrupt registry storage. The off-infra GHA rebuild
(tripit build-chatterbox.yml, devnen 915ae289, succeeded 03:23 UTC) now
lives in private GHCR: switch the image there, pin the upstream-sha tag,
and add the vault-backed ghcr-credentials pull secret (mirrors
stacks/tripit). tripit's drain loop has 27 narrations queued and picks
them up the moment the pod goes Ready.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:13:19 +00:00
Viktor Barzin
8919835c5d beads-server: track claude-agent-service :latest (was pruned SHA → ImagePullBackOff)
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
cluster-health found beads-dispatcher + beads-reaper CronJobs in ImagePullBackOff
for ~7h: they pinned claude-agent-service:2fd7670d, a SHA tag that Forgejo
retention (keeps newest 10) pruned. claude-agent-service itself runs :latest
(KEEL_IGNORE_IMAGE). Point the beads tag at :latest so it tracks the live image
and can't go stale again — the dispatcher/reaper only need bd+curl+jq, which the
image ships.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 20:12:24 +00:00
Viktor Barzin
0491fc43f2 android-emulator: README — final measured profile; honest GL story
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
ci/woodpecker/push/build-cli Pipeline was canceled
Trues the runbook up to reality: guest GL stays software (llvmpipe)
under Xvfb by deliberate choice (NVIDIA headless GL would need a
different streaming architecture), the GPU slice costs ~100MiB VRAM only
while awake, and the awake steady-state is ~0.5-1.3 cores / ~5Gi with
scale-to-zero covering idle.
2026-06-12 20:11:55 +00:00
Viktor Barzin
10a52a2683 gitignore: timestamped terraform.tfstate.*.backup (plaintext Tier-0 secrets) [ci skip]
Viktor's off-infra-builds wave 0 (infra#11): two untracked
terraform.tfstate.<ts>.backup files with live plaintext Tier-0 secrets
were sitting in stacks/infra/ unmatched by the existing *.tfstate.backup
patterns — one stray git add from the public repo. Pattern added;
the on-disk files are deleted separately.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:11:41 +00:00
Viktor Barzin
3802967290 android-emulator: api36-v6 — cap RLIMIT_NOFILE; x11vnc -nolookup
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Viktor's noVNC sat at 'Connecting…' forever: the WebSocket traversed
Cloudflare/Authentik/websockify fine, but x11vnc never sent the RFB
banner — strace showed it sweeping the container's fd table with one
fcntl per fd, and containerd grants RLIMIT_NOFILE=2147483584 here, so
each connection effectively never completed. The entrypoint now sets
ulimit -n 65536 for everything it launches (verified live: banner
answers instantly under the capped limit); x11vnc also gets -nolookup
so client reverse-DNS can never stall handshakes.
2026-06-12 20:04:42 +00:00
Viktor Barzin
623d34628a docs: ADR-0002 — all owned image builds move off-infra to GHA + ghcr [ci skip]
Viktor asked to evaluate fully external image builders because in-cluster
CI builds keep destabilising the homelab (Forgejo OOM under registry-push
load, hairpin push timeouts, build IO on the shared sdc HDD, registry PVC
at its 50Gi ceiling). The evaluation was grilled to a decision set:

- every owned image builds on GitHub Actions and lives on ghcr.io
  (extends the 2026-06-09 tripit pilot to the whole fleet)
- per-repo visibility: 9 public mirrors + images (gated on a clean
  gitleaks/PII history scan), the personal/finance/gray ones stay private
- clean cut: no in-cluster fallback build pipelines; existing
  build-fallback.yml files are deleted
- Woodpecker becomes deploy-only; Forgejo registry freezes to one
  last-known-good tag per Service after a manual cleanup pass
- dead builders (terminal-lobby, webhook-handler, hmrc-sync, trading-bot,
  travel-agent, trip-planner) are decommissioned, not migrated;
  travel_blog is decommissioned outright; manual images (x402-gateway,
  chrome-service-novnc, chatterbox-tts, android-emulator) get formalized
  GHA builds; infra-ci + CLI builds move to GHA on the public infra repo

CONTEXT.md: updated 'GHA build + Woodpecker deploy', added 'Canonical
repo', 'GitHub mirror', 'Forgejo registry' terms, image-path relationship,
and a 'registry' ambiguity entry.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 19:55:47 +00:00
Viktor Barzin
3978eec53a Merge forgejo/master into wizard/emu-gpu
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-12 19:45:06 +00:00
Viktor Barzin
b2bd859a8e android-emulator: NVIDIA_DRIVER_CAPABILITIES=all — graphics libs for -gpu host
First GPU boot verified qemu attached to the T4, but the guest GL
translator reported llvmpipe: the GPU operator injects only
compute,utility by default, so the NVIDIA EGL/GL vendor libraries were
absent and gfxstream silently fell back to software GL. The graphics
capability completes the hardware rendering path.
2026-06-12 19:43:25 +00:00
Viktor Barzin
0216e993dc etcd-load-reduction: remove VPA/Goldilocks, disable kyverno reporting, descheduler hourly
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline failed
The control-plane flap (etcd lease-renewal timeouts) recurred. Rather than move
etcd to SSD (code-oflt, deferred again), the chosen direction is to REDUCE etcd
load enough that the leader-election-timeout band-aid (renew 10s->30s) becomes
removable. These are the big, clean cuts:

1. Remove VPA/Goldilocks (stacks/vpa emptied). All 349 VPAs ran updateMode=Off
   (no auto-right-sizing) yet cost ~800 etcd objects + continuous recommender
   writes + a pod-creation admission webhook, purely to feed a dashboard. krr
   (Dockerized, on-demand) replaces it. Reverses the re-add after memory 2431.

2. Disable kyverno reporting (admission/aggregate/background). policyReports were
   already off, so the pipeline generated ephemeralreports + an hourly
   all-resource etcd re-scan for NO user-facing output. Admission enforcement
   (deny-* policies) and Keel mutation are unaffected; violations surface via
   Loki->Slack.

3. descheduler */5 -> hourly (fewer list/evict cycles; rebalancing isn't urgent).

Deferred (poor ROI / unsafe as planned): ESO refreshInterval 15m->1h is a
~20-stack sprawl for ~0.1 writes/s; keel background=false is invalid for a
mutate-existing policy and its churn is apply-time not steady-state. Both filed
as follow-up beads.

Post-apply: delete the chart-orphaned VPA CRDs to cascade-clean leftover CRs.
Then measure etcd apply-latency and revert the timeouts. Docs updated
(VPA/Goldilocks -> krr). See memory 5402-5407.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 19:41:22 +00:00
Viktor Barzin
16adda2c48 android-emulator: gate reaches the kube API via env vars, not DNS
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
First real wake attempt 500'd: kubernetes.default.svc does not resolve
from the gate's alpine pod (musl + injected dns_config ndots quirk), so
every kube call failed with 'Name does not resolve'. Use the injected
KUBERNETES_SERVICE_HOST/PORT env vars — the canonical in-cluster
endpoint, no DNS dependency. ConfigMap checksum annotation rolls the
gate automatically.
2026-06-12 19:32:34 +00:00
Viktor Barzin
b1b9de90e4 tripit: tripit-api ingress joins the dedicated 100/1000 rate-limit
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Follow-up to eef4dc7f: the Android Shell's dedicated bearer-auth host
(tripit-api, ADR-0017) serves the same thumbnail-proxy traffic and was
still on the default 10/50 limiter — the shell's photo grid would have
hit the identical 429 wall Viktor just reported on the PWA host.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 19:18:40 +00:00
Viktor Barzin
eef4dc7f63 tripit: dedicated 100/1000 rate-limit — photo grid 429s on the default 10/50
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
Viktor hit a wall of 429s scrolling the new trip Photos tab: every Immich
thumbnail proxies through tripit's /api, so a few-hundred-photo trip is
that many parallel GETs from one IP — far past the shared Traefik
limiter's average 10 / burst 50. Fourth instance of the parallel-asset
pattern (ha-sofia, ActualBudget, noVNC); same cure: dedicated
tripit-rate-limit middleware (average 100, burst 1000) +
skip_default_rate_limit on the main tripit ingress only. The token-gated
calendar/email/slack carve-outs keep the strict default.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 19:15:56 +00:00
Viktor Barzin
e8a4eb0f05 tripit: satisfy the auth-comment lint on the tripit-api ingress
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The previous commit (c5631cff) failed CI's ingress_factory guard: the
'# auth = "none": <why>' justification must sit directly above the auth
line inside the module, not above the module block. Same content, moved
to where the lint looks; no functional change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 08:53:02 +00:00
Viktor Barzin
c5631cff74 tripit: Shell auth surface — tripit-app OAuth2 provider + bearer-only tripit-api host
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
Viktor is adding the Android APK (Capacitor Shell) for TripIt. The Shell
cannot use the browser's forward-auth cookie dance, so per tripit ADR-0017
it logs in with OIDC Code+PKCE and calls the API with bearer JWTs:

- authentik.tf: tripit-app OAuth2 provider (public client + PKCE — an APK
  holds no secret), custom-scheme redirect me.viktorbarzin.tripit://callback,
  RS256, 1h access / 90d refresh (offline_access mapping attached so refresh
  tokens are issued), plus the TripIt App application.
- main.tf: new ingress host tripit-api.viktorbarzin.me -> same tripit
  Service, no forward-auth (backend validates the JWTs itself once tripit
  AUTH_MODE=hybrid lands — slice 2), inbound X-authentik-* deleted via the
  existing traefik strip-auth-headers middleware so the header fallback can
  never be spoofed through this host.

Closes nothing here; tracked as viktor/tripit#49.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 08:47:46 +00:00
Viktor Barzin
b985686661 android-emulator: non-merge apply trigger (GPU + wake gate)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-12 07:53:38 +00:00
Viktor Barzin
18ccd57b63 Merge forgejo/master into wizard/emu-gpu
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
2026-06-12 07:53:12 +00:00
Viktor Barzin
f4dd515fd7 android-emulator: GPU rendering on node1 + scale-to-zero wake gate
Viktor's direction (2026-06-12): the emulator is dev-only, so it should
be on-demand, and it should use the T4 where applicable. (1) api36-v5
runs '-gpu host' on the GPU node (nodeSelector + time-slice + EGL libs;
automatic swiftshader fallback if GPU init dies) — screen-on rendering
moves off the CPU (~5 cores → expected 1-2). (2) The wake gate (stdlib
python, owns / on both hostnames) scales the deployment 0→1 on visit and
hands the browser to noVNC when ready; agents GET /wake + /status. The
idle-sleeper CronJob counts established adb/noVNC connections via
/proc/net/tcp (excluding the in-container loopback adb client) and scales
to zero after 4 idle checks (~1h). TF ignores replicas drift. VRAM cost
(~0.5-1GiB) is held only while awake, protecting llama-swap headroom.
2026-06-12 07:52:50 +00:00
Viktor Barzin
b598c61c61 android-emulator: scale to 0 — its CPU burn was starving etcd
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
The cluster-health check found the control plane flapping: kube-scheduler
and kube-controller-manager were crashlooping (220+ restarts) on lost
leader-election leases, with "etcdserver: request timed out" in the logs.

Root cause: the android-emulator pod's ~4.7-core swiftshader (software-GPU)
CPU burn on node3, together with frigate on node1, saturated the single
Proxmox host (load ~64) and starved etcd's disk/CPU on the k8s-master VM —
so etcd timed out and the leader-election controllers died and restarted in
a loop.

The emulator is a shared *test* instance, not a 24/7 service, so scaling it
to 0 is the right relief: spin it back to replicas=1 on-demand for a testing
session. Confirmed recovery after scaling down: node3 CPU 83%->28%, PVE load
64->51, control-plane restarts frozen. Durable structural fix (etcd/critical
VM disks off the shared sdc HDD; PVE CPU weighting) is tracked as code-oflt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 07:31:46 +00:00
Viktor Barzin
39a22b352e tts: bootstrap the chatterbox NFS subdir — first-window mount failed forever
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
First real window (2026-06-12 02:00): the chatterbox pod sat in
ContainerCreating with MountVolume exit 32 x19 — /srv/nfs-ssd is exported
whole-tree but the chatterbox SUBDIR never existed on the host (the
go-live runbook step needed NFS-host shell nobody doing the apply had).
One-shot busybox Job mounts the export root and mkdir -p's the subtree;
kubelet's mount retry then self-heals the pod. Audio queue (27 items)
drains as soon as the model loads.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 02:51:14 +00:00
Viktor Barzin
db63cd7501 android-emulator+traefik: non-merge apply trigger for the rate-limit fix
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline failed
Pipeline 102 applied nothing — the rate-limit commit entered master under
a merge head and the changed-stack detector is blind to merge diffs.
Plain commit touching both stacks so they apply.
2026-06-12 00:33:10 +00:00
Viktor Barzin
4d844d6fd4 Merge forgejo/master into wizard/emu-ratelimit
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline failed
2026-06-12 00:26:05 +00:00
Viktor Barzin
152dad0a40 android-emulator: dedicated rate-limit — noVNC's module storm tripped the shared 10/50 limiter
Viktor's 'VNC stuck loading forever' (remote network): noVNC 1.3 is
unbundled and fetches ~60 ES modules in parallel on page open; the shared
Traefik rate-limit (average 10, burst 50) 429s the tail and noVNC's
loader waits on the missing modules indefinitely (reproduced: 38x429 in
a 90-request burst through the ingress). Adds a dedicated 50/300
android-emulator-rate-limit middleware (actualbudget/immich pattern) and
opts both emulator ingresses out of the shared limiter.
2026-06-12 00:25:44 +00:00
Viktor Barzin
d3d37a15ec tts: GPU-gated live narration — demand-gate CronJob + all-day VRAM guard
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
ci/woodpecker/push/build-cli Pipeline was canceled
Viktor asked 'can't we make it live? why the cronjob?' — the overnight
window guaranteed VRAM room on the shared T4, but immich/frigate models
idle-unload during the day so the card often has room (measured 10.3 GiB
free at 01:20). New 'demand' action every 3 min: scale Chatterbox up when
tripit's audio queue is non-empty AND free VRAM >= floor; idle it back to
0 when the queue empties (also frees the card early inside the nightly
window). Failed metrics scrape fail-safes to no-scale-up, same as the
window preflight. The guard moves to all-day */5 — live synthesis can
hold the card at any hour, so the yield-on-pressure watchdog must watch
at any hour. tripit exposes the unauthenticated in-cluster queue count;
a 404 from an older image reads as queued=0 (no-op). The 02:00 window-up
stays as the guaranteed nightly catch-up.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 00:25:35 +00:00
Viktor Barzin
d818f7ed3b android-emulator: README — measured resource profile + remote access + screen-off etiquette
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-12 00:10:03 +00:00
Viktor Barzin
9af3e8860e Merge origin/master (CI state-sync commits) into wizard/android-emulator-public
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
ci/woodpecker/push/build-cli Pipeline was canceled
2026-06-12 00:08:14 +00:00
Viktor Barzin
43d2107760 android-emulator: public Authentik-gated ingress for the noVNC screen
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
Viktor wants the emulator screen reachable over the web: adds
android-emulator.viktorbarzin.me (Cloudflare-proxied) behind Authentik
forward-auth — same-origin WebSockets through forward-auth are proven by
the terminal/ttyd stack. The LAN .lan view stays, and adb:5555 remains
LAN-only since it is unauthenticated.
2026-06-12 00:07:49 +00:00
Viktor Barzin
9a2124f105 tripit: flip Research agent live (RESEARCH_PROVIDER=claude_agent, #23)
Switches the planning workspace's 'Research this' from the deterministic Fake to the live claude-agent-service Researcher. Behaviour-reviewed via a prod-pod country_when call (proposed Morocco/Georgia/Peru/Iceland with real 2026 UK bank-holiday leave windows + rough fares). Opt-in, budget-capped ~$2/run, wall-clock-bounded → degrades to 'found nothing' on slow/failed/quota-exhausted runs. Reuses CLAUDE_AGENT_TOKEN already in tripit-secrets. Completes the 12-slice Trip-Planning-Decisions feature.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 23:53:49 +00:00
Viktor Barzin
02ed3062f6 android-emulator: non-merge apply trigger for v4 image rollout
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Pipeline 96 applied only tripit: the v4 bump (577267cd) entered master
inside a merge whose first-parent diff hid stacks/android-emulator from
the stack detector — same failure mode as the tts 798b0255 trigger. This
plain commit touches the stack so the detector picks it up.
2026-06-11 23:48:16 +00:00
Viktor Barzin
2f8addc63b Merge forgejo/master into wizard/android-emulator
Some checks failed
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline failed
2026-06-11 22:53:11 +00:00
Viktor Barzin
577267cd97 android-emulator: api36-v4 — pin emulator 36.1.9; bind socat to pod IP
Two final fixes from the live debugging session: (1) sdkmanager-latest
emulator 36.6.11 hangs before executing a single guest instruction in
this pod (KVM and TCG alike, every gpu mode, crash-reporting on or off)
while 36.1.9 boots Android in ~107s — the entrypoint now pins build
13823996 on the PVC; (2) the emulator already listens on 127.0.0.1:5555,
so socat's wildcard bind died with EADDRINUSE and its exit restarted the
pod right after a successful boot — socat now binds the pod IP only.
2026-06-11 22:52:54 +00:00
Viktor Barzin
fba1659611 tripit: enable LLM sight discovery + real place resolver (image 2a965ca0 is live)
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Viktor's tour-redo (tripit#29): the new image is rolled out, so the two
new provider knobs can flip — discovery becomes wikipedia+llm (GeoSearch
merged with claude-agent-service proposals, Focus-steered) and the
Wikipedia place resolver (manual sight search + LLM-proposal resolution)
leaves its fake default. Env-after-image hold order, same as FARE_PROVIDER.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 22:30:24 +00:00
Viktor Barzin
f74e421283 tripit: overnight tour-audio fill CronJobs (02:20 + 04:30 retry, Europe/London)
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Viktor's tour-guide redo (tripit#30/#31): narration audio is baked-audio-only
now — the fill-tour-audio worker synthesizes the queued (story, telling,
voice) audio while the tts stack's off-peak window (02:00-06:00) has
Chatterbox scaled up. Two idempotent passes: 02:20 after scale-up + model
load, 04:30 insurance against a skipped window or guard yield. Daytime runs
record tts_unreachable and exit quietly by design.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 22:24:29 +00:00
Viktor Barzin
85dbec6108 android-emulator: api36-v3 — avdmanager must run from inside the SDK root
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline failed
v2's marker fix proved the install completes, but avdmanager still saw
no system images: it IGNORES ANDROID_SDK_ROOT (and has no --sdk_root),
deriving the SDK root from its own toolsdir — /opt/android in our image,
while packages live on the PVC at /sdk. v3 seeds cmdline-tools into
/sdk/cmdline-tools/latest once and runs avdmanager from there, so it
resolves the PVC as the SDK root.
2026-06-11 21:15:50 +00:00
Viktor Barzin
5e8a988858 android-emulator: api36-v2 — marker-file install idempotency + retries
Some checks failed
ci/woodpecker/push/k8s-portal Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/registry-config-sync Pipeline was successful
ci/woodpecker/push/build-ci-image Pipeline was successful
ci/woodpecker/push/default Pipeline failed
First boot crashed mid-SDK-install, and the dir-existence check then
skipped reinstall forever: avdmanager saw the partial tree and died with
'Valid system image paths are: null' (CrashLoopBackOff). v2 tracks
install completion with a marker file written only after sdkmanager
succeeds + package.xml exists, wipes partial system-image trees before
reinstalling, and retries sdkmanager 3x.
2026-06-11 20:59:08 +00:00
Viktor Barzin
3fac45febc android-emulator: drop applied import stanzas; deployment recreates fresh
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
The five imports from the last recovery commit are in state now (verified
serial 4: everything except the deployment). The deployment kept falling
out of state between runs, so instead of a third import round the broken
0-replica deployment object was deleted live (transient recovery step,
presence-claimed) and this apply recreates it Terraform-owned with the
quota-fitting 3Gi requests. Import stanzas must go because TF 1.5 errors
on importing already-managed addresses.
2026-06-11 20:49:37 +00:00
Viktor Barzin
6b7efcd2d6 android-emulator: import the five resources still missing from state
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
Pipeline 88 imported the namespace but its refresh dropped the PVC, both
services, the ingress and the tls secret from state (PG-backend state
races on this new stack's first applies), so the apply again died on
'already exists' conflicts. State now holds namespace+deployment; adopt
the missing five with import blocks (TF 1.5 errors on importing
already-managed addresses, so only the missing set is listed). Stanzas
come out once applied.
2026-06-11 20:44:09 +00:00
Viktor Barzin
b948224008 android-emulator: import orphaned namespace into state (lock-race recovery)
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
Pipeline 85 created the namespace but a Terraform pg-backend
workspace-creation lock race (new stack schema initializing while other
stacks applied concurrently) left it out of the recorded state — every
later apply then died with 'namespaces android-emulator already exists'.
Adopt it with an import block per the house recovery pattern; stanza
gets removed once it has applied.
2026-06-11 20:38:46 +00:00
Viktor Barzin
99c19584f7 android-emulator: fit pod inside the tier-1 ResourceQuota (Burstable memory)
Some checks failed
ci/woodpecker/push/k8s-portal Pipeline failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful
ci/woodpecker/push/registry-config-sync Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/build-ci-image Pipeline was successful
First deploy hit 'exceeded quota: tier-quota, requested requests.memory=8Gi,
limited 4Gi' — the generated tier-1 quota caps memory REQUESTS at 4Gi but
allows 32Gi of limits, so go Burstable (requests 3Gi, limits 8Gi) like
tiers 3/4 do, instead of opting the namespace out via custom-quota.
2026-06-11 19:56:09 +00:00
Viktor Barzin
6bf216751b Merge forgejo/master (tts stack) into wizard/android-emulator
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
# Conflicts:
#	stacks/tripit/main.tf
2026-06-11 19:53:07 +00:00
Viktor Barzin
8b7c77c794 android-emulator: new stack — shared in-cluster Android 16 testing instance
Viktor is setting up an Android app development pipeline (tripit is the
first app) and wants agents to natively test changes on Android before
shipping. This adds the testing environment: an API-36 Google emulator
under KVM as a privileged pod (namespace joins the Kyverno exclude list),
SDK/system-image/AVD on a proxmox-lvm PVC, adb on the shared MetalLB IP
10.0.20.200:5555 (LAN only), noVNC screen view at
android-emulator.viktorbarzin.lan. Image is built manually from the
stack's docker/ dir (rare rebuilds; off-infra-CI rule targets repeated
builds). First infra ADR records the trade-offs (devvm/VM/redroid/budtmo
rejected).
2026-06-11 19:51:57 +00:00
Viktor Barzin
798b025580 tts+kyverno: non-merge apply trigger (merge-commit diff hid stacks/tts from the stack detector)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The Woodpecker default pipeline selects stacks via git diff HEAD~1 HEAD;
on a merge commit that is the first-parent diff, which contained only the
concurrently-landed files — stacks/tts never got applied (namespace still
absent) and the kyverno re-trigger push got no pipeline at all. Single
non-merge commit touching both stacks so the detector sees them; the
sorted loop applies kyverno before tts, the order tripit#26 requires.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 19:08:23 +00:00
Viktor Barzin
a66aeac3b8 Merge remote-tracking branch 'forgejo/master' into wizard/tour-redo-env
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-11 18:27:53 +00:00
Viktor Barzin
4a8c4f9a14 tts: first apply of Chatterbox stack; predefined voices from the image, not the unseeded PVC
Viktor's tour-guide redo (tripit#26): 87702bdc committed this stack with
[ci skip] so it was never applied — prod tripit has been pointing at a
nonexistent chatterbox-tts service since. This commit triggers the apply
and fixes the voices path: config pointed predefined_voices_path at the
NFS PVC (/data/voices), which nobody can seed without NFS-host shell
access and which would leave /v1/audio/voices empty (it gates readiness).
Use the 28 voices bundled in the image at /app/voices instead; /data
keeps reference audio (future cloning) and the HF model cache.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:27:44 +00:00
Viktor Barzin
318ce9b909 Merge remote-tracking branch 'forgejo/master' into wizard/breakglass-redesign
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-11 18:23:40 +00:00
Viktor Barzin
df332b59e6 break-glass SSH: drop port-knock for exposed key-only :52222; version host config
Viktor got locked out of the break-glass path (forgot the port-knock setup) and
deleted the edge-router forwards, then asked to review and redesign it from
scratch.

Root cause of the lockout: the knock added no real security (key-only SSH is
already brute-force-proof) and its only benefit — hiding the port — came at the
cost of a circular dependency. The knock sequence lived only in in-cluster
Vault, which is unreachable in the exact away/cold scenario break-glass exists
for. So the unlock secret was unavailable precisely when needed.

New model (self-contained, nothing to remember): plain key-only SSH on the
Proxmox host's :52222, openly reachable. The edge router forwards WAN tcp/52222
-> 192.168.1.127:52222 (external port MUST equal internal on the TP-Link AX6000
- it rejects remaps; port 22 itself is reserved). The exposed port trusts only a
dedicated break-glass key via `Match LocalPort` (a leak of any other root key
does not grant internet access), rate-limited (iptables hashlimit) + fail2ban.

- Removed knockd (package + config) and the legacy Synology SSH forward
  (ext 3333 -> .13:22, a needless WAN exposure the original plan wanted gone).
- Fixed the fail2ban jail for Debian 13 (auth logs under sshd-session, not sshd
  - the stock journalmatch silently never banned).
- Versioned the host config in scripts/ (it was applied ad-hoc, never committed)
  and recorded the deliberate Wave-1 "no public-IP" exception in security.md +
  .claude/CLAUDE.md. Superseded the 2026-05-30 port-knock design docs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:23:39 +00:00