infra

Author	SHA1	Message	Date
Viktor Barzin	e5291f97c8	android-emulator: api36-v8 — auto-fit emulator window to the display noVNC scaled correctly but the emulator's Qt window opened small (~411x914) and floated inside the 1080x2280 Xvfb, so the user saw a tiny phone in a sea of black. v8 bakes a background fitter (wmctrl+xdotool) that, after boot, auto-OKs the one-shot nested-virtualization warning dialog, fills the phone window to the display, and parks the control strip off the right edge — re-running to catch window/dialog timing then maintaining every 30s. Applied live to the running pod already; this makes it survive the next wake.	2026-06-12 20:44:29 +00:00
Viktor Barzin	97dcf49b8e	monitoring: reduce Slack alert noise (alert-on-change + daily digest) Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details Reviewed the last 24h of Slack alerts after the midday node-pressure blip: the volume came far less from the outage than from (a) alerts re-pinging every few hours while nothing changed and (b) a pod cascade that fired uninhibited. This hardens the alerting system so recurrences are quiet, rather than just clearing today's broken services. Changes (all in the monitoring module): * Alert-on-change routing. warning/info repeat_interval -> 8760h (notify once, then only on a membership change or resolve); critical 1h -> 6h (a slow nag, not an hourly drip). send_resolved stays on. The bulk of the 24h volume was these re-pings (RpiSofiaUndervoltage alone fired continuously for ~24h, re-notifying every 4h). * Daily digest CronJob (alert_digest.tf + alert_digest.py) -> #alerts at 08:00 Europe/London: the full current board grouped by severity + what resolved in the last 24h. This is the standing-state safety net for the alert-on-change model. Stock python:3.12-alpine, pure-stdlib script (no pip/apk at runtime -> none of the per-run disk-write footprint that disabled status-page-pusher). Reuses the existing Alertmanager Slack webhook via a namespaced Secret; reads Alertmanager v2 + Prometheus. * Cascade inhibition. NodeConditionBad/NodeDiskPressure now suppress the downstream pod-churn alerts (PodCrashLooping, PodImagePullBackOff, PodsStuckContainerCreating, ScrapeTargetDown, ReplicasMismatch, ...). The midday DiskPressure event on 4 nodes fired 25 PodCrashLooping + 14 PodImagePullBackOff uninhibited because only NodeDown was a source. T3 probe de-duplication. T3ProbeLegDown now inhibits T3ProbeDropBurst for the same leg — two alerts described one condition and were the #1 noise source (~3,400 alert-minutes over 24h). * ScrapeTargetDown false positives. Scrape only Ready endpoints, so completed CronJob pods that linger in EndpointSlices as NotReady addresses stop firing phantom "down" alerts (tts/tripit/beads). A Ready pod with a genuinely broken metrics endpoint still fires. * for: 0m -> 5m on the flappy backup-status flags (LVM/Weekly/Offsite/ NfsMirror/Vzdump Failing) and DNS spike detectors, so a single transient Pushgateway/scrape blip no longer fires-and-resolves. Added an Alertmanager scrape target: it carried no prometheus.io/scrape annotation, so notification volume was unmeasurable — now we can verify this change worked (alertmanager_notifications_total et al.). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 20:35:56 +00:00
Viktor Barzin	87a8a393fe	tts: demand gate treats a failed queue probe as no-action, not queue-empty Some checks failed ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was canceled Details The demand-gate script defaulted an unreadable/unparseable tts-queue response to QUEUED=0, which the scale-down arm reads as 'queue empty'. One transient curl failure at 20:30 UTC today idled chatterbox-tts to 0 the very minute the pod first went Ready, with 27 narrations still queued (tripit kept logging tts_unreachable). Probe failure now exits without touching replicas: scale-up still needs a real count > 0, and scale-down now needs an explicitly parsed 0. Worst case after this change is a stale-up deployment idling until the 06:00 window-down. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 20:35:02 +00:00
Viktor Barzin	18f524c265	docs: ghcr-credentials is now Kyverno-synced to allowlisted namespaces [ci skip] Same-change doc sync for infra#12: the tripit-ns-scoped interim secret paragraph described the pre-ClusterPolicy state. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 20:31:55 +00:00
Viktor Barzin	68c7be8653	traefik: non-merge apply trigger (error-pages buffer fix) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-12 20:31:24 +00:00
Viktor Barzin	f3cb5661a6	Merge forgejo/master into wizard/errorpages-buffer Some checks failed ci/woodpecker/push/default Pipeline was canceled Details ci/woodpecker/push/build-cli Pipeline was canceled Details	2026-06-12 20:31:22 +00:00
Viktor Barzin	aa1fccb883	traefik/error-pages: READ_BUFFER_SIZE 5KB -> 128KB — 431s for cookie-heavy users Viktor hit 'Too big request header' (fasthttp 431 from error-pages) on a routed host during a brief 503 window, and sees it periodically across ingresses: Authentik forward-auth accumulates one authentik_proxy_* cookie per protected service on .viktorbarzin.me, so established browsers carry multi-10KB Cookie headers — over error-pages' 5120-byte default read buffer, which doubles as its max header size. Any error- middleware dispatch then 431'd instead of rendering the styled page. Same root cause class as the 2026-06-01 large_client_header_buffers fixes on bot-block-proxy and auth-proxy-config; error-pages was the remaining small-buffer backend on the shared chain.	2026-06-12 20:31:01 +00:00
Viktor Barzin	523e18c127	kyverno: sync-ghcr-credentials to private-ghcr namespaces; tripit consumes the clone All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Viktor asked to unblock the ADR-0002 ghcr pull-secret work (infra#12) without waiting on a UI-minted token: GitHub has no token-mint API, so the admin PAT (aliased in Vault as secret/viktor/ghcr_pull_token — swap the alias value when a scoped token is ever minted) becomes the platform credential. Because the PAT is broad, the new ClusterPolicy clones ghcr-credentials ONLY to an explicit allowlist of namespaces running private ghcr images (tripit, f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio, fire-planner, recruiter-responder) — NOT cluster-wide like registry-credentials. generateExisting+synchronize so existing namespaces get the clone. tripit's hand-declared ns-scoped secret is removed in favour of the clone (imagePullSecrets now reference the name literally). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 20:28:11 +00:00
Viktor Barzin	12fd1fcbc9	android-emulator: api36-v7 — noVNC defaults: scaled view, autoconnect, reconnect Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details Viktor's screen rendered unscaled on a bare /vnc.html. The entrypoint now writes /usr/share/novnc/defaults.json (resize=scale, autoconnect, reconnect with 2s delay, shared) so every load behaves right without URL params, and viewers self-heal across pod restarts/wakes. Already applied live to the running pod; this makes it survive the next wake.	2026-06-12 20:18:26 +00:00
Viktor Barzin	ff08c685cd	tts: image is TF-owned — drop the copied KEEL ignore so the GHCR switch applies All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The deployment's lifecycle.ignore_changes still ignored the container image (copied from the keel-managed tripit pattern), which would have made the previous commit's GHCR switch a silent no-op on apply. Keel cannot poll the private GHCR repo anyway; the pinned sha tag is terraform's to manage. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 20:13:50 +00:00
Viktor Barzin	dbb4572112	tts: pull Chatterbox from GHCR — the Forgejo-registry copy is unpullable Some checks are pending ci/woodpecker/push/build-cli Pipeline is pending Details ci/woodpecker/push/default Pipeline is pending Details Viktor reports the voice still isn't from the TTS service — correct: zero story_audio rows exist; the pod has sat in ImagePullBackOff since the first window because the 2026-06-09 Forgejo-registry push has a corrupt layer blob (HEAD 500s; pushed from a 94%-full disk) and identical digests can't heal corrupt registry storage. The off-infra GHA rebuild (tripit build-chatterbox.yml, devnen 915ae289, succeeded 03:23 UTC) now lives in private GHCR: switch the image there, pin the upstream-sha tag, and add the vault-backed ghcr-credentials pull secret (mirrors stacks/tripit). tripit's drain loop has 27 narrations queued and picks them up the moment the pod goes Ready. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 20:13:19 +00:00
Viktor Barzin	8919835c5d	beads-server: track claude-agent-service :latest (was pruned SHA → ImagePullBackOff) Some checks failed ci/woodpecker/push/build-cli Pipeline was canceled Details ci/woodpecker/push/default Pipeline was canceled Details cluster-health found beads-dispatcher + beads-reaper CronJobs in ImagePullBackOff for ~7h: they pinned claude-agent-service:2fd7670d, a SHA tag that Forgejo retention (keeps newest 10) pruned. claude-agent-service itself runs :latest (KEEL_IGNORE_IMAGE). Point the beads tag at :latest so it tracks the live image and can't go stale again — the dispatcher/reaper only need bd+curl+jq, which the image ships. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 20:12:24 +00:00
Viktor Barzin	0491fc43f2	android-emulator: README — final measured profile; honest GL story Some checks failed ci/woodpecker/push/default Pipeline was canceled Details ci/woodpecker/push/build-cli Pipeline was canceled Details Trues the runbook up to reality: guest GL stays software (llvmpipe) under Xvfb by deliberate choice (NVIDIA headless GL would need a different streaming architecture), the GPU slice costs ~100MiB VRAM only while awake, and the awake steady-state is ~0.5-1.3 cores / ~5Gi with scale-to-zero covering idle.	2026-06-12 20:11:55 +00:00
Viktor Barzin	10a52a2683	gitignore: timestamped terraform.tfstate..backup (plaintext Tier-0 secrets) [ci skip] Viktor's off-infra-builds wave 0 (infra#11): two untracked terraform.tfstate.<ts>.backup files with live plaintext Tier-0 secrets were sitting in stacks/infra/ unmatched by the existing .tfstate.backup patterns — one stray git add from the public repo. Pattern added; the on-disk files are deleted separately. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 20:11:41 +00:00
Viktor Barzin	3802967290	android-emulator: api36-v6 — cap RLIMIT_NOFILE; x11vnc -nolookup All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Viktor's noVNC sat at 'Connecting…' forever: the WebSocket traversed Cloudflare/Authentik/websockify fine, but x11vnc never sent the RFB banner — strace showed it sweeping the container's fd table with one fcntl per fd, and containerd grants RLIMIT_NOFILE=2147483584 here, so each connection effectively never completed. The entrypoint now sets ulimit -n 65536 for everything it launches (verified live: banner answers instantly under the capped limit); x11vnc also gets -nolookup so client reverse-DNS can never stall handshakes.	2026-06-12 20:04:42 +00:00
Viktor Barzin	623d34628a	docs: ADR-0002 — all owned image builds move off-infra to GHA + ghcr [ci skip] Viktor asked to evaluate fully external image builders because in-cluster CI builds keep destabilising the homelab (Forgejo OOM under registry-push load, hairpin push timeouts, build IO on the shared sdc HDD, registry PVC at its 50Gi ceiling). The evaluation was grilled to a decision set: - every owned image builds on GitHub Actions and lives on ghcr.io (extends the 2026-06-09 tripit pilot to the whole fleet) - per-repo visibility: 9 public mirrors + images (gated on a clean gitleaks/PII history scan), the personal/finance/gray ones stay private - clean cut: no in-cluster fallback build pipelines; existing build-fallback.yml files are deleted - Woodpecker becomes deploy-only; Forgejo registry freezes to one last-known-good tag per Service after a manual cleanup pass - dead builders (terminal-lobby, webhook-handler, hmrc-sync, trading-bot, travel-agent, trip-planner) are decommissioned, not migrated; travel_blog is decommissioned outright; manual images (x402-gateway, chrome-service-novnc, chatterbox-tts, android-emulator) get formalized GHA builds; infra-ci + CLI builds move to GHA on the public infra repo CONTEXT.md: updated 'GHA build + Woodpecker deploy', added 'Canonical repo', 'GitHub mirror', 'Forgejo registry' terms, image-path relationship, and a 'registry' ambiguity entry. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 19:55:47 +00:00
Viktor Barzin	3978eec53a	Merge forgejo/master into wizard/emu-gpu Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-12 19:45:06 +00:00
Viktor Barzin	b2bd859a8e	android-emulator: NVIDIA_DRIVER_CAPABILITIES=all — graphics libs for -gpu host First GPU boot verified qemu attached to the T4, but the guest GL translator reported llvmpipe: the GPU operator injects only compute,utility by default, so the NVIDIA EGL/GL vendor libraries were absent and gfxstream silently fell back to software GL. The graphics capability completes the hardware rendering path.	2026-06-12 19:43:25 +00:00
Viktor Barzin	0216e993dc	etcd-load-reduction: remove VPA/Goldilocks, disable kyverno reporting, descheduler hourly Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details The control-plane flap (etcd lease-renewal timeouts) recurred. Rather than move etcd to SSD (code-oflt, deferred again), the chosen direction is to REDUCE etcd load enough that the leader-election-timeout band-aid (renew 10s->30s) becomes removable. These are the big, clean cuts: 1. Remove VPA/Goldilocks (stacks/vpa emptied). All 349 VPAs ran updateMode=Off (no auto-right-sizing) yet cost ~800 etcd objects + continuous recommender writes + a pod-creation admission webhook, purely to feed a dashboard. krr (Dockerized, on-demand) replaces it. Reverses the re-add after memory 2431. 2. Disable kyverno reporting (admission/aggregate/background). policyReports were already off, so the pipeline generated ephemeralreports + an hourly all-resource etcd re-scan for NO user-facing output. Admission enforcement (deny-* policies) and Keel mutation are unaffected; violations surface via Loki->Slack. 3. descheduler */5 -> hourly (fewer list/evict cycles; rebalancing isn't urgent). Deferred (poor ROI / unsafe as planned): ESO refreshInterval 15m->1h is a ~20-stack sprawl for ~0.1 writes/s; keel background=false is invalid for a mutate-existing policy and its churn is apply-time not steady-state. Both filed as follow-up beads. Post-apply: delete the chart-orphaned VPA CRDs to cascade-clean leftover CRs. Then measure etcd apply-latency and revert the timeouts. Docs updated (VPA/Goldilocks -> krr). See memory 5402-5407. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 19:41:22 +00:00
Viktor Barzin	16adda2c48	android-emulator: gate reaches the kube API via env vars, not DNS All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details First real wake attempt 500'd: kubernetes.default.svc does not resolve from the gate's alpine pod (musl + injected dns_config ndots quirk), so every kube call failed with 'Name does not resolve'. Use the injected KUBERNETES_SERVICE_HOST/PORT env vars — the canonical in-cluster endpoint, no DNS dependency. ConfigMap checksum annotation rolls the gate automatically.	2026-06-12 19:32:34 +00:00
Viktor Barzin	b1b9de90e4	tripit: tripit-api ingress joins the dedicated 100/1000 rate-limit All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Follow-up to `eef4dc7f`: the Android Shell's dedicated bearer-auth host (tripit-api, ADR-0017) serves the same thumbnail-proxy traffic and was still on the default 10/50 limiter — the shell's photo grid would have hit the identical 429 wall Viktor just reported on the PWA host. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 19:18:40 +00:00
Viktor Barzin	eef4dc7f63	tripit: dedicated 100/1000 rate-limit — photo grid 429s on the default 10/50 Some checks failed ci/woodpecker/push/build-cli Pipeline was canceled Details ci/woodpecker/push/default Pipeline was canceled Details Viktor hit a wall of 429s scrolling the new trip Photos tab: every Immich thumbnail proxies through tripit's /api, so a few-hundred-photo trip is that many parallel GETs from one IP — far past the shared Traefik limiter's average 10 / burst 50. Fourth instance of the parallel-asset pattern (ha-sofia, ActualBudget, noVNC); same cure: dedicated tripit-rate-limit middleware (average 100, burst 1000) + skip_default_rate_limit on the main tripit ingress only. The token-gated calendar/email/slack carve-outs keep the strict default. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 19:15:56 +00:00
Viktor Barzin	e8a4eb0f05	tripit: satisfy the auth-comment lint on the tripit-api ingress All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The previous commit (`c5631cff`) failed CI's ingress_factory guard: the '# auth = "none": <why>' justification must sit directly above the auth line inside the module, not above the module block. Same content, moved to where the lint looks; no functional change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 08:53:02 +00:00
Viktor Barzin	c5631cff74	tripit: Shell auth surface — tripit-app OAuth2 provider + bearer-only tripit-api host Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Viktor is adding the Android APK (Capacitor Shell) for TripIt. The Shell cannot use the browser's forward-auth cookie dance, so per tripit ADR-0017 it logs in with OIDC Code+PKCE and calls the API with bearer JWTs: - authentik.tf: tripit-app OAuth2 provider (public client + PKCE — an APK holds no secret), custom-scheme redirect me.viktorbarzin.tripit://callback, RS256, 1h access / 90d refresh (offline_access mapping attached so refresh tokens are issued), plus the TripIt App application. - main.tf: new ingress host tripit-api.viktorbarzin.me -> same tripit Service, no forward-auth (backend validates the JWTs itself once tripit AUTH_MODE=hybrid lands — slice 2), inbound X-authentik-* deleted via the existing traefik strip-auth-headers middleware so the header fallback can never be spoofed through this host. Closes nothing here; tracked as viktor/tripit#49. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 08:47:46 +00:00
Viktor Barzin	b985686661	android-emulator: non-merge apply trigger (GPU + wake gate) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-12 07:53:38 +00:00
Viktor Barzin	18ccd57b63	Merge forgejo/master into wizard/emu-gpu Some checks failed ci/woodpecker/push/build-cli Pipeline was canceled Details ci/woodpecker/push/default Pipeline was canceled Details	2026-06-12 07:53:12 +00:00
Viktor Barzin	f4dd515fd7	android-emulator: GPU rendering on node1 + scale-to-zero wake gate Viktor's direction (2026-06-12): the emulator is dev-only, so it should be on-demand, and it should use the T4 where applicable. (1) api36-v5 runs '-gpu host' on the GPU node (nodeSelector + time-slice + EGL libs; automatic swiftshader fallback if GPU init dies) — screen-on rendering moves off the CPU (~5 cores → expected 1-2). (2) The wake gate (stdlib python, owns / on both hostnames) scales the deployment 0→1 on visit and hands the browser to noVNC when ready; agents GET /wake + /status. The idle-sleeper CronJob counts established adb/noVNC connections via /proc/net/tcp (excluding the in-container loopback adb client) and scales to zero after 4 idle checks (~1h). TF ignores replicas drift. VRAM cost (~0.5-1GiB) is held only while awake, protecting llama-swap headroom.	2026-06-12 07:52:50 +00:00
Viktor Barzin	b598c61c61	android-emulator: scale to 0 — its CPU burn was starving etcd All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details The cluster-health check found the control plane flapping: kube-scheduler and kube-controller-manager were crashlooping (220+ restarts) on lost leader-election leases, with "etcdserver: request timed out" in the logs. Root cause: the android-emulator pod's ~4.7-core swiftshader (software-GPU) CPU burn on node3, together with frigate on node1, saturated the single Proxmox host (load ~64) and starved etcd's disk/CPU on the k8s-master VM — so etcd timed out and the leader-election controllers died and restarted in a loop. The emulator is a shared test instance, not a 24/7 service, so scaling it to 0 is the right relief: spin it back to replicas=1 on-demand for a testing session. Confirmed recovery after scaling down: node3 CPU 83%->28%, PVE load 64->51, control-plane restarts frozen. Durable structural fix (etcd/critical VM disks off the shared sdc HDD; PVE CPU weighting) is tracked as code-oflt. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 07:31:46 +00:00
Viktor Barzin	39a22b352e	tts: bootstrap the chatterbox NFS subdir — first-window mount failed forever All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details First real window (2026-06-12 02:00): the chatterbox pod sat in ContainerCreating with MountVolume exit 32 x19 — /srv/nfs-ssd is exported whole-tree but the chatterbox SUBDIR never existed on the host (the go-live runbook step needed NFS-host shell nobody doing the apply had). One-shot busybox Job mounts the export root and mkdir -p's the subtree; kubelet's mount retry then self-heals the pod. Audio queue (27 items) drains as soon as the model loads. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 02:51:14 +00:00
Viktor Barzin	db63cd7501	android-emulator+traefik: non-merge apply trigger for the rate-limit fix Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline failed Details Pipeline 102 applied nothing — the rate-limit commit entered master under a merge head and the changed-stack detector is blind to merge diffs. Plain commit touching both stacks so they apply.	2026-06-12 00:33:10 +00:00
Viktor Barzin	4d844d6fd4	Merge forgejo/master into wizard/emu-ratelimit Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline failed Details	2026-06-12 00:26:05 +00:00
Viktor Barzin	152dad0a40	android-emulator: dedicated rate-limit — noVNC's module storm tripped the shared 10/50 limiter Viktor's 'VNC stuck loading forever' (remote network): noVNC 1.3 is unbundled and fetches ~60 ES modules in parallel on page open; the shared Traefik rate-limit (average 10, burst 50) 429s the tail and noVNC's loader waits on the missing modules indefinitely (reproduced: 38x429 in a 90-request burst through the ingress). Adds a dedicated 50/300 android-emulator-rate-limit middleware (actualbudget/immich pattern) and opts both emulator ingresses out of the shared limiter.	2026-06-12 00:25:44 +00:00
Viktor Barzin	d3d37a15ec	tts: GPU-gated live narration — demand-gate CronJob + all-day VRAM guard Some checks failed ci/woodpecker/push/default Pipeline was canceled Details ci/woodpecker/push/build-cli Pipeline was canceled Details Viktor asked 'can't we make it live? why the cronjob?' — the overnight window guaranteed VRAM room on the shared T4, but immich/frigate models idle-unload during the day so the card often has room (measured 10.3 GiB free at 01:20). New 'demand' action every 3 min: scale Chatterbox up when tripit's audio queue is non-empty AND free VRAM >= floor; idle it back to 0 when the queue empties (also frees the card early inside the nightly window). Failed metrics scrape fail-safes to no-scale-up, same as the window preflight. The guard moves to all-day */5 — live synthesis can hold the card at any hour, so the yield-on-pressure watchdog must watch at any hour. tripit exposes the unauthenticated in-cluster queue count; a 404 from an older image reads as queued=0 (no-op). The 02:00 window-up stays as the guaranteed nightly catch-up. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 00:25:35 +00:00
Viktor Barzin	d818f7ed3b	android-emulator: README — measured resource profile + remote access + screen-off etiquette All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-12 00:10:03 +00:00
Viktor Barzin	9af3e8860e	Merge origin/master (CI state-sync commits) into wizard/android-emulator-public Some checks failed ci/woodpecker/push/default Pipeline was canceled Details ci/woodpecker/push/build-cli Pipeline was canceled Details	2026-06-12 00:08:14 +00:00
Viktor Barzin	43d2107760	android-emulator: public Authentik-gated ingress for the noVNC screen Some checks failed ci/woodpecker/push/build-cli Pipeline was canceled Details ci/woodpecker/push/default Pipeline was canceled Details Viktor wants the emulator screen reachable over the web: adds android-emulator.viktorbarzin.me (Cloudflare-proxied) behind Authentik forward-auth — same-origin WebSockets through forward-auth are proven by the terminal/ttyd stack. The LAN .lan view stays, and adb:5555 remains LAN-only since it is unauthenticated.	2026-06-12 00:07:49 +00:00
Viktor Barzin	9a2124f105	tripit: flip Research agent live (RESEARCH_PROVIDER=claude_agent, #23 ) Switches the planning workspace's 'Research this' from the deterministic Fake to the live claude-agent-service Researcher. Behaviour-reviewed via a prod-pod country_when call (proposed Morocco/Georgia/Peru/Iceland with real 2026 UK bank-holiday leave windows + rough fares). Opt-in, budget-capped ~$2/run, wall-clock-bounded → degrades to 'found nothing' on slow/failed/quota-exhausted runs. Reuses CLAUDE_AGENT_TOKEN already in tripit-secrets. Completes the 12-slice Trip-Planning-Decisions feature. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-11 23:53:49 +00:00
Viktor Barzin	02ed3062f6	android-emulator: non-merge apply trigger for v4 image rollout All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Pipeline 96 applied only tripit: the v4 bump (`577267cd`) entered master inside a merge whose first-parent diff hid stacks/android-emulator from the stack detector — same failure mode as the tts `798b0255` trigger. This plain commit touches the stack so the detector picks it up.	2026-06-11 23:48:16 +00:00
Viktor Barzin	2f8addc63b	Merge forgejo/master into wizard/android-emulator Some checks failed ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline failed Details	2026-06-11 22:53:11 +00:00
Viktor Barzin	577267cd97	android-emulator: api36-v4 — pin emulator 36.1.9; bind socat to pod IP Two final fixes from the live debugging session: (1) sdkmanager-latest emulator 36.6.11 hangs before executing a single guest instruction in this pod (KVM and TCG alike, every gpu mode, crash-reporting on or off) while 36.1.9 boots Android in ~107s — the entrypoint now pins build 13823996 on the PVC; (2) the emulator already listens on 127.0.0.1:5555, so socat's wildcard bind died with EADDRINUSE and its exit restarted the pod right after a successful boot — socat now binds the pod IP only.	2026-06-11 22:52:54 +00:00
Viktor Barzin	fba1659611	tripit: enable LLM sight discovery + real place resolver (image 2a965ca0 is live) All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Viktor's tour-redo (tripit#29): the new image is rolled out, so the two new provider knobs can flip — discovery becomes wikipedia+llm (GeoSearch merged with claude-agent-service proposals, Focus-steered) and the Wikipedia place resolver (manual sight search + LLM-proposal resolution) leaves its fake default. Env-after-image hold order, same as FARE_PROVIDER. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 22:30:24 +00:00
Viktor Barzin	f74e421283	tripit: overnight tour-audio fill CronJobs (02:20 + 04:30 retry, Europe/London) All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Viktor's tour-guide redo (tripit#30/#31): narration audio is baked-audio-only now — the fill-tour-audio worker synthesizes the queued (story, telling, voice) audio while the tts stack's off-peak window (02:00-06:00) has Chatterbox scaled up. Two idempotent passes: 02:20 after scale-up + model load, 04:30 insurance against a skipped window or guard yield. Daytime runs record tts_unreachable and exit quietly by design. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 22:24:29 +00:00
Viktor Barzin	85dbec6108	android-emulator: api36-v3 — avdmanager must run from inside the SDK root Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details v2's marker fix proved the install completes, but avdmanager still saw no system images: it IGNORES ANDROID_SDK_ROOT (and has no --sdk_root), deriving the SDK root from its own toolsdir — /opt/android in our image, while packages live on the PVC at /sdk. v3 seeds cmdline-tools into /sdk/cmdline-tools/latest once and runs avdmanager from there, so it resolves the PVC as the SDK root.	2026-06-11 21:15:50 +00:00
Viktor Barzin	5e8a988858	android-emulator: api36-v2 — marker-file install idempotency + retries Some checks failed ci/woodpecker/push/k8s-portal Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/registry-config-sync Pipeline was successful Details ci/woodpecker/push/build-ci-image Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details First boot crashed mid-SDK-install, and the dir-existence check then skipped reinstall forever: avdmanager saw the partial tree and died with 'Valid system image paths are: null' (CrashLoopBackOff). v2 tracks install completion with a marker file written only after sdkmanager succeeds + package.xml exists, wipes partial system-image trees before reinstalling, and retries sdkmanager 3x.	2026-06-11 20:59:08 +00:00
Viktor Barzin	3fac45febc	android-emulator: drop applied import stanzas; deployment recreates fresh Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details The five imports from the last recovery commit are in state now (verified serial 4: everything except the deployment). The deployment kept falling out of state between runs, so instead of a third import round the broken 0-replica deployment object was deleted live (transient recovery step, presence-claimed) and this apply recreates it Terraform-owned with the quota-fitting 3Gi requests. Import stanzas must go because TF 1.5 errors on importing already-managed addresses.	2026-06-11 20:49:37 +00:00
Viktor Barzin	6b7efcd2d6	android-emulator: import the five resources still missing from state Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Pipeline 88 imported the namespace but its refresh dropped the PVC, both services, the ingress and the tls secret from state (PG-backend state races on this new stack's first applies), so the apply again died on 'already exists' conflicts. State now holds namespace+deployment; adopt the missing five with import blocks (TF 1.5 errors on importing already-managed addresses, so only the missing set is listed). Stanzas come out once applied.	2026-06-11 20:44:09 +00:00
Viktor Barzin	b948224008	android-emulator: import orphaned namespace into state (lock-race recovery) Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Pipeline 85 created the namespace but a Terraform pg-backend workspace-creation lock race (new stack schema initializing while other stacks applied concurrently) left it out of the recorded state — every later apply then died with 'namespaces android-emulator already exists'. Adopt it with an import block per the house recovery pattern; stanza gets removed once it has applied.	2026-06-11 20:38:46 +00:00
Viktor Barzin	99c19584f7	android-emulator: fit pod inside the tier-1 ResourceQuota (Burstable memory) Some checks failed ci/woodpecker/push/k8s-portal Pipeline failed Details ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful Details ci/woodpecker/push/registry-config-sync Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/build-ci-image Pipeline was successful Details First deploy hit 'exceeded quota: tier-quota, requested requests.memory=8Gi, limited 4Gi' — the generated tier-1 quota caps memory REQUESTS at 4Gi but allows 32Gi of limits, so go Burstable (requests 3Gi, limits 8Gi) like tiers 3/4 do, instead of opting the namespace out via custom-quota.	2026-06-11 19:56:09 +00:00
Viktor Barzin	6bf216751b	Merge forgejo/master (tts stack) into wizard/android-emulator Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details # Conflicts: # stacks/tripit/main.tf	2026-06-11 19:53:07 +00:00
Viktor Barzin	8b7c77c794	android-emulator: new stack — shared in-cluster Android 16 testing instance Viktor is setting up an Android app development pipeline (tripit is the first app) and wants agents to natively test changes on Android before shipping. This adds the testing environment: an API-36 Google emulator under KVM as a privileged pod (namespace joins the Kyverno exclude list), SDK/system-image/AVD on a proxmox-lvm PVC, adb on the shared MetalLB IP 10.0.20.200:5555 (LAN only), noVNC screen view at android-emulator.viktorbarzin.lan. Image is built manually from the stack's docker/ dir (rare rebuilds; off-infra-CI rule targets repeated builds). First infra ADR records the trade-offs (devvm/VM/redroid/budtmo rejected).	2026-06-11 19:51:57 +00:00

1 2 3 4 5 ...

4226 commits