infra

Author	SHA1	Message	Date
Viktor Barzin	d02ca4f2db	ci-pipeline-health: daily sweep of the off-infra CI chain (ADR-0002) Viktor asked to monitor the pipelines closely as builds move off-infra (PRD infra#10). New aux stack: daily 07:30 UTC CronJob on the claude-agent-service image running a deterministic shell sweep — GitHub Actions failures/stuck runs across owned repos, Woodpecker pipeline failures, GHA free-tier minutes burn. Healthy = one quiet Slack line; issues = Slack alert + comment on infra#10. In-cluster (not a cloud routine) because Vault + the Woodpecker token are LAN-only. Secrets via ExternalSecret (github_pat deliberately, not the ghcr_pull_token alias — a scoped packages-only rotation couldn't read Actions runs). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 20:45:28 +00:00
Viktor Barzin	98f1f7fc24	tts: seed extension-less voice copies so tripit's bare stems resolve All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details First live drain failed all 27 queued narrations with 404 'Voice file 'Emily' not found': tripit's catalog sends bare stems (Emily) but the devnen server resolves the voice as a literal filename (Emily.wav) in predefined_voices_path then reference_audio — no stem fallback exists upstream (HEAD == our pinned sha), and symlinks can't bridge it because safe_resolve_within() resolves them out of the containment check. New initContainer on the chatterbox deployment copies the 28 bundled voices to /data/reference_audio/<stem> on the PVC (second lookup path). Same image as the main container so no extra pull; idempotent; ~15 MB. Verified live before committing: an extension-less copy synthesizes 200 audio/mp3 (5.3s warm) where voice=Emily 404'd. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 20:41:51 +00:00
Viktor Barzin	bb0f9f59ef	docs: CI-compute doctrine — leverage external infra for builds AND tests [ci skip] Viktor's standing instruction (2026-06-12): lean on external infra as much as possible for CI — builds, running tests, lint, releases all on GitHub Actions hosted runners, never on cluster nodes; in-cluster pipelines only for cluster-touching steps (deploys, terragrunt, certbot). Also: watch any triggered pipeline chain to completion and fix failures immediately. Added to AGENTS.md + .claude/CLAUDE.md CI sections (ADR-0002 companions). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 20:39:27 +00:00
Viktor Barzin	97dcf49b8e	monitoring: reduce Slack alert noise (alert-on-change + daily digest) Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details Reviewed the last 24h of Slack alerts after the midday node-pressure blip: the volume came far less from the outage than from (a) alerts re-pinging every few hours while nothing changed and (b) a pod cascade that fired uninhibited. This hardens the alerting system so recurrences are quiet, rather than just clearing today's broken services. Changes (all in the monitoring module): * Alert-on-change routing. warning/info repeat_interval -> 8760h (notify once, then only on a membership change or resolve); critical 1h -> 6h (a slow nag, not an hourly drip). send_resolved stays on. The bulk of the 24h volume was these re-pings (RpiSofiaUndervoltage alone fired continuously for ~24h, re-notifying every 4h). * Daily digest CronJob (alert_digest.tf + alert_digest.py) -> #alerts at 08:00 Europe/London: the full current board grouped by severity + what resolved in the last 24h. This is the standing-state safety net for the alert-on-change model. Stock python:3.12-alpine, pure-stdlib script (no pip/apk at runtime -> none of the per-run disk-write footprint that disabled status-page-pusher). Reuses the existing Alertmanager Slack webhook via a namespaced Secret; reads Alertmanager v2 + Prometheus. * Cascade inhibition. NodeConditionBad/NodeDiskPressure now suppress the downstream pod-churn alerts (PodCrashLooping, PodImagePullBackOff, PodsStuckContainerCreating, ScrapeTargetDown, ReplicasMismatch, ...). The midday DiskPressure event on 4 nodes fired 25 PodCrashLooping + 14 PodImagePullBackOff uninhibited because only NodeDown was a source. T3 probe de-duplication. T3ProbeLegDown now inhibits T3ProbeDropBurst for the same leg — two alerts described one condition and were the #1 noise source (~3,400 alert-minutes over 24h). * ScrapeTargetDown false positives. Scrape only Ready endpoints, so completed CronJob pods that linger in EndpointSlices as NotReady addresses stop firing phantom "down" alerts (tts/tripit/beads). A Ready pod with a genuinely broken metrics endpoint still fires. * for: 0m -> 5m on the flappy backup-status flags (LVM/Weekly/Offsite/ NfsMirror/Vzdump Failing) and DNS spike detectors, so a single transient Pushgateway/scrape blip no longer fires-and-resolves. Added an Alertmanager scrape target: it carried no prometheus.io/scrape annotation, so notification volume was unmeasurable — now we can verify this change worked (alertmanager_notifications_total et al.). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 20:35:56 +00:00
Viktor Barzin	87a8a393fe	tts: demand gate treats a failed queue probe as no-action, not queue-empty Some checks failed ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was canceled Details The demand-gate script defaulted an unreadable/unparseable tts-queue response to QUEUED=0, which the scale-down arm reads as 'queue empty'. One transient curl failure at 20:30 UTC today idled chatterbox-tts to 0 the very minute the pod first went Ready, with 27 narrations still queued (tripit kept logging tts_unreachable). Probe failure now exits without touching replicas: scale-up still needs a real count > 0, and scale-down now needs an explicitly parsed 0. Worst case after this change is a stale-up deployment idling until the 06:00 window-down. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 20:35:02 +00:00
Viktor Barzin	18f524c265	docs: ghcr-credentials is now Kyverno-synced to allowlisted namespaces [ci skip] Same-change doc sync for infra#12: the tripit-ns-scoped interim secret paragraph described the pre-ClusterPolicy state. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 20:31:55 +00:00
Viktor Barzin	68c7be8653	traefik: non-merge apply trigger (error-pages buffer fix) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-12 20:31:24 +00:00
Viktor Barzin	f3cb5661a6	Merge forgejo/master into wizard/errorpages-buffer Some checks failed ci/woodpecker/push/default Pipeline was canceled Details ci/woodpecker/push/build-cli Pipeline was canceled Details	2026-06-12 20:31:22 +00:00
Viktor Barzin	aa1fccb883	traefik/error-pages: READ_BUFFER_SIZE 5KB -> 128KB — 431s for cookie-heavy users Viktor hit 'Too big request header' (fasthttp 431 from error-pages) on a routed host during a brief 503 window, and sees it periodically across ingresses: Authentik forward-auth accumulates one authentik_proxy_* cookie per protected service on .viktorbarzin.me, so established browsers carry multi-10KB Cookie headers — over error-pages' 5120-byte default read buffer, which doubles as its max header size. Any error- middleware dispatch then 431'd instead of rendering the styled page. Same root cause class as the 2026-06-01 large_client_header_buffers fixes on bot-block-proxy and auth-proxy-config; error-pages was the remaining small-buffer backend on the shared chain.	2026-06-12 20:31:01 +00:00
Viktor Barzin	523e18c127	kyverno: sync-ghcr-credentials to private-ghcr namespaces; tripit consumes the clone All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Viktor asked to unblock the ADR-0002 ghcr pull-secret work (infra#12) without waiting on a UI-minted token: GitHub has no token-mint API, so the admin PAT (aliased in Vault as secret/viktor/ghcr_pull_token — swap the alias value when a scoped token is ever minted) becomes the platform credential. Because the PAT is broad, the new ClusterPolicy clones ghcr-credentials ONLY to an explicit allowlist of namespaces running private ghcr images (tripit, f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio, fire-planner, recruiter-responder) — NOT cluster-wide like registry-credentials. generateExisting+synchronize so existing namespaces get the clone. tripit's hand-declared ns-scoped secret is removed in favour of the clone (imagePullSecrets now reference the name literally). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 20:28:11 +00:00
Viktor Barzin	12fd1fcbc9	android-emulator: api36-v7 — noVNC defaults: scaled view, autoconnect, reconnect Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details Viktor's screen rendered unscaled on a bare /vnc.html. The entrypoint now writes /usr/share/novnc/defaults.json (resize=scale, autoconnect, reconnect with 2s delay, shared) so every load behaves right without URL params, and viewers self-heal across pod restarts/wakes. Already applied live to the running pod; this makes it survive the next wake.	2026-06-12 20:18:26 +00:00
Viktor Barzin	ff08c685cd	tts: image is TF-owned — drop the copied KEEL ignore so the GHCR switch applies All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The deployment's lifecycle.ignore_changes still ignored the container image (copied from the keel-managed tripit pattern), which would have made the previous commit's GHCR switch a silent no-op on apply. Keel cannot poll the private GHCR repo anyway; the pinned sha tag is terraform's to manage. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 20:13:50 +00:00
Viktor Barzin	dbb4572112	tts: pull Chatterbox from GHCR — the Forgejo-registry copy is unpullable Some checks are pending ci/woodpecker/push/build-cli Pipeline is pending Details ci/woodpecker/push/default Pipeline is pending Details Viktor reports the voice still isn't from the TTS service — correct: zero story_audio rows exist; the pod has sat in ImagePullBackOff since the first window because the 2026-06-09 Forgejo-registry push has a corrupt layer blob (HEAD 500s; pushed from a 94%-full disk) and identical digests can't heal corrupt registry storage. The off-infra GHA rebuild (tripit build-chatterbox.yml, devnen 915ae289, succeeded 03:23 UTC) now lives in private GHCR: switch the image there, pin the upstream-sha tag, and add the vault-backed ghcr-credentials pull secret (mirrors stacks/tripit). tripit's drain loop has 27 narrations queued and picks them up the moment the pod goes Ready. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 20:13:19 +00:00
Viktor Barzin	8919835c5d	beads-server: track claude-agent-service :latest (was pruned SHA → ImagePullBackOff) Some checks failed ci/woodpecker/push/build-cli Pipeline was canceled Details ci/woodpecker/push/default Pipeline was canceled Details cluster-health found beads-dispatcher + beads-reaper CronJobs in ImagePullBackOff for ~7h: they pinned claude-agent-service:2fd7670d, a SHA tag that Forgejo retention (keeps newest 10) pruned. claude-agent-service itself runs :latest (KEEL_IGNORE_IMAGE). Point the beads tag at :latest so it tracks the live image and can't go stale again — the dispatcher/reaper only need bd+curl+jq, which the image ships. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 20:12:24 +00:00
Viktor Barzin	0491fc43f2	android-emulator: README — final measured profile; honest GL story Some checks failed ci/woodpecker/push/default Pipeline was canceled Details ci/woodpecker/push/build-cli Pipeline was canceled Details Trues the runbook up to reality: guest GL stays software (llvmpipe) under Xvfb by deliberate choice (NVIDIA headless GL would need a different streaming architecture), the GPU slice costs ~100MiB VRAM only while awake, and the awake steady-state is ~0.5-1.3 cores / ~5Gi with scale-to-zero covering idle.	2026-06-12 20:11:55 +00:00
Viktor Barzin	10a52a2683	gitignore: timestamped terraform.tfstate..backup (plaintext Tier-0 secrets) [ci skip] Viktor's off-infra-builds wave 0 (infra#11): two untracked terraform.tfstate.<ts>.backup files with live plaintext Tier-0 secrets were sitting in stacks/infra/ unmatched by the existing .tfstate.backup patterns — one stray git add from the public repo. Pattern added; the on-disk files are deleted separately. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 20:11:41 +00:00
Viktor Barzin	3802967290	android-emulator: api36-v6 — cap RLIMIT_NOFILE; x11vnc -nolookup All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Viktor's noVNC sat at 'Connecting…' forever: the WebSocket traversed Cloudflare/Authentik/websockify fine, but x11vnc never sent the RFB banner — strace showed it sweeping the container's fd table with one fcntl per fd, and containerd grants RLIMIT_NOFILE=2147483584 here, so each connection effectively never completed. The entrypoint now sets ulimit -n 65536 for everything it launches (verified live: banner answers instantly under the capped limit); x11vnc also gets -nolookup so client reverse-DNS can never stall handshakes.	2026-06-12 20:04:42 +00:00
Viktor Barzin	623d34628a	docs: ADR-0002 — all owned image builds move off-infra to GHA + ghcr [ci skip] Viktor asked to evaluate fully external image builders because in-cluster CI builds keep destabilising the homelab (Forgejo OOM under registry-push load, hairpin push timeouts, build IO on the shared sdc HDD, registry PVC at its 50Gi ceiling). The evaluation was grilled to a decision set: - every owned image builds on GitHub Actions and lives on ghcr.io (extends the 2026-06-09 tripit pilot to the whole fleet) - per-repo visibility: 9 public mirrors + images (gated on a clean gitleaks/PII history scan), the personal/finance/gray ones stay private - clean cut: no in-cluster fallback build pipelines; existing build-fallback.yml files are deleted - Woodpecker becomes deploy-only; Forgejo registry freezes to one last-known-good tag per Service after a manual cleanup pass - dead builders (terminal-lobby, webhook-handler, hmrc-sync, trading-bot, travel-agent, trip-planner) are decommissioned, not migrated; travel_blog is decommissioned outright; manual images (x402-gateway, chrome-service-novnc, chatterbox-tts, android-emulator) get formalized GHA builds; infra-ci + CLI builds move to GHA on the public infra repo CONTEXT.md: updated 'GHA build + Woodpecker deploy', added 'Canonical repo', 'GitHub mirror', 'Forgejo registry' terms, image-path relationship, and a 'registry' ambiguity entry. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 19:55:47 +00:00
Viktor Barzin	3978eec53a	Merge forgejo/master into wizard/emu-gpu Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-12 19:45:06 +00:00
Viktor Barzin	b2bd859a8e	android-emulator: NVIDIA_DRIVER_CAPABILITIES=all — graphics libs for -gpu host First GPU boot verified qemu attached to the T4, but the guest GL translator reported llvmpipe: the GPU operator injects only compute,utility by default, so the NVIDIA EGL/GL vendor libraries were absent and gfxstream silently fell back to software GL. The graphics capability completes the hardware rendering path.	2026-06-12 19:43:25 +00:00
Viktor Barzin	0216e993dc	etcd-load-reduction: remove VPA/Goldilocks, disable kyverno reporting, descheduler hourly Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details The control-plane flap (etcd lease-renewal timeouts) recurred. Rather than move etcd to SSD (code-oflt, deferred again), the chosen direction is to REDUCE etcd load enough that the leader-election-timeout band-aid (renew 10s->30s) becomes removable. These are the big, clean cuts: 1. Remove VPA/Goldilocks (stacks/vpa emptied). All 349 VPAs ran updateMode=Off (no auto-right-sizing) yet cost ~800 etcd objects + continuous recommender writes + a pod-creation admission webhook, purely to feed a dashboard. krr (Dockerized, on-demand) replaces it. Reverses the re-add after memory 2431. 2. Disable kyverno reporting (admission/aggregate/background). policyReports were already off, so the pipeline generated ephemeralreports + an hourly all-resource etcd re-scan for NO user-facing output. Admission enforcement (deny-* policies) and Keel mutation are unaffected; violations surface via Loki->Slack. 3. descheduler */5 -> hourly (fewer list/evict cycles; rebalancing isn't urgent). Deferred (poor ROI / unsafe as planned): ESO refreshInterval 15m->1h is a ~20-stack sprawl for ~0.1 writes/s; keel background=false is invalid for a mutate-existing policy and its churn is apply-time not steady-state. Both filed as follow-up beads. Post-apply: delete the chart-orphaned VPA CRDs to cascade-clean leftover CRs. Then measure etcd apply-latency and revert the timeouts. Docs updated (VPA/Goldilocks -> krr). See memory 5402-5407. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 19:41:22 +00:00
Viktor Barzin	16adda2c48	android-emulator: gate reaches the kube API via env vars, not DNS All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details First real wake attempt 500'd: kubernetes.default.svc does not resolve from the gate's alpine pod (musl + injected dns_config ndots quirk), so every kube call failed with 'Name does not resolve'. Use the injected KUBERNETES_SERVICE_HOST/PORT env vars — the canonical in-cluster endpoint, no DNS dependency. ConfigMap checksum annotation rolls the gate automatically.	2026-06-12 19:32:34 +00:00
Viktor Barzin	b1b9de90e4	tripit: tripit-api ingress joins the dedicated 100/1000 rate-limit All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Follow-up to `eef4dc7f`: the Android Shell's dedicated bearer-auth host (tripit-api, ADR-0017) serves the same thumbnail-proxy traffic and was still on the default 10/50 limiter — the shell's photo grid would have hit the identical 429 wall Viktor just reported on the PWA host. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 19:18:40 +00:00
Viktor Barzin	eef4dc7f63	tripit: dedicated 100/1000 rate-limit — photo grid 429s on the default 10/50 Some checks failed ci/woodpecker/push/build-cli Pipeline was canceled Details ci/woodpecker/push/default Pipeline was canceled Details Viktor hit a wall of 429s scrolling the new trip Photos tab: every Immich thumbnail proxies through tripit's /api, so a few-hundred-photo trip is that many parallel GETs from one IP — far past the shared Traefik limiter's average 10 / burst 50. Fourth instance of the parallel-asset pattern (ha-sofia, ActualBudget, noVNC); same cure: dedicated tripit-rate-limit middleware (average 100, burst 1000) + skip_default_rate_limit on the main tripit ingress only. The token-gated calendar/email/slack carve-outs keep the strict default. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 19:15:56 +00:00
Viktor Barzin	e8a4eb0f05	tripit: satisfy the auth-comment lint on the tripit-api ingress All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The previous commit (`c5631cff`) failed CI's ingress_factory guard: the '# auth = "none": <why>' justification must sit directly above the auth line inside the module, not above the module block. Same content, moved to where the lint looks; no functional change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 08:53:02 +00:00
Viktor Barzin	c5631cff74	tripit: Shell auth surface — tripit-app OAuth2 provider + bearer-only tripit-api host Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Viktor is adding the Android APK (Capacitor Shell) for TripIt. The Shell cannot use the browser's forward-auth cookie dance, so per tripit ADR-0017 it logs in with OIDC Code+PKCE and calls the API with bearer JWTs: - authentik.tf: tripit-app OAuth2 provider (public client + PKCE — an APK holds no secret), custom-scheme redirect me.viktorbarzin.tripit://callback, RS256, 1h access / 90d refresh (offline_access mapping attached so refresh tokens are issued), plus the TripIt App application. - main.tf: new ingress host tripit-api.viktorbarzin.me -> same tripit Service, no forward-auth (backend validates the JWTs itself once tripit AUTH_MODE=hybrid lands — slice 2), inbound X-authentik-* deleted via the existing traefik strip-auth-headers middleware so the header fallback can never be spoofed through this host. Closes nothing here; tracked as viktor/tripit#49. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 08:47:46 +00:00
Viktor Barzin	b985686661	android-emulator: non-merge apply trigger (GPU + wake gate) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-12 07:53:38 +00:00
Viktor Barzin	18ccd57b63	Merge forgejo/master into wizard/emu-gpu Some checks failed ci/woodpecker/push/build-cli Pipeline was canceled Details ci/woodpecker/push/default Pipeline was canceled Details	2026-06-12 07:53:12 +00:00
Viktor Barzin	f4dd515fd7	android-emulator: GPU rendering on node1 + scale-to-zero wake gate Viktor's direction (2026-06-12): the emulator is dev-only, so it should be on-demand, and it should use the T4 where applicable. (1) api36-v5 runs '-gpu host' on the GPU node (nodeSelector + time-slice + EGL libs; automatic swiftshader fallback if GPU init dies) — screen-on rendering moves off the CPU (~5 cores → expected 1-2). (2) The wake gate (stdlib python, owns / on both hostnames) scales the deployment 0→1 on visit and hands the browser to noVNC when ready; agents GET /wake + /status. The idle-sleeper CronJob counts established adb/noVNC connections via /proc/net/tcp (excluding the in-container loopback adb client) and scales to zero after 4 idle checks (~1h). TF ignores replicas drift. VRAM cost (~0.5-1GiB) is held only while awake, protecting llama-swap headroom.	2026-06-12 07:52:50 +00:00
Viktor Barzin	b598c61c61	android-emulator: scale to 0 — its CPU burn was starving etcd All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details The cluster-health check found the control plane flapping: kube-scheduler and kube-controller-manager were crashlooping (220+ restarts) on lost leader-election leases, with "etcdserver: request timed out" in the logs. Root cause: the android-emulator pod's ~4.7-core swiftshader (software-GPU) CPU burn on node3, together with frigate on node1, saturated the single Proxmox host (load ~64) and starved etcd's disk/CPU on the k8s-master VM — so etcd timed out and the leader-election controllers died and restarted in a loop. The emulator is a shared test instance, not a 24/7 service, so scaling it to 0 is the right relief: spin it back to replicas=1 on-demand for a testing session. Confirmed recovery after scaling down: node3 CPU 83%->28%, PVE load 64->51, control-plane restarts frozen. Durable structural fix (etcd/critical VM disks off the shared sdc HDD; PVE CPU weighting) is tracked as code-oflt. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 07:31:46 +00:00
Viktor Barzin	39a22b352e	tts: bootstrap the chatterbox NFS subdir — first-window mount failed forever All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details First real window (2026-06-12 02:00): the chatterbox pod sat in ContainerCreating with MountVolume exit 32 x19 — /srv/nfs-ssd is exported whole-tree but the chatterbox SUBDIR never existed on the host (the go-live runbook step needed NFS-host shell nobody doing the apply had). One-shot busybox Job mounts the export root and mkdir -p's the subtree; kubelet's mount retry then self-heals the pod. Audio queue (27 items) drains as soon as the model loads. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 02:51:14 +00:00
Viktor Barzin	db63cd7501	android-emulator+traefik: non-merge apply trigger for the rate-limit fix Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline failed Details Pipeline 102 applied nothing — the rate-limit commit entered master under a merge head and the changed-stack detector is blind to merge diffs. Plain commit touching both stacks so they apply.	2026-06-12 00:33:10 +00:00
Viktor Barzin	4d844d6fd4	Merge forgejo/master into wizard/emu-ratelimit Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline failed Details	2026-06-12 00:26:05 +00:00
Viktor Barzin	152dad0a40	android-emulator: dedicated rate-limit — noVNC's module storm tripped the shared 10/50 limiter Viktor's 'VNC stuck loading forever' (remote network): noVNC 1.3 is unbundled and fetches ~60 ES modules in parallel on page open; the shared Traefik rate-limit (average 10, burst 50) 429s the tail and noVNC's loader waits on the missing modules indefinitely (reproduced: 38x429 in a 90-request burst through the ingress). Adds a dedicated 50/300 android-emulator-rate-limit middleware (actualbudget/immich pattern) and opts both emulator ingresses out of the shared limiter.	2026-06-12 00:25:44 +00:00
Viktor Barzin	d3d37a15ec	tts: GPU-gated live narration — demand-gate CronJob + all-day VRAM guard Some checks failed ci/woodpecker/push/default Pipeline was canceled Details ci/woodpecker/push/build-cli Pipeline was canceled Details Viktor asked 'can't we make it live? why the cronjob?' — the overnight window guaranteed VRAM room on the shared T4, but immich/frigate models idle-unload during the day so the card often has room (measured 10.3 GiB free at 01:20). New 'demand' action every 3 min: scale Chatterbox up when tripit's audio queue is non-empty AND free VRAM >= floor; idle it back to 0 when the queue empties (also frees the card early inside the nightly window). Failed metrics scrape fail-safes to no-scale-up, same as the window preflight. The guard moves to all-day */5 — live synthesis can hold the card at any hour, so the yield-on-pressure watchdog must watch at any hour. tripit exposes the unauthenticated in-cluster queue count; a 404 from an older image reads as queued=0 (no-op). The 02:00 window-up stays as the guaranteed nightly catch-up. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 00:25:35 +00:00
Viktor Barzin	d818f7ed3b	android-emulator: README — measured resource profile + remote access + screen-off etiquette All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-12 00:10:03 +00:00
Viktor Barzin	9af3e8860e	Merge origin/master (CI state-sync commits) into wizard/android-emulator-public Some checks failed ci/woodpecker/push/default Pipeline was canceled Details ci/woodpecker/push/build-cli Pipeline was canceled Details	2026-06-12 00:08:14 +00:00
Viktor Barzin	43d2107760	android-emulator: public Authentik-gated ingress for the noVNC screen Some checks failed ci/woodpecker/push/build-cli Pipeline was canceled Details ci/woodpecker/push/default Pipeline was canceled Details Viktor wants the emulator screen reachable over the web: adds android-emulator.viktorbarzin.me (Cloudflare-proxied) behind Authentik forward-auth — same-origin WebSockets through forward-auth are proven by the terminal/ttyd stack. The LAN .lan view stays, and adb:5555 remains LAN-only since it is unauthenticated.	2026-06-12 00:07:49 +00:00
Viktor Barzin	9a2124f105	tripit: flip Research agent live (RESEARCH_PROVIDER=claude_agent, #23 ) Switches the planning workspace's 'Research this' from the deterministic Fake to the live claude-agent-service Researcher. Behaviour-reviewed via a prod-pod country_when call (proposed Morocco/Georgia/Peru/Iceland with real 2026 UK bank-holiday leave windows + rough fares). Opt-in, budget-capped ~$2/run, wall-clock-bounded → degrades to 'found nothing' on slow/failed/quota-exhausted runs. Reuses CLAUDE_AGENT_TOKEN already in tripit-secrets. Completes the 12-slice Trip-Planning-Decisions feature. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-11 23:53:49 +00:00
Viktor Barzin	02ed3062f6	android-emulator: non-merge apply trigger for v4 image rollout All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Pipeline 96 applied only tripit: the v4 bump (`577267cd`) entered master inside a merge whose first-parent diff hid stacks/android-emulator from the stack detector — same failure mode as the tts `798b0255` trigger. This plain commit touches the stack so the detector picks it up.	2026-06-11 23:48:16 +00:00
Viktor Barzin	2f8addc63b	Merge forgejo/master into wizard/android-emulator Some checks failed ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline failed Details	2026-06-11 22:53:11 +00:00
Viktor Barzin	577267cd97	android-emulator: api36-v4 — pin emulator 36.1.9; bind socat to pod IP Two final fixes from the live debugging session: (1) sdkmanager-latest emulator 36.6.11 hangs before executing a single guest instruction in this pod (KVM and TCG alike, every gpu mode, crash-reporting on or off) while 36.1.9 boots Android in ~107s — the entrypoint now pins build 13823996 on the PVC; (2) the emulator already listens on 127.0.0.1:5555, so socat's wildcard bind died with EADDRINUSE and its exit restarted the pod right after a successful boot — socat now binds the pod IP only.	2026-06-11 22:52:54 +00:00
Viktor Barzin	fba1659611	tripit: enable LLM sight discovery + real place resolver (image 2a965ca0 is live) All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Viktor's tour-redo (tripit#29): the new image is rolled out, so the two new provider knobs can flip — discovery becomes wikipedia+llm (GeoSearch merged with claude-agent-service proposals, Focus-steered) and the Wikipedia place resolver (manual sight search + LLM-proposal resolution) leaves its fake default. Env-after-image hold order, same as FARE_PROVIDER. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 22:30:24 +00:00
Viktor Barzin	f74e421283	tripit: overnight tour-audio fill CronJobs (02:20 + 04:30 retry, Europe/London) All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Viktor's tour-guide redo (tripit#30/#31): narration audio is baked-audio-only now — the fill-tour-audio worker synthesizes the queued (story, telling, voice) audio while the tts stack's off-peak window (02:00-06:00) has Chatterbox scaled up. Two idempotent passes: 02:20 after scale-up + model load, 04:30 insurance against a skipped window or guard yield. Daytime runs record tts_unreachable and exit quietly by design. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 22:24:29 +00:00
Viktor Barzin	85dbec6108	android-emulator: api36-v3 — avdmanager must run from inside the SDK root Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details v2's marker fix proved the install completes, but avdmanager still saw no system images: it IGNORES ANDROID_SDK_ROOT (and has no --sdk_root), deriving the SDK root from its own toolsdir — /opt/android in our image, while packages live on the PVC at /sdk. v3 seeds cmdline-tools into /sdk/cmdline-tools/latest once and runs avdmanager from there, so it resolves the PVC as the SDK root.	2026-06-11 21:15:50 +00:00
Viktor Barzin	5e8a988858	android-emulator: api36-v2 — marker-file install idempotency + retries Some checks failed ci/woodpecker/push/k8s-portal Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/registry-config-sync Pipeline was successful Details ci/woodpecker/push/build-ci-image Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details First boot crashed mid-SDK-install, and the dir-existence check then skipped reinstall forever: avdmanager saw the partial tree and died with 'Valid system image paths are: null' (CrashLoopBackOff). v2 tracks install completion with a marker file written only after sdkmanager succeeds + package.xml exists, wipes partial system-image trees before reinstalling, and retries sdkmanager 3x.	2026-06-11 20:59:08 +00:00
Viktor Barzin	3fac45febc	android-emulator: drop applied import stanzas; deployment recreates fresh Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details The five imports from the last recovery commit are in state now (verified serial 4: everything except the deployment). The deployment kept falling out of state between runs, so instead of a third import round the broken 0-replica deployment object was deleted live (transient recovery step, presence-claimed) and this apply recreates it Terraform-owned with the quota-fitting 3Gi requests. Import stanzas must go because TF 1.5 errors on importing already-managed addresses.	2026-06-11 20:49:37 +00:00
Viktor Barzin	6b7efcd2d6	android-emulator: import the five resources still missing from state Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Pipeline 88 imported the namespace but its refresh dropped the PVC, both services, the ingress and the tls secret from state (PG-backend state races on this new stack's first applies), so the apply again died on 'already exists' conflicts. State now holds namespace+deployment; adopt the missing five with import blocks (TF 1.5 errors on importing already-managed addresses, so only the missing set is listed). Stanzas come out once applied.	2026-06-11 20:44:09 +00:00
Viktor Barzin	b948224008	android-emulator: import orphaned namespace into state (lock-race recovery) Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Pipeline 85 created the namespace but a Terraform pg-backend workspace-creation lock race (new stack schema initializing while other stacks applied concurrently) left it out of the recorded state — every later apply then died with 'namespaces android-emulator already exists'. Adopt it with an import block per the house recovery pattern; stanza gets removed once it has applied.	2026-06-11 20:38:46 +00:00
Viktor Barzin	99c19584f7	android-emulator: fit pod inside the tier-1 ResourceQuota (Burstable memory) Some checks failed ci/woodpecker/push/k8s-portal Pipeline failed Details ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful Details ci/woodpecker/push/registry-config-sync Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/build-ci-image Pipeline was successful Details First deploy hit 'exceeded quota: tier-quota, requested requests.memory=8Gi, limited 4Gi' — the generated tier-1 quota caps memory REQUESTS at 4Gi but allows 32Gi of limits, so go Burstable (requests 3Gi, limits 8Gi) like tiers 3/4 do, instead of opting the namespace out via custom-quota.	2026-06-11 19:56:09 +00:00

1 2 3 4 5 ...

4228 commits