The devnen server runs chunked synthesis as a blocking call inside its
async handler, so the event loop (and every HTTP probe) hangs for the
whole multi-minute story. Kubelet's http liveness probe (1s timeout)
then killed the container mid-story (exit 137, twice within 10 min of
the first real drain), which reset the engine, so every following pass
started cold and tripit's 120s synthesis budget could never be met —
the queue would never drain.
TCP probes keep the meaning that matters: uvicorn binds 8004 only
after the model finishes loading in the lifespan hook, so readiness
still gates 'model loaded', while a GPU-busy server is left alive.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
HEAD~1 on a merge commit is the feature-branch parent, so the
changed-stack detection diffed the WRONG side and silently skipped the
stacks the push actually changed — pipeline 128 'succeeded' without
applying the new ci-pipeline-health stack. Use the push's true
before-state (CI_PREV_COMMIT_SHA) when it resolves, HEAD~1 as fallback
(first build / shallow edge cases). Also touches the ci-pipeline-health
stack so THIS push applies it.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked to monitor the pipelines closely as builds move off-infra
(PRD infra#10). New aux stack: daily 07:30 UTC CronJob on the
claude-agent-service image running a deterministic shell sweep —
GitHub Actions failures/stuck runs across owned repos, Woodpecker
pipeline failures, GHA free-tier minutes burn. Healthy = one quiet
Slack line; issues = Slack alert + comment on infra#10. In-cluster
(not a cloud routine) because Vault + the Woodpecker token are
LAN-only. Secrets via ExternalSecret (github_pat deliberately, not the
ghcr_pull_token alias — a scoped packages-only rotation couldn't read
Actions runs).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
noVNC scaled correctly but the emulator's Qt window opened small (~411x914)
and floated inside the 1080x2280 Xvfb, so the user saw a tiny phone in a sea
of black. v8 bakes a background fitter (wmctrl+xdotool) that, after boot,
auto-OKs the one-shot nested-virtualization warning dialog, fills the phone
window to the display, and parks the control strip off the right edge —
re-running to catch window/dialog timing then maintaining every 30s. Applied
live to the running pod already; this makes it survive the next wake.
First live drain failed all 27 queued narrations with 404 'Voice file
'Emily' not found': tripit's catalog sends bare stems (Emily) but the
devnen server resolves the voice as a literal filename (Emily.wav) in
predefined_voices_path then reference_audio — no stem fallback exists
upstream (HEAD == our pinned sha), and symlinks can't bridge it because
safe_resolve_within() resolves them out of the containment check.
New initContainer on the chatterbox deployment copies the 28 bundled
voices to /data/reference_audio/<stem> on the PVC (second lookup path).
Same image as the main container so no extra pull; idempotent; ~15 MB.
Verified live before committing: an extension-less copy synthesizes
200 audio/mp3 (5.3s warm) where voice=Emily 404'd.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Reviewed the last 24h of Slack alerts after the midday node-pressure blip:
the volume came far less from the outage than from (a) alerts re-pinging
every few hours while nothing changed and (b) a pod cascade that fired
uninhibited. This hardens the alerting *system* so recurrences are quiet,
rather than just clearing today's broken services.
Changes (all in the monitoring module):
* Alert-on-change routing. warning/info repeat_interval -> 8760h (notify
once, then only on a membership change or resolve); critical 1h -> 6h
(a slow nag, not an hourly drip). send_resolved stays on. The bulk of
the 24h volume was these re-pings (RpiSofiaUndervoltage alone fired
continuously for ~24h, re-notifying every 4h).
* Daily digest CronJob (alert_digest.tf + alert_digest.py) -> #alerts at
08:00 Europe/London: the full current board grouped by severity + what
resolved in the last 24h. This is the standing-state safety net for the
alert-on-change model. Stock python:3.12-alpine, pure-stdlib script
(no pip/apk at runtime -> none of the per-run disk-write footprint that
disabled status-page-pusher). Reuses the existing Alertmanager Slack
webhook via a namespaced Secret; reads Alertmanager v2 + Prometheus.
* Cascade inhibition. NodeConditionBad/NodeDiskPressure now suppress the
downstream pod-churn alerts (PodCrashLooping, PodImagePullBackOff,
PodsStuckContainerCreating, ScrapeTargetDown, *ReplicasMismatch, ...).
The midday DiskPressure event on 4 nodes fired 25 PodCrashLooping + 14
PodImagePullBackOff uninhibited because only NodeDown was a source.
* T3 probe de-duplication. T3ProbeLegDown now inhibits T3ProbeDropBurst
for the same leg — two alerts described one condition and were the #1
noise source (~3,400 alert-minutes over 24h).
* ScrapeTargetDown false positives. Scrape only Ready endpoints, so
completed CronJob pods that linger in EndpointSlices as NotReady
addresses stop firing phantom "down" alerts (tts/tripit/beads). A Ready
pod with a genuinely broken metrics endpoint still fires.
* for: 0m -> 5m on the flappy backup-status flags (LVM/Weekly/Offsite/
NfsMirror/Vzdump *Failing) and DNS spike detectors, so a single
transient Pushgateway/scrape blip no longer fires-and-resolves.
* Added an Alertmanager scrape target: it carried no prometheus.io/scrape
annotation, so notification volume was unmeasurable — now we can verify
this change worked (alertmanager_notifications_total et al.).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The demand-gate script defaulted an unreadable/unparseable tts-queue
response to QUEUED=0, which the scale-down arm reads as 'queue empty'.
One transient curl failure at 20:30 UTC today idled chatterbox-tts to 0
the very minute the pod first went Ready, with 27 narrations still
queued (tripit kept logging tts_unreachable). Probe failure now exits
without touching replicas: scale-up still needs a real count > 0, and
scale-down now needs an explicitly parsed 0. Worst case after this
change is a stale-up deployment idling until the 06:00 window-down.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor hit 'Too big request header' (fasthttp 431 from error-pages) on a
routed host during a brief 503 window, and sees it periodically across
ingresses: Authentik forward-auth accumulates one authentik_proxy_*
cookie per protected service on .viktorbarzin.me, so established
browsers carry multi-10KB Cookie headers — over error-pages' 5120-byte
default read buffer, which doubles as its max header size. Any error-
middleware dispatch then 431'd instead of rendering the styled page.
Same root cause class as the 2026-06-01 large_client_header_buffers
fixes on bot-block-proxy and auth-proxy-config; error-pages was the
remaining small-buffer backend on the shared chain.
Viktor asked to unblock the ADR-0002 ghcr pull-secret work (infra#12)
without waiting on a UI-minted token: GitHub has no token-mint API, so
the admin PAT (aliased in Vault as secret/viktor/ghcr_pull_token —
swap the alias value when a scoped token is ever minted) becomes the
platform credential. Because the PAT is broad, the new ClusterPolicy
clones ghcr-credentials ONLY to an explicit allowlist of namespaces
running private ghcr images (tripit, f1-stream, job-hunter,
instagram-poster, payslip-ingest, wealthfolio, fire-planner,
recruiter-responder) — NOT cluster-wide like registry-credentials.
generateExisting+synchronize so existing namespaces get the clone.
tripit's hand-declared ns-scoped secret is removed in favour of the
clone (imagePullSecrets now reference the name literally).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor's screen rendered unscaled on a bare /vnc.html. The entrypoint
now writes /usr/share/novnc/defaults.json (resize=scale, autoconnect,
reconnect with 2s delay, shared) so every load behaves right without URL
params, and viewers self-heal across pod restarts/wakes. Already applied
live to the running pod; this makes it survive the next wake.
The deployment's lifecycle.ignore_changes still ignored the container
image (copied from the keel-managed tripit pattern), which would have
made the previous commit's GHCR switch a silent no-op on apply. Keel
cannot poll the private GHCR repo anyway; the pinned sha tag is
terraform's to manage.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor reports the voice still isn't from the TTS service — correct:
zero story_audio rows exist; the pod has sat in ImagePullBackOff since
the first window because the 2026-06-09 Forgejo-registry push has a
corrupt layer blob (HEAD 500s; pushed from a 94%-full disk) and identical
digests can't heal corrupt registry storage. The off-infra GHA rebuild
(tripit build-chatterbox.yml, devnen 915ae289, succeeded 03:23 UTC) now
lives in private GHCR: switch the image there, pin the upstream-sha tag,
and add the vault-backed ghcr-credentials pull secret (mirrors
stacks/tripit). tripit's drain loop has 27 narrations queued and picks
them up the moment the pod goes Ready.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cluster-health found beads-dispatcher + beads-reaper CronJobs in ImagePullBackOff
for ~7h: they pinned claude-agent-service:2fd7670d, a SHA tag that Forgejo
retention (keeps newest 10) pruned. claude-agent-service itself runs :latest
(KEEL_IGNORE_IMAGE). Point the beads tag at :latest so it tracks the live image
and can't go stale again — the dispatcher/reaper only need bd+curl+jq, which the
image ships.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Trues the runbook up to reality: guest GL stays software (llvmpipe)
under Xvfb by deliberate choice (NVIDIA headless GL would need a
different streaming architecture), the GPU slice costs ~100MiB VRAM only
while awake, and the awake steady-state is ~0.5-1.3 cores / ~5Gi with
scale-to-zero covering idle.
Viktor's noVNC sat at 'Connecting…' forever: the WebSocket traversed
Cloudflare/Authentik/websockify fine, but x11vnc never sent the RFB
banner — strace showed it sweeping the container's fd table with one
fcntl per fd, and containerd grants RLIMIT_NOFILE=2147483584 here, so
each connection effectively never completed. The entrypoint now sets
ulimit -n 65536 for everything it launches (verified live: banner
answers instantly under the capped limit); x11vnc also gets -nolookup
so client reverse-DNS can never stall handshakes.
First GPU boot verified qemu attached to the T4, but the guest GL
translator reported llvmpipe: the GPU operator injects only
compute,utility by default, so the NVIDIA EGL/GL vendor libraries were
absent and gfxstream silently fell back to software GL. The graphics
capability completes the hardware rendering path.
The control-plane flap (etcd lease-renewal timeouts) recurred. Rather than move
etcd to SSD (code-oflt, deferred again), the chosen direction is to REDUCE etcd
load enough that the leader-election-timeout band-aid (renew 10s->30s) becomes
removable. These are the big, clean cuts:
1. Remove VPA/Goldilocks (stacks/vpa emptied). All 349 VPAs ran updateMode=Off
(no auto-right-sizing) yet cost ~800 etcd objects + continuous recommender
writes + a pod-creation admission webhook, purely to feed a dashboard. krr
(Dockerized, on-demand) replaces it. Reverses the re-add after memory 2431.
2. Disable kyverno reporting (admission/aggregate/background). policyReports were
already off, so the pipeline generated ephemeralreports + an hourly
all-resource etcd re-scan for NO user-facing output. Admission enforcement
(deny-* policies) and Keel mutation are unaffected; violations surface via
Loki->Slack.
3. descheduler */5 -> hourly (fewer list/evict cycles; rebalancing isn't urgent).
Deferred (poor ROI / unsafe as planned): ESO refreshInterval 15m->1h is a
~20-stack sprawl for ~0.1 writes/s; keel background=false is invalid for a
mutate-existing policy and its churn is apply-time not steady-state. Both filed
as follow-up beads.
Post-apply: delete the chart-orphaned VPA CRDs to cascade-clean leftover CRs.
Then measure etcd apply-latency and revert the timeouts. Docs updated
(VPA/Goldilocks -> krr). See memory 5402-5407.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
First real wake attempt 500'd: kubernetes.default.svc does not resolve
from the gate's alpine pod (musl + injected dns_config ndots quirk), so
every kube call failed with 'Name does not resolve'. Use the injected
KUBERNETES_SERVICE_HOST/PORT env vars — the canonical in-cluster
endpoint, no DNS dependency. ConfigMap checksum annotation rolls the
gate automatically.
Follow-up to eef4dc7f: the Android Shell's dedicated bearer-auth host
(tripit-api, ADR-0017) serves the same thumbnail-proxy traffic and was
still on the default 10/50 limiter — the shell's photo grid would have
hit the identical 429 wall Viktor just reported on the PWA host.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor hit a wall of 429s scrolling the new trip Photos tab: every Immich
thumbnail proxies through tripit's /api, so a few-hundred-photo trip is
that many parallel GETs from one IP — far past the shared Traefik
limiter's average 10 / burst 50. Fourth instance of the parallel-asset
pattern (ha-sofia, ActualBudget, noVNC); same cure: dedicated
tripit-rate-limit middleware (average 100, burst 1000) +
skip_default_rate_limit on the main tripit ingress only. The token-gated
calendar/email/slack carve-outs keep the strict default.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The previous commit (c5631cff) failed CI's ingress_factory guard: the
'# auth = "none": <why>' justification must sit directly above the auth
line inside the module, not above the module block. Same content, moved
to where the lint looks; no functional change.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor is adding the Android APK (Capacitor Shell) for TripIt. The Shell
cannot use the browser's forward-auth cookie dance, so per tripit ADR-0017
it logs in with OIDC Code+PKCE and calls the API with bearer JWTs:
- authentik.tf: tripit-app OAuth2 provider (public client + PKCE — an APK
holds no secret), custom-scheme redirect me.viktorbarzin.tripit://callback,
RS256, 1h access / 90d refresh (offline_access mapping attached so refresh
tokens are issued), plus the TripIt App application.
- main.tf: new ingress host tripit-api.viktorbarzin.me -> same tripit
Service, no forward-auth (backend validates the JWTs itself once tripit
AUTH_MODE=hybrid lands — slice 2), inbound X-authentik-* deleted via the
existing traefik strip-auth-headers middleware so the header fallback can
never be spoofed through this host.
Closes nothing here; tracked as viktor/tripit#49.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor's direction (2026-06-12): the emulator is dev-only, so it should
be on-demand, and it should use the T4 where applicable. (1) api36-v5
runs '-gpu host' on the GPU node (nodeSelector + time-slice + EGL libs;
automatic swiftshader fallback if GPU init dies) — screen-on rendering
moves off the CPU (~5 cores → expected 1-2). (2) The wake gate (stdlib
python, owns / on both hostnames) scales the deployment 0→1 on visit and
hands the browser to noVNC when ready; agents GET /wake + /status. The
idle-sleeper CronJob counts established adb/noVNC connections via
/proc/net/tcp (excluding the in-container loopback adb client) and scales
to zero after 4 idle checks (~1h). TF ignores replicas drift. VRAM cost
(~0.5-1GiB) is held only while awake, protecting llama-swap headroom.
The cluster-health check found the control plane flapping: kube-scheduler
and kube-controller-manager were crashlooping (220+ restarts) on lost
leader-election leases, with "etcdserver: request timed out" in the logs.
Root cause: the android-emulator pod's ~4.7-core swiftshader (software-GPU)
CPU burn on node3, together with frigate on node1, saturated the single
Proxmox host (load ~64) and starved etcd's disk/CPU on the k8s-master VM —
so etcd timed out and the leader-election controllers died and restarted in
a loop.
The emulator is a shared *test* instance, not a 24/7 service, so scaling it
to 0 is the right relief: spin it back to replicas=1 on-demand for a testing
session. Confirmed recovery after scaling down: node3 CPU 83%->28%, PVE load
64->51, control-plane restarts frozen. Durable structural fix (etcd/critical
VM disks off the shared sdc HDD; PVE CPU weighting) is tracked as code-oflt.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
First real window (2026-06-12 02:00): the chatterbox pod sat in
ContainerCreating with MountVolume exit 32 x19 — /srv/nfs-ssd is exported
whole-tree but the chatterbox SUBDIR never existed on the host (the
go-live runbook step needed NFS-host shell nobody doing the apply had).
One-shot busybox Job mounts the export root and mkdir -p's the subtree;
kubelet's mount retry then self-heals the pod. Audio queue (27 items)
drains as soon as the model loads.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Pipeline 102 applied nothing — the rate-limit commit entered master under
a merge head and the changed-stack detector is blind to merge diffs.
Plain commit touching both stacks so they apply.
Viktor's 'VNC stuck loading forever' (remote network): noVNC 1.3 is
unbundled and fetches ~60 ES modules in parallel on page open; the shared
Traefik rate-limit (average 10, burst 50) 429s the tail and noVNC's
loader waits on the missing modules indefinitely (reproduced: 38x429 in
a 90-request burst through the ingress). Adds a dedicated 50/300
android-emulator-rate-limit middleware (actualbudget/immich pattern) and
opts both emulator ingresses out of the shared limiter.
Viktor asked 'can't we make it live? why the cronjob?' — the overnight
window guaranteed VRAM room on the shared T4, but immich/frigate models
idle-unload during the day so the card often has room (measured 10.3 GiB
free at 01:20). New 'demand' action every 3 min: scale Chatterbox up when
tripit's audio queue is non-empty AND free VRAM >= floor; idle it back to
0 when the queue empties (also frees the card early inside the nightly
window). Failed metrics scrape fail-safes to no-scale-up, same as the
window preflight. The guard moves to all-day */5 — live synthesis can
hold the card at any hour, so the yield-on-pressure watchdog must watch
at any hour. tripit exposes the unauthenticated in-cluster queue count;
a 404 from an older image reads as queued=0 (no-op). The 02:00 window-up
stays as the guaranteed nightly catch-up.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor wants the emulator screen reachable over the web: adds
android-emulator.viktorbarzin.me (Cloudflare-proxied) behind Authentik
forward-auth — same-origin WebSockets through forward-auth are proven by
the terminal/ttyd stack. The LAN .lan view stays, and adb:5555 remains
LAN-only since it is unauthenticated.
Switches the planning workspace's 'Research this' from the deterministic Fake to the live claude-agent-service Researcher. Behaviour-reviewed via a prod-pod country_when call (proposed Morocco/Georgia/Peru/Iceland with real 2026 UK bank-holiday leave windows + rough fares). Opt-in, budget-capped ~$2/run, wall-clock-bounded → degrades to 'found nothing' on slow/failed/quota-exhausted runs. Reuses CLAUDE_AGENT_TOKEN already in tripit-secrets. Completes the 12-slice Trip-Planning-Decisions feature.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pipeline 96 applied only tripit: the v4 bump (577267cd) entered master
inside a merge whose first-parent diff hid stacks/android-emulator from
the stack detector — same failure mode as the tts 798b0255 trigger. This
plain commit touches the stack so the detector picks it up.
Two final fixes from the live debugging session: (1) sdkmanager-latest
emulator 36.6.11 hangs before executing a single guest instruction in
this pod (KVM and TCG alike, every gpu mode, crash-reporting on or off)
while 36.1.9 boots Android in ~107s — the entrypoint now pins build
13823996 on the PVC; (2) the emulator already listens on 127.0.0.1:5555,
so socat's wildcard bind died with EADDRINUSE and its exit restarted the
pod right after a successful boot — socat now binds the pod IP only.
Viktor's tour-redo (tripit#29): the new image is rolled out, so the two
new provider knobs can flip — discovery becomes wikipedia+llm (GeoSearch
merged with claude-agent-service proposals, Focus-steered) and the
Wikipedia place resolver (manual sight search + LLM-proposal resolution)
leaves its fake default. Env-after-image hold order, same as FARE_PROVIDER.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor's tour-guide redo (tripit#30/#31): narration audio is baked-audio-only
now — the fill-tour-audio worker synthesizes the queued (story, telling,
voice) audio while the tts stack's off-peak window (02:00-06:00) has
Chatterbox scaled up. Two idempotent passes: 02:20 after scale-up + model
load, 04:30 insurance against a skipped window or guard yield. Daytime runs
record tts_unreachable and exit quietly by design.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
v2's marker fix proved the install completes, but avdmanager still saw
no system images: it IGNORES ANDROID_SDK_ROOT (and has no --sdk_root),
deriving the SDK root from its own toolsdir — /opt/android in our image,
while packages live on the PVC at /sdk. v3 seeds cmdline-tools into
/sdk/cmdline-tools/latest once and runs avdmanager from there, so it
resolves the PVC as the SDK root.
First boot crashed mid-SDK-install, and the dir-existence check then
skipped reinstall forever: avdmanager saw the partial tree and died with
'Valid system image paths are: null' (CrashLoopBackOff). v2 tracks
install completion with a marker file written only after sdkmanager
succeeds + package.xml exists, wipes partial system-image trees before
reinstalling, and retries sdkmanager 3x.
The five imports from the last recovery commit are in state now (verified
serial 4: everything except the deployment). The deployment kept falling
out of state between runs, so instead of a third import round the broken
0-replica deployment object was deleted live (transient recovery step,
presence-claimed) and this apply recreates it Terraform-owned with the
quota-fitting 3Gi requests. Import stanzas must go because TF 1.5 errors
on importing already-managed addresses.
Pipeline 88 imported the namespace but its refresh dropped the PVC, both
services, the ingress and the tls secret from state (PG-backend state
races on this new stack's first applies), so the apply again died on
'already exists' conflicts. State now holds namespace+deployment; adopt
the missing five with import blocks (TF 1.5 errors on importing
already-managed addresses, so only the missing set is listed). Stanzas
come out once applied.
Pipeline 85 created the namespace but a Terraform pg-backend
workspace-creation lock race (new stack schema initializing while other
stacks applied concurrently) left it out of the recorded state — every
later apply then died with 'namespaces android-emulator already exists'.
Adopt it with an import block per the house recovery pattern; stanza
gets removed once it has applied.