Commit graph

4265 commits

Author SHA1 Message Date
Viktor Barzin
2dde480795 openclaw: install-recruiter-plugin init image forgejo -> ghcr :latest (infra#27)
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Second half of the recruiter-responder off-infra migration: the first GHA
build has published ghcr.io/viktorbarzin/recruiter-responder:{1d99a8d5,latest},
so the openclaw plugin-install init container can now follow the ghcr
:latest. The forgejo-side build pipeline was removed by the onboarding
commit, so the old forgejo :latest tag is frozen and would silently serve
stale plugin code. Deferred from the first commit on purpose - flipping it
before the package existed would have wedged the openclaw rollout on
ImagePullBackOff.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:57:30 +00:00
Viktor Barzin
57ff41e47e recruiter-responder: pull image from ghcr + ghcr-credentials on all consumers (ADR-0002, infra#27)
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Migrating recruiter-responder off in-cluster Woodpecker builds: GHA will
build and push ghcr.io/viktorbarzin/recruiter-responder (PRIVATE package).
This commit lands the pull-side prerequisites BEFORE the first off-infra
build fires:

- stacks/recruiter-responder: image base forgejo -> ghcr (inert on the live
  Deployment - both containers are ignore_changes'd; the Woodpecker deploy
  moves the tag) + ghcr-credentials imagePullSecrets on the Deployment
  (covers the recruiter-responder container AND the alembic-migrate init
  container, which share the image).
- stacks/openclaw: ghcr-credentials imagePullSecrets on the openclaw
  Deployment - its install-recruiter-plugin init container consumes the
  :latest tag of this image. The image ref itself flips to ghcr in a
  follow-up once the first GHA build has created the package (flipping now
  would ImagePullBackOff on a not-yet-existing package and wedge the apply).
- stacks/kyverno: allowlist openclaw in sync-ghcr-credentials so the pull
  secret is cloned into that namespace too.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:43:35 +00:00
Viktor Barzin
c594274c83 ci: re-apply fire-planner stack after pipeline race
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Comment-only touch so the changed-stack detection applies
stacks/fire-planner from the current master tree. Pipeline 150 (commit
f18dfa4c — the ghcr image base + ghcr-credentials migration for issue
#26) was auto-killed when the concurrent nextcloud-todos push superseded
it, and pipeline 151 diffed from f18dfa4c onward so the fire-planner
stack changes were never applied (cronjobs still point at the forgejo
image, pod specs lack ghcr-credentials).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:41:20 +00:00
Viktor Barzin
a264a19629 Merge remote-tracking branch 'forgejo/master' into wizard/nextcloud-todos-ghcr
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-13 00:38:27 +00:00
Viktor Barzin
d5c328d23c nextcloud-todos: image base forgejo -> ghcr (ADR-0002, infra#18)
The nextcloud-todos build moved off-infra: GHA builds on the public
GitHub mirror and pushes ghcr.io/viktorbarzin/nextcloud-todos (public
package, anonymous pulls); Woodpecker repo 207 is deploy-only. First
ghcr image (:19c22d8c) is already built, deployed and rolled out, so
this repoint lands after the image exists. Both deployment image refs
(main + alembic-migrate init) are ignore_changes'd — no live churn,
the base matters only on resource (re)create. Old image was pulled
from a Forgejo registry package that no longer exists (pods survived
on node image cache only).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:38:25 +00:00
Viktor Barzin
f18dfa4c8b fire-planner: pull image from ghcr + add ghcr-credentials to all pod specs
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
Migrating fire-planner off in-cluster Woodpecker builds to GitHub
Actions -> ghcr.io (ADR-0002, issue #26). The image base moves
forgejo.viktorbarzin.me/viktor/fire-planner ->
ghcr.io/viktorbarzin/fire-planner (a PRIVATE ghcr package), so the
deployment, all three cronjobs (recompute, col-refresh,
examples-weekly) and the examples bulk job gain the ghcr-credentials
imagePullSecret (the kyverno sync-ghcr-credentials allowlist already
covers the fire-planner namespace). registry-credentials stays
alongside so the currently-running sha-pinned forgejo image can still
be pulled until the first ghcr deploy lands; the cronjob images are TF
literals and flip to ghcr :latest on this apply.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:38:09 +00:00
Viktor Barzin
e696957ebf ci: ancestor guard on DIFF_BASE; gate allowlists the owner's work email [ci skip]
Restarted infra pipelines after master moved diffed in REVERSE and
re-applied stale trees (pipeline 148 reverted payslip-ingest's fresh
ghcr config — repaired by the wave-2 agent). Only trust
CI_PREV_COMMIT_SHA when it is an ancestor of HEAD. publish-gate:
viktorbarzin@meta.com is the owner's own work email (same class as the
allowlisted personal domain), not blockable PII — unblocks infra#18.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:31:33 +00:00
Viktor Barzin
cdd60d9078 ci: re-apply instagram-poster + payslip-ingest stacks after pipeline race
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Comment-only touch of both stacks so the changed-stack detection applies
them from the current master tree. Two pipelines went wrong in sequence
during the parallel ADR-0002 wave-2 migrations (issues #23/#24):

- pipeline 146 (instagram-poster stack prep, commit 29c69250) was
  auto-killed when the concurrent payslip-ingest push superseded it, so
  its apply never ran;
- restarting it as pipeline 148 inherited CI_PREV_COMMIT_SHA = the NEW
  branch head (6928ce0b) with the OLD checkout (29c69250) — a reverse
  diff that re-applied stacks/payslip-ingest from the pre-migration
  tree, stripping the ghcr image base + ghcr-credentials pull secrets
  that pipeline 147 had just applied (2 resources reverted).

This commit restores the committed payslip-ingest config exactly as
issue #24 landed it and finally applies the instagram-poster ghcr prep
from issue #23. Lesson encoded in the comments: do not restart killed
infra pipelines after master has moved — re-trigger with a touch commit
instead.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:11:17 +00:00
Viktor Barzin
6928ce0be5 Merge remote-tracking branch 'forgejo/master' into wizard/payslip-ingest-ghcr
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-13 00:03:29 +00:00
Viktor Barzin
5d236c2352 payslip-ingest: image base forgejo -> ghcr, ghcr-credentials pull secret, cron to :latest+Always
Prep for moving payslip-ingest's image build off-infra to GitHub Actions ->
ghcr.io (ADR-0002 wave 2, issue #24). One stack commit before onboarding:

- image base repointed forgejo.viktorbarzin.me/viktor/payslip-ingest ->
  ghcr.io/viktorbarzin/payslip-ingest (private ghcr package)
- ghcr-credentials imagePullSecrets added on the Deployment AND the
  actualbudget-payroll-sync CronJob pod specs (namespace is already in the
  kyverno sync-ghcr-credentials allowlist; secret verified present)
- the CronJob's SHA pin is retired: terragrunt image_tag 4f70681d -> latest
  plus explicit imagePullPolicy Always on the cron container, per the fleet
  convention for owned-app CronJobs — one less set-image target, and the
  cron can never go back to pulling the dead Forgejo tag

The Deployment keeps KEEL_IGNORE_IMAGE; its concrete :sha8 tag is set by
the Woodpecker deploy pipeline after each GHA build.

Closes: nothing yet — the repo-side onboarding (offinfra-onboard) follows.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:03:11 +00:00
Viktor Barzin
29c6925031 instagram-poster: image base forgejo->ghcr + ghcr-credentials pull secret
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Prep for migrating instagram-poster off in-cluster Woodpecker builds to
GitHub Actions -> ghcr.io (ADR-0002, issue #23, PRIVATE-repo path).
Viktor asked for the wave-2 migration of instagram-poster per the wave-1
retro recipe: before onboarding, the stack must (a) carry the
ghcr-credentials imagePullSecret on the Deployment so the cluster can
pull the private ghcr image, and (b) repoint the image base from
forgejo.viktorbarzin.me/viktor to ghcr.io/viktorbarzin.

The Deployment image is KEEL_IGNORE_IMAGE (ignore_changes), so this
apply does NOT roll the pod to a not-yet-existing ghcr image — the live
forgejo-built :da5b4191 keeps running until the first GHA build POSTs
the Woodpecker deploy. The three CronJobs run curlimages/curl (public
DockerHub), not the app image, so they need neither the pull secret nor
a repoint. registry-credentials stays for the transition window.

Closes: nothing (stack prep only; repo onboarding follows)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:02:04 +00:00
Viktor Barzin
72b5843e4b publish-gate: exclude package-lock + beads tracker from email heuristic; beadboard image base -> ghcr
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
infra#17: the gate flagged npm deprecation boilerplate (package-lock.json
escapes the *.lock filter) and the upstream fork author's email in tracked
.beads data — both already-public upstream content, ruled false positives.
Lock files excluded properly; .beads moved to the eyeball inventory.
beads-server stack: beadboard image base repointed (deployment image is
KEEL-ignored; no CronJobs use it).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:52:07 +00:00
Viktor Barzin
57ffd0ed8d Merge remote-tracking branch 'forgejo/master' into wizard/freedify-mig
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-12 23:37:19 +00:00
Viktor Barzin
c16fe56180 freedify: image base forgejo registry -> ghcr (ADR-0002)
Freedify builds moved off-infra per issue #22: GitHub Actions on the
ViktorBarzin/freedify mirror now builds and pushes the public image
ghcr.io/viktorbarzin/freedify, and the Woodpecker deploy pipeline
(repo 202) rolls :sha8 via kubectl set image. Both factory deployments
(music-viktor, music-emo) now seed from ghcr instead of the retired
in-cluster Forgejo build, and the container image joins lifecycle
ignore_changes (KEEL_IGNORE_IMAGE) so terraform applies do not revert
the deployed :sha8. Landed after the first GHA push so ghcr :latest
already existed when this repoint applied. Public package - no pull
secret needed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:37:10 +00:00
Viktor Barzin
9f742b544c kms: image base forgejo registry -> ghcr (ADR-0002 infra#21)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
kms-website moves off in-cluster Woodpecker builds to GHA -> ghcr.
The kms-web-page deployment image is ignore_changes'd (CI sets the live
tag), so this repoint only governs future creates; package is PUBLIC so
no pull secret is wired. No CronJobs in this stack.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:30:07 +00:00
Viktor Barzin
fb88440ec4 ci-pipeline-health: billing moved to the enhanced usage endpoint
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The legacy /settings/billing/actions endpoint now returns 410; sum
Minutes usageItems from /settings/billing/usage instead (found during
the infra#16 retro: June-to-date = 420/2000).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:24:18 +00:00
Viktor Barzin
12bdd06f74 kyverno: force_new on sync-ghcr-credentials — generate rules are immutable
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Pipeline 138: the validate-policy webhook denies in-place edits of a
generate rule (allowlist additions). force_new = delete+recreate;
generated secrets survive and generateExisting re-adopts.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:18:15 +00:00
Viktor Barzin
6b0d42c7bc publish-gate + tuya-bridge ghcr cutover prep (ADR-0002 infra#15)
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline failed
publish-gate: gitleaks + trufflehog (full history) + PII heuristics;
CLEAN verdict gates any public flip, DIRTY = stays private. tuya-bridge:
ghcr-credentials pull secret + image base -> ghcr; namespace added to
the ghcr-credentials allowlist as a safety net (new ghcr packages
default PRIVATE even from public repos — prune after visibility flip).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:12:02 +00:00
Viktor Barzin
54dfaf6edc job-hunter: image base forgejo registry -> ghcr (ADR-0002)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
CronJobs track :latest via the TF literal (unlike the ignore_changes'd
deployment), so they kept pulling the dead Forgejo image after the
GHA/ghcr cutover — repoint the stack's image base.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:06:54 +00:00
Viktor Barzin
51682ee939 offinfra-onboard: require clean clone + ff to forgejo master first [ci skip]
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:00:55 +00:00
Viktor Barzin
09bb0b50a1 offinfra-onboard: forgejo token fallback to ~/.git-credentials [ci skip]
job-hunter's clone uses the credential-store helper (no token embedded
in the remote URL, unlike f1-stream).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 22:59:32 +00:00
Viktor Barzin
1c41781996 job-hunter: ghcr-credentials pull secret on deployment + CronJobs
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
ADR-0002 wave 1 (infra#14): job-hunter's image moves to private ghcr;
the deployment AND both :latest CronJobs need the Kyverno-cloned pull
secret.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 22:56:48 +00:00
Viktor Barzin
6f41de71fa offinfra-onboard: normalize Woodpecker repo to untrusted [ci skip]
Trusted repos get netrc injected into every step container; the
non-root bitnami/kubectl deploy step dies with '//.netrc: Permission
denied' (hit live on f1-stream's reactivated old-era repo 10, which
carried trusted=true; tripit 167 is untrusted and works).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 22:32:08 +00:00
Viktor Barzin
beac1b57a3 offinfra-onboard: re-activate inactive Woodpecker registrations [ci skip]
Hit live on f1-stream: the old GHA-era ViktorBarzin/f1-stream
registration (repo 10) existed but was deactivated; the lookup matched
it and skipped registration, leaving the deploy POST pointed at an
inactive repo. Now checks .active and re-activates in place via
forge_remote_id.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 22:28:03 +00:00
Viktor Barzin
baff3d7477 offinfra-onboard: per-repo GHA->ghcr migration tool + f1-stream ghcr pull secret
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
ADR-0002 tracer bullet (infra#13), per Viktor's go-ahead. Idempotent
script: GitHub mirror repo (create/unarchive/visibility), GHA secrets
via gh, Forgejo push-mirror (sync_on_commit) + initial sync, Woodpecker
mirror registration, renders build.yml/deploy.yml from templates
(single-manifest provenance:false, svu semver to Forgejo, ghcr keep-10
retention, Slack notify-failure, manual-event deploy), removes the old
in-cluster build pipeline, commits on the Canonical side. f1-stream
stack gains the ghcr-credentials imagePullSecret (first consumer).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 22:21:22 +00:00
Viktor Barzin
3138a0a040 Merge remote-tracking branch 'forgejo/master' into wizard/breakglass
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline failed
2026-06-12 21:41:58 +00:00
Viktor Barzin
32cf75635f claude-breakglass: in-cluster warm break-glass UI for the devvm
Stand up the infra for Viktor's break-glass: when the devvm is wedged (cluster
healthy), open breakglass.viktorbarzin.me, have Claude SSH in to diagnose/fix,
and power-cycle VM 102 via the Proxmox host if needed. App half landed in the
claude-agent-service repo.

New stack stacks/claude-breakglass/ — own namespace + SA, NO Vault role (ESO
syncs only its key, so the pod has zero direct Vault access). Hardened to
survive the pressure it exists to fix: priorityClassName tier-0-core, broad
node-pressure tolerations, anti-affinity off node1, imagePullPolicy Always.
auth="required" ingress so it rides the Authentik resilience proxy and stays
reachable via the basic-auth fallback during an auth-stack outage. Runs the
shared claude-agent-service image with the breakglass entrypoint.
files/breakglass-pve is the PVE forced-command (status|forensics|reset|stop|
start|cycle on VM 102, forensics-first).

Isolation: the shared claude-agent pod's terraform-state Vault policy is
explicitly DENIED secret/claude-breakglass/* (stacks/vault/main.tf) so a
prompt-injected agent on that pod can't read the root-on-devvm key.

traefik: add a checksum/auth-proxy-htpasswd annotation so the auth-proxy rolls
when the emergency basic-auth password rotates (it's a subPath mount that
doesn't auto-update) — regenerated this session so Viktor has a known
emergency credential, which the auth-stack-outage failure domain requires.

Docs: docs/runbooks/breakglass-ui.md (full incident + bootstrap procedure,
incl. the per-host from= NAT quirks) and a security.md note recording the two
new privileged footholds.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 21:40:17 +00:00
Viktor Barzin
1eee2d6eb6 Merge remote-tracking branch 'forgejo/master' into wizard/tripit-sub-mode
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-12 21:17:09 +00:00
Viktor Barzin
42cd7d8272 tripit: flip AUTH_MODE to hybrid + OTA bundle env (Android Shell live)
The 81a816f7 image (hybrid auth + OTA endpoints) is rolled out, so the
env can flip: AUTH_MODE=hybrid with the tripit-app OIDC knobs makes the
bearer-only tripit-api host actually authenticate Shell logins (browser
cookie path unchanged); BUNDLE_PUBLIC_BASE pins the signed OTA zip URLs
to that host; BUNDLE_TOKEN_SECRET joins the tripit-secrets ES (value
already written to Vault secret/tripit). Part of the Android APK work
(tripit #50/#51).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 21:16:14 +00:00
Viktor Barzin
02785987dd ci-pipeline-health: image :latest+Always — registry lost the 2fd7670d tag
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The sha tag other claude-agent-service CronJobs pin no longer exists in
the Forgejo registry (node caches mask it); fresh pulls 404. Follow the
owned-app CronJob convention until infra#19 moves this image to ghcr.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 21:06:20 +00:00
Viktor Barzin
765cfe803f tripit: tripit-app provider issues sub = user email (hybrid-auth identity fix)
Review of tripit slice #50 caught that the provider's default
sub_mode (hashed_user_id) would make Shell JWTs carry a sub that
never matches the email-keyed prod user rows - first app login
would either 500 in placeholder reconciliation or split the user's
identity. sub_mode = user_email makes bearer and forward-auth
resolve the same row. Part of the Android APK work (tripit #50).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 21:00:33 +00:00
Viktor Barzin
624747cc46 workstation: default Claude model fable-5 → opus-4-8 for all devvm users
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Viktor asked to make Opus the default for new Claude sessions — his own,
Emo's, and Anca's — because Fable 5 is overkill for most daily tasks.

The org-wide default lives in the managed-settings `model` key, which
overrides each user's personal ~/.claude/settings.json model (and no
per-user launcher passes --model anymore). So flipping this one value
makes every user's NEXT session default to Opus 4.8; current sessions
keep their model, and a per-session /model still overrides as before.
The hourly t3-provision-users reconcile deploys it to
/etc/claude-code/managed-settings.json within the cycle.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 20:59:03 +00:00
Viktor Barzin
bd0cb71f17 tts: TCP probes — http liveness killed the server mid-synthesis
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The devnen server runs chunked synthesis as a blocking call inside its
async handler, so the event loop (and every HTTP probe) hangs for the
whole multi-minute story. Kubelet's http liveness probe (1s timeout)
then killed the container mid-story (exit 137, twice within 10 min of
the first real drain), which reset the engine, so every following pass
started cold and tripit's 120s synthesis budget could never be met —
the queue would never drain.

TCP probes keep the meaning that matters: uvicorn binds 8004 only
after the model finishes loading in the lifespan hook, so readiness
still gates 'model loaded', while a GPU-busy server is left alive.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:57:28 +00:00
Viktor Barzin
30ff8f2db3 ci: diff changed stacks against CI_PREV_COMMIT_SHA, not HEAD~1
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
HEAD~1 on a merge commit is the feature-branch parent, so the
changed-stack detection diffed the WRONG side and silently skipped the
stacks the push actually changed — pipeline 128 'succeeded' without
applying the new ci-pipeline-health stack. Use the push's true
before-state (CI_PREV_COMMIT_SHA) when it resolves, HEAD~1 as fallback
(first build / shallow edge cases). Also touches the ci-pipeline-health
stack so THIS push applies it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:50:43 +00:00
Viktor Barzin
fb8b6aa2f3 Merge remote-tracking branch 'forgejo/master' into wizard/ci-pipeline-health
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
2026-06-12 20:45:30 +00:00
Viktor Barzin
d02ca4f2db ci-pipeline-health: daily sweep of the off-infra CI chain (ADR-0002)
Viktor asked to monitor the pipelines closely as builds move off-infra
(PRD infra#10). New aux stack: daily 07:30 UTC CronJob on the
claude-agent-service image running a deterministic shell sweep —
GitHub Actions failures/stuck runs across owned repos, Woodpecker
pipeline failures, GHA free-tier minutes burn. Healthy = one quiet
Slack line; issues = Slack alert + comment on infra#10. In-cluster
(not a cloud routine) because Vault + the Woodpecker token are
LAN-only. Secrets via ExternalSecret (github_pat deliberately, not the
ghcr_pull_token alias — a scoped packages-only rotation couldn't read
Actions runs).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:45:28 +00:00
Viktor Barzin
5bcad2bf34 Merge forgejo/master into wizard/emu-window
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
2026-06-12 20:44:40 +00:00
Viktor Barzin
e5291f97c8 android-emulator: api36-v8 — auto-fit emulator window to the display
noVNC scaled correctly but the emulator's Qt window opened small (~411x914)
and floated inside the 1080x2280 Xvfb, so the user saw a tiny phone in a sea
of black. v8 bakes a background fitter (wmctrl+xdotool) that, after boot,
auto-OKs the one-shot nested-virtualization warning dialog, fills the phone
window to the display, and parks the control strip off the right edge —
re-running to catch window/dialog timing then maintaining every 30s. Applied
live to the running pod already; this makes it survive the next wake.
2026-06-12 20:44:29 +00:00
Viktor Barzin
98f1f7fc24 tts: seed extension-less voice copies so tripit's bare stems resolve
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
First live drain failed all 27 queued narrations with 404 'Voice file
'Emily' not found': tripit's catalog sends bare stems (Emily) but the
devnen server resolves the voice as a literal filename (Emily.wav) in
predefined_voices_path then reference_audio — no stem fallback exists
upstream (HEAD == our pinned sha), and symlinks can't bridge it because
safe_resolve_within() resolves them out of the containment check.

New initContainer on the chatterbox deployment copies the 28 bundled
voices to /data/reference_audio/<stem> on the PVC (second lookup path).
Same image as the main container so no extra pull; idempotent; ~15 MB.
Verified live before committing: an extension-less copy synthesizes
200 audio/mp3 (5.3s warm) where voice=Emily 404'd.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:41:51 +00:00
Viktor Barzin
bb0f9f59ef docs: CI-compute doctrine — leverage external infra for builds AND tests [ci skip]
Viktor's standing instruction (2026-06-12): lean on external infra as
much as possible for CI — builds, running tests, lint, releases all on
GitHub Actions hosted runners, never on cluster nodes; in-cluster
pipelines only for cluster-touching steps (deploys, terragrunt,
certbot). Also: watch any triggered pipeline chain to completion and
fix failures immediately. Added to AGENTS.md + .claude/CLAUDE.md
CI sections (ADR-0002 companions).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:39:27 +00:00
Viktor Barzin
97dcf49b8e monitoring: reduce Slack alert noise (alert-on-change + daily digest)
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
Reviewed the last 24h of Slack alerts after the midday node-pressure blip:
the volume came far less from the outage than from (a) alerts re-pinging
every few hours while nothing changed and (b) a pod cascade that fired
uninhibited. This hardens the alerting *system* so recurrences are quiet,
rather than just clearing today's broken services.

Changes (all in the monitoring module):

* Alert-on-change routing. warning/info repeat_interval -> 8760h (notify
  once, then only on a membership change or resolve); critical 1h -> 6h
  (a slow nag, not an hourly drip). send_resolved stays on. The bulk of
  the 24h volume was these re-pings (RpiSofiaUndervoltage alone fired
  continuously for ~24h, re-notifying every 4h).

* Daily digest CronJob (alert_digest.tf + alert_digest.py) -> #alerts at
  08:00 Europe/London: the full current board grouped by severity + what
  resolved in the last 24h. This is the standing-state safety net for the
  alert-on-change model. Stock python:3.12-alpine, pure-stdlib script
  (no pip/apk at runtime -> none of the per-run disk-write footprint that
  disabled status-page-pusher). Reuses the existing Alertmanager Slack
  webhook via a namespaced Secret; reads Alertmanager v2 + Prometheus.

* Cascade inhibition. NodeConditionBad/NodeDiskPressure now suppress the
  downstream pod-churn alerts (PodCrashLooping, PodImagePullBackOff,
  PodsStuckContainerCreating, ScrapeTargetDown, *ReplicasMismatch, ...).
  The midday DiskPressure event on 4 nodes fired 25 PodCrashLooping + 14
  PodImagePullBackOff uninhibited because only NodeDown was a source.

* T3 probe de-duplication. T3ProbeLegDown now inhibits T3ProbeDropBurst
  for the same leg — two alerts described one condition and were the #1
  noise source (~3,400 alert-minutes over 24h).

* ScrapeTargetDown false positives. Scrape only Ready endpoints, so
  completed CronJob pods that linger in EndpointSlices as NotReady
  addresses stop firing phantom "down" alerts (tts/tripit/beads). A Ready
  pod with a genuinely broken metrics endpoint still fires.

* for: 0m -> 5m on the flappy backup-status flags (LVM/Weekly/Offsite/
  NfsMirror/Vzdump *Failing) and DNS spike detectors, so a single
  transient Pushgateway/scrape blip no longer fires-and-resolves.

* Added an Alertmanager scrape target: it carried no prometheus.io/scrape
  annotation, so notification volume was unmeasurable — now we can verify
  this change worked (alertmanager_notifications_total et al.).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 20:35:56 +00:00
Viktor Barzin
87a8a393fe tts: demand gate treats a failed queue probe as no-action, not queue-empty
Some checks failed
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was canceled
The demand-gate script defaulted an unreadable/unparseable tts-queue
response to QUEUED=0, which the scale-down arm reads as 'queue empty'.
One transient curl failure at 20:30 UTC today idled chatterbox-tts to 0
the very minute the pod first went Ready, with 27 narrations still
queued (tripit kept logging tts_unreachable). Probe failure now exits
without touching replicas: scale-up still needs a real count > 0, and
scale-down now needs an explicitly parsed 0. Worst case after this
change is a stale-up deployment idling until the 06:00 window-down.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:35:02 +00:00
Viktor Barzin
18f524c265 docs: ghcr-credentials is now Kyverno-synced to allowlisted namespaces [ci skip]
Same-change doc sync for infra#12: the tripit-ns-scoped interim secret
paragraph described the pre-ClusterPolicy state.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:31:55 +00:00
Viktor Barzin
68c7be8653 traefik: non-merge apply trigger (error-pages buffer fix)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-12 20:31:24 +00:00
Viktor Barzin
f3cb5661a6 Merge forgejo/master into wizard/errorpages-buffer
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
ci/woodpecker/push/build-cli Pipeline was canceled
2026-06-12 20:31:22 +00:00
Viktor Barzin
aa1fccb883 traefik/error-pages: READ_BUFFER_SIZE 5KB -> 128KB — 431s for cookie-heavy users
Viktor hit 'Too big request header' (fasthttp 431 from error-pages) on a
routed host during a brief 503 window, and sees it periodically across
ingresses: Authentik forward-auth accumulates one authentik_proxy_*
cookie per protected service on .viktorbarzin.me, so established
browsers carry multi-10KB Cookie headers — over error-pages' 5120-byte
default read buffer, which doubles as its max header size. Any error-
middleware dispatch then 431'd instead of rendering the styled page.
Same root cause class as the 2026-06-01 large_client_header_buffers
fixes on bot-block-proxy and auth-proxy-config; error-pages was the
remaining small-buffer backend on the shared chain.
2026-06-12 20:31:01 +00:00
Viktor Barzin
523e18c127 kyverno: sync-ghcr-credentials to private-ghcr namespaces; tripit consumes the clone
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Viktor asked to unblock the ADR-0002 ghcr pull-secret work (infra#12)
without waiting on a UI-minted token: GitHub has no token-mint API, so
the admin PAT (aliased in Vault as secret/viktor/ghcr_pull_token —
swap the alias value when a scoped token is ever minted) becomes the
platform credential. Because the PAT is broad, the new ClusterPolicy
clones ghcr-credentials ONLY to an explicit allowlist of namespaces
running private ghcr images (tripit, f1-stream, job-hunter,
instagram-poster, payslip-ingest, wealthfolio, fire-planner,
recruiter-responder) — NOT cluster-wide like registry-credentials.
generateExisting+synchronize so existing namespaces get the clone.
tripit's hand-declared ns-scoped secret is removed in favour of the
clone (imagePullSecrets now reference the name literally).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:28:11 +00:00
Viktor Barzin
12fd1fcbc9 android-emulator: api36-v7 — noVNC defaults: scaled view, autoconnect, reconnect
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
Viktor's screen rendered unscaled on a bare /vnc.html. The entrypoint
now writes /usr/share/novnc/defaults.json (resize=scale, autoconnect,
reconnect with 2s delay, shared) so every load behaves right without URL
params, and viewers self-heal across pod restarts/wakes. Already applied
live to the running pod; this makes it survive the next wake.
2026-06-12 20:18:26 +00:00
Viktor Barzin
ff08c685cd tts: image is TF-owned — drop the copied KEEL ignore so the GHCR switch applies
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The deployment's lifecycle.ignore_changes still ignored the container
image (copied from the keel-managed tripit pattern), which would have
made the previous commit's GHCR switch a silent no-op on apply. Keel
cannot poll the private GHCR repo anyway; the pinned sha tag is
terraform's to manage.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:13:50 +00:00
Viktor Barzin
dbb4572112 tts: pull Chatterbox from GHCR — the Forgejo-registry copy is unpullable
Some checks are pending
ci/woodpecker/push/build-cli Pipeline is pending
ci/woodpecker/push/default Pipeline is pending
Viktor reports the voice still isn't from the TTS service — correct:
zero story_audio rows exist; the pod has sat in ImagePullBackOff since
the first window because the 2026-06-09 Forgejo-registry push has a
corrupt layer blob (HEAD 500s; pushed from a 94%-full disk) and identical
digests can't heal corrupt registry storage. The off-infra GHA rebuild
(tripit build-chatterbox.yml, devnen 915ae289, succeeded 03:23 UTC) now
lives in private GHCR: switch the image there, pin the upstream-sha tag,
and add the vault-backed ghcr-credentials pull secret (mirrors
stacks/tripit). tripit's drain loop has 27 narrations queued and picks
them up the moment the pod goes Ready.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:13:19 +00:00