The internal + external monitor-sync CronJobs intermittently failed with socketio.exceptions.TimeoutError on api.login(), firing JobFailed -> Slack noise (and leaving monitor sync stale). Kuma 2.3.2 itself is healthy (1/1, 30m CPU); its single Node event loop just briefly stalls under ~300 monitors so the socket.io login handshake occasionally exceeds the client timeout. Wrap connect+login in a 5-attempt / 15s-backoff retry (disconnecting the half-open client between tries) so a transient stall no longer fails the whole job. Applied to both sync scripts.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Yesterday's Forgejo 3Gi->4Gi OOM fix pushed its tier-3-edge namespace quota (requests.memory=4Gi) to 100%, firing KubeQuotaAlmostFull + the healthcheck resourcequota check. Forgejo is the git + OCI-registry backbone and legitimately needs ~4Gi, so the edge tier's 4Gi ceiling is too tight. Opt the namespace out of the auto tier quota (resource-governance/custom-quota=true) and define a forgejo-specific ResourceQuota at requests.memory=8Gi, so the 4Gi pod sits at ~50% with headroom. Same opt-out pattern dbaas uses. Re-tiering was rejected: tier 1-cluster is also 4Gi, and 0-core (8Gi) would over-classify Forgejo's priority/eviction.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
k8s-portal was the last in-cluster image builder. Its .woodpecker/k8s-portal.yml
was deleted; it now builds on GHA (build-k8s-portal.yml) -> PRIVATE ghcr, pulled
via the Kyverno ghcr-credentials allowlist and deployed by Keel. Fix the CI/CD
section: drop k8s-portal from the Woodpecker-pipelines list (stale), move it from
'already on GHA' to the infra-owned private-ghcr images, and add it to the
PRIVATE ghcr allowlist roster. Completes the no-local-builds migration.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
k8s-portal was the last in-cluster image build; it now builds on GHA and
pushes ghcr.io/viktorbarzin/k8s-portal:latest, which is PRIVATE (infra repo
default). To pull it: add k8s-portal to the sync-ghcr-credentials Kyverno
allowlist (clones the ghcr-credentials Secret into the namespace) and
reference that secret via imagePullSecrets on the deployment — same wiring
as tripit/recruiter-responder. Completes the no-local-builds migration so
nothing builds container images on the cluster anymore (ADR-0002).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Recovered the real manifest + resolved lockfile (lockfileVersion 3, 71 pkgs)
from the running pod. A parent .gitignore force-ignored package.json, so the
git source tree was incomplete and the image only ever built manually. Now
reproducible on GHA (ADR-0002 no-local-builds).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
package-lock.json was never committed to either lineage — npm ci needs it,
so the build only ever worked from a manual devvm build with a local lock.
npm install resolves from package.json, unblocking the GHA build (ADR-0002).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The last in-cluster image build. GHA build-k8s-portal.yml builds
ghcr.io/viktorbarzin/k8s-portal:latest+sha (path-filtered on the Dockerfile
dir); Keel (force/poll/match-tag) rolls the deployment. Stack image repointed
to ghcr (ignore_changed); .woodpecker/k8s-portal.yml deleted.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Inline note on why the four backup CronJobs moved 10s->600s (bda1bdcb): a 10s deadline silently dropped the 2026-06-13 midnight full-backup run, firing PostgreSQLBackupStale. bda1bdcb rode in the same push as a forgejo change that failed CI on a namespace-quota error, so that pipeline failed before the dbaas apply took effect (live deadline was still 10s). This dbaas-only commit re-triggers the dbaas apply at a clean master so the 600s deadline actually goes live.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The 3Gi->6Gi bump in ff3cc44a was rejected by the forgejo namespace tier-quota (requests.memory capped at 4Gi). With Guaranteed QoS the 6Gi request exceeded quota; FailedCreate left forgejo with 0 pods for ~6 min (git remote + OCI registry outage) until I patched the live Deployment back to a schedulable 4Gi. 4Gi is the most the quota allows and is still a headroom bump over the OOM-prone 3Gi. To go higher the tier-quota must be raised in the same change. This reconciles TF to the live 4Gi so the pending/next apply is a no-op rather than reverting to the quota-busting 6Gi.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Forgejo OOMKilled twice on 2026-06-13 at the 3Gi cap (exit 137), briefly taking the git remote and OCI registry down and spiking ingress TTFB to 4.7s and the 4xx rate to 51%. Steady-state is ~2.2Gi but it spiked into the cap (true demand above 3.2Gi). The 2026-06-09 bump to 3Gi was sized for tripit buildkit registry pushes, but that driver is gone now that the Forgejo registry was frozen and emptied today (ADR-0002, images on ghcr), so the spike is git ops / the integrity-probe catalog walk / a possible leak. 6Gi gives headroom on the critical git backbone while we watch whether working-set keeps climbing (which would indicate a leak).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The daily full PostgreSQL backup silently skipped its 2026-06-13 00:00 run, leaving the last full dump 37h old and firing the critical PostgreSQLBackupStale alert. Root cause: startingDeadlineSeconds was 10s on all four dbaas backup CronJobs, so when the CronJob controller was more than 10s late to the midnight tick (many IO-heavy backups all fire at 00:00, the known etcd-starvation window) the run was dropped entirely instead of starting late. 600s lets a brief controller lag still launch the job. Applied to all four (mysql + pg, full + per-db) since they share the footgun and the midnight contention.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ADR-0002 is fully landed (issues #11-#32 closed): every owned image now
builds on GitHub Actions and pushes to ghcr.io/viktorbarzin/<name>, with
Woodpecker reduced to deploy-only. The Forgejo container registry is frozen
and emptied; there are no in-cluster image builds or CI test runs anywhere.
The docs still described the old hybrid topology (DockerHub builds,
Woodpecker-native owned-app builds, the per-pattern migration lists, the
tripit-only pilot framing), which would mislead future sessions and
incident response.
This brings the docs to the completed reality (closes#33):
- docs/architecture/ci-cd.md: full rewrite as the canonical CI/CD reference —
the fleet GHA->ghcr->Woodpecker-deploy pattern, public/private ghcr package
split, infra-owned image workflows (incl. infra-ci on ghcr), the frozen
Forgejo registry, what Woodpecker still runs, and the #31 decommissions.
- .claude/CLAUDE.md: rewrite the "CI/CD Architecture" section to the
fleet-wide final state; FIX the stale claim that claude-memory-mcp builds
to DockerHub (it is GHA->ghcr); note owned images now live on ghcr and the
Forgejo registry is frozen/break-glass near the image-registry bullet.
- .claude/reference/service-catalog.md: f1-stream is GHA->ghcr + Woodpecker
deploy-only (was "Woodpecker-native build->deploy").
- stacks/{tuya-bridge,android-emulator}/variables.tf + stacks/terminal/main.tf:
cosmetic description/comment updates (forgejo -> ghcr; terminal-lobby has no
CI pipeline). Description/comment text only — no stack logic changed.
Historical records (docs/post-mortems/*, docs/plans/*) and ADR-0002 itself
are left untouched as point-in-time records.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
openclaw's install-nextcloud-todos-plugin init still pulled forgejo
nextcloud-todos (would ImagePullBackOff on restart once the forgejo
registry is wiped) -> ghcr:latest. f1-stream stack base (KEEL_IGNORE'd,
live already ghcr via set-image) repointed for fresh-create correctness.
Clears the last LIVE forgejo viktor/* refs before the registry reclaim.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
infra-ci now builds on GHA → ghcr and the ghcr-based apply is PROVEN
(pipeline 165 ran terragrunt apply in the ghcr image). Removing the
Woodpecker build-ci-image.yml (clean cut). The breakglass tarball is
preserved as a MANUAL Woodpecker job pulling ghcr (public) → registry VM;
infra-ci on ghcr is external + node-cached, so the Forgejo-down rationale
for the old auto-tarball is moot — this is belt-and-braces DR.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The terragrunt apply step (default.yml) and drift-detection now pull
ghcr.io/viktorbarzin/infra-ci:latest (GHA-built, verified toolchain:
tf 1.5.7 / tg 0.99.4 / sops / kubectl 1.34 / vault / git-crypt). ghcr is
public + proven pullable in-cluster. build-ci-image.yml (forgejo build)
KEPT as the fallback copy until this ghcr-based apply is proven, so a
revert restores the working forgejo image if needed.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
novnc's image was ignore_changed (KEEL_IGNORE) but nothing manages its
tag (keel.sh/policy=never), so the earlier forgejo->ghcr repoint never
took. Removing container[1].image from ignore_changes lets terragrunt
own novnc=ghcr:latest and roll it. container[0]/[2] (pinned playwright)
stay ignored.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Both now built by GHA → public ghcr. Repoint stack image bases
forgejo→ghcr:latest (terragrunt-managed, imagePullPolicy Always picks up
rebuilds). android var default api36-v8 -> latest.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Flips the planning workspace's Stay cover photos from the fake provider to live Wikipedia lead-image fetches (downloaded into STORAGE_DIR, served by the backend, editable per Stay). Part of the new-trip flow feature: every picked destination city gets a banner-ready cover. HOLD-ORDER: pushed only after the tripit image containing CityImageMode.wikipedia rolled out.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Service was already scaled 0/0 and unused (Viktor: 'not used anymore').
Live resources destroyed via scripts/tg destroy (10 resources: deployment,
namespace, service, anubis-travel + PDB/cm/svc/secret, ingress, TLS).
Removing the stack dir; old Woodpecker build (repo 5) deactivated
separately. The harmless legacy 'travel' CNAME->apex in config.tfvars is
left (now 404s; removing it would trigger a full-platform apply).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Formalizing x402-gateway CI (was a manual no-CI image). The deployment
lives in the traefik module; its image was NOT in ignore_changes, so a
set-image deploy would be reverted on the next traefik apply — added it
(KEEL_IGNORE_IMAGE). Base repointed to ghcr:latest; the GHA deploy
set-images the :sha8. Public ghcr package = no pull secret. Inert on the
live pod (image now ignored); rolling cutover keeps forwardAuth up.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
claude-memory-mcp's Dockerfile is at docker/Dockerfile, not repo root
(infra#20 build failed: 'open Dockerfile: no such file or directory').
build.yml template gains file: {{DOCKERFILE}} (default ./Dockerfile).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
GHA now builds+pushes ghcr.io/viktorbarzin/claude-memory-mcp (public).
Image is KEEL_IGNORE_IMAGE (set-image managed), so this apply is inert
on the live pod; the stale :17 default literal is corrected to :latest.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
GHA now builds+pushes ghcr.io/viktorbarzin/claude-agent-service (public
package, anonymous pulls). Repointed: claude-agent-service (deployment +
git-init/seed-beads-agent inits), claude-breakglass, ci-pipeline-health,
beads-server CronJobs, k8s-version-upgrade (tag var 2fd7670d -> latest —
the Forgejo registry lost that sha; node caches were the only thing
keeping those CronJobs alive). publish-gate: vendor-contact emails
(licensing@/legal@/security@/sales@) ruled license-boilerplate, not PII.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
broker-sync is a CronJob-only consumer (no deployment): new --no-deploy
mode skips Woodpecker registration and renders build.yml without the
deploy job — :latest+Always CronJobs pick up builds on the next run.
wealthfolio stack: ghcr-credentials pull secret + image base repoint.
The wealthfolio-sync image regains a reproducible rebuild path.
Closes: code-62tm
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Second half of the recruiter-responder off-infra migration: the first GHA
build has published ghcr.io/viktorbarzin/recruiter-responder:{1d99a8d5,latest},
so the openclaw plugin-install init container can now follow the ghcr
:latest. The forgejo-side build pipeline was removed by the onboarding
commit, so the old forgejo :latest tag is frozen and would silently serve
stale plugin code. Deferred from the first commit on purpose - flipping it
before the package existed would have wedged the openclaw rollout on
ImagePullBackOff.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Migrating recruiter-responder off in-cluster Woodpecker builds: GHA will
build and push ghcr.io/viktorbarzin/recruiter-responder (PRIVATE package).
This commit lands the pull-side prerequisites BEFORE the first off-infra
build fires:
- stacks/recruiter-responder: image base forgejo -> ghcr (inert on the live
Deployment - both containers are ignore_changes'd; the Woodpecker deploy
moves the tag) + ghcr-credentials imagePullSecrets on the Deployment
(covers the recruiter-responder container AND the alembic-migrate init
container, which share the image).
- stacks/openclaw: ghcr-credentials imagePullSecrets on the openclaw
Deployment - its install-recruiter-plugin init container consumes the
:latest tag of this image. The image ref itself flips to ghcr in a
follow-up once the first GHA build has created the package (flipping now
would ImagePullBackOff on a not-yet-existing package and wedge the apply).
- stacks/kyverno: allowlist openclaw in sync-ghcr-credentials so the pull
secret is cloned into that namespace too.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment-only touch so the changed-stack detection applies
stacks/fire-planner from the current master tree. Pipeline 150 (commit
f18dfa4c — the ghcr image base + ghcr-credentials migration for issue
#26) was auto-killed when the concurrent nextcloud-todos push superseded
it, and pipeline 151 diffed from f18dfa4c onward so the fire-planner
stack changes were never applied (cronjobs still point at the forgejo
image, pod specs lack ghcr-credentials).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The nextcloud-todos build moved off-infra: GHA builds on the public
GitHub mirror and pushes ghcr.io/viktorbarzin/nextcloud-todos (public
package, anonymous pulls); Woodpecker repo 207 is deploy-only. First
ghcr image (:19c22d8c) is already built, deployed and rolled out, so
this repoint lands after the image exists. Both deployment image refs
(main + alembic-migrate init) are ignore_changes'd — no live churn,
the base matters only on resource (re)create. Old image was pulled
from a Forgejo registry package that no longer exists (pods survived
on node image cache only).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Migrating fire-planner off in-cluster Woodpecker builds to GitHub
Actions -> ghcr.io (ADR-0002, issue #26). The image base moves
forgejo.viktorbarzin.me/viktor/fire-planner ->
ghcr.io/viktorbarzin/fire-planner (a PRIVATE ghcr package), so the
deployment, all three cronjobs (recompute, col-refresh,
examples-weekly) and the examples bulk job gain the ghcr-credentials
imagePullSecret (the kyverno sync-ghcr-credentials allowlist already
covers the fire-planner namespace). registry-credentials stays
alongside so the currently-running sha-pinned forgejo image can still
be pulled until the first ghcr deploy lands; the cronjob images are TF
literals and flip to ghcr :latest on this apply.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Restarted infra pipelines after master moved diffed in REVERSE and
re-applied stale trees (pipeline 148 reverted payslip-ingest's fresh
ghcr config — repaired by the wave-2 agent). Only trust
CI_PREV_COMMIT_SHA when it is an ancestor of HEAD. publish-gate:
viktorbarzin@meta.com is the owner's own work email (same class as the
allowlisted personal domain), not blockable PII — unblocks infra#18.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment-only touch of both stacks so the changed-stack detection applies
them from the current master tree. Two pipelines went wrong in sequence
during the parallel ADR-0002 wave-2 migrations (issues #23/#24):
- pipeline 146 (instagram-poster stack prep, commit 29c69250) was
auto-killed when the concurrent payslip-ingest push superseded it, so
its apply never ran;
- restarting it as pipeline 148 inherited CI_PREV_COMMIT_SHA = the NEW
branch head (6928ce0b) with the OLD checkout (29c69250) — a reverse
diff that re-applied stacks/payslip-ingest from the pre-migration
tree, stripping the ghcr image base + ghcr-credentials pull secrets
that pipeline 147 had just applied (2 resources reverted).
This commit restores the committed payslip-ingest config exactly as
issue #24 landed it and finally applies the instagram-poster ghcr prep
from issue #23. Lesson encoded in the comments: do not restart killed
infra pipelines after master has moved — re-trigger with a touch commit
instead.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Prep for moving payslip-ingest's image build off-infra to GitHub Actions ->
ghcr.io (ADR-0002 wave 2, issue #24). One stack commit before onboarding:
- image base repointed forgejo.viktorbarzin.me/viktor/payslip-ingest ->
ghcr.io/viktorbarzin/payslip-ingest (private ghcr package)
- ghcr-credentials imagePullSecrets added on the Deployment AND the
actualbudget-payroll-sync CronJob pod specs (namespace is already in the
kyverno sync-ghcr-credentials allowlist; secret verified present)
- the CronJob's SHA pin is retired: terragrunt image_tag 4f70681d -> latest
plus explicit imagePullPolicy Always on the cron container, per the fleet
convention for owned-app CronJobs — one less set-image target, and the
cron can never go back to pulling the dead Forgejo tag
The Deployment keeps KEEL_IGNORE_IMAGE; its concrete :sha8 tag is set by
the Woodpecker deploy pipeline after each GHA build.
Closes: nothing yet — the repo-side onboarding (offinfra-onboard) follows.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Prep for migrating instagram-poster off in-cluster Woodpecker builds to
GitHub Actions -> ghcr.io (ADR-0002, issue #23, PRIVATE-repo path).
Viktor asked for the wave-2 migration of instagram-poster per the wave-1
retro recipe: before onboarding, the stack must (a) carry the
ghcr-credentials imagePullSecret on the Deployment so the cluster can
pull the private ghcr image, and (b) repoint the image base from
forgejo.viktorbarzin.me/viktor to ghcr.io/viktorbarzin.
The Deployment image is KEEL_IGNORE_IMAGE (ignore_changes), so this
apply does NOT roll the pod to a not-yet-existing ghcr image — the live
forgejo-built :da5b4191 keeps running until the first GHA build POSTs
the Woodpecker deploy. The three CronJobs run curlimages/curl (public
DockerHub), not the app image, so they need neither the pull secret nor
a repoint. registry-credentials stays for the transition window.
Closes: nothing (stack prep only; repo onboarding follows)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
infra#17: the gate flagged npm deprecation boilerplate (package-lock.json
escapes the *.lock filter) and the upstream fork author's email in tracked
.beads data — both already-public upstream content, ruled false positives.
Lock files excluded properly; .beads moved to the eyeball inventory.
beads-server stack: beadboard image base repointed (deployment image is
KEEL-ignored; no CronJobs use it).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Freedify builds moved off-infra per issue #22: GitHub Actions on the
ViktorBarzin/freedify mirror now builds and pushes the public image
ghcr.io/viktorbarzin/freedify, and the Woodpecker deploy pipeline
(repo 202) rolls :sha8 via kubectl set image. Both factory deployments
(music-viktor, music-emo) now seed from ghcr instead of the retired
in-cluster Forgejo build, and the container image joins lifecycle
ignore_changes (KEEL_IGNORE_IMAGE) so terraform applies do not revert
the deployed :sha8. Landed after the first GHA push so ghcr :latest
already existed when this repoint applied. Public package - no pull
secret needed.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
kms-website moves off in-cluster Woodpecker builds to GHA -> ghcr.
The kms-web-page deployment image is ignore_changes'd (CI sets the live
tag), so this repoint only governs future creates; package is PUBLIC so
no pull secret is wired. No CronJobs in this stack.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The legacy /settings/billing/actions endpoint now returns 410; sum
Minutes usageItems from /settings/billing/usage instead (found during
the infra#16 retro: June-to-date = 420/2000).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
publish-gate: gitleaks + trufflehog (full history) + PII heuristics;
CLEAN verdict gates any public flip, DIRTY = stays private. tuya-bridge:
ghcr-credentials pull secret + image base -> ghcr; namespace added to
the ghcr-credentials allowlist as a safety net (new ghcr packages
default PRIVATE even from public repos — prune after visibility flip).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
CronJobs track :latest via the TF literal (unlike the ignore_changes'd
deployment), so they kept pulling the dead Forgejo image after the
GHA/ghcr cutover — repoint the stack's image base.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
job-hunter's clone uses the credential-store helper (no token embedded
in the remote URL, unlike f1-stream).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ADR-0002 wave 1 (infra#14): job-hunter's image moves to private ghcr;
the deployment AND both :latest CronJobs need the Kyverno-cloned pull
secret.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Trusted repos get netrc injected into every step container; the
non-root bitnami/kubectl deploy step dies with '//.netrc: Permission
denied' (hit live on f1-stream's reactivated old-era repo 10, which
carried trusted=true; tripit 167 is untrusted and works).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>