Stage 1 of moving private images off the registry:2 container at registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption 3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk — pods still pull from the existing registry until Phase 3. What changes: * Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi). Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive, v11 default-on). * ingress_factory: max_body_size variable was declared but never wired in after the nginx→Traefik migration. Now creates a per-ingress Buffering middleware when set; default null = no limit (preserves existing behavior). Forgejo ingress sets max_body_size=5g to allow multi-GB layer pushes. * Cluster-wide registry-credentials Secret: 4th auths entry for forgejo.viktorbarzin.me, populated from Vault secret/viktor/ forgejo_pull_token (cluster-puller PAT, read:package). Existing Kyverno ClusterPolicy syncs cluster-wide — no policy edits. * Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls). Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh for existing nodes. * Forgejo retention CronJob (0 4 * * *): keeps newest 10 versions per package + always :latest. First 7 days dry-run (DRY_RUN=true); flip the local in cleanup.tf after log review. * Forgejo integrity probe CronJob (*/15): same algorithm as the existing registry-integrity-probe. Existing Prometheus alerts (RegistryManifestIntegrityFailure et al) made instance-aware so they cover both registries during the bake. * Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/. Operational note — the apply order is non-trivial because the new Vault keys (forgejo_pull_token, forgejo_cleanup_token, secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the kyverno + monitoring + forgejo stacks. The setup runbook documents the bootstrap sequence. Phase 1 (per-project dual-push pipelines) follows in subsequent commits. Bake clock starts when the last project goes dual-push. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9.9 KiB
9.9 KiB
Forgejo Registry Consolidation — Plan
Date: 2026-05-07
Status: Approved — execution in progress (Phase 0)
Design: 2026-05-07-forgejo-registry-consolidation-design.md
This is the implementation roadmap for migrating off registry-private
onto Forgejo's OCI registry. See the design doc for problem
statement and rationale. Execution spans 5 phases over ≥3 weeks.
Phase 0 — Prepare Forgejo (1 PR, no cutover risk)
| Task | File / artifact |
|---|---|
| Bump Forgejo memory request+limit 384Mi → 1Gi | infra/stacks/forgejo/main.tf |
Add FORGEJO__packages__ENABLED=true and FORGEJO__packages__CHUNKED_UPLOAD_PATH=/data/tmp/package-upload env vars (defensive — already default in v11) |
infra/stacks/forgejo/main.tf |
| Bump Forgejo PVC 5Gi → 15Gi, auto-resize cap 20Gi → 50Gi | infra/stacks/forgejo/main.tf |
Bump ingress max_body_size = "5g" (wired into ingress_factory as a Buffering middleware) |
infra/stacks/forgejo/main.tf, infra/modules/kubernetes/ingress_factory/main.tf |
Create cluster-puller (read:package), ci-pusher (write:package), and a third cleanup PAT on ci-pusher; store PATs in Vault |
runbook: docs/runbooks/forgejo-registry-setup.md |
Extend registry-credentials Secret with 4th auths entry for forgejo.viktorbarzin.me |
infra/stacks/kyverno/modules/kyverno/registry-credentials.tf |
Add containerd hosts.toml entry redirecting forgejo.viktorbarzin.me → in-cluster Traefik LB 10.0.20.200 |
infra/stacks/infra/main.tf cloud-init + new infra/scripts/setup-forgejo-containerd-mirror.sh for existing nodes |
Forgejo retention CronJob (0 4 * * *, dry-run for first 7 days) |
new infra/stacks/forgejo/cleanup.tf + infra/stacks/forgejo/files/cleanup.sh |
Forgejo integrity probe CronJob (*/15 * * * *) |
infra/stacks/monitoring/modules/monitoring/main.tf |
| Make existing alerts instance-aware so they cover both registries | infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl |
Smoke test (must pass before declaring Phase 0 done):
docker login forgejo.viktorbarzin.mesucceeds.- Push a hello-world image to
forgejo.viktorbarzin.me/viktor/smoketest:1succeeds. crictl pull forgejo.viktorbarzin.me/viktor/smoketest:1from a k8s node succeeds, using the auto-syncedregistry-credentialsSecret.- A fresh namespace gets the cloned Secret with 4
authsentries. - Delete the smoketest package via API.
- Forgejo integrity probe completes once and pushes metrics.
Phase 1 — Source migration (parallel-safe, no production impact)
For each project the recipe is identical:
git init+ push toforgejo.viktorbarzin.me/viktor/<name>— register in Woodpecker via OAuth.- Add
.woodpecker.ymlbased onpayslip-ingest/.woodpecker.yml. Push step useswoodpeckerci/plugin-docker-buildxwith TWOrepo:entries (dual-push). - Confirm first build pushes to BOTH registries.
Projects (bake clock starts at "all dual-push"):
| Project | Action |
|---|---|
claude-agent-service |
Extract from monorepo to Forgejo. New .woodpecker.yml. |
fire-planner |
Extract from monorepo to Forgejo. New .woodpecker.yml. |
wealthfolio-sync |
Extract from monorepo to Forgejo. New .woodpecker.yml. |
hmrc-sync |
Extract from monorepo to Forgejo. New .woodpecker.yml. |
freedify |
Push from monorepo to Forgejo. New .woodpecker.yml. (Upstream is gone.) |
payslip-ingest |
Already on Forgejo. Add second repo: entry to .woodpecker.yml. |
job-hunter |
Already on Forgejo. Add second repo: entry. |
beadboard |
Push to Forgejo. New .woodpecker.yml. Disable GHA workflow. Don't archive GitHub yet (deferred to Phase 3). |
claude-memory-mcp |
Push to Forgejo. New .woodpecker.yml. |
infra-ci |
Edit .woodpecker/build-ci-image.yml to dual-push. ALSO `docker save |
Break-glass runbook (docs/runbooks/forgejo-registry-breakglass.md)
documents the recovery path.
Phase 2 — Bake (≥14 days)
- No
image=lines change. Pods still pull fromregistry.viktorbarzin.me. - Daily smoke check: pull a recent image from Forgejo as
cluster-puller, verify integrity (HEAD on manifest + each blob). - Bake exit criteria:
- Zero
RegistryManifestIntegrityFailurealerts on Forgejo. - Zero
ContainerNearOOMfor the forgejo pod. - Retention CronJob has run ≥14 times successfully.
- At least one full Sunday GC cycle has elapsed.
- Switch retention CronJob to
DRY_RUN=falseon day 7, observe until day 14.
- Zero
Phase 3 — Cutover (one PR per project, single session)
Order = lowest blast radius first. Each step:
image= flip → kubectl rollout restart → verify pull from Forgejo.
payslip-ingest(infra/stacks/payslip-ingest/main.tf)job-hunter(infra/stacks/job-hunter/main.tf)claude-agent-service(infra/stacks/claude-agent-service/main.tf)fire-planner(infra/stacks/fire-planner/main.tf)wealthfolio-sync(infra/stacks/wealthfolio/main.tf)freedify(infra/stacks/freedify/factory/main.tf)chrome-service(infra/stacks/chrome-service/main.tf)beads-server/beadboard(infra/stacks/beads-server/main.tf). Thengh repo archive ViktorBarzin/beadboard.infra-ci— flipimage:references in 4.woodpecker/*.ymlfiles in the infra repo. Verify next push to master applies cleanly.claude-memory-mcp— updateCLAUDE.mdinstall instruction fromclaude plugins install github:ViktorBarzin/claude-memory-mcptoclaude plugins install https://forgejo.viktorbarzin.me/viktor/claude-memory-mcp.git.gh repo archive ViktorBarzin/claude-memory-mcp.
Phase 4 — Decommission
| Step | File / location |
|---|---|
Stop registry-private container on VM (10.0.20.10): edit /opt/registry/docker-compose.yml, comment out service, docker compose up -d --remove-orphans. (Manual SSH — cloud-init won't redeploy on TF apply per memory id=1078.) |
live VM |
| Update cloud-init template to match the new compose file | infra/stacks/infra/main.tf:288 |
Delete auths entries for registry.viktorbarzin.me / :5050 / 10.0.20.10:5050 from the dockerconfigjson |
infra/stacks/kyverno/modules/kyverno/registry-credentials.tf |
Drop registry.viktorbarzin.me and 10.0.20.10:5050 hosts.toml entries on each node + cloud-init template |
infra/stacks/infra/main.tf cloud-init + ad-hoc script |
After 1 week of no incidents, delete /opt/registry/data/private/ blob storage on the VM (~2.6GB freed) |
manual SSH |
Phase 5 — Docs
In the same commit as the Phase 4 closing:
| Doc | Update |
|---|---|
docs/runbooks/registry-vm.md |
Note registry-private is gone; pull-through caches and break-glass tarballs only |
docs/runbooks/registry-rebuild-image.md |
Replaced by NEW forgejo-registry-rebuild-image.md |
docs/runbooks/forgejo-registry-rebuild-image.md (NEW) |
Forgejo PVC restore procedure |
docs/runbooks/forgejo-registry-breakglass.md (NEW) |
infra-ci tarball recovery |
docs/architecture/ci-cd.md |
Image registry section flips to Forgejo |
docs/architecture/monitoring.md |
Integrity probe target updated |
infra/.claude/CLAUDE.md |
Registry references updated |
CLAUDE.md (monorepo root) |
claude-memory-mcp install URL updated |
infra/.claude/reference/service-catalog.md |
Cross-reference checked |
Critical files modified
| File | Phase | What |
|---|---|---|
infra/stacks/forgejo/main.tf |
0 | Memory bump, packages env vars, PVC bump, ingress max_body_size |
infra/stacks/forgejo/cleanup.tf (NEW) |
0 | Retention CronJob |
infra/stacks/forgejo/files/cleanup.sh (NEW) |
0 | Retention script (mounted via ConfigMap) |
infra/modules/kubernetes/ingress_factory/main.tf |
0 | Wire max_body_size into a Traefik Buffering middleware |
infra/stacks/kyverno/modules/kyverno/registry-credentials.tf |
0 | Add 4th auths entry |
infra/stacks/infra/main.tf |
0 + 4 | Containerd hosts.toml block (add Forgejo, later remove registry-private); compose template update |
infra/scripts/setup-forgejo-containerd-mirror.sh (NEW) |
0 | One-shot rollout for existing nodes |
infra/stacks/monitoring/modules/monitoring/main.tf |
0 | Forgejo integrity probe CronJob |
infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl |
0 | Make alerts instance-aware |
infra/stacks/monitoring/main.tf |
0 | Plumb forgejo_pull_token into module |
infra/.woodpecker/build-ci-image.yml |
1 | Dual-push to add Forgejo target + tarball break-glass |
<each-project>/.woodpecker.yml |
1 | Dual-push (NEW for fire-planner, wealthfolio-sync, hmrc-sync, freedify, beadboard, claude-memory-mcp; EDIT for payslip-ingest, job-hunter, claude-agent-service) |
infra/.woodpecker/{default,drift-detection,build-cli}.yml |
3 | Flip image: to Forgejo for infra-ci |
infra/stacks/{beads-server,chrome-service,claude-agent-service,fire-planner,freedify/factory,job-hunter,payslip-ingest,wealthfolio}/main.tf |
3 | Flip image = to Forgejo |
Verification
- Push (Phase 0/1):
docker push forgejo.viktorbarzin.me/viktor/<name>visible in Forgejo Web UI under viktor/. - Pull (Phase 0):
crictl pull forgejo.viktorbarzin.me/viktor/smoketest:1succeeds with auto-synced Secret. - Dual-push (Phase 1): every Woodpecker pipeline run pushes to BOTH endpoints — confirmed via HEAD checks on
<reg>:<sha>for both. - Bake (Phase 2): existing daily Forgejo
/api/healthzexternal monitor stays green; integrity probe stays green; noContainerNearOOMfor forgejo pod. - Cutover (Phase 3):
kubectl rollout status deploy/<svc> -n <ns>succeeds.kubectl describe podshows the image was pulled fromforgejo.viktorbarzin.me. - Decommission (Phase 4):
docker pson registry VM no longer showsregistry-private. Brand-new namespace gets the Secret with only the Forgejoauthsentry. Pull still works.