End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.
Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
— image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
(viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
from Forgejo. build-ci-image.yml dual-pushes still until next
build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.
Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
hosts.toml entries for registry.viktorbarzin.me +
10.0.20.10:5050. (Existing nodes already had the file removed
manually by `setup-forgejo-containerd-mirror.sh` rollout — the
cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
service block removed; nginx 5050 port mapping dropped. Pull-
through caches for upstream registries (5000/5010/5020/5030/5040)
stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
`private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
integrity_probe + registry_probe_credentials resources stripped.
forgejo_integrity_probe is the only manifest probe now.
Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
diagram now reflect Forgejo. Pre-migration root-cause sentence
preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
alert annotation simplified now that only one registry is in
scope.
Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
match the new template AND `docker compose up -d --remove-orphans`
to actually stop the registry-private container. Memory id=1078
confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
registry.viktorbarzin.me:5050 from the `repo:` list — at that
point the post-push integrity check at line 33-107 also needs
to be repointed at Forgejo or removed (the per-build verify is
redundant with the every-15min Forgejo probe).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
11 KiB
CI/CD Pipeline
Overview
The CI/CD pipeline uses a hybrid approach: GitHub Actions for building Docker images (providing free compute for public repos) and Woodpecker CI for deployments (leveraging cluster-internal access). Git pushes trigger GHA builds that produce Docker images with 8-character SHA tags, push to DockerHub, then POST to Woodpecker's API to trigger deployments that update Kubernetes workloads via kubectl set image.
Architecture Diagram
graph LR
A[Git Push] --> B[GitHub Actions]
B --> C[Build Docker Image<br/>linux/amd64, 8-char SHA tag]
C --> D[Push to DockerHub]
D --> E[POST Woodpecker API]
E --> F[Woodpecker Pipeline]
F --> G[Vault K8s Auth<br/>SA JWT]
G --> H[kubectl set image]
H --> I[K8s Deployment]
I --> J[Pull from DockerHub<br/>or Pull-Through Cache]
K[Pull-Through Cache<br/>10.0.20.10] -.-> J
L[forgejo.viktorbarzin.me<br/>Private Registry on Forgejo] -.-> J
style B fill:#2088ff
style F fill:#4c9e47
style K fill:#f39c12
Components
| Component | Version | Location | Purpose |
|---|---|---|---|
| GitHub Actions | Cloud | .github/workflows/build-and-deploy.yml |
Build Docker images, push to DockerHub |
| Woodpecker CI | Self-hosted | ci.viktorbarzin.me |
Deploy to Kubernetes cluster |
| DockerHub | Cloud | viktorbarzin/* |
Public image registry |
| Private Registry | Forgejo Packages | forgejo.viktorbarzin.me/viktor |
Private container images (PAT auth, retention CronJob) — migrated from registry.viktorbarzin.me 2026-05-07 |
| Pull-Through Cache | Custom | 10.0.20.10:5000 (docker.io)10.0.20.10:5010 (ghcr.io) |
LAN cache for remote registries |
| Kyverno | Cluster | kyverno namespace |
Auto-sync registry credentials to all namespaces |
| Vault | Cluster | vault.viktorbarzin.me |
K8s auth for Woodpecker pipelines |
How It Works
Build Flow (GitHub Actions)
- Trigger: Git push to main/master branch
- Build: GHA builds Docker image for
linux/amd64platform only - Tag: Image tagged with 8-character commit SHA (e.g.,
viktorbarzin/app:a1b2c3d4):latesttags are never used to prevent stale pull-through cache issues
- Push: Image pushed to DockerHub public registry
- Trigger Deploy: POST request to Woodpecker API with repo ID and commit SHA
Deploy Flow (Woodpecker CI)
- Receive Webhook: Woodpecker API receives deployment trigger from GHA
- Authenticate: Pipeline uses Kubernetes ServiceAccount JWT to authenticate with Vault via K8s auth
- Deploy:
kubectl set image deployment/<name> <container>=viktorbarzin/<app>:<sha> - Notify: Slack notification on success/failure
Project Migration Status
Migrated to GHA (9 projects):
- Website
- k8s-portal
- f1-stream
- claude-memory-mcp
- apple-health-data
- audiblez-web
- plotting-book
- insta2spotify
- book-search (audiobook-search)
Woodpecker-only (infra + large apps):
travel_blog: 5.7GB content directory exceeds GHA limits- Infra pipelines: require cluster access (terragrunt apply, certbot, build-cli)
Woodpecker Pipeline Files
Each project contains:
.woodpecker/deploy.yml: kubectl set image + Slack notification.woodpecker/build-fallback.yml: Legacy full build pipeline (event: deployment, never auto-fires)
Woodpecker Repository IDs
Woodpecker API uses numeric IDs (not owner/name):
| Repo | ID |
|---|---|
| infra | 1 |
| Website | 2 |
| finance | 3 |
| health | 4 |
| travel_blog | 5 |
| webhook-handler | 6 |
| audiblez-web | 9 |
| f1-stream | 10 |
| plotting-book | 43 |
| claude-memory-mcp | 78 |
| infra-onboarding | 79 |
Image Registry Flow
- Containerd hosts.toml redirects pulls from docker.io and ghcr.io to pull-through cache at
10.0.20.10 - Pull-through cache serves cached images from LAN, fetches from upstream on cache miss
- Kyverno ClusterPolicy auto-syncs
registry-credentialsSecret to all namespaces for private registry access - Private registry has been Forgejo's built-in OCI registry at
forgejo.viktorbarzin.me/viktor/<image>since 2026-05-07. Auth via PAT (Vaultsecret/ci/global/forgejo_push_tokenfor push,secret/viktor/forgejo_pull_tokenfor pull). The pre-migrationregistry:2.8.3-based private registry onregistry.viktorbarzin.me:5050was the root cause of three orphan-index incidents in three weeks (2026-04-13, 2026-04-19, 2026-05-04 — seedocs/post-mortems/2026-04-19-registry-orphan-index.mdand the full migration writeup atdocs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md). The five pull-through caches on10.0.20.10(ports 5000/5010/5020/5030/5040) stay in place for upstream registries. - Integrity probe (
registry-integrity-probeCronJob inmonitoringns, every 15m) walks/v2/_catalog→ tags → indexes → child manifests via HEAD and pushesregistry_manifest_integrity_failuresto Pushgateway; alertsRegistryManifestIntegrityFailure/RegistryIntegrityProbeStale/RegistryCatalogInaccessiblepage on broken state. Authoritative check (HTTP API, not filesystem).
Infra Pipelines (Woodpecker-only)
| Pipeline | File | Purpose |
|---|---|---|
| default | .woodpecker/default.yml |
Terragrunt apply on push |
| renew-tls | .woodpecker/renew-tls.yml |
Certbot renewal cron |
| build-cli | .woodpecker/build-cli.yml |
Build and push to dual registries |
| build-ci-image | .woodpecker/build-ci-image.yml |
Build infra-ci tooling image (triggered by ci/Dockerfile change or manual); post-push HEADs every blob via verify-integrity step to catch orphan-index pushes |
| k8s-portal | .woodpecker/k8s-portal.yml |
Path-filtered build for k8s-portal subdirectory |
| registry-config-sync | .woodpecker/registry-config-sync.yml |
SCP modules/docker-registry/* to /opt/registry/ on 10.0.20.10 when any managed file changes; bounces containers + nginx per docs/runbooks/registry-vm.md |
| pve-nfs-exports-sync | .woodpecker/pve-nfs-exports-sync.yml |
Sync scripts/pve-nfs-exports → /etc/exports on PVE host |
| postmortem-todos | .woodpecker/postmortem-todos.yml |
Auto-resolve safe TODOs from new docs/post-mortems/*.md via headless Claude agent |
| drift-detection | .woodpecker/drift-detection.yml |
Nightly Terraform drift detection |
| issue-automation | .woodpecker/issue-automation.yml |
Triage + respond to ViktorBarzin/infra GitHub issues |
| provision-user | .woodpecker/provision-user.yml |
Add namespace-owner user from Vault spec |
Configuration
GitHub Actions
File: .github/workflows/build-and-deploy.yml
name: Build and Deploy
on:
push:
branches: [main, master]
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Build Docker image
run: docker build --platform linux/amd64 -t viktorbarzin/app:${SHORT_SHA} .
- name: Push to DockerHub
run: docker push viktorbarzin/app:${SHORT_SHA}
- name: Trigger Woodpecker Deploy
run: |
curl -X POST https://ci.viktorbarzin.me/api/repos/<REPO_ID>/pipelines \
-H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}"
Required GitHub Secrets:
DOCKERHUB_USERNAMEDOCKERHUB_TOKENWOODPECKER_TOKEN
Woodpecker Deploy Pipeline
File: .woodpecker/deploy.yml
when:
event: [deployment]
steps:
deploy:
image: bitnami/kubectl:latest
commands:
- kubectl set image deployment/app app=viktorbarzin/app:${CI_COMMIT_SHA:0:8}
secrets: [k8s_token]
notify:
image: plugins/slack
settings:
webhook: ${SLACK_WEBHOOK}
when:
status: [success, failure]
YAML Gotchas:
- Commands with
${VAR}:${VAR}syntax must be quoted to prevent YAML map parsing when vars are empty - Use
bitnami/kubectl:latest(not pinned versions) - Global secrets must be manually added to
secrets:list in pipeline
Vault Configuration
K8s Auth for Woodpecker:
- Woodpecker pipelines authenticate using ServiceAccount JWT
- Vault K8s auth mount validates JWT and issues token
- Policies grant access to secrets and dynamic credentials
CI/CD Secrets Sync
CronJob: Pushes secret/ci/global from Vault → Woodpecker API every 6 hours
- Keeps Woodpecker global secrets in sync with Vault
- Runs in
woodpeckernamespace
Decisions & Rationale
Why GitHub Actions + Woodpecker?
Alternatives considered:
- Woodpecker-only: Simple, but wastes cluster resources on builds
- GHA-only: No cluster access, requires kubectl from outside (security risk)
- Hybrid (chosen): GHA for compute-heavy builds (free), Woodpecker for privileged deployments (secure cluster access)
Benefits:
- Free compute for builds on public repos
- Cluster access stays internal (Woodpecker has direct K8s access)
- Separation of concerns: build vs deploy
Why 8-Character SHA Tags (Not :latest)?
- Pull-through cache serves stale
:latesttags indefinitely - SHA tags ensure every deployment pulls the correct image
- 8 characters provide sufficient collision resistance (16^8 = 4.3 billion combinations)
Why Numeric Repo IDs for Woodpecker API?
- Woodpecker API requires numeric IDs (not owner/name slugs)
- IDs are stable across repo renames
- Must be manually looked up from Woodpecker UI or database
Why linux/amd64 Only?
- Cluster runs on x86_64 nodes only
- ARM builds would waste time and storage
- Multi-arch images add complexity without benefit
Troubleshooting
GHA Build Fails: "denied: requested access to the resource is denied"
Cause: DockerHub credentials expired or incorrect
Fix:
# Regenerate DockerHub token
# Update GitHub repo secrets: DOCKERHUB_USERNAME, DOCKERHUB_TOKEN
Woodpecker Deploy Fails: "Unauthorized"
Cause: Vault K8s auth token expired or invalid
Fix:
# Restart Woodpecker pipeline (token auto-renewed)
# Check Vault K8s auth role exists: vault read auth/kubernetes/role/woodpecker-deployer
Image Pull Fails: "ErrImagePull"
Cause: Pull-through cache or registry credentials issue
Fix:
# Check pull-through cache is running
curl http://10.0.20.10:5000/v2/_catalog
# Verify registry-credentials Secret exists in namespace
kubectl get secret registry-credentials -n <namespace>
# Manually sync credentials if missing
kubectl get secret registry-credentials -n default -o yaml | \
sed 's/namespace: default/namespace: <namespace>/' | kubectl apply -f -
Woodpecker Pipeline: "YAML: did not find expected key"
Cause: Unquoted command with ${VAR}:${VAR} syntax when VAR is empty
Fix: Quote the command:
commands:
- "kubectl set image deployment/app app=viktorbarzin/app:${SHORT_SHA}"
travel_blog Build Times Out on GHA
Cause: 5.7GB content directory exceeds GHA disk/time limits
Fix: Keep on Woodpecker (no migration). Build uses cluster storage and resources.
CI/CD Secrets Out of Sync
Cause: CronJob failed to sync Vault → Woodpecker
Fix:
# Check CronJob status
kubectl get cronjob -n woodpecker
# Manually trigger sync
kubectl create job --from=cronjob/sync-secrets manual-sync -n woodpecker
Related
- Databases Architecture — Database credentials via Vault
- Multi-Tenancy — Per-user Woodpecker access
- Runbook:
../runbooks/deploy-new-app.md— How to set up CI/CD for a new app - Runbook:
../runbooks/troubleshoot-image-pull.md— Debug image pull issues - Vault documentation: K8s auth configuration
- Woodpecker documentation: API reference