docs: ADR-0002 — all owned image builds move off-infra to GHA + ghcr [ci skip]

Viktor asked to evaluate fully external image builders because in-cluster
CI builds keep destabilising the homelab (Forgejo OOM under registry-push
load, hairpin push timeouts, build IO on the shared sdc HDD, registry PVC
at its 50Gi ceiling). The evaluation was grilled to a decision set:

- every owned image builds on GitHub Actions and lives on ghcr.io
  (extends the 2026-06-09 tripit pilot to the whole fleet)
- per-repo visibility: 9 public mirrors + images (gated on a clean
  gitleaks/PII history scan), the personal/finance/gray ones stay private
- clean cut: no in-cluster fallback build pipelines; existing
  build-fallback.yml files are deleted
- Woodpecker becomes deploy-only; Forgejo registry freezes to one
  last-known-good tag per Service after a manual cleanup pass
- dead builders (terminal-lobby, webhook-handler, hmrc-sync, trading-bot,
  travel-agent, trip-planner) are decommissioned, not migrated;
  travel_blog is decommissioned outright; manual images (x402-gateway,
  chrome-service-novnc, chatterbox-tts, android-emulator) get formalized
  GHA builds; infra-ci + CLI builds move to GHA on the public infra repo

CONTEXT.md: updated 'GHA build + Woodpecker deploy', added 'Canonical
repo', 'GitHub mirror', 'Forgejo registry' terms, image-path relationship,
and a 'registry' ambiguity entry.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-12 19:55:47 +00:00
parent 3978eec53a
commit 623d34628a
2 changed files with 40 additions and 2 deletions

View file

@ -169,8 +169,20 @@ A user-managed secret committed to a Stack directory as `sealed-*.yaml`. Distinc
### CI/CD
**GHA build + Woodpecker deploy**:
The split where Docker images are built+pushed by GitHub Actions and Woodpecker only runs `kubectl set image` on a deploy-only pipeline. Repos that can't fit GHA limits stay on Woodpecker for build too.
_Avoid_: bare "Woodpecker pipeline" — say "build" or "deploy".
The split where every owned image is built+pushed by GitHub Actions and Woodpecker only runs `kubectl set image` on a deploy-only pipeline (ADR-0002). Woodpecker never builds images.
_Avoid_: bare "Woodpecker pipeline" — say "build" or "deploy"; "fallback build" (the in-cluster fallback path was removed by ADR-0002).
**Canonical repo**:
The Forgejo `viktor/<name>` repo — the only place commits land, workflow files included.
_Avoid_: "upstream" (ambiguous); committing anywhere else.
**GitHub mirror**:
The GitHub repo a **Canonical repo** push-mirrors to, one-way, so GitHub Actions can build from it; anything committed on the mirror is silently overwritten by the next sync.
_Avoid_: treating it as a second writable remote; bare "the GitHub repo" without saying mirror.
**Forgejo registry**:
Forgejo's built-in container registry — since ADR-0002 a frozen archive holding one last-known-good tag per **Service**, not a build target; owned images live on ghcr.io.
_Avoid_: "private registry" (collides with the registry VM's pull-through caches); pushing new images to it.
**Keel**:
The **poll-driven** rollout orchestrator — watches registries for new image tags and rolls the matching Deployments automatically. The actor behind "auto-upgrade" for upstream images, and a redundant net for owned apps (already rolled on push by **Woodpecker deploy**).
@ -192,6 +204,7 @@ A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content
- A **proxmox-lvm-encrypted** PVC binds to one Node at a time (RWO) and requires a Service-level backup CronJob; an **NFS volume** is RWX and is backed up at the host level via rsync.
- **State tier** and **Namespace tier** are orthogonal — a Tier 0 Stack can deploy a Service into any Namespace tier and vice versa.
- A **Service**'s image reaches the cluster via **Woodpecker deploy** (push-driven, on commit) or **Keel** (poll-driven, on a new registry tag); **Diun** only notifies. Operator-managed StatefulSets are rolled by neither.
- An owned **Service**'s image is built by GitHub Actions from the **Canonical repo**'s **GitHub mirror** and hosted on ghcr.io (ADR-0002); the **Forgejo registry** keeps only a frozen last-known-good tag per **Service**.
- Tier-1 **State tier** state and ~12 app databases share one **CNPG** `pg-cluster`, reached through **PgBouncer**; their credentials rotate via the `vault-database` store.
## Example dialogue
@ -211,3 +224,4 @@ A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content
- **"secret"** spans Vault entries, K8s Secret objects, **ExternalSecrets**, and **Sealed Secrets**. Always specify which.
- **"proxied"** / **"non-proxied"** refer to Cloudflare's CDN posture for a DNS record, _not_ Anubis or forward-auth layering.
- **"policy"** spans **Kyverno policy** (admission-time mutate/generate/validate), **Calico NetworkPolicy** (data-path ingress/egress), Vault policy (KV access), and K8s RBAC. Always qualify which engine.
- **"registry"** spans three things: ghcr.io (where owned images live, ADR-0002), the **Forgejo registry** (frozen last-known-good archive), and the registry VM's pull-through caches (read-only proxies of upstream registries). Name which one.

View file

@ -0,0 +1,24 @@
---
status: accepted
date: 2026-06-12
---
# All owned images build off-infra on GitHub Actions and live on ghcr.io
In-cluster Woodpecker buildkit builds repeatedly hurt the homelab: registry-push load OOMKilled Forgejo (2026-06-09), buildkit→Forgejo pushes ride a flaky hairpin, build IO lands on the shared sdc HDD, and the Forgejo registry PVC sat at its 50Gi ceiling with retention stuck in DRY_RUN. We decided every owned image is built by GitHub Actions and hosted on ghcr.io, extending the tripit pilot (2026-06-09) to the whole fleet: Forgejo stays the canonical git host, a one-way push-mirror feeds a GitHub mirror, and the mirror's workflow builds, pushes, then POSTs Woodpecker's API to deploy. The Forgejo container registry is decommissioned as a build target — one manual cleanup pass keeps a last-known-good tag per Service, after which nothing pushes to it.
## Considered options
- **GHA builds pushing back into the Forgejo registry** — keeps images home and the pull path unchanged, but keeps the exact failure mode that motivated the move (Forgejo OOM under blob-push load), keeps the PVC growth, and keeps the circular dependency where the images needed to repair the cluster live inside the cluster. Rejected.
- **Per-repo in-cluster fallback builds** (the old `build-fallback.yml` pattern) — rejected in favour of a clean cut: a GitHub outage pauses image builds (running workloads are unaffected), and existing fallback files are deleted. The hedge against ghcr's "currently free" private storage ever being enforced is the visibility split (public images are permanently free) plus re-creating fallbacks if that day comes.
- **Paid builders (Docker Build Cloud, Depot)** — solve a multi-arch/persistent-cache problem this fleet doesn't have (everything is linux/amd64). Rejected.
## Consequences
- DR improves: images survive homelab loss, so a dead cluster can pull everything it needs to come back — the same doctrine that keeps the monorepo on GitHub ("Forgejo dies with the cluster").
- Private ghcr pulls bypass the registry VM's pull-through cache (it can't authenticate), so cold-node pulls of private images depend on GitHub availability; public images cache normally.
- Visibility is decided per repo: public = generic tooling that passes a gitleaks/PII history scan; private = personal, financial, or legally-gray domains. A failed scan means the repo stays private — canonical history is never rewritten for publication. For interpreted languages repo visibility ≈ image visibility (the image ships the source).
- Only private-repo builds consume GitHub free-plan minutes (~12 builders, well under the 2,000/mo free tier; usage is reviewed after rollout wave 2 before considering Pro).
- Woodpecker becomes deploy-only; its agents never build. The Kyverno-synced `registry-credentials` stays (Forgejo git + frozen last-known-good images); a cluster-wide Kyverno-synced `ghcr-credentials` joins it.
- Builders with no live consumer (terminal-lobby, webhook-handler, hmrc-sync, trading-bot, travel-agent, trip-planner) are decommissioned rather than migrated; travel_blog is decommissioned outright (service + CI). Any revival adopts this ADR's pattern.
- Workflows build single-manifest images (`provenance: false`, linux/amd64 only) so registry retention never faces the orphaned-index-children failure class that broke Forgejo's cleanup.