From 3e82c64a7659e1aa8d2a108bc2edb806f41605c5 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 13 Jun 2026 12:55:49 +0000 Subject: [PATCH] docs: sync CI/CD docs to ADR-0002 final state (ghcr + Woodpecker deploy-only) [ci skip] MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ADR-0002 is fully landed (issues #11-#32 closed): every owned image now builds on GitHub Actions and pushes to ghcr.io/viktorbarzin/, with Woodpecker reduced to deploy-only. The Forgejo container registry is frozen and emptied; there are no in-cluster image builds or CI test runs anywhere. The docs still described the old hybrid topology (DockerHub builds, Woodpecker-native owned-app builds, the per-pattern migration lists, the tripit-only pilot framing), which would mislead future sessions and incident response. This brings the docs to the completed reality (closes #33): - docs/architecture/ci-cd.md: full rewrite as the canonical CI/CD reference — the fleet GHA->ghcr->Woodpecker-deploy pattern, public/private ghcr package split, infra-owned image workflows (incl. infra-ci on ghcr), the frozen Forgejo registry, what Woodpecker still runs, and the #31 decommissions. - .claude/CLAUDE.md: rewrite the "CI/CD Architecture" section to the fleet-wide final state; FIX the stale claim that claude-memory-mcp builds to DockerHub (it is GHA->ghcr); note owned images now live on ghcr and the Forgejo registry is frozen/break-glass near the image-registry bullet. - .claude/reference/service-catalog.md: f1-stream is GHA->ghcr + Woodpecker deploy-only (was "Woodpecker-native build->deploy"). - stacks/{tuya-bridge,android-emulator}/variables.tf + stacks/terminal/main.tf: cosmetic description/comment updates (forgejo -> ghcr; terminal-lobby has no CI pipeline). Description/comment text only — no stack logic changed. Historical records (docs/post-mortems/*, docs/plans/*) and ADR-0002 itself are left untouched as point-in-time records. Co-Authored-By: Claude Fable 5 --- .claude/CLAUDE.md | 131 ++++--- .claude/reference/service-catalog.md | 2 +- docs/architecture/ci-cd.md | 530 ++++++++++++++------------- stacks/android-emulator/variables.tf | 2 +- stacks/terminal/main.tf | 7 +- stacks/tuya-bridge/variables.tf | 2 +- 6 files changed, 379 insertions(+), 295 deletions(-) diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index d0bc9444..37ab99f3 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -38,7 +38,7 @@ Violations cause state drift, which causes future applies to break or silently r - **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`. Smoke-test target: `echo.viktorbarzin.me` (auth=public, header-reflecting backend). - **Anubis PoW challenge** (`modules/kubernetes/anubis_instance/`): per-site reverse proxy that issues a 30-day JWT cookie after a tiny PoW solve. Use for **public, content-bearing sites without app-level auth** (blog, docs, wikis, static landing pages). Pattern: declare `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://..svc.cluster.local" }`, then in `ingress_factory` set `service_name = module.anubis.service_name`, `port = module.anubis.service_port`, `anti_ai_scraping = false`. Shared ed25519 key in Vault `secret/viktor` -> `anubis_ed25519_key`; cookie scoped to `viktorbarzin.me` so one solve covers all Anubis-fronted subdomains. **DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints** — clients without JS can't solve PoW. **Replicas default to 1** because Anubis stores in-flight challenges in process memory; a challenge issued by pod A and solved against pod B errors with `store: key not found` (HTTP 500). Bumping replicas requires wiring a shared Redis store (TODO). For path-level carve-outs (e.g. wrongmove has `/` behind Anubis but `/api` direct, blog has `/net-diag.sh` direct), declare a second `ingress_factory` with `ingress_path = ["/"]` pointing at the bare backend service. Active on: blog (except `/net-diag.sh`), www, kms, travel, f1, cc, json, pb (privatebin), home (homepage), wrongmove (UI only). See `.claude/reference/patterns.md` "Anti-AI Scraping" for full layering. - **Docker images**: Always build for `linux/amd64`. SHA-tag rule is being phased out — see `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`. New model: CI pushes `:latest` (optionally also `:<8-char-sha>` for traceability), Keel polls and triggers rollouts. Cache-staleness concern from the old rule is resolved at the nginx layer (URL-split — manifests pass through, blobs cached). Until Phase 1 of the migration completes (per the plan), follow the SHA-tag rule for new services to match existing pattern. -- **Private registry**: `forgejo.viktorbarzin.me/viktor/` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/:` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. **Kubelet pulls** are kept off the hairpin **at the resolver, with zero node-side DNS config**: pfSense Unbound carries a domain override forwarding the whole `viktorbarzin.me` zone to Technitium (added 2026-06-10, `docs/runbooks/pfsense-unbound.md`), whose split-horizon zone CNAMEs every ingress host (auto-synced hourly by `technitium-ingress-dns-sync`) to the zone apex whose A record tracks the **live** Traefik LB IP (canary: `viktorbarzin-apex-probe`, alerts ViktorBarzinApexDrift). Nodes are stock — link DNS `10.0.20.1 94.140.14.14` via `qm set --nameserver`, no `/etc/hosts` pins, no resolved drop-ins (two same-day interim approaches on 2026-06-10 were removed the same day). The containerd `hosts.toml` mirror (`[host."https://10.0.20.203"]`, `skip_verify = true`) still exists but is **vestigial** — it can NOT keep pulls internal on its own: Traefik routes by Host/SNI and 404s the mirror's bare-IP requests, and the registry's Bearer auth realm is the absolute `https://forgejo.viktorbarzin.me/v2/token` URL fetched outside the mirror — without internal DNS every fresh pull degrades to public DNS → hairpin → intermittent `dial tcp 176.12.22.76:443: i/o timeout` ImagePullBackOff (tuya-bridge 7.5h outage 2026-06-10, tripit 2026-06-09; see `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md`). **In-cluster pods are ordinary internal clients too** (since 2026-06-10 evening) — CoreDNS's dedicated `viktorbarzin.me:53` block (Corefile in `stacks/technitium/modules/technitium/main.tf`) forwards to the Technitium ClusterIP `10.96.0.53`, so pods get the same split-horizon answers as everyone else; forgejo stays pinned to Traefik's **ClusterIP** in that block (TF-interpolated from the live Service) so CI pushes survive a Technitium outage. This relies on a k8s-1.34 behavior verified 2026-06-10: **pods CAN reach the ETP=Local Traefik LB IP** (kube-proxy short-circuits in-cluster traffic to LB IPs via the cluster path) — re-verify after major k8s upgrades; canary = the uptime-kuma `[External]` fleet going red. (The block briefly forwarded to `8.8.8.8/1.1.1.1` earlier that day, which kept pods on the WAN IP and the broken TP-Link NAT loopback — 27 non-proxied `[External]` monitors dark; beads code-yh33.) **Was `.200` until 2026-06-01** — Traefik's 2026-05-30 move to its dedicated `.203` left the mirror pointing at the now-dead `.200:443`, silently breaking every *fresh* forgejo pull; a future LB renumber is now handled by DNS (apex record + drift probe) — only the vestigial hosts.toml literal would go stale. Mirror source lives in `modules/create-template-vm/k8s-node-containerd-setup.sh` (new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing nodes; also cleans up the legacy 2026-06-10 node-DNS customization). Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest` + any buildkit `*cache*` tag — **REVERTED to DRY_RUN 2026-06-10 after its first live run orphaned OCI index children** (multi-arch/attestation children are separate *untagged* sha256 versions that sort outside the newest-10 window while their parent index is kept; broke `kms-website:latest`+`:dfc83fb`, caught by the integrity probe, healed by re-tagging latest→a794d1a + deleting the corrupt version; see `docs/post-mortems/2026-06-10-forgejo-retention-orphaned-indexes.md`). Do NOT re-enable deletes until the keep-set resolves kept indexes' child digests (or skips untagged versions, or moves to Forgejo's native container-aware cleanup rules). The registry PVC remains at its 50Gi autoresize ceiling on the HDD (we did NOT move it to SSD, see beads code-oflt), so a container-aware retention is still needed. Integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07. +- **Image registry**: **Owned images now live on `ghcr.io/viktorbarzin/`** (ADR-0002, built by GHA — see the CI/CD Architecture section). The **Forgejo container registry is FROZEN + emptied** (break-glass only — `docs/runbooks/forgejo-registry-breakglass.md`); nothing pushes to it. The rest of this bullet documents the **still-live forgejo-pull DNS/mirror machinery** (it remains in place for the break-glass path + because `registry-credentials` is still Kyverno-synced; the hairpin lessons apply to any internal-registry pull). Historical usage was `image: forgejo.viktorbarzin.me/viktor/:` + `imagePullSecrets: [{name: registry-credentials}]`. **Kubelet pulls** are kept off the hairpin **at the resolver, with zero node-side DNS config**: pfSense Unbound carries a domain override forwarding the whole `viktorbarzin.me` zone to Technitium (added 2026-06-10, `docs/runbooks/pfsense-unbound.md`), whose split-horizon zone CNAMEs every ingress host (auto-synced hourly by `technitium-ingress-dns-sync`) to the zone apex whose A record tracks the **live** Traefik LB IP (canary: `viktorbarzin-apex-probe`, alerts ViktorBarzinApexDrift). Nodes are stock — link DNS `10.0.20.1 94.140.14.14` via `qm set --nameserver`, no `/etc/hosts` pins, no resolved drop-ins (two same-day interim approaches on 2026-06-10 were removed the same day). The containerd `hosts.toml` mirror (`[host."https://10.0.20.203"]`, `skip_verify = true`) still exists but is **vestigial** — it can NOT keep pulls internal on its own: Traefik routes by Host/SNI and 404s the mirror's bare-IP requests, and the registry's Bearer auth realm is the absolute `https://forgejo.viktorbarzin.me/v2/token` URL fetched outside the mirror — without internal DNS every fresh pull degrades to public DNS → hairpin → intermittent `dial tcp 176.12.22.76:443: i/o timeout` ImagePullBackOff (tuya-bridge 7.5h outage 2026-06-10, tripit 2026-06-09; see `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md`). **In-cluster pods are ordinary internal clients too** (since 2026-06-10 evening) — CoreDNS's dedicated `viktorbarzin.me:53` block (Corefile in `stacks/technitium/modules/technitium/main.tf`) forwards to the Technitium ClusterIP `10.96.0.53`, so pods get the same split-horizon answers as everyone else; forgejo stays pinned to Traefik's **ClusterIP** in that block (TF-interpolated from the live Service) so CI pushes survive a Technitium outage. This relies on a k8s-1.34 behavior verified 2026-06-10: **pods CAN reach the ETP=Local Traefik LB IP** (kube-proxy short-circuits in-cluster traffic to LB IPs via the cluster path) — re-verify after major k8s upgrades; canary = the uptime-kuma `[External]` fleet going red. (The block briefly forwarded to `8.8.8.8/1.1.1.1` earlier that day, which kept pods on the WAN IP and the broken TP-Link NAT loopback — 27 non-proxied `[External]` monitors dark; beads code-yh33.) **Was `.200` until 2026-06-01** — Traefik's 2026-05-30 move to its dedicated `.203` left the mirror pointing at the now-dead `.200:443`, silently breaking every *fresh* forgejo pull; a future LB renumber is now handled by DNS (apex record + drift probe) — only the vestigial hosts.toml literal would go stale. Mirror source lives in `modules/create-template-vm/k8s-node-containerd-setup.sh` (new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing nodes; also cleans up the legacy 2026-06-10 node-DNS customization). Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest` + any buildkit `*cache*` tag — **REVERTED to DRY_RUN 2026-06-10 after its first live run orphaned OCI index children** (multi-arch/attestation children are separate *untagged* sha256 versions that sort outside the newest-10 window while their parent index is kept; broke `kms-website:latest`+`:dfc83fb`, caught by the integrity probe, healed by re-tagging latest→a794d1a + deleting the corrupt version; see `docs/post-mortems/2026-06-10-forgejo-retention-orphaned-indexes.md`). Do NOT re-enable deletes until the keep-set resolves kept indexes' child digests (or skips untagged versions, or moves to Forgejo's native container-aware cleanup rules). The registry PVC remains at its 50Gi autoresize ceiling on the HDD (we did NOT move it to SSD, see beads code-oflt), so a container-aware retention is still needed. Integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07. - **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts. - **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi. - **Node OS disk tuning** (in `stacks/infra/main.tf`): kubelet `imageGCHighThresholdPercent=70` (was 85), `imageGCLowThresholdPercent=60` (was 80), ext4 `commit=60` in fstab (was default 5s), journald `SystemMaxUse=200M` + `MaxRetentionSec=3day`. @@ -87,62 +87,103 @@ Violations cause state drift, which causes future applies to break or silently r - **Pin database versions**: Disable Diun (image update monitoring) for MySQL, PostgreSQL, Redis. - **Quarterly right-sizing**: Run `krr` (Dockerized, against Prometheus) for recommendations; compare to current requests and adjust in TF. (Goldilocks dashboard removed 2026-06-12.) -## CI/CD Architecture — GHA Builds + Woodpecker Deploy +## CI/CD Architecture — GHA Builds → ghcr + Woodpecker Deploy -**Doctrine (ADR-0002): leverage external infra for ALL CI compute.** Builds, -tests, lint, and release jobs run on GitHub Actions hosted runners (public -repos: unlimited free; private: 2000 free min/mo) — never on cluster nodes. -In-cluster pipelines are reserved for cluster-touching steps only: Woodpecker -deploys (`kubectl set image`), terragrunt applies, certbot. Do not -(re)introduce in-cluster image builds or CI test runs — the fallback-build -pattern was deliberately removed (clean cut). **Watch what you trigger**: -after any push that fires a build chain, monitor it to completion (GHA run → -Woodpecker deploy → `rollout status`) and fix failures immediately; verify -via live state, not the checkmark. Fleet migration: PRD infra#10 (ADR-0002). +**Doctrine (ADR-0002, fleet-wide as of 2026-06-13): ALL image builds + CI +compute run OFF-infra.** Every owned image is built/linted/tested on GitHub +Actions (public repos: free; private: 2000 free min/mo) and pushed to +`ghcr.io/viktorbarzin/`. **No in-cluster image builds or CI test runs +exist anywhere** — the in-cluster Woodpecker buildkit and the fallback-build +pattern were removed (clean cut). Woodpecker is **deploy-only** (plus infra +applies + maintenance crons). Canonical CI/CD reference: +`docs/architecture/ci-cd.md`; decision: `docs/adr/0002-all-image-builds-off-infra-gha-ghcr.md`. +**Watch what you trigger**: after a push that fires a build chain, follow it to +completion (GHA run → Woodpecker deploy → `rollout status`) and fix failures; +verify via live state, not the checkmark. -**Owned-app deploy model (build triggers the rollout — 2026-06-02):** For -self-hosted apps **we build** (Forgejo `viktor/` + Dockerfile + -`.woodpecker.yml`), the build pipeline ALSO drives the rollout — atomic + -deterministic, no wait for Keel's poll. Pattern (`build-and-push` tags `latest` -+ `${CI_COMMIT_SHA:0:8}`, then a `deploy` step): `kubectl set image -deployment/ =:${CI_COMMIT_SHA:0:8} -n ` + -`kubectl rollout status ... --timeout=300s`. The `woodpecker-agent` SA is -`cluster-admin`, so the `bitnami/kubectl` step needs no kubeconfig/RBAC (uses -its in-cluster SA). **Keel stays enrolled in parallel** as a redundant net -(finds the deployed SHA already running → no-op). Requires the Deployment to -have `ignore_changes` on `…container[0].image` (KEEL_IGNORE_IMAGE) so CI -`set image` doesn't fight `terragrunt apply`. CronJobs in owned apps use -`:latest` + `imagePullPolicy: Always` (fresh pod each run) instead of a deploy -step. **Never** `set image`/`rollout restart` operator-managed StatefulSets -(memory id=740). Reference impls: `tuya_bridge/.woodpecker.yml`, -`job-hunter`, `f1-stream` (viktor/f1-stream, extracted from this monorepo -2026-06-05). This reverses decision #12 of -`docs/plans/2026-05-16-auto-upgrade-apps-design.md` for owned (not upstream) -images. +**The fleet pattern (every owned app):** Forgejo `viktor/` (canonical) +push-mirrors (`sync_on_commit`) → GitHub `ViktorBarzin/` → GHA +`.github/workflows/build.yml` (committed on Forgejo, mirrors over): `on: push: +branches:[master]` ONLY (feature branches mirror but build/deploy nothing — the +safety valve). The `build` job: lint/test → `svu` cuts the next `vX.Y.Z` tag to +CANONICAL Forgejo (GHA secret `FORGEJO_GIT_TOKEN` = write:repository PAT) + bakes +`VERSION` → `buildx` `linux/amd64` `provenance:false` (single-manifest, dodges +the orphaned-index-children class) → push `ghcr.io/viktorbarzin/:` + +`:latest` → `delete-package-versions` keep-10. The `deploy` job POSTs +`ci.viktorbarzin.me/api/repos//pipelines` (the GitHub-mirror's Woodpecker +registration, github-forge; GHA secret `WOODPECKER_TOKEN`) with `IMAGE_TAG` + +`IMAGE_NAME` → `.woodpecker/deploy.yml` (event:**manual** ONLY, so the raw +Forgejo→GitHub mirror pushes don't fire a tag-less deploy) runs `kubectl set +image deployment/ …` in-cluster (woodpecker-agent SA = cluster-admin, no +kubeconfig). Deployment image is `ignore_changes`/KEEL_IGNORE_IMAGE so the SHA +sticks vs `terragrunt apply`; CronJobs track `:latest` + `imagePullPolicy: +Always`. **Keel stays enrolled** as a redundant net (sees the SHA already +running → no-op). **Never** `set image`/`rollout restart` operator-managed +StatefulSets (memory id=740). Onboarding tool: `scripts/offinfra-onboard` + +`scripts/offinfra-templates/`; mirror + workflow commits via the Forgejo API over +the internal Traefik LB (`curl --resolve forgejo.viktorbarzin.me:443:10.0.20.203`). +Reference impls: tripit (the original pilot), f1-stream, job-hunter, tuya_bridge. -**Flow (GHA-migrated apps)**: `git push → GHA build+push DockerHub (8-char SHA) → POST Woodpecker API → kubectl set image` +**Migrated apps (issues #13–#27):** f1-stream, job-hunter, tuya_bridge, +beadboard, nextcloud-todos, claude-agent-service, **claude-memory-mcp** (GHA → +ghcr, NOT DockerHub), kms-website, Freedify, instagram-poster, payslip-ingest, +broker-sync (image `wealthfolio-sync`), fire-planner, recruiter-responder, +x402-gateway — plus tripit. Earlier public-repo apps already on GHA (Website, +k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify, +audiobook-search, council-complaints) now also land on ghcr. +- **PUBLIC ghcr packages:** beadboard, nextcloud-todos, claude-agent-service, + claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, + chrome-service-novnc, android-emulator. +- **PRIVATE ghcr:** f1-stream, job-hunter, instagram-poster, payslip-ingest, + wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, + infra-ci. Pulled via the Kyverno-synced `ghcr-credentials` allowlist + (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`; NOT cluster-wide; cred + = Vault `secret/viktor/ghcr_pull_token`, an alias of the admin `github_pat` — + GitHub has no token-mint API, swap the alias value if a scoped token is ever + UI-minted). -**Migrated to GHA** (9): Website, k8s-portal, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints -**Woodpecker-native owned-app build** (Forgejo registry, build->deploy in one `.woodpecker.yml`): tuya_bridge, job-hunter, f1-stream (extracted to viktor/f1-stream 2026-06-05; Woodpecker repo id 166; the old github source is archived + its GHA repo-id-10 deactivated) -**Woodpecker-only**: travel_blog (1.4GB content too large for GHA), infra pipelines (terragrunt apply, certbot, build-cli — need cluster access) -**Private Forgejo repo → off-infra GHA → GHCR** (NEW 2026-06-09 — gentler builds: keeps build IO **and** the registry push OFF the homelab/sdc; replaces in-cluster Woodpecker buildkit for private repos): **tripit** is the pilot. Forgejo `viktor/tripit` (canonical) push-mirrors → PRIVATE `ViktorBarzin/tripit` GitHub repo (`sync_on_commit`); `.github/workflows/build.yml` (committed on Forgejo, mirrors over) builds + pushes `ghcr.io/viktorbarzin/tripit:+latest` on GHA (free, ~2min, GHA-native cache). Cluster pulls of PRIVATE ghcr images use the `ghcr-credentials` dockerconfigjson, cloned by the kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit ALLOWLIST of private-ghcr namespaces only (ADR-0002; source `stacks/kyverno/modules/kyverno/ghcr-credentials.tf`; cred = Vault `secret/viktor/ghcr_pull_token`, currently an alias of the admin `github_pat` — GitHub has no token-mint API, swap the alias value if a scoped token is ever UI-minted). **Auto-deploy** (verified 2026-06-09): the GHA `deploy` job POSTs `ci.viktorbarzin.me/api/repos/167/pipelines` (Woodpecker repo **167** = the GitHub mirror, registered github-forge; GHA secret `WOODPECKER_TOKEN`) with `IMAGE_TAG`+`IMAGE_NAME` → `.woodpecker/deploy.yml` (event:**manual** ONLY, so the Forgejo→GitHub mirror's raw pushes don't fire a tag-less deploy) runs `kubectl set image deployment/tripit tripit=… alembic-migrate=…` in-cluster (woodpecker-agent SA = cluster-admin, no kubeconfig). Image is KEEL_IGNORE_IMAGE so the SHA tag sticks; worker CronJobs track `:latest`. **Semver** (parallel layer): the GHA `build` job runs `svu` v3.4.1 over conventional commits, auto-cuts the next `vX.Y.Z` git tag pushed to CANONICAL Forgejo (GHA secret `FORGEJO_GIT_TOKEN` = write:repository PAT, NOT the package-scoped push token) and bakes `VERSION` → app reports it at `/api/version` (verified 0.2.1). Deploy tag stays the 8-char SHA. The old in-cluster `.woodpecker/build.yml` was DELETED (only `.woodpecker/deploy.yml` remains). GitHub default branch must be `master`. **Replicate to f1-stream, tuya_bridge, job-hunter** (currently Woodpecker-native in-cluster builds). Mirror + workflow-file commits are done via the Forgejo API over the internal Traefik LB (`curl --resolve forgejo.viktorbarzin.me:443:10.0.20.203`) since the devvm can't reach forgejo's public hairpin. +**Infra-owned images (issues #29/#30)** build on GHA workflows IN the infra +repo's own `.github/workflows/` (added to the GitHub lineage via PR; the +github↔forgejo divergence was deliberately NOT reconciled): +`build-chrome-service-novnc.yml` + `build-android-emulator.yml` → public ghcr; +`build-cli.yml` → DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli`; +`build-infra-ci.yml` → `ghcr.io/viktorbarzin/infra-ci`. **infra-ci** is the image +the `.woodpecker/default.yml` apply step + `drift-detection.yml` run in (proven +by pipelines 165/166). chatterbox-tts is already built by tripit's GHA → ghcr. +The Woodpecker `build-ci-image.yml` + `build-cli.yml` pipelines were REMOVED; +infra-ci break-glass is a manual `.woodpecker/breakglass-infra-ci.yml` (ghcr +pull-and-save to the registry VM). -**Per-project files**: -- `.github/workflows/build-and-deploy.yml` — GHA: checkout, build, push DockerHub, POST Woodpecker API -- `.woodpecker/deploy.yml` — Woodpecker: `kubectl set image` + Slack notify (event: `[manual, push]`) -- `.woodpecker/build-fallback.yml` — Old full build pipeline preserved (event: `deployment` — never auto-fires) +**Forgejo container registry: FROZEN + emptied** (issue #32 wiped all `viktor/*` +container packages). Break-glass-only now; nothing pushes. `forgejo-cleanup` +stays DRY_RUN. Pull-through caches on `10.0.20.10` are unchanged. Runbook: +`docs/runbooks/forgejo-registry-breakglass.md`. -**Woodpecker API**: Uses **numeric repo IDs** (`/api/repos/2/pipelines`), NOT owner/name paths (those return HTML). -Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handler=6, audiblez-web=9, plotting-book=43, claude-memory-mcp=78, infra-onboarding=79, council-complaints=TBD (f1-stream's old GHA-era github repo id 10 is deactivated; it's now a Woodpecker-native Forgejo build at repo id 166) +**Woodpecker now runs only:** per-app `deploy.yml` (manual, `kubectl set +image`), `default.yml` (terragrunt apply), `renew-tls.yml` (certbot), +maintenance crons (drift-detection, provision-user, registry-config-sync, +pve-nfs-exports-sync, issue-automation, postmortem-todos, k8s-portal), and the +manual `breakglass-infra-ci.yml`. **No build/test pipeline on any repo — do not +(re)introduce one.** + +**Decommissioned (issue #31):** travel_blog (stack destroyed + dir removed), 6 +dead builders' pipelines (terminal-lobby, webhook-handler, hmrc-sync, +trading-bot, travel-agent, trip-planner), and all `build-fallback.yml` files +(only Website had one). + +**Woodpecker API**: numeric repo IDs (`/api/repos//pipelines`), NOT +owner/name (those return HTML). The deploy registration for each app is the +**GitHub mirror** repo (github-forge). Infra: Forgejo forge = repo 82, legacy +GitHub forge = repo 1. **Woodpecker YAML gotchas**: - Commands with `${VAR}:${VAR}` must be **quoted** — unquoted `:` triggers YAML map parsing when vars are empty - Use `bitnami/kubectl:latest` (not pinned versions — entrypoint compatibility issues) - Global secrets must have `manual` in their events list for API-triggered pipelines -**GitHub repo secrets** (set on all repos): `DOCKERHUB_USERNAME`, `DOCKERHUB_TOKEN`, `WOODPECKER_TOKEN` - -**Infra pipelines unchanged**: `default.yml` (terragrunt apply), `renew-tls.yml` (certbot cron), `build-cli.yml` (dual registry push), `k8s-portal.yml` (path-filtered build), `provision-user.yml` — all stay on Woodpecker. +**GitHub repo secrets** (per repo): `WOODPECKER_TOKEN` (POST deploy pipeline), +`FORGEJO_GIT_TOKEN` (write:repository PAT for the svu tag push). ghcr push uses +the workflow's built-in `GITHUB_TOKEN` (`packages: write`). ## Database Host diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md index 632505c0..ec78beac 100644 --- a/.claude/reference/service-catalog.md +++ b/.claude/reference/service-catalog.md @@ -47,7 +47,7 @@ | nextcloud | File sync/share | nextcloud | | calibre | E-book management (may be merged into ebooks stack) | calibre | | onlyoffice | Document editing | onlyoffice | -| f1-stream | F1 streaming (uses chrome-service for hmembeds verifier); source in own repo `viktor/f1-stream` (Forgejo, extracted 2026-06-05), Woodpecker-native build->deploy (repo id 166) | f1-stream | +| f1-stream | F1 streaming (uses chrome-service for hmembeds verifier); canonical source in own repo `viktor/f1-stream` (Forgejo, extracted 2026-06-05); GHA-built → `ghcr.io/viktorbarzin/f1-stream` (private), Woodpecker deploy-only (ADR-0002) | f1-stream | | chrome-service | Headed Chromium over CDP (`http://chrome-service.chrome-service.svc:9222`, `connect_over_cdp`; legacy `:3000/` WS pool removed 2026-06-04) for sibling services driving anti-bot pages — snapshot-harvester CronJob + tripit fare scrape | chrome-service | | rybbit | Analytics | rybbit | | isponsorblocktv | SponsorBlock for TV | isponsorblocktv | diff --git a/docs/architecture/ci-cd.md b/docs/architecture/ci-cd.md index e44df43d..c4493f86 100644 --- a/docs/architecture/ci-cd.md +++ b/docs/architecture/ci-cd.md @@ -2,334 +2,374 @@ ## Overview -The CI/CD pipeline uses a hybrid approach: GitHub Actions for building Docker images (providing free compute for public repos) and Woodpecker CI for deployments (leveraging cluster-internal access). Git pushes trigger GHA builds that produce Docker images with 8-character SHA tags, push to DockerHub, then POST to Woodpecker's API to trigger deployments that update Kubernetes workloads via `kubectl set image`. +**Doctrine (ADR-0002): all image builds and CI compute run OFF-infra.** Every +owned image is built, tested, and linted on **GitHub Actions** (free on public +repos; 2000 free min/mo on private) and pushed to **`ghcr.io/viktorbarzin/`**. +Woodpecker is **deploy-only** — a GHA job POSTs its API with the freshly-built +image tag and Woodpecker runs `kubectl set image` from inside the cluster. +There are **no in-cluster image builds or CI test runs anywhere** — the +in-cluster Woodpecker buildkit and the fallback-build pattern were removed as a +clean cut (ADR-0002, 2026-06-13). The Forgejo container registry is **frozen +and emptied** — break-glass only. + +This breaks the old circular dependency (images needed to repair the cluster +used to be built and stored *inside* it) and keeps build IO + registry pushes +off the homelab spindle. ## Architecture Diagram ```mermaid graph LR - A[Git Push] --> B[GitHub Actions] - B --> C[Build Docker Image
linux/amd64, 8-char SHA tag] - C --> D[Push to DockerHub] - D --> E[POST Woodpecker API] - E --> F[Woodpecker Pipeline] - F --> G[Vault K8s Auth
SA JWT] - G --> H[kubectl set image] - H --> I[K8s Deployment] - I --> J[Pull from DockerHub
or Pull-Through Cache] + A[git push Forgejo
viktor/<repo> canonical] --> B[push-mirror sync_on_commit] + B --> C[GitHub mirror
ViktorBarzin/<repo>] + C --> D[GitHub Actions
.github/workflows/build.yml] + D --> E[lint / test] + E --> F[buildx linux/amd64
provenance:false] + F --> G[push ghcr.io/viktorbarzin/<name>
:sha8 + :latest] + G --> H[svu tag -> Forgejo canonical] + G --> I[POST Woodpecker deploy repo] + I --> J[.woodpecker/deploy.yml
event: manual] + J --> K[kubectl set image
in-cluster SA cluster-admin] + K --> L[K8s Deployment
pulls from ghcr] - K[Pull-Through Cache
10.0.20.10] -.-> J - L[forgejo.viktorbarzin.me
Private Registry on Forgejo] -.-> J - - style B fill:#2088ff - style F fill:#4c9e47 - style K fill:#f39c12 + style D fill:#2088ff + style J fill:#4c9e47 + style G fill:#f39c12 ``` ## Components -| Component | Version | Location | Purpose | -|-----------|---------|----------|---------| -| GitHub Actions | Cloud | `.github/workflows/build-and-deploy.yml` | Build Docker images, push to DockerHub | -| Woodpecker CI | Self-hosted | `ci.viktorbarzin.me` | Deploy to Kubernetes cluster | -| DockerHub | Cloud | `viktorbarzin/*` | Public image registry | -| Private Registry | Forgejo Packages | `forgejo.viktorbarzin.me/viktor` | Private container images (PAT auth, retention CronJob) — migrated from registry.viktorbarzin.me 2026-05-07 | -| Pull-Through Cache | Custom | `10.0.20.10:5000` (docker.io)
`10.0.20.10:5010` (ghcr.io) | LAN cache for remote registries | -| Kyverno | Cluster | `kyverno` namespace | Auto-sync registry credentials to all namespaces | -| Vault | Cluster | `vault.viktorbarzin.me` | K8s auth for Woodpecker pipelines | +| Component | Location | Purpose | +|-----------|----------|---------| +| GitHub Actions | `.github/workflows/build.yml` (per repo) | Build + lint + test + push image; trigger deploy; cut semver tag | +| ghcr.io | `ghcr.io/viktorbarzin/*` | Container registry for ALL owned images (public + private packages) | +| Woodpecker CI | `ci.viktorbarzin.me` | **Deploy-only** — `kubectl set image` in-cluster; plus infra applies + maintenance crons | +| Forgejo | `forgejo.viktorbarzin.me/viktor/` | **Canonical** git source (push-mirrors to GitHub). Container registry **FROZEN** (break-glass only) | +| Pull-Through Cache | `10.0.20.10:5000/5010/5020/5030/5040` | LAN cache for upstream registries (DockerHub, ghcr, Quay, k8s.gcr, Kyverno) | +| Kyverno | `kyverno` namespace | Syncs `ghcr-credentials` (private-ghcr allowlist) + `registry-credentials` to namespaces | +| Vault | `vault.viktorbarzin.me` | K8s auth for Woodpecker deploy pipelines; CI tokens in `secret/ci/global` + `secret/viktor` | ## How It Works -### Build Flow (GitHub Actions) +### The fleet pattern (every owned app) -1. **Trigger**: Git push to main/master branch -2. **Build**: GHA builds Docker image for `linux/amd64` platform only -3. **Tag**: Image tagged with 8-character commit SHA (e.g., `viktorbarzin/app:a1b2c3d4`) - - `:latest` tags are **never used** to prevent stale pull-through cache issues -4. **Push**: Image pushed to DockerHub public registry -5. **Trigger Deploy**: POST request to Woodpecker API with repo ID and commit SHA +1. **Canonical source = Forgejo** `viktor/`. A **push-mirror** + (`sync_on_commit`) pushes every commit to the GitHub mirror + `ViktorBarzin/`. The `.github/workflows/build.yml` is committed on + Forgejo and mirrors over. +2. **GHA `build` job** (triggers `on: push: branches: [master]` ONLY — feature + branches mirror but build/deploy nothing, the safety valve): + - lint + test + - `svu` computes the next `vX.Y.Z` from conventional commits and pushes the + tag back to **canonical Forgejo** (GHA secret `FORGEJO_GIT_TOKEN` = + write:repository PAT); `VERSION` is baked into the image + - `docker buildx` `linux/amd64`, **`provenance: false`** (single-manifest — + avoids the orphaned-index-children failure class), push + `ghcr.io/viktorbarzin/:` + `:latest` + - `delete-package-versions` keeps the newest ~10 ghcr versions +3. **GHA `deploy` job** POSTs `ci.viktorbarzin.me/api/repos//pipelines` + (the Woodpecker registration for the **GitHub mirror**, github-forge; GHA + secret `WOODPECKER_TOKEN`) with `IMAGE_TAG` + `IMAGE_NAME`. +4. **`.woodpecker/deploy.yml`** (event: **manual** only, so the raw + Forgejo→GitHub mirror pushes don't fire a tag-less deploy) runs `kubectl set + image deployment/ =` in-cluster. The `woodpecker-agent` + SA is `cluster-admin`, so the `bitnami/kubectl` step needs no + kubeconfig/RBAC. The Deployment image is in `lifecycle.ignore_changes` + (`KEEL_IGNORE_IMAGE`) so the SHA tag sticks and `terragrunt apply` doesn't + fight it. CronJobs in owned apps track `:latest` + `imagePullPolicy: Always` + instead of a deploy step. -### Deploy Flow (Woodpecker CI) +**Keel stays enrolled** as a redundant net (finds the deployed SHA already +running → no-op). -1. **Receive Webhook**: Woodpecker API receives deployment trigger from GHA -2. **Authenticate**: Pipeline uses Kubernetes ServiceAccount JWT to authenticate with Vault via K8s auth -3. **Deploy**: `kubectl set image deployment/ =viktorbarzin/:` -4. **Notify**: Slack notification on success/failure +**Tooling**: `infra/scripts/offinfra-onboard` + `infra/scripts/offinfra-templates/` +scaffold a repo onto this pattern (mirror, workflow, Woodpecker deploy repo, +old-pipeline removal, default-branch flip). Mirror + workflow commits go via +the Forgejo API over the internal Traefik LB +(`curl --resolve forgejo.viktorbarzin.me:443:10.0.20.203`) since the devvm +can't reach Forgejo's public hairpin. -### Project Migration Status +### ghcr package visibility -**Migrated to GHA (8 projects)**: -- Website -- k8s-portal -- claude-memory-mcp -- apple-health-data -- audiblez-web -- plotting-book -- insta2spotify -- book-search (audiobook-search) +| Visibility | Packages | Pull mechanism | +|------------|----------|----------------| +| **Public** | beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator | Anonymous | +| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci | `ghcr-credentials` dockerconfigjson | -**Woodpecker-native owned-app builds** (build + push to the Forgejo private -registry + `kubectl set image` rollout, all in one `.woodpecker.yml`; Keel -stays enrolled as a redundant net): `tuya_bridge`, `job-hunter`, `f1-stream`. -`f1-stream` was extracted from this monorepo to `viktor/f1-stream` on -2026-06-05 (Woodpecker repo id 166); the old github source is archived and its -GHA-era Woodpecker repo (id 10) is deactivated. +Private-image pulls use the `ghcr-credentials` dockerconfigjson, cloned by the +kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit +**ALLOWLIST** of private-ghcr namespaces only (NOT cluster-wide; source +`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`). Cred = Vault +`secret/viktor/ghcr_pull_token` (an alias of the admin `github_pat` — GitHub +has no token-mint API; swap the alias value if a scoped token is ever +UI-minted). -**Woodpecker-only (infra + large apps)**: -- `travel_blog`: 5.7GB content directory exceeds GHA limits -- Infra pipelines: require cluster access (terragrunt apply, certbot, build-cli) +### Migrated apps (issues #13–#27) -### Woodpecker Pipeline Files +f1-stream, job-hunter, tuya_bridge, beadboard, nextcloud-todos, +claude-agent-service, claude-memory-mcp, kms-website, Freedify, +instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`), +fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original +pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website, +k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify, +audiobook-search, council-complaints) now also land on ghcr. -Each project contains: -- `.woodpecker/deploy.yml`: kubectl set image + Slack notification -- `.woodpecker/build-fallback.yml`: Legacy full build pipeline (event: deployment, never auto-fires) +### Infra-owned images (issues #29 / #30) -### Woodpecker Repository IDs +Images owned by the infra repo build on GHA workflows **in the infra repo's own +`.github/workflows/`** (the github↔forgejo divergence was deliberately NOT +reconciled — the workflows were added to the GitHub lineage via PR): -Woodpecker API uses numeric IDs (not owner/name): +| Image | Workflow | Destination | +|-------|----------|-------------| +| chrome-service-novnc | `build-chrome-service-novnc.yml` | public `ghcr.io/viktorbarzin/chrome-service-novnc` | +| android-emulator | `build-android-emulator.yml` | public `ghcr.io/viktorbarzin/android-emulator` | +| infra CLI | `build-cli.yml` | DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli` | +| infra-ci | `build-infra-ci.yml` | private `ghcr.io/viktorbarzin/infra-ci` | -| Repo | ID | -|------|------| -| infra | 1 | -| Website | 2 | -| finance | 3 | -| health | 4 | -| travel_blog | 5 | -| webhook-handler | 6 | -| audiblez-web | 9 | -| plotting-book | 43 | -| claude-memory-mcp | 78 | -| infra-onboarding | 79 | +**`infra-ci`** is the image the `.woodpecker/default.yml` apply step and +`drift-detection.yml` run in (proven by pipelines 165/166). `chatterbox-tts` is +already built by tripit's GHA → ghcr. -### Image Registry Flow +The Woodpecker `build-ci-image.yml` and `build-cli.yml` pipelines were +**REMOVED**. Break-glass for infra-ci is now a manual +`.woodpecker/breakglass-infra-ci.yml` (ghcr pull-and-save to the registry VM). -1. **Containerd hosts.toml** redirects pulls from docker.io and ghcr.io to pull-through cache at `10.0.20.10` -2. **Pull-through cache** serves cached images from LAN, fetches from upstream on cache miss -3. **Kyverno ClusterPolicy** auto-syncs `registry-credentials` Secret to all namespaces for private registry access -4. **Private registry** has been Forgejo's built-in OCI registry at `forgejo.viktorbarzin.me/viktor/` since 2026-05-07. Auth via PAT (Vault `secret/ci/global/forgejo_push_token` for push, `secret/viktor/forgejo_pull_token` for pull). The pre-migration `registry:2.8.3`-based private registry on `registry.viktorbarzin.me:5050` was the root cause of three orphan-index incidents in three weeks (2026-04-13, 2026-04-19, 2026-05-04 — see `docs/post-mortems/2026-04-19-registry-orphan-index.md` and the full migration writeup at `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md`). The five pull-through caches on `10.0.20.10` (ports 5000/5010/5020/5030/5040) stay in place for upstream registries. -5. **Integrity probe** (`registry-integrity-probe` CronJob in `monitoring` ns, every 15m) walks `/v2/_catalog` → tags → indexes → child manifests via HEAD and pushes `registry_manifest_integrity_failures` to Pushgateway; alerts `RegistryManifestIntegrityFailure` / `RegistryIntegrityProbeStale` / `RegistryCatalogInaccessible` page on broken state. Authoritative check (HTTP API, not filesystem). +### Forgejo container registry — FROZEN -### Infra Pipelines (Woodpecker-only) +Issue #32 wiped all `viktor/*` container packages (~19G reclaimed, `/data` +58%→20%). The registry is **break-glass-only** now; nothing pushes to it. The +`forgejo-cleanup` CronJob stays in `DRY_RUN` (nothing to clean). Pull-through +caches on the registry VM (`10.0.20.10`) are unchanged. See +`docs/runbooks/forgejo-registry-breakglass.md`. + +### Image registry / pull path + +1. **Containerd `hosts.toml`** redirects pulls from docker.io and ghcr.io to the + pull-through cache at `10.0.20.10` (5000 = docker.io, 5010 = ghcr.io). +2. **Pull-through cache** serves cached images from the LAN, fetches upstream on + a miss. +3. **Kyverno ClusterPolicies** sync `ghcr-credentials` (private-ghcr allowlist) + and `registry-credentials` to namespaces. + +## Woodpecker — what it still runs + +Woodpecker is **deploy + cluster-touching steps only**: | Pipeline | File | Purpose | |----------|------|---------| -| default | `.woodpecker/default.yml` | Terragrunt apply on push | -| renew-tls | `.woodpecker/renew-tls.yml` | Certbot renewal cron | -| build-cli | `.woodpecker/build-cli.yml` | Build and push to dual registries | -| build-ci-image | `.woodpecker/build-ci-image.yml` | Build `infra-ci` tooling image (triggered by `ci/Dockerfile` change or manual); post-push HEADs every blob via `verify-integrity` step to catch orphan-index pushes | -| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered build for k8s-portal subdirectory | -| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` to `/opt/registry/` on `10.0.20.10` when any managed file changes; bounces containers + nginx per `docs/runbooks/registry-vm.md` | -| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE host | -| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new `docs/post-mortems/*.md` via headless Claude agent | -| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift detection | -| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues | +| per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) | +| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) | +| certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron | +| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) | | provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec | +| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` → `10.0.20.10` on change | +| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE | +| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues | +| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new post-mortems | +| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered deploy for the portal | +| breakglass-infra-ci | `.woodpecker/breakglass-infra-ci.yml` | **Manual** ghcr pull-and-save of infra-ci to the registry VM | + +**No build/test pipeline exists on any repo.** Do not (re)introduce one. + +### Woodpecker API + +Uses **numeric repo IDs** (`/api/repos//pipelines`), NOT owner/name paths +(those return HTML). The deploy registration for each app is the **GitHub +mirror** repo (registered github-forge). IDs are stable across renames and must +be looked up from the Woodpecker UI/DB. + +### Woodpecker YAML gotchas + +- Commands with `${VAR}:${VAR}` must be **quoted** — an unquoted `:` triggers + YAML map parsing when the vars are empty. +- Use `bitnami/kubectl:latest` (not pinned versions — entrypoint compatibility). +- Global secrets must include `manual` in their events list for API-triggered + pipelines. + +### GitHub repo secrets + +Per repo: `WOODPECKER_TOKEN` (POST the deploy pipeline), `FORGEJO_GIT_TOKEN` +(write:repository PAT for the `svu` tag push). ghcr push uses the workflow's +built-in `GITHUB_TOKEN` (`packages: write`). + +## Infra repo CI topology + +The infra repo runs on Woodpecker via **two** forge registrations: the Forgejo +forge (repo id 82, registered 2026-06-08) and the legacy GitHub forge (repo id +1). Pushes to **Forgejo** `master` fire `.woodpecker/default.yml` +(changed-stacks terragrunt apply, in `infra-ci`) plus the `notify-nonadmin-push` +Slack audit step. Operational facts (2026-06-10): + +- **Webhook URL is the IN-CLUSTER service**: + `http://woodpecker-server.woodpecker.svc.cluster.local/api/hook?...` (PATCHed + via the Forgejo API). The Woodpecker default (`https://ci.viktorbarzin.me/...`) + resolves to the non-proxied public A record from pods → NAT hairpin → + intermittent `context deadline exceeded`, silently dropping push events. If + Woodpecker "repairs" the repo it rewrites the hook back to `ci.viktorbarzin.me` + — re-apply the in-cluster URL. +- **Repo-scoped secrets must exist on BOTH repos**: pipelines reference + repo-level secrets (`registry_ssh_key`, `pve_ssh_key`, `CLOUDFLARE_TOKEN`, …). + When registering a new forge repo for infra, clone the secret set too. +- **Empty commits defeat path filters**: a commit with no changed files makes + Woodpecker include ALL workflow files (path conditions can't exclude), so every + repo secret must resolve. Normal commits with real files only compile the + matching workflows. + +The Forgejo trigger is not fully dependable — land infra changes by pushing +Forgejo master (as viktor), use `[ci skip]` for docs/no-op commits, and verify +deploys via `scripts/tg` + live cluster state rather than trusting the CI +checkmark. The two remotes have **diverged** (parallel histories under +different SHAs); expect github pushes to reject non-fast-forward and leave them +— never force-push. ## Configuration -### GitHub Actions - -**File**: `.github/workflows/build-and-deploy.yml` +### GitHub Actions (per-app `.github/workflows/build.yml`) ```yaml -name: Build and Deploy +name: build on: push: - branches: [main, master] + branches: [master] jobs: build: runs-on: ubuntu-latest + permissions: + contents: write # svu tag push + packages: write # ghcr push steps: - - name: Build Docker image - run: docker build --platform linux/amd64 -t viktorbarzin/app:${SHORT_SHA} . - - name: Push to DockerHub - run: docker push viktorbarzin/app:${SHORT_SHA} - - name: Trigger Woodpecker Deploy + - uses: actions/checkout@v4 + - name: lint + test + run: make lint test + - name: svu tag -> Forgejo run: | - curl -X POST https://ci.viktorbarzin.me/api/repos//pipelines \ - -H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}" + VERSION=$(svu next) + # ... push tag to canonical Forgejo with FORGEJO_GIT_TOKEN + - uses: docker/setup-buildx-action@v3 + - uses: docker/build-push-action@v6 + with: + platforms: linux/amd64 + provenance: false + push: true + tags: | + ghcr.io/viktorbarzin/:${{ github.sha }} + ghcr.io/viktorbarzin/:latest + deploy: + needs: build + runs-on: ubuntu-latest + steps: + - name: Trigger Woodpecker deploy + run: | + curl -X POST https://ci.viktorbarzin.me/api/repos//pipelines \ + -H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}" \ + -d '{"branch":"master","variables":{"IMAGE_TAG":"...","IMAGE_NAME":"..."}}' ``` -**Required GitHub Secrets**: -- `DOCKERHUB_USERNAME` -- `DOCKERHUB_TOKEN` -- `WOODPECKER_TOKEN` - -### Woodpecker Deploy Pipeline - -**File**: `.woodpecker/deploy.yml` +### Woodpecker deploy pipeline (per-app `.woodpecker/deploy.yml`) ```yaml when: - event: [deployment] + event: manual steps: deploy: - image: bitnami/kubectl:latest + image: bitnami/kubectl:latest # uses the in-cluster woodpecker-agent SA (cluster-admin) commands: - - kubectl set image deployment/app app=viktorbarzin/app:${CI_COMMIT_SHA:0:8} - secrets: [k8s_token] - + - "kubectl set image deployment/app app=${IMAGE_NAME}:${IMAGE_TAG} -n " + - "kubectl rollout status deployment/app -n --timeout=300s" notify: image: plugins/slack - settings: - webhook: ${SLACK_WEBHOOK} when: status: [success, failure] ``` -**YAML Gotchas**: -- Commands with `${VAR}:${VAR}` syntax must be quoted to prevent YAML map parsing when vars are empty -- Use `bitnami/kubectl:latest` (not pinned versions) -- Global secrets must be manually added to `secrets:` list in pipeline +### CI/CD secrets sync -### Vault Configuration - -**K8s Auth for Woodpecker**: -- Woodpecker pipelines authenticate using ServiceAccount JWT -- Vault K8s auth mount validates JWT and issues token -- Policies grant access to secrets and dynamic credentials - -### CI/CD Secrets Sync - -**CronJob**: Pushes `secret/ci/global` from Vault → Woodpecker API every 6 hours -- Keeps Woodpecker global secrets in sync with Vault -- Runs in `woodpecker` namespace - -## Infra repo CI (Woodpecker repo 82 — Forgejo forge) - -The infra repo itself runs on Woodpecker via the **Forgejo** forge (repo id 82, -registered 2026-06-08; the GitHub-side repo id 1 also remains registered). -Pushes to `master` fire `.woodpecker/default.yml` (changed-stacks terragrunt -apply) plus the `notify-nonadmin-push` Slack audit step (allow-then-audit -contribution model — see `multi-tenancy.md`). Operational facts (2026-06-10): - -- **Webhook URL is the IN-CLUSTER service**: `http://woodpecker-server.woodpecker.svc.cluster.local/api/hook?...` - (PATCHed via the Forgejo API). The Woodpecker-generated default - (`https://ci.viktorbarzin.me/...`) resolves to the non-proxied public A - record from pods → NAT hairpin → intermittent `context deadline exceeded`, - silently dropping push events (found when a push produced no pipeline). - If Woodpecker ever "repairs" the repo it will rewrite the hook back to - `ci.viktorbarzin.me` — re-apply the in-cluster URL (or pin `ci.viktorbarzin.me` - in the CoreDNS pod carve-out alongside forgejo). -- **Repo-scoped secrets must exist on BOTH repos**: pipelines reference - repo-level secrets (`registry_ssh_key`, `pve_ssh_key`, `CLOUDFLARE_TOKEN`, - …). Repo 82 was registered without them and every all-workflow compile - errored with `secret "registry_ssh_key" not found`. Fixed by cloning repo-1 - rows to repo 82 in the Woodpecker DB (`insert into secrets … select … where - repo_id=1`). When registering a new forge repo for infra, clone the secret - set too. -- **Empty commits defeat path filters**: a commit with no changed files makes - Woodpecker include ALL workflow files (path conditions can't exclude), so - every repo secret must resolve. Normal commits with real files only compile - the matching workflows. +A CronJob in the `woodpecker` namespace pushes `secret/ci/global` from Vault → +the Woodpecker API every 6h, keeping global secrets in sync. Woodpecker deploy +pipelines authenticate to the cluster via the in-cluster `woodpecker-agent` SA +(cluster-admin); Vault K8s auth backs any secret reads. ## Decisions & Rationale -### Why GitHub Actions + Woodpecker? +### Why all builds off-infra (ADR-0002)? -**Alternatives considered**: -1. **Woodpecker-only**: Simple, but wastes cluster resources on builds -2. **GHA-only**: No cluster access, requires kubectl from outside (security risk) -3. **Hybrid (chosen)**: GHA for compute-heavy builds (free), Woodpecker for privileged deployments (secure cluster access) +- **Breaks the circular dependency** — the images needed to repair the cluster + no longer live inside it (they're on ghcr, an external registry). +- **Removes build IO + registry push load** from the contended homelab spindle. +- GHA is free on public repos and generous on private; buildx provenance:false + sidesteps the orphaned-index-children failure class that plagued the + in-cluster registry. +- **Clean cut** — no in-cluster fallback builds anywhere; one pattern, + fleet-wide. -**Benefits**: -- Free compute for builds on public repos -- Cluster access stays internal (Woodpecker has direct K8s access) -- Separation of concerns: build vs deploy +### Why ghcr (not push back to Forgejo)? -### Why 8-Character SHA Tags (Not :latest)? +Forgejo's container registry repeatedly orphaned OCI index children +(2026-04-13/19, 2026-05-04, 2026-06-10) and its retention is not container-aware. +ghcr is external (DR-safe), free for this scale, and has native multi-arch +handling. The Forgejo registry was frozen + emptied (issue #32). -- Pull-through cache serves stale `:latest` tags indefinitely -- SHA tags ensure every deployment pulls the correct image -- 8 characters provide sufficient collision resistance (16^8 = 4.3 billion combinations) +### Why Woodpecker stays for deploy? -### Why Numeric Repo IDs for Woodpecker API? +`kubectl set image` needs in-cluster privileged access; doing it from GHA would +mean exposing kube-apiserver or a long-lived kubeconfig. Woodpecker's +`woodpecker-agent` SA is already cluster-admin in-cluster — the deploy step +needs no credentials. -- Woodpecker API requires numeric IDs (not owner/name slugs) -- IDs are stable across repo renames -- Must be manually looked up from Woodpecker UI or database +### Why `event: manual` on deploy.yml? -### Why linux/amd64 Only? +The Forgejo→GitHub push-mirror sends raw, tag-less pushes to the GitHub mirror. +If `deploy.yml` fired on `push`, every mirror sync would trigger a deploy with no +image tag. `manual` means only the GHA `deploy` job's explicit API POST (with +`IMAGE_TAG`) deploys. -- Cluster runs on x86_64 nodes only -- ARM builds would waste time and storage -- Multi-arch images add complexity without benefit +### Why linux/amd64 only? + +The cluster runs on x86_64 nodes only; ARM builds waste time and storage. ## Troubleshooting -### GHA Build Fails: "denied: requested access to the resource is denied" +### GHA build fails: ghcr push "denied" -**Cause**: DockerHub credentials expired or incorrect +The workflow `GITHUB_TOKEN` needs `packages: write` permission and the package +must allow the repo to push. Check the workflow `permissions:` block and the +package's "Manage Actions access" settings. + +### Image pull fails: "ErrImagePull" / "ImagePullBackOff" -**Fix**: ```bash -# Regenerate DockerHub token -# Update GitHub repo secrets: DOCKERHUB_USERNAME, DOCKERHUB_TOKEN +# Public image — check the pull-through cache is up +curl http://10.0.20.10:5010/v2/_catalog + +# Private image — verify the ghcr-credentials Secret exists in the namespace +kubectl get secret ghcr-credentials -n +# It's Kyverno-synced to an allowlist; if missing, the namespace isn't on the +# allowlist in stacks/kyverno/modules/kyverno/ghcr-credentials.tf ``` -### Woodpecker Deploy Fails: "Unauthorized" +If the cause is the internal-DNS hairpin (fresh pulls timing out on the public +Forgejo path), see the CoreDNS `viktorbarzin.me` carve-out in +`docs/architecture/networking.md` and `docs/runbooks/registry-vm.md`. -**Cause**: Vault K8s auth token expired or invalid +### Deploy didn't happen after a push -**Fix**: -```bash -# Restart Woodpecker pipeline (token auto-renewed) -# Check Vault K8s auth role exists: vault read auth/kubernetes/role/woodpecker-deployer -``` +Confirm the push was to **master** (feature branches build/deploy nothing). +Check the GHA run completed the `deploy` job, then check Woodpecker received the +manual pipeline (`ci.viktorbarzin.me`, the GitHub-mirror deploy repo). Verify +live with `kubectl rollout status` — not the CI checkmark. -### Image Pull Fails: "ErrImagePull" +### Woodpecker deploy fails: "YAML: did not find expected key" -**Cause**: Pull-through cache or registry credentials issue - -**Fix**: -```bash -# Check pull-through cache is running -curl http://10.0.20.10:5000/v2/_catalog - -# Verify registry-credentials Secret exists in namespace -kubectl get secret registry-credentials -n - -# Manually sync credentials if missing -kubectl get secret registry-credentials -n default -o yaml | \ - sed 's/namespace: default/namespace: /' | kubectl apply -f - -``` - -### Woodpecker Pipeline: "YAML: did not find expected key" - -**Cause**: Unquoted command with `${VAR}:${VAR}` syntax when VAR is empty - -**Fix**: Quote the command: -```yaml -commands: - - "kubectl set image deployment/app app=viktorbarzin/app:${SHORT_SHA}" -``` - -### travel_blog Build Times Out on GHA - -**Cause**: 5.7GB content directory exceeds GHA disk/time limits - -**Fix**: Keep on Woodpecker (no migration). Build uses cluster storage and resources. - -### CI/CD Secrets Out of Sync - -**Cause**: CronJob failed to sync Vault → Woodpecker - -**Fix**: -```bash -# Check CronJob status -kubectl get cronjob -n woodpecker - -# Manually trigger sync -kubectl create job --from=cronjob/sync-secrets manual-sync -n woodpecker -``` +Unquoted command with `${VAR}:${VAR}` syntax when a VAR is empty. Quote the +command (see the deploy.yml example above). ## Related -- [Databases Architecture](./databases.md) — Database credentials via Vault -- [Multi-Tenancy](./multi-tenancy.md) — Per-user Woodpecker access -- Runbook: `../runbooks/deploy-new-app.md` — How to set up CI/CD for a new app -- Runbook: `../runbooks/troubleshoot-image-pull.md` — Debug image pull issues -- Vault documentation: K8s auth configuration -- Woodpecker documentation: API reference +- ADR: `../adr/0002-all-image-builds-off-infra-gha-ghcr.md` — the decision +- [Databases Architecture](./databases.md) — database credentials via Vault +- [Multi-Tenancy](./multi-tenancy.md) — per-user Woodpecker access +- Runbook: `../runbooks/forgejo-registry-breakglass.md` — using the frozen registry +- Runbook: `../runbooks/registry-vm.md` — pull-through cache VM + image-pull debugging +- Onboarding tool: `../../scripts/offinfra-onboard` + `../../scripts/offinfra-templates/` diff --git a/stacks/android-emulator/variables.tf b/stacks/android-emulator/variables.tf index bcc24a0d..822b7527 100644 --- a/stacks/android-emulator/variables.tf +++ b/stacks/android-emulator/variables.tf @@ -6,5 +6,5 @@ variable "tls_secret_name" { variable "image_tag" { type = string default = "latest" - description = "android-emulator image tag at forgejo.viktorbarzin.me/viktor/android-emulator. Built by GHA (.github/workflows/build-android-emulator.yml) -> ghcr.io/viktorbarzin/android-emulator on changes to stacks/android-emulator/docker/ (ADR-0002). :latest tracks the newest build." + description = "android-emulator image tag at ghcr.io/viktorbarzin/android-emulator. Built by GHA (.github/workflows/build-android-emulator.yml) on changes to stacks/android-emulator/docker/ (ADR-0002). :latest tracks the newest build." } diff --git a/stacks/terminal/main.tf b/stacks/terminal/main.tf index c2f3f50b..3737817d 100644 --- a/stacks/terminal/main.tf +++ b/stacks/terminal/main.tf @@ -225,8 +225,11 @@ module "ingress_ro" { # https://forgejo.viktorbarzin.me/viktor/terminal-lobby # # That repo's ./scripts/deploy.sh ships everything to wizard@10.0.10.10 -# and restarts ttyd / ttyd-ro / tmux-api / clipboard-upload. This stack -# only owns the Kubernetes side: Services, Endpoints pointing at +# and restarts ttyd / ttyd-ro / tmux-api / clipboard-upload. Deploy is +# MANUAL via that script — there is no CI pipeline (the lobby's +# .woodpecker.yml was removed under ADR-0002, issue #31; it builds no +# image, so it is not part of the GHA->ghcr fleet). This stack only owns +# the Kubernetes side: Services, Endpoints pointing at # 10.0.10.10:{7681,7682,7683,7684}, the IngressRoutes, and the Traefik # middlewares that gate everything behind Authentik forward-auth. # diff --git a/stacks/tuya-bridge/variables.tf b/stacks/tuya-bridge/variables.tf index 5c2be4d3..58e0a005 100644 --- a/stacks/tuya-bridge/variables.tf +++ b/stacks/tuya-bridge/variables.tf @@ -6,5 +6,5 @@ variable "tls_secret_name" { variable "image_tag" { type = string default = "latest" - description = "tuya_bridge image tag pushed to forgejo.viktorbarzin.me/viktor/tuya_bridge. Each Woodpecker run does `kubectl set image` to the 8-char git SHA; this variable is only used on initial create / TF recreate (image is in lifecycle.ignore_changes)." + description = "tuya_bridge image tag at ghcr.io/viktorbarzin/tuya_bridge (built by GHA, ADR-0002). The GHA deploy job drives a Woodpecker `kubectl set image` to the 8-char git SHA; this variable is only used on initial create / TF recreate (image is in lifecycle.ignore_changes)." }