From fe8db19aaf73f8c3af6975d9dfa2ffc35881d6c1 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Tue, 2 Jun 2026 20:24:50 +0000 Subject: [PATCH] job-hunter: build-triggers-deploy model; CronJob :latest + docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CI now drives the Deployment rollout (kubectl set image to the build SHA in .woodpecker.yml), so the stack moves to image_tag = "latest": the Deployment runs whatever CI last set (image ignore_changes keeps TF from fighting it), and the CronJob uses :latest + imagePullPolicy=Always (fresh pod each weekly run). Keel stays enrolled in parallel as a redundant net. Docs: rewrite the runbook "Deploying" section for build-triggers-deploy; record the reversal of decision #12 in the auto-upgrade design doc (owned apps drive their own rollout, Keel parallel — upstream stays Keel-only); add the owned-app deploy model to infra/.claude/CLAUDE.md CI/CD section. [ci skip] — applied locally (stack-scoped); avoids a broad CI auto-apply. Co-Authored-By: Claude Opus 4.8 --- .claude/CLAUDE.md | 21 +++++++++++++- .../2026-05-16-auto-upgrade-apps-design.md | 15 ++++++++++ docs/runbooks/job-hunter.md | 28 +++++++++++++++++++ stacks/job-hunter/cronjob.tf | 18 +++++++----- stacks/job-hunter/terragrunt.hcl | 13 +++++---- 5 files changed, 82 insertions(+), 13 deletions(-) diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index cdf6f09c..bdfb4470 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -89,7 +89,26 @@ Violations cause state drift, which causes future applies to break or silently r ## CI/CD Architecture — GHA Builds + Woodpecker Deploy -**Flow**: `git push → GHA build+push DockerHub (8-char SHA) → POST Woodpecker API → kubectl set image` +**Owned-app deploy model (build triggers the rollout — 2026-06-02):** For +self-hosted apps **we build** (Forgejo `viktor/` + Dockerfile + +`.woodpecker.yml`), the build pipeline ALSO drives the rollout — atomic + +deterministic, no wait for Keel's poll. Pattern (`build-and-push` tags `latest` ++ `${CI_COMMIT_SHA:0:8}`, then a `deploy` step): `kubectl set image +deployment/ =:${CI_COMMIT_SHA:0:8} -n ` + +`kubectl rollout status ... --timeout=300s`. The `woodpecker-agent` SA is +`cluster-admin`, so the `bitnami/kubectl` step needs no kubeconfig/RBAC (uses +its in-cluster SA). **Keel stays enrolled in parallel** as a redundant net +(finds the deployed SHA already running → no-op). Requires the Deployment to +have `ignore_changes` on `…container[0].image` (KEEL_IGNORE_IMAGE) so CI +`set image` doesn't fight `terragrunt apply`. CronJobs in owned apps use +`:latest` + `imagePullPolicy: Always` (fresh pod each run) instead of a deploy +step. **Never** `set image`/`rollout restart` operator-managed StatefulSets +(memory id=740). Reference impls: `tuya_bridge/.woodpecker.yml`, +`job-hunter`. This reverses decision #12 of +`docs/plans/2026-05-16-auto-upgrade-apps-design.md` for owned (not upstream) +images. + +**Flow (GHA-migrated apps)**: `git push → GHA build+push DockerHub (8-char SHA) → POST Woodpecker API → kubectl set image` **Migrated to GHA** (10): Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints **Woodpecker-only**: travel_blog (1.4GB content too large for GHA), infra pipelines (terragrunt apply, certbot, build-cli — need cluster access) diff --git a/docs/plans/2026-05-16-auto-upgrade-apps-design.md b/docs/plans/2026-05-16-auto-upgrade-apps-design.md index 4ff24d3f..da484dce 100644 --- a/docs/plans/2026-05-16-auto-upgrade-apps-design.md +++ b/docs/plans/2026-05-16-auto-upgrade-apps-design.md @@ -3,6 +3,21 @@ **Date**: 2026-05-16 **Status**: Approved (brainstorm + grill complete; implementation pending) +> **UPDATE 2026-06-02 — decision #12 / Q1 reversed for OWNED apps.** The +> original "uniform Keel-only, no per-repo `kubectl set image` step" call held +> only for **upstream** images (which we can't build, so Keel poll-and-bump is +> the only option). For **self-hosted apps we build**, CI now ALSO drives the +> rollout: `build-and-push` tags `latest` + `:`, then a `deploy` step runs +> `kubectl set image deployment/ ...:` + `rollout status`. Rationale +> (memory id=3183, proven on tuya-bridge 2026-05-29): the pipeline is atomic +> and deterministic — no wait for Keel's hourly poll, no risk of Keel resolving +> `:latest` to a stale concrete tag. **Keel stays enrolled in parallel** as a +> redundant net (it finds the just-deployed SHA already running → no-op), so +> upstream apps and owned apps share one mental model. Enabled cluster-wide by +> the `woodpecker-agent` SA being `cluster-admin` (no per-app RBAC). Owned apps +> being rolled out to this pattern 2026-06-02; CronJobs in owned apps use +> `:latest` + `imagePullPolicy: Always` instead of a deploy step. + ## Problem Three constraints in tension across the cluster's ~70 services: diff --git a/docs/runbooks/job-hunter.md b/docs/runbooks/job-hunter.md index 13ce6258..d499d1e2 100644 --- a/docs/runbooks/job-hunter.md +++ b/docs/runbooks/job-hunter.md @@ -108,6 +108,34 @@ kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter cdio-reconc Changes hit `/webhook/cdio`; comp/role extraction from the diff is manual or LLM-side (CDIO only captures the changed text). +### Deploying (build triggers the rollout) + +Deploys are **automatic on push to master** — we build the image, so CI also +drives the rollout (`.woodpecker.yml`: `build-and-push` tags `latest` + +`${CI_COMMIT_SHA:0:8}`, then a `deploy` step runs +`kubectl set image deployment/job-hunter ...:${SHA}` + `rollout status`). The +woodpecker-agent SA is cluster-admin, so no kubeconfig/RBAC is wired into the +step. Keel stays enrolled in parallel as a redundant net (finds the SHA already +running → no-op). So to ship code: + +```bash +# in the job-hunter source repo (forgejo viktor/job-hunter) +git push origin master # → lint+test → build (latest + :) → set image → rollout +``` + +The **Deployment** rolls to the just-built `:`. The **CronJob** runs +`:latest` with `imagePullPolicy: Always`, so its next scheduled pod pulls the +newest image (no rollout needed for a CronJob). `image_tag = "latest"` in +`terragrunt.hcl` is just the TF baseline; the running Deployment digest is +whatever CI last set (`kubectl -n job-hunter get deploy job-hunter -o jsonpath='{..image}'`). + +**Versioning** is still semver — bump `pyproject.toml` and cut a `git tag +vX.Y.Z` to mark a release; that's the human version record, independent of the +`:` deploy tag (map a running SHA back to a version with `git describe`). + +**Rollback**: `kubectl -n job-hunter rollout undo deployment/job-hunter` (last +ReplicaSet), or push a revert commit (CI redeploys the reverted SHA). + ### Applying the Terraform stack ```bash diff --git a/stacks/job-hunter/cronjob.tf b/stacks/job-hunter/cronjob.tf index 83674e2f..34e80ce8 100644 --- a/stacks/job-hunter/cronjob.tf +++ b/stacks/job-hunter/cronjob.tf @@ -5,8 +5,10 @@ # # The alembic-migrate init container mirrors the Deployment so the CronJob can # never run a refresh against an un-migrated DB (snapshot inserts would fail). -# Image is :latest (Keel-managed for the Deployment); the CronJob pulls the -# current latest at each run, so it always executes the newest code. +# Image is local.image (:latest via image_tag) with imagePullPolicy=Always: a +# CronJob spawns a fresh pod each run, so Always pull = it always executes the +# newest built code. The Deployment is rolled by CI (kubectl set image to the +# build SHA); the CronJob needs no rollout — Always pull covers it. resource "kubernetes_cron_job_v1" "job_hunter_refresh" { metadata { name = "job-hunter-refresh" @@ -40,9 +42,10 @@ resource "kubernetes_cron_job_v1" "job_hunter_refresh" { } init_container { - name = "alembic-migrate" - image = local.image - command = ["python", "-m", "job_hunter", "migrate"] + name = "alembic-migrate" + image = local.image + image_pull_policy = "Always" + command = ["python", "-m", "job_hunter", "migrate"] env_from { secret_ref { name = "job-hunter-secrets" @@ -65,8 +68,9 @@ resource "kubernetes_cron_job_v1" "job_hunter_refresh" { } container { - name = "refresh" - image = local.image + name = "refresh" + image = local.image + image_pull_policy = "Always" command = ["python", "-m", "job_hunter", "refresh", "--source", "ats", "--source", "hn", "--source", "levels_fyi"] diff --git a/stacks/job-hunter/terragrunt.hcl b/stacks/job-hunter/terragrunt.hcl index 8f4a32fb..27b08b9d 100644 --- a/stacks/job-hunter/terragrunt.hcl +++ b/stacks/job-hunter/terragrunt.hcl @@ -18,9 +18,12 @@ dependency "external-secrets" { } inputs = { - # 92afc38d = master HEAD with levels.fyi scraper + comp_table COALESCE - # fix + Frankfurter FX backend (exchangerate.host free tier deprecated - # in 2026). Built + pushed locally 2026-04-19 while the Woodpecker - # Forgejo webhook remains broken. - image_tag = "92afc38d" + # :latest — CI drives the rollout. On every master push the pipeline builds + # latest + : and runs `kubectl set image deployment/job-hunter ...:` + # so the Deployment rolls to the just-built code immediately (no wait for + # Keel's poll). Keel stays enrolled in parallel as a redundant net. The + # CronJob uses :latest + Always pull (fresh pod each run). Project version + # lives in pyproject.toml + git tag vX.Y.Z (semver), independent of the + # deploy tag. CI OOM that had blocked all builds since 2026-04 is fixed. + image_tag = "latest" }