infra/stacks/beads-server/main.tf

963 lines
27 KiB
Terraform
Raw Normal View History

variable "tls_secret_name" {
type = string
sensitive = true
}
[infra/beads-server] Wire BeadBoard to claude-agent-service ## Context BeadBoard is the Next.js task visualization dashboard shipped in this stack. We want users to trigger headless Claude agent runs directly from a beads task row — "one-click dispatch" — instead of copy-pasting `bd` IDs into a terminal. The agent runs in-cluster as claude-agent-service (see stacks/claude-agent-service/), protected by a bearer token in Vault at secret/claude-agent-service/api_bearer_token. For BeadBoard to POST to /execute we need the service URL and the bearer token available inside the pod as env vars. The URL is static (cluster DNS); the token must come through External Secrets Operator so rotation in Vault propagates without re-applying Terraform. Secondary cleanup: the container was still pinned to :latest which violates the 8-char-SHA convention and causes stale pulls through the registry cache (see .claude/CLAUDE.md, Docker images). The image tag is now variable-driven; the GHA pipeline will override the default once it publishes the first SHA. ## This change - Adds an ExternalSecret `beadboard-agent-service` in the `beads-server` namespace, mirroring the pattern in stacks/claude-agent-service/main.tf (same Vault path `secret/claude-agent-service`, same `vault-kv` ClusterSecretStore, same 15m refresh). Exposes exactly one key: `api_bearer_token`. - Adds two env vars to the `beadboard` container: - `CLAUDE_AGENT_SERVICE_URL` — static cluster URL (`http://claude-agent-service.claude-agent.svc.cluster.local:8080`) - `CLAUDE_AGENT_BEARER_TOKEN` — `secret_key_ref` pointing at the ESO-managed Secret, key `api_bearer_token` - Adds `reloader.stakater.com/auto = "true"` on the Deployment's top-level metadata — matches the convention used by rybbit, claude-memory, onlyoffice. When ESO refreshes the K8s Secret because Vault rotated the token, Reloader restarts the pod so the new token is picked up (env vars are read once at boot). - Adds `variable "beadboard_image_tag"` (default `"latest"`, with a one-line comment flagging the temporary default). The image reference now interpolates `${var.beadboard_image_tag}`. No tfvars file is touched — orchestrator will flip the default to the first real 8-char SHA once GHA publishes it. ## What is NOT in this change - No GHA workflow additions. The pipeline that builds `registry.viktorbarzin.me:5050/beadboard` lives in the BeadBoard repo and is out of scope here. - No Vault-side changes. `secret/claude-agent-service/api_bearer_token` already exists (it powers the claude-agent-service deployment itself). - No Terraform `apply`. Orchestrator applies. ## Data flow Vault (secret/claude-agent-service) │ refresh every 15m ▼ ESO → K8s Secret `beadboard-agent-service` (beads-server ns) │ envFrom.secretKeyRef ▼ BeadBoard pod (CLAUDE_AGENT_BEARER_TOKEN env) │ Authorization: Bearer <token> ▼ claude-agent-service.claude-agent.svc:8080 /execute On Vault rotation: ESO picks up new value at next refresh → K8s Secret data changes → Reloader sees annotation + referenced Secret changed → rolling-recreates the beadboard pod with the new token. ## Test Plan ### Automated - `terraform fmt -recursive stacks/beads-server/` — clean (formatted the file once; subsequent run is a no-op). - `terraform -chdir=stacks/beads-server validate` (after `terraform init -backend=false`) — `Success! The configuration is valid`. The 14 "Deprecated Resource" warnings are pre-existing (`kubernetes_namespace` vs `_v1` etc.) and unrelated to this change. ### Manual Verification 1. Orchestrator applies: `scripts/tg -chdir=stacks/beads-server apply` 2. Verify the ExternalSecret synced: `kubectl -n beads-server get externalsecret beadboard-agent-service` Expected: `Ready=True`, `SyncedAt` recent. 3. Verify the K8s Secret exists with one key: `kubectl -n beads-server get secret beadboard-agent-service -o jsonpath='{.data.api_bearer_token}' | base64 -d | head -c 8` Expected: first 8 chars of the bearer token. 4. Verify the deployment picked up the env vars: `kubectl -n beads-server get deploy beadboard -o yaml | grep -A2 CLAUDE_AGENT` Expected: both env entries present, bearer via `secretKeyRef`. 5. Verify the reloader annotation is on the Deployment metadata: `kubectl -n beads-server get deploy beadboard -o jsonpath='{.metadata.annotations.reloader\.stakater\.com/auto}'` Expected: `true`. 6. Verify the image tag resolved to the variable default (for now): `kubectl -n beads-server get deploy beadboard -o jsonpath='{.spec.template.spec.containers[0].image}'` Expected: `registry.viktorbarzin.me:5050/beadboard:latest` (will become `...:<sha>` once `beadboard_image_tag` default is updated). 7. Smoke-test the env var inside the pod: `kubectl -n beads-server exec deploy/beadboard -- sh -c 'printenv CLAUDE_AGENT_SERVICE_URL; printenv CLAUDE_AGENT_BEARER_TOKEN | head -c 8'` Expected: URL printed, first 8 chars of token printed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:53:51 +00:00
variable "beadboard_image_tag" {
type = string
[infra] Bump claude-agent-service + beadboard image tags ## Context Two rolling updates tied to the BeadBoard dispatch-button work (code-kel): 1. claude-agent-service now ships bd 1.0.2, a seed ConfigMap-equivalent (files in /usr/share/agent-seed/), the beads-task-runner agent, and hmac.compare_digest bearer verification. The tag moves from 382d6b14 to 0c24c9b6 (monorepo HEAD). 2. The beadboard Deployment in beads-server now consumes CLAUDE_AGENT_SERVICE_URL + CLAUDE_AGENT_BEARER_TOKEN, so the image needs the Dispatch button + /api/agent-dispatch + /api/agent-status routes. Tag moves from :latest to :17a38e43 (fork HEAD on github.com/ViktorBarzin/beadboard). ## What this change does - Flips `local.image_tag` in claude-agent-service main.tf. - Drops the "temporary" comment on `beadboard_image_tag` and sets the default to 17a38e43 (honours the infra 8-char-SHA rule — CLAUDE.md "Use 8-char git SHA tags — `:latest` causes stale pull-through cache"). ## Test Plan ## Automated - Both images already pushed to registry.viktorbarzin.me{:5050}/ : - claude-agent-service:0c24c9b6 verified via `docker run --rm … bd version` → 1.0.2 OK, /usr/share/agent-seed/ contains both seed files. - beadboard:17a38e43 pushed, digest cd0d3c47. - terraform fmt/validate clean on both stacks from the earlier commits. ## Manual Verification 1. Push triggers Woodpecker default.yml. 2. Expected: both stacks apply; claude-agent-service pod rolls (new seed-beads-agent init creates /workspace/.beads/ + /workspace/scratch + copies beads-task-runner.md), beadboard pod rolls with new env vars sourced from beadboard-agent-service ExternalSecret. 3. Cross-check: `kubectl -n claude-agent get pod -o yaml | grep image:` should show :0c24c9b6; `kubectl -n beads-server get deploy beadboard -o yaml | grep image:` should show :17a38e43. Closes: code-kel Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:24:37 +00:00
default = "17a38e43"
[infra/beads-server] Wire BeadBoard to claude-agent-service ## Context BeadBoard is the Next.js task visualization dashboard shipped in this stack. We want users to trigger headless Claude agent runs directly from a beads task row — "one-click dispatch" — instead of copy-pasting `bd` IDs into a terminal. The agent runs in-cluster as claude-agent-service (see stacks/claude-agent-service/), protected by a bearer token in Vault at secret/claude-agent-service/api_bearer_token. For BeadBoard to POST to /execute we need the service URL and the bearer token available inside the pod as env vars. The URL is static (cluster DNS); the token must come through External Secrets Operator so rotation in Vault propagates without re-applying Terraform. Secondary cleanup: the container was still pinned to :latest which violates the 8-char-SHA convention and causes stale pulls through the registry cache (see .claude/CLAUDE.md, Docker images). The image tag is now variable-driven; the GHA pipeline will override the default once it publishes the first SHA. ## This change - Adds an ExternalSecret `beadboard-agent-service` in the `beads-server` namespace, mirroring the pattern in stacks/claude-agent-service/main.tf (same Vault path `secret/claude-agent-service`, same `vault-kv` ClusterSecretStore, same 15m refresh). Exposes exactly one key: `api_bearer_token`. - Adds two env vars to the `beadboard` container: - `CLAUDE_AGENT_SERVICE_URL` — static cluster URL (`http://claude-agent-service.claude-agent.svc.cluster.local:8080`) - `CLAUDE_AGENT_BEARER_TOKEN` — `secret_key_ref` pointing at the ESO-managed Secret, key `api_bearer_token` - Adds `reloader.stakater.com/auto = "true"` on the Deployment's top-level metadata — matches the convention used by rybbit, claude-memory, onlyoffice. When ESO refreshes the K8s Secret because Vault rotated the token, Reloader restarts the pod so the new token is picked up (env vars are read once at boot). - Adds `variable "beadboard_image_tag"` (default `"latest"`, with a one-line comment flagging the temporary default). The image reference now interpolates `${var.beadboard_image_tag}`. No tfvars file is touched — orchestrator will flip the default to the first real 8-char SHA once GHA publishes it. ## What is NOT in this change - No GHA workflow additions. The pipeline that builds `registry.viktorbarzin.me:5050/beadboard` lives in the BeadBoard repo and is out of scope here. - No Vault-side changes. `secret/claude-agent-service/api_bearer_token` already exists (it powers the claude-agent-service deployment itself). - No Terraform `apply`. Orchestrator applies. ## Data flow Vault (secret/claude-agent-service) │ refresh every 15m ▼ ESO → K8s Secret `beadboard-agent-service` (beads-server ns) │ envFrom.secretKeyRef ▼ BeadBoard pod (CLAUDE_AGENT_BEARER_TOKEN env) │ Authorization: Bearer <token> ▼ claude-agent-service.claude-agent.svc:8080 /execute On Vault rotation: ESO picks up new value at next refresh → K8s Secret data changes → Reloader sees annotation + referenced Secret changed → rolling-recreates the beadboard pod with the new token. ## Test Plan ### Automated - `terraform fmt -recursive stacks/beads-server/` — clean (formatted the file once; subsequent run is a no-op). - `terraform -chdir=stacks/beads-server validate` (after `terraform init -backend=false`) — `Success! The configuration is valid`. The 14 "Deprecated Resource" warnings are pre-existing (`kubernetes_namespace` vs `_v1` etc.) and unrelated to this change. ### Manual Verification 1. Orchestrator applies: `scripts/tg -chdir=stacks/beads-server apply` 2. Verify the ExternalSecret synced: `kubectl -n beads-server get externalsecret beadboard-agent-service` Expected: `Ready=True`, `SyncedAt` recent. 3. Verify the K8s Secret exists with one key: `kubectl -n beads-server get secret beadboard-agent-service -o jsonpath='{.data.api_bearer_token}' | base64 -d | head -c 8` Expected: first 8 chars of the bearer token. 4. Verify the deployment picked up the env vars: `kubectl -n beads-server get deploy beadboard -o yaml | grep -A2 CLAUDE_AGENT` Expected: both env entries present, bearer via `secretKeyRef`. 5. Verify the reloader annotation is on the Deployment metadata: `kubectl -n beads-server get deploy beadboard -o jsonpath='{.metadata.annotations.reloader\.stakater\.com/auto}'` Expected: `true`. 6. Verify the image tag resolved to the variable default (for now): `kubectl -n beads-server get deploy beadboard -o jsonpath='{.spec.template.spec.containers[0].image}'` Expected: `registry.viktorbarzin.me:5050/beadboard:latest` (will become `...:<sha>` once `beadboard_image_tag` default is updated). 7. Smoke-test the env var inside the pod: `kubectl -n beads-server exec deploy/beadboard -- sh -c 'printenv CLAUDE_AGENT_SERVICE_URL; printenv CLAUDE_AGENT_BEARER_TOKEN | head -c 8'` Expected: URL printed, first 8 chars of token printed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:53:51 +00:00
}
[beads-server] Auto-dispatch agent beads via CronJobs ## Context Until now, handing work to the in-cluster `beads-task-runner` agent required opening BeadBoard and clicking the manual Dispatch button on each bead. We want users to be able to describe work as a bead, set `assignee=agent`, and have the agent pick it up within a couple of minutes — no clicks. The existing pieces already provide everything we need: - `claude-agent-service` exposes `/execute` with a single-slot `asyncio.Lock` - BeadBoard's `/api/agent-dispatch` builds the prompt and forwards the bearer - BeadBoard's `/api/agent-status` reports `busy` via a cached `/health` poll - Dolt stores beads and is already in-cluster at `dolt.beads-server:3306` So the only missing component is a poller that ties them together. This commit adds that poller as two Kubernetes CronJobs — matching the existing infra pattern (OpenClaw task-processor, certbot-renewal, backups) rather than introducing n8n or in-service polling. ## Flow ``` user: bd assign <id> agent │ ▼ Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐ │ │ ▼ │ CronJob: beads-dispatcher │ 1. GET beadboard/api/agent-status (busy? skip) │ 2. bd query 'assignee=agent AND status=open' │ 3. bd update -s in_progress (claim) │ 4. POST beadboard/api/agent-dispatch │ 5. bd note "dispatched: job=…" │ │ │ ▼ │ claude-agent-service /execute │ beads-task-runner agent runs; notes/closes bead │ │ │ ▼ │ done ──► next tick picks up the next bead ───────────────┘ CronJob: beads-reaper (every 10 min) for bead (assignee=agent, status=in_progress, updated_at > 30 min): bd note "reaper: no progress for Nm — blocking" bd update -s blocked ``` ## Decisions - **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd client can set it (`bd assign <id> agent`). - **Sequential dispatch** — matches the service's `asyncio.Lock`. With a 2-min poll cadence and ~5-min average run, throughput is ~12 beads/hour. Parallelism is a separate plan. - **Fixed agent `beads-task-runner`** — read-only rails, matches the manual Dispatch button. Broader-privilege agents stay manual via BeadBoard UI. - **Image reuse** — the claude-agent-service image already ships `bd`, `jq`, `curl`; a new CronJob-specific image would duplicate 400MB of infra tooling. Mirror `claude_agent_service_image_tag` locally; bump on rebuild. - **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing the image-seeded file. The script copies it into `/tmp/.beads/` because bd may touch the parent dir and ConfigMap mounts are read-only. - **Kill switch (`beads_dispatcher_enabled`)** — single bool, default true. When false, `suspend: true` on both CronJobs; manual Dispatch keeps working. - **Reaper threshold 30 min** — `bd note` bumps `updated_at`, so a well-behaved `beads-task-runner` never trips the reaper. Failures trip it; pod crashes (in-memory job state lost) also trip it. ## What is NOT in this change - No Terraform apply — requires Vault OIDC + cluster access. Apply manually: `cd infra/stacks/beads-server && scripts/tg apply` - No change to `claude-agent-service/` (already ships bd/jq/curl) - No change to `beadboard/` (`/api/agent-dispatch` + `/api/agent-status` reused) - No change to the `beads-task-runner` agent definition (rails unchanged) - Parallelism: single-slot is MVP; multi-slot dispatch is a separate plan. ## Deviations from plan Minor, documented in code comments: - Reaper uses `.updated_at` instead of the plan's `.notes[].created_at`. bd serializes `notes` as a string (not an array), and every `bd note` bumps `updated_at` — equivalent for the reaper's purpose. - ISO-8601 parsed via `python3`, not `date -d` — Alpine's busybox lacks GNU `-d` and the image has python3. - `HOME=/tmp` set as a safety net — bd may try to write state/lock files. ## Test plan ### Automated ``` $ cd infra/stacks/beads-server && terraform init -backend=false Terraform has been successfully initialized! $ terraform validate Warning: Deprecated Resource (kubernetes_namespace → v1) # pre-existing, unrelated Success! The configuration is valid, but there were some validation warnings as shown above. $ terraform fmt stacks/beads-server/main.tf # (no output — already formatted) ``` ### Manual verification 1. **Apply** ``` vault login -method=oidc cd infra/stacks/beads-server scripts/tg apply ``` Expect: `kubernetes_config_map.beads_metadata`, `kubernetes_cron_job_v1.beads_dispatcher`, `kubernetes_cron_job_v1.beads_reaper` created. No changes to existing resources. 2. **CronJobs exist with right schedule** ``` kubectl -n beads-server get cronjob ``` Expect `beads-dispatcher */2 * * * *` and `beads-reaper */10 * * * *`, both with `SUSPEND=False`. 3. **End-to-end smoke** ``` bd create "auto-dispatch smoke test" \ -d "Read /etc/hostname inside the agent sandbox and close." \ --acceptance "bd note includes 'hostname=' line and bead is closed." bd assign <new-id> agent # within 2 min: bd show <new-id> --json | jq '{status, notes}' ``` Expect notes to contain `auto-dispatcher claimed at …` and `dispatched: job=<uuid>`, status `in_progress`. 4. **Reaper smoke** Assign + dispatch a long bead, then `kubectl -n claude-agent delete pod -l app=claude-agent-service`. Within 30 min + one reaper tick, `bd show <id>` shows `blocked` with a `reaper: no progress for Nm — blocking` note. 5. **Kill switch** ``` cd infra/stacks/beads-server scripts/tg apply -var=beads_dispatcher_enabled=false kubectl -n beads-server get cronjob ``` Expect `SUSPEND=True` on both CronJobs. Assign a bead to `agent`; verify nothing happens within 5 min. Re-apply with `=true` to re-enable. Runbook with all above plus reaper semantics + design choices at `infra/docs/runbooks/beads-auto-dispatch.md`. Closes: code-8sm Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:35:46 +00:00
# Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf — keep in
# sync when the claude-agent-service image is rebuilt. Reused here because the
# dispatcher + reaper CronJobs only need bd, curl, and jq, which that image
# already ships.
variable "claude_agent_service_image_tag" {
type = string
default = "2fd7670d"
[beads-server] Auto-dispatch agent beads via CronJobs ## Context Until now, handing work to the in-cluster `beads-task-runner` agent required opening BeadBoard and clicking the manual Dispatch button on each bead. We want users to be able to describe work as a bead, set `assignee=agent`, and have the agent pick it up within a couple of minutes — no clicks. The existing pieces already provide everything we need: - `claude-agent-service` exposes `/execute` with a single-slot `asyncio.Lock` - BeadBoard's `/api/agent-dispatch` builds the prompt and forwards the bearer - BeadBoard's `/api/agent-status` reports `busy` via a cached `/health` poll - Dolt stores beads and is already in-cluster at `dolt.beads-server:3306` So the only missing component is a poller that ties them together. This commit adds that poller as two Kubernetes CronJobs — matching the existing infra pattern (OpenClaw task-processor, certbot-renewal, backups) rather than introducing n8n or in-service polling. ## Flow ``` user: bd assign <id> agent │ ▼ Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐ │ │ ▼ │ CronJob: beads-dispatcher │ 1. GET beadboard/api/agent-status (busy? skip) │ 2. bd query 'assignee=agent AND status=open' │ 3. bd update -s in_progress (claim) │ 4. POST beadboard/api/agent-dispatch │ 5. bd note "dispatched: job=…" │ │ │ ▼ │ claude-agent-service /execute │ beads-task-runner agent runs; notes/closes bead │ │ │ ▼ │ done ──► next tick picks up the next bead ───────────────┘ CronJob: beads-reaper (every 10 min) for bead (assignee=agent, status=in_progress, updated_at > 30 min): bd note "reaper: no progress for Nm — blocking" bd update -s blocked ``` ## Decisions - **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd client can set it (`bd assign <id> agent`). - **Sequential dispatch** — matches the service's `asyncio.Lock`. With a 2-min poll cadence and ~5-min average run, throughput is ~12 beads/hour. Parallelism is a separate plan. - **Fixed agent `beads-task-runner`** — read-only rails, matches the manual Dispatch button. Broader-privilege agents stay manual via BeadBoard UI. - **Image reuse** — the claude-agent-service image already ships `bd`, `jq`, `curl`; a new CronJob-specific image would duplicate 400MB of infra tooling. Mirror `claude_agent_service_image_tag` locally; bump on rebuild. - **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing the image-seeded file. The script copies it into `/tmp/.beads/` because bd may touch the parent dir and ConfigMap mounts are read-only. - **Kill switch (`beads_dispatcher_enabled`)** — single bool, default true. When false, `suspend: true` on both CronJobs; manual Dispatch keeps working. - **Reaper threshold 30 min** — `bd note` bumps `updated_at`, so a well-behaved `beads-task-runner` never trips the reaper. Failures trip it; pod crashes (in-memory job state lost) also trip it. ## What is NOT in this change - No Terraform apply — requires Vault OIDC + cluster access. Apply manually: `cd infra/stacks/beads-server && scripts/tg apply` - No change to `claude-agent-service/` (already ships bd/jq/curl) - No change to `beadboard/` (`/api/agent-dispatch` + `/api/agent-status` reused) - No change to the `beads-task-runner` agent definition (rails unchanged) - Parallelism: single-slot is MVP; multi-slot dispatch is a separate plan. ## Deviations from plan Minor, documented in code comments: - Reaper uses `.updated_at` instead of the plan's `.notes[].created_at`. bd serializes `notes` as a string (not an array), and every `bd note` bumps `updated_at` — equivalent for the reaper's purpose. - ISO-8601 parsed via `python3`, not `date -d` — Alpine's busybox lacks GNU `-d` and the image has python3. - `HOME=/tmp` set as a safety net — bd may try to write state/lock files. ## Test plan ### Automated ``` $ cd infra/stacks/beads-server && terraform init -backend=false Terraform has been successfully initialized! $ terraform validate Warning: Deprecated Resource (kubernetes_namespace → v1) # pre-existing, unrelated Success! The configuration is valid, but there were some validation warnings as shown above. $ terraform fmt stacks/beads-server/main.tf # (no output — already formatted) ``` ### Manual verification 1. **Apply** ``` vault login -method=oidc cd infra/stacks/beads-server scripts/tg apply ``` Expect: `kubernetes_config_map.beads_metadata`, `kubernetes_cron_job_v1.beads_dispatcher`, `kubernetes_cron_job_v1.beads_reaper` created. No changes to existing resources. 2. **CronJobs exist with right schedule** ``` kubectl -n beads-server get cronjob ``` Expect `beads-dispatcher */2 * * * *` and `beads-reaper */10 * * * *`, both with `SUSPEND=False`. 3. **End-to-end smoke** ``` bd create "auto-dispatch smoke test" \ -d "Read /etc/hostname inside the agent sandbox and close." \ --acceptance "bd note includes 'hostname=' line and bead is closed." bd assign <new-id> agent # within 2 min: bd show <new-id> --json | jq '{status, notes}' ``` Expect notes to contain `auto-dispatcher claimed at …` and `dispatched: job=<uuid>`, status `in_progress`. 4. **Reaper smoke** Assign + dispatch a long bead, then `kubectl -n claude-agent delete pod -l app=claude-agent-service`. Within 30 min + one reaper tick, `bd show <id>` shows `blocked` with a `reaper: no progress for Nm — blocking` note. 5. **Kill switch** ``` cd infra/stacks/beads-server scripts/tg apply -var=beads_dispatcher_enabled=false kubectl -n beads-server get cronjob ``` Expect `SUSPEND=True` on both CronJobs. Assign a bead to `agent`; verify nothing happens within 5 min. Re-apply with `=true` to re-enable. Runbook with all above plus reaper semantics + design choices at `infra/docs/runbooks/beads-auto-dispatch.md`. Closes: code-8sm Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:35:46 +00:00
}
# Kill switch for auto-dispatch. When false, both CronJobs are suspended. The
# manual BeadBoard Dispatch button keeps working either way.
variable "beads_dispatcher_enabled" {
type = bool
default = true
}
resource "kubernetes_namespace" "beads" {
metadata {
name = "beads-server"
labels = {
tier = local.tiers.aux
}
}
[infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] ## Context Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with `metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This is intentional — Terraform owns container resource limits, and Goldilocks should only provide recommendations, never auto-update. The label is how Goldilocks decides per-namespace whether to run its VPA in `off` mode. Effect on Terraform: every `kubernetes_namespace` resource shows the label as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey 2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace is affected. This commit brings the intentional admission drift under the same `# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for the ndots dns_config pattern. The marker now stands generically for any Kyverno admission-webhook drift suppression; the inline comment records which specific policy stamps which specific field so future grep audits show why each suppression exists. ## This change 107 `.tf` files touched — every stack's `resource "kubernetes_namespace"` resource gets: ```hcl lifecycle { # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] } ``` Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`): match `^resource "kubernetes_namespace" ` → track `{` / `}` until the outermost closing brace → insert the lifecycle block before the closing brace. The script is idempotent (skips any file that already mentions `goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe. Vault stack picked up 2 namespaces in the same file (k8s-users produces one, plus a second explicit ns) — confirmed via file diff (+8 lines). ## What is NOT in this change - `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out (paused 2026-04-06 per user decision). Reverted after the script ran. - `stacks/_template/main.tf.example` — per-stack skeleton, intentionally minimal. User keeps it that way. Not touched by the script (file has no real `resource "kubernetes_namespace"` — only a placeholder comment). - `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) — gitignored, won't commit; the live path was edited. - `terraform fmt` cleanup of adjacent pre-existing alignment issues in authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted to keep the commit scoped to the Goldilocks sweep. Those files will need a separate fmt-only commit or will be cleaned up on next real apply to that stack. ## Verification Dawarich (one of the hundred-plus touched stacks) showed the pattern before and after: ``` $ cd stacks/dawarich && ../../scripts/tg plan Before: Plan: 0 to add, 2 to change, 0 to destroy. # kubernetes_namespace.dawarich will be updated in-place (goldilocks.fairwinds.com/vpa-update-mode -> null) # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place (Kyverno generate.* labels — fixed in 8d94688d) After: No changes. Your infrastructure matches the configuration. ``` Injection count check: ``` $ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}' 108 ``` ## Reproduce locally 1. `git pull` 2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan` 3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label. Closes: code-dwx Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:15:27 +00:00
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
resource "kubernetes_persistent_volume_claim" "dolt_data" {
wait_until_bound = false
metadata {
name = "dolt-data"
namespace = kubernetes_namespace.beads.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "10Gi"
}
}
spec {
access_modes = ["ReadWriteOnce"]
storage_class_name = "proxmox-lvm"
resources {
requests = { storage = "2Gi" }
}
}
}
resource "kubernetes_config_map" "dolt_init" {
metadata {
name = "dolt-init"
namespace = kubernetes_namespace.beads.metadata[0].name
}
data = {
"01-create-beads-user.sql" = <<-EOT
CREATE USER IF NOT EXISTS 'beads'@'%' IDENTIFIED BY '';
GRANT ALL PRIVILEGES ON *.* TO 'beads'@'%' WITH GRANT OPTION;
EOT
}
}
resource "kubernetes_deployment" "dolt" {
metadata {
name = "dolt"
namespace = kubernetes_namespace.beads.metadata[0].name
labels = {
app = "dolt"
tier = local.tiers.aux
}
}
spec {
replicas = 1
strategy {
type = "Recreate"
}
selector {
match_labels = {
app = "dolt"
}
}
template {
metadata {
labels = {
app = "dolt"
}
}
spec {
container {
name = "dolt"
image = "dolthub/dolt-sql-server:latest"
port {
name = "mysql"
container_port = 3306
}
env {
name = "DOLT_ROOT_HOST"
value = "%"
}
volume_mount {
name = "dolt-data"
mount_path = "/var/lib/dolt"
}
volume_mount {
name = "init-scripts"
mount_path = "/docker-entrypoint-initdb.d"
read_only = true
}
startup_probe {
tcp_socket {
port = 3306
}
failure_threshold = 30
period_seconds = 2
}
liveness_probe {
tcp_socket {
port = 3306
}
initial_delay_seconds = 10
period_seconds = 30
}
readiness_probe {
tcp_socket {
port = 3306
}
initial_delay_seconds = 5
period_seconds = 10
}
resources {
requests = {
memory = "256Mi"
cpu = "50m"
}
limits = {
memory = "512Mi"
}
}
}
volume {
name = "dolt-data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.dolt_data.metadata[0].name
}
}
volume {
name = "init-scripts"
config_map {
name = kubernetes_config_map.dolt_init.metadata[0].name
}
}
}
}
}
lifecycle {
ignore_changes = [
[infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] ## Context Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }` snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2 override that prevents NxDomain search-domain flooding). 27 occurrences across 19 stacks. Without this suppression, every pod-owning resource shows perpetual TF plan drift. The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/` module emitting the ignore-paths list as an output that stacks would consume in their `ignore_changes` blocks. That approach is architecturally impossible: Terraform's `ignore_changes` meta-argument accepts only static attribute paths — it rejects module outputs, locals, variables, and any expression (the HCL spec evaluates `lifecycle` before the regular expression graph). So a DRY module cannot exist. The canonical pattern IS the repeated snippet. What the snippet was missing was a *discoverability tag* so that (a) new resources can be validated for compliance, (b) the existing 27 sites can be grep'd in a single command, and (c) future maintainers understand the convention rather than each reinventing it. ## This change - Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment. Attached inline on every `spec[0].template[0].spec[0].dns_config` line (or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27 existing suppression sites. - Documents the convention with rationale and copy-paste snippets in `AGENTS.md` → new "Kyverno Drift Suppression" section. - Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference the marker and explain why the module approach is blocked. - Updates `_template/main.tf.example` so every new stack starts compliant. ## What is NOT in this change - The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`) — that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker. - Behavioral changes — every `ignore_changes` list is byte-identical save for the inline comment. - The fallback module the original plan anticipated — skipped because Terraform rejects expressions in `ignore_changes`. - `terraform fmt` cleanup on adjacent unrelated blocks in three files (claude-agent-service, freedify/factory, hermes-agent). Reverted to keep this commit scoped to the convention rollout. ## Before / after Before (cannot distinguish accidental-forgotten from intentional-convention): ```hcl lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] } ``` After (greppable, self-documenting, discoverable by tooling): ```hcl lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 } ``` ## Test Plan ### Automated ``` $ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \ | awk -F: '{s+=$2} END {print s}' 27 $ git diff --stat | grep -E '\.(tf|tf\.example|md)$' | wc -l 21 # All code-file diffs are 1 insertion + 1 deletion per marker site, # except beads-server (3), ebooks (4), immich (3), uptime-kuma (2). $ git diff --stat stacks/ | tail -1 20 files changed, 45 insertions(+), 28 deletions(-) ``` ### Manual Verification No apply required — HCL comments only. Zero effect on any stack's plan output. Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` must grow as new pod-owning resources are added. ## Reproduce locally 1. `cd infra && git pull` 2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files 3. Grep any new `kubernetes_deployment` for the marker; absence = missing suppression. Closes: code-28m Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:15:51 +00:00
spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
]
}
}
resource "kubernetes_service" "dolt" {
metadata {
name = "dolt"
namespace = kubernetes_namespace.beads.metadata[0].name
labels = {
app = "dolt"
}
annotations = {
"metallb.universe.tf/loadBalancerIPs" = "10.0.20.200"
"metallb.io/allow-shared-ip" = "shared"
}
}
spec {
type = "LoadBalancer"
external_traffic_policy = "Cluster"
selector = {
app = "dolt"
}
port {
name = "mysql"
port = 3306
target_port = 3306
}
}
}
# ── Dolt Workbench (web UI) ──
resource "kubernetes_config_map" "workbench_store" {
metadata {
name = "workbench-store"
namespace = kubernetes_namespace.beads.metadata[0].name
}
data = {
"store.json" = jsonencode([{
name = "beads"
connectionUrl = "mysql://beads@dolt.beads-server.svc.cluster.local:3306/code"
hideDoltFeatures = false
useSSL = false
type = "Mysql"
}])
}
}
resource "kubernetes_deployment" "workbench" {
metadata {
name = "dolt-workbench"
namespace = kubernetes_namespace.beads.metadata[0].name
labels = {
app = "dolt-workbench"
tier = local.tiers.aux
}
}
spec {
replicas = 1
selector {
match_labels = {
app = "dolt-workbench"
}
}
template {
metadata {
labels = {
app = "dolt-workbench"
}
}
spec {
init_container {
name = "seed-config"
image = "dolthub/dolt-workbench:latest"
command = ["sh", "-c", <<-EOT
# Seed connection store
cp /config/store.json /store/store.json
# Copy static JS to writable volume and patch GraphQL URL
cp -r /app/web/.next/static/* /static/
for f in /static/chunks/pages/_app-*.js; do
sed -i 's|http://localhost:9002/graphql|/graphql|g' "$f"
done
[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf | module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list | cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:45:04 +00:00
echo "Patched GraphQL URL and store path"
EOT
]
volume_mount {
name = "store-config"
mount_path = "/config"
read_only = true
}
volume_mount {
name = "store"
mount_path = "/store"
}
volume_mount {
name = "static-patched"
mount_path = "/static"
}
}
container {
name = "workbench"
image = "dolthub/dolt-workbench:latest"
[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf | module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list | cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:45:04 +00:00
command = ["sh", "-c", <<-EOT
# Patch GraphQL server to listen on 0.0.0.0 (IPv4) — Node 18+ defaults to IPv6
sed -i 's|app.listen(9002)|app.listen(9002,"0.0.0.0")|g' /app/graphql-server/dist/main.js
# Start PM2, then auto-connect to Dolt after GraphQL is ready
pm2-runtime /app/process.yml &
PM2_PID=$!
# Wait for GraphQL server to be ready, then auto-connect
for i in $(seq 1 30); do
if node -e "fetch('http://127.0.0.1:9002/graphql',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({query:'{storedConnections{name}}'})}).then(r=>{if(r.ok)process.exit(0);process.exit(1)}).catch(()=>process.exit(1))" 2>/dev/null; then
node -e "fetch('http://127.0.0.1:9002/graphql',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({query:'mutation{addDatabaseConnection(connectionUrl:\"mysql://beads@dolt.beads-server.svc.cluster.local:3306/code\",name:\"beads\",hideDoltFeatures:false,useSSL:false,type:Mysql){currentDatabase}}'})}).then(r=>r.text()).then(t=>{console.log('Auto-connect:',t);process.exit(0)}).catch(e=>{console.error(e);process.exit(1)})" 2>&1
break
fi
sleep 1
done &
wait $PM2_PID
[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf | module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list | cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:45:04 +00:00
EOT
]
port {
name = "http"
container_port = 3000
}
port {
name = "graphql"
container_port = 9002
}
[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf | module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list | cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:45:04 +00:00
env {
name = "NODE_OPTIONS"
value = "--dns-result-order=ipv4first"
}
env {
name = "GRAPHQLAPI_URL"
value = "http://localhost:9002/graphql"
}
[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf | module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list | cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:45:04 +00:00
volume_mount {
name = "store"
[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf | module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list | cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:45:04 +00:00
mount_path = "/app/graphql-server/store"
}
volume_mount {
name = "static-patched"
mount_path = "/app/web/.next/static"
}
startup_probe {
http_get {
path = "/"
port = 3000
}
failure_threshold = 30
period_seconds = 2
}
liveness_probe {
http_get {
path = "/"
port = 3000
}
initial_delay_seconds = 10
period_seconds = 30
}
readiness_probe {
http_get {
path = "/"
port = 3000
}
initial_delay_seconds = 5
period_seconds = 10
}
resources {
requests = {
memory = "128Mi"
cpu = "10m"
}
limits = {
memory = "512Mi"
}
}
}
volume {
name = "store-config"
config_map {
name = kubernetes_config_map.workbench_store.metadata[0].name
}
}
volume {
name = "store"
empty_dir {}
}
volume {
name = "static-patched"
empty_dir {}
}
}
}
}
lifecycle {
ignore_changes = [
[infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] ## Context Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }` snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2 override that prevents NxDomain search-domain flooding). 27 occurrences across 19 stacks. Without this suppression, every pod-owning resource shows perpetual TF plan drift. The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/` module emitting the ignore-paths list as an output that stacks would consume in their `ignore_changes` blocks. That approach is architecturally impossible: Terraform's `ignore_changes` meta-argument accepts only static attribute paths — it rejects module outputs, locals, variables, and any expression (the HCL spec evaluates `lifecycle` before the regular expression graph). So a DRY module cannot exist. The canonical pattern IS the repeated snippet. What the snippet was missing was a *discoverability tag* so that (a) new resources can be validated for compliance, (b) the existing 27 sites can be grep'd in a single command, and (c) future maintainers understand the convention rather than each reinventing it. ## This change - Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment. Attached inline on every `spec[0].template[0].spec[0].dns_config` line (or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27 existing suppression sites. - Documents the convention with rationale and copy-paste snippets in `AGENTS.md` → new "Kyverno Drift Suppression" section. - Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference the marker and explain why the module approach is blocked. - Updates `_template/main.tf.example` so every new stack starts compliant. ## What is NOT in this change - The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`) — that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker. - Behavioral changes — every `ignore_changes` list is byte-identical save for the inline comment. - The fallback module the original plan anticipated — skipped because Terraform rejects expressions in `ignore_changes`. - `terraform fmt` cleanup on adjacent unrelated blocks in three files (claude-agent-service, freedify/factory, hermes-agent). Reverted to keep this commit scoped to the convention rollout. ## Before / after Before (cannot distinguish accidental-forgotten from intentional-convention): ```hcl lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] } ``` After (greppable, self-documenting, discoverable by tooling): ```hcl lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 } ``` ## Test Plan ### Automated ``` $ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \ | awk -F: '{s+=$2} END {print s}' 27 $ git diff --stat | grep -E '\.(tf|tf\.example|md)$' | wc -l 21 # All code-file diffs are 1 insertion + 1 deletion per marker site, # except beads-server (3), ebooks (4), immich (3), uptime-kuma (2). $ git diff --stat stacks/ | tail -1 20 files changed, 45 insertions(+), 28 deletions(-) ``` ### Manual Verification No apply required — HCL comments only. Zero effect on any stack's plan output. Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` must grow as new pod-owning resources are added. ## Reproduce locally 1. `cd infra && git pull` 2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files 3. Grep any new `kubernetes_deployment` for the marker; absence = missing suppression. Closes: code-28m Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:15:51 +00:00
spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
]
}
}
resource "kubernetes_service" "workbench" {
metadata {
name = "dolt-workbench"
namespace = kubernetes_namespace.beads.metadata[0].name
labels = {
app = "dolt-workbench"
}
}
spec {
selector = {
app = "dolt-workbench"
}
port {
name = "http"
port = 80
target_port = 3000
}
port {
name = "graphql"
port = 9002
target_port = 9002
}
}
}
module "tls_secret" {
source = "../../modules/kubernetes/setup_tls_secret"
namespace = kubernetes_namespace.beads.metadata[0].name
tls_secret_name = var.tls_secret_name
}
module "ingress" {
[infra/beads-server] Wire BeadBoard to claude-agent-service ## Context BeadBoard is the Next.js task visualization dashboard shipped in this stack. We want users to trigger headless Claude agent runs directly from a beads task row — "one-click dispatch" — instead of copy-pasting `bd` IDs into a terminal. The agent runs in-cluster as claude-agent-service (see stacks/claude-agent-service/), protected by a bearer token in Vault at secret/claude-agent-service/api_bearer_token. For BeadBoard to POST to /execute we need the service URL and the bearer token available inside the pod as env vars. The URL is static (cluster DNS); the token must come through External Secrets Operator so rotation in Vault propagates without re-applying Terraform. Secondary cleanup: the container was still pinned to :latest which violates the 8-char-SHA convention and causes stale pulls through the registry cache (see .claude/CLAUDE.md, Docker images). The image tag is now variable-driven; the GHA pipeline will override the default once it publishes the first SHA. ## This change - Adds an ExternalSecret `beadboard-agent-service` in the `beads-server` namespace, mirroring the pattern in stacks/claude-agent-service/main.tf (same Vault path `secret/claude-agent-service`, same `vault-kv` ClusterSecretStore, same 15m refresh). Exposes exactly one key: `api_bearer_token`. - Adds two env vars to the `beadboard` container: - `CLAUDE_AGENT_SERVICE_URL` — static cluster URL (`http://claude-agent-service.claude-agent.svc.cluster.local:8080`) - `CLAUDE_AGENT_BEARER_TOKEN` — `secret_key_ref` pointing at the ESO-managed Secret, key `api_bearer_token` - Adds `reloader.stakater.com/auto = "true"` on the Deployment's top-level metadata — matches the convention used by rybbit, claude-memory, onlyoffice. When ESO refreshes the K8s Secret because Vault rotated the token, Reloader restarts the pod so the new token is picked up (env vars are read once at boot). - Adds `variable "beadboard_image_tag"` (default `"latest"`, with a one-line comment flagging the temporary default). The image reference now interpolates `${var.beadboard_image_tag}`. No tfvars file is touched — orchestrator will flip the default to the first real 8-char SHA once GHA publishes it. ## What is NOT in this change - No GHA workflow additions. The pipeline that builds `registry.viktorbarzin.me:5050/beadboard` lives in the BeadBoard repo and is out of scope here. - No Vault-side changes. `secret/claude-agent-service/api_bearer_token` already exists (it powers the claude-agent-service deployment itself). - No Terraform `apply`. Orchestrator applies. ## Data flow Vault (secret/claude-agent-service) │ refresh every 15m ▼ ESO → K8s Secret `beadboard-agent-service` (beads-server ns) │ envFrom.secretKeyRef ▼ BeadBoard pod (CLAUDE_AGENT_BEARER_TOKEN env) │ Authorization: Bearer <token> ▼ claude-agent-service.claude-agent.svc:8080 /execute On Vault rotation: ESO picks up new value at next refresh → K8s Secret data changes → Reloader sees annotation + referenced Secret changed → rolling-recreates the beadboard pod with the new token. ## Test Plan ### Automated - `terraform fmt -recursive stacks/beads-server/` — clean (formatted the file once; subsequent run is a no-op). - `terraform -chdir=stacks/beads-server validate` (after `terraform init -backend=false`) — `Success! The configuration is valid`. The 14 "Deprecated Resource" warnings are pre-existing (`kubernetes_namespace` vs `_v1` etc.) and unrelated to this change. ### Manual Verification 1. Orchestrator applies: `scripts/tg -chdir=stacks/beads-server apply` 2. Verify the ExternalSecret synced: `kubectl -n beads-server get externalsecret beadboard-agent-service` Expected: `Ready=True`, `SyncedAt` recent. 3. Verify the K8s Secret exists with one key: `kubectl -n beads-server get secret beadboard-agent-service -o jsonpath='{.data.api_bearer_token}' | base64 -d | head -c 8` Expected: first 8 chars of the bearer token. 4. Verify the deployment picked up the env vars: `kubectl -n beads-server get deploy beadboard -o yaml | grep -A2 CLAUDE_AGENT` Expected: both env entries present, bearer via `secretKeyRef`. 5. Verify the reloader annotation is on the Deployment metadata: `kubectl -n beads-server get deploy beadboard -o jsonpath='{.metadata.annotations.reloader\.stakater\.com/auto}'` Expected: `true`. 6. Verify the image tag resolved to the variable default (for now): `kubectl -n beads-server get deploy beadboard -o jsonpath='{.spec.template.spec.containers[0].image}'` Expected: `registry.viktorbarzin.me:5050/beadboard:latest` (will become `...:<sha>` once `beadboard_image_tag` default is updated). 7. Smoke-test the env var inside the pod: `kubectl -n beads-server exec deploy/beadboard -- sh -c 'printenv CLAUDE_AGENT_SERVICE_URL; printenv CLAUDE_AGENT_BEARER_TOKEN | head -c 8'` Expected: URL printed, first 8 chars of token printed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:53:51 +00:00
source = "../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
namespace = kubernetes_namespace.beads.metadata[0].name
name = "dolt-workbench"
tls_secret_name = var.tls_secret_name
protected = false
exclude_crowdsec = true
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Dolt Workbench"
"gethomepage.dev/description" = "Beads task database UI"
"gethomepage.dev/icon" = "dolt.png"
"gethomepage.dev/group" = "Core Platform"
"gethomepage.dev/pod-selector" = ""
}
}
# GraphQL API ingress — the frontend JS hardcodes localhost:9002/graphql,
# but we rewrite the browser request to hit the same hostname on /graphql
# routed to port 9002.
resource "kubernetes_ingress_v1" "graphql" {
metadata {
name = "dolt-workbench-graphql"
namespace = kubernetes_namespace.beads.metadata[0].name
annotations = {
# No Authentik — browser fetch() can't follow 302 redirects on POST.
# Main page (/) is still protected. GraphQL has no sensitive data beyond task list.
}
}
spec {
ingress_class_name = "traefik"
tls {
hosts = ["dolt-workbench.viktorbarzin.me"]
secret_name = var.tls_secret_name
}
rule {
host = "dolt-workbench.viktorbarzin.me"
http {
path {
path = "/graphql"
path_type = "Exact"
backend {
service {
name = kubernetes_service.workbench.metadata[0].name
port {
number = 9002
}
}
}
}
}
}
}
}
# ── BeadBoard (task visualization dashboard) ──
resource "kubernetes_config_map" "beadboard_config" {
metadata {
name = "beadboard-beads-config"
namespace = kubernetes_namespace.beads.metadata[0].name
}
data = {
"metadata.json" = jsonencode({
database = "dolt"
backend = "dolt"
dolt_mode = "server"
dolt_server_host = "dolt.beads-server.svc.cluster.local"
dolt_server_port = 3306
dolt_server_user = "root"
dolt_database = "code"
project_id = "a8f8bae7-ce65-4145-a5db-a13d11d297da"
})
"dolt-server.port" = "3306"
}
}
[infra/beads-server] Wire BeadBoard to claude-agent-service ## Context BeadBoard is the Next.js task visualization dashboard shipped in this stack. We want users to trigger headless Claude agent runs directly from a beads task row — "one-click dispatch" — instead of copy-pasting `bd` IDs into a terminal. The agent runs in-cluster as claude-agent-service (see stacks/claude-agent-service/), protected by a bearer token in Vault at secret/claude-agent-service/api_bearer_token. For BeadBoard to POST to /execute we need the service URL and the bearer token available inside the pod as env vars. The URL is static (cluster DNS); the token must come through External Secrets Operator so rotation in Vault propagates without re-applying Terraform. Secondary cleanup: the container was still pinned to :latest which violates the 8-char-SHA convention and causes stale pulls through the registry cache (see .claude/CLAUDE.md, Docker images). The image tag is now variable-driven; the GHA pipeline will override the default once it publishes the first SHA. ## This change - Adds an ExternalSecret `beadboard-agent-service` in the `beads-server` namespace, mirroring the pattern in stacks/claude-agent-service/main.tf (same Vault path `secret/claude-agent-service`, same `vault-kv` ClusterSecretStore, same 15m refresh). Exposes exactly one key: `api_bearer_token`. - Adds two env vars to the `beadboard` container: - `CLAUDE_AGENT_SERVICE_URL` — static cluster URL (`http://claude-agent-service.claude-agent.svc.cluster.local:8080`) - `CLAUDE_AGENT_BEARER_TOKEN` — `secret_key_ref` pointing at the ESO-managed Secret, key `api_bearer_token` - Adds `reloader.stakater.com/auto = "true"` on the Deployment's top-level metadata — matches the convention used by rybbit, claude-memory, onlyoffice. When ESO refreshes the K8s Secret because Vault rotated the token, Reloader restarts the pod so the new token is picked up (env vars are read once at boot). - Adds `variable "beadboard_image_tag"` (default `"latest"`, with a one-line comment flagging the temporary default). The image reference now interpolates `${var.beadboard_image_tag}`. No tfvars file is touched — orchestrator will flip the default to the first real 8-char SHA once GHA publishes it. ## What is NOT in this change - No GHA workflow additions. The pipeline that builds `registry.viktorbarzin.me:5050/beadboard` lives in the BeadBoard repo and is out of scope here. - No Vault-side changes. `secret/claude-agent-service/api_bearer_token` already exists (it powers the claude-agent-service deployment itself). - No Terraform `apply`. Orchestrator applies. ## Data flow Vault (secret/claude-agent-service) │ refresh every 15m ▼ ESO → K8s Secret `beadboard-agent-service` (beads-server ns) │ envFrom.secretKeyRef ▼ BeadBoard pod (CLAUDE_AGENT_BEARER_TOKEN env) │ Authorization: Bearer <token> ▼ claude-agent-service.claude-agent.svc:8080 /execute On Vault rotation: ESO picks up new value at next refresh → K8s Secret data changes → Reloader sees annotation + referenced Secret changed → rolling-recreates the beadboard pod with the new token. ## Test Plan ### Automated - `terraform fmt -recursive stacks/beads-server/` — clean (formatted the file once; subsequent run is a no-op). - `terraform -chdir=stacks/beads-server validate` (after `terraform init -backend=false`) — `Success! The configuration is valid`. The 14 "Deprecated Resource" warnings are pre-existing (`kubernetes_namespace` vs `_v1` etc.) and unrelated to this change. ### Manual Verification 1. Orchestrator applies: `scripts/tg -chdir=stacks/beads-server apply` 2. Verify the ExternalSecret synced: `kubectl -n beads-server get externalsecret beadboard-agent-service` Expected: `Ready=True`, `SyncedAt` recent. 3. Verify the K8s Secret exists with one key: `kubectl -n beads-server get secret beadboard-agent-service -o jsonpath='{.data.api_bearer_token}' | base64 -d | head -c 8` Expected: first 8 chars of the bearer token. 4. Verify the deployment picked up the env vars: `kubectl -n beads-server get deploy beadboard -o yaml | grep -A2 CLAUDE_AGENT` Expected: both env entries present, bearer via `secretKeyRef`. 5. Verify the reloader annotation is on the Deployment metadata: `kubectl -n beads-server get deploy beadboard -o jsonpath='{.metadata.annotations.reloader\.stakater\.com/auto}'` Expected: `true`. 6. Verify the image tag resolved to the variable default (for now): `kubectl -n beads-server get deploy beadboard -o jsonpath='{.spec.template.spec.containers[0].image}'` Expected: `registry.viktorbarzin.me:5050/beadboard:latest` (will become `...:<sha>` once `beadboard_image_tag` default is updated). 7. Smoke-test the env var inside the pod: `kubectl -n beads-server exec deploy/beadboard -- sh -c 'printenv CLAUDE_AGENT_SERVICE_URL; printenv CLAUDE_AGENT_BEARER_TOKEN | head -c 8'` Expected: URL printed, first 8 chars of token printed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:53:51 +00:00
# Pulls the claude-agent-service bearer token from Vault so BeadBoard can
# dispatch agent jobs via the in-cluster HTTP API.
resource "kubernetes_manifest" "beadboard_agent_service_secret" {
manifest = {
apiVersion = "external-secrets.io/v1beta1"
kind = "ExternalSecret"
metadata = {
name = "beadboard-agent-service"
namespace = kubernetes_namespace.beads.metadata[0].name
}
spec = {
refreshInterval = "15m"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = {
name = "beadboard-agent-service"
}
data = [
{
secretKey = "api_bearer_token"
remoteRef = {
key = "claude-agent-service"
property = "api_bearer_token"
}
},
]
}
}
}
resource "kubernetes_deployment" "beadboard" {
metadata {
name = "beadboard"
namespace = kubernetes_namespace.beads.metadata[0].name
labels = {
app = "beadboard"
tier = local.tiers.aux
}
[infra/beads-server] Wire BeadBoard to claude-agent-service ## Context BeadBoard is the Next.js task visualization dashboard shipped in this stack. We want users to trigger headless Claude agent runs directly from a beads task row — "one-click dispatch" — instead of copy-pasting `bd` IDs into a terminal. The agent runs in-cluster as claude-agent-service (see stacks/claude-agent-service/), protected by a bearer token in Vault at secret/claude-agent-service/api_bearer_token. For BeadBoard to POST to /execute we need the service URL and the bearer token available inside the pod as env vars. The URL is static (cluster DNS); the token must come through External Secrets Operator so rotation in Vault propagates without re-applying Terraform. Secondary cleanup: the container was still pinned to :latest which violates the 8-char-SHA convention and causes stale pulls through the registry cache (see .claude/CLAUDE.md, Docker images). The image tag is now variable-driven; the GHA pipeline will override the default once it publishes the first SHA. ## This change - Adds an ExternalSecret `beadboard-agent-service` in the `beads-server` namespace, mirroring the pattern in stacks/claude-agent-service/main.tf (same Vault path `secret/claude-agent-service`, same `vault-kv` ClusterSecretStore, same 15m refresh). Exposes exactly one key: `api_bearer_token`. - Adds two env vars to the `beadboard` container: - `CLAUDE_AGENT_SERVICE_URL` — static cluster URL (`http://claude-agent-service.claude-agent.svc.cluster.local:8080`) - `CLAUDE_AGENT_BEARER_TOKEN` — `secret_key_ref` pointing at the ESO-managed Secret, key `api_bearer_token` - Adds `reloader.stakater.com/auto = "true"` on the Deployment's top-level metadata — matches the convention used by rybbit, claude-memory, onlyoffice. When ESO refreshes the K8s Secret because Vault rotated the token, Reloader restarts the pod so the new token is picked up (env vars are read once at boot). - Adds `variable "beadboard_image_tag"` (default `"latest"`, with a one-line comment flagging the temporary default). The image reference now interpolates `${var.beadboard_image_tag}`. No tfvars file is touched — orchestrator will flip the default to the first real 8-char SHA once GHA publishes it. ## What is NOT in this change - No GHA workflow additions. The pipeline that builds `registry.viktorbarzin.me:5050/beadboard` lives in the BeadBoard repo and is out of scope here. - No Vault-side changes. `secret/claude-agent-service/api_bearer_token` already exists (it powers the claude-agent-service deployment itself). - No Terraform `apply`. Orchestrator applies. ## Data flow Vault (secret/claude-agent-service) │ refresh every 15m ▼ ESO → K8s Secret `beadboard-agent-service` (beads-server ns) │ envFrom.secretKeyRef ▼ BeadBoard pod (CLAUDE_AGENT_BEARER_TOKEN env) │ Authorization: Bearer <token> ▼ claude-agent-service.claude-agent.svc:8080 /execute On Vault rotation: ESO picks up new value at next refresh → K8s Secret data changes → Reloader sees annotation + referenced Secret changed → rolling-recreates the beadboard pod with the new token. ## Test Plan ### Automated - `terraform fmt -recursive stacks/beads-server/` — clean (formatted the file once; subsequent run is a no-op). - `terraform -chdir=stacks/beads-server validate` (after `terraform init -backend=false`) — `Success! The configuration is valid`. The 14 "Deprecated Resource" warnings are pre-existing (`kubernetes_namespace` vs `_v1` etc.) and unrelated to this change. ### Manual Verification 1. Orchestrator applies: `scripts/tg -chdir=stacks/beads-server apply` 2. Verify the ExternalSecret synced: `kubectl -n beads-server get externalsecret beadboard-agent-service` Expected: `Ready=True`, `SyncedAt` recent. 3. Verify the K8s Secret exists with one key: `kubectl -n beads-server get secret beadboard-agent-service -o jsonpath='{.data.api_bearer_token}' | base64 -d | head -c 8` Expected: first 8 chars of the bearer token. 4. Verify the deployment picked up the env vars: `kubectl -n beads-server get deploy beadboard -o yaml | grep -A2 CLAUDE_AGENT` Expected: both env entries present, bearer via `secretKeyRef`. 5. Verify the reloader annotation is on the Deployment metadata: `kubectl -n beads-server get deploy beadboard -o jsonpath='{.metadata.annotations.reloader\.stakater\.com/auto}'` Expected: `true`. 6. Verify the image tag resolved to the variable default (for now): `kubectl -n beads-server get deploy beadboard -o jsonpath='{.spec.template.spec.containers[0].image}'` Expected: `registry.viktorbarzin.me:5050/beadboard:latest` (will become `...:<sha>` once `beadboard_image_tag` default is updated). 7. Smoke-test the env var inside the pod: `kubectl -n beads-server exec deploy/beadboard -- sh -c 'printenv CLAUDE_AGENT_SERVICE_URL; printenv CLAUDE_AGENT_BEARER_TOKEN | head -c 8'` Expected: URL printed, first 8 chars of token printed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:53:51 +00:00
annotations = {
"reloader.stakater.com/auto" = "true"
}
}
spec {
replicas = 1
selector {
match_labels = {
app = "beadboard"
}
}
template {
metadata {
labels = {
app = "beadboard"
}
}
spec {
image_pull_secrets {
name = "registry-credentials"
}
init_container {
name = "seed-beads-config"
image = "busybox:1.36"
command = ["sh", "-c", "cp /config/* /beads/ && mkdir -p /beads/templates /beads/archetypes"]
volume_mount {
name = "beads-config"
mount_path = "/config"
read_only = true
}
volume_mount {
name = "beads-writable"
mount_path = "/beads"
}
}
container {
name = "beadboard"
[forgejo] Phases 3+4+5: cutover, decommission, docs sweep End of forgejo-registry-consolidation. After Phase 0/1 already landed (Forgejo ready, dual-push CI, integrity probe, retention CronJob, images migrated via forgejo-migrate-orphan-images.sh), this commit flips everything off registry.viktorbarzin.me onto Forgejo and removes the legacy infrastructure. Phase 3 — image= flips: * infra/stacks/{payslip-ingest,job-hunter,claude-agent-service, fire-planner,freedify/factory,chrome-service,beads-server}/main.tf — image= now points to forgejo.viktorbarzin.me/viktor/<name>. * infra/stacks/claude-memory/main.tf — also moved off DockerHub (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...). * infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled from Forgejo. build-ci-image.yml dual-pushes still until next build cycle confirms Forgejo as canonical. * /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated. Phase 4 — decommission registry-private: * registry-credentials Secret: dropped registry.viktorbarzin.me / registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries. Forgejo entry is the only one left. * infra/stacks/infra/main.tf cloud-init: dropped containerd hosts.toml entries for registry.viktorbarzin.me + 10.0.20.10:5050. (Existing nodes already had the file removed manually by `setup-forgejo-containerd-mirror.sh` rollout — the cloud-init template only fires on new VM provision.) * infra/modules/docker-registry/docker-compose.yml: registry-private service block removed; nginx 5050 port mapping dropped. Pull- through caches for upstream registries (5000/5010/5020/5030/5040) stay on the VM permanently. * infra/modules/docker-registry/nginx_registry.conf: upstream `private` block + port 5050 server block removed. * infra/stacks/monitoring/modules/monitoring/main.tf: registry_ integrity_probe + registry_probe_credentials resources stripped. forgejo_integrity_probe is the only manifest probe now. Phase 5 — final docs sweep: * infra/docs/runbooks/registry-vm.md — VM scope reduced to pull- through caches; forgejo-registry-breakglass.md cross-ref added. * infra/docs/architecture/ci-cd.md — registry component table + diagram now reflect Forgejo. Pre-migration root-cause sentence preserved as historical context with a pointer to the design doc. * infra/docs/architecture/monitoring.md — Registry Integrity Probe row updated to point at the Forgejo probe. * infra/.claude/CLAUDE.md — Private registry section rewritten end- to-end (auth, retention, integrity, where the bake came from). * prometheus_chart_values.tpl — RegistryManifestIntegrityFailure alert annotation simplified now that only one registry is in scope. Operational follow-up (cannot be done from a TF apply): 1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to match the new template AND `docker compose up -d --remove-orphans` to actually stop the registry-private container. Memory id=1078 confirms cloud-init won't redeploy on TF apply alone. 2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/` on the VM (~2.6GB freed). 3. Open the dual-push step in build-ci-image.yml and drop registry.viktorbarzin.me:5050 from the `repo:` list — at that point the post-push integrity check at line 33-107 also needs to be repointed at Forgejo or removed (the per-build verify is redundant with the every-15min Forgejo probe). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 18:30:02 +00:00
# Phase 3 cutover 2026-05-07 — Forgejo registry consolidation.
image = "forgejo.viktorbarzin.me/viktor/beadboard:${var.beadboard_image_tag}"
port {
name = "http"
container_port = 3000
}
[infra/beads-server] Wire BeadBoard to claude-agent-service ## Context BeadBoard is the Next.js task visualization dashboard shipped in this stack. We want users to trigger headless Claude agent runs directly from a beads task row — "one-click dispatch" — instead of copy-pasting `bd` IDs into a terminal. The agent runs in-cluster as claude-agent-service (see stacks/claude-agent-service/), protected by a bearer token in Vault at secret/claude-agent-service/api_bearer_token. For BeadBoard to POST to /execute we need the service URL and the bearer token available inside the pod as env vars. The URL is static (cluster DNS); the token must come through External Secrets Operator so rotation in Vault propagates without re-applying Terraform. Secondary cleanup: the container was still pinned to :latest which violates the 8-char-SHA convention and causes stale pulls through the registry cache (see .claude/CLAUDE.md, Docker images). The image tag is now variable-driven; the GHA pipeline will override the default once it publishes the first SHA. ## This change - Adds an ExternalSecret `beadboard-agent-service` in the `beads-server` namespace, mirroring the pattern in stacks/claude-agent-service/main.tf (same Vault path `secret/claude-agent-service`, same `vault-kv` ClusterSecretStore, same 15m refresh). Exposes exactly one key: `api_bearer_token`. - Adds two env vars to the `beadboard` container: - `CLAUDE_AGENT_SERVICE_URL` — static cluster URL (`http://claude-agent-service.claude-agent.svc.cluster.local:8080`) - `CLAUDE_AGENT_BEARER_TOKEN` — `secret_key_ref` pointing at the ESO-managed Secret, key `api_bearer_token` - Adds `reloader.stakater.com/auto = "true"` on the Deployment's top-level metadata — matches the convention used by rybbit, claude-memory, onlyoffice. When ESO refreshes the K8s Secret because Vault rotated the token, Reloader restarts the pod so the new token is picked up (env vars are read once at boot). - Adds `variable "beadboard_image_tag"` (default `"latest"`, with a one-line comment flagging the temporary default). The image reference now interpolates `${var.beadboard_image_tag}`. No tfvars file is touched — orchestrator will flip the default to the first real 8-char SHA once GHA publishes it. ## What is NOT in this change - No GHA workflow additions. The pipeline that builds `registry.viktorbarzin.me:5050/beadboard` lives in the BeadBoard repo and is out of scope here. - No Vault-side changes. `secret/claude-agent-service/api_bearer_token` already exists (it powers the claude-agent-service deployment itself). - No Terraform `apply`. Orchestrator applies. ## Data flow Vault (secret/claude-agent-service) │ refresh every 15m ▼ ESO → K8s Secret `beadboard-agent-service` (beads-server ns) │ envFrom.secretKeyRef ▼ BeadBoard pod (CLAUDE_AGENT_BEARER_TOKEN env) │ Authorization: Bearer <token> ▼ claude-agent-service.claude-agent.svc:8080 /execute On Vault rotation: ESO picks up new value at next refresh → K8s Secret data changes → Reloader sees annotation + referenced Secret changed → rolling-recreates the beadboard pod with the new token. ## Test Plan ### Automated - `terraform fmt -recursive stacks/beads-server/` — clean (formatted the file once; subsequent run is a no-op). - `terraform -chdir=stacks/beads-server validate` (after `terraform init -backend=false`) — `Success! The configuration is valid`. The 14 "Deprecated Resource" warnings are pre-existing (`kubernetes_namespace` vs `_v1` etc.) and unrelated to this change. ### Manual Verification 1. Orchestrator applies: `scripts/tg -chdir=stacks/beads-server apply` 2. Verify the ExternalSecret synced: `kubectl -n beads-server get externalsecret beadboard-agent-service` Expected: `Ready=True`, `SyncedAt` recent. 3. Verify the K8s Secret exists with one key: `kubectl -n beads-server get secret beadboard-agent-service -o jsonpath='{.data.api_bearer_token}' | base64 -d | head -c 8` Expected: first 8 chars of the bearer token. 4. Verify the deployment picked up the env vars: `kubectl -n beads-server get deploy beadboard -o yaml | grep -A2 CLAUDE_AGENT` Expected: both env entries present, bearer via `secretKeyRef`. 5. Verify the reloader annotation is on the Deployment metadata: `kubectl -n beads-server get deploy beadboard -o jsonpath='{.metadata.annotations.reloader\.stakater\.com/auto}'` Expected: `true`. 6. Verify the image tag resolved to the variable default (for now): `kubectl -n beads-server get deploy beadboard -o jsonpath='{.spec.template.spec.containers[0].image}'` Expected: `registry.viktorbarzin.me:5050/beadboard:latest` (will become `...:<sha>` once `beadboard_image_tag` default is updated). 7. Smoke-test the env var inside the pod: `kubectl -n beads-server exec deploy/beadboard -- sh -c 'printenv CLAUDE_AGENT_SERVICE_URL; printenv CLAUDE_AGENT_BEARER_TOKEN | head -c 8'` Expected: URL printed, first 8 chars of token printed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:53:51 +00:00
env {
name = "CLAUDE_AGENT_SERVICE_URL"
value = "http://claude-agent-service.claude-agent.svc.cluster.local:8080"
}
env {
name = "CLAUDE_AGENT_BEARER_TOKEN"
value_from {
secret_key_ref {
name = "beadboard-agent-service"
key = "api_bearer_token"
}
}
}
volume_mount {
name = "beads-writable"
mount_path = "/app/.beads"
}
startup_probe {
http_get {
path = "/"
port = 3000
}
failure_threshold = 30
period_seconds = 2
}
liveness_probe {
http_get {
path = "/"
port = 3000
}
initial_delay_seconds = 10
period_seconds = 30
}
readiness_probe {
http_get {
path = "/"
port = 3000
}
initial_delay_seconds = 5
period_seconds = 10
}
resources {
requests = {
memory = "256Mi"
cpu = "50m"
}
limits = {
memory = "512Mi"
}
}
}
volume {
name = "beads-config"
config_map {
name = kubernetes_config_map.beadboard_config.metadata[0].name
}
}
volume {
name = "beads-writable"
empty_dir {}
}
}
}
}
lifecycle {
ignore_changes = [
[infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] ## Context Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }` snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2 override that prevents NxDomain search-domain flooding). 27 occurrences across 19 stacks. Without this suppression, every pod-owning resource shows perpetual TF plan drift. The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/` module emitting the ignore-paths list as an output that stacks would consume in their `ignore_changes` blocks. That approach is architecturally impossible: Terraform's `ignore_changes` meta-argument accepts only static attribute paths — it rejects module outputs, locals, variables, and any expression (the HCL spec evaluates `lifecycle` before the regular expression graph). So a DRY module cannot exist. The canonical pattern IS the repeated snippet. What the snippet was missing was a *discoverability tag* so that (a) new resources can be validated for compliance, (b) the existing 27 sites can be grep'd in a single command, and (c) future maintainers understand the convention rather than each reinventing it. ## This change - Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment. Attached inline on every `spec[0].template[0].spec[0].dns_config` line (or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27 existing suppression sites. - Documents the convention with rationale and copy-paste snippets in `AGENTS.md` → new "Kyverno Drift Suppression" section. - Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference the marker and explain why the module approach is blocked. - Updates `_template/main.tf.example` so every new stack starts compliant. ## What is NOT in this change - The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`) — that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker. - Behavioral changes — every `ignore_changes` list is byte-identical save for the inline comment. - The fallback module the original plan anticipated — skipped because Terraform rejects expressions in `ignore_changes`. - `terraform fmt` cleanup on adjacent unrelated blocks in three files (claude-agent-service, freedify/factory, hermes-agent). Reverted to keep this commit scoped to the convention rollout. ## Before / after Before (cannot distinguish accidental-forgotten from intentional-convention): ```hcl lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] } ``` After (greppable, self-documenting, discoverable by tooling): ```hcl lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 } ``` ## Test Plan ### Automated ``` $ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \ | awk -F: '{s+=$2} END {print s}' 27 $ git diff --stat | grep -E '\.(tf|tf\.example|md)$' | wc -l 21 # All code-file diffs are 1 insertion + 1 deletion per marker site, # except beads-server (3), ebooks (4), immich (3), uptime-kuma (2). $ git diff --stat stacks/ | tail -1 20 files changed, 45 insertions(+), 28 deletions(-) ``` ### Manual Verification No apply required — HCL comments only. Zero effect on any stack's plan output. Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` must grow as new pod-owning resources are added. ## Reproduce locally 1. `cd infra && git pull` 2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files 3. Grep any new `kubernetes_deployment` for the marker; absence = missing suppression. Closes: code-28m Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:15:51 +00:00
spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
]
}
}
resource "kubernetes_service" "beadboard" {
metadata {
name = "beadboard"
namespace = kubernetes_namespace.beads.metadata[0].name
labels = {
app = "beadboard"
}
}
spec {
selector = {
app = "beadboard"
}
port {
name = "http"
port = 80
target_port = 3000
}
}
}
module "beadboard_ingress" {
[infra/beads-server] Wire BeadBoard to claude-agent-service ## Context BeadBoard is the Next.js task visualization dashboard shipped in this stack. We want users to trigger headless Claude agent runs directly from a beads task row — "one-click dispatch" — instead of copy-pasting `bd` IDs into a terminal. The agent runs in-cluster as claude-agent-service (see stacks/claude-agent-service/), protected by a bearer token in Vault at secret/claude-agent-service/api_bearer_token. For BeadBoard to POST to /execute we need the service URL and the bearer token available inside the pod as env vars. The URL is static (cluster DNS); the token must come through External Secrets Operator so rotation in Vault propagates without re-applying Terraform. Secondary cleanup: the container was still pinned to :latest which violates the 8-char-SHA convention and causes stale pulls through the registry cache (see .claude/CLAUDE.md, Docker images). The image tag is now variable-driven; the GHA pipeline will override the default once it publishes the first SHA. ## This change - Adds an ExternalSecret `beadboard-agent-service` in the `beads-server` namespace, mirroring the pattern in stacks/claude-agent-service/main.tf (same Vault path `secret/claude-agent-service`, same `vault-kv` ClusterSecretStore, same 15m refresh). Exposes exactly one key: `api_bearer_token`. - Adds two env vars to the `beadboard` container: - `CLAUDE_AGENT_SERVICE_URL` — static cluster URL (`http://claude-agent-service.claude-agent.svc.cluster.local:8080`) - `CLAUDE_AGENT_BEARER_TOKEN` — `secret_key_ref` pointing at the ESO-managed Secret, key `api_bearer_token` - Adds `reloader.stakater.com/auto = "true"` on the Deployment's top-level metadata — matches the convention used by rybbit, claude-memory, onlyoffice. When ESO refreshes the K8s Secret because Vault rotated the token, Reloader restarts the pod so the new token is picked up (env vars are read once at boot). - Adds `variable "beadboard_image_tag"` (default `"latest"`, with a one-line comment flagging the temporary default). The image reference now interpolates `${var.beadboard_image_tag}`. No tfvars file is touched — orchestrator will flip the default to the first real 8-char SHA once GHA publishes it. ## What is NOT in this change - No GHA workflow additions. The pipeline that builds `registry.viktorbarzin.me:5050/beadboard` lives in the BeadBoard repo and is out of scope here. - No Vault-side changes. `secret/claude-agent-service/api_bearer_token` already exists (it powers the claude-agent-service deployment itself). - No Terraform `apply`. Orchestrator applies. ## Data flow Vault (secret/claude-agent-service) │ refresh every 15m ▼ ESO → K8s Secret `beadboard-agent-service` (beads-server ns) │ envFrom.secretKeyRef ▼ BeadBoard pod (CLAUDE_AGENT_BEARER_TOKEN env) │ Authorization: Bearer <token> ▼ claude-agent-service.claude-agent.svc:8080 /execute On Vault rotation: ESO picks up new value at next refresh → K8s Secret data changes → Reloader sees annotation + referenced Secret changed → rolling-recreates the beadboard pod with the new token. ## Test Plan ### Automated - `terraform fmt -recursive stacks/beads-server/` — clean (formatted the file once; subsequent run is a no-op). - `terraform -chdir=stacks/beads-server validate` (after `terraform init -backend=false`) — `Success! The configuration is valid`. The 14 "Deprecated Resource" warnings are pre-existing (`kubernetes_namespace` vs `_v1` etc.) and unrelated to this change. ### Manual Verification 1. Orchestrator applies: `scripts/tg -chdir=stacks/beads-server apply` 2. Verify the ExternalSecret synced: `kubectl -n beads-server get externalsecret beadboard-agent-service` Expected: `Ready=True`, `SyncedAt` recent. 3. Verify the K8s Secret exists with one key: `kubectl -n beads-server get secret beadboard-agent-service -o jsonpath='{.data.api_bearer_token}' | base64 -d | head -c 8` Expected: first 8 chars of the bearer token. 4. Verify the deployment picked up the env vars: `kubectl -n beads-server get deploy beadboard -o yaml | grep -A2 CLAUDE_AGENT` Expected: both env entries present, bearer via `secretKeyRef`. 5. Verify the reloader annotation is on the Deployment metadata: `kubectl -n beads-server get deploy beadboard -o jsonpath='{.metadata.annotations.reloader\.stakater\.com/auto}'` Expected: `true`. 6. Verify the image tag resolved to the variable default (for now): `kubectl -n beads-server get deploy beadboard -o jsonpath='{.spec.template.spec.containers[0].image}'` Expected: `registry.viktorbarzin.me:5050/beadboard:latest` (will become `...:<sha>` once `beadboard_image_tag` default is updated). 7. Smoke-test the env var inside the pod: `kubectl -n beads-server exec deploy/beadboard -- sh -c 'printenv CLAUDE_AGENT_SERVICE_URL; printenv CLAUDE_AGENT_BEARER_TOKEN | head -c 8'` Expected: URL printed, first 8 chars of token printed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:53:51 +00:00
source = "../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
namespace = kubernetes_namespace.beads.metadata[0].name
name = "beadboard"
tls_secret_name = var.tls_secret_name
protected = true
exclude_crowdsec = true
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "BeadBoard"
"gethomepage.dev/description" = "Agent task visualization dashboard"
"gethomepage.dev/icon" = "mdi-chart-gantt"
"gethomepage.dev/group" = "Core Platform"
"gethomepage.dev/pod-selector" = ""
}
}
[beads-server] Auto-dispatch agent beads via CronJobs ## Context Until now, handing work to the in-cluster `beads-task-runner` agent required opening BeadBoard and clicking the manual Dispatch button on each bead. We want users to be able to describe work as a bead, set `assignee=agent`, and have the agent pick it up within a couple of minutes — no clicks. The existing pieces already provide everything we need: - `claude-agent-service` exposes `/execute` with a single-slot `asyncio.Lock` - BeadBoard's `/api/agent-dispatch` builds the prompt and forwards the bearer - BeadBoard's `/api/agent-status` reports `busy` via a cached `/health` poll - Dolt stores beads and is already in-cluster at `dolt.beads-server:3306` So the only missing component is a poller that ties them together. This commit adds that poller as two Kubernetes CronJobs — matching the existing infra pattern (OpenClaw task-processor, certbot-renewal, backups) rather than introducing n8n or in-service polling. ## Flow ``` user: bd assign <id> agent │ ▼ Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐ │ │ ▼ │ CronJob: beads-dispatcher │ 1. GET beadboard/api/agent-status (busy? skip) │ 2. bd query 'assignee=agent AND status=open' │ 3. bd update -s in_progress (claim) │ 4. POST beadboard/api/agent-dispatch │ 5. bd note "dispatched: job=…" │ │ │ ▼ │ claude-agent-service /execute │ beads-task-runner agent runs; notes/closes bead │ │ │ ▼ │ done ──► next tick picks up the next bead ───────────────┘ CronJob: beads-reaper (every 10 min) for bead (assignee=agent, status=in_progress, updated_at > 30 min): bd note "reaper: no progress for Nm — blocking" bd update -s blocked ``` ## Decisions - **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd client can set it (`bd assign <id> agent`). - **Sequential dispatch** — matches the service's `asyncio.Lock`. With a 2-min poll cadence and ~5-min average run, throughput is ~12 beads/hour. Parallelism is a separate plan. - **Fixed agent `beads-task-runner`** — read-only rails, matches the manual Dispatch button. Broader-privilege agents stay manual via BeadBoard UI. - **Image reuse** — the claude-agent-service image already ships `bd`, `jq`, `curl`; a new CronJob-specific image would duplicate 400MB of infra tooling. Mirror `claude_agent_service_image_tag` locally; bump on rebuild. - **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing the image-seeded file. The script copies it into `/tmp/.beads/` because bd may touch the parent dir and ConfigMap mounts are read-only. - **Kill switch (`beads_dispatcher_enabled`)** — single bool, default true. When false, `suspend: true` on both CronJobs; manual Dispatch keeps working. - **Reaper threshold 30 min** — `bd note` bumps `updated_at`, so a well-behaved `beads-task-runner` never trips the reaper. Failures trip it; pod crashes (in-memory job state lost) also trip it. ## What is NOT in this change - No Terraform apply — requires Vault OIDC + cluster access. Apply manually: `cd infra/stacks/beads-server && scripts/tg apply` - No change to `claude-agent-service/` (already ships bd/jq/curl) - No change to `beadboard/` (`/api/agent-dispatch` + `/api/agent-status` reused) - No change to the `beads-task-runner` agent definition (rails unchanged) - Parallelism: single-slot is MVP; multi-slot dispatch is a separate plan. ## Deviations from plan Minor, documented in code comments: - Reaper uses `.updated_at` instead of the plan's `.notes[].created_at`. bd serializes `notes` as a string (not an array), and every `bd note` bumps `updated_at` — equivalent for the reaper's purpose. - ISO-8601 parsed via `python3`, not `date -d` — Alpine's busybox lacks GNU `-d` and the image has python3. - `HOME=/tmp` set as a safety net — bd may try to write state/lock files. ## Test plan ### Automated ``` $ cd infra/stacks/beads-server && terraform init -backend=false Terraform has been successfully initialized! $ terraform validate Warning: Deprecated Resource (kubernetes_namespace → v1) # pre-existing, unrelated Success! The configuration is valid, but there were some validation warnings as shown above. $ terraform fmt stacks/beads-server/main.tf # (no output — already formatted) ``` ### Manual verification 1. **Apply** ``` vault login -method=oidc cd infra/stacks/beads-server scripts/tg apply ``` Expect: `kubernetes_config_map.beads_metadata`, `kubernetes_cron_job_v1.beads_dispatcher`, `kubernetes_cron_job_v1.beads_reaper` created. No changes to existing resources. 2. **CronJobs exist with right schedule** ``` kubectl -n beads-server get cronjob ``` Expect `beads-dispatcher */2 * * * *` and `beads-reaper */10 * * * *`, both with `SUSPEND=False`. 3. **End-to-end smoke** ``` bd create "auto-dispatch smoke test" \ -d "Read /etc/hostname inside the agent sandbox and close." \ --acceptance "bd note includes 'hostname=' line and bead is closed." bd assign <new-id> agent # within 2 min: bd show <new-id> --json | jq '{status, notes}' ``` Expect notes to contain `auto-dispatcher claimed at …` and `dispatched: job=<uuid>`, status `in_progress`. 4. **Reaper smoke** Assign + dispatch a long bead, then `kubectl -n claude-agent delete pod -l app=claude-agent-service`. Within 30 min + one reaper tick, `bd show <id>` shows `blocked` with a `reaper: no progress for Nm — blocking` note. 5. **Kill switch** ``` cd infra/stacks/beads-server scripts/tg apply -var=beads_dispatcher_enabled=false kubectl -n beads-server get cronjob ``` Expect `SUSPEND=True` on both CronJobs. Assign a bead to `agent`; verify nothing happens within 5 min. Re-apply with `=true` to re-enable. Runbook with all above plus reaper semantics + design choices at `infra/docs/runbooks/beads-auto-dispatch.md`. Closes: code-8sm Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:35:46 +00:00
# ── Beads auto-dispatch (dispatcher + reaper CronJobs) ──
#
# Flow:
# user: bd assign <id> agent
# └──> CronJob: beads-dispatcher (every 2 min)
# 1. GET BeadBoard /api/agent-status — skip if claude-agent-service busy
# 2. bd query 'assignee=agent AND status=open' — pick highest priority
# 3. bd update -s in_progress (claim; next tick won't re-pick)
# 4. POST BeadBoard /api/agent-dispatch — reuses prompt-build + bearer flow
# 5. bd note "dispatched: job=<id>" (or rollback + note on failure)
#
# CronJob: beads-reaper (every 10 min)
# └── for bead (assignee=agent, status=in_progress, updated_at > 30m):
# bd update -s blocked + bd note (recover from pod crashes mid-run)
#
# The claude-agent-service image ships bd + jq + curl — no separate image built.
resource "kubernetes_config_map" "beads_metadata" {
metadata {
name = "beads-metadata"
namespace = kubernetes_namespace.beads.metadata[0].name
}
data = {
"metadata.json" = jsonencode({
database = "dolt"
backend = "dolt"
dolt_mode = "server"
dolt_server_host = "${kubernetes_service.dolt.metadata[0].name}.${kubernetes_namespace.beads.metadata[0].name}.svc.cluster.local"
dolt_server_port = 3306
dolt_server_user = "beads"
dolt_database = "code"
project_id = "a8f8bae7-ce65-4145-a5db-a13d11d297da"
})
}
}
locals {
[forgejo] Phases 3+4+5: cutover, decommission, docs sweep End of forgejo-registry-consolidation. After Phase 0/1 already landed (Forgejo ready, dual-push CI, integrity probe, retention CronJob, images migrated via forgejo-migrate-orphan-images.sh), this commit flips everything off registry.viktorbarzin.me onto Forgejo and removes the legacy infrastructure. Phase 3 — image= flips: * infra/stacks/{payslip-ingest,job-hunter,claude-agent-service, fire-planner,freedify/factory,chrome-service,beads-server}/main.tf — image= now points to forgejo.viktorbarzin.me/viktor/<name>. * infra/stacks/claude-memory/main.tf — also moved off DockerHub (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...). * infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled from Forgejo. build-ci-image.yml dual-pushes still until next build cycle confirms Forgejo as canonical. * /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated. Phase 4 — decommission registry-private: * registry-credentials Secret: dropped registry.viktorbarzin.me / registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries. Forgejo entry is the only one left. * infra/stacks/infra/main.tf cloud-init: dropped containerd hosts.toml entries for registry.viktorbarzin.me + 10.0.20.10:5050. (Existing nodes already had the file removed manually by `setup-forgejo-containerd-mirror.sh` rollout — the cloud-init template only fires on new VM provision.) * infra/modules/docker-registry/docker-compose.yml: registry-private service block removed; nginx 5050 port mapping dropped. Pull- through caches for upstream registries (5000/5010/5020/5030/5040) stay on the VM permanently. * infra/modules/docker-registry/nginx_registry.conf: upstream `private` block + port 5050 server block removed. * infra/stacks/monitoring/modules/monitoring/main.tf: registry_ integrity_probe + registry_probe_credentials resources stripped. forgejo_integrity_probe is the only manifest probe now. Phase 5 — final docs sweep: * infra/docs/runbooks/registry-vm.md — VM scope reduced to pull- through caches; forgejo-registry-breakglass.md cross-ref added. * infra/docs/architecture/ci-cd.md — registry component table + diagram now reflect Forgejo. Pre-migration root-cause sentence preserved as historical context with a pointer to the design doc. * infra/docs/architecture/monitoring.md — Registry Integrity Probe row updated to point at the Forgejo probe. * infra/.claude/CLAUDE.md — Private registry section rewritten end- to-end (auth, retention, integrity, where the bake came from). * prometheus_chart_values.tpl — RegistryManifestIntegrityFailure alert annotation simplified now that only one registry is in scope. Operational follow-up (cannot be done from a TF apply): 1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to match the new template AND `docker compose up -d --remove-orphans` to actually stop the registry-private container. Memory id=1078 confirms cloud-init won't redeploy on TF apply alone. 2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/` on the VM (~2.6GB freed). 3. Open the dual-push step in build-ci-image.yml and drop registry.viktorbarzin.me:5050 from the `repo:` list — at that point the post-push integrity check at line 33-107 also needs to be repointed at Forgejo or removed (the per-build verify is redundant with the every-15min Forgejo probe). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 18:30:02 +00:00
# Phase 3 cutover 2026-05-07 — Forgejo registry consolidation.
claude_agent_service_image = "forgejo.viktorbarzin.me/viktor/claude-agent-service:${var.claude_agent_service_image_tag}"
[beads-server] Auto-dispatch agent beads via CronJobs ## Context Until now, handing work to the in-cluster `beads-task-runner` agent required opening BeadBoard and clicking the manual Dispatch button on each bead. We want users to be able to describe work as a bead, set `assignee=agent`, and have the agent pick it up within a couple of minutes — no clicks. The existing pieces already provide everything we need: - `claude-agent-service` exposes `/execute` with a single-slot `asyncio.Lock` - BeadBoard's `/api/agent-dispatch` builds the prompt and forwards the bearer - BeadBoard's `/api/agent-status` reports `busy` via a cached `/health` poll - Dolt stores beads and is already in-cluster at `dolt.beads-server:3306` So the only missing component is a poller that ties them together. This commit adds that poller as two Kubernetes CronJobs — matching the existing infra pattern (OpenClaw task-processor, certbot-renewal, backups) rather than introducing n8n or in-service polling. ## Flow ``` user: bd assign <id> agent │ ▼ Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐ │ │ ▼ │ CronJob: beads-dispatcher │ 1. GET beadboard/api/agent-status (busy? skip) │ 2. bd query 'assignee=agent AND status=open' │ 3. bd update -s in_progress (claim) │ 4. POST beadboard/api/agent-dispatch │ 5. bd note "dispatched: job=…" │ │ │ ▼ │ claude-agent-service /execute │ beads-task-runner agent runs; notes/closes bead │ │ │ ▼ │ done ──► next tick picks up the next bead ───────────────┘ CronJob: beads-reaper (every 10 min) for bead (assignee=agent, status=in_progress, updated_at > 30 min): bd note "reaper: no progress for Nm — blocking" bd update -s blocked ``` ## Decisions - **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd client can set it (`bd assign <id> agent`). - **Sequential dispatch** — matches the service's `asyncio.Lock`. With a 2-min poll cadence and ~5-min average run, throughput is ~12 beads/hour. Parallelism is a separate plan. - **Fixed agent `beads-task-runner`** — read-only rails, matches the manual Dispatch button. Broader-privilege agents stay manual via BeadBoard UI. - **Image reuse** — the claude-agent-service image already ships `bd`, `jq`, `curl`; a new CronJob-specific image would duplicate 400MB of infra tooling. Mirror `claude_agent_service_image_tag` locally; bump on rebuild. - **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing the image-seeded file. The script copies it into `/tmp/.beads/` because bd may touch the parent dir and ConfigMap mounts are read-only. - **Kill switch (`beads_dispatcher_enabled`)** — single bool, default true. When false, `suspend: true` on both CronJobs; manual Dispatch keeps working. - **Reaper threshold 30 min** — `bd note` bumps `updated_at`, so a well-behaved `beads-task-runner` never trips the reaper. Failures trip it; pod crashes (in-memory job state lost) also trip it. ## What is NOT in this change - No Terraform apply — requires Vault OIDC + cluster access. Apply manually: `cd infra/stacks/beads-server && scripts/tg apply` - No change to `claude-agent-service/` (already ships bd/jq/curl) - No change to `beadboard/` (`/api/agent-dispatch` + `/api/agent-status` reused) - No change to the `beads-task-runner` agent definition (rails unchanged) - Parallelism: single-slot is MVP; multi-slot dispatch is a separate plan. ## Deviations from plan Minor, documented in code comments: - Reaper uses `.updated_at` instead of the plan's `.notes[].created_at`. bd serializes `notes` as a string (not an array), and every `bd note` bumps `updated_at` — equivalent for the reaper's purpose. - ISO-8601 parsed via `python3`, not `date -d` — Alpine's busybox lacks GNU `-d` and the image has python3. - `HOME=/tmp` set as a safety net — bd may try to write state/lock files. ## Test plan ### Automated ``` $ cd infra/stacks/beads-server && terraform init -backend=false Terraform has been successfully initialized! $ terraform validate Warning: Deprecated Resource (kubernetes_namespace → v1) # pre-existing, unrelated Success! The configuration is valid, but there were some validation warnings as shown above. $ terraform fmt stacks/beads-server/main.tf # (no output — already formatted) ``` ### Manual verification 1. **Apply** ``` vault login -method=oidc cd infra/stacks/beads-server scripts/tg apply ``` Expect: `kubernetes_config_map.beads_metadata`, `kubernetes_cron_job_v1.beads_dispatcher`, `kubernetes_cron_job_v1.beads_reaper` created. No changes to existing resources. 2. **CronJobs exist with right schedule** ``` kubectl -n beads-server get cronjob ``` Expect `beads-dispatcher */2 * * * *` and `beads-reaper */10 * * * *`, both with `SUSPEND=False`. 3. **End-to-end smoke** ``` bd create "auto-dispatch smoke test" \ -d "Read /etc/hostname inside the agent sandbox and close." \ --acceptance "bd note includes 'hostname=' line and bead is closed." bd assign <new-id> agent # within 2 min: bd show <new-id> --json | jq '{status, notes}' ``` Expect notes to contain `auto-dispatcher claimed at …` and `dispatched: job=<uuid>`, status `in_progress`. 4. **Reaper smoke** Assign + dispatch a long bead, then `kubectl -n claude-agent delete pod -l app=claude-agent-service`. Within 30 min + one reaper tick, `bd show <id>` shows `blocked` with a `reaper: no progress for Nm — blocking` note. 5. **Kill switch** ``` cd infra/stacks/beads-server scripts/tg apply -var=beads_dispatcher_enabled=false kubectl -n beads-server get cronjob ``` Expect `SUSPEND=True` on both CronJobs. Assign a bead to `agent`; verify nothing happens within 5 min. Re-apply with `=true` to re-enable. Runbook with all above plus reaper semantics + design choices at `infra/docs/runbooks/beads-auto-dispatch.md`. Closes: code-8sm Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:35:46 +00:00
beadboard_internal_url = "http://${kubernetes_service.beadboard.metadata[0].name}.${kubernetes_namespace.beads.metadata[0].name}.svc.cluster.local"
beads_script_prelude = <<-EOT
set -euo pipefail
# bd with Dolt server mode needs metadata.json in a directory it can walk.
# ConfigMap mounts are read-only — copy to a writable location before use.
mkdir -p /tmp/.beads
cp /etc/beads-metadata/metadata.json /tmp/.beads/metadata.json
EOT
}
resource "kubernetes_cron_job_v1" "beads_dispatcher" {
metadata {
name = "beads-dispatcher"
namespace = kubernetes_namespace.beads.metadata[0].name
}
spec {
schedule = "*/2 * * * *"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 3
starting_deadline_seconds = 60
suspend = !var.beads_dispatcher_enabled
job_template {
metadata {}
spec {
backoff_limit = 0
ttl_seconds_after_finished = 600
template {
metadata {
labels = {
app = "beads-dispatcher"
}
}
spec {
restart_policy = "Never"
image_pull_secrets {
name = "registry-credentials"
}
container {
name = "dispatcher"
image = local.claude_agent_service_image
command = ["/bin/sh", "-c", <<-EOT
${local.beads_script_prelude}
BUSY=$(curl -sf "$${BEADBOARD_URL}/api/agent-status" | jq -r '.busy // false')
if [ "$BUSY" != "false" ]; then
echo "claude-agent-service is busy — skipping tick"
exit 0
fi
BEAD=$(bd --db /tmp/.beads query 'assignee=agent AND status=open' --json \
| jq -r '[.[] | select(.acceptance_criteria and (.acceptance_criteria | length) > 0)]
| sort_by(.priority, .updated_at)[0].id // empty')
if [ -z "$BEAD" ]; then
echo "no eligible beads (assignee=agent, status=open, has acceptance_criteria)"
exit 0
fi
echo "picked bead: $BEAD"
bd --db /tmp/.beads update "$BEAD" -s in_progress
bd --db /tmp/.beads note "$BEAD" "auto-dispatcher claimed at $(date -u +%Y-%m-%dT%H:%M:%SZ)"
RESP=$(curl -sS -w '\n%%{http_code}' -X POST \
-H 'Content-Type: application/json' \
-d "{\"taskId\":\"$BEAD\"}" \
"$${BEADBOARD_URL}/api/agent-dispatch")
CODE=$(printf '%s' "$RESP" | tail -n1)
BODY=$(printf '%s' "$RESP" | sed '$d')
if [ "$CODE" = "200" ]; then
JOB_ID=$(printf '%s' "$BODY" | jq -r '.job_id // "unknown"')
bd --db /tmp/.beads note "$BEAD" "dispatched: job=$JOB_ID"
echo "dispatched $BEAD as job $JOB_ID"
else
# Roll the claim back so the next tick can retry.
bd --db /tmp/.beads update "$BEAD" -s open
bd --db /tmp/.beads note "$BEAD" "dispatch failed HTTP $CODE: $BODY"
echo "dispatch FAILED for $BEAD: HTTP $CODE — $BODY" >&2
exit 1
fi
EOT
]
env {
name = "BEADBOARD_URL"
value = local.beadboard_internal_url
}
env {
name = "API_BEARER_TOKEN"
value_from {
secret_key_ref {
name = "beadboard-agent-service"
key = "api_bearer_token"
}
}
}
env {
name = "BEADS_ACTOR"
value = "beads-dispatcher"
}
env {
name = "HOME"
value = "/tmp"
}
volume_mount {
name = "beads-metadata"
mount_path = "/etc/beads-metadata"
read_only = true
}
resources {
requests = {
cpu = "50m"
memory = "128Mi"
}
limits = {
memory = "256Mi"
}
}
}
volume {
name = "beads-metadata"
config_map {
name = kubernetes_config_map.beads_metadata.metadata[0].name
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
resource "kubernetes_cron_job_v1" "beads_reaper" {
metadata {
name = "beads-reaper"
namespace = kubernetes_namespace.beads.metadata[0].name
}
spec {
schedule = "*/10 * * * *"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 3
starting_deadline_seconds = 60
suspend = !var.beads_dispatcher_enabled
job_template {
metadata {}
spec {
backoff_limit = 0
ttl_seconds_after_finished = 600
template {
metadata {
labels = {
app = "beads-reaper"
}
}
spec {
restart_policy = "Never"
image_pull_secrets {
name = "registry-credentials"
}
container {
name = "reaper"
image = local.claude_agent_service_image
command = ["/bin/sh", "-c", <<-EOT
${local.beads_script_prelude}
THRESHOLD_MIN=30
NOW=$(date -u +%s)
bd --db /tmp/.beads query 'assignee=agent AND status=in_progress' --json \
| jq -c '.[]' \
| while read -r BEAD_JSON; do
ID=$(printf '%s' "$BEAD_JSON" | jq -r '.id')
LAST_UPDATE=$(printf '%s' "$BEAD_JSON" | jq -r '.updated_at')
# Alpine's busybox date lacks GNU -d; parse ISO-8601 with python3.
LAST_TS=$(python3 -c "from datetime import datetime; print(int(datetime.fromisoformat('$LAST_UPDATE'.replace('Z','+00:00')).timestamp()))")
AGE_MIN=$(( (NOW - LAST_TS) / 60 ))
if [ "$AGE_MIN" -gt "$THRESHOLD_MIN" ]; then
bd --db /tmp/.beads note "$ID" "reaper: no progress for $${AGE_MIN}m (threshold $${THRESHOLD_MIN}m) — blocking"
bd --db /tmp/.beads update "$ID" -s blocked
echo "REAPED $ID (stale $${AGE_MIN}m)"
else
echo "keeping $ID (age $${AGE_MIN}m < $${THRESHOLD_MIN}m)"
fi
done
EOT
]
env {
name = "BEADS_ACTOR"
value = "beads-reaper"
}
env {
name = "HOME"
value = "/tmp"
}
volume_mount {
name = "beads-metadata"
mount_path = "/etc/beads-metadata"
read_only = true
}
resources {
requests = {
cpu = "50m"
memory = "128Mi"
}
limits = {
memory = "256Mi"
}
}
}
volume {
name = "beads-metadata"
config_map {
name = kubernetes_config_map.beads_metadata.metadata[0].name
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}