From 1a7f68fe5bc149e304717aa933c6e14da56bff26 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 18 Apr 2026 22:35:46 +0000 Subject: [PATCH] [beads-server] Auto-dispatch agent beads via CronJobs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Context Until now, handing work to the in-cluster `beads-task-runner` agent required opening BeadBoard and clicking the manual Dispatch button on each bead. We want users to be able to describe work as a bead, set `assignee=agent`, and have the agent pick it up within a couple of minutes — no clicks. The existing pieces already provide everything we need: - `claude-agent-service` exposes `/execute` with a single-slot `asyncio.Lock` - BeadBoard's `/api/agent-dispatch` builds the prompt and forwards the bearer - BeadBoard's `/api/agent-status` reports `busy` via a cached `/health` poll - Dolt stores beads and is already in-cluster at `dolt.beads-server:3306` So the only missing component is a poller that ties them together. This commit adds that poller as two Kubernetes CronJobs — matching the existing infra pattern (OpenClaw task-processor, certbot-renewal, backups) rather than introducing n8n or in-service polling. ## Flow ``` user: bd assign agent │ ▼ Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐ │ │ ▼ │ CronJob: beads-dispatcher │ 1. GET beadboard/api/agent-status (busy? skip) │ 2. bd query 'assignee=agent AND status=open' │ 3. bd update -s in_progress (claim) │ 4. POST beadboard/api/agent-dispatch │ 5. bd note "dispatched: job=…" │ │ │ ▼ │ claude-agent-service /execute │ beads-task-runner agent runs; notes/closes bead │ │ │ ▼ │ done ──► next tick picks up the next bead ───────────────┘ CronJob: beads-reaper (every 10 min) for bead (assignee=agent, status=in_progress, updated_at > 30 min): bd note "reaper: no progress for Nm — blocking" bd update -s blocked ``` ## Decisions - **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd client can set it (`bd assign agent`). - **Sequential dispatch** — matches the service's `asyncio.Lock`. With a 2-min poll cadence and ~5-min average run, throughput is ~12 beads/hour. Parallelism is a separate plan. - **Fixed agent `beads-task-runner`** — read-only rails, matches the manual Dispatch button. Broader-privilege agents stay manual via BeadBoard UI. - **Image reuse** — the claude-agent-service image already ships `bd`, `jq`, `curl`; a new CronJob-specific image would duplicate 400MB of infra tooling. Mirror `claude_agent_service_image_tag` locally; bump on rebuild. - **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing the image-seeded file. The script copies it into `/tmp/.beads/` because bd may touch the parent dir and ConfigMap mounts are read-only. - **Kill switch (`beads_dispatcher_enabled`)** — single bool, default true. When false, `suspend: true` on both CronJobs; manual Dispatch keeps working. - **Reaper threshold 30 min** — `bd note` bumps `updated_at`, so a well-behaved `beads-task-runner` never trips the reaper. Failures trip it; pod crashes (in-memory job state lost) also trip it. ## What is NOT in this change - No Terraform apply — requires Vault OIDC + cluster access. Apply manually: `cd infra/stacks/beads-server && scripts/tg apply` - No change to `claude-agent-service/` (already ships bd/jq/curl) - No change to `beadboard/` (`/api/agent-dispatch` + `/api/agent-status` reused) - No change to the `beads-task-runner` agent definition (rails unchanged) - Parallelism: single-slot is MVP; multi-slot dispatch is a separate plan. ## Deviations from plan Minor, documented in code comments: - Reaper uses `.updated_at` instead of the plan's `.notes[].created_at`. bd serializes `notes` as a string (not an array), and every `bd note` bumps `updated_at` — equivalent for the reaper's purpose. - ISO-8601 parsed via `python3`, not `date -d` — Alpine's busybox lacks GNU `-d` and the image has python3. - `HOME=/tmp` set as a safety net — bd may try to write state/lock files. ## Test plan ### Automated ``` $ cd infra/stacks/beads-server && terraform init -backend=false Terraform has been successfully initialized! $ terraform validate Warning: Deprecated Resource (kubernetes_namespace → v1) # pre-existing, unrelated Success! The configuration is valid, but there were some validation warnings as shown above. $ terraform fmt stacks/beads-server/main.tf # (no output — already formatted) ``` ### Manual verification 1. **Apply** ``` vault login -method=oidc cd infra/stacks/beads-server scripts/tg apply ``` Expect: `kubernetes_config_map.beads_metadata`, `kubernetes_cron_job_v1.beads_dispatcher`, `kubernetes_cron_job_v1.beads_reaper` created. No changes to existing resources. 2. **CronJobs exist with right schedule** ``` kubectl -n beads-server get cronjob ``` Expect `beads-dispatcher */2 * * * *` and `beads-reaper */10 * * * *`, both with `SUSPEND=False`. 3. **End-to-end smoke** ``` bd create "auto-dispatch smoke test" \ -d "Read /etc/hostname inside the agent sandbox and close." \ --acceptance "bd note includes 'hostname=' line and bead is closed." bd assign agent # within 2 min: bd show --json | jq '{status, notes}' ``` Expect notes to contain `auto-dispatcher claimed at …` and `dispatched: job=`, status `in_progress`. 4. **Reaper smoke** Assign + dispatch a long bead, then `kubectl -n claude-agent delete pod -l app=claude-agent-service`. Within 30 min + one reaper tick, `bd show ` shows `blocked` with a `reaper: no progress for Nm — blocking` note. 5. **Kill switch** ``` cd infra/stacks/beads-server scripts/tg apply -var=beads_dispatcher_enabled=false kubectl -n beads-server get cronjob ``` Expect `SUSPEND=True` on both CronJobs. Assign a bead to `agent`; verify nothing happens within 5 min. Re-apply with `=true` to re-enable. Runbook with all above plus reaper semantics + design choices at `infra/docs/runbooks/beads-auto-dispatch.md`. Closes: code-8sm Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/runbooks/beads-auto-dispatch.md | 185 +++++++++++++++++ stacks/beads-server/main.tf | 287 +++++++++++++++++++++++++++ 2 files changed, 472 insertions(+) create mode 100644 docs/runbooks/beads-auto-dispatch.md diff --git a/docs/runbooks/beads-auto-dispatch.md b/docs/runbooks/beads-auto-dispatch.md new file mode 100644 index 00000000..f8bade68 --- /dev/null +++ b/docs/runbooks/beads-auto-dispatch.md @@ -0,0 +1,185 @@ +# Beads Auto-Dispatch Runbook + +Users can hand work to the headless `beads-task-runner` agent by assigning a +bead to the sentinel user `agent`. Two CronJobs in the `beads-server` +namespace drive the pipeline: + +- **`beads-dispatcher`** — every 2 min: picks up the highest-priority + `assignee=agent`/`status=open` bead with non-empty acceptance criteria, + claims it by flipping to `in_progress`, and POSTs it to BeadBoard's + `/api/agent-dispatch`. BeadBoard forwards to `claude-agent-service` with + the existing bearer-token flow. +- **`beads-reaper`** — every 10 min: flips any `assignee=agent` + + `status=in_progress` bead whose `updated_at` is older than 30 min to + `status=blocked` with an explanatory note. Catches pod crashes mid-run. + +The manual BeadBoard Dispatch button continues to work in parallel. + +## Flow diagram + +``` + user: bd assign agent + │ + ▼ + Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐ + │ │ + ▼ │ + CronJob: beads-dispatcher │ + 1. GET beadboard/api/agent-status (busy?) │ + 2. bd query 'assignee=agent AND status=open' │ + 3. bd update -s in_progress (claim) │ + 4. POST beadboard/api/agent-dispatch │ + 5. bd note "dispatched: job=…" │ + │ │ + ▼ │ + claude-agent-service /execute │ + beads-task-runner agent runs; notes/closes bead │ + │ │ + ▼ │ + done ──► next tick picks up the next bead ───────────────┘ + + + CronJob: beads-reaper (every 10 min) + for bead (assignee=agent, status=in_progress, updated_at > 30 min): + bd note "reaper: no progress for Nm — blocking" + bd update -s blocked +``` + +## Usage + +### Hand a bead to the agent + +``` +bd create "Title" \ + -d "Full context — files, services, error messages. Any agent with no prior context must be able to execute this." \ + --acceptance "Concrete, verifiable criteria" \ + -p 2 +bd assign agent +``` + +**Acceptance criteria is required.** Beads without it are skipped by the +dispatcher and stay in `open` forever. This is intentional — the +`beads-task-runner` agent expects clear done conditions. + +### Take a bead back (unassign) + +``` +bd assign "" +``` + +If the bead is already `in_progress`, also reset it: + +``` +bd update -s open +``` + +### Pause auto-dispatch + +``` +cd infra/stacks/beads-server +scripts/tg apply -var=beads_dispatcher_enabled=false +``` + +This sets `spec.suspend: true` on both CronJobs. Existing running jobs +continue; no new ticks fire. Re-enable by re-applying with +`beads_dispatcher_enabled=true` (the default). Manual BeadBoard Dispatch +remains available while paused. + +### Read the logs + +``` +# Recent dispatcher runs +kubectl -n beads-server get jobs --selector=job-name --sort-by=.metadata.creationTimestamp | grep beads-dispatcher | tail +kubectl -n beads-server logs job/ + +# Tail the underlying agent once a bead dispatches +kubectl -n claude-agent logs -l app=claude-agent-service -f + +# Inspect reaper decisions +kubectl -n beads-server get jobs | grep beads-reaper | tail +kubectl -n beads-server logs job/ +``` + +### Inspect a specific bead's dispatch history + +``` +bd show --json | jq '{status, assignee, notes, updated_at}' +``` + +Both the dispatcher and reaper write dated notes (`auto-dispatcher claimed +at…`, `dispatched: job=…`, `reaper: no progress for…`) so the audit trail +lives on the bead itself. + +## Reaper semantics — when a bead becomes `blocked` + +The reaper flips a bead to `blocked` if: +- `assignee = agent`, AND +- `status = in_progress`, AND +- `updated_at` is more than **30 minutes** in the past. + +Every `bd note` bumps `updated_at`, so a well-behaved `beads-task-runner` +agent never trips the reaper — it notes progress as it works. A `blocked` +bead is a signal that: +- the agent pod crashed mid-run (`kubectl -n claude-agent delete pod` test), +- the job hit its 15-minute budget timeout inside `claude-agent-service` + without notes (rare — the agent usually notes failure before exiting), +- `claude-agent-service` was restarted during the run (in-memory job state + is lost; see [known risks](#known-risks)). + +Recovery: read the reaper note, reopen manually if appropriate: + +``` +bd update -s open +bd assign agent # re-arm for next dispatcher tick +``` + +## Design choices + +- **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd + client can set it (`bd assign agent`). +- **Sequential dispatch** — matches `claude-agent-service`'s single-slot + `asyncio.Lock`. With a 2-min poll cadence and ~5-min average run, + throughput is ~12 beads/hour. Parallelism is a separate plan. +- **Fixed agent (`beads-task-runner`)** — read-only rails, matches BeadBoard's + manual Dispatch button. Broader-privilege agents stay manual. +- **CronJob (not in-service polling, not n8n)** — matches existing infra + pattern (OpenClaw task-processor, certbot-renewal, backups), TF-managed, + easy to pause. +- **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing + the image-seeded file. The CronJob's init step copies it into `/tmp/.beads/` + because `bd` may touch the parent directory and ConfigMap mounts are + read-only. + +## Known risks + +- **In-memory job state in `claude-agent-service`** — if the pod restarts + mid-run, the job record is lost. The reaper catches this after 30 min. + Persistent job store is deferred. +- **Prompt injection via bead fields** — a malicious bead description could + try to steer the agent. The `beads-task-runner` rails + token budget + + timeout are the defense. Identical exposure as the manual Dispatch button. +- **Image tag drift** — `claude_agent_service_image_tag` in + `stacks/beads-server/main.tf` mirrors `local.image_tag` in + `stacks/claude-agent-service/main.tf`. Bump both when the image rebuilds, + or the dispatcher/reaper will run on an older layer. (They only need + `bd`, `curl`, `jq` — stable across rebuilds — so the drift is low-risk.) +- **`bd` JSON schema changes** — the reaper's `jq` reads `.id` and + `.updated_at`. If a future `bd` upgrade renames these, the reaper breaks + silently (no reaping, no alert). `BD_VERSION` is pinned in the image + Dockerfile. + +## Verification after change + +``` +# Both CronJobs exist with the right schedule / SUSPEND state +kubectl -n beads-server get cronjob + +# End-to-end smoke test +bd create "auto-dispatch smoke test" \ + -d "Read /etc/hostname inside the agent sandbox and close." \ + --acceptance "bd note includes 'hostname=' and bead is closed." +bd assign agent +# within 2 min: +bd show --json | jq '.notes' +# → contains 'auto-dispatcher claimed' + 'dispatched: job=' +``` diff --git a/stacks/beads-server/main.tf b/stacks/beads-server/main.tf index 6e19c786..01f75ff4 100644 --- a/stacks/beads-server/main.tf +++ b/stacks/beads-server/main.tf @@ -8,6 +8,22 @@ variable "beadboard_image_tag" { default = "17a38e43" } +# Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf — keep in +# sync when the claude-agent-service image is rebuilt. Reused here because the +# dispatcher + reaper CronJobs only need bd, curl, and jq, which that image +# already ships. +variable "claude_agent_service_image_tag" { + type = string + default = "0c24c9b6" +} + +# Kill switch for auto-dispatch. When false, both CronJobs are suspended. The +# manual BeadBoard Dispatch button keeps working either way. +variable "beads_dispatcher_enabled" { + type = bool + default = true +} + resource "kubernetes_namespace" "beads" { metadata { name = "beads-server" @@ -671,3 +687,274 @@ module "beadboard_ingress" { "gethomepage.dev/pod-selector" = "" } } + +# ── Beads auto-dispatch (dispatcher + reaper CronJobs) ── +# +# Flow: +# user: bd assign agent +# └──> CronJob: beads-dispatcher (every 2 min) +# 1. GET BeadBoard /api/agent-status — skip if claude-agent-service busy +# 2. bd query 'assignee=agent AND status=open' — pick highest priority +# 3. bd update -s in_progress (claim; next tick won't re-pick) +# 4. POST BeadBoard /api/agent-dispatch — reuses prompt-build + bearer flow +# 5. bd note "dispatched: job=" (or rollback + note on failure) +# +# CronJob: beads-reaper (every 10 min) +# └── for bead (assignee=agent, status=in_progress, updated_at > 30m): +# bd update -s blocked + bd note (recover from pod crashes mid-run) +# +# The claude-agent-service image ships bd + jq + curl — no separate image built. + +resource "kubernetes_config_map" "beads_metadata" { + metadata { + name = "beads-metadata" + namespace = kubernetes_namespace.beads.metadata[0].name + } + data = { + "metadata.json" = jsonencode({ + database = "dolt" + backend = "dolt" + dolt_mode = "server" + dolt_server_host = "${kubernetes_service.dolt.metadata[0].name}.${kubernetes_namespace.beads.metadata[0].name}.svc.cluster.local" + dolt_server_port = 3306 + dolt_server_user = "beads" + dolt_database = "code" + project_id = "a8f8bae7-ce65-4145-a5db-a13d11d297da" + }) + } +} + +locals { + claude_agent_service_image = "registry.viktorbarzin.me/claude-agent-service:${var.claude_agent_service_image_tag}" + beadboard_internal_url = "http://${kubernetes_service.beadboard.metadata[0].name}.${kubernetes_namespace.beads.metadata[0].name}.svc.cluster.local" + + beads_script_prelude = <<-EOT + set -euo pipefail + # bd with Dolt server mode needs metadata.json in a directory it can walk. + # ConfigMap mounts are read-only — copy to a writable location before use. + mkdir -p /tmp/.beads + cp /etc/beads-metadata/metadata.json /tmp/.beads/metadata.json + EOT +} + +resource "kubernetes_cron_job_v1" "beads_dispatcher" { + metadata { + name = "beads-dispatcher" + namespace = kubernetes_namespace.beads.metadata[0].name + } + spec { + schedule = "*/2 * * * *" + concurrency_policy = "Forbid" + successful_jobs_history_limit = 3 + failed_jobs_history_limit = 3 + starting_deadline_seconds = 60 + suspend = !var.beads_dispatcher_enabled + job_template { + metadata {} + spec { + backoff_limit = 0 + ttl_seconds_after_finished = 600 + template { + metadata { + labels = { + app = "beads-dispatcher" + } + } + spec { + restart_policy = "Never" + image_pull_secrets { + name = "registry-credentials" + } + container { + name = "dispatcher" + image = local.claude_agent_service_image + command = ["/bin/sh", "-c", <<-EOT + ${local.beads_script_prelude} + + BUSY=$(curl -sf "$${BEADBOARD_URL}/api/agent-status" | jq -r '.busy // false') + if [ "$BUSY" != "false" ]; then + echo "claude-agent-service is busy — skipping tick" + exit 0 + fi + + BEAD=$(bd --db /tmp/.beads query 'assignee=agent AND status=open' --json \ + | jq -r '[.[] | select(.acceptance_criteria and (.acceptance_criteria | length) > 0)] + | sort_by(.priority, .updated_at)[0].id // empty') + + if [ -z "$BEAD" ]; then + echo "no eligible beads (assignee=agent, status=open, has acceptance_criteria)" + exit 0 + fi + + echo "picked bead: $BEAD" + + bd --db /tmp/.beads update "$BEAD" -s in_progress + bd --db /tmp/.beads note "$BEAD" "auto-dispatcher claimed at $(date -u +%Y-%m-%dT%H:%M:%SZ)" + + RESP=$(curl -sS -w '\n%%{http_code}' -X POST \ + -H 'Content-Type: application/json' \ + -d "{\"taskId\":\"$BEAD\"}" \ + "$${BEADBOARD_URL}/api/agent-dispatch") + CODE=$(printf '%s' "$RESP" | tail -n1) + BODY=$(printf '%s' "$RESP" | sed '$d') + + if [ "$CODE" = "200" ]; then + JOB_ID=$(printf '%s' "$BODY" | jq -r '.job_id // "unknown"') + bd --db /tmp/.beads note "$BEAD" "dispatched: job=$JOB_ID" + echo "dispatched $BEAD as job $JOB_ID" + else + # Roll the claim back so the next tick can retry. + bd --db /tmp/.beads update "$BEAD" -s open + bd --db /tmp/.beads note "$BEAD" "dispatch failed HTTP $CODE: $BODY" + echo "dispatch FAILED for $BEAD: HTTP $CODE — $BODY" >&2 + exit 1 + fi + EOT + ] + env { + name = "BEADBOARD_URL" + value = local.beadboard_internal_url + } + env { + name = "API_BEARER_TOKEN" + value_from { + secret_key_ref { + name = "beadboard-agent-service" + key = "api_bearer_token" + } + } + } + env { + name = "BEADS_ACTOR" + value = "beads-dispatcher" + } + env { + name = "HOME" + value = "/tmp" + } + volume_mount { + name = "beads-metadata" + mount_path = "/etc/beads-metadata" + read_only = true + } + resources { + requests = { + cpu = "50m" + memory = "128Mi" + } + limits = { + memory = "256Mi" + } + } + } + volume { + name = "beads-metadata" + config_map { + name = kubernetes_config_map.beads_metadata.metadata[0].name + } + } + } + } + } + } + } + lifecycle { + # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] + } +} + +resource "kubernetes_cron_job_v1" "beads_reaper" { + metadata { + name = "beads-reaper" + namespace = kubernetes_namespace.beads.metadata[0].name + } + spec { + schedule = "*/10 * * * *" + concurrency_policy = "Forbid" + successful_jobs_history_limit = 3 + failed_jobs_history_limit = 3 + starting_deadline_seconds = 60 + suspend = !var.beads_dispatcher_enabled + job_template { + metadata {} + spec { + backoff_limit = 0 + ttl_seconds_after_finished = 600 + template { + metadata { + labels = { + app = "beads-reaper" + } + } + spec { + restart_policy = "Never" + image_pull_secrets { + name = "registry-credentials" + } + container { + name = "reaper" + image = local.claude_agent_service_image + command = ["/bin/sh", "-c", <<-EOT + ${local.beads_script_prelude} + + THRESHOLD_MIN=30 + NOW=$(date -u +%s) + + bd --db /tmp/.beads query 'assignee=agent AND status=in_progress' --json \ + | jq -c '.[]' \ + | while read -r BEAD_JSON; do + ID=$(printf '%s' "$BEAD_JSON" | jq -r '.id') + LAST_UPDATE=$(printf '%s' "$BEAD_JSON" | jq -r '.updated_at') + # Alpine's busybox date lacks GNU -d; parse ISO-8601 with python3. + LAST_TS=$(python3 -c "from datetime import datetime; print(int(datetime.fromisoformat('$LAST_UPDATE'.replace('Z','+00:00')).timestamp()))") + AGE_MIN=$(( (NOW - LAST_TS) / 60 )) + if [ "$AGE_MIN" -gt "$THRESHOLD_MIN" ]; then + bd --db /tmp/.beads note "$ID" "reaper: no progress for $${AGE_MIN}m (threshold $${THRESHOLD_MIN}m) — blocking" + bd --db /tmp/.beads update "$ID" -s blocked + echo "REAPED $ID (stale $${AGE_MIN}m)" + else + echo "keeping $ID (age $${AGE_MIN}m < $${THRESHOLD_MIN}m)" + fi + done + EOT + ] + env { + name = "BEADS_ACTOR" + value = "beads-reaper" + } + env { + name = "HOME" + value = "/tmp" + } + volume_mount { + name = "beads-metadata" + mount_path = "/etc/beads-metadata" + read_only = true + } + resources { + requests = { + cpu = "50m" + memory = "128Mi" + } + limits = { + memory = "256Mi" + } + } + } + volume { + name = "beads-metadata" + config_map { + name = kubernetes_config_map.beads_metadata.metadata[0].name + } + } + } + } + } + } + } + lifecycle { + # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] + } +}