diff --git a/docs/runbooks/beads-auto-dispatch.md b/docs/runbooks/beads-auto-dispatch.md new file mode 100644 index 00000000..f8bade68 --- /dev/null +++ b/docs/runbooks/beads-auto-dispatch.md @@ -0,0 +1,185 @@ +# Beads Auto-Dispatch Runbook + +Users can hand work to the headless `beads-task-runner` agent by assigning a +bead to the sentinel user `agent`. Two CronJobs in the `beads-server` +namespace drive the pipeline: + +- **`beads-dispatcher`** — every 2 min: picks up the highest-priority + `assignee=agent`/`status=open` bead with non-empty acceptance criteria, + claims it by flipping to `in_progress`, and POSTs it to BeadBoard's + `/api/agent-dispatch`. BeadBoard forwards to `claude-agent-service` with + the existing bearer-token flow. +- **`beads-reaper`** — every 10 min: flips any `assignee=agent` + + `status=in_progress` bead whose `updated_at` is older than 30 min to + `status=blocked` with an explanatory note. Catches pod crashes mid-run. + +The manual BeadBoard Dispatch button continues to work in parallel. + +## Flow diagram + +``` + user: bd assign agent + │ + ▼ + Dolt @ dolt.beads-server.svc:3306 ◄──── every 2 min ────┐ + │ │ + ▼ │ + CronJob: beads-dispatcher │ + 1. GET beadboard/api/agent-status (busy?) │ + 2. bd query 'assignee=agent AND status=open' │ + 3. bd update -s in_progress (claim) │ + 4. POST beadboard/api/agent-dispatch │ + 5. bd note "dispatched: job=…" │ + │ │ + ▼ │ + claude-agent-service /execute │ + beads-task-runner agent runs; notes/closes bead │ + │ │ + ▼ │ + done ──► next tick picks up the next bead ───────────────┘ + + + CronJob: beads-reaper (every 10 min) + for bead (assignee=agent, status=in_progress, updated_at > 30 min): + bd note "reaper: no progress for Nm — blocking" + bd update -s blocked +``` + +## Usage + +### Hand a bead to the agent + +``` +bd create "Title" \ + -d "Full context — files, services, error messages. Any agent with no prior context must be able to execute this." \ + --acceptance "Concrete, verifiable criteria" \ + -p 2 +bd assign agent +``` + +**Acceptance criteria is required.** Beads without it are skipped by the +dispatcher and stay in `open` forever. This is intentional — the +`beads-task-runner` agent expects clear done conditions. + +### Take a bead back (unassign) + +``` +bd assign "" +``` + +If the bead is already `in_progress`, also reset it: + +``` +bd update -s open +``` + +### Pause auto-dispatch + +``` +cd infra/stacks/beads-server +scripts/tg apply -var=beads_dispatcher_enabled=false +``` + +This sets `spec.suspend: true` on both CronJobs. Existing running jobs +continue; no new ticks fire. Re-enable by re-applying with +`beads_dispatcher_enabled=true` (the default). Manual BeadBoard Dispatch +remains available while paused. + +### Read the logs + +``` +# Recent dispatcher runs +kubectl -n beads-server get jobs --selector=job-name --sort-by=.metadata.creationTimestamp | grep beads-dispatcher | tail +kubectl -n beads-server logs job/ + +# Tail the underlying agent once a bead dispatches +kubectl -n claude-agent logs -l app=claude-agent-service -f + +# Inspect reaper decisions +kubectl -n beads-server get jobs | grep beads-reaper | tail +kubectl -n beads-server logs job/ +``` + +### Inspect a specific bead's dispatch history + +``` +bd show --json | jq '{status, assignee, notes, updated_at}' +``` + +Both the dispatcher and reaper write dated notes (`auto-dispatcher claimed +at…`, `dispatched: job=…`, `reaper: no progress for…`) so the audit trail +lives on the bead itself. + +## Reaper semantics — when a bead becomes `blocked` + +The reaper flips a bead to `blocked` if: +- `assignee = agent`, AND +- `status = in_progress`, AND +- `updated_at` is more than **30 minutes** in the past. + +Every `bd note` bumps `updated_at`, so a well-behaved `beads-task-runner` +agent never trips the reaper — it notes progress as it works. A `blocked` +bead is a signal that: +- the agent pod crashed mid-run (`kubectl -n claude-agent delete pod` test), +- the job hit its 15-minute budget timeout inside `claude-agent-service` + without notes (rare — the agent usually notes failure before exiting), +- `claude-agent-service` was restarted during the run (in-memory job state + is lost; see [known risks](#known-risks)). + +Recovery: read the reaper note, reopen manually if appropriate: + +``` +bd update -s open +bd assign agent # re-arm for next dispatcher tick +``` + +## Design choices + +- **Sentinel assignee `agent`** — free-form, no Beads schema change. Any bd + client can set it (`bd assign agent`). +- **Sequential dispatch** — matches `claude-agent-service`'s single-slot + `asyncio.Lock`. With a 2-min poll cadence and ~5-min average run, + throughput is ~12 beads/hour. Parallelism is a separate plan. +- **Fixed agent (`beads-task-runner`)** — read-only rails, matches BeadBoard's + manual Dispatch button. Broader-privilege agents stay manual. +- **CronJob (not in-service polling, not n8n)** — matches existing infra + pattern (OpenClaw task-processor, certbot-renewal, backups), TF-managed, + easy to pause. +- **ConfigMap-mounted `metadata.json`** — declarative TF rather than reusing + the image-seeded file. The CronJob's init step copies it into `/tmp/.beads/` + because `bd` may touch the parent directory and ConfigMap mounts are + read-only. + +## Known risks + +- **In-memory job state in `claude-agent-service`** — if the pod restarts + mid-run, the job record is lost. The reaper catches this after 30 min. + Persistent job store is deferred. +- **Prompt injection via bead fields** — a malicious bead description could + try to steer the agent. The `beads-task-runner` rails + token budget + + timeout are the defense. Identical exposure as the manual Dispatch button. +- **Image tag drift** — `claude_agent_service_image_tag` in + `stacks/beads-server/main.tf` mirrors `local.image_tag` in + `stacks/claude-agent-service/main.tf`. Bump both when the image rebuilds, + or the dispatcher/reaper will run on an older layer. (They only need + `bd`, `curl`, `jq` — stable across rebuilds — so the drift is low-risk.) +- **`bd` JSON schema changes** — the reaper's `jq` reads `.id` and + `.updated_at`. If a future `bd` upgrade renames these, the reaper breaks + silently (no reaping, no alert). `BD_VERSION` is pinned in the image + Dockerfile. + +## Verification after change + +``` +# Both CronJobs exist with the right schedule / SUSPEND state +kubectl -n beads-server get cronjob + +# End-to-end smoke test +bd create "auto-dispatch smoke test" \ + -d "Read /etc/hostname inside the agent sandbox and close." \ + --acceptance "bd note includes 'hostname=' and bead is closed." +bd assign agent +# within 2 min: +bd show --json | jq '.notes' +# → contains 'auto-dispatcher claimed' + 'dispatched: job=' +``` diff --git a/stacks/beads-server/main.tf b/stacks/beads-server/main.tf index 6e19c786..01f75ff4 100644 --- a/stacks/beads-server/main.tf +++ b/stacks/beads-server/main.tf @@ -8,6 +8,22 @@ variable "beadboard_image_tag" { default = "17a38e43" } +# Mirrors `local.image_tag` in stacks/claude-agent-service/main.tf — keep in +# sync when the claude-agent-service image is rebuilt. Reused here because the +# dispatcher + reaper CronJobs only need bd, curl, and jq, which that image +# already ships. +variable "claude_agent_service_image_tag" { + type = string + default = "0c24c9b6" +} + +# Kill switch for auto-dispatch. When false, both CronJobs are suspended. The +# manual BeadBoard Dispatch button keeps working either way. +variable "beads_dispatcher_enabled" { + type = bool + default = true +} + resource "kubernetes_namespace" "beads" { metadata { name = "beads-server" @@ -671,3 +687,274 @@ module "beadboard_ingress" { "gethomepage.dev/pod-selector" = "" } } + +# ── Beads auto-dispatch (dispatcher + reaper CronJobs) ── +# +# Flow: +# user: bd assign agent +# └──> CronJob: beads-dispatcher (every 2 min) +# 1. GET BeadBoard /api/agent-status — skip if claude-agent-service busy +# 2. bd query 'assignee=agent AND status=open' — pick highest priority +# 3. bd update -s in_progress (claim; next tick won't re-pick) +# 4. POST BeadBoard /api/agent-dispatch — reuses prompt-build + bearer flow +# 5. bd note "dispatched: job=" (or rollback + note on failure) +# +# CronJob: beads-reaper (every 10 min) +# └── for bead (assignee=agent, status=in_progress, updated_at > 30m): +# bd update -s blocked + bd note (recover from pod crashes mid-run) +# +# The claude-agent-service image ships bd + jq + curl — no separate image built. + +resource "kubernetes_config_map" "beads_metadata" { + metadata { + name = "beads-metadata" + namespace = kubernetes_namespace.beads.metadata[0].name + } + data = { + "metadata.json" = jsonencode({ + database = "dolt" + backend = "dolt" + dolt_mode = "server" + dolt_server_host = "${kubernetes_service.dolt.metadata[0].name}.${kubernetes_namespace.beads.metadata[0].name}.svc.cluster.local" + dolt_server_port = 3306 + dolt_server_user = "beads" + dolt_database = "code" + project_id = "a8f8bae7-ce65-4145-a5db-a13d11d297da" + }) + } +} + +locals { + claude_agent_service_image = "registry.viktorbarzin.me/claude-agent-service:${var.claude_agent_service_image_tag}" + beadboard_internal_url = "http://${kubernetes_service.beadboard.metadata[0].name}.${kubernetes_namespace.beads.metadata[0].name}.svc.cluster.local" + + beads_script_prelude = <<-EOT + set -euo pipefail + # bd with Dolt server mode needs metadata.json in a directory it can walk. + # ConfigMap mounts are read-only — copy to a writable location before use. + mkdir -p /tmp/.beads + cp /etc/beads-metadata/metadata.json /tmp/.beads/metadata.json + EOT +} + +resource "kubernetes_cron_job_v1" "beads_dispatcher" { + metadata { + name = "beads-dispatcher" + namespace = kubernetes_namespace.beads.metadata[0].name + } + spec { + schedule = "*/2 * * * *" + concurrency_policy = "Forbid" + successful_jobs_history_limit = 3 + failed_jobs_history_limit = 3 + starting_deadline_seconds = 60 + suspend = !var.beads_dispatcher_enabled + job_template { + metadata {} + spec { + backoff_limit = 0 + ttl_seconds_after_finished = 600 + template { + metadata { + labels = { + app = "beads-dispatcher" + } + } + spec { + restart_policy = "Never" + image_pull_secrets { + name = "registry-credentials" + } + container { + name = "dispatcher" + image = local.claude_agent_service_image + command = ["/bin/sh", "-c", <<-EOT + ${local.beads_script_prelude} + + BUSY=$(curl -sf "$${BEADBOARD_URL}/api/agent-status" | jq -r '.busy // false') + if [ "$BUSY" != "false" ]; then + echo "claude-agent-service is busy — skipping tick" + exit 0 + fi + + BEAD=$(bd --db /tmp/.beads query 'assignee=agent AND status=open' --json \ + | jq -r '[.[] | select(.acceptance_criteria and (.acceptance_criteria | length) > 0)] + | sort_by(.priority, .updated_at)[0].id // empty') + + if [ -z "$BEAD" ]; then + echo "no eligible beads (assignee=agent, status=open, has acceptance_criteria)" + exit 0 + fi + + echo "picked bead: $BEAD" + + bd --db /tmp/.beads update "$BEAD" -s in_progress + bd --db /tmp/.beads note "$BEAD" "auto-dispatcher claimed at $(date -u +%Y-%m-%dT%H:%M:%SZ)" + + RESP=$(curl -sS -w '\n%%{http_code}' -X POST \ + -H 'Content-Type: application/json' \ + -d "{\"taskId\":\"$BEAD\"}" \ + "$${BEADBOARD_URL}/api/agent-dispatch") + CODE=$(printf '%s' "$RESP" | tail -n1) + BODY=$(printf '%s' "$RESP" | sed '$d') + + if [ "$CODE" = "200" ]; then + JOB_ID=$(printf '%s' "$BODY" | jq -r '.job_id // "unknown"') + bd --db /tmp/.beads note "$BEAD" "dispatched: job=$JOB_ID" + echo "dispatched $BEAD as job $JOB_ID" + else + # Roll the claim back so the next tick can retry. + bd --db /tmp/.beads update "$BEAD" -s open + bd --db /tmp/.beads note "$BEAD" "dispatch failed HTTP $CODE: $BODY" + echo "dispatch FAILED for $BEAD: HTTP $CODE — $BODY" >&2 + exit 1 + fi + EOT + ] + env { + name = "BEADBOARD_URL" + value = local.beadboard_internal_url + } + env { + name = "API_BEARER_TOKEN" + value_from { + secret_key_ref { + name = "beadboard-agent-service" + key = "api_bearer_token" + } + } + } + env { + name = "BEADS_ACTOR" + value = "beads-dispatcher" + } + env { + name = "HOME" + value = "/tmp" + } + volume_mount { + name = "beads-metadata" + mount_path = "/etc/beads-metadata" + read_only = true + } + resources { + requests = { + cpu = "50m" + memory = "128Mi" + } + limits = { + memory = "256Mi" + } + } + } + volume { + name = "beads-metadata" + config_map { + name = kubernetes_config_map.beads_metadata.metadata[0].name + } + } + } + } + } + } + } + lifecycle { + # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] + } +} + +resource "kubernetes_cron_job_v1" "beads_reaper" { + metadata { + name = "beads-reaper" + namespace = kubernetes_namespace.beads.metadata[0].name + } + spec { + schedule = "*/10 * * * *" + concurrency_policy = "Forbid" + successful_jobs_history_limit = 3 + failed_jobs_history_limit = 3 + starting_deadline_seconds = 60 + suspend = !var.beads_dispatcher_enabled + job_template { + metadata {} + spec { + backoff_limit = 0 + ttl_seconds_after_finished = 600 + template { + metadata { + labels = { + app = "beads-reaper" + } + } + spec { + restart_policy = "Never" + image_pull_secrets { + name = "registry-credentials" + } + container { + name = "reaper" + image = local.claude_agent_service_image + command = ["/bin/sh", "-c", <<-EOT + ${local.beads_script_prelude} + + THRESHOLD_MIN=30 + NOW=$(date -u +%s) + + bd --db /tmp/.beads query 'assignee=agent AND status=in_progress' --json \ + | jq -c '.[]' \ + | while read -r BEAD_JSON; do + ID=$(printf '%s' "$BEAD_JSON" | jq -r '.id') + LAST_UPDATE=$(printf '%s' "$BEAD_JSON" | jq -r '.updated_at') + # Alpine's busybox date lacks GNU -d; parse ISO-8601 with python3. + LAST_TS=$(python3 -c "from datetime import datetime; print(int(datetime.fromisoformat('$LAST_UPDATE'.replace('Z','+00:00')).timestamp()))") + AGE_MIN=$(( (NOW - LAST_TS) / 60 )) + if [ "$AGE_MIN" -gt "$THRESHOLD_MIN" ]; then + bd --db /tmp/.beads note "$ID" "reaper: no progress for $${AGE_MIN}m (threshold $${THRESHOLD_MIN}m) — blocking" + bd --db /tmp/.beads update "$ID" -s blocked + echo "REAPED $ID (stale $${AGE_MIN}m)" + else + echo "keeping $ID (age $${AGE_MIN}m < $${THRESHOLD_MIN}m)" + fi + done + EOT + ] + env { + name = "BEADS_ACTOR" + value = "beads-reaper" + } + env { + name = "HOME" + value = "/tmp" + } + volume_mount { + name = "beads-metadata" + mount_path = "/etc/beads-metadata" + read_only = true + } + resources { + requests = { + cpu = "50m" + memory = "128Mi" + } + limits = { + memory = "256Mi" + } + } + } + volume { + name = "beads-metadata" + config_map { + name = kubernetes_config_map.beads_metadata.metadata[0].name + } + } + } + } + } + } + } + lifecycle { + # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] + } +}