Viktor Barzin f0948493b3 claude-agent-service: wire parallel execution (git-crypt mount, memory, MAX_CONCURRENCY)

The service now runs agent calls concurrently (bounded semaphore, per-job
isolated clones) instead of single-flight. Infra side:
- mount git-crypt-key into the main container (each job re-unlocks its own clone)
- MAX_CONCURRENCY=10 env (excess calls queue FIFO)
- bump pod memory 2Gi req / 12Gi limit, cpu req 1 (Burstable, tier-aux) — sized
  for ~10 concurrent claude+terraform runs; fits node2/3/5 headroom
- docs: beads-auto-dispatch + automated-upgrades no longer describe single-slot

Service code: viktor/claude-agent-service @ 66104a3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-03 10:24:24 +00:00

7.3 KiB

Raw Blame History

Beads Auto-Dispatch Runbook

Users can hand work to the headless beads-task-runner agent by assigning a bead to the sentinel user agent. Two CronJobs in the beads-server namespace drive the pipeline:

beads-dispatcher — every 2 min: picks up the highest-priority assignee=agent/status=open bead with non-empty acceptance criteria, claims it by flipping to in_progress, and POSTs it to BeadBoard's /api/agent-dispatch. BeadBoard forwards to claude-agent-service with the existing bearer-token flow.
beads-reaper — every 10 min: flips any assignee=agent + status=in_progress bead whose updated_at is older than 30 min to status=blocked with an explanatory note. Catches pod crashes mid-run.

The manual BeadBoard Dispatch button continues to work in parallel.

Flow diagram

  user: bd assign <id> agent
         │
         ▼
  Dolt @ dolt.beads-server.svc:3306  ◄──── every 2 min ────┐
         │                                                  │
         ▼                                                  │
  CronJob: beads-dispatcher                                 │
    1. GET beadboard/api/agent-status  (busy?)              │
    2. bd query 'assignee=agent AND status=open'            │
    3. bd update -s in_progress   (claim)                   │
    4. POST beadboard/api/agent-dispatch                    │
    5. bd note "dispatched: job=…"                          │
         │                                                  │
         ▼                                                  │
  claude-agent-service /execute                             │
    beads-task-runner agent runs; notes/closes bead         │
         │                                                  │
         ▼                                                  │
  done  ──► next tick picks up the next bead ───────────────┘


  CronJob: beads-reaper  (every 10 min)
    for bead (assignee=agent, status=in_progress, updated_at > 30 min):
      bd note   "reaper: no progress for Nm — blocking"
      bd update -s blocked

Usage

Hand a bead to the agent

bd create "Title" \
    -d "Full context — files, services, error messages. Any agent with no prior context must be able to execute this." \
    --acceptance "Concrete, verifiable criteria" \
    -p 2
bd assign <new-id> agent

Acceptance criteria is required. Beads without it are skipped by the dispatcher and stay in open forever. This is intentional — the beads-task-runner agent expects clear done conditions.

Take a bead back (unassign)

bd assign <id> ""

If the bead is already in_progress, also reset it:

bd update <id> -s open

Pause auto-dispatch

cd infra/stacks/beads-server
scripts/tg apply -var=beads_dispatcher_enabled=false

This sets spec.suspend: true on both CronJobs. Existing running jobs continue; no new ticks fire. Re-enable by re-applying with beads_dispatcher_enabled=true (the default). Manual BeadBoard Dispatch remains available while paused.

Read the logs

# Recent dispatcher runs
kubectl -n beads-server get jobs --selector=job-name --sort-by=.metadata.creationTimestamp | grep beads-dispatcher | tail
kubectl -n beads-server logs job/<dispatcher-job-name>

# Tail the underlying agent once a bead dispatches
kubectl -n claude-agent logs -l app=claude-agent-service -f

# Inspect reaper decisions
kubectl -n beads-server get jobs | grep beads-reaper | tail
kubectl -n beads-server logs job/<reaper-job-name>

Inspect a specific bead's dispatch history

bd show <id> --json | jq '{status, assignee, notes, updated_at}'

Both the dispatcher and reaper write dated notes (auto-dispatcher claimed at…, dispatched: job=…, reaper: no progress for…) so the audit trail lives on the bead itself.

Reaper semantics — when a bead becomes `blocked`

The reaper flips a bead to blocked if:

assignee = agent, AND
status = in_progress, AND
updated_at is more than 30 minutes in the past.

Every bd note bumps updated_at, so a well-behaved beads-task-runner agent never trips the reaper — it notes progress as it works. A blocked bead is a signal that:

the agent pod crashed mid-run (kubectl -n claude-agent delete pod test),
the job hit its 15-minute budget timeout inside claude-agent-service without notes (rare — the agent usually notes failure before exiting),
claude-agent-service was restarted during the run (in-memory job state is lost; see known risks).

Recovery: read the reaper note, reopen manually if appropriate:

bd update <id> -s open
bd assign <id> agent     # re-arm for next dispatcher tick

Design choices

Sentinel assignee agent — free-form, no Beads schema change. Any bd client can set it (bd assign <id> agent).
One-bead-per-tick dispatch — the dispatcher submits at most one bead per 2-min tick, gating on claude-agent-service's /health busy flag. busy now means active >= capacity (bounded semaphore, default 10) — the service no longer single-flight-locks via asyncio.Lock. So up to ~capacity beads can run concurrently; the 2-min poll cadence (not single-slot execution) now bounds ramp-up.
Fixed agent (beads-task-runner) — read-only rails, matches BeadBoard's manual Dispatch button. Broader-privilege agents stay manual.
CronJob (not in-service polling, not n8n) — matches existing infra pattern (OpenClaw task-processor, certbot-renewal, backups), TF-managed, easy to pause.
ConfigMap-mounted metadata.json — declarative TF rather than reusing the image-seeded file. The CronJob's init step copies it into /tmp/.beads/ because bd may touch the parent directory and ConfigMap mounts are read-only.

Known risks

In-memory job state in claude-agent-service — if the pod restarts mid-run, the job record is lost. The reaper catches this after 30 min. Persistent job store is deferred.
Prompt injection via bead fields — a malicious bead description could try to steer the agent. The beads-task-runner rails + token budget + timeout are the defense. Identical exposure as the manual Dispatch button.
Image tag drift — claude_agent_service_image_tag in stacks/beads-server/main.tf mirrors local.image_tag in stacks/claude-agent-service/main.tf. Bump both when the image rebuilds, or the dispatcher/reaper will run on an older layer. (They only need bd, curl, jq — stable across rebuilds — so the drift is low-risk.)
bd JSON schema changes — the reaper's jq reads .id and .updated_at. If a future bd upgrade renames these, the reaper breaks silently (no reaping, no alert). BD_VERSION is pinned in the image Dockerfile.

Verification after change

# Both CronJobs exist with the right schedule / SUSPEND state
kubectl -n beads-server get cronjob

# End-to-end smoke test
bd create "auto-dispatch smoke test" \
    -d "Read /etc/hostname inside the agent sandbox and close." \
    --acceptance "bd note includes 'hostname=' and bead is closed."
bd assign <new-id> agent
# within 2 min:
bd show <new-id> --json | jq '.notes'
# → contains 'auto-dispatcher claimed' + 'dispatched: job=<uuid>'

7.3 KiB Raw Blame History