ci/woodpecker/push/woodpecker Pipeline was successful

Details

parallel execution: replace single-flight lock with bounded semaphore + per-job workspace

Multiple agent calls now run concurrently, each in its own isolated git
checkout (local clone of the warm base, hardlinked objects, git-crypt
re-unlocked), so concurrent jobs never share a working tree.

- execution_lock (asyncio.Lock) -> execution_semaphore (default MAX_CONCURRENCY=10);
  excess calls queue FIFO instead of 409/503. MAX_QUEUE_DEPTH safety valve.
- /execute never returns 409; jobs go queued -> running. Timeout covers
  execution only, not queue wait.
- /v1/chat/completions queues for a slot instead of 503-busy.
- /health: busy = at-capacity, plus active/queued/capacity fields.
- per-job workspace prepare/cleanup under a short git lock; the agent run holds none.
- in-memory job registry evicted past JOB_TTL_SECONDS.

Design: docs/2026-06-02-parallel-execution-design.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-02 20:57:41 +00:00

5.8 KiB

Raw Blame History

Parallel, independent execution — design

Date: 2026-06-02 Status: approved, in implementation Scope: claude-agent-service — remove the single-flight execution lock so multiple agent calls run concurrently, each in its own isolated workspace.

Problem

Today a single global asyncio.Lock (execution_lock) serializes every agent invocation:

POST /execute returns 409 Agent is busy when a job is in flight.
POST /v1/chat/completions returns 503 agent is busy likewise.
All calls run claude -p with cwd=/workspace/infra — one shared working tree, git pull --rebase'd before each call.

The lock exists because two claude -p processes in the same working tree would clobber each other's file edits and git state (.git/index.lock contention, racing git pull --rebase).

Goal

Run calls in parallel, each fully independent of the others, without the git/file collisions that the lock currently prevents — on a single pod (replicas=1), keeping the in-memory job registry coherent for /jobs/{id} polling.

Design

Workspace isolation — per-job local clone

Each job gets its own git checkout so file edits and git operations never touch another job's state:

A warm base clone lives at /workspace/base (created by the existing init container; renamed from /workspace/infra), git-crypt-unlocked.
Per job, under a short-held git_lock:
- Debounced git fetch origin && git reset --hard origin/master on the base (skipped if fetched within FETCH_DEBOUNCE_SECONDS) so bursts share one network fetch.
- git clone --local /workspace/base /workspace/jobs/<id> — objects are hardlinked (near-free disk, no .terraform carried since clone takes tracked content only).
- Re-point origin to the GitHub URL and git-crypt unlock <key> in the job dir.
The job runs claude -p with cwd=/workspace/jobs/<id> holding no lock.
finally → rm -rf /workspace/jobs/<id>.

git_lock is held only for the fast setup/teardown (~<2 s); execution is fully parallel. Rejected alternatives: git worktree (shares one .git → agents that git commit/pull still contend — not truly independent) and cp -a (copies accumulated .terraform provider caches → disk blowup).

Distinct cwd per job also isolates Claude CLI per-project state (~/.claude/projects/<cwd-hash>/). The long-lived CLAUDE_CODE_OAUTH_TOKEN avoids credential-file write races in the shared ~/.claude.

Concurrency model

execution_semaphore = asyncio.Semaphore(MAX_CONCURRENCY) replaces execution_lock. Default MAX_CONCURRENCY=10 ("soft-unbounded").
Requests beyond the limit queue FIFO (asyncio fairness) — they are not rejected.
MAX_QUEUE_DEPTH safety valve (default 100): if active + queued exceeds it, reject (429 on /execute, 503 on chat) to bound memory.
A concurrency_slot() async context manager wraps acquire/release and keeps inflight_active / inflight_queued counters for /health.

Endpoint behavior

Endpoint	Before	After
`POST /execute`	`202` or `409` busy	`202` always (unless queue full → `429`); job `status="queued"` until a slot frees, then `running`. Timeout clock starts on execution, not queue-wait.
`POST /v1/chat/completions`	`200` or `503` busy	queues for a slot (caller waits, bounded by the 900 s timeout); still `503` on execution failure/timeout or if queue full
`GET /jobs/{id}`	unchanged	unchanged (can now report `queued`)
`GET /health`	`{status, busy=lock.locked()}`	`{status, busy=(active>=capacity), active, queued, capacity}` — keeps BeadBoard `/api/agent-status` + beads-dispatcher working

Housekeeping

Job eviction: completed/failed/timeout/error jobs older than JOB_TTL_SECONDS (default 3600) are evicted; the in-memory jobs dict currently grows unbounded and parallelism increases churn.
Pod restart still loses in-flight jobs (pre-existing; out of scope — no shared store, matching the in-pod decision).

Infra (`infra/stacks/claude-agent-service/main.tf`)

Mount the existing git-crypt-key configmap into the main container (today only the init container has it) — needed for per-job unlock.
Pod memory: request 2Gi, limit 12Gi (Burstable, tier-aux); CPU request 1, no CPU limit. Fits node2/3/5 headroom (~22–26 GB free).
Wire MAX_CONCURRENCY env. Rename init-container clone target to /workspace/base; WORKSPACE_DIR→ base path.
replicas=1, Recreate unchanged.

Blast radius (verified)

All callers handle the busy responses gracefully or fail safely, so removing them is safe:

n8n DIUN (/execute) — rate-limited 5/6h, no retry; 409 was rare.
payslip-ingest (/execute+poll) — 90× retry; big win from parallelism.
recruiter-responder (/execute+poll) — returns busy, OpenClaw retries.
fire-planner (/v1/chat/completions) — client-side semaphore; can be relaxed after this.
BeadBoard (/execute) — UI shows busy via /api/agent-status (/health).
beads-dispatcher CronJob — gates on /health busy; 2-min tick.

Testing (TDD)

Rewrite test_execute_respects_sequential_lock and test_chat_completions_returns_503_when_agent_busy (they encode the removed behavior). New tests: two concurrent /execute both run; safety-queue at MAX_CONCURRENCY=2; concurrent chat-completions both run; /health capacity fields; per-job distinct workspace cwd; timeout excludes queue-wait; job eviction; queue-depth rejection. An autouse fixture resets semaphore + counters

jobs between tests.

Docs to update (same change)

infra/docs/architecture/automated-upgrades.md, infra/docs/runbooks/beads-auto-dispatch.md, infra/AGENTS.md, root CLAUDE.md — all currently describe "sequential / single-slot".

5.8 KiB Raw Blame History Unescape Escape