claude-agent-service/docs/adr/0004-persistent-worktrees-for-implementation-agents.md
Viktor Barzin be81005186
Some checks are pending
Build and Push / lint-and-test (push) Waiting to run
Build and Push / build (push) Blocked by required conditions
Build and Push / deploy (push) Blocked by required conditions
Build and Push / notify-failure (push) Blocked by required conditions
docs: capture AFK implementation pipeline design + ADRs 0002-0004
Record the architecture for moving code implementation AFK, decided in a
design/grilling session. The owner wants the human-in-the-loop boundary to
stop at design + spec: once an issue is triaged ready-for-agent, an agent
should implement it test-first, push it, and see it to a healthy deploy on
its own, escalating only when it can't proceed.

Decisions captured:
- claude-agent-service is the control plane (poller + watcher + safety);
  a dedicated in-cluster T3 Code instance is the executor + cockpit, because
  T3 can only show sessions it launched itself -> we dispatch into it
  (ADR 0003).
- AFK code pushes straight to master; on a broken deploy it fix-forwards
  then freezes the broken state for forensics rather than reverting
  (ADR 0002).
- Implementation agents use persistent per-repo checkouts + git worktrees on
  SSD-NFS for warm caches, reversing the throwaway-clone rule for this path
  because concurrency is serial-within-repo (ADR 0004).

Pilot-gated: five integration unknowns must be validated against a dedicated
T3 instance before the poller is wired. No code yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:09:12 +00:00

3.8 KiB

Implementation agents use persistent per-repo checkouts + git worktrees, reversing the throwaway-clone rule for this path

2026-06-02-parallel-execution-design.md deliberately rejected git worktrees and chose throwaway git clone --local per job, "because worktrees share one .git → agents that git commit/pull still contend — not truly independent". The AFK implementation pipeline (docs/2026-06-14-afk-implementation-pipeline-design.md) reverses that for its own path: each enrolled repo gets a persistent checkout, and each issue runs in a git worktree off it, on a shared SSD-NFS volume. This ADR records why the earlier rejection does not apply here — so the two decisions read as complementary, not contradictory.

Status

accepted (2026-06-14) — for the AFK implementation path only; the existing job-runner (recruiter-triage, nextcloud-todos, etc.) keeps throwaway clones.

Why the 2026-06-02 rejection doesn't bind this path

The rejection's premise was concurrent jobs in the same checkout contending on .git/index.lock and racing git pull. The AFK pipeline's concurrency model is serial within a repo, parallel only across repos (ADR-adjacent decision in the design doc): at most one agent ever touches a given repo's .git at a time, and different repos are different checkouts. The contention the rejection guarded against cannot occur here. With that removed, worktrees become the better choice because they unlock cache reuse the throwaway model can't.

Considered options

  • Persistent checkout + worktree per issue, on SSD-NFS (chosen). Warm git objects, persisted node_modules/venv/build caches, and shared package-manager caches survive across jobs, so the TDD loop stops reinstalling deps every run. Compounds with to-issues clustering many slices in one repo, processed serially — slice N reuses slice 1's warm tree.
  • Throwaway git clone --local per job (status quo elsewhere) — rejected for this path: correct for the concurrent job-runner, but re-pays dependency install on every issue, which dominates wall-clock for an implement-test-fix-forward loop.
  • cp -a of a warm tree — rejected (same reason as 2026-06-02): copies accumulated caches → disk blowup, and no git isolation.

Considered options — storage

  • SSD-NFS (chosen). The current /persistent PVC is 5Gi HDD NFS (nfs-truenas/srv/nfs) and unused; git checkouts + node_modules are death-by-small-files on HDD NFS and 5Gi is too small. Provision an SSD-backed NFS class over /srv/nfs-ssd (other apps already use that path) at a realistic size (tens of GB).
  • HDD NFS / /persistent as-is — rejected: too slow for many small files, too small.
  • Local block (proxmox-lvm) — rejected: faster but HDD and node-pinned (RWO), lost on reschedule; NFS RWX survives and the volume also holds session state.

Consequences

  • One SSD-NFS volume holds, per enrolled repo: the persistent checkout, the warm dep/package caches, and (under ADR 0003) the worktrees T3 prepares. Cache env (pip, GOMODCACHE/GOCACHE, PNPM_HOME/npm, cargo) must be wired to it — today caching is off (pip --no-cache-dir, no cache envs set).
  • Housekeeping the throwaway model didn't need: git fetch before each worktree add, periodic git worktree prune + git gc, and cache eviction if the volume fills.
  • infra stays on its own path — it is git-crypt, and editing encrypted files from a worktree is disallowed; the persistent-worktree model is for the non-infra app repos in the allowlist.
  • Open reconciliation (pilot): whether T3's native prepareWorktree writes into this volume + our persistent checkouts, or we manage the checkout and point T3 at it. Resolve before committing the architecture.