claude-agent-service/docs/adr/0004-persistent-worktrees-for-implementation-agents.md
Viktor Barzin be81005186
Some checks are pending
Build and Push / lint-and-test (push) Waiting to run
Build and Push / build (push) Blocked by required conditions
Build and Push / deploy (push) Blocked by required conditions
Build and Push / notify-failure (push) Blocked by required conditions
docs: capture AFK implementation pipeline design + ADRs 0002-0004
Record the architecture for moving code implementation AFK, decided in a
design/grilling session. The owner wants the human-in-the-loop boundary to
stop at design + spec: once an issue is triaged ready-for-agent, an agent
should implement it test-first, push it, and see it to a healthy deploy on
its own, escalating only when it can't proceed.

Decisions captured:
- claude-agent-service is the control plane (poller + watcher + safety);
  a dedicated in-cluster T3 Code instance is the executor + cockpit, because
  T3 can only show sessions it launched itself -> we dispatch into it
  (ADR 0003).
- AFK code pushes straight to master; on a broken deploy it fix-forwards
  then freezes the broken state for forensics rather than reverting
  (ADR 0002).
- Implementation agents use persistent per-repo checkouts + git worktrees on
  SSD-NFS for warm caches, reversing the throwaway-clone rule for this path
  because concurrency is serial-within-repo (ADR 0004).

Pilot-gated: five integration unknowns must be validated against a dedicated
T3 instance before the poller is wired. No code yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:09:12 +00:00

68 lines
3.8 KiB
Markdown

# Implementation agents use persistent per-repo checkouts + git worktrees, reversing the throwaway-clone rule for this path
`2026-06-02-parallel-execution-design.md` deliberately **rejected git worktrees**
and chose throwaway `git clone --local` per job, "because worktrees share one
`.git` → agents that `git commit`/`pull` still contend — not truly independent".
The AFK implementation pipeline
(`docs/2026-06-14-afk-implementation-pipeline-design.md`) **reverses that for its
own path**: each enrolled repo gets a **persistent checkout**, and each issue
runs in a **`git worktree`** off it, on a shared **SSD-NFS** volume. This ADR
records why the earlier rejection does not apply here — so the two decisions
read as complementary, not contradictory.
## Status
accepted (2026-06-14) — for the AFK implementation path only; the existing
job-runner (recruiter-triage, nextcloud-todos, etc.) keeps throwaway clones.
## Why the 2026-06-02 rejection doesn't bind this path
The rejection's premise was **concurrent jobs in the same checkout** contending
on `.git/index.lock` and racing `git pull`. The AFK pipeline's concurrency model
is **serial within a repo, parallel only across repos** (ADR-adjacent decision in
the design doc): at most one agent ever touches a given repo's `.git` at a time,
and different repos are different checkouts. The contention the rejection guarded
against cannot occur here. With that removed, worktrees become the *better*
choice because they unlock cache reuse the throwaway model can't.
## Considered options
- **Persistent checkout + worktree per issue, on SSD-NFS (chosen).** Warm git
objects, **persisted `node_modules`/venv/build caches**, and shared
package-manager caches survive across jobs, so the TDD loop stops reinstalling
deps every run. Compounds with `to-issues` clustering many slices in one repo,
processed serially — slice N reuses slice 1's warm tree.
- **Throwaway `git clone --local` per job (status quo elsewhere)** — rejected for
this path: correct for the concurrent job-runner, but re-pays dependency
install on every issue, which dominates wall-clock for an
implement-test-fix-forward loop.
- **`cp -a` of a warm tree** — rejected (same reason as 2026-06-02): copies
accumulated caches → disk blowup, and no git isolation.
## Considered options — storage
- **SSD-NFS (chosen).** The current `/persistent` PVC is `5Gi` **HDD NFS**
(`nfs-truenas``/srv/nfs`) and unused; git checkouts + `node_modules` are
death-by-small-files on HDD NFS and 5Gi is too small. Provision an SSD-backed
NFS class over `/srv/nfs-ssd` (other apps already use that path) at a realistic
size (tens of GB).
- **HDD NFS / `/persistent` as-is** — rejected: too slow for many small files,
too small.
- **Local block (proxmox-lvm)** — rejected: faster but HDD and node-pinned (RWO),
lost on reschedule; NFS RWX survives and the volume also holds session state.
## Consequences
- One **SSD-NFS volume** holds, per enrolled repo: the persistent checkout, the
warm dep/package caches, and (under ADR 0003) the worktrees T3 prepares. Cache
env (`pip`, `GOMODCACHE`/`GOCACHE`, `PNPM_HOME`/npm, cargo) must be wired to it
— today caching is off (`pip --no-cache-dir`, no cache envs set).
- Housekeeping the throwaway model didn't need: `git fetch` before each
`worktree add`, periodic `git worktree prune` + `git gc`, and cache eviction if
the volume fills.
- **`infra` stays on its own path** — it is git-crypt, and editing encrypted
files from a worktree is disallowed; the persistent-worktree model is for the
non-`infra` app repos in the allowlist.
- Open reconciliation (pilot): whether T3's native `prepareWorktree` writes into
this volume + our persistent checkouts, or we manage the checkout and point T3
at it. Resolve before committing the architecture.