Record the architecture for moving code implementation AFK, decided in a design/grilling session. The owner wants the human-in-the-loop boundary to stop at design + spec: once an issue is triaged ready-for-agent, an agent should implement it test-first, push it, and see it to a healthy deploy on its own, escalating only when it can't proceed. Decisions captured: - claude-agent-service is the control plane (poller + watcher + safety); a dedicated in-cluster T3 Code instance is the executor + cockpit, because T3 can only show sessions it launched itself -> we dispatch into it (ADR 0003). - AFK code pushes straight to master; on a broken deploy it fix-forwards then freezes the broken state for forensics rather than reverting (ADR 0002). - Implementation agents use persistent per-repo checkouts + git worktrees on SSD-NFS for warm caches, reversing the throwaway-clone rule for this path because concurrency is serial-within-repo (ADR 0004). Pilot-gated: five integration unknowns must be validated against a dedicated T3 instance before the poller is wired. No code yet. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
3.8 KiB
Implementation agents use persistent per-repo checkouts + git worktrees, reversing the throwaway-clone rule for this path
2026-06-02-parallel-execution-design.md deliberately rejected git worktrees
and chose throwaway git clone --local per job, "because worktrees share one
.git → agents that git commit/pull still contend — not truly independent".
The AFK implementation pipeline
(docs/2026-06-14-afk-implementation-pipeline-design.md) reverses that for its
own path: each enrolled repo gets a persistent checkout, and each issue
runs in a git worktree off it, on a shared SSD-NFS volume. This ADR
records why the earlier rejection does not apply here — so the two decisions
read as complementary, not contradictory.
Status
accepted (2026-06-14) — for the AFK implementation path only; the existing job-runner (recruiter-triage, nextcloud-todos, etc.) keeps throwaway clones.
Why the 2026-06-02 rejection doesn't bind this path
The rejection's premise was concurrent jobs in the same checkout contending
on .git/index.lock and racing git pull. The AFK pipeline's concurrency model
is serial within a repo, parallel only across repos (ADR-adjacent decision in
the design doc): at most one agent ever touches a given repo's .git at a time,
and different repos are different checkouts. The contention the rejection guarded
against cannot occur here. With that removed, worktrees become the better
choice because they unlock cache reuse the throwaway model can't.
Considered options
- Persistent checkout + worktree per issue, on SSD-NFS (chosen). Warm git
objects, persisted
node_modules/venv/build caches, and shared package-manager caches survive across jobs, so the TDD loop stops reinstalling deps every run. Compounds withto-issuesclustering many slices in one repo, processed serially — slice N reuses slice 1's warm tree. - Throwaway
git clone --localper job (status quo elsewhere) — rejected for this path: correct for the concurrent job-runner, but re-pays dependency install on every issue, which dominates wall-clock for an implement-test-fix-forward loop. cp -aof a warm tree — rejected (same reason as 2026-06-02): copies accumulated caches → disk blowup, and no git isolation.
Considered options — storage
- SSD-NFS (chosen). The current
/persistentPVC is5GiHDD NFS (nfs-truenas→/srv/nfs) and unused; git checkouts +node_modulesare death-by-small-files on HDD NFS and 5Gi is too small. Provision an SSD-backed NFS class over/srv/nfs-ssd(other apps already use that path) at a realistic size (tens of GB). - HDD NFS /
/persistentas-is — rejected: too slow for many small files, too small. - Local block (proxmox-lvm) — rejected: faster but HDD and node-pinned (RWO), lost on reschedule; NFS RWX survives and the volume also holds session state.
Consequences
- One SSD-NFS volume holds, per enrolled repo: the persistent checkout, the
warm dep/package caches, and (under ADR 0003) the worktrees T3 prepares. Cache
env (
pip,GOMODCACHE/GOCACHE,PNPM_HOME/npm, cargo) must be wired to it — today caching is off (pip --no-cache-dir, no cache envs set). - Housekeeping the throwaway model didn't need:
git fetchbefore eachworktree add, periodicgit worktree prune+git gc, and cache eviction if the volume fills. infrastays on its own path — it is git-crypt, and editing encrypted files from a worktree is disallowed; the persistent-worktree model is for the non-infraapp repos in the allowlist.- Open reconciliation (pilot): whether T3's native
prepareWorktreewrites into this volume + our persistent checkouts, or we manage the checkout and point T3 at it. Resolve before committing the architecture.