Some checks are pending
Record the architecture for moving code implementation AFK, decided in a design/grilling session. The owner wants the human-in-the-loop boundary to stop at design + spec: once an issue is triaged ready-for-agent, an agent should implement it test-first, push it, and see it to a healthy deploy on its own, escalating only when it can't proceed. Decisions captured: - claude-agent-service is the control plane (poller + watcher + safety); a dedicated in-cluster T3 Code instance is the executor + cockpit, because T3 can only show sessions it launched itself -> we dispatch into it (ADR 0003). - AFK code pushes straight to master; on a broken deploy it fix-forwards then freezes the broken state for forensics rather than reverting (ADR 0002). - Implementation agents use persistent per-repo checkouts + git worktrees on SSD-NFS for warm caches, reversing the throwaway-clone rule for this path because concurrency is serial-within-repo (ADR 0004). Pilot-gated: five integration unknowns must be validated against a dedicated T3 instance before the poller is wired. No code yet. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
68 lines
3.8 KiB
Markdown
68 lines
3.8 KiB
Markdown
# Implementation agents use persistent per-repo checkouts + git worktrees, reversing the throwaway-clone rule for this path
|
|
|
|
`2026-06-02-parallel-execution-design.md` deliberately **rejected git worktrees**
|
|
and chose throwaway `git clone --local` per job, "because worktrees share one
|
|
`.git` → agents that `git commit`/`pull` still contend — not truly independent".
|
|
The AFK implementation pipeline
|
|
(`docs/2026-06-14-afk-implementation-pipeline-design.md`) **reverses that for its
|
|
own path**: each enrolled repo gets a **persistent checkout**, and each issue
|
|
runs in a **`git worktree`** off it, on a shared **SSD-NFS** volume. This ADR
|
|
records why the earlier rejection does not apply here — so the two decisions
|
|
read as complementary, not contradictory.
|
|
|
|
## Status
|
|
|
|
accepted (2026-06-14) — for the AFK implementation path only; the existing
|
|
job-runner (recruiter-triage, nextcloud-todos, etc.) keeps throwaway clones.
|
|
|
|
## Why the 2026-06-02 rejection doesn't bind this path
|
|
|
|
The rejection's premise was **concurrent jobs in the same checkout** contending
|
|
on `.git/index.lock` and racing `git pull`. The AFK pipeline's concurrency model
|
|
is **serial within a repo, parallel only across repos** (ADR-adjacent decision in
|
|
the design doc): at most one agent ever touches a given repo's `.git` at a time,
|
|
and different repos are different checkouts. The contention the rejection guarded
|
|
against cannot occur here. With that removed, worktrees become the *better*
|
|
choice because they unlock cache reuse the throwaway model can't.
|
|
|
|
## Considered options
|
|
|
|
- **Persistent checkout + worktree per issue, on SSD-NFS (chosen).** Warm git
|
|
objects, **persisted `node_modules`/venv/build caches**, and shared
|
|
package-manager caches survive across jobs, so the TDD loop stops reinstalling
|
|
deps every run. Compounds with `to-issues` clustering many slices in one repo,
|
|
processed serially — slice N reuses slice 1's warm tree.
|
|
- **Throwaway `git clone --local` per job (status quo elsewhere)** — rejected for
|
|
this path: correct for the concurrent job-runner, but re-pays dependency
|
|
install on every issue, which dominates wall-clock for an
|
|
implement-test-fix-forward loop.
|
|
- **`cp -a` of a warm tree** — rejected (same reason as 2026-06-02): copies
|
|
accumulated caches → disk blowup, and no git isolation.
|
|
|
|
## Considered options — storage
|
|
|
|
- **SSD-NFS (chosen).** The current `/persistent` PVC is `5Gi` **HDD NFS**
|
|
(`nfs-truenas` → `/srv/nfs`) and unused; git checkouts + `node_modules` are
|
|
death-by-small-files on HDD NFS and 5Gi is too small. Provision an SSD-backed
|
|
NFS class over `/srv/nfs-ssd` (other apps already use that path) at a realistic
|
|
size (tens of GB).
|
|
- **HDD NFS / `/persistent` as-is** — rejected: too slow for many small files,
|
|
too small.
|
|
- **Local block (proxmox-lvm)** — rejected: faster but HDD and node-pinned (RWO),
|
|
lost on reschedule; NFS RWX survives and the volume also holds session state.
|
|
|
|
## Consequences
|
|
|
|
- One **SSD-NFS volume** holds, per enrolled repo: the persistent checkout, the
|
|
warm dep/package caches, and (under ADR 0003) the worktrees T3 prepares. Cache
|
|
env (`pip`, `GOMODCACHE`/`GOCACHE`, `PNPM_HOME`/npm, cargo) must be wired to it
|
|
— today caching is off (`pip --no-cache-dir`, no cache envs set).
|
|
- Housekeeping the throwaway model didn't need: `git fetch` before each
|
|
`worktree add`, periodic `git worktree prune` + `git gc`, and cache eviction if
|
|
the volume fills.
|
|
- **`infra` stays on its own path** — it is git-crypt, and editing encrypted
|
|
files from a worktree is disallowed; the persistent-worktree model is for the
|
|
non-`infra` app repos in the allowlist.
|
|
- Open reconciliation (pilot): whether T3's native `prepareWorktree` writes into
|
|
this volume + our persistent checkouts, or we manage the checkout and point T3
|
|
at it. Resolve before committing the architecture.
|