From be810051865a96defd13dd6ba4344b1b77055403 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 14 Jun 2026 19:09:12 +0000 Subject: [PATCH] docs: capture AFK implementation pipeline design + ADRs 0002-0004 Record the architecture for moving code implementation AFK, decided in a design/grilling session. The owner wants the human-in-the-loop boundary to stop at design + spec: once an issue is triaged ready-for-agent, an agent should implement it test-first, push it, and see it to a healthy deploy on its own, escalating only when it can't proceed. Decisions captured: - claude-agent-service is the control plane (poller + watcher + safety); a dedicated in-cluster T3 Code instance is the executor + cockpit, because T3 can only show sessions it launched itself -> we dispatch into it (ADR 0003). - AFK code pushes straight to master; on a broken deploy it fix-forwards then freezes the broken state for forensics rather than reverting (ADR 0002). - Implementation agents use persistent per-repo checkouts + git worktrees on SSD-NFS for warm caches, reversing the throwaway-clone rule for this path because concurrency is serial-within-repo (ADR 0004). Pilot-gated: five integration unknowns must be validated against a dedicated T3 instance before the poller is wired. No code yet. Co-Authored-By: Claude Opus 4.8 --- ...6-14-afk-implementation-pipeline-design.md | 259 ++++++++++++++++++ ...fk-autonomous-merge-and-failure-posture.md | 69 +++++ docs/adr/0003-t3-thin-executor-and-cockpit.md | 70 +++++ ...ent-worktrees-for-implementation-agents.md | 68 +++++ 4 files changed, 466 insertions(+) create mode 100644 docs/2026-06-14-afk-implementation-pipeline-design.md create mode 100644 docs/adr/0002-afk-autonomous-merge-and-failure-posture.md create mode 100644 docs/adr/0003-t3-thin-executor-and-cockpit.md create mode 100644 docs/adr/0004-persistent-worktrees-for-implementation-agents.md diff --git a/docs/2026-06-14-afk-implementation-pipeline-design.md b/docs/2026-06-14-afk-implementation-pipeline-design.md new file mode 100644 index 0000000..3c38644 --- /dev/null +++ b/docs/2026-06-14-afk-implementation-pipeline-design.md @@ -0,0 +1,259 @@ +# AFK implementation pipeline — design + +**Date:** 2026-06-14 +**Status:** proposed — pilot pending (see "Pilot" below; no code yet) +**Scope:** A new autonomous path that turns a triaged `ready-for-agent` issue +into tested, deployed code with no human at the keyboard. `claude-agent-service` +becomes the **control plane**; a dedicated in-cluster **T3 Code** instance +becomes the **executor + cockpit**. Touches: `claude-agent-service` (new poller ++ dispatch + watcher), a new T3 stack in `infra/`, a shared SSD-NFS volume, and +the per-repo issue trackers. + +> Provenance: this design is the output of a long grilling session +> (2026-06-14). It records the decisions *and* the alternatives that were +> considered and dropped, so the reasoning survives. The three hardest-to-reverse +> calls are split into ADRs 0002–0004. + +## Problem + +Today the development flow is **grill-with-docs → to-prd → to-issues → triage → +implement**, and *every* stage is human-in-the-loop (HITL), including +implementation. The owner wants the HITL boundary to stop at **design + spec**: +once an issue is triaged `ready-for-agent`, an agent should pick it up and +implement it **AFK** (away from keyboard) — write it test-first, push it, and +see it through to a healthy deploy — escalating to a human only when it genuinely +can't proceed. + +Two gaps block this today: + +- The only existing issue→agent automation is the **infra `issue-responder`**, + which fires on `user-report`/`feature-request` labels on the `infra` repo + only — not on `ready-for-agent`, not on the other sub-project repos that the + general design flow produces. +- `claude-agent-service` only ever clones `infra`, runs one-shot fire-and-forget + `claude -p` jobs (no session, no live stream, no attach), and has no + multi-repo checkout. The owner wants to *watch and steer* in-flight work, which + the batch model can't offer. + +## Goal + +- HITL covers design + spec only. Publishing `ready-for-agent` issues is the + release signal (the `to-issues` quiz is the review gate). +- An autonomous loop picks up unblocked `ready-for-agent` issues from + **enrolled** repos, implements them test-first, and lands them — pushing + straight to `master` so CI deploys them (see ADR 0002 for the risk posture). +- The owner can **see all in-flight workers and converse with any of them** from + one UI — the T3 cockpit (see ADR 0003). +- Reuse before building: lean on the existing CI/CD chain, the design skills, T3 + Code's multi-agent cockpit, and the persistence/worktree machinery — rather + than hand-building a session console and a bespoke runtime. + +## Design + +### Roles: control plane vs executor + cockpit + +| Concern | Owner | +|---|---| +| When to start, which issue, the prompt, the safety envelope | **claude-agent-service** (control plane) — poller + watcher | +| Running the agent (Claude Agent SDK), the worktree, the fleet UI | **T3 Code** (executor + cockpit) — one dedicated in-cluster instance | +| Build → image → deploy → rollout | existing CI/CD (GHA → ghcr → Woodpecker → Keel) | +| Issue queue + state | the per-repo GitHub issue trackers | + +The pivotal constraint that forces this split: **T3 can only display sessions it +launched itself** — it has no command to adopt an externally-started session. So +"viewable in T3" ⟺ "launched by T3". To keep `claude-agent-service` in charge +*and* get the fleet view, the control plane **dispatches into T3** rather than +running `claude` itself. See ADR 0003. + +### End-to-end flow + +``` +HUMAN (interactive session) + /grill-with-docs → /to-prd → /to-issues → /triage + └ produces ready-for-agent issues (dependency-ordered), labeled by a + trusted collaborator. Publishing them = the release signal. +══════════════════════ HANDOFF ══════════════════════ +CONTROL PLANE (claude-agent-service, in-cluster) + poller CronJob (every few min): + for repo in allowlist: + skip repo if it already has an agent-in-progress issue (per-repo lock) + pick highest-priority ready-for-agent issue where: + • all "Blocked by" closed • labeled by a trusted collaborator + → stamp agent-in-progress + → POST /api/orchestration/dispatch (thread.turn.start + bootstrap: + create thread, prepare worktree, run setup, deliver the prompt) +EXECUTOR + COCKPIT (dedicated T3 instance, in-cluster) + runs the issue-implementer agent (our prompt) in the worktree: + read issue + AGENT-BRIEF + repo CONTEXT.md/ADRs → TDD red-green-refactor + → commit (paraphrase issue, "Closes #N", AFK trailer) → push master + watcher (control plane) polls GET /api/orchestration/snapshot + CI: + ├─ healthy ──────► comment + close issue, drop lock, notify ✅ + ├─ pre-push block ► do NOT push, relabel ready-for-human, escalate + └─ post-push red ► fix-forward (≤5 attempts / 60 min) + ├─ recovers ► healthy + └─ exhausts ► FREEZE broken (preserve forensics), + relabel ready-for-human, hard page +``` + +### Trigger & dispatch predicate + +A poller CronJob (mirrors the existing `beads-dispatcher` pattern; stays +in-cluster because neither the service nor T3 has public ingress). It dispatches +issue *I* in repo *R* iff **all** hold: + +- `R` is in the **allowlist** ConfigMap, and the **kill switch** is off; +- `I` has label `ready-for-agent`, applied by a **trusted collaborator** (the + trust gate — on private repos only collaborators can label, so the label *is* + the authorization; external/bot issues never auto-run); +- every issue in `I`'s "Blocked by" is closed; +- `R` has no issue currently labeled `agent-in-progress` (the per-repo lock). + +On dispatch it stamps `agent-in-progress`; on any terminal outcome it removes it. + +### Concurrency & locking + +**Parallel across repos, serial within a repo.** Multiple repos progress at +once; at most one agent per repo (two agents in one repo would collide on the +working tree). Enforced by the `agent-in-progress` label as a per-repo lock. +Starting value; raise later. + +### Merge & failure posture — see ADR 0002 + +- **Always push to master** (no PR gate). Tests-green is the merge gate; CI + + rollback are the safety net, matching the human allow-then-audit model. +- **Pre-push** failure (can't get green / blocked / would need a disallowed op): + do *not* push; relabel `ready-for-human`; comment what was tried; page. +- **Post-push** failure (CI build or rollout red): **fix-forward** up to **5 + attempts or 60 minutes**, then if still red **freeze in the broken state** + (preserve forensics — do not auto-revert), relabel `ready-for-human`, hard + page. The owner explicitly chose debuggability over availability here. +- **Budget:** `max_budget_usd = 100` per issue (time/attempt caps usually bite + first). + +### Build/test environment & worktrees — see ADR 0004 + +The agent must run the target repo's test suite (TDD gate) before pushing. +Therefore: + +- **Local toolchains scoped to the allowlist** — the executor image carries only + the *enrolled* repos' runtimes; the toolchain set grows in lockstep with the + allowlist. +- **Persistent per-repo checkout + `git worktree` per issue** on a shared + **SSD-NFS** volume, so git objects, installed deps, and package-manager caches + stay warm across jobs. This **supersedes** the throwaway `git clone --local` + model from `2026-06-02-parallel-execution-design.md`; that rejection was + correct for *concurrent* same-repo jobs, but the serial-within-repo choice + here removes the `.git` contention it guarded against (ADR 0004). It pays off + precisely because `to-issues` clusters many slices in one repo, processed + serially — slice N reuses the warm checkout slice 1 paid for. + +### T3 integration: thin dispatch — see ADR 0003 + +The control plane holds a capability-scoped **`orchestration:operate`** bearer +token (minted via `t3 auth`, stored in Vault, refreshed for the 1-hour expiry) +and calls T3's HTTP API: + +- `POST /api/orchestration/dispatch` → `thread.turn.start` with a `bootstrap` + that creates the thread, prepares the worktree, optionally runs a setup + script, and delivers the prompt — one call spawns a worktree-isolated worker. +- `GET /api/orchestration/snapshot` → the full fleet read-model (per-thread + `running`/`idle`/`error`, `hasPendingUserInput`, `hasPendingApprovals`, + `branch`, `worktreePath`). T3 has **no outbound webhooks**, so the watcher + **polls** this to drive CI-watch, freeze, and label transitions. + +The AFK *behavior and safety* (issue-implementer prompt, guardrails, always-push, +fix-forward/freeze, issue integration) live in **our** thin layer, so T3 is a +**swappable, version-pinned backend** — never Keel-auto-upgraded, reversible to a +self-hosted runtime if it goes sideways. + +### Observability & interaction + +The "active sessions layer" and the "attach and converse" surface **converge +into one screen — the T3 cockpit**: a live list of all worker threads grouped by +project; click one to stream its transcript and send it a turn. This dissolves +the earlier intermediate ideas of a generalized-breakglass console and a +raw-tmux hybrid attach — T3 provides converse / approve / resume natively +(`thread.user-input.respond`, `thread.approval.respond`). + +Cross-system, durable signals the control plane still emits: + +- **Phase-checklist comment** on the issue, edited in place as phases complete + (worktree → tests-red → green → pushed → CI → deployed). Durable, low-noise, + lives on the issue, doubles as audit trail. +- **Loki** logs labeled `{repo, issue}` for deep-dive. +- **Presence** claim per running session (`repo:`, purpose `AFK #N`), + heartbeated — so AFK work shows up next to human sessions in the layer the + prompt hook already injects. +- **Doorbell**: Slack / ntfy ping on terminal states, deep-linking into the T3 + thread. Notify, not control — the dedicated-Slack-control-plane idea is + dropped in favour of the T3 cockpit. + +### Safety envelope + +- **Trust gate** — only collaborator-labeled `ready-for-agent` issues run. +- **Allowlist** — a repo is untouchable until enrolled (prereqs: tests + GHA CI + + `CONTEXT.md`). Start with 1–2 repos; expand deliberately. +- **Kill switch** — one ConfigMap flag pauses all pickup (the Keel + scale-to-0 reflex, built in from day one). +- **Per-repo lock** — ≤1 agent per repo. +- **Guardrails** (reused from `issue-responder`) — no PVC/PV deletes, no direct + Vault edits, no force-push to master, infra changes Terraform-only, never + `[ci skip]`. +- **Identity & audit** — shared service identity; each commit body paraphrases + the issue and carries `Closes #N` + an AFK-agent trailer, so the commit + message stays the audit trail. + +## Parameters (chosen starting values — all tunable) + +| Knob | Value | +|---|---| +| Merge gate | always push to master | +| Post-push failure | fix-forward, then freeze-broken | +| Fix-forward cap | 5 attempts **or** 60 minutes | +| Per-issue budget | `max_budget_usd = 100` | +| Concurrency | parallel across repos, serial within a repo | +| Repo scope | opt-in allowlist, start small | +| Progress detail | phase-checklist on issue + Loki logs | +| Alert channel | Slack (+ ntfy), as a doorbell into T3 | +| Executor | dedicated in-cluster T3 (thin dispatch), version-pinned | + +## Pilot — validate before wiring the poller + +The thin model rests on five unknowns. Stand up the dedicated T3 instance and +drive a couple of allowlist-repo issues **by hand** via the dispatch API to +confirm each, *before* building the poller and committing the architecture: + +1. **Per-thread custom agent + skip-permissions** — can a dispatched thread + carry *our* `issue-implementer` system prompt and run unattended without + stalling on T3's approval gating? *(biggest unknown)* +2. **Dispatch auth** — mint `orchestration:operate`, store in Vault, refresh the + 1-hour token. +3. **Status/completion** — drive CI-watch/freeze/labels purely from polling + `GET /api/orchestration/snapshot`. +4. **Worktree reconciliation** — T3's native `prepareWorktree` vs our + persistent-checkout-with-warm-caches; pick one or make them cooperate on the + volume. +5. **The in-cluster T3 pod** — headless `t3 serve --no-browser`, version-pinned + and **Keel-excluded**, internal ingress + Authentik, with tokens / toolchains + / SSD volume / `claude auth` provisioned. + +## Relationship to prior decisions + +- **Supersedes** the worktree rejection in + `2026-06-02-parallel-execution-design.md` (contextualized, not contradicted — + ADR 0004). +- **Drops** two intermediate ideas explored and rejected this session: + evolving `claude-agent-service` into its own session/tmux/worktree runtime, + and building a bespoke breakglass-generalized console — both replaced by T3. +- **Reuses** the `issue-responder` guardrails, the CI/CD chain, the + `beads-dispatcher` CronJob pattern, presence, Loki, and the design skills. + +## Out of scope / open questions + +- Raw-terminal "take-over" of a worker (T3 is a GUI cockpit, not a terminal); if + ever needed, that's a separate add-on. +- Multi-tenant T3 (it is single-operator by design — fine, it matches the shared + service identity). +- Cross-repo dependency orchestration beyond per-issue "Blocked by". +- T3 Code is pre-1.0 (~v0.0.x) and churny; the version-pin + Keel-exclude + + swappable-backend discipline (ADR 0003) is the mitigation. diff --git a/docs/adr/0002-afk-autonomous-merge-and-failure-posture.md b/docs/adr/0002-afk-autonomous-merge-and-failure-posture.md new file mode 100644 index 0000000..cd397cf --- /dev/null +++ b/docs/adr/0002-afk-autonomous-merge-and-failure-posture.md @@ -0,0 +1,69 @@ +# AFK agents push straight to master; failures fix-forward then freeze, not revert + +The AFK implementation pipeline (see +`docs/2026-06-14-afk-implementation-pipeline-design.md`) lets an autonomous +agent land code with no human at the keyboard. The owner deliberately chose the +most hands-off posture: **AFK-written code pushes straight to `master`** (which +then deploys via the existing CI/CD chain) with **no pull-request review gate**, +and when a deploy breaks, the agent **fixes forward and then freezes the broken +state** rather than auto-reverting. This ADR records that risk posture and why it +was chosen over the safer alternatives, because it is surprising and not cheap to +walk back once callers and habits depend on it. + +## Status + +accepted (2026-06-14) — posture decided; enforced once the pipeline ships +(pilot-gated). + +## Context + +`master` on every enrolled repo deploys continuously (GHA build → ghcr → +Woodpecker → Keel). So "where AFK code lands" is really "what reaches a live +deploy without a human looking". The owner weighed three merge gates and three +post-push failure responses and picked the autonomy-maximizing end of both, +accepting the blast radius explicitly. + +## Considered options — merge gate + +- **Always push to master (chosen).** Tests-green is the gate; CI + rollback are + the safety net. Matches the existing human allow-then-audit model (non-admins + already push straight to master). Most hands-off. +- **Adaptive (push if confident, else PR)** — rejected as the *default* though it + is what `issue-responder` does; the owner wanted full hands-off, not a + confidence-gated PR for otherwise-working code. +- **Always open a PR** — rejected: reintroduces a human merge step on every + issue, i.e. "AFK implementation, human merge" — not the goal. + +## Considered options — post-push failure (CI/rollout goes red after a green push) + +- **Fix-forward then freeze (chosen).** Iterate with corrective commits up to + **5 attempts or 60 minutes**; if still red, **leave the broken state in place** + (do not revert), relabel the issue `ready-for-human`, and hard-page. Same + forensics-first instinct as the breakglass (ADR 0001): preserve the exact + failing state for debugging rather than auto-cleaning it away. +- **Auto-revert + escalate** — rejected (was the recommendation): restores green + fastest, but destroys the forensic state the owner wants to inspect. +- **Alert and freeze immediately (no fix-forward)** — rejected: gives up on + transient/env-drift failures a corrective commit would clear. + +Pre-push failure (can't reach green, blocked, or would need a disallowed op) is +not a dilemma: the agent does **not** push, relabels `ready-for-human`, comments +what it tried, and pages. + +## Consequences + +- An unreviewed logic error can deploy before any human sees it; rollback (not + review) is the safety net. Bounded by: tests-as-gate, the start-small + allowlist, the per-repo lock, and the kill switch. +- A frozen-broken deploy can sit unhealthy until the owner answers the page — + availability is traded for debuggability, by explicit choice. Acceptable + because enrolled repos are non-critical by the allowlist prerequisite, and the + owner is paged hard (Slack + ntfy). +- Fix-forward can stack up to 5 commits on a bad change before freezing; the + 60-minute cap bounds the churn window. +- Per-issue spend is capped at `max_budget_usd = 100`. +- Guardrails still hold underneath this posture: no PVC/PV deletes, no direct + Vault edits, no force-push, infra changes Terraform-only, never `[ci skip]`. +- Reversible: tightening to adaptive/PR or to auto-revert is a config + watcher + change, not a re-architecture — but callers/habits will have formed around + "it just lands", so flag loudly if reversing. diff --git a/docs/adr/0003-t3-thin-executor-and-cockpit.md b/docs/adr/0003-t3-thin-executor-and-cockpit.md new file mode 100644 index 0000000..3251418 --- /dev/null +++ b/docs/adr/0003-t3-thin-executor-and-cockpit.md @@ -0,0 +1,70 @@ +# AFK workers run inside a dedicated T3 Code instance; claude-agent-service dispatches into it + +The owner wants one UI to see and converse with every in-flight AFK worker, and +named **T3 Code** (the self-hosted multi-agent cockpit already running at +`t3.viktorbarzin.me`) as that UI. Research into T3's source +(`pingdotgg/t3code`, ~v0.0.27) found it is genuinely built for this — a fleet of +worker "threads" with a live read-model and a scoped HTTP dispatch API — **but** +it can only display sessions **it launched itself**; there is no command to adopt +a session another process started. So "viewable in T3" ⟺ "launched by T3". This +ADR records the resulting architecture: `claude-agent-service` stays the +**control plane** and **dispatches into a dedicated, in-cluster T3 instance** +which is the **executor + cockpit**. The agent runs inside T3; we keep the brain. + +## Status + +accepted (2026-06-14) — direction decided; **gated on a pilot** (the five +unknowns in the design doc) before the poller is wired and the architecture is +committed. + +## Why T3, and why "thin" + +T3 provides, out of the box, what we would otherwise hand-build: a three-panel +fleet cockpit (`projects → threads → conversation`), an +`OrchestrationReadModel` with per-thread live status, and +`POST /api/orchestration/dispatch` whose `thread.turn.start` + `bootstrap` can +**create a thread, prepare a git worktree, run a setup script, and deliver a +prompt in one call** — exactly the worker-spawn primitive. Converse / approve / +resume are native (`thread.user-input.respond`, `thread.approval.respond`). For +Claude it embeds `@anthropic-ai/claude-agent-sdk`. + +"Thin" = the AFK *behavior and safety* (the `issue-implementer` prompt, +guardrails, always-push, fix-forward/freeze, CI-watch, issue integration) live +in **our** layer (the poller + watcher), not in T3. T3 is a **swappable backend** +we drive over its API. + +## Considered options + +- **Thin: claude-agent-service dispatches into T3 (chosen).** Control plane calls + T3's dispatch API; T3 runs the agent in a worktree and shows it. Get the fleet + view, keep the brain, least to build. Cost: execution moves into the T3 pod, so + T3's runtime is in the *hot path* (not just the window). +- **claude-agent-service runs the agent, T3 only displays it** — rejected because + it is impossible: T3 cannot adopt an externally-started session + (`thread.session.set` is server-internal; no external-session-id field). This + is the constraint that shaped the whole decision. +- **Deep: claude-agent-service as a custom T3 provider (ACP-style)** — rejected + for now: keeps the runtime ours with a T3 UI, but means building and + maintaining a provider against a pre-1.0, internal, no-contributions interface + — effectively a fork. Revisit only if "thin" proves too limiting. +- **Skip T3; build our own console** (generalized breakglass + tmux) — rejected: + most stable and fully in-house, but abandons the owner's explicit "see workers + in T3" goal and means owning a session console forever. + +## Consequences + +- A **dedicated in-cluster T3 instance** (a pod, consistent with the earlier + in-cluster-over-devvm substrate choice) is the worker host, separate from the + per-user devvm T3 instances. It needs the SSD worktree volume, git/Anthropic + tokens, toolchains, `claude auth`, and an internal Authentik-gated ingress. +- T3's runtime is now in the **execution hot path** — its maturity affects + whether work *runs*, not only whether it can be *seen*. Mitigations: **pin the + version and exclude it from Keel** (its churn + hard-cutover auth migrations + make auto-upgrade a Keel-class hazard), keep the integration thin and the + backend swappable, and **pilot** the five unknowns first. +- T3 is **single-operator** — fine here: it matches the already-accepted shared + service identity for AFK work. +- No outbound webhooks from T3 → the watcher **polls** + `GET /api/orchestration/snapshot`. +- This supersedes the intermediate ideas of evolving `claude-agent-service` into + its own session/tmux/worktree runtime and building a bespoke attach console. diff --git a/docs/adr/0004-persistent-worktrees-for-implementation-agents.md b/docs/adr/0004-persistent-worktrees-for-implementation-agents.md new file mode 100644 index 0000000..4488443 --- /dev/null +++ b/docs/adr/0004-persistent-worktrees-for-implementation-agents.md @@ -0,0 +1,68 @@ +# Implementation agents use persistent per-repo checkouts + git worktrees, reversing the throwaway-clone rule for this path + +`2026-06-02-parallel-execution-design.md` deliberately **rejected git worktrees** +and chose throwaway `git clone --local` per job, "because worktrees share one +`.git` → agents that `git commit`/`pull` still contend — not truly independent". +The AFK implementation pipeline +(`docs/2026-06-14-afk-implementation-pipeline-design.md`) **reverses that for its +own path**: each enrolled repo gets a **persistent checkout**, and each issue +runs in a **`git worktree`** off it, on a shared **SSD-NFS** volume. This ADR +records why the earlier rejection does not apply here — so the two decisions +read as complementary, not contradictory. + +## Status + +accepted (2026-06-14) — for the AFK implementation path only; the existing +job-runner (recruiter-triage, nextcloud-todos, etc.) keeps throwaway clones. + +## Why the 2026-06-02 rejection doesn't bind this path + +The rejection's premise was **concurrent jobs in the same checkout** contending +on `.git/index.lock` and racing `git pull`. The AFK pipeline's concurrency model +is **serial within a repo, parallel only across repos** (ADR-adjacent decision in +the design doc): at most one agent ever touches a given repo's `.git` at a time, +and different repos are different checkouts. The contention the rejection guarded +against cannot occur here. With that removed, worktrees become the *better* +choice because they unlock cache reuse the throwaway model can't. + +## Considered options + +- **Persistent checkout + worktree per issue, on SSD-NFS (chosen).** Warm git + objects, **persisted `node_modules`/venv/build caches**, and shared + package-manager caches survive across jobs, so the TDD loop stops reinstalling + deps every run. Compounds with `to-issues` clustering many slices in one repo, + processed serially — slice N reuses slice 1's warm tree. +- **Throwaway `git clone --local` per job (status quo elsewhere)** — rejected for + this path: correct for the concurrent job-runner, but re-pays dependency + install on every issue, which dominates wall-clock for an + implement-test-fix-forward loop. +- **`cp -a` of a warm tree** — rejected (same reason as 2026-06-02): copies + accumulated caches → disk blowup, and no git isolation. + +## Considered options — storage + +- **SSD-NFS (chosen).** The current `/persistent` PVC is `5Gi` **HDD NFS** + (`nfs-truenas` → `/srv/nfs`) and unused; git checkouts + `node_modules` are + death-by-small-files on HDD NFS and 5Gi is too small. Provision an SSD-backed + NFS class over `/srv/nfs-ssd` (other apps already use that path) at a realistic + size (tens of GB). +- **HDD NFS / `/persistent` as-is** — rejected: too slow for many small files, + too small. +- **Local block (proxmox-lvm)** — rejected: faster but HDD and node-pinned (RWO), + lost on reschedule; NFS RWX survives and the volume also holds session state. + +## Consequences + +- One **SSD-NFS volume** holds, per enrolled repo: the persistent checkout, the + warm dep/package caches, and (under ADR 0003) the worktrees T3 prepares. Cache + env (`pip`, `GOMODCACHE`/`GOCACHE`, `PNPM_HOME`/npm, cargo) must be wired to it + — today caching is off (`pip --no-cache-dir`, no cache envs set). +- Housekeeping the throwaway model didn't need: `git fetch` before each + `worktree add`, periodic `git worktree prune` + `git gc`, and cache eviction if + the volume fills. +- **`infra` stays on its own path** — it is git-crypt, and editing encrypted + files from a worktree is disallowed; the persistent-worktree model is for the + non-`infra` app repos in the allowlist. +- Open reconciliation (pilot): whether T3's native `prepareWorktree` writes into + this volume + our persistent checkouts, or we manage the checkout and point T3 + at it. Resolve before committing the architecture.