docs: capture AFK implementation pipeline design + ADRs 0002-0004

Record the architecture for moving code implementation AFK, decided in a design/grilling session. The owner wants the human-in-the-loop boundary to stop at design + spec: once an issue is triaged ready-for-agent, an agent should implement it test-first, push it, and see it to a healthy deploy on its own, escalating only when it can't proceed. Decisions captured: - claude-agent-service is the control plane (poller + watcher + safety); a dedicated in-cluster T3 Code instance is the executor + cockpit, because T3 can only show sessions it launched itself -> we dispatch into it (ADR 0003). - AFK code pushes straight to master; on a broken deploy it fix-forwards then freezes the broken state for forensics rather than reverting (ADR 0002). - Implementation agents use persistent per-repo checkouts + git worktrees on SSD-NFS for warm caches, reversing the throwaway-clone rule for this path because concurrency is serial-within-repo (ADR 0004). Pilot-gated: five integration unknowns must be validated against a dedicated T3 instance before the poller is wired. No code yet. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:09:12 +00:00 · 2026-06-14 19:09:12 +00:00 · be81005186
commit be81005186
parent 9d8afdd884
4 changed files with 466 additions and 0 deletions
--- a/docs/2026-06-14-afk-implementation-pipeline-design.md
+++ b/docs/2026-06-14-afk-implementation-pipeline-design.md
@ -0,0 +1,259 @@
 # AFK implementation pipeline — design
 **Date:** 2026-06-14
 **Status:** proposed — pilot pending (see "Pilot" below; no code yet)
 **Scope:** A new autonomous path that turns a triaged `ready-for-agent` issue
 into tested, deployed code with no human at the keyboard. `claude-agent-service`
 becomes the **control plane**; a dedicated in-cluster **T3 Code** instance
 becomes the **executor + cockpit**. Touches: `claude-agent-service` (new poller
 + dispatch + watcher), a new T3 stack in `infra/`, a shared SSD-NFS volume, and
 the per-repo issue trackers.
 > Provenance: this design is the output of a long grilling session
 > (2026-06-14). It records the decisions *and* the alternatives that were
 > considered and dropped, so the reasoning survives. The three hardest-to-reverse
 > calls are split into ADRs 0002–0004.
 ## Problem
 Today the development flow is **grill-with-docs → to-prd → to-issues → triage →
 implement**, and *every* stage is human-in-the-loop (HITL), including
 implementation. The owner wants the HITL boundary to stop at **design + spec**:
 once an issue is triaged `ready-for-agent`, an agent should pick it up and
 implement it **AFK** (away from keyboard) — write it test-first, push it, and
 see it through to a healthy deploy — escalating to a human only when it genuinely
 can't proceed.
 Two gaps block this today:
 - The only existing issue→agent automation is the **infra `issue-responder`**,
  which fires on `user-report`/`feature-request` labels on the `infra` repo
  only — not on `ready-for-agent`, not on the other sub-project repos that the
  general design flow produces.
 - `claude-agent-service` only ever clones `infra`, runs one-shot fire-and-forget
  `claude -p` jobs (no session, no live stream, no attach), and has no
  multi-repo checkout. The owner wants to *watch and steer* in-flight work, which
  the batch model can't offer.
 ## Goal
 - HITL covers design + spec only. Publishing `ready-for-agent` issues is the
  release signal (the `to-issues` quiz is the review gate).
 - An autonomous loop picks up unblocked `ready-for-agent` issues from
  **enrolled** repos, implements them test-first, and lands them — pushing
  straight to `master` so CI deploys them (see ADR 0002 for the risk posture).
 - The owner can **see all in-flight workers and converse with any of them** from
  one UI — the T3 cockpit (see ADR 0003).
 - Reuse before building: lean on the existing CI/CD chain, the design skills, T3
  Code's multi-agent cockpit, and the persistence/worktree machinery — rather
  than hand-building a session console and a bespoke runtime.
 ## Design
 ### Roles: control plane vs executor + cockpit
 | Concern | Owner |
 |---|---|
 | When to start, which issue, the prompt, the safety envelope | **claude-agent-service** (control plane) — poller + watcher |
 | Running the agent (Claude Agent SDK), the worktree, the fleet UI | **T3 Code** (executor + cockpit) — one dedicated in-cluster instance |
 | Build → image → deploy → rollout | existing CI/CD (GHA → ghcr → Woodpecker → Keel) |
 | Issue queue + state | the per-repo GitHub issue trackers |
 The pivotal constraint that forces this split: **T3 can only display sessions it
 launched itself** — it has no command to adopt an externally-started session. So
 "viewable in T3" ⟺ "launched by T3". To keep `claude-agent-service` in charge
 *and* get the fleet view, the control plane **dispatches into T3** rather than
 running `claude` itself. See ADR 0003.
 ### End-to-end flow
 ```
 HUMAN (interactive session)
  /grill-with-docs → /to-prd → /to-issues → /triage
     └ produces ready-for-agent issues (dependency-ordered), labeled by a
       trusted collaborator. Publishing them = the release signal.
 ══════════════════════ HANDOFF ══════════════════════
 CONTROL PLANE  (claude-agent-service, in-cluster)
  poller CronJob (every few min):
    for repo in allowlist:
      skip repo if it already has an agent-in-progress issue   (per-repo lock)
      pick highest-priority ready-for-agent issue where:
        • all "Blocked by" closed   • labeled by a trusted collaborator
      → stamp agent-in-progress
      → POST /api/orchestration/dispatch  (thread.turn.start + bootstrap:
            create thread, prepare worktree, run setup, deliver the prompt)
 EXECUTOR + COCKPIT  (dedicated T3 instance, in-cluster)
  runs the issue-implementer agent (our prompt) in the worktree:
    read issue + AGENT-BRIEF + repo CONTEXT.md/ADRs → TDD red-green-refactor
    → commit (paraphrase issue, "Closes #N", AFK trailer) → push master
  watcher (control plane) polls GET /api/orchestration/snapshot + CI:
    ├─ healthy ──────► comment + close issue, drop lock, notify ✅
    ├─ pre-push block ► do NOT push, relabel ready-for-human, escalate
    └─ post-push red ► fix-forward (≤5 attempts / 60 min)
                         ├─ recovers ► healthy
                         └─ exhausts ► FREEZE broken (preserve forensics),
                                       relabel ready-for-human, hard page
 ```
 ### Trigger & dispatch predicate
 A poller CronJob (mirrors the existing `beads-dispatcher` pattern; stays
 in-cluster because neither the service nor T3 has public ingress). It dispatches
 issue *I* in repo *R* iff **all** hold:
 - `R` is in the **allowlist** ConfigMap, and the **kill switch** is off;
 - `I` has label `ready-for-agent`, applied by a **trusted collaborator** (the
  trust gate — on private repos only collaborators can label, so the label *is*
  the authorization; external/bot issues never auto-run);
 - every issue in `I`'s "Blocked by" is closed;
 - `R` has no issue currently labeled `agent-in-progress` (the per-repo lock).
 On dispatch it stamps `agent-in-progress`; on any terminal outcome it removes it.
 ### Concurrency & locking
 **Parallel across repos, serial within a repo.** Multiple repos progress at
 once; at most one agent per repo (two agents in one repo would collide on the
 working tree). Enforced by the `agent-in-progress` label as a per-repo lock.
 Starting value; raise later.
 ### Merge & failure posture — see ADR 0002
 - **Always push to master** (no PR gate). Tests-green is the merge gate; CI +
  rollback are the safety net, matching the human allow-then-audit model.
 - **Pre-push** failure (can't get green / blocked / would need a disallowed op):
  do *not* push; relabel `ready-for-human`; comment what was tried; page.
 - **Post-push** failure (CI build or rollout red): **fix-forward** up to **5
  attempts or 60 minutes**, then if still red **freeze in the broken state**
  (preserve forensics — do not auto-revert), relabel `ready-for-human`, hard
  page. The owner explicitly chose debuggability over availability here.
 - **Budget:** `max_budget_usd = 100` per issue (time/attempt caps usually bite
  first).
 ### Build/test environment & worktrees — see ADR 0004
 The agent must run the target repo's test suite (TDD gate) before pushing.
 Therefore:
 - **Local toolchains scoped to the allowlist** — the executor image carries only
  the *enrolled* repos' runtimes; the toolchain set grows in lockstep with the
  allowlist.
 - **Persistent per-repo checkout + `git worktree` per issue** on a shared
  **SSD-NFS** volume, so git objects, installed deps, and package-manager caches
  stay warm across jobs. This **supersedes** the throwaway `git clone --local`
  model from `2026-06-02-parallel-execution-design.md`; that rejection was
  correct for *concurrent* same-repo jobs, but the serial-within-repo choice
  here removes the `.git` contention it guarded against (ADR 0004). It pays off
  precisely because `to-issues` clusters many slices in one repo, processed
  serially — slice N reuses the warm checkout slice 1 paid for.
 ### T3 integration: thin dispatch — see ADR 0003
 The control plane holds a capability-scoped **`orchestration:operate`** bearer
 token (minted via `t3 auth`, stored in Vault, refreshed for the 1-hour expiry)
 and calls T3's HTTP API:
 - `POST /api/orchestration/dispatch` → `thread.turn.start` with a `bootstrap`
  that creates the thread, prepares the worktree, optionally runs a setup
  script, and delivers the prompt — one call spawns a worktree-isolated worker.
 - `GET /api/orchestration/snapshot` → the full fleet read-model (per-thread
  `running`/`idle`/`error`, `hasPendingUserInput`, `hasPendingApprovals`,
  `branch`, `worktreePath`). T3 has **no outbound webhooks**, so the watcher
  **polls** this to drive CI-watch, freeze, and label transitions.
 The AFK *behavior and safety* (issue-implementer prompt, guardrails, always-push,
 fix-forward/freeze, issue integration) live in **our** thin layer, so T3 is a
 **swappable, version-pinned backend** — never Keel-auto-upgraded, reversible to a
 self-hosted runtime if it goes sideways.
 ### Observability & interaction
 The "active sessions layer" and the "attach and converse" surface **converge
 into one screen — the T3 cockpit**: a live list of all worker threads grouped by
 project; click one to stream its transcript and send it a turn. This dissolves
 the earlier intermediate ideas of a generalized-breakglass console and a
 raw-tmux hybrid attach — T3 provides converse / approve / resume natively
 (`thread.user-input.respond`, `thread.approval.respond`).
 Cross-system, durable signals the control plane still emits:
 - **Phase-checklist comment** on the issue, edited in place as phases complete
  (worktree → tests-red → green → pushed → CI → deployed). Durable, low-noise,
  lives on the issue, doubles as audit trail.
 - **Loki** logs labeled `{repo, issue}` for deep-dive.
 - **Presence** claim per running session (`repo:<name>`, purpose `AFK #N`),
  heartbeated — so AFK work shows up next to human sessions in the layer the
  prompt hook already injects.
 - **Doorbell**: Slack / ntfy ping on terminal states, deep-linking into the T3
  thread. Notify, not control — the dedicated-Slack-control-plane idea is
  dropped in favour of the T3 cockpit.
 ### Safety envelope
 - **Trust gate** — only collaborator-labeled `ready-for-agent` issues run.
 - **Allowlist** — a repo is untouchable until enrolled (prereqs: tests + GHA CI
  + `CONTEXT.md`). Start with 1–2 repos; expand deliberately.
 - **Kill switch** — one ConfigMap flag pauses all pickup (the Keel
  scale-to-0 reflex, built in from day one).
 - **Per-repo lock** — ≤1 agent per repo.
 - **Guardrails** (reused from `issue-responder`) — no PVC/PV deletes, no direct
  Vault edits, no force-push to master, infra changes Terraform-only, never
  `[ci skip]`.
 - **Identity & audit** — shared service identity; each commit body paraphrases
  the issue and carries `Closes #N` + an AFK-agent trailer, so the commit
  message stays the audit trail.
 ## Parameters (chosen starting values — all tunable)
 | Knob | Value |
 |---|---|
 | Merge gate | always push to master |
 | Post-push failure | fix-forward, then freeze-broken |
 | Fix-forward cap | 5 attempts **or** 60 minutes |
 | Per-issue budget | `max_budget_usd = 100` |
 | Concurrency | parallel across repos, serial within a repo |
 | Repo scope | opt-in allowlist, start small |
 | Progress detail | phase-checklist on issue + Loki logs |
 | Alert channel | Slack (+ ntfy), as a doorbell into T3 |
 | Executor | dedicated in-cluster T3 (thin dispatch), version-pinned |
 ## Pilot — validate before wiring the poller
 The thin model rests on five unknowns. Stand up the dedicated T3 instance and
 drive a couple of allowlist-repo issues **by hand** via the dispatch API to
 confirm each, *before* building the poller and committing the architecture:
 1. **Per-thread custom agent + skip-permissions** — can a dispatched thread
   carry *our* `issue-implementer` system prompt and run unattended without
   stalling on T3's approval gating? *(biggest unknown)*
 2. **Dispatch auth** — mint `orchestration:operate`, store in Vault, refresh the
   1-hour token.
 3. **Status/completion** — drive CI-watch/freeze/labels purely from polling
   `GET /api/orchestration/snapshot`.
 4. **Worktree reconciliation** — T3's native `prepareWorktree` vs our
   persistent-checkout-with-warm-caches; pick one or make them cooperate on the
   volume.
 5. **The in-cluster T3 pod** — headless `t3 serve --no-browser`, version-pinned
   and **Keel-excluded**, internal ingress + Authentik, with tokens / toolchains
   / SSD volume / `claude auth` provisioned.
 ## Relationship to prior decisions
 - **Supersedes** the worktree rejection in
  `2026-06-02-parallel-execution-design.md` (contextualized, not contradicted —
  ADR 0004).
 - **Drops** two intermediate ideas explored and rejected this session:
  evolving `claude-agent-service` into its own session/tmux/worktree runtime,
  and building a bespoke breakglass-generalized console — both replaced by T3.
 - **Reuses** the `issue-responder` guardrails, the CI/CD chain, the
  `beads-dispatcher` CronJob pattern, presence, Loki, and the design skills.
 ## Out of scope / open questions
 - Raw-terminal "take-over" of a worker (T3 is a GUI cockpit, not a terminal); if
  ever needed, that's a separate add-on.
 - Multi-tenant T3 (it is single-operator by design — fine, it matches the shared
  service identity).
 - Cross-repo dependency orchestration beyond per-issue "Blocked by".
 - T3 Code is pre-1.0 (~v0.0.x) and churny; the version-pin + Keel-exclude +
  swappable-backend discipline (ADR 0003) is the mitigation.
--- a/docs/adr/0002-afk-autonomous-merge-and-failure-posture.md
+++ b/docs/adr/0002-afk-autonomous-merge-and-failure-posture.md
@ -0,0 +1,69 @@
 # AFK agents push straight to master; failures fix-forward then freeze, not revert
 The AFK implementation pipeline (see
 `docs/2026-06-14-afk-implementation-pipeline-design.md`) lets an autonomous
 agent land code with no human at the keyboard. The owner deliberately chose the
 most hands-off posture: **AFK-written code pushes straight to `master`** (which
 then deploys via the existing CI/CD chain) with **no pull-request review gate**,
 and when a deploy breaks, the agent **fixes forward and then freezes the broken
 state** rather than auto-reverting. This ADR records that risk posture and why it
 was chosen over the safer alternatives, because it is surprising and not cheap to
 walk back once callers and habits depend on it.
 ## Status
 accepted (2026-06-14) — posture decided; enforced once the pipeline ships
 (pilot-gated).
 ## Context
 `master` on every enrolled repo deploys continuously (GHA build → ghcr →
 Woodpecker → Keel). So "where AFK code lands" is really "what reaches a live
 deploy without a human looking". The owner weighed three merge gates and three
 post-push failure responses and picked the autonomy-maximizing end of both,
 accepting the blast radius explicitly.
 ## Considered options — merge gate
 - **Always push to master (chosen).** Tests-green is the gate; CI + rollback are
  the safety net. Matches the existing human allow-then-audit model (non-admins
  already push straight to master). Most hands-off.
 - **Adaptive (push if confident, else PR)** — rejected as the *default* though it
  is what `issue-responder` does; the owner wanted full hands-off, not a
  confidence-gated PR for otherwise-working code.
 - **Always open a PR** — rejected: reintroduces a human merge step on every
  issue, i.e. "AFK implementation, human merge" — not the goal.
 ## Considered options — post-push failure (CI/rollout goes red after a green push)
 - **Fix-forward then freeze (chosen).** Iterate with corrective commits up to
  **5 attempts or 60 minutes**; if still red, **leave the broken state in place**
  (do not revert), relabel the issue `ready-for-human`, and hard-page. Same
  forensics-first instinct as the breakglass (ADR 0001): preserve the exact
  failing state for debugging rather than auto-cleaning it away.
 - **Auto-revert + escalate** — rejected (was the recommendation): restores green
  fastest, but destroys the forensic state the owner wants to inspect.
 - **Alert and freeze immediately (no fix-forward)** — rejected: gives up on
  transient/env-drift failures a corrective commit would clear.
 Pre-push failure (can't reach green, blocked, or would need a disallowed op) is
 not a dilemma: the agent does **not** push, relabels `ready-for-human`, comments
 what it tried, and pages.
 ## Consequences
 - An unreviewed logic error can deploy before any human sees it; rollback (not
  review) is the safety net. Bounded by: tests-as-gate, the start-small
  allowlist, the per-repo lock, and the kill switch.
 - A frozen-broken deploy can sit unhealthy until the owner answers the page —
  availability is traded for debuggability, by explicit choice. Acceptable
  because enrolled repos are non-critical by the allowlist prerequisite, and the
  owner is paged hard (Slack + ntfy).
 - Fix-forward can stack up to 5 commits on a bad change before freezing; the
  60-minute cap bounds the churn window.
 - Per-issue spend is capped at `max_budget_usd = 100`.
 - Guardrails still hold underneath this posture: no PVC/PV deletes, no direct
  Vault edits, no force-push, infra changes Terraform-only, never `[ci skip]`.
 - Reversible: tightening to adaptive/PR or to auto-revert is a config + watcher
  change, not a re-architecture — but callers/habits will have formed around
  "it just lands", so flag loudly if reversing.
--- a/docs/adr/0003-t3-thin-executor-and-cockpit.md
+++ b/docs/adr/0003-t3-thin-executor-and-cockpit.md
@ -0,0 +1,70 @@
 # AFK workers run inside a dedicated T3 Code instance; claude-agent-service dispatches into it
 The owner wants one UI to see and converse with every in-flight AFK worker, and
 named **T3 Code** (the self-hosted multi-agent cockpit already running at
 `t3.viktorbarzin.me`) as that UI. Research into T3's source
 (`pingdotgg/t3code`, ~v0.0.27) found it is genuinely built for this — a fleet of
 worker "threads" with a live read-model and a scoped HTTP dispatch API — **but**
 it can only display sessions **it launched itself**; there is no command to adopt
 a session another process started. So "viewable in T3" ⟺ "launched by T3". This
 ADR records the resulting architecture: `claude-agent-service` stays the
 **control plane** and **dispatches into a dedicated, in-cluster T3 instance**
 which is the **executor + cockpit**. The agent runs inside T3; we keep the brain.
 ## Status
 accepted (2026-06-14) — direction decided; **gated on a pilot** (the five
 unknowns in the design doc) before the poller is wired and the architecture is
 committed.
 ## Why T3, and why "thin"
 T3 provides, out of the box, what we would otherwise hand-build: a three-panel
 fleet cockpit (`projects → threads → conversation`), an
 `OrchestrationReadModel` with per-thread live status, and
 `POST /api/orchestration/dispatch` whose `thread.turn.start` + `bootstrap` can
 **create a thread, prepare a git worktree, run a setup script, and deliver a
 prompt in one call** — exactly the worker-spawn primitive. Converse / approve /
 resume are native (`thread.user-input.respond`, `thread.approval.respond`). For
 Claude it embeds `@anthropic-ai/claude-agent-sdk`.
 "Thin" = the AFK *behavior and safety* (the `issue-implementer` prompt,
 guardrails, always-push, fix-forward/freeze, CI-watch, issue integration) live
 in **our** layer (the poller + watcher), not in T3. T3 is a **swappable backend**
 we drive over its API.
 ## Considered options
 - **Thin: claude-agent-service dispatches into T3 (chosen).** Control plane calls
  T3's dispatch API; T3 runs the agent in a worktree and shows it. Get the fleet
  view, keep the brain, least to build. Cost: execution moves into the T3 pod, so
  T3's runtime is in the *hot path* (not just the window).
 - **claude-agent-service runs the agent, T3 only displays it** — rejected because
  it is impossible: T3 cannot adopt an externally-started session
  (`thread.session.set` is server-internal; no external-session-id field). This
  is the constraint that shaped the whole decision.
 - **Deep: claude-agent-service as a custom T3 provider (ACP-style)** — rejected
  for now: keeps the runtime ours with a T3 UI, but means building and
  maintaining a provider against a pre-1.0, internal, no-contributions interface
  — effectively a fork. Revisit only if "thin" proves too limiting.
 - **Skip T3; build our own console** (generalized breakglass + tmux) — rejected:
  most stable and fully in-house, but abandons the owner's explicit "see workers
  in T3" goal and means owning a session console forever.
 ## Consequences
 - A **dedicated in-cluster T3 instance** (a pod, consistent with the earlier
  in-cluster-over-devvm substrate choice) is the worker host, separate from the
  per-user devvm T3 instances. It needs the SSD worktree volume, git/Anthropic
  tokens, toolchains, `claude auth`, and an internal Authentik-gated ingress.
 - T3's runtime is now in the **execution hot path** — its maturity affects
  whether work *runs*, not only whether it can be *seen*. Mitigations: **pin the
  version and exclude it from Keel** (its churn + hard-cutover auth migrations
  make auto-upgrade a Keel-class hazard), keep the integration thin and the
  backend swappable, and **pilot** the five unknowns first.
 - T3 is **single-operator** — fine here: it matches the already-accepted shared
  service identity for AFK work.
 - No outbound webhooks from T3 → the watcher **polls**
  `GET /api/orchestration/snapshot`.
 - This supersedes the intermediate ideas of evolving `claude-agent-service` into
  its own session/tmux/worktree runtime and building a bespoke attach console.
--- a/docs/adr/0004-persistent-worktrees-for-implementation-agents.md
+++ b/docs/adr/0004-persistent-worktrees-for-implementation-agents.md
@ -0,0 +1,68 @@
 # Implementation agents use persistent per-repo checkouts + git worktrees, reversing the throwaway-clone rule for this path
 `2026-06-02-parallel-execution-design.md` deliberately **rejected git worktrees**
 and chose throwaway `git clone --local` per job, "because worktrees share one
 `.git` → agents that `git commit`/`pull` still contend — not truly independent".
 The AFK implementation pipeline
 (`docs/2026-06-14-afk-implementation-pipeline-design.md`) **reverses that for its
 own path**: each enrolled repo gets a **persistent checkout**, and each issue
 runs in a **`git worktree`** off it, on a shared **SSD-NFS** volume. This ADR
 records why the earlier rejection does not apply here — so the two decisions
 read as complementary, not contradictory.
 ## Status
 accepted (2026-06-14) — for the AFK implementation path only; the existing
 job-runner (recruiter-triage, nextcloud-todos, etc.) keeps throwaway clones.
 ## Why the 2026-06-02 rejection doesn't bind this path
 The rejection's premise was **concurrent jobs in the same checkout** contending
 on `.git/index.lock` and racing `git pull`. The AFK pipeline's concurrency model
 is **serial within a repo, parallel only across repos** (ADR-adjacent decision in
 the design doc): at most one agent ever touches a given repo's `.git` at a time,
 and different repos are different checkouts. The contention the rejection guarded
 against cannot occur here. With that removed, worktrees become the *better*
 choice because they unlock cache reuse the throwaway model can't.
 ## Considered options
 - **Persistent checkout + worktree per issue, on SSD-NFS (chosen).** Warm git
  objects, **persisted `node_modules`/venv/build caches**, and shared
  package-manager caches survive across jobs, so the TDD loop stops reinstalling
  deps every run. Compounds with `to-issues` clustering many slices in one repo,
  processed serially — slice N reuses slice 1's warm tree.
 - **Throwaway `git clone --local` per job (status quo elsewhere)** — rejected for
  this path: correct for the concurrent job-runner, but re-pays dependency
  install on every issue, which dominates wall-clock for an
  implement-test-fix-forward loop.
 - **`cp -a` of a warm tree** — rejected (same reason as 2026-06-02): copies
  accumulated caches → disk blowup, and no git isolation.
 ## Considered options — storage
 - **SSD-NFS (chosen).** The current `/persistent` PVC is `5Gi` **HDD NFS**
  (`nfs-truenas` → `/srv/nfs`) and unused; git checkouts + `node_modules` are
  death-by-small-files on HDD NFS and 5Gi is too small. Provision an SSD-backed
  NFS class over `/srv/nfs-ssd` (other apps already use that path) at a realistic
  size (tens of GB).
 - **HDD NFS / `/persistent` as-is** — rejected: too slow for many small files,
  too small.
 - **Local block (proxmox-lvm)** — rejected: faster but HDD and node-pinned (RWO),
  lost on reschedule; NFS RWX survives and the volume also holds session state.
 ## Consequences
 - One **SSD-NFS volume** holds, per enrolled repo: the persistent checkout, the
  warm dep/package caches, and (under ADR 0003) the worktrees T3 prepares. Cache
  env (`pip`, `GOMODCACHE`/`GOCACHE`, `PNPM_HOME`/npm, cargo) must be wired to it
  — today caching is off (`pip --no-cache-dir`, no cache envs set).
 - Housekeeping the throwaway model didn't need: `git fetch` before each
  `worktree add`, periodic `git worktree prune` + `git gc`, and cache eviction if
  the volume fills.
 - **`infra` stays on its own path** — it is git-crypt, and editing encrypted
  files from a worktree is disallowed; the persistent-worktree model is for the
  non-`infra` app repos in the allowlist.
 - Open reconciliation (pilot): whether T3's native `prepareWorktree` writes into
  this volume + our persistent checkouts, or we manage the checkout and point T3
  at it. Resolve before committing the architecture.