docs: capture AFK implementation pipeline design + ADRs 0002-0004

Record the architecture for moving code implementation AFK, decided in a design/grilling session. The owner wants the human-in-the-loop boundary to stop at design + spec: once an issue is triaged ready-for-agent, an agent should implement it test-first, push it, and see it to a healthy deploy on its own, escalating only when it can't proceed. Decisions captured: - claude-agent-service is the control plane (poller + watcher + safety); a dedicated in-cluster T3 Code instance is the executor + cockpit, because T3 can only show sessions it launched itself -> we dispatch into it (ADR 0003). - AFK code pushes straight to master; on a broken deploy it fix-forwards then freezes the broken state for forensics rather than reverting (ADR 0002). - Implementation agents use persistent per-repo checkouts + git worktrees on SSD-NFS for warm caches, reversing the throwaway-clone rule for this path because concurrency is serial-within-repo (ADR 0004). Pilot-gated: five integration unknowns must be validated against a dedicated T3 instance before the poller is wired. No code yet. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:09:12 +00:00 · 2026-06-14 19:09:12 +00:00 · be81005186
commit be81005186
parent 9d8afdd884
4 changed files with 466 additions and 0 deletions
--- a/docs/adr/0002-afk-autonomous-merge-and-failure-posture.md
+++ b/docs/adr/0002-afk-autonomous-merge-and-failure-posture.md
@ -0,0 +1,69 @@
+# AFK agents push straight to master; failures fix-forward then freeze, not revert
+
+The AFK implementation pipeline (see
+`docs/2026-06-14-afk-implementation-pipeline-design.md`) lets an autonomous
+agent land code with no human at the keyboard. The owner deliberately chose the
+most hands-off posture: **AFK-written code pushes straight to `master`** (which
+then deploys via the existing CI/CD chain) with **no pull-request review gate**,
+and when a deploy breaks, the agent **fixes forward and then freezes the broken
+state** rather than auto-reverting. This ADR records that risk posture and why it
+was chosen over the safer alternatives, because it is surprising and not cheap to
+walk back once callers and habits depend on it.
+
+## Status
+
+accepted (2026-06-14) — posture decided; enforced once the pipeline ships
+(pilot-gated).
+
+## Context
+
+`master` on every enrolled repo deploys continuously (GHA build → ghcr →
+Woodpecker → Keel). So "where AFK code lands" is really "what reaches a live
+deploy without a human looking". The owner weighed three merge gates and three
+post-push failure responses and picked the autonomy-maximizing end of both,
+accepting the blast radius explicitly.
+
+## Considered options — merge gate
+
+- **Always push to master (chosen).** Tests-green is the gate; CI + rollback are
+  the safety net. Matches the existing human allow-then-audit model (non-admins
+  already push straight to master). Most hands-off.
+- **Adaptive (push if confident, else PR)** — rejected as the *default* though it
+  is what `issue-responder` does; the owner wanted full hands-off, not a
+  confidence-gated PR for otherwise-working code.
+- **Always open a PR** — rejected: reintroduces a human merge step on every
+  issue, i.e. "AFK implementation, human merge" — not the goal.
+
+## Considered options — post-push failure (CI/rollout goes red after a green push)
+
+- **Fix-forward then freeze (chosen).** Iterate with corrective commits up to
+  **5 attempts or 60 minutes**; if still red, **leave the broken state in place**
+  (do not revert), relabel the issue `ready-for-human`, and hard-page. Same
+  forensics-first instinct as the breakglass (ADR 0001): preserve the exact
+  failing state for debugging rather than auto-cleaning it away.
+- **Auto-revert + escalate** — rejected (was the recommendation): restores green
+  fastest, but destroys the forensic state the owner wants to inspect.
+- **Alert and freeze immediately (no fix-forward)** — rejected: gives up on
+  transient/env-drift failures a corrective commit would clear.
+
+Pre-push failure (can't reach green, blocked, or would need a disallowed op) is
+not a dilemma: the agent does **not** push, relabels `ready-for-human`, comments
+what it tried, and pages.
+
+## Consequences
+
+- An unreviewed logic error can deploy before any human sees it; rollback (not
+  review) is the safety net. Bounded by: tests-as-gate, the start-small
+  allowlist, the per-repo lock, and the kill switch.
+- A frozen-broken deploy can sit unhealthy until the owner answers the page —
+  availability is traded for debuggability, by explicit choice. Acceptable
+  because enrolled repos are non-critical by the allowlist prerequisite, and the
+  owner is paged hard (Slack + ntfy).
+- Fix-forward can stack up to 5 commits on a bad change before freezing; the
+  60-minute cap bounds the churn window.
+- Per-issue spend is capped at `max_budget_usd = 100`.
+- Guardrails still hold underneath this posture: no PVC/PV deletes, no direct
+  Vault edits, no force-push, infra changes Terraform-only, never `[ci skip]`.
+- Reversible: tightening to adaptive/PR or to auto-revert is a config + watcher
+  change, not a re-architecture — but callers/habits will have formed around
+  "it just lands", so flag loudly if reversing.
--- a/docs/adr/0003-t3-thin-executor-and-cockpit.md
+++ b/docs/adr/0003-t3-thin-executor-and-cockpit.md
@ -0,0 +1,70 @@
+# AFK workers run inside a dedicated T3 Code instance; claude-agent-service dispatches into it
+
+The owner wants one UI to see and converse with every in-flight AFK worker, and
+named **T3 Code** (the self-hosted multi-agent cockpit already running at
+`t3.viktorbarzin.me`) as that UI. Research into T3's source
+(`pingdotgg/t3code`, ~v0.0.27) found it is genuinely built for this — a fleet of
+worker "threads" with a live read-model and a scoped HTTP dispatch API — **but**
+it can only display sessions **it launched itself**; there is no command to adopt
+a session another process started. So "viewable in T3" ⟺ "launched by T3". This
+ADR records the resulting architecture: `claude-agent-service` stays the
+**control plane** and **dispatches into a dedicated, in-cluster T3 instance**
+which is the **executor + cockpit**. The agent runs inside T3; we keep the brain.
+
+## Status
+
+accepted (2026-06-14) — direction decided; **gated on a pilot** (the five
+unknowns in the design doc) before the poller is wired and the architecture is
+committed.
+
+## Why T3, and why "thin"
+
+T3 provides, out of the box, what we would otherwise hand-build: a three-panel
+fleet cockpit (`projects → threads → conversation`), an
+`OrchestrationReadModel` with per-thread live status, and
+`POST /api/orchestration/dispatch` whose `thread.turn.start` + `bootstrap` can
+**create a thread, prepare a git worktree, run a setup script, and deliver a
+prompt in one call** — exactly the worker-spawn primitive. Converse / approve /
+resume are native (`thread.user-input.respond`, `thread.approval.respond`). For
+Claude it embeds `@anthropic-ai/claude-agent-sdk`.
+
+"Thin" = the AFK *behavior and safety* (the `issue-implementer` prompt,
+guardrails, always-push, fix-forward/freeze, CI-watch, issue integration) live
+in **our** layer (the poller + watcher), not in T3. T3 is a **swappable backend**
+we drive over its API.
+
+## Considered options
+
+- **Thin: claude-agent-service dispatches into T3 (chosen).** Control plane calls
+  T3's dispatch API; T3 runs the agent in a worktree and shows it. Get the fleet
+  view, keep the brain, least to build. Cost: execution moves into the T3 pod, so
+  T3's runtime is in the *hot path* (not just the window).
+- **claude-agent-service runs the agent, T3 only displays it** — rejected because
+  it is impossible: T3 cannot adopt an externally-started session
+  (`thread.session.set` is server-internal; no external-session-id field). This
+  is the constraint that shaped the whole decision.
+- **Deep: claude-agent-service as a custom T3 provider (ACP-style)** — rejected
+  for now: keeps the runtime ours with a T3 UI, but means building and
+  maintaining a provider against a pre-1.0, internal, no-contributions interface
+  — effectively a fork. Revisit only if "thin" proves too limiting.
+- **Skip T3; build our own console** (generalized breakglass + tmux) — rejected:
+  most stable and fully in-house, but abandons the owner's explicit "see workers
+  in T3" goal and means owning a session console forever.
+
+## Consequences
+
+- A **dedicated in-cluster T3 instance** (a pod, consistent with the earlier
+  in-cluster-over-devvm substrate choice) is the worker host, separate from the
+  per-user devvm T3 instances. It needs the SSD worktree volume, git/Anthropic
+  tokens, toolchains, `claude auth`, and an internal Authentik-gated ingress.
+- T3's runtime is now in the **execution hot path** — its maturity affects
+  whether work *runs*, not only whether it can be *seen*. Mitigations: **pin the
+  version and exclude it from Keel** (its churn + hard-cutover auth migrations
+  make auto-upgrade a Keel-class hazard), keep the integration thin and the
+  backend swappable, and **pilot** the five unknowns first.
+- T3 is **single-operator** — fine here: it matches the already-accepted shared
+  service identity for AFK work.
+- No outbound webhooks from T3 → the watcher **polls**
+  `GET /api/orchestration/snapshot`.
+- This supersedes the intermediate ideas of evolving `claude-agent-service` into
+  its own session/tmux/worktree runtime and building a bespoke attach console.
--- a/docs/adr/0004-persistent-worktrees-for-implementation-agents.md
+++ b/docs/adr/0004-persistent-worktrees-for-implementation-agents.md
@ -0,0 +1,68 @@
+# Implementation agents use persistent per-repo checkouts + git worktrees, reversing the throwaway-clone rule for this path
+
+`2026-06-02-parallel-execution-design.md` deliberately **rejected git worktrees**
+and chose throwaway `git clone --local` per job, "because worktrees share one
+`.git` → agents that `git commit`/`pull` still contend — not truly independent".
+The AFK implementation pipeline
+(`docs/2026-06-14-afk-implementation-pipeline-design.md`) **reverses that for its
+own path**: each enrolled repo gets a **persistent checkout**, and each issue
+runs in a **`git worktree`** off it, on a shared **SSD-NFS** volume. This ADR
+records why the earlier rejection does not apply here — so the two decisions
+read as complementary, not contradictory.
+
+## Status
+
+accepted (2026-06-14) — for the AFK implementation path only; the existing
+job-runner (recruiter-triage, nextcloud-todos, etc.) keeps throwaway clones.
+
+## Why the 2026-06-02 rejection doesn't bind this path
+
+The rejection's premise was **concurrent jobs in the same checkout** contending
+on `.git/index.lock` and racing `git pull`. The AFK pipeline's concurrency model
+is **serial within a repo, parallel only across repos** (ADR-adjacent decision in
+the design doc): at most one agent ever touches a given repo's `.git` at a time,
+and different repos are different checkouts. The contention the rejection guarded
+against cannot occur here. With that removed, worktrees become the *better*
+choice because they unlock cache reuse the throwaway model can't.
+
+## Considered options
+
+- **Persistent checkout + worktree per issue, on SSD-NFS (chosen).** Warm git
+  objects, **persisted `node_modules`/venv/build caches**, and shared
+  package-manager caches survive across jobs, so the TDD loop stops reinstalling
+  deps every run. Compounds with `to-issues` clustering many slices in one repo,
+  processed serially — slice N reuses slice 1's warm tree.
+- **Throwaway `git clone --local` per job (status quo elsewhere)** — rejected for
+  this path: correct for the concurrent job-runner, but re-pays dependency
+  install on every issue, which dominates wall-clock for an
+  implement-test-fix-forward loop.
+- **`cp -a` of a warm tree** — rejected (same reason as 2026-06-02): copies
+  accumulated caches → disk blowup, and no git isolation.
+
+## Considered options — storage
+
+- **SSD-NFS (chosen).** The current `/persistent` PVC is `5Gi` **HDD NFS**
+  (`nfs-truenas` → `/srv/nfs`) and unused; git checkouts + `node_modules` are
+  death-by-small-files on HDD NFS and 5Gi is too small. Provision an SSD-backed
+  NFS class over `/srv/nfs-ssd` (other apps already use that path) at a realistic
+  size (tens of GB).
+- **HDD NFS / `/persistent` as-is** — rejected: too slow for many small files,
+  too small.
+- **Local block (proxmox-lvm)** — rejected: faster but HDD and node-pinned (RWO),
+  lost on reschedule; NFS RWX survives and the volume also holds session state.
+
+## Consequences
+
+- One **SSD-NFS volume** holds, per enrolled repo: the persistent checkout, the
+  warm dep/package caches, and (under ADR 0003) the worktrees T3 prepares. Cache
+  env (`pip`, `GOMODCACHE`/`GOCACHE`, `PNPM_HOME`/npm, cargo) must be wired to it
+  — today caching is off (`pip --no-cache-dir`, no cache envs set).
+- Housekeeping the throwaway model didn't need: `git fetch` before each
+  `worktree add`, periodic `git worktree prune` + `git gc`, and cache eviction if
+  the volume fills.
+- **`infra` stays on its own path** — it is git-crypt, and editing encrypted
+  files from a worktree is disallowed; the persistent-worktree model is for the
+  non-`infra` app repos in the allowlist.
+- Open reconciliation (pilot): whether T3's native `prepareWorktree` writes into
+  this volume + our persistent checkouts, or we manage the checkout and point T3
+  at it. Resolve before committing the architecture.