docs: capture AFK implementation pipeline design + ADRs 0002-0004
Some checks are pending
Build and Push / lint-and-test (push) Waiting to run
Build and Push / build (push) Blocked by required conditions
Build and Push / deploy (push) Blocked by required conditions
Build and Push / notify-failure (push) Blocked by required conditions

Record the architecture for moving code implementation AFK, decided in a
design/grilling session. The owner wants the human-in-the-loop boundary to
stop at design + spec: once an issue is triaged ready-for-agent, an agent
should implement it test-first, push it, and see it to a healthy deploy on
its own, escalating only when it can't proceed.

Decisions captured:
- claude-agent-service is the control plane (poller + watcher + safety);
  a dedicated in-cluster T3 Code instance is the executor + cockpit, because
  T3 can only show sessions it launched itself -> we dispatch into it
  (ADR 0003).
- AFK code pushes straight to master; on a broken deploy it fix-forwards
  then freezes the broken state for forensics rather than reverting
  (ADR 0002).
- Implementation agents use persistent per-repo checkouts + git worktrees on
  SSD-NFS for warm caches, reversing the throwaway-clone rule for this path
  because concurrency is serial-within-repo (ADR 0004).

Pilot-gated: five integration unknowns must be validated against a dedicated
T3 instance before the poller is wired. No code yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-14 19:09:12 +00:00
parent 9d8afdd884
commit be81005186
4 changed files with 466 additions and 0 deletions

View file

@ -0,0 +1,69 @@
# AFK agents push straight to master; failures fix-forward then freeze, not revert
The AFK implementation pipeline (see
`docs/2026-06-14-afk-implementation-pipeline-design.md`) lets an autonomous
agent land code with no human at the keyboard. The owner deliberately chose the
most hands-off posture: **AFK-written code pushes straight to `master`** (which
then deploys via the existing CI/CD chain) with **no pull-request review gate**,
and when a deploy breaks, the agent **fixes forward and then freezes the broken
state** rather than auto-reverting. This ADR records that risk posture and why it
was chosen over the safer alternatives, because it is surprising and not cheap to
walk back once callers and habits depend on it.
## Status
accepted (2026-06-14) — posture decided; enforced once the pipeline ships
(pilot-gated).
## Context
`master` on every enrolled repo deploys continuously (GHA build → ghcr →
Woodpecker → Keel). So "where AFK code lands" is really "what reaches a live
deploy without a human looking". The owner weighed three merge gates and three
post-push failure responses and picked the autonomy-maximizing end of both,
accepting the blast radius explicitly.
## Considered options — merge gate
- **Always push to master (chosen).** Tests-green is the gate; CI + rollback are
the safety net. Matches the existing human allow-then-audit model (non-admins
already push straight to master). Most hands-off.
- **Adaptive (push if confident, else PR)** — rejected as the *default* though it
is what `issue-responder` does; the owner wanted full hands-off, not a
confidence-gated PR for otherwise-working code.
- **Always open a PR** — rejected: reintroduces a human merge step on every
issue, i.e. "AFK implementation, human merge" — not the goal.
## Considered options — post-push failure (CI/rollout goes red after a green push)
- **Fix-forward then freeze (chosen).** Iterate with corrective commits up to
**5 attempts or 60 minutes**; if still red, **leave the broken state in place**
(do not revert), relabel the issue `ready-for-human`, and hard-page. Same
forensics-first instinct as the breakglass (ADR 0001): preserve the exact
failing state for debugging rather than auto-cleaning it away.
- **Auto-revert + escalate** — rejected (was the recommendation): restores green
fastest, but destroys the forensic state the owner wants to inspect.
- **Alert and freeze immediately (no fix-forward)** — rejected: gives up on
transient/env-drift failures a corrective commit would clear.
Pre-push failure (can't reach green, blocked, or would need a disallowed op) is
not a dilemma: the agent does **not** push, relabels `ready-for-human`, comments
what it tried, and pages.
## Consequences
- An unreviewed logic error can deploy before any human sees it; rollback (not
review) is the safety net. Bounded by: tests-as-gate, the start-small
allowlist, the per-repo lock, and the kill switch.
- A frozen-broken deploy can sit unhealthy until the owner answers the page —
availability is traded for debuggability, by explicit choice. Acceptable
because enrolled repos are non-critical by the allowlist prerequisite, and the
owner is paged hard (Slack + ntfy).
- Fix-forward can stack up to 5 commits on a bad change before freezing; the
60-minute cap bounds the churn window.
- Per-issue spend is capped at `max_budget_usd = 100`.
- Guardrails still hold underneath this posture: no PVC/PV deletes, no direct
Vault edits, no force-push, infra changes Terraform-only, never `[ci skip]`.
- Reversible: tightening to adaptive/PR or to auto-revert is a config + watcher
change, not a re-architecture — but callers/habits will have formed around
"it just lands", so flag loudly if reversing.

View file

@ -0,0 +1,70 @@
# AFK workers run inside a dedicated T3 Code instance; claude-agent-service dispatches into it
The owner wants one UI to see and converse with every in-flight AFK worker, and
named **T3 Code** (the self-hosted multi-agent cockpit already running at
`t3.viktorbarzin.me`) as that UI. Research into T3's source
(`pingdotgg/t3code`, ~v0.0.27) found it is genuinely built for this — a fleet of
worker "threads" with a live read-model and a scoped HTTP dispatch API — **but**
it can only display sessions **it launched itself**; there is no command to adopt
a session another process started. So "viewable in T3" ⟺ "launched by T3". This
ADR records the resulting architecture: `claude-agent-service` stays the
**control plane** and **dispatches into a dedicated, in-cluster T3 instance**
which is the **executor + cockpit**. The agent runs inside T3; we keep the brain.
## Status
accepted (2026-06-14) — direction decided; **gated on a pilot** (the five
unknowns in the design doc) before the poller is wired and the architecture is
committed.
## Why T3, and why "thin"
T3 provides, out of the box, what we would otherwise hand-build: a three-panel
fleet cockpit (`projects → threads → conversation`), an
`OrchestrationReadModel` with per-thread live status, and
`POST /api/orchestration/dispatch` whose `thread.turn.start` + `bootstrap` can
**create a thread, prepare a git worktree, run a setup script, and deliver a
prompt in one call** — exactly the worker-spawn primitive. Converse / approve /
resume are native (`thread.user-input.respond`, `thread.approval.respond`). For
Claude it embeds `@anthropic-ai/claude-agent-sdk`.
"Thin" = the AFK *behavior and safety* (the `issue-implementer` prompt,
guardrails, always-push, fix-forward/freeze, CI-watch, issue integration) live
in **our** layer (the poller + watcher), not in T3. T3 is a **swappable backend**
we drive over its API.
## Considered options
- **Thin: claude-agent-service dispatches into T3 (chosen).** Control plane calls
T3's dispatch API; T3 runs the agent in a worktree and shows it. Get the fleet
view, keep the brain, least to build. Cost: execution moves into the T3 pod, so
T3's runtime is in the *hot path* (not just the window).
- **claude-agent-service runs the agent, T3 only displays it** — rejected because
it is impossible: T3 cannot adopt an externally-started session
(`thread.session.set` is server-internal; no external-session-id field). This
is the constraint that shaped the whole decision.
- **Deep: claude-agent-service as a custom T3 provider (ACP-style)** — rejected
for now: keeps the runtime ours with a T3 UI, but means building and
maintaining a provider against a pre-1.0, internal, no-contributions interface
— effectively a fork. Revisit only if "thin" proves too limiting.
- **Skip T3; build our own console** (generalized breakglass + tmux) — rejected:
most stable and fully in-house, but abandons the owner's explicit "see workers
in T3" goal and means owning a session console forever.
## Consequences
- A **dedicated in-cluster T3 instance** (a pod, consistent with the earlier
in-cluster-over-devvm substrate choice) is the worker host, separate from the
per-user devvm T3 instances. It needs the SSD worktree volume, git/Anthropic
tokens, toolchains, `claude auth`, and an internal Authentik-gated ingress.
- T3's runtime is now in the **execution hot path** — its maturity affects
whether work *runs*, not only whether it can be *seen*. Mitigations: **pin the
version and exclude it from Keel** (its churn + hard-cutover auth migrations
make auto-upgrade a Keel-class hazard), keep the integration thin and the
backend swappable, and **pilot** the five unknowns first.
- T3 is **single-operator** — fine here: it matches the already-accepted shared
service identity for AFK work.
- No outbound webhooks from T3 → the watcher **polls**
`GET /api/orchestration/snapshot`.
- This supersedes the intermediate ideas of evolving `claude-agent-service` into
its own session/tmux/worktree runtime and building a bespoke attach console.

View file

@ -0,0 +1,68 @@
# Implementation agents use persistent per-repo checkouts + git worktrees, reversing the throwaway-clone rule for this path
`2026-06-02-parallel-execution-design.md` deliberately **rejected git worktrees**
and chose throwaway `git clone --local` per job, "because worktrees share one
`.git` → agents that `git commit`/`pull` still contend — not truly independent".
The AFK implementation pipeline
(`docs/2026-06-14-afk-implementation-pipeline-design.md`) **reverses that for its
own path**: each enrolled repo gets a **persistent checkout**, and each issue
runs in a **`git worktree`** off it, on a shared **SSD-NFS** volume. This ADR
records why the earlier rejection does not apply here — so the two decisions
read as complementary, not contradictory.
## Status
accepted (2026-06-14) — for the AFK implementation path only; the existing
job-runner (recruiter-triage, nextcloud-todos, etc.) keeps throwaway clones.
## Why the 2026-06-02 rejection doesn't bind this path
The rejection's premise was **concurrent jobs in the same checkout** contending
on `.git/index.lock` and racing `git pull`. The AFK pipeline's concurrency model
is **serial within a repo, parallel only across repos** (ADR-adjacent decision in
the design doc): at most one agent ever touches a given repo's `.git` at a time,
and different repos are different checkouts. The contention the rejection guarded
against cannot occur here. With that removed, worktrees become the *better*
choice because they unlock cache reuse the throwaway model can't.
## Considered options
- **Persistent checkout + worktree per issue, on SSD-NFS (chosen).** Warm git
objects, **persisted `node_modules`/venv/build caches**, and shared
package-manager caches survive across jobs, so the TDD loop stops reinstalling
deps every run. Compounds with `to-issues` clustering many slices in one repo,
processed serially — slice N reuses slice 1's warm tree.
- **Throwaway `git clone --local` per job (status quo elsewhere)** — rejected for
this path: correct for the concurrent job-runner, but re-pays dependency
install on every issue, which dominates wall-clock for an
implement-test-fix-forward loop.
- **`cp -a` of a warm tree** — rejected (same reason as 2026-06-02): copies
accumulated caches → disk blowup, and no git isolation.
## Considered options — storage
- **SSD-NFS (chosen).** The current `/persistent` PVC is `5Gi` **HDD NFS**
(`nfs-truenas``/srv/nfs`) and unused; git checkouts + `node_modules` are
death-by-small-files on HDD NFS and 5Gi is too small. Provision an SSD-backed
NFS class over `/srv/nfs-ssd` (other apps already use that path) at a realistic
size (tens of GB).
- **HDD NFS / `/persistent` as-is** — rejected: too slow for many small files,
too small.
- **Local block (proxmox-lvm)** — rejected: faster but HDD and node-pinned (RWO),
lost on reschedule; NFS RWX survives and the volume also holds session state.
## Consequences
- One **SSD-NFS volume** holds, per enrolled repo: the persistent checkout, the
warm dep/package caches, and (under ADR 0003) the worktrees T3 prepares. Cache
env (`pip`, `GOMODCACHE`/`GOCACHE`, `PNPM_HOME`/npm, cargo) must be wired to it
— today caching is off (`pip --no-cache-dir`, no cache envs set).
- Housekeeping the throwaway model didn't need: `git fetch` before each
`worktree add`, periodic `git worktree prune` + `git gc`, and cache eviction if
the volume fills.
- **`infra` stays on its own path** — it is git-crypt, and editing encrypted
files from a worktree is disallowed; the persistent-worktree model is for the
non-`infra` app repos in the allowlist.
- Open reconciliation (pilot): whether T3's native `prepareWorktree` writes into
this volume + our persistent checkouts, or we manage the checkout and point T3
at it. Resolve before committing the architecture.