claude-agent-service/docs/2026-06-14-afk-implementation-pipeline-design.md
Viktor Barzin be81005186
Some checks are pending
Build and Push / lint-and-test (push) Waiting to run
Build and Push / build (push) Blocked by required conditions
Build and Push / deploy (push) Blocked by required conditions
Build and Push / notify-failure (push) Blocked by required conditions
docs: capture AFK implementation pipeline design + ADRs 0002-0004
Record the architecture for moving code implementation AFK, decided in a
design/grilling session. The owner wants the human-in-the-loop boundary to
stop at design + spec: once an issue is triaged ready-for-agent, an agent
should implement it test-first, push it, and see it to a healthy deploy on
its own, escalating only when it can't proceed.

Decisions captured:
- claude-agent-service is the control plane (poller + watcher + safety);
  a dedicated in-cluster T3 Code instance is the executor + cockpit, because
  T3 can only show sessions it launched itself -> we dispatch into it
  (ADR 0003).
- AFK code pushes straight to master; on a broken deploy it fix-forwards
  then freezes the broken state for forensics rather than reverting
  (ADR 0002).
- Implementation agents use persistent per-repo checkouts + git worktrees on
  SSD-NFS for warm caches, reversing the throwaway-clone rule for this path
  because concurrency is serial-within-repo (ADR 0004).

Pilot-gated: five integration unknowns must be validated against a dedicated
T3 instance before the poller is wired. No code yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:09:12 +00:00

259 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# AFK implementation pipeline — design
**Date:** 2026-06-14
**Status:** proposed — pilot pending (see "Pilot" below; no code yet)
**Scope:** A new autonomous path that turns a triaged `ready-for-agent` issue
into tested, deployed code with no human at the keyboard. `claude-agent-service`
becomes the **control plane**; a dedicated in-cluster **T3 Code** instance
becomes the **executor + cockpit**. Touches: `claude-agent-service` (new poller
+ dispatch + watcher), a new T3 stack in `infra/`, a shared SSD-NFS volume, and
the per-repo issue trackers.
> Provenance: this design is the output of a long grilling session
> (2026-06-14). It records the decisions *and* the alternatives that were
> considered and dropped, so the reasoning survives. The three hardest-to-reverse
> calls are split into ADRs 00020004.
## Problem
Today the development flow is **grill-with-docs → to-prd → to-issues → triage →
implement**, and *every* stage is human-in-the-loop (HITL), including
implementation. The owner wants the HITL boundary to stop at **design + spec**:
once an issue is triaged `ready-for-agent`, an agent should pick it up and
implement it **AFK** (away from keyboard) — write it test-first, push it, and
see it through to a healthy deploy — escalating to a human only when it genuinely
can't proceed.
Two gaps block this today:
- The only existing issue→agent automation is the **infra `issue-responder`**,
which fires on `user-report`/`feature-request` labels on the `infra` repo
only — not on `ready-for-agent`, not on the other sub-project repos that the
general design flow produces.
- `claude-agent-service` only ever clones `infra`, runs one-shot fire-and-forget
`claude -p` jobs (no session, no live stream, no attach), and has no
multi-repo checkout. The owner wants to *watch and steer* in-flight work, which
the batch model can't offer.
## Goal
- HITL covers design + spec only. Publishing `ready-for-agent` issues is the
release signal (the `to-issues` quiz is the review gate).
- An autonomous loop picks up unblocked `ready-for-agent` issues from
**enrolled** repos, implements them test-first, and lands them — pushing
straight to `master` so CI deploys them (see ADR 0002 for the risk posture).
- The owner can **see all in-flight workers and converse with any of them** from
one UI — the T3 cockpit (see ADR 0003).
- Reuse before building: lean on the existing CI/CD chain, the design skills, T3
Code's multi-agent cockpit, and the persistence/worktree machinery — rather
than hand-building a session console and a bespoke runtime.
## Design
### Roles: control plane vs executor + cockpit
| Concern | Owner |
|---|---|
| When to start, which issue, the prompt, the safety envelope | **claude-agent-service** (control plane) — poller + watcher |
| Running the agent (Claude Agent SDK), the worktree, the fleet UI | **T3 Code** (executor + cockpit) — one dedicated in-cluster instance |
| Build → image → deploy → rollout | existing CI/CD (GHA → ghcr → Woodpecker → Keel) |
| Issue queue + state | the per-repo GitHub issue trackers |
The pivotal constraint that forces this split: **T3 can only display sessions it
launched itself** — it has no command to adopt an externally-started session. So
"viewable in T3" ⟺ "launched by T3". To keep `claude-agent-service` in charge
*and* get the fleet view, the control plane **dispatches into T3** rather than
running `claude` itself. See ADR 0003.
### End-to-end flow
```
HUMAN (interactive session)
/grill-with-docs → /to-prd → /to-issues → /triage
└ produces ready-for-agent issues (dependency-ordered), labeled by a
trusted collaborator. Publishing them = the release signal.
══════════════════════ HANDOFF ══════════════════════
CONTROL PLANE (claude-agent-service, in-cluster)
poller CronJob (every few min):
for repo in allowlist:
skip repo if it already has an agent-in-progress issue (per-repo lock)
pick highest-priority ready-for-agent issue where:
• all "Blocked by" closed • labeled by a trusted collaborator
→ stamp agent-in-progress
→ POST /api/orchestration/dispatch (thread.turn.start + bootstrap:
create thread, prepare worktree, run setup, deliver the prompt)
EXECUTOR + COCKPIT (dedicated T3 instance, in-cluster)
runs the issue-implementer agent (our prompt) in the worktree:
read issue + AGENT-BRIEF + repo CONTEXT.md/ADRs → TDD red-green-refactor
→ commit (paraphrase issue, "Closes #N", AFK trailer) → push master
watcher (control plane) polls GET /api/orchestration/snapshot + CI:
├─ healthy ──────► comment + close issue, drop lock, notify ✅
├─ pre-push block ► do NOT push, relabel ready-for-human, escalate
└─ post-push red ► fix-forward (≤5 attempts / 60 min)
├─ recovers ► healthy
└─ exhausts ► FREEZE broken (preserve forensics),
relabel ready-for-human, hard page
```
### Trigger & dispatch predicate
A poller CronJob (mirrors the existing `beads-dispatcher` pattern; stays
in-cluster because neither the service nor T3 has public ingress). It dispatches
issue *I* in repo *R* iff **all** hold:
- `R` is in the **allowlist** ConfigMap, and the **kill switch** is off;
- `I` has label `ready-for-agent`, applied by a **trusted collaborator** (the
trust gate — on private repos only collaborators can label, so the label *is*
the authorization; external/bot issues never auto-run);
- every issue in `I`'s "Blocked by" is closed;
- `R` has no issue currently labeled `agent-in-progress` (the per-repo lock).
On dispatch it stamps `agent-in-progress`; on any terminal outcome it removes it.
### Concurrency & locking
**Parallel across repos, serial within a repo.** Multiple repos progress at
once; at most one agent per repo (two agents in one repo would collide on the
working tree). Enforced by the `agent-in-progress` label as a per-repo lock.
Starting value; raise later.
### Merge & failure posture — see ADR 0002
- **Always push to master** (no PR gate). Tests-green is the merge gate; CI +
rollback are the safety net, matching the human allow-then-audit model.
- **Pre-push** failure (can't get green / blocked / would need a disallowed op):
do *not* push; relabel `ready-for-human`; comment what was tried; page.
- **Post-push** failure (CI build or rollout red): **fix-forward** up to **5
attempts or 60 minutes**, then if still red **freeze in the broken state**
(preserve forensics — do not auto-revert), relabel `ready-for-human`, hard
page. The owner explicitly chose debuggability over availability here.
- **Budget:** `max_budget_usd = 100` per issue (time/attempt caps usually bite
first).
### Build/test environment & worktrees — see ADR 0004
The agent must run the target repo's test suite (TDD gate) before pushing.
Therefore:
- **Local toolchains scoped to the allowlist** — the executor image carries only
the *enrolled* repos' runtimes; the toolchain set grows in lockstep with the
allowlist.
- **Persistent per-repo checkout + `git worktree` per issue** on a shared
**SSD-NFS** volume, so git objects, installed deps, and package-manager caches
stay warm across jobs. This **supersedes** the throwaway `git clone --local`
model from `2026-06-02-parallel-execution-design.md`; that rejection was
correct for *concurrent* same-repo jobs, but the serial-within-repo choice
here removes the `.git` contention it guarded against (ADR 0004). It pays off
precisely because `to-issues` clusters many slices in one repo, processed
serially — slice N reuses the warm checkout slice 1 paid for.
### T3 integration: thin dispatch — see ADR 0003
The control plane holds a capability-scoped **`orchestration:operate`** bearer
token (minted via `t3 auth`, stored in Vault, refreshed for the 1-hour expiry)
and calls T3's HTTP API:
- `POST /api/orchestration/dispatch``thread.turn.start` with a `bootstrap`
that creates the thread, prepares the worktree, optionally runs a setup
script, and delivers the prompt — one call spawns a worktree-isolated worker.
- `GET /api/orchestration/snapshot` → the full fleet read-model (per-thread
`running`/`idle`/`error`, `hasPendingUserInput`, `hasPendingApprovals`,
`branch`, `worktreePath`). T3 has **no outbound webhooks**, so the watcher
**polls** this to drive CI-watch, freeze, and label transitions.
The AFK *behavior and safety* (issue-implementer prompt, guardrails, always-push,
fix-forward/freeze, issue integration) live in **our** thin layer, so T3 is a
**swappable, version-pinned backend** — never Keel-auto-upgraded, reversible to a
self-hosted runtime if it goes sideways.
### Observability & interaction
The "active sessions layer" and the "attach and converse" surface **converge
into one screen — the T3 cockpit**: a live list of all worker threads grouped by
project; click one to stream its transcript and send it a turn. This dissolves
the earlier intermediate ideas of a generalized-breakglass console and a
raw-tmux hybrid attach — T3 provides converse / approve / resume natively
(`thread.user-input.respond`, `thread.approval.respond`).
Cross-system, durable signals the control plane still emits:
- **Phase-checklist comment** on the issue, edited in place as phases complete
(worktree → tests-red → green → pushed → CI → deployed). Durable, low-noise,
lives on the issue, doubles as audit trail.
- **Loki** logs labeled `{repo, issue}` for deep-dive.
- **Presence** claim per running session (`repo:<name>`, purpose `AFK #N`),
heartbeated — so AFK work shows up next to human sessions in the layer the
prompt hook already injects.
- **Doorbell**: Slack / ntfy ping on terminal states, deep-linking into the T3
thread. Notify, not control — the dedicated-Slack-control-plane idea is
dropped in favour of the T3 cockpit.
### Safety envelope
- **Trust gate** — only collaborator-labeled `ready-for-agent` issues run.
- **Allowlist** — a repo is untouchable until enrolled (prereqs: tests + GHA CI
+ `CONTEXT.md`). Start with 12 repos; expand deliberately.
- **Kill switch** — one ConfigMap flag pauses all pickup (the Keel
scale-to-0 reflex, built in from day one).
- **Per-repo lock** — ≤1 agent per repo.
- **Guardrails** (reused from `issue-responder`) — no PVC/PV deletes, no direct
Vault edits, no force-push to master, infra changes Terraform-only, never
`[ci skip]`.
- **Identity & audit** — shared service identity; each commit body paraphrases
the issue and carries `Closes #N` + an AFK-agent trailer, so the commit
message stays the audit trail.
## Parameters (chosen starting values — all tunable)
| Knob | Value |
|---|---|
| Merge gate | always push to master |
| Post-push failure | fix-forward, then freeze-broken |
| Fix-forward cap | 5 attempts **or** 60 minutes |
| Per-issue budget | `max_budget_usd = 100` |
| Concurrency | parallel across repos, serial within a repo |
| Repo scope | opt-in allowlist, start small |
| Progress detail | phase-checklist on issue + Loki logs |
| Alert channel | Slack (+ ntfy), as a doorbell into T3 |
| Executor | dedicated in-cluster T3 (thin dispatch), version-pinned |
## Pilot — validate before wiring the poller
The thin model rests on five unknowns. Stand up the dedicated T3 instance and
drive a couple of allowlist-repo issues **by hand** via the dispatch API to
confirm each, *before* building the poller and committing the architecture:
1. **Per-thread custom agent + skip-permissions** — can a dispatched thread
carry *our* `issue-implementer` system prompt and run unattended without
stalling on T3's approval gating? *(biggest unknown)*
2. **Dispatch auth** — mint `orchestration:operate`, store in Vault, refresh the
1-hour token.
3. **Status/completion** — drive CI-watch/freeze/labels purely from polling
`GET /api/orchestration/snapshot`.
4. **Worktree reconciliation** — T3's native `prepareWorktree` vs our
persistent-checkout-with-warm-caches; pick one or make them cooperate on the
volume.
5. **The in-cluster T3 pod** — headless `t3 serve --no-browser`, version-pinned
and **Keel-excluded**, internal ingress + Authentik, with tokens / toolchains
/ SSD volume / `claude auth` provisioned.
## Relationship to prior decisions
- **Supersedes** the worktree rejection in
`2026-06-02-parallel-execution-design.md` (contextualized, not contradicted —
ADR 0004).
- **Drops** two intermediate ideas explored and rejected this session:
evolving `claude-agent-service` into its own session/tmux/worktree runtime,
and building a bespoke breakglass-generalized console — both replaced by T3.
- **Reuses** the `issue-responder` guardrails, the CI/CD chain, the
`beads-dispatcher` CronJob pattern, presence, Loki, and the design skills.
## Out of scope / open questions
- Raw-terminal "take-over" of a worker (T3 is a GUI cockpit, not a terminal); if
ever needed, that's a separate add-on.
- Multi-tenant T3 (it is single-operator by design — fine, it matches the shared
service identity).
- Cross-repo dependency orchestration beyond per-issue "Blocked by".
- T3 Code is pre-1.0 (~v0.0.x) and churny; the version-pin + Keel-exclude +
swappable-backend discipline (ADR 0003) is the mitigation.