claude-agent-service/docs/2026-06-14-afk-implementation-pipeline-design.md
Viktor Barzin be81005186
Some checks are pending
Build and Push / lint-and-test (push) Waiting to run
Build and Push / build (push) Blocked by required conditions
Build and Push / deploy (push) Blocked by required conditions
Build and Push / notify-failure (push) Blocked by required conditions
docs: capture AFK implementation pipeline design + ADRs 0002-0004
Record the architecture for moving code implementation AFK, decided in a
design/grilling session. The owner wants the human-in-the-loop boundary to
stop at design + spec: once an issue is triaged ready-for-agent, an agent
should implement it test-first, push it, and see it to a healthy deploy on
its own, escalating only when it can't proceed.

Decisions captured:
- claude-agent-service is the control plane (poller + watcher + safety);
  a dedicated in-cluster T3 Code instance is the executor + cockpit, because
  T3 can only show sessions it launched itself -> we dispatch into it
  (ADR 0003).
- AFK code pushes straight to master; on a broken deploy it fix-forwards
  then freezes the broken state for forensics rather than reverting
  (ADR 0002).
- Implementation agents use persistent per-repo checkouts + git worktrees on
  SSD-NFS for warm caches, reversing the throwaway-clone rule for this path
  because concurrency is serial-within-repo (ADR 0004).

Pilot-gated: five integration unknowns must be validated against a dedicated
T3 instance before the poller is wired. No code yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:09:12 +00:00

13 KiB
Raw Permalink Blame History

AFK implementation pipeline — design

Date: 2026-06-14 Status: proposed — pilot pending (see "Pilot" below; no code yet) Scope: A new autonomous path that turns a triaged ready-for-agent issue into tested, deployed code with no human at the keyboard. claude-agent-service becomes the control plane; a dedicated in-cluster T3 Code instance becomes the executor + cockpit. Touches: claude-agent-service (new poller

  • dispatch + watcher), a new T3 stack in infra/, a shared SSD-NFS volume, and the per-repo issue trackers.

Provenance: this design is the output of a long grilling session (2026-06-14). It records the decisions and the alternatives that were considered and dropped, so the reasoning survives. The three hardest-to-reverse calls are split into ADRs 00020004.

Problem

Today the development flow is grill-with-docs → to-prd → to-issues → triage → implement, and every stage is human-in-the-loop (HITL), including implementation. The owner wants the HITL boundary to stop at design + spec: once an issue is triaged ready-for-agent, an agent should pick it up and implement it AFK (away from keyboard) — write it test-first, push it, and see it through to a healthy deploy — escalating to a human only when it genuinely can't proceed.

Two gaps block this today:

  • The only existing issue→agent automation is the infra issue-responder, which fires on user-report/feature-request labels on the infra repo only — not on ready-for-agent, not on the other sub-project repos that the general design flow produces.
  • claude-agent-service only ever clones infra, runs one-shot fire-and-forget claude -p jobs (no session, no live stream, no attach), and has no multi-repo checkout. The owner wants to watch and steer in-flight work, which the batch model can't offer.

Goal

  • HITL covers design + spec only. Publishing ready-for-agent issues is the release signal (the to-issues quiz is the review gate).
  • An autonomous loop picks up unblocked ready-for-agent issues from enrolled repos, implements them test-first, and lands them — pushing straight to master so CI deploys them (see ADR 0002 for the risk posture).
  • The owner can see all in-flight workers and converse with any of them from one UI — the T3 cockpit (see ADR 0003).
  • Reuse before building: lean on the existing CI/CD chain, the design skills, T3 Code's multi-agent cockpit, and the persistence/worktree machinery — rather than hand-building a session console and a bespoke runtime.

Design

Roles: control plane vs executor + cockpit

Concern Owner
When to start, which issue, the prompt, the safety envelope claude-agent-service (control plane) — poller + watcher
Running the agent (Claude Agent SDK), the worktree, the fleet UI T3 Code (executor + cockpit) — one dedicated in-cluster instance
Build → image → deploy → rollout existing CI/CD (GHA → ghcr → Woodpecker → Keel)
Issue queue + state the per-repo GitHub issue trackers

The pivotal constraint that forces this split: T3 can only display sessions it launched itself — it has no command to adopt an externally-started session. So "viewable in T3" ⟺ "launched by T3". To keep claude-agent-service in charge and get the fleet view, the control plane dispatches into T3 rather than running claude itself. See ADR 0003.

End-to-end flow

HUMAN (interactive session)
  /grill-with-docs → /to-prd → /to-issues → /triage
     └ produces ready-for-agent issues (dependency-ordered), labeled by a
       trusted collaborator. Publishing them = the release signal.
══════════════════════ HANDOFF ══════════════════════
CONTROL PLANE  (claude-agent-service, in-cluster)
  poller CronJob (every few min):
    for repo in allowlist:
      skip repo if it already has an agent-in-progress issue   (per-repo lock)
      pick highest-priority ready-for-agent issue where:
        • all "Blocked by" closed   • labeled by a trusted collaborator
      → stamp agent-in-progress
      → POST /api/orchestration/dispatch  (thread.turn.start + bootstrap:
            create thread, prepare worktree, run setup, deliver the prompt)
EXECUTOR + COCKPIT  (dedicated T3 instance, in-cluster)
  runs the issue-implementer agent (our prompt) in the worktree:
    read issue + AGENT-BRIEF + repo CONTEXT.md/ADRs → TDD red-green-refactor
    → commit (paraphrase issue, "Closes #N", AFK trailer) → push master
  watcher (control plane) polls GET /api/orchestration/snapshot + CI:
    ├─ healthy ──────► comment + close issue, drop lock, notify ✅
    ├─ pre-push block ► do NOT push, relabel ready-for-human, escalate
    └─ post-push red ► fix-forward (≤5 attempts / 60 min)
                         ├─ recovers ► healthy
                         └─ exhausts ► FREEZE broken (preserve forensics),
                                       relabel ready-for-human, hard page

Trigger & dispatch predicate

A poller CronJob (mirrors the existing beads-dispatcher pattern; stays in-cluster because neither the service nor T3 has public ingress). It dispatches issue I in repo R iff all hold:

  • R is in the allowlist ConfigMap, and the kill switch is off;
  • I has label ready-for-agent, applied by a trusted collaborator (the trust gate — on private repos only collaborators can label, so the label is the authorization; external/bot issues never auto-run);
  • every issue in I's "Blocked by" is closed;
  • R has no issue currently labeled agent-in-progress (the per-repo lock).

On dispatch it stamps agent-in-progress; on any terminal outcome it removes it.

Concurrency & locking

Parallel across repos, serial within a repo. Multiple repos progress at once; at most one agent per repo (two agents in one repo would collide on the working tree). Enforced by the agent-in-progress label as a per-repo lock. Starting value; raise later.

Merge & failure posture — see ADR 0002

  • Always push to master (no PR gate). Tests-green is the merge gate; CI + rollback are the safety net, matching the human allow-then-audit model.
  • Pre-push failure (can't get green / blocked / would need a disallowed op): do not push; relabel ready-for-human; comment what was tried; page.
  • Post-push failure (CI build or rollout red): fix-forward up to 5 attempts or 60 minutes, then if still red freeze in the broken state (preserve forensics — do not auto-revert), relabel ready-for-human, hard page. The owner explicitly chose debuggability over availability here.
  • Budget: max_budget_usd = 100 per issue (time/attempt caps usually bite first).

Build/test environment & worktrees — see ADR 0004

The agent must run the target repo's test suite (TDD gate) before pushing. Therefore:

  • Local toolchains scoped to the allowlist — the executor image carries only the enrolled repos' runtimes; the toolchain set grows in lockstep with the allowlist.
  • Persistent per-repo checkout + git worktree per issue on a shared SSD-NFS volume, so git objects, installed deps, and package-manager caches stay warm across jobs. This supersedes the throwaway git clone --local model from 2026-06-02-parallel-execution-design.md; that rejection was correct for concurrent same-repo jobs, but the serial-within-repo choice here removes the .git contention it guarded against (ADR 0004). It pays off precisely because to-issues clusters many slices in one repo, processed serially — slice N reuses the warm checkout slice 1 paid for.

T3 integration: thin dispatch — see ADR 0003

The control plane holds a capability-scoped orchestration:operate bearer token (minted via t3 auth, stored in Vault, refreshed for the 1-hour expiry) and calls T3's HTTP API:

  • POST /api/orchestration/dispatchthread.turn.start with a bootstrap that creates the thread, prepares the worktree, optionally runs a setup script, and delivers the prompt — one call spawns a worktree-isolated worker.
  • GET /api/orchestration/snapshot → the full fleet read-model (per-thread running/idle/error, hasPendingUserInput, hasPendingApprovals, branch, worktreePath). T3 has no outbound webhooks, so the watcher polls this to drive CI-watch, freeze, and label transitions.

The AFK behavior and safety (issue-implementer prompt, guardrails, always-push, fix-forward/freeze, issue integration) live in our thin layer, so T3 is a swappable, version-pinned backend — never Keel-auto-upgraded, reversible to a self-hosted runtime if it goes sideways.

Observability & interaction

The "active sessions layer" and the "attach and converse" surface converge into one screen — the T3 cockpit: a live list of all worker threads grouped by project; click one to stream its transcript and send it a turn. This dissolves the earlier intermediate ideas of a generalized-breakglass console and a raw-tmux hybrid attach — T3 provides converse / approve / resume natively (thread.user-input.respond, thread.approval.respond).

Cross-system, durable signals the control plane still emits:

  • Phase-checklist comment on the issue, edited in place as phases complete (worktree → tests-red → green → pushed → CI → deployed). Durable, low-noise, lives on the issue, doubles as audit trail.
  • Loki logs labeled {repo, issue} for deep-dive.
  • Presence claim per running session (repo:<name>, purpose AFK #N), heartbeated — so AFK work shows up next to human sessions in the layer the prompt hook already injects.
  • Doorbell: Slack / ntfy ping on terminal states, deep-linking into the T3 thread. Notify, not control — the dedicated-Slack-control-plane idea is dropped in favour of the T3 cockpit.

Safety envelope

  • Trust gate — only collaborator-labeled ready-for-agent issues run.
  • Allowlist — a repo is untouchable until enrolled (prereqs: tests + GHA CI
    • CONTEXT.md). Start with 12 repos; expand deliberately.
  • Kill switch — one ConfigMap flag pauses all pickup (the Keel scale-to-0 reflex, built in from day one).
  • Per-repo lock — ≤1 agent per repo.
  • Guardrails (reused from issue-responder) — no PVC/PV deletes, no direct Vault edits, no force-push to master, infra changes Terraform-only, never [ci skip].
  • Identity & audit — shared service identity; each commit body paraphrases the issue and carries Closes #N + an AFK-agent trailer, so the commit message stays the audit trail.

Parameters (chosen starting values — all tunable)

Knob Value
Merge gate always push to master
Post-push failure fix-forward, then freeze-broken
Fix-forward cap 5 attempts or 60 minutes
Per-issue budget max_budget_usd = 100
Concurrency parallel across repos, serial within a repo
Repo scope opt-in allowlist, start small
Progress detail phase-checklist on issue + Loki logs
Alert channel Slack (+ ntfy), as a doorbell into T3
Executor dedicated in-cluster T3 (thin dispatch), version-pinned

Pilot — validate before wiring the poller

The thin model rests on five unknowns. Stand up the dedicated T3 instance and drive a couple of allowlist-repo issues by hand via the dispatch API to confirm each, before building the poller and committing the architecture:

  1. Per-thread custom agent + skip-permissions — can a dispatched thread carry our issue-implementer system prompt and run unattended without stalling on T3's approval gating? (biggest unknown)
  2. Dispatch auth — mint orchestration:operate, store in Vault, refresh the 1-hour token.
  3. Status/completion — drive CI-watch/freeze/labels purely from polling GET /api/orchestration/snapshot.
  4. Worktree reconciliation — T3's native prepareWorktree vs our persistent-checkout-with-warm-caches; pick one or make them cooperate on the volume.
  5. The in-cluster T3 pod — headless t3 serve --no-browser, version-pinned and Keel-excluded, internal ingress + Authentik, with tokens / toolchains / SSD volume / claude auth provisioned.

Relationship to prior decisions

  • Supersedes the worktree rejection in 2026-06-02-parallel-execution-design.md (contextualized, not contradicted — ADR 0004).
  • Drops two intermediate ideas explored and rejected this session: evolving claude-agent-service into its own session/tmux/worktree runtime, and building a bespoke breakglass-generalized console — both replaced by T3.
  • Reuses the issue-responder guardrails, the CI/CD chain, the beads-dispatcher CronJob pattern, presence, Loki, and the design skills.

Out of scope / open questions

  • Raw-terminal "take-over" of a worker (T3 is a GUI cockpit, not a terminal); if ever needed, that's a separate add-on.
  • Multi-tenant T3 (it is single-operator by design — fine, it matches the shared service identity).
  • Cross-repo dependency orchestration beyond per-issue "Blocked by".
  • T3 Code is pre-1.0 (~v0.0.x) and churny; the version-pin + Keel-exclude + swappable-backend discipline (ADR 0003) is the mitigation.