afk: add the autonomous issue-implementer loop (SHIPS DISABLED)

Adds app/afk/ — the "away-from-keyboard" control plane that watches the
issue tracker for ready-for-agent issues, dispatches each to a fresh
full-access T3 thread (with the issue-implementer preamble prepended,
because T3 does not honour ~/.claude/CLAUDE.md), and drives the resulting
run through its lifecycle: tests-red -> green -> pushed -> CI -> deployed,
escalating or fix-forwarding via a small pure state machine.

The loop is split into pure cores (no I/O, exhaustively unit-tested) and
thin injected adapters (the only edges that ever touch T3, the tracker,
CI, or Slack — faked in every test, so nothing here talks to a real
server, GitHub/Forgejo, or the cluster):

  pure:     types, dispatch_policy, run_state_machine, phase_checklist,
            config, issue_implementer_prompt
  adapters: t3_client (two-POST dispatch + snapshot), tracker, ci_watcher,
            notifier
  loops:    poller  — CronJob tick #1: list_ready -> select_dispatchable
                      -> dispatch + stamp the in-progress lock (label only
                      AFTER a successful dispatch, so a failed dispatch
                      never leaves a phantom lock). Per-repo lock derived
                      from the ready set, since the CronJob is stateless
                      between ticks.
            watcher — CronJob tick #2: assemble RunState from snapshot +
                      CI -> next_action -> act (close on success; relabel
                      ready-for-human + ring the doorbell on the two
                      escalations; dispatch a corrective turn on
                      fix-forward; refresh the progress checklist).

SHIPS DISABLED, on purpose: Config defaults to kill_switch=True AND an
empty allowlist, so a freshly-loaded config dispatches nothing and does
zero I/O. The package is not imported by the running service and has no
auto-enable path. Arming it is a deliberate, later, manual step requiring
BOTH gates (clear the kill switch AND enrol the exact repos) so one
fat-fingered env var can't arm every repo.

Test-first throughout: 412 tests pass (poller + watcher add integration
tests wiring the real pure cores to in-memory fakes). mypy clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-15 21:15:11 +00:00
parent 171857da6b
commit 2ef0db9a96
23 changed files with 4717 additions and 0 deletions

159
app/afk/t3_client.py Normal file
View file

@ -0,0 +1,159 @@
"""Adapter for the in-cluster T3 Code instance — the AFK executor + cockpit.
The control plane keeps the brain; T3 runs the agent. This module is the thin
wire between them: it turns "implement issue N of repo R with this prompt" into
the TWO HTTP commands T3's orchestration API needs, and reads the fleet
snapshot the watcher polls. It owns no AFK behaviour the agent's standing
rules ride in as the ``ISSUE_IMPLEMENTER_PREAMBLE`` prepended to the turn
message, because T3's full-access ``claudeAgent`` runtime does NOT honour
``~/.claude/CLAUDE.md`` (see ``issue_implementer_prompt``).
Two operations, both against the dedicated in-cluster T3 pod:
* ``dispatch(repo, issue, prompt) -> thread_id`` POSTs ``thread.create``
then ``thread.turn.start`` to ``/api/orchestration/dispatch``. The create
command selects the ``claudeAgent`` instance in ``full-access`` runtime mode
and returns a thread id; the turn command targets that thread and delivers
``ISSUE_IMPLEMENTER_PREAMBLE + prompt`` as ``message.text``. One dispatch =
one worktree-isolated worker.
* ``snapshot() -> dict`` GETs ``/api/orchestration/snapshot``, the full fleet
read-model. T3 has no outbound webhooks, so the watcher polls this for
per-thread ``running``/``idle``/``error`` status.
The HTTP transport and the bearer provider are **injected** (constructor
args), so the production wiring hands in an ``httpx.Client`` plus a Vault-backed
token reader, while tests hand in an in-memory fake nothing here ever opens a
socket on its own. The bearer is re-read from the provider on **every** request
because T3's ``orchestration:operate`` token expires hourly and is refreshed out
of band.
"""
from collections.abc import Callable
from typing import Protocol
from .issue_implementer_prompt import ISSUE_IMPLEMENTER_PREAMBLE
# Orchestration API paths, relative to the configured base URL.
_DISPATCH_PATH = "/api/orchestration/dispatch"
_SNAPSHOT_PATH = "/api/orchestration/snapshot"
# Pilot-baked dispatch envelope: which backend instance runs the thread and in
# which runtime mode. Constants (not config) — every AFK thread is identical.
_INSTANCE_ID = "claudeAgent"
_RUNTIME_MODE = "full-access"
# JSON shapes. Command bodies and the snapshot read-model are open string-keyed
# objects; ``object`` values keep us honest without a bare ``Any``.
type Json = dict[str, object]
class HttpResponse(Protocol):
"""The httpx-shaped response surface this adapter relies on.
Both ``httpx.Response`` and the test fake satisfy it: ``raise_for_status``
turns a non-2xx into an exception (so a failed ``thread.create`` aborts
before ``thread.turn.start`` ever fires) and ``json`` parses the body.
"""
def raise_for_status(self) -> object: ...
def json(self) -> Json: ...
class HttpClient(Protocol):
"""Minimal injected transport: a JSON ``post`` and a ``get``, both taking
explicit headers. Deliberately a strict subset of ``httpx.Client`` so the
real client passes one straight through and tests pass a recorder."""
def post(self, url: str, json: Json, headers: dict[str, str]) -> HttpResponse: ...
def get(self, url: str, headers: dict[str, str]) -> HttpResponse: ...
class T3Client:
"""Dispatch/snapshot adapter for one in-cluster T3 instance.
``base_url`` is the T3 service root (a trailing slash is tolerated);
``http`` is the injected transport; ``bearer_provider`` returns the current
``orchestration:operate`` token, re-read per request for hourly rotation.
"""
def __init__(
self,
base_url: str,
http: HttpClient,
bearer_provider: Callable[[], str],
) -> None:
self._base_url = base_url.rstrip("/")
self._http = http
self._bearer_provider = bearer_provider
# ----------------------------------------------------------------- #
# Public API (the ``t3_client.T3Client`` contract).
# ----------------------------------------------------------------- #
def dispatch(self, repo: str, issue: int, prompt: str) -> str:
"""Spawn one worker thread for ``issue`` of ``repo`` and return its id.
Two POSTs to ``/api/orchestration/dispatch``: ``thread.create`` (selects
the ``claudeAgent`` instance, ``full-access`` runtime) yields the thread
id; ``thread.turn.start`` then delivers ``ISSUE_IMPLEMENTER_PREAMBLE +
prompt`` to that thread. A failed create raises and short-circuits the
turn (we never fire a turn at a thread that wasn't created).
"""
create_resp = self._post(
_DISPATCH_PATH,
{
"command": "thread.create",
"repo": repo,
"issue": issue,
"modelSelection": {"instanceId": _INSTANCE_ID},
"runtimeMode": _RUNTIME_MODE,
},
)
thread_id = self._thread_id_of(create_resp.json())
self._post(
_DISPATCH_PATH,
{
"command": "thread.turn.start",
"threadId": thread_id,
"message": {"text": ISSUE_IMPLEMENTER_PREAMBLE + prompt},
},
)
return thread_id
def snapshot(self) -> Json:
"""Return the parsed fleet read-model from ``/api/orchestration/snapshot``."""
return self._get(_SNAPSHOT_PATH).json()
# ----------------------------------------------------------------- #
# Internals.
# ----------------------------------------------------------------- #
def _post(self, path: str, body: Json) -> HttpResponse:
resp = self._http.post(self._url(path), json=body, headers=self._headers())
resp.raise_for_status()
return resp
def _get(self, path: str) -> HttpResponse:
resp = self._http.get(self._url(path), headers=self._headers())
resp.raise_for_status()
return resp
def _url(self, path: str) -> str:
return f"{self._base_url}{path}"
def _headers(self) -> dict[str, str]:
return {"Authorization": f"Bearer {self._bearer_provider()}"}
@staticmethod
def _thread_id_of(create_response: Json) -> str:
"""Extract the new thread id from a ``thread.create`` reply.
T3 returns it as ``threadId``; we fail loudly on a malformed reply rather
than dispatch a turn at an empty/None id.
"""
thread_id = create_response.get("threadId")
if not isinstance(thread_id, str) or not thread_id:
raise ValueError(
f"thread.create response missing a usable threadId: {create_response!r}"
)
return thread_id