claude-agent-service/CONTEXT.md
Viktor Barzin 68cee55594
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
docs: capture breakglass design (CONTEXT glossary + ADR 0001)
Viktor wants a Claude-driven web UI on the agent service to act as a
breakglass: when the devvm is down he can open it, have Claude SSH in to
diagnose/repair, and power-cycle the VM via the Proxmox host if needed.

Grilling settled the design. Recording it now as the design record before
implementation:

- CONTEXT.md: glossary for the breakglass language (breakglass agent,
  warm/cold case, forced-command verb, cycle vs reset, forensics).
- ADR 0001: the security architecture — isolated deployment in its own
  namespace + narrow Vault policy (the existing claude-agent namespace's
  terraform-state policy grants secret/data/* to Bash-wielding agents that
  ingest untrusted input, so co-locating root-on-devvm keys would be
  exfiltratable); warm-case-only scope (devvm wedged, cluster healthy —
  the in-cluster UI can't survive the shared PVE host going down, which
  stays the separate cold-path SSH design); and bounded-but-broad host
  capability (full sudo on devvm, autonomous forced-command PVE power
  verbs, forensics-first).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:59:13 +00:00

3.9 KiB

Claude Agent Service

In-cluster FastAPI wrapper that runs the Claude CLI headlessly for other services (issue automation, recruiter triage, nextcloud todos, …). This glossary covers the breakglass capability layered on top of it; the existing job-runner concepts (Job, Execute, OpenAI-compat) are documented in the code.

Language

Breakglass

Breakglass: The emergency capability for regaining control of the devvm when it is down but the cluster is healthy — a Claude-driven web UI that SSHes into the devvm to diagnose/repair and can power-cycle it via the PVE host. Avoid: "disaster recovery", "the cold breakglass" (that is the separate cluster-down SSH path — see Warm case / Cold case).

Breakglass agent: The single, isolated Claude agent the breakglass UI talks to. It has host access (sudo on the devvm, PVE power verbs) and a deliberately narrow tool surface — no web/untrusted-input tools — so it carries no prompt-injection vector. Avoid: reusing the general job-runner agents (recruiter-triage, nextcloud-todos-exec) for breakglass — those ingest untrusted input.

Warm case / Cold case: The warm case is "devvm wedged, cluster healthy" — the breakglass's entire scope. The cold case is "cluster or PVE host down", which an in-cluster UI cannot survive (devvm and all nodes are guests of one PVE host) and is handled elsewhere (knock-gated PVE SSH design + iDRAC), explicitly out of scope here. Avoid: calling the in-cluster UI a general "devvm is down" tool — it only covers the warm case.

Forced-command verb: A single whitelisted operation a breakglass SSH key may invoke — enforced by command="…" restrict in the host's authorized_keys, never a free shell on the PVE host. The verbs are status | forensics | reset | stop | start | cycle, scoped to VM 102 only. Avoid: "remote command", "ssh command" (those imply an open shell).

Cycle: A full stop→start of VM 102 — distinct from a warm reset/reboot because it spawns a fresh QEMU process and so applies staged VM config (the fix for the 2026-06-11 QEMU I/O stall). A warm reset reuses the wedged process. Avoid: using "reset" or "reboot" to mean a stop→start.

Forensics: The unconditional pre-mutation state capture (qm status/config/pending + QMP query, guest diagnostics) that runs before any mutating verb, so an erroneous reset never destroys the evidence of why the devvm was wedged. Avoid: "logs", "snapshot" (this is a point-in-time diagnostic dump, not a disk snapshot).

Relationships

  • The Breakglass UI is served by an in-cluster pod and reaches the devvm over SSH; it does not proxy to anything hosted on the devvm (unlike terminal.viktorbarzin.me), so it survives the devvm being down.
  • A Breakglass agent invokes Forced-command verbs on the PVE host; every mutating verb runs Forensics first.
  • A Cycle is the verb that applies staged VM config; a reset is the warm variant that does not.
  • Breakglass covers only the Warm case; the Cold case is a separate, out-of-scope recovery path.

Example dialogue

Dev: "If the devvm OOMs, can the Breakglass agent just reset it?" Owner: "It can, autonomously — but a reset is a warm reboot. If the QEMU process is wedged (the 2026-06-11 class), it needs a cycle — stop→start — to apply the staged config. Either way it captures Forensics first." Dev: "And if the whole cluster is down?" Owner: "Then the breakglass is down too — that's the Cold case, not this tool. This one assumes the cluster is healthy."

Flagged ambiguities

  • "reset" was used to mean both a warm reboot and a stop→start — resolved: reset is warm, cycle is stop→start (and is what applies staged config).
  • "breakglass" was used for both this warm UI and the cluster-down SSH path — resolved: this context's Breakglass is the Warm case UI only.