Viktor wants a Claude-driven web UI on the agent service to act as a breakglass: when the devvm is down he can open it, have Claude SSH in to diagnose/repair, and power-cycle the VM via the Proxmox host if needed. Grilling settled the design. Recording it now as the design record before implementation: - CONTEXT.md: glossary for the breakglass language (breakglass agent, warm/cold case, forced-command verb, cycle vs reset, forensics). - ADR 0001: the security architecture — isolated deployment in its own namespace + narrow Vault policy (the existing claude-agent namespace's terraform-state policy grants secret/data/* to Bash-wielding agents that ingest untrusted input, so co-locating root-on-devvm keys would be exfiltratable); warm-case-only scope (devvm wedged, cluster healthy — the in-cluster UI can't survive the shared PVE host going down, which stays the separate cold-path SSH design); and bounded-but-broad host capability (full sudo on devvm, autonomous forced-command PVE power verbs, forensics-first). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
3.9 KiB
Claude Agent Service
In-cluster FastAPI wrapper that runs the Claude CLI headlessly for other services (issue automation, recruiter triage, nextcloud todos, …). This glossary covers the breakglass capability layered on top of it; the existing job-runner concepts (Job, Execute, OpenAI-compat) are documented in the code.
Language
Breakglass
Breakglass: The emergency capability for regaining control of the devvm when it is down but the cluster is healthy — a Claude-driven web UI that SSHes into the devvm to diagnose/repair and can power-cycle it via the PVE host. Avoid: "disaster recovery", "the cold breakglass" (that is the separate cluster-down SSH path — see Warm case / Cold case).
Breakglass agent: The single, isolated Claude agent the breakglass UI talks to. It has host access (sudo on the devvm, PVE power verbs) and a deliberately narrow tool surface — no web/untrusted-input tools — so it carries no prompt-injection vector. Avoid: reusing the general job-runner agents (recruiter-triage, nextcloud-todos-exec) for breakglass — those ingest untrusted input.
Warm case / Cold case: The warm case is "devvm wedged, cluster healthy" — the breakglass's entire scope. The cold case is "cluster or PVE host down", which an in-cluster UI cannot survive (devvm and all nodes are guests of one PVE host) and is handled elsewhere (knock-gated PVE SSH design + iDRAC), explicitly out of scope here. Avoid: calling the in-cluster UI a general "devvm is down" tool — it only covers the warm case.
Forced-command verb:
A single whitelisted operation a breakglass SSH key may invoke — enforced by
command="…" restrict in the host's authorized_keys, never a free shell on
the PVE host. The verbs are status | forensics | reset | stop | start | cycle, scoped to VM 102 only.
Avoid: "remote command", "ssh command" (those imply an open shell).
Cycle: A full stop→start of VM 102 — distinct from a warm reset/reboot because it spawns a fresh QEMU process and so applies staged VM config (the fix for the 2026-06-11 QEMU I/O stall). A warm reset reuses the wedged process. Avoid: using "reset" or "reboot" to mean a stop→start.
Forensics:
The unconditional pre-mutation state capture (qm status/config/pending + QMP
query, guest diagnostics) that runs before any mutating verb, so an erroneous
reset never destroys the evidence of why the devvm was wedged.
Avoid: "logs", "snapshot" (this is a point-in-time diagnostic dump, not a
disk snapshot).
Relationships
- The Breakglass UI is served by an in-cluster pod and reaches the
devvm over SSH; it does not proxy to anything hosted on the devvm
(unlike
terminal.viktorbarzin.me), so it survives the devvm being down. - A Breakglass agent invokes Forced-command verbs on the PVE host; every mutating verb runs Forensics first.
- A Cycle is the verb that applies staged VM config; a reset is the warm variant that does not.
- Breakglass covers only the Warm case; the Cold case is a separate, out-of-scope recovery path.
Example dialogue
Dev: "If the devvm OOMs, can the Breakglass agent just reset it?" Owner: "It can, autonomously — but a reset is a warm reboot. If the QEMU process is wedged (the 2026-06-11 class), it needs a cycle — stop→start — to apply the staged config. Either way it captures Forensics first." Dev: "And if the whole cluster is down?" Owner: "Then the breakglass is down too — that's the Cold case, not this tool. This one assumes the cluster is healthy."
Flagged ambiguities
- "reset" was used to mean both a warm reboot and a stop→start — resolved: reset is warm, cycle is stop→start (and is what applies staged config).
- "breakglass" was used for both this warm UI and the cluster-down SSH path — resolved: this context's Breakglass is the Warm case UI only.