docs: capture breakglass design (CONTEXT glossary + ADR 0001)
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Viktor wants a Claude-driven web UI on the agent service to act as a breakglass: when the devvm is down he can open it, have Claude SSH in to diagnose/repair, and power-cycle the VM via the Proxmox host if needed. Grilling settled the design. Recording it now as the design record before implementation: - CONTEXT.md: glossary for the breakglass language (breakglass agent, warm/cold case, forced-command verb, cycle vs reset, forensics). - ADR 0001: the security architecture — isolated deployment in its own namespace + narrow Vault policy (the existing claude-agent namespace's terraform-state policy grants secret/data/* to Bash-wielding agents that ingest untrusted input, so co-locating root-on-devvm keys would be exfiltratable); warm-case-only scope (devvm wedged, cluster healthy — the in-cluster UI can't survive the shared PVE host going down, which stays the separate cold-path SSH design); and bounded-but-broad host capability (full sudo on devvm, autonomous forced-command PVE power verbs, forensics-first). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
parent
7495b46f60
commit
68cee55594
2 changed files with 144 additions and 0 deletions
85
CONTEXT.md
Normal file
85
CONTEXT.md
Normal file
|
|
@ -0,0 +1,85 @@
|
|||
# Claude Agent Service
|
||||
|
||||
In-cluster FastAPI wrapper that runs the Claude CLI headlessly for other
|
||||
services (issue automation, recruiter triage, nextcloud todos, …). This
|
||||
glossary covers the **breakglass** capability layered on top of it; the
|
||||
existing job-runner concepts (Job, Execute, OpenAI-compat) are documented in
|
||||
the code.
|
||||
|
||||
## Language
|
||||
|
||||
### Breakglass
|
||||
|
||||
**Breakglass**:
|
||||
The emergency capability for regaining control of the **devvm** when it is down
|
||||
but the cluster is healthy — a Claude-driven web UI that SSHes *into* the devvm
|
||||
to diagnose/repair and can power-cycle it via the PVE host.
|
||||
_Avoid_: "disaster recovery", "the cold breakglass" (that is the separate
|
||||
cluster-down SSH path — see **Warm case / Cold case**).
|
||||
|
||||
**Breakglass agent**:
|
||||
The single, isolated Claude agent the breakglass UI talks to. It has host
|
||||
access (sudo on the devvm, PVE power verbs) and a deliberately narrow tool
|
||||
surface — no web/untrusted-input tools — so it carries no prompt-injection
|
||||
vector.
|
||||
_Avoid_: reusing the general job-runner agents (recruiter-triage,
|
||||
nextcloud-todos-exec) for breakglass — those ingest untrusted input.
|
||||
|
||||
**Warm case** / **Cold case**:
|
||||
The **warm case** is "devvm wedged, cluster healthy" — the breakglass's entire
|
||||
scope. The **cold case** is "cluster or PVE host down", which an in-cluster UI
|
||||
cannot survive (devvm and all nodes are guests of one PVE host) and is handled
|
||||
elsewhere (knock-gated PVE SSH design + iDRAC), explicitly out of scope here.
|
||||
_Avoid_: calling the in-cluster UI a general "devvm is down" tool — it only
|
||||
covers the warm case.
|
||||
|
||||
**Forced-command verb**:
|
||||
A single whitelisted operation a breakglass SSH key may invoke — enforced by
|
||||
`command="…" restrict` in the host's `authorized_keys`, never a free shell on
|
||||
the PVE host. The verbs are `status | forensics | reset | stop | start |
|
||||
cycle`, scoped to VM 102 only.
|
||||
_Avoid_: "remote command", "ssh command" (those imply an open shell).
|
||||
|
||||
**Cycle**:
|
||||
A full **stop→start** of VM 102 — distinct from a warm reset/reboot because it
|
||||
spawns a fresh QEMU process and so applies staged VM config (the fix for the
|
||||
2026-06-11 QEMU I/O stall). A warm reset reuses the wedged process.
|
||||
_Avoid_: using "reset" or "reboot" to mean a stop→start.
|
||||
|
||||
**Forensics**:
|
||||
The unconditional pre-mutation state capture (`qm status/config/pending` + QMP
|
||||
query, guest diagnostics) that runs *before* any mutating verb, so an erroneous
|
||||
reset never destroys the evidence of why the devvm was wedged.
|
||||
_Avoid_: "logs", "snapshot" (this is a point-in-time diagnostic dump, not a
|
||||
disk snapshot).
|
||||
|
||||
## Relationships
|
||||
|
||||
- The **Breakglass** UI is served by an in-cluster pod and reaches the
|
||||
**devvm** over SSH; it does **not** proxy to anything hosted on the devvm
|
||||
(unlike `terminal.viktorbarzin.me`), so it survives the devvm being down.
|
||||
- A **Breakglass agent** invokes **Forced-command verbs** on the PVE host;
|
||||
every mutating verb runs **Forensics** first.
|
||||
- A **Cycle** is the verb that applies staged VM config; a **reset** is the
|
||||
warm variant that does not.
|
||||
- **Breakglass** covers only the **Warm case**; the **Cold case** is a
|
||||
separate, out-of-scope recovery path.
|
||||
|
||||
## Example dialogue
|
||||
|
||||
> **Dev:** "If the devvm OOMs, can the **Breakglass agent** just **reset** it?"
|
||||
> **Owner:** "It can, autonomously — but a **reset** is a warm reboot. If the
|
||||
> QEMU process is wedged (the 2026-06-11 class), it needs a **cycle** —
|
||||
> stop→start — to apply the staged config. Either way it captures
|
||||
> **Forensics** first."
|
||||
> **Dev:** "And if the whole cluster is down?"
|
||||
> **Owner:** "Then the breakglass is down too — that's the **Cold case**, not
|
||||
> this tool. This one assumes the cluster is healthy."
|
||||
|
||||
## Flagged ambiguities
|
||||
|
||||
- "reset" was used to mean both a warm reboot and a stop→start — resolved:
|
||||
**reset** is warm, **cycle** is stop→start (and is what applies staged
|
||||
config).
|
||||
- "breakglass" was used for both this warm UI and the cluster-down SSH path —
|
||||
resolved: this context's **Breakglass** is the **Warm case** UI only.
|
||||
59
docs/adr/0001-breakglass-security-architecture.md
Normal file
59
docs/adr/0001-breakglass-security-architecture.md
Normal file
|
|
@ -0,0 +1,59 @@
|
|||
# Breakglass: isolated deployment, warm-case scope, bounded host capabilities
|
||||
|
||||
We are adding a Claude-driven web UI ("breakglass") to recover the devvm when
|
||||
it is down. It runs as a **separate deployment in its own `claude-breakglass`
|
||||
namespace** (own ServiceAccount, own Vault role/policy scoped to *only* the
|
||||
breakglass SSH keys), **not** in the existing `claude-agent` pod, because that
|
||||
pod runs agents that ingest untrusted input (recruiter emails, nextcloud
|
||||
todos) with `Bash`, and the shared `terraform-state` Vault policy grants the
|
||||
whole namespace `secret/data/*` — so co-locating the keys would let a
|
||||
prompt-injected agent read root-on-devvm credentials. We also add an explicit
|
||||
`deny` on the breakglass key path to `terraform-state`.
|
||||
|
||||
## Status
|
||||
|
||||
accepted (2026-06-12)
|
||||
|
||||
## Scope decision: warm case only
|
||||
|
||||
The devvm (VM 102) and all 7 Kubernetes nodes are guests of the **same single
|
||||
PVE host**. An in-cluster UI therefore cannot be a true breakglass for
|
||||
cluster- or host-down events — it would be dead exactly when needed. We scope
|
||||
it deliberately to the **warm case**: devvm wedged (OOM / disk-full / stuck
|
||||
service / QEMU I/O stall) while the cluster is healthy. The owner accepted this
|
||||
limitation explicitly. The **cold case** (cluster/host down) stays with the
|
||||
separate knock-gated PVE-SSH design (`infra/docs/plans/2026-05-30-breakglass-ssh-access-design.md`)
|
||||
and the `server-lifecycle` iDRAC CLI — out of scope here.
|
||||
|
||||
## Considered options
|
||||
|
||||
- **Same pod, gate by endpoint** — rejected: endpoint-gating is HTTP-layer,
|
||||
but key exfiltration is filesystem/Vault-layer; a `Bash` agent reads the key
|
||||
regardless of which route is exposed.
|
||||
- **App-level bearer login** — rejected in favour of reusing the ingress
|
||||
`auth = "required"` resilience proxy, which already does Authentik SSO with
|
||||
an HTTP basic-auth fallback when Authentik is down (the chosen failure
|
||||
domain), plus CrowdSec + rate-limit by default.
|
||||
- **Proxmox API token instead of SSH** — rejected: weaker forensics (no
|
||||
QMP/console capture) and would duplicate the SSH mechanism still needed for
|
||||
devvm diagnostics.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Host capabilities are intentionally broad but **bounded**: full sudo shell on
|
||||
the devvm (any soft repair), and **autonomous** PVE power verbs
|
||||
(`status|forensics|reset|stop|start|cycle` on VM 102 only) via a
|
||||
`command="…" restrict` forced-command — never a free shell on the
|
||||
hypervisor. Every mutating verb captures forensics first, unconditionally.
|
||||
- The breakglass agent *can* trigger a reset on its own judgement (the owner
|
||||
chose autonomy over a human-confirm gate). In the isolated pod there is no
|
||||
untrusted-input injection vector; the residual risk is a model misread
|
||||
rebooting a devvm that did not strictly need it — bounded and recoverable.
|
||||
- The SSH private key is loaded into an in-pod `ssh-agent` (not written to
|
||||
disk). This is an availability/hygiene measure, **not** the primary control —
|
||||
the dedicated narrow Vault policy is, since any in-pod process could
|
||||
otherwise re-fetch the key from Vault.
|
||||
- The pod is hardened against the very pressure event it exists to fix:
|
||||
high `priorityClassName` (anti-eviction), broad tolerations, anti-affinity
|
||||
off the contended GPU node, `imagePullPolicy: IfNotPresent`, hardcoded target
|
||||
IPs (no DNS dependency), emptyDir-only (no NFS dependency).
|
||||
Loading…
Add table
Add a link
Reference in a new issue