claude-agent-service/docs/adr/0001-breakglass-security-architecture.md

# Breakglass: isolated deployment, warm-case scope, bounded host capabilities

We are adding a Claude-driven web UI ("breakglass") to recover the devvm when
it is down. It runs as a **separate deployment in its own `claude-breakglass`
namespace** (own ServiceAccount, own Vault role/policy scoped to *only* the
breakglass SSH keys), **not** in the existing `claude-agent` pod, because that
pod runs agents that ingest untrusted input (recruiter emails, nextcloud
todos) with `Bash`, and the shared `terraform-state` Vault policy grants the
whole namespace `secret/data/*` — so co-locating the keys would let a
prompt-injected agent read root-on-devvm credentials. We also add an explicit
`deny` on the breakglass key path to `terraform-state`.

## Status

accepted (2026-06-12)

## Scope decision: warm case only

The devvm (VM 102) and all 7 Kubernetes nodes are guests of the **same single
PVE host**. An in-cluster UI therefore cannot be a true breakglass for
cluster- or host-down events — it would be dead exactly when needed. We scope
it deliberately to the **warm case**: devvm wedged (OOM / disk-full / stuck
service / QEMU I/O stall) while the cluster is healthy. The owner accepted this
limitation explicitly. The **cold case** (cluster/host down) stays with the
separate knock-gated PVE-SSH design (`infra/docs/plans/2026-05-30-breakglass-ssh-access-design.md`)
and the `server-lifecycle` iDRAC CLI — out of scope here.

## Considered options

- **Same pod, gate by endpoint** — rejected: endpoint-gating is HTTP-layer,
  but key exfiltration is filesystem/Vault-layer; a `Bash` agent reads the key
  regardless of which route is exposed.
- **App-level bearer login** — rejected in favour of reusing the ingress
  `auth = "required"` resilience proxy, which already does Authentik SSO with
  an HTTP basic-auth fallback when Authentik is down (the chosen failure
  domain), plus CrowdSec + rate-limit by default.
- **Proxmox API token instead of SSH** — rejected: weaker forensics (no
  QMP/console capture) and would duplicate the SSH mechanism still needed for
  devvm diagnostics.

## Consequences

- Host capabilities are intentionally broad but **bounded**: full sudo shell on
  the devvm (any soft repair), and **autonomous** PVE power verbs
  (`status|forensics|reset|stop|start|cycle` on VM 102 only) via a
  `command="…" restrict` forced-command — never a free shell on the
  hypervisor. Every mutating verb captures forensics first, unconditionally.
- The breakglass agent *can* trigger a reset on its own judgement (the owner
  chose autonomy over a human-confirm gate). In the isolated pod there is no
  untrusted-input injection vector; the residual risk is a model misread
  rebooting a devvm that did not strictly need it — bounded and recoverable.
- The SSH private key is loaded into an in-pod `ssh-agent` (not written to
  disk). This is an availability/hygiene measure, **not** the primary control —
  the dedicated narrow Vault policy is, since any in-pod process could
  otherwise re-fetch the key from Vault.
- The pod is hardened against the very pressure event it exists to fix:
  high `priorityClassName` (anti-eviction), broad tolerations, anti-affinity
  off the contended GPU node, `imagePullPolicy: IfNotPresent`, hardcoded target
  IPs (no DNS dependency), emptyDir-only (no NFS dependency).
docs: capture breakglass design (CONTEXT glossary + ADR 0001) Viktor wants a Claude-driven web UI on the agent service to act as a breakglass: when the devvm is down he can open it, have Claude SSH in to diagnose/repair, and power-cycle the VM via the Proxmox host if needed. Grilling settled the design. Recording it now as the design record before implementation: - CONTEXT.md: glossary for the breakglass language (breakglass agent, warm/cold case, forced-command verb, cycle vs reset, forensics). - ADR 0001: the security architecture — isolated deployment in its own namespace + narrow Vault policy (the existing claude-agent namespace's terraform-state policy grants secret/data/* to Bash-wielding agents that ingest untrusted input, so co-locating root-on-devvm keys would be exfiltratable); warm-case-only scope (devvm wedged, cluster healthy — the in-cluster UI can't survive the shared PVE host going down, which stays the separate cold-path SSH design); and bounded-but-broad host capability (full sudo on devvm, autonomous forced-command PVE power verbs, forensics-first). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> 2026-06-12 20:59:13 +00:00			`# Breakglass: isolated deployment, warm-case scope, bounded host capabilities`

			`We are adding a Claude-driven web UI ("breakglass") to recover the devvm when`
			it is down. It runs as a **separate deployment in its own `claude-breakglass`
			`namespace** (own ServiceAccount, own Vault role/policy scoped to only the`
			breakglass SSH keys), not in the existing `claude-agent` pod, because that
			`pod runs agents that ingest untrusted input (recruiter emails, nextcloud`
			todos) with `Bash`, and the shared `terraform-state` Vault policy grants the
			whole namespace `secret/data/*` — so co-locating the keys would let a
			`prompt-injected agent read root-on-devvm credentials. We also add an explicit`
			`deny` on the breakglass key path to `terraform-state`.

			`## Status`

			`accepted (2026-06-12)`

			`## Scope decision: warm case only`

			`The devvm (VM 102) and all 7 Kubernetes nodes are guests of the **same single`
			`PVE host**. An in-cluster UI therefore cannot be a true breakglass for`
			`cluster- or host-down events — it would be dead exactly when needed. We scope`
			`it deliberately to the warm case: devvm wedged (OOM / disk-full / stuck`
			`service / QEMU I/O stall) while the cluster is healthy. The owner accepted this`
			`limitation explicitly. The cold case (cluster/host down) stays with the`
			separate knock-gated PVE-SSH design (`infra/docs/plans/2026-05-30-breakglass-ssh-access-design.md`)
			and the `server-lifecycle` iDRAC CLI — out of scope here.

			`## Considered options`

			`- Same pod, gate by endpoint — rejected: endpoint-gating is HTTP-layer,`
			but key exfiltration is filesystem/Vault-layer; a `Bash` agent reads the key
			`regardless of which route is exposed.`
			`- App-level bearer login — rejected in favour of reusing the ingress`
			`auth = "required"` resilience proxy, which already does Authentik SSO with
			`an HTTP basic-auth fallback when Authentik is down (the chosen failure`
			`domain), plus CrowdSec + rate-limit by default.`
			`- Proxmox API token instead of SSH — rejected: weaker forensics (no`
			`QMP/console capture) and would duplicate the SSH mechanism still needed for`
			`devvm diagnostics.`

			`## Consequences`

			`- Host capabilities are intentionally broad but bounded: full sudo shell on`
			`the devvm (any soft repair), and autonomous PVE power verbs`
			(`status\|forensics\|reset\|stop\|start\|cycle` on VM 102 only) via a
			`command="…" restrict` forced-command — never a free shell on the
			`hypervisor. Every mutating verb captures forensics first, unconditionally.`
			`- The breakglass agent can trigger a reset on its own judgement (the owner`
			`chose autonomy over a human-confirm gate). In the isolated pod there is no`
			`untrusted-input injection vector; the residual risk is a model misread`
			`rebooting a devvm that did not strictly need it — bounded and recoverable.`
			- The SSH private key is loaded into an in-pod `ssh-agent` (not written to
			`disk). This is an availability/hygiene measure, not the primary control —`
			`the dedicated narrow Vault policy is, since any in-pod process could`
			`otherwise re-fetch the key from Vault.`
			`- The pod is hardened against the very pressure event it exists to fix:`
			high `priorityClassName` (anti-eviction), broad tolerations, anti-affinity
			off the contended GPU node, `imagePullPolicy: IfNotPresent`, hardcoded target
			`IPs (no DNS dependency), emptyDir-only (no NFS dependency).`