Viktor wants a Claude-driven web UI on the agent service to act as a breakglass: when the devvm is down he can open it, have Claude SSH in to diagnose/repair, and power-cycle the VM via the Proxmox host if needed. Grilling settled the design. Recording it now as the design record before implementation: - CONTEXT.md: glossary for the breakglass language (breakglass agent, warm/cold case, forced-command verb, cycle vs reset, forensics). - ADR 0001: the security architecture — isolated deployment in its own namespace + narrow Vault policy (the existing claude-agent namespace's terraform-state policy grants secret/data/* to Bash-wielding agents that ingest untrusted input, so co-locating root-on-devvm keys would be exfiltratable); warm-case-only scope (devvm wedged, cluster healthy — the in-cluster UI can't survive the shared PVE host going down, which stays the separate cold-path SSH design); and bounded-but-broad host capability (full sudo on devvm, autonomous forced-command PVE power verbs, forensics-first). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
3.3 KiB
Breakglass: isolated deployment, warm-case scope, bounded host capabilities
We are adding a Claude-driven web UI ("breakglass") to recover the devvm when
it is down. It runs as a separate deployment in its own claude-breakglass
namespace (own ServiceAccount, own Vault role/policy scoped to only the
breakglass SSH keys), not in the existing claude-agent pod, because that
pod runs agents that ingest untrusted input (recruiter emails, nextcloud
todos) with Bash, and the shared terraform-state Vault policy grants the
whole namespace secret/data/* — so co-locating the keys would let a
prompt-injected agent read root-on-devvm credentials. We also add an explicit
deny on the breakglass key path to terraform-state.
Status
accepted (2026-06-12)
Scope decision: warm case only
The devvm (VM 102) and all 7 Kubernetes nodes are guests of the same single
PVE host. An in-cluster UI therefore cannot be a true breakglass for
cluster- or host-down events — it would be dead exactly when needed. We scope
it deliberately to the warm case: devvm wedged (OOM / disk-full / stuck
service / QEMU I/O stall) while the cluster is healthy. The owner accepted this
limitation explicitly. The cold case (cluster/host down) stays with the
separate knock-gated PVE-SSH design (infra/docs/plans/2026-05-30-breakglass-ssh-access-design.md)
and the server-lifecycle iDRAC CLI — out of scope here.
Considered options
- Same pod, gate by endpoint — rejected: endpoint-gating is HTTP-layer,
but key exfiltration is filesystem/Vault-layer; a
Bashagent reads the key regardless of which route is exposed. - App-level bearer login — rejected in favour of reusing the ingress
auth = "required"resilience proxy, which already does Authentik SSO with an HTTP basic-auth fallback when Authentik is down (the chosen failure domain), plus CrowdSec + rate-limit by default. - Proxmox API token instead of SSH — rejected: weaker forensics (no QMP/console capture) and would duplicate the SSH mechanism still needed for devvm diagnostics.
Consequences
- Host capabilities are intentionally broad but bounded: full sudo shell on
the devvm (any soft repair), and autonomous PVE power verbs
(
status|forensics|reset|stop|start|cycleon VM 102 only) via acommand="…" restrictforced-command — never a free shell on the hypervisor. Every mutating verb captures forensics first, unconditionally. - The breakglass agent can trigger a reset on its own judgement (the owner chose autonomy over a human-confirm gate). In the isolated pod there is no untrusted-input injection vector; the residual risk is a model misread rebooting a devvm that did not strictly need it — bounded and recoverable.
- The SSH private key is loaded into an in-pod
ssh-agent(not written to disk). This is an availability/hygiene measure, not the primary control — the dedicated narrow Vault policy is, since any in-pod process could otherwise re-fetch the key from Vault. - The pod is hardened against the very pressure event it exists to fix:
high
priorityClassName(anti-eviction), broad tolerations, anti-affinity off the contended GPU node,imagePullPolicy: IfNotPresent, hardcoded target IPs (no DNS dependency), emptyDir-only (no NFS dependency).