infra/docs/runbooks/breakglass-ui.md
Viktor Barzin 32cf75635f claude-breakglass: in-cluster warm break-glass UI for the devvm
Stand up the infra for Viktor's break-glass: when the devvm is wedged (cluster
healthy), open breakglass.viktorbarzin.me, have Claude SSH in to diagnose/fix,
and power-cycle VM 102 via the Proxmox host if needed. App half landed in the
claude-agent-service repo.

New stack stacks/claude-breakglass/ — own namespace + SA, NO Vault role (ESO
syncs only its key, so the pod has zero direct Vault access). Hardened to
survive the pressure it exists to fix: priorityClassName tier-0-core, broad
node-pressure tolerations, anti-affinity off node1, imagePullPolicy Always.
auth="required" ingress so it rides the Authentik resilience proxy and stays
reachable via the basic-auth fallback during an auth-stack outage. Runs the
shared claude-agent-service image with the breakglass entrypoint.
files/breakglass-pve is the PVE forced-command (status|forensics|reset|stop|
start|cycle on VM 102, forensics-first).

Isolation: the shared claude-agent pod's terraform-state Vault policy is
explicitly DENIED secret/claude-breakglass/* (stacks/vault/main.tf) so a
prompt-injected agent on that pod can't read the root-on-devvm key.

traefik: add a checksum/auth-proxy-htpasswd annotation so the auth-proxy rolls
when the emergency basic-auth password rotates (it's a subPath mount that
doesn't auto-update) — regenerated this session so Viktor has a known
emergency credential, which the auth-stack-outage failure domain requires.

Docs: docs/runbooks/breakglass-ui.md (full incident + bootstrap procedure,
incl. the per-host from= NAT quirks) and a security.md note recording the two
new privileged footholds.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 21:40:17 +00:00

5.6 KiB

Runbook: devvm breakglass UI (claude-breakglass)

Last updated: 2026-06-12

What this is

breakglass.viktorbarzin.me — an in-cluster Claude-driven web UI for recovering the devvm (Proxmox VM 102) when it is wedged but the cluster is healthy (the warm case). You chat with a Claude agent that SSHes into the devvm to diagnose/repair it, and there are manual buttons that power-cycle the VM via the Proxmox host even if the Anthropic API is down.

This is NOT the cold breakglass. If the cluster or PVE host is down, this UI is down too (it's a cluster workload). For that case use the cold path:

  • ssh -p 52222 root@<wan>qm stop 102 && qm start 102 (docs/runbooks/breakglass-ssh.md)
  • server-lifecycle iDRAC CLI (192.168.1.4) to power-cycle the whole host.

Architecture

browser ─► Cloudflare ─► Traefik ─► auth-proxy (Authentik, basic-auth fallback)
                                      └─► claude-breakglass Service (in-cluster)
claude-breakglass pod (ns claude-breakglass, own SA, NO Vault role):
  • app.breakglass.server (FastAPI) serves the Svelte UI + /api
  • chat → claude -p --agent breakglass (stream-json → SSE)
  • ssh-agent holds the breakglass key (synced by ESO, never on disk)
  • ssh devvm  → breakglass@10.0.10.10 (full sudo)         [diagnose/repair]
  • ssh pve <verb> → root@192.168.1.127 forced-command     [VM 102 power verbs]

Image: forgejo.viktorbarzin.me/viktor/claude-agent-service:latest (shared with claude-agent-service; the deployment overrides the command with /srv/docker-entrypoint-breakglass.sh). Code: claude-agent-service/app/breakglass/. Stack: stacks/claude-breakglass/. ADR: claude-agent-service/docs/adr/0001-*.

Auth (how to get in)

  • Normal: Authentik SSO (you're already logged in to the SSO).
  • Authentik down: the auth-proxy falls back to HTTP basic-auth ("Emergency Access"). Username admin; password is the shared auth_fallback_htpasswd (Vault secret/platform). This same credential gates every auth="required" app. Rotate: regenerate the htpasswd, vault kv patch secret/platform auth_fallback_htpasswd=..., apply the traefik stack (the auth-proxy rolls on the checksum/auth-proxy-htpasswd annotation).

The PVE forced-command (the reset path)

The breakglass SSH key's entry in PVE /root/.ssh/authorized_keys is pinned to command="/usr/local/bin/breakglass-pve",restrict,from="192.168.1.2". It only accepts the bare verbs status | forensics | reset | stop | start | cycle against VM 102 — anything else is rejected and logged to /var/log/breakglass-pve.log. Every mutating verb captures forensics first.

  • cycle = stop→start (fresh QEMU, applies staged config) — the fix for a QEMU I/O stall (2026-06-11). If a clean stop fails, it kills the wedged QEMU PID then starts. Prefer cycle over reset for a wedged VM.
  • reset is a warm reset (reuses QEMU) — only for a normal guest hang.

Script source: stacks/claude-breakglass/files/breakglass-pve (deploy via scp … root@192.168.1.127:/usr/local/bin/breakglass-pve).

NAT quirks (why from= differs per host)

Discovered during bring-up — both verified from a real in-cluster pod:

  • pod → PVE (192.168.1.127): pfSense SNATs inter-VLAN traffic to its 192.168.1.2 interface, so PVE sees 192.168.1.2 for ALL cluster (and devvm) SSH. Hence the PVE key uses from="192.168.1.2". The devvm itself is NOT a permitted source (it's the box being recovered).
  • pod → devvm (10.0.10.10): the devvm sees the Calico-SNAT node IP (10.0.20.0/24). Hence the devvm key uses from="10.0.20.0/24".

Host bootstrap (one-time; redo on devvm rebuild / key rotation)

The keypair lives in Vault secret/claude-breakglass/ssh_key (private_key/public_key). To re-provision after a rebuild:

PUB=$(vault kv get -field=public_key secret/claude-breakglass/ssh_key)

# devvm (full-sudo recovery user):
sudo useradd -m -s /bin/bash breakglass 2>/dev/null || true
sudo install -d -m700 -o breakglass -g breakglass /home/breakglass/.ssh
printf 'from="10.0.20.0/24" %s\n' "$PUB" | sudo tee /home/breakglass/.ssh/authorized_keys
sudo chown breakglass:breakglass /home/breakglass/.ssh/authorized_keys
sudo chmod 600 /home/breakglass/.ssh/authorized_keys
echo 'breakglass ALL=(ALL) NOPASSWD:ALL' | sudo tee /etc/sudoers.d/breakglass && sudo chmod 440 /etc/sudoers.d/breakglass

# PVE (forced-command power verbs):
scp stacks/claude-breakglass/files/breakglass-pve root@192.168.1.127:/usr/local/bin/breakglass-pve
ssh root@192.168.1.127 chmod 0755 /usr/local/bin/breakglass-pve
# then append to /root/.ssh/authorized_keys on PVE:
#   command="/usr/local/bin/breakglass-pve",restrict,from="192.168.1.2" <PUB>

Host-key checking is OFF in the pod's ssh config (a devvm rebuild rotates the host key; strict checking would lock the breakglass out mid-incident — trusted internal LAN, key auth stands).

Verify

kubectl -n claude-breakglass get pods                 # Running
kubectl -n claude-breakglass logs deploy/claude-breakglass | grep -i ssh-add
curl -sk https://breakglass.viktorbarzin.me/health    # (through the edge)
# from a pod, the PVE path:  ssh pve status  → "status: running"

Isolation (why a separate deployment)

The shared claude-agent pod runs agents that ingest untrusted input (recruiter emails, nextcloud todos) with Bash. Co-locating the root-on-devvm key there would let a prompt injection exfiltrate it. The breakglass runs in its own namespace with its own SA and no Vault role (ESO syncs only its key); the terraform-state Vault policy is explicitly DENIED secret/claude-breakglass/*.