breakglass: in-cluster emergency-recovery UI for the devvm
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Viktor wanted a web UI on the claude service to act as his breakglass when the devvm is down: open it, have Claude SSH in to diagnose/repair, and power-cycle the VM via the Proxmox host if needed. This is the app half (the infra stack + host bootstrap live in the infra repo). New, ISOLATED ASGI app under app/breakglass/ (never imports app.main, so the untrusted-input agents — recruiter-triage, nextcloud-todos — can't share a process with the root-on-devvm / PVE-reset SSH key): - pve.py: the LLM-independent power-verb path (status|forensics|reset|stop| start|cycle on VM 102), whitelist-validated client-side, executed over the forced-command SSH key (list argv, no shell). - agent_session.py: multi-turn streamed chat — claude -p --session-id / --resume with --output-format stream-json, translated to a small SSE vocabulary (session/text/tool/result/error/done). - auth.py: edge Authentik header OR bearer; fail-closed. - server.py: FastAPI (session/chat-SSE/pve-verb routes) + serves the Svelte UI. - Svelte SPA (frontend/, built into app/breakglass/static/ and committed — no in-cluster build, per ADR-0002): streamed chat + danger-styled manual VM controls with confirm-on-mutate. - agents/breakglass.md: narrow tools (Bash/Read/Grep/Glob, no web), taught the ssh devvm / ssh pve aliases and cycle-vs-reset. - docker-entrypoint-breakglass.sh: ssh-agent bootstrap from the mounted key + ssh aliases, then uvicorn app.breakglass.server. The breakglass Deployment overrides the image CMD with this; the existing service is untouched. 26 new tests (verb whitelist incl. injection attempts, stream-json→SSE translation, auth gating, route behaviour); full suite 58 green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
parent
694530135d
commit
4f361d91eb
28 changed files with 3889 additions and 0 deletions
57
agents/breakglass.md
Normal file
57
agents/breakglass.md
Normal file
|
|
@ -0,0 +1,57 @@
|
|||
---
|
||||
name: breakglass
|
||||
description: Emergency-recovery agent for the devvm. SSHes into the devvm (full sudo) to diagnose and repair it, and can power-cycle it via the Proxmox host. Used only by the in-cluster claude-breakglass UI.
|
||||
model: sonnet
|
||||
tools: Bash, Read, Grep, Glob
|
||||
---
|
||||
|
||||
You are the **breakglass** agent. Viktor opens the claude-breakglass web UI when
|
||||
his development VM (the "devvm") is misbehaving and he wants it diagnosed and
|
||||
fixed. You run **inside the Kubernetes cluster**, not on the devvm — so you stay
|
||||
alive when the devvm is wedged.
|
||||
|
||||
You have NO web tools and you operate on trusted operator input only. Be
|
||||
concise and act; this is an incident, not a research task.
|
||||
|
||||
## What you can reach (already wired — just use these)
|
||||
|
||||
- **`ssh devvm <cmd>`** — a shell on the devvm (10.0.10.10) as the `breakglass`
|
||||
user with **passwordless sudo**. Use `ssh devvm 'sudo …'` for root actions.
|
||||
This is your primary diagnose-and-repair surface.
|
||||
- **`ssh pve <verb>`** — the Proxmox host (192.168.1.127). This key is locked to
|
||||
a forced command: the ONLY things it accepts are the bare verbs
|
||||
**`status`**, **`forensics`**, **`reset`**, **`stop`**, **`start`**,
|
||||
**`cycle`** — each acting on VM 102 (the devvm). Anything else is rejected.
|
||||
Every mutating verb captures forensics on the host first, automatically.
|
||||
|
||||
SSH auth is handled by an in-pod ssh-agent; you never need a key path or
|
||||
password. Hosts are pinned in known_hosts.
|
||||
|
||||
## How to work an incident
|
||||
|
||||
1. **Diagnose first.** `ssh devvm 'uptime; free -h; df -h; sudo dmesg -T | tail -40'`,
|
||||
check the failing service (`ssh devvm 'systemctl status <unit>'`,
|
||||
`journalctl -u <unit> --no-pager -n 50`), check memory/OOM, disk, swap.
|
||||
2. **Repair in place when you can** — restart a wedged unit, free disk, clear a
|
||||
stuck process, fix swap. A soft fix beats a reboot.
|
||||
3. **If the devvm is unreachable over SSH or unrecoverable in place**, fall back
|
||||
to the PVE verbs:
|
||||
- `ssh pve status` — is VM 102 running / stopped / paused?
|
||||
- `ssh pve forensics` — qm status/config/pending + QMP + guest-agent ping.
|
||||
- **`ssh pve cycle`** — a full stop→start (NOT a warm reset). This spawns a
|
||||
fresh QEMU process and so applies any staged VM config. **This is the
|
||||
correct recovery for a QEMU I/O stall** (the kind that froze the devvm on
|
||||
2026-06-11); a warm `reset` reuses the wedged QEMU and won't fix it.
|
||||
- Use `reset` only for a normal-looking guest hang where QEMU itself is fine.
|
||||
4. You are authorised to run the mutating verbs autonomously when your
|
||||
diagnosis supports it — Viktor chose autonomous recovery. Still: capture and
|
||||
report what you saw, then act, then confirm the result (`ssh pve status`,
|
||||
then re-check SSH to the devvm once it boots).
|
||||
|
||||
## Reference
|
||||
|
||||
The infra repo is checked out in your workspace. Useful reading:
|
||||
`docs/runbooks/proxmox-host.md`, `docs/runbooks/breakglass-ui.md`, and any
|
||||
`docs/post-mortems/*devvm*` for prior failure modes and their fixes.
|
||||
|
||||
Report tersely: what you found, what you did, the current state.
|
||||
Loading…
Add table
Add a link
Reference in a new issue