breakglass: in-cluster emergency-recovery UI for the devvm
Viktor wanted a web UI on the claude service to act as his breakglass when
the devvm is down: open it, have Claude SSH in to diagnose/repair, and
power-cycle the VM via the Proxmox host if needed. This is the app half
(the infra stack + host bootstrap live in the infra repo).
New, ISOLATED ASGI app under app/breakglass/ (never imports app.main, so the
untrusted-input agents — recruiter-triage, nextcloud-todos — can't share a
process with the root-on-devvm / PVE-reset SSH key):
- pve.py: the LLM-independent power-verb path (status|forensics|reset|stop|
start|cycle on VM 102), whitelist-validated client-side, executed over the
forced-command SSH key (list argv, no shell).
- agent_session.py: multi-turn streamed chat — claude -p --session-id /
--resume with --output-format stream-json, translated to a small SSE
vocabulary (session/text/tool/result/error/done).
- auth.py: edge Authentik header OR bearer; fail-closed.
- server.py: FastAPI (session/chat-SSE/pve-verb routes) + serves the Svelte UI.
- Svelte SPA (frontend/, built into app/breakglass/static/ and committed — no
in-cluster build, per ADR-0002): streamed chat + danger-styled manual VM
controls with confirm-on-mutate.
- agents/breakglass.md: narrow tools (Bash/Read/Grep/Glob, no web), taught the
ssh devvm / ssh pve aliases and cycle-vs-reset.
- docker-entrypoint-breakglass.sh: ssh-agent bootstrap from the mounted key +
ssh aliases, then uvicorn app.breakglass.server. The breakglass Deployment
overrides the image CMD with this; the existing service is untouched.
26 new tests (verb whitelist incl. injection attempts, stream-json→SSE
translation, auth gating, route behaviour); full suite 58 green.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 21:36:05 +00:00
|
|
|
"""Environment-driven config for the breakglass app.
|
|
|
|
|
|
|
|
|
|
Targets are hardcoded IPs by default (the breakglass must not depend on cluster
|
|
|
|
|
DNS — it has to work when things are broken). Everything is overridable via env
|
|
|
|
|
for tests and future re-IPing.
|
|
|
|
|
"""
|
|
|
|
|
import os
|
|
|
|
|
|
|
|
|
|
# SSH targets. IPs, not names — no DNS dependency in an incident.
|
|
|
|
|
DEVVM_HOST = os.environ.get("BREAKGLASS_DEVVM_HOST", "10.0.10.10")
|
|
|
|
|
DEVVM_USER = os.environ.get("BREAKGLASS_DEVVM_USER", "breakglass")
|
|
|
|
|
PVE_HOST = os.environ.get("BREAKGLASS_PVE_HOST", "192.168.1.127")
|
|
|
|
|
PVE_USER = os.environ.get("BREAKGLASS_PVE_USER", "root")
|
|
|
|
|
|
|
|
|
|
# The Claude agent the breakglass UI drives. Narrow tool surface, no web tools.
|
|
|
|
|
BREAKGLASS_AGENT = os.environ.get("BREAKGLASS_AGENT", "breakglass")
|
|
|
|
|
DEFAULT_MODEL = os.environ.get("BREAKGLASS_MODEL", "sonnet")
|
|
|
|
|
|
|
|
|
|
# Where claude session state + per-session scratch live. emptyDir in prod.
|
|
|
|
|
SESSIONS_DIR = os.environ.get("BREAKGLASS_SESSIONS_DIR", "/workspace/sessions")
|
|
|
|
|
|
|
|
|
|
# A single human operator per incident — no need for the job-runner's fan-out.
|
|
|
|
|
MAX_CONCURRENT_TURNS = int(os.environ.get("BREAKGLASS_MAX_CONCURRENT_TURNS", "2"))
|
|
|
|
|
# A chat turn that runs longer than this is killed (the agent is wedged).
|
|
|
|
|
TURN_TIMEOUT_SECONDS = int(os.environ.get("BREAKGLASS_TURN_TIMEOUT_SECONDS", "1800"))
|
|
|
|
|
# A single PVE power verb must return fast; a wedged host shouldn't hang the UI.
|
|
|
|
|
PVE_VERB_TIMEOUT_SECONDS = int(os.environ.get("BREAKGLASS_PVE_VERB_TIMEOUT_SECONDS", "120"))
|
breakglass UI v2: attachable sessions (tmux model) + mobile-first redesign
Full audit-driven rework. Keeps the proven SSE-translation + verb logic; everything else upgraded for phone-primary use.
Backend — server owns the session, clients attach (Viktor's tmux idea):
- session.py: SessionManager + Session with an event log, subscriber pub/sub, and turns that run DETACHED (keep going if the client disconnects).
- GET /api/session/{id}/stream = attach (SSE): replays the transcript then tails live; per-event id: lines so an EventSource auto-reconnect resumes from Last-Event-ID (free re-attach). POST /{id}/prompt starts a detached turn; POST /{id}/cancel = Stop. Replaces the old one-shot /api/chat.
- agent_session trimmed to the argv + translate_event helpers; 21 new/updated tests (replay, Last-Event-ID resume, broadcast, detached turn, resume, cancel, routes) — 53 green.
Frontend — mobile-first via the frontend-design skill (emergency-console aesthetic):
- EventSource attach (native auto-reconnect, zero client reconnect logic); transcript.js folds events->messages with id-dedupe so replays never double-render (30 unit assertions).
- Installable PWA: manifest + icons (wrench/break-glass mark) + apple-mobile-web-app meta + theme-color; viewport-fit=cover + safe-area; 100dvh; 16px composer (no iOS zoom).
- One-tap diagnosis presets (Triage / Memory-OOM / Disk / Services / QEMU-wedged) mapped to the devvm's real failure modes; Stop button while a turn runs.
- Foldable VM-control sheet, cycle the dominant recovery action w/ confirm, output capped 46vh.
- a11y: fixed --ink-faint contrast 3.6:1 -> 6.1:1 (WCAG AA); >=44px tap targets. Deleted the obsolete fetch-reader sse.js (EventSource replaces it).
Verified: 53 backend tests + 30 transcript assertions; Playwright @390x844 (input on-screen y=721-821, presets/sheet/fold/cap); local integration smoke vs the real backend (attach->caught-up, 404, verbs, PWA served).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-14 19:19:03 +00:00
|
|
|
# How long an idle attach stream waits before emitting an SSE keepalive comment
|
|
|
|
|
# (keeps proxies/CDN from closing the long-lived connection).
|
|
|
|
|
SSE_KEEPALIVE_SECONDS = int(os.environ.get("BREAKGLASS_SSE_KEEPALIVE_SECONDS", "20"))
|
breakglass: in-cluster emergency-recovery UI for the devvm
Viktor wanted a web UI on the claude service to act as his breakglass when
the devvm is down: open it, have Claude SSH in to diagnose/repair, and
power-cycle the VM via the Proxmox host if needed. This is the app half
(the infra stack + host bootstrap live in the infra repo).
New, ISOLATED ASGI app under app/breakglass/ (never imports app.main, so the
untrusted-input agents — recruiter-triage, nextcloud-todos — can't share a
process with the root-on-devvm / PVE-reset SSH key):
- pve.py: the LLM-independent power-verb path (status|forensics|reset|stop|
start|cycle on VM 102), whitelist-validated client-side, executed over the
forced-command SSH key (list argv, no shell).
- agent_session.py: multi-turn streamed chat — claude -p --session-id /
--resume with --output-format stream-json, translated to a small SSE
vocabulary (session/text/tool/result/error/done).
- auth.py: edge Authentik header OR bearer; fail-closed.
- server.py: FastAPI (session/chat-SSE/pve-verb routes) + serves the Svelte UI.
- Svelte SPA (frontend/, built into app/breakglass/static/ and committed — no
in-cluster build, per ADR-0002): streamed chat + danger-styled manual VM
controls with confirm-on-mutate.
- agents/breakglass.md: narrow tools (Bash/Read/Grep/Glob, no web), taught the
ssh devvm / ssh pve aliases and cycle-vs-reset.
- docker-entrypoint-breakglass.sh: ssh-agent bootstrap from the mounted key +
ssh aliases, then uvicorn app.breakglass.server. The breakglass Deployment
overrides the image CMD with this; the existing service is untouched.
26 new tests (verb whitelist incl. injection attempts, stream-json→SSE
translation, auth gating, route behaviour); full suite 58 green.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 21:36:05 +00:00
|
|
|
|
|
|
|
|
# Auth. The app sits behind the ingress `auth = "required"` resilience proxy
|
|
|
|
|
# (Authentik SSO, basic-auth fallback when Authentik is down). We additionally
|
|
|
|
|
# accept a bearer token for machine/CLI callers. Either gate is sufficient;
|
|
|
|
|
# the edge is the primary one for the browser UI.
|
|
|
|
|
API_TOKEN = os.environ.get("API_BEARER_TOKEN", "")
|
|
|
|
|
# Header the auth-proxy injects for an authenticated human (set by Authentik, or
|
|
|
|
|
# by the basic-auth fallback's `$remote_user`). Presence ⇒ edge-authenticated.
|
|
|
|
|
TRUSTED_USER_HEADER = "x-authentik-username"
|