claude-agent-service/app/breakglass/config.py
Viktor Barzin 4f361d91eb
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
breakglass: in-cluster emergency-recovery UI for the devvm
Viktor wanted a web UI on the claude service to act as his breakglass when
the devvm is down: open it, have Claude SSH in to diagnose/repair, and
power-cycle the VM via the Proxmox host if needed. This is the app half
(the infra stack + host bootstrap live in the infra repo).

New, ISOLATED ASGI app under app/breakglass/ (never imports app.main, so the
untrusted-input agents — recruiter-triage, nextcloud-todos — can't share a
process with the root-on-devvm / PVE-reset SSH key):
- pve.py: the LLM-independent power-verb path (status|forensics|reset|stop|
  start|cycle on VM 102), whitelist-validated client-side, executed over the
  forced-command SSH key (list argv, no shell).
- agent_session.py: multi-turn streamed chat — claude -p --session-id /
  --resume with --output-format stream-json, translated to a small SSE
  vocabulary (session/text/tool/result/error/done).
- auth.py: edge Authentik header OR bearer; fail-closed.
- server.py: FastAPI (session/chat-SSE/pve-verb routes) + serves the Svelte UI.
- Svelte SPA (frontend/, built into app/breakglass/static/ and committed — no
  in-cluster build, per ADR-0002): streamed chat + danger-styled manual VM
  controls with confirm-on-mutate.
- agents/breakglass.md: narrow tools (Bash/Read/Grep/Glob, no web), taught the
  ssh devvm / ssh pve aliases and cycle-vs-reset.
- docker-entrypoint-breakglass.sh: ssh-agent bootstrap from the mounted key +
  ssh aliases, then uvicorn app.breakglass.server. The breakglass Deployment
  overrides the image CMD with this; the existing service is untouched.

26 new tests (verb whitelist incl. injection attempts, stream-json→SSE
translation, auth gating, route behaviour); full suite 58 green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 21:36:05 +00:00

36 lines
1.9 KiB
Python

"""Environment-driven config for the breakglass app.
Targets are hardcoded IPs by default (the breakglass must not depend on cluster
DNS — it has to work when things are broken). Everything is overridable via env
for tests and future re-IPing.
"""
import os
# SSH targets. IPs, not names — no DNS dependency in an incident.
DEVVM_HOST = os.environ.get("BREAKGLASS_DEVVM_HOST", "10.0.10.10")
DEVVM_USER = os.environ.get("BREAKGLASS_DEVVM_USER", "breakglass")
PVE_HOST = os.environ.get("BREAKGLASS_PVE_HOST", "192.168.1.127")
PVE_USER = os.environ.get("BREAKGLASS_PVE_USER", "root")
# The Claude agent the breakglass UI drives. Narrow tool surface, no web tools.
BREAKGLASS_AGENT = os.environ.get("BREAKGLASS_AGENT", "breakglass")
DEFAULT_MODEL = os.environ.get("BREAKGLASS_MODEL", "sonnet")
# Where claude session state + per-session scratch live. emptyDir in prod.
SESSIONS_DIR = os.environ.get("BREAKGLASS_SESSIONS_DIR", "/workspace/sessions")
# A single human operator per incident — no need for the job-runner's fan-out.
MAX_CONCURRENT_TURNS = int(os.environ.get("BREAKGLASS_MAX_CONCURRENT_TURNS", "2"))
# A chat turn that runs longer than this is killed (the agent is wedged).
TURN_TIMEOUT_SECONDS = int(os.environ.get("BREAKGLASS_TURN_TIMEOUT_SECONDS", "1800"))
# A single PVE power verb must return fast; a wedged host shouldn't hang the UI.
PVE_VERB_TIMEOUT_SECONDS = int(os.environ.get("BREAKGLASS_PVE_VERB_TIMEOUT_SECONDS", "120"))
# Auth. The app sits behind the ingress `auth = "required"` resilience proxy
# (Authentik SSO, basic-auth fallback when Authentik is down). We additionally
# accept a bearer token for machine/CLI callers. Either gate is sufficient;
# the edge is the primary one for the browser UI.
API_TOKEN = os.environ.get("API_BEARER_TOKEN", "")
# Header the auth-proxy injects for an authenticated human (set by Authentik, or
# by the basic-auth fallback's `$remote_user`). Presence ⇒ edge-authenticated.
TRUSTED_USER_HEADER = "x-authentik-username"