claude-agent-service/app/breakglass/config.py

"""Environment-driven config for the breakglass app.

Targets are hardcoded IPs by default (the breakglass must not depend on cluster
DNS — it has to work when things are broken). Everything is overridable via env
for tests and future re-IPing.
"""
import os

# SSH targets. IPs, not names — no DNS dependency in an incident.
DEVVM_HOST = os.environ.get("BREAKGLASS_DEVVM_HOST", "10.0.10.10")
DEVVM_USER = os.environ.get("BREAKGLASS_DEVVM_USER", "breakglass")
PVE_HOST = os.environ.get("BREAKGLASS_PVE_HOST", "192.168.1.127")
PVE_USER = os.environ.get("BREAKGLASS_PVE_USER", "root")

# The Claude agent the breakglass UI drives. Narrow tool surface, no web tools.
BREAKGLASS_AGENT = os.environ.get("BREAKGLASS_AGENT", "breakglass")
DEFAULT_MODEL = os.environ.get("BREAKGLASS_MODEL", "sonnet")

# Where claude session state + per-session scratch live. emptyDir in prod.
SESSIONS_DIR = os.environ.get("BREAKGLASS_SESSIONS_DIR", "/workspace/sessions")

# A single human operator per incident — no need for the job-runner's fan-out.
MAX_CONCURRENT_TURNS = int(os.environ.get("BREAKGLASS_MAX_CONCURRENT_TURNS", "2"))
# A chat turn that runs longer than this is killed (the agent is wedged).
TURN_TIMEOUT_SECONDS = int(os.environ.get("BREAKGLASS_TURN_TIMEOUT_SECONDS", "1800"))
# A single PVE power verb must return fast; a wedged host shouldn't hang the UI.
PVE_VERB_TIMEOUT_SECONDS = int(os.environ.get("BREAKGLASS_PVE_VERB_TIMEOUT_SECONDS", "120"))

# Auth. The app sits behind the ingress `auth = "required"` resilience proxy
# (Authentik SSO, basic-auth fallback when Authentik is down). We additionally
# accept a bearer token for machine/CLI callers. Either gate is sufficient;
# the edge is the primary one for the browser UI.
API_TOKEN = os.environ.get("API_BEARER_TOKEN", "")
# Header the auth-proxy injects for an authenticated human (set by Authentik, or
# by the basic-auth fallback's `$remote_user`). Presence ⇒ edge-authenticated.
TRUSTED_USER_HEADER = "x-authentik-username"
breakglass: in-cluster emergency-recovery UI for the devvm Viktor wanted a web UI on the claude service to act as his breakglass when the devvm is down: open it, have Claude SSH in to diagnose/repair, and power-cycle the VM via the Proxmox host if needed. This is the app half (the infra stack + host bootstrap live in the infra repo). New, ISOLATED ASGI app under app/breakglass/ (never imports app.main, so the untrusted-input agents — recruiter-triage, nextcloud-todos — can't share a process with the root-on-devvm / PVE-reset SSH key): - pve.py: the LLM-independent power-verb path (status\|forensics\|reset\|stop\| start\|cycle on VM 102), whitelist-validated client-side, executed over the forced-command SSH key (list argv, no shell). - agent_session.py: multi-turn streamed chat — claude -p --session-id / --resume with --output-format stream-json, translated to a small SSE vocabulary (session/text/tool/result/error/done). - auth.py: edge Authentik header OR bearer; fail-closed. - server.py: FastAPI (session/chat-SSE/pve-verb routes) + serves the Svelte UI. - Svelte SPA (frontend/, built into app/breakglass/static/ and committed — no in-cluster build, per ADR-0002): streamed chat + danger-styled manual VM controls with confirm-on-mutate. - agents/breakglass.md: narrow tools (Bash/Read/Grep/Glob, no web), taught the ssh devvm / ssh pve aliases and cycle-vs-reset. - docker-entrypoint-breakglass.sh: ssh-agent bootstrap from the mounted key + ssh aliases, then uvicorn app.breakglass.server. The breakglass Deployment overrides the image CMD with this; the existing service is untouched. 26 new tests (verb whitelist incl. injection attempts, stream-json→SSE translation, auth gating, route behaviour); full suite 58 green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> 2026-06-12 21:36:05 +00:00			`"""Environment-driven config for the breakglass app.`

			`Targets are hardcoded IPs by default (the breakglass must not depend on cluster`
			`DNS — it has to work when things are broken). Everything is overridable via env`
			`for tests and future re-IPing.`
			`"""`
			`import os`

			`# SSH targets. IPs, not names — no DNS dependency in an incident.`
			`DEVVM_HOST = os.environ.get("BREAKGLASS_DEVVM_HOST", "10.0.10.10")`
			`DEVVM_USER = os.environ.get("BREAKGLASS_DEVVM_USER", "breakglass")`
			`PVE_HOST = os.environ.get("BREAKGLASS_PVE_HOST", "192.168.1.127")`
			`PVE_USER = os.environ.get("BREAKGLASS_PVE_USER", "root")`

			`# The Claude agent the breakglass UI drives. Narrow tool surface, no web tools.`
			`BREAKGLASS_AGENT = os.environ.get("BREAKGLASS_AGENT", "breakglass")`
			`DEFAULT_MODEL = os.environ.get("BREAKGLASS_MODEL", "sonnet")`

			`# Where claude session state + per-session scratch live. emptyDir in prod.`
			`SESSIONS_DIR = os.environ.get("BREAKGLASS_SESSIONS_DIR", "/workspace/sessions")`

			`# A single human operator per incident — no need for the job-runner's fan-out.`
			`MAX_CONCURRENT_TURNS = int(os.environ.get("BREAKGLASS_MAX_CONCURRENT_TURNS", "2"))`
			`# A chat turn that runs longer than this is killed (the agent is wedged).`
			`TURN_TIMEOUT_SECONDS = int(os.environ.get("BREAKGLASS_TURN_TIMEOUT_SECONDS", "1800"))`
			`# A single PVE power verb must return fast; a wedged host shouldn't hang the UI.`
			`PVE_VERB_TIMEOUT_SECONDS = int(os.environ.get("BREAKGLASS_PVE_VERB_TIMEOUT_SECONDS", "120"))`

			# Auth. The app sits behind the ingress `auth = "required"` resilience proxy
			`# (Authentik SSO, basic-auth fallback when Authentik is down). We additionally`
			`# accept a bearer token for machine/CLI callers. Either gate is sufficient;`
			`# the edge is the primary one for the browser UI.`
			`API_TOKEN = os.environ.get("API_BEARER_TOKEN", "")`
			`# Header the auth-proxy injects for an authenticated human (set by Authentik, or`
			# by the basic-auth fallback's `$remote_user`). Presence ⇒ edge-authenticated.
			`TRUSTED_USER_HEADER = "x-authentik-username"`