claude-breakglass: in-cluster warm break-glass UI for the devvm

Stand up the infra for Viktor's break-glass: when the devvm is wedged (cluster
healthy), open breakglass.viktorbarzin.me, have Claude SSH in to diagnose/fix,
and power-cycle VM 102 via the Proxmox host if needed. App half landed in the
claude-agent-service repo.

New stack stacks/claude-breakglass/ — own namespace + SA, NO Vault role (ESO
syncs only its key, so the pod has zero direct Vault access). Hardened to
survive the pressure it exists to fix: priorityClassName tier-0-core, broad
node-pressure tolerations, anti-affinity off node1, imagePullPolicy Always.
auth="required" ingress so it rides the Authentik resilience proxy and stays
reachable via the basic-auth fallback during an auth-stack outage. Runs the
shared claude-agent-service image with the breakglass entrypoint.
files/breakglass-pve is the PVE forced-command (status|forensics|reset|stop|
start|cycle on VM 102, forensics-first).

Isolation: the shared claude-agent pod's terraform-state Vault policy is
explicitly DENIED secret/claude-breakglass/* (stacks/vault/main.tf) so a
prompt-injected agent on that pod can't read the root-on-devvm key.

traefik: add a checksum/auth-proxy-htpasswd annotation so the auth-proxy rolls
when the emergency basic-auth password rotates (it's a subPath mount that
doesn't auto-update) — regenerated this session so Viktor has a known
emergency credential, which the auth-stack-outage failure domain requires.

Docs: docs/runbooks/breakglass-ui.md (full incident + bootstrap procedure,
incl. the per-host from= NAT quirks) and a security.md note recording the two
new privileged footholds.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-12 21:40:17 +00:00
parent 02785987dd
commit 32cf75635f
7 changed files with 629 additions and 0 deletions

View file

@ -257,6 +257,8 @@ Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same
**Documented exception — break-glass SSH (2026-06-11):** one deliberate carve-out. The Proxmox host's sshd listens on a WAN-exposed `:52222` (edge-router forward), **key-only**, trusting only a dedicated break-glass key (`Match LocalPort``authorized_keys.breakglass`), rate-limited (iptables hashlimit) + fail2ban. It is intentionally reachable from the public internet so it survives a cluster/tunnel outage with no dependency on the cluster — the one case the "must transit LAN/Headscale" rule cannot serve. Brute-force-proof (no password); the trade is Shodan-visibility. As-built: `docs/runbooks/breakglass-ssh.md`; rationale: `docs/plans/2026-06-11-breakglass-ssh-redesign-design.md`. (Replaced the 2026-05-30 port-knock variant, which was non-scannable but had a circular Vault dependency that caused a lockout.)
**Two privileged footholds for the warm break-glass UI (2026-06-12):** the in-cluster `claude-breakglass` service (`breakglass.viktorbarzin.me`, warm case = devvm wedged, cluster healthy) holds one ed25519 key (Vault `secret/claude-breakglass/ssh_key`) authorising: (1) a `breakglass` user on the **devvm** with NOPASSWD sudo (`from="10.0.20.0/24"` — the Calico-SNAT node subnet); (2) a **PVE** `authorized_keys` entry pinned to `command="/usr/local/bin/breakglass-pve",restrict,from="192.168.1.2"` (pfSense's inter-VLAN SNAT IP) that only runs the verbs `status|forensics|reset|stop|start|cycle` against VM 102. The key is reachable ONLY by the breakglass pod (own namespace, no Vault role, ESO-synced); the shared `claude-agent` pod's `terraform-state` Vault policy is explicitly DENIED `secret/claude-breakglass/*`. Reset is autonomous (the agent may fire it), forensics-first. Reachable via Authentik or the basic-auth fallback — LAN-routed, not WAN-exposed. Runbook: `docs/runbooks/breakglass-ui.md`; ADR: `claude-agent-service/docs/adr/0001-breakglass-security-architecture.md`.
#### Why no canary tokens
Original plan included canary tokens (fake K8s Secret, Vault KV path, PVE file, sinkhole hostname). Rejected because Viktor routinely greps `secret/viktor` (135 keys) and lists `kubectl get secret -A` — any read-trigger canary self-fires. Use-based canaries (zero-RBAC SA tokens with audit alerts on use) were also considered but rejected in favor of cleaner source-IP anomaly detection (K9, V7) on REAL tokens — same threat model, no fake-token operational burden.