infra/docs/runbooks/breakglass-ssh.md
Viktor Barzin df332b59e6 break-glass SSH: drop port-knock for exposed key-only :52222; version host config
Viktor got locked out of the break-glass path (forgot the port-knock setup) and
deleted the edge-router forwards, then asked to review and redesign it from
scratch.

Root cause of the lockout: the knock added no real security (key-only SSH is
already brute-force-proof) and its only benefit — hiding the port — came at the
cost of a circular dependency. The knock sequence lived only in in-cluster
Vault, which is unreachable in the exact away/cold scenario break-glass exists
for. So the unlock secret was unavailable precisely when needed.

New model (self-contained, nothing to remember): plain key-only SSH on the
Proxmox host's :52222, openly reachable. The edge router forwards WAN tcp/52222
-> 192.168.1.127:52222 (external port MUST equal internal on the TP-Link AX6000
- it rejects remaps; port 22 itself is reserved). The exposed port trusts only a
dedicated break-glass key via `Match LocalPort` (a leak of any other root key
does not grant internet access), rate-limited (iptables hashlimit) + fail2ban.

- Removed knockd (package + config) and the legacy Synology SSH forward
  (ext 3333 -> .13:22, a needless WAN exposure the original plan wanted gone).
- Fixed the fail2ban jail for Debian 13 (auth logs under sshd-session, not sshd
  - the stock journalmatch silently never banned).
- Versioned the host config in scripts/ (it was applied ad-hoc, never committed)
  and recorded the deliberate Wave-1 "no public-IP" exception in security.md +
  .claude/CLAUDE.md. Superseded the 2026-05-30 port-knock design docs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:23:39 +00:00

6.7 KiB

Runbook: Break-glass SSH

Cold-survivable, brute-force-proof SSH onto the home LAN for when the Kubernetes cluster and its remote-access tunnels (Headscale, cloudflared) are down but the Proxmox host + edge router are up. Redesigned 2026-06-11 — the previous port-knock design is decommissioned (see "History" below).

Model (as built)

your laptop (anywhere) ── ssh -p 52222 ──▶ edge router 192.168.1.1
                                              │ WAN tcp/52222 ─▶ 192.168.1.127:52222
                                              ▼
                                       Proxmox host 192.168.1.127
                                          sshd :52222 (key-only, break-glass key ONLY)
                                          → full LAN via ssh -J / ssh -D
  • No port-knock. Plain ssh -p 52222. The SSH key is the only gate.
  • Key-only, brute-force-proof. The exposed :52222 trusts only the dedicated break-glass key (/root/.ssh/authorized_keys.breakglass), separate from root's normal LAN-admin keys, so it is independently revocable and a leak of any other root key does not grant internet access.
  • Rate-limited per source IP (iptables hashlimit) + fail2ban. These trim scanner noise only; key-only auth is the real protection.
  • Exposed, not hidden. :52222 answers on the WAN (Shodan-visible). This is a deliberate, documented exception to the Wave-1 "no public-IP access" policy (see docs/architecture/security.md), chosen for self-containment: it has no dependency on the cluster (unlike Headscale/cloudflared) and nothing to remember (unlike the old knock, whose sequence lived only in in-cluster Vault).

Secrets (Vault secret/viktor)

Key Use
breakglass_ssh_pubkey authorized on the host (authorized_keys.breakglass)
breakglass_ssh_privkey the private key (also on your laptop at ~/.ssh/breakglass_ed25519)

The key has no passphrase (so it works in a true cold event without anything to recall). Treat the private key as the sole credential — guard the laptop copy.

Leftover: breakglass_knock_sequence is dead (knock decommissioned). It is inert; remove it when you have a Vault token with the patch capability (vault kv patch / merge-patch — the everyday token lacks it).

Connect

Client ~/.ssh/config:

Host breakglass
    HostName viktorbarzin.ddns.net        # follows the dynamic WAN IP
    Port 52222
    User root
    IdentityFile ~/.ssh/breakglass_ed25519
    IdentitiesOnly yes

Then:

ssh breakglass                              # shell on the Proxmox host
ssh -J breakglass root@10.0.20.1            # jump to pfSense (or any LAN host)
ssh -D 1080 breakglass                      # SOCKS5 → reach any internal IP

There is no bg() knock function anymore — delete it from your shell rc if you added it under the old design.

Cold-event IP cheat sheet (cluster DNS is down)

Host IP
Proxmox host 192.168.1.127
pfSense 10.0.20.1 (WAN 192.168.1.2)
k8s API 10.0.20.100
Synology NAS 192.168.1.13 (reach via ssh -J breakglass)
edge router 192.168.1.1

Deploy / re-provision the host config

Source of truth lives in infra/scripts/. To (re)deploy:

# 1. break-glass key authorized for the exposed port
PUB="$(vault kv get -field=breakglass_ssh_pubkey secret/viktor)"
ssh root@192.168.1.127 "printf '%s\n' '$PUB' > /root/.ssh/authorized_keys.breakglass && chmod 600 /root/.ssh/authorized_keys.breakglass"

# 2. sshd drop-in (dual-port, Match-isolated) — validate before reload (anti-lockout)
scp scripts/sshd-10-breakglass.conf root@192.168.1.127:/etc/ssh/sshd_config.d/10-breakglass.conf
ssh root@192.168.1.127 'sshd -t && systemctl reload ssh'

# 3. firewall (rate-limit) + boot unit
scp scripts/breakglass-firewall.sh root@192.168.1.127:/usr/local/sbin/breakglass-firewall.sh
ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh && systemctl enable --now breakglass-firewall.service'

# 4. fail2ban jail
scp scripts/fail2ban-breakglass-sshd.local root@192.168.1.127:/etc/fail2ban/jail.d/breakglass-sshd.local
ssh root@192.168.1.127 'systemctl restart fail2ban && fail2ban-client status sshd'

The breakglass-firewall.service unit (oneshot, RemainAfterExit=yes, Before=network-online-ish ordering) is a manual host unit — recreate it if the host is rebuilt:

[Unit]
Description=Break-glass base firewall (key-only SSH on :52222)
After=network-pre.target
Wants=network-pre.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/breakglass-firewall.sh
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target

Edge-router forward (manual — live device, not Terraform)

TP-Link Archer AX6000 (192.168.1.1) → Advanced → NAT Forwarding → Port Forwarding. The break-glass rule:

Service Name Device IP External Port Internal Port Protocol
breakglass-ssh 192.168.1.127 52222 52222 TCP

AX6000 quirks (learned 2026-06-11 — do not relearn the hard way):

  • External port must equal internal port. The firmware rejects any remap (e.g. 22 → 52222) with "External Port: This item conflicts with existed ones." Hence ext==int 52222.
  • Port 22 is reserved — even 22 → 22 is refused. Break-glass cannot use 22.
  • Row delete is immediate (no confirm dialog) — clicking the trash icon removes the rule and toasts "Operation succeeded".
  • Automation: ~/wizard/tools/insecure-browse/add-forward.{sh,js} (dockerized Playwright; double-gated save DRY_RUN=0 CONFIRM_SAVE=1; supports RULES_JSON add, EDIT_RULES_JSON protocol-edit, DELETE_RULES_JSON identity-guarded delete). Router password: Vault secret/viktor/edge_router_192_168_1_1_password.

Rotate / revoke

  • Revoke instantly: remove the line from /root/.ssh/authorized_keys.breakglass.
  • Rotate the key: ssh-keygen -t ed25519 -a 100 -f ~/.ssh/breakglass_ed25519, vault kv patch secret/viktor breakglass_ssh_privkey=@... breakglass_ssh_pubkey=..., redeploy step 1 above.
  • Router reset wipes forwards: re-add the breakglass-ssh rule above.

History

  • 2026-05-30: original design — key-only SSH on :52222 gated behind a UDP port-knock (knockd). Decommissioned 2026-06-11: the knock added no real security (the SSH key already makes the port brute-force-proof) and its only benefit — hiding the port — came at the cost of a circular dependency: the knock sequence lived only in in-cluster Vault, unreachable in the exact cold/away scenario break-glass exists for. That caused a real lockout. The knockd package + config + the legacy Synology SSH forward (ext 3333 → .13:22) were removed.