infra/docs/plans/2026-06-11-breakglass-ssh-redesign-design.md
Viktor Barzin df332b59e6 break-glass SSH: drop port-knock for exposed key-only :52222; version host config
Viktor got locked out of the break-glass path (forgot the port-knock setup) and
deleted the edge-router forwards, then asked to review and redesign it from
scratch.

Root cause of the lockout: the knock added no real security (key-only SSH is
already brute-force-proof) and its only benefit — hiding the port — came at the
cost of a circular dependency. The knock sequence lived only in in-cluster
Vault, which is unreachable in the exact away/cold scenario break-glass exists
for. So the unlock secret was unavailable precisely when needed.

New model (self-contained, nothing to remember): plain key-only SSH on the
Proxmox host's :52222, openly reachable. The edge router forwards WAN tcp/52222
-> 192.168.1.127:52222 (external port MUST equal internal on the TP-Link AX6000
- it rejects remaps; port 22 itself is reserved). The exposed port trusts only a
dedicated break-glass key via `Match LocalPort` (a leak of any other root key
does not grant internet access), rate-limited (iptables hashlimit) + fail2ban.

- Removed knockd (package + config) and the legacy Synology SSH forward
  (ext 3333 -> .13:22, a needless WAN exposure the original plan wanted gone).
- Fixed the fail2ban jail for Debian 13 (auth logs under sshd-session, not sshd
  - the stock journalmatch silently never banned).
- Versioned the host config in scripts/ (it was applied ad-hoc, never committed)
  and recorded the deliberate Wave-1 "no public-IP" exception in security.md +
  .claude/CLAUDE.md. Superseded the 2026-05-30 port-knock design docs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:23:39 +00:00

73 lines
3.9 KiB
Markdown

# Break-glass SSH — Redesign
- **Date**: 2026-06-11
- **Status**: Implemented
- **Owner**: Viktor
- **Supersedes**: `2026-05-30-breakglass-ssh-access-{design,plan}.md` (port-knock design)
- **As-built runbook**: `docs/runbooks/breakglass-ssh.md`
## Why redesign
The 2026-05-30 design gated a key-only SSH port on the Proxmox host behind a UDP
**port-knock** (knockd). It caused a real lockout, for a structural reason:
- The knock sequence was 3 random ports stored **only** in Vault, and the client
helper fetched it from Vault at connect time.
- **Vault is in-cluster** and not publicly reachable (Wave-1 policy). In the
exact scenario break-glass exists for — away from home, cluster/tunnels down —
the knock sequence is unreachable and unmemorable. Circular dependency.
The knock's only benefit was hiding an already brute-force-proof port; its cost
was that fragility. For a *recovery* path, robustness beats stealth.
## Decision
**Plain key-only SSH to the Proxmox host on `:52222`, openly reachable, no knock.**
Hardened with: the exposed port trusts only a dedicated break-glass key
(`Match LocalPort`), per-source connection rate-limiting (iptables hashlimit),
and fail2ban. Scenario covered: *cluster + tunnels down, host + pfSense + router
up* (the common "I'm away and need in" case — confirmed with Viktor; deeper
"pfSense wedged" / "host down" tiers are explicitly out of scope).
Alternatives considered and rejected: keeping the knock (fragile, circular);
Tailscale-on-pfSense (briefly chosen, then dropped — reintroduces the upstream
dependency Headscale is self-hosted to avoid, and the user preferred a
self-contained stock-ssh path); WireGuard road-warrior (needs a client, and the
self-contained SSH path was preferred).
## Components
| Layer | Change | Source of truth |
|---|---|---|
| sshd | dual-port `:22` (LAN, all keys) + `:52222` (WAN, break-glass key only via `Match LocalPort`, terminated by `Match all`); key-only everywhere | `scripts/sshd-10-breakglass.conf` |
| host firewall | `BREAKGLASS` chain: `:52222` rate-limited per source, LAN bypass; replaced the knock-gated default-DROP | `scripts/breakglass-firewall.sh` (+ `breakglass-firewall.service`) |
| fail2ban | jail fixed for Debian 13 (`journalmatch` by unit, not `_COMM=sshd`, else it never bans), bans on `:22`+`:52222` | `scripts/fail2ban-breakglass-sshd.local` |
| knockd | **removed** (package purged, config deleted) | — |
| edge router | `breakglass-ssh` WAN tcp/52222 → 192.168.1.127:52222; **removed** legacy Synology SSH forward (ext 3333 → .13:22) | manual (live device) |
| Vault | `breakglass_ssh_{pub,priv}key` retained; `breakglass_knock_sequence` now dead | `secret/viktor` |
## Edge-router constraints discovered (TP-Link AX6000)
- **No port remapping** — external port must equal internal port (rejects e.g.
`22 → 52222` as a "conflict"). All forwards are ext==int; hence `:52222` both
sides.
- **Port 22 is reserved** — `22 → 22` is also refused. Break-glass cannot use 22
(Viktor's initial preference); `:52222` is the landed port.
- **Row delete is immediate** (no confirm dialog).
## Security posture
- **Brute force: impossible** (key-only, no password).
- **Scannable: yes** — deliberate, documented Wave-1 exception (`security.md`).
- **Residual risks:** sshd 0-day during exposure (mitigate: patch, rate-limit,
fail2ban, low MaxAuthTries); break-glass key theft (revoke by removing the
`authorized_keys.breakglass` line). Logins are audited (PVE ships sshd auth +
snoopy execve to Loki).
## Verification (2026-06-11)
- `:52222` reachable; break-glass key authenticates (`root@pve`).
- Non-break-glass keys **rejected** on `:52222` (Match isolation works).
- `:22` LAN admin unaffected (Match all reset confirmed — global root login intact).
- Full WAN path: `ssh -p 52222 <WAN-IP>` with the break-glass key → `root@pve`.
- knockd gone; fail2ban jail matches Debian 13 `sshd-session` lines.