infra/docs/plans/2026-05-30-breakglass-ssh-access-plan.md

396 lines
16 KiB
Markdown
Raw Normal View History

break-glass SSH: drop port-knock for exposed key-only :52222; version host config Viktor got locked out of the break-glass path (forgot the port-knock setup) and deleted the edge-router forwards, then asked to review and redesign it from scratch. Root cause of the lockout: the knock added no real security (key-only SSH is already brute-force-proof) and its only benefit — hiding the port — came at the cost of a circular dependency. The knock sequence lived only in in-cluster Vault, which is unreachable in the exact away/cold scenario break-glass exists for. So the unlock secret was unavailable precisely when needed. New model (self-contained, nothing to remember): plain key-only SSH on the Proxmox host's :52222, openly reachable. The edge router forwards WAN tcp/52222 -> 192.168.1.127:52222 (external port MUST equal internal on the TP-Link AX6000 - it rejects remaps; port 22 itself is reserved). The exposed port trusts only a dedicated break-glass key via `Match LocalPort` (a leak of any other root key does not grant internet access), rate-limited (iptables hashlimit) + fail2ban. - Removed knockd (package + config) and the legacy Synology SSH forward (ext 3333 -> .13:22, a needless WAN exposure the original plan wanted gone). - Fixed the fail2ban jail for Debian 13 (auth logs under sshd-session, not sshd - the stock journalmatch silently never banned). - Versioned the host config in scripts/ (it was applied ad-hoc, never committed) and recorded the deliberate Wave-1 "no public-IP" exception in security.md + .claude/CLAUDE.md. Superseded the 2026-05-30 port-knock design docs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:23:39 +00:00
# Break-Glass SSH Access — Implementation Plan
> **⚠️ SUPERSEDED 2026-06-11** by the redesign in
> `2026-06-11-breakglass-ssh-redesign-design.md` (port-knock removed). Retained
> for history. As-built: `docs/runbooks/breakglass-ssh.md`.
> **Execution model:** This plan mutates **live devices** (the Proxmox host's sshd, and the TP-Link edge router). It is **human-gated**, NOT for autonomous subagents. Each live step is applied with anti-lockout verification, and every edge-router change is made by Viktor (or by the browse tool with explicit per-change approval). Steps use `- [ ]` checkboxes.
**Goal:** Stand up a cold, brute-force-proof SSH backdoor onto the LAN — key-only SSH to the Proxmox host (`192.168.1.127`) gated behind a UDP port-knock — then decommission the legacy Synology SSH exposure and tighten UPnP.
**Architecture:** Edge router `.1` forwards a UDP knock sequence + TCP `52222` to the Proxmox host. The host runs `knockd` (libpcap) which opens `52222` for the knocker's IP for 30 s; `sshd` listens on `:22` (LAN, always) and `:52222` (external, knock-gated), key-only. Path bypasses pfSense + the k8s cluster. Client uses only stock `ssh` + `bash`.
**Tech stack:** OpenSSH, knockd, iptables, fail2ban (Debian/PVE host); TP-Link Archer AX6000 UI (edge router); HashiCorp Vault (secrets); Docker (`/home/wizard/tools/insecure-browse` for any router automation).
**Reference:** design doc `2026-05-30-breakglass-ssh-access-design.md`. Router audit (current `.1` forwards) recorded in task notes + `/home/wizard/tools/insecure-browse/out/`.
---
## Pre-flight (read before starting)
- **Anti-lockout rule:** never disable password auth or reload sshd without an *already-open* root session held + a *new* session verified. Applies to every host step.
- **Live-router rule:** all `.1` changes are made by Viktor in the UI (or browse-tool with explicit approval). No blind automation of router writes.
- **Ordering rule:** the legacy Synology SSH forward (Rule 6) is **not** closed until break-glass is verified working from an external network (Phase 4 gates on Phase 4-pre verification).
- **Host access:** PVE host reached as `ssh root@192.168.1.127` from the LAN.
- **Commit gate:** the infra repo currently has unmerged conflicts + an in-progress provider/backend migration. Do NOT commit (Phase 6) until Viktor confirms the repo is clean.
---
## Phase 0 — Generate secrets (no live changes)
### Task 0.1: Break-glass SSH keypair
**Files:** none in repo (secrets → Vault).
- [ ] **Step 1: Generate a dedicated ed25519 keypair (with passphrase)**
```bash
mkdir -p ~/.ssh
ssh-keygen -t ed25519 -a 100 -C "breakglass-$(date +%Y%m%d)" -f ~/.ssh/breakglass_ed25519
# set a passphrase when prompted (so a stolen laptop key isn't instantly usable)
```
- [ ] **Step 2: Store the private key + public key in Vault**
```bash
vault kv patch secret/viktor \
breakglass_ssh_privkey=@$HOME/.ssh/breakglass_ed25519 \
breakglass_ssh_pubkey="$(cat ~/.ssh/breakglass_ed25519.pub)"
```
- [ ] **Step 3: Verify the keys are retrievable**
```bash
vault kv get -field=breakglass_ssh_pubkey secret/viktor
```
Expected: prints the `ssh-ed25519 AAAA... breakglass-YYYYMMDD` line.
### Task 0.2: Knock sequence
- [ ] **Step 1: Generate 3 random UDP knock ports**
```bash
KNOCK="$(shuf -i 20000-60000 -n 3 | paste -sd, -)"; echo "$KNOCK"
```
- [ ] **Step 2: Store the sequence in Vault (keep it out of git)**
```bash
vault kv patch secret/viktor breakglass_knock_sequence="$KNOCK"
vault kv get -field=breakglass_knock_sequence secret/viktor
```
Expected: prints three comma-separated ports, e.g. `28411,49027,33180`.
---
## Phase 1 — Proxmox host: key-only SSH + knock gate (LIVE host change)
> Run everything in this phase **on the PVE host**. Keep your current `ssh root@192.168.1.127` session open the entire phase.
### Task 1.1: Pre-checks (no changes yet)
- [ ] **Step 1: Confirm key login already works (anti-lockout baseline)**
From your laptop, with the break-glass key authorized later — for now confirm your *existing* admin key works:
```bash
ssh -o PasswordAuthentication=no root@192.168.1.127 'echo KEY_LOGIN_OK'
```
Expected: `KEY_LOGIN_OK` (key auth works → safe to disable passwords later). If it prompts for a password, STOP and fix key auth first.
- [ ] **Step 2: Check whether the PVE firewall is active (coexistence)**
```bash
ssh root@192.168.1.127 'pve-firewall status 2>/dev/null; iptables -S | head'
```
Expected: note whether `Status: enabled/running`. If **enabled**, add the Phase-1.4 rules via PVE's firewall (Datacenter→Firewall) instead of raw iptables, OR disable it if unused. If **disabled** (common), proceed with the raw-iptables approach below.
### Task 1.2: Authorize the break-glass key
- [ ] **Step 1: Append the break-glass public key to root's authorized_keys**
```bash
PUB="$(vault kv get -field=breakglass_ssh_pubkey secret/viktor)"
ssh root@192.168.1.127 "grep -qF '$PUB' /root/.ssh/authorized_keys || echo '$PUB' >> /root/.ssh/authorized_keys"
```
- [ ] **Step 2: Verify break-glass key logs in (on :22, still default)**
```bash
ssh -i ~/.ssh/breakglass_ed25519 -o PasswordAuthentication=no root@192.168.1.127 'echo BREAKGLASS_KEY_OK'
```
Expected: `BREAKGLASS_KEY_OK`.
### Task 1.3: sshd dual-port + key-only
**Files:** Create on host: `/etc/ssh/sshd_config.d/10-breakglass.conf`
- [ ] **Step 1: Write the sshd drop-in**
```bash
ssh root@192.168.1.127 'cat > /etc/ssh/sshd_config.d/10-breakglass.conf' <<'EOF'
Port 22
Port 52222
PasswordAuthentication no
KbdInteractiveAuthentication no
PubkeyAuthentication yes
PermitRootLogin prohibit-password
MaxAuthTries 3
LoginGraceTime 20
EOF
```
- [ ] **Step 2: Validate config syntax (do NOT reload yet)**
```bash
ssh root@192.168.1.127 'sshd -t && echo SSHD_CONFIG_OK'
```
Expected: `SSHD_CONFIG_OK`. If error, fix the drop-in before reloading.
- [ ] **Step 3: Reload sshd (current session stays alive)**
```bash
ssh root@192.168.1.127 'systemctl reload ssh && echo RELOADED'
```
Expected: `RELOADED`.
- [ ] **Step 4: Verify a NEW key session works on :22 AND :52222 before trusting it**
```bash
ssh -i ~/.ssh/breakglass_ed25519 -p 22 root@192.168.1.127 'echo OK22'
ssh -i ~/.ssh/breakglass_ed25519 -p 52222 root@192.168.1.127 'echo OK52222'
```
Expected: `OK22` and `OK52222`. (If `:52222` refuses, sshd may not have bound the second port — check `ss -tlnp | grep ssh` on the host.) Only after both succeed, the old session is safe to drop.
### Task 1.4: Base firewall (default-drop :52222, allow :22 + established)
**Files:** Create on host: `/usr/local/sbin/breakglass-firewall.sh`, `/etc/systemd/system/breakglass-firewall.service`
- [ ] **Step 1: Write the idempotent base-firewall script (dedicated chain)**
```bash
ssh root@192.168.1.127 'cat > /usr/local/sbin/breakglass-firewall.sh' <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
# Idempotent: (re)build a dedicated BREAKGLASS chain hooked into INPUT.
iptables -N BREAKGLASS 2>/dev/null || iptables -F BREAKGLASS
iptables -C INPUT -j BREAKGLASS 2>/dev/null || iptables -I INPUT 1 -j BREAKGLASS
# established/related always allowed
iptables -A BREAKGLASS -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
# LAN admin on :22 always allowed (.1 does NOT forward :22 to this host, so :22 is LAN-only)
iptables -A BREAKGLASS -p tcp --dport 22 -j ACCEPT
# external SSH on :52222 closed by default; knockd punches a per-source ACCEPT into INPUT pos 1
iptables -A BREAKGLASS -p tcp --dport 52222 -j DROP
EOF
ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh'
```
- [ ] **Step 2: Write a boot-time systemd unit (persists across reboot, before knockd)**
```bash
ssh root@192.168.1.127 'cat > /etc/systemd/system/breakglass-firewall.service' <<'EOF'
[Unit]
Description=Break-glass base firewall (SSH knock gate)
After=network-pre.target
Before=knockd.service
Wants=network-pre.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/breakglass-firewall.sh
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
EOF
ssh root@192.168.1.127 'systemctl daemon-reload && systemctl enable --now breakglass-firewall.service && echo FW_APPLIED'
```
Expected: `FW_APPLIED`.
- [ ] **Step 3: Verify LAN :22 still works and :52222 is now dropped from LAN**
```bash
ssh -i ~/.ssh/breakglass_ed25519 -p 22 root@192.168.1.127 'echo STILL_OK22' # works
nc -z -w3 192.168.1.127 52222 && echo "OPEN(bad)" || echo "CLOSED_AS_EXPECTED" # closed pre-knock
```
Expected: `STILL_OK22` and `CLOSED_AS_EXPECTED`.
### Task 1.5: knockd
**Files:** Create/modify on host: `/etc/knockd.conf`, `/etc/default/knockd`
- [ ] **Step 1: Install knockd (host daemon — must be native, not Docker, to manage host iptables)**
```bash
ssh root@192.168.1.127 'apt-get update -qq && apt-get install -y knockd && echo KNOCKD_INSTALLED'
```
Expected: `KNOCKD_INSTALLED`.
- [ ] **Step 2: Write knockd.conf with the Vault knock sequence (UDP)**
```bash
KNOCK="$(vault kv get -field=breakglass_knock_sequence secret/viktor)" # e.g. 28411,49027,33180
read K1 K2 K3 <<<"$(echo "$KNOCK" | tr ',' ' ')"
ssh root@192.168.1.127 "cat > /etc/knockd.conf" <<EOF
[options]
UseSyslog
Interface = vmbr0
[breakglass]
sequence = ${K1}:udp,${K2}:udp,${K3}:udp
seq_timeout = 10
start_command = /usr/sbin/iptables -I INPUT 1 -s %IP% -p tcp --dport 52222 -j ACCEPT
cmd_timeout = 30
stop_command = /usr/sbin/iptables -D INPUT -s %IP% -p tcp --dport 52222 -j ACCEPT
EOF
```
- [ ] **Step 3: Enable + start knockd**
```bash
ssh root@192.168.1.127 "sed -i 's/^START_KNOCKD=.*/START_KNOCKD=1/' /etc/default/knockd 2>/dev/null || echo 'START_KNOCKD=1' >> /etc/default/knockd"
ssh root@192.168.1.127 'systemctl enable --now knockd && systemctl is-active knockd'
```
Expected: `active`.
### Task 1.6: fail2ban (defense-in-depth)
- [ ] **Step 1: Install + enable fail2ban with the default sshd jail**
```bash
ssh root@192.168.1.127 'apt-get install -y fail2ban && systemctl enable --now fail2ban && fail2ban-client status sshd >/dev/null && echo F2B_OK'
```
Expected: `F2B_OK` (sshd jail active).
---
## Phase 2 — Edge router `.1` forwards (LIVE router change — Viktor executes)
> In the AX6000 UI: **Advanced → NAT Forwarding → Port Forwarding → Add**. Do NOT remove anything yet.
- [ ] **Step 1: Add the SSH break-glass forward**
- Name `breakglass-ssh`, External Port `52222`, Internal IP `192.168.1.127`, Internal Port `52222`, Protocol `TCP`, Enable.
- [ ] **Step 2: Add the three UDP knock forwards** (values from `vault kv get -field=breakglass_knock_sequence secret/viktor`)
- For each of the 3 ports: Name `bg-knock-N`, External Port `<port>`, Internal IP `192.168.1.127`, Internal Port `<same port>`, Protocol `UDP`, Enable.
- [ ] **Step 3: (verify #1) Determine whether `.1` preserves source IP or SNATs**
After Phase 3 connects once, on the host check the observed source:
```bash
ssh root@192.168.1.127 'journalctl -u knockd -n 20 --no-pager | grep -i "stage\|open"'
```
If `%IP%` is a public IP → source preserved (per-IP granularity). If it's `192.168.1.1``.1` SNATs (knock opens `:52222` for the shared `.1` source during the 30 s window). Both are acceptable with the dual-port + key-only model; just note it in the runbook.
---
## Phase 3 — Client config (laptop, no live infra change)
**Files:** Modify `~/.ssh/config`; add a shell function to `~/.zshrc`/`~/.bashrc`.
- [ ] **Step 1: Add the SSH host block**
```bash
cat >> ~/.ssh/config <<'EOF'
Host breakglass
HostName viktorbarzin.ddns.net
Port 52222
User root
IdentityFile ~/.ssh/breakglass_ed25519
EOF
```
(`viktorbarzin.ddns.net` is the router's NO-IP DDNS name — follows the dynamic WAN IP. Raw IP `176.12.22.76` is the fallback.)
- [ ] **Step 2: Add the knock+connect function**
```bash
cat >> ~/.zshrc <<'EOF'
bg() {
local host="viktorbarzin.ddns.net"
local seq; seq="$(vault kv get -field=breakglass_knock_sequence secret/viktor 2>/dev/null || echo "")"
[ -z "$seq" ] && { echo "no knock sequence (vault?)"; return 1; }
for p in ${seq//,/ }; do (exec 3<>/dev/udp/$host/$p) 2>/dev/null && echo "x" >&3; sleep 0.4; done
sleep 0.5
ssh breakglass "$@"
}
EOF
```
> Note: the bash `/dev/udp` redirection works under bash (`/bin/bash` on macOS + Linux). Under zsh, `/dev/udp` is also supported by zsh's builtin in recent versions; if your zsh build lacks it, define `bg` in bash or use `nc -u -w1 $host $p </dev/null`.
---
## Phase 4-pre — Verify break-glass END-TO-END (gates Phase 4)
> Do this from an **external** network (phone hotspot / tethered), NOT the home LAN.
- [ ] **Step 1: Without knocking, the port is silent**
```bash
nc -z -w3 viktorbarzin.ddns.net 52222 && echo "OPEN(bad)" || echo "SILENT_OK"
```
Expected: `SILENT_OK`.
- [ ] **Step 2: Knock + connect succeeds**
```bash
bg 'hostname; echo BREAKGLASS_E2E_OK'
```
Expected: the PVE hostname + `BREAKGLASS_E2E_OK`.
- [ ] **Step 3: Full-LAN reach via the jump (no extra install)**
```bash
ssh -J breakglass root@10.0.20.1 'echo PFSENSE_REACHED' 2>/dev/null || echo "check pfSense ssh"
ssh -J breakglass admin@192.168.1.13 'echo SYNOLOGY_REACHED' 2>/dev/null || echo "check synology ssh"
```
Expected: confirms you can reach pfSense + Synology *through* break-glass (so closing Rule 6 loses nothing).
- [ ] **Step 4: LAN admin unaffected**
From the home LAN: `ssh -p 22 root@192.168.1.127 'echo LAN22_OK'``LAN22_OK`.
**GATE:** Only proceed to Phase 4 once Steps 14 pass. If any fail, fix before removing the legacy forward.
---
## Phase 5 — Router cleanup (LIVE router change — Viktor executes, AFTER Phase 4-pre passes)
> AX6000 UI. One pass, all three changes.
- [ ] **Step 1: Remove the Synology SSH exposure (Rule 6)**
- Advanced → NAT Forwarding → Port Forwarding → delete (or disable) rule **`HTTP` / 3333 → 192.168.1.13:22**.
- [ ] **Step 2: Delete the stale Proxmox rule (Rule 3)**
- Delete the disabled rule **`proxmox` / 8006 → 192.168.1.127**.
- [ ] **Step 3: Disable UPnP**
- Advanced → NAT Forwarding → UPnP → toggle **OFF**. (Tailscale on `.101` falls back to DERP relay; the `41643→pfSense` mapping drops.)
- [ ] **Step 4: Verify the Synology SSH is gone from the WAN, break-glass still works**
From an external network:
```bash
nc -z -w3 viktorbarzin.ddns.net 3333 && echo "STILL_OPEN(bad)" || echo "SYNOLOGY_SSH_CLOSED_OK"
bg 'echo BREAKGLASS_STILL_OK'
```
Expected: `SYNOLOGY_SSH_CLOSED_OK` and `BREAKGLASS_STILL_OK`.
---
## Phase 6 — Docs + commit (AFTER infra repo is clean)
- [ ] **Step 1: Update `docs/architecture/vpn.md`** — add a "Break-glass SSH" section (knock-gated SSH to PVE host, client `bg()`, cheat-sheet IPs).
- [ ] **Step 2: Update `docs/architecture/security.md` + the Wave-1 note in `infra/.claude/CLAUDE.md`** — record the deliberate knock-gated exception; **correct the WAN-exposure inventory** (actual `.1` forwards are qbittorrent/stun/turn→pfSense + the new break-glass; Synology SSH removed; UPnP disabled; Remote Management off).
- [ ] **Step 3: New runbook `docs/runbooks/breakglass-ssh.md`** — connect procedure, knock/key rotation, re-adding `.1` forwards after a router reset.
- [ ] **Step 4: Commit the design + plan + doc updates** (only once Viktor confirms the repo is committable):
```bash
git -C /home/wizard/code/infra add \
docs/plans/2026-05-30-breakglass-ssh-access-design.md \
docs/plans/2026-05-30-breakglass-ssh-access-plan.md \
docs/architecture/vpn.md docs/architecture/security.md \
docs/runbooks/breakglass-ssh.md .claude/CLAUDE.md
git -C /home/wizard/code/infra commit -m "docs+feat: break-glass knock-gated SSH; retire Synology SSH forward; disable UPnP [ci skip]"
git -C /home/wizard/code/infra push origin master
```
---
## Self-review
- **Spec coverage:** key-only SSH ✅ (1.3), knock gate ✅ (1.4/1.5), invisibility ✅ (4-pre.1), full-LAN via jump ✅ (4-pre.3), no-lockout ✅ (1.1/1.3.4), Wave-1 exception doc ✅ (6.2), close legacy SSH ✅ (5.1), UPnP ✅ (5.3). All design §sections map to a task.
- **Placeholder scan:** no TBDs; secret values are generated + Vault-stored, referenced via `vault kv get` (concrete, not placeholders).
- **Consistency:** port `52222`, knock from `secret/viktor/breakglass_knock_sequence`, key `~/.ssh/breakglass_ed25519`, host `192.168.1.127` used consistently throughout.
- **Open verify items** (flagged inline, non-blocking): #1 `.1` SNAT behaviour (2.3), pve-firewall coexistence (1.1.2).