break-glass SSH: drop port-knock for exposed key-only :52222; version host config

Viktor got locked out of the break-glass path (forgot the port-knock setup) and
deleted the edge-router forwards, then asked to review and redesign it from
scratch.

Root cause of the lockout: the knock added no real security (key-only SSH is
already brute-force-proof) and its only benefit — hiding the port — came at the
cost of a circular dependency. The knock sequence lived only in in-cluster
Vault, which is unreachable in the exact away/cold scenario break-glass exists
for. So the unlock secret was unavailable precisely when needed.

New model (self-contained, nothing to remember): plain key-only SSH on the
Proxmox host's :52222, openly reachable. The edge router forwards WAN tcp/52222
-> 192.168.1.127:52222 (external port MUST equal internal on the TP-Link AX6000
- it rejects remaps; port 22 itself is reserved). The exposed port trusts only a
dedicated break-glass key via `Match LocalPort` (a leak of any other root key
does not grant internet access), rate-limited (iptables hashlimit) + fail2ban.

- Removed knockd (package + config) and the legacy Synology SSH forward
  (ext 3333 -> .13:22, a needless WAN exposure the original plan wanted gone).
- Fixed the fail2ban jail for Debian 13 (auth logs under sshd-session, not sshd
  - the stock journalmatch silently never banned).
- Versioned the host config in scripts/ (it was applied ad-hoc, never committed)
  and recorded the deliberate Wave-1 "no public-IP" exception in security.md +
  .claude/CLAUDE.md. Superseded the 2026-05-30 port-knock design docs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-11 18:23:39 +00:00
parent e2788d1b2d
commit df332b59e6
9 changed files with 989 additions and 1 deletions

View file

@ -255,6 +255,8 @@ Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same
**Policy: no public-IP access ever.** Vault, kube-apiserver, PVE sshd must transit a trusted LAN or Headscale. Anything else fires an alert.
**Documented exception — break-glass SSH (2026-06-11):** one deliberate carve-out. The Proxmox host's sshd listens on a WAN-exposed `:52222` (edge-router forward), **key-only**, trusting only a dedicated break-glass key (`Match LocalPort``authorized_keys.breakglass`), rate-limited (iptables hashlimit) + fail2ban. It is intentionally reachable from the public internet so it survives a cluster/tunnel outage with no dependency on the cluster — the one case the "must transit LAN/Headscale" rule cannot serve. Brute-force-proof (no password); the trade is Shodan-visibility. As-built: `docs/runbooks/breakglass-ssh.md`; rationale: `docs/plans/2026-06-11-breakglass-ssh-redesign-design.md`. (Replaced the 2026-05-30 port-knock variant, which was non-scannable but had a circular Vault dependency that caused a lockout.)
#### Why no canary tokens
Original plan included canary tokens (fake K8s Secret, Vault KV path, PVE file, sinkhole hostname). Rejected because Viktor routinely greps `secret/viktor` (135 keys) and lists `kubectl get secret -A` — any read-trigger canary self-fires. Use-based canaries (zero-RBAC SA tokens with audit alerts on use) were also considered but rejected in favor of cleaner source-IP anomaly detection (K9, V7) on REAL tokens — same threat model, no fake-token operational burden.

View file

@ -0,0 +1,285 @@
# Break-Glass SSH Access — Design
> **⚠️ SUPERSEDED 2026-06-11** by `2026-06-11-breakglass-ssh-redesign-design.md`.
> The port-knock was removed: it added no real security (the SSH key already
> makes the port brute-force-proof) and its knock sequence lived only in
> in-cluster Vault — unreachable in the exact cold/away scenario break-glass
> exists for, which caused a real lockout. Retained for history. As-built:
> `docs/runbooks/breakglass-ssh.md`.
- **Date**: 2026-05-30
- **Status**: Draft — pending user review
- **Owner**: Viktor
- **Related**: `docs/architecture/vpn.md`, `docs/architecture/security.md`, `infra/.claude/CLAUDE.md` (Security Posture Wave 1)
## 1. Goal
Provide a **cold, brute-force-proof backdoor onto the home LAN from the public
internet** for the case where the Kubernetes cluster and every cluster-hosted
remote-access path are down (cloudflared, Headscale/Tailscale, in-cluster
WireGuard), but the **Proxmox host, pfSense, and the edge router are still up**.
### Hard requirements (from the user)
1. **Cold-survivable**: must work when the k8s cluster + all its tunnels are
down. The path must touch **nothing in the cluster** (no Authentik, Traefik,
Technitium/AdGuard DNS, cloudflared).
2. **Full LAN access** once connected (SSH to Proxmox host, pfSense, Synology,
k8s API, etc.).
3. **No brute force**: no password-guessable surface.
4. **Client uses only software pre-installed on Linux/macOS** — no WireGuard /
Tailscale / fwknop client install. Stock `ssh` (+ `bash`) only.
5. **Minimal effort**, and ideally **honor the locked Wave 1 policy**
(`no public-IP access — … PVE sshd must transit LAN or Headscale`).
## 2. Decision
**Key-only SSH to the Proxmox host, gated behind a UDP port-knock.**
- The Proxmox host (`192.168.1.127`) is the entry point — it's the recovery box
(`virsh`/`qm` to reboot the pfSense VM, `kubectl`, full hypervisor control)
and it sits directly on the `192.168.1.0/24` segment, so the path **does not
traverse pfSense or the cluster** — it survives a wedged pfSense too, not just
a down cluster.
- SSH is the only externally-usable remote tool **pre-installed on every
Linux/macOS box**, satisfying requirement 4.
- **Key-only auth** (no passwords anywhere) makes password brute force
impossible → requirement 3.
- A **port-knock** keeps the external SSH port **closed/invisible to scanners**
until a knock sequence is sent. This restores the "no standing public service"
property we'd have had with WireGuard and keeps us within the **intent** of the
Wave 1 policy (PVE sshd is not internet-scannable). The knock is sent with a
**bash `/dev/udp` one-liner** — zero install.
### Alternatives rejected
| Option | Why rejected |
|---|---|
| WireGuard road-warrior on pfSense | Needs a WireGuard **client app** (fails requirement 4). Was the prior design. |
| Tailscale / Headscale | Client app + control plane is in-cluster (dies cold). |
| Browser → web admin UI (Proxmox/pfSense/Synology) | "Pre-installed" (browser) but password-based → brute-forceable, far larger attack surface than a key-only SSH port. |
| Plain **exposed** key-only SSH (no knock) | Brute-force-proof, but a **publicly visible** service (Shodan-catalogued) and a standing violation of the Wave 1 "no public PVE sshd" policy. The knock removes the standing exposure for ~15 min more setup. |
| fwknop / cryptographic SPA | Strongest hiding, but needs a **client install** (fails requirement 4). |
## 3. Architecture
```
Your laptop (anywhere) — stock ssh + bash, nothing installed
│ (1) UDP knock sequence → bash: echo > /dev/udp/<pub>/<port> (instant, no handshake)
│ (2) ssh -p 52222 root@<pub>
Edge router 192.168.1.1 (the box the stored password unlocks)
│ forwards: UDP <k1>,<k2>,<k3> + TCP 52222 → 192.168.1.127
Proxmox host 192.168.1.127 ← path bypasses pfSense entirely
├─ knockd (libpcap) sees the UDP knock → opens TCP 52222 for your source IP (30 s)
├─ sshd listens on :22 (LAN admin, always) AND :52222 (external, knock-gated), key-only
└─ once in: virsh/qm (reboot pfSense VM), kubectl, ssh -J / ssh -D → full LAN
```
**Why it meets "cold + full LAN":** the host is up by definition of the chosen
failure mode; nothing in the path depends on k8s, pfSense, or DNS. From the host
you reach the whole LAN either directly (it's on `192.168.1.0/24` and routes to
the VLANs via pfSense when pfSense is up) or by using SSH's built-in
`-J`/`-D` — both stock, no install.
## 4. Components
### 4.1 Edge router @ 192.168.1.1 (manual, in the browser)
Add port-forwards (same place the existing `51821` WireGuard forward lives):
- **TCP 52222 → 192.168.1.127:52222** (external SSH; no port rewrite — see §4.3 rationale)
- **UDP `<k1>`, `<k2>`, `<k3>` → 192.168.1.127** (knock ports; actual numbers in Vault)
If the router supports a **port range** forward, a single range covering the
knock ports + 52222 is tidier than four rules.
> **Verify (#1 implementation check):** whether `.1` **preserves the source IP**
> on forwarded packets (typical DNAT) or **SNATs** them to `192.168.1.1`. Test by
> knocking + connecting from an external network and checking `/var/log/auth.log`
> + `knockd` syslog for the observed source IP. The design works either way (see
> §4.3), but it determines knock granularity.
### 4.2 SSH keys & Vault layout
- Mint a **dedicated** break-glass keypair (ed25519), separate from
`secret/viktor/proxmox_ssh_key`, so it's independently revocable and clearly
labelled.
- **Public key**`/root/.ssh/authorized_keys` on the Proxmox host (no `from=`
restriction — break-glass is from-anywhere; the knock + key are the gate).
- **Private key** → Vault `secret/viktor/breakglass_ssh_privkey` (for
re-provisioning) **and** on your laptop at `~/.ssh/breakglass_ed25519`
(chmod 600).
- **Knock sequence** → Vault `secret/viktor/breakglass_knock_sequence` (kept out
of git — obscurity value only; see §5).
### 4.3 Proxmox host — sshd hardening
`/etc/ssh/sshd_config.d/10-breakglass.conf`:
```
Port 22
Port 52222
PasswordAuthentication no
KbdInteractiveAuthentication no
PubkeyAuthentication yes
PermitRootLogin prohibit-password # key-only root (PVE recovery norm)
MaxAuthTries 3
LoginGraceTime 20
```
- sshd listens on **:22 (LAN admin, always allowed)** and **:52222 (external,
knock-gated)**. Using a dedicated external port (not a DNAT rewrite to 22)
lets the firewall distinguish LAN vs external **regardless of `.1` SNAT
behaviour** (§4.1) — LAN admin on `:22` is never affected by the gate.
- **Default to root key-only** for recovery practicality. *Alternative for
review:* a dedicated `breakglass` sudo user instead of root.
> **Verify (#2):** key login already works for your normal access **before**
> `PasswordAuthentication no` is committed — no lockout. (Backup rsync jobs
> already use keys, so this is likely already effectively true.)
### 4.4 Host firewall (knock gate)
Default-drop the external SSH port; knockd punches a per-source hole. LAN admin
(`:22`) and established sessions are untouched:
```
# allow established / related
iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
# LAN admin + backups: SSH on :22 always allowed
iptables -A INPUT -p tcp --dport 22 -j ACCEPT
# external SSH on :52222 closed by default — knockd opens it per-source
iptables -A INPUT -p tcp --dport 52222 -j DROP
```
- **knockd uses libpcap**, so it sees the UDP knock packets even though iptables
drops them — the knock ports stay **silent/closed** to scanners.
- **pve-firewall coexistence (verify #3):** confirm whether the PVE firewall is
enabled. If it is, express these rules through it (or a dedicated chain) so a
pve-firewall reload doesn't wipe the knockd-managed rule. Default PVE installs
often have it off at datacenter level.
### 4.5 knockd
`apt install knockd` (Debian/PVE). `/etc/knockd.conf`:
```
[options]
UseSyslog
Interface = vmbr0 # the 192.168.1.127 interface
[breakglass]
sequence = <k1>:udp,<k2>:udp,<k3>:udp # real ports from Vault
seq_timeout = 10
start_command = /usr/sbin/iptables -I INPUT 1 -s %IP% -p tcp --dport 52222 -j ACCEPT
cmd_timeout = 30
stop_command = /usr/sbin/iptables -D INPUT -s %IP% -p tcp --dport 52222 -j ACCEPT
```
- **UDP knock** → the client knock is fire-and-forget (`/dev/udp`), no TCP-hang
on the client (a TCP knock to a dropped port would block until timeout).
- Opens `:52222` for the knocker's source IP for **30 s**; an SSH session
established within that window **persists** via conntrack ESTABLISHED after the
rule is removed. Enable + start the `knockd` service.
### 4.6 fail2ban (defense-in-depth)
`apt install fail2ban`, sshd jail (watches `auth.log`, bans repeat failures).
Local to the host, **no cluster dependency**. Catches anything that gets past the
knock to the sshd listener.
### 4.7 Client side (laptop — stock tools only)
`~/.ssh/config`:
```
Host breakglass
HostName <public-ip-or-dyndns>
Port 52222
User root
IdentityFile ~/.ssh/breakglass_ed25519
```
Knock + connect — a shell function using **bash builtins only** (works on
macOS `/bin/bash` + Linux; UDP send is instant):
```sh
bg() {
local host=<public-ip-or-dyndns>
for p in <k1> <k2> <k3>; do echo -n x > "/dev/udp/$host/$p"; sleep 0.4; done
sleep 0.5
ssh breakglass "$@"
}
```
- **Full LAN, no install:** `ssh -J breakglass <internal-host>` (jump), or
`ssh -D 1080 breakglass` then point a browser/`curl` at SOCKS5 `127.0.0.1:1080`
to reach any internal IP. From the host shell you already have everything.
- *Optional fully-transparent variant:* fold the knock into a `ProxyCommand` in
the `Host breakglass` block so plain `ssh breakglass` knocks automatically.
### 4.8 Cold-scenario IP cheat sheet (DNS is down when the cluster is down)
Technitium + AdGuard are in-cluster, so `.lan` resolution is gone in a cold
event. Use IPs:
| Host | IP |
|---|---|
| Proxmox host | `192.168.1.127` (also `10.0.10.1` VLAN10) |
| pfSense | `10.0.20.1` (WAN `192.168.1.2`) |
| k8s API server | `10.0.20.100` |
| Synology NAS | `192.168.1.13` |
| Edge router | `192.168.1.1` |
| Traefik LB / MetalLB | `10.0.20.200` / `10.0.20.203` |
## 5. Security analysis
- **Brute force: solved.** No password auth anywhere → password guessing is
impossible; key brute force is cryptographically infeasible.
- **Invisibility / Wave 1 intent: satisfied.** The external SSH port is
default-dropped and the knock ports are pcap-sniffed (never answered), so a
scanner sees a closed/silent host — PVE sshd is **not internet-scannable**,
honouring the spirit of "no public-IP access to PVE sshd".
- **The knock is obscurity, not cryptography.** A port-knock sequence is
plaintext and replayable by a passive on-path observer. **The SSH key is the
real access control** — the knock only removes the standing/scannable surface.
(Cryptographic SPA = fwknop, rejected for needing a client install.) Treat the
knock sequence as a secret-ish convenience, not a second cryptographic factor.
- **Residual risks** (none are brute force):
1. An sshd **0-day** exploitable during the 30 s open window → mitigation: keep
PVE patched; short `cmd_timeout`; fail2ban.
2. **Private key theft** → mitigation: key has a passphrase; revoke by removing
the line from `authorized_keys`.
3. If `.1` **SNATs** (§4.1), the 30 s window opens `:52222` for the shared
`192.168.1.1` source — anyone else arriving via `.1` in that window could
reach the sshd banner, but still needs your key. Mitigated by the short
window + key-only + fail2ban.
- **Deliberate, documented exception** to the Wave 1 "no public-IP access"
policy, scoped to this single knock-gated port. To be recorded in
`security.md` + the Wave 1 note in `infra/.claude/CLAUDE.md` on implementation.
## 6. What's automated vs manual
- **I do**: generate the keypair + knock sequence, store them in Vault, produce
the exact `sshd_config.d` snippet, `knockd.conf`, iptables rules, the client
`~/.ssh/config` + `bg()` function, and write the runbook + doc updates.
- **Manual / careful (live devices)**: the `.1` edge-router forwards are done by
you in the browser (out-of-Terraform, live device). The Proxmox host changes
(sshd, knockd, iptables, fail2ban) are applied over SSH **with key-login
verified first** to avoid lockout; pfSense is **not** touched. None of this is
a `tg apply` — pfSense and the edge router are not Terraform-managed.
## 7. Testing & verification
1. From an **external** network (phone hotspot): run `bg`; confirm knockd syslog
shows the sequence + opens `:52222`; SSH succeeds.
2. **Without** knocking: `ssh -p 52222` from external → connection refused/timed
out (port closed). A plain port scan of `52222` + the knock ports → silent.
3. LAN admin on `:22` still works (no regression); backup rsync jobs unaffected.
4. Full-LAN: `ssh -J breakglass 10.0.20.1` (pfSense) and `ssh -D 1080` SOCKS to
an internal IP.
5. Determine `.1` source-IP behaviour (verify #1) and adjust knock granularity
note accordingly.
## 8. Failure modes & rotation
- **Proxmox host down** (not just cluster): this path is gone — that's the
out-of-band tier (serial/IPMI/separate device), explicitly **out of scope**.
- **`.1` router config reset**: forwards lost → re-add from this doc; consider
exporting the `.1` config for backup.
- **Public IP change**: use a hostname endpoint (Cloudflare-resolved) so it
auto-follows; keep the raw IP as fallback.
- **Key/knock compromise**: remove the `authorized_keys` line (kills access
instantly); rotate the knock sequence in `knockd.conf` + Vault.
## 9. Out of scope
- Host-down / site-down out-of-band access (IPMI, LTE) — a future tier.
- Phone access (would need an SSH **app**, e.g. Termius — outside the
"pre-installed Linux/macOS" constraint; laptop is the target).
## 10. Docs to update on implementation
- `docs/architecture/vpn.md` — add a "Break-glass SSH" section.
- `docs/architecture/security.md` + Wave 1 note in `infra/.claude/CLAUDE.md`
record the deliberate knock-gated exception to "no public PVE sshd".
- New runbook `docs/runbooks/breakglass-ssh.md` — connect + rotate procedure.

View file

@ -0,0 +1,395 @@
# Break-Glass SSH Access — Implementation Plan
> **⚠️ SUPERSEDED 2026-06-11** by the redesign in
> `2026-06-11-breakglass-ssh-redesign-design.md` (port-knock removed). Retained
> for history. As-built: `docs/runbooks/breakglass-ssh.md`.
> **Execution model:** This plan mutates **live devices** (the Proxmox host's sshd, and the TP-Link edge router). It is **human-gated**, NOT for autonomous subagents. Each live step is applied with anti-lockout verification, and every edge-router change is made by Viktor (or by the browse tool with explicit per-change approval). Steps use `- [ ]` checkboxes.
**Goal:** Stand up a cold, brute-force-proof SSH backdoor onto the LAN — key-only SSH to the Proxmox host (`192.168.1.127`) gated behind a UDP port-knock — then decommission the legacy Synology SSH exposure and tighten UPnP.
**Architecture:** Edge router `.1` forwards a UDP knock sequence + TCP `52222` to the Proxmox host. The host runs `knockd` (libpcap) which opens `52222` for the knocker's IP for 30 s; `sshd` listens on `:22` (LAN, always) and `:52222` (external, knock-gated), key-only. Path bypasses pfSense + the k8s cluster. Client uses only stock `ssh` + `bash`.
**Tech stack:** OpenSSH, knockd, iptables, fail2ban (Debian/PVE host); TP-Link Archer AX6000 UI (edge router); HashiCorp Vault (secrets); Docker (`/home/wizard/tools/insecure-browse` for any router automation).
**Reference:** design doc `2026-05-30-breakglass-ssh-access-design.md`. Router audit (current `.1` forwards) recorded in task notes + `/home/wizard/tools/insecure-browse/out/`.
---
## Pre-flight (read before starting)
- **Anti-lockout rule:** never disable password auth or reload sshd without an *already-open* root session held + a *new* session verified. Applies to every host step.
- **Live-router rule:** all `.1` changes are made by Viktor in the UI (or browse-tool with explicit approval). No blind automation of router writes.
- **Ordering rule:** the legacy Synology SSH forward (Rule 6) is **not** closed until break-glass is verified working from an external network (Phase 4 gates on Phase 4-pre verification).
- **Host access:** PVE host reached as `ssh root@192.168.1.127` from the LAN.
- **Commit gate:** the infra repo currently has unmerged conflicts + an in-progress provider/backend migration. Do NOT commit (Phase 6) until Viktor confirms the repo is clean.
---
## Phase 0 — Generate secrets (no live changes)
### Task 0.1: Break-glass SSH keypair
**Files:** none in repo (secrets → Vault).
- [ ] **Step 1: Generate a dedicated ed25519 keypair (with passphrase)**
```bash
mkdir -p ~/.ssh
ssh-keygen -t ed25519 -a 100 -C "breakglass-$(date +%Y%m%d)" -f ~/.ssh/breakglass_ed25519
# set a passphrase when prompted (so a stolen laptop key isn't instantly usable)
```
- [ ] **Step 2: Store the private key + public key in Vault**
```bash
vault kv patch secret/viktor \
breakglass_ssh_privkey=@$HOME/.ssh/breakglass_ed25519 \
breakglass_ssh_pubkey="$(cat ~/.ssh/breakglass_ed25519.pub)"
```
- [ ] **Step 3: Verify the keys are retrievable**
```bash
vault kv get -field=breakglass_ssh_pubkey secret/viktor
```
Expected: prints the `ssh-ed25519 AAAA... breakglass-YYYYMMDD` line.
### Task 0.2: Knock sequence
- [ ] **Step 1: Generate 3 random UDP knock ports**
```bash
KNOCK="$(shuf -i 20000-60000 -n 3 | paste -sd, -)"; echo "$KNOCK"
```
- [ ] **Step 2: Store the sequence in Vault (keep it out of git)**
```bash
vault kv patch secret/viktor breakglass_knock_sequence="$KNOCK"
vault kv get -field=breakglass_knock_sequence secret/viktor
```
Expected: prints three comma-separated ports, e.g. `28411,49027,33180`.
---
## Phase 1 — Proxmox host: key-only SSH + knock gate (LIVE host change)
> Run everything in this phase **on the PVE host**. Keep your current `ssh root@192.168.1.127` session open the entire phase.
### Task 1.1: Pre-checks (no changes yet)
- [ ] **Step 1: Confirm key login already works (anti-lockout baseline)**
From your laptop, with the break-glass key authorized later — for now confirm your *existing* admin key works:
```bash
ssh -o PasswordAuthentication=no root@192.168.1.127 'echo KEY_LOGIN_OK'
```
Expected: `KEY_LOGIN_OK` (key auth works → safe to disable passwords later). If it prompts for a password, STOP and fix key auth first.
- [ ] **Step 2: Check whether the PVE firewall is active (coexistence)**
```bash
ssh root@192.168.1.127 'pve-firewall status 2>/dev/null; iptables -S | head'
```
Expected: note whether `Status: enabled/running`. If **enabled**, add the Phase-1.4 rules via PVE's firewall (Datacenter→Firewall) instead of raw iptables, OR disable it if unused. If **disabled** (common), proceed with the raw-iptables approach below.
### Task 1.2: Authorize the break-glass key
- [ ] **Step 1: Append the break-glass public key to root's authorized_keys**
```bash
PUB="$(vault kv get -field=breakglass_ssh_pubkey secret/viktor)"
ssh root@192.168.1.127 "grep -qF '$PUB' /root/.ssh/authorized_keys || echo '$PUB' >> /root/.ssh/authorized_keys"
```
- [ ] **Step 2: Verify break-glass key logs in (on :22, still default)**
```bash
ssh -i ~/.ssh/breakglass_ed25519 -o PasswordAuthentication=no root@192.168.1.127 'echo BREAKGLASS_KEY_OK'
```
Expected: `BREAKGLASS_KEY_OK`.
### Task 1.3: sshd dual-port + key-only
**Files:** Create on host: `/etc/ssh/sshd_config.d/10-breakglass.conf`
- [ ] **Step 1: Write the sshd drop-in**
```bash
ssh root@192.168.1.127 'cat > /etc/ssh/sshd_config.d/10-breakglass.conf' <<'EOF'
Port 22
Port 52222
PasswordAuthentication no
KbdInteractiveAuthentication no
PubkeyAuthentication yes
PermitRootLogin prohibit-password
MaxAuthTries 3
LoginGraceTime 20
EOF
```
- [ ] **Step 2: Validate config syntax (do NOT reload yet)**
```bash
ssh root@192.168.1.127 'sshd -t && echo SSHD_CONFIG_OK'
```
Expected: `SSHD_CONFIG_OK`. If error, fix the drop-in before reloading.
- [ ] **Step 3: Reload sshd (current session stays alive)**
```bash
ssh root@192.168.1.127 'systemctl reload ssh && echo RELOADED'
```
Expected: `RELOADED`.
- [ ] **Step 4: Verify a NEW key session works on :22 AND :52222 before trusting it**
```bash
ssh -i ~/.ssh/breakglass_ed25519 -p 22 root@192.168.1.127 'echo OK22'
ssh -i ~/.ssh/breakglass_ed25519 -p 52222 root@192.168.1.127 'echo OK52222'
```
Expected: `OK22` and `OK52222`. (If `:52222` refuses, sshd may not have bound the second port — check `ss -tlnp | grep ssh` on the host.) Only after both succeed, the old session is safe to drop.
### Task 1.4: Base firewall (default-drop :52222, allow :22 + established)
**Files:** Create on host: `/usr/local/sbin/breakglass-firewall.sh`, `/etc/systemd/system/breakglass-firewall.service`
- [ ] **Step 1: Write the idempotent base-firewall script (dedicated chain)**
```bash
ssh root@192.168.1.127 'cat > /usr/local/sbin/breakglass-firewall.sh' <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
# Idempotent: (re)build a dedicated BREAKGLASS chain hooked into INPUT.
iptables -N BREAKGLASS 2>/dev/null || iptables -F BREAKGLASS
iptables -C INPUT -j BREAKGLASS 2>/dev/null || iptables -I INPUT 1 -j BREAKGLASS
# established/related always allowed
iptables -A BREAKGLASS -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
# LAN admin on :22 always allowed (.1 does NOT forward :22 to this host, so :22 is LAN-only)
iptables -A BREAKGLASS -p tcp --dport 22 -j ACCEPT
# external SSH on :52222 closed by default; knockd punches a per-source ACCEPT into INPUT pos 1
iptables -A BREAKGLASS -p tcp --dport 52222 -j DROP
EOF
ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh'
```
- [ ] **Step 2: Write a boot-time systemd unit (persists across reboot, before knockd)**
```bash
ssh root@192.168.1.127 'cat > /etc/systemd/system/breakglass-firewall.service' <<'EOF'
[Unit]
Description=Break-glass base firewall (SSH knock gate)
After=network-pre.target
Before=knockd.service
Wants=network-pre.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/breakglass-firewall.sh
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
EOF
ssh root@192.168.1.127 'systemctl daemon-reload && systemctl enable --now breakglass-firewall.service && echo FW_APPLIED'
```
Expected: `FW_APPLIED`.
- [ ] **Step 3: Verify LAN :22 still works and :52222 is now dropped from LAN**
```bash
ssh -i ~/.ssh/breakglass_ed25519 -p 22 root@192.168.1.127 'echo STILL_OK22' # works
nc -z -w3 192.168.1.127 52222 && echo "OPEN(bad)" || echo "CLOSED_AS_EXPECTED" # closed pre-knock
```
Expected: `STILL_OK22` and `CLOSED_AS_EXPECTED`.
### Task 1.5: knockd
**Files:** Create/modify on host: `/etc/knockd.conf`, `/etc/default/knockd`
- [ ] **Step 1: Install knockd (host daemon — must be native, not Docker, to manage host iptables)**
```bash
ssh root@192.168.1.127 'apt-get update -qq && apt-get install -y knockd && echo KNOCKD_INSTALLED'
```
Expected: `KNOCKD_INSTALLED`.
- [ ] **Step 2: Write knockd.conf with the Vault knock sequence (UDP)**
```bash
KNOCK="$(vault kv get -field=breakglass_knock_sequence secret/viktor)" # e.g. 28411,49027,33180
read K1 K2 K3 <<<"$(echo "$KNOCK" | tr ',' ' ')"
ssh root@192.168.1.127 "cat > /etc/knockd.conf" <<EOF
[options]
UseSyslog
Interface = vmbr0
[breakglass]
sequence = ${K1}:udp,${K2}:udp,${K3}:udp
seq_timeout = 10
start_command = /usr/sbin/iptables -I INPUT 1 -s %IP% -p tcp --dport 52222 -j ACCEPT
cmd_timeout = 30
stop_command = /usr/sbin/iptables -D INPUT -s %IP% -p tcp --dport 52222 -j ACCEPT
EOF
```
- [ ] **Step 3: Enable + start knockd**
```bash
ssh root@192.168.1.127 "sed -i 's/^START_KNOCKD=.*/START_KNOCKD=1/' /etc/default/knockd 2>/dev/null || echo 'START_KNOCKD=1' >> /etc/default/knockd"
ssh root@192.168.1.127 'systemctl enable --now knockd && systemctl is-active knockd'
```
Expected: `active`.
### Task 1.6: fail2ban (defense-in-depth)
- [ ] **Step 1: Install + enable fail2ban with the default sshd jail**
```bash
ssh root@192.168.1.127 'apt-get install -y fail2ban && systemctl enable --now fail2ban && fail2ban-client status sshd >/dev/null && echo F2B_OK'
```
Expected: `F2B_OK` (sshd jail active).
---
## Phase 2 — Edge router `.1` forwards (LIVE router change — Viktor executes)
> In the AX6000 UI: **Advanced → NAT Forwarding → Port Forwarding → Add**. Do NOT remove anything yet.
- [ ] **Step 1: Add the SSH break-glass forward**
- Name `breakglass-ssh`, External Port `52222`, Internal IP `192.168.1.127`, Internal Port `52222`, Protocol `TCP`, Enable.
- [ ] **Step 2: Add the three UDP knock forwards** (values from `vault kv get -field=breakglass_knock_sequence secret/viktor`)
- For each of the 3 ports: Name `bg-knock-N`, External Port `<port>`, Internal IP `192.168.1.127`, Internal Port `<same port>`, Protocol `UDP`, Enable.
- [ ] **Step 3: (verify #1) Determine whether `.1` preserves source IP or SNATs**
After Phase 3 connects once, on the host check the observed source:
```bash
ssh root@192.168.1.127 'journalctl -u knockd -n 20 --no-pager | grep -i "stage\|open"'
```
If `%IP%` is a public IP → source preserved (per-IP granularity). If it's `192.168.1.1``.1` SNATs (knock opens `:52222` for the shared `.1` source during the 30 s window). Both are acceptable with the dual-port + key-only model; just note it in the runbook.
---
## Phase 3 — Client config (laptop, no live infra change)
**Files:** Modify `~/.ssh/config`; add a shell function to `~/.zshrc`/`~/.bashrc`.
- [ ] **Step 1: Add the SSH host block**
```bash
cat >> ~/.ssh/config <<'EOF'
Host breakglass
HostName viktorbarzin.ddns.net
Port 52222
User root
IdentityFile ~/.ssh/breakglass_ed25519
EOF
```
(`viktorbarzin.ddns.net` is the router's NO-IP DDNS name — follows the dynamic WAN IP. Raw IP `176.12.22.76` is the fallback.)
- [ ] **Step 2: Add the knock+connect function**
```bash
cat >> ~/.zshrc <<'EOF'
bg() {
local host="viktorbarzin.ddns.net"
local seq; seq="$(vault kv get -field=breakglass_knock_sequence secret/viktor 2>/dev/null || echo "")"
[ -z "$seq" ] && { echo "no knock sequence (vault?)"; return 1; }
for p in ${seq//,/ }; do (exec 3<>/dev/udp/$host/$p) 2>/dev/null && echo "x" >&3; sleep 0.4; done
sleep 0.5
ssh breakglass "$@"
}
EOF
```
> Note: the bash `/dev/udp` redirection works under bash (`/bin/bash` on macOS + Linux). Under zsh, `/dev/udp` is also supported by zsh's builtin in recent versions; if your zsh build lacks it, define `bg` in bash or use `nc -u -w1 $host $p </dev/null`.
---
## Phase 4-pre — Verify break-glass END-TO-END (gates Phase 4)
> Do this from an **external** network (phone hotspot / tethered), NOT the home LAN.
- [ ] **Step 1: Without knocking, the port is silent**
```bash
nc -z -w3 viktorbarzin.ddns.net 52222 && echo "OPEN(bad)" || echo "SILENT_OK"
```
Expected: `SILENT_OK`.
- [ ] **Step 2: Knock + connect succeeds**
```bash
bg 'hostname; echo BREAKGLASS_E2E_OK'
```
Expected: the PVE hostname + `BREAKGLASS_E2E_OK`.
- [ ] **Step 3: Full-LAN reach via the jump (no extra install)**
```bash
ssh -J breakglass root@10.0.20.1 'echo PFSENSE_REACHED' 2>/dev/null || echo "check pfSense ssh"
ssh -J breakglass admin@192.168.1.13 'echo SYNOLOGY_REACHED' 2>/dev/null || echo "check synology ssh"
```
Expected: confirms you can reach pfSense + Synology *through* break-glass (so closing Rule 6 loses nothing).
- [ ] **Step 4: LAN admin unaffected**
From the home LAN: `ssh -p 22 root@192.168.1.127 'echo LAN22_OK'``LAN22_OK`.
**GATE:** Only proceed to Phase 4 once Steps 14 pass. If any fail, fix before removing the legacy forward.
---
## Phase 5 — Router cleanup (LIVE router change — Viktor executes, AFTER Phase 4-pre passes)
> AX6000 UI. One pass, all three changes.
- [ ] **Step 1: Remove the Synology SSH exposure (Rule 6)**
- Advanced → NAT Forwarding → Port Forwarding → delete (or disable) rule **`HTTP` / 3333 → 192.168.1.13:22**.
- [ ] **Step 2: Delete the stale Proxmox rule (Rule 3)**
- Delete the disabled rule **`proxmox` / 8006 → 192.168.1.127**.
- [ ] **Step 3: Disable UPnP**
- Advanced → NAT Forwarding → UPnP → toggle **OFF**. (Tailscale on `.101` falls back to DERP relay; the `41643→pfSense` mapping drops.)
- [ ] **Step 4: Verify the Synology SSH is gone from the WAN, break-glass still works**
From an external network:
```bash
nc -z -w3 viktorbarzin.ddns.net 3333 && echo "STILL_OPEN(bad)" || echo "SYNOLOGY_SSH_CLOSED_OK"
bg 'echo BREAKGLASS_STILL_OK'
```
Expected: `SYNOLOGY_SSH_CLOSED_OK` and `BREAKGLASS_STILL_OK`.
---
## Phase 6 — Docs + commit (AFTER infra repo is clean)
- [ ] **Step 1: Update `docs/architecture/vpn.md`** — add a "Break-glass SSH" section (knock-gated SSH to PVE host, client `bg()`, cheat-sheet IPs).
- [ ] **Step 2: Update `docs/architecture/security.md` + the Wave-1 note in `infra/.claude/CLAUDE.md`** — record the deliberate knock-gated exception; **correct the WAN-exposure inventory** (actual `.1` forwards are qbittorrent/stun/turn→pfSense + the new break-glass; Synology SSH removed; UPnP disabled; Remote Management off).
- [ ] **Step 3: New runbook `docs/runbooks/breakglass-ssh.md`** — connect procedure, knock/key rotation, re-adding `.1` forwards after a router reset.
- [ ] **Step 4: Commit the design + plan + doc updates** (only once Viktor confirms the repo is committable):
```bash
git -C /home/wizard/code/infra add \
docs/plans/2026-05-30-breakglass-ssh-access-design.md \
docs/plans/2026-05-30-breakglass-ssh-access-plan.md \
docs/architecture/vpn.md docs/architecture/security.md \
docs/runbooks/breakglass-ssh.md .claude/CLAUDE.md
git -C /home/wizard/code/infra commit -m "docs+feat: break-glass knock-gated SSH; retire Synology SSH forward; disable UPnP [ci skip]"
git -C /home/wizard/code/infra push origin master
```
---
## Self-review
- **Spec coverage:** key-only SSH ✅ (1.3), knock gate ✅ (1.4/1.5), invisibility ✅ (4-pre.1), full-LAN via jump ✅ (4-pre.3), no-lockout ✅ (1.1/1.3.4), Wave-1 exception doc ✅ (6.2), close legacy SSH ✅ (5.1), UPnP ✅ (5.3). All design §sections map to a task.
- **Placeholder scan:** no TBDs; secret values are generated + Vault-stored, referenced via `vault kv get` (concrete, not placeholders).
- **Consistency:** port `52222`, knock from `secret/viktor/breakglass_knock_sequence`, key `~/.ssh/breakglass_ed25519`, host `192.168.1.127` used consistently throughout.
- **Open verify items** (flagged inline, non-blocking): #1 `.1` SNAT behaviour (2.3), pve-firewall coexistence (1.1.2).

View file

@ -0,0 +1,73 @@
# Break-glass SSH — Redesign
- **Date**: 2026-06-11
- **Status**: Implemented
- **Owner**: Viktor
- **Supersedes**: `2026-05-30-breakglass-ssh-access-{design,plan}.md` (port-knock design)
- **As-built runbook**: `docs/runbooks/breakglass-ssh.md`
## Why redesign
The 2026-05-30 design gated a key-only SSH port on the Proxmox host behind a UDP
**port-knock** (knockd). It caused a real lockout, for a structural reason:
- The knock sequence was 3 random ports stored **only** in Vault, and the client
helper fetched it from Vault at connect time.
- **Vault is in-cluster** and not publicly reachable (Wave-1 policy). In the
exact scenario break-glass exists for — away from home, cluster/tunnels down —
the knock sequence is unreachable and unmemorable. Circular dependency.
The knock's only benefit was hiding an already brute-force-proof port; its cost
was that fragility. For a *recovery* path, robustness beats stealth.
## Decision
**Plain key-only SSH to the Proxmox host on `:52222`, openly reachable, no knock.**
Hardened with: the exposed port trusts only a dedicated break-glass key
(`Match LocalPort`), per-source connection rate-limiting (iptables hashlimit),
and fail2ban. Scenario covered: *cluster + tunnels down, host + pfSense + router
up* (the common "I'm away and need in" case — confirmed with Viktor; deeper
"pfSense wedged" / "host down" tiers are explicitly out of scope).
Alternatives considered and rejected: keeping the knock (fragile, circular);
Tailscale-on-pfSense (briefly chosen, then dropped — reintroduces the upstream
dependency Headscale is self-hosted to avoid, and the user preferred a
self-contained stock-ssh path); WireGuard road-warrior (needs a client, and the
self-contained SSH path was preferred).
## Components
| Layer | Change | Source of truth |
|---|---|---|
| sshd | dual-port `:22` (LAN, all keys) + `:52222` (WAN, break-glass key only via `Match LocalPort`, terminated by `Match all`); key-only everywhere | `scripts/sshd-10-breakglass.conf` |
| host firewall | `BREAKGLASS` chain: `:52222` rate-limited per source, LAN bypass; replaced the knock-gated default-DROP | `scripts/breakglass-firewall.sh` (+ `breakglass-firewall.service`) |
| fail2ban | jail fixed for Debian 13 (`journalmatch` by unit, not `_COMM=sshd`, else it never bans), bans on `:22`+`:52222` | `scripts/fail2ban-breakglass-sshd.local` |
| knockd | **removed** (package purged, config deleted) | — |
| edge router | `breakglass-ssh` WAN tcp/52222 → 192.168.1.127:52222; **removed** legacy Synology SSH forward (ext 3333 → .13:22) | manual (live device) |
| Vault | `breakglass_ssh_{pub,priv}key` retained; `breakglass_knock_sequence` now dead | `secret/viktor` |
## Edge-router constraints discovered (TP-Link AX6000)
- **No port remapping** — external port must equal internal port (rejects e.g.
`22 → 52222` as a "conflict"). All forwards are ext==int; hence `:52222` both
sides.
- **Port 22 is reserved**`22 → 22` is also refused. Break-glass cannot use 22
(Viktor's initial preference); `:52222` is the landed port.
- **Row delete is immediate** (no confirm dialog).
## Security posture
- **Brute force: impossible** (key-only, no password).
- **Scannable: yes** — deliberate, documented Wave-1 exception (`security.md`).
- **Residual risks:** sshd 0-day during exposure (mitigate: patch, rate-limit,
fail2ban, low MaxAuthTries); break-glass key theft (revoke by removing the
`authorized_keys.breakglass` line). Logins are audited (PVE ships sshd auth +
snoopy execve to Loki).
## Verification (2026-06-11)
- `:52222` reachable; break-glass key authenticates (`root@pve`).
- Non-break-glass keys **rejected** on `:52222` (Match isolation works).
- `:22` LAN admin unaffected (Match all reset confirmed — global root login intact).
- Full WAN path: `ssh -p 52222 <WAN-IP>` with the break-glass key → `root@pve`.
- knockd gone; fail2ban jail matches Debian 13 `sshd-session` lines.

View file

@ -0,0 +1,158 @@
# Runbook: Break-glass SSH
Cold-survivable, brute-force-proof SSH onto the home LAN for when the Kubernetes
cluster and its remote-access tunnels (Headscale, cloudflared) are down but the
**Proxmox host + edge router are up**. Redesigned 2026-06-11 — the previous
port-knock design is decommissioned (see "History" below).
## Model (as built)
```
your laptop (anywhere) ── ssh -p 52222 ──▶ edge router 192.168.1.1
│ WAN tcp/52222 ─▶ 192.168.1.127:52222
Proxmox host 192.168.1.127
sshd :52222 (key-only, break-glass key ONLY)
→ full LAN via ssh -J / ssh -D
```
- **No port-knock.** Plain `ssh -p 52222`. The SSH key is the only gate.
- **Key-only**, brute-force-proof. The exposed `:52222` trusts **only** the
dedicated break-glass key (`/root/.ssh/authorized_keys.breakglass`), separate
from root's normal LAN-admin keys, so it is independently revocable and a leak
of any other root key does not grant internet access.
- **Rate-limited** per source IP (iptables hashlimit) + **fail2ban**. These trim
scanner noise only; key-only auth is the real protection.
- **Exposed, not hidden.** `:52222` answers on the WAN (Shodan-visible). This is
a deliberate, documented exception to the Wave-1 "no public-IP access" policy
(see `docs/architecture/security.md`), chosen for self-containment: it has **no
dependency on the cluster** (unlike Headscale/cloudflared) and nothing to
remember (unlike the old knock, whose sequence lived only in in-cluster Vault).
## Secrets (Vault `secret/viktor`)
| Key | Use |
|---|---|
| `breakglass_ssh_pubkey` | authorized on the host (`authorized_keys.breakglass`) |
| `breakglass_ssh_privkey` | the private key (also on your laptop at `~/.ssh/breakglass_ed25519`) |
The key has **no passphrase** (so it works in a true cold event without anything
to recall). Treat the private key as the sole credential — guard the laptop copy.
> Leftover: `breakglass_knock_sequence` is dead (knock decommissioned). It is
> inert; remove it when you have a Vault token with the `patch` capability
> (`vault kv patch` / merge-patch — the everyday token lacks it).
## Connect
Client `~/.ssh/config`:
```
Host breakglass
HostName viktorbarzin.ddns.net # follows the dynamic WAN IP
Port 52222
User root
IdentityFile ~/.ssh/breakglass_ed25519
IdentitiesOnly yes
```
Then:
```bash
ssh breakglass # shell on the Proxmox host
ssh -J breakglass root@10.0.20.1 # jump to pfSense (or any LAN host)
ssh -D 1080 breakglass # SOCKS5 → reach any internal IP
```
There is **no `bg()` knock function** anymore — delete it from your shell rc if
you added it under the old design.
## Cold-event IP cheat sheet (cluster DNS is down)
| Host | IP |
|---|---|
| Proxmox host | `192.168.1.127` |
| pfSense | `10.0.20.1` (WAN `192.168.1.2`) |
| k8s API | `10.0.20.100` |
| Synology NAS | `192.168.1.13` (reach via `ssh -J breakglass`) |
| edge router | `192.168.1.1` |
## Deploy / re-provision the host config
Source of truth lives in `infra/scripts/`. To (re)deploy:
```bash
# 1. break-glass key authorized for the exposed port
PUB="$(vault kv get -field=breakglass_ssh_pubkey secret/viktor)"
ssh root@192.168.1.127 "printf '%s\n' '$PUB' > /root/.ssh/authorized_keys.breakglass && chmod 600 /root/.ssh/authorized_keys.breakglass"
# 2. sshd drop-in (dual-port, Match-isolated) — validate before reload (anti-lockout)
scp scripts/sshd-10-breakglass.conf root@192.168.1.127:/etc/ssh/sshd_config.d/10-breakglass.conf
ssh root@192.168.1.127 'sshd -t && systemctl reload ssh'
# 3. firewall (rate-limit) + boot unit
scp scripts/breakglass-firewall.sh root@192.168.1.127:/usr/local/sbin/breakglass-firewall.sh
ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh && systemctl enable --now breakglass-firewall.service'
# 4. fail2ban jail
scp scripts/fail2ban-breakglass-sshd.local root@192.168.1.127:/etc/fail2ban/jail.d/breakglass-sshd.local
ssh root@192.168.1.127 'systemctl restart fail2ban && fail2ban-client status sshd'
```
The `breakglass-firewall.service` unit (oneshot, `RemainAfterExit=yes`,
`Before=network-online`-ish ordering) is a manual host unit — recreate it if the
host is rebuilt:
```ini
[Unit]
Description=Break-glass base firewall (key-only SSH on :52222)
After=network-pre.target
Wants=network-pre.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/breakglass-firewall.sh
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
## Edge-router forward (manual — live device, not Terraform)
TP-Link Archer AX6000 (`192.168.1.1`) → Advanced → NAT Forwarding → Port
Forwarding. The break-glass rule:
| Service Name | Device IP | External Port | Internal Port | Protocol |
|---|---|---|---|---|
| `breakglass-ssh` | `192.168.1.127` | `52222` | `52222` | TCP |
**AX6000 quirks (learned 2026-06-11 — do not relearn the hard way):**
- **External port must equal internal port.** The firmware rejects any remap
(e.g. `22 → 52222`) with *"External Port: This item conflicts with existed
ones."* Hence ext==int 52222.
- **Port 22 is reserved** — even `22 → 22` is refused. Break-glass cannot use 22.
- **Row delete is immediate** (no confirm dialog) — clicking the trash icon
removes the rule and toasts "Operation succeeded".
- Automation: `~/wizard/tools/insecure-browse/add-forward.{sh,js}` (dockerized
Playwright; double-gated save `DRY_RUN=0 CONFIRM_SAVE=1`; supports
`RULES_JSON` add, `EDIT_RULES_JSON` protocol-edit, `DELETE_RULES_JSON`
identity-guarded delete). Router password: Vault
`secret/viktor/edge_router_192_168_1_1_password`.
## Rotate / revoke
- **Revoke instantly:** remove the line from `/root/.ssh/authorized_keys.breakglass`.
- **Rotate the key:** `ssh-keygen -t ed25519 -a 100 -f ~/.ssh/breakglass_ed25519`,
`vault kv patch secret/viktor breakglass_ssh_privkey=@... breakglass_ssh_pubkey=...`,
redeploy step 1 above.
- **Router reset wipes forwards:** re-add the `breakglass-ssh` rule above.
## History
- **2026-05-30:** original design — key-only SSH on `:52222` gated behind a
**UDP port-knock** (knockd). Decommissioned 2026-06-11: the knock added no real
security (the SSH key already makes the port brute-force-proof) and its only
benefit — hiding the port — came at the cost of a **circular dependency**: the
knock sequence lived only in in-cluster Vault, unreachable in the exact
cold/away scenario break-glass exists for. That caused a real lockout. The
knockd package + config + the legacy Synology SSH forward (ext 3333 → .13:22)
were removed.