From df332b59e65409b741fa6fbba8e4a301e7736ada Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Thu, 11 Jun 2026 18:23:39 +0000 Subject: [PATCH] break-glass SSH: drop port-knock for exposed key-only :52222; version host config MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Viktor got locked out of the break-glass path (forgot the port-knock setup) and deleted the edge-router forwards, then asked to review and redesign it from scratch. Root cause of the lockout: the knock added no real security (key-only SSH is already brute-force-proof) and its only benefit — hiding the port — came at the cost of a circular dependency. The knock sequence lived only in in-cluster Vault, which is unreachable in the exact away/cold scenario break-glass exists for. So the unlock secret was unavailable precisely when needed. New model (self-contained, nothing to remember): plain key-only SSH on the Proxmox host's :52222, openly reachable. The edge router forwards WAN tcp/52222 -> 192.168.1.127:52222 (external port MUST equal internal on the TP-Link AX6000 - it rejects remaps; port 22 itself is reserved). The exposed port trusts only a dedicated break-glass key via `Match LocalPort` (a leak of any other root key does not grant internet access), rate-limited (iptables hashlimit) + fail2ban. - Removed knockd (package + config) and the legacy Synology SSH forward (ext 3333 -> .13:22, a needless WAN exposure the original plan wanted gone). - Fixed the fail2ban jail for Debian 13 (auth logs under sshd-session, not sshd - the stock journalmatch silently never banned). - Versioned the host config in scripts/ (it was applied ad-hoc, never committed) and recorded the deliberate Wave-1 "no public-IP" exception in security.md + .claude/CLAUDE.md. Superseded the 2026-05-30 port-knock design docs. Co-Authored-By: Claude Fable 5 --- .claude/CLAUDE.md | 2 +- docs/architecture/security.md | 2 + ...2026-05-30-breakglass-ssh-access-design.md | 285 +++++++++++++ .../2026-05-30-breakglass-ssh-access-plan.md | 395 ++++++++++++++++++ ...26-06-11-breakglass-ssh-redesign-design.md | 73 ++++ docs/runbooks/breakglass-ssh.md | 158 +++++++ scripts/breakglass-firewall.sh | 26 ++ scripts/fail2ban-breakglass-sshd.local | 18 + scripts/sshd-10-breakglass.conf | 31 ++ 9 files changed, 989 insertions(+), 1 deletion(-) create mode 100644 docs/plans/2026-05-30-breakglass-ssh-access-design.md create mode 100644 docs/plans/2026-05-30-breakglass-ssh-access-plan.md create mode 100644 docs/plans/2026-06-11-breakglass-ssh-redesign-design.md create mode 100644 docs/runbooks/breakglass-ssh.md create mode 100644 scripts/breakglass-firewall.sh create mode 100644 scripts/fail2ban-breakglass-sshd.local create mode 100644 scripts/sshd-10-breakglass.conf diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index eeedf2dc..be8adb82 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -178,7 +178,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/security-incident.md`. Beads epic: `code-8ywc`. - **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming. -- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. +- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.) - **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging. - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred. - **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. diff --git a/docs/architecture/security.md b/docs/architecture/security.md index 4a29638d..a092b14c 100644 --- a/docs/architecture/security.md +++ b/docs/architecture/security.md @@ -255,6 +255,8 @@ Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same **Policy: no public-IP access ever.** Vault, kube-apiserver, PVE sshd must transit a trusted LAN or Headscale. Anything else fires an alert. +**Documented exception — break-glass SSH (2026-06-11):** one deliberate carve-out. The Proxmox host's sshd listens on a WAN-exposed `:52222` (edge-router forward), **key-only**, trusting only a dedicated break-glass key (`Match LocalPort` → `authorized_keys.breakglass`), rate-limited (iptables hashlimit) + fail2ban. It is intentionally reachable from the public internet so it survives a cluster/tunnel outage with no dependency on the cluster — the one case the "must transit LAN/Headscale" rule cannot serve. Brute-force-proof (no password); the trade is Shodan-visibility. As-built: `docs/runbooks/breakglass-ssh.md`; rationale: `docs/plans/2026-06-11-breakglass-ssh-redesign-design.md`. (Replaced the 2026-05-30 port-knock variant, which was non-scannable but had a circular Vault dependency that caused a lockout.) + #### Why no canary tokens Original plan included canary tokens (fake K8s Secret, Vault KV path, PVE file, sinkhole hostname). Rejected because Viktor routinely greps `secret/viktor` (135 keys) and lists `kubectl get secret -A` — any read-trigger canary self-fires. Use-based canaries (zero-RBAC SA tokens with audit alerts on use) were also considered but rejected in favor of cleaner source-IP anomaly detection (K9, V7) on REAL tokens — same threat model, no fake-token operational burden. diff --git a/docs/plans/2026-05-30-breakglass-ssh-access-design.md b/docs/plans/2026-05-30-breakglass-ssh-access-design.md new file mode 100644 index 00000000..1b8b2070 --- /dev/null +++ b/docs/plans/2026-05-30-breakglass-ssh-access-design.md @@ -0,0 +1,285 @@ +# Break-Glass SSH Access — Design + +> **⚠️ SUPERSEDED 2026-06-11** by `2026-06-11-breakglass-ssh-redesign-design.md`. +> The port-knock was removed: it added no real security (the SSH key already +> makes the port brute-force-proof) and its knock sequence lived only in +> in-cluster Vault — unreachable in the exact cold/away scenario break-glass +> exists for, which caused a real lockout. Retained for history. As-built: +> `docs/runbooks/breakglass-ssh.md`. + +- **Date**: 2026-05-30 +- **Status**: Draft — pending user review +- **Owner**: Viktor +- **Related**: `docs/architecture/vpn.md`, `docs/architecture/security.md`, `infra/.claude/CLAUDE.md` (Security Posture Wave 1) + +## 1. Goal + +Provide a **cold, brute-force-proof backdoor onto the home LAN from the public +internet** for the case where the Kubernetes cluster and every cluster-hosted +remote-access path are down (cloudflared, Headscale/Tailscale, in-cluster +WireGuard), but the **Proxmox host, pfSense, and the edge router are still up**. + +### Hard requirements (from the user) + +1. **Cold-survivable**: must work when the k8s cluster + all its tunnels are + down. The path must touch **nothing in the cluster** (no Authentik, Traefik, + Technitium/AdGuard DNS, cloudflared). +2. **Full LAN access** once connected (SSH to Proxmox host, pfSense, Synology, + k8s API, etc.). +3. **No brute force**: no password-guessable surface. +4. **Client uses only software pre-installed on Linux/macOS** — no WireGuard / + Tailscale / fwknop client install. Stock `ssh` (+ `bash`) only. +5. **Minimal effort**, and ideally **honor the locked Wave 1 policy** + (`no public-IP access — … PVE sshd must transit LAN or Headscale`). + +## 2. Decision + +**Key-only SSH to the Proxmox host, gated behind a UDP port-knock.** + +- The Proxmox host (`192.168.1.127`) is the entry point — it's the recovery box + (`virsh`/`qm` to reboot the pfSense VM, `kubectl`, full hypervisor control) + and it sits directly on the `192.168.1.0/24` segment, so the path **does not + traverse pfSense or the cluster** — it survives a wedged pfSense too, not just + a down cluster. +- SSH is the only externally-usable remote tool **pre-installed on every + Linux/macOS box**, satisfying requirement 4. +- **Key-only auth** (no passwords anywhere) makes password brute force + impossible → requirement 3. +- A **port-knock** keeps the external SSH port **closed/invisible to scanners** + until a knock sequence is sent. This restores the "no standing public service" + property we'd have had with WireGuard and keeps us within the **intent** of the + Wave 1 policy (PVE sshd is not internet-scannable). The knock is sent with a + **bash `/dev/udp` one-liner** — zero install. + +### Alternatives rejected + +| Option | Why rejected | +|---|---| +| WireGuard road-warrior on pfSense | Needs a WireGuard **client app** (fails requirement 4). Was the prior design. | +| Tailscale / Headscale | Client app + control plane is in-cluster (dies cold). | +| Browser → web admin UI (Proxmox/pfSense/Synology) | "Pre-installed" (browser) but password-based → brute-forceable, far larger attack surface than a key-only SSH port. | +| Plain **exposed** key-only SSH (no knock) | Brute-force-proof, but a **publicly visible** service (Shodan-catalogued) and a standing violation of the Wave 1 "no public PVE sshd" policy. The knock removes the standing exposure for ~15 min more setup. | +| fwknop / cryptographic SPA | Strongest hiding, but needs a **client install** (fails requirement 4). | + +## 3. Architecture + +``` + Your laptop (anywhere) — stock ssh + bash, nothing installed + │ (1) UDP knock sequence → bash: echo > /dev/udp// (instant, no handshake) + │ (2) ssh -p 52222 root@ + ▼ + Edge router 192.168.1.1 (the box the stored password unlocks) + │ forwards: UDP ,, + TCP 52222 → 192.168.1.127 + ▼ + Proxmox host 192.168.1.127 ← path bypasses pfSense entirely + ├─ knockd (libpcap) sees the UDP knock → opens TCP 52222 for your source IP (30 s) + ├─ sshd listens on :22 (LAN admin, always) AND :52222 (external, knock-gated), key-only + └─ once in: virsh/qm (reboot pfSense VM), kubectl, ssh -J / ssh -D → full LAN +``` + +**Why it meets "cold + full LAN":** the host is up by definition of the chosen +failure mode; nothing in the path depends on k8s, pfSense, or DNS. From the host +you reach the whole LAN either directly (it's on `192.168.1.0/24` and routes to +the VLANs via pfSense when pfSense is up) or by using SSH's built-in +`-J`/`-D` — both stock, no install. + +## 4. Components + +### 4.1 Edge router @ 192.168.1.1 (manual, in the browser) +Add port-forwards (same place the existing `51821` WireGuard forward lives): +- **TCP 52222 → 192.168.1.127:52222** (external SSH; no port rewrite — see §4.3 rationale) +- **UDP ``, ``, `` → 192.168.1.127** (knock ports; actual numbers in Vault) + +If the router supports a **port range** forward, a single range covering the +knock ports + 52222 is tidier than four rules. + +> **Verify (#1 implementation check):** whether `.1` **preserves the source IP** +> on forwarded packets (typical DNAT) or **SNATs** them to `192.168.1.1`. Test by +> knocking + connecting from an external network and checking `/var/log/auth.log` +> + `knockd` syslog for the observed source IP. The design works either way (see +> §4.3), but it determines knock granularity. + +### 4.2 SSH keys & Vault layout +- Mint a **dedicated** break-glass keypair (ed25519), separate from + `secret/viktor/proxmox_ssh_key`, so it's independently revocable and clearly + labelled. +- **Public key** → `/root/.ssh/authorized_keys` on the Proxmox host (no `from=` + restriction — break-glass is from-anywhere; the knock + key are the gate). +- **Private key** → Vault `secret/viktor/breakglass_ssh_privkey` (for + re-provisioning) **and** on your laptop at `~/.ssh/breakglass_ed25519` + (chmod 600). +- **Knock sequence** → Vault `secret/viktor/breakglass_knock_sequence` (kept out + of git — obscurity value only; see §5). + +### 4.3 Proxmox host — sshd hardening +`/etc/ssh/sshd_config.d/10-breakglass.conf`: +``` +Port 22 +Port 52222 +PasswordAuthentication no +KbdInteractiveAuthentication no +PubkeyAuthentication yes +PermitRootLogin prohibit-password # key-only root (PVE recovery norm) +MaxAuthTries 3 +LoginGraceTime 20 +``` +- sshd listens on **:22 (LAN admin, always allowed)** and **:52222 (external, + knock-gated)**. Using a dedicated external port (not a DNAT rewrite to 22) + lets the firewall distinguish LAN vs external **regardless of `.1` SNAT + behaviour** (§4.1) — LAN admin on `:22` is never affected by the gate. +- **Default to root key-only** for recovery practicality. *Alternative for + review:* a dedicated `breakglass` sudo user instead of root. + +> **Verify (#2):** key login already works for your normal access **before** +> `PasswordAuthentication no` is committed — no lockout. (Backup rsync jobs +> already use keys, so this is likely already effectively true.) + +### 4.4 Host firewall (knock gate) +Default-drop the external SSH port; knockd punches a per-source hole. LAN admin +(`:22`) and established sessions are untouched: +``` +# allow established / related +iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT +# LAN admin + backups: SSH on :22 always allowed +iptables -A INPUT -p tcp --dport 22 -j ACCEPT +# external SSH on :52222 closed by default — knockd opens it per-source +iptables -A INPUT -p tcp --dport 52222 -j DROP +``` +- **knockd uses libpcap**, so it sees the UDP knock packets even though iptables + drops them — the knock ports stay **silent/closed** to scanners. +- **pve-firewall coexistence (verify #3):** confirm whether the PVE firewall is + enabled. If it is, express these rules through it (or a dedicated chain) so a + pve-firewall reload doesn't wipe the knockd-managed rule. Default PVE installs + often have it off at datacenter level. + +### 4.5 knockd +`apt install knockd` (Debian/PVE). `/etc/knockd.conf`: +``` +[options] + UseSyslog + Interface = vmbr0 # the 192.168.1.127 interface + +[breakglass] + sequence = :udp,:udp,:udp # real ports from Vault + seq_timeout = 10 + start_command = /usr/sbin/iptables -I INPUT 1 -s %IP% -p tcp --dport 52222 -j ACCEPT + cmd_timeout = 30 + stop_command = /usr/sbin/iptables -D INPUT -s %IP% -p tcp --dport 52222 -j ACCEPT +``` +- **UDP knock** → the client knock is fire-and-forget (`/dev/udp`), no TCP-hang + on the client (a TCP knock to a dropped port would block until timeout). +- Opens `:52222` for the knocker's source IP for **30 s**; an SSH session + established within that window **persists** via conntrack ESTABLISHED after the + rule is removed. Enable + start the `knockd` service. + +### 4.6 fail2ban (defense-in-depth) +`apt install fail2ban`, sshd jail (watches `auth.log`, bans repeat failures). +Local to the host, **no cluster dependency**. Catches anything that gets past the +knock to the sshd listener. + +### 4.7 Client side (laptop — stock tools only) +`~/.ssh/config`: +``` +Host breakglass + HostName + Port 52222 + User root + IdentityFile ~/.ssh/breakglass_ed25519 +``` +Knock + connect — a shell function using **bash builtins only** (works on +macOS `/bin/bash` + Linux; UDP send is instant): +```sh +bg() { + local host= + for p in ; do echo -n x > "/dev/udp/$host/$p"; sleep 0.4; done + sleep 0.5 + ssh breakglass "$@" +} +``` +- **Full LAN, no install:** `ssh -J breakglass ` (jump), or + `ssh -D 1080 breakglass` then point a browser/`curl` at SOCKS5 `127.0.0.1:1080` + to reach any internal IP. From the host shell you already have everything. +- *Optional fully-transparent variant:* fold the knock into a `ProxyCommand` in + the `Host breakglass` block so plain `ssh breakglass` knocks automatically. + +### 4.8 Cold-scenario IP cheat sheet (DNS is down when the cluster is down) +Technitium + AdGuard are in-cluster, so `.lan` resolution is gone in a cold +event. Use IPs: + +| Host | IP | +|---|---| +| Proxmox host | `192.168.1.127` (also `10.0.10.1` VLAN10) | +| pfSense | `10.0.20.1` (WAN `192.168.1.2`) | +| k8s API server | `10.0.20.100` | +| Synology NAS | `192.168.1.13` | +| Edge router | `192.168.1.1` | +| Traefik LB / MetalLB | `10.0.20.200` / `10.0.20.203` | + +## 5. Security analysis + +- **Brute force: solved.** No password auth anywhere → password guessing is + impossible; key brute force is cryptographically infeasible. +- **Invisibility / Wave 1 intent: satisfied.** The external SSH port is + default-dropped and the knock ports are pcap-sniffed (never answered), so a + scanner sees a closed/silent host — PVE sshd is **not internet-scannable**, + honouring the spirit of "no public-IP access to PVE sshd". +- **The knock is obscurity, not cryptography.** A port-knock sequence is + plaintext and replayable by a passive on-path observer. **The SSH key is the + real access control** — the knock only removes the standing/scannable surface. + (Cryptographic SPA = fwknop, rejected for needing a client install.) Treat the + knock sequence as a secret-ish convenience, not a second cryptographic factor. +- **Residual risks** (none are brute force): + 1. An sshd **0-day** exploitable during the 30 s open window → mitigation: keep + PVE patched; short `cmd_timeout`; fail2ban. + 2. **Private key theft** → mitigation: key has a passphrase; revoke by removing + the line from `authorized_keys`. + 3. If `.1` **SNATs** (§4.1), the 30 s window opens `:52222` for the shared + `192.168.1.1` source — anyone else arriving via `.1` in that window could + reach the sshd banner, but still needs your key. Mitigated by the short + window + key-only + fail2ban. +- **Deliberate, documented exception** to the Wave 1 "no public-IP access" + policy, scoped to this single knock-gated port. To be recorded in + `security.md` + the Wave 1 note in `infra/.claude/CLAUDE.md` on implementation. + +## 6. What's automated vs manual + +- **I do**: generate the keypair + knock sequence, store them in Vault, produce + the exact `sshd_config.d` snippet, `knockd.conf`, iptables rules, the client + `~/.ssh/config` + `bg()` function, and write the runbook + doc updates. +- **Manual / careful (live devices)**: the `.1` edge-router forwards are done by + you in the browser (out-of-Terraform, live device). The Proxmox host changes + (sshd, knockd, iptables, fail2ban) are applied over SSH **with key-login + verified first** to avoid lockout; pfSense is **not** touched. None of this is + a `tg apply` — pfSense and the edge router are not Terraform-managed. + +## 7. Testing & verification +1. From an **external** network (phone hotspot): run `bg`; confirm knockd syslog + shows the sequence + opens `:52222`; SSH succeeds. +2. **Without** knocking: `ssh -p 52222` from external → connection refused/timed + out (port closed). A plain port scan of `52222` + the knock ports → silent. +3. LAN admin on `:22` still works (no regression); backup rsync jobs unaffected. +4. Full-LAN: `ssh -J breakglass 10.0.20.1` (pfSense) and `ssh -D 1080` SOCKS to + an internal IP. +5. Determine `.1` source-IP behaviour (verify #1) and adjust knock granularity + note accordingly. + +## 8. Failure modes & rotation +- **Proxmox host down** (not just cluster): this path is gone — that's the + out-of-band tier (serial/IPMI/separate device), explicitly **out of scope**. +- **`.1` router config reset**: forwards lost → re-add from this doc; consider + exporting the `.1` config for backup. +- **Public IP change**: use a hostname endpoint (Cloudflare-resolved) so it + auto-follows; keep the raw IP as fallback. +- **Key/knock compromise**: remove the `authorized_keys` line (kills access + instantly); rotate the knock sequence in `knockd.conf` + Vault. + +## 9. Out of scope +- Host-down / site-down out-of-band access (IPMI, LTE) — a future tier. +- Phone access (would need an SSH **app**, e.g. Termius — outside the + "pre-installed Linux/macOS" constraint; laptop is the target). + +## 10. Docs to update on implementation +- `docs/architecture/vpn.md` — add a "Break-glass SSH" section. +- `docs/architecture/security.md` + Wave 1 note in `infra/.claude/CLAUDE.md` — + record the deliberate knock-gated exception to "no public PVE sshd". +- New runbook `docs/runbooks/breakglass-ssh.md` — connect + rotate procedure. diff --git a/docs/plans/2026-05-30-breakglass-ssh-access-plan.md b/docs/plans/2026-05-30-breakglass-ssh-access-plan.md new file mode 100644 index 00000000..c4db48e2 --- /dev/null +++ b/docs/plans/2026-05-30-breakglass-ssh-access-plan.md @@ -0,0 +1,395 @@ +# Break-Glass SSH Access — Implementation Plan + +> **⚠️ SUPERSEDED 2026-06-11** by the redesign in +> `2026-06-11-breakglass-ssh-redesign-design.md` (port-knock removed). Retained +> for history. As-built: `docs/runbooks/breakglass-ssh.md`. + +> **Execution model:** This plan mutates **live devices** (the Proxmox host's sshd, and the TP-Link edge router). It is **human-gated**, NOT for autonomous subagents. Each live step is applied with anti-lockout verification, and every edge-router change is made by Viktor (or by the browse tool with explicit per-change approval). Steps use `- [ ]` checkboxes. + +**Goal:** Stand up a cold, brute-force-proof SSH backdoor onto the LAN — key-only SSH to the Proxmox host (`192.168.1.127`) gated behind a UDP port-knock — then decommission the legacy Synology SSH exposure and tighten UPnP. + +**Architecture:** Edge router `.1` forwards a UDP knock sequence + TCP `52222` to the Proxmox host. The host runs `knockd` (libpcap) which opens `52222` for the knocker's IP for 30 s; `sshd` listens on `:22` (LAN, always) and `:52222` (external, knock-gated), key-only. Path bypasses pfSense + the k8s cluster. Client uses only stock `ssh` + `bash`. + +**Tech stack:** OpenSSH, knockd, iptables, fail2ban (Debian/PVE host); TP-Link Archer AX6000 UI (edge router); HashiCorp Vault (secrets); Docker (`/home/wizard/tools/insecure-browse` for any router automation). + +**Reference:** design doc `2026-05-30-breakglass-ssh-access-design.md`. Router audit (current `.1` forwards) recorded in task notes + `/home/wizard/tools/insecure-browse/out/`. + +--- + +## Pre-flight (read before starting) + +- **Anti-lockout rule:** never disable password auth or reload sshd without an *already-open* root session held + a *new* session verified. Applies to every host step. +- **Live-router rule:** all `.1` changes are made by Viktor in the UI (or browse-tool with explicit approval). No blind automation of router writes. +- **Ordering rule:** the legacy Synology SSH forward (Rule 6) is **not** closed until break-glass is verified working from an external network (Phase 4 gates on Phase 4-pre verification). +- **Host access:** PVE host reached as `ssh root@192.168.1.127` from the LAN. +- **Commit gate:** the infra repo currently has unmerged conflicts + an in-progress provider/backend migration. Do NOT commit (Phase 6) until Viktor confirms the repo is clean. + +--- + +## Phase 0 — Generate secrets (no live changes) + +### Task 0.1: Break-glass SSH keypair + +**Files:** none in repo (secrets → Vault). + +- [ ] **Step 1: Generate a dedicated ed25519 keypair (with passphrase)** + +```bash +mkdir -p ~/.ssh +ssh-keygen -t ed25519 -a 100 -C "breakglass-$(date +%Y%m%d)" -f ~/.ssh/breakglass_ed25519 +# set a passphrase when prompted (so a stolen laptop key isn't instantly usable) +``` + +- [ ] **Step 2: Store the private key + public key in Vault** + +```bash +vault kv patch secret/viktor \ + breakglass_ssh_privkey=@$HOME/.ssh/breakglass_ed25519 \ + breakglass_ssh_pubkey="$(cat ~/.ssh/breakglass_ed25519.pub)" +``` + +- [ ] **Step 3: Verify the keys are retrievable** + +```bash +vault kv get -field=breakglass_ssh_pubkey secret/viktor +``` +Expected: prints the `ssh-ed25519 AAAA... breakglass-YYYYMMDD` line. + +### Task 0.2: Knock sequence + +- [ ] **Step 1: Generate 3 random UDP knock ports** + +```bash +KNOCK="$(shuf -i 20000-60000 -n 3 | paste -sd, -)"; echo "$KNOCK" +``` + +- [ ] **Step 2: Store the sequence in Vault (keep it out of git)** + +```bash +vault kv patch secret/viktor breakglass_knock_sequence="$KNOCK" +vault kv get -field=breakglass_knock_sequence secret/viktor +``` +Expected: prints three comma-separated ports, e.g. `28411,49027,33180`. + +--- + +## Phase 1 — Proxmox host: key-only SSH + knock gate (LIVE host change) + +> Run everything in this phase **on the PVE host**. Keep your current `ssh root@192.168.1.127` session open the entire phase. + +### Task 1.1: Pre-checks (no changes yet) + +- [ ] **Step 1: Confirm key login already works (anti-lockout baseline)** + +From your laptop, with the break-glass key authorized later — for now confirm your *existing* admin key works: +```bash +ssh -o PasswordAuthentication=no root@192.168.1.127 'echo KEY_LOGIN_OK' +``` +Expected: `KEY_LOGIN_OK` (key auth works → safe to disable passwords later). If it prompts for a password, STOP and fix key auth first. + +- [ ] **Step 2: Check whether the PVE firewall is active (coexistence)** + +```bash +ssh root@192.168.1.127 'pve-firewall status 2>/dev/null; iptables -S | head' +``` +Expected: note whether `Status: enabled/running`. If **enabled**, add the Phase-1.4 rules via PVE's firewall (Datacenter→Firewall) instead of raw iptables, OR disable it if unused. If **disabled** (common), proceed with the raw-iptables approach below. + +### Task 1.2: Authorize the break-glass key + +- [ ] **Step 1: Append the break-glass public key to root's authorized_keys** + +```bash +PUB="$(vault kv get -field=breakglass_ssh_pubkey secret/viktor)" +ssh root@192.168.1.127 "grep -qF '$PUB' /root/.ssh/authorized_keys || echo '$PUB' >> /root/.ssh/authorized_keys" +``` + +- [ ] **Step 2: Verify break-glass key logs in (on :22, still default)** + +```bash +ssh -i ~/.ssh/breakglass_ed25519 -o PasswordAuthentication=no root@192.168.1.127 'echo BREAKGLASS_KEY_OK' +``` +Expected: `BREAKGLASS_KEY_OK`. + +### Task 1.3: sshd dual-port + key-only + +**Files:** Create on host: `/etc/ssh/sshd_config.d/10-breakglass.conf` + +- [ ] **Step 1: Write the sshd drop-in** + +```bash +ssh root@192.168.1.127 'cat > /etc/ssh/sshd_config.d/10-breakglass.conf' <<'EOF' +Port 22 +Port 52222 +PasswordAuthentication no +KbdInteractiveAuthentication no +PubkeyAuthentication yes +PermitRootLogin prohibit-password +MaxAuthTries 3 +LoginGraceTime 20 +EOF +``` + +- [ ] **Step 2: Validate config syntax (do NOT reload yet)** + +```bash +ssh root@192.168.1.127 'sshd -t && echo SSHD_CONFIG_OK' +``` +Expected: `SSHD_CONFIG_OK`. If error, fix the drop-in before reloading. + +- [ ] **Step 3: Reload sshd (current session stays alive)** + +```bash +ssh root@192.168.1.127 'systemctl reload ssh && echo RELOADED' +``` +Expected: `RELOADED`. + +- [ ] **Step 4: Verify a NEW key session works on :22 AND :52222 before trusting it** + +```bash +ssh -i ~/.ssh/breakglass_ed25519 -p 22 root@192.168.1.127 'echo OK22' +ssh -i ~/.ssh/breakglass_ed25519 -p 52222 root@192.168.1.127 'echo OK52222' +``` +Expected: `OK22` and `OK52222`. (If `:52222` refuses, sshd may not have bound the second port — check `ss -tlnp | grep ssh` on the host.) Only after both succeed, the old session is safe to drop. + +### Task 1.4: Base firewall (default-drop :52222, allow :22 + established) + +**Files:** Create on host: `/usr/local/sbin/breakglass-firewall.sh`, `/etc/systemd/system/breakglass-firewall.service` + +- [ ] **Step 1: Write the idempotent base-firewall script (dedicated chain)** + +```bash +ssh root@192.168.1.127 'cat > /usr/local/sbin/breakglass-firewall.sh' <<'EOF' +#!/usr/bin/env bash +set -euo pipefail +# Idempotent: (re)build a dedicated BREAKGLASS chain hooked into INPUT. +iptables -N BREAKGLASS 2>/dev/null || iptables -F BREAKGLASS +iptables -C INPUT -j BREAKGLASS 2>/dev/null || iptables -I INPUT 1 -j BREAKGLASS +# established/related always allowed +iptables -A BREAKGLASS -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT +# LAN admin on :22 always allowed (.1 does NOT forward :22 to this host, so :22 is LAN-only) +iptables -A BREAKGLASS -p tcp --dport 22 -j ACCEPT +# external SSH on :52222 closed by default; knockd punches a per-source ACCEPT into INPUT pos 1 +iptables -A BREAKGLASS -p tcp --dport 52222 -j DROP +EOF +ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh' +``` + +- [ ] **Step 2: Write a boot-time systemd unit (persists across reboot, before knockd)** + +```bash +ssh root@192.168.1.127 'cat > /etc/systemd/system/breakglass-firewall.service' <<'EOF' +[Unit] +Description=Break-glass base firewall (SSH knock gate) +After=network-pre.target +Before=knockd.service +Wants=network-pre.target + +[Service] +Type=oneshot +ExecStart=/usr/local/sbin/breakglass-firewall.sh +RemainAfterExit=yes + +[Install] +WantedBy=multi-user.target +EOF +ssh root@192.168.1.127 'systemctl daemon-reload && systemctl enable --now breakglass-firewall.service && echo FW_APPLIED' +``` +Expected: `FW_APPLIED`. + +- [ ] **Step 3: Verify LAN :22 still works and :52222 is now dropped from LAN** + +```bash +ssh -i ~/.ssh/breakglass_ed25519 -p 22 root@192.168.1.127 'echo STILL_OK22' # works +nc -z -w3 192.168.1.127 52222 && echo "OPEN(bad)" || echo "CLOSED_AS_EXPECTED" # closed pre-knock +``` +Expected: `STILL_OK22` and `CLOSED_AS_EXPECTED`. + +### Task 1.5: knockd + +**Files:** Create/modify on host: `/etc/knockd.conf`, `/etc/default/knockd` + +- [ ] **Step 1: Install knockd (host daemon — must be native, not Docker, to manage host iptables)** + +```bash +ssh root@192.168.1.127 'apt-get update -qq && apt-get install -y knockd && echo KNOCKD_INSTALLED' +``` +Expected: `KNOCKD_INSTALLED`. + +- [ ] **Step 2: Write knockd.conf with the Vault knock sequence (UDP)** + +```bash +KNOCK="$(vault kv get -field=breakglass_knock_sequence secret/viktor)" # e.g. 28411,49027,33180 +read K1 K2 K3 <<<"$(echo "$KNOCK" | tr ',' ' ')" +ssh root@192.168.1.127 "cat > /etc/knockd.conf" </dev/null || echo 'START_KNOCKD=1' >> /etc/default/knockd" +ssh root@192.168.1.127 'systemctl enable --now knockd && systemctl is-active knockd' +``` +Expected: `active`. + +### Task 1.6: fail2ban (defense-in-depth) + +- [ ] **Step 1: Install + enable fail2ban with the default sshd jail** + +```bash +ssh root@192.168.1.127 'apt-get install -y fail2ban && systemctl enable --now fail2ban && fail2ban-client status sshd >/dev/null && echo F2B_OK' +``` +Expected: `F2B_OK` (sshd jail active). + +--- + +## Phase 2 — Edge router `.1` forwards (LIVE router change — Viktor executes) + +> In the AX6000 UI: **Advanced → NAT Forwarding → Port Forwarding → Add**. Do NOT remove anything yet. + +- [ ] **Step 1: Add the SSH break-glass forward** + - Name `breakglass-ssh`, External Port `52222`, Internal IP `192.168.1.127`, Internal Port `52222`, Protocol `TCP`, Enable. + +- [ ] **Step 2: Add the three UDP knock forwards** (values from `vault kv get -field=breakglass_knock_sequence secret/viktor`) + - For each of the 3 ports: Name `bg-knock-N`, External Port ``, Internal IP `192.168.1.127`, Internal Port ``, Protocol `UDP`, Enable. + +- [ ] **Step 3: (verify #1) Determine whether `.1` preserves source IP or SNATs** + +After Phase 3 connects once, on the host check the observed source: +```bash +ssh root@192.168.1.127 'journalctl -u knockd -n 20 --no-pager | grep -i "stage\|open"' +``` +If `%IP%` is a public IP → source preserved (per-IP granularity). If it's `192.168.1.1` → `.1` SNATs (knock opens `:52222` for the shared `.1` source during the 30 s window). Both are acceptable with the dual-port + key-only model; just note it in the runbook. + +--- + +## Phase 3 — Client config (laptop, no live infra change) + +**Files:** Modify `~/.ssh/config`; add a shell function to `~/.zshrc`/`~/.bashrc`. + +- [ ] **Step 1: Add the SSH host block** + +```bash +cat >> ~/.ssh/config <<'EOF' + +Host breakglass + HostName viktorbarzin.ddns.net + Port 52222 + User root + IdentityFile ~/.ssh/breakglass_ed25519 +EOF +``` +(`viktorbarzin.ddns.net` is the router's NO-IP DDNS name — follows the dynamic WAN IP. Raw IP `176.12.22.76` is the fallback.) + +- [ ] **Step 2: Add the knock+connect function** + +```bash +cat >> ~/.zshrc <<'EOF' + +bg() { + local host="viktorbarzin.ddns.net" + local seq; seq="$(vault kv get -field=breakglass_knock_sequence secret/viktor 2>/dev/null || echo "")" + [ -z "$seq" ] && { echo "no knock sequence (vault?)"; return 1; } + for p in ${seq//,/ }; do (exec 3<>/dev/udp/$host/$p) 2>/dev/null && echo "x" >&3; sleep 0.4; done + sleep 0.5 + ssh breakglass "$@" +} +EOF +``` +> Note: the bash `/dev/udp` redirection works under bash (`/bin/bash` on macOS + Linux). Under zsh, `/dev/udp` is also supported by zsh's builtin in recent versions; if your zsh build lacks it, define `bg` in bash or use `nc -u -w1 $host $p Do this from an **external** network (phone hotspot / tethered), NOT the home LAN. + +- [ ] **Step 1: Without knocking, the port is silent** + +```bash +nc -z -w3 viktorbarzin.ddns.net 52222 && echo "OPEN(bad)" || echo "SILENT_OK" +``` +Expected: `SILENT_OK`. + +- [ ] **Step 2: Knock + connect succeeds** + +```bash +bg 'hostname; echo BREAKGLASS_E2E_OK' +``` +Expected: the PVE hostname + `BREAKGLASS_E2E_OK`. + +- [ ] **Step 3: Full-LAN reach via the jump (no extra install)** + +```bash +ssh -J breakglass root@10.0.20.1 'echo PFSENSE_REACHED' 2>/dev/null || echo "check pfSense ssh" +ssh -J breakglass admin@192.168.1.13 'echo SYNOLOGY_REACHED' 2>/dev/null || echo "check synology ssh" +``` +Expected: confirms you can reach pfSense + Synology *through* break-glass (so closing Rule 6 loses nothing). + +- [ ] **Step 4: LAN admin unaffected** + +From the home LAN: `ssh -p 22 root@192.168.1.127 'echo LAN22_OK'` → `LAN22_OK`. + +**GATE:** Only proceed to Phase 4 once Steps 1–4 pass. If any fail, fix before removing the legacy forward. + +--- + +## Phase 5 — Router cleanup (LIVE router change — Viktor executes, AFTER Phase 4-pre passes) + +> AX6000 UI. One pass, all three changes. + +- [ ] **Step 1: Remove the Synology SSH exposure (Rule 6)** + - Advanced → NAT Forwarding → Port Forwarding → delete (or disable) rule **`HTTP` / 3333 → 192.168.1.13:22**. + +- [ ] **Step 2: Delete the stale Proxmox rule (Rule 3)** + - Delete the disabled rule **`proxmox` / 8006 → 192.168.1.127**. + +- [ ] **Step 3: Disable UPnP** + - Advanced → NAT Forwarding → UPnP → toggle **OFF**. (Tailscale on `.101` falls back to DERP relay; the `41643→pfSense` mapping drops.) + +- [ ] **Step 4: Verify the Synology SSH is gone from the WAN, break-glass still works** + +From an external network: +```bash +nc -z -w3 viktorbarzin.ddns.net 3333 && echo "STILL_OPEN(bad)" || echo "SYNOLOGY_SSH_CLOSED_OK" +bg 'echo BREAKGLASS_STILL_OK' +``` +Expected: `SYNOLOGY_SSH_CLOSED_OK` and `BREAKGLASS_STILL_OK`. + +--- + +## Phase 6 — Docs + commit (AFTER infra repo is clean) + +- [ ] **Step 1: Update `docs/architecture/vpn.md`** — add a "Break-glass SSH" section (knock-gated SSH to PVE host, client `bg()`, cheat-sheet IPs). +- [ ] **Step 2: Update `docs/architecture/security.md` + the Wave-1 note in `infra/.claude/CLAUDE.md`** — record the deliberate knock-gated exception; **correct the WAN-exposure inventory** (actual `.1` forwards are qbittorrent/stun/turn→pfSense + the new break-glass; Synology SSH removed; UPnP disabled; Remote Management off). +- [ ] **Step 3: New runbook `docs/runbooks/breakglass-ssh.md`** — connect procedure, knock/key rotation, re-adding `.1` forwards after a router reset. +- [ ] **Step 4: Commit the design + plan + doc updates** (only once Viktor confirms the repo is committable): + +```bash +git -C /home/wizard/code/infra add \ + docs/plans/2026-05-30-breakglass-ssh-access-design.md \ + docs/plans/2026-05-30-breakglass-ssh-access-plan.md \ + docs/architecture/vpn.md docs/architecture/security.md \ + docs/runbooks/breakglass-ssh.md .claude/CLAUDE.md +git -C /home/wizard/code/infra commit -m "docs+feat: break-glass knock-gated SSH; retire Synology SSH forward; disable UPnP [ci skip]" +git -C /home/wizard/code/infra push origin master +``` + +--- + +## Self-review + +- **Spec coverage:** key-only SSH ✅ (1.3), knock gate ✅ (1.4/1.5), invisibility ✅ (4-pre.1), full-LAN via jump ✅ (4-pre.3), no-lockout ✅ (1.1/1.3.4), Wave-1 exception doc ✅ (6.2), close legacy SSH ✅ (5.1), UPnP ✅ (5.3). All design §sections map to a task. +- **Placeholder scan:** no TBDs; secret values are generated + Vault-stored, referenced via `vault kv get` (concrete, not placeholders). +- **Consistency:** port `52222`, knock from `secret/viktor/breakglass_knock_sequence`, key `~/.ssh/breakglass_ed25519`, host `192.168.1.127` used consistently throughout. +- **Open verify items** (flagged inline, non-blocking): #1 `.1` SNAT behaviour (2.3), pve-firewall coexistence (1.1.2). diff --git a/docs/plans/2026-06-11-breakglass-ssh-redesign-design.md b/docs/plans/2026-06-11-breakglass-ssh-redesign-design.md new file mode 100644 index 00000000..d555d971 --- /dev/null +++ b/docs/plans/2026-06-11-breakglass-ssh-redesign-design.md @@ -0,0 +1,73 @@ +# Break-glass SSH — Redesign + +- **Date**: 2026-06-11 +- **Status**: Implemented +- **Owner**: Viktor +- **Supersedes**: `2026-05-30-breakglass-ssh-access-{design,plan}.md` (port-knock design) +- **As-built runbook**: `docs/runbooks/breakglass-ssh.md` + +## Why redesign + +The 2026-05-30 design gated a key-only SSH port on the Proxmox host behind a UDP +**port-knock** (knockd). It caused a real lockout, for a structural reason: + +- The knock sequence was 3 random ports stored **only** in Vault, and the client + helper fetched it from Vault at connect time. +- **Vault is in-cluster** and not publicly reachable (Wave-1 policy). In the + exact scenario break-glass exists for — away from home, cluster/tunnels down — + the knock sequence is unreachable and unmemorable. Circular dependency. + +The knock's only benefit was hiding an already brute-force-proof port; its cost +was that fragility. For a *recovery* path, robustness beats stealth. + +## Decision + +**Plain key-only SSH to the Proxmox host on `:52222`, openly reachable, no knock.** +Hardened with: the exposed port trusts only a dedicated break-glass key +(`Match LocalPort`), per-source connection rate-limiting (iptables hashlimit), +and fail2ban. Scenario covered: *cluster + tunnels down, host + pfSense + router +up* (the common "I'm away and need in" case — confirmed with Viktor; deeper +"pfSense wedged" / "host down" tiers are explicitly out of scope). + +Alternatives considered and rejected: keeping the knock (fragile, circular); +Tailscale-on-pfSense (briefly chosen, then dropped — reintroduces the upstream +dependency Headscale is self-hosted to avoid, and the user preferred a +self-contained stock-ssh path); WireGuard road-warrior (needs a client, and the +self-contained SSH path was preferred). + +## Components + +| Layer | Change | Source of truth | +|---|---|---| +| sshd | dual-port `:22` (LAN, all keys) + `:52222` (WAN, break-glass key only via `Match LocalPort`, terminated by `Match all`); key-only everywhere | `scripts/sshd-10-breakglass.conf` | +| host firewall | `BREAKGLASS` chain: `:52222` rate-limited per source, LAN bypass; replaced the knock-gated default-DROP | `scripts/breakglass-firewall.sh` (+ `breakglass-firewall.service`) | +| fail2ban | jail fixed for Debian 13 (`journalmatch` by unit, not `_COMM=sshd`, else it never bans), bans on `:22`+`:52222` | `scripts/fail2ban-breakglass-sshd.local` | +| knockd | **removed** (package purged, config deleted) | — | +| edge router | `breakglass-ssh` WAN tcp/52222 → 192.168.1.127:52222; **removed** legacy Synology SSH forward (ext 3333 → .13:22) | manual (live device) | +| Vault | `breakglass_ssh_{pub,priv}key` retained; `breakglass_knock_sequence` now dead | `secret/viktor` | + +## Edge-router constraints discovered (TP-Link AX6000) + +- **No port remapping** — external port must equal internal port (rejects e.g. + `22 → 52222` as a "conflict"). All forwards are ext==int; hence `:52222` both + sides. +- **Port 22 is reserved** — `22 → 22` is also refused. Break-glass cannot use 22 + (Viktor's initial preference); `:52222` is the landed port. +- **Row delete is immediate** (no confirm dialog). + +## Security posture + +- **Brute force: impossible** (key-only, no password). +- **Scannable: yes** — deliberate, documented Wave-1 exception (`security.md`). +- **Residual risks:** sshd 0-day during exposure (mitigate: patch, rate-limit, + fail2ban, low MaxAuthTries); break-glass key theft (revoke by removing the + `authorized_keys.breakglass` line). Logins are audited (PVE ships sshd auth + + snoopy execve to Loki). + +## Verification (2026-06-11) + +- `:52222` reachable; break-glass key authenticates (`root@pve`). +- Non-break-glass keys **rejected** on `:52222` (Match isolation works). +- `:22` LAN admin unaffected (Match all reset confirmed — global root login intact). +- Full WAN path: `ssh -p 52222 ` with the break-glass key → `root@pve`. +- knockd gone; fail2ban jail matches Debian 13 `sshd-session` lines. diff --git a/docs/runbooks/breakglass-ssh.md b/docs/runbooks/breakglass-ssh.md new file mode 100644 index 00000000..348586f8 --- /dev/null +++ b/docs/runbooks/breakglass-ssh.md @@ -0,0 +1,158 @@ +# Runbook: Break-glass SSH + +Cold-survivable, brute-force-proof SSH onto the home LAN for when the Kubernetes +cluster and its remote-access tunnels (Headscale, cloudflared) are down but the +**Proxmox host + edge router are up**. Redesigned 2026-06-11 — the previous +port-knock design is decommissioned (see "History" below). + +## Model (as built) + +``` +your laptop (anywhere) ── ssh -p 52222 ──▶ edge router 192.168.1.1 + │ WAN tcp/52222 ─▶ 192.168.1.127:52222 + ▼ + Proxmox host 192.168.1.127 + sshd :52222 (key-only, break-glass key ONLY) + → full LAN via ssh -J / ssh -D +``` + +- **No port-knock.** Plain `ssh -p 52222`. The SSH key is the only gate. +- **Key-only**, brute-force-proof. The exposed `:52222` trusts **only** the + dedicated break-glass key (`/root/.ssh/authorized_keys.breakglass`), separate + from root's normal LAN-admin keys, so it is independently revocable and a leak + of any other root key does not grant internet access. +- **Rate-limited** per source IP (iptables hashlimit) + **fail2ban**. These trim + scanner noise only; key-only auth is the real protection. +- **Exposed, not hidden.** `:52222` answers on the WAN (Shodan-visible). This is + a deliberate, documented exception to the Wave-1 "no public-IP access" policy + (see `docs/architecture/security.md`), chosen for self-containment: it has **no + dependency on the cluster** (unlike Headscale/cloudflared) and nothing to + remember (unlike the old knock, whose sequence lived only in in-cluster Vault). + +## Secrets (Vault `secret/viktor`) + +| Key | Use | +|---|---| +| `breakglass_ssh_pubkey` | authorized on the host (`authorized_keys.breakglass`) | +| `breakglass_ssh_privkey` | the private key (also on your laptop at `~/.ssh/breakglass_ed25519`) | + +The key has **no passphrase** (so it works in a true cold event without anything +to recall). Treat the private key as the sole credential — guard the laptop copy. + +> Leftover: `breakglass_knock_sequence` is dead (knock decommissioned). It is +> inert; remove it when you have a Vault token with the `patch` capability +> (`vault kv patch` / merge-patch — the everyday token lacks it). + +## Connect + +Client `~/.ssh/config`: + +``` +Host breakglass + HostName viktorbarzin.ddns.net # follows the dynamic WAN IP + Port 52222 + User root + IdentityFile ~/.ssh/breakglass_ed25519 + IdentitiesOnly yes +``` + +Then: + +```bash +ssh breakglass # shell on the Proxmox host +ssh -J breakglass root@10.0.20.1 # jump to pfSense (or any LAN host) +ssh -D 1080 breakglass # SOCKS5 → reach any internal IP +``` + +There is **no `bg()` knock function** anymore — delete it from your shell rc if +you added it under the old design. + +## Cold-event IP cheat sheet (cluster DNS is down) + +| Host | IP | +|---|---| +| Proxmox host | `192.168.1.127` | +| pfSense | `10.0.20.1` (WAN `192.168.1.2`) | +| k8s API | `10.0.20.100` | +| Synology NAS | `192.168.1.13` (reach via `ssh -J breakglass`) | +| edge router | `192.168.1.1` | + +## Deploy / re-provision the host config + +Source of truth lives in `infra/scripts/`. To (re)deploy: + +```bash +# 1. break-glass key authorized for the exposed port +PUB="$(vault kv get -field=breakglass_ssh_pubkey secret/viktor)" +ssh root@192.168.1.127 "printf '%s\n' '$PUB' > /root/.ssh/authorized_keys.breakglass && chmod 600 /root/.ssh/authorized_keys.breakglass" + +# 2. sshd drop-in (dual-port, Match-isolated) — validate before reload (anti-lockout) +scp scripts/sshd-10-breakglass.conf root@192.168.1.127:/etc/ssh/sshd_config.d/10-breakglass.conf +ssh root@192.168.1.127 'sshd -t && systemctl reload ssh' + +# 3. firewall (rate-limit) + boot unit +scp scripts/breakglass-firewall.sh root@192.168.1.127:/usr/local/sbin/breakglass-firewall.sh +ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh && systemctl enable --now breakglass-firewall.service' + +# 4. fail2ban jail +scp scripts/fail2ban-breakglass-sshd.local root@192.168.1.127:/etc/fail2ban/jail.d/breakglass-sshd.local +ssh root@192.168.1.127 'systemctl restart fail2ban && fail2ban-client status sshd' +``` + +The `breakglass-firewall.service` unit (oneshot, `RemainAfterExit=yes`, +`Before=network-online`-ish ordering) is a manual host unit — recreate it if the +host is rebuilt: + +```ini +[Unit] +Description=Break-glass base firewall (key-only SSH on :52222) +After=network-pre.target +Wants=network-pre.target +[Service] +Type=oneshot +ExecStart=/usr/local/sbin/breakglass-firewall.sh +RemainAfterExit=yes +[Install] +WantedBy=multi-user.target +``` + +## Edge-router forward (manual — live device, not Terraform) + +TP-Link Archer AX6000 (`192.168.1.1`) → Advanced → NAT Forwarding → Port +Forwarding. The break-glass rule: + +| Service Name | Device IP | External Port | Internal Port | Protocol | +|---|---|---|---|---| +| `breakglass-ssh` | `192.168.1.127` | `52222` | `52222` | TCP | + +**AX6000 quirks (learned 2026-06-11 — do not relearn the hard way):** +- **External port must equal internal port.** The firmware rejects any remap + (e.g. `22 → 52222`) with *"External Port: This item conflicts with existed + ones."* Hence ext==int 52222. +- **Port 22 is reserved** — even `22 → 22` is refused. Break-glass cannot use 22. +- **Row delete is immediate** (no confirm dialog) — clicking the trash icon + removes the rule and toasts "Operation succeeded". +- Automation: `~/wizard/tools/insecure-browse/add-forward.{sh,js}` (dockerized + Playwright; double-gated save `DRY_RUN=0 CONFIRM_SAVE=1`; supports + `RULES_JSON` add, `EDIT_RULES_JSON` protocol-edit, `DELETE_RULES_JSON` + identity-guarded delete). Router password: Vault + `secret/viktor/edge_router_192_168_1_1_password`. + +## Rotate / revoke + +- **Revoke instantly:** remove the line from `/root/.ssh/authorized_keys.breakglass`. +- **Rotate the key:** `ssh-keygen -t ed25519 -a 100 -f ~/.ssh/breakglass_ed25519`, + `vault kv patch secret/viktor breakglass_ssh_privkey=@... breakglass_ssh_pubkey=...`, + redeploy step 1 above. +- **Router reset wipes forwards:** re-add the `breakglass-ssh` rule above. + +## History + +- **2026-05-30:** original design — key-only SSH on `:52222` gated behind a + **UDP port-knock** (knockd). Decommissioned 2026-06-11: the knock added no real + security (the SSH key already makes the port brute-force-proof) and its only + benefit — hiding the port — came at the cost of a **circular dependency**: the + knock sequence lived only in in-cluster Vault, unreachable in the exact + cold/away scenario break-glass exists for. That caused a real lockout. The + knockd package + config + the legacy Synology SSH forward (ext 3333 → .13:22) + were removed. diff --git a/scripts/breakglass-firewall.sh b/scripts/breakglass-firewall.sh new file mode 100644 index 00000000..51260cb9 --- /dev/null +++ b/scripts/breakglass-firewall.sh @@ -0,0 +1,26 @@ +#!/usr/bin/env bash +set -euo pipefail +# Break-glass base firewall (redesigned 2026-06-11; replaced the port-knock gate). +# +# Source of truth. Deploy to the PVE host with: +# scp scripts/breakglass-firewall.sh root@192.168.1.127:/usr/local/sbin/breakglass-firewall.sh +# ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh && systemctl restart breakglass-firewall.service' +# The breakglass-firewall.service oneshot runs this at boot (RemainAfterExit). +# +# Model: key-only SSH break-glass on :52222, openly reachable from the WAN, NO +# port-knock. The SSH key is the gate (brute-force-proof); the rate-limit below +# only trims scanner noise / slows a hypothetical sshd 0-day. +# :22 -> LAN admin (all of root's keys), always allowed. +# :52222 -> WAN break-glass. LAN/VLAN sources bypass the limit; external NEW +# connections are rate-limited per source IP, then accepted. +iptables -N BREAKGLASS 2>/dev/null || iptables -F BREAKGLASS +iptables -C INPUT -j BREAKGLASS 2>/dev/null || iptables -I INPUT 1 -j BREAKGLASS + +iptables -A BREAKGLASS -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT +iptables -A BREAKGLASS -p tcp --dport 22 -j ACCEPT +iptables -A BREAKGLASS -p tcp --dport 52222 -s 192.168.1.0/24 -j ACCEPT +iptables -A BREAKGLASS -p tcp --dport 52222 -s 10.0.0.0/8 -j ACCEPT +iptables -A BREAKGLASS -p tcp --dport 52222 -m conntrack --ctstate NEW \ + -m hashlimit --hashlimit-name bg_ssh --hashlimit-mode srcip \ + --hashlimit-above 6/min --hashlimit-burst 3 -j DROP +iptables -A BREAKGLASS -p tcp --dport 52222 -j ACCEPT diff --git a/scripts/fail2ban-breakglass-sshd.local b/scripts/fail2ban-breakglass-sshd.local new file mode 100644 index 00000000..19066295 --- /dev/null +++ b/scripts/fail2ban-breakglass-sshd.local @@ -0,0 +1,18 @@ +# Break-glass SSH fail2ban jail (redesigned 2026-06-11). Source of truth. +# Deploy to the PVE host with: +# scp scripts/fail2ban-breakglass-sshd.local root@192.168.1.127:/etc/fail2ban/jail.d/breakglass-sshd.local +# ssh root@192.168.1.127 'systemctl restart fail2ban' +# +# GOTCHA (Debian 13 / OpenSSH 9.x): auth lines are logged under +# _COMM=sshd-session, NOT _COMM=sshd. The stock Debian jail keys journalmatch on +# `_SYSTEMD_UNIT=ssh.service + _COMM=sshd` and therefore silently NEVER bans. +# Match by unit only so both sshd and sshd-session lines are seen. Ban on both +# SSH ports (the WAN break-glass listener is :52222). +[sshd] +enabled = true +backend = systemd +journalmatch = _SYSTEMD_UNIT=ssh.service +port = ssh,52222 +maxretry = 4 +findtime = 10m +bantime = 1h diff --git a/scripts/sshd-10-breakglass.conf b/scripts/sshd-10-breakglass.conf new file mode 100644 index 00000000..96663d2b --- /dev/null +++ b/scripts/sshd-10-breakglass.conf @@ -0,0 +1,31 @@ +# Break-glass SSH drop-in (redesigned 2026-06-11). Source of truth. +# Deploy to the PVE host with: +# scp scripts/sshd-10-breakglass.conf root@192.168.1.127:/etc/ssh/sshd_config.d/10-breakglass.conf +# ssh root@192.168.1.127 'sshd -t && systemctl reload ssh' +# +# :22 = LAN admin, all of root's keys (default AuthorizedKeysFile). +# :52222 = WAN-exposed break-glass. The edge router forwards WAN tcp/52222 -> +# 192.168.1.127:52222 (external port MUST equal internal port on the +# TP-Link AX6000 — it rejects remaps; port 22 itself is reserved). +# The Match LocalPort block trusts ONLY the dedicated break-glass key +# (authorized_keys.breakglass), so a leak of any other root key does +# NOT grant internet access. Rate-limited by the BREAKGLASS iptables +# chain + fail2ban. No port-knock. +# +# NOTE: the trailing `Match all` is REQUIRED. /etc/ssh/sshd_config has +# `Include sshd_config.d/*.conf` near the top but a global `PermitRootLogin` +# further down; without `Match all` resetting context, that later global +# directive would be swallowed into the `Match LocalPort 52222` condition. +Port 22 +Port 52222 +PasswordAuthentication no +KbdInteractiveAuthentication no +PubkeyAuthentication yes +PermitRootLogin prohibit-password +MaxAuthTries 3 +LoginGraceTime 20 + +Match LocalPort 52222 + AuthorizedKeysFile /root/.ssh/authorized_keys.breakglass + PermitRootLogin prohibit-password +Match all