Merge forgejo/master (tts stack) into wizard/android-emulator
# Conflicts: # stacks/tripit/main.tf
This commit is contained in:
commit
6bf216751b
37 changed files with 1774 additions and 86 deletions
|
|
@ -40,10 +40,10 @@ graph TB
|
|||
|
||||
| Component | Version | Location | Purpose |
|
||||
|-----------|---------|----------|---------|
|
||||
| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (2 replicas) |
|
||||
| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (3 replicas) |
|
||||
| Authentik Worker | 2026.2.2 | `stacks/authentik/` | Background task processors (2 replicas) |
|
||||
| PgBouncer | Latest | `stacks/authentik/` | PostgreSQL connection pooler (3 replicas) |
|
||||
| Embedded Outpost | - | Built into Authentik | Forward auth endpoint for Traefik |
|
||||
| Embedded Outpost | - | Standalone deployment, managed by Authentik | Forward auth endpoint for Traefik (2 replicas, PG-backed sessions) |
|
||||
| Traefik ForwardAuth | - | `modules/kubernetes/ingress_factory/` | Middleware attached when `auth = "required"` or `"public"` |
|
||||
| Vault OIDC Method | - | `stacks/vault/` | Human SSO authentication to Vault |
|
||||
| Vault K8s Auth | - | `stacks/vault/` | Service account JWT authentication |
|
||||
|
|
@ -64,15 +64,36 @@ Services pick an auth tier via the `auth` enum on the `ingress_factory` module (
|
|||
When `auth = "required"`, an unauthenticated request flows:
|
||||
|
||||
1. Request hits Traefik ingress
|
||||
2. ForwardAuth middleware calls Authentik embedded outpost
|
||||
3. Authentik checks for valid session cookie
|
||||
2. ForwardAuth middleware calls the `auth-proxy` nginx (basicAuth fallback when Authentik is down), which proxies to the Authentik embedded outpost over a keepalive connection pool
|
||||
3. Authentik checks for valid session cookie (domain-level `authentik_proxy_*` cookie on `.viktorbarzin.me`, 4-week validity — one cookie covers all forward-auth apps)
|
||||
4. If missing/invalid, redirects to Authentik login page (authentik.viktorbarzin.me)
|
||||
5. User authenticates via social provider (Google/GitHub/Facebook)
|
||||
5. User authenticates on a **single screen**: username + password together (the identification stage embeds the password stage), or a social provider button (Google/GitHub/Facebook), then MFA validation
|
||||
6. Authentik creates session, sets cookie, redirects back to original URL
|
||||
7. Subsequent requests include session cookie, pass auth check, reach backend
|
||||
|
||||
Authentik adds authentication headers (user, email, groups) to forwarded requests. These headers are stripped before reaching the backend to prevent confusion.
|
||||
|
||||
### First-time signin performance (2026-06-10)
|
||||
|
||||
Signin latency is dominated by screen count and round trips, not server time
|
||||
(DB avg 1.6ms). Standing decisions:
|
||||
|
||||
- **Single-screen login**: the identification stage carries `password_stage`,
|
||||
so username+password is one round trip. The separate password-stage binding
|
||||
was removed from `default-authentication-flow` (required by authentik when
|
||||
embedding). Pinned in TF: `authentik_stage_identification.default_identification`.
|
||||
- **Implicit consent everywhere**: all OIDC providers are first-party, so none
|
||||
use the explicit-consent flow (it re-prompted every 4 weeks per app).
|
||||
- **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
|
||||
are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
|
||||
15m policy cache, 60s persistent DB connections.
|
||||
- **Static assets cached immutable**: `/static` ingress carve-out adds
|
||||
`Cache-Control: public, max-age=31536000, immutable` (assets are
|
||||
version-fingerprinted; authentik itself sends no max-age).
|
||||
- **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
|
||||
- **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
|
||||
TCP setup on the forward-auth subrequest path.
|
||||
|
||||
**Anti-exposure guard**: every `auth = "app"` or `auth = "none"` line MUST have a preceding `# auth = "<tier>": <reason>` comment documenting what gates the backend (for `"app"`) or why the endpoint is intentionally public (for `"none"`). The convention is enforced by `scripts/check-ingress-auth-comments.py`, which `scripts/tg` runs on every `plan/apply/destroy/refresh` and blocks the terragrunt invocation if violated. Stack-scoped — each stack documents itself.
|
||||
|
||||
### Social Login & Invitation Flow
|
||||
|
|
|
|||
|
|
@ -22,9 +22,11 @@ graph TB
|
|||
NODE2["VM 202: k8s-node2<br/>8c / 32GB"]
|
||||
NODE3["VM 203: k8s-node3<br/>8c / 32GB"]
|
||||
NODE4["VM 204: k8s-node4<br/>8c / 32GB"]
|
||||
NODE5["VM 205: k8s-node5<br/>8c / 32GB"]
|
||||
NODE6["VM 206: k8s-node6<br/>8c / 32GB"]
|
||||
end
|
||||
|
||||
subgraph K8s["Kubernetes Cluster v1.34.2"]
|
||||
subgraph K8s["Kubernetes Cluster v1.34.8"]
|
||||
direction TB
|
||||
|
||||
subgraph VPA["VPA (Goldilocks - Initial Mode)"]
|
||||
|
|
@ -62,7 +64,7 @@ graph TB
|
|||
| Model | Dell PowerEdge R730 |
|
||||
| CPU | 1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated) |
|
||||
| Total Cores/Threads | 22 cores / 44 threads |
|
||||
| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). VMs use ~176GB total (k8s-node1 48GB + 4 K8s VMs x 32GB) |
|
||||
| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). K8s VMs use ~240GB total (k8s-node1 48GB + 6 K8s VMs x 32GB) |
|
||||
| GPU | NVIDIA Tesla T4 (16GB GDDR6, PCIe 0000:06:00.0) |
|
||||
| Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD |
|
||||
| Hypervisor | Proxmox VE |
|
||||
|
|
@ -76,8 +78,10 @@ graph TB
|
|||
| k8s-node2 | 202 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
|
||||
| k8s-node3 | 203 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
|
||||
| k8s-node4 | 204 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
|
||||
| k8s-node5 | 205 | 8 | 32GB | vmbr1:vlan20 (10.0.20.105) | Worker (joined 2026-05-26) | None |
|
||||
| k8s-node6 | 206 | 8 | 32GB | vmbr1:vlan20 (10.0.20.106) | Worker (joined 2026-05-26) | None |
|
||||
|
||||
**Total Cluster Resources**: 48 vCPUs, ~176GB RAM (k8s-node1 48GB + 4 nodes x 32GB)
|
||||
**Total Cluster Resources**: 64 vCPUs, ~240GB RAM (k8s-node1 16c/48GB + master and 5 workers at 8c/32GB each)
|
||||
|
||||
> **All Linux VMs are hand-managed in Proxmox, NOT in Terraform**
|
||||
> (decided 2026-05-26, commit 44c3770a). The telmate/proxmox v3.0.2
|
||||
|
|
@ -97,7 +101,12 @@ graph TB
|
|||
> PVE host (sources in `infra/scripts/`, install pattern per
|
||||
> `architecture/backup-dr.md`). Timer fires `OnBootSec=5min` +
|
||||
> `OnCalendar=hourly`, so any drift (config restore, manual `qm
|
||||
> set`, fresh clone) self-heals within the hour. Current caps:
|
||||
> set`, fresh clone) self-heals within the hour. The script compares
|
||||
> *normalized option sets*, so an unchanged config is a true no-op —
|
||||
> until 2026-06-11 a raw string compare (defeated by `qm config`'s
|
||||
> canonical key order) re-issued `qm set` hourly against running VMs,
|
||||
> live-rewriting QEMU throttle state via QMP (implicated in the devvm
|
||||
> I/O stall; see `post-mortems/2026-06-11-devvm-qemu-io-stall.md`). Current caps:
|
||||
> 102 devvm 60/60, 103 home-assistant 40/40, 200 k8s-master 100/60,
|
||||
> 201 k8s-node1 150/120, 202 k8s-node2 150/120, 203 k8s-node3 150/120,
|
||||
> 204 k8s-node4 150/120, 220 docker-registry 40/40.
|
||||
|
|
|
|||
|
|
@ -255,6 +255,8 @@ Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same
|
|||
|
||||
**Policy: no public-IP access ever.** Vault, kube-apiserver, PVE sshd must transit a trusted LAN or Headscale. Anything else fires an alert.
|
||||
|
||||
**Documented exception — break-glass SSH (2026-06-11):** one deliberate carve-out. The Proxmox host's sshd listens on a WAN-exposed `:52222` (edge-router forward), **key-only**, trusting only a dedicated break-glass key (`Match LocalPort` → `authorized_keys.breakglass`), rate-limited (iptables hashlimit) + fail2ban. It is intentionally reachable from the public internet so it survives a cluster/tunnel outage with no dependency on the cluster — the one case the "must transit LAN/Headscale" rule cannot serve. Brute-force-proof (no password); the trade is Shodan-visibility. As-built: `docs/runbooks/breakglass-ssh.md`; rationale: `docs/plans/2026-06-11-breakglass-ssh-redesign-design.md`. (Replaced the 2026-05-30 port-knock variant, which was non-scannable but had a circular Vault dependency that caused a lockout.)
|
||||
|
||||
#### Why no canary tokens
|
||||
|
||||
Original plan included canary tokens (fake K8s Secret, Vault KV path, PVE file, sinkhole hostname). Rejected because Viktor routinely greps `secret/viktor` (135 keys) and lists `kubectl get secret -A` — any read-trigger canary self-fires. Use-based canaries (zero-RBAC SA tokens with audit alerts on use) were also considered but rejected in favor of cleaner source-IP anomaly detection (K9, V7) on REAL tokens — same threat model, no fake-token operational burden.
|
||||
|
|
|
|||
|
|
@ -17,7 +17,7 @@ All services storing sensitive data were migrated to `proxmox-lvm-encrypted` on
|
|||
- **HDD NFS**: `/srv/nfs` on ext4 LV `pve/nfs-data` (4TB) — bulk media and backup targets
|
||||
- **SSD NFS**: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB) — high-performance data (Immich ML)
|
||||
|
||||
Both `StorageClass: nfs-truenas` and `StorageClass: nfs-proxmox` point to the Proxmox host and are functionally identical. The `nfs-truenas` name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster.
|
||||
`StorageClass: nfs-truenas` is the **only** NFS StorageClass and points to the Proxmox host. The name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster. (A short-lived parallel `nfs-proxmox` StorageClass was removed on 2026-04-25, commit 484b4c71, during the vault NFS-hostile migration.)
|
||||
|
||||
**Backup storage (sda)**: 1.1TB RAID1 SAS disk, VG `backup`, LV `data` (ext4), mounted at `/mnt/backup` on PVE host. Dedicated backup disk for weekly PVC file backups, auto SQLite backups, pfSense backups, and PVE config. NFS data syncs directly to Synology via inotify change tracking (not stored on sda). Independent of live storage (sdc).
|
||||
|
||||
|
|
@ -47,7 +47,7 @@ graph TB
|
|||
end
|
||||
|
||||
subgraph K8s["Kubernetes Cluster"]
|
||||
CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-proxmox (+ legacy nfs-truenas)<br/>soft,timeo=30,retrans=3"]
|
||||
CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-truenas (historical name)<br/>soft,timeo=30,retrans=3"]
|
||||
CSI_PVE["Proxmox CSI plugin<br/>StorageClass: proxmox-lvm<br/>StorageClass: proxmox-lvm-encrypted"]
|
||||
|
||||
NFS_PV["NFS PersistentVolumes<br/>RWX, ~100 volumes"]
|
||||
|
|
@ -85,8 +85,7 @@ graph TB
|
|||
| Proxmox NFS (HDD) | LV `pve/nfs-data`, 4TB ext4 | 192.168.1.127:/srv/nfs | Bulk NFS data for all services |
|
||||
| Proxmox NFS (SSD) | LV `ssd/nfs-ssd-data`, 100GB ext4 | 192.168.1.127:/srv/nfs-ssd | High-performance data (Immich ML) |
|
||||
| nfs-csi | Helm chart | Namespace: nfs-csi | NFS CSI driver |
|
||||
| StorageClass `nfs-proxmox` | RWX, soft mount | Cluster-wide | NFS storage, points to Proxmox host |
|
||||
| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | **Historical name** — functionally identical to `nfs-proxmox`, points to the Proxmox host. Kept because SC names are immutable on 48 bound PVs. |
|
||||
| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | The only NFS StorageClass — **historical name**, points to the Proxmox host. Kept because SC names are immutable on 48 bound PVs. (Sibling `nfs-proxmox` SC removed 2026-04-25, commit 484b4c71.) |
|
||||
| TF module `nfs_volume` | `modules/kubernetes/nfs_volume/` | Infra repo | Static NFS PV/PVC factory |
|
||||
| ~~TrueNAS VM~~ | **DECOMMISSIONED 2026-04-13** | Was VM 9000 at 10.0.10.15 | Replaced by Proxmox NFS. VM still in stopped state pending deletion. |
|
||||
| ~~democratic-csi-iscsi~~ | **REMOVED** | Was namespace: iscsi-csi | Replaced by Proxmox CSI (2026-04-02) |
|
||||
|
|
@ -113,7 +112,7 @@ graph TB
|
|||
|
||||
**Note**: Some legacy PVs still reference `/mnt/main/<service>` paths. These work via compatibility symlinks/bind-mounts on the Proxmox host. New PVs should use `/srv/nfs/<service>` or `/srv/nfs-ssd/<service>`.
|
||||
|
||||
**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-proxmox` StorageClass (or the legacy `nfs-truenas` for existing PVs) via PVCs.
|
||||
**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-truenas` StorageClass (historical name; it points at the Proxmox host) via PVCs.
|
||||
|
||||
### Block Storage Flow (Proxmox CSI) — NEW
|
||||
|
||||
|
|
|
|||
285
docs/plans/2026-05-30-breakglass-ssh-access-design.md
Normal file
285
docs/plans/2026-05-30-breakglass-ssh-access-design.md
Normal file
|
|
@ -0,0 +1,285 @@
|
|||
# Break-Glass SSH Access — Design
|
||||
|
||||
> **⚠️ SUPERSEDED 2026-06-11** by `2026-06-11-breakglass-ssh-redesign-design.md`.
|
||||
> The port-knock was removed: it added no real security (the SSH key already
|
||||
> makes the port brute-force-proof) and its knock sequence lived only in
|
||||
> in-cluster Vault — unreachable in the exact cold/away scenario break-glass
|
||||
> exists for, which caused a real lockout. Retained for history. As-built:
|
||||
> `docs/runbooks/breakglass-ssh.md`.
|
||||
|
||||
- **Date**: 2026-05-30
|
||||
- **Status**: Draft — pending user review
|
||||
- **Owner**: Viktor
|
||||
- **Related**: `docs/architecture/vpn.md`, `docs/architecture/security.md`, `infra/.claude/CLAUDE.md` (Security Posture Wave 1)
|
||||
|
||||
## 1. Goal
|
||||
|
||||
Provide a **cold, brute-force-proof backdoor onto the home LAN from the public
|
||||
internet** for the case where the Kubernetes cluster and every cluster-hosted
|
||||
remote-access path are down (cloudflared, Headscale/Tailscale, in-cluster
|
||||
WireGuard), but the **Proxmox host, pfSense, and the edge router are still up**.
|
||||
|
||||
### Hard requirements (from the user)
|
||||
|
||||
1. **Cold-survivable**: must work when the k8s cluster + all its tunnels are
|
||||
down. The path must touch **nothing in the cluster** (no Authentik, Traefik,
|
||||
Technitium/AdGuard DNS, cloudflared).
|
||||
2. **Full LAN access** once connected (SSH to Proxmox host, pfSense, Synology,
|
||||
k8s API, etc.).
|
||||
3. **No brute force**: no password-guessable surface.
|
||||
4. **Client uses only software pre-installed on Linux/macOS** — no WireGuard /
|
||||
Tailscale / fwknop client install. Stock `ssh` (+ `bash`) only.
|
||||
5. **Minimal effort**, and ideally **honor the locked Wave 1 policy**
|
||||
(`no public-IP access — … PVE sshd must transit LAN or Headscale`).
|
||||
|
||||
## 2. Decision
|
||||
|
||||
**Key-only SSH to the Proxmox host, gated behind a UDP port-knock.**
|
||||
|
||||
- The Proxmox host (`192.168.1.127`) is the entry point — it's the recovery box
|
||||
(`virsh`/`qm` to reboot the pfSense VM, `kubectl`, full hypervisor control)
|
||||
and it sits directly on the `192.168.1.0/24` segment, so the path **does not
|
||||
traverse pfSense or the cluster** — it survives a wedged pfSense too, not just
|
||||
a down cluster.
|
||||
- SSH is the only externally-usable remote tool **pre-installed on every
|
||||
Linux/macOS box**, satisfying requirement 4.
|
||||
- **Key-only auth** (no passwords anywhere) makes password brute force
|
||||
impossible → requirement 3.
|
||||
- A **port-knock** keeps the external SSH port **closed/invisible to scanners**
|
||||
until a knock sequence is sent. This restores the "no standing public service"
|
||||
property we'd have had with WireGuard and keeps us within the **intent** of the
|
||||
Wave 1 policy (PVE sshd is not internet-scannable). The knock is sent with a
|
||||
**bash `/dev/udp` one-liner** — zero install.
|
||||
|
||||
### Alternatives rejected
|
||||
|
||||
| Option | Why rejected |
|
||||
|---|---|
|
||||
| WireGuard road-warrior on pfSense | Needs a WireGuard **client app** (fails requirement 4). Was the prior design. |
|
||||
| Tailscale / Headscale | Client app + control plane is in-cluster (dies cold). |
|
||||
| Browser → web admin UI (Proxmox/pfSense/Synology) | "Pre-installed" (browser) but password-based → brute-forceable, far larger attack surface than a key-only SSH port. |
|
||||
| Plain **exposed** key-only SSH (no knock) | Brute-force-proof, but a **publicly visible** service (Shodan-catalogued) and a standing violation of the Wave 1 "no public PVE sshd" policy. The knock removes the standing exposure for ~15 min more setup. |
|
||||
| fwknop / cryptographic SPA | Strongest hiding, but needs a **client install** (fails requirement 4). |
|
||||
|
||||
## 3. Architecture
|
||||
|
||||
```
|
||||
Your laptop (anywhere) — stock ssh + bash, nothing installed
|
||||
│ (1) UDP knock sequence → bash: echo > /dev/udp/<pub>/<port> (instant, no handshake)
|
||||
│ (2) ssh -p 52222 root@<pub>
|
||||
▼
|
||||
Edge router 192.168.1.1 (the box the stored password unlocks)
|
||||
│ forwards: UDP <k1>,<k2>,<k3> + TCP 52222 → 192.168.1.127
|
||||
▼
|
||||
Proxmox host 192.168.1.127 ← path bypasses pfSense entirely
|
||||
├─ knockd (libpcap) sees the UDP knock → opens TCP 52222 for your source IP (30 s)
|
||||
├─ sshd listens on :22 (LAN admin, always) AND :52222 (external, knock-gated), key-only
|
||||
└─ once in: virsh/qm (reboot pfSense VM), kubectl, ssh -J / ssh -D → full LAN
|
||||
```
|
||||
|
||||
**Why it meets "cold + full LAN":** the host is up by definition of the chosen
|
||||
failure mode; nothing in the path depends on k8s, pfSense, or DNS. From the host
|
||||
you reach the whole LAN either directly (it's on `192.168.1.0/24` and routes to
|
||||
the VLANs via pfSense when pfSense is up) or by using SSH's built-in
|
||||
`-J`/`-D` — both stock, no install.
|
||||
|
||||
## 4. Components
|
||||
|
||||
### 4.1 Edge router @ 192.168.1.1 (manual, in the browser)
|
||||
Add port-forwards (same place the existing `51821` WireGuard forward lives):
|
||||
- **TCP 52222 → 192.168.1.127:52222** (external SSH; no port rewrite — see §4.3 rationale)
|
||||
- **UDP `<k1>`, `<k2>`, `<k3>` → 192.168.1.127** (knock ports; actual numbers in Vault)
|
||||
|
||||
If the router supports a **port range** forward, a single range covering the
|
||||
knock ports + 52222 is tidier than four rules.
|
||||
|
||||
> **Verify (#1 implementation check):** whether `.1` **preserves the source IP**
|
||||
> on forwarded packets (typical DNAT) or **SNATs** them to `192.168.1.1`. Test by
|
||||
> knocking + connecting from an external network and checking `/var/log/auth.log`
|
||||
> + `knockd` syslog for the observed source IP. The design works either way (see
|
||||
> §4.3), but it determines knock granularity.
|
||||
|
||||
### 4.2 SSH keys & Vault layout
|
||||
- Mint a **dedicated** break-glass keypair (ed25519), separate from
|
||||
`secret/viktor/proxmox_ssh_key`, so it's independently revocable and clearly
|
||||
labelled.
|
||||
- **Public key** → `/root/.ssh/authorized_keys` on the Proxmox host (no `from=`
|
||||
restriction — break-glass is from-anywhere; the knock + key are the gate).
|
||||
- **Private key** → Vault `secret/viktor/breakglass_ssh_privkey` (for
|
||||
re-provisioning) **and** on your laptop at `~/.ssh/breakglass_ed25519`
|
||||
(chmod 600).
|
||||
- **Knock sequence** → Vault `secret/viktor/breakglass_knock_sequence` (kept out
|
||||
of git — obscurity value only; see §5).
|
||||
|
||||
### 4.3 Proxmox host — sshd hardening
|
||||
`/etc/ssh/sshd_config.d/10-breakglass.conf`:
|
||||
```
|
||||
Port 22
|
||||
Port 52222
|
||||
PasswordAuthentication no
|
||||
KbdInteractiveAuthentication no
|
||||
PubkeyAuthentication yes
|
||||
PermitRootLogin prohibit-password # key-only root (PVE recovery norm)
|
||||
MaxAuthTries 3
|
||||
LoginGraceTime 20
|
||||
```
|
||||
- sshd listens on **:22 (LAN admin, always allowed)** and **:52222 (external,
|
||||
knock-gated)**. Using a dedicated external port (not a DNAT rewrite to 22)
|
||||
lets the firewall distinguish LAN vs external **regardless of `.1` SNAT
|
||||
behaviour** (§4.1) — LAN admin on `:22` is never affected by the gate.
|
||||
- **Default to root key-only** for recovery practicality. *Alternative for
|
||||
review:* a dedicated `breakglass` sudo user instead of root.
|
||||
|
||||
> **Verify (#2):** key login already works for your normal access **before**
|
||||
> `PasswordAuthentication no` is committed — no lockout. (Backup rsync jobs
|
||||
> already use keys, so this is likely already effectively true.)
|
||||
|
||||
### 4.4 Host firewall (knock gate)
|
||||
Default-drop the external SSH port; knockd punches a per-source hole. LAN admin
|
||||
(`:22`) and established sessions are untouched:
|
||||
```
|
||||
# allow established / related
|
||||
iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
|
||||
# LAN admin + backups: SSH on :22 always allowed
|
||||
iptables -A INPUT -p tcp --dport 22 -j ACCEPT
|
||||
# external SSH on :52222 closed by default — knockd opens it per-source
|
||||
iptables -A INPUT -p tcp --dport 52222 -j DROP
|
||||
```
|
||||
- **knockd uses libpcap**, so it sees the UDP knock packets even though iptables
|
||||
drops them — the knock ports stay **silent/closed** to scanners.
|
||||
- **pve-firewall coexistence (verify #3):** confirm whether the PVE firewall is
|
||||
enabled. If it is, express these rules through it (or a dedicated chain) so a
|
||||
pve-firewall reload doesn't wipe the knockd-managed rule. Default PVE installs
|
||||
often have it off at datacenter level.
|
||||
|
||||
### 4.5 knockd
|
||||
`apt install knockd` (Debian/PVE). `/etc/knockd.conf`:
|
||||
```
|
||||
[options]
|
||||
UseSyslog
|
||||
Interface = vmbr0 # the 192.168.1.127 interface
|
||||
|
||||
[breakglass]
|
||||
sequence = <k1>:udp,<k2>:udp,<k3>:udp # real ports from Vault
|
||||
seq_timeout = 10
|
||||
start_command = /usr/sbin/iptables -I INPUT 1 -s %IP% -p tcp --dport 52222 -j ACCEPT
|
||||
cmd_timeout = 30
|
||||
stop_command = /usr/sbin/iptables -D INPUT -s %IP% -p tcp --dport 52222 -j ACCEPT
|
||||
```
|
||||
- **UDP knock** → the client knock is fire-and-forget (`/dev/udp`), no TCP-hang
|
||||
on the client (a TCP knock to a dropped port would block until timeout).
|
||||
- Opens `:52222` for the knocker's source IP for **30 s**; an SSH session
|
||||
established within that window **persists** via conntrack ESTABLISHED after the
|
||||
rule is removed. Enable + start the `knockd` service.
|
||||
|
||||
### 4.6 fail2ban (defense-in-depth)
|
||||
`apt install fail2ban`, sshd jail (watches `auth.log`, bans repeat failures).
|
||||
Local to the host, **no cluster dependency**. Catches anything that gets past the
|
||||
knock to the sshd listener.
|
||||
|
||||
### 4.7 Client side (laptop — stock tools only)
|
||||
`~/.ssh/config`:
|
||||
```
|
||||
Host breakglass
|
||||
HostName <public-ip-or-dyndns>
|
||||
Port 52222
|
||||
User root
|
||||
IdentityFile ~/.ssh/breakglass_ed25519
|
||||
```
|
||||
Knock + connect — a shell function using **bash builtins only** (works on
|
||||
macOS `/bin/bash` + Linux; UDP send is instant):
|
||||
```sh
|
||||
bg() {
|
||||
local host=<public-ip-or-dyndns>
|
||||
for p in <k1> <k2> <k3>; do echo -n x > "/dev/udp/$host/$p"; sleep 0.4; done
|
||||
sleep 0.5
|
||||
ssh breakglass "$@"
|
||||
}
|
||||
```
|
||||
- **Full LAN, no install:** `ssh -J breakglass <internal-host>` (jump), or
|
||||
`ssh -D 1080 breakglass` then point a browser/`curl` at SOCKS5 `127.0.0.1:1080`
|
||||
to reach any internal IP. From the host shell you already have everything.
|
||||
- *Optional fully-transparent variant:* fold the knock into a `ProxyCommand` in
|
||||
the `Host breakglass` block so plain `ssh breakglass` knocks automatically.
|
||||
|
||||
### 4.8 Cold-scenario IP cheat sheet (DNS is down when the cluster is down)
|
||||
Technitium + AdGuard are in-cluster, so `.lan` resolution is gone in a cold
|
||||
event. Use IPs:
|
||||
|
||||
| Host | IP |
|
||||
|---|---|
|
||||
| Proxmox host | `192.168.1.127` (also `10.0.10.1` VLAN10) |
|
||||
| pfSense | `10.0.20.1` (WAN `192.168.1.2`) |
|
||||
| k8s API server | `10.0.20.100` |
|
||||
| Synology NAS | `192.168.1.13` |
|
||||
| Edge router | `192.168.1.1` |
|
||||
| Traefik LB / MetalLB | `10.0.20.200` / `10.0.20.203` |
|
||||
|
||||
## 5. Security analysis
|
||||
|
||||
- **Brute force: solved.** No password auth anywhere → password guessing is
|
||||
impossible; key brute force is cryptographically infeasible.
|
||||
- **Invisibility / Wave 1 intent: satisfied.** The external SSH port is
|
||||
default-dropped and the knock ports are pcap-sniffed (never answered), so a
|
||||
scanner sees a closed/silent host — PVE sshd is **not internet-scannable**,
|
||||
honouring the spirit of "no public-IP access to PVE sshd".
|
||||
- **The knock is obscurity, not cryptography.** A port-knock sequence is
|
||||
plaintext and replayable by a passive on-path observer. **The SSH key is the
|
||||
real access control** — the knock only removes the standing/scannable surface.
|
||||
(Cryptographic SPA = fwknop, rejected for needing a client install.) Treat the
|
||||
knock sequence as a secret-ish convenience, not a second cryptographic factor.
|
||||
- **Residual risks** (none are brute force):
|
||||
1. An sshd **0-day** exploitable during the 30 s open window → mitigation: keep
|
||||
PVE patched; short `cmd_timeout`; fail2ban.
|
||||
2. **Private key theft** → mitigation: key has a passphrase; revoke by removing
|
||||
the line from `authorized_keys`.
|
||||
3. If `.1` **SNATs** (§4.1), the 30 s window opens `:52222` for the shared
|
||||
`192.168.1.1` source — anyone else arriving via `.1` in that window could
|
||||
reach the sshd banner, but still needs your key. Mitigated by the short
|
||||
window + key-only + fail2ban.
|
||||
- **Deliberate, documented exception** to the Wave 1 "no public-IP access"
|
||||
policy, scoped to this single knock-gated port. To be recorded in
|
||||
`security.md` + the Wave 1 note in `infra/.claude/CLAUDE.md` on implementation.
|
||||
|
||||
## 6. What's automated vs manual
|
||||
|
||||
- **I do**: generate the keypair + knock sequence, store them in Vault, produce
|
||||
the exact `sshd_config.d` snippet, `knockd.conf`, iptables rules, the client
|
||||
`~/.ssh/config` + `bg()` function, and write the runbook + doc updates.
|
||||
- **Manual / careful (live devices)**: the `.1` edge-router forwards are done by
|
||||
you in the browser (out-of-Terraform, live device). The Proxmox host changes
|
||||
(sshd, knockd, iptables, fail2ban) are applied over SSH **with key-login
|
||||
verified first** to avoid lockout; pfSense is **not** touched. None of this is
|
||||
a `tg apply` — pfSense and the edge router are not Terraform-managed.
|
||||
|
||||
## 7. Testing & verification
|
||||
1. From an **external** network (phone hotspot): run `bg`; confirm knockd syslog
|
||||
shows the sequence + opens `:52222`; SSH succeeds.
|
||||
2. **Without** knocking: `ssh -p 52222` from external → connection refused/timed
|
||||
out (port closed). A plain port scan of `52222` + the knock ports → silent.
|
||||
3. LAN admin on `:22` still works (no regression); backup rsync jobs unaffected.
|
||||
4. Full-LAN: `ssh -J breakglass 10.0.20.1` (pfSense) and `ssh -D 1080` SOCKS to
|
||||
an internal IP.
|
||||
5. Determine `.1` source-IP behaviour (verify #1) and adjust knock granularity
|
||||
note accordingly.
|
||||
|
||||
## 8. Failure modes & rotation
|
||||
- **Proxmox host down** (not just cluster): this path is gone — that's the
|
||||
out-of-band tier (serial/IPMI/separate device), explicitly **out of scope**.
|
||||
- **`.1` router config reset**: forwards lost → re-add from this doc; consider
|
||||
exporting the `.1` config for backup.
|
||||
- **Public IP change**: use a hostname endpoint (Cloudflare-resolved) so it
|
||||
auto-follows; keep the raw IP as fallback.
|
||||
- **Key/knock compromise**: remove the `authorized_keys` line (kills access
|
||||
instantly); rotate the knock sequence in `knockd.conf` + Vault.
|
||||
|
||||
## 9. Out of scope
|
||||
- Host-down / site-down out-of-band access (IPMI, LTE) — a future tier.
|
||||
- Phone access (would need an SSH **app**, e.g. Termius — outside the
|
||||
"pre-installed Linux/macOS" constraint; laptop is the target).
|
||||
|
||||
## 10. Docs to update on implementation
|
||||
- `docs/architecture/vpn.md` — add a "Break-glass SSH" section.
|
||||
- `docs/architecture/security.md` + Wave 1 note in `infra/.claude/CLAUDE.md` —
|
||||
record the deliberate knock-gated exception to "no public PVE sshd".
|
||||
- New runbook `docs/runbooks/breakglass-ssh.md` — connect + rotate procedure.
|
||||
395
docs/plans/2026-05-30-breakglass-ssh-access-plan.md
Normal file
395
docs/plans/2026-05-30-breakglass-ssh-access-plan.md
Normal file
|
|
@ -0,0 +1,395 @@
|
|||
# Break-Glass SSH Access — Implementation Plan
|
||||
|
||||
> **⚠️ SUPERSEDED 2026-06-11** by the redesign in
|
||||
> `2026-06-11-breakglass-ssh-redesign-design.md` (port-knock removed). Retained
|
||||
> for history. As-built: `docs/runbooks/breakglass-ssh.md`.
|
||||
|
||||
> **Execution model:** This plan mutates **live devices** (the Proxmox host's sshd, and the TP-Link edge router). It is **human-gated**, NOT for autonomous subagents. Each live step is applied with anti-lockout verification, and every edge-router change is made by Viktor (or by the browse tool with explicit per-change approval). Steps use `- [ ]` checkboxes.
|
||||
|
||||
**Goal:** Stand up a cold, brute-force-proof SSH backdoor onto the LAN — key-only SSH to the Proxmox host (`192.168.1.127`) gated behind a UDP port-knock — then decommission the legacy Synology SSH exposure and tighten UPnP.
|
||||
|
||||
**Architecture:** Edge router `.1` forwards a UDP knock sequence + TCP `52222` to the Proxmox host. The host runs `knockd` (libpcap) which opens `52222` for the knocker's IP for 30 s; `sshd` listens on `:22` (LAN, always) and `:52222` (external, knock-gated), key-only. Path bypasses pfSense + the k8s cluster. Client uses only stock `ssh` + `bash`.
|
||||
|
||||
**Tech stack:** OpenSSH, knockd, iptables, fail2ban (Debian/PVE host); TP-Link Archer AX6000 UI (edge router); HashiCorp Vault (secrets); Docker (`/home/wizard/tools/insecure-browse` for any router automation).
|
||||
|
||||
**Reference:** design doc `2026-05-30-breakglass-ssh-access-design.md`. Router audit (current `.1` forwards) recorded in task notes + `/home/wizard/tools/insecure-browse/out/`.
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight (read before starting)
|
||||
|
||||
- **Anti-lockout rule:** never disable password auth or reload sshd without an *already-open* root session held + a *new* session verified. Applies to every host step.
|
||||
- **Live-router rule:** all `.1` changes are made by Viktor in the UI (or browse-tool with explicit approval). No blind automation of router writes.
|
||||
- **Ordering rule:** the legacy Synology SSH forward (Rule 6) is **not** closed until break-glass is verified working from an external network (Phase 4 gates on Phase 4-pre verification).
|
||||
- **Host access:** PVE host reached as `ssh root@192.168.1.127` from the LAN.
|
||||
- **Commit gate:** the infra repo currently has unmerged conflicts + an in-progress provider/backend migration. Do NOT commit (Phase 6) until Viktor confirms the repo is clean.
|
||||
|
||||
---
|
||||
|
||||
## Phase 0 — Generate secrets (no live changes)
|
||||
|
||||
### Task 0.1: Break-glass SSH keypair
|
||||
|
||||
**Files:** none in repo (secrets → Vault).
|
||||
|
||||
- [ ] **Step 1: Generate a dedicated ed25519 keypair (with passphrase)**
|
||||
|
||||
```bash
|
||||
mkdir -p ~/.ssh
|
||||
ssh-keygen -t ed25519 -a 100 -C "breakglass-$(date +%Y%m%d)" -f ~/.ssh/breakglass_ed25519
|
||||
# set a passphrase when prompted (so a stolen laptop key isn't instantly usable)
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Store the private key + public key in Vault**
|
||||
|
||||
```bash
|
||||
vault kv patch secret/viktor \
|
||||
breakglass_ssh_privkey=@$HOME/.ssh/breakglass_ed25519 \
|
||||
breakglass_ssh_pubkey="$(cat ~/.ssh/breakglass_ed25519.pub)"
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Verify the keys are retrievable**
|
||||
|
||||
```bash
|
||||
vault kv get -field=breakglass_ssh_pubkey secret/viktor
|
||||
```
|
||||
Expected: prints the `ssh-ed25519 AAAA... breakglass-YYYYMMDD` line.
|
||||
|
||||
### Task 0.2: Knock sequence
|
||||
|
||||
- [ ] **Step 1: Generate 3 random UDP knock ports**
|
||||
|
||||
```bash
|
||||
KNOCK="$(shuf -i 20000-60000 -n 3 | paste -sd, -)"; echo "$KNOCK"
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Store the sequence in Vault (keep it out of git)**
|
||||
|
||||
```bash
|
||||
vault kv patch secret/viktor breakglass_knock_sequence="$KNOCK"
|
||||
vault kv get -field=breakglass_knock_sequence secret/viktor
|
||||
```
|
||||
Expected: prints three comma-separated ports, e.g. `28411,49027,33180`.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Proxmox host: key-only SSH + knock gate (LIVE host change)
|
||||
|
||||
> Run everything in this phase **on the PVE host**. Keep your current `ssh root@192.168.1.127` session open the entire phase.
|
||||
|
||||
### Task 1.1: Pre-checks (no changes yet)
|
||||
|
||||
- [ ] **Step 1: Confirm key login already works (anti-lockout baseline)**
|
||||
|
||||
From your laptop, with the break-glass key authorized later — for now confirm your *existing* admin key works:
|
||||
```bash
|
||||
ssh -o PasswordAuthentication=no root@192.168.1.127 'echo KEY_LOGIN_OK'
|
||||
```
|
||||
Expected: `KEY_LOGIN_OK` (key auth works → safe to disable passwords later). If it prompts for a password, STOP and fix key auth first.
|
||||
|
||||
- [ ] **Step 2: Check whether the PVE firewall is active (coexistence)**
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 'pve-firewall status 2>/dev/null; iptables -S | head'
|
||||
```
|
||||
Expected: note whether `Status: enabled/running`. If **enabled**, add the Phase-1.4 rules via PVE's firewall (Datacenter→Firewall) instead of raw iptables, OR disable it if unused. If **disabled** (common), proceed with the raw-iptables approach below.
|
||||
|
||||
### Task 1.2: Authorize the break-glass key
|
||||
|
||||
- [ ] **Step 1: Append the break-glass public key to root's authorized_keys**
|
||||
|
||||
```bash
|
||||
PUB="$(vault kv get -field=breakglass_ssh_pubkey secret/viktor)"
|
||||
ssh root@192.168.1.127 "grep -qF '$PUB' /root/.ssh/authorized_keys || echo '$PUB' >> /root/.ssh/authorized_keys"
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Verify break-glass key logs in (on :22, still default)**
|
||||
|
||||
```bash
|
||||
ssh -i ~/.ssh/breakglass_ed25519 -o PasswordAuthentication=no root@192.168.1.127 'echo BREAKGLASS_KEY_OK'
|
||||
```
|
||||
Expected: `BREAKGLASS_KEY_OK`.
|
||||
|
||||
### Task 1.3: sshd dual-port + key-only
|
||||
|
||||
**Files:** Create on host: `/etc/ssh/sshd_config.d/10-breakglass.conf`
|
||||
|
||||
- [ ] **Step 1: Write the sshd drop-in**
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 'cat > /etc/ssh/sshd_config.d/10-breakglass.conf' <<'EOF'
|
||||
Port 22
|
||||
Port 52222
|
||||
PasswordAuthentication no
|
||||
KbdInteractiveAuthentication no
|
||||
PubkeyAuthentication yes
|
||||
PermitRootLogin prohibit-password
|
||||
MaxAuthTries 3
|
||||
LoginGraceTime 20
|
||||
EOF
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Validate config syntax (do NOT reload yet)**
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 'sshd -t && echo SSHD_CONFIG_OK'
|
||||
```
|
||||
Expected: `SSHD_CONFIG_OK`. If error, fix the drop-in before reloading.
|
||||
|
||||
- [ ] **Step 3: Reload sshd (current session stays alive)**
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 'systemctl reload ssh && echo RELOADED'
|
||||
```
|
||||
Expected: `RELOADED`.
|
||||
|
||||
- [ ] **Step 4: Verify a NEW key session works on :22 AND :52222 before trusting it**
|
||||
|
||||
```bash
|
||||
ssh -i ~/.ssh/breakglass_ed25519 -p 22 root@192.168.1.127 'echo OK22'
|
||||
ssh -i ~/.ssh/breakglass_ed25519 -p 52222 root@192.168.1.127 'echo OK52222'
|
||||
```
|
||||
Expected: `OK22` and `OK52222`. (If `:52222` refuses, sshd may not have bound the second port — check `ss -tlnp | grep ssh` on the host.) Only after both succeed, the old session is safe to drop.
|
||||
|
||||
### Task 1.4: Base firewall (default-drop :52222, allow :22 + established)
|
||||
|
||||
**Files:** Create on host: `/usr/local/sbin/breakglass-firewall.sh`, `/etc/systemd/system/breakglass-firewall.service`
|
||||
|
||||
- [ ] **Step 1: Write the idempotent base-firewall script (dedicated chain)**
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 'cat > /usr/local/sbin/breakglass-firewall.sh' <<'EOF'
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
# Idempotent: (re)build a dedicated BREAKGLASS chain hooked into INPUT.
|
||||
iptables -N BREAKGLASS 2>/dev/null || iptables -F BREAKGLASS
|
||||
iptables -C INPUT -j BREAKGLASS 2>/dev/null || iptables -I INPUT 1 -j BREAKGLASS
|
||||
# established/related always allowed
|
||||
iptables -A BREAKGLASS -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
|
||||
# LAN admin on :22 always allowed (.1 does NOT forward :22 to this host, so :22 is LAN-only)
|
||||
iptables -A BREAKGLASS -p tcp --dport 22 -j ACCEPT
|
||||
# external SSH on :52222 closed by default; knockd punches a per-source ACCEPT into INPUT pos 1
|
||||
iptables -A BREAKGLASS -p tcp --dport 52222 -j DROP
|
||||
EOF
|
||||
ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh'
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Write a boot-time systemd unit (persists across reboot, before knockd)**
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 'cat > /etc/systemd/system/breakglass-firewall.service' <<'EOF'
|
||||
[Unit]
|
||||
Description=Break-glass base firewall (SSH knock gate)
|
||||
After=network-pre.target
|
||||
Before=knockd.service
|
||||
Wants=network-pre.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/sbin/breakglass-firewall.sh
|
||||
RemainAfterExit=yes
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
EOF
|
||||
ssh root@192.168.1.127 'systemctl daemon-reload && systemctl enable --now breakglass-firewall.service && echo FW_APPLIED'
|
||||
```
|
||||
Expected: `FW_APPLIED`.
|
||||
|
||||
- [ ] **Step 3: Verify LAN :22 still works and :52222 is now dropped from LAN**
|
||||
|
||||
```bash
|
||||
ssh -i ~/.ssh/breakglass_ed25519 -p 22 root@192.168.1.127 'echo STILL_OK22' # works
|
||||
nc -z -w3 192.168.1.127 52222 && echo "OPEN(bad)" || echo "CLOSED_AS_EXPECTED" # closed pre-knock
|
||||
```
|
||||
Expected: `STILL_OK22` and `CLOSED_AS_EXPECTED`.
|
||||
|
||||
### Task 1.5: knockd
|
||||
|
||||
**Files:** Create/modify on host: `/etc/knockd.conf`, `/etc/default/knockd`
|
||||
|
||||
- [ ] **Step 1: Install knockd (host daemon — must be native, not Docker, to manage host iptables)**
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 'apt-get update -qq && apt-get install -y knockd && echo KNOCKD_INSTALLED'
|
||||
```
|
||||
Expected: `KNOCKD_INSTALLED`.
|
||||
|
||||
- [ ] **Step 2: Write knockd.conf with the Vault knock sequence (UDP)**
|
||||
|
||||
```bash
|
||||
KNOCK="$(vault kv get -field=breakglass_knock_sequence secret/viktor)" # e.g. 28411,49027,33180
|
||||
read K1 K2 K3 <<<"$(echo "$KNOCK" | tr ',' ' ')"
|
||||
ssh root@192.168.1.127 "cat > /etc/knockd.conf" <<EOF
|
||||
[options]
|
||||
UseSyslog
|
||||
Interface = vmbr0
|
||||
|
||||
[breakglass]
|
||||
sequence = ${K1}:udp,${K2}:udp,${K3}:udp
|
||||
seq_timeout = 10
|
||||
start_command = /usr/sbin/iptables -I INPUT 1 -s %IP% -p tcp --dport 52222 -j ACCEPT
|
||||
cmd_timeout = 30
|
||||
stop_command = /usr/sbin/iptables -D INPUT -s %IP% -p tcp --dport 52222 -j ACCEPT
|
||||
EOF
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Enable + start knockd**
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 "sed -i 's/^START_KNOCKD=.*/START_KNOCKD=1/' /etc/default/knockd 2>/dev/null || echo 'START_KNOCKD=1' >> /etc/default/knockd"
|
||||
ssh root@192.168.1.127 'systemctl enable --now knockd && systemctl is-active knockd'
|
||||
```
|
||||
Expected: `active`.
|
||||
|
||||
### Task 1.6: fail2ban (defense-in-depth)
|
||||
|
||||
- [ ] **Step 1: Install + enable fail2ban with the default sshd jail**
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 'apt-get install -y fail2ban && systemctl enable --now fail2ban && fail2ban-client status sshd >/dev/null && echo F2B_OK'
|
||||
```
|
||||
Expected: `F2B_OK` (sshd jail active).
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Edge router `.1` forwards (LIVE router change — Viktor executes)
|
||||
|
||||
> In the AX6000 UI: **Advanced → NAT Forwarding → Port Forwarding → Add**. Do NOT remove anything yet.
|
||||
|
||||
- [ ] **Step 1: Add the SSH break-glass forward**
|
||||
- Name `breakglass-ssh`, External Port `52222`, Internal IP `192.168.1.127`, Internal Port `52222`, Protocol `TCP`, Enable.
|
||||
|
||||
- [ ] **Step 2: Add the three UDP knock forwards** (values from `vault kv get -field=breakglass_knock_sequence secret/viktor`)
|
||||
- For each of the 3 ports: Name `bg-knock-N`, External Port `<port>`, Internal IP `192.168.1.127`, Internal Port `<same port>`, Protocol `UDP`, Enable.
|
||||
|
||||
- [ ] **Step 3: (verify #1) Determine whether `.1` preserves source IP or SNATs**
|
||||
|
||||
After Phase 3 connects once, on the host check the observed source:
|
||||
```bash
|
||||
ssh root@192.168.1.127 'journalctl -u knockd -n 20 --no-pager | grep -i "stage\|open"'
|
||||
```
|
||||
If `%IP%` is a public IP → source preserved (per-IP granularity). If it's `192.168.1.1` → `.1` SNATs (knock opens `:52222` for the shared `.1` source during the 30 s window). Both are acceptable with the dual-port + key-only model; just note it in the runbook.
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — Client config (laptop, no live infra change)
|
||||
|
||||
**Files:** Modify `~/.ssh/config`; add a shell function to `~/.zshrc`/`~/.bashrc`.
|
||||
|
||||
- [ ] **Step 1: Add the SSH host block**
|
||||
|
||||
```bash
|
||||
cat >> ~/.ssh/config <<'EOF'
|
||||
|
||||
Host breakglass
|
||||
HostName viktorbarzin.ddns.net
|
||||
Port 52222
|
||||
User root
|
||||
IdentityFile ~/.ssh/breakglass_ed25519
|
||||
EOF
|
||||
```
|
||||
(`viktorbarzin.ddns.net` is the router's NO-IP DDNS name — follows the dynamic WAN IP. Raw IP `176.12.22.76` is the fallback.)
|
||||
|
||||
- [ ] **Step 2: Add the knock+connect function**
|
||||
|
||||
```bash
|
||||
cat >> ~/.zshrc <<'EOF'
|
||||
|
||||
bg() {
|
||||
local host="viktorbarzin.ddns.net"
|
||||
local seq; seq="$(vault kv get -field=breakglass_knock_sequence secret/viktor 2>/dev/null || echo "")"
|
||||
[ -z "$seq" ] && { echo "no knock sequence (vault?)"; return 1; }
|
||||
for p in ${seq//,/ }; do (exec 3<>/dev/udp/$host/$p) 2>/dev/null && echo "x" >&3; sleep 0.4; done
|
||||
sleep 0.5
|
||||
ssh breakglass "$@"
|
||||
}
|
||||
EOF
|
||||
```
|
||||
> Note: the bash `/dev/udp` redirection works under bash (`/bin/bash` on macOS + Linux). Under zsh, `/dev/udp` is also supported by zsh's builtin in recent versions; if your zsh build lacks it, define `bg` in bash or use `nc -u -w1 $host $p </dev/null`.
|
||||
|
||||
---
|
||||
|
||||
## Phase 4-pre — Verify break-glass END-TO-END (gates Phase 4)
|
||||
|
||||
> Do this from an **external** network (phone hotspot / tethered), NOT the home LAN.
|
||||
|
||||
- [ ] **Step 1: Without knocking, the port is silent**
|
||||
|
||||
```bash
|
||||
nc -z -w3 viktorbarzin.ddns.net 52222 && echo "OPEN(bad)" || echo "SILENT_OK"
|
||||
```
|
||||
Expected: `SILENT_OK`.
|
||||
|
||||
- [ ] **Step 2: Knock + connect succeeds**
|
||||
|
||||
```bash
|
||||
bg 'hostname; echo BREAKGLASS_E2E_OK'
|
||||
```
|
||||
Expected: the PVE hostname + `BREAKGLASS_E2E_OK`.
|
||||
|
||||
- [ ] **Step 3: Full-LAN reach via the jump (no extra install)**
|
||||
|
||||
```bash
|
||||
ssh -J breakglass root@10.0.20.1 'echo PFSENSE_REACHED' 2>/dev/null || echo "check pfSense ssh"
|
||||
ssh -J breakglass admin@192.168.1.13 'echo SYNOLOGY_REACHED' 2>/dev/null || echo "check synology ssh"
|
||||
```
|
||||
Expected: confirms you can reach pfSense + Synology *through* break-glass (so closing Rule 6 loses nothing).
|
||||
|
||||
- [ ] **Step 4: LAN admin unaffected**
|
||||
|
||||
From the home LAN: `ssh -p 22 root@192.168.1.127 'echo LAN22_OK'` → `LAN22_OK`.
|
||||
|
||||
**GATE:** Only proceed to Phase 4 once Steps 1–4 pass. If any fail, fix before removing the legacy forward.
|
||||
|
||||
---
|
||||
|
||||
## Phase 5 — Router cleanup (LIVE router change — Viktor executes, AFTER Phase 4-pre passes)
|
||||
|
||||
> AX6000 UI. One pass, all three changes.
|
||||
|
||||
- [ ] **Step 1: Remove the Synology SSH exposure (Rule 6)**
|
||||
- Advanced → NAT Forwarding → Port Forwarding → delete (or disable) rule **`HTTP` / 3333 → 192.168.1.13:22**.
|
||||
|
||||
- [ ] **Step 2: Delete the stale Proxmox rule (Rule 3)**
|
||||
- Delete the disabled rule **`proxmox` / 8006 → 192.168.1.127**.
|
||||
|
||||
- [ ] **Step 3: Disable UPnP**
|
||||
- Advanced → NAT Forwarding → UPnP → toggle **OFF**. (Tailscale on `.101` falls back to DERP relay; the `41643→pfSense` mapping drops.)
|
||||
|
||||
- [ ] **Step 4: Verify the Synology SSH is gone from the WAN, break-glass still works**
|
||||
|
||||
From an external network:
|
||||
```bash
|
||||
nc -z -w3 viktorbarzin.ddns.net 3333 && echo "STILL_OPEN(bad)" || echo "SYNOLOGY_SSH_CLOSED_OK"
|
||||
bg 'echo BREAKGLASS_STILL_OK'
|
||||
```
|
||||
Expected: `SYNOLOGY_SSH_CLOSED_OK` and `BREAKGLASS_STILL_OK`.
|
||||
|
||||
---
|
||||
|
||||
## Phase 6 — Docs + commit (AFTER infra repo is clean)
|
||||
|
||||
- [ ] **Step 1: Update `docs/architecture/vpn.md`** — add a "Break-glass SSH" section (knock-gated SSH to PVE host, client `bg()`, cheat-sheet IPs).
|
||||
- [ ] **Step 2: Update `docs/architecture/security.md` + the Wave-1 note in `infra/.claude/CLAUDE.md`** — record the deliberate knock-gated exception; **correct the WAN-exposure inventory** (actual `.1` forwards are qbittorrent/stun/turn→pfSense + the new break-glass; Synology SSH removed; UPnP disabled; Remote Management off).
|
||||
- [ ] **Step 3: New runbook `docs/runbooks/breakglass-ssh.md`** — connect procedure, knock/key rotation, re-adding `.1` forwards after a router reset.
|
||||
- [ ] **Step 4: Commit the design + plan + doc updates** (only once Viktor confirms the repo is committable):
|
||||
|
||||
```bash
|
||||
git -C /home/wizard/code/infra add \
|
||||
docs/plans/2026-05-30-breakglass-ssh-access-design.md \
|
||||
docs/plans/2026-05-30-breakglass-ssh-access-plan.md \
|
||||
docs/architecture/vpn.md docs/architecture/security.md \
|
||||
docs/runbooks/breakglass-ssh.md .claude/CLAUDE.md
|
||||
git -C /home/wizard/code/infra commit -m "docs+feat: break-glass knock-gated SSH; retire Synology SSH forward; disable UPnP [ci skip]"
|
||||
git -C /home/wizard/code/infra push origin master
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Self-review
|
||||
|
||||
- **Spec coverage:** key-only SSH ✅ (1.3), knock gate ✅ (1.4/1.5), invisibility ✅ (4-pre.1), full-LAN via jump ✅ (4-pre.3), no-lockout ✅ (1.1/1.3.4), Wave-1 exception doc ✅ (6.2), close legacy SSH ✅ (5.1), UPnP ✅ (5.3). All design §sections map to a task.
|
||||
- **Placeholder scan:** no TBDs; secret values are generated + Vault-stored, referenced via `vault kv get` (concrete, not placeholders).
|
||||
- **Consistency:** port `52222`, knock from `secret/viktor/breakglass_knock_sequence`, key `~/.ssh/breakglass_ed25519`, host `192.168.1.127` used consistently throughout.
|
||||
- **Open verify items** (flagged inline, non-blocking): #1 `.1` SNAT behaviour (2.3), pve-firewall coexistence (1.1.2).
|
||||
73
docs/plans/2026-06-11-breakglass-ssh-redesign-design.md
Normal file
73
docs/plans/2026-06-11-breakglass-ssh-redesign-design.md
Normal file
|
|
@ -0,0 +1,73 @@
|
|||
# Break-glass SSH — Redesign
|
||||
|
||||
- **Date**: 2026-06-11
|
||||
- **Status**: Implemented
|
||||
- **Owner**: Viktor
|
||||
- **Supersedes**: `2026-05-30-breakglass-ssh-access-{design,plan}.md` (port-knock design)
|
||||
- **As-built runbook**: `docs/runbooks/breakglass-ssh.md`
|
||||
|
||||
## Why redesign
|
||||
|
||||
The 2026-05-30 design gated a key-only SSH port on the Proxmox host behind a UDP
|
||||
**port-knock** (knockd). It caused a real lockout, for a structural reason:
|
||||
|
||||
- The knock sequence was 3 random ports stored **only** in Vault, and the client
|
||||
helper fetched it from Vault at connect time.
|
||||
- **Vault is in-cluster** and not publicly reachable (Wave-1 policy). In the
|
||||
exact scenario break-glass exists for — away from home, cluster/tunnels down —
|
||||
the knock sequence is unreachable and unmemorable. Circular dependency.
|
||||
|
||||
The knock's only benefit was hiding an already brute-force-proof port; its cost
|
||||
was that fragility. For a *recovery* path, robustness beats stealth.
|
||||
|
||||
## Decision
|
||||
|
||||
**Plain key-only SSH to the Proxmox host on `:52222`, openly reachable, no knock.**
|
||||
Hardened with: the exposed port trusts only a dedicated break-glass key
|
||||
(`Match LocalPort`), per-source connection rate-limiting (iptables hashlimit),
|
||||
and fail2ban. Scenario covered: *cluster + tunnels down, host + pfSense + router
|
||||
up* (the common "I'm away and need in" case — confirmed with Viktor; deeper
|
||||
"pfSense wedged" / "host down" tiers are explicitly out of scope).
|
||||
|
||||
Alternatives considered and rejected: keeping the knock (fragile, circular);
|
||||
Tailscale-on-pfSense (briefly chosen, then dropped — reintroduces the upstream
|
||||
dependency Headscale is self-hosted to avoid, and the user preferred a
|
||||
self-contained stock-ssh path); WireGuard road-warrior (needs a client, and the
|
||||
self-contained SSH path was preferred).
|
||||
|
||||
## Components
|
||||
|
||||
| Layer | Change | Source of truth |
|
||||
|---|---|---|
|
||||
| sshd | dual-port `:22` (LAN, all keys) + `:52222` (WAN, break-glass key only via `Match LocalPort`, terminated by `Match all`); key-only everywhere | `scripts/sshd-10-breakglass.conf` |
|
||||
| host firewall | `BREAKGLASS` chain: `:52222` rate-limited per source, LAN bypass; replaced the knock-gated default-DROP | `scripts/breakglass-firewall.sh` (+ `breakglass-firewall.service`) |
|
||||
| fail2ban | jail fixed for Debian 13 (`journalmatch` by unit, not `_COMM=sshd`, else it never bans), bans on `:22`+`:52222` | `scripts/fail2ban-breakglass-sshd.local` |
|
||||
| knockd | **removed** (package purged, config deleted) | — |
|
||||
| edge router | `breakglass-ssh` WAN tcp/52222 → 192.168.1.127:52222; **removed** legacy Synology SSH forward (ext 3333 → .13:22) | manual (live device) |
|
||||
| Vault | `breakglass_ssh_{pub,priv}key` retained; `breakglass_knock_sequence` now dead | `secret/viktor` |
|
||||
|
||||
## Edge-router constraints discovered (TP-Link AX6000)
|
||||
|
||||
- **No port remapping** — external port must equal internal port (rejects e.g.
|
||||
`22 → 52222` as a "conflict"). All forwards are ext==int; hence `:52222` both
|
||||
sides.
|
||||
- **Port 22 is reserved** — `22 → 22` is also refused. Break-glass cannot use 22
|
||||
(Viktor's initial preference); `:52222` is the landed port.
|
||||
- **Row delete is immediate** (no confirm dialog).
|
||||
|
||||
## Security posture
|
||||
|
||||
- **Brute force: impossible** (key-only, no password).
|
||||
- **Scannable: yes** — deliberate, documented Wave-1 exception (`security.md`).
|
||||
- **Residual risks:** sshd 0-day during exposure (mitigate: patch, rate-limit,
|
||||
fail2ban, low MaxAuthTries); break-glass key theft (revoke by removing the
|
||||
`authorized_keys.breakglass` line). Logins are audited (PVE ships sshd auth +
|
||||
snoopy execve to Loki).
|
||||
|
||||
## Verification (2026-06-11)
|
||||
|
||||
- `:52222` reachable; break-glass key authenticates (`root@pve`).
|
||||
- Non-break-glass keys **rejected** on `:52222` (Match isolation works).
|
||||
- `:22` LAN admin unaffected (Match all reset confirmed — global root login intact).
|
||||
- Full WAN path: `ssh -p 52222 <WAN-IP>` with the break-glass key → `root@pve`.
|
||||
- knockd gone; fail2ban jail matches Debian 13 `sshd-session` lines.
|
||||
|
|
@ -0,0 +1,76 @@
|
|||
# Post-mortem: Authentik downgrade boot storm + shared-PG failover (2026-06-10)
|
||||
|
||||
**Impact:** Authentik (and therefore forward-auth for all ~67 `auth="required"`
|
||||
ingresses and every OIDC app) degraded/unavailable for ~50 minutes
|
||||
(~22:20–23:10 UTC). The auth-proxy basicAuth fallback served Emergency Access
|
||||
prompts during outpost-check failures. The shared CNPG primary failed over
|
||||
(pg-cluster-2 → pg-cluster-1, 22:40:58 UTC), briefly disturbing every PG-backed
|
||||
tenant.
|
||||
|
||||
**Trigger:** a routine values-only `tg apply` on `stacks/authentik` (first-time
|
||||
signin speedup work — env tuning, outpost config, static-asset ingress).
|
||||
|
||||
## Root causes (three stacked)
|
||||
|
||||
1. **Helm/Keel version split → silent downgrade.** Keel (namespace
|
||||
`keel.sh/enrolled` + diun annotations) had upgraded the live authentik
|
||||
image to `2026.2.4`, while the Helm release pinned chart `2026.2.2` (whose
|
||||
appVersion drives the image tag). The values-only apply therefore rolled
|
||||
every server/worker pod BACK to `2026.2.2` against a `2026.2.4`-migrated
|
||||
database. Cores never came up healthy (`failed to proxy to backend`, plus
|
||||
Django cross-version serialized-cache warnings), and mid-storm Keel
|
||||
re-upgraded the image, adding a third ReplicaSet to the churn.
|
||||
|
||||
2. **Liveness budget too small for authentik's boot.** The chart-default
|
||||
liveness probe (3×10s, 3s timeout) kills a pod ~30s after the go layer
|
||||
passes the startup probe — but during a rolling restart the Python core
|
||||
still waits on authentik's DB **migration advisory lock** (60–120s+ under
|
||||
contention). kubelet kill-looped every booting pod, and each kill increased
|
||||
lock contention for the rest (thundering herd).
|
||||
|
||||
3. **Ghost lock holders.** Pods killed mid-migration-check left PgBouncer
|
||||
server connections `idle in transaction` still **holding the migration
|
||||
advisory lock** (observed twice: `SELECT * FROM authentik_version_history`
|
||||
idle 2+ min). Every subsequent boot serialized behind a dead client.
|
||||
PgBouncer had no `idle_transaction_timeout`, so the ghosts never expired.
|
||||
|
||||
**Aggravator:** `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60` (newly made live) made
|
||||
every Django thread hold its connection persistently; with PgBouncer in
|
||||
*session* mode each one pins a server connection 1:1, so the restart churn
|
||||
saturated all 3×(20+5) pool slots (58s/s client wait observed; authentik held
|
||||
75 of 108 connections on the new primary). The shared primary's
|
||||
restart/failover at 22:40 fits this storm window.
|
||||
|
||||
## Resolution
|
||||
|
||||
- Scaled workers to 0 (transient) to free pool capacity; rollout converged
|
||||
once, then re-degraded when workers returned.
|
||||
- Emergency `kubectl patch` of the server liveness probe (3×10s/3s →
|
||||
6×10s/5s) — final state codified in Helm values in the same session.
|
||||
- `pg_terminate_backend()` on the ghost `idle in transaction` lock holders
|
||||
(twice).
|
||||
- Scaled servers to 1 so a single `2026.2.4` pod booted uncontended, then back
|
||||
to 3 — converged cleanly (51s boots, zero restarts).
|
||||
- Final `tg apply` reconciled everything (image tag pinned, conn_max_age
|
||||
removed, liveness in values, pgbouncer reaper config).
|
||||
|
||||
## Prevention (all landed in this change)
|
||||
|
||||
| Cause | Fix |
|
||||
|---|---|
|
||||
| Helm/Keel version split | `global.image.tag` pinned in `values.yaml` to the Keel-managed live tag, with a comment requiring the pin be refreshed whenever the chart is touched. Long-term: bump the chart pin when Keel moves the image (diun notifies). |
|
||||
| Liveness kill loop | `server.livenessProbe` 6×10s / 5s timeout in values (startup probe still bounds total boot at 60×10s). |
|
||||
| Ghost advisory-lock holders | `idle_transaction_timeout = 300` in `pgbouncer.ini` + config-checksum annotation so ini changes actually roll pgbouncer pods. |
|
||||
| Pool saturation | `CONN_MAX_AGE` removed (per-request connections are ~1–2ms through local PgBouncer; not worth pinning server connections in session mode). values.yaml carries a do-not-set warning. |
|
||||
|
||||
## Lessons
|
||||
|
||||
- **Check the live image tag against the chart pin before ANY helm-managed
|
||||
apply on a Keel-enrolled namespace.** `kubectl get deploy <x> -o
|
||||
jsonpath='{..image}'` vs the chart's appVersion — a mismatch means the apply
|
||||
is a version change, not a config change.
|
||||
- A "stuck rollout" of authentik is usually the migration advisory lock:
|
||||
check `pg_locks` joined to `pg_stat_activity` for `idle in transaction`
|
||||
holders before blaming probes or resources.
|
||||
- The auth-proxy basicAuth fallback worked as designed throughout (Emergency
|
||||
Access path); without it every protected app would have hard-failed.
|
||||
116
docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md
Normal file
116
docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md
Normal file
|
|
@ -0,0 +1,116 @@
|
|||
# 2026-06-11 — devvm dead ~90 min: QEMU-internal I/O stall on the legacy LSI disk path
|
||||
|
||||
## Impact
|
||||
|
||||
- devvm (VM 102, the shared multi-user Claude Code workstation) effectively
|
||||
dead 15:21–16:48 UTC (18:21–19:48 EEST): all ssh/tmux and t3 sessions for
|
||||
wizard/emo/anca lost, every in-flight agent killed.
|
||||
- Detection was human (~90 min) — no `up{instance="devvm"} == 0` alert
|
||||
exists (follow-up below).
|
||||
- Recovery was manual: kill of the wedged QEMU process + `qm start` (the
|
||||
kill left no autopsy — see "What we could not prove").
|
||||
|
||||
## Timeline (UTC; host journal runs EEST = UTC+3)
|
||||
|
||||
- **15:01** — hourly `apply-mbps-caps` run live-rewrites VM 102's scsi0
|
||||
throttle via `qm set` (as it had done every hour for weeks — see Root
|
||||
cause #4).
|
||||
- **15:18–15:20** — guest healthy by every metric: CPU 7–16% of 16 vCPUs,
|
||||
load 1.4, 17 GiB MemAvailable, swap flat at 2.0 GiB, host `sdc` 2–8%
|
||||
utilized. Heavy claude/bwrap sandbox activity (normal workload).
|
||||
- **15:19:08** — last journal line the guest ever writes (mid normal
|
||||
traffic, zero kernel distress — not even a hung-task warning).
|
||||
- **15:21** — host RRD (pvestatd polling QEMU over QMP once a minute) shows
|
||||
`diskwrite` drop to **exactly 0 and stay 0 for 87 minutes** — not even
|
||||
journal flushes. netout collapses 380K→7K/s. **QEMU keeps answering QMP
|
||||
the whole time** — the process and its main loop are alive; only the
|
||||
block path is dead.
|
||||
- **15:21→15:39** — guest CPU (host's view) ramps 11% → ~50% and plateaus:
|
||||
processes progressively piling up behind dead storage (dirty-page
|
||||
writeback stuck → direct reclaim spins). Classic starvation cascade, not
|
||||
a panic (a panic halts or spins flat from t=0).
|
||||
- **16:47:42** — QMP socket resets: the wedged QEMU is killed out-of-band
|
||||
(root shell; no PVE task, no snoopy line — shell-builtin `kill`).
|
||||
- **16:48:31** — `qmstart` task; guest boots clean on kernel 6.8.0-124
|
||||
(wedged boot ran 6.8.0-117).
|
||||
|
||||
## Ruled out (evidence, not vibes)
|
||||
|
||||
- **Guest CPU/memory/swap pressure** — healthy at last scrape (Prometheus)
|
||||
and per-minute host RRD.
|
||||
- **Host storage** — `pve` thin pool 68% data / 15.5% meta; zero kernel
|
||||
I/O errors on the host all day; `sdc` quiet through the window.
|
||||
- **Host-side kill/OOM** — no OOM-killer lines, no segfault, no QEMU crash
|
||||
log; 113 of 114 monitored targets stayed up. Only the devvm died.
|
||||
- **Guest kernel panic** — would not keep QMP-visible blockstats frozen at
|
||||
0 while netout ACKs trickle; and the guest kernel logged nothing.
|
||||
|
||||
## Root cause
|
||||
|
||||
**Class pinned, exact line unprovable** (see below): the devvm's disk I/O
|
||||
stalled *inside the QEMU process* — below the guest kernel (all guest I/O
|
||||
froze simultaneously with nothing logged) and above host storage (host
|
||||
clean, neighbors fine, QEMU main loop responsive). Contributing stack,
|
||||
unique to this VM:
|
||||
|
||||
1. **`scsihw: lsi`** — the emulated LSI 53C895A (1997 chip, QEMU's legacy
|
||||
default for OSes without virtio drivers). The devvm was the **only VM
|
||||
on the host** running its disk through this path; every healthy
|
||||
neighbor uses `virtio-scsi-pci`. The LSI model is documented as
|
||||
hang-prone under intensive I/O.
|
||||
2. **No `iothread`** — all disk emulation ran on QEMU's single main event
|
||||
loop, sharing it with timers and QMP.
|
||||
3. **QEMU-level mbps throttle (60/60)** — a token bucket inside QEMU whose
|
||||
queued I/O completes only when its re-arm timer fires.
|
||||
4. **Hourly live throttle rewrites** — `apply-mbps-caps.sh`'s idempotency
|
||||
check compared raw config strings, but `qm config` prints keys in its
|
||||
own canonical order, so the check **never matched** and the script
|
||||
re-issued `qm set` (→ live QMP `block_set_io_throttle` against the
|
||||
running QEMU) every hour, 24×/day, for weeks — each poke a chance to
|
||||
race the throttle machinery while queued I/O is in flight. The wedge
|
||||
came 20 min after the 15:01 poke.
|
||||
|
||||
## What we could not prove
|
||||
|
||||
Whether the stuck queue was the LSI device model, the throttle-group
|
||||
timer, or their interaction. The discriminating evidence (QMP
|
||||
`query-block`, a stack trace of the QEMU process) existed in RAM at 16:47
|
||||
and was destroyed by the recovery kill. If a wedge recurs **autopsy before
|
||||
shooting**: `qm guest exec` will fail but `qm monitor`/QMP `query-block`,
|
||||
`query-status`, and `gdb -p <pid> -batch -ex 'thread apply all bt'` on the
|
||||
kvm process pin it to the line.
|
||||
|
||||
## Fixes
|
||||
|
||||
| Status | Fix |
|
||||
|---|---|
|
||||
| shipped (this commit) | `apply-mbps-caps.sh` compares **normalized option sets** — hourly runs are now true no-ops; running VMs' throttle state is no longer rewritten 24×/day. Verified: reordered-key configs compare equal, real drift still triggers `qm set`, post-restart iothread configs compare equal. |
|
||||
| staged, awaiting Viktor's cold stop→start | VM 102: `scsihw: virtio-scsi-single` + `scsi0 …,iothread=1,aio=threads` — replaces the LSI path with the paravirt controller all healthy VMs use, moves disk emulation off the main loop, swaps io_uring for boring thread-pool AIO. Guest pre-flight passed (`CONFIG_SCSI_VIRTIO=y` built-in; fstab on LVM dm-uuid/UUID). Must be a **full stop→start** — a guest reboot reuses the old QEMU process. |
|
||||
|
||||
## Open follow-ups (discussed 2026-06-11, not yet built)
|
||||
|
||||
- `DevvmDown` alert (`up{job="devvm"} == 0 for 3m` → Slack) — closes the
|
||||
90-min detection gap.
|
||||
- Freeze forensics: netconsole → pve listener, serial console,
|
||||
`kernel.panic=60`, and a capture-before-kill runbook (above) so any
|
||||
recurrence is pinned, not mourned.
|
||||
- The recurring *crawl* class (agent storms → swap-thrash; journald
|
||||
watchdog-killed 3× on 2026-06-10) is a separate failure mode —
|
||||
ssh/tmux sessions remain memory-uncontained by explicit decision
|
||||
(swap-only, 2026-06-10).
|
||||
|
||||
## Lessons
|
||||
|
||||
- **A VM can die of QEMU-userspace causes that no guest or host kernel log
|
||||
will ever show.** The host's per-VM RRD (pvestatd's QMP polls) is the
|
||||
only witness — `diskwrite=0` with a live QMP socket is the signature.
|
||||
- **"Idempotent" reconcilers must prove idempotency against the system's
|
||||
canonical output format**, not against the string they themselves
|
||||
constructed. A compare that never matches turns a safety net into a
|
||||
24×/day fault injector — and its own journal said `updating scsi0`
|
||||
every hour, in plain sight, for weeks.
|
||||
- The May-26 mbps caps fixed the sdc-saturation freeze class and
|
||||
introduced this one's trigger surface. Layered mitigations fail in
|
||||
layers — audit what a fix *adds*, not only what it removes.
|
||||
- pve host logs are **EEST (UTC+3)**; guest logs are UTC. Every
|
||||
cross-machine correlation in this incident initially looked 3h off.
|
||||
158
docs/runbooks/breakglass-ssh.md
Normal file
158
docs/runbooks/breakglass-ssh.md
Normal file
|
|
@ -0,0 +1,158 @@
|
|||
# Runbook: Break-glass SSH
|
||||
|
||||
Cold-survivable, brute-force-proof SSH onto the home LAN for when the Kubernetes
|
||||
cluster and its remote-access tunnels (Headscale, cloudflared) are down but the
|
||||
**Proxmox host + edge router are up**. Redesigned 2026-06-11 — the previous
|
||||
port-knock design is decommissioned (see "History" below).
|
||||
|
||||
## Model (as built)
|
||||
|
||||
```
|
||||
your laptop (anywhere) ── ssh -p 52222 ──▶ edge router 192.168.1.1
|
||||
│ WAN tcp/52222 ─▶ 192.168.1.127:52222
|
||||
▼
|
||||
Proxmox host 192.168.1.127
|
||||
sshd :52222 (key-only, break-glass key ONLY)
|
||||
→ full LAN via ssh -J / ssh -D
|
||||
```
|
||||
|
||||
- **No port-knock.** Plain `ssh -p 52222`. The SSH key is the only gate.
|
||||
- **Key-only**, brute-force-proof. The exposed `:52222` trusts **only** the
|
||||
dedicated break-glass key (`/root/.ssh/authorized_keys.breakglass`), separate
|
||||
from root's normal LAN-admin keys, so it is independently revocable and a leak
|
||||
of any other root key does not grant internet access.
|
||||
- **Rate-limited** per source IP (iptables hashlimit) + **fail2ban**. These trim
|
||||
scanner noise only; key-only auth is the real protection.
|
||||
- **Exposed, not hidden.** `:52222` answers on the WAN (Shodan-visible). This is
|
||||
a deliberate, documented exception to the Wave-1 "no public-IP access" policy
|
||||
(see `docs/architecture/security.md`), chosen for self-containment: it has **no
|
||||
dependency on the cluster** (unlike Headscale/cloudflared) and nothing to
|
||||
remember (unlike the old knock, whose sequence lived only in in-cluster Vault).
|
||||
|
||||
## Secrets (Vault `secret/viktor`)
|
||||
|
||||
| Key | Use |
|
||||
|---|---|
|
||||
| `breakglass_ssh_pubkey` | authorized on the host (`authorized_keys.breakglass`) |
|
||||
| `breakglass_ssh_privkey` | the private key (also on your laptop at `~/.ssh/breakglass_ed25519`) |
|
||||
|
||||
The key has **no passphrase** (so it works in a true cold event without anything
|
||||
to recall). Treat the private key as the sole credential — guard the laptop copy.
|
||||
|
||||
> Leftover: `breakglass_knock_sequence` is dead (knock decommissioned). It is
|
||||
> inert; remove it when you have a Vault token with the `patch` capability
|
||||
> (`vault kv patch` / merge-patch — the everyday token lacks it).
|
||||
|
||||
## Connect
|
||||
|
||||
Client `~/.ssh/config`:
|
||||
|
||||
```
|
||||
Host breakglass
|
||||
HostName viktorbarzin.ddns.net # follows the dynamic WAN IP
|
||||
Port 52222
|
||||
User root
|
||||
IdentityFile ~/.ssh/breakglass_ed25519
|
||||
IdentitiesOnly yes
|
||||
```
|
||||
|
||||
Then:
|
||||
|
||||
```bash
|
||||
ssh breakglass # shell on the Proxmox host
|
||||
ssh -J breakglass root@10.0.20.1 # jump to pfSense (or any LAN host)
|
||||
ssh -D 1080 breakglass # SOCKS5 → reach any internal IP
|
||||
```
|
||||
|
||||
There is **no `bg()` knock function** anymore — delete it from your shell rc if
|
||||
you added it under the old design.
|
||||
|
||||
## Cold-event IP cheat sheet (cluster DNS is down)
|
||||
|
||||
| Host | IP |
|
||||
|---|---|
|
||||
| Proxmox host | `192.168.1.127` |
|
||||
| pfSense | `10.0.20.1` (WAN `192.168.1.2`) |
|
||||
| k8s API | `10.0.20.100` |
|
||||
| Synology NAS | `192.168.1.13` (reach via `ssh -J breakglass`) |
|
||||
| edge router | `192.168.1.1` |
|
||||
|
||||
## Deploy / re-provision the host config
|
||||
|
||||
Source of truth lives in `infra/scripts/`. To (re)deploy:
|
||||
|
||||
```bash
|
||||
# 1. break-glass key authorized for the exposed port
|
||||
PUB="$(vault kv get -field=breakglass_ssh_pubkey secret/viktor)"
|
||||
ssh root@192.168.1.127 "printf '%s\n' '$PUB' > /root/.ssh/authorized_keys.breakglass && chmod 600 /root/.ssh/authorized_keys.breakglass"
|
||||
|
||||
# 2. sshd drop-in (dual-port, Match-isolated) — validate before reload (anti-lockout)
|
||||
scp scripts/sshd-10-breakglass.conf root@192.168.1.127:/etc/ssh/sshd_config.d/10-breakglass.conf
|
||||
ssh root@192.168.1.127 'sshd -t && systemctl reload ssh'
|
||||
|
||||
# 3. firewall (rate-limit) + boot unit
|
||||
scp scripts/breakglass-firewall.sh root@192.168.1.127:/usr/local/sbin/breakglass-firewall.sh
|
||||
ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh && systemctl enable --now breakglass-firewall.service'
|
||||
|
||||
# 4. fail2ban jail
|
||||
scp scripts/fail2ban-breakglass-sshd.local root@192.168.1.127:/etc/fail2ban/jail.d/breakglass-sshd.local
|
||||
ssh root@192.168.1.127 'systemctl restart fail2ban && fail2ban-client status sshd'
|
||||
```
|
||||
|
||||
The `breakglass-firewall.service` unit (oneshot, `RemainAfterExit=yes`,
|
||||
`Before=network-online`-ish ordering) is a manual host unit — recreate it if the
|
||||
host is rebuilt:
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Break-glass base firewall (key-only SSH on :52222)
|
||||
After=network-pre.target
|
||||
Wants=network-pre.target
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/sbin/breakglass-firewall.sh
|
||||
RemainAfterExit=yes
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
## Edge-router forward (manual — live device, not Terraform)
|
||||
|
||||
TP-Link Archer AX6000 (`192.168.1.1`) → Advanced → NAT Forwarding → Port
|
||||
Forwarding. The break-glass rule:
|
||||
|
||||
| Service Name | Device IP | External Port | Internal Port | Protocol |
|
||||
|---|---|---|---|---|
|
||||
| `breakglass-ssh` | `192.168.1.127` | `52222` | `52222` | TCP |
|
||||
|
||||
**AX6000 quirks (learned 2026-06-11 — do not relearn the hard way):**
|
||||
- **External port must equal internal port.** The firmware rejects any remap
|
||||
(e.g. `22 → 52222`) with *"External Port: This item conflicts with existed
|
||||
ones."* Hence ext==int 52222.
|
||||
- **Port 22 is reserved** — even `22 → 22` is refused. Break-glass cannot use 22.
|
||||
- **Row delete is immediate** (no confirm dialog) — clicking the trash icon
|
||||
removes the rule and toasts "Operation succeeded".
|
||||
- Automation: `~/wizard/tools/insecure-browse/add-forward.{sh,js}` (dockerized
|
||||
Playwright; double-gated save `DRY_RUN=0 CONFIRM_SAVE=1`; supports
|
||||
`RULES_JSON` add, `EDIT_RULES_JSON` protocol-edit, `DELETE_RULES_JSON`
|
||||
identity-guarded delete). Router password: Vault
|
||||
`secret/viktor/edge_router_192_168_1_1_password`.
|
||||
|
||||
## Rotate / revoke
|
||||
|
||||
- **Revoke instantly:** remove the line from `/root/.ssh/authorized_keys.breakglass`.
|
||||
- **Rotate the key:** `ssh-keygen -t ed25519 -a 100 -f ~/.ssh/breakglass_ed25519`,
|
||||
`vault kv patch secret/viktor breakglass_ssh_privkey=@... breakglass_ssh_pubkey=...`,
|
||||
redeploy step 1 above.
|
||||
- **Router reset wipes forwards:** re-add the `breakglass-ssh` rule above.
|
||||
|
||||
## History
|
||||
|
||||
- **2026-05-30:** original design — key-only SSH on `:52222` gated behind a
|
||||
**UDP port-knock** (knockd). Decommissioned 2026-06-11: the knock added no real
|
||||
security (the SSH key already makes the port brute-force-proof) and its only
|
||||
benefit — hiding the port — came at the cost of a **circular dependency**: the
|
||||
knock sequence lived only in in-cluster Vault, unreachable in the exact
|
||||
cold/away scenario break-glass exists for. That caused a real lockout. The
|
||||
knockd package + config + the legacy Synology SSH forward (ext 3333 → .13:22)
|
||||
were removed.
|
||||
|
|
@ -35,6 +35,41 @@ Attribution table:
|
|||
|
||||
Alerts `T3ProbeLegDown` / `T3ProbeDropBurst` fire on sustained breakage.
|
||||
|
||||
## 1b. Connection logs in Loki (passive, always-on — catch a real drop)
|
||||
|
||||
Three layers of the real path log every t3 `/ws` connection to Loki, so a drop
|
||||
the user actually experienced is attributable after the fact without a repro. A
|
||||
drop is **a short-lived `/ws` connection** (a healthy session holds one socket
|
||||
for hours); the client's 20s heartbeat watchdog reconnects on any break.
|
||||
|
||||
| Layer | Loki stream | What it tells you |
|
||||
|---|---|---|
|
||||
| Traefik | `{job="traefik"}` ⟶ filter `t3code-t3` + `GET /ws` | per-connection **duration** (trailing `…ms`) + edge (cloudflared pod) IP |
|
||||
| cloudflared | `{job="cloudflared"}` ⟶ filter `t3.viktorbarzin.me/ws` | CF-tunnel-side close (`ended abruptly: context canceled` = browser/CF side hung up) |
|
||||
| t3-dispatch | `{job="devvm-journal",unit="t3-dispatch.service"} \|= "ws close"` | **`dur_ms` + `cause`** — the discriminator below |
|
||||
|
||||
`cause` on the dispatch `ws close` line:
|
||||
- **`downstream_closed`** — client / Cloudflare / Traefik tore the socket down
|
||||
(`context canceled`). Short `dur_ms` = client watchdog firing → a **last-mile /
|
||||
network-quality** drop (or CF/tunnel blip); t3-serve was fine.
|
||||
- **`upstream_closed`** — the user's `t3 serve` closed/reset (reset by peer / EOF
|
||||
/ refused) → t3-serve stall/restart/OOM.
|
||||
- **`graceful`** — clean close from either side (e.g. the client watchdog's
|
||||
`disconnect()` after a >20s heartbeat gap). Cross-check `dur_ms`: a ~20s+
|
||||
graceful close with no devvm pressure spike (§3) is a heartbeat-timeout whose
|
||||
stall was NOT on devvm → last-mile.
|
||||
|
||||
Triage query (Grafana Explore → Loki) — every short t3 socket in a window:
|
||||
|
||||
```logql
|
||||
{job="devvm-journal", unit="t3-dispatch.service"} |= "ws close"
|
||||
| regexp `dur_ms=(?P<dur>[0-9]+) cause=(?P<cause>\S+)` | dur < 120000
|
||||
```
|
||||
|
||||
Line the timestamp up against `{job="traefik"}` (duration + edge IP) and
|
||||
`{job="cloudflared"}` (CF-side close) for the same second to localise the layer.
|
||||
devvm journald (incl. `t3-serve@<user>`) ships via `scripts/devvm-promtail.*`.
|
||||
|
||||
## 2. Server-side log recipe (per-event forensics)
|
||||
|
||||
On devvm (timestamps in UTC):
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue