Merge forgejo/master (tts stack) into wizard/android-emulator
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful

# Conflicts:
#	stacks/tripit/main.tf
This commit is contained in:
Viktor Barzin 2026-06-11 19:53:07 +00:00
commit 6bf216751b
37 changed files with 1774 additions and 86 deletions

View file

@ -40,10 +40,10 @@ graph TB
| Component | Version | Location | Purpose |
|-----------|---------|----------|---------|
| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (2 replicas) |
| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (3 replicas) |
| Authentik Worker | 2026.2.2 | `stacks/authentik/` | Background task processors (2 replicas) |
| PgBouncer | Latest | `stacks/authentik/` | PostgreSQL connection pooler (3 replicas) |
| Embedded Outpost | - | Built into Authentik | Forward auth endpoint for Traefik |
| Embedded Outpost | - | Standalone deployment, managed by Authentik | Forward auth endpoint for Traefik (2 replicas, PG-backed sessions) |
| Traefik ForwardAuth | - | `modules/kubernetes/ingress_factory/` | Middleware attached when `auth = "required"` or `"public"` |
| Vault OIDC Method | - | `stacks/vault/` | Human SSO authentication to Vault |
| Vault K8s Auth | - | `stacks/vault/` | Service account JWT authentication |
@ -64,15 +64,36 @@ Services pick an auth tier via the `auth` enum on the `ingress_factory` module (
When `auth = "required"`, an unauthenticated request flows:
1. Request hits Traefik ingress
2. ForwardAuth middleware calls Authentik embedded outpost
3. Authentik checks for valid session cookie
2. ForwardAuth middleware calls the `auth-proxy` nginx (basicAuth fallback when Authentik is down), which proxies to the Authentik embedded outpost over a keepalive connection pool
3. Authentik checks for valid session cookie (domain-level `authentik_proxy_*` cookie on `.viktorbarzin.me`, 4-week validity — one cookie covers all forward-auth apps)
4. If missing/invalid, redirects to Authentik login page (authentik.viktorbarzin.me)
5. User authenticates via social provider (Google/GitHub/Facebook)
5. User authenticates on a **single screen**: username + password together (the identification stage embeds the password stage), or a social provider button (Google/GitHub/Facebook), then MFA validation
6. Authentik creates session, sets cookie, redirects back to original URL
7. Subsequent requests include session cookie, pass auth check, reach backend
Authentik adds authentication headers (user, email, groups) to forwarded requests. These headers are stripped before reaching the backend to prevent confusion.
### First-time signin performance (2026-06-10)
Signin latency is dominated by screen count and round trips, not server time
(DB avg 1.6ms). Standing decisions:
- **Single-screen login**: the identification stage carries `password_stage`,
so username+password is one round trip. The separate password-stage binding
was removed from `default-authentication-flow` (required by authentik when
embedding). Pinned in TF: `authentik_stage_identification.default_identification`.
- **Implicit consent everywhere**: all OIDC providers are first-party, so none
use the explicit-consent flow (it re-prompted every 4 weeks per app).
- **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
15m policy cache, 60s persistent DB connections.
- **Static assets cached immutable**: `/static` ingress carve-out adds
`Cache-Control: public, max-age=31536000, immutable` (assets are
version-fingerprinted; authentik itself sends no max-age).
- **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
- **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
TCP setup on the forward-auth subrequest path.
**Anti-exposure guard**: every `auth = "app"` or `auth = "none"` line MUST have a preceding `# auth = "<tier>": <reason>` comment documenting what gates the backend (for `"app"`) or why the endpoint is intentionally public (for `"none"`). The convention is enforced by `scripts/check-ingress-auth-comments.py`, which `scripts/tg` runs on every `plan/apply/destroy/refresh` and blocks the terragrunt invocation if violated. Stack-scoped — each stack documents itself.
### Social Login & Invitation Flow

View file

@ -22,9 +22,11 @@ graph TB
NODE2["VM 202: k8s-node2<br/>8c / 32GB"]
NODE3["VM 203: k8s-node3<br/>8c / 32GB"]
NODE4["VM 204: k8s-node4<br/>8c / 32GB"]
NODE5["VM 205: k8s-node5<br/>8c / 32GB"]
NODE6["VM 206: k8s-node6<br/>8c / 32GB"]
end
subgraph K8s["Kubernetes Cluster v1.34.2"]
subgraph K8s["Kubernetes Cluster v1.34.8"]
direction TB
subgraph VPA["VPA (Goldilocks - Initial Mode)"]
@ -62,7 +64,7 @@ graph TB
| Model | Dell PowerEdge R730 |
| CPU | 1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated) |
| Total Cores/Threads | 22 cores / 44 threads |
| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). VMs use ~176GB total (k8s-node1 48GB + 4 K8s VMs x 32GB) |
| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). K8s VMs use ~240GB total (k8s-node1 48GB + 6 K8s VMs x 32GB) |
| GPU | NVIDIA Tesla T4 (16GB GDDR6, PCIe 0000:06:00.0) |
| Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD |
| Hypervisor | Proxmox VE |
@ -76,8 +78,10 @@ graph TB
| k8s-node2 | 202 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
| k8s-node3 | 203 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
| k8s-node4 | 204 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
| k8s-node5 | 205 | 8 | 32GB | vmbr1:vlan20 (10.0.20.105) | Worker (joined 2026-05-26) | None |
| k8s-node6 | 206 | 8 | 32GB | vmbr1:vlan20 (10.0.20.106) | Worker (joined 2026-05-26) | None |
**Total Cluster Resources**: 48 vCPUs, ~176GB RAM (k8s-node1 48GB + 4 nodes x 32GB)
**Total Cluster Resources**: 64 vCPUs, ~240GB RAM (k8s-node1 16c/48GB + master and 5 workers at 8c/32GB each)
> **All Linux VMs are hand-managed in Proxmox, NOT in Terraform**
> (decided 2026-05-26, commit 44c3770a). The telmate/proxmox v3.0.2
@ -97,7 +101,12 @@ graph TB
> PVE host (sources in `infra/scripts/`, install pattern per
> `architecture/backup-dr.md`). Timer fires `OnBootSec=5min` +
> `OnCalendar=hourly`, so any drift (config restore, manual `qm
> set`, fresh clone) self-heals within the hour. Current caps:
> set`, fresh clone) self-heals within the hour. The script compares
> *normalized option sets*, so an unchanged config is a true no-op —
> until 2026-06-11 a raw string compare (defeated by `qm config`'s
> canonical key order) re-issued `qm set` hourly against running VMs,
> live-rewriting QEMU throttle state via QMP (implicated in the devvm
> I/O stall; see `post-mortems/2026-06-11-devvm-qemu-io-stall.md`). Current caps:
> 102 devvm 60/60, 103 home-assistant 40/40, 200 k8s-master 100/60,
> 201 k8s-node1 150/120, 202 k8s-node2 150/120, 203 k8s-node3 150/120,
> 204 k8s-node4 150/120, 220 docker-registry 40/40.

View file

@ -255,6 +255,8 @@ Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same
**Policy: no public-IP access ever.** Vault, kube-apiserver, PVE sshd must transit a trusted LAN or Headscale. Anything else fires an alert.
**Documented exception — break-glass SSH (2026-06-11):** one deliberate carve-out. The Proxmox host's sshd listens on a WAN-exposed `:52222` (edge-router forward), **key-only**, trusting only a dedicated break-glass key (`Match LocalPort``authorized_keys.breakglass`), rate-limited (iptables hashlimit) + fail2ban. It is intentionally reachable from the public internet so it survives a cluster/tunnel outage with no dependency on the cluster — the one case the "must transit LAN/Headscale" rule cannot serve. Brute-force-proof (no password); the trade is Shodan-visibility. As-built: `docs/runbooks/breakglass-ssh.md`; rationale: `docs/plans/2026-06-11-breakglass-ssh-redesign-design.md`. (Replaced the 2026-05-30 port-knock variant, which was non-scannable but had a circular Vault dependency that caused a lockout.)
#### Why no canary tokens
Original plan included canary tokens (fake K8s Secret, Vault KV path, PVE file, sinkhole hostname). Rejected because Viktor routinely greps `secret/viktor` (135 keys) and lists `kubectl get secret -A` — any read-trigger canary self-fires. Use-based canaries (zero-RBAC SA tokens with audit alerts on use) were also considered but rejected in favor of cleaner source-IP anomaly detection (K9, V7) on REAL tokens — same threat model, no fake-token operational burden.

View file

@ -17,7 +17,7 @@ All services storing sensitive data were migrated to `proxmox-lvm-encrypted` on
- **HDD NFS**: `/srv/nfs` on ext4 LV `pve/nfs-data` (4TB) — bulk media and backup targets
- **SSD NFS**: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB) — high-performance data (Immich ML)
Both `StorageClass: nfs-truenas` and `StorageClass: nfs-proxmox` point to the Proxmox host and are functionally identical. The `nfs-truenas` name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster.
`StorageClass: nfs-truenas` is the **only** NFS StorageClass and points to the Proxmox host. The name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster. (A short-lived parallel `nfs-proxmox` StorageClass was removed on 2026-04-25, commit 484b4c71, during the vault NFS-hostile migration.)
**Backup storage (sda)**: 1.1TB RAID1 SAS disk, VG `backup`, LV `data` (ext4), mounted at `/mnt/backup` on PVE host. Dedicated backup disk for weekly PVC file backups, auto SQLite backups, pfSense backups, and PVE config. NFS data syncs directly to Synology via inotify change tracking (not stored on sda). Independent of live storage (sdc).
@ -47,7 +47,7 @@ graph TB
end
subgraph K8s["Kubernetes Cluster"]
CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-proxmox (+ legacy nfs-truenas)<br/>soft,timeo=30,retrans=3"]
CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-truenas (historical name)<br/>soft,timeo=30,retrans=3"]
CSI_PVE["Proxmox CSI plugin<br/>StorageClass: proxmox-lvm<br/>StorageClass: proxmox-lvm-encrypted"]
NFS_PV["NFS PersistentVolumes<br/>RWX, ~100 volumes"]
@ -85,8 +85,7 @@ graph TB
| Proxmox NFS (HDD) | LV `pve/nfs-data`, 4TB ext4 | 192.168.1.127:/srv/nfs | Bulk NFS data for all services |
| Proxmox NFS (SSD) | LV `ssd/nfs-ssd-data`, 100GB ext4 | 192.168.1.127:/srv/nfs-ssd | High-performance data (Immich ML) |
| nfs-csi | Helm chart | Namespace: nfs-csi | NFS CSI driver |
| StorageClass `nfs-proxmox` | RWX, soft mount | Cluster-wide | NFS storage, points to Proxmox host |
| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | **Historical name** — functionally identical to `nfs-proxmox`, points to the Proxmox host. Kept because SC names are immutable on 48 bound PVs. |
| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | The only NFS StorageClass — **historical name**, points to the Proxmox host. Kept because SC names are immutable on 48 bound PVs. (Sibling `nfs-proxmox` SC removed 2026-04-25, commit 484b4c71.) |
| TF module `nfs_volume` | `modules/kubernetes/nfs_volume/` | Infra repo | Static NFS PV/PVC factory |
| ~~TrueNAS VM~~ | **DECOMMISSIONED 2026-04-13** | Was VM 9000 at 10.0.10.15 | Replaced by Proxmox NFS. VM still in stopped state pending deletion. |
| ~~democratic-csi-iscsi~~ | **REMOVED** | Was namespace: iscsi-csi | Replaced by Proxmox CSI (2026-04-02) |
@ -113,7 +112,7 @@ graph TB
**Note**: Some legacy PVs still reference `/mnt/main/<service>` paths. These work via compatibility symlinks/bind-mounts on the Proxmox host. New PVs should use `/srv/nfs/<service>` or `/srv/nfs-ssd/<service>`.
**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-proxmox` StorageClass (or the legacy `nfs-truenas` for existing PVs) via PVCs.
**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-truenas` StorageClass (historical name; it points at the Proxmox host) via PVCs.
### Block Storage Flow (Proxmox CSI) — NEW

View file

@ -0,0 +1,285 @@
# Break-Glass SSH Access — Design
> **⚠️ SUPERSEDED 2026-06-11** by `2026-06-11-breakglass-ssh-redesign-design.md`.
> The port-knock was removed: it added no real security (the SSH key already
> makes the port brute-force-proof) and its knock sequence lived only in
> in-cluster Vault — unreachable in the exact cold/away scenario break-glass
> exists for, which caused a real lockout. Retained for history. As-built:
> `docs/runbooks/breakglass-ssh.md`.
- **Date**: 2026-05-30
- **Status**: Draft — pending user review
- **Owner**: Viktor
- **Related**: `docs/architecture/vpn.md`, `docs/architecture/security.md`, `infra/.claude/CLAUDE.md` (Security Posture Wave 1)
## 1. Goal
Provide a **cold, brute-force-proof backdoor onto the home LAN from the public
internet** for the case where the Kubernetes cluster and every cluster-hosted
remote-access path are down (cloudflared, Headscale/Tailscale, in-cluster
WireGuard), but the **Proxmox host, pfSense, and the edge router are still up**.
### Hard requirements (from the user)
1. **Cold-survivable**: must work when the k8s cluster + all its tunnels are
down. The path must touch **nothing in the cluster** (no Authentik, Traefik,
Technitium/AdGuard DNS, cloudflared).
2. **Full LAN access** once connected (SSH to Proxmox host, pfSense, Synology,
k8s API, etc.).
3. **No brute force**: no password-guessable surface.
4. **Client uses only software pre-installed on Linux/macOS** — no WireGuard /
Tailscale / fwknop client install. Stock `ssh` (+ `bash`) only.
5. **Minimal effort**, and ideally **honor the locked Wave 1 policy**
(`no public-IP access — … PVE sshd must transit LAN or Headscale`).
## 2. Decision
**Key-only SSH to the Proxmox host, gated behind a UDP port-knock.**
- The Proxmox host (`192.168.1.127`) is the entry point — it's the recovery box
(`virsh`/`qm` to reboot the pfSense VM, `kubectl`, full hypervisor control)
and it sits directly on the `192.168.1.0/24` segment, so the path **does not
traverse pfSense or the cluster** — it survives a wedged pfSense too, not just
a down cluster.
- SSH is the only externally-usable remote tool **pre-installed on every
Linux/macOS box**, satisfying requirement 4.
- **Key-only auth** (no passwords anywhere) makes password brute force
impossible → requirement 3.
- A **port-knock** keeps the external SSH port **closed/invisible to scanners**
until a knock sequence is sent. This restores the "no standing public service"
property we'd have had with WireGuard and keeps us within the **intent** of the
Wave 1 policy (PVE sshd is not internet-scannable). The knock is sent with a
**bash `/dev/udp` one-liner** — zero install.
### Alternatives rejected
| Option | Why rejected |
|---|---|
| WireGuard road-warrior on pfSense | Needs a WireGuard **client app** (fails requirement 4). Was the prior design. |
| Tailscale / Headscale | Client app + control plane is in-cluster (dies cold). |
| Browser → web admin UI (Proxmox/pfSense/Synology) | "Pre-installed" (browser) but password-based → brute-forceable, far larger attack surface than a key-only SSH port. |
| Plain **exposed** key-only SSH (no knock) | Brute-force-proof, but a **publicly visible** service (Shodan-catalogued) and a standing violation of the Wave 1 "no public PVE sshd" policy. The knock removes the standing exposure for ~15 min more setup. |
| fwknop / cryptographic SPA | Strongest hiding, but needs a **client install** (fails requirement 4). |
## 3. Architecture
```
Your laptop (anywhere) — stock ssh + bash, nothing installed
│ (1) UDP knock sequence → bash: echo > /dev/udp/<pub>/<port> (instant, no handshake)
│ (2) ssh -p 52222 root@<pub>
Edge router 192.168.1.1 (the box the stored password unlocks)
│ forwards: UDP <k1>,<k2>,<k3> + TCP 52222 → 192.168.1.127
Proxmox host 192.168.1.127 ← path bypasses pfSense entirely
├─ knockd (libpcap) sees the UDP knock → opens TCP 52222 for your source IP (30 s)
├─ sshd listens on :22 (LAN admin, always) AND :52222 (external, knock-gated), key-only
└─ once in: virsh/qm (reboot pfSense VM), kubectl, ssh -J / ssh -D → full LAN
```
**Why it meets "cold + full LAN":** the host is up by definition of the chosen
failure mode; nothing in the path depends on k8s, pfSense, or DNS. From the host
you reach the whole LAN either directly (it's on `192.168.1.0/24` and routes to
the VLANs via pfSense when pfSense is up) or by using SSH's built-in
`-J`/`-D` — both stock, no install.
## 4. Components
### 4.1 Edge router @ 192.168.1.1 (manual, in the browser)
Add port-forwards (same place the existing `51821` WireGuard forward lives):
- **TCP 52222 → 192.168.1.127:52222** (external SSH; no port rewrite — see §4.3 rationale)
- **UDP `<k1>`, `<k2>`, `<k3>` → 192.168.1.127** (knock ports; actual numbers in Vault)
If the router supports a **port range** forward, a single range covering the
knock ports + 52222 is tidier than four rules.
> **Verify (#1 implementation check):** whether `.1` **preserves the source IP**
> on forwarded packets (typical DNAT) or **SNATs** them to `192.168.1.1`. Test by
> knocking + connecting from an external network and checking `/var/log/auth.log`
> + `knockd` syslog for the observed source IP. The design works either way (see
> §4.3), but it determines knock granularity.
### 4.2 SSH keys & Vault layout
- Mint a **dedicated** break-glass keypair (ed25519), separate from
`secret/viktor/proxmox_ssh_key`, so it's independently revocable and clearly
labelled.
- **Public key**`/root/.ssh/authorized_keys` on the Proxmox host (no `from=`
restriction — break-glass is from-anywhere; the knock + key are the gate).
- **Private key** → Vault `secret/viktor/breakglass_ssh_privkey` (for
re-provisioning) **and** on your laptop at `~/.ssh/breakglass_ed25519`
(chmod 600).
- **Knock sequence** → Vault `secret/viktor/breakglass_knock_sequence` (kept out
of git — obscurity value only; see §5).
### 4.3 Proxmox host — sshd hardening
`/etc/ssh/sshd_config.d/10-breakglass.conf`:
```
Port 22
Port 52222
PasswordAuthentication no
KbdInteractiveAuthentication no
PubkeyAuthentication yes
PermitRootLogin prohibit-password # key-only root (PVE recovery norm)
MaxAuthTries 3
LoginGraceTime 20
```
- sshd listens on **:22 (LAN admin, always allowed)** and **:52222 (external,
knock-gated)**. Using a dedicated external port (not a DNAT rewrite to 22)
lets the firewall distinguish LAN vs external **regardless of `.1` SNAT
behaviour** (§4.1) — LAN admin on `:22` is never affected by the gate.
- **Default to root key-only** for recovery practicality. *Alternative for
review:* a dedicated `breakglass` sudo user instead of root.
> **Verify (#2):** key login already works for your normal access **before**
> `PasswordAuthentication no` is committed — no lockout. (Backup rsync jobs
> already use keys, so this is likely already effectively true.)
### 4.4 Host firewall (knock gate)
Default-drop the external SSH port; knockd punches a per-source hole. LAN admin
(`:22`) and established sessions are untouched:
```
# allow established / related
iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
# LAN admin + backups: SSH on :22 always allowed
iptables -A INPUT -p tcp --dport 22 -j ACCEPT
# external SSH on :52222 closed by default — knockd opens it per-source
iptables -A INPUT -p tcp --dport 52222 -j DROP
```
- **knockd uses libpcap**, so it sees the UDP knock packets even though iptables
drops them — the knock ports stay **silent/closed** to scanners.
- **pve-firewall coexistence (verify #3):** confirm whether the PVE firewall is
enabled. If it is, express these rules through it (or a dedicated chain) so a
pve-firewall reload doesn't wipe the knockd-managed rule. Default PVE installs
often have it off at datacenter level.
### 4.5 knockd
`apt install knockd` (Debian/PVE). `/etc/knockd.conf`:
```
[options]
UseSyslog
Interface = vmbr0 # the 192.168.1.127 interface
[breakglass]
sequence = <k1>:udp,<k2>:udp,<k3>:udp # real ports from Vault
seq_timeout = 10
start_command = /usr/sbin/iptables -I INPUT 1 -s %IP% -p tcp --dport 52222 -j ACCEPT
cmd_timeout = 30
stop_command = /usr/sbin/iptables -D INPUT -s %IP% -p tcp --dport 52222 -j ACCEPT
```
- **UDP knock** → the client knock is fire-and-forget (`/dev/udp`), no TCP-hang
on the client (a TCP knock to a dropped port would block until timeout).
- Opens `:52222` for the knocker's source IP for **30 s**; an SSH session
established within that window **persists** via conntrack ESTABLISHED after the
rule is removed. Enable + start the `knockd` service.
### 4.6 fail2ban (defense-in-depth)
`apt install fail2ban`, sshd jail (watches `auth.log`, bans repeat failures).
Local to the host, **no cluster dependency**. Catches anything that gets past the
knock to the sshd listener.
### 4.7 Client side (laptop — stock tools only)
`~/.ssh/config`:
```
Host breakglass
HostName <public-ip-or-dyndns>
Port 52222
User root
IdentityFile ~/.ssh/breakglass_ed25519
```
Knock + connect — a shell function using **bash builtins only** (works on
macOS `/bin/bash` + Linux; UDP send is instant):
```sh
bg() {
local host=<public-ip-or-dyndns>
for p in <k1> <k2> <k3>; do echo -n x > "/dev/udp/$host/$p"; sleep 0.4; done
sleep 0.5
ssh breakglass "$@"
}
```
- **Full LAN, no install:** `ssh -J breakglass <internal-host>` (jump), or
`ssh -D 1080 breakglass` then point a browser/`curl` at SOCKS5 `127.0.0.1:1080`
to reach any internal IP. From the host shell you already have everything.
- *Optional fully-transparent variant:* fold the knock into a `ProxyCommand` in
the `Host breakglass` block so plain `ssh breakglass` knocks automatically.
### 4.8 Cold-scenario IP cheat sheet (DNS is down when the cluster is down)
Technitium + AdGuard are in-cluster, so `.lan` resolution is gone in a cold
event. Use IPs:
| Host | IP |
|---|---|
| Proxmox host | `192.168.1.127` (also `10.0.10.1` VLAN10) |
| pfSense | `10.0.20.1` (WAN `192.168.1.2`) |
| k8s API server | `10.0.20.100` |
| Synology NAS | `192.168.1.13` |
| Edge router | `192.168.1.1` |
| Traefik LB / MetalLB | `10.0.20.200` / `10.0.20.203` |
## 5. Security analysis
- **Brute force: solved.** No password auth anywhere → password guessing is
impossible; key brute force is cryptographically infeasible.
- **Invisibility / Wave 1 intent: satisfied.** The external SSH port is
default-dropped and the knock ports are pcap-sniffed (never answered), so a
scanner sees a closed/silent host — PVE sshd is **not internet-scannable**,
honouring the spirit of "no public-IP access to PVE sshd".
- **The knock is obscurity, not cryptography.** A port-knock sequence is
plaintext and replayable by a passive on-path observer. **The SSH key is the
real access control** — the knock only removes the standing/scannable surface.
(Cryptographic SPA = fwknop, rejected for needing a client install.) Treat the
knock sequence as a secret-ish convenience, not a second cryptographic factor.
- **Residual risks** (none are brute force):
1. An sshd **0-day** exploitable during the 30 s open window → mitigation: keep
PVE patched; short `cmd_timeout`; fail2ban.
2. **Private key theft** → mitigation: key has a passphrase; revoke by removing
the line from `authorized_keys`.
3. If `.1` **SNATs** (§4.1), the 30 s window opens `:52222` for the shared
`192.168.1.1` source — anyone else arriving via `.1` in that window could
reach the sshd banner, but still needs your key. Mitigated by the short
window + key-only + fail2ban.
- **Deliberate, documented exception** to the Wave 1 "no public-IP access"
policy, scoped to this single knock-gated port. To be recorded in
`security.md` + the Wave 1 note in `infra/.claude/CLAUDE.md` on implementation.
## 6. What's automated vs manual
- **I do**: generate the keypair + knock sequence, store them in Vault, produce
the exact `sshd_config.d` snippet, `knockd.conf`, iptables rules, the client
`~/.ssh/config` + `bg()` function, and write the runbook + doc updates.
- **Manual / careful (live devices)**: the `.1` edge-router forwards are done by
you in the browser (out-of-Terraform, live device). The Proxmox host changes
(sshd, knockd, iptables, fail2ban) are applied over SSH **with key-login
verified first** to avoid lockout; pfSense is **not** touched. None of this is
a `tg apply` — pfSense and the edge router are not Terraform-managed.
## 7. Testing & verification
1. From an **external** network (phone hotspot): run `bg`; confirm knockd syslog
shows the sequence + opens `:52222`; SSH succeeds.
2. **Without** knocking: `ssh -p 52222` from external → connection refused/timed
out (port closed). A plain port scan of `52222` + the knock ports → silent.
3. LAN admin on `:22` still works (no regression); backup rsync jobs unaffected.
4. Full-LAN: `ssh -J breakglass 10.0.20.1` (pfSense) and `ssh -D 1080` SOCKS to
an internal IP.
5. Determine `.1` source-IP behaviour (verify #1) and adjust knock granularity
note accordingly.
## 8. Failure modes & rotation
- **Proxmox host down** (not just cluster): this path is gone — that's the
out-of-band tier (serial/IPMI/separate device), explicitly **out of scope**.
- **`.1` router config reset**: forwards lost → re-add from this doc; consider
exporting the `.1` config for backup.
- **Public IP change**: use a hostname endpoint (Cloudflare-resolved) so it
auto-follows; keep the raw IP as fallback.
- **Key/knock compromise**: remove the `authorized_keys` line (kills access
instantly); rotate the knock sequence in `knockd.conf` + Vault.
## 9. Out of scope
- Host-down / site-down out-of-band access (IPMI, LTE) — a future tier.
- Phone access (would need an SSH **app**, e.g. Termius — outside the
"pre-installed Linux/macOS" constraint; laptop is the target).
## 10. Docs to update on implementation
- `docs/architecture/vpn.md` — add a "Break-glass SSH" section.
- `docs/architecture/security.md` + Wave 1 note in `infra/.claude/CLAUDE.md`
record the deliberate knock-gated exception to "no public PVE sshd".
- New runbook `docs/runbooks/breakglass-ssh.md` — connect + rotate procedure.

View file

@ -0,0 +1,395 @@
# Break-Glass SSH Access — Implementation Plan
> **⚠️ SUPERSEDED 2026-06-11** by the redesign in
> `2026-06-11-breakglass-ssh-redesign-design.md` (port-knock removed). Retained
> for history. As-built: `docs/runbooks/breakglass-ssh.md`.
> **Execution model:** This plan mutates **live devices** (the Proxmox host's sshd, and the TP-Link edge router). It is **human-gated**, NOT for autonomous subagents. Each live step is applied with anti-lockout verification, and every edge-router change is made by Viktor (or by the browse tool with explicit per-change approval). Steps use `- [ ]` checkboxes.
**Goal:** Stand up a cold, brute-force-proof SSH backdoor onto the LAN — key-only SSH to the Proxmox host (`192.168.1.127`) gated behind a UDP port-knock — then decommission the legacy Synology SSH exposure and tighten UPnP.
**Architecture:** Edge router `.1` forwards a UDP knock sequence + TCP `52222` to the Proxmox host. The host runs `knockd` (libpcap) which opens `52222` for the knocker's IP for 30 s; `sshd` listens on `:22` (LAN, always) and `:52222` (external, knock-gated), key-only. Path bypasses pfSense + the k8s cluster. Client uses only stock `ssh` + `bash`.
**Tech stack:** OpenSSH, knockd, iptables, fail2ban (Debian/PVE host); TP-Link Archer AX6000 UI (edge router); HashiCorp Vault (secrets); Docker (`/home/wizard/tools/insecure-browse` for any router automation).
**Reference:** design doc `2026-05-30-breakglass-ssh-access-design.md`. Router audit (current `.1` forwards) recorded in task notes + `/home/wizard/tools/insecure-browse/out/`.
---
## Pre-flight (read before starting)
- **Anti-lockout rule:** never disable password auth or reload sshd without an *already-open* root session held + a *new* session verified. Applies to every host step.
- **Live-router rule:** all `.1` changes are made by Viktor in the UI (or browse-tool with explicit approval). No blind automation of router writes.
- **Ordering rule:** the legacy Synology SSH forward (Rule 6) is **not** closed until break-glass is verified working from an external network (Phase 4 gates on Phase 4-pre verification).
- **Host access:** PVE host reached as `ssh root@192.168.1.127` from the LAN.
- **Commit gate:** the infra repo currently has unmerged conflicts + an in-progress provider/backend migration. Do NOT commit (Phase 6) until Viktor confirms the repo is clean.
---
## Phase 0 — Generate secrets (no live changes)
### Task 0.1: Break-glass SSH keypair
**Files:** none in repo (secrets → Vault).
- [ ] **Step 1: Generate a dedicated ed25519 keypair (with passphrase)**
```bash
mkdir -p ~/.ssh
ssh-keygen -t ed25519 -a 100 -C "breakglass-$(date +%Y%m%d)" -f ~/.ssh/breakglass_ed25519
# set a passphrase when prompted (so a stolen laptop key isn't instantly usable)
```
- [ ] **Step 2: Store the private key + public key in Vault**
```bash
vault kv patch secret/viktor \
breakglass_ssh_privkey=@$HOME/.ssh/breakglass_ed25519 \
breakglass_ssh_pubkey="$(cat ~/.ssh/breakglass_ed25519.pub)"
```
- [ ] **Step 3: Verify the keys are retrievable**
```bash
vault kv get -field=breakglass_ssh_pubkey secret/viktor
```
Expected: prints the `ssh-ed25519 AAAA... breakglass-YYYYMMDD` line.
### Task 0.2: Knock sequence
- [ ] **Step 1: Generate 3 random UDP knock ports**
```bash
KNOCK="$(shuf -i 20000-60000 -n 3 | paste -sd, -)"; echo "$KNOCK"
```
- [ ] **Step 2: Store the sequence in Vault (keep it out of git)**
```bash
vault kv patch secret/viktor breakglass_knock_sequence="$KNOCK"
vault kv get -field=breakglass_knock_sequence secret/viktor
```
Expected: prints three comma-separated ports, e.g. `28411,49027,33180`.
---
## Phase 1 — Proxmox host: key-only SSH + knock gate (LIVE host change)
> Run everything in this phase **on the PVE host**. Keep your current `ssh root@192.168.1.127` session open the entire phase.
### Task 1.1: Pre-checks (no changes yet)
- [ ] **Step 1: Confirm key login already works (anti-lockout baseline)**
From your laptop, with the break-glass key authorized later — for now confirm your *existing* admin key works:
```bash
ssh -o PasswordAuthentication=no root@192.168.1.127 'echo KEY_LOGIN_OK'
```
Expected: `KEY_LOGIN_OK` (key auth works → safe to disable passwords later). If it prompts for a password, STOP and fix key auth first.
- [ ] **Step 2: Check whether the PVE firewall is active (coexistence)**
```bash
ssh root@192.168.1.127 'pve-firewall status 2>/dev/null; iptables -S | head'
```
Expected: note whether `Status: enabled/running`. If **enabled**, add the Phase-1.4 rules via PVE's firewall (Datacenter→Firewall) instead of raw iptables, OR disable it if unused. If **disabled** (common), proceed with the raw-iptables approach below.
### Task 1.2: Authorize the break-glass key
- [ ] **Step 1: Append the break-glass public key to root's authorized_keys**
```bash
PUB="$(vault kv get -field=breakglass_ssh_pubkey secret/viktor)"
ssh root@192.168.1.127 "grep -qF '$PUB' /root/.ssh/authorized_keys || echo '$PUB' >> /root/.ssh/authorized_keys"
```
- [ ] **Step 2: Verify break-glass key logs in (on :22, still default)**
```bash
ssh -i ~/.ssh/breakglass_ed25519 -o PasswordAuthentication=no root@192.168.1.127 'echo BREAKGLASS_KEY_OK'
```
Expected: `BREAKGLASS_KEY_OK`.
### Task 1.3: sshd dual-port + key-only
**Files:** Create on host: `/etc/ssh/sshd_config.d/10-breakglass.conf`
- [ ] **Step 1: Write the sshd drop-in**
```bash
ssh root@192.168.1.127 'cat > /etc/ssh/sshd_config.d/10-breakglass.conf' <<'EOF'
Port 22
Port 52222
PasswordAuthentication no
KbdInteractiveAuthentication no
PubkeyAuthentication yes
PermitRootLogin prohibit-password
MaxAuthTries 3
LoginGraceTime 20
EOF
```
- [ ] **Step 2: Validate config syntax (do NOT reload yet)**
```bash
ssh root@192.168.1.127 'sshd -t && echo SSHD_CONFIG_OK'
```
Expected: `SSHD_CONFIG_OK`. If error, fix the drop-in before reloading.
- [ ] **Step 3: Reload sshd (current session stays alive)**
```bash
ssh root@192.168.1.127 'systemctl reload ssh && echo RELOADED'
```
Expected: `RELOADED`.
- [ ] **Step 4: Verify a NEW key session works on :22 AND :52222 before trusting it**
```bash
ssh -i ~/.ssh/breakglass_ed25519 -p 22 root@192.168.1.127 'echo OK22'
ssh -i ~/.ssh/breakglass_ed25519 -p 52222 root@192.168.1.127 'echo OK52222'
```
Expected: `OK22` and `OK52222`. (If `:52222` refuses, sshd may not have bound the second port — check `ss -tlnp | grep ssh` on the host.) Only after both succeed, the old session is safe to drop.
### Task 1.4: Base firewall (default-drop :52222, allow :22 + established)
**Files:** Create on host: `/usr/local/sbin/breakglass-firewall.sh`, `/etc/systemd/system/breakglass-firewall.service`
- [ ] **Step 1: Write the idempotent base-firewall script (dedicated chain)**
```bash
ssh root@192.168.1.127 'cat > /usr/local/sbin/breakglass-firewall.sh' <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
# Idempotent: (re)build a dedicated BREAKGLASS chain hooked into INPUT.
iptables -N BREAKGLASS 2>/dev/null || iptables -F BREAKGLASS
iptables -C INPUT -j BREAKGLASS 2>/dev/null || iptables -I INPUT 1 -j BREAKGLASS
# established/related always allowed
iptables -A BREAKGLASS -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
# LAN admin on :22 always allowed (.1 does NOT forward :22 to this host, so :22 is LAN-only)
iptables -A BREAKGLASS -p tcp --dport 22 -j ACCEPT
# external SSH on :52222 closed by default; knockd punches a per-source ACCEPT into INPUT pos 1
iptables -A BREAKGLASS -p tcp --dport 52222 -j DROP
EOF
ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh'
```
- [ ] **Step 2: Write a boot-time systemd unit (persists across reboot, before knockd)**
```bash
ssh root@192.168.1.127 'cat > /etc/systemd/system/breakglass-firewall.service' <<'EOF'
[Unit]
Description=Break-glass base firewall (SSH knock gate)
After=network-pre.target
Before=knockd.service
Wants=network-pre.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/breakglass-firewall.sh
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
EOF
ssh root@192.168.1.127 'systemctl daemon-reload && systemctl enable --now breakglass-firewall.service && echo FW_APPLIED'
```
Expected: `FW_APPLIED`.
- [ ] **Step 3: Verify LAN :22 still works and :52222 is now dropped from LAN**
```bash
ssh -i ~/.ssh/breakglass_ed25519 -p 22 root@192.168.1.127 'echo STILL_OK22' # works
nc -z -w3 192.168.1.127 52222 && echo "OPEN(bad)" || echo "CLOSED_AS_EXPECTED" # closed pre-knock
```
Expected: `STILL_OK22` and `CLOSED_AS_EXPECTED`.
### Task 1.5: knockd
**Files:** Create/modify on host: `/etc/knockd.conf`, `/etc/default/knockd`
- [ ] **Step 1: Install knockd (host daemon — must be native, not Docker, to manage host iptables)**
```bash
ssh root@192.168.1.127 'apt-get update -qq && apt-get install -y knockd && echo KNOCKD_INSTALLED'
```
Expected: `KNOCKD_INSTALLED`.
- [ ] **Step 2: Write knockd.conf with the Vault knock sequence (UDP)**
```bash
KNOCK="$(vault kv get -field=breakglass_knock_sequence secret/viktor)" # e.g. 28411,49027,33180
read K1 K2 K3 <<<"$(echo "$KNOCK" | tr ',' ' ')"
ssh root@192.168.1.127 "cat > /etc/knockd.conf" <<EOF
[options]
UseSyslog
Interface = vmbr0
[breakglass]
sequence = ${K1}:udp,${K2}:udp,${K3}:udp
seq_timeout = 10
start_command = /usr/sbin/iptables -I INPUT 1 -s %IP% -p tcp --dport 52222 -j ACCEPT
cmd_timeout = 30
stop_command = /usr/sbin/iptables -D INPUT -s %IP% -p tcp --dport 52222 -j ACCEPT
EOF
```
- [ ] **Step 3: Enable + start knockd**
```bash
ssh root@192.168.1.127 "sed -i 's/^START_KNOCKD=.*/START_KNOCKD=1/' /etc/default/knockd 2>/dev/null || echo 'START_KNOCKD=1' >> /etc/default/knockd"
ssh root@192.168.1.127 'systemctl enable --now knockd && systemctl is-active knockd'
```
Expected: `active`.
### Task 1.6: fail2ban (defense-in-depth)
- [ ] **Step 1: Install + enable fail2ban with the default sshd jail**
```bash
ssh root@192.168.1.127 'apt-get install -y fail2ban && systemctl enable --now fail2ban && fail2ban-client status sshd >/dev/null && echo F2B_OK'
```
Expected: `F2B_OK` (sshd jail active).
---
## Phase 2 — Edge router `.1` forwards (LIVE router change — Viktor executes)
> In the AX6000 UI: **Advanced → NAT Forwarding → Port Forwarding → Add**. Do NOT remove anything yet.
- [ ] **Step 1: Add the SSH break-glass forward**
- Name `breakglass-ssh`, External Port `52222`, Internal IP `192.168.1.127`, Internal Port `52222`, Protocol `TCP`, Enable.
- [ ] **Step 2: Add the three UDP knock forwards** (values from `vault kv get -field=breakglass_knock_sequence secret/viktor`)
- For each of the 3 ports: Name `bg-knock-N`, External Port `<port>`, Internal IP `192.168.1.127`, Internal Port `<same port>`, Protocol `UDP`, Enable.
- [ ] **Step 3: (verify #1) Determine whether `.1` preserves source IP or SNATs**
After Phase 3 connects once, on the host check the observed source:
```bash
ssh root@192.168.1.127 'journalctl -u knockd -n 20 --no-pager | grep -i "stage\|open"'
```
If `%IP%` is a public IP → source preserved (per-IP granularity). If it's `192.168.1.1``.1` SNATs (knock opens `:52222` for the shared `.1` source during the 30 s window). Both are acceptable with the dual-port + key-only model; just note it in the runbook.
---
## Phase 3 — Client config (laptop, no live infra change)
**Files:** Modify `~/.ssh/config`; add a shell function to `~/.zshrc`/`~/.bashrc`.
- [ ] **Step 1: Add the SSH host block**
```bash
cat >> ~/.ssh/config <<'EOF'
Host breakglass
HostName viktorbarzin.ddns.net
Port 52222
User root
IdentityFile ~/.ssh/breakglass_ed25519
EOF
```
(`viktorbarzin.ddns.net` is the router's NO-IP DDNS name — follows the dynamic WAN IP. Raw IP `176.12.22.76` is the fallback.)
- [ ] **Step 2: Add the knock+connect function**
```bash
cat >> ~/.zshrc <<'EOF'
bg() {
local host="viktorbarzin.ddns.net"
local seq; seq="$(vault kv get -field=breakglass_knock_sequence secret/viktor 2>/dev/null || echo "")"
[ -z "$seq" ] && { echo "no knock sequence (vault?)"; return 1; }
for p in ${seq//,/ }; do (exec 3<>/dev/udp/$host/$p) 2>/dev/null && echo "x" >&3; sleep 0.4; done
sleep 0.5
ssh breakglass "$@"
}
EOF
```
> Note: the bash `/dev/udp` redirection works under bash (`/bin/bash` on macOS + Linux). Under zsh, `/dev/udp` is also supported by zsh's builtin in recent versions; if your zsh build lacks it, define `bg` in bash or use `nc -u -w1 $host $p </dev/null`.
---
## Phase 4-pre — Verify break-glass END-TO-END (gates Phase 4)
> Do this from an **external** network (phone hotspot / tethered), NOT the home LAN.
- [ ] **Step 1: Without knocking, the port is silent**
```bash
nc -z -w3 viktorbarzin.ddns.net 52222 && echo "OPEN(bad)" || echo "SILENT_OK"
```
Expected: `SILENT_OK`.
- [ ] **Step 2: Knock + connect succeeds**
```bash
bg 'hostname; echo BREAKGLASS_E2E_OK'
```
Expected: the PVE hostname + `BREAKGLASS_E2E_OK`.
- [ ] **Step 3: Full-LAN reach via the jump (no extra install)**
```bash
ssh -J breakglass root@10.0.20.1 'echo PFSENSE_REACHED' 2>/dev/null || echo "check pfSense ssh"
ssh -J breakglass admin@192.168.1.13 'echo SYNOLOGY_REACHED' 2>/dev/null || echo "check synology ssh"
```
Expected: confirms you can reach pfSense + Synology *through* break-glass (so closing Rule 6 loses nothing).
- [ ] **Step 4: LAN admin unaffected**
From the home LAN: `ssh -p 22 root@192.168.1.127 'echo LAN22_OK'``LAN22_OK`.
**GATE:** Only proceed to Phase 4 once Steps 14 pass. If any fail, fix before removing the legacy forward.
---
## Phase 5 — Router cleanup (LIVE router change — Viktor executes, AFTER Phase 4-pre passes)
> AX6000 UI. One pass, all three changes.
- [ ] **Step 1: Remove the Synology SSH exposure (Rule 6)**
- Advanced → NAT Forwarding → Port Forwarding → delete (or disable) rule **`HTTP` / 3333 → 192.168.1.13:22**.
- [ ] **Step 2: Delete the stale Proxmox rule (Rule 3)**
- Delete the disabled rule **`proxmox` / 8006 → 192.168.1.127**.
- [ ] **Step 3: Disable UPnP**
- Advanced → NAT Forwarding → UPnP → toggle **OFF**. (Tailscale on `.101` falls back to DERP relay; the `41643→pfSense` mapping drops.)
- [ ] **Step 4: Verify the Synology SSH is gone from the WAN, break-glass still works**
From an external network:
```bash
nc -z -w3 viktorbarzin.ddns.net 3333 && echo "STILL_OPEN(bad)" || echo "SYNOLOGY_SSH_CLOSED_OK"
bg 'echo BREAKGLASS_STILL_OK'
```
Expected: `SYNOLOGY_SSH_CLOSED_OK` and `BREAKGLASS_STILL_OK`.
---
## Phase 6 — Docs + commit (AFTER infra repo is clean)
- [ ] **Step 1: Update `docs/architecture/vpn.md`** — add a "Break-glass SSH" section (knock-gated SSH to PVE host, client `bg()`, cheat-sheet IPs).
- [ ] **Step 2: Update `docs/architecture/security.md` + the Wave-1 note in `infra/.claude/CLAUDE.md`** — record the deliberate knock-gated exception; **correct the WAN-exposure inventory** (actual `.1` forwards are qbittorrent/stun/turn→pfSense + the new break-glass; Synology SSH removed; UPnP disabled; Remote Management off).
- [ ] **Step 3: New runbook `docs/runbooks/breakglass-ssh.md`** — connect procedure, knock/key rotation, re-adding `.1` forwards after a router reset.
- [ ] **Step 4: Commit the design + plan + doc updates** (only once Viktor confirms the repo is committable):
```bash
git -C /home/wizard/code/infra add \
docs/plans/2026-05-30-breakglass-ssh-access-design.md \
docs/plans/2026-05-30-breakglass-ssh-access-plan.md \
docs/architecture/vpn.md docs/architecture/security.md \
docs/runbooks/breakglass-ssh.md .claude/CLAUDE.md
git -C /home/wizard/code/infra commit -m "docs+feat: break-glass knock-gated SSH; retire Synology SSH forward; disable UPnP [ci skip]"
git -C /home/wizard/code/infra push origin master
```
---
## Self-review
- **Spec coverage:** key-only SSH ✅ (1.3), knock gate ✅ (1.4/1.5), invisibility ✅ (4-pre.1), full-LAN via jump ✅ (4-pre.3), no-lockout ✅ (1.1/1.3.4), Wave-1 exception doc ✅ (6.2), close legacy SSH ✅ (5.1), UPnP ✅ (5.3). All design §sections map to a task.
- **Placeholder scan:** no TBDs; secret values are generated + Vault-stored, referenced via `vault kv get` (concrete, not placeholders).
- **Consistency:** port `52222`, knock from `secret/viktor/breakglass_knock_sequence`, key `~/.ssh/breakglass_ed25519`, host `192.168.1.127` used consistently throughout.
- **Open verify items** (flagged inline, non-blocking): #1 `.1` SNAT behaviour (2.3), pve-firewall coexistence (1.1.2).

View file

@ -0,0 +1,73 @@
# Break-glass SSH — Redesign
- **Date**: 2026-06-11
- **Status**: Implemented
- **Owner**: Viktor
- **Supersedes**: `2026-05-30-breakglass-ssh-access-{design,plan}.md` (port-knock design)
- **As-built runbook**: `docs/runbooks/breakglass-ssh.md`
## Why redesign
The 2026-05-30 design gated a key-only SSH port on the Proxmox host behind a UDP
**port-knock** (knockd). It caused a real lockout, for a structural reason:
- The knock sequence was 3 random ports stored **only** in Vault, and the client
helper fetched it from Vault at connect time.
- **Vault is in-cluster** and not publicly reachable (Wave-1 policy). In the
exact scenario break-glass exists for — away from home, cluster/tunnels down —
the knock sequence is unreachable and unmemorable. Circular dependency.
The knock's only benefit was hiding an already brute-force-proof port; its cost
was that fragility. For a *recovery* path, robustness beats stealth.
## Decision
**Plain key-only SSH to the Proxmox host on `:52222`, openly reachable, no knock.**
Hardened with: the exposed port trusts only a dedicated break-glass key
(`Match LocalPort`), per-source connection rate-limiting (iptables hashlimit),
and fail2ban. Scenario covered: *cluster + tunnels down, host + pfSense + router
up* (the common "I'm away and need in" case — confirmed with Viktor; deeper
"pfSense wedged" / "host down" tiers are explicitly out of scope).
Alternatives considered and rejected: keeping the knock (fragile, circular);
Tailscale-on-pfSense (briefly chosen, then dropped — reintroduces the upstream
dependency Headscale is self-hosted to avoid, and the user preferred a
self-contained stock-ssh path); WireGuard road-warrior (needs a client, and the
self-contained SSH path was preferred).
## Components
| Layer | Change | Source of truth |
|---|---|---|
| sshd | dual-port `:22` (LAN, all keys) + `:52222` (WAN, break-glass key only via `Match LocalPort`, terminated by `Match all`); key-only everywhere | `scripts/sshd-10-breakglass.conf` |
| host firewall | `BREAKGLASS` chain: `:52222` rate-limited per source, LAN bypass; replaced the knock-gated default-DROP | `scripts/breakglass-firewall.sh` (+ `breakglass-firewall.service`) |
| fail2ban | jail fixed for Debian 13 (`journalmatch` by unit, not `_COMM=sshd`, else it never bans), bans on `:22`+`:52222` | `scripts/fail2ban-breakglass-sshd.local` |
| knockd | **removed** (package purged, config deleted) | — |
| edge router | `breakglass-ssh` WAN tcp/52222 → 192.168.1.127:52222; **removed** legacy Synology SSH forward (ext 3333 → .13:22) | manual (live device) |
| Vault | `breakglass_ssh_{pub,priv}key` retained; `breakglass_knock_sequence` now dead | `secret/viktor` |
## Edge-router constraints discovered (TP-Link AX6000)
- **No port remapping** — external port must equal internal port (rejects e.g.
`22 → 52222` as a "conflict"). All forwards are ext==int; hence `:52222` both
sides.
- **Port 22 is reserved**`22 → 22` is also refused. Break-glass cannot use 22
(Viktor's initial preference); `:52222` is the landed port.
- **Row delete is immediate** (no confirm dialog).
## Security posture
- **Brute force: impossible** (key-only, no password).
- **Scannable: yes** — deliberate, documented Wave-1 exception (`security.md`).
- **Residual risks:** sshd 0-day during exposure (mitigate: patch, rate-limit,
fail2ban, low MaxAuthTries); break-glass key theft (revoke by removing the
`authorized_keys.breakglass` line). Logins are audited (PVE ships sshd auth +
snoopy execve to Loki).
## Verification (2026-06-11)
- `:52222` reachable; break-glass key authenticates (`root@pve`).
- Non-break-glass keys **rejected** on `:52222` (Match isolation works).
- `:22` LAN admin unaffected (Match all reset confirmed — global root login intact).
- Full WAN path: `ssh -p 52222 <WAN-IP>` with the break-glass key → `root@pve`.
- knockd gone; fail2ban jail matches Debian 13 `sshd-session` lines.

View file

@ -0,0 +1,76 @@
# Post-mortem: Authentik downgrade boot storm + shared-PG failover (2026-06-10)
**Impact:** Authentik (and therefore forward-auth for all ~67 `auth="required"`
ingresses and every OIDC app) degraded/unavailable for ~50 minutes
(~22:2023:10 UTC). The auth-proxy basicAuth fallback served Emergency Access
prompts during outpost-check failures. The shared CNPG primary failed over
(pg-cluster-2 → pg-cluster-1, 22:40:58 UTC), briefly disturbing every PG-backed
tenant.
**Trigger:** a routine values-only `tg apply` on `stacks/authentik` (first-time
signin speedup work — env tuning, outpost config, static-asset ingress).
## Root causes (three stacked)
1. **Helm/Keel version split → silent downgrade.** Keel (namespace
`keel.sh/enrolled` + diun annotations) had upgraded the live authentik
image to `2026.2.4`, while the Helm release pinned chart `2026.2.2` (whose
appVersion drives the image tag). The values-only apply therefore rolled
every server/worker pod BACK to `2026.2.2` against a `2026.2.4`-migrated
database. Cores never came up healthy (`failed to proxy to backend`, plus
Django cross-version serialized-cache warnings), and mid-storm Keel
re-upgraded the image, adding a third ReplicaSet to the churn.
2. **Liveness budget too small for authentik's boot.** The chart-default
liveness probe (3×10s, 3s timeout) kills a pod ~30s after the go layer
passes the startup probe — but during a rolling restart the Python core
still waits on authentik's DB **migration advisory lock** (60120s+ under
contention). kubelet kill-looped every booting pod, and each kill increased
lock contention for the rest (thundering herd).
3. **Ghost lock holders.** Pods killed mid-migration-check left PgBouncer
server connections `idle in transaction` still **holding the migration
advisory lock** (observed twice: `SELECT * FROM authentik_version_history`
idle 2+ min). Every subsequent boot serialized behind a dead client.
PgBouncer had no `idle_transaction_timeout`, so the ghosts never expired.
**Aggravator:** `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60` (newly made live) made
every Django thread hold its connection persistently; with PgBouncer in
*session* mode each one pins a server connection 1:1, so the restart churn
saturated all 3×(20+5) pool slots (58s/s client wait observed; authentik held
75 of 108 connections on the new primary). The shared primary's
restart/failover at 22:40 fits this storm window.
## Resolution
- Scaled workers to 0 (transient) to free pool capacity; rollout converged
once, then re-degraded when workers returned.
- Emergency `kubectl patch` of the server liveness probe (3×10s/3s →
6×10s/5s) — final state codified in Helm values in the same session.
- `pg_terminate_backend()` on the ghost `idle in transaction` lock holders
(twice).
- Scaled servers to 1 so a single `2026.2.4` pod booted uncontended, then back
to 3 — converged cleanly (51s boots, zero restarts).
- Final `tg apply` reconciled everything (image tag pinned, conn_max_age
removed, liveness in values, pgbouncer reaper config).
## Prevention (all landed in this change)
| Cause | Fix |
|---|---|
| Helm/Keel version split | `global.image.tag` pinned in `values.yaml` to the Keel-managed live tag, with a comment requiring the pin be refreshed whenever the chart is touched. Long-term: bump the chart pin when Keel moves the image (diun notifies). |
| Liveness kill loop | `server.livenessProbe` 6×10s / 5s timeout in values (startup probe still bounds total boot at 60×10s). |
| Ghost advisory-lock holders | `idle_transaction_timeout = 300` in `pgbouncer.ini` + config-checksum annotation so ini changes actually roll pgbouncer pods. |
| Pool saturation | `CONN_MAX_AGE` removed (per-request connections are ~12ms through local PgBouncer; not worth pinning server connections in session mode). values.yaml carries a do-not-set warning. |
## Lessons
- **Check the live image tag against the chart pin before ANY helm-managed
apply on a Keel-enrolled namespace.** `kubectl get deploy <x> -o
jsonpath='{..image}'` vs the chart's appVersion — a mismatch means the apply
is a version change, not a config change.
- A "stuck rollout" of authentik is usually the migration advisory lock:
check `pg_locks` joined to `pg_stat_activity` for `idle in transaction`
holders before blaming probes or resources.
- The auth-proxy basicAuth fallback worked as designed throughout (Emergency
Access path); without it every protected app would have hard-failed.

View file

@ -0,0 +1,116 @@
# 2026-06-11 — devvm dead ~90 min: QEMU-internal I/O stall on the legacy LSI disk path
## Impact
- devvm (VM 102, the shared multi-user Claude Code workstation) effectively
dead 15:2116:48 UTC (18:2119:48 EEST): all ssh/tmux and t3 sessions for
wizard/emo/anca lost, every in-flight agent killed.
- Detection was human (~90 min) — no `up{instance="devvm"} == 0` alert
exists (follow-up below).
- Recovery was manual: kill of the wedged QEMU process + `qm start` (the
kill left no autopsy — see "What we could not prove").
## Timeline (UTC; host journal runs EEST = UTC+3)
- **15:01** — hourly `apply-mbps-caps` run live-rewrites VM 102's scsi0
throttle via `qm set` (as it had done every hour for weeks — see Root
cause #4).
- **15:1815:20** — guest healthy by every metric: CPU 716% of 16 vCPUs,
load 1.4, 17 GiB MemAvailable, swap flat at 2.0 GiB, host `sdc` 28%
utilized. Heavy claude/bwrap sandbox activity (normal workload).
- **15:19:08** — last journal line the guest ever writes (mid normal
traffic, zero kernel distress — not even a hung-task warning).
- **15:21** — host RRD (pvestatd polling QEMU over QMP once a minute) shows
`diskwrite` drop to **exactly 0 and stay 0 for 87 minutes** — not even
journal flushes. netout collapses 380K→7K/s. **QEMU keeps answering QMP
the whole time** — the process and its main loop are alive; only the
block path is dead.
- **15:21→15:39** — guest CPU (host's view) ramps 11% → ~50% and plateaus:
processes progressively piling up behind dead storage (dirty-page
writeback stuck → direct reclaim spins). Classic starvation cascade, not
a panic (a panic halts or spins flat from t=0).
- **16:47:42** — QMP socket resets: the wedged QEMU is killed out-of-band
(root shell; no PVE task, no snoopy line — shell-builtin `kill`).
- **16:48:31**`qmstart` task; guest boots clean on kernel 6.8.0-124
(wedged boot ran 6.8.0-117).
## Ruled out (evidence, not vibes)
- **Guest CPU/memory/swap pressure** — healthy at last scrape (Prometheus)
and per-minute host RRD.
- **Host storage**`pve` thin pool 68% data / 15.5% meta; zero kernel
I/O errors on the host all day; `sdc` quiet through the window.
- **Host-side kill/OOM** — no OOM-killer lines, no segfault, no QEMU crash
log; 113 of 114 monitored targets stayed up. Only the devvm died.
- **Guest kernel panic** — would not keep QMP-visible blockstats frozen at
0 while netout ACKs trickle; and the guest kernel logged nothing.
## Root cause
**Class pinned, exact line unprovable** (see below): the devvm's disk I/O
stalled *inside the QEMU process* — below the guest kernel (all guest I/O
froze simultaneously with nothing logged) and above host storage (host
clean, neighbors fine, QEMU main loop responsive). Contributing stack,
unique to this VM:
1. **`scsihw: lsi`** — the emulated LSI 53C895A (1997 chip, QEMU's legacy
default for OSes without virtio drivers). The devvm was the **only VM
on the host** running its disk through this path; every healthy
neighbor uses `virtio-scsi-pci`. The LSI model is documented as
hang-prone under intensive I/O.
2. **No `iothread`** — all disk emulation ran on QEMU's single main event
loop, sharing it with timers and QMP.
3. **QEMU-level mbps throttle (60/60)** — a token bucket inside QEMU whose
queued I/O completes only when its re-arm timer fires.
4. **Hourly live throttle rewrites**`apply-mbps-caps.sh`'s idempotency
check compared raw config strings, but `qm config` prints keys in its
own canonical order, so the check **never matched** and the script
re-issued `qm set` (→ live QMP `block_set_io_throttle` against the
running QEMU) every hour, 24×/day, for weeks — each poke a chance to
race the throttle machinery while queued I/O is in flight. The wedge
came 20 min after the 15:01 poke.
## What we could not prove
Whether the stuck queue was the LSI device model, the throttle-group
timer, or their interaction. The discriminating evidence (QMP
`query-block`, a stack trace of the QEMU process) existed in RAM at 16:47
and was destroyed by the recovery kill. If a wedge recurs **autopsy before
shooting**: `qm guest exec` will fail but `qm monitor`/QMP `query-block`,
`query-status`, and `gdb -p <pid> -batch -ex 'thread apply all bt'` on the
kvm process pin it to the line.
## Fixes
| Status | Fix |
|---|---|
| shipped (this commit) | `apply-mbps-caps.sh` compares **normalized option sets** — hourly runs are now true no-ops; running VMs' throttle state is no longer rewritten 24×/day. Verified: reordered-key configs compare equal, real drift still triggers `qm set`, post-restart iothread configs compare equal. |
| staged, awaiting Viktor's cold stop→start | VM 102: `scsihw: virtio-scsi-single` + `scsi0 …,iothread=1,aio=threads` — replaces the LSI path with the paravirt controller all healthy VMs use, moves disk emulation off the main loop, swaps io_uring for boring thread-pool AIO. Guest pre-flight passed (`CONFIG_SCSI_VIRTIO=y` built-in; fstab on LVM dm-uuid/UUID). Must be a **full stop→start** — a guest reboot reuses the old QEMU process. |
## Open follow-ups (discussed 2026-06-11, not yet built)
- `DevvmDown` alert (`up{job="devvm"} == 0 for 3m` → Slack) — closes the
90-min detection gap.
- Freeze forensics: netconsole → pve listener, serial console,
`kernel.panic=60`, and a capture-before-kill runbook (above) so any
recurrence is pinned, not mourned.
- The recurring *crawl* class (agent storms → swap-thrash; journald
watchdog-killed 3× on 2026-06-10) is a separate failure mode —
ssh/tmux sessions remain memory-uncontained by explicit decision
(swap-only, 2026-06-10).
## Lessons
- **A VM can die of QEMU-userspace causes that no guest or host kernel log
will ever show.** The host's per-VM RRD (pvestatd's QMP polls) is the
only witness — `diskwrite=0` with a live QMP socket is the signature.
- **"Idempotent" reconcilers must prove idempotency against the system's
canonical output format**, not against the string they themselves
constructed. A compare that never matches turns a safety net into a
24×/day fault injector — and its own journal said `updating scsi0`
every hour, in plain sight, for weeks.
- The May-26 mbps caps fixed the sdc-saturation freeze class and
introduced this one's trigger surface. Layered mitigations fail in
layers — audit what a fix *adds*, not only what it removes.
- pve host logs are **EEST (UTC+3)**; guest logs are UTC. Every
cross-machine correlation in this incident initially looked 3h off.

View file

@ -0,0 +1,158 @@
# Runbook: Break-glass SSH
Cold-survivable, brute-force-proof SSH onto the home LAN for when the Kubernetes
cluster and its remote-access tunnels (Headscale, cloudflared) are down but the
**Proxmox host + edge router are up**. Redesigned 2026-06-11 — the previous
port-knock design is decommissioned (see "History" below).
## Model (as built)
```
your laptop (anywhere) ── ssh -p 52222 ──▶ edge router 192.168.1.1
│ WAN tcp/52222 ─▶ 192.168.1.127:52222
Proxmox host 192.168.1.127
sshd :52222 (key-only, break-glass key ONLY)
→ full LAN via ssh -J / ssh -D
```
- **No port-knock.** Plain `ssh -p 52222`. The SSH key is the only gate.
- **Key-only**, brute-force-proof. The exposed `:52222` trusts **only** the
dedicated break-glass key (`/root/.ssh/authorized_keys.breakglass`), separate
from root's normal LAN-admin keys, so it is independently revocable and a leak
of any other root key does not grant internet access.
- **Rate-limited** per source IP (iptables hashlimit) + **fail2ban**. These trim
scanner noise only; key-only auth is the real protection.
- **Exposed, not hidden.** `:52222` answers on the WAN (Shodan-visible). This is
a deliberate, documented exception to the Wave-1 "no public-IP access" policy
(see `docs/architecture/security.md`), chosen for self-containment: it has **no
dependency on the cluster** (unlike Headscale/cloudflared) and nothing to
remember (unlike the old knock, whose sequence lived only in in-cluster Vault).
## Secrets (Vault `secret/viktor`)
| Key | Use |
|---|---|
| `breakglass_ssh_pubkey` | authorized on the host (`authorized_keys.breakglass`) |
| `breakglass_ssh_privkey` | the private key (also on your laptop at `~/.ssh/breakglass_ed25519`) |
The key has **no passphrase** (so it works in a true cold event without anything
to recall). Treat the private key as the sole credential — guard the laptop copy.
> Leftover: `breakglass_knock_sequence` is dead (knock decommissioned). It is
> inert; remove it when you have a Vault token with the `patch` capability
> (`vault kv patch` / merge-patch — the everyday token lacks it).
## Connect
Client `~/.ssh/config`:
```
Host breakglass
HostName viktorbarzin.ddns.net # follows the dynamic WAN IP
Port 52222
User root
IdentityFile ~/.ssh/breakglass_ed25519
IdentitiesOnly yes
```
Then:
```bash
ssh breakglass # shell on the Proxmox host
ssh -J breakglass root@10.0.20.1 # jump to pfSense (or any LAN host)
ssh -D 1080 breakglass # SOCKS5 → reach any internal IP
```
There is **no `bg()` knock function** anymore — delete it from your shell rc if
you added it under the old design.
## Cold-event IP cheat sheet (cluster DNS is down)
| Host | IP |
|---|---|
| Proxmox host | `192.168.1.127` |
| pfSense | `10.0.20.1` (WAN `192.168.1.2`) |
| k8s API | `10.0.20.100` |
| Synology NAS | `192.168.1.13` (reach via `ssh -J breakglass`) |
| edge router | `192.168.1.1` |
## Deploy / re-provision the host config
Source of truth lives in `infra/scripts/`. To (re)deploy:
```bash
# 1. break-glass key authorized for the exposed port
PUB="$(vault kv get -field=breakglass_ssh_pubkey secret/viktor)"
ssh root@192.168.1.127 "printf '%s\n' '$PUB' > /root/.ssh/authorized_keys.breakglass && chmod 600 /root/.ssh/authorized_keys.breakglass"
# 2. sshd drop-in (dual-port, Match-isolated) — validate before reload (anti-lockout)
scp scripts/sshd-10-breakglass.conf root@192.168.1.127:/etc/ssh/sshd_config.d/10-breakglass.conf
ssh root@192.168.1.127 'sshd -t && systemctl reload ssh'
# 3. firewall (rate-limit) + boot unit
scp scripts/breakglass-firewall.sh root@192.168.1.127:/usr/local/sbin/breakglass-firewall.sh
ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh && systemctl enable --now breakglass-firewall.service'
# 4. fail2ban jail
scp scripts/fail2ban-breakglass-sshd.local root@192.168.1.127:/etc/fail2ban/jail.d/breakglass-sshd.local
ssh root@192.168.1.127 'systemctl restart fail2ban && fail2ban-client status sshd'
```
The `breakglass-firewall.service` unit (oneshot, `RemainAfterExit=yes`,
`Before=network-online`-ish ordering) is a manual host unit — recreate it if the
host is rebuilt:
```ini
[Unit]
Description=Break-glass base firewall (key-only SSH on :52222)
After=network-pre.target
Wants=network-pre.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/breakglass-firewall.sh
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
## Edge-router forward (manual — live device, not Terraform)
TP-Link Archer AX6000 (`192.168.1.1`) → Advanced → NAT Forwarding → Port
Forwarding. The break-glass rule:
| Service Name | Device IP | External Port | Internal Port | Protocol |
|---|---|---|---|---|
| `breakglass-ssh` | `192.168.1.127` | `52222` | `52222` | TCP |
**AX6000 quirks (learned 2026-06-11 — do not relearn the hard way):**
- **External port must equal internal port.** The firmware rejects any remap
(e.g. `22 → 52222`) with *"External Port: This item conflicts with existed
ones."* Hence ext==int 52222.
- **Port 22 is reserved** — even `22 → 22` is refused. Break-glass cannot use 22.
- **Row delete is immediate** (no confirm dialog) — clicking the trash icon
removes the rule and toasts "Operation succeeded".
- Automation: `~/wizard/tools/insecure-browse/add-forward.{sh,js}` (dockerized
Playwright; double-gated save `DRY_RUN=0 CONFIRM_SAVE=1`; supports
`RULES_JSON` add, `EDIT_RULES_JSON` protocol-edit, `DELETE_RULES_JSON`
identity-guarded delete). Router password: Vault
`secret/viktor/edge_router_192_168_1_1_password`.
## Rotate / revoke
- **Revoke instantly:** remove the line from `/root/.ssh/authorized_keys.breakglass`.
- **Rotate the key:** `ssh-keygen -t ed25519 -a 100 -f ~/.ssh/breakglass_ed25519`,
`vault kv patch secret/viktor breakglass_ssh_privkey=@... breakglass_ssh_pubkey=...`,
redeploy step 1 above.
- **Router reset wipes forwards:** re-add the `breakglass-ssh` rule above.
## History
- **2026-05-30:** original design — key-only SSH on `:52222` gated behind a
**UDP port-knock** (knockd). Decommissioned 2026-06-11: the knock added no real
security (the SSH key already makes the port brute-force-proof) and its only
benefit — hiding the port — came at the cost of a **circular dependency**: the
knock sequence lived only in in-cluster Vault, unreachable in the exact
cold/away scenario break-glass exists for. That caused a real lockout. The
knockd package + config + the legacy Synology SSH forward (ext 3333 → .13:22)
were removed.

View file

@ -35,6 +35,41 @@ Attribution table:
Alerts `T3ProbeLegDown` / `T3ProbeDropBurst` fire on sustained breakage.
## 1b. Connection logs in Loki (passive, always-on — catch a real drop)
Three layers of the real path log every t3 `/ws` connection to Loki, so a drop
the user actually experienced is attributable after the fact without a repro. A
drop is **a short-lived `/ws` connection** (a healthy session holds one socket
for hours); the client's 20s heartbeat watchdog reconnects on any break.
| Layer | Loki stream | What it tells you |
|---|---|---|
| Traefik | `{job="traefik"}` ⟶ filter `t3code-t3` + `GET /ws` | per-connection **duration** (trailing `…ms`) + edge (cloudflared pod) IP |
| cloudflared | `{job="cloudflared"}` ⟶ filter `t3.viktorbarzin.me/ws` | CF-tunnel-side close (`ended abruptly: context canceled` = browser/CF side hung up) |
| t3-dispatch | `{job="devvm-journal",unit="t3-dispatch.service"} \|= "ws close"` | **`dur_ms` + `cause`** — the discriminator below |
`cause` on the dispatch `ws close` line:
- **`downstream_closed`** — client / Cloudflare / Traefik tore the socket down
(`context canceled`). Short `dur_ms` = client watchdog firing → a **last-mile /
network-quality** drop (or CF/tunnel blip); t3-serve was fine.
- **`upstream_closed`** — the user's `t3 serve` closed/reset (reset by peer / EOF
/ refused) → t3-serve stall/restart/OOM.
- **`graceful`** — clean close from either side (e.g. the client watchdog's
`disconnect()` after a >20s heartbeat gap). Cross-check `dur_ms`: a ~20s+
graceful close with no devvm pressure spike (§3) is a heartbeat-timeout whose
stall was NOT on devvm → last-mile.
Triage query (Grafana Explore → Loki) — every short t3 socket in a window:
```logql
{job="devvm-journal", unit="t3-dispatch.service"} |= "ws close"
| regexp `dur_ms=(?P<dur>[0-9]+) cause=(?P<cause>\S+)` | dur < 120000
```
Line the timestamp up against `{job="traefik"}` (duration + edge IP) and
`{job="cloudflared"}` (CF-side close) for the same second to localise the layer.
devvm journald (incl. `t3-serve@<user>`) ships via `scripts/devvm-promtail.*`.
## 2. Server-side log recipe (per-event forensics)
On devvm (timestamps in UTC):