Merge forgejo/master (tts stack) into wizard/android-emulator
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful

# Conflicts:
#	stacks/tripit/main.tf
This commit is contained in:
Viktor Barzin 2026-06-11 19:53:07 +00:00
commit 6bf216751b
37 changed files with 1774 additions and 86 deletions

View file

@ -135,7 +135,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
## Database Host
**`postgresql_host`** in `config.tfvars` is `pg-cluster-rw.dbaas.svc.cluster.local` (the CNPG primary). The legacy `postgresql.dbaas` service has no endpoints — never use it. This variable is shared by ~12 stacks.
**`postgresql_host`** in `config.tfvars` is `pg-cluster-rw.dbaas.svc.cluster.local` (the CNPG primary). The legacy `postgresql.dbaas` service is a live compatibility alias (selector `cnpg.io/instanceRole=primary`, so it also reaches the primary — authentik's PgBouncer still points at it) — but use `pg-cluster-rw` for anything new. This variable is shared by ~12 stacks.
**CNPG tuning** (in `stacks/dbaas/modules/dbaas/main.tf`): `shared_buffers=512MB`, `work_mem=16MB`, `wal_compression=on`, `effective_cache_size=1536MB`, pod memory 2Gi.
@ -159,7 +159,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~1013.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob |
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding |
| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. |
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
| MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
@ -178,7 +178,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/security-incident.md`. Beads epic: `code-8ywc`.
- **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming.
- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale.
- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.)
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
- **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred.

View file

@ -5,17 +5,26 @@
## Applications (11)
| Application | Provider Type | Auth Flow |
|-------------|--------------|-----------|
| Cloudflare Access | OAuth2/OIDC | explicit consent |
| Cloudflare Access | OAuth2/OIDC | implicit consent |
| Domain wide catch all | Proxy (forward auth) | implicit consent |
| Forgejo | OAuth2/OIDC | explicit consent |
| Forgejo | OAuth2/OIDC | implicit consent |
| Grafana | OAuth2/OIDC | implicit consent |
| Headscale | OAuth2/OIDC | explicit consent |
| Immich | OAuth2/OIDC | explicit consent |
| Headscale | OAuth2/OIDC | implicit consent |
| Immich | OAuth2/OIDC | implicit consent |
| Kubernetes | OAuth2/OIDC (public) | implicit consent |
| Kubernetes Dashboard | OAuth2/OIDC (confidential) | implicit consent |
| linkwarden | OAuth2/OIDC | explicit consent |
| linkwarden | OAuth2/OIDC | implicit consent |
| Vault | OAuth2/OIDC | implicit consent |
| wrongmove | OAuth2/OIDC | implicit consent |
> **2026-06-10 — every provider now uses implicit consent.** Cloudflare
> Access (pk 9), Forgejo (20), Immich (1), Headscale (13), linkwarden (8)
> and Vault (53) were switched from
> `default-provider-authorization-explicit-consent` via the API (these
> providers are UI-managed, not in TF). All are first-party apps; the
> expiring consent screen (re-shown every 4 weeks per app) only slowed
> first-time signin.
> **Kubernetes Dashboard** (TF-managed in `stacks/k8s-dashboard/authentik.tf`):
> confidential client `k8s-dashboard`, built for seamless dashboard SSO via
> oauth2-proxy. **Currently IDLE** — the apiserver rejects all OIDC tokens (see
@ -60,8 +69,27 @@
- All sources use `invitation-enrollment` as enrollment flow (new users require invitation)
## Authorization Flows
- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen
- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects
- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen — no provider uses it since 2026-06-10
- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects — used by ALL providers
## Authentication Flow (single-screen login, 2026-06-10)
`default-authentication-flow` bindings: identification (order 10) →
mfa-validation (order 30) → user-login (order 100). The identification
stage (`default-authentication-identification`, pk
`32aca5ab-106e-43f4-a4cc-4513d80e57f3`) has `password_stage` set to
`default-authentication-password`, so username + password render on ONE
screen (one round trip instead of two). The previously separate
password-stage binding at order 20 (pk `0fc677db-a23f-4ee7-8648-da342e14573b`)
was DELETED via the API — authentik requires removing it when the
identification stage embeds the password field. `password_stage` is pinned in
Terraform (`authentik_stage_identification.default_identification` in
`stacks/authentik/authentik_provider.tf`); all other stage fields stay
UI-managed via `ignore_changes`. Social-login buttons remain on the same
screen and bypass the password field, so Google/GitHub/Facebook users are
unaffected. If a future authentik upgrade/blueprint re-adds the order-20
binding, users would briefly see a second password prompt — delete the
binding again.
## Invitation Enrollment Flow
Slug: `invitation-enrollment` | PK: `7d667321-2b02-4e16-8161-148078a8dac1`
@ -149,7 +177,12 @@ Notes:
- The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
- `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
- The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
- The `unauthenticated_age` env var is injected via `server.env` / `worker.env` (not `authentik.sessions.unauthenticated_age`) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. The same applies to the existing `authentik.cache.*`, `authentik.web.*`, `authentik.worker.*` blocks (currently inert; live values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced).
- ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns).
- **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config.
- **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin.
- **Liveness budget (2026-06-10):** `server.livenessProbe` = 6×10s, 5s timeout (chart default 3×10s/3s kill-loops pods that queue on the DB migration advisory lock during rolling restarts).
- **PgBouncer (2026-06-10):** `idle_transaction_timeout=300` reaps ghost `idle in transaction` sessions (a killed pod mid-migration otherwise holds the migration advisory lock forever, serializing all boots); the deployment carries a config-checksum annotation so ini changes roll the pods. Do NOT set `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE` — session-mode PgBouncer pins persistent conns 1:1 (pool saturation).
- **Static assets (2026-06-10):** a second `ingress_factory` (`module.ingress-static`, path `/static` on the authentik host) attaches the `authentik-static-cache-headers` middleware → `Cache-Control: public, max-age=31536000, immutable`. Authentik itself serves no max-age; assets are version-fingerprinted so immutable is safe. Mainly helps split-horizon internal users (no Cloudflare edge cache on the direct path).
## Upgrade Validation Checklist
@ -161,8 +194,9 @@ Run after **any** of these:
The fragile surfaces are the `kubernetes_json_patches` and the `Outpost.managed` field — both rely on assumptions that can silently break across upgrades. The checklist exercises the same path the alerts watch, so it doubles as a smoke test for the alerts.
```bash
# 1. Service routes to the outpost pod (NOT the server pods).
# Empty endpoints => auth-proxy fallback fires; expected: ONE pod IP, ports 9000/9300/9443.
# 1. Service routes to the outpost pods (NOT the server pods).
# Empty endpoints => auth-proxy fallback fires; expected: TWO pod IPs
# (kubernetes_replicas=2 since 2026-06-10), ports 9000/9300/9443.
kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost
# 2. Service selector still excludes the server pods. Expected: includes

View file

@ -92,19 +92,21 @@ Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB
| VMID | Name | Status | CPUs | RAM | Network | Disk | Notes |
|------|------|--------|------|-----|---------|------|-------|
| 101 | pfsense | running | 8 | 4GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall |
| 102 | devvm | running | 16 | 24GB | vmbr1:vlan10 | 100G | Development VM + t3code Workstation host. 8G swapfile (swappiness=10). Capacity budget: ~4-5G RAM/active user, max ~3-4 concurrent active Claude sessions. NOT Terraform-managed. |
| 102 | devvm | running | 16 | 24GB | vmbr1:vlan10 | 100G | Development VM + t3code Workstation host. 14G swap (8G /swapfile + 6G /swapfile2, grown 2026-06-10; swappiness=10). Capacity budget: ~4-5G RAM/active user, max ~3-4 concurrent active Claude sessions. NOT Terraform-managed. Disk controller: `virtio-scsi-single` + `scsi0 iothread=1,aio=threads` staged 2026-06-11 after the QEMU I/O stall (was `scsihw: lsi`, the only VM on the legacy path — see `docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md`); applies at next cold stop→start. |
| 103 | home-assistant | running | 8 | 8GB | vmbr0 | 64G | HA Sofia, net0(vlan10) disabled, SSH: vbarzin@192.168.1.8 |
| 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup (unused) |
| 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) |
| 201 | k8s-node1 | running | 16 | 32GB | vmbr1:vlan20 | 256G | GPU node, Tesla T4 |
| 202 | k8s-node2 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
| 203 | k8s-node3 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
| 204 | k8s-node4 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
| 200 | k8s-master | running | 8 | 32GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) |
| 201 | k8s-node1 | running | 16 | 48GB | vmbr1:vlan20 | 256G | GPU node, Tesla T4 |
| 202 | k8s-node2 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker |
| 203 | k8s-node3 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker |
| 204 | k8s-node4 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker |
| 205 | k8s-node5 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker (10.0.20.105, joined 2026-05-26) |
| 206 | k8s-node6 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker (10.0.20.106, joined 2026-05-26) |
| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM |
| ~~9000~~ | ~~truenas~~ | **stopped/decommissioned** | — | — | — | — | NFS migrated to Proxmox host (192.168.1.127) at `/srv/nfs` and `/srv/nfs-ssd` |
**Total VM RAM allocated**: 196 GB of 272 GB (72%) — 76 GB free for future VMs (devvm corrected 8GB→24GB 2026-06-08)
**Total VM RAM allocated**: ~288 GB nominal across running VMs vs 272 GB physical — OVERCOMMITTED (ballooning enabled on K8s workers, host swap in use; see memory id=535/2543). K8s rows live-verified via `kubectl get nodes` capacity 2026-06-11 (master 32G, node1 48G, node2-6 32G; the old 16/32/24GB figures predated the 2026-04-02 resize and node5/6).
## VM Templates
| VMID | Name | Purpose |

View file

@ -32,7 +32,7 @@
|---------|-------------|-------|
| k8s-dashboard | Kubernetes dashboard at `k8s.viktorbarzin.me`. **Forward-auth + auto-injected SA token** (apiserver OIDC blocked, see design §12). nginx token-injector (`dashboard_injector.tf`) maps `X-authentik-username` → the user's `dashboard-<user>` SA token (ns admin + read-only on namespace-list/nodes only via `dashboard-nav-readonly` — no cross-tenant reads, `rbac/.../dashboard-sa.tf`; admins → cluster-admin SA) and sets `Authorization: Bearer` → no token-paste, dashboard auto-authenticates per user. Forward-auth admits `kubernetes-*` groups for this host (`stacks/authentik/admin-services-restriction.tf`). oauth2-proxy + `k8s-dashboard` OIDC app built but idle. | k8s-dashboard |
| reverse-proxy | Generic reverse proxy | reverse-proxy |
| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@<u>` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@<u>` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role``scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin is a deliberate, reversible step via `docs/runbooks/t3-version-bump.md` (pre-bump `state.sqlite` backup → bump `T3_PIN` → enforcer install with a REAL pairing health-check that auto-rolls-back → verify → restore). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. **Drop attribution (2026-06-10):** `t3-probe` Deployment (same ns) holds differential legs — `cloudflare` (full public path via DoH-pinned DNS), `internal` (Traefik LB only), `t3serve` (devvm:3773 direct) — against dispatch's unauthenticated `/probe` carve-out (walloff-guarded); Prometheus job `t3-probe`, alerts `T3ProbeLegDown`/`T3ProbeDropBurst`, runbook `docs/runbooks/t3-drop-attribution.md`. `t3-serve@` units carry memory containment (`MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0/OOMPolicy=continue`) so a runaway agent OOMs alone instead of freezing devvm. | t3code |
| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@<u>` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@<u>` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role``scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin is a deliberate, reversible step via `docs/runbooks/t3-version-bump.md` (pre-bump `state.sqlite` backup → bump `T3_PIN` → enforcer install with a REAL pairing health-check that auto-rolls-back → verify → restore). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. **Drop attribution (2026-06-10):** `t3-probe` Deployment (same ns) holds differential legs — `cloudflare` (full public path via DoH-pinned DNS), `internal` (Traefik LB only), `t3serve` (devvm:3773 direct) — against dispatch's unauthenticated `/probe` carve-out (walloff-guarded); Prometheus job `t3-probe`, alerts `T3ProbeLegDown`/`T3ProbeDropBurst`, runbook `docs/runbooks/t3-drop-attribution.md`. `t3-serve@` units carry memory containment (`MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0/OOMPolicy=continue`) so a runaway agent OOMs alone instead of freezing devvm. **Connection logs (2026-06-11):** `t3-dispatch` logs every `/ws` open/close with `dur_ms` + `cause` (`downstream_closed`=client/CF/Traefik hung up → last-mile; `upstream_closed`=t3-serve closed; `graceful`); devvm journald now ships to Loki via `scripts/devvm-promtail.*` (`{job="devvm-journal"}` + `{job="sshd-devvm"}`), joining Traefik `/ws`-duration + cloudflared close events already in Loki for full per-drop attribution without a repro. **Empirical (2026-06-11):** direct-to-t3-serve held one WS 40 min (0 drops) while a real tunnel session cycled 5×/90s → drop originates above t3-serve on the public path, NOT in t3-serve itself; `t3 auth pairing create`+`/api/auth/browser-session` works but dispatch **auto-pair is 401-broken on v0.0.26** (latent; live 30-day cookies mask it). | t3code |
## Active Use
| Service | Description | Stack |

2
.gitignore vendored
View file

@ -104,5 +104,5 @@ stacks/terminal/clipboard-upload/clipboard-upload
terraform.tfstate
terraform.tfstate.backup
# Per-feature git worktrees (worktree-first workflow — execution.md §3)
# Per-feature git worktrees (worktree-first workflow — execution.md)
.worktrees/

View file

@ -149,7 +149,7 @@ _Avoid_: bare "backup" without saying which copy you mean (a service is "backed
**CNPG** / **pg-cluster**:
**CNPG** is the CloudNativePG operator; **`pg-cluster`** is the Postgres cluster it manages — the shared Postgres substrate. Backs Tier-1 Terraform state (`pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`) and ~12 application databases, reached through **PgBouncer** (a **critical-path Service**) for connection pooling; app credentials rotate via the `vault-database` ClusterSecretStore.
_Avoid_: "the database" (many DBs share one cluster); the legacy `postgresql.dbaas` Service (no endpoints — dead); conflating the CNPG operator with the `pg-cluster` it manages.
_Avoid_: "the database" (many DBs share one cluster); the legacy `postgresql.dbaas` Service for NEW work (it is a live compatibility alias selecting the CNPG primary — authentik's PgBouncer still uses it — but `pg-cluster-rw` is the canonical name); conflating the CNPG operator with the `pg-cluster` it manages.
### Secrets

View file

@ -40,10 +40,10 @@ graph TB
| Component | Version | Location | Purpose |
|-----------|---------|----------|---------|
| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (2 replicas) |
| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (3 replicas) |
| Authentik Worker | 2026.2.2 | `stacks/authentik/` | Background task processors (2 replicas) |
| PgBouncer | Latest | `stacks/authentik/` | PostgreSQL connection pooler (3 replicas) |
| Embedded Outpost | - | Built into Authentik | Forward auth endpoint for Traefik |
| Embedded Outpost | - | Standalone deployment, managed by Authentik | Forward auth endpoint for Traefik (2 replicas, PG-backed sessions) |
| Traefik ForwardAuth | - | `modules/kubernetes/ingress_factory/` | Middleware attached when `auth = "required"` or `"public"` |
| Vault OIDC Method | - | `stacks/vault/` | Human SSO authentication to Vault |
| Vault K8s Auth | - | `stacks/vault/` | Service account JWT authentication |
@ -64,15 +64,36 @@ Services pick an auth tier via the `auth` enum on the `ingress_factory` module (
When `auth = "required"`, an unauthenticated request flows:
1. Request hits Traefik ingress
2. ForwardAuth middleware calls Authentik embedded outpost
3. Authentik checks for valid session cookie
2. ForwardAuth middleware calls the `auth-proxy` nginx (basicAuth fallback when Authentik is down), which proxies to the Authentik embedded outpost over a keepalive connection pool
3. Authentik checks for valid session cookie (domain-level `authentik_proxy_*` cookie on `.viktorbarzin.me`, 4-week validity — one cookie covers all forward-auth apps)
4. If missing/invalid, redirects to Authentik login page (authentik.viktorbarzin.me)
5. User authenticates via social provider (Google/GitHub/Facebook)
5. User authenticates on a **single screen**: username + password together (the identification stage embeds the password stage), or a social provider button (Google/GitHub/Facebook), then MFA validation
6. Authentik creates session, sets cookie, redirects back to original URL
7. Subsequent requests include session cookie, pass auth check, reach backend
Authentik adds authentication headers (user, email, groups) to forwarded requests. These headers are stripped before reaching the backend to prevent confusion.
### First-time signin performance (2026-06-10)
Signin latency is dominated by screen count and round trips, not server time
(DB avg 1.6ms). Standing decisions:
- **Single-screen login**: the identification stage carries `password_stage`,
so username+password is one round trip. The separate password-stage binding
was removed from `default-authentication-flow` (required by authentik when
embedding). Pinned in TF: `authentik_stage_identification.default_identification`.
- **Implicit consent everywhere**: all OIDC providers are first-party, so none
use the explicit-consent flow (it re-prompted every 4 weeks per app).
- **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
15m policy cache, 60s persistent DB connections.
- **Static assets cached immutable**: `/static` ingress carve-out adds
`Cache-Control: public, max-age=31536000, immutable` (assets are
version-fingerprinted; authentik itself sends no max-age).
- **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
- **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
TCP setup on the forward-auth subrequest path.
**Anti-exposure guard**: every `auth = "app"` or `auth = "none"` line MUST have a preceding `# auth = "<tier>": <reason>` comment documenting what gates the backend (for `"app"`) or why the endpoint is intentionally public (for `"none"`). The convention is enforced by `scripts/check-ingress-auth-comments.py`, which `scripts/tg` runs on every `plan/apply/destroy/refresh` and blocks the terragrunt invocation if violated. Stack-scoped — each stack documents itself.
### Social Login & Invitation Flow

View file

@ -22,9 +22,11 @@ graph TB
NODE2["VM 202: k8s-node2<br/>8c / 32GB"]
NODE3["VM 203: k8s-node3<br/>8c / 32GB"]
NODE4["VM 204: k8s-node4<br/>8c / 32GB"]
NODE5["VM 205: k8s-node5<br/>8c / 32GB"]
NODE6["VM 206: k8s-node6<br/>8c / 32GB"]
end
subgraph K8s["Kubernetes Cluster v1.34.2"]
subgraph K8s["Kubernetes Cluster v1.34.8"]
direction TB
subgraph VPA["VPA (Goldilocks - Initial Mode)"]
@ -62,7 +64,7 @@ graph TB
| Model | Dell PowerEdge R730 |
| CPU | 1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated) |
| Total Cores/Threads | 22 cores / 44 threads |
| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). VMs use ~176GB total (k8s-node1 48GB + 4 K8s VMs x 32GB) |
| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). K8s VMs use ~240GB total (k8s-node1 48GB + 6 K8s VMs x 32GB) |
| GPU | NVIDIA Tesla T4 (16GB GDDR6, PCIe 0000:06:00.0) |
| Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD |
| Hypervisor | Proxmox VE |
@ -76,8 +78,10 @@ graph TB
| k8s-node2 | 202 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
| k8s-node3 | 203 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
| k8s-node4 | 204 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
| k8s-node5 | 205 | 8 | 32GB | vmbr1:vlan20 (10.0.20.105) | Worker (joined 2026-05-26) | None |
| k8s-node6 | 206 | 8 | 32GB | vmbr1:vlan20 (10.0.20.106) | Worker (joined 2026-05-26) | None |
**Total Cluster Resources**: 48 vCPUs, ~176GB RAM (k8s-node1 48GB + 4 nodes x 32GB)
**Total Cluster Resources**: 64 vCPUs, ~240GB RAM (k8s-node1 16c/48GB + master and 5 workers at 8c/32GB each)
> **All Linux VMs are hand-managed in Proxmox, NOT in Terraform**
> (decided 2026-05-26, commit 44c3770a). The telmate/proxmox v3.0.2
@ -97,7 +101,12 @@ graph TB
> PVE host (sources in `infra/scripts/`, install pattern per
> `architecture/backup-dr.md`). Timer fires `OnBootSec=5min` +
> `OnCalendar=hourly`, so any drift (config restore, manual `qm
> set`, fresh clone) self-heals within the hour. Current caps:
> set`, fresh clone) self-heals within the hour. The script compares
> *normalized option sets*, so an unchanged config is a true no-op —
> until 2026-06-11 a raw string compare (defeated by `qm config`'s
> canonical key order) re-issued `qm set` hourly against running VMs,
> live-rewriting QEMU throttle state via QMP (implicated in the devvm
> I/O stall; see `post-mortems/2026-06-11-devvm-qemu-io-stall.md`). Current caps:
> 102 devvm 60/60, 103 home-assistant 40/40, 200 k8s-master 100/60,
> 201 k8s-node1 150/120, 202 k8s-node2 150/120, 203 k8s-node3 150/120,
> 204 k8s-node4 150/120, 220 docker-registry 40/40.

View file

@ -255,6 +255,8 @@ Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same
**Policy: no public-IP access ever.** Vault, kube-apiserver, PVE sshd must transit a trusted LAN or Headscale. Anything else fires an alert.
**Documented exception — break-glass SSH (2026-06-11):** one deliberate carve-out. The Proxmox host's sshd listens on a WAN-exposed `:52222` (edge-router forward), **key-only**, trusting only a dedicated break-glass key (`Match LocalPort``authorized_keys.breakglass`), rate-limited (iptables hashlimit) + fail2ban. It is intentionally reachable from the public internet so it survives a cluster/tunnel outage with no dependency on the cluster — the one case the "must transit LAN/Headscale" rule cannot serve. Brute-force-proof (no password); the trade is Shodan-visibility. As-built: `docs/runbooks/breakglass-ssh.md`; rationale: `docs/plans/2026-06-11-breakglass-ssh-redesign-design.md`. (Replaced the 2026-05-30 port-knock variant, which was non-scannable but had a circular Vault dependency that caused a lockout.)
#### Why no canary tokens
Original plan included canary tokens (fake K8s Secret, Vault KV path, PVE file, sinkhole hostname). Rejected because Viktor routinely greps `secret/viktor` (135 keys) and lists `kubectl get secret -A` — any read-trigger canary self-fires. Use-based canaries (zero-RBAC SA tokens with audit alerts on use) were also considered but rejected in favor of cleaner source-IP anomaly detection (K9, V7) on REAL tokens — same threat model, no fake-token operational burden.

View file

@ -17,7 +17,7 @@ All services storing sensitive data were migrated to `proxmox-lvm-encrypted` on
- **HDD NFS**: `/srv/nfs` on ext4 LV `pve/nfs-data` (4TB) — bulk media and backup targets
- **SSD NFS**: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB) — high-performance data (Immich ML)
Both `StorageClass: nfs-truenas` and `StorageClass: nfs-proxmox` point to the Proxmox host and are functionally identical. The `nfs-truenas` name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster.
`StorageClass: nfs-truenas` is the **only** NFS StorageClass and points to the Proxmox host. The name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster. (A short-lived parallel `nfs-proxmox` StorageClass was removed on 2026-04-25, commit 484b4c71, during the vault NFS-hostile migration.)
**Backup storage (sda)**: 1.1TB RAID1 SAS disk, VG `backup`, LV `data` (ext4), mounted at `/mnt/backup` on PVE host. Dedicated backup disk for weekly PVC file backups, auto SQLite backups, pfSense backups, and PVE config. NFS data syncs directly to Synology via inotify change tracking (not stored on sda). Independent of live storage (sdc).
@ -47,7 +47,7 @@ graph TB
end
subgraph K8s["Kubernetes Cluster"]
CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-proxmox (+ legacy nfs-truenas)<br/>soft,timeo=30,retrans=3"]
CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-truenas (historical name)<br/>soft,timeo=30,retrans=3"]
CSI_PVE["Proxmox CSI plugin<br/>StorageClass: proxmox-lvm<br/>StorageClass: proxmox-lvm-encrypted"]
NFS_PV["NFS PersistentVolumes<br/>RWX, ~100 volumes"]
@ -85,8 +85,7 @@ graph TB
| Proxmox NFS (HDD) | LV `pve/nfs-data`, 4TB ext4 | 192.168.1.127:/srv/nfs | Bulk NFS data for all services |
| Proxmox NFS (SSD) | LV `ssd/nfs-ssd-data`, 100GB ext4 | 192.168.1.127:/srv/nfs-ssd | High-performance data (Immich ML) |
| nfs-csi | Helm chart | Namespace: nfs-csi | NFS CSI driver |
| StorageClass `nfs-proxmox` | RWX, soft mount | Cluster-wide | NFS storage, points to Proxmox host |
| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | **Historical name** — functionally identical to `nfs-proxmox`, points to the Proxmox host. Kept because SC names are immutable on 48 bound PVs. |
| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | The only NFS StorageClass — **historical name**, points to the Proxmox host. Kept because SC names are immutable on 48 bound PVs. (Sibling `nfs-proxmox` SC removed 2026-04-25, commit 484b4c71.) |
| TF module `nfs_volume` | `modules/kubernetes/nfs_volume/` | Infra repo | Static NFS PV/PVC factory |
| ~~TrueNAS VM~~ | **DECOMMISSIONED 2026-04-13** | Was VM 9000 at 10.0.10.15 | Replaced by Proxmox NFS. VM still in stopped state pending deletion. |
| ~~democratic-csi-iscsi~~ | **REMOVED** | Was namespace: iscsi-csi | Replaced by Proxmox CSI (2026-04-02) |
@ -113,7 +112,7 @@ graph TB
**Note**: Some legacy PVs still reference `/mnt/main/<service>` paths. These work via compatibility symlinks/bind-mounts on the Proxmox host. New PVs should use `/srv/nfs/<service>` or `/srv/nfs-ssd/<service>`.
**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-proxmox` StorageClass (or the legacy `nfs-truenas` for existing PVs) via PVCs.
**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-truenas` StorageClass (historical name; it points at the Proxmox host) via PVCs.
### Block Storage Flow (Proxmox CSI) — NEW

View file

@ -0,0 +1,285 @@
# Break-Glass SSH Access — Design
> **⚠️ SUPERSEDED 2026-06-11** by `2026-06-11-breakglass-ssh-redesign-design.md`.
> The port-knock was removed: it added no real security (the SSH key already
> makes the port brute-force-proof) and its knock sequence lived only in
> in-cluster Vault — unreachable in the exact cold/away scenario break-glass
> exists for, which caused a real lockout. Retained for history. As-built:
> `docs/runbooks/breakglass-ssh.md`.
- **Date**: 2026-05-30
- **Status**: Draft — pending user review
- **Owner**: Viktor
- **Related**: `docs/architecture/vpn.md`, `docs/architecture/security.md`, `infra/.claude/CLAUDE.md` (Security Posture Wave 1)
## 1. Goal
Provide a **cold, brute-force-proof backdoor onto the home LAN from the public
internet** for the case where the Kubernetes cluster and every cluster-hosted
remote-access path are down (cloudflared, Headscale/Tailscale, in-cluster
WireGuard), but the **Proxmox host, pfSense, and the edge router are still up**.
### Hard requirements (from the user)
1. **Cold-survivable**: must work when the k8s cluster + all its tunnels are
down. The path must touch **nothing in the cluster** (no Authentik, Traefik,
Technitium/AdGuard DNS, cloudflared).
2. **Full LAN access** once connected (SSH to Proxmox host, pfSense, Synology,
k8s API, etc.).
3. **No brute force**: no password-guessable surface.
4. **Client uses only software pre-installed on Linux/macOS** — no WireGuard /
Tailscale / fwknop client install. Stock `ssh` (+ `bash`) only.
5. **Minimal effort**, and ideally **honor the locked Wave 1 policy**
(`no public-IP access — … PVE sshd must transit LAN or Headscale`).
## 2. Decision
**Key-only SSH to the Proxmox host, gated behind a UDP port-knock.**
- The Proxmox host (`192.168.1.127`) is the entry point — it's the recovery box
(`virsh`/`qm` to reboot the pfSense VM, `kubectl`, full hypervisor control)
and it sits directly on the `192.168.1.0/24` segment, so the path **does not
traverse pfSense or the cluster** — it survives a wedged pfSense too, not just
a down cluster.
- SSH is the only externally-usable remote tool **pre-installed on every
Linux/macOS box**, satisfying requirement 4.
- **Key-only auth** (no passwords anywhere) makes password brute force
impossible → requirement 3.
- A **port-knock** keeps the external SSH port **closed/invisible to scanners**
until a knock sequence is sent. This restores the "no standing public service"
property we'd have had with WireGuard and keeps us within the **intent** of the
Wave 1 policy (PVE sshd is not internet-scannable). The knock is sent with a
**bash `/dev/udp` one-liner** — zero install.
### Alternatives rejected
| Option | Why rejected |
|---|---|
| WireGuard road-warrior on pfSense | Needs a WireGuard **client app** (fails requirement 4). Was the prior design. |
| Tailscale / Headscale | Client app + control plane is in-cluster (dies cold). |
| Browser → web admin UI (Proxmox/pfSense/Synology) | "Pre-installed" (browser) but password-based → brute-forceable, far larger attack surface than a key-only SSH port. |
| Plain **exposed** key-only SSH (no knock) | Brute-force-proof, but a **publicly visible** service (Shodan-catalogued) and a standing violation of the Wave 1 "no public PVE sshd" policy. The knock removes the standing exposure for ~15 min more setup. |
| fwknop / cryptographic SPA | Strongest hiding, but needs a **client install** (fails requirement 4). |
## 3. Architecture
```
Your laptop (anywhere) — stock ssh + bash, nothing installed
│ (1) UDP knock sequence → bash: echo > /dev/udp/<pub>/<port> (instant, no handshake)
│ (2) ssh -p 52222 root@<pub>
Edge router 192.168.1.1 (the box the stored password unlocks)
│ forwards: UDP <k1>,<k2>,<k3> + TCP 52222 → 192.168.1.127
Proxmox host 192.168.1.127 ← path bypasses pfSense entirely
├─ knockd (libpcap) sees the UDP knock → opens TCP 52222 for your source IP (30 s)
├─ sshd listens on :22 (LAN admin, always) AND :52222 (external, knock-gated), key-only
└─ once in: virsh/qm (reboot pfSense VM), kubectl, ssh -J / ssh -D → full LAN
```
**Why it meets "cold + full LAN":** the host is up by definition of the chosen
failure mode; nothing in the path depends on k8s, pfSense, or DNS. From the host
you reach the whole LAN either directly (it's on `192.168.1.0/24` and routes to
the VLANs via pfSense when pfSense is up) or by using SSH's built-in
`-J`/`-D` — both stock, no install.
## 4. Components
### 4.1 Edge router @ 192.168.1.1 (manual, in the browser)
Add port-forwards (same place the existing `51821` WireGuard forward lives):
- **TCP 52222 → 192.168.1.127:52222** (external SSH; no port rewrite — see §4.3 rationale)
- **UDP `<k1>`, `<k2>`, `<k3>` → 192.168.1.127** (knock ports; actual numbers in Vault)
If the router supports a **port range** forward, a single range covering the
knock ports + 52222 is tidier than four rules.
> **Verify (#1 implementation check):** whether `.1` **preserves the source IP**
> on forwarded packets (typical DNAT) or **SNATs** them to `192.168.1.1`. Test by
> knocking + connecting from an external network and checking `/var/log/auth.log`
> + `knockd` syslog for the observed source IP. The design works either way (see
> §4.3), but it determines knock granularity.
### 4.2 SSH keys & Vault layout
- Mint a **dedicated** break-glass keypair (ed25519), separate from
`secret/viktor/proxmox_ssh_key`, so it's independently revocable and clearly
labelled.
- **Public key**`/root/.ssh/authorized_keys` on the Proxmox host (no `from=`
restriction — break-glass is from-anywhere; the knock + key are the gate).
- **Private key** → Vault `secret/viktor/breakglass_ssh_privkey` (for
re-provisioning) **and** on your laptop at `~/.ssh/breakglass_ed25519`
(chmod 600).
- **Knock sequence** → Vault `secret/viktor/breakglass_knock_sequence` (kept out
of git — obscurity value only; see §5).
### 4.3 Proxmox host — sshd hardening
`/etc/ssh/sshd_config.d/10-breakglass.conf`:
```
Port 22
Port 52222
PasswordAuthentication no
KbdInteractiveAuthentication no
PubkeyAuthentication yes
PermitRootLogin prohibit-password # key-only root (PVE recovery norm)
MaxAuthTries 3
LoginGraceTime 20
```
- sshd listens on **:22 (LAN admin, always allowed)** and **:52222 (external,
knock-gated)**. Using a dedicated external port (not a DNAT rewrite to 22)
lets the firewall distinguish LAN vs external **regardless of `.1` SNAT
behaviour** (§4.1) — LAN admin on `:22` is never affected by the gate.
- **Default to root key-only** for recovery practicality. *Alternative for
review:* a dedicated `breakglass` sudo user instead of root.
> **Verify (#2):** key login already works for your normal access **before**
> `PasswordAuthentication no` is committed — no lockout. (Backup rsync jobs
> already use keys, so this is likely already effectively true.)
### 4.4 Host firewall (knock gate)
Default-drop the external SSH port; knockd punches a per-source hole. LAN admin
(`:22`) and established sessions are untouched:
```
# allow established / related
iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
# LAN admin + backups: SSH on :22 always allowed
iptables -A INPUT -p tcp --dport 22 -j ACCEPT
# external SSH on :52222 closed by default — knockd opens it per-source
iptables -A INPUT -p tcp --dport 52222 -j DROP
```
- **knockd uses libpcap**, so it sees the UDP knock packets even though iptables
drops them — the knock ports stay **silent/closed** to scanners.
- **pve-firewall coexistence (verify #3):** confirm whether the PVE firewall is
enabled. If it is, express these rules through it (or a dedicated chain) so a
pve-firewall reload doesn't wipe the knockd-managed rule. Default PVE installs
often have it off at datacenter level.
### 4.5 knockd
`apt install knockd` (Debian/PVE). `/etc/knockd.conf`:
```
[options]
UseSyslog
Interface = vmbr0 # the 192.168.1.127 interface
[breakglass]
sequence = <k1>:udp,<k2>:udp,<k3>:udp # real ports from Vault
seq_timeout = 10
start_command = /usr/sbin/iptables -I INPUT 1 -s %IP% -p tcp --dport 52222 -j ACCEPT
cmd_timeout = 30
stop_command = /usr/sbin/iptables -D INPUT -s %IP% -p tcp --dport 52222 -j ACCEPT
```
- **UDP knock** → the client knock is fire-and-forget (`/dev/udp`), no TCP-hang
on the client (a TCP knock to a dropped port would block until timeout).
- Opens `:52222` for the knocker's source IP for **30 s**; an SSH session
established within that window **persists** via conntrack ESTABLISHED after the
rule is removed. Enable + start the `knockd` service.
### 4.6 fail2ban (defense-in-depth)
`apt install fail2ban`, sshd jail (watches `auth.log`, bans repeat failures).
Local to the host, **no cluster dependency**. Catches anything that gets past the
knock to the sshd listener.
### 4.7 Client side (laptop — stock tools only)
`~/.ssh/config`:
```
Host breakglass
HostName <public-ip-or-dyndns>
Port 52222
User root
IdentityFile ~/.ssh/breakglass_ed25519
```
Knock + connect — a shell function using **bash builtins only** (works on
macOS `/bin/bash` + Linux; UDP send is instant):
```sh
bg() {
local host=<public-ip-or-dyndns>
for p in <k1> <k2> <k3>; do echo -n x > "/dev/udp/$host/$p"; sleep 0.4; done
sleep 0.5
ssh breakglass "$@"
}
```
- **Full LAN, no install:** `ssh -J breakglass <internal-host>` (jump), or
`ssh -D 1080 breakglass` then point a browser/`curl` at SOCKS5 `127.0.0.1:1080`
to reach any internal IP. From the host shell you already have everything.
- *Optional fully-transparent variant:* fold the knock into a `ProxyCommand` in
the `Host breakglass` block so plain `ssh breakglass` knocks automatically.
### 4.8 Cold-scenario IP cheat sheet (DNS is down when the cluster is down)
Technitium + AdGuard are in-cluster, so `.lan` resolution is gone in a cold
event. Use IPs:
| Host | IP |
|---|---|
| Proxmox host | `192.168.1.127` (also `10.0.10.1` VLAN10) |
| pfSense | `10.0.20.1` (WAN `192.168.1.2`) |
| k8s API server | `10.0.20.100` |
| Synology NAS | `192.168.1.13` |
| Edge router | `192.168.1.1` |
| Traefik LB / MetalLB | `10.0.20.200` / `10.0.20.203` |
## 5. Security analysis
- **Brute force: solved.** No password auth anywhere → password guessing is
impossible; key brute force is cryptographically infeasible.
- **Invisibility / Wave 1 intent: satisfied.** The external SSH port is
default-dropped and the knock ports are pcap-sniffed (never answered), so a
scanner sees a closed/silent host — PVE sshd is **not internet-scannable**,
honouring the spirit of "no public-IP access to PVE sshd".
- **The knock is obscurity, not cryptography.** A port-knock sequence is
plaintext and replayable by a passive on-path observer. **The SSH key is the
real access control** — the knock only removes the standing/scannable surface.
(Cryptographic SPA = fwknop, rejected for needing a client install.) Treat the
knock sequence as a secret-ish convenience, not a second cryptographic factor.
- **Residual risks** (none are brute force):
1. An sshd **0-day** exploitable during the 30 s open window → mitigation: keep
PVE patched; short `cmd_timeout`; fail2ban.
2. **Private key theft** → mitigation: key has a passphrase; revoke by removing
the line from `authorized_keys`.
3. If `.1` **SNATs** (§4.1), the 30 s window opens `:52222` for the shared
`192.168.1.1` source — anyone else arriving via `.1` in that window could
reach the sshd banner, but still needs your key. Mitigated by the short
window + key-only + fail2ban.
- **Deliberate, documented exception** to the Wave 1 "no public-IP access"
policy, scoped to this single knock-gated port. To be recorded in
`security.md` + the Wave 1 note in `infra/.claude/CLAUDE.md` on implementation.
## 6. What's automated vs manual
- **I do**: generate the keypair + knock sequence, store them in Vault, produce
the exact `sshd_config.d` snippet, `knockd.conf`, iptables rules, the client
`~/.ssh/config` + `bg()` function, and write the runbook + doc updates.
- **Manual / careful (live devices)**: the `.1` edge-router forwards are done by
you in the browser (out-of-Terraform, live device). The Proxmox host changes
(sshd, knockd, iptables, fail2ban) are applied over SSH **with key-login
verified first** to avoid lockout; pfSense is **not** touched. None of this is
a `tg apply` — pfSense and the edge router are not Terraform-managed.
## 7. Testing & verification
1. From an **external** network (phone hotspot): run `bg`; confirm knockd syslog
shows the sequence + opens `:52222`; SSH succeeds.
2. **Without** knocking: `ssh -p 52222` from external → connection refused/timed
out (port closed). A plain port scan of `52222` + the knock ports → silent.
3. LAN admin on `:22` still works (no regression); backup rsync jobs unaffected.
4. Full-LAN: `ssh -J breakglass 10.0.20.1` (pfSense) and `ssh -D 1080` SOCKS to
an internal IP.
5. Determine `.1` source-IP behaviour (verify #1) and adjust knock granularity
note accordingly.
## 8. Failure modes & rotation
- **Proxmox host down** (not just cluster): this path is gone — that's the
out-of-band tier (serial/IPMI/separate device), explicitly **out of scope**.
- **`.1` router config reset**: forwards lost → re-add from this doc; consider
exporting the `.1` config for backup.
- **Public IP change**: use a hostname endpoint (Cloudflare-resolved) so it
auto-follows; keep the raw IP as fallback.
- **Key/knock compromise**: remove the `authorized_keys` line (kills access
instantly); rotate the knock sequence in `knockd.conf` + Vault.
## 9. Out of scope
- Host-down / site-down out-of-band access (IPMI, LTE) — a future tier.
- Phone access (would need an SSH **app**, e.g. Termius — outside the
"pre-installed Linux/macOS" constraint; laptop is the target).
## 10. Docs to update on implementation
- `docs/architecture/vpn.md` — add a "Break-glass SSH" section.
- `docs/architecture/security.md` + Wave 1 note in `infra/.claude/CLAUDE.md`
record the deliberate knock-gated exception to "no public PVE sshd".
- New runbook `docs/runbooks/breakglass-ssh.md` — connect + rotate procedure.

View file

@ -0,0 +1,395 @@
# Break-Glass SSH Access — Implementation Plan
> **⚠️ SUPERSEDED 2026-06-11** by the redesign in
> `2026-06-11-breakglass-ssh-redesign-design.md` (port-knock removed). Retained
> for history. As-built: `docs/runbooks/breakglass-ssh.md`.
> **Execution model:** This plan mutates **live devices** (the Proxmox host's sshd, and the TP-Link edge router). It is **human-gated**, NOT for autonomous subagents. Each live step is applied with anti-lockout verification, and every edge-router change is made by Viktor (or by the browse tool with explicit per-change approval). Steps use `- [ ]` checkboxes.
**Goal:** Stand up a cold, brute-force-proof SSH backdoor onto the LAN — key-only SSH to the Proxmox host (`192.168.1.127`) gated behind a UDP port-knock — then decommission the legacy Synology SSH exposure and tighten UPnP.
**Architecture:** Edge router `.1` forwards a UDP knock sequence + TCP `52222` to the Proxmox host. The host runs `knockd` (libpcap) which opens `52222` for the knocker's IP for 30 s; `sshd` listens on `:22` (LAN, always) and `:52222` (external, knock-gated), key-only. Path bypasses pfSense + the k8s cluster. Client uses only stock `ssh` + `bash`.
**Tech stack:** OpenSSH, knockd, iptables, fail2ban (Debian/PVE host); TP-Link Archer AX6000 UI (edge router); HashiCorp Vault (secrets); Docker (`/home/wizard/tools/insecure-browse` for any router automation).
**Reference:** design doc `2026-05-30-breakglass-ssh-access-design.md`. Router audit (current `.1` forwards) recorded in task notes + `/home/wizard/tools/insecure-browse/out/`.
---
## Pre-flight (read before starting)
- **Anti-lockout rule:** never disable password auth or reload sshd without an *already-open* root session held + a *new* session verified. Applies to every host step.
- **Live-router rule:** all `.1` changes are made by Viktor in the UI (or browse-tool with explicit approval). No blind automation of router writes.
- **Ordering rule:** the legacy Synology SSH forward (Rule 6) is **not** closed until break-glass is verified working from an external network (Phase 4 gates on Phase 4-pre verification).
- **Host access:** PVE host reached as `ssh root@192.168.1.127` from the LAN.
- **Commit gate:** the infra repo currently has unmerged conflicts + an in-progress provider/backend migration. Do NOT commit (Phase 6) until Viktor confirms the repo is clean.
---
## Phase 0 — Generate secrets (no live changes)
### Task 0.1: Break-glass SSH keypair
**Files:** none in repo (secrets → Vault).
- [ ] **Step 1: Generate a dedicated ed25519 keypair (with passphrase)**
```bash
mkdir -p ~/.ssh
ssh-keygen -t ed25519 -a 100 -C "breakglass-$(date +%Y%m%d)" -f ~/.ssh/breakglass_ed25519
# set a passphrase when prompted (so a stolen laptop key isn't instantly usable)
```
- [ ] **Step 2: Store the private key + public key in Vault**
```bash
vault kv patch secret/viktor \
breakglass_ssh_privkey=@$HOME/.ssh/breakglass_ed25519 \
breakglass_ssh_pubkey="$(cat ~/.ssh/breakglass_ed25519.pub)"
```
- [ ] **Step 3: Verify the keys are retrievable**
```bash
vault kv get -field=breakglass_ssh_pubkey secret/viktor
```
Expected: prints the `ssh-ed25519 AAAA... breakglass-YYYYMMDD` line.
### Task 0.2: Knock sequence
- [ ] **Step 1: Generate 3 random UDP knock ports**
```bash
KNOCK="$(shuf -i 20000-60000 -n 3 | paste -sd, -)"; echo "$KNOCK"
```
- [ ] **Step 2: Store the sequence in Vault (keep it out of git)**
```bash
vault kv patch secret/viktor breakglass_knock_sequence="$KNOCK"
vault kv get -field=breakglass_knock_sequence secret/viktor
```
Expected: prints three comma-separated ports, e.g. `28411,49027,33180`.
---
## Phase 1 — Proxmox host: key-only SSH + knock gate (LIVE host change)
> Run everything in this phase **on the PVE host**. Keep your current `ssh root@192.168.1.127` session open the entire phase.
### Task 1.1: Pre-checks (no changes yet)
- [ ] **Step 1: Confirm key login already works (anti-lockout baseline)**
From your laptop, with the break-glass key authorized later — for now confirm your *existing* admin key works:
```bash
ssh -o PasswordAuthentication=no root@192.168.1.127 'echo KEY_LOGIN_OK'
```
Expected: `KEY_LOGIN_OK` (key auth works → safe to disable passwords later). If it prompts for a password, STOP and fix key auth first.
- [ ] **Step 2: Check whether the PVE firewall is active (coexistence)**
```bash
ssh root@192.168.1.127 'pve-firewall status 2>/dev/null; iptables -S | head'
```
Expected: note whether `Status: enabled/running`. If **enabled**, add the Phase-1.4 rules via PVE's firewall (Datacenter→Firewall) instead of raw iptables, OR disable it if unused. If **disabled** (common), proceed with the raw-iptables approach below.
### Task 1.2: Authorize the break-glass key
- [ ] **Step 1: Append the break-glass public key to root's authorized_keys**
```bash
PUB="$(vault kv get -field=breakglass_ssh_pubkey secret/viktor)"
ssh root@192.168.1.127 "grep -qF '$PUB' /root/.ssh/authorized_keys || echo '$PUB' >> /root/.ssh/authorized_keys"
```
- [ ] **Step 2: Verify break-glass key logs in (on :22, still default)**
```bash
ssh -i ~/.ssh/breakglass_ed25519 -o PasswordAuthentication=no root@192.168.1.127 'echo BREAKGLASS_KEY_OK'
```
Expected: `BREAKGLASS_KEY_OK`.
### Task 1.3: sshd dual-port + key-only
**Files:** Create on host: `/etc/ssh/sshd_config.d/10-breakglass.conf`
- [ ] **Step 1: Write the sshd drop-in**
```bash
ssh root@192.168.1.127 'cat > /etc/ssh/sshd_config.d/10-breakglass.conf' <<'EOF'
Port 22
Port 52222
PasswordAuthentication no
KbdInteractiveAuthentication no
PubkeyAuthentication yes
PermitRootLogin prohibit-password
MaxAuthTries 3
LoginGraceTime 20
EOF
```
- [ ] **Step 2: Validate config syntax (do NOT reload yet)**
```bash
ssh root@192.168.1.127 'sshd -t && echo SSHD_CONFIG_OK'
```
Expected: `SSHD_CONFIG_OK`. If error, fix the drop-in before reloading.
- [ ] **Step 3: Reload sshd (current session stays alive)**
```bash
ssh root@192.168.1.127 'systemctl reload ssh && echo RELOADED'
```
Expected: `RELOADED`.
- [ ] **Step 4: Verify a NEW key session works on :22 AND :52222 before trusting it**
```bash
ssh -i ~/.ssh/breakglass_ed25519 -p 22 root@192.168.1.127 'echo OK22'
ssh -i ~/.ssh/breakglass_ed25519 -p 52222 root@192.168.1.127 'echo OK52222'
```
Expected: `OK22` and `OK52222`. (If `:52222` refuses, sshd may not have bound the second port — check `ss -tlnp | grep ssh` on the host.) Only after both succeed, the old session is safe to drop.
### Task 1.4: Base firewall (default-drop :52222, allow :22 + established)
**Files:** Create on host: `/usr/local/sbin/breakglass-firewall.sh`, `/etc/systemd/system/breakglass-firewall.service`
- [ ] **Step 1: Write the idempotent base-firewall script (dedicated chain)**
```bash
ssh root@192.168.1.127 'cat > /usr/local/sbin/breakglass-firewall.sh' <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
# Idempotent: (re)build a dedicated BREAKGLASS chain hooked into INPUT.
iptables -N BREAKGLASS 2>/dev/null || iptables -F BREAKGLASS
iptables -C INPUT -j BREAKGLASS 2>/dev/null || iptables -I INPUT 1 -j BREAKGLASS
# established/related always allowed
iptables -A BREAKGLASS -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
# LAN admin on :22 always allowed (.1 does NOT forward :22 to this host, so :22 is LAN-only)
iptables -A BREAKGLASS -p tcp --dport 22 -j ACCEPT
# external SSH on :52222 closed by default; knockd punches a per-source ACCEPT into INPUT pos 1
iptables -A BREAKGLASS -p tcp --dport 52222 -j DROP
EOF
ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh'
```
- [ ] **Step 2: Write a boot-time systemd unit (persists across reboot, before knockd)**
```bash
ssh root@192.168.1.127 'cat > /etc/systemd/system/breakglass-firewall.service' <<'EOF'
[Unit]
Description=Break-glass base firewall (SSH knock gate)
After=network-pre.target
Before=knockd.service
Wants=network-pre.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/breakglass-firewall.sh
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
EOF
ssh root@192.168.1.127 'systemctl daemon-reload && systemctl enable --now breakglass-firewall.service && echo FW_APPLIED'
```
Expected: `FW_APPLIED`.
- [ ] **Step 3: Verify LAN :22 still works and :52222 is now dropped from LAN**
```bash
ssh -i ~/.ssh/breakglass_ed25519 -p 22 root@192.168.1.127 'echo STILL_OK22' # works
nc -z -w3 192.168.1.127 52222 && echo "OPEN(bad)" || echo "CLOSED_AS_EXPECTED" # closed pre-knock
```
Expected: `STILL_OK22` and `CLOSED_AS_EXPECTED`.
### Task 1.5: knockd
**Files:** Create/modify on host: `/etc/knockd.conf`, `/etc/default/knockd`
- [ ] **Step 1: Install knockd (host daemon — must be native, not Docker, to manage host iptables)**
```bash
ssh root@192.168.1.127 'apt-get update -qq && apt-get install -y knockd && echo KNOCKD_INSTALLED'
```
Expected: `KNOCKD_INSTALLED`.
- [ ] **Step 2: Write knockd.conf with the Vault knock sequence (UDP)**
```bash
KNOCK="$(vault kv get -field=breakglass_knock_sequence secret/viktor)" # e.g. 28411,49027,33180
read K1 K2 K3 <<<"$(echo "$KNOCK" | tr ',' ' ')"
ssh root@192.168.1.127 "cat > /etc/knockd.conf" <<EOF
[options]
UseSyslog
Interface = vmbr0
[breakglass]
sequence = ${K1}:udp,${K2}:udp,${K3}:udp
seq_timeout = 10
start_command = /usr/sbin/iptables -I INPUT 1 -s %IP% -p tcp --dport 52222 -j ACCEPT
cmd_timeout = 30
stop_command = /usr/sbin/iptables -D INPUT -s %IP% -p tcp --dport 52222 -j ACCEPT
EOF
```
- [ ] **Step 3: Enable + start knockd**
```bash
ssh root@192.168.1.127 "sed -i 's/^START_KNOCKD=.*/START_KNOCKD=1/' /etc/default/knockd 2>/dev/null || echo 'START_KNOCKD=1' >> /etc/default/knockd"
ssh root@192.168.1.127 'systemctl enable --now knockd && systemctl is-active knockd'
```
Expected: `active`.
### Task 1.6: fail2ban (defense-in-depth)
- [ ] **Step 1: Install + enable fail2ban with the default sshd jail**
```bash
ssh root@192.168.1.127 'apt-get install -y fail2ban && systemctl enable --now fail2ban && fail2ban-client status sshd >/dev/null && echo F2B_OK'
```
Expected: `F2B_OK` (sshd jail active).
---
## Phase 2 — Edge router `.1` forwards (LIVE router change — Viktor executes)
> In the AX6000 UI: **Advanced → NAT Forwarding → Port Forwarding → Add**. Do NOT remove anything yet.
- [ ] **Step 1: Add the SSH break-glass forward**
- Name `breakglass-ssh`, External Port `52222`, Internal IP `192.168.1.127`, Internal Port `52222`, Protocol `TCP`, Enable.
- [ ] **Step 2: Add the three UDP knock forwards** (values from `vault kv get -field=breakglass_knock_sequence secret/viktor`)
- For each of the 3 ports: Name `bg-knock-N`, External Port `<port>`, Internal IP `192.168.1.127`, Internal Port `<same port>`, Protocol `UDP`, Enable.
- [ ] **Step 3: (verify #1) Determine whether `.1` preserves source IP or SNATs**
After Phase 3 connects once, on the host check the observed source:
```bash
ssh root@192.168.1.127 'journalctl -u knockd -n 20 --no-pager | grep -i "stage\|open"'
```
If `%IP%` is a public IP → source preserved (per-IP granularity). If it's `192.168.1.1``.1` SNATs (knock opens `:52222` for the shared `.1` source during the 30 s window). Both are acceptable with the dual-port + key-only model; just note it in the runbook.
---
## Phase 3 — Client config (laptop, no live infra change)
**Files:** Modify `~/.ssh/config`; add a shell function to `~/.zshrc`/`~/.bashrc`.
- [ ] **Step 1: Add the SSH host block**
```bash
cat >> ~/.ssh/config <<'EOF'
Host breakglass
HostName viktorbarzin.ddns.net
Port 52222
User root
IdentityFile ~/.ssh/breakglass_ed25519
EOF
```
(`viktorbarzin.ddns.net` is the router's NO-IP DDNS name — follows the dynamic WAN IP. Raw IP `176.12.22.76` is the fallback.)
- [ ] **Step 2: Add the knock+connect function**
```bash
cat >> ~/.zshrc <<'EOF'
bg() {
local host="viktorbarzin.ddns.net"
local seq; seq="$(vault kv get -field=breakglass_knock_sequence secret/viktor 2>/dev/null || echo "")"
[ -z "$seq" ] && { echo "no knock sequence (vault?)"; return 1; }
for p in ${seq//,/ }; do (exec 3<>/dev/udp/$host/$p) 2>/dev/null && echo "x" >&3; sleep 0.4; done
sleep 0.5
ssh breakglass "$@"
}
EOF
```
> Note: the bash `/dev/udp` redirection works under bash (`/bin/bash` on macOS + Linux). Under zsh, `/dev/udp` is also supported by zsh's builtin in recent versions; if your zsh build lacks it, define `bg` in bash or use `nc -u -w1 $host $p </dev/null`.
---
## Phase 4-pre — Verify break-glass END-TO-END (gates Phase 4)
> Do this from an **external** network (phone hotspot / tethered), NOT the home LAN.
- [ ] **Step 1: Without knocking, the port is silent**
```bash
nc -z -w3 viktorbarzin.ddns.net 52222 && echo "OPEN(bad)" || echo "SILENT_OK"
```
Expected: `SILENT_OK`.
- [ ] **Step 2: Knock + connect succeeds**
```bash
bg 'hostname; echo BREAKGLASS_E2E_OK'
```
Expected: the PVE hostname + `BREAKGLASS_E2E_OK`.
- [ ] **Step 3: Full-LAN reach via the jump (no extra install)**
```bash
ssh -J breakglass root@10.0.20.1 'echo PFSENSE_REACHED' 2>/dev/null || echo "check pfSense ssh"
ssh -J breakglass admin@192.168.1.13 'echo SYNOLOGY_REACHED' 2>/dev/null || echo "check synology ssh"
```
Expected: confirms you can reach pfSense + Synology *through* break-glass (so closing Rule 6 loses nothing).
- [ ] **Step 4: LAN admin unaffected**
From the home LAN: `ssh -p 22 root@192.168.1.127 'echo LAN22_OK'``LAN22_OK`.
**GATE:** Only proceed to Phase 4 once Steps 14 pass. If any fail, fix before removing the legacy forward.
---
## Phase 5 — Router cleanup (LIVE router change — Viktor executes, AFTER Phase 4-pre passes)
> AX6000 UI. One pass, all three changes.
- [ ] **Step 1: Remove the Synology SSH exposure (Rule 6)**
- Advanced → NAT Forwarding → Port Forwarding → delete (or disable) rule **`HTTP` / 3333 → 192.168.1.13:22**.
- [ ] **Step 2: Delete the stale Proxmox rule (Rule 3)**
- Delete the disabled rule **`proxmox` / 8006 → 192.168.1.127**.
- [ ] **Step 3: Disable UPnP**
- Advanced → NAT Forwarding → UPnP → toggle **OFF**. (Tailscale on `.101` falls back to DERP relay; the `41643→pfSense` mapping drops.)
- [ ] **Step 4: Verify the Synology SSH is gone from the WAN, break-glass still works**
From an external network:
```bash
nc -z -w3 viktorbarzin.ddns.net 3333 && echo "STILL_OPEN(bad)" || echo "SYNOLOGY_SSH_CLOSED_OK"
bg 'echo BREAKGLASS_STILL_OK'
```
Expected: `SYNOLOGY_SSH_CLOSED_OK` and `BREAKGLASS_STILL_OK`.
---
## Phase 6 — Docs + commit (AFTER infra repo is clean)
- [ ] **Step 1: Update `docs/architecture/vpn.md`** — add a "Break-glass SSH" section (knock-gated SSH to PVE host, client `bg()`, cheat-sheet IPs).
- [ ] **Step 2: Update `docs/architecture/security.md` + the Wave-1 note in `infra/.claude/CLAUDE.md`** — record the deliberate knock-gated exception; **correct the WAN-exposure inventory** (actual `.1` forwards are qbittorrent/stun/turn→pfSense + the new break-glass; Synology SSH removed; UPnP disabled; Remote Management off).
- [ ] **Step 3: New runbook `docs/runbooks/breakglass-ssh.md`** — connect procedure, knock/key rotation, re-adding `.1` forwards after a router reset.
- [ ] **Step 4: Commit the design + plan + doc updates** (only once Viktor confirms the repo is committable):
```bash
git -C /home/wizard/code/infra add \
docs/plans/2026-05-30-breakglass-ssh-access-design.md \
docs/plans/2026-05-30-breakglass-ssh-access-plan.md \
docs/architecture/vpn.md docs/architecture/security.md \
docs/runbooks/breakglass-ssh.md .claude/CLAUDE.md
git -C /home/wizard/code/infra commit -m "docs+feat: break-glass knock-gated SSH; retire Synology SSH forward; disable UPnP [ci skip]"
git -C /home/wizard/code/infra push origin master
```
---
## Self-review
- **Spec coverage:** key-only SSH ✅ (1.3), knock gate ✅ (1.4/1.5), invisibility ✅ (4-pre.1), full-LAN via jump ✅ (4-pre.3), no-lockout ✅ (1.1/1.3.4), Wave-1 exception doc ✅ (6.2), close legacy SSH ✅ (5.1), UPnP ✅ (5.3). All design §sections map to a task.
- **Placeholder scan:** no TBDs; secret values are generated + Vault-stored, referenced via `vault kv get` (concrete, not placeholders).
- **Consistency:** port `52222`, knock from `secret/viktor/breakglass_knock_sequence`, key `~/.ssh/breakglass_ed25519`, host `192.168.1.127` used consistently throughout.
- **Open verify items** (flagged inline, non-blocking): #1 `.1` SNAT behaviour (2.3), pve-firewall coexistence (1.1.2).

View file

@ -0,0 +1,73 @@
# Break-glass SSH — Redesign
- **Date**: 2026-06-11
- **Status**: Implemented
- **Owner**: Viktor
- **Supersedes**: `2026-05-30-breakglass-ssh-access-{design,plan}.md` (port-knock design)
- **As-built runbook**: `docs/runbooks/breakglass-ssh.md`
## Why redesign
The 2026-05-30 design gated a key-only SSH port on the Proxmox host behind a UDP
**port-knock** (knockd). It caused a real lockout, for a structural reason:
- The knock sequence was 3 random ports stored **only** in Vault, and the client
helper fetched it from Vault at connect time.
- **Vault is in-cluster** and not publicly reachable (Wave-1 policy). In the
exact scenario break-glass exists for — away from home, cluster/tunnels down —
the knock sequence is unreachable and unmemorable. Circular dependency.
The knock's only benefit was hiding an already brute-force-proof port; its cost
was that fragility. For a *recovery* path, robustness beats stealth.
## Decision
**Plain key-only SSH to the Proxmox host on `:52222`, openly reachable, no knock.**
Hardened with: the exposed port trusts only a dedicated break-glass key
(`Match LocalPort`), per-source connection rate-limiting (iptables hashlimit),
and fail2ban. Scenario covered: *cluster + tunnels down, host + pfSense + router
up* (the common "I'm away and need in" case — confirmed with Viktor; deeper
"pfSense wedged" / "host down" tiers are explicitly out of scope).
Alternatives considered and rejected: keeping the knock (fragile, circular);
Tailscale-on-pfSense (briefly chosen, then dropped — reintroduces the upstream
dependency Headscale is self-hosted to avoid, and the user preferred a
self-contained stock-ssh path); WireGuard road-warrior (needs a client, and the
self-contained SSH path was preferred).
## Components
| Layer | Change | Source of truth |
|---|---|---|
| sshd | dual-port `:22` (LAN, all keys) + `:52222` (WAN, break-glass key only via `Match LocalPort`, terminated by `Match all`); key-only everywhere | `scripts/sshd-10-breakglass.conf` |
| host firewall | `BREAKGLASS` chain: `:52222` rate-limited per source, LAN bypass; replaced the knock-gated default-DROP | `scripts/breakglass-firewall.sh` (+ `breakglass-firewall.service`) |
| fail2ban | jail fixed for Debian 13 (`journalmatch` by unit, not `_COMM=sshd`, else it never bans), bans on `:22`+`:52222` | `scripts/fail2ban-breakglass-sshd.local` |
| knockd | **removed** (package purged, config deleted) | — |
| edge router | `breakglass-ssh` WAN tcp/52222 → 192.168.1.127:52222; **removed** legacy Synology SSH forward (ext 3333 → .13:22) | manual (live device) |
| Vault | `breakglass_ssh_{pub,priv}key` retained; `breakglass_knock_sequence` now dead | `secret/viktor` |
## Edge-router constraints discovered (TP-Link AX6000)
- **No port remapping** — external port must equal internal port (rejects e.g.
`22 → 52222` as a "conflict"). All forwards are ext==int; hence `:52222` both
sides.
- **Port 22 is reserved**`22 → 22` is also refused. Break-glass cannot use 22
(Viktor's initial preference); `:52222` is the landed port.
- **Row delete is immediate** (no confirm dialog).
## Security posture
- **Brute force: impossible** (key-only, no password).
- **Scannable: yes** — deliberate, documented Wave-1 exception (`security.md`).
- **Residual risks:** sshd 0-day during exposure (mitigate: patch, rate-limit,
fail2ban, low MaxAuthTries); break-glass key theft (revoke by removing the
`authorized_keys.breakglass` line). Logins are audited (PVE ships sshd auth +
snoopy execve to Loki).
## Verification (2026-06-11)
- `:52222` reachable; break-glass key authenticates (`root@pve`).
- Non-break-glass keys **rejected** on `:52222` (Match isolation works).
- `:22` LAN admin unaffected (Match all reset confirmed — global root login intact).
- Full WAN path: `ssh -p 52222 <WAN-IP>` with the break-glass key → `root@pve`.
- knockd gone; fail2ban jail matches Debian 13 `sshd-session` lines.

View file

@ -0,0 +1,76 @@
# Post-mortem: Authentik downgrade boot storm + shared-PG failover (2026-06-10)
**Impact:** Authentik (and therefore forward-auth for all ~67 `auth="required"`
ingresses and every OIDC app) degraded/unavailable for ~50 minutes
(~22:2023:10 UTC). The auth-proxy basicAuth fallback served Emergency Access
prompts during outpost-check failures. The shared CNPG primary failed over
(pg-cluster-2 → pg-cluster-1, 22:40:58 UTC), briefly disturbing every PG-backed
tenant.
**Trigger:** a routine values-only `tg apply` on `stacks/authentik` (first-time
signin speedup work — env tuning, outpost config, static-asset ingress).
## Root causes (three stacked)
1. **Helm/Keel version split → silent downgrade.** Keel (namespace
`keel.sh/enrolled` + diun annotations) had upgraded the live authentik
image to `2026.2.4`, while the Helm release pinned chart `2026.2.2` (whose
appVersion drives the image tag). The values-only apply therefore rolled
every server/worker pod BACK to `2026.2.2` against a `2026.2.4`-migrated
database. Cores never came up healthy (`failed to proxy to backend`, plus
Django cross-version serialized-cache warnings), and mid-storm Keel
re-upgraded the image, adding a third ReplicaSet to the churn.
2. **Liveness budget too small for authentik's boot.** The chart-default
liveness probe (3×10s, 3s timeout) kills a pod ~30s after the go layer
passes the startup probe — but during a rolling restart the Python core
still waits on authentik's DB **migration advisory lock** (60120s+ under
contention). kubelet kill-looped every booting pod, and each kill increased
lock contention for the rest (thundering herd).
3. **Ghost lock holders.** Pods killed mid-migration-check left PgBouncer
server connections `idle in transaction` still **holding the migration
advisory lock** (observed twice: `SELECT * FROM authentik_version_history`
idle 2+ min). Every subsequent boot serialized behind a dead client.
PgBouncer had no `idle_transaction_timeout`, so the ghosts never expired.
**Aggravator:** `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60` (newly made live) made
every Django thread hold its connection persistently; with PgBouncer in
*session* mode each one pins a server connection 1:1, so the restart churn
saturated all 3×(20+5) pool slots (58s/s client wait observed; authentik held
75 of 108 connections on the new primary). The shared primary's
restart/failover at 22:40 fits this storm window.
## Resolution
- Scaled workers to 0 (transient) to free pool capacity; rollout converged
once, then re-degraded when workers returned.
- Emergency `kubectl patch` of the server liveness probe (3×10s/3s →
6×10s/5s) — final state codified in Helm values in the same session.
- `pg_terminate_backend()` on the ghost `idle in transaction` lock holders
(twice).
- Scaled servers to 1 so a single `2026.2.4` pod booted uncontended, then back
to 3 — converged cleanly (51s boots, zero restarts).
- Final `tg apply` reconciled everything (image tag pinned, conn_max_age
removed, liveness in values, pgbouncer reaper config).
## Prevention (all landed in this change)
| Cause | Fix |
|---|---|
| Helm/Keel version split | `global.image.tag` pinned in `values.yaml` to the Keel-managed live tag, with a comment requiring the pin be refreshed whenever the chart is touched. Long-term: bump the chart pin when Keel moves the image (diun notifies). |
| Liveness kill loop | `server.livenessProbe` 6×10s / 5s timeout in values (startup probe still bounds total boot at 60×10s). |
| Ghost advisory-lock holders | `idle_transaction_timeout = 300` in `pgbouncer.ini` + config-checksum annotation so ini changes actually roll pgbouncer pods. |
| Pool saturation | `CONN_MAX_AGE` removed (per-request connections are ~12ms through local PgBouncer; not worth pinning server connections in session mode). values.yaml carries a do-not-set warning. |
## Lessons
- **Check the live image tag against the chart pin before ANY helm-managed
apply on a Keel-enrolled namespace.** `kubectl get deploy <x> -o
jsonpath='{..image}'` vs the chart's appVersion — a mismatch means the apply
is a version change, not a config change.
- A "stuck rollout" of authentik is usually the migration advisory lock:
check `pg_locks` joined to `pg_stat_activity` for `idle in transaction`
holders before blaming probes or resources.
- The auth-proxy basicAuth fallback worked as designed throughout (Emergency
Access path); without it every protected app would have hard-failed.

View file

@ -0,0 +1,116 @@
# 2026-06-11 — devvm dead ~90 min: QEMU-internal I/O stall on the legacy LSI disk path
## Impact
- devvm (VM 102, the shared multi-user Claude Code workstation) effectively
dead 15:2116:48 UTC (18:2119:48 EEST): all ssh/tmux and t3 sessions for
wizard/emo/anca lost, every in-flight agent killed.
- Detection was human (~90 min) — no `up{instance="devvm"} == 0` alert
exists (follow-up below).
- Recovery was manual: kill of the wedged QEMU process + `qm start` (the
kill left no autopsy — see "What we could not prove").
## Timeline (UTC; host journal runs EEST = UTC+3)
- **15:01** — hourly `apply-mbps-caps` run live-rewrites VM 102's scsi0
throttle via `qm set` (as it had done every hour for weeks — see Root
cause #4).
- **15:1815:20** — guest healthy by every metric: CPU 716% of 16 vCPUs,
load 1.4, 17 GiB MemAvailable, swap flat at 2.0 GiB, host `sdc` 28%
utilized. Heavy claude/bwrap sandbox activity (normal workload).
- **15:19:08** — last journal line the guest ever writes (mid normal
traffic, zero kernel distress — not even a hung-task warning).
- **15:21** — host RRD (pvestatd polling QEMU over QMP once a minute) shows
`diskwrite` drop to **exactly 0 and stay 0 for 87 minutes** — not even
journal flushes. netout collapses 380K→7K/s. **QEMU keeps answering QMP
the whole time** — the process and its main loop are alive; only the
block path is dead.
- **15:21→15:39** — guest CPU (host's view) ramps 11% → ~50% and plateaus:
processes progressively piling up behind dead storage (dirty-page
writeback stuck → direct reclaim spins). Classic starvation cascade, not
a panic (a panic halts or spins flat from t=0).
- **16:47:42** — QMP socket resets: the wedged QEMU is killed out-of-band
(root shell; no PVE task, no snoopy line — shell-builtin `kill`).
- **16:48:31**`qmstart` task; guest boots clean on kernel 6.8.0-124
(wedged boot ran 6.8.0-117).
## Ruled out (evidence, not vibes)
- **Guest CPU/memory/swap pressure** — healthy at last scrape (Prometheus)
and per-minute host RRD.
- **Host storage**`pve` thin pool 68% data / 15.5% meta; zero kernel
I/O errors on the host all day; `sdc` quiet through the window.
- **Host-side kill/OOM** — no OOM-killer lines, no segfault, no QEMU crash
log; 113 of 114 monitored targets stayed up. Only the devvm died.
- **Guest kernel panic** — would not keep QMP-visible blockstats frozen at
0 while netout ACKs trickle; and the guest kernel logged nothing.
## Root cause
**Class pinned, exact line unprovable** (see below): the devvm's disk I/O
stalled *inside the QEMU process* — below the guest kernel (all guest I/O
froze simultaneously with nothing logged) and above host storage (host
clean, neighbors fine, QEMU main loop responsive). Contributing stack,
unique to this VM:
1. **`scsihw: lsi`** — the emulated LSI 53C895A (1997 chip, QEMU's legacy
default for OSes without virtio drivers). The devvm was the **only VM
on the host** running its disk through this path; every healthy
neighbor uses `virtio-scsi-pci`. The LSI model is documented as
hang-prone under intensive I/O.
2. **No `iothread`** — all disk emulation ran on QEMU's single main event
loop, sharing it with timers and QMP.
3. **QEMU-level mbps throttle (60/60)** — a token bucket inside QEMU whose
queued I/O completes only when its re-arm timer fires.
4. **Hourly live throttle rewrites**`apply-mbps-caps.sh`'s idempotency
check compared raw config strings, but `qm config` prints keys in its
own canonical order, so the check **never matched** and the script
re-issued `qm set` (→ live QMP `block_set_io_throttle` against the
running QEMU) every hour, 24×/day, for weeks — each poke a chance to
race the throttle machinery while queued I/O is in flight. The wedge
came 20 min after the 15:01 poke.
## What we could not prove
Whether the stuck queue was the LSI device model, the throttle-group
timer, or their interaction. The discriminating evidence (QMP
`query-block`, a stack trace of the QEMU process) existed in RAM at 16:47
and was destroyed by the recovery kill. If a wedge recurs **autopsy before
shooting**: `qm guest exec` will fail but `qm monitor`/QMP `query-block`,
`query-status`, and `gdb -p <pid> -batch -ex 'thread apply all bt'` on the
kvm process pin it to the line.
## Fixes
| Status | Fix |
|---|---|
| shipped (this commit) | `apply-mbps-caps.sh` compares **normalized option sets** — hourly runs are now true no-ops; running VMs' throttle state is no longer rewritten 24×/day. Verified: reordered-key configs compare equal, real drift still triggers `qm set`, post-restart iothread configs compare equal. |
| staged, awaiting Viktor's cold stop→start | VM 102: `scsihw: virtio-scsi-single` + `scsi0 …,iothread=1,aio=threads` — replaces the LSI path with the paravirt controller all healthy VMs use, moves disk emulation off the main loop, swaps io_uring for boring thread-pool AIO. Guest pre-flight passed (`CONFIG_SCSI_VIRTIO=y` built-in; fstab on LVM dm-uuid/UUID). Must be a **full stop→start** — a guest reboot reuses the old QEMU process. |
## Open follow-ups (discussed 2026-06-11, not yet built)
- `DevvmDown` alert (`up{job="devvm"} == 0 for 3m` → Slack) — closes the
90-min detection gap.
- Freeze forensics: netconsole → pve listener, serial console,
`kernel.panic=60`, and a capture-before-kill runbook (above) so any
recurrence is pinned, not mourned.
- The recurring *crawl* class (agent storms → swap-thrash; journald
watchdog-killed 3× on 2026-06-10) is a separate failure mode —
ssh/tmux sessions remain memory-uncontained by explicit decision
(swap-only, 2026-06-10).
## Lessons
- **A VM can die of QEMU-userspace causes that no guest or host kernel log
will ever show.** The host's per-VM RRD (pvestatd's QMP polls) is the
only witness — `diskwrite=0` with a live QMP socket is the signature.
- **"Idempotent" reconcilers must prove idempotency against the system's
canonical output format**, not against the string they themselves
constructed. A compare that never matches turns a safety net into a
24×/day fault injector — and its own journal said `updating scsi0`
every hour, in plain sight, for weeks.
- The May-26 mbps caps fixed the sdc-saturation freeze class and
introduced this one's trigger surface. Layered mitigations fail in
layers — audit what a fix *adds*, not only what it removes.
- pve host logs are **EEST (UTC+3)**; guest logs are UTC. Every
cross-machine correlation in this incident initially looked 3h off.

View file

@ -0,0 +1,158 @@
# Runbook: Break-glass SSH
Cold-survivable, brute-force-proof SSH onto the home LAN for when the Kubernetes
cluster and its remote-access tunnels (Headscale, cloudflared) are down but the
**Proxmox host + edge router are up**. Redesigned 2026-06-11 — the previous
port-knock design is decommissioned (see "History" below).
## Model (as built)
```
your laptop (anywhere) ── ssh -p 52222 ──▶ edge router 192.168.1.1
│ WAN tcp/52222 ─▶ 192.168.1.127:52222
Proxmox host 192.168.1.127
sshd :52222 (key-only, break-glass key ONLY)
→ full LAN via ssh -J / ssh -D
```
- **No port-knock.** Plain `ssh -p 52222`. The SSH key is the only gate.
- **Key-only**, brute-force-proof. The exposed `:52222` trusts **only** the
dedicated break-glass key (`/root/.ssh/authorized_keys.breakglass`), separate
from root's normal LAN-admin keys, so it is independently revocable and a leak
of any other root key does not grant internet access.
- **Rate-limited** per source IP (iptables hashlimit) + **fail2ban**. These trim
scanner noise only; key-only auth is the real protection.
- **Exposed, not hidden.** `:52222` answers on the WAN (Shodan-visible). This is
a deliberate, documented exception to the Wave-1 "no public-IP access" policy
(see `docs/architecture/security.md`), chosen for self-containment: it has **no
dependency on the cluster** (unlike Headscale/cloudflared) and nothing to
remember (unlike the old knock, whose sequence lived only in in-cluster Vault).
## Secrets (Vault `secret/viktor`)
| Key | Use |
|---|---|
| `breakglass_ssh_pubkey` | authorized on the host (`authorized_keys.breakglass`) |
| `breakglass_ssh_privkey` | the private key (also on your laptop at `~/.ssh/breakglass_ed25519`) |
The key has **no passphrase** (so it works in a true cold event without anything
to recall). Treat the private key as the sole credential — guard the laptop copy.
> Leftover: `breakglass_knock_sequence` is dead (knock decommissioned). It is
> inert; remove it when you have a Vault token with the `patch` capability
> (`vault kv patch` / merge-patch — the everyday token lacks it).
## Connect
Client `~/.ssh/config`:
```
Host breakglass
HostName viktorbarzin.ddns.net # follows the dynamic WAN IP
Port 52222
User root
IdentityFile ~/.ssh/breakglass_ed25519
IdentitiesOnly yes
```
Then:
```bash
ssh breakglass # shell on the Proxmox host
ssh -J breakglass root@10.0.20.1 # jump to pfSense (or any LAN host)
ssh -D 1080 breakglass # SOCKS5 → reach any internal IP
```
There is **no `bg()` knock function** anymore — delete it from your shell rc if
you added it under the old design.
## Cold-event IP cheat sheet (cluster DNS is down)
| Host | IP |
|---|---|
| Proxmox host | `192.168.1.127` |
| pfSense | `10.0.20.1` (WAN `192.168.1.2`) |
| k8s API | `10.0.20.100` |
| Synology NAS | `192.168.1.13` (reach via `ssh -J breakglass`) |
| edge router | `192.168.1.1` |
## Deploy / re-provision the host config
Source of truth lives in `infra/scripts/`. To (re)deploy:
```bash
# 1. break-glass key authorized for the exposed port
PUB="$(vault kv get -field=breakglass_ssh_pubkey secret/viktor)"
ssh root@192.168.1.127 "printf '%s\n' '$PUB' > /root/.ssh/authorized_keys.breakglass && chmod 600 /root/.ssh/authorized_keys.breakglass"
# 2. sshd drop-in (dual-port, Match-isolated) — validate before reload (anti-lockout)
scp scripts/sshd-10-breakglass.conf root@192.168.1.127:/etc/ssh/sshd_config.d/10-breakglass.conf
ssh root@192.168.1.127 'sshd -t && systemctl reload ssh'
# 3. firewall (rate-limit) + boot unit
scp scripts/breakglass-firewall.sh root@192.168.1.127:/usr/local/sbin/breakglass-firewall.sh
ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh && systemctl enable --now breakglass-firewall.service'
# 4. fail2ban jail
scp scripts/fail2ban-breakglass-sshd.local root@192.168.1.127:/etc/fail2ban/jail.d/breakglass-sshd.local
ssh root@192.168.1.127 'systemctl restart fail2ban && fail2ban-client status sshd'
```
The `breakglass-firewall.service` unit (oneshot, `RemainAfterExit=yes`,
`Before=network-online`-ish ordering) is a manual host unit — recreate it if the
host is rebuilt:
```ini
[Unit]
Description=Break-glass base firewall (key-only SSH on :52222)
After=network-pre.target
Wants=network-pre.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/breakglass-firewall.sh
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
## Edge-router forward (manual — live device, not Terraform)
TP-Link Archer AX6000 (`192.168.1.1`) → Advanced → NAT Forwarding → Port
Forwarding. The break-glass rule:
| Service Name | Device IP | External Port | Internal Port | Protocol |
|---|---|---|---|---|
| `breakglass-ssh` | `192.168.1.127` | `52222` | `52222` | TCP |
**AX6000 quirks (learned 2026-06-11 — do not relearn the hard way):**
- **External port must equal internal port.** The firmware rejects any remap
(e.g. `22 → 52222`) with *"External Port: This item conflicts with existed
ones."* Hence ext==int 52222.
- **Port 22 is reserved** — even `22 → 22` is refused. Break-glass cannot use 22.
- **Row delete is immediate** (no confirm dialog) — clicking the trash icon
removes the rule and toasts "Operation succeeded".
- Automation: `~/wizard/tools/insecure-browse/add-forward.{sh,js}` (dockerized
Playwright; double-gated save `DRY_RUN=0 CONFIRM_SAVE=1`; supports
`RULES_JSON` add, `EDIT_RULES_JSON` protocol-edit, `DELETE_RULES_JSON`
identity-guarded delete). Router password: Vault
`secret/viktor/edge_router_192_168_1_1_password`.
## Rotate / revoke
- **Revoke instantly:** remove the line from `/root/.ssh/authorized_keys.breakglass`.
- **Rotate the key:** `ssh-keygen -t ed25519 -a 100 -f ~/.ssh/breakglass_ed25519`,
`vault kv patch secret/viktor breakglass_ssh_privkey=@... breakglass_ssh_pubkey=...`,
redeploy step 1 above.
- **Router reset wipes forwards:** re-add the `breakglass-ssh` rule above.
## History
- **2026-05-30:** original design — key-only SSH on `:52222` gated behind a
**UDP port-knock** (knockd). Decommissioned 2026-06-11: the knock added no real
security (the SSH key already makes the port brute-force-proof) and its only
benefit — hiding the port — came at the cost of a **circular dependency**: the
knock sequence lived only in in-cluster Vault, unreachable in the exact
cold/away scenario break-glass exists for. That caused a real lockout. The
knockd package + config + the legacy Synology SSH forward (ext 3333 → .13:22)
were removed.

View file

@ -35,6 +35,41 @@ Attribution table:
Alerts `T3ProbeLegDown` / `T3ProbeDropBurst` fire on sustained breakage.
## 1b. Connection logs in Loki (passive, always-on — catch a real drop)
Three layers of the real path log every t3 `/ws` connection to Loki, so a drop
the user actually experienced is attributable after the fact without a repro. A
drop is **a short-lived `/ws` connection** (a healthy session holds one socket
for hours); the client's 20s heartbeat watchdog reconnects on any break.
| Layer | Loki stream | What it tells you |
|---|---|---|
| Traefik | `{job="traefik"}` ⟶ filter `t3code-t3` + `GET /ws` | per-connection **duration** (trailing `…ms`) + edge (cloudflared pod) IP |
| cloudflared | `{job="cloudflared"}` ⟶ filter `t3.viktorbarzin.me/ws` | CF-tunnel-side close (`ended abruptly: context canceled` = browser/CF side hung up) |
| t3-dispatch | `{job="devvm-journal",unit="t3-dispatch.service"} \|= "ws close"` | **`dur_ms` + `cause`** — the discriminator below |
`cause` on the dispatch `ws close` line:
- **`downstream_closed`** — client / Cloudflare / Traefik tore the socket down
(`context canceled`). Short `dur_ms` = client watchdog firing → a **last-mile /
network-quality** drop (or CF/tunnel blip); t3-serve was fine.
- **`upstream_closed`** — the user's `t3 serve` closed/reset (reset by peer / EOF
/ refused) → t3-serve stall/restart/OOM.
- **`graceful`** — clean close from either side (e.g. the client watchdog's
`disconnect()` after a >20s heartbeat gap). Cross-check `dur_ms`: a ~20s+
graceful close with no devvm pressure spike (§3) is a heartbeat-timeout whose
stall was NOT on devvm → last-mile.
Triage query (Grafana Explore → Loki) — every short t3 socket in a window:
```logql
{job="devvm-journal", unit="t3-dispatch.service"} |= "ws close"
| regexp `dur_ms=(?P<dur>[0-9]+) cause=(?P<cause>\S+)` | dur < 120000
```
Line the timestamp up against `{job="traefik"}` (duration + edge IP) and
`{job="cloudflared"}` (CF-side close) for the same second to localise the layer.
devvm journald (incl. `t3-serve@<user>`) ships via `scripts/devvm-promtail.*`.
## 2. Server-side log recipe (per-event forensics)
On devvm (timestamps in UTC):

View file

@ -27,6 +27,12 @@ TARGETS=(
"220:scsi0:40:40" # docker-registry
)
# Sort a disk spec's comma-separated options so two specs with the same
# option set but different key order compare equal.
normalized() {
tr ',' '\n' <<<"$1" | LC_ALL=C sort | paste -sd, -
}
apply_one() {
local spec="$1"
local vmid slot rd wr
@ -49,8 +55,13 @@ apply_one() {
newvalue="${cleaned},mbps_rd=${rd},mbps_wr=${wr}"
# Skip the qm-set call entirely when state already matches — keeps
# journal noise low under the hourly timer.
if [[ "$current" == "$newvalue" ]]; then
# journal noise low under the hourly timer. Compare option SETS, not raw
# strings: `qm config` prints keys in its own canonical order, so a raw
# compare never matched and every hourly run re-issued `qm set`, which
# live-rewrites the running VM's QEMU throttle state via QMP (implicated
# in the 2026-06-11 devvm I/O stall — see
# docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md).
if [[ "$(normalized "$current")" == "$(normalized "$newvalue")" ]]; then
echo "vmid $vmid: $slot already at mbps_rd=${rd},mbps_wr=${wr} — no-op"
return 0
fi

View file

@ -0,0 +1,26 @@
#!/usr/bin/env bash
set -euo pipefail
# Break-glass base firewall (redesigned 2026-06-11; replaced the port-knock gate).
#
# Source of truth. Deploy to the PVE host with:
# scp scripts/breakglass-firewall.sh root@192.168.1.127:/usr/local/sbin/breakglass-firewall.sh
# ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh && systemctl restart breakglass-firewall.service'
# The breakglass-firewall.service oneshot runs this at boot (RemainAfterExit).
#
# Model: key-only SSH break-glass on :52222, openly reachable from the WAN, NO
# port-knock. The SSH key is the gate (brute-force-proof); the rate-limit below
# only trims scanner noise / slows a hypothetical sshd 0-day.
# :22 -> LAN admin (all of root's keys), always allowed.
# :52222 -> WAN break-glass. LAN/VLAN sources bypass the limit; external NEW
# connections are rate-limited per source IP, then accepted.
iptables -N BREAKGLASS 2>/dev/null || iptables -F BREAKGLASS
iptables -C INPUT -j BREAKGLASS 2>/dev/null || iptables -I INPUT 1 -j BREAKGLASS
iptables -A BREAKGLASS -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
iptables -A BREAKGLASS -p tcp --dport 22 -j ACCEPT
iptables -A BREAKGLASS -p tcp --dport 52222 -s 192.168.1.0/24 -j ACCEPT
iptables -A BREAKGLASS -p tcp --dport 52222 -s 10.0.0.0/8 -j ACCEPT
iptables -A BREAKGLASS -p tcp --dport 52222 -m conntrack --ctstate NEW \
-m hashlimit --hashlimit-name bg_ssh --hashlimit-mode srcip \
--hashlimit-above 6/min --hashlimit-burst 3 -j DROP
iptables -A BREAKGLASS -p tcp --dport 52222 -j ACCEPT

View file

@ -0,0 +1,17 @@
# systemd unit for promtail on the devvm (10.0.10.10). Install to
# /etc/systemd/system/promtail.service. See scripts/devvm-promtail.yaml for the full deploy.
[Unit]
Description=Promtail (ships devvm journal -> cluster Loki)
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/config.yml
Restart=on-failure
RestartSec=5
User=root
Group=root
[Install]
WantedBy=multi-user.target

View file

@ -0,0 +1,59 @@
# Promtail config for the devvm (10.0.10.10) — ships the systemd journal to cluster Loki.
#
# devvm is a standalone VM (NOT a k8s node), so its journal — including the t3
# stack (t3-dispatch, t3-serve@<user>) — was never in Loki. Added 2026-06-11 for
# t3 drop forensics: t3-dispatch now logs each /ws connection's open/close with
# duration + which side hung up (downstream_closed = client/CF/Traefik went away;
# upstream_closed = t3-serve closed/stalled; graceful = clean close). Joined with
# Traefik's per-/ws duration (already in Loki) this attributes every drop to a layer.
#
# NOT Terraform-managed (devvm is outside k8s) — same hand-deployed pattern as
# scripts/pve-promtail.* and the rpi-sofia promtail. This file is source-of-truth.
#
# Deploy (on devvm, as root via sudo):
# sudo install -d -m 0755 /etc/promtail /var/lib/promtail
# sudo install -m 0644 scripts/devvm-promtail.yaml /etc/promtail/config.yml
# sudo install -m 0644 scripts/devvm-promtail.service /etc/systemd/system/promtail.service
# # Binary: grafana/loki v3.5.1 promtail-linux-amd64 -> /usr/local/bin/promtail (chmod 0755).
# sudo systemctl daemon-reload && sudo systemctl enable --now promtail
# # Loki reach: loki.viktorbarzin.lan (Technitium CNAME -> live Traefik LB; insecure cert).
#
# Streams produced:
# {job="devvm-journal"} — full devvm journal
# {job="devvm-journal", unit="t3-dispatch.service"} — dispatch (ws open/close lines)
# {job="devvm-journal", unit="t3-serve@wizard.service"} — per-user t3 serve
# {job="sshd-devvm"} — sshd auth lines (parity with sshd-pve)
server:
http_listen_port: 9080
grpc_listen_port: 0
log_level: warn
positions:
filename: /var/lib/promtail/positions.yaml
clients:
- url: https://loki.viktorbarzin.lan/loki/api/v1/push
tls_config:
insecure_skip_verify: true
scrape_configs:
- job_name: journal
journal:
max_age: 12h
json: false
path: /var/log/journal
labels:
host: devvm
job: devvm-journal
relabel_configs:
- source_labels: ['__journal__systemd_unit']
target_label: unit
- source_labels: ['__journal_priority_keyword']
target_label: level
- source_labels: ['__journal_syslog_identifier']
target_label: identifier
# sshd auth lines -> job=sshd-devvm (parity with the pve shipper's sshd-pve).
- source_labels: ['__journal_syslog_identifier']
regex: 'sshd.*'
target_label: job
replacement: 'sshd-devvm'

View file

@ -0,0 +1,18 @@
# Break-glass SSH fail2ban jail (redesigned 2026-06-11). Source of truth.
# Deploy to the PVE host with:
# scp scripts/fail2ban-breakglass-sshd.local root@192.168.1.127:/etc/fail2ban/jail.d/breakglass-sshd.local
# ssh root@192.168.1.127 'systemctl restart fail2ban'
#
# GOTCHA (Debian 13 / OpenSSH 9.x): auth lines are logged under
# _COMM=sshd-session, NOT _COMM=sshd. The stock Debian jail keys journalmatch on
# `_SYSTEMD_UNIT=ssh.service + _COMM=sshd` and therefore silently NEVER bans.
# Match by unit only so both sshd and sshd-session lines are seen. Ban on both
# SSH ports (the WAN break-glass listener is :52222).
[sshd]
enabled = true
backend = systemd
journalmatch = _SYSTEMD_UNIT=ssh.service
port = ssh,52222
maxretry = 4
findtime = 10m
bantime = 1h

View file

@ -0,0 +1,31 @@
# Break-glass SSH drop-in (redesigned 2026-06-11). Source of truth.
# Deploy to the PVE host with:
# scp scripts/sshd-10-breakglass.conf root@192.168.1.127:/etc/ssh/sshd_config.d/10-breakglass.conf
# ssh root@192.168.1.127 'sshd -t && systemctl reload ssh'
#
# :22 = LAN admin, all of root's keys (default AuthorizedKeysFile).
# :52222 = WAN-exposed break-glass. The edge router forwards WAN tcp/52222 ->
# 192.168.1.127:52222 (external port MUST equal internal port on the
# TP-Link AX6000 — it rejects remaps; port 22 itself is reserved).
# The Match LocalPort block trusts ONLY the dedicated break-glass key
# (authorized_keys.breakglass), so a leak of any other root key does
# NOT grant internet access. Rate-limited by the BREAKGLASS iptables
# chain + fail2ban. No port-knock.
#
# NOTE: the trailing `Match all` is REQUIRED. /etc/ssh/sshd_config has
# `Include sshd_config.d/*.conf` near the top but a global `PermitRootLogin`
# further down; without `Match all` resetting context, that later global
# directive would be swallowed into the `Match LocalPort 52222` condition.
Port 22
Port 52222
PasswordAuthentication no
KbdInteractiveAuthentication no
PubkeyAuthentication yes
PermitRootLogin prohibit-password
MaxAuthTries 3
LoginGraceTime 20
Match LocalPort 52222
AuthorizedKeysFile /root/.ssh/authorized_keys.breakglass
PermitRootLogin prohibit-password
Match all

View file

@ -2,4 +2,4 @@ module t3-dispatch
go 1.22
require github.com/gorilla/websocket v1.5.3 // indirect
require github.com/gorilla/websocket v1.5.3

View file

@ -212,7 +212,64 @@ func handler(w http.ResponseWriter, r *http.Request) {
}
// Steady state: reverse-proxy (incl. WebSocket upgrade) to the user's instance.
target, _ := url.Parse(fmt.Sprintf("http://127.0.0.1:%d", e.Port))
httputil.NewSingleHostReverseProxy(target).ServeHTTP(w, r)
proxy := httputil.NewSingleHostReverseProxy(target)
// WebSocket connection logging: t3 drops manifest as the client's 20s
// heartbeat watchdog reconnecting, so a flood of short-lived /ws connections
// IS the symptom. Log each WS open + close (duration + which side hung up) so
// a drop is attributable from logs alone — graceful closes otherwise leave no
// trace (the default ReverseProxy only logs on error). cause stays "graceful"
// unless ErrorHandler fires; ErrorHandler runs within ServeHTTP, so reading
// cause after ServeHTTP returns needs no synchronisation.
if isWebSocket(r) {
start := time.Now()
ip := clientIP(r)
cause := "graceful"
proxy.ErrorHandler = func(rw http.ResponseWriter, _ *http.Request, err error) {
cause = classifyClose(err)
}
log.Printf("ws open user=%s ip=%s", e.OsUser, ip)
proxy.ServeHTTP(w, r)
log.Printf("ws close user=%s ip=%s dur_ms=%d cause=%s",
e.OsUser, ip, time.Since(start).Milliseconds(), cause)
return
}
proxy.ServeHTTP(w, r)
}
// isWebSocket reports whether r is a WebSocket upgrade request.
func isWebSocket(r *http.Request) bool {
return strings.EqualFold(r.Header.Get("Upgrade"), "websocket") &&
strings.Contains(strings.ToLower(r.Header.Get("Connection")), "upgrade")
}
// clientIP returns the forwarded client chain (X-Forwarded-For, set by
// Traefik/CF) when present, else the immediate peer — for correlating a drop
// to a specific client/edge.
func clientIP(r *http.Request) string {
if xff := r.Header.Get("X-Forwarded-For"); xff != "" {
return xff
}
return r.RemoteAddr
}
// classifyClose maps a reverse-proxy copy error to which side ended the socket:
// downstream (client/CF/Traefik went away) vs upstream (the user's t3 serve
// closed/reset). Distinguishes a last-mile/client drop from a t3-serve stall.
func classifyClose(err error) string {
if err == nil {
return "graceful"
}
s := err.Error()
switch {
case strings.Contains(s, "context canceled"):
return "downstream_closed" // client / CF / Traefik tore down
case strings.Contains(s, "reset by peer"), strings.Contains(s, "broken pipe"),
strings.Contains(s, "EOF"), strings.Contains(s, "connection refused"):
return "upstream_closed" // t3 serve closed / unreachable
default:
return s
}
}
func main() {

View file

@ -301,3 +301,63 @@ func TestProbeWSEcho(t *testing.T) {
}
}
}
func TestIsWebSocket(t *testing.T) {
cases := []struct {
up, conn string
want bool
}{
{"websocket", "Upgrade", true},
{"websocket", "keep-alive, Upgrade", true},
{"WebSocket", "upgrade", true},
{"", "keep-alive", false},
{"h2c", "Upgrade", false},
{"websocket", "keep-alive", false},
}
for _, c := range cases {
r, _ := http.NewRequest("GET", "/ws", nil)
if c.up != "" {
r.Header.Set("Upgrade", c.up)
}
r.Header.Set("Connection", c.conn)
if got := isWebSocket(r); got != c.want {
t.Errorf("isWebSocket(up=%q conn=%q)=%v want %v", c.up, c.conn, got, c.want)
}
}
}
func TestClassifyClose(t *testing.T) {
cases := []struct {
in error
want string
}{
{nil, "graceful"},
{errTest("context canceled"), "downstream_closed"},
{errTest("read tcp 127.0.0.1:60664->127.0.0.1:3773: read: connection reset by peer"), "upstream_closed"},
{errTest("write: broken pipe"), "upstream_closed"},
{errTest("unexpected EOF"), "upstream_closed"},
{errTest("dial tcp 127.0.0.1:3773: connect: connection refused"), "upstream_closed"},
{errTest("some novel error"), "some novel error"},
}
for _, c := range cases {
if got := classifyClose(c.in); got != c.want {
t.Errorf("classifyClose(%v)=%q want %q", c.in, got, c.want)
}
}
}
type errTest string
func (e errTest) Error() string { return string(e) }
func TestClientIP(t *testing.T) {
r, _ := http.NewRequest("GET", "/ws", nil)
r.RemoteAddr = "10.0.0.5:1234"
if got := clientIP(r); got != "10.0.0.5:1234" {
t.Errorf("clientIP no-xff = %q", got)
}
r.Header.Set("X-Forwarded-For", "1.2.3.4, 10.10.1.1")
if got := clientIP(r); got != "1.2.3.4, 10.10.1.1" {
t.Errorf("clientIP xff = %q", got)
}
}

View file

@ -1,4 +1,4 @@
{
"claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier. Your kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner). Do not attempt to escalate privileges or reach another user's resources.\n- Secrets are per-user. Never read another user's home directory, credentials, tokens, or ~/.claude secrets. Your own secrets live in your home at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; you can verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Feature-sized work is done in an isolated git worktree (`.worktrees/<topic>`, branch `<os-user>/<topic>`) and merged into master when finished, so several agents can work the same project at once — full lifecycle in ~/.claude/rules/execution.md §3; trivial single-commit fixes may go straight to master. When you finish a change in a repo under ~/code (or ~/code itself when it IS the clone): commit it ON master and push to the forgejo remote. THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request) — this matters more than the change itself. Never use [ci skip] as a non-admin (it would hide the change from the audit feed; harmless no-op applies are fine). If the push is rejected non-fast-forward, git pull --rebase forgejo master and push again. If it is rejected by branch protection (user not whitelisted), fall back to a <os-user>/<topic> branch + PR via the Forgejo API (token = password field in ~/.git-credentials). Keep every clone on a clean master when done so background auto-refresh keeps working. Tell the user in plain words what happened ('done — your change is live/recorded'). Full recipe: AGENTS.md → 'Non-admin workstation users' in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning, quality) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code, in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it (e.g. ~/code/tripit). [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.",
"claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner). Do not attempt to escalate privileges or reach another user's resources.\n- Secrets are per-user. Never read another user's home directory, credentials, tokens, or ~/.claude secrets. Your own secrets live in your home at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n - Push rejected by branch protection (user not whitelisted) → fall back to a <os-user>/<topic> branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n - Keep every clone on a clean master when done; tell the user in plain words what happened.\n - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n",
"model": "claude-fable-5"
}

View file

@ -91,14 +91,21 @@ resource "authentik_outpost" "embedded" {
protocol_providers = [authentik_provider_proxy.catchall.id]
service_connection = "99e227a7-4562-4888-9660-4c27da678c50"
config = jsonencode({
log_level = "trace"
docker_labels = null
authentik_host = "https://authentik.viktorbarzin.me/"
docker_network = null
container_image = null
docker_map_ports = true
refresh_interval = "minutes=5"
kubernetes_replicas = 1
# info, not trace: the outpost sits on the hot path of every request to
# every auth="required" ingress trace logging is per-request overhead
# with no operational value (request access lines are emitted at info).
log_level = "info"
docker_labels = null
authentik_host = "https://authentik.viktorbarzin.me/"
docker_network = null
container_image = null
docker_map_ports = true
refresh_interval = "minutes=5"
# 2 replicas: removes the single-pod hot path for all forward-auth
# subrequests. Safe since sessions moved to the shared Postgres backend
# (authentik_providers_proxy_proxysession, 2026-05-10) no pod-local
# session state anymore.
kubernetes_replicas = 2
kubernetes_namespace = "authentik"
authentik_host_browser = ""
object_naming_template = "ak-outpost-%(name)s"
@ -198,3 +205,39 @@ resource "authentik_stage_user_login" "default_login" {
]
}
}
# -----------------------------------------------------------------------------
# Default Identification stage adopted 2026-06-10 to embed the password
# field on the identification screen (single-screen login: one round trip and
# one screen instead of two). Per authentik docs, when an Identification stage
# carries a password stage the Password stage must NOT be bound separately
# the redundant order-20 binding on default-authentication-flow (pk
# 0fc677db-a23f-4ee7-8648-da342e14573b) was deleted via the API in the same
# change. Social-login users are unaffected: source buttons stay on the same
# screen and bypass the password field.
# -----------------------------------------------------------------------------
data "authentik_stage" "default_authentication_password" {
name = "default-authentication-password"
}
resource "authentik_stage_identification" "default_identification" {
name = "default-authentication-identification"
password_stage = data.authentik_stage.default_authentication_password.id
lifecycle {
# Pin only password_stage; everything else stays UI-managed (same pattern
# as authentik_stage_user_login.default_login above).
ignore_changes = [
user_fields,
case_insensitive_matching,
show_matched_user,
show_source_labels,
sources,
enrollment_flow,
recovery_flow,
passwordless_flow,
pretend_user_exists,
captcha_stage,
]
}
}

View file

@ -29,7 +29,7 @@ resource "kubernetes_namespace" "authentik" {
labels = {
tier = var.tier
"resource-governance/custom-quota" = "true"
"keel.sh/enrolled" = "true"
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -111,3 +111,44 @@ module "ingress-outpost" {
anti_ai_scraping = false
exclude_crowdsec = true
}
# Immutable caching for the flow-executor static assets. Authentik serves
# /static/dist/* with version-fingerprinted filenames (e.g. poly-2026.2.4.js)
# but no max-age, so browsers re-validate the login JS bundle on every signin
# and split-horizon internal users (direct to Traefik, no Cloudflare) get no
# edge cache at all. Long-lived immutable caching is safe: every authentik
# upgrade changes the asset URLs.
resource "kubernetes_manifest" "static_cache_headers" {
manifest = {
apiVersion = "traefik.io/v1alpha1"
kind = "Middleware"
metadata = {
name = "static-cache-headers"
namespace = kubernetes_namespace.authentik.metadata[0].name
}
spec = {
headers = {
customResponseHeaders = {
"Cache-Control" = "public, max-age=31536000, immutable"
}
}
}
}
}
module "ingress-static" {
source = "../../../../modules/kubernetes/ingress_factory"
# Same-host path carve-out of the public authentik UI ingress above, only
# adding the cache-headers middleware for the static asset prefix.
# auth = "none": versioned static assets of the (already public) Authentik login UI.
auth = "none"
namespace = kubernetes_namespace.authentik.metadata[0].name
name = "authentik-static"
host = "authentik"
service_name = "goauthentik-server"
ingress_path = ["/static"]
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false
homepage_enabled = false
extra_middlewares = ["authentik-static-cache-headers@kubernetescrd"]
}

View file

@ -12,3 +12,7 @@ default_pool_size = 20
reserve_pool_size = 5
reserve_pool_timeout = 5
ignore_startup_parameters = extra_float_digits
; Reap server connections stuck "idle in transaction" (e.g. an authentik pod
; killed mid-migration leaves a ghost transaction holding the migration
; advisory lock, serializing every subsequent pod boot — 2026-06-10 incident).
idle_transaction_timeout = 300

View file

@ -48,6 +48,11 @@ resource "kubernetes_deployment" "pgbouncer" {
labels = {
app = "pgbouncer"
}
annotations = {
# pgbouncer reads its ini only at startup (subPath mount never
# propagates updates anyway) roll the pods on config change.
"checksum/pgbouncer-config" = sha1(kubernetes_config_map.pgbouncer_config.data["pgbouncer.ini"])
}
}
spec {
@ -157,7 +162,8 @@ resource "kubernetes_deployment" "pgbouncer" {
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
spec[0].template[0].spec[0].container[0].image_pull_policy, # Keel flip-flops this between Always/IfNotPresent
metadata[0].annotations["kubernetes.io/change-cause"],
metadata[0].annotations["deployment.kubernetes.io/revision"],
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1

View file

@ -1,4 +1,10 @@
authentik:
# NOTE: because we set existingSecret below, the chart does NOT render the
# authentik.* values into an AUTHENTIK_* env Secret — the live env comes
# from the orphaned, helm-keep-policy `goauthentik` Secret created by chart
# 2025.10.3. Anything under authentik.* here is effectively INERT. All new
# or tuned config MUST go through server.env / worker.env instead (see
# .claude/reference/authentik-state.md).
log_level: warning
# log_level: trace
secret_key: ""
@ -14,38 +20,47 @@ authentik:
port: 6432
user: authentik
password: ""
# Persistent client-side connections (safe with PgBouncer session mode;
# must be < pgbouncer server_idle_timeout=600s). Cuts Django connection
# setup overhead off the ~70 sequential ORM ops per flow stage.
conn_max_age: 60
conn_health_checks: true
cache:
# Cache flow plans for 30m and policy evaluations for 15m. Authentik 2026.2
# moved cache storage from Redis to Postgres, so a TTL hit is still a
# SELECT — but a single indexed lookup beats re-evaluating PolicyBindings.
timeout_flows: 1800
timeout_policies: 900
web:
# Gunicorn: 3 workers × 4 threads per server pod (default 2×4).
# Pairs with the server memory bump to 2Gi (each worker preloads Django ~500Mi).
workers: 3
threads: 4
worker:
# Celery-equivalent worker threads per pod (default 2, renamed from
# AUTHENTIK_WORKER__CONCURRENCY in 2025.8).
threads: 4
server:
replicas: 3
# Anonymous Django sessions (no completed login: bots, healthcheckers,
# partial flows) expire in 2h. Default is days=1. Once login completes,
# UserLoginStage.session_duration takes over via request.session.set_expiry.
# Injected via server.env (not authentik.sessions.*) because we use
# authentik.existingSecret.secretName, which makes the chart skip
# rendering the AUTHENTIK_* secret — so the values block doesn't reach env.
env:
# Anonymous Django sessions (no completed login: bots, healthcheckers,
# partial flows) expire in 2h. Default is days=1. Once login completes,
# UserLoginStage.session_duration takes over via request.session.set_expiry.
# Injected via server.env (not authentik.sessions.*) because we use
# authentik.existingSecret.secretName, which makes the chart skip
# rendering the AUTHENTIK_* secret — so the values block doesn't reach env.
- name: AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE
value: "hours=2"
# Gunicorn: 3 workers × 4 threads per server pod (defaults 2×4).
# Pairs with the server memory limit of 2Gi (each worker preloads
# Django ~500Mi).
- name: AUTHENTIK_WEB__WORKERS
value: "3"
- name: AUTHENTIK_WEB__THREADS
value: "4"
# Cache flow plans for 30m and policy evaluations for 15m (defaults 300s).
# Authentik 2026.2 stores cache in Postgres, so a TTL hit is still a
# SELECT — but a single indexed lookup beats re-planning the flow
# (~70 sequential ORM ops per flow stage POST).
- name: AUTHENTIK_CACHE__TIMEOUT_FLOWS
value: "1800"
- name: AUTHENTIK_CACHE__TIMEOUT_POLICIES
value: "900"
# Do NOT set AUTHENTIK_POSTGRESQL__CONN_MAX_AGE here. With PgBouncer in
# session mode every persistent Django connection pins a server connection
# 1:1, so the 3x(20+5) pool saturated during the 2026-06-10 rolling
# restart (58s pool waits, readiness flapping, and the shared CNPG primary
# failed over mid-storm). The ~1-2ms/request connection-setup saving is
# not worth that risk on the shared PG substrate.
# Liveness budget sized for slow boots (2026-06-10 incident): during a
# rolling restart pods queue on authentik's DB migration lock; the go layer
# answers /-/health/live before the core is up, so with the default 3x10s
# budget kubelet kill-looped every booting pod and amplified the contention.
# Startup probe still bounds total boot time (60x10s).
livenessProbe:
failureThreshold: 6
timeoutSeconds: 5
strategy:
type: RollingUpdate
rollingUpdate:
@ -76,17 +91,36 @@ server:
minAvailable: 2
global:
addPrometheusAnnotations: true
image:
# Pin to the Keel-managed live tag. Keel (diun-annotated, keel.sh/enrolled
# namespace) bumps the IMAGE between chart releases, while helm defaults
# the tag to the chart appVersion — so any helm upgrade silently
# DOWNGRADES the running pods to the chart pin (2026-06-10: a values-only
# apply rolled live 2026.2.4 back to 2026.2.2 against a 2026.2.4-migrated
# DB → boot storm, see docs/post-mortems/2026-06-10-authentik-downgrade-
# boot-storm.md). Keep this tag in sync with what Keel has deployed when
# touching this chart; clear it only when bumping the chart version itself.
tag: "2026.2.4"
worker:
# 2 replicas: workers handle background tasks (LDAP sync, email,
# certificate renewal) — no user-facing traffic, so 2-of-3 isn't
# needed for availability. Drop saves ~100m sustained CPU.
replicas: 2
# Same unauthenticated_age cap as server — both the server (Django session
# middleware) and worker (cleanup tasks) need to see the value.
env:
# Same unauthenticated_age cap as server — both the server (Django session
# middleware) and worker (cleanup tasks) need to see the value.
- name: AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE
value: "hours=2"
# Dramatiq worker threads per pod (default 2).
- name: AUTHENTIK_WORKER__THREADS
value: "4"
# Keep cache settings in lockstep with server.env. (No CONN_MAX_AGE —
# see the server.env note: session-mode PgBouncer pins persistent conns.)
- name: AUTHENTIK_CACHE__TIMEOUT_FLOWS
value: "1800"
- name: AUTHENTIK_CACHE__TIMEOUT_POLICIES
value: "900"
strategy:
type: RollingUpdate
rollingUpdate:

View file

@ -1170,3 +1170,9 @@ resource "kubectl_manifest" "mutate_strip_cpu_limits" {
}
})
}
# Apply re-trigger 2026-06-11: 87702bdc landed with [ci skip], so this stack was
# never CI-applied; tripit#26 (tour-guide redo) needs the tts GPU-priority
# exclusion live before the tts stack applies. No functional change in this commit.
# (See stacks/tts/main.tf same apply-trigger note, tripit#26.)

View file

@ -89,7 +89,18 @@ resource "kubernetes_deployment" "error_pages" {
}
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
# live object (keel enrollment / resource-governance) don't strip them.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"],
metadata[0].annotations["keel.sh/match-tag"],
metadata[0].labels["tier"],
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"],
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
]
}
}

View file

@ -494,7 +494,16 @@ resource "kubernetes_deployment" "bot_block_proxy" {
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config,
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
# live object (keel enrollment / resource-governance) don't strip them.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"],
metadata[0].annotations["keel.sh/match-tag"],
metadata[0].labels["tier"],
]
}
}
@ -653,7 +662,16 @@ resource "kubernetes_deployment" "x402_gateway" {
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config,
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
# live object (keel enrollment / resource-governance) don't strip them.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"],
metadata[0].annotations["keel.sh/match-tag"],
metadata[0].labels["tier"],
]
}
}
@ -720,6 +738,11 @@ resource "kubernetes_config_map" "auth_proxy_config" {
"default.conf" = <<-EOT
upstream authentik {
server ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000;
# Reuse connections to the outpost. Without this every forward-auth
# subrequest (= every request to every auth="required" ingress) opens
# a fresh TCP connection. Requires HTTP/1.1 + cleared Connection
# header on the proxy_pass locations below.
keepalive 32;
}
server {
listen 9000;
@ -734,6 +757,8 @@ resource "kubernetes_config_map" "auth_proxy_config" {
location /outpost.goauthentik.io/auth/traefik {
proxy_pass http://authentik;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_connect_timeout 3s;
proxy_read_timeout 5s;
proxy_send_timeout 5s;
@ -764,6 +789,8 @@ resource "kubernetes_config_map" "auth_proxy_config" {
location /outpost.goauthentik.io/ {
proxy_pass http://authentik;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_connect_timeout 3s;
proxy_read_timeout 10s;
proxy_set_header Host $host;
@ -820,6 +847,11 @@ resource "kubernetes_deployment" "auth_proxy" {
labels = {
app = "auth-proxy"
}
annotations = {
# nginx only reads its config at startup roll the pods whenever
# the ConfigMap content changes.
"checksum/auth-proxy-config" = sha1(kubernetes_config_map.auth_proxy_config.data["default.conf"])
}
}
spec {
topology_spread_constraint {
@ -908,7 +940,16 @@ resource "kubernetes_deployment" "auth_proxy" {
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config,
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
# live object (keel enrollment / resource-governance) don't strip them.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"],
metadata[0].annotations["keel.sh/match-tag"],
metadata[0].labels["tier"],
]
}
}

View file

@ -96,6 +96,14 @@ locals {
CALENDAR_CONFLICT_PROVIDER = "nextcloud"
NEXTCLOUD_CALDAV_BASE = "https://nextcloud.viktorbarzin.me/remote.php/dav"
NEXTCLOUD_CALDAV_USER = "admin"
# Tour-guide content pipeline (tripit#24/#25): these three default to `fake`
# in tripit's config, which is what shipped dark on 2026-06-08 prod only
# ever showed the placeholder "Sight 1". Real providers: Wikipedia GeoSearch
# discovery, the five web story sources, and the claude-agent-service script
# writer (CLAUDE_AGENT_TOKEN already in tripit-secrets).
SIGHT_DISCOVERY_PROVIDER = "wikipedia"
STORY_SOURCE_MODE = "web"
SCRIPT_WRITER_MODE = "chat"
}
}

View file

@ -73,8 +73,14 @@ locals {
repo_id = "chatterbox-multilingual"
}
tts_engine = {
device = "cuda"
predefined_voices_path = "/data/voices"
device = "cuda"
# Predefined voices come from the IMAGE's bundled set (28 reference WAVs
# under the devnen server's /app/voices) rather than the NFS PVC: nobody
# can seed /data/voices without NFS-host shell access, and an empty
# predefined dir means /v1/audio/voices serves nothing (it gates the
# readiness probe). tripit's Voice catalog (tripit#30) names a subset of
# these stems. /data keeps reference_audio (future cloning) + HF cache.
predefined_voices_path = "/app/voices"
reference_audio_path = "/data/reference_audio"
}
})
@ -472,3 +478,7 @@ resource "kubernetes_cron_job_v1" "offpeak" {
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# Apply trigger 2026-06-11 (tripit#26): the previous push was a merge commit, so
# the changed-stack detector (git diff HEAD~1 HEAD = first-parent diff) missed
# stacks/tts entirely. Non-merge commit so the diff names this stack.