Merge forgejo/master (tts stack) into wizard/android-emulator

# Conflicts: # stacks/tripit/main.tf
2026-06-11 19:53:07 +00:00 · 2026-06-11 19:53:07 +00:00 · 6bf216751b
commit 6bf216751b
parent 8b7c77c794 798b025580
37 changed files with 1774 additions and 86 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -135,7 +135,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle

 ## Database Host

-**`postgresql_host`** in `config.tfvars` is `pg-cluster-rw.dbaas.svc.cluster.local` (the CNPG primary). The legacy `postgresql.dbaas` service has no endpoints — never use it. This variable is shared by ~12 stacks.
+**`postgresql_host`** in `config.tfvars` is `pg-cluster-rw.dbaas.svc.cluster.local` (the CNPG primary). The legacy `postgresql.dbaas` service is a live compatibility alias (selector `cnpg.io/instanceRole=primary`, so it also reaches the primary — authentik's PgBouncer still points at it) — but use `pg-cluster-rw` for anything new. This variable is shared by ~12 stacks.

 **CNPG tuning** (in `stacks/dbaas/modules/dbaas/main.tf`): `shared_buffers=512MB`, `work_mem=16MB`, `wal_compression=on`, `effective_cache_size=1536MB`, pod memory 2Gi.

@ -159,7 +159,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
 | Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
 | CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob |
 | Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
-| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding |
+| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. |
 | Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
 | MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
 | phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
@ -178,7 +178,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
 Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/security-incident.md`. Beads epic: `code-8ywc`.

 - **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming.
- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale.
+- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.)
 - **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
 - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
 - **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred.
--- a/.claude/reference/authentik-state.md
+++ b/.claude/reference/authentik-state.md
@ -5,17 +5,26 @@
 ## Applications (11)
 | Application | Provider Type | Auth Flow |
 |-------------|--------------|-----------|
-| Cloudflare Access | OAuth2/OIDC | explicit consent |
+| Cloudflare Access | OAuth2/OIDC | implicit consent |
 | Domain wide catch all | Proxy (forward auth) | implicit consent |
-| Forgejo | OAuth2/OIDC | explicit consent |
+| Forgejo | OAuth2/OIDC | implicit consent |
 | Grafana | OAuth2/OIDC | implicit consent |
-| Headscale | OAuth2/OIDC | explicit consent |
-| Immich | OAuth2/OIDC | explicit consent |
+| Headscale | OAuth2/OIDC | implicit consent |
+| Immich | OAuth2/OIDC | implicit consent |
 | Kubernetes | OAuth2/OIDC (public) | implicit consent |
 | Kubernetes Dashboard | OAuth2/OIDC (confidential) | implicit consent |
-| linkwarden | OAuth2/OIDC | explicit consent |
+| linkwarden | OAuth2/OIDC | implicit consent |
+| Vault | OAuth2/OIDC | implicit consent |
 | wrongmove | OAuth2/OIDC | implicit consent |

+> **2026-06-10 — every provider now uses implicit consent.** Cloudflare
+> Access (pk 9), Forgejo (20), Immich (1), Headscale (13), linkwarden (8)
+> and Vault (53) were switched from
+> `default-provider-authorization-explicit-consent` via the API (these
+> providers are UI-managed, not in TF). All are first-party apps; the
+> expiring consent screen (re-shown every 4 weeks per app) only slowed
+> first-time signin.
+
 > **Kubernetes Dashboard** (TF-managed in `stacks/k8s-dashboard/authentik.tf`):
 > confidential client `k8s-dashboard`, built for seamless dashboard SSO via
 > oauth2-proxy. **Currently IDLE** — the apiserver rejects all OIDC tokens (see
@ -60,8 +69,27 @@
 - All sources use `invitation-enrollment` as enrollment flow (new users require invitation)

 ## Authorization Flows
- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen
- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects
+- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen — no provider uses it since 2026-06-10
+- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects — used by ALL providers
+
+## Authentication Flow (single-screen login, 2026-06-10)
+
+`default-authentication-flow` bindings: identification (order 10) →
+mfa-validation (order 30) → user-login (order 100). The identification
+stage (`default-authentication-identification`, pk
+`32aca5ab-106e-43f4-a4cc-4513d80e57f3`) has `password_stage` set to
+`default-authentication-password`, so username + password render on ONE
+screen (one round trip instead of two). The previously separate
+password-stage binding at order 20 (pk `0fc677db-a23f-4ee7-8648-da342e14573b`)
+was DELETED via the API — authentik requires removing it when the
+identification stage embeds the password field. `password_stage` is pinned in
+Terraform (`authentik_stage_identification.default_identification` in
+`stacks/authentik/authentik_provider.tf`); all other stage fields stay
+UI-managed via `ignore_changes`. Social-login buttons remain on the same
+screen and bypass the password field, so Google/GitHub/Facebook users are
+unaffected. If a future authentik upgrade/blueprint re-adds the order-20
+binding, users would briefly see a second password prompt — delete the
+binding again.

 ## Invitation Enrollment Flow
 Slug: `invitation-enrollment` | PK: `7d667321-2b02-4e16-8161-148078a8dac1`
@ -149,7 +177,12 @@ Notes:
 - The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
 - `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
 - The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
- The `unauthenticated_age` env var is injected via `server.env` / `worker.env` (not `authentik.sessions.unauthenticated_age`) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. The same applies to the existing `authentik.cache.*`, `authentik.web.*`, `authentik.worker.*` blocks (currently inert; live values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced).
+- ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns).
+- **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config.
+- **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin.
+- **Liveness budget (2026-06-10):** `server.livenessProbe` = 6×10s, 5s timeout (chart default 3×10s/3s kill-loops pods that queue on the DB migration advisory lock during rolling restarts).
+- **PgBouncer (2026-06-10):** `idle_transaction_timeout=300` reaps ghost `idle in transaction` sessions (a killed pod mid-migration otherwise holds the migration advisory lock forever, serializing all boots); the deployment carries a config-checksum annotation so ini changes roll the pods. Do NOT set `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE` — session-mode PgBouncer pins persistent conns 1:1 (pool saturation).
+- **Static assets (2026-06-10):** a second `ingress_factory` (`module.ingress-static`, path `/static` on the authentik host) attaches the `authentik-static-cache-headers` middleware → `Cache-Control: public, max-age=31536000, immutable`. Authentik itself serves no max-age; assets are version-fingerprinted so immutable is safe. Mainly helps split-horizon internal users (no Cloudflare edge cache on the direct path).

 ## Upgrade Validation Checklist

@ -161,8 +194,9 @@ Run after **any** of these:
 The fragile surfaces are the `kubernetes_json_patches` and the `Outpost.managed` field — both rely on assumptions that can silently break across upgrades. The checklist exercises the same path the alerts watch, so it doubles as a smoke test for the alerts.

 ```bash
-# 1. Service routes to the outpost pod (NOT the server pods).
-#    Empty endpoints => auth-proxy fallback fires; expected: ONE pod IP, ports 9000/9300/9443.
+# 1. Service routes to the outpost pods (NOT the server pods).
+#    Empty endpoints => auth-proxy fallback fires; expected: TWO pod IPs
+#    (kubernetes_replicas=2 since 2026-06-10), ports 9000/9300/9443.
 kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost

 # 2. Service selector still excludes the server pods. Expected: includes
--- a/.claude/reference/proxmox-inventory.md
+++ b/.claude/reference/proxmox-inventory.md
@ -92,19 +92,21 @@ Channel 3:  A4 [32G] ──── A8 [32G]  ──── A12[ 8G ]     = 72 GB
 | VMID | Name | Status | CPUs | RAM | Network | Disk | Notes |
 |------|------|--------|------|-----|---------|------|-------|
 | 101 | pfsense | running | 8 | 4GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall |
-| 102 | devvm | running | 16 | 24GB | vmbr1:vlan10 | 100G | Development VM + t3code Workstation host. 8G swapfile (swappiness=10). Capacity budget: ~4-5G RAM/active user, max ~3-4 concurrent active Claude sessions. NOT Terraform-managed. |
+| 102 | devvm | running | 16 | 24GB | vmbr1:vlan10 | 100G | Development VM + t3code Workstation host. 14G swap (8G /swapfile + 6G /swapfile2, grown 2026-06-10; swappiness=10). Capacity budget: ~4-5G RAM/active user, max ~3-4 concurrent active Claude sessions. NOT Terraform-managed. Disk controller: `virtio-scsi-single` + `scsi0 iothread=1,aio=threads` staged 2026-06-11 after the QEMU I/O stall (was `scsihw: lsi`, the only VM on the legacy path — see `docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md`); applies at next cold stop→start. |
 | 103 | home-assistant | running | 8 | 8GB | vmbr0 | 64G | HA Sofia, net0(vlan10) disabled, SSH: vbarzin@192.168.1.8 |
 | 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup (unused) |
-| 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) |
-| 201 | k8s-node1 | running | 16 | 32GB | vmbr1:vlan20 | 256G | GPU node, Tesla T4 |
-| 202 | k8s-node2 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
-| 203 | k8s-node3 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
-| 204 | k8s-node4 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
+| 200 | k8s-master | running | 8 | 32GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) |
+| 201 | k8s-node1 | running | 16 | 48GB | vmbr1:vlan20 | 256G | GPU node, Tesla T4 |
+| 202 | k8s-node2 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker |
+| 203 | k8s-node3 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker |
+| 204 | k8s-node4 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker |
+| 205 | k8s-node5 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker (10.0.20.105, joined 2026-05-26) |
+| 206 | k8s-node6 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker (10.0.20.106, joined 2026-05-26) |
 | 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
 | 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM |
 | ~~9000~~ | ~~truenas~~ | **stopped/decommissioned** | — | — | — | — | NFS migrated to Proxmox host (192.168.1.127) at `/srv/nfs` and `/srv/nfs-ssd` |

-**Total VM RAM allocated**: 196 GB of 272 GB (72%) — 76 GB free for future VMs (devvm corrected 8GB→24GB 2026-06-08)
+**Total VM RAM allocated**: ~288 GB nominal across running VMs vs 272 GB physical — OVERCOMMITTED (ballooning enabled on K8s workers, host swap in use; see memory id=535/2543). K8s rows live-verified via `kubectl get nodes` capacity 2026-06-11 (master 32G, node1 48G, node2-6 32G; the old 16/32/24GB figures predated the 2026-04-02 resize and node5/6).

 ## VM Templates
 | VMID | Name | Purpose |
--- a/.claude/reference/service-catalog.md
+++ b/.claude/reference/service-catalog.md
@ -32,7 +32,7 @@
 |---------|-------------|-------|
 | k8s-dashboard | Kubernetes dashboard at `k8s.viktorbarzin.me`. **Forward-auth + auto-injected SA token** (apiserver OIDC blocked, see design §12). nginx token-injector (`dashboard_injector.tf`) maps `X-authentik-username` → the user's `dashboard-<user>` SA token (ns admin + read-only on namespace-list/nodes only via `dashboard-nav-readonly` — no cross-tenant reads, `rbac/.../dashboard-sa.tf`; admins → cluster-admin SA) and sets `Authorization: Bearer` → no token-paste, dashboard auto-authenticates per user. Forward-auth admits `kubernetes-*` groups for this host (`stacks/authentik/admin-services-restriction.tf`). oauth2-proxy + `k8s-dashboard` OIDC app built but idle. | k8s-dashboard |
 | reverse-proxy | Generic reverse proxy | reverse-proxy |
-| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@<u>` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@<u>` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role`→`scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin is a deliberate, reversible step via `docs/runbooks/t3-version-bump.md` (pre-bump `state.sqlite` backup → bump `T3_PIN` → enforcer install with a REAL pairing health-check that auto-rolls-back → verify → restore). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. **Drop attribution (2026-06-10):** `t3-probe` Deployment (same ns) holds differential legs — `cloudflare` (full public path via DoH-pinned DNS), `internal` (Traefik LB only), `t3serve` (devvm:3773 direct) — against dispatch's unauthenticated `/probe` carve-out (walloff-guarded); Prometheus job `t3-probe`, alerts `T3ProbeLegDown`/`T3ProbeDropBurst`, runbook `docs/runbooks/t3-drop-attribution.md`. `t3-serve@` units carry memory containment (`MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0/OOMPolicy=continue`) so a runaway agent OOMs alone instead of freezing devvm. | t3code |
+| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@<u>` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@<u>` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role`→`scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin is a deliberate, reversible step via `docs/runbooks/t3-version-bump.md` (pre-bump `state.sqlite` backup → bump `T3_PIN` → enforcer install with a REAL pairing health-check that auto-rolls-back → verify → restore). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. **Drop attribution (2026-06-10):** `t3-probe` Deployment (same ns) holds differential legs — `cloudflare` (full public path via DoH-pinned DNS), `internal` (Traefik LB only), `t3serve` (devvm:3773 direct) — against dispatch's unauthenticated `/probe` carve-out (walloff-guarded); Prometheus job `t3-probe`, alerts `T3ProbeLegDown`/`T3ProbeDropBurst`, runbook `docs/runbooks/t3-drop-attribution.md`. `t3-serve@` units carry memory containment (`MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0/OOMPolicy=continue`) so a runaway agent OOMs alone instead of freezing devvm. **Connection logs (2026-06-11):** `t3-dispatch` logs every `/ws` open/close with `dur_ms` + `cause` (`downstream_closed`=client/CF/Traefik hung up → last-mile; `upstream_closed`=t3-serve closed; `graceful`); devvm journald now ships to Loki via `scripts/devvm-promtail.*` (`{job="devvm-journal"}` + `{job="sshd-devvm"}`), joining Traefik `/ws`-duration + cloudflared close events already in Loki for full per-drop attribution without a repro. **Empirical (2026-06-11):** direct-to-t3-serve held one WS 40 min (0 drops) while a real tunnel session cycled 5×/90s → drop originates above t3-serve on the public path, NOT in t3-serve itself; `t3 auth pairing create`+`/api/auth/browser-session` works but dispatch **auto-pair is 401-broken on v0.0.26** (latent; live 30-day cookies mask it). | t3code |

 ## Active Use
 | Service | Description | Stack |
--- a/.gitignore
+++ b/.gitignore
@ -104,5 +104,5 @@ stacks/terminal/clipboard-upload/clipboard-upload
 terraform.tfstate
 terraform.tfstate.backup

-# Per-feature git worktrees (worktree-first workflow — execution.md §3)
+# Per-feature git worktrees (worktree-first workflow — execution.md)
 .worktrees/
--- a/CONTEXT.md
+++ b/CONTEXT.md
@ -149,7 +149,7 @@ _Avoid_: bare "backup" without saying which copy you mean (a service is "backed

 **CNPG** / **pg-cluster**:
 **CNPG** is the CloudNativePG operator; **`pg-cluster`** is the Postgres cluster it manages — the shared Postgres substrate. Backs Tier-1 Terraform state (`pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`) and ~12 application databases, reached through **PgBouncer** (a **critical-path Service**) for connection pooling; app credentials rotate via the `vault-database` ClusterSecretStore.
-_Avoid_: "the database" (many DBs share one cluster); the legacy `postgresql.dbaas` Service (no endpoints — dead); conflating the CNPG operator with the `pg-cluster` it manages.
+_Avoid_: "the database" (many DBs share one cluster); the legacy `postgresql.dbaas` Service for NEW work (it is a live compatibility alias selecting the CNPG primary — authentik's PgBouncer still uses it — but `pg-cluster-rw` is the canonical name); conflating the CNPG operator with the `pg-cluster` it manages.

 ### Secrets

--- a/docs/architecture/authentication.md
+++ b/docs/architecture/authentication.md
@ -40,10 +40,10 @@ graph TB

 | Component | Version | Location | Purpose |
 |-----------|---------|----------|---------|
-| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (2 replicas) |
+| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (3 replicas) |
 | Authentik Worker | 2026.2.2 | `stacks/authentik/` | Background task processors (2 replicas) |
 | PgBouncer | Latest | `stacks/authentik/` | PostgreSQL connection pooler (3 replicas) |
-| Embedded Outpost | - | Built into Authentik | Forward auth endpoint for Traefik |
+| Embedded Outpost | - | Standalone deployment, managed by Authentik | Forward auth endpoint for Traefik (2 replicas, PG-backed sessions) |
 | Traefik ForwardAuth | - | `modules/kubernetes/ingress_factory/` | Middleware attached when `auth = "required"` or `"public"` |
 | Vault OIDC Method | - | `stacks/vault/` | Human SSO authentication to Vault |
 | Vault K8s Auth | - | `stacks/vault/` | Service account JWT authentication |
@ -64,15 +64,36 @@ Services pick an auth tier via the `auth` enum on the `ingress_factory` module (
 When `auth = "required"`, an unauthenticated request flows:

 1. Request hits Traefik ingress
-2. ForwardAuth middleware calls Authentik embedded outpost
-3. Authentik checks for valid session cookie
+2. ForwardAuth middleware calls the `auth-proxy` nginx (basicAuth fallback when Authentik is down), which proxies to the Authentik embedded outpost over a keepalive connection pool
+3. Authentik checks for valid session cookie (domain-level `authentik_proxy_*` cookie on `.viktorbarzin.me`, 4-week validity — one cookie covers all forward-auth apps)
 4. If missing/invalid, redirects to Authentik login page (authentik.viktorbarzin.me)
-5. User authenticates via social provider (Google/GitHub/Facebook)
+5. User authenticates on a **single screen**: username + password together (the identification stage embeds the password stage), or a social provider button (Google/GitHub/Facebook), then MFA validation
 6. Authentik creates session, sets cookie, redirects back to original URL
 7. Subsequent requests include session cookie, pass auth check, reach backend

 Authentik adds authentication headers (user, email, groups) to forwarded requests. These headers are stripped before reaching the backend to prevent confusion.

+### First-time signin performance (2026-06-10)
+
+Signin latency is dominated by screen count and round trips, not server time
+(DB avg 1.6ms). Standing decisions:
+
+- **Single-screen login**: the identification stage carries `password_stage`,
+  so username+password is one round trip. The separate password-stage binding
+  was removed from `default-authentication-flow` (required by authentik when
+  embedding). Pinned in TF: `authentik_stage_identification.default_identification`.
+- **Implicit consent everywhere**: all OIDC providers are first-party, so none
+  use the explicit-consent flow (it re-prompted every 4 weeks per app).
+- **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
+  are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
+  15m policy cache, 60s persistent DB connections.
+- **Static assets cached immutable**: `/static` ingress carve-out adds
+  `Cache-Control: public, max-age=31536000, immutable` (assets are
+  version-fingerprinted; authentik itself sends no max-age).
+- **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
+- **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
+  TCP setup on the forward-auth subrequest path.
+
 **Anti-exposure guard**: every `auth = "app"` or `auth = "none"` line MUST have a preceding `# auth = "<tier>": <reason>` comment documenting what gates the backend (for `"app"`) or why the endpoint is intentionally public (for `"none"`). The convention is enforced by `scripts/check-ingress-auth-comments.py`, which `scripts/tg` runs on every `plan/apply/destroy/refresh` and blocks the terragrunt invocation if violated. Stack-scoped — each stack documents itself.

 ### Social Login & Invitation Flow
--- a/docs/architecture/compute.md
+++ b/docs/architecture/compute.md
@ -22,9 +22,11 @@ graph TB
        NODE2["VM 202: k8s-node2<br/>8c / 32GB"]
        NODE3["VM 203: k8s-node3<br/>8c / 32GB"]
        NODE4["VM 204: k8s-node4<br/>8c / 32GB"]
+        NODE5["VM 205: k8s-node5<br/>8c / 32GB"]
+        NODE6["VM 206: k8s-node6<br/>8c / 32GB"]
    end

-    subgraph K8s["Kubernetes Cluster v1.34.2"]
+    subgraph K8s["Kubernetes Cluster v1.34.8"]
        direction TB

        subgraph VPA["VPA (Goldilocks - Initial Mode)"]
@ -62,7 +64,7 @@ graph TB
 | Model | Dell PowerEdge R730 |
 | CPU | 1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated) |
 | Total Cores/Threads | 22 cores / 44 threads |
-| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). VMs use ~176GB total (k8s-node1 48GB + 4 K8s VMs x 32GB) |
+| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). K8s VMs use ~240GB total (k8s-node1 48GB + 6 K8s VMs x 32GB) |
 | GPU | NVIDIA Tesla T4 (16GB GDDR6, PCIe 0000:06:00.0) |
 | Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD |
 | Hypervisor | Proxmox VE |
@ -76,8 +78,10 @@ graph TB
 | k8s-node2 | 202 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
 | k8s-node3 | 203 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
 | k8s-node4 | 204 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
+| k8s-node5 | 205 | 8 | 32GB | vmbr1:vlan20 (10.0.20.105) | Worker (joined 2026-05-26) | None |
+| k8s-node6 | 206 | 8 | 32GB | vmbr1:vlan20 (10.0.20.106) | Worker (joined 2026-05-26) | None |

-**Total Cluster Resources**: 48 vCPUs, ~176GB RAM (k8s-node1 48GB + 4 nodes x 32GB)
+**Total Cluster Resources**: 64 vCPUs, ~240GB RAM (k8s-node1 16c/48GB + master and 5 workers at 8c/32GB each)

 > **All Linux VMs are hand-managed in Proxmox, NOT in Terraform**
 > (decided 2026-05-26, commit 44c3770a). The telmate/proxmox v3.0.2
@ -97,7 +101,12 @@ graph TB
 > PVE host (sources in `infra/scripts/`, install pattern per
 > `architecture/backup-dr.md`). Timer fires `OnBootSec=5min` +
 > `OnCalendar=hourly`, so any drift (config restore, manual `qm
-> set`, fresh clone) self-heals within the hour. Current caps:
+> set`, fresh clone) self-heals within the hour. The script compares
+> *normalized option sets*, so an unchanged config is a true no-op —
+> until 2026-06-11 a raw string compare (defeated by `qm config`'s
+> canonical key order) re-issued `qm set` hourly against running VMs,
+> live-rewriting QEMU throttle state via QMP (implicated in the devvm
+> I/O stall; see `post-mortems/2026-06-11-devvm-qemu-io-stall.md`). Current caps:
 > 102 devvm 60/60, 103 home-assistant 40/40, 200 k8s-master 100/60,
 > 201 k8s-node1 150/120, 202 k8s-node2 150/120, 203 k8s-node3 150/120,
 > 204 k8s-node4 150/120, 220 docker-registry 40/40.
--- a/docs/architecture/security.md
+++ b/docs/architecture/security.md
@ -255,6 +255,8 @@ Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same

 **Policy: no public-IP access ever.** Vault, kube-apiserver, PVE sshd must transit a trusted LAN or Headscale. Anything else fires an alert.

+**Documented exception — break-glass SSH (2026-06-11):** one deliberate carve-out. The Proxmox host's sshd listens on a WAN-exposed `:52222` (edge-router forward), **key-only**, trusting only a dedicated break-glass key (`Match LocalPort` → `authorized_keys.breakglass`), rate-limited (iptables hashlimit) + fail2ban. It is intentionally reachable from the public internet so it survives a cluster/tunnel outage with no dependency on the cluster — the one case the "must transit LAN/Headscale" rule cannot serve. Brute-force-proof (no password); the trade is Shodan-visibility. As-built: `docs/runbooks/breakglass-ssh.md`; rationale: `docs/plans/2026-06-11-breakglass-ssh-redesign-design.md`. (Replaced the 2026-05-30 port-knock variant, which was non-scannable but had a circular Vault dependency that caused a lockout.)
+
 #### Why no canary tokens

 Original plan included canary tokens (fake K8s Secret, Vault KV path, PVE file, sinkhole hostname). Rejected because Viktor routinely greps `secret/viktor` (135 keys) and lists `kubectl get secret -A` — any read-trigger canary self-fires. Use-based canaries (zero-RBAC SA tokens with audit alerts on use) were also considered but rejected in favor of cleaner source-IP anomaly detection (K9, V7) on REAL tokens — same threat model, no fake-token operational burden.
--- a/docs/architecture/storage.md
+++ b/docs/architecture/storage.md
@ -17,7 +17,7 @@ All services storing sensitive data were migrated to `proxmox-lvm-encrypted` on
 - **HDD NFS**: `/srv/nfs` on ext4 LV `pve/nfs-data` (4TB) — bulk media and backup targets
 - **SSD NFS**: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB) — high-performance data (Immich ML)

-Both `StorageClass: nfs-truenas` and `StorageClass: nfs-proxmox` point to the Proxmox host and are functionally identical. The `nfs-truenas` name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster.
+`StorageClass: nfs-truenas` is the **only** NFS StorageClass and points to the Proxmox host. The name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster. (A short-lived parallel `nfs-proxmox` StorageClass was removed on 2026-04-25, commit 484b4c71, during the vault NFS-hostile migration.)

 **Backup storage (sda)**: 1.1TB RAID1 SAS disk, VG `backup`, LV `data` (ext4), mounted at `/mnt/backup` on PVE host. Dedicated backup disk for weekly PVC file backups, auto SQLite backups, pfSense backups, and PVE config. NFS data syncs directly to Synology via inotify change tracking (not stored on sda). Independent of live storage (sdc).

@ -47,7 +47,7 @@ graph TB
    end

    subgraph K8s["Kubernetes Cluster"]
-        CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-proxmox (+ legacy nfs-truenas)<br/>soft,timeo=30,retrans=3"]
+        CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-truenas (historical name)<br/>soft,timeo=30,retrans=3"]
        CSI_PVE["Proxmox CSI plugin<br/>StorageClass: proxmox-lvm<br/>StorageClass: proxmox-lvm-encrypted"]

        NFS_PV["NFS PersistentVolumes<br/>RWX, ~100 volumes"]
@ -85,8 +85,7 @@ graph TB
 | Proxmox NFS (HDD) | LV `pve/nfs-data`, 4TB ext4 | 192.168.1.127:/srv/nfs | Bulk NFS data for all services |
 | Proxmox NFS (SSD) | LV `ssd/nfs-ssd-data`, 100GB ext4 | 192.168.1.127:/srv/nfs-ssd | High-performance data (Immich ML) |
 | nfs-csi | Helm chart | Namespace: nfs-csi | NFS CSI driver |
-| StorageClass `nfs-proxmox` | RWX, soft mount | Cluster-wide | NFS storage, points to Proxmox host |
-| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | **Historical name** — functionally identical to `nfs-proxmox`, points to the Proxmox host. Kept because SC names are immutable on 48 bound PVs. |
+| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | The only NFS StorageClass — **historical name**, points to the Proxmox host. Kept because SC names are immutable on 48 bound PVs. (Sibling `nfs-proxmox` SC removed 2026-04-25, commit 484b4c71.) |
 | TF module `nfs_volume` | `modules/kubernetes/nfs_volume/` | Infra repo | Static NFS PV/PVC factory |
 | ~~TrueNAS VM~~ | **DECOMMISSIONED 2026-04-13** | Was VM 9000 at 10.0.10.15 | Replaced by Proxmox NFS. VM still in stopped state pending deletion. |
 | ~~democratic-csi-iscsi~~ | **REMOVED** | Was namespace: iscsi-csi | Replaced by Proxmox CSI (2026-04-02) |
@ -113,7 +112,7 @@ graph TB

 **Note**: Some legacy PVs still reference `/mnt/main/<service>` paths. These work via compatibility symlinks/bind-mounts on the Proxmox host. New PVs should use `/srv/nfs/<service>` or `/srv/nfs-ssd/<service>`.

-**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-proxmox` StorageClass (or the legacy `nfs-truenas` for existing PVs) via PVCs.
+**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-truenas` StorageClass (historical name; it points at the Proxmox host) via PVCs.

 ### Block Storage Flow (Proxmox CSI) — NEW

--- a/docs/plans/2026-05-30-breakglass-ssh-access-design.md
+++ b/docs/plans/2026-05-30-breakglass-ssh-access-design.md
@ -0,0 +1,285 @@
+# Break-Glass SSH Access — Design
+
+> **⚠️ SUPERSEDED 2026-06-11** by `2026-06-11-breakglass-ssh-redesign-design.md`.
+> The port-knock was removed: it added no real security (the SSH key already
+> makes the port brute-force-proof) and its knock sequence lived only in
+> in-cluster Vault — unreachable in the exact cold/away scenario break-glass
+> exists for, which caused a real lockout. Retained for history. As-built:
+> `docs/runbooks/breakglass-ssh.md`.
+
+- **Date**: 2026-05-30
+- **Status**: Draft — pending user review
+- **Owner**: Viktor
+- **Related**: `docs/architecture/vpn.md`, `docs/architecture/security.md`, `infra/.claude/CLAUDE.md` (Security Posture Wave 1)
+
+## 1. Goal
+
+Provide a **cold, brute-force-proof backdoor onto the home LAN from the public
+internet** for the case where the Kubernetes cluster and every cluster-hosted
+remote-access path are down (cloudflared, Headscale/Tailscale, in-cluster
+WireGuard), but the **Proxmox host, pfSense, and the edge router are still up**.
+
+### Hard requirements (from the user)
+
+1. **Cold-survivable**: must work when the k8s cluster + all its tunnels are
+   down. The path must touch **nothing in the cluster** (no Authentik, Traefik,
+   Technitium/AdGuard DNS, cloudflared).
+2. **Full LAN access** once connected (SSH to Proxmox host, pfSense, Synology,
+   k8s API, etc.).
+3. **No brute force**: no password-guessable surface.
+4. **Client uses only software pre-installed on Linux/macOS** — no WireGuard /
+   Tailscale / fwknop client install. Stock `ssh` (+ `bash`) only.
+5. **Minimal effort**, and ideally **honor the locked Wave 1 policy**
+   (`no public-IP access — … PVE sshd must transit LAN or Headscale`).
+
+## 2. Decision
+
+**Key-only SSH to the Proxmox host, gated behind a UDP port-knock.**
+
+- The Proxmox host (`192.168.1.127`) is the entry point — it's the recovery box
+  (`virsh`/`qm` to reboot the pfSense VM, `kubectl`, full hypervisor control)
+  and it sits directly on the `192.168.1.0/24` segment, so the path **does not
+  traverse pfSense or the cluster** — it survives a wedged pfSense too, not just
+  a down cluster.
+- SSH is the only externally-usable remote tool **pre-installed on every
+  Linux/macOS box**, satisfying requirement 4.
+- **Key-only auth** (no passwords anywhere) makes password brute force
+  impossible → requirement 3.
+- A **port-knock** keeps the external SSH port **closed/invisible to scanners**
+  until a knock sequence is sent. This restores the "no standing public service"
+  property we'd have had with WireGuard and keeps us within the **intent** of the
+  Wave 1 policy (PVE sshd is not internet-scannable). The knock is sent with a
+  **bash `/dev/udp` one-liner** — zero install.
+
+### Alternatives rejected
+
+| Option | Why rejected |
+|---|---|
+| WireGuard road-warrior on pfSense | Needs a WireGuard **client app** (fails requirement 4). Was the prior design. |
+| Tailscale / Headscale | Client app + control plane is in-cluster (dies cold). |
+| Browser → web admin UI (Proxmox/pfSense/Synology) | "Pre-installed" (browser) but password-based → brute-forceable, far larger attack surface than a key-only SSH port. |
+| Plain **exposed** key-only SSH (no knock) | Brute-force-proof, but a **publicly visible** service (Shodan-catalogued) and a standing violation of the Wave 1 "no public PVE sshd" policy. The knock removes the standing exposure for ~15 min more setup. |
+| fwknop / cryptographic SPA | Strongest hiding, but needs a **client install** (fails requirement 4). |
+
+## 3. Architecture
+
+```
+  Your laptop (anywhere) — stock ssh + bash, nothing installed
+     │  (1) UDP knock sequence  →  bash: echo > /dev/udp/<pub>/<port>   (instant, no handshake)
+     │  (2) ssh -p 52222 root@<pub>
+     ▼
+  Edge router 192.168.1.1   (the box the stored password unlocks)
+     │  forwards:  UDP <k1>,<k2>,<k3>  +  TCP 52222   →   192.168.1.127
+     ▼
+  Proxmox host 192.168.1.127   ← path bypasses pfSense entirely
+     ├─ knockd (libpcap) sees the UDP knock → opens TCP 52222 for your source IP (30 s)
+     ├─ sshd listens on :22 (LAN admin, always) AND :52222 (external, knock-gated), key-only
+     └─ once in:  virsh/qm (reboot pfSense VM), kubectl, ssh -J / ssh -D → full LAN
+```
+
+**Why it meets "cold + full LAN":** the host is up by definition of the chosen
+failure mode; nothing in the path depends on k8s, pfSense, or DNS. From the host
+you reach the whole LAN either directly (it's on `192.168.1.0/24` and routes to
+the VLANs via pfSense when pfSense is up) or by using SSH's built-in
+`-J`/`-D` — both stock, no install.
+
+## 4. Components
+
+### 4.1 Edge router @ 192.168.1.1 (manual, in the browser)
+Add port-forwards (same place the existing `51821` WireGuard forward lives):
+- **TCP 52222 → 192.168.1.127:52222** (external SSH; no port rewrite — see §4.3 rationale)
+- **UDP `<k1>`, `<k2>`, `<k3>` → 192.168.1.127** (knock ports; actual numbers in Vault)
+
+If the router supports a **port range** forward, a single range covering the
+knock ports + 52222 is tidier than four rules.
+
+> **Verify (#1 implementation check):** whether `.1` **preserves the source IP**
+> on forwarded packets (typical DNAT) or **SNATs** them to `192.168.1.1`. Test by
+> knocking + connecting from an external network and checking `/var/log/auth.log`
+> + `knockd` syslog for the observed source IP. The design works either way (see
+> §4.3), but it determines knock granularity.
+
+### 4.2 SSH keys & Vault layout
+- Mint a **dedicated** break-glass keypair (ed25519), separate from
+  `secret/viktor/proxmox_ssh_key`, so it's independently revocable and clearly
+  labelled.
+- **Public key** → `/root/.ssh/authorized_keys` on the Proxmox host (no `from=`
+  restriction — break-glass is from-anywhere; the knock + key are the gate).
+- **Private key** → Vault `secret/viktor/breakglass_ssh_privkey` (for
+  re-provisioning) **and** on your laptop at `~/.ssh/breakglass_ed25519`
+  (chmod 600).
+- **Knock sequence** → Vault `secret/viktor/breakglass_knock_sequence` (kept out
+  of git — obscurity value only; see §5).
+
+### 4.3 Proxmox host — sshd hardening
+`/etc/ssh/sshd_config.d/10-breakglass.conf`:
+```
+Port 22
+Port 52222
+PasswordAuthentication no
+KbdInteractiveAuthentication no
+PubkeyAuthentication yes
+PermitRootLogin prohibit-password     # key-only root (PVE recovery norm)
+MaxAuthTries 3
+LoginGraceTime 20
+```
+- sshd listens on **:22 (LAN admin, always allowed)** and **:52222 (external,
+  knock-gated)**. Using a dedicated external port (not a DNAT rewrite to 22)
+  lets the firewall distinguish LAN vs external **regardless of `.1` SNAT
+  behaviour** (§4.1) — LAN admin on `:22` is never affected by the gate.
+- **Default to root key-only** for recovery practicality. *Alternative for
+  review:* a dedicated `breakglass` sudo user instead of root.
+
+> **Verify (#2):** key login already works for your normal access **before**
+> `PasswordAuthentication no` is committed — no lockout. (Backup rsync jobs
+> already use keys, so this is likely already effectively true.)
+
+### 4.4 Host firewall (knock gate)
+Default-drop the external SSH port; knockd punches a per-source hole. LAN admin
+(`:22`) and established sessions are untouched:
+```
+# allow established / related
+iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
+# LAN admin + backups: SSH on :22 always allowed
+iptables -A INPUT -p tcp --dport 22 -j ACCEPT
+# external SSH on :52222 closed by default — knockd opens it per-source
+iptables -A INPUT -p tcp --dport 52222 -j DROP
+```
+- **knockd uses libpcap**, so it sees the UDP knock packets even though iptables
+  drops them — the knock ports stay **silent/closed** to scanners.
+- **pve-firewall coexistence (verify #3):** confirm whether the PVE firewall is
+  enabled. If it is, express these rules through it (or a dedicated chain) so a
+  pve-firewall reload doesn't wipe the knockd-managed rule. Default PVE installs
+  often have it off at datacenter level.
+
+### 4.5 knockd
+`apt install knockd` (Debian/PVE). `/etc/knockd.conf`:
+```
+[options]
+    UseSyslog
+    Interface = vmbr0          # the 192.168.1.127 interface
+
+[breakglass]
+    sequence      = <k1>:udp,<k2>:udp,<k3>:udp     # real ports from Vault
+    seq_timeout   = 10
+    start_command = /usr/sbin/iptables -I INPUT 1 -s %IP% -p tcp --dport 52222 -j ACCEPT
+    cmd_timeout   = 30
+    stop_command  = /usr/sbin/iptables -D INPUT -s %IP% -p tcp --dport 52222 -j ACCEPT
+```
+- **UDP knock** → the client knock is fire-and-forget (`/dev/udp`), no TCP-hang
+  on the client (a TCP knock to a dropped port would block until timeout).
+- Opens `:52222` for the knocker's source IP for **30 s**; an SSH session
+  established within that window **persists** via conntrack ESTABLISHED after the
+  rule is removed. Enable + start the `knockd` service.
+
+### 4.6 fail2ban (defense-in-depth)
+`apt install fail2ban`, sshd jail (watches `auth.log`, bans repeat failures).
+Local to the host, **no cluster dependency**. Catches anything that gets past the
+knock to the sshd listener.
+
+### 4.7 Client side (laptop — stock tools only)
+`~/.ssh/config`:
+```
+Host breakglass
+    HostName <public-ip-or-dyndns>
+    Port 52222
+    User root
+    IdentityFile ~/.ssh/breakglass_ed25519
+```
+Knock + connect — a shell function using **bash builtins only** (works on
+macOS `/bin/bash` + Linux; UDP send is instant):
+```sh
+bg() {
+  local host=<public-ip-or-dyndns>
+  for p in <k1> <k2> <k3>; do echo -n x > "/dev/udp/$host/$p"; sleep 0.4; done
+  sleep 0.5
+  ssh breakglass "$@"
+}
+```
+- **Full LAN, no install:** `ssh -J breakglass <internal-host>` (jump), or
+  `ssh -D 1080 breakglass` then point a browser/`curl` at SOCKS5 `127.0.0.1:1080`
+  to reach any internal IP. From the host shell you already have everything.
+- *Optional fully-transparent variant:* fold the knock into a `ProxyCommand` in
+  the `Host breakglass` block so plain `ssh breakglass` knocks automatically.
+
+### 4.8 Cold-scenario IP cheat sheet (DNS is down when the cluster is down)
+Technitium + AdGuard are in-cluster, so `.lan` resolution is gone in a cold
+event. Use IPs:
+
+| Host | IP |
+|---|---|
+| Proxmox host | `192.168.1.127` (also `10.0.10.1` VLAN10) |
+| pfSense | `10.0.20.1` (WAN `192.168.1.2`) |
+| k8s API server | `10.0.20.100` |
+| Synology NAS | `192.168.1.13` |
+| Edge router | `192.168.1.1` |
+| Traefik LB / MetalLB | `10.0.20.200` / `10.0.20.203` |
+
+## 5. Security analysis
+
+- **Brute force: solved.** No password auth anywhere → password guessing is
+  impossible; key brute force is cryptographically infeasible.
+- **Invisibility / Wave 1 intent: satisfied.** The external SSH port is
+  default-dropped and the knock ports are pcap-sniffed (never answered), so a
+  scanner sees a closed/silent host — PVE sshd is **not internet-scannable**,
+  honouring the spirit of "no public-IP access to PVE sshd".
+- **The knock is obscurity, not cryptography.** A port-knock sequence is
+  plaintext and replayable by a passive on-path observer. **The SSH key is the
+  real access control** — the knock only removes the standing/scannable surface.
+  (Cryptographic SPA = fwknop, rejected for needing a client install.) Treat the
+  knock sequence as a secret-ish convenience, not a second cryptographic factor.
+- **Residual risks** (none are brute force):
+  1. An sshd **0-day** exploitable during the 30 s open window → mitigation: keep
+     PVE patched; short `cmd_timeout`; fail2ban.
+  2. **Private key theft** → mitigation: key has a passphrase; revoke by removing
+     the line from `authorized_keys`.
+  3. If `.1` **SNATs** (§4.1), the 30 s window opens `:52222` for the shared
+     `192.168.1.1` source — anyone else arriving via `.1` in that window could
+     reach the sshd banner, but still needs your key. Mitigated by the short
+     window + key-only + fail2ban.
+- **Deliberate, documented exception** to the Wave 1 "no public-IP access"
+  policy, scoped to this single knock-gated port. To be recorded in
+  `security.md` + the Wave 1 note in `infra/.claude/CLAUDE.md` on implementation.
+
+## 6. What's automated vs manual
+
+- **I do**: generate the keypair + knock sequence, store them in Vault, produce
+  the exact `sshd_config.d` snippet, `knockd.conf`, iptables rules, the client
+  `~/.ssh/config` + `bg()` function, and write the runbook + doc updates.
+- **Manual / careful (live devices)**: the `.1` edge-router forwards are done by
+  you in the browser (out-of-Terraform, live device). The Proxmox host changes
+  (sshd, knockd, iptables, fail2ban) are applied over SSH **with key-login
+  verified first** to avoid lockout; pfSense is **not** touched. None of this is
+  a `tg apply` — pfSense and the edge router are not Terraform-managed.
+
+## 7. Testing & verification
+1. From an **external** network (phone hotspot): run `bg`; confirm knockd syslog
+   shows the sequence + opens `:52222`; SSH succeeds.
+2. **Without** knocking: `ssh -p 52222` from external → connection refused/timed
+   out (port closed). A plain port scan of `52222` + the knock ports → silent.
+3. LAN admin on `:22` still works (no regression); backup rsync jobs unaffected.
+4. Full-LAN: `ssh -J breakglass 10.0.20.1` (pfSense) and `ssh -D 1080` SOCKS to
+   an internal IP.
+5. Determine `.1` source-IP behaviour (verify #1) and adjust knock granularity
+   note accordingly.
+
+## 8. Failure modes & rotation
+- **Proxmox host down** (not just cluster): this path is gone — that's the
+  out-of-band tier (serial/IPMI/separate device), explicitly **out of scope**.
+- **`.1` router config reset**: forwards lost → re-add from this doc; consider
+  exporting the `.1` config for backup.
+- **Public IP change**: use a hostname endpoint (Cloudflare-resolved) so it
+  auto-follows; keep the raw IP as fallback.
+- **Key/knock compromise**: remove the `authorized_keys` line (kills access
+  instantly); rotate the knock sequence in `knockd.conf` + Vault.
+
+## 9. Out of scope
+- Host-down / site-down out-of-band access (IPMI, LTE) — a future tier.
+- Phone access (would need an SSH **app**, e.g. Termius — outside the
+  "pre-installed Linux/macOS" constraint; laptop is the target).
+
+## 10. Docs to update on implementation
+- `docs/architecture/vpn.md` — add a "Break-glass SSH" section.
+- `docs/architecture/security.md` + Wave 1 note in `infra/.claude/CLAUDE.md` —
+  record the deliberate knock-gated exception to "no public PVE sshd".
+- New runbook `docs/runbooks/breakglass-ssh.md` — connect + rotate procedure.
--- a/docs/plans/2026-05-30-breakglass-ssh-access-plan.md
+++ b/docs/plans/2026-05-30-breakglass-ssh-access-plan.md
@ -0,0 +1,395 @@
+# Break-Glass SSH Access — Implementation Plan
+
+> **⚠️ SUPERSEDED 2026-06-11** by the redesign in
+> `2026-06-11-breakglass-ssh-redesign-design.md` (port-knock removed). Retained
+> for history. As-built: `docs/runbooks/breakglass-ssh.md`.
+
+> **Execution model:** This plan mutates **live devices** (the Proxmox host's sshd, and the TP-Link edge router). It is **human-gated**, NOT for autonomous subagents. Each live step is applied with anti-lockout verification, and every edge-router change is made by Viktor (or by the browse tool with explicit per-change approval). Steps use `- [ ]` checkboxes.
+
+**Goal:** Stand up a cold, brute-force-proof SSH backdoor onto the LAN — key-only SSH to the Proxmox host (`192.168.1.127`) gated behind a UDP port-knock — then decommission the legacy Synology SSH exposure and tighten UPnP.
+
+**Architecture:** Edge router `.1` forwards a UDP knock sequence + TCP `52222` to the Proxmox host. The host runs `knockd` (libpcap) which opens `52222` for the knocker's IP for 30 s; `sshd` listens on `:22` (LAN, always) and `:52222` (external, knock-gated), key-only. Path bypasses pfSense + the k8s cluster. Client uses only stock `ssh` + `bash`.
+
+**Tech stack:** OpenSSH, knockd, iptables, fail2ban (Debian/PVE host); TP-Link Archer AX6000 UI (edge router); HashiCorp Vault (secrets); Docker (`/home/wizard/tools/insecure-browse` for any router automation).
+
+**Reference:** design doc `2026-05-30-breakglass-ssh-access-design.md`. Router audit (current `.1` forwards) recorded in task notes + `/home/wizard/tools/insecure-browse/out/`.
+
+---
+
+## Pre-flight (read before starting)
+
+- **Anti-lockout rule:** never disable password auth or reload sshd without an *already-open* root session held + a *new* session verified. Applies to every host step.
+- **Live-router rule:** all `.1` changes are made by Viktor in the UI (or browse-tool with explicit approval). No blind automation of router writes.
+- **Ordering rule:** the legacy Synology SSH forward (Rule 6) is **not** closed until break-glass is verified working from an external network (Phase 4 gates on Phase 4-pre verification).
+- **Host access:** PVE host reached as `ssh root@192.168.1.127` from the LAN.
+- **Commit gate:** the infra repo currently has unmerged conflicts + an in-progress provider/backend migration. Do NOT commit (Phase 6) until Viktor confirms the repo is clean.
+
+---
+
+## Phase 0 — Generate secrets (no live changes)
+
+### Task 0.1: Break-glass SSH keypair
+
+**Files:** none in repo (secrets → Vault).
+
+- [ ] **Step 1: Generate a dedicated ed25519 keypair (with passphrase)**
+
+```bash
+mkdir -p ~/.ssh
+ssh-keygen -t ed25519 -a 100 -C "breakglass-$(date +%Y%m%d)" -f ~/.ssh/breakglass_ed25519
+# set a passphrase when prompted (so a stolen laptop key isn't instantly usable)
+```
+
+- [ ] **Step 2: Store the private key + public key in Vault**
+
+```bash
+vault kv patch secret/viktor \
+  breakglass_ssh_privkey=@$HOME/.ssh/breakglass_ed25519 \
+  breakglass_ssh_pubkey="$(cat ~/.ssh/breakglass_ed25519.pub)"
+```
+
+- [ ] **Step 3: Verify the keys are retrievable**
+
+```bash
+vault kv get -field=breakglass_ssh_pubkey secret/viktor
+```
+Expected: prints the `ssh-ed25519 AAAA... breakglass-YYYYMMDD` line.
+
+### Task 0.2: Knock sequence
+
+- [ ] **Step 1: Generate 3 random UDP knock ports**
+
+```bash
+KNOCK="$(shuf -i 20000-60000 -n 3 | paste -sd, -)"; echo "$KNOCK"
+```
+
+- [ ] **Step 2: Store the sequence in Vault (keep it out of git)**
+
+```bash
+vault kv patch secret/viktor breakglass_knock_sequence="$KNOCK"
+vault kv get -field=breakglass_knock_sequence secret/viktor
+```
+Expected: prints three comma-separated ports, e.g. `28411,49027,33180`.
+
+---
+
+## Phase 1 — Proxmox host: key-only SSH + knock gate (LIVE host change)
+
+> Run everything in this phase **on the PVE host**. Keep your current `ssh root@192.168.1.127` session open the entire phase.
+
+### Task 1.1: Pre-checks (no changes yet)
+
+- [ ] **Step 1: Confirm key login already works (anti-lockout baseline)**
+
+From your laptop, with the break-glass key authorized later — for now confirm your *existing* admin key works:
+```bash
+ssh -o PasswordAuthentication=no root@192.168.1.127 'echo KEY_LOGIN_OK'
+```
+Expected: `KEY_LOGIN_OK` (key auth works → safe to disable passwords later). If it prompts for a password, STOP and fix key auth first.
+
+- [ ] **Step 2: Check whether the PVE firewall is active (coexistence)**
+
+```bash
+ssh root@192.168.1.127 'pve-firewall status 2>/dev/null; iptables -S | head'
+```
+Expected: note whether `Status: enabled/running`. If **enabled**, add the Phase-1.4 rules via PVE's firewall (Datacenter→Firewall) instead of raw iptables, OR disable it if unused. If **disabled** (common), proceed with the raw-iptables approach below.
+
+### Task 1.2: Authorize the break-glass key
+
+- [ ] **Step 1: Append the break-glass public key to root's authorized_keys**
+
+```bash
+PUB="$(vault kv get -field=breakglass_ssh_pubkey secret/viktor)"
+ssh root@192.168.1.127 "grep -qF '$PUB' /root/.ssh/authorized_keys || echo '$PUB' >> /root/.ssh/authorized_keys"
+```
+
+- [ ] **Step 2: Verify break-glass key logs in (on :22, still default)**
+
+```bash
+ssh -i ~/.ssh/breakglass_ed25519 -o PasswordAuthentication=no root@192.168.1.127 'echo BREAKGLASS_KEY_OK'
+```
+Expected: `BREAKGLASS_KEY_OK`.
+
+### Task 1.3: sshd dual-port + key-only
+
+**Files:** Create on host: `/etc/ssh/sshd_config.d/10-breakglass.conf`
+
+- [ ] **Step 1: Write the sshd drop-in**
+
+```bash
+ssh root@192.168.1.127 'cat > /etc/ssh/sshd_config.d/10-breakglass.conf' <<'EOF'
+Port 22
+Port 52222
+PasswordAuthentication no
+KbdInteractiveAuthentication no
+PubkeyAuthentication yes
+PermitRootLogin prohibit-password
+MaxAuthTries 3
+LoginGraceTime 20
+EOF
+```
+
+- [ ] **Step 2: Validate config syntax (do NOT reload yet)**
+
+```bash
+ssh root@192.168.1.127 'sshd -t && echo SSHD_CONFIG_OK'
+```
+Expected: `SSHD_CONFIG_OK`. If error, fix the drop-in before reloading.
+
+- [ ] **Step 3: Reload sshd (current session stays alive)**
+
+```bash
+ssh root@192.168.1.127 'systemctl reload ssh && echo RELOADED'
+```
+Expected: `RELOADED`.
+
+- [ ] **Step 4: Verify a NEW key session works on :22 AND :52222 before trusting it**
+
+```bash
+ssh -i ~/.ssh/breakglass_ed25519 -p 22    root@192.168.1.127 'echo OK22'
+ssh -i ~/.ssh/breakglass_ed25519 -p 52222 root@192.168.1.127 'echo OK52222'
+```
+Expected: `OK22` and `OK52222`. (If `:52222` refuses, sshd may not have bound the second port — check `ss -tlnp | grep ssh` on the host.) Only after both succeed, the old session is safe to drop.
+
+### Task 1.4: Base firewall (default-drop :52222, allow :22 + established)
+
+**Files:** Create on host: `/usr/local/sbin/breakglass-firewall.sh`, `/etc/systemd/system/breakglass-firewall.service`
+
+- [ ] **Step 1: Write the idempotent base-firewall script (dedicated chain)**
+
+```bash
+ssh root@192.168.1.127 'cat > /usr/local/sbin/breakglass-firewall.sh' <<'EOF'
+#!/usr/bin/env bash
+set -euo pipefail
+# Idempotent: (re)build a dedicated BREAKGLASS chain hooked into INPUT.
+iptables -N BREAKGLASS 2>/dev/null || iptables -F BREAKGLASS
+iptables -C INPUT -j BREAKGLASS 2>/dev/null || iptables -I INPUT 1 -j BREAKGLASS
+# established/related always allowed
+iptables -A BREAKGLASS -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
+# LAN admin on :22 always allowed (.1 does NOT forward :22 to this host, so :22 is LAN-only)
+iptables -A BREAKGLASS -p tcp --dport 22 -j ACCEPT
+# external SSH on :52222 closed by default; knockd punches a per-source ACCEPT into INPUT pos 1
+iptables -A BREAKGLASS -p tcp --dport 52222 -j DROP
+EOF
+ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh'
+```
+
+- [ ] **Step 2: Write a boot-time systemd unit (persists across reboot, before knockd)**
+
+```bash
+ssh root@192.168.1.127 'cat > /etc/systemd/system/breakglass-firewall.service' <<'EOF'
+[Unit]
+Description=Break-glass base firewall (SSH knock gate)
+After=network-pre.target
+Before=knockd.service
+Wants=network-pre.target
+
+[Service]
+Type=oneshot
+ExecStart=/usr/local/sbin/breakglass-firewall.sh
+RemainAfterExit=yes
+
+[Install]
+WantedBy=multi-user.target
+EOF
+ssh root@192.168.1.127 'systemctl daemon-reload && systemctl enable --now breakglass-firewall.service && echo FW_APPLIED'
+```
+Expected: `FW_APPLIED`.
+
+- [ ] **Step 3: Verify LAN :22 still works and :52222 is now dropped from LAN**
+
+```bash
+ssh -i ~/.ssh/breakglass_ed25519 -p 22 root@192.168.1.127 'echo STILL_OK22'         # works
+nc -z -w3 192.168.1.127 52222 && echo "OPEN(bad)" || echo "CLOSED_AS_EXPECTED"      # closed pre-knock
+```
+Expected: `STILL_OK22` and `CLOSED_AS_EXPECTED`.
+
+### Task 1.5: knockd
+
+**Files:** Create/modify on host: `/etc/knockd.conf`, `/etc/default/knockd`
+
+- [ ] **Step 1: Install knockd (host daemon — must be native, not Docker, to manage host iptables)**
+
+```bash
+ssh root@192.168.1.127 'apt-get update -qq && apt-get install -y knockd && echo KNOCKD_INSTALLED'
+```
+Expected: `KNOCKD_INSTALLED`.
+
+- [ ] **Step 2: Write knockd.conf with the Vault knock sequence (UDP)**
+
+```bash
+KNOCK="$(vault kv get -field=breakglass_knock_sequence secret/viktor)"   # e.g. 28411,49027,33180
+read K1 K2 K3 <<<"$(echo "$KNOCK" | tr ',' ' ')"
+ssh root@192.168.1.127 "cat > /etc/knockd.conf" <<EOF
+[options]
+    UseSyslog
+    Interface = vmbr0
+
+[breakglass]
+    sequence      = ${K1}:udp,${K2}:udp,${K3}:udp
+    seq_timeout   = 10
+    start_command = /usr/sbin/iptables -I INPUT 1 -s %IP% -p tcp --dport 52222 -j ACCEPT
+    cmd_timeout   = 30
+    stop_command  = /usr/sbin/iptables -D INPUT -s %IP% -p tcp --dport 52222 -j ACCEPT
+EOF
+```
+
+- [ ] **Step 3: Enable + start knockd**
+
+```bash
+ssh root@192.168.1.127 "sed -i 's/^START_KNOCKD=.*/START_KNOCKD=1/' /etc/default/knockd 2>/dev/null || echo 'START_KNOCKD=1' >> /etc/default/knockd"
+ssh root@192.168.1.127 'systemctl enable --now knockd && systemctl is-active knockd'
+```
+Expected: `active`.
+
+### Task 1.6: fail2ban (defense-in-depth)
+
+- [ ] **Step 1: Install + enable fail2ban with the default sshd jail**
+
+```bash
+ssh root@192.168.1.127 'apt-get install -y fail2ban && systemctl enable --now fail2ban && fail2ban-client status sshd >/dev/null && echo F2B_OK'
+```
+Expected: `F2B_OK` (sshd jail active).
+
+---
+
+## Phase 2 — Edge router `.1` forwards (LIVE router change — Viktor executes)
+
+> In the AX6000 UI: **Advanced → NAT Forwarding → Port Forwarding → Add**. Do NOT remove anything yet.
+
+- [ ] **Step 1: Add the SSH break-glass forward**
+  - Name `breakglass-ssh`, External Port `52222`, Internal IP `192.168.1.127`, Internal Port `52222`, Protocol `TCP`, Enable.
+
+- [ ] **Step 2: Add the three UDP knock forwards** (values from `vault kv get -field=breakglass_knock_sequence secret/viktor`)
+  - For each of the 3 ports: Name `bg-knock-N`, External Port `<port>`, Internal IP `192.168.1.127`, Internal Port `<same port>`, Protocol `UDP`, Enable.
+
+- [ ] **Step 3: (verify #1) Determine whether `.1` preserves source IP or SNATs**
+
+After Phase 3 connects once, on the host check the observed source:
+```bash
+ssh root@192.168.1.127 'journalctl -u knockd -n 20 --no-pager | grep -i "stage\|open"'
+```
+If `%IP%` is a public IP → source preserved (per-IP granularity). If it's `192.168.1.1` → `.1` SNATs (knock opens `:52222` for the shared `.1` source during the 30 s window). Both are acceptable with the dual-port + key-only model; just note it in the runbook.
+
+---
+
+## Phase 3 — Client config (laptop, no live infra change)
+
+**Files:** Modify `~/.ssh/config`; add a shell function to `~/.zshrc`/`~/.bashrc`.
+
+- [ ] **Step 1: Add the SSH host block**
+
+```bash
+cat >> ~/.ssh/config <<'EOF'
+
+Host breakglass
+    HostName viktorbarzin.ddns.net
+    Port 52222
+    User root
+    IdentityFile ~/.ssh/breakglass_ed25519
+EOF
+```
+(`viktorbarzin.ddns.net` is the router's NO-IP DDNS name — follows the dynamic WAN IP. Raw IP `176.12.22.76` is the fallback.)
+
+- [ ] **Step 2: Add the knock+connect function**
+
+```bash
+cat >> ~/.zshrc <<'EOF'
+
+bg() {
+  local host="viktorbarzin.ddns.net"
+  local seq; seq="$(vault kv get -field=breakglass_knock_sequence secret/viktor 2>/dev/null || echo "")"
+  [ -z "$seq" ] && { echo "no knock sequence (vault?)"; return 1; }
+  for p in ${seq//,/ }; do (exec 3<>/dev/udp/$host/$p) 2>/dev/null && echo "x" >&3; sleep 0.4; done
+  sleep 0.5
+  ssh breakglass "$@"
+}
+EOF
+```
+> Note: the bash `/dev/udp` redirection works under bash (`/bin/bash` on macOS + Linux). Under zsh, `/dev/udp` is also supported by zsh's builtin in recent versions; if your zsh build lacks it, define `bg` in bash or use `nc -u -w1 $host $p </dev/null`.
+
+---
+
+## Phase 4-pre — Verify break-glass END-TO-END (gates Phase 4)
+
+> Do this from an **external** network (phone hotspot / tethered), NOT the home LAN.
+
+- [ ] **Step 1: Without knocking, the port is silent**
+
+```bash
+nc -z -w3 viktorbarzin.ddns.net 52222 && echo "OPEN(bad)" || echo "SILENT_OK"
+```
+Expected: `SILENT_OK`.
+
+- [ ] **Step 2: Knock + connect succeeds**
+
+```bash
+bg 'hostname; echo BREAKGLASS_E2E_OK'
+```
+Expected: the PVE hostname + `BREAKGLASS_E2E_OK`.
+
+- [ ] **Step 3: Full-LAN reach via the jump (no extra install)**
+
+```bash
+ssh -J breakglass root@10.0.20.1 'echo PFSENSE_REACHED' 2>/dev/null || echo "check pfSense ssh"
+ssh -J breakglass admin@192.168.1.13 'echo SYNOLOGY_REACHED' 2>/dev/null || echo "check synology ssh"
+```
+Expected: confirms you can reach pfSense + Synology *through* break-glass (so closing Rule 6 loses nothing).
+
+- [ ] **Step 4: LAN admin unaffected**
+
+From the home LAN: `ssh -p 22 root@192.168.1.127 'echo LAN22_OK'` → `LAN22_OK`.
+
+**GATE:** Only proceed to Phase 4 once Steps 1–4 pass. If any fail, fix before removing the legacy forward.
+
+---
+
+## Phase 5 — Router cleanup (LIVE router change — Viktor executes, AFTER Phase 4-pre passes)
+
+> AX6000 UI. One pass, all three changes.
+
+- [ ] **Step 1: Remove the Synology SSH exposure (Rule 6)**
+  - Advanced → NAT Forwarding → Port Forwarding → delete (or disable) rule **`HTTP` / 3333 → 192.168.1.13:22**.
+
+- [ ] **Step 2: Delete the stale Proxmox rule (Rule 3)**
+  - Delete the disabled rule **`proxmox` / 8006 → 192.168.1.127**.
+
+- [ ] **Step 3: Disable UPnP**
+  - Advanced → NAT Forwarding → UPnP → toggle **OFF**. (Tailscale on `.101` falls back to DERP relay; the `41643→pfSense` mapping drops.)
+
+- [ ] **Step 4: Verify the Synology SSH is gone from the WAN, break-glass still works**
+
+From an external network:
+```bash
+nc -z -w3 viktorbarzin.ddns.net 3333 && echo "STILL_OPEN(bad)" || echo "SYNOLOGY_SSH_CLOSED_OK"
+bg 'echo BREAKGLASS_STILL_OK'
+```
+Expected: `SYNOLOGY_SSH_CLOSED_OK` and `BREAKGLASS_STILL_OK`.
+
+---
+
+## Phase 6 — Docs + commit (AFTER infra repo is clean)
+
+- [ ] **Step 1: Update `docs/architecture/vpn.md`** — add a "Break-glass SSH" section (knock-gated SSH to PVE host, client `bg()`, cheat-sheet IPs).
+- [ ] **Step 2: Update `docs/architecture/security.md` + the Wave-1 note in `infra/.claude/CLAUDE.md`** — record the deliberate knock-gated exception; **correct the WAN-exposure inventory** (actual `.1` forwards are qbittorrent/stun/turn→pfSense + the new break-glass; Synology SSH removed; UPnP disabled; Remote Management off).
+- [ ] **Step 3: New runbook `docs/runbooks/breakglass-ssh.md`** — connect procedure, knock/key rotation, re-adding `.1` forwards after a router reset.
+- [ ] **Step 4: Commit the design + plan + doc updates** (only once Viktor confirms the repo is committable):
+
+```bash
+git -C /home/wizard/code/infra add \
+  docs/plans/2026-05-30-breakglass-ssh-access-design.md \
+  docs/plans/2026-05-30-breakglass-ssh-access-plan.md \
+  docs/architecture/vpn.md docs/architecture/security.md \
+  docs/runbooks/breakglass-ssh.md .claude/CLAUDE.md
+git -C /home/wizard/code/infra commit -m "docs+feat: break-glass knock-gated SSH; retire Synology SSH forward; disable UPnP [ci skip]"
+git -C /home/wizard/code/infra push origin master
+```
+
+---
+
+## Self-review
+
+- **Spec coverage:** key-only SSH ✅ (1.3), knock gate ✅ (1.4/1.5), invisibility ✅ (4-pre.1), full-LAN via jump ✅ (4-pre.3), no-lockout ✅ (1.1/1.3.4), Wave-1 exception doc ✅ (6.2), close legacy SSH ✅ (5.1), UPnP ✅ (5.3). All design §sections map to a task.
+- **Placeholder scan:** no TBDs; secret values are generated + Vault-stored, referenced via `vault kv get` (concrete, not placeholders).
+- **Consistency:** port `52222`, knock from `secret/viktor/breakglass_knock_sequence`, key `~/.ssh/breakglass_ed25519`, host `192.168.1.127` used consistently throughout.
+- **Open verify items** (flagged inline, non-blocking): #1 `.1` SNAT behaviour (2.3), pve-firewall coexistence (1.1.2).
--- a/docs/plans/2026-06-11-breakglass-ssh-redesign-design.md
+++ b/docs/plans/2026-06-11-breakglass-ssh-redesign-design.md
@ -0,0 +1,73 @@
+# Break-glass SSH — Redesign
+
+- **Date**: 2026-06-11
+- **Status**: Implemented
+- **Owner**: Viktor
+- **Supersedes**: `2026-05-30-breakglass-ssh-access-{design,plan}.md` (port-knock design)
+- **As-built runbook**: `docs/runbooks/breakglass-ssh.md`
+
+## Why redesign
+
+The 2026-05-30 design gated a key-only SSH port on the Proxmox host behind a UDP
+**port-knock** (knockd). It caused a real lockout, for a structural reason:
+
+- The knock sequence was 3 random ports stored **only** in Vault, and the client
+  helper fetched it from Vault at connect time.
+- **Vault is in-cluster** and not publicly reachable (Wave-1 policy). In the
+  exact scenario break-glass exists for — away from home, cluster/tunnels down —
+  the knock sequence is unreachable and unmemorable. Circular dependency.
+
+The knock's only benefit was hiding an already brute-force-proof port; its cost
+was that fragility. For a *recovery* path, robustness beats stealth.
+
+## Decision
+
+**Plain key-only SSH to the Proxmox host on `:52222`, openly reachable, no knock.**
+Hardened with: the exposed port trusts only a dedicated break-glass key
+(`Match LocalPort`), per-source connection rate-limiting (iptables hashlimit),
+and fail2ban. Scenario covered: *cluster + tunnels down, host + pfSense + router
+up* (the common "I'm away and need in" case — confirmed with Viktor; deeper
+"pfSense wedged" / "host down" tiers are explicitly out of scope).
+
+Alternatives considered and rejected: keeping the knock (fragile, circular);
+Tailscale-on-pfSense (briefly chosen, then dropped — reintroduces the upstream
+dependency Headscale is self-hosted to avoid, and the user preferred a
+self-contained stock-ssh path); WireGuard road-warrior (needs a client, and the
+self-contained SSH path was preferred).
+
+## Components
+
+| Layer | Change | Source of truth |
+|---|---|---|
+| sshd | dual-port `:22` (LAN, all keys) + `:52222` (WAN, break-glass key only via `Match LocalPort`, terminated by `Match all`); key-only everywhere | `scripts/sshd-10-breakglass.conf` |
+| host firewall | `BREAKGLASS` chain: `:52222` rate-limited per source, LAN bypass; replaced the knock-gated default-DROP | `scripts/breakglass-firewall.sh` (+ `breakglass-firewall.service`) |
+| fail2ban | jail fixed for Debian 13 (`journalmatch` by unit, not `_COMM=sshd`, else it never bans), bans on `:22`+`:52222` | `scripts/fail2ban-breakglass-sshd.local` |
+| knockd | **removed** (package purged, config deleted) | — |
+| edge router | `breakglass-ssh` WAN tcp/52222 → 192.168.1.127:52222; **removed** legacy Synology SSH forward (ext 3333 → .13:22) | manual (live device) |
+| Vault | `breakglass_ssh_{pub,priv}key` retained; `breakglass_knock_sequence` now dead | `secret/viktor` |
+
+## Edge-router constraints discovered (TP-Link AX6000)
+
+- **No port remapping** — external port must equal internal port (rejects e.g.
+  `22 → 52222` as a "conflict"). All forwards are ext==int; hence `:52222` both
+  sides.
+- **Port 22 is reserved** — `22 → 22` is also refused. Break-glass cannot use 22
+  (Viktor's initial preference); `:52222` is the landed port.
+- **Row delete is immediate** (no confirm dialog).
+
+## Security posture
+
+- **Brute force: impossible** (key-only, no password).
+- **Scannable: yes** — deliberate, documented Wave-1 exception (`security.md`).
+- **Residual risks:** sshd 0-day during exposure (mitigate: patch, rate-limit,
+  fail2ban, low MaxAuthTries); break-glass key theft (revoke by removing the
+  `authorized_keys.breakglass` line). Logins are audited (PVE ships sshd auth +
+  snoopy execve to Loki).
+
+## Verification (2026-06-11)
+
+- `:52222` reachable; break-glass key authenticates (`root@pve`).
+- Non-break-glass keys **rejected** on `:52222` (Match isolation works).
+- `:22` LAN admin unaffected (Match all reset confirmed — global root login intact).
+- Full WAN path: `ssh -p 52222 <WAN-IP>` with the break-glass key → `root@pve`.
+- knockd gone; fail2ban jail matches Debian 13 `sshd-session` lines.
--- a/docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md
+++ b/docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md
@ -0,0 +1,76 @@
+# Post-mortem: Authentik downgrade boot storm + shared-PG failover (2026-06-10)
+
+**Impact:** Authentik (and therefore forward-auth for all ~67 `auth="required"`
+ingresses and every OIDC app) degraded/unavailable for ~50 minutes
+(~22:20–23:10 UTC). The auth-proxy basicAuth fallback served Emergency Access
+prompts during outpost-check failures. The shared CNPG primary failed over
+(pg-cluster-2 → pg-cluster-1, 22:40:58 UTC), briefly disturbing every PG-backed
+tenant.
+
+**Trigger:** a routine values-only `tg apply` on `stacks/authentik` (first-time
+signin speedup work — env tuning, outpost config, static-asset ingress).
+
+## Root causes (three stacked)
+
+1. **Helm/Keel version split → silent downgrade.** Keel (namespace
+   `keel.sh/enrolled` + diun annotations) had upgraded the live authentik
+   image to `2026.2.4`, while the Helm release pinned chart `2026.2.2` (whose
+   appVersion drives the image tag). The values-only apply therefore rolled
+   every server/worker pod BACK to `2026.2.2` against a `2026.2.4`-migrated
+   database. Cores never came up healthy (`failed to proxy to backend`, plus
+   Django cross-version serialized-cache warnings), and mid-storm Keel
+   re-upgraded the image, adding a third ReplicaSet to the churn.
+
+2. **Liveness budget too small for authentik's boot.** The chart-default
+   liveness probe (3×10s, 3s timeout) kills a pod ~30s after the go layer
+   passes the startup probe — but during a rolling restart the Python core
+   still waits on authentik's DB **migration advisory lock** (60–120s+ under
+   contention). kubelet kill-looped every booting pod, and each kill increased
+   lock contention for the rest (thundering herd).
+
+3. **Ghost lock holders.** Pods killed mid-migration-check left PgBouncer
+   server connections `idle in transaction` still **holding the migration
+   advisory lock** (observed twice: `SELECT * FROM authentik_version_history`
+   idle 2+ min). Every subsequent boot serialized behind a dead client.
+   PgBouncer had no `idle_transaction_timeout`, so the ghosts never expired.
+
+**Aggravator:** `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60` (newly made live) made
+every Django thread hold its connection persistently; with PgBouncer in
+*session* mode each one pins a server connection 1:1, so the restart churn
+saturated all 3×(20+5) pool slots (58s/s client wait observed; authentik held
+75 of 108 connections on the new primary). The shared primary's
+restart/failover at 22:40 fits this storm window.
+
+## Resolution
+
+- Scaled workers to 0 (transient) to free pool capacity; rollout converged
+  once, then re-degraded when workers returned.
+- Emergency `kubectl patch` of the server liveness probe (3×10s/3s →
+  6×10s/5s) — final state codified in Helm values in the same session.
+- `pg_terminate_backend()` on the ghost `idle in transaction` lock holders
+  (twice).
+- Scaled servers to 1 so a single `2026.2.4` pod booted uncontended, then back
+  to 3 — converged cleanly (51s boots, zero restarts).
+- Final `tg apply` reconciled everything (image tag pinned, conn_max_age
+  removed, liveness in values, pgbouncer reaper config).
+
+## Prevention (all landed in this change)
+
+| Cause | Fix |
+|---|---|
+| Helm/Keel version split | `global.image.tag` pinned in `values.yaml` to the Keel-managed live tag, with a comment requiring the pin be refreshed whenever the chart is touched. Long-term: bump the chart pin when Keel moves the image (diun notifies). |
+| Liveness kill loop | `server.livenessProbe` 6×10s / 5s timeout in values (startup probe still bounds total boot at 60×10s). |
+| Ghost advisory-lock holders | `idle_transaction_timeout = 300` in `pgbouncer.ini` + config-checksum annotation so ini changes actually roll pgbouncer pods. |
+| Pool saturation | `CONN_MAX_AGE` removed (per-request connections are ~1–2ms through local PgBouncer; not worth pinning server connections in session mode). values.yaml carries a do-not-set warning. |
+
+## Lessons
+
+- **Check the live image tag against the chart pin before ANY helm-managed
+  apply on a Keel-enrolled namespace.** `kubectl get deploy <x> -o
+  jsonpath='{..image}'` vs the chart's appVersion — a mismatch means the apply
+  is a version change, not a config change.
+- A "stuck rollout" of authentik is usually the migration advisory lock:
+  check `pg_locks` joined to `pg_stat_activity` for `idle in transaction`
+  holders before blaming probes or resources.
+- The auth-proxy basicAuth fallback worked as designed throughout (Emergency
+  Access path); without it every protected app would have hard-failed.
--- a/docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md
+++ b/docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md
@ -0,0 +1,116 @@
+# 2026-06-11 — devvm dead ~90 min: QEMU-internal I/O stall on the legacy LSI disk path
+
+## Impact
+
+- devvm (VM 102, the shared multi-user Claude Code workstation) effectively
+  dead 15:21–16:48 UTC (18:21–19:48 EEST): all ssh/tmux and t3 sessions for
+  wizard/emo/anca lost, every in-flight agent killed.
+- Detection was human (~90 min) — no `up{instance="devvm"} == 0` alert
+  exists (follow-up below).
+- Recovery was manual: kill of the wedged QEMU process + `qm start` (the
+  kill left no autopsy — see "What we could not prove").
+
+## Timeline (UTC; host journal runs EEST = UTC+3)
+
+- **15:01** — hourly `apply-mbps-caps` run live-rewrites VM 102's scsi0
+  throttle via `qm set` (as it had done every hour for weeks — see Root
+  cause #4).
+- **15:18–15:20** — guest healthy by every metric: CPU 7–16% of 16 vCPUs,
+  load 1.4, 17 GiB MemAvailable, swap flat at 2.0 GiB, host `sdc` 2–8%
+  utilized. Heavy claude/bwrap sandbox activity (normal workload).
+- **15:19:08** — last journal line the guest ever writes (mid normal
+  traffic, zero kernel distress — not even a hung-task warning).
+- **15:21** — host RRD (pvestatd polling QEMU over QMP once a minute) shows
+  `diskwrite` drop to **exactly 0 and stay 0 for 87 minutes** — not even
+  journal flushes. netout collapses 380K→7K/s. **QEMU keeps answering QMP
+  the whole time** — the process and its main loop are alive; only the
+  block path is dead.
+- **15:21→15:39** — guest CPU (host's view) ramps 11% → ~50% and plateaus:
+  processes progressively piling up behind dead storage (dirty-page
+  writeback stuck → direct reclaim spins). Classic starvation cascade, not
+  a panic (a panic halts or spins flat from t=0).
+- **16:47:42** — QMP socket resets: the wedged QEMU is killed out-of-band
+  (root shell; no PVE task, no snoopy line — shell-builtin `kill`).
+- **16:48:31** — `qmstart` task; guest boots clean on kernel 6.8.0-124
+  (wedged boot ran 6.8.0-117).
+
+## Ruled out (evidence, not vibes)
+
+- **Guest CPU/memory/swap pressure** — healthy at last scrape (Prometheus)
+  and per-minute host RRD.
+- **Host storage** — `pve` thin pool 68% data / 15.5% meta; zero kernel
+  I/O errors on the host all day; `sdc` quiet through the window.
+- **Host-side kill/OOM** — no OOM-killer lines, no segfault, no QEMU crash
+  log; 113 of 114 monitored targets stayed up. Only the devvm died.
+- **Guest kernel panic** — would not keep QMP-visible blockstats frozen at
+  0 while netout ACKs trickle; and the guest kernel logged nothing.
+
+## Root cause
+
+**Class pinned, exact line unprovable** (see below): the devvm's disk I/O
+stalled *inside the QEMU process* — below the guest kernel (all guest I/O
+froze simultaneously with nothing logged) and above host storage (host
+clean, neighbors fine, QEMU main loop responsive). Contributing stack,
+unique to this VM:
+
+1. **`scsihw: lsi`** — the emulated LSI 53C895A (1997 chip, QEMU's legacy
+   default for OSes without virtio drivers). The devvm was the **only VM
+   on the host** running its disk through this path; every healthy
+   neighbor uses `virtio-scsi-pci`. The LSI model is documented as
+   hang-prone under intensive I/O.
+2. **No `iothread`** — all disk emulation ran on QEMU's single main event
+   loop, sharing it with timers and QMP.
+3. **QEMU-level mbps throttle (60/60)** — a token bucket inside QEMU whose
+   queued I/O completes only when its re-arm timer fires.
+4. **Hourly live throttle rewrites** — `apply-mbps-caps.sh`'s idempotency
+   check compared raw config strings, but `qm config` prints keys in its
+   own canonical order, so the check **never matched** and the script
+   re-issued `qm set` (→ live QMP `block_set_io_throttle` against the
+   running QEMU) every hour, 24×/day, for weeks — each poke a chance to
+   race the throttle machinery while queued I/O is in flight. The wedge
+   came 20 min after the 15:01 poke.
+
+## What we could not prove
+
+Whether the stuck queue was the LSI device model, the throttle-group
+timer, or their interaction. The discriminating evidence (QMP
+`query-block`, a stack trace of the QEMU process) existed in RAM at 16:47
+and was destroyed by the recovery kill. If a wedge recurs **autopsy before
+shooting**: `qm guest exec` will fail but `qm monitor`/QMP `query-block`,
+`query-status`, and `gdb -p <pid> -batch -ex 'thread apply all bt'` on the
+kvm process pin it to the line.
+
+## Fixes
+
+| Status | Fix |
+|---|---|
+| shipped (this commit) | `apply-mbps-caps.sh` compares **normalized option sets** — hourly runs are now true no-ops; running VMs' throttle state is no longer rewritten 24×/day. Verified: reordered-key configs compare equal, real drift still triggers `qm set`, post-restart iothread configs compare equal. |
+| staged, awaiting Viktor's cold stop→start | VM 102: `scsihw: virtio-scsi-single` + `scsi0 …,iothread=1,aio=threads` — replaces the LSI path with the paravirt controller all healthy VMs use, moves disk emulation off the main loop, swaps io_uring for boring thread-pool AIO. Guest pre-flight passed (`CONFIG_SCSI_VIRTIO=y` built-in; fstab on LVM dm-uuid/UUID). Must be a **full stop→start** — a guest reboot reuses the old QEMU process. |
+
+## Open follow-ups (discussed 2026-06-11, not yet built)
+
+- `DevvmDown` alert (`up{job="devvm"} == 0 for 3m` → Slack) — closes the
+  90-min detection gap.
+- Freeze forensics: netconsole → pve listener, serial console,
+  `kernel.panic=60`, and a capture-before-kill runbook (above) so any
+  recurrence is pinned, not mourned.
+- The recurring *crawl* class (agent storms → swap-thrash; journald
+  watchdog-killed 3× on 2026-06-10) is a separate failure mode —
+  ssh/tmux sessions remain memory-uncontained by explicit decision
+  (swap-only, 2026-06-10).
+
+## Lessons
+
+- **A VM can die of QEMU-userspace causes that no guest or host kernel log
+  will ever show.** The host's per-VM RRD (pvestatd's QMP polls) is the
+  only witness — `diskwrite=0` with a live QMP socket is the signature.
+- **"Idempotent" reconcilers must prove idempotency against the system's
+  canonical output format**, not against the string they themselves
+  constructed. A compare that never matches turns a safety net into a
+  24×/day fault injector — and its own journal said `updating scsi0`
+  every hour, in plain sight, for weeks.
+- The May-26 mbps caps fixed the sdc-saturation freeze class and
+  introduced this one's trigger surface. Layered mitigations fail in
+  layers — audit what a fix *adds*, not only what it removes.
+- pve host logs are **EEST (UTC+3)**; guest logs are UTC. Every
+  cross-machine correlation in this incident initially looked 3h off.
--- a/docs/runbooks/breakglass-ssh.md
+++ b/docs/runbooks/breakglass-ssh.md
@ -0,0 +1,158 @@
+# Runbook: Break-glass SSH
+
+Cold-survivable, brute-force-proof SSH onto the home LAN for when the Kubernetes
+cluster and its remote-access tunnels (Headscale, cloudflared) are down but the
+**Proxmox host + edge router are up**. Redesigned 2026-06-11 — the previous
+port-knock design is decommissioned (see "History" below).
+
+## Model (as built)
+
+```
+your laptop (anywhere) ── ssh -p 52222 ──▶ edge router 192.168.1.1
+                                              │ WAN tcp/52222 ─▶ 192.168.1.127:52222
+                                              ▼
+                                       Proxmox host 192.168.1.127
+                                          sshd :52222 (key-only, break-glass key ONLY)
+                                          → full LAN via ssh -J / ssh -D
+```
+
+- **No port-knock.** Plain `ssh -p 52222`. The SSH key is the only gate.
+- **Key-only**, brute-force-proof. The exposed `:52222` trusts **only** the
+  dedicated break-glass key (`/root/.ssh/authorized_keys.breakglass`), separate
+  from root's normal LAN-admin keys, so it is independently revocable and a leak
+  of any other root key does not grant internet access.
+- **Rate-limited** per source IP (iptables hashlimit) + **fail2ban**. These trim
+  scanner noise only; key-only auth is the real protection.
+- **Exposed, not hidden.** `:52222` answers on the WAN (Shodan-visible). This is
+  a deliberate, documented exception to the Wave-1 "no public-IP access" policy
+  (see `docs/architecture/security.md`), chosen for self-containment: it has **no
+  dependency on the cluster** (unlike Headscale/cloudflared) and nothing to
+  remember (unlike the old knock, whose sequence lived only in in-cluster Vault).
+
+## Secrets (Vault `secret/viktor`)
+
+| Key | Use |
+|---|---|
+| `breakglass_ssh_pubkey` | authorized on the host (`authorized_keys.breakglass`) |
+| `breakglass_ssh_privkey` | the private key (also on your laptop at `~/.ssh/breakglass_ed25519`) |
+
+The key has **no passphrase** (so it works in a true cold event without anything
+to recall). Treat the private key as the sole credential — guard the laptop copy.
+
+> Leftover: `breakglass_knock_sequence` is dead (knock decommissioned). It is
+> inert; remove it when you have a Vault token with the `patch` capability
+> (`vault kv patch` / merge-patch — the everyday token lacks it).
+
+## Connect
+
+Client `~/.ssh/config`:
+
+```
+Host breakglass
+    HostName viktorbarzin.ddns.net        # follows the dynamic WAN IP
+    Port 52222
+    User root
+    IdentityFile ~/.ssh/breakglass_ed25519
+    IdentitiesOnly yes
+```
+
+Then:
+
+```bash
+ssh breakglass                              # shell on the Proxmox host
+ssh -J breakglass root@10.0.20.1            # jump to pfSense (or any LAN host)
+ssh -D 1080 breakglass                      # SOCKS5 → reach any internal IP
+```
+
+There is **no `bg()` knock function** anymore — delete it from your shell rc if
+you added it under the old design.
+
+## Cold-event IP cheat sheet (cluster DNS is down)
+
+| Host | IP |
+|---|---|
+| Proxmox host | `192.168.1.127` |
+| pfSense | `10.0.20.1` (WAN `192.168.1.2`) |
+| k8s API | `10.0.20.100` |
+| Synology NAS | `192.168.1.13` (reach via `ssh -J breakglass`) |
+| edge router | `192.168.1.1` |
+
+## Deploy / re-provision the host config
+
+Source of truth lives in `infra/scripts/`. To (re)deploy:
+
+```bash
+# 1. break-glass key authorized for the exposed port
+PUB="$(vault kv get -field=breakglass_ssh_pubkey secret/viktor)"
+ssh root@192.168.1.127 "printf '%s\n' '$PUB' > /root/.ssh/authorized_keys.breakglass && chmod 600 /root/.ssh/authorized_keys.breakglass"
+
+# 2. sshd drop-in (dual-port, Match-isolated) — validate before reload (anti-lockout)
+scp scripts/sshd-10-breakglass.conf root@192.168.1.127:/etc/ssh/sshd_config.d/10-breakglass.conf
+ssh root@192.168.1.127 'sshd -t && systemctl reload ssh'
+
+# 3. firewall (rate-limit) + boot unit
+scp scripts/breakglass-firewall.sh root@192.168.1.127:/usr/local/sbin/breakglass-firewall.sh
+ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh && systemctl enable --now breakglass-firewall.service'
+
+# 4. fail2ban jail
+scp scripts/fail2ban-breakglass-sshd.local root@192.168.1.127:/etc/fail2ban/jail.d/breakglass-sshd.local
+ssh root@192.168.1.127 'systemctl restart fail2ban && fail2ban-client status sshd'
+```
+
+The `breakglass-firewall.service` unit (oneshot, `RemainAfterExit=yes`,
+`Before=network-online`-ish ordering) is a manual host unit — recreate it if the
+host is rebuilt:
+
+```ini
+[Unit]
+Description=Break-glass base firewall (key-only SSH on :52222)
+After=network-pre.target
+Wants=network-pre.target
+[Service]
+Type=oneshot
+ExecStart=/usr/local/sbin/breakglass-firewall.sh
+RemainAfterExit=yes
+[Install]
+WantedBy=multi-user.target
+```
+
+## Edge-router forward (manual — live device, not Terraform)
+
+TP-Link Archer AX6000 (`192.168.1.1`) → Advanced → NAT Forwarding → Port
+Forwarding. The break-glass rule:
+
+| Service Name | Device IP | External Port | Internal Port | Protocol |
+|---|---|---|---|---|
+| `breakglass-ssh` | `192.168.1.127` | `52222` | `52222` | TCP |
+
+**AX6000 quirks (learned 2026-06-11 — do not relearn the hard way):**
+- **External port must equal internal port.** The firmware rejects any remap
+  (e.g. `22 → 52222`) with *"External Port: This item conflicts with existed
+  ones."* Hence ext==int 52222.
+- **Port 22 is reserved** — even `22 → 22` is refused. Break-glass cannot use 22.
+- **Row delete is immediate** (no confirm dialog) — clicking the trash icon
+  removes the rule and toasts "Operation succeeded".
+- Automation: `~/wizard/tools/insecure-browse/add-forward.{sh,js}` (dockerized
+  Playwright; double-gated save `DRY_RUN=0 CONFIRM_SAVE=1`; supports
+  `RULES_JSON` add, `EDIT_RULES_JSON` protocol-edit, `DELETE_RULES_JSON`
+  identity-guarded delete). Router password: Vault
+  `secret/viktor/edge_router_192_168_1_1_password`.
+
+## Rotate / revoke
+
+- **Revoke instantly:** remove the line from `/root/.ssh/authorized_keys.breakglass`.
+- **Rotate the key:** `ssh-keygen -t ed25519 -a 100 -f ~/.ssh/breakglass_ed25519`,
+  `vault kv patch secret/viktor breakglass_ssh_privkey=@... breakglass_ssh_pubkey=...`,
+  redeploy step 1 above.
+- **Router reset wipes forwards:** re-add the `breakglass-ssh` rule above.
+
+## History
+
+- **2026-05-30:** original design — key-only SSH on `:52222` gated behind a
+  **UDP port-knock** (knockd). Decommissioned 2026-06-11: the knock added no real
+  security (the SSH key already makes the port brute-force-proof) and its only
+  benefit — hiding the port — came at the cost of a **circular dependency**: the
+  knock sequence lived only in in-cluster Vault, unreachable in the exact
+  cold/away scenario break-glass exists for. That caused a real lockout. The
+  knockd package + config + the legacy Synology SSH forward (ext 3333 → .13:22)
+  were removed.
--- a/docs/runbooks/t3-drop-attribution.md
+++ b/docs/runbooks/t3-drop-attribution.md
@ -35,6 +35,41 @@ Attribution table:

 Alerts `T3ProbeLegDown` / `T3ProbeDropBurst` fire on sustained breakage.

+## 1b. Connection logs in Loki (passive, always-on — catch a real drop)
+
+Three layers of the real path log every t3 `/ws` connection to Loki, so a drop
+the user actually experienced is attributable after the fact without a repro. A
+drop is **a short-lived `/ws` connection** (a healthy session holds one socket
+for hours); the client's 20s heartbeat watchdog reconnects on any break.
+
+| Layer | Loki stream | What it tells you |
+|---|---|---|
+| Traefik | `{job="traefik"}` ⟶ filter `t3code-t3` + `GET /ws` | per-connection **duration** (trailing `…ms`) + edge (cloudflared pod) IP |
+| cloudflared | `{job="cloudflared"}` ⟶ filter `t3.viktorbarzin.me/ws` | CF-tunnel-side close (`ended abruptly: context canceled` = browser/CF side hung up) |
+| t3-dispatch | `{job="devvm-journal",unit="t3-dispatch.service"} \|= "ws close"` | **`dur_ms` + `cause`** — the discriminator below |
+
+`cause` on the dispatch `ws close` line:
+- **`downstream_closed`** — client / Cloudflare / Traefik tore the socket down
+  (`context canceled`). Short `dur_ms` = client watchdog firing → a **last-mile /
+  network-quality** drop (or CF/tunnel blip); t3-serve was fine.
+- **`upstream_closed`** — the user's `t3 serve` closed/reset (reset by peer / EOF
+  / refused) → t3-serve stall/restart/OOM.
+- **`graceful`** — clean close from either side (e.g. the client watchdog's
+  `disconnect()` after a >20s heartbeat gap). Cross-check `dur_ms`: a ~20s+
+  graceful close with no devvm pressure spike (§3) is a heartbeat-timeout whose
+  stall was NOT on devvm → last-mile.
+
+Triage query (Grafana Explore → Loki) — every short t3 socket in a window:
+
+```logql
+{job="devvm-journal", unit="t3-dispatch.service"} |= "ws close"
+  | regexp `dur_ms=(?P<dur>[0-9]+) cause=(?P<cause>\S+)` | dur < 120000
+```
+
+Line the timestamp up against `{job="traefik"}` (duration + edge IP) and
+`{job="cloudflared"}` (CF-side close) for the same second to localise the layer.
+devvm journald (incl. `t3-serve@<user>`) ships via `scripts/devvm-promtail.*`.
+
 ## 2. Server-side log recipe (per-event forensics)

 On devvm (timestamps in UTC):
--- a/scripts/apply-mbps-caps.sh
+++ b/scripts/apply-mbps-caps.sh
@ -27,6 +27,12 @@ TARGETS=(
  "220:scsi0:40:40"      # docker-registry
 )

+# Sort a disk spec's comma-separated options so two specs with the same
+# option set but different key order compare equal.
+normalized() {
+  tr ',' '\n' <<<"$1" | LC_ALL=C sort | paste -sd, -
+}
+
 apply_one() {
  local spec="$1"
  local vmid slot rd wr
@ -49,8 +55,13 @@ apply_one() {
  newvalue="${cleaned},mbps_rd=${rd},mbps_wr=${wr}"

  # Skip the qm-set call entirely when state already matches — keeps
-  # journal noise low under the hourly timer.
-  if [[ "$current" == "$newvalue" ]]; then
+  # journal noise low under the hourly timer. Compare option SETS, not raw
+  # strings: `qm config` prints keys in its own canonical order, so a raw
+  # compare never matched and every hourly run re-issued `qm set`, which
+  # live-rewrites the running VM's QEMU throttle state via QMP (implicated
+  # in the 2026-06-11 devvm I/O stall — see
+  # docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md).
+  if [[ "$(normalized "$current")" == "$(normalized "$newvalue")" ]]; then
    echo "vmid $vmid: $slot already at mbps_rd=${rd},mbps_wr=${wr} — no-op"
    return 0
  fi
--- a/scripts/breakglass-firewall.sh
+++ b/scripts/breakglass-firewall.sh
@ -0,0 +1,26 @@
+#!/usr/bin/env bash
+set -euo pipefail
+# Break-glass base firewall (redesigned 2026-06-11; replaced the port-knock gate).
+#
+# Source of truth. Deploy to the PVE host with:
+#   scp scripts/breakglass-firewall.sh root@192.168.1.127:/usr/local/sbin/breakglass-firewall.sh
+#   ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh && systemctl restart breakglass-firewall.service'
+# The breakglass-firewall.service oneshot runs this at boot (RemainAfterExit).
+#
+# Model: key-only SSH break-glass on :52222, openly reachable from the WAN, NO
+# port-knock. The SSH key is the gate (brute-force-proof); the rate-limit below
+# only trims scanner noise / slows a hypothetical sshd 0-day.
+#   :22    -> LAN admin (all of root's keys), always allowed.
+#   :52222 -> WAN break-glass. LAN/VLAN sources bypass the limit; external NEW
+#             connections are rate-limited per source IP, then accepted.
+iptables -N BREAKGLASS 2>/dev/null || iptables -F BREAKGLASS
+iptables -C INPUT -j BREAKGLASS 2>/dev/null || iptables -I INPUT 1 -j BREAKGLASS
+
+iptables -A BREAKGLASS -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
+iptables -A BREAKGLASS -p tcp --dport 22 -j ACCEPT
+iptables -A BREAKGLASS -p tcp --dport 52222 -s 192.168.1.0/24 -j ACCEPT
+iptables -A BREAKGLASS -p tcp --dport 52222 -s 10.0.0.0/8 -j ACCEPT
+iptables -A BREAKGLASS -p tcp --dport 52222 -m conntrack --ctstate NEW \
+  -m hashlimit --hashlimit-name bg_ssh --hashlimit-mode srcip \
+  --hashlimit-above 6/min --hashlimit-burst 3 -j DROP
+iptables -A BREAKGLASS -p tcp --dport 52222 -j ACCEPT
--- a/scripts/devvm-promtail.service
+++ b/scripts/devvm-promtail.service
@ -0,0 +1,17 @@
+# systemd unit for promtail on the devvm (10.0.10.10). Install to
+# /etc/systemd/system/promtail.service. See scripts/devvm-promtail.yaml for the full deploy.
+[Unit]
+Description=Promtail (ships devvm journal -> cluster Loki)
+After=network-online.target
+Wants=network-online.target
+
+[Service]
+Type=simple
+ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/config.yml
+Restart=on-failure
+RestartSec=5
+User=root
+Group=root
+
+[Install]
+WantedBy=multi-user.target
--- a/scripts/devvm-promtail.yaml
+++ b/scripts/devvm-promtail.yaml
@ -0,0 +1,59 @@
+# Promtail config for the devvm (10.0.10.10) — ships the systemd journal to cluster Loki.
+#
+# devvm is a standalone VM (NOT a k8s node), so its journal — including the t3
+# stack (t3-dispatch, t3-serve@<user>) — was never in Loki. Added 2026-06-11 for
+# t3 drop forensics: t3-dispatch now logs each /ws connection's open/close with
+# duration + which side hung up (downstream_closed = client/CF/Traefik went away;
+# upstream_closed = t3-serve closed/stalled; graceful = clean close). Joined with
+# Traefik's per-/ws duration (already in Loki) this attributes every drop to a layer.
+#
+# NOT Terraform-managed (devvm is outside k8s) — same hand-deployed pattern as
+# scripts/pve-promtail.* and the rpi-sofia promtail. This file is source-of-truth.
+#
+# Deploy (on devvm, as root via sudo):
+#   sudo install -d -m 0755 /etc/promtail /var/lib/promtail
+#   sudo install -m 0644 scripts/devvm-promtail.yaml    /etc/promtail/config.yml
+#   sudo install -m 0644 scripts/devvm-promtail.service /etc/systemd/system/promtail.service
+#   # Binary: grafana/loki v3.5.1 promtail-linux-amd64 -> /usr/local/bin/promtail (chmod 0755).
+#   sudo systemctl daemon-reload && sudo systemctl enable --now promtail
+#   # Loki reach: loki.viktorbarzin.lan (Technitium CNAME -> live Traefik LB; insecure cert).
+#
+# Streams produced:
+#   {job="devvm-journal"}                     — full devvm journal
+#   {job="devvm-journal", unit="t3-dispatch.service"}        — dispatch (ws open/close lines)
+#   {job="devvm-journal", unit="t3-serve@wizard.service"}    — per-user t3 serve
+#   {job="sshd-devvm"}                        — sshd auth lines (parity with sshd-pve)
+server:
+  http_listen_port: 9080
+  grpc_listen_port: 0
+  log_level: warn
+
+positions:
+  filename: /var/lib/promtail/positions.yaml
+
+clients:
+  - url: https://loki.viktorbarzin.lan/loki/api/v1/push
+    tls_config:
+      insecure_skip_verify: true
+
+scrape_configs:
+  - job_name: journal
+    journal:
+      max_age: 12h
+      json: false
+      path: /var/log/journal
+      labels:
+        host: devvm
+        job: devvm-journal
+    relabel_configs:
+      - source_labels: ['__journal__systemd_unit']
+        target_label: unit
+      - source_labels: ['__journal_priority_keyword']
+        target_label: level
+      - source_labels: ['__journal_syslog_identifier']
+        target_label: identifier
+      # sshd auth lines -> job=sshd-devvm (parity with the pve shipper's sshd-pve).
+      - source_labels: ['__journal_syslog_identifier']
+        regex: 'sshd.*'
+        target_label: job
+        replacement: 'sshd-devvm'
--- a/scripts/fail2ban-breakglass-sshd.local
+++ b/scripts/fail2ban-breakglass-sshd.local
@ -0,0 +1,18 @@
+# Break-glass SSH fail2ban jail (redesigned 2026-06-11). Source of truth.
+# Deploy to the PVE host with:
+#   scp scripts/fail2ban-breakglass-sshd.local root@192.168.1.127:/etc/fail2ban/jail.d/breakglass-sshd.local
+#   ssh root@192.168.1.127 'systemctl restart fail2ban'
+#
+# GOTCHA (Debian 13 / OpenSSH 9.x): auth lines are logged under
+# _COMM=sshd-session, NOT _COMM=sshd. The stock Debian jail keys journalmatch on
+# `_SYSTEMD_UNIT=ssh.service + _COMM=sshd` and therefore silently NEVER bans.
+# Match by unit only so both sshd and sshd-session lines are seen. Ban on both
+# SSH ports (the WAN break-glass listener is :52222).
+[sshd]
+enabled = true
+backend = systemd
+journalmatch = _SYSTEMD_UNIT=ssh.service
+port = ssh,52222
+maxretry = 4
+findtime = 10m
+bantime = 1h
--- a/scripts/sshd-10-breakglass.conf
+++ b/scripts/sshd-10-breakglass.conf
@ -0,0 +1,31 @@
+# Break-glass SSH drop-in (redesigned 2026-06-11). Source of truth.
+# Deploy to the PVE host with:
+#   scp scripts/sshd-10-breakglass.conf root@192.168.1.127:/etc/ssh/sshd_config.d/10-breakglass.conf
+#   ssh root@192.168.1.127 'sshd -t && systemctl reload ssh'
+#
+#   :22    = LAN admin, all of root's keys (default AuthorizedKeysFile).
+#   :52222 = WAN-exposed break-glass. The edge router forwards WAN tcp/52222 ->
+#            192.168.1.127:52222 (external port MUST equal internal port on the
+#            TP-Link AX6000 — it rejects remaps; port 22 itself is reserved).
+#            The Match LocalPort block trusts ONLY the dedicated break-glass key
+#            (authorized_keys.breakglass), so a leak of any other root key does
+#            NOT grant internet access. Rate-limited by the BREAKGLASS iptables
+#            chain + fail2ban. No port-knock.
+#
+# NOTE: the trailing `Match all` is REQUIRED. /etc/ssh/sshd_config has
+# `Include sshd_config.d/*.conf` near the top but a global `PermitRootLogin`
+# further down; without `Match all` resetting context, that later global
+# directive would be swallowed into the `Match LocalPort 52222` condition.
+Port 22
+Port 52222
+PasswordAuthentication no
+KbdInteractiveAuthentication no
+PubkeyAuthentication yes
+PermitRootLogin prohibit-password
+MaxAuthTries 3
+LoginGraceTime 20
+
+Match LocalPort 52222
+    AuthorizedKeysFile /root/.ssh/authorized_keys.breakglass
+    PermitRootLogin prohibit-password
+Match all
--- a/scripts/t3-dispatch/go.mod
+++ b/scripts/t3-dispatch/go.mod
@ -2,4 +2,4 @@ module t3-dispatch

 go 1.22

-require github.com/gorilla/websocket v1.5.3 // indirect
+require github.com/gorilla/websocket v1.5.3
--- a/scripts/t3-dispatch/main.go
+++ b/scripts/t3-dispatch/main.go
@ -212,7 +212,64 @@ func handler(w http.ResponseWriter, r *http.Request) {
 	}
 	// Steady state: reverse-proxy (incl. WebSocket upgrade) to the user's instance.
 	target, _ := url.Parse(fmt.Sprintf("http://127.0.0.1:%d", e.Port))
-	httputil.NewSingleHostReverseProxy(target).ServeHTTP(w, r)
+	proxy := httputil.NewSingleHostReverseProxy(target)
+
+	// WebSocket connection logging: t3 drops manifest as the client's 20s
+	// heartbeat watchdog reconnecting, so a flood of short-lived /ws connections
+	// IS the symptom. Log each WS open + close (duration + which side hung up) so
+	// a drop is attributable from logs alone — graceful closes otherwise leave no
+	// trace (the default ReverseProxy only logs on error). cause stays "graceful"
+	// unless ErrorHandler fires; ErrorHandler runs within ServeHTTP, so reading
+	// cause after ServeHTTP returns needs no synchronisation.
+	if isWebSocket(r) {
+		start := time.Now()
+		ip := clientIP(r)
+		cause := "graceful"
+		proxy.ErrorHandler = func(rw http.ResponseWriter, _ *http.Request, err error) {
+			cause = classifyClose(err)
+		}
+		log.Printf("ws open user=%s ip=%s", e.OsUser, ip)
+		proxy.ServeHTTP(w, r)
+		log.Printf("ws close user=%s ip=%s dur_ms=%d cause=%s",
+			e.OsUser, ip, time.Since(start).Milliseconds(), cause)
+		return
+	}
+	proxy.ServeHTTP(w, r)
+}
+
+// isWebSocket reports whether r is a WebSocket upgrade request.
+func isWebSocket(r *http.Request) bool {
+	return strings.EqualFold(r.Header.Get("Upgrade"), "websocket") &&
+		strings.Contains(strings.ToLower(r.Header.Get("Connection")), "upgrade")
+}
+
+// clientIP returns the forwarded client chain (X-Forwarded-For, set by
+// Traefik/CF) when present, else the immediate peer — for correlating a drop
+// to a specific client/edge.
+func clientIP(r *http.Request) string {
+	if xff := r.Header.Get("X-Forwarded-For"); xff != "" {
+		return xff
+	}
+	return r.RemoteAddr
+}
+
+// classifyClose maps a reverse-proxy copy error to which side ended the socket:
+// downstream (client/CF/Traefik went away) vs upstream (the user's t3 serve
+// closed/reset). Distinguishes a last-mile/client drop from a t3-serve stall.
+func classifyClose(err error) string {
+	if err == nil {
+		return "graceful"
+	}
+	s := err.Error()
+	switch {
+	case strings.Contains(s, "context canceled"):
+		return "downstream_closed" // client / CF / Traefik tore down
+	case strings.Contains(s, "reset by peer"), strings.Contains(s, "broken pipe"),
+		strings.Contains(s, "EOF"), strings.Contains(s, "connection refused"):
+		return "upstream_closed" // t3 serve closed / unreachable
+	default:
+		return s
+	}
 }

 func main() {
--- a/scripts/t3-dispatch/main_test.go
+++ b/scripts/t3-dispatch/main_test.go
@ -301,3 +301,63 @@ func TestProbeWSEcho(t *testing.T) {
 		}
 	}
 }
+
+func TestIsWebSocket(t *testing.T) {
+	cases := []struct {
+		up, conn string
+		want     bool
+	}{
+		{"websocket", "Upgrade", true},
+		{"websocket", "keep-alive, Upgrade", true},
+		{"WebSocket", "upgrade", true},
+		{"", "keep-alive", false},
+		{"h2c", "Upgrade", false},
+		{"websocket", "keep-alive", false},
+	}
+	for _, c := range cases {
+		r, _ := http.NewRequest("GET", "/ws", nil)
+		if c.up != "" {
+			r.Header.Set("Upgrade", c.up)
+		}
+		r.Header.Set("Connection", c.conn)
+		if got := isWebSocket(r); got != c.want {
+			t.Errorf("isWebSocket(up=%q conn=%q)=%v want %v", c.up, c.conn, got, c.want)
+		}
+	}
+}
+
+func TestClassifyClose(t *testing.T) {
+	cases := []struct {
+		in   error
+		want string
+	}{
+		{nil, "graceful"},
+		{errTest("context canceled"), "downstream_closed"},
+		{errTest("read tcp 127.0.0.1:60664->127.0.0.1:3773: read: connection reset by peer"), "upstream_closed"},
+		{errTest("write: broken pipe"), "upstream_closed"},
+		{errTest("unexpected EOF"), "upstream_closed"},
+		{errTest("dial tcp 127.0.0.1:3773: connect: connection refused"), "upstream_closed"},
+		{errTest("some novel error"), "some novel error"},
+	}
+	for _, c := range cases {
+		if got := classifyClose(c.in); got != c.want {
+			t.Errorf("classifyClose(%v)=%q want %q", c.in, got, c.want)
+		}
+	}
+}
+
+type errTest string
+
+func (e errTest) Error() string { return string(e) }
+
+func TestClientIP(t *testing.T) {
+	r, _ := http.NewRequest("GET", "/ws", nil)
+	r.RemoteAddr = "10.0.0.5:1234"
+	if got := clientIP(r); got != "10.0.0.5:1234" {
+		t.Errorf("clientIP no-xff = %q", got)
+	}
+	r.Header.Set("X-Forwarded-For", "1.2.3.4, 10.10.1.1")
+	if got := clientIP(r); got != "1.2.3.4, 10.10.1.1" {
+		t.Errorf("clientIP xff = %q", got)
+	}
+}
--- a/scripts/workstation/managed-settings.json
+++ b/scripts/workstation/managed-settings.json
@ -1,4 +1,4 @@
 {
-  "claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier. Your kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner). Do not attempt to escalate privileges or reach another user's resources.\n- Secrets are per-user. Never read another user's home directory, credentials, tokens, or ~/.claude secrets. Your own secrets live in your home at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; you can verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Feature-sized work is done in an isolated git worktree (`.worktrees/<topic>`, branch `<os-user>/<topic>`) and merged into master when finished, so several agents can work the same project at once — full lifecycle in ~/.claude/rules/execution.md §3; trivial single-commit fixes may go straight to master. When you finish a change in a repo under ~/code (or ~/code itself when it IS the clone): commit it ON master and push to the forgejo remote. THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request) — this matters more than the change itself. Never use [ci skip] as a non-admin (it would hide the change from the audit feed; harmless no-op applies are fine). If the push is rejected non-fast-forward, git pull --rebase forgejo master and push again. If it is rejected by branch protection (user not whitelisted), fall back to a <os-user>/<topic> branch + PR via the Forgejo API (token = password field in ~/.git-credentials). Keep every clone on a clean master when done so background auto-refresh keeps working. Tell the user in plain words what happened ('done — your change is live/recorded'). Full recipe: AGENTS.md → 'Non-admin workstation users' in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning, quality) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code, in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it (e.g. ~/code/tripit). [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.",
+  "claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner). Do not attempt to escalate privileges or reach another user's resources.\n- Secrets are per-user. Never read another user's home directory, credentials, tokens, or ~/.claude secrets. Your own secrets live in your home at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n  - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n  - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n  - Push rejected by branch protection (user not whitelisted) → fall back to a <os-user>/<topic> branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n  - Keep every clone on a clean master when done; tell the user in plain words what happened.\n  - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n",
  "model": "claude-fable-5"
 }
--- a/stacks/authentik/authentik_provider.tf
+++ b/stacks/authentik/authentik_provider.tf
@ -91,14 +91,21 @@ resource "authentik_outpost" "embedded" {
  protocol_providers = [authentik_provider_proxy.catchall.id]
  service_connection = "99e227a7-4562-4888-9660-4c27da678c50"
  config = jsonencode({
-    log_level                        = "trace"
-    docker_labels                    = null
-    authentik_host                   = "https://authentik.viktorbarzin.me/"
-    docker_network                   = null
-    container_image                  = null
-    docker_map_ports                 = true
-    refresh_interval                 = "minutes=5"
-    kubernetes_replicas              = 1
+    # info, not trace: the outpost sits on the hot path of every request to
+    # every auth="required" ingress — trace logging is per-request overhead
+    # with no operational value (request access lines are emitted at info).
+    log_level        = "info"
+    docker_labels    = null
+    authentik_host   = "https://authentik.viktorbarzin.me/"
+    docker_network   = null
+    container_image  = null
+    docker_map_ports = true
+    refresh_interval = "minutes=5"
+    # 2 replicas: removes the single-pod hot path for all forward-auth
+    # subrequests. Safe since sessions moved to the shared Postgres backend
+    # (authentik_providers_proxy_proxysession, 2026-05-10) — no pod-local
+    # session state anymore.
+    kubernetes_replicas              = 2
    kubernetes_namespace             = "authentik"
    authentik_host_browser           = ""
    object_naming_template           = "ak-outpost-%(name)s"
@ -198,3 +205,39 @@ resource "authentik_stage_user_login" "default_login" {
    ]
  }
 }
+
+# -----------------------------------------------------------------------------
+# Default Identification stage — adopted 2026-06-10 to embed the password
+# field on the identification screen (single-screen login: one round trip and
+# one screen instead of two). Per authentik docs, when an Identification stage
+# carries a password stage the Password stage must NOT be bound separately —
+# the redundant order-20 binding on default-authentication-flow (pk
+# 0fc677db-a23f-4ee7-8648-da342e14573b) was deleted via the API in the same
+# change. Social-login users are unaffected: source buttons stay on the same
+# screen and bypass the password field.
+# -----------------------------------------------------------------------------
+
+data "authentik_stage" "default_authentication_password" {
+  name = "default-authentication-password"
+}
+
+resource "authentik_stage_identification" "default_identification" {
+  name           = "default-authentication-identification"
+  password_stage = data.authentik_stage.default_authentication_password.id
+  lifecycle {
+    # Pin only password_stage; everything else stays UI-managed (same pattern
+    # as authentik_stage_user_login.default_login above).
+    ignore_changes = [
+      user_fields,
+      case_insensitive_matching,
+      show_matched_user,
+      show_source_labels,
+      sources,
+      enrollment_flow,
+      recovery_flow,
+      passwordless_flow,
+      pretend_user_exists,
+      captcha_stage,
+    ]
+  }
+}
--- a/stacks/authentik/modules/authentik/main.tf
+++ b/stacks/authentik/modules/authentik/main.tf
@ -29,7 +29,7 @@ resource "kubernetes_namespace" "authentik" {
    labels = {
      tier                               = var.tier
      "resource-governance/custom-quota" = "true"
-      "keel.sh/enrolled" = "true"
+      "keel.sh/enrolled"                 = "true"
    }
  }
  lifecycle {
@ -111,3 +111,44 @@ module "ingress-outpost" {
  anti_ai_scraping = false
  exclude_crowdsec = true
 }
+
+# Immutable caching for the flow-executor static assets. Authentik serves
+# /static/dist/* with version-fingerprinted filenames (e.g. poly-2026.2.4.js)
+# but no max-age, so browsers re-validate the login JS bundle on every signin
+# — and split-horizon internal users (direct to Traefik, no Cloudflare) get no
+# edge cache at all. Long-lived immutable caching is safe: every authentik
+# upgrade changes the asset URLs.
+resource "kubernetes_manifest" "static_cache_headers" {
+  manifest = {
+    apiVersion = "traefik.io/v1alpha1"
+    kind       = "Middleware"
+    metadata = {
+      name      = "static-cache-headers"
+      namespace = kubernetes_namespace.authentik.metadata[0].name
+    }
+    spec = {
+      headers = {
+        customResponseHeaders = {
+          "Cache-Control" = "public, max-age=31536000, immutable"
+        }
+      }
+    }
+  }
+}
+
+module "ingress-static" {
+  source = "../../../../modules/kubernetes/ingress_factory"
+  # Same-host path carve-out of the public authentik UI ingress above, only
+  # adding the cache-headers middleware for the static asset prefix.
+  # auth = "none": versioned static assets of the (already public) Authentik login UI.
+  auth              = "none"
+  namespace         = kubernetes_namespace.authentik.metadata[0].name
+  name              = "authentik-static"
+  host              = "authentik"
+  service_name      = "goauthentik-server"
+  ingress_path      = ["/static"]
+  tls_secret_name   = var.tls_secret_name
+  anti_ai_scraping  = false
+  homepage_enabled  = false
+  extra_middlewares = ["authentik-static-cache-headers@kubernetescrd"]
+}
--- a/stacks/authentik/modules/authentik/pgbouncer.ini
+++ b/stacks/authentik/modules/authentik/pgbouncer.ini
@ -12,3 +12,7 @@ default_pool_size = 20
 reserve_pool_size = 5
 reserve_pool_timeout = 5
 ignore_startup_parameters = extra_float_digits
+; Reap server connections stuck "idle in transaction" (e.g. an authentik pod
+; killed mid-migration leaves a ghost transaction holding the migration
+; advisory lock, serializing every subsequent pod boot — 2026-06-10 incident).
+idle_transaction_timeout = 300
--- a/stacks/authentik/modules/authentik/pgbouncer.tf
+++ b/stacks/authentik/modules/authentik/pgbouncer.tf
@ -48,6 +48,11 @@ resource "kubernetes_deployment" "pgbouncer" {
        labels = {
          app = "pgbouncer"
        }
+        annotations = {
+          # pgbouncer reads its ini only at startup (subPath mount never
+          # propagates updates anyway) — roll the pods on config change.
+          "checksum/pgbouncer-config" = sha1(kubernetes_config_map.pgbouncer_config.data["pgbouncer.ini"])
+        }
      }

      spec {
@ -157,7 +162,8 @@ resource "kubernetes_deployment" "pgbouncer" {
      metadata[0].annotations["keel.sh/trigger"],
      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
      metadata[0].annotations["keel.sh/match-tag"],
-      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      spec[0].template[0].spec[0].container[0].image,             # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      spec[0].template[0].spec[0].container[0].image_pull_policy, # Keel flip-flops this between Always/IfNotPresent
      metadata[0].annotations["kubernetes.io/change-cause"],
      metadata[0].annotations["deployment.kubernetes.io/revision"],
      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
--- a/stacks/authentik/modules/authentik/values.yaml
+++ b/stacks/authentik/modules/authentik/values.yaml
@ -1,4 +1,10 @@
 authentik:
+  # NOTE: because we set existingSecret below, the chart does NOT render the
+  # authentik.* values into an AUTHENTIK_* env Secret — the live env comes
+  # from the orphaned, helm-keep-policy `goauthentik` Secret created by chart
+  # 2025.10.3. Anything under authentik.* here is effectively INERT. All new
+  # or tuned config MUST go through server.env / worker.env instead (see
+  # .claude/reference/authentik-state.md).
  log_level: warning
  # log_level: trace
  secret_key: ""
@ -14,38 +20,47 @@ authentik:
    port: 6432
    user: authentik
    password: ""
-    # Persistent client-side connections (safe with PgBouncer session mode;
-    # must be < pgbouncer server_idle_timeout=600s). Cuts Django connection
-    # setup overhead off the ~70 sequential ORM ops per flow stage.
-    conn_max_age: 60
-    conn_health_checks: true
-  cache:
-    # Cache flow plans for 30m and policy evaluations for 15m. Authentik 2026.2
-    # moved cache storage from Redis to Postgres, so a TTL hit is still a
-    # SELECT — but a single indexed lookup beats re-evaluating PolicyBindings.
-    timeout_flows: 1800
-    timeout_policies: 900
-  web:
-    # Gunicorn: 3 workers × 4 threads per server pod (default 2×4).
-    # Pairs with the server memory bump to 2Gi (each worker preloads Django ~500Mi).
-    workers: 3
-    threads: 4
-  worker:
-    # Celery-equivalent worker threads per pod (default 2, renamed from
-    # AUTHENTIK_WORKER__CONCURRENCY in 2025.8).
-    threads: 4

 server:
  replicas: 3
-  # Anonymous Django sessions (no completed login: bots, healthcheckers,
-  # partial flows) expire in 2h. Default is days=1. Once login completes,
-  # UserLoginStage.session_duration takes over via request.session.set_expiry.
-  # Injected via server.env (not authentik.sessions.*) because we use
-  # authentik.existingSecret.secretName, which makes the chart skip
-  # rendering the AUTHENTIK_* secret — so the values block doesn't reach env.
  env:
+    # Anonymous Django sessions (no completed login: bots, healthcheckers,
+    # partial flows) expire in 2h. Default is days=1. Once login completes,
+    # UserLoginStage.session_duration takes over via request.session.set_expiry.
+    # Injected via server.env (not authentik.sessions.*) because we use
+    # authentik.existingSecret.secretName, which makes the chart skip
+    # rendering the AUTHENTIK_* secret — so the values block doesn't reach env.
    - name: AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE
      value: "hours=2"
+    # Gunicorn: 3 workers × 4 threads per server pod (defaults 2×4).
+    # Pairs with the server memory limit of 2Gi (each worker preloads
+    # Django ~500Mi).
+    - name: AUTHENTIK_WEB__WORKERS
+      value: "3"
+    - name: AUTHENTIK_WEB__THREADS
+      value: "4"
+    # Cache flow plans for 30m and policy evaluations for 15m (defaults 300s).
+    # Authentik 2026.2 stores cache in Postgres, so a TTL hit is still a
+    # SELECT — but a single indexed lookup beats re-planning the flow
+    # (~70 sequential ORM ops per flow stage POST).
+    - name: AUTHENTIK_CACHE__TIMEOUT_FLOWS
+      value: "1800"
+    - name: AUTHENTIK_CACHE__TIMEOUT_POLICIES
+      value: "900"
+    # Do NOT set AUTHENTIK_POSTGRESQL__CONN_MAX_AGE here. With PgBouncer in
+    # session mode every persistent Django connection pins a server connection
+    # 1:1, so the 3x(20+5) pool saturated during the 2026-06-10 rolling
+    # restart (58s pool waits, readiness flapping, and the shared CNPG primary
+    # failed over mid-storm). The ~1-2ms/request connection-setup saving is
+    # not worth that risk on the shared PG substrate.
+  # Liveness budget sized for slow boots (2026-06-10 incident): during a
+  # rolling restart pods queue on authentik's DB migration lock; the go layer
+  # answers /-/health/live before the core is up, so with the default 3x10s
+  # budget kubelet kill-looped every booting pod and amplified the contention.
+  # Startup probe still bounds total boot time (60x10s).
+  livenessProbe:
+    failureThreshold: 6
+    timeoutSeconds: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
@ -76,17 +91,36 @@ server:
    minAvailable: 2
 global:
  addPrometheusAnnotations: true
+  image:
+    # Pin to the Keel-managed live tag. Keel (diun-annotated, keel.sh/enrolled
+    # namespace) bumps the IMAGE between chart releases, while helm defaults
+    # the tag to the chart appVersion — so any helm upgrade silently
+    # DOWNGRADES the running pods to the chart pin (2026-06-10: a values-only
+    # apply rolled live 2026.2.4 back to 2026.2.2 against a 2026.2.4-migrated
+    # DB → boot storm, see docs/post-mortems/2026-06-10-authentik-downgrade-
+    # boot-storm.md). Keep this tag in sync with what Keel has deployed when
+    # touching this chart; clear it only when bumping the chart version itself.
+    tag: "2026.2.4"

 worker:
  # 2 replicas: workers handle background tasks (LDAP sync, email,
  # certificate renewal) — no user-facing traffic, so 2-of-3 isn't
  # needed for availability. Drop saves ~100m sustained CPU.
  replicas: 2
-  # Same unauthenticated_age cap as server — both the server (Django session
-  # middleware) and worker (cleanup tasks) need to see the value.
  env:
+    # Same unauthenticated_age cap as server — both the server (Django session
+    # middleware) and worker (cleanup tasks) need to see the value.
    - name: AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE
      value: "hours=2"
+    # Dramatiq worker threads per pod (default 2).
+    - name: AUTHENTIK_WORKER__THREADS
+      value: "4"
+    # Keep cache settings in lockstep with server.env. (No CONN_MAX_AGE —
+    # see the server.env note: session-mode PgBouncer pins persistent conns.)
+    - name: AUTHENTIK_CACHE__TIMEOUT_FLOWS
+      value: "1800"
+    - name: AUTHENTIK_CACHE__TIMEOUT_POLICIES
+      value: "900"
  strategy:
    type: RollingUpdate
    rollingUpdate:
--- a/stacks/kyverno/modules/kyverno/resource-governance.tf
+++ b/stacks/kyverno/modules/kyverno/resource-governance.tf
@ -1170,3 +1170,9 @@ resource "kubectl_manifest" "mutate_strip_cpu_limits" {
    }
  })
 }
+
+# Apply re-trigger 2026-06-11: 87702bdc landed with [ci skip], so this stack was
+# never CI-applied; tripit#26 (tour-guide redo) needs the tts GPU-priority
+# exclusion live before the tts stack applies. No functional change in this commit.
+
+# (See stacks/tts/main.tf — same apply-trigger note, tripit#26.)
--- a/stacks/traefik/modules/traefik/error-pages.tf
+++ b/stacks/traefik/modules/traefik/error-pages.tf
@ -89,7 +89,18 @@ resource "kubernetes_deployment" "error_pages" {
  }

  lifecycle {
-    ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      # KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
+      # live object (keel enrollment / resource-governance) — don't strip them.
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"],
+      metadata[0].annotations["keel.sh/match-tag"],
+      metadata[0].labels["tier"],
+      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"],
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+    ]
  }
 }

--- a/stacks/traefik/modules/traefik/main.tf
+++ b/stacks/traefik/modules/traefik/main.tf
@ -494,7 +494,16 @@ resource "kubernetes_deployment" "bot_block_proxy" {
  }
  lifecycle {
    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config,
+      # KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
+      # live object (keel enrollment / resource-governance) — don't strip them.
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"],
+      metadata[0].annotations["keel.sh/match-tag"],
+      metadata[0].labels["tier"],
+    ]
  }
 }

@ -653,7 +662,16 @@ resource "kubernetes_deployment" "x402_gateway" {

  lifecycle {
    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config,
+      # KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
+      # live object (keel enrollment / resource-governance) — don't strip them.
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"],
+      metadata[0].annotations["keel.sh/match-tag"],
+      metadata[0].labels["tier"],
+    ]
  }
 }

@ -720,6 +738,11 @@ resource "kubernetes_config_map" "auth_proxy_config" {
    "default.conf" = <<-EOT
      upstream authentik {
          server ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000;
+          # Reuse connections to the outpost. Without this every forward-auth
+          # subrequest (= every request to every auth="required" ingress) opens
+          # a fresh TCP connection. Requires HTTP/1.1 + cleared Connection
+          # header on the proxy_pass locations below.
+          keepalive 32;
      }
      server {
          listen 9000;
@ -734,6 +757,8 @@ resource "kubernetes_config_map" "auth_proxy_config" {

          location /outpost.goauthentik.io/auth/traefik {
              proxy_pass http://authentik;
+              proxy_http_version 1.1;
+              proxy_set_header Connection "";
              proxy_connect_timeout 3s;
              proxy_read_timeout 5s;
              proxy_send_timeout 5s;
@ -764,6 +789,8 @@ resource "kubernetes_config_map" "auth_proxy_config" {

          location /outpost.goauthentik.io/ {
              proxy_pass http://authentik;
+              proxy_http_version 1.1;
+              proxy_set_header Connection "";
              proxy_connect_timeout 3s;
              proxy_read_timeout 10s;
              proxy_set_header Host $host;
@ -820,6 +847,11 @@ resource "kubernetes_deployment" "auth_proxy" {
        labels = {
          app = "auth-proxy"
        }
+        annotations = {
+          # nginx only reads its config at startup — roll the pods whenever
+          # the ConfigMap content changes.
+          "checksum/auth-proxy-config" = sha1(kubernetes_config_map.auth_proxy_config.data["default.conf"])
+        }
      }
      spec {
        topology_spread_constraint {
@ -908,7 +940,16 @@ resource "kubernetes_deployment" "auth_proxy" {
  }
  lifecycle {
    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config,
+      # KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
+      # live object (keel enrollment / resource-governance) — don't strip them.
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"],
+      metadata[0].annotations["keel.sh/match-tag"],
+      metadata[0].labels["tier"],
+    ]
  }
 }

--- a/stacks/tripit/main.tf
+++ b/stacks/tripit/main.tf
@ -96,6 +96,14 @@ locals {
    CALENDAR_CONFLICT_PROVIDER = "nextcloud"
    NEXTCLOUD_CALDAV_BASE      = "https://nextcloud.viktorbarzin.me/remote.php/dav"
    NEXTCLOUD_CALDAV_USER      = "admin"
+    # Tour-guide content pipeline (tripit#24/#25): these three default to `fake`
+    # in tripit's config, which is what shipped dark on 2026-06-08 — prod only
+    # ever showed the placeholder "Sight 1". Real providers: Wikipedia GeoSearch
+    # discovery, the five web story sources, and the claude-agent-service script
+    # writer (CLAUDE_AGENT_TOKEN already in tripit-secrets).
+    SIGHT_DISCOVERY_PROVIDER = "wikipedia"
+    STORY_SOURCE_MODE        = "web"
+    SCRIPT_WRITER_MODE       = "chat"
  }
 }

--- a/stacks/tts/main.tf
+++ b/stacks/tts/main.tf
@ -73,8 +73,14 @@ locals {
      repo_id = "chatterbox-multilingual"
    }
    tts_engine = {
-      device                 = "cuda"
-      predefined_voices_path = "/data/voices"
+      device = "cuda"
+      # Predefined voices come from the IMAGE's bundled set (28 reference WAVs
+      # under the devnen server's /app/voices) rather than the NFS PVC: nobody
+      # can seed /data/voices without NFS-host shell access, and an empty
+      # predefined dir means /v1/audio/voices serves nothing (it gates the
+      # readiness probe). tripit's Voice catalog (tripit#30) names a subset of
+      # these stems. /data keeps reference_audio (future cloning) + HF cache.
+      predefined_voices_path = "/app/voices"
      reference_audio_path   = "/data/reference_audio"
    }
  })
@ -472,3 +478,7 @@ resource "kubernetes_cron_job_v1" "offpeak" {
    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
  }
 }
+
+# Apply trigger 2026-06-11 (tripit#26): the previous push was a merge commit, so
+# the changed-stack detector (git diff HEAD~1 HEAD = first-parent diff) missed
+# stacks/tts entirely. Non-merge commit so the diff names this stack.