diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md
index f33430be..218f7af7 100644
--- a/.claude/reference/service-catalog.md
+++ b/.claude/reference/service-catalog.md
@@ -32,7 +32,7 @@
 |---------|-------------|-------|
 | k8s-dashboard | Kubernetes dashboard at `k8s.viktorbarzin.me`. **Forward-auth + auto-injected SA token** (apiserver OIDC blocked, see design §12). nginx token-injector (`dashboard_injector.tf`) maps `X-authentik-username` → the user's `dashboard-<user>` SA token (ns admin + read-only on namespace-list/nodes only via `dashboard-nav-readonly` — no cross-tenant reads, `rbac/.../dashboard-sa.tf`; admins → cluster-admin SA) and sets `Authorization: Bearer` → no token-paste, dashboard auto-authenticates per user. Forward-auth admits `kubernetes-*` groups for this host (`stacks/authentik/admin-services-restriction.tf`). oauth2-proxy + `k8s-dashboard` OIDC app built but idle. | k8s-dashboard |
 | reverse-proxy | Generic reverse proxy | reverse-proxy |
-| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@<u>` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@<u>` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role`→`scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin is a deliberate, reversible step via `docs/runbooks/t3-version-bump.md` (pre-bump `state.sqlite` backup → bump `T3_PIN` → enforcer install with a REAL pairing health-check that auto-rolls-back → verify → restore). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. | t3code |
+| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@<u>` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@<u>` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role`→`scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin is a deliberate, reversible step via `docs/runbooks/t3-version-bump.md` (pre-bump `state.sqlite` backup → bump `T3_PIN` → enforcer install with a REAL pairing health-check that auto-rolls-back → verify → restore). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. **Drop attribution (2026-06-10):** `t3-probe` Deployment (same ns) holds differential legs — `cloudflare` (full public path via DoH-pinned DNS), `internal` (Traefik LB only), `t3serve` (devvm:3773 direct) — against dispatch's unauthenticated `/probe` carve-out (walloff-guarded); Prometheus job `t3-probe`, alerts `T3ProbeLegDown`/`T3ProbeDropBurst`, runbook `docs/runbooks/t3-drop-attribution.md`. `t3-serve@` units carry memory containment (`MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0/OOMPolicy=continue`) so a runaway agent OOMs alone instead of freezing devvm. | t3code |
 
 ## Active Use
 | Service | Description | Stack |
diff --git a/docs/runbooks/t3-drop-attribution.md b/docs/runbooks/t3-drop-attribution.md
new file mode 100644
index 00000000..192061bb
--- /dev/null
+++ b/docs/runbooks/t3-drop-attribution.md
@@ -0,0 +1,91 @@
+# t3 drop attribution — "is it infra or my network?"
+
+When a t3 user reports "disconnects, then self-recovers after a few seconds",
+that is the t3 **client watchdog**: the browser heartbeats every 10s and force-
+reconnects after >20s without a response. Any stall or break anywhere on
+browser → Cloudflare → tunnel → Traefik → t3-dispatch → `t3 serve` produces
+the identical symptom. This runbook attributes a drop to a segment in minutes.
+
+## 1. Check the probe (first stop)
+
+The in-cluster `t3-probe` (stacks/t3code, scrape job `t3-probe`) holds three
+permanent legs that differ only in path segment:
+
+| leg | path under test | drop means |
+|---|---|---|
+| `cloudflare` | WAN → CF edge → tunnel → cloudflared → Traefik → dispatch | Cloudflare/WAN segment |
+| `internal` | Traefik LB (10.0.20.203) → dispatch (no Cloudflare) | Traefik / dispatch / devvm network |
+| `t3serve` | HTTP straight to devvm:3773 (`t3 serve` process) | the serve process itself (event-loop stall) |
+
+Prometheus queries:
+
+```promql
+increase(t3probe_disconnects_total[1h])          # drops per leg+reason
+t3probe_connected                                # current state per leg
+histogram_quantile(0.99, rate(t3probe_rtt_seconds_bucket{leg="t3serve"}[15m]))
+```
+
+Attribution table:
+
+- `cloudflare` drops, `internal` clean → Cloudflare edge / QUIC tunnel / WAN.
+- both WS legs drop together → Traefik, dispatch, or devvm reachability.
+- `t3serve` RTT spikes / timeouts → the user's `t3 serve` stalled (see §3).
+- **all legs clean while the user dropped → their last mile / device. Infra
+  is exonerated, with data.**
+
+Alerts `T3ProbeLegDown` / `T3ProbeDropBurst` fire on sustained breakage.
+
+## 2. Server-side log recipe (per-event forensics)
+
+On devvm (timestamps in UTC):
+
+```bash
+# dispatch view — error class identifies which side died:
+#   "context canceled"                      = front/client side tore down
+#   "connection reset by peer 127.0.0.1:PORT" = that user's serve closed
+#   "connection refused"                    = that user's serve was down
+journalctl -u t3-dispatch --since "1 hour ago" | grep "proxy error"
+
+# mass-cancel bursts (many same-second cancels = shared-segment break):
+journalctl -u t3-dispatch --since "6 hours ago" \
+  | grep -oE '^.* [0-9:]+ http: proxy error: context canceled' \
+  | awk '{print $6}' | sort | uniq -c | awk '$1>=5'
+
+# serve-side starvation markers (git taking >5s = devvm frozen):
+journalctl -u t3-serve@<user> --since "6 hours ago" | grep "timed out"
+
+# tunnel-side: cloudflared pod restarts + per-connection events
+kubectl -n cloudflared get pods
+kubectl -n cloudflared logs <pod> --since=6h | grep -E "ERR|reconnect"
+```
+
+## 3. devvm pressure correlation
+
+devvm node_exporter is scraped as job `devvm` (since 2026-06-10). The known
+high-frequency drop mechanism is **memory+IO pressure on devvm**: agent
+processes live inside `t3-serve@<user>`'s cgroup; a runaway agent swap-thrashes
+the spinning root disk and freezes the box in multi-10s windows — every
+connected client's watchdog fires at once (2026-06-10: a 10.8G agent → global
+OOM → 8.5min hard outage).
+
+```promql
+rate(node_pressure_io_stalled_seconds_total{instance="devvm"}[5m])
+rate(node_pressure_memory_stalled_seconds_total{instance="devvm"}[5m])
+node_memory_SwapFree_bytes{instance="devvm"}
+```
+
+Guardrails in place (2026-06-10, `scripts/t3-serve@.service`): per-unit
+`MemoryHigh=12G`, `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue` —
+a runaway agent now OOMs alone inside the cgroup instead of taking the box
+(and the WS server) with it.
+
+## 4. Known root causes (2026-06-10 investigation)
+
+1. **devvm memory/IO storms** (high-frequency mechanism) — §3.
+2. **cloudflared in-place autoupdate** — fixed: `--no-autoupdate`
+   (stacks/cloudflared). Before the fix every CF release exited all 3 pods
+   (code 11), severing all tunnel WebSockets.
+3. **QUIC tunnel churn** (~1–2/day, "no recent network activity") — inherent;
+   visible as `cloudflare`-leg-only blips.
+4. **t3 nightly autoupdate** — pinned after the 2026-06-09 outage, see
+   `docs/post-mortems/2026-06-09-t3-nightly-autoupdate-auth-outage.md`.
diff --git a/scripts/t3-dispatch/go.mod b/scripts/t3-dispatch/go.mod
index 26e9388b..6b629374 100644
--- a/scripts/t3-dispatch/go.mod
+++ b/scripts/t3-dispatch/go.mod
@@ -1,3 +1,5 @@
 module t3-dispatch
 
 go 1.22
+
+require github.com/gorilla/websocket v1.5.3 // indirect
diff --git a/scripts/t3-dispatch/go.sum b/scripts/t3-dispatch/go.sum
new file mode 100644
index 00000000..25a9fc4b
--- /dev/null
+++ b/scripts/t3-dispatch/go.sum
@@ -0,0 +1,2 @@
+github.com/gorilla/websocket v1.5.3 h1:saDtZ6Pbx/0u+bgYQ3q96pZgCzfhKXGPqt7kZ72aNNg=
+github.com/gorilla/websocket v1.5.3/go.mod h1:YR8l580nyteQvAITg2hZ9XVh4b55+EU/adAjf1fMHhE=
diff --git a/scripts/t3-dispatch/main.go b/scripts/t3-dispatch/main.go
index a36e2da0..35d38ec9 100644
--- a/scripts/t3-dispatch/main.go
+++ b/scripts/t3-dispatch/main.go
@@ -228,6 +228,7 @@ func main() {
 	}()
 	mux := http.NewServeMux()
 	mux.HandleFunc("/healthz", func(w http.ResponseWriter, _ *http.Request) { _, _ = w.Write([]byte("ok\n")) })
+	registerProbe(mux)
 	mux.HandleFunc("/", handler)
 	log.Printf("t3-dispatch listening on %s", listenAddr)
 	log.Fatal(http.ListenAndServe(listenAddr, mux))
diff --git a/scripts/t3-dispatch/main_test.go b/scripts/t3-dispatch/main_test.go
index ee43266e..81b522ba 100644
--- a/scripts/t3-dispatch/main_test.go
+++ b/scripts/t3-dispatch/main_test.go
@@ -5,7 +5,10 @@ import (
 	"net/http/httptest"
 	"net/url"
 	"strconv"
+	"strings"
 	"testing"
+
+	"github.com/gorilla/websocket"
 )
 
 func portOf(t *testing.T, ts *httptest.Server) int {
@@ -258,3 +261,43 @@ func TestAutoPairAcrossVersions(t *testing.T) {
 		})
 	}
 }
+
+func TestProbeHealthz(t *testing.T) {
+	mux := http.NewServeMux()
+	registerProbe(mux)
+	ts := httptest.NewServer(mux)
+	defer ts.Close()
+	resp, err := http.Get(ts.URL + "/probe/healthz")
+	if err != nil {
+		t.Fatalf("GET /probe/healthz: %v", err)
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode != http.StatusOK {
+		t.Errorf("status = %d, want 200", resp.StatusCode)
+	}
+}
+
+func TestProbeWSEcho(t *testing.T) {
+	mux := http.NewServeMux()
+	registerProbe(mux)
+	ts := httptest.NewServer(mux)
+	defer ts.Close()
+	wsURL := "ws" + strings.TrimPrefix(ts.URL, "http") + "/probe/ws"
+	c, _, err := websocket.DefaultDialer.Dial(wsURL, nil)
+	if err != nil {
+		t.Fatalf("dial %s: %v", wsURL, err)
+	}
+	defer c.Close()
+	for _, msg := range []string{"ping 1718000000", "ping 1718000010"} {
+		if err := c.WriteMessage(websocket.TextMessage, []byte(msg)); err != nil {
+			t.Fatalf("write: %v", err)
+		}
+		_, got, err := c.ReadMessage()
+		if err != nil {
+			t.Fatalf("read: %v", err)
+		}
+		if string(got) != msg {
+			t.Errorf("echo = %q, want %q", got, msg)
+		}
+	}
+}
diff --git a/scripts/t3-dispatch/probe.go b/scripts/t3-dispatch/probe.go
new file mode 100644
index 00000000..df689bce
--- /dev/null
+++ b/scripts/t3-dispatch/probe.go
@@ -0,0 +1,49 @@
+// probe.go: unauthenticated path-health surface for the in-cluster t3-probe.
+// /probe/* is carved out of Authentik (stacks/t3code `module "ingress_probe"`)
+// so a synthetic client can hold a long-lived WebSocket here via two routes
+// (Cloudflare edge vs internal Traefik) and attribute connection drops to a
+// path segment. It echoes tiny frames and reaches no t3 instance — nothing
+// user-grade is exposed.
+package main
+
+import (
+	"net/http"
+	"time"
+
+	"github.com/gorilla/websocket"
+)
+
+// Reap connections whose client went silent; the probe pings every 10s, so 90s
+// of silence means the peer is gone even if TCP never noticed.
+const probeIdleLimit = 90 * time.Second
+
+var probeUpgrader = websocket.Upgrader{
+	// No cookies or credentials are at stake on an echo endpoint, and the
+	// probe connects without a browser Origin — checking it would only break it.
+	CheckOrigin: func(*http.Request) bool { return true },
+}
+
+func registerProbe(mux *http.ServeMux) {
+	mux.HandleFunc("/probe/healthz", func(w http.ResponseWriter, _ *http.Request) {
+		_, _ = w.Write([]byte("ok\n"))
+	})
+	mux.HandleFunc("/probe/ws", func(w http.ResponseWriter, r *http.Request) {
+		c, err := probeUpgrader.Upgrade(w, r, nil)
+		if err != nil {
+			return // Upgrade has already written the HTTP error
+		}
+		defer c.Close()
+		for {
+			if err := c.SetReadDeadline(time.Now().Add(probeIdleLimit)); err != nil {
+				return
+			}
+			mt, msg, err := c.ReadMessage()
+			if err != nil {
+				return
+			}
+			if err := c.WriteMessage(mt, msg); err != nil {
+				return
+			}
+		}
+	})
+}
diff --git a/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf b/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf
index 583900fe..5244db0e 100644
--- a/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf
+++ b/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf
@@ -57,6 +57,9 @@ locals {
     "instagram-poster-image" = "https://instagram-poster.viktorbarzin.me/image"
     # trading-bot app root (auth="app"): WebAuthn/JWT in-app; was walled, now 200.
     "trading-bot-app" = "https://trading.viktorbarzin.me/"
+    # t3 dispatch probe surface (auth="none" path carve-out on /probe): WS echo
+    # + healthz for the t3-probe drop-attribution client (stacks/t3code).
+    "t3-probe-ws" = "https://t3.viktorbarzin.me/probe/healthz"
     # NOTE: openclaw task-webhook (auth="none") is intentionally NOT probed — it
     # has no public DNS record (NXDOMAIN, external_monitor=false), so there is no
     # externally GET-able URL to probe. Its carve-out is internal-only.
diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
index 732b4e68..c9176867 100755
--- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
+++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
@@ -2541,6 +2541,22 @@ serverFiles:
               severity: warning
             annotations:
               summary: "Email round-trip monitor never reported - check CronJob in mailserver namespace"
+          - alert: T3ProbeLegDown
+            expr: t3probe_connected{job="t3-probe"} == 0
+            for: 5m
+            labels:
+              severity: warning
+            annotations:
+              summary: "A t3 path-probe leg has been down >5m (leg label says which)"
+              description: "cloudflare-only = Cloudflare/WAN segment; cloudflare+internal = Traefik/dispatch/devvm; t3serve = the serve process. See docs/runbooks/t3-drop-attribution.md."
+          - alert: T3ProbeDropBurst
+            expr: increase(t3probe_disconnects_total{job="t3-probe"}[15m]) > 6
+            for: 1m
+            labels:
+              severity: warning
+            annotations:
+              summary: "A t3 path-probe leg is dropping repeatedly (>6 in 15m; see leg/reason labels)"
+              description: "Users on the same segment are seeing 'disconnected, reconnecting' at this rate. Compare legs to attribute; correlate with devvm node_pressure_* metrics."
           - alert: ViktorBarzinApexDrift
             expr: viktorbarzin_apex_correct{job="viktorbarzin-apex-probe"} == 0
             for: 10m
@@ -3110,6 +3126,30 @@ extraScrapeConfigs: |
       - source_labels: [__address__]
         target_label: instance
         replacement: 'pve-node-r730' # Giving it a friendly name
+  # devvm: the shared workstation VM hosting per-user t3-serve + Claude agents.
+  # Its node_exporter ran unscraped until 2026-06-10 — the t3 disconnect
+  # root-cause work had NO memory/IO-pressure history for the very box whose
+  # stalls fire every t3 client's watchdog. Pressure/swap/load here is the
+  # primary correlate for t3probe drop events.
+  - job_name: 'devvm'
+    static_configs:
+      - targets:
+        - "10.0.10.10:9100"
+        labels:
+          node: 'devvm'
+    metrics_path: '/metrics'
+    relabel_configs:
+      - source_labels: [__address__]
+        target_label: instance
+        replacement: 'devvm' # Giving it a friendly name
+  # t3-probe: differential t3 path-health prober (stacks/t3code). Legs:
+  # cloudflare (full public path), internal (Traefik only), t3serve (the
+  # serve process). See docs/runbooks/t3-drop-attribution.md.
+  - job_name: 't3-probe'
+    static_configs:
+      - targets:
+        - "t3-probe.t3code.svc.cluster.local:9108"
+    metrics_path: '/metrics'
   # rpi-sofia: external Raspberry Pi 3 at the Sofia home site (Frigate camera
   # DNAT passthrough + solar inverter path + HA MQTT sensors). node_exporter
   # installed via apt; the rpi_* metrics come from a vcgencmd textfile collector
diff --git a/stacks/t3code/main.tf b/stacks/t3code/main.tf
index 9e23ff1f..75a8e542 100644
--- a/stacks/t3code/main.tf
+++ b/stacks/t3code/main.tf
@@ -94,3 +94,118 @@ module "ingress" {
     "gethomepage.dev/pod-selector" = ""
   }
 }
+
+# === Drop-attribution probe surface ==========================================
+# /probe/* on the t3 host is dispatch's unauthenticated echo surface (see
+# scripts/t3-dispatch/probe.go) for the t3-probe below. Guarded against
+# Authentik re-walling by `authentik_walloff_targets` in stacks/monitoring.
+module "ingress_probe" {
+  source = "../../modules/kubernetes/ingress_factory"
+  # auth = "none": WS echo + healthz for the in-cluster path-health probe; no
+  # user data, no t3 instance reachable — auth would break the synthetic client.
+  auth             = "none"
+  anti_ai_scraping = false  # the probe IS a bot; PoW/UA filtering would block it
+  dns_type         = "none" # main `module.ingress` owns the DNS record for this host
+  namespace        = kubernetes_namespace.t3code.metadata[0].name
+  name             = "t3-probe"
+  service_name     = kubernetes_service.t3code.metadata[0].name
+  full_host        = "t3.viktorbarzin.me"
+  ingress_path     = ["/probe"]
+  tls_secret_name  = var.tls_secret_name
+}
+
+# t3-probe: differential WS/HTTP prober (see probe.py docstring for the
+# attribution model). Runs in-cluster so it measures the shared path WITHOUT
+# any user's last mile; Prometheus scrapes it via the static `t3-probe` job
+# in stacks/monitoring.
+resource "kubernetes_config_map_v1" "t3_probe" {
+  metadata {
+    name      = "t3-probe"
+    namespace = kubernetes_namespace.t3code.metadata[0].name
+  }
+  data = {
+    "probe.py" = file("${path.module}/probe.py")
+  }
+}
+
+resource "kubernetes_deployment_v1" "t3_probe" {
+  metadata {
+    name      = "t3-probe"
+    namespace = kubernetes_namespace.t3code.metadata[0].name
+    labels    = { app = "t3-probe" }
+  }
+  spec {
+    replicas = 1
+    selector {
+      match_labels = { app = "t3-probe" }
+    }
+    template {
+      metadata {
+        labels = { app = "t3-probe" }
+        annotations = {
+          "checksum/probe" = sha256(file("${path.module}/probe.py"))
+        }
+      }
+      spec {
+        container {
+          name  = "probe"
+          image = "python:3.12-alpine"
+          # Long-running pod, not a high-cadence CronJob: a one-time pinned
+          # pip install at start (with retries against transient DNS) is the
+          # lightweight alternative to owning a registry image for ~200 lines.
+          command = ["sh", "-c", <<-EOT
+            for i in 1 2 3 4 5; do
+              pip install --no-cache-dir --quiet aiohttp==3.9.5 prometheus-client==0.20.0 && break
+              echo "pip attempt $i failed; retrying" >&2; sleep 10
+            done
+            exec python /app/probe.py
+          EOT
+          ]
+          port {
+            container_port = 9108
+            name           = "metrics"
+          }
+          volume_mount {
+            name       = "app"
+            mount_path = "/app"
+            read_only  = true
+          }
+          resources {
+            requests = {
+              cpu    = "10m"
+              memory = "64Mi"
+            }
+            limits = {
+              memory = "192Mi"
+            }
+          }
+        }
+        volume {
+          name = "app"
+          config_map {
+            name = kubernetes_config_map_v1.t3_probe.metadata[0].name
+          }
+        }
+      }
+    }
+  }
+  lifecycle {
+    ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
+  }
+}
+
+resource "kubernetes_service" "t3_probe" {
+  metadata {
+    name      = "t3-probe"
+    namespace = kubernetes_namespace.t3code.metadata[0].name
+    labels    = { app = "t3-probe" }
+  }
+  spec {
+    selector = { app = "t3-probe" }
+    port {
+      name        = "metrics"
+      port        = 9108
+      target_port = 9108
+    }
+  }
+}
diff --git a/stacks/t3code/probe.py b/stacks/t3code/probe.py
new file mode 100644
index 00000000..ea3da93b
--- /dev/null
+++ b/stacks/t3code/probe.py
@@ -0,0 +1,201 @@
+"""t3-probe: differential path-health probe behind the t3 drop attribution.
+
+Holds long-lived WebSockets to t3-dispatch's /probe/ws echo endpoint via two
+routes that differ ONLY in the Cloudflare segment, plus an HTTP heartbeat
+against the t3-serve process itself:
+
+  leg=cloudflare  wss://T3_HOST/probe/ws connected to the address PUBLIC DNS
+                  returns (DoH @1.1.1.1) -> WAN -> CF edge -> tunnel ->
+                  cloudflared -> Traefik -> t3-dispatch
+  leg=internal    same URL pinned to the internal Traefik LB -> Traefik ->
+                  t3-dispatch (no Cloudflare)
+  leg=t3serve     GET http://DEVVM:3773/api/auth/session every 10s; an
+                  event-loop stall in the user's `t3 serve` delays/times-out
+                  this regardless of auth
+
+Attribution: cloudflare drops alone -> Cloudflare/WAN segment; cloudflare +
+internal together -> Traefik/dispatch/devvm network; t3serve latency spikes ->
+the serve process (memory/IO stalls); all legs clean while a human drops ->
+their last mile, infra exonerated. Mirrors the real t3 client's resilience
+protocol (10s heartbeat, ~20s watchdog) so probe drops mean a real client
+would have dropped too.
+"""
+
+import asyncio
+import json
+import logging
+import time
+
+import aiohttp
+from aiohttp.abc import AbstractResolver
+from prometheus_client import Counter, Gauge, Histogram, start_http_server
+
+T3_HOST = "t3.viktorbarzin.me"
+TRAEFIK_LB = "10.0.20.203"
+DEVVM = "10.0.10.10"
+T3_SERVE_PORT = 3773
+DOH_URL = "https://1.1.1.1/dns-query"
+HEARTBEAT_SECONDS = 10
+RTT_TIMEOUT_SECONDS = 20  # mirror the t3 client watchdog
+METRICS_PORT = 9108
+
+log = logging.getLogger("t3probe")
+
+CONNECTED = Gauge("t3probe_connected", "1 while the leg's connection is up", ["leg"])
+DISCONNECTS = Counter(
+    "t3probe_disconnects_total", "Connection deaths by leg and reason", ["leg", "reason"]
+)
+RTT = Histogram(
+    "t3probe_rtt_seconds",
+    "Heartbeat round-trip (WS echo / HTTP GET) per leg",
+    ["leg"],
+    buckets=[0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 20],
+)
+CONNECTION_AGE = Histogram(
+    "t3probe_connection_age_seconds",
+    "Age of a WS connection when it died",
+    ["leg"],
+    buckets=[10, 30, 60, 300, 900, 3600, 14400, 86400],
+)
+LAST_DISCONNECT = Gauge(
+    "t3probe_last_disconnect_timestamp", "Unix time of the leg's last death", ["leg"]
+)
+
+
+class PinnedResolver(AbstractResolver):
+    """Resolve T3_HOST to one fixed address; Host/SNI/cert stay hostname-true."""
+
+    def __init__(self, address):
+        self.address = address
+
+    async def resolve(self, host, port=0, family=0):
+        return [
+            {
+                "hostname": host,
+                "host": self.address,
+                "port": port,
+                "family": 2,  # AF_INET
+                "proto": 0,
+                "flags": 0,
+            }
+        ]
+
+    async def close(self):
+        pass
+
+
+class DoHResolver(AbstractResolver):
+    """Resolve via Cloudflare DoH so the answer is the PUBLIC (proxied) one.
+
+    The cluster's own DNS is split-horizon since 2026-06-10 (pods get internal
+    answers for *.viktorbarzin.me), which would silently collapse this leg
+    onto the internal route — public resolution must bypass it.
+    """
+
+    async def resolve(self, host, port=0, family=0):
+        async with aiohttp.ClientSession() as s:
+            async with s.get(
+                DOH_URL,
+                params={"name": host, "type": "A"},
+                headers={"accept": "application/dns-json"},
+                timeout=aiohttp.ClientTimeout(total=10),
+            ) as resp:
+                answers = (await resp.json(content_type=None)).get("Answer", [])
+        addrs = [a["data"] for a in answers if a.get("type") == 1]
+        if not addrs:
+            raise OSError(f"DoH returned no A records for {host}")
+        return [
+            {
+                "hostname": host,
+                "host": addrs[0],
+                "port": port,
+                "family": 2,
+                "proto": 0,
+                "flags": 0,
+            }
+        ]
+
+    async def close(self):
+        pass
+
+
+async def ws_leg(leg, resolver):
+    url = f"wss://{T3_HOST}/probe/ws"
+    attempts = 0
+    while True:
+        CONNECTED.labels(leg).set(0)
+        established = None
+        reason = "connect_failed"
+        try:
+            connector = aiohttp.TCPConnector(resolver=resolver, force_close=True)
+            async with aiohttp.ClientSession(connector=connector) as session:
+                async with session.ws_connect(
+                    url, timeout=aiohttp.ClientWSTimeout(ws_close=10), heartbeat=None
+                ) as ws:
+                    established = time.monotonic()
+                    attempts = 0
+                    CONNECTED.labels(leg).set(1)
+                    log.info("%s: connected", leg)
+                    while True:
+                        sent = time.monotonic()
+                        await ws.send_str(f"ping {time.time_ns()}")
+                        msg = await ws.receive(timeout=RTT_TIMEOUT_SECONDS)
+                        if msg.type != aiohttp.WSMsgType.TEXT:
+                            reason = f"closed_{msg.type.name.lower()}"
+                            break
+                        RTT.labels(leg).observe(time.monotonic() - sent)
+                        await asyncio.sleep(HEARTBEAT_SECONDS)
+        except asyncio.TimeoutError:
+            reason = "rtt_timeout" if established else "connect_timeout"
+        except (aiohttp.ClientError, OSError) as e:
+            reason = "connect_failed" if not established else "io_error"
+            log.warning("%s: %s: %s", leg, reason, e)
+        CONNECTED.labels(leg).set(0)
+        DISCONNECTS.labels(leg, reason).inc()
+        LAST_DISCONNECT.labels(leg).set(time.time())
+        if established is not None:
+            CONNECTION_AGE.labels(leg).observe(time.monotonic() - established)
+            log.info("%s: died after %.0fs (%s)", leg, time.monotonic() - established, reason)
+        attempts += 1
+        await asyncio.sleep(min(3 * attempts, 30))
+
+
+async def t3serve_leg():
+    leg = "t3serve"
+    url = f"http://{DEVVM}:{T3_SERVE_PORT}/api/auth/session"
+    timeout = aiohttp.ClientTimeout(total=RTT_TIMEOUT_SECONDS)
+    async with aiohttp.ClientSession(timeout=timeout) as session:
+        while True:
+            sent = time.monotonic()
+            try:
+                async with session.get(url) as resp:
+                    await resp.read()
+                    RTT.labels(leg).observe(time.monotonic() - sent)
+                    CONNECTED.labels(leg).set(1 if resp.status == 200 else 0)
+                    if resp.status != 200:
+                        DISCONNECTS.labels(leg, f"http_{resp.status}").inc()
+                        LAST_DISCONNECT.labels(leg).set(time.time())
+            except asyncio.TimeoutError:
+                CONNECTED.labels(leg).set(0)
+                DISCONNECTS.labels(leg, "rtt_timeout").inc()
+                LAST_DISCONNECT.labels(leg).set(time.time())
+            except (aiohttp.ClientError, OSError) as e:
+                CONNECTED.labels(leg).set(0)
+                DISCONNECTS.labels(leg, "connect_failed").inc()
+                LAST_DISCONNECT.labels(leg).set(time.time())
+                log.warning("%s: %s", leg, e)
+            await asyncio.sleep(HEARTBEAT_SECONDS)
+
+
+async def main():
+    logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
+    start_http_server(METRICS_PORT)
+    await asyncio.gather(
+        ws_leg("cloudflare", DoHResolver()),
+        ws_leg("internal", PinnedResolver(TRAEFIK_LB)),
+        t3serve_leg(),
+    )
+
+
+if __name__ == "__main__":
+    asyncio.run(main())