From 9b55d53be077d40a1f05eed266f8f18c32233c59 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Wed, 10 Jun 2026 21:11:29 +0000 Subject: [PATCH] t3: differential drop-attribution probe + devvm metrics MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes the loop on Viktor's ask to find the t3 disconnect root cause and definitively rule infra in or out. Server logs alone cannot separate 'client network broke' from 'Cloudflare/tunnel broke' from 't3-serve stalled' — every cause collapses into the same 20s-watchdog reconnect. The t3-probe (stacks/t3code) holds three permanent legs that differ only in path segment: 'cloudflare' (WS via DoH-resolved public DNS -> WAN -> CF edge -> tunnel -> Traefik -> dispatch), 'internal' (same WS pinned to the Traefik LB, no Cloudflare), 't3serve' (HTTP straight to the serve process). Whichever leg drops convicts its segment; all legs clean while a user drops exonerates infra with data. Dispatch gains an unauthenticated /probe/ws echo + /probe/healthz (gorilla/websocket, test-first) behind an auth=none path carve-out, guarded by the authentik-walloff probe. Also starts scraping devvm's node_exporter (job 'devvm') — it ran unscraped, so the box whose memory/IO stalls cause the drops had zero pressure history. Alerts T3ProbeLegDown + T3ProbeDropBurst; runbook docs/runbooks/t3-drop-attribution.md. --- .claude/reference/service-catalog.md | 2 +- docs/runbooks/t3-drop-attribution.md | 91 ++++++++ scripts/t3-dispatch/go.mod | 2 + scripts/t3-dispatch/go.sum | 2 + scripts/t3-dispatch/main.go | 1 + scripts/t3-dispatch/main_test.go | 43 ++++ scripts/t3-dispatch/probe.go | 49 +++++ .../monitoring/authentik_walloff_probe.tf | 3 + .../monitoring/prometheus_chart_values.tpl | 40 ++++ stacks/t3code/main.tf | 115 ++++++++++ stacks/t3code/probe.py | 201 ++++++++++++++++++ 11 files changed, 548 insertions(+), 1 deletion(-) create mode 100644 docs/runbooks/t3-drop-attribution.md create mode 100644 scripts/t3-dispatch/go.sum create mode 100644 scripts/t3-dispatch/probe.go create mode 100644 stacks/t3code/probe.py diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md index f33430be..218f7af7 100644 --- a/.claude/reference/service-catalog.md +++ b/.claude/reference/service-catalog.md @@ -32,7 +32,7 @@ |---------|-------------|-------| | k8s-dashboard | Kubernetes dashboard at `k8s.viktorbarzin.me`. **Forward-auth + auto-injected SA token** (apiserver OIDC blocked, see design §12). nginx token-injector (`dashboard_injector.tf`) maps `X-authentik-username` → the user's `dashboard-` SA token (ns admin + read-only on namespace-list/nodes only via `dashboard-nav-readonly` — no cross-tenant reads, `rbac/.../dashboard-sa.tf`; admins → cluster-admin SA) and sets `Authorization: Bearer` → no token-paste, dashboard auto-authenticates per user. Forward-auth admits `kubernetes-*` groups for this host (`stacks/authentik/admin-services-restriction.tf`). oauth2-proxy + `k8s-dashboard` OIDC app built but idle. | k8s-dashboard | | reverse-proxy | Generic reverse proxy | reverse-proxy | -| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role`→`scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin is a deliberate, reversible step via `docs/runbooks/t3-version-bump.md` (pre-bump `state.sqlite` backup → bump `T3_PIN` → enforcer install with a REAL pairing health-check that auto-rolls-back → verify → restore). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. | t3code | +| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role`→`scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin is a deliberate, reversible step via `docs/runbooks/t3-version-bump.md` (pre-bump `state.sqlite` backup → bump `T3_PIN` → enforcer install with a REAL pairing health-check that auto-rolls-back → verify → restore). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. **Drop attribution (2026-06-10):** `t3-probe` Deployment (same ns) holds differential legs — `cloudflare` (full public path via DoH-pinned DNS), `internal` (Traefik LB only), `t3serve` (devvm:3773 direct) — against dispatch's unauthenticated `/probe` carve-out (walloff-guarded); Prometheus job `t3-probe`, alerts `T3ProbeLegDown`/`T3ProbeDropBurst`, runbook `docs/runbooks/t3-drop-attribution.md`. `t3-serve@` units carry memory containment (`MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0/OOMPolicy=continue`) so a runaway agent OOMs alone instead of freezing devvm. | t3code | ## Active Use | Service | Description | Stack | diff --git a/docs/runbooks/t3-drop-attribution.md b/docs/runbooks/t3-drop-attribution.md new file mode 100644 index 00000000..192061bb --- /dev/null +++ b/docs/runbooks/t3-drop-attribution.md @@ -0,0 +1,91 @@ +# t3 drop attribution — "is it infra or my network?" + +When a t3 user reports "disconnects, then self-recovers after a few seconds", +that is the t3 **client watchdog**: the browser heartbeats every 10s and force- +reconnects after >20s without a response. Any stall or break anywhere on +browser → Cloudflare → tunnel → Traefik → t3-dispatch → `t3 serve` produces +the identical symptom. This runbook attributes a drop to a segment in minutes. + +## 1. Check the probe (first stop) + +The in-cluster `t3-probe` (stacks/t3code, scrape job `t3-probe`) holds three +permanent legs that differ only in path segment: + +| leg | path under test | drop means | +|---|---|---| +| `cloudflare` | WAN → CF edge → tunnel → cloudflared → Traefik → dispatch | Cloudflare/WAN segment | +| `internal` | Traefik LB (10.0.20.203) → dispatch (no Cloudflare) | Traefik / dispatch / devvm network | +| `t3serve` | HTTP straight to devvm:3773 (`t3 serve` process) | the serve process itself (event-loop stall) | + +Prometheus queries: + +```promql +increase(t3probe_disconnects_total[1h]) # drops per leg+reason +t3probe_connected # current state per leg +histogram_quantile(0.99, rate(t3probe_rtt_seconds_bucket{leg="t3serve"}[15m])) +``` + +Attribution table: + +- `cloudflare` drops, `internal` clean → Cloudflare edge / QUIC tunnel / WAN. +- both WS legs drop together → Traefik, dispatch, or devvm reachability. +- `t3serve` RTT spikes / timeouts → the user's `t3 serve` stalled (see §3). +- **all legs clean while the user dropped → their last mile / device. Infra + is exonerated, with data.** + +Alerts `T3ProbeLegDown` / `T3ProbeDropBurst` fire on sustained breakage. + +## 2. Server-side log recipe (per-event forensics) + +On devvm (timestamps in UTC): + +```bash +# dispatch view — error class identifies which side died: +# "context canceled" = front/client side tore down +# "connection reset by peer 127.0.0.1:PORT" = that user's serve closed +# "connection refused" = that user's serve was down +journalctl -u t3-dispatch --since "1 hour ago" | grep "proxy error" + +# mass-cancel bursts (many same-second cancels = shared-segment break): +journalctl -u t3-dispatch --since "6 hours ago" \ + | grep -oE '^.* [0-9:]+ http: proxy error: context canceled' \ + | awk '{print $6}' | sort | uniq -c | awk '$1>=5' + +# serve-side starvation markers (git taking >5s = devvm frozen): +journalctl -u t3-serve@ --since "6 hours ago" | grep "timed out" + +# tunnel-side: cloudflared pod restarts + per-connection events +kubectl -n cloudflared get pods +kubectl -n cloudflared logs --since=6h | grep -E "ERR|reconnect" +``` + +## 3. devvm pressure correlation + +devvm node_exporter is scraped as job `devvm` (since 2026-06-10). The known +high-frequency drop mechanism is **memory+IO pressure on devvm**: agent +processes live inside `t3-serve@`'s cgroup; a runaway agent swap-thrashes +the spinning root disk and freezes the box in multi-10s windows — every +connected client's watchdog fires at once (2026-06-10: a 10.8G agent → global +OOM → 8.5min hard outage). + +```promql +rate(node_pressure_io_stalled_seconds_total{instance="devvm"}[5m]) +rate(node_pressure_memory_stalled_seconds_total{instance="devvm"}[5m]) +node_memory_SwapFree_bytes{instance="devvm"} +``` + +Guardrails in place (2026-06-10, `scripts/t3-serve@.service`): per-unit +`MemoryHigh=12G`, `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue` — +a runaway agent now OOMs alone inside the cgroup instead of taking the box +(and the WS server) with it. + +## 4. Known root causes (2026-06-10 investigation) + +1. **devvm memory/IO storms** (high-frequency mechanism) — §3. +2. **cloudflared in-place autoupdate** — fixed: `--no-autoupdate` + (stacks/cloudflared). Before the fix every CF release exited all 3 pods + (code 11), severing all tunnel WebSockets. +3. **QUIC tunnel churn** (~1–2/day, "no recent network activity") — inherent; + visible as `cloudflare`-leg-only blips. +4. **t3 nightly autoupdate** — pinned after the 2026-06-09 outage, see + `docs/post-mortems/2026-06-09-t3-nightly-autoupdate-auth-outage.md`. diff --git a/scripts/t3-dispatch/go.mod b/scripts/t3-dispatch/go.mod index 26e9388b..6b629374 100644 --- a/scripts/t3-dispatch/go.mod +++ b/scripts/t3-dispatch/go.mod @@ -1,3 +1,5 @@ module t3-dispatch go 1.22 + +require github.com/gorilla/websocket v1.5.3 // indirect diff --git a/scripts/t3-dispatch/go.sum b/scripts/t3-dispatch/go.sum new file mode 100644 index 00000000..25a9fc4b --- /dev/null +++ b/scripts/t3-dispatch/go.sum @@ -0,0 +1,2 @@ +github.com/gorilla/websocket v1.5.3 h1:saDtZ6Pbx/0u+bgYQ3q96pZgCzfhKXGPqt7kZ72aNNg= +github.com/gorilla/websocket v1.5.3/go.mod h1:YR8l580nyteQvAITg2hZ9XVh4b55+EU/adAjf1fMHhE= diff --git a/scripts/t3-dispatch/main.go b/scripts/t3-dispatch/main.go index a36e2da0..35d38ec9 100644 --- a/scripts/t3-dispatch/main.go +++ b/scripts/t3-dispatch/main.go @@ -228,6 +228,7 @@ func main() { }() mux := http.NewServeMux() mux.HandleFunc("/healthz", func(w http.ResponseWriter, _ *http.Request) { _, _ = w.Write([]byte("ok\n")) }) + registerProbe(mux) mux.HandleFunc("/", handler) log.Printf("t3-dispatch listening on %s", listenAddr) log.Fatal(http.ListenAndServe(listenAddr, mux)) diff --git a/scripts/t3-dispatch/main_test.go b/scripts/t3-dispatch/main_test.go index ee43266e..81b522ba 100644 --- a/scripts/t3-dispatch/main_test.go +++ b/scripts/t3-dispatch/main_test.go @@ -5,7 +5,10 @@ import ( "net/http/httptest" "net/url" "strconv" + "strings" "testing" + + "github.com/gorilla/websocket" ) func portOf(t *testing.T, ts *httptest.Server) int { @@ -258,3 +261,43 @@ func TestAutoPairAcrossVersions(t *testing.T) { }) } } + +func TestProbeHealthz(t *testing.T) { + mux := http.NewServeMux() + registerProbe(mux) + ts := httptest.NewServer(mux) + defer ts.Close() + resp, err := http.Get(ts.URL + "/probe/healthz") + if err != nil { + t.Fatalf("GET /probe/healthz: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusOK { + t.Errorf("status = %d, want 200", resp.StatusCode) + } +} + +func TestProbeWSEcho(t *testing.T) { + mux := http.NewServeMux() + registerProbe(mux) + ts := httptest.NewServer(mux) + defer ts.Close() + wsURL := "ws" + strings.TrimPrefix(ts.URL, "http") + "/probe/ws" + c, _, err := websocket.DefaultDialer.Dial(wsURL, nil) + if err != nil { + t.Fatalf("dial %s: %v", wsURL, err) + } + defer c.Close() + for _, msg := range []string{"ping 1718000000", "ping 1718000010"} { + if err := c.WriteMessage(websocket.TextMessage, []byte(msg)); err != nil { + t.Fatalf("write: %v", err) + } + _, got, err := c.ReadMessage() + if err != nil { + t.Fatalf("read: %v", err) + } + if string(got) != msg { + t.Errorf("echo = %q, want %q", got, msg) + } + } +} diff --git a/scripts/t3-dispatch/probe.go b/scripts/t3-dispatch/probe.go new file mode 100644 index 00000000..df689bce --- /dev/null +++ b/scripts/t3-dispatch/probe.go @@ -0,0 +1,49 @@ +// probe.go: unauthenticated path-health surface for the in-cluster t3-probe. +// /probe/* is carved out of Authentik (stacks/t3code `module "ingress_probe"`) +// so a synthetic client can hold a long-lived WebSocket here via two routes +// (Cloudflare edge vs internal Traefik) and attribute connection drops to a +// path segment. It echoes tiny frames and reaches no t3 instance — nothing +// user-grade is exposed. +package main + +import ( + "net/http" + "time" + + "github.com/gorilla/websocket" +) + +// Reap connections whose client went silent; the probe pings every 10s, so 90s +// of silence means the peer is gone even if TCP never noticed. +const probeIdleLimit = 90 * time.Second + +var probeUpgrader = websocket.Upgrader{ + // No cookies or credentials are at stake on an echo endpoint, and the + // probe connects without a browser Origin — checking it would only break it. + CheckOrigin: func(*http.Request) bool { return true }, +} + +func registerProbe(mux *http.ServeMux) { + mux.HandleFunc("/probe/healthz", func(w http.ResponseWriter, _ *http.Request) { + _, _ = w.Write([]byte("ok\n")) + }) + mux.HandleFunc("/probe/ws", func(w http.ResponseWriter, r *http.Request) { + c, err := probeUpgrader.Upgrade(w, r, nil) + if err != nil { + return // Upgrade has already written the HTTP error + } + defer c.Close() + for { + if err := c.SetReadDeadline(time.Now().Add(probeIdleLimit)); err != nil { + return + } + mt, msg, err := c.ReadMessage() + if err != nil { + return + } + if err := c.WriteMessage(mt, msg); err != nil { + return + } + } + }) +} diff --git a/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf b/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf index 583900fe..5244db0e 100644 --- a/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf +++ b/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf @@ -57,6 +57,9 @@ locals { "instagram-poster-image" = "https://instagram-poster.viktorbarzin.me/image" # trading-bot app root (auth="app"): WebAuthn/JWT in-app; was walled, now 200. "trading-bot-app" = "https://trading.viktorbarzin.me/" + # t3 dispatch probe surface (auth="none" path carve-out on /probe): WS echo + # + healthz for the t3-probe drop-attribution client (stacks/t3code). + "t3-probe-ws" = "https://t3.viktorbarzin.me/probe/healthz" # NOTE: openclaw task-webhook (auth="none") is intentionally NOT probed — it # has no public DNS record (NXDOMAIN, external_monitor=false), so there is no # externally GET-able URL to probe. Its carve-out is internal-only. diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl index 732b4e68..c9176867 100755 --- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl +++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl @@ -2541,6 +2541,22 @@ serverFiles: severity: warning annotations: summary: "Email round-trip monitor never reported - check CronJob in mailserver namespace" + - alert: T3ProbeLegDown + expr: t3probe_connected{job="t3-probe"} == 0 + for: 5m + labels: + severity: warning + annotations: + summary: "A t3 path-probe leg has been down >5m (leg label says which)" + description: "cloudflare-only = Cloudflare/WAN segment; cloudflare+internal = Traefik/dispatch/devvm; t3serve = the serve process. See docs/runbooks/t3-drop-attribution.md." + - alert: T3ProbeDropBurst + expr: increase(t3probe_disconnects_total{job="t3-probe"}[15m]) > 6 + for: 1m + labels: + severity: warning + annotations: + summary: "A t3 path-probe leg is dropping repeatedly (>6 in 15m; see leg/reason labels)" + description: "Users on the same segment are seeing 'disconnected, reconnecting' at this rate. Compare legs to attribute; correlate with devvm node_pressure_* metrics." - alert: ViktorBarzinApexDrift expr: viktorbarzin_apex_correct{job="viktorbarzin-apex-probe"} == 0 for: 10m @@ -3110,6 +3126,30 @@ extraScrapeConfigs: | - source_labels: [__address__] target_label: instance replacement: 'pve-node-r730' # Giving it a friendly name + # devvm: the shared workstation VM hosting per-user t3-serve + Claude agents. + # Its node_exporter ran unscraped until 2026-06-10 — the t3 disconnect + # root-cause work had NO memory/IO-pressure history for the very box whose + # stalls fire every t3 client's watchdog. Pressure/swap/load here is the + # primary correlate for t3probe drop events. + - job_name: 'devvm' + static_configs: + - targets: + - "10.0.10.10:9100" + labels: + node: 'devvm' + metrics_path: '/metrics' + relabel_configs: + - source_labels: [__address__] + target_label: instance + replacement: 'devvm' # Giving it a friendly name + # t3-probe: differential t3 path-health prober (stacks/t3code). Legs: + # cloudflare (full public path), internal (Traefik only), t3serve (the + # serve process). See docs/runbooks/t3-drop-attribution.md. + - job_name: 't3-probe' + static_configs: + - targets: + - "t3-probe.t3code.svc.cluster.local:9108" + metrics_path: '/metrics' # rpi-sofia: external Raspberry Pi 3 at the Sofia home site (Frigate camera # DNAT passthrough + solar inverter path + HA MQTT sensors). node_exporter # installed via apt; the rpi_* metrics come from a vcgencmd textfile collector diff --git a/stacks/t3code/main.tf b/stacks/t3code/main.tf index 9e23ff1f..75a8e542 100644 --- a/stacks/t3code/main.tf +++ b/stacks/t3code/main.tf @@ -94,3 +94,118 @@ module "ingress" { "gethomepage.dev/pod-selector" = "" } } + +# === Drop-attribution probe surface ========================================== +# /probe/* on the t3 host is dispatch's unauthenticated echo surface (see +# scripts/t3-dispatch/probe.go) for the t3-probe below. Guarded against +# Authentik re-walling by `authentik_walloff_targets` in stacks/monitoring. +module "ingress_probe" { + source = "../../modules/kubernetes/ingress_factory" + # auth = "none": WS echo + healthz for the in-cluster path-health probe; no + # user data, no t3 instance reachable — auth would break the synthetic client. + auth = "none" + anti_ai_scraping = false # the probe IS a bot; PoW/UA filtering would block it + dns_type = "none" # main `module.ingress` owns the DNS record for this host + namespace = kubernetes_namespace.t3code.metadata[0].name + name = "t3-probe" + service_name = kubernetes_service.t3code.metadata[0].name + full_host = "t3.viktorbarzin.me" + ingress_path = ["/probe"] + tls_secret_name = var.tls_secret_name +} + +# t3-probe: differential WS/HTTP prober (see probe.py docstring for the +# attribution model). Runs in-cluster so it measures the shared path WITHOUT +# any user's last mile; Prometheus scrapes it via the static `t3-probe` job +# in stacks/monitoring. +resource "kubernetes_config_map_v1" "t3_probe" { + metadata { + name = "t3-probe" + namespace = kubernetes_namespace.t3code.metadata[0].name + } + data = { + "probe.py" = file("${path.module}/probe.py") + } +} + +resource "kubernetes_deployment_v1" "t3_probe" { + metadata { + name = "t3-probe" + namespace = kubernetes_namespace.t3code.metadata[0].name + labels = { app = "t3-probe" } + } + spec { + replicas = 1 + selector { + match_labels = { app = "t3-probe" } + } + template { + metadata { + labels = { app = "t3-probe" } + annotations = { + "checksum/probe" = sha256(file("${path.module}/probe.py")) + } + } + spec { + container { + name = "probe" + image = "python:3.12-alpine" + # Long-running pod, not a high-cadence CronJob: a one-time pinned + # pip install at start (with retries against transient DNS) is the + # lightweight alternative to owning a registry image for ~200 lines. + command = ["sh", "-c", <<-EOT + for i in 1 2 3 4 5; do + pip install --no-cache-dir --quiet aiohttp==3.9.5 prometheus-client==0.20.0 && break + echo "pip attempt $i failed; retrying" >&2; sleep 10 + done + exec python /app/probe.py + EOT + ] + port { + container_port = 9108 + name = "metrics" + } + volume_mount { + name = "app" + mount_path = "/app" + read_only = true + } + resources { + requests = { + cpu = "10m" + memory = "64Mi" + } + limits = { + memory = "192Mi" + } + } + } + volume { + name = "app" + config_map { + name = kubernetes_config_map_v1.t3_probe.metadata[0].name + } + } + } + } + } + lifecycle { + ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 + } +} + +resource "kubernetes_service" "t3_probe" { + metadata { + name = "t3-probe" + namespace = kubernetes_namespace.t3code.metadata[0].name + labels = { app = "t3-probe" } + } + spec { + selector = { app = "t3-probe" } + port { + name = "metrics" + port = 9108 + target_port = 9108 + } + } +} diff --git a/stacks/t3code/probe.py b/stacks/t3code/probe.py new file mode 100644 index 00000000..ea3da93b --- /dev/null +++ b/stacks/t3code/probe.py @@ -0,0 +1,201 @@ +"""t3-probe: differential path-health probe behind the t3 drop attribution. + +Holds long-lived WebSockets to t3-dispatch's /probe/ws echo endpoint via two +routes that differ ONLY in the Cloudflare segment, plus an HTTP heartbeat +against the t3-serve process itself: + + leg=cloudflare wss://T3_HOST/probe/ws connected to the address PUBLIC DNS + returns (DoH @1.1.1.1) -> WAN -> CF edge -> tunnel -> + cloudflared -> Traefik -> t3-dispatch + leg=internal same URL pinned to the internal Traefik LB -> Traefik -> + t3-dispatch (no Cloudflare) + leg=t3serve GET http://DEVVM:3773/api/auth/session every 10s; an + event-loop stall in the user's `t3 serve` delays/times-out + this regardless of auth + +Attribution: cloudflare drops alone -> Cloudflare/WAN segment; cloudflare + +internal together -> Traefik/dispatch/devvm network; t3serve latency spikes -> +the serve process (memory/IO stalls); all legs clean while a human drops -> +their last mile, infra exonerated. Mirrors the real t3 client's resilience +protocol (10s heartbeat, ~20s watchdog) so probe drops mean a real client +would have dropped too. +""" + +import asyncio +import json +import logging +import time + +import aiohttp +from aiohttp.abc import AbstractResolver +from prometheus_client import Counter, Gauge, Histogram, start_http_server + +T3_HOST = "t3.viktorbarzin.me" +TRAEFIK_LB = "10.0.20.203" +DEVVM = "10.0.10.10" +T3_SERVE_PORT = 3773 +DOH_URL = "https://1.1.1.1/dns-query" +HEARTBEAT_SECONDS = 10 +RTT_TIMEOUT_SECONDS = 20 # mirror the t3 client watchdog +METRICS_PORT = 9108 + +log = logging.getLogger("t3probe") + +CONNECTED = Gauge("t3probe_connected", "1 while the leg's connection is up", ["leg"]) +DISCONNECTS = Counter( + "t3probe_disconnects_total", "Connection deaths by leg and reason", ["leg", "reason"] +) +RTT = Histogram( + "t3probe_rtt_seconds", + "Heartbeat round-trip (WS echo / HTTP GET) per leg", + ["leg"], + buckets=[0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 20], +) +CONNECTION_AGE = Histogram( + "t3probe_connection_age_seconds", + "Age of a WS connection when it died", + ["leg"], + buckets=[10, 30, 60, 300, 900, 3600, 14400, 86400], +) +LAST_DISCONNECT = Gauge( + "t3probe_last_disconnect_timestamp", "Unix time of the leg's last death", ["leg"] +) + + +class PinnedResolver(AbstractResolver): + """Resolve T3_HOST to one fixed address; Host/SNI/cert stay hostname-true.""" + + def __init__(self, address): + self.address = address + + async def resolve(self, host, port=0, family=0): + return [ + { + "hostname": host, + "host": self.address, + "port": port, + "family": 2, # AF_INET + "proto": 0, + "flags": 0, + } + ] + + async def close(self): + pass + + +class DoHResolver(AbstractResolver): + """Resolve via Cloudflare DoH so the answer is the PUBLIC (proxied) one. + + The cluster's own DNS is split-horizon since 2026-06-10 (pods get internal + answers for *.viktorbarzin.me), which would silently collapse this leg + onto the internal route — public resolution must bypass it. + """ + + async def resolve(self, host, port=0, family=0): + async with aiohttp.ClientSession() as s: + async with s.get( + DOH_URL, + params={"name": host, "type": "A"}, + headers={"accept": "application/dns-json"}, + timeout=aiohttp.ClientTimeout(total=10), + ) as resp: + answers = (await resp.json(content_type=None)).get("Answer", []) + addrs = [a["data"] for a in answers if a.get("type") == 1] + if not addrs: + raise OSError(f"DoH returned no A records for {host}") + return [ + { + "hostname": host, + "host": addrs[0], + "port": port, + "family": 2, + "proto": 0, + "flags": 0, + } + ] + + async def close(self): + pass + + +async def ws_leg(leg, resolver): + url = f"wss://{T3_HOST}/probe/ws" + attempts = 0 + while True: + CONNECTED.labels(leg).set(0) + established = None + reason = "connect_failed" + try: + connector = aiohttp.TCPConnector(resolver=resolver, force_close=True) + async with aiohttp.ClientSession(connector=connector) as session: + async with session.ws_connect( + url, timeout=aiohttp.ClientWSTimeout(ws_close=10), heartbeat=None + ) as ws: + established = time.monotonic() + attempts = 0 + CONNECTED.labels(leg).set(1) + log.info("%s: connected", leg) + while True: + sent = time.monotonic() + await ws.send_str(f"ping {time.time_ns()}") + msg = await ws.receive(timeout=RTT_TIMEOUT_SECONDS) + if msg.type != aiohttp.WSMsgType.TEXT: + reason = f"closed_{msg.type.name.lower()}" + break + RTT.labels(leg).observe(time.monotonic() - sent) + await asyncio.sleep(HEARTBEAT_SECONDS) + except asyncio.TimeoutError: + reason = "rtt_timeout" if established else "connect_timeout" + except (aiohttp.ClientError, OSError) as e: + reason = "connect_failed" if not established else "io_error" + log.warning("%s: %s: %s", leg, reason, e) + CONNECTED.labels(leg).set(0) + DISCONNECTS.labels(leg, reason).inc() + LAST_DISCONNECT.labels(leg).set(time.time()) + if established is not None: + CONNECTION_AGE.labels(leg).observe(time.monotonic() - established) + log.info("%s: died after %.0fs (%s)", leg, time.monotonic() - established, reason) + attempts += 1 + await asyncio.sleep(min(3 * attempts, 30)) + + +async def t3serve_leg(): + leg = "t3serve" + url = f"http://{DEVVM}:{T3_SERVE_PORT}/api/auth/session" + timeout = aiohttp.ClientTimeout(total=RTT_TIMEOUT_SECONDS) + async with aiohttp.ClientSession(timeout=timeout) as session: + while True: + sent = time.monotonic() + try: + async with session.get(url) as resp: + await resp.read() + RTT.labels(leg).observe(time.monotonic() - sent) + CONNECTED.labels(leg).set(1 if resp.status == 200 else 0) + if resp.status != 200: + DISCONNECTS.labels(leg, f"http_{resp.status}").inc() + LAST_DISCONNECT.labels(leg).set(time.time()) + except asyncio.TimeoutError: + CONNECTED.labels(leg).set(0) + DISCONNECTS.labels(leg, "rtt_timeout").inc() + LAST_DISCONNECT.labels(leg).set(time.time()) + except (aiohttp.ClientError, OSError) as e: + CONNECTED.labels(leg).set(0) + DISCONNECTS.labels(leg, "connect_failed").inc() + LAST_DISCONNECT.labels(leg).set(time.time()) + log.warning("%s: %s", leg, e) + await asyncio.sleep(HEARTBEAT_SECONDS) + + +async def main(): + logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s") + start_http_server(METRICS_PORT) + await asyncio.gather( + ws_leg("cloudflare", DoHResolver()), + ws_leg("internal", PinnedResolver(TRAEFIK_LB)), + t3serve_leg(), + ) + + +if __name__ == "__main__": + asyncio.run(main())