diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md index f33430be..218f7af7 100644 --- a/.claude/reference/service-catalog.md +++ b/.claude/reference/service-catalog.md @@ -32,7 +32,7 @@ |---------|-------------|-------| | k8s-dashboard | Kubernetes dashboard at `k8s.viktorbarzin.me`. **Forward-auth + auto-injected SA token** (apiserver OIDC blocked, see design §12). nginx token-injector (`dashboard_injector.tf`) maps `X-authentik-username` → the user's `dashboard-` SA token (ns admin + read-only on namespace-list/nodes only via `dashboard-nav-readonly` — no cross-tenant reads, `rbac/.../dashboard-sa.tf`; admins → cluster-admin SA) and sets `Authorization: Bearer` → no token-paste, dashboard auto-authenticates per user. Forward-auth admits `kubernetes-*` groups for this host (`stacks/authentik/admin-services-restriction.tf`). oauth2-proxy + `k8s-dashboard` OIDC app built but idle. | k8s-dashboard | | reverse-proxy | Generic reverse proxy | reverse-proxy | -| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role`→`scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin is a deliberate, reversible step via `docs/runbooks/t3-version-bump.md` (pre-bump `state.sqlite` backup → bump `T3_PIN` → enforcer install with a REAL pairing health-check that auto-rolls-back → verify → restore). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. | t3code | +| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role`→`scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin is a deliberate, reversible step via `docs/runbooks/t3-version-bump.md` (pre-bump `state.sqlite` backup → bump `T3_PIN` → enforcer install with a REAL pairing health-check that auto-rolls-back → verify → restore). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. **Drop attribution (2026-06-10):** `t3-probe` Deployment (same ns) holds differential legs — `cloudflare` (full public path via DoH-pinned DNS), `internal` (Traefik LB only), `t3serve` (devvm:3773 direct) — against dispatch's unauthenticated `/probe` carve-out (walloff-guarded); Prometheus job `t3-probe`, alerts `T3ProbeLegDown`/`T3ProbeDropBurst`, runbook `docs/runbooks/t3-drop-attribution.md`. `t3-serve@` units carry memory containment (`MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0/OOMPolicy=continue`) so a runaway agent OOMs alone instead of freezing devvm. | t3code | ## Active Use | Service | Description | Stack | diff --git a/docs/runbooks/t3-drop-attribution.md b/docs/runbooks/t3-drop-attribution.md new file mode 100644 index 00000000..192061bb --- /dev/null +++ b/docs/runbooks/t3-drop-attribution.md @@ -0,0 +1,91 @@ +# t3 drop attribution — "is it infra or my network?" + +When a t3 user reports "disconnects, then self-recovers after a few seconds", +that is the t3 **client watchdog**: the browser heartbeats every 10s and force- +reconnects after >20s without a response. Any stall or break anywhere on +browser → Cloudflare → tunnel → Traefik → t3-dispatch → `t3 serve` produces +the identical symptom. This runbook attributes a drop to a segment in minutes. + +## 1. Check the probe (first stop) + +The in-cluster `t3-probe` (stacks/t3code, scrape job `t3-probe`) holds three +permanent legs that differ only in path segment: + +| leg | path under test | drop means | +|---|---|---| +| `cloudflare` | WAN → CF edge → tunnel → cloudflared → Traefik → dispatch | Cloudflare/WAN segment | +| `internal` | Traefik LB (10.0.20.203) → dispatch (no Cloudflare) | Traefik / dispatch / devvm network | +| `t3serve` | HTTP straight to devvm:3773 (`t3 serve` process) | the serve process itself (event-loop stall) | + +Prometheus queries: + +```promql +increase(t3probe_disconnects_total[1h]) # drops per leg+reason +t3probe_connected # current state per leg +histogram_quantile(0.99, rate(t3probe_rtt_seconds_bucket{leg="t3serve"}[15m])) +``` + +Attribution table: + +- `cloudflare` drops, `internal` clean → Cloudflare edge / QUIC tunnel / WAN. +- both WS legs drop together → Traefik, dispatch, or devvm reachability. +- `t3serve` RTT spikes / timeouts → the user's `t3 serve` stalled (see §3). +- **all legs clean while the user dropped → their last mile / device. Infra + is exonerated, with data.** + +Alerts `T3ProbeLegDown` / `T3ProbeDropBurst` fire on sustained breakage. + +## 2. Server-side log recipe (per-event forensics) + +On devvm (timestamps in UTC): + +```bash +# dispatch view — error class identifies which side died: +# "context canceled" = front/client side tore down +# "connection reset by peer 127.0.0.1:PORT" = that user's serve closed +# "connection refused" = that user's serve was down +journalctl -u t3-dispatch --since "1 hour ago" | grep "proxy error" + +# mass-cancel bursts (many same-second cancels = shared-segment break): +journalctl -u t3-dispatch --since "6 hours ago" \ + | grep -oE '^.* [0-9:]+ http: proxy error: context canceled' \ + | awk '{print $6}' | sort | uniq -c | awk '$1>=5' + +# serve-side starvation markers (git taking >5s = devvm frozen): +journalctl -u t3-serve@ --since "6 hours ago" | grep "timed out" + +# tunnel-side: cloudflared pod restarts + per-connection events +kubectl -n cloudflared get pods +kubectl -n cloudflared logs --since=6h | grep -E "ERR|reconnect" +``` + +## 3. devvm pressure correlation + +devvm node_exporter is scraped as job `devvm` (since 2026-06-10). The known +high-frequency drop mechanism is **memory+IO pressure on devvm**: agent +processes live inside `t3-serve@`'s cgroup; a runaway agent swap-thrashes +the spinning root disk and freezes the box in multi-10s windows — every +connected client's watchdog fires at once (2026-06-10: a 10.8G agent → global +OOM → 8.5min hard outage). + +```promql +rate(node_pressure_io_stalled_seconds_total{instance="devvm"}[5m]) +rate(node_pressure_memory_stalled_seconds_total{instance="devvm"}[5m]) +node_memory_SwapFree_bytes{instance="devvm"} +``` + +Guardrails in place (2026-06-10, `scripts/t3-serve@.service`): per-unit +`MemoryHigh=12G`, `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue` — +a runaway agent now OOMs alone inside the cgroup instead of taking the box +(and the WS server) with it. + +## 4. Known root causes (2026-06-10 investigation) + +1. **devvm memory/IO storms** (high-frequency mechanism) — §3. +2. **cloudflared in-place autoupdate** — fixed: `--no-autoupdate` + (stacks/cloudflared). Before the fix every CF release exited all 3 pods + (code 11), severing all tunnel WebSockets. +3. **QUIC tunnel churn** (~1–2/day, "no recent network activity") — inherent; + visible as `cloudflare`-leg-only blips. +4. **t3 nightly autoupdate** — pinned after the 2026-06-09 outage, see + `docs/post-mortems/2026-06-09-t3-nightly-autoupdate-auth-outage.md`. diff --git a/scripts/t3-dispatch/go.mod b/scripts/t3-dispatch/go.mod index 26e9388b..6b629374 100644 --- a/scripts/t3-dispatch/go.mod +++ b/scripts/t3-dispatch/go.mod @@ -1,3 +1,5 @@ module t3-dispatch go 1.22 + +require github.com/gorilla/websocket v1.5.3 // indirect diff --git a/scripts/t3-dispatch/go.sum b/scripts/t3-dispatch/go.sum new file mode 100644 index 00000000..25a9fc4b --- /dev/null +++ b/scripts/t3-dispatch/go.sum @@ -0,0 +1,2 @@ +github.com/gorilla/websocket v1.5.3 h1:saDtZ6Pbx/0u+bgYQ3q96pZgCzfhKXGPqt7kZ72aNNg= +github.com/gorilla/websocket v1.5.3/go.mod h1:YR8l580nyteQvAITg2hZ9XVh4b55+EU/adAjf1fMHhE= diff --git a/scripts/t3-dispatch/main.go b/scripts/t3-dispatch/main.go index a36e2da0..35d38ec9 100644 --- a/scripts/t3-dispatch/main.go +++ b/scripts/t3-dispatch/main.go @@ -228,6 +228,7 @@ func main() { }() mux := http.NewServeMux() mux.HandleFunc("/healthz", func(w http.ResponseWriter, _ *http.Request) { _, _ = w.Write([]byte("ok\n")) }) + registerProbe(mux) mux.HandleFunc("/", handler) log.Printf("t3-dispatch listening on %s", listenAddr) log.Fatal(http.ListenAndServe(listenAddr, mux)) diff --git a/scripts/t3-dispatch/main_test.go b/scripts/t3-dispatch/main_test.go index ee43266e..81b522ba 100644 --- a/scripts/t3-dispatch/main_test.go +++ b/scripts/t3-dispatch/main_test.go @@ -5,7 +5,10 @@ import ( "net/http/httptest" "net/url" "strconv" + "strings" "testing" + + "github.com/gorilla/websocket" ) func portOf(t *testing.T, ts *httptest.Server) int { @@ -258,3 +261,43 @@ func TestAutoPairAcrossVersions(t *testing.T) { }) } } + +func TestProbeHealthz(t *testing.T) { + mux := http.NewServeMux() + registerProbe(mux) + ts := httptest.NewServer(mux) + defer ts.Close() + resp, err := http.Get(ts.URL + "/probe/healthz") + if err != nil { + t.Fatalf("GET /probe/healthz: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusOK { + t.Errorf("status = %d, want 200", resp.StatusCode) + } +} + +func TestProbeWSEcho(t *testing.T) { + mux := http.NewServeMux() + registerProbe(mux) + ts := httptest.NewServer(mux) + defer ts.Close() + wsURL := "ws" + strings.TrimPrefix(ts.URL, "http") + "/probe/ws" + c, _, err := websocket.DefaultDialer.Dial(wsURL, nil) + if err != nil { + t.Fatalf("dial %s: %v", wsURL, err) + } + defer c.Close() + for _, msg := range []string{"ping 1718000000", "ping 1718000010"} { + if err := c.WriteMessage(websocket.TextMessage, []byte(msg)); err != nil { + t.Fatalf("write: %v", err) + } + _, got, err := c.ReadMessage() + if err != nil { + t.Fatalf("read: %v", err) + } + if string(got) != msg { + t.Errorf("echo = %q, want %q", got, msg) + } + } +} diff --git a/scripts/t3-dispatch/probe.go b/scripts/t3-dispatch/probe.go new file mode 100644 index 00000000..df689bce --- /dev/null +++ b/scripts/t3-dispatch/probe.go @@ -0,0 +1,49 @@ +// probe.go: unauthenticated path-health surface for the in-cluster t3-probe. +// /probe/* is carved out of Authentik (stacks/t3code `module "ingress_probe"`) +// so a synthetic client can hold a long-lived WebSocket here via two routes +// (Cloudflare edge vs internal Traefik) and attribute connection drops to a +// path segment. It echoes tiny frames and reaches no t3 instance — nothing +// user-grade is exposed. +package main + +import ( + "net/http" + "time" + + "github.com/gorilla/websocket" +) + +// Reap connections whose client went silent; the probe pings every 10s, so 90s +// of silence means the peer is gone even if TCP never noticed. +const probeIdleLimit = 90 * time.Second + +var probeUpgrader = websocket.Upgrader{ + // No cookies or credentials are at stake on an echo endpoint, and the + // probe connects without a browser Origin — checking it would only break it. + CheckOrigin: func(*http.Request) bool { return true }, +} + +func registerProbe(mux *http.ServeMux) { + mux.HandleFunc("/probe/healthz", func(w http.ResponseWriter, _ *http.Request) { + _, _ = w.Write([]byte("ok\n")) + }) + mux.HandleFunc("/probe/ws", func(w http.ResponseWriter, r *http.Request) { + c, err := probeUpgrader.Upgrade(w, r, nil) + if err != nil { + return // Upgrade has already written the HTTP error + } + defer c.Close() + for { + if err := c.SetReadDeadline(time.Now().Add(probeIdleLimit)); err != nil { + return + } + mt, msg, err := c.ReadMessage() + if err != nil { + return + } + if err := c.WriteMessage(mt, msg); err != nil { + return + } + } + }) +} diff --git a/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf b/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf index 583900fe..5244db0e 100644 --- a/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf +++ b/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf @@ -57,6 +57,9 @@ locals { "instagram-poster-image" = "https://instagram-poster.viktorbarzin.me/image" # trading-bot app root (auth="app"): WebAuthn/JWT in-app; was walled, now 200. "trading-bot-app" = "https://trading.viktorbarzin.me/" + # t3 dispatch probe surface (auth="none" path carve-out on /probe): WS echo + # + healthz for the t3-probe drop-attribution client (stacks/t3code). + "t3-probe-ws" = "https://t3.viktorbarzin.me/probe/healthz" # NOTE: openclaw task-webhook (auth="none") is intentionally NOT probed — it # has no public DNS record (NXDOMAIN, external_monitor=false), so there is no # externally GET-able URL to probe. Its carve-out is internal-only. diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl index 732b4e68..c9176867 100755 --- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl +++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl @@ -2541,6 +2541,22 @@ serverFiles: severity: warning annotations: summary: "Email round-trip monitor never reported - check CronJob in mailserver namespace" + - alert: T3ProbeLegDown + expr: t3probe_connected{job="t3-probe"} == 0 + for: 5m + labels: + severity: warning + annotations: + summary: "A t3 path-probe leg has been down >5m (leg label says which)" + description: "cloudflare-only = Cloudflare/WAN segment; cloudflare+internal = Traefik/dispatch/devvm; t3serve = the serve process. See docs/runbooks/t3-drop-attribution.md." + - alert: T3ProbeDropBurst + expr: increase(t3probe_disconnects_total{job="t3-probe"}[15m]) > 6 + for: 1m + labels: + severity: warning + annotations: + summary: "A t3 path-probe leg is dropping repeatedly (>6 in 15m; see leg/reason labels)" + description: "Users on the same segment are seeing 'disconnected, reconnecting' at this rate. Compare legs to attribute; correlate with devvm node_pressure_* metrics." - alert: ViktorBarzinApexDrift expr: viktorbarzin_apex_correct{job="viktorbarzin-apex-probe"} == 0 for: 10m @@ -3110,6 +3126,30 @@ extraScrapeConfigs: | - source_labels: [__address__] target_label: instance replacement: 'pve-node-r730' # Giving it a friendly name + # devvm: the shared workstation VM hosting per-user t3-serve + Claude agents. + # Its node_exporter ran unscraped until 2026-06-10 — the t3 disconnect + # root-cause work had NO memory/IO-pressure history for the very box whose + # stalls fire every t3 client's watchdog. Pressure/swap/load here is the + # primary correlate for t3probe drop events. + - job_name: 'devvm' + static_configs: + - targets: + - "10.0.10.10:9100" + labels: + node: 'devvm' + metrics_path: '/metrics' + relabel_configs: + - source_labels: [__address__] + target_label: instance + replacement: 'devvm' # Giving it a friendly name + # t3-probe: differential t3 path-health prober (stacks/t3code). Legs: + # cloudflare (full public path), internal (Traefik only), t3serve (the + # serve process). See docs/runbooks/t3-drop-attribution.md. + - job_name: 't3-probe' + static_configs: + - targets: + - "t3-probe.t3code.svc.cluster.local:9108" + metrics_path: '/metrics' # rpi-sofia: external Raspberry Pi 3 at the Sofia home site (Frigate camera # DNAT passthrough + solar inverter path + HA MQTT sensors). node_exporter # installed via apt; the rpi_* metrics come from a vcgencmd textfile collector diff --git a/stacks/t3code/main.tf b/stacks/t3code/main.tf index 9e23ff1f..75a8e542 100644 --- a/stacks/t3code/main.tf +++ b/stacks/t3code/main.tf @@ -94,3 +94,118 @@ module "ingress" { "gethomepage.dev/pod-selector" = "" } } + +# === Drop-attribution probe surface ========================================== +# /probe/* on the t3 host is dispatch's unauthenticated echo surface (see +# scripts/t3-dispatch/probe.go) for the t3-probe below. Guarded against +# Authentik re-walling by `authentik_walloff_targets` in stacks/monitoring. +module "ingress_probe" { + source = "../../modules/kubernetes/ingress_factory" + # auth = "none": WS echo + healthz for the in-cluster path-health probe; no + # user data, no t3 instance reachable — auth would break the synthetic client. + auth = "none" + anti_ai_scraping = false # the probe IS a bot; PoW/UA filtering would block it + dns_type = "none" # main `module.ingress` owns the DNS record for this host + namespace = kubernetes_namespace.t3code.metadata[0].name + name = "t3-probe" + service_name = kubernetes_service.t3code.metadata[0].name + full_host = "t3.viktorbarzin.me" + ingress_path = ["/probe"] + tls_secret_name = var.tls_secret_name +} + +# t3-probe: differential WS/HTTP prober (see probe.py docstring for the +# attribution model). Runs in-cluster so it measures the shared path WITHOUT +# any user's last mile; Prometheus scrapes it via the static `t3-probe` job +# in stacks/monitoring. +resource "kubernetes_config_map_v1" "t3_probe" { + metadata { + name = "t3-probe" + namespace = kubernetes_namespace.t3code.metadata[0].name + } + data = { + "probe.py" = file("${path.module}/probe.py") + } +} + +resource "kubernetes_deployment_v1" "t3_probe" { + metadata { + name = "t3-probe" + namespace = kubernetes_namespace.t3code.metadata[0].name + labels = { app = "t3-probe" } + } + spec { + replicas = 1 + selector { + match_labels = { app = "t3-probe" } + } + template { + metadata { + labels = { app = "t3-probe" } + annotations = { + "checksum/probe" = sha256(file("${path.module}/probe.py")) + } + } + spec { + container { + name = "probe" + image = "python:3.12-alpine" + # Long-running pod, not a high-cadence CronJob: a one-time pinned + # pip install at start (with retries against transient DNS) is the + # lightweight alternative to owning a registry image for ~200 lines. + command = ["sh", "-c", <<-EOT + for i in 1 2 3 4 5; do + pip install --no-cache-dir --quiet aiohttp==3.9.5 prometheus-client==0.20.0 && break + echo "pip attempt $i failed; retrying" >&2; sleep 10 + done + exec python /app/probe.py + EOT + ] + port { + container_port = 9108 + name = "metrics" + } + volume_mount { + name = "app" + mount_path = "/app" + read_only = true + } + resources { + requests = { + cpu = "10m" + memory = "64Mi" + } + limits = { + memory = "192Mi" + } + } + } + volume { + name = "app" + config_map { + name = kubernetes_config_map_v1.t3_probe.metadata[0].name + } + } + } + } + } + lifecycle { + ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 + } +} + +resource "kubernetes_service" "t3_probe" { + metadata { + name = "t3-probe" + namespace = kubernetes_namespace.t3code.metadata[0].name + labels = { app = "t3-probe" } + } + spec { + selector = { app = "t3-probe" } + port { + name = "metrics" + port = 9108 + target_port = 9108 + } + } +} diff --git a/stacks/t3code/probe.py b/stacks/t3code/probe.py new file mode 100644 index 00000000..ea3da93b --- /dev/null +++ b/stacks/t3code/probe.py @@ -0,0 +1,201 @@ +"""t3-probe: differential path-health probe behind the t3 drop attribution. + +Holds long-lived WebSockets to t3-dispatch's /probe/ws echo endpoint via two +routes that differ ONLY in the Cloudflare segment, plus an HTTP heartbeat +against the t3-serve process itself: + + leg=cloudflare wss://T3_HOST/probe/ws connected to the address PUBLIC DNS + returns (DoH @1.1.1.1) -> WAN -> CF edge -> tunnel -> + cloudflared -> Traefik -> t3-dispatch + leg=internal same URL pinned to the internal Traefik LB -> Traefik -> + t3-dispatch (no Cloudflare) + leg=t3serve GET http://DEVVM:3773/api/auth/session every 10s; an + event-loop stall in the user's `t3 serve` delays/times-out + this regardless of auth + +Attribution: cloudflare drops alone -> Cloudflare/WAN segment; cloudflare + +internal together -> Traefik/dispatch/devvm network; t3serve latency spikes -> +the serve process (memory/IO stalls); all legs clean while a human drops -> +their last mile, infra exonerated. Mirrors the real t3 client's resilience +protocol (10s heartbeat, ~20s watchdog) so probe drops mean a real client +would have dropped too. +""" + +import asyncio +import json +import logging +import time + +import aiohttp +from aiohttp.abc import AbstractResolver +from prometheus_client import Counter, Gauge, Histogram, start_http_server + +T3_HOST = "t3.viktorbarzin.me" +TRAEFIK_LB = "10.0.20.203" +DEVVM = "10.0.10.10" +T3_SERVE_PORT = 3773 +DOH_URL = "https://1.1.1.1/dns-query" +HEARTBEAT_SECONDS = 10 +RTT_TIMEOUT_SECONDS = 20 # mirror the t3 client watchdog +METRICS_PORT = 9108 + +log = logging.getLogger("t3probe") + +CONNECTED = Gauge("t3probe_connected", "1 while the leg's connection is up", ["leg"]) +DISCONNECTS = Counter( + "t3probe_disconnects_total", "Connection deaths by leg and reason", ["leg", "reason"] +) +RTT = Histogram( + "t3probe_rtt_seconds", + "Heartbeat round-trip (WS echo / HTTP GET) per leg", + ["leg"], + buckets=[0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 20], +) +CONNECTION_AGE = Histogram( + "t3probe_connection_age_seconds", + "Age of a WS connection when it died", + ["leg"], + buckets=[10, 30, 60, 300, 900, 3600, 14400, 86400], +) +LAST_DISCONNECT = Gauge( + "t3probe_last_disconnect_timestamp", "Unix time of the leg's last death", ["leg"] +) + + +class PinnedResolver(AbstractResolver): + """Resolve T3_HOST to one fixed address; Host/SNI/cert stay hostname-true.""" + + def __init__(self, address): + self.address = address + + async def resolve(self, host, port=0, family=0): + return [ + { + "hostname": host, + "host": self.address, + "port": port, + "family": 2, # AF_INET + "proto": 0, + "flags": 0, + } + ] + + async def close(self): + pass + + +class DoHResolver(AbstractResolver): + """Resolve via Cloudflare DoH so the answer is the PUBLIC (proxied) one. + + The cluster's own DNS is split-horizon since 2026-06-10 (pods get internal + answers for *.viktorbarzin.me), which would silently collapse this leg + onto the internal route — public resolution must bypass it. + """ + + async def resolve(self, host, port=0, family=0): + async with aiohttp.ClientSession() as s: + async with s.get( + DOH_URL, + params={"name": host, "type": "A"}, + headers={"accept": "application/dns-json"}, + timeout=aiohttp.ClientTimeout(total=10), + ) as resp: + answers = (await resp.json(content_type=None)).get("Answer", []) + addrs = [a["data"] for a in answers if a.get("type") == 1] + if not addrs: + raise OSError(f"DoH returned no A records for {host}") + return [ + { + "hostname": host, + "host": addrs[0], + "port": port, + "family": 2, + "proto": 0, + "flags": 0, + } + ] + + async def close(self): + pass + + +async def ws_leg(leg, resolver): + url = f"wss://{T3_HOST}/probe/ws" + attempts = 0 + while True: + CONNECTED.labels(leg).set(0) + established = None + reason = "connect_failed" + try: + connector = aiohttp.TCPConnector(resolver=resolver, force_close=True) + async with aiohttp.ClientSession(connector=connector) as session: + async with session.ws_connect( + url, timeout=aiohttp.ClientWSTimeout(ws_close=10), heartbeat=None + ) as ws: + established = time.monotonic() + attempts = 0 + CONNECTED.labels(leg).set(1) + log.info("%s: connected", leg) + while True: + sent = time.monotonic() + await ws.send_str(f"ping {time.time_ns()}") + msg = await ws.receive(timeout=RTT_TIMEOUT_SECONDS) + if msg.type != aiohttp.WSMsgType.TEXT: + reason = f"closed_{msg.type.name.lower()}" + break + RTT.labels(leg).observe(time.monotonic() - sent) + await asyncio.sleep(HEARTBEAT_SECONDS) + except asyncio.TimeoutError: + reason = "rtt_timeout" if established else "connect_timeout" + except (aiohttp.ClientError, OSError) as e: + reason = "connect_failed" if not established else "io_error" + log.warning("%s: %s: %s", leg, reason, e) + CONNECTED.labels(leg).set(0) + DISCONNECTS.labels(leg, reason).inc() + LAST_DISCONNECT.labels(leg).set(time.time()) + if established is not None: + CONNECTION_AGE.labels(leg).observe(time.monotonic() - established) + log.info("%s: died after %.0fs (%s)", leg, time.monotonic() - established, reason) + attempts += 1 + await asyncio.sleep(min(3 * attempts, 30)) + + +async def t3serve_leg(): + leg = "t3serve" + url = f"http://{DEVVM}:{T3_SERVE_PORT}/api/auth/session" + timeout = aiohttp.ClientTimeout(total=RTT_TIMEOUT_SECONDS) + async with aiohttp.ClientSession(timeout=timeout) as session: + while True: + sent = time.monotonic() + try: + async with session.get(url) as resp: + await resp.read() + RTT.labels(leg).observe(time.monotonic() - sent) + CONNECTED.labels(leg).set(1 if resp.status == 200 else 0) + if resp.status != 200: + DISCONNECTS.labels(leg, f"http_{resp.status}").inc() + LAST_DISCONNECT.labels(leg).set(time.time()) + except asyncio.TimeoutError: + CONNECTED.labels(leg).set(0) + DISCONNECTS.labels(leg, "rtt_timeout").inc() + LAST_DISCONNECT.labels(leg).set(time.time()) + except (aiohttp.ClientError, OSError) as e: + CONNECTED.labels(leg).set(0) + DISCONNECTS.labels(leg, "connect_failed").inc() + LAST_DISCONNECT.labels(leg).set(time.time()) + log.warning("%s: %s", leg, e) + await asyncio.sleep(HEARTBEAT_SECONDS) + + +async def main(): + logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s") + start_http_server(METRICS_PORT) + await asyncio.gather( + ws_leg("cloudflare", DoHResolver()), + ws_leg("internal", PinnedResolver(TRAEFIK_LB)), + t3serve_leg(), + ) + + +if __name__ == "__main__": + asyncio.run(main())