infra/docs/runbooks/t3-drop-attribution.md
Viktor Barzin 9b19caff47
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
t3: connection logging across the path for drop attribution
Viktor asked to add connection logs (Traefik/Cloudflare) to catch the
real-path t3 WS drops: a direct-to-t3-serve browser ran 40 min clean
while real tunnel sessions cycle every 15-35s, so the drop originates
above t3-serve and we need to see which layer cuts the socket.

Traefik (/ws duration) and cloudflared (WS close events) already ship to
Loki; the gap was the devvm side. This adds:

- t3-dispatch logs every /ws open/close with dur_ms + cause:
  downstream_closed (client/CF/Traefik hung up = last-mile/network),
  upstream_closed (t3-serve closed/reset), or graceful. Graceful closes
  previously left no trace (default ReverseProxy only logs on error), so a
  watchdog-driven reconnect was invisible. Helpers unit-tested.
- devvm-promtail.{yaml,service}: ships devvm journald (t3-dispatch +
  t3-serve@<user>) to cluster Loki as job=devvm-journal, mirroring the
  pve/rpi-sofia shippers. devvm was never in Loki (standalone VM).

Joined in Loki the three layers attribute any future drop to a segment
with no repro needed. Runbook + service-catalog updated.
2026-06-11 13:48:10 +00:00

6 KiB
Raw Blame History

t3 drop attribution — "is it infra or my network?"

When a t3 user reports "disconnects, then self-recovers after a few seconds", that is the t3 client watchdog: the browser heartbeats every 10s and force- reconnects after >20s without a response. Any stall or break anywhere on browser → Cloudflare → tunnel → Traefik → t3-dispatch → t3 serve produces the identical symptom. This runbook attributes a drop to a segment in minutes.

1. Check the probe (first stop)

The in-cluster t3-probe (stacks/t3code, scrape job t3-probe) holds three permanent legs that differ only in path segment:

leg path under test drop means
cloudflare WAN → CF edge → tunnel → cloudflared → Traefik → dispatch Cloudflare/WAN segment
internal Traefik LB (10.0.20.203) → dispatch (no Cloudflare) Traefik / dispatch / devvm network
t3serve HTTP straight to devvm:3773 (t3 serve process) the serve process itself (event-loop stall)

Prometheus queries:

increase(t3probe_disconnects_total[1h])          # drops per leg+reason
t3probe_connected                                # current state per leg
histogram_quantile(0.99, rate(t3probe_rtt_seconds_bucket{leg="t3serve"}[15m]))

Attribution table:

  • cloudflare drops, internal clean → Cloudflare edge / QUIC tunnel / WAN.
  • both WS legs drop together → Traefik, dispatch, or devvm reachability.
  • t3serve RTT spikes / timeouts → the user's t3 serve stalled (see §3).
  • all legs clean while the user dropped → their last mile / device. Infra is exonerated, with data.

Alerts T3ProbeLegDown / T3ProbeDropBurst fire on sustained breakage.

1b. Connection logs in Loki (passive, always-on — catch a real drop)

Three layers of the real path log every t3 /ws connection to Loki, so a drop the user actually experienced is attributable after the fact without a repro. A drop is a short-lived /ws connection (a healthy session holds one socket for hours); the client's 20s heartbeat watchdog reconnects on any break.

Layer Loki stream What it tells you
Traefik {job="traefik"} ⟶ filter t3code-t3 + GET /ws per-connection duration (trailing …ms) + edge (cloudflared pod) IP
cloudflared {job="cloudflared"} ⟶ filter t3.viktorbarzin.me/ws CF-tunnel-side close (ended abruptly: context canceled = browser/CF side hung up)
t3-dispatch {job="devvm-journal",unit="t3-dispatch.service"} |= "ws close" dur_ms + cause — the discriminator below

cause on the dispatch ws close line:

  • downstream_closed — client / Cloudflare / Traefik tore the socket down (context canceled). Short dur_ms = client watchdog firing → a last-mile / network-quality drop (or CF/tunnel blip); t3-serve was fine.
  • upstream_closed — the user's t3 serve closed/reset (reset by peer / EOF / refused) → t3-serve stall/restart/OOM.
  • graceful — clean close from either side (e.g. the client watchdog's disconnect() after a >20s heartbeat gap). Cross-check dur_ms: a ~20s+ graceful close with no devvm pressure spike (§3) is a heartbeat-timeout whose stall was NOT on devvm → last-mile.

Triage query (Grafana Explore → Loki) — every short t3 socket in a window:

{job="devvm-journal", unit="t3-dispatch.service"} |= "ws close"
  | regexp `dur_ms=(?P<dur>[0-9]+) cause=(?P<cause>\S+)` | dur < 120000

Line the timestamp up against {job="traefik"} (duration + edge IP) and {job="cloudflared"} (CF-side close) for the same second to localise the layer. devvm journald (incl. t3-serve@<user>) ships via scripts/devvm-promtail.*.

2. Server-side log recipe (per-event forensics)

On devvm (timestamps in UTC):

# dispatch view — error class identifies which side died:
#   "context canceled"                      = front/client side tore down
#   "connection reset by peer 127.0.0.1:PORT" = that user's serve closed
#   "connection refused"                    = that user's serve was down
journalctl -u t3-dispatch --since "1 hour ago" | grep "proxy error"

# mass-cancel bursts (many same-second cancels = shared-segment break):
journalctl -u t3-dispatch --since "6 hours ago" \
  | grep -oE '^.* [0-9:]+ http: proxy error: context canceled' \
  | awk '{print $6}' | sort | uniq -c | awk '$1>=5'

# serve-side starvation markers (git taking >5s = devvm frozen):
journalctl -u t3-serve@<user> --since "6 hours ago" | grep "timed out"

# tunnel-side: cloudflared pod restarts + per-connection events
kubectl -n cloudflared get pods
kubectl -n cloudflared logs <pod> --since=6h | grep -E "ERR|reconnect"

3. devvm pressure correlation

devvm node_exporter is scraped as job devvm (since 2026-06-10). The known high-frequency drop mechanism is memory+IO pressure on devvm: agent processes live inside t3-serve@<user>'s cgroup; a runaway agent swap-thrashes the spinning root disk and freezes the box in multi-10s windows — every connected client's watchdog fires at once (2026-06-10: a 10.8G agent → global OOM → 8.5min hard outage).

rate(node_pressure_io_stalled_seconds_total{instance="devvm"}[5m])
rate(node_pressure_memory_stalled_seconds_total{instance="devvm"}[5m])
node_memory_SwapFree_bytes{instance="devvm"}

Guardrails in place (2026-06-10, scripts/t3-serve@.service): per-unit MemoryHigh=12G, MemoryMax=16G, MemorySwapMax=0, OOMPolicy=continue — a runaway agent now OOMs alone inside the cgroup instead of taking the box (and the WS server) with it.

4. Known root causes (2026-06-10 investigation)

  1. devvm memory/IO storms (high-frequency mechanism) — §3.
  2. cloudflared in-place autoupdate — fixed: --no-autoupdate (stacks/cloudflared). Before the fix every CF release exited all 3 pods (code 11), severing all tunnel WebSockets.
  3. QUIC tunnel churn (~12/day, "no recent network activity") — inherent; visible as cloudflare-leg-only blips.
  4. t3 nightly autoupdate — pinned after the 2026-06-09 outage, see docs/post-mortems/2026-06-09-t3-nightly-autoupdate-auth-outage.md.