infra/docs/runbooks/t3-drop-attribution.md
Viktor Barzin 9b19caff47
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
t3: connection logging across the path for drop attribution
Viktor asked to add connection logs (Traefik/Cloudflare) to catch the
real-path t3 WS drops: a direct-to-t3-serve browser ran 40 min clean
while real tunnel sessions cycle every 15-35s, so the drop originates
above t3-serve and we need to see which layer cuts the socket.

Traefik (/ws duration) and cloudflared (WS close events) already ship to
Loki; the gap was the devvm side. This adds:

- t3-dispatch logs every /ws open/close with dur_ms + cause:
  downstream_closed (client/CF/Traefik hung up = last-mile/network),
  upstream_closed (t3-serve closed/reset), or graceful. Graceful closes
  previously left no trace (default ReverseProxy only logs on error), so a
  watchdog-driven reconnect was invisible. Helpers unit-tested.
- devvm-promtail.{yaml,service}: ships devvm journald (t3-dispatch +
  t3-serve@<user>) to cluster Loki as job=devvm-journal, mirroring the
  pve/rpi-sofia shippers. devvm was never in Loki (standalone VM).

Joined in Loki the three layers attribute any future drop to a segment
with no repro needed. Runbook + service-catalog updated.
2026-06-11 13:48:10 +00:00

126 lines
6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# t3 drop attribution — "is it infra or my network?"
When a t3 user reports "disconnects, then self-recovers after a few seconds",
that is the t3 **client watchdog**: the browser heartbeats every 10s and force-
reconnects after >20s without a response. Any stall or break anywhere on
browser → Cloudflare → tunnel → Traefik → t3-dispatch → `t3 serve` produces
the identical symptom. This runbook attributes a drop to a segment in minutes.
## 1. Check the probe (first stop)
The in-cluster `t3-probe` (stacks/t3code, scrape job `t3-probe`) holds three
permanent legs that differ only in path segment:
| leg | path under test | drop means |
|---|---|---|
| `cloudflare` | WAN → CF edge → tunnel → cloudflared → Traefik → dispatch | Cloudflare/WAN segment |
| `internal` | Traefik LB (10.0.20.203) → dispatch (no Cloudflare) | Traefik / dispatch / devvm network |
| `t3serve` | HTTP straight to devvm:3773 (`t3 serve` process) | the serve process itself (event-loop stall) |
Prometheus queries:
```promql
increase(t3probe_disconnects_total[1h]) # drops per leg+reason
t3probe_connected # current state per leg
histogram_quantile(0.99, rate(t3probe_rtt_seconds_bucket{leg="t3serve"}[15m]))
```
Attribution table:
- `cloudflare` drops, `internal` clean → Cloudflare edge / QUIC tunnel / WAN.
- both WS legs drop together → Traefik, dispatch, or devvm reachability.
- `t3serve` RTT spikes / timeouts → the user's `t3 serve` stalled (see §3).
- **all legs clean while the user dropped → their last mile / device. Infra
is exonerated, with data.**
Alerts `T3ProbeLegDown` / `T3ProbeDropBurst` fire on sustained breakage.
## 1b. Connection logs in Loki (passive, always-on — catch a real drop)
Three layers of the real path log every t3 `/ws` connection to Loki, so a drop
the user actually experienced is attributable after the fact without a repro. A
drop is **a short-lived `/ws` connection** (a healthy session holds one socket
for hours); the client's 20s heartbeat watchdog reconnects on any break.
| Layer | Loki stream | What it tells you |
|---|---|---|
| Traefik | `{job="traefik"}` ⟶ filter `t3code-t3` + `GET /ws` | per-connection **duration** (trailing `…ms`) + edge (cloudflared pod) IP |
| cloudflared | `{job="cloudflared"}` ⟶ filter `t3.viktorbarzin.me/ws` | CF-tunnel-side close (`ended abruptly: context canceled` = browser/CF side hung up) |
| t3-dispatch | `{job="devvm-journal",unit="t3-dispatch.service"} \|= "ws close"` | **`dur_ms` + `cause`** — the discriminator below |
`cause` on the dispatch `ws close` line:
- **`downstream_closed`** — client / Cloudflare / Traefik tore the socket down
(`context canceled`). Short `dur_ms` = client watchdog firing → a **last-mile /
network-quality** drop (or CF/tunnel blip); t3-serve was fine.
- **`upstream_closed`** — the user's `t3 serve` closed/reset (reset by peer / EOF
/ refused) → t3-serve stall/restart/OOM.
- **`graceful`** — clean close from either side (e.g. the client watchdog's
`disconnect()` after a >20s heartbeat gap). Cross-check `dur_ms`: a ~20s+
graceful close with no devvm pressure spike (§3) is a heartbeat-timeout whose
stall was NOT on devvm → last-mile.
Triage query (Grafana Explore → Loki) — every short t3 socket in a window:
```logql
{job="devvm-journal", unit="t3-dispatch.service"} |= "ws close"
| regexp `dur_ms=(?P<dur>[0-9]+) cause=(?P<cause>\S+)` | dur < 120000
```
Line the timestamp up against `{job="traefik"}` (duration + edge IP) and
`{job="cloudflared"}` (CF-side close) for the same second to localise the layer.
devvm journald (incl. `t3-serve@<user>`) ships via `scripts/devvm-promtail.*`.
## 2. Server-side log recipe (per-event forensics)
On devvm (timestamps in UTC):
```bash
# dispatch view — error class identifies which side died:
# "context canceled" = front/client side tore down
# "connection reset by peer 127.0.0.1:PORT" = that user's serve closed
# "connection refused" = that user's serve was down
journalctl -u t3-dispatch --since "1 hour ago" | grep "proxy error"
# mass-cancel bursts (many same-second cancels = shared-segment break):
journalctl -u t3-dispatch --since "6 hours ago" \
| grep -oE '^.* [0-9:]+ http: proxy error: context canceled' \
| awk '{print $6}' | sort | uniq -c | awk '$1>=5'
# serve-side starvation markers (git taking >5s = devvm frozen):
journalctl -u t3-serve@<user> --since "6 hours ago" | grep "timed out"
# tunnel-side: cloudflared pod restarts + per-connection events
kubectl -n cloudflared get pods
kubectl -n cloudflared logs <pod> --since=6h | grep -E "ERR|reconnect"
```
## 3. devvm pressure correlation
devvm node_exporter is scraped as job `devvm` (since 2026-06-10). The known
high-frequency drop mechanism is **memory+IO pressure on devvm**: agent
processes live inside `t3-serve@<user>`'s cgroup; a runaway agent swap-thrashes
the spinning root disk and freezes the box in multi-10s windows — every
connected client's watchdog fires at once (2026-06-10: a 10.8G agent → global
OOM → 8.5min hard outage).
```promql
rate(node_pressure_io_stalled_seconds_total{instance="devvm"}[5m])
rate(node_pressure_memory_stalled_seconds_total{instance="devvm"}[5m])
node_memory_SwapFree_bytes{instance="devvm"}
```
Guardrails in place (2026-06-10, `scripts/t3-serve@.service`): per-unit
`MemoryHigh=12G`, `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue`
a runaway agent now OOMs alone inside the cgroup instead of taking the box
(and the WS server) with it.
## 4. Known root causes (2026-06-10 investigation)
1. **devvm memory/IO storms** (high-frequency mechanism) — §3.
2. **cloudflared in-place autoupdate** — fixed: `--no-autoupdate`
(stacks/cloudflared). Before the fix every CF release exited all 3 pods
(code 11), severing all tunnel WebSockets.
3. **QUIC tunnel churn** (~12/day, "no recent network activity") — inherent;
visible as `cloudflare`-leg-only blips.
4. **t3 nightly autoupdate** — pinned after the 2026-06-09 outage, see
`docs/post-mortems/2026-06-09-t3-nightly-autoupdate-auth-outage.md`.