t3: differential drop-attribution probe + devvm metrics

Closes the loop on Viktor's ask to find the t3 disconnect root cause and
definitively rule infra in or out. Server logs alone cannot separate
'client network broke' from 'Cloudflare/tunnel broke' from 't3-serve
stalled' — every cause collapses into the same 20s-watchdog reconnect.

The t3-probe (stacks/t3code) holds three permanent legs that differ only
in path segment: 'cloudflare' (WS via DoH-resolved public DNS -> WAN ->
CF edge -> tunnel -> Traefik -> dispatch), 'internal' (same WS pinned to
the Traefik LB, no Cloudflare), 't3serve' (HTTP straight to the serve
process). Whichever leg drops convicts its segment; all legs clean while
a user drops exonerates infra with data. Dispatch gains an
unauthenticated /probe/ws echo + /probe/healthz (gorilla/websocket,
test-first) behind an auth=none path carve-out, guarded by the
authentik-walloff probe.

Also starts scraping devvm's node_exporter (job 'devvm') — it ran
unscraped, so the box whose memory/IO stalls cause the drops had zero
pressure history. Alerts T3ProbeLegDown + T3ProbeDropBurst; runbook
docs/runbooks/t3-drop-attribution.md.
This commit is contained in:
Viktor Barzin 2026-06-10 21:11:29 +00:00
parent b5c6639272
commit 9b55d53be0
11 changed files with 548 additions and 1 deletions

View file

@ -57,6 +57,9 @@ locals {
"instagram-poster-image" = "https://instagram-poster.viktorbarzin.me/image"
# trading-bot app root (auth="app"): WebAuthn/JWT in-app; was walled, now 200.
"trading-bot-app" = "https://trading.viktorbarzin.me/"
# t3 dispatch probe surface (auth="none" path carve-out on /probe): WS echo
# + healthz for the t3-probe drop-attribution client (stacks/t3code).
"t3-probe-ws" = "https://t3.viktorbarzin.me/probe/healthz"
# NOTE: openclaw task-webhook (auth="none") is intentionally NOT probed it
# has no public DNS record (NXDOMAIN, external_monitor=false), so there is no
# externally GET-able URL to probe. Its carve-out is internal-only.

View file

@ -2541,6 +2541,22 @@ serverFiles:
severity: warning
annotations:
summary: "Email round-trip monitor never reported - check CronJob in mailserver namespace"
- alert: T3ProbeLegDown
expr: t3probe_connected{job="t3-probe"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "A t3 path-probe leg has been down >5m (leg label says which)"
description: "cloudflare-only = Cloudflare/WAN segment; cloudflare+internal = Traefik/dispatch/devvm; t3serve = the serve process. See docs/runbooks/t3-drop-attribution.md."
- alert: T3ProbeDropBurst
expr: increase(t3probe_disconnects_total{job="t3-probe"}[15m]) > 6
for: 1m
labels:
severity: warning
annotations:
summary: "A t3 path-probe leg is dropping repeatedly (>6 in 15m; see leg/reason labels)"
description: "Users on the same segment are seeing 'disconnected, reconnecting' at this rate. Compare legs to attribute; correlate with devvm node_pressure_* metrics."
- alert: ViktorBarzinApexDrift
expr: viktorbarzin_apex_correct{job="viktorbarzin-apex-probe"} == 0
for: 10m
@ -3110,6 +3126,30 @@ extraScrapeConfigs: |
- source_labels: [__address__]
target_label: instance
replacement: 'pve-node-r730' # Giving it a friendly name
# devvm: the shared workstation VM hosting per-user t3-serve + Claude agents.
# Its node_exporter ran unscraped until 2026-06-10 — the t3 disconnect
# root-cause work had NO memory/IO-pressure history for the very box whose
# stalls fire every t3 client's watchdog. Pressure/swap/load here is the
# primary correlate for t3probe drop events.
- job_name: 'devvm'
static_configs:
- targets:
- "10.0.10.10:9100"
labels:
node: 'devvm'
metrics_path: '/metrics'
relabel_configs:
- source_labels: [__address__]
target_label: instance
replacement: 'devvm' # Giving it a friendly name
# t3-probe: differential t3 path-health prober (stacks/t3code). Legs:
# cloudflare (full public path), internal (Traefik only), t3serve (the
# serve process). See docs/runbooks/t3-drop-attribution.md.
- job_name: 't3-probe'
static_configs:
- targets:
- "t3-probe.t3code.svc.cluster.local:9108"
metrics_path: '/metrics'
# rpi-sofia: external Raspberry Pi 3 at the Sofia home site (Frigate camera
# DNAT passthrough + solar inverter path + HA MQTT sensors). node_exporter
# installed via apt; the rpi_* metrics come from a vcgencmd textfile collector