Closes the loop on Viktor's ask to find the t3 disconnect root cause and definitively rule infra in or out. Server logs alone cannot separate 'client network broke' from 'Cloudflare/tunnel broke' from 't3-serve stalled' — every cause collapses into the same 20s-watchdog reconnect. The t3-probe (stacks/t3code) holds three permanent legs that differ only in path segment: 'cloudflare' (WS via DoH-resolved public DNS -> WAN -> CF edge -> tunnel -> Traefik -> dispatch), 'internal' (same WS pinned to the Traefik LB, no Cloudflare), 't3serve' (HTTP straight to the serve process). Whichever leg drops convicts its segment; all legs clean while a user drops exonerates infra with data. Dispatch gains an unauthenticated /probe/ws echo + /probe/healthz (gorilla/websocket, test-first) behind an auth=none path carve-out, guarded by the authentik-walloff probe. Also starts scraping devvm's node_exporter (job 'devvm') — it ran unscraped, so the box whose memory/IO stalls cause the drops had zero pressure history. Alerts T3ProbeLegDown + T3ProbeDropBurst; runbook docs/runbooks/t3-drop-attribution.md.
49 lines
1.5 KiB
Go
49 lines
1.5 KiB
Go
// probe.go: unauthenticated path-health surface for the in-cluster t3-probe.
|
|
// /probe/* is carved out of Authentik (stacks/t3code `module "ingress_probe"`)
|
|
// so a synthetic client can hold a long-lived WebSocket here via two routes
|
|
// (Cloudflare edge vs internal Traefik) and attribute connection drops to a
|
|
// path segment. It echoes tiny frames and reaches no t3 instance — nothing
|
|
// user-grade is exposed.
|
|
package main
|
|
|
|
import (
|
|
"net/http"
|
|
"time"
|
|
|
|
"github.com/gorilla/websocket"
|
|
)
|
|
|
|
// Reap connections whose client went silent; the probe pings every 10s, so 90s
|
|
// of silence means the peer is gone even if TCP never noticed.
|
|
const probeIdleLimit = 90 * time.Second
|
|
|
|
var probeUpgrader = websocket.Upgrader{
|
|
// No cookies or credentials are at stake on an echo endpoint, and the
|
|
// probe connects without a browser Origin — checking it would only break it.
|
|
CheckOrigin: func(*http.Request) bool { return true },
|
|
}
|
|
|
|
func registerProbe(mux *http.ServeMux) {
|
|
mux.HandleFunc("/probe/healthz", func(w http.ResponseWriter, _ *http.Request) {
|
|
_, _ = w.Write([]byte("ok\n"))
|
|
})
|
|
mux.HandleFunc("/probe/ws", func(w http.ResponseWriter, r *http.Request) {
|
|
c, err := probeUpgrader.Upgrade(w, r, nil)
|
|
if err != nil {
|
|
return // Upgrade has already written the HTTP error
|
|
}
|
|
defer c.Close()
|
|
for {
|
|
if err := c.SetReadDeadline(time.Now().Add(probeIdleLimit)); err != nil {
|
|
return
|
|
}
|
|
mt, msg, err := c.ReadMessage()
|
|
if err != nil {
|
|
return
|
|
}
|
|
if err := c.WriteMessage(mt, msg); err != nil {
|
|
return
|
|
}
|
|
}
|
|
})
|
|
}
|