Commit graph

8 commits

Author SHA1 Message Date
Viktor Barzin
9b55d53be0 t3: differential drop-attribution probe + devvm metrics
Closes the loop on Viktor's ask to find the t3 disconnect root cause and
definitively rule infra in or out. Server logs alone cannot separate
'client network broke' from 'Cloudflare/tunnel broke' from 't3-serve
stalled' — every cause collapses into the same 20s-watchdog reconnect.

The t3-probe (stacks/t3code) holds three permanent legs that differ only
in path segment: 'cloudflare' (WS via DoH-resolved public DNS -> WAN ->
CF edge -> tunnel -> Traefik -> dispatch), 'internal' (same WS pinned to
the Traefik LB, no Cloudflare), 't3serve' (HTTP straight to the serve
process). Whichever leg drops convicts its segment; all legs clean while
a user drops exonerates infra with data. Dispatch gains an
unauthenticated /probe/ws echo + /probe/healthz (gorilla/websocket,
test-first) behind an auth=none path carve-out, guarded by the
authentik-walloff probe.

Also starts scraping devvm's node_exporter (job 'devvm') — it ran
unscraped, so the box whose memory/IO stalls cause the drops had zero
pressure history. Alerts T3ProbeLegDown + T3ProbeDropBurst; runbook
docs/runbooks/t3-drop-attribution.md.
2026-06-10 21:11:29 +00:00
Viktor Barzin
dacd9d2d8a t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip]
Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):

- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
  /api/auth/bootstrap on 404 — one binary pairs across both versions and any
  rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
  before, green after). Built, deployed, verified live on 0.0.24 (all three
  users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
  handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
  build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
  state.sqlite (was the only copy, unbacked) -> the one-way forward schema
  migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
41c11216da t3-dispatch: re-pair on present-but-invalid t3_session cookie
The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09
auth-schema rollback wiped all server-side sessions, browsers kept dead
30-day t3_session cookies; the dispatcher proxied them straight through
and t3 rendered its pair page ("all users must pair again").

Now a present cookie on a top-level document navigation is validated via
the instance's /api/auth/session and re-paired on authenticated:false.
Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html)
so XHR/asset/WebSocket sub-requests are never answered with a 302; fails
open (proxy through) on any validation error. Unit + handler tests added.

[ci skip]

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00
Viktor Barzin
6d224861c4 stem95su: scheduled Drive->site sync CronJob (every 10m)
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.

Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:42:26 +00:00
Viktor Barzin
d27df1f321 t3code: dispatch — strip @domain from X-authentik-username (Authentik injects email)
Authentik injects the full email (e.g. vbarzin@gmail.com), but /etc/ttyd-user-map
and dispatch.json key on the local part (vbarzin), so every real login hit
403 'no instance provisioned'. Strip @domain before lookup, matching the
terminal stack's tmux-attach.sh. Verified: vbarzin@gmail.com / emil.barzin@gmail.com
-> 302 (own instance); unmapped/no-header -> 403.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
9f551e3c13 t3code: harden dispatch — dedicated user + validated t3-mint + scoped sudoers
Run t3-dispatch as an unprivileged dedicated user instead of wizard (who has
full sudo). Privileged minting goes through /usr/local/bin/t3-mint, which
validates the target against /etc/ttyd-user-map before minting as that user;
sudoers permits t3-dispatch to run only that wrapper. Compromise of the
network-facing service can mint pairing tokens for mapped users at most.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
0472f67d49 t3code: devvm dispatch + auto-pair service (Go)
Routes X-authentik-username -> per-user t3 instance; on no t3_session
cookie, mints a pairing token (as the OS user) and exchanges it at
/api/auth/bootstrap, injecting the session cookie. Listens :3780, reads
/etc/t3-serve/dispatch.json. Constants from the Task-1 auth-contract spike.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00