7.2 KiB
Runbook: t3 version — gated nightly tracker (freeze / revert / roll back)
t3 on the devvm auto-tracks the nightly npm dist-tag (Viktor, 2026-06-16,
risk explicitly accepted), via the daily t3-autoupdate timer. Every bump is
GATED so a bad nightly self-heals instead of repeating 2026-06-09. This reverses
the post-incident pin decision — read 2026-06-09-t3-nightly-autoupdate-auth-outage.md
for why every guard below exists. t3 is still pre-1.0 and ships breaking changes
between builds; the gate is what makes auto-tracking safe.
How the tracker gates each bump (scripts/t3-autoupdate.sh)
- Freeze gate —
/etc/t3-autoupdate.freezepresent (orT3_PIN=<ver>set) → hold at current, do nothing. - Resolve + downgrade-guard —
npm view t3@nightly version; proceed only if the target is strictly newer than installed AND a-nightly.build (the tag is mutable and can point backward). - Pre-bump backup — online
VACUUM INTOof every user'sstate.sqliteto/var/backups/t3-state/<u>/state-prebump-<ver>-<ts>.sqlite(runs AS the owner; never stops a serve). Rollback is then a RESTORE, not sqlite surgery. - Install + health-check —
npm i -g t3@<ver>, then start a throwaway serve SEEDED WITH A COPY of wizard's real populatedstate.sqlite(scratch on/var/tmp, not the 2 GB tmpfs/tmp) so it exercises the forward MIGRATION (the 2026-06-09 failure class) + the real mint→exchange→t3_sessionpairing handshake. Fail → roll back binary to last-good, exit (no serve migrated yet → clean). - Canary rollout — restart IDLE instances one at a time, verifying pairing through the real dispatch after each. First failure → roll back binary + restore that user's DB from the pre-bump backup + self-freeze (touch the freeze file) so it cannot re-flap onto bad builds. Active-agent instances are DEFERRED (never killed) and migrate on their next idle restart.
- Last-good — advanced to the new version only on full success
(
/var/lib/t3-autoupdate/last-good); it is the rollback target.
Detection backstop (real-user pairing failures / endpoint fallback): the dispatch
logs every outcome (paired user=.. endpoint=.. fallback=.., plus mint/pairing ... failed) → Loki alerts T3PairingBroken / T3PairFallbackHigh /
T3AutoUpdateRolledBack / T3AutoUpdateRollbackFailed / T3AutoUpdateFrozen →
Alertmanager → Slack.
Idle migrator — draining deferrals (scripts/t3-migrate-idle.sh)
Step 5 DEFERS any instance with an active agent, recording /var/lib/t3-autoupdate/deferred/<user> (= the target version). Without a drainer, a user busy at every 04:00 window never migrates and their client shows "Client and server versions differ" for days. t3-migrate-idle.timer (overnight, every 20 min 01:00–05:40) drains those markers:
- Per marker: skip + clear if the unit is gone or was already restarted after the deferral; otherwise restart the still-stale
t3-serve@<u>onto the current binary only when that user is idle —state.sqliteshows zeroactive_turn_id(no in-flight turn) AND ≥T3_MIGRATE_QUIET_SECONDS(default 900 = 15 min) since the last thread activity — then verify pairing and clear the marker. Fail-closed: any query/parse doubt → skip, retry next tick. - It restarts via the SAME
safe_restart_unitthe daily canary uses (sourcedt3-safe-restart.sh: backup → restart → verify → recover). The shared/etc/t3-autoupdate.freezehalts it too. - Force / preview:
sudo systemctl start t3-migrate-idle.service # run a drain pass now (still idle-gated) sudo env T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle # log decisions, act on nothing - Rare-tail failure: if a deferred user's forward migration fails at idle restart (already gated against a copy of their real DB at install),
safe_restart_unitrestores their DB + freezes + alerts. The binary rollback is a no-op (the build was already accepted, so other users are unaffected), but that user's serve may crashloop on the restored DB until the freeze is cleared and the build investigated (manual rollback below).
Operations
Freeze / revert (stop tracking right now — the fast "make it stop"):
sudo touch /etc/t3-autoupdate.freeze # holds at the current build; next run is a no-op + fires T3AutoUpdateFrozen
sudo rm -f /etc/t3-autoupdate.freeze # resume tracking
Pin to an exact version (instead of tracking nightly): set T3_PIN=<ver> in
the unit environment (or the scripts/t3-autoupdate.sh default) — the tracker
enforces it and stops following nightly. Keep in sync with setup-devvm.sh.
Preview the current nightly without touching anything (no global change, no restarts):
sudo T3_DRY_RUN=1 /usr/local/bin/t3-autoupdate # installs candidate to a temp prefix, runs the full gate, reports PASS/FAIL
Force a run now (instead of waiting for 04:00):
sudo systemctl start t3-autoupdate.service # runs in its own cgroup, isolated from the t3-serve@ instances it manages
What a bump touches (still true)
- Pairing API — t3 renamed
POST /api/auth/bootstrap→/api/auth/browser-sessionin 0.0.25.t3-dispatchis version-agnostic (pairEndpointsinscripts/t3-dispatch/main.gotries browser-session, falls back to bootstrap). If a future build renames it AGAIN, the health-check + canary fail the bump and self-freeze — then add the new path topairEndpoints, rebuild + redeploy the dispatch, and clear the freeze. - Schema — 0.0.25+ migrate every
~/.t3/userdata/state.sqliteFORWARD — a one-way door. A binary downgrade alone does NOT roll it back; you must restore the DB. The tracker does this automatically on a canary failure; do it by hand (below) if a problem surfaces after a successful bump.
Manual rollback (problem surfaces after a bump the gate let through)
GOOD=$(cat /var/lib/t3-autoupdate/last-good) # or the known-good version you want
sudo touch /etc/t3-autoupdate.freeze # stop the tracker FIRST
sudo npm i -g "t3@$GOOD"
# Restore + restart each user's serve. The wizard/admin instance: run this from
# OUTSIDE its own t3 session (stopping the serve you're running inside kills you);
# or just let it pick up $GOOD on its next natural restart.
for u in $(awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/ /,"",$2);print $2}' /etc/ttyd-user-map | sort -u); do
bak=$(sudo ls -1t /var/backups/t3-state/$u/state-prebump-* 2>/dev/null | head -1)
[ -n "$bak" ] || continue
sudo systemctl stop t3-serve@$u
sudo install -o "$u" -g "$u" -m600 "$bak" /home/$u/.t3/userdata/state.sqlite
sudo rm -f /home/$u/.t3/userdata/state.sqlite-wal /home/$u/.t3/userdata/state.sqlite-shm
sudo systemctl start t3-serve@$u
done
Verify (any user pairs cleanly through the dispatch)
for u in vbarzin emil.barzin ancaelena98; do
curl -sI -H "X-authentik-username: $u" http://10.0.10.10:3780/ | grep -iE 'HTTP/|set-cookie: t3_session'
done # each must be 302 + t3_session
t3 --version
(The 2026-06-09 incident had no pre-bump backup, so rollback meant per-user sqlite surgery. The tracker now takes a guaranteed pre-bump snapshot — rollback is a restore.)