Phase 4 docs for the enforcer -> gated-tracker change: - runbook t3-version-bump.md: rewritten around the tracker — how each bump is gated, plus freeze/revert/pin/dry-run/manual-rollback ops. - post-mortem 2026-06-09: append the deliberate 2026-06-16 reversal and how the gates close each named root-cause/lesson (historical sections left intact). - service-catalog t3 row: "PINNED 0.0.24 enforcer" -> gated nightly tracker; replace the stale "auto-pair 401-broken on 0.0.26" note (re-verified healthy 2026-06-16, cookieless -> 302 + t3_session). - t3-provision-users.sh step 5b comment: enforcer -> tracker; note Persistent dropped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5.5 KiB
Runbook: t3 version — gated nightly tracker (freeze / revert / roll back)
t3 on the devvm auto-tracks the nightly npm dist-tag (Viktor, 2026-06-16,
risk explicitly accepted), via the daily t3-autoupdate timer. Every bump is
GATED so a bad nightly self-heals instead of repeating 2026-06-09. This reverses
the post-incident pin decision — read 2026-06-09-t3-nightly-autoupdate-auth-outage.md
for why every guard below exists. t3 is still pre-1.0 and ships breaking changes
between builds; the gate is what makes auto-tracking safe.
How the tracker gates each bump (scripts/t3-autoupdate.sh)
- Freeze gate —
/etc/t3-autoupdate.freezepresent (orT3_PIN=<ver>set) → hold at current, do nothing. - Resolve + downgrade-guard —
npm view t3@nightly version; proceed only if the target is strictly newer than installed AND a-nightly.build (the tag is mutable and can point backward). - Pre-bump backup — online
VACUUM INTOof every user'sstate.sqliteto/var/backups/t3-state/<u>/state-prebump-<ver>-<ts>.sqlite(runs AS the owner; never stops a serve). Rollback is then a RESTORE, not sqlite surgery. - Install + health-check —
npm i -g t3@<ver>, then start a throwaway serve SEEDED WITH A COPY of wizard's real populatedstate.sqlite(scratch on/var/tmp, not the 2 GB tmpfs/tmp) so it exercises the forward MIGRATION (the 2026-06-09 failure class) + the real mint→exchange→t3_sessionpairing handshake. Fail → roll back binary to last-good, exit (no serve migrated yet → clean). - Canary rollout — restart IDLE instances one at a time, verifying pairing through the real dispatch after each. First failure → roll back binary + restore that user's DB from the pre-bump backup + self-freeze (touch the freeze file) so it cannot re-flap onto bad builds. Active-agent instances are DEFERRED (never killed) and migrate on their next idle restart.
- Last-good — advanced to the new version only on full success
(
/var/lib/t3-autoupdate/last-good); it is the rollback target.
Detection backstop (real-user pairing failures / endpoint fallback): the dispatch
logs every outcome (paired user=.. endpoint=.. fallback=.., plus mint/pairing ... failed) → Loki alerts T3PairingBroken / T3PairFallbackHigh /
T3AutoUpdateRolledBack / T3AutoUpdateRollbackFailed / T3AutoUpdateFrozen →
Alertmanager → Slack.
Operations
Freeze / revert (stop tracking right now — the fast "make it stop"):
sudo touch /etc/t3-autoupdate.freeze # holds at the current build; next run is a no-op + fires T3AutoUpdateFrozen
sudo rm -f /etc/t3-autoupdate.freeze # resume tracking
Pin to an exact version (instead of tracking nightly): set T3_PIN=<ver> in
the unit environment (or the scripts/t3-autoupdate.sh default) — the tracker
enforces it and stops following nightly. Keep in sync with setup-devvm.sh.
Preview the current nightly without touching anything (no global change, no restarts):
sudo T3_DRY_RUN=1 /usr/local/bin/t3-autoupdate # installs candidate to a temp prefix, runs the full gate, reports PASS/FAIL
Force a run now (instead of waiting for 04:00):
sudo systemctl start t3-autoupdate.service # runs in its own cgroup, isolated from the t3-serve@ instances it manages
What a bump touches (still true)
- Pairing API — t3 renamed
POST /api/auth/bootstrap→/api/auth/browser-sessionin 0.0.25.t3-dispatchis version-agnostic (pairEndpointsinscripts/t3-dispatch/main.gotries browser-session, falls back to bootstrap). If a future build renames it AGAIN, the health-check + canary fail the bump and self-freeze — then add the new path topairEndpoints, rebuild + redeploy the dispatch, and clear the freeze. - Schema — 0.0.25+ migrate every
~/.t3/userdata/state.sqliteFORWARD — a one-way door. A binary downgrade alone does NOT roll it back; you must restore the DB. The tracker does this automatically on a canary failure; do it by hand (below) if a problem surfaces after a successful bump.
Manual rollback (problem surfaces after a bump the gate let through)
GOOD=$(cat /var/lib/t3-autoupdate/last-good) # or the known-good version you want
sudo touch /etc/t3-autoupdate.freeze # stop the tracker FIRST
sudo npm i -g "t3@$GOOD"
# Restore + restart each user's serve. The wizard/admin instance: run this from
# OUTSIDE its own t3 session (stopping the serve you're running inside kills you);
# or just let it pick up $GOOD on its next natural restart.
for u in $(awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/ /,"",$2);print $2}' /etc/ttyd-user-map | sort -u); do
bak=$(sudo ls -1t /var/backups/t3-state/$u/state-prebump-* 2>/dev/null | head -1)
[ -n "$bak" ] || continue
sudo systemctl stop t3-serve@$u
sudo install -o "$u" -g "$u" -m600 "$bak" /home/$u/.t3/userdata/state.sqlite
sudo rm -f /home/$u/.t3/userdata/state.sqlite-wal /home/$u/.t3/userdata/state.sqlite-shm
sudo systemctl start t3-serve@$u
done
Verify (any user pairs cleanly through the dispatch)
for u in vbarzin emil.barzin ancaelena98; do
curl -sI -H "X-authentik-username: $u" http://10.0.10.10:3780/ | grep -iE 'HTTP/|set-cookie: t3_session'
done # each must be 302 + t3_session
t3 --version
(The 2026-06-09 incident had no pre-bump backup, so rollback meant per-user sqlite surgery. The tracker now takes a guaranteed pre-bump snapshot — rollback is a restore.)