infra/docs/runbooks/t3-version-bump.md
Viktor Barzin cdd9ecd199
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
t3: docs for the gated nightly tracker (runbook, post-mortem, service-catalog)
Phase 4 docs for the enforcer -> gated-tracker change:
- runbook t3-version-bump.md: rewritten around the tracker — how each bump is
  gated, plus freeze/revert/pin/dry-run/manual-rollback ops.
- post-mortem 2026-06-09: append the deliberate 2026-06-16 reversal and how the
  gates close each named root-cause/lesson (historical sections left intact).
- service-catalog t3 row: "PINNED 0.0.24 enforcer" -> gated nightly tracker;
  replace the stale "auto-pair 401-broken on 0.0.26" note (re-verified healthy
  2026-06-16, cookieless -> 302 + t3_session).
- t3-provision-users.sh step 5b comment: enforcer -> tracker; note Persistent dropped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 11:33:49 +00:00

5.5 KiB

Runbook: t3 version — gated nightly tracker (freeze / revert / roll back)

t3 on the devvm auto-tracks the nightly npm dist-tag (Viktor, 2026-06-16, risk explicitly accepted), via the daily t3-autoupdate timer. Every bump is GATED so a bad nightly self-heals instead of repeating 2026-06-09. This reverses the post-incident pin decision — read 2026-06-09-t3-nightly-autoupdate-auth-outage.md for why every guard below exists. t3 is still pre-1.0 and ships breaking changes between builds; the gate is what makes auto-tracking safe.

How the tracker gates each bump (scripts/t3-autoupdate.sh)

  1. Freeze gate/etc/t3-autoupdate.freeze present (or T3_PIN=<ver> set) → hold at current, do nothing.
  2. Resolve + downgrade-guardnpm view t3@nightly version; proceed only if the target is strictly newer than installed AND a -nightly. build (the tag is mutable and can point backward).
  3. Pre-bump backup — online VACUUM INTO of every user's state.sqlite to /var/backups/t3-state/<u>/state-prebump-<ver>-<ts>.sqlite (runs AS the owner; never stops a serve). Rollback is then a RESTORE, not sqlite surgery.
  4. Install + health-checknpm i -g t3@<ver>, then start a throwaway serve SEEDED WITH A COPY of wizard's real populated state.sqlite (scratch on /var/tmp, not the 2 GB tmpfs /tmp) so it exercises the forward MIGRATION (the 2026-06-09 failure class) + the real mint→exchange→t3_session pairing handshake. Fail → roll back binary to last-good, exit (no serve migrated yet → clean).
  5. Canary rollout — restart IDLE instances one at a time, verifying pairing through the real dispatch after each. First failure → roll back binary + restore that user's DB from the pre-bump backup + self-freeze (touch the freeze file) so it cannot re-flap onto bad builds. Active-agent instances are DEFERRED (never killed) and migrate on their next idle restart.
  6. Last-good — advanced to the new version only on full success (/var/lib/t3-autoupdate/last-good); it is the rollback target.

Detection backstop (real-user pairing failures / endpoint fallback): the dispatch logs every outcome (paired user=.. endpoint=.. fallback=.., plus mint/pairing ... failed) → Loki alerts T3PairingBroken / T3PairFallbackHigh / T3AutoUpdateRolledBack / T3AutoUpdateRollbackFailed / T3AutoUpdateFrozen → Alertmanager → Slack.

Operations

Freeze / revert (stop tracking right now — the fast "make it stop"):

sudo touch /etc/t3-autoupdate.freeze     # holds at the current build; next run is a no-op + fires T3AutoUpdateFrozen
sudo rm -f /etc/t3-autoupdate.freeze     # resume tracking

Pin to an exact version (instead of tracking nightly): set T3_PIN=<ver> in the unit environment (or the scripts/t3-autoupdate.sh default) — the tracker enforces it and stops following nightly. Keep in sync with setup-devvm.sh.

Preview the current nightly without touching anything (no global change, no restarts):

sudo T3_DRY_RUN=1 /usr/local/bin/t3-autoupdate   # installs candidate to a temp prefix, runs the full gate, reports PASS/FAIL

Force a run now (instead of waiting for 04:00):

sudo systemctl start t3-autoupdate.service   # runs in its own cgroup, isolated from the t3-serve@ instances it manages

What a bump touches (still true)

  1. Pairing API — t3 renamed POST /api/auth/bootstrap/api/auth/browser-session in 0.0.25. t3-dispatch is version-agnostic (pairEndpoints in scripts/t3-dispatch/main.go tries browser-session, falls back to bootstrap). If a future build renames it AGAIN, the health-check + canary fail the bump and self-freeze — then add the new path to pairEndpoints, rebuild + redeploy the dispatch, and clear the freeze.
  2. Schema — 0.0.25+ migrate every ~/.t3/userdata/state.sqlite FORWARD — a one-way door. A binary downgrade alone does NOT roll it back; you must restore the DB. The tracker does this automatically on a canary failure; do it by hand (below) if a problem surfaces after a successful bump.

Manual rollback (problem surfaces after a bump the gate let through)

GOOD=$(cat /var/lib/t3-autoupdate/last-good)   # or the known-good version you want
sudo touch /etc/t3-autoupdate.freeze           # stop the tracker FIRST
sudo npm i -g "t3@$GOOD"
# Restore + restart each user's serve. The wizard/admin instance: run this from
# OUTSIDE its own t3 session (stopping the serve you're running inside kills you);
# or just let it pick up $GOOD on its next natural restart.
for u in $(awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/ /,"",$2);print $2}' /etc/ttyd-user-map | sort -u); do
  bak=$(sudo ls -1t /var/backups/t3-state/$u/state-prebump-* 2>/dev/null | head -1)
  [ -n "$bak" ] || continue
  sudo systemctl stop t3-serve@$u
  sudo install -o "$u" -g "$u" -m600 "$bak" /home/$u/.t3/userdata/state.sqlite
  sudo rm -f /home/$u/.t3/userdata/state.sqlite-wal /home/$u/.t3/userdata/state.sqlite-shm
  sudo systemctl start t3-serve@$u
done

Verify (any user pairs cleanly through the dispatch)

for u in vbarzin emil.barzin ancaelena98; do
  curl -sI -H "X-authentik-username: $u" http://10.0.10.10:3780/ | grep -iE 'HTTP/|set-cookie: t3_session'
done   # each must be 302 + t3_session
t3 --version

(The 2026-06-09 incident had no pre-bump backup, so rollback meant per-user sqlite surgery. The tracker now takes a guaranteed pre-bump snapshot — rollback is a restore.)