t3: docs for the gated nightly tracker (runbook, post-mortem, service-catalog)
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful

Phase 4 docs for the enforcer -> gated-tracker change:
- runbook t3-version-bump.md: rewritten around the tracker — how each bump is
  gated, plus freeze/revert/pin/dry-run/manual-rollback ops.
- post-mortem 2026-06-09: append the deliberate 2026-06-16 reversal and how the
  gates close each named root-cause/lesson (historical sections left intact).
- service-catalog t3 row: "PINNED 0.0.24 enforcer" -> gated nightly tracker;
  replace the stale "auto-pair 401-broken on 0.0.26" note (re-verified healthy
  2026-06-16, cookieless -> 302 + t3_session).
- t3-provision-users.sh step 5b comment: enforcer -> tracker; note Persistent dropped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-16 11:33:49 +00:00
parent f4f7705127
commit cdd9ecd199
4 changed files with 126 additions and 95 deletions

View file

@ -1,95 +1,86 @@
# Runbook: bump the pinned t3 version (e.g. 0.0.24 → 0.0.25)
# Runbook: t3 version — gated nightly tracker (freeze / revert / roll back)
t3 on the devvm is **pinned** (`T3_PIN`, default `0.0.24`) and held there by the
`t3-autoupdate` enforcer. t3 is pre-1.0 and ships breaking changes between
builds, so a bump is a **deliberate, verified, reversible** step — never an
auto-update. This runbook makes it calm. Background: post-mortem
`2026-06-09-t3-nightly-autoupdate-auth-outage.md`.
t3 on the devvm **auto-tracks the `nightly` npm dist-tag** (Viktor, 2026-06-16,
risk explicitly accepted), via the daily `t3-autoupdate` timer. Every bump is
GATED so a bad nightly self-heals instead of repeating 2026-06-09. This reverses
the post-incident pin decision — read `2026-06-09-t3-nightly-autoupdate-auth-outage.md`
for why every guard below exists. t3 is still pre-1.0 and ships breaking changes
between builds; the gate is what makes auto-tracking safe.
## What a bump actually touches
## How the tracker gates each bump (`scripts/t3-autoupdate.sh`)
1. **Freeze gate**`/etc/t3-autoupdate.freeze` present (or `T3_PIN=<ver>` set) →
hold at current, do nothing.
2. **Resolve + downgrade-guard**`npm view t3@nightly version`; proceed only if
the target is strictly newer than installed AND a `-nightly.` build (the tag is
mutable and can point backward).
3. **Pre-bump backup** — online `VACUUM INTO` of every user's `state.sqlite` to
`/var/backups/t3-state/<u>/state-prebump-<ver>-<ts>.sqlite` (runs AS the owner;
never stops a serve). Rollback is then a RESTORE, not sqlite surgery.
4. **Install + health-check**`npm i -g t3@<ver>`, then start a throwaway serve
SEEDED WITH A COPY of wizard's real populated `state.sqlite` (scratch on
`/var/tmp`, not the 2 GB tmpfs `/tmp`) so it exercises the forward MIGRATION
(the 2026-06-09 failure class) + the real mint→exchange→`t3_session` pairing
handshake. Fail → roll back binary to last-good, exit (no serve migrated yet →
clean).
5. **Canary rollout** — restart IDLE instances one at a time, verifying pairing
through the real dispatch after each. First failure → roll back binary +
restore that user's DB from the pre-bump backup + **self-freeze** (touch the
freeze file) so it cannot re-flap onto bad builds. Active-agent instances are
DEFERRED (never killed) and migrate on their next idle restart.
6. **Last-good** — advanced to the new version only on full success
(`/var/lib/t3-autoupdate/last-good`); it is the rollback target.
Detection backstop (real-user pairing failures / endpoint fallback): the dispatch
logs every outcome (`paired user=.. endpoint=.. fallback=..`, plus `mint/pairing
... failed`) → Loki alerts `T3PairingBroken` / `T3PairFallbackHigh` /
`T3AutoUpdateRolledBack` / `T3AutoUpdateRollbackFailed` / `T3AutoUpdateFrozen`
Alertmanager → Slack.
## Operations
**Freeze / revert (stop tracking right now — the fast "make it stop"):**
```bash
sudo touch /etc/t3-autoupdate.freeze # holds at the current build; next run is a no-op + fires T3AutoUpdateFrozen
sudo rm -f /etc/t3-autoupdate.freeze # resume tracking
```
**Pin to an exact version (instead of tracking nightly):** set `T3_PIN=<ver>` in
the unit environment (or the `scripts/t3-autoupdate.sh` default) — the tracker
enforces it and stops following nightly. Keep in sync with `setup-devvm.sh`.
**Preview the current nightly without touching anything (no global change, no restarts):**
```bash
sudo T3_DRY_RUN=1 /usr/local/bin/t3-autoupdate # installs candidate to a temp prefix, runs the full gate, reports PASS/FAIL
```
**Force a run now (instead of waiting for 04:00):**
```bash
sudo systemctl start t3-autoupdate.service # runs in its own cgroup, isolated from the t3-serve@ instances it manages
```
## What a bump touches (still true)
1. **Pairing API** — t3 renamed `POST /api/auth/bootstrap``/api/auth/browser-session`
in 0.0.25. `t3-dispatch` is now **version-agnostic** (tries `browser-session`,
falls back to `bootstrap`; see `pairEndpoints` in `scripts/t3-dispatch/main.go`),
so 0.0.24↔0.0.25 needs **no dispatch change**. If a *future* build renames it
again, add the new path to `pairEndpoints`, rebuild, redeploy first.
2. **Schema** — 0.0.25+ migrate every `~/.t3/userdata/state.sqlite` **forward**
(`auth_pairing_links`/`auth_sessions` `role``scopes`, `+proof_key_thumbprint`).
This is a **one-way door**: a binary downgrade alone will NOT roll it back —
you must restore the DB. Hence the mandatory pre-bump backup below.
in 0.0.25. `t3-dispatch` is version-agnostic (`pairEndpoints` in
`scripts/t3-dispatch/main.go` tries browser-session, falls back to bootstrap).
If a future build renames it AGAIN, the health-check + canary fail the bump and
self-freeze — then add the new path to `pairEndpoints`, rebuild + redeploy the
dispatch, and clear the freeze.
2. **Schema** — 0.0.25+ migrate every `~/.t3/userdata/state.sqlite` FORWARD — a
**one-way door**. A binary downgrade alone does NOT roll it back; you must
restore the DB. The tracker does this automatically on a canary failure; do it
by hand (below) if a problem surfaces *after* a successful bump.
## Pre-flight (no downtime)
## Manual rollback (problem surfaces after a bump the gate let through)
```bash
# 1. Confirm the dispatch already speaks the new version's pairing API.
# Install the candidate to an isolated prefix (does NOT touch the global pin):
npm install --prefix /tmp/t3-cand t3@<new> # e.g. t3@0.0.25
BIN=/tmp/t3-cand/node_modules/.bin/t3; D=$(mktemp -d)
"$BIN" serve --host 127.0.0.1 --port 3796 --base-dir "$D" >/tmp/cand.log 2>&1 &
CRED=$("$BIN" auth pairing create --base-dir "$D" --ttl 5m --json | sed -n 's/.*"credential":"\([^"]*\)".*/\1/p')
# Try the dispatch's endpoints; one must give 200 + Set-Cookie: t3_session.
for ep in /api/auth/browser-session /api/auth/bootstrap; do
curl -s -i -X POST -H 'Content-Type: application/json' -d "{\"credential\":\"$CRED\"}" \
"http://127.0.0.1:3796$ep" | grep -iE 'HTTP/|set-cookie: t3_session'; done
kill %1; rm -rf "$D" /tmp/t3-cand
# If NO endpoint yields a t3_session cookie -> the API changed again; update
# pairEndpoints in main.go + rebuild the dispatch BEFORE proceeding.
# 2. Dispatch unit tests still green:
( cd ~/code/infra/scripts/t3-dispatch && go test ./... )
```
## The bump
```bash
NEW=0.0.25
# 1. PRE-BUMP BACKUP — the rollback safety net. Per user, stop the serve (so the
# copy is consistent + fast), copy state.sqlite, restart. Do the ACTIVE admin
# instance last / from OUTSIDE its own t3 session (you can't restart the serve
# you're running inside).
for u in $(awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/ /,"",$2);print $2}' /etc/ttyd-user-map | sort -u); do
src=/home/$u/.t3/userdata/state.sqlite; [ -f "$src" ] || continue
sudo systemctl stop t3-serve@$u
sudo install -d -o "$u" -g "$u" -m700 /var/backups/t3-state/$u
sudo cp -a "$src" /var/backups/t3-state/$u/state-prebump-$NEW-$(date +%Y%m%d-%H%M%S).sqlite
sudo systemctl start t3-serve@$u
done
# (t3-backup-state also runs daily; this captures a guaranteed snapshot at T-0.)
# 2. Move the pin in BOTH places (keep them in sync):
sed -i "s/T3_PIN:-[0-9.]*/T3_PIN:-$NEW/" ~/code/infra/scripts/t3-autoupdate.sh \
~/code/infra/scripts/workstation/setup-devvm.sh
sudo install -m0755 ~/code/infra/scripts/t3-autoupdate.sh /usr/local/bin/t3-autoupdate
# 3. Run the enforcer. It installs t3@$NEW, then HEALTH-CHECKS the real pairing
# handshake (mint -> browser-session/bootstrap -> t3_session). If pairing is
# broken in $NEW, it AUTO-ROLLS-BACK to the previous version and exits non-zero.
sudo /usr/local/bin/t3-autoupdate # restarts idle instances; defers active ones
# 4. Restart any instance the enforcer deferred (active agent), when it's idle.
# The wizard/admin instance: restart from OUTSIDE its own session, or it picks
# up $NEW on its next natural restart (the unit runs the global /usr/bin/t3).
```
## Verify
```bash
for u in vbarzin emil.barzin ancaelena98; do
curl -sI -H "X-authentik-username: $u" http://10.0.10.10:3780/ | grep -iE 'HTTP/|set-cookie: t3_session'
done # each must be 302 + t3_session
t3 --version # == $NEW
```
## Rollback (if pairing breaks or $NEW misbehaves)
The enforcer auto-rolls-back the **binary** if its health-check fails. But if a
problem surfaces *after* serves migrated their DBs forward, the binary alone
won't fix it — restore the DBs:
```bash
sed -i "s/T3_PIN:-[0-9.]*/T3_PIN:-0.0.24/" ~/code/infra/scripts/t3-autoupdate.sh ~/code/infra/scripts/workstation/setup-devvm.sh
sudo install -m0755 ~/code/infra/scripts/t3-autoupdate.sh /usr/local/bin/t3-autoupdate
sudo npm i -g t3@0.0.24
GOOD=$(cat /var/lib/t3-autoupdate/last-good) # or the known-good version you want
sudo touch /etc/t3-autoupdate.freeze # stop the tracker FIRST
sudo npm i -g "t3@$GOOD"
# Restore + restart each user's serve. The wizard/admin instance: run this from
# OUTSIDE its own t3 session (stopping the serve you're running inside kills you);
# or just let it pick up $GOOD on its next natural restart.
for u in $(awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/ /,"",$2);print $2}' /etc/ttyd-user-map | sort -u); do
bak=$(sudo ls -1t /var/backups/t3-state/$u/state-prebump-* 2>/dev/null | head -1)
[ -n "$bak" ] || continue
@ -98,8 +89,16 @@ for u in $(awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/ /,"",$2);print $2}' /etc/tty
sudo rm -f /home/$u/.t3/userdata/state.sqlite-wal /home/$u/.t3/userdata/state.sqlite-shm
sudo systemctl start t3-serve@$u
done
# verify 302 + t3_session as above
```
(The 2026-06-09 incident had no pre-bump backup, so rollback meant per-user
sqlite surgery. With the backup, it's a restore.)
## Verify (any user pairs cleanly through the dispatch)
```bash
for u in vbarzin emil.barzin ancaelena98; do
curl -sI -H "X-authentik-username: $u" http://10.0.10.10:3780/ | grep -iE 'HTTP/|set-cookie: t3_session'
done # each must be 302 + t3_session
t3 --version
```
(The 2026-06-09 incident had no pre-bump backup, so rollback meant per-user sqlite
surgery. The tracker now takes a guaranteed pre-bump snapshot — rollback is a restore.)