t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip]

Investigated the 0.0.25 break: it is ONLY an endpoint rename (/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing contract (credential payload, t3_session cookie, /api/auth/session) is byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep): - t3-dispatch: autoPair tries /api/auth/browser-session, falls back to /api/auth/bootstrap on 404 — one binary pairs across both versions and any rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25 before, green after). Built, deployed, verified live on 0.0.24 (all three users still 302 + t3_session via the fallback). - t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad build now auto-rolls-back. Validated against both versions. - t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3 state.sqlite (was the only copy, unbacked) -> the one-way forward schema migration becomes a restore, not sqlite surgery. timeout-guarded. - runbooks/t3-version-bump.md: the reversible cutover checklist. - post-mortem #5 (health-check) DONE + #6 added; service-catalog updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 20:00:11 +00:00 · 2026-06-09 20:00:11 +00:00 · bccaa08d8e
commit bccaa08d8e
parent 5ea238c707
9 changed files with 311 additions and 19 deletions
--- a/scripts/t3-backup-state.sh
+++ b/scripts/t3-backup-state.sh
@ -0,0 +1,43 @@
+#!/usr/bin/env bash
+# Consistent online backup of each t3 user's ~/.t3 state.sqlite (chat/session
+# history AND auth tables). ~/.t3 lives on the devvm local disk — NOT a K8s PVC and
+# NOT in the 3-2-1 pipeline — so without this it is the only copy and a rebuild
+# loses it. It also makes a t3 version bump REVERSIBLE: 0.0.25+ migrate the schema
+# FORWARD (a one-way door), so a clean pre-bump backup turns rollback into a restore
+# instead of per-user sqlite surgery (see runbooks/t3-version-bump.md). Runs as root
+# via t3-backup-state.timer; the per-user .backup runs AS the owning user so the live
+# WAL/-shm files keep their owner and the running t3-serve is never perturbed.
+set -uo pipefail
+DEST="${T3_BACKUP_DEST:-/var/backups/t3-state}"
+KEEP="${T3_BACKUP_KEEP:-14}"
+MAP=/etc/ttyd-user-map
+LOG() { logger -t t3-backup-state "$*"; echo "t3-backup-state: $*"; }
+
+ts=$(date +%Y%m%d-%H%M%S)
+# RHS of each non-comment "authentik=os_user" line = an OS user owning a ~/.t3.
+mapfile -t users < <(awk -F= '!/^[[:space:]]*#/ && NF==2 { gsub(/[[:space:]]/,"",$2); print $2 }' "$MAP" 2>/dev/null | sort -u)
+[[ ${#users[@]} -gt 0 ]] || { LOG "no users in $MAP; nothing to back up"; exit 0; }
+
+rc=0
+for u in "${users[@]}"; do
+  src="/home/$u/.t3/userdata/state.sqlite"
+  if [[ ! -f "$src" ]]; then LOG "skip $u (no state.sqlite)"; continue; fi
+  out="$DEST/$u"; dst="$out/state-$ts.sqlite"
+  install -d -o "$u" -g "$u" -m 0700 "$out"
+  # VACUUM INTO takes a consistent read-snapshot copy — unlike .backup it does NOT
+  # restart when the source is written mid-copy, so it finishes in a single pass even
+  # for the actively-used instance (the admin's own live session, which .backup would
+  # loop on forever). Run as the owning user so WAL access keeps the live serve happy.
+  # timeout caps a pathologically-slow copy (huge DB + concurrent writes on a contended
+  # disk) so the daily run can never wedge — it just logs + retries next cycle. The
+  # daily 03:30 slot normally finds instances idle, where even a large DB copies fast.
+  if runuser -u "$u" -- timeout "${T3_BACKUP_TIMEOUT:-900}" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [[ -s "$dst" ]]; then
+    LOG "backed up $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)"
+  else
+    LOG "WARN: backup FAILED for $u ($src)"; rc=1; rm -f "$dst"
+  fi
+  # retention: keep newest $KEEP per user
+  ls -1t "$out"/state-*.sqlite 2>/dev/null | tail -n +$((KEEP+1)) | xargs -r rm -f
+done
+LOG "done (rc=$rc)"
+exit $rc