Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):
- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
/api/auth/bootstrap on 404 — one binary pairs across both versions and any
rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
before, green after). Built, deployed, verified live on 0.0.24 (all three
users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
state.sqlite (was the only copy, unbacked) -> the one-way forward schema
migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
43 lines
2.4 KiB
Bash
43 lines
2.4 KiB
Bash
#!/usr/bin/env bash
|
|
# Consistent online backup of each t3 user's ~/.t3 state.sqlite (chat/session
|
|
# history AND auth tables). ~/.t3 lives on the devvm local disk — NOT a K8s PVC and
|
|
# NOT in the 3-2-1 pipeline — so without this it is the only copy and a rebuild
|
|
# loses it. It also makes a t3 version bump REVERSIBLE: 0.0.25+ migrate the schema
|
|
# FORWARD (a one-way door), so a clean pre-bump backup turns rollback into a restore
|
|
# instead of per-user sqlite surgery (see runbooks/t3-version-bump.md). Runs as root
|
|
# via t3-backup-state.timer; the per-user .backup runs AS the owning user so the live
|
|
# WAL/-shm files keep their owner and the running t3-serve is never perturbed.
|
|
set -uo pipefail
|
|
DEST="${T3_BACKUP_DEST:-/var/backups/t3-state}"
|
|
KEEP="${T3_BACKUP_KEEP:-14}"
|
|
MAP=/etc/ttyd-user-map
|
|
LOG() { logger -t t3-backup-state "$*"; echo "t3-backup-state: $*"; }
|
|
|
|
ts=$(date +%Y%m%d-%H%M%S)
|
|
# RHS of each non-comment "authentik=os_user" line = an OS user owning a ~/.t3.
|
|
mapfile -t users < <(awk -F= '!/^[[:space:]]*#/ && NF==2 { gsub(/[[:space:]]/,"",$2); print $2 }' "$MAP" 2>/dev/null | sort -u)
|
|
[[ ${#users[@]} -gt 0 ]] || { LOG "no users in $MAP; nothing to back up"; exit 0; }
|
|
|
|
rc=0
|
|
for u in "${users[@]}"; do
|
|
src="/home/$u/.t3/userdata/state.sqlite"
|
|
if [[ ! -f "$src" ]]; then LOG "skip $u (no state.sqlite)"; continue; fi
|
|
out="$DEST/$u"; dst="$out/state-$ts.sqlite"
|
|
install -d -o "$u" -g "$u" -m 0700 "$out"
|
|
# VACUUM INTO takes a consistent read-snapshot copy — unlike .backup it does NOT
|
|
# restart when the source is written mid-copy, so it finishes in a single pass even
|
|
# for the actively-used instance (the admin's own live session, which .backup would
|
|
# loop on forever). Run as the owning user so WAL access keeps the live serve happy.
|
|
# timeout caps a pathologically-slow copy (huge DB + concurrent writes on a contended
|
|
# disk) so the daily run can never wedge — it just logs + retries next cycle. The
|
|
# daily 03:30 slot normally finds instances idle, where even a large DB copies fast.
|
|
if runuser -u "$u" -- timeout "${T3_BACKUP_TIMEOUT:-900}" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [[ -s "$dst" ]]; then
|
|
LOG "backed up $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)"
|
|
else
|
|
LOG "WARN: backup FAILED for $u ($src)"; rc=1; rm -f "$dst"
|
|
fi
|
|
# retention: keep newest $KEEP per user
|
|
ls -1t "$out"/state-*.sqlite 2>/dev/null | tail -n +$((KEEP+1)) | xargs -r rm -f
|
|
done
|
|
LOG "done (rc=$rc)"
|
|
exit $rc
|