docs: t3-migrate-idle runbook section + service-catalog + design status
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
334d8fee5d
commit
5a136c7d53
3 changed files with 15 additions and 2 deletions
File diff suppressed because one or more lines are too long
|
|
@ -1,7 +1,7 @@
|
|||
# t3 idle-migrate — graceful overnight restart of deferred t3-serve instances — design
|
||||
|
||||
- **Date:** 2026-06-21
|
||||
- **Status:** designed 2026-06-21 (brainstorm) — not yet implemented
|
||||
- **Status:** implemented 2026-06-21 (branch `wizard/t3-idle-migrate`; deployed + timer enabled on devvm, first overnight drain pending)
|
||||
- **Owner:** Viktor (wizard)
|
||||
- **Builds on:** the gated nightly tracker `t3-autoupdate` (re-enabled 2026-06-16, `scripts/t3-autoupdate.{sh,service,timer}`; design history in `docs/runbooks/t3-version-bump.md` + post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`) and the per-user `t3-serve@<user>` systemd instances (`scripts/t3-serve@.service`).
|
||||
|
||||
|
|
|
|||
|
|
@ -37,6 +37,19 @@ logs every outcome (`paired user=.. endpoint=.. fallback=..`, plus `mint/pairing
|
|||
`T3AutoUpdateRolledBack` / `T3AutoUpdateRollbackFailed` / `T3AutoUpdateFrozen` →
|
||||
Alertmanager → Slack.
|
||||
|
||||
## Idle migrator — draining deferrals (`scripts/t3-migrate-idle.sh`)
|
||||
|
||||
Step 5 DEFERS any instance with an active agent, recording `/var/lib/t3-autoupdate/deferred/<user>` (= the target version). Without a drainer, a user busy at every 04:00 window never migrates and their client shows *"Client and server versions differ"* for days. `t3-migrate-idle.timer` (overnight, every 20 min 01:00–05:40) drains those markers:
|
||||
|
||||
- Per marker: skip + clear if the unit is gone or was already restarted *after* the deferral; otherwise restart the still-stale `t3-serve@<u>` onto the current binary **only when that user is idle** — `state.sqlite` shows zero `active_turn_id` (no in-flight turn) AND ≥ `T3_MIGRATE_QUIET_SECONDS` (default 900 = 15 min) since the last thread activity — then verify pairing and clear the marker. **Fail-closed:** any query/parse doubt → skip, retry next tick.
|
||||
- It restarts via the SAME `safe_restart_unit` the daily canary uses (sourced `t3-safe-restart.sh`: backup → restart → verify → recover). The shared `/etc/t3-autoupdate.freeze` halts it too.
|
||||
- **Force / preview:**
|
||||
```bash
|
||||
sudo systemctl start t3-migrate-idle.service # run a drain pass now (still idle-gated)
|
||||
sudo env T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle # log decisions, act on nothing
|
||||
```
|
||||
- **Rare-tail failure:** if a deferred user's forward migration fails at idle restart (already gated against a copy of their real DB at install), `safe_restart_unit` restores their DB + freezes + alerts. The binary rollback is a no-op (the build was already accepted, so other users are unaffected), but that user's serve may crashloop on the restored DB until the freeze is cleared and the build investigated (manual rollback below).
|
||||
|
||||
## Operations
|
||||
|
||||
**Freeze / revert (stop tracking right now — the fast "make it stop"):**
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue