t3-idle-migrate: design for graceful overnight restart of deferred t3-serve instances
Viktor hit the t3 'Client and server versions differ' warning. Root cause: the daily gated autoupdate defers a user's t3-serve restart whenever that user has an active agent at the 04:00 window, so anyone busy every night (long-lived/AFK sessions) never migrates and the client/server version skew persists for days. This design adds a small idle-gated overnight job that drains those deferrals -- restarting a deferred instance onto the current binary only when no turn is in flight (state.sqlite active_turn_id) and it's been quiet for a buffer, so the migration lands in a real quiet gap instead of killing in-flight agent turns. Reuses the autoupdate's proven backup->restart->verify->recover path via a shared helper (approach C from the brainstorm). Design doc only; no behavior change yet. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
b1bbe42821
commit
9503bed589
1 changed files with 140 additions and 0 deletions
140
docs/plans/2026-06-21-t3-idle-migrate-design.md
Normal file
140
docs/plans/2026-06-21-t3-idle-migrate-design.md
Normal file
|
|
@ -0,0 +1,140 @@
|
|||
# t3 idle-migrate — graceful overnight restart of deferred t3-serve instances — design
|
||||
|
||||
- **Date:** 2026-06-21
|
||||
- **Status:** designed 2026-06-21 (brainstorm) — not yet implemented
|
||||
- **Owner:** Viktor (wizard)
|
||||
- **Builds on:** the gated nightly tracker `t3-autoupdate` (re-enabled 2026-06-16, `scripts/t3-autoupdate.{sh,service,timer}`; design history in `docs/runbooks/t3-version-bump.md` + post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`) and the per-user `t3-serve@<user>` systemd instances (`scripts/t3-serve@.service`).
|
||||
|
||||
## Goal
|
||||
|
||||
When `t3-autoupdate` **defers** a user's `t3-serve` restart because that user has an active agent at the daily 04:00–05:00 window, the user's running server keeps executing its start-time t3 version indefinitely — their client (which tracks the freshly-installed global binary) then shows *"Client and server versions differ."* For a user who is busy at every daily window (wizard: long-lived/AFK sessions overnight), the deferral never resolves and the skew persists for days.
|
||||
|
||||
Add a **small, idle-gated overnight job that drains those deferrals**: restart a deferred `t3-serve@<user>` onto the current binary **only when nothing is actively working** in that instance, so the migration happens during a genuine quiet gap rather than killing in-flight agent turns.
|
||||
|
||||
## Background — why the skew persists (root cause, verified 2026-06-21)
|
||||
|
||||
- All `t3-serve@<user>` instances share ONE global `/usr/bin/t3` (→ `/usr/lib/node_modules/t3`). `t3-autoupdate` installs a new nightly to that single binary, health-gates it against a **copy** of wizard's populated `state.sqlite`, then **canary-restarts idle instances one at a time**, verifying pairing after each (`scripts/t3-autoupdate.sh` step 6).
|
||||
- Its idle check is coarse — `unit_busy()`:
|
||||
```sh
|
||||
pid=$(systemctl show -p MainPID --value "$unit")
|
||||
pgrep -aP "$pid" | grep -qiE 'claude|codex|opencode'
|
||||
```
|
||||
i.e. "does the server have any `claude`/`codex`/`opencode` **child**?" But `t3 serve` keeps one such child alive per **open** session, even one idle awaiting input. Live snapshot 2026-06-21: wizard had **5 `running` provider sessions** (= 5 `claude` children) but only **3 mid-turn**, plus **89 `ready` (open-idle)** threads. So `unit_busy` is true whenever any tab is open → wizard is deferred at every window.
|
||||
- The job runs **once daily** (`OnCalendar=*-*-* 04:00:00`, `RandomizedDelaySec=1h`, `Persistent` deliberately omitted) and **only acts on a version bump** (exits early if `installed == target`). So once the binary is already current, nothing re-triggers a restart of a still-stale running server until the *next* new nightly — and only if the user happens to be idle then.
|
||||
- Confirmed in the logs: `t3-autoupdate: deferring t3-serve@wizard.service (active agent) — migrates on its next idle restart` on **both** Jun 20 and Jun 21 windows; wizard's server has been up since Jun 20 06:17 on `…20260620.605` while the binary + client are on `…20260621.613`.
|
||||
|
||||
## Decisions (from brainstorm 2026-06-21)
|
||||
|
||||
1. **"Safe to restart" = no turn in flight AND a quiet buffer.** Not "zero open sessions" (that would essentially never fire for wizard). Open-but-idle tabs are acceptable to drop — t3 persists thread history in `state.sqlite` and the client reconnects/resumes (the daily job already restarts idle instances routinely; restart→resume is the exercised path). To verify during implementation: the user-facing resume after a server restart.
|
||||
2. **Cadence: overnight window only.** Frequent checks within a fixed overnight window; never disconnects tabs during the working day. Migrates within ~1 night of a build landing.
|
||||
3. **Scope: all `t3-serve@<user>`, self-limiting.** The job restarts only an instance that actually *owes* a migration (a deferral marker exists). Users already migrated at the daily window have no marker → no-op. No hardcoded per-user logic.
|
||||
4. **Approach C: extract a shared safe-restart helper, reuse from both jobs.** One audited copy of the dangerous backup→restart→verify→recover logic; the new job adds only *scheduling + gating*.
|
||||
|
||||
## Constraints (load-bearing)
|
||||
|
||||
1. **The binary is global; migrations are forward-only and per-user-DB.** You cannot keep one user on the old version while others run the new one. A real-user forward-migration failure therefore means the build is unsafe for a real user → the only consistent recovery is the daily job's existing one (restore that user's DB + roll the **global** binary back to last-good + freeze + alert). This is a rare tail (the build was already migration-gated against a copy of wizard's real DB at install time), but the idle path must not invent a weaker recovery.
|
||||
2. **Per-user secret boundary.** A user's `~/.t3/userdata/state.sqlite` is mode 600 and may not be read as another user. The job runs as root (system service) but reads each user's DB **as that user** via `runuser -u <user> -- sqlite3 …` (the pattern `backup_all` already uses), read-only (`mode=ro`) so it never locks the live WAL.
|
||||
3. **Fail closed.** Any uncertainty about whether an instance is safe to restart (DB locked/busy/unreadable, query error, unparseable timestamp) → treat as *not safe*, skip this tick, retry in 20 min. Never restart on doubt.
|
||||
4. **Do not change the daily job's gated-install behavior.** The step-6 extraction must be behavior-preserving; health-gate, canary, downgrade-guard, freeze, and rollback stay exactly as today.
|
||||
5. **Infra-as-code via the devvm installer.** Sources live in `scripts/`; deployment is `scripts/workstation/setup-devvm.sh` (the devvm is hand-managed VM 102 — no Terraform apply). Shared-devvm deploy takes a presence claim.
|
||||
|
||||
## Design
|
||||
|
||||
### Components
|
||||
|
||||
Four new files in `scripts/` + a one-line addition to the existing job:
|
||||
|
||||
1. **`scripts/t3-safe-restart.sh`** — shared library, sourced (not executed). Holds the per-unit "dangerous" routine extracted from `t3-autoupdate.sh` step 6 as `safe_restart_unit <unit> <target>`:
|
||||
pre-restart `VACUUM INTO` backup (as the owner) → `systemctl restart` → poll `verify_pairing` (15×2s ≈ 30s) → on failure: restore that user's DB from the pre-restart backup, `rollback_binary` to last-good, `touch $FREEZE_FILE`, log+alert. The shared helpers it needs (`LOG`, `ver`, `osusers`, `ak_for`, `verify_pairing`, `prebump_of`, `rollback_binary`, `DISPATCH`/`BACKUP_DIR`/… config) move into the lib too. Installed to `/usr/local/lib/t3-safe-restart.sh`.
|
||||
**Contract:** returns `0` on verified success, **non-zero** after performing recovery+freeze on failure. This is the one non-verbatim change to step-6 logic — today it `exit 1`s inline; the extracted function `return`s instead so the *caller* decides (the daily job `exit 1`s on non-zero exactly as today; the idle job `break`s). Behavior is otherwise identical.
|
||||
|
||||
2. **`scripts/t3-migrate-idle.sh`** — the new job (scheduling + gating only). Installed to `/usr/local/bin/t3-migrate-idle`. Sources the lib; per tick, drains the deferral directory (control flow below).
|
||||
|
||||
3. **`scripts/t3-migrate-idle.service`** — `Type=oneshot`, `ExecStart=/usr/local/bin/t3-migrate-idle`. (No `EnvironmentFile` needed; env-overridable knobs have defaults.)
|
||||
|
||||
4. **`scripts/t3-migrate-idle.timer`** — overnight window, frequent checks:
|
||||
```ini
|
||||
[Timer]
|
||||
OnCalendar=*-*-* 01..05:00/20 # fires 01:00,01:20,…,05:40; none at/after 06:00. System TZ (UTC) — tune the window.
|
||||
Persistent=false # never replay a missed migrate-restart at an unpredictable time
|
||||
RandomizedDelaySec=120
|
||||
```
|
||||
|
||||
5. **One-line edit to `t3-autoupdate.sh`** — in the existing defer branch, *also record* the deferral:
|
||||
```sh
|
||||
LOG "deferring $unit (active agent) — migrates on its next idle restart"
|
||||
mkdir -p "$DEFER_DIR" 2>/dev/null; printf '%s\n' "$target" > "$DEFER_DIR/$u" # NEW
|
||||
deferred=$((deferred+1)); continue
|
||||
```
|
||||
where `DEFER_DIR=/var/lib/t3-autoupdate/deferred`. This is the *only* behavioral change to the scarred script beyond the verbatim step-6 extraction.
|
||||
|
||||
### Why a deferral marker (not version-introspection)
|
||||
|
||||
The marker makes "which instances owe a restart" **exact** and decouples it from the binary-is-current problem — the daily job already *knows* it deferred wizard, so it records that fact. The idle job drains the directory; the version string in the marker is informational (a restart always picks up whatever binary is current). The marker is removed only after the restart's pairing is verified.
|
||||
|
||||
### Control flow of `t3-migrate-idle` (per tick)
|
||||
|
||||
```
|
||||
for marker in $DEFER_DIR/*: # nothing deferred → no-op
|
||||
user = basename(marker); unit = t3-serve@<user>.service
|
||||
[ unit is an active running service ] or { rm marker; continue } # gone
|
||||
if unit ActiveEnterTimestamp > mtime(marker): rm marker; continue # already restarted (manual/other) → just clear
|
||||
if not safe_to_restart(user): continue # mid-turn or not quiet → try next tick
|
||||
target = contents(marker)
|
||||
if safe_restart_unit(unit, target): rm marker # success: verified on new binary
|
||||
else: # helper already restored DB + rolled back binary + froze + alerted
|
||||
break # frozen: stop draining; a human investigates
|
||||
```
|
||||
|
||||
### `safe_to_restart(user)` — the gate
|
||||
|
||||
Single read-only query, run as the user:
|
||||
|
||||
```sh
|
||||
runuser -u "$user" -- sqlite3 "file:/home/$user/.t3/userdata/state.sqlite?mode=ro" "
|
||||
SELECT
|
||||
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
|
||||
CAST((julianday('now')
|
||||
- julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
|
||||
FROM projection_thread_sessions;"
|
||||
```
|
||||
|
||||
- Column 1 = **active turns**; must be `0`. (`active_turn_id` is set exactly while a turn runs — verified 2026-06-21.)
|
||||
- Column 2 = **idle seconds** = now − most-recent thread activity. Must be `≥ QUIET_SECONDS` (default **900** = 15 min, env-overridable). `updated_at` is ISO-8601 `…Z`; `datetime('now')`/`julianday('now')` are UTC, so normalizing `T`/`Z` away before `julianday()` keeps the arithmetic correct without depending on a newer SQLite's `Z` parsing.
|
||||
- **NULL idle** (no threads at all) ⇒ safe. **Any error / non-numeric / nonzero exit** ⇒ not safe (constraint 3).
|
||||
|
||||
### Failure recovery
|
||||
|
||||
Delegated entirely to `safe_restart_unit` (the extracted, already-proven path): restore the user's DB from the pre-restart backup, roll the global binary back to last-good, `touch /etc/t3-autoupdate.freeze`, log+alert. The idle job then stops draining (the freeze halts both jobs until a human clears it) — see constraint 1 for why per-user divergence isn't an option.
|
||||
|
||||
### Observability
|
||||
|
||||
- Structured `logger -t t3-migrate-idle` lines; extend the existing `T3AutoUpdate*` Loki ruler/alerts to also match this tag. Success → one line: `migrated t3-serve@wizard → <target> (idle restart; idle 47m)`. Failure → reuses the daily job's freeze+alert.
|
||||
- **Recommended (optional):** a Pushgateway gauge for **deferral-marker age** + an alert if a marker survives **> 3 days** — passive visibility into "busy every night for 3 days," *not* the auto-escalation/daytime-widening that was explicitly de-scoped.
|
||||
|
||||
### Delivery
|
||||
|
||||
- Wire into `scripts/workstation/setup-devvm.sh` alongside the existing units:
|
||||
- `install -m 0644 "$SCRIPTS/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh`
|
||||
- `install -m 0755 "$SCRIPTS/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle`
|
||||
- add `t3-migrate-idle.service t3-migrate-idle.timer` to the unit-install loop (→ `/etc/systemd/system/`)
|
||||
- add `t3-migrate-idle.timer` to the `systemctl enable --now` list
|
||||
- `homelab claim host:devvm --purpose "deploy t3-migrate-idle units"` before the install + enable on the shared devvm.
|
||||
- No Terraform (hand-managed VM 102).
|
||||
|
||||
## Testing
|
||||
|
||||
- **TDD on the gating core (`bats`)** against fixture `state.sqlite` files: active turn → unsafe; idle-but-recent (< QUIET) → unsafe; idle + quiet → safe; empty DB → safe; locked/garbage DB / sqlite error → unsafe (fail-closed); marker drain: unit started after marker → clear+skip, before → eligible.
|
||||
- **`T3_DRY_RUN=1`** mode logs `would migrate <unit> → <target>` without acting. Roll out in dry-run first; confirm it flags wizard's server at a real overnight idle moment; then enable live.
|
||||
- **Step-6 extraction is behavior-preserving** — validate the daily job's decisions are unchanged via a dry-run diff before/after the refactor.
|
||||
|
||||
## Out of scope (YAGNI)
|
||||
|
||||
- Daytime restarts / "around the clock" cadence (de-scoped: overnight only).
|
||||
- Auto-escalation that widens to a daytime attempt after N stale nights (de-scoped; the optional marker-age alert covers visibility).
|
||||
- Per-user opt-out file (not needed — the job is self-limiting via markers).
|
||||
- Any change to how `t3-autoupdate` *installs/gates* a build.
|
||||
|
||||
## Open questions
|
||||
|
||||
None outstanding from the brainstorm. Two items to **verify during implementation** (not blockers): (a) user-facing session resume after a `t3-serve` restart; (b) the devvm's `sqlite3` parses the normalized timestamp as expected (the `replace()` normalization is the safeguard).
|
||||
Loading…
Add table
Add a link
Reference in a new issue