infra/docs/plans/2026-06-21-t3-idle-migrate-plan.md
Viktor Barzin 0cebeeb0ee t3-idle-migrate: implementation plan
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:26:05 +00:00

729 lines
34 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# t3 idle-migrate Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Add a small idle-gated overnight job that restarts a `t3-serve@<user>` deferred by the daily autoupdate, so a chronically-busy user's server migrates onto the current t3 binary during a real quiet gap instead of staying version-skewed for days.
**Architecture:** Extract the daily job's per-unit "dangerous" restart routine (backup→restart→verify→recover) into a sourced shared library `t3-safe-restart.sh`; the daily `t3-autoupdate` and a new `t3-migrate-idle` job both call it. The daily job records each deferral as a marker file; the new job drains markers overnight, restarting only when `state.sqlite` shows no in-flight turn and a quiet buffer has elapsed. Self-limiting (only acts on a recorded deferral), fail-closed.
**Tech Stack:** bash, systemd timers, sqlite3 (reading t3's `state.sqlite`), the existing `t3-autoupdate` machinery. Deployed via `scripts/workstation/setup-devvm.sh` on the hand-managed devvm (no Terraform).
**Design:** `docs/plans/2026-06-21-t3-idle-migrate-design.md`.
---
## File structure
- **Create `scripts/t3-safe-restart.sh`** — sourced library: shared config defaults, `LOG`/`ver`/`osusers`/`ak_for`/`verify_pairing`/`backup_user`/`prebump_of`/`rollback_binary`, and `safe_restart_unit`. One responsibility: the audited per-unit safe restart + its recovery.
- **Modify `scripts/t3-autoupdate.sh`** — source the lib; replace the inline helpers + step-6 body with calls into it; write/clear the deferral marker. Behavior unchanged.
- **Create `scripts/t3-migrate-idle.sh`** — the new job: the idle gate (`gate_query`/`gate_is_safe`/`safe_to_restart`) + the marker-drain loop. Main logic behind a `main`-guard so it's source-safe for tests.
- **Create `scripts/t3-migrate-idle.service`** + **`scripts/t3-migrate-idle.timer`** — oneshot + overnight timer.
- **Create `tests/t3-migrate-idle-gate.test.sh`** — pure-bash TDD for the gate predicates against fixture SQLite DBs (no root, no bats).
- **Modify `scripts/workstation/setup-devvm.sh`** — install + enable the new files.
- **Modify `docs/runbooks/t3-version-bump.md`** + **`.claude/reference/service-catalog.md`** — document the new job.
**Recovery semantics note (load-bearing):** `safe_restart_unit` is reused verbatim. In the *daily* path a canary failure happens when `last_good < target`, so its `rollback_binary` genuinely reverts the global binary (correct — a bad build is bad for everyone). In the *idle* path `last_good == installed == target` (the build was already accepted), so `rollback_binary` is a **harmless no-op reinstall** — recovery reduces to "restore the failing user's DB + freeze + alert" and does NOT downgrade other users. Known rare-tail limitation: if that user's forward migration genuinely fails at idle time (already gated against a copy of their real DB at install), their server may crashloop on the restored DB until a human acts on the freeze+alert. Documented, not hidden.
---
## Task 1: Shared library `t3-safe-restart.sh`
**Files:**
- Create: `scripts/t3-safe-restart.sh`
- [ ] **Step 1: Create the library**
```bash
#!/usr/bin/env bash
# t3-safe-restart.sh — SOURCED library (not executed). Shared by t3-autoupdate.sh
# (daily gated tracker) and t3-migrate-idle.sh (overnight deferral drainer).
#
# Holds the per-unit "dangerous" routine — backup -> restart -> verify pairing ->
# recover (restore DB + roll global binary back to last-good + freeze) — extracted
# verbatim from t3-autoupdate.sh step 6, plus the small helpers it depends on.
# The only change from the inline original: safe_restart_unit RETURNS non-zero on
# failure (after performing recovery+freeze) instead of `exit 1`, so the CALLER
# decides what to do (the daily job exits; the idle job stops draining).
#
# Callers must set, before calling safe_restart_unit: $target (version being moved
# TO, for log lines + the prebump filename) and $last_good (rollback target).
# Set $LOG_TAG before sourcing to tag syslog ("t3-autoupdate" / "t3-migrate-idle").
# ---- shared config defaults (override via env before sourcing) ------------------
: "${LOG_TAG:=t3-safe-restart}"
: "${FREEZE_FILE:=/etc/t3-autoupdate.freeze}"
: "${STATE_DIR:=/var/lib/t3-autoupdate}"
: "${LAST_GOOD_FILE:=$STATE_DIR/last-good}"
: "${DEFER_DIR:=$STATE_DIR/deferred}"
: "${BACKUP_DIR:=/var/backups/t3-state}"
: "${DISPATCH:=127.0.0.1:3780}"
: "${USER_MAP:=/etc/ttyd-user-map}"
: "${T3_BACKUP_TIMEOUT:=900}"
LOG() { logger -t "$LOG_TAG" "$*"; echo "$LOG_TAG: $*"; }
ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; }
# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line).
osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; }
# authentik username for an OS user (reverse map; first match) — for dispatch verify.
ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; }
# Online consistent snapshot of ONE user's state.sqlite (run AS the owner so the
# WAL stays owned; never stops the serve). Uses global $target for the filename.
# Echoes the backup path on success; non-zero on failure.
backup_user() {
local u="$1" src out dst ts
src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || return 1
ts="$(date +%Y%m%d-%H%M%S)"
out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite"
install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out"
if runuser -u "$u" -- timeout "$T3_BACKUP_TIMEOUT" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then
printf '%s\n' "$dst"; return 0
fi
rm -f "$dst"; return 1
}
# newest pre-bump backup for a user taken for the current $target (restore source).
prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; }
# roll the GLOBAL binary back to last-good. In the idle path last_good==installed,
# so this is a harmless no-op reinstall (does NOT downgrade other users).
rollback_binary() {
LOG "rolling back binary $target -> $last_good"
if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi
LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1
}
# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie).
verify_pairing() {
local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; }
out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)"
printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session='
}
# safe_restart_unit <unit> <user>: restart the unit, verify pairing; on failure
# restore the user's DB from its pre-restart backup, roll the binary back, freeze.
# Assumes a pre-restart backup already exists for <user> at the current $target
# (the daily job's backup_all, or the idle job's backup_user, takes it first).
# Returns 0 on verified success, non-zero after recovery+freeze on failure.
safe_restart_unit() {
local unit="$1" u="$2" ok=0 _ bak
systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero"
for _ in $(seq 1 15); do
if verify_pairing "$u"; then ok=1; break; fi
sleep 2
done
if [ "$ok" = "1" ]; then
LOG "restarted $unit -> $target (pairing verified via dispatch)"; return 0
fi
LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB"
rollback_binary
bak="$(prebump_of "$u")"
if [ -n "$bak" ]; then
systemctl stop "$unit" 2>/dev/null
if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then
rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm"
LOG "restored $u state.sqlite from $bak"
fi
systemctl start "$unit" 2>/dev/null
fi
touch "$FREEZE_FILE" 2>/dev/null
LOG "FROZEN ($FREEZE_FILE) after $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume"
return 1
}
```
- [ ] **Step 2: Syntax + lint check**
Run: `bash -n scripts/t3-safe-restart.sh && (command -v shellcheck >/dev/null && shellcheck -x scripts/t3-safe-restart.sh || echo "shellcheck absent — skipped")`
Expected: no syntax errors. (shellcheck may warn on the intentional global `$target`/`$last_good` references — acceptable; they are documented caller-set globals.)
- [ ] **Step 3: Source-and-define smoke test**
Run:
```bash
bash -c 'LOG_TAG=test; . scripts/t3-safe-restart.sh; for f in LOG ver osusers ak_for backup_user prebump_of rollback_binary verify_pairing safe_restart_unit; do declare -F "$f" >/dev/null || { echo "MISSING $f"; exit 1; }; done; echo "all functions defined"'
```
Expected: `all functions defined` (sourcing has no side effects — no exit, no output beyond the echo).
- [ ] **Step 4: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-safe-restart.sh
git "${GC[@]}" commit -m "t3-safe-restart: extract shared safe-restart library from t3-autoupdate
Pull the per-unit backup->restart->verify->recover routine (and the small
helpers it needs) out of t3-autoupdate.sh into a sourced library, so a second
job (the upcoming idle migrator) can reuse the exact same audited recovery path
instead of forking safety-critical code. safe_restart_unit returns non-zero on
failure (after recovery+freeze) rather than exiting, so callers control flow.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 2: Refactor `t3-autoupdate.sh` to use the library + record deferrals
**Files:**
- Modify: `scripts/t3-autoupdate.sh` (config block 3242, helpers 44165, step 6 loop 194225)
- [ ] **Step 1: Source the library; drop the now-shared helpers**
Replace lines 3252 (the `T3_*` config block through the `newer()` helper) with — keep the autoupdate-only config, source the lib for the shared bits:
```bash
# ---- autoupdate-specific config (shared config + helpers come from the lib) -----
T3_TRACK="${T3_TRACK:-nightly}" # npm dist-tag to follow (nightly | latest)
T3_PIN="${T3_PIN:-}" # optional HARD pin to an exact version (disables tracking)
SMOKE_PORT="${T3_SMOKE_PORT:-3799}"
DRY_RUN="${T3_DRY_RUN:-0}"
TMPROOT="${T3_TMPDIR:-/var/tmp}" # health-check scratch on DISK — /tmp is a 2G tmpfs and a populated state.sqlite (~hundreds of MB) overflows it
LOG_TAG=t3-autoupdate
# shellcheck source=scripts/t3-safe-restart.sh
. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
# is $1 a strictly-newer version than $2 (version-sort)?
newer() { [ "$1" != "$2" ] && [ "$(printf '%s\n%s\n' "$1" "$2" | sort -V | tail -1)" = "$1" ]; }
mkdir -p "$STATE_DIR" 2>/dev/null || true
```
(The lib now provides `FREEZE_FILE`, `STATE_DIR`, `LAST_GOOD_FILE`, `DEFER_DIR`, `BACKUP_DIR`, `DISPATCH`, `USER_MAP`, `LOG`, `ver`, `osusers`, `ak_for`, `verify_pairing`, `prebump_of`, `rollback_binary`, `backup_user`, `safe_restart_unit`.)
- [ ] **Step 2: Simplify `backup_all` to call the shared `backup_user`**
Replace the `backup_all()` definition (lines 90105) with:
```bash
ADMIN_SEED=""
backup_all() {
local u dst
for u in $(osusers); do
if dst="$(backup_user "$u")"; then
LOG "pre-bump backup: $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)"
[ "$u" = "wizard" ] && ADMIN_SEED="$dst"
else
LOG "WARN: pre-bump backup FAILED for $u (/home/$u/.t3/userdata/state.sqlite)"
fi
done
[ -n "$ADMIN_SEED" ] || ADMIN_SEED="$(ls -1t "$BACKUP_DIR"/*/"state-prebump-$target-"*.sqlite 2>/dev/null | head -1)"
}
```
Delete the now-duplicated standalone `prebump_of`, `rollback_binary`, and `verify_pairing` definitions (lines 107108, 146152, 160165) — they come from the lib. Keep `health_check` and `unit_busy` (autoupdate-only).
- [ ] **Step 3: Use `safe_restart_unit` + write/clear the deferral marker in step 6**
Replace the step-6 loop body (lines 196225) with:
```bash
for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' 2>/dev/null | awk '{print $1}'); do
u="$(printf '%s' "$unit" | sed -n 's/^t3-serve@\(.*\)\.service$/\1/p')"; [ -n "$u" ] || continue
if unit_busy "$unit"; then
LOG "deferring $unit (active agent) — migrates on its next idle restart"
mkdir -p "$DEFER_DIR" 2>/dev/null && printf '%s\n' "$target" >"$DEFER_DIR/$u" # record for t3-migrate-idle
deferred=$((deferred+1)); continue
fi
if safe_restart_unit "$unit" "$u"; then
restarted=$((restarted+1))
rm -f "$DEFER_DIR/$u" 2>/dev/null # now current — clear any stale marker
else
exit 1 # frozen by safe_restart_unit — preserve today's behavior
fi
done
```
- [ ] **Step 4: Syntax check + behavior-preserving dry-run diff**
Run:
```bash
bash -n scripts/t3-autoupdate.sh
# Confirm the only remaining defer/restart decisions are unchanged vs HEAD~1 logic:
git -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false diff HEAD scripts/t3-autoupdate.sh | grep -E '^\+|^-' | grep -vE 'safe_restart_unit|backup_user|DEFER_DIR|source|\. "|LOG_TAG|^\+\+\+|^---' | head -40
```
Expected: no syntax errors; the diff shows only the extraction (calls replacing inline bodies) + the two marker lines — no change to install/health-gate/canary decision logic.
- [ ] **Step 5: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-autoupdate.sh
git "${GC[@]}" commit -m "t3-autoupdate: source the shared safe-restart lib + record deferrals
Behavior-preserving refactor: the per-unit restart/recover body and small helpers
now come from t3-safe-restart.sh (one audited copy). Additionally, when a unit is
deferred for an active agent, write a marker under /var/lib/t3-autoupdate/deferred/
so the new idle migrator can drain it later; clear the marker on a successful
restart. Install/health-gate/canary logic is unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 3: The idle gate (TDD) — `gate_query` + `gate_is_safe`
**Files:**
- Create: `tests/t3-migrate-idle-gate.test.sh`
- Create (incremental): `scripts/t3-migrate-idle.sh` (gate functions only this task)
- [ ] **Step 1: Write the failing test**
Create `tests/t3-migrate-idle-gate.test.sh`:
```bash
#!/usr/bin/env bash
# Pure-bash unit tests for the t3-migrate-idle gate. No root, no bats, no Docker.
# Sources t3-migrate-idle.sh (main-guarded) with the lib path pointed at the worktree.
set -uo pipefail
HERE="$(cd "$(dirname "$0")/.." && pwd)" # repo root (tests/ is one level down)
export T3_SAFE_RESTART_LIB="$HERE/scripts/t3-safe-restart.sh"
# shellcheck source=/dev/null
. "$HERE/scripts/t3-migrate-idle.sh" # defines functions; main-guard prevents the drain from running
pass=0; fail=0
ok() { if "$@"; then pass=$((pass+1)); else fail=$((fail+1)); echo "FAIL: $*"; fi; }
notok(){ if "$@"; then fail=$((fail+1)); echo "FAIL (expected non-zero): $*"; else pass=$((pass+1)); fi; }
# --- gate_is_safe <active> <idle_seconds> with QUIET_SECONDS=900 ---
QUIET_SECONDS=900
ok gate_is_safe 0 1000 # idle, quiet long enough -> safe
notok gate_is_safe 1 1000 # a turn in flight -> unsafe
notok gate_is_safe 0 100 # idle but not quiet enough -> unsafe
ok gate_is_safe 0 "" # no threads at all (NULL idle) -> safe
notok gate_is_safe x 1000 # unparseable active -> unsafe
notok gate_is_safe 0 -30 # negative idle (clock skew) -> unsafe
# --- gate_query <db> against fixture SQLite DBs ---
TMP="$(mktemp -d)"; trap 'rm -rf "$TMP"' EXIT
mkfix() { # mkfix <file> ; reads rows "active_turn_id|updated_at" on stdin
local f="$1"; sqlite3 "$f" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);"
while IFS='|' read -r a u; do sqlite3 "$f" "INSERT INTO projection_thread_sessions VALUES ($([ "$a" = NULL ] && echo NULL || echo "'$a'"), '$u');"; done
}
NOW="$(date -u +%Y-%m-%dT%H:%M:%S.000Z)"
OLD="$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S.000Z)"
# active turn present -> "1|<small idle>"
printf '%s\n' "abc|$NOW" "NULL|$OLD" | mkfix "$TMP/active.db"
res="$(gate_query "$TMP/active.db")"; ok test "${res%%|*}" = "1"
# all idle, last activity 1h ago -> "0|>=3500"
printf '%s\n' "NULL|$OLD" "NULL|$OLD" | mkfix "$TMP/idle.db"
res="$(gate_query "$TMP/idle.db")"; ok test "${res%%|*}" = "0"; ok test "${res##*|}" -ge 3500
# empty table -> "0|" (NULL idle)
sqlite3 "$TMP/empty.db" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);"
res="$(gate_query "$TMP/empty.db")"; ok test "${res%%|*}" = "0"
echo "PASS=$pass FAIL=$fail"; [ "$fail" -eq 0 ]
```
- [ ] **Step 2: Run it to verify it fails**
Run: `bash tests/t3-migrate-idle-gate.test.sh`
Expected: FAIL — `scripts/t3-migrate-idle.sh` does not exist yet (source error).
- [ ] **Step 3: Create `scripts/t3-migrate-idle.sh` with the gate functions + main-guard skeleton**
```bash
#!/usr/bin/env bash
# t3-migrate-idle.sh — drains t3-autoupdate's deferral markers (via the overnight
# t3-migrate-idle.timer). For each deferred t3-serve@<user>, if nothing is actively
# working in that instance (no in-flight turn + a quiet buffer), restart it onto the
# current binary using the shared safe_restart_unit, then clear the marker.
# Why this exists: t3-autoupdate defers a user with an active agent at its single
# daily window; a user busy every night never migrates and their client shows
# "Client and server versions differ". See docs/plans/2026-06-21-t3-idle-migrate-*.
set -uo pipefail
LOG_TAG=t3-migrate-idle
# shellcheck source=scripts/t3-safe-restart.sh
. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
QUIET_SECONDS="${T3_MIGRATE_QUIET_SECONDS:-900}" # required idle before a restart (15 min)
DRY_RUN="${T3_DRY_RUN:-0}"
# pure logic: is it safe given <active_turns> and <idle_seconds>? fail closed.
gate_is_safe() {
local active="$1" idle="$2"
case "$active" in ''|*[!0-9]*) return 1;; esac # unparseable/empty active -> unsafe
[ "$active" -eq 0 ] || return 1 # a turn is running -> unsafe
[ -z "$idle" ] && return 0 # no threads at all -> safe
case "$idle" in ''|*[!0-9-]*) return 1;; esac # non-numeric -> unsafe
[ "$idle" -ge "$QUIET_SECONDS" ] # negative or < quiet -> unsafe
}
# query a state.sqlite (path or file: URI). Echoes "<active_turns>|<idle_seconds>".
# idle_seconds is empty when there are no rows. Normalizes ISO 'T'/'Z' for julianday.
gate_query() {
local db="$1"
sqlite3 -batch -noheader -separator '|' "$db" \
"SELECT
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
FROM projection_thread_sessions;"
}
# safe_to_restart <user>: wire runuser + the user's DB into gate_query/gate_is_safe.
safe_to_restart() {
local u="$1" db row
db="/home/$u/.t3/userdata/state.sqlite"; [ -f "$db" ] || return 1
row="$(runuser -u "$u" -- sqlite3 -batch -noheader -separator '|' "file:$db?mode=ro" \
"SELECT
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
FROM projection_thread_sessions;" 2>/dev/null)" || return 1
gate_is_safe "${row%%|*}" "${row##*|}"
}
main() {
: # drain loop added in Task 4
}
# main-guard: run only when executed, not when sourced (tests source this file).
if [ "${BASH_SOURCE[0]}" = "${0}" ]; then main "$@"; fi
```
- [ ] **Step 4: Run the test to verify it passes**
Run: `bash tests/t3-migrate-idle-gate.test.sh`
Expected: `PASS=10 FAIL=0` (exit 0).
- [ ] **Step 5: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-migrate-idle.sh tests/t3-migrate-idle-gate.test.sh
git "${GC[@]}" commit -m "t3-migrate-idle: idle gate (no in-flight turn + quiet buffer), TDD
The gate reads t3's state.sqlite: safe to restart only when zero threads have an
active_turn_id AND the most-recent thread activity is older than the quiet buffer
(default 15m). Fail-closed on any parse/query error. Pure-bash unit tests cover
the boundaries against fixture DBs (no root/bats/Docker).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 4: The marker-drain loop in `t3-migrate-idle.sh`
**Files:**
- Modify: `scripts/t3-migrate-idle.sh` (replace the `main()` skeleton)
- [ ] **Step 1: Implement `main()` (the drain loop)**
Replace the `main() { : ; }` skeleton with:
```bash
main() {
# a frozen build must not be auto-migrated (shared switch with t3-autoupdate)
if [ -e "$FREEZE_FILE" ]; then LOG "FROZEN: $FREEZE_FILE present — not draining deferrals"; exit 0; fi
[ -d "$DEFER_DIR" ] || exit 0 # nothing deferred
last_good="$(tr -d '[:space:]' <"$LAST_GOOD_FILE" 2>/dev/null)" # rollback target for the helper
local marker u unit started mwritten migrated=0 skipped=0
for marker in "$DEFER_DIR"/*; do
[ -e "$marker" ] || continue # empty-dir glob
u="$(basename "$marker")"; unit="t3-serve@$u.service"
if ! systemctl is-active --quiet "$unit"; then
LOG "clearing marker for $u: $unit not active"; rm -f "$marker"; continue
fi
started="$(date -d "$(systemctl show -p ActiveEnterTimestamp --value "$unit" 2>/dev/null)" +%s 2>/dev/null || echo 0)"
mwritten="$(stat -c %Y "$marker" 2>/dev/null || echo 0)"
if [ "$started" -gt "$mwritten" ]; then
LOG "clearing marker for $u: $unit already restarted $((started-mwritten))s after the deferral"; rm -f "$marker"; continue
fi
if ! safe_to_restart "$u"; then skipped=$((skipped+1)); continue; fi
target="$(tr -d '[:space:]' <"$marker" 2>/dev/null)"; [ -n "$target" ] || target="$(ver)"
if [ "$DRY_RUN" = "1" ]; then LOG "DRY_RUN: would migrate $unit -> $target (idle gate satisfied)"; continue; fi
if ! backup_user "$u" >/dev/null; then
LOG "WARN: pre-restart backup failed for $u — skipping (fail closed)"; skipped=$((skipped+1)); continue
fi
if safe_restart_unit "$unit" "$u"; then
LOG "migrated $unit -> $target (idle restart)"; rm -f "$marker"; migrated=$((migrated+1))
else
LOG "migrate FAILED for $unit — recovery+freeze handled by safe_restart_unit; stopping drain"; exit 1
fi
done
LOG "idle-migrate pass complete (migrated=$migrated skipped=$skipped)"
}
```
- [ ] **Step 2: Re-run the gate tests (regression — main-guard still source-safe)**
Run: `bash tests/t3-migrate-idle-gate.test.sh`
Expected: `PASS=10 FAIL=0` (sourcing still defines functions without running the loop).
- [ ] **Step 3: Syntax + lint**
Run: `bash -n scripts/t3-migrate-idle.sh && (command -v shellcheck >/dev/null && shellcheck -x scripts/t3-migrate-idle.sh || echo "shellcheck absent — skipped")`
Expected: no syntax errors.
- [ ] **Step 4: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-migrate-idle.sh
git "${GC[@]}" commit -m "t3-migrate-idle: drain deferral markers when safe
For each /var/lib/t3-autoupdate/deferred/<user> marker: skip+clear if the unit is
gone or was already restarted after the deferral; otherwise, when the idle gate is
satisfied, take a pre-restart backup and restart via the shared safe_restart_unit,
clearing the marker on verified success. DRY_RUN logs decisions without acting.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 5: systemd units
**Files:**
- Create: `scripts/t3-migrate-idle.service`, `scripts/t3-migrate-idle.timer`
- [ ] **Step 1: Create the service unit**
`scripts/t3-migrate-idle.service`:
```ini
[Unit]
Description=t3 idle migrator — restart deferred t3-serve instances onto the current binary when idle
Documentation=https://forgejo.viktorbarzin.me/viktor/infra/src/branch/master/docs/plans/2026-06-21-t3-idle-migrate-design.md
After=network.target t3-dispatch.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/t3-migrate-idle
```
- [ ] **Step 2: Create the timer unit**
`scripts/t3-migrate-idle.timer`:
```ini
[Unit]
Description=Overnight drain of t3-autoupdate deferrals (idle-gated t3-serve migration)
[Timer]
OnCalendar=*-*-* 01..05:00/20
RandomizedDelaySec=120
Persistent=false
[Install]
WantedBy=timers.target
```
- [ ] **Step 3: Validate unit syntax**
Run: `systemd-analyze verify scripts/t3-migrate-idle.service scripts/t3-migrate-idle.timer 2>&1 | grep -v 'Unknown\|Cannot find' || echo "units parse OK"`
Expected: no fatal parse errors (warnings about the `[Install]` of a non-installed unit / missing exec on a non-deployed path are acceptable in the worktree).
- [ ] **Step 4: Confirm the OnCalendar expands to the intended overnight slots**
Run: `systemd-analyze calendar '*-*-* 01..05:00/20' --iterations=5`
Expected: next elapses at 01:00/01:20/01:40/02:00/… (every 20 min, hours 0105).
- [ ] **Step 5: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-migrate-idle.service scripts/t3-migrate-idle.timer
git "${GC[@]}" commit -m "t3-migrate-idle: systemd oneshot + overnight timer (01:00-05:40, /20)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 6: Wire into `setup-devvm.sh`
**Files:**
- Modify: `scripts/workstation/setup-devvm.sh` (9a install ~line 164; 9d unit loop ~line 200; enable ~line 218)
- [ ] **Step 1: Install the lib + the new script (section 9a)**
After the `install -m 0755 "$SCRIPTS/t3-autoupdate.sh" /usr/local/bin/t3-autoupdate` line, add:
```bash
install -m 0644 "$SCRIPTS/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh
install -m 0755 "$SCRIPTS/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle
```
- [ ] **Step 2: Install the unit files (section 9d loop)**
Add to the `for u in …` unit list (after the `t3-autoupdate.service t3-autoupdate.timer \` line):
```bash
t3-migrate-idle.service t3-migrate-idle.timer \
```
- [ ] **Step 3: Enable the timer (section 9 enable line)**
Append `t3-migrate-idle.timer` to the `systemctl enable --now` list:
```bash
systemctl enable --now t3-dispatch.service \
t3-autoupdate.timer t3-backup-state.timer t3-provision-users.timer t3-migrate-idle.timer >/dev/null 2>&1 || \
log "WARN: some units failed to enable (check: systemctl status t3-dispatch t3-*.timer)"
```
- [ ] **Step 4: Syntax check**
Run: `bash -n scripts/workstation/setup-devvm.sh`
Expected: no syntax errors.
- [ ] **Step 5: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/workstation/setup-devvm.sh
git "${GC[@]}" commit -m "setup-devvm: install + enable t3-migrate-idle (lib, script, units, timer)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 7: Deploy to the devvm + validate (dry-run first)
**Files:** none (operational). Presence-claimed, shared-host mutation.
- [ ] **Step 1: Claim the host**
Run: `homelab claim host:devvm --purpose "deploy t3-migrate-idle units (idle-gated t3-serve migration)"`
Expected: claim acquired (if already held by another session, defer per CLAUDE.md).
- [ ] **Step 2: Install the artifacts (mirror setup-devvm.sh 9a/9d)**
Run:
```bash
W=/home/wizard/code/infra/.worktrees/t3-idle-migrate/scripts
sudo install -m 0644 "$W/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh
sudo install -m 0755 "$W/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle
sudo install -m 0644 "$W/t3-migrate-idle.service" /etc/systemd/system/t3-migrate-idle.service
sudo install -m 0644 "$W/t3-migrate-idle.timer" /etc/systemd/system/t3-migrate-idle.timer
sudo systemctl daemon-reload
```
Expected: no errors.
- [ ] **Step 2b: Re-point the live daily job at the installed lib (it now sources it)**
The deployed `/usr/local/bin/t3-autoupdate` is the OLD inline version until setup-devvm re-runs; install the refactored one so both jobs share the lib:
```bash
sudo install -m 0755 "$W/t3-autoupdate.sh" /usr/local/bin/t3-autoupdate
sudo /usr/local/bin/t3-autoupdate # safe: same-version run exits at "already on nightly; nothing to do"
```
Expected: log line `already on <track>=<ver>; nothing to do` (proves the refactored daily job sources the lib and runs clean).
- [ ] **Step 3: DRY-RUN the idle migrator against live state**
Run: `sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle; echo "exit=$?"`
Expected: with wizard currently busy (mid-turn during the day), a `skipped` count — `idle-migrate pass complete (migrated=0 skipped=N)` — and NO restart. (If wizard happens to be idle+quiet, it logs `DRY_RUN: would migrate t3-serve@wizard …` and still does not act.)
- [ ] **Step 4: Seed a deferral marker for the current skew + dry-run again**
The live daily job already deferred wizard but the marker mechanism is new, so create it once to represent the existing `.605→.613` debt:
```bash
sudo install -d -m755 /var/lib/t3-autoupdate/deferred
printf '%s\n' "$(t3 --version | awk '{print $NF}' | sed 's/^v//')" | sudo tee /var/lib/t3-autoupdate/deferred/wizard >/dev/null
sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle; echo "exit=$?"
```
Expected: the pass now considers `wizard` — either `DRY_RUN: would migrate t3-serve@wizard.service -> …613` (if idle) or counted in `skipped` (if mid-turn). Confirms marker drain + gate wiring end-to-end without acting.
- [ ] **Step 5: Enable the timer (live)**
Run: `sudo systemctl enable --now t3-migrate-idle.timer && systemctl list-timers t3-migrate-idle.timer --no-pager`
Expected: timer active, next elapse in the 01:0005:40 window.
- [ ] **Step 6: Release the claim**
Run: `homelab release host:devvm`
> **First live migration** happens overnight at the first idle+quiet tick. Verify next session: `journalctl -u t3-migrate-idle.service --since yesterday | grep -E 'migrated|skipped|DRY|FROZEN'` and `t3 --version` vs the running server's version. (The user-facing resume-after-restart is observed here — design open-question (a).)
---
## Task 8: Docs
**Files:**
- Modify: `docs/runbooks/t3-version-bump.md` (add an idle-migrate section)
- Modify: `.claude/reference/service-catalog.md` (add the unit)
- Modify: `docs/plans/2026-06-21-t3-idle-migrate-design.md` (Status → implemented)
- [ ] **Step 1: Runbook** — add a section after the autoupdate description:
```markdown
## Idle migrator (`t3-migrate-idle.timer`)
`t3-autoupdate` defers a user's `t3-serve` restart when they have an active agent
at the daily window, recording `/var/lib/t3-autoupdate/deferred/<user>`.
`t3-migrate-idle` (overnight, every 20 min 01:0005:40) drains those markers:
it restarts a deferred instance onto the current binary only when that user's
`state.sqlite` shows no in-flight turn (`active_turn_id`) and ≥15 min quiet, via
the shared `safe_restart_unit` (same backup→verify→recover as the daily canary).
- **Force a migration now:** `sudo systemctl start t3-migrate-idle.service` (still idle-gated).
- **Preview without acting:** `sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle`.
- **Stop it:** the shared `/etc/t3-autoupdate.freeze` halts both jobs.
- **Rare-tail failure:** a forward-migration failure at idle restart restores the
user's DB + freezes + alerts (the binary rollback is a no-op since the build was
already accepted); the user's server may crashloop on the restored DB until the
freeze is cleared. Investigate per the rollback section above.
```
- [ ] **Step 2: service-catalog** — add a row/line for `t3-migrate-idle.timer` (overnight idle-gated t3-serve migration; sources `t3-safe-restart.sh`).
- [ ] **Step 3: design doc status** — change the header `Status:` to `implemented 2026-06-21 (commits on wizard/t3-idle-migrate)`.
- [ ] **Step 4: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add docs/runbooks/t3-version-bump.md .claude/reference/service-catalog.md docs/plans/2026-06-21-t3-idle-migrate-design.md
git "${GC[@]}" commit -m "docs: t3-migrate-idle runbook + service-catalog + design status
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 9: Land
- [ ] **Step 1: Merge latest master into the branch**
Run:
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" fetch forgejo
git "${GC[@]}" merge --no-edit forgejo/master
```
Expected: clean merge (no conflicts; the files are new or autoupdate-only). Resolve if any.
- [ ] **Step 2: Re-run the gate tests post-merge**
Run: `bash tests/t3-migrate-idle-gate.test.sh`
Expected: `PASS=10 FAIL=0`.
- [ ] **Step 3: Push to master**
Run: `git -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false push forgejo HEAD:master`
Expected: accepted. Non-fast-forward → fetch/merge/retry.
- [ ] **Step 4: Watch CI to completion**
Run: `homelab ci watch`
Expected: green (infra apply pipeline — this change is scripts/docs only, no Terraform, so apply is a no-op for it).
- [ ] **Step 5: Clean up the worktree**
Run (from the main checkout):
```bash
git -C /home/wizard/code/infra worktree remove .worktrees/t3-idle-migrate
git -C /home/wizard/code/infra branch -d wizard/t3-idle-migrate
```
---
## Self-review
- **Spec coverage:** marker mechanism (T2,T4) · shared safe-restart lib / approach C (T1) · idle gate active_turn_id+quiet (T3) · overnight timer (T5) · all-users self-limiting via markers (T4 loop) · failure recovery reuse (T1, note) · observability logs (LOG_TAG throughout) · delivery via setup-devvm (T6) · presence-claimed deploy (T7) · TDD on the gate (T3) · dry-run rollout (T7) · docs (T8). Optional Pushgateway marker-age gauge from the design is **intentionally deferred** (logged here as a follow-up, not built — keeps scope to the shipping mechanism).
- **Placeholders:** none — every file has complete content; every command has expected output.
- **Type/name consistency:** `safe_restart_unit`, `backup_user`, `prebump_of`, `gate_query`, `gate_is_safe`, `safe_to_restart`, `DEFER_DIR`, `QUIET_SECONDS`, `T3_SAFE_RESTART_LIB`, `LOG_TAG` used identically across tasks. `target`/`last_good` are documented caller-set globals consumed by lib functions.