t3-idle-migrate: implementation plan
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
9503bed589
commit
0cebeeb0ee
1 changed files with 729 additions and 0 deletions
729
docs/plans/2026-06-21-t3-idle-migrate-plan.md
Normal file
729
docs/plans/2026-06-21-t3-idle-migrate-plan.md
Normal file
|
|
@ -0,0 +1,729 @@
|
||||||
|
# t3 idle-migrate Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Add a small idle-gated overnight job that restarts a `t3-serve@<user>` deferred by the daily autoupdate, so a chronically-busy user's server migrates onto the current t3 binary during a real quiet gap instead of staying version-skewed for days.
|
||||||
|
|
||||||
|
**Architecture:** Extract the daily job's per-unit "dangerous" restart routine (backup→restart→verify→recover) into a sourced shared library `t3-safe-restart.sh`; the daily `t3-autoupdate` and a new `t3-migrate-idle` job both call it. The daily job records each deferral as a marker file; the new job drains markers overnight, restarting only when `state.sqlite` shows no in-flight turn and a quiet buffer has elapsed. Self-limiting (only acts on a recorded deferral), fail-closed.
|
||||||
|
|
||||||
|
**Tech Stack:** bash, systemd timers, sqlite3 (reading t3's `state.sqlite`), the existing `t3-autoupdate` machinery. Deployed via `scripts/workstation/setup-devvm.sh` on the hand-managed devvm (no Terraform).
|
||||||
|
|
||||||
|
**Design:** `docs/plans/2026-06-21-t3-idle-migrate-design.md`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File structure
|
||||||
|
|
||||||
|
- **Create `scripts/t3-safe-restart.sh`** — sourced library: shared config defaults, `LOG`/`ver`/`osusers`/`ak_for`/`verify_pairing`/`backup_user`/`prebump_of`/`rollback_binary`, and `safe_restart_unit`. One responsibility: the audited per-unit safe restart + its recovery.
|
||||||
|
- **Modify `scripts/t3-autoupdate.sh`** — source the lib; replace the inline helpers + step-6 body with calls into it; write/clear the deferral marker. Behavior unchanged.
|
||||||
|
- **Create `scripts/t3-migrate-idle.sh`** — the new job: the idle gate (`gate_query`/`gate_is_safe`/`safe_to_restart`) + the marker-drain loop. Main logic behind a `main`-guard so it's source-safe for tests.
|
||||||
|
- **Create `scripts/t3-migrate-idle.service`** + **`scripts/t3-migrate-idle.timer`** — oneshot + overnight timer.
|
||||||
|
- **Create `tests/t3-migrate-idle-gate.test.sh`** — pure-bash TDD for the gate predicates against fixture SQLite DBs (no root, no bats).
|
||||||
|
- **Modify `scripts/workstation/setup-devvm.sh`** — install + enable the new files.
|
||||||
|
- **Modify `docs/runbooks/t3-version-bump.md`** + **`.claude/reference/service-catalog.md`** — document the new job.
|
||||||
|
|
||||||
|
**Recovery semantics note (load-bearing):** `safe_restart_unit` is reused verbatim. In the *daily* path a canary failure happens when `last_good < target`, so its `rollback_binary` genuinely reverts the global binary (correct — a bad build is bad for everyone). In the *idle* path `last_good == installed == target` (the build was already accepted), so `rollback_binary` is a **harmless no-op reinstall** — recovery reduces to "restore the failing user's DB + freeze + alert" and does NOT downgrade other users. Known rare-tail limitation: if that user's forward migration genuinely fails at idle time (already gated against a copy of their real DB at install), their server may crashloop on the restored DB until a human acts on the freeze+alert. Documented, not hidden.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 1: Shared library `t3-safe-restart.sh`
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `scripts/t3-safe-restart.sh`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create the library**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# t3-safe-restart.sh — SOURCED library (not executed). Shared by t3-autoupdate.sh
|
||||||
|
# (daily gated tracker) and t3-migrate-idle.sh (overnight deferral drainer).
|
||||||
|
#
|
||||||
|
# Holds the per-unit "dangerous" routine — backup -> restart -> verify pairing ->
|
||||||
|
# recover (restore DB + roll global binary back to last-good + freeze) — extracted
|
||||||
|
# verbatim from t3-autoupdate.sh step 6, plus the small helpers it depends on.
|
||||||
|
# The only change from the inline original: safe_restart_unit RETURNS non-zero on
|
||||||
|
# failure (after performing recovery+freeze) instead of `exit 1`, so the CALLER
|
||||||
|
# decides what to do (the daily job exits; the idle job stops draining).
|
||||||
|
#
|
||||||
|
# Callers must set, before calling safe_restart_unit: $target (version being moved
|
||||||
|
# TO, for log lines + the prebump filename) and $last_good (rollback target).
|
||||||
|
# Set $LOG_TAG before sourcing to tag syslog ("t3-autoupdate" / "t3-migrate-idle").
|
||||||
|
|
||||||
|
# ---- shared config defaults (override via env before sourcing) ------------------
|
||||||
|
: "${LOG_TAG:=t3-safe-restart}"
|
||||||
|
: "${FREEZE_FILE:=/etc/t3-autoupdate.freeze}"
|
||||||
|
: "${STATE_DIR:=/var/lib/t3-autoupdate}"
|
||||||
|
: "${LAST_GOOD_FILE:=$STATE_DIR/last-good}"
|
||||||
|
: "${DEFER_DIR:=$STATE_DIR/deferred}"
|
||||||
|
: "${BACKUP_DIR:=/var/backups/t3-state}"
|
||||||
|
: "${DISPATCH:=127.0.0.1:3780}"
|
||||||
|
: "${USER_MAP:=/etc/ttyd-user-map}"
|
||||||
|
: "${T3_BACKUP_TIMEOUT:=900}"
|
||||||
|
|
||||||
|
LOG() { logger -t "$LOG_TAG" "$*"; echo "$LOG_TAG: $*"; }
|
||||||
|
ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; }
|
||||||
|
# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line).
|
||||||
|
osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; }
|
||||||
|
# authentik username for an OS user (reverse map; first match) — for dispatch verify.
|
||||||
|
ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; }
|
||||||
|
|
||||||
|
# Online consistent snapshot of ONE user's state.sqlite (run AS the owner so the
|
||||||
|
# WAL stays owned; never stops the serve). Uses global $target for the filename.
|
||||||
|
# Echoes the backup path on success; non-zero on failure.
|
||||||
|
backup_user() {
|
||||||
|
local u="$1" src out dst ts
|
||||||
|
src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || return 1
|
||||||
|
ts="$(date +%Y%m%d-%H%M%S)"
|
||||||
|
out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite"
|
||||||
|
install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out"
|
||||||
|
if runuser -u "$u" -- timeout "$T3_BACKUP_TIMEOUT" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then
|
||||||
|
printf '%s\n' "$dst"; return 0
|
||||||
|
fi
|
||||||
|
rm -f "$dst"; return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# newest pre-bump backup for a user taken for the current $target (restore source).
|
||||||
|
prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; }
|
||||||
|
|
||||||
|
# roll the GLOBAL binary back to last-good. In the idle path last_good==installed,
|
||||||
|
# so this is a harmless no-op reinstall (does NOT downgrade other users).
|
||||||
|
rollback_binary() {
|
||||||
|
LOG "rolling back binary $target -> $last_good"
|
||||||
|
if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi
|
||||||
|
LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie).
|
||||||
|
verify_pairing() {
|
||||||
|
local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; }
|
||||||
|
out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)"
|
||||||
|
printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session='
|
||||||
|
}
|
||||||
|
|
||||||
|
# safe_restart_unit <unit> <user>: restart the unit, verify pairing; on failure
|
||||||
|
# restore the user's DB from its pre-restart backup, roll the binary back, freeze.
|
||||||
|
# Assumes a pre-restart backup already exists for <user> at the current $target
|
||||||
|
# (the daily job's backup_all, or the idle job's backup_user, takes it first).
|
||||||
|
# Returns 0 on verified success, non-zero after recovery+freeze on failure.
|
||||||
|
safe_restart_unit() {
|
||||||
|
local unit="$1" u="$2" ok=0 _ bak
|
||||||
|
systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero"
|
||||||
|
for _ in $(seq 1 15); do
|
||||||
|
if verify_pairing "$u"; then ok=1; break; fi
|
||||||
|
sleep 2
|
||||||
|
done
|
||||||
|
if [ "$ok" = "1" ]; then
|
||||||
|
LOG "restarted $unit -> $target (pairing verified via dispatch)"; return 0
|
||||||
|
fi
|
||||||
|
LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB"
|
||||||
|
rollback_binary
|
||||||
|
bak="$(prebump_of "$u")"
|
||||||
|
if [ -n "$bak" ]; then
|
||||||
|
systemctl stop "$unit" 2>/dev/null
|
||||||
|
if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then
|
||||||
|
rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm"
|
||||||
|
LOG "restored $u state.sqlite from $bak"
|
||||||
|
fi
|
||||||
|
systemctl start "$unit" 2>/dev/null
|
||||||
|
fi
|
||||||
|
touch "$FREEZE_FILE" 2>/dev/null
|
||||||
|
LOG "FROZEN ($FREEZE_FILE) after $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Syntax + lint check**
|
||||||
|
|
||||||
|
Run: `bash -n scripts/t3-safe-restart.sh && (command -v shellcheck >/dev/null && shellcheck -x scripts/t3-safe-restart.sh || echo "shellcheck absent — skipped")`
|
||||||
|
Expected: no syntax errors. (shellcheck may warn on the intentional global `$target`/`$last_good` references — acceptable; they are documented caller-set globals.)
|
||||||
|
|
||||||
|
- [ ] **Step 3: Source-and-define smoke test**
|
||||||
|
|
||||||
|
Run:
|
||||||
|
```bash
|
||||||
|
bash -c 'LOG_TAG=test; . scripts/t3-safe-restart.sh; for f in LOG ver osusers ak_for backup_user prebump_of rollback_binary verify_pairing safe_restart_unit; do declare -F "$f" >/dev/null || { echo "MISSING $f"; exit 1; }; done; echo "all functions defined"'
|
||||||
|
```
|
||||||
|
Expected: `all functions defined` (sourcing has no side effects — no exit, no output beyond the echo).
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
|
||||||
|
git "${GC[@]}" add scripts/t3-safe-restart.sh
|
||||||
|
git "${GC[@]}" commit -m "t3-safe-restart: extract shared safe-restart library from t3-autoupdate
|
||||||
|
|
||||||
|
Pull the per-unit backup->restart->verify->recover routine (and the small
|
||||||
|
helpers it needs) out of t3-autoupdate.sh into a sourced library, so a second
|
||||||
|
job (the upcoming idle migrator) can reuse the exact same audited recovery path
|
||||||
|
instead of forking safety-critical code. safe_restart_unit returns non-zero on
|
||||||
|
failure (after recovery+freeze) rather than exiting, so callers control flow.
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 2: Refactor `t3-autoupdate.sh` to use the library + record deferrals
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `scripts/t3-autoupdate.sh` (config block 32–42, helpers 44–165, step 6 loop 194–225)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Source the library; drop the now-shared helpers**
|
||||||
|
|
||||||
|
Replace lines 32–52 (the `T3_*` config block through the `newer()` helper) with — keep the autoupdate-only config, source the lib for the shared bits:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# ---- autoupdate-specific config (shared config + helpers come from the lib) -----
|
||||||
|
T3_TRACK="${T3_TRACK:-nightly}" # npm dist-tag to follow (nightly | latest)
|
||||||
|
T3_PIN="${T3_PIN:-}" # optional HARD pin to an exact version (disables tracking)
|
||||||
|
SMOKE_PORT="${T3_SMOKE_PORT:-3799}"
|
||||||
|
DRY_RUN="${T3_DRY_RUN:-0}"
|
||||||
|
TMPROOT="${T3_TMPDIR:-/var/tmp}" # health-check scratch on DISK — /tmp is a 2G tmpfs and a populated state.sqlite (~hundreds of MB) overflows it
|
||||||
|
|
||||||
|
LOG_TAG=t3-autoupdate
|
||||||
|
# shellcheck source=scripts/t3-safe-restart.sh
|
||||||
|
. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
|
||||||
|
|
||||||
|
# is $1 a strictly-newer version than $2 (version-sort)?
|
||||||
|
newer() { [ "$1" != "$2" ] && [ "$(printf '%s\n%s\n' "$1" "$2" | sort -V | tail -1)" = "$1" ]; }
|
||||||
|
|
||||||
|
mkdir -p "$STATE_DIR" 2>/dev/null || true
|
||||||
|
```
|
||||||
|
|
||||||
|
(The lib now provides `FREEZE_FILE`, `STATE_DIR`, `LAST_GOOD_FILE`, `DEFER_DIR`, `BACKUP_DIR`, `DISPATCH`, `USER_MAP`, `LOG`, `ver`, `osusers`, `ak_for`, `verify_pairing`, `prebump_of`, `rollback_binary`, `backup_user`, `safe_restart_unit`.)
|
||||||
|
|
||||||
|
- [ ] **Step 2: Simplify `backup_all` to call the shared `backup_user`**
|
||||||
|
|
||||||
|
Replace the `backup_all()` definition (lines 90–105) with:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ADMIN_SEED=""
|
||||||
|
backup_all() {
|
||||||
|
local u dst
|
||||||
|
for u in $(osusers); do
|
||||||
|
if dst="$(backup_user "$u")"; then
|
||||||
|
LOG "pre-bump backup: $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)"
|
||||||
|
[ "$u" = "wizard" ] && ADMIN_SEED="$dst"
|
||||||
|
else
|
||||||
|
LOG "WARN: pre-bump backup FAILED for $u (/home/$u/.t3/userdata/state.sqlite)"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
[ -n "$ADMIN_SEED" ] || ADMIN_SEED="$(ls -1t "$BACKUP_DIR"/*/"state-prebump-$target-"*.sqlite 2>/dev/null | head -1)"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Delete the now-duplicated standalone `prebump_of`, `rollback_binary`, and `verify_pairing` definitions (lines 107–108, 146–152, 160–165) — they come from the lib. Keep `health_check` and `unit_busy` (autoupdate-only).
|
||||||
|
|
||||||
|
- [ ] **Step 3: Use `safe_restart_unit` + write/clear the deferral marker in step 6**
|
||||||
|
|
||||||
|
Replace the step-6 loop body (lines 196–225) with:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' 2>/dev/null | awk '{print $1}'); do
|
||||||
|
u="$(printf '%s' "$unit" | sed -n 's/^t3-serve@\(.*\)\.service$/\1/p')"; [ -n "$u" ] || continue
|
||||||
|
if unit_busy "$unit"; then
|
||||||
|
LOG "deferring $unit (active agent) — migrates on its next idle restart"
|
||||||
|
mkdir -p "$DEFER_DIR" 2>/dev/null && printf '%s\n' "$target" >"$DEFER_DIR/$u" # record for t3-migrate-idle
|
||||||
|
deferred=$((deferred+1)); continue
|
||||||
|
fi
|
||||||
|
if safe_restart_unit "$unit" "$u"; then
|
||||||
|
restarted=$((restarted+1))
|
||||||
|
rm -f "$DEFER_DIR/$u" 2>/dev/null # now current — clear any stale marker
|
||||||
|
else
|
||||||
|
exit 1 # frozen by safe_restart_unit — preserve today's behavior
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Syntax check + behavior-preserving dry-run diff**
|
||||||
|
|
||||||
|
Run:
|
||||||
|
```bash
|
||||||
|
bash -n scripts/t3-autoupdate.sh
|
||||||
|
# Confirm the only remaining defer/restart decisions are unchanged vs HEAD~1 logic:
|
||||||
|
git -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false diff HEAD scripts/t3-autoupdate.sh | grep -E '^\+|^-' | grep -vE 'safe_restart_unit|backup_user|DEFER_DIR|source|\. "|LOG_TAG|^\+\+\+|^---' | head -40
|
||||||
|
```
|
||||||
|
Expected: no syntax errors; the diff shows only the extraction (calls replacing inline bodies) + the two marker lines — no change to install/health-gate/canary decision logic.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
|
||||||
|
git "${GC[@]}" add scripts/t3-autoupdate.sh
|
||||||
|
git "${GC[@]}" commit -m "t3-autoupdate: source the shared safe-restart lib + record deferrals
|
||||||
|
|
||||||
|
Behavior-preserving refactor: the per-unit restart/recover body and small helpers
|
||||||
|
now come from t3-safe-restart.sh (one audited copy). Additionally, when a unit is
|
||||||
|
deferred for an active agent, write a marker under /var/lib/t3-autoupdate/deferred/
|
||||||
|
so the new idle migrator can drain it later; clear the marker on a successful
|
||||||
|
restart. Install/health-gate/canary logic is unchanged.
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 3: The idle gate (TDD) — `gate_query` + `gate_is_safe`
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `tests/t3-migrate-idle-gate.test.sh`
|
||||||
|
- Create (incremental): `scripts/t3-migrate-idle.sh` (gate functions only this task)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Write the failing test**
|
||||||
|
|
||||||
|
Create `tests/t3-migrate-idle-gate.test.sh`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# Pure-bash unit tests for the t3-migrate-idle gate. No root, no bats, no Docker.
|
||||||
|
# Sources t3-migrate-idle.sh (main-guarded) with the lib path pointed at the worktree.
|
||||||
|
set -uo pipefail
|
||||||
|
HERE="$(cd "$(dirname "$0")/.." && pwd)" # repo root (tests/ is one level down)
|
||||||
|
export T3_SAFE_RESTART_LIB="$HERE/scripts/t3-safe-restart.sh"
|
||||||
|
# shellcheck source=/dev/null
|
||||||
|
. "$HERE/scripts/t3-migrate-idle.sh" # defines functions; main-guard prevents the drain from running
|
||||||
|
|
||||||
|
pass=0; fail=0
|
||||||
|
ok() { if "$@"; then pass=$((pass+1)); else fail=$((fail+1)); echo "FAIL: $*"; fi; }
|
||||||
|
notok(){ if "$@"; then fail=$((fail+1)); echo "FAIL (expected non-zero): $*"; else pass=$((pass+1)); fi; }
|
||||||
|
|
||||||
|
# --- gate_is_safe <active> <idle_seconds> with QUIET_SECONDS=900 ---
|
||||||
|
QUIET_SECONDS=900
|
||||||
|
ok gate_is_safe 0 1000 # idle, quiet long enough -> safe
|
||||||
|
notok gate_is_safe 1 1000 # a turn in flight -> unsafe
|
||||||
|
notok gate_is_safe 0 100 # idle but not quiet enough -> unsafe
|
||||||
|
ok gate_is_safe 0 "" # no threads at all (NULL idle) -> safe
|
||||||
|
notok gate_is_safe x 1000 # unparseable active -> unsafe
|
||||||
|
notok gate_is_safe 0 -30 # negative idle (clock skew) -> unsafe
|
||||||
|
|
||||||
|
# --- gate_query <db> against fixture SQLite DBs ---
|
||||||
|
TMP="$(mktemp -d)"; trap 'rm -rf "$TMP"' EXIT
|
||||||
|
mkfix() { # mkfix <file> ; reads rows "active_turn_id|updated_at" on stdin
|
||||||
|
local f="$1"; sqlite3 "$f" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);"
|
||||||
|
while IFS='|' read -r a u; do sqlite3 "$f" "INSERT INTO projection_thread_sessions VALUES ($([ "$a" = NULL ] && echo NULL || echo "'$a'"), '$u');"; done
|
||||||
|
}
|
||||||
|
NOW="$(date -u +%Y-%m-%dT%H:%M:%S.000Z)"
|
||||||
|
OLD="$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S.000Z)"
|
||||||
|
|
||||||
|
# active turn present -> "1|<small idle>"
|
||||||
|
printf '%s\n' "abc|$NOW" "NULL|$OLD" | mkfix "$TMP/active.db"
|
||||||
|
res="$(gate_query "$TMP/active.db")"; ok test "${res%%|*}" = "1"
|
||||||
|
|
||||||
|
# all idle, last activity 1h ago -> "0|>=3500"
|
||||||
|
printf '%s\n' "NULL|$OLD" "NULL|$OLD" | mkfix "$TMP/idle.db"
|
||||||
|
res="$(gate_query "$TMP/idle.db")"; ok test "${res%%|*}" = "0"; ok test "${res##*|}" -ge 3500
|
||||||
|
|
||||||
|
# empty table -> "0|" (NULL idle)
|
||||||
|
sqlite3 "$TMP/empty.db" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);"
|
||||||
|
res="$(gate_query "$TMP/empty.db")"; ok test "${res%%|*}" = "0"
|
||||||
|
|
||||||
|
echo "PASS=$pass FAIL=$fail"; [ "$fail" -eq 0 ]
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run it to verify it fails**
|
||||||
|
|
||||||
|
Run: `bash tests/t3-migrate-idle-gate.test.sh`
|
||||||
|
Expected: FAIL — `scripts/t3-migrate-idle.sh` does not exist yet (source error).
|
||||||
|
|
||||||
|
- [ ] **Step 3: Create `scripts/t3-migrate-idle.sh` with the gate functions + main-guard skeleton**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# t3-migrate-idle.sh — drains t3-autoupdate's deferral markers (via the overnight
|
||||||
|
# t3-migrate-idle.timer). For each deferred t3-serve@<user>, if nothing is actively
|
||||||
|
# working in that instance (no in-flight turn + a quiet buffer), restart it onto the
|
||||||
|
# current binary using the shared safe_restart_unit, then clear the marker.
|
||||||
|
# Why this exists: t3-autoupdate defers a user with an active agent at its single
|
||||||
|
# daily window; a user busy every night never migrates and their client shows
|
||||||
|
# "Client and server versions differ". See docs/plans/2026-06-21-t3-idle-migrate-*.
|
||||||
|
set -uo pipefail
|
||||||
|
|
||||||
|
LOG_TAG=t3-migrate-idle
|
||||||
|
# shellcheck source=scripts/t3-safe-restart.sh
|
||||||
|
. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
|
||||||
|
|
||||||
|
QUIET_SECONDS="${T3_MIGRATE_QUIET_SECONDS:-900}" # required idle before a restart (15 min)
|
||||||
|
DRY_RUN="${T3_DRY_RUN:-0}"
|
||||||
|
|
||||||
|
# pure logic: is it safe given <active_turns> and <idle_seconds>? fail closed.
|
||||||
|
gate_is_safe() {
|
||||||
|
local active="$1" idle="$2"
|
||||||
|
case "$active" in ''|*[!0-9]*) return 1;; esac # unparseable/empty active -> unsafe
|
||||||
|
[ "$active" -eq 0 ] || return 1 # a turn is running -> unsafe
|
||||||
|
[ -z "$idle" ] && return 0 # no threads at all -> safe
|
||||||
|
case "$idle" in ''|*[!0-9-]*) return 1;; esac # non-numeric -> unsafe
|
||||||
|
[ "$idle" -ge "$QUIET_SECONDS" ] # negative or < quiet -> unsafe
|
||||||
|
}
|
||||||
|
|
||||||
|
# query a state.sqlite (path or file: URI). Echoes "<active_turns>|<idle_seconds>".
|
||||||
|
# idle_seconds is empty when there are no rows. Normalizes ISO 'T'/'Z' for julianday.
|
||||||
|
gate_query() {
|
||||||
|
local db="$1"
|
||||||
|
sqlite3 -batch -noheader -separator '|' "$db" \
|
||||||
|
"SELECT
|
||||||
|
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
|
||||||
|
CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
|
||||||
|
FROM projection_thread_sessions;"
|
||||||
|
}
|
||||||
|
|
||||||
|
# safe_to_restart <user>: wire runuser + the user's DB into gate_query/gate_is_safe.
|
||||||
|
safe_to_restart() {
|
||||||
|
local u="$1" db row
|
||||||
|
db="/home/$u/.t3/userdata/state.sqlite"; [ -f "$db" ] || return 1
|
||||||
|
row="$(runuser -u "$u" -- sqlite3 -batch -noheader -separator '|' "file:$db?mode=ro" \
|
||||||
|
"SELECT
|
||||||
|
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
|
||||||
|
CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
|
||||||
|
FROM projection_thread_sessions;" 2>/dev/null)" || return 1
|
||||||
|
gate_is_safe "${row%%|*}" "${row##*|}"
|
||||||
|
}
|
||||||
|
|
||||||
|
main() {
|
||||||
|
: # drain loop added in Task 4
|
||||||
|
}
|
||||||
|
|
||||||
|
# main-guard: run only when executed, not when sourced (tests source this file).
|
||||||
|
if [ "${BASH_SOURCE[0]}" = "${0}" ]; then main "$@"; fi
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run the test to verify it passes**
|
||||||
|
|
||||||
|
Run: `bash tests/t3-migrate-idle-gate.test.sh`
|
||||||
|
Expected: `PASS=10 FAIL=0` (exit 0).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
|
||||||
|
git "${GC[@]}" add scripts/t3-migrate-idle.sh tests/t3-migrate-idle-gate.test.sh
|
||||||
|
git "${GC[@]}" commit -m "t3-migrate-idle: idle gate (no in-flight turn + quiet buffer), TDD
|
||||||
|
|
||||||
|
The gate reads t3's state.sqlite: safe to restart only when zero threads have an
|
||||||
|
active_turn_id AND the most-recent thread activity is older than the quiet buffer
|
||||||
|
(default 15m). Fail-closed on any parse/query error. Pure-bash unit tests cover
|
||||||
|
the boundaries against fixture DBs (no root/bats/Docker).
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 4: The marker-drain loop in `t3-migrate-idle.sh`
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `scripts/t3-migrate-idle.sh` (replace the `main()` skeleton)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Implement `main()` (the drain loop)**
|
||||||
|
|
||||||
|
Replace the `main() { : ; }` skeleton with:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
main() {
|
||||||
|
# a frozen build must not be auto-migrated (shared switch with t3-autoupdate)
|
||||||
|
if [ -e "$FREEZE_FILE" ]; then LOG "FROZEN: $FREEZE_FILE present — not draining deferrals"; exit 0; fi
|
||||||
|
[ -d "$DEFER_DIR" ] || exit 0 # nothing deferred
|
||||||
|
last_good="$(tr -d '[:space:]' <"$LAST_GOOD_FILE" 2>/dev/null)" # rollback target for the helper
|
||||||
|
|
||||||
|
local marker u unit started mwritten migrated=0 skipped=0
|
||||||
|
for marker in "$DEFER_DIR"/*; do
|
||||||
|
[ -e "$marker" ] || continue # empty-dir glob
|
||||||
|
u="$(basename "$marker")"; unit="t3-serve@$u.service"
|
||||||
|
if ! systemctl is-active --quiet "$unit"; then
|
||||||
|
LOG "clearing marker for $u: $unit not active"; rm -f "$marker"; continue
|
||||||
|
fi
|
||||||
|
started="$(date -d "$(systemctl show -p ActiveEnterTimestamp --value "$unit" 2>/dev/null)" +%s 2>/dev/null || echo 0)"
|
||||||
|
mwritten="$(stat -c %Y "$marker" 2>/dev/null || echo 0)"
|
||||||
|
if [ "$started" -gt "$mwritten" ]; then
|
||||||
|
LOG "clearing marker for $u: $unit already restarted $((started-mwritten))s after the deferral"; rm -f "$marker"; continue
|
||||||
|
fi
|
||||||
|
if ! safe_to_restart "$u"; then skipped=$((skipped+1)); continue; fi
|
||||||
|
|
||||||
|
target="$(tr -d '[:space:]' <"$marker" 2>/dev/null)"; [ -n "$target" ] || target="$(ver)"
|
||||||
|
if [ "$DRY_RUN" = "1" ]; then LOG "DRY_RUN: would migrate $unit -> $target (idle gate satisfied)"; continue; fi
|
||||||
|
if ! backup_user "$u" >/dev/null; then
|
||||||
|
LOG "WARN: pre-restart backup failed for $u — skipping (fail closed)"; skipped=$((skipped+1)); continue
|
||||||
|
fi
|
||||||
|
if safe_restart_unit "$unit" "$u"; then
|
||||||
|
LOG "migrated $unit -> $target (idle restart)"; rm -f "$marker"; migrated=$((migrated+1))
|
||||||
|
else
|
||||||
|
LOG "migrate FAILED for $unit — recovery+freeze handled by safe_restart_unit; stopping drain"; exit 1
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
LOG "idle-migrate pass complete (migrated=$migrated skipped=$skipped)"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Re-run the gate tests (regression — main-guard still source-safe)**
|
||||||
|
|
||||||
|
Run: `bash tests/t3-migrate-idle-gate.test.sh`
|
||||||
|
Expected: `PASS=10 FAIL=0` (sourcing still defines functions without running the loop).
|
||||||
|
|
||||||
|
- [ ] **Step 3: Syntax + lint**
|
||||||
|
|
||||||
|
Run: `bash -n scripts/t3-migrate-idle.sh && (command -v shellcheck >/dev/null && shellcheck -x scripts/t3-migrate-idle.sh || echo "shellcheck absent — skipped")`
|
||||||
|
Expected: no syntax errors.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
|
||||||
|
git "${GC[@]}" add scripts/t3-migrate-idle.sh
|
||||||
|
git "${GC[@]}" commit -m "t3-migrate-idle: drain deferral markers when safe
|
||||||
|
|
||||||
|
For each /var/lib/t3-autoupdate/deferred/<user> marker: skip+clear if the unit is
|
||||||
|
gone or was already restarted after the deferral; otherwise, when the idle gate is
|
||||||
|
satisfied, take a pre-restart backup and restart via the shared safe_restart_unit,
|
||||||
|
clearing the marker on verified success. DRY_RUN logs decisions without acting.
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 5: systemd units
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `scripts/t3-migrate-idle.service`, `scripts/t3-migrate-idle.timer`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Create the service unit**
|
||||||
|
|
||||||
|
`scripts/t3-migrate-idle.service`:
|
||||||
|
```ini
|
||||||
|
[Unit]
|
||||||
|
Description=t3 idle migrator — restart deferred t3-serve instances onto the current binary when idle
|
||||||
|
Documentation=https://forgejo.viktorbarzin.me/viktor/infra/src/branch/master/docs/plans/2026-06-21-t3-idle-migrate-design.md
|
||||||
|
After=network.target t3-dispatch.service
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=/usr/local/bin/t3-migrate-idle
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Create the timer unit**
|
||||||
|
|
||||||
|
`scripts/t3-migrate-idle.timer`:
|
||||||
|
```ini
|
||||||
|
[Unit]
|
||||||
|
Description=Overnight drain of t3-autoupdate deferrals (idle-gated t3-serve migration)
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
OnCalendar=*-*-* 01..05:00/20
|
||||||
|
RandomizedDelaySec=120
|
||||||
|
Persistent=false
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Validate unit syntax**
|
||||||
|
|
||||||
|
Run: `systemd-analyze verify scripts/t3-migrate-idle.service scripts/t3-migrate-idle.timer 2>&1 | grep -v 'Unknown\|Cannot find' || echo "units parse OK"`
|
||||||
|
Expected: no fatal parse errors (warnings about the `[Install]` of a non-installed unit / missing exec on a non-deployed path are acceptable in the worktree).
|
||||||
|
|
||||||
|
- [ ] **Step 4: Confirm the OnCalendar expands to the intended overnight slots**
|
||||||
|
|
||||||
|
Run: `systemd-analyze calendar '*-*-* 01..05:00/20' --iterations=5`
|
||||||
|
Expected: next elapses at 01:00/01:20/01:40/02:00/… (every 20 min, hours 01–05).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
|
||||||
|
git "${GC[@]}" add scripts/t3-migrate-idle.service scripts/t3-migrate-idle.timer
|
||||||
|
git "${GC[@]}" commit -m "t3-migrate-idle: systemd oneshot + overnight timer (01:00-05:40, /20)
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 6: Wire into `setup-devvm.sh`
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `scripts/workstation/setup-devvm.sh` (9a install ~line 164; 9d unit loop ~line 200; enable ~line 218)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Install the lib + the new script (section 9a)**
|
||||||
|
|
||||||
|
After the `install -m 0755 "$SCRIPTS/t3-autoupdate.sh" /usr/local/bin/t3-autoupdate` line, add:
|
||||||
|
```bash
|
||||||
|
install -m 0644 "$SCRIPTS/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh
|
||||||
|
install -m 0755 "$SCRIPTS/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Install the unit files (section 9d loop)**
|
||||||
|
|
||||||
|
Add to the `for u in …` unit list (after the `t3-autoupdate.service t3-autoupdate.timer \` line):
|
||||||
|
```bash
|
||||||
|
t3-migrate-idle.service t3-migrate-idle.timer \
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Enable the timer (section 9 enable line)**
|
||||||
|
|
||||||
|
Append `t3-migrate-idle.timer` to the `systemctl enable --now` list:
|
||||||
|
```bash
|
||||||
|
systemctl enable --now t3-dispatch.service \
|
||||||
|
t3-autoupdate.timer t3-backup-state.timer t3-provision-users.timer t3-migrate-idle.timer >/dev/null 2>&1 || \
|
||||||
|
log "WARN: some units failed to enable (check: systemctl status t3-dispatch t3-*.timer)"
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Syntax check**
|
||||||
|
|
||||||
|
Run: `bash -n scripts/workstation/setup-devvm.sh`
|
||||||
|
Expected: no syntax errors.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
|
||||||
|
git "${GC[@]}" add scripts/workstation/setup-devvm.sh
|
||||||
|
git "${GC[@]}" commit -m "setup-devvm: install + enable t3-migrate-idle (lib, script, units, timer)
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 7: Deploy to the devvm + validate (dry-run first)
|
||||||
|
|
||||||
|
**Files:** none (operational). Presence-claimed, shared-host mutation.
|
||||||
|
|
||||||
|
- [ ] **Step 1: Claim the host**
|
||||||
|
|
||||||
|
Run: `homelab claim host:devvm --purpose "deploy t3-migrate-idle units (idle-gated t3-serve migration)"`
|
||||||
|
Expected: claim acquired (if already held by another session, defer per CLAUDE.md).
|
||||||
|
|
||||||
|
- [ ] **Step 2: Install the artifacts (mirror setup-devvm.sh 9a/9d)**
|
||||||
|
|
||||||
|
Run:
|
||||||
|
```bash
|
||||||
|
W=/home/wizard/code/infra/.worktrees/t3-idle-migrate/scripts
|
||||||
|
sudo install -m 0644 "$W/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh
|
||||||
|
sudo install -m 0755 "$W/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle
|
||||||
|
sudo install -m 0644 "$W/t3-migrate-idle.service" /etc/systemd/system/t3-migrate-idle.service
|
||||||
|
sudo install -m 0644 "$W/t3-migrate-idle.timer" /etc/systemd/system/t3-migrate-idle.timer
|
||||||
|
sudo systemctl daemon-reload
|
||||||
|
```
|
||||||
|
Expected: no errors.
|
||||||
|
|
||||||
|
- [ ] **Step 2b: Re-point the live daily job at the installed lib (it now sources it)**
|
||||||
|
|
||||||
|
The deployed `/usr/local/bin/t3-autoupdate` is the OLD inline version until setup-devvm re-runs; install the refactored one so both jobs share the lib:
|
||||||
|
```bash
|
||||||
|
sudo install -m 0755 "$W/t3-autoupdate.sh" /usr/local/bin/t3-autoupdate
|
||||||
|
sudo /usr/local/bin/t3-autoupdate # safe: same-version run exits at "already on nightly; nothing to do"
|
||||||
|
```
|
||||||
|
Expected: log line `already on <track>=<ver>; nothing to do` (proves the refactored daily job sources the lib and runs clean).
|
||||||
|
|
||||||
|
- [ ] **Step 3: DRY-RUN the idle migrator against live state**
|
||||||
|
|
||||||
|
Run: `sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle; echo "exit=$?"`
|
||||||
|
Expected: with wizard currently busy (mid-turn during the day), a `skipped` count — `idle-migrate pass complete (migrated=0 skipped=N)` — and NO restart. (If wizard happens to be idle+quiet, it logs `DRY_RUN: would migrate t3-serve@wizard …` and still does not act.)
|
||||||
|
|
||||||
|
- [ ] **Step 4: Seed a deferral marker for the current skew + dry-run again**
|
||||||
|
|
||||||
|
The live daily job already deferred wizard but the marker mechanism is new, so create it once to represent the existing `.605→.613` debt:
|
||||||
|
```bash
|
||||||
|
sudo install -d -m755 /var/lib/t3-autoupdate/deferred
|
||||||
|
printf '%s\n' "$(t3 --version | awk '{print $NF}' | sed 's/^v//')" | sudo tee /var/lib/t3-autoupdate/deferred/wizard >/dev/null
|
||||||
|
sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle; echo "exit=$?"
|
||||||
|
```
|
||||||
|
Expected: the pass now considers `wizard` — either `DRY_RUN: would migrate t3-serve@wizard.service -> …613` (if idle) or counted in `skipped` (if mid-turn). Confirms marker drain + gate wiring end-to-end without acting.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Enable the timer (live)**
|
||||||
|
|
||||||
|
Run: `sudo systemctl enable --now t3-migrate-idle.timer && systemctl list-timers t3-migrate-idle.timer --no-pager`
|
||||||
|
Expected: timer active, next elapse in the 01:00–05:40 window.
|
||||||
|
|
||||||
|
- [ ] **Step 6: Release the claim**
|
||||||
|
|
||||||
|
Run: `homelab release host:devvm`
|
||||||
|
|
||||||
|
> **First live migration** happens overnight at the first idle+quiet tick. Verify next session: `journalctl -u t3-migrate-idle.service --since yesterday | grep -E 'migrated|skipped|DRY|FROZEN'` and `t3 --version` vs the running server's version. (The user-facing resume-after-restart is observed here — design open-question (a).)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 8: Docs
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `docs/runbooks/t3-version-bump.md` (add an idle-migrate section)
|
||||||
|
- Modify: `.claude/reference/service-catalog.md` (add the unit)
|
||||||
|
- Modify: `docs/plans/2026-06-21-t3-idle-migrate-design.md` (Status → implemented)
|
||||||
|
|
||||||
|
- [ ] **Step 1: Runbook** — add a section after the autoupdate description:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Idle migrator (`t3-migrate-idle.timer`)
|
||||||
|
|
||||||
|
`t3-autoupdate` defers a user's `t3-serve` restart when they have an active agent
|
||||||
|
at the daily window, recording `/var/lib/t3-autoupdate/deferred/<user>`.
|
||||||
|
`t3-migrate-idle` (overnight, every 20 min 01:00–05:40) drains those markers:
|
||||||
|
it restarts a deferred instance onto the current binary only when that user's
|
||||||
|
`state.sqlite` shows no in-flight turn (`active_turn_id`) and ≥15 min quiet, via
|
||||||
|
the shared `safe_restart_unit` (same backup→verify→recover as the daily canary).
|
||||||
|
- **Force a migration now:** `sudo systemctl start t3-migrate-idle.service` (still idle-gated).
|
||||||
|
- **Preview without acting:** `sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle`.
|
||||||
|
- **Stop it:** the shared `/etc/t3-autoupdate.freeze` halts both jobs.
|
||||||
|
- **Rare-tail failure:** a forward-migration failure at idle restart restores the
|
||||||
|
user's DB + freezes + alerts (the binary rollback is a no-op since the build was
|
||||||
|
already accepted); the user's server may crashloop on the restored DB until the
|
||||||
|
freeze is cleared. Investigate per the rollback section above.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: service-catalog** — add a row/line for `t3-migrate-idle.timer` (overnight idle-gated t3-serve migration; sources `t3-safe-restart.sh`).
|
||||||
|
|
||||||
|
- [ ] **Step 3: design doc status** — change the header `Status:` to `implemented 2026-06-21 (commits on wizard/t3-idle-migrate)`.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
|
||||||
|
git "${GC[@]}" add docs/runbooks/t3-version-bump.md .claude/reference/service-catalog.md docs/plans/2026-06-21-t3-idle-migrate-design.md
|
||||||
|
git "${GC[@]}" commit -m "docs: t3-migrate-idle runbook + service-catalog + design status
|
||||||
|
|
||||||
|
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Task 9: Land
|
||||||
|
|
||||||
|
- [ ] **Step 1: Merge latest master into the branch**
|
||||||
|
|
||||||
|
Run:
|
||||||
|
```bash
|
||||||
|
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
|
||||||
|
git "${GC[@]}" fetch forgejo
|
||||||
|
git "${GC[@]}" merge --no-edit forgejo/master
|
||||||
|
```
|
||||||
|
Expected: clean merge (no conflicts; the files are new or autoupdate-only). Resolve if any.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Re-run the gate tests post-merge**
|
||||||
|
|
||||||
|
Run: `bash tests/t3-migrate-idle-gate.test.sh`
|
||||||
|
Expected: `PASS=10 FAIL=0`.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Push to master**
|
||||||
|
|
||||||
|
Run: `git -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false push forgejo HEAD:master`
|
||||||
|
Expected: accepted. Non-fast-forward → fetch/merge/retry.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Watch CI to completion**
|
||||||
|
|
||||||
|
Run: `homelab ci watch`
|
||||||
|
Expected: green (infra apply pipeline — this change is scripts/docs only, no Terraform, so apply is a no-op for it).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Clean up the worktree**
|
||||||
|
|
||||||
|
Run (from the main checkout):
|
||||||
|
```bash
|
||||||
|
git -C /home/wizard/code/infra worktree remove .worktrees/t3-idle-migrate
|
||||||
|
git -C /home/wizard/code/infra branch -d wizard/t3-idle-migrate
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Self-review
|
||||||
|
|
||||||
|
- **Spec coverage:** marker mechanism (T2,T4) · shared safe-restart lib / approach C (T1) · idle gate active_turn_id+quiet (T3) · overnight timer (T5) · all-users self-limiting via markers (T4 loop) · failure recovery reuse (T1, note) · observability logs (LOG_TAG throughout) · delivery via setup-devvm (T6) · presence-claimed deploy (T7) · TDD on the gate (T3) · dry-run rollout (T7) · docs (T8). Optional Pushgateway marker-age gauge from the design is **intentionally deferred** (logged here as a follow-up, not built — keeps scope to the shipping mechanism).
|
||||||
|
- **Placeholders:** none — every file has complete content; every command has expected output.
|
||||||
|
- **Type/name consistency:** `safe_restart_unit`, `backup_user`, `prebump_of`, `gate_query`, `gate_is_safe`, `safe_to_restart`, `DEFER_DIR`, `QUIET_SECONDS`, `T3_SAFE_RESTART_LIB`, `LOG_TAG` used identically across tasks. `target`/`last_good` are documented caller-set globals consumed by lib functions.
|
||||||
Loading…
Add table
Add a link
Reference in a new issue