Compare commits

...

10 commits

Author SHA1 Message Date
Viktor Barzin
92ff0b92f1 Merge remote-tracking branch 'forgejo/master' into wizard/t3-idle-migrate
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 12:41:33 +00:00
Viktor Barzin
5a136c7d53 docs: t3-migrate-idle runbook section + service-catalog + design status
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:40:46 +00:00
Viktor Barzin
334d8fee5d setup-devvm: install + enable t3-migrate-idle (lib, script, units, timer)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:36:13 +00:00
Viktor Barzin
3cf09a0fe3 t3-migrate-idle: systemd oneshot + overnight timer (01:00-05:40, /20)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:35:19 +00:00
Viktor Barzin
af9f7be297 t3-migrate-idle: drain deferral markers when safe
For each /var/lib/t3-autoupdate/deferred/<user> marker: skip+clear if the unit is
gone or was already restarted after the deferral; otherwise, when the idle gate is
satisfied, take a pre-restart backup and restart via the shared safe_restart_unit,
clearing the marker on verified success. DRY_RUN logs decisions without acting.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:34:44 +00:00
Viktor Barzin
06e400522f t3-migrate-idle: idle gate (no in-flight turn + quiet buffer), TDD
The gate reads t3's state.sqlite: safe to restart only when zero threads have an
active_turn_id AND the most-recent thread activity is older than the quiet buffer
(default 15m). Fail-closed on any parse/query error. Pure-bash unit tests cover
the boundaries against fixture DBs (no root/bats/Docker).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:34:11 +00:00
Viktor Barzin
de97696ff0 t3-autoupdate: source the shared safe-restart lib + record deferrals
Behavior-preserving refactor: the per-unit restart/recover body and small helpers
now come from t3-safe-restart.sh (one audited copy). Additionally, when a unit is
deferred for an active agent, write a marker under /var/lib/t3-autoupdate/deferred/
so the new idle migrator can drain it later; clear the marker on a successful
restart. Install/health-gate/canary logic is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:32:57 +00:00
Viktor Barzin
2ab5b94748 t3-safe-restart: extract shared safe-restart library from t3-autoupdate
Pull the per-unit backup->restart->verify->recover routine (and the small
helpers it needs) out of t3-autoupdate.sh into a sourced library, so a second
job (the upcoming idle migrator) can reuse the exact same audited recovery path
instead of forking safety-critical code. safe_restart_unit returns non-zero on
failure (after recovery+freeze) rather than exiting, so callers control flow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:28:53 +00:00
Viktor Barzin
0cebeeb0ee t3-idle-migrate: implementation plan
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:26:05 +00:00
Viktor Barzin
9503bed589 t3-idle-migrate: design for graceful overnight restart of deferred t3-serve instances
Viktor hit the t3 'Client and server versions differ' warning. Root cause: the daily gated autoupdate defers a user's t3-serve restart whenever that user has an active agent at the 04:00 window, so anyone busy every night (long-lived/AFK sessions) never migrates and the client/server version skew persists for days.

This design adds a small idle-gated overnight job that drains those deferrals -- restarting a deferred instance onto the current binary only when no turn is in flight (state.sqlite active_turn_id) and it's been quiet for a buffer, so the migration lands in a real quiet gap instead of killing in-flight agent turns. Reuses the autoupdate's proven backup->restart->verify->recover path via a shared helper (approach C from the brainstorm). Design doc only; no behavior change yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:04:22 +00:00
11 changed files with 1148 additions and 63 deletions

File diff suppressed because one or more lines are too long

View file

@ -0,0 +1,140 @@
# t3 idle-migrate — graceful overnight restart of deferred t3-serve instances — design
- **Date:** 2026-06-21
- **Status:** implemented 2026-06-21 (branch `wizard/t3-idle-migrate`; deployed + timer enabled on devvm, first overnight drain pending)
- **Owner:** Viktor (wizard)
- **Builds on:** the gated nightly tracker `t3-autoupdate` (re-enabled 2026-06-16, `scripts/t3-autoupdate.{sh,service,timer}`; design history in `docs/runbooks/t3-version-bump.md` + post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`) and the per-user `t3-serve@<user>` systemd instances (`scripts/t3-serve@.service`).
## Goal
When `t3-autoupdate` **defers** a user's `t3-serve` restart because that user has an active agent at the daily 04:0005:00 window, the user's running server keeps executing its start-time t3 version indefinitely — their client (which tracks the freshly-installed global binary) then shows *"Client and server versions differ."* For a user who is busy at every daily window (wizard: long-lived/AFK sessions overnight), the deferral never resolves and the skew persists for days.
Add a **small, idle-gated overnight job that drains those deferrals**: restart a deferred `t3-serve@<user>` onto the current binary **only when nothing is actively working** in that instance, so the migration happens during a genuine quiet gap rather than killing in-flight agent turns.
## Background — why the skew persists (root cause, verified 2026-06-21)
- All `t3-serve@<user>` instances share ONE global `/usr/bin/t3` (→ `/usr/lib/node_modules/t3`). `t3-autoupdate` installs a new nightly to that single binary, health-gates it against a **copy** of wizard's populated `state.sqlite`, then **canary-restarts idle instances one at a time**, verifying pairing after each (`scripts/t3-autoupdate.sh` step 6).
- Its idle check is coarse — `unit_busy()`:
```sh
pid=$(systemctl show -p MainPID --value "$unit")
pgrep -aP "$pid" | grep -qiE 'claude|codex|opencode'
```
i.e. "does the server have any `claude`/`codex`/`opencode` **child**?" But `t3 serve` keeps one such child alive per **open** session, even one idle awaiting input. Live snapshot 2026-06-21: wizard had **5 `running` provider sessions** (= 5 `claude` children) but only **3 mid-turn**, plus **89 `ready` (open-idle)** threads. So `unit_busy` is true whenever any tab is open → wizard is deferred at every window.
- The job runs **once daily** (`OnCalendar=*-*-* 04:00:00`, `RandomizedDelaySec=1h`, `Persistent` deliberately omitted) and **only acts on a version bump** (exits early if `installed == target`). So once the binary is already current, nothing re-triggers a restart of a still-stale running server until the *next* new nightly — and only if the user happens to be idle then.
- Confirmed in the logs: `t3-autoupdate: deferring t3-serve@wizard.service (active agent) — migrates on its next idle restart` on **both** Jun 20 and Jun 21 windows; wizard's server has been up since Jun 20 06:17 on `…20260620.605` while the binary + client are on `…20260621.613`.
## Decisions (from brainstorm 2026-06-21)
1. **"Safe to restart" = no turn in flight AND a quiet buffer.** Not "zero open sessions" (that would essentially never fire for wizard). Open-but-idle tabs are acceptable to drop — t3 persists thread history in `state.sqlite` and the client reconnects/resumes (the daily job already restarts idle instances routinely; restart→resume is the exercised path). To verify during implementation: the user-facing resume after a server restart.
2. **Cadence: overnight window only.** Frequent checks within a fixed overnight window; never disconnects tabs during the working day. Migrates within ~1 night of a build landing.
3. **Scope: all `t3-serve@<user>`, self-limiting.** The job restarts only an instance that actually *owes* a migration (a deferral marker exists). Users already migrated at the daily window have no marker → no-op. No hardcoded per-user logic.
4. **Approach C: extract a shared safe-restart helper, reuse from both jobs.** One audited copy of the dangerous backup→restart→verify→recover logic; the new job adds only *scheduling + gating*.
## Constraints (load-bearing)
1. **The binary is global; migrations are forward-only and per-user-DB.** You cannot keep one user on the old version while others run the new one. A real-user forward-migration failure therefore means the build is unsafe for a real user → the only consistent recovery is the daily job's existing one (restore that user's DB + roll the **global** binary back to last-good + freeze + alert). This is a rare tail (the build was already migration-gated against a copy of wizard's real DB at install time), but the idle path must not invent a weaker recovery.
2. **Per-user secret boundary.** A user's `~/.t3/userdata/state.sqlite` is mode 600 and may not be read as another user. The job runs as root (system service) but reads each user's DB **as that user** via `runuser -u <user> -- sqlite3 …` (the pattern `backup_all` already uses), read-only (`mode=ro`) so it never locks the live WAL.
3. **Fail closed.** Any uncertainty about whether an instance is safe to restart (DB locked/busy/unreadable, query error, unparseable timestamp) → treat as *not safe*, skip this tick, retry in 20 min. Never restart on doubt.
4. **Do not change the daily job's gated-install behavior.** The step-6 extraction must be behavior-preserving; health-gate, canary, downgrade-guard, freeze, and rollback stay exactly as today.
5. **Infra-as-code via the devvm installer.** Sources live in `scripts/`; deployment is `scripts/workstation/setup-devvm.sh` (the devvm is hand-managed VM 102 — no Terraform apply). Shared-devvm deploy takes a presence claim.
## Design
### Components
Four new files in `scripts/` + a one-line addition to the existing job:
1. **`scripts/t3-safe-restart.sh`** — shared library, sourced (not executed). Holds the per-unit "dangerous" routine extracted from `t3-autoupdate.sh` step 6 as `safe_restart_unit <unit> <target>`:
pre-restart `VACUUM INTO` backup (as the owner) → `systemctl restart` → poll `verify_pairing` (15×2s ≈ 30s) → on failure: restore that user's DB from the pre-restart backup, `rollback_binary` to last-good, `touch $FREEZE_FILE`, log+alert. The shared helpers it needs (`LOG`, `ver`, `osusers`, `ak_for`, `verify_pairing`, `prebump_of`, `rollback_binary`, `DISPATCH`/`BACKUP_DIR`/… config) move into the lib too. Installed to `/usr/local/lib/t3-safe-restart.sh`.
**Contract:** returns `0` on verified success, **non-zero** after performing recovery+freeze on failure. This is the one non-verbatim change to step-6 logic — today it `exit 1`s inline; the extracted function `return`s instead so the *caller* decides (the daily job `exit 1`s on non-zero exactly as today; the idle job `break`s). Behavior is otherwise identical.
2. **`scripts/t3-migrate-idle.sh`** — the new job (scheduling + gating only). Installed to `/usr/local/bin/t3-migrate-idle`. Sources the lib; per tick, drains the deferral directory (control flow below).
3. **`scripts/t3-migrate-idle.service`** — `Type=oneshot`, `ExecStart=/usr/local/bin/t3-migrate-idle`. (No `EnvironmentFile` needed; env-overridable knobs have defaults.)
4. **`scripts/t3-migrate-idle.timer`** — overnight window, frequent checks:
```ini
[Timer]
OnCalendar=*-*-* 01..05:00/20 # fires 01:00,01:20,…,05:40; none at/after 06:00. System TZ (UTC) — tune the window.
Persistent=false # never replay a missed migrate-restart at an unpredictable time
RandomizedDelaySec=120
```
5. **One-line edit to `t3-autoupdate.sh`** — in the existing defer branch, *also record* the deferral:
```sh
LOG "deferring $unit (active agent) — migrates on its next idle restart"
mkdir -p "$DEFER_DIR" 2>/dev/null; printf '%s\n' "$target" > "$DEFER_DIR/$u" # NEW
deferred=$((deferred+1)); continue
```
where `DEFER_DIR=/var/lib/t3-autoupdate/deferred`. This is the *only* behavioral change to the scarred script beyond the verbatim step-6 extraction.
### Why a deferral marker (not version-introspection)
The marker makes "which instances owe a restart" **exact** and decouples it from the binary-is-current problem — the daily job already *knows* it deferred wizard, so it records that fact. The idle job drains the directory; the version string in the marker is informational (a restart always picks up whatever binary is current). The marker is removed only after the restart's pairing is verified.
### Control flow of `t3-migrate-idle` (per tick)
```
for marker in $DEFER_DIR/*: # nothing deferred → no-op
user = basename(marker); unit = t3-serve@<user>.service
[ unit is an active running service ] or { rm marker; continue } # gone
if unit ActiveEnterTimestamp > mtime(marker): rm marker; continue # already restarted (manual/other) → just clear
if not safe_to_restart(user): continue # mid-turn or not quiet → try next tick
target = contents(marker)
if safe_restart_unit(unit, target): rm marker # success: verified on new binary
else: # helper already restored DB + rolled back binary + froze + alerted
break # frozen: stop draining; a human investigates
```
### `safe_to_restart(user)` — the gate
Single read-only query, run as the user:
```sh
runuser -u "$user" -- sqlite3 "file:/home/$user/.t3/userdata/state.sqlite?mode=ro" "
SELECT
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
CAST((julianday('now')
- julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
FROM projection_thread_sessions;"
```
- Column 1 = **active turns**; must be `0`. (`active_turn_id` is set exactly while a turn runs — verified 2026-06-21.)
- Column 2 = **idle seconds** = now most-recent thread activity. Must be `≥ QUIET_SECONDS` (default **900** = 15 min, env-overridable). `updated_at` is ISO-8601 `…Z`; `datetime('now')`/`julianday('now')` are UTC, so normalizing `T`/`Z` away before `julianday()` keeps the arithmetic correct without depending on a newer SQLite's `Z` parsing.
- **NULL idle** (no threads at all) ⇒ safe. **Any error / non-numeric / nonzero exit** ⇒ not safe (constraint 3).
### Failure recovery
Delegated entirely to `safe_restart_unit` (the extracted, already-proven path): restore the user's DB from the pre-restart backup, roll the global binary back to last-good, `touch /etc/t3-autoupdate.freeze`, log+alert. The idle job then stops draining (the freeze halts both jobs until a human clears it) — see constraint 1 for why per-user divergence isn't an option.
### Observability
- Structured `logger -t t3-migrate-idle` lines; extend the existing `T3AutoUpdate*` Loki ruler/alerts to also match this tag. Success → one line: `migrated t3-serve@wizard → <target> (idle restart; idle 47m)`. Failure → reuses the daily job's freeze+alert.
- **Recommended (optional):** a Pushgateway gauge for **deferral-marker age** + an alert if a marker survives **> 3 days** — passive visibility into "busy every night for 3 days," *not* the auto-escalation/daytime-widening that was explicitly de-scoped.
### Delivery
- Wire into `scripts/workstation/setup-devvm.sh` alongside the existing units:
- `install -m 0644 "$SCRIPTS/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh`
- `install -m 0755 "$SCRIPTS/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle`
- add `t3-migrate-idle.service t3-migrate-idle.timer` to the unit-install loop (→ `/etc/systemd/system/`)
- add `t3-migrate-idle.timer` to the `systemctl enable --now` list
- `homelab claim host:devvm --purpose "deploy t3-migrate-idle units"` before the install + enable on the shared devvm.
- No Terraform (hand-managed VM 102).
## Testing
- **TDD on the gating core (`bats`)** against fixture `state.sqlite` files: active turn → unsafe; idle-but-recent (< QUIET) unsafe; idle + quiet safe; empty DB safe; locked/garbage DB / sqlite error unsafe (fail-closed); marker drain: unit started after marker clear+skip, before eligible.
- **`T3_DRY_RUN=1`** mode logs `would migrate <unit> → <target>` without acting. Roll out in dry-run first; confirm it flags wizard's server at a real overnight idle moment; then enable live.
- **Step-6 extraction is behavior-preserving** — validate the daily job's decisions are unchanged via a dry-run diff before/after the refactor.
## Out of scope (YAGNI)
- Daytime restarts / "around the clock" cadence (de-scoped: overnight only).
- Auto-escalation that widens to a daytime attempt after N stale nights (de-scoped; the optional marker-age alert covers visibility).
- Per-user opt-out file (not needed — the job is self-limiting via markers).
- Any change to how `t3-autoupdate` *installs/gates* a build.
## Open questions
None outstanding from the brainstorm. Two items to **verify during implementation** (not blockers): (a) user-facing session resume after a `t3-serve` restart; (b) the devvm's `sqlite3` parses the normalized timestamp as expected (the `replace()` normalization is the safeguard).

View file

@ -0,0 +1,729 @@
# t3 idle-migrate Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Add a small idle-gated overnight job that restarts a `t3-serve@<user>` deferred by the daily autoupdate, so a chronically-busy user's server migrates onto the current t3 binary during a real quiet gap instead of staying version-skewed for days.
**Architecture:** Extract the daily job's per-unit "dangerous" restart routine (backup→restart→verify→recover) into a sourced shared library `t3-safe-restart.sh`; the daily `t3-autoupdate` and a new `t3-migrate-idle` job both call it. The daily job records each deferral as a marker file; the new job drains markers overnight, restarting only when `state.sqlite` shows no in-flight turn and a quiet buffer has elapsed. Self-limiting (only acts on a recorded deferral), fail-closed.
**Tech Stack:** bash, systemd timers, sqlite3 (reading t3's `state.sqlite`), the existing `t3-autoupdate` machinery. Deployed via `scripts/workstation/setup-devvm.sh` on the hand-managed devvm (no Terraform).
**Design:** `docs/plans/2026-06-21-t3-idle-migrate-design.md`.
---
## File structure
- **Create `scripts/t3-safe-restart.sh`** — sourced library: shared config defaults, `LOG`/`ver`/`osusers`/`ak_for`/`verify_pairing`/`backup_user`/`prebump_of`/`rollback_binary`, and `safe_restart_unit`. One responsibility: the audited per-unit safe restart + its recovery.
- **Modify `scripts/t3-autoupdate.sh`** — source the lib; replace the inline helpers + step-6 body with calls into it; write/clear the deferral marker. Behavior unchanged.
- **Create `scripts/t3-migrate-idle.sh`** — the new job: the idle gate (`gate_query`/`gate_is_safe`/`safe_to_restart`) + the marker-drain loop. Main logic behind a `main`-guard so it's source-safe for tests.
- **Create `scripts/t3-migrate-idle.service`** + **`scripts/t3-migrate-idle.timer`** — oneshot + overnight timer.
- **Create `tests/t3-migrate-idle-gate.test.sh`** — pure-bash TDD for the gate predicates against fixture SQLite DBs (no root, no bats).
- **Modify `scripts/workstation/setup-devvm.sh`** — install + enable the new files.
- **Modify `docs/runbooks/t3-version-bump.md`** + **`.claude/reference/service-catalog.md`** — document the new job.
**Recovery semantics note (load-bearing):** `safe_restart_unit` is reused verbatim. In the *daily* path a canary failure happens when `last_good < target`, so its `rollback_binary` genuinely reverts the global binary (correct — a bad build is bad for everyone). In the *idle* path `last_good == installed == target` (the build was already accepted), so `rollback_binary` is a **harmless no-op reinstall** — recovery reduces to "restore the failing user's DB + freeze + alert" and does NOT downgrade other users. Known rare-tail limitation: if that user's forward migration genuinely fails at idle time (already gated against a copy of their real DB at install), their server may crashloop on the restored DB until a human acts on the freeze+alert. Documented, not hidden.
---
## Task 1: Shared library `t3-safe-restart.sh`
**Files:**
- Create: `scripts/t3-safe-restart.sh`
- [ ] **Step 1: Create the library**
```bash
#!/usr/bin/env bash
# t3-safe-restart.sh — SOURCED library (not executed). Shared by t3-autoupdate.sh
# (daily gated tracker) and t3-migrate-idle.sh (overnight deferral drainer).
#
# Holds the per-unit "dangerous" routine — backup -> restart -> verify pairing ->
# recover (restore DB + roll global binary back to last-good + freeze) — extracted
# verbatim from t3-autoupdate.sh step 6, plus the small helpers it depends on.
# The only change from the inline original: safe_restart_unit RETURNS non-zero on
# failure (after performing recovery+freeze) instead of `exit 1`, so the CALLER
# decides what to do (the daily job exits; the idle job stops draining).
#
# Callers must set, before calling safe_restart_unit: $target (version being moved
# TO, for log lines + the prebump filename) and $last_good (rollback target).
# Set $LOG_TAG before sourcing to tag syslog ("t3-autoupdate" / "t3-migrate-idle").
# ---- shared config defaults (override via env before sourcing) ------------------
: "${LOG_TAG:=t3-safe-restart}"
: "${FREEZE_FILE:=/etc/t3-autoupdate.freeze}"
: "${STATE_DIR:=/var/lib/t3-autoupdate}"
: "${LAST_GOOD_FILE:=$STATE_DIR/last-good}"
: "${DEFER_DIR:=$STATE_DIR/deferred}"
: "${BACKUP_DIR:=/var/backups/t3-state}"
: "${DISPATCH:=127.0.0.1:3780}"
: "${USER_MAP:=/etc/ttyd-user-map}"
: "${T3_BACKUP_TIMEOUT:=900}"
LOG() { logger -t "$LOG_TAG" "$*"; echo "$LOG_TAG: $*"; }
ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; }
# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line).
osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; }
# authentik username for an OS user (reverse map; first match) — for dispatch verify.
ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; }
# Online consistent snapshot of ONE user's state.sqlite (run AS the owner so the
# WAL stays owned; never stops the serve). Uses global $target for the filename.
# Echoes the backup path on success; non-zero on failure.
backup_user() {
local u="$1" src out dst ts
src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || return 1
ts="$(date +%Y%m%d-%H%M%S)"
out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite"
install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out"
if runuser -u "$u" -- timeout "$T3_BACKUP_TIMEOUT" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then
printf '%s\n' "$dst"; return 0
fi
rm -f "$dst"; return 1
}
# newest pre-bump backup for a user taken for the current $target (restore source).
prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; }
# roll the GLOBAL binary back to last-good. In the idle path last_good==installed,
# so this is a harmless no-op reinstall (does NOT downgrade other users).
rollback_binary() {
LOG "rolling back binary $target -> $last_good"
if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi
LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1
}
# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie).
verify_pairing() {
local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; }
out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)"
printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session='
}
# safe_restart_unit <unit> <user>: restart the unit, verify pairing; on failure
# restore the user's DB from its pre-restart backup, roll the binary back, freeze.
# Assumes a pre-restart backup already exists for <user> at the current $target
# (the daily job's backup_all, or the idle job's backup_user, takes it first).
# Returns 0 on verified success, non-zero after recovery+freeze on failure.
safe_restart_unit() {
local unit="$1" u="$2" ok=0 _ bak
systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero"
for _ in $(seq 1 15); do
if verify_pairing "$u"; then ok=1; break; fi
sleep 2
done
if [ "$ok" = "1" ]; then
LOG "restarted $unit -> $target (pairing verified via dispatch)"; return 0
fi
LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB"
rollback_binary
bak="$(prebump_of "$u")"
if [ -n "$bak" ]; then
systemctl stop "$unit" 2>/dev/null
if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then
rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm"
LOG "restored $u state.sqlite from $bak"
fi
systemctl start "$unit" 2>/dev/null
fi
touch "$FREEZE_FILE" 2>/dev/null
LOG "FROZEN ($FREEZE_FILE) after $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume"
return 1
}
```
- [ ] **Step 2: Syntax + lint check**
Run: `bash -n scripts/t3-safe-restart.sh && (command -v shellcheck >/dev/null && shellcheck -x scripts/t3-safe-restart.sh || echo "shellcheck absent — skipped")`
Expected: no syntax errors. (shellcheck may warn on the intentional global `$target`/`$last_good` references — acceptable; they are documented caller-set globals.)
- [ ] **Step 3: Source-and-define smoke test**
Run:
```bash
bash -c 'LOG_TAG=test; . scripts/t3-safe-restart.sh; for f in LOG ver osusers ak_for backup_user prebump_of rollback_binary verify_pairing safe_restart_unit; do declare -F "$f" >/dev/null || { echo "MISSING $f"; exit 1; }; done; echo "all functions defined"'
```
Expected: `all functions defined` (sourcing has no side effects — no exit, no output beyond the echo).
- [ ] **Step 4: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-safe-restart.sh
git "${GC[@]}" commit -m "t3-safe-restart: extract shared safe-restart library from t3-autoupdate
Pull the per-unit backup->restart->verify->recover routine (and the small
helpers it needs) out of t3-autoupdate.sh into a sourced library, so a second
job (the upcoming idle migrator) can reuse the exact same audited recovery path
instead of forking safety-critical code. safe_restart_unit returns non-zero on
failure (after recovery+freeze) rather than exiting, so callers control flow.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 2: Refactor `t3-autoupdate.sh` to use the library + record deferrals
**Files:**
- Modify: `scripts/t3-autoupdate.sh` (config block 3242, helpers 44165, step 6 loop 194225)
- [ ] **Step 1: Source the library; drop the now-shared helpers**
Replace lines 3252 (the `T3_*` config block through the `newer()` helper) with — keep the autoupdate-only config, source the lib for the shared bits:
```bash
# ---- autoupdate-specific config (shared config + helpers come from the lib) -----
T3_TRACK="${T3_TRACK:-nightly}" # npm dist-tag to follow (nightly | latest)
T3_PIN="${T3_PIN:-}" # optional HARD pin to an exact version (disables tracking)
SMOKE_PORT="${T3_SMOKE_PORT:-3799}"
DRY_RUN="${T3_DRY_RUN:-0}"
TMPROOT="${T3_TMPDIR:-/var/tmp}" # health-check scratch on DISK — /tmp is a 2G tmpfs and a populated state.sqlite (~hundreds of MB) overflows it
LOG_TAG=t3-autoupdate
# shellcheck source=scripts/t3-safe-restart.sh
. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
# is $1 a strictly-newer version than $2 (version-sort)?
newer() { [ "$1" != "$2" ] && [ "$(printf '%s\n%s\n' "$1" "$2" | sort -V | tail -1)" = "$1" ]; }
mkdir -p "$STATE_DIR" 2>/dev/null || true
```
(The lib now provides `FREEZE_FILE`, `STATE_DIR`, `LAST_GOOD_FILE`, `DEFER_DIR`, `BACKUP_DIR`, `DISPATCH`, `USER_MAP`, `LOG`, `ver`, `osusers`, `ak_for`, `verify_pairing`, `prebump_of`, `rollback_binary`, `backup_user`, `safe_restart_unit`.)
- [ ] **Step 2: Simplify `backup_all` to call the shared `backup_user`**
Replace the `backup_all()` definition (lines 90105) with:
```bash
ADMIN_SEED=""
backup_all() {
local u dst
for u in $(osusers); do
if dst="$(backup_user "$u")"; then
LOG "pre-bump backup: $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)"
[ "$u" = "wizard" ] && ADMIN_SEED="$dst"
else
LOG "WARN: pre-bump backup FAILED for $u (/home/$u/.t3/userdata/state.sqlite)"
fi
done
[ -n "$ADMIN_SEED" ] || ADMIN_SEED="$(ls -1t "$BACKUP_DIR"/*/"state-prebump-$target-"*.sqlite 2>/dev/null | head -1)"
}
```
Delete the now-duplicated standalone `prebump_of`, `rollback_binary`, and `verify_pairing` definitions (lines 107108, 146152, 160165) — they come from the lib. Keep `health_check` and `unit_busy` (autoupdate-only).
- [ ] **Step 3: Use `safe_restart_unit` + write/clear the deferral marker in step 6**
Replace the step-6 loop body (lines 196225) with:
```bash
for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' 2>/dev/null | awk '{print $1}'); do
u="$(printf '%s' "$unit" | sed -n 's/^t3-serve@\(.*\)\.service$/\1/p')"; [ -n "$u" ] || continue
if unit_busy "$unit"; then
LOG "deferring $unit (active agent) — migrates on its next idle restart"
mkdir -p "$DEFER_DIR" 2>/dev/null && printf '%s\n' "$target" >"$DEFER_DIR/$u" # record for t3-migrate-idle
deferred=$((deferred+1)); continue
fi
if safe_restart_unit "$unit" "$u"; then
restarted=$((restarted+1))
rm -f "$DEFER_DIR/$u" 2>/dev/null # now current — clear any stale marker
else
exit 1 # frozen by safe_restart_unit — preserve today's behavior
fi
done
```
- [ ] **Step 4: Syntax check + behavior-preserving dry-run diff**
Run:
```bash
bash -n scripts/t3-autoupdate.sh
# Confirm the only remaining defer/restart decisions are unchanged vs HEAD~1 logic:
git -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false diff HEAD scripts/t3-autoupdate.sh | grep -E '^\+|^-' | grep -vE 'safe_restart_unit|backup_user|DEFER_DIR|source|\. "|LOG_TAG|^\+\+\+|^---' | head -40
```
Expected: no syntax errors; the diff shows only the extraction (calls replacing inline bodies) + the two marker lines — no change to install/health-gate/canary decision logic.
- [ ] **Step 5: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-autoupdate.sh
git "${GC[@]}" commit -m "t3-autoupdate: source the shared safe-restart lib + record deferrals
Behavior-preserving refactor: the per-unit restart/recover body and small helpers
now come from t3-safe-restart.sh (one audited copy). Additionally, when a unit is
deferred for an active agent, write a marker under /var/lib/t3-autoupdate/deferred/
so the new idle migrator can drain it later; clear the marker on a successful
restart. Install/health-gate/canary logic is unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 3: The idle gate (TDD) — `gate_query` + `gate_is_safe`
**Files:**
- Create: `tests/t3-migrate-idle-gate.test.sh`
- Create (incremental): `scripts/t3-migrate-idle.sh` (gate functions only this task)
- [ ] **Step 1: Write the failing test**
Create `tests/t3-migrate-idle-gate.test.sh`:
```bash
#!/usr/bin/env bash
# Pure-bash unit tests for the t3-migrate-idle gate. No root, no bats, no Docker.
# Sources t3-migrate-idle.sh (main-guarded) with the lib path pointed at the worktree.
set -uo pipefail
HERE="$(cd "$(dirname "$0")/.." && pwd)" # repo root (tests/ is one level down)
export T3_SAFE_RESTART_LIB="$HERE/scripts/t3-safe-restart.sh"
# shellcheck source=/dev/null
. "$HERE/scripts/t3-migrate-idle.sh" # defines functions; main-guard prevents the drain from running
pass=0; fail=0
ok() { if "$@"; then pass=$((pass+1)); else fail=$((fail+1)); echo "FAIL: $*"; fi; }
notok(){ if "$@"; then fail=$((fail+1)); echo "FAIL (expected non-zero): $*"; else pass=$((pass+1)); fi; }
# --- gate_is_safe <active> <idle_seconds> with QUIET_SECONDS=900 ---
QUIET_SECONDS=900
ok gate_is_safe 0 1000 # idle, quiet long enough -> safe
notok gate_is_safe 1 1000 # a turn in flight -> unsafe
notok gate_is_safe 0 100 # idle but not quiet enough -> unsafe
ok gate_is_safe 0 "" # no threads at all (NULL idle) -> safe
notok gate_is_safe x 1000 # unparseable active -> unsafe
notok gate_is_safe 0 -30 # negative idle (clock skew) -> unsafe
# --- gate_query <db> against fixture SQLite DBs ---
TMP="$(mktemp -d)"; trap 'rm -rf "$TMP"' EXIT
mkfix() { # mkfix <file> ; reads rows "active_turn_id|updated_at" on stdin
local f="$1"; sqlite3 "$f" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);"
while IFS='|' read -r a u; do sqlite3 "$f" "INSERT INTO projection_thread_sessions VALUES ($([ "$a" = NULL ] && echo NULL || echo "'$a'"), '$u');"; done
}
NOW="$(date -u +%Y-%m-%dT%H:%M:%S.000Z)"
OLD="$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S.000Z)"
# active turn present -> "1|<small idle>"
printf '%s\n' "abc|$NOW" "NULL|$OLD" | mkfix "$TMP/active.db"
res="$(gate_query "$TMP/active.db")"; ok test "${res%%|*}" = "1"
# all idle, last activity 1h ago -> "0|>=3500"
printf '%s\n' "NULL|$OLD" "NULL|$OLD" | mkfix "$TMP/idle.db"
res="$(gate_query "$TMP/idle.db")"; ok test "${res%%|*}" = "0"; ok test "${res##*|}" -ge 3500
# empty table -> "0|" (NULL idle)
sqlite3 "$TMP/empty.db" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);"
res="$(gate_query "$TMP/empty.db")"; ok test "${res%%|*}" = "0"
echo "PASS=$pass FAIL=$fail"; [ "$fail" -eq 0 ]
```
- [ ] **Step 2: Run it to verify it fails**
Run: `bash tests/t3-migrate-idle-gate.test.sh`
Expected: FAIL — `scripts/t3-migrate-idle.sh` does not exist yet (source error).
- [ ] **Step 3: Create `scripts/t3-migrate-idle.sh` with the gate functions + main-guard skeleton**
```bash
#!/usr/bin/env bash
# t3-migrate-idle.sh — drains t3-autoupdate's deferral markers (via the overnight
# t3-migrate-idle.timer). For each deferred t3-serve@<user>, if nothing is actively
# working in that instance (no in-flight turn + a quiet buffer), restart it onto the
# current binary using the shared safe_restart_unit, then clear the marker.
# Why this exists: t3-autoupdate defers a user with an active agent at its single
# daily window; a user busy every night never migrates and their client shows
# "Client and server versions differ". See docs/plans/2026-06-21-t3-idle-migrate-*.
set -uo pipefail
LOG_TAG=t3-migrate-idle
# shellcheck source=scripts/t3-safe-restart.sh
. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
QUIET_SECONDS="${T3_MIGRATE_QUIET_SECONDS:-900}" # required idle before a restart (15 min)
DRY_RUN="${T3_DRY_RUN:-0}"
# pure logic: is it safe given <active_turns> and <idle_seconds>? fail closed.
gate_is_safe() {
local active="$1" idle="$2"
case "$active" in ''|*[!0-9]*) return 1;; esac # unparseable/empty active -> unsafe
[ "$active" -eq 0 ] || return 1 # a turn is running -> unsafe
[ -z "$idle" ] && return 0 # no threads at all -> safe
case "$idle" in ''|*[!0-9-]*) return 1;; esac # non-numeric -> unsafe
[ "$idle" -ge "$QUIET_SECONDS" ] # negative or < quiet -> unsafe
}
# query a state.sqlite (path or file: URI). Echoes "<active_turns>|<idle_seconds>".
# idle_seconds is empty when there are no rows. Normalizes ISO 'T'/'Z' for julianday.
gate_query() {
local db="$1"
sqlite3 -batch -noheader -separator '|' "$db" \
"SELECT
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
FROM projection_thread_sessions;"
}
# safe_to_restart <user>: wire runuser + the user's DB into gate_query/gate_is_safe.
safe_to_restart() {
local u="$1" db row
db="/home/$u/.t3/userdata/state.sqlite"; [ -f "$db" ] || return 1
row="$(runuser -u "$u" -- sqlite3 -batch -noheader -separator '|' "file:$db?mode=ro" \
"SELECT
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
FROM projection_thread_sessions;" 2>/dev/null)" || return 1
gate_is_safe "${row%%|*}" "${row##*|}"
}
main() {
: # drain loop added in Task 4
}
# main-guard: run only when executed, not when sourced (tests source this file).
if [ "${BASH_SOURCE[0]}" = "${0}" ]; then main "$@"; fi
```
- [ ] **Step 4: Run the test to verify it passes**
Run: `bash tests/t3-migrate-idle-gate.test.sh`
Expected: `PASS=10 FAIL=0` (exit 0).
- [ ] **Step 5: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-migrate-idle.sh tests/t3-migrate-idle-gate.test.sh
git "${GC[@]}" commit -m "t3-migrate-idle: idle gate (no in-flight turn + quiet buffer), TDD
The gate reads t3's state.sqlite: safe to restart only when zero threads have an
active_turn_id AND the most-recent thread activity is older than the quiet buffer
(default 15m). Fail-closed on any parse/query error. Pure-bash unit tests cover
the boundaries against fixture DBs (no root/bats/Docker).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 4: The marker-drain loop in `t3-migrate-idle.sh`
**Files:**
- Modify: `scripts/t3-migrate-idle.sh` (replace the `main()` skeleton)
- [ ] **Step 1: Implement `main()` (the drain loop)**
Replace the `main() { : ; }` skeleton with:
```bash
main() {
# a frozen build must not be auto-migrated (shared switch with t3-autoupdate)
if [ -e "$FREEZE_FILE" ]; then LOG "FROZEN: $FREEZE_FILE present — not draining deferrals"; exit 0; fi
[ -d "$DEFER_DIR" ] || exit 0 # nothing deferred
last_good="$(tr -d '[:space:]' <"$LAST_GOOD_FILE" 2>/dev/null)" # rollback target for the helper
local marker u unit started mwritten migrated=0 skipped=0
for marker in "$DEFER_DIR"/*; do
[ -e "$marker" ] || continue # empty-dir glob
u="$(basename "$marker")"; unit="t3-serve@$u.service"
if ! systemctl is-active --quiet "$unit"; then
LOG "clearing marker for $u: $unit not active"; rm -f "$marker"; continue
fi
started="$(date -d "$(systemctl show -p ActiveEnterTimestamp --value "$unit" 2>/dev/null)" +%s 2>/dev/null || echo 0)"
mwritten="$(stat -c %Y "$marker" 2>/dev/null || echo 0)"
if [ "$started" -gt "$mwritten" ]; then
LOG "clearing marker for $u: $unit already restarted $((started-mwritten))s after the deferral"; rm -f "$marker"; continue
fi
if ! safe_to_restart "$u"; then skipped=$((skipped+1)); continue; fi
target="$(tr -d '[:space:]' <"$marker" 2>/dev/null)"; [ -n "$target" ] || target="$(ver)"
if [ "$DRY_RUN" = "1" ]; then LOG "DRY_RUN: would migrate $unit -> $target (idle gate satisfied)"; continue; fi
if ! backup_user "$u" >/dev/null; then
LOG "WARN: pre-restart backup failed for $u — skipping (fail closed)"; skipped=$((skipped+1)); continue
fi
if safe_restart_unit "$unit" "$u"; then
LOG "migrated $unit -> $target (idle restart)"; rm -f "$marker"; migrated=$((migrated+1))
else
LOG "migrate FAILED for $unit — recovery+freeze handled by safe_restart_unit; stopping drain"; exit 1
fi
done
LOG "idle-migrate pass complete (migrated=$migrated skipped=$skipped)"
}
```
- [ ] **Step 2: Re-run the gate tests (regression — main-guard still source-safe)**
Run: `bash tests/t3-migrate-idle-gate.test.sh`
Expected: `PASS=10 FAIL=0` (sourcing still defines functions without running the loop).
- [ ] **Step 3: Syntax + lint**
Run: `bash -n scripts/t3-migrate-idle.sh && (command -v shellcheck >/dev/null && shellcheck -x scripts/t3-migrate-idle.sh || echo "shellcheck absent — skipped")`
Expected: no syntax errors.
- [ ] **Step 4: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-migrate-idle.sh
git "${GC[@]}" commit -m "t3-migrate-idle: drain deferral markers when safe
For each /var/lib/t3-autoupdate/deferred/<user> marker: skip+clear if the unit is
gone or was already restarted after the deferral; otherwise, when the idle gate is
satisfied, take a pre-restart backup and restart via the shared safe_restart_unit,
clearing the marker on verified success. DRY_RUN logs decisions without acting.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 5: systemd units
**Files:**
- Create: `scripts/t3-migrate-idle.service`, `scripts/t3-migrate-idle.timer`
- [ ] **Step 1: Create the service unit**
`scripts/t3-migrate-idle.service`:
```ini
[Unit]
Description=t3 idle migrator — restart deferred t3-serve instances onto the current binary when idle
Documentation=https://forgejo.viktorbarzin.me/viktor/infra/src/branch/master/docs/plans/2026-06-21-t3-idle-migrate-design.md
After=network.target t3-dispatch.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/t3-migrate-idle
```
- [ ] **Step 2: Create the timer unit**
`scripts/t3-migrate-idle.timer`:
```ini
[Unit]
Description=Overnight drain of t3-autoupdate deferrals (idle-gated t3-serve migration)
[Timer]
OnCalendar=*-*-* 01..05:00/20
RandomizedDelaySec=120
Persistent=false
[Install]
WantedBy=timers.target
```
- [ ] **Step 3: Validate unit syntax**
Run: `systemd-analyze verify scripts/t3-migrate-idle.service scripts/t3-migrate-idle.timer 2>&1 | grep -v 'Unknown\|Cannot find' || echo "units parse OK"`
Expected: no fatal parse errors (warnings about the `[Install]` of a non-installed unit / missing exec on a non-deployed path are acceptable in the worktree).
- [ ] **Step 4: Confirm the OnCalendar expands to the intended overnight slots**
Run: `systemd-analyze calendar '*-*-* 01..05:00/20' --iterations=5`
Expected: next elapses at 01:00/01:20/01:40/02:00/… (every 20 min, hours 0105).
- [ ] **Step 5: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-migrate-idle.service scripts/t3-migrate-idle.timer
git "${GC[@]}" commit -m "t3-migrate-idle: systemd oneshot + overnight timer (01:00-05:40, /20)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 6: Wire into `setup-devvm.sh`
**Files:**
- Modify: `scripts/workstation/setup-devvm.sh` (9a install ~line 164; 9d unit loop ~line 200; enable ~line 218)
- [ ] **Step 1: Install the lib + the new script (section 9a)**
After the `install -m 0755 "$SCRIPTS/t3-autoupdate.sh" /usr/local/bin/t3-autoupdate` line, add:
```bash
install -m 0644 "$SCRIPTS/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh
install -m 0755 "$SCRIPTS/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle
```
- [ ] **Step 2: Install the unit files (section 9d loop)**
Add to the `for u in …` unit list (after the `t3-autoupdate.service t3-autoupdate.timer \` line):
```bash
t3-migrate-idle.service t3-migrate-idle.timer \
```
- [ ] **Step 3: Enable the timer (section 9 enable line)**
Append `t3-migrate-idle.timer` to the `systemctl enable --now` list:
```bash
systemctl enable --now t3-dispatch.service \
t3-autoupdate.timer t3-backup-state.timer t3-provision-users.timer t3-migrate-idle.timer >/dev/null 2>&1 || \
log "WARN: some units failed to enable (check: systemctl status t3-dispatch t3-*.timer)"
```
- [ ] **Step 4: Syntax check**
Run: `bash -n scripts/workstation/setup-devvm.sh`
Expected: no syntax errors.
- [ ] **Step 5: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/workstation/setup-devvm.sh
git "${GC[@]}" commit -m "setup-devvm: install + enable t3-migrate-idle (lib, script, units, timer)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 7: Deploy to the devvm + validate (dry-run first)
**Files:** none (operational). Presence-claimed, shared-host mutation.
- [ ] **Step 1: Claim the host**
Run: `homelab claim host:devvm --purpose "deploy t3-migrate-idle units (idle-gated t3-serve migration)"`
Expected: claim acquired (if already held by another session, defer per CLAUDE.md).
- [ ] **Step 2: Install the artifacts (mirror setup-devvm.sh 9a/9d)**
Run:
```bash
W=/home/wizard/code/infra/.worktrees/t3-idle-migrate/scripts
sudo install -m 0644 "$W/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh
sudo install -m 0755 "$W/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle
sudo install -m 0644 "$W/t3-migrate-idle.service" /etc/systemd/system/t3-migrate-idle.service
sudo install -m 0644 "$W/t3-migrate-idle.timer" /etc/systemd/system/t3-migrate-idle.timer
sudo systemctl daemon-reload
```
Expected: no errors.
- [ ] **Step 2b: Re-point the live daily job at the installed lib (it now sources it)**
The deployed `/usr/local/bin/t3-autoupdate` is the OLD inline version until setup-devvm re-runs; install the refactored one so both jobs share the lib:
```bash
sudo install -m 0755 "$W/t3-autoupdate.sh" /usr/local/bin/t3-autoupdate
sudo /usr/local/bin/t3-autoupdate # safe: same-version run exits at "already on nightly; nothing to do"
```
Expected: log line `already on <track>=<ver>; nothing to do` (proves the refactored daily job sources the lib and runs clean).
- [ ] **Step 3: DRY-RUN the idle migrator against live state**
Run: `sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle; echo "exit=$?"`
Expected: with wizard currently busy (mid-turn during the day), a `skipped` count — `idle-migrate pass complete (migrated=0 skipped=N)` — and NO restart. (If wizard happens to be idle+quiet, it logs `DRY_RUN: would migrate t3-serve@wizard …` and still does not act.)
- [ ] **Step 4: Seed a deferral marker for the current skew + dry-run again**
The live daily job already deferred wizard but the marker mechanism is new, so create it once to represent the existing `.605→.613` debt:
```bash
sudo install -d -m755 /var/lib/t3-autoupdate/deferred
printf '%s\n' "$(t3 --version | awk '{print $NF}' | sed 's/^v//')" | sudo tee /var/lib/t3-autoupdate/deferred/wizard >/dev/null
sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle; echo "exit=$?"
```
Expected: the pass now considers `wizard` — either `DRY_RUN: would migrate t3-serve@wizard.service -> …613` (if idle) or counted in `skipped` (if mid-turn). Confirms marker drain + gate wiring end-to-end without acting.
- [ ] **Step 5: Enable the timer (live)**
Run: `sudo systemctl enable --now t3-migrate-idle.timer && systemctl list-timers t3-migrate-idle.timer --no-pager`
Expected: timer active, next elapse in the 01:0005:40 window.
- [ ] **Step 6: Release the claim**
Run: `homelab release host:devvm`
> **First live migration** happens overnight at the first idle+quiet tick. Verify next session: `journalctl -u t3-migrate-idle.service --since yesterday | grep -E 'migrated|skipped|DRY|FROZEN'` and `t3 --version` vs the running server's version. (The user-facing resume-after-restart is observed here — design open-question (a).)
---
## Task 8: Docs
**Files:**
- Modify: `docs/runbooks/t3-version-bump.md` (add an idle-migrate section)
- Modify: `.claude/reference/service-catalog.md` (add the unit)
- Modify: `docs/plans/2026-06-21-t3-idle-migrate-design.md` (Status → implemented)
- [ ] **Step 1: Runbook** — add a section after the autoupdate description:
```markdown
## Idle migrator (`t3-migrate-idle.timer`)
`t3-autoupdate` defers a user's `t3-serve` restart when they have an active agent
at the daily window, recording `/var/lib/t3-autoupdate/deferred/<user>`.
`t3-migrate-idle` (overnight, every 20 min 01:0005:40) drains those markers:
it restarts a deferred instance onto the current binary only when that user's
`state.sqlite` shows no in-flight turn (`active_turn_id`) and ≥15 min quiet, via
the shared `safe_restart_unit` (same backup→verify→recover as the daily canary).
- **Force a migration now:** `sudo systemctl start t3-migrate-idle.service` (still idle-gated).
- **Preview without acting:** `sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle`.
- **Stop it:** the shared `/etc/t3-autoupdate.freeze` halts both jobs.
- **Rare-tail failure:** a forward-migration failure at idle restart restores the
user's DB + freezes + alerts (the binary rollback is a no-op since the build was
already accepted); the user's server may crashloop on the restored DB until the
freeze is cleared. Investigate per the rollback section above.
```
- [ ] **Step 2: service-catalog** — add a row/line for `t3-migrate-idle.timer` (overnight idle-gated t3-serve migration; sources `t3-safe-restart.sh`).
- [ ] **Step 3: design doc status** — change the header `Status:` to `implemented 2026-06-21 (commits on wizard/t3-idle-migrate)`.
- [ ] **Step 4: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add docs/runbooks/t3-version-bump.md .claude/reference/service-catalog.md docs/plans/2026-06-21-t3-idle-migrate-design.md
git "${GC[@]}" commit -m "docs: t3-migrate-idle runbook + service-catalog + design status
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 9: Land
- [ ] **Step 1: Merge latest master into the branch**
Run:
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" fetch forgejo
git "${GC[@]}" merge --no-edit forgejo/master
```
Expected: clean merge (no conflicts; the files are new or autoupdate-only). Resolve if any.
- [ ] **Step 2: Re-run the gate tests post-merge**
Run: `bash tests/t3-migrate-idle-gate.test.sh`
Expected: `PASS=10 FAIL=0`.
- [ ] **Step 3: Push to master**
Run: `git -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false push forgejo HEAD:master`
Expected: accepted. Non-fast-forward → fetch/merge/retry.
- [ ] **Step 4: Watch CI to completion**
Run: `homelab ci watch`
Expected: green (infra apply pipeline — this change is scripts/docs only, no Terraform, so apply is a no-op for it).
- [ ] **Step 5: Clean up the worktree**
Run (from the main checkout):
```bash
git -C /home/wizard/code/infra worktree remove .worktrees/t3-idle-migrate
git -C /home/wizard/code/infra branch -d wizard/t3-idle-migrate
```
---
## Self-review
- **Spec coverage:** marker mechanism (T2,T4) · shared safe-restart lib / approach C (T1) · idle gate active_turn_id+quiet (T3) · overnight timer (T5) · all-users self-limiting via markers (T4 loop) · failure recovery reuse (T1, note) · observability logs (LOG_TAG throughout) · delivery via setup-devvm (T6) · presence-claimed deploy (T7) · TDD on the gate (T3) · dry-run rollout (T7) · docs (T8). Optional Pushgateway marker-age gauge from the design is **intentionally deferred** (logged here as a follow-up, not built — keeps scope to the shipping mechanism).
- **Placeholders:** none — every file has complete content; every command has expected output.
- **Type/name consistency:** `safe_restart_unit`, `backup_user`, `prebump_of`, `gate_query`, `gate_is_safe`, `safe_to_restart`, `DEFER_DIR`, `QUIET_SECONDS`, `T3_SAFE_RESTART_LIB`, `LOG_TAG` used identically across tasks. `target`/`last_good` are documented caller-set globals consumed by lib functions.

View file

@ -37,6 +37,19 @@ logs every outcome (`paired user=.. endpoint=.. fallback=..`, plus `mint/pairing
`T3AutoUpdateRolledBack` / `T3AutoUpdateRollbackFailed` / `T3AutoUpdateFrozen`
Alertmanager → Slack.
## Idle migrator — draining deferrals (`scripts/t3-migrate-idle.sh`)
Step 5 DEFERS any instance with an active agent, recording `/var/lib/t3-autoupdate/deferred/<user>` (= the target version). Without a drainer, a user busy at every 04:00 window never migrates and their client shows *"Client and server versions differ"* for days. `t3-migrate-idle.timer` (overnight, every 20 min 01:0005:40) drains those markers:
- Per marker: skip + clear if the unit is gone or was already restarted *after* the deferral; otherwise restart the still-stale `t3-serve@<u>` onto the current binary **only when that user is idle**`state.sqlite` shows zero `active_turn_id` (no in-flight turn) AND ≥ `T3_MIGRATE_QUIET_SECONDS` (default 900 = 15 min) since the last thread activity — then verify pairing and clear the marker. **Fail-closed:** any query/parse doubt → skip, retry next tick.
- It restarts via the SAME `safe_restart_unit` the daily canary uses (sourced `t3-safe-restart.sh`: backup → restart → verify → recover). The shared `/etc/t3-autoupdate.freeze` halts it too.
- **Force / preview:**
```bash
sudo systemctl start t3-migrate-idle.service # run a drain pass now (still idle-gated)
sudo env T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle # log decisions, act on nothing
```
- **Rare-tail failure:** if a deferred user's forward migration fails at idle restart (already gated against a copy of their real DB at install), `safe_restart_unit` restores their DB + freezes + alerts. The binary rollback is a no-op (the build was already accepted, so other users are unaffected), but that user's serve may crashloop on the restored DB until the freeze is cleared and the build investigated (manual rollback below).
## Operations
**Freeze / revert (stop tracking right now — the fast "make it stop"):**

View file

@ -21,7 +21,7 @@
# - canary rollout: restart idle instances ONE AT A TIME, verifying pairing
# through the real dispatch after each, and roll back (binary + that user's DB)
# + self-freeze on the first failure — active-agent instances are deferred,
# never killed;
# never killed (deferred instances are recorded for t3-migrate-idle to drain);
# - rollback target is the recorded LAST-GOOD build, not "whatever was installed".
# Detection backstop (real-user pairing failure/fallback) lives in the dispatch
# logs + Loki alerts (T3PairingBroken / T3PairFallbackHigh / T3AutoUpdate*).
@ -29,24 +29,17 @@
# Full procedure + manual rollback: docs/runbooks/t3-version-bump.md.
set -uo pipefail
# ---- autoupdate-specific config (shared config + helpers come from the lib) -----
T3_TRACK="${T3_TRACK:-nightly}" # npm dist-tag to follow (nightly | latest)
T3_PIN="${T3_PIN:-}" # optional HARD pin to an exact version (disables tracking)
FREEZE_FILE="${T3_FREEZE_FILE:-/etc/t3-autoupdate.freeze}"
STATE_DIR="${T3_STATE_DIR:-/var/lib/t3-autoupdate}"
LAST_GOOD_FILE="$STATE_DIR/last-good"
BACKUP_DIR="${T3_BACKUP_DEST:-/var/backups/t3-state}"
SMOKE_PORT="${T3_SMOKE_PORT:-3799}"
DISPATCH="${T3_DISPATCH:-127.0.0.1:3780}"
USER_MAP="${T3_USER_MAP:-/etc/ttyd-user-map}"
DRY_RUN="${T3_DRY_RUN:-0}"
TMPROOT="${T3_TMPDIR:-/var/tmp}" # health-check scratch on DISK — /tmp is a 2G tmpfs and a populated state.sqlite (~hundreds of MB) overflows it
LOG() { logger -t t3-autoupdate "$*"; echo "t3-autoupdate: $*"; }
ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; }
# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line).
osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; }
# authentik username for an OS user (reverse map; first match) — for dispatch verify.
ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; }
LOG_TAG=t3-autoupdate
# shellcheck source=scripts/t3-safe-restart.sh
. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
# is $1 a strictly-newer version than $2 (version-sort)?
newer() { [ "$1" != "$2" ] && [ "$(printf '%s\n%s\n' "$1" "$2" | sort -V | tail -1)" = "$1" ]; }
@ -86,27 +79,21 @@ LOG "candidate: $current -> $target (track=$T3_TRACK, last_good=$last_good, dry_
# ---- helpers: backup, health-check, rollback, restart-verify --------------------
# Online consistent per-user snapshot (run AS the owner so WAL stays owned; never
# stops the serve). Sets $ADMIN_SEED to wizard's backup for the migration health
# check. Mirrors t3-backup-state.sh.
# check. Mirrors t3-backup-state.sh. (backup_user lives in the shared lib.)
ADMIN_SEED=""
backup_all() {
local u src out dst ts; ts="$(date +%Y%m%d-%H%M%S)"
local u dst
for u in $(osusers); do
src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || continue
out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite"
install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out"
if runuser -u "$u" -- timeout "${T3_BACKUP_TIMEOUT:-900}" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then
if dst="$(backup_user "$u")"; then
LOG "pre-bump backup: $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)"
[ "$u" = "wizard" ] && ADMIN_SEED="$dst"
else
LOG "WARN: pre-bump backup FAILED for $u ($src)"; rm -f "$dst"
LOG "WARN: pre-bump backup FAILED for $u (/home/$u/.t3/userdata/state.sqlite)"
fi
done
[ -n "$ADMIN_SEED" ] || ADMIN_SEED="$(ls -1t "$BACKUP_DIR"/*/"state-prebump-$target-"*.sqlite 2>/dev/null | head -1)"
}
# newest pre-bump backup taken THIS run for a user (for restore-on-rollback).
prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; }
# health_check <t3bin> [seed_db]: start a throwaway serve (seeded with a copy of a
# real populated DB if given, so the forward migration runs on real data), then do
# the real mint -> credential-exchange -> t3_session pairing handshake with the
@ -143,27 +130,12 @@ health_check() {
rm -rf "$dir"; return 1
}
# roll the GLOBAL binary back to last-good. Pre-restart failures need only this
# (no real DB migrated yet); post-restart failures also restore the user's DB.
rollback_binary() {
LOG "rolling back binary $target -> $last_good"
if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi
LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1
}
# is this t3-serve@<unit> running an active agent (claude/codex/opencode)? never restart those.
unit_busy() {
local unit="$1" pid; pid="$(systemctl show -p MainPID --value "$unit" 2>/dev/null)"
[ -n "$pid" ] && [ "$pid" != "0" ] && pgrep -aP "$pid" 2>/dev/null | grep -qiE 'claude|codex|opencode'
}
# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie).
verify_pairing() {
local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; }
out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)"
printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session='
}
# ---- 3. DRY RUN: preview only (install candidate to temp prefix, gate it) -------
if [ "$DRY_RUN" = "1" ]; then
LOG "DRY_RUN: would back up [$(osusers | tr '\n' ' ')]; testing candidate $target in a temp prefix (no global change, no restarts)"
@ -196,31 +168,15 @@ restarted=0; deferred=0
for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' 2>/dev/null | awk '{print $1}'); do
u="$(printf '%s' "$unit" | sed -n 's/^t3-serve@\(.*\)\.service$/\1/p')"; [ -n "$u" ] || continue
if unit_busy "$unit"; then
LOG "deferring $unit (active agent) — migrates on its next idle restart"; deferred=$((deferred+1)); continue
LOG "deferring $unit (active agent) — migrates on its next idle restart"
mkdir -p "$DEFER_DIR" 2>/dev/null && printf '%s\n' "$target" >"$DEFER_DIR/$u" # record for t3-migrate-idle
deferred=$((deferred+1)); continue
fi
systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero"
ok=0
for _ in $(seq 1 15); do
if verify_pairing "$u"; then ok=1; break; fi
sleep 2
done
if [ "$ok" = "1" ]; then
LOG "restarted $unit -> $target (pairing verified via dispatch)"; restarted=$((restarted+1))
if safe_restart_unit "$unit" "$u"; then
restarted=$((restarted+1))
rm -f "$DEFER_DIR/$u" 2>/dev/null # now current — clear any stale marker
else
LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB"
rollback_binary
bak="$(prebump_of "$u")"
if [ -n "$bak" ]; then
systemctl stop "$unit" 2>/dev/null
if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then
rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm"
LOG "restored $u state.sqlite from $bak"
fi
systemctl start "$unit" 2>/dev/null
fi
touch "$FREEZE_FILE" 2>/dev/null
LOG "FROZEN ($FREEZE_FILE) after canary $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume"
exit 1
exit 1 # frozen by safe_restart_unit — preserve today's behavior
fi
done

View file

@ -0,0 +1,8 @@
[Unit]
Description=t3 idle migrator — restart deferred t3-serve instances onto the current binary when idle
Documentation=https://forgejo.viktorbarzin.me/viktor/infra/src/branch/master/docs/plans/2026-06-21-t3-idle-migrate-design.md
After=network.target t3-dispatch.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/t3-migrate-idle

View file

@ -0,0 +1,86 @@
#!/usr/bin/env bash
# t3-migrate-idle.sh — drains t3-autoupdate's deferral markers (via the overnight
# t3-migrate-idle.timer). For each deferred t3-serve@<user>, if nothing is actively
# working in that instance (no in-flight turn + a quiet buffer), restart it onto the
# current binary using the shared safe_restart_unit, then clear the marker.
# Why this exists: t3-autoupdate defers a user with an active agent at its single
# daily window; a user busy every night never migrates and their client shows
# "Client and server versions differ". See docs/plans/2026-06-21-t3-idle-migrate-*.
set -uo pipefail
LOG_TAG=t3-migrate-idle
# shellcheck source=scripts/t3-safe-restart.sh
. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
QUIET_SECONDS="${T3_MIGRATE_QUIET_SECONDS:-900}" # required idle before a restart (15 min)
DRY_RUN="${T3_DRY_RUN:-0}"
# pure logic: is it safe given <active_turns> and <idle_seconds>? fail closed.
gate_is_safe() {
local active="$1" idle="$2"
case "$active" in ''|*[!0-9]*) return 1;; esac # unparseable/empty active -> unsafe
[ "$active" -eq 0 ] || return 1 # a turn is running -> unsafe
[ -z "$idle" ] && return 0 # no threads at all -> safe
case "$idle" in ''|*[!0-9-]*) return 1;; esac # non-numeric -> unsafe
[ "$idle" -ge "$QUIET_SECONDS" ] # negative or < quiet -> unsafe
}
# query a state.sqlite (path or file: URI). Echoes "<active_turns>|<idle_seconds>".
# idle_seconds is empty when there are no rows. Normalizes ISO 'T'/'Z' for julianday.
gate_query() {
local db="$1"
sqlite3 -batch -noheader -separator '|' "$db" \
"SELECT
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
FROM projection_thread_sessions;"
}
# safe_to_restart <user>: wire runuser + the user's DB into gate_query/gate_is_safe.
safe_to_restart() {
local u="$1" db row
db="/home/$u/.t3/userdata/state.sqlite"; [ -f "$db" ] || return 1
row="$(runuser -u "$u" -- sqlite3 -batch -noheader -separator '|' "file:$db?mode=ro" \
"SELECT
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
FROM projection_thread_sessions;" 2>/dev/null)" || return 1
gate_is_safe "${row%%|*}" "${row##*|}"
}
main() {
# a frozen build must not be auto-migrated (shared switch with t3-autoupdate)
if [ -e "$FREEZE_FILE" ]; then LOG "FROZEN: $FREEZE_FILE present — not draining deferrals"; exit 0; fi
[ -d "$DEFER_DIR" ] || exit 0 # nothing deferred
last_good="$(tr -d '[:space:]' <"$LAST_GOOD_FILE" 2>/dev/null)" # rollback target for the helper
local marker u unit started mwritten migrated=0 skipped=0
for marker in "$DEFER_DIR"/*; do
[ -e "$marker" ] || continue # empty-dir glob
u="$(basename "$marker")"; unit="t3-serve@$u.service"
if ! systemctl is-active --quiet "$unit"; then
LOG "clearing marker for $u: $unit not active"; rm -f "$marker"; continue
fi
started="$(date -d "$(systemctl show -p ActiveEnterTimestamp --value "$unit" 2>/dev/null)" +%s 2>/dev/null || echo 0)"
mwritten="$(stat -c %Y "$marker" 2>/dev/null || echo 0)"
if [ "$started" -gt "$mwritten" ]; then
LOG "clearing marker for $u: $unit already restarted $((started-mwritten))s after the deferral"; rm -f "$marker"; continue
fi
if ! safe_to_restart "$u"; then skipped=$((skipped+1)); continue; fi
target="$(tr -d '[:space:]' <"$marker" 2>/dev/null)"; [ -n "$target" ] || target="$(ver)"
if [ "$DRY_RUN" = "1" ]; then LOG "DRY_RUN: would migrate $unit -> $target (idle gate satisfied)"; continue; fi
if ! backup_user "$u" >/dev/null; then
LOG "WARN: pre-restart backup failed for $u — skipping (fail closed)"; skipped=$((skipped+1)); continue
fi
if safe_restart_unit "$unit" "$u"; then
LOG "migrated $unit -> $target (idle restart)"; rm -f "$marker"; migrated=$((migrated+1))
else
LOG "migrate FAILED for $unit — recovery+freeze handled by safe_restart_unit; stopping drain"; exit 1
fi
done
LOG "idle-migrate pass complete (migrated=$migrated skipped=$skipped)"
}
# main-guard: run only when executed, not when sourced (tests source this file).
if [ "${BASH_SOURCE[0]}" = "${0}" ]; then main "$@"; fi

View file

@ -0,0 +1,10 @@
[Unit]
Description=Overnight drain of t3-autoupdate deferrals (idle-gated t3-serve migration)
[Timer]
OnCalendar=*-*-* 01..05:00/20
RandomizedDelaySec=120
Persistent=false
[Install]
WantedBy=timers.target

View file

@ -0,0 +1,96 @@
#!/usr/bin/env bash
# t3-safe-restart.sh — SOURCED library (not executed). Shared by t3-autoupdate.sh
# (daily gated tracker) and t3-migrate-idle.sh (overnight deferral drainer).
#
# Holds the per-unit "dangerous" routine — backup -> restart -> verify pairing ->
# recover (restore DB + roll global binary back to last-good + freeze) — extracted
# verbatim from t3-autoupdate.sh step 6, plus the small helpers it depends on.
# The only change from the inline original: safe_restart_unit RETURNS non-zero on
# failure (after performing recovery+freeze) instead of `exit 1`, so the CALLER
# decides what to do (the daily job exits; the idle job stops draining).
#
# Callers must set, before calling safe_restart_unit: $target (version being moved
# TO, for log lines + the prebump filename) and $last_good (rollback target).
# Set $LOG_TAG before sourcing to tag syslog ("t3-autoupdate" / "t3-migrate-idle").
# ---- shared config defaults (honour the original T3_* override names) -----------
: "${LOG_TAG:=t3-safe-restart}"
: "${FREEZE_FILE:=${T3_FREEZE_FILE:-/etc/t3-autoupdate.freeze}}"
: "${STATE_DIR:=${T3_STATE_DIR:-/var/lib/t3-autoupdate}}"
: "${LAST_GOOD_FILE:=$STATE_DIR/last-good}"
: "${DEFER_DIR:=$STATE_DIR/deferred}"
: "${BACKUP_DIR:=${T3_BACKUP_DEST:-/var/backups/t3-state}}"
: "${DISPATCH:=${T3_DISPATCH:-127.0.0.1:3780}}"
: "${USER_MAP:=${T3_USER_MAP:-/etc/ttyd-user-map}}"
: "${T3_BACKUP_TIMEOUT:=900}"
LOG() { logger -t "$LOG_TAG" "$*"; echo "$LOG_TAG: $*"; }
ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; }
# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line).
osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; }
# authentik username for an OS user (reverse map; first match) — for dispatch verify.
ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; }
# Online consistent snapshot of ONE user's state.sqlite (run AS the owner so the
# WAL stays owned; never stops the serve). Uses global $target for the filename.
# Echoes the backup path on success; non-zero on failure.
backup_user() {
local u="$1" src out dst ts
src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || return 1
ts="$(date +%Y%m%d-%H%M%S)"
out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite"
install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out"
if runuser -u "$u" -- timeout "$T3_BACKUP_TIMEOUT" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then
printf '%s\n' "$dst"; return 0
fi
rm -f "$dst"; return 1
}
# newest pre-bump backup for a user taken for the current $target (restore source).
prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; }
# roll the GLOBAL binary back to last-good. In the idle path last_good==installed,
# so this is a harmless no-op reinstall (does NOT downgrade other users).
rollback_binary() {
LOG "rolling back binary $target -> $last_good"
if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi
LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1
}
# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie).
verify_pairing() {
local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; }
out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)"
printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session='
}
# safe_restart_unit <unit> <user>: restart the unit, verify pairing; on failure
# restore the user's DB from its pre-restart backup, roll the binary back, freeze.
# Assumes a pre-restart backup already exists for <user> at the current $target
# (the daily job's backup_all, or the idle job's backup_user, takes it first).
# Returns 0 on verified success, non-zero after recovery+freeze on failure.
safe_restart_unit() {
local unit="$1" u="$2" ok=0 _ bak
systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero"
for _ in $(seq 1 15); do
if verify_pairing "$u"; then ok=1; break; fi
sleep 2
done
if [ "$ok" = "1" ]; then
LOG "restarted $unit -> $target (pairing verified via dispatch)"; return 0
fi
LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB"
rollback_binary
bak="$(prebump_of "$u")"
if [ -n "$bak" ]; then
systemctl stop "$unit" 2>/dev/null
if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then
rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm"
LOG "restored $u state.sqlite from $bak"
fi
systemctl start "$unit" 2>/dev/null
fi
touch "$FREEZE_FILE" 2>/dev/null
LOG "FROZEN ($FREEZE_FILE) after $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume"
return 1
}

View file

@ -162,6 +162,8 @@ fi
SCRIPTS="$HERE/.."
# 9a) scripts the units exec (t3-provision-users already deployed in section 6)
install -m 0755 "$SCRIPTS/t3-autoupdate.sh" /usr/local/bin/t3-autoupdate
install -m 0644 "$SCRIPTS/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh # sourced lib (t3-autoupdate + t3-migrate-idle)
install -m 0755 "$SCRIPTS/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle
install -m 0755 "$SCRIPTS/t3-backup-state.sh" /usr/local/bin/t3-backup-state
install -m 0755 "$SCRIPTS/t3-mint" /usr/local/bin/t3-mint
install -m 0755 "$HERE/claude-auth-sync.sh" /usr/local/bin/claude-auth-sync
@ -198,6 +200,7 @@ fi
for u in t3-serve@.service \
claude-auth-sync@.service claude-auth-sync@.timer \
t3-autoupdate.service t3-autoupdate.timer \
t3-migrate-idle.service t3-migrate-idle.timer \
t3-backup-state.service t3-backup-state.timer \
t3-provision-users.service t3-provision-users.timer \
t3-dispatch.service; do
@ -216,7 +219,7 @@ done
log "playwright: template units + snapshot-refresh script installed (per-user enable in provisioner)"
systemctl daemon-reload
systemctl enable --now t3-dispatch.service \
t3-autoupdate.timer t3-backup-state.timer t3-provision-users.timer >/dev/null 2>&1 || \
t3-autoupdate.timer t3-backup-state.timer t3-provision-users.timer t3-migrate-idle.timer >/dev/null 2>&1 || \
log "WARN: some units failed to enable (check: systemctl status t3-dispatch t3-*.timer)"
log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-user)"

View file

@ -0,0 +1,44 @@
#!/usr/bin/env bash
# Pure-bash unit tests for the t3-migrate-idle gate. No root, no bats, no Docker.
# Sources t3-migrate-idle.sh (main-guarded) with the lib path pointed at the worktree.
set -uo pipefail
HERE="$(cd "$(dirname "$0")/.." && pwd)" # repo root (tests/ is one level down)
export T3_SAFE_RESTART_LIB="$HERE/scripts/t3-safe-restart.sh"
# shellcheck source=/dev/null
. "$HERE/scripts/t3-migrate-idle.sh" # defines functions; main-guard prevents the drain from running
pass=0; fail=0
ok() { if "$@"; then pass=$((pass+1)); else fail=$((fail+1)); echo "FAIL: $*"; fi; }
notok(){ if "$@"; then fail=$((fail+1)); echo "FAIL (expected non-zero): $*"; else pass=$((pass+1)); fi; }
# --- gate_is_safe <active> <idle_seconds> with QUIET_SECONDS=900 ---
QUIET_SECONDS=900
ok gate_is_safe 0 1000 # idle, quiet long enough -> safe
notok gate_is_safe 1 1000 # a turn in flight -> unsafe
notok gate_is_safe 0 100 # idle but not quiet enough -> unsafe
ok gate_is_safe 0 "" # no threads at all (NULL idle) -> safe
notok gate_is_safe x 1000 # unparseable active -> unsafe
notok gate_is_safe 0 -30 # negative idle (clock skew) -> unsafe
# --- gate_query <db> against fixture SQLite DBs ---
TMP="$(mktemp -d)"; trap 'rm -rf "$TMP"' EXIT
mkfix() { # mkfix <file> ; reads rows "active_turn_id|updated_at" on stdin
local f="$1"; sqlite3 "$f" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);"
while IFS='|' read -r a u; do sqlite3 "$f" "INSERT INTO projection_thread_sessions VALUES ($([ "$a" = NULL ] && echo NULL || echo "'$a'"), '$u');"; done
}
NOW="$(date -u +%Y-%m-%dT%H:%M:%S.000Z)"
OLD="$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S.000Z)"
# active turn present -> "1|<small idle>"
printf '%s\n' "abc|$NOW" "NULL|$OLD" | mkfix "$TMP/active.db"
res="$(gate_query "$TMP/active.db")"; ok test "${res%%|*}" = "1"
# all idle, last activity 1h ago -> "0|>=3500"
printf '%s\n' "NULL|$OLD" "NULL|$OLD" | mkfix "$TMP/idle.db"
res="$(gate_query "$TMP/idle.db")"; ok test "${res%%|*}" = "0"; ok test "${res##*|}" -ge 3500
# empty table -> "0|" (NULL idle)
sqlite3 "$TMP/empty.db" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);"
res="$(gate_query "$TMP/empty.db")"; ok test "${res%%|*}" = "0"
echo "PASS=$pass FAIL=$fail"; [ "$fail" -eq 0 ]