Compare commits
10 commits
ddbdbca7e9
...
92ff0b92f1
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
92ff0b92f1 | ||
|
|
5a136c7d53 | ||
|
|
334d8fee5d | ||
|
|
3cf09a0fe3 | ||
|
|
af9f7be297 | ||
|
|
06e400522f | ||
|
|
de97696ff0 | ||
|
|
2ab5b94748 | ||
|
|
0cebeeb0ee | ||
|
|
9503bed589 |
11 changed files with 1148 additions and 63 deletions
File diff suppressed because one or more lines are too long
140
docs/plans/2026-06-21-t3-idle-migrate-design.md
Normal file
140
docs/plans/2026-06-21-t3-idle-migrate-design.md
Normal file
|
|
@ -0,0 +1,140 @@
|
|||
# t3 idle-migrate — graceful overnight restart of deferred t3-serve instances — design
|
||||
|
||||
- **Date:** 2026-06-21
|
||||
- **Status:** implemented 2026-06-21 (branch `wizard/t3-idle-migrate`; deployed + timer enabled on devvm, first overnight drain pending)
|
||||
- **Owner:** Viktor (wizard)
|
||||
- **Builds on:** the gated nightly tracker `t3-autoupdate` (re-enabled 2026-06-16, `scripts/t3-autoupdate.{sh,service,timer}`; design history in `docs/runbooks/t3-version-bump.md` + post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`) and the per-user `t3-serve@<user>` systemd instances (`scripts/t3-serve@.service`).
|
||||
|
||||
## Goal
|
||||
|
||||
When `t3-autoupdate` **defers** a user's `t3-serve` restart because that user has an active agent at the daily 04:00–05:00 window, the user's running server keeps executing its start-time t3 version indefinitely — their client (which tracks the freshly-installed global binary) then shows *"Client and server versions differ."* For a user who is busy at every daily window (wizard: long-lived/AFK sessions overnight), the deferral never resolves and the skew persists for days.
|
||||
|
||||
Add a **small, idle-gated overnight job that drains those deferrals**: restart a deferred `t3-serve@<user>` onto the current binary **only when nothing is actively working** in that instance, so the migration happens during a genuine quiet gap rather than killing in-flight agent turns.
|
||||
|
||||
## Background — why the skew persists (root cause, verified 2026-06-21)
|
||||
|
||||
- All `t3-serve@<user>` instances share ONE global `/usr/bin/t3` (→ `/usr/lib/node_modules/t3`). `t3-autoupdate` installs a new nightly to that single binary, health-gates it against a **copy** of wizard's populated `state.sqlite`, then **canary-restarts idle instances one at a time**, verifying pairing after each (`scripts/t3-autoupdate.sh` step 6).
|
||||
- Its idle check is coarse — `unit_busy()`:
|
||||
```sh
|
||||
pid=$(systemctl show -p MainPID --value "$unit")
|
||||
pgrep -aP "$pid" | grep -qiE 'claude|codex|opencode'
|
||||
```
|
||||
i.e. "does the server have any `claude`/`codex`/`opencode` **child**?" But `t3 serve` keeps one such child alive per **open** session, even one idle awaiting input. Live snapshot 2026-06-21: wizard had **5 `running` provider sessions** (= 5 `claude` children) but only **3 mid-turn**, plus **89 `ready` (open-idle)** threads. So `unit_busy` is true whenever any tab is open → wizard is deferred at every window.
|
||||
- The job runs **once daily** (`OnCalendar=*-*-* 04:00:00`, `RandomizedDelaySec=1h`, `Persistent` deliberately omitted) and **only acts on a version bump** (exits early if `installed == target`). So once the binary is already current, nothing re-triggers a restart of a still-stale running server until the *next* new nightly — and only if the user happens to be idle then.
|
||||
- Confirmed in the logs: `t3-autoupdate: deferring t3-serve@wizard.service (active agent) — migrates on its next idle restart` on **both** Jun 20 and Jun 21 windows; wizard's server has been up since Jun 20 06:17 on `…20260620.605` while the binary + client are on `…20260621.613`.
|
||||
|
||||
## Decisions (from brainstorm 2026-06-21)
|
||||
|
||||
1. **"Safe to restart" = no turn in flight AND a quiet buffer.** Not "zero open sessions" (that would essentially never fire for wizard). Open-but-idle tabs are acceptable to drop — t3 persists thread history in `state.sqlite` and the client reconnects/resumes (the daily job already restarts idle instances routinely; restart→resume is the exercised path). To verify during implementation: the user-facing resume after a server restart.
|
||||
2. **Cadence: overnight window only.** Frequent checks within a fixed overnight window; never disconnects tabs during the working day. Migrates within ~1 night of a build landing.
|
||||
3. **Scope: all `t3-serve@<user>`, self-limiting.** The job restarts only an instance that actually *owes* a migration (a deferral marker exists). Users already migrated at the daily window have no marker → no-op. No hardcoded per-user logic.
|
||||
4. **Approach C: extract a shared safe-restart helper, reuse from both jobs.** One audited copy of the dangerous backup→restart→verify→recover logic; the new job adds only *scheduling + gating*.
|
||||
|
||||
## Constraints (load-bearing)
|
||||
|
||||
1. **The binary is global; migrations are forward-only and per-user-DB.** You cannot keep one user on the old version while others run the new one. A real-user forward-migration failure therefore means the build is unsafe for a real user → the only consistent recovery is the daily job's existing one (restore that user's DB + roll the **global** binary back to last-good + freeze + alert). This is a rare tail (the build was already migration-gated against a copy of wizard's real DB at install time), but the idle path must not invent a weaker recovery.
|
||||
2. **Per-user secret boundary.** A user's `~/.t3/userdata/state.sqlite` is mode 600 and may not be read as another user. The job runs as root (system service) but reads each user's DB **as that user** via `runuser -u <user> -- sqlite3 …` (the pattern `backup_all` already uses), read-only (`mode=ro`) so it never locks the live WAL.
|
||||
3. **Fail closed.** Any uncertainty about whether an instance is safe to restart (DB locked/busy/unreadable, query error, unparseable timestamp) → treat as *not safe*, skip this tick, retry in 20 min. Never restart on doubt.
|
||||
4. **Do not change the daily job's gated-install behavior.** The step-6 extraction must be behavior-preserving; health-gate, canary, downgrade-guard, freeze, and rollback stay exactly as today.
|
||||
5. **Infra-as-code via the devvm installer.** Sources live in `scripts/`; deployment is `scripts/workstation/setup-devvm.sh` (the devvm is hand-managed VM 102 — no Terraform apply). Shared-devvm deploy takes a presence claim.
|
||||
|
||||
## Design
|
||||
|
||||
### Components
|
||||
|
||||
Four new files in `scripts/` + a one-line addition to the existing job:
|
||||
|
||||
1. **`scripts/t3-safe-restart.sh`** — shared library, sourced (not executed). Holds the per-unit "dangerous" routine extracted from `t3-autoupdate.sh` step 6 as `safe_restart_unit <unit> <target>`:
|
||||
pre-restart `VACUUM INTO` backup (as the owner) → `systemctl restart` → poll `verify_pairing` (15×2s ≈ 30s) → on failure: restore that user's DB from the pre-restart backup, `rollback_binary` to last-good, `touch $FREEZE_FILE`, log+alert. The shared helpers it needs (`LOG`, `ver`, `osusers`, `ak_for`, `verify_pairing`, `prebump_of`, `rollback_binary`, `DISPATCH`/`BACKUP_DIR`/… config) move into the lib too. Installed to `/usr/local/lib/t3-safe-restart.sh`.
|
||||
**Contract:** returns `0` on verified success, **non-zero** after performing recovery+freeze on failure. This is the one non-verbatim change to step-6 logic — today it `exit 1`s inline; the extracted function `return`s instead so the *caller* decides (the daily job `exit 1`s on non-zero exactly as today; the idle job `break`s). Behavior is otherwise identical.
|
||||
|
||||
2. **`scripts/t3-migrate-idle.sh`** — the new job (scheduling + gating only). Installed to `/usr/local/bin/t3-migrate-idle`. Sources the lib; per tick, drains the deferral directory (control flow below).
|
||||
|
||||
3. **`scripts/t3-migrate-idle.service`** — `Type=oneshot`, `ExecStart=/usr/local/bin/t3-migrate-idle`. (No `EnvironmentFile` needed; env-overridable knobs have defaults.)
|
||||
|
||||
4. **`scripts/t3-migrate-idle.timer`** — overnight window, frequent checks:
|
||||
```ini
|
||||
[Timer]
|
||||
OnCalendar=*-*-* 01..05:00/20 # fires 01:00,01:20,…,05:40; none at/after 06:00. System TZ (UTC) — tune the window.
|
||||
Persistent=false # never replay a missed migrate-restart at an unpredictable time
|
||||
RandomizedDelaySec=120
|
||||
```
|
||||
|
||||
5. **One-line edit to `t3-autoupdate.sh`** — in the existing defer branch, *also record* the deferral:
|
||||
```sh
|
||||
LOG "deferring $unit (active agent) — migrates on its next idle restart"
|
||||
mkdir -p "$DEFER_DIR" 2>/dev/null; printf '%s\n' "$target" > "$DEFER_DIR/$u" # NEW
|
||||
deferred=$((deferred+1)); continue
|
||||
```
|
||||
where `DEFER_DIR=/var/lib/t3-autoupdate/deferred`. This is the *only* behavioral change to the scarred script beyond the verbatim step-6 extraction.
|
||||
|
||||
### Why a deferral marker (not version-introspection)
|
||||
|
||||
The marker makes "which instances owe a restart" **exact** and decouples it from the binary-is-current problem — the daily job already *knows* it deferred wizard, so it records that fact. The idle job drains the directory; the version string in the marker is informational (a restart always picks up whatever binary is current). The marker is removed only after the restart's pairing is verified.
|
||||
|
||||
### Control flow of `t3-migrate-idle` (per tick)
|
||||
|
||||
```
|
||||
for marker in $DEFER_DIR/*: # nothing deferred → no-op
|
||||
user = basename(marker); unit = t3-serve@<user>.service
|
||||
[ unit is an active running service ] or { rm marker; continue } # gone
|
||||
if unit ActiveEnterTimestamp > mtime(marker): rm marker; continue # already restarted (manual/other) → just clear
|
||||
if not safe_to_restart(user): continue # mid-turn or not quiet → try next tick
|
||||
target = contents(marker)
|
||||
if safe_restart_unit(unit, target): rm marker # success: verified on new binary
|
||||
else: # helper already restored DB + rolled back binary + froze + alerted
|
||||
break # frozen: stop draining; a human investigates
|
||||
```
|
||||
|
||||
### `safe_to_restart(user)` — the gate
|
||||
|
||||
Single read-only query, run as the user:
|
||||
|
||||
```sh
|
||||
runuser -u "$user" -- sqlite3 "file:/home/$user/.t3/userdata/state.sqlite?mode=ro" "
|
||||
SELECT
|
||||
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
|
||||
CAST((julianday('now')
|
||||
- julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
|
||||
FROM projection_thread_sessions;"
|
||||
```
|
||||
|
||||
- Column 1 = **active turns**; must be `0`. (`active_turn_id` is set exactly while a turn runs — verified 2026-06-21.)
|
||||
- Column 2 = **idle seconds** = now − most-recent thread activity. Must be `≥ QUIET_SECONDS` (default **900** = 15 min, env-overridable). `updated_at` is ISO-8601 `…Z`; `datetime('now')`/`julianday('now')` are UTC, so normalizing `T`/`Z` away before `julianday()` keeps the arithmetic correct without depending on a newer SQLite's `Z` parsing.
|
||||
- **NULL idle** (no threads at all) ⇒ safe. **Any error / non-numeric / nonzero exit** ⇒ not safe (constraint 3).
|
||||
|
||||
### Failure recovery
|
||||
|
||||
Delegated entirely to `safe_restart_unit` (the extracted, already-proven path): restore the user's DB from the pre-restart backup, roll the global binary back to last-good, `touch /etc/t3-autoupdate.freeze`, log+alert. The idle job then stops draining (the freeze halts both jobs until a human clears it) — see constraint 1 for why per-user divergence isn't an option.
|
||||
|
||||
### Observability
|
||||
|
||||
- Structured `logger -t t3-migrate-idle` lines; extend the existing `T3AutoUpdate*` Loki ruler/alerts to also match this tag. Success → one line: `migrated t3-serve@wizard → <target> (idle restart; idle 47m)`. Failure → reuses the daily job's freeze+alert.
|
||||
- **Recommended (optional):** a Pushgateway gauge for **deferral-marker age** + an alert if a marker survives **> 3 days** — passive visibility into "busy every night for 3 days," *not* the auto-escalation/daytime-widening that was explicitly de-scoped.
|
||||
|
||||
### Delivery
|
||||
|
||||
- Wire into `scripts/workstation/setup-devvm.sh` alongside the existing units:
|
||||
- `install -m 0644 "$SCRIPTS/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh`
|
||||
- `install -m 0755 "$SCRIPTS/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle`
|
||||
- add `t3-migrate-idle.service t3-migrate-idle.timer` to the unit-install loop (→ `/etc/systemd/system/`)
|
||||
- add `t3-migrate-idle.timer` to the `systemctl enable --now` list
|
||||
- `homelab claim host:devvm --purpose "deploy t3-migrate-idle units"` before the install + enable on the shared devvm.
|
||||
- No Terraform (hand-managed VM 102).
|
||||
|
||||
## Testing
|
||||
|
||||
- **TDD on the gating core (`bats`)** against fixture `state.sqlite` files: active turn → unsafe; idle-but-recent (< QUIET) → unsafe; idle + quiet → safe; empty DB → safe; locked/garbage DB / sqlite error → unsafe (fail-closed); marker drain: unit started after marker → clear+skip, before → eligible.
|
||||
- **`T3_DRY_RUN=1`** mode logs `would migrate <unit> → <target>` without acting. Roll out in dry-run first; confirm it flags wizard's server at a real overnight idle moment; then enable live.
|
||||
- **Step-6 extraction is behavior-preserving** — validate the daily job's decisions are unchanged via a dry-run diff before/after the refactor.
|
||||
|
||||
## Out of scope (YAGNI)
|
||||
|
||||
- Daytime restarts / "around the clock" cadence (de-scoped: overnight only).
|
||||
- Auto-escalation that widens to a daytime attempt after N stale nights (de-scoped; the optional marker-age alert covers visibility).
|
||||
- Per-user opt-out file (not needed — the job is self-limiting via markers).
|
||||
- Any change to how `t3-autoupdate` *installs/gates* a build.
|
||||
|
||||
## Open questions
|
||||
|
||||
None outstanding from the brainstorm. Two items to **verify during implementation** (not blockers): (a) user-facing session resume after a `t3-serve` restart; (b) the devvm's `sqlite3` parses the normalized timestamp as expected (the `replace()` normalization is the safeguard).
|
||||
729
docs/plans/2026-06-21-t3-idle-migrate-plan.md
Normal file
729
docs/plans/2026-06-21-t3-idle-migrate-plan.md
Normal file
|
|
@ -0,0 +1,729 @@
|
|||
# t3 idle-migrate Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Add a small idle-gated overnight job that restarts a `t3-serve@<user>` deferred by the daily autoupdate, so a chronically-busy user's server migrates onto the current t3 binary during a real quiet gap instead of staying version-skewed for days.
|
||||
|
||||
**Architecture:** Extract the daily job's per-unit "dangerous" restart routine (backup→restart→verify→recover) into a sourced shared library `t3-safe-restart.sh`; the daily `t3-autoupdate` and a new `t3-migrate-idle` job both call it. The daily job records each deferral as a marker file; the new job drains markers overnight, restarting only when `state.sqlite` shows no in-flight turn and a quiet buffer has elapsed. Self-limiting (only acts on a recorded deferral), fail-closed.
|
||||
|
||||
**Tech Stack:** bash, systemd timers, sqlite3 (reading t3's `state.sqlite`), the existing `t3-autoupdate` machinery. Deployed via `scripts/workstation/setup-devvm.sh` on the hand-managed devvm (no Terraform).
|
||||
|
||||
**Design:** `docs/plans/2026-06-21-t3-idle-migrate-design.md`.
|
||||
|
||||
---
|
||||
|
||||
## File structure
|
||||
|
||||
- **Create `scripts/t3-safe-restart.sh`** — sourced library: shared config defaults, `LOG`/`ver`/`osusers`/`ak_for`/`verify_pairing`/`backup_user`/`prebump_of`/`rollback_binary`, and `safe_restart_unit`. One responsibility: the audited per-unit safe restart + its recovery.
|
||||
- **Modify `scripts/t3-autoupdate.sh`** — source the lib; replace the inline helpers + step-6 body with calls into it; write/clear the deferral marker. Behavior unchanged.
|
||||
- **Create `scripts/t3-migrate-idle.sh`** — the new job: the idle gate (`gate_query`/`gate_is_safe`/`safe_to_restart`) + the marker-drain loop. Main logic behind a `main`-guard so it's source-safe for tests.
|
||||
- **Create `scripts/t3-migrate-idle.service`** + **`scripts/t3-migrate-idle.timer`** — oneshot + overnight timer.
|
||||
- **Create `tests/t3-migrate-idle-gate.test.sh`** — pure-bash TDD for the gate predicates against fixture SQLite DBs (no root, no bats).
|
||||
- **Modify `scripts/workstation/setup-devvm.sh`** — install + enable the new files.
|
||||
- **Modify `docs/runbooks/t3-version-bump.md`** + **`.claude/reference/service-catalog.md`** — document the new job.
|
||||
|
||||
**Recovery semantics note (load-bearing):** `safe_restart_unit` is reused verbatim. In the *daily* path a canary failure happens when `last_good < target`, so its `rollback_binary` genuinely reverts the global binary (correct — a bad build is bad for everyone). In the *idle* path `last_good == installed == target` (the build was already accepted), so `rollback_binary` is a **harmless no-op reinstall** — recovery reduces to "restore the failing user's DB + freeze + alert" and does NOT downgrade other users. Known rare-tail limitation: if that user's forward migration genuinely fails at idle time (already gated against a copy of their real DB at install), their server may crashloop on the restored DB until a human acts on the freeze+alert. Documented, not hidden.
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Shared library `t3-safe-restart.sh`
|
||||
|
||||
**Files:**
|
||||
- Create: `scripts/t3-safe-restart.sh`
|
||||
|
||||
- [ ] **Step 1: Create the library**
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# t3-safe-restart.sh — SOURCED library (not executed). Shared by t3-autoupdate.sh
|
||||
# (daily gated tracker) and t3-migrate-idle.sh (overnight deferral drainer).
|
||||
#
|
||||
# Holds the per-unit "dangerous" routine — backup -> restart -> verify pairing ->
|
||||
# recover (restore DB + roll global binary back to last-good + freeze) — extracted
|
||||
# verbatim from t3-autoupdate.sh step 6, plus the small helpers it depends on.
|
||||
# The only change from the inline original: safe_restart_unit RETURNS non-zero on
|
||||
# failure (after performing recovery+freeze) instead of `exit 1`, so the CALLER
|
||||
# decides what to do (the daily job exits; the idle job stops draining).
|
||||
#
|
||||
# Callers must set, before calling safe_restart_unit: $target (version being moved
|
||||
# TO, for log lines + the prebump filename) and $last_good (rollback target).
|
||||
# Set $LOG_TAG before sourcing to tag syslog ("t3-autoupdate" / "t3-migrate-idle").
|
||||
|
||||
# ---- shared config defaults (override via env before sourcing) ------------------
|
||||
: "${LOG_TAG:=t3-safe-restart}"
|
||||
: "${FREEZE_FILE:=/etc/t3-autoupdate.freeze}"
|
||||
: "${STATE_DIR:=/var/lib/t3-autoupdate}"
|
||||
: "${LAST_GOOD_FILE:=$STATE_DIR/last-good}"
|
||||
: "${DEFER_DIR:=$STATE_DIR/deferred}"
|
||||
: "${BACKUP_DIR:=/var/backups/t3-state}"
|
||||
: "${DISPATCH:=127.0.0.1:3780}"
|
||||
: "${USER_MAP:=/etc/ttyd-user-map}"
|
||||
: "${T3_BACKUP_TIMEOUT:=900}"
|
||||
|
||||
LOG() { logger -t "$LOG_TAG" "$*"; echo "$LOG_TAG: $*"; }
|
||||
ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; }
|
||||
# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line).
|
||||
osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; }
|
||||
# authentik username for an OS user (reverse map; first match) — for dispatch verify.
|
||||
ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; }
|
||||
|
||||
# Online consistent snapshot of ONE user's state.sqlite (run AS the owner so the
|
||||
# WAL stays owned; never stops the serve). Uses global $target for the filename.
|
||||
# Echoes the backup path on success; non-zero on failure.
|
||||
backup_user() {
|
||||
local u="$1" src out dst ts
|
||||
src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || return 1
|
||||
ts="$(date +%Y%m%d-%H%M%S)"
|
||||
out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite"
|
||||
install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out"
|
||||
if runuser -u "$u" -- timeout "$T3_BACKUP_TIMEOUT" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then
|
||||
printf '%s\n' "$dst"; return 0
|
||||
fi
|
||||
rm -f "$dst"; return 1
|
||||
}
|
||||
|
||||
# newest pre-bump backup for a user taken for the current $target (restore source).
|
||||
prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; }
|
||||
|
||||
# roll the GLOBAL binary back to last-good. In the idle path last_good==installed,
|
||||
# so this is a harmless no-op reinstall (does NOT downgrade other users).
|
||||
rollback_binary() {
|
||||
LOG "rolling back binary $target -> $last_good"
|
||||
if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi
|
||||
LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1
|
||||
}
|
||||
|
||||
# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie).
|
||||
verify_pairing() {
|
||||
local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; }
|
||||
out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)"
|
||||
printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session='
|
||||
}
|
||||
|
||||
# safe_restart_unit <unit> <user>: restart the unit, verify pairing; on failure
|
||||
# restore the user's DB from its pre-restart backup, roll the binary back, freeze.
|
||||
# Assumes a pre-restart backup already exists for <user> at the current $target
|
||||
# (the daily job's backup_all, or the idle job's backup_user, takes it first).
|
||||
# Returns 0 on verified success, non-zero after recovery+freeze on failure.
|
||||
safe_restart_unit() {
|
||||
local unit="$1" u="$2" ok=0 _ bak
|
||||
systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero"
|
||||
for _ in $(seq 1 15); do
|
||||
if verify_pairing "$u"; then ok=1; break; fi
|
||||
sleep 2
|
||||
done
|
||||
if [ "$ok" = "1" ]; then
|
||||
LOG "restarted $unit -> $target (pairing verified via dispatch)"; return 0
|
||||
fi
|
||||
LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB"
|
||||
rollback_binary
|
||||
bak="$(prebump_of "$u")"
|
||||
if [ -n "$bak" ]; then
|
||||
systemctl stop "$unit" 2>/dev/null
|
||||
if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then
|
||||
rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm"
|
||||
LOG "restored $u state.sqlite from $bak"
|
||||
fi
|
||||
systemctl start "$unit" 2>/dev/null
|
||||
fi
|
||||
touch "$FREEZE_FILE" 2>/dev/null
|
||||
LOG "FROZEN ($FREEZE_FILE) after $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume"
|
||||
return 1
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Syntax + lint check**
|
||||
|
||||
Run: `bash -n scripts/t3-safe-restart.sh && (command -v shellcheck >/dev/null && shellcheck -x scripts/t3-safe-restart.sh || echo "shellcheck absent — skipped")`
|
||||
Expected: no syntax errors. (shellcheck may warn on the intentional global `$target`/`$last_good` references — acceptable; they are documented caller-set globals.)
|
||||
|
||||
- [ ] **Step 3: Source-and-define smoke test**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
bash -c 'LOG_TAG=test; . scripts/t3-safe-restart.sh; for f in LOG ver osusers ak_for backup_user prebump_of rollback_binary verify_pairing safe_restart_unit; do declare -F "$f" >/dev/null || { echo "MISSING $f"; exit 1; }; done; echo "all functions defined"'
|
||||
```
|
||||
Expected: `all functions defined` (sourcing has no side effects — no exit, no output beyond the echo).
|
||||
|
||||
- [ ] **Step 4: Commit**
|
||||
|
||||
```bash
|
||||
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
|
||||
git "${GC[@]}" add scripts/t3-safe-restart.sh
|
||||
git "${GC[@]}" commit -m "t3-safe-restart: extract shared safe-restart library from t3-autoupdate
|
||||
|
||||
Pull the per-unit backup->restart->verify->recover routine (and the small
|
||||
helpers it needs) out of t3-autoupdate.sh into a sourced library, so a second
|
||||
job (the upcoming idle migrator) can reuse the exact same audited recovery path
|
||||
instead of forking safety-critical code. safe_restart_unit returns non-zero on
|
||||
failure (after recovery+freeze) rather than exiting, so callers control flow.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 2: Refactor `t3-autoupdate.sh` to use the library + record deferrals
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/t3-autoupdate.sh` (config block 32–42, helpers 44–165, step 6 loop 194–225)
|
||||
|
||||
- [ ] **Step 1: Source the library; drop the now-shared helpers**
|
||||
|
||||
Replace lines 32–52 (the `T3_*` config block through the `newer()` helper) with — keep the autoupdate-only config, source the lib for the shared bits:
|
||||
|
||||
```bash
|
||||
# ---- autoupdate-specific config (shared config + helpers come from the lib) -----
|
||||
T3_TRACK="${T3_TRACK:-nightly}" # npm dist-tag to follow (nightly | latest)
|
||||
T3_PIN="${T3_PIN:-}" # optional HARD pin to an exact version (disables tracking)
|
||||
SMOKE_PORT="${T3_SMOKE_PORT:-3799}"
|
||||
DRY_RUN="${T3_DRY_RUN:-0}"
|
||||
TMPROOT="${T3_TMPDIR:-/var/tmp}" # health-check scratch on DISK — /tmp is a 2G tmpfs and a populated state.sqlite (~hundreds of MB) overflows it
|
||||
|
||||
LOG_TAG=t3-autoupdate
|
||||
# shellcheck source=scripts/t3-safe-restart.sh
|
||||
. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
|
||||
|
||||
# is $1 a strictly-newer version than $2 (version-sort)?
|
||||
newer() { [ "$1" != "$2" ] && [ "$(printf '%s\n%s\n' "$1" "$2" | sort -V | tail -1)" = "$1" ]; }
|
||||
|
||||
mkdir -p "$STATE_DIR" 2>/dev/null || true
|
||||
```
|
||||
|
||||
(The lib now provides `FREEZE_FILE`, `STATE_DIR`, `LAST_GOOD_FILE`, `DEFER_DIR`, `BACKUP_DIR`, `DISPATCH`, `USER_MAP`, `LOG`, `ver`, `osusers`, `ak_for`, `verify_pairing`, `prebump_of`, `rollback_binary`, `backup_user`, `safe_restart_unit`.)
|
||||
|
||||
- [ ] **Step 2: Simplify `backup_all` to call the shared `backup_user`**
|
||||
|
||||
Replace the `backup_all()` definition (lines 90–105) with:
|
||||
|
||||
```bash
|
||||
ADMIN_SEED=""
|
||||
backup_all() {
|
||||
local u dst
|
||||
for u in $(osusers); do
|
||||
if dst="$(backup_user "$u")"; then
|
||||
LOG "pre-bump backup: $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)"
|
||||
[ "$u" = "wizard" ] && ADMIN_SEED="$dst"
|
||||
else
|
||||
LOG "WARN: pre-bump backup FAILED for $u (/home/$u/.t3/userdata/state.sqlite)"
|
||||
fi
|
||||
done
|
||||
[ -n "$ADMIN_SEED" ] || ADMIN_SEED="$(ls -1t "$BACKUP_DIR"/*/"state-prebump-$target-"*.sqlite 2>/dev/null | head -1)"
|
||||
}
|
||||
```
|
||||
|
||||
Delete the now-duplicated standalone `prebump_of`, `rollback_binary`, and `verify_pairing` definitions (lines 107–108, 146–152, 160–165) — they come from the lib. Keep `health_check` and `unit_busy` (autoupdate-only).
|
||||
|
||||
- [ ] **Step 3: Use `safe_restart_unit` + write/clear the deferral marker in step 6**
|
||||
|
||||
Replace the step-6 loop body (lines 196–225) with:
|
||||
|
||||
```bash
|
||||
for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' 2>/dev/null | awk '{print $1}'); do
|
||||
u="$(printf '%s' "$unit" | sed -n 's/^t3-serve@\(.*\)\.service$/\1/p')"; [ -n "$u" ] || continue
|
||||
if unit_busy "$unit"; then
|
||||
LOG "deferring $unit (active agent) — migrates on its next idle restart"
|
||||
mkdir -p "$DEFER_DIR" 2>/dev/null && printf '%s\n' "$target" >"$DEFER_DIR/$u" # record for t3-migrate-idle
|
||||
deferred=$((deferred+1)); continue
|
||||
fi
|
||||
if safe_restart_unit "$unit" "$u"; then
|
||||
restarted=$((restarted+1))
|
||||
rm -f "$DEFER_DIR/$u" 2>/dev/null # now current — clear any stale marker
|
||||
else
|
||||
exit 1 # frozen by safe_restart_unit — preserve today's behavior
|
||||
fi
|
||||
done
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Syntax check + behavior-preserving dry-run diff**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
bash -n scripts/t3-autoupdate.sh
|
||||
# Confirm the only remaining defer/restart decisions are unchanged vs HEAD~1 logic:
|
||||
git -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false diff HEAD scripts/t3-autoupdate.sh | grep -E '^\+|^-' | grep -vE 'safe_restart_unit|backup_user|DEFER_DIR|source|\. "|LOG_TAG|^\+\+\+|^---' | head -40
|
||||
```
|
||||
Expected: no syntax errors; the diff shows only the extraction (calls replacing inline bodies) + the two marker lines — no change to install/health-gate/canary decision logic.
|
||||
|
||||
- [ ] **Step 5: Commit**
|
||||
|
||||
```bash
|
||||
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
|
||||
git "${GC[@]}" add scripts/t3-autoupdate.sh
|
||||
git "${GC[@]}" commit -m "t3-autoupdate: source the shared safe-restart lib + record deferrals
|
||||
|
||||
Behavior-preserving refactor: the per-unit restart/recover body and small helpers
|
||||
now come from t3-safe-restart.sh (one audited copy). Additionally, when a unit is
|
||||
deferred for an active agent, write a marker under /var/lib/t3-autoupdate/deferred/
|
||||
so the new idle migrator can drain it later; clear the marker on a successful
|
||||
restart. Install/health-gate/canary logic is unchanged.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 3: The idle gate (TDD) — `gate_query` + `gate_is_safe`
|
||||
|
||||
**Files:**
|
||||
- Create: `tests/t3-migrate-idle-gate.test.sh`
|
||||
- Create (incremental): `scripts/t3-migrate-idle.sh` (gate functions only this task)
|
||||
|
||||
- [ ] **Step 1: Write the failing test**
|
||||
|
||||
Create `tests/t3-migrate-idle-gate.test.sh`:
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# Pure-bash unit tests for the t3-migrate-idle gate. No root, no bats, no Docker.
|
||||
# Sources t3-migrate-idle.sh (main-guarded) with the lib path pointed at the worktree.
|
||||
set -uo pipefail
|
||||
HERE="$(cd "$(dirname "$0")/.." && pwd)" # repo root (tests/ is one level down)
|
||||
export T3_SAFE_RESTART_LIB="$HERE/scripts/t3-safe-restart.sh"
|
||||
# shellcheck source=/dev/null
|
||||
. "$HERE/scripts/t3-migrate-idle.sh" # defines functions; main-guard prevents the drain from running
|
||||
|
||||
pass=0; fail=0
|
||||
ok() { if "$@"; then pass=$((pass+1)); else fail=$((fail+1)); echo "FAIL: $*"; fi; }
|
||||
notok(){ if "$@"; then fail=$((fail+1)); echo "FAIL (expected non-zero): $*"; else pass=$((pass+1)); fi; }
|
||||
|
||||
# --- gate_is_safe <active> <idle_seconds> with QUIET_SECONDS=900 ---
|
||||
QUIET_SECONDS=900
|
||||
ok gate_is_safe 0 1000 # idle, quiet long enough -> safe
|
||||
notok gate_is_safe 1 1000 # a turn in flight -> unsafe
|
||||
notok gate_is_safe 0 100 # idle but not quiet enough -> unsafe
|
||||
ok gate_is_safe 0 "" # no threads at all (NULL idle) -> safe
|
||||
notok gate_is_safe x 1000 # unparseable active -> unsafe
|
||||
notok gate_is_safe 0 -30 # negative idle (clock skew) -> unsafe
|
||||
|
||||
# --- gate_query <db> against fixture SQLite DBs ---
|
||||
TMP="$(mktemp -d)"; trap 'rm -rf "$TMP"' EXIT
|
||||
mkfix() { # mkfix <file> ; reads rows "active_turn_id|updated_at" on stdin
|
||||
local f="$1"; sqlite3 "$f" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);"
|
||||
while IFS='|' read -r a u; do sqlite3 "$f" "INSERT INTO projection_thread_sessions VALUES ($([ "$a" = NULL ] && echo NULL || echo "'$a'"), '$u');"; done
|
||||
}
|
||||
NOW="$(date -u +%Y-%m-%dT%H:%M:%S.000Z)"
|
||||
OLD="$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S.000Z)"
|
||||
|
||||
# active turn present -> "1|<small idle>"
|
||||
printf '%s\n' "abc|$NOW" "NULL|$OLD" | mkfix "$TMP/active.db"
|
||||
res="$(gate_query "$TMP/active.db")"; ok test "${res%%|*}" = "1"
|
||||
|
||||
# all idle, last activity 1h ago -> "0|>=3500"
|
||||
printf '%s\n' "NULL|$OLD" "NULL|$OLD" | mkfix "$TMP/idle.db"
|
||||
res="$(gate_query "$TMP/idle.db")"; ok test "${res%%|*}" = "0"; ok test "${res##*|}" -ge 3500
|
||||
|
||||
# empty table -> "0|" (NULL idle)
|
||||
sqlite3 "$TMP/empty.db" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);"
|
||||
res="$(gate_query "$TMP/empty.db")"; ok test "${res%%|*}" = "0"
|
||||
|
||||
echo "PASS=$pass FAIL=$fail"; [ "$fail" -eq 0 ]
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Run it to verify it fails**
|
||||
|
||||
Run: `bash tests/t3-migrate-idle-gate.test.sh`
|
||||
Expected: FAIL — `scripts/t3-migrate-idle.sh` does not exist yet (source error).
|
||||
|
||||
- [ ] **Step 3: Create `scripts/t3-migrate-idle.sh` with the gate functions + main-guard skeleton**
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# t3-migrate-idle.sh — drains t3-autoupdate's deferral markers (via the overnight
|
||||
# t3-migrate-idle.timer). For each deferred t3-serve@<user>, if nothing is actively
|
||||
# working in that instance (no in-flight turn + a quiet buffer), restart it onto the
|
||||
# current binary using the shared safe_restart_unit, then clear the marker.
|
||||
# Why this exists: t3-autoupdate defers a user with an active agent at its single
|
||||
# daily window; a user busy every night never migrates and their client shows
|
||||
# "Client and server versions differ". See docs/plans/2026-06-21-t3-idle-migrate-*.
|
||||
set -uo pipefail
|
||||
|
||||
LOG_TAG=t3-migrate-idle
|
||||
# shellcheck source=scripts/t3-safe-restart.sh
|
||||
. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
|
||||
|
||||
QUIET_SECONDS="${T3_MIGRATE_QUIET_SECONDS:-900}" # required idle before a restart (15 min)
|
||||
DRY_RUN="${T3_DRY_RUN:-0}"
|
||||
|
||||
# pure logic: is it safe given <active_turns> and <idle_seconds>? fail closed.
|
||||
gate_is_safe() {
|
||||
local active="$1" idle="$2"
|
||||
case "$active" in ''|*[!0-9]*) return 1;; esac # unparseable/empty active -> unsafe
|
||||
[ "$active" -eq 0 ] || return 1 # a turn is running -> unsafe
|
||||
[ -z "$idle" ] && return 0 # no threads at all -> safe
|
||||
case "$idle" in ''|*[!0-9-]*) return 1;; esac # non-numeric -> unsafe
|
||||
[ "$idle" -ge "$QUIET_SECONDS" ] # negative or < quiet -> unsafe
|
||||
}
|
||||
|
||||
# query a state.sqlite (path or file: URI). Echoes "<active_turns>|<idle_seconds>".
|
||||
# idle_seconds is empty when there are no rows. Normalizes ISO 'T'/'Z' for julianday.
|
||||
gate_query() {
|
||||
local db="$1"
|
||||
sqlite3 -batch -noheader -separator '|' "$db" \
|
||||
"SELECT
|
||||
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
|
||||
CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
|
||||
FROM projection_thread_sessions;"
|
||||
}
|
||||
|
||||
# safe_to_restart <user>: wire runuser + the user's DB into gate_query/gate_is_safe.
|
||||
safe_to_restart() {
|
||||
local u="$1" db row
|
||||
db="/home/$u/.t3/userdata/state.sqlite"; [ -f "$db" ] || return 1
|
||||
row="$(runuser -u "$u" -- sqlite3 -batch -noheader -separator '|' "file:$db?mode=ro" \
|
||||
"SELECT
|
||||
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
|
||||
CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
|
||||
FROM projection_thread_sessions;" 2>/dev/null)" || return 1
|
||||
gate_is_safe "${row%%|*}" "${row##*|}"
|
||||
}
|
||||
|
||||
main() {
|
||||
: # drain loop added in Task 4
|
||||
}
|
||||
|
||||
# main-guard: run only when executed, not when sourced (tests source this file).
|
||||
if [ "${BASH_SOURCE[0]}" = "${0}" ]; then main "$@"; fi
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Run the test to verify it passes**
|
||||
|
||||
Run: `bash tests/t3-migrate-idle-gate.test.sh`
|
||||
Expected: `PASS=10 FAIL=0` (exit 0).
|
||||
|
||||
- [ ] **Step 5: Commit**
|
||||
|
||||
```bash
|
||||
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
|
||||
git "${GC[@]}" add scripts/t3-migrate-idle.sh tests/t3-migrate-idle-gate.test.sh
|
||||
git "${GC[@]}" commit -m "t3-migrate-idle: idle gate (no in-flight turn + quiet buffer), TDD
|
||||
|
||||
The gate reads t3's state.sqlite: safe to restart only when zero threads have an
|
||||
active_turn_id AND the most-recent thread activity is older than the quiet buffer
|
||||
(default 15m). Fail-closed on any parse/query error. Pure-bash unit tests cover
|
||||
the boundaries against fixture DBs (no root/bats/Docker).
|
||||
|
||||
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 4: The marker-drain loop in `t3-migrate-idle.sh`
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/t3-migrate-idle.sh` (replace the `main()` skeleton)
|
||||
|
||||
- [ ] **Step 1: Implement `main()` (the drain loop)**
|
||||
|
||||
Replace the `main() { : ; }` skeleton with:
|
||||
|
||||
```bash
|
||||
main() {
|
||||
# a frozen build must not be auto-migrated (shared switch with t3-autoupdate)
|
||||
if [ -e "$FREEZE_FILE" ]; then LOG "FROZEN: $FREEZE_FILE present — not draining deferrals"; exit 0; fi
|
||||
[ -d "$DEFER_DIR" ] || exit 0 # nothing deferred
|
||||
last_good="$(tr -d '[:space:]' <"$LAST_GOOD_FILE" 2>/dev/null)" # rollback target for the helper
|
||||
|
||||
local marker u unit started mwritten migrated=0 skipped=0
|
||||
for marker in "$DEFER_DIR"/*; do
|
||||
[ -e "$marker" ] || continue # empty-dir glob
|
||||
u="$(basename "$marker")"; unit="t3-serve@$u.service"
|
||||
if ! systemctl is-active --quiet "$unit"; then
|
||||
LOG "clearing marker for $u: $unit not active"; rm -f "$marker"; continue
|
||||
fi
|
||||
started="$(date -d "$(systemctl show -p ActiveEnterTimestamp --value "$unit" 2>/dev/null)" +%s 2>/dev/null || echo 0)"
|
||||
mwritten="$(stat -c %Y "$marker" 2>/dev/null || echo 0)"
|
||||
if [ "$started" -gt "$mwritten" ]; then
|
||||
LOG "clearing marker for $u: $unit already restarted $((started-mwritten))s after the deferral"; rm -f "$marker"; continue
|
||||
fi
|
||||
if ! safe_to_restart "$u"; then skipped=$((skipped+1)); continue; fi
|
||||
|
||||
target="$(tr -d '[:space:]' <"$marker" 2>/dev/null)"; [ -n "$target" ] || target="$(ver)"
|
||||
if [ "$DRY_RUN" = "1" ]; then LOG "DRY_RUN: would migrate $unit -> $target (idle gate satisfied)"; continue; fi
|
||||
if ! backup_user "$u" >/dev/null; then
|
||||
LOG "WARN: pre-restart backup failed for $u — skipping (fail closed)"; skipped=$((skipped+1)); continue
|
||||
fi
|
||||
if safe_restart_unit "$unit" "$u"; then
|
||||
LOG "migrated $unit -> $target (idle restart)"; rm -f "$marker"; migrated=$((migrated+1))
|
||||
else
|
||||
LOG "migrate FAILED for $unit — recovery+freeze handled by safe_restart_unit; stopping drain"; exit 1
|
||||
fi
|
||||
done
|
||||
LOG "idle-migrate pass complete (migrated=$migrated skipped=$skipped)"
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Re-run the gate tests (regression — main-guard still source-safe)**
|
||||
|
||||
Run: `bash tests/t3-migrate-idle-gate.test.sh`
|
||||
Expected: `PASS=10 FAIL=0` (sourcing still defines functions without running the loop).
|
||||
|
||||
- [ ] **Step 3: Syntax + lint**
|
||||
|
||||
Run: `bash -n scripts/t3-migrate-idle.sh && (command -v shellcheck >/dev/null && shellcheck -x scripts/t3-migrate-idle.sh || echo "shellcheck absent — skipped")`
|
||||
Expected: no syntax errors.
|
||||
|
||||
- [ ] **Step 4: Commit**
|
||||
|
||||
```bash
|
||||
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
|
||||
git "${GC[@]}" add scripts/t3-migrate-idle.sh
|
||||
git "${GC[@]}" commit -m "t3-migrate-idle: drain deferral markers when safe
|
||||
|
||||
For each /var/lib/t3-autoupdate/deferred/<user> marker: skip+clear if the unit is
|
||||
gone or was already restarted after the deferral; otherwise, when the idle gate is
|
||||
satisfied, take a pre-restart backup and restart via the shared safe_restart_unit,
|
||||
clearing the marker on verified success. DRY_RUN logs decisions without acting.
|
||||
|
||||
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 5: systemd units
|
||||
|
||||
**Files:**
|
||||
- Create: `scripts/t3-migrate-idle.service`, `scripts/t3-migrate-idle.timer`
|
||||
|
||||
- [ ] **Step 1: Create the service unit**
|
||||
|
||||
`scripts/t3-migrate-idle.service`:
|
||||
```ini
|
||||
[Unit]
|
||||
Description=t3 idle migrator — restart deferred t3-serve instances onto the current binary when idle
|
||||
Documentation=https://forgejo.viktorbarzin.me/viktor/infra/src/branch/master/docs/plans/2026-06-21-t3-idle-migrate-design.md
|
||||
After=network.target t3-dispatch.service
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/bin/t3-migrate-idle
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Create the timer unit**
|
||||
|
||||
`scripts/t3-migrate-idle.timer`:
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Overnight drain of t3-autoupdate deferrals (idle-gated t3-serve migration)
|
||||
|
||||
[Timer]
|
||||
OnCalendar=*-*-* 01..05:00/20
|
||||
RandomizedDelaySec=120
|
||||
Persistent=false
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Validate unit syntax**
|
||||
|
||||
Run: `systemd-analyze verify scripts/t3-migrate-idle.service scripts/t3-migrate-idle.timer 2>&1 | grep -v 'Unknown\|Cannot find' || echo "units parse OK"`
|
||||
Expected: no fatal parse errors (warnings about the `[Install]` of a non-installed unit / missing exec on a non-deployed path are acceptable in the worktree).
|
||||
|
||||
- [ ] **Step 4: Confirm the OnCalendar expands to the intended overnight slots**
|
||||
|
||||
Run: `systemd-analyze calendar '*-*-* 01..05:00/20' --iterations=5`
|
||||
Expected: next elapses at 01:00/01:20/01:40/02:00/… (every 20 min, hours 01–05).
|
||||
|
||||
- [ ] **Step 5: Commit**
|
||||
|
||||
```bash
|
||||
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
|
||||
git "${GC[@]}" add scripts/t3-migrate-idle.service scripts/t3-migrate-idle.timer
|
||||
git "${GC[@]}" commit -m "t3-migrate-idle: systemd oneshot + overnight timer (01:00-05:40, /20)
|
||||
|
||||
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 6: Wire into `setup-devvm.sh`
|
||||
|
||||
**Files:**
|
||||
- Modify: `scripts/workstation/setup-devvm.sh` (9a install ~line 164; 9d unit loop ~line 200; enable ~line 218)
|
||||
|
||||
- [ ] **Step 1: Install the lib + the new script (section 9a)**
|
||||
|
||||
After the `install -m 0755 "$SCRIPTS/t3-autoupdate.sh" /usr/local/bin/t3-autoupdate` line, add:
|
||||
```bash
|
||||
install -m 0644 "$SCRIPTS/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh
|
||||
install -m 0755 "$SCRIPTS/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Install the unit files (section 9d loop)**
|
||||
|
||||
Add to the `for u in …` unit list (after the `t3-autoupdate.service t3-autoupdate.timer \` line):
|
||||
```bash
|
||||
t3-migrate-idle.service t3-migrate-idle.timer \
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Enable the timer (section 9 enable line)**
|
||||
|
||||
Append `t3-migrate-idle.timer` to the `systemctl enable --now` list:
|
||||
```bash
|
||||
systemctl enable --now t3-dispatch.service \
|
||||
t3-autoupdate.timer t3-backup-state.timer t3-provision-users.timer t3-migrate-idle.timer >/dev/null 2>&1 || \
|
||||
log "WARN: some units failed to enable (check: systemctl status t3-dispatch t3-*.timer)"
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Syntax check**
|
||||
|
||||
Run: `bash -n scripts/workstation/setup-devvm.sh`
|
||||
Expected: no syntax errors.
|
||||
|
||||
- [ ] **Step 5: Commit**
|
||||
|
||||
```bash
|
||||
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
|
||||
git "${GC[@]}" add scripts/workstation/setup-devvm.sh
|
||||
git "${GC[@]}" commit -m "setup-devvm: install + enable t3-migrate-idle (lib, script, units, timer)
|
||||
|
||||
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 7: Deploy to the devvm + validate (dry-run first)
|
||||
|
||||
**Files:** none (operational). Presence-claimed, shared-host mutation.
|
||||
|
||||
- [ ] **Step 1: Claim the host**
|
||||
|
||||
Run: `homelab claim host:devvm --purpose "deploy t3-migrate-idle units (idle-gated t3-serve migration)"`
|
||||
Expected: claim acquired (if already held by another session, defer per CLAUDE.md).
|
||||
|
||||
- [ ] **Step 2: Install the artifacts (mirror setup-devvm.sh 9a/9d)**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
W=/home/wizard/code/infra/.worktrees/t3-idle-migrate/scripts
|
||||
sudo install -m 0644 "$W/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh
|
||||
sudo install -m 0755 "$W/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle
|
||||
sudo install -m 0644 "$W/t3-migrate-idle.service" /etc/systemd/system/t3-migrate-idle.service
|
||||
sudo install -m 0644 "$W/t3-migrate-idle.timer" /etc/systemd/system/t3-migrate-idle.timer
|
||||
sudo systemctl daemon-reload
|
||||
```
|
||||
Expected: no errors.
|
||||
|
||||
- [ ] **Step 2b: Re-point the live daily job at the installed lib (it now sources it)**
|
||||
|
||||
The deployed `/usr/local/bin/t3-autoupdate` is the OLD inline version until setup-devvm re-runs; install the refactored one so both jobs share the lib:
|
||||
```bash
|
||||
sudo install -m 0755 "$W/t3-autoupdate.sh" /usr/local/bin/t3-autoupdate
|
||||
sudo /usr/local/bin/t3-autoupdate # safe: same-version run exits at "already on nightly; nothing to do"
|
||||
```
|
||||
Expected: log line `already on <track>=<ver>; nothing to do` (proves the refactored daily job sources the lib and runs clean).
|
||||
|
||||
- [ ] **Step 3: DRY-RUN the idle migrator against live state**
|
||||
|
||||
Run: `sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle; echo "exit=$?"`
|
||||
Expected: with wizard currently busy (mid-turn during the day), a `skipped` count — `idle-migrate pass complete (migrated=0 skipped=N)` — and NO restart. (If wizard happens to be idle+quiet, it logs `DRY_RUN: would migrate t3-serve@wizard …` and still does not act.)
|
||||
|
||||
- [ ] **Step 4: Seed a deferral marker for the current skew + dry-run again**
|
||||
|
||||
The live daily job already deferred wizard but the marker mechanism is new, so create it once to represent the existing `.605→.613` debt:
|
||||
```bash
|
||||
sudo install -d -m755 /var/lib/t3-autoupdate/deferred
|
||||
printf '%s\n' "$(t3 --version | awk '{print $NF}' | sed 's/^v//')" | sudo tee /var/lib/t3-autoupdate/deferred/wizard >/dev/null
|
||||
sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle; echo "exit=$?"
|
||||
```
|
||||
Expected: the pass now considers `wizard` — either `DRY_RUN: would migrate t3-serve@wizard.service -> …613` (if idle) or counted in `skipped` (if mid-turn). Confirms marker drain + gate wiring end-to-end without acting.
|
||||
|
||||
- [ ] **Step 5: Enable the timer (live)**
|
||||
|
||||
Run: `sudo systemctl enable --now t3-migrate-idle.timer && systemctl list-timers t3-migrate-idle.timer --no-pager`
|
||||
Expected: timer active, next elapse in the 01:00–05:40 window.
|
||||
|
||||
- [ ] **Step 6: Release the claim**
|
||||
|
||||
Run: `homelab release host:devvm`
|
||||
|
||||
> **First live migration** happens overnight at the first idle+quiet tick. Verify next session: `journalctl -u t3-migrate-idle.service --since yesterday | grep -E 'migrated|skipped|DRY|FROZEN'` and `t3 --version` vs the running server's version. (The user-facing resume-after-restart is observed here — design open-question (a).)
|
||||
|
||||
---
|
||||
|
||||
## Task 8: Docs
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/runbooks/t3-version-bump.md` (add an idle-migrate section)
|
||||
- Modify: `.claude/reference/service-catalog.md` (add the unit)
|
||||
- Modify: `docs/plans/2026-06-21-t3-idle-migrate-design.md` (Status → implemented)
|
||||
|
||||
- [ ] **Step 1: Runbook** — add a section after the autoupdate description:
|
||||
|
||||
```markdown
|
||||
## Idle migrator (`t3-migrate-idle.timer`)
|
||||
|
||||
`t3-autoupdate` defers a user's `t3-serve` restart when they have an active agent
|
||||
at the daily window, recording `/var/lib/t3-autoupdate/deferred/<user>`.
|
||||
`t3-migrate-idle` (overnight, every 20 min 01:00–05:40) drains those markers:
|
||||
it restarts a deferred instance onto the current binary only when that user's
|
||||
`state.sqlite` shows no in-flight turn (`active_turn_id`) and ≥15 min quiet, via
|
||||
the shared `safe_restart_unit` (same backup→verify→recover as the daily canary).
|
||||
- **Force a migration now:** `sudo systemctl start t3-migrate-idle.service` (still idle-gated).
|
||||
- **Preview without acting:** `sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle`.
|
||||
- **Stop it:** the shared `/etc/t3-autoupdate.freeze` halts both jobs.
|
||||
- **Rare-tail failure:** a forward-migration failure at idle restart restores the
|
||||
user's DB + freezes + alerts (the binary rollback is a no-op since the build was
|
||||
already accepted); the user's server may crashloop on the restored DB until the
|
||||
freeze is cleared. Investigate per the rollback section above.
|
||||
```
|
||||
|
||||
- [ ] **Step 2: service-catalog** — add a row/line for `t3-migrate-idle.timer` (overnight idle-gated t3-serve migration; sources `t3-safe-restart.sh`).
|
||||
|
||||
- [ ] **Step 3: design doc status** — change the header `Status:` to `implemented 2026-06-21 (commits on wizard/t3-idle-migrate)`.
|
||||
|
||||
- [ ] **Step 4: Commit**
|
||||
|
||||
```bash
|
||||
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
|
||||
git "${GC[@]}" add docs/runbooks/t3-version-bump.md .claude/reference/service-catalog.md docs/plans/2026-06-21-t3-idle-migrate-design.md
|
||||
git "${GC[@]}" commit -m "docs: t3-migrate-idle runbook + service-catalog + design status
|
||||
|
||||
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 9: Land
|
||||
|
||||
- [ ] **Step 1: Merge latest master into the branch**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
|
||||
git "${GC[@]}" fetch forgejo
|
||||
git "${GC[@]}" merge --no-edit forgejo/master
|
||||
```
|
||||
Expected: clean merge (no conflicts; the files are new or autoupdate-only). Resolve if any.
|
||||
|
||||
- [ ] **Step 2: Re-run the gate tests post-merge**
|
||||
|
||||
Run: `bash tests/t3-migrate-idle-gate.test.sh`
|
||||
Expected: `PASS=10 FAIL=0`.
|
||||
|
||||
- [ ] **Step 3: Push to master**
|
||||
|
||||
Run: `git -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false push forgejo HEAD:master`
|
||||
Expected: accepted. Non-fast-forward → fetch/merge/retry.
|
||||
|
||||
- [ ] **Step 4: Watch CI to completion**
|
||||
|
||||
Run: `homelab ci watch`
|
||||
Expected: green (infra apply pipeline — this change is scripts/docs only, no Terraform, so apply is a no-op for it).
|
||||
|
||||
- [ ] **Step 5: Clean up the worktree**
|
||||
|
||||
Run (from the main checkout):
|
||||
```bash
|
||||
git -C /home/wizard/code/infra worktree remove .worktrees/t3-idle-migrate
|
||||
git -C /home/wizard/code/infra branch -d wizard/t3-idle-migrate
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Self-review
|
||||
|
||||
- **Spec coverage:** marker mechanism (T2,T4) · shared safe-restart lib / approach C (T1) · idle gate active_turn_id+quiet (T3) · overnight timer (T5) · all-users self-limiting via markers (T4 loop) · failure recovery reuse (T1, note) · observability logs (LOG_TAG throughout) · delivery via setup-devvm (T6) · presence-claimed deploy (T7) · TDD on the gate (T3) · dry-run rollout (T7) · docs (T8). Optional Pushgateway marker-age gauge from the design is **intentionally deferred** (logged here as a follow-up, not built — keeps scope to the shipping mechanism).
|
||||
- **Placeholders:** none — every file has complete content; every command has expected output.
|
||||
- **Type/name consistency:** `safe_restart_unit`, `backup_user`, `prebump_of`, `gate_query`, `gate_is_safe`, `safe_to_restart`, `DEFER_DIR`, `QUIET_SECONDS`, `T3_SAFE_RESTART_LIB`, `LOG_TAG` used identically across tasks. `target`/`last_good` are documented caller-set globals consumed by lib functions.
|
||||
|
|
@ -37,6 +37,19 @@ logs every outcome (`paired user=.. endpoint=.. fallback=..`, plus `mint/pairing
|
|||
`T3AutoUpdateRolledBack` / `T3AutoUpdateRollbackFailed` / `T3AutoUpdateFrozen` →
|
||||
Alertmanager → Slack.
|
||||
|
||||
## Idle migrator — draining deferrals (`scripts/t3-migrate-idle.sh`)
|
||||
|
||||
Step 5 DEFERS any instance with an active agent, recording `/var/lib/t3-autoupdate/deferred/<user>` (= the target version). Without a drainer, a user busy at every 04:00 window never migrates and their client shows *"Client and server versions differ"* for days. `t3-migrate-idle.timer` (overnight, every 20 min 01:00–05:40) drains those markers:
|
||||
|
||||
- Per marker: skip + clear if the unit is gone or was already restarted *after* the deferral; otherwise restart the still-stale `t3-serve@<u>` onto the current binary **only when that user is idle** — `state.sqlite` shows zero `active_turn_id` (no in-flight turn) AND ≥ `T3_MIGRATE_QUIET_SECONDS` (default 900 = 15 min) since the last thread activity — then verify pairing and clear the marker. **Fail-closed:** any query/parse doubt → skip, retry next tick.
|
||||
- It restarts via the SAME `safe_restart_unit` the daily canary uses (sourced `t3-safe-restart.sh`: backup → restart → verify → recover). The shared `/etc/t3-autoupdate.freeze` halts it too.
|
||||
- **Force / preview:**
|
||||
```bash
|
||||
sudo systemctl start t3-migrate-idle.service # run a drain pass now (still idle-gated)
|
||||
sudo env T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle # log decisions, act on nothing
|
||||
```
|
||||
- **Rare-tail failure:** if a deferred user's forward migration fails at idle restart (already gated against a copy of their real DB at install), `safe_restart_unit` restores their DB + freezes + alerts. The binary rollback is a no-op (the build was already accepted, so other users are unaffected), but that user's serve may crashloop on the restored DB until the freeze is cleared and the build investigated (manual rollback below).
|
||||
|
||||
## Operations
|
||||
|
||||
**Freeze / revert (stop tracking right now — the fast "make it stop"):**
|
||||
|
|
|
|||
|
|
@ -21,7 +21,7 @@
|
|||
# - canary rollout: restart idle instances ONE AT A TIME, verifying pairing
|
||||
# through the real dispatch after each, and roll back (binary + that user's DB)
|
||||
# + self-freeze on the first failure — active-agent instances are deferred,
|
||||
# never killed;
|
||||
# never killed (deferred instances are recorded for t3-migrate-idle to drain);
|
||||
# - rollback target is the recorded LAST-GOOD build, not "whatever was installed".
|
||||
# Detection backstop (real-user pairing failure/fallback) lives in the dispatch
|
||||
# logs + Loki alerts (T3PairingBroken / T3PairFallbackHigh / T3AutoUpdate*).
|
||||
|
|
@ -29,24 +29,17 @@
|
|||
# Full procedure + manual rollback: docs/runbooks/t3-version-bump.md.
|
||||
set -uo pipefail
|
||||
|
||||
# ---- autoupdate-specific config (shared config + helpers come from the lib) -----
|
||||
T3_TRACK="${T3_TRACK:-nightly}" # npm dist-tag to follow (nightly | latest)
|
||||
T3_PIN="${T3_PIN:-}" # optional HARD pin to an exact version (disables tracking)
|
||||
FREEZE_FILE="${T3_FREEZE_FILE:-/etc/t3-autoupdate.freeze}"
|
||||
STATE_DIR="${T3_STATE_DIR:-/var/lib/t3-autoupdate}"
|
||||
LAST_GOOD_FILE="$STATE_DIR/last-good"
|
||||
BACKUP_DIR="${T3_BACKUP_DEST:-/var/backups/t3-state}"
|
||||
SMOKE_PORT="${T3_SMOKE_PORT:-3799}"
|
||||
DISPATCH="${T3_DISPATCH:-127.0.0.1:3780}"
|
||||
USER_MAP="${T3_USER_MAP:-/etc/ttyd-user-map}"
|
||||
DRY_RUN="${T3_DRY_RUN:-0}"
|
||||
TMPROOT="${T3_TMPDIR:-/var/tmp}" # health-check scratch on DISK — /tmp is a 2G tmpfs and a populated state.sqlite (~hundreds of MB) overflows it
|
||||
|
||||
LOG() { logger -t t3-autoupdate "$*"; echo "t3-autoupdate: $*"; }
|
||||
ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; }
|
||||
# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line).
|
||||
osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; }
|
||||
# authentik username for an OS user (reverse map; first match) — for dispatch verify.
|
||||
ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; }
|
||||
LOG_TAG=t3-autoupdate
|
||||
# shellcheck source=scripts/t3-safe-restart.sh
|
||||
. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
|
||||
|
||||
# is $1 a strictly-newer version than $2 (version-sort)?
|
||||
newer() { [ "$1" != "$2" ] && [ "$(printf '%s\n%s\n' "$1" "$2" | sort -V | tail -1)" = "$1" ]; }
|
||||
|
||||
|
|
@ -86,27 +79,21 @@ LOG "candidate: $current -> $target (track=$T3_TRACK, last_good=$last_good, dry_
|
|||
# ---- helpers: backup, health-check, rollback, restart-verify --------------------
|
||||
# Online consistent per-user snapshot (run AS the owner so WAL stays owned; never
|
||||
# stops the serve). Sets $ADMIN_SEED to wizard's backup for the migration health
|
||||
# check. Mirrors t3-backup-state.sh.
|
||||
# check. Mirrors t3-backup-state.sh. (backup_user lives in the shared lib.)
|
||||
ADMIN_SEED=""
|
||||
backup_all() {
|
||||
local u src out dst ts; ts="$(date +%Y%m%d-%H%M%S)"
|
||||
local u dst
|
||||
for u in $(osusers); do
|
||||
src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || continue
|
||||
out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite"
|
||||
install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out"
|
||||
if runuser -u "$u" -- timeout "${T3_BACKUP_TIMEOUT:-900}" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then
|
||||
if dst="$(backup_user "$u")"; then
|
||||
LOG "pre-bump backup: $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)"
|
||||
[ "$u" = "wizard" ] && ADMIN_SEED="$dst"
|
||||
else
|
||||
LOG "WARN: pre-bump backup FAILED for $u ($src)"; rm -f "$dst"
|
||||
LOG "WARN: pre-bump backup FAILED for $u (/home/$u/.t3/userdata/state.sqlite)"
|
||||
fi
|
||||
done
|
||||
[ -n "$ADMIN_SEED" ] || ADMIN_SEED="$(ls -1t "$BACKUP_DIR"/*/"state-prebump-$target-"*.sqlite 2>/dev/null | head -1)"
|
||||
}
|
||||
|
||||
# newest pre-bump backup taken THIS run for a user (for restore-on-rollback).
|
||||
prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; }
|
||||
|
||||
# health_check <t3bin> [seed_db]: start a throwaway serve (seeded with a copy of a
|
||||
# real populated DB if given, so the forward migration runs on real data), then do
|
||||
# the real mint -> credential-exchange -> t3_session pairing handshake with the
|
||||
|
|
@ -143,27 +130,12 @@ health_check() {
|
|||
rm -rf "$dir"; return 1
|
||||
}
|
||||
|
||||
# roll the GLOBAL binary back to last-good. Pre-restart failures need only this
|
||||
# (no real DB migrated yet); post-restart failures also restore the user's DB.
|
||||
rollback_binary() {
|
||||
LOG "rolling back binary $target -> $last_good"
|
||||
if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi
|
||||
LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1
|
||||
}
|
||||
|
||||
# is this t3-serve@<unit> running an active agent (claude/codex/opencode)? never restart those.
|
||||
unit_busy() {
|
||||
local unit="$1" pid; pid="$(systemctl show -p MainPID --value "$unit" 2>/dev/null)"
|
||||
[ -n "$pid" ] && [ "$pid" != "0" ] && pgrep -aP "$pid" 2>/dev/null | grep -qiE 'claude|codex|opencode'
|
||||
}
|
||||
|
||||
# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie).
|
||||
verify_pairing() {
|
||||
local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; }
|
||||
out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)"
|
||||
printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session='
|
||||
}
|
||||
|
||||
# ---- 3. DRY RUN: preview only (install candidate to temp prefix, gate it) -------
|
||||
if [ "$DRY_RUN" = "1" ]; then
|
||||
LOG "DRY_RUN: would back up [$(osusers | tr '\n' ' ')]; testing candidate $target in a temp prefix (no global change, no restarts)"
|
||||
|
|
@ -196,31 +168,15 @@ restarted=0; deferred=0
|
|||
for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' 2>/dev/null | awk '{print $1}'); do
|
||||
u="$(printf '%s' "$unit" | sed -n 's/^t3-serve@\(.*\)\.service$/\1/p')"; [ -n "$u" ] || continue
|
||||
if unit_busy "$unit"; then
|
||||
LOG "deferring $unit (active agent) — migrates on its next idle restart"; deferred=$((deferred+1)); continue
|
||||
LOG "deferring $unit (active agent) — migrates on its next idle restart"
|
||||
mkdir -p "$DEFER_DIR" 2>/dev/null && printf '%s\n' "$target" >"$DEFER_DIR/$u" # record for t3-migrate-idle
|
||||
deferred=$((deferred+1)); continue
|
||||
fi
|
||||
systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero"
|
||||
ok=0
|
||||
for _ in $(seq 1 15); do
|
||||
if verify_pairing "$u"; then ok=1; break; fi
|
||||
sleep 2
|
||||
done
|
||||
if [ "$ok" = "1" ]; then
|
||||
LOG "restarted $unit -> $target (pairing verified via dispatch)"; restarted=$((restarted+1))
|
||||
if safe_restart_unit "$unit" "$u"; then
|
||||
restarted=$((restarted+1))
|
||||
rm -f "$DEFER_DIR/$u" 2>/dev/null # now current — clear any stale marker
|
||||
else
|
||||
LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB"
|
||||
rollback_binary
|
||||
bak="$(prebump_of "$u")"
|
||||
if [ -n "$bak" ]; then
|
||||
systemctl stop "$unit" 2>/dev/null
|
||||
if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then
|
||||
rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm"
|
||||
LOG "restored $u state.sqlite from $bak"
|
||||
fi
|
||||
systemctl start "$unit" 2>/dev/null
|
||||
fi
|
||||
touch "$FREEZE_FILE" 2>/dev/null
|
||||
LOG "FROZEN ($FREEZE_FILE) after canary $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume"
|
||||
exit 1
|
||||
exit 1 # frozen by safe_restart_unit — preserve today's behavior
|
||||
fi
|
||||
done
|
||||
|
||||
|
|
|
|||
8
scripts/t3-migrate-idle.service
Normal file
8
scripts/t3-migrate-idle.service
Normal file
|
|
@ -0,0 +1,8 @@
|
|||
[Unit]
|
||||
Description=t3 idle migrator — restart deferred t3-serve instances onto the current binary when idle
|
||||
Documentation=https://forgejo.viktorbarzin.me/viktor/infra/src/branch/master/docs/plans/2026-06-21-t3-idle-migrate-design.md
|
||||
After=network.target t3-dispatch.service
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/bin/t3-migrate-idle
|
||||
86
scripts/t3-migrate-idle.sh
Normal file
86
scripts/t3-migrate-idle.sh
Normal file
|
|
@ -0,0 +1,86 @@
|
|||
#!/usr/bin/env bash
|
||||
# t3-migrate-idle.sh — drains t3-autoupdate's deferral markers (via the overnight
|
||||
# t3-migrate-idle.timer). For each deferred t3-serve@<user>, if nothing is actively
|
||||
# working in that instance (no in-flight turn + a quiet buffer), restart it onto the
|
||||
# current binary using the shared safe_restart_unit, then clear the marker.
|
||||
# Why this exists: t3-autoupdate defers a user with an active agent at its single
|
||||
# daily window; a user busy every night never migrates and their client shows
|
||||
# "Client and server versions differ". See docs/plans/2026-06-21-t3-idle-migrate-*.
|
||||
set -uo pipefail
|
||||
|
||||
LOG_TAG=t3-migrate-idle
|
||||
# shellcheck source=scripts/t3-safe-restart.sh
|
||||
. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
|
||||
|
||||
QUIET_SECONDS="${T3_MIGRATE_QUIET_SECONDS:-900}" # required idle before a restart (15 min)
|
||||
DRY_RUN="${T3_DRY_RUN:-0}"
|
||||
|
||||
# pure logic: is it safe given <active_turns> and <idle_seconds>? fail closed.
|
||||
gate_is_safe() {
|
||||
local active="$1" idle="$2"
|
||||
case "$active" in ''|*[!0-9]*) return 1;; esac # unparseable/empty active -> unsafe
|
||||
[ "$active" -eq 0 ] || return 1 # a turn is running -> unsafe
|
||||
[ -z "$idle" ] && return 0 # no threads at all -> safe
|
||||
case "$idle" in ''|*[!0-9-]*) return 1;; esac # non-numeric -> unsafe
|
||||
[ "$idle" -ge "$QUIET_SECONDS" ] # negative or < quiet -> unsafe
|
||||
}
|
||||
|
||||
# query a state.sqlite (path or file: URI). Echoes "<active_turns>|<idle_seconds>".
|
||||
# idle_seconds is empty when there are no rows. Normalizes ISO 'T'/'Z' for julianday.
|
||||
gate_query() {
|
||||
local db="$1"
|
||||
sqlite3 -batch -noheader -separator '|' "$db" \
|
||||
"SELECT
|
||||
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
|
||||
CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
|
||||
FROM projection_thread_sessions;"
|
||||
}
|
||||
|
||||
# safe_to_restart <user>: wire runuser + the user's DB into gate_query/gate_is_safe.
|
||||
safe_to_restart() {
|
||||
local u="$1" db row
|
||||
db="/home/$u/.t3/userdata/state.sqlite"; [ -f "$db" ] || return 1
|
||||
row="$(runuser -u "$u" -- sqlite3 -batch -noheader -separator '|' "file:$db?mode=ro" \
|
||||
"SELECT
|
||||
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
|
||||
CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
|
||||
FROM projection_thread_sessions;" 2>/dev/null)" || return 1
|
||||
gate_is_safe "${row%%|*}" "${row##*|}"
|
||||
}
|
||||
|
||||
main() {
|
||||
# a frozen build must not be auto-migrated (shared switch with t3-autoupdate)
|
||||
if [ -e "$FREEZE_FILE" ]; then LOG "FROZEN: $FREEZE_FILE present — not draining deferrals"; exit 0; fi
|
||||
[ -d "$DEFER_DIR" ] || exit 0 # nothing deferred
|
||||
last_good="$(tr -d '[:space:]' <"$LAST_GOOD_FILE" 2>/dev/null)" # rollback target for the helper
|
||||
|
||||
local marker u unit started mwritten migrated=0 skipped=0
|
||||
for marker in "$DEFER_DIR"/*; do
|
||||
[ -e "$marker" ] || continue # empty-dir glob
|
||||
u="$(basename "$marker")"; unit="t3-serve@$u.service"
|
||||
if ! systemctl is-active --quiet "$unit"; then
|
||||
LOG "clearing marker for $u: $unit not active"; rm -f "$marker"; continue
|
||||
fi
|
||||
started="$(date -d "$(systemctl show -p ActiveEnterTimestamp --value "$unit" 2>/dev/null)" +%s 2>/dev/null || echo 0)"
|
||||
mwritten="$(stat -c %Y "$marker" 2>/dev/null || echo 0)"
|
||||
if [ "$started" -gt "$mwritten" ]; then
|
||||
LOG "clearing marker for $u: $unit already restarted $((started-mwritten))s after the deferral"; rm -f "$marker"; continue
|
||||
fi
|
||||
if ! safe_to_restart "$u"; then skipped=$((skipped+1)); continue; fi
|
||||
|
||||
target="$(tr -d '[:space:]' <"$marker" 2>/dev/null)"; [ -n "$target" ] || target="$(ver)"
|
||||
if [ "$DRY_RUN" = "1" ]; then LOG "DRY_RUN: would migrate $unit -> $target (idle gate satisfied)"; continue; fi
|
||||
if ! backup_user "$u" >/dev/null; then
|
||||
LOG "WARN: pre-restart backup failed for $u — skipping (fail closed)"; skipped=$((skipped+1)); continue
|
||||
fi
|
||||
if safe_restart_unit "$unit" "$u"; then
|
||||
LOG "migrated $unit -> $target (idle restart)"; rm -f "$marker"; migrated=$((migrated+1))
|
||||
else
|
||||
LOG "migrate FAILED for $unit — recovery+freeze handled by safe_restart_unit; stopping drain"; exit 1
|
||||
fi
|
||||
done
|
||||
LOG "idle-migrate pass complete (migrated=$migrated skipped=$skipped)"
|
||||
}
|
||||
|
||||
# main-guard: run only when executed, not when sourced (tests source this file).
|
||||
if [ "${BASH_SOURCE[0]}" = "${0}" ]; then main "$@"; fi
|
||||
10
scripts/t3-migrate-idle.timer
Normal file
10
scripts/t3-migrate-idle.timer
Normal file
|
|
@ -0,0 +1,10 @@
|
|||
[Unit]
|
||||
Description=Overnight drain of t3-autoupdate deferrals (idle-gated t3-serve migration)
|
||||
|
||||
[Timer]
|
||||
OnCalendar=*-*-* 01..05:00/20
|
||||
RandomizedDelaySec=120
|
||||
Persistent=false
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
96
scripts/t3-safe-restart.sh
Normal file
96
scripts/t3-safe-restart.sh
Normal file
|
|
@ -0,0 +1,96 @@
|
|||
#!/usr/bin/env bash
|
||||
# t3-safe-restart.sh — SOURCED library (not executed). Shared by t3-autoupdate.sh
|
||||
# (daily gated tracker) and t3-migrate-idle.sh (overnight deferral drainer).
|
||||
#
|
||||
# Holds the per-unit "dangerous" routine — backup -> restart -> verify pairing ->
|
||||
# recover (restore DB + roll global binary back to last-good + freeze) — extracted
|
||||
# verbatim from t3-autoupdate.sh step 6, plus the small helpers it depends on.
|
||||
# The only change from the inline original: safe_restart_unit RETURNS non-zero on
|
||||
# failure (after performing recovery+freeze) instead of `exit 1`, so the CALLER
|
||||
# decides what to do (the daily job exits; the idle job stops draining).
|
||||
#
|
||||
# Callers must set, before calling safe_restart_unit: $target (version being moved
|
||||
# TO, for log lines + the prebump filename) and $last_good (rollback target).
|
||||
# Set $LOG_TAG before sourcing to tag syslog ("t3-autoupdate" / "t3-migrate-idle").
|
||||
|
||||
# ---- shared config defaults (honour the original T3_* override names) -----------
|
||||
: "${LOG_TAG:=t3-safe-restart}"
|
||||
: "${FREEZE_FILE:=${T3_FREEZE_FILE:-/etc/t3-autoupdate.freeze}}"
|
||||
: "${STATE_DIR:=${T3_STATE_DIR:-/var/lib/t3-autoupdate}}"
|
||||
: "${LAST_GOOD_FILE:=$STATE_DIR/last-good}"
|
||||
: "${DEFER_DIR:=$STATE_DIR/deferred}"
|
||||
: "${BACKUP_DIR:=${T3_BACKUP_DEST:-/var/backups/t3-state}}"
|
||||
: "${DISPATCH:=${T3_DISPATCH:-127.0.0.1:3780}}"
|
||||
: "${USER_MAP:=${T3_USER_MAP:-/etc/ttyd-user-map}}"
|
||||
: "${T3_BACKUP_TIMEOUT:=900}"
|
||||
|
||||
LOG() { logger -t "$LOG_TAG" "$*"; echo "$LOG_TAG: $*"; }
|
||||
ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; }
|
||||
# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line).
|
||||
osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; }
|
||||
# authentik username for an OS user (reverse map; first match) — for dispatch verify.
|
||||
ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; }
|
||||
|
||||
# Online consistent snapshot of ONE user's state.sqlite (run AS the owner so the
|
||||
# WAL stays owned; never stops the serve). Uses global $target for the filename.
|
||||
# Echoes the backup path on success; non-zero on failure.
|
||||
backup_user() {
|
||||
local u="$1" src out dst ts
|
||||
src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || return 1
|
||||
ts="$(date +%Y%m%d-%H%M%S)"
|
||||
out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite"
|
||||
install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out"
|
||||
if runuser -u "$u" -- timeout "$T3_BACKUP_TIMEOUT" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then
|
||||
printf '%s\n' "$dst"; return 0
|
||||
fi
|
||||
rm -f "$dst"; return 1
|
||||
}
|
||||
|
||||
# newest pre-bump backup for a user taken for the current $target (restore source).
|
||||
prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; }
|
||||
|
||||
# roll the GLOBAL binary back to last-good. In the idle path last_good==installed,
|
||||
# so this is a harmless no-op reinstall (does NOT downgrade other users).
|
||||
rollback_binary() {
|
||||
LOG "rolling back binary $target -> $last_good"
|
||||
if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi
|
||||
LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1
|
||||
}
|
||||
|
||||
# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie).
|
||||
verify_pairing() {
|
||||
local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; }
|
||||
out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)"
|
||||
printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session='
|
||||
}
|
||||
|
||||
# safe_restart_unit <unit> <user>: restart the unit, verify pairing; on failure
|
||||
# restore the user's DB from its pre-restart backup, roll the binary back, freeze.
|
||||
# Assumes a pre-restart backup already exists for <user> at the current $target
|
||||
# (the daily job's backup_all, or the idle job's backup_user, takes it first).
|
||||
# Returns 0 on verified success, non-zero after recovery+freeze on failure.
|
||||
safe_restart_unit() {
|
||||
local unit="$1" u="$2" ok=0 _ bak
|
||||
systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero"
|
||||
for _ in $(seq 1 15); do
|
||||
if verify_pairing "$u"; then ok=1; break; fi
|
||||
sleep 2
|
||||
done
|
||||
if [ "$ok" = "1" ]; then
|
||||
LOG "restarted $unit -> $target (pairing verified via dispatch)"; return 0
|
||||
fi
|
||||
LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB"
|
||||
rollback_binary
|
||||
bak="$(prebump_of "$u")"
|
||||
if [ -n "$bak" ]; then
|
||||
systemctl stop "$unit" 2>/dev/null
|
||||
if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then
|
||||
rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm"
|
||||
LOG "restored $u state.sqlite from $bak"
|
||||
fi
|
||||
systemctl start "$unit" 2>/dev/null
|
||||
fi
|
||||
touch "$FREEZE_FILE" 2>/dev/null
|
||||
LOG "FROZEN ($FREEZE_FILE) after $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume"
|
||||
return 1
|
||||
}
|
||||
|
|
@ -162,6 +162,8 @@ fi
|
|||
SCRIPTS="$HERE/.."
|
||||
# 9a) scripts the units exec (t3-provision-users already deployed in section 6)
|
||||
install -m 0755 "$SCRIPTS/t3-autoupdate.sh" /usr/local/bin/t3-autoupdate
|
||||
install -m 0644 "$SCRIPTS/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh # sourced lib (t3-autoupdate + t3-migrate-idle)
|
||||
install -m 0755 "$SCRIPTS/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle
|
||||
install -m 0755 "$SCRIPTS/t3-backup-state.sh" /usr/local/bin/t3-backup-state
|
||||
install -m 0755 "$SCRIPTS/t3-mint" /usr/local/bin/t3-mint
|
||||
install -m 0755 "$HERE/claude-auth-sync.sh" /usr/local/bin/claude-auth-sync
|
||||
|
|
@ -198,6 +200,7 @@ fi
|
|||
for u in t3-serve@.service \
|
||||
claude-auth-sync@.service claude-auth-sync@.timer \
|
||||
t3-autoupdate.service t3-autoupdate.timer \
|
||||
t3-migrate-idle.service t3-migrate-idle.timer \
|
||||
t3-backup-state.service t3-backup-state.timer \
|
||||
t3-provision-users.service t3-provision-users.timer \
|
||||
t3-dispatch.service; do
|
||||
|
|
@ -216,7 +219,7 @@ done
|
|||
log "playwright: template units + snapshot-refresh script installed (per-user enable in provisioner)"
|
||||
systemctl daemon-reload
|
||||
systemctl enable --now t3-dispatch.service \
|
||||
t3-autoupdate.timer t3-backup-state.timer t3-provision-users.timer >/dev/null 2>&1 || \
|
||||
t3-autoupdate.timer t3-backup-state.timer t3-provision-users.timer t3-migrate-idle.timer >/dev/null 2>&1 || \
|
||||
log "WARN: some units failed to enable (check: systemctl status t3-dispatch t3-*.timer)"
|
||||
log "service units installed + enabled (t3-dispatch + 3 timers; t3-serve@ per-user)"
|
||||
|
||||
|
|
|
|||
44
tests/t3-migrate-idle-gate.test.sh
Normal file
44
tests/t3-migrate-idle-gate.test.sh
Normal file
|
|
@ -0,0 +1,44 @@
|
|||
#!/usr/bin/env bash
|
||||
# Pure-bash unit tests for the t3-migrate-idle gate. No root, no bats, no Docker.
|
||||
# Sources t3-migrate-idle.sh (main-guarded) with the lib path pointed at the worktree.
|
||||
set -uo pipefail
|
||||
HERE="$(cd "$(dirname "$0")/.." && pwd)" # repo root (tests/ is one level down)
|
||||
export T3_SAFE_RESTART_LIB="$HERE/scripts/t3-safe-restart.sh"
|
||||
# shellcheck source=/dev/null
|
||||
. "$HERE/scripts/t3-migrate-idle.sh" # defines functions; main-guard prevents the drain from running
|
||||
|
||||
pass=0; fail=0
|
||||
ok() { if "$@"; then pass=$((pass+1)); else fail=$((fail+1)); echo "FAIL: $*"; fi; }
|
||||
notok(){ if "$@"; then fail=$((fail+1)); echo "FAIL (expected non-zero): $*"; else pass=$((pass+1)); fi; }
|
||||
|
||||
# --- gate_is_safe <active> <idle_seconds> with QUIET_SECONDS=900 ---
|
||||
QUIET_SECONDS=900
|
||||
ok gate_is_safe 0 1000 # idle, quiet long enough -> safe
|
||||
notok gate_is_safe 1 1000 # a turn in flight -> unsafe
|
||||
notok gate_is_safe 0 100 # idle but not quiet enough -> unsafe
|
||||
ok gate_is_safe 0 "" # no threads at all (NULL idle) -> safe
|
||||
notok gate_is_safe x 1000 # unparseable active -> unsafe
|
||||
notok gate_is_safe 0 -30 # negative idle (clock skew) -> unsafe
|
||||
|
||||
# --- gate_query <db> against fixture SQLite DBs ---
|
||||
TMP="$(mktemp -d)"; trap 'rm -rf "$TMP"' EXIT
|
||||
mkfix() { # mkfix <file> ; reads rows "active_turn_id|updated_at" on stdin
|
||||
local f="$1"; sqlite3 "$f" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);"
|
||||
while IFS='|' read -r a u; do sqlite3 "$f" "INSERT INTO projection_thread_sessions VALUES ($([ "$a" = NULL ] && echo NULL || echo "'$a'"), '$u');"; done
|
||||
}
|
||||
NOW="$(date -u +%Y-%m-%dT%H:%M:%S.000Z)"
|
||||
OLD="$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S.000Z)"
|
||||
|
||||
# active turn present -> "1|<small idle>"
|
||||
printf '%s\n' "abc|$NOW" "NULL|$OLD" | mkfix "$TMP/active.db"
|
||||
res="$(gate_query "$TMP/active.db")"; ok test "${res%%|*}" = "1"
|
||||
|
||||
# all idle, last activity 1h ago -> "0|>=3500"
|
||||
printf '%s\n' "NULL|$OLD" "NULL|$OLD" | mkfix "$TMP/idle.db"
|
||||
res="$(gate_query "$TMP/idle.db")"; ok test "${res%%|*}" = "0"; ok test "${res##*|}" -ge 3500
|
||||
|
||||
# empty table -> "0|" (NULL idle)
|
||||
sqlite3 "$TMP/empty.db" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);"
|
||||
res="$(gate_query "$TMP/empty.db")"; ok test "${res%%|*}" = "0"
|
||||
|
||||
echo "PASS=$pass FAIL=$fail"; [ "$fail" -eq 0 ]
|
||||
Loading…
Add table
Add a link
Reference in a new issue