From bccaa08d8e1ce25b984026b53236d6de34b6df6d Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Tue, 9 Jun 2026 20:00:11 +0000 Subject: [PATCH] =?UTF-8?q?t3:=20prepare=20to=20adopt=200.0.25=20=E2=80=94?= =?UTF-8?q?=20version-agnostic=20dispatch=20+=20real=20pairing=20health-ch?= =?UTF-8?q?eck=20+=20state=20backup=20[ci=20skip]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Investigated the 0.0.25 break: it is ONLY an endpoint rename (/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing contract (credential payload, t3_session cookie, /api/auth/session) is byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep): - t3-dispatch: autoPair tries /api/auth/browser-session, falls back to /api/auth/bootstrap on 404 — one binary pairs across both versions and any rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25 before, green after). Built, deployed, verified live on 0.0.24 (all three users still 302 + t3_session via the fallback). - t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad build now auto-rolls-back. Validated against both versions. - t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3 state.sqlite (was the only copy, unbacked) -> the one-way forward schema migration becomes a restore, not sqlite surgery. timeout-guarded. - runbooks/t3-version-bump.md: the reversible cutover checklist. - post-mortem #5 (health-check) DONE + #6 added; service-catalog updated. Co-Authored-By: Claude Opus 4.8 --- .claude/reference/service-catalog.md | 2 +- ...06-09-t3-nightly-autoupdate-auth-outage.md | 27 ++++- docs/runbooks/t3-version-bump.md | 105 ++++++++++++++++++ scripts/t3-autoupdate.sh | 33 ++++-- scripts/t3-backup-state.service | 6 + scripts/t3-backup-state.sh | 43 +++++++ scripts/t3-backup-state.timer | 10 ++ scripts/t3-dispatch/main.go | 44 +++++++- scripts/t3-dispatch/main_test.go | 60 ++++++++++ 9 files changed, 311 insertions(+), 19 deletions(-) create mode 100644 docs/runbooks/t3-version-bump.md create mode 100644 scripts/t3-backup-state.service create mode 100644 scripts/t3-backup-state.sh create mode 100644 scripts/t3-backup-state.timer diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md index c8022fa1..f33430be 100644 --- a/.claude/reference/service-catalog.md +++ b/.claude/reference/service-catalog.md @@ -32,7 +32,7 @@ |---------|-------------|-------| | k8s-dashboard | Kubernetes dashboard at `k8s.viktorbarzin.me`. **Forward-auth + auto-injected SA token** (apiserver OIDC blocked, see design §12). nginx token-injector (`dashboard_injector.tf`) maps `X-authentik-username` → the user's `dashboard-` SA token (ns admin + read-only on namespace-list/nodes only via `dashboard-nav-readonly` — no cross-tenant reads, `rbac/.../dashboard-sa.tf`; admins → cluster-admin SA) and sets `Authorization: Bearer` → no token-paste, dashboard auto-authenticates per user. Forward-auth admits `kubernetes-*` groups for this host (`stacks/authentik/admin-services-restriction.tf`). oauth2-proxy + `k8s-dashboard` OIDC app built but idle. | k8s-dashboard | | reverse-proxy | Generic reverse proxy | reverse-proxy | -| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role`→`scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin requires first verifying `t3-dispatch`'s bootstrap flow against the new build (expect 302 + `t3_session`). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. | t3code | +| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role`→`scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin is a deliberate, reversible step via `docs/runbooks/t3-version-bump.md` (pre-bump `state.sqlite` backup → bump `T3_PIN` → enforcer install with a REAL pairing health-check that auto-rolls-back → verify → restore). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. | t3code | ## Active Use | Service | Description | Stack | diff --git a/docs/post-mortems/2026-06-09-t3-nightly-autoupdate-auth-outage.md b/docs/post-mortems/2026-06-09-t3-nightly-autoupdate-auth-outage.md index e8f0d2d5..4aa4cb77 100644 --- a/docs/post-mortems/2026-06-09-t3-nightly-autoupdate-auth-outage.md +++ b/docs/post-mortems/2026-06-09-t3-nightly-autoupdate-auth-outage.md @@ -106,12 +106,29 @@ first run. live session and all projection history were untouched. Backup: `/home/wizard/.t3/userdata/auth-backup-*.sql`. -### 5. End-to-end pairing health-check (DEFERRED) +### 5. End-to-end pairing health-check (DONE — 2026-06-09 follow-up) -The smoke test should exercise mint→bootstrap→cookie, not just `GET /`. Not -done here (the pin makes it moot for the known-good build); needed before the -enforcer is ever pointed at a new version. A blackbox probe on the dispatch -auto-pair (expect 302 + `t3_session`) would have alerted within minutes. +`t3-autoupdate.sh`'s smoke test now exercises the REAL handshake — mint → +`POST` the credential (trying `browser-session` then `bootstrap`) → require +`200` + a `t3_session` cookie — not just `GET / → 200`. A build that renames or +breaks the pairing API now fails the check and **auto-rolls-back**, instead of +shipping a pairing-broken binary to everyone. + +### 6. Version-agnostic dispatch + reversible bumps (DONE — "prepare for 0.0.25") + +So the pin can move without another outage: +- **`t3-dispatch` is now version-agnostic** — `autoPair` tries + `/api/auth/browser-session` (0.0.25) and falls back to `/api/auth/bootstrap` + (0.0.24), so one binary pairs across the rename and through rolling-restart + skew. Covered by `TestAutoPairAcrossVersions`. Investigation confirmed the + 0.0.25 break was *only* this endpoint rename — the rest of the contract + (credential payload, `t3_session` cookie, `/api/auth/session`) is byte-identical. +- **`~/.t3` state is now backed up** — `t3-backup-state` (daily timer, online + `VACUUM INTO`, timeout-guarded) snapshots each user's `state.sqlite` (previously + the only copy, unbacked). This turns the one-way forward migration into a + *restore*, not sqlite surgery. +- **Cutover is a checklist** — `docs/runbooks/t3-version-bump.md` (pre-flight + verify, pre-bump backup, enforcer install + auto-rollback, verify, restore). ## Lessons diff --git a/docs/runbooks/t3-version-bump.md b/docs/runbooks/t3-version-bump.md new file mode 100644 index 00000000..ce0fbcc1 --- /dev/null +++ b/docs/runbooks/t3-version-bump.md @@ -0,0 +1,105 @@ +# Runbook: bump the pinned t3 version (e.g. 0.0.24 → 0.0.25) + +t3 on the devvm is **pinned** (`T3_PIN`, default `0.0.24`) and held there by the +`t3-autoupdate` enforcer. t3 is pre-1.0 and ships breaking changes between +builds, so a bump is a **deliberate, verified, reversible** step — never an +auto-update. This runbook makes it calm. Background: post-mortem +`2026-06-09-t3-nightly-autoupdate-auth-outage.md`. + +## What a bump actually touches + +1. **Pairing API** — t3 renamed `POST /api/auth/bootstrap` → `/api/auth/browser-session` + in 0.0.25. `t3-dispatch` is now **version-agnostic** (tries `browser-session`, + falls back to `bootstrap`; see `pairEndpoints` in `scripts/t3-dispatch/main.go`), + so 0.0.24↔0.0.25 needs **no dispatch change**. If a *future* build renames it + again, add the new path to `pairEndpoints`, rebuild, redeploy first. +2. **Schema** — 0.0.25+ migrate every `~/.t3/userdata/state.sqlite` **forward** + (`auth_pairing_links`/`auth_sessions` `role`→`scopes`, `+proof_key_thumbprint`). + This is a **one-way door**: a binary downgrade alone will NOT roll it back — + you must restore the DB. Hence the mandatory pre-bump backup below. + +## Pre-flight (no downtime) + +```bash +# 1. Confirm the dispatch already speaks the new version's pairing API. +# Install the candidate to an isolated prefix (does NOT touch the global pin): +npm install --prefix /tmp/t3-cand t3@ # e.g. t3@0.0.25 +BIN=/tmp/t3-cand/node_modules/.bin/t3; D=$(mktemp -d) +"$BIN" serve --host 127.0.0.1 --port 3796 --base-dir "$D" >/tmp/cand.log 2>&1 & +CRED=$("$BIN" auth pairing create --base-dir "$D" --ttl 5m --json | sed -n 's/.*"credential":"\([^"]*\)".*/\1/p') +# Try the dispatch's endpoints; one must give 200 + Set-Cookie: t3_session. +for ep in /api/auth/browser-session /api/auth/bootstrap; do + curl -s -i -X POST -H 'Content-Type: application/json' -d "{\"credential\":\"$CRED\"}" \ + "http://127.0.0.1:3796$ep" | grep -iE 'HTTP/|set-cookie: t3_session'; done +kill %1; rm -rf "$D" /tmp/t3-cand +# If NO endpoint yields a t3_session cookie -> the API changed again; update +# pairEndpoints in main.go + rebuild the dispatch BEFORE proceeding. + +# 2. Dispatch unit tests still green: +( cd ~/code/infra/scripts/t3-dispatch && go test ./... ) +``` + +## The bump + +```bash +NEW=0.0.25 +# 1. PRE-BUMP BACKUP — the rollback safety net. Per user, stop the serve (so the +# copy is consistent + fast), copy state.sqlite, restart. Do the ACTIVE admin +# instance last / from OUTSIDE its own t3 session (you can't restart the serve +# you're running inside). +for u in $(awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/ /,"",$2);print $2}' /etc/ttyd-user-map | sort -u); do + src=/home/$u/.t3/userdata/state.sqlite; [ -f "$src" ] || continue + sudo systemctl stop t3-serve@$u + sudo install -d -o "$u" -g "$u" -m700 /var/backups/t3-state/$u + sudo cp -a "$src" /var/backups/t3-state/$u/state-prebump-$NEW-$(date +%Y%m%d-%H%M%S).sqlite + sudo systemctl start t3-serve@$u +done +# (t3-backup-state also runs daily; this captures a guaranteed snapshot at T-0.) + +# 2. Move the pin in BOTH places (keep them in sync): +sed -i "s/T3_PIN:-[0-9.]*/T3_PIN:-$NEW/" ~/code/infra/scripts/t3-autoupdate.sh \ + ~/code/infra/scripts/workstation/setup-devvm.sh +sudo install -m0755 ~/code/infra/scripts/t3-autoupdate.sh /usr/local/bin/t3-autoupdate + +# 3. Run the enforcer. It installs t3@$NEW, then HEALTH-CHECKS the real pairing +# handshake (mint -> browser-session/bootstrap -> t3_session). If pairing is +# broken in $NEW, it AUTO-ROLLS-BACK to the previous version and exits non-zero. +sudo /usr/local/bin/t3-autoupdate # restarts idle instances; defers active ones + +# 4. Restart any instance the enforcer deferred (active agent), when it's idle. +# The wizard/admin instance: restart from OUTSIDE its own session, or it picks +# up $NEW on its next natural restart (the unit runs the global /usr/bin/t3). +``` + +## Verify + +```bash +for u in vbarzin emil.barzin ancaelena98; do + curl -sI -H "X-authentik-username: $u" http://10.0.10.10:3780/ | grep -iE 'HTTP/|set-cookie: t3_session' +done # each must be 302 + t3_session +t3 --version # == $NEW +``` + +## Rollback (if pairing breaks or $NEW misbehaves) + +The enforcer auto-rolls-back the **binary** if its health-check fails. But if a +problem surfaces *after* serves migrated their DBs forward, the binary alone +won't fix it — restore the DBs: + +```bash +sed -i "s/T3_PIN:-[0-9.]*/T3_PIN:-0.0.24/" ~/code/infra/scripts/t3-autoupdate.sh ~/code/infra/scripts/workstation/setup-devvm.sh +sudo install -m0755 ~/code/infra/scripts/t3-autoupdate.sh /usr/local/bin/t3-autoupdate +sudo npm i -g t3@0.0.24 +for u in $(awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/ /,"",$2);print $2}' /etc/ttyd-user-map | sort -u); do + bak=$(sudo ls -1t /var/backups/t3-state/$u/state-prebump-* 2>/dev/null | head -1) + [ -n "$bak" ] || continue + sudo systemctl stop t3-serve@$u + sudo install -o "$u" -g "$u" -m600 "$bak" /home/$u/.t3/userdata/state.sqlite + sudo rm -f /home/$u/.t3/userdata/state.sqlite-wal /home/$u/.t3/userdata/state.sqlite-shm + sudo systemctl start t3-serve@$u +done +# verify 302 + t3_session as above +``` + +(The 2026-06-09 incident had no pre-bump backup, so rollback meant per-user +sqlite surgery. With the backup, it's a restore.) diff --git a/scripts/t3-autoupdate.sh b/scripts/t3-autoupdate.sh index 836605f0..4eac8ddf 100644 --- a/scripts/t3-autoupdate.sh +++ b/scripts/t3-autoupdate.sh @@ -9,8 +9,10 @@ # To move the pin: bump T3_PIN AND first verify t3-dispatch's bootstrap flow against the # new build (curl the dispatch -> expect 302 + Set-Cookie t3_session). See post-mortem # 2026-06-09-t3-nightly-autoupdate-auth-outage.md. -# CAVEAT: the health-check below only probes GET / (200) — it does NOT exercise the -# mint/bootstrap/pairing path, so it will NOT catch an auth regression on its own. +# The health-check below exercises the REAL pairing handshake (mint -> credential +# exchange -> t3_session cookie), mirroring t3-dispatch's endpoint fallback — so a +# build that renames or breaks the pairing API fails the check and auto-rolls-back +# (closes the 2026-06-09 miss, where a GET / probe passed a pairing-broken build). set -uo pipefail T3_PIN="${T3_PIN:-0.0.24}" # known-good, t3-dispatch-compatible (2026-06-09 post-mortem) LOG() { logger -t t3-autoupdate "$*"; echo "t3-autoupdate: $*"; } @@ -27,17 +29,34 @@ fi LOG "re-pinned to $after (was $before); health-checking…" # Health-check the NEW binary on a throwaway port/base-dir before trusting it. +# Gate 1 = liveness (GET / -> 200); Gate 2 = the REAL pairing handshake t3-dispatch +# performs (mint -> POST credential -> 200 + t3_session cookie), trying the same +# endpoint fallback. Gate 2 catches a bootstrap-API rename / pairing regression. SMOKE_PORT=3799; SMOKE_DIR=$(mktemp -d) t3 serve --host 127.0.0.1 --port "$SMOKE_PORT" --base-dir "$SMOKE_DIR" >/dev/null 2>&1 & -smoke=$!; ok=0 +smoke=$!; live=0; pair_ok=0 for _ in $(seq 1 15); do - [[ "$(curl -s -o /dev/null -w '%{http_code}' --max-time 5 "http://127.0.0.1:$SMOKE_PORT/" 2>/dev/null)" == "200" ]] && { ok=1; break; } + [[ "$(curl -s -o /dev/null -w '%{http_code}' --max-time 5 "http://127.0.0.1:$SMOKE_PORT/" 2>/dev/null)" == "200" ]] && { live=1; break; } sleep 2 done +if [[ "$live" == "1" ]]; then + cred=$(t3 auth pairing create --base-dir "$SMOKE_DIR" --ttl 5m --json 2>/dev/null \ + | tr -d '\n ' | sed -n 's/.*"credential":"\([^"]*\)".*/\1/p') + if [[ -n "$cred" ]]; then + for ep in /api/auth/browser-session /api/auth/bootstrap; do # mirror t3-dispatch's fallback + hdr=$(curl -s -i --max-time 5 -X POST -H 'Content-Type: application/json' \ + -d "{\"credential\":\"$cred\"}" "http://127.0.0.1:$SMOKE_PORT$ep" 2>/dev/null) + code=$(printf '%s' "$hdr" | sed -n '1s#.* \([0-9][0-9][0-9]\).*#\1#p') + [[ "$code" == "404" ]] && continue # endpoint absent in this build — try the next + printf '%s' "$hdr" | grep -qi '^set-cookie:[[:space:]]*t3_session=' && pair_ok=1 + break + done + fi +fi kill "$smoke" 2>/dev/null; wait "$smoke" 2>/dev/null; rm -rf "$SMOKE_DIR" -if [[ "$ok" != "1" ]]; then - LOG "HEALTH-CHECK FAILED for $after — rolling back to $before" +if [[ "$live" != "1" || "$pair_ok" != "1" ]]; then + LOG "HEALTH-CHECK FAILED for $after (live=$live pair=$pair_ok) — rolling back to $before" if [[ -n "$before" ]] && npm i -g "t3@$before" >/dev/null 2>&1; then LOG "rolled back to $before" else @@ -45,7 +64,7 @@ if [[ "$ok" != "1" ]]; then fi exit 1 fi -LOG "health OK; restarting idle instances" +LOG "health OK (live + pairing handshake); restarting idle instances" # Restart only IDLE per-user instances; defer any with an active agent child. for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' | awk '{print $1}'); do diff --git a/scripts/t3-backup-state.service b/scripts/t3-backup-state.service new file mode 100644 index 00000000..5f590942 --- /dev/null +++ b/scripts/t3-backup-state.service @@ -0,0 +1,6 @@ +[Unit] +Description=Consistent backup of per-user t3 ~/.t3 state.sqlite (history + auth) + +[Service] +Type=oneshot +ExecStart=/usr/local/bin/t3-backup-state diff --git a/scripts/t3-backup-state.sh b/scripts/t3-backup-state.sh new file mode 100644 index 00000000..7ade2892 --- /dev/null +++ b/scripts/t3-backup-state.sh @@ -0,0 +1,43 @@ +#!/usr/bin/env bash +# Consistent online backup of each t3 user's ~/.t3 state.sqlite (chat/session +# history AND auth tables). ~/.t3 lives on the devvm local disk — NOT a K8s PVC and +# NOT in the 3-2-1 pipeline — so without this it is the only copy and a rebuild +# loses it. It also makes a t3 version bump REVERSIBLE: 0.0.25+ migrate the schema +# FORWARD (a one-way door), so a clean pre-bump backup turns rollback into a restore +# instead of per-user sqlite surgery (see runbooks/t3-version-bump.md). Runs as root +# via t3-backup-state.timer; the per-user .backup runs AS the owning user so the live +# WAL/-shm files keep their owner and the running t3-serve is never perturbed. +set -uo pipefail +DEST="${T3_BACKUP_DEST:-/var/backups/t3-state}" +KEEP="${T3_BACKUP_KEEP:-14}" +MAP=/etc/ttyd-user-map +LOG() { logger -t t3-backup-state "$*"; echo "t3-backup-state: $*"; } + +ts=$(date +%Y%m%d-%H%M%S) +# RHS of each non-comment "authentik=os_user" line = an OS user owning a ~/.t3. +mapfile -t users < <(awk -F= '!/^[[:space:]]*#/ && NF==2 { gsub(/[[:space:]]/,"",$2); print $2 }' "$MAP" 2>/dev/null | sort -u) +[[ ${#users[@]} -gt 0 ]] || { LOG "no users in $MAP; nothing to back up"; exit 0; } + +rc=0 +for u in "${users[@]}"; do + src="/home/$u/.t3/userdata/state.sqlite" + if [[ ! -f "$src" ]]; then LOG "skip $u (no state.sqlite)"; continue; fi + out="$DEST/$u"; dst="$out/state-$ts.sqlite" + install -d -o "$u" -g "$u" -m 0700 "$out" + # VACUUM INTO takes a consistent read-snapshot copy — unlike .backup it does NOT + # restart when the source is written mid-copy, so it finishes in a single pass even + # for the actively-used instance (the admin's own live session, which .backup would + # loop on forever). Run as the owning user so WAL access keeps the live serve happy. + # timeout caps a pathologically-slow copy (huge DB + concurrent writes on a contended + # disk) so the daily run can never wedge — it just logs + retries next cycle. The + # daily 03:30 slot normally finds instances idle, where even a large DB copies fast. + if runuser -u "$u" -- timeout "${T3_BACKUP_TIMEOUT:-900}" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [[ -s "$dst" ]]; then + LOG "backed up $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)" + else + LOG "WARN: backup FAILED for $u ($src)"; rc=1; rm -f "$dst" + fi + # retention: keep newest $KEEP per user + ls -1t "$out"/state-*.sqlite 2>/dev/null | tail -n +$((KEEP+1)) | xargs -r rm -f +done +LOG "done (rc=$rc)" +exit $rc diff --git a/scripts/t3-backup-state.timer b/scripts/t3-backup-state.timer new file mode 100644 index 00000000..72ac48e5 --- /dev/null +++ b/scripts/t3-backup-state.timer @@ -0,0 +1,10 @@ +[Unit] +Description=Daily t3 state.sqlite backup (the only copy of ~/.t3; enables version-bump rollback) + +[Timer] +OnCalendar=*-*-* 03:30:00 +RandomizedDelaySec=20m +Persistent=true + +[Install] +WantedBy=timers.target diff --git a/scripts/t3-dispatch/main.go b/scripts/t3-dispatch/main.go index 401b0edb..a36e2da0 100644 --- a/scripts/t3-dispatch/main.go +++ b/scripts/t3-dispatch/main.go @@ -113,9 +113,42 @@ func isDocumentNav(r *http.Request) bool { return strings.Contains(r.Header.Get("Accept"), "text/html") } +// pairEndpoints are the instance's session-bootstrap paths in preference order. +// t3 renamed /api/auth/bootstrap -> /api/auth/browser-session in 0.0.25; trying the +// new name first and falling back to the old lets ONE dispatch binary pair against +// either version — so the t3 pin can move forward (and survive a rolling-restart +// skew where some instances are already on the new version) without a 502 storm. +var pairEndpoints = []string{"/api/auth/browser-session", "/api/auth/bootstrap"} + +// exchangeCredential POSTs the pairing credential to the user's instance, trying +// each pairEndpoint in turn. A 404 means "absent in this t3 version" -> try the +// next; any other status is that endpoint's verdict, returned as-is. Caller owns +// resp.Body. +func exchangeCredential(port int, credential string) (*http.Response, error) { + body, _ := json.Marshal(map[string]string{"credential": credential}) + var lastErr error + for _, ep := range pairEndpoints { + resp, err := http.Post(fmt.Sprintf("http://127.0.0.1:%d%s", port, ep), + "application/json", bytes.NewReader(body)) + if err != nil { + lastErr = err + continue + } + if resp.StatusCode == http.StatusNotFound { + resp.Body.Close() // endpoint absent in this t3 version — try the next + continue + } + return resp, nil + } + if lastErr != nil { + return nil, lastErr + } + return nil, fmt.Errorf("no pairing endpoint accepted the request (all returned 404)") +} + // autoPair mints a one-time pairing token for the user's instance (as that OS -// user, via the scoped sudoers entry) and exchanges it at the instance's -// /api/auth/bootstrap, relaying the returned t3_session Set-Cookie to the browser. +// user, via the scoped sudoers entry) and exchanges it at the instance's pairing +// endpoint, relaying the returned t3_session Set-Cookie to the browser. func autoPair(e entry, w http.ResponseWriter, r *http.Request) { // t3-mint (root, via scoped sudoers) validates the OS user is in // /etc/ttyd-user-map, then mints as that user. The dispatch service itself @@ -133,16 +166,15 @@ func autoPair(e entry, w http.ResponseWriter, r *http.Request) { http.Error(w, "unparseable pairing output", http.StatusInternalServerError) return } - body, _ := json.Marshal(map[string]string{"credential": pc.Credential}) - resp, err := http.Post(fmt.Sprintf("http://127.0.0.1:%d/api/auth/bootstrap", e.Port), - "application/json", bytes.NewReader(body)) + resp, err := exchangeCredential(e.Port, pc.Credential) if err != nil { + log.Printf("pairing exchange for %s failed: %v", e.OsUser, err) http.Error(w, "bootstrap request failed", http.StatusBadGateway) return } defer resp.Body.Close() if resp.StatusCode != http.StatusOK { - log.Printf("bootstrap for %s returned %d", e.OsUser, resp.StatusCode) + log.Printf("pairing for %s returned %d", e.OsUser, resp.StatusCode) http.Error(w, "bootstrap rejected", http.StatusBadGateway) return } diff --git a/scripts/t3-dispatch/main_test.go b/scripts/t3-dispatch/main_test.go index 81ca26a9..ee43266e 100644 --- a/scripts/t3-dispatch/main_test.go +++ b/scripts/t3-dispatch/main_test.go @@ -117,6 +117,8 @@ func fakeInstance(authenticated bool, bootstrapCalled *bool) *httptest.Server { } http.SetCookie(w, &http.Cookie{Name: cookieName, Value: "fresh", Path: "/"}) _, _ = w.Write([]byte(`{"authenticated":true}`)) + case "/api/auth/browser-session": + http.NotFound(w, r) // models a 0.0.24 instance: the 0.0.25 endpoint is absent default: _, _ = w.Write([]byte("APP")) } @@ -198,3 +200,61 @@ func TestHandlerProxiesXHREvenIfCookieInvalid(t *testing.T) { t.Fatalf("XHR should proxy through, got %d %q", w.Code, w.Body.String()) } } + +// pairInstance simulates a t3 instance that exposes pairing at exactly one path +// (200 + t3_session) and 404s the other known path — modeling the 0.0.25 rename of +// /api/auth/bootstrap -> /api/auth/browser-session. records which path was hit. +func pairInstance(pairPath string, hit *string) *httptest.Server { + return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + switch r.URL.Path { + case "/api/auth/browser-session", "/api/auth/bootstrap": + if r.URL.Path != pairPath { + http.NotFound(w, r) // endpoint absent in this t3 version + return + } + if hit != nil { + *hit = r.URL.Path + } + http.SetCookie(w, &http.Cookie{Name: cookieName, Value: "fresh", Path: "/"}) + _, _ = w.Write([]byte(`{"authenticated":true}`)) + default: + http.NotFound(w, r) + } + })) +} + +// TestAutoPairAcrossVersions: one dispatch binary must pair against BOTH the +// 0.0.24 endpoint (/api/auth/bootstrap) and the 0.0.25 one (/api/auth/browser-session), +// so the pin can move forward (and survive rolling-restart skew) without a 502 storm. +func TestAutoPairAcrossVersions(t *testing.T) { + orig := mintToken + mintToken = func(string) ([]byte, error) { return []byte(`{"credential":"tok"}`), nil } + defer func() { mintToken = orig }() + + for _, tc := range []struct{ name, pairPath string }{ + {"0.0.25 browser-session", "/api/auth/browser-session"}, + {"0.0.24 bootstrap", "/api/auth/bootstrap"}, + } { + t.Run(tc.name, func(t *testing.T) { + var hit string + ts := pairInstance(tc.pairPath, &hit) + defer ts.Close() + setTable(portOf(t, ts)) + + r := httptest.NewRequest("GET", "/", nil) + r.Header.Set("X-authentik-username", "vbarzin@gmail.com") // no cookie -> autoPair + w := httptest.NewRecorder() + handler(w, r) + + if w.Code != http.StatusFound { + t.Fatalf("want 302 re-pair, got %d body=%q", w.Code, w.Body.String()) + } + if hit != tc.pairPath { + t.Fatalf("want pairing via %s, hit=%q", tc.pairPath, hit) + } + if cs := w.Result().Cookies(); len(cs) == 0 || cs[0].Value != "fresh" { + t.Fatalf("want fresh t3_session relayed, got %+v", cs) + } + }) + } +}