t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip]
Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):
- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
/api/auth/bootstrap on 404 — one binary pairs across both versions and any
rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
before, green after). Built, deployed, verified live on 0.0.24 (all three
users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
state.sqlite (was the only copy, unbacked) -> the one-way forward schema
migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
5ea238c707
commit
bccaa08d8e
9 changed files with 311 additions and 19 deletions
|
|
@ -32,7 +32,7 @@
|
|||
|---------|-------------|-------|
|
||||
| k8s-dashboard | Kubernetes dashboard at `k8s.viktorbarzin.me`. **Forward-auth + auto-injected SA token** (apiserver OIDC blocked, see design §12). nginx token-injector (`dashboard_injector.tf`) maps `X-authentik-username` → the user's `dashboard-<user>` SA token (ns admin + read-only on namespace-list/nodes only via `dashboard-nav-readonly` — no cross-tenant reads, `rbac/.../dashboard-sa.tf`; admins → cluster-admin SA) and sets `Authorization: Bearer` → no token-paste, dashboard auto-authenticates per user. Forward-auth admits `kubernetes-*` groups for this host (`stacks/authentik/admin-services-restriction.tf`). oauth2-proxy + `k8s-dashboard` OIDC app built but idle. | k8s-dashboard |
|
||||
| reverse-proxy | Generic reverse proxy | reverse-proxy |
|
||||
| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@<u>` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@<u>` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role`→`scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin requires first verifying `t3-dispatch`'s bootstrap flow against the new build (expect 302 + `t3_session`). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. | t3code |
|
||||
| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@<u>` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@<u>` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role`→`scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin is a deliberate, reversible step via `docs/runbooks/t3-version-bump.md` (pre-bump `state.sqlite` backup → bump `T3_PIN` → enforcer install with a REAL pairing health-check that auto-rolls-back → verify → restore). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. | t3code |
|
||||
|
||||
## Active Use
|
||||
| Service | Description | Stack |
|
||||
|
|
|
|||
|
|
@ -106,12 +106,29 @@ first run.
|
|||
live session and all projection history were untouched. Backup:
|
||||
`/home/wizard/.t3/userdata/auth-backup-*.sql`.
|
||||
|
||||
### 5. End-to-end pairing health-check (DEFERRED)
|
||||
### 5. End-to-end pairing health-check (DONE — 2026-06-09 follow-up)
|
||||
|
||||
The smoke test should exercise mint→bootstrap→cookie, not just `GET /`. Not
|
||||
done here (the pin makes it moot for the known-good build); needed before the
|
||||
enforcer is ever pointed at a new version. A blackbox probe on the dispatch
|
||||
auto-pair (expect 302 + `t3_session`) would have alerted within minutes.
|
||||
`t3-autoupdate.sh`'s smoke test now exercises the REAL handshake — mint →
|
||||
`POST` the credential (trying `browser-session` then `bootstrap`) → require
|
||||
`200` + a `t3_session` cookie — not just `GET / → 200`. A build that renames or
|
||||
breaks the pairing API now fails the check and **auto-rolls-back**, instead of
|
||||
shipping a pairing-broken binary to everyone.
|
||||
|
||||
### 6. Version-agnostic dispatch + reversible bumps (DONE — "prepare for 0.0.25")
|
||||
|
||||
So the pin can move without another outage:
|
||||
- **`t3-dispatch` is now version-agnostic** — `autoPair` tries
|
||||
`/api/auth/browser-session` (0.0.25) and falls back to `/api/auth/bootstrap`
|
||||
(0.0.24), so one binary pairs across the rename and through rolling-restart
|
||||
skew. Covered by `TestAutoPairAcrossVersions`. Investigation confirmed the
|
||||
0.0.25 break was *only* this endpoint rename — the rest of the contract
|
||||
(credential payload, `t3_session` cookie, `/api/auth/session`) is byte-identical.
|
||||
- **`~/.t3` state is now backed up** — `t3-backup-state` (daily timer, online
|
||||
`VACUUM INTO`, timeout-guarded) snapshots each user's `state.sqlite` (previously
|
||||
the only copy, unbacked). This turns the one-way forward migration into a
|
||||
*restore*, not sqlite surgery.
|
||||
- **Cutover is a checklist** — `docs/runbooks/t3-version-bump.md` (pre-flight
|
||||
verify, pre-bump backup, enforcer install + auto-rollback, verify, restore).
|
||||
|
||||
## Lessons
|
||||
|
||||
|
|
|
|||
105
docs/runbooks/t3-version-bump.md
Normal file
105
docs/runbooks/t3-version-bump.md
Normal file
|
|
@ -0,0 +1,105 @@
|
|||
# Runbook: bump the pinned t3 version (e.g. 0.0.24 → 0.0.25)
|
||||
|
||||
t3 on the devvm is **pinned** (`T3_PIN`, default `0.0.24`) and held there by the
|
||||
`t3-autoupdate` enforcer. t3 is pre-1.0 and ships breaking changes between
|
||||
builds, so a bump is a **deliberate, verified, reversible** step — never an
|
||||
auto-update. This runbook makes it calm. Background: post-mortem
|
||||
`2026-06-09-t3-nightly-autoupdate-auth-outage.md`.
|
||||
|
||||
## What a bump actually touches
|
||||
|
||||
1. **Pairing API** — t3 renamed `POST /api/auth/bootstrap` → `/api/auth/browser-session`
|
||||
in 0.0.25. `t3-dispatch` is now **version-agnostic** (tries `browser-session`,
|
||||
falls back to `bootstrap`; see `pairEndpoints` in `scripts/t3-dispatch/main.go`),
|
||||
so 0.0.24↔0.0.25 needs **no dispatch change**. If a *future* build renames it
|
||||
again, add the new path to `pairEndpoints`, rebuild, redeploy first.
|
||||
2. **Schema** — 0.0.25+ migrate every `~/.t3/userdata/state.sqlite` **forward**
|
||||
(`auth_pairing_links`/`auth_sessions` `role`→`scopes`, `+proof_key_thumbprint`).
|
||||
This is a **one-way door**: a binary downgrade alone will NOT roll it back —
|
||||
you must restore the DB. Hence the mandatory pre-bump backup below.
|
||||
|
||||
## Pre-flight (no downtime)
|
||||
|
||||
```bash
|
||||
# 1. Confirm the dispatch already speaks the new version's pairing API.
|
||||
# Install the candidate to an isolated prefix (does NOT touch the global pin):
|
||||
npm install --prefix /tmp/t3-cand t3@<new> # e.g. t3@0.0.25
|
||||
BIN=/tmp/t3-cand/node_modules/.bin/t3; D=$(mktemp -d)
|
||||
"$BIN" serve --host 127.0.0.1 --port 3796 --base-dir "$D" >/tmp/cand.log 2>&1 &
|
||||
CRED=$("$BIN" auth pairing create --base-dir "$D" --ttl 5m --json | sed -n 's/.*"credential":"\([^"]*\)".*/\1/p')
|
||||
# Try the dispatch's endpoints; one must give 200 + Set-Cookie: t3_session.
|
||||
for ep in /api/auth/browser-session /api/auth/bootstrap; do
|
||||
curl -s -i -X POST -H 'Content-Type: application/json' -d "{\"credential\":\"$CRED\"}" \
|
||||
"http://127.0.0.1:3796$ep" | grep -iE 'HTTP/|set-cookie: t3_session'; done
|
||||
kill %1; rm -rf "$D" /tmp/t3-cand
|
||||
# If NO endpoint yields a t3_session cookie -> the API changed again; update
|
||||
# pairEndpoints in main.go + rebuild the dispatch BEFORE proceeding.
|
||||
|
||||
# 2. Dispatch unit tests still green:
|
||||
( cd ~/code/infra/scripts/t3-dispatch && go test ./... )
|
||||
```
|
||||
|
||||
## The bump
|
||||
|
||||
```bash
|
||||
NEW=0.0.25
|
||||
# 1. PRE-BUMP BACKUP — the rollback safety net. Per user, stop the serve (so the
|
||||
# copy is consistent + fast), copy state.sqlite, restart. Do the ACTIVE admin
|
||||
# instance last / from OUTSIDE its own t3 session (you can't restart the serve
|
||||
# you're running inside).
|
||||
for u in $(awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/ /,"",$2);print $2}' /etc/ttyd-user-map | sort -u); do
|
||||
src=/home/$u/.t3/userdata/state.sqlite; [ -f "$src" ] || continue
|
||||
sudo systemctl stop t3-serve@$u
|
||||
sudo install -d -o "$u" -g "$u" -m700 /var/backups/t3-state/$u
|
||||
sudo cp -a "$src" /var/backups/t3-state/$u/state-prebump-$NEW-$(date +%Y%m%d-%H%M%S).sqlite
|
||||
sudo systemctl start t3-serve@$u
|
||||
done
|
||||
# (t3-backup-state also runs daily; this captures a guaranteed snapshot at T-0.)
|
||||
|
||||
# 2. Move the pin in BOTH places (keep them in sync):
|
||||
sed -i "s/T3_PIN:-[0-9.]*/T3_PIN:-$NEW/" ~/code/infra/scripts/t3-autoupdate.sh \
|
||||
~/code/infra/scripts/workstation/setup-devvm.sh
|
||||
sudo install -m0755 ~/code/infra/scripts/t3-autoupdate.sh /usr/local/bin/t3-autoupdate
|
||||
|
||||
# 3. Run the enforcer. It installs t3@$NEW, then HEALTH-CHECKS the real pairing
|
||||
# handshake (mint -> browser-session/bootstrap -> t3_session). If pairing is
|
||||
# broken in $NEW, it AUTO-ROLLS-BACK to the previous version and exits non-zero.
|
||||
sudo /usr/local/bin/t3-autoupdate # restarts idle instances; defers active ones
|
||||
|
||||
# 4. Restart any instance the enforcer deferred (active agent), when it's idle.
|
||||
# The wizard/admin instance: restart from OUTSIDE its own session, or it picks
|
||||
# up $NEW on its next natural restart (the unit runs the global /usr/bin/t3).
|
||||
```
|
||||
|
||||
## Verify
|
||||
|
||||
```bash
|
||||
for u in vbarzin emil.barzin ancaelena98; do
|
||||
curl -sI -H "X-authentik-username: $u" http://10.0.10.10:3780/ | grep -iE 'HTTP/|set-cookie: t3_session'
|
||||
done # each must be 302 + t3_session
|
||||
t3 --version # == $NEW
|
||||
```
|
||||
|
||||
## Rollback (if pairing breaks or $NEW misbehaves)
|
||||
|
||||
The enforcer auto-rolls-back the **binary** if its health-check fails. But if a
|
||||
problem surfaces *after* serves migrated their DBs forward, the binary alone
|
||||
won't fix it — restore the DBs:
|
||||
|
||||
```bash
|
||||
sed -i "s/T3_PIN:-[0-9.]*/T3_PIN:-0.0.24/" ~/code/infra/scripts/t3-autoupdate.sh ~/code/infra/scripts/workstation/setup-devvm.sh
|
||||
sudo install -m0755 ~/code/infra/scripts/t3-autoupdate.sh /usr/local/bin/t3-autoupdate
|
||||
sudo npm i -g t3@0.0.24
|
||||
for u in $(awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/ /,"",$2);print $2}' /etc/ttyd-user-map | sort -u); do
|
||||
bak=$(sudo ls -1t /var/backups/t3-state/$u/state-prebump-* 2>/dev/null | head -1)
|
||||
[ -n "$bak" ] || continue
|
||||
sudo systemctl stop t3-serve@$u
|
||||
sudo install -o "$u" -g "$u" -m600 "$bak" /home/$u/.t3/userdata/state.sqlite
|
||||
sudo rm -f /home/$u/.t3/userdata/state.sqlite-wal /home/$u/.t3/userdata/state.sqlite-shm
|
||||
sudo systemctl start t3-serve@$u
|
||||
done
|
||||
# verify 302 + t3_session as above
|
||||
```
|
||||
|
||||
(The 2026-06-09 incident had no pre-bump backup, so rollback meant per-user
|
||||
sqlite surgery. With the backup, it's a restore.)
|
||||
|
|
@ -9,8 +9,10 @@
|
|||
# To move the pin: bump T3_PIN AND first verify t3-dispatch's bootstrap flow against the
|
||||
# new build (curl the dispatch -> expect 302 + Set-Cookie t3_session). See post-mortem
|
||||
# 2026-06-09-t3-nightly-autoupdate-auth-outage.md.
|
||||
# CAVEAT: the health-check below only probes GET / (200) — it does NOT exercise the
|
||||
# mint/bootstrap/pairing path, so it will NOT catch an auth regression on its own.
|
||||
# The health-check below exercises the REAL pairing handshake (mint -> credential
|
||||
# exchange -> t3_session cookie), mirroring t3-dispatch's endpoint fallback — so a
|
||||
# build that renames or breaks the pairing API fails the check and auto-rolls-back
|
||||
# (closes the 2026-06-09 miss, where a GET / probe passed a pairing-broken build).
|
||||
set -uo pipefail
|
||||
T3_PIN="${T3_PIN:-0.0.24}" # known-good, t3-dispatch-compatible (2026-06-09 post-mortem)
|
||||
LOG() { logger -t t3-autoupdate "$*"; echo "t3-autoupdate: $*"; }
|
||||
|
|
@ -27,17 +29,34 @@ fi
|
|||
LOG "re-pinned to $after (was $before); health-checking…"
|
||||
|
||||
# Health-check the NEW binary on a throwaway port/base-dir before trusting it.
|
||||
# Gate 1 = liveness (GET / -> 200); Gate 2 = the REAL pairing handshake t3-dispatch
|
||||
# performs (mint -> POST credential -> 200 + t3_session cookie), trying the same
|
||||
# endpoint fallback. Gate 2 catches a bootstrap-API rename / pairing regression.
|
||||
SMOKE_PORT=3799; SMOKE_DIR=$(mktemp -d)
|
||||
t3 serve --host 127.0.0.1 --port "$SMOKE_PORT" --base-dir "$SMOKE_DIR" >/dev/null 2>&1 &
|
||||
smoke=$!; ok=0
|
||||
smoke=$!; live=0; pair_ok=0
|
||||
for _ in $(seq 1 15); do
|
||||
[[ "$(curl -s -o /dev/null -w '%{http_code}' --max-time 5 "http://127.0.0.1:$SMOKE_PORT/" 2>/dev/null)" == "200" ]] && { ok=1; break; }
|
||||
[[ "$(curl -s -o /dev/null -w '%{http_code}' --max-time 5 "http://127.0.0.1:$SMOKE_PORT/" 2>/dev/null)" == "200" ]] && { live=1; break; }
|
||||
sleep 2
|
||||
done
|
||||
if [[ "$live" == "1" ]]; then
|
||||
cred=$(t3 auth pairing create --base-dir "$SMOKE_DIR" --ttl 5m --json 2>/dev/null \
|
||||
| tr -d '\n ' | sed -n 's/.*"credential":"\([^"]*\)".*/\1/p')
|
||||
if [[ -n "$cred" ]]; then
|
||||
for ep in /api/auth/browser-session /api/auth/bootstrap; do # mirror t3-dispatch's fallback
|
||||
hdr=$(curl -s -i --max-time 5 -X POST -H 'Content-Type: application/json' \
|
||||
-d "{\"credential\":\"$cred\"}" "http://127.0.0.1:$SMOKE_PORT$ep" 2>/dev/null)
|
||||
code=$(printf '%s' "$hdr" | sed -n '1s#.* \([0-9][0-9][0-9]\).*#\1#p')
|
||||
[[ "$code" == "404" ]] && continue # endpoint absent in this build — try the next
|
||||
printf '%s' "$hdr" | grep -qi '^set-cookie:[[:space:]]*t3_session=' && pair_ok=1
|
||||
break
|
||||
done
|
||||
fi
|
||||
fi
|
||||
kill "$smoke" 2>/dev/null; wait "$smoke" 2>/dev/null; rm -rf "$SMOKE_DIR"
|
||||
|
||||
if [[ "$ok" != "1" ]]; then
|
||||
LOG "HEALTH-CHECK FAILED for $after — rolling back to $before"
|
||||
if [[ "$live" != "1" || "$pair_ok" != "1" ]]; then
|
||||
LOG "HEALTH-CHECK FAILED for $after (live=$live pair=$pair_ok) — rolling back to $before"
|
||||
if [[ -n "$before" ]] && npm i -g "t3@$before" >/dev/null 2>&1; then
|
||||
LOG "rolled back to $before"
|
||||
else
|
||||
|
|
@ -45,7 +64,7 @@ if [[ "$ok" != "1" ]]; then
|
|||
fi
|
||||
exit 1
|
||||
fi
|
||||
LOG "health OK; restarting idle instances"
|
||||
LOG "health OK (live + pairing handshake); restarting idle instances"
|
||||
|
||||
# Restart only IDLE per-user instances; defer any with an active agent child.
|
||||
for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' | awk '{print $1}'); do
|
||||
|
|
|
|||
6
scripts/t3-backup-state.service
Normal file
6
scripts/t3-backup-state.service
Normal file
|
|
@ -0,0 +1,6 @@
|
|||
[Unit]
|
||||
Description=Consistent backup of per-user t3 ~/.t3 state.sqlite (history + auth)
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/bin/t3-backup-state
|
||||
43
scripts/t3-backup-state.sh
Normal file
43
scripts/t3-backup-state.sh
Normal file
|
|
@ -0,0 +1,43 @@
|
|||
#!/usr/bin/env bash
|
||||
# Consistent online backup of each t3 user's ~/.t3 state.sqlite (chat/session
|
||||
# history AND auth tables). ~/.t3 lives on the devvm local disk — NOT a K8s PVC and
|
||||
# NOT in the 3-2-1 pipeline — so without this it is the only copy and a rebuild
|
||||
# loses it. It also makes a t3 version bump REVERSIBLE: 0.0.25+ migrate the schema
|
||||
# FORWARD (a one-way door), so a clean pre-bump backup turns rollback into a restore
|
||||
# instead of per-user sqlite surgery (see runbooks/t3-version-bump.md). Runs as root
|
||||
# via t3-backup-state.timer; the per-user .backup runs AS the owning user so the live
|
||||
# WAL/-shm files keep their owner and the running t3-serve is never perturbed.
|
||||
set -uo pipefail
|
||||
DEST="${T3_BACKUP_DEST:-/var/backups/t3-state}"
|
||||
KEEP="${T3_BACKUP_KEEP:-14}"
|
||||
MAP=/etc/ttyd-user-map
|
||||
LOG() { logger -t t3-backup-state "$*"; echo "t3-backup-state: $*"; }
|
||||
|
||||
ts=$(date +%Y%m%d-%H%M%S)
|
||||
# RHS of each non-comment "authentik=os_user" line = an OS user owning a ~/.t3.
|
||||
mapfile -t users < <(awk -F= '!/^[[:space:]]*#/ && NF==2 { gsub(/[[:space:]]/,"",$2); print $2 }' "$MAP" 2>/dev/null | sort -u)
|
||||
[[ ${#users[@]} -gt 0 ]] || { LOG "no users in $MAP; nothing to back up"; exit 0; }
|
||||
|
||||
rc=0
|
||||
for u in "${users[@]}"; do
|
||||
src="/home/$u/.t3/userdata/state.sqlite"
|
||||
if [[ ! -f "$src" ]]; then LOG "skip $u (no state.sqlite)"; continue; fi
|
||||
out="$DEST/$u"; dst="$out/state-$ts.sqlite"
|
||||
install -d -o "$u" -g "$u" -m 0700 "$out"
|
||||
# VACUUM INTO takes a consistent read-snapshot copy — unlike .backup it does NOT
|
||||
# restart when the source is written mid-copy, so it finishes in a single pass even
|
||||
# for the actively-used instance (the admin's own live session, which .backup would
|
||||
# loop on forever). Run as the owning user so WAL access keeps the live serve happy.
|
||||
# timeout caps a pathologically-slow copy (huge DB + concurrent writes on a contended
|
||||
# disk) so the daily run can never wedge — it just logs + retries next cycle. The
|
||||
# daily 03:30 slot normally finds instances idle, where even a large DB copies fast.
|
||||
if runuser -u "$u" -- timeout "${T3_BACKUP_TIMEOUT:-900}" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [[ -s "$dst" ]]; then
|
||||
LOG "backed up $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)"
|
||||
else
|
||||
LOG "WARN: backup FAILED for $u ($src)"; rc=1; rm -f "$dst"
|
||||
fi
|
||||
# retention: keep newest $KEEP per user
|
||||
ls -1t "$out"/state-*.sqlite 2>/dev/null | tail -n +$((KEEP+1)) | xargs -r rm -f
|
||||
done
|
||||
LOG "done (rc=$rc)"
|
||||
exit $rc
|
||||
10
scripts/t3-backup-state.timer
Normal file
10
scripts/t3-backup-state.timer
Normal file
|
|
@ -0,0 +1,10 @@
|
|||
[Unit]
|
||||
Description=Daily t3 state.sqlite backup (the only copy of ~/.t3; enables version-bump rollback)
|
||||
|
||||
[Timer]
|
||||
OnCalendar=*-*-* 03:30:00
|
||||
RandomizedDelaySec=20m
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
|
|
@ -113,9 +113,42 @@ func isDocumentNav(r *http.Request) bool {
|
|||
return strings.Contains(r.Header.Get("Accept"), "text/html")
|
||||
}
|
||||
|
||||
// pairEndpoints are the instance's session-bootstrap paths in preference order.
|
||||
// t3 renamed /api/auth/bootstrap -> /api/auth/browser-session in 0.0.25; trying the
|
||||
// new name first and falling back to the old lets ONE dispatch binary pair against
|
||||
// either version — so the t3 pin can move forward (and survive a rolling-restart
|
||||
// skew where some instances are already on the new version) without a 502 storm.
|
||||
var pairEndpoints = []string{"/api/auth/browser-session", "/api/auth/bootstrap"}
|
||||
|
||||
// exchangeCredential POSTs the pairing credential to the user's instance, trying
|
||||
// each pairEndpoint in turn. A 404 means "absent in this t3 version" -> try the
|
||||
// next; any other status is that endpoint's verdict, returned as-is. Caller owns
|
||||
// resp.Body.
|
||||
func exchangeCredential(port int, credential string) (*http.Response, error) {
|
||||
body, _ := json.Marshal(map[string]string{"credential": credential})
|
||||
var lastErr error
|
||||
for _, ep := range pairEndpoints {
|
||||
resp, err := http.Post(fmt.Sprintf("http://127.0.0.1:%d%s", port, ep),
|
||||
"application/json", bytes.NewReader(body))
|
||||
if err != nil {
|
||||
lastErr = err
|
||||
continue
|
||||
}
|
||||
if resp.StatusCode == http.StatusNotFound {
|
||||
resp.Body.Close() // endpoint absent in this t3 version — try the next
|
||||
continue
|
||||
}
|
||||
return resp, nil
|
||||
}
|
||||
if lastErr != nil {
|
||||
return nil, lastErr
|
||||
}
|
||||
return nil, fmt.Errorf("no pairing endpoint accepted the request (all returned 404)")
|
||||
}
|
||||
|
||||
// autoPair mints a one-time pairing token for the user's instance (as that OS
|
||||
// user, via the scoped sudoers entry) and exchanges it at the instance's
|
||||
// /api/auth/bootstrap, relaying the returned t3_session Set-Cookie to the browser.
|
||||
// user, via the scoped sudoers entry) and exchanges it at the instance's pairing
|
||||
// endpoint, relaying the returned t3_session Set-Cookie to the browser.
|
||||
func autoPair(e entry, w http.ResponseWriter, r *http.Request) {
|
||||
// t3-mint (root, via scoped sudoers) validates the OS user is in
|
||||
// /etc/ttyd-user-map, then mints as that user. The dispatch service itself
|
||||
|
|
@ -133,16 +166,15 @@ func autoPair(e entry, w http.ResponseWriter, r *http.Request) {
|
|||
http.Error(w, "unparseable pairing output", http.StatusInternalServerError)
|
||||
return
|
||||
}
|
||||
body, _ := json.Marshal(map[string]string{"credential": pc.Credential})
|
||||
resp, err := http.Post(fmt.Sprintf("http://127.0.0.1:%d/api/auth/bootstrap", e.Port),
|
||||
"application/json", bytes.NewReader(body))
|
||||
resp, err := exchangeCredential(e.Port, pc.Credential)
|
||||
if err != nil {
|
||||
log.Printf("pairing exchange for %s failed: %v", e.OsUser, err)
|
||||
http.Error(w, "bootstrap request failed", http.StatusBadGateway)
|
||||
return
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
if resp.StatusCode != http.StatusOK {
|
||||
log.Printf("bootstrap for %s returned %d", e.OsUser, resp.StatusCode)
|
||||
log.Printf("pairing for %s returned %d", e.OsUser, resp.StatusCode)
|
||||
http.Error(w, "bootstrap rejected", http.StatusBadGateway)
|
||||
return
|
||||
}
|
||||
|
|
|
|||
|
|
@ -117,6 +117,8 @@ func fakeInstance(authenticated bool, bootstrapCalled *bool) *httptest.Server {
|
|||
}
|
||||
http.SetCookie(w, &http.Cookie{Name: cookieName, Value: "fresh", Path: "/"})
|
||||
_, _ = w.Write([]byte(`{"authenticated":true}`))
|
||||
case "/api/auth/browser-session":
|
||||
http.NotFound(w, r) // models a 0.0.24 instance: the 0.0.25 endpoint is absent
|
||||
default:
|
||||
_, _ = w.Write([]byte("APP"))
|
||||
}
|
||||
|
|
@ -198,3 +200,61 @@ func TestHandlerProxiesXHREvenIfCookieInvalid(t *testing.T) {
|
|||
t.Fatalf("XHR should proxy through, got %d %q", w.Code, w.Body.String())
|
||||
}
|
||||
}
|
||||
|
||||
// pairInstance simulates a t3 instance that exposes pairing at exactly one path
|
||||
// (200 + t3_session) and 404s the other known path — modeling the 0.0.25 rename of
|
||||
// /api/auth/bootstrap -> /api/auth/browser-session. records which path was hit.
|
||||
func pairInstance(pairPath string, hit *string) *httptest.Server {
|
||||
return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
switch r.URL.Path {
|
||||
case "/api/auth/browser-session", "/api/auth/bootstrap":
|
||||
if r.URL.Path != pairPath {
|
||||
http.NotFound(w, r) // endpoint absent in this t3 version
|
||||
return
|
||||
}
|
||||
if hit != nil {
|
||||
*hit = r.URL.Path
|
||||
}
|
||||
http.SetCookie(w, &http.Cookie{Name: cookieName, Value: "fresh", Path: "/"})
|
||||
_, _ = w.Write([]byte(`{"authenticated":true}`))
|
||||
default:
|
||||
http.NotFound(w, r)
|
||||
}
|
||||
}))
|
||||
}
|
||||
|
||||
// TestAutoPairAcrossVersions: one dispatch binary must pair against BOTH the
|
||||
// 0.0.24 endpoint (/api/auth/bootstrap) and the 0.0.25 one (/api/auth/browser-session),
|
||||
// so the pin can move forward (and survive rolling-restart skew) without a 502 storm.
|
||||
func TestAutoPairAcrossVersions(t *testing.T) {
|
||||
orig := mintToken
|
||||
mintToken = func(string) ([]byte, error) { return []byte(`{"credential":"tok"}`), nil }
|
||||
defer func() { mintToken = orig }()
|
||||
|
||||
for _, tc := range []struct{ name, pairPath string }{
|
||||
{"0.0.25 browser-session", "/api/auth/browser-session"},
|
||||
{"0.0.24 bootstrap", "/api/auth/bootstrap"},
|
||||
} {
|
||||
t.Run(tc.name, func(t *testing.T) {
|
||||
var hit string
|
||||
ts := pairInstance(tc.pairPath, &hit)
|
||||
defer ts.Close()
|
||||
setTable(portOf(t, ts))
|
||||
|
||||
r := httptest.NewRequest("GET", "/", nil)
|
||||
r.Header.Set("X-authentik-username", "vbarzin@gmail.com") // no cookie -> autoPair
|
||||
w := httptest.NewRecorder()
|
||||
handler(w, r)
|
||||
|
||||
if w.Code != http.StatusFound {
|
||||
t.Fatalf("want 302 re-pair, got %d body=%q", w.Code, w.Body.String())
|
||||
}
|
||||
if hit != tc.pairPath {
|
||||
t.Fatalf("want pairing via %s, hit=%q", tc.pairPath, hit)
|
||||
}
|
||||
if cs := w.Result().Cookies(); len(cs) == 0 || cs[0].Value != "fresh" {
|
||||
t.Fatalf("want fresh t3_session relayed, got %+v", cs)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue