t3: docs for the gated nightly tracker (runbook, post-mortem, service-catalog)
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful

Phase 4 docs for the enforcer -> gated-tracker change:
- runbook t3-version-bump.md: rewritten around the tracker — how each bump is
  gated, plus freeze/revert/pin/dry-run/manual-rollback ops.
- post-mortem 2026-06-09: append the deliberate 2026-06-16 reversal and how the
  gates close each named root-cause/lesson (historical sections left intact).
- service-catalog t3 row: "PINNED 0.0.24 enforcer" -> gated nightly tracker;
  replace the stale "auto-pair 401-broken on 0.0.26" note (re-verified healthy
  2026-06-16, cookieless -> 302 + t3_session).
- t3-provision-users.sh step 5b comment: enforcer -> tracker; note Persistent dropped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-16 11:33:49 +00:00
parent f4f7705127
commit cdd9ecd199
4 changed files with 126 additions and 95 deletions

View file

@ -32,7 +32,7 @@
|---------|-------------|-------| |---------|-------------|-------|
| k8s-dashboard | Kubernetes dashboard at `k8s.viktorbarzin.me`. **Forward-auth + auto-injected SA token** (apiserver OIDC blocked, see design §12). nginx token-injector (`dashboard_injector.tf`) maps `X-authentik-username` → the user's `dashboard-<user>` SA token (ns admin + read-only on namespace-list/nodes only via `dashboard-nav-readonly` — no cross-tenant reads, `rbac/.../dashboard-sa.tf`; admins → cluster-admin SA) and sets `Authorization: Bearer` → no token-paste, dashboard auto-authenticates per user. Forward-auth admits `kubernetes-*` groups for this host (`stacks/authentik/admin-services-restriction.tf`). oauth2-proxy + `k8s-dashboard` OIDC app built but idle. | k8s-dashboard | | k8s-dashboard | Kubernetes dashboard at `k8s.viktorbarzin.me`. **Forward-auth + auto-injected SA token** (apiserver OIDC blocked, see design §12). nginx token-injector (`dashboard_injector.tf`) maps `X-authentik-username` → the user's `dashboard-<user>` SA token (ns admin + read-only on namespace-list/nodes only via `dashboard-nav-readonly` — no cross-tenant reads, `rbac/.../dashboard-sa.tf`; admins → cluster-admin SA) and sets `Authorization: Bearer` → no token-paste, dashboard auto-authenticates per user. Forward-auth admits `kubernetes-*` groups for this host (`stacks/authentik/admin-services-restriction.tf`). oauth2-proxy + `k8s-dashboard` OIDC app built but idle. | k8s-dashboard |
| reverse-proxy | Generic reverse proxy | reverse-proxy | | reverse-proxy | Generic reverse proxy | reverse-proxy |
| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@<u>` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@<u>` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role``scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin is a deliberate, reversible step via `docs/runbooks/t3-version-bump.md` (pre-bump `state.sqlite` backup → bump `T3_PIN` → enforcer install with a REAL pairing health-check that auto-rolls-back → verify → restore). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. **Drop attribution (2026-06-10):** `t3-probe` Deployment (same ns) holds differential legs — `cloudflare` (full public path via DoH-pinned DNS), `internal` (Traefik LB only), `t3serve` (devvm:3773 direct) — against dispatch's unauthenticated `/probe` carve-out (walloff-guarded); Prometheus job `t3-probe`, alerts `T3ProbeLegDown`/`T3ProbeDropBurst`, runbook `docs/runbooks/t3-drop-attribution.md`. `t3-serve@` units carry memory containment (`MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0/OOMPolicy=continue`) so a runaway agent OOMs alone instead of freezing devvm. **Connection logs (2026-06-11):** `t3-dispatch` logs every `/ws` open/close with `dur_ms` + `cause` (`downstream_closed`=client/CF/Traefik hung up → last-mile; `upstream_closed`=t3-serve closed; `graceful`); devvm journald now ships to Loki via `scripts/devvm-promtail.*` (`{job="devvm-journal"}` + `{job="sshd-devvm"}`), joining Traefik `/ws`-duration + cloudflared close events already in Loki for full per-drop attribution without a repro. **Empirical (2026-06-11):** direct-to-t3-serve held one WS 40 min (0 drops) while a real tunnel session cycled 5×/90s → drop originates above t3-serve on the public path, NOT in t3-serve itself; `t3 auth pairing create`+`/api/auth/browser-session` works but dispatch **auto-pair is 401-broken on v0.0.26** (latent; live 30-day cookies mask it). | t3code | | t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@<u>` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@<u>` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 AUTO-TRACKS the `nightly` npm dist-tag** (Viktor 2026-06-16, reversing the post-2026-06-09 pin; churn risk accepted) — `t3-autoupdate` is a daily GATED tracker that follows `t3@nightly` but gates every bump so a bad build self-heals: downgrade-guard → pre-bump `VACUUM INTO` backup → health-check that SEEDS a copy of a real POPULATED `state.sqlite` to exercise the forward migration + the real mint→exchange→`t3_session` pairing handshake → canary-restart idle instances ONE AT A TIME with per-instance dispatch pairing verify → auto-rollback to last-good + self-freeze on failure (active-agent instances deferred, never killed; last-good in `/var/lib/t3-autoupdate/last-good`). The 2026-06-09 outage was the SAME nightly channel WITHOUT these gates. Freeze/revert now: `sudo touch /etc/t3-autoupdate.freeze` (or set `T3_PIN=<ver>` to hard-pin); preview a build with `T3_DRY_RUN=1`. Channel via `T3_TRACK` in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). Full ops + manual rollback: `docs/runbooks/t3-version-bump.md`. `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. **Drop attribution (2026-06-10):** `t3-probe` Deployment (same ns) holds differential legs — `cloudflare` (full public path via DoH-pinned DNS), `internal` (Traefik LB only), `t3serve` (devvm:3773 direct) — against dispatch's unauthenticated `/probe` carve-out (walloff-guarded); Prometheus job `t3-probe`, alerts `T3ProbeLegDown`/`T3ProbeDropBurst`, runbook `docs/runbooks/t3-drop-attribution.md`. `t3-serve@` units carry memory containment (`MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0/OOMPolicy=continue`) so a runaway agent OOMs alone instead of freezing devvm. **Connection logs (2026-06-11):** `t3-dispatch` logs every `/ws` open/close with `dur_ms` + `cause` (`downstream_closed`=client/CF/Traefik hung up → last-mile; `upstream_closed`=t3-serve closed; `graceful`); devvm journald now ships to Loki via `scripts/devvm-promtail.*` (`{job="devvm-journal"}` + `{job="sshd-devvm"}`), joining Traefik `/ws`-duration + cloudflared close events already in Loki for full per-drop attribution without a repro. **Empirical (2026-06-11):** direct-to-t3-serve held one WS 40 min (0 drops) while a real tunnel session cycled 5×/90s → drop originates above t3-serve on the public path, NOT in t3-serve itself; `t3 auth pairing create`+`/api/auth/browser-session` works, and dispatch **auto-pair was re-verified healthy on the live pin 2026-06-16** (cookieless `X-authentik-username` → 302 + `t3_session`) — the earlier transient 401 note no longer reproduces, and the new dispatch pairing logs + `T3PairingBroken`/`T3PairFallbackHigh` Loki alerts now watch pairing continuously. | t3code |
## Active Use ## Active Use
| Service | Description | Stack | | Service | Description | Stack |

View file

@ -148,8 +148,39 @@ So the pin can move without another outage:
## References ## References
- `infra/scripts/t3-autoupdate.sh` (pinned enforcer), `.service`, `.timer` - `infra/scripts/t3-autoupdate.sh` (gated nightly TRACKER since 2026-06-16; was the pinned enforcer), `.service`, `.timer`
- `infra/scripts/t3-provision-users.sh` step 5b - `infra/scripts/t3-provision-users.sh` step 5b
- `infra/scripts/workstation/setup-devvm.sh` step 2b - `infra/scripts/workstation/setup-devvm.sh` step 2b
- `infra/.claude/reference/service-catalog.md` (t3 serving layer) - `infra/.claude/reference/service-catalog.md` (t3 serving layer)
- Backup of wizard's pre-repair auth tables: `/home/wizard/.t3/userdata/auth-backup-*.sql` - Backup of wizard's pre-repair auth tables: `/home/wizard/.t3/userdata/auth-backup-*.sql`
## 2026-06-16 update: gated nightly tracking deliberately re-enabled
Viktor chose to **reverse the pin** and auto-track `t3@nightly` again — accepting
the churn risk — with the explicit requirement "make sure session auth works and
revert if the fallback/failure rate climbs." The naive nightly tracking that
caused this incident is now replaced by a GATED tracker that closes every gap the
root-cause + lessons sections named:
- **Detection gap (was still open)** → the dispatch now logs every pairing
outcome (success endpoint + fallback) and the enforcer logs rollbacks/freezes;
Loki alerts (`T3PairingBroken`, `T3PairFallbackHigh`, `T3AutoUpdate*`) page on
real breakage. The pre-existing `t3-probe` only checks `GET /api/auth/session
== 200`, which stays 200 even when pairing is dead — it never caught this class.
- **"A liveness probe is not a correctness probe"** → the health-check now SEEDS
a throwaway serve with a COPY of a real populated `state.sqlite` and runs the
forward MIGRATION + real pairing handshake before trusting a build.
- **"A binary downgrade is not a schema rollback"** → mandatory pre-bump
`VACUUM INTO` backup; rollback restores the DB; a canary failure auto-restores
+ self-freezes.
- **All-at-once blast radius** → canary rollout (idle instances one at a time,
pairing-verified through the dispatch; active-agent sessions deferred, never killed).
- **`enable --now` / boot-catchup firing a missed bump mid-day** → `Persistent=true`
dropped from the timer.
Mechanism + freeze/revert/rollback ops: `docs/runbooks/t3-version-bump.md`.
First live cutover 2026-06-16: `0.0.26``0.0.28-nightly.20260616.571`, gated —
emo + ancamilea migrated + pairing-verified, wizard deferred (active session).
The headless `t3 serve` has **no in-app self-updater** (verified: no update-check
/ npm shell-out in `dist/bin.mjs`), so the npm install is the sole version
authority; the t3 UI's Stable/Nightly toggle governs the unused **desktop** app.

View file

@ -1,95 +1,86 @@
# Runbook: bump the pinned t3 version (e.g. 0.0.24 → 0.0.25) # Runbook: t3 version — gated nightly tracker (freeze / revert / roll back)
t3 on the devvm is **pinned** (`T3_PIN`, default `0.0.24`) and held there by the t3 on the devvm **auto-tracks the `nightly` npm dist-tag** (Viktor, 2026-06-16,
`t3-autoupdate` enforcer. t3 is pre-1.0 and ships breaking changes between risk explicitly accepted), via the daily `t3-autoupdate` timer. Every bump is
builds, so a bump is a **deliberate, verified, reversible** step — never an GATED so a bad nightly self-heals instead of repeating 2026-06-09. This reverses
auto-update. This runbook makes it calm. Background: post-mortem the post-incident pin decision — read `2026-06-09-t3-nightly-autoupdate-auth-outage.md`
`2026-06-09-t3-nightly-autoupdate-auth-outage.md`. for why every guard below exists. t3 is still pre-1.0 and ships breaking changes
between builds; the gate is what makes auto-tracking safe.
## What a bump actually touches ## How the tracker gates each bump (`scripts/t3-autoupdate.sh`)
1. **Freeze gate**`/etc/t3-autoupdate.freeze` present (or `T3_PIN=<ver>` set) →
hold at current, do nothing.
2. **Resolve + downgrade-guard**`npm view t3@nightly version`; proceed only if
the target is strictly newer than installed AND a `-nightly.` build (the tag is
mutable and can point backward).
3. **Pre-bump backup** — online `VACUUM INTO` of every user's `state.sqlite` to
`/var/backups/t3-state/<u>/state-prebump-<ver>-<ts>.sqlite` (runs AS the owner;
never stops a serve). Rollback is then a RESTORE, not sqlite surgery.
4. **Install + health-check**`npm i -g t3@<ver>`, then start a throwaway serve
SEEDED WITH A COPY of wizard's real populated `state.sqlite` (scratch on
`/var/tmp`, not the 2 GB tmpfs `/tmp`) so it exercises the forward MIGRATION
(the 2026-06-09 failure class) + the real mint→exchange→`t3_session` pairing
handshake. Fail → roll back binary to last-good, exit (no serve migrated yet →
clean).
5. **Canary rollout** — restart IDLE instances one at a time, verifying pairing
through the real dispatch after each. First failure → roll back binary +
restore that user's DB from the pre-bump backup + **self-freeze** (touch the
freeze file) so it cannot re-flap onto bad builds. Active-agent instances are
DEFERRED (never killed) and migrate on their next idle restart.
6. **Last-good** — advanced to the new version only on full success
(`/var/lib/t3-autoupdate/last-good`); it is the rollback target.
Detection backstop (real-user pairing failures / endpoint fallback): the dispatch
logs every outcome (`paired user=.. endpoint=.. fallback=..`, plus `mint/pairing
... failed`) → Loki alerts `T3PairingBroken` / `T3PairFallbackHigh` /
`T3AutoUpdateRolledBack` / `T3AutoUpdateRollbackFailed` / `T3AutoUpdateFrozen`
Alertmanager → Slack.
## Operations
**Freeze / revert (stop tracking right now — the fast "make it stop"):**
```bash
sudo touch /etc/t3-autoupdate.freeze # holds at the current build; next run is a no-op + fires T3AutoUpdateFrozen
sudo rm -f /etc/t3-autoupdate.freeze # resume tracking
```
**Pin to an exact version (instead of tracking nightly):** set `T3_PIN=<ver>` in
the unit environment (or the `scripts/t3-autoupdate.sh` default) — the tracker
enforces it and stops following nightly. Keep in sync with `setup-devvm.sh`.
**Preview the current nightly without touching anything (no global change, no restarts):**
```bash
sudo T3_DRY_RUN=1 /usr/local/bin/t3-autoupdate # installs candidate to a temp prefix, runs the full gate, reports PASS/FAIL
```
**Force a run now (instead of waiting for 04:00):**
```bash
sudo systemctl start t3-autoupdate.service # runs in its own cgroup, isolated from the t3-serve@ instances it manages
```
## What a bump touches (still true)
1. **Pairing API** — t3 renamed `POST /api/auth/bootstrap``/api/auth/browser-session` 1. **Pairing API** — t3 renamed `POST /api/auth/bootstrap``/api/auth/browser-session`
in 0.0.25. `t3-dispatch` is now **version-agnostic** (tries `browser-session`, in 0.0.25. `t3-dispatch` is version-agnostic (`pairEndpoints` in
falls back to `bootstrap`; see `pairEndpoints` in `scripts/t3-dispatch/main.go`), `scripts/t3-dispatch/main.go` tries browser-session, falls back to bootstrap).
so 0.0.24↔0.0.25 needs **no dispatch change**. If a *future* build renames it If a future build renames it AGAIN, the health-check + canary fail the bump and
again, add the new path to `pairEndpoints`, rebuild, redeploy first. self-freeze — then add the new path to `pairEndpoints`, rebuild + redeploy the
2. **Schema** — 0.0.25+ migrate every `~/.t3/userdata/state.sqlite` **forward** dispatch, and clear the freeze.
(`auth_pairing_links`/`auth_sessions` `role``scopes`, `+proof_key_thumbprint`). 2. **Schema** — 0.0.25+ migrate every `~/.t3/userdata/state.sqlite` FORWARD — a
This is a **one-way door**: a binary downgrade alone will NOT roll it back — **one-way door**. A binary downgrade alone does NOT roll it back; you must
you must restore the DB. Hence the mandatory pre-bump backup below. restore the DB. The tracker does this automatically on a canary failure; do it
by hand (below) if a problem surfaces *after* a successful bump.
## Pre-flight (no downtime) ## Manual rollback (problem surfaces after a bump the gate let through)
```bash ```bash
# 1. Confirm the dispatch already speaks the new version's pairing API. GOOD=$(cat /var/lib/t3-autoupdate/last-good) # or the known-good version you want
# Install the candidate to an isolated prefix (does NOT touch the global pin): sudo touch /etc/t3-autoupdate.freeze # stop the tracker FIRST
npm install --prefix /tmp/t3-cand t3@<new> # e.g. t3@0.0.25 sudo npm i -g "t3@$GOOD"
BIN=/tmp/t3-cand/node_modules/.bin/t3; D=$(mktemp -d) # Restore + restart each user's serve. The wizard/admin instance: run this from
"$BIN" serve --host 127.0.0.1 --port 3796 --base-dir "$D" >/tmp/cand.log 2>&1 & # OUTSIDE its own t3 session (stopping the serve you're running inside kills you);
CRED=$("$BIN" auth pairing create --base-dir "$D" --ttl 5m --json | sed -n 's/.*"credential":"\([^"]*\)".*/\1/p') # or just let it pick up $GOOD on its next natural restart.
# Try the dispatch's endpoints; one must give 200 + Set-Cookie: t3_session.
for ep in /api/auth/browser-session /api/auth/bootstrap; do
curl -s -i -X POST -H 'Content-Type: application/json' -d "{\"credential\":\"$CRED\"}" \
"http://127.0.0.1:3796$ep" | grep -iE 'HTTP/|set-cookie: t3_session'; done
kill %1; rm -rf "$D" /tmp/t3-cand
# If NO endpoint yields a t3_session cookie -> the API changed again; update
# pairEndpoints in main.go + rebuild the dispatch BEFORE proceeding.
# 2. Dispatch unit tests still green:
( cd ~/code/infra/scripts/t3-dispatch && go test ./... )
```
## The bump
```bash
NEW=0.0.25
# 1. PRE-BUMP BACKUP — the rollback safety net. Per user, stop the serve (so the
# copy is consistent + fast), copy state.sqlite, restart. Do the ACTIVE admin
# instance last / from OUTSIDE its own t3 session (you can't restart the serve
# you're running inside).
for u in $(awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/ /,"",$2);print $2}' /etc/ttyd-user-map | sort -u); do
src=/home/$u/.t3/userdata/state.sqlite; [ -f "$src" ] || continue
sudo systemctl stop t3-serve@$u
sudo install -d -o "$u" -g "$u" -m700 /var/backups/t3-state/$u
sudo cp -a "$src" /var/backups/t3-state/$u/state-prebump-$NEW-$(date +%Y%m%d-%H%M%S).sqlite
sudo systemctl start t3-serve@$u
done
# (t3-backup-state also runs daily; this captures a guaranteed snapshot at T-0.)
# 2. Move the pin in BOTH places (keep them in sync):
sed -i "s/T3_PIN:-[0-9.]*/T3_PIN:-$NEW/" ~/code/infra/scripts/t3-autoupdate.sh \
~/code/infra/scripts/workstation/setup-devvm.sh
sudo install -m0755 ~/code/infra/scripts/t3-autoupdate.sh /usr/local/bin/t3-autoupdate
# 3. Run the enforcer. It installs t3@$NEW, then HEALTH-CHECKS the real pairing
# handshake (mint -> browser-session/bootstrap -> t3_session). If pairing is
# broken in $NEW, it AUTO-ROLLS-BACK to the previous version and exits non-zero.
sudo /usr/local/bin/t3-autoupdate # restarts idle instances; defers active ones
# 4. Restart any instance the enforcer deferred (active agent), when it's idle.
# The wizard/admin instance: restart from OUTSIDE its own session, or it picks
# up $NEW on its next natural restart (the unit runs the global /usr/bin/t3).
```
## Verify
```bash
for u in vbarzin emil.barzin ancaelena98; do
curl -sI -H "X-authentik-username: $u" http://10.0.10.10:3780/ | grep -iE 'HTTP/|set-cookie: t3_session'
done # each must be 302 + t3_session
t3 --version # == $NEW
```
## Rollback (if pairing breaks or $NEW misbehaves)
The enforcer auto-rolls-back the **binary** if its health-check fails. But if a
problem surfaces *after* serves migrated their DBs forward, the binary alone
won't fix it — restore the DBs:
```bash
sed -i "s/T3_PIN:-[0-9.]*/T3_PIN:-0.0.24/" ~/code/infra/scripts/t3-autoupdate.sh ~/code/infra/scripts/workstation/setup-devvm.sh
sudo install -m0755 ~/code/infra/scripts/t3-autoupdate.sh /usr/local/bin/t3-autoupdate
sudo npm i -g t3@0.0.24
for u in $(awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/ /,"",$2);print $2}' /etc/ttyd-user-map | sort -u); do for u in $(awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/ /,"",$2);print $2}' /etc/ttyd-user-map | sort -u); do
bak=$(sudo ls -1t /var/backups/t3-state/$u/state-prebump-* 2>/dev/null | head -1) bak=$(sudo ls -1t /var/backups/t3-state/$u/state-prebump-* 2>/dev/null | head -1)
[ -n "$bak" ] || continue [ -n "$bak" ] || continue
@ -98,8 +89,16 @@ for u in $(awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/ /,"",$2);print $2}' /etc/tty
sudo rm -f /home/$u/.t3/userdata/state.sqlite-wal /home/$u/.t3/userdata/state.sqlite-shm sudo rm -f /home/$u/.t3/userdata/state.sqlite-wal /home/$u/.t3/userdata/state.sqlite-shm
sudo systemctl start t3-serve@$u sudo systemctl start t3-serve@$u
done done
# verify 302 + t3_session as above
``` ```
(The 2026-06-09 incident had no pre-bump backup, so rollback meant per-user ## Verify (any user pairs cleanly through the dispatch)
sqlite surgery. With the backup, it's a restore.)
```bash
for u in vbarzin emil.barzin ancaelena98; do
curl -sI -H "X-authentik-username: $u" http://10.0.10.10:3780/ | grep -iE 'HTTP/|set-cookie: t3_session'
done # each must be 302 + t3_session
t3 --version
```
(The 2026-06-09 incident had no pre-bump backup, so rollback meant per-user sqlite
surgery. The tracker now takes a guaranteed pre-bump snapshot — rollback is a restore.)

View file

@ -396,11 +396,12 @@ while IFS=$'\t' read -r os_user port; do
id "$os_user" >/dev/null 2>&1 && run systemctl enable --now "t3-serve@$os_user.service" >/dev/null 2>&1 || true id "$os_user" >/dev/null 2>&1 && run systemctl enable --now "t3-serve@$os_user.service" >/dev/null 2>&1 || true
done < <(jq -r '.ports | to_entries[] | [.key, .value] | @tsv' "$desired_file") done < <(jq -r '.ports | to_entries[] | [.key, .value] | @tsv' "$desired_file")
# 5b) machine-wide (once, not per-user): keep the t3 pinned-version ENFORCER enabled (it # 5b) machine-wide (once, not per-user): keep the t3 gated nightly TRACKER timer enabled (it
# re-asserts T3_PIN daily; a no-op when already correct). NOT --now: with Persistent=true # follows t3@nightly daily, gated; see t3-autoupdate.sh / docs/runbooks/t3-version-bump.md).
# a `--now` enable fires the missed daily job IMMEDIATELY, which on 2026-06-09 pulled a # NEVER --now: the tracker installs a NEW build + migrates DBs + restarts serves, so firing
# breaking nightly mid-day and took out auth for everyone. `enable` (no --now) just arms # a missed run mid-day with users active is exactly the 2026-06-09 shape. `enable` (no --now)
# the 04:00 schedule; fresh boxes get t3 from setup-devvm.sh's pinned install, not here. # just arms the 04:00 schedule (the timer also dropped Persistent=true so a boot can't fire a
# missed bump). Fresh boxes get t3 from setup-devvm.sh's nightly install, not here.
run systemctl enable t3-autoupdate.timer >/dev/null 2>&1 || true run systemctl enable t3-autoupdate.timer >/dev/null 2>&1 || true
# tmux session persistence: periodic snapshot + boot-time restore (reboot # tmux session persistence: periodic snapshot + boot-time restore (reboot
# survival for users' named claude sessions). Safe to --now: save is a # survival for users' named claude sessions). Safe to --now: save is a