t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip]
The t3-autoupdate timer (re-enabled by the provisioner's step 5b with `--now`, which fires the missed daily job immediately on a Persistent timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions role->scopes, +proof_key_thumbprint) AND changed the bootstrap API, breaking t3-mint/pairing for ALL devvm users (pair prompt, no session). - t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a nightly tracker -- re-asserts the pin (a no-op when correct). - t3-provision-users.sh step 5b: drop `--now` (it triggered the immediate missed-job run that pulled the bad build). - setup-devvm.sh: install pinned t3@0.0.24 at machine setup. - unit Descriptions + service-catalog reflect the pin. - post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md. Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled the (now-pinned) enforcer, reset the 2 new users' disposable DBs, surgically reverted wizard's auth tables to level-30 (96 threads + live session preserved). All users verified 302 + t3_session. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
2125651aaa
commit
5ea238c707
7 changed files with 174 additions and 13 deletions
|
|
@ -32,7 +32,7 @@
|
|||
|---------|-------------|-------|
|
||||
| k8s-dashboard | Kubernetes dashboard at `k8s.viktorbarzin.me`. **Forward-auth + auto-injected SA token** (apiserver OIDC blocked, see design §12). nginx token-injector (`dashboard_injector.tf`) maps `X-authentik-username` → the user's `dashboard-<user>` SA token (ns admin + read-only on namespace-list/nodes only via `dashboard-nav-readonly` — no cross-tenant reads, `rbac/.../dashboard-sa.tf`; admins → cluster-admin SA) and sets `Authorization: Bearer` → no token-paste, dashboard auto-authenticates per user. Forward-auth admits `kubernetes-*` groups for this host (`stacks/authentik/admin-services-restriction.tf`). oauth2-proxy + `k8s-dashboard` OIDC app built but idle. | k8s-dashboard |
|
||||
| reverse-proxy | Generic reverse proxy | reverse-proxy |
|
||||
| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@<u>` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@<u>` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary tracks `nightly`** via `t3-autoupdate` (daily systemd timer; health-check + auto-rollback on a bad build; restarts only idle instances) — so new models (e.g. Opus 4.8) land as t3 ships them. Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. | t3code |
|
||||
| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@<u>` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@<u>` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role`→`scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin requires first verifying `t3-dispatch`'s bootstrap flow against the new build (expect 302 + `t3_session`). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. | t3code |
|
||||
|
||||
## Active Use
|
||||
| Service | Description | Stack |
|
||||
|
|
|
|||
|
|
@ -0,0 +1,138 @@
|
|||
# Post-Mortem: t3 Nightly Auto-Update (0.0.25) Migrated `state.sqlite` Forward → mint/pairing Broke for All Devvm Users
|
||||
|
||||
## Summary
|
||||
|
||||
The devvm t3 auto-updater (`t3-autoupdate.timer`) pulled the `t3@nightly`
|
||||
build `0.0.25-nightly.20260608.497`. That build ran two forward schema
|
||||
migrations on every per-user `~/.t3/userdata/state.sqlite` (renaming
|
||||
`role`→`scopes` in `auth_pairing_links` + `auth_sessions`, adding
|
||||
`proof_key_thumbprint`) **and** changed the bootstrap API. The result was a
|
||||
binary-vs-schema mismatch that broke `t3-mint` (pairing-credential issuance)
|
||||
for **all** users — every fresh login landed on the t3 pairing prompt instead
|
||||
of an authenticated session.
|
||||
|
||||
## Impact
|
||||
|
||||
- **Who:** every devvm t3 user — `wizard` (Viktor), `emo`, `ancamilea`.
|
||||
- **What:** `t3 auth pairing create` failed (`AuthControlPlaneError:
|
||||
Failed to create pairing link` → `PersistenceSqlError` on
|
||||
`auth_pairing_links`), so `t3-dispatch` auto-pair returned 500/502 and the
|
||||
browser showed the pairing prompt. Existing *already-authenticated* sessions
|
||||
kept working (validated against `auth_sessions`, not the pairing path).
|
||||
- **When:** ~13:56 (bad nightly installed) → ~15:16 (all users verified 302).
|
||||
- **Trigger of the report:** Anca could not log in ("gets the pair prompt,
|
||||
session broken").
|
||||
|
||||
## Timeline (devvm clock)
|
||||
|
||||
- **13:56** — `t3-provision-users` step 5b ran `systemctl enable --now
|
||||
t3-autoupdate.timer`. The timer is `OnCalendar=04:00 … Persistent=true`;
|
||||
`--now` + a missed 04:00 schedule fired the daily job **immediately**.
|
||||
- **13:56** — updater installed `t3@nightly` = `0.0.25-nightly.20260608.497`
|
||||
(was `0.0.24`). The `GET / → 200` health-check **passed** (it never
|
||||
exercises mint/bootstrap), so no auto-rollback. It restarted *idle* serves
|
||||
(emo) onto 0.0.25 and deferred *active* ones (wizard, ancamilea).
|
||||
- **~14:38** — `t3-mint` (now global 0.0.25) ran migrations 31
|
||||
(`AuthAuthorizationScopes`) + 32 (`AuthPairingProofKeyThumbprint`) against
|
||||
each `state.sqlite` it touched → schemas moved to "level 32".
|
||||
- **~14:40** — first recovery action rolled the **binary** back to `0.0.24`.
|
||||
This did **not** help: the DBs were still at level 32, so the level-30
|
||||
binary's INSERT hit `no column named role` / `NOT NULL constraint failed:
|
||||
scopes`. (Downgrading a binary after a forward migration is not a rollback.)
|
||||
- **~15:01–15:16** — diagnosed the binary-vs-schema mismatch, confirmed
|
||||
`0.0.25` *stable* is **also** dispatch-incompatible (auto-pair → 502, the
|
||||
bootstrap API moved), pinned to `0.0.24`, reset the two new users' disposable
|
||||
DBs, surgically reverted wizard's two auth tables to level 30. All three
|
||||
users verified 302 + `Set-Cookie: t3_session`.
|
||||
|
||||
## Root Cause
|
||||
|
||||
Three compounding factors:
|
||||
|
||||
1. **Auto-tracking a pre-1.0 tool's nightly.** `t3-autoupdate.sh` ran
|
||||
`npm i -g t3@nightly`. t3 ships breaking schema-migration and bootstrap-API
|
||||
changes between builds; our `t3-dispatch` (Go) speaks a fixed bootstrap
|
||||
contract (`POST /api/auth/bootstrap {"credential":…}` → `Set-Cookie`).
|
||||
2. **`enable --now` on a `Persistent=true` timer.** The provisioner's
|
||||
re-assertion of the timer didn't just *arm* the schedule — it fired the
|
||||
missed daily job on the spot, mid-afternoon, with users active.
|
||||
3. **A health-check that proves nothing about auth.** The smoke test only
|
||||
probes `GET / → 200`. The 0.0.25 server answers 200 while its pairing/mint
|
||||
path is incompatible, so the "auto-rollback on bad build" never triggered.
|
||||
|
||||
Forward migrations + a binary downgrade = a DB the old binary can't write.
|
||||
`state.sqlite` also holds the precious projection tables (session history), so
|
||||
a blanket "delete and re-pair" was only safe for the brand-new users.
|
||||
|
||||
## Detection
|
||||
|
||||
User report (Anca on the pairing prompt). No alert fired — the auto-updater's
|
||||
own health-check is the only automated gate and it passed. **Gap:** nothing
|
||||
monitors the end-to-end pairing flow.
|
||||
|
||||
## Fixes & Mitigations
|
||||
|
||||
### 1. Pin t3, stop tracking nightly (DONE)
|
||||
|
||||
`infra/scripts/t3-autoupdate.sh` is now a **pinned-version enforcer**:
|
||||
`T3_PIN="${T3_PIN:-0.0.24}"`, `npm i -g "t3@$T3_PIN"`. It re-asserts the pin
|
||||
(a no-op when already correct) instead of chasing nightly. Unit `Description`s
|
||||
updated. To move the pin: bump `T3_PIN` **and first** verify `t3-dispatch`'s
|
||||
bootstrap flow against the new build (`curl` the dispatch → expect 302 +
|
||||
`Set-Cookie: t3_session`).
|
||||
|
||||
### 2. Drop `--now` from the provisioner (DONE)
|
||||
|
||||
`infra/scripts/t3-provision-users.sh` step 5b now runs `systemctl enable
|
||||
t3-autoupdate.timer` (no `--now`) — it arms the 04:00 schedule without firing a
|
||||
missed job immediately.
|
||||
|
||||
### 3. Pinned install at machine setup (DONE)
|
||||
|
||||
`infra/scripts/workstation/setup-devvm.sh` installs `t3@$T3_PIN` directly, so a
|
||||
fresh box has the pinned t3 immediately rather than depending on the enforcer's
|
||||
first run.
|
||||
|
||||
### 4. Recovery actions taken on the host (DONE)
|
||||
|
||||
- Global `t3` rolled to `0.0.24`; enforcer redeployed + timer re-enabled
|
||||
(verified the enforcer is a no-op at the pin).
|
||||
- New users (`emo` 0 threads, `ancamilea` 1 trivial thread): `state.sqlite`
|
||||
parked aside; serve restarted → fresh level-30 DB.
|
||||
- `wizard` (96 threads, and the serve hosting the recovery session — cannot be
|
||||
restarted): the two auth tables were atomically rebuilt to the level-30
|
||||
schema (copied from a fresh DB) and migration records 31/32 removed.
|
||||
`auth_sessions` had 0 rows and the 0.0.24 serve never reads `scopes`, so the
|
||||
live session and all projection history were untouched. Backup:
|
||||
`/home/wizard/.t3/userdata/auth-backup-*.sql`.
|
||||
|
||||
### 5. End-to-end pairing health-check (DEFERRED)
|
||||
|
||||
The smoke test should exercise mint→bootstrap→cookie, not just `GET /`. Not
|
||||
done here (the pin makes it moot for the known-good build); needed before the
|
||||
enforcer is ever pointed at a new version. A blackbox probe on the dispatch
|
||||
auto-pair (expect 302 + `t3_session`) would have alerted within minutes.
|
||||
|
||||
## Lessons
|
||||
|
||||
- **Don't auto-track a pre-1.0 tool's nightly.** Pin to a known-good,
|
||||
contract-verified build; upgrades are a deliberate, tested act.
|
||||
- **`enable --now` on a `Persistent=true` timer fires the missed job now.**
|
||||
Use plain `enable` to arm a schedule without a surprise immediate run.
|
||||
- **A liveness probe (`GET /`) is not a readiness/correctness probe.** If a
|
||||
feature (auth/pairing) can break while `/` stays 200, the health-check must
|
||||
exercise that feature or it gives false confidence.
|
||||
- **A binary downgrade is not a schema rollback.** Once a forward migration
|
||||
runs, the data is migrated; the old binary now mismatches its own DB.
|
||||
- **Separate disposable state from precious state before resetting.** t3's
|
||||
`state.sqlite` mixes ephemeral auth (`auth_pairing_links`, `auth_sessions`)
|
||||
with precious history (`projection_*`); surgical table-level repair
|
||||
preserved 8k+ messages that a blanket reset would have destroyed.
|
||||
|
||||
## References
|
||||
|
||||
- `infra/scripts/t3-autoupdate.sh` (pinned enforcer), `.service`, `.timer`
|
||||
- `infra/scripts/t3-provision-users.sh` step 5b
|
||||
- `infra/scripts/workstation/setup-devvm.sh` step 2b
|
||||
- `infra/.claude/reference/service-catalog.md` (t3 serving layer)
|
||||
- Backup of wizard's pre-repair auth tables: `/home/wizard/.t3/userdata/auth-backup-*.sql`
|
||||
|
|
@ -1,5 +1,5 @@
|
|||
[Unit]
|
||||
Description=Track latest t3 nightly (health-checked, idle-only restart)
|
||||
Description=Enforce pinned t3 version (health-checked, idle-only restart)
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
|
|
|
|||
|
|
@ -1,20 +1,30 @@
|
|||
#!/usr/bin/env bash
|
||||
# Track the latest t3 nightly — with a health-check + auto-rollback (lesson from
|
||||
# the Keel auto-update incidents: never blindly trust a new build) and idle-only
|
||||
# restarts (never kill an in-flight coding session). Runs as root via the unit.
|
||||
# Enforce the PINNED t3 version ($T3_PIN) across the box — NOT "latest/nightly".
|
||||
# t3 is pre-1.0 and ships breaking schema-migration + bootstrap-API changes between
|
||||
# builds that our t3-dispatch can't follow blind. 2026-06-09: a nightly auto-update
|
||||
# (0.0.25) migrated every ~/.t3 state.sqlite forward (auth_pairing_links/auth_sessions
|
||||
# role->scopes) AND changed the bootstrap API, breaking mint/pairing for ALL users.
|
||||
# So we PIN; this unit just re-asserts the pin (a no-op when already correct) with a
|
||||
# health-check + auto-rollback and idle-only restarts (never kill an in-flight session).
|
||||
# To move the pin: bump T3_PIN AND first verify t3-dispatch's bootstrap flow against the
|
||||
# new build (curl the dispatch -> expect 302 + Set-Cookie t3_session). See post-mortem
|
||||
# 2026-06-09-t3-nightly-autoupdate-auth-outage.md.
|
||||
# CAVEAT: the health-check below only probes GET / (200) — it does NOT exercise the
|
||||
# mint/bootstrap/pairing path, so it will NOT catch an auth regression on its own.
|
||||
set -uo pipefail
|
||||
T3_PIN="${T3_PIN:-0.0.24}" # known-good, t3-dispatch-compatible (2026-06-09 post-mortem)
|
||||
LOG() { logger -t t3-autoupdate "$*"; echo "t3-autoupdate: $*"; }
|
||||
|
||||
ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; }
|
||||
|
||||
before=$(ver); LOG "current: ${before:-unknown}"
|
||||
npm i -g t3@nightly >/dev/null 2>&1 || { LOG "npm install failed; staying on ${before:-current}"; exit 0; }
|
||||
before=$(ver); LOG "current: ${before:-unknown}; pin: $T3_PIN"
|
||||
npm i -g "t3@$T3_PIN" >/dev/null 2>&1 || { LOG "npm install failed; staying on ${before:-current}"; exit 0; }
|
||||
after=$(ver)
|
||||
|
||||
if [[ -z "$after" || "$after" == "$before" ]]; then
|
||||
LOG "already latest (${before:-?}); nothing to do"; exit 0
|
||||
LOG "already at pin $T3_PIN (${before:-?}); nothing to do"; exit 0
|
||||
fi
|
||||
LOG "installed $after (was $before); health-checking…"
|
||||
LOG "re-pinned to $after (was $before); health-checking…"
|
||||
|
||||
# Health-check the NEW binary on a throwaway port/base-dir before trusting it.
|
||||
SMOKE_PORT=3799; SMOKE_DIR=$(mktemp -d)
|
||||
|
|
|
|||
|
|
@ -1,5 +1,5 @@
|
|||
[Unit]
|
||||
Description=Daily t3 nightly auto-update
|
||||
Description=Daily t3 pinned-version enforcer (re-asserts T3_PIN; no-op when correct)
|
||||
|
||||
[Timer]
|
||||
OnCalendar=*-*-* 04:00:00
|
||||
|
|
|
|||
|
|
@ -191,9 +191,12 @@ while IFS=$'\t' read -r os_user port; do
|
|||
id "$os_user" >/dev/null 2>&1 && run systemctl enable --now "t3-serve@$os_user.service" >/dev/null 2>&1 || true
|
||||
done < <(jq -r '.ports | to_entries[] | [.key, .value] | @tsv' "$desired_file")
|
||||
|
||||
# 5b) machine-wide (once, not per-user): keep the t3 nightly auto-updater enabled so it
|
||||
# self-heals hourly — a `disabled` timer silently freezes every instance on an old build.
|
||||
run systemctl enable --now t3-autoupdate.timer >/dev/null 2>&1 || true
|
||||
# 5b) machine-wide (once, not per-user): keep the t3 pinned-version ENFORCER enabled (it
|
||||
# re-asserts T3_PIN daily; a no-op when already correct). NOT --now: with Persistent=true
|
||||
# a `--now` enable fires the missed daily job IMMEDIATELY, which on 2026-06-09 pulled a
|
||||
# breaking nightly mid-day and took out auth for everyone. `enable` (no --now) just arms
|
||||
# the 04:00 schedule; fresh boxes get t3 from setup-devvm.sh's pinned install, not here.
|
||||
run systemctl enable t3-autoupdate.timer >/dev/null 2>&1 || true
|
||||
|
||||
# 6) regenerate /etc/ttyd-user-map + dispatch.json from the desired state (SSoT:
|
||||
# a roster entry removed here DISAPPEARS, which is what the offboarding cut relies on)
|
||||
|
|
|
|||
|
|
@ -33,6 +33,16 @@ if [[ $need_node -eq 1 ]]; then
|
|||
fi
|
||||
command -v claude >/dev/null || { log "npm: installing @anthropic-ai/claude-code"; npm install -g @anthropic-ai/claude-code >/dev/null; }
|
||||
|
||||
# 2b) t3 (the per-user coding surface) — PINNED, never nightly/latest. t3 is pre-1.0 and
|
||||
# ships breaking auth-schema + bootstrap-API changes our t3-dispatch can't follow blind
|
||||
# (2026-06-09 outage: a nightly auto-update broke pairing for ALL users). The daily
|
||||
# t3-autoupdate ENFORCER re-asserts this same pin; install it here so a fresh box has t3
|
||||
# immediately. Keep T3_PIN in sync with t3-autoupdate.sh.
|
||||
T3_PIN="${T3_PIN:-0.0.24}"
|
||||
if [[ "$(t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//')" != "$T3_PIN" ]]; then
|
||||
log "npm: installing pinned t3@$T3_PIN"; npm install -g "t3@$T3_PIN" >/dev/null
|
||||
fi
|
||||
|
||||
# 3) kubelogin (kubectl oidc-login) system-wide — NOT the apt 'kubelogin' (= Azure tool)
|
||||
if [[ ! -x /usr/local/bin/kubelogin ]]; then
|
||||
log "kubelogin: installing int128/kubelogin"
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue