diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md index 0ba680cb..c8022fa1 100644 --- a/.claude/reference/service-catalog.md +++ b/.claude/reference/service-catalog.md @@ -32,7 +32,7 @@ |---------|-------------|-------| | k8s-dashboard | Kubernetes dashboard at `k8s.viktorbarzin.me`. **Forward-auth + auto-injected SA token** (apiserver OIDC blocked, see design §12). nginx token-injector (`dashboard_injector.tf`) maps `X-authentik-username` → the user's `dashboard-` SA token (ns admin + read-only on namespace-list/nodes only via `dashboard-nav-readonly` — no cross-tenant reads, `rbac/.../dashboard-sa.tf`; admins → cluster-admin SA) and sets `Authorization: Bearer` → no token-paste, dashboard auto-authenticates per user. Forward-auth admits `kubernetes-*` groups for this host (`stacks/authentik/admin-services-restriction.tf`). oauth2-proxy + `k8s-dashboard` OIDC app built but idle. | k8s-dashboard | | reverse-proxy | Generic reverse proxy | reverse-proxy | -| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary tracks `nightly`** via `t3-autoupdate` (daily systemd timer; health-check + auto-rollback on a bad build; restarts only idle instances) — so new models (e.g. Opus 4.8) land as t3 ships them. Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. | t3code | +| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role`→`scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin requires first verifying `t3-dispatch`'s bootstrap flow against the new build (expect 302 + `t3_session`). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. | t3code | ## Active Use | Service | Description | Stack | diff --git a/docs/post-mortems/2026-06-09-t3-nightly-autoupdate-auth-outage.md b/docs/post-mortems/2026-06-09-t3-nightly-autoupdate-auth-outage.md new file mode 100644 index 00000000..e8f0d2d5 --- /dev/null +++ b/docs/post-mortems/2026-06-09-t3-nightly-autoupdate-auth-outage.md @@ -0,0 +1,138 @@ +# Post-Mortem: t3 Nightly Auto-Update (0.0.25) Migrated `state.sqlite` Forward → mint/pairing Broke for All Devvm Users + +## Summary + +The devvm t3 auto-updater (`t3-autoupdate.timer`) pulled the `t3@nightly` +build `0.0.25-nightly.20260608.497`. That build ran two forward schema +migrations on every per-user `~/.t3/userdata/state.sqlite` (renaming +`role`→`scopes` in `auth_pairing_links` + `auth_sessions`, adding +`proof_key_thumbprint`) **and** changed the bootstrap API. The result was a +binary-vs-schema mismatch that broke `t3-mint` (pairing-credential issuance) +for **all** users — every fresh login landed on the t3 pairing prompt instead +of an authenticated session. + +## Impact + +- **Who:** every devvm t3 user — `wizard` (Viktor), `emo`, `ancamilea`. +- **What:** `t3 auth pairing create` failed (`AuthControlPlaneError: + Failed to create pairing link` → `PersistenceSqlError` on + `auth_pairing_links`), so `t3-dispatch` auto-pair returned 500/502 and the + browser showed the pairing prompt. Existing *already-authenticated* sessions + kept working (validated against `auth_sessions`, not the pairing path). +- **When:** ~13:56 (bad nightly installed) → ~15:16 (all users verified 302). +- **Trigger of the report:** Anca could not log in ("gets the pair prompt, + session broken"). + +## Timeline (devvm clock) + +- **13:56** — `t3-provision-users` step 5b ran `systemctl enable --now + t3-autoupdate.timer`. The timer is `OnCalendar=04:00 … Persistent=true`; + `--now` + a missed 04:00 schedule fired the daily job **immediately**. +- **13:56** — updater installed `t3@nightly` = `0.0.25-nightly.20260608.497` + (was `0.0.24`). The `GET / → 200` health-check **passed** (it never + exercises mint/bootstrap), so no auto-rollback. It restarted *idle* serves + (emo) onto 0.0.25 and deferred *active* ones (wizard, ancamilea). +- **~14:38** — `t3-mint` (now global 0.0.25) ran migrations 31 + (`AuthAuthorizationScopes`) + 32 (`AuthPairingProofKeyThumbprint`) against + each `state.sqlite` it touched → schemas moved to "level 32". +- **~14:40** — first recovery action rolled the **binary** back to `0.0.24`. + This did **not** help: the DBs were still at level 32, so the level-30 + binary's INSERT hit `no column named role` / `NOT NULL constraint failed: + scopes`. (Downgrading a binary after a forward migration is not a rollback.) +- **~15:01–15:16** — diagnosed the binary-vs-schema mismatch, confirmed + `0.0.25` *stable* is **also** dispatch-incompatible (auto-pair → 502, the + bootstrap API moved), pinned to `0.0.24`, reset the two new users' disposable + DBs, surgically reverted wizard's two auth tables to level 30. All three + users verified 302 + `Set-Cookie: t3_session`. + +## Root Cause + +Three compounding factors: + +1. **Auto-tracking a pre-1.0 tool's nightly.** `t3-autoupdate.sh` ran + `npm i -g t3@nightly`. t3 ships breaking schema-migration and bootstrap-API + changes between builds; our `t3-dispatch` (Go) speaks a fixed bootstrap + contract (`POST /api/auth/bootstrap {"credential":…}` → `Set-Cookie`). +2. **`enable --now` on a `Persistent=true` timer.** The provisioner's + re-assertion of the timer didn't just *arm* the schedule — it fired the + missed daily job on the spot, mid-afternoon, with users active. +3. **A health-check that proves nothing about auth.** The smoke test only + probes `GET / → 200`. The 0.0.25 server answers 200 while its pairing/mint + path is incompatible, so the "auto-rollback on bad build" never triggered. + +Forward migrations + a binary downgrade = a DB the old binary can't write. +`state.sqlite` also holds the precious projection tables (session history), so +a blanket "delete and re-pair" was only safe for the brand-new users. + +## Detection + +User report (Anca on the pairing prompt). No alert fired — the auto-updater's +own health-check is the only automated gate and it passed. **Gap:** nothing +monitors the end-to-end pairing flow. + +## Fixes & Mitigations + +### 1. Pin t3, stop tracking nightly (DONE) + +`infra/scripts/t3-autoupdate.sh` is now a **pinned-version enforcer**: +`T3_PIN="${T3_PIN:-0.0.24}"`, `npm i -g "t3@$T3_PIN"`. It re-asserts the pin +(a no-op when already correct) instead of chasing nightly. Unit `Description`s +updated. To move the pin: bump `T3_PIN` **and first** verify `t3-dispatch`'s +bootstrap flow against the new build (`curl` the dispatch → expect 302 + +`Set-Cookie: t3_session`). + +### 2. Drop `--now` from the provisioner (DONE) + +`infra/scripts/t3-provision-users.sh` step 5b now runs `systemctl enable +t3-autoupdate.timer` (no `--now`) — it arms the 04:00 schedule without firing a +missed job immediately. + +### 3. Pinned install at machine setup (DONE) + +`infra/scripts/workstation/setup-devvm.sh` installs `t3@$T3_PIN` directly, so a +fresh box has the pinned t3 immediately rather than depending on the enforcer's +first run. + +### 4. Recovery actions taken on the host (DONE) + +- Global `t3` rolled to `0.0.24`; enforcer redeployed + timer re-enabled + (verified the enforcer is a no-op at the pin). +- New users (`emo` 0 threads, `ancamilea` 1 trivial thread): `state.sqlite` + parked aside; serve restarted → fresh level-30 DB. +- `wizard` (96 threads, and the serve hosting the recovery session — cannot be + restarted): the two auth tables were atomically rebuilt to the level-30 + schema (copied from a fresh DB) and migration records 31/32 removed. + `auth_sessions` had 0 rows and the 0.0.24 serve never reads `scopes`, so the + live session and all projection history were untouched. Backup: + `/home/wizard/.t3/userdata/auth-backup-*.sql`. + +### 5. End-to-end pairing health-check (DEFERRED) + +The smoke test should exercise mint→bootstrap→cookie, not just `GET /`. Not +done here (the pin makes it moot for the known-good build); needed before the +enforcer is ever pointed at a new version. A blackbox probe on the dispatch +auto-pair (expect 302 + `t3_session`) would have alerted within minutes. + +## Lessons + +- **Don't auto-track a pre-1.0 tool's nightly.** Pin to a known-good, + contract-verified build; upgrades are a deliberate, tested act. +- **`enable --now` on a `Persistent=true` timer fires the missed job now.** + Use plain `enable` to arm a schedule without a surprise immediate run. +- **A liveness probe (`GET /`) is not a readiness/correctness probe.** If a + feature (auth/pairing) can break while `/` stays 200, the health-check must + exercise that feature or it gives false confidence. +- **A binary downgrade is not a schema rollback.** Once a forward migration + runs, the data is migrated; the old binary now mismatches its own DB. +- **Separate disposable state from precious state before resetting.** t3's + `state.sqlite` mixes ephemeral auth (`auth_pairing_links`, `auth_sessions`) + with precious history (`projection_*`); surgical table-level repair + preserved 8k+ messages that a blanket reset would have destroyed. + +## References + +- `infra/scripts/t3-autoupdate.sh` (pinned enforcer), `.service`, `.timer` +- `infra/scripts/t3-provision-users.sh` step 5b +- `infra/scripts/workstation/setup-devvm.sh` step 2b +- `infra/.claude/reference/service-catalog.md` (t3 serving layer) +- Backup of wizard's pre-repair auth tables: `/home/wizard/.t3/userdata/auth-backup-*.sql` diff --git a/scripts/t3-autoupdate.service b/scripts/t3-autoupdate.service index d3306da7..7b043f13 100644 --- a/scripts/t3-autoupdate.service +++ b/scripts/t3-autoupdate.service @@ -1,5 +1,5 @@ [Unit] -Description=Track latest t3 nightly (health-checked, idle-only restart) +Description=Enforce pinned t3 version (health-checked, idle-only restart) After=network-online.target Wants=network-online.target diff --git a/scripts/t3-autoupdate.sh b/scripts/t3-autoupdate.sh index 962f3fc4..836605f0 100644 --- a/scripts/t3-autoupdate.sh +++ b/scripts/t3-autoupdate.sh @@ -1,20 +1,30 @@ #!/usr/bin/env bash -# Track the latest t3 nightly — with a health-check + auto-rollback (lesson from -# the Keel auto-update incidents: never blindly trust a new build) and idle-only -# restarts (never kill an in-flight coding session). Runs as root via the unit. +# Enforce the PINNED t3 version ($T3_PIN) across the box — NOT "latest/nightly". +# t3 is pre-1.0 and ships breaking schema-migration + bootstrap-API changes between +# builds that our t3-dispatch can't follow blind. 2026-06-09: a nightly auto-update +# (0.0.25) migrated every ~/.t3 state.sqlite forward (auth_pairing_links/auth_sessions +# role->scopes) AND changed the bootstrap API, breaking mint/pairing for ALL users. +# So we PIN; this unit just re-asserts the pin (a no-op when already correct) with a +# health-check + auto-rollback and idle-only restarts (never kill an in-flight session). +# To move the pin: bump T3_PIN AND first verify t3-dispatch's bootstrap flow against the +# new build (curl the dispatch -> expect 302 + Set-Cookie t3_session). See post-mortem +# 2026-06-09-t3-nightly-autoupdate-auth-outage.md. +# CAVEAT: the health-check below only probes GET / (200) — it does NOT exercise the +# mint/bootstrap/pairing path, so it will NOT catch an auth regression on its own. set -uo pipefail +T3_PIN="${T3_PIN:-0.0.24}" # known-good, t3-dispatch-compatible (2026-06-09 post-mortem) LOG() { logger -t t3-autoupdate "$*"; echo "t3-autoupdate: $*"; } ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; } -before=$(ver); LOG "current: ${before:-unknown}" -npm i -g t3@nightly >/dev/null 2>&1 || { LOG "npm install failed; staying on ${before:-current}"; exit 0; } +before=$(ver); LOG "current: ${before:-unknown}; pin: $T3_PIN" +npm i -g "t3@$T3_PIN" >/dev/null 2>&1 || { LOG "npm install failed; staying on ${before:-current}"; exit 0; } after=$(ver) if [[ -z "$after" || "$after" == "$before" ]]; then - LOG "already latest (${before:-?}); nothing to do"; exit 0 + LOG "already at pin $T3_PIN (${before:-?}); nothing to do"; exit 0 fi -LOG "installed $after (was $before); health-checking…" +LOG "re-pinned to $after (was $before); health-checking…" # Health-check the NEW binary on a throwaway port/base-dir before trusting it. SMOKE_PORT=3799; SMOKE_DIR=$(mktemp -d) diff --git a/scripts/t3-autoupdate.timer b/scripts/t3-autoupdate.timer index a59135f7..ccdbd4c6 100644 --- a/scripts/t3-autoupdate.timer +++ b/scripts/t3-autoupdate.timer @@ -1,5 +1,5 @@ [Unit] -Description=Daily t3 nightly auto-update +Description=Daily t3 pinned-version enforcer (re-asserts T3_PIN; no-op when correct) [Timer] OnCalendar=*-*-* 04:00:00 diff --git a/scripts/t3-provision-users.sh b/scripts/t3-provision-users.sh index 8c269bdd..37689153 100644 --- a/scripts/t3-provision-users.sh +++ b/scripts/t3-provision-users.sh @@ -191,9 +191,12 @@ while IFS=$'\t' read -r os_user port; do id "$os_user" >/dev/null 2>&1 && run systemctl enable --now "t3-serve@$os_user.service" >/dev/null 2>&1 || true done < <(jq -r '.ports | to_entries[] | [.key, .value] | @tsv' "$desired_file") -# 5b) machine-wide (once, not per-user): keep the t3 nightly auto-updater enabled so it -# self-heals hourly — a `disabled` timer silently freezes every instance on an old build. -run systemctl enable --now t3-autoupdate.timer >/dev/null 2>&1 || true +# 5b) machine-wide (once, not per-user): keep the t3 pinned-version ENFORCER enabled (it +# re-asserts T3_PIN daily; a no-op when already correct). NOT --now: with Persistent=true +# a `--now` enable fires the missed daily job IMMEDIATELY, which on 2026-06-09 pulled a +# breaking nightly mid-day and took out auth for everyone. `enable` (no --now) just arms +# the 04:00 schedule; fresh boxes get t3 from setup-devvm.sh's pinned install, not here. +run systemctl enable t3-autoupdate.timer >/dev/null 2>&1 || true # 6) regenerate /etc/ttyd-user-map + dispatch.json from the desired state (SSoT: # a roster entry removed here DISAPPEARS, which is what the offboarding cut relies on) diff --git a/scripts/workstation/setup-devvm.sh b/scripts/workstation/setup-devvm.sh index f929b30a..faf7b7bc 100755 --- a/scripts/workstation/setup-devvm.sh +++ b/scripts/workstation/setup-devvm.sh @@ -33,6 +33,16 @@ if [[ $need_node -eq 1 ]]; then fi command -v claude >/dev/null || { log "npm: installing @anthropic-ai/claude-code"; npm install -g @anthropic-ai/claude-code >/dev/null; } +# 2b) t3 (the per-user coding surface) — PINNED, never nightly/latest. t3 is pre-1.0 and +# ships breaking auth-schema + bootstrap-API changes our t3-dispatch can't follow blind +# (2026-06-09 outage: a nightly auto-update broke pairing for ALL users). The daily +# t3-autoupdate ENFORCER re-asserts this same pin; install it here so a fresh box has t3 +# immediately. Keep T3_PIN in sync with t3-autoupdate.sh. +T3_PIN="${T3_PIN:-0.0.24}" +if [[ "$(t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//')" != "$T3_PIN" ]]; then + log "npm: installing pinned t3@$T3_PIN"; npm install -g "t3@$T3_PIN" >/dev/null +fi + # 3) kubelogin (kubectl oidc-login) system-wide — NOT the apt 'kubelogin' (= Azure tool) if [[ ! -x /usr/local/bin/kubelogin ]]; then log "kubelogin: installing int128/kubelogin"