Compare commits
3 commits
994d305d04
...
cdd9ecd199
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
cdd9ecd199 | ||
|
|
f4f7705127 | ||
|
|
36521839fc |
8 changed files with 372 additions and 167 deletions
|
|
@ -32,7 +32,7 @@
|
|||
|---------|-------------|-------|
|
||||
| k8s-dashboard | Kubernetes dashboard at `k8s.viktorbarzin.me`. **Forward-auth + auto-injected SA token** (apiserver OIDC blocked, see design §12). nginx token-injector (`dashboard_injector.tf`) maps `X-authentik-username` → the user's `dashboard-<user>` SA token (ns admin + read-only on namespace-list/nodes only via `dashboard-nav-readonly` — no cross-tenant reads, `rbac/.../dashboard-sa.tf`; admins → cluster-admin SA) and sets `Authorization: Bearer` → no token-paste, dashboard auto-authenticates per user. Forward-auth admits `kubernetes-*` groups for this host (`stacks/authentik/admin-services-restriction.tf`). oauth2-proxy + `k8s-dashboard` OIDC app built but idle. | k8s-dashboard |
|
||||
| reverse-proxy | Generic reverse proxy | reverse-proxy |
|
||||
| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@<u>` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@<u>` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role`→`scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin is a deliberate, reversible step via `docs/runbooks/t3-version-bump.md` (pre-bump `state.sqlite` backup → bump `T3_PIN` → enforcer install with a REAL pairing health-check that auto-rolls-back → verify → restore). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. **Drop attribution (2026-06-10):** `t3-probe` Deployment (same ns) holds differential legs — `cloudflare` (full public path via DoH-pinned DNS), `internal` (Traefik LB only), `t3serve` (devvm:3773 direct) — against dispatch's unauthenticated `/probe` carve-out (walloff-guarded); Prometheus job `t3-probe`, alerts `T3ProbeLegDown`/`T3ProbeDropBurst`, runbook `docs/runbooks/t3-drop-attribution.md`. `t3-serve@` units carry memory containment (`MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0/OOMPolicy=continue`) so a runaway agent OOMs alone instead of freezing devvm. **Connection logs (2026-06-11):** `t3-dispatch` logs every `/ws` open/close with `dur_ms` + `cause` (`downstream_closed`=client/CF/Traefik hung up → last-mile; `upstream_closed`=t3-serve closed; `graceful`); devvm journald now ships to Loki via `scripts/devvm-promtail.*` (`{job="devvm-journal"}` + `{job="sshd-devvm"}`), joining Traefik `/ws`-duration + cloudflared close events already in Loki for full per-drop attribution without a repro. **Empirical (2026-06-11):** direct-to-t3-serve held one WS 40 min (0 drops) while a real tunnel session cycled 5×/90s → drop originates above t3-serve on the public path, NOT in t3-serve itself; `t3 auth pairing create`+`/api/auth/browser-session` works but dispatch **auto-pair is 401-broken on v0.0.26** (latent; live 30-day cookies mask it). | t3code |
|
||||
| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@<u>` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@<u>` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 AUTO-TRACKS the `nightly` npm dist-tag** (Viktor 2026-06-16, reversing the post-2026-06-09 pin; churn risk accepted) — `t3-autoupdate` is a daily GATED tracker that follows `t3@nightly` but gates every bump so a bad build self-heals: downgrade-guard → pre-bump `VACUUM INTO` backup → health-check that SEEDS a copy of a real POPULATED `state.sqlite` to exercise the forward migration + the real mint→exchange→`t3_session` pairing handshake → canary-restart idle instances ONE AT A TIME with per-instance dispatch pairing verify → auto-rollback to last-good + self-freeze on failure (active-agent instances deferred, never killed; last-good in `/var/lib/t3-autoupdate/last-good`). The 2026-06-09 outage was the SAME nightly channel WITHOUT these gates. Freeze/revert now: `sudo touch /etc/t3-autoupdate.freeze` (or set `T3_PIN=<ver>` to hard-pin); preview a build with `T3_DRY_RUN=1`. Channel via `T3_TRACK` in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). Full ops + manual rollback: `docs/runbooks/t3-version-bump.md`. `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. **Drop attribution (2026-06-10):** `t3-probe` Deployment (same ns) holds differential legs — `cloudflare` (full public path via DoH-pinned DNS), `internal` (Traefik LB only), `t3serve` (devvm:3773 direct) — against dispatch's unauthenticated `/probe` carve-out (walloff-guarded); Prometheus job `t3-probe`, alerts `T3ProbeLegDown`/`T3ProbeDropBurst`, runbook `docs/runbooks/t3-drop-attribution.md`. `t3-serve@` units carry memory containment (`MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0/OOMPolicy=continue`) so a runaway agent OOMs alone instead of freezing devvm. **Connection logs (2026-06-11):** `t3-dispatch` logs every `/ws` open/close with `dur_ms` + `cause` (`downstream_closed`=client/CF/Traefik hung up → last-mile; `upstream_closed`=t3-serve closed; `graceful`); devvm journald now ships to Loki via `scripts/devvm-promtail.*` (`{job="devvm-journal"}` + `{job="sshd-devvm"}`), joining Traefik `/ws`-duration + cloudflared close events already in Loki for full per-drop attribution without a repro. **Empirical (2026-06-11):** direct-to-t3-serve held one WS 40 min (0 drops) while a real tunnel session cycled 5×/90s → drop originates above t3-serve on the public path, NOT in t3-serve itself; `t3 auth pairing create`+`/api/auth/browser-session` works, and dispatch **auto-pair was re-verified healthy on the live pin 2026-06-16** (cookieless `X-authentik-username` → 302 + `t3_session`) — the earlier transient 401 note no longer reproduces, and the new dispatch pairing logs + `T3PairingBroken`/`T3PairFallbackHigh` Loki alerts now watch pairing continuously. | t3code |
|
||||
|
||||
## Active Use
|
||||
| Service | Description | Stack |
|
||||
|
|
|
|||
|
|
@ -148,8 +148,39 @@ So the pin can move without another outage:
|
|||
|
||||
## References
|
||||
|
||||
- `infra/scripts/t3-autoupdate.sh` (pinned enforcer), `.service`, `.timer`
|
||||
- `infra/scripts/t3-autoupdate.sh` (gated nightly TRACKER since 2026-06-16; was the pinned enforcer), `.service`, `.timer`
|
||||
- `infra/scripts/t3-provision-users.sh` step 5b
|
||||
- `infra/scripts/workstation/setup-devvm.sh` step 2b
|
||||
- `infra/.claude/reference/service-catalog.md` (t3 serving layer)
|
||||
- Backup of wizard's pre-repair auth tables: `/home/wizard/.t3/userdata/auth-backup-*.sql`
|
||||
|
||||
## 2026-06-16 update: gated nightly tracking deliberately re-enabled
|
||||
|
||||
Viktor chose to **reverse the pin** and auto-track `t3@nightly` again — accepting
|
||||
the churn risk — with the explicit requirement "make sure session auth works and
|
||||
revert if the fallback/failure rate climbs." The naive nightly tracking that
|
||||
caused this incident is now replaced by a GATED tracker that closes every gap the
|
||||
root-cause + lessons sections named:
|
||||
|
||||
- **Detection gap (was still open)** → the dispatch now logs every pairing
|
||||
outcome (success endpoint + fallback) and the enforcer logs rollbacks/freezes;
|
||||
Loki alerts (`T3PairingBroken`, `T3PairFallbackHigh`, `T3AutoUpdate*`) page on
|
||||
real breakage. The pre-existing `t3-probe` only checks `GET /api/auth/session
|
||||
== 200`, which stays 200 even when pairing is dead — it never caught this class.
|
||||
- **"A liveness probe is not a correctness probe"** → the health-check now SEEDS
|
||||
a throwaway serve with a COPY of a real populated `state.sqlite` and runs the
|
||||
forward MIGRATION + real pairing handshake before trusting a build.
|
||||
- **"A binary downgrade is not a schema rollback"** → mandatory pre-bump
|
||||
`VACUUM INTO` backup; rollback restores the DB; a canary failure auto-restores
|
||||
+ self-freezes.
|
||||
- **All-at-once blast radius** → canary rollout (idle instances one at a time,
|
||||
pairing-verified through the dispatch; active-agent sessions deferred, never killed).
|
||||
- **`enable --now` / boot-catchup firing a missed bump mid-day** → `Persistent=true`
|
||||
dropped from the timer.
|
||||
|
||||
Mechanism + freeze/revert/rollback ops: `docs/runbooks/t3-version-bump.md`.
|
||||
First live cutover 2026-06-16: `0.0.26` → `0.0.28-nightly.20260616.571`, gated —
|
||||
emo + ancamilea migrated + pairing-verified, wizard deferred (active session).
|
||||
The headless `t3 serve` has **no in-app self-updater** (verified: no update-check
|
||||
/ npm shell-out in `dist/bin.mjs`), so the npm install is the sole version
|
||||
authority; the t3 UI's Stable/Nightly toggle governs the unused **desktop** app.
|
||||
|
|
|
|||
|
|
@ -1,95 +1,86 @@
|
|||
# Runbook: bump the pinned t3 version (e.g. 0.0.24 → 0.0.25)
|
||||
# Runbook: t3 version — gated nightly tracker (freeze / revert / roll back)
|
||||
|
||||
t3 on the devvm is **pinned** (`T3_PIN`, default `0.0.24`) and held there by the
|
||||
`t3-autoupdate` enforcer. t3 is pre-1.0 and ships breaking changes between
|
||||
builds, so a bump is a **deliberate, verified, reversible** step — never an
|
||||
auto-update. This runbook makes it calm. Background: post-mortem
|
||||
`2026-06-09-t3-nightly-autoupdate-auth-outage.md`.
|
||||
t3 on the devvm **auto-tracks the `nightly` npm dist-tag** (Viktor, 2026-06-16,
|
||||
risk explicitly accepted), via the daily `t3-autoupdate` timer. Every bump is
|
||||
GATED so a bad nightly self-heals instead of repeating 2026-06-09. This reverses
|
||||
the post-incident pin decision — read `2026-06-09-t3-nightly-autoupdate-auth-outage.md`
|
||||
for why every guard below exists. t3 is still pre-1.0 and ships breaking changes
|
||||
between builds; the gate is what makes auto-tracking safe.
|
||||
|
||||
## What a bump actually touches
|
||||
## How the tracker gates each bump (`scripts/t3-autoupdate.sh`)
|
||||
|
||||
1. **Freeze gate** — `/etc/t3-autoupdate.freeze` present (or `T3_PIN=<ver>` set) →
|
||||
hold at current, do nothing.
|
||||
2. **Resolve + downgrade-guard** — `npm view t3@nightly version`; proceed only if
|
||||
the target is strictly newer than installed AND a `-nightly.` build (the tag is
|
||||
mutable and can point backward).
|
||||
3. **Pre-bump backup** — online `VACUUM INTO` of every user's `state.sqlite` to
|
||||
`/var/backups/t3-state/<u>/state-prebump-<ver>-<ts>.sqlite` (runs AS the owner;
|
||||
never stops a serve). Rollback is then a RESTORE, not sqlite surgery.
|
||||
4. **Install + health-check** — `npm i -g t3@<ver>`, then start a throwaway serve
|
||||
SEEDED WITH A COPY of wizard's real populated `state.sqlite` (scratch on
|
||||
`/var/tmp`, not the 2 GB tmpfs `/tmp`) so it exercises the forward MIGRATION
|
||||
(the 2026-06-09 failure class) + the real mint→exchange→`t3_session` pairing
|
||||
handshake. Fail → roll back binary to last-good, exit (no serve migrated yet →
|
||||
clean).
|
||||
5. **Canary rollout** — restart IDLE instances one at a time, verifying pairing
|
||||
through the real dispatch after each. First failure → roll back binary +
|
||||
restore that user's DB from the pre-bump backup + **self-freeze** (touch the
|
||||
freeze file) so it cannot re-flap onto bad builds. Active-agent instances are
|
||||
DEFERRED (never killed) and migrate on their next idle restart.
|
||||
6. **Last-good** — advanced to the new version only on full success
|
||||
(`/var/lib/t3-autoupdate/last-good`); it is the rollback target.
|
||||
|
||||
Detection backstop (real-user pairing failures / endpoint fallback): the dispatch
|
||||
logs every outcome (`paired user=.. endpoint=.. fallback=..`, plus `mint/pairing
|
||||
... failed`) → Loki alerts `T3PairingBroken` / `T3PairFallbackHigh` /
|
||||
`T3AutoUpdateRolledBack` / `T3AutoUpdateRollbackFailed` / `T3AutoUpdateFrozen` →
|
||||
Alertmanager → Slack.
|
||||
|
||||
## Operations
|
||||
|
||||
**Freeze / revert (stop tracking right now — the fast "make it stop"):**
|
||||
```bash
|
||||
sudo touch /etc/t3-autoupdate.freeze # holds at the current build; next run is a no-op + fires T3AutoUpdateFrozen
|
||||
sudo rm -f /etc/t3-autoupdate.freeze # resume tracking
|
||||
```
|
||||
|
||||
**Pin to an exact version (instead of tracking nightly):** set `T3_PIN=<ver>` in
|
||||
the unit environment (or the `scripts/t3-autoupdate.sh` default) — the tracker
|
||||
enforces it and stops following nightly. Keep in sync with `setup-devvm.sh`.
|
||||
|
||||
**Preview the current nightly without touching anything (no global change, no restarts):**
|
||||
```bash
|
||||
sudo T3_DRY_RUN=1 /usr/local/bin/t3-autoupdate # installs candidate to a temp prefix, runs the full gate, reports PASS/FAIL
|
||||
```
|
||||
|
||||
**Force a run now (instead of waiting for 04:00):**
|
||||
```bash
|
||||
sudo systemctl start t3-autoupdate.service # runs in its own cgroup, isolated from the t3-serve@ instances it manages
|
||||
```
|
||||
|
||||
## What a bump touches (still true)
|
||||
|
||||
1. **Pairing API** — t3 renamed `POST /api/auth/bootstrap` → `/api/auth/browser-session`
|
||||
in 0.0.25. `t3-dispatch` is now **version-agnostic** (tries `browser-session`,
|
||||
falls back to `bootstrap`; see `pairEndpoints` in `scripts/t3-dispatch/main.go`),
|
||||
so 0.0.24↔0.0.25 needs **no dispatch change**. If a *future* build renames it
|
||||
again, add the new path to `pairEndpoints`, rebuild, redeploy first.
|
||||
2. **Schema** — 0.0.25+ migrate every `~/.t3/userdata/state.sqlite` **forward**
|
||||
(`auth_pairing_links`/`auth_sessions` `role`→`scopes`, `+proof_key_thumbprint`).
|
||||
This is a **one-way door**: a binary downgrade alone will NOT roll it back —
|
||||
you must restore the DB. Hence the mandatory pre-bump backup below.
|
||||
in 0.0.25. `t3-dispatch` is version-agnostic (`pairEndpoints` in
|
||||
`scripts/t3-dispatch/main.go` tries browser-session, falls back to bootstrap).
|
||||
If a future build renames it AGAIN, the health-check + canary fail the bump and
|
||||
self-freeze — then add the new path to `pairEndpoints`, rebuild + redeploy the
|
||||
dispatch, and clear the freeze.
|
||||
2. **Schema** — 0.0.25+ migrate every `~/.t3/userdata/state.sqlite` FORWARD — a
|
||||
**one-way door**. A binary downgrade alone does NOT roll it back; you must
|
||||
restore the DB. The tracker does this automatically on a canary failure; do it
|
||||
by hand (below) if a problem surfaces *after* a successful bump.
|
||||
|
||||
## Pre-flight (no downtime)
|
||||
## Manual rollback (problem surfaces after a bump the gate let through)
|
||||
|
||||
```bash
|
||||
# 1. Confirm the dispatch already speaks the new version's pairing API.
|
||||
# Install the candidate to an isolated prefix (does NOT touch the global pin):
|
||||
npm install --prefix /tmp/t3-cand t3@<new> # e.g. t3@0.0.25
|
||||
BIN=/tmp/t3-cand/node_modules/.bin/t3; D=$(mktemp -d)
|
||||
"$BIN" serve --host 127.0.0.1 --port 3796 --base-dir "$D" >/tmp/cand.log 2>&1 &
|
||||
CRED=$("$BIN" auth pairing create --base-dir "$D" --ttl 5m --json | sed -n 's/.*"credential":"\([^"]*\)".*/\1/p')
|
||||
# Try the dispatch's endpoints; one must give 200 + Set-Cookie: t3_session.
|
||||
for ep in /api/auth/browser-session /api/auth/bootstrap; do
|
||||
curl -s -i -X POST -H 'Content-Type: application/json' -d "{\"credential\":\"$CRED\"}" \
|
||||
"http://127.0.0.1:3796$ep" | grep -iE 'HTTP/|set-cookie: t3_session'; done
|
||||
kill %1; rm -rf "$D" /tmp/t3-cand
|
||||
# If NO endpoint yields a t3_session cookie -> the API changed again; update
|
||||
# pairEndpoints in main.go + rebuild the dispatch BEFORE proceeding.
|
||||
|
||||
# 2. Dispatch unit tests still green:
|
||||
( cd ~/code/infra/scripts/t3-dispatch && go test ./... )
|
||||
```
|
||||
|
||||
## The bump
|
||||
|
||||
```bash
|
||||
NEW=0.0.25
|
||||
# 1. PRE-BUMP BACKUP — the rollback safety net. Per user, stop the serve (so the
|
||||
# copy is consistent + fast), copy state.sqlite, restart. Do the ACTIVE admin
|
||||
# instance last / from OUTSIDE its own t3 session (you can't restart the serve
|
||||
# you're running inside).
|
||||
for u in $(awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/ /,"",$2);print $2}' /etc/ttyd-user-map | sort -u); do
|
||||
src=/home/$u/.t3/userdata/state.sqlite; [ -f "$src" ] || continue
|
||||
sudo systemctl stop t3-serve@$u
|
||||
sudo install -d -o "$u" -g "$u" -m700 /var/backups/t3-state/$u
|
||||
sudo cp -a "$src" /var/backups/t3-state/$u/state-prebump-$NEW-$(date +%Y%m%d-%H%M%S).sqlite
|
||||
sudo systemctl start t3-serve@$u
|
||||
done
|
||||
# (t3-backup-state also runs daily; this captures a guaranteed snapshot at T-0.)
|
||||
|
||||
# 2. Move the pin in BOTH places (keep them in sync):
|
||||
sed -i "s/T3_PIN:-[0-9.]*/T3_PIN:-$NEW/" ~/code/infra/scripts/t3-autoupdate.sh \
|
||||
~/code/infra/scripts/workstation/setup-devvm.sh
|
||||
sudo install -m0755 ~/code/infra/scripts/t3-autoupdate.sh /usr/local/bin/t3-autoupdate
|
||||
|
||||
# 3. Run the enforcer. It installs t3@$NEW, then HEALTH-CHECKS the real pairing
|
||||
# handshake (mint -> browser-session/bootstrap -> t3_session). If pairing is
|
||||
# broken in $NEW, it AUTO-ROLLS-BACK to the previous version and exits non-zero.
|
||||
sudo /usr/local/bin/t3-autoupdate # restarts idle instances; defers active ones
|
||||
|
||||
# 4. Restart any instance the enforcer deferred (active agent), when it's idle.
|
||||
# The wizard/admin instance: restart from OUTSIDE its own session, or it picks
|
||||
# up $NEW on its next natural restart (the unit runs the global /usr/bin/t3).
|
||||
```
|
||||
|
||||
## Verify
|
||||
|
||||
```bash
|
||||
for u in vbarzin emil.barzin ancaelena98; do
|
||||
curl -sI -H "X-authentik-username: $u" http://10.0.10.10:3780/ | grep -iE 'HTTP/|set-cookie: t3_session'
|
||||
done # each must be 302 + t3_session
|
||||
t3 --version # == $NEW
|
||||
```
|
||||
|
||||
## Rollback (if pairing breaks or $NEW misbehaves)
|
||||
|
||||
The enforcer auto-rolls-back the **binary** if its health-check fails. But if a
|
||||
problem surfaces *after* serves migrated their DBs forward, the binary alone
|
||||
won't fix it — restore the DBs:
|
||||
|
||||
```bash
|
||||
sed -i "s/T3_PIN:-[0-9.]*/T3_PIN:-0.0.24/" ~/code/infra/scripts/t3-autoupdate.sh ~/code/infra/scripts/workstation/setup-devvm.sh
|
||||
sudo install -m0755 ~/code/infra/scripts/t3-autoupdate.sh /usr/local/bin/t3-autoupdate
|
||||
sudo npm i -g t3@0.0.24
|
||||
GOOD=$(cat /var/lib/t3-autoupdate/last-good) # or the known-good version you want
|
||||
sudo touch /etc/t3-autoupdate.freeze # stop the tracker FIRST
|
||||
sudo npm i -g "t3@$GOOD"
|
||||
# Restore + restart each user's serve. The wizard/admin instance: run this from
|
||||
# OUTSIDE its own t3 session (stopping the serve you're running inside kills you);
|
||||
# or just let it pick up $GOOD on its next natural restart.
|
||||
for u in $(awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/ /,"",$2);print $2}' /etc/ttyd-user-map | sort -u); do
|
||||
bak=$(sudo ls -1t /var/backups/t3-state/$u/state-prebump-* 2>/dev/null | head -1)
|
||||
[ -n "$bak" ] || continue
|
||||
|
|
@ -98,8 +89,16 @@ for u in $(awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/ /,"",$2);print $2}' /etc/tty
|
|||
sudo rm -f /home/$u/.t3/userdata/state.sqlite-wal /home/$u/.t3/userdata/state.sqlite-shm
|
||||
sudo systemctl start t3-serve@$u
|
||||
done
|
||||
# verify 302 + t3_session as above
|
||||
```
|
||||
|
||||
(The 2026-06-09 incident had no pre-bump backup, so rollback meant per-user
|
||||
sqlite surgery. With the backup, it's a restore.)
|
||||
## Verify (any user pairs cleanly through the dispatch)
|
||||
|
||||
```bash
|
||||
for u in vbarzin emil.barzin ancaelena98; do
|
||||
curl -sI -H "X-authentik-username: $u" http://10.0.10.10:3780/ | grep -iE 'HTTP/|set-cookie: t3_session'
|
||||
done # each must be 302 + t3_session
|
||||
t3 --version
|
||||
```
|
||||
|
||||
(The 2026-06-09 incident had no pre-bump backup, so rollback meant per-user sqlite
|
||||
surgery. The tracker now takes a guaranteed pre-bump snapshot — rollback is a restore.)
|
||||
|
|
|
|||
|
|
@ -1,78 +1,229 @@
|
|||
#!/usr/bin/env bash
|
||||
# Enforce the PINNED t3 version ($T3_PIN) across the box — NOT "latest/nightly".
|
||||
# t3 is pre-1.0 and ships breaking schema-migration + bootstrap-API changes between
|
||||
# builds that our t3-dispatch can't follow blind. 2026-06-09: a nightly auto-update
|
||||
# (0.0.25) migrated every ~/.t3 state.sqlite forward (auth_pairing_links/auth_sessions
|
||||
# role->scopes) AND changed the bootstrap API, breaking mint/pairing for ALL users.
|
||||
# So we PIN; this unit just re-asserts the pin (a no-op when already correct) with a
|
||||
# health-check + auto-rollback and idle-only restarts (never kill an in-flight session).
|
||||
# To move the pin: bump T3_PIN AND first verify t3-dispatch's bootstrap flow against the
|
||||
# new build (curl the dispatch -> expect 302 + Set-Cookie t3_session). See post-mortem
|
||||
# 2026-06-09-t3-nightly-autoupdate-auth-outage.md.
|
||||
# The health-check below exercises the REAL pairing handshake (mint -> credential
|
||||
# exchange -> t3_session cookie), mirroring t3-dispatch's endpoint fallback — so a
|
||||
# build that renames or breaks the pairing API fails the check and auto-rolls-back
|
||||
# (closes the 2026-06-09 miss, where a GET / probe passed a pairing-broken build).
|
||||
# t3 GATED NIGHTLY TRACKER (daily, via t3-autoupdate.timer).
|
||||
#
|
||||
# t3 is pre-1.0 and ships breaking schema-migration + pairing-API changes between
|
||||
# builds. On 2026-06-09 a blind `npm i -g t3@nightly` migrated every ~/.t3
|
||||
# state.sqlite FORWARD and moved the bootstrap API, breaking pairing for ALL users
|
||||
# with no alert (post-mortem 2026-06-09-t3-nightly-autoupdate-auth-outage.md). We
|
||||
# pinned in response.
|
||||
#
|
||||
# 2026-06-16 (Viktor's call, risk explicitly accepted): re-enable nightly tracking,
|
||||
# but GATED so a bad nightly self-heals instead of breaking everyone. This script
|
||||
# now follows the `nightly` npm dist-tag (T3_TRACK) under these guards:
|
||||
# - freeze switch (/etc/t3-autoupdate.freeze) + optional hard pin (T3_PIN) for
|
||||
# instant manual revert; a canary failure also self-freezes;
|
||||
# - downgrade-guard (the nightly tag is mutable — never move backward);
|
||||
# - pre-bump per-user state.sqlite backup BEFORE install (rollback => restore,
|
||||
# not sqlite surgery), via the same online VACUUM INTO as t3-backup-state;
|
||||
# - a health-check that seeds a throwaway instance with a COPY of a real
|
||||
# POPULATED state.sqlite, so it exercises the forward MIGRATION (the actual
|
||||
# 2026-06-09 failure class) + the real pairing handshake before trusting a build;
|
||||
# - canary rollout: restart idle instances ONE AT A TIME, verifying pairing
|
||||
# through the real dispatch after each, and roll back (binary + that user's DB)
|
||||
# + self-freeze on the first failure — active-agent instances are deferred,
|
||||
# never killed;
|
||||
# - rollback target is the recorded LAST-GOOD build, not "whatever was installed".
|
||||
# Detection backstop (real-user pairing failure/fallback) lives in the dispatch
|
||||
# logs + Loki alerts (T3PairingBroken / T3PairFallbackHigh / T3AutoUpdate*).
|
||||
# To stop tracking: `sudo touch /etc/t3-autoupdate.freeze` (or set T3_PIN=<ver>).
|
||||
# Full procedure + manual rollback: docs/runbooks/t3-version-bump.md.
|
||||
set -uo pipefail
|
||||
T3_PIN="${T3_PIN:-0.0.26}" # known-good, t3-dispatch-compatible (2026-06-09 post-mortem)
|
||||
|
||||
T3_TRACK="${T3_TRACK:-nightly}" # npm dist-tag to follow (nightly | latest)
|
||||
T3_PIN="${T3_PIN:-}" # optional HARD pin to an exact version (disables tracking)
|
||||
FREEZE_FILE="${T3_FREEZE_FILE:-/etc/t3-autoupdate.freeze}"
|
||||
STATE_DIR="${T3_STATE_DIR:-/var/lib/t3-autoupdate}"
|
||||
LAST_GOOD_FILE="$STATE_DIR/last-good"
|
||||
BACKUP_DIR="${T3_BACKUP_DEST:-/var/backups/t3-state}"
|
||||
SMOKE_PORT="${T3_SMOKE_PORT:-3799}"
|
||||
DISPATCH="${T3_DISPATCH:-127.0.0.1:3780}"
|
||||
USER_MAP="${T3_USER_MAP:-/etc/ttyd-user-map}"
|
||||
DRY_RUN="${T3_DRY_RUN:-0}"
|
||||
TMPROOT="${T3_TMPDIR:-/var/tmp}" # health-check scratch on DISK — /tmp is a 2G tmpfs and a populated state.sqlite (~hundreds of MB) overflows it
|
||||
|
||||
LOG() { logger -t t3-autoupdate "$*"; echo "t3-autoupdate: $*"; }
|
||||
|
||||
ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; }
|
||||
# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line).
|
||||
osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; }
|
||||
# authentik username for an OS user (reverse map; first match) — for dispatch verify.
|
||||
ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; }
|
||||
# is $1 a strictly-newer version than $2 (version-sort)?
|
||||
newer() { [ "$1" != "$2" ] && [ "$(printf '%s\n%s\n' "$1" "$2" | sort -V | tail -1)" = "$1" ]; }
|
||||
|
||||
before=$(ver); LOG "current: ${before:-unknown}; pin: $T3_PIN"
|
||||
npm i -g "t3@$T3_PIN" >/dev/null 2>&1 || { LOG "npm install failed; staying on ${before:-current}"; exit 0; }
|
||||
after=$(ver)
|
||||
mkdir -p "$STATE_DIR" 2>/dev/null || true
|
||||
|
||||
if [[ -z "$after" || "$after" == "$before" ]]; then
|
||||
LOG "already at pin $T3_PIN (${before:-?}); nothing to do"; exit 0
|
||||
# ---- 0. freeze gate -------------------------------------------------------------
|
||||
if [ -e "$FREEZE_FILE" ]; then
|
||||
LOG "FROZEN: $FREEZE_FILE present — holding at $(ver), not tracking $T3_TRACK"; exit 0
|
||||
fi
|
||||
LOG "re-pinned to $after (was $before); health-checking…"
|
||||
|
||||
# Health-check the NEW binary on a throwaway port/base-dir before trusting it.
|
||||
# Gate 1 = liveness (GET / -> 200); Gate 2 = the REAL pairing handshake t3-dispatch
|
||||
# performs (mint -> POST credential -> 200 + t3_session cookie), trying the same
|
||||
# endpoint fallback. Gate 2 catches a bootstrap-API rename / pairing regression.
|
||||
SMOKE_PORT=3799; SMOKE_DIR=$(mktemp -d)
|
||||
t3 serve --host 127.0.0.1 --port "$SMOKE_PORT" --base-dir "$SMOKE_DIR" >/dev/null 2>&1 &
|
||||
smoke=$!; live=0; pair_ok=0
|
||||
for _ in $(seq 1 15); do
|
||||
[[ "$(curl -s -o /dev/null -w '%{http_code}' --max-time 5 "http://127.0.0.1:$SMOKE_PORT/" 2>/dev/null)" == "200" ]] && { live=1; break; }
|
||||
sleep 2
|
||||
done
|
||||
if [[ "$live" == "1" ]]; then
|
||||
cred=$(t3 auth pairing create --base-dir "$SMOKE_DIR" --ttl 5m --json 2>/dev/null \
|
||||
| tr -d '\n ' | sed -n 's/.*"credential":"\([^"]*\)".*/\1/p')
|
||||
if [[ -n "$cred" ]]; then
|
||||
for ep in /api/auth/browser-session /api/auth/bootstrap; do # mirror t3-dispatch's fallback
|
||||
hdr=$(curl -s -i --max-time 5 -X POST -H 'Content-Type: application/json' \
|
||||
-d "{\"credential\":\"$cred\"}" "http://127.0.0.1:$SMOKE_PORT$ep" 2>/dev/null)
|
||||
code=$(printf '%s' "$hdr" | sed -n '1s#.* \([0-9][0-9][0-9]\).*#\1#p')
|
||||
[[ "$code" == "404" ]] && continue # endpoint absent in this build — try the next
|
||||
printf '%s' "$hdr" | grep -qi '^set-cookie:[[:space:]]*t3_session=' && pair_ok=1
|
||||
break
|
||||
done
|
||||
current="$(ver)"
|
||||
[ -n "$current" ] || { LOG "cannot read current t3 version — aborting (is t3 installed?)"; exit 0; }
|
||||
[ -s "$LAST_GOOD_FILE" ] || echo "$current" >"$LAST_GOOD_FILE" # seed last-good on first run
|
||||
last_good="$(tr -d '[:space:]' <"$LAST_GOOD_FILE" 2>/dev/null)"
|
||||
[ -n "$last_good" ] || last_good="$current"
|
||||
|
||||
# ---- 1. resolve target ----------------------------------------------------------
|
||||
if [ -n "$T3_PIN" ]; then
|
||||
target="$T3_PIN"
|
||||
LOG "T3_PIN=$T3_PIN set — enforcing pin (tracking disabled)"
|
||||
else
|
||||
target="$(npm view "t3@$T3_TRACK" version 2>/dev/null | tail -1 | tr -d '[:space:]')"
|
||||
[ -n "$target" ] || { LOG "could not resolve t3@$T3_TRACK from npm — staying on $current"; exit 0; }
|
||||
fi
|
||||
|
||||
[ "$target" = "$current" ] && { LOG "already on $T3_TRACK=$current; nothing to do"; exit 0; }
|
||||
|
||||
# ---- 2. downgrade + channel guard (mutable nightly tag can point backward) ------
|
||||
if [ -z "$T3_PIN" ]; then
|
||||
newer "$target" "$current" || { LOG "resolved $T3_TRACK=$target is NOT newer than installed $current — refusing downgrade"; exit 0; }
|
||||
if [ "$T3_TRACK" = "nightly" ]; then
|
||||
case "$target" in *-nightly.*) : ;; *) LOG "resolved nightly target '$target' is not a nightly build — refusing"; exit 0;; esac
|
||||
fi
|
||||
fi
|
||||
kill "$smoke" 2>/dev/null; wait "$smoke" 2>/dev/null; rm -rf "$SMOKE_DIR"
|
||||
LOG "candidate: $current -> $target (track=$T3_TRACK, last_good=$last_good, dry_run=$DRY_RUN)"
|
||||
|
||||
if [[ "$live" != "1" || "$pair_ok" != "1" ]]; then
|
||||
LOG "HEALTH-CHECK FAILED for $after (live=$live pair=$pair_ok) — rolling back to $before"
|
||||
if [[ -n "$before" ]] && npm i -g "t3@$before" >/dev/null 2>&1; then
|
||||
LOG "rolled back to $before"
|
||||
else
|
||||
LOG "ROLLBACK FAILED — manual fix needed (t3 may be broken)"
|
||||
# ---- helpers: backup, health-check, rollback, restart-verify --------------------
|
||||
# Online consistent per-user snapshot (run AS the owner so WAL stays owned; never
|
||||
# stops the serve). Sets $ADMIN_SEED to wizard's backup for the migration health
|
||||
# check. Mirrors t3-backup-state.sh.
|
||||
ADMIN_SEED=""
|
||||
backup_all() {
|
||||
local u src out dst ts; ts="$(date +%Y%m%d-%H%M%S)"
|
||||
for u in $(osusers); do
|
||||
src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || continue
|
||||
out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite"
|
||||
install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out"
|
||||
if runuser -u "$u" -- timeout "${T3_BACKUP_TIMEOUT:-900}" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then
|
||||
LOG "pre-bump backup: $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)"
|
||||
[ "$u" = "wizard" ] && ADMIN_SEED="$dst"
|
||||
else
|
||||
LOG "WARN: pre-bump backup FAILED for $u ($src)"; rm -f "$dst"
|
||||
fi
|
||||
done
|
||||
[ -n "$ADMIN_SEED" ] || ADMIN_SEED="$(ls -1t "$BACKUP_DIR"/*/"state-prebump-$target-"*.sqlite 2>/dev/null | head -1)"
|
||||
}
|
||||
|
||||
# newest pre-bump backup taken THIS run for a user (for restore-on-rollback).
|
||||
prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; }
|
||||
|
||||
# health_check <t3bin> [seed_db]: start a throwaway serve (seeded with a copy of a
|
||||
# real populated DB if given, so the forward migration runs on real data), then do
|
||||
# the real mint -> credential-exchange -> t3_session pairing handshake with the
|
||||
# dispatch's endpoint fallback, and sniff the serve log for a migration failure.
|
||||
health_check() {
|
||||
local t3bin="$1" seed="${2:-}" dir logf pid live=0 pair=0 migerr=0 cred ep hdr code seeded=fresh
|
||||
dir="$(mktemp -d -p "$TMPROOT")"; mkdir -p "$dir/userdata"; logf="$dir/serve.log"
|
||||
if [ -n "$seed" ] && [ -f "$seed" ]; then cp "$seed" "$dir/userdata/state.sqlite"; seeded=populated; fi
|
||||
"$t3bin" serve --host 127.0.0.1 --port "$SMOKE_PORT" --base-dir "$dir" >"$logf" 2>&1 &
|
||||
pid=$!
|
||||
for _ in $(seq 1 15); do
|
||||
[ "$(curl -s -o /dev/null -w '%{http_code}' --max-time 5 "http://127.0.0.1:$SMOKE_PORT/" 2>/dev/null)" = "200" ] && { live=1; break; }
|
||||
sleep 2
|
||||
done
|
||||
if [ "$live" = "1" ]; then
|
||||
cred="$("$t3bin" auth pairing create --base-dir "$dir" --ttl 5m --json 2>/dev/null | tr -d '\n ' | sed -n 's/.*"credential":"\([^"]*\)".*/\1/p')"
|
||||
if [ -n "$cred" ]; then
|
||||
for ep in /api/auth/browser-session /api/auth/bootstrap; do
|
||||
hdr="$(curl -s -i --max-time 5 -X POST -H 'Content-Type: application/json' -d "{\"credential\":\"$cred\"}" "http://127.0.0.1:$SMOKE_PORT$ep" 2>/dev/null)"
|
||||
code="$(printf '%s' "$hdr" | sed -n '1s#.* \([0-9][0-9][0-9]\).*#\1#p')"
|
||||
[ "$code" = "404" ] && continue
|
||||
printf '%s' "$hdr" | grep -qi '^set-cookie:[[:space:]]*t3_session=' && pair=1
|
||||
break
|
||||
done
|
||||
fi
|
||||
fi
|
||||
exit 1
|
||||
fi
|
||||
LOG "health OK (live + pairing handshake); restarting idle instances"
|
||||
grep -qiE 'migration failed|failed to migrate|no column named|NOT NULL constraint failed|PersistenceSqlError' "$logf" 2>/dev/null && migerr=1
|
||||
kill "$pid" 2>/dev/null; wait "$pid" 2>/dev/null
|
||||
if [ "$live" = "1" ] && [ "$pair" = "1" ] && [ "$migerr" = "0" ]; then
|
||||
LOG "health OK ($seeded: live + pairing handshake + clean migration)"
|
||||
rm -rf "$dir"; return 0
|
||||
fi
|
||||
LOG "HEALTH-CHECK FAILED ($seeded: live=$live pair=$pair migerr=$migerr); serve log: $(tail -3 "$logf" 2>/dev/null | tr '\n' '|')"
|
||||
rm -rf "$dir"; return 1
|
||||
}
|
||||
|
||||
# Restart only IDLE per-user instances; defer any with an active agent child.
|
||||
for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' | awk '{print $1}'); do
|
||||
pid=$(systemctl show -p MainPID --value "$unit")
|
||||
if [[ -n "$pid" && "$pid" != 0 ]] && pgrep -aP "$pid" 2>/dev/null | grep -qiE 'claude|codex|opencode'; then
|
||||
LOG "deferring $unit (active agent) — updates next cycle when idle"
|
||||
# roll the GLOBAL binary back to last-good. Pre-restart failures need only this
|
||||
# (no real DB migrated yet); post-restart failures also restore the user's DB.
|
||||
rollback_binary() {
|
||||
LOG "rolling back binary $target -> $last_good"
|
||||
if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi
|
||||
LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1
|
||||
}
|
||||
|
||||
# is this t3-serve@<unit> running an active agent (claude/codex/opencode)? never restart those.
|
||||
unit_busy() {
|
||||
local unit="$1" pid; pid="$(systemctl show -p MainPID --value "$unit" 2>/dev/null)"
|
||||
[ -n "$pid" ] && [ "$pid" != "0" ] && pgrep -aP "$pid" 2>/dev/null | grep -qiE 'claude|codex|opencode'
|
||||
}
|
||||
|
||||
# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie).
|
||||
verify_pairing() {
|
||||
local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; }
|
||||
out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)"
|
||||
printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session='
|
||||
}
|
||||
|
||||
# ---- 3. DRY RUN: preview only (install candidate to temp prefix, gate it) -------
|
||||
if [ "$DRY_RUN" = "1" ]; then
|
||||
LOG "DRY_RUN: would back up [$(osusers | tr '\n' ' ')]; testing candidate $target in a temp prefix (no global change, no restarts)"
|
||||
tmp="$(mktemp -d -p "$TMPROOT")"
|
||||
if npm i --prefix "$tmp" "t3@$target" >/dev/null 2>&1; then
|
||||
seed="$(ls -1t "$BACKUP_DIR/wizard/state-"*.sqlite 2>/dev/null | head -1)" # reuse any existing backup as seed
|
||||
if health_check "$tmp/node_modules/.bin/t3" "$seed"; then LOG "DRY_RUN: candidate $target PASSED the gate"; else LOG "DRY_RUN: candidate $target FAILED the gate"; fi
|
||||
else
|
||||
systemctl restart "$unit" && LOG "restarted $unit -> $after"
|
||||
LOG "DRY_RUN: npm could not fetch t3@$target"
|
||||
fi
|
||||
rm -rf "$tmp"; exit 0
|
||||
fi
|
||||
|
||||
# ---- 4. pre-bump backup, then install -------------------------------------------
|
||||
backup_all
|
||||
if ! npm i -g "t3@$target" >/dev/null 2>&1; then
|
||||
LOG "npm install of t3@$target FAILED — staying on $current"; exit 0
|
||||
fi
|
||||
installed="$(ver)"
|
||||
[ "$installed" = "$target" ] || { LOG "post-install version is $installed, expected $target — rolling back"; rollback_binary; exit 1; }
|
||||
|
||||
# ---- 5. gate the new binary on a POPULATED-DB migration + pairing ---------------
|
||||
if ! health_check "$(command -v t3)" "$ADMIN_SEED"; then
|
||||
rollback_binary; exit 1 # nothing restarted yet -> binary rollback is clean
|
||||
fi
|
||||
LOG "health gate passed for $target; canary-restarting idle instances one at a time"
|
||||
|
||||
# ---- 6. canary rollout: idle instances one-by-one, verify pairing after each ----
|
||||
restarted=0; deferred=0
|
||||
for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' 2>/dev/null | awk '{print $1}'); do
|
||||
u="$(printf '%s' "$unit" | sed -n 's/^t3-serve@\(.*\)\.service$/\1/p')"; [ -n "$u" ] || continue
|
||||
if unit_busy "$unit"; then
|
||||
LOG "deferring $unit (active agent) — migrates on its next idle restart"; deferred=$((deferred+1)); continue
|
||||
fi
|
||||
systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero"
|
||||
ok=0
|
||||
for _ in $(seq 1 15); do
|
||||
if verify_pairing "$u"; then ok=1; break; fi
|
||||
sleep 2
|
||||
done
|
||||
if [ "$ok" = "1" ]; then
|
||||
LOG "restarted $unit -> $target (pairing verified via dispatch)"; restarted=$((restarted+1))
|
||||
else
|
||||
LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB"
|
||||
rollback_binary
|
||||
bak="$(prebump_of "$u")"
|
||||
if [ -n "$bak" ]; then
|
||||
systemctl stop "$unit" 2>/dev/null
|
||||
if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then
|
||||
rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm"
|
||||
LOG "restored $u state.sqlite from $bak"
|
||||
fi
|
||||
systemctl start "$unit" 2>/dev/null
|
||||
fi
|
||||
touch "$FREEZE_FILE" 2>/dev/null
|
||||
LOG "FROZEN ($FREEZE_FILE) after canary $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume"
|
||||
exit 1
|
||||
fi
|
||||
done
|
||||
LOG "update complete: $after"
|
||||
|
||||
# ---- 7. success: advance last-good ----------------------------------------------
|
||||
echo "$target" >"$LAST_GOOD_FILE"
|
||||
LOG "update complete: $target (restarted=$restarted deferred=$deferred); last_good now $target"
|
||||
|
|
|
|||
|
|
@ -1,10 +1,13 @@
|
|||
[Unit]
|
||||
Description=Daily t3 pinned-version enforcer (re-asserts T3_PIN; no-op when correct)
|
||||
Description=Daily gated t3 nightly tracker (health-checked + canary + auto-rollback)
|
||||
|
||||
[Timer]
|
||||
OnCalendar=*-*-* 04:00:00
|
||||
RandomizedDelaySec=1h
|
||||
Persistent=true
|
||||
# Persistent deliberately OMITTED: this now installs a NEW build + migrates DBs +
|
||||
# restarts serves, so a missed 04:00 run must NOT fire on boot mid-day with users
|
||||
# active (a 2026-06-09 contributing factor). Skipping a day is fine — the next
|
||||
# 04:00 picks up the latest nightly.
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
|
|
|
|||
|
|
@ -396,11 +396,12 @@ while IFS=$'\t' read -r os_user port; do
|
|||
id "$os_user" >/dev/null 2>&1 && run systemctl enable --now "t3-serve@$os_user.service" >/dev/null 2>&1 || true
|
||||
done < <(jq -r '.ports | to_entries[] | [.key, .value] | @tsv' "$desired_file")
|
||||
|
||||
# 5b) machine-wide (once, not per-user): keep the t3 pinned-version ENFORCER enabled (it
|
||||
# re-asserts T3_PIN daily; a no-op when already correct). NOT --now: with Persistent=true
|
||||
# a `--now` enable fires the missed daily job IMMEDIATELY, which on 2026-06-09 pulled a
|
||||
# breaking nightly mid-day and took out auth for everyone. `enable` (no --now) just arms
|
||||
# the 04:00 schedule; fresh boxes get t3 from setup-devvm.sh's pinned install, not here.
|
||||
# 5b) machine-wide (once, not per-user): keep the t3 gated nightly TRACKER timer enabled (it
|
||||
# follows t3@nightly daily, gated; see t3-autoupdate.sh / docs/runbooks/t3-version-bump.md).
|
||||
# NEVER --now: the tracker installs a NEW build + migrates DBs + restarts serves, so firing
|
||||
# a missed run mid-day with users active is exactly the 2026-06-09 shape. `enable` (no --now)
|
||||
# just arms the 04:00 schedule (the timer also dropped Persistent=true so a boot can't fire a
|
||||
# missed bump). Fresh boxes get t3 from setup-devvm.sh's nightly install, not here.
|
||||
run systemctl enable t3-autoupdate.timer >/dev/null 2>&1 || true
|
||||
# tmux session persistence: periodic snapshot + boot-time restore (reboot
|
||||
# survival for users' named claude sessions). Safe to --now: save is a
|
||||
|
|
|
|||
|
|
@ -55,14 +55,18 @@ PROFILE_EOF
|
|||
chmod 0644 /etc/profile.d/10-local-bin.sh
|
||||
log "/etc/profile.d/10-local-bin.sh (~/.local/bin on PATH for login shells)"
|
||||
|
||||
# 2b) t3 (the per-user coding surface) — PINNED, never nightly/latest. t3 is pre-1.0 and
|
||||
# ships breaking auth-schema + bootstrap-API changes our t3-dispatch can't follow blind
|
||||
# (2026-06-09 outage: a nightly auto-update broke pairing for ALL users). The daily
|
||||
# t3-autoupdate ENFORCER re-asserts this same pin; install it here so a fresh box has t3
|
||||
# immediately. Keep T3_PIN in sync with t3-autoupdate.sh.
|
||||
T3_PIN="${T3_PIN:-0.0.26}"
|
||||
if [[ "$(t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//')" != "$T3_PIN" ]]; then
|
||||
log "npm: installing pinned t3@$T3_PIN"; npm install -g "t3@$T3_PIN" >/dev/null
|
||||
# 2b) t3 (the per-user coding surface) — GATED NIGHTLY TRACKER (2026-06-16; was pinned).
|
||||
# t3 is pre-1.0 and ships breaking auth-schema + bootstrap-API changes (2026-06-09
|
||||
# outage: a blind nightly auto-update broke pairing for ALL users). The daily
|
||||
# t3-autoupdate now FOLLOWS t3@nightly but GATES each bump (populated-DB health-check
|
||||
# + canary + auto-rollback + self-freeze) so a bad nightly self-heals. A fresh box has
|
||||
# no user state to migrate or sessions to break, so install the current nightly
|
||||
# directly; the gated tracker owns it thereafter. Keep T3_TRACK in sync with
|
||||
# t3-autoupdate.sh. To freeze/revert: `touch /etc/t3-autoupdate.freeze`.
|
||||
T3_TRACK="${T3_TRACK:-nightly}"
|
||||
want_t3="$(npm view "t3@$T3_TRACK" version 2>/dev/null | tail -1)"
|
||||
if [[ -n "$want_t3" && "$(t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//')" != "$want_t3" ]]; then
|
||||
log "npm: installing t3@$T3_TRACK ($want_t3)"; npm install -g "t3@$want_t3" >/dev/null
|
||||
fi
|
||||
|
||||
# 3) kubelogin (kubectl oidc-login) system-wide — NOT the apt 'kubelogin' (= Azure tool).
|
||||
|
|
|
|||
16
stacks/monitoring/imports.tf
Normal file
16
stacks/monitoring/imports.tf
Normal file
|
|
@ -0,0 +1,16 @@
|
|||
# One-shot adoption of two alert-digest resources that exist in-cluster but fell
|
||||
# out of Terraform state — the monitoring apply was create-failing on every push
|
||||
# with `configmaps "alert-digest-script" already exists` and `secrets
|
||||
# "alert-digest" already exists` (pre-existing: pipelines 203 AND 204). Importing
|
||||
# reconciles them into state so `terraform apply` UPDATES instead of failing to
|
||||
# create. These blocks are idempotent (a no-op once the resources are in state)
|
||||
# and may be removed after the next green apply. Defs: modules/monitoring/alert_digest.tf.
|
||||
import {
|
||||
to = module.monitoring.kubernetes_config_map.alert_digest_script
|
||||
id = "monitoring/alert-digest-script"
|
||||
}
|
||||
|
||||
import {
|
||||
to = module.monitoring.kubernetes_secret.alert_digest
|
||||
id = "monitoring/alert-digest"
|
||||
}
|
||||
Loading…
Add table
Add a link
Reference in a new issue