t3: docs for the gated nightly tracker (runbook, post-mortem, service-catalog)
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful

Phase 4 docs for the enforcer -> gated-tracker change:
- runbook t3-version-bump.md: rewritten around the tracker — how each bump is
  gated, plus freeze/revert/pin/dry-run/manual-rollback ops.
- post-mortem 2026-06-09: append the deliberate 2026-06-16 reversal and how the
  gates close each named root-cause/lesson (historical sections left intact).
- service-catalog t3 row: "PINNED 0.0.24 enforcer" -> gated nightly tracker;
  replace the stale "auto-pair 401-broken on 0.0.26" note (re-verified healthy
  2026-06-16, cookieless -> 302 + t3_session).
- t3-provision-users.sh step 5b comment: enforcer -> tracker; note Persistent dropped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-16 11:33:49 +00:00
parent f4f7705127
commit cdd9ecd199
4 changed files with 126 additions and 95 deletions

View file

@ -32,7 +32,7 @@
|---------|-------------|-------|
| k8s-dashboard | Kubernetes dashboard at `k8s.viktorbarzin.me`. **Forward-auth + auto-injected SA token** (apiserver OIDC blocked, see design §12). nginx token-injector (`dashboard_injector.tf`) maps `X-authentik-username` → the user's `dashboard-<user>` SA token (ns admin + read-only on namespace-list/nodes only via `dashboard-nav-readonly` — no cross-tenant reads, `rbac/.../dashboard-sa.tf`; admins → cluster-admin SA) and sets `Authorization: Bearer` → no token-paste, dashboard auto-authenticates per user. Forward-auth admits `kubernetes-*` groups for this host (`stacks/authentik/admin-services-restriction.tf`). oauth2-proxy + `k8s-dashboard` OIDC app built but idle. | k8s-dashboard |
| reverse-proxy | Generic reverse proxy | reverse-proxy |
| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@<u>` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@<u>` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role``scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin is a deliberate, reversible step via `docs/runbooks/t3-version-bump.md` (pre-bump `state.sqlite` backup → bump `T3_PIN` → enforcer install with a REAL pairing health-check that auto-rolls-back → verify → restore). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. **Drop attribution (2026-06-10):** `t3-probe` Deployment (same ns) holds differential legs — `cloudflare` (full public path via DoH-pinned DNS), `internal` (Traefik LB only), `t3serve` (devvm:3773 direct) — against dispatch's unauthenticated `/probe` carve-out (walloff-guarded); Prometheus job `t3-probe`, alerts `T3ProbeLegDown`/`T3ProbeDropBurst`, runbook `docs/runbooks/t3-drop-attribution.md`. `t3-serve@` units carry memory containment (`MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0/OOMPolicy=continue`) so a runaway agent OOMs alone instead of freezing devvm. **Connection logs (2026-06-11):** `t3-dispatch` logs every `/ws` open/close with `dur_ms` + `cause` (`downstream_closed`=client/CF/Traefik hung up → last-mile; `upstream_closed`=t3-serve closed; `graceful`); devvm journald now ships to Loki via `scripts/devvm-promtail.*` (`{job="devvm-journal"}` + `{job="sshd-devvm"}`), joining Traefik `/ws`-duration + cloudflared close events already in Loki for full per-drop attribution without a repro. **Empirical (2026-06-11):** direct-to-t3-serve held one WS 40 min (0 drops) while a real tunnel session cycled 5×/90s → drop originates above t3-serve on the public path, NOT in t3-serve itself; `t3 auth pairing create`+`/api/auth/browser-session` works but dispatch **auto-pair is 401-broken on v0.0.26** (latent; live 30-day cookies mask it). | t3code |
| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@<u>` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@<u>` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 AUTO-TRACKS the `nightly` npm dist-tag** (Viktor 2026-06-16, reversing the post-2026-06-09 pin; churn risk accepted) — `t3-autoupdate` is a daily GATED tracker that follows `t3@nightly` but gates every bump so a bad build self-heals: downgrade-guard → pre-bump `VACUUM INTO` backup → health-check that SEEDS a copy of a real POPULATED `state.sqlite` to exercise the forward migration + the real mint→exchange→`t3_session` pairing handshake → canary-restart idle instances ONE AT A TIME with per-instance dispatch pairing verify → auto-rollback to last-good + self-freeze on failure (active-agent instances deferred, never killed; last-good in `/var/lib/t3-autoupdate/last-good`). The 2026-06-09 outage was the SAME nightly channel WITHOUT these gates. Freeze/revert now: `sudo touch /etc/t3-autoupdate.freeze` (or set `T3_PIN=<ver>` to hard-pin); preview a build with `T3_DRY_RUN=1`. Channel via `T3_TRACK` in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). Full ops + manual rollback: `docs/runbooks/t3-version-bump.md`. `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. **Drop attribution (2026-06-10):** `t3-probe` Deployment (same ns) holds differential legs — `cloudflare` (full public path via DoH-pinned DNS), `internal` (Traefik LB only), `t3serve` (devvm:3773 direct) — against dispatch's unauthenticated `/probe` carve-out (walloff-guarded); Prometheus job `t3-probe`, alerts `T3ProbeLegDown`/`T3ProbeDropBurst`, runbook `docs/runbooks/t3-drop-attribution.md`. `t3-serve@` units carry memory containment (`MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0/OOMPolicy=continue`) so a runaway agent OOMs alone instead of freezing devvm. **Connection logs (2026-06-11):** `t3-dispatch` logs every `/ws` open/close with `dur_ms` + `cause` (`downstream_closed`=client/CF/Traefik hung up → last-mile; `upstream_closed`=t3-serve closed; `graceful`); devvm journald now ships to Loki via `scripts/devvm-promtail.*` (`{job="devvm-journal"}` + `{job="sshd-devvm"}`), joining Traefik `/ws`-duration + cloudflared close events already in Loki for full per-drop attribution without a repro. **Empirical (2026-06-11):** direct-to-t3-serve held one WS 40 min (0 drops) while a real tunnel session cycled 5×/90s → drop originates above t3-serve on the public path, NOT in t3-serve itself; `t3 auth pairing create`+`/api/auth/browser-session` works, and dispatch **auto-pair was re-verified healthy on the live pin 2026-06-16** (cookieless `X-authentik-username` → 302 + `t3_session`) — the earlier transient 401 note no longer reproduces, and the new dispatch pairing logs + `T3PairingBroken`/`T3PairFallbackHigh` Loki alerts now watch pairing continuously. | t3code |
## Active Use
| Service | Description | Stack |