diff --git a/docs/runbooks/claude-auth-renew-workstation.md b/docs/runbooks/claude-auth-renew-workstation.md index 727b0da4..8156530e 100644 --- a/docs/runbooks/claude-auth-renew-workstation.md +++ b/docs/runbooks/claude-auth-renew-workstation.md @@ -80,8 +80,64 @@ sudo --preserve-env=VAULT_ADDR,VAULT_TOKEN /usr/local/bin/t3-provision-users ``` Never copy another user's `.credentials.json` or scoped Vault token. Never restore -the old shared `CLAUDE_CODE_OAUTH_TOKEN`; environment credentials outrank per-user -login and would silently collapse all users onto one identity. +a **shared** `CLAUDE_CODE_OAUTH_TOKEN` across users; environment credentials +outrank per-user login and would silently collapse all users onto one identity. +(A **per-user**, non-rotating setup-token tied to the user's OWN Enterprise +identity is a different, sanctioned thing — see "Long-lived per-user token" below.) + +## Long-lived per-user token (heavy concurrent-agent users) + +The six-hourly renewal above assumes Claude owns refresh-token rotation in a +single `~/.claude/.credentials.json`. A user who runs **many concurrent Claude +sessions** (interactive tmux panes + their `t3-serve` instance + always-on +`start-claude.sh` agents) breaks that assumption: when the shared access token +expires, the processes refresh **simultaneously**, the OAuth server rotates the +refresh token, and the losing writer persists an **empty** refresh token — +logging the user out roughly every access-token lifetime (~8h). Re-issuing the +credential does not help; the race recurs. + +The fix is a **per-user, long-lived setup-token** (`sk-ant-oat01-…`, ~1y, +**non-rotating**). With `CLAUDE_CODE_OAUTH_TOKEN` set, Claude uses it directly and +never touches `.credentials.json` — so there is nothing to race on. This is the +user's OWN Enterprise identity (scope `user:inference`; local MCP servers are +client-side and unaffected), stored only in their OWN Vault path — **NOT** the +forbidden shared token, and it never crosses OS users. + +**Enable it (one-time, per user):** + +1. The user mints their own token (interactive Enterprise SSO): + + ```bash + claude setup-token # opens an SSO URL; paste the code back -> prints sk-ant-oat01-… + ``` + +2. An admin stores it in that user's Vault path (MERGE, never `kv put` — siblings + like `claude_ai_oauth_json` / `vaultwarden_*` must survive): + + ```bash + vault kv patch -method=rw secret/workstation/claude-users/ \ + setup_token=sk-ant-oat01-… + ``` + +3. Materialize + activate (or just wait ≤6h for the timer): + + ```bash + systemctl start claude-auth-sync@.service + ``` + + `claude-auth-sync` writes `~/.config/claude-auth-sync/claude-oauth.env` + (`CLAUDE_CODE_OAUTH_TOKEN=…`, mode 0600) and, while a token is present, **skips** + the rotating-credential validate/backup/restore (so no false + `WorkstationClaudeAuthInvalid`). `start-claude.sh` and `t3-serve@.service` load + that env file. **Sessions started before activation keep the old credential + until relaunched** — the user must restart their agents / `t3-serve` to cut over. + +**Disable it:** clear the field (`vault kv patch -method=rw +secret/workstation/claude-users/ setup_token=""`) — the next sync removes +the env file and the user reverts to the per-user SSO credential flow. + +**Rotate before expiry:** setup-tokens expire 1y after mint. Re-mint (step 1) and +re-store (step 2); the env file refreshes on the next sync. ## Verification diff --git a/scripts/t3-serve@.service b/scripts/t3-serve@.service index 4109b36b..7f3d765d 100644 --- a/scripts/t3-serve@.service +++ b/scripts/t3-serve@.service @@ -11,6 +11,12 @@ Environment=HOME=/home/%i Environment=PATH=/usr/local/bin:/usr/bin:/bin:/home/%i/.local/bin Environment=NODE_ENV=production EnvironmentFile=/etc/t3-serve/%i.env +# Optional per-user long-lived CLAUDE_CODE_OAUTH_TOKEN, materialized by +# claude-auth-sync from the user's own Vault path. Non-rotating, so t3's +# concurrent agent sessions can't race on OAuth refresh-token rotation and wipe +# the shared ~/.claude/.credentials.json. Leading '-' = optional (absent for +# users on the normal per-user Enterprise-SSO credential flow). +EnvironmentFile=-/home/%i/.config/claude-auth-sync/claude-oauth.env WorkingDirectory=/home/%i ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3 Restart=on-failure diff --git a/scripts/workstation/claude-auth-sync.sh b/scripts/workstation/claude-auth-sync.sh index 0ea94f48..b9676df9 100755 --- a/scripts/workstation/claude-auth-sync.sh +++ b/scripts/workstation/claude-auth-sync.sh @@ -13,6 +13,10 @@ CAS_VAULT_TOKEN_FILE="${CLAUDE_AUTH_VAULT_TOKEN_FILE:-$CAS_CONFIG_DIR/vault-toke CAS_VAULT_PATH="${CLAUDE_AUTH_VAULT_PATH:-secret/workstation/claude-users/$CAS_USER}" CAS_STATE_DIR="${CLAUDE_AUTH_STATE_DIR:-$CAS_HOME/.local/state/claude-auth-sync}" CAS_LOG="$CAS_STATE_DIR/sync.log" +# Where a long-lived per-user setup-token is materialized as an env file +# (KEY=VALUE) for start-claude.sh + t3-serve@.service to load. Lives under the +# already-ReadWritePaths config dir so the sandboxed service may write it. +CAS_TOKEN_ENV_FILE="${CLAUDE_AUTH_TOKEN_ENV_FILE:-$CAS_CONFIG_DIR/claude-oauth.env}" cas_log() { mkdir -p "$CAS_STATE_DIR" @@ -133,6 +137,41 @@ cas_restore() { cas_log "RECOVERED restored Claude OAuth state from Vault" } +# A user-scoped, long-lived setup-token (`sk-ant-oat01-…`, ~1y, NON-rotating) may +# be stored in this user's OWN Vault path (field `setup_token`). When present it +# is the authoritative credential: it bypasses the shared +# ~/.claude/.credentials.json OAuth refresh-token rotation entirely — the fix for +# users running many concurrent Claude sessions (interactive + t3-serve + always-on +# agents) that otherwise race on refresh and wipe each other's refresh token. +# We materialize it to a user-owned env file that start-claude.sh and +# t3-serve@.service load as CLAUDE_CODE_OAUTH_TOKEN. This is the user's OWN +# Enterprise identity, NOT the forbidden legacy SHARED token — it never crosses +# OS users. Returns 0 when a token is active, so the caller skips the +# rotating-credential validate/backup/restore (probing the now-vestigial +# credential would otherwise emit false WorkstationClaudeAuthInvalid alerts). +cas_sync_setup_token() { + local token desired tmp + token="$(vault kv get -field=setup_token "$CAS_VAULT_PATH" 2>/dev/null)" || token="" + if [[ "$token" != sk-ant-oat01-* ]]; then + if [[ -e "$CAS_TOKEN_ENV_FILE" ]]; then + rm -f "$CAS_TOKEN_ENV_FILE" + cas_log "removed stale CLAUDE_CODE_OAUTH_TOKEN env (no setup-token in Vault)" + fi + return 1 + fi + desired="CLAUDE_CODE_OAUTH_TOKEN=$token" + if [[ -r "$CAS_TOKEN_ENV_FILE" && "$(<"$CAS_TOKEN_ENV_FILE")" == "$desired" ]]; then + cas_log "OK long-lived setup-token active (CLAUDE_CODE_OAUTH_TOKEN current); credential checks skipped" + return 0 + fi + tmp="$(mktemp "${CAS_TOKEN_ENV_FILE}.XXXXXX")" || { cas_log "FAIL could not stage token env file"; return 1; } + printf '%s\n' "$desired" > "$tmp" + chmod 0600 "$tmp" + mv "$tmp" "$CAS_TOKEN_ENV_FILE" + cas_log "OK long-lived setup-token active; CLAUDE_CODE_OAUTH_TOKEN materialized; credential checks skipped" + return 0 +} + cas_main() { umask 077 for bin in jq vault claude timeout flock; do @@ -143,6 +182,11 @@ cas_main() { flock -n 9 || { cas_log "SKIP another sync is already running"; return 0; } cas_prepare_vault || return 1 + # A long-lived per-user setup-token, if provisioned, is authoritative and + # non-rotating — materialize it and skip the rotating-credential dance. + if cas_sync_setup_token; then + return 0 + fi if cas_live_auth_ok; then cas_backup return diff --git a/scripts/workstation/skel/start-claude.sh b/scripts/workstation/skel/start-claude.sh index b3e25744..45ed9c4a 100755 --- a/scripts/workstation/skel/start-claude.sh +++ b/scripts/workstation/skel/start-claude.sh @@ -93,6 +93,15 @@ ensure_onboarding() { } ensure_onboarding +# Load a per-user long-lived CLAUDE_CODE_OAUTH_TOKEN if claude-auth-sync has +# materialized one from this user's own Vault path. A non-rotating setup-token +# sidesteps the shared ~/.claude/.credentials.json OAuth refresh-token race that +# logs out users running many concurrent agents (interactive + t3 + always-on). +# Absent file -> no-op (normal per-user Enterprise-SSO flow). The user's OWN +# token; never shared between OS users. +_oauth_env="$HOME/.config/claude-auth-sync/claude-oauth.env" +if [ -r "$_oauth_env" ]; then set -a; . "$_oauth_env"; set +a; fi + # Deliberately not `exec` so we can branch on the exit code: clean quit ends the # pane (ttyd closes the terminal); a crash drops to a shell so the tmux session # isn't destroyed-and-recreated in a ttyd auto-reconnect loop.