workstation: per-user long-lived Claude token to end concurrent-refresh logout
All checks were successful
ci/woodpecker/push/default Pipeline was successful

A heavy user (emo) runs 8+ always-on `claude` agents + their t3-serve instance,
all sharing one ~/.claude/.credentials.json. When the shared access token expires
the processes refresh simultaneously; OAuth refresh-token rotation makes the
losing writer persist an EMPTY refresh token, logging the user out roughly every
access-token lifetime (~8h). Re-issuing the credential never sticks — the race
recurs (this is why emo's "standalone token" fix kept regressing).

Fix: an opt-in, per-user, non-rotating setup-token (sk-ant-oat01, ~1y, scope
user:inference) kept in the user's OWN Vault path (field `setup_token`).
claude-auth-sync materializes it to a user-owned
~/.config/claude-auth-sync/claude-oauth.env and, while it is present, SKIPS the
rotating-credential validate/backup/restore (so no false
WorkstationClaudeAuthInvalid). start-claude.sh and t3-serve@.service load it as
CLAUDE_CODE_OAUTH_TOKEN, so every session of that user uses the non-rotating
token and there is nothing to race on.

Fail-safe + opt-in: with no `setup_token` in Vault, every path is a no-op, so
users on the normal per-user Enterprise-SSO flow are unaffected. This is each
user's OWN identity, never the forbidden shared CLAUDE_CODE_OAUTH_TOKEN. Runbook
documents enable/disable/rotate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-28 08:07:43 +00:00
parent 3cc8f9f661
commit c70810a51b
4 changed files with 117 additions and 2 deletions

View file

@ -13,6 +13,10 @@ CAS_VAULT_TOKEN_FILE="${CLAUDE_AUTH_VAULT_TOKEN_FILE:-$CAS_CONFIG_DIR/vault-toke
CAS_VAULT_PATH="${CLAUDE_AUTH_VAULT_PATH:-secret/workstation/claude-users/$CAS_USER}"
CAS_STATE_DIR="${CLAUDE_AUTH_STATE_DIR:-$CAS_HOME/.local/state/claude-auth-sync}"
CAS_LOG="$CAS_STATE_DIR/sync.log"
# Where a long-lived per-user setup-token is materialized as an env file
# (KEY=VALUE) for start-claude.sh + t3-serve@.service to load. Lives under the
# already-ReadWritePaths config dir so the sandboxed service may write it.
CAS_TOKEN_ENV_FILE="${CLAUDE_AUTH_TOKEN_ENV_FILE:-$CAS_CONFIG_DIR/claude-oauth.env}"
cas_log() {
mkdir -p "$CAS_STATE_DIR"
@ -133,6 +137,41 @@ cas_restore() {
cas_log "RECOVERED restored Claude OAuth state from Vault"
}
# A user-scoped, long-lived setup-token (`sk-ant-oat01-…`, ~1y, NON-rotating) may
# be stored in this user's OWN Vault path (field `setup_token`). When present it
# is the authoritative credential: it bypasses the shared
# ~/.claude/.credentials.json OAuth refresh-token rotation entirely — the fix for
# users running many concurrent Claude sessions (interactive + t3-serve + always-on
# agents) that otherwise race on refresh and wipe each other's refresh token.
# We materialize it to a user-owned env file that start-claude.sh and
# t3-serve@.service load as CLAUDE_CODE_OAUTH_TOKEN. This is the user's OWN
# Enterprise identity, NOT the forbidden legacy SHARED token — it never crosses
# OS users. Returns 0 when a token is active, so the caller skips the
# rotating-credential validate/backup/restore (probing the now-vestigial
# credential would otherwise emit false WorkstationClaudeAuthInvalid alerts).
cas_sync_setup_token() {
local token desired tmp
token="$(vault kv get -field=setup_token "$CAS_VAULT_PATH" 2>/dev/null)" || token=""
if [[ "$token" != sk-ant-oat01-* ]]; then
if [[ -e "$CAS_TOKEN_ENV_FILE" ]]; then
rm -f "$CAS_TOKEN_ENV_FILE"
cas_log "removed stale CLAUDE_CODE_OAUTH_TOKEN env (no setup-token in Vault)"
fi
return 1
fi
desired="CLAUDE_CODE_OAUTH_TOKEN=$token"
if [[ -r "$CAS_TOKEN_ENV_FILE" && "$(<"$CAS_TOKEN_ENV_FILE")" == "$desired" ]]; then
cas_log "OK long-lived setup-token active (CLAUDE_CODE_OAUTH_TOKEN current); credential checks skipped"
return 0
fi
tmp="$(mktemp "${CAS_TOKEN_ENV_FILE}.XXXXXX")" || { cas_log "FAIL could not stage token env file"; return 1; }
printf '%s\n' "$desired" > "$tmp"
chmod 0600 "$tmp"
mv "$tmp" "$CAS_TOKEN_ENV_FILE"
cas_log "OK long-lived setup-token active; CLAUDE_CODE_OAUTH_TOKEN materialized; credential checks skipped"
return 0
}
cas_main() {
umask 077
for bin in jq vault claude timeout flock; do
@ -143,6 +182,11 @@ cas_main() {
flock -n 9 || { cas_log "SKIP another sync is already running"; return 0; }
cas_prepare_vault || return 1
# A long-lived per-user setup-token, if provisioned, is authoritative and
# non-rotating — materialize it and skip the rotating-credential dance.
if cas_sync_setup_token; then
return 0
fi
if cas_live_auth_ok; then
cas_backup
return

View file

@ -93,6 +93,15 @@ ensure_onboarding() {
}
ensure_onboarding
# Load a per-user long-lived CLAUDE_CODE_OAUTH_TOKEN if claude-auth-sync has
# materialized one from this user's own Vault path. A non-rotating setup-token
# sidesteps the shared ~/.claude/.credentials.json OAuth refresh-token race that
# logs out users running many concurrent agents (interactive + t3 + always-on).
# Absent file -> no-op (normal per-user Enterprise-SSO flow). The user's OWN
# token; never shared between OS users.
_oauth_env="$HOME/.config/claude-auth-sync/claude-oauth.env"
if [ -r "$_oauth_env" ]; then set -a; . "$_oauth_env"; set +a; fi
# Deliberately not `exec` so we can branch on the exit code: clean quit ends the
# pane (ttyd closes the terminal); a crash drops to a shell so the tmux session
# isn't destroyed-and-recreated in a ttyd auto-reconnect loop.