All checks were successful
ci/woodpecker/push/default Pipeline was successful
A heavy user (emo) runs 8+ always-on `claude` agents + their t3-serve instance, all sharing one ~/.claude/.credentials.json. When the shared access token expires the processes refresh simultaneously; OAuth refresh-token rotation makes the losing writer persist an EMPTY refresh token, logging the user out roughly every access-token lifetime (~8h). Re-issuing the credential never sticks — the race recurs (this is why emo's "standalone token" fix kept regressing). Fix: an opt-in, per-user, non-rotating setup-token (sk-ant-oat01, ~1y, scope user:inference) kept in the user's OWN Vault path (field `setup_token`). claude-auth-sync materializes it to a user-owned ~/.config/claude-auth-sync/claude-oauth.env and, while it is present, SKIPS the rotating-credential validate/backup/restore (so no false WorkstationClaudeAuthInvalid). start-claude.sh and t3-serve@.service load it as CLAUDE_CODE_OAUTH_TOKEN, so every session of that user uses the non-rotating token and there is nothing to race on. Fail-safe + opt-in: with no `setup_token` in Vault, every path is a no-op, so users on the normal per-user Enterprise-SSO flow are unaffected. This is each user's OWN identity, never the forbidden shared CLAUDE_CODE_OAUTH_TOKEN. Runbook documents enable/disable/rotate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
156 lines
6.6 KiB
Markdown
156 lines
6.6 KiB
Markdown
# Workstation Claude authentication renewal
|
|
|
|
## Scope
|
|
|
|
Every roster user authenticates Claude Code with their own Enterprise identity.
|
|
Credentials are never shared between OS users. Claude refreshes its normal OAuth
|
|
access token; `claude-auth-sync@<user>.timer` verifies that refresh using real
|
|
inference every six hours and backs up only the `claudeAiOauth` object to:
|
|
|
|
```text
|
|
secret/workstation/claude-users/<os-user>
|
|
```
|
|
|
|
The backup **merges** into that path (`vault kv patch -method=rw`, falling back to
|
|
`kv put` only when the path does not exist yet), so keys that other tools
|
|
co-locate there — notably `homelab vault`'s `vaultwarden_*` credentials — survive.
|
|
A blind `kv put` here silently wiped them on every six-hourly run (fixed 2026-06-26).
|
|
|
|
The user's unrelated `mcpOAuth` credentials never leave their home directory.
|
|
Each renewal service has a distinct 32-day periodic Vault token, mode `0600`, at
|
|
`~/.config/claude-auth-sync/vault-token`. Its policy can access only that user's
|
|
path. The service renews the Vault token on every run.
|
|
|
|
## Normal lifecycle
|
|
|
|
1. Add the user to `scripts/workstation/roster.yaml` and apply the Vault stack.
|
|
2. Run `scripts/workstation/setup-devvm.sh` as root with the admin Vault token.
|
|
Its foreground provisioner mints the isolated periodic token and enables the
|
|
user's timer. Routine hourly provisioning never needs an admin token.
|
|
3. The user completes one initial Enterprise login:
|
|
|
|
```bash
|
|
claude auth login --claudeai --sso --email <enterprise-email>
|
|
```
|
|
|
|
4. Start the first sync immediately instead of waiting for the timer:
|
|
|
|
```bash
|
|
systemctl start claude-auth-sync@<os-user>.service
|
|
systemctl status claude-auth-sync@<os-user>.service
|
|
```
|
|
|
|
Success writes no secrets to the journal. The user's private log records `OK` in
|
|
`~/.local/state/claude-auth-sync/sync.log`; journald receives the same status with
|
|
`identifier=claude-auth-sync` for Loki alerting.
|
|
|
|
## Automatic recovery
|
|
|
|
`claude auth status` is not a sufficient health check: it can report logged in
|
|
while inference returns HTTP 401. The service therefore runs a minimal Haiku
|
|
inference with no session persistence. On failure it:
|
|
|
|
1. reads the user's latest OAuth object from Vault;
|
|
2. atomically merges it into `.credentials.json`, preserving MCP OAuth state;
|
|
3. retries inference once;
|
|
4. stores the newly refreshed OAuth object back in Vault on success.
|
|
|
|
Vault KV version history remains available for audit, but the service deliberately
|
|
does not cycle through old refresh tokens: providers commonly invalidate rotated
|
|
refresh tokens, so replaying old versions can make recovery less deterministic.
|
|
|
|
## Recovery requiring a person
|
|
|
|
If both local state and the latest Vault copy fail, the refresh token was revoked,
|
|
invalidated, or the Enterprise session requires reauthorization. Run the login as
|
|
the affected OS user, then rerun the service:
|
|
|
|
```bash
|
|
claude auth login --claudeai --sso --email <enterprise-email>
|
|
systemctl start claude-auth-sync@$(id -un).service
|
|
```
|
|
|
|
If the scoped Vault token expired or drift protection rejected it, rerun the root
|
|
provisioner with an admin Vault token after confirming the matching policy exists:
|
|
|
|
```bash
|
|
export VAULT_ADDR=https://vault.viktorbarzin.me
|
|
export VAULT_TOKEN="$(cat /home/wizard/.vault-token)"
|
|
sudo --preserve-env=VAULT_ADDR,VAULT_TOKEN /usr/local/bin/t3-provision-users
|
|
```
|
|
|
|
Never copy another user's `.credentials.json` or scoped Vault token. Never restore
|
|
a **shared** `CLAUDE_CODE_OAUTH_TOKEN` across users; environment credentials
|
|
outrank per-user login and would silently collapse all users onto one identity.
|
|
(A **per-user**, non-rotating setup-token tied to the user's OWN Enterprise
|
|
identity is a different, sanctioned thing — see "Long-lived per-user token" below.)
|
|
|
|
## Long-lived per-user token (heavy concurrent-agent users)
|
|
|
|
The six-hourly renewal above assumes Claude owns refresh-token rotation in a
|
|
single `~/.claude/.credentials.json`. A user who runs **many concurrent Claude
|
|
sessions** (interactive tmux panes + their `t3-serve` instance + always-on
|
|
`start-claude.sh` agents) breaks that assumption: when the shared access token
|
|
expires, the processes refresh **simultaneously**, the OAuth server rotates the
|
|
refresh token, and the losing writer persists an **empty** refresh token —
|
|
logging the user out roughly every access-token lifetime (~8h). Re-issuing the
|
|
credential does not help; the race recurs.
|
|
|
|
The fix is a **per-user, long-lived setup-token** (`sk-ant-oat01-…`, ~1y,
|
|
**non-rotating**). With `CLAUDE_CODE_OAUTH_TOKEN` set, Claude uses it directly and
|
|
never touches `.credentials.json` — so there is nothing to race on. This is the
|
|
user's OWN Enterprise identity (scope `user:inference`; local MCP servers are
|
|
client-side and unaffected), stored only in their OWN Vault path — **NOT** the
|
|
forbidden shared token, and it never crosses OS users.
|
|
|
|
**Enable it (one-time, per user):**
|
|
|
|
1. The user mints their own token (interactive Enterprise SSO):
|
|
|
|
```bash
|
|
claude setup-token # opens an SSO URL; paste the code back -> prints sk-ant-oat01-…
|
|
```
|
|
|
|
2. An admin stores it in that user's Vault path (MERGE, never `kv put` — siblings
|
|
like `claude_ai_oauth_json` / `vaultwarden_*` must survive):
|
|
|
|
```bash
|
|
vault kv patch -method=rw secret/workstation/claude-users/<os-user> \
|
|
setup_token=sk-ant-oat01-…
|
|
```
|
|
|
|
3. Materialize + activate (or just wait ≤6h for the timer):
|
|
|
|
```bash
|
|
systemctl start claude-auth-sync@<os-user>.service
|
|
```
|
|
|
|
`claude-auth-sync` writes `~/.config/claude-auth-sync/claude-oauth.env`
|
|
(`CLAUDE_CODE_OAUTH_TOKEN=…`, mode 0600) and, while a token is present, **skips**
|
|
the rotating-credential validate/backup/restore (so no false
|
|
`WorkstationClaudeAuthInvalid`). `start-claude.sh` and `t3-serve@.service` load
|
|
that env file. **Sessions started before activation keep the old credential
|
|
until relaunched** — the user must restart their agents / `t3-serve` to cut over.
|
|
|
|
**Disable it:** clear the field (`vault kv patch -method=rw
|
|
secret/workstation/claude-users/<os-user> setup_token=""`) — the next sync removes
|
|
the env file and the user reverts to the per-user SSO credential flow.
|
|
|
|
**Rotate before expiry:** setup-tokens expire 1y after mint. Re-mint (step 1) and
|
|
re-store (step 2); the env file refreshes on the next sync.
|
|
|
|
## Verification
|
|
|
|
```bash
|
|
systemctl list-timers 'claude-auth-sync@*'
|
|
systemctl status claude-auth-sync@<os-user>.service
|
|
journalctl -t claude-auth-sync --since today
|
|
```
|
|
|
|
Inspect Vault metadata, not secret values:
|
|
|
|
```bash
|
|
vault kv metadata get secret/workstation/claude-users/<os-user>
|
|
```
|
|
|
|
Alert `WorkstationClaudeAuthInvalid` fires when any renewal agent logs `FAIL`.
|