A heavy user (emo) runs 8+ always-on `claude` agents + their t3-serve instance,
all sharing one ~/.claude/.credentials.json. When the shared access token expires
the processes refresh simultaneously; OAuth refresh-token rotation makes the
losing writer persist an EMPTY refresh token, logging the user out roughly every
access-token lifetime (~8h). Re-issuing the credential never sticks — the race
recurs (this is why emo's "standalone token" fix kept regressing).
Fix: an opt-in, per-user, non-rotating setup-token (sk-ant-oat01, ~1y, scope
user:inference) kept in the user's OWN Vault path (field `setup_token`).
claude-auth-sync materializes it to a user-owned
~/.config/claude-auth-sync/claude-oauth.env and, while it is present, SKIPS the
rotating-credential validate/backup/restore (so no false
WorkstationClaudeAuthInvalid). start-claude.sh and t3-serve@.service load it as
CLAUDE_CODE_OAUTH_TOKEN, so every session of that user uses the non-rotating
token and there is nothing to race on.
Fail-safe + opt-in: with no `setup_token` in Vault, every path is a no-op, so
users on the normal per-user Enterprise-SSO flow are unaffected. This is each
user's OWN identity, never the forbidden shared CLAUDE_CODE_OAUTH_TOKEN. Runbook
documents enable/disable/rotate.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Same t3-disconnect root-cause work: a runaway claude agent child grew to
10.8G anon RSS inside t3-serve@wizard's cgroup, swap-thrashed devvm off
its spinning disk (system-wide multi-10s freezes = every t3 client's 20s
watchdog firing = the 'frequent disconnects that self-recover'), then
the global OOM at 2026-06-10 19:56 took the whole unit down for 8.5min
because the default OOMPolicy=stop fails the unit when ANY cgroup child
is OOM-killed. Cap the cgroup (MemoryHigh=12G, MemoryMax=16G), forbid
swap so stalls can't smear into minute-long freezes, and OOMPolicy=continue
so a runaway agent dies alone while the WS server keeps serving.
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.
Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>