Add per-user Claude auth renewal
Each workstation user needs a continuously valid Claude token under their own Enterprise identity. Store only that user's OAuth state in an isolated Vault path, renew and verify it automatically, recover from Vault when possible, and alert when interactive SSO is required.
This commit is contained in:
parent
834c5e6a2a
commit
5549fc3672
11 changed files with 408 additions and 28 deletions
|
|
@ -547,6 +547,8 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1
|
|||
|
||||
**Claude Code runtime — native, per-user (2026-06-15):** `claude` is the **native** install (`~/.local/bin/claude` → `~/.local/share/claude/versions/<v>`, self-updating; `installMethod: native`) — NOT npm-global or npx. It is the runtime for both the ttyd launcher and each `t3-serve` instance. `setup-devvm.sh` installs node ONLY for the `t3` CLI (not claude); per-user native claude is provisioned by the reconcile's `install_user_claude_native` (covers terminal + t3, idempotent, skip-if-present) and self-bootstrapped by `start-claude.sh` on first launch — both via the official `https://claude.ai/install.sh`. The legacy machine-wide `npm install -g @anthropic-ai/claude-code` bootstrap and the launcher's `npx` fallback were removed; existing users had already auto-migrated to native, and the npm-global dir was empty. **PATH (`~/.local/bin`, where the native binary lives):** ensured three ways — `/etc/profile.d/10-local-bin.sh` for login shells (machine-wide, fresh-user-safe), `start-claude.sh` itself (the launcher runs in tmux's non-login env that skips the user's shell rc), and `t3-serve@.service` (`Environment=PATH=…:/home/%i/.local/bin`).
|
||||
|
||||
**Claude authentication — per-user, self-renewing, Vault-recoverable (2026-06-20):** every roster user logs in with their OWN Enterprise identity; shared `CLAUDE_CODE_OAUTH_TOKEN` injection was removed because environment auth outranks local login and collapses identity/audit/quota. Claude owns access-token refresh in `~/.claude/.credentials.json`. A system template timer (`claude-auth-sync@<user>.timer`, every 6h) renews a dedicated 32-day periodic Vault token, validates Claude with real non-persistent Haiku inference (`auth status` can lie during a 401), backs up only `claudeAiOauth` to `secret/workstation/claude-users/<os_user>`, and performs one atomic Vault restore/retry on failure while preserving `mcpOAuth`. Vault policy `workstation-claude-<os_user>` isolates every path; the roster generates policies for present and future users. A hard refresh-token revocation still requires the affected person to complete SSO—there is no supported noninteractive bypass. Loki alert `WorkstationClaudeAuthInvalid` surfaces exhausted recovery. Runbook: `../runbooks/claude-auth-renew-workstation.md`.
|
||||
|
||||
**Per-user browser MCP — playwright, reproducible from git (2026-06-16):** every user (incl. the admin) gets their OWN isolated `@playwright/mcp` server so their concurrent Claude sessions don't fight over tabs (`--isolated` → a fresh browser context per MCP connection), wired into Claude in **every directory** via a user-scope `~/.claude.json` entry (`playwright → http://localhost:<PLAYWRIGHT_PORT>/mcp`). Mechanism: **system-level template units** `playwright-mcp@<user>.service` + `playwright-snapshot-refresh@<user>.{service,timer}` (`User=%i`, sourced from `scripts/workstation/playwright/`, installed by `setup-devvm.sh` §9e — system manager, so NO systemd --user / linger). `roster_engine.py` allocates a sticky per-user `PLAYWRIGHT_PORT` (`PLAYWRIGHT_BASE_PORT=8931`); the reconcile's `install_playwright()` writes it, seeds the chrome-service snapshot token if-absent (staged from Vault `secret/chrome-service` to `/etc/t3-serve/chrome-service-token` by `setup-devvm.sh` §8c, since the hourly root reconcile has no Vault token), wires `~/.claude.json` by running `claude mcp add --scope user` AS the user (clobber-proof + if-absent, so it fixes existing/new/admin without rewriting a populated config), and `enable --now`s the instances (idempotent — never restarts a running server). The `@playwright/mcp` version is **pinned** in the unit (the `@latest`-silently-rolls-the-fleet footgun — see `T3_PIN`). Replaced the earlier hand-made `~/.config/systemd/user/playwright-*` units (one-time idle-gated migration; pre-migration emo/anca had servers running but never wired into their `.claude.json`). Cookie-warming pipeline + ops: `../runbooks/chrome-service-snapshot.md`.
|
||||
|
||||
**Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Its location depends on the per-user `code_layout` in `roster.yaml`: `single` (default) puts the clone AT `~/code`; `workspace` makes `~/code` a plain directory of per-project clones — the infra clone at `~/code/infra` plus each roster `repos` entry cloned from Forgejo `viktor/<name>` **as the user** (their PAT authenticates, so private repos work; clone failures WARN and retry next hour). Flipping a user to `workspace` auto-migrates their existing `~/code` clone to `~/code/infra` (local branches/dirty state survive; running processes follow the moved inode). ancamilea = workspace + `tripit` since 2026-06-10. The provisioner clones infra anonymously from the public GitHub mirror; **contribute access is wired per-user on top** (see below). The apply boundary still holds (`scripts/tg apply` needs an admin Vault token + cluster RBAC), but **pushing `master` is NOT inert** — the Forgejo→Woodpecker webhook fires `.woodpecker/default.yml` (`event: push, branch: master`, `require_approval: forks` only), which terragrunt-applies changed stacks. `master` is **branch-protected on Forgejo** (force-push disabled for everyone — history is append-only; push + merge whitelists = `viktor` + explicitly granted users, deploy keys allowed). **Allow-then-audit (Viktor, 2026-06-10):** `ebarzin` (emo) is on the whitelist and pushes straight to `master` — no PR gate. The tracking burden moves to: (a) **commit messages that record what + why** (the agent instructions in AGENTS.md and the managed claudeMd require the body to paraphrase the user's request), (b) the **`notify-nonadmin-push` Slack audit step** in `.woodpecker/default.yml` — every master push by a non-admin author is posted to Slack (admin pushes are not), and (c) non-admins **never use `[ci skip]`** so every change fires the pipeline (and thus the audit feed). Users NOT on the whitelist fall back to `<user>/<topic>` branches + PRs. **Clones stay fresh automatically** (2026-06-10): the hourly `t3-provision-users` reconcile runs `refresh_user_clone` over every managed clone — the infra clone and any workspace repos (fetch all remotes + fast-forward `master`, ONLY when on master with a clean tree and an upstream — dirty trees and local commits are left alone with a WARN) — and also `wire_forgejo_remote`, which idempotently adds the documented `forgejo` remote + `forgejo/master` upstream to infra clones that predate that contract. `start-claude.sh` does the same freshen at session launch (10s fetch cap per repo so an offline remote never stalls the session; workspace layouts freshen each repo under `~/code`).
|
||||
|
|
@ -561,7 +563,7 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1
|
|||
|
||||
**Web-terminal session persistence (2026-06-10):** the tmux-based web terminal's named sessions (each running one Claude conversation) survive devvm reboots — `tmux-persist-save.timer` (5-min) snapshots every terminal user's sessions (name, cwd, conversation uuid from argv or the cwd-slug transcript dir) to `/var/lib/tmux-persist/<user>.tsv`, and `tmux-persist-restore.service` recreates missing sessions at boot with `claude --resume <uuid>` (per-session idempotent; also handles partial loss). The web terminal also exposes an **on-demand "Restore sessions" button** (terminal-lobby: `tmux-api` `POST /restore` → the validated root `tmux-restore-user` wrapper → `tmux-persist restore <user>`, a single-user mode of the same script): the boot-only restore service never fires when an **OOM kills a user's tmux server *without* a reboot** (the common case under multi-user memory pressure), so the button covers that gap. This is a **tmux/terminal-surface** feature, deliberately outside the t3 namespace: the t3 chat surface persists its own threads (`~/.t3` state, plus the daily `t3-backup-state` dump), and Claude conversations themselves were always durable (`~/.claude/projects/`) — what this adds is the volatile tmux wiring.
|
||||
|
||||
**Status (2026-06-10):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), the Authentik `T3 Users` edge gate, **the emo Phase-5 cutover (own clone + launcher repoint + `code-shared` removal, completed 2026-06-10) and emo's contribute access (`ebarzin` write collaborator + PAT + protected `master`)**, and **per-user `code_layout` with the ancamilea workspace cutover (infra → `~/code/infra`, `tripit` alongside, 2026-06-10)**. Per the live `/etc/skel` design, non-admin `~/.claude/{rules,skills}` symlinks into the admin base are **kept** (they ARE the shared-base delivery mechanism — the plan's step to remove them is obsolete). **Remaining (held / future):** the offboarding apply-side (Phase 7), the rest of per-user MCP/auth injection (`ha` + `claude_memory` + `.credentials.json` + beads Dolt cred — **per-user playwright browser MCP done 2026-06-16**, see above), and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning.
|
||||
**Status (2026-06-20):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), the Authentik `T3 Users` edge gate, **the emo Phase-5 cutover (own clone + launcher repoint + `code-shared` removal, completed 2026-06-10) and emo's contribute access (`ebarzin` write collaborator + PAT + protected `master`)**, **per-user `code_layout` with the ancamilea workspace cutover**, per-user playwright browser MCP, and per-user Claude OAuth renewal/Vault recovery. Per the live `/etc/skel` design, non-admin `~/.claude/{rules,skills}` symlinks into the admin base are **kept**. **Remaining (held / future):** the offboarding apply-side (Phase 7), per-user `ha`/`claude_memory`/beads credential injection, and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning.
|
||||
|
||||
## Related
|
||||
|
||||
|
|
|
|||
|
|
@ -110,7 +110,7 @@ The Config base / machine-wide managed layer is **secret-free**. Everything carr
|
|||
|
||||
| Auth / token | Lives in (per-user, `0600`) | New-user provisioning (from Vault) |
|
||||
|---|---|---|
|
||||
| **Claude OAuth** | `~/.claude/.credentials.json` (or `CLAUDE_CODE_OAUTH_TOKEN`) | the shared Enterprise token (earlier decision) **or** own interactive login; emo keeps his own |
|
||||
| **Claude OAuth** | `~/.claude/.credentials.json` + isolated Vault backup | own Enterprise SSO login; Claude refreshes locally and `claude-auth-sync@<user>.timer` validates/backs up/recovers `claudeAiOauth` at `secret/workstation/claude-users/<os_user>`; shared token injection is forbidden |
|
||||
| **`claude_memory` MCP** | `~/.claude.json` mcpServers + `MEMORY_API_KEY` in `settings.json` env | **DEFERRED — not a risk now (Viktor, 2026-06-08).** Per-user memory isolation needs a service-side `_key_to_user` map edit + redeploy (claude-memory-mcp, GHA repo 78), not just a Vault write — NOT built now. For now a new user gets a simple key or omits memory; revisit if isolation becomes a concern. |
|
||||
| **`ha` MCP** (token-in-URL) | `~/.claude.json` | shared `ha_sofia_mcp_url` from Vault `secret/openclaw` (one HA instance; shared secret, per-user file) — only if HA-eligible |
|
||||
| **`playwright` MCP** | per-user systemd unit (own port) + localhost entry | existing per-user playwright pattern (id=4015); non-secret |
|
||||
|
|
|
|||
95
docs/runbooks/claude-auth-renew-workstation.md
Normal file
95
docs/runbooks/claude-auth-renew-workstation.md
Normal file
|
|
@ -0,0 +1,95 @@
|
|||
# Workstation Claude authentication renewal
|
||||
|
||||
## Scope
|
||||
|
||||
Every roster user authenticates Claude Code with their own Enterprise identity.
|
||||
Credentials are never shared between OS users. Claude refreshes its normal OAuth
|
||||
access token; `claude-auth-sync@<user>.timer` verifies that refresh using real
|
||||
inference every six hours and backs up only the `claudeAiOauth` object to:
|
||||
|
||||
```text
|
||||
secret/workstation/claude-users/<os-user>
|
||||
```
|
||||
|
||||
The user's unrelated `mcpOAuth` credentials never leave their home directory.
|
||||
Each renewal service has a distinct 32-day periodic Vault token, mode `0600`, at
|
||||
`~/.config/claude-auth-sync/vault-token`. Its policy can access only that user's
|
||||
path. The service renews the Vault token on every run.
|
||||
|
||||
## Normal lifecycle
|
||||
|
||||
1. Add the user to `scripts/workstation/roster.yaml` and apply the Vault stack.
|
||||
2. Run `scripts/workstation/setup-devvm.sh` as root with the admin Vault token.
|
||||
Its foreground provisioner mints the isolated periodic token and enables the
|
||||
user's timer. Routine hourly provisioning never needs an admin token.
|
||||
3. The user completes one initial Enterprise login:
|
||||
|
||||
```bash
|
||||
claude auth login --claudeai --sso --email <enterprise-email>
|
||||
```
|
||||
|
||||
4. Start the first sync immediately instead of waiting for the timer:
|
||||
|
||||
```bash
|
||||
systemctl start claude-auth-sync@<os-user>.service
|
||||
systemctl status claude-auth-sync@<os-user>.service
|
||||
```
|
||||
|
||||
Success writes no secrets to the journal. The user's private log records `OK` in
|
||||
`~/.local/state/claude-auth-sync/sync.log`; journald receives the same status with
|
||||
`identifier=claude-auth-sync` for Loki alerting.
|
||||
|
||||
## Automatic recovery
|
||||
|
||||
`claude auth status` is not a sufficient health check: it can report logged in
|
||||
while inference returns HTTP 401. The service therefore runs a minimal Haiku
|
||||
inference with no session persistence. On failure it:
|
||||
|
||||
1. reads the user's latest OAuth object from Vault;
|
||||
2. atomically merges it into `.credentials.json`, preserving MCP OAuth state;
|
||||
3. retries inference once;
|
||||
4. stores the newly refreshed OAuth object back in Vault on success.
|
||||
|
||||
Vault KV version history remains available for audit, but the service deliberately
|
||||
does not cycle through old refresh tokens: providers commonly invalidate rotated
|
||||
refresh tokens, so replaying old versions can make recovery less deterministic.
|
||||
|
||||
## Recovery requiring a person
|
||||
|
||||
If both local state and the latest Vault copy fail, the refresh token was revoked,
|
||||
invalidated, or the Enterprise session requires reauthorization. Run the login as
|
||||
the affected OS user, then rerun the service:
|
||||
|
||||
```bash
|
||||
claude auth login --claudeai --sso --email <enterprise-email>
|
||||
systemctl start claude-auth-sync@$(id -un).service
|
||||
```
|
||||
|
||||
If the scoped Vault token expired or drift protection rejected it, rerun the root
|
||||
provisioner with an admin Vault token after confirming the matching policy exists:
|
||||
|
||||
```bash
|
||||
export VAULT_ADDR=https://vault.viktorbarzin.me
|
||||
export VAULT_TOKEN="$(cat /home/wizard/.vault-token)"
|
||||
sudo --preserve-env=VAULT_ADDR,VAULT_TOKEN /usr/local/bin/t3-provision-users
|
||||
```
|
||||
|
||||
Never copy another user's `.credentials.json` or scoped Vault token. Never restore
|
||||
the old shared `CLAUDE_CODE_OAUTH_TOKEN`; environment credentials outrank per-user
|
||||
login and would silently collapse all users onto one identity.
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
systemctl list-timers 'claude-auth-sync@*'
|
||||
systemctl status claude-auth-sync@<os-user>.service
|
||||
journalctl -t claude-auth-sync --since today
|
||||
```
|
||||
|
||||
Inspect Vault metadata, not secret values:
|
||||
|
||||
```bash
|
||||
vault kv metadata get secret/workstation/claude-users/<os-user>
|
||||
```
|
||||
|
||||
Alert `WorkstationClaudeAuthInvalid` fires when any renewal agent logs `FAIL`.
|
||||
Loading…
Add table
Add a link
Reference in a new issue