workstation: stop the Claude Code onboarding wizard reappearing for terminal users

emo reported being "logged out" on terminal.viktorbarzin.me: every new shell dropped him at the first-run "Choose the text style" wizard, even though he'd used many sessions and is in fact fully authenticated. Root cause is NOT a logout — ~/.claude.json is a single file that all of a user's concurrent claude processes (the ttyd terminal + their t3-serve instance + agent sessions) read-modify-write, and a stale writer periodically drops top-level keys, including hasCompletedOnboarding. That bounces the next interactive session back to onboarding; credentials are safe in the separate ~/.claude/.credentials.json (which is why T3 kept working). wizard's own ~/.claude.json showed the same key loss, so this hits any heavy multi-session user. Fix: - skel/start-claude.sh: ensure_onboarding() idempotently re-asserts hasCompletedOnboarding (+ lastOnboardingVersion) in ~/.claude.json right before launching claude. Merge-only (never clobbers other keys), runs as the user, and no-ops if jq is missing or the file is empty/corrupt. So even if the race drops the flag, the next launch restores it before claude reads it. - t3-provision-users.sh: deploy_user_launcher() re-copies skel/start-claude.sh into every non-admin home (copy-if-changed) on the hourly reconcile. /etc/skel only seeds the launcher at account creation, so without this the fix (and any future launcher edit) would never reach existing users. .tmux.conf is deliberately not re-copied — terminal-lobby appends a managed section to it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:37:59 +00:00 · 2026-06-15 14:37:59 +00:00 · bb3f5f2329
commit bb3f5f2329
parent 82a0c5aedf
3 changed files with 52 additions and 0 deletions
--- a/docs/architecture/multi-tenancy.md
+++ b/docs/architecture/multi-tenancy.md
@ -543,6 +543,8 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1

 **Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.

+**Onboarding state self-heals (2026-06-15):** `~/.claude.json` is a single file that ALL of a user's concurrent `claude` processes (the ttyd terminal + their `t3-serve` instance + agent/SDK sessions) read-modify-write, so a stale writer periodically drops top-level keys — including `hasCompletedOnboarding` — which bounces the next *interactive* session back to the first-run "Choose the text style" wizard even though the user is fully logged in (credentials live in the SEPARATE `~/.claude/.credentials.json`, untouched by the race; first observed for emo 2026-06-15). The launcher (`skel/start-claude.sh`) now idempotently re-asserts `hasCompletedOnboarding` (+ `lastOnboardingVersion`) in `~/.claude.json` right before it runs `claude` — merge-only, never clobbers other keys, no-op if jq is missing or the file is empty/corrupt. And since the launcher is a per-user copy that `/etc/skel` only seeds at account creation, the reconcile's new `deploy_user_launcher` step re-copies `skel/start-claude.sh` into every non-admin home (copy-if-changed) so launcher edits now reach EXISTING users within the hour — `.tmux.conf` is deliberately NOT re-copied (terminal-lobby appends its own managed section to it).
+
 **Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Its location depends on the per-user `code_layout` in `roster.yaml`: `single` (default) puts the clone AT `~/code`; `workspace` makes `~/code` a plain directory of per-project clones — the infra clone at `~/code/infra` plus each roster `repos` entry cloned from Forgejo `viktor/<name>` **as the user** (their PAT authenticates, so private repos work; clone failures WARN and retry next hour). Flipping a user to `workspace` auto-migrates their existing `~/code` clone to `~/code/infra` (local branches/dirty state survive; running processes follow the moved inode). ancamilea = workspace + `tripit` since 2026-06-10. The provisioner clones infra anonymously from the public GitHub mirror; **contribute access is wired per-user on top** (see below). The apply boundary still holds (`scripts/tg apply` needs an admin Vault token + cluster RBAC), but **pushing `master` is NOT inert** — the Forgejo→Woodpecker webhook fires `.woodpecker/default.yml` (`event: push, branch: master`, `require_approval: forks` only), which terragrunt-applies changed stacks. `master` is **branch-protected on Forgejo** (force-push disabled for everyone — history is append-only; push + merge whitelists = `viktor` + explicitly granted users, deploy keys allowed). **Allow-then-audit (Viktor, 2026-06-10):** `ebarzin` (emo) is on the whitelist and pushes straight to `master` — no PR gate. The tracking burden moves to: (a) **commit messages that record what + why** (the agent instructions in AGENTS.md and the managed claudeMd require the body to paraphrase the user's request), (b) the **`notify-nonadmin-push` Slack audit step** in `.woodpecker/default.yml` — every master push by a non-admin author is posted to Slack (admin pushes are not), and (c) non-admins **never use `[ci skip]`** so every change fires the pipeline (and thus the audit feed). Users NOT on the whitelist fall back to `<user>/<topic>` branches + PRs. **Clones stay fresh automatically** (2026-06-10): the hourly `t3-provision-users` reconcile runs `refresh_user_clone` over every managed clone — the infra clone and any workspace repos (fetch all remotes + fast-forward `master`, ONLY when on master with a clean tree and an upstream — dirty trees and local commits are left alone with a WARN) — and also `wire_forgejo_remote`, which idempotently adds the documented `forgejo` remote + `forgejo/master` upstream to infra clones that predate that contract. `start-claude.sh` does the same freshen at session launch (10s fetch cap per repo so an offline remote never stalls the session; workspace layouts freshen each repo under `~/code`).

 **Contribute access (per non-admin, manual — the anca/tripit PAT precedent):**
--- a/scripts/t3-provision-users.sh
+++ b/scripts/t3-provision-users.sh
@ -270,6 +270,24 @@ install_user_claude_token() {
  log "shared Claude token -> $user (t3-serve env; restart needed to take effect)"
 }

+# Re-deploy the managed per-user Claude launcher to ~/start-claude.sh. /etc/skel only
+# seeds it at account creation (setup-devvm.sh), so without this a launcher edit never
+# reaches EXISTING users — they keep running a stale copy. Copy-if-changed from the repo's
+# skel/, owned by the user, 0755. (We deliberately do NOT re-copy .tmux.conf: terminal-lobby
+# appends a managed persistence section to each user's ~/.tmux.conf that a re-copy would clobber.)
+deploy_user_launcher() {
+  local user="$1" home src dst
+  src="$WORKSTATION_DIR/skel/start-claude.sh"
+  home="$(getent passwd "$user" | cut -d: -f6)"
+  [[ -n "$home" && -d "$home" && -f "$src" ]] || return 0
+  dst="$home/start-claude.sh"
+  cmp -s "$src" "$dst" 2>/dev/null && return 0          # already current -> no churn
+  if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] deploy launcher -> $dst"; return 0; fi
+  install -m 0755 "$src" "$dst"
+  chown "$user:$user" "$dst"
+  log "deployed start-claude.sh -> $user"
+}
+
 [[ $EUID -eq 0 ]] || { echo "t3-provision-users: must run as root" >&2; exit 1; }
 for bin in python3 jq; do command -v "$bin" >/dev/null || { echo "missing $bin" >&2; exit 1; }; done
 [[ -f "$ROSTER" && -f "$ENGINE" ]] || { echo "roster/engine not under $WORKSTATION_DIR" >&2; exit 1; }
@ -346,6 +364,7 @@ while IFS=$'\t' read -r os_user tier shell groups_csv code_layout repos_csv; do
    fi
    install_user_kubeconfig "$os_user"
    install_user_claude_token "$os_user"
+    deploy_user_launcher "$os_user"          # keep ~/start-claude.sh current (skel only seeds new accounts)
  fi
  refresh_codex_mirror "$os_user"            # all tiers — mirror of the managed claudeMd
 done < <(jq -r '.accounts[] | [.os_user, .tier, .shell, (if (.groups|length)==0 then "-" else (.groups|join(",")) end), .code_layout, (if (.repos|length)==0 then "-" else (.repos|join(",")) end)] | @tsv' "$desired_file")
--- a/scripts/workstation/skel/start-claude.sh
+++ b/scripts/workstation/skel/start-claude.sh
@ -51,6 +51,37 @@ launch() {
  fi
 }

+# Re-assert Claude Code's first-run onboarding flag before launch. ~/.claude.json is a
+# SINGLE file that ALL of a user's concurrent claude processes (this terminal, their
+# t3-serve instance, agent/SDK sessions) read-modify-write; a stale writer periodically
+# drops top-level keys — including hasCompletedOnboarding — which throws the next
+# interactive session back to the "Choose the text style" wizard even though the user is
+# fully logged in (credentials live in the SEPARATE ~/.claude/.credentials.json, which is
+# never affected). Idempotent, runs as the user right before launch, never clobbers other
+# keys. Best-effort: no-op if jq is missing or the file is empty/corrupt (claude self-heals).
+ensure_onboarding() {
+  command -v jq >/dev/null 2>&1 || return 0
+  local cfg="$HOME/.claude.json" ver tmp
+  ver="$(claude --version 2>/dev/null | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1)"
+  if [ -s "$cfg" ]; then
+    jq -e . "$cfg" >/dev/null 2>&1 || return 0                                     # corrupt -> leave for claude
+    [ "$(jq -r '.hasCompletedOnboarding // false' "$cfg")" = "true" ] && return 0  # already set -> no write
+  elif [ -e "$cfg" ]; then
+    return 0                                                                       # empty (mid-write?) -> leave it
+  fi
+  tmp="$(mktemp "${cfg}.XXXXXX")" || return 0
+  if [ -f "$cfg" ]; then
+    jq --arg v "$ver" '.hasCompletedOnboarding = true
+      | (if $v != "" then .lastOnboardingVersion = $v else . end)' "$cfg" > "$tmp" 2>/dev/null \
+      && chmod 600 "$tmp" && mv "$tmp" "$cfg" || rm -f "$tmp"
+  else
+    jq -n --arg v "$ver" '{hasCompletedOnboarding: true}
+      + (if $v != "" then {lastOnboardingVersion: $v} else {} end)' > "$tmp" 2>/dev/null \
+      && chmod 600 "$tmp" && mv "$tmp" "$cfg" || rm -f "$tmp"
+  fi
+}
+ensure_onboarding
+
 # Deliberately not `exec` so we can branch on the exit code: clean quit ends the
 # pane (ttyd closes the terminal); a crash drops to a shell so the tmux session
 # isn't destroyed-and-recreated in a ttyd auto-reconnect loop.