infra

Author	SHA1	Message	Date
Viktor Barzin	8a2a3d9eca	Merge remote-tracking branch 'origin/master' into wizard/reconcile-mirror All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details # Conflicts: # scripts/t3-provision-users.sh	2026-06-16 22:32:43 +00:00
Viktor Barzin	0a6ed4b2fe	workstation: per-user playwright browser MCP for all users, reproducible from git Viktor asked that the playwright browser MCP be available for every devvm user in every directory, with each user running their own server and multiple concurrent sessions per user. Before this, playwright was hand-set-up per user (~/.config/systemd/user/ playwright-mcp.service on 8931/8932/8933) and only wizard was actually wired — emo's and anca's servers ran but their ~/.claude.json had no playwright entry, so their Claude never connected. None of it was reproducible from git (units, refresh script, and the Vault snapshot token lived only in user homes), so a devvm rebuild would silently lose it. This makes it reproducible and fixes the unwired users: - roster_engine.py: sticky per-user PLAYWRIGHT_PORT (PLAYWRIGHT_BASE_PORT=8931, allocated for every roster user incl. the admin), emitted in the derive JSON. - scripts/workstation/playwright/: system-level TEMPLATE units (playwright-mcp@.service + playwright-snapshot-refresh@.{service,timer}, User=%i — system manager, so no systemd --user / linger) + the refresh script. @playwright/mcp pinned to 0.0.76 (avoids the @latest silent-fleet-roll footgun, same rationale as T3_PIN). - setup-devvm.sh: install the templates + script (9e); stage the chrome-service snapshot bearer token from Vault to a root file (8c) — the hourly root reconcile has no Vault token, mirrors the Claude OAuth staging in 8a. - t3-provision-users.sh: install_playwright() (ALL tiers incl. admin) writes PLAYWRIGHT_PORT, seeds the token if-absent, wires the user-scope ~/.claude.json by running `claude mcp add` AS the user (clobber-proof + if-absent, so it fixes existing/new/admin without rewriting a populated config), and enable --now's the instances (idempotent, never restarts a running server). Also hardened the section-1 .env scan to skip the new playwright-.env files (no T3_PORT -> grep no-match would abort under set -e -o pipefail). - Docs: chrome-service-snapshot runbook (new Provisioning section + system-unit commands), multi-tenancy.md, and the 2026-06-07 plan Task 2.3. Supersedes the hand-made per-user --user units (one-time idle-gated migration to follow on the live host). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:33:47 +00:00
Viktor Barzin	36521839fc	t3: gated nightly tracker (replaces pinned enforcer) + drop timer Persistent Phase 2 of "track t3 nightly, accept the risk, but make sure session auth works and revert if it breaks". Rewrites the daily t3-autoupdate from a pinned-version enforcer into a NIGHTLY TRACKER that gates every bump so a bad build self-heals instead of repeating 2026-06-09: - follows the t3@nightly npm dist-tag (T3_TRACK; T3_PIN still works as a hard freeze; /etc/t3-autoupdate.freeze is the manual revert switch); - downgrade-guard (the nightly tag is mutable — never move backward) + channel sanity (target must be a -nightly. build); - pre-bump per-user state.sqlite backup (online VACUUM INTO) BEFORE install, so rollback is a restore not sqlite surgery; - health-check now SEEDS a throwaway instance with a COPY of a real POPULATED state.sqlite, exercising the forward MIGRATION (the actual 2026-06-09 failure class) + the real mint->exchange->t3_session pairing handshake before trusting a build. Scratch dir is on /var/tmp (disk), not the 2G tmpfs /tmp; - canary rollout: restart idle instances ONE AT A TIME, verify pairing through the real dispatch after each, and on the first failure roll back (binary + that user's DB from the pre-bump backup) AND self-freeze so it can't re-flap onto bad builds. Active-agent instances are deferred, never killed. Rollback target is the recorded LAST-GOOD, not "whatever was installed"; - DRY_RUN mode (T3_DRY_RUN=1) previews the gate against a temp-prefix install — validated: 0.0.28-nightly.20260616.571 PASSES the populated-DB migration gate. timer: drop Persistent=true (a missed 04:00 must not fire a real bump on boot mid-day with users active — a 2026-06-09 contributing factor). setup-devvm.sh: install t3@nightly on fresh boxes (no state to break), in sync. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 10:08:12 +00:00
Viktor Barzin	ef555c7e02	workstation: put ~/.local/bin on PATH so the launcher finds native claude All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor hit "~/.local/bin is not part of the PATH". Root cause: the native claude binary lives in ~/.local/bin, but the terminal launcher (start-claude.sh) runs in tmux's NON-login bash env, which doesn't source the user's shell rc where the native installer put ~/.local/bin on PATH. So `command -v claude` failed there → the launcher's bootstrap re-ran the native installer → the installer printed the PATH warning. (Interactive zsh already had ~/.local/bin via the per-user installer rc edit, and t3-serve sets PATH in its unit — so only the terminal launcher was affected.) - skel/start-claude.sh: prepend ~/.local/bin to PATH near the top (guarded/idempotent), before the launch logic — so `claude` is found, no reinstall, no warning. - setup-devvm.sh: install /etc/profile.d/10-local-bin.sh — adds ~/.local/bin to PATH for all LOGIN shells machine-wide (SSH etc.), independent of the per-user installer rc edit (fresh-user-safe). zsh login picks it up via /etc/zsh/zprofile -> /etc/profile. - docs/architecture/multi-tenancy.md: documented the three PATH-injection points. Verified: guard adds-when-missing / no-dup-when-present; all scripts pass bash -n. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:20:03 +00:00
Viktor Barzin	eecd78233b	workstation: standardize on the native claude install (drop npm-global + npx) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Question from Viktor: should claude run via the binary or npx? Answer: the native install is the recommended runtime (self-contained, self-updating ~/.local/bin/claude; installMethod=native) — and every existing user had already auto-migrated to it, leaving the npm-global copy empty and the npx fallback dead. "Leave only the recommended setup": - setup-devvm.sh: node is now installed ONLY for the t3 CLI; dropped the machine-wide `npm install -g @anthropic-ai/claude-code` (npm/npx is not the recommended runtime and just shadowed the per-user native installs). - t3-provision-users.sh: new per-user `install_user_claude_native` (runs the official https://claude.ai/install.sh AS the user, idempotent/skip-if-present) — provisions native claude for BOTH the terminal launcher and each t3-serve instance, replacing the npm bootstrap. - skel/start-claude.sh: launcher runs the native `claude` only; if missing it bootstraps via the native installer (was an `npx @anthropic-ai/claude-code` fallback). - docs/architecture/multi-tenancy.md: documented the native-only runtime model. node stays (the pinned t3 CLI is npm-global). Verified: native installer reachable + produces ~/.local/bin/claude 2.1.177; all three scripts pass bash -n + shellcheck. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:12:05 +00:00
Viktor Barzin	312c418a9a	workstation: setup-devvm.sh installs the systemd service layer (reproducible rebuild) The t3 system units (t3-serve@, t3-autoupdate, t3-backup-state, t3-provision-users, t3-dispatch) + the t3-dispatch Go binary + t3-mint + the sudoers grant were all hand-scp'd and would NOT survive a fresh devvm. setup-devvm.sh now installs + enables them: build-if-absent for the Go binary, visudo-validated sudoers (a malformed /etc/sudoers.d file breaks all sudo), timers self-heal, t3-dispatch system account created if absent. t3-serve@ stays a per-user template enabled by the provisioner; the ttyd terminal-lobby chain ships from its own repo (viktor/terminal-lobby). Verified: shellcheck clean, go build compiles, visudo parses the sudoers, units parse. NOT run live (would re-assert apt/npm on the shared host) — exercised on next rebuild. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:07:20 +00:00
Viktor Barzin	eb8695743b	workstation: fix setup-devvm.sh provisioner correctness (claude detect, kubelogin pin, codex auth, t3-serve dir) - claude-code: detect via `npm ls -g` not `command -v claude` — the admin's personal ~/.local/bin/claude shadowed the PATH check, so the system-wide install never ran (/usr/lib/node_modules/@anthropic-ai empty, no /usr/bin/claude; fresh non-admins had no claude). Found during the devvm reproducibility audit. - kubelogin: pin v1.36.2 instead of releases/latest/download, so two fresh boxes built weeks apart are byte-identical. - /etc/t3-serve: mkdir before the token writes (install -m doesn't create the parent — section 8 would fail on a fresh box). - codex shared auth: stage /opt/codex-shared/auth.json from Vault secret/workstation.codex_shared_auth_json (key already existed but nothing consumed it — was a manual step lost on rebuild), mirroring the Claude token. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:54 +00:00
Viktor Barzin	3e7093947d	t3: bump pin 0.0.24 -> 0.0.26 (fable-5) [ci skip] Completes the 0.0.26 adoption prepared in fcb84ce0 (version-agnostic dispatch browser-session/bootstrap fallback + Gate-2 real pairing health-check + per-user state.sqlite backup). 0.0.26 verified end-to-end on the devvm: emo + ancamilea auto-pair via t3-dispatch (302 + Set-Cookie t3_session) after migrating state.sqlite 30->32; pre-cutover backups in /var/backups/t3-state. Brings claude-fable-5 into the t3 model picker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	baac46415f	t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip] The t3-autoupdate timer (re-enabled by the provisioner's step 5b with `--now`, which fires the missed daily job immediately on a Persistent timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions role->scopes, +proof_key_thumbprint) AND changed the bootstrap API, breaking t3-mint/pairing for ALL devvm users (pair prompt, no session). - t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a nightly tracker -- re-asserts the pin (a no-op when correct). - t3-provision-users.sh step 5b: drop `--now` (it triggered the immediate missed-job run that pulled the bad build). - setup-devvm.sh: install pinned t3@0.0.24 at machine setup. - unit Descriptions + service-catalog reflect the pin. - post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md. Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled the (now-pinned) enforcer, reset the 2 new users' disposable DBs, surgically reverted wizard's auth tables to level-30 (96 threads + live session preserved). All users verified 302 + t3_session. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	39e35ca8c9	workstation: share admin Claude subscription with non-admins (CLAUDE_CODE_OAUTH_TOKEN) Non-admins without their own ~/.claude login get the shared long-lived sk-ant-oat01 token injected into their t3-serve env, so their agent authenticates against the admin's subscription. setup-devvm.sh stages it from Vault secret/workstation.claude_oauth_token (root-readable); the provisioner's install_user_claude_token injects per-user, if-absent (never clobbers emo's own login). Live-fixed anca (verified AUTHOK); this codifies it for reproducibility + future users. NOT pushed (shared-tree divergence hold). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	fd0f4a0365	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip] `6d224861` came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:45:33 +00:00
Viktor Barzin	6d224861c4	stem95su: scheduled Drive->site sync CronJob (every 10m) CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:42:26 +00:00
Viktor Barzin	06f5c12476	workstation: setup-devvm.sh hardens the admin's unlocked tree (o-rx, not world-readable) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Codifies the leak fix found during the emo cutover: /home/wizard/code is git-crypt-DECRYPTED in the admin's working tree, but was mode 0775 (o+rx) — so any devvm user (even outside code-shared) could read decrypted secrets by path (verified: emo read certificate.pfx as plaintext DER). setup-devvm.sh now chmod o-rx the admin tree so a rebuild keeps it. Live fix already applied (now drwxrws---). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 18:08:52 +00:00
Viktor Barzin	2c1865eabb	workstation: roster-driven provisioner (SSoT reconcile, additive-only) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details t3-provision-users.sh now consumes roster_engine.py: derives accounts + per-tier groups + sticky ports + /etc/ttyd-user-map + dispatch.json from roster.yaml and applies them. ADDITIVE-ONLY for existing users (never strips a group, replaces a home, or re-locks an account) so the hourly timer is always safe. Best-effort tier validation vs live k8s_users: warns on a net-new absent user (emo), aborts only on a real tier conflict, skips when root has no Vault token. DRY_RUN mode for safe testing. Verified on the live host: reproduces dispatch.json content exactly, emo/anca groups + all t3-serve instances unchanged, idempotent, shellcheck-clean; deployed to /usr/local/bin (hourly timer target). Engine: validate_tiers now returns ValidationIssue(severity) — error=conflict (abort) vs warn=absent (grant pending) — + has_blocking_errors(); 28 pytest cases. setup-devvm.sh redeploys the provisioner for reproducibility. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 14:18:12 +00:00
Viktor Barzin	1757cb59e7	workstation: machine-wide config inheritance (managed claudeMd + setup-devvm.sh + skel) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Spike confirmed (claude 2.1.168): /etc/claude-code/managed-settings.json claudeMd reaches a session (sentinel echoed). Hybrid inheritance = enforced org claudeMd machine-wide (top precedence, non-overridable) + per-user ~/.claude/{skills,rules,...} symlinks to the config base (live, the proven emo pattern) seeded via /etc/skel. setup-devvm.sh is idempotent: apt toolset, node>=18 + claude-code, system-wide kubelogin (NOT the Azure apt pkg), the managed config, and /etc/skel (launcher that cd's $HOME/code, tmux UX, inheritance symlinks). Verified: emo unchanged (groups/symlinks/live sessions intact), emo can read the managed config, idempotent re-run clean. Security fix (host state): /home/wizard/.claude/settings.json was 0664, exposing MEMORY_API_KEY to all devvm users -> chmod 0600. chezmoi source needs a private_ prefix + the key templated out to persist this (dotfiles-repo follow-up). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 14:07:04 +00:00

15 commits