Merge remote-tracking branch 'origin/master' into wizard/reconcile-mirror

# Conflicts: # scripts/t3-provision-users.sh
2026-06-16 22:32:43 +00:00 · 2026-06-16 22:32:43 +00:00 · 8a2a3d9eca
commit 8a2a3d9eca
parent 88717c61fd 63e714782c
13 changed files with 383 additions and 159 deletions
--- a/docs/architecture/multi-tenancy.md
+++ b/docs/architecture/multi-tenancy.md
@ -547,6 +547,8 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1

 **Claude Code runtime — native, per-user (2026-06-15):** `claude` is the **native** install (`~/.local/bin/claude` → `~/.local/share/claude/versions/<v>`, self-updating; `installMethod: native`) — NOT npm-global or npx. It is the runtime for both the ttyd launcher and each `t3-serve` instance. `setup-devvm.sh` installs node ONLY for the `t3` CLI (not claude); per-user native claude is provisioned by the reconcile's `install_user_claude_native` (covers terminal + t3, idempotent, skip-if-present) and self-bootstrapped by `start-claude.sh` on first launch — both via the official `https://claude.ai/install.sh`. The legacy machine-wide `npm install -g @anthropic-ai/claude-code` bootstrap and the launcher's `npx` fallback were removed; existing users had already auto-migrated to native, and the npm-global dir was empty. **PATH (`~/.local/bin`, where the native binary lives):** ensured three ways — `/etc/profile.d/10-local-bin.sh` for login shells (machine-wide, fresh-user-safe), `start-claude.sh` itself (the launcher runs in tmux's non-login env that skips the user's shell rc), and `t3-serve@.service` (`Environment=PATH=…:/home/%i/.local/bin`).

+**Per-user browser MCP — playwright, reproducible from git (2026-06-16):** every user (incl. the admin) gets their OWN isolated `@playwright/mcp` server so their concurrent Claude sessions don't fight over tabs (`--isolated` → a fresh browser context per MCP connection), wired into Claude in **every directory** via a user-scope `~/.claude.json` entry (`playwright → http://localhost:<PLAYWRIGHT_PORT>/mcp`). Mechanism: **system-level template units** `playwright-mcp@<user>.service` + `playwright-snapshot-refresh@<user>.{service,timer}` (`User=%i`, sourced from `scripts/workstation/playwright/`, installed by `setup-devvm.sh` §9e — system manager, so NO systemd --user / linger). `roster_engine.py` allocates a sticky per-user `PLAYWRIGHT_PORT` (`PLAYWRIGHT_BASE_PORT=8931`); the reconcile's `install_playwright()` writes it, seeds the chrome-service snapshot token if-absent (staged from Vault `secret/chrome-service` to `/etc/t3-serve/chrome-service-token` by `setup-devvm.sh` §8c, since the hourly root reconcile has no Vault token), wires `~/.claude.json` by running `claude mcp add --scope user` AS the user (clobber-proof + if-absent, so it fixes existing/new/admin without rewriting a populated config), and `enable --now`s the instances (idempotent — never restarts a running server). The `@playwright/mcp` version is **pinned** in the unit (the `@latest`-silently-rolls-the-fleet footgun — see `T3_PIN`). Replaced the earlier hand-made `~/.config/systemd/user/playwright-*` units (one-time idle-gated migration; pre-migration emo/anca had servers running but never wired into their `.claude.json`). Cookie-warming pipeline + ops: `../runbooks/chrome-service-snapshot.md`.
+
 **Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Its location depends on the per-user `code_layout` in `roster.yaml`: `single` (default) puts the clone AT `~/code`; `workspace` makes `~/code` a plain directory of per-project clones — the infra clone at `~/code/infra` plus each roster `repos` entry cloned from Forgejo `viktor/<name>` **as the user** (their PAT authenticates, so private repos work; clone failures WARN and retry next hour). Flipping a user to `workspace` auto-migrates their existing `~/code` clone to `~/code/infra` (local branches/dirty state survive; running processes follow the moved inode). ancamilea = workspace + `tripit` since 2026-06-10. The provisioner clones infra anonymously from the public GitHub mirror; **contribute access is wired per-user on top** (see below). The apply boundary still holds (`scripts/tg apply` needs an admin Vault token + cluster RBAC), but **pushing `master` is NOT inert** — the Forgejo→Woodpecker webhook fires `.woodpecker/default.yml` (`event: push, branch: master`, `require_approval: forks` only), which terragrunt-applies changed stacks. `master` is **branch-protected on Forgejo** (force-push disabled for everyone — history is append-only; push + merge whitelists = `viktor` + explicitly granted users, deploy keys allowed). **Allow-then-audit (Viktor, 2026-06-10):** `ebarzin` (emo) is on the whitelist and pushes straight to `master` — no PR gate. The tracking burden moves to: (a) **commit messages that record what + why** (the agent instructions in AGENTS.md and the managed claudeMd require the body to paraphrase the user's request), (b) the **`notify-nonadmin-push` Slack audit step** in `.woodpecker/default.yml` — every master push by a non-admin author is posted to Slack (admin pushes are not), and (c) non-admins **never use `[ci skip]`** so every change fires the pipeline (and thus the audit feed). Users NOT on the whitelist fall back to `<user>/<topic>` branches + PRs. **Clones stay fresh automatically** (2026-06-10): the hourly `t3-provision-users` reconcile runs `refresh_user_clone` over every managed clone — the infra clone and any workspace repos (fetch all remotes + fast-forward `master`, ONLY when on master with a clean tree and an upstream — dirty trees and local commits are left alone with a WARN) — and also `wire_forgejo_remote`, which idempotently adds the documented `forgejo` remote + `forgejo/master` upstream to infra clones that predate that contract. `start-claude.sh` does the same freshen at session launch (10s fetch cap per repo so an offline remote never stalls the session; workspace layouts freshen each repo under `~/code`).

 **Contribute access (per non-admin, manual — the anca/tripit PAT precedent):**
@ -559,7 +561,7 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1

 **Web-terminal session persistence (2026-06-10):** the tmux-based web terminal's named sessions (each running one Claude conversation) survive devvm reboots — `tmux-persist-save.timer` (5-min) snapshots every terminal user's sessions (name, cwd, conversation uuid from argv or the cwd-slug transcript dir) to `/var/lib/tmux-persist/<user>.tsv`, and `tmux-persist-restore.service` recreates missing sessions at boot with `claude --resume <uuid>` (per-session idempotent; also handles partial loss). The web terminal also exposes an **on-demand "Restore sessions" button** (terminal-lobby: `tmux-api` `POST /restore` → the validated root `tmux-restore-user` wrapper → `tmux-persist restore <user>`, a single-user mode of the same script): the boot-only restore service never fires when an **OOM kills a user's tmux server *without* a reboot** (the common case under multi-user memory pressure), so the button covers that gap. This is a **tmux/terminal-surface** feature, deliberately outside the t3 namespace: the t3 chat surface persists its own threads (`~/.t3` state, plus the daily `t3-backup-state` dump), and Claude conversations themselves were always durable (`~/.claude/projects/`) — what this adds is the volatile tmux wiring.

-**Status (2026-06-10):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), the Authentik `T3 Users` edge gate, **the emo Phase-5 cutover (own clone + launcher repoint + `code-shared` removal, completed 2026-06-10) and emo's contribute access (`ebarzin` write collaborator + PAT + protected `master`)**, and **per-user `code_layout` with the ancamilea workspace cutover (infra → `~/code/infra`, `tripit` alongside, 2026-06-10)**. Per the live `/etc/skel` design, non-admin `~/.claude/{rules,skills}` symlinks into the admin base are **kept** (they ARE the shared-base delivery mechanism — the plan's step to remove them is obsolete). **Remaining (held / future):** the offboarding apply-side (Phase 7), per-user MCP/auth injection, and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning.
+**Status (2026-06-10):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), the Authentik `T3 Users` edge gate, **the emo Phase-5 cutover (own clone + launcher repoint + `code-shared` removal, completed 2026-06-10) and emo's contribute access (`ebarzin` write collaborator + PAT + protected `master`)**, and **per-user `code_layout` with the ancamilea workspace cutover (infra → `~/code/infra`, `tripit` alongside, 2026-06-10)**. Per the live `/etc/skel` design, non-admin `~/.claude/{rules,skills}` symlinks into the admin base are **kept** (they ARE the shared-base delivery mechanism — the plan's step to remove them is obsolete). **Remaining (held / future):** the offboarding apply-side (Phase 7), the rest of per-user MCP/auth injection (`ha` + `claude_memory` + `.credentials.json` + beads Dolt cred — **per-user playwright browser MCP done 2026-06-16**, see above), and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning.

 ## Related

--- a/docs/plans/2026-06-07-multi-user-workstation-plan.md
+++ b/docs/plans/2026-06-07-multi-user-workstation-plan.md
@ -129,6 +129,21 @@ users:

 ### Task 2.3: Inject per-user MCP + auth secrets (new users only; never clobber)

+> **PARTIAL — per-user playwright browser MCP DONE (2026-06-16), reproducible from git.**
+> Implemented NOT via the "write a fresh `~/.claude.json`" step below (that skips
+> EXISTING users who have a `.claude.json` lacking the entry — emo + anca were
+> exactly this: server running, never wired). Instead: `roster_engine.py` allocates
+> a sticky per-user `PLAYWRIGHT_PORT` (`PLAYWRIGHT_BASE_PORT=8931`); `setup-devvm.sh`
+> (§8c/§9e) stages the chrome-service token + installs **system-level template units**
+> (`scripts/workstation/playwright/playwright-mcp@.service` + `…-snapshot-refresh@.{service,timer}`,
+> no systemd --user / linger); `t3-provision-users.sh` `install_playwright()` (ALL
+> tiers incl. admin) seeds the token if-absent, runs `claude mcp add --scope user
+> playwright` AS the user (clobber-proof → fixes existing + new + admin), and
+> `enable --now`s the instances. Replaced the hand-made `~/.config/systemd/user/playwright-*`
+> units (one-time idle-gated migration). Runbook: `../runbooks/chrome-service-snapshot.md`
+> → "Provisioning". **Still TODO in this task:** `ha`, `claude_memory`,
+> `.credentials.json`, and the beads Dolt credential.
+
 **Files:** Modify `infra/scripts/t3-provision-users.sh` (add `install_user_secrets`)

 - [ ] **Step 1:** For each non-admin **without** an existing `~/.claude.json` (NEW users only — NEVER touch an existing one): write `~/.claude.json` with `playwright-shared` (localhost), `ha` (shared `ha_sofia_mcp_url` from Vault `secret/openclaw`) if HA-eligible, and `claude_memory` using a **shared/simple key (per-user memory isolation is DEFERRED — not a risk now)**. Seed `~/.claude/.credentials.json` with the shared Claude token (Vault) **or** leave absent for interactive login. **Drop the beads Dolt credential** into `~/code/.beads/` (`.beads-credential-key`, from Vault, or set `DOLT_REMOTE_PASSWORD`) so `bd` authenticates — it's git-ignored, so a fresh clone lacks it. All `0600`, owned by the user. Per-user `playwright-mcp` systemd unit on its own port (existing pattern, id=4015).
--- a/docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md
+++ b/docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md
@ -178,6 +178,16 @@ During the recovery, a second cascade was discovered that compounded the outage:

 **Still the real fix (from this PM, still TODO):** the P0 import-side cap, and especially the **IO-isolation** items — move k8s-master **etcd** + node OS disks off sdc onto SSD (generalize P3), and/or give the Immich library its own spindle (P1). Concurrency caps are a band-aid; sdc remains a single shared failure domain that every storm finds. Tracked in beads (see Follow-up Implementation).

+## Update 2026-06-16 — 6th IO-pressure incident (same `anca-elements-import` Job re-triggered)
+
+**Same direct trigger as 2026-05-25.** The original `kubernetes_job_v1.anca_elements_import` resource block was never removed from `stacks/immich/main.tf` after the 2026-05-25 import completed — despite the in-code comment instructing "After successful completion: REMOVE this resource block + apply again." Every subsequent `terragrunt apply` of the immich stack re-created the Job. On 2026-06-16 ~20:50 UTC it ran again with the original `--concurrent-tasks 20`, scanning all 21,643 Immich assets in pure read-scan mode (`Uploaded 0`) for ~51 min. Result mirrored 2026-06-01: 62 of 64 nfsd threads in D-state on `folio_wait_bit_common`, sdc 80–82% util, **etcd starved → kube-apiserver crash-loop with `start-service-ip-repair-controllers failed: unable to perform initial IP and Port allocation check`**. Cluster unreachable; PVE host load peaked at 102 of 44 threads. The 2026-06-01 server-side job concurrency caps (`thumbnailGeneration=2, metadataExtraction=2, library=2`) held — the storm was on the import side, not the ML side.
+
+**Immediate recovery**: `nfsd` throttled `64 → 8` threads on the PVE host (gave apiserver enough headroom to come back), then `kubectl delete job -n immich anca-elements-import` + force-delete the pod. Storm cleared instantly: sdc 80% → 30% util, all nfsd threads idle, apiserver `/readyz: ok`. nfsd restored to 64.
+
+**Permanent fix (this commit)**: Removed `kubernetes_job_v1.anca_elements_import` AND the `module "nfs_anca_elements_host"` PVC from `stacks/immich/main.tf`. The photo batch is complete; per user, the videos batch is not on the near roadmap, so the PVC + the comment scaffold around it are gone too. The on-disk dump at `/srv/nfs/anca-elements` on the PVE host is **kept** (browseable via Nextcloud's admin-only "PVE NFS Pool" mount); decision on deletion deferred to user. A future import would re-add the PVC + a fresh Job (or, better, a one-shot manual `kubectl create job` invocation that does not live in Terraform — see Lessons below).
+
+**Updated lesson — one-shot Jobs do NOT belong in `kubernetes_job_v1`.** TF treats Jobs as long-lived resources and re-creates them on every apply if state drift is detected. A truly one-shot import either (a) becomes a `kubernetes_cron_job_v1` with `suspend = true` (Viktor can un-suspend → run → re-suspend) or (b) lives outside TF entirely as a `kubectl create job --from=...` ad-hoc invocation captured in `docs/runbooks/`. The "REMOVE this resource block + apply again" comment failed as a control because nobody noticed it for 22 days.
+
 ## Related

 - 2026-05-09 IO post-mortem: `docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md`
--- a/docs/runbooks/chrome-service-snapshot.md
+++ b/docs/runbooks/chrome-service-snapshot.md
@ -11,8 +11,36 @@ external Claude Code sessions on the dev box. Architecture in
 | chrome-service Deployment | `chrome-service` ns | always-on | headed chromium, CDP :9222, persistent /profile/chromium-data |
 | snapshot-server sidecar | same pod | always-on | serves `/api/snapshot`, bearer-gated, port 8088 |
 | snapshot-harvester CronJob | `chrome-service` ns | `23 * * * *` | dumps `storage_state()` via CDP → `/profile/snapshots/storage-state.json` |
-| dev-box refresh timer | each dev box | hourly | curls `chrome.viktorbarzin.me/api/snapshot` → `~/.cache/playwright-shared-storage-state.json` |
-| dev-box `playwright-mcp.service` | each dev box | always-on | `@playwright/mcp --isolated --storage-state=…` per-MCP-connection contexts |
+| dev-box refresh timer | each dev box, per OS user | hourly (`*:28`) | `playwright-snapshot-refresh@<user>.timer` curls `chrome.viktorbarzin.me/api/snapshot` → `~/.cache/playwright-shared-storage-state.json` |
+| dev-box `playwright-mcp@<user>.service` | each dev box, per OS user | always-on | pinned `@playwright/mcp@<ver> --isolated --storage-state=…` on the user's `PLAYWRIGHT_PORT`; per-MCP-connection (per-session) contexts |
+
+## Provisioning (reproducible from git)
+
+The dev-box side is **per-OS-user** and fully reproducible — no hand-setup.
+Each user gets their own isolated `@playwright/mcp` server (multiple concurrent
+Claude sessions per user, isolated by `--isolated`), wired into their Claude in
+**every directory** via a user-scope `~/.claude.json` entry
+(`playwright → http://localhost:<PLAYWRIGHT_PORT>/mcp`).
+
+- **System-level template units** (NOT `systemd --user`, so no linger needed):
+  `playwright-mcp@.service` + `playwright-snapshot-refresh@.{service,timer}`,
+  sourced from `infra/scripts/workstation/playwright/`, installed to
+  `/etc/systemd/system/` by `setup-devvm.sh` (§9e). `User=%i`; per-user
+  `PLAYWRIGHT_PORT` from `/etc/t3-serve/playwright-<user>.env`.
+- **Port allocation**: `roster_engine.py` (`PLAYWRIGHT_BASE_PORT=8931`, sticky)
+  — emitted in the derive JSON, written per-user by `t3-provision-users.sh` (§5c).
+- **Snapshot token**: `setup-devvm.sh` (§8c) stages Vault
+  `secret/chrome-service` `api_bearer_token` → root file
+  `/etc/t3-serve/chrome-service-token`; the provisioner copies it (if-absent,
+  0600) to each user's `~/.config/playwright/token` (the hourly root reconcile
+  has no Vault token, hence the staging — mirrors the Claude OAuth token in §8a).
+- **MCP wiring + enablement**: `t3-provision-users.sh` `install_playwright()` runs
+  `claude mcp add --scope user … playwright` AS the user (clobber-proof, if-absent)
+  and `systemctl enable --now` the system instances. Idempotent; never restarts a
+  running instance or rewrites an existing `~/.claude.json` entry.
+- **Pinned version**: bump `@playwright/mcp@<ver>` in
+  `scripts/workstation/playwright/playwright-mcp@.service` (the `@latest` →
+  silent-fleet-roll footgun is why; see the `T3_PIN` rationale in `setup-devvm.sh`).

 ## Day-to-day

@ -43,14 +71,14 @@ Expected: `wrote snapshot (… bytes) to /profile/snapshots/storage-state.json`.
 ### Trigger dev-box refresh manually

 ```bash
-# On the dev box, as the user whose Claude Code sessions need the new state:
-systemctl --user start playwright-snapshot-refresh.service
+# On the dev box, refresh a specific user's snapshot (system template instance):
+sudo systemctl start playwright-snapshot-refresh@<user>.service

-# Or directly:
-/usr/local/bin/playwright-snapshot-refresh
+# Or run the script directly AS that user:
+sudo -u <user> /usr/local/bin/playwright-snapshot-refresh

 # Verify
-ls -la ~/.cache/playwright-shared-storage-state.json
+sudo ls -la /home/<user>/.cache/playwright-shared-storage-state.json
 ```

 ### Inspect the current snapshot
@ -108,12 +136,14 @@ The bearer token in `~/.config/playwright/token` doesn't match the
 server's. Almost always means the Vault secret was rotated and the
 local cache is stale.

-**Fix**:
+**Fix** (re-stage centrally so a rebuild stays correct, then re-copy to the user):
 ```bash
 vault login -method=oidc  # if needed
-vault kv get -field=api_bearer_token secret/chrome-service > ~/.config/playwright/token
-chmod 600 ~/.config/playwright/token
-systemctl --user start playwright-snapshot-refresh.service
+sudo install -m 0600 <(vault kv get -field=api_bearer_token secret/chrome-service) \
+  /etc/t3-serve/chrome-service-token
+sudo install -o <user> -g <user> -m 0600 \
+  /etc/t3-serve/chrome-service-token /home/<user>/.config/playwright/token
+sudo systemctl start playwright-snapshot-refresh@<user>.service
 ```

 ### Dev-box `playwright-snapshot-refresh` returns 404 with "snapshot not yet available"
@ -129,9 +159,9 @@ new context with it. **Existing MCP sessions don't hot-reload** — they
 keep the cookies they were seeded with at session start. New sessions
 get the fresh snapshot.

-**Fix**: restart the MCP server on the dev box to pick up the new file:
+**Fix**: restart the user's MCP server on the dev box to pick up the new file:
 ```bash
-systemctl --user restart playwright-mcp.service
+sudo systemctl restart playwright-mcp@<user>.service
 ```

 ### Snapshot file is suspiciously small or empty cookies array
@ -158,13 +188,18 @@ vault kv put secret/chrome-service \

 # Reloader auto-restarts chrome-service pod (snapshot-server picks up new token).

-# On EVERY dev box that pulls the snapshot:
-vault kv get -field=api_bearer_token secret/chrome-service > ~/.config/playwright/token
-chmod 600 ~/.config/playwright/token
+# On EVERY dev box: re-stage the root file, then overwrite each user's copy
+# (the provisioner's per-user copy is if-absent, so a ROTATION must overwrite).
+sudo install -m 0600 <(vault kv get -field=api_bearer_token secret/chrome-service) \
+  /etc/t3-serve/chrome-service-token
+for u in $(ls /etc/t3-serve/playwright-*.env 2>/dev/null | sed 's#.*/playwright-##;s#\.env##'); do
+  sudo install -o "$u" -g "$u" -m 0600 \
+    /etc/t3-serve/chrome-service-token /home/"$u"/.config/playwright/token
+done

-# Verify the next refresh succeeds:
-systemctl --user start playwright-snapshot-refresh.service
-journalctl --user -u playwright-snapshot-refresh.service -n 20
+# Verify the next refresh succeeds for a user:
+sudo systemctl start playwright-snapshot-refresh@<user>.service
+sudo journalctl -u playwright-snapshot-refresh@<user>.service -n 20
 ```

 ## Restore from a backup tarball