multi-user-workstation: design + phased implementation plan
devvm multi-user Claude Code workstation: role-driven profiles (admin/power-user/namespace-owner) off one git roster (single source of truth, full onboard->offboard lifecycle); config inheritance via Claude's native machine-wide managed layer; per-user writable git-crypt-locked infra clone (ungated, apply-time is the boundary); per-tier OIDC kubectl; per-user secrets/auth (memory isolation deferred); incremental, emo-safe migration; capacity prereqs. Folds in gap-analysis findings verified live 2026-06-08. Designed, not yet implemented. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
3d6c5b8bc7
commit
bb7bcf803b
3 changed files with 440 additions and 1 deletions
32
CONTEXT.md
32
CONTEXT.md
|
|
@ -53,9 +53,39 @@ One of {Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared} — replicas
|
|||
_Avoid_: "core service" (collides with the `0-core-*` Namespace tier name).
|
||||
|
||||
**Namespace-owner**:
|
||||
A non-admin identity declared in `secret/platform → k8s_users` (JSON map). Owns one or more namespaces and one or more public subdomains.
|
||||
A non-admin identity declared in `secret/platform → k8s_users` (JSON map). Owns one or more namespaces and one or more public subdomains. Also drives a **Workstation profile** (an identity has both a cluster facet and a workstation facet).
|
||||
_Avoid_: bare "user", "tenant".
|
||||
|
||||
### Workstation (multi-user devvm)
|
||||
|
||||
**devvm**:
|
||||
The dev VM (`10.0.10.10`), a non-cluster VM on the **PVE host** that hosts each person's Claude Code coding environment (the `t3-serve@<user>` and terminal-lobby sessions). Not a **Node** (it isn't in the cluster).
|
||||
_Avoid_: calling it a "Node"; "host" (reserved for the PVE host).
|
||||
|
||||
**Workstation**:
|
||||
A person's identity-scoped Claude Code environment on the **devvm** — one OS account, their session runs as that uid. The same human may also be a **Namespace-owner**; the cluster identity and the Workstation are two facets of one person.
|
||||
_Avoid_: "t3 instance" (only one surface of a Workstation); bare "user".
|
||||
|
||||
**RBAC tier**:
|
||||
The role band that governs a person everywhere — `kubernetes-admins` (Viktor; cluster-admin, secrets, apply), `kubernetes-power-users` (infra-aware, broad read, no destructive change), `kubernetes-namespace-owners` (own-namespace app dev). The single axis that keys both cluster RBAC **and** the **Workstation profile**.
|
||||
_Avoid_: inventing per-service roles; conflating with **Namespace tier** / **State tier** (those are not identity).
|
||||
|
||||
**Workstation profile**:
|
||||
The **RBAC tier**-keyed bundle a **Workstation** receives: **Config inheritance** (identical for everyone) plus the person's **Infra visibility** and cluster scope (varies by tier). Never hand-tuned per person — one identity decision (Authentik group + `k8s_users`) provisions the cluster facet and the Workstation together.
|
||||
_Avoid_: per-person bespoke setup (the rejected "stitched-together" status quo).
|
||||
|
||||
**Config inheritance**:
|
||||
The universal half of every **Workstation profile** — Viktor's *static* Claude config (skills, rules, agents, commands, `CLAUDE.md`, hooks) **live-extends** from a **Config base**, it is NOT copied: each person's `~/.claude` draws these from the shared base, so an edit Viktor makes appears in every Workstation immediately, with no seed/copy/sync step. Users may layer their own items on top (rarely do). **RBAC tier**-independent. Per-user *mutable* state (`~/.claude.json`, `.credentials.json`, `projects/`, sessions) is never shared — local only.
|
||||
_Avoid_: a periodic copy/seed/sync of `~/.claude` (rejected — inheritance must be live); sharing `~/.claude.json` / `.credentials.json` (per-user, secret-bearing, corrupts under concurrent writes — see emo's multi-session profile).
|
||||
|
||||
**Config base**:
|
||||
The shared, secret-free, version-controlled source of truth for the *static* Claude config that every **Workstation** live-extends (see **Config inheritance**). Viktor's authoring surface — when he edits a skill/rule, he edits the base; the chezmoi dotfiles repo is its versioned form (commit = audit/rollback, NOT a push to users). Holds only skills/rules/agents/commands/`CLAUDE.md`/hooks — never secrets or per-user mutable state.
|
||||
_Avoid_: treating it as a per-user seed target (it is a live shared source, not a copy); putting secrets in it.
|
||||
|
||||
**Infra visibility**:
|
||||
What a non-admin **Workstation** may SEE of the infra: the public repo **code** and the person's own **RBAC**-scoped view of the live cluster (kubectl / dashboard within their namespaces). Explicitly excludes the **git-crypt** secrets (`terraform.tfvars`, `secrets/`) and any out-of-scope mutation. The boundary that "respect their permissions" enforces — violated today because `~/code` is one git-crypt-*unlocked* tree shared via the `code-shared` group.
|
||||
_Avoid_: reading "see the infra" as access to secrets or apply rights.
|
||||
|
||||
### Networking
|
||||
|
||||
**Public domain**:
|
||||
|
|
|
|||
186
docs/plans/2026-06-07-multi-user-workstation-design.md
Normal file
186
docs/plans/2026-06-07-multi-user-workstation-design.md
Normal file
|
|
@ -0,0 +1,186 @@
|
|||
# Multi-User Workstation — Design
|
||||
|
||||
- **Date:** 2026-06-07
|
||||
- **Status:** designed (grilled extensively); not yet implemented
|
||||
- **Owner:** Viktor (wizard)
|
||||
- **Builds on:** the t3code multi-user setup (`docs/plans/2026-06-01-t3-auto-provision-*`), the `k8s_users` multi-tenancy (`docs/architecture/multi-tenancy.md`), and the cloud-init VM-reproducibility decision (memory id=1575).
|
||||
- **Glossary:** see `infra/CONTEXT.md` → "Workstation (multi-user devvm)" for the canonical terms used here (devvm, Workstation, RBAC tier, Workstation profile, Config inheritance, Config base, Infra visibility).
|
||||
|
||||
## Goal
|
||||
|
||||
Let any onboarded person get a fully-configured Claude Code **Workstation** on the devvm that **inherits Viktor's config live** (his edits propagate with no per-user sync), bounded by **their own permissions** (read infra code + RBAC-scoped cluster view, never secrets), provisioned by **one declarative roster + one idempotent script**, and **reproducible from git** so the VM can be rebuilt from a template.
|
||||
|
||||
## How we got here (so the rationale isn't re-litigated)
|
||||
|
||||
This was stress-tested down several branches before landing:
|
||||
|
||||
1. **Adopt a CDE?** Researched Coder / Gitpod-Ona / Eclipse Che / DevPod / OpenHands (2026-06-07). The category consolidated to "Coder or Che, or build it." Coder is architecturally a great fit but the **role model we need is Premium-gated** (groups + OIDC group→role sync + template ACLs are all paid), its agent UI is mid-transition (Tasks→Agents, Sept 2026), and it still needs custom glue. ~80% of the hard parts are already solved by our stack. → **Build on the existing stack** (ADR-0001).
|
||||
2. **K8s ephemeral pods vs devvm OS users?** Ephemeral pods are maximally declarative but, at ~3-4 trusted users, re-platforming the agent + per-pod persistence is **overkill**; the devvm model already runs and config-push is *easier* on one host. → **devvm Linux users** (ADR-0002).
|
||||
3. **Config inheritance — sync vs live?** A periodic sync/seed was rejected; the requirement is **live inheritance** ("I edit, everyone has it"). Realized via **each subsystem's native machine-wide layer + a per-user layer on top** (ADR-0003) — not OverlayFS (kernel disallows live lowerdir edits), not Nix (rebuild, not live), not bespoke symlink-only (clumsy per-item override).
|
||||
|
||||
## Core model
|
||||
|
||||
A person's **RBAC tier** drives one **Workstation profile**. **Inheritance**: `wizard` authors a **Config base** once; every child user (emo, anca, gheorghe) inherits it **live** through native machine-wide layers and may add their own on top. What differs per tier is **Infra visibility** and cluster scope — never the inherited config. Onboarding is **declarative**: a git roster + an idempotent provisioner.
|
||||
|
||||
## Components
|
||||
|
||||
### 1. Roster — the SINGLE source of truth (in git, full lifecycle)
|
||||
|
||||
A git-committed map keyed by **`os_user`**; it drives the **entire lifecycle** (onboard → reconcile → offboard). It carries the multiple identifiers a person actually has (verified live 2026-06-08 — they differ!):
|
||||
|
||||
```yaml
|
||||
# infra/scripts/workstation/roster.yaml — THE source of truth
|
||||
# os_user (key) → authentik_user (login local-part) · k8s_user (k8s_users key) · tier · namespaces
|
||||
users:
|
||||
emo: { authentik_user: emil.barzin, k8s_user: emo, tier: power-user }
|
||||
# NET-NEW cluster identity — emo is NOT in k8s_users today
|
||||
ancamilea: { authentik_user: ancaelena98, k8s_user: anca, tier: namespace-owner, namespaces: [plotting-book] }
|
||||
# ALREADY a namespace-owner — preserve plotting-book; do NOT re-provision
|
||||
# gheorghe: { authentik_user: vabbit81, k8s_user: vabbit81, tier: namespace-owner, namespaces: [vabbit81] }
|
||||
# already a cluster namespace-owner; uncomment when he wants a devvm workstation
|
||||
# wizard (admin) is the base author; not provisioned as a child.
|
||||
```
|
||||
|
||||
**Single source of truth (SSoT):** the roster is authoritative; everything else is **derived or validated against it** — never hand-maintained in parallel:
|
||||
- `/etc/ttyd-user-map` + `/etc/t3-serve/dispatch.json` are **regenerated** from the roster each reconcile (not appended).
|
||||
- The Authentik **`T3 Users`** group membership is reconciled from the roster (a member ⇔ a roster entry).
|
||||
- The reconcile **validates** `roster.tier` against the live `k8s_users` role and **fails loud on mismatch** (e.g. roster says `power-user` but `k8s_users` says `namespace-owner`) — so the workstation tier and the cluster tier can't silently diverge. `k8s_user`/`namespaces` are reconciled into `k8s_users` (or asserted to match for pre-existing users).
|
||||
|
||||
`os_user` is the pinned key (no email→username derivation — avoids the `ancaelena98`-vs-`ancamilea` trap). Onboard = add an entry + reconcile; **offboard = remove the entry** (see "User lifecycle").
|
||||
|
||||
### 2. Eligibility gate (Authentik group, edge-enforced)
|
||||
|
||||
A `T3 Users` Authentik group gates `t3.viktorbarzin.me` at the edge via a one-branch addition to the existing `stacks/authentik/admin-services-restriction.tf` expression policy (`if host == "t3.viktorbarzin.me": return ak_is_group_member(request.user, name="T3 Users")`). Non-members 302→login, never reach the box. Verified earlier: `X-authentik-groups` already reaches the dispatcher (it's in the forward-auth middleware `authResponseHeaders`), so a dispatcher-side second check is possible but the edge gate is the primary.
|
||||
|
||||
### 3. Provisioning (idempotent script + roster)
|
||||
|
||||
Extend the existing root reconcile (`infra/scripts/t3-provision-users.sh`) to read `roster.yaml` and, per entry, converge:
|
||||
- `useradd` the OS account if missing — **constrained** per tier (see §6);
|
||||
- assign per-tier groups;
|
||||
- drop the per-user identity-scoped kubeconfig + Vault helper;
|
||||
- append the `<authentik_user>=<os_user>` line to `/etc/ttyd-user-map`;
|
||||
- `systemctl enable --now t3-serve@<os_user>`;
|
||||
- provision a writable git-crypt-locked clone at `~/code` for non-admins **only if absent** (§5; never replaces an existing `~/code`).
|
||||
|
||||
Run via the existing systemd timer (OnBoot + periodic) for self-healing, plus on-demand after a roster edit. Account creation is the one new privileged step; it lives only in this root reconcile.
|
||||
|
||||
### 4. Config inheritance (native machine-wide layers — ADR-0003)
|
||||
|
||||
`wizard` authors the **Config base** (a git checkout of the dotfiles/config-base repo on the devvm). It materializes into the OS's native machine-wide layers, which every user inherits live:
|
||||
|
||||
**Verified 2026-06-08:** t3 is itself built on `@anthropic-ai/claude-agent-sdk` and opts into `settingSources: [user, project, local]`; the SDK also reads `/etc/claude-code/managed-settings.json` independently. So the managed layer + `~/.claude` reach **both** surfaces — the t3 web UI *and* a terminal `claude`. Two caveats: it's **Claude-specific** (a t3 user who picks Codex/OpenCode won't inherit Claude config), and `rules/` loads via the per-user `user` source (so Task 1.1's "managed-`claudeMd` vs per-user symlink" question stays real).
|
||||
|
||||
| What inherits | Base location (machine-wide) | Native mechanism (live) | Per-user override |
|
||||
|---|---|---|---|
|
||||
| Claude skills/prompts/rules/CLAUDE.md/hooks/settings | `/etc/claude-code/managed-settings.json` + managed skills | Claude merges enterprise ⊕ user; auto-reloads | `~/.claude/skills…` (adds; base authoritative on clash) |
|
||||
| Shell (zsh/aliases/env) | `/etc/zsh/zshrc`, `/etc/profile.d/*.sh`, `/etc/skel` | sourced at login; skel seeds new homes | `~/.zshrc` layers on top |
|
||||
| Tools/binaries | system-wide `/usr/local` + apt manifest | one host → shared `/usr` | `pip install --user` in `~` |
|
||||
|
||||
`wizard` edits the base → commit → every child inherits on next prompt/login. **No copy, no mirror, no drift** (this replaces today's hand-mirrored per-user setup — the documented emo-drift pain, memory id=3205/4015). Per-user *mutable* state (`~/.claude.json`, `.credentials.json`, `projects/`, history) is never shared — local only. *(Caveat: the managed layer natively covers settings/skills/`claudeMd`; the bespoke `~/.claude/rules/` + `agents/` dirs are delivered via the managed `claudeMd` OR a per-user symlink to the base — pinned in plan Task 1.1. This is also what replaces the old `start-claude.sh: cd /home/wizard/code` hack: config now comes from the managed layer regardless of CWD, so a new user's launcher just `cd ~/code`.)*
|
||||
|
||||
### 5. Infra access (per-user writable locked clone — changes NOT gated)
|
||||
|
||||
Each non-admin gets their **own writable**, git-crypt-**locked** clone of the monorepo at `~/code`:
|
||||
- A **keyless** clone (`filter.git-crypt.smudge=cat`): all code/docs are plaintext; the git-crypt'd secret files (`infra/secrets/`, `infra/terraform.tfvars`) stay `\0GITCRYPT\0` ciphertext blobs. They read the code, never the secrets (the repo is public anyway; only git-crypt'd files are sensitive).
|
||||
- **Writable + ungated:** they edit, commit, and `git push` to Forgejo **freely** — no read-only mount, no PR gate. Safe because **pushing infra master does NOT auto-apply** (infra is applied *manually* via `scripts/tg apply`; memory id=4355). Per-user clones also remove the old shared-tree commit-entanglement hazard.
|
||||
- **The real boundary is apply-time, not the repo:** a non-admin can change code but cannot make it take effect — `scripts/tg apply` needs a write-capable Vault token (`vault login -method=oidc` → vault-admin) + cluster RBAC their tier lacks.
|
||||
- **Trade vs the earlier live mirror:** the infra repo's own `CLAUDE.md`/code now updates via `git pull` (standard dev flow), not instantly. The high-value live inheritance — Viktor's skills/prompts/rules/global `CLAUDE.md` — is **unaffected** (it flows through the machine-wide managed layer in §4, not the repo).
|
||||
|
||||
### 6. Permission model
|
||||
|
||||
| Tier | OS account | sudo / docker | code-shared + git-crypt | infra repo | kubectl (own OIDC, per tier) | Vault (own OIDC) |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **admin** (Viktor) | wizard | ✅ / ✅ | ✅ (unlocked) | unlocked R/W tree; can `tg apply` | cluster-admin | vault-admin |
|
||||
| **power-user** (Emo) | emo | ❌ / ❌ | ❌ | own **writable locked** clone (push free; no secrets; can't apply) | **cluster-wide read-only, no Secrets** | scoped read |
|
||||
| **namespace-owner** (Anca) | ancamilea | ❌ / ❌ | ❌ | own **writable locked** clone (push free; no secrets; can't apply) | **admin in own namespace** (full R/W in-ns) + namespace/node LIST only | own-namespace paths |
|
||||
|
||||
Layers: Authentik group (eligibility) → OS account `0700` home + per-tier groups (no sudo/docker for non-admins; rootless podman if containers needed) → **per-user OIDC kubeconfig + Vault** so each session acts as *its own* identity, never Viktor's. **kubectl is enabled per tier** — the provisioner installs each user's kubeconfig at the scope above (admin = cluster-admin; power-user = cluster-wide read-only, no Secrets; namespace-owner = admin in their own namespace), reusing the existing `k8s_users` / dashboard-SA machinery (memory id=4042). **Changing infra is never gated at the repo; it's gated at apply** — only admin can `scripts/tg apply` (write Vault + cluster RBAC). Per-user creds live in each `0700` home; wizard's `~/.vault-token` (`0600`) is unreadable to others.
|
||||
|
||||
**Cluster-RBAC reality (verified 2026-06-08) — two corrections + identity facts:**
|
||||
- **power-user role:** the existing `oidc-power-user` ClusterRole grants cluster-wide **read+write+Secrets** and is currently *unbound* — NOT the read-only-no-Secrets tier ADR-0005 wants. So power-user needs a **NEW** `oidc-power-user-readonly` ClusterRole (get/list/watch on non-secret resources cluster-wide, NO `secrets`), bound to emo's OIDC email. Do not reuse the existing role.
|
||||
- **kubeconfig is OIDC, not SA-token:** the apiserver carries live `--oidc-*` flags for the `kubernetes` audience and accepts Authentik OIDC; the "apiserver rejects OIDC" note in `dashboard-sa.tf` is dashboard-audience-specific (the multi-issuer `authentication-config` isn't live). Install `kubelogin`, smoke-test the OIDC path first, and fall back to the per-user SA-token (dashboard) pattern only if it fails.
|
||||
- **identity reality:** emo has **no `k8s_users` entry** today → power-user is a NET-NEW grant; anca is already namespace-owner of `plotting-book` and gheorghe (`vabbit81`) of `vabbit81` — preserve, don't re-provision.
|
||||
|
||||
**Shared-host caveat:** a multi-user host is a softer boundary than pods — it relies on standard Linux hardening. Appropriate because these are trusted people. If a user must ever be *untrusted*, that's the signal to revisit K8s pods. Note: non-admins' Claude/t3 runs `--dangerously-skip-permissions` (autonomous tool execution as their uid) — bounded by the `0700` home + no-sudo/no-docker sandbox, but a conscious accepted trade.
|
||||
|
||||
### 7. Secrets & auth (per-user, injected — never in the Config base)
|
||||
|
||||
The Config base / machine-wide managed layer is **secret-free**. Everything carrying a token/auth is **per-user**, in the user's own `0600` files, and **never machine-wide** — per the Google-Workspace-MCP precedent (id=4553: *"do NOT move a secret-bearing MCP server into machine-wide config"*; one user literally can't read another's `~/.claude.json`).
|
||||
|
||||
| Auth / token | Lives in (per-user, `0600`) | New-user provisioning (from Vault) |
|
||||
|---|---|---|
|
||||
| **Claude OAuth** | `~/.claude/.credentials.json` (or `CLAUDE_CODE_OAUTH_TOKEN`) | the shared Enterprise token (earlier decision) **or** own interactive login; emo keeps his own |
|
||||
| **`claude_memory` MCP** | `~/.claude.json` mcpServers + `MEMORY_API_KEY` in `settings.json` env | **DEFERRED — not a risk now (Viktor, 2026-06-08).** Per-user memory isolation needs a service-side `_key_to_user` map edit + redeploy (claude-memory-mcp, GHA repo 78), not just a Vault write — NOT built now. For now a new user gets a simple key or omits memory; revisit if isolation becomes a concern. |
|
||||
| **`ha` MCP** (token-in-URL) | `~/.claude.json` | shared `ha_sofia_mcp_url` from Vault `secret/openclaw` (one HA instance; shared secret, per-user file) — only if HA-eligible |
|
||||
| **`playwright` MCP** | per-user systemd unit (own port) + localhost entry | existing per-user playwright pattern (id=4015); non-secret |
|
||||
| **`context7`** | plugin-provided | non-secret (plugins layer) |
|
||||
|
||||
The root provisioner READS these from Vault and writes them into a **new** user's home — **if-absent, never clobbering** an existing user's working config. Minting a new per-user memory key needs an admin Vault write (`vault login -method=oidc`; the agent token can't write KV — id=4181) → an admin onboarding step. **emo's existing MCP/auth is untouched** (additive-only): `managed-settings.json` carries NO `env` secrets, so his `MEMORY_API_KEY` and his `~/.claude.json` MCP servers keep working exactly as today.
|
||||
|
||||
**beads (`bd`) credential — gap found 2026-06-08:** a per-user infra clone does NOT include the Dolt credential (`.beads-credential-key` is git-ignored), so the provisioner must drop it (or set `DOLT_REMOTE_PASSWORD`) into the user's `~/code/.beads/` — else `bd` resolves the central server (`10.0.20.200:3306`) but fails auth. `bd` does **not** depend on `code-shared` (it's server-mode against the central Dolt), so the emo cutover doesn't break `bd` *if* his credential is provisioned.
|
||||
|
||||
## Capacity & prerequisites
|
||||
|
||||
**The devvm is the binding constraint — address before onboarding active users.** Verified 2026-06-08: devvm has **24 GB RAM** (the `proxmox-inventory.md` "8 GB" is STALE → fix that doc), ~8 GB free, **0 swap**; wizard alone already runs ~20 sessions (~10 GB RSS). Each interactive Claude session is ~300–700 MB; each user adds one persistent `t3-serve` daemon (~430 MB). 3–5 active users × several sessions would exhaust RAM → with **0 swap the failure mode is OOM-kill of live sessions** (everyone's), not graceful slowdown — also a `~/.claude.json` corruption trigger (id=2320/2321: multi-session writes + disk pressure).
|
||||
|
||||
**Prerequisites (do FIRST):** (1) **add swap** to the devvm (OOM-kill → graceful pressure); (2) optionally bump RAM (PVE-side — devvm is NOT TF-managed, id=1575); (3) set a per-user RAM budget + a **max-concurrent-active-users** ceiling; (4) memory/disk-pressure monitoring on the devvm. CPU (16 cores, ~7%) and disk (`/` ~28 GB free) are fine for now.
|
||||
|
||||
## User lifecycle (onboard → reconcile → offboard) — the roster drives all of it
|
||||
|
||||
The roster is the SSoT for the **whole** lifecycle, not just creation:
|
||||
|
||||
- **Onboard:** add a roster entry (the reconcile also adds them to the `T3 Users` Authentik group). The reconcile creates the constrained account, seeds config inheritance, provisions the per-user OIDC kubeconfig + locked clone + MCP/auth (+ the `bd` Dolt credential), starts `t3-serve@<u>`.
|
||||
- **Reconcile (routine, additive-only):** converges *missing* state UP; never strips an existing user (the don't-break-emo guarantee). Safe to run anytime.
|
||||
- **Offboard (REMOVE the roster entry):** the destructive half — gated + staged, NOT the routine timer:
|
||||
1. **Reversible cut (on roster removal):** stop+disable `t3-serve@<u>`; drop the user from `/etc/ttyd-user-map` + `dispatch.json` (regenerated → 403 at the dispatcher); remove from the `T3 Users` Authentik group (edge-blocked); `passwd -l <u>`. Access fully cut; nothing deleted.
|
||||
2. **Cluster revoke:** remove their `k8s_users` entry + apply (drops RBAC binding + kubeconfig validity) + revoke shared-token / memory creds.
|
||||
3. **Destructive (explicit, separate, never auto):** archive `~<u>` (tar → backup), then `userdel -r`. Irreversible — requires explicit go-ahead.
|
||||
- Write `docs/runbooks/offboard-user.md` (the link in `multi-tenancy.md` currently dead-ends). Rollback of step 1/2 = re-add the roster entry + reconcile.
|
||||
|
||||
## Incrementality & migration (don't break emo)
|
||||
|
||||
emo has a **working** setup that must not break: his `t3-serve@emo` (port 3774) + ~4 concurrent live Claude sessions (id=2320); his own `~/.claude` + `~/.claude.json` (MCP servers incl. `ha` token-in-URL and his `MEMORY_API_KEY`); his `~/code` symlink into wizard's tree; `code-shared` + `docker` membership; tmux/playwright units. Hard guarantees:
|
||||
|
||||
- **The idempotent reconcile is ADDITIVE-ONLY.** It creates *missing* accounts/config/instances and *adds* a user's tier-appropriate access, but it **never removes** an existing user's groups, **never replaces** an existing `~/code` (skip-if-exists), and **never writes into** an existing `~/.claude` / `~/.claude.json`. Running `provision-users.sh` at any time is therefore a no-op on emo's existing state — safe to run repeatedly.
|
||||
- **Every destructive/tightening step is SEPARATE, explicit, idle-gated, and reversible** — never part of the routine reconcile.
|
||||
- **Phases 0–4 are additive and verified non-breaking.** After each, confirm emo's live sessions, his `~/.claude`/MCP, his `~/code`, and his groups are unchanged.
|
||||
|
||||
Rollout order:
|
||||
1. **Config base + machine-wide managed layer** → wizard + emo *inherit* wizard's skills/prompts. Additive: the managed layer only ADDS; it must not set keys/hooks that override emo's working `~/.claude` / `MEMORY_API_KEY` / MCP servers. **Verify emo's existing sessions + MCP still work.**
|
||||
2. **Roster + provisioner** alongside the current `/etc/ttyd-user-map` (idempotent; ancamilea already provisioned; emo's instance untouched).
|
||||
3. **Per-user writable locked clones** provisioned **only for users without an existing `~/code`** — emo's symlink is left intact (skip-if-exists).
|
||||
4. **Per-tier kubeconfig** installed **only if absent** (existing `~/.kube/config` backed up, never clobbered) — emo's current kube access untouched.
|
||||
5. **emo cutover — the ONLY step that changes emo; opt-in + reversible, never auto-run:** (a) record rollback state (`readlink ~emo/code`, `id emo`, copy of `start-claude.sh`); (b) idle-gate (id=3201); (c) replace his `~/code` symlink with his own writable locked clone, **point his `start-claude.sh` at `cd ~/code`** (today it hardcodes `cd /home/wizard/code` — *that* is the actual reason his Claude lands in wizard's unlocked tree, so swapping the symlink alone is NOT enough), drop the now-redundant `~/.claude/{rules,skills/file-issue}` symlinks into wizard's home (the managed layer / shared base delivers them now), and `gpasswd -d emo code-shared`. He keeps full edit/commit/push (ungated); loses only secret-read + apply. **Rollback (seconds):** restore the symlink + `start-claude.sh` + the `~/.claude` symlinks + `gpasswd -a emo code-shared`. A `t3-serve@emo` restart only blips his WebSocket (id=3308). Requires explicit go-ahead.
|
||||
6. **Authentik `T3 Users` group + edge gate** last (once instances exist), so no one is locked out mid-migration.
|
||||
|
||||
New users (gheorghe; and ancamilea's enhancement) are born into the new model — no migration needed.
|
||||
|
||||
## Template-readiness ("VM as a template" — future)
|
||||
|
||||
Design principle: **every bit of devvm setup is an idempotent git script** — nothing lives only as hand-typed host state. Three scripts in `infra/scripts/workstation/`: `setup-devvm.sh` (package manifest + managed config + config-base clone), `provision-users.sh` (roster loop), and the roster + manifest data files. When the template is wanted: the devvm becomes a cloud-init Proxmox template (the estate's existing reproducibility pattern, id=1575) that clones the infra repo + runs both scripts → identical devvm. Per-user **home data** is the only non-template state → add `/home` to the 3-2-1 backup set, or users re-clone + re-pair on a fresh box.
|
||||
|
||||
## Key decisions (ADR candidates)
|
||||
|
||||
- **ADR-0001 — Build on the existing stack, not a CDE.** Coder/Che/etc. researched; the role model is Premium-gated or the platform lacks the agent layer, and the homelab scale doesn't justify it. Hard to reverse, surprising ("why not Coder?"), real trade-off.
|
||||
- **ADR-0002 — devvm Linux users, not K8s ephemeral pods.** Re-platforming is overkill at this scale; config-push is easier on one host.
|
||||
- **ADR-0003 — Config inheritance via native machine-wide layers + per-user override.** Rejected: periodic sync, OverlayFS (no live lowerdir edits), Nix (rebuild not live).
|
||||
- **ADR-0004 — Infra access via per-user writable git-crypt-locked clones (changes ungated).** Each non-admin gets their own writable, keyless (locked) clone — read + edit + push freely, no PR gate. Safe because infra apply is manual + admin-only (push ≠ apply, id=4355) and the clone can't decrypt secrets. Rejected: the shared read-only mirror (gated changes) and the shared unlocked tree (secret leak + commit entanglement). Trade: repo-local CLAUDE.md updates via pull, not live (global config inheritance stays live via §4).
|
||||
- **ADR-0005 — Power-user = cluster-wide read-only (no Secrets), via a NEW dedicated ClusterRole.** Re-widens cross-tenant READ for the trusted power-user tier only — but via a NEW `oidc-power-user-readonly` ClusterRole (get/list/watch, NO `secrets`), NOT the existing `oidc-power-user` (which grants read+write+Secrets and is unbound). Bound to the user's OIDC identity (kubelogin) — the apiserver accepts Authentik OIDC for the `kubernetes` audience; the dashboard's SA-token pattern is for the dashboard UI only.
|
||||
- **ADR-0006 — The roster is the single source of truth for the FULL lifecycle.** `roster.yaml` drives onboard *and* offboard; `/etc/ttyd-user-map`, `dispatch.json`, and Authentik `T3 Users` membership are *derived* from it, and tier is *validated* against `k8s_users` (fail-loud on mismatch). Rejected: hand-maintaining the four membership lists in parallel (guaranteed drift). Offboarding is first-class + staged (reversible cut → cluster revoke → gated `userdel`), not an afterthought.
|
||||
- **ADR-0007 — Add swap + a capacity budget to the devvm before onboarding active users.** A shared 24 GB / **0-swap** host OOM-kills live sessions under multi-user load (wizard alone runs ~20). Swap + a max-concurrent ceiling are prerequisites, not follow-ups.
|
||||
|
||||
## Out of scope / deferred
|
||||
|
||||
- Zero-touch auto-provision on first Authentik login (admin runs the provisioner / the timer converges — simpler at this scale).
|
||||
- K8s per-user pods (revisit only if a user must be untrusted, or scale grows large).
|
||||
- The actual cloud-init template conversion (design for it now; do it when wanted).
|
||||
- **Per-user memory isolation** (own namespace / service-side `_key_to_user` map + redeploy) — **deferred; not a risk now** (Viktor, 2026-06-08). Revisit if memory cross-read becomes a concern.
|
||||
|
||||
## Verification (acceptance)
|
||||
|
||||
- A new roster entry + `provision-users.sh` → the user can log into `t3.viktorbarzin.me` and lands in a configured Workstation with Viktor's skills/prompts.
|
||||
- wizard edits a skill/CLAUDE.md in the base → a child's next prompt sees it (no pull).
|
||||
- A child's `kubectl`/`vault` is bounded by their tier (kubectl enabled per tier: power-user = cluster-wide read-only; namespace-owner = read/write in own ns only); a non-admin cannot read git-crypt secrets nor escalate.
|
||||
- A non-admin can edit + commit + push their infra clone **freely**, but cannot `scripts/tg apply` (no write Vault / cluster RBAC) — changes don't take effect until an admin applies.
|
||||
- Re-running the provisioner is idempotent (no changes on a converged host).
|
||||
- `provision-users.sh` + `setup-devvm.sh` reproduce the setup on a fresh host from git.
|
||||
223
docs/plans/2026-06-07-multi-user-workstation-plan.md
Normal file
223
docs/plans/2026-06-07-multi-user-workstation-plan.md
Normal file
|
|
@ -0,0 +1,223 @@
|
|||
# Multi-User Workstation — Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: use `superpowers:subagent-driven-development` (recommended) or `superpowers:executing-plans` to implement task-by-task. Steps use `- [ ]` for tracking. This is **infra** work — "verify" means an idempotent re-run + a smoke check with expected output (not pytest). Honor the Terraform-only rule for cluster changes; devvm host scripts are the accepted exception (versioned in `infra/scripts/`, deployed via the provisioner). Claim `host:devvm` before mutating the devvm; gate `t3-serve@<user>` restarts on user idle (memory id=3201). **INCREMENTALITY (don't break emo):** every phase is additive; the idempotent reconcile is **additive-only** — it NEVER removes an existing user's groups, NEVER replaces an existing `~/code` (skip-if-exists), and NEVER writes into an existing `~/.claude`/`~/.claude.json`. The emo cutover (Phase 5) is the ONLY destructive step — explicit, idle-gated, reversible, never auto-run. After each of Phases 1–4, **verify emo's live sessions, `~/.claude`/MCP, `~/code`, and groups are unchanged.**
|
||||
|
||||
**Goal:** A declarative roster + idempotent scripts that provision per-user Claude Code Workstations on the devvm, inheriting Viktor's config live via native machine-wide layers, scoped by RBAC tier, reproducible from git.
|
||||
|
||||
**Architecture:** Config base (machine-wide managed Claude config + system shell files + apt manifest) authored by wizard → all users inherit live. `roster.yaml` + `provision-users.sh` create constrained OS accounts + per-user OIDC kubeconfig (per tier) + per-user writable git-crypt-locked infra clone + `t3-serve@<u>`. Authentik `T3 Users` group gates the edge.
|
||||
|
||||
**Tech Stack:** Bash (idempotent host scripts), systemd template units + timer, Claude Code managed-settings, git-crypt, Authentik expression policy (Terraform), the existing `k8s_users` per-user Vault/RBAC.
|
||||
|
||||
**Design:** `infra/docs/plans/2026-06-07-multi-user-workstation-design.md`. **Glossary:** `infra/CONTEXT.md`.
|
||||
|
||||
---
|
||||
|
||||
## File structure
|
||||
|
||||
- Create: `infra/scripts/workstation/roster.yaml` — the source-of-truth roster
|
||||
- Create: `infra/scripts/workstation/packages.txt` — declared host apt/global toolset
|
||||
- Create: `infra/scripts/workstation/setup-devvm.sh` — host base: packages + managed Claude config + config-base clone (idempotent)
|
||||
- Create: `infra/scripts/workstation/managed-settings.json` — the machine-wide Claude base (settings + `claudeMd`)
|
||||
- Modify: `infra/scripts/t3-provision-users.sh` — read `roster.yaml`; create constrained accounts; per-tier groups + kubeconfig; repoint `~/code`
|
||||
- Modify: `infra/scripts/t3-provision-users.sh` — also provision each non-admin's own writable git-crypt-locked clone at `~/code` (no separate mirror service)
|
||||
- Modify: `infra/stacks/authentik/admin-services-restriction.tf` — add the `t3.viktorbarzin.me` → `T3 Users` branch
|
||||
- Create: `infra/stacks/authentik/` group resource (or document the UI-created group) for `T3 Users`
|
||||
- Docs: update `infra/docs/architecture/multi-tenancy.md` (add the Workstation section) + `.claude/reference/service-catalog.md` (t3code row) in the same commits
|
||||
|
||||
---
|
||||
|
||||
## Phase −1 — Prerequisites (do FIRST)
|
||||
|
||||
### Task −1.1: devvm capacity (P0 — verified 2026-06-08: 24 GB RAM, 0 swap, wizard ~20 sessions)
|
||||
|
||||
- [ ] **Step 1:** Add **swap** to the devvm (swapfile, e.g. 8–16 GB) — turns multi-user OOM-kill into graceful pressure. Verify `free -h` shows `Swap` > 0.
|
||||
- [ ] **Step 2:** Document a per-user RAM budget + a **max-concurrent-active-users** ceiling; add memory/disk-pressure monitoring on the devvm. (Optionally bump RAM PVE-side — devvm is NOT TF-managed, id=1575.)
|
||||
- [ ] **Step 3:** Fix the stale `infra/.claude/reference/proxmox-inventory.md` devvm RAM (says 8 GB; live = 24 GB). Commit `[ci skip]`.
|
||||
|
||||
### Task −1.2: tooling
|
||||
|
||||
- [ ] **Step 1:** Install `kubelogin` (`kubectl-oidc_login`) on the devvm and add it to `packages.txt` — the per-user OIDC kubeconfig (Task 2.2) needs it; it is NOT installed today.
|
||||
|
||||
---
|
||||
|
||||
## Phase 0 — Roster + config base in git (no host changes)
|
||||
|
||||
### Task 0.1: Create the roster
|
||||
|
||||
**Files:** Create `infra/scripts/workstation/roster.yaml`
|
||||
|
||||
- [ ] **Step 1:** Write the roster with the current three children (wizard is the base author, not listed):
|
||||
|
||||
```yaml
|
||||
# THE single source of truth for the devvm Workstation lifecycle (onboard → offboard).
|
||||
# os_user (key) → authentik_user · k8s_user · tier · namespaces. Identifiers differ per person (verified 2026-06-08).
|
||||
users:
|
||||
emo: { authentik_user: emil.barzin, k8s_user: emo, tier: power-user } # NET-NEW cluster identity (not in k8s_users today)
|
||||
ancamilea: { authentik_user: ancaelena98, k8s_user: anca, tier: namespace-owner, namespaces: [plotting-book] } # ALREADY provisioned — preserve, don't re-create
|
||||
# gheorghe: { authentik_user: vabbit81, k8s_user: vabbit81, tier: namespace-owner, namespaces: [vabbit81] } # already a cluster ns-owner; uncomment for a devvm workstation
|
||||
```
|
||||
(`os_user` is the pinned key — no email→username derivation. Note the three distinct IDs per person.)
|
||||
|
||||
- [ ] **Step 2: Verify** it parses: `python3 -c "import yaml,sys; print(yaml.safe_load(open('infra/scripts/workstation/roster.yaml')))"` → Expected: a dict with `users.emo.tier == power-user`.
|
||||
- [ ] **Step 3: Commit:** `git add infra/scripts/workstation/roster.yaml && git commit -m "workstation: add roster source-of-truth [ci skip]"`
|
||||
|
||||
### Task 0.2: Declare the host toolset
|
||||
|
||||
**Files:** Create `infra/scripts/workstation/packages.txt`
|
||||
|
||||
- [ ] **Step 1:** List the shared tools (one per line, comments allowed): `git`, `zsh`, `tmux`, `ripgrep`, `jq`, `python3`, `nodejs`, `kubectl`, `vault`, `podman` (rootless). Claude Code is installed via npm global in `setup-devvm.sh` (Task 1.2), not apt.
|
||||
- [ ] **Step 2: Verify:** `grep -vE '^\s*(#|$)' infra/scripts/workstation/packages.txt` lists the expected packages.
|
||||
- [ ] **Step 3: Commit:** `git add infra/scripts/workstation/packages.txt && git commit -m "workstation: declare host package manifest [ci skip]"`
|
||||
|
||||
### Task 0.3: Build the Config base (secret-free, curated — it doesn't exist yet)
|
||||
|
||||
**Files:** chezmoi dotfiles repo (`github.com/ViktorBarzin/dot_files`, `dot_claude/`) + `infra/scripts/workstation/managed-settings.json`
|
||||
|
||||
- [ ] **Step 1:** Create/refresh the **Config base** = the secret-free curated set the managed layer + `/etc/skel` deploy from: skills/agents/rules/commands/hooks/`CLAUDE.md` + shell (`zshrc`/`profile.d`) + the `start-claude.sh` launcher (`cd "$HOME/code"`). Sanitize OUT all secrets (`.credentials.json`, `~/.claude.json`, `settings.json` `env`); resolve any `~/.agents/skills` symlinks to real files.
|
||||
- [ ] **Step 2:** Reconcile launcher ownership: the current `start-claude.sh` is deployed by the SEPARATE `viktor/terminal-lobby` repo (its own `deploy.sh`). Decide whether the workstation base or terminal-lobby owns it — not both (avoid two competing launchers).
|
||||
- [ ] **Step 3: Verify:** secret-scan the base (`grep -rEi 'sk-ant|oat01|BEGIN .*PRIVATE|api[_-]?key|password'` → only docs/placeholders) + no dangling symlinks.
|
||||
- [ ] **Step 4: Commit/push** the refreshed dotfiles repo.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Config base + machine-wide inheritance (additive; verify wizard+emo inherit)
|
||||
|
||||
### Task 1.1: Pin the exact Claude managed-skills mechanism (discovery spike)
|
||||
|
||||
**Why:** the managed `settings.json` + `claudeMd` paths are confirmed (`/etc/claude-code/managed-settings.json`), but the exact **managed skills** deployment path needs confirming on the installed Claude Code version before we rely on it for skill inheritance.
|
||||
|
||||
- [ ] **Step 1:** On the devvm, check the installed version: `claude --version`.
|
||||
- [ ] **Step 2:** Confirm the managed location is read: create a throwaway `/etc/claude-code/managed-settings.json` with a benign `claudeMd` string, start a fresh `claude` session as a NON-wizard test user, and confirm the injected guidance appears. Expected: the `claudeMd` text is present in context.
|
||||
- [ ] **Step 3:** Determine the managed-skills path (managed-settings `skills`/skill-source key, or a managed skills dir) **AND how the bespoke `~/.claude/rules/*.md` + `agents/` are delivered machine-wide** — the managed layer covers settings/skills/`claudeMd`, NOT an arbitrary `rules/` dir, so rules land either (a) folded into the managed `claudeMd`, or (b) a per-user symlink to the shared Config base (replacing today's live `~/.claude/rules → /home/wizard/.claude/rules` symlink). Record the verified mechanism in the design doc's §4 + a memory.
|
||||
- [ ] **Step 3b — Plan-B (go/no-go):** if managed *skills* aren't supported on the installed Claude Code version, FALL BACK to per-user symlinks of `~/.claude/{skills,agents,rules}` → the shared Config base. The verified `settingSources:[user,…]` (2026-06-08) means both t3 and `claude` read the per-user `user` layer, so symlinks are a complete fallback. Make this an explicit branch, not a silent assumption.
|
||||
- [ ] **Step 4: Commit** the design-doc update: `git commit -am "workstation: pin verified managed-skills mechanism [ci skip]"`
|
||||
|
||||
### Task 1.2: `setup-devvm.sh` — host base (idempotent)
|
||||
|
||||
**Files:** Create `infra/scripts/workstation/setup-devvm.sh`, `infra/scripts/workstation/managed-settings.json`
|
||||
|
||||
- [ ] **Step 1:** Write `managed-settings.json` — the machine-wide Claude base: the `claudeMd` org guidance + any enforced hooks/permissions, **no secrets** (per-user memory keys etc. stay per-user).
|
||||
- [ ] **Step 2:** Write `setup-devvm.sh` (run as root, idempotent): (a) `apt-get install -y $(grep -vE '^\s*(#|$)' packages.txt)`; (b) `npm install -g @anthropic-ai/claude-code` if missing; (c) `install -m 0644 managed-settings.json /etc/claude-code/managed-settings.json`; (d) materialize managed skills from the config-base checkout per the Task 1.1 mechanism; (e) lay down `/etc/profile.d/00-workstation.sh` + `/etc/zsh/zshrc.d/` base shell config + seed `/etc/skel` — **incl. a `start-claude.sh` that `cd "$HOME/code"` and a `.tmux.conf` with `default-command "$HOME/start-claude.sh"`, so a new account auto-launches Claude in ITS OWN clone (never a hardcoded `/home/wizard/code`)**; (f) clone/refresh the config-base repo to a shared path.
|
||||
- [ ] **Step 3: Verify (inheritance):** as `emo` (idle-gated if a session is live), `sudo -u emo -i claude` shows wizard's managed `claudeMd` + a base skill in `/skills`, with no per-emo copy. Expected: base skill present.
|
||||
- [ ] **Step 4: Verify (idempotent):** re-run `setup-devvm.sh`; Expected: exit 0, no changes on second run.
|
||||
- [ ] **Step 5: Commit:** `git add infra/scripts/workstation/setup-devvm.sh infra/scripts/workstation/managed-settings.json && git commit -m "workstation: host base + machine-wide Claude config inheritance"`
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Provisioner (additive; create constrained accounts from roster)
|
||||
|
||||
### Task 2.1: Extend `t3-provision-users.sh` to read the roster + create accounts
|
||||
|
||||
**Files:** Modify `infra/scripts/t3-provision-users.sh`
|
||||
|
||||
- [ ] **Step 1:** Add a roster-read + per-entry loop. For each `os_user`: if the account is **absent**, `useradd -m -s /bin/zsh "$os_user"` + `passwd -l "$os_user"` (SSO/t3 only) + `chmod 700 ~`. `set_tier_groups` is **ADD-ONLY** — it `gpasswd -a`'s the tier's groups (admin → `sudo,docker,code-shared`; power-user/namespace-owner → none beyond their own) but **NEVER removes** a group from an existing account (so a routine reconcile can't strip emo's current `code-shared`/`docker` — removal is the Phase-5 cutover only). Do **not** `passwd -l` or re-`chmod` an already-existing account.
|
||||
- [ ] **Step 2 (SSoT — derive, don't append):** **Regenerate** `/etc/ttyd-user-map` + `/etc/t3-serve/dispatch.json` from the roster each run (so a removed roster entry DISAPPEARS — this is what makes offboarding's reversible-cut work), allocate sticky ports, `systemctl enable --now t3-serve@<os_user>`. Reconcile the `T3 Users` Authentik group membership from the roster. **Validate** each entry's `tier` against the live `k8s_users` role and **abort with a clear error on mismatch** (workstation tier and cluster tier must not silently diverge).
|
||||
- [ ] **Step 3: Verify (idempotent + non-breaking):** run as root; Expected: emo + ancamilea instances `active`, dispatch.json unchanged, **AND** `id emo` still shows `code-shared`+`docker` (NOT stripped), emo's `~/code` symlink intact, his live sessions unaffected.
|
||||
- [ ] **Step 4: Verify (constrained account):** `id emo` shows no `sudo`/`docker`/`code-shared`; `sudo -n -u emo true` fails (no sudo).
|
||||
- [ ] **Step 5: Commit:** `git add infra/scripts/t3-provision-users.sh && git commit -m "workstation: roster-driven account creation + per-tier groups"`
|
||||
|
||||
### Task 2.2: Per-user identity-scoped kubeconfig + Vault helper
|
||||
|
||||
**Files:** Modify `infra/scripts/t3-provision-users.sh` (add `install_user_identity`)
|
||||
|
||||
- [ ] **Step 1:** For each non-admin, write `~$os_user/.kube/config` as a **per-user OIDC kubeconfig** (`kubelogin`/`oidc-login`) bound to THEIR email — the apiserver accepts Authentik OIDC for the `kubernetes` audience (verified 2026-06-08; the dashboard SA-token pattern is for the dashboard UI, NOT kubectl). Tier → a ClusterRole bound to their OIDC `User`: namespace-owner → admin in their own namespace via the existing `oidc-ns-owner-*` bindings (for anca that's the EXISTING `plotting-book` — assert, don't re-provision); power-user → a **NEW `oidc-power-user-readonly`** ClusterRole (get/list/watch cluster-wide, **NO `secrets`**), NOT the existing `oidc-power-user` (read+write+Secrets). Owned by the user, `0600`. **Install only if `~/.kube/config` is absent;** else back up to `.bak-<ts>` and skip (never clobber).
|
||||
- [ ] **Step 2:** Drop a `~/.zshrc.d/vault.sh` that sets `VAULT_ADDR=https://vault.viktorbarzin.me` and documents `vault login -method=oidc` (their own identity). Do NOT seed wizard's token.
|
||||
- [ ] **Step 3: Verify (OIDC works, then scoping):** FIRST smoke-test the OIDC path — a non-admin `kubectl` via kubelogin actually authenticates (it's currently unexercised by any human; if it fails like the dashboard audience did, fall back to a per-user SA-token kubeconfig). THEN: as emo, `kubectl get pods -A` works (read) but `kubectl get secret -A` is forbidden and `kubectl delete` anything is forbidden; as ancamilea, only `plotting-book` is visible.
|
||||
- [ ] **Step 4: Commit:** `git add infra/scripts/t3-provision-users.sh && git commit -m "workstation: per-user identity-scoped kubeconfig + vault helper"`
|
||||
|
||||
*(Prereq: add a **NEW `oidc-power-user-readonly`** ClusterRole + email binding to `stacks/rbac` via `scripts/tg apply` — do NOT reuse the existing `oidc-power-user` (read+write+Secrets, currently unbound). emo also needs a NEW `k8s_users` entry as `power-user` (net-new); anca/gheorghe already exist — assert, don't re-create. Terraform-managed, separate commit.)*
|
||||
|
||||
### Task 2.3: Inject per-user MCP + auth secrets (new users only; never clobber)
|
||||
|
||||
**Files:** Modify `infra/scripts/t3-provision-users.sh` (add `install_user_secrets`)
|
||||
|
||||
- [ ] **Step 1:** For each non-admin **without** an existing `~/.claude.json` (NEW users only — NEVER touch an existing one): write `~/.claude.json` with `playwright-shared` (localhost), `ha` (shared `ha_sofia_mcp_url` from Vault `secret/openclaw`) if HA-eligible, and `claude_memory` using a **shared/simple key (per-user memory isolation is DEFERRED — not a risk now)**. Seed `~/.claude/.credentials.json` with the shared Claude token (Vault) **or** leave absent for interactive login. **Drop the beads Dolt credential** into `~/code/.beads/` (`.beads-credential-key`, from Vault, or set `DOLT_REMOTE_PASSWORD`) so `bd` authenticates — it's git-ignored, so a fresh clone lacks it. All `0600`, owned by the user. Per-user `playwright-mcp` systemd unit on its own port (existing pattern, id=4015).
|
||||
- [ ] **Step 2 (DEFERRED — not now):** Per-user memory isolation is NOT built (Viktor, 2026-06-08): a new user shares/omits memory for now. When wanted, it needs a service-side `_key_to_user` map edit + redeploy (claude-memory-mcp, GHA repo 78) **and** a Vault key — not just a Vault write (id=413/4181).
|
||||
- [ ] **Step 3: Verify (new user gets isolated auth):** as the test user, `claude mcp list` shows their servers `Connected`; `memory_recall` returns THEIR namespace, not Viktor's.
|
||||
- [ ] **Step 4: Verify (emo untouched):** `~emo/.claude.json`, `~emo/.claude/.credentials.json`, `~emo/.claude/settings.json` are **byte-identical** to before the run (`sha256sum` before/after); `claude mcp list` as emo still shows ha/claude_memory/playwright `Connected`.
|
||||
- [ ] **Step 5: Commit:** `git add infra/scripts/t3-provision-users.sh && git commit -m "workstation: per-user MCP + auth injection (new users only, if-absent)"`
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — Per-user writable locked infra clone (code view; changes ungated)
|
||||
|
||||
### Task 3.1: Provision each non-admin's own writable git-crypt-locked `~/code`
|
||||
|
||||
**Files:** Modify `infra/scripts/t3-provision-users.sh` (add `install_infra_clone`)
|
||||
|
||||
- [ ] **Step 1:** For each non-admin, **only if `~$os_user/code` does not exist at all** (no symlink, no directory — NEVER touch an existing `~/code`, so emo's symlink stays intact), clone the same repo wizard uses, as that user: `REPO=$(git -C /home/wizard/code config --get remote.origin.url); sudo -u "$os_user" git clone "$REPO" ~/code`. Then in the clone set `git config filter.git-crypt.smudge cat; filter.git-crypt.clean cat; filter.git-crypt.required false` and `git checkout master`. **No git-crypt key is installed** → secret files stay ciphertext, code/docs are plaintext (memory id=3665/3666). Owned by the user, writable.
|
||||
- [ ] **Step 2:** Leave it writable with a normal `origin` remote (Forgejo) — no read-only mount, no PR gate; they may edit/commit/push freely. (Optional: `git config push.default current` so a bare `git push` targets their own branch.)
|
||||
- [ ] **Step 3: Verify (locked + writable):** as emo, `head -c 9 ~/code/infra/terraform.tfvars` shows the `GITCRYPT` magic (ciphertext); `cat ~/code/CLAUDE.md` is plaintext; `echo x >> ~/code/README.md && git -C ~/code commit -am wip` **succeeds** (writable, ungated).
|
||||
- [ ] **Step 4: Verify (apply-gated, not repo-gated):** as emo, `cd ~/code/infra && scripts/tg apply <a-stack>` **fails** (no write Vault token / cluster RBAC); `vault login -method=oidc` as emo cannot obtain vault-admin. Pushing to Forgejo does NOT trigger an apply (id=4355). So his edits can't take effect without an admin apply.
|
||||
- [ ] **Step 5: Commit:** `git add infra/scripts/t3-provision-users.sh && git commit -m "workstation: per-user writable git-crypt-locked infra clone"`
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — Eligibility gate (Authentik group + edge)
|
||||
|
||||
### Task 4.1: Create the `T3 Users` group + edge restriction
|
||||
|
||||
**Files:** Modify `infra/stacks/authentik/admin-services-restriction.tf`; add the group resource
|
||||
|
||||
- [ ] **Step 1:** Add `resource "authentik_group" "t3_users" { name = "T3 Users" }` (pattern: `stacks/authentik/guest.tf:53`). Add emo/ancamilea (and wizard) as members.
|
||||
- [ ] **Step 2:** In the expression policy, add a dedicated branch BEFORE the final return: `if host == "t3.viktorbarzin.me": return ak_is_group_member(request.user, name="T3 Users")`.
|
||||
- [ ] **Step 3: Apply:** `vault login -method=oidc` then `scripts/tg apply` in `stacks/authentik` (claim `stack:authentik` first).
|
||||
- [ ] **Step 4: Verify (gate):** `curl -sI` an unauthenticated request to `t3.viktorbarzin.me` → 302 to Authentik; a member login → reaches their instance; a logged-in NON-member → denied. Confirm the `authentik-walloff` probe stays green for any public carve-outs.
|
||||
- [ ] **Step 5: Commit:** `git add infra/stacks/authentik/*.tf && git commit -m "workstation: gate t3.viktorbarzin.me to T3 Users group"`
|
||||
|
||||
---
|
||||
|
||||
## Phase 5 — Migrate existing users (idle-gated, low-disruption)
|
||||
|
||||
### Task 5.1: Cut emo over to his own writable locked clone (opt-in, reversible)
|
||||
|
||||
**Files:** none (host state; an explicit one-time action — NOT the routine reconcile)
|
||||
|
||||
- [ ] **Step 1: Prereqs.** Confirm emo inherits config (Phase 1) + has his scoped kubeconfig (Phase 2). (Phase 3 deliberately SKIPPED emo — his clone is created *here*.)
|
||||
- [ ] **Step 2: Record rollback state.** Save `readlink -f ~emo/code` (symlink target), `id emo` (groups), a copy of `/home/emo/start-claude.sh`, and the `~/.claude/{rules,skills/file-issue}` symlink targets. This is the instant-rollback snapshot.
|
||||
- [ ] **Step 3: Idle-gate + go-ahead.** Confirm emo's sessions are keystroke-idle ≥20 min (id=3201); if ambiguous, ASK. Opt-in — never auto-run by the reconcile.
|
||||
- [ ] **Step 4: Cutover.** (a) `mv ~emo/code ~emo/code.symlink.bak`; provision his own writable locked clone at `~emo/code` (Phase-3 `install_infra_clone`, run explicitly for emo). (b) **Repoint his launcher (REQUIRED):** back up `/home/emo/start-claude.sh`, then change its `cd /home/wizard/code` → `cd "$HOME/code"`. The hardcoded `cd` is the *actual* mechanism landing him in wizard's tree — the symlink swap alone is insufficient. (c) Remove the now-redundant `~/.claude/rules` and `~/.claude/skills/file-issue` symlinks into wizard's home (managed layer / shared base delivers them now). (d) `gpasswd -d emo code-shared`.
|
||||
- [ ] **Step 5: Verify.** As emo: `cat ~/code/CLAUDE.md` works (his clone); `head -c 9 ~/code/infra/terraform.tfvars` shows `GITCRYPT` ciphertext (locked); he can still `git -C ~/code commit` (ungated) but can no longer read wizard's unlocked secrets nor `scripts/tg apply`. emo's live t3 session still works (only a WS blip if `t3-serve@emo` was restarted).
|
||||
- [ ] **Step 6: Rollback (seconds, if anything's off):** restore the `~emo/code` symlink (`rm -rf ~emo/code && ln -sfn <saved-target> ~emo/code`), restore `start-claude.sh` from its backup, recreate the `~/.claude/{rules,skills/file-issue}` symlinks, and `gpasswd -a emo code-shared` → emo back to his exact prior state. Otherwise record the cutover in a memory.
|
||||
|
||||
### Task 5.2: Confirm ancamilea + a fresh test user end-to-end
|
||||
|
||||
- [ ] **Step 1:** Confirm ancamilea logs into `t3.viktorbarzin.me` → her instance, inherits config, own-namespace kubectl only.
|
||||
- [ ] **Step 2:** Add a throwaway roster entry, run `provision-users.sh`, confirm the account+instance appear and login works; then remove it + `userdel` and confirm clean teardown.
|
||||
|
||||
---
|
||||
|
||||
## Phase 6 — Template-readiness (design-for-now; convert when wanted)
|
||||
|
||||
### Task 6.1: Verify reproducibility from git (no cloud-init yet)
|
||||
|
||||
- [ ] **Step 1:** On a scratch VM (or a container), clone the infra repo and run `setup-devvm.sh` + `provision-users.sh`; confirm the toolset + managed config + users reproduce.
|
||||
- [ ] **Step 2 (promote out of deferred — do in the main rollout):** Add per-user home data to the 3-2-1 backup set NOW: at minimum `~/.t3` (pairings + 30-day sessions) + `~/.claude` (mutable state), ideally all of `/home`. A devvm rebuild otherwise silently loses every user's pairings + session state.
|
||||
- [ ] **Step 3 (deferred):** When the template is wanted, wrap `setup-devvm.sh` + `provision-users.sh` in cloud-init (the `modules/create-template-vm` pattern, memory id=1575) and snapshot the devvm as a Proxmox template. File a beads task; do not build now.
|
||||
|
||||
---
|
||||
|
||||
## Phase 7 — Offboarding (deprovision; staged, gated)
|
||||
|
||||
Removing a user = delete their `roster.yaml` entry, then:
|
||||
|
||||
### Task 7.1: Reversible cut (driven by roster removal)
|
||||
|
||||
- [ ] **Step 1:** On reconcile after the entry is gone: `systemctl disable --now t3-serve@<u>`; regenerate `/etc/ttyd-user-map` + `dispatch.json` (user absent → dispatcher 403s); remove them from the `T3 Users` Authentik group (edge-blocked); `passwd -l <u>`. **Verify:** they can no longer reach `t3.viktorbarzin.me` (302→login, then denied) and can't log in. Nothing deleted yet.
|
||||
- [ ] **Step 2 (cluster revoke):** remove their `k8s_users` entry + `scripts/tg apply` (drops their RBAC binding; OIDC kubeconfig stops authorizing); revoke any individually-held token/memory key.
|
||||
|
||||
### Task 7.2: Destructive removal (explicit, separate, NEVER auto)
|
||||
|
||||
- [ ] **Step 1:** Archive `~<u>` → backup: `tar czf /mnt/backup/offboard/<u>-<ts>.tar.gz /home/<u>`.
|
||||
- [ ] **Step 2:** `userdel -r <u>` (removes home + spool). **Irreversible — requires explicit go-ahead.**
|
||||
- [ ] **Step 3: Rollback:** before 7.2, re-add the roster entry + reconcile restores everything; after 7.2, restore from the archive.
|
||||
- [ ] **Step 4:** Write + commit `infra/docs/runbooks/offboard-user.md` (the `multi-tenancy.md` link to it is currently a dead end).
|
||||
|
||||
---
|
||||
|
||||
## Self-review
|
||||
|
||||
- **Spec coverage:** prerequisites/capacity + kubelogin (Ph−1), roster SSoT + config-base build (Ph0), config inheritance (Ph1), provisioning + per-tier OIDC kubectl + SSoT-derive/validate + secrets/auth + beads-cred (Ph2), infra code access via writable locked clone (Ph3), Authentik gate (Ph4), incremental non-breaking migration (Ph5), reproducibility/template + per-user backups (Ph6), **offboarding / full lifecycle (Ph7)** — all mapped. Per-user **memory isolation DEFERRED** (not a risk now).
|
||||
- **Open verification carried as a task, not a placeholder:** the exact managed-skills path (Task 1.1) is a discovery spike with a concrete acceptance check.
|
||||
- **Terraform-only respected:** the only cluster changes (Authentik group/policy, the power-user ClusterRole) go through `scripts/tg apply`; devvm host scripts are the accepted exception.
|
||||
- **Docs:** multi-tenancy.md + service-catalog.md updates folded into the relevant commits (per the update-docs rule).
|
||||
Loading…
Add table
Add a link
Reference in a new issue