diff --git a/CONTEXT.md b/CONTEXT.md index 28fe5e83..c9a9d033 100644 --- a/CONTEXT.md +++ b/CONTEXT.md @@ -53,9 +53,39 @@ One of {Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared} — replicas _Avoid_: "core service" (collides with the `0-core-*` Namespace tier name). **Namespace-owner**: -A non-admin identity declared in `secret/platform → k8s_users` (JSON map). Owns one or more namespaces and one or more public subdomains. +A non-admin identity declared in `secret/platform → k8s_users` (JSON map). Owns one or more namespaces and one or more public subdomains. Also drives a **Workstation profile** (an identity has both a cluster facet and a workstation facet). _Avoid_: bare "user", "tenant". +### Workstation (multi-user devvm) + +**devvm**: +The dev VM (`10.0.10.10`), a non-cluster VM on the **PVE host** that hosts each person's Claude Code coding environment (the `t3-serve@` and terminal-lobby sessions). Not a **Node** (it isn't in the cluster). +_Avoid_: calling it a "Node"; "host" (reserved for the PVE host). + +**Workstation**: +A person's identity-scoped Claude Code environment on the **devvm** — one OS account, their session runs as that uid. The same human may also be a **Namespace-owner**; the cluster identity and the Workstation are two facets of one person. +_Avoid_: "t3 instance" (only one surface of a Workstation); bare "user". + +**RBAC tier**: +The role band that governs a person everywhere — `kubernetes-admins` (Viktor; cluster-admin, secrets, apply), `kubernetes-power-users` (infra-aware, broad read, no destructive change), `kubernetes-namespace-owners` (own-namespace app dev). The single axis that keys both cluster RBAC **and** the **Workstation profile**. +_Avoid_: inventing per-service roles; conflating with **Namespace tier** / **State tier** (those are not identity). + +**Workstation profile**: +The **RBAC tier**-keyed bundle a **Workstation** receives: **Config inheritance** (identical for everyone) plus the person's **Infra visibility** and cluster scope (varies by tier). Never hand-tuned per person — one identity decision (Authentik group + `k8s_users`) provisions the cluster facet and the Workstation together. +_Avoid_: per-person bespoke setup (the rejected "stitched-together" status quo). + +**Config inheritance**: +The universal half of every **Workstation profile** — Viktor's *static* Claude config (skills, rules, agents, commands, `CLAUDE.md`, hooks) **live-extends** from a **Config base**, it is NOT copied: each person's `~/.claude` draws these from the shared base, so an edit Viktor makes appears in every Workstation immediately, with no seed/copy/sync step. Users may layer their own items on top (rarely do). **RBAC tier**-independent. Per-user *mutable* state (`~/.claude.json`, `.credentials.json`, `projects/`, sessions) is never shared — local only. +_Avoid_: a periodic copy/seed/sync of `~/.claude` (rejected — inheritance must be live); sharing `~/.claude.json` / `.credentials.json` (per-user, secret-bearing, corrupts under concurrent writes — see emo's multi-session profile). + +**Config base**: +The shared, secret-free, version-controlled source of truth for the *static* Claude config that every **Workstation** live-extends (see **Config inheritance**). Viktor's authoring surface — when he edits a skill/rule, he edits the base; the chezmoi dotfiles repo is its versioned form (commit = audit/rollback, NOT a push to users). Holds only skills/rules/agents/commands/`CLAUDE.md`/hooks — never secrets or per-user mutable state. +_Avoid_: treating it as a per-user seed target (it is a live shared source, not a copy); putting secrets in it. + +**Infra visibility**: +What a non-admin **Workstation** may SEE of the infra: the public repo **code** and the person's own **RBAC**-scoped view of the live cluster (kubectl / dashboard within their namespaces). Explicitly excludes the **git-crypt** secrets (`terraform.tfvars`, `secrets/`) and any out-of-scope mutation. The boundary that "respect their permissions" enforces — violated today because `~/code` is one git-crypt-*unlocked* tree shared via the `code-shared` group. +_Avoid_: reading "see the infra" as access to secrets or apply rights. + ### Networking **Public domain**: diff --git a/docs/plans/2026-06-07-multi-user-workstation-design.md b/docs/plans/2026-06-07-multi-user-workstation-design.md new file mode 100644 index 00000000..f78f93dc --- /dev/null +++ b/docs/plans/2026-06-07-multi-user-workstation-design.md @@ -0,0 +1,186 @@ +# Multi-User Workstation — Design + +- **Date:** 2026-06-07 +- **Status:** designed (grilled extensively); not yet implemented +- **Owner:** Viktor (wizard) +- **Builds on:** the t3code multi-user setup (`docs/plans/2026-06-01-t3-auto-provision-*`), the `k8s_users` multi-tenancy (`docs/architecture/multi-tenancy.md`), and the cloud-init VM-reproducibility decision (memory id=1575). +- **Glossary:** see `infra/CONTEXT.md` → "Workstation (multi-user devvm)" for the canonical terms used here (devvm, Workstation, RBAC tier, Workstation profile, Config inheritance, Config base, Infra visibility). + +## Goal + +Let any onboarded person get a fully-configured Claude Code **Workstation** on the devvm that **inherits Viktor's config live** (his edits propagate with no per-user sync), bounded by **their own permissions** (read infra code + RBAC-scoped cluster view, never secrets), provisioned by **one declarative roster + one idempotent script**, and **reproducible from git** so the VM can be rebuilt from a template. + +## How we got here (so the rationale isn't re-litigated) + +This was stress-tested down several branches before landing: + +1. **Adopt a CDE?** Researched Coder / Gitpod-Ona / Eclipse Che / DevPod / OpenHands (2026-06-07). The category consolidated to "Coder or Che, or build it." Coder is architecturally a great fit but the **role model we need is Premium-gated** (groups + OIDC group→role sync + template ACLs are all paid), its agent UI is mid-transition (Tasks→Agents, Sept 2026), and it still needs custom glue. ~80% of the hard parts are already solved by our stack. → **Build on the existing stack** (ADR-0001). +2. **K8s ephemeral pods vs devvm OS users?** Ephemeral pods are maximally declarative but, at ~3-4 trusted users, re-platforming the agent + per-pod persistence is **overkill**; the devvm model already runs and config-push is *easier* on one host. → **devvm Linux users** (ADR-0002). +3. **Config inheritance — sync vs live?** A periodic sync/seed was rejected; the requirement is **live inheritance** ("I edit, everyone has it"). Realized via **each subsystem's native machine-wide layer + a per-user layer on top** (ADR-0003) — not OverlayFS (kernel disallows live lowerdir edits), not Nix (rebuild, not live), not bespoke symlink-only (clumsy per-item override). + +## Core model + +A person's **RBAC tier** drives one **Workstation profile**. **Inheritance**: `wizard` authors a **Config base** once; every child user (emo, anca, gheorghe) inherits it **live** through native machine-wide layers and may add their own on top. What differs per tier is **Infra visibility** and cluster scope — never the inherited config. Onboarding is **declarative**: a git roster + an idempotent provisioner. + +## Components + +### 1. Roster — the SINGLE source of truth (in git, full lifecycle) + +A git-committed map keyed by **`os_user`**; it drives the **entire lifecycle** (onboard → reconcile → offboard). It carries the multiple identifiers a person actually has (verified live 2026-06-08 — they differ!): + +```yaml +# infra/scripts/workstation/roster.yaml — THE source of truth +# os_user (key) → authentik_user (login local-part) · k8s_user (k8s_users key) · tier · namespaces +users: + emo: { authentik_user: emil.barzin, k8s_user: emo, tier: power-user } + # NET-NEW cluster identity — emo is NOT in k8s_users today + ancamilea: { authentik_user: ancaelena98, k8s_user: anca, tier: namespace-owner, namespaces: [plotting-book] } + # ALREADY a namespace-owner — preserve plotting-book; do NOT re-provision +# gheorghe: { authentik_user: vabbit81, k8s_user: vabbit81, tier: namespace-owner, namespaces: [vabbit81] } + # already a cluster namespace-owner; uncomment when he wants a devvm workstation +# wizard (admin) is the base author; not provisioned as a child. +``` + +**Single source of truth (SSoT):** the roster is authoritative; everything else is **derived or validated against it** — never hand-maintained in parallel: +- `/etc/ttyd-user-map` + `/etc/t3-serve/dispatch.json` are **regenerated** from the roster each reconcile (not appended). +- The Authentik **`T3 Users`** group membership is reconciled from the roster (a member ⇔ a roster entry). +- The reconcile **validates** `roster.tier` against the live `k8s_users` role and **fails loud on mismatch** (e.g. roster says `power-user` but `k8s_users` says `namespace-owner`) — so the workstation tier and the cluster tier can't silently diverge. `k8s_user`/`namespaces` are reconciled into `k8s_users` (or asserted to match for pre-existing users). + +`os_user` is the pinned key (no email→username derivation — avoids the `ancaelena98`-vs-`ancamilea` trap). Onboard = add an entry + reconcile; **offboard = remove the entry** (see "User lifecycle"). + +### 2. Eligibility gate (Authentik group, edge-enforced) + +A `T3 Users` Authentik group gates `t3.viktorbarzin.me` at the edge via a one-branch addition to the existing `stacks/authentik/admin-services-restriction.tf` expression policy (`if host == "t3.viktorbarzin.me": return ak_is_group_member(request.user, name="T3 Users")`). Non-members 302→login, never reach the box. Verified earlier: `X-authentik-groups` already reaches the dispatcher (it's in the forward-auth middleware `authResponseHeaders`), so a dispatcher-side second check is possible but the edge gate is the primary. + +### 3. Provisioning (idempotent script + roster) + +Extend the existing root reconcile (`infra/scripts/t3-provision-users.sh`) to read `roster.yaml` and, per entry, converge: +- `useradd` the OS account if missing — **constrained** per tier (see §6); +- assign per-tier groups; +- drop the per-user identity-scoped kubeconfig + Vault helper; +- append the `=` line to `/etc/ttyd-user-map`; +- `systemctl enable --now t3-serve@`; +- provision a writable git-crypt-locked clone at `~/code` for non-admins **only if absent** (§5; never replaces an existing `~/code`). + +Run via the existing systemd timer (OnBoot + periodic) for self-healing, plus on-demand after a roster edit. Account creation is the one new privileged step; it lives only in this root reconcile. + +### 4. Config inheritance (native machine-wide layers — ADR-0003) + +`wizard` authors the **Config base** (a git checkout of the dotfiles/config-base repo on the devvm). It materializes into the OS's native machine-wide layers, which every user inherits live: + +**Verified 2026-06-08:** t3 is itself built on `@anthropic-ai/claude-agent-sdk` and opts into `settingSources: [user, project, local]`; the SDK also reads `/etc/claude-code/managed-settings.json` independently. So the managed layer + `~/.claude` reach **both** surfaces — the t3 web UI *and* a terminal `claude`. Two caveats: it's **Claude-specific** (a t3 user who picks Codex/OpenCode won't inherit Claude config), and `rules/` loads via the per-user `user` source (so Task 1.1's "managed-`claudeMd` vs per-user symlink" question stays real). + +| What inherits | Base location (machine-wide) | Native mechanism (live) | Per-user override | +|---|---|---|---| +| Claude skills/prompts/rules/CLAUDE.md/hooks/settings | `/etc/claude-code/managed-settings.json` + managed skills | Claude merges enterprise ⊕ user; auto-reloads | `~/.claude/skills…` (adds; base authoritative on clash) | +| Shell (zsh/aliases/env) | `/etc/zsh/zshrc`, `/etc/profile.d/*.sh`, `/etc/skel` | sourced at login; skel seeds new homes | `~/.zshrc` layers on top | +| Tools/binaries | system-wide `/usr/local` + apt manifest | one host → shared `/usr` | `pip install --user` in `~` | + +`wizard` edits the base → commit → every child inherits on next prompt/login. **No copy, no mirror, no drift** (this replaces today's hand-mirrored per-user setup — the documented emo-drift pain, memory id=3205/4015). Per-user *mutable* state (`~/.claude.json`, `.credentials.json`, `projects/`, history) is never shared — local only. *(Caveat: the managed layer natively covers settings/skills/`claudeMd`; the bespoke `~/.claude/rules/` + `agents/` dirs are delivered via the managed `claudeMd` OR a per-user symlink to the base — pinned in plan Task 1.1. This is also what replaces the old `start-claude.sh: cd /home/wizard/code` hack: config now comes from the managed layer regardless of CWD, so a new user's launcher just `cd ~/code`.)* + +### 5. Infra access (per-user writable locked clone — changes NOT gated) + +Each non-admin gets their **own writable**, git-crypt-**locked** clone of the monorepo at `~/code`: +- A **keyless** clone (`filter.git-crypt.smudge=cat`): all code/docs are plaintext; the git-crypt'd secret files (`infra/secrets/`, `infra/terraform.tfvars`) stay `\0GITCRYPT\0` ciphertext blobs. They read the code, never the secrets (the repo is public anyway; only git-crypt'd files are sensitive). +- **Writable + ungated:** they edit, commit, and `git push` to Forgejo **freely** — no read-only mount, no PR gate. Safe because **pushing infra master does NOT auto-apply** (infra is applied *manually* via `scripts/tg apply`; memory id=4355). Per-user clones also remove the old shared-tree commit-entanglement hazard. +- **The real boundary is apply-time, not the repo:** a non-admin can change code but cannot make it take effect — `scripts/tg apply` needs a write-capable Vault token (`vault login -method=oidc` → vault-admin) + cluster RBAC their tier lacks. +- **Trade vs the earlier live mirror:** the infra repo's own `CLAUDE.md`/code now updates via `git pull` (standard dev flow), not instantly. The high-value live inheritance — Viktor's skills/prompts/rules/global `CLAUDE.md` — is **unaffected** (it flows through the machine-wide managed layer in §4, not the repo). + +### 6. Permission model + +| Tier | OS account | sudo / docker | code-shared + git-crypt | infra repo | kubectl (own OIDC, per tier) | Vault (own OIDC) | +|---|---|---|---|---|---|---| +| **admin** (Viktor) | wizard | ✅ / ✅ | ✅ (unlocked) | unlocked R/W tree; can `tg apply` | cluster-admin | vault-admin | +| **power-user** (Emo) | emo | ❌ / ❌ | ❌ | own **writable locked** clone (push free; no secrets; can't apply) | **cluster-wide read-only, no Secrets** | scoped read | +| **namespace-owner** (Anca) | ancamilea | ❌ / ❌ | ❌ | own **writable locked** clone (push free; no secrets; can't apply) | **admin in own namespace** (full R/W in-ns) + namespace/node LIST only | own-namespace paths | + +Layers: Authentik group (eligibility) → OS account `0700` home + per-tier groups (no sudo/docker for non-admins; rootless podman if containers needed) → **per-user OIDC kubeconfig + Vault** so each session acts as *its own* identity, never Viktor's. **kubectl is enabled per tier** — the provisioner installs each user's kubeconfig at the scope above (admin = cluster-admin; power-user = cluster-wide read-only, no Secrets; namespace-owner = admin in their own namespace), reusing the existing `k8s_users` / dashboard-SA machinery (memory id=4042). **Changing infra is never gated at the repo; it's gated at apply** — only admin can `scripts/tg apply` (write Vault + cluster RBAC). Per-user creds live in each `0700` home; wizard's `~/.vault-token` (`0600`) is unreadable to others. + +**Cluster-RBAC reality (verified 2026-06-08) — two corrections + identity facts:** +- **power-user role:** the existing `oidc-power-user` ClusterRole grants cluster-wide **read+write+Secrets** and is currently *unbound* — NOT the read-only-no-Secrets tier ADR-0005 wants. So power-user needs a **NEW** `oidc-power-user-readonly` ClusterRole (get/list/watch on non-secret resources cluster-wide, NO `secrets`), bound to emo's OIDC email. Do not reuse the existing role. +- **kubeconfig is OIDC, not SA-token:** the apiserver carries live `--oidc-*` flags for the `kubernetes` audience and accepts Authentik OIDC; the "apiserver rejects OIDC" note in `dashboard-sa.tf` is dashboard-audience-specific (the multi-issuer `authentication-config` isn't live). Install `kubelogin`, smoke-test the OIDC path first, and fall back to the per-user SA-token (dashboard) pattern only if it fails. +- **identity reality:** emo has **no `k8s_users` entry** today → power-user is a NET-NEW grant; anca is already namespace-owner of `plotting-book` and gheorghe (`vabbit81`) of `vabbit81` — preserve, don't re-provision. + +**Shared-host caveat:** a multi-user host is a softer boundary than pods — it relies on standard Linux hardening. Appropriate because these are trusted people. If a user must ever be *untrusted*, that's the signal to revisit K8s pods. Note: non-admins' Claude/t3 runs `--dangerously-skip-permissions` (autonomous tool execution as their uid) — bounded by the `0700` home + no-sudo/no-docker sandbox, but a conscious accepted trade. + +### 7. Secrets & auth (per-user, injected — never in the Config base) + +The Config base / machine-wide managed layer is **secret-free**. Everything carrying a token/auth is **per-user**, in the user's own `0600` files, and **never machine-wide** — per the Google-Workspace-MCP precedent (id=4553: *"do NOT move a secret-bearing MCP server into machine-wide config"*; one user literally can't read another's `~/.claude.json`). + +| Auth / token | Lives in (per-user, `0600`) | New-user provisioning (from Vault) | +|---|---|---| +| **Claude OAuth** | `~/.claude/.credentials.json` (or `CLAUDE_CODE_OAUTH_TOKEN`) | the shared Enterprise token (earlier decision) **or** own interactive login; emo keeps his own | +| **`claude_memory` MCP** | `~/.claude.json` mcpServers + `MEMORY_API_KEY` in `settings.json` env | **DEFERRED — not a risk now (Viktor, 2026-06-08).** Per-user memory isolation needs a service-side `_key_to_user` map edit + redeploy (claude-memory-mcp, GHA repo 78), not just a Vault write — NOT built now. For now a new user gets a simple key or omits memory; revisit if isolation becomes a concern. | +| **`ha` MCP** (token-in-URL) | `~/.claude.json` | shared `ha_sofia_mcp_url` from Vault `secret/openclaw` (one HA instance; shared secret, per-user file) — only if HA-eligible | +| **`playwright` MCP** | per-user systemd unit (own port) + localhost entry | existing per-user playwright pattern (id=4015); non-secret | +| **`context7`** | plugin-provided | non-secret (plugins layer) | + +The root provisioner READS these from Vault and writes them into a **new** user's home — **if-absent, never clobbering** an existing user's working config. Minting a new per-user memory key needs an admin Vault write (`vault login -method=oidc`; the agent token can't write KV — id=4181) → an admin onboarding step. **emo's existing MCP/auth is untouched** (additive-only): `managed-settings.json` carries NO `env` secrets, so his `MEMORY_API_KEY` and his `~/.claude.json` MCP servers keep working exactly as today. + +**beads (`bd`) credential — gap found 2026-06-08:** a per-user infra clone does NOT include the Dolt credential (`.beads-credential-key` is git-ignored), so the provisioner must drop it (or set `DOLT_REMOTE_PASSWORD`) into the user's `~/code/.beads/` — else `bd` resolves the central server (`10.0.20.200:3306`) but fails auth. `bd` does **not** depend on `code-shared` (it's server-mode against the central Dolt), so the emo cutover doesn't break `bd` *if* his credential is provisioned. + +## Capacity & prerequisites + +**The devvm is the binding constraint — address before onboarding active users.** Verified 2026-06-08: devvm has **24 GB RAM** (the `proxmox-inventory.md` "8 GB" is STALE → fix that doc), ~8 GB free, **0 swap**; wizard alone already runs ~20 sessions (~10 GB RSS). Each interactive Claude session is ~300–700 MB; each user adds one persistent `t3-serve` daemon (~430 MB). 3–5 active users × several sessions would exhaust RAM → with **0 swap the failure mode is OOM-kill of live sessions** (everyone's), not graceful slowdown — also a `~/.claude.json` corruption trigger (id=2320/2321: multi-session writes + disk pressure). + +**Prerequisites (do FIRST):** (1) **add swap** to the devvm (OOM-kill → graceful pressure); (2) optionally bump RAM (PVE-side — devvm is NOT TF-managed, id=1575); (3) set a per-user RAM budget + a **max-concurrent-active-users** ceiling; (4) memory/disk-pressure monitoring on the devvm. CPU (16 cores, ~7%) and disk (`/` ~28 GB free) are fine for now. + +## User lifecycle (onboard → reconcile → offboard) — the roster drives all of it + +The roster is the SSoT for the **whole** lifecycle, not just creation: + +- **Onboard:** add a roster entry (the reconcile also adds them to the `T3 Users` Authentik group). The reconcile creates the constrained account, seeds config inheritance, provisions the per-user OIDC kubeconfig + locked clone + MCP/auth (+ the `bd` Dolt credential), starts `t3-serve@`. +- **Reconcile (routine, additive-only):** converges *missing* state UP; never strips an existing user (the don't-break-emo guarantee). Safe to run anytime. +- **Offboard (REMOVE the roster entry):** the destructive half — gated + staged, NOT the routine timer: + 1. **Reversible cut (on roster removal):** stop+disable `t3-serve@`; drop the user from `/etc/ttyd-user-map` + `dispatch.json` (regenerated → 403 at the dispatcher); remove from the `T3 Users` Authentik group (edge-blocked); `passwd -l `. Access fully cut; nothing deleted. + 2. **Cluster revoke:** remove their `k8s_users` entry + apply (drops RBAC binding + kubeconfig validity) + revoke shared-token / memory creds. + 3. **Destructive (explicit, separate, never auto):** archive `~` (tar → backup), then `userdel -r`. Irreversible — requires explicit go-ahead. +- Write `docs/runbooks/offboard-user.md` (the link in `multi-tenancy.md` currently dead-ends). Rollback of step 1/2 = re-add the roster entry + reconcile. + +## Incrementality & migration (don't break emo) + +emo has a **working** setup that must not break: his `t3-serve@emo` (port 3774) + ~4 concurrent live Claude sessions (id=2320); his own `~/.claude` + `~/.claude.json` (MCP servers incl. `ha` token-in-URL and his `MEMORY_API_KEY`); his `~/code` symlink into wizard's tree; `code-shared` + `docker` membership; tmux/playwright units. Hard guarantees: + +- **The idempotent reconcile is ADDITIVE-ONLY.** It creates *missing* accounts/config/instances and *adds* a user's tier-appropriate access, but it **never removes** an existing user's groups, **never replaces** an existing `~/code` (skip-if-exists), and **never writes into** an existing `~/.claude` / `~/.claude.json`. Running `provision-users.sh` at any time is therefore a no-op on emo's existing state — safe to run repeatedly. +- **Every destructive/tightening step is SEPARATE, explicit, idle-gated, and reversible** — never part of the routine reconcile. +- **Phases 0–4 are additive and verified non-breaking.** After each, confirm emo's live sessions, his `~/.claude`/MCP, his `~/code`, and his groups are unchanged. + +Rollout order: +1. **Config base + machine-wide managed layer** → wizard + emo *inherit* wizard's skills/prompts. Additive: the managed layer only ADDS; it must not set keys/hooks that override emo's working `~/.claude` / `MEMORY_API_KEY` / MCP servers. **Verify emo's existing sessions + MCP still work.** +2. **Roster + provisioner** alongside the current `/etc/ttyd-user-map` (idempotent; ancamilea already provisioned; emo's instance untouched). +3. **Per-user writable locked clones** provisioned **only for users without an existing `~/code`** — emo's symlink is left intact (skip-if-exists). +4. **Per-tier kubeconfig** installed **only if absent** (existing `~/.kube/config` backed up, never clobbered) — emo's current kube access untouched. +5. **emo cutover — the ONLY step that changes emo; opt-in + reversible, never auto-run:** (a) record rollback state (`readlink ~emo/code`, `id emo`, copy of `start-claude.sh`); (b) idle-gate (id=3201); (c) replace his `~/code` symlink with his own writable locked clone, **point his `start-claude.sh` at `cd ~/code`** (today it hardcodes `cd /home/wizard/code` — *that* is the actual reason his Claude lands in wizard's unlocked tree, so swapping the symlink alone is NOT enough), drop the now-redundant `~/.claude/{rules,skills/file-issue}` symlinks into wizard's home (the managed layer / shared base delivers them now), and `gpasswd -d emo code-shared`. He keeps full edit/commit/push (ungated); loses only secret-read + apply. **Rollback (seconds):** restore the symlink + `start-claude.sh` + the `~/.claude` symlinks + `gpasswd -a emo code-shared`. A `t3-serve@emo` restart only blips his WebSocket (id=3308). Requires explicit go-ahead. +6. **Authentik `T3 Users` group + edge gate** last (once instances exist), so no one is locked out mid-migration. + +New users (gheorghe; and ancamilea's enhancement) are born into the new model — no migration needed. + +## Template-readiness ("VM as a template" — future) + +Design principle: **every bit of devvm setup is an idempotent git script** — nothing lives only as hand-typed host state. Three scripts in `infra/scripts/workstation/`: `setup-devvm.sh` (package manifest + managed config + config-base clone), `provision-users.sh` (roster loop), and the roster + manifest data files. When the template is wanted: the devvm becomes a cloud-init Proxmox template (the estate's existing reproducibility pattern, id=1575) that clones the infra repo + runs both scripts → identical devvm. Per-user **home data** is the only non-template state → add `/home` to the 3-2-1 backup set, or users re-clone + re-pair on a fresh box. + +## Key decisions (ADR candidates) + +- **ADR-0001 — Build on the existing stack, not a CDE.** Coder/Che/etc. researched; the role model is Premium-gated or the platform lacks the agent layer, and the homelab scale doesn't justify it. Hard to reverse, surprising ("why not Coder?"), real trade-off. +- **ADR-0002 — devvm Linux users, not K8s ephemeral pods.** Re-platforming is overkill at this scale; config-push is easier on one host. +- **ADR-0003 — Config inheritance via native machine-wide layers + per-user override.** Rejected: periodic sync, OverlayFS (no live lowerdir edits), Nix (rebuild not live). +- **ADR-0004 — Infra access via per-user writable git-crypt-locked clones (changes ungated).** Each non-admin gets their own writable, keyless (locked) clone — read + edit + push freely, no PR gate. Safe because infra apply is manual + admin-only (push ≠ apply, id=4355) and the clone can't decrypt secrets. Rejected: the shared read-only mirror (gated changes) and the shared unlocked tree (secret leak + commit entanglement). Trade: repo-local CLAUDE.md updates via pull, not live (global config inheritance stays live via §4). +- **ADR-0005 — Power-user = cluster-wide read-only (no Secrets), via a NEW dedicated ClusterRole.** Re-widens cross-tenant READ for the trusted power-user tier only — but via a NEW `oidc-power-user-readonly` ClusterRole (get/list/watch, NO `secrets`), NOT the existing `oidc-power-user` (which grants read+write+Secrets and is unbound). Bound to the user's OIDC identity (kubelogin) — the apiserver accepts Authentik OIDC for the `kubernetes` audience; the dashboard's SA-token pattern is for the dashboard UI only. +- **ADR-0006 — The roster is the single source of truth for the FULL lifecycle.** `roster.yaml` drives onboard *and* offboard; `/etc/ttyd-user-map`, `dispatch.json`, and Authentik `T3 Users` membership are *derived* from it, and tier is *validated* against `k8s_users` (fail-loud on mismatch). Rejected: hand-maintaining the four membership lists in parallel (guaranteed drift). Offboarding is first-class + staged (reversible cut → cluster revoke → gated `userdel`), not an afterthought. +- **ADR-0007 — Add swap + a capacity budget to the devvm before onboarding active users.** A shared 24 GB / **0-swap** host OOM-kills live sessions under multi-user load (wizard alone runs ~20). Swap + a max-concurrent ceiling are prerequisites, not follow-ups. + +## Out of scope / deferred + +- Zero-touch auto-provision on first Authentik login (admin runs the provisioner / the timer converges — simpler at this scale). +- K8s per-user pods (revisit only if a user must be untrusted, or scale grows large). +- The actual cloud-init template conversion (design for it now; do it when wanted). +- **Per-user memory isolation** (own namespace / service-side `_key_to_user` map + redeploy) — **deferred; not a risk now** (Viktor, 2026-06-08). Revisit if memory cross-read becomes a concern. + +## Verification (acceptance) + +- A new roster entry + `provision-users.sh` → the user can log into `t3.viktorbarzin.me` and lands in a configured Workstation with Viktor's skills/prompts. +- wizard edits a skill/CLAUDE.md in the base → a child's next prompt sees it (no pull). +- A child's `kubectl`/`vault` is bounded by their tier (kubectl enabled per tier: power-user = cluster-wide read-only; namespace-owner = read/write in own ns only); a non-admin cannot read git-crypt secrets nor escalate. +- A non-admin can edit + commit + push their infra clone **freely**, but cannot `scripts/tg apply` (no write Vault / cluster RBAC) — changes don't take effect until an admin applies. +- Re-running the provisioner is idempotent (no changes on a converged host). +- `provision-users.sh` + `setup-devvm.sh` reproduce the setup on a fresh host from git. diff --git a/docs/plans/2026-06-07-multi-user-workstation-plan.md b/docs/plans/2026-06-07-multi-user-workstation-plan.md new file mode 100644 index 00000000..1bd3275c --- /dev/null +++ b/docs/plans/2026-06-07-multi-user-workstation-plan.md @@ -0,0 +1,223 @@ +# Multi-User Workstation — Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: use `superpowers:subagent-driven-development` (recommended) or `superpowers:executing-plans` to implement task-by-task. Steps use `- [ ]` for tracking. This is **infra** work — "verify" means an idempotent re-run + a smoke check with expected output (not pytest). Honor the Terraform-only rule for cluster changes; devvm host scripts are the accepted exception (versioned in `infra/scripts/`, deployed via the provisioner). Claim `host:devvm` before mutating the devvm; gate `t3-serve@` restarts on user idle (memory id=3201). **INCREMENTALITY (don't break emo):** every phase is additive; the idempotent reconcile is **additive-only** — it NEVER removes an existing user's groups, NEVER replaces an existing `~/code` (skip-if-exists), and NEVER writes into an existing `~/.claude`/`~/.claude.json`. The emo cutover (Phase 5) is the ONLY destructive step — explicit, idle-gated, reversible, never auto-run. After each of Phases 1–4, **verify emo's live sessions, `~/.claude`/MCP, `~/code`, and groups are unchanged.** + +**Goal:** A declarative roster + idempotent scripts that provision per-user Claude Code Workstations on the devvm, inheriting Viktor's config live via native machine-wide layers, scoped by RBAC tier, reproducible from git. + +**Architecture:** Config base (machine-wide managed Claude config + system shell files + apt manifest) authored by wizard → all users inherit live. `roster.yaml` + `provision-users.sh` create constrained OS accounts + per-user OIDC kubeconfig (per tier) + per-user writable git-crypt-locked infra clone + `t3-serve@`. Authentik `T3 Users` group gates the edge. + +**Tech Stack:** Bash (idempotent host scripts), systemd template units + timer, Claude Code managed-settings, git-crypt, Authentik expression policy (Terraform), the existing `k8s_users` per-user Vault/RBAC. + +**Design:** `infra/docs/plans/2026-06-07-multi-user-workstation-design.md`. **Glossary:** `infra/CONTEXT.md`. + +--- + +## File structure + +- Create: `infra/scripts/workstation/roster.yaml` — the source-of-truth roster +- Create: `infra/scripts/workstation/packages.txt` — declared host apt/global toolset +- Create: `infra/scripts/workstation/setup-devvm.sh` — host base: packages + managed Claude config + config-base clone (idempotent) +- Create: `infra/scripts/workstation/managed-settings.json` — the machine-wide Claude base (settings + `claudeMd`) +- Modify: `infra/scripts/t3-provision-users.sh` — read `roster.yaml`; create constrained accounts; per-tier groups + kubeconfig; repoint `~/code` +- Modify: `infra/scripts/t3-provision-users.sh` — also provision each non-admin's own writable git-crypt-locked clone at `~/code` (no separate mirror service) +- Modify: `infra/stacks/authentik/admin-services-restriction.tf` — add the `t3.viktorbarzin.me` → `T3 Users` branch +- Create: `infra/stacks/authentik/` group resource (or document the UI-created group) for `T3 Users` +- Docs: update `infra/docs/architecture/multi-tenancy.md` (add the Workstation section) + `.claude/reference/service-catalog.md` (t3code row) in the same commits + +--- + +## Phase −1 — Prerequisites (do FIRST) + +### Task −1.1: devvm capacity (P0 — verified 2026-06-08: 24 GB RAM, 0 swap, wizard ~20 sessions) + +- [ ] **Step 1:** Add **swap** to the devvm (swapfile, e.g. 8–16 GB) — turns multi-user OOM-kill into graceful pressure. Verify `free -h` shows `Swap` > 0. +- [ ] **Step 2:** Document a per-user RAM budget + a **max-concurrent-active-users** ceiling; add memory/disk-pressure monitoring on the devvm. (Optionally bump RAM PVE-side — devvm is NOT TF-managed, id=1575.) +- [ ] **Step 3:** Fix the stale `infra/.claude/reference/proxmox-inventory.md` devvm RAM (says 8 GB; live = 24 GB). Commit `[ci skip]`. + +### Task −1.2: tooling + +- [ ] **Step 1:** Install `kubelogin` (`kubectl-oidc_login`) on the devvm and add it to `packages.txt` — the per-user OIDC kubeconfig (Task 2.2) needs it; it is NOT installed today. + +--- + +## Phase 0 — Roster + config base in git (no host changes) + +### Task 0.1: Create the roster + +**Files:** Create `infra/scripts/workstation/roster.yaml` + +- [ ] **Step 1:** Write the roster with the current three children (wizard is the base author, not listed): + +```yaml +# THE single source of truth for the devvm Workstation lifecycle (onboard → offboard). +# os_user (key) → authentik_user · k8s_user · tier · namespaces. Identifiers differ per person (verified 2026-06-08). +users: + emo: { authentik_user: emil.barzin, k8s_user: emo, tier: power-user } # NET-NEW cluster identity (not in k8s_users today) + ancamilea: { authentik_user: ancaelena98, k8s_user: anca, tier: namespace-owner, namespaces: [plotting-book] } # ALREADY provisioned — preserve, don't re-create +# gheorghe: { authentik_user: vabbit81, k8s_user: vabbit81, tier: namespace-owner, namespaces: [vabbit81] } # already a cluster ns-owner; uncomment for a devvm workstation +``` +(`os_user` is the pinned key — no email→username derivation. Note the three distinct IDs per person.) + +- [ ] **Step 2: Verify** it parses: `python3 -c "import yaml,sys; print(yaml.safe_load(open('infra/scripts/workstation/roster.yaml')))"` → Expected: a dict with `users.emo.tier == power-user`. +- [ ] **Step 3: Commit:** `git add infra/scripts/workstation/roster.yaml && git commit -m "workstation: add roster source-of-truth [ci skip]"` + +### Task 0.2: Declare the host toolset + +**Files:** Create `infra/scripts/workstation/packages.txt` + +- [ ] **Step 1:** List the shared tools (one per line, comments allowed): `git`, `zsh`, `tmux`, `ripgrep`, `jq`, `python3`, `nodejs`, `kubectl`, `vault`, `podman` (rootless). Claude Code is installed via npm global in `setup-devvm.sh` (Task 1.2), not apt. +- [ ] **Step 2: Verify:** `grep -vE '^\s*(#|$)' infra/scripts/workstation/packages.txt` lists the expected packages. +- [ ] **Step 3: Commit:** `git add infra/scripts/workstation/packages.txt && git commit -m "workstation: declare host package manifest [ci skip]"` + +### Task 0.3: Build the Config base (secret-free, curated — it doesn't exist yet) + +**Files:** chezmoi dotfiles repo (`github.com/ViktorBarzin/dot_files`, `dot_claude/`) + `infra/scripts/workstation/managed-settings.json` + +- [ ] **Step 1:** Create/refresh the **Config base** = the secret-free curated set the managed layer + `/etc/skel` deploy from: skills/agents/rules/commands/hooks/`CLAUDE.md` + shell (`zshrc`/`profile.d`) + the `start-claude.sh` launcher (`cd "$HOME/code"`). Sanitize OUT all secrets (`.credentials.json`, `~/.claude.json`, `settings.json` `env`); resolve any `~/.agents/skills` symlinks to real files. +- [ ] **Step 2:** Reconcile launcher ownership: the current `start-claude.sh` is deployed by the SEPARATE `viktor/terminal-lobby` repo (its own `deploy.sh`). Decide whether the workstation base or terminal-lobby owns it — not both (avoid two competing launchers). +- [ ] **Step 3: Verify:** secret-scan the base (`grep -rEi 'sk-ant|oat01|BEGIN .*PRIVATE|api[_-]?key|password'` → only docs/placeholders) + no dangling symlinks. +- [ ] **Step 4: Commit/push** the refreshed dotfiles repo. + +--- + +## Phase 1 — Config base + machine-wide inheritance (additive; verify wizard+emo inherit) + +### Task 1.1: Pin the exact Claude managed-skills mechanism (discovery spike) + +**Why:** the managed `settings.json` + `claudeMd` paths are confirmed (`/etc/claude-code/managed-settings.json`), but the exact **managed skills** deployment path needs confirming on the installed Claude Code version before we rely on it for skill inheritance. + +- [ ] **Step 1:** On the devvm, check the installed version: `claude --version`. +- [ ] **Step 2:** Confirm the managed location is read: create a throwaway `/etc/claude-code/managed-settings.json` with a benign `claudeMd` string, start a fresh `claude` session as a NON-wizard test user, and confirm the injected guidance appears. Expected: the `claudeMd` text is present in context. +- [ ] **Step 3:** Determine the managed-skills path (managed-settings `skills`/skill-source key, or a managed skills dir) **AND how the bespoke `~/.claude/rules/*.md` + `agents/` are delivered machine-wide** — the managed layer covers settings/skills/`claudeMd`, NOT an arbitrary `rules/` dir, so rules land either (a) folded into the managed `claudeMd`, or (b) a per-user symlink to the shared Config base (replacing today's live `~/.claude/rules → /home/wizard/.claude/rules` symlink). Record the verified mechanism in the design doc's §4 + a memory. +- [ ] **Step 3b — Plan-B (go/no-go):** if managed *skills* aren't supported on the installed Claude Code version, FALL BACK to per-user symlinks of `~/.claude/{skills,agents,rules}` → the shared Config base. The verified `settingSources:[user,…]` (2026-06-08) means both t3 and `claude` read the per-user `user` layer, so symlinks are a complete fallback. Make this an explicit branch, not a silent assumption. +- [ ] **Step 4: Commit** the design-doc update: `git commit -am "workstation: pin verified managed-skills mechanism [ci skip]"` + +### Task 1.2: `setup-devvm.sh` — host base (idempotent) + +**Files:** Create `infra/scripts/workstation/setup-devvm.sh`, `infra/scripts/workstation/managed-settings.json` + +- [ ] **Step 1:** Write `managed-settings.json` — the machine-wide Claude base: the `claudeMd` org guidance + any enforced hooks/permissions, **no secrets** (per-user memory keys etc. stay per-user). +- [ ] **Step 2:** Write `setup-devvm.sh` (run as root, idempotent): (a) `apt-get install -y $(grep -vE '^\s*(#|$)' packages.txt)`; (b) `npm install -g @anthropic-ai/claude-code` if missing; (c) `install -m 0644 managed-settings.json /etc/claude-code/managed-settings.json`; (d) materialize managed skills from the config-base checkout per the Task 1.1 mechanism; (e) lay down `/etc/profile.d/00-workstation.sh` + `/etc/zsh/zshrc.d/` base shell config + seed `/etc/skel` — **incl. a `start-claude.sh` that `cd "$HOME/code"` and a `.tmux.conf` with `default-command "$HOME/start-claude.sh"`, so a new account auto-launches Claude in ITS OWN clone (never a hardcoded `/home/wizard/code`)**; (f) clone/refresh the config-base repo to a shared path. +- [ ] **Step 3: Verify (inheritance):** as `emo` (idle-gated if a session is live), `sudo -u emo -i claude` shows wizard's managed `claudeMd` + a base skill in `/skills`, with no per-emo copy. Expected: base skill present. +- [ ] **Step 4: Verify (idempotent):** re-run `setup-devvm.sh`; Expected: exit 0, no changes on second run. +- [ ] **Step 5: Commit:** `git add infra/scripts/workstation/setup-devvm.sh infra/scripts/workstation/managed-settings.json && git commit -m "workstation: host base + machine-wide Claude config inheritance"` + +--- + +## Phase 2 — Provisioner (additive; create constrained accounts from roster) + +### Task 2.1: Extend `t3-provision-users.sh` to read the roster + create accounts + +**Files:** Modify `infra/scripts/t3-provision-users.sh` + +- [ ] **Step 1:** Add a roster-read + per-entry loop. For each `os_user`: if the account is **absent**, `useradd -m -s /bin/zsh "$os_user"` + `passwd -l "$os_user"` (SSO/t3 only) + `chmod 700 ~`. `set_tier_groups` is **ADD-ONLY** — it `gpasswd -a`'s the tier's groups (admin → `sudo,docker,code-shared`; power-user/namespace-owner → none beyond their own) but **NEVER removes** a group from an existing account (so a routine reconcile can't strip emo's current `code-shared`/`docker` — removal is the Phase-5 cutover only). Do **not** `passwd -l` or re-`chmod` an already-existing account. +- [ ] **Step 2 (SSoT — derive, don't append):** **Regenerate** `/etc/ttyd-user-map` + `/etc/t3-serve/dispatch.json` from the roster each run (so a removed roster entry DISAPPEARS — this is what makes offboarding's reversible-cut work), allocate sticky ports, `systemctl enable --now t3-serve@`. Reconcile the `T3 Users` Authentik group membership from the roster. **Validate** each entry's `tier` against the live `k8s_users` role and **abort with a clear error on mismatch** (workstation tier and cluster tier must not silently diverge). +- [ ] **Step 3: Verify (idempotent + non-breaking):** run as root; Expected: emo + ancamilea instances `active`, dispatch.json unchanged, **AND** `id emo` still shows `code-shared`+`docker` (NOT stripped), emo's `~/code` symlink intact, his live sessions unaffected. +- [ ] **Step 4: Verify (constrained account):** `id emo` shows no `sudo`/`docker`/`code-shared`; `sudo -n -u emo true` fails (no sudo). +- [ ] **Step 5: Commit:** `git add infra/scripts/t3-provision-users.sh && git commit -m "workstation: roster-driven account creation + per-tier groups"` + +### Task 2.2: Per-user identity-scoped kubeconfig + Vault helper + +**Files:** Modify `infra/scripts/t3-provision-users.sh` (add `install_user_identity`) + +- [ ] **Step 1:** For each non-admin, write `~$os_user/.kube/config` as a **per-user OIDC kubeconfig** (`kubelogin`/`oidc-login`) bound to THEIR email — the apiserver accepts Authentik OIDC for the `kubernetes` audience (verified 2026-06-08; the dashboard SA-token pattern is for the dashboard UI, NOT kubectl). Tier → a ClusterRole bound to their OIDC `User`: namespace-owner → admin in their own namespace via the existing `oidc-ns-owner-*` bindings (for anca that's the EXISTING `plotting-book` — assert, don't re-provision); power-user → a **NEW `oidc-power-user-readonly`** ClusterRole (get/list/watch cluster-wide, **NO `secrets`**), NOT the existing `oidc-power-user` (read+write+Secrets). Owned by the user, `0600`. **Install only if `~/.kube/config` is absent;** else back up to `.bak-` and skip (never clobber). +- [ ] **Step 2:** Drop a `~/.zshrc.d/vault.sh` that sets `VAULT_ADDR=https://vault.viktorbarzin.me` and documents `vault login -method=oidc` (their own identity). Do NOT seed wizard's token. +- [ ] **Step 3: Verify (OIDC works, then scoping):** FIRST smoke-test the OIDC path — a non-admin `kubectl` via kubelogin actually authenticates (it's currently unexercised by any human; if it fails like the dashboard audience did, fall back to a per-user SA-token kubeconfig). THEN: as emo, `kubectl get pods -A` works (read) but `kubectl get secret -A` is forbidden and `kubectl delete` anything is forbidden; as ancamilea, only `plotting-book` is visible. +- [ ] **Step 4: Commit:** `git add infra/scripts/t3-provision-users.sh && git commit -m "workstation: per-user identity-scoped kubeconfig + vault helper"` + +*(Prereq: add a **NEW `oidc-power-user-readonly`** ClusterRole + email binding to `stacks/rbac` via `scripts/tg apply` — do NOT reuse the existing `oidc-power-user` (read+write+Secrets, currently unbound). emo also needs a NEW `k8s_users` entry as `power-user` (net-new); anca/gheorghe already exist — assert, don't re-create. Terraform-managed, separate commit.)* + +### Task 2.3: Inject per-user MCP + auth secrets (new users only; never clobber) + +**Files:** Modify `infra/scripts/t3-provision-users.sh` (add `install_user_secrets`) + +- [ ] **Step 1:** For each non-admin **without** an existing `~/.claude.json` (NEW users only — NEVER touch an existing one): write `~/.claude.json` with `playwright-shared` (localhost), `ha` (shared `ha_sofia_mcp_url` from Vault `secret/openclaw`) if HA-eligible, and `claude_memory` using a **shared/simple key (per-user memory isolation is DEFERRED — not a risk now)**. Seed `~/.claude/.credentials.json` with the shared Claude token (Vault) **or** leave absent for interactive login. **Drop the beads Dolt credential** into `~/code/.beads/` (`.beads-credential-key`, from Vault, or set `DOLT_REMOTE_PASSWORD`) so `bd` authenticates — it's git-ignored, so a fresh clone lacks it. All `0600`, owned by the user. Per-user `playwright-mcp` systemd unit on its own port (existing pattern, id=4015). +- [ ] **Step 2 (DEFERRED — not now):** Per-user memory isolation is NOT built (Viktor, 2026-06-08): a new user shares/omits memory for now. When wanted, it needs a service-side `_key_to_user` map edit + redeploy (claude-memory-mcp, GHA repo 78) **and** a Vault key — not just a Vault write (id=413/4181). +- [ ] **Step 3: Verify (new user gets isolated auth):** as the test user, `claude mcp list` shows their servers `Connected`; `memory_recall` returns THEIR namespace, not Viktor's. +- [ ] **Step 4: Verify (emo untouched):** `~emo/.claude.json`, `~emo/.claude/.credentials.json`, `~emo/.claude/settings.json` are **byte-identical** to before the run (`sha256sum` before/after); `claude mcp list` as emo still shows ha/claude_memory/playwright `Connected`. +- [ ] **Step 5: Commit:** `git add infra/scripts/t3-provision-users.sh && git commit -m "workstation: per-user MCP + auth injection (new users only, if-absent)"` + +--- + +## Phase 3 — Per-user writable locked infra clone (code view; changes ungated) + +### Task 3.1: Provision each non-admin's own writable git-crypt-locked `~/code` + +**Files:** Modify `infra/scripts/t3-provision-users.sh` (add `install_infra_clone`) + +- [ ] **Step 1:** For each non-admin, **only if `~$os_user/code` does not exist at all** (no symlink, no directory — NEVER touch an existing `~/code`, so emo's symlink stays intact), clone the same repo wizard uses, as that user: `REPO=$(git -C /home/wizard/code config --get remote.origin.url); sudo -u "$os_user" git clone "$REPO" ~/code`. Then in the clone set `git config filter.git-crypt.smudge cat; filter.git-crypt.clean cat; filter.git-crypt.required false` and `git checkout master`. **No git-crypt key is installed** → secret files stay ciphertext, code/docs are plaintext (memory id=3665/3666). Owned by the user, writable. +- [ ] **Step 2:** Leave it writable with a normal `origin` remote (Forgejo) — no read-only mount, no PR gate; they may edit/commit/push freely. (Optional: `git config push.default current` so a bare `git push` targets their own branch.) +- [ ] **Step 3: Verify (locked + writable):** as emo, `head -c 9 ~/code/infra/terraform.tfvars` shows the `GITCRYPT` magic (ciphertext); `cat ~/code/CLAUDE.md` is plaintext; `echo x >> ~/code/README.md && git -C ~/code commit -am wip` **succeeds** (writable, ungated). +- [ ] **Step 4: Verify (apply-gated, not repo-gated):** as emo, `cd ~/code/infra && scripts/tg apply ` **fails** (no write Vault token / cluster RBAC); `vault login -method=oidc` as emo cannot obtain vault-admin. Pushing to Forgejo does NOT trigger an apply (id=4355). So his edits can't take effect without an admin apply. +- [ ] **Step 5: Commit:** `git add infra/scripts/t3-provision-users.sh && git commit -m "workstation: per-user writable git-crypt-locked infra clone"` + +--- + +## Phase 4 — Eligibility gate (Authentik group + edge) + +### Task 4.1: Create the `T3 Users` group + edge restriction + +**Files:** Modify `infra/stacks/authentik/admin-services-restriction.tf`; add the group resource + +- [ ] **Step 1:** Add `resource "authentik_group" "t3_users" { name = "T3 Users" }` (pattern: `stacks/authentik/guest.tf:53`). Add emo/ancamilea (and wizard) as members. +- [ ] **Step 2:** In the expression policy, add a dedicated branch BEFORE the final return: `if host == "t3.viktorbarzin.me": return ak_is_group_member(request.user, name="T3 Users")`. +- [ ] **Step 3: Apply:** `vault login -method=oidc` then `scripts/tg apply` in `stacks/authentik` (claim `stack:authentik` first). +- [ ] **Step 4: Verify (gate):** `curl -sI` an unauthenticated request to `t3.viktorbarzin.me` → 302 to Authentik; a member login → reaches their instance; a logged-in NON-member → denied. Confirm the `authentik-walloff` probe stays green for any public carve-outs. +- [ ] **Step 5: Commit:** `git add infra/stacks/authentik/*.tf && git commit -m "workstation: gate t3.viktorbarzin.me to T3 Users group"` + +--- + +## Phase 5 — Migrate existing users (idle-gated, low-disruption) + +### Task 5.1: Cut emo over to his own writable locked clone (opt-in, reversible) + +**Files:** none (host state; an explicit one-time action — NOT the routine reconcile) + +- [ ] **Step 1: Prereqs.** Confirm emo inherits config (Phase 1) + has his scoped kubeconfig (Phase 2). (Phase 3 deliberately SKIPPED emo — his clone is created *here*.) +- [ ] **Step 2: Record rollback state.** Save `readlink -f ~emo/code` (symlink target), `id emo` (groups), a copy of `/home/emo/start-claude.sh`, and the `~/.claude/{rules,skills/file-issue}` symlink targets. This is the instant-rollback snapshot. +- [ ] **Step 3: Idle-gate + go-ahead.** Confirm emo's sessions are keystroke-idle ≥20 min (id=3201); if ambiguous, ASK. Opt-in — never auto-run by the reconcile. +- [ ] **Step 4: Cutover.** (a) `mv ~emo/code ~emo/code.symlink.bak`; provision his own writable locked clone at `~emo/code` (Phase-3 `install_infra_clone`, run explicitly for emo). (b) **Repoint his launcher (REQUIRED):** back up `/home/emo/start-claude.sh`, then change its `cd /home/wizard/code` → `cd "$HOME/code"`. The hardcoded `cd` is the *actual* mechanism landing him in wizard's tree — the symlink swap alone is insufficient. (c) Remove the now-redundant `~/.claude/rules` and `~/.claude/skills/file-issue` symlinks into wizard's home (managed layer / shared base delivers them now). (d) `gpasswd -d emo code-shared`. +- [ ] **Step 5: Verify.** As emo: `cat ~/code/CLAUDE.md` works (his clone); `head -c 9 ~/code/infra/terraform.tfvars` shows `GITCRYPT` ciphertext (locked); he can still `git -C ~/code commit` (ungated) but can no longer read wizard's unlocked secrets nor `scripts/tg apply`. emo's live t3 session still works (only a WS blip if `t3-serve@emo` was restarted). +- [ ] **Step 6: Rollback (seconds, if anything's off):** restore the `~emo/code` symlink (`rm -rf ~emo/code && ln -sfn ~emo/code`), restore `start-claude.sh` from its backup, recreate the `~/.claude/{rules,skills/file-issue}` symlinks, and `gpasswd -a emo code-shared` → emo back to his exact prior state. Otherwise record the cutover in a memory. + +### Task 5.2: Confirm ancamilea + a fresh test user end-to-end + +- [ ] **Step 1:** Confirm ancamilea logs into `t3.viktorbarzin.me` → her instance, inherits config, own-namespace kubectl only. +- [ ] **Step 2:** Add a throwaway roster entry, run `provision-users.sh`, confirm the account+instance appear and login works; then remove it + `userdel` and confirm clean teardown. + +--- + +## Phase 6 — Template-readiness (design-for-now; convert when wanted) + +### Task 6.1: Verify reproducibility from git (no cloud-init yet) + +- [ ] **Step 1:** On a scratch VM (or a container), clone the infra repo and run `setup-devvm.sh` + `provision-users.sh`; confirm the toolset + managed config + users reproduce. +- [ ] **Step 2 (promote out of deferred — do in the main rollout):** Add per-user home data to the 3-2-1 backup set NOW: at minimum `~/.t3` (pairings + 30-day sessions) + `~/.claude` (mutable state), ideally all of `/home`. A devvm rebuild otherwise silently loses every user's pairings + session state. +- [ ] **Step 3 (deferred):** When the template is wanted, wrap `setup-devvm.sh` + `provision-users.sh` in cloud-init (the `modules/create-template-vm` pattern, memory id=1575) and snapshot the devvm as a Proxmox template. File a beads task; do not build now. + +--- + +## Phase 7 — Offboarding (deprovision; staged, gated) + +Removing a user = delete their `roster.yaml` entry, then: + +### Task 7.1: Reversible cut (driven by roster removal) + +- [ ] **Step 1:** On reconcile after the entry is gone: `systemctl disable --now t3-serve@`; regenerate `/etc/ttyd-user-map` + `dispatch.json` (user absent → dispatcher 403s); remove them from the `T3 Users` Authentik group (edge-blocked); `passwd -l `. **Verify:** they can no longer reach `t3.viktorbarzin.me` (302→login, then denied) and can't log in. Nothing deleted yet. +- [ ] **Step 2 (cluster revoke):** remove their `k8s_users` entry + `scripts/tg apply` (drops their RBAC binding; OIDC kubeconfig stops authorizing); revoke any individually-held token/memory key. + +### Task 7.2: Destructive removal (explicit, separate, NEVER auto) + +- [ ] **Step 1:** Archive `~` → backup: `tar czf /mnt/backup/offboard/-.tar.gz /home/`. +- [ ] **Step 2:** `userdel -r ` (removes home + spool). **Irreversible — requires explicit go-ahead.** +- [ ] **Step 3: Rollback:** before 7.2, re-add the roster entry + reconcile restores everything; after 7.2, restore from the archive. +- [ ] **Step 4:** Write + commit `infra/docs/runbooks/offboard-user.md` (the `multi-tenancy.md` link to it is currently a dead end). + +--- + +## Self-review + +- **Spec coverage:** prerequisites/capacity + kubelogin (Ph−1), roster SSoT + config-base build (Ph0), config inheritance (Ph1), provisioning + per-tier OIDC kubectl + SSoT-derive/validate + secrets/auth + beads-cred (Ph2), infra code access via writable locked clone (Ph3), Authentik gate (Ph4), incremental non-breaking migration (Ph5), reproducibility/template + per-user backups (Ph6), **offboarding / full lifecycle (Ph7)** — all mapped. Per-user **memory isolation DEFERRED** (not a risk now). +- **Open verification carried as a task, not a placeholder:** the exact managed-skills path (Task 1.1) is a discovery spike with a concrete acceptance check. +- **Terraform-only respected:** the only cluster changes (Authentik group/policy, the power-user ClusterRole) go through `scripts/tg apply`; devvm host scripts are the accepted exception. +- **Docs:** multi-tenancy.md + service-catalog.md updates folded into the relevant commits (per the update-docs rule).