Viktor Barzin 0a6ed4b2fe workstation: per-user playwright browser MCP for all users, reproducible from git

Viktor asked that the playwright browser MCP be available for every devvm user
in every directory, with each user running their own server and multiple
concurrent sessions per user.

Before this, playwright was hand-set-up per user (~/.config/systemd/user/
playwright-mcp.service on 8931/8932/8933) and only wizard was actually wired —
emo's and anca's servers ran but their ~/.claude.json had no playwright entry,
so their Claude never connected. None of it was reproducible from git (units,
refresh script, and the Vault snapshot token lived only in user homes), so a
devvm rebuild would silently lose it.

This makes it reproducible and fixes the unwired users:

- roster_engine.py: sticky per-user PLAYWRIGHT_PORT (PLAYWRIGHT_BASE_PORT=8931,
  allocated for every roster user incl. the admin), emitted in the derive JSON.
- scripts/workstation/playwright/: system-level TEMPLATE units
  (playwright-mcp@.service + playwright-snapshot-refresh@.{service,timer},
  User=%i — system manager, so no systemd --user / linger) + the refresh script.
  @playwright/mcp pinned to 0.0.76 (avoids the @latest silent-fleet-roll
  footgun, same rationale as T3_PIN).
- setup-devvm.sh: install the templates + script (9e); stage the chrome-service
  snapshot bearer token from Vault to a root file (8c) — the hourly root
  reconcile has no Vault token, mirrors the Claude OAuth staging in 8a.
- t3-provision-users.sh: install_playwright() (ALL tiers incl. admin) writes
  PLAYWRIGHT_PORT, seeds the token if-absent, wires the user-scope ~/.claude.json
  by running `claude mcp add` AS the user (clobber-proof + if-absent, so it fixes
  existing/new/admin without rewriting a populated config), and enable --now's the
  instances (idempotent, never restarts a running server). Also hardened the
  section-1 *.env scan to skip the new playwright-*.env files (no T3_PORT -> grep
  no-match would abort under set -e -o pipefail).
- Docs: chrome-service-snapshot runbook (new Provisioning section + system-unit
  commands), multi-tenancy.md, and the 2026-06-07 plan Task 2.3.

Supersedes the hand-made per-user --user units (one-time idle-gated migration to
follow on the live host).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-16 20:33:47 +00:00

27 KiB

Raw Blame History

Multi-Tenancy

Overview

The cluster implements namespace-based multi-tenancy where each user receives their own Kubernetes namespace(s), RBAC roles, resource quotas, and CI/CD access. Onboarding is Vault-driven: add user metadata to secret/platform → k8s_users, apply Terraform stacks, and all resources (namespace, policies, RBAC, DNS, TLS) are auto-generated. Users access the cluster via OIDC authentication through Authentik and can self-service via k8s-portal.

Architecture Diagram

graph TB
    A[Admin: Add to Authentik Groups] --> B[Admin: Add to Vault k8s_users]
    B --> C[Apply vault Stack]
    C --> D[Apply platform Stack]
    D --> E[Apply woodpecker Stack]

    C --> C1[Create Namespace]
    C --> C2[Create Vault Policy<br/>namespace-owner-user]
    C --> C3[Create Vault Identity<br/>Entity + OIDC Alias]
    C --> C4[Create K8s Deployer Role<br/>Vault K8s Auth]

    D --> D1[Create RBAC RoleBinding<br/>Namespace Admin]
    D --> D2[Create RBAC ClusterRoleBinding<br/>Cluster Read-Only]
    D --> D3[Create ResourceQuota]
    D --> D4[Create TLS Secret]
    D --> D5[Create Cloudflare DNS]

    E --> E1[Grant Woodpecker Admin]

    F[User: Run Setup Script] --> F1[Install kubectl, kubelogin,<br/>Vault CLI, Terraform]
    F1 --> F2[OIDC Login via Authentik]
    F2 --> G[kubectl Access]

    style A fill:#e74c3c
    style B fill:#e74c3c
    style C fill:#2088ff
    style D fill:#2088ff
    style E fill:#2088ff
    style F fill:#27ae60

Components

Component	Version	Location	Purpose
Authentik	Latest	`authentik` namespace	OIDC provider for K8s + Vault
Vault	Latest	`vault` namespace	Identity source, policy engine
k8s-portal	SvelteKit	`k8s-portal.viktorbarzin.me`	Self-service onboarding UI
Terraform (vault stack)	-	`stacks/vault/`	Namespace, Vault resources
Terraform (platform stack)	-	`stacks/platform/`	RBAC, quotas, DNS, TLS
Terraform (woodpecker stack)	-	`stacks/woodpecker/`	CI/CD admin access
Headscale	Latest	`headscale` namespace	VPN mesh network (user access)

How It Works

Namespace-Owner Model

Each user receives:

Kubernetes Namespace(s): Isolated workload environment
Vault Policy: Read/write access to secret/data/<namespace>/*
RBAC Role: Namespace admin (full control within namespace)
RBAC ClusterRole: Cluster read-only (view cluster resources)
ResourceQuota: CPU, memory, storage limits
TLS Secret: Wildcard cert for *.<namespace>.viktorbarzin.me
DNS Records: Cloudflare A/CNAME for user domains
Woodpecker Admin: Access to create repos and pipelines

Onboarding Flow (3 Steps, No Code Changes)

Step 1: Authentik

Action: Admin adds user to groups

kubernetes-namespace-owners
Headscale Users

Result: User can authenticate to Vault and K8s via OIDC

Step 2: Vault KV

Action: Admin adds JSON entry to secret/platform → k8s_users

Example:

{
  "alice": {
    "role": "namespace-owner",
    "namespaces": ["alice-prod", "alice-dev"],
    "domains": ["alice.viktorbarzin.me", "app.alice.viktorbarzin.me"],
    "quota": {
      "cpu": "4",
      "memory": "8Gi",
      "storage": "20Gi"
    }
  }
}

Fields:

role: Always namespace-owner for standard users
namespaces: List of K8s namespaces to create
domains: Cloudflare DNS records to create
quota: Per-namespace resource limits

Step 3: Apply Terraform Stacks

Order matters (dependencies):

vault stack:
```
cd stacks/vault
terragrunt apply
```
- Creates namespaces
- Creates Vault policy namespace-owner-alice
- Creates Vault identity entity + OIDC alias
- Creates K8s deployer role for Woodpecker CI
platform stack:
```
cd stacks/platform
terragrunt apply
```
- Creates RBAC RoleBinding (namespace admin)
- Creates RBAC ClusterRoleBinding (cluster read-only)
- Creates ResourceQuota
- Creates TLS Secret (wildcard cert from Let's Encrypt)
- Creates Cloudflare DNS A/CNAME records
woodpecker stack:
```
cd stacks/woodpecker
terragrunt apply
```
- Grants Woodpecker admin access for user's Forgejo repos

Auto-Generated Resources Per User

Resource	Name Pattern	Purpose
Namespace	`<username>-prod`, `<username>-dev`	Workload isolation
Vault Policy	`namespace-owner-<username>`	Secret access control
Vault Identity Entity	`<username>`	OIDC identity mapping
Vault OIDC Alias	Authentik sub claim	Link OIDC to entity
Vault K8s Role	`<namespace>-deployer`	Woodpecker CI access
K8s Role	Auto-generated	Namespace admin permissions
RoleBinding	`<username>-admin`	Bind user to namespace admin
ClusterRoleBinding	`<username>-read-only`	Cluster-wide read access
ResourceQuota	`<namespace>-quota`	CPU/memory/storage limits
Secret	`tls-<namespace>`	Wildcard TLS cert
Cloudflare DNS	A/CNAME records	Domain routing

User Setup (Self-Service)

k8s-portal: k8s-portal.viktorbarzin.me

User logs in with Authentik
Downloads setup script

Runs script:

curl https://k8s-portal.viktorbarzin.me/setup.sh | bash

Script installs:
- kubectl
- kubelogin (OIDC plugin)
- vault CLI
- terraform
- terragrunt

User runs OIDC login:

kubectl oidc-login setup \
  --oidc-issuer-url=https://auth.viktorbarzin.me/application/o/kubernetes/ \
  --oidc-client-id=kubernetes

User can now run kubectl commands

Namespace-owners just log into https://k8s.viktorbarzin.me with their Authentik account and land straight in the dashboard scoped to their namespace — no token to paste. A token-injector (stacks/k8s-dashboard/dashboard_injector.tf) maps their Authentik identity (X-authentik-username) to their dashboard-<user> SA token (admin on their namespace + read-only on the namespace list & nodes only — they can't read other tenants' resources) and injects it as Authorization: Bearer. Forward-auth admits the kubernetes-* groups for this host (stacks/authentik/admin-services-restriction.tf).

Why not seamless OIDC SSO: the intended oauth2-proxy OIDC path is built but blocked — the apiserver rejects all Authentik OIDC tokens. The injector uses SA tokens (which the apiserver accepts) keyed off the forward-auth identity. See docs/architecture/authentication.md and docs/plans/2026-06-04-k8s-dashboard-sso-design.md §12.

RBAC Groups

Group	ClusterRole	Scope	Members
`kubernetes-admins`	`cluster-admin`	Full cluster access	Viktor
`kubernetes-power-users`	Custom	Elevated permissions	Senior users
`kubernetes-namespace-owners`	`namespace-admin` + `view`	Namespace admin + cluster read	All users

User CI/CD (Woodpecker)

Flow:

User creates repo in Forgejo
Forgejo username must match Vault k8s_users key (e.g., alice)
Woodpecker authenticates to Vault using K8s SA JWT
Vault issues namespace-scoped deployer token
Pipeline runs kubectl commands within user's namespace(s)

Vault K8s Role (auto-created per namespace):

vault write auth/kubernetes/role/alice-prod-deployer \
  bound_service_account_names=woodpecker-deployer \
  bound_service_account_namespaces=woodpecker \
  policies=namespace-owner-alice \
  ttl=1h

Pipeline Example:

steps:
  deploy:
    image: bitnami/kubectl:latest
    commands:
      - kubectl apply -f k8s/ -n alice-prod
    secrets: [k8s_token]

Configuration

Vault k8s_users Entry

Path: secret/platform → k8s_users

Full Example:

{
  "alice": {
    "role": "namespace-owner",
    "namespaces": ["alice-prod", "alice-dev"],
    "domains": [
      "alice.viktorbarzin.me",
      "app.alice.viktorbarzin.me",
      "api.alice.viktorbarzin.me"
    ],
    "quota": {
      "cpu": "4",
      "memory": "8Gi",
      "storage": "20Gi",
      "pods": "20"
    }
  },
  "bob": {
    "role": "namespace-owner",
    "namespaces": ["bob-staging"],
    "domains": ["bob.viktorbarzin.me"],
    "quota": {
      "cpu": "2",
      "memory": "4Gi",
      "storage": "10Gi"
    }
  }
}

Vault Policy Template

Auto-generated per user:

# Policy: namespace-owner-alice
path "secret/data/alice-prod/*" {
  capabilities = ["create", "read", "update", "delete", "list"]
}

path "secret/data/alice-dev/*" {
  capabilities = ["create", "read", "update", "delete", "list"]
}

path "secret/metadata/alice-prod/*" {
  capabilities = ["list"]
}

path "secret/metadata/alice-dev/*" {
  capabilities = ["list"]
}

ResourceQuota Example

apiVersion: v1
kind: ResourceQuota
metadata:
  name: alice-prod-quota
  namespace: alice-prod
spec:
  hard:
    requests.cpu: "4"
    requests.memory: "8Gi"
    persistentvolumeclaims: "10"
    requests.storage: "20Gi"
    pods: "20"

Factory Pattern for Multi-Instance Services

Structure:

stacks/
  actualbudget/
    main.tf         # Shared configuration
    factory/
      main.tf       # Per-user module

main.tf (service definition):

# Shared NFS export, Cloudflare routes, etc.

factory/main.tf (per-user instance):

module "alice" {
  source = "../"
  user   = "alice"
  domain = "budget.alice.viktorbarzin.me"
}

module "bob" {
  source = "../"
  user   = "bob"
  domain = "budget.bob.viktorbarzin.me"
}

To add user:

Export NFS share: /mnt/data/<service>/<user>
Add Cloudflare route: <user>.<service>.viktorbarzin.me
Add module block in factory/main.tf

Examples:

actualbudget: Personal budgeting app
freedify: Music streaming service

Decisions & Rationale

Why Namespace-Per-User?

Alternatives considered:

Shared namespace: No isolation, quota enforcement difficult
Cluster-per-user: Too expensive, management overhead
Namespace-per-user (chosen): Balance isolation, quotas, RBAC

Benefits:

Strong isolation (network policies, RBAC)
Easy quota enforcement (ResourceQuota)
Simple mental model (1 user = N namespaces)
Scales to hundreds of users

Why Vault-Driven Onboarding?

Alternatives considered:

Manual YAML: Error-prone, no audit trail
CRD-based operator: Complex, requires custom controller
Vault + Terraform (chosen): Single source of truth, auditable

Benefits:

Vault as identity source (integrates with OIDC)
Terraform for declarative infrastructure
Git-tracked changes (audit trail)
Secrets rotation built-in

Why Factory Pattern for Multi-Instance Apps?

Alternatives considered:

Helm chart per user: Duplication, drift risk
Single shared instance: No isolation, security risk
Factory module (chosen): DRY, scalable

Benefits:

No code duplication
Easy to add users (one module block)
Centralized updates (change main.tf, all instances update)

Why OIDC Instead of Static Tokens?

Alternatives considered:

Static ServiceAccount tokens: Never expire, security risk
X.509 client certs: Complex rotation
OIDC (chosen): Centralized auth, automatic rotation

Benefits:

Tokens auto-expire (1h for deployer, 24h for user)
Centralized user management (Authentik)
Integrates with Vault identity engine
Industry standard (OpenID Connect)

Why ResourceQuota Over LimitRange?

ResourceQuota: Total namespace consumption (e.g., max 8Gi memory)
LimitRange: Per-pod limits (e.g., max 2Gi per pod)

Choice: ResourceQuota only

Users manage their own pod limits
Quota prevents runaway consumption
Simpler mental model

Troubleshooting

User Can't Log In: "Unauthorized"

Cause: User not in Authentik kubernetes-namespace-owners group

Fix:

# Check user groups in Authentik UI
# Add to kubernetes-namespace-owners group

User Has No Namespaces

Cause: vault stack not applied after adding to k8s_users

Fix:

cd stacks/vault
terragrunt apply

User Can't Access Secrets in Vault

Cause: Vault policy not attached to identity entity

Fix:

# Check entity
vault read identity/entity/name/alice

# Check policy exists
vault policy read namespace-owner-alice

# Manually attach policy to entity
vault write identity/entity/name/alice policies=namespace-owner-alice

Woodpecker Pipeline: "Forbidden"

Cause: Forgejo username doesn't match Vault k8s_users key

Fix:

# Rename Forgejo user to match Vault key
# OR update k8s_users key to match Forgejo username, then terragrunt apply

ResourceQuota: "Forbidden: exceeded quota"

Cause: User exceeded namespace quota

Fix:

# Check quota usage
kubectl describe quota -n alice-prod

# User must delete resources or request quota increase
# To increase: update k8s_users in Vault, apply platform stack

DNS Not Resolving

Cause: Cloudflare DNS not created by platform stack

Fix:

# Check domains in k8s_users
vault kv get secret/platform | jq -r '.data.data.k8s_users.alice.domains'

# Apply platform stack
cd stacks/platform
terragrunt apply

# Verify in Cloudflare dashboard

TLS Secret Missing

Cause: cert-manager failed to issue certificate

Fix:

# Check cert-manager logs
kubectl logs -n cert-manager deploy/cert-manager

# Check Certificate resource
kubectl get certificate -n alice-prod

# Check CertificateRequest
kubectl describe certificaterequest -n alice-prod

# If Let's Encrypt rate limited, wait 1 week or use staging

User Can't See Cluster Resources

Cause: ClusterRoleBinding not created

Fix:

# Check ClusterRoleBinding exists
kubectl get clusterrolebinding | grep alice

# Apply platform stack
cd stacks/platform
terragrunt apply

Factory Pattern: New User Not Created

Cause: Module block not added to factory/main.tf

Fix:

# Edit factory/main.tf
cat >> stacks/actualbudget/factory/main.tf <<EOF
module "charlie" {
  source = "../"
  user   = "charlie"
  domain = "budget.charlie.viktorbarzin.me"
}
EOF

# Apply
cd stacks/actualbudget/factory
terragrunt apply

DevVM Workstation (Claude Code multi-user)

Separate from the in-cluster namespace-owner model above, the devvm (10.0.10.10, VMID 102) hosts per-user Claude Code Workstations behind t3.viktorbarzin.me. It reuses the same identity backbone — the Vault k8s_users map and Authentik — but adds a devvm-side layer. Authoritative design + phased plan: docs/plans/2026-06-07-multi-user-workstation-{design,plan}.md (PRD: ViktorBarzin/infra#9).

Single source of truth: infra/scripts/workstation/roster.yaml (os_user → authentik_user / k8s_user / tier / namespaces). roster_engine.py (pytest-covered pure core) derives desired state; t3-provision-users (hourly timer) applies it — additive-only for existing users (never strips a group, replaces a home, or re-locks an account). /etc/ttyd-user-map + dispatch.json are generated from the roster (do not hand-edit).

RBAC tiers: admin (Viktor — cluster-admin, unlocked tree, secrets) · power-user (cluster-wide read-only, NO Secrets, via a dedicated oidc-power-user-readonly ClusterRole) · namespace-owner (admin in own namespace only). Each session acts as the user's own OIDC identity (kubelogin), never the admin's.

Config inheritance (live): wizard authors the base (his chezmoi-versioned ~/.claude). Two native layers carry it to every user — the enforced org claudeMd in /etc/claude-code/managed-settings.json (top precedence, all sessions) and per-user ~/.claude/{skills,rules,…} symlinks to the base (seeded via /etc/skel; edits propagate live). Secrets stay per-user at mode 600, never symlinked. The managed config self-deploys from the repo (2026-06-10): the hourly reconcile's sync_managed_config installs scripts/workstation/managed-settings.json to /etc/claude-code/ whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and refresh_codex_mirror regenerates each user's ~/.codex/AGENTS.md (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (.claude/CLAUDE.md, AGENTS.md, CONTEXT.md in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.

Onboarding state self-heals (2026-06-15): ~/.claude.json is a single file that ALL of a user's concurrent claude processes (the ttyd terminal + their t3-serve instance + agent/SDK sessions) read-modify-write, so a stale writer periodically drops top-level keys — including hasCompletedOnboarding — which bounces the next interactive session back to the first-run "Choose the text style" wizard even though the user is fully logged in (credentials live in the SEPARATE ~/.claude/.credentials.json, untouched by the race; first observed for emo 2026-06-15). The launcher (skel/start-claude.sh) now idempotently re-asserts hasCompletedOnboarding (+ lastOnboardingVersion) in ~/.claude.json right before it runs claude — merge-only, never clobbers other keys, no-op if jq is missing or the file is empty/corrupt. And since the launcher is a per-user copy that /etc/skel only seeds at account creation, the reconcile's new deploy_user_launcher step re-copies skel/start-claude.sh into every non-admin home (copy-if-changed) so launcher edits now reach EXISTING users within the hour — .tmux.conf is deliberately NOT re-copied (terminal-lobby appends its own managed section to it).

Claude Code runtime — native, per-user (2026-06-15): claude is the native install (~/.local/bin/claude → ~/.local/share/claude/versions/<v>, self-updating; installMethod: native) — NOT npm-global or npx. It is the runtime for both the ttyd launcher and each t3-serve instance. setup-devvm.sh installs node ONLY for the t3 CLI (not claude); per-user native claude is provisioned by the reconcile's install_user_claude_native (covers terminal + t3, idempotent, skip-if-present) and self-bootstrapped by start-claude.sh on first launch — both via the official https://claude.ai/install.sh. The legacy machine-wide npm install -g @anthropic-ai/claude-code bootstrap and the launcher's npx fallback were removed; existing users had already auto-migrated to native, and the npm-global dir was empty. PATH (~/.local/bin, where the native binary lives): ensured three ways — /etc/profile.d/10-local-bin.sh for login shells (machine-wide, fresh-user-safe), start-claude.sh itself (the launcher runs in tmux's non-login env that skips the user's shell rc), and t3-serve@.service (Environment=PATH=…:/home/%i/.local/bin).

Per-user browser MCP — playwright, reproducible from git (2026-06-16): every user (incl. the admin) gets their OWN isolated @playwright/mcp server so their concurrent Claude sessions don't fight over tabs (--isolated → a fresh browser context per MCP connection), wired into Claude in every directory via a user-scope ~/.claude.json entry (playwright → http://localhost:<PLAYWRIGHT_PORT>/mcp). Mechanism: system-level template units playwright-mcp@<user>.service + playwright-snapshot-refresh@<user>.{service,timer} (User=%i, sourced from scripts/workstation/playwright/, installed by setup-devvm.sh §9e — system manager, so NO systemd --user / linger). roster_engine.py allocates a sticky per-user PLAYWRIGHT_PORT (PLAYWRIGHT_BASE_PORT=8931); the reconcile's install_playwright() writes it, seeds the chrome-service snapshot token if-absent (staged from Vault secret/chrome-service to /etc/t3-serve/chrome-service-token by setup-devvm.sh §8c, since the hourly root reconcile has no Vault token), wires ~/.claude.json by running claude mcp add --scope user AS the user (clobber-proof + if-absent, so it fixes existing/new/admin without rewriting a populated config), and enable --nows the instances (idempotent — never restarts a running server). The @playwright/mcp version is pinned in the unit (the @latest-silently-rolls-the-fleet footgun — see T3_PIN). Replaced the earlier hand-made ~/.config/systemd/user/playwright-* units (one-time idle-gated migration; pre-migration emo/anca had servers running but never wired into their .claude.json). Cookie-warming pipeline + ops: ../runbooks/chrome-service-snapshot.md.

Infra access: non-admins get their own writable, git-crypt-LOCKED clone of the (public) infra repo — code/docs plaintext, secret files (*.tfvars, secrets/**) stay ciphertext. Its location depends on the per-user code_layout in roster.yaml: single (default) puts the clone AT ~/code; workspace makes ~/code a plain directory of per-project clones — the infra clone at ~/code/infra plus each roster repos entry cloned from Forgejo viktor/<name> as the user (their PAT authenticates, so private repos work; clone failures WARN and retry next hour). Flipping a user to workspace auto-migrates their existing ~/code clone to ~/code/infra (local branches/dirty state survive; running processes follow the moved inode). ancamilea = workspace + tripit since 2026-06-10. The provisioner clones infra anonymously from the public GitHub mirror; contribute access is wired per-user on top (see below). The apply boundary still holds (scripts/tg apply needs an admin Vault token + cluster RBAC), but pushing master is NOT inert — the Forgejo→Woodpecker webhook fires .woodpecker/default.yml (event: push, branch: master, require_approval: forks only), which terragrunt-applies changed stacks. master is branch-protected on Forgejo (force-push disabled for everyone — history is append-only; push + merge whitelists = viktor + explicitly granted users, deploy keys allowed). Allow-then-audit (Viktor, 2026-06-10): ebarzin (emo) is on the whitelist and pushes straight to master — no PR gate. The tracking burden moves to: (a) commit messages that record what + why (the agent instructions in AGENTS.md and the managed claudeMd require the body to paraphrase the user's request), (b) the notify-nonadmin-push Slack audit step in .woodpecker/default.yml — every master push by a non-admin author is posted to Slack (admin pushes are not), and (c) non-admins never use [ci skip] so every change fires the pipeline (and thus the audit feed). Users NOT on the whitelist fall back to <user>/<topic> branches + PRs. Clones stay fresh automatically (2026-06-10): the hourly t3-provision-users reconcile runs refresh_user_clone over every managed clone — the infra clone and any workspace repos (fetch all remotes + fast-forward master, ONLY when on master with a clean tree and an upstream — dirty trees and local commits are left alone with a WARN) — and also wire_forgejo_remote, which idempotently adds the documented forgejo remote + forgejo/master upstream to infra clones that predate that contract. start-claude.sh does the same freshen at session launch (10s fetch cap per repo so an offline remote never stalls the session; workspace layouts freshen each repo under ~/code).

Contribute access (per non-admin, manual — the anca/tripit PAT precedent):

Add their Forgejo user as a write collaborator on viktor/infra (PUT /api/v1/repos/viktor/infra/collaborators/<login>).
Mint a PAT — the admin REST endpoint 404s here, use the in-pod CLI: kubectl -n forgejo exec deploy/forgejo -- su -s /bin/sh git -c "forgejo admin user generate-access-token --username <login> --token-name devvm-infra-git --scopes 'write:repository'".
Install it in their ~/.git-credentials (https://<login>:<token>@forgejo.viktorbarzin.me, mode 600) + git config --global credential.helper store, set user.name/user.email.
The reconcile wires the clone side automatically (wire_forgejo_remote): forgejo remote + master tracking forgejo/master on every non-admin infra clone (origin stays the anonymous GitHub mirror). No manual step since 2026-06-10.
(Optional — Viktor's call per user) Grant direct master push: add their login to the master branch-protection push + merge whitelists (PATCH /api/v1/repos/viktor/infra/branch_protections/master). Done for ebarzin 2026-06-10.
Verify: branch push succeeds; a master push succeeds for whitelisted users and is rejected with Not allowed to push to protected branch otherwise.

Web-terminal session persistence (2026-06-10): the tmux-based web terminal's named sessions (each running one Claude conversation) survive devvm reboots — tmux-persist-save.timer (5-min) snapshots every terminal user's sessions (name, cwd, conversation uuid from argv or the cwd-slug transcript dir) to /var/lib/tmux-persist/<user>.tsv, and tmux-persist-restore.service recreates missing sessions at boot with claude --resume <uuid> (per-session idempotent; also handles partial loss). The web terminal also exposes an on-demand "Restore sessions" button (terminal-lobby: tmux-api POST /restore → the validated root tmux-restore-user wrapper → tmux-persist restore <user>, a single-user mode of the same script): the boot-only restore service never fires when an OOM kills a user's tmux server without a reboot (the common case under multi-user memory pressure), so the button covers that gap. This is a tmux/terminal-surface feature, deliberately outside the t3 namespace: the t3 chat surface persists its own threads (~/.t3 state, plus the daily t3-backup-state dump), and Claude conversations themselves were always durable (~/.claude/projects/) — what this adds is the volatile tmux wiring.

Status (2026-06-10): built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, per-user OIDC kubeconfig + the oidc-power-user-readonly ClusterRole + emo's k8s_users entry (applied + impersonation-verified), the Authentik T3 Users edge gate, the emo Phase-5 cutover (own clone + launcher repoint + code-shared removal, completed 2026-06-10) and emo's contribute access (ebarzin write collaborator + PAT + protected master), and per-user code_layout with the ancamilea workspace cutover (infra → ~/code/infra, tripit alongside, 2026-06-10). Per the live /etc/skel design, non-admin ~/.claude/{rules,skills} symlinks into the admin base are kept (they ARE the shared-base delivery mechanism — the plan's step to remove them is obsolete). Remaining (held / future): the offboarding apply-side (Phase 7), the rest of per-user MCP/auth injection (ha + claude_memory + .credentials.json + beads Dolt cred — per-user playwright browser MCP done 2026-06-16, see above), and roster-reconciled T3 Users membership. See ../runbooks/offboard-user.md for deprovisioning.

CI/CD Pipeline — Per-user Woodpecker pipelines
Databases — Vault DB engine for per-user databases
Runbook: ../runbooks/onboard-user.md — Step-by-step onboarding guide
Runbook: ../runbooks/offboard-user.md — Remove user and resources
k8s-portal documentation: Self-service UI
Vault documentation: Identity secrets engine

27 KiB Raw Blame History

Multi-Tenancy

Overview

Architecture Diagram

Components

How It Works

Namespace-Owner Model

Onboarding Flow (3 Steps, No Code Changes)

Step 1: Authentik

Step 2: Vault KV

Step 3: Apply Terraform Stacks

Auto-Generated Resources Per User

User Setup (Self-Service)

Web Dashboard (auto-login, no token paste)

RBAC Groups

User CI/CD (Woodpecker)

Configuration

Vault k8s_users Entry

Vault Policy Template

ResourceQuota Example

Factory Pattern for Multi-Instance Services

Decisions & Rationale

Why Namespace-Per-User?

Why Vault-Driven Onboarding?

Why Factory Pattern for Multi-Instance Apps?

Why OIDC Instead of Static Tokens?

Why ResourceQuota Over LimitRange?

Troubleshooting

User Can't Log In: "Unauthorized"

User Has No Namespaces

User Can't Access Secrets in Vault

Woodpecker Pipeline: "Forbidden"

ResourceQuota: "Forbidden: exceeded quota"

DNS Not Resolving

TLS Secret Missing

User Can't See Cluster Resources

Factory Pattern: New User Not Created

DevVM Workstation (Claude Code multi-user)

Related

27 KiB

Raw Blame History