# Multi-Tenancy ## Overview The cluster implements namespace-based multi-tenancy where each user receives their own Kubernetes namespace(s), RBAC roles, resource quotas, and CI/CD access. Onboarding is Vault-driven: add user metadata to `secret/platform → k8s_users`, apply Terraform stacks, and all resources (namespace, policies, RBAC, DNS, TLS) are auto-generated. Users access the cluster via OIDC authentication through Authentik and can self-service via k8s-portal. ## Architecture Diagram ```mermaid graph TB A[Admin: Add to Authentik Groups] --> B[Admin: Add to Vault k8s_users] B --> C[Apply vault Stack] C --> D[Apply platform Stack] D --> E[Apply woodpecker Stack] C --> C1[Create Namespace] C --> C2[Create Vault Policy
namespace-owner-user] C --> C3[Create Vault Identity
Entity + OIDC Alias] C --> C4[Create K8s Deployer Role
Vault K8s Auth] D --> D1[Create RBAC RoleBinding
Namespace Admin] D --> D2[Create RBAC ClusterRoleBinding
Cluster Read-Only] D --> D3[Create ResourceQuota] D --> D4[Create TLS Secret] D --> D5[Create Cloudflare DNS] E --> E1[Grant Woodpecker Admin] F[User: Run Setup Script] --> F1[Install kubectl, kubelogin,
Vault CLI, Terraform] F1 --> F2[OIDC Login via Authentik] F2 --> G[kubectl Access] style A fill:#e74c3c style B fill:#e74c3c style C fill:#2088ff style D fill:#2088ff style E fill:#2088ff style F fill:#27ae60 ``` ## Components | Component | Version | Location | Purpose | |-----------|---------|----------|---------| | Authentik | Latest | `authentik` namespace | OIDC provider for K8s + Vault | | Vault | Latest | `vault` namespace | Identity source, policy engine | | k8s-portal | SvelteKit | `k8s-portal.viktorbarzin.me` | Self-service onboarding UI | | Terraform (vault stack) | - | `stacks/vault/` | Namespace, Vault resources | | Terraform (platform stack) | - | `stacks/platform/` | RBAC, quotas, DNS, TLS | | Terraform (woodpecker stack) | - | `stacks/woodpecker/` | CI/CD admin access | | Headscale | Latest | `headscale` namespace | VPN mesh network (user access) | ## How It Works ### Namespace-Owner Model Each user receives: 1. **Kubernetes Namespace(s)**: Isolated workload environment 2. **Vault Policy**: Read/write access to `secret/data//*` 3. **RBAC Role**: Namespace admin (full control within namespace) 4. **RBAC ClusterRole**: Cluster read-only (view cluster resources) 5. **ResourceQuota**: CPU, memory, storage limits 6. **TLS Secret**: Wildcard cert for `*..viktorbarzin.me` 7. **DNS Records**: Cloudflare A/CNAME for user domains 8. **Woodpecker Admin**: Access to create repos and pipelines ### Onboarding Flow (3 Steps, No Code Changes) #### Step 1: Authentik **Action**: Admin adds user to groups - `kubernetes-namespace-owners` - `Headscale Users` **Result**: User can authenticate to Vault and K8s via OIDC #### Step 2: Vault KV **Action**: Admin adds JSON entry to `secret/platform → k8s_users` **Example**: ```json { "alice": { "role": "namespace-owner", "namespaces": ["alice-prod", "alice-dev"], "domains": ["alice.viktorbarzin.me", "app.alice.viktorbarzin.me"], "quota": { "cpu": "4", "memory": "8Gi", "storage": "20Gi" } } } ``` **Fields**: - `role`: Always `namespace-owner` for standard users - `namespaces`: List of K8s namespaces to create - `domains`: Cloudflare DNS records to create - `quota`: Per-namespace resource limits #### Step 3: Apply Terraform Stacks **Order matters** (dependencies): 1. **vault stack**: ```bash cd stacks/vault terragrunt apply ``` - Creates namespaces - Creates Vault policy `namespace-owner-alice` - Creates Vault identity entity + OIDC alias - Creates K8s deployer role for Woodpecker CI 2. **platform stack**: ```bash cd stacks/platform terragrunt apply ``` - Creates RBAC RoleBinding (namespace admin) - Creates RBAC ClusterRoleBinding (cluster read-only) - Creates ResourceQuota - Creates TLS Secret (wildcard cert from Let's Encrypt) - Creates Cloudflare DNS A/CNAME records 3. **woodpecker stack**: ```bash cd stacks/woodpecker terragrunt apply ``` - Grants Woodpecker admin access for user's Forgejo repos ### Auto-Generated Resources Per User | Resource | Name Pattern | Purpose | |----------|--------------|---------| | Namespace | `-prod`, `-dev` | Workload isolation | | Vault Policy | `namespace-owner-` | Secret access control | | Vault Identity Entity | `` | OIDC identity mapping | | Vault OIDC Alias | Authentik sub claim | Link OIDC to entity | | Vault K8s Role | `-deployer` | Woodpecker CI access | | K8s Role | Auto-generated | Namespace admin permissions | | RoleBinding | `-admin` | Bind user to namespace admin | | ClusterRoleBinding | `-read-only` | Cluster-wide read access | | ResourceQuota | `-quota` | CPU/memory/storage limits | | Secret | `tls-` | Wildcard TLS cert | | Cloudflare DNS | A/CNAME records | Domain routing | ### User Setup (Self-Service) **k8s-portal**: `k8s-portal.viktorbarzin.me` 1. User logs in with Authentik 2. Downloads setup script 3. Runs script: ```bash curl https://k8s-portal.viktorbarzin.me/setup.sh | bash ``` 4. Script installs: - `kubectl` - `kubelogin` (OIDC plugin) - `vault` CLI - `terraform` - `terragrunt` 5. User runs OIDC login: ```bash kubectl oidc-login setup \ --oidc-issuer-url=https://auth.viktorbarzin.me/application/o/kubernetes/ \ --oidc-client-id=kubernetes ``` 6. User can now run `kubectl` commands ### RBAC Groups | Group | ClusterRole | Scope | Members | |-------|-------------|-------|---------| | `kubernetes-admins` | `cluster-admin` | Full cluster access | Viktor | | `kubernetes-power-users` | Custom | Elevated permissions | Senior users | | `kubernetes-namespace-owners` | `namespace-admin` + `view` | Namespace admin + cluster read | All users | ### User CI/CD (Woodpecker) **Flow**: 1. User creates repo in Forgejo 2. Forgejo username **must match** Vault `k8s_users` key (e.g., `alice`) 3. Woodpecker authenticates to Vault using K8s SA JWT 4. Vault issues namespace-scoped deployer token 5. Pipeline runs `kubectl` commands within user's namespace(s) **Vault K8s Role** (auto-created per namespace): ```hcl vault write auth/kubernetes/role/alice-prod-deployer \ bound_service_account_names=woodpecker-deployer \ bound_service_account_namespaces=woodpecker \ policies=namespace-owner-alice \ ttl=1h ``` **Pipeline Example**: ```yaml steps: deploy: image: bitnami/kubectl:latest commands: - kubectl apply -f k8s/ -n alice-prod secrets: [k8s_token] ``` ## Configuration ### Vault k8s_users Entry **Path**: `secret/platform → k8s_users` **Full Example**: ```json { "alice": { "role": "namespace-owner", "namespaces": ["alice-prod", "alice-dev"], "domains": [ "alice.viktorbarzin.me", "app.alice.viktorbarzin.me", "api.alice.viktorbarzin.me" ], "quota": { "cpu": "4", "memory": "8Gi", "storage": "20Gi", "pods": "20" } }, "bob": { "role": "namespace-owner", "namespaces": ["bob-staging"], "domains": ["bob.viktorbarzin.me"], "quota": { "cpu": "2", "memory": "4Gi", "storage": "10Gi" } } } ``` ### Vault Policy Template **Auto-generated per user**: ```hcl # Policy: namespace-owner-alice path "secret/data/alice-prod/*" { capabilities = ["create", "read", "update", "delete", "list"] } path "secret/data/alice-dev/*" { capabilities = ["create", "read", "update", "delete", "list"] } path "secret/metadata/alice-prod/*" { capabilities = ["list"] } path "secret/metadata/alice-dev/*" { capabilities = ["list"] } ``` ### ResourceQuota Example ```yaml apiVersion: v1 kind: ResourceQuota metadata: name: alice-prod-quota namespace: alice-prod spec: hard: requests.cpu: "4" requests.memory: "8Gi" persistentvolumeclaims: "10" requests.storage: "20Gi" pods: "20" ``` ### Factory Pattern for Multi-Instance Services **Structure**: ``` stacks/ actualbudget/ main.tf # Shared configuration factory/ main.tf # Per-user module ``` **main.tf** (service definition): ```hcl # Shared NFS export, Cloudflare routes, etc. ``` **factory/main.tf** (per-user instance): ```hcl module "alice" { source = "../" user = "alice" domain = "budget.alice.viktorbarzin.me" } module "bob" { source = "../" user = "bob" domain = "budget.bob.viktorbarzin.me" } ``` **To add user**: 1. Export NFS share: `/mnt/data//` 2. Add Cloudflare route: `..viktorbarzin.me` 3. Add module block in `factory/main.tf` **Examples**: - `actualbudget`: Personal budgeting app - `freedify`: Music streaming service ## Decisions & Rationale ### Why Namespace-Per-User? **Alternatives considered**: 1. **Shared namespace**: No isolation, quota enforcement difficult 2. **Cluster-per-user**: Too expensive, management overhead 3. **Namespace-per-user (chosen)**: Balance isolation, quotas, RBAC **Benefits**: - Strong isolation (network policies, RBAC) - Easy quota enforcement (ResourceQuota) - Simple mental model (1 user = N namespaces) - Scales to hundreds of users ### Why Vault-Driven Onboarding? **Alternatives considered**: 1. **Manual YAML**: Error-prone, no audit trail 2. **CRD-based operator**: Complex, requires custom controller 3. **Vault + Terraform (chosen)**: Single source of truth, auditable **Benefits**: - Vault as identity source (integrates with OIDC) - Terraform for declarative infrastructure - Git-tracked changes (audit trail) - Secrets rotation built-in ### Why Factory Pattern for Multi-Instance Apps? **Alternatives considered**: 1. **Helm chart per user**: Duplication, drift risk 2. **Single shared instance**: No isolation, security risk 3. **Factory module (chosen)**: DRY, scalable **Benefits**: - No code duplication - Easy to add users (one module block) - Centralized updates (change `main.tf`, all instances update) ### Why OIDC Instead of Static Tokens? **Alternatives considered**: 1. **Static ServiceAccount tokens**: Never expire, security risk 2. **X.509 client certs**: Complex rotation 3. **OIDC (chosen)**: Centralized auth, automatic rotation **Benefits**: - Tokens auto-expire (1h for deployer, 24h for user) - Centralized user management (Authentik) - Integrates with Vault identity engine - Industry standard (OpenID Connect) ### Why ResourceQuota Over LimitRange? - **ResourceQuota**: Total namespace consumption (e.g., max 8Gi memory) - **LimitRange**: Per-pod limits (e.g., max 2Gi per pod) **Choice**: ResourceQuota only - Users manage their own pod limits - Quota prevents runaway consumption - Simpler mental model ## Troubleshooting ### User Can't Log In: "Unauthorized" **Cause**: User not in Authentik `kubernetes-namespace-owners` group **Fix**: ```bash # Check user groups in Authentik UI # Add to kubernetes-namespace-owners group ``` ### User Has No Namespaces **Cause**: `vault` stack not applied after adding to `k8s_users` **Fix**: ```bash cd stacks/vault terragrunt apply ``` ### User Can't Access Secrets in Vault **Cause**: Vault policy not attached to identity entity **Fix**: ```bash # Check entity vault read identity/entity/name/alice # Check policy exists vault policy read namespace-owner-alice # Manually attach policy to entity vault write identity/entity/name/alice policies=namespace-owner-alice ``` ### Woodpecker Pipeline: "Forbidden" **Cause**: Forgejo username doesn't match Vault `k8s_users` key **Fix**: ```bash # Rename Forgejo user to match Vault key # OR update k8s_users key to match Forgejo username, then terragrunt apply ``` ### ResourceQuota: "Forbidden: exceeded quota" **Cause**: User exceeded namespace quota **Fix**: ```bash # Check quota usage kubectl describe quota -n alice-prod # User must delete resources or request quota increase # To increase: update k8s_users in Vault, apply platform stack ``` ### DNS Not Resolving **Cause**: Cloudflare DNS not created by platform stack **Fix**: ```bash # Check domains in k8s_users vault kv get secret/platform | jq -r '.data.data.k8s_users.alice.domains' # Apply platform stack cd stacks/platform terragrunt apply # Verify in Cloudflare dashboard ``` ### TLS Secret Missing **Cause**: cert-manager failed to issue certificate **Fix**: ```bash # Check cert-manager logs kubectl logs -n cert-manager deploy/cert-manager # Check Certificate resource kubectl get certificate -n alice-prod # Check CertificateRequest kubectl describe certificaterequest -n alice-prod # If Let's Encrypt rate limited, wait 1 week or use staging ``` ### User Can't See Cluster Resources **Cause**: ClusterRoleBinding not created **Fix**: ```bash # Check ClusterRoleBinding exists kubectl get clusterrolebinding | grep alice # Apply platform stack cd stacks/platform terragrunt apply ``` ### Factory Pattern: New User Not Created **Cause**: Module block not added to `factory/main.tf` **Fix**: ```bash # Edit factory/main.tf cat >> stacks/actualbudget/factory/main.tf <