Compare commits
9 commits
master
...
feat/chatt
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
368560fd92 | ||
|
|
c64ead0112 | ||
|
|
ff0cb9a0d0 | ||
|
|
f39bb2b849 | ||
|
|
77f03c62af | ||
|
|
f085016d52 | ||
|
|
ed52d1646b | ||
|
|
f8e8f31306 | ||
|
|
1bc5c92622 |
21 changed files with 1806 additions and 34 deletions
|
|
@ -38,7 +38,7 @@ Violations cause state drift, which causes future applies to break or silently r
|
|||
- **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`. Smoke-test target: `echo.viktorbarzin.me` (auth=public, header-reflecting backend).
|
||||
- **Anubis PoW challenge** (`modules/kubernetes/anubis_instance/`): per-site reverse proxy that issues a 30-day JWT cookie after a tiny PoW solve. Use for **public, content-bearing sites without app-level auth** (blog, docs, wikis, static landing pages). Pattern: declare `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://<backend>.<ns>.svc.cluster.local" }`, then in `ingress_factory` set `service_name = module.anubis.service_name`, `port = module.anubis.service_port`, `anti_ai_scraping = false`. Shared ed25519 key in Vault `secret/viktor` -> `anubis_ed25519_key`; cookie scoped to `viktorbarzin.me` so one solve covers all Anubis-fronted subdomains. **DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints** — clients without JS can't solve PoW. **Replicas default to 1** because Anubis stores in-flight challenges in process memory; a challenge issued by pod A and solved against pod B errors with `store: key not found` (HTTP 500). Bumping replicas requires wiring a shared Redis store (TODO). For path-level carve-outs (e.g. wrongmove has `/` behind Anubis but `/api` direct, blog has `/net-diag.sh` direct), declare a second `ingress_factory` with `ingress_path = ["/<path>"]` pointing at the bare backend service. Active on: blog (except `/net-diag.sh`), www, kms, travel, f1, cc, json, pb (privatebin), home (homepage), wrongmove (UI only). See `.claude/reference/patterns.md` "Anti-AI Scraping" for full layering.
|
||||
- **Docker images**: Always build for `linux/amd64`. SHA-tag rule is being phased out — see `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`. New model: CI pushes `:latest` (optionally also `:<8-char-sha>` for traceability), Keel polls and triggers rollouts. Cache-staleness concern from the old rule is resolved at the nginx layer (URL-split — manifests pass through, blobs cached). Until Phase 1 of the migration completes (per the plan), follow the SHA-tag rule for new services to match existing pattern.
|
||||
- **Private registry**: `forgejo.viktorbarzin.me/viktor/<name>` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. Containerd `hosts.toml` on every node redirects to in-cluster Traefik LB `10.0.20.203` (with `skip_verify = true`, since the node dials Traefik by IP but the cert is for `forgejo.viktorbarzin.me`) to avoid hairpin NAT. That redirect covers **kubelet pulls** only — in-cluster pods (notably Woodpecker buildkit build pods pushing images) resolve `forgejo.viktorbarzin.me` via a CoreDNS `rewrite name exact ... traefik.traefik.svc.cluster.local` (Corefile in `stacks/technitium/modules/technitium/main.tf`), since they do NOT use the node containerd mirror; without it, buildkit pushes intermittently timed out on the public-IP hairpin (added 2026-06-04, beads code-yh33). **Was `.200` until 2026-06-01** — Traefik's 2026-05-30 move to its dedicated `.203` left this redirect pointing at the now-dead `.200:443`, silently breaking every *fresh* forgejo pull (cached images kept running, so it stayed hidden until a new image tag was pulled). Redirect source lives in `modules/create-template-vm/k8s-node-containerd-setup.sh` (new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing nodes). Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest`; integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07.
|
||||
- **Private registry**: `forgejo.viktorbarzin.me/viktor/<name>` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. Containerd `hosts.toml` on every node redirects to in-cluster Traefik LB `10.0.20.203` (with `skip_verify = true`, since the node dials Traefik by IP but the cert is for `forgejo.viktorbarzin.me`) to avoid hairpin NAT. That redirect covers **kubelet pulls** only — in-cluster pods (notably Woodpecker buildkit build pods pushing images) resolve `forgejo.viktorbarzin.me` via a CoreDNS `rewrite name exact ... traefik.traefik.svc.cluster.local` (Corefile in `stacks/technitium/modules/technitium/main.tf`), since they do NOT use the node containerd mirror; without it, buildkit pushes intermittently timed out on the public-IP hairpin (added 2026-06-04, beads code-yh33). **Was `.200` until 2026-06-01** — Traefik's 2026-05-30 move to its dedicated `.203` left this redirect pointing at the now-dead `.200:443`, silently breaking every *fresh* forgejo pull (cached images kept running, so it stayed hidden until a new image tag was pulled). Redirect source lives in `modules/create-template-vm/k8s-node-containerd-setup.sh` (new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing nodes). Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest` + any buildkit `*cache*` tag (so `--cache-from`/`--cache-to` refs survive retention — added 2026-06-09); **went live (DRY_RUN=false) 2026-06-09** after verifying 0 running images on the delete set — the registry PVC is at its 50Gi autoresize ceiling on the HDD (we did NOT move it to SSD, see beads code-oflt), so live retention is what keeps it from filling. Integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07.
|
||||
- **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts.
|
||||
- **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi.
|
||||
- **Node OS disk tuning** (in `stacks/infra/main.tf`): kubelet `imageGCHighThresholdPercent=70` (was 85), `imageGCLowThresholdPercent=60` (was 80), ext4 `commit=60` in fstab (was default 5s), journald `SystemMaxUse=200M` + `MaxRetentionSec=3day`.
|
||||
|
|
|
|||
|
|
@ -32,7 +32,7 @@
|
|||
|---------|-------------|-------|
|
||||
| k8s-dashboard | Kubernetes dashboard at `k8s.viktorbarzin.me`. **Forward-auth + auto-injected SA token** (apiserver OIDC blocked, see design §12). nginx token-injector (`dashboard_injector.tf`) maps `X-authentik-username` → the user's `dashboard-<user>` SA token (ns admin + read-only on namespace-list/nodes only via `dashboard-nav-readonly` — no cross-tenant reads, `rbac/.../dashboard-sa.tf`; admins → cluster-admin SA) and sets `Authorization: Bearer` → no token-paste, dashboard auto-authenticates per user. Forward-auth admits `kubernetes-*` groups for this host (`stacks/authentik/admin-services-restriction.tf`). oauth2-proxy + `k8s-dashboard` OIDC app built but idle. | k8s-dashboard |
|
||||
| reverse-proxy | Generic reverse proxy | reverse-proxy |
|
||||
| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@<u>` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@<u>` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary tracks `nightly`** via `t3-autoupdate` (daily systemd timer; health-check + auto-rollback on a bad build; restarts only idle instances) — so new models (e.g. Opus 4.8) land as t3 ships them. Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. | t3code |
|
||||
| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@<u>` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@<u>` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary is PINNED** (`T3_PIN`, currently `0.0.24`) — `t3-autoupdate` is a daily *enforcer* that re-asserts the pin (a no-op when correct; restarts only idle instances), NOT a nightly tracker. It used to track `nightly`, but on 2026-06-09 a nightly bump migrated every `~/.t3/state.sqlite` forward (`role`→`scopes`) and changed the bootstrap API, breaking pairing for ALL users (post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`). t3 is pre-1.0, so moving the pin requires first verifying `t3-dispatch`'s bootstrap flow against the new build (expect 302 + `t3_session`). Pin set in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. | t3code |
|
||||
|
||||
## Active Use
|
||||
| Service | Description | Stack |
|
||||
|
|
@ -116,7 +116,7 @@
|
|||
| status-page | Status page | status-page |
|
||||
| plotting-book | Book plotting/world-building app | plotting-book |
|
||||
| tripit | Self-hosted TripIt-clone travel-itinerary PWA (FastAPI + SvelteKit SPA, same-origin). CNPG (`tripit` db, Vault static role `pg-tripit`) + RWX NFS trip-doc vault (`/srv/nfs/tripit-documents`) + RWO `proxmox-lvm-encrypted` personal-document vault `tripit-personal-documents` (passports/IDs — AES-256-GCM app-layer envelope, master key `DOCUMENT_ENCRYPTION_KEY` in `secret/tripit`). `auth=required` (Authentik forward-auth, reads `X-authentik-email`); second `auth=none` ingress on `/api/calendar` for HMAC-token-gated `.ics` feed. Email-ingest CronJob `tripit-ingest-plans` (`*/15`) is the SOLE inbound path — forward a booking to plans@viktorbarzin.me (catch-all → spam@), polled read-only and routed ONLY to a registered user / verified linked address (no default-owner fallback; strangers ignored), parsed by local LLM (`qwen3vl-4b`), and the sender is emailed the outcome (Added to trip / Couldn't import). Plus `tripit-poll-flights`, `tripit-run-reminders`, `tripit-transport-nudge`, `tripit-weather-brief`. (The old Gmail-scrape `tripit-ingest-mail` CronJob was removed 2026-06-05.) App secrets in Vault `secret/tripit`. | tripit |
|
||||
| stem95su | STEM educational platform for **95. СУ „Проф. Иван Шишманов"** (Sofia school) at stem95su.viktorbarzin.me. Public **open** static site (`auth=none` — CrowdSec + ai-bot-block, no login). Stock `nginx:1.28-alpine` serving content **straight off PVE host NFS** `/srv/nfs/stem-site` (RWX `nfs_volume`, mounted read-only) — **NOT** image-baked, so the externally-authored (Gemini-exported) HTML/media updates with no rebuild; auto-backed-up offsite by `nfs-mirror`. **Content source = Google Drive folder "claude"** (id `1cmOI2jRyBJdnrVPgbr4kx2cx_4DY6pm_`, shared Valentina→vbarzin@gmail.com). **Deploy is ON-DEMAND, no scheduled job** (deliberate — short-term content, avoid rotting artifacts): mirror Drive→NFS via a throwaway `rclone/rclone` container using the existing `google_workspace` OAuth creds in Vault `secret/viktor` (`google_workspace_mcp_token_json`) → rsync to `/srv/nfs/stem-site` (empty-source guard). Just ask Claude to "sync stem95su from Drive" (recipe in claude-memory). Nextcloud "PVE NFS Pool"/rsync still works as a manual fallback. Dashboard `stem_board.html` served at `/` via a small nginx ConfigMap (`index`). No DB, no in-cluster secrets. Reference impl for the NFS-backed static-site pattern (see patterns.md). | stem95su |
|
||||
| stem95su | STEM educational platform for **95. СУ „Проф. Иван Шишманов"** (Sofia school) at stem95su.viktorbarzin.me. Public **open** static site (`auth=none` — CrowdSec + ai-bot-block, no login). Stock `nginx:1.28-alpine` serving content **straight off PVE host NFS** `/srv/nfs/stem-site` (RWX `nfs_volume`, mounted read-only) — **NOT** image-baked, so the externally-authored (Gemini-exported) HTML/media updates with no rebuild; auto-backed-up offsite by `nfs-mirror`. **Content source = Google Drive folder "claude"** (id `1cmOI2jRyBJdnrVPgbr4kx2cx_4DY6pm_`, shared Valentina→vbarzin@gmail.com). **Deploy = scheduled mirror** (since 2026-06-09, reversed the earlier on-demand-only call once content went active): CronJob `stem95su-gdrive-sync` (`*/10`, `stacks/stem95su/gdrive-sync.tf`) mounts the content PVC RW and `rclone sync`s the Drive folder onto it (`docker.io/rclone/rclone:1.74.3`, `scope=drive.readonly` — Drive is READ-ONLY; empty-source guard + `--max-delete 25` so a partial listing can't wipe the site). rclone creds (OAuth refresh-token) in Vault `secret/stem95su` (`rclone_conf`) → ESO secret `stem95su-rclone`. **Requires the GCP OAuth app (project home-lab-1700868541205) published to "Production"** or the refresh token expires ~weekly (re-mint + `vault kv put secret/stem95su rclone_conf=…` after publishing); a dead token surfaces as a failed Job. Manual on-demand sync still possible (throwaway rclone container from devvm; recipe in claude-memory). Nextcloud "PVE NFS Pool"/rsync is a manual fallback. Dashboard `stem_board.html` served at `/` via a small nginx ConfigMap (`index`). No DB, no in-cluster secrets. Reference impl for the NFS-backed static-site pattern (see patterns.md). | stem95su |
|
||||
| trek | **TRIAL (2026-06-05)** — self-hosted group-trip planner (upstream [TREK](https://github.com/mauriceboe/TREK), `mauriceboe/trek:3.0.22`, AGPL-3.0). Solo evaluation behind Authentik forward-auth (`auth=required`) before deciding build-vs-adopt; covers collaborative trip planning + accommodation records + activities + per-person budget splitting on free OpenStreetMap (no paid maps key). SQLite + uploads on `proxmox-lvm-encrypted` (`trek-data-encrypted` 2Gi, `trek-uploads-encrypted` 5Gi). For the trial only: `ENCRYPTION_KEY` is TREK-auto-generated onto the data PVC and the bootstrap admin (`admin@trek.local`) is printed to pod logs — NO Vault/ESO wiring (graduation TODO: move key to `secret/trek` + ESO, add an app-level SQLite backup CronJob since host file-backup can't read the LUKS PVC, wire TREK↔Authentik OIDC). Pinned image, TF-managed (no CI/Keel). Availability-poll companion (Rallly) deferred. Teardown: `tg destroy` in `stacks/trek`. | trek |
|
||||
|
||||
## Cloudflare Domains
|
||||
|
|
|
|||
|
|
@ -85,12 +85,14 @@ Replaces the in-cluster nginx `t3-dispatch` (the session-mint needs `sudo` + loc
|
|||
|
||||
Per request (Authentik forward-auth has injected a trustworthy `X-authentik-username`):
|
||||
1. Resolve `X-authentik-username` → OS user via `/etc/ttyd-user-map`. No mapping → **403**.
|
||||
2. **Has a valid t3 session cookie?** → reverse-proxy (incl. WebSocket upgrade) to `127.0.0.1:<T3_PORT>`. (Steady state — the common path.)
|
||||
3. **No cookie** (first visit / expired) → auto-pair:
|
||||
2. **Has a valid t3 session cookie?** → reverse-proxy (incl. WebSocket upgrade) to `127.0.0.1:<T3_PORT>`. (Steady state — the common path.) Sub-requests (XHR/asset/WebSocket) take the cookie at face value; on a **top-level document navigation** the cookie is verified against the instance's `GET /api/auth/session` so a present-but-dead cookie doesn't slip through.
|
||||
3. **No cookie, or an invalid cookie on a document navigation** (first visit / expired / server-side session wiped) → auto-pair:
|
||||
- `sudo -u <os_user> t3 auth pairing create --base-dir /home/<os_user>/.t3 --ttl 5m --json` → one-time token.
|
||||
- exchange it at the instance's `POST /api/auth/bootstrap` → capture the returned `Set-Cookie`.
|
||||
- relay that `Set-Cookie` to the browser + `302 /`. Browser now holds the t3 session cookie → next request is the steady-state path. **Login → straight in.**
|
||||
|
||||
> **As-built note (2026-06-09):** the first implementation re-paired only on an *absent* cookie. After an auth-schema rollback wiped every server-side session, browsers still held live-looking-but-dead 30-day `t3_session` cookies, which the dispatcher proxied straight through → t3 rendered its pair page (the "all users must pair again" incident). Fixed by validating a present cookie via `/api/auth/session` and re-pairing on `authenticated:false` — **gated to document navigations** (`isDocumentNav`: trust `Sec-Fetch-Dest: document`, else fall back to `Accept: text/html`) so XHR/asset/WebSocket sub-requests are never answered with a `302`, and **fail-open** (proxy through) on any validation error so no new failure mode is introduced. See `scripts/t3-dispatch/main.go` (`sessionValid`, `isDocumentNav`) + `main_test.go`.
|
||||
|
||||
Implementation: a small reverse proxy that supports WebSocket upgrade (Go `httputil.ReverseProxy`, or Python aiohttp) — chosen at plan time.
|
||||
|
||||
### 4. Terraform — `stacks/t3code` shrinks
|
||||
|
|
|
|||
469
docs/plans/2026-06-09-workstation-authentik-membership-plan.md
Normal file
469
docs/plans/2026-06-09-workstation-authentik-membership-plan.md
Normal file
|
|
@ -0,0 +1,469 @@
|
|||
# Workstation Membership v2 — Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax. This is **infra** work: the engine tasks are real pytest TDD; the host/Authentik tasks "verify" via an idempotent re-run + a smoke check with expected output. Honor the Terraform-only rule for cluster/Authentik changes (`scripts/tg apply`); devvm host scripts are the accepted exception. Claim `host:devvm` before host mutations and `stack:authentik` before applying Authentik.
|
||||
|
||||
**Goal:** Make the Authentik `T3 Users` group membership the single source of truth for who gets a devvm workstation account, identified by email; retire `roster.yaml`.
|
||||
|
||||
**Architecture:** The provisioner reads `T3 Users` members from the Authentik API (read-only token) instead of `roster.yaml`. A pure engine derives the Linux `os_user` from each member's email (or an `os_user` Authentik attribute override) and produces the same desired-state shape v1 already applies. Workstation access stays fully decoupled from cluster RBAC (`k8s_users` untouched). wizard is special-cased as the admin/owner.
|
||||
|
||||
**Tech Stack:** Python (pure engine, pytest) + Bash (provisioner) + `jq`/`curl` (Authentik API) + Terraform (`stacks/authentik`: read-only token, drop HCL members).
|
||||
|
||||
**Design:** `infra/docs/plans/2026-06-09-workstation-authentik-membership-design.md`.
|
||||
|
||||
---
|
||||
|
||||
## File structure
|
||||
|
||||
- Modify: `infra/scripts/workstation/roster_engine.py` — add `derive_os_user()` + `roster_from_members()` (pure).
|
||||
- Modify: `infra/scripts/workstation/test_roster_engine.py` — tests for the two new functions.
|
||||
- Modify: `infra/scripts/t3-provision-users.sh` — source members from the Authentik API instead of `roster.yaml`.
|
||||
- Modify: `infra/scripts/workstation/setup-devvm.sh` — drop the read-only Authentik token to `/etc/t3-serve/authentik-token`.
|
||||
- Create: `infra/stacks/authentik/t3-provision-token.tf` — read-only service account + API token.
|
||||
- Modify: `infra/stacks/authentik/t3-users.tf` — drop the HCL `users` list (membership becomes Authentik-managed).
|
||||
- Delete: `infra/scripts/workstation/roster.yaml` (Task 7).
|
||||
- Modify: `infra/.claude/reference/service-catalog.md`, `infra/docs/architecture/multi-tenancy.md` (Task 7).
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Engine — `derive_os_user()`
|
||||
|
||||
**Files:** Modify `infra/scripts/workstation/roster_engine.py`; Test `infra/scripts/workstation/test_roster_engine.py`
|
||||
|
||||
- [ ] **Step 1: Write the failing tests** (append to `test_roster_engine.py`)
|
||||
|
||||
```python
|
||||
# --- derive_os_user: email/attribute -> Linux username (v2) ---
|
||||
|
||||
def test_derive_os_user_sanitizes_email_local_part():
|
||||
assert eng.derive_os_user("emil.barzin@gmail.com", None) == "emil_barzin"
|
||||
|
||||
|
||||
def test_derive_os_user_attribute_overrides():
|
||||
assert eng.derive_os_user("emil.barzin@gmail.com", "emo") == "emo"
|
||||
|
||||
|
||||
def test_derive_os_user_lowercases_and_replaces_unsafe_runs():
|
||||
assert eng.derive_os_user("Weird.Name+tag@x.com", None) == "weird_name_tag"
|
||||
|
||||
|
||||
def test_derive_os_user_truncates_to_32():
|
||||
long = ("a" * 40) + "@x.com"
|
||||
assert eng.derive_os_user(long, None) == "a" * 32
|
||||
|
||||
|
||||
def test_derive_os_user_blank_attribute_is_ignored():
|
||||
assert eng.derive_os_user("emil.barzin@gmail.com", "") == "emil_barzin"
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Run to verify they fail**
|
||||
|
||||
Run: `cd infra/scripts/workstation && python3 -m pytest test_roster_engine.py -k derive_os_user -q`
|
||||
Expected: FAIL — `AttributeError: module 'roster_engine' has no attribute 'derive_os_user'`
|
||||
|
||||
- [ ] **Step 3: Implement** (add to `roster_engine.py`, after `RosterError`)
|
||||
|
||||
```python
|
||||
import re
|
||||
|
||||
_MAX_USERNAME = 32
|
||||
|
||||
|
||||
def derive_os_user(email: str, os_user_attr: str | None) -> str:
|
||||
"""Linux username for a workstation member: the explicit `os_user` Authentik
|
||||
attribute if set, else the email local-part sanitized to a valid username
|
||||
(lowercase; runs of non [a-z0-9_-] -> '_'; stripped; <=32 chars)."""
|
||||
if os_user_attr:
|
||||
return os_user_attr
|
||||
local = email.split("@", 1)[0].lower()
|
||||
cleaned = re.sub(r"[^a-z0-9_-]+", "_", local).strip("_")
|
||||
return cleaned[:_MAX_USERNAME]
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Run to verify they pass**
|
||||
|
||||
Run: `python3 -m pytest test_roster_engine.py -k derive_os_user -q`
|
||||
Expected: PASS (5 passed)
|
||||
|
||||
- [ ] **Step 5: Commit**
|
||||
|
||||
```bash
|
||||
cd /home/wizard/code/infra
|
||||
git add scripts/workstation/roster_engine.py scripts/workstation/test_roster_engine.py
|
||||
git commit -m "workstation: engine derive_os_user (email/attribute -> Linux username)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 2: Engine — `roster_from_members()`
|
||||
|
||||
Builds a `Roster` (the v1 type `derive_desired_state` already consumes) from the Authentik member list, so the existing tested derivation is reused unchanged.
|
||||
|
||||
**Files:** Modify `roster_engine.py`; Test `test_roster_engine.py`
|
||||
|
||||
- [ ] **Step 1: Write the failing tests**
|
||||
|
||||
```python
|
||||
# --- roster_from_members: Authentik members -> Roster (v2) ---
|
||||
|
||||
MEMBERS = [
|
||||
{"email": "vbarzin@gmail.com", "os_user": "wizard"},
|
||||
{"email": "emil.barzin@gmail.com", "os_user": "emo"},
|
||||
{"email": "ancaelena98@gmail.com", "os_user": "ancamilea"},
|
||||
]
|
||||
ADMINS = {"vbarzin@gmail.com"}
|
||||
|
||||
|
||||
def test_roster_from_members_maps_identity_fields():
|
||||
r = eng.roster_from_members(MEMBERS, ADMINS)
|
||||
u = r.users["emo"]
|
||||
assert u.os_user == "emo"
|
||||
assert u.authentik_user == "emil.barzin" # email local-part = t3-dispatch key
|
||||
assert u.k8s_user == "emil.barzin@gmail.com" # email = identity
|
||||
assert u.tier == "power-user" # non-admin
|
||||
|
||||
|
||||
def test_roster_from_members_admin_by_email():
|
||||
r = eng.roster_from_members(MEMBERS, ADMINS)
|
||||
assert r.users["wizard"].tier == "admin"
|
||||
|
||||
|
||||
def test_roster_from_members_derives_os_user_when_no_override():
|
||||
r = eng.roster_from_members([{"email": "jane.doe@x.com", "os_user": None}], set())
|
||||
assert "jane_doe" in r.users
|
||||
assert r.users["jane_doe"].tier == "power-user"
|
||||
|
||||
|
||||
def test_roster_from_members_raises_on_os_user_collision():
|
||||
members = [{"email": "a@x.com", "os_user": "dup"}, {"email": "b@y.com", "os_user": "dup"}]
|
||||
with pytest.raises(eng.RosterError, match="collision"):
|
||||
eng.roster_from_members(members, set())
|
||||
|
||||
|
||||
def test_roster_from_members_reuses_derive_desired_state():
|
||||
r = eng.roster_from_members(MEMBERS, ADMINS)
|
||||
ds = eng.derive_desired_state(r, {"wizard": 3773, "emo": 3774, "ancamilea": 3775})
|
||||
assert ds.dispatch["emil.barzin"] == {"os_user": "emo", "port": 3774}
|
||||
assert ds.accounts["wizard"].groups == ("code-shared", "docker", "sudo")
|
||||
assert ds.accounts["emo"].groups == ()
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Run to verify they fail**
|
||||
|
||||
Run: `python3 -m pytest test_roster_engine.py -k roster_from_members -q`
|
||||
Expected: FAIL — `AttributeError: ... 'roster_from_members'`
|
||||
|
||||
- [ ] **Step 3: Implement** (add to `roster_engine.py`)
|
||||
|
||||
```python
|
||||
def roster_from_members(members: list[dict], admin_emails: set[str]) -> Roster:
|
||||
"""Build a Roster from Authentik `T3 Users` members. Each member dict has
|
||||
`email` and optional `os_user`. tier = admin iff the email is in admin_emails,
|
||||
else power-user (a non-admin workstation: no groups, locked clone). Raises on
|
||||
an os_user collision (two emails resolving to the same Linux username)."""
|
||||
users: dict[str, User] = {}
|
||||
for m in members:
|
||||
email = m["email"]
|
||||
os_user = derive_os_user(email, m.get("os_user"))
|
||||
if os_user in users:
|
||||
raise RosterError(
|
||||
f"os_user collision: {email!r} and {users[os_user].k8s_user!r} "
|
||||
f"both resolve to {os_user!r} (set an os_user attribute to disambiguate)"
|
||||
)
|
||||
tier = "admin" if email in admin_emails else "power-user"
|
||||
users[os_user] = User(
|
||||
os_user=os_user,
|
||||
authentik_user=email.split("@", 1)[0],
|
||||
k8s_user=email,
|
||||
tier=tier,
|
||||
namespaces=(),
|
||||
)
|
||||
return Roster(users)
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Run the whole suite**
|
||||
|
||||
Run: `python3 -m pytest test_roster_engine.py -q && ruff check roster_engine.py test_roster_engine.py`
|
||||
Expected: PASS (all, incl. the v1 tests) + ruff clean
|
||||
|
||||
- [ ] **Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/workstation/roster_engine.py scripts/workstation/test_roster_engine.py
|
||||
git commit -m "workstation: engine roster_from_members (Authentik members -> Roster, reuses derive)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 3: Read-only Authentik token (Terraform)
|
||||
|
||||
**Files:** Create `infra/stacks/authentik/t3-provision-token.tf`
|
||||
|
||||
- [ ] **Step 1: Write the resources** (service account + API token + view permissions)
|
||||
|
||||
```hcl
|
||||
# Read-only service account whose token the devvm provisioner uses to list
|
||||
# "T3 Users" members. View-only: it can read users + groups, nothing else.
|
||||
resource "authentik_user" "t3_provision" {
|
||||
username = "t3-provision-bot"
|
||||
name = "T3 Provision (read-only)"
|
||||
type = "service_account"
|
||||
path = "service-accounts"
|
||||
}
|
||||
|
||||
resource "authentik_token" "t3_provision" {
|
||||
identifier = "t3-provision-readonly"
|
||||
user = authentik_user.t3_provision.id
|
||||
intent = "api"
|
||||
description = "devvm t3-provision-users: read T3 Users membership"
|
||||
retrieve_key = true
|
||||
}
|
||||
|
||||
# Global view permissions for the service account (users + groups read only).
|
||||
resource "authentik_rbac_permission_user" "t3_provision_view_user" {
|
||||
user = authentik_user.t3_provision.id
|
||||
permission = "authentik_core.view_user"
|
||||
}
|
||||
|
||||
resource "authentik_rbac_permission_user" "t3_provision_view_group" {
|
||||
user = authentik_user.t3_provision.id
|
||||
permission = "authentik_core.view_group"
|
||||
}
|
||||
|
||||
output "t3_provision_token" {
|
||||
value = authentik_token.t3_provision.key
|
||||
sensitive = true
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Apply** (claim first)
|
||||
|
||||
```bash
|
||||
~/code/scripts/presence claim stack:authentik --purpose "v2: read-only t3-provision token"
|
||||
export VAULT_ADDR=https://vault.viktorbarzin.me && vault login -method=oidc
|
||||
cd /home/wizard/code/infra/stacks/authentik && ../../scripts/tg apply -target=authentik_user.t3_provision -target=authentik_token.t3_provision -target=authentik_rbac_permission_user.t3_provision_view_user -target=authentik_rbac_permission_user.t3_provision_view_group --non-interactive
|
||||
```
|
||||
Expected: 4 added. (If the `authentik_rbac_permission_user` resource/permission codename differs in the installed provider, run `../../scripts/tg console` / check the provider docs and adjust the codename; verify in Step 3.)
|
||||
|
||||
- [ ] **Step 3: Store the token in Vault + verify it is read-only**
|
||||
|
||||
```bash
|
||||
TOK=$(../../scripts/tg output -raw t3_provision_token)
|
||||
vault kv patch secret/authentik t3_provision_token="$TOK"
|
||||
# verify: can LIST T3 Users members...
|
||||
curl -sk -H "Authorization: Bearer $TOK" "https://authentik.viktorbarzin.me/api/v3/core/users/?groups_by_name=T3%20Users" | jq -r '.results[].email'
|
||||
# ...but CANNOT write (expect 403):
|
||||
curl -sk -o /dev/null -w '%{http_code}\n' -X PATCH -H "Authorization: Bearer $TOK" -H 'Content-Type: application/json' -d '{"name":"x"}' "https://authentik.viktorbarzin.me/api/v3/core/users/14/"
|
||||
```
|
||||
Expected: the three emails listed; the PATCH returns `403`.
|
||||
|
||||
- [ ] **Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/authentik/t3-provision-token.tf
|
||||
git commit -m "workstation: read-only Authentik token for the t3-provision membership query"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 4: setup-devvm.sh — stage the token for the root provisioner
|
||||
|
||||
**Files:** Modify `infra/scripts/workstation/setup-devvm.sh`
|
||||
|
||||
- [ ] **Step 1: Add a token-staging step** (after step 6, before the final `log "OK"`). The hourly provisioner runs as root with no Vault token, so `setup-devvm.sh` (run by wizard, who can read Vault) drops it to a root-only file.
|
||||
|
||||
```bash
|
||||
# 8) stage the read-only Authentik token for the root provisioner's membership query.
|
||||
if command -v vault >/dev/null; then
|
||||
export VAULT_ADDR="${VAULT_ADDR:-https://vault.viktorbarzin.me}"
|
||||
if tok="$(vault kv get -field=t3_provision_token secret/authentik 2>/dev/null)"; then
|
||||
install -m 0600 /dev/stdin /etc/t3-serve/authentik-token <<<"$tok"
|
||||
log "staged /etc/t3-serve/authentik-token (read-only Authentik API)"
|
||||
else
|
||||
log "WARN: t3_provision_token not in Vault -> Authentik membership query will be skipped"
|
||||
fi
|
||||
fi
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Run + verify**
|
||||
|
||||
Run: `sudo bash /home/wizard/code/infra/scripts/workstation/setup-devvm.sh 2>&1 | grep -E 'authentik-token|OK'` then `sudo stat -c '%a %U' /etc/t3-serve/authentik-token`
|
||||
Expected: "staged ... authentik-token" + `OK`; perms `600 root`.
|
||||
|
||||
- [ ] **Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/workstation/setup-devvm.sh
|
||||
git commit -m "workstation: setup-devvm.sh stages the read-only Authentik token (root-only)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 5: Provisioner — source members from Authentik (replace roster.yaml)
|
||||
|
||||
**Files:** Modify `infra/scripts/t3-provision-users.sh`
|
||||
|
||||
- [ ] **Step 1: Add a members-fetch + swap the engine call.** Replace the roster-read/derive block. Fetch members from Authentik (best-effort); build the members JSON `[{email, os_user}]`; pass to the engine via a new `--members-json` mode on `derive`.
|
||||
|
||||
First extend the engine CLI (`roster_engine.py` `_main`): add `derive-members` that reads a members JSON + ports JSON + admin emails and emits the same desired-state JSON.
|
||||
|
||||
```python
|
||||
# in _main(), add a subparser:
|
||||
pm = sub.add_parser("derive-members", help="desired state from an Authentik member list")
|
||||
pm.add_argument("--members-json", required=True)
|
||||
pm.add_argument("--ports-json", required=True)
|
||||
pm.add_argument("--admin-emails", default="", help="comma-separated admin emails")
|
||||
# ...in the dispatch:
|
||||
if args.cmd == "derive-members":
|
||||
with open(args.members_json, encoding="utf-8") as fh:
|
||||
members = json.load(fh)
|
||||
with open(args.ports_json, encoding="utf-8") as fh:
|
||||
ports = json.load(fh)
|
||||
admins = {e for e in args.admin_emails.split(",") if e}
|
||||
ds = derive_desired_state(roster_from_members(members, admins), ports)
|
||||
json.dump(_desired_state_to_dict(ds), sys.stdout, indent=2, sort_keys=True)
|
||||
sys.stdout.write("\n")
|
||||
return 0
|
||||
```
|
||||
|
||||
In `t3-provision-users.sh`, replace the `ROSTER`/validate/derive section with:
|
||||
|
||||
```bash
|
||||
AUTHENTIK_URL="${AUTHENTIK_URL:-https://authentik.viktorbarzin.me}"
|
||||
TOKEN_FILE="${TOKEN_FILE:-/etc/t3-serve/authentik-token}"
|
||||
T3_GROUP="${T3_GROUP:-T3 Users}"
|
||||
ADMIN_EMAILS="${WORKSTATION_ADMIN_EMAILS:-vbarzin@gmail.com}"
|
||||
|
||||
members_file="$(mktemp)"; trap 'rm -f "$ports_file" "$members_file" "${desired_file:-}"' EXIT
|
||||
if [[ -r "$TOKEN_FILE" ]]; then
|
||||
tok="$(cat "$TOKEN_FILE")"
|
||||
if curl -sf -H "Authorization: Bearer $tok" --get \
|
||||
--data-urlencode "groups_by_name=$T3_GROUP" \
|
||||
"$AUTHENTIK_URL/api/v3/core/users/" \
|
||||
| jq -c '[.results[] | select(.is_active) | {email: .email, os_user: (.attributes.os_user // null)}]' \
|
||||
> "$members_file" && [[ -s "$members_file" ]]; then
|
||||
:
|
||||
else
|
||||
log "WARN: Authentik membership query failed -> no membership change this run"; echo '[]' > "$members_file"
|
||||
SKIP_RECONCILE=1
|
||||
fi
|
||||
else
|
||||
log "WARN: $TOKEN_FILE absent -> no membership change this run"; echo '[]' > "$members_file"; SKIP_RECONCILE=1
|
||||
fi
|
||||
|
||||
if [[ "${SKIP_RECONCILE:-0}" == 1 ]]; then log "reconcile skipped (no Authentik membership)"; exit 0; fi
|
||||
|
||||
desired_file="$(mktemp)"
|
||||
python3 "$ENGINE" derive-members --members-json "$members_file" --ports-json "$ports_file" --admin-emails "$ADMIN_EMAILS" > "$desired_file"
|
||||
jq -e . "$desired_file" >/dev/null || { echo "[t3-provision] derive-members produced invalid JSON" >&2; exit 1; }
|
||||
```
|
||||
|
||||
(Keep steps 4-6 of the existing script — accounts/groups/clone/kubeconfig, .env/enable, regen map/dispatch — unchanged; they consume `$desired_file`.)
|
||||
|
||||
- [ ] **Step 2: shellcheck + DRY_RUN** (with the staged token present)
|
||||
|
||||
Run: `cd /home/wizard/code/infra/scripts && shellcheck -S warning t3-provision-users.sh && sudo DRY_RUN=1 bash t3-provision-users.sh 2>&1 | grep -iE 'clone|kubeconfig|reconcile|WARN'`
|
||||
Expected: shellcheck clean; dry-run lists the current members, no account creations (all exist), "reconcile complete (DRY-RUN)".
|
||||
|
||||
- [ ] **Step 3: Real run + verify it reproduces current state**
|
||||
|
||||
Run: `sudo jq -S . /etc/t3-serve/dispatch.json > /tmp/d1; sudo DRY_RUN=0 bash t3-provision-users.sh >/dev/null 2>&1; sudo jq -S . /etc/t3-serve/dispatch.json > /tmp/d2; diff /tmp/d1 /tmp/d2 && echo SAME; id -nG emo`
|
||||
Expected: `SAME` (dispatch content unchanged); emo groups unchanged. Redeploy: `sudo install -m0755 t3-provision-users.sh /usr/local/bin/t3-provision-users`.
|
||||
|
||||
- [ ] **Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/t3-provision-users.sh scripts/workstation/roster_engine.py scripts/workstation/test_roster_engine.py
|
||||
git commit -m "workstation: provisioner sources members from Authentik T3 Users (replaces roster.yaml)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 6: Authentik — Authentik-managed membership + legacy os_user attributes
|
||||
|
||||
**Files:** Modify `infra/stacks/authentik/t3-users.tf`; set user attributes via API.
|
||||
|
||||
- [ ] **Step 1: Set the legacy os_user attributes** (the 3 existing accounts don't derive from their emails). Read-merge-write so existing attributes are preserved (Authentik PATCH replaces the `attributes` dict).
|
||||
|
||||
```bash
|
||||
export VAULT_ADDR=https://vault.viktorbarzin.me
|
||||
TOK=$(vault kv get -field=tf_api_token secret/authentik)
|
||||
A=https://authentik.viktorbarzin.me/api/v3
|
||||
set_os_user() { # $1=username $2=os_user
|
||||
local pk attrs
|
||||
pk=$(curl -sk -H "Authorization: Bearer $TOK" "$A/core/users/?username=$1" | jq '.results[0].pk')
|
||||
attrs=$(curl -sk -H "Authorization: Bearer $TOK" "$A/core/users/$pk/" | jq -c --arg o "$2" '.attributes + {os_user:$o}')
|
||||
curl -sk -X PATCH -H "Authorization: Bearer $TOK" -H 'Content-Type: application/json' \
|
||||
-d "{\"attributes\":$attrs}" "$A/core/users/$pk/" | jq -r '.username + " os_user=" + .attributes.os_user'
|
||||
}
|
||||
set_os_user "vbarzin@gmail.com" wizard
|
||||
set_os_user "emil.barzin@gmail.com" emo
|
||||
set_os_user "ancaelena98@gmail.com" ancamilea
|
||||
```
|
||||
Expected: three lines confirming `os_user=` each.
|
||||
|
||||
- [ ] **Step 2: Drop the HCL `users` list** so membership is Authentik-managed. Edit `t3-users.tf`: remove the `users = [...]` argument from `resource "authentik_group" "t3_users"` (keep the `data "authentik_user"` lookups removed too if now unused). Leave the group resource (name only).
|
||||
|
||||
```hcl
|
||||
resource "authentik_group" "t3_users" {
|
||||
name = "T3 Users"
|
||||
# Membership is managed in Authentik (UI/API), not Terraform — the devvm
|
||||
# provisioner reconciles workstation accounts from this group's members.
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Apply + verify members unchanged**
|
||||
|
||||
```bash
|
||||
cd /home/wizard/code/infra/stacks/authentik && ../../scripts/tg apply -target=authentik_group.t3_users --non-interactive
|
||||
curl -sk -H "Authorization: Bearer $TOK" "$A/core/groups/?search=T3%20Users" | jq -r '.results[0].users_obj[].username'
|
||||
```
|
||||
Expected: apply shows the group updated (no member change / the `users` field no longer managed); the 3 members still listed.
|
||||
|
||||
- [ ] **Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/authentik/t3-users.tf
|
||||
git commit -m "workstation: T3 Users membership is Authentik-managed (drop HCL member list)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 7: Retire roster.yaml + update docs
|
||||
|
||||
**Files:** Delete `infra/scripts/workstation/roster.yaml`; modify `service-catalog.md`, `multi-tenancy.md`.
|
||||
|
||||
- [ ] **Step 1: Confirm nothing reads roster.yaml anymore**
|
||||
|
||||
Run: `grep -rn 'roster.yaml\|roster_engine.*roster\b' /home/wizard/code/infra/scripts /home/wizard/code/infra/docs | grep -v 'load_roster\|test_\|design.md\|-plan.md'`
|
||||
Expected: no live references in the provisioner (the engine keeps `load_roster` for tests, that's fine).
|
||||
|
||||
- [ ] **Step 2: Delete it + update the service-catalog t3code row** — change "Source of truth = roster.yaml" to "Source of truth = the Authentik `T3 Users` group (members → accounts via the read-only API token); `os_user` from the email or a per-user `os_user` attribute". Update the multi-tenancy Workstation section's "single source of truth" line likewise.
|
||||
|
||||
```bash
|
||||
git rm scripts/workstation/roster.yaml
|
||||
# (edit service-catalog.md + multi-tenancy.md per above)
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add scripts/workstation/roster.yaml .claude/reference/service-catalog.md docs/architecture/multi-tenancy.md
|
||||
git commit -m "workstation: retire roster.yaml — Authentik T3 Users group is the membership SSoT"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 8: End-to-end smoke (add + remove a throwaway member)
|
||||
|
||||
- [ ] **Step 1: Add a throwaway test member** to `T3 Users` in Authentik (a test user, or temporarily add an existing one), set no `os_user` attribute. Run `sudo /usr/local/bin/t3-provision-users` and confirm an account `<derived>` is created (`id <derived>`), with a locked `~/code` (secret file shows `GITCRYPT`) and `~/.kube/config`.
|
||||
- [ ] **Step 2: Remove the test member** from the group; run the reconcile; confirm they drop out of `/etc/ttyd-user-map` + `dispatch.json` (the reversible cut). Leave `userdel` to the gated offboarding runbook.
|
||||
- [ ] **Step 3: Verify the 3 real users are intact** — `id emo` (groups unchanged), emo/ancamilea/wizard still in `dispatch.json`, their `t3-serve@` active, emo's locked clone + ancamilea's intact.
|
||||
|
||||
---
|
||||
|
||||
## Self-review
|
||||
|
||||
- **Spec coverage:** Authentik-as-SSoT (Tasks 5,6) · email identity + os_user derive/override (Tasks 1,6) · provisioner reads the API (Task 5) · read-only token for the root timer (Tasks 3,4) · roster.yaml retires (Task 7) · k8s_users/cluster untouched (no task touches it) · wizard special-cased (admin_emails, Task 2). All covered.
|
||||
- **Type consistency:** `derive_os_user(email, os_user_attr)` and `roster_from_members(members, admin_emails)` used consistently; `members` dicts are `{email, os_user}`; reuses the existing `User`/`Roster`/`derive_desired_state`/`DesiredState`.
|
||||
- **apiserver-OIDC:** out of scope here (kubectl auth method only) — flagged in the design; the generic kubeconfig task is unchanged from v1.
|
||||
- **Open risk:** the `authentik_rbac_permission_user` resource name / permission codenames may differ in the installed provider version (Task 3) — Step 3 verifies read-works/write-403 and says to adjust if needed.
|
||||
|
|
@ -0,0 +1,138 @@
|
|||
# Post-Mortem: t3 Nightly Auto-Update (0.0.25) Migrated `state.sqlite` Forward → mint/pairing Broke for All Devvm Users
|
||||
|
||||
## Summary
|
||||
|
||||
The devvm t3 auto-updater (`t3-autoupdate.timer`) pulled the `t3@nightly`
|
||||
build `0.0.25-nightly.20260608.497`. That build ran two forward schema
|
||||
migrations on every per-user `~/.t3/userdata/state.sqlite` (renaming
|
||||
`role`→`scopes` in `auth_pairing_links` + `auth_sessions`, adding
|
||||
`proof_key_thumbprint`) **and** changed the bootstrap API. The result was a
|
||||
binary-vs-schema mismatch that broke `t3-mint` (pairing-credential issuance)
|
||||
for **all** users — every fresh login landed on the t3 pairing prompt instead
|
||||
of an authenticated session.
|
||||
|
||||
## Impact
|
||||
|
||||
- **Who:** every devvm t3 user — `wizard` (Viktor), `emo`, `ancamilea`.
|
||||
- **What:** `t3 auth pairing create` failed (`AuthControlPlaneError:
|
||||
Failed to create pairing link` → `PersistenceSqlError` on
|
||||
`auth_pairing_links`), so `t3-dispatch` auto-pair returned 500/502 and the
|
||||
browser showed the pairing prompt. Existing *already-authenticated* sessions
|
||||
kept working (validated against `auth_sessions`, not the pairing path).
|
||||
- **When:** ~13:56 (bad nightly installed) → ~15:16 (all users verified 302).
|
||||
- **Trigger of the report:** Anca could not log in ("gets the pair prompt,
|
||||
session broken").
|
||||
|
||||
## Timeline (devvm clock)
|
||||
|
||||
- **13:56** — `t3-provision-users` step 5b ran `systemctl enable --now
|
||||
t3-autoupdate.timer`. The timer is `OnCalendar=04:00 … Persistent=true`;
|
||||
`--now` + a missed 04:00 schedule fired the daily job **immediately**.
|
||||
- **13:56** — updater installed `t3@nightly` = `0.0.25-nightly.20260608.497`
|
||||
(was `0.0.24`). The `GET / → 200` health-check **passed** (it never
|
||||
exercises mint/bootstrap), so no auto-rollback. It restarted *idle* serves
|
||||
(emo) onto 0.0.25 and deferred *active* ones (wizard, ancamilea).
|
||||
- **~14:38** — `t3-mint` (now global 0.0.25) ran migrations 31
|
||||
(`AuthAuthorizationScopes`) + 32 (`AuthPairingProofKeyThumbprint`) against
|
||||
each `state.sqlite` it touched → schemas moved to "level 32".
|
||||
- **~14:40** — first recovery action rolled the **binary** back to `0.0.24`.
|
||||
This did **not** help: the DBs were still at level 32, so the level-30
|
||||
binary's INSERT hit `no column named role` / `NOT NULL constraint failed:
|
||||
scopes`. (Downgrading a binary after a forward migration is not a rollback.)
|
||||
- **~15:01–15:16** — diagnosed the binary-vs-schema mismatch, confirmed
|
||||
`0.0.25` *stable* is **also** dispatch-incompatible (auto-pair → 502, the
|
||||
bootstrap API moved), pinned to `0.0.24`, reset the two new users' disposable
|
||||
DBs, surgically reverted wizard's two auth tables to level 30. All three
|
||||
users verified 302 + `Set-Cookie: t3_session`.
|
||||
|
||||
## Root Cause
|
||||
|
||||
Three compounding factors:
|
||||
|
||||
1. **Auto-tracking a pre-1.0 tool's nightly.** `t3-autoupdate.sh` ran
|
||||
`npm i -g t3@nightly`. t3 ships breaking schema-migration and bootstrap-API
|
||||
changes between builds; our `t3-dispatch` (Go) speaks a fixed bootstrap
|
||||
contract (`POST /api/auth/bootstrap {"credential":…}` → `Set-Cookie`).
|
||||
2. **`enable --now` on a `Persistent=true` timer.** The provisioner's
|
||||
re-assertion of the timer didn't just *arm* the schedule — it fired the
|
||||
missed daily job on the spot, mid-afternoon, with users active.
|
||||
3. **A health-check that proves nothing about auth.** The smoke test only
|
||||
probes `GET / → 200`. The 0.0.25 server answers 200 while its pairing/mint
|
||||
path is incompatible, so the "auto-rollback on bad build" never triggered.
|
||||
|
||||
Forward migrations + a binary downgrade = a DB the old binary can't write.
|
||||
`state.sqlite` also holds the precious projection tables (session history), so
|
||||
a blanket "delete and re-pair" was only safe for the brand-new users.
|
||||
|
||||
## Detection
|
||||
|
||||
User report (Anca on the pairing prompt). No alert fired — the auto-updater's
|
||||
own health-check is the only automated gate and it passed. **Gap:** nothing
|
||||
monitors the end-to-end pairing flow.
|
||||
|
||||
## Fixes & Mitigations
|
||||
|
||||
### 1. Pin t3, stop tracking nightly (DONE)
|
||||
|
||||
`infra/scripts/t3-autoupdate.sh` is now a **pinned-version enforcer**:
|
||||
`T3_PIN="${T3_PIN:-0.0.24}"`, `npm i -g "t3@$T3_PIN"`. It re-asserts the pin
|
||||
(a no-op when already correct) instead of chasing nightly. Unit `Description`s
|
||||
updated. To move the pin: bump `T3_PIN` **and first** verify `t3-dispatch`'s
|
||||
bootstrap flow against the new build (`curl` the dispatch → expect 302 +
|
||||
`Set-Cookie: t3_session`).
|
||||
|
||||
### 2. Drop `--now` from the provisioner (DONE)
|
||||
|
||||
`infra/scripts/t3-provision-users.sh` step 5b now runs `systemctl enable
|
||||
t3-autoupdate.timer` (no `--now`) — it arms the 04:00 schedule without firing a
|
||||
missed job immediately.
|
||||
|
||||
### 3. Pinned install at machine setup (DONE)
|
||||
|
||||
`infra/scripts/workstation/setup-devvm.sh` installs `t3@$T3_PIN` directly, so a
|
||||
fresh box has the pinned t3 immediately rather than depending on the enforcer's
|
||||
first run.
|
||||
|
||||
### 4. Recovery actions taken on the host (DONE)
|
||||
|
||||
- Global `t3` rolled to `0.0.24`; enforcer redeployed + timer re-enabled
|
||||
(verified the enforcer is a no-op at the pin).
|
||||
- New users (`emo` 0 threads, `ancamilea` 1 trivial thread): `state.sqlite`
|
||||
parked aside; serve restarted → fresh level-30 DB.
|
||||
- `wizard` (96 threads, and the serve hosting the recovery session — cannot be
|
||||
restarted): the two auth tables were atomically rebuilt to the level-30
|
||||
schema (copied from a fresh DB) and migration records 31/32 removed.
|
||||
`auth_sessions` had 0 rows and the 0.0.24 serve never reads `scopes`, so the
|
||||
live session and all projection history were untouched. Backup:
|
||||
`/home/wizard/.t3/userdata/auth-backup-*.sql`.
|
||||
|
||||
### 5. End-to-end pairing health-check (DEFERRED)
|
||||
|
||||
The smoke test should exercise mint→bootstrap→cookie, not just `GET /`. Not
|
||||
done here (the pin makes it moot for the known-good build); needed before the
|
||||
enforcer is ever pointed at a new version. A blackbox probe on the dispatch
|
||||
auto-pair (expect 302 + `t3_session`) would have alerted within minutes.
|
||||
|
||||
## Lessons
|
||||
|
||||
- **Don't auto-track a pre-1.0 tool's nightly.** Pin to a known-good,
|
||||
contract-verified build; upgrades are a deliberate, tested act.
|
||||
- **`enable --now` on a `Persistent=true` timer fires the missed job now.**
|
||||
Use plain `enable` to arm a schedule without a surprise immediate run.
|
||||
- **A liveness probe (`GET /`) is not a readiness/correctness probe.** If a
|
||||
feature (auth/pairing) can break while `/` stays 200, the health-check must
|
||||
exercise that feature or it gives false confidence.
|
||||
- **A binary downgrade is not a schema rollback.** Once a forward migration
|
||||
runs, the data is migrated; the old binary now mismatches its own DB.
|
||||
- **Separate disposable state from precious state before resetting.** t3's
|
||||
`state.sqlite` mixes ephemeral auth (`auth_pairing_links`, `auth_sessions`)
|
||||
with precious history (`projection_*`); surgical table-level repair
|
||||
preserved 8k+ messages that a blanket reset would have destroyed.
|
||||
|
||||
## References
|
||||
|
||||
- `infra/scripts/t3-autoupdate.sh` (pinned enforcer), `.service`, `.timer`
|
||||
- `infra/scripts/t3-provision-users.sh` step 5b
|
||||
- `infra/scripts/workstation/setup-devvm.sh` step 2b
|
||||
- `infra/.claude/reference/service-catalog.md` (t3 serving layer)
|
||||
- Backup of wizard's pre-repair auth tables: `/home/wizard/.t3/userdata/auth-backup-*.sql`
|
||||
|
|
@ -1,5 +1,5 @@
|
|||
[Unit]
|
||||
Description=Track latest t3 nightly (health-checked, idle-only restart)
|
||||
Description=Enforce pinned t3 version (health-checked, idle-only restart)
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
|
|
|
|||
|
|
@ -1,20 +1,30 @@
|
|||
#!/usr/bin/env bash
|
||||
# Track the latest t3 nightly — with a health-check + auto-rollback (lesson from
|
||||
# the Keel auto-update incidents: never blindly trust a new build) and idle-only
|
||||
# restarts (never kill an in-flight coding session). Runs as root via the unit.
|
||||
# Enforce the PINNED t3 version ($T3_PIN) across the box — NOT "latest/nightly".
|
||||
# t3 is pre-1.0 and ships breaking schema-migration + bootstrap-API changes between
|
||||
# builds that our t3-dispatch can't follow blind. 2026-06-09: a nightly auto-update
|
||||
# (0.0.25) migrated every ~/.t3 state.sqlite forward (auth_pairing_links/auth_sessions
|
||||
# role->scopes) AND changed the bootstrap API, breaking mint/pairing for ALL users.
|
||||
# So we PIN; this unit just re-asserts the pin (a no-op when already correct) with a
|
||||
# health-check + auto-rollback and idle-only restarts (never kill an in-flight session).
|
||||
# To move the pin: bump T3_PIN AND first verify t3-dispatch's bootstrap flow against the
|
||||
# new build (curl the dispatch -> expect 302 + Set-Cookie t3_session). See post-mortem
|
||||
# 2026-06-09-t3-nightly-autoupdate-auth-outage.md.
|
||||
# CAVEAT: the health-check below only probes GET / (200) — it does NOT exercise the
|
||||
# mint/bootstrap/pairing path, so it will NOT catch an auth regression on its own.
|
||||
set -uo pipefail
|
||||
T3_PIN="${T3_PIN:-0.0.24}" # known-good, t3-dispatch-compatible (2026-06-09 post-mortem)
|
||||
LOG() { logger -t t3-autoupdate "$*"; echo "t3-autoupdate: $*"; }
|
||||
|
||||
ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; }
|
||||
|
||||
before=$(ver); LOG "current: ${before:-unknown}"
|
||||
npm i -g t3@nightly >/dev/null 2>&1 || { LOG "npm install failed; staying on ${before:-current}"; exit 0; }
|
||||
before=$(ver); LOG "current: ${before:-unknown}; pin: $T3_PIN"
|
||||
npm i -g "t3@$T3_PIN" >/dev/null 2>&1 || { LOG "npm install failed; staying on ${before:-current}"; exit 0; }
|
||||
after=$(ver)
|
||||
|
||||
if [[ -z "$after" || "$after" == "$before" ]]; then
|
||||
LOG "already latest (${before:-?}); nothing to do"; exit 0
|
||||
LOG "already at pin $T3_PIN (${before:-?}); nothing to do"; exit 0
|
||||
fi
|
||||
LOG "installed $after (was $before); health-checking…"
|
||||
LOG "re-pinned to $after (was $before); health-checking…"
|
||||
|
||||
# Health-check the NEW binary on a throwaway port/base-dir before trusting it.
|
||||
SMOKE_PORT=3799; SMOKE_DIR=$(mktemp -d)
|
||||
|
|
|
|||
|
|
@ -1,5 +1,5 @@
|
|||
[Unit]
|
||||
Description=Daily t3 nightly auto-update
|
||||
Description=Daily t3 pinned-version enforcer (re-asserts T3_PIN; no-op when correct)
|
||||
|
||||
[Timer]
|
||||
OnCalendar=*-*-* 04:00:00
|
||||
|
|
|
|||
|
|
@ -59,6 +59,60 @@ func lookup(ak string) (entry, bool) {
|
|||
return e, ok
|
||||
}
|
||||
|
||||
// mintToken mints a one-time pairing token for osUser via the scoped sudoers
|
||||
// entry (the dispatch service can invoke nothing else). Indirected through a var
|
||||
// so tests can stub the privileged exec.
|
||||
var mintToken = func(osUser string) ([]byte, error) {
|
||||
return exec.Command("sudo", "-n", "/usr/local/bin/t3-mint", osUser).Output()
|
||||
}
|
||||
|
||||
var sessionClient = &http.Client{Timeout: 5 * time.Second}
|
||||
|
||||
// sessionValid asks the user's instance whether the presented t3_session cookie
|
||||
// is still valid. Server-side sessions can be wiped/expired independently of the
|
||||
// 30-day cookie (e.g. an auth-schema rollback drops every session row), leaving
|
||||
// the browser with a live-looking but dead cookie. Fails OPEN: any error/non-200/
|
||||
// parse failure returns true so the request still proxies — a re-pair is forced
|
||||
// only on a definitive authenticated:false.
|
||||
func sessionValid(e entry, c *http.Cookie) bool {
|
||||
req, err := http.NewRequest(http.MethodGet,
|
||||
fmt.Sprintf("http://127.0.0.1:%d/api/auth/session", e.Port), nil)
|
||||
if err != nil {
|
||||
return true
|
||||
}
|
||||
req.AddCookie(c)
|
||||
resp, err := sessionClient.Do(req)
|
||||
if err != nil {
|
||||
return true
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
if resp.StatusCode != http.StatusOK {
|
||||
return true
|
||||
}
|
||||
var s struct {
|
||||
Authenticated bool `json:"authenticated"`
|
||||
}
|
||||
if json.NewDecoder(resp.Body).Decode(&s) != nil {
|
||||
return true
|
||||
}
|
||||
return s.Authenticated
|
||||
}
|
||||
|
||||
// isDocumentNav reports whether r is a top-level browser document navigation, as
|
||||
// opposed to an XHR/fetch/asset/WebSocket sub-request. Only such requests are
|
||||
// safe to answer with a re-pair 302 — redirecting a sub-resource would corrupt
|
||||
// the SPA's fetch/WebSocket contract. Trust Sec-Fetch-Dest when present (all
|
||||
// modern browsers send it); fall back to the Accept header otherwise.
|
||||
func isDocumentNav(r *http.Request) bool {
|
||||
if r.Method != http.MethodGet {
|
||||
return false
|
||||
}
|
||||
if dest := r.Header.Get("Sec-Fetch-Dest"); dest != "" {
|
||||
return dest == "document"
|
||||
}
|
||||
return strings.Contains(r.Header.Get("Accept"), "text/html")
|
||||
}
|
||||
|
||||
// autoPair mints a one-time pairing token for the user's instance (as that OS
|
||||
// user, via the scoped sudoers entry) and exchanges it at the instance's
|
||||
// /api/auth/bootstrap, relaying the returned t3_session Set-Cookie to the browser.
|
||||
|
|
@ -66,7 +120,7 @@ func autoPair(e entry, w http.ResponseWriter, r *http.Request) {
|
|||
// t3-mint (root, via scoped sudoers) validates the OS user is in
|
||||
// /etc/ttyd-user-map, then mints as that user. The dispatch service itself
|
||||
// runs unprivileged and can invoke nothing else.
|
||||
out, err := exec.Command("sudo", "-n", "/usr/local/bin/t3-mint", e.OsUser).Output()
|
||||
out, err := mintToken(e.OsUser)
|
||||
if err != nil {
|
||||
log.Printf("mint for %s failed: %v", e.OsUser, err)
|
||||
http.Error(w, "pairing mint failed", http.StatusInternalServerError)
|
||||
|
|
@ -111,7 +165,16 @@ func handler(w http.ResponseWriter, r *http.Request) {
|
|||
http.Error(w, "no t3 instance provisioned for this user", http.StatusForbidden)
|
||||
return
|
||||
}
|
||||
if _, err := r.Cookie(cookieName); err != nil {
|
||||
c, err := r.Cookie(cookieName)
|
||||
if err != nil {
|
||||
autoPair(e, w, r)
|
||||
return
|
||||
}
|
||||
// A present cookie can still be server-side-invalid (sessions wiped/expired
|
||||
// while the 30-day cookie lingers). On a top-level navigation, verify it and
|
||||
// re-pair if dead — otherwise the instance just renders its pair page. Gated
|
||||
// to document navs so we never 302 an XHR/asset/WebSocket sub-request.
|
||||
if isDocumentNav(r) && !sessionValid(e, c) {
|
||||
autoPair(e, w, r)
|
||||
return
|
||||
}
|
||||
|
|
|
|||
200
scripts/t3-dispatch/main_test.go
Normal file
200
scripts/t3-dispatch/main_test.go
Normal file
|
|
@ -0,0 +1,200 @@
|
|||
package main
|
||||
|
||||
import (
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"net/url"
|
||||
"strconv"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func portOf(t *testing.T, ts *httptest.Server) int {
|
||||
t.Helper()
|
||||
u, err := url.Parse(ts.URL)
|
||||
if err != nil {
|
||||
t.Fatalf("parse %s: %v", ts.URL, err)
|
||||
}
|
||||
p, err := strconv.Atoi(u.Port())
|
||||
if err != nil {
|
||||
t.Fatalf("port %s: %v", u.Port(), err)
|
||||
}
|
||||
return p
|
||||
}
|
||||
|
||||
func TestIsDocumentNav(t *testing.T) {
|
||||
cases := []struct {
|
||||
name string
|
||||
method string
|
||||
headers map[string]string
|
||||
want bool
|
||||
}{
|
||||
{"GET sec-fetch-dest document", "GET", map[string]string{"Sec-Fetch-Dest": "document"}, true},
|
||||
{"GET accept html (no sec-fetch)", "GET", map[string]string{"Accept": "text/html,application/xhtml+xml"}, true},
|
||||
{"GET xhr empty dest beats accept", "GET", map[string]string{"Sec-Fetch-Dest": "empty", "Accept": "text/html"}, false},
|
||||
{"GET json", "GET", map[string]string{"Accept": "application/json"}, false},
|
||||
{"POST html", "POST", map[string]string{"Accept": "text/html"}, false},
|
||||
{"GET no headers", "GET", map[string]string{}, false},
|
||||
}
|
||||
for _, c := range cases {
|
||||
t.Run(c.name, func(t *testing.T) {
|
||||
r, _ := http.NewRequest(c.method, "/", nil)
|
||||
for k, v := range c.headers {
|
||||
r.Header.Set(k, v)
|
||||
}
|
||||
if got := isDocumentNav(r); got != c.want {
|
||||
t.Errorf("isDocumentNav = %v, want %v", got, c.want)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func sessionServer(status int, body string) *httptest.Server {
|
||||
return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
if r.URL.Path != "/api/auth/session" {
|
||||
http.NotFound(w, r)
|
||||
return
|
||||
}
|
||||
w.WriteHeader(status)
|
||||
_, _ = w.Write([]byte(body))
|
||||
}))
|
||||
}
|
||||
|
||||
func TestSessionValid(t *testing.T) {
|
||||
ck := &http.Cookie{Name: cookieName, Value: "x"}
|
||||
|
||||
t.Run("authenticated true -> valid", func(t *testing.T) {
|
||||
ts := sessionServer(200, `{"authenticated":true}`)
|
||||
defer ts.Close()
|
||||
if !sessionValid(entry{Port: portOf(t, ts)}, ck) {
|
||||
t.Fatal("want valid (true) for authenticated:true")
|
||||
}
|
||||
})
|
||||
t.Run("authenticated false -> invalid", func(t *testing.T) {
|
||||
ts := sessionServer(200, `{"authenticated":false}`)
|
||||
defer ts.Close()
|
||||
if sessionValid(entry{Port: portOf(t, ts)}, ck) {
|
||||
t.Fatal("want invalid (false) for authenticated:false")
|
||||
}
|
||||
})
|
||||
t.Run("500 -> fail-open valid", func(t *testing.T) {
|
||||
ts := sessionServer(500, `boom`)
|
||||
defer ts.Close()
|
||||
if !sessionValid(entry{Port: portOf(t, ts)}, ck) {
|
||||
t.Fatal("want fail-open true on 500")
|
||||
}
|
||||
})
|
||||
t.Run("malformed json -> fail-open valid", func(t *testing.T) {
|
||||
ts := sessionServer(200, `not json`)
|
||||
defer ts.Close()
|
||||
if !sessionValid(entry{Port: portOf(t, ts)}, ck) {
|
||||
t.Fatal("want fail-open true on unparseable body")
|
||||
}
|
||||
})
|
||||
t.Run("unreachable -> fail-open valid", func(t *testing.T) {
|
||||
ts := sessionServer(200, `{"authenticated":false}`)
|
||||
p := portOf(t, ts)
|
||||
ts.Close() // nothing listening now
|
||||
if !sessionValid(entry{Port: p}, ck) {
|
||||
t.Fatal("want fail-open true on connection refused")
|
||||
}
|
||||
})
|
||||
}
|
||||
|
||||
// fakeInstance serves the three endpoints the dispatcher touches: the session
|
||||
// check, the bootstrap exchange, and a catch-all standing in for the proxied app.
|
||||
func fakeInstance(authenticated bool, bootstrapCalled *bool) *httptest.Server {
|
||||
return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
switch r.URL.Path {
|
||||
case "/api/auth/session":
|
||||
if authenticated {
|
||||
_, _ = w.Write([]byte(`{"authenticated":true}`))
|
||||
} else {
|
||||
_, _ = w.Write([]byte(`{"authenticated":false}`))
|
||||
}
|
||||
case "/api/auth/bootstrap":
|
||||
if bootstrapCalled != nil {
|
||||
*bootstrapCalled = true
|
||||
}
|
||||
http.SetCookie(w, &http.Cookie{Name: cookieName, Value: "fresh", Path: "/"})
|
||||
_, _ = w.Write([]byte(`{"authenticated":true}`))
|
||||
default:
|
||||
_, _ = w.Write([]byte("APP"))
|
||||
}
|
||||
}))
|
||||
}
|
||||
|
||||
func setTable(port int) {
|
||||
mu.Lock()
|
||||
table = map[string]entry{"vbarzin": {OsUser: "wizard", Port: port}}
|
||||
mu.Unlock()
|
||||
}
|
||||
|
||||
func TestHandlerRepairsOnInvalidCookieDocNav(t *testing.T) {
|
||||
called := false
|
||||
ts := fakeInstance(false, &called)
|
||||
defer ts.Close()
|
||||
setTable(portOf(t, ts))
|
||||
|
||||
orig := mintToken
|
||||
mintToken = func(string) ([]byte, error) { return []byte(`{"credential":"tok"}`), nil }
|
||||
defer func() { mintToken = orig }()
|
||||
|
||||
r := httptest.NewRequest("GET", "/", nil)
|
||||
r.Header.Set("X-authentik-username", "vbarzin@gmail.com")
|
||||
r.Header.Set("Sec-Fetch-Dest", "document")
|
||||
r.AddCookie(&http.Cookie{Name: cookieName, Value: "stale"})
|
||||
w := httptest.NewRecorder()
|
||||
|
||||
handler(w, r)
|
||||
|
||||
if w.Code != http.StatusFound {
|
||||
t.Fatalf("stale cookie on doc-nav should re-pair (302), got %d body=%q", w.Code, w.Body.String())
|
||||
}
|
||||
if !called {
|
||||
t.Fatal("expected bootstrap to be called during re-pair")
|
||||
}
|
||||
cookies := w.Result().Cookies()
|
||||
if len(cookies) == 0 || cookies[0].Value != "fresh" {
|
||||
t.Fatalf("expected fresh t3_session relayed, got %+v", cookies)
|
||||
}
|
||||
}
|
||||
|
||||
func TestHandlerProxiesOnValidCookie(t *testing.T) {
|
||||
ts := fakeInstance(true, nil)
|
||||
defer ts.Close()
|
||||
setTable(portOf(t, ts))
|
||||
|
||||
r := httptest.NewRequest("GET", "/", nil)
|
||||
r.Header.Set("X-authentik-username", "vbarzin@gmail.com")
|
||||
r.Header.Set("Sec-Fetch-Dest", "document")
|
||||
r.AddCookie(&http.Cookie{Name: cookieName, Value: "good"})
|
||||
w := httptest.NewRecorder()
|
||||
|
||||
handler(w, r)
|
||||
|
||||
if w.Code != http.StatusOK || w.Body.String() != "APP" {
|
||||
t.Fatalf("valid cookie should proxy (200 APP), got %d %q", w.Code, w.Body.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestHandlerProxiesXHREvenIfCookieInvalid(t *testing.T) {
|
||||
called := false
|
||||
ts := fakeInstance(false, &called) // session would say invalid, but XHR must NOT be re-paired
|
||||
defer ts.Close()
|
||||
setTable(portOf(t, ts))
|
||||
|
||||
r := httptest.NewRequest("GET", "/api/threads", nil)
|
||||
r.Header.Set("X-authentik-username", "vbarzin@gmail.com")
|
||||
r.Header.Set("Sec-Fetch-Dest", "empty") // XHR/fetch, not a document nav
|
||||
r.AddCookie(&http.Cookie{Name: cookieName, Value: "stale"})
|
||||
w := httptest.NewRecorder()
|
||||
|
||||
handler(w, r)
|
||||
|
||||
if called {
|
||||
t.Fatal("must NOT re-pair (302) a non-document sub-request — would corrupt the SPA fetch contract")
|
||||
}
|
||||
if w.Code != http.StatusOK || w.Body.String() != "APP" {
|
||||
t.Fatalf("XHR should proxy through, got %d %q", w.Code, w.Body.String())
|
||||
}
|
||||
}
|
||||
|
|
@ -95,6 +95,39 @@ EOF
|
|||
log "wrote OIDC kubeconfig -> $user:~/.kube/config"
|
||||
}
|
||||
|
||||
# Idempotently set KEY=VALUE in a t3-serve env file, PRESERVING other lines — so writing
|
||||
# T3_PORT never clobbers an injected CLAUDE_CODE_OAUTH_TOKEN, and vice-versa. Mode 0600.
|
||||
env_set() {
|
||||
local file="$1" key="$2" val="$3"
|
||||
if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] set $key -> $file"; return 0; fi
|
||||
install -d -m 0755 "$(dirname "$file")"
|
||||
if [[ -f "$file" ]] && grep -q "^${key}=" "$file"; then
|
||||
grep -qx "${key}=${val}" "$file" || sed -i "s|^${key}=.*|${key}=${val}|" "$file"
|
||||
else
|
||||
printf '%s=%s\n' "$key" "$val" >> "$file"
|
||||
fi
|
||||
chmod 600 "$file"
|
||||
}
|
||||
|
||||
# Share the admin's Claude subscription with a non-admin: inject CLAUDE_CODE_OAUTH_TOKEN
|
||||
# (the staged long-lived token) into their t3-serve env — ONLY if they have neither their
|
||||
# own ~/.claude/.credentials.json (own login) nor an existing token. Never clobbers. The
|
||||
# agent picks it up when its t3-serve@ instance (re)starts.
|
||||
install_user_claude_token() {
|
||||
local user="$1" home envf tok
|
||||
local token_file="${CLAUDE_TOKEN_FILE:-/etc/t3-serve/claude-oauth-token}"
|
||||
home="$(getent passwd "$user" | cut -d: -f6)"
|
||||
[[ -z "$home" ]] && return 0
|
||||
[[ -f "$home/.claude/.credentials.json" ]] && return 0 # has own login -> leave it
|
||||
[[ -r "$token_file" ]] || return 0
|
||||
envf="${ENVDIR:-/etc/t3-serve}/$user.env"
|
||||
grep -q '^CLAUDE_CODE_OAUTH_TOKEN=' "$envf" 2>/dev/null && return 0 # already shared
|
||||
if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] share Claude token -> $envf"; return 0; fi
|
||||
tok="$(cat "$token_file")"
|
||||
env_set "$envf" CLAUDE_CODE_OAUTH_TOKEN "$tok"
|
||||
log "shared Claude token -> $user (t3-serve env; restart needed to take effect)"
|
||||
}
|
||||
|
||||
[[ $EUID -eq 0 ]] || { echo "t3-provision-users: must run as root" >&2; exit 1; }
|
||||
for bin in python3 jq; do command -v "$bin" >/dev/null || { echo "missing $bin" >&2; exit 1; }; done
|
||||
[[ -f "$ROSTER" && -f "$ENGINE" ]] || { echo "roster/engine not under $WORKSTATION_DIR" >&2; exit 1; }
|
||||
|
|
@ -144,21 +177,27 @@ while IFS=$'\t' read -r os_user tier shell groups_csv; do
|
|||
log "add $os_user -> group $g"; run gpasswd -a "$os_user" "$g" >/dev/null
|
||||
done
|
||||
fi
|
||||
if [[ "$tier" != admin ]]; then # non-admins: locked ~/code clone + OIDC kubeconfig
|
||||
if [[ "$tier" != admin ]]; then # non-admins: locked clone + kubeconfig + shared Claude token
|
||||
install_locked_clone "$os_user"
|
||||
install_user_kubeconfig "$os_user"
|
||||
install_user_claude_token "$os_user"
|
||||
fi
|
||||
done < <(jq -r '.accounts[] | [.os_user, .tier, .shell, (.groups|join(","))] | @tsv' "$desired_file")
|
||||
|
||||
# 5) per-user .env (sticky port) + enable t3-serve@
|
||||
while IFS=$'\t' read -r os_user port; do
|
||||
envf="$ENVDIR/$os_user.env"
|
||||
if [[ ! -f "$envf" ]] || ! grep -qx "T3_PORT=$port" "$envf"; then
|
||||
run bash -c "printf 'T3_PORT=%s\n' '$port' > '$envf'"
|
||||
fi
|
||||
env_set "$envf" T3_PORT "$port" # update-or-append; preserves CLAUDE_CODE_OAUTH_TOKEN
|
||||
id "$os_user" >/dev/null 2>&1 && run systemctl enable --now "t3-serve@$os_user.service" >/dev/null 2>&1 || true
|
||||
done < <(jq -r '.ports | to_entries[] | [.key, .value] | @tsv' "$desired_file")
|
||||
|
||||
# 5b) machine-wide (once, not per-user): keep the t3 pinned-version ENFORCER enabled (it
|
||||
# re-asserts T3_PIN daily; a no-op when already correct). NOT --now: with Persistent=true
|
||||
# a `--now` enable fires the missed daily job IMMEDIATELY, which on 2026-06-09 pulled a
|
||||
# breaking nightly mid-day and took out auth for everyone. `enable` (no --now) just arms
|
||||
# the 04:00 schedule; fresh boxes get t3 from setup-devvm.sh's pinned install, not here.
|
||||
run systemctl enable t3-autoupdate.timer >/dev/null 2>&1 || true
|
||||
|
||||
# 6) regenerate /etc/ttyd-user-map + dispatch.json from the desired state (SSoT:
|
||||
# a roster entry removed here DISAPPEARS, which is what the offboarding cut relies on)
|
||||
if [[ "$DRY_RUN" == 1 ]]; then
|
||||
|
|
|
|||
|
|
@ -33,6 +33,16 @@ if [[ $need_node -eq 1 ]]; then
|
|||
fi
|
||||
command -v claude >/dev/null || { log "npm: installing @anthropic-ai/claude-code"; npm install -g @anthropic-ai/claude-code >/dev/null; }
|
||||
|
||||
# 2b) t3 (the per-user coding surface) — PINNED, never nightly/latest. t3 is pre-1.0 and
|
||||
# ships breaking auth-schema + bootstrap-API changes our t3-dispatch can't follow blind
|
||||
# (2026-06-09 outage: a nightly auto-update broke pairing for ALL users). The daily
|
||||
# t3-autoupdate ENFORCER re-asserts this same pin; install it here so a fresh box has t3
|
||||
# immediately. Keep T3_PIN in sync with t3-autoupdate.sh.
|
||||
T3_PIN="${T3_PIN:-0.0.24}"
|
||||
if [[ "$(t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//')" != "$T3_PIN" ]]; then
|
||||
log "npm: installing pinned t3@$T3_PIN"; npm install -g "t3@$T3_PIN" >/dev/null
|
||||
fi
|
||||
|
||||
# 3) kubelogin (kubectl oidc-login) system-wide — NOT the apt 'kubelogin' (= Azure tool)
|
||||
if [[ ! -x /usr/local/bin/kubelogin ]]; then
|
||||
log "kubelogin: installing int128/kubelogin"
|
||||
|
|
@ -77,4 +87,21 @@ if [[ -d "$ADMIN_CODE" ]]; then
|
|||
log "hardened $ADMIN_CODE (o-rx — not world-readable)"
|
||||
fi
|
||||
|
||||
# 8) stage the shared Claude subscription OAuth token (long-lived sk-ant-oat01) to a
|
||||
# root-readable file the provisioner injects into non-admins' t3-serve env, so they
|
||||
# share the admin's Claude subscription (only those without their own ~/.claude login).
|
||||
if command -v vault >/dev/null; then
|
||||
export VAULT_ADDR="${VAULT_ADDR:-https://vault.viktorbarzin.me}"
|
||||
# setup-devvm runs as root (no ~/.vault-token); borrow the admin's token to read Vault.
|
||||
if [[ -z "${VAULT_TOKEN:-}" && -r /home/wizard/.vault-token ]]; then
|
||||
VAULT_TOKEN="$(cat /home/wizard/.vault-token)"; export VAULT_TOKEN
|
||||
fi
|
||||
if claude_tok="$(vault kv get -field=claude_oauth_token secret/workstation 2>/dev/null)"; then
|
||||
install -m 0600 /dev/stdin /etc/t3-serve/claude-oauth-token <<<"$claude_tok"
|
||||
log "staged /etc/t3-serve/claude-oauth-token (shared Claude subscription)"
|
||||
else
|
||||
log "WARN: secret/workstation claude_oauth_token absent -> non-admins won't share Claude auth"
|
||||
fi
|
||||
fi
|
||||
|
||||
log "OK (idempotent)"
|
||||
|
|
|
|||
|
|
@ -4,9 +4,17 @@
|
|||
# it's per-user runtime state inside the Forgejo DB. Driving retention from
|
||||
# a CronJob hitting the public API keeps the policy versioned in this repo.
|
||||
#
|
||||
# Auth: a write:package PAT belonging to ci-pusher (same user that pushes
|
||||
# from CI). DELETE on packages requires write:package scope. PAT lives in
|
||||
# Vault at secret/viktor/forgejo_cleanup_token.
|
||||
# Auth: a write:package PAT belonging to VIKTOR (the package OWNER). PAT
|
||||
# lives in Vault at secret/viktor/forgejo_cleanup_token.
|
||||
#
|
||||
# CORRECTION 2026-06-09: this previously said the PAT belonged to ci-pusher.
|
||||
# That was wrong and silently broke retention — Forgejo container packages
|
||||
# are scoped per-user, so ci-pusher gets HTTP 403 on DELETE of viktor/*
|
||||
# (the dry-run only does GETs, which DO work, so the 403 stayed hidden until
|
||||
# the first live run). DELETE requires a write:package PAT owned by viktor.
|
||||
# forgejo_cleanup_token is therefore set to viktor's write:package PAT (today
|
||||
# the same value as secret/ci/global/forgejo_push_token). IF that push token
|
||||
# is ever regenerated, re-mirror it here or retention silently 403s again.
|
||||
|
||||
data "vault_kv_secret_v2" "forgejo_viktor" {
|
||||
mount = "secret"
|
||||
|
|
@ -14,8 +22,12 @@ data "vault_kv_secret_v2" "forgejo_viktor" {
|
|||
}
|
||||
|
||||
locals {
|
||||
# Flip to false after first 7 days of dry-run logs look correct.
|
||||
forgejo_cleanup_dry_run = true
|
||||
# Activated 2026-06-09 after verifying a dry-run delete list against all
|
||||
# running viktor/* images cluster-wide: 0 running images on the delete set
|
||||
# (would prune 317 stale versions, keeping newest 10 + latest + cache tags).
|
||||
# Live retention is what keeps the registry PVC from filling on the HDD
|
||||
# (we deliberately did NOT move Forgejo to SSD — see beads code-oflt).
|
||||
forgejo_cleanup_dry_run = false
|
||||
}
|
||||
|
||||
resource "kubernetes_config_map" "forgejo_cleanup_script" {
|
||||
|
|
|
|||
|
|
@ -2,8 +2,13 @@
|
|||
# Forgejo container-package retention.
|
||||
#
|
||||
# For each container package owned by ${FORGEJO_OWNER}, keep newest
|
||||
# ${KEEP_LAST_N} versions + always keep tag "latest". Deletes the rest via
|
||||
# ${KEEP_LAST_N} versions + always keep tag "latest" + always keep any
|
||||
# buildkit cache tag (matches "cache", e.g. tripit:cache — these back
|
||||
# --cache-from/--cache-to and must survive retention or every build is a
|
||||
# cold rebuild). Deletes the rest via
|
||||
# DELETE /api/v1/packages/{owner}/container/{name}/{version}.
|
||||
# (Note: an 8-char SHA tag is pure hex and cannot contain "cache" — 'h' is
|
||||
# not a hex digit — so the cache match never catches a real image tag.)
|
||||
#
|
||||
# DRY_RUN=true logs what would be deleted but issues no DELETE calls.
|
||||
#
|
||||
|
|
@ -72,9 +77,11 @@ for NAME in $NAMES; do
|
|||
N_VERSIONS=$(jq 'length' "$TMPDIR/$NAME.json")
|
||||
echo "[$NAME] $N_VERSIONS version(s)"
|
||||
|
||||
# Build the keep set: top $KEEP + anything tagged 'latest'.
|
||||
# Build the keep set: top $KEEP + always 'latest' + any buildkit cache tag.
|
||||
jq -r --argjson keep "$KEEP" '
|
||||
[.[0:$keep][].version] + [.[] | select(.version == "latest") | .version]
|
||||
[.[0:$keep][].version]
|
||||
+ [.[] | select(.version == "latest") | .version]
|
||||
+ [.[] | select(.version | test("cache"; "i")) | .version]
|
||||
| unique
|
||||
| .[]
|
||||
' "$TMPDIR/$NAME.json" > "$TMPDIR/$NAME.keep"
|
||||
|
|
|
|||
|
|
@ -9,7 +9,7 @@ resource "kubernetes_namespace" "forgejo" {
|
|||
name = "forgejo"
|
||||
labels = {
|
||||
"istio-injection" : "disabled"
|
||||
tier = local.tiers.edge
|
||||
tier = local.tiers.edge
|
||||
"keel.sh/enrolled" = "true"
|
||||
}
|
||||
}
|
||||
|
|
@ -94,7 +94,7 @@ resource "kubernetes_deployment" "forgejo" {
|
|||
fs_group = 1000
|
||||
}
|
||||
container {
|
||||
name = "forgejo"
|
||||
name = "forgejo"
|
||||
# Pinned to 11.0.14 (latest 11.x as of 2026-05-12) — was on
|
||||
# floating `:11`. On 2026-05-24T15:35:37Z Keel force-policy
|
||||
# rewrote the tag from `11.0.14 → 1.18` (Gitea-era Forgejo
|
||||
|
|
@ -168,13 +168,19 @@ resource "kubernetes_deployment" "forgejo" {
|
|||
name = "data"
|
||||
mount_path = "/data"
|
||||
}
|
||||
# Bumped 1Gi -> 3Gi 2026-06-09: Forgejo was OOMKilled (exit 137)
|
||||
# under registry-push load from in-cluster CI builds (tripit
|
||||
# buildkit pushes large layers into the OCI registry). VPA
|
||||
# upperBound reads ~1.5Gi, but that's suppressed by the 1Gi cap it
|
||||
# kept OOMing against — size for the push spike, not steady-state.
|
||||
# requests=limits (Guaranteed QoS) per the repo memory convention.
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "15m"
|
||||
memory = "1Gi"
|
||||
memory = "3Gi"
|
||||
}
|
||||
limits = {
|
||||
memory = "1Gi"
|
||||
memory = "3Gi"
|
||||
}
|
||||
}
|
||||
port {
|
||||
|
|
@ -202,7 +208,7 @@ resource "kubernetes_deployment" "forgejo" {
|
|||
metadata[0].annotations["keel.sh/match-tag"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
|
||||
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
|
||||
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
|
||||
metadata[0].annotations["kubernetes.io/change-cause"],
|
||||
metadata[0].annotations["deployment.kubernetes.io/revision"],
|
||||
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"],
|
||||
|
|
|
|||
|
|
@ -15,6 +15,15 @@
|
|||
locals {
|
||||
governance_tiers = ["0-core", "1-cluster", "2-gpu", "3-edge", "4-aux"]
|
||||
excluded_namespaces = ["kube-system", "metallb-system", "kyverno", "calico-system", "calico-apiserver"]
|
||||
|
||||
# GPU-priority injection exclude list. Adds `tts` to the base set so the
|
||||
# `inject-gpu-workload-priority` policy does NOT stamp the immich-equal
|
||||
# gpu-workload (1,200,000) priority on Chatterbox-TTS pods. Chatterbox is a
|
||||
# best-effort off-peak batch tenant on the shared T4: it must keep its
|
||||
# tier-2-gpu (600,000) priority so it is ALWAYS the pod evicted under GPU-node
|
||||
# pressure, never immich-ml/frigate/llama-swap. See the tts stack
|
||||
# (stacks/tts/) + docs/plans/2026-06-08-chatterbox-tts-infra.md §3.
|
||||
gpu_priority_excluded_namespaces = concat(local.excluded_namespaces, ["tts"])
|
||||
}
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
|
|
@ -905,7 +914,10 @@ resource "kubectl_manifest" "mutate_gpu_priority" {
|
|||
any = [
|
||||
{
|
||||
resources = {
|
||||
namespaces = local.excluded_namespaces
|
||||
# tts added so Chatterbox-TTS keeps tier-2-gpu priority (it's a
|
||||
# best-effort off-peak batch tenant — must be evicted first,
|
||||
# not promoted to immich-equal gpu-workload). See locals above.
|
||||
namespaces = local.gpu_priority_excluded_namespaces
|
||||
}
|
||||
}
|
||||
]
|
||||
|
|
|
|||
119
stacks/stem95su/gdrive-sync.tf
Normal file
119
stacks/stem95su/gdrive-sync.tf
Normal file
|
|
@ -0,0 +1,119 @@
|
|||
# Automatic Google Drive -> site sync (added 2026-06-09; supersedes the
|
||||
# earlier on-demand-only model now that content is actively maintained).
|
||||
#
|
||||
# A CronJob mirrors the READ-ONLY Drive folder "claude" (servable content in
|
||||
# subfolder "stem claude/files/") onto the NFS content volume every 10 min via
|
||||
# rclone. rclone is delta-aware: an unchanged run lists ~33 files' metadata and
|
||||
# transfers nothing, so the schedule is cheap (not a 24MB re-download). nginx
|
||||
# keeps serving the same volume read-only; updates appear within ~5s (actimeo).
|
||||
#
|
||||
# Drive is treated strictly READ-ONLY: scope=drive.readonly and rclone only ever
|
||||
# reads the remote (sync gdrive: -> /data), never writes back.
|
||||
#
|
||||
# TOKEN LONGEVITY: the GCP OAuth app (project home-lab-1700868541205) MUST be
|
||||
# published to "Production" or its refresh token expires ~weekly and this job
|
||||
# fails. After publishing, re-mint the token and refresh
|
||||
# `secret/stem95su.rclone_conf`. A failed run surfaces as a failed Job.
|
||||
|
||||
resource "kubernetes_manifest" "rclone_external_secret" {
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1beta1"
|
||||
kind = "ExternalSecret"
|
||||
metadata = {
|
||||
name = "stem95su-rclone"
|
||||
namespace = kubernetes_namespace.stem95su.metadata[0].name
|
||||
}
|
||||
spec = {
|
||||
refreshInterval = "1h"
|
||||
secretStoreRef = {
|
||||
name = "vault-kv"
|
||||
kind = "ClusterSecretStore"
|
||||
}
|
||||
target = { name = "stem95su-rclone" }
|
||||
data = [{
|
||||
secretKey = "rclone.conf"
|
||||
remoteRef = {
|
||||
key = "stem95su"
|
||||
property = "rclone_conf"
|
||||
}
|
||||
}]
|
||||
}
|
||||
}
|
||||
depends_on = [kubernetes_namespace.stem95su]
|
||||
}
|
||||
|
||||
resource "kubernetes_cron_job_v1" "gdrive_sync" {
|
||||
metadata {
|
||||
name = "stem95su-gdrive-sync"
|
||||
namespace = kubernetes_namespace.stem95su.metadata[0].name
|
||||
labels = { run = "stem95su", component = "gdrive-sync" }
|
||||
}
|
||||
spec {
|
||||
schedule = "*/10 * * * *"
|
||||
concurrency_policy = "Forbid"
|
||||
successful_jobs_history_limit = 2
|
||||
failed_jobs_history_limit = 3
|
||||
job_template {
|
||||
metadata {}
|
||||
spec {
|
||||
backoff_limit = 1
|
||||
ttl_seconds_after_finished = 86400
|
||||
template {
|
||||
metadata { labels = { run = "stem95su", component = "gdrive-sync" } }
|
||||
spec {
|
||||
restart_policy = "OnFailure"
|
||||
container {
|
||||
name = "rclone"
|
||||
image = "docker.io/rclone/rclone:1.74.3"
|
||||
# Mirror Drive folder -> /data. Guard: hard-fail on auth/list error
|
||||
# (so an expired token is visible); skip quietly if the source is
|
||||
# empty / missing the dashboard (never wipe the live site);
|
||||
# --max-delete caps catastrophic deletes from a partial listing.
|
||||
command = ["/bin/sh", "-c", <<-EOT
|
||||
set -eu
|
||||
cp /config/rclone.conf /tmp/rc.conf
|
||||
SRC="gdrive:stem claude/files"
|
||||
LIST=$(rclone --config /tmp/rc.conf lsf "$SRC" --files-only) || { echo "FATAL: Drive list failed (auth/network)"; exit 1; }
|
||||
N=$(printf '%s\n' "$LIST" | grep -c . || true)
|
||||
if [ "$N" -lt 1 ] || ! printf '%s\n' "$LIST" | grep -qx "stem_board.html"; then
|
||||
echo "GUARD: source N=$N / stem_board.html missing -- skipping, site untouched"; exit 0
|
||||
fi
|
||||
echo "source OK ($N files) -- mirroring to /data"
|
||||
rclone --config /tmp/rc.conf sync "$SRC" /data --exclude ".DS_Store" --fast-list --transfers 4 --max-delete 25 -v
|
||||
EOT
|
||||
]
|
||||
resources {
|
||||
requests = { cpu = "10m", memory = "64Mi" }
|
||||
limits = { memory = "192Mi" }
|
||||
}
|
||||
volume_mount {
|
||||
name = "rclone-config"
|
||||
mount_path = "/config"
|
||||
read_only = true
|
||||
}
|
||||
volume_mount {
|
||||
name = "content"
|
||||
mount_path = "/data"
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "rclone-config"
|
||||
secret { secret_name = "stem95su-rclone" }
|
||||
}
|
||||
volume {
|
||||
name = "content"
|
||||
persistent_volume_claim {
|
||||
claim_name = module.nfs_content.claim_name
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
depends_on = [kubernetes_manifest.rclone_external_secret]
|
||||
}
|
||||
|
|
@ -65,6 +65,15 @@ locals {
|
|||
SMTP_USER = "spam@viktorbarzin.me"
|
||||
SMTP_FROM = "plans@viktorbarzin.me"
|
||||
PUBLIC_BASE_URL = "https://tripit.viktorbarzin.me"
|
||||
# Narrator audio (ADR-0004): Chatterbox via the in-cluster `tts` stack.
|
||||
# OpenAI-compatible /v1/audio/speech; the bake POSTs best-effort synth
|
||||
# requests, so a down/Pending Chatterbox is a clean skip (browser-TTS
|
||||
# fallback), never a bake error. ClusterIP-only → no token. Note: the mode
|
||||
# is `openai_compatible` (tripit renamed it from `chatterbox`); TTS_MODEL is
|
||||
# still the `chatterbox` family string tripit sends as the OpenAI `model`.
|
||||
TTS_MODE = "openai_compatible"
|
||||
TTS_BASE_URL = "http://chatterbox-tts.tts.svc.cluster.local:8000"
|
||||
TTS_MODEL = "chatterbox"
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
149
stacks/tts/README.md
Normal file
149
stacks/tts/README.md
Normal file
|
|
@ -0,0 +1,149 @@
|
|||
# tts — Chatterbox TTS (tripit narration)
|
||||
|
||||
In-cluster text-to-speech for tripit's "Tour guide". Runs the
|
||||
[devnen/Chatterbox-TTS-Server](https://github.com/devnen/Chatterbox-TTS-Server)
|
||||
(Resemble AI Chatterbox under an OpenAI-compatible HTTP server) as a single
|
||||
Deployment + ClusterIP Service `chatterbox-tts.tts.svc.cluster.local:8000`,
|
||||
requesting **one time-slice** of the shared Tesla T4 (`nvidia.com/gpu: 1`).
|
||||
|
||||
Full design + rationale (Option-A off-peak control, OOM analysis, ADR links):
|
||||
`docs/plans/2026-06-08-chatterbox-tts-infra.md` (in the tripit-tour-guide repo)
|
||||
and `infra/docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`.
|
||||
|
||||
> This stack mirrors `infra/stacks/llama-cpp/`. The scaffolding files
|
||||
> (`backend.tf`, `providers.tf`, `cloudflare_provider.tf`, `tiers.tf`,
|
||||
> `.terraform.lock.hcl`) are **generated by Terragrunt** on `init` and are
|
||||
> git-ignored — only `main.tf`, `terragrunt.hcl` and this README are tracked.
|
||||
|
||||
---
|
||||
|
||||
## What this stack creates
|
||||
|
||||
- `kubernetes_namespace.tts` — tier `2-gpu`, keel-enrolled, istio off.
|
||||
- `module.nfs_models` — RWX NFS-SSD PVC at `/srv/nfs-ssd/chatterbox`, mounted at
|
||||
`/data` (predefined voices, narrator reference WAVs, **and** the HuggingFace
|
||||
model cache via `HF_HOME=/data/hf_cache`, so weights download once and persist
|
||||
across the per-window pod recreation).
|
||||
- `kubernetes_config_map.chatterbox_config` — `config.yaml`: `server.port=8004`,
|
||||
`model.repo_id=chatterbox-multilingual`, `tts_engine.device=cuda`, voices /
|
||||
reference paths under `/data`.
|
||||
- `kubernetes_deployment.chatterbox` — **starts at `replicas=0`**; the off-peak
|
||||
CronJobs own the replica count at runtime. `TTS_BF16=off` (T4 = Turing, no
|
||||
bf16). `priority_class_name=tier-2-gpu` (the polite-tenant demotion).
|
||||
- `kubernetes_service.chatterbox` — ClusterIP, **`port 8000 → targetPort 8004`**
|
||||
so tripit's default `TTS_BASE_URL` works unchanged. Prometheus scrape
|
||||
annotations.
|
||||
- **Off-peak control** (SA + Role + RoleBinding + 3 CronJobs): see below.
|
||||
|
||||
## Off-peak control (Option A — window + free-VRAM gate)
|
||||
|
||||
The T4 is time-sliced with **zero VRAM isolation** (post-mortem 2026-06-02), so
|
||||
`nvidia.com/gpu: 1` buys a scheduling turn, NOT memory. Chatterbox must only
|
||||
allocate VRAM when the card is actually free. Implemented as three CronJobs
|
||||
(all `Europe/London`), each a `bitnami/kubectl` pod using the namespace SA:
|
||||
|
||||
| CronJob | Schedule (default) | Action |
|
||||
|---|---|---|
|
||||
| `chatterbox-window-up` | `0 2 * * *` | **Preflight**: scrape `gpu_pod_memory_used_bytes` from `gpu-pod-exporter.nvidia.svc:80/metrics`, compute `free = 16 GiB − Σused`; scale to **1 only if** `free ≥ vram_free_floor_bytes`. |
|
||||
| `chatterbox-vram-guard` | `*/5 2-5 * * *` | **Guard**: every 5 min in-window, scale to **0** if `free < floor` (a resident woke; yield the card mid-bake). |
|
||||
| `chatterbox-window-down` | `0 6 * * *` | **Window end**: scale to **0** unconditionally. |
|
||||
|
||||
`tripit`'s bake is best-effort + cached-forever (ADR-0002/0004) — a skipped or
|
||||
aborted window simply backfills on the next one. No latency SLA.
|
||||
|
||||
### The free-VRAM floor — YOU MUST MEASURE THIS
|
||||
|
||||
`var.vram_free_floor_bytes` defaults to **6 GiB** (a conservative guess:
|
||||
~4 GiB assumed multilingual FP16 peak + ~2 GiB headroom for the
|
||||
read→`cudaMalloc` race). **The real T4 peak of `chatterbox-multilingual` is not
|
||||
published upstream.** Capture it during the first bake:
|
||||
|
||||
```bash
|
||||
# while a real synth is running on the freed T4:
|
||||
kubectl -n monitoring exec deploy/prometheus -- \
|
||||
promtool query instant http://localhost:9090 \
|
||||
'sum(gpu_pod_memory_used_bytes{namespace="tts"})'
|
||||
# or read the gauge straight from the exporter:
|
||||
kubectl -n nvidia exec ds/gpu-pod-exporter -- \
|
||||
sh -c 'curl -s localhost:9401/metrics | grep "namespace=\"tts\""'
|
||||
```
|
||||
|
||||
Then set the floor to `measured_peak + ~2 GiB` (pass `-var` or add to the stack
|
||||
tfvars). If the peak is too high to coexist even off-peak, switch
|
||||
`model.repo_id` in `main.tf` to `chatterbox` (English, lighter) or
|
||||
`chatterbox-turbo`, or escalate to Option B (scale `immich-machine-learning` to
|
||||
0 for the window).
|
||||
|
||||
---
|
||||
|
||||
## Build + push the image (do this BEFORE the first apply)
|
||||
|
||||
`devnen/Chatterbox-TTS-Server` ships **no published image** — build from the
|
||||
repo's **cu128** target (matches the cluster's pinned 570.195.03 / CUDA 12.8
|
||||
driver) and push to the private Forgejo registry. The devvm docker is pre-authed
|
||||
to `forgejo.viktorbarzin.me`. Run on the devvm (large CUDA image — needs disk +
|
||||
bandwidth):
|
||||
|
||||
```bash
|
||||
# 1. Clone the upstream server repo (outside the monorepo).
|
||||
git clone https://github.com/devnen/Chatterbox-TTS-Server /tmp/chatterbox-tts-server
|
||||
cd /tmp/chatterbox-tts-server
|
||||
|
||||
# 2. Build the cu128 variant (Dockerfile.cu128 — PyTorch 2.9.0+cu128, the target
|
||||
# the repo's docker-compose-cu128.yml uses) for linux/amd64.
|
||||
SHA="$(git rev-parse --short=8 HEAD)"
|
||||
docker build \
|
||||
--platform linux/amd64 \
|
||||
--build-arg RUNTIME=nvidia \
|
||||
-f Dockerfile.cu128 \
|
||||
-t forgejo.viktorbarzin.me/viktor/chatterbox-tts:latest \
|
||||
-t "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${SHA}" \
|
||||
.
|
||||
|
||||
# 3. Push both tags. (If docker isn't authed: log in with the viktor push PAT
|
||||
# from Vault — `vault kv get -field=forgejo_push_token secret/ci/global` —
|
||||
# `docker login forgejo.viktorbarzin.me -u viktor`.)
|
||||
docker push forgejo.viktorbarzin.me/viktor/chatterbox-tts:latest
|
||||
docker push "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${SHA}"
|
||||
```
|
||||
|
||||
> If `Dockerfile.cu128` is not a clean `docker build` target (e.g. it relies on
|
||||
> build args defined only in `docker-compose-cu128.yml`), lift those args onto
|
||||
> the `docker build` line or `docker compose -f docker-compose-cu128.yml build`
|
||||
> then `docker tag` the resulting `chatterbox-tts-server:cu128` image to the
|
||||
> Forgejo ref above before pushing.
|
||||
|
||||
---
|
||||
|
||||
## Apply (admin-gated — run in order)
|
||||
|
||||
```bash
|
||||
vault login -method=oidc
|
||||
~/code/scripts/presence claim node:k8s-node1 --purpose "chatterbox-tts first apply (GPU)"
|
||||
~/code/scripts/presence claim stack:tts --purpose "chatterbox-tts stack apply"
|
||||
|
||||
# 1. The polite-tenant hardening (exclude tts from gpu-workload priority).
|
||||
~/code/scripts/tg plan --stack kyverno
|
||||
~/code/scripts/tg apply --stack kyverno
|
||||
|
||||
# 2. This stack.
|
||||
~/code/scripts/tg plan --stack tts
|
||||
~/code/scripts/tg apply --stack tts # apply does NOT wake the GPU (replicas=0)
|
||||
|
||||
# 3. Flip tripit narration on.
|
||||
~/code/scripts/tg plan --stack tripit
|
||||
~/code/scripts/tg apply --stack tripit
|
||||
```
|
||||
|
||||
See `docs/plans/2026-06-08-chatterbox-tts-infra.md` §5 for the full go-live
|
||||
checklist (seed voices on NFS-SSD, smoke-test a synth, watch the neighbours).
|
||||
|
||||
## Rollback (instant, no data loss)
|
||||
|
||||
- **Narration off:** set `TTS_MODE=none` (or drop the three `TTS_*` lines) in
|
||||
`stacks/tripit/main.tf` → `tg apply --stack tripit`. The bake makes no audio;
|
||||
playback falls back to browser TTS. Cached `story_audio` rows are harmless.
|
||||
- **Chatterbox off the GPU:** `kubectl -n tts scale deploy/chatterbox-tts
|
||||
--replicas=0` (transient) and/or `tg destroy --stack tts`. Best-effort synth
|
||||
means tripit bakes keep running audio-less — no error.
|
||||
- Neither touches the resident GPU tenants (Option A never modifies them).
|
||||
474
stacks/tts/main.tf
Normal file
474
stacks/tts/main.tf
Normal file
|
|
@ -0,0 +1,474 @@
|
|||
variable "image_tag" {
|
||||
type = string
|
||||
default = "latest"
|
||||
description = "chatterbox-tts image tag. Use the 8-char git SHA in CI; :latest for local trials."
|
||||
}
|
||||
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# Option-A off-peak control (see docs/plans/2026-06-08-chatterbox-tts-infra.md §3).
|
||||
# The Deployment sits at replicas=0; a CronJob scales it to 1 at the window start
|
||||
# ONLY IF a free-VRAM preflight passes, and another scales it back to 0 at window
|
||||
# end. A guard CronJob yields the card mid-window if free VRAM drops below the
|
||||
# floor (a resident woke up). tripit's bake is best-effort + idempotent, so a
|
||||
# skipped/aborted window simply backfills on the next one (ADR-0002/0004).
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
variable "vram_free_floor_bytes" {
|
||||
type = number
|
||||
# OPEN ITEM — must be measured (§5 smoke test / §3.X). This is the minimum free
|
||||
# VRAM the preflight requires before it will scale Chatterbox up, and the floor
|
||||
# the guard yields below. Default = 6 GiB ≈ (a conservative guess for
|
||||
# chatterbox-multilingual FP16 peak ~4 GiB + ~2 GiB headroom for the
|
||||
# read→cudaMalloc race). RAISE/LOWER once the real T4 peak is captured from
|
||||
# gpu_pod_memory_used_bytes{namespace="tts"} during a real synth.
|
||||
default = 6442450944
|
||||
description = "Minimum free GPU VRAM (bytes) required before scaling Chatterbox up; guard yields below it."
|
||||
}
|
||||
|
||||
variable "gpu_total_bytes" {
|
||||
type = number
|
||||
default = 17179869184 # Tesla T4 = 16 GiB
|
||||
description = "Total VRAM on the shared GPU. Free = this minus sum(gpu_pod_memory_used_bytes)."
|
||||
}
|
||||
|
||||
variable "offpeak_window_up_schedule" {
|
||||
type = string
|
||||
default = "0 2 * * *" # 02:00 Europe/London (see timezone on the CronJob)
|
||||
description = "Cron schedule that fires the free-VRAM preflight + scale-up at window start."
|
||||
}
|
||||
|
||||
variable "offpeak_window_down_schedule" {
|
||||
type = string
|
||||
default = "0 6 * * *" # 06:00 Europe/London
|
||||
description = "Cron schedule that scales Chatterbox back to 0 at window end."
|
||||
}
|
||||
|
||||
variable "offpeak_guard_schedule" {
|
||||
type = string
|
||||
default = "*/5 2-5 * * *" # every 5 min inside the 02:00–06:00 window
|
||||
description = "Cron schedule for the mid-window guard that yields the card if free VRAM drops."
|
||||
}
|
||||
|
||||
locals {
|
||||
namespace = "tts"
|
||||
labels = { app = "chatterbox-tts" }
|
||||
image = "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${var.image_tag}"
|
||||
|
||||
# config.yaml rendered into a ConfigMap, mounted at /app/config.yaml (the
|
||||
# server's WORKDIR is /app). Voices, reference audio and the HF model cache
|
||||
# all live on the NFS-SSD PVC (mounted at /data) so weights persist across
|
||||
# restarts and load fast. server.port stays at the devnen default 8004; the
|
||||
# Service remaps 8000->8004 so tripit's default TTS_BASE_URL works unchanged.
|
||||
#
|
||||
# model.repo_id = chatterbox-multilingual (ADR-0004; 23 languages for
|
||||
# worldwide place-names). If the measured T4 VRAM peak is too high to coexist
|
||||
# even off-peak, fall back to "chatterbox" (English, lighter) — a one-line
|
||||
# change here (§3.X / §6 decision 3).
|
||||
chatterbox_config = yamlencode({
|
||||
server = {
|
||||
host = "0.0.0.0"
|
||||
port = 8004
|
||||
}
|
||||
model = {
|
||||
repo_id = "chatterbox-multilingual"
|
||||
}
|
||||
tts_engine = {
|
||||
device = "cuda"
|
||||
predefined_voices_path = "/data/voices"
|
||||
reference_audio_path = "/data/reference_audio"
|
||||
}
|
||||
})
|
||||
|
||||
# Shared script for the off-peak CronJobs. Reads the in-cluster
|
||||
# gpu_pod_memory_used_bytes gauge (the per-namespace gauge the 2026-06-02
|
||||
# post-mortem built — host-PID attribution, no new exporter needed), sums it,
|
||||
# and computes free = GPU_TOTAL - used. Pure POSIX + awk; curl is baked into
|
||||
# the curl image. ACTION is "up" | "down" | "guard".
|
||||
# up — scale to 1 ONLY IF free >= FLOOR (positive admission).
|
||||
# guard — scale to 0 IF free < FLOOR (a resident woke mid-window; yield).
|
||||
# down — scale to 0 unconditionally (window end).
|
||||
# Heredoc escaping: only `$${...}` (literal `${...}`) is escaped — Terraform
|
||||
# would otherwise try to interpolate it. Bare `$(...)`, `$((...))` and awk's
|
||||
# `$NF` are literal `$` and pass through unescaped.
|
||||
vram_gate_script = <<-EOT
|
||||
set -eu
|
||||
: "$${ACTION:?}" "$${FLOOR:?}" "$${GPU_TOTAL:?}"
|
||||
METRICS_URL="http://gpu-pod-exporter.nvidia.svc.cluster.local:80/metrics"
|
||||
|
||||
# Sum gpu_pod_memory_used_bytes across all pods. Missing metric / empty
|
||||
# scrape => used=0 (card idle). -f so a non-200 scrape is a hard error we
|
||||
# treat conservatively (skip scale-up).
|
||||
if ! BODY="$(curl -sf -m 10 "$${METRICS_URL}")"; then
|
||||
echo "WARN: could not scrape $${METRICS_URL}"
|
||||
if [ "$${ACTION}" = "up" ]; then
|
||||
echo "preflight: scrape failed -> NOT scaling up (fail-safe)"; exit 0
|
||||
fi
|
||||
# For down/guard a failed scrape must NOT block yielding the card.
|
||||
BODY=""
|
||||
fi
|
||||
USED="$(printf '%s\n' "$${BODY}" \
|
||||
| awk '/^gpu_pod_memory_used_bytes\{/ { s += $NF } END { printf "%d", s }')"
|
||||
USED="$${USED:-0}"
|
||||
FREE="$(( GPU_TOTAL - USED ))"
|
||||
echo "GPU VRAM: used=$${USED} free=$${FREE} floor=$${FLOOR} (total=$${GPU_TOTAL})"
|
||||
|
||||
case "$${ACTION}" in
|
||||
up)
|
||||
if [ "$${FREE}" -ge "$${FLOOR}" ]; then
|
||||
echo "preflight PASS: free >= floor -> scaling chatterbox-tts to 1"
|
||||
kubectl -n tts scale deploy/chatterbox-tts --replicas=1
|
||||
else
|
||||
echo "preflight SKIP: free < floor -> leaving chatterbox-tts at 0 (retry next window)"
|
||||
fi
|
||||
;;
|
||||
guard)
|
||||
if [ "$${FREE}" -lt "$${FLOOR}" ]; then
|
||||
echo "guard TRIP: free < floor -> yielding the card, scaling chatterbox-tts to 0"
|
||||
kubectl -n tts scale deploy/chatterbox-tts --replicas=0
|
||||
else
|
||||
echo "guard OK: free >= floor -> chatterbox-tts may keep running"
|
||||
fi
|
||||
;;
|
||||
down)
|
||||
echo "window end -> scaling chatterbox-tts to 0"
|
||||
kubectl -n tts scale deploy/chatterbox-tts --replicas=0
|
||||
;;
|
||||
esac
|
||||
EOT
|
||||
|
||||
# Common spec for the three off-peak CronJobs. Each runs one bitnami/kubectl
|
||||
# pod (in-cluster SA, no kubeconfig) executing the shared gate script with a
|
||||
# different ACTION. timezone pins the window to Europe/London regardless of
|
||||
# node TZ.
|
||||
offpeak_cronjobs = {
|
||||
chatterbox-window-up = {
|
||||
schedule = var.offpeak_window_up_schedule
|
||||
action = "up"
|
||||
}
|
||||
chatterbox-window-down = {
|
||||
schedule = var.offpeak_window_down_schedule
|
||||
action = "down"
|
||||
}
|
||||
chatterbox-vram-guard = {
|
||||
schedule = var.offpeak_guard_schedule
|
||||
action = "guard"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_namespace" "tts" {
|
||||
metadata {
|
||||
name = local.namespace
|
||||
labels = {
|
||||
tier = local.tiers.gpu
|
||||
"istio-injection" = "disabled"
|
||||
"keel.sh/enrolled" = "true"
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
|
||||
}
|
||||
}
|
||||
|
||||
# Model weights + voices on NFS-SSD (fast load), RWX so a seed Job / kubectl cp
|
||||
# can write the predefined voices + narrator reference WAV while the Deployment
|
||||
# mounts it. Path /srv/nfs-ssd/chatterbox on the Proxmox host. Mirrors
|
||||
# llama-cpp's nfs_models. First start downloads the model into /data/hf_cache
|
||||
# (HF_HOME below), so weights persist across pod restarts.
|
||||
module "nfs_models" {
|
||||
source = "../../modules/kubernetes/nfs_volume"
|
||||
name = "chatterbox-models"
|
||||
namespace = kubernetes_namespace.tts.metadata[0].name
|
||||
nfs_server = "192.168.1.127"
|
||||
nfs_path = "/srv/nfs-ssd/chatterbox"
|
||||
storage = "20Gi" # multilingual weights + HF cache + voices headroom
|
||||
}
|
||||
|
||||
resource "kubernetes_config_map" "chatterbox_config" {
|
||||
metadata {
|
||||
name = "chatterbox-config"
|
||||
namespace = kubernetes_namespace.tts.metadata[0].name
|
||||
labels = local.labels
|
||||
}
|
||||
data = {
|
||||
"config.yaml" = local.chatterbox_config
|
||||
}
|
||||
}
|
||||
|
||||
# Single Deployment running the devnen Chatterbox-TTS-Server (OpenAI-compatible
|
||||
# /v1/audio/speech). Sits at replicas=0 — the off-peak CronJobs below scale it
|
||||
# to 1 only when the free-VRAM preflight passes (Option A), and back to 0 at
|
||||
# window end. wait_for_rollout=false so apply never blocks on a pod that is
|
||||
# intentionally scaled to 0.
|
||||
resource "kubernetes_deployment" "chatterbox" {
|
||||
metadata {
|
||||
name = "chatterbox-tts"
|
||||
namespace = kubernetes_namespace.tts.metadata[0].name
|
||||
labels = merge(local.labels, { tier = local.tiers.gpu })
|
||||
}
|
||||
wait_for_rollout = false
|
||||
spec {
|
||||
# Off-peak control owns the replica count at runtime (CronJobs scale 0<->1).
|
||||
# Declare 0 here so a plain `tg apply` outside the window doesn't wake the
|
||||
# card. ignore_changes on replicas (below) stops apply from fighting the
|
||||
# CronJob's scale.
|
||||
replicas = 0
|
||||
strategy { type = "Recreate" }
|
||||
selector {
|
||||
match_labels = { app = "chatterbox-tts" }
|
||||
}
|
||||
template {
|
||||
metadata {
|
||||
labels = { app = "chatterbox-tts" }
|
||||
annotations = {
|
||||
"checksum/config" = sha256(local.chatterbox_config)
|
||||
}
|
||||
}
|
||||
spec {
|
||||
node_selector = { "nvidia.com/gpu.present" = "true" }
|
||||
toleration {
|
||||
key = "nvidia.com/gpu"
|
||||
operator = "Equal"
|
||||
value = "true"
|
||||
effect = "NoSchedule"
|
||||
}
|
||||
# C-hardening (§3.RECOMMENDATION.3): Chatterbox is a polite, best-effort
|
||||
# batch tenant — give it the regular tier-2-gpu priority (600000) so it
|
||||
# is ALWAYS the pod evicted under GPU-node pressure, never immich-ml /
|
||||
# frigate / llama-swap. This relies on the `tts` namespace being EXCLUDED
|
||||
# from the Kyverno `inject-gpu-workload-priority` policy (which would
|
||||
# otherwise stamp the immich-equal gpu-workload=1,200,000 priority on any
|
||||
# nvidia.com/gpu pod). That exclusion is the two-line edit to the kyverno
|
||||
# stack flagged in the PR. Without it, this priority_class_name is
|
||||
# overwritten on pod CREATE and Chatterbox would compete as an equal.
|
||||
priority_class_name = "tier-2-gpu"
|
||||
|
||||
image_pull_secrets { name = "registry-credentials" }
|
||||
|
||||
container {
|
||||
name = "chatterbox-tts"
|
||||
image = local.image
|
||||
port {
|
||||
container_port = 8004
|
||||
name = "http"
|
||||
}
|
||||
|
||||
# T4 is Turing — NO bf16 (ADR-0004). Pin off; run FP16/FP32.
|
||||
env {
|
||||
name = "TTS_BF16"
|
||||
value = "off"
|
||||
}
|
||||
# Park the HuggingFace cache on the NFS-SSD PVC so model weights
|
||||
# download once and persist across pod restarts (the pod is recreated
|
||||
# every window). The devnen compose mounts HF cache at /app/hf_cache;
|
||||
# point HF_HOME at the PVC instead.
|
||||
env {
|
||||
name = "HF_HOME"
|
||||
value = "/data/hf_cache"
|
||||
}
|
||||
env {
|
||||
name = "HF_HUB_CACHE"
|
||||
value = "/data/hf_cache"
|
||||
}
|
||||
|
||||
volume_mount {
|
||||
name = "config"
|
||||
mount_path = "/app/config.yaml"
|
||||
sub_path = "config.yaml"
|
||||
}
|
||||
volume_mount {
|
||||
name = "models"
|
||||
mount_path = "/data"
|
||||
}
|
||||
|
||||
# /v1/audio/voices is cheap and only 200s once the model is loaded —
|
||||
# so it gates real readiness. First start downloads the model, which
|
||||
# is slow; the generous failure_threshold absorbs that.
|
||||
readiness_probe {
|
||||
http_get {
|
||||
path = "/v1/audio/voices"
|
||||
port = 8004
|
||||
}
|
||||
initial_delay_seconds = 20
|
||||
period_seconds = 15
|
||||
failure_threshold = 12
|
||||
}
|
||||
liveness_probe {
|
||||
http_get {
|
||||
path = "/v1/audio/voices"
|
||||
port = 8004
|
||||
}
|
||||
initial_delay_seconds = 120
|
||||
period_seconds = 30
|
||||
failure_threshold = 5
|
||||
}
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "200m"
|
||||
memory = "2Gi"
|
||||
}
|
||||
limits = {
|
||||
memory = "8Gi"
|
||||
"nvidia.com/gpu" = "1" # ONE time-slice (operator advertises 100), NOT the whole card
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
volume {
|
||||
name = "config"
|
||||
config_map {
|
||||
name = kubernetes_config_map.chatterbox_config.metadata[0].name
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "models"
|
||||
persistent_volume_claim {
|
||||
claim_name = module.nfs_models.claim_name
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
# Off-peak CronJobs own the replica count — don't let apply reset it.
|
||||
spec[0].replicas,
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
||||
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE
|
||||
metadata[0].annotations["keel.sh/match-tag"],
|
||||
metadata[0].annotations["keel.sh/policy"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
|
||||
metadata[0].annotations["kubernetes.io/change-cause"],
|
||||
metadata[0].annotations["deployment.kubernetes.io/revision"],
|
||||
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"],
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "chatterbox" {
|
||||
metadata {
|
||||
name = "chatterbox-tts"
|
||||
namespace = kubernetes_namespace.tts.metadata[0].name
|
||||
labels = local.labels
|
||||
annotations = {
|
||||
# Prometheus annotation-based scrape (mirrors tripit). The devnen server
|
||||
# has no /metrics; this monitors liveness via the blackbox path and keeps
|
||||
# the Service in the scrape set if a /metrics endpoint is added later.
|
||||
"prometheus.io/scrape" = "true"
|
||||
"prometheus.io/path" = "/v1/audio/voices"
|
||||
"prometheus.io/port" = "8000"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
type = "ClusterIP" # in-cluster only — never ingressed (no token needed)
|
||||
selector = { app = "chatterbox-tts" }
|
||||
port {
|
||||
name = "http"
|
||||
port = 8000 # tripit's default TTS_BASE_URL port
|
||||
target_port = 8004 # the devnen server's actual listen port
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# Option-A off-peak control: SA + Role (scale the Deployment) + RoleBinding +
|
||||
# three CronJobs (window-up preflight, mid-window guard, window-down). Mirrors
|
||||
# the nextcloud-watchdog in-cluster-kubectl pattern (SA → Role → bitnami/kubectl
|
||||
# CronJob, no kubeconfig).
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
resource "kubernetes_service_account" "offpeak" {
|
||||
metadata {
|
||||
name = "chatterbox-offpeak"
|
||||
namespace = kubernetes_namespace.tts.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_role" "offpeak" {
|
||||
metadata {
|
||||
name = "chatterbox-offpeak"
|
||||
namespace = kubernetes_namespace.tts.metadata[0].name
|
||||
}
|
||||
# get + patch on the deployment scale subresource is all the gate needs.
|
||||
rule {
|
||||
api_groups = ["apps"]
|
||||
resources = ["deployments", "deployments/scale"]
|
||||
verbs = ["get", "patch"]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_role_binding" "offpeak" {
|
||||
metadata {
|
||||
name = "chatterbox-offpeak"
|
||||
namespace = kubernetes_namespace.tts.metadata[0].name
|
||||
}
|
||||
role_ref {
|
||||
api_group = "rbac.authorization.k8s.io"
|
||||
kind = "Role"
|
||||
name = kubernetes_role.offpeak.metadata[0].name
|
||||
}
|
||||
subject {
|
||||
kind = "ServiceAccount"
|
||||
name = kubernetes_service_account.offpeak.metadata[0].name
|
||||
namespace = kubernetes_namespace.tts.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_cron_job_v1" "offpeak" {
|
||||
for_each = local.offpeak_cronjobs
|
||||
|
||||
metadata {
|
||||
name = each.key
|
||||
namespace = kubernetes_namespace.tts.metadata[0].name
|
||||
labels = local.labels
|
||||
}
|
||||
spec {
|
||||
schedule = each.value.schedule
|
||||
timezone = "Europe/London"
|
||||
concurrency_policy = "Forbid"
|
||||
starting_deadline_seconds = 120
|
||||
successful_jobs_history_limit = 1
|
||||
failed_jobs_history_limit = 3
|
||||
job_template {
|
||||
metadata { labels = local.labels }
|
||||
spec {
|
||||
backoff_limit = 1
|
||||
active_deadline_seconds = 120
|
||||
ttl_seconds_after_finished = 300
|
||||
template {
|
||||
metadata { labels = local.labels }
|
||||
spec {
|
||||
service_account_name = kubernetes_service_account.offpeak.metadata[0].name
|
||||
restart_policy = "Never"
|
||||
container {
|
||||
name = "vram-gate"
|
||||
image = "bitnami/kubectl:latest"
|
||||
command = ["/bin/bash", "-c", local.vram_gate_script]
|
||||
env {
|
||||
name = "ACTION"
|
||||
value = each.value.action
|
||||
}
|
||||
env {
|
||||
name = "FLOOR"
|
||||
value = tostring(var.vram_free_floor_bytes)
|
||||
}
|
||||
env {
|
||||
name = "GPU_TOTAL"
|
||||
value = tostring(var.gpu_total_bytes)
|
||||
}
|
||||
resources {
|
||||
requests = { cpu = "20m", memory = "64Mi" }
|
||||
limits = { memory = "128Mi" }
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno mutates dns_config with ndots=2 on CronJobs.
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
36
stacks/tts/terragrunt.hcl
Normal file
36
stacks/tts/terragrunt.hcl
Normal file
|
|
@ -0,0 +1,36 @@
|
|||
include "root" {
|
||||
path = find_in_parent_folders()
|
||||
}
|
||||
|
||||
dependency "platform" {
|
||||
config_path = "../platform"
|
||||
skip_outputs = true
|
||||
}
|
||||
|
||||
dependency "vault" {
|
||||
config_path = "../vault"
|
||||
skip_outputs = true
|
||||
}
|
||||
|
||||
# tts: in-cluster text-to-speech for tripit's "Tour guide" narration.
|
||||
# One Deployment of `forgejo.viktorbarzin.me/viktor/chatterbox-tts` (devnen
|
||||
# Chatterbox-TTS-Server, OpenAI-compatible /v1/audio/speech) at a single
|
||||
# ClusterIP Service `chatterbox-tts.tts.svc:8000` (server listens on 8004;
|
||||
# the Service remaps). Requests ONE time-slice of the shared T4
|
||||
# (nvidia.com/gpu=1) — a slice, not the card.
|
||||
#
|
||||
# OOM-avoidance (Option A, docs/plans/2026-06-08-chatterbox-tts-infra.md §3):
|
||||
# the Deployment sits at replicas=0; an off-peak CronJob scales it to 1 at the
|
||||
# 02:00–06:00 Europe/London window ONLY IF a free-VRAM preflight passes
|
||||
# (gpu_pod_memory_used_bytes from gpu-pod-exporter), a guard CronJob yields the
|
||||
# card mid-window if a resident wakes, and a window-down CronJob scales back to
|
||||
# 0. tripit's bake is best-effort + cached-forever (ADR-0002/0004), so a
|
||||
# skipped/aborted window simply backfills next time — no latency SLA.
|
||||
#
|
||||
# Polite-tenant hardening: the `tts` namespace must be EXCLUDED from the kyverno
|
||||
# `inject-gpu-workload-priority` policy (a separate two-line edit to the kyverno
|
||||
# stack) so Chatterbox keeps tier-2-gpu priority (600000) and is always the pod
|
||||
# evicted under pressure — never immich-ml/frigate/llama-swap.
|
||||
#
|
||||
# Image is built from the devnen repo + pushed to Forgejo — see this stack's
|
||||
# README.md for the exact docker build + push commands.
|
||||
Loading…
Add table
Add a link
Reference in a new issue