# tts — Chatterbox TTS (tripit narration) In-cluster text-to-speech for tripit's "Tour guide". Runs the [devnen/Chatterbox-TTS-Server](https://github.com/devnen/Chatterbox-TTS-Server) (Resemble AI Chatterbox under an OpenAI-compatible HTTP server) as a single Deployment + ClusterIP Service `chatterbox-tts.tts.svc.cluster.local:8000`, requesting **one time-slice** of the shared Tesla T4 (`nvidia.com/gpu: 1`). Full design + rationale (Option-A off-peak control, OOM analysis, ADR links): `docs/plans/2026-06-08-chatterbox-tts-infra.md` (in the tripit-tour-guide repo) and `infra/docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`. > This stack mirrors `infra/stacks/llama-cpp/`. The scaffolding files > (`backend.tf`, `providers.tf`, `cloudflare_provider.tf`, `tiers.tf`, > `.terraform.lock.hcl`) are **generated by Terragrunt** on `init` and are > git-ignored — only `main.tf`, `terragrunt.hcl` and this README are tracked. --- ## What this stack creates - `kubernetes_namespace.tts` — tier `2-gpu`, keel-enrolled, istio off. - `module.nfs_models` — RWX NFS-SSD PVC at `/srv/nfs-ssd/chatterbox`, mounted at `/data` (predefined voices, narrator reference WAVs, **and** the HuggingFace model cache via `HF_HOME=/data/hf_cache`, so weights download once and persist across the per-window pod recreation). - `kubernetes_config_map.chatterbox_config` — `config.yaml`: `server.port=8004`, `model.repo_id=chatterbox-multilingual`, `tts_engine.device=cuda`, voices / reference paths under `/data`. - `kubernetes_deployment.chatterbox` — **starts at `replicas=0`**; the off-peak CronJobs own the replica count at runtime. `TTS_BF16=off` (T4 = Turing, no bf16). `priority_class_name=tier-2-gpu` (the polite-tenant demotion). - `kubernetes_service.chatterbox` — ClusterIP, **`port 8000 → targetPort 8004`** so tripit's default `TTS_BASE_URL` works unchanged. Prometheus scrape annotations. - **Off-peak control** (SA + Role + RoleBinding + 3 CronJobs): see below. ## Off-peak control (Option A — window + free-VRAM gate) The T4 is time-sliced with **zero VRAM isolation** (post-mortem 2026-06-02), so `nvidia.com/gpu: 1` buys a scheduling turn, NOT memory. Chatterbox must only allocate VRAM when the card is actually free. Implemented as three CronJobs (all `Europe/London`), each a `bitnami/kubectl` pod using the namespace SA: | CronJob | Schedule (default) | Action | |---|---|---| | `chatterbox-window-up` | `0 2 * * *` | **Preflight**: scrape `gpu_pod_memory_used_bytes` from `gpu-pod-exporter.nvidia.svc:80/metrics`, compute `free = 16 GiB − Σused`; scale to **1 only if** `free ≥ vram_free_floor_bytes`. | | `chatterbox-vram-guard` | `*/5 2-5 * * *` | **Guard**: every 5 min in-window, scale to **0** if `free < floor` (a resident woke; yield the card mid-bake). | | `chatterbox-window-down` | `0 6 * * *` | **Window end**: scale to **0** unconditionally. | `tripit`'s bake is best-effort + cached-forever (ADR-0002/0004) — a skipped or aborted window simply backfills on the next one. No latency SLA. ### The free-VRAM floor — YOU MUST MEASURE THIS `var.vram_free_floor_bytes` defaults to **6 GiB** (a conservative guess: ~4 GiB assumed multilingual FP16 peak + ~2 GiB headroom for the read→`cudaMalloc` race). **The real T4 peak of `chatterbox-multilingual` is not published upstream.** Capture it during the first bake: ```bash # while a real synth is running on the freed T4: kubectl -n monitoring exec deploy/prometheus -- \ promtool query instant http://localhost:9090 \ 'sum(gpu_pod_memory_used_bytes{namespace="tts"})' # or read the gauge straight from the exporter: kubectl -n nvidia exec ds/gpu-pod-exporter -- \ sh -c 'curl -s localhost:9401/metrics | grep "namespace=\"tts\""' ``` Then set the floor to `measured_peak + ~2 GiB` (pass `-var` or add to the stack tfvars). If the peak is too high to coexist even off-peak, switch `model.repo_id` in `main.tf` to `chatterbox` (English, lighter) or `chatterbox-turbo`, or escalate to Option B (scale `immich-machine-learning` to 0 for the window). --- ## Build + push the image (do this BEFORE the first apply) `devnen/Chatterbox-TTS-Server` ships **no published image** — build from the repo's **cu128** target (matches the cluster's pinned 570.195.03 / CUDA 12.8 driver) and push to the private Forgejo registry. The devvm docker is pre-authed to `forgejo.viktorbarzin.me`. Run on the devvm (large CUDA image — needs disk + bandwidth): ```bash # 1. Clone the upstream server repo (outside the monorepo). git clone https://github.com/devnen/Chatterbox-TTS-Server /tmp/chatterbox-tts-server cd /tmp/chatterbox-tts-server # 2. Build the cu128 variant (Dockerfile.cu128 — PyTorch 2.9.0+cu128, the target # the repo's docker-compose-cu128.yml uses) for linux/amd64. SHA="$(git rev-parse --short=8 HEAD)" docker build \ --platform linux/amd64 \ --build-arg RUNTIME=nvidia \ -f Dockerfile.cu128 \ -t forgejo.viktorbarzin.me/viktor/chatterbox-tts:latest \ -t "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${SHA}" \ . # 3. Push both tags. (If docker isn't authed: log in with the viktor push PAT # from Vault — `vault kv get -field=forgejo_push_token secret/ci/global` — # `docker login forgejo.viktorbarzin.me -u viktor`.) docker push forgejo.viktorbarzin.me/viktor/chatterbox-tts:latest docker push "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${SHA}" ``` > If `Dockerfile.cu128` is not a clean `docker build` target (e.g. it relies on > build args defined only in `docker-compose-cu128.yml`), lift those args onto > the `docker build` line or `docker compose -f docker-compose-cu128.yml build` > then `docker tag` the resulting `chatterbox-tts-server:cu128` image to the > Forgejo ref above before pushing. --- ## Apply (admin-gated — run in order) ```bash vault login -method=oidc ~/code/scripts/presence claim node:k8s-node1 --purpose "chatterbox-tts first apply (GPU)" ~/code/scripts/presence claim stack:tts --purpose "chatterbox-tts stack apply" # 1. The polite-tenant hardening (exclude tts from gpu-workload priority). ~/code/scripts/tg plan --stack kyverno ~/code/scripts/tg apply --stack kyverno # 2. This stack. ~/code/scripts/tg plan --stack tts ~/code/scripts/tg apply --stack tts # apply does NOT wake the GPU (replicas=0) # 3. Flip tripit narration on. ~/code/scripts/tg plan --stack tripit ~/code/scripts/tg apply --stack tripit ``` See `docs/plans/2026-06-08-chatterbox-tts-infra.md` §5 for the full go-live checklist (seed voices on NFS-SSD, smoke-test a synth, watch the neighbours). ## Rollback (instant, no data loss) - **Narration off:** set `TTS_MODE=none` (or drop the three `TTS_*` lines) in `stacks/tripit/main.tf` → `tg apply --stack tripit`. The bake makes no audio; playback falls back to browser TTS. Cached `story_audio` rows are harmless. - **Chatterbox off the GPU:** `kubectl -n tts scale deploy/chatterbox-tts --replicas=0` (transient) and/or `tg destroy --stack tts`. Best-effort synth means tripit bakes keep running audio-less — no error. - Neither touches the resident GPU tenants (Option A never modifies them).