infra/stacks/tts/README.md
Viktor Barzin 48013a4a92 feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip]
New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible
/v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on
8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/.

Option A off-peak control (no VRAM isolation on the time-sliced T4 — see
post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London
CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00
ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from
gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields
the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at
06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a
skipped/aborted window backfills next time. SA+Role+RoleBinding grant the
CronJobs deployments/scale (nextcloud-watchdog pattern).

Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the
`tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox
keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure
— never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses
the base exclude list (tts untouched there).

tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to
local.app_env (no token — ClusterIP only). No tripit code change.

Image build is documented in stacks/tts/README.md (devnen cu128 target ->
forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline
(large CUDA image + needs the upstream repo). NOT APPLIED — review branch only.
Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the
measured chatterbox-multilingual T4 peak during the first bake.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00

149 lines
7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# tts — Chatterbox TTS (tripit narration)
In-cluster text-to-speech for tripit's "Tour guide". Runs the
[devnen/Chatterbox-TTS-Server](https://github.com/devnen/Chatterbox-TTS-Server)
(Resemble AI Chatterbox under an OpenAI-compatible HTTP server) as a single
Deployment + ClusterIP Service `chatterbox-tts.tts.svc.cluster.local:8000`,
requesting **one time-slice** of the shared Tesla T4 (`nvidia.com/gpu: 1`).
Full design + rationale (Option-A off-peak control, OOM analysis, ADR links):
`docs/plans/2026-06-08-chatterbox-tts-infra.md` (in the tripit-tour-guide repo)
and `infra/docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`.
> This stack mirrors `infra/stacks/llama-cpp/`. The scaffolding files
> (`backend.tf`, `providers.tf`, `cloudflare_provider.tf`, `tiers.tf`,
> `.terraform.lock.hcl`) are **generated by Terragrunt** on `init` and are
> git-ignored — only `main.tf`, `terragrunt.hcl` and this README are tracked.
---
## What this stack creates
- `kubernetes_namespace.tts` — tier `2-gpu`, keel-enrolled, istio off.
- `module.nfs_models` — RWX NFS-SSD PVC at `/srv/nfs-ssd/chatterbox`, mounted at
`/data` (predefined voices, narrator reference WAVs, **and** the HuggingFace
model cache via `HF_HOME=/data/hf_cache`, so weights download once and persist
across the per-window pod recreation).
- `kubernetes_config_map.chatterbox_config``config.yaml`: `server.port=8004`,
`model.repo_id=chatterbox-multilingual`, `tts_engine.device=cuda`, voices /
reference paths under `/data`.
- `kubernetes_deployment.chatterbox`**starts at `replicas=0`**; the off-peak
CronJobs own the replica count at runtime. `TTS_BF16=off` (T4 = Turing, no
bf16). `priority_class_name=tier-2-gpu` (the polite-tenant demotion).
- `kubernetes_service.chatterbox` — ClusterIP, **`port 8000 → targetPort 8004`**
so tripit's default `TTS_BASE_URL` works unchanged. Prometheus scrape
annotations.
- **Off-peak control** (SA + Role + RoleBinding + 3 CronJobs): see below.
## Off-peak control (Option A — window + free-VRAM gate)
The T4 is time-sliced with **zero VRAM isolation** (post-mortem 2026-06-02), so
`nvidia.com/gpu: 1` buys a scheduling turn, NOT memory. Chatterbox must only
allocate VRAM when the card is actually free. Implemented as three CronJobs
(all `Europe/London`), each a `bitnami/kubectl` pod using the namespace SA:
| CronJob | Schedule (default) | Action |
|---|---|---|
| `chatterbox-window-up` | `0 2 * * *` | **Preflight**: scrape `gpu_pod_memory_used_bytes` from `gpu-pod-exporter.nvidia.svc:80/metrics`, compute `free = 16 GiB Σused`; scale to **1 only if** `free ≥ vram_free_floor_bytes`. |
| `chatterbox-vram-guard` | `*/5 2-5 * * *` | **Guard**: every 5 min in-window, scale to **0** if `free < floor` (a resident woke; yield the card mid-bake). |
| `chatterbox-window-down` | `0 6 * * *` | **Window end**: scale to **0** unconditionally. |
`tripit`'s bake is best-effort + cached-forever (ADR-0002/0004) — a skipped or
aborted window simply backfills on the next one. No latency SLA.
### The free-VRAM floor — YOU MUST MEASURE THIS
`var.vram_free_floor_bytes` defaults to **6 GiB** (a conservative guess:
~4 GiB assumed multilingual FP16 peak + ~2 GiB headroom for the
read→`cudaMalloc` race). **The real T4 peak of `chatterbox-multilingual` is not
published upstream.** Capture it during the first bake:
```bash
# while a real synth is running on the freed T4:
kubectl -n monitoring exec deploy/prometheus -- \
promtool query instant http://localhost:9090 \
'sum(gpu_pod_memory_used_bytes{namespace="tts"})'
# or read the gauge straight from the exporter:
kubectl -n nvidia exec ds/gpu-pod-exporter -- \
sh -c 'curl -s localhost:9401/metrics | grep "namespace=\"tts\""'
```
Then set the floor to `measured_peak + ~2 GiB` (pass `-var` or add to the stack
tfvars). If the peak is too high to coexist even off-peak, switch
`model.repo_id` in `main.tf` to `chatterbox` (English, lighter) or
`chatterbox-turbo`, or escalate to Option B (scale `immich-machine-learning` to
0 for the window).
---
## Build + push the image (do this BEFORE the first apply)
`devnen/Chatterbox-TTS-Server` ships **no published image** — build from the
repo's **cu128** target (matches the cluster's pinned 570.195.03 / CUDA 12.8
driver) and push to the private Forgejo registry. The devvm docker is pre-authed
to `forgejo.viktorbarzin.me`. Run on the devvm (large CUDA image — needs disk +
bandwidth):
```bash
# 1. Clone the upstream server repo (outside the monorepo).
git clone https://github.com/devnen/Chatterbox-TTS-Server /tmp/chatterbox-tts-server
cd /tmp/chatterbox-tts-server
# 2. Build the cu128 variant (Dockerfile.cu128 — PyTorch 2.9.0+cu128, the target
# the repo's docker-compose-cu128.yml uses) for linux/amd64.
SHA="$(git rev-parse --short=8 HEAD)"
docker build \
--platform linux/amd64 \
--build-arg RUNTIME=nvidia \
-f Dockerfile.cu128 \
-t forgejo.viktorbarzin.me/viktor/chatterbox-tts:latest \
-t "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${SHA}" \
.
# 3. Push both tags. (If docker isn't authed: log in with the viktor push PAT
# from Vault — `vault kv get -field=forgejo_push_token secret/ci/global` —
# `docker login forgejo.viktorbarzin.me -u viktor`.)
docker push forgejo.viktorbarzin.me/viktor/chatterbox-tts:latest
docker push "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${SHA}"
```
> If `Dockerfile.cu128` is not a clean `docker build` target (e.g. it relies on
> build args defined only in `docker-compose-cu128.yml`), lift those args onto
> the `docker build` line or `docker compose -f docker-compose-cu128.yml build`
> then `docker tag` the resulting `chatterbox-tts-server:cu128` image to the
> Forgejo ref above before pushing.
---
## Apply (admin-gated — run in order)
```bash
vault login -method=oidc
~/code/scripts/presence claim node:k8s-node1 --purpose "chatterbox-tts first apply (GPU)"
~/code/scripts/presence claim stack:tts --purpose "chatterbox-tts stack apply"
# 1. The polite-tenant hardening (exclude tts from gpu-workload priority).
~/code/scripts/tg plan --stack kyverno
~/code/scripts/tg apply --stack kyverno
# 2. This stack.
~/code/scripts/tg plan --stack tts
~/code/scripts/tg apply --stack tts # apply does NOT wake the GPU (replicas=0)
# 3. Flip tripit narration on.
~/code/scripts/tg plan --stack tripit
~/code/scripts/tg apply --stack tripit
```
See `docs/plans/2026-06-08-chatterbox-tts-infra.md` §5 for the full go-live
checklist (seed voices on NFS-SSD, smoke-test a synth, watch the neighbours).
## Rollback (instant, no data loss)
- **Narration off:** set `TTS_MODE=none` (or drop the three `TTS_*` lines) in
`stacks/tripit/main.tf``tg apply --stack tripit`. The bake makes no audio;
playback falls back to browser TTS. Cached `story_audio` rows are harmless.
- **Chatterbox off the GPU:** `kubectl -n tts scale deploy/chatterbox-tts
--replicas=0` (transient) and/or `tg destroy --stack tts`. Best-effort synth
means tripit bakes keep running audio-less — no error.
- Neither touches the resident GPU tenants (Option A never modifies them).