infra/stacks/tts
Viktor Barzin 48013a4a92 feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip]
New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible
/v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on
8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/.

Option A off-peak control (no VRAM isolation on the time-sliced T4 — see
post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London
CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00
ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from
gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields
the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at
06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a
skipped/aborted window backfills next time. SA+Role+RoleBinding grant the
CronJobs deployments/scale (nextcloud-watchdog pattern).

Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the
`tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox
keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure
— never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses
the base exclude list (tts untouched there).

tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to
local.app_env (no token — ClusterIP only). No tripit code change.

Image build is documented in stacks/tts/README.md (devnen cu128 target ->
forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline
(large CUDA image + needs the upstream repo). NOT APPLIED — review branch only.
Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the
measured chatterbox-multilingual T4 peak during the first bake.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
..
main.tf feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip] 2026-06-09 21:21:39 +00:00
README.md feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip] 2026-06-09 21:21:39 +00:00
terragrunt.hcl feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip] 2026-06-09 21:21:39 +00:00

tts — Chatterbox TTS (tripit narration)

In-cluster text-to-speech for tripit's "Tour guide". Runs the devnen/Chatterbox-TTS-Server (Resemble AI Chatterbox under an OpenAI-compatible HTTP server) as a single Deployment + ClusterIP Service chatterbox-tts.tts.svc.cluster.local:8000, requesting one time-slice of the shared Tesla T4 (nvidia.com/gpu: 1).

Full design + rationale (Option-A off-peak control, OOM analysis, ADR links): docs/plans/2026-06-08-chatterbox-tts-infra.md (in the tripit-tour-guide repo) and infra/docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md.

This stack mirrors infra/stacks/llama-cpp/. The scaffolding files (backend.tf, providers.tf, cloudflare_provider.tf, tiers.tf, .terraform.lock.hcl) are generated by Terragrunt on init and are git-ignored — only main.tf, terragrunt.hcl and this README are tracked.


What this stack creates

  • kubernetes_namespace.tts — tier 2-gpu, keel-enrolled, istio off.
  • module.nfs_models — RWX NFS-SSD PVC at /srv/nfs-ssd/chatterbox, mounted at /data (predefined voices, narrator reference WAVs, and the HuggingFace model cache via HF_HOME=/data/hf_cache, so weights download once and persist across the per-window pod recreation).
  • kubernetes_config_map.chatterbox_configconfig.yaml: server.port=8004, model.repo_id=chatterbox-multilingual, tts_engine.device=cuda, voices / reference paths under /data.
  • kubernetes_deployment.chatterboxstarts at replicas=0; the off-peak CronJobs own the replica count at runtime. TTS_BF16=off (T4 = Turing, no bf16). priority_class_name=tier-2-gpu (the polite-tenant demotion).
  • kubernetes_service.chatterbox — ClusterIP, port 8000 → targetPort 8004 so tripit's default TTS_BASE_URL works unchanged. Prometheus scrape annotations.
  • Off-peak control (SA + Role + RoleBinding + 3 CronJobs): see below.

Off-peak control (Option A — window + free-VRAM gate)

The T4 is time-sliced with zero VRAM isolation (post-mortem 2026-06-02), so nvidia.com/gpu: 1 buys a scheduling turn, NOT memory. Chatterbox must only allocate VRAM when the card is actually free. Implemented as three CronJobs (all Europe/London), each a bitnami/kubectl pod using the namespace SA:

CronJob Schedule (default) Action
chatterbox-window-up 0 2 * * * Preflight: scrape gpu_pod_memory_used_bytes from gpu-pod-exporter.nvidia.svc:80/metrics, compute free = 16 GiB Σused; scale to 1 only if free ≥ vram_free_floor_bytes.
chatterbox-vram-guard */5 2-5 * * * Guard: every 5 min in-window, scale to 0 if free < floor (a resident woke; yield the card mid-bake).
chatterbox-window-down 0 6 * * * Window end: scale to 0 unconditionally.

tripit's bake is best-effort + cached-forever (ADR-0002/0004) — a skipped or aborted window simply backfills on the next one. No latency SLA.

The free-VRAM floor — YOU MUST MEASURE THIS

var.vram_free_floor_bytes defaults to 6 GiB (a conservative guess: ~4 GiB assumed multilingual FP16 peak + ~2 GiB headroom for the read→cudaMalloc race). The real T4 peak of chatterbox-multilingual is not published upstream. Capture it during the first bake:

# while a real synth is running on the freed T4:
kubectl -n monitoring exec deploy/prometheus -- \
  promtool query instant http://localhost:9090 \
  'sum(gpu_pod_memory_used_bytes{namespace="tts"})'
# or read the gauge straight from the exporter:
kubectl -n nvidia exec ds/gpu-pod-exporter -- \
  sh -c 'curl -s localhost:9401/metrics | grep "namespace=\"tts\""'

Then set the floor to measured_peak + ~2 GiB (pass -var or add to the stack tfvars). If the peak is too high to coexist even off-peak, switch model.repo_id in main.tf to chatterbox (English, lighter) or chatterbox-turbo, or escalate to Option B (scale immich-machine-learning to 0 for the window).


Build + push the image (do this BEFORE the first apply)

devnen/Chatterbox-TTS-Server ships no published image — build from the repo's cu128 target (matches the cluster's pinned 570.195.03 / CUDA 12.8 driver) and push to the private Forgejo registry. The devvm docker is pre-authed to forgejo.viktorbarzin.me. Run on the devvm (large CUDA image — needs disk + bandwidth):

# 1. Clone the upstream server repo (outside the monorepo).
git clone https://github.com/devnen/Chatterbox-TTS-Server /tmp/chatterbox-tts-server
cd /tmp/chatterbox-tts-server

# 2. Build the cu128 variant (Dockerfile.cu128 — PyTorch 2.9.0+cu128, the target
#    the repo's docker-compose-cu128.yml uses) for linux/amd64.
SHA="$(git rev-parse --short=8 HEAD)"
docker build \
  --platform linux/amd64 \
  --build-arg RUNTIME=nvidia \
  -f Dockerfile.cu128 \
  -t forgejo.viktorbarzin.me/viktor/chatterbox-tts:latest \
  -t "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${SHA}" \
  .

# 3. Push both tags. (If docker isn't authed: log in with the viktor push PAT
#    from Vault — `vault kv get -field=forgejo_push_token secret/ci/global` —
#    `docker login forgejo.viktorbarzin.me -u viktor`.)
docker push forgejo.viktorbarzin.me/viktor/chatterbox-tts:latest
docker push "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${SHA}"

If Dockerfile.cu128 is not a clean docker build target (e.g. it relies on build args defined only in docker-compose-cu128.yml), lift those args onto the docker build line or docker compose -f docker-compose-cu128.yml build then docker tag the resulting chatterbox-tts-server:cu128 image to the Forgejo ref above before pushing.


Apply (admin-gated — run in order)

vault login -method=oidc
~/code/scripts/presence claim node:k8s-node1 --purpose "chatterbox-tts first apply (GPU)"
~/code/scripts/presence claim stack:tts      --purpose "chatterbox-tts stack apply"

# 1. The polite-tenant hardening (exclude tts from gpu-workload priority).
~/code/scripts/tg plan  --stack kyverno
~/code/scripts/tg apply --stack kyverno

# 2. This stack.
~/code/scripts/tg plan  --stack tts
~/code/scripts/tg apply --stack tts        # apply does NOT wake the GPU (replicas=0)

# 3. Flip tripit narration on.
~/code/scripts/tg plan  --stack tripit
~/code/scripts/tg apply --stack tripit

See docs/plans/2026-06-08-chatterbox-tts-infra.md §5 for the full go-live checklist (seed voices on NFS-SSD, smoke-test a synth, watch the neighbours).

Rollback (instant, no data loss)

  • Narration off: set TTS_MODE=none (or drop the three TTS_* lines) in stacks/tripit/main.tftg apply --stack tripit. The bake makes no audio; playback falls back to browser TTS. Cached story_audio rows are harmless.
  • Chatterbox off the GPU: kubectl -n tts scale deploy/chatterbox-tts --replicas=0 (transient) and/or tg destroy --stack tts. Best-effort synth means tripit bakes keep running audio-less — no error.
  • Neither touches the resident GPU tenants (Option A never modifies them).