New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible /v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on 8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/. Option A off-peak control (no VRAM isolation on the time-sliced T4 — see post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00 ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at 06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a skipped/aborted window backfills next time. SA+Role+RoleBinding grant the CronJobs deployments/scale (nextcloud-watchdog pattern). Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the `tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure — never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses the base exclude list (tts untouched there). tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to local.app_env (no token — ClusterIP only). No tripit code change. Image build is documented in stacks/tts/README.md (devnen cu128 target -> forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline (large CUDA image + needs the upstream repo). NOT APPLIED — review branch only. Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the measured chatterbox-multilingual T4 peak during the first bake. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
7 KiB
tts — Chatterbox TTS (tripit narration)
In-cluster text-to-speech for tripit's "Tour guide". Runs the
devnen/Chatterbox-TTS-Server
(Resemble AI Chatterbox under an OpenAI-compatible HTTP server) as a single
Deployment + ClusterIP Service chatterbox-tts.tts.svc.cluster.local:8000,
requesting one time-slice of the shared Tesla T4 (nvidia.com/gpu: 1).
Full design + rationale (Option-A off-peak control, OOM analysis, ADR links):
docs/plans/2026-06-08-chatterbox-tts-infra.md (in the tripit-tour-guide repo)
and infra/docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md.
This stack mirrors
infra/stacks/llama-cpp/. The scaffolding files (backend.tf,providers.tf,cloudflare_provider.tf,tiers.tf,.terraform.lock.hcl) are generated by Terragrunt oninitand are git-ignored — onlymain.tf,terragrunt.hcland this README are tracked.
What this stack creates
kubernetes_namespace.tts— tier2-gpu, keel-enrolled, istio off.module.nfs_models— RWX NFS-SSD PVC at/srv/nfs-ssd/chatterbox, mounted at/data(predefined voices, narrator reference WAVs, and the HuggingFace model cache viaHF_HOME=/data/hf_cache, so weights download once and persist across the per-window pod recreation).kubernetes_config_map.chatterbox_config—config.yaml:server.port=8004,model.repo_id=chatterbox-multilingual,tts_engine.device=cuda, voices / reference paths under/data.kubernetes_deployment.chatterbox— starts atreplicas=0; the off-peak CronJobs own the replica count at runtime.TTS_BF16=off(T4 = Turing, no bf16).priority_class_name=tier-2-gpu(the polite-tenant demotion).kubernetes_service.chatterbox— ClusterIP,port 8000 → targetPort 8004so tripit's defaultTTS_BASE_URLworks unchanged. Prometheus scrape annotations.- Off-peak control (SA + Role + RoleBinding + 3 CronJobs): see below.
Off-peak control (Option A — window + free-VRAM gate)
The T4 is time-sliced with zero VRAM isolation (post-mortem 2026-06-02), so
nvidia.com/gpu: 1 buys a scheduling turn, NOT memory. Chatterbox must only
allocate VRAM when the card is actually free. Implemented as three CronJobs
(all Europe/London), each a bitnami/kubectl pod using the namespace SA:
| CronJob | Schedule (default) | Action |
|---|---|---|
chatterbox-window-up |
0 2 * * * |
Preflight: scrape gpu_pod_memory_used_bytes from gpu-pod-exporter.nvidia.svc:80/metrics, compute free = 16 GiB − Σused; scale to 1 only if free ≥ vram_free_floor_bytes. |
chatterbox-vram-guard |
*/5 2-5 * * * |
Guard: every 5 min in-window, scale to 0 if free < floor (a resident woke; yield the card mid-bake). |
chatterbox-window-down |
0 6 * * * |
Window end: scale to 0 unconditionally. |
tripit's bake is best-effort + cached-forever (ADR-0002/0004) — a skipped or
aborted window simply backfills on the next one. No latency SLA.
The free-VRAM floor — YOU MUST MEASURE THIS
var.vram_free_floor_bytes defaults to 6 GiB (a conservative guess:
~4 GiB assumed multilingual FP16 peak + ~2 GiB headroom for the
read→cudaMalloc race). The real T4 peak of chatterbox-multilingual is not
published upstream. Capture it during the first bake:
# while a real synth is running on the freed T4:
kubectl -n monitoring exec deploy/prometheus -- \
promtool query instant http://localhost:9090 \
'sum(gpu_pod_memory_used_bytes{namespace="tts"})'
# or read the gauge straight from the exporter:
kubectl -n nvidia exec ds/gpu-pod-exporter -- \
sh -c 'curl -s localhost:9401/metrics | grep "namespace=\"tts\""'
Then set the floor to measured_peak + ~2 GiB (pass -var or add to the stack
tfvars). If the peak is too high to coexist even off-peak, switch
model.repo_id in main.tf to chatterbox (English, lighter) or
chatterbox-turbo, or escalate to Option B (scale immich-machine-learning to
0 for the window).
Build + push the image (do this BEFORE the first apply)
devnen/Chatterbox-TTS-Server ships no published image — build from the
repo's cu128 target (matches the cluster's pinned 570.195.03 / CUDA 12.8
driver) and push to the private Forgejo registry. The devvm docker is pre-authed
to forgejo.viktorbarzin.me. Run on the devvm (large CUDA image — needs disk +
bandwidth):
# 1. Clone the upstream server repo (outside the monorepo).
git clone https://github.com/devnen/Chatterbox-TTS-Server /tmp/chatterbox-tts-server
cd /tmp/chatterbox-tts-server
# 2. Build the cu128 variant (Dockerfile.cu128 — PyTorch 2.9.0+cu128, the target
# the repo's docker-compose-cu128.yml uses) for linux/amd64.
SHA="$(git rev-parse --short=8 HEAD)"
docker build \
--platform linux/amd64 \
--build-arg RUNTIME=nvidia \
-f Dockerfile.cu128 \
-t forgejo.viktorbarzin.me/viktor/chatterbox-tts:latest \
-t "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${SHA}" \
.
# 3. Push both tags. (If docker isn't authed: log in with the viktor push PAT
# from Vault — `vault kv get -field=forgejo_push_token secret/ci/global` —
# `docker login forgejo.viktorbarzin.me -u viktor`.)
docker push forgejo.viktorbarzin.me/viktor/chatterbox-tts:latest
docker push "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${SHA}"
If
Dockerfile.cu128is not a cleandocker buildtarget (e.g. it relies on build args defined only indocker-compose-cu128.yml), lift those args onto thedocker buildline ordocker compose -f docker-compose-cu128.yml buildthendocker tagthe resultingchatterbox-tts-server:cu128image to the Forgejo ref above before pushing.
Apply (admin-gated — run in order)
vault login -method=oidc
~/code/scripts/presence claim node:k8s-node1 --purpose "chatterbox-tts first apply (GPU)"
~/code/scripts/presence claim stack:tts --purpose "chatterbox-tts stack apply"
# 1. The polite-tenant hardening (exclude tts from gpu-workload priority).
~/code/scripts/tg plan --stack kyverno
~/code/scripts/tg apply --stack kyverno
# 2. This stack.
~/code/scripts/tg plan --stack tts
~/code/scripts/tg apply --stack tts # apply does NOT wake the GPU (replicas=0)
# 3. Flip tripit narration on.
~/code/scripts/tg plan --stack tripit
~/code/scripts/tg apply --stack tripit
See docs/plans/2026-06-08-chatterbox-tts-infra.md §5 for the full go-live
checklist (seed voices on NFS-SSD, smoke-test a synth, watch the neighbours).
Rollback (instant, no data loss)
- Narration off: set
TTS_MODE=none(or drop the threeTTS_*lines) instacks/tripit/main.tf→tg apply --stack tripit. The bake makes no audio; playback falls back to browser TTS. Cachedstory_audiorows are harmless. - Chatterbox off the GPU:
kubectl -n tts scale deploy/chatterbox-tts --replicas=0(transient) and/ortg destroy --stack tts. Best-effort synth means tripit bakes keep running audio-less — no error. - Neither touches the resident GPU tenants (Option A never modifies them).