infra/stacks/tts
Viktor Barzin bd0cb71f17
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
tts: TCP probes — http liveness killed the server mid-synthesis
The devnen server runs chunked synthesis as a blocking call inside its
async handler, so the event loop (and every HTTP probe) hangs for the
whole multi-minute story. Kubelet's http liveness probe (1s timeout)
then killed the container mid-story (exit 137, twice within 10 min of
the first real drain), which reset the engine, so every following pass
started cold and tripit's 120s synthesis budget could never be met —
the queue would never drain.

TCP probes keep the meaning that matters: uvicorn binds 8004 only
after the model finishes loading in the lifespan hook, so readiness
still gates 'model loaded', while a GPU-busy server is left alive.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:57:28 +00:00
..
main.tf tts: TCP probes — http liveness killed the server mid-synthesis 2026-06-12 20:57:28 +00:00
README.md feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip] 2026-06-09 21:41:53 +00:00
terragrunt.hcl feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip] 2026-06-09 21:41:53 +00:00

tts — Chatterbox TTS (tripit narration)

In-cluster text-to-speech for tripit's "Tour guide". Runs the devnen/Chatterbox-TTS-Server (Resemble AI Chatterbox under an OpenAI-compatible HTTP server) as a single Deployment + ClusterIP Service chatterbox-tts.tts.svc.cluster.local:8000, requesting one time-slice of the shared Tesla T4 (nvidia.com/gpu: 1).

Full design + rationale (Option-A off-peak control, OOM analysis, ADR links): docs/plans/2026-06-08-chatterbox-tts-infra.md (in the tripit-tour-guide repo) and infra/docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md.

This stack mirrors infra/stacks/llama-cpp/. The scaffolding files (backend.tf, providers.tf, cloudflare_provider.tf, tiers.tf, .terraform.lock.hcl) are generated by Terragrunt on init and are git-ignored — only main.tf, terragrunt.hcl and this README are tracked.


What this stack creates

  • kubernetes_namespace.tts — tier 2-gpu, keel-enrolled, istio off.
  • module.nfs_models — RWX NFS-SSD PVC at /srv/nfs-ssd/chatterbox, mounted at /data (predefined voices, narrator reference WAVs, and the HuggingFace model cache via HF_HOME=/data/hf_cache, so weights download once and persist across the per-window pod recreation).
  • kubernetes_config_map.chatterbox_configconfig.yaml: server.port=8004, model.repo_id=chatterbox-multilingual, tts_engine.device=cuda, voices / reference paths under /data.
  • kubernetes_deployment.chatterboxstarts at replicas=0; the off-peak CronJobs own the replica count at runtime. TTS_BF16=off (T4 = Turing, no bf16). priority_class_name=tier-2-gpu (the polite-tenant demotion).
  • kubernetes_service.chatterbox — ClusterIP, port 8000 → targetPort 8004 so tripit's default TTS_BASE_URL works unchanged. Prometheus scrape annotations.
  • Off-peak control (SA + Role + RoleBinding + 3 CronJobs): see below.

Off-peak control (Option A — window + free-VRAM gate)

The T4 is time-sliced with zero VRAM isolation (post-mortem 2026-06-02), so nvidia.com/gpu: 1 buys a scheduling turn, NOT memory. Chatterbox must only allocate VRAM when the card is actually free. Implemented as three CronJobs (all Europe/London), each a bitnami/kubectl pod using the namespace SA:

CronJob Schedule (default) Action
chatterbox-window-up 0 2 * * * Preflight: scrape gpu_pod_memory_used_bytes from gpu-pod-exporter.nvidia.svc:80/metrics, compute free = 16 GiB Σused; scale to 1 only if free ≥ vram_free_floor_bytes.
chatterbox-vram-guard */5 2-5 * * * Guard: every 5 min in-window, scale to 0 if free < floor (a resident woke; yield the card mid-bake).
chatterbox-window-down 0 6 * * * Window end: scale to 0 unconditionally.

tripit's bake is best-effort + cached-forever (ADR-0002/0004) — a skipped or aborted window simply backfills on the next one. No latency SLA.

The free-VRAM floor — YOU MUST MEASURE THIS

var.vram_free_floor_bytes defaults to 6 GiB (a conservative guess: ~4 GiB assumed multilingual FP16 peak + ~2 GiB headroom for the read→cudaMalloc race). The real T4 peak of chatterbox-multilingual is not published upstream. Capture it during the first bake:

# while a real synth is running on the freed T4:
kubectl -n monitoring exec deploy/prometheus -- \
  promtool query instant http://localhost:9090 \
  'sum(gpu_pod_memory_used_bytes{namespace="tts"})'
# or read the gauge straight from the exporter:
kubectl -n nvidia exec ds/gpu-pod-exporter -- \
  sh -c 'curl -s localhost:9401/metrics | grep "namespace=\"tts\""'

Then set the floor to measured_peak + ~2 GiB (pass -var or add to the stack tfvars). If the peak is too high to coexist even off-peak, switch model.repo_id in main.tf to chatterbox (English, lighter) or chatterbox-turbo, or escalate to Option B (scale immich-machine-learning to 0 for the window).


Build + push the image (do this BEFORE the first apply)

devnen/Chatterbox-TTS-Server ships no published image — build from the repo's cu128 target (matches the cluster's pinned 570.195.03 / CUDA 12.8 driver) and push to the private Forgejo registry. The devvm docker is pre-authed to forgejo.viktorbarzin.me. Run on the devvm (large CUDA image — needs disk + bandwidth):

# 1. Clone the upstream server repo (outside the monorepo).
git clone https://github.com/devnen/Chatterbox-TTS-Server /tmp/chatterbox-tts-server
cd /tmp/chatterbox-tts-server

# 2. Build the cu128 variant (Dockerfile.cu128 — PyTorch 2.9.0+cu128, the target
#    the repo's docker-compose-cu128.yml uses) for linux/amd64.
SHA="$(git rev-parse --short=8 HEAD)"
docker build \
  --platform linux/amd64 \
  --build-arg RUNTIME=nvidia \
  -f Dockerfile.cu128 \
  -t forgejo.viktorbarzin.me/viktor/chatterbox-tts:latest \
  -t "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${SHA}" \
  .

# 3. Push both tags. (If docker isn't authed: log in with the viktor push PAT
#    from Vault — `vault kv get -field=forgejo_push_token secret/ci/global` —
#    `docker login forgejo.viktorbarzin.me -u viktor`.)
docker push forgejo.viktorbarzin.me/viktor/chatterbox-tts:latest
docker push "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${SHA}"

If Dockerfile.cu128 is not a clean docker build target (e.g. it relies on build args defined only in docker-compose-cu128.yml), lift those args onto the docker build line or docker compose -f docker-compose-cu128.yml build then docker tag the resulting chatterbox-tts-server:cu128 image to the Forgejo ref above before pushing.


Apply (admin-gated — run in order)

vault login -method=oidc
~/code/scripts/presence claim node:k8s-node1 --purpose "chatterbox-tts first apply (GPU)"
~/code/scripts/presence claim stack:tts      --purpose "chatterbox-tts stack apply"

# 1. The polite-tenant hardening (exclude tts from gpu-workload priority).
~/code/scripts/tg plan  --stack kyverno
~/code/scripts/tg apply --stack kyverno

# 2. This stack.
~/code/scripts/tg plan  --stack tts
~/code/scripts/tg apply --stack tts        # apply does NOT wake the GPU (replicas=0)

# 3. Flip tripit narration on.
~/code/scripts/tg plan  --stack tripit
~/code/scripts/tg apply --stack tripit

See docs/plans/2026-06-08-chatterbox-tts-infra.md §5 for the full go-live checklist (seed voices on NFS-SSD, smoke-test a synth, watch the neighbours).

Rollback (instant, no data loss)

  • Narration off: set TTS_MODE=none (or drop the three TTS_* lines) in stacks/tripit/main.tftg apply --stack tripit. The bake makes no audio; playback falls back to browser TTS. Cached story_audio rows are harmless.
  • Chatterbox off the GPU: kubectl -n tts scale deploy/chatterbox-tts --replicas=0 (transient) and/or tg destroy --stack tts. Best-effort synth means tripit bakes keep running audio-less — no error.
  • Neither touches the resident GPU tenants (Option A never modifies them).