The devnen server runs chunked synthesis as a blocking call inside its async handler, so the event loop (and every HTTP probe) hangs for the whole multi-minute story. Kubelet's http liveness probe (1s timeout) then killed the container mid-story (exit 137, twice within 10 min of the first real drain), which reset the engine, so every following pass started cold and tripit's 120s synthesis budget could never be met — the queue would never drain. TCP probes keep the meaning that matters: uvicorn binds 8004 only after the model finishes loading in the lifespan hook, so readiness still gates 'model loaded', while a GPU-busy server is left alive. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| main.tf | ||
| README.md | ||
| terragrunt.hcl | ||
tts — Chatterbox TTS (tripit narration)
In-cluster text-to-speech for tripit's "Tour guide". Runs the
devnen/Chatterbox-TTS-Server
(Resemble AI Chatterbox under an OpenAI-compatible HTTP server) as a single
Deployment + ClusterIP Service chatterbox-tts.tts.svc.cluster.local:8000,
requesting one time-slice of the shared Tesla T4 (nvidia.com/gpu: 1).
Full design + rationale (Option-A off-peak control, OOM analysis, ADR links):
docs/plans/2026-06-08-chatterbox-tts-infra.md (in the tripit-tour-guide repo)
and infra/docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md.
This stack mirrors
infra/stacks/llama-cpp/. The scaffolding files (backend.tf,providers.tf,cloudflare_provider.tf,tiers.tf,.terraform.lock.hcl) are generated by Terragrunt oninitand are git-ignored — onlymain.tf,terragrunt.hcland this README are tracked.
What this stack creates
kubernetes_namespace.tts— tier2-gpu, keel-enrolled, istio off.module.nfs_models— RWX NFS-SSD PVC at/srv/nfs-ssd/chatterbox, mounted at/data(predefined voices, narrator reference WAVs, and the HuggingFace model cache viaHF_HOME=/data/hf_cache, so weights download once and persist across the per-window pod recreation).kubernetes_config_map.chatterbox_config—config.yaml:server.port=8004,model.repo_id=chatterbox-multilingual,tts_engine.device=cuda, voices / reference paths under/data.kubernetes_deployment.chatterbox— starts atreplicas=0; the off-peak CronJobs own the replica count at runtime.TTS_BF16=off(T4 = Turing, no bf16).priority_class_name=tier-2-gpu(the polite-tenant demotion).kubernetes_service.chatterbox— ClusterIP,port 8000 → targetPort 8004so tripit's defaultTTS_BASE_URLworks unchanged. Prometheus scrape annotations.- Off-peak control (SA + Role + RoleBinding + 3 CronJobs): see below.
Off-peak control (Option A — window + free-VRAM gate)
The T4 is time-sliced with zero VRAM isolation (post-mortem 2026-06-02), so
nvidia.com/gpu: 1 buys a scheduling turn, NOT memory. Chatterbox must only
allocate VRAM when the card is actually free. Implemented as three CronJobs
(all Europe/London), each a bitnami/kubectl pod using the namespace SA:
| CronJob | Schedule (default) | Action |
|---|---|---|
chatterbox-window-up |
0 2 * * * |
Preflight: scrape gpu_pod_memory_used_bytes from gpu-pod-exporter.nvidia.svc:80/metrics, compute free = 16 GiB − Σused; scale to 1 only if free ≥ vram_free_floor_bytes. |
chatterbox-vram-guard |
*/5 2-5 * * * |
Guard: every 5 min in-window, scale to 0 if free < floor (a resident woke; yield the card mid-bake). |
chatterbox-window-down |
0 6 * * * |
Window end: scale to 0 unconditionally. |
tripit's bake is best-effort + cached-forever (ADR-0002/0004) — a skipped or
aborted window simply backfills on the next one. No latency SLA.
The free-VRAM floor — YOU MUST MEASURE THIS
var.vram_free_floor_bytes defaults to 6 GiB (a conservative guess:
~4 GiB assumed multilingual FP16 peak + ~2 GiB headroom for the
read→cudaMalloc race). The real T4 peak of chatterbox-multilingual is not
published upstream. Capture it during the first bake:
# while a real synth is running on the freed T4:
kubectl -n monitoring exec deploy/prometheus -- \
promtool query instant http://localhost:9090 \
'sum(gpu_pod_memory_used_bytes{namespace="tts"})'
# or read the gauge straight from the exporter:
kubectl -n nvidia exec ds/gpu-pod-exporter -- \
sh -c 'curl -s localhost:9401/metrics | grep "namespace=\"tts\""'
Then set the floor to measured_peak + ~2 GiB (pass -var or add to the stack
tfvars). If the peak is too high to coexist even off-peak, switch
model.repo_id in main.tf to chatterbox (English, lighter) or
chatterbox-turbo, or escalate to Option B (scale immich-machine-learning to
0 for the window).
Build + push the image (do this BEFORE the first apply)
devnen/Chatterbox-TTS-Server ships no published image — build from the
repo's cu128 target (matches the cluster's pinned 570.195.03 / CUDA 12.8
driver) and push to the private Forgejo registry. The devvm docker is pre-authed
to forgejo.viktorbarzin.me. Run on the devvm (large CUDA image — needs disk +
bandwidth):
# 1. Clone the upstream server repo (outside the monorepo).
git clone https://github.com/devnen/Chatterbox-TTS-Server /tmp/chatterbox-tts-server
cd /tmp/chatterbox-tts-server
# 2. Build the cu128 variant (Dockerfile.cu128 — PyTorch 2.9.0+cu128, the target
# the repo's docker-compose-cu128.yml uses) for linux/amd64.
SHA="$(git rev-parse --short=8 HEAD)"
docker build \
--platform linux/amd64 \
--build-arg RUNTIME=nvidia \
-f Dockerfile.cu128 \
-t forgejo.viktorbarzin.me/viktor/chatterbox-tts:latest \
-t "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${SHA}" \
.
# 3. Push both tags. (If docker isn't authed: log in with the viktor push PAT
# from Vault — `vault kv get -field=forgejo_push_token secret/ci/global` —
# `docker login forgejo.viktorbarzin.me -u viktor`.)
docker push forgejo.viktorbarzin.me/viktor/chatterbox-tts:latest
docker push "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${SHA}"
If
Dockerfile.cu128is not a cleandocker buildtarget (e.g. it relies on build args defined only indocker-compose-cu128.yml), lift those args onto thedocker buildline ordocker compose -f docker-compose-cu128.yml buildthendocker tagthe resultingchatterbox-tts-server:cu128image to the Forgejo ref above before pushing.
Apply (admin-gated — run in order)
vault login -method=oidc
~/code/scripts/presence claim node:k8s-node1 --purpose "chatterbox-tts first apply (GPU)"
~/code/scripts/presence claim stack:tts --purpose "chatterbox-tts stack apply"
# 1. The polite-tenant hardening (exclude tts from gpu-workload priority).
~/code/scripts/tg plan --stack kyverno
~/code/scripts/tg apply --stack kyverno
# 2. This stack.
~/code/scripts/tg plan --stack tts
~/code/scripts/tg apply --stack tts # apply does NOT wake the GPU (replicas=0)
# 3. Flip tripit narration on.
~/code/scripts/tg plan --stack tripit
~/code/scripts/tg apply --stack tripit
See docs/plans/2026-06-08-chatterbox-tts-infra.md §5 for the full go-live
checklist (seed voices on NFS-SSD, smoke-test a synth, watch the neighbours).
Rollback (instant, no data loss)
- Narration off: set
TTS_MODE=none(or drop the threeTTS_*lines) instacks/tripit/main.tf→tg apply --stack tripit. The bake makes no audio; playback falls back to browser TTS. Cachedstory_audiorows are harmless. - Chatterbox off the GPU:
kubectl -n tts scale deploy/chatterbox-tts --replicas=0(transient) and/ortg destroy --stack tts. Best-effort synth means tripit bakes keep running audio-less — no error. - Neither touches the resident GPU tenants (Option A never modifies them).