infra/stacks/tts/README.md

# tts — Chatterbox TTS (tripit narration)

In-cluster text-to-speech for tripit's "Tour guide". Runs the
[devnen/Chatterbox-TTS-Server](https://github.com/devnen/Chatterbox-TTS-Server)
(Resemble AI Chatterbox under an OpenAI-compatible HTTP server) as a single
Deployment + ClusterIP Service `chatterbox-tts.tts.svc.cluster.local:8000`,
requesting **one time-slice** of the shared Tesla T4 (`nvidia.com/gpu: 1`).

Full design + rationale (Option-A off-peak control, OOM analysis, ADR links):
`docs/plans/2026-06-08-chatterbox-tts-infra.md` (in the tripit-tour-guide repo)
and `infra/docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`.

> This stack mirrors `infra/stacks/llama-cpp/`. The scaffolding files
> (`backend.tf`, `providers.tf`, `cloudflare_provider.tf`, `tiers.tf`,
> `.terraform.lock.hcl`) are **generated by Terragrunt** on `init` and are
> git-ignored — only `main.tf`, `terragrunt.hcl` and this README are tracked.

---

## What this stack creates

- `kubernetes_namespace.tts` — tier `2-gpu`, keel-enrolled, istio off.
- `module.nfs_models` — RWX NFS-SSD PVC at `/srv/nfs-ssd/chatterbox`, mounted at
  `/data` (predefined voices, narrator reference WAVs, **and** the HuggingFace
  model cache via `HF_HOME=/data/hf_cache`, so weights download once and persist
  across the per-window pod recreation).
- `kubernetes_config_map.chatterbox_config` — `config.yaml`: `server.port=8004`,
  `model.repo_id=chatterbox-multilingual`, `tts_engine.device=cuda`, voices /
  reference paths under `/data`.
- `kubernetes_deployment.chatterbox` — **starts at `replicas=0`**; the off-peak
  CronJobs own the replica count at runtime. `TTS_BF16=off` (T4 = Turing, no
  bf16). `priority_class_name=tier-2-gpu` (the polite-tenant demotion).
- `kubernetes_service.chatterbox` — ClusterIP, **`port 8000 → targetPort 8004`**
  so tripit's default `TTS_BASE_URL` works unchanged. Prometheus scrape
  annotations.
- **Off-peak control** (SA + Role + RoleBinding + 3 CronJobs): see below.

## Off-peak control (Option A — window + free-VRAM gate)

The T4 is time-sliced with **zero VRAM isolation** (post-mortem 2026-06-02), so
`nvidia.com/gpu: 1` buys a scheduling turn, NOT memory. Chatterbox must only
allocate VRAM when the card is actually free. Implemented as three CronJobs
(all `Europe/London`), each a `bitnami/kubectl` pod using the namespace SA:

| CronJob | Schedule (default) | Action |
|---|---|---|
| `chatterbox-window-up`   | `0 2 * * *`     | **Preflight**: scrape `gpu_pod_memory_used_bytes` from `gpu-pod-exporter.nvidia.svc:80/metrics`, compute `free = 16 GiB − Σused`; scale to **1 only if** `free ≥ vram_free_floor_bytes`. |
| `chatterbox-vram-guard`  | `*/5 2-5 * * *` | **Guard**: every 5 min in-window, scale to **0** if `free < floor` (a resident woke; yield the card mid-bake). |
| `chatterbox-window-down` | `0 6 * * *`     | **Window end**: scale to **0** unconditionally. |

`tripit`'s bake is best-effort + cached-forever (ADR-0002/0004) — a skipped or
aborted window simply backfills on the next one. No latency SLA.

### The free-VRAM floor — YOU MUST MEASURE THIS

`var.vram_free_floor_bytes` defaults to **6 GiB** (a conservative guess:
~4 GiB assumed multilingual FP16 peak + ~2 GiB headroom for the
read→`cudaMalloc` race). **The real T4 peak of `chatterbox-multilingual` is not
published upstream.** Capture it during the first bake:

```bash
# while a real synth is running on the freed T4:
kubectl -n monitoring exec deploy/prometheus -- \
  promtool query instant http://localhost:9090 \
  'sum(gpu_pod_memory_used_bytes{namespace="tts"})'
# or read the gauge straight from the exporter:
kubectl -n nvidia exec ds/gpu-pod-exporter -- \
  sh -c 'curl -s localhost:9401/metrics | grep "namespace=\"tts\""'
```

Then set the floor to `measured_peak + ~2 GiB` (pass `-var` or add to the stack
tfvars). If the peak is too high to coexist even off-peak, switch
`model.repo_id` in `main.tf` to `chatterbox` (English, lighter) or
`chatterbox-turbo`, or escalate to Option B (scale `immich-machine-learning` to
0 for the window).

---

## Build + push the image (do this BEFORE the first apply)

`devnen/Chatterbox-TTS-Server` ships **no published image** — build from the
repo's **cu128** target (matches the cluster's pinned 570.195.03 / CUDA 12.8
driver) and push to the private Forgejo registry. The devvm docker is pre-authed
to `forgejo.viktorbarzin.me`. Run on the devvm (large CUDA image — needs disk +
bandwidth):

```bash
# 1. Clone the upstream server repo (outside the monorepo).
git clone https://github.com/devnen/Chatterbox-TTS-Server /tmp/chatterbox-tts-server
cd /tmp/chatterbox-tts-server

# 2. Build the cu128 variant (Dockerfile.cu128 — PyTorch 2.9.0+cu128, the target
#    the repo's docker-compose-cu128.yml uses) for linux/amd64.
SHA="$(git rev-parse --short=8 HEAD)"
docker build \
  --platform linux/amd64 \
  --build-arg RUNTIME=nvidia \
  -f Dockerfile.cu128 \
  -t forgejo.viktorbarzin.me/viktor/chatterbox-tts:latest \
  -t "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${SHA}" \
  .

# 3. Push both tags. (If docker isn't authed: log in with the viktor push PAT
#    from Vault — `vault kv get -field=forgejo_push_token secret/ci/global` —
#    `docker login forgejo.viktorbarzin.me -u viktor`.)
docker push forgejo.viktorbarzin.me/viktor/chatterbox-tts:latest
docker push "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${SHA}"
```

> If `Dockerfile.cu128` is not a clean `docker build` target (e.g. it relies on
> build args defined only in `docker-compose-cu128.yml`), lift those args onto
> the `docker build` line or `docker compose -f docker-compose-cu128.yml build`
> then `docker tag` the resulting `chatterbox-tts-server:cu128` image to the
> Forgejo ref above before pushing.

---

## Apply (admin-gated — run in order)

```bash
vault login -method=oidc
~/code/scripts/presence claim node:k8s-node1 --purpose "chatterbox-tts first apply (GPU)"
~/code/scripts/presence claim stack:tts      --purpose "chatterbox-tts stack apply"

# 1. The polite-tenant hardening (exclude tts from gpu-workload priority).
~/code/scripts/tg plan  --stack kyverno
~/code/scripts/tg apply --stack kyverno

# 2. This stack.
~/code/scripts/tg plan  --stack tts
~/code/scripts/tg apply --stack tts        # apply does NOT wake the GPU (replicas=0)

# 3. Flip tripit narration on.
~/code/scripts/tg plan  --stack tripit
~/code/scripts/tg apply --stack tripit
```

See `docs/plans/2026-06-08-chatterbox-tts-infra.md` §5 for the full go-live
checklist (seed voices on NFS-SSD, smoke-test a synth, watch the neighbours).

## Rollback (instant, no data loss)

- **Narration off:** set `TTS_MODE=none` (or drop the three `TTS_*` lines) in
  `stacks/tripit/main.tf` → `tg apply --stack tripit`. The bake makes no audio;
  playback falls back to browser TTS. Cached `story_audio` rows are harmless.
- **Chatterbox off the GPU:** `kubectl -n tts scale deploy/chatterbox-tts
  --replicas=0` (transient) and/or `tg destroy --stack tts`. Best-effort synth
  means tripit bakes keep running audio-less — no error.
- Neither touches the resident GPU tenants (Option A never modifies them).
-												feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip]

New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible
/v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on
8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/.

Option A off-peak control (no VRAM isolation on the time-sliced T4 — see
post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London
CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00
ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from
gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields
the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at
06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a
skipped/aborted window backfills next time. SA+Role+RoleBinding grant the
CronJobs deployments/scale (nextcloud-watchdog pattern).

Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the
`tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox
keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure
— never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses
the base exclude list (tts untouched there).

tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to
local.app_env (no token — ClusterIP only). No tripit code change.

Image build is documented in stacks/tts/README.md (devnen cu128 target ->
forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline
(large CUDA image + needs the upstream repo). NOT APPLIED — review branch only.
Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the
measured chatterbox-multilingual T4 peak during the first bake.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-09 07:30:19 +00:00
+								# tts — Chatterbox TTS (tripit narration)
 								In-cluster text-to-speech for tripit's "Tour guide". Runs the
 								[devnen/Chatterbox-TTS-Server](https://github.com/devnen/Chatterbox-TTS-Server)
 								(Resemble AI Chatterbox under an OpenAI-compatible HTTP server) as a single
 								Deployment + ClusterIP Service `chatterbox-tts.tts.svc.cluster.local:8000`,
 								requesting **one time-slice** of the shared Tesla T4 (`nvidia.com/gpu: 1`).
 								Full design + rationale (Option-A off-peak control, OOM analysis, ADR links):
 								`docs/plans/2026-06-08-chatterbox-tts-infra.md` (in the tripit-tour-guide repo)
 								and `infra/docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`.
 								> This stack mirrors `infra/stacks/llama-cpp/`. The scaffolding files
 								> (`backend.tf`, `providers.tf`, `cloudflare_provider.tf`, `tiers.tf`,
 								> `.terraform.lock.hcl`) are **generated by Terragrunt** on `init` and are
 								> git-ignored — only `main.tf`, `terragrunt.hcl` and this README are tracked.
 								---
 								## What this stack creates
 								- `kubernetes_namespace.tts` — tier `2-gpu`, keel-enrolled, istio off.
 								- `module.nfs_models` — RWX NFS-SSD PVC at `/srv/nfs-ssd/chatterbox`, mounted at
 								  `/data` (predefined voices, narrator reference WAVs, **and** the HuggingFace
 								  model cache via `HF_HOME=/data/hf_cache`, so weights download once and persist
 								  across the per-window pod recreation).
 								- `kubernetes_config_map.chatterbox_config` — `config.yaml`: `server.port=8004`,
 								  `model.repo_id=chatterbox-multilingual`, `tts_engine.device=cuda`, voices /
 								  reference paths under `/data`.
 								- `kubernetes_deployment.chatterbox` — **starts at `replicas=0`**; the off-peak
 								  CronJobs own the replica count at runtime. `TTS_BF16=off` (T4 = Turing, no
 								  bf16). `priority_class_name=tier-2-gpu` (the polite-tenant demotion).
 								- `kubernetes_service.chatterbox` — ClusterIP, **`port 8000 → targetPort 8004`**
 								  so tripit's default `TTS_BASE_URL` works unchanged. Prometheus scrape
 								  annotations.
 								- **Off-peak control** (SA + Role + RoleBinding + 3 CronJobs): see below.
 								## Off-peak control (Option A — window + free-VRAM gate)
 								The T4 is time-sliced with **zero VRAM isolation** (post-mortem 2026-06-02), so
 								`nvidia.com/gpu: 1` buys a scheduling turn, NOT memory. Chatterbox must only
 								allocate VRAM when the card is actually free. Implemented as three CronJobs
 								(all `Europe/London`), each a `bitnami/kubectl` pod using the namespace SA:
 								| CronJob | Schedule (default) | Action |
 								|---|---|---|
 								| `chatterbox-window-up`   | `0 2 * * *`     | **Preflight**: scrape `gpu_pod_memory_used_bytes` from `gpu-pod-exporter.nvidia.svc:80/metrics`, compute `free = 16 GiB − Σused`; scale to **1 only if** `free ≥ vram_free_floor_bytes`. |
 								| `chatterbox-vram-guard`  | `*/5 2-5 * * *` | **Guard**: every 5 min in-window, scale to **0** if `free < floor` (a resident woke; yield the card mid-bake). |
 								| `chatterbox-window-down` | `0 6 * * *`     | **Window end**: scale to **0** unconditionally. |
 								`tripit`'s bake is best-effort + cached-forever (ADR-0002/0004) — a skipped or
 								aborted window simply backfills on the next one. No latency SLA.
 								### The free-VRAM floor — YOU MUST MEASURE THIS
 								`var.vram_free_floor_bytes` defaults to **6 GiB** (a conservative guess:
 								~4 GiB assumed multilingual FP16 peak + ~2 GiB headroom for the
 								read→`cudaMalloc` race). **The real T4 peak of `chatterbox-multilingual` is not
 								published upstream.** Capture it during the first bake:
 								```bash
 								# while a real synth is running on the freed T4:
 								kubectl -n monitoring exec deploy/prometheus -- \
 								  promtool query instant http://localhost:9090 \
 								  'sum(gpu_pod_memory_used_bytes{namespace="tts"})'
 								# or read the gauge straight from the exporter:
 								kubectl -n nvidia exec ds/gpu-pod-exporter -- \
 								  sh -c 'curl -s localhost:9401/metrics | grep "namespace=\"tts\""'
 								```
 								Then set the floor to `measured_peak + ~2 GiB` (pass `-var` or add to the stack
 								tfvars). If the peak is too high to coexist even off-peak, switch
 								`model.repo_id` in `main.tf` to `chatterbox` (English, lighter) or
 								`chatterbox-turbo`, or escalate to Option B (scale `immich-machine-learning` to
 for the window).
 								---
 								## Build + push the image (do this BEFORE the first apply)
 								`devnen/Chatterbox-TTS-Server` ships **no published image** — build from the
 								repo's **cu128** target (matches the cluster's pinned 570.195.03 / CUDA 12.8
 								driver) and push to the private Forgejo registry. The devvm docker is pre-authed
 								to `forgejo.viktorbarzin.me`. Run on the devvm (large CUDA image — needs disk +
 								bandwidth):
 								```bash
 								# 1. Clone the upstream server repo (outside the monorepo).
 								git clone https://github.com/devnen/Chatterbox-TTS-Server /tmp/chatterbox-tts-server
 								cd /tmp/chatterbox-tts-server
 								# 2. Build the cu128 variant (Dockerfile.cu128 — PyTorch 2.9.0+cu128, the target
 								#    the repo's docker-compose-cu128.yml uses) for linux/amd64.
 								SHA="$(git rev-parse --short=8 HEAD)"
 								docker build \
 								  --platform linux/amd64 \
 								  --build-arg RUNTIME=nvidia \
 								  -f Dockerfile.cu128 \
 								  -t forgejo.viktorbarzin.me/viktor/chatterbox-tts:latest \
 								  -t "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${SHA}" \
 								  .
 								# 3. Push both tags. (If docker isn't authed: log in with the viktor push PAT
 								#    from Vault — `vault kv get -field=forgejo_push_token secret/ci/global` —
 								#    `docker login forgejo.viktorbarzin.me -u viktor`.)
 								docker push forgejo.viktorbarzin.me/viktor/chatterbox-tts:latest
 								docker push "forgejo.viktorbarzin.me/viktor/chatterbox-tts:${SHA}"
 								```
 								> If `Dockerfile.cu128` is not a clean `docker build` target (e.g. it relies on
 								> build args defined only in `docker-compose-cu128.yml`), lift those args onto
 								> the `docker build` line or `docker compose -f docker-compose-cu128.yml build`
 								> then `docker tag` the resulting `chatterbox-tts-server:cu128` image to the
 								> Forgejo ref above before pushing.
 								---
 								## Apply (admin-gated — run in order)
 								```bash
 								vault login -method=oidc
 								~/code/scripts/presence claim node:k8s-node1 --purpose "chatterbox-tts first apply (GPU)"
 								~/code/scripts/presence claim stack:tts      --purpose "chatterbox-tts stack apply"
 								# 1. The polite-tenant hardening (exclude tts from gpu-workload priority).
 								~/code/scripts/tg plan  --stack kyverno
 								~/code/scripts/tg apply --stack kyverno
 								# 2. This stack.
 								~/code/scripts/tg plan  --stack tts
 								~/code/scripts/tg apply --stack tts        # apply does NOT wake the GPU (replicas=0)
 								# 3. Flip tripit narration on.
 								~/code/scripts/tg plan  --stack tripit
 								~/code/scripts/tg apply --stack tripit
 								```
 								See `docs/plans/2026-06-08-chatterbox-tts-infra.md` §5 for the full go-live
 								checklist (seed voices on NFS-SSD, smoke-test a synth, watch the neighbours).
 								## Rollback (instant, no data loss)
 								- **Narration off:** set `TTS_MODE=none` (or drop the three `TTS_*` lines) in
 								  `stacks/tripit/main.tf` → `tg apply --stack tripit`. The bake makes no audio;
 								  playback falls back to browser TTS. Cached `story_audio` rows are harmless.
 								- **Chatterbox off the GPU:** `kubectl -n tts scale deploy/chatterbox-tts
 								  --replicas=0` (transient) and/or `tg destroy --stack tts`. Best-effort synth
 								  means tripit bakes keep running audio-less — no error.
 								- Neither touches the resident GPU tenants (Option A never modifies them).