feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip]

New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible
/v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on
8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/.

Option A off-peak control (no VRAM isolation on the time-sliced T4 — see
post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London
CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00
ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from
gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields
the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at
06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a
skipped/aborted window backfills next time. SA+Role+RoleBinding grant the
CronJobs deployments/scale (nextcloud-watchdog pattern).

Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the
`tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox
keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure
— never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses
the base exclude list (tts untouched there).

tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to
local.app_env (no token — ClusterIP only). No tripit code change.

Image build is documented in stacks/tts/README.md (devnen cu128 target ->
forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline
(large CUDA image + needs the upstream repo). NOT APPLIED — review branch only.
Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the
measured chatterbox-multilingual T4 peak during the first bake.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-09 07:30:19 +00:00
parent edaee13be3
commit 87702bdce8
4 changed files with 672 additions and 1 deletions

View file

@ -15,6 +15,15 @@
locals {
governance_tiers = ["0-core", "1-cluster", "2-gpu", "3-edge", "4-aux"]
excluded_namespaces = ["kube-system", "metallb-system", "kyverno", "calico-system", "calico-apiserver"]
# GPU-priority injection exclude list. Adds `tts` to the base set so the
# `inject-gpu-workload-priority` policy does NOT stamp the immich-equal
# gpu-workload (1,200,000) priority on Chatterbox-TTS pods. Chatterbox is a
# best-effort off-peak batch tenant on the shared T4: it must keep its
# tier-2-gpu (600,000) priority so it is ALWAYS the pod evicted under GPU-node
# pressure, never immich-ml/frigate/llama-swap. See the tts stack
# (stacks/tts/) + docs/plans/2026-06-08-chatterbox-tts-infra.md §3.
gpu_priority_excluded_namespaces = concat(local.excluded_namespaces, ["tts"])
}
# -----------------------------------------------------------------------------
@ -905,7 +914,10 @@ resource "kubectl_manifest" "mutate_gpu_priority" {
any = [
{
resources = {
namespaces = local.excluded_namespaces
# tts added so Chatterbox-TTS keeps tier-2-gpu priority (it's a
# best-effort off-peak batch tenant must be evicted first,
# not promoted to immich-equal gpu-workload). See locals above.
namespaces = local.gpu_priority_excluded_namespaces
}
}
]