infra/stacks/tts/terragrunt.hcl

include "root" {
  path = find_in_parent_folders()
}

dependency "platform" {
  config_path  = "../platform"
  skip_outputs = true
}

dependency "vault" {
  config_path  = "../vault"
  skip_outputs = true
}

# tts: in-cluster text-to-speech for tripit's "Tour guide" narration.
# One Deployment of `forgejo.viktorbarzin.me/viktor/chatterbox-tts` (devnen
# Chatterbox-TTS-Server, OpenAI-compatible /v1/audio/speech) at a single
# ClusterIP Service `chatterbox-tts.tts.svc:8000` (server listens on 8004;
# the Service remaps). Requests ONE time-slice of the shared T4
# (nvidia.com/gpu=1) — a slice, not the card.
#
# OOM-avoidance (Option A, docs/plans/2026-06-08-chatterbox-tts-infra.md §3):
# the Deployment sits at replicas=0; an off-peak CronJob scales it to 1 at the
# 02:00–06:00 Europe/London window ONLY IF a free-VRAM preflight passes
# (gpu_pod_memory_used_bytes from gpu-pod-exporter), a guard CronJob yields the
# card mid-window if a resident wakes, and a window-down CronJob scales back to
# 0. tripit's bake is best-effort + cached-forever (ADR-0002/0004), so a
# skipped/aborted window simply backfills next time — no latency SLA.
#
# Polite-tenant hardening: the `tts` namespace must be EXCLUDED from the kyverno
# `inject-gpu-workload-priority` policy (a separate two-line edit to the kyverno
# stack) so Chatterbox keeps tier-2-gpu priority (600000) and is always the pod
# evicted under pressure — never immich-ml/frigate/llama-swap.
#
# Image is built from the devnen repo + pushed to Forgejo — see this stack's
# README.md for the exact docker build + push commands.
-												feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip]

New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible
/v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on
8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/.

Option A off-peak control (no VRAM isolation on the time-sliced T4 — see
post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London
CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00
ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from
gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields
the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at
06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a
skipped/aborted window backfills next time. SA+Role+RoleBinding grant the
CronJobs deployments/scale (nextcloud-watchdog pattern).

Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the
`tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox
keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure
— never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses
the base exclude list (tts untouched there).

tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to
local.app_env (no token — ClusterIP only). No tripit code change.

Image build is documented in stacks/tts/README.md (devnen cu128 target ->
forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline
(large CUDA image + needs the upstream repo). NOT APPLIED — review branch only.
Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the
measured chatterbox-multilingual T4 peak during the first bake.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

											
										
										
											2026-06-09 07:30:19 +00:00
+								include "root" {
 								  path = find_in_parent_folders()
 								}
 								dependency "platform" {
 								  config_path  = "../platform"
 								  skip_outputs = true
 								}
 								dependency "vault" {
 								  config_path  = "../vault"
 								  skip_outputs = true
 								}
 								# tts: in-cluster text-to-speech for tripit's "Tour guide" narration.
 								# One Deployment of `forgejo.viktorbarzin.me/viktor/chatterbox-tts` (devnen
 								# Chatterbox-TTS-Server, OpenAI-compatible /v1/audio/speech) at a single
 								# ClusterIP Service `chatterbox-tts.tts.svc:8000` (server listens on 8004;
 								# the Service remaps). Requests ONE time-slice of the shared T4
 								# (nvidia.com/gpu=1) — a slice, not the card.
 								#
 								# OOM-avoidance (Option A, docs/plans/2026-06-08-chatterbox-tts-infra.md §3):
 								# the Deployment sits at replicas=0; an off-peak CronJob scales it to 1 at the
 								# 02:00–06:00 Europe/London window ONLY IF a free-VRAM preflight passes
 								# (gpu_pod_memory_used_bytes from gpu-pod-exporter), a guard CronJob yields the
 								# card mid-window if a resident wakes, and a window-down CronJob scales back to
 								# 0. tripit's bake is best-effort + cached-forever (ADR-0002/0004), so a
 								# skipped/aborted window simply backfills next time — no latency SLA.
 								#
 								# Polite-tenant hardening: the `tts` namespace must be EXCLUDED from the kyverno
 								# `inject-gpu-workload-priority` policy (a separate two-line edit to the kyverno
 								# stack) so Chatterbox keeps tier-2-gpu priority (600000) and is always the pod
 								# evicted under pressure — never immich-ml/frigate/llama-swap.
 								#
 								# Image is built from the devnen repo + pushed to Forgejo — see this stack's
 								# README.md for the exact docker build + push commands.