portal-tts: DRAFT stack — Piper TTS (CPU, always-on) for portal-assistant

Draft (NOT applied) of a new infra stack deploying Piper as an in-cluster
text-to-speech service for the portal-assistant Gateway (portal-assistant
issue #3, ADR-0003). Bulgarian (bg_BG-dimitar-medium) + English
(en_US-lessac-medium), voice chosen per request.

Why this shape:
- CPU-only, always-on (replicas=1, no GPU): Piper runs in real time on CPU, so
  this keeps TTS off the OOM-prone shared T4 that the two GPU siblings
  (tts/chatterbox, portal-stt) already contend for. Bulgarian isn't on
  chatterbox anyway (its langs exclude bg).
- OpenAI-compatible image (openedai-speech-min, /v1/audio/speech) so the Gateway
  gets raw audio bytes per its tts.synthesize(text, lang) -> bytes contract and
  treats Piper + the future edge-tts fallback identically — same shape
  chatterbox already uses.
- Voices on an NFS-SSD PVC, downloaded from rhasspy/piper-voices by an init
  container on first boot; a ConfigMap maps request voice bg/en -> .onnx model.
- ClusterIP only (audio stays on the LAN; the Gateway is the only externally
  exposed component, ADR-0001).

Mirrors the just-written portal-stt sibling stack's conventions. terraform fmt
clean; terraform validate passes (only the codebase-wide kubernetes_namespace
deprecation warnings). HITL: operator reviews + applies via GitOps; do not apply
from a worktree. Open items flagged in main.tf (image choice on a frozen
upstream; resource sizing to confirm with krr).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-17 18:59:42 +00:00
parent aac7121ccc
commit fe9364b9c9
2 changed files with 438 additions and 0 deletions

405
stacks/portal-tts/main.tf Normal file
View file

@ -0,0 +1,405 @@
# =============================================================================
# portal-tts Piper TTS (CPU, always-on) for the portal-assistant Gateway
# =============================================================================
#
# DRAFT for operator review (portal-assistant issue #3). HITL apply: an agent
# drafts; the operator applies via GitOps and verifies the rollout. Do NOT
# `terragrunt apply` this from a worktree.
#
# WHAT: a single ALWAYS-ON Piper deployment serving Bulgarian
# (`bg_BG-dimitar-medium`) AND English (`en_US-lessac-medium`), with the voice
# chosen PER REQUEST, behind a ClusterIP Service `portal-tts.portal-tts.svc:8000`.
# CPU-ONLY no GPU node selector / toleration / nvidia.com/gpu request (Piper
# is a fast CPU neural TTS; ADR-0003). Audio never leaves the LAN; the
# portal-assistant Gateway is the only externally exposed component (ADR-0001),
# so there is no ingress / Authentik here.
#
# WHY CPU + ALWAYS-ON (contrast the two GPU siblings):
# * tts/ (chatterbox) scales 0<->1 behind a free-VRAM CronJob gate it is a
# best-effort BATCH tenant on the shared T4.
# * portal-stt/ (Speaches) is warm-resident on ONE T4 slice interactive STT
# that must not pay a cold model load.
# Piper needs neither: it runs in real time on CPU (no GPU contention at all),
# so the simplest correct design is replicas=1, always up. Keeping it off the
# T4 also REMOVES one tenant from the OOM-prone shared card
# (docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md) Bulgarian
# isn't on chatterbox anyway (its 23 langs exclude bg; ADR-0003).
#
# API SHAPE (the Gateway team's contract): openedai-speech is OpenAI-compatible.
# POST /v1/audio/speech
# Content-Type: application/json
# { "model": "tts-1", "input": "<text>", "voice": "<bg|en>",
# "response_format": "wav", "speed": 1.0 }
# -> 200, body = raw audio bytes (wav/mp3/opus/flac/aac/pcm per response_format)
# This matches the Gateway's tts.synthesize(text, lang) -> bytes interface
# (portal-assistant gateway/app/pipeline.py): map lang "bg"->voice "bg",
# anything else -> "en". Same OpenAI shape chatterbox already uses, so the
# Gateway can treat Piper and the edge-tts fallback identically.
#
# PLUGGABLE FALLBACK (noted, NOT built here a Gateway-side concern): ADR-0003
# keeps TTS a swappable backend with Microsoft edge-tts as an online
# Bulgarian-quality fallback. The Gateway selects Piper (this Service, on-LAN
# default) vs edge-tts (cloud) per its own config; nothing in THIS stack needs
# to change to add edge-tts. If a second in-cluster engine is ever wanted,
# add a sibling Deployment+Service and let the Gateway choose.
#
# IMAGE CHOICE (OPEN ITEM operator please confirm before apply):
# Primary: ghcr.io/matatonic/openedai-speech-min the CPU-only, piper-only
# (~1 GiB) variant of openedai-speech. OpenAI /v1/audio/speech, multi-voice via
# a voice_to_speaker.yaml map, returns raw audio bytes. Pre-built for
# linux/amd64+arm64, pullable from ghcr (tags: latest, 0.18.2, 0.18.1, ...).
# CAVEAT: the upstream repo was ARCHIVED 2026-01-04 (read-only, no further
# updates / security patches). It is feature-complete and stable for this use,
# but pinned (not Keel-tracked) precisely because it is frozen upstream.
# Alternative if a maintained image is preferred: arkdevuk/Webpiper (FastAPI,
# actively developed) but it returns JSON {url} requiring a SECOND fetch to
# retrieve the wav, a worse fit for the Gateway's bytes contract; it would
# need a small Gateway adapter. The rhasspy/wyoming-piper HTTP server is NOT
# suitable: it loads ONE voice per process (no per-request voice switch).
# =============================================================================
variable "nfs_server" {
type = string
description = "NFS server (Proxmox host). From config.tfvars (192.168.1.127)."
}
variable "piper_image" {
type = string
# CPU-only piper-only openedai-speech. Pinned to 0.18.2 (the newest published
# tag; repo archived 2026-01-04 so this is effectively the final release) for
# reproducibility NOT :latest, which would also drift to the same frozen
# build but loses the explicit version record. linux/amd64 confirmed.
default = "ghcr.io/matatonic/openedai-speech-min:0.18.2"
description = "openedai-speech CPU/piper-only image. See IMAGE CHOICE note in main.tf."
}
variable "bg_voice" {
type = string
# The single Bulgarian Piper voice (rhasspy/piper-voices). ADR-0003 names this
# exact model; it was reviewed against edge-tts in the M0.2 bake-off.
default = "bg_BG-dimitar-medium"
description = "Bulgarian Piper voice model stem (rhasspy/piper-voices)."
}
variable "en_voice" {
type = string
# English Piper voice. lessac-medium is the canonical balanced en_US voice
# (the upstream openedai-speech default-quality pick). Swappable to any
# rhasspy/piper-voices en stem (e.g. en_US-amy-medium, en_GB-alba-medium).
default = "en_US-lessac-medium"
description = "English Piper voice model stem (rhasspy/piper-voices)."
}
locals {
namespace = "portal-tts"
labels = { app = "portal-tts" }
# rhasspy/piper-voices HuggingFace layout:
# resolve/main/<lang>/<locale>/<name>/<quality>/<locale>-<name>-<quality>.onnx
# Each voice needs BOTH the .onnx and the matching .onnx.json (Piper reads the
# sample rate / phoneme map from the .json next to the model). All 4 URLs were
# HEAD-verified 200 on 2026-06-17. Derived from the voice stems so changing a
# voice variable updates both the download URL and the config map together.
hf_base = "https://huggingface.co/rhasspy/piper-voices/resolve/main"
voice_models = {
# voice stem => HF directory path "<lang>/<locale>/<name>/<quality>"
(var.bg_voice) = "bg/bg_BG/dimitar/medium"
(var.en_voice) = "en/en_US/lessac/medium"
}
# voice_to_speaker.yaml: the openedai-speech config that maps a REQUEST voice
# name ("bg" / "en") to a Piper .onnx on the PVC. The Gateway sends
# voice="bg" or "en"; the server resolves it here. (We expose short logical
# names, not the long model stems, so the Gateway's lang->voice map is trivial
# and stable even if the underlying model stem changes.) The default model
# name is "tts-1" (openedai-speech's piper model id).
voice_to_speaker = yamlencode({
"tts-1" = {
bg = {
model = "voices/${var.bg_voice}.onnx"
speaker = null # single-speaker model -> default
}
en = {
model = "voices/${var.en_voice}.onnx"
speaker = null
}
}
})
# Init-container provisioning script. Downloads each voice's .onnx + .onnx.json
# into the PVC's voices/ dir IF MISSING (idempotent re-runs skip
# already-present files), then copies the config map's voice_to_speaker.yaml
# into the PVC's config/ dir. Pure POSIX + wget (busybox has wget).
download_script = <<-EOT
set -eu
mkdir -p /data/voices /data/config
fetch() { # $1 = url, $2 = dest
if [ -s "$2" ]; then echo "have $2"; return 0; fi
echo "get $1 -> $2"
wget -q -O "$2.tmp" "$1"
mv "$2.tmp" "$2"
}
%{for stem, dir in local.voice_models~}
fetch "${local.hf_base}/${dir}/${stem}.onnx" "/data/voices/${stem}.onnx"
fetch "${local.hf_base}/${dir}/${stem}.onnx.json" "/data/voices/${stem}.onnx.json"
%{endfor~}
cp /config-src/voice_to_speaker.yaml /data/config/voice_to_speaker.yaml
echo "voices:"; ls -la /data/voices
echo "config:"; cat /data/config/voice_to_speaker.yaml
EOT
}
resource "kubernetes_namespace" "portal_tts" {
metadata {
name = local.namespace
labels = {
tier = local.tiers.aux # CPU-only best-effort helper, not a GPU tenant
"istio-injection" = "disabled"
"keel.sh/enrolled" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# Voice models + config on NFS-SSD: fast first-load, persists across restarts,
# and RWX so a future seed/inspect pod can touch it. Path /srv/nfs-ssd/portal-tts
# on the Proxmox host. Small two medium Piper voices are ~60-120 MiB each.
# Mirrors portal-stt's nfs_models pattern.
module "nfs_models" {
source = "../../modules/kubernetes/nfs_volume"
name = "portal-tts-models"
namespace = kubernetes_namespace.portal_tts.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/srv/nfs-ssd/portal-tts"
storage = "2Gi" # 2 medium Piper voices + headroom for more
}
# One-shot bootstrap: /srv/nfs-ssd is exported whole-tree but the portal-tts
# SUBDIR must exist before kubelet can bind-mount it (chatterbox/portal-stt both
# hit exit 32 on a missing subdir). Mount the export ROOT (which exists) and
# mkdir the subtree; kubelet's mount retry then heals the main pod. Idempotent;
# immutable-once-created.
resource "kubernetes_job" "models_dir_init" {
metadata {
name = "portal-tts-models-dir-init"
namespace = kubernetes_namespace.portal_tts.metadata[0].name
labels = local.labels
}
spec {
backoff_limit = 3
ttl_seconds_after_finished = 86400
template {
metadata { labels = local.labels }
spec {
restart_policy = "Never"
container {
name = "mkdir"
image = "busybox:1.37"
command = ["sh", "-c", "mkdir -p /mnt/portal-tts/voices /mnt/portal-tts/config && ls -la /mnt/portal-tts"]
volume_mount {
name = "nfs-ssd-root"
mount_path = "/mnt"
}
}
volume {
name = "nfs-ssd-root"
nfs {
server = var.nfs_server
path = "/srv/nfs-ssd"
}
}
}
}
}
wait_for_completion = true
timeouts { create = "3m" }
}
# The voice_to_speaker.yaml map, mounted into the init container which copies it
# onto the PVC (openedai-speech reads config from a writable dir). Checksum drives
# a rollout when the voice map changes.
resource "kubernetes_config_map" "voices" {
metadata {
name = "portal-tts-voices"
namespace = kubernetes_namespace.portal_tts.metadata[0].name
labels = local.labels
}
data = {
"voice_to_speaker.yaml" = local.voice_to_speaker
}
}
# Always-on Piper. replicas=1, never scaled to zero (no off-peak gate). CPU-only:
# NO node_selector / toleration / nvidia.com/gpu it runs on any worker. The
# init container downloads the voices to the PVC and seeds the config before the
# server starts; openedai-speech then serves both voices, selectable per request.
resource "kubernetes_deployment" "portal_tts" {
metadata {
name = "portal-tts"
namespace = kubernetes_namespace.portal_tts.metadata[0].name
labels = merge(local.labels, { tier = local.tiers.aux })
}
# wait_for_rollout left default (true): a plain apply SHOULD block until the
# voices download + the server reports healthy, surfacing a bad image/voice
# early. First boot pulls ~2 voices from HuggingFace onto the PVC.
spec {
replicas = 1
# NFS PVC is RWX so RollingUpdate would be safe, but Recreate keeps it simple
# and avoids two pods racing the same voices/ dir on first download.
strategy { type = "Recreate" }
selector {
match_labels = { app = "portal-tts" }
}
template {
metadata {
labels = { app = "portal-tts" }
annotations = {
"checksum/voices" = sha256(local.voice_to_speaker)
}
}
spec {
# Download voices + seed config onto the PVC before the server starts.
init_container {
name = "fetch-voices"
image = "busybox:1.37"
command = ["sh", "-c", local.download_script]
volume_mount {
name = "models"
mount_path = "/data"
}
volume_mount {
name = "config-src"
mount_path = "/config-src"
}
resources {
requests = { cpu = "20m", memory = "32Mi" }
limits = { memory = "64Mi" }
}
}
container {
name = "portal-tts"
image = var.piper_image
# openedai-speech serves /v1/audio/speech on :8000. It reads
# config/voice_to_speaker.yaml and voices/*.onnx relative to its
# workdir (/app); we mount the PVC at /app/voices and /app/config so
# the init-seeded files are found.
port {
container_port = 8000
name = "http"
}
env {
name = "OPENEDAI_LOG_LEVEL"
value = "INFO" # image default is INFO; explicit for legibility
}
volume_mount {
name = "models"
mount_path = "/app/voices"
sub_path = "voices"
}
volume_mount {
name = "models"
mount_path = "/app/config"
sub_path = "config"
}
# openedai-speech has no /health on its OpenAI surface. Use a TCP probe
# the uvicorn socket binds only after the app (and the piper voices
# index) is ready. The init container already downloaded the voices, so
# process start is fast.
startup_probe {
tcp_socket { port = 8000 }
period_seconds = 5
failure_threshold = 24 # ~2 min
}
readiness_probe {
tcp_socket { port = 8000 }
period_seconds = 15
failure_threshold = 4
}
liveness_probe {
tcp_socket { port = 8000 }
initial_delay_seconds = 20
period_seconds = 30
failure_threshold = 5
}
resources {
# Piper is light on CPU. No CPU limit (cluster policy: CFS throttling
# avoided). Memory sized for the python runtime + 2 loaded onnx voices
# (each medium model ~60-120 MiB + onnxruntime arena). VERIFY with krr
# after a few days of real traffic and tighten.
requests = {
cpu = "100m"
memory = "512Mi"
}
limits = {
memory = "1Gi"
}
}
}
volume {
name = "models"
persistent_volume_claim {
claim_name = module.nfs_models.claim_name
}
}
volume {
name = "config-src"
config_map {
name = kubernetes_config_map.voices.metadata[0].name
}
}
}
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
# image is TF-OWNED (pinned tag on a FROZEN upstream) do NOT let Keel
# churn it. Ignore keel's annotation noise but keep the image pin applying.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
metadata[0].annotations["kubernetes.io/change-cause"],
metadata[0].annotations["deployment.kubernetes.io/revision"],
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
]
}
}
# ClusterIP in-cluster only (the Gateway calls this; audio stays on the LAN).
# No ingress, no Authentik: the Gateway is the only externally exposed component
# (ADR-0001) and holds the edge auth. OpenAI speech path:
# http://portal-tts.portal-tts.svc.cluster.local:8000/v1/audio/speech
resource "kubernetes_service" "portal_tts" {
metadata {
name = "portal-tts"
namespace = kubernetes_namespace.portal_tts.metadata[0].name
labels = local.labels
annotations = {
# openedai-speech has no /metrics endpoint; annotation-based scrape kept on
# a liveness path so the Service stays in the scrape set (Ready-endpoint
# relabeling filters non-Ready pods). Probes the OpenAI models list.
"prometheus.io/scrape" = "true"
"prometheus.io/path" = "/v1/models"
"prometheus.io/port" = "8000"
}
}
spec {
type = "ClusterIP"
selector = { app = "portal-tts" }
port {
name = "http"
port = 8000
target_port = 8000
}
}
}

View file

@ -0,0 +1,33 @@
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}
# portal-tts: in-cluster text-to-speech for the portal-assistant Gateway
# (portal-assistant issue #3, ADR-0003). One ALWAYS-ON Deployment of Piper
# (ghcr.io/matatonic/openedai-speech-min, OpenAI-compatible /v1/audio/speech)
# serving Bulgarian `bg_BG-dimitar-medium` + English `en_US-lessac-medium`, voice
# chosen PER REQUEST, behind a single ClusterIP Service
# `portal-tts.portal-tts.svc:8000`. Speech path: /v1/audio/speech.
#
# CPU-ONLY: Piper is a fast CPU neural TTS NO GPU node selector / toleration /
# nvidia.com/gpu request. This deliberately keeps TTS off the OOM-prone shared
# T4 (the two GPU siblings tts/chatterbox + portal-stt already contend for it);
# Bulgarian isn't available on chatterbox anyway (ADR-0003). replicas=1, never
# scaled to zero no off-peak gate needed when there's no GPU to free.
#
# Voices live on an NFS-SSD PVC, downloaded from rhasspy/piper-voices by an init
# container on first boot (both .onnx + .onnx.json), then persist. A ConfigMap
# supplies voice_to_speaker.yaml mapping request voice "bg"/"en" -> .onnx model.
#
# PLUGGABLE: ADR-0003 keeps TTS a swappable backend with edge-tts as an online
# Bulgarian fallback that switch is Gateway-side; nothing here changes for it.
#
# nfs_server comes from config.tfvars (192.168.1.127) via the root inputs.
#
# HITL: agent drafts; operator applies via GitOps, then verifies the rollout +
# a bg/en /v1/audio/speech smoke test (curl returns audio bytes).