infra/stacks/portal-stt/terragrunt.hcl

34 lines
1.6 KiB
HCL
Raw Normal View History

2026-06-17 20:25:29 +00:00
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}
# portal-stt: in-cluster speech-to-text for the portal-assistant Gateway
# (portal-assistant issue #2, ADR-0003). One Deployment of Speaches
# (ghcr.io/speaches-ai/speaches, OpenAI-compatible faster-whisper) serving
# `large-v3-turbo` int8, multilingual (Bulgarian + English), behind a single
# ClusterIP Service `portal-stt.portal-stt.svc:8000`. Transcription path:
# /v1/audio/transcriptions. Requests ONE time-slice of the shared T4
# (nvidia.com/gpu=1) — a slice, not the card.
#
# WARM-RESIDENT (NOT the tts/chatterbox demand-gate): replicas=1, never scaled
# to zero. The model is preloaded at startup (PRELOAD_MODELS) and never unloaded
# (STT_MODEL_TTL=-1) so interactive voice Turns never pay a cold model load.
# Chatterbox can scale 0<->1 because it is best-effort batch narration; STT is
# latency-critical and must stay warm. See portal-assistant CONTEXT.md
# "Warm window".
#
# VRAM safety on the shared T4 (16 GiB, no per-tenant isolation): int8 weights
# budget ~1.5 GiB; worst-case alongside immich-ml (~2.1) + frigate (~1.9) +
# llama-swap qwen3-8b (~4.35) leaves ~6 GiB headroom. This pod is NOT excluded
# from the kyverno gpu-priority policy, so it correctly gets the immich-equal
# `gpu-workload` priority (first-class resident, never evicted first) — the
# inverse of tts. Full VRAM math + the OOM post-mortem reference are in main.tf.
#
# HITL: agent drafts; operator presence-claims the T4 and applies via GitOps,
# then verifies the rollout + a bg/en transcription smoke test.