34 lines
1.6 KiB
HCL
34 lines
1.6 KiB
HCL
|
|
include "root" {
|
||
|
|
path = find_in_parent_folders()
|
||
|
|
}
|
||
|
|
|
||
|
|
dependency "platform" {
|
||
|
|
config_path = "../platform"
|
||
|
|
skip_outputs = true
|
||
|
|
}
|
||
|
|
|
||
|
|
# portal-stt: in-cluster speech-to-text for the portal-assistant Gateway
|
||
|
|
# (portal-assistant issue #2, ADR-0003). One Deployment of Speaches
|
||
|
|
# (ghcr.io/speaches-ai/speaches, OpenAI-compatible faster-whisper) serving
|
||
|
|
# `large-v3-turbo` int8, multilingual (Bulgarian + English), behind a single
|
||
|
|
# ClusterIP Service `portal-stt.portal-stt.svc:8000`. Transcription path:
|
||
|
|
# /v1/audio/transcriptions. Requests ONE time-slice of the shared T4
|
||
|
|
# (nvidia.com/gpu=1) — a slice, not the card.
|
||
|
|
#
|
||
|
|
# WARM-RESIDENT (NOT the tts/chatterbox demand-gate): replicas=1, never scaled
|
||
|
|
# to zero. The model is preloaded at startup (PRELOAD_MODELS) and never unloaded
|
||
|
|
# (STT_MODEL_TTL=-1) so interactive voice Turns never pay a cold model load.
|
||
|
|
# Chatterbox can scale 0<->1 because it is best-effort batch narration; STT is
|
||
|
|
# latency-critical and must stay warm. See portal-assistant CONTEXT.md
|
||
|
|
# "Warm window".
|
||
|
|
#
|
||
|
|
# VRAM safety on the shared T4 (16 GiB, no per-tenant isolation): int8 weights
|
||
|
|
# budget ~1.5 GiB; worst-case alongside immich-ml (~2.1) + frigate (~1.9) +
|
||
|
|
# llama-swap qwen3-8b (~4.35) leaves ~6 GiB headroom. This pod is NOT excluded
|
||
|
|
# from the kyverno gpu-priority policy, so it correctly gets the immich-equal
|
||
|
|
# `gpu-workload` priority (first-class resident, never evicted first) — the
|
||
|
|
# inverse of tts. Full VRAM math + the OOM post-mortem reference are in main.tf.
|
||
|
|
#
|
||
|
|
# HITL: agent drafts; operator presence-claims the T4 and applies via GitOps,
|
||
|
|
# then verifies the rollout + a bg/en transcription smoke test.
|