diff --git a/docs/architecture/llama-cpp.md b/docs/architecture/llama-cpp.md new file mode 100644 index 00000000..299c8d4e --- /dev/null +++ b/docs/architecture/llama-cpp.md @@ -0,0 +1,113 @@ +# llama-cpp / llama-swap + +## Overview + +In-cluster, OpenAI-compatible vision-LLM endpoint. A single +`mostlygeek/llama-swap:cuda` Deployment fronts three GGUF models +served by `llama.cpp`'s `llama-server` subprocesses, hot-swapped on +demand by `llama-swap`. One Service, one `/v1` endpoint, model +selected by the request body `model` field. + +Initial use case: vision-LLM benchmark on a curated Immich album, +choosing between **Qwen3-VL-8B**, **MiniCPM-V-4.5**, and +**Qwen3-VL-4B** for instagram-poster's candidate-scoring path. +Future consumers (Home Assistant, agentic tooling) can hit the same +endpoint via LiteLLM at the cluster gateway. + +## Why llama.cpp + llama-swap (not Ollama) + +Verified across 7+7 research/challenger subagents (2026-05-10): + +- **Broader OpenAI-compat surface** — `tool_choice`, `image_url` + remote URLs, native bearer auth via `--api-key`, `/reranking`, + Anthropic `/v1/messages` shim. +- **Native observability** — `/metrics`, `/health` returns 503 during + model load (proper K8s startup-probe semantics), `/slots` per-slot + tracking. Ollama still has the `/metrics` issue + [#3144](https://github.com/ollama/ollama/issues/3144) open. +- **Stricter structured output** — native GBNF on `/completion`, + JSON-schema-to-GBNF converter, optional `LLAMA_LLGUIDANCE=ON`. +- **Vision coverage for our targets** — llama.cpp ≥ b9095 supports + Qwen3-VL and MiniCPM-V-4.5 natively; Ollama needs the official + `qwen3-vl` tag (community GGUFs broken — split-mmproj + [#14575](https://github.com/ollama/ollama/issues/14575)) and the + `openbmb/minicpm-v4.5` Ollama tag is 8 months stale. + +Ollama still wins for Llama-3.2-Vision (`mllama` cross-attention) and +ecosystem polish (Go/JS SDKs, langchain-ollama, n8n nodes, HA built-in) +— the latter is mooted by fronting llama.cpp with **LiteLLM** at the +gateway. + +## Components + +| Component | Resource | Purpose | +|-----------|----------|---------| +| llama-swap Deployment | `kubernetes_deployment.llama_swap` | One pod, one OpenAI-compat endpoint, hot-swaps model subprocesses | +| llama-swap ConfigMap | `kubernetes_config_map.llama_swap_config` | YAML model entries (cmd, ttl, checkEndpoint) | +| llama-swap Service | `kubernetes_service.llama_swap` | ClusterIP `:8080` → `llama-swap.llama-cpp.svc.cluster.local` | +| Models PVC | `module.nfs_models` (NFS-RWX `/srv/nfs-ssd/llamacpp`) | Shared GGUF store, 30Gi | +| Download Job | `kubernetes_job_v1.download_models` | Pulls Q4_K_M GGUF + mmproj per model, creates stable `model.gguf` / `mmproj.gguf` symlinks, warms page cache | + +## Storage + +NFS-SSD on the Proxmox host (`192.168.1.127:/srv/nfs-ssd/llamacpp`). +Cold model load is ~40s × 3 startups ≈ 2 min in a 25-30 min benchmark +run (<10%). The download Job warms the kernel page cache after pulling +GGUFs so first inference reads from warm cache. + +If steady-state cold-load latency becomes a problem, **Path B**: carve +~50Gi from a Proxmox SSD as an LV, attach as a vdisk to k8s-node1, +mount on-host, expose via a static `kubernetes_persistent_volume` with +`local` source + node1 affinity. NVMe-class load times. Out of scope +for the initial deployment. + +## GPU allocation + +The llama-swap pod requests `nvidia.com/gpu: 1` (whole-T4 +allocation). The shared T4 is also used by Immich's ML pod +(`immich.immich-machine-learning`); only one of the two can hold the +GPU at a time. Operator must scale immich-ml to 0 before running a +benchmark and restore it after: + +```bash +kubectl scale -n immich deploy/immich-machine-learning --replicas=0 +# ... benchmark ... +kubectl scale -n immich deploy/immich-machine-learning --replicas=1 +``` + +## Models served + +| ID | HF repo | Quant | Ctx | mmproj | +|----|---------|-------|-----|--------| +| `qwen3vl-8b` | `Qwen/Qwen3-VL-8B-Instruct-GGUF` | Q4_K_M | 3072 | yes | +| `minicpm-v-4-5` | `openbmb/MiniCPM-V-4_5-gguf` | Q4_K_M | 3072 | yes | +| `qwen3vl-4b` | `Qwen/Qwen3-VL-4B-Instruct-GGUF` | Q4_K_M | 3072 | yes | + +llama.cpp build pinned via the `llama-swap:cuda` image (ships a +recent llama.cpp ≥ b9095, which includes Qwen3-VL projection fix +[#20899](https://github.com/ggml-org/llama.cpp/issues/20899) and +mtmd Flash-Attention regression fix +[#16962](https://github.com/ggml-org/llama.cpp/issues/16962)). + +## Endpoints + +- `GET /v1/models` — list configured models +- `POST /v1/chat/completions` — standard OpenAI chat (vision via + `image_url` content parts, base64 or remote URL) +- `POST /completion` — llama.cpp native completion (preferred for + GBNF-constrained structured output to avoid 2026 regression magnet + on `/v1/chat/completions`) +- `GET /metrics` — Prometheus +- `GET /health` — 200 once a model is fully loaded; 503 during load + +## Known issues / decisions + +- **Cluster-wide GPU contention** — only one of llama-swap or + immich-ml can hold the T4. No GPU sharing solution wired in + (MPS/MIG would help but T4 has no MIG and MPS is overkill for two + workloads). +- **Filename-agnostic config** — the download Job creates stable + `model.gguf` / `mmproj.gguf` symlinks per model dir so the + llama-swap config doesn't need to track exact HF filenames (which + change between releases). +- **TF schema** — `llama-cpp` (PG backend on dbaas). diff --git a/stacks/llama-cpp/main.tf b/stacks/llama-cpp/main.tf new file mode 100644 index 00000000..f3289856 --- /dev/null +++ b/stacks/llama-cpp/main.tf @@ -0,0 +1,371 @@ +locals { + namespace = "llama-cpp" + labels = { app = "llama-cpp" } + + # llama-swap fronts per-model llama.cpp instances. The :cuda image + # ships a recent llama-server inside, which is what gets spawned per + # model. One Service, one /v1 endpoint, model selected by the + # OpenAI `model` field. mostlygeek/llama-swap is production-grade + # (3.9k★, v211, May 2026). + llamaswap_image = "ghcr.io/mostlygeek/llama-swap:cuda" + + # Three vision models for the benchmark sweep. All Apache-2.0, all GGUF + # Q4_K_M (T4 has no FP8/BF16 — INT4 is the right knob). Image long-edge + # capped at 1024 px to keep prefill <2s on the T4. + # + # Filenames are matched by glob in the download Job (huggingface_hub + # snapshot_download with allow_patterns). Stable symlinks model.gguf / + # mmproj.gguf are created after download so llama-swap config can be + # filename-agnostic. + models = { + qwen3vl-8b = { + hf_repo = "Qwen/Qwen3-VL-8B-Instruct-GGUF" + gguf_pattern = "*Q4_K_M*.gguf" + mmproj_pattern = "*mmproj*.gguf" + ctx_size = 3072 + gpu_layers = 99 + } + minicpm-v-4-5 = { + hf_repo = "openbmb/MiniCPM-V-4_5-gguf" + gguf_pattern = "*Q4_K_M*.gguf" + mmproj_pattern = "*mmproj*.gguf" + ctx_size = 3072 + gpu_layers = 99 + } + qwen3vl-4b = { + hf_repo = "Qwen/Qwen3-VL-4B-Instruct-GGUF" + gguf_pattern = "*Q4_K_M*.gguf" + mmproj_pattern = "*mmproj*.gguf" + ctx_size = 3072 + gpu_layers = 99 + } + } + + # YAML config rendered into the ConfigMap. llama-swap reads /app/config.yaml. + # ${PORT} is substituted by llama-swap; ${MODEL_ID} is the model key. + llama_swap_config = yamlencode({ + healthCheckTimeout = 180 # 60-90s is typical model load on NFS-SSD + logLevel = "info" + logToStdout = "both" + startPort = 5800 + + macros = { + llama_server_base = "/app/llama-server --host 0.0.0.0 --port $${PORT} --jinja -fa -np 1" + } + + models = { + for mid, cfg in local.models : mid => { + cmd = join(" ", [ + "/app/llama-server", + "--host 0.0.0.0", + "--port $${PORT}", + "-m /models/${mid}/model.gguf", + "--mmproj /models/${mid}/mmproj.gguf", + "-ngl ${cfg.gpu_layers}", + "-c ${cfg.ctx_size}", + "-np 1", + "--jinja", + "-fa", + ]) + ttl = 600 # unload after 10 min idle + checkEndpoint = "/health" + } + } + }) +} + +resource "kubernetes_namespace" "llama_cpp" { + metadata { + name = local.namespace + labels = { + tier = local.tiers.gpu + "istio-injection" = "disabled" + } + } + lifecycle { + ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] + } +} + +# Shared model store. NFS-RWX so the download Job can write while +# the llama-swap Deployment mounts it. Path /srv/nfs-ssd/llamacpp on +# the Proxmox host (SSD-backed for fast model load — Q4_K_M 8B mmaps in +# ~2s vs ~10s on HDD NFS). Page-cache is warmed by the download Job so +# first inference reads from warm cache. +module "nfs_models" { + source = "../../modules/kubernetes/nfs_volume" + name = "llama-cpp-models" + namespace = kubernetes_namespace.llama_cpp.metadata[0].name + nfs_server = "192.168.1.127" + nfs_path = "/srv/nfs-ssd/llamacpp" + storage = "30Gi" +} + +# One-shot download Job. Pulls Q4_K_M GGUF + mmproj for every model in +# locals.models into /models//, creates stable model.gguf / +# mmproj.gguf symlinks, then warms the page cache. Idempotent — +# huggingface_hub's snapshot_download skips files that already exist +# with matching size; symlinks are recreated each run. +resource "kubernetes_job_v1" "download_models" { + metadata { + name = "download-models" + namespace = kubernetes_namespace.llama_cpp.metadata[0].name + labels = local.labels + } + spec { + backoff_limit = 2 + ttl_seconds_after_finished = 86400 + template { + metadata { labels = local.labels } + spec { + restart_policy = "OnFailure" + + container { + name = "download" + image = "python:3.12-slim" + command = ["/bin/bash", "-c", <<-EOT + set -euo pipefail + pip install --quiet --no-cache-dir 'huggingface_hub>=0.24' + python - <<'PY' + import json, os, glob + from huggingface_hub import snapshot_download + models = json.loads(os.environ["MODELS_JSON"]) + for mid, cfg in models.items(): + local_dir = f"/models/{mid}" + os.makedirs(local_dir, exist_ok=True) + print(f"==> downloading {mid} from {cfg['hf_repo']} -> {local_dir}", flush=True) + snapshot_download( + repo_id=cfg["hf_repo"], + local_dir=local_dir, + allow_patterns=[cfg["gguf_pattern"], cfg["mmproj_pattern"]], + token=os.environ.get("HF_TOKEN") or None, + # Single-threaded download — multi-worker buffers + # multi-GB chunks per worker and OOMs the Job at 2Gi. + max_workers=1, + ) + # Resolve actual filenames and create stable symlinks so + # llama-swap config is filename-agnostic. + ggufs = [p for p in glob.glob(f"{local_dir}/*Q4_K_M*.gguf") if "mmproj" not in p.lower()] + mmprojs = glob.glob(f"{local_dir}/*mmproj*.gguf") + if not ggufs: + raise SystemExit(f"no GGUF found in {local_dir}") + if not mmprojs: + raise SystemExit(f"no mmproj found in {local_dir}") + gguf_link = f"{local_dir}/model.gguf" + mmproj_link = f"{local_dir}/mmproj.gguf" + for link, target in ((gguf_link, ggufs[0]), (mmproj_link, mmprojs[0])): + if os.path.islink(link) or os.path.exists(link): + os.unlink(link) + os.symlink(os.path.basename(target), link) + print(f"==> done {mid}", flush=True) + for f in sorted(os.listdir(local_dir)): + full = os.path.join(local_dir, f) + if os.path.isfile(full) and not os.path.islink(full): + print(f" {f} ({os.path.getsize(full):,} bytes)", flush=True) + print("==> warming page cache", flush=True) + PY + # Warm the kernel page cache so first inference reads warm. + # Wrapped in bash (not the Python heredoc) to keep the cat + # output out of stdout buffering. + find /models -type f -name '*.gguf' ! -name 'model.gguf' ! -name 'mmproj.gguf' \ + -exec sh -c 'cat "$1" > /dev/null' _ {} \; + echo "ALL DONE" + EOT + ] + env { + name = "MODELS_JSON" + value = jsonencode(local.models) + } + env { + name = "HF_HUB_ENABLE_HF_TRANSFER" + value = "0" + } + # Optional: HF token from Vault (rate-limit avoidance). Sourced + # from the existing `viktor` Vault path which holds personal + # creds. Empty string is acceptable (anonymous downloads). + env { + name = "HF_TOKEN" + value_from { + secret_key_ref { + name = "hf-token" + key = "token" + optional = true + } + } + } + volume_mount { + name = "models" + mount_path = "/models" + } + resources { + requests = { cpu = "100m", memory = "256Mi" } + # 4Gi covers the worst-case huggingface_hub buffer (single + # 5GB GGUF chunked over HTTP) plus interpreter overhead. + # 2Gi was hit by the previous run. + limits = { memory = "4Gi" } + } + } + + volume { + name = "models" + persistent_volume_claim { + claim_name = module.nfs_models.claim_name + } + } + } + } + } + wait_for_completion = false + lifecycle { + ignore_changes = [ + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 + metadata[0].annotations, + ] + } +} + +resource "kubernetes_config_map" "llama_swap_config" { + metadata { + name = "llama-swap-config" + namespace = kubernetes_namespace.llama_cpp.metadata[0].name + labels = local.labels + } + data = { + "config.yaml" = local.llama_swap_config + } +} + +# Single Deployment running llama-swap. Spawns per-model llama-server +# subprocesses on demand and unloads them after `ttl` seconds idle. +# The whole T4 is allocated to this pod via nvidia.com/gpu=1; immich-ml +# must be scaled to 0 during benchmark runs. +resource "kubernetes_deployment" "llama_swap" { + metadata { + name = "llama-swap" + namespace = kubernetes_namespace.llama_cpp.metadata[0].name + labels = merge(local.labels, { tier = local.tiers.gpu }) + } + # Don't block apply on rollout — the GPU is shared with immich-ml and + # the pod stays Pending until the operator scales immich-ml=0 for a + # benchmark window. Apply is "create the desired state, don't wait + # for it to be reachable". + wait_for_rollout = false + spec { + replicas = 1 + strategy { type = "Recreate" } + + selector { + match_labels = { app = "llama-cpp", component = "llama-swap" } + } + + template { + metadata { + labels = { app = "llama-cpp", component = "llama-swap" } + annotations = { + # Bounce the pod whenever the configmap content changes. + "checksum/config" = sha256(local.llama_swap_config) + } + } + spec { + node_selector = { gpu = "true" } + toleration { + key = "nvidia.com/gpu" + operator = "Equal" + value = "true" + effect = "NoSchedule" + } + + container { + name = "llama-swap" + image = local.llamaswap_image + args = ["-config", "/app/config.yaml", "-listen", ":8080"] + port { + container_port = 8080 + name = "http" + } + volume_mount { + name = "models" + mount_path = "/models" + } + volume_mount { + name = "config" + mount_path = "/app/config.yaml" + sub_path = "config.yaml" + } + # llama-swap returns 200 on / once running; per-model readiness + # is gated by the model's own /health endpoint (configured in + # the YAML as checkEndpoint). + readiness_probe { + http_get { + path = "/" + port = 8080 + } + initial_delay_seconds = 5 + period_seconds = 10 + failure_threshold = 6 + } + liveness_probe { + http_get { + path = "/" + port = 8080 + } + initial_delay_seconds = 30 + period_seconds = 30 + failure_threshold = 5 + } + resources { + requests = { + cpu = "200m" + memory = "2Gi" + } + limits = { + memory = "12Gi" + "nvidia.com/gpu" = "1" + } + } + } + + volume { + name = "models" + persistent_volume_claim { + claim_name = module.nfs_models.claim_name + } + } + volume { + name = "config" + config_map { + name = kubernetes_config_map.llama_swap_config.metadata[0].name + } + } + } + } + } + + lifecycle { + ignore_changes = [ + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 + ] + } + + depends_on = [kubernetes_job_v1.download_models] +} + +resource "kubernetes_service" "llama_swap" { + metadata { + name = "llama-swap" + namespace = kubernetes_namespace.llama_cpp.metadata[0].name + labels = local.labels + } + spec { + type = "ClusterIP" + selector = { + app = "llama-cpp" + component = "llama-swap" + } + port { + name = "http" + port = 8080 + target_port = 8080 + } + } +} diff --git a/stacks/llama-cpp/terragrunt.hcl b/stacks/llama-cpp/terragrunt.hcl new file mode 100644 index 00000000..bdc70c25 --- /dev/null +++ b/stacks/llama-cpp/terragrunt.hcl @@ -0,0 +1,23 @@ +include "root" { + path = find_in_parent_folders() +} + +dependency "platform" { + config_path = "../platform" + skip_outputs = true +} + +dependency "vault" { + config_path = "../vault" + skip_outputs = true +} + +# llama-cpp: in-cluster vision-LLM server. One Deployment of +# `mostlygeek/llama-swap:cuda` fronts three models (qwen3vl-8b, +# minicpm-v-4-5, qwen3vl-4b) at a single OpenAI-compat /v1 endpoint +# on Service `llama-swap`. llama-swap loads/unloads per-model +# llama-server subprocesses on demand (idle TTL 10 min). The T4 is +# allocated wholly to this pod; immich-ml must be scaled to 0 during +# benchmark runs. See infra/docs/architecture/llama-cpp.md for the +# full rationale (build ≥ b6907 for Qwen3-VL, T4 FP16/INT4 only, +# llama-swap over Ollama, etc.).