infra/llama-cpp: add stack — llama-swap fronting Qwen3-VL + MiniCPM-V

Single Deployment of mostlygeek/llama-swap:cuda hot-swaps three GGUF vision models (qwen3vl-8b, minicpm-v-4-5, qwen3vl-4b) at one OpenAI-compat /v1 endpoint on Service llama-swap.llama-cpp.svc. Idle TTL 10min so models unload between benchmark batches. Storage: NFS-RWX from /srv/nfs-ssd/llamacpp (30Gi). One-shot download Job pulls Q4_K_M GGUF + mmproj per model, creates stable model.gguf / mmproj.gguf symlinks so the llama-swap config is filename-agnostic, then warms the kernel page cache. GPU: nvidia.com/gpu=1 = whole T4 — operator must scale immich-ml to 0 during benchmark windows. wait_for_rollout=false so apply doesn't block on GPU availability. Initial use case: vision-LLM benchmark for instagram-poster candidate scoring; future consumers (HA, agentic tooling) hit the same endpoint via LiteLLM at the gateway. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 14:13:40 +00:00 · 2026-05-10 14:13:40 +00:00 · 34acd98785
commit 34acd98785
parent ce65dc2385
3 changed files with 507 additions and 0 deletions
--- a/docs/architecture/llama-cpp.md
+++ b/docs/architecture/llama-cpp.md
@ -0,0 +1,113 @@
+# llama-cpp / llama-swap
+
+## Overview
+
+In-cluster, OpenAI-compatible vision-LLM endpoint. A single
+`mostlygeek/llama-swap:cuda` Deployment fronts three GGUF models
+served by `llama.cpp`'s `llama-server` subprocesses, hot-swapped on
+demand by `llama-swap`. One Service, one `/v1` endpoint, model
+selected by the request body `model` field.
+
+Initial use case: vision-LLM benchmark on a curated Immich album,
+choosing between **Qwen3-VL-8B**, **MiniCPM-V-4.5**, and
+**Qwen3-VL-4B** for instagram-poster's candidate-scoring path.
+Future consumers (Home Assistant, agentic tooling) can hit the same
+endpoint via LiteLLM at the cluster gateway.
+
+## Why llama.cpp + llama-swap (not Ollama)
+
+Verified across 7+7 research/challenger subagents (2026-05-10):
+
+- **Broader OpenAI-compat surface** — `tool_choice`, `image_url`
+  remote URLs, native bearer auth via `--api-key`, `/reranking`,
+  Anthropic `/v1/messages` shim.
+- **Native observability** — `/metrics`, `/health` returns 503 during
+  model load (proper K8s startup-probe semantics), `/slots` per-slot
+  tracking. Ollama still has the `/metrics` issue
+  [#3144](https://github.com/ollama/ollama/issues/3144) open.
+- **Stricter structured output** — native GBNF on `/completion`,
+  JSON-schema-to-GBNF converter, optional `LLAMA_LLGUIDANCE=ON`.
+- **Vision coverage for our targets** — llama.cpp ≥ b9095 supports
+  Qwen3-VL and MiniCPM-V-4.5 natively; Ollama needs the official
+  `qwen3-vl` tag (community GGUFs broken — split-mmproj
+  [#14575](https://github.com/ollama/ollama/issues/14575)) and the
+  `openbmb/minicpm-v4.5` Ollama tag is 8 months stale.
+
+Ollama still wins for Llama-3.2-Vision (`mllama` cross-attention) and
+ecosystem polish (Go/JS SDKs, langchain-ollama, n8n nodes, HA built-in)
+— the latter is mooted by fronting llama.cpp with **LiteLLM** at the
+gateway.
+
+## Components
+
+| Component | Resource | Purpose |
+|-----------|----------|---------|
+| llama-swap Deployment | `kubernetes_deployment.llama_swap` | One pod, one OpenAI-compat endpoint, hot-swaps model subprocesses |
+| llama-swap ConfigMap | `kubernetes_config_map.llama_swap_config` | YAML model entries (cmd, ttl, checkEndpoint) |
+| llama-swap Service | `kubernetes_service.llama_swap` | ClusterIP `:8080` → `llama-swap.llama-cpp.svc.cluster.local` |
+| Models PVC | `module.nfs_models` (NFS-RWX `/srv/nfs-ssd/llamacpp`) | Shared GGUF store, 30Gi |
+| Download Job | `kubernetes_job_v1.download_models` | Pulls Q4_K_M GGUF + mmproj per model, creates stable `model.gguf` / `mmproj.gguf` symlinks, warms page cache |
+
+## Storage
+
+NFS-SSD on the Proxmox host (`192.168.1.127:/srv/nfs-ssd/llamacpp`).
+Cold model load is ~40s × 3 startups ≈ 2 min in a 25-30 min benchmark
+run (<10%). The download Job warms the kernel page cache after pulling
+GGUFs so first inference reads from warm cache.
+
+If steady-state cold-load latency becomes a problem, **Path B**: carve
+~50Gi from a Proxmox SSD as an LV, attach as a vdisk to k8s-node1,
+mount on-host, expose via a static `kubernetes_persistent_volume` with
+`local` source + node1 affinity. NVMe-class load times. Out of scope
+for the initial deployment.
+
+## GPU allocation
+
+The llama-swap pod requests `nvidia.com/gpu: 1` (whole-T4
+allocation). The shared T4 is also used by Immich's ML pod
+(`immich.immich-machine-learning`); only one of the two can hold the
+GPU at a time. Operator must scale immich-ml to 0 before running a
+benchmark and restore it after:
+
+```bash
+kubectl scale -n immich deploy/immich-machine-learning --replicas=0
+# ... benchmark ...
+kubectl scale -n immich deploy/immich-machine-learning --replicas=1
+```
+
+## Models served
+
+| ID | HF repo | Quant | Ctx | mmproj |
+|----|---------|-------|-----|--------|
+| `qwen3vl-8b` | `Qwen/Qwen3-VL-8B-Instruct-GGUF` | Q4_K_M | 3072 | yes |
+| `minicpm-v-4-5` | `openbmb/MiniCPM-V-4_5-gguf` | Q4_K_M | 3072 | yes |
+| `qwen3vl-4b` | `Qwen/Qwen3-VL-4B-Instruct-GGUF` | Q4_K_M | 3072 | yes |
+
+llama.cpp build pinned via the `llama-swap:cuda` image (ships a
+recent llama.cpp ≥ b9095, which includes Qwen3-VL projection fix
+[#20899](https://github.com/ggml-org/llama.cpp/issues/20899) and
+mtmd Flash-Attention regression fix
+[#16962](https://github.com/ggml-org/llama.cpp/issues/16962)).
+
+## Endpoints
+
+- `GET /v1/models` — list configured models
+- `POST /v1/chat/completions` — standard OpenAI chat (vision via
+  `image_url` content parts, base64 or remote URL)
+- `POST /completion` — llama.cpp native completion (preferred for
+  GBNF-constrained structured output to avoid 2026 regression magnet
+  on `/v1/chat/completions`)
+- `GET /metrics` — Prometheus
+- `GET /health` — 200 once a model is fully loaded; 503 during load
+
+## Known issues / decisions
+
+- **Cluster-wide GPU contention** — only one of llama-swap or
+  immich-ml can hold the T4. No GPU sharing solution wired in
+  (MPS/MIG would help but T4 has no MIG and MPS is overkill for two
+  workloads).
+- **Filename-agnostic config** — the download Job creates stable
+  `model.gguf` / `mmproj.gguf` symlinks per model dir so the
+  llama-swap config doesn't need to track exact HF filenames (which
+  change between releases).
+- **TF schema** — `llama-cpp` (PG backend on dbaas).
--- a/stacks/llama-cpp/main.tf
+++ b/stacks/llama-cpp/main.tf
@ -0,0 +1,371 @@
+locals {
+  namespace = "llama-cpp"
+  labels    = { app = "llama-cpp" }
+
+  # llama-swap fronts per-model llama.cpp instances. The :cuda image
+  # ships a recent llama-server inside, which is what gets spawned per
+  # model. One Service, one /v1 endpoint, model selected by the
+  # OpenAI `model` field. mostlygeek/llama-swap is production-grade
+  # (3.9k★, v211, May 2026).
+  llamaswap_image = "ghcr.io/mostlygeek/llama-swap:cuda"
+
+  # Three vision models for the benchmark sweep. All Apache-2.0, all GGUF
+  # Q4_K_M (T4 has no FP8/BF16 — INT4 is the right knob). Image long-edge
+  # capped at 1024 px to keep prefill <2s on the T4.
+  #
+  # Filenames are matched by glob in the download Job (huggingface_hub
+  # snapshot_download with allow_patterns). Stable symlinks model.gguf /
+  # mmproj.gguf are created after download so llama-swap config can be
+  # filename-agnostic.
+  models = {
+    qwen3vl-8b = {
+      hf_repo        = "Qwen/Qwen3-VL-8B-Instruct-GGUF"
+      gguf_pattern   = "*Q4_K_M*.gguf"
+      mmproj_pattern = "*mmproj*.gguf"
+      ctx_size       = 3072
+      gpu_layers     = 99
+    }
+    minicpm-v-4-5 = {
+      hf_repo        = "openbmb/MiniCPM-V-4_5-gguf"
+      gguf_pattern   = "*Q4_K_M*.gguf"
+      mmproj_pattern = "*mmproj*.gguf"
+      ctx_size       = 3072
+      gpu_layers     = 99
+    }
+    qwen3vl-4b = {
+      hf_repo        = "Qwen/Qwen3-VL-4B-Instruct-GGUF"
+      gguf_pattern   = "*Q4_K_M*.gguf"
+      mmproj_pattern = "*mmproj*.gguf"
+      ctx_size       = 3072
+      gpu_layers     = 99
+    }
+  }
+
+  # YAML config rendered into the ConfigMap. llama-swap reads /app/config.yaml.
+  # ${PORT} is substituted by llama-swap; ${MODEL_ID} is the model key.
+  llama_swap_config = yamlencode({
+    healthCheckTimeout = 180 # 60-90s is typical model load on NFS-SSD
+    logLevel           = "info"
+    logToStdout        = "both"
+    startPort          = 5800
+
+    macros = {
+      llama_server_base = "/app/llama-server --host 0.0.0.0 --port $${PORT} --jinja -fa -np 1"
+    }
+
+    models = {
+      for mid, cfg in local.models : mid => {
+        cmd = join(" ", [
+          "/app/llama-server",
+          "--host 0.0.0.0",
+          "--port $${PORT}",
+          "-m /models/${mid}/model.gguf",
+          "--mmproj /models/${mid}/mmproj.gguf",
+          "-ngl ${cfg.gpu_layers}",
+          "-c ${cfg.ctx_size}",
+          "-np 1",
+          "--jinja",
+          "-fa",
+        ])
+        ttl           = 600 # unload after 10 min idle
+        checkEndpoint = "/health"
+      }
+    }
+  })
+}
+
+resource "kubernetes_namespace" "llama_cpp" {
+  metadata {
+    name = local.namespace
+    labels = {
+      tier              = local.tiers.gpu
+      "istio-injection" = "disabled"
+    }
+  }
+  lifecycle {
+    ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
+  }
+}
+
+# Shared model store. NFS-RWX so the download Job can write while
+# the llama-swap Deployment mounts it. Path /srv/nfs-ssd/llamacpp on
+# the Proxmox host (SSD-backed for fast model load — Q4_K_M 8B mmaps in
+# ~2s vs ~10s on HDD NFS). Page-cache is warmed by the download Job so
+# first inference reads from warm cache.
+module "nfs_models" {
+  source     = "../../modules/kubernetes/nfs_volume"
+  name       = "llama-cpp-models"
+  namespace  = kubernetes_namespace.llama_cpp.metadata[0].name
+  nfs_server = "192.168.1.127"
+  nfs_path   = "/srv/nfs-ssd/llamacpp"
+  storage    = "30Gi"
+}
+
+# One-shot download Job. Pulls Q4_K_M GGUF + mmproj for every model in
+# locals.models into /models/<id>/, creates stable model.gguf /
+# mmproj.gguf symlinks, then warms the page cache. Idempotent —
+# huggingface_hub's snapshot_download skips files that already exist
+# with matching size; symlinks are recreated each run.
+resource "kubernetes_job_v1" "download_models" {
+  metadata {
+    name      = "download-models"
+    namespace = kubernetes_namespace.llama_cpp.metadata[0].name
+    labels    = local.labels
+  }
+  spec {
+    backoff_limit              = 2
+    ttl_seconds_after_finished = 86400
+    template {
+      metadata { labels = local.labels }
+      spec {
+        restart_policy = "OnFailure"
+
+        container {
+          name  = "download"
+          image = "python:3.12-slim"
+          command = ["/bin/bash", "-c", <<-EOT
+            set -euo pipefail
+            pip install --quiet --no-cache-dir 'huggingface_hub>=0.24'
+            python - <<'PY'
+            import json, os, glob
+            from huggingface_hub import snapshot_download
+            models = json.loads(os.environ["MODELS_JSON"])
+            for mid, cfg in models.items():
+                local_dir = f"/models/{mid}"
+                os.makedirs(local_dir, exist_ok=True)
+                print(f"==> downloading {mid} from {cfg['hf_repo']} -> {local_dir}", flush=True)
+                snapshot_download(
+                    repo_id=cfg["hf_repo"],
+                    local_dir=local_dir,
+                    allow_patterns=[cfg["gguf_pattern"], cfg["mmproj_pattern"]],
+                    token=os.environ.get("HF_TOKEN") or None,
+                    # Single-threaded download — multi-worker buffers
+                    # multi-GB chunks per worker and OOMs the Job at 2Gi.
+                    max_workers=1,
+                )
+                # Resolve actual filenames and create stable symlinks so
+                # llama-swap config is filename-agnostic.
+                ggufs = [p for p in glob.glob(f"{local_dir}/*Q4_K_M*.gguf") if "mmproj" not in p.lower()]
+                mmprojs = glob.glob(f"{local_dir}/*mmproj*.gguf")
+                if not ggufs:
+                    raise SystemExit(f"no GGUF found in {local_dir}")
+                if not mmprojs:
+                    raise SystemExit(f"no mmproj found in {local_dir}")
+                gguf_link = f"{local_dir}/model.gguf"
+                mmproj_link = f"{local_dir}/mmproj.gguf"
+                for link, target in ((gguf_link, ggufs[0]), (mmproj_link, mmprojs[0])):
+                    if os.path.islink(link) or os.path.exists(link):
+                        os.unlink(link)
+                    os.symlink(os.path.basename(target), link)
+                print(f"==> done {mid}", flush=True)
+                for f in sorted(os.listdir(local_dir)):
+                    full = os.path.join(local_dir, f)
+                    if os.path.isfile(full) and not os.path.islink(full):
+                        print(f"   {f} ({os.path.getsize(full):,} bytes)", flush=True)
+            print("==> warming page cache", flush=True)
+            PY
+            # Warm the kernel page cache so first inference reads warm.
+            # Wrapped in bash (not the Python heredoc) to keep the cat
+            # output out of stdout buffering.
+            find /models -type f -name '*.gguf' ! -name 'model.gguf' ! -name 'mmproj.gguf' \
+              -exec sh -c 'cat "$1" > /dev/null' _ {} \;
+            echo "ALL DONE"
+          EOT
+          ]
+          env {
+            name  = "MODELS_JSON"
+            value = jsonencode(local.models)
+          }
+          env {
+            name  = "HF_HUB_ENABLE_HF_TRANSFER"
+            value = "0"
+          }
+          # Optional: HF token from Vault (rate-limit avoidance). Sourced
+          # from the existing `viktor` Vault path which holds personal
+          # creds. Empty string is acceptable (anonymous downloads).
+          env {
+            name = "HF_TOKEN"
+            value_from {
+              secret_key_ref {
+                name     = "hf-token"
+                key      = "token"
+                optional = true
+              }
+            }
+          }
+          volume_mount {
+            name       = "models"
+            mount_path = "/models"
+          }
+          resources {
+            requests = { cpu = "100m", memory = "256Mi" }
+            # 4Gi covers the worst-case huggingface_hub buffer (single
+            # 5GB GGUF chunked over HTTP) plus interpreter overhead.
+            # 2Gi was hit by the previous run.
+            limits = { memory = "4Gi" }
+          }
+        }
+
+        volume {
+          name = "models"
+          persistent_volume_claim {
+            claim_name = module.nfs_models.claim_name
+          }
+        }
+      }
+    }
+  }
+  wait_for_completion = false
+  lifecycle {
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      metadata[0].annotations,
+    ]
+  }
+}
+
+resource "kubernetes_config_map" "llama_swap_config" {
+  metadata {
+    name      = "llama-swap-config"
+    namespace = kubernetes_namespace.llama_cpp.metadata[0].name
+    labels    = local.labels
+  }
+  data = {
+    "config.yaml" = local.llama_swap_config
+  }
+}
+
+# Single Deployment running llama-swap. Spawns per-model llama-server
+# subprocesses on demand and unloads them after `ttl` seconds idle.
+# The whole T4 is allocated to this pod via nvidia.com/gpu=1; immich-ml
+# must be scaled to 0 during benchmark runs.
+resource "kubernetes_deployment" "llama_swap" {
+  metadata {
+    name      = "llama-swap"
+    namespace = kubernetes_namespace.llama_cpp.metadata[0].name
+    labels    = merge(local.labels, { tier = local.tiers.gpu })
+  }
+  # Don't block apply on rollout — the GPU is shared with immich-ml and
+  # the pod stays Pending until the operator scales immich-ml=0 for a
+  # benchmark window. Apply is "create the desired state, don't wait
+  # for it to be reachable".
+  wait_for_rollout = false
+  spec {
+    replicas = 1
+    strategy { type = "Recreate" }
+
+    selector {
+      match_labels = { app = "llama-cpp", component = "llama-swap" }
+    }
+
+    template {
+      metadata {
+        labels = { app = "llama-cpp", component = "llama-swap" }
+        annotations = {
+          # Bounce the pod whenever the configmap content changes.
+          "checksum/config" = sha256(local.llama_swap_config)
+        }
+      }
+      spec {
+        node_selector = { gpu = "true" }
+        toleration {
+          key      = "nvidia.com/gpu"
+          operator = "Equal"
+          value    = "true"
+          effect   = "NoSchedule"
+        }
+
+        container {
+          name  = "llama-swap"
+          image = local.llamaswap_image
+          args  = ["-config", "/app/config.yaml", "-listen", ":8080"]
+          port {
+            container_port = 8080
+            name           = "http"
+          }
+          volume_mount {
+            name       = "models"
+            mount_path = "/models"
+          }
+          volume_mount {
+            name       = "config"
+            mount_path = "/app/config.yaml"
+            sub_path   = "config.yaml"
+          }
+          # llama-swap returns 200 on / once running; per-model readiness
+          # is gated by the model's own /health endpoint (configured in
+          # the YAML as checkEndpoint).
+          readiness_probe {
+            http_get {
+              path = "/"
+              port = 8080
+            }
+            initial_delay_seconds = 5
+            period_seconds        = 10
+            failure_threshold     = 6
+          }
+          liveness_probe {
+            http_get {
+              path = "/"
+              port = 8080
+            }
+            initial_delay_seconds = 30
+            period_seconds        = 30
+            failure_threshold     = 5
+          }
+          resources {
+            requests = {
+              cpu    = "200m"
+              memory = "2Gi"
+            }
+            limits = {
+              memory           = "12Gi"
+              "nvidia.com/gpu" = "1"
+            }
+          }
+        }
+
+        volume {
+          name = "models"
+          persistent_volume_claim {
+            claim_name = module.nfs_models.claim_name
+          }
+        }
+        volume {
+          name = "config"
+          config_map {
+            name = kubernetes_config_map.llama_swap_config.metadata[0].name
+          }
+        }
+      }
+    }
+  }
+
+  lifecycle {
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+    ]
+  }
+
+  depends_on = [kubernetes_job_v1.download_models]
+}
+
+resource "kubernetes_service" "llama_swap" {
+  metadata {
+    name      = "llama-swap"
+    namespace = kubernetes_namespace.llama_cpp.metadata[0].name
+    labels    = local.labels
+  }
+  spec {
+    type = "ClusterIP"
+    selector = {
+      app       = "llama-cpp"
+      component = "llama-swap"
+    }
+    port {
+      name        = "http"
+      port        = 8080
+      target_port = 8080
+    }
+  }
+}
--- a/stacks/llama-cpp/terragrunt.hcl
+++ b/stacks/llama-cpp/terragrunt.hcl
@ -0,0 +1,23 @@
+include "root" {
+  path = find_in_parent_folders()
+}
+
+dependency "platform" {
+  config_path  = "../platform"
+  skip_outputs = true
+}
+
+dependency "vault" {
+  config_path  = "../vault"
+  skip_outputs = true
+}
+
+# llama-cpp: in-cluster vision-LLM server. One Deployment of
+# `mostlygeek/llama-swap:cuda` fronts three models (qwen3vl-8b,
+# minicpm-v-4-5, qwen3vl-4b) at a single OpenAI-compat /v1 endpoint
+# on Service `llama-swap`. llama-swap loads/unloads per-model
+# llama-server subprocesses on demand (idle TTL 10 min). The T4 is
+# allocated wholly to this pod; immich-ml must be scaled to 0 during
+# benchmark runs. See infra/docs/architecture/llama-cpp.md for the
+# full rationale (build ≥ b6907 for Qwen3-VL, T4 FP16/INT4 only,
+# llama-swap over Ollama, etc.).