feat(example): add OpenAI-compatible embeddings endpoint (abetlen#2281)

abetlen · web-flow · commit fe927bd000f0 · 2026-06-07T15:09:29.000-07:00
* feat(example): add embeddings endpoint

* docs: update changelog for embeddings endpoint

* test: remove server embeddings example test

* feat(example): auto-detect embedding model mode
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+- feat(example): add OpenAI-compatible embeddings endpoint by @abetlen in #2281
+
 ## [0.3.27]
 
 - feat: update llama.cpp to ggml-org/llama.cpp@465b1f0e7
diff --git a/examples/server/README.md b/examples/server/README.md
@@ -1,7 +1,7 @@
 # Server Example
 
 This example is an updated OpenAI-compatible web server that depends only on the low-level C bindings.
-It supports batched inference, prompt caching, response parsing, `/v1/responses`, disk sequence caching, MTP, LoRA, and multimodal image/audio inputs.
+It supports batched inference, prompt caching, response parsing, `/v1/responses`, `/v1/embeddings`, disk sequence caching, MTP, LoRA, and multimodal image/audio inputs.
 
 ## Setup
 
@@ -46,6 +46,7 @@ The smallest checked-in example uses Qwen3.5 0.8B so the server can be started o
 
 | Config | Model | Notes |
 | --- | --- | --- |
+| [`configs/bge-small-en-v1.5.json`](configs/bge-small-en-v1.5.json) | [`CompendiumLabs/bge-small-en-v1.5-gguf`](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf) | Small embedding model config for `/v1/embeddings`. |
 | [`configs/qwen3.5-0.8b.json`](configs/qwen3.5-0.8b.json) | [`lmstudio-community/Qwen3.5-0.8B-GGUF`](https://huggingface.co/lmstudio-community/Qwen3.5-0.8B-GGUF) | Default small multimodal example. |
 | [`configs/gemma-4-12b-it-qat.json`](configs/gemma-4-12b-it-qat.json) | [`unsloth/gemma-4-12B-it-qat-GGUF`](https://huggingface.co/unsloth/gemma-4-12B-it-qat-GGUF) | Larger Gemma 4 QAT multimodal config with projector. |
 | [`configs/qwen3.6-27b.json`](configs/qwen3.6-27b.json) | [`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) | Larger Qwen3.6 multimodal config. |
@@ -86,11 +87,33 @@ response = client.responses.create(
 print(response.output_text)
 ```
 
+### Embeddings
+
+Start the server with an embedding config before calling `/v1/embeddings`.
+
+```bash
+cd examples/server
+uv run --script server.py -C configs/bge-small-en-v1.5.json
+```
+
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="not-used")
+
+response = client.embeddings.create(
+    model="bge-small-en-v1.5",
+    input=["The food was delicious.", "The meal was excellent."],
+)
+print(len(response.data[0].embedding))
+```
+
 ## API Surface
 
 | Endpoint | Purpose | Reference |
 | --- | --- | --- |
 | `POST /v1/completions` | Legacy text completions with streaming, stop sequences, logprobs, penalties, seeds, and grammar-backed JSON output. | [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions) |
+| `POST /v1/embeddings` | OpenAI-compatible embeddings for embedding-mode GGUF models, including string inputs, token inputs, base64 output, and dimensions truncation. | [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings) |
 | `POST /v1/chat/completions` | Chat completions with streaming, tools, forced tool choice, reasoning parsing, multimodal content parts, and structured response parsing. | [OpenAI Chat API](https://platform.openai.com/docs/api-reference/chat) |
 | `POST /v1/responses` | Stateless Responses API compatibility for clients that use response items and response events. | [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) |
 | `WS /v1/responses` | Stateful websocket Responses transport with per-connection `previous_response_id` replay. | [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) |
@@ -190,6 +213,8 @@ Most model runtime fields map to `llama_model_params` or `llama_context_params`
 | `threads` | Decode thread count. |
 | `threads_batch` | Prefill and batch thread count. |
 | `kv_unified` | Selects unified or per-sequence memory layout. |
+| `embedding` | Overrides embedding mode; omit to auto-detect pooled embedding GGUFs from model metadata. |
+| `pooling_type` | Overrides pooled embedding behavior for embedding models, such as `1` for mean pooling. |
 | `store_logits` | Keeps logits after decode when needed by sampling or diagnostics. |
 | `use_mmap` | Memory maps model weights. |
 | `use_mlock` | Attempts to lock model pages into RAM. |
diff --git a/examples/server/configs/bge-small-en-v1.5.json b/examples/server/configs/bge-small-en-v1.5.json
@@ -0,0 +1,22 @@
+{
+  "server": {
+    "host": "0.0.0.0",
+    "port": 8000
+  },
+  "model": {
+    "alias": "bge-small-en-v1.5",
+    "from_pretrained": {
+      "repo_id": "CompendiumLabs/bge-small-en-v1.5-gguf",
+      "filename": "bge-small-en-v1.5-q4_k_m.gguf"
+    },
+    "n_ctx": 512,
+    "n_seq_max": 16,
+    "n_batch": 512,
+    "n_ubatch": 512,
+    "threads": 4,
+    "threads_batch": 8,
+    "kv_unified": true,
+    "store_logits": false,
+    "use_mmap": true
+  }
+}
diff --git a/examples/server/server.py b/examples/server/server.py