Skip to content

Commit fe927bd

Browse files
authored
feat(example): add OpenAI-compatible embeddings endpoint (abetlen#2281)
* feat(example): add embeddings endpoint * docs: update changelog for embeddings endpoint * test: remove server embeddings example test * feat(example): auto-detect embedding model mode
1 parent 380177b commit fe927bd

4 files changed

Lines changed: 438 additions & 8 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
- feat(example): add OpenAI-compatible embeddings endpoint by @abetlen in #2281
11+
1012
## [0.3.27]
1113

1214
- feat: update llama.cpp to ggml-org/llama.cpp@465b1f0e7

examples/server/README.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Server Example
22

33
This example is an updated OpenAI-compatible web server that depends only on the low-level C bindings.
4-
It supports batched inference, prompt caching, response parsing, `/v1/responses`, disk sequence caching, MTP, LoRA, and multimodal image/audio inputs.
4+
It supports batched inference, prompt caching, response parsing, `/v1/responses`, `/v1/embeddings`, disk sequence caching, MTP, LoRA, and multimodal image/audio inputs.
55

66
## Setup
77

@@ -46,6 +46,7 @@ The smallest checked-in example uses Qwen3.5 0.8B so the server can be started o
4646

4747
| Config | Model | Notes |
4848
| --- | --- | --- |
49+
| [`configs/bge-small-en-v1.5.json`](configs/bge-small-en-v1.5.json) | [`CompendiumLabs/bge-small-en-v1.5-gguf`](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf) | Small embedding model config for `/v1/embeddings`. |
4950
| [`configs/qwen3.5-0.8b.json`](configs/qwen3.5-0.8b.json) | [`lmstudio-community/Qwen3.5-0.8B-GGUF`](https://huggingface.co/lmstudio-community/Qwen3.5-0.8B-GGUF) | Default small multimodal example. |
5051
| [`configs/gemma-4-12b-it-qat.json`](configs/gemma-4-12b-it-qat.json) | [`unsloth/gemma-4-12B-it-qat-GGUF`](https://huggingface.co/unsloth/gemma-4-12B-it-qat-GGUF) | Larger Gemma 4 QAT multimodal config with projector. |
5152
| [`configs/qwen3.6-27b.json`](configs/qwen3.6-27b.json) | [`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) | Larger Qwen3.6 multimodal config. |
@@ -86,11 +87,33 @@ response = client.responses.create(
8687
print(response.output_text)
8788
```
8889

90+
### Embeddings
91+
92+
Start the server with an embedding config before calling `/v1/embeddings`.
93+
94+
```bash
95+
cd examples/server
96+
uv run --script server.py -C configs/bge-small-en-v1.5.json
97+
```
98+
99+
```python
100+
from openai import OpenAI
101+
102+
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="not-used")
103+
104+
response = client.embeddings.create(
105+
model="bge-small-en-v1.5",
106+
input=["The food was delicious.", "The meal was excellent."],
107+
)
108+
print(len(response.data[0].embedding))
109+
```
110+
89111
## API Surface
90112

91113
| Endpoint | Purpose | Reference |
92114
| --- | --- | --- |
93115
| `POST /v1/completions` | Legacy text completions with streaming, stop sequences, logprobs, penalties, seeds, and grammar-backed JSON output. | [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions) |
116+
| `POST /v1/embeddings` | OpenAI-compatible embeddings for embedding-mode GGUF models, including string inputs, token inputs, base64 output, and dimensions truncation. | [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings) |
94117
| `POST /v1/chat/completions` | Chat completions with streaming, tools, forced tool choice, reasoning parsing, multimodal content parts, and structured response parsing. | [OpenAI Chat API](https://platform.openai.com/docs/api-reference/chat) |
95118
| `POST /v1/responses` | Stateless Responses API compatibility for clients that use response items and response events. | [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) |
96119
| `WS /v1/responses` | Stateful websocket Responses transport with per-connection `previous_response_id` replay. | [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) |
@@ -190,6 +213,8 @@ Most model runtime fields map to `llama_model_params` or `llama_context_params`
190213
| `threads` | Decode thread count. |
191214
| `threads_batch` | Prefill and batch thread count. |
192215
| `kv_unified` | Selects unified or per-sequence memory layout. |
216+
| `embedding` | Overrides embedding mode; omit to auto-detect pooled embedding GGUFs from model metadata. |
217+
| `pooling_type` | Overrides pooled embedding behavior for embedding models, such as `1` for mean pooling. |
193218
| `store_logits` | Keeps logits after decode when needed by sampling or diagnostics. |
194219
| `use_mmap` | Memory maps model weights. |
195220
| `use_mlock` | Attempts to lock model pages into RAM. |
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
{
2+
"server": {
3+
"host": "0.0.0.0",
4+
"port": 8000
5+
},
6+
"model": {
7+
"alias": "bge-small-en-v1.5",
8+
"from_pretrained": {
9+
"repo_id": "CompendiumLabs/bge-small-en-v1.5-gguf",
10+
"filename": "bge-small-en-v1.5-q4_k_m.gguf"
11+
},
12+
"n_ctx": 512,
13+
"n_seq_max": 16,
14+
"n_batch": 512,
15+
"n_ubatch": 512,
16+
"threads": 4,
17+
"threads_batch": 8,
18+
"kv_unified": true,
19+
"store_logits": false,
20+
"use_mmap": true
21+
}
22+
}

0 commit comments

Comments
 (0)