|
1 | 1 | # Server Example |
2 | 2 |
|
3 | 3 | This example is an updated OpenAI-compatible web server that depends only on the low-level C bindings. |
4 | | -It supports batched inference, prompt caching, response parsing, `/v1/responses`, disk sequence caching, MTP, LoRA, and multimodal image/audio inputs. |
| 4 | +It supports batched inference, prompt caching, response parsing, `/v1/responses`, `/v1/embeddings`, disk sequence caching, MTP, LoRA, and multimodal image/audio inputs. |
5 | 5 |
|
6 | 6 | ## Setup |
7 | 7 |
|
@@ -46,6 +46,7 @@ The smallest checked-in example uses Qwen3.5 0.8B so the server can be started o |
46 | 46 |
|
47 | 47 | | Config | Model | Notes | |
48 | 48 | | --- | --- | --- | |
| 49 | +| [`configs/bge-small-en-v1.5.json`](configs/bge-small-en-v1.5.json) | [`CompendiumLabs/bge-small-en-v1.5-gguf`](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf) | Small embedding model config for `/v1/embeddings`. | |
49 | 50 | | [`configs/qwen3.5-0.8b.json`](configs/qwen3.5-0.8b.json) | [`lmstudio-community/Qwen3.5-0.8B-GGUF`](https://huggingface.co/lmstudio-community/Qwen3.5-0.8B-GGUF) | Default small multimodal example. | |
50 | 51 | | [`configs/gemma-4-12b-it-qat.json`](configs/gemma-4-12b-it-qat.json) | [`unsloth/gemma-4-12B-it-qat-GGUF`](https://huggingface.co/unsloth/gemma-4-12B-it-qat-GGUF) | Larger Gemma 4 QAT multimodal config with projector. | |
51 | 52 | | [`configs/qwen3.6-27b.json`](configs/qwen3.6-27b.json) | [`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) | Larger Qwen3.6 multimodal config. | |
@@ -86,11 +87,33 @@ response = client.responses.create( |
86 | 87 | print(response.output_text) |
87 | 88 | ``` |
88 | 89 |
|
| 90 | +### Embeddings |
| 91 | + |
| 92 | +Start the server with an embedding config before calling `/v1/embeddings`. |
| 93 | + |
| 94 | +```bash |
| 95 | +cd examples/server |
| 96 | +uv run --script server.py -C configs/bge-small-en-v1.5.json |
| 97 | +``` |
| 98 | + |
| 99 | +```python |
| 100 | +from openai import OpenAI |
| 101 | + |
| 102 | +client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="not-used") |
| 103 | + |
| 104 | +response = client.embeddings.create( |
| 105 | + model="bge-small-en-v1.5", |
| 106 | + input=["The food was delicious.", "The meal was excellent."], |
| 107 | +) |
| 108 | +print(len(response.data[0].embedding)) |
| 109 | +``` |
| 110 | + |
89 | 111 | ## API Surface |
90 | 112 |
|
91 | 113 | | Endpoint | Purpose | Reference | |
92 | 114 | | --- | --- | --- | |
93 | 115 | | `POST /v1/completions` | Legacy text completions with streaming, stop sequences, logprobs, penalties, seeds, and grammar-backed JSON output. | [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions) | |
| 116 | +| `POST /v1/embeddings` | OpenAI-compatible embeddings for embedding-mode GGUF models, including string inputs, token inputs, base64 output, and dimensions truncation. | [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings) | |
94 | 117 | | `POST /v1/chat/completions` | Chat completions with streaming, tools, forced tool choice, reasoning parsing, multimodal content parts, and structured response parsing. | [OpenAI Chat API](https://platform.openai.com/docs/api-reference/chat) | |
95 | 118 | | `POST /v1/responses` | Stateless Responses API compatibility for clients that use response items and response events. | [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) | |
96 | 119 | | `WS /v1/responses` | Stateful websocket Responses transport with per-connection `previous_response_id` replay. | [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) | |
@@ -190,6 +213,8 @@ Most model runtime fields map to `llama_model_params` or `llama_context_params` |
190 | 213 | | `threads` | Decode thread count. | |
191 | 214 | | `threads_batch` | Prefill and batch thread count. | |
192 | 215 | | `kv_unified` | Selects unified or per-sequence memory layout. | |
| 216 | +| `embedding` | Overrides embedding mode; omit to auto-detect pooled embedding GGUFs from model metadata. | |
| 217 | +| `pooling_type` | Overrides pooled embedding behavior for embedding models, such as `1` for mean pooling. | |
193 | 218 | | `store_logits` | Keeps logits after decode when needed by sampling or diagnostics. | |
194 | 219 | | `use_mmap` | Memory maps model weights. | |
195 | 220 | | `use_mlock` | Attempts to lock model pages into RAM. | |
|
0 commit comments