candle-vllm exposes an OpenAI-compatible POST /v1/embeddings endpoint for text-capable models that implement embedding forward passes in this repo.
CUDA example:
cargo run --release --features cuda -- --m Qwen/Qwen3-0.6B --p 8000Metal example:
cargo run --release --features metal -- --m google/gemma-3-4b-it --p 8000Float embeddings with mean pooling:
curl -X POST http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input":"hello world","model":"default","embedding_type":"mean","encoding_format":"float"}'Base64 embeddings with last-token pooling:
curl -X POST http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input":["hello","hola"],"embedding_type":"last","encoding_format":"base64"}'input: a string or string array.model: optional;defaultresolves to the loaded model.embedding_type:meanorlast.encoding_format:floatorbase64.
- Embedding requests use the same tokenizer and context limits as chat requests.
embedding_type=meanaverages token hidden states.embedding_type=lastreturns the final token hidden state.- Responses follow the OpenAI schema:
data[].embeddingandusage.prompt_tokens. - Unsupported architectures return an error instead of silently falling back.
use candle_vllm::api::{EngineBuilder, ModelRepo};
use candle_vllm::openai::requests::{EmbeddingInput, EmbeddingRequest};
#[tokio::main]
async fn main() -> candle_core::Result<()> {
let engine = EngineBuilder::new(ModelRepo::ModelID(("Qwen/Qwen3-0.6B", None)))
.build_async()
.await?;
let request = EmbeddingRequest {
model: Some("default".to_string()),
input: EmbeddingInput::String("hello world".to_string()),
encoding_format: Default::default(),
embedding_type: Default::default(),
};
let response = engine.embed(request)?;
println!("embeddings: {}", response.data.len());
engine.shutdown();
Ok(())
}