Summary
Serve a local embeddings model alongside chat models, exposed through the existing localhost:4000/v1/embeddings endpoint. This enables local RAG, semantic search, and agent memory retrieval without sending data to cloud embedding APIs.
Problem
Agent frameworks increasingly rely on embeddings for:
- RAG (Retrieval-Augmented Generation): Semantic search over knowledge bases, documentation, and past conversations
- Agent memory: Finding relevant past interactions beyond keyword matching ("find that AWS project from February" should work even when the query uses different words)
- Semantic deduplication: Detecting when an agent is generating near-identical outputs (useful for loop detection)
Without a local embeddings endpoint, users must either:
- Call a cloud embedding API (OpenAI, Voyage, Cohere) — adds latency, ongoing cost, and sends potentially sensitive data off-machine. This directly conflicts with the privacy motivation for running local models.
- Fall back to keyword search (FTS5, BM25) — misses semantic matches, which degrades agent memory and RAG quality significantly.
- Run a separate embedding service (Ollama, sentence-transformers) — another process to manage, another port to configure, outside mlx-stack's watchdog and health monitoring.
Use Case
A user running Hermes Agent 24/7 on their Mac Mini has the agent processing client projects, writing code, and doing research. Over weeks, the agent accumulates a knowledge base of past work. When the agent encounters a new task, it should be able to semantically search past work for relevant context.
Today with FTS5 keyword search: Query "optimize API response time" won't find past work tagged as "reduce endpoint latency" — different words, same meaning.
With local embeddings: Both phrases map to nearby vectors. The agent finds relevant context automatically.
Proposed Solution
Architecture
Embedding models are tiny — nomic-embed-text is 137M parameters (~300MB in int8). On a 64GB machine, this is noise. The model can be served as a fourth process alongside the three chat tiers.
┌─────────────────────────────────────────┐
│ LiteLLM Proxy (:4000) │
│ /v1/chat/completions → chat tiers │
│ /v1/embeddings → embedding server │
├─────────────────────────────────────────┤
│ vllm-mlx :8000 (standard) │
│ vllm-mlx :8001 (fast) │
│ vllm-mlx :8002 (longctx) │
│ embed-server :8003 (embeddings) │
└─────────────────────────────────────────┘
Serving Layer
vllm-mlx is designed for generative models, not encoder-only embedding models. Options for the embedding server:
Option A (recommended): Minimal FastAPI wrapper around mlx-lm
A lightweight server (~100-150 lines) that:
- Loads an MLX embedding model at startup
- Exposes
/v1/embeddings matching the OpenAI API spec
- Handles batched requests
- Returns normalized vectors
This is simple, purpose-built, and has no heavy dependencies beyond mlx and fastapi.
Option B: Wait for Ollama backend support
Once Ollama is supported as a backend (planned for v0.2), embeddings come for free — ollama pull nomic-embed-text and configure LiteLLM to route /v1/embeddings to Ollama's endpoint. Zero custom code needed.
Recommendation: If Ollama backend support ships first, use Option B. If embeddings are needed before Ollama support, use Option A as a bridge.
Catalog Extension
Add an embeddings section to the catalog schema:
nomic-embed-text:
name: "Nomic Embed Text v1.5"
type: embedding
source: "mlx-community/nomic-embed-text-v1.5-4bit"
params_m: 137
dimensions: 768
max_sequence_length: 8192
mteb_score: 62.28
memory_mb: 280
quantizations:
- id: int4
source: "mlx-community/nomic-embed-text-v1.5-4bit"
memory_mb: 140
- id: bf16
source: "mlx-community/nomic-embed-text-v1.5-bf16"
memory_mb: 280
Stack Definition
Add an optional embeddings tier:
# stack.yaml
standard:
model: qwen3-32b
quant: int4
port: 8000
fast:
model: qwen3-8b
quant: int4
port: 8001
embeddings: # new, optional
model: nomic-embed-text
quant: int4
port: 8003
backend: mlx-embed # or "ollama" once supported
LiteLLM Config
model_list:
# ... existing chat models ...
- model_name: "local-embed"
litellm_params:
model: "openai/nomic-embed-text"
api_base: "http://localhost:8003/v1"
api_key: "mlx-stack"
Recommended Embedding Models
| Model |
Params |
Memory |
MTEB Score |
Notes |
| nomic-embed-text v1.5 |
137M |
~140MB (int4) |
62.28 |
Best size/quality ratio |
| bge-small-en-v1.5 |
33M |
~70MB |
51.68 |
Ultra-lightweight |
| all-MiniLM-L6-v2 |
22M |
~50MB |
49.54 |
Classic, widely used |
| mxbai-embed-large |
335M |
~670MB |
64.68 |
Best quality, still tiny |
Process Management
The embedding server integrates into the existing process management:
- Managed by the same watchdog
- Health checks via
/v1/models or /health
- Included in
mlx-stack status output
- Logs rotated alongside chat model logs
- Started/stopped with
mlx-stack up / mlx-stack down
CLI Changes
# Include embeddings in init
mlx-stack init --with-embeddings
# Or add to existing stack
mlx-stack init --add-embeddings nomic-embed-text
# Verify
curl http://localhost:4000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input": "test query", "model": "local-embed"}'
Cost Justification
While cloud embedding costs are low ($0.00002/1K tokens for OpenAI text-embedding-3-small), for a 24/7 agent doing 1,000+ embedding calls per day, the reasons to go local are:
- Privacy: Client data, code, and documents stay on-machine. This is the primary motivation.
- Latency: Local embeddings return in <10ms vs 50-200ms for cloud API calls. Matters for real-time RAG during agent reasoning.
- Availability: No dependency on external API uptime for a critical agent capability.
- Simplicity: One fewer external service to configure, authenticate, and monitor.
Sequencing
Recommended: v0.3, after Ollama backend support (v0.2)
If Ollama ships as a backend in v0.2, then embeddings become nearly free to add:
- Ollama handles the serving
- LiteLLM handles the routing
- mlx-stack just needs the catalog entries and stack definition additions
Building a custom MLX embedding server before Ollama support means maintaining a component that becomes redundant.
Acceptance Criteria
Summary
Serve a local embeddings model alongside chat models, exposed through the existing
localhost:4000/v1/embeddingsendpoint. This enables local RAG, semantic search, and agent memory retrieval without sending data to cloud embedding APIs.Problem
Agent frameworks increasingly rely on embeddings for:
Without a local embeddings endpoint, users must either:
Use Case
A user running Hermes Agent 24/7 on their Mac Mini has the agent processing client projects, writing code, and doing research. Over weeks, the agent accumulates a knowledge base of past work. When the agent encounters a new task, it should be able to semantically search past work for relevant context.
Today with FTS5 keyword search: Query "optimize API response time" won't find past work tagged as "reduce endpoint latency" — different words, same meaning.
With local embeddings: Both phrases map to nearby vectors. The agent finds relevant context automatically.
Proposed Solution
Architecture
Embedding models are tiny —
nomic-embed-textis 137M parameters (~300MB in int8). On a 64GB machine, this is noise. The model can be served as a fourth process alongside the three chat tiers.Serving Layer
vllm-mlx is designed for generative models, not encoder-only embedding models. Options for the embedding server:
Option A (recommended): Minimal FastAPI wrapper around mlx-lm
A lightweight server (~100-150 lines) that:
/v1/embeddingsmatching the OpenAI API specThis is simple, purpose-built, and has no heavy dependencies beyond
mlxandfastapi.Option B: Wait for Ollama backend support
Once Ollama is supported as a backend (planned for v0.2), embeddings come for free —
ollama pull nomic-embed-textand configure LiteLLM to route/v1/embeddingsto Ollama's endpoint. Zero custom code needed.Recommendation: If Ollama backend support ships first, use Option B. If embeddings are needed before Ollama support, use Option A as a bridge.
Catalog Extension
Add an
embeddingssection to the catalog schema:Stack Definition
Add an optional
embeddingstier:LiteLLM Config
Recommended Embedding Models
Process Management
The embedding server integrates into the existing process management:
/v1/modelsor/healthmlx-stack statusoutputmlx-stack up/mlx-stack downCLI Changes
Cost Justification
While cloud embedding costs are low ($0.00002/1K tokens for OpenAI text-embedding-3-small), for a 24/7 agent doing 1,000+ embedding calls per day, the reasons to go local are:
Sequencing
Recommended: v0.3, after Ollama backend support (v0.2)
If Ollama ships as a backend in v0.2, then embeddings become nearly free to add:
Building a custom MLX embedding server before Ollama support means maintaining a component that becomes redundant.
Acceptance Criteria
mlx-stack init --with-embeddingsincludes embedding model/v1/embeddingsto the embedding servermlx-stack statusshows embedding server state