Skip to content

Embeddings endpoint for local RAG and semantic search #28

@weklund-agent

Description

@weklund-agent

Summary

Serve a local embeddings model alongside chat models, exposed through the existing localhost:4000/v1/embeddings endpoint. This enables local RAG, semantic search, and agent memory retrieval without sending data to cloud embedding APIs.

Problem

Agent frameworks increasingly rely on embeddings for:

  • RAG (Retrieval-Augmented Generation): Semantic search over knowledge bases, documentation, and past conversations
  • Agent memory: Finding relevant past interactions beyond keyword matching ("find that AWS project from February" should work even when the query uses different words)
  • Semantic deduplication: Detecting when an agent is generating near-identical outputs (useful for loop detection)

Without a local embeddings endpoint, users must either:

  1. Call a cloud embedding API (OpenAI, Voyage, Cohere) — adds latency, ongoing cost, and sends potentially sensitive data off-machine. This directly conflicts with the privacy motivation for running local models.
  2. Fall back to keyword search (FTS5, BM25) — misses semantic matches, which degrades agent memory and RAG quality significantly.
  3. Run a separate embedding service (Ollama, sentence-transformers) — another process to manage, another port to configure, outside mlx-stack's watchdog and health monitoring.

Use Case

A user running Hermes Agent 24/7 on their Mac Mini has the agent processing client projects, writing code, and doing research. Over weeks, the agent accumulates a knowledge base of past work. When the agent encounters a new task, it should be able to semantically search past work for relevant context.

Today with FTS5 keyword search: Query "optimize API response time" won't find past work tagged as "reduce endpoint latency" — different words, same meaning.

With local embeddings: Both phrases map to nearby vectors. The agent finds relevant context automatically.

Proposed Solution

Architecture

Embedding models are tiny — nomic-embed-text is 137M parameters (~300MB in int8). On a 64GB machine, this is noise. The model can be served as a fourth process alongside the three chat tiers.

┌─────────────────────────────────────────┐
│        LiteLLM Proxy (:4000)            │
│  /v1/chat/completions → chat tiers      │
│  /v1/embeddings → embedding server      │
├─────────────────────────────────────────┤
│  vllm-mlx :8000 (standard)             │
│  vllm-mlx :8001 (fast)                 │
│  vllm-mlx :8002 (longctx)             │
│  embed-server :8003 (embeddings)        │
└─────────────────────────────────────────┘

Serving Layer

vllm-mlx is designed for generative models, not encoder-only embedding models. Options for the embedding server:

Option A (recommended): Minimal FastAPI wrapper around mlx-lm

A lightweight server (~100-150 lines) that:

  • Loads an MLX embedding model at startup
  • Exposes /v1/embeddings matching the OpenAI API spec
  • Handles batched requests
  • Returns normalized vectors

This is simple, purpose-built, and has no heavy dependencies beyond mlx and fastapi.

Option B: Wait for Ollama backend support

Once Ollama is supported as a backend (planned for v0.2), embeddings come for free — ollama pull nomic-embed-text and configure LiteLLM to route /v1/embeddings to Ollama's endpoint. Zero custom code needed.

Recommendation: If Ollama backend support ships first, use Option B. If embeddings are needed before Ollama support, use Option A as a bridge.

Catalog Extension

Add an embeddings section to the catalog schema:

nomic-embed-text:
  name: "Nomic Embed Text v1.5"
  type: embedding
  source: "mlx-community/nomic-embed-text-v1.5-4bit"
  params_m: 137
  dimensions: 768
  max_sequence_length: 8192
  mteb_score: 62.28
  memory_mb: 280
  quantizations:
    - id: int4
      source: "mlx-community/nomic-embed-text-v1.5-4bit"
      memory_mb: 140
    - id: bf16
      source: "mlx-community/nomic-embed-text-v1.5-bf16"
      memory_mb: 280

Stack Definition

Add an optional embeddings tier:

# stack.yaml
standard:
  model: qwen3-32b
  quant: int4
  port: 8000
fast:
  model: qwen3-8b
  quant: int4
  port: 8001
embeddings:              # new, optional
  model: nomic-embed-text
  quant: int4
  port: 8003
  backend: mlx-embed     # or "ollama" once supported

LiteLLM Config

model_list:
  # ... existing chat models ...
  - model_name: "local-embed"
    litellm_params:
      model: "openai/nomic-embed-text"
      api_base: "http://localhost:8003/v1"
      api_key: "mlx-stack"

Recommended Embedding Models

Model Params Memory MTEB Score Notes
nomic-embed-text v1.5 137M ~140MB (int4) 62.28 Best size/quality ratio
bge-small-en-v1.5 33M ~70MB 51.68 Ultra-lightweight
all-MiniLM-L6-v2 22M ~50MB 49.54 Classic, widely used
mxbai-embed-large 335M ~670MB 64.68 Best quality, still tiny

Process Management

The embedding server integrates into the existing process management:

  • Managed by the same watchdog
  • Health checks via /v1/models or /health
  • Included in mlx-stack status output
  • Logs rotated alongside chat model logs
  • Started/stopped with mlx-stack up / mlx-stack down

CLI Changes

# Include embeddings in init
mlx-stack init --with-embeddings

# Or add to existing stack
mlx-stack init --add-embeddings nomic-embed-text

# Verify
curl http://localhost:4000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": "test query", "model": "local-embed"}'

Cost Justification

While cloud embedding costs are low ($0.00002/1K tokens for OpenAI text-embedding-3-small), for a 24/7 agent doing 1,000+ embedding calls per day, the reasons to go local are:

  1. Privacy: Client data, code, and documents stay on-machine. This is the primary motivation.
  2. Latency: Local embeddings return in <10ms vs 50-200ms for cloud API calls. Matters for real-time RAG during agent reasoning.
  3. Availability: No dependency on external API uptime for a critical agent capability.
  4. Simplicity: One fewer external service to configure, authenticate, and monitor.

Sequencing

Recommended: v0.3, after Ollama backend support (v0.2)

If Ollama ships as a backend in v0.2, then embeddings become nearly free to add:

  • Ollama handles the serving
  • LiteLLM handles the routing
  • mlx-stack just needs the catalog entries and stack definition additions

Building a custom MLX embedding server before Ollama support means maintaining a component that becomes redundant.

Acceptance Criteria

  • At least one embedding model in the catalog
  • Embeddings tier configurable in stack definition
  • mlx-stack init --with-embeddings includes embedding model
  • Embedding server managed by process manager and watchdog
  • LiteLLM routes /v1/embeddings to the embedding server
  • Health checks include the embedding server
  • mlx-stack status shows embedding server state
  • Documentation with example RAG integration

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions