Embeddings endpoint for local RAG and semantic search

## Summary

Serve a local embeddings model alongside chat models, exposed through the existing `localhost:4000/v1/embeddings` endpoint. This enables local RAG, semantic search, and agent memory retrieval without sending data to cloud embedding APIs.

## Problem

Agent frameworks increasingly rely on embeddings for:
- **RAG (Retrieval-Augmented Generation):** Semantic search over knowledge bases, documentation, and past conversations
- **Agent memory:** Finding relevant past interactions beyond keyword matching ("find that AWS project from February" should work even when the query uses different words)
- **Semantic deduplication:** Detecting when an agent is generating near-identical outputs (useful for loop detection)

Without a local embeddings endpoint, users must either:
1. **Call a cloud embedding API** (OpenAI, Voyage, Cohere) — adds latency, ongoing cost, and sends potentially sensitive data off-machine. This directly conflicts with the privacy motivation for running local models.
2. **Fall back to keyword search** (FTS5, BM25) — misses semantic matches, which degrades agent memory and RAG quality significantly.
3. **Run a separate embedding service** (Ollama, sentence-transformers) — another process to manage, another port to configure, outside mlx-stack's watchdog and health monitoring.

## Use Case

A user running Hermes Agent 24/7 on their Mac Mini has the agent processing client projects, writing code, and doing research. Over weeks, the agent accumulates a knowledge base of past work. When the agent encounters a new task, it should be able to semantically search past work for relevant context.

Today with FTS5 keyword search: Query "optimize API response time" won't find past work tagged as "reduce endpoint latency" — different words, same meaning.

With local embeddings: Both phrases map to nearby vectors. The agent finds relevant context automatically.

## Proposed Solution

### Architecture

Embedding models are tiny — `nomic-embed-text` is 137M parameters (~300MB in int8). On a 64GB machine, this is noise. The model can be served as a fourth process alongside the three chat tiers.

```
┌─────────────────────────────────────────┐
│        LiteLLM Proxy (:4000)            │
│  /v1/chat/completions → chat tiers      │
│  /v1/embeddings → embedding server      │
├─────────────────────────────────────────┤
│  vllm-mlx :8000 (standard)             │
│  vllm-mlx :8001 (fast)                 │
│  vllm-mlx :8002 (longctx)             │
│  embed-server :8003 (embeddings)        │
└─────────────────────────────────────────┘
```

### Serving Layer

vllm-mlx is designed for generative models, not encoder-only embedding models. Options for the embedding server:

**Option A (recommended): Minimal FastAPI wrapper around mlx-lm**

A lightweight server (~100-150 lines) that:
- Loads an MLX embedding model at startup
- Exposes `/v1/embeddings` matching the OpenAI API spec
- Handles batched requests
- Returns normalized vectors

This is simple, purpose-built, and has no heavy dependencies beyond `mlx` and `fastapi`.

**Option B: Wait for Ollama backend support**

Once Ollama is supported as a backend (planned for v0.2), embeddings come for free — `ollama pull nomic-embed-text` and configure LiteLLM to route `/v1/embeddings` to Ollama's endpoint. Zero custom code needed.

**Recommendation:** If Ollama backend support ships first, use Option B. If embeddings are needed before Ollama support, use Option A as a bridge.

### Catalog Extension

Add an `embeddings` section to the catalog schema:

```yaml
nomic-embed-text:
  name: "Nomic Embed Text v1.5"
  type: embedding
  source: "mlx-community/nomic-embed-text-v1.5-4bit"
  params_m: 137
  dimensions: 768
  max_sequence_length: 8192
  mteb_score: 62.28
  memory_mb: 280
  quantizations:
    - id: int4
      source: "mlx-community/nomic-embed-text-v1.5-4bit"
      memory_mb: 140
    - id: bf16
      source: "mlx-community/nomic-embed-text-v1.5-bf16"
      memory_mb: 280
```

### Stack Definition

Add an optional `embeddings` tier:

```yaml
# stack.yaml
standard:
  model: qwen3-32b
  quant: int4
  port: 8000
fast:
  model: qwen3-8b
  quant: int4
  port: 8001
embeddings:              # new, optional
  model: nomic-embed-text
  quant: int4
  port: 8003
  backend: mlx-embed     # or "ollama" once supported
```

### LiteLLM Config

```yaml
model_list:
  # ... existing chat models ...
  - model_name: "local-embed"
    litellm_params:
      model: "openai/nomic-embed-text"
      api_base: "http://localhost:8003/v1"
      api_key: "mlx-stack"
```

### Recommended Embedding Models

| Model | Params | Memory | MTEB Score | Notes |
|-------|--------|--------|------------|-------|
| nomic-embed-text v1.5 | 137M | ~140MB (int4) | 62.28 | Best size/quality ratio |
| bge-small-en-v1.5 | 33M | ~70MB | 51.68 | Ultra-lightweight |
| all-MiniLM-L6-v2 | 22M | ~50MB | 49.54 | Classic, widely used |
| mxbai-embed-large | 335M | ~670MB | 64.68 | Best quality, still tiny |

### Process Management

The embedding server integrates into the existing process management:
- Managed by the same watchdog
- Health checks via `/v1/models` or `/health`
- Included in `mlx-stack status` output
- Logs rotated alongside chat model logs
- Started/stopped with `mlx-stack up` / `mlx-stack down`

### CLI Changes

```bash
# Include embeddings in init
mlx-stack init --with-embeddings

# Or add to existing stack
mlx-stack init --add-embeddings nomic-embed-text

# Verify
curl http://localhost:4000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": "test query", "model": "local-embed"}'
```

## Cost Justification

While cloud embedding costs are low ($0.00002/1K tokens for OpenAI text-embedding-3-small), for a 24/7 agent doing 1,000+ embedding calls per day, the reasons to go local are:

1. **Privacy:** Client data, code, and documents stay on-machine. This is the primary motivation.
2. **Latency:** Local embeddings return in <10ms vs 50-200ms for cloud API calls. Matters for real-time RAG during agent reasoning.
3. **Availability:** No dependency on external API uptime for a critical agent capability.
4. **Simplicity:** One fewer external service to configure, authenticate, and monitor.

## Sequencing

**Recommended: v0.3, after Ollama backend support (v0.2)**

If Ollama ships as a backend in v0.2, then embeddings become nearly free to add:
- Ollama handles the serving
- LiteLLM handles the routing
- mlx-stack just needs the catalog entries and stack definition additions

Building a custom MLX embedding server before Ollama support means maintaining a component that becomes redundant.

## Acceptance Criteria

- [ ] At least one embedding model in the catalog
- [ ] Embeddings tier configurable in stack definition
- [ ] `mlx-stack init --with-embeddings` includes embedding model
- [ ] Embedding server managed by process manager and watchdog
- [ ] LiteLLM routes `/v1/embeddings` to the embedding server
- [ ] Health checks include the embedding server
- [ ] `mlx-stack status` shows embedding server state
- [ ] Documentation with example RAG integration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embeddings endpoint for local RAG and semantic search #28

Summary

Problem

Use Case

Proposed Solution

Architecture

Serving Layer

Catalog Extension

Stack Definition

LiteLLM Config

Recommended Embedding Models

Process Management

CLI Changes

Cost Justification

Sequencing

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Model	Params	Memory	MTEB Score	Notes
nomic-embed-text v1.5	137M	~140MB (int4)	62.28	Best size/quality ratio
bge-small-en-v1.5	33M	~70MB	51.68	Ultra-lightweight
all-MiniLM-L6-v2	22M	~50MB	49.54	Classic, widely used
mxbai-embed-large	335M	~670MB	64.68	Best quality, still tiny

Embeddings endpoint for local RAG and semantic search #28

Description

Summary

Problem

Use Case

Proposed Solution

Architecture

Serving Layer

Catalog Extension

Stack Definition

LiteLLM Config

Recommended Embedding Models

Process Management

CLI Changes

Cost Justification

Sequencing

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions