Agent-aware prefix cache sharing for KV cache reuse across requests

## Summary

Implement a persistent, shared KV cache pool that eliminates redundant computation across agent requests. Agent workloads have 100:1 input-to-output token ratios — the same system prompt, tool definitions, and conversation prefix are recomputed from scratch on every request. Prefix caching can reduce time-to-first-token by 85% and make agent workloads 3-7x faster on the same hardware.

## Problem

Agent frameworks make hundreds of LLM calls per task. The structure of these calls is highly repetitive:

```
[System prompt: 2000 tokens] [Tool definitions: 1500 tokens] [Conversation history: 500 tokens] [New user turn: 50 tokens]
```

On every single call, the model recomputes the KV cache for the entire prefix (4000+ tokens) to generate a response to 50 new tokens. This means **98% of compute is redundant.**

The problem compounds with multiple agents. A CrewAI crew with 5 agents might share the same base system prompt and tool definitions. Each agent's requests redundantly compute the shared prefix.

### Quantified Impact

Research from KVFlow (July 2025) measured agent workloads specifically:
- Input-to-output token ratio: **100:1** (vs 1:1 for interactive chat)
- Prefill dominates total latency for agent workloads
- vLLM's prefix caching reduces TTFT from 4.3s to 0.6s — an **85% reduction**
- For multi-agent workloads with shared prefixes, the savings multiply

On a Mac Mini M4 Pro running a 32B Q4 model:
- Without prefix caching: ~2-4s TTFT per agent request (full prefill of 4K+ tokens)
- With prefix caching: ~0.3-0.6s TTFT (only compute new tokens since last cache hit)
- Over 500 agent requests/day: saving **15-30 minutes of wall-clock computation time**

### Why Nobody Has Built This for MLX

- Ollama: recomputes from scratch per request. No prefix caching. This is the #1 performance complaint.
- LM Studio: no concept of cross-request caching
- vllm-mlx: has PagedAttention but no persistent prefix cache across requests
- llama.cpp: has prompt caching but only within a single session, not cross-session or cross-client

## Proposed Solution

### Architecture

```
┌─────────────────────────────────────────────┐
│              Prefix Cache Manager            │
│                                              │
│  ┌─────────────────────────────────────────┐ │
│  │  Hash Table: prefix_hash → KV cache ref │ │
│  │                                         │ │
│  │  "sys_prompt_v1"  → [KV block 0-127]   │ │
│  │  "sys+tools_v1"   → [KV block 0-255]   │ │
│  │  "sys+tools+conv" → [KV block 0-300]   │ │
│  └─────────────────────────────────────────┘ │
│                                              │
│  ┌───────────┐  ┌──────────┐  ┌───────────┐ │
│  │ LRU Evictor│  │ Metrics  │  │ Warmup    │ │
│  └───────────┘  └──────────┘  └───────────┘ │
└─────────────────────────────────────────────┘
         ↕ zero-copy (unified memory)
┌─────────────────────────────────────────────┐
│           vllm-mlx Model Server             │
│     (modified to accept pre-computed KV)    │
└─────────────────────────────────────────────┘
```

### How It Works

**Step 1: Prefix hashing**

When a request arrives, tokenize the input and compute rolling hashes at block boundaries (e.g., every 128 tokens):

```
tokens[0:128]   → hash_0
tokens[0:256]   → hash_1
tokens[0:384]   → hash_2
tokens[0:512]   → hash_3  (cache miss — new content starts here)
```

Find the longest prefix that has a cache hit. Start KV computation from that point.

**Step 2: Cache storage**

Store computed KV cache blocks in a pool backed by unified memory. Apple Silicon's unified memory is the key advantage here — the KV cache sits in one memory space accessible to both CPU and GPU with zero-copy overhead. On discrete GPU systems, this would require explicit memory transfers; on Apple Silicon, it's free.

**Step 3: Cache eviction**

LRU eviction with frequency boosting:
- Most-recently-used blocks stay in cache
- Frequently-accessed blocks (like shared system prompts) get boosted priority
- Configurable memory budget (e.g., 20% of unified memory reserved for KV cache)
- When memory pressure is detected (macOS memory_pressure API), aggressively evict cold blocks

**Step 4: Cache warming**

On startup or after a model reload, pre-compute KV caches for known-hot prefixes:
- System prompts registered by the user or detected from recent request patterns
- Tool definition blocks
- Common conversation prefixes

```yaml
prefix_cache:
  warm_prefixes:
    - name: "hermes-system"
      tokens_file: ~/.mlx-stack/cache/hermes-system-prompt.txt
    - name: "crewai-tools"
      tokens_file: ~/.mlx-stack/cache/crewai-tool-definitions.txt
```

### Cross-Agent Prefix Sharing

Multiple agents using the same model often share prefix components:

```
Agent A: [shared system prompt] [shared tools] [agent A persona] [conversation A]
Agent B: [shared system prompt] [shared tools] [agent B persona] [conversation B]
```

The cache manager identifies the shared prefix (`[shared system prompt] [shared tools]`) and stores it once. Both agents' requests reuse the same KV blocks for the shared portion and only compute their unique suffixes.

### Apple Silicon Advantage

This feature exploits Apple Silicon's unified memory architecture in a way that is architecturally impossible on discrete GPU systems:

1. **Zero-copy KV cache access:** The KV cache lives in unified memory, accessible to both the cache manager (CPU code) and the model inference (GPU/ANE). No PCIe transfers needed.
2. **Large cache capacity:** On a 64GB M4 Pro, you could dedicate 12-16GB to KV cache while still running a 32B Q4 model (18GB) with plenty of headroom. That's enough to cache dozens of distinct system prompt variants.
3. **No fragmentation:** Unified memory doesn't suffer from GPU memory fragmentation issues that plague CUDA-based PagedAttention implementations.

### Integration with vllm-mlx

This requires changes to how vllm-mlx handles the KV cache. Two approaches:

**Option A: Middleware (non-invasive)**

The prefix cache manager sits between LiteLLM and vllm-mlx. It:
1. Intercepts the request
2. Checks for prefix cache hits
3. If hit: modifies the request to only include tokens after the cached prefix, and passes the pre-computed KV state to vllm-mlx
4. If miss: passes the full request through, then caches the computed KV blocks

This requires vllm-mlx to support accepting pre-computed KV state, which may need upstream changes.

**Option B: Fork/extend vllm-mlx**

Directly modify vllm-mlx's inference loop to check the prefix cache before computing KV. This is more performant but creates a maintenance burden.

**Recommendation:** Start with Option A. Contribute the KV state injection API upstream to vllm-mlx. Fall back to Option B only if the upstream maintainers are unresponsive.

### Metrics and Observability

```bash
mlx-stack cache status
```

Output:
```
Prefix Cache Status
  Memory allocated: 8.2 GB / 12.0 GB budget
  Entries: 47
  Hit rate (last hour): 87.3%
  Avg TTFT reduction: 76%

Top cached prefixes:
  hermes-system-v1    hits: 1,234   size: 512 MB   age: 2h
  crewai-tools        hits: 856     size: 384 MB   age: 4h
  research-agent-ctx  hits: 234     size: 128 MB   age: 30m
```

Response headers:
```
X-MLX-Stack-Cache-Hit: true
X-MLX-Stack-Cache-Prefix-Tokens: 3840
X-MLX-Stack-Cache-New-Tokens: 128
X-MLX-Stack-TTFT-Saved-Ms: 2400
```

### Configuration

```yaml
prefix_cache:
  enabled: true
  memory_budget_pct: 20          # % of unified memory for KV cache
  block_size: 128                # tokens per cache block
  eviction_policy: lru_frequency # LRU with frequency boosting
  max_entries: 1000
  warm_on_startup: true
  warm_prefixes: []              # optional list of known-hot prefixes
  metrics:
    enabled: true
    log_interval: 300            # seconds between metric summaries in logs
```

## Research References

- [KVFlow: Prefix Caching for Multi-Agent Workflows](https://arxiv.org/html/2507.07400v1) — Specifically measured agent workloads, showed 100:1 input-to-output ratios
- [vLLM Automatic Prefix Caching](https://docs.vllm.ai/en/stable/design/prefix_caching/) — Production implementation using PagedAttention, demonstrated 85% TTFT reduction
- [vLLM Sleep Mode](https://blog.vllm.ai/2025/10/26/sleep-mode.html) — KV cache persistence across model sleep/wake cycles
- [ProServe: Multi-Priority Request Scheduling](https://arxiv.org/html/2512.12928) — Priority-aware scheduling for LLM serving

## Complexity and Sequencing

**Estimated effort:** 4-8 weeks of focused work

**Recommended sequencing:** v0.3

This is the performance differentiator that makes mlx-stack impossible to ignore for agent workloads. It should be built after the v0.2 foundations (add-model, reliability layer) are stable, since it depends on a well-functioning multi-tier serving layer.

**Risk:** Requires either upstream vllm-mlx changes or a custom fork. The feasibility depends on vllm-mlx's willingness to support KV state injection.

## Acceptance Criteria

- [ ] Prefix hashing correctly identifies shared prefixes across requests
- [ ] KV cache blocks stored in unified memory with zero-copy access
- [ ] LRU eviction with frequency boosting keeps hot prefixes cached
- [ ] Cross-agent prefix sharing works (multiple sessions benefit from same cached prefix)
- [ ] Configurable memory budget for KV cache
- [ ] Cache warming on startup for registered prefixes
- [ ] `mlx-stack cache status` shows hit rate, memory usage, and top entries
- [ ] Response headers expose cache hit/miss and TTFT savings
- [ ] Memory pressure triggers aggressive eviction
- [ ] Measurable TTFT improvement of >50% on repeated agent requests
- [ ] Benchmark suite comparing cached vs uncached agent workloads
- [ ] Documentation explaining the caching model and tuning guide

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent-aware prefix cache sharing for KV cache reuse across requests #30

Summary

Problem

Quantified Impact

Why Nobody Has Built This for MLX

Proposed Solution

Architecture

How It Works

Cross-Agent Prefix Sharing

Apple Silicon Advantage

Integration with vllm-mlx

Metrics and Observability

Configuration

Research References

Complexity and Sequencing

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Agent-aware prefix cache sharing for KV cache reuse across requests #30

Description

Summary

Problem

Quantified Impact

Why Nobody Has Built This for MLX

Proposed Solution

Architecture

How It Works

Cross-Agent Prefix Sharing

Apple Silicon Advantage

Integration with vllm-mlx

Metrics and Observability

Configuration

Research References

Complexity and Sequencing

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions