Summary
Implement a persistent, shared KV cache pool that eliminates redundant computation across agent requests. Agent workloads have 100:1 input-to-output token ratios — the same system prompt, tool definitions, and conversation prefix are recomputed from scratch on every request. Prefix caching can reduce time-to-first-token by 85% and make agent workloads 3-7x faster on the same hardware.
Problem
Agent frameworks make hundreds of LLM calls per task. The structure of these calls is highly repetitive:
[System prompt: 2000 tokens] [Tool definitions: 1500 tokens] [Conversation history: 500 tokens] [New user turn: 50 tokens]
On every single call, the model recomputes the KV cache for the entire prefix (4000+ tokens) to generate a response to 50 new tokens. This means 98% of compute is redundant.
The problem compounds with multiple agents. A CrewAI crew with 5 agents might share the same base system prompt and tool definitions. Each agent's requests redundantly compute the shared prefix.
Quantified Impact
Research from KVFlow (July 2025) measured agent workloads specifically:
- Input-to-output token ratio: 100:1 (vs 1:1 for interactive chat)
- Prefill dominates total latency for agent workloads
- vLLM's prefix caching reduces TTFT from 4.3s to 0.6s — an 85% reduction
- For multi-agent workloads with shared prefixes, the savings multiply
On a Mac Mini M4 Pro running a 32B Q4 model:
- Without prefix caching: ~2-4s TTFT per agent request (full prefill of 4K+ tokens)
- With prefix caching: ~0.3-0.6s TTFT (only compute new tokens since last cache hit)
- Over 500 agent requests/day: saving 15-30 minutes of wall-clock computation time
Why Nobody Has Built This for MLX
Proposed Solution
Architecture
┌─────────────────────────────────────────────┐
│ Prefix Cache Manager │
│ │
│ ┌─────────────────────────────────────────┐ │
│ │ Hash Table: prefix_hash → KV cache ref │ │
│ │ │ │
│ │ "sys_prompt_v1" → [KV block 0-127] │ │
│ │ "sys+tools_v1" → [KV block 0-255] │ │
│ │ "sys+tools+conv" → [KV block 0-300] │ │
│ └─────────────────────────────────────────┘ │
│ │
│ ┌───────────┐ ┌──────────┐ ┌───────────┐ │
│ │ LRU Evictor│ │ Metrics │ │ Warmup │ │
│ └───────────┘ └──────────┘ └───────────┘ │
└─────────────────────────────────────────────┘
↕ zero-copy (unified memory)
┌─────────────────────────────────────────────┐
│ vllm-mlx Model Server │
│ (modified to accept pre-computed KV) │
└─────────────────────────────────────────────┘
How It Works
Step 1: Prefix hashing
When a request arrives, tokenize the input and compute rolling hashes at block boundaries (e.g., every 128 tokens):
tokens[0:128] → hash_0
tokens[0:256] → hash_1
tokens[0:384] → hash_2
tokens[0:512] → hash_3 (cache miss — new content starts here)
Find the longest prefix that has a cache hit. Start KV computation from that point.
Step 2: Cache storage
Store computed KV cache blocks in a pool backed by unified memory. Apple Silicon's unified memory is the key advantage here — the KV cache sits in one memory space accessible to both CPU and GPU with zero-copy overhead. On discrete GPU systems, this would require explicit memory transfers; on Apple Silicon, it's free.
Step 3: Cache eviction
LRU eviction with frequency boosting:
- Most-recently-used blocks stay in cache
- Frequently-accessed blocks (like shared system prompts) get boosted priority
- Configurable memory budget (e.g., 20% of unified memory reserved for KV cache)
- When memory pressure is detected (macOS memory_pressure API), aggressively evict cold blocks
Step 4: Cache warming
On startup or after a model reload, pre-compute KV caches for known-hot prefixes:
- System prompts registered by the user or detected from recent request patterns
- Tool definition blocks
- Common conversation prefixes
prefix_cache:
warm_prefixes:
- name: "hermes-system"
tokens_file: ~/.mlx-stack/cache/hermes-system-prompt.txt
- name: "crewai-tools"
tokens_file: ~/.mlx-stack/cache/crewai-tool-definitions.txt
Cross-Agent Prefix Sharing
Multiple agents using the same model often share prefix components:
Agent A: [shared system prompt] [shared tools] [agent A persona] [conversation A]
Agent B: [shared system prompt] [shared tools] [agent B persona] [conversation B]
The cache manager identifies the shared prefix ([shared system prompt] [shared tools]) and stores it once. Both agents' requests reuse the same KV blocks for the shared portion and only compute their unique suffixes.
Apple Silicon Advantage
This feature exploits Apple Silicon's unified memory architecture in a way that is architecturally impossible on discrete GPU systems:
- Zero-copy KV cache access: The KV cache lives in unified memory, accessible to both the cache manager (CPU code) and the model inference (GPU/ANE). No PCIe transfers needed.
- Large cache capacity: On a 64GB M4 Pro, you could dedicate 12-16GB to KV cache while still running a 32B Q4 model (18GB) with plenty of headroom. That's enough to cache dozens of distinct system prompt variants.
- No fragmentation: Unified memory doesn't suffer from GPU memory fragmentation issues that plague CUDA-based PagedAttention implementations.
Integration with vllm-mlx
This requires changes to how vllm-mlx handles the KV cache. Two approaches:
Option A: Middleware (non-invasive)
The prefix cache manager sits between LiteLLM and vllm-mlx. It:
- Intercepts the request
- Checks for prefix cache hits
- If hit: modifies the request to only include tokens after the cached prefix, and passes the pre-computed KV state to vllm-mlx
- If miss: passes the full request through, then caches the computed KV blocks
This requires vllm-mlx to support accepting pre-computed KV state, which may need upstream changes.
Option B: Fork/extend vllm-mlx
Directly modify vllm-mlx's inference loop to check the prefix cache before computing KV. This is more performant but creates a maintenance burden.
Recommendation: Start with Option A. Contribute the KV state injection API upstream to vllm-mlx. Fall back to Option B only if the upstream maintainers are unresponsive.
Metrics and Observability
Output:
Prefix Cache Status
Memory allocated: 8.2 GB / 12.0 GB budget
Entries: 47
Hit rate (last hour): 87.3%
Avg TTFT reduction: 76%
Top cached prefixes:
hermes-system-v1 hits: 1,234 size: 512 MB age: 2h
crewai-tools hits: 856 size: 384 MB age: 4h
research-agent-ctx hits: 234 size: 128 MB age: 30m
Response headers:
X-MLX-Stack-Cache-Hit: true
X-MLX-Stack-Cache-Prefix-Tokens: 3840
X-MLX-Stack-Cache-New-Tokens: 128
X-MLX-Stack-TTFT-Saved-Ms: 2400
Configuration
prefix_cache:
enabled: true
memory_budget_pct: 20 # % of unified memory for KV cache
block_size: 128 # tokens per cache block
eviction_policy: lru_frequency # LRU with frequency boosting
max_entries: 1000
warm_on_startup: true
warm_prefixes: [] # optional list of known-hot prefixes
metrics:
enabled: true
log_interval: 300 # seconds between metric summaries in logs
Research References
Complexity and Sequencing
Estimated effort: 4-8 weeks of focused work
Recommended sequencing: v0.3
This is the performance differentiator that makes mlx-stack impossible to ignore for agent workloads. It should be built after the v0.2 foundations (add-model, reliability layer) are stable, since it depends on a well-functioning multi-tier serving layer.
Risk: Requires either upstream vllm-mlx changes or a custom fork. The feasibility depends on vllm-mlx's willingness to support KV state injection.
Acceptance Criteria
Summary
Implement a persistent, shared KV cache pool that eliminates redundant computation across agent requests. Agent workloads have 100:1 input-to-output token ratios — the same system prompt, tool definitions, and conversation prefix are recomputed from scratch on every request. Prefix caching can reduce time-to-first-token by 85% and make agent workloads 3-7x faster on the same hardware.
Problem
Agent frameworks make hundreds of LLM calls per task. The structure of these calls is highly repetitive:
On every single call, the model recomputes the KV cache for the entire prefix (4000+ tokens) to generate a response to 50 new tokens. This means 98% of compute is redundant.
The problem compounds with multiple agents. A CrewAI crew with 5 agents might share the same base system prompt and tool definitions. Each agent's requests redundantly compute the shared prefix.
Quantified Impact
Research from KVFlow (July 2025) measured agent workloads specifically:
On a Mac Mini M4 Pro running a 32B Q4 model:
Why Nobody Has Built This for MLX
Proposed Solution
Architecture
How It Works
Step 1: Prefix hashing
When a request arrives, tokenize the input and compute rolling hashes at block boundaries (e.g., every 128 tokens):
Find the longest prefix that has a cache hit. Start KV computation from that point.
Step 2: Cache storage
Store computed KV cache blocks in a pool backed by unified memory. Apple Silicon's unified memory is the key advantage here — the KV cache sits in one memory space accessible to both CPU and GPU with zero-copy overhead. On discrete GPU systems, this would require explicit memory transfers; on Apple Silicon, it's free.
Step 3: Cache eviction
LRU eviction with frequency boosting:
Step 4: Cache warming
On startup or after a model reload, pre-compute KV caches for known-hot prefixes:
Cross-Agent Prefix Sharing
Multiple agents using the same model often share prefix components:
The cache manager identifies the shared prefix (
[shared system prompt] [shared tools]) and stores it once. Both agents' requests reuse the same KV blocks for the shared portion and only compute their unique suffixes.Apple Silicon Advantage
This feature exploits Apple Silicon's unified memory architecture in a way that is architecturally impossible on discrete GPU systems:
Integration with vllm-mlx
This requires changes to how vllm-mlx handles the KV cache. Two approaches:
Option A: Middleware (non-invasive)
The prefix cache manager sits between LiteLLM and vllm-mlx. It:
This requires vllm-mlx to support accepting pre-computed KV state, which may need upstream changes.
Option B: Fork/extend vllm-mlx
Directly modify vllm-mlx's inference loop to check the prefix cache before computing KV. This is more performant but creates a maintenance burden.
Recommendation: Start with Option A. Contribute the KV state injection API upstream to vllm-mlx. Fall back to Option B only if the upstream maintainers are unresponsive.
Metrics and Observability
Output:
Response headers:
Configuration
Research References
Complexity and Sequencing
Estimated effort: 4-8 weeks of focused work
Recommended sequencing: v0.3
This is the performance differentiator that makes mlx-stack impossible to ignore for agent workloads. It should be built after the v0.2 foundations (add-model, reliability layer) are stable, since it depends on a well-functioning multi-tier serving layer.
Risk: Requires either upstream vllm-mlx changes or a custom fork. The feasibility depends on vllm-mlx's willingness to support KV state injection.
Acceptance Criteria
mlx-stack cache statusshows hit rate, memory usage, and top entries