Skip to content

Agent-aware prefix cache sharing for KV cache reuse across requests #30

@weklund-agent

Description

@weklund-agent

Summary

Implement a persistent, shared KV cache pool that eliminates redundant computation across agent requests. Agent workloads have 100:1 input-to-output token ratios — the same system prompt, tool definitions, and conversation prefix are recomputed from scratch on every request. Prefix caching can reduce time-to-first-token by 85% and make agent workloads 3-7x faster on the same hardware.

Problem

Agent frameworks make hundreds of LLM calls per task. The structure of these calls is highly repetitive:

[System prompt: 2000 tokens] [Tool definitions: 1500 tokens] [Conversation history: 500 tokens] [New user turn: 50 tokens]

On every single call, the model recomputes the KV cache for the entire prefix (4000+ tokens) to generate a response to 50 new tokens. This means 98% of compute is redundant.

The problem compounds with multiple agents. A CrewAI crew with 5 agents might share the same base system prompt and tool definitions. Each agent's requests redundantly compute the shared prefix.

Quantified Impact

Research from KVFlow (July 2025) measured agent workloads specifically:

  • Input-to-output token ratio: 100:1 (vs 1:1 for interactive chat)
  • Prefill dominates total latency for agent workloads
  • vLLM's prefix caching reduces TTFT from 4.3s to 0.6s — an 85% reduction
  • For multi-agent workloads with shared prefixes, the savings multiply

On a Mac Mini M4 Pro running a 32B Q4 model:

  • Without prefix caching: ~2-4s TTFT per agent request (full prefill of 4K+ tokens)
  • With prefix caching: ~0.3-0.6s TTFT (only compute new tokens since last cache hit)
  • Over 500 agent requests/day: saving 15-30 minutes of wall-clock computation time

Why Nobody Has Built This for MLX

Proposed Solution

Architecture

┌─────────────────────────────────────────────┐
│              Prefix Cache Manager            │
│                                              │
│  ┌─────────────────────────────────────────┐ │
│  │  Hash Table: prefix_hash → KV cache ref │ │
│  │                                         │ │
│  │  "sys_prompt_v1"  → [KV block 0-127]   │ │
│  │  "sys+tools_v1"   → [KV block 0-255]   │ │
│  │  "sys+tools+conv" → [KV block 0-300]   │ │
│  └─────────────────────────────────────────┘ │
│                                              │
│  ┌───────────┐  ┌──────────┐  ┌───────────┐ │
│  │ LRU Evictor│  │ Metrics  │  │ Warmup    │ │
│  └───────────┘  └──────────┘  └───────────┘ │
└─────────────────────────────────────────────┘
         ↕ zero-copy (unified memory)
┌─────────────────────────────────────────────┐
│           vllm-mlx Model Server             │
│     (modified to accept pre-computed KV)    │
└─────────────────────────────────────────────┘

How It Works

Step 1: Prefix hashing

When a request arrives, tokenize the input and compute rolling hashes at block boundaries (e.g., every 128 tokens):

tokens[0:128]   → hash_0
tokens[0:256]   → hash_1
tokens[0:384]   → hash_2
tokens[0:512]   → hash_3  (cache miss — new content starts here)

Find the longest prefix that has a cache hit. Start KV computation from that point.

Step 2: Cache storage

Store computed KV cache blocks in a pool backed by unified memory. Apple Silicon's unified memory is the key advantage here — the KV cache sits in one memory space accessible to both CPU and GPU with zero-copy overhead. On discrete GPU systems, this would require explicit memory transfers; on Apple Silicon, it's free.

Step 3: Cache eviction

LRU eviction with frequency boosting:

  • Most-recently-used blocks stay in cache
  • Frequently-accessed blocks (like shared system prompts) get boosted priority
  • Configurable memory budget (e.g., 20% of unified memory reserved for KV cache)
  • When memory pressure is detected (macOS memory_pressure API), aggressively evict cold blocks

Step 4: Cache warming

On startup or after a model reload, pre-compute KV caches for known-hot prefixes:

  • System prompts registered by the user or detected from recent request patterns
  • Tool definition blocks
  • Common conversation prefixes
prefix_cache:
  warm_prefixes:
    - name: "hermes-system"
      tokens_file: ~/.mlx-stack/cache/hermes-system-prompt.txt
    - name: "crewai-tools"
      tokens_file: ~/.mlx-stack/cache/crewai-tool-definitions.txt

Cross-Agent Prefix Sharing

Multiple agents using the same model often share prefix components:

Agent A: [shared system prompt] [shared tools] [agent A persona] [conversation A]
Agent B: [shared system prompt] [shared tools] [agent B persona] [conversation B]

The cache manager identifies the shared prefix ([shared system prompt] [shared tools]) and stores it once. Both agents' requests reuse the same KV blocks for the shared portion and only compute their unique suffixes.

Apple Silicon Advantage

This feature exploits Apple Silicon's unified memory architecture in a way that is architecturally impossible on discrete GPU systems:

  1. Zero-copy KV cache access: The KV cache lives in unified memory, accessible to both the cache manager (CPU code) and the model inference (GPU/ANE). No PCIe transfers needed.
  2. Large cache capacity: On a 64GB M4 Pro, you could dedicate 12-16GB to KV cache while still running a 32B Q4 model (18GB) with plenty of headroom. That's enough to cache dozens of distinct system prompt variants.
  3. No fragmentation: Unified memory doesn't suffer from GPU memory fragmentation issues that plague CUDA-based PagedAttention implementations.

Integration with vllm-mlx

This requires changes to how vllm-mlx handles the KV cache. Two approaches:

Option A: Middleware (non-invasive)

The prefix cache manager sits between LiteLLM and vllm-mlx. It:

  1. Intercepts the request
  2. Checks for prefix cache hits
  3. If hit: modifies the request to only include tokens after the cached prefix, and passes the pre-computed KV state to vllm-mlx
  4. If miss: passes the full request through, then caches the computed KV blocks

This requires vllm-mlx to support accepting pre-computed KV state, which may need upstream changes.

Option B: Fork/extend vllm-mlx

Directly modify vllm-mlx's inference loop to check the prefix cache before computing KV. This is more performant but creates a maintenance burden.

Recommendation: Start with Option A. Contribute the KV state injection API upstream to vllm-mlx. Fall back to Option B only if the upstream maintainers are unresponsive.

Metrics and Observability

mlx-stack cache status

Output:

Prefix Cache Status
  Memory allocated: 8.2 GB / 12.0 GB budget
  Entries: 47
  Hit rate (last hour): 87.3%
  Avg TTFT reduction: 76%

Top cached prefixes:
  hermes-system-v1    hits: 1,234   size: 512 MB   age: 2h
  crewai-tools        hits: 856     size: 384 MB   age: 4h
  research-agent-ctx  hits: 234     size: 128 MB   age: 30m

Response headers:

X-MLX-Stack-Cache-Hit: true
X-MLX-Stack-Cache-Prefix-Tokens: 3840
X-MLX-Stack-Cache-New-Tokens: 128
X-MLX-Stack-TTFT-Saved-Ms: 2400

Configuration

prefix_cache:
  enabled: true
  memory_budget_pct: 20          # % of unified memory for KV cache
  block_size: 128                # tokens per cache block
  eviction_policy: lru_frequency # LRU with frequency boosting
  max_entries: 1000
  warm_on_startup: true
  warm_prefixes: []              # optional list of known-hot prefixes
  metrics:
    enabled: true
    log_interval: 300            # seconds between metric summaries in logs

Research References

Complexity and Sequencing

Estimated effort: 4-8 weeks of focused work

Recommended sequencing: v0.3

This is the performance differentiator that makes mlx-stack impossible to ignore for agent workloads. It should be built after the v0.2 foundations (add-model, reliability layer) are stable, since it depends on a well-functioning multi-tier serving layer.

Risk: Requires either upstream vllm-mlx changes or a custom fork. The feasibility depends on vllm-mlx's willingness to support KV state injection.

Acceptance Criteria

  • Prefix hashing correctly identifies shared prefixes across requests
  • KV cache blocks stored in unified memory with zero-copy access
  • LRU eviction with frequency boosting keeps hot prefixes cached
  • Cross-agent prefix sharing works (multiple sessions benefit from same cached prefix)
  • Configurable memory budget for KV cache
  • Cache warming on startup for registered prefixes
  • mlx-stack cache status shows hit rate, memory usage, and top entries
  • Response headers expose cache hit/miss and TTFT savings
  • Memory pressure triggers aggressive eviction
  • Measurable TTFT improvement of >50% on repeated agent requests
  • Benchmark suite comparing cached vs uncached agent workloads
  • Documentation explaining the caching model and tuning guide

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions