Skip to content

KVCache: Multi-turn Conversation Only Reads Previous Turn, Not Full Context #281

@LouisDDN

Description

@LouisDDN

Hi, the following is AI-generated, but essentially I only see the last turn being loaded during step 2 of the KV cache loading process. I’m not sure whether this is a design decision or not. Please let me know. At first glance, it doesn't look right to me.

Summary

The multi-turn conversation feature currently reads only the previous turn (turn N-1) before writing a new turn (turn N), rather than reading all previous turns (turns 1 through N-1) as would be required for realistic LLM inference simulation.

This significantly underestimates I/O volume for long conversations and doesn't match real-world LLM behavior.


Current Behavior

Code Location: kv_cache_benchmark/kv_cache/benchmark.py lines 517-525

# 2. For multi-turn conversations, access cache from previous turn.
if self.conversation_manager and request.turn_number > 1:
    prev_turn_key = f"{request.conversation_id}_turn_{request.turn_number - 1}"
    location, read_latency = self.cache.access_cache(prev_turn_key, InferencePhase.DECODE, 'multi_turn')

What happens:

  • Turn 2: Reads turn 1 only (1 read)
  • Turn 3: Reads turn 2 only (1 read) - Missing turn 1!
  • Turn 10: Reads turn 9 only (1 read) - Missing turns 1-8!

Total reads for 10-turn conversation: 9 reads (one per turn)


Expected Behavior

Real LLM inference requires reading all previous context for each turn.

Option 1: Read All Previous Turns (Separate Entries)

# Proposed fix:
if self.conversation_manager and request.turn_number > 1:
    prev_keys = self.conversation_manager.get_all_previous_turn_keys(
        request.conversation_id, request.turn_number
    )
    for prev_key in prev_keys:
        location, read_latency = self.cache.access_cache(prev_key, InferencePhase.DECODE, 'multi_turn')
        storage_latency += read_latency

What happens:

  • Turn 2: Reads turn 1 (1 read)
  • Turn 3: Reads turns 1, 2 (2 reads)
  • Turn 10: Reads turns 1-9 (9 reads)

Total reads for 10-turn conversation: 1 + 2 + 3 + ... + 9 = 45 reads

Option 2: Cumulative Cache Entries (Concatenation)

Alternatively, implement cumulative KV cache growth (more realistic but more complex):

# Turn 2:
prev_data = cache.read("turn_1")  # 100 tokens
new_data = generate(50 tokens)
combined = np.concatenate([prev_data, new_data])  # 150 tokens
cache.write("turn_2", combined)  # Write 150 tokens (cumulative)

# Turn 3:
prev_data = cache.read("turn_2")  # 150 tokens (cumulative)
new_data = generate(75 tokens)
combined = np.concatenate([prev_data, new_data])  # 225 tokens
cache.write("turn_3", combined)  # Write 225 tokens (cumulative)

This matches real LLM behavior where KV cache grows with each turn.


Why This Matters

1. I/O Volume Underestimation

Current implementation (10-turn conversation):

  • Reads: 9 (one per turn)
  • I/O volume: 9 × avg_cache_size

Realistic implementation (Option 1):

  • Reads: 45 (cumulative)
  • I/O volume: 45 × avg_cache_size

Underestimation factor: 5× for 10-turn conversations, worse for longer conversations.

2. Cache Locality Patterns

Current implementation:

  • Only recent turn (turn N-1) is accessed
  • Older turns (1 to N-2) may be evicted without consequence

Realistic implementation:

  • All previous turns must be accessible
  • Evicting old turns causes cache misses
  • Tests realistic cache capacity requirements

3. Storage Throughput Requirements

For a 50-turn conversation (max_turns_per_conv = 50):

Current: 49 reads total
Realistic: 1 + 2 + 3 + ... + 49 = 1,225 reads total

This 25× difference dramatically changes storage throughput requirements.


Evidence of Incomplete Implementation

Function Exists But Is Unused

File: kv_cache_benchmark/kv_cache/conversation.py lines 105-111

def get_all_previous_turn_keys(self, conversation_id: str, current_turn: int) -> List[str]:
    """Retrieves all cache keys from previous turns in a conversation."""
    with self.lock:
        if conversation_id not in self.conversations:
            return []
        state = self.conversations[conversation_id]
        return [key for key in state.cache_keys if key != f"{conversation_id}_turn_{current_turn}"]

This function is NEVER called in the benchmark code!

$ grep -rn "get_all_previous_turn_keys" kv_cache_benchmark/ --include="*.py"

kv_cache_benchmark/tests/test_kv_cache.py:953:        prev_keys = conversation_manager.get_all_previous_turn_keys(conv_id, 2)
kv_cache_benchmark/kv_cache/conversation.py:105:    def get_all_previous_turn_keys(self, conversation_id: str, current_turn: int) -> List[str]:

Only used in tests, never in actual benchmark.

Documentation Inconsistency

File: kv_cache_benchmark/docs/MLperf v3 KV cache proposal.md line 1831

"Context tokens = cumulative conversation history"

But the code doesn't implement cumulative context! Each turn only reads the previous turn, not the full cumulative history.


Proposed Solution

Minimal Fix (Option 1): Read All Previous Turns

File: kv_cache_benchmark/kv_cache/benchmark.py

Replace lines 517-525 with:

# 2. For multi-turn conversations, access cache from ALL previous turns.
if self.conversation_manager and request.turn_number > 1:
    prev_keys = self.conversation_manager.get_all_previous_turn_keys(
        request.conversation_id, request.turn_number
    )
    for prev_key in prev_keys:
        location, read_latency = self.cache.access_cache(prev_key, InferencePhase.DECODE, 'multi_turn')
        if location is not None:
            storage_latency += read_latency
            with self.results_lock: self.results['multi_turn_cache_hits'] += 1
        else:
            with self.results_lock: self.results['multi_turn_cache_misses'] += 1

Pros:

  • Simple change (uses existing function)
  • Realistic I/O pattern
  • Tests cache capacity requirements
  • No change to cache entry structure

Cons:

  • Increases I/O volume (but this is realistic!)
  • Longer benchmark runtime for long conversations

Complete Fix (Option 2): Cumulative Cache Entries

Implement true cumulative KV cache growth with concatenation. This is more complex but matches real LLM behavior exactly.

Pros:

  • Matches real LLM inference exactly
  • Single read per turn (of cumulative cache)
  • Tests realistic cache size growth

Cons:

  • Requires significant code changes
  • More complex implementation
  • Higher memory usage

Impact Assessment

Performance Impact

Current (10-turn conversation):

  • 9 reads
  • ~100ms total read latency (assuming 10ms/read from NVMe)

Proposed (10-turn conversation):

  • 45 reads
  • ~450ms total read latency

This is realistic! Real LLM inference DOES require reading all previous context.

Benchmark Runtime

For --max-turns-per-conv 50 with 100 conversations:

Current: 4,900 reads (49 per conversation)
Proposed: 122,500 reads (1,225 per conversation)

Runtime increase: ~25× for multi-turn reads

Mitigation: Users can reduce --max-turns-per-conv if needed, but the I/O pattern will be realistic.


Recommendation

Implement Option 1 (Read All Previous Turns) as the minimal fix.

This:

  1. Uses existing get_all_previous_turn_keys() function
  2. Requires minimal code changes
  3. Provides realistic I/O patterns
  4. Tests actual cache capacity requirements
  5. Matches documentation claims about "cumulative conversation history"

Option 2 (cumulative cache entries) could be considered for a future version if more accurate LLM simulation is desired.


Related Issues

  • Documentation at line 1831 claims "cumulative conversation history" but code doesn't implement this
  • Function get_all_previous_turn_keys() exists but is unused (dead code)
  • Current implementation may give misleading results for storage capacity planning

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions