-
Notifications
You must be signed in to change notification settings - Fork 131
[GH-909] Improve async summary performance. #949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR improves async summary performance in short-term memory by extracting summarization logic into a dedicated consolidator class that runs asynchronously, preventing blocking during episode addition.
Changes:
- Introduced
ShortTermMemoryConsolidatorclass to handle summarization in the background - Modified
ShortTermMemoryto use read-write locks and delegate summarization to the consolidator - Extracted common episode formatting logic into
episodes_to_stringutility function
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/memmachine/episodic_memory/short_term_memory/short_term_memory.py |
Core refactoring: added consolidator class, converted to async properties, and implemented non-blocking summarization with read-write locks |
src/memmachine/common/episode_store/episode_model.py |
Added episodes_to_string utility function to centralize episode formatting logic |
src/memmachine/episodic_memory/episodic_memory.py |
Updated to use the new episodes_to_string utility instead of duplicated formatting code |
src/memmachine/common/errors.py |
Added ShortTermMemoryClosedError exception for better error handling |
tests/memmachine/episodic_memory/short_term_memory/test_short_term_memory.py |
Added comprehensive tests for new async summarization behavior and updated existing tests to match new summary format |
tests/memmachine/common/episode_store/test_episode_model.py |
Added tests for the new episodes_to_string utility function |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/memmachine/episodic_memory/short_term_memory/test_short_term_memory.py
Outdated
Show resolved
Hide resolved
3c33102 to
08a2d80
Compare
|
Did you manually verify the performance improvement? |
08a2d80 to
17b2660
Compare
|
Here is my test code, insert 100 entires each has 1000 random characters to episodic memory only: def run():
client = MemMachineClient(base_url="http://localhost:8080")
project = client.get_project(
org_id="performance_test",
project_id="multi_add",
)
memory = project.memory(
metadata={
"group_id": "test_group",
"agent_id": "agent1",
"user_id": "user1",
}
)
start = time.time()
characters = string.ascii_letters + string.digits + string.punctuation
for i in range(100):
result = memory.add(
content=''.join(random.choices(characters, k=1000)),
memory_types=[MemoryType.Episodic],
)
assert len(result) == 1
if i % 10 == 0:
print(f"Added entry i={i}")
duration = time.time() - start
print(f"Added 1000 entries in {duration:.2f} seconds ({1000/duration:.2f} ops/sec)")I run the test on my macbook. Before the change: After the update: About 5 times faster. |
|
@jealous @edwinyyyu This looks extremely promising!! I used the Locust load script from #837 and ran the Episodic Write-only test to produce the following results:
Until this PR, I could only achieve up to ~20 requests per second (RPS) with 50 users - see the results in #837. With this fix, I'm getting up to 150 RPS at 50 users (7.5x increase). The RPS looks stable, which is a good sign. The Locust workload starts to fail when we hit 100 users with "neo4j - Failed to read four byte Bolt handshake response from server ResolvedIPv4Address(('172.18.0.3', 7687)) (deadline Deadline(timeout=60.0))". This is not the fault of MemMachine or this PR; it's actually a good sign, as it shows the bottleneck is now in Neo4j. See #940 for similar issues. We have tunables available to help, so I'll continue that effort to see if I can successfully complete the load test with 500 users. In the future, we may need to consider a back-off algorithm when we detect database/storage latency issues or timeouts. |
| # No more episodes to process, exit the worker | ||
| break | ||
|
|
||
| batch_to_process = self._pending_episodes[:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is the episodes can accumulate. Eventually it can exceed the model context window size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the logic so it will recursively split the episodes into 2 parts until they fit into the context window. If single episode + summary has already exceeds the context window, I have no other choice but to drop that episode.
17b2660 to
255f6e1
Compare
fixes performance degradation in short-term memory caused by blocking summary generation during episode addition. Previously, triggering a new summary while a previous one was still running would block the add operation and slow down ingestion. The fix extracts summary logic into a dedicated consolidator class that runs consolidation asynchronously and avoids blocking writes. Key changes: - Move episode summarization logic into a new consolidator class - Run consolidation in a background routine instead of blocking add operations - Cache newly added episodes when consolidation is already running and include them in the next consolidation pass - Ensure queries wait for consolidation to finish so results are always stable and consistent
255f6e1 to
9e57b54
Compare

Purpose of the change
fixes performance degradation in short-term memory caused by blocking summary generation during episode addition.
Description
Previously, triggering a new summary while a previous one was still running would block the add operation and slow down ingestion.
The fix extracts summary logic into a dedicated consolidator class that runs consolidation asynchronously and avoids blocking writes.
Key changes:
Fixes/Closes
Fixes #909
Type of change
How Has This Been Tested?
Checklist
Maintainer Checklist