add conversation history summarization#2730
add conversation history summarization#2730blublinsky wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
ceb314a to
9b9a793
Compare
onmete
left a comment
There was a problem hiding this comment.
My understanding of how the summarization should work is:
- Retrieve full conversation history (as we do today)
- At prompt preparation time (in _prepare_prompt(), where limit_conversation_history() is called), check if history fits in available tokens
- If it doesn't fit: summarize ALL messages via an LLM call, inject the summary into the system prompt
- Store the summary in cache, replacing the original history
- Next request: summary is retrieved as the "history" - it's small, always fits
- This is a simple "summarize everything when needed" approach - no need to be clever about which messages to keep.
Essentially, what we are looking for is to replace this line https://github.com/openshift/lightspeed-service/blob/main/ols/src/query_helpers/docs_summarizer.py#L236, with summarization feature.
The PR's approach tries to optimize before fetching (limit what we retrieve), but with summarization, this optimization becomes unnecessary.
Once we summarize, the history is replaced with a compact summary. There's no scenario where we have "too many messages to fetch" because either:
- history hasn't been summarized yet (small enough to fetch)
- history was summarized (only summary exists)
Is my understanding reasonable? Is there a scenario that discards this? Can we try to not add more responsibilities to (already bloated) docs summarizer? :)
9b9a793 to
dad939c
Compare
summary.
|
|
/retest |
dad939c to
e402813
Compare
| Conversation history: | ||
| {full_conversation} | ||
|
|
||
| Summary:""" |
There was a problem hiding this comment.
Please is probably a waste of tokens :P
I found this prompt somewhere:
You are an expert conversation summarizer. Your job is to create detailed, comprehensive summaries of chat conversations.
Your summary should include:
- What were the main subjects covered?
- Any agreements, choices, or conclusions made
- Revealed preferences, likes, dislikes, or constraints
- Significant Q&A exchanges
- Tasks mentioned or to be completed
Be comprehensive but concise. Focus on information that would be valuable for continuing the conversation later. Write in a natural, narrative style that another AI can easily understand and use as context.
Do not include:
- Pleasantries or greetings unless they reveal something important
- Repetitive information
5b5f910 to
9679bdc
Compare
9679bdc to
b377810
Compare
928ce14 to
760470a
Compare
|
/retest |
1 similar comment
|
/retest |
36c48ac to
6dd787b
Compare
|
/retest |
2 similar comments
|
/retest |
|
/retest |
|
/test service-on-pull-request |
|
/retest |
1 similar comment
|
/retest |
|
/test evaluation |
161ec61 to
c256ce3
Compare
|
/retest |
1 similar comment
|
/retest |
f3beaa3 to
dca7ffc
Compare
|
/retest |
dca7ffc to
b247b83
Compare
|
I'm sending comments to ensure this PR is aligned to agreed plan:
|
b247b83 to
e4b5010
Compare
Here are events: data: {"event": "start", "data": {"conversation_id": "47494b6d-70b0-48d5-9376-5d0b34d04f6b"}} data: {"event": "history_compression_start", "data": {"status": "started"}} data: {"event": "history_compression_end", "data": {"status": "success", "duration_ms": 1786.2}} data: {"event": "token", "data": {"id": 0, "token": ""}} data: {"event": "token", "data": {"id": 1, "token": "{""}}
|
e4b5010 to
c904bc9
Compare
| ainvoke = getattr(bare_llm, "ainvoke", None) | ||
| if not callable(ainvoke): | ||
| raise TypeError("LLM object must provide callable ainvoke(messages)") | ||
| response = await ainvoke(messages) |
There was a problem hiding this comment.
@blublinsky Think its good to have a timeout here .
|
@blublinsky - Can we have a flag to turn on/off summarization ? Need not be present in CR. May be handy for UI to test and if the users want to have control in the future . |
done |
c904bc9 to
38351e4
Compare
|
@onmete - I see only one item to be addressed from you list - is Item (1). I feel , Trigger at - 85% of available_tokens is fine (total - response window - tool budget - system prompt - RAG- query)- I feel its safer - because we are operating specifically on the available tokens for history. |
|
|
||
| assert len(result) == DEFAULT_ENTRIES_TO_KEEP | ||
| assert result[0].query.content == "[Previous conversation summary]" | ||
| assert result[1:] == cache_entries[:-1] |
There was a problem hiding this comment.
Is this right? It seems this is dropping the latest message - we want to drop the oldest.
Also, test_compress_conversation_history_no_compression_needed is misleading name as compression happens.
ols/app/endpoints/ols.py
Outdated
| conversation_id = retrieve_conversation_id(llm_request) | ||
| if not suid.check_suid(conversation_id): | ||
| raise HTTPException( | ||
| status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, |
0492f18 to
fe1d936
Compare
fe1d936 to
85428da
Compare
|
/retest |
|
@blublinsky: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Description
This PR
What this implementation is missing:
Type of change
Related Tickets & Documents
OLS-2500
Checklist before requesting a review
Testing