Skip to content

add conversation history summarization#2730

Open
blublinsky wants to merge 1 commit intoopenshift:mainfrom
blublinsky:history-retrival
Open

add conversation history summarization#2730
blublinsky wants to merge 1 commit intoopenshift:mainfrom
blublinsky:history-retrival

Conversation

@blublinsky
Copy link
Contributor

@blublinsky blublinsky commented Feb 3, 2026

Description

This PR

  1. prepares for the implementation of conversation summarization
  2. Actual summarization implementation

What this implementation is missing:

  1. Externalized onfiguration for entries_to_keep - can be exposed in configuration. Is it important? now its 5

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

@openshift-ci openshift-ci bot requested review from raptorsun and xrajesh February 3, 2026 13:33
@openshift-ci
Copy link

openshift-ci bot commented Feb 3, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign xrajesh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@blublinsky blublinsky force-pushed the history-retrival branch 3 times, most recently from ceb314a to 9b9a793 Compare February 3, 2026 15:03
Copy link
Contributor

@onmete onmete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of how the summarization should work is:

  1. Retrieve full conversation history (as we do today)
  2. At prompt preparation time (in _prepare_prompt(), where limit_conversation_history() is called), check if history fits in available tokens
  3. If it doesn't fit: summarize ALL messages via an LLM call, inject the summary into the system prompt
  4. Store the summary in cache, replacing the original history
  5. Next request: summary is retrieved as the "history" - it's small, always fits
  6. This is a simple "summarize everything when needed" approach - no need to be clever about which messages to keep.

Essentially, what we are looking for is to replace this line https://github.com/openshift/lightspeed-service/blob/main/ols/src/query_helpers/docs_summarizer.py#L236, with summarization feature.

The PR's approach tries to optimize before fetching (limit what we retrieve), but with summarization, this optimization becomes unnecessary.
Once we summarize, the history is replaced with a compact summary. There's no scenario where we have "too many messages to fetch" because either:

  • history hasn't been summarized yet (small enough to fetch)
  • history was summarized (only summary exists)

Is my understanding reasonable? Is there a scenario that discards this? Can we try to not add more responsibilities to (already bloated) docs summarizer? :)

@blublinsky
Copy link
Contributor Author

My understanding of how the summarization should work is:

  1. Retrieve full conversation history (as we do today)
    Not quite. We retrieve only partial, based on the token budget. When retrieving, we check whether the history is too large, which we will use as a signal for summarization.
  2. At prompt preparation time (in _prepare_prompt(), where limit_conversation_history() is called), check if history fits in available tokens
    It is checked immediately after retrieval
  3. If it doesn't fit: summarize ALL messages via an LLM call, inject the summary into the system prompt
    This is next step
  4. Store the summary in cache, replacing the original history
    next step
  5. Next request: summary is retrieved as the "history" - it's small, always fits
  6. This is a simple "summarize everything when needed" approach - no need to be clever about which messages to keep.

Essentially, what we are looking for is to replace this line https://github.com/openshift/lightspeed-service/blob/main/ols/src/query_helpers/docs_summarizer.py#L236, with summarization feature.

The PR's approach tries to optimize before fetching (limit what we retrieve), but with summarization, this optimization becomes unnecessary. Once we summarize, the history is replaced with a compact summary. There's no scenario where we have "too many messages to fetch" because either:
It actually is - its our defence mechanism

  • history hasn't been summarized yet (small enough to fetch)
  • history was summarized (only summary exists)

Is my understanding reasonable? Is there a scenario that discards this? Can we try to not add more responsibilities to (already bloated) docs summarizer? :)

summary.
What is done in this Pr:

  1. Optimization of read and getting the signal that the history is too large, instead of checking its size every time. it is moved to the doc summarizer, because it is a place where we compute the available token budget.
  2. The actual summarization is a simple async function that is trivial to implement based on this
  3. Split these 2 to make PR smaller

@blublinsky
Copy link
Contributor Author

/retest

@blublinsky blublinsky changed the title Refactor history retrieval in preparation to summarization add conversation history summarization Feb 4, 2026
Conversation history:
{full_conversation}

Summary:"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please is probably a waste of tokens :P

I found this prompt somewhere:

You are an expert conversation summarizer. Your job is to create detailed, comprehensive summaries of chat conversations.

Your summary should include:
- What were the main subjects covered?
- Any agreements, choices, or conclusions made
- Revealed preferences, likes, dislikes, or constraints
- Significant Q&A exchanges
- Tasks mentioned or to be completed

Be comprehensive but concise. Focus on information that would be valuable for continuing the conversation later. Write in a natural, narrative style that another AI can easily understand and use as context.

Do not include:
- Pleasantries or greetings unless they reveal something important
- Repetitive information

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 4, 2026
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 4, 2026
@blublinsky blublinsky force-pushed the history-retrival branch 4 times, most recently from 928ce14 to 760470a Compare February 6, 2026 09:15
@blublinsky
Copy link
Contributor Author

/retest

1 similar comment
@blublinsky
Copy link
Contributor Author

/retest

@blublinsky blublinsky force-pushed the history-retrival branch 4 times, most recently from 36c48ac to 6dd787b Compare February 10, 2026 13:37
@blublinsky
Copy link
Contributor Author

/retest

2 similar comments
@blublinsky
Copy link
Contributor Author

/retest

@blublinsky
Copy link
Contributor Author

/retest

@blublinsky
Copy link
Contributor Author

/test service-on-pull-request

@blublinsky
Copy link
Contributor Author

/retest

1 similar comment
@blublinsky
Copy link
Contributor Author

/retest

@blublinsky
Copy link
Contributor Author

/test evaluation

@blublinsky blublinsky force-pushed the history-retrival branch 2 times, most recently from 161ec61 to c256ce3 Compare February 27, 2026 19:27
@blublinsky
Copy link
Contributor Author

/retest

1 similar comment
@blublinsky
Copy link
Contributor Author

/retest

@blublinsky blublinsky force-pushed the history-retrival branch 2 times, most recently from f3beaa3 to dca7ffc Compare March 2, 2026 16:30
@blublinsky
Copy link
Contributor Author

/retest

@onmete
Copy link
Contributor

onmete commented Mar 6, 2026

I'm sending comments to ensure this PR is aligned to agreed plan:

  1. Trigger - 85% of context window (primary), history overflow (secondary)
    The 0.85 ratio is applied to the remaining history budget, not the full context window. Our architecture says the primary trigger should be when the overall context fills 85% of the window (system prompt + RAG + history + query >= 85% of context_window_size). Currently, a conversation could fill 70% of the context window but only 50% of the history budget, and compression would never trigger. Consider calculating the trigger against context_window_size directly rather than against the already-reduced available_tokens.

  2. Keep last 2 turns verbatim (with degradation guard)
    We agreed we'll do 5 last exchanges without degradation guard (try 4,3,2 ... if 5 doesn't fit) - aligned.

  3. Compact everything older into a structured summary via LLM call
    Aligned.

  4. Replace cache with compacted summary + preserved tail (destructive)
    Aligned.

  5. Truncation as fallback when compaction fails
    Partially aligned, but seems the fallback is destructive when it shouldn't be.
    When compaction fails, the PR still destructively rewrites the cache - deleting the full history and replacing it with only the fallback entries. This means a LLM failure permanently destroys conversation history. The architecture says "keep truncation/cut-off as fallback" - meaning use the existing limit_conversation_history truncation on the unchanged full cache. The fallback path should not call _rewrite_cache; it should simply return the full cache entries and let the downstream limit_conversation_history truncate them for this request, leaving the full history intact in cache for the next attempt.

  6. Send an event to the UI that OLS is compacting
    Missing (and also please create an associated UI story to show compaction is running in chat).

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 6, 2026
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 6, 2026
@blublinsky
Copy link
Contributor Author

blublinsky commented Mar 6, 2026

I'm sending comments to ensure this PR is aligned to agreed plan:

  1. Trigger - 85% of context window (primary), history overflow (secondary)
    The 0.85 ratio is applied to the remaining history budget, not the full context window. Our architecture says the primary trigger should be when the overall context fills 85% of the window (system prompt + RAG + history + query >= 85% of context_window_size). Currently, a conversation could fill 70% of the context window but only 50% of the history budget, and compression would never trigger. Consider calculating the trigger against context_window_size directly rather than against the already-reduced available_tokens.
    85% of the remaining budget for history is a trigger
  2. Keep last 2 turns verbatim (with degradation guard)
    We agreed we'll do 5 last exchanges without degradation guard (try 4,3,2 ... if 5 doesn't fit) - aligned.
  3. Compact everything older into a structured summary via LLM call
    Aligned.
    Done
  4. Replace cache with compacted summary + preserved tail (destructive)
    Aligned.
  5. Truncation as fallback when compaction fails
    Truncation is not a fallback its a safety valve in case when compaction occurs. It is to make sure that summarization does not create too large of an entry.
    Partially aligned, but seems the fallback is destructive when it shouldn't be.
    When compaction fails, the PR still destructively rewrites the cache - deleting the full history and replacing it with only the fallback entries. This means a LLM failure permanently destroys conversation history. The architecture says "keep truncation/cut-off as fallback" - meaning use the existing limit_conversation_history truncation on the unchanged full cache. The fallback path should not call _rewrite_cache; it should simply return the full cache entries and let the downstream limit_conversation_history truncate them for this request, leaving the full history intact in cache for the next attempt.
    Fixed
  6. Send an event to the UI that OLS is compacting
    fixed

Here are events:

data: {"event": "start", "data": {"conversation_id": "47494b6d-70b0-48d5-9376-5d0b34d04f6b"}}

data: {"event": "history_compression_start", "data": {"status": "started"}}

data: {"event": "history_compression_end", "data": {"status": "success", "duration_ms": 1786.2}}

data: {"event": "token", "data": {"id": 0, "token": ""}}

data: {"event": "token", "data": {"id": 1, "token": "{""}}

Missing (and also please create an associated UI story to show compaction is running in chat).

ainvoke = getattr(bare_llm, "ainvoke", None)
if not callable(ainvoke):
raise TypeError("LLM object must provide callable ainvoke(messages)")
response = await ainvoke(messages)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blublinsky Think its good to have a timeout here .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@xrajesh
Copy link
Contributor

xrajesh commented Mar 9, 2026

@blublinsky - Can we have a flag to turn on/off summarization ? Need not be present in CR. May be handy for UI to test and if the users want to have control in the future .

@blublinsky
Copy link
Contributor Author

@blublinsky - Can we have a flag to turn on/off summarization ? Need not be present in CR. May be handy for UI to test and if the users want to have control in the future .

done

@xrajesh
Copy link
Contributor

xrajesh commented Mar 10, 2026

@onmete - I see only one item to be addressed from you list - is Item (1). I feel , Trigger at - 85% of available_tokens is fine (total - response window - tool budget - system prompt - RAG- query)- I feel its safer - because we are operating specifically on the available tokens for history.


assert len(result) == DEFAULT_ENTRIES_TO_KEEP
assert result[0].query.content == "[Previous conversation summary]"
assert result[1:] == cache_entries[:-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? It seems this is dropping the latest message - we want to drop the oldest.

Also, test_compress_conversation_history_no_compression_needed is misleading name as compression happens.

conversation_id = retrieve_conversation_id(llm_request)
if not suid.check_suid(conversation_id):
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 400.

@blublinsky blublinsky force-pushed the history-retrival branch 2 times, most recently from 0492f18 to fe1d936 Compare March 13, 2026 13:45
@blublinsky
Copy link
Contributor Author

/retest

@openshift-ci
Copy link

openshift-ci bot commented Mar 13, 2026

@blublinsky: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants