add conversation history summarization by blublinsky · Pull Request #2730 · openshift/lightspeed-service

blublinsky · 2026-02-03T13:32:04Z

Description

This PR

prepares for the implementation of conversation summarization
Actual summarization implementation

What this implementation is missing:

Externalized onfiguration for entries_to_keep - can be exposed in configuration. Is it important? now its 5

Type of change

Related Tickets & Documents

Related Issue #
OLS-2500
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

openshift-ci · 2026-02-03T13:34:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign xrajesh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

onmete

My understanding of how the summarization should work is:

Retrieve full conversation history (as we do today)
At prompt preparation time (in _prepare_prompt(), where limit_conversation_history() is called), check if history fits in available tokens
If it doesn't fit: summarize ALL messages via an LLM call, inject the summary into the system prompt
Store the summary in cache, replacing the original history
Next request: summary is retrieved as the "history" - it's small, always fits
This is a simple "summarize everything when needed" approach - no need to be clever about which messages to keep.

Essentially, what we are looking for is to replace this line https://github.com/openshift/lightspeed-service/blob/main/ols/src/query_helpers/docs_summarizer.py#L236, with summarization feature.

The PR's approach tries to optimize before fetching (limit what we retrieve), but with summarization, this optimization becomes unnecessary.
Once we summarize, the history is replaced with a compact summary. There's no scenario where we have "too many messages to fetch" because either:

history hasn't been summarized yet (small enough to fetch)
history was summarized (only summary exists)

Is my understanding reasonable? Is there a scenario that discards this? Can we try to not add more responsibilities to (already bloated) docs summarizer? :)

blublinsky · 2026-02-03T16:43:21Z

My understanding of how the summarization should work is:

Retrieve full conversation history (as we do today)
Not quite. We retrieve only partial, based on the token budget. When retrieving, we check whether the history is too large, which we will use as a signal for summarization.

At prompt preparation time (in _prepare_prompt(), where limit_conversation_history() is called), check if history fits in available tokens
It is checked immediately after retrieval

If it doesn't fit: summarize ALL messages via an LLM call, inject the summary into the system prompt
This is next step

Store the summary in cache, replacing the original history
next step

Next request: summary is retrieved as the "history" - it's small, always fits

This is a simple "summarize everything when needed" approach - no need to be clever about which messages to keep.

Essentially, what we are looking for is to replace this line https://github.com/openshift/lightspeed-service/blob/main/ols/src/query_helpers/docs_summarizer.py#L236, with summarization feature.

The PR's approach tries to optimize before fetching (limit what we retrieve), but with summarization, this optimization becomes unnecessary. Once we summarize, the history is replaced with a compact summary. There's no scenario where we have "too many messages to fetch" because either:
It actually is - its our defence mechanism

history hasn't been summarized yet (small enough to fetch)

history was summarized (only summary exists)

Is my understanding reasonable? Is there a scenario that discards this? Can we try to not add more responsibilities to (already bloated) docs summarizer? :)

summary.
What is done in this Pr:

Optimization of read and getting the signal that the history is too large, instead of checking its size every time. it is moved to the doc summarizer, because it is a place where we compute the available token budget.
The actual summarization is a simple async function that is trivial to implement based on this
Split these 2 to make PR smaller

blublinsky · 2026-02-03T20:34:39Z

/retest

onmete · 2026-02-04T13:50:44Z

ols/src/query_helpers/docs_summarizer.py

+Conversation history:
+{full_conversation}
+
+Summary:"""


Please is probably a waste of tokens :P

I found this prompt somewhere:

You are an expert conversation summarizer. Your job is to create detailed, comprehensive summaries of chat conversations. Your summary should include: - What were the main subjects covered? - Any agreements, choices, or conclusions made - Revealed preferences, likes, dislikes, or constraints - Significant Q&A exchanges - Tasks mentioned or to be completed Be comprehensive but concise. Focus on information that would be valuable for continuing the conversation later. Write in a natural, narrative style that another AI can easily understand and use as context. Do not include: - Pleasantries or greetings unless they reveal something important - Repetitive information

blublinsky · 2026-02-06T11:45:26Z

/retest

blublinsky · 2026-02-06T18:43:08Z

/retest

blublinsky · 2026-02-11T09:03:01Z

/retest

blublinsky · 2026-02-11T09:33:23Z

/retest

blublinsky · 2026-02-11T11:07:44Z

/retest

blublinsky · 2026-02-26T20:28:15Z

/test service-on-pull-request

blublinsky · 2026-02-26T20:35:32Z

/retest

blublinsky · 2026-02-26T21:50:11Z

/retest

blublinsky · 2026-02-27T11:00:44Z

/test evaluation

blublinsky · 2026-02-27T20:30:30Z

/retest

blublinsky · 2026-02-27T21:44:22Z

/retest

blublinsky · 2026-03-03T17:59:16Z

/retest

onmete · 2026-03-06T13:26:19Z

I'm sending comments to ensure this PR is aligned to agreed plan:

Trigger - 85% of context window (primary), history overflow (secondary)
The 0.85 ratio is applied to the remaining history budget, not the full context window. Our architecture says the primary trigger should be when the overall context fills 85% of the window (system prompt + RAG + history + query >= 85% of context_window_size). Currently, a conversation could fill 70% of the context window but only 50% of the history budget, and compression would never trigger. Consider calculating the trigger against context_window_size directly rather than against the already-reduced available_tokens.
Keep last 2 turns verbatim (with degradation guard)
We agreed we'll do 5 last exchanges without degradation guard (try 4,3,2 ... if 5 doesn't fit) - aligned.
Compact everything older into a structured summary via LLM call
Aligned.
Replace cache with compacted summary + preserved tail (destructive)
Aligned.
Truncation as fallback when compaction fails
Partially aligned, but seems the fallback is destructive when it shouldn't be.
When compaction fails, the PR still destructively rewrites the cache - deleting the full history and replacing it with only the fallback entries. This means a LLM failure permanently destroys conversation history. The architecture says "keep truncation/cut-off as fallback" - meaning use the existing limit_conversation_history truncation on the unchanged full cache. The fallback path should not call _rewrite_cache; it should simply return the full cache entries and let the downstream limit_conversation_history truncate them for this request, leaving the full history intact in cache for the next attempt.
Send an event to the UI that OLS is compacting
Missing (and also please create an associated UI story to show compaction is running in chat).

blublinsky · 2026-03-06T19:16:11Z

I'm sending comments to ensure this PR is aligned to agreed plan:

Trigger - 85% of context window (primary), history overflow (secondary)
The 0.85 ratio is applied to the remaining history budget, not the full context window. Our architecture says the primary trigger should be when the overall context fills 85% of the window (system prompt + RAG + history + query >= 85% of context_window_size). Currently, a conversation could fill 70% of the context window but only 50% of the history budget, and compression would never trigger. Consider calculating the trigger against context_window_size directly rather than against the already-reduced available_tokens.
85% of the remaining budget for history is a trigger

Keep last 2 turns verbatim (with degradation guard)
We agreed we'll do 5 last exchanges without degradation guard (try 4,3,2 ... if 5 doesn't fit) - aligned.

Compact everything older into a structured summary via LLM call
Aligned.
Done

Replace cache with compacted summary + preserved tail (destructive)
Aligned.

Truncation as fallback when compaction fails
Truncation is not a fallback its a safety valve in case when compaction occurs. It is to make sure that summarization does not create too large of an entry.
Partially aligned, but seems the fallback is destructive when it shouldn't be.
When compaction fails, the PR still destructively rewrites the cache - deleting the full history and replacing it with only the fallback entries. This means a LLM failure permanently destroys conversation history. The architecture says "keep truncation/cut-off as fallback" - meaning use the existing limit_conversation_history truncation on the unchanged full cache. The fallback path should not call _rewrite_cache; it should simply return the full cache entries and let the downstream limit_conversation_history truncate them for this request, leaving the full history intact in cache for the next attempt.
Fixed

Send an event to the UI that OLS is compacting
fixed

Here are events:

data: {"event": "start", "data": {"conversation_id": "47494b6d-70b0-48d5-9376-5d0b34d04f6b"}}

data: {"event": "history_compression_start", "data": {"status": "started"}}

data: {"event": "history_compression_end", "data": {"status": "success", "duration_ms": 1786.2}}

data: {"event": "token", "data": {"id": 0, "token": ""}}

data: {"event": "token", "data": {"id": 1, "token": "{""}}

Missing (and also please create an associated UI story to show compaction is running in chat).

xrajesh · 2026-03-09T04:02:20Z

ols/src/query_helpers/history_support.py

+            ainvoke = getattr(bare_llm, "ainvoke", None)
+            if not callable(ainvoke):
+                raise TypeError("LLM object must provide callable ainvoke(messages)")
+            response = await ainvoke(messages)


@blublinsky Think its good to have a timeout here .

xrajesh · 2026-03-09T04:08:50Z

@blublinsky - Can we have a flag to turn on/off summarization ? Need not be present in CR. May be handy for UI to test and if the users want to have control in the future .

blublinsky · 2026-03-09T08:17:12Z

@blublinsky - Can we have a flag to turn on/off summarization ? Need not be present in CR. May be handy for UI to test and if the users want to have control in the future .

done

xrajesh · 2026-03-10T04:47:29Z

@onmete - I see only one item to be addressed from you list - is Item (1). I feel , Trigger at - 85% of available_tokens is fine (total - response window - tool budget - system prompt - RAG- query)- I feel its safer - because we are operating specifically on the available tokens for history.

onmete · 2026-03-10T10:15:59Z

tests/unit/query_helpers/test_history_support.py

+
+    assert len(result) == DEFAULT_ENTRIES_TO_KEEP
+    assert result[0].query.content == "[Previous conversation summary]"
+    assert result[1:] == cache_entries[:-1]


Is this right? It seems this is dropping the latest message - we want to drop the oldest.

Also, test_compress_conversation_history_no_compression_needed is misleading name as compression happens.

onmete · 2026-03-10T10:24:34Z

ols/app/endpoints/ols.py

    conversation_id = retrieve_conversation_id(llm_request)
+    if not suid.check_suid(conversation_id):
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,


This should be 400.

blublinsky · 2026-03-13T14:42:28Z

/retest

openshift-ci · 2026-03-13T15:02:39Z

@blublinsky: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from raptorsun and xrajesh February 3, 2026 13:33

blublinsky force-pushed the history-retrival branch 3 times, most recently from ceb314a to 9b9a793 Compare February 3, 2026 15:03

onmete reviewed Feb 3, 2026

View reviewed changes

blublinsky force-pushed the history-retrival branch from 9b9a793 to dad939c Compare February 3, 2026 16:23

blublinsky force-pushed the history-retrival branch from dad939c to e402813 Compare February 4, 2026 08:38

blublinsky changed the title ~~Refactor history retrieval in preparation to summarization~~ add conversation history summarization Feb 4, 2026

onmete reviewed Feb 4, 2026

View reviewed changes

blublinsky force-pushed the history-retrival branch from 5b5f910 to 9679bdc Compare February 4, 2026 14:23

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 4, 2026

blublinsky force-pushed the history-retrival branch from 9679bdc to b377810 Compare February 4, 2026 14:33

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 4, 2026

blublinsky force-pushed the history-retrival branch 4 times, most recently from 928ce14 to 760470a Compare February 6, 2026 09:15

blublinsky force-pushed the history-retrival branch 4 times, most recently from 36c48ac to 6dd787b Compare February 10, 2026 13:37

blublinsky force-pushed the history-retrival branch 2 times, most recently from 161ec61 to c256ce3 Compare February 27, 2026 19:27

blublinsky force-pushed the history-retrival branch 2 times, most recently from f3beaa3 to dca7ffc Compare March 2, 2026 16:30

blublinsky force-pushed the history-retrival branch from dca7ffc to b247b83 Compare March 3, 2026 18:05

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 6, 2026

blublinsky force-pushed the history-retrival branch from b247b83 to e4b5010 Compare March 6, 2026 16:01

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 6, 2026

blublinsky force-pushed the history-retrival branch from e4b5010 to c904bc9 Compare March 6, 2026 19:55

xrajesh reviewed Mar 9, 2026

View reviewed changes

blublinsky force-pushed the history-retrival branch from c904bc9 to 38351e4 Compare March 9, 2026 08:27

onmete reviewed Mar 10, 2026

View reviewed changes

blublinsky force-pushed the history-retrival branch 2 times, most recently from 0492f18 to fe1d936 Compare March 13, 2026 13:45

Refactor history retrieval in preparation to summarization

85428da

blublinsky force-pushed the history-retrival branch from fe1d936 to 85428da Compare March 13, 2026 13:46

Conversation

blublinsky commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Related Tickets & Documents

Checklist before requesting a review

Testing

Uh oh!

openshift-ci bot commented Feb 3, 2026

Uh oh!

onmete left a comment

Choose a reason for hiding this comment

Uh oh!

blublinsky commented Feb 3, 2026

Uh oh!

blublinsky commented Feb 3, 2026

Uh oh!

onmete Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

blublinsky Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

blublinsky commented Feb 6, 2026

Uh oh!

blublinsky commented Feb 6, 2026

Uh oh!

blublinsky commented Feb 11, 2026

Uh oh!

blublinsky commented Feb 11, 2026

Uh oh!

blublinsky commented Feb 11, 2026

Uh oh!

blublinsky commented Feb 26, 2026

Uh oh!

blublinsky commented Feb 26, 2026

Uh oh!

blublinsky commented Feb 26, 2026

Uh oh!

blublinsky commented Feb 27, 2026

Uh oh!

blublinsky commented Feb 27, 2026

Uh oh!

blublinsky commented Feb 27, 2026

Uh oh!

blublinsky commented Mar 3, 2026

Uh oh!

onmete commented Mar 6, 2026

Uh oh!

blublinsky commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xrajesh Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

blublinsky Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

xrajesh commented Mar 9, 2026

Uh oh!

blublinsky commented Mar 9, 2026

Uh oh!

xrajesh commented Mar 10, 2026

Uh oh!

onmete Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

onmete Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

blublinsky commented Mar 13, 2026

Uh oh!

openshift-ci bot commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

blublinsky commented Feb 3, 2026 •

edited

Loading

blublinsky commented Mar 6, 2026 •

edited

Loading