LLM-1633: Protect context integrity on model switch#255
Conversation
Kimchi Code Review
Summary📊 Review Score: 72/100 (overall code quality — 0 lowest, 100 highest) 🧪 Tests: yes — Comprehensive unit tests were added for 📝 Found 5 issue(s). See inline comments for details. What to expectKimchi will analyze the changes in this pull request and post:
The review typically completes within a few minutes. This comment will be updated once the review is ready. Interact with Kimchi
ConfigurationReviews are configured by your organization admin. Powered by Kimchi — AI-powered code review by CAST AI |
There was a problem hiding this comment.
📊 Review Score: 72/100 (overall code quality — 0 lowest, 100 highest)
⏱️ Estimated effort to review: 3/5 (1 = trivial, 5 = very complex)
🧪 Tests: yes — Comprehensive unit tests were added for estimateTokens, hasImages, stripImages, and truncateMessages, plus integration tests for the extension handler. However, one truncation test does not actually exercise the truncation path it claims to verify, providing false confidence.
📝 Found 5 issue(s). See inline comments for details.
| let tokens = 0 | ||
| for (const msg of messages) { | ||
| // Use precise usage from assistant messages when available | ||
| if (msg.role === "assistant" && "usage" in msg && msg.usage?.totalTokens) { |
There was a problem hiding this comment.
estimateTokens sums usage.totalTokens across every assistant message in the array. Because totalTokens on a historical assistant turn typically reflects the cumulative token count for that entire API request (including all prior history), adding them together multi-counts earlier conversation history. This produces a grossly inflated estimate and causes premature truncation, stripping valid context that would actually fit within the model window.
💡 Suggestion: Use only the most recent assistant message's usage.totalTokens as the baseline, then add incremental char/4 and image estimates for any messages that follow it. Avoid summing historical usage totals.
| // Walk backwards to find the smallest index i such that messages[i..] fits | ||
| // within effectiveMax AND has at least 2 messages. Start from the minimum | ||
| // cutoff that preserves 2 messages (length - 2). | ||
| let cutoff = messages.length |
There was a problem hiding this comment.
If the last two messages alone exceed effectiveMax, the backward search loop never finds a valid cutoff, leaving it at messages.length. This makes pruned an empty array, so the function returns an array containing only the truncation notice — dropping every message and violating the documented contract to unconditionally preserve at least the last two messages.
💡 Suggestion: After the loop, if cutoff === messages.length, force cutoff = Math.max(0, messages.length - 2) to ensure the final two messages are always retained, even when they exceed the nominal budget.
| expect(result.length).toBeLessThan(30) | ||
| }) | ||
|
|
||
| it("always preserves at least the last 2 messages", () => { |
There was a problem hiding this comment.
The test "always preserves at least the last 2 messages" passes four tiny messages to truncateMessages(msgs, 100). Because the effective budget (floor(100 * 0.95) - 11 ≈ 84) is far larger than the estimated tokens of those messages, the function returns the original reference unchanged. The test therefore never exercises actual truncation or proves the minimum-2 guarantee under pressure.
💡 Suggestion: Force real truncation by using a maxTokens value small enough that even the last two messages exceed the budget (e.g., 1), or inflate the message content size, then assert that the final two messages remain in the result.
| // within effectiveMax AND has at least 2 messages. Start from the minimum | ||
| // cutoff that preserves 2 messages (length - 2). | ||
| let cutoff = messages.length | ||
| for (let i = messages.length - 2; i >= 0; i--) { |
There was a problem hiding this comment.
For each candidate index the code creates a new array with messages.slice(i) and rescans it via estimateTokens. Both operations are O(n) and the loop runs O(n) times, yielding O(n²) time complexity and O(n²) temporary allocations. For long conversation histories this repeated re-scanning is a CPU and memory hotspot.
💡 Suggestion: Replace the repeated slicing and full rescans with a single reverse pass that accumulates a running token sum, or pre-compute per-message estimates and walk backwards while subtracting from a pre-calculated total.
| * Returns the original reference when there is nothing to strip. | ||
| */ | ||
| export function stripImages(messages: ContextEvent["messages"]): ContextEvent["messages"] { | ||
| let changed = false |
There was a problem hiding this comment.
ℹ️🔧 Maintainability
In stripImages, the changed flag is declared outside the messages.map callback. Once any message contains an image, changed remains true for all later iterations, so every subsequent message — even those without images — is unnecessarily shallow-cloned and assigned a new content array produced by content.map(...).
💡 Suggestion: Track changes per-message by declaring a msgChanged flag inside the map callback, and only return { ...msg, content: stripped } when that specific message actually contained an image.
Summary
When users switch between models mid-conversation, the new model may have different capabilities and limits than the original one. This PR introduces a model guard extension that automatically adapts the conversation context to the target model's constraints, preventing silent failures and data loss.
What it does
Image stripping for non-vision models -- If the user switches to a model that does not support image inputs, the guard replaces all image content blocks with text placeholders. This keeps the conversation flowing instead of failing on unsupported content.
Context window overflow protection -- When the accumulated token usage exceeds the target model's context window (with a safety margin), the guard truncates the oldest messages to fit. It always preserves at least the two most recent messages and prepends a truncation notice so the model knows context was lost.
Reject switch when context already overflows -- The model switch tool itself now checks whether the current context size exceeds the target model's window before performing the switch. If it does, the switch is rejected with a clear message asking the user to compact first.
Architecture
The guard is implemented as a standard extension (
model-guard) that hooks into thecontextevent pipeline. It runs before each model invocation and returns modified messages only when adjustments are needed -- otherwise it is a no-op. The image stripping and truncation logic are pure functions with no side effects, making them straightforward to test and reason about.The switch rejection lives in the existing
model-switchextension, right before the actualsetModelcall.Test plan