feat(backend): chat context compaction#261
Open
MesoX wants to merge 23 commits into
Open
Conversation
…paction Adds proactive + reactive context-window management so chats no longer hard-fail when history exceeds a model's window (docs/adr/0009). - Context-window resolution (context-window.ts + litellm-registry.ts): manual override → provider API → vendored litellm registry → conservative default, cached with TTL and evicted on modelMeta change. - Single token estimator over a neutral CountUnit structure (token-estimate.ts), shared by Tier 1 (UIMessages) and Tier 2 (ModelMessages). - Tier 1 cross-turn compaction (compaction.ts): prune-then-summarize behind a single versioned CAS watermark writer; hysteresis trigger/target ratios. - Tier 2 in-turn compaction via prepareStep; sub-agent wiring threads the window. - Recovery middleware (recovery.ts): detect provider context-overflow, trim and retry once; always on, independent of the proactive kill switch. - agent-runner wraps the model with recovery, threads RV1 prior-messages for C4 edit-detection, and stamps §H/§I run stats onto the assistant message. - Schema: additive nullable provider.modelMeta + chat compaction state + per-agent compaction config; migration 0046. RV2 workspace-scope check on chat submit; POST /:chatId/compact for on-demand compaction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ations Surfaces the compaction work in the chat UI. - Context-usage ring beside the model selector (§H): fill = last input tokens / resolved window for the selected model; neutral grey when the window is unknown/default. Clickable to compact on demand (§J). - Per-message stats popover next to Regenerate (§I): input/output tokens, TTFT and total generation time from the server-stamped metadata.stats. - Tool-call run durations in the tool header, reusing server start/complete timestamps with a client-observed fallback (useToolDuration). The hook is written to satisfy the react-hooks purity / set-state-in-effect / refs rules: render reads only state, and all writes are deferred into timer callbacks. - Per-agent compaction config fields in the agent form. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # apps/backend/src/services/chat-execution.ts
…nds in /v1
vLLM and most OpenAI-compatible providers set baseUrl to "{root}/v1" (the
OpenAI SDK needs it that way for chat calls). detectOpenAiCompatible appended
"/v1/models" to that, producing "{root}/v1/v1/models" → 404 → the window
silently fell to the 8192 default and the §H usage ring rendered "unknown".
Strip a trailing "/v1" before building the models URL so the probe hits
"{root}/v1/models" and reads max_model_len. Added a regression test.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The §I stats popover (i) icon was gated on metadata.stats, which was only stamped post-stream and persisted to the DB — so it appeared a refetch round-trip after the answer finished, lagging the copy/delete/regenerate icons. Emit the stats via messageMetadata on the `finish` event so they ride the final stream chunk and the (i) appears the instant the answer completes. firstTokenAt is captured in streamText.onChunk (fires before finish) instead of the async snapshot drain. A single buildMessageStats() helper feeds both the streamed copy and the post-stream persist stamp (sharing one finishedAt) so live and reloaded stats match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… trace 11a (U5): ring no longer jumps back to pre-compaction value on user Send. Expiry now tracks `assistantMessageCount` instead of `messages.length` so optimistic user-message pushes don't trigger early fallback. 11b: wrap the agent-info (i) button in a Tooltip so hover reveals its purpose without opening the dialog. Shows agent description or "Agent info". 11c (§K): Tier 1 compaction is now visible in the chat timeline. - Backend emits synthetic `compact_context` tool-call + tool-result chunks into the UIMessage stream (via `prependCompactionChunks`) immediately after the `start` event; the existing tool-call expander renders them for free. - `CompactionTrace` threads through Tier1Output → ChatTurn.compactionTrace → agent-runner.stream() — only when a model summary was produced (prune-only turns produce no trace to avoid empty/confusing entries). - `COMPACT_CONTEXT_TOOL_NAME` constant shared across stream producer, strip filter, §J message builder, and frontend display-name mapping. - `stripCompactionTraceParts` removes trace parts before ModelMessage conversion so the provider never sees the phantom tool call on replay. - `buildCompactionTraceMessage` builds a standalone assistant message for §J (forced compact endpoint — no live stream to inject into). - `title: "Context compaction"` on the chunk + tool.tsx display-name mapping shows a friendly label instead of the raw underscore name. - `context-usage-ring.tsx`: improve neutral-state tooltip copy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wraps the existing Popover trigger in a Tooltip using the TooltipTrigger asChild > PopoverTrigger asChild composition pattern. Hover shows compact stats (In/Out/TTFT/Total); click still opens the full popover. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Chunk 12: drop the per-agent context-compaction override surface. Compaction now runs from DEFAULT_COMPACTION_CONFIG (gated by the COMPACTION_ENABLED kill switch); per-model config lands separately. - schema.ts: drop 6 agent columns (compaction_enabled, trigger_ratio, target_ratio, reserve_ratio, keep_recent_messages, min_prunable_chars) - schemas: drop matching base fields + the targetRatio<triggerRatio hysteresis refine; slim agentCreate/agentUpdate picks - compaction.ts: delete CompactionConfigOverrides + resolveCompactionConfig - chat-execution.ts: build config from DEFAULT_COMPACTION_CONFIG - agent-form.tsx: remove the Context compaction form block + state - tests: drop the per-agent config cases Migration: rewrite 0046 to never add the agent columns (it has not reached main) + trim its snapshot, rather than stacking a drop migration. Already-migrated DBs (dev, test server) had the columns dropped manually. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per-agent compaction config was removed in 625ff96; the `agent` parameter threaded into buildCompactionRuntime is now unused. Remove it from the args type, the destructure, and all three call sites. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After Chunk 12 compaction config is global (DEFAULT_COMPACTION_CONFIG + COMPACTION_ENABLED kill switch). Add optional env overrides for the ceiling ratios so a test deployment can lower the trigger and exercise auto-compaction without filling a large context window. Unset/blank/invalid env values fall back to the built-in defaults, so production behavior is unchanged. Knobs: COMPACTION_TRIGGER_RATIO, COMPACTION_TARGET_RATIO, COMPACTION_RESERVE_RATIO, COMPACTION_KEEP_RECENT, COMPACTION_MIN_PRUNABLE_CHARS. Documented in .env.example. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…to compaction plan Survey of Hermes Agent, Codex CLI, Claude Code, Cline confirms the shipped design already implements real-token feeding (C1), input-tokens-only window (F1), reserve-carve trigger (C3 = Codex 90% cap), and two-layer proactive+ recovery (P4). Documents the Hermes token-floor (#14690) as a do-not-copy anti-pattern and defers three optional adds (message-count valve, maxOutput reserve floor, model-aware aggressiveness). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Capture the 2026-06-12 live test-server findings: per-step-timeout bug that kills pre-stream compaction (150s summarize > 120s watchdog), the 8,631-token runaway, and the lost-turn root cause. Add the four fixes (heartbeat, maxOutputTokens cap + concise prompt, stream-open-before-compaction with live Running status, abortSignal), the prior-art summarization-prompt survey (Claude Code / Codex / OpenCode / Hermes), the proposed replacement prompt, and the decided-against provider-model-selector. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pt overhaul Fix 1 (CRITICAL): thread `onActivity` into `buildCompactionRuntime` and tick it every 10 s via `setInterval` while `generateText` runs inside the `summarize` closure. The 120 s per-step stall watchdog now sees regular activity during a slow summarize call instead of silence, and no longer kills the run before the summary is committed. Fix 2: add `maxOutputTokens: 2000` ceiling to the summarize `generateText` call as a hard backstop against the runaway-expansion failure mode observed in the live test (8,631-token output for a 6,178-token input). Log `finishReason` and emit a `warn` if the ceiling is hit so the event is visible in prod logs. Replace the unstructured one-liner system prompt with a structured four-section handoff prompt (Intent / Decisions / Files / Next step) that survives repeated re-compactions and includes an explicit "aim under ~1500 tokens" length target. Fix 3 (stream-before-compaction) deferred — heartbeat addresses the immediate timeout bug; the live-indicator refactor is a larger architectural change. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Address review findings on the chunk 13 summarize changes: - Reorder the handoff prompt sections most-critical-first (Intent → Current state & next step → Decisions → Files) so a truncation at the maxOutputTokens ceiling drops file/tool detail rather than the resume-critical intent and next step. Add a front-load instruction. - Thread the run's AbortSignal through prepareChatTurn and buildCompactionRuntime into the summarizer generateText call. The Fix 1 heartbeat suppresses the per-step stall watchdog during summarize, so without this a hung call would leak until the 10 min per-run timeout; now a cancel/timeout aborts it. - Drop the redundant onActivity wrapper at the call site (the optional event param already satisfies the () => void heartbeat signature). - Reconcile the conflicting token-count comments (~150-200 vs ~1500) around the output ceiling. - Add tests for the summarize closure: abort signal + ceiling + ordered prompt threading, the finishReason === length warn path, and the heartbeat tick/clear lifecycle. Export buildCompactionRuntime for test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stage 2 compaction returned recent messages verbatim, so large tool results (e.g. MCP dumps) in the kept window dominated tokensAfter and prevented reaching targetTokens. Prune recent tool outputs with a higher threshold (minRecentPrunableChars, default minPrunableChars*5) and apply it across every over-target return path — including the empty-prefix / null-watermark bail where the whole history fits within keepRecentMessages. Warn when the post-compaction estimate still exceeds 2x target. Raise summarizer output ceiling 2000 -> 4000 to catch models that ignore the 1500-token prompt limit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three tasks: (1) Tier 1 recent-trim safety (64232d8 shipped; option D overflow-gate + exempt-newest remaining), (2) context-editing-style prune of kept tool results (needs review + bigger plan; session iteration captured), (3) ingestion cap for oversized MCP/sub-agent results (upstream-issue candidate, marked not filed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…newest Tier 1 recent (kept) tool results are now trimmed only when the kept view would breach the hard window wall (inputBudget), not on a soft targetTokens (hysteresis) miss. A soft miss is left at full fidelity and re-compacts next turn. The single newest message is always exempt. - UICompactOptions.inputBudget threaded from applyTier1Compaction as max(0, budget.inputBudget - overheadTokens) (mirrors effectiveTarget). - keepRecentWithinWall applied on both over-target paths (Stage 2 + empty prefix); omitted budget falls back to always-trim (pre-option-D guard). Review fixes: - Re-gate the over-target warning to afterEstimate > inputBudget; the old target*2 heuristic fired on every healthy compaction under a low target ratio. Falls back to target*2 when no wall is supplied. - keepRecentWithinWall returns the recent estimate to avoid a double pass. Tests: 57 pass (option-D verbatim/trim/empty-prefix/no-warn cases). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… elision) editToolResults + elidedToolPlaceholder: pure view transform eliding OLD bulky tool-result bodies to a self-describing placeholder before the Tier 1 trigger, so a leaned view can skip summarization. Recency by tool-result count, newest message exempt, size-gated; no durable state (P1). Post-review hardening: grow-guard (never elide when placeholder >= output; no prompt inflation / negative reclaim) + idempotency guard, both decided up front into a Map so the rewrite runs only on real work and never copies on a no-op. 3 config fields + 3 env overrides. +10 tests; runs/ suite 193 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ction-clean # Conflicts: # apps/backend/src/runs/agent-runner.test.ts # apps/backend/src/runs/agent-runner.ts # apps/backend/src/services/chat-execution.ts # apps/backend/src/tools/sub-agent.ts
Defensive fixes to context compaction surfaced by chunk 14 review: - A1: projectTier1Tokens treats lastInputTokens<=0 as no-baseline (usage-less gateways persist contextTokens=0); findLastInputTokens skips trace messages and zero-count turns symmetrically. - A2: no-dimension image token fallback uses pessimistic per-provider ceilings (Anthropic 1600 / OpenAI-high 2000) instead of flat 1200. - A6: cap output reserve at half the window so an input-scoped contextWindow minus a large max_output_tokens cannot collapse inputBudget and thrash. - B-F1/C2: range-validate compaction env overrides (warn + default on out-of-range); restore target<trigger hysteresis clamp lost when chunk 12 removed resolveCompactionConfig. - B-F3: wrap sub-agent models with overflow-recovery middleware (built always; recovery is the P4 net even under the kill switch), guarded on a model instance so a string id degrades instead of dropping the sub-agent. - B-F7: report same-basis tokensBefore/After in compaction.fired. - B-F8/A4: forceCompactChat appends the trace via atomic jsonb concatenation instead of overwriting the whole column. - T10 deviation: version-mismatch skip no longer clears dirty (a concurrent invalidate also bumps version but leaves dirty on purpose). Typecheck clean; compaction + token-estimate suites 92 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Map-reduce summarizer no longer re-overflows: summarizePrefix checks the folded size (prior summary + framing) and recurses the reduce step; chunks split on message boundaries via packSegments instead of arbitrary char slices. Force-compact confirm now gates on significance (ADR-0012 §Force-compact): /compact returns tokensBefore/messagesDropped/keepRecentMessages and the context-window endpoint exposes keepRecentMessages so the client confirms only when the drop is significant, else runs immediately. Token estimator counts errorText on output-error UI parts, restoring the UI/Model adapter count equality (§One estimator). Sub-agent recovery/Tier 2 subtract per-sub-agent overhead. Extracted resolveCompactionConfig so the runtime and the context-window route share one config source. Also: drop fabricated Qwen3-72B registry key, rename ADR 0009→0012 (resolves collision with the invitation ADR-0009), refresh stale plan/0009 references, ring amber via CSS var with hex fallback, add cross-tenant 404 submit test, remove shipped internal plan doc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CI `pnpm lint` failed on the compaction branch under type-aware rules: - require-await: de-async test mocks + two prod helpers (loadBuiltinRegistry, the default loadRegistry slot) that never await; return Promise.resolve/reject to preserve behavior. - no-unnecessary-type-assertion: drop redundant `as` casts (autofix). - no-unsafe-* / no-explicit-any: type the captured mock-call args in chat-execution/sub-agent tests instead of leaking `any`. - no-useless-assignment: drop the dead `agent` local in the force-compact path. No behavior change. Lint, typecheck, and 203 affected tests all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Chat context compaction: keep chats alive past the model's context window
Design doc:
docs/adr/0012-context-compaction.md(ADR-0012)Problem
Chats hard-fail the moment their message history exceeds a model's context window. The AI SDK (
ai@6) reports real token usage after each call, but it exposes no context-window metadata on the model interface and no pre-call tokenizer. Providers also disagree on whether the window is discoverable at all: Google, OpenRouter, and vLLM expose it via their APIs, while OpenAI, Anthropic, and Bedrock do not. On top of that, the existing error handling only covered auth, rate-limit, and 5xx failures, so a context-overflow rejection killed the turn outright.What this PR does
This change adds a two-tier, view-not-delete compaction model. It is fed by a single token estimator, all of its durable state is mutated through a single versioned compare-and-swap (CAS) writer, an always-on recovery net catches the overflow errors that the proactive path misses, and a deterministic context-editing pass prunes stale, bulky tool results without ever calling a model. Because top-level chats and sub-agents both run through the shared
agent-runner/ToolLoopAgent, one implementation covers both.The full rationale — including every alternative we rejected and why — lives in ADR-0012. The key decisions, summarized from it, are below.
Load-bearing principles
CountUnit[]). Both Tier 1 (UIMessages) and Tier 2 (ModelMessages) normalize into it, and both count only model-bound parts (text,tool-call,tool-result,file,image). Divergence between the two tiers is therefore impossible by construction, because a tier cannot fire on a number the other tier never sees.summaryWatermark,contextSummary,compactionDirty) goes through a single compare-and-swap keyed on aversioncolumn. Concurrency — for example a trigger run racing a user run — and the interaction between compaction and history-edit invalidation are both resolved by version rather than by comparing watermark values. That matters because a watermark can move backward on a history edit, and a value-based comparison would misread that as "not yet advanced" and write a stale summary over mutated history. On a CAS conflict the loser re-reads the row: if the winner already covered its prefix it skips, otherwise it retries once and then skips with a contended warning. There is no recompute loop and no livelock.400/413overflow error is caught, the messages are aggressively trimmed in memory (through the same Tier 2 adapter, so there is no bespoke trimmer) and the call is retried exactly once. Recovery flagscompactionDirtyand leaves the durable compaction to the next turn. Crucially, recovery stays on even when proactive compaction is globally disabled — it is the last line of defense, not a risk surface.Mechanisms
resolveContextWindow(provider, modelId)resolves a window per model, in order: a manual override (provider.modelMeta), then provider API auto-detection (Google / OpenRouter / vLLM), then the community-maintained litellm registry JSON (which covers OpenAI / Anthropic / Bedrock), and finally a conservative8192default. We deliberately do not maintain our own context-window table. Key normalization is boundary-safe, results are cached per provider+model with a TTL, the cache is evicted immediately on amodelMetaedit,defaultresults use a short TTL, and API fetches use a 5 s timeout behind a single-flight guard.usageexists; every later turn uses the real, provider-reportedusage.inputTokens. We accept the first-turn imprecision — guarded by a 1.15 cold-start margin and the recovery net — rather than ship a per-provider tokenizer.prepareChatTurnover the durable history. Its budget math works off the input budget (window − maxOutputReserve − safetyReserve), with a trigger ratio of 0.8 and a target of 0.5. The two ratios are deliberately distinct to give hysteresis, so compaction does not re-fire on the very next turn. The work is staged cheap-first: Stage 0 context-edits (pruning bulky old tool results with no model call), Stage 1 prunes the older prefix, and Stage 2 summarizes the prefix into a single synthetic message only if the conversation is still over target. Tool-call/result pairs stay atomic across the keep boundary, summarization is incremental (only the messages after the watermark), and a visiblecontext-compactedevent keeps the behavior fail-loud.prepareStephook on bothstreamTextandgenerateText, summarizing old completed tool results while keeping recent steps verbatim and preserving call/result pairing. It fires only when genuinely near the limit, and it is not persisted — the next turn's Tier 1 folds the result into the durable summary.DEFAULT_COMPACTION_CONFIG); only the window/output size is per model. SettingCOMPACTION_ENABLED=falsedisables all proactive compaction in production without a deploy, and recovery ignores that flag. Per-agent tuning was shipped and then removed, because no surveyed tool exposes it and the ratios already self-normalize to the model window.Frontend
usedTokens / contextWindow, ramping from green to amber (≥ 0.7) to red (≥ 0.9), and rendering a neutral grey when the window is unknown or defaulted rather than guessing a ramp. The window comes from the currently selected model, and the numerator is the last response's peakcontextTokens.(i)popover under each assistant response shows input/output tokens, time-to-first-token, and total generation time, reusing the existing tool-call timing mechanism.POST /chats/:id/compactruns Tier 1 once regardless of the threshold and returns the post-compact usage so the ring refreshes immediately. A click is deferred while a response is streaming, and a confirmation dialog appears only when the drop would be significant.compact_contexttool-call/result pair that reuses the tool-call UI. The trace part is stripped beforeconvertToModelMessages, so it never replays to the provider as a phantom tool call.Notable rejected alternatives (full list in ADR §Considered Options)
max(window × pct, 64000)). Rejected because it overflows sub-64k models, where the trigger would never fire. A floor belongs only on the window fallback, never on the trigger.Schema changes (additive, nullable/defaulted)
provider.modelMeta(JSONB) — per-model window/output overrides.contextSummary,summaryWatermark,compactionDirty, andversion.apps/backend/drizzle/0046_context_compaction.sql.Rollout is lazy and there is no backfill: existing chats compact only on their next turn. An eager backfill would create a thundering herd of summarize calls.
Cross-tenant safety
The submit route verifies that the body
idbelongs to the caller's workspace before a run starts. The compaction store is keyed by chat id alone, so without that check one workspace could mutate another's summary and watermark.Observability
The feature emits structured
metric:-tagged log lines:compaction.fired,summarize.latency_ms,recovery.*,context_window.fell_to_default,litellm.key_miss,cas.conflict, andcontext_edited.Config
New
apps/backend/.env.examplekeys, all optional with built-in defaults:COMPACTION_ENABLED,COMPACTION_TRIGGER_RATIO,COMPACTION_TARGET_RATIO,COMPACTION_RESERVE_RATIO,COMPACTION_KEEP_RECENT,COMPACTION_MIN_PRUNABLE_CHARS, andCOMPACTION_MIN_RECENT_PRUNABLE_CHARS.Tests
New suites cover
compaction.test.ts,context-window.test.ts,token-estimate.test.ts, andrecovery.test.ts; existing suites foragent-runner.test.ts,chat.test.ts,chat-execution.test.ts,sub-agent.test.ts, andpackages/schemas/index.test.tswere extended. Recovery's per-provider overflow regex is fixture-tested across OpenAI/vLLM, Anthropic, Google, and Bedrock.Known limitations (ADR §Open/deferred)
Pending → Runningcompaction trace (it currently renders post-hoc only), the estimate-vs-real divergence metric (log-only for now), and content-type tool-result media accounting.🤖 Generated with Claude Code