[None][fix] Batch addSequence with pre-claim to fix MNT overflow by liji-nv · Pull Request #12892 · NVIDIA/TensorRT-LLM

liji-nv · 2026-04-09T14:01:41Z

When kvcache reuse is enabled, when add_sequence in sequential, previous add_sequence may evict block that is used by other sequences. Making the token estimation in scheduler not reliable and actual runtime see max_num_tokens (MNT) overflow..

And when host offloading is enabled, onboarding a host block to GPU during addSequence can trigger eviction of other reusable for host blocks from the radix tree which is reusable for current sequence. This causes actual KV cache reuse to be less than the scheduler estimated, leading to max_num_tokens (MNT) overflow assertions.

Add a new addSequenceBatch API that processes all first-chunk context requests in two phases:

Phase 1: Walk the radix tree and claimBlock() for all matching blocks across all requests. No onboarding, no allocation. This protects reusable blocks from eviction.
Phase 2: Onboard host blocks and allocate non-matching blocks. Since all reusable blocks are already claimed, evictions during onboarding cannot touch them.

Summary by CodeRabbit

Release Notes

New Features
- Added batch sequence onboarding API for improved KV cache management, enabling more efficient allocation and reuse of cache blocks for multiple concurrent sequences.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

liji-nv · 2026-04-09T14:01:58Z

/bot run

tensorrt-cicd · 2026-04-09T14:08:03Z

PR_Github #42543 [ run ] triggered by Bot. Commit: 2344609 Link to invocation

coderabbitai · 2026-04-09T14:10:57Z

📝 Walkthrough

Walkthrough

Introduces a new two-phase batch sequence onboarding API for KV cache management. Phase 1 claims matching cached blocks under lock; Phase 2 onboards and allocates remaining blocks. Adds corresponding methods to WindowBlockManager, BlockManager, BaseKVCacheManager, and KVCacheManager classes, along with Python bindings and integration into the resource manager's context allocation path when block reuse is enabled.

Changes

Cohort / File(s)	Summary
KV Cache Manager Core `cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`, `cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`	Added two-phase batch sequence onboarding: new `ClaimResult` struct and three methods (`claimMatchingBlocks`, `onboardAndAllocateBlocks`, `addSequenceBatch`) in `WindowBlockManager`; `addSequenceBatch` delegation in `BlockManager`; new pure virtual `addSequenceBatch` in `BaseKVCacheManager` and implementation in `KVCacheManager` with input validation, sequence creation, cache offset updates, and statistics tracking.
Python Bindings `cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`	Added `add_sequence_batch` Python-exposed binding method on `BaseKVCacheManager` that marshals list arguments into C\+\+ tuples and reference vectors, forwarding to the underlying C\+\+ implementation.
Resource Manager Integration `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	Modified `prepare_resources` method to conditionally batch eligible first-context-chunk requests when block reuse is enabled, calling `add_sequence_batch()` once per batch instead of per-request `add_sequence` calls, while preserving the original path for non-block-reuse scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant ResourceMgr as Resource Manager
    participant KVCacheMgr as KVCacheManager
    participant BlockMgr as BlockManager
    participant WindowBlockMgr as WindowBlockManager
    
    ResourceMgr->>KVCacheMgr: addSequenceBatch(requestInfos, llmRequests)
    
    activate KVCacheMgr
    KVCacheMgr->>KVCacheMgr: Validate constraints (block reuse enabled)
    KVCacheMgr->>KVCacheMgr: Create GenerationRequest sequences<br/>Compute inputLengths & numContextBlocksVec
    KVCacheMgr->>BlockMgr: addSequenceBatch(sequences, inputLengths,<br/>numContextBlocksVec, llmRequests, windowSize)
    
    activate BlockMgr
    BlockMgr->>WindowBlockMgr: addSequenceBatch(...)
    
    activate WindowBlockMgr
    WindowBlockMgr->>WindowBlockMgr: Acquire mCachedBlocksRootMutex
    
    loop For each sequence (Phase 1)
        WindowBlockMgr->>WindowBlockMgr: claimMatchingBlocks(sequence, inputLength,<br/>numContextBlocks, llmRequest)
        note over WindowBlockMgr: Radix-tree walk, claim blocks,<br/>update priorities
    end
    
    loop For each sequence (Phase 2)
        WindowBlockMgr->>WindowBlockMgr: onboardAndAllocateBlocks(sequence,<br/>llmRequest, claimResult)
        note over WindowBlockMgr: Onboard copied blocks, allocate<br/>non-matching & non-shared blocks,<br/>finalize prepopulated length
    end
    
    WindowBlockMgr->>WindowBlockMgr: Release mCachedBlocksRootMutex
    WindowBlockMgr-->>BlockMgr: Return prepopulated lengths
    deactivate WindowBlockMgr
    
    BlockMgr-->>KVCacheMgr: Return prepopulated lengths
    deactivate BlockMgr
    
    KVCacheMgr->>KVCacheMgr: Update per-window cache offsets
    KVCacheMgr->>KVCacheMgr: Set llmRequest prepopulated<br/>prompt lengths & clear reusable tokens
    KVCacheMgr-->>ResourceMgr: Return (batch complete)
    deactivate KVCacheMgr
    
    ResourceMgr->>ResourceMgr: Apply per-token add_token calls<br/>& KV connector updates

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description is incomplete. It lacks a formal title, test coverage details, and unchecked PR checklist items required by the template.	Add a properly formatted title following the template format (e.g., [TRTLLM-XXXX][fix] Summary), complete the Test Coverage section with specific test details, and verify all PR checklist items are reviewed and marked as appropriate.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main fix: batching addSequence with pre-claim to address MNT overflow when host offloading is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`:
- Around line 1789-1796: The Python trampoline class PyKvCacheManager is missing
a pure-virtual override for the new C++ method addSequenceBatch, causing Python
subclasses not to dispatch; add an override method in PyKvCacheManager with the
signature matching addSequenceBatch that calls
NB_OVERRIDE_PURE(addSequenceBatch, requestInfos, llmRequests), and increment the
NB_TRAMPOLINE count for tbk::BaseKVCacheManager from 36 to 37 so the trampoline
table matches the new virtual. Ensure the override uses the exact parameter
types (std::vector<std::tuple<tb::LlmRequest::RequestIdType, SizeType32,
SizeType32>> const& requestInfos and
std::vector<std::reference_wrapper<tb::LlmRequest>> const& llmRequests) and that
NB_TRAMPOLINE is updated accordingly.

In `@cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`:
- Around line 3408-3411: The current approach captures baseline totals in
numAllocTotalBlocksPre, numAllocNewBlocksPre, numReusedBlocksPre and
numMissedBlocksPre once for the whole batch and then computes per-request stats
by subtracting those global baselines, which causes earlier requests' work to be
attributed to later ones; instead compute per-request deltas either by
snapshotting the totals immediately before and after processing each individual
request (or at the exact request completion point) or by recording increments
during Phase 2 as each request performs allocations/reuses/misses; update the
logic in
updateAllocTotalPerRequest/updateAllocNewPerRequest/updateReusedPerRequest/updateMissedPerRequest
to use these per-request snapshots or Phase 2 counters rather than the
single-batch pre-snap variables so each request only accounts for its own
changes.
- Around line 3387-3449: Before calling mBlockManager.addSequenceBatch in
KVCacheManager::addSequenceBatch, enforce the same recurrent-state chunking
guard that WindowBlockManager::addSequence rejects: for each request, if its
kvCacheRetentionConfig indicates a recurrent-state retention mode and
inputLength != llmRequest.getPromptLen(), fail early (use TLLM_CHECK_WITH_INFO
or similar) with a message that batched recurrent-state requests must have
inputLength equal to the prompt length; perform this check using the
already-captured kvCacheRetentionConfig and llmRequest variables for each index
prior to invoking mBlockManager.addSequenceBatch.

In `@cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`:
- Around line 396-413: The lambda bound to "add_sequence_batch" currently uses
nb::call_guard<nb::gil_scoped_release>() which drops the GIL for the entire body
and makes nb::len, list indexing, and nb::cast calls unsafe; change the
implementation so the GIL is held while marshalling the Python inputs into the
C++ locals (i.e., build requestInfos and llmRequests using nb::len, nb::cast,
nb::tuple, nb::list, std::ref) and only release the GIL immediately before
calling BaseKVCacheManager::addSequenceBatch (call self.addSequenceBatch under a
scoped GIL release). Ensure nb::call_guard<nb::gil_scoped_release>() is applied
only around the actual addSequenceBatch invocation or use an explicit
gil_scoped_release scope for that call.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 644-678: The batching path currently calls
impl.add_sequence_batch(batch_request_infos, batch_llm_requests) for block-reuse
requests which breaks VSWA; change the control so batching only happens when
self.is_vswa is False (i.e., gate the add_sequence_batch call with not
self.is_vswa) and if self.is_vswa is True, fall back to the per-request
impl.add_sequence(...) flow used earlier (including the subsequent
impl.add_token(...) loops and kv_connector_manager.update_state_after_alloc
calls for each req). Ensure the logic around enable_block_reuse,
batch_request_infos, batch_llm_requests and batch_ctx_requests remains
consistent so VSWA continues using the single-request path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 21f2daa7-dcc7-468a-a725-8af6a3a2a2f3

📥 Commits

Reviewing files that changed from the base of the PR and between 3e942cc and 2344609.

📒 Files selected for processing (4)

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
tensorrt_llm/_torch/pyexecutor/resource_manager.py

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp

tensorrt_llm/_torch/pyexecutor/resource_manager.py

tensorrt-cicd · 2026-04-09T17:35:50Z

PR_Github #42543 [ run ] completed with state FAILURE. Commit: 2344609
/LLM/main/L0_MergeRequest_PR pipeline #33280 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

liji-nv · 2026-04-10T05:14:04Z

/bot run

tensorrt-cicd · 2026-04-10T08:02:38Z

PR_Github #42669 [ run ] triggered by Bot. Commit: e5d8513 Link to invocation

tensorrt-cicd · 2026-04-10T13:40:55Z

PR_Github #42669 [ run ] completed with state SUCCESS. Commit: e5d8513
/LLM/main/L0_MergeRequest_PR pipeline #33377 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

…NT overflow When host offloading is enabled, onboarding a host block to GPU during addSequence can trigger eviction of other reusable host blocks from the radix tree. This causes actual KV cache reuse to be less than the scheduler estimated, leading to max_num_tokens (MNT) overflow assertions. Add a new addSequenceBatch API that processes all first-chunk context requests in two phases: - Phase 1: Walk the radix tree and claimBlock() for all matching blocks across all requests. No onboarding, no allocation. This protects reusable blocks from eviction. - Phase 2: Onboard host blocks and allocate non-matching blocks. Since all reusable blocks are already claimed, evictions during onboarding cannot touch them. On the Python side, replace the TOCTOU-prone revalidation loop (count_reusable_blocks + budget check) with a single batch call. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

…r for batch addSequence In addSequenceBatch Phase 1, calling freeLeafBlock on partial-match leaf blocks modifies the trie structure while other requests' blocks are still attached. This can cause storeContextBlocks to encounter blocks whose trie nodes were orphaned, leading to the assertion: "cascade prune: parent did not find this node as a child". Fix by deferring freeLeafBlock to Phase 2 (onboardAndAllocateBlocks), where it is safe because all ancestor blocks have been claimed and their trie values protect against cascade pruning. Add PartialClaimTracker to coordinate partial-match ownership across requests in the same batch: - If a later request fully matches a leaf block, all earlier partial matchers are marked needsCopy. - If only partial matches exist, the last request reuses the block in-place; earlier ones copy. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

liji-nv · 2026-04-13T07:09:25Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-13T07:14:52Z

PR_Github #42980 [ run ] triggered by Bot. Commit: 62bd31e Link to invocation

tensorrt-cicd · 2026-04-13T23:14:15Z

PR_Github #42980 [ run ] completed with state SUCCESS. Commit: 62bd31e
/LLM/main/L0_MergeRequest_PR pipeline #33634 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

thorjohnsen

I agree with the main change in this PR, claiming the blocks as soon as they are found for reuse will increase reuse. I expect this will break some of the unit tests in kvCacheManagerTest.cpp, since the blocks that are returned will change. You will probably need to update that as well. I'll give it a more thorough review tomorrow.

liji-nv · 2026-04-14T12:21:46Z

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

+                    {
+                        // Unreferenced non-leaf: could be in the free queue and evictable.
+                        // Claim to protect during Phase 2 copies; deferred release at batch end.
+                        mEvictionPolicy->claimBlock(matchingBlock, result.perBlockRetentions[bi].retentionPriority,


There is a bug that this claim never get freed correctly in phase 2. The new PR is addressing this.

liji-nv · 2026-04-14T13:42:38Z

Review #13029 instead. That contains some fix for the corner case.

liji-nv requested a review from a team as a code owner April 9, 2026 14:01

github-actions bot assigned liji-nv Apr 9, 2026

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

liji-nv force-pushed the fix/batch-addsequence-mnt-overflow-v2 branch 3 times, most recently from a9a7f73 to e5d8513 Compare April 10, 2026 04:48

SimengLiu-nv mentioned this pull request Apr 10, 2026

[None][feat] Merge feat/bench_y (deepseek-v32) changes to main #12865

Draft

1 task

liji-nv force-pushed the fix/batch-addsequence-mnt-overflow-v2 branch from e5d8513 to fd77def Compare April 13, 2026 07:06

liji-nv added 2 commits April 13, 2026 00:08

liji-nv force-pushed the fix/batch-addsequence-mnt-overflow-v2 branch from fd77def to 62bd31e Compare April 13, 2026 07:08

liji-nv requested review from PerkzZheng, SimengLiu-nv and eopXD April 13, 2026 07:10

liji-nv requested review from thorjohnsen and removed request for PerkzZheng April 14, 2026 03:15

liji-nv changed the title ~~[None][fix] Batch addSequence with pre-claim to fix host offloading M…~~ [None][fix] Batch addSequence with pre-claim to fix MNT overflow Apr 14, 2026

thorjohnsen reviewed Apr 14, 2026

View reviewed changes

liji-nv commented Apr 14, 2026

View reviewed changes

liji-nv marked this pull request as draft April 14, 2026 13:43

Conversation

liji-nv commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

liji-nv commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

liji-nv commented Apr 10, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

liji-nv commented Apr 13, 2026

Uh oh!

tensorrt-cicd commented Apr 13, 2026

Uh oh!

tensorrt-cicd commented Apr 13, 2026

Uh oh!

thorjohnsen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liji-nv Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

liji-nv commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liji-nv commented Apr 9, 2026 •

edited

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading

thorjohnsen left a comment •

edited

Loading