[None][refactor] Batch addSequence with two-phase claim and unified VSWA/non-reuse support by liji-nv · Pull Request #13029 · NVIDIA/TensorRT-LLM

liji-nv · 2026-04-14T09:10:54Z

…SWA/non-reuse support

Introduce addSequenceBatch with a two-phase claim-then-onboard strategy that prevents host offloading from evicting reusable blocks in the radix tree. Extend it to handle all cases previously requiring separate addSequence calls:

Two-phase strategy: Phase 1 claims matching blocks under a single lock using PartialClaimTracker to coordinate partial-match ownership across requests. Phase 2 onboards and allocates without re-traversing the tree.
VSWA: Loop over all window sizes in KVCacheManager::addSequenceBatch, tracking minPrepopulatedLen across windows per sequence.
Non-reuse: Add buildClaimResultMetadata() to build ClaimResult without radix tree walk. Recurrent state placeholder logic branches on isEnableBlockReuse (aligned with PR [None][chore] Unify code path for reuse/non-reuse when adding sequence in kv cache manager #10437).
Star attention: Compute special seq_len in Python and route through the batch path.
Dummy requests: Batch add_sequence calls in add_dummy_requests, deferring is_gen state modifications until after the batch call.

Summary by CodeRabbit

Refactor
- Optimized batch KV-cache allocation and context block reuse mechanisms for improved resource efficiency.
- Enhanced sequence batch processing pipeline with more efficient cache block management and allocation strategies.
- Improved resource manager with batched cache operations for better performance during context prefill processing.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-14T09:19:08Z

📝 Walkthrough

Walkthrough

These changes introduce a new batch KV-cache sequence allocation API implementing a two-phase "claim-then-onboard" approach for context cache reuse. The feature adds data structures and methods across the C++ KV cache manager hierarchy and Python bindings, enabling efficient multi-request context block claiming and onboarding with deferred allocation/reuse decisions.

Changes

Cohort / File(s)	Summary
Core API & Data Structures `cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`	Added nested data structures (`BatchSeqStats`, `ClaimResult`, `PartialClaimTracker`) and new batch allocation methods across `WindowBlockManager`, `BlockManager`, `BaseKVCacheManager`, and `KVCacheManager` to support two-phase claim-then-onboard context cache reuse.
Implementation `cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`	Implemented batch allocation logic including Phase 1 claim operations (`claimMatchingBlocks`, `buildClaimResultMetadata`) and Phase 2 onboarding (`onboardAndAllocateBlocks`), with support for radix-tree block matching, partial-leaf reuse, and deferred release semantics across window-sized block managers.
Python Bindings `cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`	Exposed new `add_sequence_batch` method on `BaseKVCacheManager` via nanobind bindings, marshaling request info tuples and LLM request references between Python and C++ with proper GIL handling.
PyExecutor Integration `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	Refactored `KVCacheManager.prepare_resources` and `add_dummy_requests` to use the new batched `add_sequence_batch` API instead of per-request `add_sequence` calls for eligible context chunks and dummy requests.

Sequence Diagram(s)

sequenceDiagram
    participant PyExec as PyExecutor
    participant KVCacheMgr as KVCacheManager
    participant BlockMgr as BlockManager
    participant WindowMgr as WindowBlockManager
    
    PyExec->>KVCacheMgr: addSequenceBatch(requestInfos, llmRequests)
    KVCacheMgr->>BlockMgr: addSequenceBatch(sequences, inputLengths, numContextBlocks, ...)
    activate BlockMgr
    
    BlockMgr->>WindowMgr: Phase 1: claimMatchingBlocks(...) or buildClaimResultMetadata(...)
    activate WindowMgr
    WindowMgr-->>BlockMgr: ClaimResult (matched blocks, metadata)
    deactivate WindowMgr
    
    BlockMgr->>WindowMgr: Phase 2: onboardAndAllocateBlocks(per request)
    activate WindowMgr
    WindowMgr-->>BlockMgr: BatchSeqStats (allocation results)
    deactivate WindowMgr
    
    deactivate BlockMgr
    BlockMgr-->>KVCacheMgr: vector<BatchSeqStats>
    
    KVCacheMgr->>KVCacheMgr: Update cache offsets & aggregated stats
    KVCacheMgr-->>PyExec: Allocation complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description is incomplete. It lacks required sections: no formal PR title following the template, missing 'Description' section explanation, missing 'Test Coverage' section details, and the checklist is only partially addressed.	Add a proper PR title with ticket/issue ID and type prefix (e.g., [TRTLLM-1234][feat]). Fill in the 'Description' section explaining the issue and solution. Document specific test cases in 'Test Coverage' section. Complete all relevant checklist items with proper justification.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: introducing batch addSequence with a two-phase claim-and-onboard strategy plus unified support for VSWA and non-reuse paths, which matches the core objective across all modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)
2-2: ⚠️ Potential issue | 🟡 Minor

Update the modified-file copyright year.

This file was changed in 2026, but the header still ends at 2025.

As per coding guidelines: "Add NVIDIA copyright header on ALL new files and update year on modified files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp` at line 2, Update
the SPDX copyright header year range in the file by changing the trailing year
from 2025 to 2026; locate the SPDX header line starting with
"SPDX-FileCopyrightText" (the comment line shown in the diff) and extend the
range to include 2026 so the header reflects the file modification year.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`:
- Around line 1890-1898: The docstring for addSequenceBatch incorrectly states
it requires block reuse; update the public comment for addSequenceBatch to
reflect that it supports both the block-reuse path and the non-reuse path that
uses buildClaimResultMetadata(...) — explain the two-phase claim-then-onboard
behavior and that for non-reuse callers the buildClaimResultMetadata helper is
used to prepare claim results before onboarding/allocating blocks; reference the
methods addSequenceBatch and buildClaimResultMetadata so callers know both code
paths are supported.

In `@cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`:
- Around line 2133-2141: The function BlockManager::addSequenceBatch is
misformatted; run LLVM clang-format (max 120 chars/line) on this overload to fix
line wrapping/indentation issues and reformat the block containing the signature
and return statement; ensure the signature and the return call to
mWindowBlockManagers.at(windowSize).addSequenceBatch(...) conform to project
style and line-length limits so clang-format passes CI.
- Around line 1414-1424: The code unconditionally dereferences
llmRequest.getEncoderUniqueTokens() for non-self cache paths, causing optional
access failures for cross-KV requests without encoder tokens; change the logic
that computes uniqueTokens/blockKeys to first check for encoder tokens (e.g.,
bool hasUniqueTokens = (mCacheType==CacheType::kSELF ||
mCacheType==CacheType::kSELFKONLY) ||
llmRequest.getEncoderUniqueTokens().has_value()), and only call
llmRequest.getEncoderUniqueTokens().value(), chopVectorIntoBlocks<UniqueToken>,
and buildBlockKeys when hasUniqueTokens is true; otherwise leave
result.blockKeys empty (skip block-key generation). Apply the same
hasUniqueTokens guard in the later stats update site around the code referencing
getEncoderUniqueTokens() so both locations behave consistently for cross-KV
requests without encoder tokens.
- Around line 1636-1644: The new branch in addSequenceBatch() increments only
mReusedBlocks when claimed.isPlaceholder is false, causing mismatch with
loadOrAllocateBlocks() which also updates mFullReusedBlocks and
mPartialReusedBlocks; update this path to detect full vs partial reuse and
increment mFullReusedBlocks or mPartialReusedBlocks accordingly (in addition to
mReusedBlocks), using the same criteria/load conditions as
loadOrAllocateBlocks(), and keep the unique-block tracking via reusedBlockIds
and matchingBlockId unchanged so getAndResetIterationStats() reports the same
breakdown as before.
- Around line 1444-1451: The current shareLastContextBlockAmongBeams logic
diverges from loadOrAllocateBlocks(): restore the old rule so a full last
context block — and any cross-KV context block logic used by
loadOrAllocateBlocks() — is still shared for beamWidth > 1; update the
computation in kvCacheManager.cpp (variables/functions:
shareLastContextBlockAmongBeams, numSharedContextBlocks, isRecurrentState(),
sequence.getBeamWidth(), inputLength, mTokensPerBlock) to mirror the branching
in loadOrAllocateBlocks()/WindowBlockManager::addSequence (i.e., set
shareLastContextBlockAmongBeams true when beamWidth==1 OR when beamWidth>1 and
the last block is full or identified as a cross-KV context block by the same
predicate used in loadOrAllocateBlocks()), then recompute numSharedContextBlocks
accordingly.

In `@cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`:
- Line 71: NB_TRAMPOLINE currently reserves 37 trampoline slots but
PyKvCacheManager defines 39 NB_OVERRIDE_PURE overrides, causing nanobind virtual
dispatch failures; update the NB_TRAMPOLINE(tbk::BaseKVCacheManager, 37)
invocation to reserve at least 39 slots (e.g.,
NB_TRAMPOLINE(tbk::BaseKVCacheManager, 39)) to match the overrides, and update
the file copyright year string from "2022-2025" to "2022-2026".

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 801-806: The draft path only reserves self.num_extra_kv_tokens and
ignores num_extra_decoding_steps and the draft manager's own extra-KV setting;
update the loop after draft_kv_cache_manager.impl.add_sequence_batch to compute
the number of extra tokens to add per request as (draft manager's own extra-KV
if available, otherwise self.num_extra_kv_tokens) plus
self.num_extra_decoding_steps, then call
draft_kv_cache_manager.impl.add_token(req_id) that many times for each req_id
from draft_batch_request_infos so the draft KV cache mirrors the main allocation
and respects the draft manager's configuration (referencing
draft_kv_cache_manager, add_sequence_batch, add_token, self.num_extra_kv_tokens,
num_extra_extra_decoding_steps, and
draft_batch_request_infos/draft_batch_llm_requests).

---

Outside diff comments:
In `@cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`:
- Line 2: Update the SPDX copyright header year range in the file by changing
the trailing year from 2025 to 2026; locate the SPDX header line starting with
"SPDX-FileCopyrightText" (the comment line shown in the diff) and extend the
range to include 2026 so the header reflects the file modification year.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: e984bffe-8c4a-45cf-8acd-aded0d1b9173

📥 Commits

Reviewing files that changed from the base of the PR and between f2f9051 and 680d222.

📒 Files selected for processing (4)

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
tensorrt_llm/_torch/pyexecutor/resource_manager.py

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp

tensorrt_llm/_torch/pyexecutor/resource_manager.py

…SWA/non-reuse support Introduce addSequenceBatch with a two-phase claim-then-onboard strategy that prevents host offloading from evicting reusable blocks in the radix tree. Extend it to handle all cases previously requiring separate addSequence calls: - Two-phase strategy: Phase 1 claims matching blocks under a single lock using PartialClaimTracker to coordinate partial-match ownership across requests. Phase 2 onboards and allocates without re-traversing the tree. - VSWA: Loop over all window sizes in KVCacheManager::addSequenceBatch, tracking minPrepopulatedLen across windows per sequence. - Non-reuse: Add buildClaimResultMetadata() to build ClaimResult without radix tree walk. Recurrent state placeholder logic branches on isEnableBlockReuse (aligned with PR NVIDIA#10437). - Star attention: Compute special seq_len in Python and route through the batch path. - Dummy requests: Batch add_sequence calls in add_dummy_requests, deferring is_gen state modifications until after the batch call. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

Migrate all 84 kvCacheManager.addSequence calls in kvCacheManagerTest.cpp to addSequenceBatch (batch-size-1 for interleaved tests, true batching for 2 consecutive-call tests). Fix partial/full reuse stat tracking in onboardAndAllocateBlocks. Update SecondaryBlockPrimaryChildTest assertion to match improved block recovery from two-phase claim strategy. BlockManager-level addSequence calls are kept as-is since they test the lower-level API directly. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

…uenceBatch When claimMatchingBlocks claims an unreferenced non-leaf partial-match block to protect it during Phase 2 copies, the block is removed from the free queue. Previously it was only released at the end of the batch (deferred release), which was too late — subsequent getFreeBlock calls for non-matching blocks could not find free blocks in tight pools. Fix: use PartialClaimTracker to assign release responsibility to the last copier in the batch. After the copy completes in Phase 2, the responsible copier releases the source back to the free queue via shouldReleaseCopySource flag on ClaimedBlock. A full match on the same block revokes the release responsibility. Also migrate BlockManagerReuseTest to addSequenceBatch to validate the fix, and update the full-match tracker branch to handle both leaf and non-leaf blocks. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

…e-last-block logic - Update addSequenceBatch docstring to reflect support for both block-reuse and non-reuse paths via buildClaimResultMetadata. - Guard encoder unique token access in claimMatchingBlocks and onboardAndAllocateBlocks with hasUniqueTokens check, matching buildClaimResultMetadata and WindowBlockManager::addSequence (PR NVIDIA#10437) for cross-KV requests without encoder tokens (e.g., Whisper). - Align shareLastContextBlockAmongBeams in claimMatchingBlocks with the unified formula from loadOrAllocateBlocks (PR NVIDIA#10437): isShareLastContextBlock = kCROSS || inputLength % tokensPerBlock == 0. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

liji-nv requested a review from a team as a code owner April 14, 2026 09:10

github-actions bot assigned liji-nv Apr 14, 2026

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

liji-nv added 3 commits April 14, 2026 05:49

liji-nv force-pushed the fix/unify-addsequence-batch branch from 680d222 to 847c922 Compare April 14, 2026 12:49

liji-nv changed the title ~~[None][refactor] Batch addSequence with two-phase claim and unified V…~~ [None][refactor] Batch addSequence with two-phase claim and unified VSWA/non-reuse support Apr 14, 2026

liji-nv requested review from SimengLiu-nv, eopXD and thorjohnsen April 14, 2026 13:41

liji-nv mentioned this pull request Apr 14, 2026

[None][fix] Batch addSequence with pre-claim to fix MNT overflow #12892

Draft

1 task

liji-nv force-pushed the fix/unify-addsequence-batch branch from b50218d to c4093ba Compare April 14, 2026 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][refactor] Batch addSequence with two-phase claim and unified VSWA/non-reuse support#13029

[None][refactor] Batch addSequence with two-phase claim and unified VSWA/non-reuse support#13029
liji-nv wants to merge 4 commits intoNVIDIA:mainfrom
liji-nv:fix/unify-addsequence-batch

liji-nv commented Apr 14, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 14, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

liji-nv commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

liji-nv commented Apr 14, 2026 •

edited

Loading

coderabbitai bot commented Apr 14, 2026 •

edited

Loading