Skip to content

[None][refactor] Batch addSequence with two-phase claim and unified VSWA/non-reuse support#13029

Open
liji-nv wants to merge 4 commits intoNVIDIA:mainfrom
liji-nv:fix/unify-addsequence-batch
Open

[None][refactor] Batch addSequence with two-phase claim and unified VSWA/non-reuse support#13029
liji-nv wants to merge 4 commits intoNVIDIA:mainfrom
liji-nv:fix/unify-addsequence-batch

Conversation

@liji-nv
Copy link
Copy Markdown
Collaborator

@liji-nv liji-nv commented Apr 14, 2026

…SWA/non-reuse support

Introduce addSequenceBatch with a two-phase claim-then-onboard strategy that prevents host offloading from evicting reusable blocks in the radix tree. Extend it to handle all cases previously requiring separate addSequence calls:

  • Two-phase strategy: Phase 1 claims matching blocks under a single lock using PartialClaimTracker to coordinate partial-match ownership across requests. Phase 2 onboards and allocates without re-traversing the tree.
  • VSWA: Loop over all window sizes in KVCacheManager::addSequenceBatch, tracking minPrepopulatedLen across windows per sequence.
  • Non-reuse: Add buildClaimResultMetadata() to build ClaimResult without radix tree walk. Recurrent state placeholder logic branches on isEnableBlockReuse (aligned with PR [None][chore] Unify code path for reuse/non-reuse when adding sequence in kv cache manager #10437).
  • Star attention: Compute special seq_len in Python and route through the batch path.
  • Dummy requests: Batch add_sequence calls in add_dummy_requests, deferring is_gen state modifications until after the batch call.

Summary by CodeRabbit

  • Refactor
    • Optimized batch KV-cache allocation and context block reuse mechanisms for improved resource efficiency.
    • Enhanced sequence batch processing pipeline with more efficient cache block management and allocation strategies.
    • Improved resource manager with batched cache operations for better performance during context prefill processing.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@liji-nv liji-nv requested a review from a team as a code owner April 14, 2026 09:10
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 14, 2026

📝 Walkthrough

Walkthrough

These changes introduce a new batch KV-cache sequence allocation API implementing a two-phase "claim-then-onboard" approach for context cache reuse. The feature adds data structures and methods across the C++ KV cache manager hierarchy and Python bindings, enabling efficient multi-request context block claiming and onboarding with deferred allocation/reuse decisions.

Changes

Cohort / File(s) Summary
Core API & Data Structures
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
Added nested data structures (BatchSeqStats, ClaimResult, PartialClaimTracker) and new batch allocation methods across WindowBlockManager, BlockManager, BaseKVCacheManager, and KVCacheManager to support two-phase claim-then-onboard context cache reuse.
Implementation
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
Implemented batch allocation logic including Phase 1 claim operations (claimMatchingBlocks, buildClaimResultMetadata) and Phase 2 onboarding (onboardAndAllocateBlocks), with support for radix-tree block matching, partial-leaf reuse, and deferred release semantics across window-sized block managers.
Python Bindings
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
Exposed new add_sequence_batch method on BaseKVCacheManager via nanobind bindings, marshaling request info tuples and LLM request references between Python and C++ with proper GIL handling.
PyExecutor Integration
tensorrt_llm/_torch/pyexecutor/resource_manager.py
Refactored KVCacheManager.prepare_resources and add_dummy_requests to use the new batched add_sequence_batch API instead of per-request add_sequence calls for eligible context chunks and dummy requests.

Sequence Diagram(s)

sequenceDiagram
    participant PyExec as PyExecutor
    participant KVCacheMgr as KVCacheManager
    participant BlockMgr as BlockManager
    participant WindowMgr as WindowBlockManager
    
    PyExec->>KVCacheMgr: addSequenceBatch(requestInfos, llmRequests)
    KVCacheMgr->>BlockMgr: addSequenceBatch(sequences, inputLengths, numContextBlocks, ...)
    activate BlockMgr
    
    BlockMgr->>WindowMgr: Phase 1: claimMatchingBlocks(...) or buildClaimResultMetadata(...)
    activate WindowMgr
    WindowMgr-->>BlockMgr: ClaimResult (matched blocks, metadata)
    deactivate WindowMgr
    
    BlockMgr->>WindowMgr: Phase 2: onboardAndAllocateBlocks(per request)
    activate WindowMgr
    WindowMgr-->>BlockMgr: BatchSeqStats (allocation results)
    deactivate WindowMgr
    
    deactivate BlockMgr
    BlockMgr-->>KVCacheMgr: vector<BatchSeqStats>
    
    KVCacheMgr->>KVCacheMgr: Update cache offsets & aggregated stats
    KVCacheMgr-->>PyExec: Allocation complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is incomplete. It lacks required sections: no formal PR title following the template, missing 'Description' section explanation, missing 'Test Coverage' section details, and the checklist is only partially addressed. Add a proper PR title with ticket/issue ID and type prefix (e.g., [TRTLLM-1234][feat]). Fill in the 'Description' section explaining the issue and solution. Document specific test cases in 'Test Coverage' section. Complete all relevant checklist items with proper justification.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: introducing batch addSequence with a two-phase claim-and-onboard strategy plus unified support for VSWA and non-reuse paths, which matches the core objective across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Update the modified-file copyright year.

This file was changed in 2026, but the header still ends at 2025.

As per coding guidelines: "Add NVIDIA copyright header on ALL new files and update year on modified files."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp` at line 2, Update
the SPDX copyright header year range in the file by changing the trailing year
from 2025 to 2026; locate the SPDX header line starting with
"SPDX-FileCopyrightText" (the comment line shown in the diff) and extend the
range to include 2026 so the header reflects the file modification year.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`:
- Around line 1890-1898: The docstring for addSequenceBatch incorrectly states
it requires block reuse; update the public comment for addSequenceBatch to
reflect that it supports both the block-reuse path and the non-reuse path that
uses buildClaimResultMetadata(...) — explain the two-phase claim-then-onboard
behavior and that for non-reuse callers the buildClaimResultMetadata helper is
used to prepare claim results before onboarding/allocating blocks; reference the
methods addSequenceBatch and buildClaimResultMetadata so callers know both code
paths are supported.

In `@cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`:
- Around line 2133-2141: The function BlockManager::addSequenceBatch is
misformatted; run LLVM clang-format (max 120 chars/line) on this overload to fix
line wrapping/indentation issues and reformat the block containing the signature
and return statement; ensure the signature and the return call to
mWindowBlockManagers.at(windowSize).addSequenceBatch(...) conform to project
style and line-length limits so clang-format passes CI.
- Around line 1414-1424: The code unconditionally dereferences
llmRequest.getEncoderUniqueTokens() for non-self cache paths, causing optional
access failures for cross-KV requests without encoder tokens; change the logic
that computes uniqueTokens/blockKeys to first check for encoder tokens (e.g.,
bool hasUniqueTokens = (mCacheType==CacheType::kSELF ||
mCacheType==CacheType::kSELFKONLY) ||
llmRequest.getEncoderUniqueTokens().has_value()), and only call
llmRequest.getEncoderUniqueTokens().value(), chopVectorIntoBlocks<UniqueToken>,
and buildBlockKeys when hasUniqueTokens is true; otherwise leave
result.blockKeys empty (skip block-key generation). Apply the same
hasUniqueTokens guard in the later stats update site around the code referencing
getEncoderUniqueTokens() so both locations behave consistently for cross-KV
requests without encoder tokens.
- Around line 1636-1644: The new branch in addSequenceBatch() increments only
mReusedBlocks when claimed.isPlaceholder is false, causing mismatch with
loadOrAllocateBlocks() which also updates mFullReusedBlocks and
mPartialReusedBlocks; update this path to detect full vs partial reuse and
increment mFullReusedBlocks or mPartialReusedBlocks accordingly (in addition to
mReusedBlocks), using the same criteria/load conditions as
loadOrAllocateBlocks(), and keep the unique-block tracking via reusedBlockIds
and matchingBlockId unchanged so getAndResetIterationStats() reports the same
breakdown as before.
- Around line 1444-1451: The current shareLastContextBlockAmongBeams logic
diverges from loadOrAllocateBlocks(): restore the old rule so a full last
context block — and any cross-KV context block logic used by
loadOrAllocateBlocks() — is still shared for beamWidth > 1; update the
computation in kvCacheManager.cpp (variables/functions:
shareLastContextBlockAmongBeams, numSharedContextBlocks, isRecurrentState(),
sequence.getBeamWidth(), inputLength, mTokensPerBlock) to mirror the branching
in loadOrAllocateBlocks()/WindowBlockManager::addSequence (i.e., set
shareLastContextBlockAmongBeams true when beamWidth==1 OR when beamWidth>1 and
the last block is full or identified as a cross-KV context block by the same
predicate used in loadOrAllocateBlocks()), then recompute numSharedContextBlocks
accordingly.

In `@cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`:
- Line 71: NB_TRAMPOLINE currently reserves 37 trampoline slots but
PyKvCacheManager defines 39 NB_OVERRIDE_PURE overrides, causing nanobind virtual
dispatch failures; update the NB_TRAMPOLINE(tbk::BaseKVCacheManager, 37)
invocation to reserve at least 39 slots (e.g.,
NB_TRAMPOLINE(tbk::BaseKVCacheManager, 39)) to match the overrides, and update
the file copyright year string from "2022-2025" to "2022-2026".

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 801-806: The draft path only reserves self.num_extra_kv_tokens and
ignores num_extra_decoding_steps and the draft manager's own extra-KV setting;
update the loop after draft_kv_cache_manager.impl.add_sequence_batch to compute
the number of extra tokens to add per request as (draft manager's own extra-KV
if available, otherwise self.num_extra_kv_tokens) plus
self.num_extra_decoding_steps, then call
draft_kv_cache_manager.impl.add_token(req_id) that many times for each req_id
from draft_batch_request_infos so the draft KV cache mirrors the main allocation
and respects the draft manager's configuration (referencing
draft_kv_cache_manager, add_sequence_batch, add_token, self.num_extra_kv_tokens,
num_extra_extra_decoding_steps, and
draft_batch_request_infos/draft_batch_llm_requests).

---

Outside diff comments:
In `@cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`:
- Line 2: Update the SPDX copyright header year range in the file by changing
the trailing year from 2025 to 2026; locate the SPDX header line starting with
"SPDX-FileCopyrightText" (the comment line shown in the diff) and extend the
range to include 2026 so the header reflects the file modification year.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: e984bffe-8c4a-45cf-8acd-aded0d1b9173

📥 Commits

Reviewing files that changed from the base of the PR and between f2f9051 and 680d222.

📒 Files selected for processing (4)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py

liji-nv added 3 commits April 14, 2026 05:49
…SWA/non-reuse support

Introduce addSequenceBatch with a two-phase claim-then-onboard strategy
that prevents host offloading from evicting reusable blocks in the radix
tree. Extend it to handle all cases previously requiring separate
addSequence calls:

- Two-phase strategy: Phase 1 claims matching blocks under a single lock
  using PartialClaimTracker to coordinate partial-match ownership across
  requests. Phase 2 onboards and allocates without re-traversing the tree.
- VSWA: Loop over all window sizes in KVCacheManager::addSequenceBatch,
  tracking minPrepopulatedLen across windows per sequence.
- Non-reuse: Add buildClaimResultMetadata() to build ClaimResult without
  radix tree walk. Recurrent state placeholder logic branches on
  isEnableBlockReuse (aligned with PR NVIDIA#10437).
- Star attention: Compute special seq_len in Python and route through
  the batch path.
- Dummy requests: Batch add_sequence calls in add_dummy_requests,
  deferring is_gen state modifications until after the batch call.

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Migrate all 84 kvCacheManager.addSequence calls in kvCacheManagerTest.cpp
to addSequenceBatch (batch-size-1 for interleaved tests, true batching
for 2 consecutive-call tests). Fix partial/full reuse stat tracking in
onboardAndAllocateBlocks. Update SecondaryBlockPrimaryChildTest assertion
to match improved block recovery from two-phase claim strategy.

BlockManager-level addSequence calls are kept as-is since they test
the lower-level API directly.

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
…uenceBatch

When claimMatchingBlocks claims an unreferenced non-leaf partial-match
block to protect it during Phase 2 copies, the block is removed from
the free queue. Previously it was only released at the end of the batch
(deferred release), which was too late — subsequent getFreeBlock calls
for non-matching blocks could not find free blocks in tight pools.

Fix: use PartialClaimTracker to assign release responsibility to the
last copier in the batch. After the copy completes in Phase 2, the
responsible copier releases the source back to the free queue via
shouldReleaseCopySource flag on ClaimedBlock. A full match on the same
block revokes the release responsibility.

Also migrate BlockManagerReuseTest to addSequenceBatch to validate
the fix, and update the full-match tracker branch to handle both
leaf and non-leaf blocks.

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
@liji-nv liji-nv force-pushed the fix/unify-addsequence-batch branch from 680d222 to 847c922 Compare April 14, 2026 12:49
@liji-nv liji-nv changed the title [None][refactor] Batch addSequence with two-phase claim and unified V… [None][refactor] Batch addSequence with two-phase claim and unified VSWA/non-reuse support Apr 14, 2026
…e-last-block logic

- Update addSequenceBatch docstring to reflect support for both block-reuse
  and non-reuse paths via buildClaimResultMetadata.
- Guard encoder unique token access in claimMatchingBlocks and
  onboardAndAllocateBlocks with hasUniqueTokens check, matching
  buildClaimResultMetadata and WindowBlockManager::addSequence (PR NVIDIA#10437)
  for cross-KV requests without encoder tokens (e.g., Whisper).
- Align shareLastContextBlockAmongBeams in claimMatchingBlocks with the
  unified formula from loadOrAllocateBlocks (PR NVIDIA#10437):
  isShareLastContextBlock = kCROSS || inputLength % tokensPerBlock == 0.

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
@liji-nv liji-nv force-pushed the fix/unify-addsequence-batch branch from b50218d to c4093ba Compare April 14, 2026 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant