[None][refactor] Batch addSequence with two-phase claim and unified VSWA/non-reuse support#13029
[None][refactor] Batch addSequence with two-phase claim and unified VSWA/non-reuse support#13029liji-nv wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughThese changes introduce a new batch KV-cache sequence allocation API implementing a two-phase "claim-then-onboard" approach for context cache reuse. The feature adds data structures and methods across the C++ KV cache manager hierarchy and Python bindings, enabling efficient multi-request context block claiming and onboarding with deferred allocation/reuse decisions. Changes
Sequence Diagram(s)sequenceDiagram
participant PyExec as PyExecutor
participant KVCacheMgr as KVCacheManager
participant BlockMgr as BlockManager
participant WindowMgr as WindowBlockManager
PyExec->>KVCacheMgr: addSequenceBatch(requestInfos, llmRequests)
KVCacheMgr->>BlockMgr: addSequenceBatch(sequences, inputLengths, numContextBlocks, ...)
activate BlockMgr
BlockMgr->>WindowMgr: Phase 1: claimMatchingBlocks(...) or buildClaimResultMetadata(...)
activate WindowMgr
WindowMgr-->>BlockMgr: ClaimResult (matched blocks, metadata)
deactivate WindowMgr
BlockMgr->>WindowMgr: Phase 2: onboardAndAllocateBlocks(per request)
activate WindowMgr
WindowMgr-->>BlockMgr: BatchSeqStats (allocation results)
deactivate WindowMgr
deactivate BlockMgr
BlockMgr-->>KVCacheMgr: vector<BatchSeqStats>
KVCacheMgr->>KVCacheMgr: Update cache offsets & aggregated stats
KVCacheMgr-->>PyExec: Allocation complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)
2-2:⚠️ Potential issue | 🟡 MinorUpdate the modified-file copyright year.
This file was changed in 2026, but the header still ends at 2025.
As per coding guidelines: "Add NVIDIA copyright header on ALL new files and update year on modified files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp` at line 2, Update the SPDX copyright header year range in the file by changing the trailing year from 2025 to 2026; locate the SPDX header line starting with "SPDX-FileCopyrightText" (the comment line shown in the diff) and extend the range to include 2026 so the header reflects the file modification year.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`:
- Around line 1890-1898: The docstring for addSequenceBatch incorrectly states
it requires block reuse; update the public comment for addSequenceBatch to
reflect that it supports both the block-reuse path and the non-reuse path that
uses buildClaimResultMetadata(...) — explain the two-phase claim-then-onboard
behavior and that for non-reuse callers the buildClaimResultMetadata helper is
used to prepare claim results before onboarding/allocating blocks; reference the
methods addSequenceBatch and buildClaimResultMetadata so callers know both code
paths are supported.
In `@cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`:
- Around line 2133-2141: The function BlockManager::addSequenceBatch is
misformatted; run LLVM clang-format (max 120 chars/line) on this overload to fix
line wrapping/indentation issues and reformat the block containing the signature
and return statement; ensure the signature and the return call to
mWindowBlockManagers.at(windowSize).addSequenceBatch(...) conform to project
style and line-length limits so clang-format passes CI.
- Around line 1414-1424: The code unconditionally dereferences
llmRequest.getEncoderUniqueTokens() for non-self cache paths, causing optional
access failures for cross-KV requests without encoder tokens; change the logic
that computes uniqueTokens/blockKeys to first check for encoder tokens (e.g.,
bool hasUniqueTokens = (mCacheType==CacheType::kSELF ||
mCacheType==CacheType::kSELFKONLY) ||
llmRequest.getEncoderUniqueTokens().has_value()), and only call
llmRequest.getEncoderUniqueTokens().value(), chopVectorIntoBlocks<UniqueToken>,
and buildBlockKeys when hasUniqueTokens is true; otherwise leave
result.blockKeys empty (skip block-key generation). Apply the same
hasUniqueTokens guard in the later stats update site around the code referencing
getEncoderUniqueTokens() so both locations behave consistently for cross-KV
requests without encoder tokens.
- Around line 1636-1644: The new branch in addSequenceBatch() increments only
mReusedBlocks when claimed.isPlaceholder is false, causing mismatch with
loadOrAllocateBlocks() which also updates mFullReusedBlocks and
mPartialReusedBlocks; update this path to detect full vs partial reuse and
increment mFullReusedBlocks or mPartialReusedBlocks accordingly (in addition to
mReusedBlocks), using the same criteria/load conditions as
loadOrAllocateBlocks(), and keep the unique-block tracking via reusedBlockIds
and matchingBlockId unchanged so getAndResetIterationStats() reports the same
breakdown as before.
- Around line 1444-1451: The current shareLastContextBlockAmongBeams logic
diverges from loadOrAllocateBlocks(): restore the old rule so a full last
context block — and any cross-KV context block logic used by
loadOrAllocateBlocks() — is still shared for beamWidth > 1; update the
computation in kvCacheManager.cpp (variables/functions:
shareLastContextBlockAmongBeams, numSharedContextBlocks, isRecurrentState(),
sequence.getBeamWidth(), inputLength, mTokensPerBlock) to mirror the branching
in loadOrAllocateBlocks()/WindowBlockManager::addSequence (i.e., set
shareLastContextBlockAmongBeams true when beamWidth==1 OR when beamWidth>1 and
the last block is full or identified as a cross-KV context block by the same
predicate used in loadOrAllocateBlocks()), then recompute numSharedContextBlocks
accordingly.
In `@cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`:
- Line 71: NB_TRAMPOLINE currently reserves 37 trampoline slots but
PyKvCacheManager defines 39 NB_OVERRIDE_PURE overrides, causing nanobind virtual
dispatch failures; update the NB_TRAMPOLINE(tbk::BaseKVCacheManager, 37)
invocation to reserve at least 39 slots (e.g.,
NB_TRAMPOLINE(tbk::BaseKVCacheManager, 39)) to match the overrides, and update
the file copyright year string from "2022-2025" to "2022-2026".
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 801-806: The draft path only reserves self.num_extra_kv_tokens and
ignores num_extra_decoding_steps and the draft manager's own extra-KV setting;
update the loop after draft_kv_cache_manager.impl.add_sequence_batch to compute
the number of extra tokens to add per request as (draft manager's own extra-KV
if available, otherwise self.num_extra_kv_tokens) plus
self.num_extra_decoding_steps, then call
draft_kv_cache_manager.impl.add_token(req_id) that many times for each req_id
from draft_batch_request_infos so the draft KV cache mirrors the main allocation
and respects the draft manager's configuration (referencing
draft_kv_cache_manager, add_sequence_batch, add_token, self.num_extra_kv_tokens,
num_extra_extra_decoding_steps, and
draft_batch_request_infos/draft_batch_llm_requests).
---
Outside diff comments:
In `@cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`:
- Line 2: Update the SPDX copyright header year range in the file by changing
the trailing year from 2025 to 2026; locate the SPDX header line starting with
"SPDX-FileCopyrightText" (the comment line shown in the diff) and extend the
range to include 2026 so the header reflects the file modification year.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: e984bffe-8c4a-45cf-8acd-aded0d1b9173
📒 Files selected for processing (4)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.hcpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpptensorrt_llm/_torch/pyexecutor/resource_manager.py
…SWA/non-reuse support Introduce addSequenceBatch with a two-phase claim-then-onboard strategy that prevents host offloading from evicting reusable blocks in the radix tree. Extend it to handle all cases previously requiring separate addSequence calls: - Two-phase strategy: Phase 1 claims matching blocks under a single lock using PartialClaimTracker to coordinate partial-match ownership across requests. Phase 2 onboards and allocates without re-traversing the tree. - VSWA: Loop over all window sizes in KVCacheManager::addSequenceBatch, tracking minPrepopulatedLen across windows per sequence. - Non-reuse: Add buildClaimResultMetadata() to build ClaimResult without radix tree walk. Recurrent state placeholder logic branches on isEnableBlockReuse (aligned with PR NVIDIA#10437). - Star attention: Compute special seq_len in Python and route through the batch path. - Dummy requests: Batch add_sequence calls in add_dummy_requests, deferring is_gen state modifications until after the batch call. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Migrate all 84 kvCacheManager.addSequence calls in kvCacheManagerTest.cpp to addSequenceBatch (batch-size-1 for interleaved tests, true batching for 2 consecutive-call tests). Fix partial/full reuse stat tracking in onboardAndAllocateBlocks. Update SecondaryBlockPrimaryChildTest assertion to match improved block recovery from two-phase claim strategy. BlockManager-level addSequence calls are kept as-is since they test the lower-level API directly. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
…uenceBatch When claimMatchingBlocks claims an unreferenced non-leaf partial-match block to protect it during Phase 2 copies, the block is removed from the free queue. Previously it was only released at the end of the batch (deferred release), which was too late — subsequent getFreeBlock calls for non-matching blocks could not find free blocks in tight pools. Fix: use PartialClaimTracker to assign release responsibility to the last copier in the batch. After the copy completes in Phase 2, the responsible copier releases the source back to the free queue via shouldReleaseCopySource flag on ClaimedBlock. A full match on the same block revokes the release responsibility. Also migrate BlockManagerReuseTest to addSequenceBatch to validate the fix, and update the full-match tracker branch to handle both leaf and non-leaf blocks. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
680d222 to
847c922
Compare
…e-last-block logic - Update addSequenceBatch docstring to reflect support for both block-reuse and non-reuse paths via buildClaimResultMetadata. - Guard encoder unique token access in claimMatchingBlocks and onboardAndAllocateBlocks with hasUniqueTokens check, matching buildClaimResultMetadata and WindowBlockManager::addSequence (PR NVIDIA#10437) for cross-KV requests without encoder tokens (e.g., Whisper). - Align shareLastContextBlockAmongBeams in claimMatchingBlocks with the unified formula from loadOrAllocateBlocks (PR NVIDIA#10437): isShareLastContextBlock = kCROSS || inputLength % tokensPerBlock == 0. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
b50218d to
c4093ba
Compare
…SWA/non-reuse support
Introduce addSequenceBatch with a two-phase claim-then-onboard strategy that prevents host offloading from evicting reusable blocks in the radix tree. Extend it to handle all cases previously requiring separate addSequence calls:
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.