Skip to content

[None][fix] Fix chunked prefill API contract for nemotron nano VL#13025

Open
2ez4bz wants to merge 1 commit intoNVIDIA:mainfrom
2ez4bz:dev-nano-v3-fix-chunked-prefill-api
Open

[None][fix] Fix chunked prefill API contract for nemotron nano VL#13025
2ez4bz wants to merge 1 commit intoNVIDIA:mainfrom
2ez4bz:dev-nano-v3-fix-chunked-prefill-api

Conversation

@2ez4bz
Copy link
Copy Markdown
Collaborator

@2ez4bz 2ez4bz commented Apr 14, 2026

Summary by CodeRabbit

  • Improvements

    • Enhanced diagnostic logging for multimodal embedding handling with more detailed information about returned tensors, embedding types, and token counts.
  • Tests

    • Added comprehensive tests validating multimodal embedding caching behavior, encoder contract compliance, and chunked prefill compatibility.

Description

  • Why?

In order to opt into the caching functionality for chunked prefix, there are certain assumptions on the return type of the encoder's forward function. These assumptions did not hold for nemotron nano VL prior to this commit.

  • What?

This commit fixes this issue, and adds tests to catch regressions.

Test Coverage

See above.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@2ez4bz 2ez4bz requested a review from a team as a code owner April 14, 2026 06:02
@2ez4bz 2ez4bz requested a review from symphonylyh April 14, 2026 06:02
@2ez4bz 2ez4bz force-pushed the dev-nano-v3-fix-chunked-prefill-api branch 2 times, most recently from f4313d5 to 35608fe Compare April 14, 2026 06:09
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 14, 2026

Caution

Review failed

An error occurred during the review process. Please try again later.

📝 Walkthrough

Walkthrough

These changes refactor multimodal embedding handling in Nemotron Nano models. The _encode_multimodal method now returns only a list of concatenated tensors instead of a tuple with token counts; token counts are stored directly in multimodal parameters. The forward method reconstructs token counts from context parameters when needed. Enhanced logging provides more detailed warning information for debugging.

Changes

Cohort / File(s) Summary
Logging Enhancement
tensorrt_llm/_torch/models/modeling_multimodal_utils.py
Enhanced logger.warning message to include number of returned embeddings, embedding type names, shapes, count of uncached multimodal params, and encoder function reference when encoder_forward_fn returns multiple tensors.
Core API Changes
tensorrt_llm/_torch/models/modeling_nemotron_nano.py
Modified _encode_multimodal to return List[torch.Tensor] instead of tuple with token counts. Stores per-request num_tokens_in_video in MultimodalParams.multimodal_data. Updated forward to reconstruct num_tokens_in_videos from context params. Changed merge_evs_mm_embeds invocation to use sliced context params.
Test Updates
tests/unittest/_torch/modeling/test_modeling_nemotron_nano_v2_vl.py
Added imports and comprehensive tests for multimodal caching contract. Updated existing dispatch tests for new return signature. Introduced TestEncodeMultimodalContract validating list-of-single-tensor return, and TestChunkedPrefillCaching verifying encoder caching behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Forward as Forward()
    participant Encoder as _encode_multimodal()
    participant Cache as MultimodalParams<br/>(Cache Storage)
    participant Merge as merge_evs_mm_embeds()
    participant Context as Context Params

    Forward->>Encoder: Call with uncached params
    Encoder->>Encoder: Vision/Audio encoding
    Encoder->>Cache: Store embeddings in<br/>multimodal_data
    Encoder->>Cache: Store num_tokens_in_video
    Encoder-->>Forward: Return List[Tensor]
    Forward->>Merge: Pass merged embeddings<br/>+ context params
    Merge-->>Forward: Merged result
    Forward->>Context: Reconstruct num_tokens<br/>from cached values
    Forward-->>Forward: Continue with<br/>merged embeddings
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: fixing the chunked prefill API contract for nemotron nano VL, which aligns with the primary objective and code modifications across all three files.
Description check ✅ Passed The description adequately explains the issue (assumptions about encoder return type not holding for nemotron nano VL), the solution (fixing the API contract and adding tests), and provides test coverage details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

* Why?

In order to opt into the caching functionality for chunked prefix, there
are certain assumptions on the return type of the encoder's forward
function. These assumptions did not hold for nemotron nano VL prior to
this commit.

* What?

This commit fixes this issue, and adds tests to catch regressions.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
@2ez4bz 2ez4bz force-pushed the dev-nano-v3-fix-chunked-prefill-api branch from 35608fe to e7868dd Compare April 14, 2026 16:52
@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 14, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43265 [ run ] triggered by Bot. Commit: e7868dd Link to invocation

@2ez4bz 2ez4bz enabled auto-merge (squash) April 14, 2026 17:54
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43265 [ run ] completed with state SUCCESS. Commit: e7868dd
/LLM/main/L0_MergeRequest_PR pipeline #33818 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants