Skip to content

[TRTLLM-11872][perf] Multi-threading async media loading and optimizing video frame decoding in trtllm-serve#13034

Open
yechank-nvidia wants to merge 5 commits intoNVIDIA:mainfrom
yechank-nvidia:async_media_load
Open

[TRTLLM-11872][perf] Multi-threading async media loading and optimizing video frame decoding in trtllm-serve#13034
yechank-nvidia wants to merge 5 commits intoNVIDIA:mainfrom
yechank-nvidia:async_media_load

Conversation

@yechank-nvidia
Copy link
Copy Markdown
Collaborator

@yechank-nvidia yechank-nvidia commented Apr 14, 2026

Summary

Two performance fixes in tensorrt_llm/inputs/utils.py for trtllm-serve multimodal API requests.

Fix 1: Truly Async Media Loading

  • async_load_image / async_load_video / async_load_audio were blocking the event loop despite being async. CPU-bound work (PIL,cv2, soundfile) now runs in a thread pool via run_in_executor. Additional improvements:

  • Global aiohttp.ClientSession singleton — reuses TCP connections instead of creating a new session per request

  • retrieve_all_async now launches a single asyncio.gather across all modalities simultaneously instead of gathering per-modality sequentially

Fix 2: Faster Video Frame Decoding

Before: CAP_PROP_POS_FRAMES seek per frame — each seek decodes from the nearest keyframe to the target frame. Gets worse as GOP size grows (real H.264 videos: 1–8s keyframe intervals).

After:

  • Single sequential forward scan with grab() — no per-frame seek
  • PIL Image.fromarray + ToTensor replaced with direct numpy.transposetorch.from_numpy (7.3× faster for tensor conversion)

Results

Baseline: original per-frame seek + PIL conversion, 32 frames extracted, H.264.

Single request latency

Scenario Before After Speedup
360p, 5s 501 ms 79 ms 6.3×
720p, 5s 628 ms 116 ms 5.4×
1080p, 5s 843 ms 222 ms 3.8×
1080p, 60s 878 ms 560 ms 1.6×

Server throughput (8 concurrent workers)

Scenario Before After Gain
360p, 5s 25 r/s 73 r/s 2.9×
1080p, 5s 4.3 r/s 12.9 r/s 3.0×
1080p, 60s 4.0 r/s 5.3 r/s 1.3×

Short videos benefit most — seek overhead dominates decode time for short clips and is eliminated entirely by the sequential scan.

Files Changed

  • tensorrt_llm/inputs/utils.py
  • tests/unittest/inputs/test_async_media_loading.py (13 new tests)

Summary by CodeRabbit

  • New Features

    • Implemented efficient HTTP session reuse for media downloads.
    • Added concurrent gathering of multimodal data for improved performance.
  • Bug Fixes

    • Optimized video frame extraction logic with improved frame sampling and processing.
  • Tests

    • Added comprehensive test coverage for async media loading, session management, and concurrent data retrieval operations.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 14, 2026

📝 Walkthrough

Walkthrough

The changes significantly refactor async media loading utilities to improve performance through global session reuse and concurrent operations while offloading CPU-bound work to thread executors. Video frame extraction logic is reworked to use forward scanning with optimized sampling, and modality retrieval is changed to concurrent gathering.

Changes

Cohort / File(s) Summary
Async Media Loading Core
tensorrt_llm/inputs/utils.py
Introduced global aiohttp session reuse via _get_aiohttp_session(). Updated async_load_image, async_load_video, and async_load_audio to offload CPU-bound operations (decoding, conversions) to asyncio thread pool executor. Reworked _load_video_by_cv2 with forward frame scanning, conditional linspace/range sampling, and direct numpy-to-torch tensor conversion for PyTorch format. Modified MultimodalDataTracker.retrieve_all_async() to gather all modality coroutines concurrently rather than sequentially per modality.
Async Media Loading Tests
tests/unittest/inputs/test_async_media_loading.py
New test module validating async image/audio loading with format variations, thread executor offloading behavior, aiohttp session reuse and lifecycle, and concurrent modality retrieval in MultimodalDataTracker with timing assertions.

Sequence Diagram

sequenceDiagram
    participant Caller as Caller
    participant Tracker as MultimodalDataTracker
    participant EventLoop as Event Loop
    participant Gather as asyncio.gather()
    participant RemoteA as Remote Server (Audio)
    participant RemoteV as Remote Server (Video)
    participant ThreadPool as Thread Pool Executor
    
    Caller->>Tracker: retrieve_all_async()
    Tracker->>EventLoop: Collect all coroutines<br/>(data + embeddings)<br/>across modalities
    Tracker->>Gather: asyncio.gather(*all_coroutines)
    par Concurrent Fetch
        Gather->>RemoteA: fetch audio_1
        Gather->>RemoteV: fetch video_1
        Gather->>RemoteA: fetch audio_2
        Gather->>RemoteV: fetch video_2
    end
    par CPU-Bound Offload
        RemoteA->>ThreadPool: decode audio
        RemoteV->>ThreadPool: extract frames
    end
    ThreadPool-->>Gather: decoded results
    Gather-->>Tracker: all results
    Tracker->>Tracker: Regroup by modality
    Tracker-->>Caller: {modality: {data, embeddings}}
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects both main changes: async media loading optimization and video frame decoding improvements, making it clear and specific.
Description check ✅ Passed The PR description provides clear sections covering both fixes, performance results with detailed metrics, and test coverage details, addressing all key template requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
tensorrt_llm/inputs/utils.py (2)

34-45: Consider adding a cleanup function for the global session.

The global aiohttp.ClientSession is created lazily but there's no corresponding close_aiohttp_session() function for graceful shutdown. While connections are cleaned up on process exit, long-running servers that reload modules or undergo hot-restarts may leak connections.

♻️ Proposed addition for session cleanup
 async def _get_aiohttp_session() -> aiohttp.ClientSession:
     """Return the shared aiohttp.ClientSession, creating it on first call."""
     global _global_aiohttp_session
     if _global_aiohttp_session is None or _global_aiohttp_session.closed:
         _global_aiohttp_session = aiohttp.ClientSession()
     return _global_aiohttp_session
+
+
+async def _close_aiohttp_session() -> None:
+    """Close the shared aiohttp.ClientSession if open."""
+    global _global_aiohttp_session
+    if _global_aiohttp_session is not None and not _global_aiohttp_session.closed:
+        await _global_aiohttp_session.close()
+    _global_aiohttp_session = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/inputs/utils.py` around lines 34 - 45, Add a cleanup function to
close the lazily-created shared aiohttp session to avoid leaking connections:
implement an async close_aiohttp_session() that checks the module-level
_global_aiohttp_session (and its .closed flag), awaits its .close() if open, and
sets _global_aiohttp_session to None; call or expose this function for graceful
shutdowns and document using it alongside _get_aiohttp_session().

533-534: Add strict=True to zip() for defensive programming.

While the lengths are guaranteed equal by construction (both derived from the same asyncio.gather call), using strict=True would catch any future bugs that might break this invariant.

♻️ Proposed fix
-            for modality, result in zip(modality_keys, results):
+            for modality, result in zip(modality_keys, results, strict=True):
                 out[modality].append(result)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/inputs/utils.py` around lines 533 - 534, The loop pairing
modality_keys and results should use defensive zip checking: change the zip call
in the block that iterates "for modality, result in zip(modality_keys,
results):" to use zip(..., strict=True) so mismatched lengths raise immediately;
update any tests or callers if they rely on silent truncation. Ensure this
change is applied where modality_keys and results are assembled in
tensorrt_llm/inputs/utils.py so the strictness guards the invariant.
tests/unittest/inputs/test_async_media_loading.py (3)

239-261: Timing-based test may be flaky under CI load.

The concurrency test relies on timing (0.15s delay with 0.08s tolerance). While the approach is sound and the values are reasonable, heavily loaded CI runners might occasionally cause flakiness. Consider increasing the tolerance slightly (e.g., 0.12s) or adding pytest.mark.flaky if flakiness is observed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/inputs/test_async_media_loading.py` around lines 239 - 261,
The timing-based concurrency assertion in the test uses DELAY = 0.15 and
TOLERANCE = 0.08 which can be flaky on loaded CI runners; update the test to
increase the tolerance (e.g., set TOLERANCE = 0.12) or mark the test flaky
(e.g., add pytest.mark.flaky) so failures under high load are tolerated;
specifically change the TOLERANCE constant referenced next to DELAY and keep the
rest of the test (the _slow coroutine and the call to
tracker.retrieve_all_async()) unchanged.

114-120: Temp file not cleaned up after test.

Using delete=False without explicit cleanup leaves test artifacts on disk. Consider using pytest's tmp_path fixture or adding explicit cleanup.

♻️ Proposed fix using tmp_path fixture
     `@pytest.mark.asyncio`
-    async def test_load_audio_from_file(self):
-        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
-            path = _make_audio_file(f.name)
+    async def test_load_audio_from_file(self, tmp_path):
+        path = _make_audio_file(str(tmp_path / "test.wav"))
         result = await async_load_audio(path)
         audio_array, sample_rate = result
         assert isinstance(audio_array, np.ndarray)
         assert sample_rate == 16000
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/inputs/test_async_media_loading.py` around lines 114 - 120,
The test test_load_audio_from_file currently creates a temp file with
tempfile.NamedTemporaryFile(delete=False) and never removes it; either switch
the test to use pytest's tmp_path fixture to create a disposable path and write
the WAV via _make_audio_file there (use tmp_path / "test.wav") or ensure
explicit cleanup by removing the created file (os.remove(path)) in a
finally/teardown after calling async_load_audio; update references to
tempfile.NamedTemporaryFile, _make_audio_file, and async_load_audio accordingly
so no temp artifacts remain.

134-135: Same temp file cleanup issue.

This test also uses delete=False without explicit cleanup. Apply the same tmp_path fixture pattern suggested above.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/inputs/test_async_media_loading.py` around lines 134 - 135,
The test currently creates a temp WAV with tempfile.NamedTemporaryFile(...,
delete=False) and never removes it; replace that pattern with the pytest
tmp_path fixture: create a Path under tmp_path (e.g., tmp_path / "test.wav"),
pass its str to _make_audio_file (or write the needed contents there), and use
that path for the test so pytest will manage cleanup; update the test function
signature to accept tmp_path and remove the tempfile.NamedTemporaryFile usage in
test_async_media_loading.py around the _make_audio_file call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/unittest/inputs/test_async_media_loading.py`:
- Line 17: Remove the unused import "defaultdict" from the top-level imports
(the line containing "from collections import defaultdict"); update any imports
only if actually referenced elsewhere (e.g., keep other collections imports
intact) so the unused symbol defaultdict is deleted to satisfy the
linter/pre-commit checks.

---

Nitpick comments:
In `@tensorrt_llm/inputs/utils.py`:
- Around line 34-45: Add a cleanup function to close the lazily-created shared
aiohttp session to avoid leaking connections: implement an async
close_aiohttp_session() that checks the module-level _global_aiohttp_session
(and its .closed flag), awaits its .close() if open, and sets
_global_aiohttp_session to None; call or expose this function for graceful
shutdowns and document using it alongside _get_aiohttp_session().
- Around line 533-534: The loop pairing modality_keys and results should use
defensive zip checking: change the zip call in the block that iterates "for
modality, result in zip(modality_keys, results):" to use zip(..., strict=True)
so mismatched lengths raise immediately; update any tests or callers if they
rely on silent truncation. Ensure this change is applied where modality_keys and
results are assembled in tensorrt_llm/inputs/utils.py so the strictness guards
the invariant.

In `@tests/unittest/inputs/test_async_media_loading.py`:
- Around line 239-261: The timing-based concurrency assertion in the test uses
DELAY = 0.15 and TOLERANCE = 0.08 which can be flaky on loaded CI runners;
update the test to increase the tolerance (e.g., set TOLERANCE = 0.12) or mark
the test flaky (e.g., add pytest.mark.flaky) so failures under high load are
tolerated; specifically change the TOLERANCE constant referenced next to DELAY
and keep the rest of the test (the _slow coroutine and the call to
tracker.retrieve_all_async()) unchanged.
- Around line 114-120: The test test_load_audio_from_file currently creates a
temp file with tempfile.NamedTemporaryFile(delete=False) and never removes it;
either switch the test to use pytest's tmp_path fixture to create a disposable
path and write the WAV via _make_audio_file there (use tmp_path / "test.wav") or
ensure explicit cleanup by removing the created file (os.remove(path)) in a
finally/teardown after calling async_load_audio; update references to
tempfile.NamedTemporaryFile, _make_audio_file, and async_load_audio accordingly
so no temp artifacts remain.
- Around line 134-135: The test currently creates a temp WAV with
tempfile.NamedTemporaryFile(..., delete=False) and never removes it; replace
that pattern with the pytest tmp_path fixture: create a Path under tmp_path
(e.g., tmp_path / "test.wav"), pass its str to _make_audio_file (or write the
needed contents there), and use that path for the test so pytest will manage
cleanup; update the test function signature to accept tmp_path and remove the
tempfile.NamedTemporaryFile usage in test_async_media_loading.py around the
_make_audio_file call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6dcb14eb-8af7-4199-b355-3c424431e79d

📥 Commits

Reviewing files that changed from the base of the PR and between f2f9051 and 81b73a8.

📒 Files selected for processing (2)
  • tensorrt_llm/inputs/utils.py
  • tests/unittest/inputs/test_async_media_loading.py

Comment thread tests/unittest/inputs/test_async_media_loading.py Outdated
@yechank-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43226 [ run ] triggered by Bot. Commit: 6f35282 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43226 [ run ] completed with state DISABLED
Freeze main and open the PR merge only after CI is back to healthy https://nvidia.slack.com/archives/C059LSY62BT/p1776141760843319?thread_ts=1775985925.442509&cid=C059LSY62BT

Link to invocation

Copy link
Copy Markdown
Collaborator

@2ez4bz 2ez4bz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock, but please consider the comments 🙏

Comment thread tensorrt_llm/inputs/utils.py Outdated
Comment thread tensorrt_llm/inputs/utils.py Outdated
Comment thread tensorrt_llm/inputs/utils.py Outdated
Comment thread tensorrt_llm/inputs/utils.py
Comment thread tests/unittest/inputs/test_async_media_loading.py Outdated
Comment thread tests/unittest/inputs/test_async_media_loading.py Outdated
Comment thread tests/unittest/inputs/test_async_media_loading.py Outdated
Comment thread tests/unittest/inputs/test_async_media_loading.py Outdated
Comment thread tests/unittest/inputs/test_async_media_loading.py Outdated
@yechank-nvidia yechank-nvidia force-pushed the async_media_load branch 2 times, most recently from a49f565 to 0aefb69 Compare April 15, 2026 02:06
@yechank-nvidia yechank-nvidia changed the title [TRTLLM-11872][perf] Optimize async media loading and video frame decoding in trtllm-serve [TRTLLM-11872][perf] Multi-threading async media loading and optimizing video frame decoding in trtllm-serve Apr 15, 2026
@yechank-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43347 [ run ] triggered by Bot. Commit: 0c8b86a Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43347 [ run ] completed with state FAILURE. Commit: 0c8b86a
/LLM/main/L0_MergeRequest_PR pipeline #33885 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yechank-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@yechank-nvidia yechank-nvidia added the Multimodal Label for issues & PRs regarding Multimodal related objects label Apr 15, 2026
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43378 [ run ] triggered by Bot. Commit: 8b3671a Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43378 [ run ] completed with state SUCCESS. Commit: 8b3671a
/LLM/main/L0_MergeRequest_PR pipeline #33912 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yechank-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43441 [ run ] triggered by Bot. Commit: 8b3671a Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43441 [ run ] completed with state SUCCESS. Commit: 8b3671a
/LLM/main/L0_MergeRequest_PR pipeline #33969 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yechank-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43493 [ run ] triggered by Bot. Commit: ac02802 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43493 [ run ] completed with state SUCCESS. Commit: ac02802
/LLM/main/L0_MergeRequest_PR pipeline #34009 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@2ez4bz 2ez4bz force-pushed the async_media_load branch from ac02802 to a045c78 Compare April 15, 2026 19:56
@2ez4bz
Copy link
Copy Markdown
Collaborator

2ez4bz commented Apr 15, 2026

/bot run --disable-fail-fast

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
@yechank-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43757 [ run ] triggered by Bot. Commit: 6bf5169 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43757 [ run ] completed with state SUCCESS. Commit: 6bf5169
/LLM/main/L0_MergeRequest_PR pipeline #34239 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator

2ez4bz commented Apr 16, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43843 [ run ] triggered by Bot. Commit: 6bf5169 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43843 [ run ] completed with state FAILURE. Commit: 6bf5169
/LLM/main/L0_MergeRequest_PR pipeline #34304 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Multimodal Label for issues & PRs regarding Multimodal related objects

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants