[TRTLLM-11872][perf] Multi-threading async media loading and optimizing video frame decoding in trtllm-serve by yechank-nvidia · Pull Request #13034 · NVIDIA/TensorRT-LLM

yechank-nvidia · 2026-04-14T09:31:58Z

Summary

Two performance fixes in tensorrt_llm/inputs/utils.py for trtllm-serve multimodal API requests.

Fix 1: Truly Async Media Loading

async_load_image / async_load_video / async_load_audio were blocking the event loop despite being async. CPU-bound work (PIL,cv2, soundfile) now runs in a thread pool via run_in_executor. Additional improvements:
Global aiohttp.ClientSession singleton — reuses TCP connections instead of creating a new session per request
retrieve_all_async now launches a single asyncio.gather across all modalities simultaneously instead of gathering per-modality sequentially

Fix 2: Faster Video Frame Decoding

Before: CAP_PROP_POS_FRAMES seek per frame — each seek decodes from the nearest keyframe to the target frame. Gets worse as GOP size grows (real H.264 videos: 1–8s keyframe intervals).

After:

Single sequential forward scan with grab() — no per-frame seek
PIL Image.fromarray + ToTensor replaced with direct numpy.transpose → torch.from_numpy (7.3× faster for tensor conversion)

Results

Baseline: original per-frame seek + PIL conversion, 32 frames extracted, H.264.

Single request latency

Scenario	Before	After	Speedup
360p, 5s	501 ms	79 ms	6.3×
720p, 5s	628 ms	116 ms	5.4×
1080p, 5s	843 ms	222 ms	3.8×
1080p, 60s	878 ms	560 ms	1.6×

Server throughput (8 concurrent workers)

Scenario	Before	After	Gain
360p, 5s	25 r/s	73 r/s	2.9×
1080p, 5s	4.3 r/s	12.9 r/s	3.0×
1080p, 60s	4.0 r/s	5.3 r/s	1.3×

Short videos benefit most — seek overhead dominates decode time for short clips and is eliminated entirely by the sequential scan.

Files Changed

tensorrt_llm/inputs/utils.py
tests/unittest/inputs/test_async_media_loading.py (13 new tests)

Summary by CodeRabbit

New Features
- Implemented efficient HTTP session reuse for media downloads.
- Added concurrent gathering of multimodal data for improved performance.
Bug Fixes
- Optimized video frame extraction logic with improved frame sampling and processing.
Tests
- Added comprehensive test coverage for async media loading, session management, and concurrent data retrieval operations.

coderabbitai · 2026-04-14T09:37:36Z

📝 Walkthrough

Walkthrough

The changes significantly refactor async media loading utilities to improve performance through global session reuse and concurrent operations while offloading CPU-bound work to thread executors. Video frame extraction logic is reworked to use forward scanning with optimized sampling, and modality retrieval is changed to concurrent gathering.

Changes

Cohort / File(s)	Summary
Async Media Loading Core `tensorrt_llm/inputs/utils.py`	Introduced global aiohttp session reuse via `_get_aiohttp_session()`. Updated `async_load_image`, `async_load_video`, and `async_load_audio` to offload CPU-bound operations (decoding, conversions) to asyncio thread pool executor. Reworked `_load_video_by_cv2` with forward frame scanning, conditional linspace/range sampling, and direct numpy-to-torch tensor conversion for PyTorch format. Modified `MultimodalDataTracker.retrieve_all_async()` to gather all modality coroutines concurrently rather than sequentially per modality.
Async Media Loading Tests `tests/unittest/inputs/test_async_media_loading.py`	New test module validating async image/audio loading with format variations, thread executor offloading behavior, aiohttp session reuse and lifecycle, and concurrent modality retrieval in `MultimodalDataTracker` with timing assertions.

Sequence Diagram

sequenceDiagram
    participant Caller as Caller
    participant Tracker as MultimodalDataTracker
    participant EventLoop as Event Loop
    participant Gather as asyncio.gather()
    participant RemoteA as Remote Server (Audio)
    participant RemoteV as Remote Server (Video)
    participant ThreadPool as Thread Pool Executor
    
    Caller->>Tracker: retrieve_all_async()
    Tracker->>EventLoop: Collect all coroutines<br/>(data + embeddings)<br/>across modalities
    Tracker->>Gather: asyncio.gather(*all_coroutines)
    par Concurrent Fetch
        Gather->>RemoteA: fetch audio_1
        Gather->>RemoteV: fetch video_1
        Gather->>RemoteA: fetch audio_2
        Gather->>RemoteV: fetch video_2
    end
    par CPU-Bound Offload
        RemoteA->>ThreadPool: decode audio
        RemoteV->>ThreadPool: extract frames
    end
    ThreadPool-->>Gather: decoded results
    Gather-->>Tracker: all results
    Tracker->>Tracker: Regroup by modality
    Tracker-->>Caller: {modality: {data, embeddings}}

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 31.25% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately reflects both main changes: async media loading optimization and video frame decoding improvements, making it clear and specific.
Description check	✅ Passed	The PR description provides clear sections covering both fixes, performance results with detailed metrics, and test coverage details, addressing all key template requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

tensorrt_llm/inputs/utils.py (2)

34-45: Consider adding a cleanup function for the global session.

The global aiohttp.ClientSession is created lazily but there's no corresponding close_aiohttp_session() function for graceful shutdown. While connections are cleaned up on process exit, long-running servers that reload modules or undergo hot-restarts may leak connections.

♻️ Proposed addition for session cleanup

 async def _get_aiohttp_session() -> aiohttp.ClientSession:
     """Return the shared aiohttp.ClientSession, creating it on first call."""
     global _global_aiohttp_session
     if _global_aiohttp_session is None or _global_aiohttp_session.closed:
         _global_aiohttp_session = aiohttp.ClientSession()
     return _global_aiohttp_session
+
+
+async def _close_aiohttp_session() -> None:
+    """Close the shared aiohttp.ClientSession if open."""
+    global _global_aiohttp_session
+    if _global_aiohttp_session is not None and not _global_aiohttp_session.closed:
+        await _global_aiohttp_session.close()
+    _global_aiohttp_session = None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/inputs/utils.py` around lines 34 - 45, Add a cleanup function to
close the lazily-created shared aiohttp session to avoid leaking connections:
implement an async close_aiohttp_session() that checks the module-level
_global_aiohttp_session (and its .closed flag), awaits its .close() if open, and
sets _global_aiohttp_session to None; call or expose this function for graceful
shutdowns and document using it alongside _get_aiohttp_session().

533-534: Add strict=True to zip() for defensive programming.

While the lengths are guaranteed equal by construction (both derived from the same asyncio.gather call), using strict=True would catch any future bugs that might break this invariant.

♻️ Proposed fix

-            for modality, result in zip(modality_keys, results):
+            for modality, result in zip(modality_keys, results, strict=True):
                 out[modality].append(result)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/inputs/utils.py` around lines 533 - 534, The loop pairing
modality_keys and results should use defensive zip checking: change the zip call
in the block that iterates "for modality, result in zip(modality_keys,
results):" to use zip(..., strict=True) so mismatched lengths raise immediately;
update any tests or callers if they rely on silent truncation. Ensure this
change is applied where modality_keys and results are assembled in
tensorrt_llm/inputs/utils.py so the strictness guards the invariant.

tests/unittest/inputs/test_async_media_loading.py (3)

239-261: Timing-based test may be flaky under CI load.

The concurrency test relies on timing (0.15s delay with 0.08s tolerance). While the approach is sound and the values are reasonable, heavily loaded CI runners might occasionally cause flakiness. Consider increasing the tolerance slightly (e.g., 0.12s) or adding pytest.mark.flaky if flakiness is observed.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/inputs/test_async_media_loading.py` around lines 239 - 261,
The timing-based concurrency assertion in the test uses DELAY = 0.15 and
TOLERANCE = 0.08 which can be flaky on loaded CI runners; update the test to
increase the tolerance (e.g., set TOLERANCE = 0.12) or mark the test flaky
(e.g., add pytest.mark.flaky) so failures under high load are tolerated;
specifically change the TOLERANCE constant referenced next to DELAY and keep the
rest of the test (the _slow coroutine and the call to
tracker.retrieve_all_async()) unchanged.

114-120: Temp file not cleaned up after test.

Using delete=False without explicit cleanup leaves test artifacts on disk. Consider using pytest's tmp_path fixture or adding explicit cleanup.

♻️ Proposed fix using tmp_path fixture

     `@pytest.mark.asyncio`
-    async def test_load_audio_from_file(self):
-        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
-            path = _make_audio_file(f.name)
+    async def test_load_audio_from_file(self, tmp_path):
+        path = _make_audio_file(str(tmp_path / "test.wav"))
         result = await async_load_audio(path)
         audio_array, sample_rate = result
         assert isinstance(audio_array, np.ndarray)
         assert sample_rate == 16000

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/inputs/test_async_media_loading.py` around lines 114 - 120,
The test test_load_audio_from_file currently creates a temp file with
tempfile.NamedTemporaryFile(delete=False) and never removes it; either switch
the test to use pytest's tmp_path fixture to create a disposable path and write
the WAV via _make_audio_file there (use tmp_path / "test.wav") or ensure
explicit cleanup by removing the created file (os.remove(path)) in a
finally/teardown after calling async_load_audio; update references to
tempfile.NamedTemporaryFile, _make_audio_file, and async_load_audio accordingly
so no temp artifacts remain.

134-135: Same temp file cleanup issue.

This test also uses delete=False without explicit cleanup. Apply the same tmp_path fixture pattern suggested above.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/inputs/test_async_media_loading.py` around lines 134 - 135,
The test currently creates a temp WAV with tempfile.NamedTemporaryFile(...,
delete=False) and never removes it; replace that pattern with the pytest
tmp_path fixture: create a Path under tmp_path (e.g., tmp_path / "test.wav"),
pass its str to _make_audio_file (or write the needed contents there), and use
that path for the test so pytest will manage cleanup; update the test function
signature to accept tmp_path and remove the tempfile.NamedTemporaryFile usage in
test_async_media_loading.py around the _make_audio_file call.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/unittest/inputs/test_async_media_loading.py`:
- Line 17: Remove the unused import "defaultdict" from the top-level imports
(the line containing "from collections import defaultdict"); update any imports
only if actually referenced elsewhere (e.g., keep other collections imports
intact) so the unused symbol defaultdict is deleted to satisfy the
linter/pre-commit checks.

---

Nitpick comments:
In `@tensorrt_llm/inputs/utils.py`:
- Around line 34-45: Add a cleanup function to close the lazily-created shared
aiohttp session to avoid leaking connections: implement an async
close_aiohttp_session() that checks the module-level _global_aiohttp_session
(and its .closed flag), awaits its .close() if open, and sets
_global_aiohttp_session to None; call or expose this function for graceful
shutdowns and document using it alongside _get_aiohttp_session().
- Around line 533-534: The loop pairing modality_keys and results should use
defensive zip checking: change the zip call in the block that iterates "for
modality, result in zip(modality_keys, results):" to use zip(..., strict=True)
so mismatched lengths raise immediately; update any tests or callers if they
rely on silent truncation. Ensure this change is applied where modality_keys and
results are assembled in tensorrt_llm/inputs/utils.py so the strictness guards
the invariant.

In `@tests/unittest/inputs/test_async_media_loading.py`:
- Around line 239-261: The timing-based concurrency assertion in the test uses
DELAY = 0.15 and TOLERANCE = 0.08 which can be flaky on loaded CI runners;
update the test to increase the tolerance (e.g., set TOLERANCE = 0.12) or mark
the test flaky (e.g., add pytest.mark.flaky) so failures under high load are
tolerated; specifically change the TOLERANCE constant referenced next to DELAY
and keep the rest of the test (the _slow coroutine and the call to
tracker.retrieve_all_async()) unchanged.
- Around line 114-120: The test test_load_audio_from_file currently creates a
temp file with tempfile.NamedTemporaryFile(delete=False) and never removes it;
either switch the test to use pytest's tmp_path fixture to create a disposable
path and write the WAV via _make_audio_file there (use tmp_path / "test.wav") or
ensure explicit cleanup by removing the created file (os.remove(path)) in a
finally/teardown after calling async_load_audio; update references to
tempfile.NamedTemporaryFile, _make_audio_file, and async_load_audio accordingly
so no temp artifacts remain.
- Around line 134-135: The test currently creates a temp WAV with
tempfile.NamedTemporaryFile(..., delete=False) and never removes it; replace
that pattern with the pytest tmp_path fixture: create a Path under tmp_path
(e.g., tmp_path / "test.wav"), pass its str to _make_audio_file (or write the
needed contents there), and use that path for the test so pytest will manage
cleanup; update the test function signature to accept tmp_path and remove the
tempfile.NamedTemporaryFile usage in test_async_media_loading.py around the
_make_audio_file call.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6dcb14eb-8af7-4199-b355-3c424431e79d

📥 Commits

Reviewing files that changed from the base of the PR and between f2f9051 and 81b73a8.

📒 Files selected for processing (2)

tensorrt_llm/inputs/utils.py
tests/unittest/inputs/test_async_media_loading.py

yechank-nvidia · 2026-04-14T11:59:22Z

/bot run

tensorrt-cicd · 2026-04-14T12:05:57Z

PR_Github #43226 [ run ] triggered by Bot. Commit: 6f35282 Link to invocation

tensorrt-cicd · 2026-04-14T12:05:58Z

PR_Github #43226 [ run ] completed with state DISABLED
Freeze main and open the PR merge only after CI is back to healthy https://nvidia.slack.com/archives/C059LSY62BT/p1776141760843319?thread_ts=1775985925.442509&cid=C059LSY62BT

Link to invocation

2ez4bz

Approving to unblock, but please consider the comments 🙏

yechank-nvidia · 2026-04-15T03:01:46Z

/bot run

tensorrt-cicd · 2026-04-15T03:07:25Z

PR_Github #43347 [ run ] triggered by Bot. Commit: 0c8b86a Link to invocation

tensorrt-cicd · 2026-04-15T03:48:28Z

PR_Github #43347 [ run ] completed with state FAILURE. Commit: 0c8b86a
/LLM/main/L0_MergeRequest_PR pipeline #33885 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yechank-nvidia · 2026-04-15T04:06:57Z

/bot run

tensorrt-cicd · 2026-04-15T04:14:01Z

PR_Github #43378 [ run ] triggered by Bot. Commit: 8b3671a Link to invocation

tensorrt-cicd · 2026-04-15T07:48:15Z

PR_Github #43378 [ run ] completed with state SUCCESS. Commit: 8b3671a
/LLM/main/L0_MergeRequest_PR pipeline #33912 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yechank-nvidia · 2026-04-15T07:55:49Z

/bot run

tensorrt-cicd · 2026-04-15T08:02:33Z

PR_Github #43441 [ run ] triggered by Bot. Commit: 8b3671a Link to invocation

tensorrt-cicd · 2026-04-15T10:12:07Z

PR_Github #43441 [ run ] completed with state SUCCESS. Commit: 8b3671a
/LLM/main/L0_MergeRequest_PR pipeline #33969 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yechank-nvidia · 2026-04-15T12:52:25Z

/bot run

tensorrt-cicd · 2026-04-15T12:58:52Z

PR_Github #43493 [ run ] triggered by Bot. Commit: ac02802 Link to invocation

tensorrt-cicd · 2026-04-15T17:15:42Z

PR_Github #43493 [ run ] completed with state SUCCESS. Commit: ac02802
/LLM/main/L0_MergeRequest_PR pipeline #34009 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

2ez4bz · 2026-04-15T19:56:29Z

/bot run --disable-fail-fast

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

yechank-nvidia · 2026-04-16T10:26:38Z

/bot run

tensorrt-cicd · 2026-04-16T10:32:53Z

PR_Github #43757 [ run ] triggered by Bot. Commit: 6bf5169 Link to invocation

tensorrt-cicd · 2026-04-16T11:45:25Z

PR_Github #43757 [ run ] completed with state SUCCESS. Commit: 6bf5169
/LLM/main/L0_MergeRequest_PR pipeline #34239 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

2ez4bz · 2026-04-16T22:47:59Z

/bot run

tensorrt-cicd · 2026-04-16T22:53:36Z

PR_Github #43843 [ run ] triggered by Bot. Commit: 6bf5169 Link to invocation

tensorrt-cicd · 2026-04-17T01:14:12Z

PR_Github #43843 [ run ] completed with state FAILURE. Commit: 6bf5169
/LLM/main/L0_MergeRequest_PR pipeline #34304 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

github-actions bot assigned yechank-nvidia Apr 14, 2026

yechank-nvidia requested review from 2ez4bz and Wanli-Jiang April 14, 2026 09:32

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

Comment thread tests/unittest/inputs/test_async_media_loading.py Outdated

2ez4bz approved these changes Apr 14, 2026

View reviewed changes

yechank-nvidia force-pushed the async_media_load branch 2 times, most recently from a49f565 to 0aefb69 Compare April 15, 2026 02:06

yechank-nvidia changed the title ~~[TRTLLM-11872][perf] Optimize async media loading and video frame decoding in trtllm-serve~~ [TRTLLM-11872][perf] Multi-threading async media loading and optimizing video frame decoding in trtllm-serve Apr 15, 2026

yechank-nvidia force-pushed the async_media_load branch from 0c8b86a to 8b3671a Compare April 15, 2026 04:05

yechank-nvidia added the Multimodal Label for issues & PRs regarding Multimodal related objects label Apr 15, 2026

yechank-nvidia force-pushed the async_media_load branch from 8b3671a to ac02802 Compare April 15, 2026 12:52

2ez4bz force-pushed the async_media_load branch from ac02802 to a045c78 Compare April 15, 2026 19:56

Optimize async media loading and video frame decoding in trtllm-serve

b646eea

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

yechank-nvidia added 3 commits April 16, 2026 01:34

fix pre-commit

272bcd7

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

fix ruff

e988e90

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

address comments and better naming

b6299b9

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

yechank-nvidia force-pushed the async_media_load branch from a045c78 to b6299b9 Compare April 16, 2026 08:35

rebase fix

6bf5169

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

Conversation

yechank-nvidia commented Apr 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix 1: Truly Async Media Loading

Fix 2: Faster Video Frame Decoding

Results

Files Changed

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yechank-nvidia commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

2ez4bz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yechank-nvidia commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

yechank-nvidia commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

yechank-nvidia commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

yechank-nvidia commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

2ez4bz commented Apr 15, 2026

Uh oh!

yechank-nvidia commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

2ez4bz commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

yechank-nvidia commented Apr 14, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 14, 2026 •

edited

Loading