-
Notifications
You must be signed in to change notification settings - Fork 393
MD-TRT Support, Compile/Export, C++ and Python #4183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
narendasan
wants to merge
21
commits into
main
Choose a base branch
from
push-vqqzkszwrvyx
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
808ee01
Multi-Device TensorRT Runtime with Native NCCL Collectives
apbose aaa6557
removing the try-except block in TRTengine.cpp and correcting the typis
apbose 4c1e68d
Redesign distributed inference API: auto-detect rank, lazy NCCL setup…
apbose 7cfa40b
remove nccl.h dependancy
apbose ac96255
clean up import and add comment
apbose fe1c6f4
moving setup_nccl_library call to example script
apbose b658c7a
work on the save/load export part-add is_md flag, guard export tracin…
apbose a35dfe6
refactor: Adjusting how we use NCCL
narendasan 3def3f7
fix: enable torch.compile(backend='tensorrt') for LLMs with dynamic s…
narendasan 2aa8f14
test: add torch.compile(backend='tensorrt') integration test for Llam…
narendasan 6f81a66
feat: llama3.2 working with MD-TRT
narendasan 0d2d61c
feat: Support exported and serialization workflows for MD-TRT
narendasan e08b0c5
ci: fix nccl builds in CI
narendasan 754b62b
chore: Some reorg and cleaning the constructor
narendasan bf432ad
fix: thread the MD-TRT requirement through the conversion system
narendasan f4e77ad
fix: DeviceMesh FakeScriptObjects get passed in as arguments into tor…
narendasan 6ba00cf
fix: Address segfaults when a distributed context is manually destroy…
narendasan edf6518
replacing torchrun with torchtrtrun for right .so
apbose 9e390eb
chore: apply linting
narendasan 1b4e559
use correct group for dummy all_reduce
apbose df51acf
Broaden NCCL skip guards to include native TRT collectives and fix di…
apbose File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -162,9 +162,7 @@ void setup_input_tensors( | |
| // Get tensor address, using placeholder for empty tensors | ||
| // TensorRT requires non-null address even if numel() = 0 | ||
| // empty_tensor_placeholder is pre-allocated in TRTEngine constructor | ||
| void* input_addr = (final_input.numel() == 0 || final_input.data_ptr() == nullptr) | ||
| ? compiled_engine->empty_tensor_placeholder | ||
| : final_input.data_ptr(); | ||
| void* input_addr = final_input.numel() == 0 ? compiled_engine->empty_tensor_placeholder : final_input.data_ptr(); | ||
|
|
||
| TORCHTRT_CHECK( | ||
| compiled_engine->exec_ctx->setTensorAddress(name.c_str(), input_addr), | ||
|
|
@@ -209,6 +207,27 @@ void create_output_allocator(c10::intrusive_ptr<TRTEngine> compiled_engine) { | |
| } | ||
|
|
||
| std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intrusive_ptr<TRTEngine> compiled_engine) { | ||
| // All inputs are expected to be on CUDA. Warn and move any that are not. | ||
| for (auto& inp : inputs) { | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would like to remove this but didnt have time to check if the device operations in python suppress this correctly |
||
| if (inp.defined() && !inp.is_cuda()) { | ||
| LOG_WARNING( | ||
| "Input tensor is not on a CUDA device. Moving it to CUDA automatically. " | ||
| "For best performance, ensure all inputs are on the correct CUDA device before " | ||
| "calling the TensorRT engine (e.g. tensor.cuda() or tensor.to(device))."); | ||
| inp = inp.cuda(); | ||
| } | ||
| } | ||
|
|
||
| #ifdef ENABLE_TRT_NCCL_COLLECTIVES | ||
| // Lazy one-shot NCCL bind: fires on the first real execute_engine call when | ||
| // the constructor-time bind was deferred (e.g. no collective had been issued | ||
| // at construction time, or for serialized programs loaded inline where there | ||
| // is no Python _TorchTensorRTModule.forward wrapper). | ||
| if (compiled_engine->is_md && !compiled_engine->nccl_initialized) { | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not entirely sure this is necessary |
||
| compiled_engine->bind_nccl_comm(); | ||
| } | ||
| #endif | ||
|
|
||
| torch::Tensor dynamic_workspace; | ||
| if (compiled_engine->resource_allocation_strategy == TRTEngine::ResourceAllocationStrategy::kDynamic) { | ||
| dynamic_workspace = torch::empty(compiled_engine->cuda_engine->getDeviceMemorySizeV2(), {torch::kCUDA}); | ||
|
|
||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only do this if there is one available group. If there are multiple NCCL groups available we should tell the user to manually select