Skip to content

Commit ef0662c

Browse files
Torch-TensorRT Github Botcehongwang
authored andcommitted
docs: [Automated] Regenerating documenation for d97cb7a
Signed-off-by: Torch-TensorRT Github Bot <torch-tensorrt.github.bot@nvidia.com>
1 parent 2e26bfa commit ef0662c

62 files changed

Lines changed: 1908 additions & 1905 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docsrc/contributors/complex_number_support.rst

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -128,9 +128,8 @@ runtime modules handle the conversion:
128128
* ``prepare_inputs`` (``dynamo/utils.py``) — builds the ``Input`` spec with the
129129
``view_as_real`` shape/dtype but retains the original complex tensor in
130130
``inp.torch_tensor`` for tracing.
131-
* ``_PythonTorchTensorRTModule.forward`` — applies ``torch.view_as_real(i).contiguous()``
132-
for each complex input before feeding it to the engine.
133-
* ``_TorchTensorRTModule.forward`` — same ``view_as_real`` conversion.
131+
* ``TorchTensorRTModule.forward`` — applies ``torch.view_as_real(i).contiguous()``
132+
for each complex input before feeding tensors to ``execute_engine`` / ``execute_engine_python``.
134133

135134
Key Implementation Invariants
136135
-------------------------------

docsrc/contributors/cuda_graphs.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -93,8 +93,8 @@ Subsequent inference launches the instantiated graph instead of calling
9393
Graph Storage
9494
^^^^^^^^^^^^^
9595

96-
Each runtime module (both C++ ``TorchTensorRTModule`` and Python
97-
``PythonTorchTensorRTModule``) stores a ``cudaGraphExec_t`` instance. When
96+
``TorchTensorRTModule`` (C++ or Python execution path) may record a CUDA graph for
97+
engine execution when CUDA graphs are enabled at runtime. When
9898
``use_cuda_graph=True`` is set at compile time the runtime records one graph
9999
per engine for the first input shape encountered.
100100

docsrc/debugging/troubleshooting.rst

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -126,8 +126,10 @@ Runtime Errors
126126
the engine. Upgrade TRT or rebuild with ``version_compatible=True``.
127127
* The GPU compute capability is lower than on the build machine. Rebuild with
128128
``hardware_compatible=True`` (requires Ampere or newer).
129-
* The ``.ep`` file was generated with ``use_python_runtime=True`` which is not
130-
serializable. Rebuild with the default C++ runtime.
129+
* The ``.ep`` export path does not support your compiled module layout (e.g. mixed
130+
Python-runtime subgraphs in a specific exporter version). Try the default C++ path
131+
at compile time or use ``torch_tensorrt`` module save/load APIs that preserve
132+
``TorchTensorRTModule`` state.
131133

132134
**Shape mismatch at runtime / "Invalid input shape"**
133135

@@ -153,9 +155,9 @@ Runtime Errors
153155
The model contains data-dependent-shape ops (``nonzero``, ``unique``,
154156
``masked_select``, etc.) which require TRT's output allocator.
155157

156-
* Use ``PythonTorchTensorRTModule`` (``use_python_runtime=True``) — it
157-
activates the dynamic output allocator automatically via
158-
``requires_output_allocator=True``.
158+
* Use :func:`~torch_tensorrt.runtime.set_runtime_backend` with ``"python"`` or use a module with
159+
``requires_output_allocator=True`` so the runtime can use TRT's output allocator
160+
on the Python execution path when needed.
159161
* See :ref:`cuda_graphs` for ``DynamicOutputAllocator`` details.
160162

161163
----

docsrc/py_api/runtime.rst

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,13 +27,29 @@ Functions
2727

2828
.. autofunction:: enable_output_allocator
2929

30+
Runtime backend selection
31+
-------------------------
32+
33+
.. autofunction:: torch_tensorrt.runtime.get_runtime_backend
34+
35+
.. autofunction:: torch_tensorrt.runtime.set_runtime_backend
36+
3037
Classes
3138
---------
3239

3340
.. autoclass:: TorchTensorRTModule
3441
:members:
3542
:special-members: __init__
43+
:show-inheritance:
44+
45+
Single runtime module for TensorRT engines. Dispatches to the C++ or Python execution
46+
implementation based on :func:`~torch_tensorrt.runtime.get_runtime_backend` /
47+
:func:`~torch_tensorrt.runtime.set_runtime_backend`. See :ref:`python_runtime`.
3648

3749
.. autoclass:: PythonTorchTensorRTModule
3850
:members:
3951
:special-members: __init__
52+
:show-inheritance:
53+
54+
Subclass of ``TorchTensorRTModule`` that **pins** the Python engine path. Prefer
55+
``TorchTensorRTModule`` plus compile flags unless you need this guarantee. See :ref:`python_runtime`.

docsrc/tutorials/runtime_opt/index.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,12 @@ Runtime Optimization
22
=====================
33

44
Optimize inference throughput and latency: CUDA Graphs for kernel-replay,
5-
pre-allocated output buffers, and the Python runtime module.
5+
pre-allocated output buffers, and choosing the Python vs C++ TRT execution path.
66

77
.. toctree::
88
:maxdepth: 1
99

1010
cuda_graphs
1111
Example: Torch Export with Cudagraphs <../_rendered_examples/dynamo/torch_export_cudagraphs>
1212
Example: Pre-allocated output buffer <../_rendered_examples/dynamo/pre_allocated_output_example>
13-
python_runtime
13+
Python vs C++ runtime <python_runtime>
Lines changed: 77 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -1,96 +1,103 @@
11
.. _python_runtime:
22

3-
Python Runtime
4-
==============
3+
Python vs C++ runtime
4+
=====================
55

6-
Torch-TensorRT provides two runtime backends for executing compiled TRT engines
7-
inside a PyTorch graph:
6+
Torch-TensorRT uses a single module type, :class:`~torch_tensorrt.runtime.TorchTensorRTModule`,
7+
to run TensorRT engines inside PyTorch. The **execution path** (which code actually drives
8+
``execute_async``) is selected at runtime:
89

9-
* **C++ runtime** (default) — ``TorchTensorRTModule`` backed by a C++ TorchBind class.
10-
Fully serializable, supports CUDAGraphs, multi-device safe.
11-
* **Python runtime** — ``PythonTorchTensorRTModule`` backed entirely by the TRT Python
12-
API. Simpler to instrument for debugging but **not serializable** to
13-
``ExportedProgram``.
10+
* **C++ path** — ``torch.classes.tensorrt.Engine`` and ``torch.ops.tensorrt.execute_engine``.
11+
Preferred for production when the Torch-TensorRT C++ extension is available: TorchScript-friendly,
12+
and integrates with the full C++ runtime stack.
13+
* **Python path** — internal ``PythonTRTEngine`` plus
14+
``torch.ops.tensorrt.execute_engine_python``. Useful when the C++ extension is absent, or when
15+
you want easier Python-level debugging and instrumentation.
16+
17+
:class:`~torch_tensorrt.runtime.PythonTorchTensorRTModule` is a **thin subclass** of
18+
``TorchTensorRTModule`` that **pins** the Python path (same constructor and behavior, but always
19+
resolves to the Python engine). Prefer ``TorchTensorRTModule`` plus the global backend APIs below
20+
when you do not need that pin.
1421

1522
----
1623

17-
When to Use the Python Runtime
18-
--------------------------------
24+
When to use the Python path
25+
---------------------------
1926

20-
Use ``use_python_runtime=True`` when:
27+
Use :func:`~torch_tensorrt.runtime.set_runtime_backend` (typically as a context manager) when:
2128

22-
* You need to run on a machine where the C++ Torch-TensorRT library is not installed
23-
(e.g., a minimal CI container with only the Python wheel).
24-
* You want to attach Python-level callbacks to the engine execution (via
25-
:ref:`observer`) for debugging or profiling without building the C++ extension.
26-
* You are debugging a conversion issue and want to step through TRT execution in Python.
29+
* The C++ Torch-TensorRT library is not installed (e.g. a minimal environment with only the Python pieces).
30+
* You want Python-level hooks (e.g. :ref:`observer`) without relying on the C++ extension.
31+
* You are debugging conversion or execution and want to break inside the Python TRT wrapper.
2732

28-
Use the default C++ runtime in all other cases, especially:
33+
Prefer the C++ path when:
2934

30-
* When saving a compiled module to disk (``torch_tensorrt.save()``).
31-
* When using CUDAGraphs for low-latency inference.
32-
* In production deployments.
35+
* You rely on the default Torch-TensorRT deployment story and maximum parity with TorchScript export.
36+
* You use whole-graph CUDAGraph wrappers that assume the C++ runtime (see :ref:`cuda_graphs`).
3337

3438
----
3539

36-
Enabling the Python Runtime
37-
-----------------------------
40+
Enabling the Python path
41+
------------------------
42+
43+
**Process-wide default (context manager)**
3844

3945
.. code-block:: python
4046
41-
import torch_tensorrt
47+
import torch_tensorrt as tt
4248
43-
trt_gm = torch_tensorrt.dynamo.compile(
44-
exported_program,
45-
arg_inputs=inputs,
46-
use_python_runtime=True,
47-
)
49+
with tt.runtime.set_runtime_backend("python"):
50+
trt_gm = tt.dynamo.compile(exported_program, inputs)
4851
49-
Or via ``torch.compile``:
52+
**``torch.compile``** (same context manager around compile / first run)
5053

5154
.. code-block:: python
5255
53-
trt_model = torch.compile(
54-
model,
55-
backend="tensorrt",
56-
options={"use_python_runtime": True},
57-
)
56+
import torch_tensorrt as tt
5857
59-
----
58+
with tt.runtime.set_runtime_backend("python"):
59+
trt_model = torch.compile(model, backend="tensorrt", options={})
6060
61-
Limitations
62-
-----------
61+
The context manager does **not** replace :class:`~torch_tensorrt.runtime.PythonTorchTensorRTModule`,
62+
which always requests the Python path via a class-level pin.
6363

64-
* **Not serializable**: ``PythonTorchTensorRTModule`` cannot be saved via
65-
``torch_tensorrt.save()`` as an ``ExportedProgram`` or loaded back. The module is
66-
Python-only in-process.
64+
----
6765

68-
.. code-block:: python
66+
Serialization
67+
---------------
6968

70-
# This will raise an error with use_python_runtime=True:
71-
torch_tensorrt.save(trt_gm, "model.ep", arg_inputs=inputs)
69+
Module state records which backend was used (``runtime_backend`` in packed metadata). After load,
70+
``TorchTensorRTModule`` reconstructs either the C++ engine or the Python engine wrapper
71+
as appropriate. Some **export** workflows (e.g. certain ``ExportedProgram`` save paths) may still
72+
assume a C++-only graph; validate your deployment path if you mix Python execution with AOT export.
7273

73-
* **No C++ deployment**: The compiled module cannot be exported to AOTInductor or used
74-
in a C++ application without re-compiling with the C++ runtime.
74+
----
7575

76-
* **CUDAGraphs**: Whole-graph CUDAGraphs work with the Python runtime, but the
77-
per-submodule CUDAGraph recording in ``CudaGraphsTorchTensorRTModule`` is
78-
only available with the C++ runtime.
76+
Limitations
77+
-----------
78+
79+
* **C++ deployment**: A module that executed on the Python path still needs TensorRT and the
80+
Torch-TensorRT Python pieces available in-process unless you recompile targeting the C++ path.
81+
* **CUDAGraphs**: Whole-graph CUDAGraph wrappers may assume the C++ runtime for some configurations;
82+
see :ref:`cuda_graphs`.
83+
* **Explicit allocator engines**: Engines with data-dependent outputs may set
84+
``requires_output_allocator=True``; the unified module supports the output-allocator execution
85+
mode on the Python path. See :ref:`cuda_graphs` for interaction with CUDA graphs.
7986

8087
----
8188

82-
``PythonTorchTensorRTModule`` Direct Instantiation
83-
----------------------------------------------------
89+
``PythonTorchTensorRTModule`` direct instantiation
90+
--------------------------------------------------
8491

85-
You can instantiate ``PythonTorchTensorRTModule`` directly from raw engine bytes,
86-
for example when integrating a TRT engine built outside of Torch-TensorRT:
92+
You can instantiate :class:`~torch_tensorrt.runtime.PythonTorchTensorRTModule` from raw engine bytes
93+
when you need a **guaranteed** Python execution path (e.g. integrating an engine built outside
94+
Torch-TensorRT):
8795

8896
.. code-block:: python
8997
9098
from torch_tensorrt.dynamo.runtime import PythonTorchTensorRTModule
9199
from torch_tensorrt.dynamo._settings import CompilationSettings
92100
93-
# Load raw engine bytes (e.g., from trtexec output or torch_tensorrt.dynamo.convert_*)
94101
with open("model.engine", "rb") as f:
95102
engine_bytes = f.read()
96103
@@ -104,37 +111,32 @@ for example when integrating a TRT engine built outside of Torch-TensorRT:
104111
105112
output = module(torch.randn(1, 3, 224, 224).cuda())
106113
107-
**Constructor arguments:**
114+
**Constructor arguments** (same as ``TorchTensorRTModule``):
108115

109116
``serialized_engine`` (``bytes``)
110-
The raw serialized TRT engine bytes.
111-
112-
``input_binding_names`` (``List[str]``)
113-
TRT input binding names in the order they are passed to ``forward()``.
117+
Raw serialized TRT engine.
114118

115-
``output_binding_names`` (``List[str]``)
116-
TRT output binding names in the order they should be returned.
119+
``input_binding_names`` / ``output_binding_names`` (``List[str]``)
120+
Binding names in ``forward`` order.
117121

118122
``name`` (``str``, optional)
119-
Human-readable name for the module (used in logging).
123+
Name for logging and serialization.
120124

121-
``settings`` (``CompilationSettings``, optional)
122-
The compilation settings used to build the engine. Used to determine device
123-
placement and other runtime behaviors.
125+
``settings`` (:class:`~torch_tensorrt.dynamo._settings.CompilationSettings`, optional)
126+
Device and runtime options (must match how the engine was built).
124127

125128
``weight_name_map`` (``dict``, optional)
126-
Mapping of TRT weight names to PyTorch state dict names. Required for refit
127-
support via :func:`~torch_tensorrt.dynamo.refit_module_weights`.
129+
For refit workflows; see :func:`~torch_tensorrt.dynamo.refit_module_weights`.
128130

129-
``requires_output_allocator`` (``bool``, default ``False``)
130-
Set to ``True`` if the engine contains data-dependent-shape ops (``nonzero``,
131-
``unique``, etc.) that require TRT's output allocator.
131+
``requires_output_allocator`` (``bool``)
132+
Set ``True`` for data-dependent-shape ops that need TRT's output allocator.
132133

133134
----
134135

135-
Runtime Selection Logic
136-
------------------------
136+
Runtime selection summary
137+
-------------------------
137138

138-
When ``use_python_runtime`` is ``None`` (auto-select), Torch-TensorRT tries to import
139-
the C++ TorchBind class. If the C++ extension is not available it silently falls back to
140-
the Python runtime. Pass ``True`` or ``False`` to force a specific runtime.
139+
* :func:`~torch_tensorrt.runtime.get_runtime_backend` / :func:`~torch_tensorrt.runtime.set_runtime_backend`
140+
— process default for newly created ``TorchTensorRTModule`` instances (unless a subclass pins a backend).
141+
Use ``set_runtime_backend`` as a context manager to scope C++ vs Python for compile and forward.
142+
* If the C++ extension is **not** built, only the Python path is available.

examples/apps/flux_demo.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,6 @@ def forward_loop(mod):
125125
"enabled_precisions": enabled_precisions,
126126
"truncate_double": True,
127127
"min_block_size": 1,
128-
"use_python_runtime": True,
129128
"immutable_weights": False,
130129
"offload_module_to_cpu": args.low_vram_mode,
131130
"use_explicit_typing": use_explicit_typing,
@@ -136,7 +135,8 @@ def forward_loop(mod):
136135
remove_hook_from_module(pipe.transformer, recurse=True)
137136
pipe.transformer.to(DEVICE)
138137

139-
trt_gm = torch_tensorrt.MutableTorchTensorRTModule(backbone, **settings)
138+
with torch_tensorrt.runtime.set_runtime_backend("python"):
139+
trt_gm = torch_tensorrt.MutableTorchTensorRTModule(backbone, **settings)
140140
if dynamic_shapes:
141141
trt_gm.set_expected_dynamic_shape_range((), dynamic_shapes)
142142
pipe.transformer = trt_gm

examples/distributed_inference/data_parallel_stable_diffusion.py

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -31,19 +31,19 @@
3131
backend = "torch_tensorrt"
3232

3333
# Optimize the UNet portion with Torch-TensorRT
34-
pipe.unet = torch.compile( # %%
35-
# Inference
36-
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
37-
# Assume there are 2 processes (2 devices)
38-
pipe.unet,
39-
backend=backend,
40-
options={
41-
"truncate_long_and_double": True,
42-
"precision": torch.float16,
43-
"use_python_runtime": True,
44-
},
45-
dynamic=False,
46-
)
34+
with torch_tensorrt.runtime.set_runtime_backend("python"):
35+
pipe.unet = torch.compile( # %%
36+
# Inference
37+
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
38+
# Assume there are 2 processes (2 devices)
39+
pipe.unet,
40+
backend=backend,
41+
options={
42+
"truncate_long_and_double": True,
43+
"precision": torch.float16,
44+
},
45+
dynamic=False,
46+
)
4747
torch_tensorrt.runtime.set_multi_device_safe_mode(True)
4848

4949

examples/distributed_inference/tensor_parallel_simple_example.py

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -93,18 +93,18 @@ def forward(self, x):
9393
python_result = tp_model(inp)
9494

9595
backend = "torch_tensorrt"
96-
tp_model = torch.compile(
97-
tp_model,
98-
backend=backend,
99-
options={
100-
"truncate_long_and_double": True,
101-
"enabled_precisions": {torch.float32, torch.float16},
102-
"use_python_runtime": True,
103-
"min_block_size": 1,
104-
"use_distributed_mode_trace": True,
105-
},
106-
dynamic=None,
107-
)
96+
with torch_tensorrt.runtime.set_runtime_backend("python"):
97+
tp_model = torch.compile(
98+
tp_model,
99+
backend=backend,
100+
options={
101+
"truncate_long_and_double": True,
102+
"enabled_precisions": {torch.float32, torch.float16},
103+
"min_block_size": 1,
104+
"use_distributed_mode_trace": True,
105+
},
106+
dynamic=None,
107+
)
108108

109109
# For TP, input needs to be same across all TP ranks.
110110
# Setting the random seed is to mimic the behavior of dataloader.

0 commit comments

Comments
 (0)