Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,8 @@ Major features include:
frameworks](https://github.com/triton-inference-server/fil_backend)
- [Concurrent model
execution](docs/user_guide/architecture.md#concurrent-model-execution)
- [Dynamic batching](docs/user_guide/model_configuration.md#dynamic-batcher)
- [Sequence batching](docs/user_guide/model_configuration.md#sequence-batcher) and
- [Dynamic batching](docs/user_guide/batcher.md#dynamic-batcher)
- [Sequence batching](docs/user_guide/batcher.md#sequence-batcher) and
[implicit state management](docs/user_guide/architecture.md#implicit-state-management)
for stateful models
- Provides [Backend API](https://github.com/triton-inference-server/backend) that
Expand All @@ -70,8 +70,8 @@ Major features include:
protocols](docs/customization_guide/inference_protocols.md) based on the community
developed [KServe
protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2)
- A [C API](docs/customization_guide/inference_protocols.md#in-process-triton-server-api) and
[Java API](docs/customization_guide/inference_protocols.md#java-bindings-for-in-process-triton-server-api)
- A [C API](docs/customization_guide/inprocess_c_api.md) and
[Java API](docs/customization_guide/inprocess_java_api.md)
allow Triton to link directly into your application for edge and other in-process use cases
- [Metrics](docs/user_guide/metrics.md) indicating GPU utilization, server
throughput, server latency, and more
Expand Down
12 changes: 6 additions & 6 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2018-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -111,17 +111,17 @@ The Model Configuration ModelOptimizationPolicy property is used to specify opti

#### Scheduling and Batching

Triton supports batching individual inference requests to improve compute resource utilization. This is extremely important as individual requests typically will not saturate GPU resources thus not leveraging the parallelism provided by GPUs to its extent. Learn more about Triton's [Batcher and Scheduler](user_guide/model_configuration.md#scheduling-and-batching).
- [Default Scheduler - Non-Batching](user_guide/model_configuration.md#default-scheduler)
- [Dynamic Batcher](user_guide/model_configuration.md#dynamic-batcher)
Triton supports batching individual inference requests to improve compute resource utilization. This is extremely important as individual requests typically will not saturate GPU resources thus not leveraging the parallelism provided by GPUs to its extent. Learn more about Triton's [Batcher and Scheduler](#scheduling-and-batching).
- [Default Scheduler - Non-Batching](user_guide/scheduler.md#default-scheduler)
- [Dynamic Batcher](user_guide/batcher.md#dynamic-batcher)
- [How to Configure Dynamic Batcher](user_guide/model_configuration.md#recommended-configuration-process)
- [Delayed Batching](user_guide/model_configuration.md#delayed-batching)
- [Delayed Batching](user_guide/batcher.md#delayed-batching)
- [Preferred Batch Size](user_guide/model_configuration.md#preferred-batch-sizes)
- [Preserving Request Ordering](user_guide/model_configuration.md#preserve-ordering)
- [Priority Levels](user_guide/model_configuration.md#priority-levels)
- [Queuing Policies](user_guide/model_configuration.md#queue-policy)
- [Ragged Batching](user_guide/ragged_batching.md)
- [Sequence Batcher](user_guide/model_configuration.md#sequence-batcher)
- [Sequence Batcher](user_guide/batcher.md#sequence-batcher)
- [Stateful Models](user_guide/model_execution.md#stateful-models)
- [Control Inputs](user_guide/model_execution.md#control-inputs)
- [Implicit State - Stateful Inference Using a Stateless Model](user_guide/implicit_state_management.md#implicit-state-management)
Expand Down
8 changes: 4 additions & 4 deletions docs/contents.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
..
.. Copyright 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
.. Copyright 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
..
.. Redistribution and use in source and binary forms, with or without
.. modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -37,7 +37,7 @@
:caption: Getting Started

getting_started/quick_deployment_by_backend
LLM With TRT-LLM <getting_started/trtllm_user_guide.md>
LLM With TensorRT-LLM <getting_started/trtllm_user_guide.md>
Multimodal model <../tutorials/Popular_Models_Guide/Llava1.5/llava_trtllm_guide.md>
Stable diffusion <../tutorials/Popular_Models_Guide/StableDiffusion/README.md>

Expand Down Expand Up @@ -96,10 +96,10 @@
:hidden:
:caption: Backends

TRT-LLM <tensorrtllm_backend/README>
TensorRT-LLM <tensorrtllm_backend/README>
vLLM <backend_guide/vllm>
Python <python_backend/README>
Pytorch <pytorch_backend/README>
PyTorch <pytorch_backend/README>
ONNX Runtime <onnxruntime_backend/README>
TensorRT <tensorrt_backend/README>
FIL <fil_backend/README>
Expand Down
4 changes: 1 addition & 3 deletions docs/customization_guide/build.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2018-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -59,8 +59,6 @@ to build Triton on a platform that is not listed here.

* [Ubuntu 22.04, x86-64](#building-for-ubuntu-2204)

* [Jetpack 4.x, NVIDIA Jetson (Xavier, Nano, TX2)](#building-for-jetpack-4x)

* [Windows 10, x86-64](#building-for-windows-10)

If you are developing or debugging Triton, see [Development and
Expand Down
6 changes: 3 additions & 3 deletions docs/customization_guide/inference_protocols.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2018-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -31,7 +31,7 @@
Clients can communicate with Triton using either an [HTTP/REST
protocol](#httprest-and-grpc-protocols), a [GRPC
protocol](#httprest-and-grpc-protocols), or by an [in-process C
API](inprocess_c_api.md#in-process-triton-server-api) or its
API](inprocess_c_api.md) or its
[C++ wrapper](https://github.com/triton-inference-server/developer_tools/tree/main/server).

## HTTP/REST and GRPC Protocols
Expand Down Expand Up @@ -142,7 +142,7 @@ For client-side documentation, see [Client-Side GRPC Status Codes](https://githu

#### GRPC Inference Handler Threads

In general, using 2 threads per completion queue seems to give the best performance, see [gRPC Performance Best Practices] (https://grpc.io/docs/guides/performance/#c). However, in cases where the performance bottleneck is at the request handling step (e.g. ensemble models), increasing the number of gRPC inference handler threads may lead to a higher throughput.
In general, using 2 threads per completion queue seems to give the best performance, see [gRPC Performance Best Practices](https://grpc.io/docs/guides/performance/#c). However, in cases where the performance bottleneck is at the request handling step (e.g. ensemble models), increasing the number of gRPC inference handler threads may lead to a higher throughput.

* `--grpc-infer-thread-count`: 2 by default.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2021-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -326,6 +326,6 @@ dynamic_batching {
}
```

To try further options of dynamic batcher see the [documentation](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher).
To try further options of dynamic batcher see the [documentation](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/batcher.md#dynamic-batcher).

You can also try enabling both concurrent model execution and dynamic batching.
14 changes: 7 additions & 7 deletions docs/getting_started/llm.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright (c) 2024-2025, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2024-2026, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -30,7 +30,7 @@

This guide captures the steps to build Phi-3 with TRT-LLM and deploy with Triton Inference Server. It also shows a shows how to use GenAI-Perf to run benchmarks to measure model performance in terms of throughput and latency.

This guide is tested on A100 80GB SXM4 and H100 80GB PCIe. It is confirmed to work with Phi-3-mini-128k-instruct and Phi-3-mini-4k-instruct (see [Support Matrix](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/phi) for full list) using TRT-LLM v0.11 and Triton Inference Server 24.07.
This guide is tested on A100 80GB SXM4 and H100 80GB PCIe. It is confirmed to work with Phi-3-mini-128k-instruct and Phi-3-mini-4k-instruct (see [Support Matrix](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/phi) for full list) using TRT-LLM v0.11 and Triton Inference Server 24.07.

- [Build and test TRT-LLM engine](#build-and-test-trt-llm-engine)
- [Deploy with Triton Inference Server](#deploy-with-triton-inference-server)
Expand Down Expand Up @@ -76,7 +76,7 @@ Reference: <https://nvidia.github.io/TensorRT-LLM/installation/linux.html>

## Build the TRT-LLM Engine

Reference: <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/phi>
Reference: <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/phi>

4. ## Download Phi-3-mini-4k-instruct

Expand Down Expand Up @@ -354,7 +354,7 @@ All config files inside /tensorrtllm\_backend/all\_models/inflight\_batcher\_llm
<details>
<summary><b> ensemble/config.pbtxt</b></summary>

# Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -864,7 +864,7 @@ All config files inside /tensorrtllm\_backend/all\_models/inflight\_batcher\_llm
<details>
<summary><b>postprocessing/config.pbtxt</b></summary>

# Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -993,7 +993,7 @@ All config files inside /tensorrtllm\_backend/all\_models/inflight\_batcher\_llm
<details>
<summary><b> preprocessing/config.pbtxt</b> </summary>

# Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -1188,7 +1188,7 @@ All config files inside /tensorrtllm\_backend/all\_models/inflight\_batcher\_llm
<summary> <b> tensorrt_llm/config.pbtxt </b></summary>


# Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down
8 changes: 4 additions & 4 deletions docs/getting_started/trtllm_user_guide.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -50,7 +50,7 @@ to prepare engines for your LLM models and serve them with Triton.
## How to use your custom TRT-LLM model

All the supported models can be found in the
[examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) folder in
[examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core) folder in
the TRT-LLM repo. Follow the examples to convert your models to TensorRT
engines.

Expand All @@ -61,7 +61,7 @@ for Triton, and
Only the *mandatory parameters* need to be set in the model config file. Feel free
to modify the optional parameters as needed. To learn more about the
parameters, model inputs, and outputs, see the
[model config documentation](ttps://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/model_config.md) for more details.
[model config documentation](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/model_config.md) for more details.

## Advanced Configuration Options and Deployment Strategies

Expand Down Expand Up @@ -95,7 +95,7 @@ to learn how to use GenAI-Perf to benchmark your LLM models.
## Performance Best Practices

Check out the
[Performance Best Practices guide](https://nvidia.github.io/TensorRT-LLM/performance/perf-best-practices.html)
[Performance tuning guide](https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/)
to learn how to optimize your TensorRT-LLM models for better performance.

## Metrics
Expand Down
8 changes: 4 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright 2023-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -58,9 +58,9 @@ architecture. The [model repository](user_guide/model_repository.md) is a
file-system based repository of the models that Triton will make
available for inferencing. Inference requests arrive at the server via
either [HTTP/REST or GRPC](customization_guide/inference_protocols.md) or by the [C
API](customization_guide/inference_protocols.md) and are then routed to the appropriate per-model
API](customization_guide/inprocess_c_api.md) and are then routed to the appropriate per-model
scheduler. Triton implements [multiple scheduling and batching
algorithms](#models-and-schedulers) that can be configured on a
algorithms](./user_guide/architecture.md#models-and-schedulers) that can be configured on a
model-by-model basis. Each model's scheduler optionally performs
batching of inference requests and then passes the requests to the
[backend](https://github.com/triton-inference-server/backend/blob/main/README.md)
Expand Down Expand Up @@ -89,7 +89,7 @@ framework such as Kubernetes.
Major features include:

- [Supports multiple deep learning
frameworks](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton)
frameworks](backend/README.md#where-can-i-find-all-the-backends-that-are-available-for-triton)
- [Supports multiple machine learning
frameworks](https://github.com/triton-inference-server/fil_backend)
- [Concurrent model
Expand Down
4 changes: 2 additions & 2 deletions docs/introduction/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright 2023-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -60,7 +60,7 @@ available for inferencing. Inference requests arrive at the server via
either [HTTP/REST or GRPC](../customization_guide/inference_protocols.md) or by the [C
API](../customization_guide/inprocess_c_api.md) and are then routed to the appropriate per-model
scheduler. Triton implements [multiple scheduling and batching
algorithms](#models-and-schedulers) that can be configured on a
algorithms](../user_guide/architecture.md#models-and-schedulers) that can be configured on a
model-by-model basis. Each model's scheduler optionally performs
batching of inference requests and then passes the requests to the
[backend](https://github.com/triton-inference-server/backend/blob/main/README.md)
Expand Down
6 changes: 3 additions & 3 deletions docs/protocol/extension_parameters.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright 2023-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -34,9 +34,9 @@ custom parameters that cannot be provided as inputs. Because this extension is
supported, Triton reports “parameters” in the extensions field of its
Server Metadata. This extension uses the optional "parameters"
field in the KServe Protocol in
[HTTP](https://kserve.github.io/website/0.10/modelserving/data_plane/v2_protocol/#inference-request-json-object)
[HTTP](https://kserve.github.io/website/docs/concepts/architecture/data-plane/v2-protocol#inference-request-json-object)
and
[GRPC](https://kserve.github.io/website/0.10/modelserving/data_plane/v2_protocol/#parameters).
[GRPC](https://kserve.github.io/website/docs/concepts/architecture/data-plane/v2-protocol#parameters).

The following parameters are reserved for Triton's usage and should not be
used as custom parameters:
Expand Down
6 changes: 3 additions & 3 deletions docs/protocol/extension_schedule_policy.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2020-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -34,9 +34,9 @@ parameters that influence how Triton handles and schedules the
request. Because this extension is supported, Triton reports
“schedule_policy” in the extensions field of its Server Metadata.
Note the policies are specific to [dynamic
batcher](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher)
batcher](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/batcher.md#dynamic-batcher)
and only experimental support to [sequence
batcher](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#sequence-batcher)
batcher](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/batcher.md#sequence-batcher)
with the [direct](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#direct)
scheduling strategy.

Expand Down
Loading
Loading