triton-inference-server · yinggeh · Feb 5, 2026 · Feb 3, 2026 · Feb 4, 2026 · Feb 4, 2026
diff --git a/README.md b/README.md
@@ -54,8 +54,8 @@ Major features include:
   frameworks](https://github.com/triton-inference-server/fil_backend)
 - [Concurrent model
   execution](docs/user_guide/architecture.md#concurrent-model-execution)
-- [Dynamic batching](docs/user_guide/model_configuration.md#dynamic-batcher)
-- [Sequence batching](docs/user_guide/model_configuration.md#sequence-batcher) and
+- [Dynamic batching](docs/user_guide/batcher.md#dynamic-batcher)
+- [Sequence batching](docs/user_guide/batcher.md#sequence-batcher) and
   [implicit state management](docs/user_guide/architecture.md#implicit-state-management)
   for stateful models
 - Provides [Backend API](https://github.com/triton-inference-server/backend) that
@@ -70,8 +70,8 @@ Major features include:
   protocols](docs/customization_guide/inference_protocols.md) based on the community
   developed [KServe
   protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2)
-- A [C API](docs/customization_guide/inference_protocols.md#in-process-triton-server-api) and
-  [Java API](docs/customization_guide/inference_protocols.md#java-bindings-for-in-process-triton-server-api)
+- A [C API](docs/customization_guide/inprocess_c_api.md) and
+  [Java API](docs/customization_guide/inprocess_java_api.md)
   allow Triton to link directly into your application for edge and other in-process use cases
 - [Metrics](docs/user_guide/metrics.md) indicating GPU utilization, server
   throughput, server latency, and more

diff --git a/docs/README.md b/docs/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -111,17 +111,17 @@ The Model Configuration ModelOptimizationPolicy property is used to specify opti
 
 #### Scheduling and Batching
 
-Triton supports batching individual inference requests to improve compute resource utilization. This is extremely important as individual requests typically will not saturate GPU resources thus not leveraging the parallelism provided by GPUs to its extent. Learn more about Triton's [Batcher and Scheduler](user_guide/model_configuration.md#scheduling-and-batching).
-- [Default Scheduler - Non-Batching](user_guide/model_configuration.md#default-scheduler)
-- [Dynamic Batcher](user_guide/model_configuration.md#dynamic-batcher)
+Triton supports batching individual inference requests to improve compute resource utilization. This is extremely important as individual requests typically will not saturate GPU resources thus not leveraging the parallelism provided by GPUs to its extent. Learn more about Triton's [Batcher and Scheduler](#scheduling-and-batching).
+- [Default Scheduler - Non-Batching](user_guide/scheduler.md#default-scheduler)
+- [Dynamic Batcher](user_guide/batcher.md#dynamic-batcher)
   - [How to Configure Dynamic Batcher](user_guide/model_configuration.md#recommended-configuration-process)
-    - [Delayed Batching](user_guide/model_configuration.md#delayed-batching)
+    - [Delayed Batching](user_guide/batcher.md#delayed-batching)
     - [Preferred Batch Size](user_guide/model_configuration.md#preferred-batch-sizes)
   - [Preserving Request Ordering](user_guide/model_configuration.md#preserve-ordering)
   - [Priority Levels](user_guide/model_configuration.md#priority-levels)
   - [Queuing Policies](user_guide/model_configuration.md#queue-policy)
   - [Ragged Batching](user_guide/ragged_batching.md)
-- [Sequence Batcher](user_guide/model_configuration.md#sequence-batcher)
+- [Sequence Batcher](user_guide/batcher.md#sequence-batcher)
   - [Stateful Models](user_guide/model_execution.md#stateful-models)
   - [Control Inputs](user_guide/model_execution.md#control-inputs)
   - [Implicit State - Stateful Inference Using a Stateless Model](user_guide/implicit_state_management.md#implicit-state-management)

diff --git a/docs/contents.rst b/docs/contents.rst
@@ -1,5 +1,5 @@
 ..
-.. Copyright 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. Copyright 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 ..
 .. Redistribution and use in source and binary forms, with or without
 .. modification, are permitted provided that the following conditions
@@ -37,7 +37,7 @@
    :caption: Getting Started
 
    getting_started/quick_deployment_by_backend
-   LLM With TRT-LLM <getting_started/trtllm_user_guide.md>
+   LLM With TensorRT-LLM <getting_started/trtllm_user_guide.md>
    Multimodal model <../tutorials/Popular_Models_Guide/Llava1.5/llava_trtllm_guide.md>
    Stable diffusion <../tutorials/Popular_Models_Guide/StableDiffusion/README.md>
 
@@ -96,10 +96,10 @@
    :hidden:
    :caption: Backends
 
-   TRT-LLM <tensorrtllm_backend/README>
+   TensorRT-LLM <tensorrtllm_backend/README>
    vLLM <backend_guide/vllm>
    Python <python_backend/README>
-   Pytorch <pytorch_backend/README>
+   PyTorch <pytorch_backend/README>
    ONNX Runtime <onnxruntime_backend/README>
    TensorRT <tensorrt_backend/README>
    FIL <fil_backend/README>

diff --git a/docs/customization_guide/build.md b/docs/customization_guide/build.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -59,8 +59,6 @@ to build Triton on a platform that is not listed here.
 
 * [Ubuntu 22.04, x86-64](#building-for-ubuntu-2204)
 
-* [Jetpack 4.x, NVIDIA Jetson (Xavier, Nano, TX2)](#building-for-jetpack-4x)
-
 * [Windows 10, x86-64](#building-for-windows-10)
 
 If you are developing or debugging Triton, see [Development and

diff --git a/docs/customization_guide/inference_protocols.md b/docs/customization_guide/inference_protocols.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2018-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -31,7 +31,7 @@
 Clients can communicate with Triton using either an [HTTP/REST
 protocol](#httprest-and-grpc-protocols), a [GRPC
 protocol](#httprest-and-grpc-protocols), or by an [in-process C
-API](inprocess_c_api.md#in-process-triton-server-api) or its
+API](inprocess_c_api.md) or its
 [C++ wrapper](https://github.com/triton-inference-server/developer_tools/tree/main/server).
 
 ## HTTP/REST and GRPC Protocols
@@ -142,7 +142,7 @@ For client-side documentation, see [Client-Side GRPC Status Codes](https://githu
 
 #### GRPC Inference Handler Threads
 
-In general, using 2 threads per completion queue seems to give the best performance, see [gRPC Performance Best Practices] (https://grpc.io/docs/guides/performance/#c). However, in cases where the performance bottleneck is at the request handling step (e.g. ensemble models), increasing the number of gRPC inference handler threads may lead to a higher throughput.
+In general, using 2 threads per completion queue seems to give the best performance, see [gRPC Performance Best Practices](https://grpc.io/docs/guides/performance/#c). However, in cases where the performance bottleneck is at the request handling step (e.g. ensemble models), increasing the number of gRPC inference handler threads may lead to a higher throughput.
 
 * `--grpc-infer-thread-count`: 2 by default.
 

diff --git a/docs/examples/jetson/concurrency_and_dynamic_batching/README.md b/docs/examples/jetson/concurrency_and_dynamic_batching/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2021-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -326,6 +326,6 @@ dynamic_batching {
 }
 ```
 
-To try further options of dynamic batcher see the [documentation](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher).
+To try further options of dynamic batcher see the [documentation](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/batcher.md#dynamic-batcher).
 
 You can also try enabling both concurrent model execution and dynamic batching.
diff --git a/docs/getting_started/llm.md b/docs/getting_started/llm.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright (c) 2024-2025, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2024-2026, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -30,7 +30,7 @@
 
 This guide captures the steps to build Phi-3 with TRT-LLM and deploy with Triton Inference Server. It also shows a shows how to use GenAI-Perf to run benchmarks to measure model performance in terms of throughput and latency.
 
-This guide is tested on A100 80GB SXM4 and H100 80GB PCIe. It is confirmed to work with Phi-3-mini-128k-instruct and Phi-3-mini-4k-instruct (see [Support Matrix](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/phi) for full list) using TRT-LLM v0.11 and Triton Inference Server 24.07.
+This guide is tested on A100 80GB SXM4 and H100 80GB PCIe. It is confirmed to work with Phi-3-mini-128k-instruct and Phi-3-mini-4k-instruct (see [Support Matrix](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/phi) for full list) using TRT-LLM v0.11 and Triton Inference Server 24.07.
 
 - [Build and test TRT-LLM engine](#build-and-test-trt-llm-engine)
 - [Deploy with Triton Inference Server](#deploy-with-triton-inference-server)
@@ -76,7 +76,7 @@ Reference: <https://nvidia.github.io/TensorRT-LLM/installation/linux.html>
 
 ## Build the TRT-LLM Engine
 
-Reference: <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/phi>
+Reference: <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/phi>
 
 4. ## Download Phi-3-mini-4k-instruct
 
@@ -354,7 +354,7 @@ All config files inside /tensorrtllm\_backend/all\_models/inflight\_batcher\_llm
 <details>
 <summary><b> ensemble/config.pbtxt</b></summary>
 
-    # Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    # Copyright (c) 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
     #
     # Redistribution and use in source and binary forms, with or without
     # modification, are permitted provided that the following conditions
@@ -864,7 +864,7 @@ All config files inside /tensorrtllm\_backend/all\_models/inflight\_batcher\_llm
 <details>
 <summary><b>postprocessing/config.pbtxt</b></summary>
 
-    # Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    # Copyright (c) 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
     #
     # Redistribution and use in source and binary forms, with or without
     # modification, are permitted provided that the following conditions
@@ -993,7 +993,7 @@ All config files inside /tensorrtllm\_backend/all\_models/inflight\_batcher\_llm
 <details>
 <summary><b> preprocessing/config.pbtxt</b> </summary>
 
-    # Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    # Copyright (c) 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
     #
     # Redistribution and use in source and binary forms, with or without
     # modification, are permitted provided that the following conditions
@@ -1188,7 +1188,7 @@ All config files inside /tensorrtllm\_backend/all\_models/inflight\_batcher\_llm
 <summary> <b> tensorrt_llm/config.pbtxt </b></summary>
 
 
-    # Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    # Copyright (c) 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
     #
     # Redistribution and use in source and binary forms, with or without
     # modification, are permitted provided that the following conditions

diff --git a/docs/getting_started/trtllm_user_guide.md b/docs/getting_started/trtllm_user_guide.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -50,7 +50,7 @@ to prepare engines for your LLM models and serve them with Triton.
 ## How to use your custom TRT-LLM model
 
 All the supported models can be found in the
-[examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) folder in
+[examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core) folder in
 the TRT-LLM repo. Follow the examples to convert your models to TensorRT
 engines.
 
@@ -61,7 +61,7 @@ for Triton, and
 Only the *mandatory parameters* need to be set in the model config file. Feel free
 to modify the optional parameters as needed. To learn more about the
 parameters, model inputs, and outputs, see the
-[model config documentation](ttps://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/model_config.md) for more details.
+[model config documentation](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/model_config.md) for more details.
 
 ## Advanced Configuration Options and Deployment Strategies
 
@@ -95,7 +95,7 @@ to learn how to use GenAI-Perf to benchmark your LLM models.
 ## Performance Best Practices
 
 Check out the
-[Performance Best Practices guide](https://nvidia.github.io/TensorRT-LLM/performance/perf-best-practices.html)
+[Performance tuning guide](https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/)
 to learn how to optimize your TensorRT-LLM models for better performance.
 
 ## Metrics

diff --git a/docs/index.md b/docs/index.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2023-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -58,9 +58,9 @@ architecture. The [model repository](user_guide/model_repository.md) is a
 file-system based repository of the models that Triton will make
 available for inferencing. Inference requests arrive at the server via
 either [HTTP/REST or GRPC](customization_guide/inference_protocols.md) or by the [C
-API](customization_guide/inference_protocols.md) and are then routed to the appropriate per-model
+API](customization_guide/inprocess_c_api.md) and are then routed to the appropriate per-model
 scheduler. Triton implements [multiple scheduling and batching
-algorithms](#models-and-schedulers) that can be configured on a
+algorithms](./user_guide/architecture.md#models-and-schedulers) that can be configured on a
 model-by-model basis. Each model's scheduler optionally performs
 batching of inference requests and then passes the requests to the
 [backend](https://github.com/triton-inference-server/backend/blob/main/README.md)
@@ -89,7 +89,7 @@ framework such as Kubernetes.
 Major features include:
 
 - [Supports multiple deep learning
-  frameworks](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton)
+  frameworks](backend/README.md#where-can-i-find-all-the-backends-that-are-available-for-triton)
 - [Supports multiple machine learning
   frameworks](https://github.com/triton-inference-server/fil_backend)
 - [Concurrent model

diff --git a/docs/introduction/index.md b/docs/introduction/index.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2023-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -60,7 +60,7 @@ available for inferencing. Inference requests arrive at the server via
 either [HTTP/REST or GRPC](../customization_guide/inference_protocols.md) or by the [C
 API](../customization_guide/inprocess_c_api.md) and are then routed to the appropriate per-model
 scheduler. Triton implements [multiple scheduling and batching
-algorithms](#models-and-schedulers) that can be configured on a
+algorithms](../user_guide/architecture.md#models-and-schedulers) that can be configured on a
 model-by-model basis. Each model's scheduler optionally performs
 batching of inference requests and then passes the requests to the
 [backend](https://github.com/triton-inference-server/backend/blob/main/README.md)

diff --git a/docs/protocol/extension_parameters.md b/docs/protocol/extension_parameters.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2023-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2023-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -34,9 +34,9 @@ custom parameters that cannot be provided as inputs. Because this extension is
 supported, Triton reports “parameters” in the extensions field of its
 Server Metadata. This extension uses the optional "parameters"
 field in the KServe Protocol in
-[HTTP](https://kserve.github.io/website/0.10/modelserving/data_plane/v2_protocol/#inference-request-json-object)
+[HTTP](https://kserve.github.io/website/docs/concepts/architecture/data-plane/v2-protocol#inference-request-json-object)
 and
-[GRPC](https://kserve.github.io/website/0.10/modelserving/data_plane/v2_protocol/#parameters).
+[GRPC](https://kserve.github.io/website/docs/concepts/architecture/data-plane/v2-protocol#parameters).
 
 The following parameters are reserved for Triton's usage and should not be
 used as custom parameters:

diff --git a/docs/protocol/extension_schedule_policy.md b/docs/protocol/extension_schedule_policy.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2020-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -34,9 +34,9 @@ parameters that influence how Triton handles and schedules the
 request. Because this extension is supported, Triton reports
 “schedule_policy” in the extensions field of its Server Metadata.
 Note the policies are specific to [dynamic
-batcher](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher)
+batcher](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/batcher.md#dynamic-batcher)
 and only experimental support to [sequence
-batcher](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#sequence-batcher)
+batcher](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/batcher.md#sequence-batcher)
 with the [direct](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#direct)
 scheduling strategy.