Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm#166
Open
wuhuikx wants to merge 75 commits intovllm-project:mainfrom
Open
Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm#166wuhuikx wants to merge 75 commits intovllm-project:mainfrom
wuhuikx wants to merge 75 commits intovllm-project:mainfrom
Conversation
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
add mha part
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
[doc] update the blog
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Update reference table with Radeon support indicators. Restructure MHA section: separate unified backends from ROCM_ATTN
merged ROCm attention backend blog post
Update data and content
Signed-off-by: wuhuikx <hattie.wu@amd.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
tjtanaa
reviewed
Feb 25, 2026
|
|
||
| - **Decode**: Generating output tokens one at a time. Each decode step loads the entire KV cache from memory to produce a single token. The bottleneck is memory bandwidth, making decode **memory-bound**. | ||
|
|
||
| - **Extend**: Continuing conversations where part of the context is already cached (prefix cache hit). This is a hybrid scenario where new tokens must attend to both cached context and fresh input. |
Contributor
There was a problem hiding this comment.
From @DarkLight1337 , extend is about retrieving kvcache before processing a new chunk. So, it is not related to (prefix cache hit).
Contributor
From @DarkLight1337 , the table here should use plural "tokens" for the last 5 rows |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

vLLM now provides 7 attention backends on AMD ROCm software. This post explains each one: why they exist, their trade-offs, and when to use them. We provide transparent benchmarks comparing all backends, and show how ROCM_AITER_FA for MHA and the AITER MLA backends deliver 1.2-4.4x higher throughput (TPS) through AMD’s AITER primitives and vLLM’s kernel orchestration.