Skip to content

Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm#166

Open
wuhuikx wants to merge 75 commits intovllm-project:mainfrom
wuhuikx:long_context
Open

Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm#166
wuhuikx wants to merge 75 commits intovllm-project:mainfrom
wuhuikx:long_context

Conversation

@wuhuikx
Copy link

@wuhuikx wuhuikx commented Feb 20, 2026

vLLM now provides 7 attention backends on AMD ROCm software. This post explains each one: why they exist, their trade-offs, and when to use them. We provide transparent benchmarks comparing all backends, and show how ROCM_AITER_FA for MHA and the AITER MLA backends deliver 1.2-4.4x higher throughput (TPS) through AMD’s AITER primitives and vLLM’s kernel orchestration.

wuhuikx and others added 30 commits December 16, 2025 22:39
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>
Update reference table with Radeon support indicators. Restructure MHA section: separate unified backends from ROCM_ATTN
merged ROCm attention backend blog post
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Signed-off-by: wuhuikx <hattie.wu@amd.com>
Signed-off-by: wuhuikx <hattie.wu@amd.com>

- **Decode**: Generating output tokens one at a time. Each decode step loads the entire KV cache from memory to produce a single token. The bottleneck is memory bandwidth, making decode **memory-bound**.

- **Extend**: Continuing conversations where part of the context is already cached (prefix cache hit). This is a hybrid scenario where new tokens must attend to both cached context and fresh input.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From @DarkLight1337 , extend is about retrieving kvcache before processing a new chunk. So, it is not related to (prefix cache hit).

@tjtanaa
Copy link
Contributor

tjtanaa commented Feb 25, 2026

image

From @DarkLight1337 , the table here should use plural "tokens" for the last 5 rows

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants