Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm by wuhuikx · Pull Request #166 · vllm-project/vllm-project.github.io

wuhuikx · 2026-02-20T12:53:58Z

vLLM now provides 7 attention backends on AMD ROCm software. This post explains each one: why they exist, their trade-offs, and when to use them. We provide transparent benchmarks comparing all backends, and show how ROCM_AITER_FA for MHA and the AITER MLA backends deliver 1.2-4.4x higher throughput (TPS) through AMD’s AITER primitives and vLLM’s kernel orchestration.

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

Signed-off-by: ganyi <ygan@amd.com>

add mha part

Signed-off-by: ganyi <ygan@amd.com>

mla intro

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

[doc] update the blog

Signed-off-by: wuhuikx <hattie.wu@amd.com>

Update reference table with Radeon support indicators. Restructure MHA section: separate unified backends from ROCM_ATTN

merged ROCm attention backend blog post

Update data and content

Signed-off-by: wuhuikx <hattie.wu@amd.com>

chatgpt-codex-connector · 2026-02-20T12:54:01Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Signed-off-by: wuhuikx <hattie.wu@amd.com>

tjtanaa · 2026-02-25T10:52:36Z

_posts/2026-02-20-rocm-attention-backend.md

+
+- **Decode**: Generating output tokens one at a time. Each decode step loads the entire KV cache from memory to produce a single token. The bottleneck is memory bandwidth, making decode **memory-bound**.
+
+- **Extend**: Continuing conversations where part of the context is already cached (prefix cache hit). This is a hybrid scenario where new tokens must attend to both cached context and fresh input.


From @DarkLight1337 , extend is about retrieving kvcache before processing a new chunk. So, it is not related to (prefix cache hit).

tjtanaa · 2026-02-25T10:57:16Z

From @DarkLight1337 , the table here should use plural "tokens" for the last 5 rows

wuhuikx and others added 30 commits December 16, 2025 22:39

Add the initial version

14952c0

Update the introduction part

5faa8d5

Update

5896243

mha part

e04b4c8

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

add figure for aiter backend

9e27da9

Signed-off-by: ganyi <ygan@amd.com>

add figure for aiter backend

76ae7b0

Signed-off-by: ganyi <ygan@amd.com>

remove Chinese note

3382912

Signed-off-by: ganyi <ygan@amd.com>

Merge pull request #1 from ganyi1996ppo/ganyi/mha_blog

3f2cdc3

add mha part

mla intro

e3d40a9

Signed-off-by: ganyi <ygan@amd.com>

mla intro

257cb95

Signed-off-by: ganyi <ygan@amd.com>

mla intro

2db3dd6

Signed-off-by: ganyi <ygan@amd.com>

Merge pull request #2 from ganyi1996ppo/ganyi/mla_intro

5ccf3d9

mla intro

[doc] updata the doc

a0e77d1

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Merge pull request #3 from wuhuikx/zejun/blog_1

4bc0c40

[doc] update the blog

Update the unified attention part

1e328a3

Update the mla part

183f5ab

Update

fc5bd0b

Add the contiguous batching image

cd10d34

Update 2025-12-16-rocm-attention-backend.md

fec7acd

Signed-off-by: wuhuikx <hattie.wu@amd.com>

Update 2025-12-16-rocm-attention-backend.md

a104672

Signed-off-by: wuhuikx <hattie.wu@amd.com>

Update the post

5a5b20e

Finish the mha part

deade23

Update the MLA part

b413ab6

merged ROCm attention backend blog post

02a7258

Fix backend descriptions and clarify recommendations

72dea7d

Update reference table with Radeon support indicators. Restructure MHA section: separate unified backends from ROCM_ATTN

Add TTFT and TPS comparison charts for MHA and MLA benchmarks

90fda74

minor changes

afa3fe8

add (higher = slower) to MHA and MLA relative performance table headers

51e1bd5

Merge pull request #5 from tanpinsiang/rocm-attention-blog-updates

eb4fc39

merged ROCm attention backend blog post

Update the figure

57d30e8

tanpinsiang and others added 17 commits February 5, 2026 20:03

fix content

cd10155

update images

9439eff

added benchmark scripts

daab5fc

make Both ROCM_AITER_MLA baseline

92c4647

change diagram.

9df0dc7

fix value

d6fee69

minor fix value

6aa91a7

rename Benchmark Scripts

fbc96bd

update diagram

6af567b

Merge pull request #10 from tanpinsiang/legal_final_2

17579de

Update data and content

Update 2025-12-16-rocm-attention-backend.md

41d8c61

Signed-off-by: wuhuikx <hattie.wu@amd.com>

Update the hardware configurations

298fc27

Merge branch 'vllm-project:main' into long_context

ee1a7a9

Update doc name

5b2da49

Change file name

504e1a0

Update figure size

a282475

Refine the format

04367e9

vercel bot deployed to Preview February 20, 2026 12:54 View deployment

Remove useless files

77dde34

vercel bot deployed to Preview February 20, 2026 12:56 View deployment

Update

fb20240

vercel bot deployed to Preview February 20, 2026 13:12 View deployment

wuhuikx closed this Feb 21, 2026

wuhuikx added 2 commits February 21, 2026 21:16

Update 2026-02-20-rocm-attention-backend.md

21d19d6

Signed-off-by: wuhuikx <hattie.wu@amd.com>

Update 2026-02-20-rocm-attention-backend.md

3efc475

Signed-off-by: wuhuikx <hattie.wu@amd.com>

wuhuikx reopened this Feb 21, 2026

vercel bot deployed to Preview February 21, 2026 13:21 View deployment

tjtanaa reviewed Feb 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm#166

Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm#166
wuhuikx wants to merge 75 commits intovllm-project:mainfrom
wuhuikx:long_context

wuhuikx commented Feb 20, 2026

Uh oh!

chatgpt-codex-connector bot commented Feb 20, 2026

Uh oh!

tjtanaa Feb 25, 2026

Uh oh!

tjtanaa commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		- Decode: Generating output tokens one at a time. Each decode step loads the entire KV cache from memory to produce a single token. The bottleneck is memory bandwidth, making decode memory-bound.

		- Extend: Continuing conversations where part of the context is already cached (prefix cache hit). This is a hybrid scenario where new tokens must attend to both cached context and fresh input.

Conversation

wuhuikx commented Feb 20, 2026

Uh oh!

chatgpt-codex-connector bot commented Feb 20, 2026

Uh oh!

tjtanaa Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

tjtanaa commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants