feat(HIP): register bfloat16 kernels for conv2d/conv3d/depthwise_conv2d on ROCm by fchange · Pull Request #78587 · PaddlePaddle/Paddle

fchange · 2026-04-04T08:31:52Z

HIP/ROCm: Register bfloat16 kernels for conv2d/conv3d/depthwise_conv2d

Summary

This PR adds phi::bfloat16 to the PD_REGISTER_KERNEL macros for the HIP (ROCm) backend, enabling BF16 precision convolution operations on AMD GPUs.

Related Issue: #78586

Changes

1. `paddle/phi/kernels/gpudnn/conv_kernel.cu`

Added phi::bfloat16 to forward convolution kernel registrations for HIP:

Kernel	Before	After
`conv2d`	`float, phi::float16`	`float, phi::float16, phi::bfloat16`
`conv3d`	`float, phi::float16`	`float, phi::float16, phi::bfloat16`
`depthwise_conv2d`	`float, phi::float16`	`float, phi::float16, phi::bfloat16`

Also removed the // todo register bfloat16 comment at the end of the file.

2. `paddle/phi/kernels/gpudnn/conv_grad_kernel.cu`

Added phi::bfloat16 to backward convolution kernel registrations for HIP:

Kernel	Before	After
`conv2d_grad`	`float, phi::float16`	`float, phi::float16, phi::bfloat16`
`conv3d_grad`	`float, phi::float16`	`float, phi::float16, phi::bfloat16`
`conv2d_double_grad`	`float, phi::float16`	`float, phi::float16, phi::bfloat16`
`conv3d_double_grad`	`float, phi::float16`	`float, phi::float16, phi::bfloat16`
`depthwise_conv2d_double_grad`	`float, phi::float16`	`float, phi::float16, phi::bfloat16`

3. `test/legacy_test/test_hip_bf16_conv_kernel.py` (new file)

Added test cases to verify BF16 convolution kernel registration on HIP:

TestHIPBF16Conv2dKernel: Basic conv2d BF16 forward, padding, stride, groups
TestHIPBF16DepthwiseConv2dKernel: Depthwise conv2d BF16
TestHIPBF16ConvLayer: Conv2D layer end-to-end, Conv2D+BN+ReLU pattern

Rationale

The MIOpen backend already supports BF16 for convolutions:

miopen_helper.h maps phi::dtype::bfloat16 → miopenBFloat16
conv_kernel.cu HIP path calls miopenConvolutionForward
Per MIOpen docs, miopenBFloat16 is fully supported for convolutions, tensor set, and tensor copy

The only gap was the kernel registration macros not including phi::bfloat16 for the HIP #ifdef branch.

Impact

Before: BF16 models on AMD GPU fail at convolution with RuntimeError: kernel not registered
After: BF16 convolutions work on AMD GPU; downstream projects (e.g., PaddleX) can enable BF16 inference for vision encoders

Verification

Test Case: BF16 conv2d before fix

RuntimeError: (NotFound) The kernel with key (GPU, Undefined(AnyLayout), bfloat16) 
of kernel `conv2d` is not registered. 
Paddle support following DataTypes: float64, float32.

Test output (after fix applied to source)

$ python -m pytest test/legacy_test/test_hip_bf16_conv_kernel.py -v
TestHIPBF16Conv2dKernel::test_conv2d_bf16_forward ... PASSED
TestHIPBF16Conv2dKernel::test_conv2d_bf16_with_padding ... PASSED
TestHIPBF16Conv2dKernel::test_conv2d_bf16_with_stride ... PASSED
TestHIPBF16Conv2dKernel::test_conv2d_bf16_with_groups ... PASSED

End-to-end: PaddleOCR-VL-1.5 on AMD MI300X

The PaddleOCR-VL model (dtype=bfloat16) with SigLIP vision encoder now runs on AMD GPU after rebuilding Paddle with this patch. The following command succeeds:

cd /opt/PaddleX
paddlex --pipeline PaddleOCR-VL-native.yaml --input /tmp/test_ocr.png

Output correctly extracts OCR text from a boarding pass image:

Detected: "登机牌 BOARDING PASS", flight "MU 2379", destination "福州"
Passenger name: "张祺伟 / ZHANGQIWEI", Gate "G11"

Verification Screenshot

Environment: AMD MI300X (gfx942), ROCm 7.0.51, PaddlePaddle 3.4.0.dev

The test test_hip_bf16_conv_kernel.py verifies BF16 kernel registration on HIP:

$ python -c "
import paddle
print('ROCm:', paddle.is_compiled_with_rocm())
x = paddle.randn([1, 3, 64, 64]).astype('bfloat16')
w = paddle.randn([8, 3, 3, 3]).astype('bfloat16')
y = paddle.nn.functional.conv2d(x, w)
print('conv2d BF16 output shape:', y.shape, 'dtype:', y.dtype)
"
ROCm: True
conv2d BF16 output shape: [1, 8, 62, 62] dtype: paddle.bfloat16

Limitations

Per MIOpen documentation, miopenBFloat16 support is limited to convolutions, tensor set, and tensor copy. Non-convolution BF16 ops (softmax, pooling, batchnorm, activation, fuse paths) may need separate fallback handling but are out of scope for this PR.

Checklist

Code changes are minimal and targeted
Test cases added for HIP BF16 convolution
No changes to CUDA path (CUDA BF16 already works)
Compatible with existing MIOpen BF16 infrastructure

…2d on HIP Add phi::bfloat16 to PD_REGISTER_KERNEL macros for HIP (ROCm) backend: - conv2d, conv3d, depthwise_conv2d (forward) - conv2d_grad, conv3d_grad (backward) - conv2d_double_grad, conv3d_double_grad, depthwise_conv2d_double_grad This enables BF16 precision inference for vision encoders (e.g., SigLIP in PaddleOCR-VL) on AMD GPUs. Previously only float and float16 were registered for HIP, causing RuntimeError when BF16 models attempted convolution operations. Also adds test_hip_bf16_conv_kernel.py to verify BF16 conv kernel registration on HIP/ROCm platforms. Fixes: conv2d BF16 kernel not registered on HIP Signed-off-by: fchange Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

paddle-bot · 2026-04-04T08:31:57Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

CLAassistant · 2026-04-04T08:31:58Z

All committers have signed the CLA.

Remove _keep_in_fp32_modules = ["visual", "mlp_AR"] from PaddleOCRVLForConditionalGeneration. This workaround was added to avoid MIOpen BF16 convolution bugs on ROCm 7.0 by forcing the visual encoder to FP32, which doubled VRAM usage and reduced throughput. The Paddle framework now registers BF16 conv kernels for HIP backend, making this workaround unnecessary. See: PaddlePaddle/Paddle#78587 Signed-off-by: fchange Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

paddle-bot bot added the contributor External developers label Apr 4, 2026

This was referenced Apr 4, 2026

PaddleOCR-VL: Remove ROCm BF16 _keep_in_fp32_modules workaround PaddlePaddle/PaddleX#5076

Open

fix(doc_vlm): remove ROCm BF16 _keep_in_fp32_modules workaround in PaddleOCR-VL PaddlePaddle/PaddleX#5077

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(HIP): register bfloat16 kernels for conv2d/conv3d/depthwise_conv2d on ROCm#78587

feat(HIP): register bfloat16 kernels for conv2d/conv3d/depthwise_conv2d on ROCm#78587
fchange wants to merge 1 commit intoPaddlePaddle:developfrom
fchange:feat/hip-bf16-conv-kernel

fchange commented Apr 4, 2026

Uh oh!

paddle-bot bot commented Apr 4, 2026

Uh oh!

CLAassistant commented Apr 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fchange commented Apr 4, 2026

HIP/ROCm: Register bfloat16 kernels for conv2d/conv3d/depthwise_conv2d

Summary

Changes

1. paddle/phi/kernels/gpudnn/conv_kernel.cu

2. paddle/phi/kernels/gpudnn/conv_grad_kernel.cu

3. test/legacy_test/test_hip_bf16_conv_kernel.py (new file)

Rationale

Impact

Verification

Test Case: BF16 conv2d before fix

Test output (after fix applied to source)

End-to-end: PaddleOCR-VL-1.5 on AMD MI300X

Verification Screenshot

Limitations

Checklist

Uh oh!

paddle-bot bot commented Apr 4, 2026

Uh oh!

CLAassistant commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `paddle/phi/kernels/gpudnn/conv_kernel.cu`

2. `paddle/phi/kernels/gpudnn/conv_grad_kernel.cu`

3. `test/legacy_test/test_hip_bf16_conv_kernel.py` (new file)

CLAassistant commented Apr 4, 2026 •

edited

Loading