Skip to content

feat(HIP): register bfloat16 kernels for conv2d/conv3d/depthwise_conv2d on ROCm#78587

Open
fchange wants to merge 1 commit intoPaddlePaddle:developfrom
fchange:feat/hip-bf16-conv-kernel
Open

feat(HIP): register bfloat16 kernels for conv2d/conv3d/depthwise_conv2d on ROCm#78587
fchange wants to merge 1 commit intoPaddlePaddle:developfrom
fchange:feat/hip-bf16-conv-kernel

Conversation

@fchange
Copy link
Copy Markdown

@fchange fchange commented Apr 4, 2026

HIP/ROCm: Register bfloat16 kernels for conv2d/conv3d/depthwise_conv2d

Summary

This PR adds phi::bfloat16 to the PD_REGISTER_KERNEL macros for the HIP (ROCm) backend, enabling BF16 precision convolution operations on AMD GPUs.

Related Issue: #78586

Changes

1. paddle/phi/kernels/gpudnn/conv_kernel.cu

Added phi::bfloat16 to forward convolution kernel registrations for HIP:

Kernel Before After
conv2d float, phi::float16 float, phi::float16, phi::bfloat16
conv3d float, phi::float16 float, phi::float16, phi::bfloat16
depthwise_conv2d float, phi::float16 float, phi::float16, phi::bfloat16

Also removed the // todo register bfloat16 comment at the end of the file.

2. paddle/phi/kernels/gpudnn/conv_grad_kernel.cu

Added phi::bfloat16 to backward convolution kernel registrations for HIP:

Kernel Before After
conv2d_grad float, phi::float16 float, phi::float16, phi::bfloat16
conv3d_grad float, phi::float16 float, phi::float16, phi::bfloat16
conv2d_double_grad float, phi::float16 float, phi::float16, phi::bfloat16
conv3d_double_grad float, phi::float16 float, phi::float16, phi::bfloat16
depthwise_conv2d_double_grad float, phi::float16 float, phi::float16, phi::bfloat16

3. test/legacy_test/test_hip_bf16_conv_kernel.py (new file)

Added test cases to verify BF16 convolution kernel registration on HIP:

  • TestHIPBF16Conv2dKernel: Basic conv2d BF16 forward, padding, stride, groups
  • TestHIPBF16DepthwiseConv2dKernel: Depthwise conv2d BF16
  • TestHIPBF16ConvLayer: Conv2D layer end-to-end, Conv2D+BN+ReLU pattern

Rationale

The MIOpen backend already supports BF16 for convolutions:

  • miopen_helper.h maps phi::dtype::bfloat16miopenBFloat16
  • conv_kernel.cu HIP path calls miopenConvolutionForward
  • Per MIOpen docs, miopenBFloat16 is fully supported for convolutions, tensor set, and tensor copy

The only gap was the kernel registration macros not including phi::bfloat16 for the HIP #ifdef branch.

Impact

  • Before: BF16 models on AMD GPU fail at convolution with RuntimeError: kernel not registered
  • After: BF16 convolutions work on AMD GPU; downstream projects (e.g., PaddleX) can enable BF16 inference for vision encoders

Verification

Test Case: BF16 conv2d before fix

RuntimeError: (NotFound) The kernel with key (GPU, Undefined(AnyLayout), bfloat16) 
of kernel `conv2d` is not registered. 
Paddle support following DataTypes: float64, float32.

Test output (after fix applied to source)

$ python -m pytest test/legacy_test/test_hip_bf16_conv_kernel.py -v
TestHIPBF16Conv2dKernel::test_conv2d_bf16_forward ... PASSED
TestHIPBF16Conv2dKernel::test_conv2d_bf16_with_padding ... PASSED
TestHIPBF16Conv2dKernel::test_conv2d_bf16_with_stride ... PASSED
TestHIPBF16Conv2dKernel::test_conv2d_bf16_with_groups ... PASSED

End-to-end: PaddleOCR-VL-1.5 on AMD MI300X

The PaddleOCR-VL model (dtype=bfloat16) with SigLIP vision encoder now runs on AMD GPU after rebuilding Paddle with this patch. The following command succeeds:

cd /opt/PaddleX
paddlex --pipeline PaddleOCR-VL-native.yaml --input /tmp/test_ocr.png

Output correctly extracts OCR text from a boarding pass image:

  • Detected: "登机牌 BOARDING PASS", flight "MU 2379", destination "福州"
  • Passenger name: "张祺伟 / ZHANGQIWEI", Gate "G11"

Verification Screenshot

Environment: AMD MI300X (gfx942), ROCm 7.0.51, PaddlePaddle 3.4.0.dev

The test test_hip_bf16_conv_kernel.py verifies BF16 kernel registration on HIP:

$ python -c "
import paddle
print('ROCm:', paddle.is_compiled_with_rocm())
x = paddle.randn([1, 3, 64, 64]).astype('bfloat16')
w = paddle.randn([8, 3, 3, 3]).astype('bfloat16')
y = paddle.nn.functional.conv2d(x, w)
print('conv2d BF16 output shape:', y.shape, 'dtype:', y.dtype)
"
ROCm: True
conv2d BF16 output shape: [1, 8, 62, 62] dtype: paddle.bfloat16

Limitations

Per MIOpen documentation, miopenBFloat16 support is limited to convolutions, tensor set, and tensor copy. Non-convolution BF16 ops (softmax, pooling, batchnorm, activation, fuse paths) may need separate fallback handling but are out of scope for this PR.

Checklist

  • Code changes are minimal and targeted
  • Test cases added for HIP BF16 convolution
  • No changes to CUDA path (CUDA BF16 already works)
  • Compatible with existing MIOpen BF16 infrastructure

…2d on HIP

Add phi::bfloat16 to PD_REGISTER_KERNEL macros for HIP (ROCm) backend:
- conv2d, conv3d, depthwise_conv2d (forward)
- conv2d_grad, conv3d_grad (backward)
- conv2d_double_grad, conv3d_double_grad, depthwise_conv2d_double_grad

This enables BF16 precision inference for vision encoders (e.g., SigLIP
in PaddleOCR-VL) on AMD GPUs. Previously only float and float16 were
registered for HIP, causing RuntimeError when BF16 models attempted
convolution operations.

Also adds test_hip_bf16_conv_kernel.py to verify BF16 conv kernel
registration on HIP/ROCm platforms.

Fixes: conv2d BF16 kernel not registered on HIP
Signed-off-by: fchange

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 4, 2026

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 4, 2026

CLA assistant check
All committers have signed the CLA.

@paddle-bot paddle-bot bot added the contributor External developers label Apr 4, 2026
fchange added a commit to fchange/PaddleX that referenced this pull request Apr 4, 2026
Remove _keep_in_fp32_modules = ["visual", "mlp_AR"] from
PaddleOCRVLForConditionalGeneration. This workaround was added to
avoid MIOpen BF16 convolution bugs on ROCm 7.0 by forcing the visual
encoder to FP32, which doubled VRAM usage and reduced throughput.

The Paddle framework now registers BF16 conv kernels for HIP backend,
making this workaround unnecessary.

See: PaddlePaddle/Paddle#78587

Signed-off-by: fchange

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants