feat(HIP): register bfloat16 kernels for conv2d/conv3d/depthwise_conv2d on ROCm#78587
Open
fchange wants to merge 1 commit intoPaddlePaddle:developfrom
Open
feat(HIP): register bfloat16 kernels for conv2d/conv3d/depthwise_conv2d on ROCm#78587fchange wants to merge 1 commit intoPaddlePaddle:developfrom
fchange wants to merge 1 commit intoPaddlePaddle:developfrom
Conversation
…2d on HIP Add phi::bfloat16 to PD_REGISTER_KERNEL macros for HIP (ROCm) backend: - conv2d, conv3d, depthwise_conv2d (forward) - conv2d_grad, conv3d_grad (backward) - conv2d_double_grad, conv3d_double_grad, depthwise_conv2d_double_grad This enables BF16 precision inference for vision encoders (e.g., SigLIP in PaddleOCR-VL) on AMD GPUs. Previously only float and float16 were registered for HIP, causing RuntimeError when BF16 models attempted convolution operations. Also adds test_hip_bf16_conv_kernel.py to verify BF16 conv kernel registration on HIP/ROCm platforms. Fixes: conv2d BF16 kernel not registered on HIP Signed-off-by: fchange Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
|
你的PR提交成功,感谢你对开源项目的贡献! |
fchange
added a commit
to fchange/PaddleX
that referenced
this pull request
Apr 4, 2026
Remove _keep_in_fp32_modules = ["visual", "mlp_AR"] from PaddleOCRVLForConditionalGeneration. This workaround was added to avoid MIOpen BF16 convolution bugs on ROCm 7.0 by forcing the visual encoder to FP32, which doubled VRAM usage and reduced throughput. The Paddle framework now registers BF16 conv kernels for HIP backend, making this workaround unnecessary. See: PaddlePaddle/Paddle#78587 Signed-off-by: fchange Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
HIP/ROCm: Register bfloat16 kernels for conv2d/conv3d/depthwise_conv2d
Summary
This PR adds
phi::bfloat16to thePD_REGISTER_KERNELmacros for the HIP (ROCm) backend, enabling BF16 precision convolution operations on AMD GPUs.Related Issue: #78586
Changes
1.
paddle/phi/kernels/gpudnn/conv_kernel.cuAdded
phi::bfloat16to forward convolution kernel registrations for HIP:conv2dfloat, phi::float16float, phi::float16, phi::bfloat16conv3dfloat, phi::float16float, phi::float16, phi::bfloat16depthwise_conv2dfloat, phi::float16float, phi::float16, phi::bfloat16Also removed the
// todo register bfloat16comment at the end of the file.2.
paddle/phi/kernels/gpudnn/conv_grad_kernel.cuAdded
phi::bfloat16to backward convolution kernel registrations for HIP:conv2d_gradfloat, phi::float16float, phi::float16, phi::bfloat16conv3d_gradfloat, phi::float16float, phi::float16, phi::bfloat16conv2d_double_gradfloat, phi::float16float, phi::float16, phi::bfloat16conv3d_double_gradfloat, phi::float16float, phi::float16, phi::bfloat16depthwise_conv2d_double_gradfloat, phi::float16float, phi::float16, phi::bfloat163.
test/legacy_test/test_hip_bf16_conv_kernel.py(new file)Added test cases to verify BF16 convolution kernel registration on HIP:
TestHIPBF16Conv2dKernel: Basic conv2d BF16 forward, padding, stride, groupsTestHIPBF16DepthwiseConv2dKernel: Depthwise conv2d BF16TestHIPBF16ConvLayer: Conv2D layer end-to-end, Conv2D+BN+ReLU patternRationale
The MIOpen backend already supports BF16 for convolutions:
miopen_helper.hmapsphi::dtype::bfloat16→miopenBFloat16conv_kernel.cuHIP path callsmiopenConvolutionForwardmiopenBFloat16is fully supported for convolutions, tensor set, and tensor copyThe only gap was the kernel registration macros not including
phi::bfloat16for the HIP#ifdefbranch.Impact
RuntimeError: kernel not registeredVerification
Test Case: BF16 conv2d before fix
Test output (after fix applied to source)
End-to-end: PaddleOCR-VL-1.5 on AMD MI300X
The PaddleOCR-VL model (dtype=bfloat16) with SigLIP vision encoder now runs on AMD GPU after rebuilding Paddle with this patch. The following command succeeds:
cd /opt/PaddleX paddlex --pipeline PaddleOCR-VL-native.yaml --input /tmp/test_ocr.pngOutput correctly extracts OCR text from a boarding pass image:
Verification Screenshot
Environment: AMD MI300X (gfx942), ROCm 7.0.51, PaddlePaddle 3.4.0.dev
The test
test_hip_bf16_conv_kernel.pyverifies BF16 kernel registration on HIP:Limitations
Per MIOpen documentation,
miopenBFloat16support is limited to convolutions, tensor set, and tensor copy. Non-convolution BF16 ops (softmax, pooling, batchnorm, activation, fuse paths) may need separate fallback handling but are out of scope for this PR.Checklist