Issue Description
Background
PaddleOCR-VL currently uses _keep_in_fp32_modules = ["visual", "mlp_AR"] in the PaddleOCRVLForConditionalGeneration model class to work around MIOpen BF16 convolution bugs on ROCm 7.0. This forces the entire SigLIP vision encoder to run in FP32 precision, even when the model is loaded with BF16 dtype.
Problem
This workaround has significant downsides:
- VRAM doubled: FP32 weights + activations consume 2x memory compared to BF16
- Throughput reduced: Cannot leverage AMD GPU's native BF16 tensor core performance
- Inconsistent behavior: Model claims BF16 dtype but vision encoder silently runs in FP32
Root Cause
The upstream Paddle framework did not register BF16 convolution kernels (conv2d, conv3d, depthwise_conv2d) for the HIP (ROCm) backend. When a BF16 model attempted convolution, it failed with:
RuntimeError: The kernel with key (GPU, Undefined(AnyLayout), bfloat16) of kernel `conv2d` is not registered
Resolution Path
A fix has been submitted to the Paddle framework: PaddlePaddle/Paddle#78587
This PR adds phi::bfloat16 to the HIP kernel registration macros in conv_kernel.cu and conv_grad_kernel.cu, enabling native BF16 convolution on AMD GPUs.
Proposed Change
Once the Paddle framework fix is merged, this workaround in PaddleX should be removed:
# Before (current)
_keep_in_fp32_modules = ["visual", "mlp_AR"]
# After
_keep_in_fp32_modules = None
A corresponding PR for PaddleX will be submitted alongside the Paddle framework fix.
Issue Description
Background
PaddleOCR-VL currently uses
_keep_in_fp32_modules = ["visual", "mlp_AR"]in thePaddleOCRVLForConditionalGenerationmodel class to work around MIOpen BF16 convolution bugs on ROCm 7.0. This forces the entire SigLIP vision encoder to run in FP32 precision, even when the model is loaded with BF16 dtype.Problem
This workaround has significant downsides:
Root Cause
The upstream Paddle framework did not register BF16 convolution kernels (
conv2d,conv3d,depthwise_conv2d) for the HIP (ROCm) backend. When a BF16 model attempted convolution, it failed with:Resolution Path
A fix has been submitted to the Paddle framework: PaddlePaddle/Paddle#78587
This PR adds
phi::bfloat16to the HIP kernel registration macros inconv_kernel.cuandconv_grad_kernel.cu, enabling native BF16 convolution on AMD GPUs.Proposed Change
Once the Paddle framework fix is merged, this workaround in PaddleX should be removed:
A corresponding PR for PaddleX will be submitted alongside the Paddle framework fix.