Skip to content

PaddleOCR-VL: Remove ROCm BF16 _keep_in_fp32_modules workaround #5076

@fchange

Description

@fchange

Issue Description

Background

PaddleOCR-VL currently uses _keep_in_fp32_modules = ["visual", "mlp_AR"] in the PaddleOCRVLForConditionalGeneration model class to work around MIOpen BF16 convolution bugs on ROCm 7.0. This forces the entire SigLIP vision encoder to run in FP32 precision, even when the model is loaded with BF16 dtype.

Problem

This workaround has significant downsides:

  1. VRAM doubled: FP32 weights + activations consume 2x memory compared to BF16
  2. Throughput reduced: Cannot leverage AMD GPU's native BF16 tensor core performance
  3. Inconsistent behavior: Model claims BF16 dtype but vision encoder silently runs in FP32

Root Cause

The upstream Paddle framework did not register BF16 convolution kernels (conv2d, conv3d, depthwise_conv2d) for the HIP (ROCm) backend. When a BF16 model attempted convolution, it failed with:

RuntimeError: The kernel with key (GPU, Undefined(AnyLayout), bfloat16) of kernel `conv2d` is not registered

Resolution Path

A fix has been submitted to the Paddle framework: PaddlePaddle/Paddle#78587

This PR adds phi::bfloat16 to the HIP kernel registration macros in conv_kernel.cu and conv_grad_kernel.cu, enabling native BF16 convolution on AMD GPUs.

Proposed Change

Once the Paddle framework fix is merged, this workaround in PaddleX should be removed:

# Before (current)
_keep_in_fp32_modules = ["visual", "mlp_AR"]

# After
_keep_in_fp32_modules = None

A corresponding PR for PaddleX will be submitted alongside the Paddle framework fix.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions