Skip to content

kernels: support AMD-FP4 data type and low-precision training#27

Draft
zhitwang17 wants to merge 6 commits into
mainfrom
zhitao/support-amd-fp4
Draft

kernels: support AMD-FP4 data type and low-precision training#27
zhitwang17 wants to merge 6 commits into
mainfrom
zhitao/support-amd-fp4

Conversation

@zhitwang17

Copy link
Copy Markdown
Collaborator

Add AMD-FP4 low-precision training support

Summary

This PR introduces AMD-FP4 as a first-class FP4 quantization scheme alongside
the existing NVFP4 path. AMD-FP4 shares the NVFP4 micro-block layout (E2M1 elements +
FP32 per-tensor outer scale) but uses a UE5M3 inner-block scale (GFXIPARCH-2067
§19.10 / OCP E5M3) instead of E4M3, giving ~256× wider inner-scale dynamic range at the
same block size. The kernel stack is refactored so NVFP4 and AMD-FP4 are peer recipes
that reuse a single block-quant implementation, differing only in the inner-scale grid.

Changes

Kernel reorganization

  • Renamed fp4/fp4_commonfp4/fp4_primitives (moved grouped_utils,
    tensor_wrappers, triton_fp4_ops) and introduced a shared
    fp4/outer_scaled_fp4/ layout (api, kernels, pack_unpack, scales) that backs
    both NVFP4 and AMD-FP4. NVFP4 quantization shrinks ~720 lines by delegating to the
    shared primitives.

Inner-scale dtype primitives

  • Added e4m3_ops and ue5m3_ops inner-scale dtype primitives, with ue5m3 providing
    the wider dynamic range required by AMD-FP4.

scale_format-aware quantization

  • convert_to/from_* now take a scale_format ("e4m3" | "ue5m3") so a single
    block-quant routine serves both schemes. The legacy
    convert_to_nvfp4(scale_format="ue5m3") path is deprecated in favor of the dedicated
    AMD-FP4 ops.

AMD-FP4 ops + dispatch routing

  • New alto::convert_to_amdfp4 / convert_from_amdfp4 ATen ops (inner scale pinned to
    ue5m3), plus amdfp_linear and amdfp_grouped_gemm.
  • Dispatch config adds "amdfp4" as a precision peer of "nvfp4" with an
    inner_scale_format field; __post_init__ forces ue5m3 for amdfp4 and rejects
    mismatched configs. tensor/conversion updated to route AMD-FP4 calls.

LPT modifier

  • LowPrecisionTrainingModifier accepts "amdfp4" as a valid scheme and applies the
    same lora_rank % 16 == 0 constraint as NVFP4.

Test status

Op-Level Test

All AMD-FP4 and NVFP4 unit tests pass: 748 passed, 0 failed (~111s) on MI300X / ROCm.

  • New tests/unittest/amdfp4/ suite: quantization, linear, grouped GEMM, UE5M3/E4M3
    dtype, dispatch guards, Triton↔PyTorch parity, and an A/B matrix cross-checking
    AMD-FP4 against the reference path.
  • Extended tests/unittest/nvfp4/ suites for the new scale_format parameter and shared
    primitives (no regressions).

E2E Test

Debug Model

GPT-OSS debug model, VS BF16 and NVFP4
image

E2E full test with GPT-OSS-20B

image

@zhitwang17 zhitwang17 self-assigned this Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant