kernels: support AMD-FP4 data type and low-precision training by zhitwang17 · Pull Request #27 · AMD-AGI/ALTO

zhitwang17 · 2026-06-08T08:58:01Z

Add AMD-FP4 low-precision training support

Summary

This PR introduces AMD-FP4 as a first-class FP4 quantization scheme alongside
the existing NVFP4 path. AMD-FP4 shares the NVFP4 micro-block layout (E2M1 elements +
FP32 per-tensor outer scale) but uses a UE5M3 inner-block scale (GFXIPARCH-2067
§19.10 / OCP E5M3) instead of E4M3, giving ~256× wider inner-scale dynamic range at the
same block size. The kernel stack is refactored so NVFP4 and AMD-FP4 are peer recipes
that reuse a single block-quant implementation, differing only in the inner-scale grid.

Changes

Kernel reorganization

Renamed fp4/fp4_common → fp4/fp4_primitives (moved grouped_utils,
tensor_wrappers, triton_fp4_ops) and introduced a shared
fp4/outer_scaled_fp4/ layout (api, kernels, pack_unpack, scales) that backs
both NVFP4 and AMD-FP4. NVFP4 quantization shrinks ~720 lines by delegating to the
shared primitives.

Inner-scale dtype primitives

Added e4m3_ops and ue5m3_ops inner-scale dtype primitives, with ue5m3 providing
the wider dynamic range required by AMD-FP4.

scale_format-aware quantization

convert_to/from_* now take a scale_format ("e4m3" | "ue5m3") so a single
block-quant routine serves both schemes. The legacy
convert_to_nvfp4(scale_format="ue5m3") path is deprecated in favor of the dedicated
AMD-FP4 ops.

AMD-FP4 ops + dispatch routing

New alto::convert_to_amdfp4 / convert_from_amdfp4 ATen ops (inner scale pinned to
ue5m3), plus amdfp_linear and amdfp_grouped_gemm.
Dispatch config adds "amdfp4" as a precision peer of "nvfp4" with an
inner_scale_format field; __post_init__ forces ue5m3 for amdfp4 and rejects
mismatched configs. tensor/conversion updated to route AMD-FP4 calls.

LPT modifier

LowPrecisionTrainingModifier accepts "amdfp4" as a valid scheme and applies the
same lora_rank % 16 == 0 constraint as NVFP4.

Test status

Op-Level Test

All AMD-FP4 and NVFP4 unit tests pass: 748 passed, 0 failed (~111s) on MI300X / ROCm.

New tests/unittest/amdfp4/ suite: quantization, linear, grouped GEMM, UE5M3/E4M3
dtype, dispatch guards, Triton↔PyTorch parity, and an A/B matrix cross-checking
AMD-FP4 against the reference path.
Extended tests/unittest/nvfp4/ suites for the new scale_format parameter and shared
primitives (no regressions).

E2E Test

Debug Model

GPT-OSS debug model, VS BF16 and NVFP4

E2E full test with GPT-OSS-20B

…_fp4 layout

zhitwang17 added 6 commits June 5, 2026 02:15

refactor: reorganize fp4 kernels into fp4_primitives and outer_scaled…

69232b7

…_fp4 layout

feat: add UE5M3 and E4M3 inner-scale dtype primitives

200b3a6

feat: add scale_format-aware fp4 quantization and dequantization

c1feb62

feat: add AMD-FP4 linear and grouped GEMM ops with dispatch routing

c445515

test: add AMD-FP4 unit tests and extend NVFP4 suites

2fd86ec

modifiers: support lpt for amdfp4

cd99b11

zhitwang17 self-assigned this Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernels: support AMD-FP4 data type and low-precision training#27

kernels: support AMD-FP4 data type and low-precision training#27
zhitwang17 wants to merge 6 commits into
mainfrom
zhitao/support-amd-fp4

zhitwang17 commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhitwang17 commented Jun 8, 2026

Add AMD-FP4 low-precision training support

Summary

Changes

Kernel reorganization

Inner-scale dtype primitives

scale_format-aware quantization

AMD-FP4 ops + dispatch routing

LPT modifier

Test status

Op-Level Test

E2E Test

Debug Model

E2E full test with GPT-OSS-20B

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant