kernels: support AMD-FP4 data type and low-precision training#27
Draft
zhitwang17 wants to merge 6 commits into
Draft
kernels: support AMD-FP4 data type and low-precision training#27zhitwang17 wants to merge 6 commits into
zhitwang17 wants to merge 6 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add AMD-FP4 low-precision training support
Summary
This PR introduces AMD-FP4 as a first-class FP4 quantization scheme alongside
the existing NVFP4 path. AMD-FP4 shares the NVFP4 micro-block layout (E2M1 elements +
FP32 per-tensor outer scale) but uses a UE5M3 inner-block scale (GFXIPARCH-2067
§19.10 / OCP E5M3) instead of E4M3, giving ~256× wider inner-scale dynamic range at the
same block size. The kernel stack is refactored so NVFP4 and AMD-FP4 are peer recipes
that reuse a single block-quant implementation, differing only in the inner-scale grid.
Changes
Kernel reorganization
fp4/fp4_common→fp4/fp4_primitives(movedgrouped_utils,tensor_wrappers,triton_fp4_ops) and introduced a sharedfp4/outer_scaled_fp4/layout (api,kernels,pack_unpack,scales) that backsboth NVFP4 and AMD-FP4. NVFP4 quantization shrinks ~720 lines by delegating to the
shared primitives.
Inner-scale dtype primitives
e4m3_opsandue5m3_opsinner-scale dtype primitives, withue5m3providingthe wider dynamic range required by AMD-FP4.
scale_format-aware quantization
convert_to/from_*now take ascale_format("e4m3"|"ue5m3") so a singleblock-quant routine serves both schemes. The legacy
convert_to_nvfp4(scale_format="ue5m3")path is deprecated in favor of the dedicatedAMD-FP4 ops.
AMD-FP4 ops + dispatch routing
alto::convert_to_amdfp4/convert_from_amdfp4ATen ops (inner scale pinned toue5m3), plusamdfp_linearandamdfp_grouped_gemm.configadds"amdfp4"as a precision peer of"nvfp4"with aninner_scale_formatfield;__post_init__forcesue5m3foramdfp4and rejectsmismatched configs.
tensor/conversionupdated to route AMD-FP4 calls.LPT modifier
LowPrecisionTrainingModifieraccepts"amdfp4"as a valid scheme and applies thesame
lora_rank % 16 == 0constraint as NVFP4.Test status
Op-Level Test
All AMD-FP4 and NVFP4 unit tests pass: 748 passed, 0 failed (~111s) on MI300X / ROCm.
tests/unittest/amdfp4/suite: quantization, linear, grouped GEMM, UE5M3/E4M3dtype, dispatch guards, Triton↔PyTorch parity, and an A/B matrix cross-checking
AMD-FP4 against the reference path.
tests/unittest/nvfp4/suites for the newscale_formatparameter and sharedprimitives (no regressions).
E2E Test
Debug Model
GPT-OSS debug model, VS BF16 and NVFP4

E2E full test with GPT-OSS-20B