Gfx1250 moe by XingerZhu · Pull Request #402 · ROCm/FlyDSL

XingerZhu · 2026-04-14T18:01:51Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

- Fix Python version compatibility in meta.py: add support for Python < 3.11 by checking for positions attribute availability - Replace hardcoded MLIR library paths in executor.py with environment variable MLIR_PATH, with clear error message when not set - Update LLVM commit hash and enable ROCM runner in build script

* [FLYDSL]:add copy_atom right_inverse * [FLYDSL]: right_inverse dynamic process bugfix * [FLYDSL]:Python refactoring and adaptation * [FLYDSL]:rm example 05

* Migrate Python bindings to PyConcreteType<> and fix TypeID ODR violation - FlyExtension.cpp / FlyROCDLExtension.cpp: migrate from legacy mlir_type_subclass() to PyConcreteType<> CRTP pattern (required by new MLIR Python binding API). Types are defined inside namespace mlir::python::MLIR_BINDINGS_PYTHON_DOMAIN::fly, using ::mlir:: global qualifiers to avoid the mlir::python::mlir namespace collision when NB_DOMAIN=mlir. - CMakeLists.txt: remove MLIRFlyDialect / MLIRFlyROCDLDialect from _fly.so / _fly_rocdl.so PRIVATE_LINK_LIBS. These static archives were being linked into both the extension modules AND FlyPythonCAPI.so (via EMBED_CAPI_LINK_LIBS → MLIRCPIFly), creating duplicate TypeID static variables. The dialect registered under FlyPythonCAPI.so's TypeIDs but _fly.so looked up types with its own copy, causing "storage uniquer isn't initialized" at runtime. Now all symbols are resolved from FlyPythonCAPI.so. - FlyToROCDL.cpp: use string-based type matching for MmaAtomCDNA3_MFMA to work around the same TypeID ODR issue in the conversion pass, and fix ROCDL MFMA intrinsic call to use I32Attr attributes instead of Value operands for cbsz/abid/blgp control parameters. * Fix pass registry ODR violation: register Fly passes via CAPI - PRIVATE_LINK_LIBS MLIRFlyToROCDL in _mlirRegisterEverything pulled in a local copy of MLIRPass, causing registerFlyPasses() to register into a LOCAL pass registry inside _mlirRegisterEverything.so while PassManager.parse() queried the GLOBAL registry in FlyPythonCAPI.so. - Fix by introducing CAPI functions (mlirRegisterFlyPasses, mlirRegisterFlyToROCDLConversionPass) in the CAPI libraries so pass registration happens inside FlyPythonCAPI.so's single global registry. - update cmake/llvm-hash.txt to keep same with triton llvm hash. * Sync build_llvm.sh with pre_bumpupllvm and add ROCM runner Align script with pre_bumpupllvm branch: full clone, buildmlir dir, NVPTX target, NB_DOMAIN=mlir, package install by default. Keep reading LLVM commit from cmake/llvm-hash.txt. Add MLIR_ENABLE_ROCM_RUNNER=ON for GPU kernel execution support. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>

- Rename C++ binding structs with Py prefix (e.g. IntTupleType -> PyIntTupleType) for consistency - Add __all__ exports to typing, primitive, and gpu modules - Add Int4 numeric type - Fix frameInfo.positions compatibility for older Python versions - Fix dialect import order to ensure _Dialect is properly exported - Add fly_rocdl ops/enum gen copy rules in CMake - Improve build_llvm.sh with configurable parallel jobs and --no-install flag - Clean up redundant comments and formatting Co-authored-by: Cursor <cursoragent@cursor.com>

gemm test ready

* [FLYDSL]: add recast_layout op * [FLYDSL]: refactor * [FLYDSL]: add detail namespace * [FLYDSL]: add upcast assert * [FLYDSL]: rm bits number * [FLYDSL]: rm redundant code * [FLYDSL]: bits number only support static value * [FLYDSL]: change APIntAttr to I32Attr * [FLYDSL]: rm notes

* fix run error * port all gemm from main * fuix cudagraph hack * add int4 version * change flymemref convert * test ok * add build script * fix graph2 * add files * fix flops * fix path * fix local test * fix * clean * update readme

* add compile only and dumpir

- Add fly-opt tool (tools/fly-opt/) for MLIR IR transformations, registering Fly/FlyROCDL dialects and all custom passes - Add lit.cfg.py with fly-opt/FileCheck configuration - Test using 'lit -v tests/' to test basic lowering tests - Add LayoutAlgebra tests: construction, size/cosize, coordinate, composition, product, divide, int_tuple operations - Add Transforms tests: canonicalize, layout_lowering - Add Conversion tests for convert-fly-to-rocdl pass, split by category: type_conversion, memref_alloca, memref_ops, pointer_ops, mma_atom, gpu_ops

…gration - Enable LLVM_BUILD_TOOLS so fly-opt is built with the default ninja target - Add MLIR lit test section to scripts/run_tests.sh - Update test/lit.cfg.py to use FLY_BUILD_DIR env var (default: build-fly)

…1250.py Add --bench mode that sweeps model configs (DeepSeek-TP/EP, GPToss) × dtypes (fp4/fp8/a8w4/fp16/bf16) × token counts with tabular TFLOPS/BW output. Reuses existing run_moe_stage1/stage2 runners. Original test mode is unaffected. Made-with: Cursor

…non-aligned dimensions - Fix TypeError in stage2 mxscale non-wave-specialized pipeline loop: when n_accs==1, scf_yield_ returns a single ArithValue instead of a list, causing _res[:n_accs] and _st[:n_accs] to fail. Normalize with isinstance check before slicing. - Add automatic K-dimension zero-padding in run_moe_stage1 (model_dim) and run_moe_stage2 (inter_dim) for mxscale dtypes (fp4/fp8/a8w4) when the dimension is not divisible by tile_k. This enables GPToss (dim=2880, 2880%128=64) to run without manual dimension adjustment. - Use original (unpadded) dimensions for FLOPS/bandwidth accounting. Made-with: Cursor

Made-with: Cursor # Conflicts: # kernels/gemm_common_gfx1250.py # tests/kernels/test_gemm_fp8fp4_gfx1250.py # tests/kernels/test_wmma_gemm_gfx1250.py

…ix minor formatting Made-with: Cursor

- Split monolithic moe_gemm_2stage_gfx1250.py into: - moe_gemm_2stage_common_gfx1250.py: shared utilities - moe_gemm_2stage_wmma_gfx1250.py: fp16/bf16 WMMA kernels with own public API - moe_gemm_2stage_mxscale_gfx1250.py: fp4/fp8/a8w4 MXScale kernels with own public API - Each module has self-contained compile_moe_gemm1/2/2_ex entry points - Unsupported dtypes raise ValueError instead of fallback - Split test_moe_gemm_gfx1250.py into test_moe_gemm_wmma_gfx1250.py and test_moe_gemm_mxscale_gfx1250.py with updated imports Made-with: Cursor

Copilot

Pull request overview

Adds gfx1250-focused Mixture-of-Experts (MoE) 2-stage GEMM coverage and supporting kernel/runtime utilities, including new WMMA fp16 kernels and TDM gather descriptor support.

Changes:

Introduces a comprehensive gfx1250 MXScale/int-quant MoE 2-stage test harness with routing, correctness, and perf/benchmark options.
Adds shared gfx1250 MoE kernel helpers plus new fp16/bf16 WMMA stage1/stage2 kernel implementations.
Extends ROCDL TDM APIs with gather-mode descriptors/loads/stores, and adds MoE-oriented benchmarking helpers.

Reviewed changes

Copilot reviewed 5 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`tests/kernels/test_moe_gemm_mxscale_gfx1250.py`	New gfx1250 MoE 2-stage test harness for fp4/fp8/a8w4 + int quant variants.
`tests/kernels/benchmark_common.py`	Adds reusable MoE benchmarking utilities (tile resolution, bytes moved, timing).
`python/flydsl/expr/rocdl/tdm_ops.py`	Adds TDM gather descriptor + gather load/store APIs for row-indexed transfers.
`kernels/moe_gemm_2stage_wmma_gfx1250.py`	New gfx1250 WMMA fp16/bf16 MoE stage1/stage2 kernel compilation entry points.
`kernels/moe_gemm_2stage_common_gfx1250.py`	New shared helpers used by gfx1250 MoE kernels (tiling, epilogues, wrappers).
`kernels/gemm_common_gfx1250.py`	Extends pipeline/barrier helpers and adds wave-specialized TDM load helper (but currently has duplicated definitions).

Comments suppressed due to low confidence (1)

kernels/gemm_common_gfx1250.py:178

WGP_BARRIER_ID, pipeline_fence_signal, pipeline_fence_wait, and issue_tdm_loads are defined twice in this module (see the second block starting here). In Python the later definitions override the earlier ones, which defeats the new scf.IfOp-based implementation above and reintroduces the older issue_tdm_loads that uses a Python if arith.cmpi(...) (invalid for MLIR values). Please remove the duplicated older definitions (or merge the logic) so there is exactly one set of fence/load helpers, and ensure issue_tdm_loads uses IR control flow (scf.IfOp) rather than Python conditionals.

WGP_BARRIER_ID = -1


def pipeline_fence_signal(outstanding=0, use_cluster=False):
    """Signal half of a split barrier fence.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 5 out of 8 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… store Extract shared logic between _compile_stage1_mxscale_kernel_impl and _compile_stage2_mxscale_kernel_impl into four new helpers in the common module: - _compute_mxscale_tiling(): format config, WMMA constants, tiling math, parameter validation - _make_mxscale_data_loaders(): factory for 9 identical LDS data-loading adapter closures - _compute_pipeline_plan(): pre-load / tail plan computation - _compute_tdm_store_layout(): TDM store D output LDS layout Net effect: mxscale file shrinks by ~330 lines (3220 -> 2889) while common grows by ~326 lines with reusable infrastructure. Made-with: Cursor

- Import _bf16_to_f16_wrapper from common instead of duplicating locally - Merge _compile_moe_stage1/2_wmma_kernel into unified _compile_moe_wmma_gemm - Simplify compile_moe_gemm1/2/2_ex to thin wrappers using **kw forwarding - Reduces file from 1101 to 912 lines (-17%) Made-with: Cursor

…common - Deduplicate routing utilities (moe_sorting_torch_native, build_routing_buffers, get_topk_valid_mask, RoutingBuffers) by importing from test_moe_gemm.py - Remove aiter CK comparison blocks (dead code on gfx1250) - Remove unused w2 allocation/quantization from stage1 runners - Clean up commented-out debug lines and unused imports - Extract generic MoE benchmark sweep system (add_moe_bench_args, moe_bench_config, moe_bench_main) into benchmark_common.py Made-with: Cursor

WGP_BARRIER_ID, pipeline_fence_signal, pipeline_fence_wait, and issue_tdm_loads were each defined twice. Keep the first (correct) versions that use scf.IfOp for proper MLIR IR generation. Made-with: Cursor

sjfeng1999 and others added 30 commits March 3, 2026 08:49

sync pre_v0.1

b2b94a2

update header macro

dd8c6fe

add separate target-specific rocdl dialect

eafa2d6

Add utility nbmodules

3bd4150

Add universalMma Atom

51e45a5

fix example02

866e2ed

Add DLTensorAdaptor for torch Tensor support

8b29d66

Add logger and EnvManager

ce61b2b

Refact Python module

bba5422

Fix missing module

6ca89f6

Add right inverse

8c784fb

* [FLYDSL]:add copy_atom right_inverse * [FLYDSL]: right_inverse dynamic process bugfix * [FLYDSL]:Python refactoring and adaptation * [FLYDSL]:rm example 05

Add numeric typing

65f6a31

Add ASTRewriter and improve jit_function cache mechanism

a3e25cb

unwrap dsl_type before calling ir Op

9b56be5

fix missing expore in primitive

26d45a9

Add tiled_copy partition

a250f90

Pre v0.1 gemm (#145)

8085ea0

gemm test ready

Pre v0.1 gemm fix (#153)

2679f23

* fix run error * port all gemm from main * fuix cudagraph hack * add int4 version * change flymemref convert * test ok * add build script * fix graph2 * add files * fix flops * fix path * fix local test * fix * clean * update readme

add compile only and dumpir (#154)

890c860

* add compile only and dumpir

add version and wheel build

6cee3f1

port docs

98fad64

build whl and dist version ok, upload pypi ok

04e7c46

add aot example

0a8f4fe

Apply clang-format to fly-opt.cpp

1570126

[Tests][Lit] Add lit tests to run_tests.sh and fix fly-opt build inte…

45e31d1

…gration - Enable LLVM_BUILD_TOOLS so fly-opt is built with the default ninja target - Add MLIR lit test section to scripts/run_tests.sh - Update test/lit.cfg.py to use FLY_BUILD_DIR env var (default: build-fly)

aoli26 and others added 15 commits April 9, 2026 05:20

add matrix A TDM gather load

7f2dc69

fix tdm gather tensor dim1

6797983

support moe tdm gather store

1bc54c1

support moe stage2 A tdm gather load

c9082e2

moe stage1 valid-token A gather loads

1e8980a

fix b scale preshuffle bug

3ec1a32

optimize moe stage2 pipeline

4c3528c

merge gate up tdm descriptor

e3a6aec

tdm gather no-oob implementation

4e34d88

fix wave specific tdm

32ff433

Merge remote-tracking branch 'origin/main' into gfx1250_moe

5d1523f

Made-with: Cursor # Conflicts: # kernels/gemm_common_gfx1250.py # tests/kernels/test_gemm_fp8fp4_gfx1250.py # tests/kernels/test_wmma_gemm_gfx1250.py

Simplify hot_loop_scheduler WMMA scheduling to use equal halves and f…

c1d4d92

…ix minor formatting Made-with: Cursor

Copilot AI review requested due to automatic review settings April 14, 2026 18:01

Copilot started reviewing on behalf of XingerZhu April 14, 2026 18:03 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Comment thread tests/kernels/test_moe_gemm_mxscale_gfx1250.py

XingerZhu requested a review from Copilot April 15, 2026 01:23

Copilot started reviewing on behalf of XingerZhu April 15, 2026 01:24 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Comment thread tests/kernels/test_moe_gemm_mxscale_gfx1250.py Outdated

Comment thread tests/kernels/test_moe_gemm_mxscale_gfx1250.py Outdated

Comment thread kernels/gemm_common_gfx1250.py

aoli26 reviewed Apr 15, 2026

View reviewed changes

Comment thread kernels/gemm_common_gfx1250.py

XingerZhu added 4 commits April 15, 2026 12:15

Remove duplicate definitions in gemm_common_gfx1250

b1d6854

WGP_BARRIER_ID, pipeline_fence_signal, pipeline_fence_wait, and issue_tdm_loads were each defined twice. Keep the first (correct) versions that use scf.IfOp for proper MLIR IR generation. Made-with: Cursor

XingerZhu requested a review from coderfeli April 16, 2026 02:44

coderfeli approved these changes Apr 16, 2026

View reviewed changes

coderfeli merged commit f65e930 into main Apr 16, 2026
9 checks passed

coderfeli deleted the gfx1250_moe_new branch April 16, 2026 03:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gfx1250 moe#402

Gfx1250 moe#402
coderfeli merged 164 commits intomainfrom
gfx1250_moe_new

XingerZhu commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

XingerZhu commented Apr 14, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants