Adapt to FlagGems 3.0 by ArCyanic · Pull Request #7 · xlinsist/FlagGems-CPU

ArCyanic · 2025-08-01T07:45:28Z

Here's a summary of tests:

Test	Success	Failure	Skipped	Note
attention_ops				无法完成测试，进程总是被系统终止
binary_pointwise_ops	948	53	181	失败的测试全都精度不足
blas_ops	161	0	0	✔
distribution_ops	9	0	0	✔
general_reduction_ops	56	70	6	测试时间过长；全都是编译错误
libentry	5	0	1	✔；跳过的测试似乎是因为需要加速卡
norm_ops	58	0	2	✔；测试时间过长
pointwise_dynamic	65	0	14	✔；所跳过的测试都不是CPU的原因
pointwise_dynamic_type_promotion	22	0	3	where算子
quant	0	0	12	需要CUDA
reduction_ops	158	12	1	测试时间过长(跑了十几个小时)，规模缩小困难
shape_utils	17	0	0	✔
special_ops	2270	14	188	部分需要CUDA,部分API未修改完成
tensor_constructor_ops	339	0	0	✔
tensor_wrapper	4	0	0	✔
unary_pointwise_ops	232	4	5	编译错误；精度不够

For a full CPU porting document, please refer to CPUPorting.md in the project

附上最后一个测试完成的截图：

Co-authored-by: suxiangM <maxiang992@128.com>

Co-authored-by: jinchengxiong <jinchengxiong@baidu.com>

* add [angle, dot, index_put, nan_to_num, polar] supported * fix angle

* fix-speedup:rand * fix-randn: change unroll to 8 * change blocksize 512 to 1024 * seed+1 * fix-speedup：all, normal

* fix codestyle * update * [MutiBackend] update MutiBackend Framework * update * update multibackend README

* [kunlunxin] fix any buffer_size_limit param * fix all

* [METAX] modify metax backend debug message * [METAX] improve index_select and repeat_interleave performance * [METAX] add max_int accuracy test for metax --------- Co-authored-by: mx-flaggems-user <m01080@metax-tech.com>

…rm_interface and upsample_bicubic2d_aa. MTHREADS: Fix op vdot and fill_. MTHREADS: Fix some ops. MTHREADS: Fixed the bug that the op under backend _mthreads cannot be recognized. Mthreads: Skip two ops in the benchmark that are not supported, enable op all.

MTHREADS: Add addmm kernel for _mthreads backend, and fix a bug of mm kernel. MTHREADS: Add bmm kernel for _mthreads backend.

* [hygon] fix accuracy error for trunc div * [hygon] fix isclose accurary error --------- Co-authored-by: suxiangM <maxiang992@128.com>

* [Huawei] Ascend code for FlagGems (flagos-ai#608) * Add files via upload * Update __init__.py * Update device.py * Update commom_utils.py * Update __init__.py * Update gelu_and_mul.py * Update angle.py * Update div.py * Update gelu.py * Update isinf.py * Update isnan.py * Update nan_to_num.py * Update pow.py * Update tanh.py * Update vector_norm.py * Update performance_utils.py * Update test_binary_pointwise_perf.py * Update test_reduction_perf.py * Update test_unary_pointwise_perf.py * Update test_binary_pointwise_ops.py * Update test_reduction_ops.py * Update test_unary_pointwise_ops.py * Create __init__.py * Update pointwise_dynamic.py * Update test_blas_ops.py * Update test_blas_ops.py * Update test_general_reduction_ops.py * Update test_reduction_ops.py * Update test_binary_pointwise_ops.py * Update test_unary_pointwise_ops.py * Update test_binary_pointwise_ops.py * Update test_blas_ops.py * Update test_special_ops.py * Update test_binary_pointwise_ops.py * Update test_unary_pointwise_ops.py * Update test_binary_pointwise_ops.py * Update test_unary_pointwise_ops.py * Update test_norm_ops.py * Update test_binary_pointwise_ops.py * Update test_binary_pointwise_ops.py * Update test_binary_pointwise_ops.py * Update test_reduction_ops.py * Update test_binary_pointwise_ops.py * Update test_general_reduction_ops.py * Update test_general_reduction_ops.py * Update test_general_reduction_ops.py * Update test_blas_ops.py * Update test_blas_ops.py * Update test_special_ops.py * Update test_binary_pointwise_ops.py * Update test_binary_pointwise_ops.py * Update test_general_reduction_ops.py * Update test_reduction_ops.py * Update test_general_reduction_ops.py * Update test_binary_pointwise_ops.py * Update test_reduction_ops.py * Update test_special_ops.py * Update test_unary_pointwise_ops.py * Update pointwise_dynamic.py * Update __init__.py * Update test_binary_pointwise_perf.py * Update test_reduction_perf.py * Update test_unary_pointwise_perf.py * Update test_blas_perf.py * Update test_binary_pointwise_perf.py * Update test_reduction_perf.py * Update test_binary_pointwise_ops.py * Update test_blas_ops.py * Update test_general_reduction_ops.py * Update test_norm_ops.py * Update test_reduction_ops.py * Update test_unary_pointwise_ops.py * Update test_binary_pointwise_ops.py * Update test_binary_pointwise_ops.py * Update test_general_reduction_ops.py * Update test_general_reduction_ops.py * Update test_unary_pointwise_ops.py * Update test_binary_pointwise_ops.py * Update __init__.py * Update test_binary_pointwise_ops.py * Update test_unary_pointwise_ops.py * Update test_blas_perf.py * Delete src/flag_gems/runtime/backend/_ascend/ops directory * Update test_binary_pointwise_ops.py * [BACKEND] Init ascend backend --------- Co-authored-by: Jiang_wj <62932620+Sans1J@users.noreply.github.com>

* [Doc] update README with citation * [no ci]update * fix cpp doc

* [KUNLUN] Speed Up Full/Ones/Zeros * [KUNLUN] Fix Ones/Zeros --------- Co-authored-by: root <root@zzjg-isa-ai-p800-klxnode04.zzjg.baidu.com>

flagos-ai#532) * write to tmp file & os.replace, so as to avoid writing a module per process in PointwiseDynamicFunction, add test for multiprocessing & multithreading * update for other operators * Update tests/test_pointwise_dynamic.py * use os.replace to write the same contents to the same path concurrently --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [Operator] register backward independently for tanh * [Operator] register backward independently for gelu * [Operator] implement threshold fwd and bwd, as bwd of relu at the same time * [Operator] register sigmoid independently * [Operator] register silu backward independently * [Operator] register dropout backward independently * [Operator] register embedding backward * [Operator] register group_norm backward * [Operator] register layer_norm backward * [Test] test backward with torch.ops.aten functions * [Operator] optimize group_norm_backward to allow larger input * [Bugfix] wrong call of threshold_backward * [Operator] register backward of softmax * [Operator] register log_softmax backward * [Operator] register batch_norm backward * [Operator] register weightnorm_interface_backward * [Operator] modify weight_norm * [Bugfix] weight_norm test error * [Bugfix] diagonal_backward * [Bugfix] initialize cuda context properly and reduce test cases * remove backward for inplace ops * impl dropout on train=False and fix error in groupnorm * [Operator] move ops weight_norm/instance_norm/outer/celoss into fused directory, which are registered as AutogradCUDA before * reformat * rename some variables for better understanding; use torch.nn's get_enum to convert reduction string to integer * delete useless definition of REDUCTION * misspell fix * Update weight_norm.py for ci * Update weight_norm.py * fix redefination of test_accuracy_polar --------- Co-authored-by: Clement Chan <iclementine@outlook.com> Co-authored-by: Bowen <81504862+Bowen12992@users.noreply.github.com>

…gos-ai#629)

flagos-ai#631) * [bugfix] reorder the computation of weight_norm_backward to pass unit test * [bugfix] allow grad of weight_norm to be nan --------- Co-authored-by: i3wanna2 <2535184404@qq.com>

* [LIBENTRY] fix triton 3.3.x support * [LIBENTRY] Fix tune and heur config when using Triton 3.3

* set environment variable for liboperators.so to find source of triton kernel code * clean cmake files * update doc and workflow for building c extensions

…gos-ai#802) * tmp disable * format change * tmp disable * tmp disable * tmp disable * tmp skip

flagos-ai#804) Adaptation for MThreads backend： - update heuristics_config - enable scatter op - enable scatter_ op - enable layernorm op

* tmp disable and open operator * format

* add environment variable for libtuner cache * rename env flag * rename env flag * rename . * rename .

Define `get_torch_device_ctx` in runtime, replacing torch_device_fn.device(device) with it. In utils/pointwise_dynamic, add some compatible codes around `_DeviceGuard`. .

.

…topk

Currently the implementation of `where` is unable to run on CPU

…r special ops

temp

… `run_all_perf_tests.sh`

[cpu] modify

jiangmf1992 and others added 30 commits May 1, 2025 20:52

[kunlunxin] Update blas addmm

37cc9c9

[hygon] fix accuracy error for trunc div (flagos-ai#592)

bf2d33b

Co-authored-by: suxiangM <maxiang992@128.com>

[KUNLUN] Fix Cat in Pad (flagos-ai#596)

e668b7e

Co-authored-by: jinchengxiong <jinchengxiong@baidu.com>

open some benchmark (flagos-ai#600)

3e4cbf9

add [angle, dot, index_put, nan_to_num, polar] supported (flagos-ai#599)

0cbcd72

* add [angle, dot, index_put, nan_to_num, polar] supported * fix angle

revert fill_ cpu (flagos-ai#538)

36397b1

Fix speedup rand (flagos-ai#602)

4df1ebb

* fix-speedup:rand * fix-randn: change unroll to 8 * change blocksize 512 to 1024 * seed+1 * fix-speedup：all, normal

[KUNLUN] Speed Up Any (flagos-ai#605)

5d7a26a

[README] Update mutibackend README (flagos-ai#604)

0c57ea4

* fix codestyle * update * [MutiBackend] update MutiBackend Framework * update * update multibackend README

update FindTorch; update dependency (flagos-ai#597)

b8d4450

[kunlunxin] fix any buffer_size_limit param (flagos-ai#606)

c664567

* [kunlunxin] fix any buffer_size_limit param * fix all

[kunlunxin] fix unique op macro (flagos-ai#609)

f8ac7b3

[kunlunxin]fix-all (flagos-ai#607)

4d66a6d

MTHREADS: Add mm kernel for _mthreads backend.

125f544

MTHREADS: Add addmm kernel for _mthreads backend, and fix a bug of mm kernel. MTHREADS: Add bmm kernel for _mthreads backend.

[hygon] fix accuracy error for 'isclose' and 'allclose' (flagos-ai#610)

891c018

* [hygon] fix accuracy error for trunc div * [hygon] fix isclose accurary error --------- Co-authored-by: suxiangM <maxiang992@128.com>

[METAX] Improve metax ops performance

3e19027

[benchmark] skip angle for musa (flagos-ai#619)

73d3474

[KUNLUN] Speed Up Full/Full Like, Fix All/Any (flagos-ai#618)

9dae4c3

[Doc] update README with citation (flagos-ai#613)

5584ef8

* [Doc] update README with citation * [no ci]update * fix cpp doc

[KUNLUN] Speed Up Full/Ones/Zeros (flagos-ai#622)

8ddae65

* [KUNLUN] Speed Up Full/Ones/Zeros * [KUNLUN] Fix Ones/Zeros --------- Co-authored-by: root <root@zzjg-isa-ai-p800-klxnode04.zzjg.baidu.com>

improve FlagGems logging (flagos-ai#628)

b51c4f4

[METAX] Optimized nonzero, polar, upsample_nearest2d performance (fla…

8e1bbc2

…gos-ai#629)

[bugfix] reorder the computation of weight_norm_backward to pass unit… (

c0e3141

flagos-ai#631) * [bugfix] reorder the computation of weight_norm_backward to pass unit test * [bugfix] allow grad of weight_norm to be nan --------- Co-authored-by: i3wanna2 <2535184404@qq.com>

[LIBENTRY] Fix triton 3.3.x support (flagos-ai#632)

1a7c3fe

* [LIBENTRY] fix triton 3.3.x support * [LIBENTRY] Fix tune and heur config when using Triton 3.3

Improve CMake files (flagos-ai#636)

ffb98bb

* set environment variable for liboperators.so to find source of triton kernel code * clean cmake files * update doc and workflow for building c extensions

nianqi-tian and others added 30 commits July 18, 2025 16:23

[kunlunxin] skip unsuported operator and open supported operator (fla…

660129c

…gos-ai#802) * tmp disable * format change * tmp disable * tmp disable * tmp disable * tmp skip

fix: cummax deal with nan value (flagos-ai#790)

05203ff

[mthreads] update the config file and enable some supported operators. (

e49a240

flagos-ai#804) Adaptation for MThreads backend： - update heuristics_config - enable scatter op - enable scatter_ op - enable layernorm op

[KUNLUNXIN]tmp disable some operators and open operator (flagos-ai#805)

cf9cb78

* tmp disable and open operator * format

add environment variable for libtuner cache (flagos-ai#806)

b0be562

* add environment variable for libtuner cache * rename env flag * rename env flag * rename . * rename .

fix: cummin deal with nan value (flagos-ai#791)

77c074c

[fix] Fix libentry bug when migrating FlagGems to CPU

e06e49a

[cpu] Initialize CPU support

28978d2

[cpu] Shrink ops input size and remove ops that unsupport float32

9f07080

[cpu] Add benchmark and comparison of all ops performance

2dfeda2

[CPU Porting] Modify cuda-related api to multi-device compatible api

821201b

Define `get_torch_device_ctx` in runtime, replacing torch_device_fn.device(device) with it. In utils/pointwise_dynamic, add some compatible codes around `_DeviceGuard`. .

[cpu] make philox-related api compatible with cpu end

43a22ab

.

[cpu] solve arguments number mismatch problem

abfb7e9

[cpu] modifications of constexpr & cuda related api for randperm and …

7ffe14c

…topk

solve scalar-scalar division bug

e249f9a

[cpu] skip some tests for cpu

60a574c

Add quick mode for attension ops testing

3297843

[cpu] minor modifications in arm backend

7db48eb

[cpu] Add CPUPorting document

5d15138

[cpu] skip tests related to where

f3f2170

Currently the implementation of `where` is unable to run on CPU

Update README.md for CPUPorting

5db3293

[ops] adapt for remainder and pad operation; shrink test scale fo…

123dc00

…r special ops

[benchmark] Support benchmark for blas and norm ops in cpu

72f1915

temp

[benchmark] Skip unsupported ops

f30bc77

Update README for CPU-specific modification

316069b

Manual tune six operators on CPU. (xlinsist#4)

df882eb

[benchmark] Tune mm and addmm manually; Clear tune_configs.yaml and…

4011b90

… `run_all_perf_tests.sh`

[cpu] Modify more CUDA APIs

d3446a2

[cpu] modify

[cpu] shrink test scales and skip for some tests

556c27b

[cpu] improve CPUPorting document

9df6c8e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt to FlagGems 3.0#7

Adapt to FlagGems 3.0#7
ArCyanic wants to merge 186 commits intoxlinsist:cpu-devfrom
ArCyanic:cpu-dev

ArCyanic commented Aug 1, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

ArCyanic commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

ArCyanic commented Aug 1, 2025 •

edited

Loading