Skip to content

Adapt to FlagGems 3.0#7

Open
ArCyanic wants to merge 186 commits intoxlinsist:cpu-devfrom
ArCyanic:cpu-dev
Open

Adapt to FlagGems 3.0#7
ArCyanic wants to merge 186 commits intoxlinsist:cpu-devfrom
ArCyanic:cpu-dev

Conversation

@ArCyanic
Copy link
Copy Markdown

@ArCyanic ArCyanic commented Aug 1, 2025

Here's a summary of tests:

Test Success Failure Skipped Note
attention_ops 无法完成测试,进程总是被系统终止
binary_pointwise_ops 948 53 181 失败的测试全都精度不足
blas_ops 161 0 0
distribution_ops 9 0 0
general_reduction_ops 56 70 6 测试时间过长;全都是编译错误
libentry 5 0 1 ✔;跳过的测试似乎是因为需要加速卡
norm_ops 58 0 2 ✔;测试时间过长
pointwise_dynamic 65 0 14 ✔;所跳过的测试都不是CPU的原因
pointwise_dynamic_type_promotion 22 0 3 where算子
quant 0 0 12 需要CUDA
reduction_ops 158 12 1 测试时间过长(跑了十几个小时),规模缩小困难
shape_utils 17 0 0
special_ops 2270 14 188 部分需要CUDA,部分API未修改完成
tensor_constructor_ops 339 0 0
tensor_wrapper 4 0 0
unary_pointwise_ops 232 4 5 编译错误;精度不够

For a full CPU porting document, please refer to CPUPorting.md in the project

附上最后一个测试完成的截图:
image

jiangmf1992 and others added 30 commits May 1, 2025 20:52
Co-authored-by: suxiangM <maxiang992@128.com>
Co-authored-by: jinchengxiong <jinchengxiong@baidu.com>
* add [angle, dot, index_put, nan_to_num, polar] supported

* fix angle
* fix-speedup:rand

* fix-randn: change unroll to 8

* change blocksize 512 to 1024

* seed+1

* fix-speedup:all, normal
* fix codestyle

* update

* [MutiBackend] update MutiBackend Framework

* update

* update multibackend README
* [kunlunxin] fix any buffer_size_limit param

* fix all
* [METAX] modify metax backend debug message

* [METAX] improve index_select and repeat_interleave performance

* [METAX] add max_int accuracy test for metax

---------

Co-authored-by: mx-flaggems-user <m01080@metax-tech.com>
…rm_interface and upsample_bicubic2d_aa.

MTHREADS: Fix op vdot and fill_.

MTHREADS: Fix some ops.

MTHREADS: Fixed the bug that the op under backend _mthreads cannot be recognized.

Mthreads: Skip two ops in the benchmark that are not supported, enable op all.
MTHREADS: Add addmm kernel for _mthreads backend, and fix a bug of mm kernel.

MTHREADS: Add bmm kernel for _mthreads backend.
* [hygon] fix accuracy error for trunc div

* [hygon] fix isclose accurary error

---------

Co-authored-by: suxiangM <maxiang992@128.com>
* [Huawei] Ascend code for FlagGems (flagos-ai#608)

* Add files via upload

* Update __init__.py

* Update device.py

* Update commom_utils.py

* Update __init__.py

* Update gelu_and_mul.py

* Update angle.py

* Update div.py

* Update gelu.py

* Update isinf.py

* Update isnan.py

* Update nan_to_num.py

* Update pow.py

* Update tanh.py

* Update vector_norm.py

* Update performance_utils.py

* Update test_binary_pointwise_perf.py

* Update test_reduction_perf.py

* Update test_unary_pointwise_perf.py

* Update test_binary_pointwise_ops.py

* Update test_reduction_ops.py

* Update test_unary_pointwise_ops.py

* Create __init__.py

* Update pointwise_dynamic.py

* Update test_blas_ops.py

* Update test_blas_ops.py

* Update test_general_reduction_ops.py

* Update test_reduction_ops.py

* Update test_binary_pointwise_ops.py

* Update test_unary_pointwise_ops.py

* Update test_binary_pointwise_ops.py

* Update test_blas_ops.py

* Update test_special_ops.py

* Update test_binary_pointwise_ops.py

* Update test_unary_pointwise_ops.py

* Update test_binary_pointwise_ops.py

* Update test_unary_pointwise_ops.py

* Update test_norm_ops.py

* Update test_binary_pointwise_ops.py

* Update test_binary_pointwise_ops.py

* Update test_binary_pointwise_ops.py

* Update test_reduction_ops.py

* Update test_binary_pointwise_ops.py

* Update test_general_reduction_ops.py

* Update test_general_reduction_ops.py

* Update test_general_reduction_ops.py

* Update test_blas_ops.py

* Update test_blas_ops.py

* Update test_special_ops.py

* Update test_binary_pointwise_ops.py

* Update test_binary_pointwise_ops.py

* Update test_general_reduction_ops.py

* Update test_reduction_ops.py

* Update test_general_reduction_ops.py

* Update test_binary_pointwise_ops.py

* Update test_reduction_ops.py

* Update test_special_ops.py

* Update test_unary_pointwise_ops.py

* Update pointwise_dynamic.py

* Update __init__.py

* Update test_binary_pointwise_perf.py

* Update test_reduction_perf.py

* Update test_unary_pointwise_perf.py

* Update test_blas_perf.py

* Update test_binary_pointwise_perf.py

* Update test_reduction_perf.py

* Update test_binary_pointwise_ops.py

* Update test_blas_ops.py

* Update test_general_reduction_ops.py

* Update test_norm_ops.py

* Update test_reduction_ops.py

* Update test_unary_pointwise_ops.py

* Update test_binary_pointwise_ops.py

* Update test_binary_pointwise_ops.py

* Update test_general_reduction_ops.py

* Update test_general_reduction_ops.py

* Update test_unary_pointwise_ops.py

* Update test_binary_pointwise_ops.py

* Update __init__.py

* Update test_binary_pointwise_ops.py

* Update test_unary_pointwise_ops.py

* Update test_blas_perf.py

* Delete src/flag_gems/runtime/backend/_ascend/ops directory

* Update test_binary_pointwise_ops.py

* [BACKEND] Init ascend backend

---------

Co-authored-by: Jiang_wj <62932620+Sans1J@users.noreply.github.com>
* [Doc] update README with citation

* [no ci]update

* fix cpp doc
* [KUNLUN] Speed Up Full/Ones/Zeros

* [KUNLUN] Fix Ones/Zeros

---------

Co-authored-by: root <root@zzjg-isa-ai-p800-klxnode04.zzjg.baidu.com>
flagos-ai#532)

* write to tmp file & os.replace, so as to avoid writing a module per process in PointwiseDynamicFunction, add test for multiprocessing & multithreading
* update for other operators
* Update tests/test_pointwise_dynamic.py
* use os.replace to write the same contents to the same path concurrently

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* [Operator] register backward independently for tanh
* [Operator] register backward independently for gelu
* [Operator] implement threshold fwd and bwd, as bwd of relu at the same time
* [Operator] register sigmoid independently
* [Operator] register silu backward independently
* [Operator] register dropout backward independently
* [Operator] register embedding backward
* [Operator] register group_norm backward
* [Operator] register layer_norm backward
* [Test] test backward with torch.ops.aten functions
* [Operator] optimize group_norm_backward to allow larger input
* [Bugfix] wrong call of threshold_backward
* [Operator] register backward of softmax
* [Operator] register log_softmax backward
* [Operator] register batch_norm backward
* [Operator] register weightnorm_interface_backward
* [Operator] modify weight_norm
* [Bugfix] weight_norm test error
* [Bugfix] diagonal_backward
* [Bugfix] initialize cuda context properly and reduce test cases
* remove backward for inplace ops
* impl dropout on train=False and fix error in groupnorm
* [Operator] move ops weight_norm/instance_norm/outer/celoss into fused directory, which are registered as AutogradCUDA before
* reformat
* rename some variables for better understanding; use torch.nn's get_enum to convert reduction string to integer
* delete useless definition of REDUCTION
* misspell fix
* Update weight_norm.py for ci
* Update weight_norm.py
* fix redefination of test_accuracy_polar
---------
Co-authored-by: Clement Chan <iclementine@outlook.com>
Co-authored-by: Bowen <81504862+Bowen12992@users.noreply.github.com>
flagos-ai#631)

* [bugfix] reorder the computation of weight_norm_backward to pass unit test

* [bugfix] allow grad of weight_norm to be nan

---------

Co-authored-by: i3wanna2 <2535184404@qq.com>
* [LIBENTRY] fix triton 3.3.x support

* [LIBENTRY] Fix tune and heur config when using Triton 3.3
* set environment variable for liboperators.so to find source of triton kernel code
* clean cmake files
* update doc and workflow for building c extensions
nianqi-tian and others added 30 commits July 18, 2025 16:23
…gos-ai#802)

* tmp disable

* format change

* tmp disable

* tmp disable

* tmp disable

* tmp skip
flagos-ai#804)

Adaptation for MThreads backend:
- update heuristics_config
- enable scatter op
- enable scatter_ op
- enable layernorm op
* add environment variable for libtuner cache

* rename env flag

* rename env flag

* rename .

* rename .
Define `get_torch_device_ctx` in runtime, replacing
torch_device_fn.device(device) with it.

In utils/pointwise_dynamic, add some compatible codes around
`_DeviceGuard`.

.
Currently the implementation of `where` is unable to run on CPU
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.