Skip to content

Update qd8 gemm config to enable 4x16c2s2 microkernels for wasm simd#9844

Open
yolanda15 wants to merge 1 commit intogoogle:masterfrom
yolanda15:qd8_update_simd
Open

Update qd8 gemm config to enable 4x16c2s2 microkernels for wasm simd#9844
yolanda15 wants to merge 1 commit intogoogle:masterfrom
yolanda15:qd8_update_simd

Conversation

@yolanda15
Copy link
Copy Markdown
Contributor

Update the qd8 gemm config to enable 4x16c2s2 microkernels for wasm simd. It also fixes a file name typo in previous PR.
Performance on Coffee Lake shows good boost for both baseline and revectorization.

Tested with d8:

--------------------------------------------------------------------------------------------------------------------
Benchmark                                                                          Time             CPU   Iterations
--------------------------------------------------------------------------------------------------------------------
Default, SIMD build:
D8TransformerBlock/T:128/D:1536/N:6/H:256/F:12288/process_time/real_time     175030 us       175035 us           24
QD8TransformerBlock/T:128/D:2048/N:8/H:256/F:16384/process_time/real_time     323143 us       323148 us           13
QD8TransformerBlock/T:128/D:2304/N:8/H:256/F:9216/process_time/real_time      219425 us       219430 us           19
QD8TransformerBlock/T:128/D:1152/N:4/H:256/F:6912/process_time/real_time       75549 us        75554 us           56

Default, Relaxed SIMD build:
QD8TransformerBlock/T:128/D:1536/N:6/H:256/F:12288/process_time/real_time     173748 us       173754 us           40
QD8TransformerBlock/T:128/D:2048/N:8/H:256/F:16384/process_time/real_time     324491 us       324496 us           22
QD8TransformerBlock/T:128/D:2304/N:8/H:256/F:9216/process_time/real_time      218274 us       218280 us           32
QD8TransformerBlock/T:128/D:1152/N:4/H:256/F:6912/process_time/real_time       74628 us        74633 us           93

4x16c2s2, SIMD build:
QD8TransformerBlock/T:128/D:1536/N:6/H:256/F:12288/process_time/real_time     157664 us       157670 us           27
QD8TransformerBlock/T:128/D:2048/N:8/H:256/F:16384/process_time/real_time     288836 us       288841 us           14
QD8TransformerBlock/T:128/D:2304/N:8/H:256/F:9216/process_time/real_time      196272 us       196277 us           21
QD8TransformerBlock/T:128/D:1152/N:4/H:256/F:6912/process_time/real_time       68001 us        68005 us           62

4x16c2s2, Relaxed SIMD build:
QD8TransformerBlock/T:128/D:1536/N:6/H:256/F:12288/process_time/real_time     155883 us       155889 us           45
QD8TransformerBlock/T:128/D:2048/N:8/H:256/F:16384/process_time/real_time     287300 us       287305 us           24
QD8TransformerBlock/T:128/D:2304/N:8/H:256/F:9216/process_time/real_time      195124 us       195128 us           36
QD8TransformerBlock/T:128/D:1152/N:4/H:256/F:6912/process_time/real_time       67092 us        67096 us          104

Tested with revectorization (d8 --experimental-wasm-revectorize):

--------------------------------------------------------------------------------------------------------------------
Benchmark                                                                          Time             CPU   Iterations
--------------------------------------------------------------------------------------------------------------------
Default, SIMD build:
QD8TransformerBlock/T:128/D:1536/N:6/H:256/F:12288/process_time/real_time     173545 us       173549 us           24
QD8TransformerBlock/T:128/D:2048/N:8/H:256/F:16384/process_time/real_time     319792 us       319797 us           13
QD8TransformerBlock/T:128/D:2304/N:8/H:256/F:9216/process_time/real_time      216803 us       216807 us           19
QD8TransformerBlock/T:128/D:1152/N:4/H:256/F:6912/process_time/real_time       74495 us        74499 us           56

Default, Relaxed SIMD build:
QD8TransformerBlock/T:128/D:1536/N:6/H:256/F:12288/process_time/real_time     173748 us       173754 us           40
QD8TransformerBlock/T:128/D:2048/N:8/H:256/F:16384/process_time/real_time     324491 us       324496 us           22
QD8TransformerBlock/T:128/D:2304/N:8/H:256/F:9216/process_time/real_time      218274 us       218280 us           32
QD8TransformerBlock/T:128/D:1152/N:4/H:256/F:6912/process_time/real_time       74628 us        74633 us           93

4x16c2s2, SIMD build:
QD8TransformerBlock/T:128/D:1536/N:6/H:256/F:12288/process_time/real_time      90432 us        90437 us           46
QD8TransformerBlock/T:128/D:2048/N:8/H:256/F:16384/process_time/real_time     176882 us       176887 us           24
QD8TransformerBlock/T:128/D:2304/N:8/H:256/F:9216/process_time/real_time      113019 us       113024 us           37
QD8TransformerBlock/T:128/D:1152/N:4/H:256/F:6912/process_time/real_time       37925 us        37928 us          110

4x16c2s2, Relaxed SIMD build:
QD8TransformerBlock/T:128/D:1536/N:6/H:256/F:12288/process_time/real_time      90973 us        90978 us           76
QD8TransformerBlock/T:128/D:2048/N:8/H:256/F:16384/process_time/real_time     177860 us       177865 us           39
QD8TransformerBlock/T:128/D:2304/N:8/H:256/F:9216/process_time/real_time      114225 us       114230 us           61
QD8TransformerBlock/T:128/D:1152/N:4/H:256/F:6912/process_time/real_time       38300 us        38305 us          183

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant