vectorize GetBestLengths2() hotspot loop. #146
Conversation
|
Sorry for the lack of reply – I tried to optimize this loop a couple of years ago (on Sandy Bridge IIRC), but didn't manage to achieve a speedup using AVX. Things might be different on modern hardware though, so this might still work. I'll try to see if I can reproduce the speedup reported here in the coming days. |
|
A couple findings here:
As-is, I don't think this is helpful, but you might be able to come up with something better when iterating on it a bit more (e.g. experiment a bit with compiling with latest clang/gcc, loop unrolling, trying to reduce pipeline stalls due to data dependencies). Adding comments to explain what you're doing is also important when dealing with handwritten assembly. |
|
I did the test with gcc 13, let me check if gcc 14 can auto vectorize the code - if that is the case maybe we dont need manually do it but a manual approach is still preferable for old compilers and possible regression in new compilers. the test cases I used are a gcc 15 source tarball, and a wsl rootfs tarball, so maybe yes the data is tend to be more compressable. I use these as test cases because:
also I wonder what's your cpu model. I can try test on more platforms and compilers, then come back later. |
up to ~30% gain on AVX, more than 10% gain on SSE. (level 9)
Tested on AMD 9950x.
larger files and higher levels show greater speed improvement.
under level9 the hot spot is here.
under level4 the hot spot is LzFind.
The difference between AVX2 and AVX is not observable. the split doesn't make the dependency chain longer.
same for SSE4.1 vs SSE2.