Skip to content

vectorize GetBestLengths2() hotspot loop. #146

Open
yumeyao wants to merge 1 commit intofhanau:masterfrom
yumeyao:simd
Open

vectorize GetBestLengths2() hotspot loop. #146
yumeyao wants to merge 1 commit intofhanau:masterfrom
yumeyao:simd

Conversation

@yumeyao
Copy link
Copy Markdown

@yumeyao yumeyao commented Sep 23, 2025

up to ~30% gain on AVX, more than 10% gain on SSE. (level 9)
Tested on AMD 9950x.

larger files and higher levels show greater speed improvement.
under level9 the hot spot is here.
under level4 the hot spot is LzFind.

The difference between AVX2 and AVX is not observable. the split doesn't make the dependency chain longer.
same for SSE4.1 vs SSE2.

@fhanau
Copy link
Copy Markdown
Owner

fhanau commented Oct 5, 2025

Sorry for the lack of reply – I tried to optimize this loop a couple of years ago (on Sandy Bridge IIRC), but didn't manage to achieve a speedup using AVX. Things might be different on modern hardware though, so this might still work. I'll try to see if I can reproduce the speedup reported here in the coming days.

@fhanau
Copy link
Copy Markdown
Owner

fhanau commented Oct 19, 2025

A couple findings here:

  • When testing using gcc-14 and running -9 on the ECT binary itself, I couldn't see a measurable speedup. Before the proposed change, we have 4.93s (default) and 4.46s (with AVX enabled), with the new code we have 4.57s (default) and 4.51s (AVX enabled). It seems that the AVX code path isn't faster than auto-vectorized AVX is (which seems odd, I thought this loop wasn't autovectorized?), or at least doesn't enable any speedup. The SSE2 code path does seem to help when AVX is unavailable, but hardware that doesn't have AVX is uncommon nowadays. I also tried only adding the SSE2 code path while compiling with AVX available, that didn't seem to help.
  • I assumed that vectorizing this may fare better on data that compresses better, where the loop will have more iterations on average and a better chance to shine (e.g. PNGs). Testing this using -7 on a large PNG file resulted in 18.27s/8.92s without the change and 12.92s/9.38s with the change. These numbers indicate that there is a speedup without AVX enabled, but a slowdown if AVX is available. We can see more strongly that the change helps for plain SSE2, but not if the compiler can use AVX (such as on modern hardware).
  • The code does not compile with AVX2 enabled, as the cmp_mask variable is missing in that code path. When adding __m256i cmp_mask = _mm256_cmpgt_epi32(_mm256_castps_si256(vcost), _mm256_castps_si256(x8));, the code appears to work correctly but isn't faster either (actually slower on the PNG file).

As-is, I don't think this is helpful, but you might be able to come up with something better when iterating on it a bit more (e.g. experiment a bit with compiling with latest clang/gcc, loop unrolling, trying to reduce pipeline stalls due to data dependencies). Adding comments to explain what you're doing is also important when dealing with handwritten assembly.

@yumeyao
Copy link
Copy Markdown
Author

yumeyao commented Oct 23, 2025

I did the test with gcc 13, let me check if gcc 14 can auto vectorize the code - if that is the case maybe we dont need manually do it but a manual approach is still preferable for old compilers and possible regression in new compilers.

the test cases I used are a gcc 15 source tarball, and a wsl rootfs tarball, so maybe yes the data is tend to be more compressable. I use these as test cases because:

  1. these 2 scenarios, alongside with png, are the scenarios where the legacy deflate/gz is a must or a better-to-provide option.
  2. they are large, reflects speed change more obviously. (also the reason where I dont know what the best test data for png is)

also I wonder what's your cpu model.

I can try test on more platforms and compilers, then come back later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants