vectorize GetBestLengths2() hotspot loop. by yumeyao · Pull Request #146 · fhanau/Efficient-Compression-Tool

yumeyao · 2025-09-23T07:53:52Z

up to ~30% gain on AVX, more than 10% gain on SSE. (level 9)
Tested on AMD 9950x.

larger files and higher levels show greater speed improvement.
under level9 the hot spot is here.
under level4 the hot spot is LzFind.

The difference between AVX2 and AVX is not observable. the split doesn't make the dependency chain longer.
same for SSE4.1 vs SSE2.

…n SSE.

fhanau · 2025-10-05T22:31:19Z

Sorry for the lack of reply – I tried to optimize this loop a couple of years ago (on Sandy Bridge IIRC), but didn't manage to achieve a speedup using AVX. Things might be different on modern hardware though, so this might still work. I'll try to see if I can reproduce the speedup reported here in the coming days.

fhanau · 2025-10-19T18:38:50Z

A couple findings here:

When testing using gcc-14 and running -9 on the ECT binary itself, I couldn't see a measurable speedup. Before the proposed change, we have 4.93s (default) and 4.46s (with AVX enabled), with the new code we have 4.57s (default) and 4.51s (AVX enabled). It seems that the AVX code path isn't faster than auto-vectorized AVX is (which seems odd, I thought this loop wasn't autovectorized?), or at least doesn't enable any speedup. The SSE2 code path does seem to help when AVX is unavailable, but hardware that doesn't have AVX is uncommon nowadays. I also tried only adding the SSE2 code path while compiling with AVX available, that didn't seem to help.
I assumed that vectorizing this may fare better on data that compresses better, where the loop will have more iterations on average and a better chance to shine (e.g. PNGs). Testing this using -7 on a large PNG file resulted in 18.27s/8.92s without the change and 12.92s/9.38s with the change. These numbers indicate that there is a speedup without AVX enabled, but a slowdown if AVX is available. We can see more strongly that the change helps for plain SSE2, but not if the compiler can use AVX (such as on modern hardware).
The code does not compile with AVX2 enabled, as the cmp_mask variable is missing in that code path. When adding __m256i cmp_mask = _mm256_cmpgt_epi32(_mm256_castps_si256(vcost), _mm256_castps_si256(x8));, the code appears to work correctly but isn't faster either (actually slower on the PNG file).

As-is, I don't think this is helpful, but you might be able to come up with something better when iterating on it a bit more (e.g. experiment a bit with compiling with latest clang/gcc, loop unrolling, trying to reduce pipeline stalls due to data dependencies). Adding comments to explain what you're doing is also important when dealing with handwritten assembly.

yumeyao · 2025-10-23T02:07:14Z

I did the test with gcc 13, let me check if gcc 14 can auto vectorize the code - if that is the case maybe we dont need manually do it but a manual approach is still preferable for old compilers and possible regression in new compilers.

the test cases I used are a gcc 15 source tarball, and a wsl rootfs tarball, so maybe yes the data is tend to be more compressable. I use these as test cases because:

these 2 scenarios, alongside with png, are the scenarios where the legacy deflate/gz is a must or a better-to-provide option.
they are large, reflects speed change more obviously. (also the reason where I dont know what the best test data for png is)

also I wonder what's your cpu model.

I can try test on more platforms and compilers, then come back later.

vectorize GetBestLengths2() hotspot loop. 20% gain on AVX, 10% gain o…

2318e50

…n SSE.

yumeyao force-pushed the simd branch from 17365e6 to 2318e50 Compare September 23, 2025 12:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vectorize GetBestLengths2() hotspot loop. #146

vectorize GetBestLengths2() hotspot loop. #146
yumeyao wants to merge 1 commit intofhanau:masterfrom
yumeyao:simd

yumeyao commented Sep 23, 2025 •

edited

Loading

Uh oh!

fhanau commented Oct 5, 2025

Uh oh!

fhanau commented Oct 19, 2025

Uh oh!

yumeyao commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yumeyao commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fhanau commented Oct 5, 2025

Uh oh!

fhanau commented Oct 19, 2025

Uh oh!

yumeyao commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yumeyao commented Sep 23, 2025 •

edited

Loading