Skip to content

Conversation

@hazzlim
Copy link
Contributor

@hazzlim hazzlim commented Jan 16, 2026

This PR adds a vectorized implementation of is_sorted_until using Neon intrinsics 🚀

Performance numbers (speedup figure relative to the existing, non-manually vectorized code - higher is better)

Benchmark MSVC Speedup Clang Speedup
bm_is_sorted_until<std::int8_t, AlgType::Std>/3000/1800 10 12.895
bm_is_sorted_until<std::int8_t, AlgType::Rng>/3000/1800 11.795 10.204
bm_is_sorted_until<std::int16_t, AlgType::Std>/3000/1800 5.674 5.6
bm_is_sorted_until<std::int16_t, AlgType::Rng>/3000/1800 6.551 5.442
bm_is_sorted_until<std::int32_t, AlgType::Std>/3000/1800 3.039 2.908
bm_is_sorted_until<std::int32_t, AlgType::Rng>/3000/1800 3.566 2.908
bm_is_sorted_until<std::int64_t, AlgType::Std>/3000/1800 1.549 1.507
bm_is_sorted_until<std::int64_t, AlgType::Rng>/3000/1800 1.899 1.581
bm_is_sorted_until<std::uint8_t, AlgType::Std>/3000/1800 9.673 12.436
bm_is_sorted_until<std::uint8_t, AlgType::Rng>/3000/1800 11.5 10.459
bm_is_sorted_until<std::uint16_t, AlgType::Std>/3000/1800 5.463 6.389
bm_is_sorted_until<std::uint16_t, AlgType::Rng>/3000/1800 6.389 6.944
bm_is_sorted_until<std::uint32_t, AlgType::Std>/3000/1800 3.017 3.172
bm_is_sorted_until<std::uint32_t, AlgType::Rng>/3000/1800 3.636 3.178
bm_is_sorted_until<std::uint64_t, AlgType::Std>/3000/1800 1.549 1.739
bm_is_sorted_until<std::uint64_t, AlgType::Rng>/3000/1800 1.818 1.581
bm_is_sorted_until<float, AlgType::Std>/3000/1800 3.939 3.297
bm_is_sorted_until<float, AlgType::Rng>/3000/1800 3.883 3.475
bm_is_sorted_until<double, AlgType::Std>/3000/1800 2.026 1.663
bm_is_sorted_until<double, AlgType::Rng>/3000/1800 2.016 1.7

@hazzlim hazzlim requested a review from a team as a code owner January 16, 2026 13:23
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Jan 16, 2026
@StephanTLavavej StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture labels Jan 16, 2026
@StephanTLavavej StephanTLavavej self-assigned this Jan 16, 2026
if constexpr (_Traits::_Vectorized) {
const size_t _Total_size_bytes = _Byte_length(_First, _Last);

const auto _Cmp_gt_wrap = [](const auto _Right, const auto _Left) noexcept {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change requested: This parameter order does non-Newtonian things to my brain but I suppose it is consistent with the code below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On instruction level, both ISAs have the GT mnemonic and not LT mnemonic.
So on intrinsics level lt is weird, and SSE4,2/AVX2 doesn't even have them (SSE2 does though).

For C++ the default predicate is std::less,

We need to bridge these two somehow. Ideally that this part would stand out.

By putting it into the least comfortable place we ensure it stands out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See also Pearl River Necklace bridge)

@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Jan 17, 2026
@StephanTLavavej StephanTLavavej removed their assignment Jan 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ARM64 Related to the ARM64 architecture performance Must go faster

Projects

Status: Ready To Merge

Development

Successfully merging this pull request may close these issues.

3 participants