Proposal load/store masked#1162
Conversation
1124f01 to
53da643
Compare
|
interesting. Before going into the details, what are the memory effects of a masked load in terms of read memory and value stored? Are the masked elements set to 0 ? to an undefined value ? AVX etc seems to set the value to zero? I'm not sold on the common implementation which looks quite heavy in scalar operation. I can see that we can't do a plain load followed by an and because it could lead to access to unallocated memory. If the mask were constant, we could optimize statically some common patterns, but with a dynamic mask as you propose... |
|
Some thoughts:
In general, I use these operations when I want to vectorize and inner loop that is not a multiple of the simd width. This is a small inner loop nested in a loop executed a lot of times. Depending on the operations, padding sometimes is slower than masking. |
2ae15bf to
510c335
Compare
|
This is starting to take shape now. I still need to clean up. But this is a first working implementation. I extended the mask operation in batch_bool to mimic std::bit so that i could reuse the code for all masked operations. I feel having those living outside the class in detail:: is not the best. They are also useful to the user I think. |
fd0e777 to
ecf4291
Compare
|
@serge-sans-paille do you have any clue on why mingw32 fails? I need to do a final pass make the code coherent (I am mixing reinterpret_casts and c-style casts). Affter that I think will ready for review. |
ecf4291 to
e88369f
Compare
| } | ||
| else | ||
| { | ||
| // Fallback to runtime path for non prefix/suffix masks |
There was a problem hiding this comment.
Unless the compiler optimizes those out, the dynamic load is going to perform serveral checks that are going to be always false (none / all).
I'm still not convinced by the relevancy of dynamic mask support. Maybe we could start small with "just" the static mask support and eventually add a dynamic version later if it appears to be needed? But maybe you have specific case in mind ?
There was a problem hiding this comment.
It is just that in my mind I can offer a public runtime mask API
then the compile time mask optimizes as much as it can and if there is no alternative it calls the runtime mask.
Same as the swizzle. So I was not planning of optimizing the runtime at all we just get a maskl_load/store where possible or a common since sse does not support masking.
PS:
This can be optimized in the future with using avx on sse or avx512vl but I will open a discussion once this is merged
There was a problem hiding this comment.
I am cleaning this to only offer compile time for now and the runtime can be a separate PR
e535375 to
4c457ae
Compare
|
I don't think the error is due to anything related to this PR. But, I like the looks of it now. Ready to review! |
4c457ae to
ec2b824
Compare
|
@DiamonDinoia we plan to do a release in the forthcoming weeks. Do you want this PR to be part of it? |
Only if it does not delay the release by much. I would like to have a release by mid-November. This gives me two weeks to update my downstream projects. For an end of the year release. PS: I am planning on addressing the swizzle issue after this PR. |
That is more or less what we planned ;) |
serge-sans-paille
left a comment
There was a problem hiding this comment.
A new wave of comments. I have the feeling that we could simplify the implementation a lot by making it more generic.
| return load<A>(mem, Mode {}); | ||
| } | ||
| // confined to lower 128-bit half (4 lanes) → forward to SSE | ||
| else XSIMD_IF_CONSTEXPR(mask.countl_zero() >= 4) |
There was a problem hiding this comment.
Thinking of it, we could remove the need for countl_zero by just masking, something like
if((mask.mask() & 0xF0) == 0)
which would be a decent implementation for
if(mask.lower_half() == 0) // or lower_mask ?
which has the nice property of not encoding any "magical value"
There was a problem hiding this comment.
The reason why I went for countl_zero is to maintain parity with std::bit. I feel like this is public API so it is worth for users not having to learn a new naming scheme
| else XSIMD_IF_CONSTEXPR(mask.countr_zero() >= 2) | ||
| { | ||
| constexpr auto mhi = ::xsimd::detail::upper_half<sse2>(mask); | ||
| const batch<double, sse2> hi(_mm256_extractf128_pd(src, 1)); |
There was a problem hiding this comment.
that's a upper_half, right?
| else XSIMD_IF_CONSTEXPR(mask.countl_zero() >= 2) | ||
| { | ||
| constexpr auto mlo = ::xsimd::detail::lower_half<sse2>(mask); | ||
| const batch<double, sse2> lo(_mm256_castpd256_pd128(src)); |
There was a problem hiding this comment.
that's a lower_half, right?
| { | ||
| _mm256_maskstore_pd(mem, mask.as_batch(), src); | ||
| } | ||
| } |
There was a problem hiding this comment.
epiphany : if you express everything in terms of lower / upper half, you're pretty close to a type-generic implementation that would avoid a lot of redunduncy. just need to wrap the actual _mm256_maskstore* and we're good, right?
There was a problem hiding this comment.
I guess you are right. It might be possible to express things in terms of size and halves. This might be recursive though and we might need to arch to half arch, similarly to what happens in widen.
There was a problem hiding this comment.
I'll also have to think if the size of the element matters. I'll have a look next week
There was a problem hiding this comment.
The reason why I kept the implementation separate is as follows:
The general implementation in terms of upper/lower ais possible. This haves the code in avx, avx2 still has to provide a new implementation so that it can take advantage of the new intrinsics so we for from 2 versions in avx down to 1 and 2 in avx2 down to 1. Same for avx512.
But then for merging we can not use more optimized intrinsics than _mm256_castps128_ps256. Is it worth it?
The solution here is to pass the MaskLoad/Store and the merge function as templates to this general implementation. Then we could have 1 general implementation maybe for all the kernels. But, I think it might be too much templating.
In general also wrapping _mm256_maskstore* also makes a clean way to expose this to the used if we wish to do so.
What do you suggest?
| return _mm512_loadu_pd(mem); | ||
| } | ||
|
|
||
| // load_masked |
There was a problem hiding this comment.
why did you split the load_masked in two parts of the header? The generic epiphany for avx seems to apply here too :-)
| { | ||
| constexpr auto mlo = ::xsimd::detail::lower_half<sse4_2>(mask); | ||
| const auto lo = load_masked(mem, mlo, convert<float> {}, Mode {}, sse4_2 {}); | ||
| return batch<float, A>(detail::merge_sse(lo, batch<float, sse4_2>(0.f))); |
There was a problem hiding this comment.
could be a call to _mm256_castps128_ps256 instead of merge + batch of zero
| { | ||
| constexpr auto mhi = ::xsimd::detail::upper_half<sse4_2>(mask); | ||
| const auto hi = load_masked(mem + 4, mhi, convert<float> {}, Mode {}, sse4_2 {}); | ||
| return batch<float, A>(detail::merge_sse(batch<float, sse4_2>(0.f), hi)); |
There was a problem hiding this comment.
I don't know if we could use a similar trick to _mm256_castps128_ps256 here
|
@AntoinePrv once you update that branch we can do a release \o/ |
I will work on this asap. I just have something to deliver on another project this week so it might take a bit for me to get back to this. |
|
Note that once you're done with Intel support, i still have to implement ARM et cie when possible |
3f3acf5 to
8104455
Compare
856d98e to
9185bb2
Compare
|
I removed most of the duplication at least I stopped when I thought things were getting out of hand. I think in the future we might offer direct access to masked load with dynamic mask. Which is basically exposing the internal helper functions but adapting them to the xsimd convention. |
1. Adds new masked API compile time masks (store_masked and load_masked) 2. General use case optimization 3. New tests 4. x86 kernels 5. Adds new APIs to batch_bool_constant for convenience resembling #include<bit> 6. Tests the new APIs
9185bb2 to
fb16c52
Compare
|
Merged as 239fbbd |
| template <class A, bool... Values, class Mode> | ||
| XSIMD_INLINE batch<int32_t, A> load_masked(int32_t const* mem, batch_bool_constant<int32_t, A, Values...> mask, convert<int32_t>, Mode, requires_arch<avx2>) noexcept | ||
| { | ||
| return load_masked<A>(mem, mask, convert<int32_t> {}, Mode {}, avx2 {}); | ||
| } |
There was a problem hiding this comment.
@DiamonDinoia I am having trouble with this function above. Is this an infinite loop?
There was a problem hiding this comment.
I might have meant calling uint -> int32 there.
There was a problem hiding this comment.
I am currently travelling. I have a look once I am back.
Dear @serge-sans-paille,
I made a rough implementation of masked load/store. Before I hash it out. Can I have some early feedback?
Thanks,
Marco