Add xsimd::get<>() for optimized compile-time element extraction#1294
Conversation
0b6d85f to
c6dd311
Compare
|
Nice thanks for fixing CI! This is ready for review. Once approved I will rewrite the history. I don't want to trigger a useless CI run. |
| void check_get_all(batch_type const& res, std::index_sequence<Is...>) const | ||
| { | ||
| int dummy[] = { (check_get_element<Is>(res), 0)... }; | ||
| (void)dummy; |
There was a problem hiding this comment.
you could check that loading the generated array ends up being equal to res, right?
serge-sans-paille
left a comment
There was a problem hiding this comment.
Please fix the testing so that we have a decent confidence in the getter when index != 0
Yes, I will! I also noticed some small changes I should make. I just did not have time to get to this still. |
5a371e7 to
fd8c743
Compare
Introduces get<I>(batch) as a top-level API for extracting a single lane
at a compile-time index. Falls back to the runtime get() when per-arch
overloads aren't present.
Per-arch optimal lowerings:
- SSE2: pextrw / byte-shift+movd / swizzle+first by lane width.
- SSE4.1: pextrb/w/d/q; I==0 short-circuits to first().
- AVX: I==0 short-circuits to first(); else halve + SSE4.1 path.
- AVX-512F: I==0 short-circuits to first(); 32/64-bit lanes use
valignd/valignq + first() (2 ops); 8/16-bit halve through AVX.
- NEON / NEON64 / RVV: native single-lane extract intrinsics.
fd8c743 to
f30c5e0
Compare
|
I like how it is now. I tried to minimize new code by re-using existing APIs. Tests check all values. |
| template <size_t... Is> | ||
| void test_get_impl(batch_type const& res, std::index_sequence<Is...>) const | ||
| { | ||
| array_type extracted = { xsimd::get<Is>(res)... }; |
There was a problem hiding this comment.
Exactly what I had in mind, thanks!
|
@serge-sans-paille @DiamonDinoia this PR was merged without being properly up to date with |
Add a free function xsimd::get(batch) API mirroring std::get(tuple) for fast compile-time element extraction from SIMD batches.
Per-architecture optimized kernel::get overloads using the fastest available intrinsics:
Also fixes a latent bug in the common fallback for complex batch compile-time get (wrong buffer type).