Implement fallback to smaller vector size for swizzle_dyn#433
Implement fallback to smaller vector size for swizzle_dyn#433cvijdea-bd wants to merge 1 commit intorust-lang:masterfrom
Conversation
169c541 to
041e43b
Compare
workingjubilee
left a comment
There was a problem hiding this comment.
Hello, sorry for taking so long to get around to this. I like the idea but I would prefer it was written to avoid unnecessary allowances of lints.
| #[inline] | ||
| pub fn swizzle_dyn(self, idxs: Simd<u8, N>) -> Self { | ||
| #![allow(unused_imports, unused_unsafe)] | ||
| #![allow(unused_imports, unused_unsafe, unreachable_patterns)] |
| all(target_arch = "arm", target_feature = "v7") | ||
| ), | ||
| target_feature = "neon", | ||
| target_endian = "little" |
| _ => dispatch_compat(self, idxs), | ||
| _ => swizzle_dyn_scalar(self, idxs), |
There was a problem hiding this comment.
I do not like this new structure in the match. Some allowances are required in this code without making it totally illegible, but this requires allowing a lint that should not be allowed. It can be avoided by making the first _ guarded by an if cfg!(..) so that swizzle_dyn_scalar only catches the true base case.
There was a problem hiding this comment.
I did consider that approach, however unless I'm missing something, it would have to effectively duplicate all of the previous if-cfgs in a negated form which seems to be a lot uglier than just having a silently unreachable pattern in some build configurations. I don't have an issue with changing it if you prefer it like that.
| #![allow( | ||
| dead_code, | ||
| unused_unsafe, | ||
| unreachable_patterns, |
There was a problem hiding this comment.
should not need this
| unreachable_patterns, |
This PR adds a fallback implementation so that e.g. u8x64::swizzle_dyn can be reasonably efficient even when only compiled with 128-bit SSSE.
A "downgraded" swizzle_dyn op on N lanes emits 4 swizzle_dyn ops on N/2 lanes. If the optimizer can deduce that index values are bounded to N/2 or less, then it will generally be more efficient.
For example,
u8x64::swizzle_dynwill only emit 4pshufbinstructions on SSSE, instead of 16 in the general case, if the optimizer can prove index values are always <16 (this is generally achieved by preceding the swizzle with a 0xf mask).Additionaly, for non-power-of-two N values, this PR adds a fallback implementation which zero-extends to the next power of two size.
Benchmarks
Below are benchmark results for the following code, on 5 target-cpu levels:
x86-64- baseline, no vectorized shufflesx86-64-v2- ssse, adds pshufb on u8x16x86-64-v3- avx2, adds vpshufb on u8x32 (vpshufb is not a true extension of pshufb to 256-bit, instead it's more like 2 pshufb ops ran in parallel, with 4-bit indices in the corresponding lane)x86-64-v4- avx512, adds vpshufb on u8x64 (again really just 4x pshufb in parallel)icelake-server- avx512vbmi, adds vpermb (this is a true 256/512-bit shuffle with 5/6-bit indices)N.B. the code used to benchmark includes #431, and removes the
src/masks/bitmask.rsavx512f mask implementation to work around the problem discussed here.The main thing to look for is the performance of
swizzle_dyn_64*on non-vbmi targets (v2, v3, v4), and the performance ofswizzle_dyn_32*on x86-64-v2. With the previous implementation, sizes without native vector instructions fall back to the scalar implementation, while in the PR version they fall back to a lower vector size for ~4-10x performance over the scalar version.Benchmark data for old version
-Ctarget-cpu=x86-64-Ctarget-cpu=x86-64-v2-Ctarget-cpu=x86-64-v3-Ctarget-cpu=x86-64-v4-Ctarget-cpu=icelake-server(avx512vbmi)Benchmark data for new version
-Ctarget-cpu=x86-64-Ctarget-cpu=x86-64-v2-Ctarget-cpu=x86-64-v3-Ctarget-cpu=x86-64-v4-Ctarget-cpu=icelake-server(avx512vbmi)