In current multi-threaded implementation (i.e. kicks in when parallel feature is enabled) of Encoder::code_with_coding_vector function, we are using rayon parallel iterators to go over each piece. But it is slower than single-threaded SIMD implementation 😮💨.
We can instead just run #-of-logical cores many rayon threads and distribute all the pieces among them. It should do better.
|
coded_data.copy_from_slice( |
|
&self |
|
.data |
|
.par_chunks_exact(self.piece_byte_len) |
|
.zip(coding_vector) |
|
.map(|(piece, &random_symbol)| { |
|
#[cfg(any(target_arch = "x86", target_arch = "x86_64", target_arch = "aarch64"))] |
|
{ |
|
let mut scalar_x_piece = piece.to_vec(); |
|
gf256_inplace_mul_vec_by_scalar(&mut scalar_x_piece, random_symbol); |
|
|
|
scalar_x_piece |
|
} |
|
|
|
#[cfg(not(any(target_arch = "x86", target_arch = "x86_64", target_arch = "aarch64")))] |
|
{ |
|
piece.iter().map(move |&symbol| (Gf256::new(symbol) * Gf256::new(random_symbol)).get()) |
|
} |
|
}) |
|
.fold( |
|
|| vec![0u8; self.piece_byte_len], |
|
|mut acc, cur| { |
|
#[cfg(any(target_arch = "x86", target_arch = "x86_64", target_arch = "aarch64"))] |
|
gf256_inplace_add_vectors(&mut acc, &cur); |
|
|
|
#[cfg(not(any(target_arch = "x86", target_arch = "x86_64", target_arch = "aarch64")))] |
|
acc.iter_mut().zip(cur).for_each(|(a, b)| { |
|
*a ^= b; |
|
}); |
|
|
|
acc |
|
}, |
|
) |
|
.reduce( |
|
|| vec![0u8; self.piece_byte_len], |
|
|mut acc, cur| { |
|
#[cfg(any(target_arch = "x86", target_arch = "x86_64", target_arch = "aarch64"))] |
|
gf256_inplace_add_vectors(&mut acc, &cur); |
|
|
|
#[cfg(not(any(target_arch = "x86", target_arch = "x86_64", target_arch = "aarch64")))] |
|
acc.iter_mut().zip(cur).for_each(|(a, b)| { |
|
*a ^= b; |
|
}); |
|
|
|
acc |
|
}, |
|
), |
|
); |
But is it actually? Explore that.
In current multi-threaded implementation (i.e. kicks in when
parallelfeature is enabled) ofEncoder::code_with_coding_vectorfunction, we are using rayon parallel iterators to go over each piece. But it is slower than single-threaded SIMD implementation 😮💨.We can instead just run #-of-logical cores many rayon threads and distribute all the pieces among them. It should do better.
rlnc/src/full/encoder.rs
Lines 175 to 222 in 061ec3f
But is it actually? Explore that.