Skip to content

Optimize Multi-threaded implementation of encode and recode #31

@itzmeanjan

Description

@itzmeanjan

In current multi-threaded implementation (i.e. kicks in when parallel feature is enabled) of Encoder::code_with_coding_vector function, we are using rayon parallel iterators to go over each piece. But it is slower than single-threaded SIMD implementation 😮‍💨.

We can instead just run #-of-logical cores many rayon threads and distribute all the pieces among them. It should do better.

rlnc/src/full/encoder.rs

Lines 175 to 222 in 061ec3f

coded_data.copy_from_slice(
&self
.data
.par_chunks_exact(self.piece_byte_len)
.zip(coding_vector)
.map(|(piece, &random_symbol)| {
#[cfg(any(target_arch = "x86", target_arch = "x86_64", target_arch = "aarch64"))]
{
let mut scalar_x_piece = piece.to_vec();
gf256_inplace_mul_vec_by_scalar(&mut scalar_x_piece, random_symbol);
scalar_x_piece
}
#[cfg(not(any(target_arch = "x86", target_arch = "x86_64", target_arch = "aarch64")))]
{
piece.iter().map(move |&symbol| (Gf256::new(symbol) * Gf256::new(random_symbol)).get())
}
})
.fold(
|| vec![0u8; self.piece_byte_len],
|mut acc, cur| {
#[cfg(any(target_arch = "x86", target_arch = "x86_64", target_arch = "aarch64"))]
gf256_inplace_add_vectors(&mut acc, &cur);
#[cfg(not(any(target_arch = "x86", target_arch = "x86_64", target_arch = "aarch64")))]
acc.iter_mut().zip(cur).for_each(|(a, b)| {
*a ^= b;
});
acc
},
)
.reduce(
|| vec![0u8; self.piece_byte_len],
|mut acc, cur| {
#[cfg(any(target_arch = "x86", target_arch = "x86_64", target_arch = "aarch64"))]
gf256_inplace_add_vectors(&mut acc, &cur);
#[cfg(not(any(target_arch = "x86", target_arch = "x86_64", target_arch = "aarch64")))]
acc.iter_mut().zip(cur).for_each(|(a, b)| {
*a ^= b;
});
acc
},
),
);

But is it actually? Explore that.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions