Conversation
4502f99 to
dd9ae45
Compare
[feature] prefetches [feature] parallel calculations [feature] arena allocator
6e48681 to
251b82a
Compare
src/lib.rs
Outdated
| { | ||
| // Parallel version: create shards in parallel | ||
| (0..original_count as usize) | ||
| .into_par_iter() |
There was a problem hiding this comment.
I undestand we want to speed it up, but this is a bit unsafe in the context of polkadot-sdk because all this threads will eat CPU from other important task, could we at least bound it to a fix number of threads if it is not already.
There was a problem hiding this comment.
By default, the number of threads is equal to the number of logical cores.
There was a problem hiding this comment.
Yeah, but that could be a bit problematic if this spawn threads eat time from things like PVF execution for other candidates, which is time-bound.
Can we make it configurable somehow, so that users of API aka polkadot-sdk can select how many threads it wants to use.
There was a problem hiding this comment.
I agree with Alex that for the purposes of polkadot-sdk, we likely want the erasure-coding to be single threaded as the parallelism comes from running it in parallel for different PoVs - otherwise it might starve other important computation. So i would be in favor of removing the parallel feature altogether
src/subshard.rs
Outdated
| let mut result = Vec::with_capacity(segments.len()); | ||
|
|
||
| for _ in 0..segments.len() { | ||
| result.push(Box::new([[0u8; SUBSHARD_SIZE]; TOTAL_SHARDS])); | ||
| } |
There was a problem hiding this comment.
| let mut result = Vec::with_capacity(segments.len()); | |
| for _ in 0..segments.len() { | |
| result.push(Box::new([[0u8; SUBSHARD_SIZE]; TOTAL_SHARDS])); | |
| } | |
| let mut result = vec_no_clone![Box::new([[0u8; SUBSHARD_SIZE]; TOTAL_SHARDS]); segments.len()]; |
macro_rules! vec_no_clone {
($elem:expr; $n:expr) => ({
let n = $n;
let mut result = Vec::with_capacity(n);
for _ in 0..n {
result.push($elem);
}
result
})
}
src/merklize.rs
Outdated
| .par_iter() | ||
| .map(|chunk| Hash::from(hash_fn(chunk))) | ||
| .collect::<Vec<_>>(); | ||
| h.resize(target_size, Hash::default()); |
There was a problem hiding this comment.
resize is called on both branches, may deduplicate
src/merklize.rs
Outdated
| #[cfg(not(feature = "parallel"))] | ||
| let hashes = { | ||
| let mut h = Vec::with_capacity(target_size); | ||
| for chunk in chunks.iter() { |
There was a problem hiding this comment.
May simplify changes by replacing .collect() with .collect_with_capacity()
trait CollectWithCapacity<T> : Iterator<Item = T> {
fn collect_with_capacity(self) -> Vec<T>;
}
impl<I: Iterator> CollectWithCapacity<I::Item> for I {
fn collect_with_capacity(self) -> Vec<I::Item> {
let mut result = Vec::with_capacity(self.size_hint().0);
for x in self {
result.push(x);
}
result
}
}There was a problem hiding this comment.
size_hint().0 is equal chunks_len. It can be less than target_size. Which is cuase additional rellocation in a case where target_size > chunks_len.
ordian
left a comment
There was a problem hiding this comment.
any idea if this crate is used by any project ?
i don't know if its used in polkajam anymore since i don't have access to private paritytech repos, but i am assuming it will be used for https://polkadot-fellows.github.io/RFCs/approved/0139-faster-erasure-coding.html
.cargo/config.toml
Outdated
| [target.x86_64-unknown-linux-gnu] | ||
| rustflags = [ | ||
| "-C", "target-cpu=native", | ||
| "-C", "target-feature=+avx2,+fma", |
There was a problem hiding this comment.
we need to be careful about this - maybe distribute 2 versions of the binaries - optimized and unoptimized, otherwise it might trigger an illegal instruction on older hardware - or communicate clearly the expected hardware changes
src/lib.rs
Outdated
| { | ||
| // Parallel version: create shards in parallel | ||
| (0..original_count as usize) | ||
| .into_par_iter() |
There was a problem hiding this comment.
I agree with Alex that for the purposes of polkadot-sdk, we likely want the erasure-coding to be single threaded as the parallelism comes from running it in parallel for different PoVs - otherwise it might starve other important computation. So i would be in favor of removing the parallel feature altogether
| #[cfg(feature = "arena")] | ||
| { | ||
| construct_chunks_arena(n_chunks, data) | ||
| } | ||
|
|
||
| #[cfg(not(feature = "arena"))] | ||
| { | ||
| construct_chunks_default(n_chunks, data) | ||
| } |
There was a problem hiding this comment.
not a fan of introducing features especially if they are not recommended - what are the expected gains here from the arena allocator?
There was a problem hiding this comment.
I don’t really understand what kind of starvation you’re talking about. Processing the PoV involves too many different computations to say that it needs all of the CPU time. In the tests, CPU utilization doesn’t even reach 30% of a core for multiple nodes. Asynchronous IO calls, syscalls, etc. let the CPU sit idle, not to mention bubbles when reading from memory.
Even if at peak these computations overlap with some other computations and cause a local slowdown, in the overall workflow the performance gain will be greater than this local collision.
yeah, that's what triggered this PR, I was jut looking to understand how hardened/used this crate is. |
Signed-off-by: iceseer <iceseer@gmail.com>
Signed-off-by: iceseer <iceseer@gmail.com>
Signed-off-by: iceseer <iceseer@gmail.com>
Signed-off-by: iceseer <iceseer@gmail.com>
Signed-off-by: iceseer <iceseer@gmail.com>
JAM does not use this crate. It uses |
[fix] `num_threads` for parallel execution Signed-off-by: iceseer <iceseer@gmail.com>
|
@alexggh I made thread pool local and cached and made |
Signed-off-by: iceseer <iceseer@gmail.com>
Signed-off-by: Alexander Lednev <iceseer@gmail.com>
2.Added a mechanism for determining data boundaries Signed-off-by: Alexander Lednev <iceseer@gmail.com>
Features
[feature] parallel calculations
[feature] branch prediction
[feature] prefetches
[feature] arena allocator
Additional options (features=)
parallel - enables most of the optimizations
arena - enables arena-allocator (at the moment, enabling this option is not recommended.)
The number of threads for
parallelcomputing can be set using theRAYON_NUM_THREADS environment variable. By default, it is equal to the number of logical cores.Results
master
feature/optimized "simd,parallel"
feature/optimized "simd"
Comparison