Fix performance regression in fork_join_executor by implementing missing traits#6919
Fix performance regression in fork_join_executor by implementing missing traits#6919arpittkhandelwal wants to merge 8 commits intomasterfrom
Conversation
|
@arpittkhandelwal excellent catch! |
Okay sir. The performance regression has been confirmed fixed in PR #6919 (which is now approved). I have also verified locally that no other performance regressions remain in
I have double-checked the performance benchmarks locally to be absolutely sure:
The regression was isolated specifically to the missing traits in |
Thank you for this anaylsis. If we look at the report from th eperformance CI then we can see that it reported a performance regression specifically for the parallel_executor, not the fork_join_executor. I think we need more data before we can be sure that things are fixed. A general note: We currently have several issues on master that need to be fixed first to make sure the CIs pass for any new PR before it being merged. These problems are being solved with their own PRs: #6914, #6920. Once the CIs pass for them, we will merge. Then we need to rebase all other waiting PRs and make the CIs pass for those. Unfortunately, over the last weekend our CIs were completely overwhelmed by the flurry of PRs submitted by everyone, causing many failures (most likely unrelated). This doesn't allow us to be sure about the state of the code base at this time. Let's take it step-by-step. |
Thank you for the clarification, sir. I understand that CI stability is the priority and we need to wait for #6914 and #6920. |
|
@hkaiser Sir, I have implemented the processing_units_count and get_first_core traits in this PR (initially for Since the CI is reporting a regression in Could we test if merging these changes (or running them through the LSU CI) helps resolve the reports? I have also prepared some diagnostic prints locally to verify the core counts for all executors; if the regression persists, I can use those to provide the exact data needed to pinpoint the cause for |
I'm almost done with #6920 and once that's merged, you can rebase this PR. That should allow us to see the results of the Perf-CI here. |
|
@arpittkhandelwal I have merged #6920, please rebase this PR to see how it fares. |
Performance test reportHPX PerformanceComparison
Info
Comparison
Info
Comparison
Info
Explanation of Symbols
|
|
@arpittkhandelwal unfortunately, your fix didn't address the performance regression :/ In any case, thanks for pinpointing the problem that is being addressed by this PR. |
Sir, I have traced the parallel_executor Stream benchmark regression. The pu_mask() function (line 585 of parallel_executor.hpp) computes needs_wraparound using get_active_os_thread_count(), which can be temporarily smaller than get_os_thread_count() on a loaded CI machine. When needs_wraparound = true, all threads wrap to cores 0-N instead of spreading across NUMA nodes, which severely hurts memory bandwidth. The fix is to use get_os_thread_count() instead. Would you like me to update this PR with that fix? |
@arpittkhandelwal the function What we may want to try is to make sure we can always cache the computed |
57a2599 to
dca03cb
Compare
Performance test reportHPX PerformanceComparison
Info
Comparison
Info
Comparison
Info
Explanation of Symbols
|
…ng the processing unit mask * Modifies parallel_policy_executor to unconditionally cache its pu_mask using a shared_ptr, eliminating the expensive mask recomputation overhead on every invocation (such as in Stream benchmarks). * Fixes the wraparound calculation by using get_os_thread_count() instead of get_active_os_thread_count(), which could previously cause non-optimal NUMA mappings on heavily loaded multi-core nodes.
dca03cb to
c83428b
Compare
There was a problem hiding this comment.
This can go away now as well, so can the corresponding #undef at the end of the file.
There was a problem hiding this comment.
Alternatively, we can keep the macro and have two code paths: one that is using allocations (as proposed here) and one using the previous code (there mask_type is simply a std::uint64_t).
| mask_ = mask; | ||
| #endif | ||
| return mask; | ||
| mask_ = new hpx::threads::mask_type(HPX_MOVE(mask)); |
There was a problem hiding this comment.
This function can't be noexcept anymore.
Performance test reportHPX PerformanceComparison
Info
Comparison
Info
Comparison
Info
Explanation of Symbols
|
Performance test reportHPX PerformanceComparison
Info
Comparison
Info
Comparison
Info
Explanation of Symbols
|
47fea3c to
e7ceeed
Compare
Performance test reportHPX PerformanceComparison
Info
Comparison
Info
Comparison
Info
Explanation of Symbols
|
…to restore chunking
Performance test reportHPX PerformanceComparison
Info
Comparison
Info
Comparison
Info
Explanation of Symbols
|
Performance test reportHPX PerformanceComparison
Info
Comparison
Info
Comparison
Info
Explanation of Symbols
|
| auto const num_threads = get_num_cores(); | ||
| auto const* pool = | ||
| pool_ ? pool_ : threads::detail::get_self_or_default_pool(); | ||
| auto const available_threads = | ||
| static_cast<std::uint32_t>(pool->get_active_os_thread_count()); | ||
| static_cast<std::uint32_t>(pool->get_os_thread_count()); |
There was a problem hiding this comment.
This is not a correct change. The executor can work only with the threads that are currently active on the given pool.
| mask_ = new hpx::threads::mask_type(HPX_MOVE(mask)); | ||
| return *mask_; | ||
| } | ||
| #endif |
There was a problem hiding this comment.
Can you try reducing the amount of code duplication, please? Perhaps by introducing a helper function that encapsulates the common functionality?
| struct is_bulk_one_way_executor< | ||
| hpx::execution::parallel_policy_executor<Policy>> : std::true_type | ||
| { | ||
| }; |
There was a problem hiding this comment.
Why did you remove this? The parallel_executor does support bulk_sync_execute.
|
@arpittkhandelwal I think we can now get back to the PRs related to performance tweaks. Could you please rebase this onto master? Also, please resolve the comflicts. |
|
@arpittkhandelwal Would you be willing to update this and finialize it? |
Up to standards ✅🟢 Issues
|
|
Hi @hkaiser sir , Thank you for the follow-up I’ll take this forward and finalize it. |
This PR fixes a significant performance regression (approx. 10-20x slowdown) observed in fork_join_executor benchmarks. The regression was traced back to fork_join_executor missing the processing_units_count and get_first_core traits, causing it to fall back to a default implementation that could return 1 core in certain environments (like CI), triggering sequential execution paths in algorithms.
Details
Problem: In recent changes (specifically around PR #6821), algorithms like for_each became stricter about adhering to the reported core count. fork_join_executor did not explicitly implement tag_invoke for processing_units_count_t, causing the customization point to fall back to a default that wasn't reliable for this executor's internal state.
Fix: Implemented tag_invoke for processing_units_count_t and get_first_core_t in fork_join_executor.hpp.
processing_units_count now correctly returns exec.shared_data_->num_threads_.
get_first_core now correctly calculates the first core from the pu_mask.
Verification
Validated locally using foreach_report_test.
Confirmed that processing_units_count now returns the correct thread count instead of falling back.
Expect CI performance tests to return to baseline levels.