Add Mask::count() method to count true elements#490
Open
GrigoryEvko wants to merge 2 commits intorust-lang:masterfrom
Open
Add Mask::count() method to count true elements#490GrigoryEvko wants to merge 2 commits intorust-lang:masterfrom
GrigoryEvko wants to merge 2 commits intorust-lang:masterfrom
Conversation
Implements a simple, efficient method to count the number of `true` elements in a SIMD mask. This is a common operation needed for: - Pre-sizing allocations before filtering - SQL-style COUNT(WHERE ...) operations - Histogram generation - Sparse data statistics Implementation delegates to `to_bitmask().count_ones()`, which compiles to a single POPCNT instruction on x86_64 and equivalent efficient instructions on other platforms (CNT on ARM, CPOP on RISC-V, i64.popcnt on WASM). Performance: ~0.7ns per operation, O(1) regardless of bit density. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The feature stdarch_x86_avx512 has been stable since Rust 1.89.0 and no longer requires a feature gate.
| #[inline] | ||
| #[must_use] | ||
| pub fn count(self) -> usize { | ||
| self.to_bitmask().count_ones() as usize |
Member
There was a problem hiding this comment.
I have two concerns with this implementation:
- as supported vector sizes increase,
to_bitmaskwill truncate to the first 64 bits to_bitmaskis pretty slow on some architectures. On those architectures, I wonder if something like(mask.to_int() >> (size_of<T>() * 8 - 1)).reduce_sum()would work better
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add
Mask::count()methodMotivation
The
MaskAPI currently provides boolean queries (any(),all()) and index queries (first_set()), but lacks a method to count the number of true elements. This forces users to either convert to arrays and iterate, or manually useto_bitmask().count_ones(), which exposes implementation details.Current workarounds:
Proposed:
This pattern appears frequently in SIMD code when pre-sizing allocations to avoid reallocation overhead:
Other common use cases include histogram generation, SQL-style COUNT aggregation, and sparse data analysis.
API Design
Design decisions:
usize- Consistent withIterator::count()and suitable for array indexingcount()notlen()-len()implies container size;count()matches the semantic operation (counting true values)#[must_use]attribute - FollowsVec::len()andslice::len()precedent (no message)const-to_bitmask()uses intrinsics that cannot be const-evaluatedImplementation
The implementation delegates to
to_bitmask().count_ones(), which already uses LLVM'sllvm.ctpopintrinsic. This compiles to efficient platform-specific instructions:POPCNT(SSE4.2)CNT(NEON)CPOP(Zbb extension)i64.popcntNo platform-specific code is required; LLVM handles optimization for each target.
Performance
Benchmarked on x86_64 (Intel Core i7-14700HX,
-C target-cpu=native):Assembly verification shows the expected codegen (x86_64):
The operation is branch-free and density-independent: mask16 measured at 1.03-1.05ns across all densities (0%, 25%, 50%, 75%, 100%), confirming constant-time behavior regardless of true element count.