Replace ExtractMostSignificantBits+BitOp patterns with Vector helper methods#126841
Replace ExtractMostSignificantBits+BitOp patterns with Vector helper methods#126841
Conversation
Replace patterns of ExtractMostSignificantBits() followed by PopCount/TrailingZeroCount/LeadingZeroCount with the optimized Vector helpers: CountWhereAllBitsSet, IndexOfWhereAllBitsSet, and LastIndexOfWhereAllBitsSet. Remove AdvSimd special paths from Vector64/Vector128 internal helpers (CountMatches, IndexOfFirstMatch, IndexOfLastMatch) and the now-unused AdvSimdExtractBitMask/AdvSimdFixupBitCount methods. Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/fcd7f0bb-7e64-41fa-8773-089705b9a737 Co-authored-by: EgorBo <523221+EgorBo@users.noreply.github.com>
...s/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.IndexOfMax.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs
Outdated
Show resolved
Hide resolved
|
Tagging subscribers to this area: @dotnet/area-system-numerics |
...tem.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.HammingDistance.cs
Outdated
Show resolved
Hide resolved
…ctly Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/fd3915a5-bbf3-4530-8406-f3983be66ffb Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/527f6141-5f1f-41b3-bb40-336677a9c787 Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
…ression Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/83aad277-6910-4c67-8aac-88c2e15047b3 Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/83aad277-6910-4c67-8aac-88c2e15047b3 Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
|
@copilot please look again in SpanHelpers - there should be plenty of opportunities like: uint matches = Vector128.Equals(Vector128<byte>.Zero, search).ExtractMostSignificantBits();
if (matches == 0)
{
// Zero flags set so no matches
offset += (nuint)Vector128<byte>.Count;
}
else
{
// Find bitflag offset of first match and add to current offset
return (int)(offset + (uint)BitOperations.TrailingZeroCount(matches));
}to be Vector128<byte> cmp = Vector128.Equals(Vector128<byte>.Zero, search);
if (cmp != Vector128<byte>.Zero) // fast path check
{
return (int)(offset + Vector128.IndexOfWhereAllBitsSet(cmp));
}
else
{
offset += (nuint)Vector128<byte>.Count;
}etc |
…n SpanHelpers Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/32b80803-7f44-4863-8845-f76d999fe220 Co-authored-by: EgorBo <523221+EgorBo@users.noreply.github.com>
Done in 9b3f39f. Replaced all the simple
Left unchanged: the loop patterns in IndexOf/LastIndexOf that iterate through multiple matches with |
|
@MihuBot -arm |
| uint matches = Vector128.Equals(Vector128<byte>.Zero, search).ExtractMostSignificantBits(); | ||
| if (matches == 0) | ||
| Vector128<byte> cmp = Vector128.Equals(Vector128<byte>.Zero, search); | ||
| if (cmp == Vector128<byte>.Zero) |
There was a problem hiding this comment.
@EgorBo, so on x64 this is basically going to do:
; Approx 8 total cycles
vxorps xmm0, xmm0, xmm0 ; 0 cycles
vpcmpeqb xmm0, xmm0, xmm1 ; 1 cycle
vptest xmm0, xmm0 ; 7 cycles
jz SHORT NO_MATCH ; fused
MATCH: ; Approx 10 total cycles
vpmovmskb eax, xmm0 ; 5 cycles
tzcnt eax, eax ; 1 cycle
mov ecx, -1 ; 1 cycle
cmp eax, 32 ; 1 cycle
cmove eax, ecx ; 1 cycle
add eax, edx ; 1 cycle
ret ; return
NO_MATCH:
; ...and on Arm64 (neoverse v2):
; Approx 7 total cycles
cmeq v16.16b, v0.16b, #0 ; 2 cycles
umaxp v17.4s, v16.4s, v16.4s ; 2 cycles
umov x1, v17.d[0] ; 2 cycles
cmp x1, #0 ; 1 cycle
b.eq NO_MATCH ; branch
MATCH: ; Approx 10 total cycles
shrn v16.8b, v16.8h, #4 ; 2 cycles
umov x1, v16.d[0] ; 2 cycles
rbit x1, x1 ; 1 cycle
clz x1, x1 ; 1 cycle
lsr w1, w1, #2 ; 1 cycles
movn w2, #0 ; 1 cycle
cmp w1, #16 ; 1 cycle
csel w1, w1, w2, ne ; fused
add w0, w0, w1 ; 1 cycle
ret lr ; return
NO_MATCH:
; ...More ideally the JIT could recognize this general pattern and generate this instead for x64:
; Approx 7 total cycles
vxorps xmm0, xmm0, xmm0 ; 0 cycles
vpcmpeqb xmm0, xmm0, xmm1 ; 1 cycle
vpmovmskb eax, xmm0 ; 5 cycles
cmp eax, 0 ; 1 cycle
jz SHORT NO_MATCH ; fused
MATCH: ; Approx 2 total cycle
tzcnt eax, eax ; 1 cycle
add eax, edx ; 1 cycle
ret ; return
NO_MATCH:
; ...and this on Arm64:
; Approx 7 total cycles
cmeq v16.16b, v0.16b, #0 ; 2 cycles
shrn v16.8b, v16.8h, #4 ; 2 cycles
umov x1, v16.d[0] ; 2 cycles
cmp w1, #0 ; 1 cycle
b.eq NO_MATCH
MATCH: ; Approx 4 total cycle
rbit x1, x1 ; 1 cycle
clz x1, x1 ; 1 cycle
lsr w1, w1, #2 ; 1 cycles
add w0, w0, w1 ; 1 cycle
ret lr ; returnmm
NO_MATCH:
; ...This would make it significantly cheaper for both, but I think requires us to recognize the != Zero followed by an Count/IndexOf/LastIndexOf pattern. Specifically I think CSE would trivially handle this for Arm64, but on x64 we'd need to transform the != Zero in that case so CSE could kick in.
What are your thoughts on this?
The alternative is we setup the managed code to look like this:
int index = Vector128.IndexOf(search, 0);
if (index < 0)
{
// Zero flags set so no matches
offset += (nuint)Vector128<byte>.Count;
}
else
{
// Find bitflag offset of first match and add to current offset
return (int)(offset + (uint)Vector128.IndexOfFirstMatch(cmp));
}Then we'd get this (roughly) on x64:
; Approx 11 total cycles
vxorps xmm0, xmm0, xmm0 ; 0 cycles
vpcmpeqb xmm0, xmm0, xmm1 ; 1 cycle
vpmovmskb eax, xmm0 ; 5 cycles
tzcnt eax, eax ; 1 cycle
mov ecx, -1 ; 1 cycle
cmp eax, 32 ; 1 cycle
cmove eax, ecx ; 1 cycle
cmp eax, 0 ; 1 cycle
jl SHORT NO_MATCH ; fused
MATCH: ; Approx 1 total cycle
add eax, edx ; 1 cycle
ret ; return
NO_MATCH:
; ...and this on Arm64:
; Approx 10 total cycles
cmeq v16.16b, v0.16b, #0 ; 2 cycles
shrn v16.8b, v16.8h, #4 ; 2 cycles
umov x1, v16.d[0] ; 2 cycles
rbit x1, x1 ; 1 cycle
clz x1, x1 ; 1 cycle
lsr w1, w1, #2 ; 1 cycles
cmp w1, #0 ; 1 cycle
b.ge NO_MATCH
MATCH: ; Approx 1 total cycle
add w0, w0, w1 ; 1 cycle
ret lr ; returnmm
NO_MATCH:
; ...This is a little less than half the cost on match on both platforms, but has slightly higher cost for the no match scenario.
But I expect this is also difficult to pattern match and handle to get it to generate what we want in the first scenario, right?
We should probably pick one and have that be the "recommended pattern" where we then have the JIT handle it for the ideal codegen. -- The "other" other thing we could do is use Vector128.AnyWhereAllBitsSet(mask) instead of mask != Vector128<T>.Zero, which might then be easier to optimize overall, but interested in your thoughts so we can work towards getting it optimized and have managed follow our desired shape.
Replace usages of
ExtractMostSignificantBits()followed byPopCount/TrailingZeroCount/LeadingZeroCountwith the recently optimized[Intrinsic]vector helpers:CountMatches,IndexOfFirstMatch,IndexOfLastMatch, andIndexOfWhereAllBitsSet.Description
Pattern replacements
BitOperations.PopCount(v.ExtractMostSignificantBits())→VectorN.CountMatches(v)(within CoreLib, using internal helper to avoid x64 regression)BitOperations.TrailingZeroCount(v.ExtractMostSignificantBits())→VectorN.IndexOfFirstMatch(v)(within CoreLib) orVectorN.IndexOfWhereAllBitsSet(v)(in Tensors, using public API)Count - 1 - BitOperations.LeadingZeroCount(v.ExtractMostSignificantBits())→VectorN.IndexOfLastMatch(v)(within CoreLib)Files changed
CountValueTypeloop bodies using the internalCountMatcheshelper directly; replacedComputeFirstIndex(3 overloads,EMSB+TZC→IndexOfFirstMatch) andComputeLastIndex(3 overloads,EMSB+LZC→IndexOfLastMatch)EMSB+TZC→IndexOfFirstMatch), keeping the vector comparison result and comparing againstVectorN<byte>.Zeroinstead of extracting to a bitmaskIndexOfFirstMatchwrapper methods; inlinedVectorN.IndexOfWhereAllBitsSetdirectly at all 9 call sitesVectorN.IndexOfWhereAllBitsSetdirectly at all 9 call sites (previously consumed the sharedIndexOfFirstMatchhelpers)Not changed
CountMatches,IndexOfFirstMatch, etc.) that the other callsites consumeCountWhereAllBitsSetwould introduce an x64 regression, and the internalCountMatcheshelper is not accessible from the Tensors assemblyResetLowestSetBit/FlipBitdon't map to single-match helpersTZC(~matches)for first difference, not first matchTesting
All relevant tests pass: