Replace ExtractMostSignificantBits+BitOp patterns with Vector helper methods by Copilot · Pull Request #126841 · dotnet/runtime

Copilot · 2026-04-13T16:51:28Z

Replace usages of ExtractMostSignificantBits() followed by PopCount/TrailingZeroCount/LeadingZeroCount with the recently optimized [Intrinsic] vector helpers: CountMatches, IndexOfFirstMatch, IndexOfLastMatch, and IndexOfWhereAllBitsSet.

Description

Pattern replacements

BitOperations.PopCount(v.ExtractMostSignificantBits()) → VectorN.CountMatches(v) (within CoreLib, using internal helper to avoid x64 regression)
BitOperations.TrailingZeroCount(v.ExtractMostSignificantBits()) → VectorN.IndexOfFirstMatch(v) (within CoreLib) or VectorN.IndexOfWhereAllBitsSet(v) (in Tensors, using public API)
Count - 1 - BitOperations.LeadingZeroCount(v.ExtractMostSignificantBits()) → VectorN.IndexOfLastMatch(v) (within CoreLib)

-count += BitOperations.PopCount(Vector512.Equals(Vector512.LoadUnsafe(ref current), targetVector).ExtractMostSignificantBits());
+count += Vector512.CountMatches(Vector512.Equals(Vector512.LoadUnsafe(ref current), targetVector));

-uint matches = Vector128.Equals(Vector128<byte>.Zero, search).ExtractMostSignificantBits();
-if (matches == 0)
+Vector128<byte> cmp = Vector128.Equals(Vector128<byte>.Zero, search);
+if (cmp == Vector128<byte>.Zero)
 {
     offset += (nuint)Vector128<byte>.Count;
 }
 else
 {
-    return (int)(offset + (uint)BitOperations.TrailingZeroCount(matches));
+    return (int)(offset + (uint)Vector128.IndexOfFirstMatch(cmp));
 }

Files changed

SpanHelpers.T.cs — 3 replacements in CountValueType loop bodies using the internal CountMatches helper directly; replaced ComputeFirstIndex (3 overloads, EMSB+TZC → IndexOfFirstMatch) and ComputeLastIndex (3 overloads, EMSB+LZC → IndexOfLastMatch)
SpanHelpers.Byte.cs — 9 replacements in the null-terminator search (EMSB+TZC → IndexOfFirstMatch), keeping the vector comparison result and comparing against VectorN<byte>.Zero instead of extracting to a bitmask
TensorPrimitives.IndexOfMax.cs — Removed 3 private IndexOfFirstMatch wrapper methods; inlined VectorN.IndexOfWhereAllBitsSet directly at all 9 call sites
TensorPrimitives.Max.cs — Inlined VectorN.IndexOfWhereAllBitsSet directly at all 9 call sites (previously consumed the shared IndexOfFirstMatch helpers)

Not changed

Vector64.cs / Vector128.cs — Left as-is; these contain the internal helper methods (CountMatches, IndexOfFirstMatch, etc.) that the other callsites consume
TensorPrimitives.HammingDistance.cs — Left as-is; CountWhereAllBitsSet would introduce an x64 regression, and the internal CountMatches helper is not accessible from the Tensors assembly
SpanHelpers.Byte.cs loop patterns — Left as-is; IndexOf/LastIndexOf patterns that iterate through multiple matches with ResetLowestSetBit/FlipBit don't map to single-match helpers
SpanHelpers.Byte.cs SequenceEqual — Left as-is; computes TZC(~matches) for first difference, not first match

Testing

All relevant tests pass:

System.Runtime.Intrinsics.Tests: 12,944 tests passed
System.Memory.Tests: 52,229 tests passed
System.Numerics.Tensors.Tests: 5,427 tests passed
System.Runtime.Tests: 69,682 tests passed

Replace patterns of ExtractMostSignificantBits() followed by PopCount/TrailingZeroCount/LeadingZeroCount with the optimized Vector helpers: CountWhereAllBitsSet, IndexOfWhereAllBitsSet, and LastIndexOfWhereAllBitsSet. Remove AdvSimd special paths from Vector64/Vector128 internal helpers (CountMatches, IndexOfFirstMatch, IndexOfLastMatch) and the now-unused AdvSimdExtractBitMask/AdvSimdFixupBitCount methods. Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/fcd7f0bb-7e64-41fa-8773-089705b9a737 Co-authored-by: EgorBo <523221+EgorBo@users.noreply.github.com>

...s/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.IndexOfMax.cs

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs

dotnet-policy-service · 2026-04-13T17:17:28Z

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

...tem.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.HammingDistance.cs

…ctly Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/fd3915a5-bbf3-4530-8406-f3983be66ffb Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/527f6141-5f1f-41b3-bb40-336677a9c787 Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

…ression Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/83aad277-6910-4c67-8aac-88c2e15047b3 Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/83aad277-6910-4c67-8aac-88c2e15047b3 Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

EgorBo · 2026-04-13T20:39:40Z

@copilot please look again in SpanHelpers - there should be plenty of opportunities like:

uint matches = Vector128.Equals(Vector128<byte>.Zero, search).ExtractMostSignificantBits();
if (matches == 0)
{
    // Zero flags set so no matches
    offset += (nuint)Vector128<byte>.Count;
}
else
{
    // Find bitflag offset of first match and add to current offset
    return (int)(offset + (uint)BitOperations.TrailingZeroCount(matches));
}

to be

Vector128<byte> cmp = Vector128.Equals(Vector128<byte>.Zero, search);
if (cmp != Vector128<byte>.Zero) // fast path check
{
    return (int)(offset + Vector128.IndexOfWhereAllBitsSet(cmp));
}
else
{
    offset += (nuint)Vector128<byte>.Count;
}

etc

…n SpanHelpers Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/32b80803-7f44-4863-8845-f76d999fe220 Co-authored-by: EgorBo <523221+EgorBo@users.noreply.github.com>

Copilot · 2026-04-13T21:10:23Z

@copilot please look again in SpanHelpers - there should be plenty of opportunities like:
uint matches = Vector128.Equals(Vector128<byte>.Zero, search).ExtractMostSignificantBi...

Done in 9b3f39f. Replaced all the simple EMSB+TZC/EMSB+LZC patterns in SpanHelpers with internal IndexOfFirstMatch/IndexOfLastMatch helpers:

SpanHelpers.Byte.cs: 9 null-terminator search sites now use IndexOfFirstMatch instead of EMSB+TrailingZeroCount
SpanHelpers.T.cs: ComputeFirstIndex (3 overloads) → IndexOfFirstMatch, ComputeLastIndex (3 overloads) → IndexOfLastMatch

Left unchanged: the loop patterns in IndexOf/LastIndexOf that iterate through multiple matches with ResetLowestSetBit/FlipBit, and SequenceEqual which computes TZC(~matches) for first difference — these don't map cleanly to the single-match helpers.

EgorBo · 2026-04-13T22:06:09Z

@MihuBot -arm

EgorBo · 2026-04-13T22:06:17Z

@MihuBot

tannergooding · 2026-04-13T22:47:03Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

-                        uint matches = Vector128.Equals(Vector128<byte>.Zero, search).ExtractMostSignificantBits();
-                        if (matches == 0)
+                        Vector128<byte> cmp = Vector128.Equals(Vector128<byte>.Zero, search);
+                        if (cmp == Vector128<byte>.Zero)


@EgorBo, so on x64 this is basically going to do:

; Approx 8 total cycles vxorps xmm0, xmm0, xmm0 ; 0 cycles vpcmpeqb xmm0, xmm0, xmm1 ; 1 cycle vptest xmm0, xmm0 ; 7 cycles jz SHORT NO_MATCH ; fused MATCH: ; Approx 10 total cycles vpmovmskb eax, xmm0 ; 5 cycles tzcnt eax, eax ; 1 cycle mov ecx, -1 ; 1 cycle cmp eax, 32 ; 1 cycle cmove eax, ecx ; 1 cycle add eax, edx ; 1 cycle ret ; return NO_MATCH: ; ...

and on Arm64 (neoverse v2):

; Approx 7 total cycles cmeq v16.16b, v0.16b, #0 ; 2 cycles umaxp v17.4s, v16.4s, v16.4s ; 2 cycles umov x1, v17.d[0] ; 2 cycles cmp x1, #0 ; 1 cycle b.eq NO_MATCH ; branch MATCH: ; Approx 10 total cycles shrn v16.8b, v16.8h, #4 ; 2 cycles umov x1, v16.d[0] ; 2 cycles rbit x1, x1 ; 1 cycle clz x1, x1 ; 1 cycle lsr w1, w1, #2 ; 1 cycles movn w2, #0 ; 1 cycle cmp w1, #16 ; 1 cycle csel w1, w1, w2, ne ; fused add w0, w0, w1 ; 1 cycle ret lr ; return NO_MATCH: ; ...

More ideally the JIT could recognize this general pattern and generate this instead for x64:

; Approx 7 total cycles vxorps xmm0, xmm0, xmm0 ; 0 cycles vpcmpeqb xmm0, xmm0, xmm1 ; 1 cycle vpmovmskb eax, xmm0 ; 5 cycles cmp eax, 0 ; 1 cycle jz SHORT NO_MATCH ; fused MATCH: ; Approx 2 total cycle tzcnt eax, eax ; 1 cycle add eax, edx ; 1 cycle ret ; return NO_MATCH: ; ...

and this on Arm64:

; Approx 7 total cycles cmeq v16.16b, v0.16b, #0 ; 2 cycles shrn v16.8b, v16.8h, #4 ; 2 cycles umov x1, v16.d[0] ; 2 cycles cmp w1, #0 ; 1 cycle b.eq NO_MATCH MATCH: ; Approx 4 total cycle rbit x1, x1 ; 1 cycle clz x1, x1 ; 1 cycle lsr w1, w1, #2 ; 1 cycles add w0, w0, w1 ; 1 cycle ret lr ; returnmm NO_MATCH: ; ...

This would make it significantly cheaper for both, but I think requires us to recognize the != Zero followed by an Count/IndexOf/LastIndexOf pattern. Specifically I think CSE would trivially handle this for Arm64, but on x64 we'd need to transform the != Zero in that case so CSE could kick in.

What are your thoughts on this?

The alternative is we setup the managed code to look like this:

int index = Vector128.IndexOf(search, 0); if (index < 0) { // Zero flags set so no matches offset += (nuint)Vector128<byte>.Count; } else { // Find bitflag offset of first match and add to current offset return (int)(offset + (uint)Vector128.IndexOfFirstMatch(cmp)); }

Then we'd get this (roughly) on x64:

; Approx 11 total cycles vxorps xmm0, xmm0, xmm0 ; 0 cycles vpcmpeqb xmm0, xmm0, xmm1 ; 1 cycle vpmovmskb eax, xmm0 ; 5 cycles tzcnt eax, eax ; 1 cycle mov ecx, -1 ; 1 cycle cmp eax, 32 ; 1 cycle cmove eax, ecx ; 1 cycle cmp eax, 0 ; 1 cycle jl SHORT NO_MATCH ; fused MATCH: ; Approx 1 total cycle add eax, edx ; 1 cycle ret ; return NO_MATCH: ; ...

and this on Arm64:

; Approx 10 total cycles cmeq v16.16b, v0.16b, #0 ; 2 cycles shrn v16.8b, v16.8h, #4 ; 2 cycles umov x1, v16.d[0] ; 2 cycles rbit x1, x1 ; 1 cycle clz x1, x1 ; 1 cycle lsr w1, w1, #2 ; 1 cycles cmp w1, #0 ; 1 cycle b.ge NO_MATCH MATCH: ; Approx 1 total cycle add w0, w0, w1 ; 1 cycle ret lr ; returnmm NO_MATCH: ; ...

This is a little less than half the cost on match on both platforms, but has slightly higher cost for the no match scenario.

But I expect this is also difficult to pattern match and handle to get it to generate what we want in the first scenario, right?

We should probably pick one and have that be the "recommended pattern" where we then have the JIT handle it for the ideal codegen. -- The "other" other thing we could do is use Vector128.AnyWhereAllBitsSet(mask) instead of mask != Vector128<T>.Zero, which might then be easier to optimize overall, but interested in your thoughts so we can work towards getting it optimized and have managed follow our desired shape.

Copilot AI assigned Copilot and EgorBo Apr 13, 2026

Copilot AI requested review from Copilot and removed request for Copilot April 13, 2026 16:51

Copilot created this pull request from a session on behalf of EgorBo April 13, 2026 16:54 View session

Copilot finished work on behalf of EgorBo April 13, 2026 16:54

Copilot AI requested a review from EgorBo April 13, 2026 16:54

tannergooding reviewed Apr 13, 2026

View reviewed changes

...s/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.IndexOfMax.cs Outdated Show resolved Hide resolved

Copilot started work on behalf of tannergooding April 13, 2026 17:10 View session

tannergooding reviewed Apr 13, 2026

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs Outdated Show resolved Hide resolved

github-actions bot added the area-System.Numerics label Apr 13, 2026

tannergooding reviewed Apr 13, 2026

View reviewed changes

...tem.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.HammingDistance.cs Outdated Show resolved Hide resolved

Remove IndexOfFirstMatch wrappers; inline IndexOfWhereAllBitsSet dire…

31f94e3

…ctly Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/fd3915a5-bbf3-4530-8406-f3983be66ffb Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

Copilot AI requested review from Copilot and removed request for Copilot April 13, 2026 17:34

Copilot finished work on behalf of tannergooding April 13, 2026 17:36

Copilot AI requested a review from tannergooding April 13, 2026 17:36

Copilot started work on behalf of tannergooding April 13, 2026 17:37 View session

Revert Vector128.cs, Vector64.cs, and HammingDistance.cs changes

499c77e

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/527f6141-5f1f-41b3-bb40-336677a9c787 Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

Copilot AI requested review from Copilot and removed request for Copilot April 13, 2026 18:02

Copilot finished work on behalf of tannergooding April 13, 2026 18:05

Copilot started work on behalf of tannergooding April 13, 2026 18:07 View session

Use CountMatches internal helper in SpanHelpers.T.cs to avoid x64 reg…

8dc533d

…ression Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/83aad277-6910-4c67-8aac-88c2e15047b3 Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

Copilot AI requested review from Copilot and removed request for Copilot April 13, 2026 18:37

Clean up extra blank lines from removed IndexOfFirstMatch methods

d3a08e4

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/83aad277-6910-4c67-8aac-88c2e15047b3 Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 13, 2026 18:38

Copilot finished work on behalf of tannergooding April 13, 2026 18:40

Copilot started work on behalf of EgorBo April 13, 2026 20:39 View session

Replace EMSB+BitOp patterns with IndexOfFirstMatch/IndexOfLastMatch i…

9b3f39f

…n SpanHelpers Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/32b80803-7f44-4863-8845-f76d999fe220 Co-authored-by: EgorBo <523221+EgorBo@users.noreply.github.com>

Copilot AI requested review from Copilot and removed request for Copilot April 13, 2026 21:10

Copilot finished work on behalf of EgorBo April 13, 2026 21:12

This was referenced Apr 13, 2026

[JitDiff X64] [Copilot] Replace ExtractMostSignificantBits+BitOp patterns wi ... MihuBot/runtime-utils#1847

Open

[JitDiff ARM64] [Copilot] Replace ExtractMostSignificantBits+BitOp patterns ... MihuBot/runtime-utils#1848

Open

tannergooding reviewed Apr 13, 2026

View reviewed changes

build-analysis bot mentioned this pull request Apr 14, 2026

System.Net.NameResolution.Tests DNS failures: Name or service not known #126641

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace ExtractMostSignificantBits+BitOp patterns with Vector helper methods#126841

Replace ExtractMostSignificantBits+BitOp patterns with Vector helper methods#126841
Copilot wants to merge 6 commits intomainfrom
copilot/replace-extract-msb-with-vector-functions

Copilot AI commented Apr 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

dotnet-policy-service bot commented Apr 13, 2026

Uh oh!

Uh oh!

EgorBo commented Apr 13, 2026

Uh oh!

Copilot AI commented Apr 13, 2026

Uh oh!

EgorBo commented Apr 13, 2026

Uh oh!

EgorBo commented Apr 13, 2026

Uh oh!

tannergooding Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Pattern replacements

Files changed

Not changed

Testing

Uh oh!

Uh oh!

Uh oh!

dotnet-policy-service bot commented Apr 13, 2026

Uh oh!

Uh oh!

EgorBo commented Apr 13, 2026

Uh oh!

Copilot AI commented Apr 13, 2026

Uh oh!

EgorBo commented Apr 13, 2026

Uh oh!

EgorBo commented Apr 13, 2026

Uh oh!

tannergooding Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Apr 13, 2026 •

edited

Loading