Skip to content

Replace ExtractMostSignificantBits+BitOp patterns with Vector helper methods#126841

Draft
Copilot wants to merge 6 commits intomainfrom
copilot/replace-extract-msb-with-vector-functions
Draft

Replace ExtractMostSignificantBits+BitOp patterns with Vector helper methods#126841
Copilot wants to merge 6 commits intomainfrom
copilot/replace-extract-msb-with-vector-functions

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 13, 2026

Replace usages of ExtractMostSignificantBits() followed by PopCount/TrailingZeroCount/LeadingZeroCount with the recently optimized [Intrinsic] vector helpers: CountMatches, IndexOfFirstMatch, IndexOfLastMatch, and IndexOfWhereAllBitsSet.

Description

Pattern replacements

  • BitOperations.PopCount(v.ExtractMostSignificantBits())VectorN.CountMatches(v) (within CoreLib, using internal helper to avoid x64 regression)
  • BitOperations.TrailingZeroCount(v.ExtractMostSignificantBits())VectorN.IndexOfFirstMatch(v) (within CoreLib) or VectorN.IndexOfWhereAllBitsSet(v) (in Tensors, using public API)
  • Count - 1 - BitOperations.LeadingZeroCount(v.ExtractMostSignificantBits())VectorN.IndexOfLastMatch(v) (within CoreLib)
-count += BitOperations.PopCount(Vector512.Equals(Vector512.LoadUnsafe(ref current), targetVector).ExtractMostSignificantBits());
+count += Vector512.CountMatches(Vector512.Equals(Vector512.LoadUnsafe(ref current), targetVector));
-uint matches = Vector128.Equals(Vector128<byte>.Zero, search).ExtractMostSignificantBits();
-if (matches == 0)
+Vector128<byte> cmp = Vector128.Equals(Vector128<byte>.Zero, search);
+if (cmp == Vector128<byte>.Zero)
 {
     offset += (nuint)Vector128<byte>.Count;
 }
 else
 {
-    return (int)(offset + (uint)BitOperations.TrailingZeroCount(matches));
+    return (int)(offset + (uint)Vector128.IndexOfFirstMatch(cmp));
 }

Files changed

  • SpanHelpers.T.cs — 3 replacements in CountValueType loop bodies using the internal CountMatches helper directly; replaced ComputeFirstIndex (3 overloads, EMSB+TZCIndexOfFirstMatch) and ComputeLastIndex (3 overloads, EMSB+LZCIndexOfLastMatch)
  • SpanHelpers.Byte.cs — 9 replacements in the null-terminator search (EMSB+TZCIndexOfFirstMatch), keeping the vector comparison result and comparing against VectorN<byte>.Zero instead of extracting to a bitmask
  • TensorPrimitives.IndexOfMax.cs — Removed 3 private IndexOfFirstMatch wrapper methods; inlined VectorN.IndexOfWhereAllBitsSet directly at all 9 call sites
  • TensorPrimitives.Max.cs — Inlined VectorN.IndexOfWhereAllBitsSet directly at all 9 call sites (previously consumed the shared IndexOfFirstMatch helpers)

Not changed

  • Vector64.cs / Vector128.cs — Left as-is; these contain the internal helper methods (CountMatches, IndexOfFirstMatch, etc.) that the other callsites consume
  • TensorPrimitives.HammingDistance.cs — Left as-is; CountWhereAllBitsSet would introduce an x64 regression, and the internal CountMatches helper is not accessible from the Tensors assembly
  • SpanHelpers.Byte.cs loop patterns — Left as-is; IndexOf/LastIndexOf patterns that iterate through multiple matches with ResetLowestSetBit/FlipBit don't map to single-match helpers
  • SpanHelpers.Byte.cs SequenceEqual — Left as-is; computes TZC(~matches) for first difference, not first match

Testing

All relevant tests pass:

  • System.Runtime.Intrinsics.Tests: 12,944 tests passed
  • System.Memory.Tests: 52,229 tests passed
  • System.Numerics.Tensors.Tests: 5,427 tests passed
  • System.Runtime.Tests: 69,682 tests passed

Replace patterns of ExtractMostSignificantBits() followed by
PopCount/TrailingZeroCount/LeadingZeroCount with the optimized
Vector helpers: CountWhereAllBitsSet, IndexOfWhereAllBitsSet,
and LastIndexOfWhereAllBitsSet.

Remove AdvSimd special paths from Vector64/Vector128 internal
helpers (CountMatches, IndexOfFirstMatch, IndexOfLastMatch) and
the now-unused AdvSimdExtractBitMask/AdvSimdFixupBitCount methods.

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/fcd7f0bb-7e64-41fa-8773-089705b9a737

Co-authored-by: EgorBo <523221+EgorBo@users.noreply.github.com>
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

…ctly

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/fd3915a5-bbf3-4530-8406-f3983be66ffb

Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
Copilot AI requested review from Copilot and removed request for Copilot April 13, 2026 17:34
Copilot AI requested a review from tannergooding April 13, 2026 17:36
Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/527f6141-5f1f-41b3-bb40-336677a9c787

Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
Copilot AI requested review from Copilot and removed request for Copilot April 13, 2026 18:02
…ression

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/83aad277-6910-4c67-8aac-88c2e15047b3

Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
Copilot AI requested review from Copilot and removed request for Copilot April 13, 2026 18:37
Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/83aad277-6910-4c67-8aac-88c2e15047b3

Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 13, 2026 18:38
Copilot AI review requested due to automatic review settings April 13, 2026 18:38
@EgorBo
Copy link
Copy Markdown
Member

EgorBo commented Apr 13, 2026

@copilot please look again in SpanHelpers - there should be plenty of opportunities like:

uint matches = Vector128.Equals(Vector128<byte>.Zero, search).ExtractMostSignificantBits();
if (matches == 0)
{
    // Zero flags set so no matches
    offset += (nuint)Vector128<byte>.Count;
}
else
{
    // Find bitflag offset of first match and add to current offset
    return (int)(offset + (uint)BitOperations.TrailingZeroCount(matches));
}

to be

Vector128<byte> cmp = Vector128.Equals(Vector128<byte>.Zero, search);
if (cmp != Vector128<byte>.Zero) // fast path check
{
    return (int)(offset + Vector128.IndexOfWhereAllBitsSet(cmp));
}
else
{
    offset += (nuint)Vector128<byte>.Count;
}

etc

Copilot AI requested review from Copilot and removed request for Copilot April 13, 2026 21:10
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 13, 2026

@copilot please look again in SpanHelpers - there should be plenty of opportunities like:

uint matches = Vector128.Equals(Vector128<byte>.Zero, search).ExtractMostSignificantBi...

Done in 9b3f39f. Replaced all the simple EMSB+TZC/EMSB+LZC patterns in SpanHelpers with internal IndexOfFirstMatch/IndexOfLastMatch helpers:

  • SpanHelpers.Byte.cs: 9 null-terminator search sites now use IndexOfFirstMatch instead of EMSB+TrailingZeroCount
  • SpanHelpers.T.cs: ComputeFirstIndex (3 overloads) → IndexOfFirstMatch, ComputeLastIndex (3 overloads) → IndexOfLastMatch

Left unchanged: the loop patterns in IndexOf/LastIndexOf that iterate through multiple matches with ResetLowestSetBit/FlipBit, and SequenceEqual which computes TZC(~matches) for first difference — these don't map cleanly to the single-match helpers.

@EgorBo
Copy link
Copy Markdown
Member

EgorBo commented Apr 13, 2026

@MihuBot -arm

@EgorBo
Copy link
Copy Markdown
Member

EgorBo commented Apr 13, 2026

@MihuBot

Comment on lines -534 to +535
uint matches = Vector128.Equals(Vector128<byte>.Zero, search).ExtractMostSignificantBits();
if (matches == 0)
Vector128<byte> cmp = Vector128.Equals(Vector128<byte>.Zero, search);
if (cmp == Vector128<byte>.Zero)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EgorBo, so on x64 this is basically going to do:

                                        ; Approx 8 total cycles
    vxorps    xmm0, xmm0, xmm0          ; 0 cycles
    vpcmpeqb  xmm0, xmm0, xmm1          ; 1 cycle
    vptest    xmm0, xmm0                ; 7 cycles
    jz        SHORT NO_MATCH            ; fused

MATCH:                                  ; Approx 10 total cycles
    vpmovmskb eax, xmm0                 ; 5 cycles
    tzcnt     eax, eax                  ; 1 cycle
    mov       ecx, -1                   ; 1 cycle
    cmp       eax, 32                   ; 1 cycle
    cmove     eax, ecx                  ; 1 cycle
    add       eax, edx                  ; 1 cycle
    ret                                 ; return

NO_MATCH:
    ; ...

and on Arm64 (neoverse v2):

                                        ; Approx 7 total cycles
    cmeq    v16.16b, v0.16b, #0         ; 2 cycles
    umaxp   v17.4s, v16.4s, v16.4s      ; 2 cycles
    umov    x1, v17.d[0]                ; 2 cycles
    cmp     x1, #0                      ; 1 cycle
    b.eq    NO_MATCH                    ; branch

MATCH:                                  ; Approx 10 total cycles
    shrn    v16.8b, v16.8h, #4          ; 2 cycles
    umov    x1, v16.d[0]                ; 2 cycles
    rbit    x1, x1                      ; 1 cycle
    clz     x1, x1                      ; 1 cycle
    lsr     w1, w1, #2                  ; 1 cycles
    movn    w2, #0                      ; 1 cycle
    cmp     w1, #16                     ; 1 cycle
    csel    w1, w1, w2, ne              ; fused
    add     w0, w0, w1                  ; 1 cycle
    ret     lr                          ; return

NO_MATCH:
    ; ...

More ideally the JIT could recognize this general pattern and generate this instead for x64:

                                        ; Approx 7 total cycles
    vxorps    xmm0, xmm0, xmm0          ; 0 cycles
    vpcmpeqb  xmm0, xmm0, xmm1          ; 1 cycle
    vpmovmskb eax, xmm0                 ; 5 cycles
    cmp       eax, 0                    ; 1 cycle
    jz        SHORT NO_MATCH            ; fused

MATCH:                                  ; Approx 2 total cycle
    tzcnt     eax, eax                  ; 1 cycle
    add       eax, edx                  ; 1 cycle
    ret                                 ; return

NO_MATCH:
    ; ...

and this on Arm64:

                                        ; Approx 7 total cycles
    cmeq    v16.16b, v0.16b, #0         ; 2 cycles
    shrn    v16.8b, v16.8h, #4          ; 2 cycles
    umov    x1, v16.d[0]                ; 2 cycles
    cmp     w1, #0                      ; 1 cycle
    b.eq    NO_MATCH

MATCH:                                  ; Approx 4 total cycle
    rbit    x1, x1                      ; 1 cycle
    clz     x1, x1                      ; 1 cycle
    lsr     w1, w1, #2                  ; 1 cycles
    add     w0, w0, w1                  ; 1 cycle
    ret     lr                          ; returnmm

NO_MATCH:
    ; ...

This would make it significantly cheaper for both, but I think requires us to recognize the != Zero followed by an Count/IndexOf/LastIndexOf pattern. Specifically I think CSE would trivially handle this for Arm64, but on x64 we'd need to transform the != Zero in that case so CSE could kick in.

What are your thoughts on this?


The alternative is we setup the managed code to look like this:

int index = Vector128.IndexOf(search, 0);

if (index < 0)
{
    // Zero flags set so no matches
    offset += (nuint)Vector128<byte>.Count;
}
else
{
    // Find bitflag offset of first match and add to current offset
    return (int)(offset + (uint)Vector128.IndexOfFirstMatch(cmp));
}

Then we'd get this (roughly) on x64:

                                        ; Approx 11 total cycles
    vxorps    xmm0, xmm0, xmm0          ; 0 cycles
    vpcmpeqb  xmm0, xmm0, xmm1          ; 1 cycle
    vpmovmskb eax, xmm0                 ; 5 cycles
    tzcnt     eax, eax                  ; 1 cycle
    mov       ecx, -1                   ; 1 cycle
    cmp       eax, 32                   ; 1 cycle
    cmove     eax, ecx                  ; 1 cycle
    cmp       eax, 0                    ; 1 cycle
    jl        SHORT NO_MATCH            ; fused

MATCH:                                  ; Approx 1 total cycle
    add       eax, edx                  ; 1 cycle
    ret                                 ; return

NO_MATCH:
    ; ...

and this on Arm64:

                                        ; Approx 10 total cycles
    cmeq    v16.16b, v0.16b, #0         ; 2 cycles
    shrn    v16.8b, v16.8h, #4          ; 2 cycles
    umov    x1, v16.d[0]                ; 2 cycles
    rbit    x1, x1                      ; 1 cycle
    clz     x1, x1                      ; 1 cycle
    lsr     w1, w1, #2                  ; 1 cycles
    cmp     w1, #0                      ; 1 cycle
    b.ge    NO_MATCH

MATCH:                                  ; Approx 1 total cycle
    add     w0, w0, w1                  ; 1 cycle
    ret     lr                          ; returnmm

NO_MATCH:
    ; ...

This is a little less than half the cost on match on both platforms, but has slightly higher cost for the no match scenario.

But I expect this is also difficult to pattern match and handle to get it to generate what we want in the first scenario, right?

We should probably pick one and have that be the "recommended pattern" where we then have the JIT handle it for the ideal codegen. -- The "other" other thing we could do is use Vector128.AnyWhereAllBitsSet(mask) instead of mask != Vector128<T>.Zero, which might then be easier to optimize overall, but interested in your thoughts so we can work towards getting it optimized and have managed follow our desired shape.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants