Skip to content

feat: add FlipSign SIMD implementation for SSE/AVX/AVX2#10

Open
LHT129 wants to merge 1 commit into
mainfrom
opencode/补齐-flipsign-simd

Hidden character warning

The head ref may contain hidden characters: "opencode/\u8865\u9f50-flipsign-simd"
Open

feat: add FlipSign SIMD implementation for SSE/AVX/AVX2#10
LHT129 wants to merge 1 commit into
mainfrom
opencode/补齐-flipsign-simd

Conversation

@LHT129
Copy link
Copy Markdown
Owner

@LHT129 LHT129 commented Apr 24, 2026

Summary

  • Add FlipSign function to sse namespace using _mm_xor_ps (4 floats per iteration)
  • Add FlipSign function to avx namespace using _mm256_xor_ps (8 floats per iteration)
  • Add FlipSign function to avx2 namespace using _mm256_xor_si256 (8 floats per iteration)
  • Update rabitq_simd.h with declarations for sse and avx2 namespaces
  • Add unit tests for SSE/AVX/AVX2 FlipSign implementations

Test Results

  • All 2,849,632 assertions passed (flip_sign + pattern tests)
  • Tested with multiple dimensions and random flip patterns
  • Precision: error < 1e-5f for standard test, < 1e-6f for pattern test

Summary by Sourcery

Add SIMD-optimized FlipSign implementations for SSE, AVX, and AVX2 and integrate them into the SIMD interface, tests, and benchmarks.

New Features:

  • Introduce FlipSign function in the SSE namespace operating on float vectors.
  • Introduce FlipSign function in the AVX namespace operating on float vectors.
  • Introduce FlipSign function in the AVX2 namespace operating on float vectors and wire it into the shared SIMD header.

Tests:

  • Extend flip_sign correctness tests to validate SSE, AVX, and AVX2 implementations against the generic version.
  • Add SSE, AVX, and AVX2 FlipSign variants to the SIMD FlipSign benchmark suite.

- Add FlipSign function to sse namespace using _mm_xor_ps
- Add FlipSign function to avx namespace using _mm256_xor_ps
- Add FlipSign function to avx2 namespace using _mm256_xor_si256
- Update rabitq_simd.h with declarations for sse and avx2 namespaces
- Add unit tests for SSE/AVX/AVX2 FlipSign implementations
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Apr 24, 2026

Reviewer's Guide

Adds SIMD-optimized FlipSign implementations for SSE, AVX, and AVX2, wires them into the shared SIMD header, and extends unit tests and benchmarks to validate and compare the new paths against the generic implementation.

Class diagram for SIMD FlipSign functions in SSE/AVX/AVX2 namespaces

classDiagram
    namespace_generic <|.. namespace_sse : fallback
    namespace_sse <|.. namespace_avx : fallback
    namespace_avx <|.. namespace_avx2 : fallback

    class namespace_generic {
        <<namespace>>
        +FlipSign(flip uint8_t*, data float*, dim uint64_t)
    }

    class namespace_sse {
        <<namespace>>
        +FlipSign(flip uint8_t*, data float*, dim uint64_t)
        +KacsWalk(data float*, len uint64_t)
        +VecRescale(data float*, dim uint64_t, val float)
        +RotateOp(data float*, idx int, dim_ int, step int)
        +FHTRotate(data float*, len uint64_t)
    }

    class namespace_avx {
        <<namespace>>
        +FlipSign(flip uint8_t*, data float*, dim uint64_t)
        +KacsWalk(data float*, len uint64_t)
        +VecRescale(data float*, dim uint64_t, val float)
        +RotateOp(data float*, idx int, dim_ int, step int)
    }

    class namespace_avx2 {
        <<namespace>>
        +FlipSign(flip uint8_t*, data float*, dim uint64_t)
        +KacsWalk(data float*, len uint64_t)
        +VecRescale(data float*, dim uint64_t, val float)
    }
Loading

File-Level Changes

Change Details Files
Implement SSE FlipSign using 4-float chunks with XOR on the sign bit and scalar fallback for tail elements.
  • Iterate over input floats in chunks of 4 when SSE is enabled
  • Decode 4 flip bits from the packed uint8_t flip array into a 4-element 0x80000000/0 mask
  • Load mask as __m128i, cast to __m128 and XOR with data vector using _mm_xor_ps
  • Handle remaining elements with scalar sign flips when dimension is not a multiple of 4
  • Fallback to generic::FlipSign when SSE is not enabled
src/simd/sse.cpp
Implement AVX FlipSign using 8-float chunks with AVX XOR and scalar tail handling, with SSE/AVX fallbacks.
  • Process data in 8-float chunks when AVX is enabled
  • Extract 8 flip bits per chunk from packed flip array into 8-element sign mask with 0x80000000 or 0
  • Load mask as __m256i and XOR with data vector via _mm256_xor_ps
  • Apply scalar per-element sign flipping for leftover elements
  • Fallback to sse::FlipSign when AVX is not enabled
src/simd/avx.cpp
Implement AVX2 FlipSign using integer XOR on packed floats, reusing AVX behavior and providing fallback.
  • Process 8 floats at a time when AVX2 is enabled
  • Build 8-wide 0x80000000/0 mask from packed flip bits
  • Load mask as __m256i and flip sign by XOR-ing integer representation with _mm256_xor_si256 and reinterpreting as floats
  • Handle non-multiple-of-8 tails with scalar sign flips
  • Fallback to avx::FlipSign when AVX2 is not enabled
src/simd/avx2.cpp
Expose FlipSign declarations for AVX and AVX2 in the shared SIMD header.
  • Add FlipSign function declaration to avx2 namespace
  • Add FlipSign function declaration to avx namespace
src/simd/rabitq_simd.h
Extend SIMD tests and benchmarks to cover SSE, AVX, and AVX2 FlipSign implementations.
  • In flip_sign correctness tests, invoke sse::FlipSign, avx::FlipSign, and avx2::FlipSign conditionally on runtime SIMD support and compare against generic results
  • Update FlipSign benchmark macro usage to benchmark the new SSE, AVX, and AVX2 implementations under capability checks
  • Extend pattern-based FlipSign correctness tests to validate SSE, AVX, and AVX2 paths against generic output with a tight error bound
src/simd/rabitq_simd_test.cpp

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • The AVX and AVX2 FlipSign implementations are nearly identical (bit extraction + mask construction); consider factoring this into a shared helper to avoid duplication and keep future changes in sync.
  • In avx::FlipSign the #else branch calls sse::FlipSign, which assumes SSE is available whenever AVX is not; if that's not guaranteed by build flags, it may be safer to fall back directly to generic::FlipSign for consistency with other paths.
  • The bit extraction in FlipSign (SSE/AVX/AVX2) repeatedly indexes flip[(bit_idx + j) / 8] within the inner loop; you could precompute the base byte and intra-byte offsets to reduce repeated division/modulo and simplify the logic.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The AVX and AVX2 `FlipSign` implementations are nearly identical (bit extraction + mask construction); consider factoring this into a shared helper to avoid duplication and keep future changes in sync.
- In `avx::FlipSign` the `#else` branch calls `sse::FlipSign`, which assumes SSE is available whenever AVX is not; if that's not guaranteed by build flags, it may be safer to fall back directly to `generic::FlipSign` for consistency with other paths.
- The bit extraction in `FlipSign` (SSE/AVX/AVX2) repeatedly indexes `flip[(bit_idx + j) / 8]` within the inner loop; you could precompute the base byte and intra-byte offsets to reduce repeated division/modulo and simplify the logic.

## Individual Comments

### Comment 1
<location path="src/simd/sse.cpp" line_range="1363-1364" />
<code_context>
+    constexpr uint64_t kFloatsPerChunk = 8;
+    uint64_t i = 0;
+    for (; i + kFloatsPerChunk <= dim; i += kFloatsPerChunk) {
+        uint32_t bit_idx = static_cast<uint32_t>(i);
+        uint16_t flip_bits = 0;
+        for (int j = 0; j < 8; j++) {
</code_context>
<issue_to_address>
**issue (bug_risk):** Avoid narrowing `i` to 32 bits for bit index computations to prevent incorrect behavior on very large `dim`.

Casting `i` (uint64_t) to `uint32_t` will wrap when `dim >= 2^32`, causing incorrect bit selection in `flip`. Unless `dim < 2^32` is guaranteed, keep `bit_idx` as `uint64_t`/`size_t` and remove the cast, and apply the same treatment to all `bit_idx` uses in the SIMD loops.
</issue_to_address>

### Comment 2
<location path="src/simd/avx.cpp" line_range="1286-1287" />
<code_context>
+    constexpr uint64_t kFloatsPerChunk = 8;
+    uint64_t i = 0;
+    for (; i + kFloatsPerChunk <= dim; i += kFloatsPerChunk) {
+        uint32_t bit_idx = static_cast<uint32_t>(i);
+        uint16_t flip_bits = 0;
+        for (int j = 0; j < 8; j++) {
+            flip_bits |= ((flip[(bit_idx + j) / 8] >> ((bit_idx + j) % 8)) & 1) << j;
</code_context>
<issue_to_address>
**issue (bug_risk):** Using a 32-bit `bit_idx` for indexing can overflow for large dimensions.

Casting `i` to `uint32_t` here means bit selection breaks once `dim` exceeds `2^32 - 1`, as in the SSE path. Keeping `bit_idx` (and the related indexing math) as `uint64_t`/`size_t` avoids this overflow while preserving performance on typical platforms.
</issue_to_address>

### Comment 3
<location path="src/simd/avx2.cpp" line_range="1436-1437" />
<code_context>
+    constexpr uint64_t kFloatsPerChunk = 8;
+    uint64_t i = 0;
+    for (; i + kFloatsPerChunk <= dim; i += kFloatsPerChunk) {
+        uint32_t bit_idx = static_cast<uint32_t>(i);
+        uint16_t flip_bits = 0;
+        for (int j = 0; j < 8; j++) {
+            flip_bits |= ((flip[(bit_idx + j) / 8] >> ((bit_idx + j) % 8)) & 1) << j;
</code_context>
<issue_to_address>
**issue (bug_risk):** Same potential overflow of `bit_idx` in AVX2 path as in SSE/AVX implementations.

Keep `bit_idx` as a 64-bit value instead of truncating to `uint32_t` so very large `dim` values don’t silently wrap and behavior stays consistent with the other implementations.
</issue_to_address>

### Comment 4
<location path="src/simd/rabitq_simd_test.cpp" line_range="466-475" />
<code_context>

             generic::FlipSign(flip, gt_data, dim);

+            if (SimdStatus::SupportSSE()) {
+                auto* sse_data = sse_datas.data() + i * dim;
+                sse::FlipSign(flip, sse_data, dim);
+                for (int j = 0; j < dim; j++) {
+                    REQUIRE(std::abs(gt_data[j] - sse_data[j]) < delta);
+                }
+            }
+            if (SimdStatus::SupportAVX()) {
+                auto* avx_data = avx_datas.data() + i * dim;
+                avx::FlipSign(flip, avx_data, dim);
+                for (int j = 0; j < dim; j++) {
+                    REQUIRE(std::abs(gt_data[j] - avx_data[j]) < delta);
+                }
+            }
+            if (SimdStatus::SupportAVX2()) {
+                auto* avx2_data = avx2_datas.data() + i * dim;
+                avx2::FlipSign(flip, avx2_data, dim);
</code_context>
<issue_to_address>
**suggestion (testing):** Add explicit tests that exercise byte-boundary and tail cases for SSE/AVX/AVX2 FlipSign

The current checks validate SSE/AVX/AVX2 against the generic path, but they don’t guarantee we hit the edge cases most prone to SIMD bugs: (a) flips whose bit indices cross byte boundaries (e.g., dims 7, 8, 9, 15, 16, …) and (b) dims that trigger the scalar cleanup (non‑multiples of the SIMD width). Please add or extend a test that (1) uses hand-crafted flip masks around byte boundaries (e.g., flipping only indices 7–9, 15–17, etc.) and (2) explicitly exercises non-multiple dims for each ISA (e.g., 1, 3, 5, 7, 9 for SSE; 5, 9, 13 for AVX/AVX2) so we cover those tail and bit-indexing regions deterministically.

Suggested implementation:

```cpp
TEST_CASE("FlipSign SIMD byte-boundary and tail cases", "[simd][flip_sign][boundary]") {
    constexpr float delta = 1e-5f;

    // Indices chosen to cross byte boundaries: bits 7–9, 15–17, ...
    const std::array<int, 6> boundary_indices = {7, 8, 9, 15, 16, 17};

    auto make_flip_mask = [&](int dim) {
        const std::size_t flip_size = (dim + 7) / 8;
        std::vector<uint8_t> flip(flip_size, uint8_t{0});
        for (int idx : boundary_indices) {
            if (idx >= dim) {
                continue;
            }
            const int byte = idx / 8;
            const int bit = idx % 8;
            flip[byte] |= static_cast<uint8_t>(1u << bit);
        }
        return flip;
    };

    auto run_dim = [&](int dim) {
        auto flip = make_flip_mask(dim);

        std::vector<float> base(dim);
        for (int i = 0; i < dim; ++i) {
            base[i] = static_cast<float>(i + 1);
        }

        std::vector<float> gt_data = base;
        generic::FlipSign(flip.data(), gt_data.data(), dim);

        if (SimdStatus::SupportSSE()) {
            std::vector<float> sse_data = base;
            sse::FlipSign(flip.data(), sse_data.data(), dim);
            for (int j = 0; j < dim; ++j) {
                REQUIRE(std::abs(gt_data[j] - sse_data[j]) < delta);
            }
        }

        if (SimdStatus::SupportAVX()) {
            std::vector<float> avx_data = base;
            avx::FlipSign(flip.data(), avx_data.data(), dim);
            for (int j = 0; j < dim; ++j) {
                REQUIRE(std::abs(gt_data[j] - avx_data[j]) < delta);
            }
        }

        if (SimdStatus::SupportAVX2()) {
            std::vector<float> avx2_data = base;
            avx2::FlipSign(flip.data(), avx2_data.data(), dim);
            for (int j = 0; j < dim; ++j) {
                REQUIRE(std::abs(gt_data[j] - avx2_data[j]) < delta);
            }
        }
    };

    // Dims that are not multiples of typical SIMD widths, to force scalar tails.
    const std::array<int, 5> sse_dims = {1, 3, 5, 7, 9};
    const std::array<int, 3> avx_dims = {5, 9, 13};

    for (int dim : sse_dims) {
        run_dim(dim);
    }
    for (int dim : avx_dims) {
        run_dim(dim);
    }
}

    std::vector<uint8_t> flips = fixtures::GenerateVectors<uint8_t>(count, flip_size);

    BENCHMARK_SIMD_FLIP_SIGN(generic, FlipSign);

```

1. Ensure `#include <array>` is present near the top of `src/simd/rabitq_simd_test.cpp`, since the new test uses `std::array`.  
2. If the file wraps tests in a namespace (e.g., `namespace milvus::simd { ... }`), verify that the new `TEST_CASE` is placed inside the same namespace scope (the replacement above assumes it is already within the correct scope).  
3. If the existing tests use a specific float type alias (e.g., `using value_t = float;`), you may want to replace `float` in the new test with that alias for consistency.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src/simd/sse.cpp
Comment on lines +1363 to +1364
uint32_t bit_idx = static_cast<uint32_t>(i);
uint8_t flip_bits = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Avoid narrowing i to 32 bits for bit index computations to prevent incorrect behavior on very large dim.

Casting i (uint64_t) to uint32_t will wrap when dim >= 2^32, causing incorrect bit selection in flip. Unless dim < 2^32 is guaranteed, keep bit_idx as uint64_t/size_t and remove the cast, and apply the same treatment to all bit_idx uses in the SIMD loops.

Comment thread src/simd/avx.cpp
Comment on lines +1286 to +1287
uint32_t bit_idx = static_cast<uint32_t>(i);
uint16_t flip_bits = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Using a 32-bit bit_idx for indexing can overflow for large dimensions.

Casting i to uint32_t here means bit selection breaks once dim exceeds 2^32 - 1, as in the SSE path. Keeping bit_idx (and the related indexing math) as uint64_t/size_t avoids this overflow while preserving performance on typical platforms.

Comment thread src/simd/avx2.cpp
Comment on lines +1436 to +1437
uint32_t bit_idx = static_cast<uint32_t>(i);
uint16_t flip_bits = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Same potential overflow of bit_idx in AVX2 path as in SSE/AVX implementations.

Keep bit_idx as a 64-bit value instead of truncating to uint32_t so very large dim values don’t silently wrap and behavior stays consistent with the other implementations.

Comment on lines +466 to +475
if (SimdStatus::SupportSSE()) {
auto* sse_data = sse_datas.data() + i * dim;
sse::FlipSign(flip, sse_data, dim);
for (int j = 0; j < dim; j++) {
REQUIRE(std::abs(gt_data[j] - sse_data[j]) < delta);
}
}
if (SimdStatus::SupportAVX()) {
auto* avx_data = avx_datas.data() + i * dim;
avx::FlipSign(flip, avx_data, dim);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add explicit tests that exercise byte-boundary and tail cases for SSE/AVX/AVX2 FlipSign

The current checks validate SSE/AVX/AVX2 against the generic path, but they don’t guarantee we hit the edge cases most prone to SIMD bugs: (a) flips whose bit indices cross byte boundaries (e.g., dims 7, 8, 9, 15, 16, …) and (b) dims that trigger the scalar cleanup (non‑multiples of the SIMD width). Please add or extend a test that (1) uses hand-crafted flip masks around byte boundaries (e.g., flipping only indices 7–9, 15–17, etc.) and (2) explicitly exercises non-multiple dims for each ISA (e.g., 1, 3, 5, 7, 9 for SSE; 5, 9, 13 for AVX/AVX2) so we cover those tail and bit-indexing regions deterministically.

Suggested implementation:

TEST_CASE("FlipSign SIMD byte-boundary and tail cases", "[simd][flip_sign][boundary]") {
    constexpr float delta = 1e-5f;

    // Indices chosen to cross byte boundaries: bits 7–9, 15–17, ...
    const std::array<int, 6> boundary_indices = {7, 8, 9, 15, 16, 17};

    auto make_flip_mask = [&](int dim) {
        const std::size_t flip_size = (dim + 7) / 8;
        std::vector<uint8_t> flip(flip_size, uint8_t{0});
        for (int idx : boundary_indices) {
            if (idx >= dim) {
                continue;
            }
            const int byte = idx / 8;
            const int bit = idx % 8;
            flip[byte] |= static_cast<uint8_t>(1u << bit);
        }
        return flip;
    };

    auto run_dim = [&](int dim) {
        auto flip = make_flip_mask(dim);

        std::vector<float> base(dim);
        for (int i = 0; i < dim; ++i) {
            base[i] = static_cast<float>(i + 1);
        }

        std::vector<float> gt_data = base;
        generic::FlipSign(flip.data(), gt_data.data(), dim);

        if (SimdStatus::SupportSSE()) {
            std::vector<float> sse_data = base;
            sse::FlipSign(flip.data(), sse_data.data(), dim);
            for (int j = 0; j < dim; ++j) {
                REQUIRE(std::abs(gt_data[j] - sse_data[j]) < delta);
            }
        }

        if (SimdStatus::SupportAVX()) {
            std::vector<float> avx_data = base;
            avx::FlipSign(flip.data(), avx_data.data(), dim);
            for (int j = 0; j < dim; ++j) {
                REQUIRE(std::abs(gt_data[j] - avx_data[j]) < delta);
            }
        }

        if (SimdStatus::SupportAVX2()) {
            std::vector<float> avx2_data = base;
            avx2::FlipSign(flip.data(), avx2_data.data(), dim);
            for (int j = 0; j < dim; ++j) {
                REQUIRE(std::abs(gt_data[j] - avx2_data[j]) < delta);
            }
        }
    };

    // Dims that are not multiples of typical SIMD widths, to force scalar tails.
    const std::array<int, 5> sse_dims = {1, 3, 5, 7, 9};
    const std::array<int, 3> avx_dims = {5, 9, 13};

    for (int dim : sse_dims) {
        run_dim(dim);
    }
    for (int dim : avx_dims) {
        run_dim(dim);
    }
}

    std::vector<uint8_t> flips = fixtures::GenerateVectors<uint8_t>(count, flip_size);

    BENCHMARK_SIMD_FLIP_SIGN(generic, FlipSign);
  1. Ensure #include <array> is present near the top of src/simd/rabitq_simd_test.cpp, since the new test uses std::array.
  2. If the file wraps tests in a namespace (e.g., namespace milvus::simd { ... }), verify that the new TEST_CASE is placed inside the same namespace scope (the replacement above assumes it is already within the correct scope).
  3. If the existing tests use a specific float type alias (e.g., using value_t = float;), you may want to replace float in the new test with that alias for consistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant