feat: add FlipSign SIMD implementation for SSE/AVX/AVX2 by LHT129 · Pull Request #10 · LHT129/vsag

LHT129 · 2026-04-24T09:06:30Z

Summary

Add FlipSign function to sse namespace using _mm_xor_ps (4 floats per iteration)
Add FlipSign function to avx namespace using _mm256_xor_ps (8 floats per iteration)
Add FlipSign function to avx2 namespace using _mm256_xor_si256 (8 floats per iteration)
Update rabitq_simd.h with declarations for sse and avx2 namespaces
Add unit tests for SSE/AVX/AVX2 FlipSign implementations

Test Results

All 2,849,632 assertions passed (flip_sign + pattern tests)
Tested with multiple dimensions and random flip patterns
Precision: error < 1e-5f for standard test, < 1e-6f for pattern test

Summary by Sourcery

Add SIMD-optimized FlipSign implementations for SSE, AVX, and AVX2 and integrate them into the SIMD interface, tests, and benchmarks.

New Features:

Introduce FlipSign function in the SSE namespace operating on float vectors.
Introduce FlipSign function in the AVX namespace operating on float vectors.
Introduce FlipSign function in the AVX2 namespace operating on float vectors and wire it into the shared SIMD header.

Tests:

Extend flip_sign correctness tests to validate SSE, AVX, and AVX2 implementations against the generic version.
Add SSE, AVX, and AVX2 FlipSign variants to the SIMD FlipSign benchmark suite.

- Add FlipSign function to sse namespace using _mm_xor_ps - Add FlipSign function to avx namespace using _mm256_xor_ps - Add FlipSign function to avx2 namespace using _mm256_xor_si256 - Update rabitq_simd.h with declarations for sse and avx2 namespaces - Add unit tests for SSE/AVX/AVX2 FlipSign implementations

sourcery-ai · 2026-04-24T09:06:37Z

Reviewer's Guide

Adds SIMD-optimized FlipSign implementations for SSE, AVX, and AVX2, wires them into the shared SIMD header, and extends unit tests and benchmarks to validate and compare the new paths against the generic implementation.

Class diagram for SIMD FlipSign functions in SSE/AVX/AVX2 namespaces

classDiagram
    namespace_generic <|.. namespace_sse : fallback
    namespace_sse <|.. namespace_avx : fallback
    namespace_avx <|.. namespace_avx2 : fallback

    class namespace_generic {
        <<namespace>>
        +FlipSign(flip uint8_t*, data float*, dim uint64_t)
    }

    class namespace_sse {
        <<namespace>>
        +FlipSign(flip uint8_t*, data float*, dim uint64_t)
        +KacsWalk(data float*, len uint64_t)
        +VecRescale(data float*, dim uint64_t, val float)
        +RotateOp(data float*, idx int, dim_ int, step int)
        +FHTRotate(data float*, len uint64_t)
    }

    class namespace_avx {
        <<namespace>>
        +FlipSign(flip uint8_t*, data float*, dim uint64_t)
        +KacsWalk(data float*, len uint64_t)
        +VecRescale(data float*, dim uint64_t, val float)
        +RotateOp(data float*, idx int, dim_ int, step int)
    }

    class namespace_avx2 {
        <<namespace>>
        +FlipSign(flip uint8_t*, data float*, dim uint64_t)
        +KacsWalk(data float*, len uint64_t)
        +VecRescale(data float*, dim uint64_t, val float)
    }

File-Level Changes

Change	Details	Files
Implement SSE FlipSign using 4-float chunks with XOR on the sign bit and scalar fallback for tail elements.	Iterate over input floats in chunks of 4 when SSE is enabled Decode 4 flip bits from the packed uint8_t flip array into a 4-element 0x80000000/0 mask Load mask as __m128i, cast to __m128 and XOR with data vector using _mm_xor_ps Handle remaining elements with scalar sign flips when dimension is not a multiple of 4 Fallback to generic::FlipSign when SSE is not enabled	`src/simd/sse.cpp`
Implement AVX FlipSign using 8-float chunks with AVX XOR and scalar tail handling, with SSE/AVX fallbacks.	Process data in 8-float chunks when AVX is enabled Extract 8 flip bits per chunk from packed flip array into 8-element sign mask with 0x80000000 or 0 Load mask as __m256i and XOR with data vector via _mm256_xor_ps Apply scalar per-element sign flipping for leftover elements Fallback to sse::FlipSign when AVX is not enabled	`src/simd/avx.cpp`
Implement AVX2 FlipSign using integer XOR on packed floats, reusing AVX behavior and providing fallback.	Process 8 floats at a time when AVX2 is enabled Build 8-wide 0x80000000/0 mask from packed flip bits Load mask as __m256i and flip sign by XOR-ing integer representation with _mm256_xor_si256 and reinterpreting as floats Handle non-multiple-of-8 tails with scalar sign flips Fallback to avx::FlipSign when AVX2 is not enabled	`src/simd/avx2.cpp`
Expose FlipSign declarations for AVX and AVX2 in the shared SIMD header.	Add FlipSign function declaration to avx2 namespace Add FlipSign function declaration to avx namespace	`src/simd/rabitq_simd.h`
Extend SIMD tests and benchmarks to cover SSE, AVX, and AVX2 FlipSign implementations.	In flip_sign correctness tests, invoke sse::FlipSign, avx::FlipSign, and avx2::FlipSign conditionally on runtime SIMD support and compare against generic results Update FlipSign benchmark macro usage to benchmark the new SSE, AVX, and AVX2 implementations under capability checks Extend pattern-based FlipSign correctness tests to validate SSE, AVX, and AVX2 paths against generic output with a tight error bound	`src/simd/rabitq_simd_test.cpp`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 4 issues, and left some high level feedback:

The AVX and AVX2 FlipSign implementations are nearly identical (bit extraction + mask construction); consider factoring this into a shared helper to avoid duplication and keep future changes in sync.
In avx::FlipSign the #else branch calls sse::FlipSign, which assumes SSE is available whenever AVX is not; if that's not guaranteed by build flags, it may be safer to fall back directly to generic::FlipSign for consistency with other paths.
The bit extraction in FlipSign (SSE/AVX/AVX2) repeatedly indexes flip[(bit_idx + j) / 8] within the inner loop; you could precompute the base byte and intra-byte offsets to reduce repeated division/modulo and simplify the logic.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The AVX and AVX2 `FlipSign` implementations are nearly identical (bit extraction + mask construction); consider factoring this into a shared helper to avoid duplication and keep future changes in sync.
- In `avx::FlipSign` the `#else` branch calls `sse::FlipSign`, which assumes SSE is available whenever AVX is not; if that's not guaranteed by build flags, it may be safer to fall back directly to `generic::FlipSign` for consistency with other paths.
- The bit extraction in `FlipSign` (SSE/AVX/AVX2) repeatedly indexes `flip[(bit_idx + j) / 8]` within the inner loop; you could precompute the base byte and intra-byte offsets to reduce repeated division/modulo and simplify the logic.

## Individual Comments

### Comment 1
<location path="src/simd/sse.cpp" line_range="1363-1364" />
<code_context>
+    constexpr uint64_t kFloatsPerChunk = 8;
+    uint64_t i = 0;
+    for (; i + kFloatsPerChunk <= dim; i += kFloatsPerChunk) {
+        uint32_t bit_idx = static_cast<uint32_t>(i);
+        uint16_t flip_bits = 0;
+        for (int j = 0; j < 8; j++) {
</code_context>
<issue_to_address>
**issue (bug_risk):** Avoid narrowing `i` to 32 bits for bit index computations to prevent incorrect behavior on very large `dim`.

Casting `i` (uint64_t) to `uint32_t` will wrap when `dim >= 2^32`, causing incorrect bit selection in `flip`. Unless `dim < 2^32` is guaranteed, keep `bit_idx` as `uint64_t`/`size_t` and remove the cast, and apply the same treatment to all `bit_idx` uses in the SIMD loops.
</issue_to_address>

### Comment 2
<location path="src/simd/avx.cpp" line_range="1286-1287" />
<code_context>
+    constexpr uint64_t kFloatsPerChunk = 8;
+    uint64_t i = 0;
+    for (; i + kFloatsPerChunk <= dim; i += kFloatsPerChunk) {
+        uint32_t bit_idx = static_cast<uint32_t>(i);
+        uint16_t flip_bits = 0;
+        for (int j = 0; j < 8; j++) {
+            flip_bits |= ((flip[(bit_idx + j) / 8] >> ((bit_idx + j) % 8)) & 1) << j;
</code_context>
<issue_to_address>
**issue (bug_risk):** Using a 32-bit `bit_idx` for indexing can overflow for large dimensions.

Casting `i` to `uint32_t` here means bit selection breaks once `dim` exceeds `2^32 - 1`, as in the SSE path. Keeping `bit_idx` (and the related indexing math) as `uint64_t`/`size_t` avoids this overflow while preserving performance on typical platforms.
</issue_to_address>

### Comment 3
<location path="src/simd/avx2.cpp" line_range="1436-1437" />
<code_context>
+    constexpr uint64_t kFloatsPerChunk = 8;
+    uint64_t i = 0;
+    for (; i + kFloatsPerChunk <= dim; i += kFloatsPerChunk) {
+        uint32_t bit_idx = static_cast<uint32_t>(i);
+        uint16_t flip_bits = 0;
+        for (int j = 0; j < 8; j++) {
+            flip_bits |= ((flip[(bit_idx + j) / 8] >> ((bit_idx + j) % 8)) & 1) << j;
</code_context>
<issue_to_address>
**issue (bug_risk):** Same potential overflow of `bit_idx` in AVX2 path as in SSE/AVX implementations.

Keep `bit_idx` as a 64-bit value instead of truncating to `uint32_t` so very large `dim` values don’t silently wrap and behavior stays consistent with the other implementations.
</issue_to_address>

### Comment 4
<location path="src/simd/rabitq_simd_test.cpp" line_range="466-475" />
<code_context>

             generic::FlipSign(flip, gt_data, dim);

+            if (SimdStatus::SupportSSE()) {
+                auto* sse_data = sse_datas.data() + i * dim;
+                sse::FlipSign(flip, sse_data, dim);
+                for (int j = 0; j < dim; j++) {
+                    REQUIRE(std::abs(gt_data[j] - sse_data[j]) < delta);
+                }
+            }
+            if (SimdStatus::SupportAVX()) {
+                auto* avx_data = avx_datas.data() + i * dim;
+                avx::FlipSign(flip, avx_data, dim);
+                for (int j = 0; j < dim; j++) {
+                    REQUIRE(std::abs(gt_data[j] - avx_data[j]) < delta);
+                }
+            }
+            if (SimdStatus::SupportAVX2()) {
+                auto* avx2_data = avx2_datas.data() + i * dim;
+                avx2::FlipSign(flip, avx2_data, dim);
</code_context>
<issue_to_address>
**suggestion (testing):** Add explicit tests that exercise byte-boundary and tail cases for SSE/AVX/AVX2 FlipSign

The current checks validate SSE/AVX/AVX2 against the generic path, but they don’t guarantee we hit the edge cases most prone to SIMD bugs: (a) flips whose bit indices cross byte boundaries (e.g., dims 7, 8, 9, 15, 16, …) and (b) dims that trigger the scalar cleanup (non‑multiples of the SIMD width). Please add or extend a test that (1) uses hand-crafted flip masks around byte boundaries (e.g., flipping only indices 7–9, 15–17, etc.) and (2) explicitly exercises non-multiple dims for each ISA (e.g., 1, 3, 5, 7, 9 for SSE; 5, 9, 13 for AVX/AVX2) so we cover those tail and bit-indexing regions deterministically.

Suggested implementation:

```cpp
TEST_CASE("FlipSign SIMD byte-boundary and tail cases", "[simd][flip_sign][boundary]") {
    constexpr float delta = 1e-5f;

    // Indices chosen to cross byte boundaries: bits 7–9, 15–17, ...
    const std::array<int, 6> boundary_indices = {7, 8, 9, 15, 16, 17};

    auto make_flip_mask = [&](int dim) {
        const std::size_t flip_size = (dim + 7) / 8;
        std::vector<uint8_t> flip(flip_size, uint8_t{0});
        for (int idx : boundary_indices) {
            if (idx >= dim) {
                continue;
            }
            const int byte = idx / 8;
            const int bit = idx % 8;
            flip[byte] |= static_cast<uint8_t>(1u << bit);
        }
        return flip;
    };

    auto run_dim = [&](int dim) {
        auto flip = make_flip_mask(dim);

        std::vector<float> base(dim);
        for (int i = 0; i < dim; ++i) {
            base[i] = static_cast<float>(i + 1);
        }

        std::vector<float> gt_data = base;
        generic::FlipSign(flip.data(), gt_data.data(), dim);

        if (SimdStatus::SupportSSE()) {
            std::vector<float> sse_data = base;
            sse::FlipSign(flip.data(), sse_data.data(), dim);
            for (int j = 0; j < dim; ++j) {
                REQUIRE(std::abs(gt_data[j] - sse_data[j]) < delta);
            }
        }

        if (SimdStatus::SupportAVX()) {
            std::vector<float> avx_data = base;
            avx::FlipSign(flip.data(), avx_data.data(), dim);
            for (int j = 0; j < dim; ++j) {
                REQUIRE(std::abs(gt_data[j] - avx_data[j]) < delta);
            }
        }

        if (SimdStatus::SupportAVX2()) {
            std::vector<float> avx2_data = base;
            avx2::FlipSign(flip.data(), avx2_data.data(), dim);
            for (int j = 0; j < dim; ++j) {
                REQUIRE(std::abs(gt_data[j] - avx2_data[j]) < delta);
            }
        }
    };

    // Dims that are not multiples of typical SIMD widths, to force scalar tails.
    const std::array<int, 5> sse_dims = {1, 3, 5, 7, 9};
    const std::array<int, 3> avx_dims = {5, 9, 13};

    for (int dim : sse_dims) {
        run_dim(dim);
    }
    for (int dim : avx_dims) {
        run_dim(dim);
    }
}

    std::vector<uint8_t> flips = fixtures::GenerateVectors<uint8_t>(count, flip_size);

    BENCHMARK_SIMD_FLIP_SIGN(generic, FlipSign);

```

1. Ensure `#include <array>` is present near the top of `src/simd/rabitq_simd_test.cpp`, since the new test uses `std::array`.  
2. If the file wraps tests in a namespace (e.g., `namespace milvus::simd { ... }`), verify that the new `TEST_CASE` is placed inside the same namespace scope (the replacement above assumes it is already within the correct scope).  
3. If the existing tests use a specific float type alias (e.g., `using value_t = float;`), you may want to replace `float` in the new test with that alias for consistency.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-04-24T09:07:57Z

+        uint32_t bit_idx = static_cast<uint32_t>(i);
+        uint8_t flip_bits = 0;


issue (bug_risk): Avoid narrowing i to 32 bits for bit index computations to prevent incorrect behavior on very large dim.

Casting i (uint64_t) to uint32_t will wrap when dim >= 2^32, causing incorrect bit selection in flip. Unless dim < 2^32 is guaranteed, keep bit_idx as uint64_t/size_t and remove the cast, and apply the same treatment to all bit_idx uses in the SIMD loops.

sourcery-ai · 2026-04-24T09:07:57Z

+        uint32_t bit_idx = static_cast<uint32_t>(i);
+        uint16_t flip_bits = 0;


issue (bug_risk): Using a 32-bit bit_idx for indexing can overflow for large dimensions.

Casting i to uint32_t here means bit selection breaks once dim exceeds 2^32 - 1, as in the SSE path. Keeping bit_idx (and the related indexing math) as uint64_t/size_t avoids this overflow while preserving performance on typical platforms.

sourcery-ai · 2026-04-24T09:07:57Z

+        uint32_t bit_idx = static_cast<uint32_t>(i);
+        uint16_t flip_bits = 0;


issue (bug_risk): Same potential overflow of bit_idx in AVX2 path as in SSE/AVX implementations.

Keep bit_idx as a 64-bit value instead of truncating to uint32_t so very large dim values don’t silently wrap and behavior stays consistent with the other implementations.

sourcery-ai · 2026-04-24T09:07:57Z

+            if (SimdStatus::SupportSSE()) {
+                auto* sse_data = sse_datas.data() + i * dim;
+                sse::FlipSign(flip, sse_data, dim);
+                for (int j = 0; j < dim; j++) {
+                    REQUIRE(std::abs(gt_data[j] - sse_data[j]) < delta);
+                }
+            }
+            if (SimdStatus::SupportAVX()) {
+                auto* avx_data = avx_datas.data() + i * dim;
+                avx::FlipSign(flip, avx_data, dim);


suggestion (testing): Add explicit tests that exercise byte-boundary and tail cases for SSE/AVX/AVX2 FlipSign

The current checks validate SSE/AVX/AVX2 against the generic path, but they don’t guarantee we hit the edge cases most prone to SIMD bugs: (a) flips whose bit indices cross byte boundaries (e.g., dims 7, 8, 9, 15, 16, …) and (b) dims that trigger the scalar cleanup (non‑multiples of the SIMD width). Please add or extend a test that (1) uses hand-crafted flip masks around byte boundaries (e.g., flipping only indices 7–9, 15–17, etc.) and (2) explicitly exercises non-multiple dims for each ISA (e.g., 1, 3, 5, 7, 9 for SSE; 5, 9, 13 for AVX/AVX2) so we cover those tail and bit-indexing regions deterministically.

Suggested implementation:

TEST_CASE("FlipSign SIMD byte-boundary and tail cases", "[simd][flip_sign][boundary]") { constexpr float delta = 1e-5f; // Indices chosen to cross byte boundaries: bits 7–9, 15–17, ... const std::array<int, 6> boundary_indices = {7, 8, 9, 15, 16, 17}; auto make_flip_mask = [&](int dim) { const std::size_t flip_size = (dim + 7) / 8; std::vector<uint8_t> flip(flip_size, uint8_t{0}); for (int idx : boundary_indices) { if (idx >= dim) { continue; } const int byte = idx / 8; const int bit = idx % 8; flip[byte] |= static_cast<uint8_t>(1u << bit); } return flip; }; auto run_dim = [&](int dim) { auto flip = make_flip_mask(dim); std::vector<float> base(dim); for (int i = 0; i < dim; ++i) { base[i] = static_cast<float>(i + 1); } std::vector<float> gt_data = base; generic::FlipSign(flip.data(), gt_data.data(), dim); if (SimdStatus::SupportSSE()) { std::vector<float> sse_data = base; sse::FlipSign(flip.data(), sse_data.data(), dim); for (int j = 0; j < dim; ++j) { REQUIRE(std::abs(gt_data[j] - sse_data[j]) < delta); } } if (SimdStatus::SupportAVX()) { std::vector<float> avx_data = base; avx::FlipSign(flip.data(), avx_data.data(), dim); for (int j = 0; j < dim; ++j) { REQUIRE(std::abs(gt_data[j] - avx_data[j]) < delta); } } if (SimdStatus::SupportAVX2()) { std::vector<float> avx2_data = base; avx2::FlipSign(flip.data(), avx2_data.data(), dim); for (int j = 0; j < dim; ++j) { REQUIRE(std::abs(gt_data[j] - avx2_data[j]) < delta); } } }; // Dims that are not multiples of typical SIMD widths, to force scalar tails. const std::array<int, 5> sse_dims = {1, 3, 5, 7, 9}; const std::array<int, 3> avx_dims = {5, 9, 13}; for (int dim : sse_dims) { run_dim(dim); } for (int dim : avx_dims) { run_dim(dim); } } std::vector<uint8_t> flips = fixtures::GenerateVectors<uint8_t>(count, flip_size); BENCHMARK_SIMD_FLIP_SIGN(generic, FlipSign);

Ensure #include <array> is present near the top of src/simd/rabitq_simd_test.cpp, since the new test uses std::array.

If the file wraps tests in a namespace (e.g., namespace milvus::simd { ... }), verify that the new TEST_CASE is placed inside the same namespace scope (the replacement above assumes it is already within the correct scope).

If the existing tests use a specific float type alias (e.g., using value_t = float;), you may want to replace float in the new test with that alias for consistency.

sourcery-ai Bot reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add FlipSign SIMD implementation for SSE/AVX/AVX2#10

Hidden character warning

feat: add FlipSign SIMD implementation for SSE/AVX/AVX2#10
LHT129 wants to merge 1 commit into
mainfrom
opencode/补齐-flipsign-simd

LHT129 commented Apr 24, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented Apr 24, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot Apr 24, 2026

Uh oh!

sourcery-ai Bot Apr 24, 2026

Uh oh!

sourcery-ai Bot Apr 24, 2026

Uh oh!

sourcery-ai Bot Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		uint32_t bit_idx = static_cast<uint32_t>(i);
		uint8_t flip_bits = 0;

		uint32_t bit_idx = static_cast<uint32_t>(i);
		uint16_t flip_bits = 0;

Conversation

LHT129 commented Apr 24, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Results

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Class diagram for SIMD FlipSign functions in SSE/AVX/AVX2 namespaces

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LHT129 commented Apr 24, 2026 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Apr 24, 2026 •

edited

Loading