Partial Optimization for IVFPQ by littleniuer · Pull Request #4893 · facebookresearch/faiss

littleniuer · 2026-03-10T09:24:53Z

Summary

This PR introduces ARM NEON optimizations for IVFPQ index, focusing on 8-bit lookup table operations and distance computations.

Changes

Optimized 8-bit LUT (lookup table) construction and distance calculation
Implemented algorithmic improvements combined with NEON intrinsics
All core optimization code is placed under faiss/sra_krl/ for clear organization

Design Decisions

Code is organized in a separate directory (faiss/sra_krl/) to keep the file structure clean and facilitate initial code review
Guarded by __aarch64__ macro, no impact on x86 builds
Functionally equivalent to the original implementation (no changes to index format or search results)

Testing

Passed FAISS built-in unit tests on AArch64 platform
Benchmark results show noticeable performance improvements on ARM servers

Notes

We understand that naming conventions and NEON code placement may need adjustments to align with the ongoing SIMD restructuring. We are happy to collaborate with maintainers to refine the code structure as needed.

If this contribution is well-received, we have additional optimizations for HNSW, Refine, and FastScan ready for follow-up PRs.

littleniuer · 2026-03-11T09:03:00Z

I noticed two test failures (test_precomputed_tables and test_hnsw_2level_mixed_search) on aarch64 with my changes. The root cause is a minor floating-point precision difference (~3.8e-06 max absolute error) introduced by the NEON-optimized PQ distance table computation. Specifically, the optimized kernel uses a transposed data layout and processes 16 centroids in parallel with a dimension-wise accumulation order, which differs from the original per-centroid accumulation. Since float32 addition is not associative, this leads to ~1 ULP differences in the distance values.

Both tests currently use bitwise-exact comparison (assert_array_equal / np.all(==)) for distance results. I've verified that relaxing the distance comparison to np.testing.assert_allclose(atol=1e-5, rtol=1e-6) resolves the issue. Note that only the distance (D) comparisons are relaxed — all index (I) comparisons remain strictly equal, confirming that the search results themselves are not affected.

All other 1087 tests pass without any modifications.

mnorris11 · 2026-03-23T15:38:56Z

cc @algoriddle

algoriddle · 2026-03-23T19:36:03Z

Thanks for the contribution and the detailed benchmarks, the NEON optimization results look promising.

Unfortunately this PR can't be merged in its current form. The core issue is that sra_krl is a separate library with its own C API, handle lifecycle, and coding conventions being dropped into the Faiss tree. Faiss integrates SIMD optimizations directly into its existing modules as C++ template specializations, not as external libraries called via extern "C" handles.

We're also in the middle of migrating all SIMD code to a dynamic dispatch (DD) framework, so new SIMD contributions need to follow that pattern. Specifically:

1. No separate library directory. Faiss puts per-ISA code in the existing module directories (e.g., impl/pq_code_distance/pq_code_distance-avx2.cpp, utils/simd_impl/distances_aarch64.cpp), not in a separate sra_krl/ subtree. The kernels should be C++ template specializations tagged with SIMDLevel::ARM_NEON, not C functions behind opaque handles.

2. Use COMPILE_SIMD_ARM_NEON / SIMDLevel templates instead of #ifdef __aarch64__. The DD infrastructure uses build-system-controlled COMPILE_SIMD_* defines and template <SIMDLevel SL> to tag SIMD specializations. Sprinkling #ifdef __aarch64__ through core files like IndexIVFPQ.cpp and ProductQuantizer.cpp is the old monolithic pattern we're moving away from.

3. .c files with C++ features forced via LANGUAGE CXX, just use .cpp.

I realize this is a significant rework since the kernels appear to come from an existing library (Kunpeng BoostKit?). The algorithmic work in the NEON kernels is valuable, but the integration approach needs to match Faiss conventions.

I've opened a PR with a migration guide that explains the DD architecture and the step-by-step recipe for adding SIMD code: PR #4973. The key sections to look at are "The Conversion Recipe" and "Common Patterns." For the PQ lookup table kernel specifically, the right integration point would be impl/pq_code_distance/ as a NEON specialization (similar to pq_code_distance-sve.cpp).

Happy to answer questions about the restructuring, feel free to ask here or on issue #4763.

cc @mdouze

littleniuer · 2026-03-25T04:00:30Z

Thanks for the contribution and the detailed benchmarks, the NEON optimization results look promising.

Unfortunately this PR can't be merged in its current form. The core issue is that sra_krl is a separate library with its own C API, handle lifecycle, and coding conventions being dropped into the Faiss tree. Faiss integrates SIMD optimizations directly into its existing modules as C++ template specializations, not as external libraries called via extern "C" handles.

We're also in the middle of migrating all SIMD code to a dynamic dispatch (DD) framework, so new SIMD contributions need to follow that pattern. Specifically:

1. No separate library directory. Faiss puts per-ISA code in the existing module directories (e.g., impl/pq_code_distance/pq_code_distance-avx2.cpp, utils/simd_impl/distances_aarch64.cpp), not in a separate sra_krl/ subtree. The kernels should be C++ template specializations tagged with SIMDLevel::ARM_NEON, not C functions behind opaque handles.

2. Use COMPILE_SIMD_ARM_NEON / SIMDLevel templates instead of #ifdef __aarch64__. The DD infrastructure uses build-system-controlled COMPILE_SIMD_* defines and template <SIMDLevel SL> to tag SIMD specializations. Sprinkling #ifdef __aarch64__ through core files like IndexIVFPQ.cpp and ProductQuantizer.cpp is the old monolithic pattern we're moving away from.

3. .c files with C++ features forced via LANGUAGE CXX, just use .cpp.

I realize this is a significant rework since the kernels appear to come from an existing library (Kunpeng BoostKit?). The algorithmic work in the NEON kernels is valuable, but the integration approach needs to match Faiss conventions.

I've opened a PR with a migration guide that explains the DD architecture and the step-by-step recipe for adding SIMD code: PR #4973. The key sections to look at are "The Conversion Recipe" and "Common Patterns." For the PQ lookup table kernel specifically, the right integration point would be impl/pq_code_distance/ as a NEON specialization (similar to pq_code_distance-sve.cpp).

Happy to answer questions about the restructuring, feel free to ask here or on issue #4763.

cc @mdouze

Thanks for the thorough review and the pointers to the DD migration guide — really appreciate it.

I've reworked the PR to align with Faiss conventions:

Removed the standalone sra_krl/ library. NEON kernels are now integrated directly into the existing Faiss module directories as C++ template specializations:
- faiss/impl/pq_code_distance/pq_code_distance-neon.cpp (following the pattern of pq_code_distance-sve.cpp)
- faiss/impl/ProductQuantizer-neon.cpp
- faiss/utils/simd_impl/distances_neon.cpp
- faiss/utils/simd_impl/matrix_transpose_neon.cpp
Replaced all #ifdef __aarch64__ guards with COMPILE_SIMD_ARM_NEON / SIMDLevel::ARM_NEON templates as described in the DD framework. Modified files include IndexIVFPQ.cpp, ProductQuantizer.cpp, platform_macros.h, etc.
All new files are .cpp — no more .c with LANGUAGE CXX workarounds.
Updated CMakeLists.txt to register the new NEON compilation units.

The algorithmic logic of the NEON kernels is preserved; only the integration layer has been restructured. Benchmarks remain consistent with the numbers reported earlier.

Looking forward to another round of review. Happy to iterate further if anything still needs adjustment.

mdouze · 2026-04-02T07:11:47Z

Could you rebase on the current Faiss?
DISPATCH_SIMDLevel does not exist anymore, replaced with with_simd_level.
Unfortunately SIMD implementation evolving quite a bit these days...

…framework)

littleniuer · 2026-04-15T06:59:31Z

Could you rebase on the current Faiss? DISPATCH_SIMDLevel does not exist anymore, replaced with with_simd_level. Unfortunately SIMD implementation evolving quite a bit these days...

Hi @mdouze,

Thanks for the heads-up. I've rebased onto the current main and migrated all SIMD code from DISPATCH_SIMDLevel to with_simd_level / with_selected_simd_levels.

Key changes in this update:

Replaced all DISPATCH_SIMDLevel usage with with_simd_level lambda pattern
Removed #ifdef COMPILE_SIMD_ARM_NEON from public headers (ProductQuantizer.h) to keep ABI consistent across platforms — the NEON-specific struct members are now always visible but remain empty on non-aarch64
All COMPILE_SIMD_* definitions kept as PRIVATE in CMakeLists.txt, consistent with upstream convention
NEON optimizations remain in dedicated -neon.cpp files under existing module directories

Performance: ~10%+ average QPS improvement on ann-benchmark datasets (glove-100-angular, etc.) on aarch64.

Regarding CI: there is 1 test failure in test_Flat_subset_type_2 with max absolute difference of 1.9e-06. This is expected — the NEON batch distance computation changes the floating-point accumulation order compared to the scalar path. IEEE 754 does not guarantee associativity, so (a + b) + c ≠ a + (b + c) when using SIMD vectorized reductions. The existing AVX2/SVE paths have the same characteristic. Happy to add a tolerance-based comparison if preferred.

Please let me know if anything else needs adjustment.

meta-cla bot added the CLA Signed label Mar 10, 2026

littleniuer mentioned this pull request Mar 10, 2026

[Performance] Improve FAISS performance on AArch64/ARM (HNSW, IVFPQ, IVFPQFS, PQFS, IVFFLAT) #4763

Closed

littleniuer force-pushed the main branch from 427754e to 7d83232 Compare March 11, 2026 09:19

mnorris11 added the to-benchmark label Mar 17, 2026

littleniuer force-pushed the main branch 2 times, most recently from 7d83232 to 1fe669c Compare March 25, 2026 03:54

littleniuer force-pushed the main branch 2 times, most recently from 69d4c11 to 6130f30 Compare March 25, 2026 04:06

littleniuer force-pushed the main branch from 6130f30 to 60844a3 Compare April 15, 2026 06:47

NEON SIMD optimizations for IVFPQ (rebased, using with_simd_level DD …

60844a3

…framework)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial Optimization for IVFPQ#4893

Partial Optimization for IVFPQ#4893
littleniuer wants to merge 1 commit intofacebookresearch:mainfrom
littleniuer:main

littleniuer commented Mar 10, 2026

Uh oh!

littleniuer commented Mar 11, 2026

Uh oh!

mnorris11 commented Mar 23, 2026

Uh oh!

algoriddle commented Mar 23, 2026 •

edited

Loading

Uh oh!

littleniuer commented Mar 25, 2026 •

edited

Loading

Uh oh!

mdouze commented Apr 2, 2026

Uh oh!

littleniuer commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

littleniuer commented Mar 10, 2026

Summary

Changes

Design Decisions

Testing

Notes

Uh oh!

littleniuer commented Mar 11, 2026

Uh oh!

mnorris11 commented Mar 23, 2026

Uh oh!

algoriddle commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

littleniuer commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdouze commented Apr 2, 2026

Uh oh!

littleniuer commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

algoriddle commented Mar 23, 2026 •

edited

Loading

littleniuer commented Mar 25, 2026 •

edited

Loading