Partial Optimization for IVFPQ#4893
Partial Optimization for IVFPQ#4893littleniuer wants to merge 1 commit intofacebookresearch:mainfrom
Conversation
|
I noticed two test failures (test_precomputed_tables and test_hnsw_2level_mixed_search) on aarch64 with my changes. The root cause is a minor floating-point precision difference (~3.8e-06 max absolute error) introduced by the NEON-optimized PQ distance table computation. Specifically, the optimized kernel uses a transposed data layout and processes 16 centroids in parallel with a dimension-wise accumulation order, which differs from the original per-centroid accumulation. Since float32 addition is not associative, this leads to ~1 ULP differences in the distance values. Both tests currently use bitwise-exact comparison (assert_array_equal / np.all(==)) for distance results. I've verified that relaxing the distance comparison to np.testing.assert_allclose(atol=1e-5, rtol=1e-6) resolves the issue. Note that only the distance (D) comparisons are relaxed — all index (I) comparisons remain strictly equal, confirming that the search results themselves are not affected. All other 1087 tests pass without any modifications. |
|
cc @algoriddle |
|
Thanks for the contribution and the detailed benchmarks, the NEON optimization results look promising. Unfortunately this PR can't be merged in its current form. The core issue is that We're also in the middle of migrating all SIMD code to a dynamic dispatch (DD) framework, so new SIMD contributions need to follow that pattern. Specifically: 1. No separate library directory. Faiss puts per-ISA code in the existing module directories (e.g., 2. Use 3. I realize this is a significant rework since the kernels appear to come from an existing library (Kunpeng BoostKit?). The algorithmic work in the NEON kernels is valuable, but the integration approach needs to match Faiss conventions. I've opened a PR with a migration guide that explains the DD architecture and the step-by-step recipe for adding SIMD code: PR #4973. The key sections to look at are "The Conversion Recipe" and "Common Patterns." For the PQ lookup table kernel specifically, the right integration point would be Happy to answer questions about the restructuring, feel free to ask here or on issue #4763. cc @mdouze |
7d83232 to
1fe669c
Compare
Thanks for the thorough review and the pointers to the DD migration guide — really appreciate it. I've reworked the PR to align with Faiss conventions:
The algorithmic logic of the NEON kernels is preserved; only the integration layer has been restructured. Benchmarks remain consistent with the numbers reported earlier. Looking forward to another round of review. Happy to iterate further if anything still needs adjustment. |
69d4c11 to
6130f30
Compare
|
Could you rebase on the current Faiss? |
Hi @mdouze, Thanks for the heads-up. I've rebased onto the current main and migrated all SIMD code from Key changes in this update:
Performance: ~10%+ average QPS improvement on ann-benchmark datasets (glove-100-angular, etc.) on aarch64. Regarding CI: there is 1 test failure in Please let me know if anything else needs adjustment. |
Summary
This PR introduces ARM NEON optimizations for IVFPQ index, focusing on 8-bit lookup table operations and distance computations.
Changes
faiss/sra_krl/for clear organizationDesign Decisions
faiss/sra_krl/) to keep the file structure clean and facilitate initial code review__aarch64__macro, no impact on x86 buildsTesting
Notes
We understand that naming conventions and NEON code placement may need adjustments to align with the ongoing SIMD restructuring. We are happy to collaborate with maintainers to refine the code structure as needed.
If this contribution is well-received, we have additional optimizations for HNSW, Refine, and FastScan ready for follow-up PRs.