Fix/sindi auto heap insert#2147
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Require kind labelWonderful, this rule succeeded.
🟢 Require version labelWonderful, this rule succeeded.
|
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR removes the use_term_lists_heap_insert SINDI search parameter and switches heap insertion strategy selection to be automatic based on doc_prune_ratio (build-time) and query_prune_ratio (search-time).
Changes:
- Removed
use_term_lists_heap_insertfrom SINDI search parameters (parsing, serialization, constants, tests, and docs). - Added automatic heap insertion strategy selection via
SINDI::UseTermListsHeapInsert(...)and wired it into search/analyzer paths. - Updated tests to exercise pruning ratios instead of the removed boolean flag, and updated docs to reflect the new behavior.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_sindi.cpp | Updates tests to generate search params using query_prune_ratio instead of the removed boolean flag. |
| tests/test_memleak.cpp | Changes index parameter handling to use fmt::format(parameter, dim) inside ScopedIndex. |
| src/inner_string_params.h | Removes the string key constant for use_term_lists_heap_insert. |
| src/analyzer/sindi_analyzer.cpp | Switches analyzer decision to SINDI::UseTermListsHeapInsert(search_param). |
| src/algorithm/sindi/sindi_test.cpp | Removes use_term_lists_heap_insert from unit test JSON strings. |
| src/algorithm/sindi/sindi_parameter_test.cpp | Adds an assertion that ToJson() no longer emits the removed key. |
| src/algorithm/sindi/sindi_parameter.h | Removes the boolean field from SINDISearchParameter. |
| src/algorithm/sindi/sindi_parameter.cpp | Removes parse/serialize support for use_term_lists_heap_insert. |
| src/algorithm/sindi/sindi.h | Adds UseTermListsHeapInsert(...) and stores doc_prune_ratio_. |
| src/algorithm/sindi/sindi.cpp | Implements automatic strategy selection and updates search call sites accordingly. |
| docs/sindi.md | Removes the parameter section and explains automatic strategy selection. |
| docs/docs/zh/src/indexes/sindi.md | Removes the parameter row and explains automatic strategy selection (ZH). |
| docs/docs/en/src/indexes/sindi.md | Removes the parameter row and explains automatic strategy selection (EN). |
| public: | ||
| explicit ScopedIndex(const std::string& name, | ||
| const std::string& build_parameters, | ||
| const std::string& parameter, |
| Allocator* allocator) { | ||
| resource_ = std::make_unique<vsag::Resource>(allocator, nullptr); | ||
| engine_ = std::make_unique<vsag::Engine>(resource_.get()); | ||
| std::string build_parameters = fmt::format(parameter, dim); |
| bool use_term_list_ordered_insert, | ||
| const SparseVector* original_query = nullptr) const; | ||
|
|
||
| bool | ||
| UseTermListsHeapInsert(const SINDISearchParameter& search_param) const; |
| bool | ||
| SINDI::UseTermListsHeapInsert(const SINDISearchParameter& search_param) const { | ||
| auto is_special_prune_ratio = [](float ratio) { | ||
| constexpr float kEpsilon = 1e-6F; | ||
| return std::abs(ratio) < kEpsilon || std::abs(ratio - 0.1F) < kEpsilon; | ||
| }; | ||
| return not(is_special_prune_ratio(doc_prune_ratio_) && | ||
| is_special_prune_ratio(search_param.query_prune_ratio)); | ||
| } |
| } else { | ||
| n_candidate = DEFAULT_N_CANDIDATE; | ||
| } |
| bool | ||
| SINDI::UseTermListsHeapInsert(const SINDISearchParameter& search_param) const { | ||
| auto is_special_prune_ratio = [](float ratio) { | ||
| constexpr float kEpsilon = 1e-6F; | ||
| return std::abs(ratio) < kEpsilon || std::abs(ratio - 0.1F) < kEpsilon; | ||
| }; | ||
| return not(is_special_prune_ratio(doc_prune_ratio_) && | ||
| is_special_prune_ratio(search_param.query_prune_ratio)); | ||
| } |
There was a problem hiding this comment.
Code Review
This pull request removes the manual use_term_lists_heap_insert search parameter from the SINDI index, allowing the heap-insertion strategy to be chosen automatically based on doc_prune_ratio and query_prune_ratio. This change is reflected across documentation, tests, parameter parsing, and the core implementation. However, a critical issue was identified in tests/test_memleak.cpp where bypassing the generate_param helper function results in an invalid parameter string, causing CreateIndex to fail at runtime. It is recommended to revert this change to keep the memory leak tests functional.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| explicit ScopedIndex(const std::string& name, | ||
| const std::string& build_parameters, | ||
| const std::string& parameter, | ||
| int64_t num_vectors, | ||
| int64_t dim, | ||
| Allocator* allocator) { | ||
| resource_ = std::make_unique<vsag::Resource>(allocator, nullptr); | ||
| engine_ = std::make_unique<vsag::Engine>(resource_.get()); | ||
| std::string build_parameters = fmt::format(parameter, dim); |
There was a problem hiding this comment.
The changes in tests/test_memleak.cpp bypass the generate_param helper function, which constructs the valid JSON parameter string required by CreateIndex. Instead, the raw quantization type string (e.g., "fp32") is passed directly as parameter to ScopedIndex, where it is formatted with dim. This results in an invalid parameter string (e.g., "fp32" instead of a valid JSON object), causing CreateIndex to fail at runtime. Since the build/link failed locally, this runtime failure was not caught. Reverting this change to use generate_param is necessary to keep the memory leak tests functional.
| explicit ScopedIndex(const std::string& name, | |
| const std::string& build_parameters, | |
| const std::string& parameter, | |
| int64_t num_vectors, | |
| int64_t dim, | |
| Allocator* allocator) { | |
| resource_ = std::make_unique<vsag::Resource>(allocator, nullptr); | |
| engine_ = std::make_unique<vsag::Engine>(resource_.get()); | |
| std::string build_parameters = fmt::format(parameter, dim); | |
| explicit ScopedIndex(const std::string& name, | |
| const std::string& build_parameters, | |
| int64_t num_vectors, | |
| int64_t dim, | |
| Allocator* allocator) { | |
| resource_ = std::make_unique<vsag::Resource>(allocator, nullptr); | |
| engine_ = std::make_unique<vsag::Engine>(resource_.get()); |
| for (const auto& [index_name, index_parameter, search_parameter, num_vectors, dim] : | ||
| INDEX_PARAMS) { | ||
| auto index_parameter = generate_param(index_name, base_quantization_type, dim); | ||
| ScopedIndex scoped_index(index_name, index_parameter, num_vectors, dim, allocator); |
There was a problem hiding this comment.
Revert this loop to use the generate_param helper function to construct the valid JSON parameter string for ScopedIndex.
| for (const auto& [index_name, index_parameter, search_parameter, num_vectors, dim] : | |
| INDEX_PARAMS) { | |
| auto index_parameter = generate_param(index_name, base_quantization_type, dim); | |
| ScopedIndex scoped_index(index_name, index_parameter, num_vectors, dim, allocator); | |
| for (const auto& [index_name, base_quantization_type, search_parameter, num_vectors, dim] : | |
| INDEX_PARAMS) { | |
| auto index_parameter = generate_param(index_name, base_quantization_type, dim); | |
| ScopedIndex scoped_index(index_name, index_parameter, num_vectors, dim, allocator); |
6e4f0a7 to
4fe0c89
Compare
Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com> Assisted-by: Codex:GPT-5
4fe0c89 to
fdbf5ab
Compare
| } else { | ||
| n_candidate = DEFAULT_N_CANDIDATE; | ||
| } | ||
|
|
||
| if (json[INDEX_SINDI].Contains(SPARSE_USE_TERM_LISTS_HEAP_INSERT)) { | ||
| use_term_lists_heap_insert = json[INDEX_SINDI][SPARSE_USE_TERM_LISTS_HEAP_INSERT].GetBool(); | ||
| } else { | ||
| use_term_lists_heap_insert = true; | ||
| } | ||
| } |
| bool | ||
| SINDI::UseTermListsHeapInsert(const SINDISearchParameter& search_param) const { | ||
| auto is_special_prune_ratio = [](float ratio) { | ||
| constexpr float epsilon = 1e-6F; | ||
| return std::abs(ratio) < epsilon || std::abs(ratio - 0.1F) < epsilon; | ||
| }; | ||
| return not(is_special_prune_ratio(doc_prune_ratio_) && | ||
| is_special_prune_ratio(search_param.query_prune_ratio)); | ||
| } |
| search_impl(const SparseTermComputerPtr& computer, | ||
| const InnerSearchParam& inner_param, | ||
| Allocator* allocator, | ||
| bool use_term_lists_heap_insert, | ||
| bool use_term_list_ordered_insert, | ||
| const SparseVector* original_query = nullptr) const; | ||
|
|
||
| bool | ||
| UseTermListsHeapInsert(const SINDISearchParameter& search_param) const; |
| auto search_param_with_prune_ratio = | ||
| fixtures::SINDITestIndex::GenerateSearchParameter(GENERATE(0.0F, 0.2F)); |
| SINDI chooses the heap insertion strategy automatically from `doc_prune_ratio` | ||
| and `query_prune_ratio`. |
Summary
use_term_lists_heap_insert.doc_prune_ratioandquery_prune_ratio.Details
SINDI now selects the heap insertion path automatically:
doc_prune_ratiois0or0.1, andquery_prune_ratiois0or0.1, use the originaluse_term_lists_heap_insert = falsepath.use_term_lists_heap_insert = truepath.Validation
clang-format --version->15.0.7git diff --checkcmake --build build --target unittests -j2algorithm_testcompiled successfully.tests/unittestslink failed due to local OpenBLAS/libgfortran unresolved symbol:_gfortran_concat_string.cmake --build build --target functests -j2tests/test_sindi.cppcompiled successfully.tests/functestslink failed with the same local_gfortran_concat_stringissue.