Skip to content

feat: DiskSINDI sparse index with batched and merged rerank IO#2129

Open
Roxanne0321 wants to merge 9 commits into
antgroup:mainfrom
Roxanne0321:feature/disksindi
Open

feat: DiskSINDI sparse index with batched and merged rerank IO#2129
Roxanne0321 wants to merge 9 commits into
antgroup:mainfrom
Roxanne0321:feature/disksindi

Conversation

@Roxanne0321
Copy link
Copy Markdown
Collaborator

Summary

Introduce DiskSINDI, a new sparse index family that stores posting lists on disk through a DiskSparseTermListDataCell, while keeping the term dictionary and (optional) rerank flat datacell in memory.

This PR includes three incremental commits:

  1. Initial DiskSINDI implementation — on-disk posting list layout, parameter handling, factory registration, and end-to-end tests.
  2. Stage3 Phase A — collapse per-candidate rerank IO into a single batched GetCodesByIdsBatch call, eliminating N serialized io_submit/io_getevents round-trips.
  3. Stage3 Phase B — sort candidate inner_ids before issuing the batched IO, then merge adjacent/nearby disk reads (gap ≤ 4 KiB, merged segment ≤ 1 MiB) to further reduce syscall count on DirectIO backends.

Key Changes

  • src/algorithm/disksindi/: new index implementation (disksindi.{h,cpp}, parameters, tests)
  • src/datacell/disk_sparse_term_list_datacell.{h,cpp,inl}: on-disk term list datacell
  • src/datacell/sparse_vector_datacell.{h,inl}: GetCodesByIdsBatch with IO merge logic
  • src/datacell/flatten_interface.h: BatchCodesResult and virtual GetCodesByIdsBatch
  • Factory registration, constants, and shared test fixtures

Testing

  • All unit tests pass (DiskSINDI end-to-end, SparseDataCell batch correctness, IO merge correctness)
  • Cross-checked rerank distances against CalcDistanceById single-id path

Closes: #1957

Keep legacy SINDI rerank deserialization compatible while moving rerank storage to SparseVectorDataCell.

Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com>
Assisted-by: GitHub Copilot:GPT-5.4
Grow sparse datacell storage based on actual encoded vector size instead of using dense dim as a hard upper bound, and address follow-up review feedback in SINDI and sparse vector retrieval.

Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com>
Assisted-by: GitHub Copilot:GPT-5.4
Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com>
Assisted-by: GitHub Copilot:GPT-5.4
Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com>
Assisted-by: GitHub Copilot:GPT-5.4
Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com>
Assisted-by: GitHub Copilot:GPT-5.4
Resolve the upstream/main merge while keeping SparseIndex removed from this branch.

Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com>
Assisted-by: GitHub Copilot:GPT-5.4
Add DiskSINDI as a new sparse index family. Compared to SINDI, the
posting lists are stored on disk through a DiskSparseTermListDataCell,
while the term dictionary and (optional) rerank flat datacell remain
in memory. The index integrates with the existing IO abstractions
(memory / mmap / buffer / async / reader) so the same code path can be
benchmarked under different IO backends.

Main pieces:

- src/algorithm/disksindi/: new index implementation, parameter handling
  and unit tests (disksindi.{h,cpp}, disksindi_parameter.{h,cpp},
  disksindi_parameter_test.cpp, CMakeLists.txt).
- src/datacell/disk_sparse_term_list_datacell.{h,cpp,inl} and matching
  unit test: term list datacell that owns the on-disk posting layout,
  with serialization and search-time scan helpers.
- src/datacell/sparse_vector_datacell.{h,inl} and its test: share rerank
  flat layout with SINDI and surface helpers that DiskSINDI uses for
  the rerank path.
- src/datacell/flatten_interface.h: tidy includes to support the new
  call sites; no functional change to existing flatten datacells.
- src/algorithm/sindi/sindi.cpp: align computer / search-impl helpers
  with the refactored sparse pieces shared with DiskSINDI.
- src/quantization/sparse_quantization/sparse_term_computer.h: expose
  a small accessor needed by DiskSINDI search.
- src/io/mmap_io.cpp: minor change to support DiskSINDI's read pattern.
- include/vsag/constants.h, include/vsag/index.h, src/constants.cpp,
  src/factory/index_creators.cpp, src/algorithm/CMakeLists.txt: register
  the new INDEX_DISKSINDI type and wire up the factory.
- tests/fixtures/unittest.h: include catch2 matchers headers needed by
  the new DiskSINDI / disk sparse term list tests.

Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com>
Assisted-by: CodeFuse:claude-sonnet-4.5
Stage3 phase A: collapse the per-candidate rerank IO into a single
batched read. This addresses the dominant cost on the DiskSINDI
async-IO rerank path, where each candidate previously triggered an
independent io_submit / io_getevents round-trip.

Code changes:

- src/datacell/flatten_interface.h: introduce BatchCodesResult and a
  new virtual GetCodesByIdsBatch on FlattenInterface. The default
  implementation falls back to a loop over GetCodesById, copying each
  fixed-size code blob into a contiguous buffer; it is safe for any
  backend whose code length equals code_size_.

- src/datacell/sparse_vector_datacell.{h,inl}: override
  GetCodesByIdsBatch. The override walks offset_io_ (in-memory) once to
  collect every DocLocation, computes per-id sizes and in-buffer
  offsets, then issues a single io_->MultiRead for all payloads. The
  rest of the access path (locking, encoding) is unchanged. Phase A
  does not sort or merge requests; that is left to phase B.

- src/algorithm/disksindi/disksindi.cpp: extract
  compute_distance_from_codes so the existing single-candidate path
  (cal_distance_by_id_unsafe) and the new batched path share the same
  scoring routine. Rewrite the rerank loop in search_impl to (1) pop
  the heap into a flat Vector<InnerIdType> preserving the original pop
  order, (2) call rerank_flat_->GetCodesByIdsBatch once, (3) iterate
  with const uint8_t* codes = batch.buffer.data() +
  batch.in_buffer_offsets[i] and feed compute_distance_from_codes.

Tests:

- src/datacell/sparse_vector_datacell_test.cpp: add a "SparseDataCell
  Batch Codes Matches Single" test that compares GetCodesByIdsBatch
  against GetCodesById byte-for-byte across ascending ids, shuffled
  ids, subsets with duplicates, and the empty input, over the
  memory_io and block_memory_io backends.

- src/algorithm/disksindi/disksindi_test.cpp: new end-to-end test that
  builds a DiskSINDI index with buffer_io term_io and memory_io
  rerank_io, runs a top-k query, and asserts that the result dim
  matches k, distances are non-decreasing, and the query itself
  appears in its own top-k.

Behavior:

- Serialization format is unchanged; this is a runtime-only change.
- Recall is unaffected (the batched path produces the same codes as
  the per-id path, only the IO scheduling differs).
- The async-IO rerank backend is the primary beneficiary; memory and
  mmap rerank backends degrade to a contiguous memcpy / sequential
  read and stay within noise.

Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com>
Assisted-by: CodeFuse:claude-sonnet-4.5
Stage3 phase B: sort candidate inner_ids before the batched rerank IO
and merge adjacent/nearby disk reads into larger segments. This builds
on the Phase A batched IO to further reduce syscall count when the
IO backend is DirectIO-based (async_io).

Code changes:

- src/algorithm/disksindi/disksindi.cpp: after popping candidates from
  the heap into a flat vector, apply std::sort so that disk offsets are
  monotonically increasing before calling GetCodesByIdsBatch.

- src/datacell/sparse_vector_datacell.inl: rewrite GetCodesByIdsBatch
  to build merged IO ranges. After gathering (offset, size) tuples, the
  function scans them in order and coalesces pairs whose gap is within
  MERGE_GAP_LIMIT (DirectIO alignment, typically 4 KiB) and whose
  merged length stays under MAX_MERGED_IO_LEN (1 MiB). A scratch buffer
  receives the merged reads via MultiRead, then a scatter loop copies
  each candidate's payload into its final slot in result.buffer.
  Correctness is independent of input ordering: unsorted ids simply
  produce fewer merges, degenerating to the Phase A behavior.

Tests:

- src/algorithm/disksindi/disksindi_test.cpp: add "DiskSINDI Sorted
  Merge Rerank End-To-End" which cross-checks every result distance
  against CalcDistanceById to ensure the sorted+merged code path
  produces identical distance computations.

- src/datacell/sparse_vector_datacell_test.cpp: add "SparseDataCell
  Batch IO Merge Correctness" covering seven sections (unsorted,
  ascending contiguous, strided, large range, single, two adjacent,
  two distant) over both memory_io and block_memory_io backends.

Signed-off-by: Roxanne0321 <liruoxvan020321@qq.com>
Assisted-by: CodeFuse:claude-sonnet-4
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jun 2, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Require kind label

Wonderful, this rule succeeded.
  • label~=^kind/

🟢 Require version label

Wonderful, this rule succeeded.
  • label~=^version/

🟢 Require linked issue for feature/bug PRs

Wonderful, this rule succeeded.
  • body~=(?im)(?:^|[\s\-\*])(?:close[sd]?|fix(?:e[sd])?|resolve[sd]?)\s*:?\s+(?:#\d+|[\w.\-]+/[\w.\-]+#\d+|https?://github\.com/[\w.\-]+/[\w.\-]+/issues/\d+)

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the DiskSINDI index type to support disk-backed sparse vector indexing, refactors the existing SINDI index to use FlattenInterface for reranking, and upgrades SparseVectorDataCell to support 64-bit offsets with backward compatibility. The review feedback highlights several critical bug fixes and optimization opportunities, including guarding against undefined behavior and crashes when handling zero-length allocations or empty heaps, avoiding heap allocation overhead for small padding writes, and correcting a minor include path typo.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +765 to +769
data->len_ = ids.size();
data->ids_ = static_cast<uint32_t*>(allocator->Allocate(sizeof(uint32_t) * data->len_));
data->vals_ = static_cast<float*>(allocator->Allocate(sizeof(float) * data->len_));
memcpy(data->ids_, ids.data(), data->len_ * sizeof(uint32_t));
memcpy(data->vals_, vals.data(), data->len_ * sizeof(float));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If data->len_ is 0, calling allocator->Allocate(0) and subsequently passing the potentially null pointers to memcpy (even with size 0) is undefined behavior in C++. We should guard the allocation and copy with a check for data->len_ > 0.

    data->len_ = ids.size();
    if (data->len_ > 0) {
        data->ids_ = static_cast<uint32_t*>(allocator->Allocate(sizeof(uint32_t) * data->len_));
        data->vals_ = static_cast<float*>(allocator->Allocate(sizeof(float) * data->len_));
        memcpy(data->ids_, ids.data(), data->len_ * sizeof(uint32_t));
        memcpy(data->vals_, vals.data(), data->len_ * sizeof(float));
    } else {
        data->ids_ = nullptr;
        data->vals_ = nullptr;
    }

Comment on lines +526 to +535
if (high_precise_distance < cur_heap_top or
high_precise_heap->Size() < static_cast<uint64_t>(k)) {
high_precise_heap->Push(high_precise_distance, label);
}
if (high_precise_heap->Size() > static_cast<uint64_t>(k)) {
high_precise_heap->Pop();
}
cur_heap_top = high_precise_heap->Top().first;
}
if constexpr (mode == RANGE_SEARCH) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If k == 0 or if high_precise_heap is empty, calling high_precise_heap->Top() will cause undefined behavior or a crash. We should guard the Top() call with an empty check to prevent potential crashes.

            if constexpr (mode == KNN_SEARCH) {
                if (high_precise_distance < cur_heap_top or
                    high_precise_heap->Size() < static_cast<uint64_t>(k)) {
                    high_precise_heap->Push(high_precise_distance, label);
                }
                if (high_precise_heap->Size() > static_cast<uint64_t>(k)) {
                    high_precise_heap->Pop();
                }
                if (not high_precise_heap->Empty()) {
                    cur_heap_top = high_precise_heap->Top().first;
                }
            }

Comment on lines +315 to +319
size_t ids_padding = (4 - (tb.ids.size() * sizeof(uint16_t)) % 4) % 4;
if (ids_padding > 0) {
Vector<char> pad(ids_padding, 0, allocator_);
writer.Write(pad.data(), ids_padding);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using Vector<char> to write 2 bytes of padding introduces unnecessary heap allocation overhead on every term write during serialization. Since ids_padding is at most 2, we can use a stack-allocated array instead to avoid heap allocation.

Suggested change
size_t ids_padding = (4 - (tb.ids.size() * sizeof(uint16_t)) % 4) % 4;
if (ids_padding > 0) {
Vector<char> pad(ids_padding, 0, allocator_);
writer.Write(pad.data(), ids_padding);
}
size_t ids_padding = (4 - (tb.ids.size() * sizeof(uint16_t)) % 4) % 4;
if (ids_padding > 0) {
char pad[2] = {0, 0};
writer.Write(pad, ids_padding);
}

Comment on lines +422 to +427
size_t ids_padding = (4 - ids_size % 4) % 4;
if (ids_padding > 0) {
Vector<uint8_t> pad(ids_padding, 0, allocator_);
io_->Write(pad.data(), ids_padding, offset);
offset += ids_padding;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using Vector<uint8_t> to write padding bytes introduces unnecessary heap allocation overhead on every term write. Since ids_padding is at most 2, we can use a stack-allocated array instead to avoid heap allocation.

Suggested change
size_t ids_padding = (4 - ids_size % 4) % 4;
if (ids_padding > 0) {
Vector<uint8_t> pad(ids_padding, 0, allocator_);
io_->Write(pad.data(), ids_padding, offset);
offset += ids_padding;
}
size_t ids_padding = (4 - ids_size % 4) % 4;
if (ids_padding > 0) {
uint8_t pad[2] = {0, 0};
io_->Write(pad, ids_padding, offset);
offset += ids_padding;
}

Comment on lines +302 to +316
data->ids_ = static_cast<uint32_t*>(allocator->Allocate(sizeof(uint32_t) * data->len_));
try {
data->vals_ = static_cast<float*>(allocator->Allocate(sizeof(float) * data->len_));
} catch (...) {
allocator->Deallocate(data->ids_);
data->ids_ = nullptr;
if (need_release) {
this->Release(codes);
}
throw;
}
for (uint32_t i = 0; i < data->len_; ++i) {
data->ids_[i] = entries[i].id;
data->vals_[i] = entries[i].val;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If data->len_ is 0, calling allocator->Allocate(0) is unnecessary and can be avoided. We should guard the allocation and initialization with a check for data->len_ > 0.

    if (data->len_ > 0) {
        data->ids_ = static_cast<uint32_t*>(allocator->Allocate(sizeof(uint32_t) * data->len_));
        try {
            data->vals_ = static_cast<float*>(allocator->Allocate(sizeof(float) * data->len_));
        } catch (...) {
            allocator->Deallocate(data->ids_);
            data->ids_ = nullptr;
            if (need_release) {
                this->Release(codes);
            }
            throw;
        }
        for (uint32_t i = 0; i < data->len_; ++i) {
            data->ids_[i] = entries[i].id;
            data->vals_[i] = entries[i].val;
        }
    } else {
        data->ids_ = nullptr;
        data->vals_ = nullptr;
    }

#include "index_common_param.h"
#include "io/basic_io.h"
#include "io/io_parameter.h"
#include "quantization/sparse_quantization//sparse_term_computer.h"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in the include path containing a double slash: quantization/sparse_quantization//sparse_term_computer.h. It should be cleaned up to a single slash.

Suggested change
#include "quantization/sparse_quantization//sparse_term_computer.h"
#include "quantization/sparse_quantization/sparse_term_computer.h"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR removes the legacy SparseIndex implementation and introduces DiskSINDI plus supporting data-cells, while also upgrading sparse vector storage for 64-bit offsets and batched IO reads.

Changes:

  • Added DiskSINDI (new index type) and a disk-backed sparse term list data-cell with serialization/deserialization support.
  • Updated SparseVectorDataCell to use 64-bit offsets, add a v2 serialization sentinel/version, and implement GetCodesByIdsBatch with IO-merge logic.
  • Updated tests/docs and refactored SINDI’s rerank-flat storage to use FlattenInterface instead of SparseIndex.

Reviewed changes

Copilot reviewed 38 out of 38 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_sparse_index.cpp Removed functional tests for deleted SparseIndex.
tests/fixtures/unittest.h Added Catch2 matcher includes for new REQUIRE_THROWS_WITH usage.
src/quantization/sparse_quantization/sparse_term_computer.h Added DiskSINDI search-param constructor.
src/io/mmap_io.cpp Populate mmap size from existing file size.
src/factory/index_creators.cpp Register DiskSINDI creator; remove SparseIndex creator.
src/datacell/sparse_vector_datacell_test.cpp Added tests for v1/v2 serialization compatibility, batched reads, and sparse vector fetch by inner id.
src/datacell/sparse_vector_datacell.inl Implemented v2 serialization + legacy fallback; added batched code fetch with IO merging; added locking and new inner-id sparse fetch.
src/datacell/sparse_vector_datacell.h Switched to 64-bit offsets with packed DocLocation; added batch API and serialization format constants.
src/datacell/flatten_interface.h Introduced BatchCodesResult + default GetCodesByIdsBatch and sparse fetch hook.
src/datacell/disk_sparse_term_list_datacell_test.cpp Added unit tests for disk term list data-cell IO restore and IO type validation.
src/datacell/disk_sparse_term_list_datacell.inl Added heap insertion helpers / window scan logic implementation.
src/datacell/disk_sparse_term_list_datacell.h Added disk sparse term list data-cell interface + template implementation.
src/datacell/disk_sparse_term_list_datacell.cpp Implemented disk sparse term list data-cell build, IO, and query helpers.
src/constants.cpp Removed INDEX_SPARSE; added INDEX_DISKSINDI.
src/algorithm/sparse_index_parameters.h Deleted legacy SparseIndex parameters.
src/algorithm/sparse_index_parameters.cpp Deleted legacy SparseIndex parameters impl.
src/algorithm/sparse_index.h Deleted legacy SparseIndex.
src/algorithm/sparse_index.cpp Deleted legacy SparseIndex implementation.
src/algorithm/sparse_distance.h Introduced shared sparse sort + distance utilities for SINDI/DiskSINDI.
src/algorithm/sindi/sindi_test.cpp Switched ground-truth from SparseIndex to an “exact SINDI” configuration; adjusted tolerances.
src/algorithm/sindi/sindi.h Replaced rerank SparseIndex with FlattenInterface.
src/algorithm/sindi/sindi.cpp Implemented rerank flat as a datacell, added legacy rerank deserialization path, and reused new sparse distance helpers.
src/algorithm/inner_index_interface.h Updated docs to remove SparseIndex references.
src/algorithm/inner_index_interface.cpp Removed SPARSE handling branch in GetVectorByIds.
src/algorithm/disksindi/disksindi_test.cpp Added end-to-end tests for DiskSINDI batched rerank behavior.
src/algorithm/disksindi/disksindi_parameter_test.cpp Added DiskSINDI parameter parsing and compatibility tests.
src/algorithm/disksindi/disksindi_parameter.h Added DiskSINDI index/search parameters.
src/algorithm/disksindi/disksindi_parameter.cpp Implemented DiskSINDI parameter parsing and compatibility rules.
src/algorithm/disksindi/disksindi.h Added DiskSINDI index interface.
src/algorithm/disksindi/disksindi.cpp Implemented DiskSINDI build/search/serialize/deserialize and batched rerank.
src/algorithm/disksindi/CMakeLists.txt Added build target for DiskSINDI sources.
src/algorithm/CMakeLists.txt Wired DiskSINDI subdirectory and object library.
include/vsag/index.h Public API: removed SPARSE, added DISKSINDI; updated docs text.
include/vsag/constants.h Public constants: removed INDEX_SPARSE, added INDEX_DISKSINDI.
docs/docs/zh/src/advanced/search_allocator.md Removed SparseIndex mention.
docs/docs/zh/src/advanced/introspection.md Removed SparseIndex mention.
docs/docs/en/src/advanced/search_allocator.md Removed SparseIndex mention.
docs/docs/en/src/advanced/introspection.md Removed SparseIndex mention.
Comments suppressed due to low confidence (4)

src/datacell/sparse_vector_datacell.inl:1

  • max_code_size_ is now initialized to sizeof(uint32_t), which makes Resize() (and any pre-allocation strategy that depends on max_code_size_) severely under-estimate IO storage needs before the first insert. Since InsertVector() resizes io_ to required_size incrementally, this can devolve into frequent small resizes and copying. Consider initializing max_code_size_ to a more realistic bound/estimate (e.g., based on configured expected avg sparse length, or a conservative heuristic), and/or grow io_ capacity with an amortized strategy (e.g., geometric growth) instead of resizing to exactly required_size.
    src/datacell/sparse_vector_datacell.inl:1
  • In the legacy deserialization path, legacy_offset_io_size is divided by sizeof(LegacyDocLocation) without validating exact divisibility. If the stream is corrupted/misaligned, truncation will leave unread bytes in the stream, which can shift quantizer_->Deserialize(reader) and cause hard-to-debug failures. Add a strict check that legacy_offset_io_size % sizeof(LegacyDocLocation) == 0 (and ideally that the resulting doc_count matches total_count_ / expected count if that invariant exists) and throw a descriptive exception on mismatch.
    src/datacell/sparse_vector_datacell_test.cpp:1
  • This test sorts expected but not actual, yet compares them for equality, which makes the assertion order-dependent on the internal encoding/storage order of GetSparseVectorByInnerId. If ordering is not strictly guaranteed, sort actual as well (or compare via an order-insensitive matcher) to ensure the test validates content rather than incidental ordering.
    src/datacell/sparse_vector_datacell.h:1
  • __attribute__((packed)) is compiler-specific (GCC/Clang) and may break portability to toolchains like MSVC. If this project supports multiple compilers, consider wrapping packing behind a project-wide portability macro (or using #pragma pack(push, 1) / #pragma pack(pop) guarded per compiler) to keep the on-disk layout guarantees while remaining portable.

Comment on lines +18 to +27
#include <mutex>
#include <shared_mutex>
#include <unordered_map>

#include "container_types.h"
#include "impl/inner_search_param.h"
#include "index_common_param.h"
#include "io/basic_io.h"
#include "io/io_parameter.h"
#include "quantization/sparse_quantization//sparse_term_computer.h"
Comment on lines +368 to +370
uint32_t term_dict_count = static_cast<uint32_t>(term_dict_size / sizeof(DiskTermEntry));
std::vector<DiskTermEntry> term_dict(term_dict_count);
reader.Read(reinterpret_cast<char*>(term_dict.data()), term_dict_size);
Comment on lines +411 to +415
CHECK_ARGUMENT(search_param.n_candidate <= SPARSE_AMPLIFICATION_FACTOR * k,
fmt::format("n_candidate ({}) should be less than {} * k ({})",
search_param.n_candidate,
AMPLIFICATION_FACTOR,
k));
* @brief Calculate distance by ID using DatasetPtr.
*
* Suitable for sparse vector indexes (SINDI, SparseIndex) where vectors
* Suitable for sparse vector indexes (SINDI) where vectors
Comment thread include/vsag/index.h
};

enum class IndexType { HNSW, DISKANN, HGRAPH, IVF, PYRAMID, BRUTEFORCE, SPARSE, SINDI, WARP };
enum class IndexType { HNSW, DISKANN, HGRAPH, IVF, PYRAMID, BRUTEFORCE, SINDI, DISKSINDI, WARP };
@wxyucs wxyucs self-assigned this Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support SINDI Disk Algorithm DiskSindi

3 participants