Skip to content

Roxanne0321/vsag

 
 

SINDI: Sparse Inverted Non-redundant Distance Index

Proof-of-Concept – Integrated into VSAG

Introduction

This repository hosts the proof-of-concept and research prototype implementation of the SINDI, an efficient algorithm for Approximate Maximum Inner Product Search (AMIPS) on sparse vectors. It was proposed in the paper "SINDI: An Efficient Index for Sparse Vector Approximate Maximum Inner Product Search".

⚠️ Note for Practitioners:
The SINDI index has been fully integrated into the VSAG framework, which provides a production-grade implementation with ongoing maintenance, cross-version compatibility, and Python/C++ APIs.
For deployment, benchmarking, or large-scale production scenarios, please refer to the
Sindi index within the VSAG repository instead of this prototype.


Parameters

Index Construction Parameters

  • lambda (λ):Window size used for segmented accumulation. Valid range: [10000, 100000].
  • alpha (α):Document pruning ratio for base vectors. For each base vector, keep the minimal prefix (after sorting by value descending) whose cumulative mass reaches alpha * total_mass. Valid range: [0, 1].
  • use_reorder:Whether to perform exact reranking on coarse candidates. Default is true.

Query/Search Parameters

  • beta (β):Query term retain ratio in coarse retrieval. Keep top int(query_dim * beta) terms by value in each query. Valid range: [0, 1].
  • gamma (γ):Candidate pool size kept after coarse stage. If gamma < topk, internal behavior is gamma = topk.
  • num_threads:Thread count used in search (omp_set_num_threads).

Note: prune_stragy is not used by current SINDI parameter parser.

Installation

You can build SINDI as part of the VSAG framework from source.

git clone -b sparse --single-branch https://github.com/Roxanne0321/vsag.git
cd vsag
make release

📂 Offline Evaluation Datasets

The offline benchmark datasets used in this work are derived from the BigANN Sparse Vector Track,
which in turn is based on the MSMARCO Passage Ranking corpus encoded with the SPLADE model.

  • Base datasets contain SPLADE-encoded sparse vectors for MS MARCO passages.
  • Query dataset contains SPLADE-encoded sparse vectors for 6,980 development queries.
  • Vectors are stored in Compressed Sparse Row (CSR) format, with dimensionality up to ~100,000
    and an average of ~120 non-zero entries per base vector, ~49 per query vector.

Download Links

Dataset Name Type Size (vectors) Download URL
base_small Base 100,000 Download
base_1M Base 1,000,000 Download
base_full Base 8,841,823 Download
queries Query 6,980 Download

Usage

Build Index

./build-release/sparse/scripts/sindi_index_build <basefile> <lambda> <alpha> <use_reorder> <index_path>

Generate Groung Truth

./build-release/sparse/scripts/generate_gt <basefile> <queryfile> <gtfile> <topk>

Search Index

./build-release/sparse/scripts/sindi_index_search <index_path> <queryfile> <gtfile> <beta> <gamma> <topk> <num_threads>

About

vsag is a vector indexing library used for similarity search.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • C++ 96.2%
  • CMake 1.6%
  • Python 0.8%
  • Shell 0.6%
  • C 0.5%
  • Makefile 0.2%
  • ANTLR 0.1%