Skip to content

[Feature]: On-Disk Index #325

@richyreachy

Description

@richyreachy

Problem / Motivation

Current zvec index mainly rely on in-mem structure to achieve low-latency nearest neighbor search. While effective for moderate-sized datasets that fit entirely in RAM, in-mem index becomes impractical as collections grow to large scale.

Moreover, many real-world use cases involve infrequently accessed long-tail vectors where keeping all data in memory is wasteful. A disk-based indexing solution would enable cost-effective scaling by leveraging disk storage while maintaining acceptable query latency.

Proposed Solution

An on-disk based index will be introduced into Zvec with the following key components:

1. On-Disk Vector Storage:
Raw vector data (in FP32 or FP16 format) will be stored persistently on disk. Only compressed representations (e.g., quantized centroids, graph links, or PQ codes) and metadata will be kept in memory. During search, relevant raw vectors are fetched from disk only when needed for final distance re-ranking.

2. Support for Mainstream Similarity Metrics:
The on-disk index will natively support common similarity functions including:
 2.1. Cosine similarity
 2.2. Inner product (dot product)
 2.3. Euclidean (L2) distance
 Distance computations will be performed accurately using the original (uncompressed) vectors retrieved from disk during the refinement stage.
3. FP32 and FP16 Data Type Support:
Users can store vectors in either 32-bit or 16-bit floating point formats on disk. The system will handle type conversion and alignment transparently, enabling memory and I/O efficiency (especially with FP16) without sacrificing compatibility.

Alternatives Considered

No response

Affected Area

{"label" => "C++ Core (storage, indexing)"}

Metadata

Metadata

Assignees

Labels

featureNew feature wanted

Type

No type

Projects

Status

Backlog

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions