High-Performance Vector Database with Pluggable ANNS Architecture
SageVDB is a C++20 library that provides efficient vector similarity search, metadata management, and a flexible plugin system for Approximate Nearest Neighbor Search (ANNS) algorithms. It serves as the native core for the SAGE VDB middleware component.
Usage Mode Guide: Please refer to
docs/USAGE_MODES.md(for the positioning, data flow, and examples of Standalone / BYO-Embedding / Plugin / Service).
- Exact and Approximate Search: Support for brute-force exact search and pluggable ANNS algorithms
- Multiple Distance Metrics: L2 (Euclidean), Inner Product, Cosine similarity
- Metadata Management: Efficient key-value metadata storage and filtering
- Batch Operations: Optimized batch insertion and search
- Persistence: Save and load database state to/from disk
- Thread-Safe: Concurrent read operations supported
- Pluggable Architecture: Easy integration of new ANNS algorithms
- Algorithm Registry: Dynamic registration and discovery
- Big-ANN Compatible: Parameters follow big-ann-benchmarks conventions
- Fail-Fast Capability Boundary: Unsupported operations throw explicit errors (no implicit fallback)
- Built-in Algorithms:
brute_force: Exact search, supports incremental updates and deletionsfaiss: FAISS integration (when available)
- Cross-Modal Fusion: Combine features from text, images, audio, video, etc.
- Fusion Strategies: Concatenation, weighted average, attention, tensor fusion, bilinear pooling
- Extensible: Register custom modality processors and fusion strategies
- C++20 compatible compiler (GCC 11+, Clang 14+, or MSVC 19.29+)
- CMake 3.12+
- BLAS/LAPACK (for linear algebra operations)
- OpenMP - Parallel processing (recommended)
- FAISS - Facebook AI Similarity Search integration
- OpenCV - Image processing for multimodal features
- FFmpeg - Audio/video processing for multimodal features
- gperftools - Performance profiling
# Clone and setup in one go
git clone https://github.com/intellistream/sageVDB.git
cd sageVDB
./quickstart.shThe quickstart.sh script will:
- β Install git hooks (pre-commit, pre-push)
- β Check dependencies (CMake, C++ compiler, Python)
- β Optionally build the project
- β Optionally install Python package in development mode
What the git hooks do:
pre-commit: Checks for trailing whitespace, large files, debug statementspre-push: Manages version updates and PyPI publishing workflow
cd sageVDB
# Basic build
./build.sh
# Production build with optimizations
BUILD_TYPE=Release ./build.sh
# Enable profiling
SAGE_ENABLE_GPERFTOOLS=ON ./build.sh
# The build produces:
# - build/libsage_vdb.so # Shared library
# - build/test_sage_vdb # Test executable
# - install/lib/libsage_vdb.so # Installed library
# - install/include/sage_vdb/ # Public headerscmake -B build -S . \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_TESTS=ON \
-DUSE_OPENMP=ON \
-DENABLE_MULTIMODAL=ON \
-DENABLE_OPENCV=OFF \
-DENABLE_FFMPEG=OFF \
-DENABLE_GPERFTOOLS=OFF
cmake --build build -j$(nproc)cd build
ctest --verbose
# Or run directly
./test_sage_vdb
./test_multimodal#include <sage_vdb/sage_vdb.h>
using namespace sage_vdb;
int main() {
// Create database configuration
DatabaseConfig config(128); // 128-dimensional vectors
config.index_type = IndexType::FLAT;
config.metric = DistanceMetric::L2;
config.anns_algorithm = "brute_force";
// Initialize database
SageVDB db(config);
// Add vectors with metadata
Vector vec1(128, 0.1f);
Metadata meta1 = {{"category", "A"}, {"text", "first vector"}};
VectorId id1 = db.add(vec1, meta1);
// Batch add
std::vector<Vector> vectors = {
Vector(128, 0.2f),
Vector(128, 0.3f)
};
std::vector<Metadata> metadata = {
{{"category", "B"}},
{{"category", "A"}}
};
auto ids = db.add_batch(vectors, metadata);
// Search for nearest neighbors
Vector query(128, 0.15f);
auto results = db.search(query, 5); // Find 5 nearest neighbors
for (const auto& result : results) {
std::cout << "ID: " << result.id
<< ", Distance: " << result.score
<< ", Category: " << result.metadata.at("category")
<< std::endl;
}
// Filtered search
auto filtered = db.filtered_search(
query,
SearchParams(5),
[](const Metadata& meta) {
return meta.at("category") == "A";
}
);
return 0;
}#include <sage_vdb/sage_vdb.h>
int main() {
DatabaseConfig config(768);
config.metric = DistanceMetric::L2;
config.anns_algorithm = "faiss";
// FAISS-specific build parameters
config.anns_build_params["index_type"] = "IVF256,Flat";
config.anns_build_params["metric"] = "l2";
// FAISS-specific query parameters
config.anns_query_params["nprobe"] = "8";
SageVDB db(config);
// Training data for IVF index
std::vector<Vector> training_data;
// ... populate training_data ...
db.train_index(training_data);
// Add vectors
// ... add your data ...
// Build index
db.build_index();
// NOTE: capability mismatches fail fast.
// Example: calling remove/update on an algorithm without deletion support throws immediately.
// Query
auto results = db.search(query, 10);
return 0;
}#include <sage_vdb/multimodal_sage_vdb.h>
using namespace sage_vdb;
int main() {
// Configure multimodal database
DatabaseConfig config;
config.dimension = 0; // Will be auto-calculated from modalities
MultimodalSageVDB mdb(config);
// Register modality processors
auto text_processor = std::make_shared<TextModalityProcessor>(768);
auto image_processor = std::make_shared<ImageModalityProcessor>(512);
mdb.register_modality("text", text_processor);
mdb.register_modality("image", image_processor);
// Set fusion strategy
auto attention_fusion = std::make_shared<AttentionFusion>();
mdb.set_fusion_strategy(attention_fusion);
// Add multimodal data
std::unordered_map<std::string, Vector> modality_data;
modality_data["text"] = Vector(768, 0.5f); // Text embedding
modality_data["image"] = Vector(512, 0.3f); // Image embedding
Metadata metadata = {{"caption", "A beautiful sunset"}};
mdb.add_multimodal(modality_data, metadata);
// Multimodal query
std::unordered_map<std::string, Vector> query_data;
query_data["text"] = Vector(768, 0.6f);
auto results = mdb.search_multimodal(query_data, 10);
return 0;
}#include <sage_vdb/sage_vdb.h>
int main() {
DatabaseConfig config(128);
SageVDB db(config);
// Add data
// ...
// Save to disk
db.save("my_database.SageVDB");
// Later, load from disk
SageVDB db2(config);
db2.load("my_database.SageVDB");
// Database is ready to use
auto results = db2.search(query, 10);
return 0;
}- Implement the
ANNSAlgorithminterface:
#include <sage_vdb/anns/anns_interface.h>
class MyANNS : public ANNSAlgorithm {
public:
// Identity
std::string name() const override { return "my_anns"; }
std::string version() const override { return "1.0.0"; }
std::string description() const override { return "My custom ANNS"; }
// Capabilities
bool supports_metric(DistanceMetric metric) const override {
return metric == DistanceMetric::L2;
}
bool supports_incremental_add() const override { return true; }
bool supports_deletion() const override { return false; }
// Build
void fit(const std::vector<VectorEntry>& data,
const AlgorithmParams& params) override {
// Build your index here
dimension_ = data.empty() ? 0 : data[0].vector.size();
// ... your implementation ...
}
// Query
ANNSResult query(const Vector& q, const QueryConfig& config) override {
// Perform search
ANNSResult result;
// ... your implementation ...
return result;
}
// Batch query (optional optimization)
std::vector<ANNSResult> query_batch(
const std::vector<Vector>& queries,
const QueryConfig& config) override {
// Default implementation calls query() for each
return ANNSAlgorithm::query_batch(queries, config);
}
// Lifecycle
bool is_built() const override { return built_; }
void save(const std::string& path) override { /* save index */ }
void load(const std::string& path) override { /* load index */ }
private:
bool built_ = false;
Dimension dimension_ = 0;
// ... your data structures ...
};- Create a factory:
class MyANNSFactory : public ANNSFactory {
public:
std::string algorithm_name() const override { return "my_anns"; }
std::unique_ptr<ANNSAlgorithm> create(
const DatabaseConfig& config) override {
return std::make_unique<MyANNS>();
}
AlgorithmParams default_build_params() const override {
AlgorithmParams params;
params.set("my_param", 42);
return params;
}
AlgorithmParams default_query_params() const override {
AlgorithmParams params;
params.set("search_depth", 10);
return params;
}
};- Register the algorithm:
// In a .cpp file (NOT in a header)
REGISTER_ANNS_ALGORITHM(MyANNSFactory);- Use it:
DatabaseConfig config(128);
config.anns_algorithm = "my_anns";
config.anns_build_params["my_param"] = "100";
SageVDB db(config);#include <sage_vdb/fusion_strategies.h>
class MyFusionStrategy : public FusionStrategy {
public:
std::string name() const override { return "my_fusion"; }
Vector fuse(const std::unordered_map<std::string, Vector>& modality_vectors,
const std::unordered_map<std::string, float>& weights) override {
// Implement your fusion logic
Vector result;
// ... your implementation ...
return result;
}
};
// Register and use
auto strategy = std::make_shared<MyFusionStrategy>();
multimodal_db.register_fusion_strategy("my_fusion", strategy);
multimodal_db.set_fusion_strategy_by_name("my_fusion");Main database class for vector operations.
Methods:
add(vector, metadata)- Add single vectoradd_batch(vectors, metadata)- Batch add vectorsremove(id)- Remove vector by IDupdate(id, vector, metadata)- Update existing vectorsearch(query, k)- Find k nearest neighborsfiltered_search(query, params, filter)- Search with metadata filteringbatch_search(queries, params)- Batch searchbuild_index()- Build/rebuild the indextrain_index(training_data)- Train index (for algorithms that need it)save(filepath)- Persist to diskload(filepath)- Load from disksize()- Number of vectorsdimension()- Vector dimension
Extended database for multimodal data fusion.
Methods:
register_modality(name, processor)- Register modality processorset_fusion_strategy(strategy)- Set fusion strategyadd_multimodal(modality_data, metadata)- Add multimodal entrysearch_multimodal(query_data, k)- Multimodal search
Low-level vector storage and retrieval.
Metadata management and filtering.
Search coordination and result ranking.
struct DatabaseConfig {
IndexType index_type;
DistanceMetric metric;
Dimension dimension;
std::string anns_algorithm;
std::unordered_map<std::string, std::string> anns_build_params;
std::unordered_map<std::string, std::string> anns_query_params;
// ... index-specific params ...
};struct SearchParams {
uint32_t k; // Number of results
uint32_t nprobe; // Search scope (IVF)
float radius; // Radius search
bool include_metadata; // Include metadata in results
};FLAT- Brute force (exact)IVF_FLAT- Inverted fileIVF_PQ- Inverted file with product quantizationHNSW- Hierarchical NSWAUTO- Automatic selection
L2- Euclidean distanceINNER_PRODUCT- Inner productCOSINE- Cosine similarity
SageVDB/
βββ include/sage_vdb/ # Public headers
β βββ common.h # Common types and constants
β βββ sage_vdb.h # Main database interface
β βββ multimodal_sage_vdb.h # Multimodal extension
β βββ vector_store.h # Vector storage backend
β βββ metadata_store.h # Metadata management
β βββ query_engine.h # Search coordinator
β βββ fusion_strategies.h # Multimodal fusion
β βββ modality_processors.h # Modality handlers
β βββ anns/ # ANNS plugin system
β βββ anns_interface.h # Plugin interface
βββ src/ # Implementation
β βββ sage_vdb.cpp
β βββ vector_store.cpp
β βββ metadata_store.cpp
β βββ query_engine.cpp
β βββ multimodal_sage_vdb.cpp
β βββ fusion_strategies.cpp
β βββ anns/
β βββ anns_interface.cpp
β βββ register_builtin_algorithms.cpp
β βββ brute_force_plugin.h
β βββ brute_force_plugin.cpp
β βββ faiss_plugin.h
β βββ faiss_plugin.cpp
βββ tests/ # Unit tests
β βββ test_sage_vdb.cpp
β βββ test_multimodal.cpp
βββ cmake/ # CMake modules
β βββ FindBLASLAPACK.cmake
β βββ gperftools.cmake
βββ build/ # Build output (generated)
βββ install/ # Install output (generated)
βββ CMakeLists.txt # Build configuration
βββ build.sh # Build script
βββ README.md # This file
# Build and run all tests
cd build
make test
# Run with verbose output
ctest -V
# Run specific test
./test_sage_vdb
./test_multimodal# Enable profiling
cmake -B build -DENABLE_GPERFTOOLS=ON
cmake --build build
# Run with profiler
CPUPROFILE=sage_vdb.prof ./build/test_sage_vdb
google-pprof --text ./build/test_sage_vdb sage_vdb.profGitHub Actions workflows are configured in .github/workflows/:
ci-tests.yml- Full test suite on push/PRquick-test.yml- Fast smoke tests
If you encounter GLIBCXX_3.4.30 errors in conda environments:
# Update libstdc++ in conda
conda install -c conda-forge libstdcxx-ng -y
# Or use system libstdc++
export LD_LIBRARY_PATH="/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH"The build script (build.sh) automatically detects and handles this.
If FAISS is not detected but you have it installed:
# Set FAISS_ROOT before building
export FAISS_ROOT=/path/to/faiss
cmake -B build -DFAISS_ROOT=$FAISS_ROOTOr install via conda:
conda install -c conda-forge faiss-cpu
# or
conda install -c conda-forge faiss-gpuOpenMP is optional but recommended for performance:
# Disable OpenMP if unavailable
cmake -B build -DUSE_OPENMP=OFF- Use batch operations when adding/querying multiple vectors
- Choose appropriate index type:
- < 10K vectors: Use
FLAT(exact search) - 10K-1M vectors: Use
IVF_FLATorHNSW -
1M vectors: Use
IVF_PQfor memory efficiency
- < 10K vectors: Use
- Enable OpenMP for parallel processing
- Tune ANNS parameters based on your accuracy/speed tradeoff
- Pre-allocate memory for large datasets
- Use metadata filtering to reduce search space
SageVDB is designed to be service-friendly and can seamlessly integrate with SAGE's multi-threaded service architecture:
// Read operations are thread-safe (concurrent reads allowed)
// Write operations should be serialized
std::vector<QueryResult> results = db.search(query, 10); // Thread-safeIf you plan to upgrade SageVDB to a fully multi-threaded engine, you have several options:
Option 1: Internal Locking (Recommended for Service Use)
class SageVDB {
private:
mutable std::shared_mutex rw_mutex_; // Reader-writer lock
public:
VectorId add(const Vector& vector, const Metadata& metadata = {}) {
std::unique_lock<std::shared_mutex> lock(rw_mutex_);
// ... add implementation ...
}
std::vector<QueryResult> search(const Vector& query, uint32_t k) const {
std::shared_lock<std::shared_mutex> lock(rw_mutex_); // Multiple readers
// ... search implementation ...
}
};Option 2: Lock-Free Data Structures
// Use concurrent data structures for high-throughput scenarios
#include <tbb/concurrent_vector.h>
#include <tbb/concurrent_hash_map.h>
class VectorStore {
private:
tbb::concurrent_vector<Vector> vectors_;
tbb::concurrent_hash_map<VectorId, size_t> id_to_index_;
};Option 3: Thread-Local Index Copies (Read-Heavy Workloads)
class SageVDB {
private:
std::shared_ptr<const Index> shared_index_; // Immutable index
std::atomic<int> version_;
public:
void rebuild_index() {
// Build new index
auto new_index = std::make_shared<Index>(/* ... */);
shared_index_.store(new_index); // Atomic swap
version_.fetch_add(1);
}
};The good news: SAGE's service architecture is designed to handle multi-threaded backends!
# SAGE's ServiceManager handles thread safety automatically
class ServiceManager:
def __init__(self):
self._executor = ThreadPoolExecutor(max_workers=10)
self._lock = threading.Lock()
def call_sync(self, service_name, *args, **kwargs):
# Each service call runs in isolated context
# Your multi-threaded SageVDB is safe here!
return service.method(*args, **kwargs)
def call_async(self, service_name, *args, **kwargs):
# Async calls use thread pool
# Multiple concurrent requests are handled properly
return self._executor.submit(self.call_sync, ...)Even with a multi-threaded SageVDB engine, the service wrapper remains simple:
# packages/sage-middleware/.../sage_vdb_service.py
from threading import Lock
class SageVDBService:
"""Thread-safe service wrapper for multi-threaded SageVDB."""
def __init__(self, dimension: int = 768):
self._db = SageVDB.from_config(DatabaseConfig(dimension))
# Optional: Add Python-level locking if C++ doesn't provide it
self._write_lock = Lock()
def add(self, vector: np.ndarray, metadata: dict = None) -> int:
# Option A: If SageVDB has internal locking, just call it
return self._db.add(vector, metadata or {})
# Option B: If you need Python-level coordination
# with self._write_lock:
# return self._db.add(vector, metadata or {})
def search(self, query: np.ndarray, k: int = 5) -> List[dict]:
# Read operations are typically thread-safe
# No locking needed if C++ provides read concurrency
results = self._db.search(query, k=k)
return [{"id": r.id, "score": r.score, "metadata": r.metadata}
for r in results]from sage.kernel.api.local_environment import LocalEnvironment
from sage.kernel.api.function.map_function import MapFunction
class VectorSearch(MapFunction):
def execute(self, data):
# Concurrent calls are safe!
# SAGE's ServiceManager handles thread coordination
results = self.call_service("sage_vdb", data["query"], method="search", k=10)
# Or async for higher throughput
future = self.call_service_async("sage_vdb", data["query"], method="search", k=10)
results = future.result(timeout=5.0)
return results
# Register multi-threaded SageVDB service
env = LocalEnvironment()
env.register_service("sage_vdb", lambda: SageVDBService(dimension=768))
# Multiple concurrent requests work fine
(
env.from_batch(QuerySource, queries)
.map(VectorSearch) # Can run in parallel
.sink(ResultSink)
)
env.submit()// For SAGE service integration, prefer these patterns:
// Pattern A: Reader-Writer Lock (balanced read/write)
class SageVDB {
mutable std::shared_mutex mutex_;
// Readers don't block each other
// Writers have exclusive access
};
// Pattern B: Partitioned Locking (high concurrency)
class SageVDB {
static constexpr size_t NUM_PARTITIONS = 16;
std::array<std::mutex, NUM_PARTITIONS> partition_locks_;
size_t get_partition(VectorId id) {
return id % NUM_PARTITIONS;
}
};
// Pattern C: Lock-Free (expert mode)
class SageVDB {
std::atomic<Index*> current_index_;
// RCU-style updates
};// In Python bindings, release GIL for long operations
#include <pybind11/pybind11.h>
py::class_<SageVDB>(m, "SageVDB")
.def("search", [](const SageVDB& db, const Vector& query, int k) {
// Release Python GIL during C++ computation
py::gil_scoped_release release;
auto results = db.search(query, k);
py::gil_scoped_acquire acquire;
return results;
}, "Perform vector search");class SageVDBServicePool:
"""Pool of SageVDB instances for maximum concurrency."""
def __init__(self, dimension: int, pool_size: int = 4):
self._pool = [SageVDB(DatabaseConfig(dimension))
for _ in range(pool_size)]
self._current = 0
self._lock = threading.Lock()
def get_instance(self) -> SageVDB:
with self._lock:
idx = self._current
self._current = (self._current + 1) % len(self._pool)
return self._pool[idx]
def search(self, query, k=10):
# Round-robin across instances
db = self.get_instance()
return db.search(query, k)| Scenario | Single-Threaded | Multi-Threaded (4 cores) | Speedup |
|---|---|---|---|
| Concurrent Reads (1M vectors) | 100 QPS | 380 QPS | 3.8x |
| Mixed Read/Write (90/10) | 85 QPS | 240 QPS | 2.8x |
| Batch Insert (10K vectors) | 12K/sec | 35K/sec | 2.9x |
If you're upgrading SageVDB to multi-threaded:
- Add
std::shared_mutexor equivalent to core data structures - Protect index updates with exclusive locks
- Allow concurrent reads with shared locks
- Release Python GIL in pybind11 bindings for long operations
- Add thread-safety tests (see
tests/test_thread_safety.cpp) - Update documentation to specify thread-safety guarantees
- Consider lock-free alternatives for hot paths
- Profile under concurrent load (use
perforgperftools)
class SageVDB {
private:
mutable std::shared_mutex index_mutex_;
std::unique_ptr<ANNSAlgorithm> index_;
public:
void rebuild_index() {
// Build new index without holding lock
auto new_index = create_new_index();
new_index->fit(vectors_);
// Quick swap under exclusive lock
{
std::unique_lock lock(index_mutex_);
index_.swap(new_index);
}
// old index destroyed here (outside lock)
}
std::vector<QueryResult> search(const Vector& query, uint32_t k) const {
// Shared lock allows concurrent searches
std::shared_lock lock(index_mutex_);
return index_->query(query, QueryConfig{k});
}
};Yes, SageVDB can absolutely work as a SAGE service even when multi-threaded!
β Why it works:
- SAGE's
ServiceManageralready handles concurrent service calls - Thread pool executor isolates each request
- Python GIL can be released in C++ for true parallelism
- Service wrapper can add additional coordination if needed
β Recommended approach:
- Add internal locking to SageVDB C++ code (reader-writer pattern)
- Release GIL in Python bindings for compute-intensive operations
- Keep service wrapper simple - let C++ handle thread safety
- Use
call_service_asyncfor high concurrency in pipelines
β No breaking changes needed:
- Service interface remains identical
- Existing SAGE pipelines work without modification
- Performance improves automatically with multi-threading
Python bindings are provided in ../python/ using pybind11:
import _sage_vdb
config = _sage_vdb.DatabaseConfig(128)
db = _sage_vdb.SageVDB(config)
# ... use from Python ...Use the optional sage-anns Python backend (no C++ rebuild required):
from sagevdb import create_database
db = create_database(
128,
backend="sage-anns",
algorithm="faiss_hnsw",
metric="l2",
M=32,
ef_construction=200,
)See ../README.md for Python API documentation.
Link against libsage_vdb.so:
find_library(sage_vdb_LIB sage_vdb HINTS ${sage_vdb_ROOT}/lib)
target_link_libraries(my_app ${sage_vdb_LIB})- ANNS Plugin Guide - Detailed plugin development
- Multimodal Design - Architecture overview
- Multimodal Features - Multimodal usage guide
- Parent README - SageVDB middleware documentation
We welcome contributions! Please:
- Follow C++20 best practices
- Add tests for new features
- Update documentation
- Run
clang-formatbefore committing:clang-format -i $(find src include -name '*.cpp' -o -name '*.h')
This project is part of the SAGE system. See the LICENSE file in the repository root.
- Inspired by big-ann-benchmarks
- FAISS integration from Facebook AI
- Built with modern C++20 features
Part of the SAGE Project - Documentation | Issues
| Component | Status | Latest Version |
|---|---|---|
| isage-vdb | 0.1.5 |