A high-performance similarity search engine that runs entirely on the local machine. Finds the most similar items to a given query from a large dataset without cloud dependencies, prioritizing privacy and performance.
- Privacy-Focused: All data and computations remain local.
- High Performance & Scalability:
- Support for 1M+ images/content items.
- Efficient handling of multi-TB datasets.
- Optimized for high-throughput disk reads and low-latency search.
- Simple API: Easy-to-use Python API and CLI.
┌────────────────────────────────────────────────────┐
│ main.py (CLI: ingest / search / create-index) │
├────────────────────────────────────────────────────┤
│ similarity_engine.py (SimilarityEngine) │
│ ├── Lazy CLIP model loading (open_clip) │
│ ├── Parallel ingestion pipeline │
│ ├── IVF-PQ index creation │
│ └── Cross-modal search (image + text) │
├────────────────────────────────────────────────────┤
│ ingestion.py │
│ ├── ImageBatchIterator (lazy dir walk) │
│ ├── ThreadPool image loading & preprocessing │
│ └── Batched CLIP inference (GPU/CPU) │
├────────────────────────────────────────────────────┤
│ LanceDB (disk-backed, memory-mapped, Arrow) │
│ └── IVF-PQ index for sub-100ms ANN search │
└────────────────────────────────────────────────────┘
| Component | Technology | Why |
|---|---|---|
| Vector Store | LanceDB | Disk-native, memory-mapped, zero-copy, handles data >> RAM |
| Embeddings | CLIP ViT-B/32 (open_clip) | Strong image+text embeddings, runs locally |
| Indexing | IVF-PQ (LanceDB built-in) | Sub-100ms search at million scale |
| Parallel I/O | concurrent.futures | ThreadPool for I/O, batched GPU inference |
- Query Latency: < 100ms for ANN search at 1M+ scale.
- Indexing Throughput: Saturate local I/O and compute during ingestion.
- Scalability: Linear scaling up to multi-TB datasets.
image-similarity/
├── main.py # CLI entry point
├── app.py # Streamlit GUI
├── similarity_engine.py # Core engine (LanceDB + CLIP)
├── ingestion.py # Parallel ingestion pipeline
├── datasets.py # Benchmark dataset downloader
├── requirements.txt # Python dependencies
├── tests/
│ └── test_engine.py # Unit tests
├── benchmarks/
│ └── bench_search.py # Latency & throughput benchmarks
├── GEMINI.md # This file
└── README.md # User-facing docs
# Ingest images
python main.py ingest --data-dir /path/to/images --batch-size 256 --workers 8
# Search by text
python main.py search --query "a red car" --top-k 10
# Search by image
python main.py search --query /path/to/query.jpg --top-k 10
# Build ANN index (after ingestion)
python main.py create-index
# Show stats
python main.py stats
# Download benchmark datasets
python main.py download --list
python main.py download --dataset cifar10 --dest ./data
# One-shot demo: download → ingest → search
python main.py demo --dataset cifar10 --query "airplane" --top-k 5
# Graphical User Interface
streamlit run app.py- Core Implementation: Complete —
similarity_engine.py,ingestion.py,main.py. - Tests:
tests/test_engine.py— unit tests for iterator, LanceDB integration. - Benchmarks:
benchmarks/bench_search.py— latency & throughput measurement. - Pending: Install deps, run tests, run benchmarks, validate at scale.
- Model is lazy-loaded to keep import time fast.
ImageBatchIteratorusesos.walkwith lazy yielding to handle multi-TB directories without memory pressure.- Embeddings are L2-normalized before storage for cosine similarity.
- IVF-PQ index should be created after ingestion for optimal search performance.