GEMINI.md

Project: Local Similarity Search Engine

Overview

A high-performance similarity search engine that runs entirely on the local machine. Finds the most similar items to a given query from a large dataset without cloud dependencies, prioritizing privacy and performance.

Goals

Privacy-Focused: All data and computations remain local.
High Performance & Scalability:
- Support for 1M+ images/content items.
- Efficient handling of multi-TB datasets.
- Optimized for high-throughput disk reads and low-latency search.
Simple API: Easy-to-use Python API and CLI.

Architecture

┌────────────────────────────────────────────────────┐
│  main.py (CLI: ingest / search / create-index)     │
├────────────────────────────────────────────────────┤
│  similarity_engine.py (SimilarityEngine)           │
│    ├── Lazy CLIP model loading (open_clip)         │
│    ├── Parallel ingestion pipeline                 │
│    ├── IVF-PQ index creation                       │
│    └── Cross-modal search (image + text)           │
├────────────────────────────────────────────────────┤
│  ingestion.py                                      │
│    ├── ImageBatchIterator (lazy dir walk)           │
│    ├── ThreadPool image loading & preprocessing    │
│    └── Batched CLIP inference (GPU/CPU)             │
├────────────────────────────────────────────────────┤
│  LanceDB (disk-backed, memory-mapped, Arrow)       │
│    └── IVF-PQ index for sub-100ms ANN search       │
└────────────────────────────────────────────────────┘

Technology Stack

Component	Technology	Why
Vector Store	LanceDB	Disk-native, memory-mapped, zero-copy, handles data >> RAM
Embeddings	CLIP ViT-B/32 (open_clip)	Strong image+text embeddings, runs locally
Indexing	IVF-PQ (LanceDB built-in)	Sub-100ms search at million scale
Parallel I/O	concurrent.futures	ThreadPool for I/O, batched GPU inference

Performance Requirements

Query Latency: < 100ms for ANN search at 1M+ scale.
Indexing Throughput: Saturate local I/O and compute during ingestion.
Scalability: Linear scaling up to multi-TB datasets.

Project Structure

image-similarity/
├── main.py                  # CLI entry point
├── app.py                   # Streamlit GUI
├── similarity_engine.py     # Core engine (LanceDB + CLIP)
├── ingestion.py             # Parallel ingestion pipeline
├── datasets.py              # Benchmark dataset downloader
├── requirements.txt         # Python dependencies
├── tests/
│   └── test_engine.py       # Unit tests
├── benchmarks/
│   └── bench_search.py      # Latency & throughput benchmarks
├── GEMINI.md                # This file
└── README.md                # User-facing docs

CLI Usage

# Ingest images
python main.py ingest --data-dir /path/to/images --batch-size 256 --workers 8

# Search by text
python main.py search --query "a red car" --top-k 10

# Search by image
python main.py search --query /path/to/query.jpg --top-k 10

# Build ANN index (after ingestion)
python main.py create-index

# Show stats
python main.py stats

# Download benchmark datasets
python main.py download --list
python main.py download --dataset cifar10 --dest ./data

# One-shot demo: download → ingest → search
python main.py demo --dataset cifar10 --query "airplane" --top-k 5

# Graphical User Interface
streamlit run app.py

Current Status

Core Implementation: Complete — similarity_engine.py, ingestion.py, main.py.
Tests: tests/test_engine.py — unit tests for iterator, LanceDB integration.
Benchmarks: benchmarks/bench_search.py — latency & throughput measurement.
Pending: Install deps, run tests, run benchmarks, validate at scale.

Development Notes

Model is lazy-loaded to keep import time fast.
ImageBatchIterator uses os.walk with lazy yielding to handle multi-TB directories without memory pressure.
Embeddings are L2-normalized before storage for cosine similarity.
IVF-PQ index should be created after ingestion for optimal search performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GEMINI.md

Project: Local Similarity Search Engine

Overview

Goals

Architecture

Technology Stack

Performance Requirements

Project Structure

CLI Usage

Current Status

Development Notes

FilesExpand file tree

GEMINI.md

Latest commit

History

GEMINI.md

File metadata and controls

GEMINI.md

Project: Local Similarity Search Engine

Overview

Goals

Architecture

Technology Stack

Performance Requirements

Project Structure

CLI Usage

Current Status

Development Notes