Levy

Levy is a semantic caching engine for LLM APIs, built as the IT artefact of an MSc Artificial Intelligence capstone project (University of Liverpool). It sits between your application and an LLM provider (Mock, OpenAI-compatible, or Ollama today; Anthropic connector planned) to optimize costs and latency by reusing responses for identical or semantically similar prompts.

The research behind Levy benchmarks false positive rates of semantic caching across embedding models (all-MiniLM vs ModernBERT), workloads (FAQ, code, chat), and similarity thresholds. The authoritative project definition lives in docs/Project_Proposal.md and docs/Specification_and_Design_Report.md (university submissions — do not modify).

Features

Exact Match Caching: Extremely fast retrieval for identical prompts.
Semantic Caching: Uses vector embeddings (via sentence-transformers) to find and reuse answers for similar meaning queries (e.g., "What is the capital of France?" vs "Tell me France's capital").
Metrics: Automatically tracks cache hit rates, latency and estimated token savings.
Pluggable Architecture: Easy to swap LLM providers or Vector Stores.

Project Structure

levy/
├── levy/                    # Core package
│   ├── cache/               # Cache logic (Exact, Semantic, InMemory/Redis stores)
│   ├── llm_client.py        # LLM interaction (Mock, OpenAI, Ollama)
│   ├── embeddings.py        # EmbeddingClient ABC + Mock, SentenceTransformer, Ollama
│   ├── embedding_manager.py # EmbeddingManager: study-model registry, runtime switching,
│   │                        #   memoization, symmetric prefix handling
│   ├── engine.py            # Main orchestration engine
│   ├── config.py            # LevyConfig (providers, thresholds, store)
│   ├── metrics.py           # Hit/miss/latency/token-savings tracking
│   └── models.py            # Data classes
├── docs/                    # Research docs (proposal & S&D report are frozen)
├── examples/                # Demo scripts
└── tests/                   # Unit tests

Installation

Using Conda (Recommended)

Ensure you have Conda installed.
Create the environment:
```
conda env create -f environment.yml
```
Activate the environment:
```
conda activate levy
```

Running Tests

python -m unittest discover -s tests -p "test_*.py"
# or, if pytest is installed:
python -m pytest tests/ -q

Usage

Quick Start (Python)

from levy import LevyEngine, LevyConfig

# Initialize with defaults (Mock LLM, Exact Cache only)
engine = LevyEngine()

# First call - hits the "LLM"
result1 = engine.generate("Hello world")
print(result1.source) # 'llm'

# Second call - hits the cache
result2 = engine.generate("Hello world")
print(result2.source) # 'exact_cache'

Running the Experiment Script

A replay script is provided to demonstrate the cache behavior:

python examples/simple_replay.py

It runs a sequence of prompts through three configurations:

No Cache
Exact Cache Only
Semantic Cache (uses sentence-transformers if available)

Running with Ollama (Local Models)

Install and run Ollama.

Pull required models:

ollama pull qwen3
ollama pull nomic-embed-text

Run the demo:
```
python examples/ollama_demo.py
```

Using Redis Stack (Docker)

To use Redis for persistence:

Start Redis:
```
docker-compose up -d
```
Configure LevyConfig to use cache_store_type="redis".

Configuration

You can configure Levy via LevyConfig:

config = LevyConfig(
    llm_provider="openai",
    openai_api_key="sk-...",
    enable_semantic_cache=True,
    similarity_threshold=0.85,   # in 1/(1+L2) space; study sweep: 0.70–0.90
    # Embedding model for the study (default: all-MiniLM-L6-v2 baseline)
    embedding_provider="sentence-transformers",
    embedding_model="all-MiniLM-L6-v2",   # or "modernbert" for the second study model
    # Vector index backend (default: "auto" → Faiss HNSW if installed, else brute-force)
    vector_index_backend="auto",  # "auto" | "faiss" | "brute_force"
)

Faiss HNSW (implemented in LEV-2) is the production vector index. Install via conda to avoid Apple-Silicon segfaults:
conda install -c conda-forge faiss-cpu
If Faiss is absent the engine falls back to a brute-force numpy index automatically.

Switching embedding models at runtime

The EmbeddingManager built into the engine resolves study-model aliases:

Alias	Checkpoint	Notes
`all-MiniLM-L6-v2` / `all-minilm`	`sentence-transformers/all-MiniLM-L6-v2`	384-dim, study baseline
`modernbert`	`nomic-ai/modernbert-embed-base`	768-dim, symmetric `search_query:` prefix applied automatically

To switch models between experiment runs, change embedding_model in LevyConfig — no code changes required. Embeddings are memoized per (model, text) so replay experiments never recompute a vector.

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.claude		.claude
docs		docs
examples		examples
levy		levy
memory		memory
openspec		openspec
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Levy

Features

Project Structure

Installation

Using Conda (Recommended)

Running Tests

Usage

Quick Start (Python)

Running the Experiment Script

Running with Ollama (Local Models)

Using Redis Stack (Docker)

Configuration

Switching embedding models at runtime

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Levy

Features

Project Structure

Installation

Using Conda (Recommended)

Running Tests

Usage

Quick Start (Python)

Running the Experiment Script

Running with Ollama (Local Models)

Using Redis Stack (Docker)

Configuration

Switching embedding models at runtime

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages