Levy is a semantic caching engine for LLM APIs, built as the IT artefact of an MSc Artificial Intelligence capstone project (University of Liverpool). It sits between your application and an LLM provider (Mock, OpenAI-compatible, or Ollama today; Anthropic connector planned) to optimize costs and latency by reusing responses for identical or semantically similar prompts.
The research behind Levy benchmarks false positive rates of semantic caching across embedding models (all-MiniLM vs ModernBERT), workloads (FAQ, code, chat), and similarity thresholds. The authoritative project definition lives in docs/Project_Proposal.md and docs/Specification_and_Design_Report.md (university submissions — do not modify).
- Exact Match Caching: Extremely fast retrieval for identical prompts.
- Semantic Caching: Uses vector embeddings (via
sentence-transformers) to find and reuse answers for similar meaning queries (e.g., "What is the capital of France?" vs "Tell me France's capital"). - Metrics: Automatically tracks cache hit rates, latency and estimated token savings.
- Pluggable Architecture: Easy to swap LLM providers or Vector Stores.
levy/
├── levy/ # Core package
│ ├── cache/ # Cache logic (Exact, Semantic, InMemory/Redis stores)
│ ├── llm_client.py # LLM interaction (Mock, OpenAI, Ollama)
│ ├── embeddings.py # EmbeddingClient ABC + Mock, SentenceTransformer, Ollama
│ ├── embedding_manager.py # EmbeddingManager: study-model registry, runtime switching,
│ │ # memoization, symmetric prefix handling
│ ├── engine.py # Main orchestration engine
│ ├── config.py # LevyConfig (providers, thresholds, store)
│ ├── metrics.py # Hit/miss/latency/token-savings tracking
│ └── models.py # Data classes
├── docs/ # Research docs (proposal & S&D report are frozen)
├── examples/ # Demo scripts
└── tests/ # Unit tests
- Ensure you have Conda installed.
- Create the environment:
conda env create -f environment.yml
- Activate the environment:
conda activate levy
python -m unittest discover -s tests -p "test_*.py"
# or, if pytest is installed:
python -m pytest tests/ -qfrom levy import LevyEngine, LevyConfig
# Initialize with defaults (Mock LLM, Exact Cache only)
engine = LevyEngine()
# First call - hits the "LLM"
result1 = engine.generate("Hello world")
print(result1.source) # 'llm'
# Second call - hits the cache
result2 = engine.generate("Hello world")
print(result2.source) # 'exact_cache'A replay script is provided to demonstrate the cache behavior:
python examples/simple_replay.pyIt runs a sequence of prompts through three configurations:
- No Cache
- Exact Cache Only
- Semantic Cache (uses
sentence-transformersif available)
- Install and run Ollama.
- Pull required models:
ollama pull qwen3 ollama pull nomic-embed-text
- Run the demo:
python examples/ollama_demo.py
To use Redis for persistence:
- Start Redis:
docker-compose up -d
- Configure
LevyConfigto usecache_store_type="redis".
You can configure Levy via LevyConfig:
config = LevyConfig(
llm_provider="openai",
openai_api_key="sk-...",
enable_semantic_cache=True,
similarity_threshold=0.85, # in 1/(1+L2) space; study sweep: 0.70–0.90
# Embedding model for the study (default: all-MiniLM-L6-v2 baseline)
embedding_provider="sentence-transformers",
embedding_model="all-MiniLM-L6-v2", # or "modernbert" for the second study model
# Vector index backend (default: "auto" → Faiss HNSW if installed, else brute-force)
vector_index_backend="auto", # "auto" | "faiss" | "brute_force"
)Faiss HNSW (implemented in LEV-2) is the production vector index. Install via conda to avoid Apple-Silicon segfaults:
conda install -c conda-forge faiss-cpuIf Faiss is absent the engine falls back to a brute-force numpy index automatically.
The EmbeddingManager built into the engine resolves study-model aliases:
| Alias | Checkpoint | Notes |
|---|---|---|
all-MiniLM-L6-v2 / all-minilm |
sentence-transformers/all-MiniLM-L6-v2 |
384-dim, study baseline |
modernbert |
nomic-ai/modernbert-embed-base |
768-dim, symmetric search_query: prefix applied automatically |
To switch models between experiment runs, change embedding_model in LevyConfig — no code changes required. Embeddings are memoized per (model, text) so replay experiments never recompute a vector.
Apache-2.0