Compression Infrastructure For Agent Runtimes And Retrieval Stacks
turboagents is a single Python package for TurboQuant-style KV-cache and vector
compression. It is designed to sit underneath existing AI systems, not replace
them. If you already have an agent framework, a local inference stack, or a
RAG pipeline, TurboAgents gives you a way to add compression, reranking, and
benchmarking without rebuilding the rest of your application.
Use the links above as the fastest route through the project. Start with the website if you want the product overview, the public docs if you want the package detail, Getting Started if you want the first commands, Benchmarks if you want the current numbers, and the SuperOptiX integration page if you want the end-to-end application story.
Most AI stacks do not need another agent framework. They need the memory and retrieval layer underneath their existing agents to stop getting in the way.
turboagents is aimed at that layer:
- compress KV-cache payloads so local and server-side inference can hold more context
- compress vector payloads so retrieval systems can store and rerank more cheaply
- benchmark the quality, latency, and recall tradeoffs explicitly instead of hiding them
- integrate with runtimes and vector backends teams already use
| Area | Current State |
|---|---|
| Quant core | Fast Walsh-Hadamard rotation, PolarQuant-style angle/radius stage, seeded QJL-style residual sketch, binary payloads |
| Engines | MLX wrapper, llama.cpp wrapper, experimental vLLM wrapper/plugin scaffold |
| Retrieval | Chroma, FAISS, LanceDB, SurrealDB, and pgvector client adapters |
| Benchmarks | Synthetic CLI, benchmark matrix, MLX sweep, adapter matrix, minimal Needle harness |
| Packaging | uv-first local workflow, docs, CI, release workflow, PyPI package |
The package is already useful in three common situations. If you are running local agents, the MLX and llama.cpp wrappers give you a clean way to script and inspect runtime paths. If you are running retrieval, the TurboRAG adapters let you keep Chroma, FAISS, LanceDB, SurrealDB, or pgvector in place while adding a compressed rerank layer. If you are still evaluating fit, the built-in benchmarks give you a narrow and repeatable way to measure payload size, reconstruction quality, and retrieval agreement before you change application code.
The benchmark story is also real, not just conceptual. Chroma and FAISS both
held recall@10 = 1.0 on the validated adapter sweep, pgvector reached
recall@10 = 0.896875 at 4.0 bits, and the current MLX 3B run showed 3.5
bits as the best quality and throughput tradeoff in that configuration. The
long-context story is intentionally narrower: the minimal Needle harness shows
early-position retrieval, but not robust mid- or late-position recall.
That is the right way to read this project today. TurboAgents is ready to use as compression infrastructure and benchmark tooling. It is not yet making broad claims about long-context quality or production-native kernels.
Install the package with uv:
uv add turboagentsInstall with useful extras:
uv add "turboagents[mlx]"
uv add "turboagents[rag]"
uv add "turboagents[all]"Try the CLI first:
turboagents doctor
turboagents bench kv --format json
turboagents bench rag --format markdown
turboagents serve --backend mlx --model mlx-community/Qwen3-0.6B-4bit --dry-runTurboAgents stays framework-agnostic, but the first full reference integration is now in SuperOptiX.
That matters because the validated story is not limited to package-level tests. It also includes real SuperOptiX retrieval paths using TurboAgents under framework runtimes.
turboagents-chromais wired into SuperOptiX and covered by focused runtime teststurboagents-lancedbis validated through the realrag_lancedb_demoflowturboagents-surrealdbis validated through the real SuperOptiX OpenAI Agents and Pydantic AI demo flows
If you want the end-to-end integration story, start here after installing TurboAgents:
- SuperOptiX integration guide:
https://superagenticai.github.io/superoptix/guides/turboagents-integration/ - SuperOptiX LanceDB demo:
https://superagenticai.github.io/superoptix/examples/agents/rag-lancedb-demo/ - SuperOptiX SurrealDB frameworks guide:
https://superagenticai.github.io/superoptix/examples/agents/surrealdb-frameworks-demo/
turboagents is not an agent framework. It is the compression layer you put
under existing AI agents, inference engines, and RAG stacks so they can:
- hold longer contexts
- use less KV-cache memory
- store more embeddings at lower cost
- benchmark quality and memory tradeoffs explicitly
Think of it as:
TurboQuantfor real systemsTurboRAGfor vector retrieval stacks- adapters and tooling around existing engines instead of a replacement for them
turboagents is for teams and developers who already have:
- AI agents that hit memory limits on long prompts
- RAG systems with large embedding stores
- inference stacks built on MLX, llama.cpp, vLLM, Chroma, FAISS, LanceDB, SurrealDB, or pgvector
- agent frameworks that need compression infrastructure, not another framework
Most users approach TurboAgents in one of three ways.
If you already have an agent system, keep the agent layer and use turboagents
to improve the inference or memory layer under it.
Examples:
- use
turboagents.engines.mlxfor MLX-based local agents - use
turboagents.engines.llamacppto build llama.cpp runtime commands - use
turboagents.engines.vllmas an experimental runtime wrapper
If you already have retrieval, keep your current application logic and add TurboRAG where vectors are stored or searched.
Examples:
- use
TurboFAISSwhen you want a local FAISS-backed retrieval path - use
TurboChromawhen you want Chroma candidate search plus TurboAgents rerank - use
TurboLanceDBorTurboSurrealDBwhen you want a sidecar/rerank integration - use
TurboPgvectorwhen your application already depends on PostgreSQL
If you are still evaluating whether TurboQuant-style compression makes sense for your stack, use the CLI first:
turboagents doctorturboagents bench kvturboagents bench ragturboagents compress
That gives you a way to validate fit before deeper integration work.
TurboAgents now includes a Chroma adapter aligned to chromadb 1.5.5.
The right integration model is:
Context-1handles search policy and context management- TurboAgents handles compressed retrieval and rerank
- Chroma retrieves candidates while TurboAgents reranks or compresses the working set under that loop
Latest validated benchmark work:
| Surface | Result |
|---|---|
| Chroma | recall@1 = 1.0, recall@10 = 1.0 across the tested sweep in the local adapter benchmark |
| MLX sweep | 3.5 bits was the best current quality/performance tradeoff on mlx-community/Llama-3.2-3B-Instruct-4bit |
| FAISS | recall@1 = 1.0, recall@10 = 1.0 across the tested sweep |
| LanceDB | recall@10 landed in the 0.70 to 0.75 range on medium-rag |
| pgvector | recall@10 improved monotonically up to 0.896875 at 4.0 bits |
| Needle | exact match held for insertion fraction 0.1, but failed at 0.5 and 0.9 |
If you want the full numbers and command paths, see:
For the shortest path through the public docs:
- docs/getting-started.md for install and first commands
- docs/adapters.md for backend-specific retrieval surfaces
- docs/examples.md for runnable local examples
- docs/benchmarks.md for validated benchmark numbers
- docs/architecture.md for the runtime and retrieval layout
uv add turboagentsOptional extras:
uv add "turboagents[mlx]"
uv add "turboagents[vllm]"
uv add "turboagents[rag]"
uv add "turboagents[all]"For local development in this repository:
uv sync
uv sync --extra ragturboagents doctor
turboagents bench kv --format json
turboagents bench rag --format markdown
turboagents bench paper
turboagents compress --input vectors.npy --output vectors.npz --head-dim 128
turboagents serve --backend mlx --model mlx-community/Qwen3-0.6B-4bit --dry-run
turboagents serve --backend vllm --model meta-llama/Llama-3.1-8B-Instruct --dry-runpython3 examples/quickstart.py
python3 examples/bench_profiles.py
python3 examples/faiss_turborag.py
python3 examples/chroma_turborag.py
python3 examples/mlx_server_dry_run.pyCommon local commands:
PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 uv run python -m pytest -q
uv run mkdocs serve -f mkdocs.local.yml
uv buildBenchmark harness commands:
uv sync --extra rag --extra mlx
uv run python scripts/run_benchmark_matrix.py --output-dir benchmark-results/$(date +%Y%m%d-%H%M%S)
uv run python scripts/benchmark_needle.py --model mlx-community/Llama-3.2-3B-Instruct-4bit --context-tokens 2048 4096 8192 --output benchmark-results/needle-$(date +%Y%m%d-%H%M%S).jsonCommunity and project health files:
See ATTRIBUTION.md. This repository is not affiliated with Google Research.
