Context
ADR-058 established internal regression tracking (commit-to-commit within lattice). The next step is cross-framework benchmarking — apples-to-apples comparison against the major inference/embedding engines on identical hardware, models, and workloads.
Frameworks to compare
Inference (decode tok/s)
- llama.cpp — C++ GGUF, Metal + CPU, the community baseline
- PyTorch — eager + torch.compile, the research baseline
- ONNX Runtime — ORT with CoreML/CPU EP
- Candle — Rust, HuggingFace's engine (closest architectural peer)
- MLX — Apple's framework (already partially covered by
bench_apples_to_apples.sh)
- Ollama — llama.cpp wrapper (already partially covered)
Embedding (vectors/sec, latency)
- FastEmbed — Python, ONNX-backed, popular for RAG pipelines
- sentence-transformers — PyTorch, the research standard
- ONNX Runtime — direct ORT with BGE/E5 models
- Candle — Rust embedding path
Benchmark dimensions
| Dimension |
Values |
| Model |
Qwen3.5-0.8B (decode), BGE-small-en-v1.5 (embed) |
| Quantization |
FP16, Q8_0, Q4_0 where supported |
| Batch size |
1, 8, 32 (embed only) |
| Sequence length |
64, 512, 2048 (decode); 128, 512 (embed) |
| Hardware |
Apple Silicon (M2 Max), x86_64 (GH Actions) |
| Metric |
tok/s (decode), vectors/s (embed), p50/p99 latency, peak RSS |
Deliverables
scripts/bench-cross-framework.sh — orchestrates all framework runs, collects results into a single TSV
scripts/bench-cross-report.py — generates markdown comparison table from TSV (similar to perf-baselines-readme.py)
- GH Actions workflow (manual trigger only — too expensive for every PR) that runs the suite on
macos-latest (Apple Silicon) and publishes results to a perf-cross-framework branch or wiki page
- README badge or table showing lattice's position vs peers on the headline metric (decode tok/s at Q8, embed vectors/s at dim=384)
Installation constraints
Each framework should be installed in isolation (venv/conda for Python, brew/cargo for native) so they don't interfere. The benchmark script should gracefully skip any framework that isn't installed.
Non-goals
- Not trying to "win" every category — honest numbers, including where lattice is slower
- Not benchmarking training (lattice-tune is not competing with PyTorch training)
- Not benchmarking server throughput (continuous batching, multi-user) — single-client latency only for now
Priority
P2 — valuable for positioning but not blocking any current work. Can be picked up after ADR-058 Phase 1 stabilizes.
References
Context
ADR-058 established internal regression tracking (commit-to-commit within lattice). The next step is cross-framework benchmarking — apples-to-apples comparison against the major inference/embedding engines on identical hardware, models, and workloads.
Frameworks to compare
Inference (decode tok/s)
bench_apples_to_apples.sh)Embedding (vectors/sec, latency)
Benchmark dimensions
Deliverables
scripts/bench-cross-framework.sh— orchestrates all framework runs, collects results into a single TSVscripts/bench-cross-report.py— generates markdown comparison table from TSV (similar toperf-baselines-readme.py)macos-latest(Apple Silicon) and publishes results to aperf-cross-frameworkbranch or wiki pageInstallation constraints
Each framework should be installed in isolation (venv/conda for Python, brew/cargo for native) so they don't interfere. The benchmark script should gracefully skip any framework that isn't installed.
Non-goals
Priority
P2 — valuable for positioning but not blocking any current work. Can be picked up after ADR-058 Phase 1 stabilizes.
References
scripts/bench_apples_to_apples.sh: existing lattice vs MLX vs Ollama comparison