Skip to content

bench: cross-framework performance comparison suite #84

@ohdearquant

Description

@ohdearquant

Context

ADR-058 established internal regression tracking (commit-to-commit within lattice). The next step is cross-framework benchmarking — apples-to-apples comparison against the major inference/embedding engines on identical hardware, models, and workloads.

Frameworks to compare

Inference (decode tok/s)

  • llama.cpp — C++ GGUF, Metal + CPU, the community baseline
  • PyTorch — eager + torch.compile, the research baseline
  • ONNX Runtime — ORT with CoreML/CPU EP
  • Candle — Rust, HuggingFace's engine (closest architectural peer)
  • MLX — Apple's framework (already partially covered by bench_apples_to_apples.sh)
  • Ollama — llama.cpp wrapper (already partially covered)

Embedding (vectors/sec, latency)

  • FastEmbed — Python, ONNX-backed, popular for RAG pipelines
  • sentence-transformers — PyTorch, the research standard
  • ONNX Runtime — direct ORT with BGE/E5 models
  • Candle — Rust embedding path

Benchmark dimensions

Dimension Values
Model Qwen3.5-0.8B (decode), BGE-small-en-v1.5 (embed)
Quantization FP16, Q8_0, Q4_0 where supported
Batch size 1, 8, 32 (embed only)
Sequence length 64, 512, 2048 (decode); 128, 512 (embed)
Hardware Apple Silicon (M2 Max), x86_64 (GH Actions)
Metric tok/s (decode), vectors/s (embed), p50/p99 latency, peak RSS

Deliverables

  1. scripts/bench-cross-framework.sh — orchestrates all framework runs, collects results into a single TSV
  2. scripts/bench-cross-report.py — generates markdown comparison table from TSV (similar to perf-baselines-readme.py)
  3. GH Actions workflow (manual trigger only — too expensive for every PR) that runs the suite on macos-latest (Apple Silicon) and publishes results to a perf-cross-framework branch or wiki page
  4. README badge or table showing lattice's position vs peers on the headline metric (decode tok/s at Q8, embed vectors/s at dim=384)

Installation constraints

Each framework should be installed in isolation (venv/conda for Python, brew/cargo for native) so they don't interfere. The benchmark script should gracefully skip any framework that isn't installed.

Non-goals

  • Not trying to "win" every category — honest numbers, including where lattice is slower
  • Not benchmarking training (lattice-tune is not competing with PyTorch training)
  • Not benchmarking server throughput (continuous batching, multi-user) — single-client latency only for now

Priority

P2 — valuable for positioning but not blocking any current work. Can be picked up after ADR-058 Phase 1 stabilizes.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions