bench: cross-framework performance comparison suite

## Context

ADR-058 established internal regression tracking (commit-to-commit within lattice). The next step is **cross-framework benchmarking** — apples-to-apples comparison against the major inference/embedding engines on identical hardware, models, and workloads.

## Frameworks to compare

### Inference (decode tok/s)
- **llama.cpp** — C++ GGUF, Metal + CPU, the community baseline
- **PyTorch** — eager + torch.compile, the research baseline
- **ONNX Runtime** — ORT with CoreML/CPU EP
- **Candle** — Rust, HuggingFace's engine (closest architectural peer)
- **MLX** — Apple's framework (already partially covered by `bench_apples_to_apples.sh`)
- **Ollama** — llama.cpp wrapper (already partially covered)

### Embedding (vectors/sec, latency)
- **FastEmbed** — Python, ONNX-backed, popular for RAG pipelines
- **sentence-transformers** — PyTorch, the research standard
- **ONNX Runtime** — direct ORT with BGE/E5 models
- **Candle** — Rust embedding path

## Benchmark dimensions

| Dimension | Values |
|---|---|
| Model | Qwen3.5-0.8B (decode), BGE-small-en-v1.5 (embed) |
| Quantization | FP16, Q8_0, Q4_0 where supported |
| Batch size | 1, 8, 32 (embed only) |
| Sequence length | 64, 512, 2048 (decode); 128, 512 (embed) |
| Hardware | Apple Silicon (M2 Max), x86_64 (GH Actions) |
| Metric | tok/s (decode), vectors/s (embed), p50/p99 latency, peak RSS |

## Deliverables

1. `scripts/bench-cross-framework.sh` — orchestrates all framework runs, collects results into a single TSV
2. `scripts/bench-cross-report.py` — generates markdown comparison table from TSV (similar to `perf-baselines-readme.py`)
3. GH Actions workflow (manual trigger only — too expensive for every PR) that runs the suite on `macos-latest` (Apple Silicon) and publishes results to a `perf-cross-framework` branch or wiki page
4. README badge or table showing lattice's position vs peers on the headline metric (decode tok/s at Q8, embed vectors/s at dim=384)

## Installation constraints

Each framework should be installed in isolation (venv/conda for Python, brew/cargo for native) so they don't interfere. The benchmark script should gracefully skip any framework that isn't installed.

## Non-goals

- Not trying to "win" every category — honest numbers, including where lattice is slower
- Not benchmarking training (lattice-tune is not competing with PyTorch training)
- Not benchmarking server throughput (continuous batching, multi-user) — single-client latency only for now

## Priority

P2 — valuable for positioning but not blocking any current work. Can be picked up after ADR-058 Phase 1 stabilizes.

## References

- ADR-058: CPU perf regression tracking (internal baseline)
- `scripts/bench_apples_to_apples.sh`: existing lattice vs MLX vs Ollama comparison
- Issue #77: GPU contention variance (affects local Metal benchmarks)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: cross-framework performance comparison suite #84

Context

Frameworks to compare

Inference (decode tok/s)

Embedding (vectors/sec, latency)

Benchmark dimensions

Deliverables

Installation constraints

Non-goals

Priority

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Dimension	Values
Model	Qwen3.5-0.8B (decode), BGE-small-en-v1.5 (embed)
Quantization	FP16, Q8_0, Q4_0 where supported
Batch size	1, 8, 32 (embed only)
Sequence length	64, 512, 2048 (decode); 128, 512 (embed)
Hardware	Apple Silicon (M2 Max), x86_64 (GH Actions)
Metric	tok/s (decode), vectors/s (embed), p50/p99 latency, peak RSS

bench: cross-framework performance comparison suite #84

Description

Context

Frameworks to compare

Inference (decode tok/s)

Embedding (vectors/sec, latency)

Benchmark dimensions

Deliverables

Installation constraints

Non-goals

Priority

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions