diff --git a/README.md b/README.md index f4e8533..aed080c 100644 --- a/README.md +++ b/README.md @@ -160,6 +160,18 @@ for r in stream_generate(model, draft, tokenizer, prompt, block_size=16, max_tok print(f"\nThroughput: {tps:.2f} tok/s") ``` +### openinfer (Rust) + +[openinfer](https://github.com/openinfer-project/openinfer) is a pure Rust + CUDA inference engine (no PyTorch) with native DFlash support for Qwen3-4B / 8B, using the [`z-lab/Qwen3-4B-DFlash-b16`](https://huggingface.co/z-lab/Qwen3-4B-DFlash-b16) and [`z-lab/Qwen3-8B-DFlash-b16`](https://huggingface.co/z-lab/Qwen3-8B-DFlash-b16) drafters. + +```bash +cargo run --release -- \ + --model-path models/Qwen3-4B \ + --dflash-draft-model-path models/Qwen3-4B-DFlash-b16 +``` + +Single-stream decode speedup: **1.82× on RTX 5070 Ti**, **1.56× on RTX 5090**. + ## 📊 Evaluation All benchmarks share the same datasets (gsm8k, math500, humaneval, mbpp, mt-bench). Datasets are automatically downloaded and cached as JSONL in `cache/` on first run.