From 150648f14879dee4f0e967dada8dc49f11414405 Mon Sep 17 00:00:00 2001 From: xiaguan <751080330@qq.com> Date: Thu, 25 Jun 2026 00:58:41 +0800 Subject: [PATCH] docs(readme): add openinfer (Rust) as a DFlash backend openinfer (https://github.com/openinfer-project/openinfer) is a pure Rust + CUDA inference engine with native DFlash support for Qwen3-4B/8B, using the z-lab/Qwen3-{4B,8B}-DFlash-b16 drafters. Adds a short Quick Start entry alongside the existing backends. Co-Authored-By: Claude Opus 4.8 (1M context) --- README.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/README.md b/README.md index f4e8533..aed080c 100644 --- a/README.md +++ b/README.md @@ -160,6 +160,18 @@ for r in stream_generate(model, draft, tokenizer, prompt, block_size=16, max_tok print(f"\nThroughput: {tps:.2f} tok/s") ``` +### openinfer (Rust) + +[openinfer](https://github.com/openinfer-project/openinfer) is a pure Rust + CUDA inference engine (no PyTorch) with native DFlash support for Qwen3-4B / 8B, using the [`z-lab/Qwen3-4B-DFlash-b16`](https://huggingface.co/z-lab/Qwen3-4B-DFlash-b16) and [`z-lab/Qwen3-8B-DFlash-b16`](https://huggingface.co/z-lab/Qwen3-8B-DFlash-b16) drafters. + +```bash +cargo run --release -- \ + --model-path models/Qwen3-4B \ + --dflash-draft-model-path models/Qwen3-4B-DFlash-b16 +``` + +Single-stream decode speedup: **1.82× on RTX 5070 Ti**, **1.56× on RTX 5090**. + ## 📊 Evaluation All benchmarks share the same datasets (gsm8k, math500, humaneval, mbpp, mt-bench). Datasets are automatically downloaded and cached as JSONL in `cache/` on first run.