From 150648f14879dee4f0e967dada8dc49f11414405 Mon Sep 17 00:00:00 2001
From: xiaguan <751080330@qq.com>
Date: Thu, 25 Jun 2026 00:58:41 +0800
Subject: [PATCH] docs(readme): add openinfer (Rust) as a DFlash backend

openinfer (https://github.com/openinfer-project/openinfer) is a pure
Rust + CUDA inference engine with native DFlash support for Qwen3-4B/8B,
using the z-lab/Qwen3-{4B,8B}-DFlash-b16 drafters. Adds a short Quick
Start entry alongside the existing backends.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 README.md | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/README.md b/README.md
index f4e8533..aed080c 100644
--- a/README.md
+++ b/README.md
@@ -160,6 +160,18 @@ for r in stream_generate(model, draft, tokenizer, prompt, block_size=16, max_tok
 print(f"\nThroughput: {tps:.2f} tok/s")
 ```
 
+### openinfer (Rust)
+
+[openinfer](https://github.com/openinfer-project/openinfer) is a pure Rust + CUDA inference engine (no PyTorch) with native DFlash support for Qwen3-4B / 8B, using the [`z-lab/Qwen3-4B-DFlash-b16`](https://huggingface.co/z-lab/Qwen3-4B-DFlash-b16) and [`z-lab/Qwen3-8B-DFlash-b16`](https://huggingface.co/z-lab/Qwen3-8B-DFlash-b16) drafters.
+
+```bash
+cargo run --release -- \
+  --model-path models/Qwen3-4B \
+  --dflash-draft-model-path models/Qwen3-4B-DFlash-b16
+```
+
+Single-stream decode speedup: **1.82× on RTX 5070 Ti**, **1.56× on RTX 5090**.
+
 ## 📊 Evaluation
 
 All benchmarks share the same datasets (gsm8k, math500, humaneval, mbpp, mt-bench). Datasets are automatically downloaded and cached as JSONL in `cache/` on first run.