Skip to content

docs(readme): add openinfer (Rust) as a DFlash backend#141

Open
xiaguan wants to merge 1 commit into
z-lab:mainfrom
xiaguan:docs/openinfer-backend
Open

docs(readme): add openinfer (Rust) as a DFlash backend#141
xiaguan wants to merge 1 commit into
z-lab:mainfrom
xiaguan:docs/openinfer-backend

Conversation

@xiaguan

@xiaguan xiaguan commented Jun 24, 2026

Copy link
Copy Markdown

Hi DFlash team — thanks for the work on DFlash, and for releasing the Qwen3-4B/8B -b16 drafters.

We added native DFlash speculative decoding to openinfer, a from-scratch LLM inference engine in pure Rust + CUDA (no PyTorch). It runs the z-lab/Qwen3-{4B,8B}-DFlash-b16 drafters behind a --dflash-draft-model-path flag; single-stream decode comes out to 1.82× on an RTX 5070 Ti and 1.56× on a 5090.

This PR adds a short openinfer (Rust) entry to the Quick Start, after the existing backends, so DFlash users on NVIDIA GPUs have the option. Happy to adjust the wording or placement however you'd prefer — and thanks again for making DFlash easy to integrate.

openinfer (https://github.com/openinfer-project/openinfer) is a pure
Rust + CUDA inference engine with native DFlash support for Qwen3-4B/8B,
using the z-lab/Qwen3-{4B,8B}-DFlash-b16 drafters. Adds a short Quick
Start entry alongside the existing backends.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@xiaguan xiaguan force-pushed the docs/openinfer-backend branch from 6fb4cbd to 150648f Compare June 24, 2026 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant