Warning
This repository is incomplete and has not been fully tested yet. Treat it as experimental research/developer tooling, not production-ready infrastructure.
dflasher trains and tests DFlash-style speculative draft models for Hugging Face
causal language models.
The local implementation is a practical DFlash-lite generic drafter:
- the source model is frozen and acts as the target/verifier;
- selected target hidden states condition a small draft Transformer;
- the draft model predicts a block of future tokens from one clean anchor token;
- generation is target-verified greedy speculative decoding, so accepted output is checked against the target model and should match target greedy tokens exactly.
The project has four practical paths:
dflasher build: the product-style entry point. It accepts a source model and writes a draft directory to--out. On Linux/CUDA it builds an official vLLM/Speculators DFlash checkpoint; on Mac it builds a local DFlash-lite draft.dflasher train: a local DFlash-lite trainer that runs on CPU/MPS/CUDA and is useful for correctness experiments.dflasher mac: Mac-friendly wrappers around the local PyTorch/MPS trainer, plus z-lab MLX script generation for Apple Silicon inference.dflasher omlx: a local OMLX/MLX backend for quantized Apple Silicon models. It extracts selected target hidden states, writes a reusable cache, trains a lightweight MLX DFlash draft from that cache, and verifies generation against the OMLX target model.dflasher official: an orchestration layer for the public vLLM Speculators DFlash pipeline. This is the path that produces vLLM/Speculators checkpoints withconfig.jsonandmodel.safetensors.
git clone https://github.com/dawncr0w/dflasher.git
cd dflasher
python3.11 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"For the official Speculators client wrappers:
pip install -e ".[dev,official]"For Mac local DFlash-lite training on Apple Silicon:
pip install -e ".[dev,mac]"For the z-lab MLX runtime, use a separate Apple Silicon environment and install the z-lab package:
pip install -e ".[zlab-mlx]"The local OMLX backend needs the Mac dependencies and the z-lab MLX draft runtime:
pip install -e ".[dev,mac,zlab-mlx]"If you serve MiniMax-M2 through the installed macOS oMLX app, patch the local app bundle once so its DFlash engine can resolve MiniMax-M2 target ops:
dflasher omlx patch-app --app-path /Applications/oMLX.appFor actual vLLM serving/training on Linux/CUDA, install the vLLM extra in that CUDA environment:
pip install -e ".[dev,official,vllm]"The vllm extra targets vLLM 0.20.1+, which is the version range z-lab
documents for core DFlash serving support. Some z-lab public checkpoints still
need custom builds, such as the Gemma4 Docker image or SWA support branch called
out by the z-lab README.
dflasher smoke --source sshleifer/tiny-gpt2 --workdir ./runs/smoke --device cpuUse build when you want the CLI contract directly: source model in, draft
model directory out.
dflasher build sshleifer/tiny-gpt2 \
--backend lite \
--texts-file examples/train_texts.txt \
--out ./runs/tiny-gpt2-draft \
--max-steps 30 \
--device cpuFor CUDA/vLLM official DFlash output:
dflasher build Qwen/Qwen3-0.6B \
--backend cuda \
--out ./runs/qwen3-0.6b-dflash-draft \
--workspace ./runs/qwen3-0.6b-official \
--speculators-repo /path/to/speculators \
--max-samples 5000 \
--seq-length 8192Use --plan-only to write the CUDA script/manifest without executing training.
The normal trainer requires real data. Use --texts-file or --dataset; the
tiny built-in dataset is only available behind --allow-builtin-data for smoke
or debug runs.
dflasher train sshleifer/tiny-gpt2 \
--texts-file examples/train_texts.txt \
--out ./runs/tiny-gpt2-draft \
--block-size 4 \
--draft-layers 1 \
--draft-hidden-size 64 \
--heads 2 \
--batch-size 2 \
--max-steps 30 \
--device cpuThis local trainer uses target logits as the teacher by default:
--loss-fn kl_divaligns the draft distribution to the frozen target distribution.--loss-fn cetrains against the target argmax token.
Mac support is split into two explicit paths so the hardware expectations stay clear:
- local DFlash-lite training uses this project and PyTorch MPS (
dflasher mac); - z-lab DFlash MLX uses z-lab's public draft checkpoints and MLX runtime
(
dflasher zlab mlx-scriptor thedflasher mac zlab-mlx-scriptalias).
Run the Mac preflight first:
dflasher mac preflight Qwen/Qwen3.5-4B --device mpsCreate a local DFlash-lite train/eval script for MPS:
dflasher mac plan sshleifer/tiny-gpt2 \
--workspace ./runs/mac-tiny-gpt2 \
--texts-file examples/train_texts.txt \
--device mps \
--max-steps 30
bash ./runs/mac-tiny-gpt2/run_mac_dflash_lite.shThe generated script simply wraps the already-tested local trainer/evaluator:
python -m dflasher.cli train ... --device mps
python -m dflasher.cli eval ... --device mps
Or build a local Mac draft directly:
dflasher build sshleifer/tiny-gpt2 \
--backend mac-lite \
--texts-file examples/train_texts.txt \
--out ./runs/tiny-gpt2-mac-draft \
--device mps \
--max-steps 30Use this path for local Apple Silicon models stored under directories such as
~/.omlx/models/.... It is designed for MLX-compatible quantized source models,
including MiniMaxM2-style local checkpoints.
Inspect a local source model without loading all weights:
dflasher omlx inspect ~/.omlx/models/MiniMax-M2.7-ultra-uncensored-heretic-oQ4-MLXBuild a local MLX DFlash draft:
dflasher omlx build ~/.omlx/models/MiniMax-M2.7-ultra-uncensored-heretic-oQ4-MLX \
--texts-file examples/train_texts.txt \
--out ./runs/minimax-m27-oq4-dflash \
--cache-dir ./runs/minimax-m27-oq4-cache \
--max-samples 4 \
--max-length 64 \
--block-size 8 \
--draft-layers 2 \
--max-steps 20 \
--overwriteThe same backend is also available through the product entry point:
dflasher build ~/.omlx/models/MiniMax-M2.7-ultra-uncensored-heretic-oQ4-MLX \
--backend omlx \
--texts-file examples/train_texts.txt \
--out ./runs/minimax-m27-oq4-dflash \
--omlx-cache-dir ./runs/minimax-m27-oq4-cache \
--omlx-max-samples 4 \
--max-length 64 \
--block-size 8 \
--draft-layers 2 \
--max-steps 20 \
--overwriteVerify target-equivalent greedy decoding with the local OMLX verifier:
dflasher omlx eval ~/.omlx/models/MiniMax-M2.7-ultra-uncensored-heretic-oQ4-MLX \
./runs/minimax-m27-oq4-dflash \
--prompt "Explain speculative decoding in one sentence." \
--max-new-tokens 16Install the draft into the local oMLX model settings for app/API serving:
dflasher omlx patch-app --app-path /Applications/oMLX.app
dflasher omlx install-app \
~/.omlx/models/MiniMax-M2.7-ultra-uncensored-heretic-oQ4-MLX \
./runs/minimax-m27-oq4-dflash \
--overwrite \
--verify-mode adaptive \
--dflash-block-tokens 4 \
--dflash-verify-len-cap 4 \
--dflash-draft-window-size 2048 \
--dflash-draft-sink-size 64patch-app also installs the local MiniMax-M2 target backend into oMLX, including
MiniMax tree verification support, a MiniMax attention hook for the verifier
fast path, and lifecycle cleanup for class-level DFlash monkey patches.
install-app can tune runtime verifier cost separately from the checkpoint
block size with --dflash-block-tokens and --dflash-verify-len-cap.
Use --draft-quant when the draft model adds too much memory pressure next to
an oQ source model; keep --no-draft-quant for the fastest path when memory is
available.
This backend is intentionally explicit: cache extraction loads the source model, draft training reads the cache, and evaluation loads the OMLX source model again as the verifier. Large quantized MoE models can still require substantial RAM and time even when the draft itself is small.
For z-lab public DFlash checkpoints on Apple Silicon, generate a small MLX script:
dflasher zlab mlx-script Qwen/Qwen3.5-4B \
--out ./runs/qwen35_mlx.py \
--prompt "How many positive whole-number divisors does 196 have?"
python ./runs/qwen35_mlx.pyOr print the z-lab benchmark command:
dflasher zlab mlx-benchmark-command Qwen/Qwen3.5-4B \
--dataset gsm8k \
--max-samples 128 \
--enable-thinkingFor families where dflasher cannot confirm public z-lab MLX support, the MLX
script and benchmark commands require an explicit --force.
The Mac DFlash-lite path creates dflasher draft directories. The z-lab MLX path
does not train a new draft; it runs an existing z-lab DFlash checkpoint via
from dflash.model_mlx import load, load_draft, stream_generate.
The official path follows the public Speculators flow:
prepare_data.py
python -m dflasher.hidden_server
data_generation_offline.py hidden-state cache
train.py --speculator-type dflash
vllm serve checkpoint_best
dflasher official benchmark
This requires a Linux/CUDA environment with speculators and vllm installed.
On a Mac CPU/MPS environment, preflight will intentionally fail the CUDA/vLLM
checks.
The generated script starts python -m dflasher.hidden_server instead of calling
the upstream launch_vllm.py directly, so --trust-remote-code also reaches the
hidden-state server's AutoConfig.from_pretrained(...) call for custom-config
families such as MiniMax/Kimi. Current Speculators prepare_data.py also
supports --trust-remote-code, so dflasher passes it through there when the
global flag is enabled.
dflasher official preflight Qwen/Qwen3-0.6B \
--speculators-repo /path/to/speculators \
--static-onlyFor MiniMax, DeepSeek, Qwen3.5/3.6, Gemma4, gpt-oss, and unknown future
families, dflasher only proceeds by default when the family is marked supported.
Use --allow-experimental when your local Transformers/vLLM/Speculators stack
can actually load that verifier. This keeps the CLI generic without pretending
that every model is verified on every backend.
Inspect any source model before planning:
dflasher official inspect-model Qwen/Qwen3.6-27B --trust-remote-code
dflasher official inspect-model MiniMaxAI/MiniMax-M2.7 --trust-remote-code
dflasher official inspect-model deepseek-ai/DeepSeek-V3.2 --trust-remote-codeinspect-model does not assume Qwen3. It builds a family profile from the
Hugging Face config when available, then reports:
- Speculators compatibility and required decoder fields;
- z-lab vLLM/SGLang/Transformers/MLX support status;
- default Speculators target layers, for example
2, N//2, N-3; - default z-lab public-checkpoint layers, for example
round(linspace(1, N-3, k)); - known z-lab draft model IDs when dflasher can map them from the source model.
Current family handling:
| Family | Examples | dflasher default |
|---|---|---|
| Qwen3 dense/MoE/Coder/Next | qwen3, qwen3_moe, qwen3_next |
Speculators supported; z-lab style supported |
| Qwen3.5/Qwen3.6 | qwen3_5_text, Qwen3.6 model IDs |
z-lab style first; Speculators experimental |
| MiniMax/Kimi | minimax_m2, kimi_k2 |
z-lab preview; Speculators experimental |
| DeepSeek | deepseek_v3, deepseek_v32, deepseek_v4 |
marked coming soon/experimental until public DFlash checkpoints are available |
| Gemma 4 | gemma4_text |
z-lab custom-vLLM/SGLang/MLX path; Speculators experimental |
| gpt-oss | gpt_oss |
z-lab supported; Speculators experimental |
| LLaMA | llama |
Speculators and z-lab style supported |
For known z-lab checkpoints, dflasher can print a serving command directly:
dflasher zlab serve-command Qwen/Qwen3.6-27B --backend vllm --trust-remote-code
dflasher zlab serve-command MiniMaxAI/MiniMax-M2.7 --backend sglang --trust-remote-code --forceThe z-lab command helper follows the current z-lab README conventions: vLLM
commands include an attention backend by default, Gemma4 commands include the
draft-side flash_attn setting and warn that a Gemma4-capable vLLM build/Docker
image is required, and SGLang commands include the long-context env flag plus the
draft attention/backend scheduler flags from the public examples.
If a z-lab serving backend is only preview/unknown for a family, pass --force
to print the command anyway.
Generate the full 5k-sample Qwen3-0.6B plan:
dflasher official plan Qwen/Qwen3-0.6B \
--workspace ./runs/qwen3-0.6b-official \
--speculators-repo /path/to/speculators \
--max-samples 5000 \
--seq-length 8192 \
--epochs 5 \
--mode offline-cache \
--block-size 8 \
--max-anchors 3072 \
--draft-layers 5 \
--draft-vocab-size 8192 \
--draft-arch llama \
--python-bin python \
--vllm-gpus 0 \
--train-gpus 0For large or custom verifier models, pass vLLM options through explicitly:
dflasher official plan MiniMaxAI/MiniMax-M2.7 \
--trust-remote-code \
--allow-experimental \
--vllm-arg=--tensor-parallel-size \
--vllm-arg=4 \
--serve-arg=--tensor-parallel-size \
--serve-arg=4--draft-arch defaults to llama for vLLM serving compatibility with current
Speculators docs. Pass --draft-arch qwen3 only when your installed Speculators
and vLLM stack explicitly supports that draft architecture end to end.
official plan runs static model/repo/script checks by default. Add
--check-environment to require CUDA/package checks before writing the script,
or --skip-preflight only for offline script generation.
Run the generated script in a CUDA environment:
bash ./runs/qwen3-0.6b-official/run_official_dflash.shOr build directly to a final official draft output directory:
dflasher build Qwen/Qwen3-0.6B \
--backend cuda \
--out ./runs/qwen3-0.6b-dflash-draft \
--workspace ./runs/qwen3-0.6b-official \
--speculators-repo /path/to/speculators \
--max-samples 5000 \
--seq-length 8192Or run stages from the manifest:
dflasher official run-stage prepare \
--manifest ./runs/qwen3-0.6b-official/dflasher_official_manifest.jsonAfter training:
dflasher official inspect-checkpoint \
./runs/qwen3-0.6b-official/checkpoints/checkpoint_best
dflasher official serve-command \
./runs/qwen3-0.6b-official/checkpoints/checkpoint_best
vllm serve ./runs/qwen3-0.6b-official/checkpoints/checkpoint_best --port 8000
dflasher official benchmark \
./runs/qwen3-0.6b-official/checkpoints/checkpoint_best \
--base-url http://localhost:8000/v1 \
--output-json ./runs/qwen3-0.6b-official/benchmark.json
dflasher official validate-equivalence \
--target-base-url http://localhost:8001/v1 \
--draft-base-url http://localhost:8000/v1 \
--target-model Qwen/Qwen3-0.6B \
--draft-model ./runs/qwen3-0.6b-official/checkpoints/checkpoint_bestdflasher eval sshleifer/tiny-gpt2 ./runs/tiny-gpt2-draft \
--prompts-file examples/prompts.txt \
--max-new-tokens 12 \
--device cpudflasher generate sshleifer/tiny-gpt2 ./runs/tiny-gpt2-draft \
--prompt "Speculative decoding" \
--max-new-tokens 24 \
--device cpuDFlash trains a lightweight block diffusion drafter for speculative decoding. The paper describes extracting selected target hidden states, conditioning the drafter with those context features, sampling masked response blocks around anchor tokens, using a position-decayed cross entropy loss, and sharing the target embedding/LM head while keeping the target frozen.
The local dflasher train path follows those implementation ideas where they are
architecture-neutral, but keeps the first version simple:
- generic PyTorch cross-attention instead of z-lab's Qwen3-specific KV injection;
- greedy-only target-equivalent verification;
- no FlexAttention sparse block training yet.
The dflasher official path delegates DFlash KV-injection architecture,
hidden-state extraction, sparse/block training, reduced vocab mapping, and
checkpoint format to vLLM Speculators. For non-Qwen families, the resolver keeps
the workflow explicit: it will plan when the model config exposes the fields that
Speculators needs, and it will surface gated/custom-backend requirements instead
of pretending every model can be trained on every machine.
Useful references:
- DFlash paper: https://arxiv.org/abs/2602.06036
- z-lab DFlash repository: https://github.com/z-lab/dflash
- z-lab model card example: https://huggingface.co/z-lab/Qwen3-8B-DFlash-b16