This directory contains an example data preparation pipeline using Qwen/Qwen3-4B as the target model.
DeepSpec trains draft models against a target model. The data pipeline does three things:
- download and split prompt data,
- regenerate assistant answers with the target model,
- precompute the target cache used by training.
The example below targets Qwen/Qwen3-4B, but the same pipeline applies to other models (e.g. Gemma). To switch targets, change the model name (--model / model_path) and adjust the sampling parameters (--temperature, --top-p, --top-k and --min-p) to match the recommended generation settings for that model. Output paths in the examples reference qwen3_4b; rename them as needed.
The wrapper script prepare_data.sh records the default settings. The individual Python scripts are also documented below for users who want to run each stage manually.
Default outputs:
train_datasets/perfectblend_train.jsonl
train_datasets/qwen3_4b/perfectblend_train_regen.jsonl
~/.cache/deepspec/qwen3_4b_target_cache
The example scripts assume a single machine with eight visible GPUs by default. For fewer GPUs, edit num_workers and CUDA_VISIBLE_DEVICES in the shell scripts.
The source dataset is mlabonne/open-perfectblend. The train split is written as JSONL, and the held-out user turns are written under eval_datasets/.
python scripts/data/download_and_split.py \
--dataset-name mlabonne/open-perfectblend \
--test-size 0.05 \
--train-output-path train_datasets/perfectblend_train.jsonl \
--test-output-dir eval_datasets \
--skip-existingThis produces:
train_datasets/perfectblend_train.jsonl
eval_datasets/perfectblend.jsonl
This step serves the target model and regenerates assistant answers against it. Any OpenAI-compatible inference engine works (SGLang, vLLM, TGI, etc.) — the example below uses SGLang, but you can swap in whatever engine you prefer as long as it exposes an OpenAI-compatible /v1 endpoint. SGLang is not in requirements.txt; install it separately, e.g. pip install "sglang[all]".
Start local sglang servers in one terminal:
bash scripts/data/launch_sglang_server.shBy default this starts eight Qwen/Qwen3-4B workers on ports 30000 to 30007 and writes logs to:
logs/sglang_qwen3_4b/
In another terminal, regenerate the assistant answers:
python scripts/data/generate_train_data.py \
--model Qwen/Qwen3-4B \
--server-address \
127.0.0.1:30000 \
127.0.0.1:30001 \
127.0.0.1:30002 \
127.0.0.1:30003 \
127.0.0.1:30004 \
127.0.0.1:30005 \
127.0.0.1:30006 \
127.0.0.1:30007 \
--concurrency 32 \
--temperature 0.7 \
--top-p 0.8 \
--top-k 20 \
--min-p 0 \
--max-tokens 4096 \
--disable-thinking \
--resume \
--input-file-path train_datasets/perfectblend_train.jsonl \
--output-file-path train_datasets/qwen3_4b/perfectblend_train_regen.jsonlThis produces:
train_datasets/qwen3_4b/perfectblend_train_regen.jsonl
If any samples fail, the script writes them to:
train_datasets/qwen3_4b/perfectblend_train_regen_error.jsonl
Stop the sglang servers before the next step if they are using the same GPUs.
The training loop reads a precomputed target cache instead of repeatedly running the target model. Prepare it with:
export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1,2,3,4,5,6,7}
export MASTER_ADDR=${MASTER_ADDR:-127.0.0.1}
export MASTER_PORT=${MASTER_PORT:-29500}
export RANK=${RANK:-0}
export WORLD_SIZE=${WORLD_SIZE:-1}
python scripts/data/prepare_target_cache.py \
--config config/dspark/dspark_qwen3_4b.py \
--train-data-path train_datasets/qwen3_4b/perfectblend_train_regen.jsonl \
--output-dir ${HOME}/.cache/deepspec/qwen3_4b_target_cache \
--local-batch-size 16Storage warning: The target cache stores per-token hidden states for the full training set and can be very large. With the default
Qwen/Qwen3-4Bsetting it takes roughly 38 TB of disk. Make sure the--output-dirfilesystem has enough free space (scaling with dataset size, sequence length, and target hidden dimension) before running this step. If storage is limited, use a smaller training set and/or reducemodel.target_layer_idsin the config (fewer captured layers means proportionally less cache).
This produces the cache consumed by scripts/train/train.sh:
~/.cache/deepspec/qwen3_4b_target_cache
The wrapper script combines the default public commands:
bash scripts/data/prepare_data.shUse the manual commands above if you want to stop and restart services between stages, change sampling parameters, use fewer GPUs, or inspect intermediate outputs.