Data Preparation

This directory contains an example data preparation pipeline using Qwen/Qwen3-4B as the target model.

DeepSpec trains draft models against a target model. The data pipeline does three things:

download and split prompt data,
regenerate assistant answers with the target model,
precompute the target cache used by training.

The example below targets Qwen/Qwen3-4B, but the same pipeline applies to other models (e.g. Gemma). To switch targets, change the model name (--model / model_path) and adjust the sampling parameters (--temperature, --top-p, --top-k and --min-p) to match the recommended generation settings for that model. Output paths in the examples reference qwen3_4b; rename them as needed.

The wrapper script prepare_data.sh records the default settings. The individual Python scripts are also documented below for users who want to run each stage manually.

Outputs

Default outputs:

train_datasets/perfectblend_train.jsonl
train_datasets/qwen3_4b/perfectblend_train_regen.jsonl
~/.cache/deepspec/qwen3_4b_target_cache

The example scripts assume a single machine with eight visible GPUs by default. For fewer GPUs, edit num_workers and CUDA_VISIBLE_DEVICES in the shell scripts.

Step 1: Download And Split Data

The source dataset is mlabonne/open-perfectblend. The train split is written as JSONL, and the held-out user turns are written under eval_datasets/.

python scripts/data/download_and_split.py \
    --dataset-name mlabonne/open-perfectblend \
    --test-size 0.05 \
    --train-output-path train_datasets/perfectblend_train.jsonl \
    --test-output-dir eval_datasets \
    --skip-existing

This produces:

train_datasets/perfectblend_train.jsonl
eval_datasets/perfectblend.jsonl

Step 2: Regenerate Answers With Qwen3-4B

This step serves the target model and regenerates assistant answers against it. Any OpenAI-compatible inference engine works (SGLang, vLLM, TGI, etc.) — the example below uses SGLang, but you can swap in whatever engine you prefer as long as it exposes an OpenAI-compatible /v1 endpoint. SGLang is not in requirements.txt; install it separately, e.g. pip install "sglang[all]".

Start local sglang servers in one terminal:

bash scripts/data/launch_sglang_server.sh

By default this starts eight Qwen/Qwen3-4B workers on ports 30000 to 30007 and writes logs to:

logs/sglang_qwen3_4b/

In another terminal, regenerate the assistant answers:

python scripts/data/generate_train_data.py \
    --model Qwen/Qwen3-4B \
    --server-address \
        127.0.0.1:30000 \
        127.0.0.1:30001 \
        127.0.0.1:30002 \
        127.0.0.1:30003 \
        127.0.0.1:30004 \
        127.0.0.1:30005 \
        127.0.0.1:30006 \
        127.0.0.1:30007 \
    --concurrency 32 \
    --temperature 0.7 \
    --top-p 0.8 \
    --top-k 20 \
    --min-p 0 \
    --max-tokens 4096 \
    --disable-thinking \
    --resume \
    --input-file-path train_datasets/perfectblend_train.jsonl \
    --output-file-path train_datasets/qwen3_4b/perfectblend_train_regen.jsonl

This produces:

train_datasets/qwen3_4b/perfectblend_train_regen.jsonl

If any samples fail, the script writes them to:

train_datasets/qwen3_4b/perfectblend_train_regen_error.jsonl

Stop the sglang servers before the next step if they are using the same GPUs.

Step 3: Prepare Target Cache

The training loop reads a precomputed target cache instead of repeatedly running the target model. Prepare it with:

export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1,2,3,4,5,6,7}
export MASTER_ADDR=${MASTER_ADDR:-127.0.0.1}
export MASTER_PORT=${MASTER_PORT:-29500}
export RANK=${RANK:-0}
export WORLD_SIZE=${WORLD_SIZE:-1}

python scripts/data/prepare_target_cache.py \
    --config config/dspark/dspark_qwen3_4b.py \
    --train-data-path train_datasets/qwen3_4b/perfectblend_train_regen.jsonl \
    --output-dir ${HOME}/.cache/deepspec/qwen3_4b_target_cache \
    --local-batch-size 16

Storage warning: The target cache stores per-token hidden states for the full training set and can be very large. With the default Qwen/Qwen3-4B setting it takes roughly 38 TB of disk. Make sure the --output-dir filesystem has enough free space (scaling with dataset size, sequence length, and target hidden dimension) before running this step. If storage is limited, use a smaller training set and/or reduce model.target_layer_ids in the config (fewer captured layers means proportionally less cache).

This produces the cache consumed by scripts/train/train.sh:

~/.cache/deepspec/qwen3_4b_target_cache

Wrapper Script

The wrapper script combines the default public commands:

bash scripts/data/prepare_data.sh

Use the manual commands above if you want to stop and restart services between stages, change sampling parameters, use fewer GPUs, or inspect intermediate outputs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Preparation

Outputs

Step 1: Download And Split Data

Step 2: Regenerate Answers With Qwen3-4B

Step 3: Prepare Target Cache

Wrapper Script

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Data Preparation

Outputs

Step 1: Download And Split Data

Step 2: Regenerate Answers With Qwen3-4B

Step 3: Prepare Target Cache

Wrapper Script