Skip to content

Latest commit

 

History

History
143 lines (105 loc) · 4.98 KB

File metadata and controls

143 lines (105 loc) · 4.98 KB

Data Preparation

This directory contains an example data preparation pipeline using Qwen/Qwen3-4B as the target model.

DeepSpec trains draft models against a target model. The data pipeline does three things:

  1. download and split prompt data,
  2. regenerate assistant answers with the target model,
  3. precompute the target cache used by training.

The example below targets Qwen/Qwen3-4B, but the same pipeline applies to other models (e.g. Gemma). To switch targets, change the model name (--model / model_path) and adjust the sampling parameters (--temperature, --top-p, --top-k and --min-p) to match the recommended generation settings for that model. Output paths in the examples reference qwen3_4b; rename them as needed.

The wrapper script prepare_data.sh records the default settings. The individual Python scripts are also documented below for users who want to run each stage manually.

Outputs

Default outputs:

train_datasets/perfectblend_train.jsonl
train_datasets/qwen3_4b/perfectblend_train_regen.jsonl
~/.cache/deepspec/qwen3_4b_target_cache

The example scripts assume a single machine with eight visible GPUs by default. For fewer GPUs, edit num_workers and CUDA_VISIBLE_DEVICES in the shell scripts.

Step 1: Download And Split Data

The source dataset is mlabonne/open-perfectblend. The train split is written as JSONL, and the held-out user turns are written under eval_datasets/.

python scripts/data/download_and_split.py \
    --dataset-name mlabonne/open-perfectblend \
    --test-size 0.05 \
    --train-output-path train_datasets/perfectblend_train.jsonl \
    --test-output-dir eval_datasets \
    --skip-existing

This produces:

train_datasets/perfectblend_train.jsonl
eval_datasets/perfectblend.jsonl

Step 2: Regenerate Answers With Qwen3-4B

This step serves the target model and regenerates assistant answers against it. Any OpenAI-compatible inference engine works (SGLang, vLLM, TGI, etc.) — the example below uses SGLang, but you can swap in whatever engine you prefer as long as it exposes an OpenAI-compatible /v1 endpoint. SGLang is not in requirements.txt; install it separately, e.g. pip install "sglang[all]".

Start local sglang servers in one terminal:

bash scripts/data/launch_sglang_server.sh

By default this starts eight Qwen/Qwen3-4B workers on ports 30000 to 30007 and writes logs to:

logs/sglang_qwen3_4b/

In another terminal, regenerate the assistant answers:

python scripts/data/generate_train_data.py \
    --model Qwen/Qwen3-4B \
    --server-address \
        127.0.0.1:30000 \
        127.0.0.1:30001 \
        127.0.0.1:30002 \
        127.0.0.1:30003 \
        127.0.0.1:30004 \
        127.0.0.1:30005 \
        127.0.0.1:30006 \
        127.0.0.1:30007 \
    --concurrency 32 \
    --temperature 0.7 \
    --top-p 0.8 \
    --top-k 20 \
    --min-p 0 \
    --max-tokens 4096 \
    --disable-thinking \
    --resume \
    --input-file-path train_datasets/perfectblend_train.jsonl \
    --output-file-path train_datasets/qwen3_4b/perfectblend_train_regen.jsonl

This produces:

train_datasets/qwen3_4b/perfectblend_train_regen.jsonl

If any samples fail, the script writes them to:

train_datasets/qwen3_4b/perfectblend_train_regen_error.jsonl

Stop the sglang servers before the next step if they are using the same GPUs.

Step 3: Prepare Target Cache

The training loop reads a precomputed target cache instead of repeatedly running the target model. Prepare it with:

export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1,2,3,4,5,6,7}
export MASTER_ADDR=${MASTER_ADDR:-127.0.0.1}
export MASTER_PORT=${MASTER_PORT:-29500}
export RANK=${RANK:-0}
export WORLD_SIZE=${WORLD_SIZE:-1}

python scripts/data/prepare_target_cache.py \
    --config config/dspark/dspark_qwen3_4b.py \
    --train-data-path train_datasets/qwen3_4b/perfectblend_train_regen.jsonl \
    --output-dir ${HOME}/.cache/deepspec/qwen3_4b_target_cache \
    --local-batch-size 16

Storage warning: The target cache stores per-token hidden states for the full training set and can be very large. With the default Qwen/Qwen3-4B setting it takes roughly 38 TB of disk. Make sure the --output-dir filesystem has enough free space (scaling with dataset size, sequence length, and target hidden dimension) before running this step. If storage is limited, use a smaller training set and/or reduce model.target_layer_ids in the config (fewer captured layers means proportionally less cache).

This produces the cache consumed by scripts/train/train.sh:

~/.cache/deepspec/qwen3_4b_target_cache

Wrapper Script

The wrapper script combines the default public commands:

bash scripts/data/prepare_data.sh

Use the manual commands above if you want to stop and restart services between stages, change sampling parameters, use fewer GPUs, or inspect intermediate outputs.