Checklist
Describe the bug
When I was training Eagle3 Qwen2.5 VL, the parameter build-dataset-num-proc in the training script had to be 0; otherwise, a deadlock would be triggered, causing the map process to get stuck.
Reproduction
torchrun \
--standalone \
--nproc_per_node $NUM_GPUS \
$ROOT_DIR/scripts/train_eagle3.py \
--target-model-path Qwen/Qwen2.5-VL-7B-Instruct \
--target-model-backend hf \
--draft-model-config $ROOT_DIR/configs/qwen2-5-vl-eagle3.json \
--build-dataset-num-proc 8 \
--train-data-path $ROOT_DIR/cache/dataset/allava4v_train.jsonl \
--output-dir $ROOT_DIR/outputs/Qwen2.5-VL-7B-eagle3 \
--num-epochs 10 \
--batch-size 1 \
--learning-rate 1e-4 \
--max-length 8192 \
--dist-timeout 360 \
--chat-template qwen2-vl \
--cache-dir $ROOT_DIR/cache \
--embedding-key model.embed_tokens.weight \
--tp-size 1 \
--is-vlm \
--min-pixels 50176 \
--max-pixels 802816
Environment
sglang 0.5.3
Checklist
Describe the bug
When I was training Eagle3 Qwen2.5 VL, the parameter
build-dataset-num-procin the training script had to be 0; otherwise, a deadlock would be triggered, causing the map process to get stuck.Reproduction
torchrun \ --standalone \ --nproc_per_node $NUM_GPUS \ $ROOT_DIR/scripts/train_eagle3.py \ --target-model-path Qwen/Qwen2.5-VL-7B-Instruct \ --target-model-backend hf \ --draft-model-config $ROOT_DIR/configs/qwen2-5-vl-eagle3.json \ --build-dataset-num-proc 8 \ --train-data-path $ROOT_DIR/cache/dataset/allava4v_train.jsonl \ --output-dir $ROOT_DIR/outputs/Qwen2.5-VL-7B-eagle3 \ --num-epochs 10 \ --batch-size 1 \ --learning-rate 1e-4 \ --max-length 8192 \ --dist-timeout 360 \ --chat-template qwen2-vl \ --cache-dir $ROOT_DIR/cache \ --embedding-key model.embed_tokens.weight \ --tp-size 1 \ --is-vlm \ --min-pixels 50176 \ --max-pixels 802816Environment
sglang 0.5.3