-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[gkd] top-k-logits & teacher server #7918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
176ef74
379e2d9
a6a5aea
d67b0f8
4e8e4da
859ca40
a6ecebb
44f0e4e
fc1b673
fd35140
f23b71e
17f88e4
51dd414
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| # Teacher server must be running first: | ||
| # CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen2.5-7B-Instruct --port 8000 --max-logprobs 64 | ||
|
|
||
| CUDA_VISIBLE_DEVICES=1,2 \ | ||
| NPROC_PER_NODE=2 \ | ||
| PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \ | ||
| megatron rlhf \ | ||
| --rlhf_type gkd \ | ||
| --model Qwen/Qwen2.5-0.5B \ | ||
| --teacher_model_server http://localhost:8000 \ | ||
| --gkd_logits_topk 64 \ | ||
| --dataset 'modelscope/gsm8k' \ | ||
| --tensor_model_parallel_size 1 \ | ||
| --pipeline_model_parallel_size 1 \ | ||
| --context_parallel_size 1 \ | ||
| --expert_model_parallel_size 1 \ | ||
| --lmbda 1 \ | ||
| --seq_kd false \ | ||
| --beta 0.5 \ | ||
| --torch_dtype bfloat16 \ | ||
| --micro_batch_size 2 \ | ||
| --global_batch_size 32 \ | ||
| --train_iters 500 \ | ||
| --lr 5e-5 \ | ||
| --lr_warmup_fraction 0.1 \ | ||
| --logging_steps 1 \ | ||
| --save_steps 100 \ | ||
| --save_total_limit 10 \ | ||
| --max_length 2048 \ | ||
| --max_completion_length 2048 \ | ||
| --attention_backend flash \ | ||
| --use_vllm true \ | ||
| --vllm_mode colocate \ | ||
| --vllm_gpu_memory_utilization 0.5 \ | ||
| --vllm_tensor_parallel_size 1 \ | ||
| --vllm_max_model_len 4096 \ | ||
| --sleep_level 1 \ | ||
| --finetune \ | ||
| --no_save_optim \ | ||
| --no_save_rng \ | ||
| --temperature 1.0 \ | ||
| --padding_free true \ | ||
| --recompute_granularity selective |
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,44 @@ | ||||||||||
| # GKD Training with External Teacher Model Server (vLLM) | ||||||||||
| # ===================== Step 1: Start Teacher Server ===================== | ||||||||||
| # Run in a separate terminal / GPU: | ||||||||||
| # | ||||||||||
| # CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen2.5-7B-Instruct \ | ||||||||||
| # --port 8000 \ | ||||||||||
| # --max-logprobs 64 \ | ||||||||||
| # --gpu-memory-utilization 0.9 | ||||||||||
|
|
||||||||||
| # ======================================================================== | ||||||||||
|
|
||||||||||
| NPROC_PER_NODE=4 \ | ||||||||||
| CUDA_VISIBLE_DEVICES=0,1,2,3 \ | ||||||||||
| PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \ | ||||||||||
| swift rlhf \ | ||||||||||
| --rlhf_type gkd \ | ||||||||||
| --model Qwen/Qwen2.5-0.5B \ | ||||||||||
| --teacher_model_server http://localhost:8000 \ | ||||||||||
| --gkd_logits_topk 64 \ | ||||||||||
| --use_vllm true \ | ||||||||||
| --vllm_mode colocate \ | ||||||||||
| --vllm_gpu_memory_utilization 0.5 \ | ||||||||||
| --vllm_tensor_parallel_size 1 \ | ||||||||||
| --vllm_max_model_len 4096 \ | ||||||||||
| --sleep_level 0 \ | ||||||||||
| --dataset 'modelscope/gsm8k' \ | ||||||||||
| --lmbda 1 \ | ||||||||||
| --seq_kd false \ | ||||||||||
| --beta 0.5 \ | ||||||||||
| --torch_dtype bfloat16 \ | ||||||||||
| --per_device_train_batch_size 2 \ | ||||||||||
| --gradient_accumulation_steps 4 \ | ||||||||||
| --learning_rate 5e-5 \ | ||||||||||
| --logging_steps 1 \ | ||||||||||
| --save_steps 100 \ | ||||||||||
| --save_total_limit 2 \ | ||||||||||
| --max_length 2048 \ | ||||||||||
| --max_completion_length 2048 \ | ||||||||||
|
Comment on lines
+37
to
+38
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The values for
Suggested change
|
||||||||||
| --warmup_ratio 0.1 \ | ||||||||||
| --save_only_model true \ | ||||||||||
| --dataloader_num_workers 4 \ | ||||||||||
| --dataset_num_proc 4 \ | ||||||||||
| --attn_impl flash_attn \ | ||||||||||
| --report_to tensorboard swanlab | ||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a minor grammatical error here. "It use" should be "It uses".