Add OLMo-core 32B GRPO launch scripts#1726
Conversation
OLMo-core (FSDP) versions of the 32B Think RL script and a 32B-scaled RLZero math script, running open_instruct/grpo.py instead of the DeepSpeed grpo_fast.py. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request adds two OLMo-core (FSDP) launch scripts for 32B GRPO training and updates the changelog. The review feedback identifies a dataset-to-split mismatch in the evaluation config, an over-allocation of nodes (requesting 28 instead of 18), a missing uv run prefix for running python, and a placeholder PR number in the changelog.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| DATASETS="allenai/Dolci-RLZero-Math-7B 1.0" | ||
|
|
||
| LOCAL_EVALS="allenai/Dolci-RLZero-Math-7B 16" | ||
| LOCAL_EVAL_SPLITS="train train" |
There was a problem hiding this comment.
There is a mismatch between the number of datasets in LOCAL_EVALS (which has 1 dataset: allenai/Dolci-RLZero-Math-7B) and the number of splits in LOCAL_EVAL_SPLITS (which has 2 splits: "train train"). This will likely cause a runtime error during dataset mixer parsing. Please update LOCAL_EVAL_SPLITS to "train".
| LOCAL_EVAL_SPLITS="train train" | |
| LOCAL_EVAL_SPLITS="train" |
| --priority urgent \ | ||
| --gs_model_name "sft_olmo3_32b_rl_run_testing" \ | ||
| --preemptible \ | ||
| --num_nodes 28 \ |
There was a problem hiding this comment.
There is a node allocation mismatch. The script requests --num_nodes 28, but the actual resources required by the configuration are only 18 nodes: Learners need 12 nodes (12 entries in --num_learners_per_node of 8 GPUs each = 96 GPUs), and vLLM needs 6 nodes (--vllm_num_engines 6 with --vllm_tensor_parallel_size 8 = 48 GPUs). Total nodes needed: 12 + 6 = 18 nodes. Requesting 28 nodes will waste 10 nodes (80 GPUs) on the cluster and unnecessarily increase queue wait times. Please update --num_nodes to 18.
| --num_nodes 28 \ | |
| --num_nodes 18 \ |
|
|
||
|
|
||
| ### Added | ||
| - Add OLMo-core (FSDP) launch scripts `scripts/train/olmo3/32b_think_rl_olmocore.sh` and `scripts/train/olmo3/32b_rlzero_math_olmocore.sh`, which run 32B GRPO via `open_instruct/grpo.py` instead of the DeepSpeed `grpo_fast.py` (https://github.com/allenai/open-instruct/pull/1726). |
| --env LD_LIBRARY_PATH=/var/lib/tcpxo/lib64 \ | ||
| --env NCCL_LIB_DIR=/var/lib/tcpxo/lib64 \ | ||
| --env HOSTED_VLLM_API_BASE=http://ceres-cs-aus-447.reviz.ai2.in:8001/v1 \ | ||
| -- source configs/beaker_configs/ray_node_setup.sh \&\& source configs/beaker_configs/code_api_setup.sh \&\& python open_instruct/grpo.py \ |
There was a problem hiding this comment.
The command inside the container runs python open_instruct/grpo.py directly. Since the environment uses uv for dependency management, running python directly might use the system python instead of the virtual environment, potentially leading to ModuleNotFoundError. It is safer and more consistent with the other launch scripts to use uv run python or uv run.
| -- source configs/beaker_configs/ray_node_setup.sh \&\& source configs/beaker_configs/code_api_setup.sh \&\& python open_instruct/grpo.py \ | |
| -- source configs/beaker_configs/ray_node_setup.sh \&\& source configs/beaker_configs/code_api_setup.sh \&\& uv run python open_instruct/grpo.py \ |
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Adds OLMo-core (FSDP) launch scripts for 32B GRPO, running
open_instruct/grpo.py(OLMo-coreTrainer) instead of the DeepSpeedgrpo_fast.py:scripts/train/olmo3/32b_think_rl_olmocore.sh— faithful OLMo-core port of32b_think_rl.sh. Identical args except:grpo_fast.py→grpo.py; dropped--deepspeed_stage 3 / --deepspeed_zpg 32 / --gather_whole_model False; added--fsdp_shard_degree 32 --fsdp_num_replicas 3(= 96 = 8×12 learner GPUs) and--activation_memory_budget 0.1.scripts/train/olmo3/32b_rlzero_math_olmocore.sh—7b_rlzero_math.shscaled to 32B + OLMo-core: modelOlmo-3-1025-7B→Olmo-3-1025-32B; dropped DeepSpeed args; added--fsdp_shard_degree 32 --fsdp_num_replicas 1(= 32 = 8×4 learner GPUs) and--activation_memory_budget 0.3. vLLM scaled for a 32B (--vllm_num_engines 10 --vllm_tensor_parallel_size 4).The DeepSpeed → FSDP conversion swaps
--deepspeed_stage/--deepspeed_zpg/--gather_whole_modelfor--fsdp_shard_degree/--fsdp_num_replicas/--activation_memory_budget. Both topologies satisfy thegrpo_utils.pyvalidation (shard_degree × num_replicas == sum(num_learners_per_node)).Decisions to sanity-check before launching
allenai/Olmo-3-1025-32B(mirroring the 7Ballenai/Olmo-3-1025-7B) — verify it exists / is intended.Dolci-RLZero-Math-7B; kept since the change targets model + backend.Testing
bash -npasses on both scripts.GPU_TESTS=bypass
🤖 Generated with Claude Code