Official codebase for SCALER (arXiv 2026): a framework for synthesizing verifiable, difficulty-controllable reasoning environments from real-world programming problems, and training LLMs with adaptive multi-environment RL to sustain informative learning signals over long horizons.
- Paper: SCALER: Synthetic sCalable Adaptive Learning Environment for Reasoning — https://arxiv.org/abs/2601.04809
- This repository is built on top of verl (Volcano Engine Reinforcement Learning for LLMs) and follows its environment/runtime conventions.
Reinforcement learning (RL) can enhance LLM reasoning, but progress often slows when:
- task difficulty drifts away from the model’s capability frontier (too easy / too hard), or
- training is dominated by a narrow set of recurring patterns, reducing distributional diversity.
SCALER addresses both via co-adaptation between the model and training environments:
- a scalable synthesis pipeline that converts real-world programming problems into verifiable environments with controllable difficulty and unbounded instance generation;
- an adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates the active set of environments to maintain informative rewards and sustain improvement.
Given a programming problem (statement + reference solution), SCALER synthesizes a reasoning environment with:
- Verifiability: deterministic oracle / unit tests provide correctness signals.
- Difficulty control: explicit scale parameters discretized into difficulty levels.
- Unbounded instance generation: randomized testcase generation yields unlimited training instances.
SCALER sustains learning signals at two levels:
- In-environment difficulty controller: keeps sampling near a target success regime.
- Environment curation: maintains an active set and replaces saturated/uninformative environments to preserve diversity and long-horizon improvements.
High-level structure (major directories):
SCALER/— SCALER core code (synthesis, controllers, curation, integration).SCALER-data/— environment pools / metadata / released artifacts (if any).recipe/environment— runnable training / evaluation recipes (paper entry points).verl/— upstream training infrastructure (forked / vendored).
This repo follows verl for environment setup (CUDA / PyTorch / distributed runtime / Docker, etc.). Please refer to:
- verl documentation: https://verl.readthedocs.io/en/latest/index.html
- verl repo: https://github.com/volcengine/verl
Tip: If you already have verl working on your machine/cluster, SCALER should be a minimal delta.
SCALER’s environment synthesis pipeline entry:
SCALER/environment_construct.sh
Run:
cd SCALER
bash environment_construct.shNotes:
- The script is intended as the one-click pipeline entry. Customize dataset paths / output dirs / parallelism in the script as needed.
- Synthesized environments and metadata are typically managed under
SCALER-data/(see repo layout).
Paper-style training runs are organized under recipe/.
A concrete entry (Qwen3-1.7B, 2739 envs):
recipe/environment/qwen3-1.7b-2739-envs.sh
Run:
bash recipe/environment/qwen3-1.7b-2739-envs.shNotes:
- This script is the actual training entry. It typically sets model, env pool, runtime (GPU / distributed), and logging.
- If you modify environment pool / difficulty scheduling / curation knobs, do it by editing the recipe script (and/or its referenced config files).
Performance on five reasoning benchmarks: MATH-500, AMC23, AIME24, MMLU-Pro, BBEH.
Numbers below are taken from Table 1 in the paper (AVG = unweighted mean):
| Base Model | Method | MATH-500 | AMC23 | AIME24 | MMLU-Pro | BBEH | AVG |
|---|---|---|---|---|---|---|---|
| Qwen3-1.7B-base | Base | 59.6 | 29.21 | 3.33 | 33.30 | 3.26 | 25.74 |
| + SCALER | 75.8 | 49.53 | 12.91 | 50.89 | 11.74 | 40.18 | |
| Qwen3-4B-base | Base | 66.4 | 44.70 | 8.75 | 51.60 | 8.10 | 35.91 |
| + SCALER | 84.4 | 75.00 | 27.29 | 70.00 | 14.56 | 54.25 |
Environment pool statistics (paper v1): 4973 programming problems → 2739 synthesized SCALER environments.
If you use SCALER in your research, please cite:
@article{xu2026scaler,
title = {SCALER: Synthetic sCalable Adaptive Learning Environment for Reasoning},
author = {Xu, Caijun and Xiao, Changyi and Peng, Zhongyuan and Wang, Xinrun and Cao, Yixin},
journal = {arXiv preprint arXiv:2601.04809},
year = {2026},
doi = {10.48550/arXiv.2601.04809}
}- Released under Apache License 2.0 (see
LICENSE). - Built on top of verl and reuses its training infrastructure. Please also check upstream verl license/notice files when redistributing.
Correspondence (paper): cjxu25@m.fudan.edu.cn
