Multiturn RL experiments
$ uv sync$ cd experiments
$ ./train.sh- Make custom chat scheduler for terminal environment
- Create custom reward functions for various tasks (SWE-Bench, TACO, math, etc.)
- Collect data
- ToRL dataset?
- Run experiments
*1. Math
- NuminaMath
- AIME
- MATH
- etc.
- Coding
- TACO *3. SWE
- SWE-Bench
Notes:
- should we do regular RL training before integrating with terminal? seems like it will take a while before the model starts using the terminal