Selective Expert Guidance For Effective and Diverse Exploration in Reinforcement Learning of LLMs

📢 Latest News

2026-02: Our paper "Selective Expert Guidance For Effective and Diverse Exploration in Reinforcement Learning of LLMs" has been accepted by ICLR 2026! 🎉
2026-02 We have released our models on HuggingFace.

Introduction

This repository provides the official implementation of "Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs"

MENTOR is a framework that enables LLMs to achieve effective and diverse exploration in reinforcement learning by providing expert guidance only at critical decision points, rather than imitating entire expert trajectories.

Key Highlights

Selective Expert Guidance: Injects expert signals only at critical decision points, avoiding full-trajectory imitation.
Effective & Diverse Exploration: Balances expert guidance with autonomous exploration, preventing entropy collapse.
Absorb Essence, Remove Redundancy: Captures essential expert strategies while discarding unnecessary patterns.

🚀 Quick Start

Installation

You can install MENTOR dependencies by running the following commands:

conda create -n mentor python=3.11
conda activate mentor
pip install -r requirements.txt
pip install -e .

Start Training

Before starting training, we strongly recommend using SwanLab to monitor and manage experiments. You can log in with the following command:

  swanlab login

We provide an example script to train MENTOR on our provided training set. You can run the following command to start training:

  bash examples/train.sh

📈 Training Dynamics

MENTOR exhibits clear differences from standard on-policy RL:

Accuracy (acc): Validation acc steadily improves and surpasses baselines.
Entropy: Entropy collapses rapidly under vanilla RL, but MENTOR slows this collapse and sustains higher entropy, enabling broader exploration.
Response Length: Responses first grow longer (absorbing expert-style tokens like verify), then shorten as training progresses, reflecting selective retention of useful reasoning patterns.

Overall, MENTOR achieves better performace, maintains effect and diverse exploration, and converges to more efficient reasoning.

📃 Evaluation

MENTOR vs. other baselines. Compared to the On-policy RL, MENTOR achieves an average performance improvement of 3.2%, 4.3% and 3.9% on the three models, respectively.

Citation

If you find our model useful, please kindly cite our paper:

@article{jiang2025selective,
  title={Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs},
  author={Jiang, Zishang and Han, Jinyi and Li, Tingyun and Wang, Xinyi and Jiang, Sihang and Liang, Jiaqing and Dai, Zhaoqian and Ma, Shuguang and Yu, Fei and Xiao, Yanghua},
  journal={arXiv preprint arXiv:2510.04140},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
data		data
examples		examples
paper		paper
scripts		scripts
tests		tests
verl		verl
vllm		vllm
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Selective Expert Guidance For Effective and Diverse Exploration in Reinforcement Learning of LLMs

📢 Latest News

Introduction

Key Highlights

🚀 Quick Start

Installation

Start Training

📈 Training Dynamics

📃 Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Selective Expert Guidance For Effective and Diverse Exploration in Reinforcement Learning of LLMs

📢 Latest News

Introduction

Key Highlights

🚀 Quick Start

Installation

Start Training

📈 Training Dynamics

📃 Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages