- 2026-02: Our paper "Selective Expert Guidance For Effective and Diverse Exploration in Reinforcement Learning of LLMs" has been accepted by ICLR 2026! 🎉
- 2026-02 We have released our models on HuggingFace.
This repository provides the official implementation of "Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs"
MENTOR is a framework that enables LLMs to achieve effective and diverse exploration in reinforcement learning by providing expert guidance only at critical decision points, rather than imitating entire expert trajectories.

- Selective Expert Guidance: Injects expert signals only at critical decision points, avoiding full-trajectory imitation.
- Effective & Diverse Exploration: Balances expert guidance with autonomous exploration, preventing entropy collapse.
- Absorb Essence, Remove Redundancy: Captures essential expert strategies while discarding unnecessary patterns.
You can install MENTOR dependencies by running the following commands:
conda create -n mentor python=3.11
conda activate mentor
pip install -r requirements.txt
pip install -e .Before starting training, we strongly recommend using SwanLab to monitor and manage experiments. You can log in with the following command:
swanlab loginWe provide an example script to train MENTOR on our provided training set. You can run the following command to start training:
bash examples/train.shMENTOR exhibits clear differences from standard on-policy RL:
- Accuracy (acc): Validation acc steadily improves and surpasses baselines.
- Entropy: Entropy collapses rapidly under vanilla RL, but MENTOR slows this collapse and sustains higher entropy, enabling broader exploration.
- Response Length: Responses first grow longer (absorbing expert-style tokens like verify), then shorten as training progresses, reflecting selective retention of useful reasoning patterns.
Overall, MENTOR achieves better performace, maintains effect and diverse exploration, and converges to more efficient reasoning.
MENTOR vs. other baselines. Compared to the On-policy RL, MENTOR achieves an average performance improvement of 3.2%, 4.3% and 3.9% on the three models, respectively.
If you find our model useful, please kindly cite our paper:
@article{jiang2025selective,
title={Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs},
author={Jiang, Zishang and Han, Jinyi and Li, Tingyun and Wang, Xinyi and Jiang, Sihang and Liang, Jiaqing and Dai, Zhaoqian and Ma, Shuguang and Yu, Fei and Xiao, Yanghua},
journal={arXiv preprint arXiv:2510.04140},
year={2025}
}
