On-policy self-distillation (OPSD) gives reasoning models dense, token-level supervision by aligning a model's own distribution with the distribution it produces under a privileged context (typically a verified solution). We show that this distributional gap is dominated by style tokens rather than task-bearing tokens β a pathology we call privilege-induced style drift, which destabilizes training and collapses response length.
RLCSD removes this drift by contrasting the teacherβstudent gap under a correct hint against the gap under a wrong hint produced under an identical prompt template. The shared stylistic component cancels in the subtraction, leaving a token-level signal that is more concentrated on task-bearing tokens. We then integrate this signal into GRPO as a verifier-anchored modulation of the outcome advantage, instead of a replacement for it. On Qwen3 (1.7B / 4B / 8B) and Olmo-3-7B-Think, across DeepMath (AMC23 / AIME24 / AIME25) and Knights & Knaves, RLCSD consistently outperforms GRPO and every prior OPSD baseline we tested, while keeping training dynamics stable where existing methods either explode or collapse.
For each query we run three stages:
-
Rollout sampling and partitioning. Sample G rollouts from the student and split them into a correct set π’βΊ and an incorrect set π’β» using a rule-based verifier (binary reward).
-
Contrastive token-level signal. Draw a positive hint y*c from π’βΊ and K negative hints {y*w,k} from π’β», wrap each in an identical "Reference Solution" template, and form
e_ctr,t = log Ο_T(y_t | x, y*_c, y_<t) β log (1/K) Ξ£_k Ο_T(y_t | x, y*_{w,k}, y_<t)Two refinements matter: (i) K-marginalize the negative branch to stay robust against error-type mismatch between the target rollout and the sampled negative, and (ii) exclude the target rollout from the hint pool to avoid self-conditioning over-confidence.
-
Verifier-anchored modulation & two-path loss. Convert ectr,t into a bounded modulation rt via a tanh squash, gate it with a threshold mask, and add it to AORM under a sign-preserving clamp so the verifier always decides the update direction. Aggregate as a two- path PPO-style clipped loss with independent normalization for unmodulated and modulated token sets.
See the paper for derivations of (1)β(3) and ablations on each design choice.
Each YAML config selects a method via the method: key. Implementations live in
third_party/verl/verl/trainer/ppo/core_algos.py (loss) and
src/self_distill_main.py (RLCSD/ECTR rollout-side data path).
| Key | Method | Reference |
|---|---|---|
grpo |
Group Relative Policy Optimization β verifier-only RLVR baseline. | Shao et al., 2024 β arXiv:2402.03300 |
opsd |
On-policy self-distillation with dense forward-KL distillation and per-token KL clipping. | Zhao et al., 2026 β arXiv:2601.18734 |
sdpo |
Dense distillation using JensenβShannon divergence (mode-balancing variant of OPSD). | HΓΌbotter et al., 2026 β arXiv:2601.20802 |
srpo |
Sample-level routing: GRPO on correct rollouts, SDPO-style distillation on failed ones. | Li et al., 2026 β arXiv:2604.02288 |
rlsd |
Per-token sampled-token distillation gap used to modulate AORM. | Yang et al., 2026 β arXiv:2604.03128 |
rlcsd |
This work β contrastive cancellation across symmetric positive/negative hints, then K-marginalized and integrated as a verifier-anchored AORM modulation. | this repo |
opsd_ectr |
OPSD + the contrastive construction grafted onto its dense distillation target (plug-in study, Β§4.3 of the paper). | this repo |
rlsd_ectr |
RLSD + the contrastive construction grafted onto its scalar modulation (plug-in study, Β§4.3 of the paper). | this repo |
The two _ectr variants are not new training methods on their own; they
exist to show that the contrastive principle behind RLCSD is general β see
Contrastive hints as a plug-in component
below.
src/
self_distill_main.py RLCSD / OPSD / SDPO / RLSD / SRPO trainer entry
verl_main.py Legacy non-verl trainer (kept for reference)
losses.py generalized_jsd_loss / sdpo_loss / rlsd_loss
verl_reward.py Custom reward function used by verl
opsd_format.py Prompt template + privileged-context wrapping
data_utils.py / prompts.py / models.py
configs/
math_deepmath/ {model}_{algo}.yaml (4 models Γ 6 algos + 4B-only ectr)
logic_kk/ same layout
scripts/
_run_verl.sh Launcher: reads a YAML and runs the right entry
math_deepmath/run_*.sh Per-config shims
logic_kk/run_*.sh
download_data.py Pull train/eval parquets from HuggingFace
third_party/verl/ Vendored verl with the RLCSD policy losses registered
assets/ Figures from the paper used in this README
requirements.txt
# Recommended: a fresh Python 3.10β3.12 env
pip install -r requirements.txtA few practical notes:
requirements.txtpinstorch>=2.5.0,<2.10to keep a CUDA 12 toolchain. torch 2.10+ defaults to CUDA 13 wheels which require a newer NVIDIA driver than CUDA 12.x systems ship. For an explicit CUDA match install from the PyTorch index:pip install "torch>=2.5.0,<2.10" --index-url https://download.pytorch.org/whl/cu126flash-attnbuilds against the installed torch β usepip install flash-attn --no-build-isolationif pip's build env can't find torch.third_party/verl/is added toPYTHONPATHautomatically by_run_verl.sh.
Training and eval parquets live at Leyiii/RLCSD. Pull everything in one shot:
python scripts/download_data.py --allThis writes to data/verl/<dataset>/{train,val}.parquet. The launcher resolves
paths under that root.
| Dataset | Used by |
|---|---|
deepmath_filtered_level5_7 |
Qwen3-1.7B (math train) |
deepmath_filtered_level6_8 |
Qwen3-4B (math train) |
deepmath_filtered_level7_10 |
Qwen3-8B + Olmo3-7B (math train) |
amc23+aime24+aime25 |
math eval |
kk_4to8 |
logic train (Knights & Knaves 4β8) |
kk_4to8_test+kk_9+kk_10+kk_11 |
logic eval (ID 4β8 + OOD 9β11) |
Training data subsets come from filtering DeepMath-103K (He et al., 2025) by difficulty band. The Knights & Knaves generator follows Logic-RL (Xie et al., 2025).
# RLCSD on Qwen3-4B, math reasoning
bash scripts/math_deepmath/run_qwen3_4b_rlcsd.sh
# SDPO baseline on Olmo3-7B-Think, logic puzzles
bash scripts/logic_kk/run_olmo3_7b_think_sdpo.shEach shim is a one-liner that forwards a config to scripts/_run_verl.sh. To
override individual hyperparameters, append Hydra-style overrides:
bash scripts/math_deepmath/run_qwen3_4b_rlcsd.sh learning_rate=2e-6 group_size=16Common environment overrides:
SWANLAB_API_KEYβ for swanlab logging (setuse_swanlab: truein configs)HF_ENDPOINTβ e.g.https://hf-mirror.comfor a HuggingFace mirrorCUDA_HOMEβ defaults to/usr/local/cuda-12.6
RLCSD attains the strongest average performance in every model block and wins most individual benchmarks across Qwen3 scales and Olmo-3-7B. The gains over the Base model average +4.3 (math) and +10.9 (logic) at 1.7B; +2.5 / +6.8 at 4B; +2.7 / +14.4 at 8B; and +1.8 / +9.9 on Olmo-3-7B. The advantage is especially pronounced on the OOD Knights & Knaves splits (+21.0 on 11-role at 8B; +13.0 on 11-role with Olmo-3-7B), suggesting the cleaned token-level signal improves generalization rather than just fitting the training task difficulty.
| Model | Method | AMC23 | AIME24 | AIME25 | Math Avg. | KK 4β8 | KK 9 | KK 10 | KK 11 | Logic Avg. |
|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3-1.7B | Base | 74.1 | 48.3 | 33.3 | 51.9 | 63.2 | 53.0 | 43.0 | 31.0 | 47.6 |
| GRPO | 76.6 | 51.6 | 37.2 | 55.1 | 67.4 | 59.0 | 52.0 | 34.0 | 53.1 | |
| OPSD | 76.3 | 50.8 | 37.7 | 54.9 | 64.4 | 55.0 | 52.0 | 32.0 | 50.9 | |
| SDPO | 72.9 | 42.2 | 33.6 | 49.6 | 67.4 | 61.0 | 54.0 | 30.0 | 53.1 | |
| SRPO | 73.2 | 43.6 | 34.4 | 50.4 | 64.4 | 58.0 | 47.0 | 33.0 | 50.6 | |
| RLSD | 73.9 | 46.1 | 36.9 | 52.3 | 66.8 | 59.0 | 50.0 | 35.0 | 52.7 | |
| RLCSD (ours) | 77.2 | 53.1 | 38.3 | 56.2 (+4.3) | 70.0 | 63.0 | 63.0 | 38.0 | 58.5 (+10.9) | |
| Qwen3-4B | Base | 88.6 | 72.5 | 65.3 | 75.5 | 73.2 | 67.0 | 58.0 | 42.0 | 60.1 |
| GRPO | 89.1 | 75.8 | 66.1 | 77.0 | 75.4 | 71.0 | 61.0 | 42.0 | 62.4 | |
| OPSD | 89.4 | 74.2 | 67.5 | 77.0 | 73.4 | 71.0 | 62.0 | 42.0 | 62.1 | |
| SDPO | 88.3 | 68.3 | 64.4 | 73.7 | 74.4 | 72.0 | 62.0 | 45.0 | 63.4 | |
| SRPO | 87.9 | 71.4 | 64.7 | 74.7 | 75.4 | 71.0 | 61.0 | 45.0 | 63.0 | |
| RLSD | 86.9 | 71.2 | 66.9 | 75.0 | 76.8 | 73.0 | 62.0 | 48.0 | 65.0 | |
| RLCSD (ours) | 90.1 | 74.4 | 69.4 | 78.0 (+2.5) | 78.6 | 73.0 | 66.0 | 50.0 | 66.9 (+6.8) | |
| Qwen3-8B | Base | 88.8 | 74.2 | 66.9 | 76.6 | 72.4 | 67.0 | 55.0 | 44.0 | 59.6 |
| GRPO | 90.1 | 76.1 | 69.7 | 78.7 | 75.2 | 75.0 | 63.0 | 49.0 | 66.0 | |
| OPSD | 90.4 | 76.6 | 68.7 | 78.7 | 75.2 | 74.0 | 61.0 | 49.0 | 64.8 | |
| SDPO | 88.8 | 76.1 | 65.6 | 76.8 | 72.4 | 72.0 | 56.0 | 46.0 | 61.6 | |
| SRPO | 88.3 | 75.3 | 65.6 | 76.4 | 74.8 | 72.0 | 60.0 | 49.0 | 64.0 | |
| RLSD | 88.7 | 75.5 | 67.2 | 77.1 | 76.6 | 77.0 | 64.0 | 52.0 | 67.4 | |
| RLCSD (ours) | 90.8 | 77.5 | 69.7 | 79.3 (+2.7) | 81.8 | 79.0 | 70.0 | 65.0 | 74.0 (+14.4) | |
| Olmo-3-7B | Base | 91.2 | 73.9 | 66.9 | 77.3 | 70.6 | 64.0 | 55.0 | 35.0 | 56.2 |
| GRPO | 92.4 | 75.8 | 68.9 | 79.0 | 73.8 | 69.0 | 63.0 | 39.0 | 61.2 | |
| OPSD | 92.2 | 75.6 | 66.9 | 78.2 | 72.4 | 69.0 | 62.0 | 38.0 | 60.4 | |
| SDPO | 91.6 | 74.2 | 67.4 | 77.7 | 73.2 | 67.0 | 59.0 | 46.0 | 61.3 | |
| SRPO | 92.1 | 75.0 | 65.3 | 77.5 | 73.2 | 68.0 | 61.0 | 40.0 | 60.6 | |
| RLSD | 92.6 | 74.7 | 66.9 | 78.1 | 73.8 | 65.0 | 61.0 | 38.0 | 59.4 | |
| RLCSD (ours) | 92.7 | 76.1 | 68.6 | 79.1 (+1.8) | 75.4 | 76.0 | 65.0 | 48.0 | 66.1 (+9.9) |
Math is reported as mean@12; Knights & Knaves as pass@1. KK 4β8 is the in-distribution test set; 9 / 10 / 11 are out-of-distribution role counts.
Existing OPSD methods exhibit two characteristic failure modes:
- Entropy explosion β training collapse. OPSD, SDPO, and SRPO produce rapidly growing actor entropy that destabilizes optimization and manifests as abrupt response-length blow-up and sharp drops in reward and validation accuracy.
- Premature length shrinkage. RLSD steadily decreases response length on math, limiting late-stage performance on long-horizon problems.
RLCSD keeps both entropy and response length stable throughout training while achieving stronger final validation performance:
The contrastive construction is not specific to RLCSD. Applying it on top of two representative methods from different families β OPSD (dense distillation) and RLSD (advantage modulation) β improves every method on nearly every metric.
Concretely, for each method we vary only the source of the privileged signal and hold everything else fixed:
- Dataset CoT β privileged context is the ground-truth solution from the dataset (the original setting in each method's paper).
- Own correct rollout β privileged context is a correct rollout sampled by the model itself from the same group.
- Own rollout + contrast β the contrastive construction (correct rollout
vs. incorrect sibling rollout). This is what
*_ectrandrlcsduse.
Settings (1) and (2) perform essentially the same across all three methods β switching the source of the one-sided hint barely matters. Setting (3) is the one that wins. Ξ below is the gain of (3) over the better of (1) and (2), averaged across benchmarks (Math = AMC23 + AIME24 + AIME25 mean@12, KK = 4β8 + 9 + 10 + 11 pass@1, Qwen3-4B):
| Method | Ξ (Math avg) | Ξ (KK avg) |
|---|---|---|
| OPSD | +0.2 | +2.3 |
| RLSD | +2.2 | +1.0 |
| RLCSD | +3.0 | +5.4 |
opsd_ectr β OPSD + contrastive token-level signal. Without contrast,
OPSD's actor entropy explodes in the late stage (the failure mode above);
adding the contrastive hint keeps entropy bounded and yields a stable training
reward.
rlsd_ectr β RLSD + contrastive evidence ratio. The contrastive signal
mitigates the length collapse seen in vanilla RLSD on math (left panel below).
The right panel is an RLCSD ablation: removing the contrast from RLCSD
reproduces the same length-collapse pathology, confirming that the contrastive
construction is what keeps RLCSD's reasoning traces from shrinking.
These plug-in results are reported only on Qwen3-4B in the paper, and only
those configs ship here (configs/{math_deepmath,logic_kk}/qwen3_4b_opsd_ectr.yaml
and qwen3_4b_rlsd_ectr.yaml).
The vendored fork under third_party/verl/ is based on
verl (Sheng et al., 2024).
RLCSD-specific extensions live in:
third_party/verl/verl/trainer/ppo/core_algos.pyβ new policy losses registered via@register_policy_loss("rlcsd" | "opsd_ectr" | "rlsd_ectr")third_party/verl/verl/workers/{actor,utils}/β minor plumbing for the positive/negative teacher batches required by the RLCSD data path
If you use this code or the released RLCSD method, please cite:




