A single long-CoT trajectory entangles generation, verification, and revision into one token stream, so neither token-level SFT nor outcome-level RL can assign ability-specific credit. As a result, large reasoning models (LRMs) often keep verifying and revising even after reaching the correct answer.
SCR (Structured Reasoning) reorganizes long-CoT reasoning into stages that are explicit, evaluable, and independently trainable, realized through a Generate–Verify–Revise paradigm:
- Dynamic Termination Supervision (DTS). Stops reasoning once self-verification confirms correctness.
- Selective Loss Masking (SLM). Excludes incorrect initial answers from the imitation loss while keeping them as context.
- Progressive Two-Stage RL. Stage I jointly optimizes generation and verification; Stage II focuses on revision once the verifier is reliable.
Across three backbones and multiple benchmarks, SCR improves reasoning quality and self-verification while reducing output length by up to 50%.
SCR/
├── LLaMA-Factory/ # SFT codebase (adapted)
├── EasyR1/ # RL codebase (adapted)
├── data/ # Training and evaluation data
├── infer/ # Inference / evaluation scripts
├── figs/ # Figures and paper PDF
├── run_SCR-SFT.sh # Entry script for the SFT phase
├── run_SCR-Stage1.sh # Entry script for RL Stage I (generation + verification)
├── run_SCR-Stage2.sh # Entry script for RL Stage II (revision)
├── infer.sh # Entry script for evaluation
└── rl_config.yaml # RL hyper-parameters
conda create -n SCR-SFT python=3.10.9
conda activate SCR-SFT
cd LLaMA-Factory
pip install -e .bash run_SCR-SFT.shKey parameters in run_SCR-SFT.sh:
| Parameter | Description |
|---|---|
model_path |
Path to the base language model. This is the model fine-tuned during training. |
template |
Conversation/prompt template for the model. |
dataset |
Path to the training dataset. Please update the EOS token according to the model type before running SFT. |
conda create -n SCR-RL python=3.10.9
conda activate SCR-RL
cd EasyR1
pip install -e .The RL phase uses a progressive two-stage curriculum. Stage I jointly trains generation and self-verification; Stage II focuses on revision once the verifier becomes reliable.
# Stage I: jointly optimize initial generation and self-verification
bash run_SCR-Stage1.sh
# Stage II: optimize revision based on the trained verifier
bash run_SCR-Stage2.shbash infer.sh- ✅ Explicit reasoning stages. Each step (generation / verification / revision) is exposed and can be supervised individually.
- ✅ Trajectory-aware supervision. DTS + SLM provide targeted SFT signals that prevent imitation of false starts and over-long verification.
- ✅ Decoupled credit assignment. Two-stage GRPO assigns ability-specific credit instead of a single outcome-level reward.
- ✅ Shorter, sharper reasoning. Up to 50% reduction in output length without sacrificing accuracy.
This repository includes code adapted from the following open-source projects:
- LLaMA-Factory: Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
- EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based
We thank the authors and contributors of these projects for making their code publicly available.
