LLM Reasoning

Trained Qwen2.5 1.5B model to reason to solve grade-level math with explicit structure: a short scratchpad in <reasoning>…</reasoning> and a single final number in <answer>…</answer>. Started from pure reinforcement learning from outcome-only signals, I first followed the widely shared willccbb GRPO demo. In practice, GRPO alone with reward tweaks didn’t work, even after several iterations on the reward, answers stayed unreliable and the XML reasoning format was broken. The model simply hadn’t seen enough correct structured completions to anchor the pattern.

I split the training in two stages:

Short LoRA SFT on 100 random GSM8K training examples, to teach format + roughly sensible traces, not to maximize benchmark score.
GRPO on top of that adapter for 2,000 steps.

SFT + GRPO behaved much better than GRPO only, the model stopped fighting the template and the credit-assignment problem got easier.

Reward Design:

Small reward for non‑empty reasoning
Soft reward when and show up in order
Large sparse reward (+3) when the number parsed from matches the gold.

Model Weights: https://huggingface.co/KickItLikeShika/Qwen2.5-1.5B-Instruct-SFT-GRPO-GSM8K
W&B Report: https://api.wandb.ai/links/egyttsteam/uja0job9

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
grpo.py		grpo.py
playground.ipynb		playground.ipynb
reason.py		reason.py
reason_config.py		reason_config.py
requirements.txt		requirements.txt
sft.py		sft.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Reasoning

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Reasoning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages