Skip to content

KickItLikeShika/llm-reasoning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Reasoning

Trained Qwen2.5 1.5B model to reason to solve grade-level math with explicit structure: a short scratchpad in <reasoning>…</reasoning> and a single final number in <answer>…</answer>. Started from pure reinforcement learning from outcome-only signals, I first followed the widely shared willccbb GRPO demo. In practice, GRPO alone with reward tweaks didn’t work, even after several iterations on the reward, answers stayed unreliable and the XML reasoning format was broken. The model simply hadn’t seen enough correct structured completions to anchor the pattern.

I split the training in two stages:

  1. Short LoRA SFT on 100 random GSM8K training examples, to teach format + roughly sensible traces, not to maximize benchmark score.
  2. GRPO on top of that adapter for 2,000 steps.

SFT + GRPO behaved much better than GRPO only, the model stopped fighting the template and the credit-assignment problem got easier.

Reward Design:

  1. Small reward for non‑empty reasoning
  2. Soft reward when and show up in order
  3. Large sparse reward (+3) when the number parsed from matches the gold.
Screenshot 2026-04-21 at 11 18 30 AM

About

Training Qwen2.5 1.5B to reason through SFT and GRPO

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors