JustTinker: Minimal Reinforcement Learning with Verifiable Rewards

✨ Highlights

🚀 Full RL training pipeline runnable on your laptop (via Tinker API)
📈 AIME 2024: 43.3% → 56.7% (+13.3%)
🛡️ Novel redundancy penalty to prevent reward hacking
💰 Total cost: < $150 (including failed experiments)

TL;DR: A minimal implementation of JustRL-style reasoning model training using the Tinker platform (run it on your Macbook!).

JustRL: Simplicity is all you need. No KL penalty, no entropy regularization, no length penalty—just RL.

Overview

This repository demonstrates a two-stage training pipeline to transform a standard instruction-tuned model into a reasoning model with explicit thinking capabilities:

Stage	Purpose
Stage 1: Cold-Start SFT	Teach the model to use `<think>...</think>` tokens
Stage 2: JustRL (GRPO)	Reinforce reasoning via verifiable rewards

Why JustRL?

Based on the paper "JustRL: Simplicity at Scale", we adopt a minimalist approach:

No KL penalty (kl_coef=0)
No length penalty (found harmful in experiments)
Clip-higher (clip_ratio=1.28)

Algorithm Evolution: GRPO → JustRL → JustTinker

Our implementation follows the evolution from standard GRPO to the simplified JustRL approach, with our own targeted improvements:

Feature	Standard GRPO	JustRL	JustTinker (Ours)
Rollout	N responses per problem	Same	rollout_n=8
Advantage	Group-relative rewards	Same	Same
Critic Model	None (Monte Carlo)	Same	Same
KL Penalty	Yes (kl_coef > 0)	Removed	None
Entropy Regularization	Optional	Removed	None
Clip Ratio	Symmetric (0.8, 1.2)	Asymmetric (0.8, 1.28)	clip-higher
Training Samples	All samples	Positive advantage only	Same
Length Penalty	Optional	Removed (harmful)	None

Our Additional Contributions (JustTinker):

Modification	Source	Purpose
`format_reward_weight=0.1`	Added	Encourage `<think>` format
`redundancy_penalty`	Original	Prevent reward hacking via repetitive content

Model Choice: Why Qwen3-4B-Instruct-2507?

The original JustRL paper uses DeepSeek-R1-Distill-Qwen-1.5B, which was distilled from DeepSeek-R1 (671B) and natively outputs <think>...</think> format. This model already "thinks out loud" by default.

In contrast, Qwen3-4B-Instruct-2507 is an internalized reasoning model:

Model	Reasoning Style	`<think>` Output
DeepSeek-R1-Distill-Qwen-1.5B	Explicit (shows reasoning)	Native
Qwen3-4B-Thinking-2507	Explicit (shows reasoning)	Native
Qwen3-4B-Instruct-2507	Internalized (hidden reasoning)	No

Qwen3-4B-Instruct-2507 was trained to produce concise, direct answers without exposing the step-by-step reasoning process. The reasoning capability is "compressed" into the model weights through distillation, optimizing for fewer output tokens.

Why use it anyway? It's the most suitable small model available on Tinker platform. To align with the JustRL setup, we perform a cold-start SFT to teach the model the <think>...</think> format before RL training.

Why Cold-Start SFT?

Since Qwen3-4B-Instruct-2507 doesn't output thinking tokens by default, we need to "awaken" this capability:

Before SFT: "The answer is 42."
After SFT:  "<think>Let me analyze this step by step...</think>\n\nThe answer is \boxed{42}."

This is a very short training phase (~800 steps, <$30) that teaches format compliance, not reasoning ability. The model already has strong reasoning capabilities—we're just teaching it to show its work.

Stage 1: Cold-Start SFT

Configuration

Parameter	Value
Base Model	`Qwen/Qwen3-4B-Instruct-2507`
Dataset	OpenR1-Math-220k (10K samples, filtered)
Max Sequence Length	8,192 tokens
Training Steps	800
Batch Size	16 (8 × 2 gradient accumulation)
Learning Rate	2e-5
Total Cost	< $30

Public Checkpoint

The trained model is made publicly available on Tinker:

tinker://b0af3bd0-9638-583f-8c2c-2bb348453023:train:0/weights/coldstart_sft_final

You can load this checkpoint directly for Stage 2 (JustRL) training:

python scripts/tinker/justrl_math_reasoning.py \
    --checkpoint {coldstart_sft_final} \
    --reasoning \
    --scale medium

Training Results

The cold-start SFT successfully taught the model to produce structured thinking:

Metric	Initial	Final
Thinking Rate	0%	70%
Boxed Answer Rate	36.7%	80%
Training Loss	0.86	0.33

Training Curves

Key Observations

Rapid Format Learning: The model learned the <think> format around step 250-300, with thinking rate jumping from 0% to ~50%
Stable Convergence: Both thinking rate (~70-80%) and boxed rate (~80%) stabilized after step 500
Response Length Growth: Average response length increased from ~8K to ~12K tokens as the model learned to produce detailed reasoning
Efficient Training: 800 steps were sufficient for format learning; further training showed diminishing returns

Sample Output

Input: How many terms will there be if we expand (4x³ + x⁻³ + 2)²⁰¹⁶?

Output:
<think>
Okay, let's see. I need to figure out how many terms there will be when we
expand (4x³ + x⁻³ + 2)²⁰¹⁶ and combine like terms. Hmm, expanding such a
high exponent might be complicated, but maybe there's a pattern or formula
I can use instead of multiplying everything out.

First, let me recall...
</think>

The number of distinct terms is \boxed{12097}.

Stage 2: JustRL (GRPO)

We conducted three experiments to explore JustRL-style training:

Experiment	Status	Key Finding
Exp 001	Failed	Training collapse due to reward hacking
Exp 002	Partial	Redundancy penalty prevents collapse
Exp 003	In Progress	AIME accuracy +13.3% with harder data

Experiment 001: Training Collapse (Reward Hacking)

Experiment ID: failed_exp_001_training_collapse_20260111 Status: Failed — Documented for learning purposes

During our first JustRL training run, we observed a classic reward hacking phenomenon where the model exploited the reward function in unintended ways.

Configuration Used (Collapsed Experiment)

# This configuration led to training collapse
algorithm:
  clip_ratio_low: 0.8
  clip_ratio_high: 1.28
  kl_coef: 0.0              # ❌ No KL penalty - allowed unconstrained drift

training:
  learning_rate: 1e-6
  batch_size: 32
  rollout_n: 8
  max_response_length: 8192

eval:
  eval_interval: 20         # ❌ Too infrequent to catch collapse early
  eval_samples: 200

# - No format reward weight

Training Curves

The plot shows clear signs of collapse after step ~120: accuracy drops sharply while response length explodes.

Timeline

Step Range	Accuracy	Thinking Rate	Avg Response Length	Status
1-50	67-88%	78-92%	2,600-4,200	Normal
51-80	70-90%	80-89%	3,000-4,400	Normal
80-120	60-85%	72-88%	3,500-5,000	Warning signs
120-135	55-75%	65-75%	4,500-5,200	Degrading
135-145	36-56%	37-56%	4,900-5,700	Rapid collapse
145-158	10-28%	14-31%	6,000-7,200	Complete collapse

What Happened?

The model discovered that generating longer responses occasionally led to correct answers by chance. Without constraints, this behavior was reinforced:

Feedback Loop:
┌─────────────────────────────────────────────────────────────┐
│  Occasionally long response → correct answer → reward       │
│         ↓                                                   │
│  Policy reinforces "generate longer"                        │
│         ↓                                                   │
│  Quality drops → fewer correct samples                      │
│         ↓                                                   │
│  Remaining correct samples are mostly long → more bias      │
│         ↓                                                   │
│  Collapse: 35,000+ char responses, no reasoning, ~10% acc   │
└─────────────────────────────────────────────────────────────┘

Sample Analysis

We extracted typical reward hacking samples from Step 140 evaluation:

Sample	Response Length	Has Thinking	Extracted Answer
#1	35,739 chars	No	(extraction failed)
#2	32,268 chars	No	(extraction failed)
#3	30,484 chars	No	(extraction failed)
#4	29,964 chars	No	`( ( ( ( ( ( ...` (repetitive)
#5	29,195 chars	No	(extraction failed)

Common patterns:

Responses hit max_length (8192 tokens) and get truncated
No <think>...</think> structure
Repetitive text loops (e.g., "Therefore, the three sides..." repeated 100+ times)
Self-aware "Wait" statements showing the model recognizes issues but can't stop

Full analysis: docs/research/reward_hacking_mechanism.md

Lessons Learned

JustRL Paper Recommendation	Our Experience
`kl_coef=0`	Allowed unconstrained policy drift
No length penalty	Contributed to response explosion
800+ steps training	Collapse started at step ~100

Key insight: JustRL's "simplicity" works for models already trained to reason (DeepSeek-R1-Distill), but may need guardrails for models learning to reason from scratch.

Mitigations Implemented

Rather than abandoning JustRL's simplicity, we developed targeted interventions:

Format reward: format_reward_weight=0.1 to incentivize <think> usage
Redundancy penalty ⭐: Penalize repetitive content (our original contribution)
- Uses compression ratio + N-gram analysis
- Only activates when redundancy > 30%
- Max penalty: 0.3 on correct answers
Early stopping: Stop if eval accuracy drops 5%+ for 3 consecutive evals
Health monitoring: Warn if response length > 5000, thinking rate < 60%, or redundancy > 40%

What we deliberately avoided (following JustRL):

❌ KL penalty
❌ Length penalty (found harmful in JustRL experiments)

Experiment 002: With Redundancy Penalty

Experiment ID: exp_002_with_redundancy_penalty Status: Not fully trained — Step 120/800

After implementing the mitigations above, we resumed training from Step 81 with the new configuration.

Configuration

Following JustRL's minimalist philosophy, we avoid KL penalty and length penalty. Instead, we introduce an original redundancy penalty to combat reward hacking.

algorithm:
  name: grpo
  rollout_n: 8
  clip_ratio_low: 0.8
  clip_ratio_high: 1.28    # clip-higher (JustRL style)
  kl_coef: 0.0             # ❌ No KL penalty (JustRL style)
  entropy_coef: 0.0        # ❌ No entropy regularization

reward:
  type: binary
  correct_reward: 1.0
  incorrect_reward: 0.0
  length_penalty: false           # ❌ No length penalty (harmful per JustRL)
  format_reward_weight: 0.1       # ✅ Encourage <think> token usage
  redundancy_weight: 0.3          # ✅ Original: penalize repetitive content
  redundancy_threshold: 0.3       # ✅ Only penalize when redundancy > 30%

training:
  learning_rate: 1e-6
  max_response_length: 8192
  eval_interval: 10

Design Philosophy

Technique	JustRL	Our Approach	Rationale
KL Penalty	❌ No	❌ No	On-policy RL has implicit regularization
Length Penalty	❌ No	❌ No	Harmful per JustRL findings
Format Reward	N/A	✅ 0.1	Encourage structured thinking
Redundancy Penalty	N/A	✅ Original	Combat reward hacking without length penalty

Why No KL Penalty?

According to RL's Razor, on-policy RL training naturally exhibits an implicit bias that keeps the policy close to the base model:

"On-policy RL training implicitly regularizes KL divergence from the base model, even without explicit KL penalties."

This theoretical insight, combined with JustRL's empirical success with kl_coef=0, supports our decision to skip KL penalty.

Redundancy Penalty: Our Original Contribution

Instead of penalizing long responses (which can hurt legitimate reasoning), we penalize repetitive/redundant content — the true signature of reward hacking.

Method: Dual-metric fusion

Compression ratio (60% weight): High repetition → high compression → high penalty
N-gram repetition (40% weight): Repeated 5-grams indicate redundancy

Validation on reward hacking samples:

Sample Type	Redundancy Score	Penalty Applied
Normal reasoning	0-4%	None
Reward hacking	62-89%	0.14-0.25

Full methodology: docs/research/redundancy_penalty_methodology.md

Training Curves (Step 1-120)

Results

Metric	Step 80 (Before)	Step 120 (Current)	Change
Eval Accuracy (MATH)	83.00%	84.00%	+1.0%
Best Eval Accuracy	83.50%	85.50% (Step 100)	+2.0%
Thinking Rate	~80%	~84%	Stable
Avg Response Length	~4000	~3700	Controlled
Avg Redundancy Score	N/A	~38%	Within limits

Key Observations:

No collapse at Step 120 (unlike Exp 001 which collapsed at Step 120-140)
Redundancy penalty keeping repetitive content in check (~38%, threshold 30%)
Eval accuracy improved from 83% to 85.5% peak
Response length stable, not exploding

Experiment 003: Harder Training & Eval Datasets

Experiment ID: exp_003_justrl_aligned Status: In Progress — Step 60/800

This experiment aligns more closely with the JustRL paper's training setup, using the DAPO-Math-17k dataset and adding AIME 2024 as an additional benchmark.

Training Curves (Step 0-60)

Current Results

Metric	SFT Baseline (Step 0)	Step 60 (Current)	Best	Change
Eval MATH Accuracy	91.00%	90.50%	91.00% (Step 0)	-0.5%
Eval AIME Accuracy	43.33%	50.00%	56.67% (Step 30/50)	+13.3%
MATH Thinking Rate	99%	98%	—	Stable
AIME Thinking Rate	83%	83%	—	Stable

Key Observations:

MATH accuracy remains stable around 88-91% (no degradation)
AIME accuracy improved significantly: 43.33% → 56.67% peak (+13.3%)
Thinking rate stable on both benchmarks
No reward hacking observed (redundancy score ~39%, within limits)

Configuration

Same reward/algorithm setup as Experiment 002, with key changes:

Parameter	Exp 002	Exp 003	Note
Training Dataset	MATH train (~7.5K)	DAPO-Math-17k (11.4K)	Harder problems
Eval Datasets	MATH only	MATH + AIME-2024	Added competition benchmark
Max Response Length	8,192 tokens	15,360 tokens	Aligned with JustRL paper

Dataset Details

Dataset	Source	Size	Purpose
DAPO-Math-17k	`BytedTsinghua-SIA/DAPO-Math-17k`	11,384	Training (RL)
MATH-test	`HuggingFaceH4/MATH-500`	200 (stratified)	Evaluation
AIME-2024	`HuggingFaceH4/aime_2024`	30	Evaluation (competition-level)

Key Differences from Experiment 002

Aspect	Exp 002	Exp 003
Training Data	MATH train	DAPO-Math-17k
Eval Benchmarks	MATH only	MATH + AIME
Max Response	8,192 tokens	15,360 tokens
Training Samples	~7,500	~11,384

Artifacts

tinker://fbadbbce-0cfc-53dd-ad26-9117748c5070:train:0/weights/checkpoint_step_50

Project Structure

RLVR/
├── README.md                    # This file
├── scripts/
│   ├── launchers/               # Shell scripts (run_coldstart_sft.sh, run_justrl_reasoning.sh)
│   ├── tinker/                  # Tinker API scripts (coldstart_sft.py, justrl_math_reasoning.py)
│   └── utils/                   # Utilities (plot_rlvr_training.py, plot_sft_training.py)
├── src/
│   ├── configs/                 # Training configurations (SFTConfig, RLConfig)
│   ├── data/                    # Dataset loading (MATH, GSM8K, DAPO-Math-17k, AIME)
│   ├── evaluation/              # Math verification (MathVerifier)
│   └── prompts/                 # Prompt formatting utilities
└── resources/                   # Training curves and artifacts
    ├── coldstartSFT/            # Stage 1 results
    ├── justRL_exp001/           # Exp 001 (collapsed)
    ├── justRL_exp002/           # Exp 002 (with redundancy penalty)
    └── justRL_exp003/           # Exp 003 (DAPO + AIME)

Quick Start

Prerequisites

# Install dependencies
pip install -r requirements.txt

# Set API key
export TINKER_API_KEY=your_api_key

Run Cold-Start SFT

./scripts/launchers/run_coldstart_sft.sh small

Run JustRL Training

# Exp 003:
./scripts/launchers/run_justrl_reasoning.sh medium --reasoning \
      --checkpoint tinker://b0af3bd0-9638-583f-8c2c-2bb348453023:train:0/weights/coldstart_sft_final

Cost Analysis

Credits: All experiments were conducted using $150 free credits gifted by Tinker.

Stage	Steps	Estimated	Actual
Cold-Start SFT	800	~$46	< $30
JustRL Exp 001 + 002	160 (120+40)	~$100	$72
JustRL Exp 003	60	~$48	$34
Total Spent	—	—	~$136
Remaining Credit	—	—	~$14

JustRL Cost Breakdown:

Exp 001 (collapsed): ~120 steps before reward hacking
Exp 002 (with redundancy penalty): 40 steps (Step 81-120)
Average cost: ~$0.45/step

Note: Costs are based on Tinker pricing for Qwen3-4B-Instruct-2507 ($0.22/M tokens). Actual costs are often lower than estimates due to early stopping and response length variance.

TODO: Process Reward Models with Universal Verifiers

The current pipeline uses Outcome Reward Models (ORMs) — binary correct/incorrect verification of the final answer. A natural next step is to add Process Reward Models (PRMs) that provide intermediate feedback on each reasoning step, complementing the existing ORM-based RLVR.

Why PRMs?

ORMs create a sparse credit assignment problem: when a multi-step reasoning chain fails, the model cannot determine which step went wrong. PRMs provide dense, step-level signals that:

Improve credit assignment — errors are localized to specific reasoning steps
Accelerate RL convergence — dense rewards are more sample-efficient than sparse binary rewards
Enable test-time search — steps rated negatively can trigger backtracking or re-sampling
Prevent error propagation — bad intermediate steps are caught before they compound

Planned: Implicit PRMs via Logprobs

Rather than training a separate PRM with expensive step-level human annotations, we plan to implement implicit process rewards derived from model log-probabilities. Key approaches to explore:

Approach	Paper	Core Idea
Implicit PRM	Free Process Rewards without Process Labels (Yuan et al., ICML 2025)	Parameterize process reward as log-likelihood ratio between policy and reference model on partial responses. No step-level annotations needed. Outperforms Math-Shepherd with 1/38 of the training data.
Online Implicit PRM	PRIME: Process Reinforcement through Implicit Rewards (Cui et al., 2025)	Integrate implicit PRM into online RL training loop. The PRM updates alongside the policy, avoiding reward hacking from stale reward models. Achieves 15.1% improvement on reasoning benchmarks from Qwen2.5-Math-7B-Base.
Ascending Confidence	PACR: Progressively Ascending Confidence Reward (Yoon et al., 2025)	Use the model's evolving belief in the correct answer (token probability of ground-truth) as dense reward. Inductive bias: along a good reasoning chain, P(correct answer) should monotonically increase.
Generative Verifiers	GenRM: Reward Modeling as Next-Token Prediction (Zhang et al., ICLR 2025)	Represent solution correctness via LLM's probability of generating "correct" vs. "incorrect" as next token. Enables chain-of-thought verification and inference-time compute scaling.
Universal Likelihood Rewards	Likelihood-Based Reward Designs for General LLM Reasoning (Kwiatkowski et al., 2025)	Log-probability of reference answer as reward — the only variant that works across both verifiable (math) and non-verifiable (long-form proofs) domains.
Meta-Evaluation Rewards	RLME: RL from Meta-Evaluation (Rentschler & Roberts, 2026)	Use evaluator LLM's probability of positive judgment on meta-questions (e.g., "Is the reasoning correct?") as reward. No ground-truth labels needed. Uses GRPO. Achieves accuracy comparable to label-based training and generalizes to open-domain settings.

Implementation Plan

Phase 1: Implement implicit PRM baseline following Free Process Rewards — compute log-likelihood ratios on partial response prefixes using the SFT checkpoint as reference model
Phase 2: Integrate implicit PRM into the GRPO training loop following PRIME — combine dense process rewards with existing binary outcome rewards
Phase 3: Experiment with ascending confidence rewards (PACR) as an alternative dense signal that requires no reference model
Phase 4: Evaluate generative verification (GenRM) for domains beyond math where binary verifiers are unavailable
Phase 5: Explore meta-evaluation rewards (RLME) — use an evaluator LLM's judgment probability as reward via GRPO, enabling training without ground-truth labels for open-domain reasoning tasks

Feasibility Notes

Implicit PRMs via logprobs are particularly well-suited for this project because:

No extra annotation cost — process rewards are derived from the policy's own log-probabilities, requiring zero step-level labels
Compatible with JustRL — PRIME demonstrates that implicit PRMs work within on-policy RL (no KL penalty), aligning with our JustRL-style training
Online updates — the implicit PRM evolves with the policy, avoiding the reward hacking issues we encountered in Exp 001
Infrastructure-friendly — logprob computation is a standard model inference operation, no additional model training pipeline needed on Tinker
Theoretical grounding — ReMiT (Huang et al., 2026) independently proves from KL-regularized RL theory that the token-level log-likelihood ratio between an RL model and a base model corresponds to an implicit reward function capturing reasoning quality. This provides additional theoretical justification for using logprob ratios as process rewards, beyond the empirical evidence from Implicit PRM and PRIME

References

Papers

Core Methodology

JustRL: Simplicity at Scale — Core methodology
DeepSeek-R1 Technical Report — GRPO algorithm
DAPO: Decoupled Clip and Dynamic Sampling — Advanced techniques
RL's Razor: On-Policy Implicit Regularization — Why KL penalty may be unnecessary

Process Reward Models & Universal Verifiers (planned future work)

Free Process Rewards without Process Labels (Yuan et al., ICML 2025) — Implicit PRM via log-likelihood ratios
PRIME: Process Reinforcement through Implicit Rewards (Cui et al., 2025) — Online implicit PRM in RL loop
PACR: Progressively Ascending Confidence Reward (Yoon et al., 2025) — Ascending answer probability as dense reward
GenRM: Reward Modeling as Next-Token Prediction (Zhang et al., ICLR 2025) — Generative verifiers
Likelihood-Based Reward Designs for General LLM Reasoning (Kwiatkowski et al., 2025) — Universal logprob rewards
RLME: RL from Meta-Evaluation (Rentschler & Roberts, 2026) — Meta-evaluation rewards without ground-truth labels
Let's Verify Step by Step (Lightman et al., ICLR 2024) — Foundational PRM paper (PRM800K)
Math-Shepherd (Wang et al., ACL 2024) — PRM without human annotations via Monte Carlo rollouts

Frameworks

Tinker Platform — Training infrastructure

License

MIT License

Citation

If you find this repository helpful, please cite:

@misc{ning2026justtinker,
  author       = {Ning, Guanghan},
  title        = {JustTinker: Minimal Reinforcement Learning with Verifiable Rewards},
  year         = {2026},
  publisher    = {GitHub},
  url          = {https://github.com/Guanghan/JustTinker},
  note         = {A minimal implementation of JustRL-style reasoning model training with redundancy penalty}
}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
docs/research		docs/research
resources		resources
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

JustTinker: Minimal Reinforcement Learning with Verifiable Rewards

✨ Highlights

Table of Contents

Overview

Why JustRL?

Algorithm Evolution: GRPO → JustRL → JustTinker

Model Choice: Why Qwen3-4B-Instruct-2507?

Why Cold-Start SFT?

Stage 1: Cold-Start SFT

Configuration

Public Checkpoint

Training Results

Training Curves

Sample Output

Stage 2: JustRL (GRPO)

Experiment 001: Training Collapse (Reward Hacking)

Configuration Used (Collapsed Experiment)

Training Curves

Timeline

What Happened?

Sample Analysis

Lessons Learned

Mitigations Implemented

Experiment 002: With Redundancy Penalty

Configuration

Design Philosophy

Training Curves (Step 1-120)

Results

Experiment 003: Harder Training & Eval Datasets

Training Curves (Step 0-60)

Current Results

Configuration

Dataset Details

Key Differences from Experiment 002

Artifacts

Project Structure

Quick Start

Prerequisites

Run Cold-Start SFT

Run JustRL Training

Cost Analysis

TODO: Process Reward Models with Universal Verifiers

Why PRMs?

Planned: Implicit PRMs via Logprobs

Implementation Plan

Feasibility Notes

References

Papers

Frameworks

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages