From 8dc5e7c6a4b754774a7c34efda019bf0f385b658 Mon Sep 17 00:00:00 2001 From: jaseweston Date: Tue, 7 Apr 2026 10:23:53 -0400 Subject: [PATCH] Update README.md --- projects/thinking_midtraining/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/projects/thinking_midtraining/README.md b/projects/thinking_midtraining/README.md index 6f62edf..7c801e3 100644 --- a/projects/thinking_midtraining/README.md +++ b/projects/thinking_midtraining/README.md @@ -18,7 +18,7 @@ We address a fundamental limitation of the current LLM training paradigm: *the a We thus introduce **thinking mid-training**, an intermediate SFT+RL training phase that bridges the gap between pretraining on raw text and post-training for instruction-following and reasoning. -Our experiments demonstrate that thinking mid-training substantially improves post-training effectiveness: our full pipeline achieves a 3.2x improvement in average accuracy across challenging reasoning benchmarks (GSM8K, MATH-500, AMC23, Olympiad, GPQA-Diamond) compared to direct RL post-training on the base model (Llama-3-8B), and more than doubled performance compared to the existing practice of mid-training with raw data. +Our experiments demonstrate that thinking mid-training substantially improves post-training effectiveness: our full pipeline achieves a 3.2x improvement in average accuracy across challenging reasoning benchmarks (GSM8K, MATH-500, AMC23, Olympiad, GPQA-Diamond) compared to direct RL post-training on the base model (Llama-3-8B), and more than doubled performance compared to mid-training with raw data. These results suggest that introducing reasoning earlier in the training pipeline yields models that are not only initially better at reasoning, but also better prepared for reasoning-intensive post-training. @@ -98,7 +98,7 @@ $$ where $\theta$ are the parameters of the model. We optimize this using DrGRPO. -## Main Experimental Results +## Experimental Results ### Mid-training Performance First, we evaluate whether the proposed approach improves reasoning capabilities without further finetuning on downstream tasks. We find thinking mid-training significantly improves over the base model as well as the existing practice of mid-training (SFT raw). Specifically, we found that simply training on 10B tokens from raw data brings doubles the average performance, although further scaling up data sizes yields a slower increase in overall performance. However, SFT on context-augmented data drastically improves average performance from 0.0264 to 0.1249. RL mid-training brings the largest improvement to 0.1896 (**9x**) despite using much less data. RL mid-training (RLMT) achieves the largest improvement. Numbers next to the training method (10k, 7k, 5k) indicate the number of training steps.