facebookresearch · uralik · Apr 7, 2026 · Apr 7, 2026
diff --git a/projects/thinking_midtraining/README.md b/projects/thinking_midtraining/README.md
@@ -18,7 +18,7 @@ We address a fundamental limitation of the current LLM training paradigm: *the a
 We thus introduce **thinking mid-training**, an intermediate SFT+RL training phase that bridges the gap between pretraining on raw text and post-training for instruction-following and reasoning. 
 
 
-Our experiments  demonstrate that thinking mid-training substantially improves post-training effectiveness: our full pipeline achieves a 3.2x improvement in average accuracy across challenging reasoning benchmarks (GSM8K, MATH-500, AMC23, Olympiad, GPQA-Diamond) compared to direct RL post-training on the base model (Llama-3-8B), and more than doubled performance compared to the existing practice of mid-training with raw data.  
+Our experiments  demonstrate that thinking mid-training substantially improves post-training effectiveness: our full pipeline achieves a 3.2x improvement in average accuracy across challenging reasoning benchmarks (GSM8K, MATH-500, AMC23, Olympiad, GPQA-Diamond) compared to direct RL post-training on the base model (Llama-3-8B), and more than doubled performance compared to mid-training with raw data.  
 
 These results suggest that introducing reasoning earlier in the training pipeline yields models that are not only initially better at reasoning, but also better prepared for reasoning-intensive post-training.
 
@@ -98,7 +98,7 @@ $$
 
 where $\theta$ are the parameters of the model. We optimize this using DrGRPO.
 
-## Main Experimental Results
+## Experimental Results
 
 ### Mid-training Performance
 First, we evaluate whether the proposed approach improves reasoning capabilities without further finetuning on downstream tasks. We find thinking mid-training significantly improves over the base model as well as the existing practice of mid-training (SFT raw). Specifically, we found that simply training on 10B tokens from raw data brings doubles the average performance, although further scaling up data sizes yields a slower increase in overall performance. However, SFT on context-augmented data drastically improves average performance from 0.0264 to 0.1249. RL mid-training brings the largest improvement to 0.1896 (**9x**) despite using much less data. RL mid-training (RLMT) achieves the largest improvement. Numbers next to the training method (10k, 7k, 5k) indicate the number of training steps.