From e3bb822eaa78fde21365293285338c2d7baea041 Mon Sep 17 00:00:00 2001
From: Ilia Kulikov <kulikov@meta.com>
Date: Tue, 7 Apr 2026 07:35:17 -0700
Subject: [PATCH] Fix LaTeX rendering in section 2 of thinking mid-training
 blog

Split the long paragraph so inline math expressions with subscripts
are in separate paragraphs, preventing the Markdown parser from
pairing underscores across expressions as emphasis markers. The loss
function is now a display equation on its own line.
---
 projects/thinking_midtraining/README.md | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/projects/thinking_midtraining/README.md b/projects/thinking_midtraining/README.md
index 6f62edf..d6ac919 100644
--- a/projects/thinking_midtraining/README.md
+++ b/projects/thinking_midtraining/README.md
@@ -58,7 +58,11 @@ $\tilde{\mathcal{D}} = \{\tilde{c}^1, \tilde{c}^2, \ldots, \tilde{c}^N\}$.
 
 ### 2) Thinking SFT Mid-training
 
-We perform supervised fine-tuning (SFT) mid-training on half of the augmented corpus, which we call $\tilde{\mathcal{D}}_{\text{SFT}}$, using standard next-token prediction. Given a base model $\mathcal{M}_{\text{base}}$ parameterized by $\theta$, we optimize the following objective: $\mathcal{L}_{\text{SFT}}(\theta) = -\mathbb{E}_{\tilde{c}^i \sim \tilde{\mathcal{D}}} \left[ \sum_{j=1}^{|\tilde{c}^i|} \log P_\theta(\tilde{c}^i_j \mid \tilde{c}^i_{<j}) \right]$
+We perform supervised fine-tuning (SFT) mid-training on half of the augmented corpus, which we call $\tilde{\mathcal{D}}_{\text{SFT}}$, using standard next-token prediction.
+
+Given a base model $\mathcal{M}_{\text{base}}$ parameterized by $\theta$, we optimize the following objective:
+
+$$\mathcal{L}_{\text{SFT}}(\theta) = -\mathbb{E}_{\tilde{c}^i \sim \tilde{\mathcal{D}}} \left[ \sum_{j=1}^{|\tilde{c}^i|} \log P_\theta(\tilde{c}^i_j \mid \tilde{c}^i_{<j}) \right]$$
 
 where $\tilde{c}^i_j$ denotes the $j$-th token in the augmented chunk $\tilde{c}^i$, and $\tilde{c}^i_{<j}$ represents all preceding tokens. Importantly, the loss is computed over the entire augmented sequence, including both the original content tokens $x_j$ and the generated thought tokens $\tau_j$. This allows the model to learn to produce intermediate reasoning steps alongside the original content.