add reasoning data by wassname · Pull Request #69 · vgel/repeng

wassname · 2025-09-21T06:43:11Z

Thinking models seem to need differen't steering data to steer their thinking mode. It vary by model but is often a narrow trajectory of behaviour with a differen't context to normal output.

Without reasoning.json (and the token)

With

(image source)

thiswillbeyourgithub · 2025-09-21T08:53:19Z

Just to make sure I get that right: you did not change the tokenizer parameter (as you can use it to forbid thinking), but only changed the prefix of the assistant's generated message.

When doing that graph, what layers did you control? What method did you use to extract the dimensions?

Also can you add detail on how you compute "action confidence"? Is it the average logprob of the generation so far? Or just the logprob for the selected token?

wassname · 2025-09-21T10:02:53Z

Just to make sure I get that right: you did not change the tokenizer parameter (as you can use it to forbid thinking), but only changed the prefix of the assistant's generated message.

Pretty much.

I changed 3 things but I have not ablated exactly which ones were needed for which thinking models:

Changed prompt to mention thinking: "Pretend you're a {persona}. You think step by step consistent with your identity."
Added reasoning.json suffixes (this PR)
Added token during dataset construction

Some combination of those are sufficient.

Now this is further complicated because reasoning models are not standardized. Some have noisy trajectories ("Qwen/Qwen3-4B-Thinking-2507"), some are smoother ("zai-org/GLM-4.1V-9B-Thinking"), and some use different tokens and templates.

what layers did you control?

All of them... although the consensus in mechanistic interpretability steering research seems to be that layers from 1/3 to 2/3 through the network are the most important ones.

Also can you add detail on how you compute "action confidence"? Is it the average logprob of the generation so far? Or just the logprob for the selected token?

The code might be clearer than my description, but I'll explain. I'm treating this as a binary classification problem: does the model pick yes or no? How far apart are these choices? And how does steering affect that?

The exact measure is log_prob(Yes) - log_prob(No), which equals log(prob(yes)/prob(no)). This is standard in preference optimization or binary classification. This metric makes it easy to see how steering changes model preferences between yes and no, even if the steering only creates a small nudge.

For calculation, it's insufficient to simply look at the raw probability of "Yes" or "No" tokens, as the model might be saying things like "The user wants me to answer Yes or No, if I answer Yes..." where these words appear in context but don't represent the answer. To handle this, I use a simple approach in PR #72.

But in the graphs I use a more complex approach. As I wanted to see how the answer evolves as the thinking model reasons Here you need to fork the KV cache and append a suffix like " the final answer is", then measure the logprob(Yes) and logprob(No) after this suffix.

All implementation details are available in this notebook and the surrounding library code (e.g., CoT.py).

thiswillbeyourgithub · 2025-09-21T10:47:21Z

Thank you very much for explaining all this. It's really helping me a lot!

The exact measure is log_prob(Yes) - log_prob(No), which equals log(prob(yes)/prob(no)). This is standard in preference optimization or binary classification. This metric makes it easy to see how steering changes model preferences between yes and no, even if the steering only creates a small nudge.

That's what I figured. But then I wondered: if you break down the model with an inadequate dimension and it generates repeating nonsense, the log probs of yes and no stop meaning anything right? That's the issue I'm hitting with my current formula which is (for a multiple choice question between 20, 30, 40 and 50) logprob_20 / logprob_20+30+40+50

But in the graphs I use a more complex approach. As I wanted to see how the answer evolves as the thinking model reasons Here you need to fork the KV cache and append a suffix like " the final answer is", then measure the logprob(Yes) and logprob(No) after this suffix.

Can't this be done without forking the KV cache by generating an answer from a prompt, then restarting a generation with prompt+answer+The final answer is ** with the parameter continue_final_message=True for just a few tokens and measuring the logprobs? I'm thinking this way would have less ways to break some models that KV cache forking.

wassname · 2025-09-21T21:02:10Z

That's what I figured. But then I wondered: if you break down the model with an inadequate dimension and it generates repeating nonsense, the log probs of yes and no stop meaning anything right? That's the issue I'm hitting with my current formula which is (for a multiple choice question between 20, 30, 40 and 50) logprob_20 / logprob_20+30+40+50

Yeah totally. I also monitor prob mass, `prob_all_choices, which is the sum of probability over all choices. Because if your prompt is off, your choices only have tiny probabilities, and you know something is wrong.

logprob_20 / logprob_20+30+40+50

Oh if this wasn't a typo, don't forget log rules. You minus logs, and divide probs. So it should be

prob_20 / prob_20+30+40+50

or

logprob_20 - logprob_20+30+40+50

Can't this be done without forking the KV cache by generating an answer from a prompt, then restarting a generation with prompt+answer+The final answer is ** with the parameter continue_final_message=True for just a few tokens and measuring the logprobs? I'm thinking this way would have less ways to break some models that KV cache forking.

Yes, it's the same, just slower as you have to recompute the previous tokens. It's the same but slow and simpler, so it's a nice way to code it up as a first pass.

reasoning data

c3817e1

vgel self-requested a review September 23, 2025 21:23

vgel approved these changes Sep 23, 2025

View reviewed changes

vgel merged commit d7020ec into vgel:main Sep 23, 2025
5 checks passed

wassname deleted the add-reasoning-data branch September 23, 2025 23:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add reasoning data#69

add reasoning data#69
vgel merged 1 commit intovgel:mainfrom
wassname:add-reasoning-data

wassname commented Sep 21, 2025

Uh oh!

thiswillbeyourgithub commented Sep 21, 2025 •

edited

Loading

Uh oh!

wassname commented Sep 21, 2025 •

edited

Loading

Uh oh!

thiswillbeyourgithub commented Sep 21, 2025

Uh oh!

wassname commented Sep 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wassname commented Sep 21, 2025

Uh oh!

thiswillbeyourgithub commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wassname commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thiswillbeyourgithub commented Sep 21, 2025

Uh oh!

wassname commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thiswillbeyourgithub commented Sep 21, 2025 •

edited

Loading

wassname commented Sep 21, 2025 •

edited

Loading

wassname commented Sep 21, 2025 •

edited

Loading