Skip to content

add reasoning data#69

Merged
vgel merged 1 commit intovgel:mainfrom
wassname:add-reasoning-data
Sep 23, 2025
Merged

add reasoning data#69
vgel merged 1 commit intovgel:mainfrom
wassname:add-reasoning-data

Conversation

@wassname
Copy link
Contributor

Thinking models seem to need differen't steering data to steer their thinking mode. It vary by model but is often a narrow trajectory of behaviour with a differen't context to normal output.

Without reasoning.json (and the token)
image

With
image

(image source)

@thiswillbeyourgithub
Copy link
Contributor

thiswillbeyourgithub commented Sep 21, 2025

Just to make sure I get that right: you did not change the tokenizer parameter (as you can use it to forbid thinking), but only changed the prefix of the assistant's generated message.

When doing that graph, what layers did you control? What method did you use to extract the dimensions?

Also can you add detail on how you compute "action confidence"? Is it the average logprob of the generation so far? Or just the logprob for the selected token?

@wassname
Copy link
Contributor Author

wassname commented Sep 21, 2025

Just to make sure I get that right: you did not change the tokenizer parameter (as you can use it to forbid thinking), but only changed the prefix of the assistant's generated message.

Pretty much.

I changed 3 things but I have not ablated exactly which ones were needed for which thinking models:

  1. Changed prompt to mention thinking: "Pretend you're a {persona}. You think step by step consistent with your identity."
  2. Added reasoning.json suffixes (this PR)
  3. Added token during dataset construction

Some combination of those are sufficient.

Now this is further complicated because reasoning models are not standardized. Some have noisy trajectories ("Qwen/Qwen3-4B-Thinking-2507"), some are smoother ("zai-org/GLM-4.1V-9B-Thinking"), and some use different tokens and templates.

what layers did you control?

All of them... although the consensus in mechanistic interpretability steering research seems to be that layers from 1/3 to 2/3 through the network are the most important ones.

Also can you add detail on how you compute "action confidence"? Is it the average logprob of the generation so far? Or just the logprob for the selected token?

The code might be clearer than my description, but I'll explain. I'm treating this as a binary classification problem: does the model pick yes or no? How far apart are these choices? And how does steering affect that?

The exact measure is log_prob(Yes) - log_prob(No), which equals log(prob(yes)/prob(no)). This is standard in preference optimization or binary classification. This metric makes it easy to see how steering changes model preferences between yes and no, even if the steering only creates a small nudge.

For calculation, it's insufficient to simply look at the raw probability of "Yes" or "No" tokens, as the model might be saying things like "The user wants me to answer Yes or No, if I answer Yes..." where these words appear in context but don't represent the answer. To handle this, I use a simple approach in PR #72.

But in the graphs I use a more complex approach. As I wanted to see how the answer evolves as the thinking model reasons Here you need to fork the KV cache and append a suffix like " the final answer is", then measure the logprob(Yes) and logprob(No) after this suffix.

All implementation details are available in this notebook and the surrounding library code (e.g., CoT.py).

@thiswillbeyourgithub
Copy link
Contributor

Thank you very much for explaining all this. It's really helping me a lot!

The exact measure is log_prob(Yes) - log_prob(No), which equals log(prob(yes)/prob(no)). This is standard in preference optimization or binary classification. This metric makes it easy to see how steering changes model preferences between yes and no, even if the steering only creates a small nudge.

That's what I figured. But then I wondered: if you break down the model with an inadequate dimension and it generates repeating nonsense, the log probs of yes and no stop meaning anything right? That's the issue I'm hitting with my current formula which is (for a multiple choice question between 20, 30, 40 and 50) logprob_20 / logprob_20+30+40+50

But in the graphs I use a more complex approach. As I wanted to see how the answer evolves as the thinking model reasons Here you need to fork the KV cache and append a suffix like " the final answer is", then measure the logprob(Yes) and logprob(No) after this suffix.

Can't this be done without forking the KV cache by generating an answer from a prompt, then restarting a generation with prompt+answer+The final answer is ** with the parameter continue_final_message=True for just a few tokens and measuring the logprobs? I'm thinking this way would have less ways to break some models that KV cache forking.

@wassname
Copy link
Contributor Author

wassname commented Sep 21, 2025

That's what I figured. But then I wondered: if you break down the model with an inadequate dimension and it generates repeating nonsense, the log probs of yes and no stop meaning anything right? That's the issue I'm hitting with my current formula which is (for a multiple choice question between 20, 30, 40 and 50) logprob_20 / logprob_20+30+40+50

Yeah totally. I also monitor prob mass, `prob_all_choices, which is the sum of probability over all choices. Because if your prompt is off, your choices only have tiny probabilities, and you know something is wrong.

logprob_20 / logprob_20+30+40+50

Oh if this wasn't a typo, don't forget log rules. You minus logs, and divide probs. So it should be

prob_20 / prob_20+30+40+50

or

logprob_20 - logprob_20+30+40+50

Can't this be done without forking the KV cache by generating an answer from a prompt, then restarting a generation with prompt+answer+The final answer is ** with the parameter continue_final_message=True for just a few tokens and measuring the logprobs? I'm thinking this way would have less ways to break some models that KV cache forking.

Yes, it's the same, just slower as you have to recompute the previous tokens. It's the same but slow and simpler, so it's a nice way to code it up as a first pass.

@vgel vgel self-requested a review September 23, 2025 21:23
@vgel vgel merged commit d7020ec into vgel:main Sep 23, 2025
5 checks passed
@wassname wassname deleted the add-reasoning-data branch September 23, 2025 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants