Conversation
|
Just to make sure I get that right: you did not change the tokenizer parameter (as you can use it to forbid thinking), but only changed the prefix of the assistant's generated message. When doing that graph, what layers did you control? What method did you use to extract the dimensions? Also can you add detail on how you compute "action confidence"? Is it the average logprob of the generation so far? Or just the logprob for the selected token? |
Pretty much. I changed 3 things but I have not ablated exactly which ones were needed for which thinking models:
Some combination of those are sufficient. Now this is further complicated because reasoning models are not standardized. Some have noisy trajectories ("Qwen/Qwen3-4B-Thinking-2507"), some are smoother ("zai-org/GLM-4.1V-9B-Thinking"), and some use different tokens and templates.
All of them... although the consensus in mechanistic interpretability steering research seems to be that layers from 1/3 to 2/3 through the network are the most important ones.
The code might be clearer than my description, but I'll explain. I'm treating this as a binary classification problem: does the model pick yes or no? How far apart are these choices? And how does steering affect that? The exact measure is For calculation, it's insufficient to simply look at the raw probability of "Yes" or "No" tokens, as the model might be saying things like "The user wants me to answer Yes or No, if I answer Yes..." where these words appear in context but don't represent the answer. To handle this, I use a simple approach in PR #72. But in the graphs I use a more complex approach. As I wanted to see how the answer evolves as the thinking model reasons Here you need to fork the KV cache and append a suffix like " the final answer is", then measure the logprob(Yes) and logprob(No) after this suffix. All implementation details are available in this notebook and the surrounding library code (e.g., CoT.py). |
|
Thank you very much for explaining all this. It's really helping me a lot!
That's what I figured. But then I wondered: if you break down the model with an inadequate dimension and it generates repeating nonsense, the log probs of yes and no stop meaning anything right? That's the issue I'm hitting with my current formula which is (for a multiple choice question between 20, 30, 40 and 50)
Can't this be done without forking the KV cache by generating an answer from a prompt, then restarting a generation with prompt+answer+ |
Yeah totally. I also monitor prob mass, `prob_all_choices, which is the sum of probability over all choices. Because if your prompt is off, your choices only have tiny probabilities, and you know something is wrong.
Oh if this wasn't a typo, don't forget log rules. You minus logs, and divide probs. So it should be
or
Yes, it's the same, just slower as you have to recompute the previous tokens. It's the same but slow and simpler, so it's a nice way to code it up as a first pass. |
Thinking models seem to need differen't steering data to steer their thinking mode. It vary by model but is often a narrow trajectory of behaviour with a differen't context to normal output.
Without reasoning.json (and the token)

With

(image source)