Skip to content

Fix encoding of channel, recipient, and constrain#114

Open
bbrowning wants to merge 1 commit into
openai:mainfrom
bbrowning:channel-constrain-multi-turn-fixes
Open

Fix encoding of channel, recipient, and constrain#114
bbrowning wants to merge 1 commit into
openai:mainfrom
bbrowning:channel-constrain-multi-turn-fixes

Conversation

@bbrowning
Copy link
Copy Markdown
Contributor

We're seeing a lot of failures with gpt-oss-20b and gpt-oss-120b models in vLLM where we get <|channel|> markers leaking into tool call names and the model outputting reasoning after <|constrain|> tokens in very long multi-turn conversations.

These are fairly trivial to reproduce with gpt-oss-20b using BFCL multi_turn_base as a simple eval, and occasionally show up with gpt-oss-120b as well. In SWE Bench Verified with gpt-oss-120b, this shows up a lot more.

Here's an example of the kind of errors we see across 500 tests cases with SWE Bench Verified and gpt-oss-120b:

  HarmonyErrors (special token leaks in message parsing):
      122x  Unknown role: assistant<|channel|>commentary
       25x  Unknown role: assistant<|channel|>bash<|channel|>commentary
       24x  unexpected tokens remaining in message header: Some("to=functions.bash")
        3x  Unknown role: final
        2x  Unknown role: assistant<|channel|>bash<|channel|>analysis
        2x  unexpected tokens remaining in message header: Some("<|constrain|>commentary")
        2x  Unknown role: assistant<|channel|>analysis
        1x  unexpected tokens remaining in message header: Some("<|constrain|>control<|end|><|start|>assistant<|channel|>commentary"
        1x  unexpected tokens remaining in message header: Some("bash")
        1x  unexpected tokens remaining in message header: Some("to=functions.bash # We'll create conftest.py with needed code<|end|
        1x  unexpected tokens remaining in message header: Some("to=functions.bash (We attempt to")
        1x  unexpected tokens remaining in message header: Some("<|constrain|>Complete_TASK_AND_SUBMIT_FINAL_OUTPUT &&")
        1x  could not decode header
        1x  Unknown role: assistant<|channel|>bash<|channel|>commentary,json
        1x  unexpected tokens remaining in message header: Some("to=functions.bash\"I'm thinking the repository is not trusted. We'l
        1x  unexpected tokens remaining in message header: Some("to=functions.bash by using git checkout...<|end|><|start|>assistant
        1x  unexpected tokens remaining in message header: Some("to=functions.bash as<|end|><|start|>assistant<|channel|>commentary"

Through experimentation and closely monitoring what the models output versus the tokens we send back in as input, I discovered two important things.

First, for tool calls output to the commentary channel, the models always output the recipient and channel order backwards from what encoding.rs was doing. So, this swaps the order of the channel and recipient when outputting tool calls to the commentary channel to match what the model does, which presumably is a closer match to the training data. In anecdotal testing this seems to greatly reduce if not eliminate all the extra <|channel|> markers we were seeing in the model output. So, this change may obviate the need for #97.

Second, we caught examples of the model emitting reasoning text after a <|constrain|> token in long multi-turn conversations. A real-world example of that we saw: <|start|>assistant<|channel|>commentary to=functions.echo<|constrain|>2.0? Actually we need to write content. Use echo.<|end|><|start|>assistant<|channel|>commentary to=functions.echo<|constrain|>json<|message|>{"content": "2.0000", "file_name": "result.txt"}<|call|>. To fix this, instead of ever passing in a bare content-type (such as json), this PR always emits the <|constrain|> special token before that content type. In long multi-turn scenarios this appears to solve the problem of the model incorrectly reasoning after that token. This is perhaps also partially a side-effect of roundtripping this model's outputs via APIs that have no provision to include a "content type" that the client can send back in, so we loose the details of whether the model output a json or <|constrain|>json and have to reconstruct as best we can on the inference side.

We'll do some more testing on our side, but I wanted to go ahead and open this PR to get some eyes on the issue and any feedback about these fixes.

We're seeing a lot of failures with gpt-oss-20b and gpt-oss-120b models
in vLLM where we get <|channel|> markers leaking into tool call names
and the model outputting reasoning after <|constrain|> tokens in very
long multi-turn conversations.

These are fairly trivial to reproduce with gpt-oss-20b using BFCL
multi_turn_base as a simple eval, and occasionally show up with
gpt-oss-120b as well. In SWE Bench Verified with gpt-oss-120b, this
shows up a lot more.

Here's an example of the kind of errors we see across 500 tests cases
with SWE Bench Verified and gpt-oss-120b:

```
  HarmonyErrors (special token leaks in message parsing):
      122x  Unknown role: assistant<|channel|>commentary
       25x  Unknown role: assistant<|channel|>bash<|channel|>commentary
       24x  unexpected tokens remaining in message header: Some("to=functions.bash")
        3x  Unknown role: final
        2x  Unknown role: assistant<|channel|>bash<|channel|>analysis
        2x  unexpected tokens remaining in message header: Some("<|constrain|>commentary")
        2x  Unknown role: assistant<|channel|>analysis
        1x  unexpected tokens remaining in message header: Some("<|constrain|>control<|end|><|start|>assistant<|channel|>commentary"
        1x  unexpected tokens remaining in message header: Some("bash")
        1x  unexpected tokens remaining in message header: Some("to=functions.bash # We'll create conftest.py with needed code<|end|
        1x  unexpected tokens remaining in message header: Some("to=functions.bash (We attempt to")
        1x  unexpected tokens remaining in message header: Some("<|constrain|>Complete_TASK_AND_SUBMIT_FINAL_OUTPUT &&")
        1x  could not decode header
        1x  Unknown role: assistant<|channel|>bash<|channel|>commentary,json
        1x  unexpected tokens remaining in message header: Some("to=functions.bash\"I'm thinking the repository is not trusted. We'l
        1x  unexpected tokens remaining in message header: Some("to=functions.bash by using git checkout...<|end|><|start|>assistant
        1x  unexpected tokens remaining in message header: Some("to=functions.bash as<|end|><|start|>assistant<|channel|>commentary"
```

Through experimentation and closely monitoring what the models output
versus the tokens we send back in as input, I discovered two important
things.

First, for tool calls output to the commentary channel, the models
always output the recipient and channel order backwards from what
encoding.rs was doing. So, this swaps the order of the channel and
recipient when outputting tool calls to the commentary channel to match
what the model does, which presumably is a closer match to the training
data. In anecdotal testing this seems to greatly reduce if not eliminate
all the extra <|channel|> markers we were seeing in the model output.
So, this change may obviate the need for openai#97.

Second, we caught examples of the model emitting reasoning text after a
<|constrain|> token in long multi-turn conversations. A real-world
example of that we saw: `<|start|>assistant<|channel|>commentary
to=functions.echo<|constrain|>2.0? Actually we need to write content.
Use echo.<|end|><|start|>assistant<|channel|>commentary
to=functions.echo<|constrain|>json<|message|>{"content": "2.0000",
"file_name": "result.txt"}<|call|>`. To fix this, instead of ever
passing in a bare content-type (such as `json`), this PR always emits
the `<|constrain|>` special token before that content type. In long
multi-turn scenarios this appears to solve the problem of the model
incorrectly reasoning after that token. This is perhaps also partially a
side-effect of roundtripping this model's outputs via APIs that have no
provision to include a "content type" that the client can send back in,
so we loose the details of whether the model output a `json` or
`<|constrain|>json` and have to reconstruct as best we can on the
inference side.

We'll do some more testing on our side, but I wanted to go ahead and
open this PR to get some eyes on the issue and any feedback about these
fixes.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant