Strip duplicate channel markers from message headers#97
Conversation
gpt-oss models occasionally emit extra <|channel|> markers in message headers during deep multi-turn conversations. Instead of erroring, the parser now strips all duplicate channel markers after extracting the first one, using the same boundary logic (stop at whitespace or '<') that is used when extracting the first channel marker. This also marks `parse_header_from_string` as `pub(crate)` to allow for direct unit testing, and adds tests of the previous and new behavior to make sure that channels are extracted properly.
|
I tried this patch and still get errors from wrong tool call parsing. There are more open pull requests regarding the harmony tool call parsing in the vllm repo. Perhaps it is better to make a combined effort towards a more robust parser? |
|
@neon12345 This change is focused on fixing one targeted failure mode I've observed with gpt-oss-20b in multi-turn scenarios with vLLM. However, you're right that it doesn't fix all potential tool call parsing scenarios. If you want to open an issue in this repository (if you're hitting Harmony errors) or in the vllm repository (if you're getting generic vllm issues with gpt-oss models), I'm happy to take a look and see if we can track down the root cause of that. So, I'm +1 to making parsing more robust. But, I also don't want larger and longer-term plans around the entire strategy and how vLLM integrates with this library to delay shipping targeted fixes that impact users today. |
We're seeing a lot of failures with gpt-oss-20b and gpt-oss-120b models
in vLLM where we get <|channel|> markers leaking into tool call names
and the model outputting reasoning after <|constrain|> tokens in very
long multi-turn conversations.
These are fairly trivial to reproduce with gpt-oss-20b using BFCL
multi_turn_base as a simple eval, and occasionally show up with
gpt-oss-120b as well. In SWE Bench Verified with gpt-oss-120b, this
shows up a lot more.
Here's an example of the kind of errors we see across 500 tests cases
with SWE Bench Verified and gpt-oss-120b:
```
HarmonyErrors (special token leaks in message parsing):
122x Unknown role: assistant<|channel|>commentary
25x Unknown role: assistant<|channel|>bash<|channel|>commentary
24x unexpected tokens remaining in message header: Some("to=functions.bash")
3x Unknown role: final
2x Unknown role: assistant<|channel|>bash<|channel|>analysis
2x unexpected tokens remaining in message header: Some("<|constrain|>commentary")
2x Unknown role: assistant<|channel|>analysis
1x unexpected tokens remaining in message header: Some("<|constrain|>control<|end|><|start|>assistant<|channel|>commentary"
1x unexpected tokens remaining in message header: Some("bash")
1x unexpected tokens remaining in message header: Some("to=functions.bash # We'll create conftest.py with needed code<|end|
1x unexpected tokens remaining in message header: Some("to=functions.bash (We attempt to")
1x unexpected tokens remaining in message header: Some("<|constrain|>Complete_TASK_AND_SUBMIT_FINAL_OUTPUT &&")
1x could not decode header
1x Unknown role: assistant<|channel|>bash<|channel|>commentary,json
1x unexpected tokens remaining in message header: Some("to=functions.bash\"I'm thinking the repository is not trusted. We'l
1x unexpected tokens remaining in message header: Some("to=functions.bash by using git checkout...<|end|><|start|>assistant
1x unexpected tokens remaining in message header: Some("to=functions.bash as<|end|><|start|>assistant<|channel|>commentary"
```
Through experimentation and closely monitoring what the models output
versus the tokens we send back in as input, I discovered two important
things.
First, for tool calls output to the commentary channel, the models
always output the recipient and channel order backwards from what
encoding.rs was doing. So, this swaps the order of the channel and
recipient when outputting tool calls to the commentary channel to match
what the model does, which presumably is a closer match to the training
data. In anecdotal testing this seems to greatly reduce if not eliminate
all the extra <|channel|> markers we were seeing in the model output.
So, this change may obviate the need for openai#97.
Second, we caught examples of the model emitting reasoning text after a
<|constrain|> token in long multi-turn conversations. A real-world
example of that we saw: `<|start|>assistant<|channel|>commentary
to=functions.echo<|constrain|>2.0? Actually we need to write content.
Use echo.<|end|><|start|>assistant<|channel|>commentary
to=functions.echo<|constrain|>json<|message|>{"content": "2.0000",
"file_name": "result.txt"}<|call|>`. To fix this, instead of ever
passing in a bare content-type (such as `json`), this PR always emits
the `<|constrain|>` special token before that content type. In long
multi-turn scenarios this appears to solve the problem of the model
incorrectly reasoning after that token. This is perhaps also partially a
side-effect of roundtripping this model's outputs via APIs that have no
provision to include a "content type" that the client can send back in,
so we loose the details of whether the model output a `json` or
`<|constrain|>json` and have to reconstruct as best we can on the
inference side.
We'll do some more testing on our side, but I wanted to go ahead and
open this PR to get some eyes on the issue and any feedback about these
fixes.
We're seeing a lot of failures with gpt-oss-20b and gpt-oss-120b models
in vLLM where we get <|channel|> markers leaking into tool call names
and the model outputting reasoning after <|constrain|> tokens in very
long multi-turn conversations.
These are fairly trivial to reproduce with gpt-oss-20b using BFCL
multi_turn_base as a simple eval, and occasionally show up with
gpt-oss-120b as well. In SWE Bench Verified with gpt-oss-120b, this
shows up a lot more.
Here's an example of the kind of errors we see across 500 tests cases
with SWE Bench Verified and gpt-oss-120b:
```
HarmonyErrors (special token leaks in message parsing):
122x Unknown role: assistant<|channel|>commentary
25x Unknown role: assistant<|channel|>bash<|channel|>commentary
24x unexpected tokens remaining in message header: Some("to=functions.bash")
3x Unknown role: final
2x Unknown role: assistant<|channel|>bash<|channel|>analysis
2x unexpected tokens remaining in message header: Some("<|constrain|>commentary")
2x Unknown role: assistant<|channel|>analysis
1x unexpected tokens remaining in message header: Some("<|constrain|>control<|end|><|start|>assistant<|channel|>commentary"
1x unexpected tokens remaining in message header: Some("bash")
1x unexpected tokens remaining in message header: Some("to=functions.bash # We'll create conftest.py with needed code<|end|
1x unexpected tokens remaining in message header: Some("to=functions.bash (We attempt to")
1x unexpected tokens remaining in message header: Some("<|constrain|>Complete_TASK_AND_SUBMIT_FINAL_OUTPUT &&")
1x could not decode header
1x Unknown role: assistant<|channel|>bash<|channel|>commentary,json
1x unexpected tokens remaining in message header: Some("to=functions.bash\"I'm thinking the repository is not trusted. We'l
1x unexpected tokens remaining in message header: Some("to=functions.bash by using git checkout...<|end|><|start|>assistant
1x unexpected tokens remaining in message header: Some("to=functions.bash as<|end|><|start|>assistant<|channel|>commentary"
```
Through experimentation and closely monitoring what the models output
versus the tokens we send back in as input, I discovered two important
things.
First, for tool calls output to the commentary channel, the models
always output the recipient and channel order backwards from what
encoding.rs was doing. So, this swaps the order of the channel and
recipient when outputting tool calls to the commentary channel to match
what the model does, which presumably is a closer match to the training
data. In anecdotal testing this seems to greatly reduce if not eliminate
all the extra <|channel|> markers we were seeing in the model output.
So, this change may obviate the need for openai#97.
Second, we caught examples of the model emitting reasoning text after a
<|constrain|> token in long multi-turn conversations. A real-world
example of that we saw: `<|start|>assistant<|channel|>commentary
to=functions.echo<|constrain|>2.0? Actually we need to write content.
Use echo.<|end|><|start|>assistant<|channel|>commentary
to=functions.echo<|constrain|>json<|message|>{"content": "2.0000",
"file_name": "result.txt"}<|call|>`. To fix this, instead of ever
passing in a bare content-type (such as `json`), this PR always emits
the `<|constrain|>` special token before that content type. In long
multi-turn scenarios this appears to solve the problem of the model
incorrectly reasoning after that token. This is perhaps also partially a
side-effect of roundtripping this model's outputs via APIs that have no
provision to include a "content type" that the client can send back in,
so we loose the details of whether the model output a `json` or
`<|constrain|>json` and have to reconstruct as best we can on the
inference side.
We'll do some more testing on our side, but I wanted to go ahead and
open this PR to get some eyes on the issue and any feedback about these
fixes.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
gpt-oss models occasionally emit extra
<|channel|>markers in message headers during deep multi-turn conversations. Instead of erroring, the parser now strips all duplicate channel markers after extracting the first one, using the same boundary logic (stop at whitespace or '<') that is used when extracting the first channel marker.This also marks
parse_header_from_stringaspub(crate)to allow for direct unit testing, and adds tests of the previous and new behavior to make sure that channels are extracted properly.I have personally observed this in multiple real-world scenarios with vLLM and gpt-oss-20b specifically. gpt-oss-120b appears far less prone to this type of failure in following its format. See vllm-project/vllm#32587 and vllm-project/vllm#31677 for vLLM community reports of this as well.
This still takes the first channel marker in the header as before. Any subsequent ones are dropped vs replacing the first we saw. This was preferred to throwing an error because in the vLLM code paths here we're handling the models generated response so anything we can clean up or fix in real-world inference scenarios without ambiguity or loss seem like easy wins to improve the reliability of the models in the wild.