Skip to content

perf: precompute start/message formatting token ids in StreamableParser#142

Open
eddieran wants to merge 1 commit into
openai:mainfrom
eddieran:perf/streamparser-precompute-format-tokens
Open

perf: precompute start/message formatting token ids in StreamableParser#142
eddieran wants to merge 1 commit into
openai:mainfrom
eddieran:perf/streamparser-precompute-format-tokens

Conversation

@eddieran
Copy link
Copy Markdown

@eddieran eddieran commented Jun 5, 2026

While in the ExpectStart and Header states, StreamableParser::process_next calls render_formatting_token — a full encode_with_special_tokens — for <|start|> / <|message|> on every input token, just to compare the incoming token against it:

StreamState::ExpectStart => {
    let start = self.encoding.render_formatting_token(FormattingToken::Start)?; // per-token BPE encode
    ...
}
StreamState::Header { .. } => {
    let msg_tok = self.encoding.render_formatting_token(FormattingToken::Message)?; // per-token BPE encode
    ...
}

These ids are constant for the lifetime of the parser. stop_tokens is already precomputed once in new_with_options and the Content state compares against that cached set; the start/message ids should be handled the same way.

This PR precomputes both ids once in new_with_options, stores them on the parser, and compares against the cached Rank in the two hot states. Pure caching — no behavior change.

Impact

On a long stream that stays in the header state, the per-token re-encode dominates. Benchmarking a 1,000,000-token stream (release build):

time
before ~13.06 s
after ~3.07 ms

Besides the obvious speedup, it bounds CPU on malformed / very long model output that the streaming parser is fed directly.

Testing

cargo test — all 30 tests pass unchanged (including the streamable_parser* tests).


Disclosure: this change was prepared with AI assistance; the diff and benchmark were reviewed and run locally against main.

`StreamableParser::process_next` called `render_formatting_token` — a full
`encode_with_special_tokens` — for `<|start|>` and `<|message|>` on every input
token while in the `ExpectStart` and `Header` states. `stop_tokens` is already
precomputed once in `new_with_options`; the start/message ids are equally
constant, so re-encoding them per token is wasted work.

Precompute both ids once (like `stop_tokens`) and compare the incoming token
against the cached `Rank`. Pure caching; no behavior change.

On a 1,000,000-token stream that stays in the header state, parse time drops
from ~13.06s to ~3.07ms (release build). All 30 tests pass unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant