perf: precompute start/message formatting token ids in StreamableParser by eddieran · Pull Request #142 · openai/harmony

eddieran · 2026-06-05T06:12:53Z

While in the ExpectStart and Header states, StreamableParser::process_next calls render_formatting_token — a full encode_with_special_tokens — for <|start|> / <|message|> on every input token, just to compare the incoming token against it:

StreamState::ExpectStart => {
    let start = self.encoding.render_formatting_token(FormattingToken::Start)?; // per-token BPE encode
    ...
}
StreamState::Header { .. } => {
    let msg_tok = self.encoding.render_formatting_token(FormattingToken::Message)?; // per-token BPE encode
    ...
}

These ids are constant for the lifetime of the parser. stop_tokens is already precomputed once in new_with_options and the Content state compares against that cached set; the start/message ids should be handled the same way.

This PR precomputes both ids once in new_with_options, stores them on the parser, and compares against the cached Rank in the two hot states. Pure caching — no behavior change.

Impact

On a long stream that stays in the header state, the per-token re-encode dominates. Benchmarking a 1,000,000-token stream (release build):

	time
before	~13.06 s
after	~3.07 ms

Besides the obvious speedup, it bounds CPU on malformed / very long model output that the streaming parser is fed directly.

Testing

cargo test — all 30 tests pass unchanged (including the streamable_parser* tests).

Disclosure: this change was prepared with AI assistance; the diff and benchmark were reviewed and run locally against main.

`StreamableParser::process_next` called `render_formatting_token` — a full `encode_with_special_tokens` — for `<|start|>` and `<|message|>` on every input token while in the `ExpectStart` and `Header` states. `stop_tokens` is already precomputed once in `new_with_options`; the start/message ids are equally constant, so re-encoding them per token is wasted work. Precompute both ids once (like `stop_tokens`) and compare the incoming token against the cached `Rank`. Pure caching; no behavior change. On a 1,000,000-token stream that stays in the header state, parse time drops from ~13.06s to ~3.07ms (release build). All 30 tests pass unchanged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: precompute start/message formatting token ids in StreamableParser#142

perf: precompute start/message formatting token ids in StreamableParser#142
eddieran wants to merge 1 commit into
openai:mainfrom
eddieran:perf/streamparser-precompute-format-tokens

eddieran commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eddieran commented Jun 5, 2026

Impact

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant