perf: precompute start/message formatting token ids in StreamableParser#142
Open
eddieran wants to merge 1 commit into
Open
perf: precompute start/message formatting token ids in StreamableParser#142eddieran wants to merge 1 commit into
eddieran wants to merge 1 commit into
Conversation
`StreamableParser::process_next` called `render_formatting_token` — a full `encode_with_special_tokens` — for `<|start|>` and `<|message|>` on every input token while in the `ExpectStart` and `Header` states. `stop_tokens` is already precomputed once in `new_with_options`; the start/message ids are equally constant, so re-encoding them per token is wasted work. Precompute both ids once (like `stop_tokens`) and compare the incoming token against the cached `Rank`. Pure caching; no behavior change. On a 1,000,000-token stream that stays in the header state, parse time drops from ~13.06s to ~3.07ms (release build). All 30 tests pass unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
While in the
ExpectStartandHeaderstates,StreamableParser::process_nextcallsrender_formatting_token— a fullencode_with_special_tokens— for<|start|>/<|message|>on every input token, just to compare the incoming token against it:These ids are constant for the lifetime of the parser.
stop_tokensis already precomputed once innew_with_optionsand theContentstate compares against that cached set; the start/message ids should be handled the same way.This PR precomputes both ids once in
new_with_options, stores them on the parser, and compares against the cachedRankin the two hot states. Pure caching — no behavior change.Impact
On a long stream that stays in the header state, the per-token re-encode dominates. Benchmarking a 1,000,000-token stream (release build):
Besides the obvious speedup, it bounds CPU on malformed / very long model output that the streaming parser is fed directly.
Testing
cargo test— all 30 tests pass unchanged (including thestreamable_parser*tests).Disclosure: this change was prepared with AI assistance; the diff and benchmark were reviewed and run locally against
main.