fix: Prevent TokenDecoder overflow causing substr exceptions in text summarization#9
Open
ZbigniewTomanek wants to merge 1 commit intoKnowledgator:mainfrom
Open
fix: Prevent TokenDecoder overflow causing substr exceptions in text summarization#9ZbigniewTomanek wants to merge 1 commit intoKnowledgator:mainfrom
ZbigniewTomanek wants to merge 1 commit intoKnowledgator:mainfrom
Conversation
…summarization Both GPU and CPU inference can generate token offsets that overflow signed 32-bit integers, particularly when using gliner-multitask-large-v0.5 with Q4 quantization on x86 Linux servers with ONNX Runtime 1.20.1. This leads to spans with startIdx/endIdx values around 1.6e9 during text summarization tasks. When these invalid indices are passed to std::string::substr, it throws "basic_string::substr: __pos > this->size()" exceptions. Added bounds checking and safe text extraction: - adjustSpanToTextBounds() validates span indices against text bounds - safeCopySpanText() guards substr calls with validation - Invalid spans are now skipped instead of causing crashes - Both SpanDecoder and TokenDecoder use the safety functions Resolves crashes in text summarization tasks and maintains compatibility with existing regression tests.
Contributor
|
@ZbigniewTomanek , thanks for the contribution, @oleksandrlukashov , please review it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes rare token decoder overflow issues that cause
std::string::substrexceptions during text summarization tasks. The bug manifests when using thegliner-multitask-large-v0.5model with Q4 quantization on x86 Linux servers with ONNX Runtime 1.20.1, where token offsets can overflow signed 32-bit integers (reaching values around 1.6e9) and cause crashes when passed tostd::string::substr.Changes
adjustSpanToTextBounds()function to validate span indices against text bounds before processingsafeCopySpanText()function to guardsubstrcalls with proper validationSpanDecoder::decode()andTokenDecoder::decode()to use the new safety functionsRoot Cause
Both GPU and CPU inference can generate token offsets that exceed safe integer bounds during text chunking operations. When these invalid indices (startIdx/endIdx values around 1.6e9) are passed to
std::string::substron text of normal length (~820 characters), it throwsbasic_string::substr: __pos > this->size()exceptions.Testing
Impact
std::string::substrexceptions in text summarization workflows