fix: Prevent TokenDecoder overflow causing substr exceptions in text summarization by ZbigniewTomanek · Pull Request #9 · Knowledgator/GLiNER.cpp

ZbigniewTomanek · 2025-09-22T10:19:21Z

Fixes rare token decoder overflow issues that cause std::string::substr exceptions during text summarization tasks. The bug manifests when using the gliner-multitask-large-v0.5 model with Q4 quantization on x86 Linux servers with ONNX Runtime 1.20.1, where token offsets can overflow signed 32-bit integers (reaching values around 1.6e9) and cause crashes when passed to std::string::substr.

Changes

Added adjustSpanToTextBounds() function to validate span indices against text bounds before processing
Added safeCopySpanText() function to guard substr calls with proper validation
Updated both SpanDecoder::decode() and TokenDecoder::decode() to use the new safety functions
Invalid spans are now skipped instead of causing application crashes
Maintains backward compatibility with existing functionality

Root Cause

Both GPU and CPU inference can generate token offsets that exceed safe integer bounds during text chunking operations. When these invalid indices (startIdx/endIdx values around 1.6e9) are passed to std::string::substr on text of normal length (~820 characters), it throws basic_string::substr: __pos > this->size() exceptions.

Testing

Resolves crashes in text summarization tasks with affected model configurations
Existing regression tests continue to pass
Invalid spans are gracefully handled without affecting valid results

Impact

Eliminates std::string::substr exceptions in text summarization workflows
No configuration changes required - bounds checking operates transparently

…summarization Both GPU and CPU inference can generate token offsets that overflow signed 32-bit integers, particularly when using gliner-multitask-large-v0.5 with Q4 quantization on x86 Linux servers with ONNX Runtime 1.20.1. This leads to spans with startIdx/endIdx values around 1.6e9 during text summarization tasks. When these invalid indices are passed to std::string::substr, it throws "basic_string::substr: __pos > this->size()" exceptions. Added bounds checking and safe text extraction: - adjustSpanToTextBounds() validates span indices against text bounds - safeCopySpanText() guards substr calls with validation - Invalid spans are now skipped instead of causing crashes - Both SpanDecoder and TokenDecoder use the safety functions Resolves crashes in text summarization tasks and maintains compatibility with existing regression tests.

Ingvarstep · 2025-09-30T09:11:37Z

@ZbigniewTomanek , thanks for the contribution, @oleksandrlukashov , please review it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Prevent TokenDecoder overflow causing substr exceptions in text summarization#9

fix: Prevent TokenDecoder overflow causing substr exceptions in text summarization#9
ZbigniewTomanek wants to merge 1 commit intoKnowledgator:mainfrom
ZbigniewTomanek:main

ZbigniewTomanek commented Sep 22, 2025

Uh oh!

Ingvarstep commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ZbigniewTomanek commented Sep 22, 2025

Changes

Root Cause

Testing

Impact

Uh oh!

Ingvarstep commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants