Scan multi-line quoted scalars incrementally#59
Merged
Conversation
The multi-line quoted-scalar collector re-scanned the entire accumulated buffer from byte 0 on every appended continuation line to find the closing quote, making close detection O(N^2) in the input size. A long unterminated quoted scalar therefore took quadratic time to parse-or-reject: a buffer of many continuation lines that grew toward tens of seconds for inputs that should be handled in milliseconds. Carry the close-detection state (escape carry-over, the first closing-quote offset, and the trailing characteristics) across appended lines and inspect only newly appended bytes, mirroring the incremental scanner already used for flow collections. Parse results, error messages, and spans are unchanged for all valid and invalid inputs; only the scanning cost drops from quadratic to linear. Add regression tests covering single- and double-quoted scalars.
# Conflicts: # tests/dos_hardening.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes a parse-time O(N²) denial-of-service in the multi-line quoted-scalar collector — the same vulnerability class previously fixed for flow collections, but the quoted-scalar path was missed.
The collector built up a multi-line quoted scalar line by line and, on every appended continuation line, re-scanned the entire accumulated buffer from byte 0 (via
quoted_scalar_close_end) to locate the closing quote. That makes close detection Σ O(k) = O(N²) in the number of accumulated bytes, so a long unterminated quoted scalar took quadratic time to parse-or-reject. Before the fix, a multi-line unterminated double-quoted scalar scaled ~4× per input doubling (16k continuation lines ≈ 0.5s, growing toward tens of seconds at larger sizes); single-quoted scalars were identically quadratic.Fix
A new private incremental scanner (
QuotedScalarScan) carries the close-detection state across appended lines — the"-style escape carry-over, the first closing-quote offset (cached once found), and the trailing characteristics (first trailing char and first non-whitespace trailing char) — and inspects only the newly appended bytes instead of re-scanning from byte 0. This mirrors the incrementalFlowCollectionStatescanner already used for flow collections.The scanner reproduces the previous
quoted_scalar_close_end/quoted_scalar_accepted_endsemantics exactly: the close offset is the byte just past the first closing quote, and an accepted close additionally requires the trailing text to be all-whitespace or a whitespace-separated comment. Because the collector evaluates the close state after every appended line, each scanned chunk always ends at the current end of the buffer — matching the originalchars.peek()seeingNoneat the buffer end, so a lone'at a chunk boundary closes (a''escape can never straddle the boundary, since the first'would already have closed the scalar and stopped the collector).Behavior is byte-for-byte unchanged for all valid and invalid inputs: same parse trees, same error messages, same spans. The now-unused
quoted_scalar_accepted_endfree function is removed. The public API is unchanged.Tests
Two regression tests added to
tests/dos_hardening.rs, covering both single- and double-quoted scalars with inputs that are fully scanned to end-of-input (an unterminated quoted scalar is not a nested collection, so the nesting-depth limit never short-circuits the scan — the timing is a genuine measurement of close-detection cost):multiline_unterminated_quoted_scalar_scales_subquadratically— doubling the line count (100k → 200k) must scale well below 4× (asserts ≤ 3× with headroom for timer/allocator noise).large_multiline_unterminated_quoted_scalar_finishes_quickly— a >1 MB unterminated quoted scalar must parse-or-reject under 1s.Post-fix timing is cleanly linear (50k/100k/200k lines ≈ 2.1 / 4.3 / 8.8 ms; a 1.2 MB run ≈ 26 ms). All listed suites pass:
dos_hardening,yaml_test_suite,event_parity,tree_parity,parser_properties,parser_indicator_regressions,diagnostics;cargo clippy --all-featuresis clean;scripts/check-public-api.shreports no diff.