Skip to content

Scan multi-line quoted scalars incrementally#59

Merged
jskoiz merged 3 commits into
mainfrom
fix/quoted-scalar-quadratic-dos
Jun 5, 2026
Merged

Scan multi-line quoted scalars incrementally#59
jskoiz merged 3 commits into
mainfrom
fix/quoted-scalar-quadratic-dos

Conversation

@jskoiz

@jskoiz jskoiz commented Jun 5, 2026

Copy link
Copy Markdown
Owner

Summary

Closes a parse-time O(N²) denial-of-service in the multi-line quoted-scalar collector — the same vulnerability class previously fixed for flow collections, but the quoted-scalar path was missed.

The collector built up a multi-line quoted scalar line by line and, on every appended continuation line, re-scanned the entire accumulated buffer from byte 0 (via quoted_scalar_close_end) to locate the closing quote. That makes close detection Σ O(k) = O(N²) in the number of accumulated bytes, so a long unterminated quoted scalar took quadratic time to parse-or-reject. Before the fix, a multi-line unterminated double-quoted scalar scaled ~4× per input doubling (16k continuation lines ≈ 0.5s, growing toward tens of seconds at larger sizes); single-quoted scalars were identically quadratic.

Fix

A new private incremental scanner (QuotedScalarScan) carries the close-detection state across appended lines — the "-style escape carry-over, the first closing-quote offset (cached once found), and the trailing characteristics (first trailing char and first non-whitespace trailing char) — and inspects only the newly appended bytes instead of re-scanning from byte 0. This mirrors the incremental FlowCollectionState scanner already used for flow collections.

The scanner reproduces the previous quoted_scalar_close_end / quoted_scalar_accepted_end semantics exactly: the close offset is the byte just past the first closing quote, and an accepted close additionally requires the trailing text to be all-whitespace or a whitespace-separated comment. Because the collector evaluates the close state after every appended line, each scanned chunk always ends at the current end of the buffer — matching the original chars.peek() seeing None at the buffer end, so a lone ' at a chunk boundary closes (a '' escape can never straddle the boundary, since the first ' would already have closed the scalar and stopped the collector).

Behavior is byte-for-byte unchanged for all valid and invalid inputs: same parse trees, same error messages, same spans. The now-unused quoted_scalar_accepted_end free function is removed. The public API is unchanged.

Tests

Two regression tests added to tests/dos_hardening.rs, covering both single- and double-quoted scalars with inputs that are fully scanned to end-of-input (an unterminated quoted scalar is not a nested collection, so the nesting-depth limit never short-circuits the scan — the timing is a genuine measurement of close-detection cost):

  • multiline_unterminated_quoted_scalar_scales_subquadratically — doubling the line count (100k → 200k) must scale well below 4× (asserts ≤ 3× with headroom for timer/allocator noise).
  • large_multiline_unterminated_quoted_scalar_finishes_quickly — a >1 MB unterminated quoted scalar must parse-or-reject under 1s.

Post-fix timing is cleanly linear (50k/100k/200k lines ≈ 2.1 / 4.3 / 8.8 ms; a 1.2 MB run ≈ 26 ms). All listed suites pass: dos_hardening, yaml_test_suite, event_parity, tree_parity, parser_properties, parser_indicator_regressions, diagnostics; cargo clippy --all-features is clean; scripts/check-public-api.sh reports no diff.

jskoiz added 3 commits June 5, 2026 09:21
The multi-line quoted-scalar collector re-scanned the entire accumulated
buffer from byte 0 on every appended continuation line to find the closing
quote, making close detection O(N^2) in the input size. A long unterminated
quoted scalar therefore took quadratic time to parse-or-reject: a buffer of
many continuation lines that grew toward tens of seconds for inputs that
should be handled in milliseconds.

Carry the close-detection state (escape carry-over, the first closing-quote
offset, and the trailing characteristics) across appended lines and inspect
only newly appended bytes, mirroring the incremental scanner already used for
flow collections. Parse results, error messages, and spans are unchanged for
all valid and invalid inputs; only the scanning cost drops from quadratic to
linear. Add regression tests covering single- and double-quoted scalars.
# Conflicts:
#	tests/dos_hardening.rs
@jskoiz jskoiz merged commit 73f9479 into main Jun 5, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant