Make single-line quote scanning linear#61
Conversation
Detecting whether a quote opens a quoted scalar previously re-scanned the current line backwards from the candidate position on every quote, so a single line consisting of a long run of quote characters did O(N^2) work during mapping-colon detection. A 50k-character line of quotes took tens of seconds to parse. Track the start of the current unquoted token during the existing forward scan instead, so the quoted-context check is O(1) per quote and a whole line is scanned once. Parse results, errors, and spans are unchanged; only the scanning cost changes. Add regression tests covering a large single-line quote run and the sub-quadratic scaling of parse time as the line grows.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e2e1316ebf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| /// precomputed so this check stays O(1) instead of re-scanning the prefix. | ||
| fn can_start_quoted_context(text: &str, idx: usize, token_start: usize) -> bool { |
There was a problem hiding this comment.
Avoid rescanning tag prefixes for quote candidates
For single-line inputs like >" repeated many times, every " is still outside a quoted context and can_start_quoted_context reaches the prefix.ends_with('>') path, where prefix.rfind("!<") scans the entire prefix again. That keeps mapping-colon detection quadratic for a small variant of the same DoS class this change is meant to close, so the quote-start check is not actually O(1) unless the tag-prefix lookup is also made incremental or otherwise bounded.
Useful? React with 👍 / 👎.
Summary
Closes a parse-time denial-of-service: a single line consisting of a long run of quote characters (e.g.
"or'repeated, with no newlines) parsed in O(N^2) time under default options. A 50k-character line of quotes took roughly 50 seconds; 8k took ~0.7s, 16k ~2.7s — i.e. ~4x per doubling.Root cause
find_mapping_colscans a line once to locate an unquoted mapping colon. For every"/'candidate it calledcan_start_quoted_context, which re-scanned the line prefix backwards (text[..idx].trim_end()then a reversechar_indiceswalk) to find the start of the current token. For a line that is one long run of quotes, roughly half the positions triggered that backward scan and each scan was O(idx), giving overall O(N^2). This is the single-line scanning path and is distinct from the multi-line quoted-scalar collector.Fix
Track the start of the current unquoted token incrementally during the existing forward pass of
find_mapping_col, and pass it tocan_start_quoted_contextso the quoted-context decision is O(1) per quote. The whole line is now scanned once, making single-line parsing linear in line length.The change is purely a performance rewrite of an internal scan: parse results, error messages, and spans are unchanged for both valid and invalid inputs. The incremental token-start computation was validated to be byte-for-byte equivalent to the previous backward scan across structured cases and millions of randomized fuzz inputs, and the full existing test suite (YAML suite, event/tree parity, property tests, indicator regressions, diagnostics, schema modes, public-API snapshot) passes unchanged.
Results
Tests
Added two regression tests in
tests/dos_hardening.rs:single_line_quote_run_is_handled_quickly: a 500k-character single-line run of"and of'must parse-or-reject within a bounded time.single_line_quote_run_scales_subquadratically: doubling the line length (200k -> 400k) must keep the time ratio well below quadratic (asserts <= 3x), mirroring the existing multi-line sub-quadratic regression test.