Skip to content

Make single-line quote scanning linear#61

Merged
jskoiz merged 1 commit into
mainfrom
fix/single-line-quote-quadratic-dos
Jun 5, 2026
Merged

Make single-line quote scanning linear#61
jskoiz merged 1 commit into
mainfrom
fix/single-line-quote-quadratic-dos

Conversation

@jskoiz

@jskoiz jskoiz commented Jun 5, 2026

Copy link
Copy Markdown
Owner

Summary

Closes a parse-time denial-of-service: a single line consisting of a long run of quote characters (e.g. " or ' repeated, with no newlines) parsed in O(N^2) time under default options. A 50k-character line of quotes took roughly 50 seconds; 8k took ~0.7s, 16k ~2.7s — i.e. ~4x per doubling.

Root cause

find_mapping_col scans a line once to locate an unquoted mapping colon. For every "/' candidate it called can_start_quoted_context, which re-scanned the line prefix backwards (text[..idx].trim_end() then a reverse char_indices walk) to find the start of the current token. For a line that is one long run of quotes, roughly half the positions triggered that backward scan and each scan was O(idx), giving overall O(N^2). This is the single-line scanning path and is distinct from the multi-line quoted-scalar collector.

Fix

Track the start of the current unquoted token incrementally during the existing forward pass of find_mapping_col, and pass it to can_start_quoted_context so the quoted-context decision is O(1) per quote. The whole line is now scanned once, making single-line parsing linear in line length.

The change is purely a performance rewrite of an internal scan: parse results, error messages, and spans are unchanged for both valid and invalid inputs. The incremental token-start computation was validated to be byte-for-byte equivalent to the previous backward scan across structured cases and millions of randomized fuzz inputs, and the full existing test suite (YAML suite, event/tree parity, property tests, indicator regressions, diagnostics, schema modes, public-API snapshot) passes unchanged.

Results

  • Double- and single-quote runs now scale ~2x per doubling (linear) instead of ~4x.
  • The previously ~50s 50k-quote input now parses in well under a millisecond.

Tests

Added two regression tests in tests/dos_hardening.rs:

  • single_line_quote_run_is_handled_quickly: a 500k-character single-line run of " and of ' must parse-or-reject within a bounded time.
  • single_line_quote_run_scales_subquadratically: doubling the line length (200k -> 400k) must keep the time ratio well below quadratic (asserts <= 3x), mirroring the existing multi-line sub-quadratic regression test.

Detecting whether a quote opens a quoted scalar previously re-scanned the
current line backwards from the candidate position on every quote, so a
single line consisting of a long run of quote characters did O(N^2) work
during mapping-colon detection. A 50k-character line of quotes took tens
of seconds to parse.

Track the start of the current unquoted token during the existing forward
scan instead, so the quoted-context check is O(1) per quote and a whole
line is scanned once. Parse results, errors, and spans are unchanged; only
the scanning cost changes.

Add regression tests covering a large single-line quote run and the
sub-quadratic scaling of parse time as the line grows.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e2e1316ebf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/parse.rs
Comment on lines +4063 to +4064
/// precomputed so this check stays O(1) instead of re-scanning the prefix.
fn can_start_quoted_context(text: &str, idx: usize, token_start: usize) -> bool {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid rescanning tag prefixes for quote candidates

For single-line inputs like >" repeated many times, every " is still outside a quoted context and can_start_quoted_context reaches the prefix.ends_with('>') path, where prefix.rfind("!<") scans the entire prefix again. That keeps mapping-colon detection quadratic for a small variant of the same DoS class this change is meant to close, so the quote-start check is not actually O(1) unless the tag-prefix lookup is also made incremental or otherwise bounded.

Useful? React with 👍 / 👎.

@jskoiz jskoiz merged commit 665a505 into main Jun 5, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant