Make single-line quote scanning linear by jskoiz · Pull Request #61 · jskoiz/saneyaml

jskoiz · 2026-06-05T19:37:58Z

Summary

Closes a parse-time denial-of-service: a single line consisting of a long run of quote characters (e.g. " or ' repeated, with no newlines) parsed in O(N^2) time under default options. A 50k-character line of quotes took roughly 50 seconds; 8k took ~0.7s, 16k ~2.7s — i.e. ~4x per doubling.

Root cause

find_mapping_col scans a line once to locate an unquoted mapping colon. For every "/' candidate it called can_start_quoted_context, which re-scanned the line prefix backwards (text[..idx].trim_end() then a reverse char_indices walk) to find the start of the current token. For a line that is one long run of quotes, roughly half the positions triggered that backward scan and each scan was O(idx), giving overall O(N^2). This is the single-line scanning path and is distinct from the multi-line quoted-scalar collector.

Fix

Track the start of the current unquoted token incrementally during the existing forward pass of find_mapping_col, and pass it to can_start_quoted_context so the quoted-context decision is O(1) per quote. The whole line is now scanned once, making single-line parsing linear in line length.

The change is purely a performance rewrite of an internal scan: parse results, error messages, and spans are unchanged for both valid and invalid inputs. The incremental token-start computation was validated to be byte-for-byte equivalent to the previous backward scan across structured cases and millions of randomized fuzz inputs, and the full existing test suite (YAML suite, event/tree parity, property tests, indicator regressions, diagnostics, schema modes, public-API snapshot) passes unchanged.

Results

Double- and single-quote runs now scale ~2x per doubling (linear) instead of ~4x.
The previously ~50s 50k-quote input now parses in well under a millisecond.

Tests

Added two regression tests in tests/dos_hardening.rs:

single_line_quote_run_is_handled_quickly: a 500k-character single-line run of " and of ' must parse-or-reject within a bounded time.
single_line_quote_run_scales_subquadratically: doubling the line length (200k -> 400k) must keep the time ratio well below quadratic (asserts <= 3x), mirroring the existing multi-line sub-quadratic regression test.

Detecting whether a quote opens a quoted scalar previously re-scanned the current line backwards from the candidate position on every quote, so a single line consisting of a long run of quote characters did O(N^2) work during mapping-colon detection. A 50k-character line of quotes took tens of seconds to parse. Track the start of the current unquoted token during the existing forward scan instead, so the quoted-context check is O(1) per quote and a whole line is scanned once. Parse results, errors, and spans are unchanged; only the scanning cost changes. Add regression tests covering a large single-line quote run and the sub-quadratic scaling of parse time as the line grows.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e2e1316ebf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-05T19:44:22Z

+/// precomputed so this check stays O(1) instead of re-scanning the prefix.
+fn can_start_quoted_context(text: &str, idx: usize, token_start: usize) -> bool {


Avoid rescanning tag prefixes for quote candidates

For single-line inputs like >" repeated many times, every " is still outside a quoted context and can_start_quoted_context reaches the prefix.ends_with('>') path, where prefix.rfind("!<") scans the entire prefix again. That keeps mapping-colon detection quadratic for a small variant of the same DoS class this change is meant to close, so the quote-start check is not actually O(1) unless the tag-prefix lookup is also made incremental or otherwise bounded.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed Jun 5, 2026

View reviewed changes

jskoiz merged commit 665a505 into main Jun 5, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make single-line quote scanning linear#61

Make single-line quote scanning linear#61
jskoiz merged 1 commit into
mainfrom
fix/single-line-quote-quadratic-dos

jskoiz commented Jun 5, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		/// precomputed so this check stays O(1) instead of re-scanning the prefix.
		fn can_start_quoted_context(text: &str, idx: usize, token_start: usize) -> bool {

Conversation

jskoiz commented Jun 5, 2026

Summary

Root cause

Fix

Results

Tests

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant