Skip to content

perf: merge UTF-8 + control char + escape scans into single string-content pass #40

@membphis

Description

@membphis

PR #38 introduced eager RFC 8259 validation, which costs ~10–48x slowdown on quickdecode.parse + access 3 fields vs the lazy/main baseline (see PR description for the full bench table). The dominant cost is in string-content validation, which currently makes three independent passes over every string's raw bytes.

Current state

For every string span between two " structurals, the eager pass calls validate_string_span(span), which today runs:

  1. span.iter().any(|&b| b < 0x20) — reject raw control characters
  2. std::str::from_utf8(span) — reject non-UTF-8 byte sequences
  3. A separate byte-by-byte escape-grammar walker — reject \a, \uZZZZ, dangling \, truncated \uXX, etc.

Each pass walks the full string. For payloads where strings dominate (most real-world JSON), this means traversing string content ~3 times per parse.

In real traffic these checks almost always pass (invalid UTF-8 from non-UTF-8-aware upstreams is the most common rejection cause; control chars and bad escapes are rare). So the work is wasted on the happy path.

Proposed optimization

真正"省事"的优化方向是合并 UTF-8 + control char + escape 三个扫描为单次扫描(一遍字节走完三个状态机),且对常见 ASCII-only 字符串走 SIMD 快路径。

Concretely:

  • One byte-level state machine that simultaneously tracks:
    • UTF-8 continuation state (using DFA tables like the Hoehrmann decoder, or simdjson-style validation)
    • Whether the current byte is a control char (< 0x20)
    • Whether the previous byte was \ (so the next byte must be a valid escape introducer; u enters a 4-hex sub-state)
  • An ASCII-only fast path: if a 32/64-byte chunk has no high bits set, no bytes < 0x20, and no \, advance past it in one SIMD compare. The non-ASCII / has-backslash slow path falls back to the state machine.

Acceptance criteria

  • Single-pass string-content validator replacing the current 3-pass validate_string_span.
  • SIMD fast path for ASCII-only chunks (no \, no control, no high bit).
  • Performance: close enough to lazy/main throughput that the eager-by-default mode is acceptable for the API-gateway use case — target within 2x of lazy on real-world payloads (open to revision based on measurements).
  • No regression in correctness: every test in tests/rfc8259_compliance.rs and tests/json_test_suite.rs continues to pass with identical error codes.

Out of scope

  • Optimizing scalar gap dispatch (check_gap / validate_number). That's a secondary cost and a separate concern.
  • Merging the eager pass into the SIMD scanner itself. Doable but requires touching the AVX2/NEON code paths and the crosscheck proptest; tackle only if the in-pass optimization is insufficient.
  • Grammar-aware structural validation (tracked in RFC 8259: grammar-aware structural validation pass (close gaps from #37) #39).

Bench reference

PR #38 description has the before/after table. Key data points (parse + access 3 fields):

Scenario main (ops/s) PR (ops/s) Slowdown
1m 18,657 392 48x
100k 116,009 4,004 29x
medium 60K 137,061 8,795 16x
github-100k 7,035 1,711 4x

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions