feat: RFC 8259 validation audit (closes #37)#38
Conversation
Switch the fragment specifier from :path to :ident so the variant name can be used in a qjd_err:: path, and replace the pattern arm with a runtime guard (if e == expected) to avoid the binding-vs-pattern ambiguity. Add macro_rejects_wrong_error_code as a regression canary.
Add five nested mod blocks (structural / whitespace / literals / strings /
numbers) to tests/rfc8259_compliance.rs with 76 tests (73 passing, 3 ignored).
Fix two gaps in eager validation:
- parse_with_options: reject empty / whitespace-only input (RFC 8259 §2 requires
a value; both EAGER and LAZY now return QJD_PARSE_ERROR).
- validate_scalars_in_gaps: track prev/next structural context in check_gap so
that an empty gap after ':' or ',' (when not followed by a value-starter like
'"', '{', '[') is rejected as QJD_PARSE_ERROR. Catches {"a":}, [,], [1,],
and {\"a\":1,} without a full grammar-aware walk.
Three tests are marked #[ignore] with issue #37 references for cases that
require a grammar-aware pass: missing-colon ({\"a\"}), leading-comma-with-value
([,1]), and missing-comma-in-object ({\"a\":1\"b\":2}).
Add JSONTestSuite as a git submodule at tests/vendor/JSONTestSuite and introduce tests/json_test_suite.rs which walks every y_*, n_*, and i_* file: y_ files must parse in both modes, n_ files must fail eager parse, i_ files are logged but not asserted. While running the walker, two real validator gaps were discovered and fixed (both < 20 lines each): - validate_trailing: used the last structural char in the whole buffer as the root-end marker, causing [][], ["a":true]"x" etc. to slip through as if they had no trailing content. Fixed by walking indices to find the first depth-0 container close (or the first root string's close). - validate_string_span: validated UTF-8 and control chars but did not check escape sequences, so \a, \x00, \uZZZZ, dangling \ etc. were accepted. Added a one-pass walker that validates every backslash escape against the RFC 8259 §7 grammar. The three unit tests in decode/string.rs that expected QJD_DECODE_FAILED for bad escapes now expect QJD_INVALID_STRING because validate_string_span (called first by decode_string) catches them before the decode loop does. 13 n_* files remain in KNOWN_N_FAILURES: all require a grammar-aware pass to enforce token-ordering rules (non-string keys, comma-vs-colon placement, missing commas between items). Each entry is annotated with the follow-up reference (issue #37). Walker results: y_* 95/95 pass, n_* 175/188 pass (13 whitelisted), i_* 35 informational verdicts printed.
…alidation-rfc8259
Replace the 3-pass string validator (control-char check + std::str::from_utf8 + byte-by-byte escape grammar walk) with a single-pass state machine, fronted by an ASCII-only SIMD fast path that bulk-skips chunks of pure printable ASCII bytes. The previous implementation walked every interior byte three times, which made eager validation 10-48x slower than the lazy baseline on parse+access benchmarks. The single-pass scalar walker combines all three checks; the fast path adds AVX2 (32B chunks) and NEON (16B chunks) skips for the common case where strings contain no escapes, no UTF-8 multi-bytes, and no control characters. Strict UTF-8 per RFC 3629: rejects overlong encodings (C0/C1, E0 with A0-BF only, F0 with 90-BF only), surrogates (ED A0-BF), and out-of-range leads (F5-FF). Matches std::str::from_utf8 for the corpus the project already covers. Module structure: src/validate/strings/mod.rs dispatcher + tests src/validate/strings/scalar.rs pure-Rust state machine src/validate/strings/avx2.rs x86_64 AVX2 ASCII skip src/validate/strings/neon.rs aarch64 NEON ASCII skip All 8 baseline unit tests are preserved verbatim. 16 new tests cover SIMD chunk-boundary cases (UTF-8 straddling, backslash at boundary, long ASCII runs), truncated \uXXXX, dangling backslash, unknown escape introducers, overlong/surrogate UTF-8, and lone continuation bytes. Bench delta (quickdecode.parse + access 3 fields, median ops/s): 100k: 4,004 -> 61,881 (15.5x) 1m: 392 -> 7,075 (18.0x) github-100k: 1,711 -> 1,897 (1.1x; mostly non-ASCII)
Replace the two-pass heuristic (string-span loop + scalar-gap walker
with `:`/`,` empty-gap detection) with a single grammar-aware state
machine that walks `indices` once.
The machine tracks the expected next-token kind in each container
context via a stack (Top/TopDone, ArrAfter{Open,Value,Comma},
ObjAfter{Open,Key,Colon,Value,Comma}). String tokens and structural
characters are validated against the state; scalar tokens living in
the byte gap before the next structural are dispatched through the
same true/false/null/number precedence the previous `check_gap`
used, so existing tests keep their current error codes.
Closes the 3 ignored cases in tests/rfc8259_compliance::structural
(missing_colon, leading_comma_array_with_value, missing_comma_in_object)
and drops all 13 entries from KNOWN_N_FAILURES in
tests/json_test_suite — every grammar-only n_* case in JSONTestSuite
is now correctly rejected.
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Free Run ID: 📒 Files selected for processing (25)
📝 WalkthroughWalkthroughThis PR implements RFC 8259 strict-mode validation with configurable EAGER/LAZY parsing. It extends the error enum with six new codes, adds post-scan validators for depth/trailing/grammar, introduces SIMD-accelerated string validation, updates the parsing flow to conditionally enforce eager checks, and wires the options through FFI and Lua bindings. Comprehensive compliance tests and JSONTestSuite integration validate conformance. ChangesRFC 8259 Compliance with Configurable Parsing Modes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Note 🎁 Summarized by CodeRabbit FreeYour organization has reached its limit of developer seats under the Pro Plan. For new users, CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please add seats to your subscription by visiting https://app.coderabbit.ai/login.If you believe this is a mistake and have available seats, please assign one to the pull request author through the subscription management page using the link above. Comment |
There was a problem hiding this comment.
Pull request overview
This PR introduces an RFC 8259 validation layer (eager-by-default) on top of the existing structural scanner, adds an options-bearing parse API across Rust/C/Lua, and brings in comprehensive compliance testing (including JSONTestSuite via submodule).
Changes:
- Add
Options { mode, max_depth }plumbing end-to-end (Rust API,qjd_parse_exC ABI, Lua wrapper options) and enforce max depth in both modes. - Implement new post-scan validators for depth, trailing content (eager), grammar-aware structural/value validation (eager), number ABNF, and string-content/UTF-8 validation.
- Add extensive RFC 8259 and JSONTestSuite-driven test coverage; update CI to fetch submodules.
Reviewed changes
Copilot reviewed 24 out of 25 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/rfc8259_compliance.rs | Adds RFC 8259 conformance tests for eager/lazy behavior. |
| tests/lua/options_spec.lua | Adds Lua tests for qd.parse(..., opts) validation and behavior. |
| tests/json_test_suite.rs | Adds JSONTestSuite corpus walker tests (y_/n_/i_ categories). |
| tests/ffi_options_smoke.rs | Adds smoke tests for qjd_parse_ex and options ABI behavior. |
| src/validate/strings/scalar.rs | Implements scalar string-span validator (control/escapes/UTF-8). |
| src/validate/strings/avx2.rs | Adds AVX2 ASCII fast path for string validation. |
| src/validate/strings/neon.rs | Adds NEON ASCII fast path for string validation. |
| src/validate/strings/mod.rs | Adds dispatch to best string validator via OnceCell. |
| src/validate/number.rs | Adds strict RFC 8259 number-format validator. |
| src/validate/mod.rs | Adds depth, trailing-content, and eager grammar/value validators. |
| src/options.rs | Introduces FFI-stable Options struct and constants. |
| src/lib.rs | Exposes doc and adds options / validate modules. |
| src/doc.rs | Adds Document::parse_with_options and eager-vs-lazy validation hooks. |
| src/ffi.rs | Adds qjd_parse_ex, updates qjd_parse delegation, extends strerror table. |
| src/error.rs | Extends error enum and strerror mapping with new validation codes. |
| src/decode/string.rs | Validates string spans at decode/access time (lazy correctness). |
| src/decode/number.rs | Validates number ABNF at decode/access time; adds f64 overflow mapping. |
| lua/quickdecode.lua | Adds Lua-visible parse options + error code table export. |
| include/lua_quick_decode.h | Adds new error codes, options struct, and qjd_parse_ex declaration. |
| README.md | Documents eager/lazy modes, test suites, and C/Lua usage. |
| docs/rfc8259-conformance.md | Adds documentation for implementation-defined JSONTestSuite behaviors. |
| CLAUDE.md | Updates architecture docs for new two-phase validation behavior. |
| .gitmodules | Adds JSONTestSuite as a git submodule under tests/. |
| .github/workflows/ci.yml | Updates CI checkout to fetch submodules recursively. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ### Known gaps | ||
|
|
||
| Three structural-grammar checks are deferred to a follow-up — they require a grammar-aware walk beyond the current heuristic. See `tests/rfc8259_compliance.rs` for the specific `#[ignore]`d cases, and `tests/json_test_suite.rs::KNOWN_N_FAILURES` for the corresponding JSONTestSuite files. |
|
|
||
| | File pattern | Our verdict | Rationale | | ||
| |---|---|---| | ||
| | `i_number_huge_exp` | REJECT (`QJD_NUMBER_OUT_OF_RANGE`) | f64 overflow surfaces at decode. | |
| local max_depth = opts.max_depth or 0 | ||
| if type(max_depth) ~= "number" or max_depth < 0 or max_depth ~= math.floor(max_depth) then | ||
| error("quickdecode.parse: opts.max_depth must be a non-negative integer") | ||
| end | ||
| opts_box[0].mode = lazy and MODE_LAZY or MODE_EAGER | ||
| opts_box[0].max_depth = max_depth | ||
| ptr = C.qjd_parse_ex(json_str, #json_str, opts_box, err_box) |
| pub fn parse_with_options( | ||
| buf: &'a [u8], | ||
| opts: &crate::options::Options, | ||
| ) -> Result<Self, qjd_err> { | ||
| // RFC 8259 §2: "A JSON text is a serialized value." | ||
| // Empty input and whitespace-only input contain no value. | ||
| if buf.iter().all(|&b| matches!(b, b' ' | b'\t' | b'\n' | b'\r')) { | ||
| return Err(qjd_err::QJD_PARSE_ERROR); | ||
| } | ||
|
|
||
| let max_depth = opts.effective_max_depth(); | ||
| let mut indices = Vec::new(); | ||
| crate::scan::scan(buf, &mut indices).map_err(|_| qjd_err::QJD_PARSE_ERROR)?; | ||
| // Sentinel simplifies boundary checks during Phase 2. | ||
| indices.push(u32::MAX); | ||
|
|
||
| crate::validate::validate_depth(buf, &indices, max_depth)?; | ||
|
|
||
| if opts.is_eager() { | ||
| crate::validate::validate_trailing(buf, &indices)?; | ||
| crate::validate::validate_eager_values(buf, &indices)?; | ||
| } |
| qjd_err::QJD_TRAILING_CONTENT => "trailing content after root value", | ||
| qjd_err::QJD_NUMBER_OUT_OF_RANGE => "number out of representable range", | ||
| qjd_err::QJD_INVALID_NUMBER => "invalid number format (RFC 8259)", | ||
| qjd_err::QJD_INVALID_STRING => "invalid string content (unescaped control char)", |
| 10 => b"trailing content after root value\0", | ||
| 11 => b"number out of representable range\0", | ||
| 12 => b"invalid number format (RFC 8259)\0", | ||
| 13 => b"invalid string content (unescaped control char)\0", |
Summary
Implements the full 7-phase RFC 8259 validation audit from #37 as a single PR.
qjd_parse_ex(buf, len, opts*, err*)FFI symbol carrying aqjd_options { mode, max_depth }struct. Oldqjd_parsedelegates with default options. Default behavior is now eager validation — strict-mode parse fails on any RFC 8259 violation, as required by the API-gateway use case. Lazy mode preserves the historical structural-only behavior via{ lazy = true }.QJD_NESTING_TOO_DEEP,QJD_TRAILING_CONTENT,QJD_NUMBER_OUT_OF_RANGE,QJD_INVALID_NUMBER,QJD_INVALID_STRING,QJD_INVALID_UTF8), kept in three-way sync acrosssrc/error.rs,include/lua_quick_decode.h, andlua/quickdecode.lua.src/validate/module — depth (both modes), trailing content (eager), number ABNF, string content (control chars), UTF-8, and escape grammar (all eager). SIMD/scalar scanners and the cross-check proptest are untouched.tests/rfc8259_compliance.rs— 76 cases organized by RFC section, with cross-mode helper macros.tests/json_test_suite.rs+ JSONTestSuite git submodule — walks 318 industry test files (95 y_* accepted, 175/188 n_* rejected with 13 whitelisted, 35 i_* logged).qd.parse(json, { lazy = true, max_depth = N })with type-checking onopts.lazyandopts.max_depth.Lua API change
Known gaps (deferred to follow-up)
Three RFC compliance cases require a grammar-aware structural walk beyond the current heuristic:
{"a"}— missing colon between key and value[,1]— leading comma followed by a value in an array{"a":1"b":2}— missing comma between key-value pairsTracked via `#[ignore = "..."]` in `tests/rfc8259_compliance.rs` and the matching JSONTestSuite cases in `KNOWN_N_FAILURES` (13 files) in `tests/json_test_suite.rs`. See `docs/rfc8259-conformance.md`.
Test plan
Submodule
Adds `tests/vendor/JSONTestSuite` pointing to https://github.com/nst/JSONTestSuite at commit `1ef36fa`. CI runners need `git submodule update --init --recursive` to fetch it.
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation
Tests