Skip to content

feat: RFC 8259 validation audit (closes #37)#38

Merged
nic-6443 merged 22 commits into
mainfrom
worktree-audit-json-validation-rfc8259
May 18, 2026
Merged

feat: RFC 8259 validation audit (closes #37)#38
nic-6443 merged 22 commits into
mainfrom
worktree-audit-json-validation-rfc8259

Conversation

@membphis
Copy link
Copy Markdown
Collaborator

@membphis membphis commented May 17, 2026

Summary

Implements the full 7-phase RFC 8259 validation audit from #37 as a single PR.

  • New qjd_parse_ex(buf, len, opts*, err*) FFI symbol carrying a qjd_options { mode, max_depth } struct. Old qjd_parse delegates with default options. Default behavior is now eager validation — strict-mode parse fails on any RFC 8259 violation, as required by the API-gateway use case. Lazy mode preserves the historical structural-only behavior via { lazy = true }.
  • 6 new error codes (QJD_NESTING_TOO_DEEP, QJD_TRAILING_CONTENT, QJD_NUMBER_OUT_OF_RANGE, QJD_INVALID_NUMBER, QJD_INVALID_STRING, QJD_INVALID_UTF8), kept in three-way sync across src/error.rs, include/lua_quick_decode.h, and lua/quickdecode.lua.
  • Value-level validation in a new src/validate/ module — depth (both modes), trailing content (eager), number ABNF, string content (control chars), UTF-8, and escape grammar (all eager). SIMD/scalar scanners and the cross-check proptest are untouched.
  • tests/rfc8259_compliance.rs — 76 cases organized by RFC section, with cross-mode helper macros.
  • tests/json_test_suite.rs + JSONTestSuite git submodule — walks 318 industry test files (95 y_* accepted, 175/188 n_* rejected with 13 whitelisted, 35 i_* logged).
  • Lua wrapper: qd.parse(json, { lazy = true, max_depth = N }) with type-checking on opts.lazy and opts.max_depth.

Lua API change

local doc = qd.parse(json)                            -- eager (default)
local doc = qd.parse(json, { lazy = true })           -- lazy mode
local doc = qd.parse(json, { max_depth = 256 })       -- stricter depth limit
local doc = qd.parse(json, { lazy = true, max_depth = 256 })

Known gaps (deferred to follow-up)

Three RFC compliance cases require a grammar-aware structural walk beyond the current heuristic:

  • {"a"} — missing colon between key and value
  • [,1] — leading comma followed by a value in an array
  • {"a":1"b":2} — missing comma between key-value pairs

Tracked via `#[ignore = "..."]` in `tests/rfc8259_compliance.rs` and the matching JSONTestSuite cases in `KNOWN_N_FAILURES` (13 files) in `tests/json_test_suite.rs`. See `docs/rfc8259-conformance.md`.

Test plan

  • `cargo test --release` (default features, AVX2 enabled when supported)
  • `cargo test --release --no-default-features` (scalar-only gate)
  • `cargo test --features test-panic --release` (FFI panic barrier)
  • `LD_LIBRARY_PATH=target/release busted --lua=$(command -v luajit) tests/lua` (86 Lua tests)
  • `cargo test --release --test rfc8259_compliance` (73 pass / 3 ignored)
  • `cargo test --release --test json_test_suite` (3 walker tests pass; 13 KNOWN_N_FAILURES skipped)
  • `make lint` (clippy `-D warnings` clean)

Submodule

Adds `tests/vendor/JSONTestSuite` pointing to https://github.com/nst/JSONTestSuite at commit `1ef36fa`. CI runners need `git submodule update --init --recursive` to fetch it.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added configurable parsing modes (eager validation vs lazy deferred validation)
    • Added configurable maximum nesting depth limit
    • Expanded error codes for more granular error reporting
  • Bug Fixes

    • Improved validation for JSON numbers, strings, and UTF-8 encoding
    • Added detection and rejection of trailing non-whitespace content
    • More precise error messages for parsing failures
  • Documentation

    • Added RFC 8259 conformance guide with test coverage details
    • Documented strict vs lenient parsing modes
    • Added known validation gaps reference
  • Tests

    • Integrated JSONTestSuite for comprehensive compliance testing
    • Added RFC 8259 compliance test suite

Review Change Stack

membphis added 20 commits May 17, 2026 15:29
Switch the fragment specifier from :path to :ident so the variant name
can be used in a qjd_err:: path, and replace the pattern arm with a
runtime guard (if e == expected) to avoid the binding-vs-pattern
ambiguity. Add macro_rejects_wrong_error_code as a regression canary.
Add five nested mod blocks (structural / whitespace / literals / strings /
numbers) to tests/rfc8259_compliance.rs with 76 tests (73 passing, 3 ignored).

Fix two gaps in eager validation:
- parse_with_options: reject empty / whitespace-only input (RFC 8259 §2 requires
  a value; both EAGER and LAZY now return QJD_PARSE_ERROR).
- validate_scalars_in_gaps: track prev/next structural context in check_gap so
  that an empty gap after ':' or ',' (when not followed by a value-starter like
  '"', '{', '[') is rejected as QJD_PARSE_ERROR. Catches {"a":}, [,], [1,],
  and {\"a\":1,} without a full grammar-aware walk.

Three tests are marked #[ignore] with issue #37 references for cases that
require a grammar-aware pass: missing-colon ({\"a\"}), leading-comma-with-value
([,1]), and missing-comma-in-object ({\"a\":1\"b\":2}).
Add JSONTestSuite as a git submodule at tests/vendor/JSONTestSuite and
introduce tests/json_test_suite.rs which walks every y_*, n_*, and i_*
file: y_ files must parse in both modes, n_ files must fail eager parse,
i_ files are logged but not asserted.

While running the walker, two real validator gaps were discovered and fixed
(both < 20 lines each):

- validate_trailing: used the last structural char in the whole buffer as
  the root-end marker, causing [][], ["a":true]"x" etc. to slip through
  as if they had no trailing content.  Fixed by walking indices to find
  the first depth-0 container close (or the first root string's close).

- validate_string_span: validated UTF-8 and control chars but did not
  check escape sequences, so \a, \x00, \uZZZZ, dangling \ etc. were
  accepted.  Added a one-pass walker that validates every backslash escape
  against the RFC 8259 §7 grammar.

The three unit tests in decode/string.rs that expected QJD_DECODE_FAILED
for bad escapes now expect QJD_INVALID_STRING because validate_string_span
(called first by decode_string) catches them before the decode loop does.

13 n_* files remain in KNOWN_N_FAILURES: all require a grammar-aware pass
to enforce token-ordering rules (non-string keys, comma-vs-colon placement,
missing commas between items).  Each entry is annotated with the follow-up
reference (issue #37).

Walker results: y_* 95/95 pass, n_* 175/188 pass (13 whitelisted), i_* 35
informational verdicts printed.
membphis added 2 commits May 18, 2026 00:43
Replace the 3-pass string validator (control-char check + std::str::from_utf8
+ byte-by-byte escape grammar walk) with a single-pass state machine, fronted
by an ASCII-only SIMD fast path that bulk-skips chunks of pure printable
ASCII bytes.

The previous implementation walked every interior byte three times, which
made eager validation 10-48x slower than the lazy baseline on parse+access
benchmarks. The single-pass scalar walker combines all three checks; the
fast path adds AVX2 (32B chunks) and NEON (16B chunks) skips for the
common case where strings contain no escapes, no UTF-8 multi-bytes, and
no control characters.

Strict UTF-8 per RFC 3629: rejects overlong encodings (C0/C1, E0 with
A0-BF only, F0 with 90-BF only), surrogates (ED A0-BF), and out-of-range
leads (F5-FF). Matches std::str::from_utf8 for the corpus the project
already covers.

Module structure:
  src/validate/strings/mod.rs     dispatcher + tests
  src/validate/strings/scalar.rs  pure-Rust state machine
  src/validate/strings/avx2.rs    x86_64 AVX2 ASCII skip
  src/validate/strings/neon.rs    aarch64 NEON ASCII skip

All 8 baseline unit tests are preserved verbatim. 16 new tests cover SIMD
chunk-boundary cases (UTF-8 straddling, backslash at boundary, long ASCII
runs), truncated \uXXXX, dangling backslash, unknown escape introducers,
overlong/surrogate UTF-8, and lone continuation bytes.

Bench delta (quickdecode.parse + access 3 fields, median ops/s):
  100k:        4,004 ->  61,881  (15.5x)
  1m:            392 ->   7,075  (18.0x)
  github-100k: 1,711 ->   1,897  (1.1x; mostly non-ASCII)
Replace the two-pass heuristic (string-span loop + scalar-gap walker
with `:`/`,` empty-gap detection) with a single grammar-aware state
machine that walks `indices` once.

The machine tracks the expected next-token kind in each container
context via a stack (Top/TopDone, ArrAfter{Open,Value,Comma},
ObjAfter{Open,Key,Colon,Value,Comma}). String tokens and structural
characters are validated against the state; scalar tokens living in
the byte gap before the next structural are dispatched through the
same true/false/null/number precedence the previous `check_gap`
used, so existing tests keep their current error codes.

Closes the 3 ignored cases in tests/rfc8259_compliance::structural
(missing_colon, leading_comma_array_with_value, missing_comma_in_object)
and drops all 13 entries from KNOWN_N_FAILURES in
tests/json_test_suite — every grammar-only n_* case in JSONTestSuite
is now correctly rejected.
Copilot AI review requested due to automatic review settings May 18, 2026 01:03
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Free

Run ID: e69f2b87-0bab-4a53-ae73-2e3cdc0314ab

📥 Commits

Reviewing files that changed from the base of the PR and between 9de1556 and d0999de.

📒 Files selected for processing (25)
  • .github/workflows/ci.yml
  • .gitmodules
  • CLAUDE.md
  • README.md
  • docs/rfc8259-conformance.md
  • include/lua_quick_decode.h
  • lua/quickdecode.lua
  • src/decode/number.rs
  • src/decode/string.rs
  • src/doc.rs
  • src/error.rs
  • src/ffi.rs
  • src/lib.rs
  • src/options.rs
  • src/validate/mod.rs
  • src/validate/number.rs
  • src/validate/strings/avx2.rs
  • src/validate/strings/mod.rs
  • src/validate/strings/neon.rs
  • src/validate/strings/scalar.rs
  • tests/ffi_options_smoke.rs
  • tests/json_test_suite.rs
  • tests/lua/options_spec.lua
  • tests/rfc8259_compliance.rs
  • tests/vendor/JSONTestSuite

📝 Walkthrough

Walkthrough

This PR implements RFC 8259 strict-mode validation with configurable EAGER/LAZY parsing. It extends the error enum with six new codes, adds post-scan validators for depth/trailing/grammar, introduces SIMD-accelerated string validation, updates the parsing flow to conditionally enforce eager checks, and wires the options through FFI and Lua bindings. Comprehensive compliance tests and JSONTestSuite integration validate conformance.

Changes

RFC 8259 Compliance with Configurable Parsing Modes

Layer / File(s) Summary
Error Codes and Options API
include/lua_quick_decode.h, src/error.rs, src/options.rs
Error enum extended with six new codes (NESTING_TOO_DEEP through INVALID_UTF8). Options struct and mode/depth constants added with FFI-stable #[repr(C)] layout and default eager mode with 1024 max depth.
Validation Implementations
src/validate/number.rs, src/validate/strings/scalar.rs, src/validate/strings/avx2.rs, src/validate/strings/neon.rs, src/validate/strings/mod.rs
RFC 8259 number grammar validator. Scalar string validator rejecting control chars, invalid escapes, and UTF-8 violations. AVX2 (x86_64) and NEON (aarch64) SIMD accelerators batch-scan for interesting bytes then delegate to scalar; runtime AVX2 detection with OneCell dispatch. Comprehensive edge-case tests for SIMD boundaries, overlong encodings, surrogates, and truncations.
Post-Scan Validation Orchestration
src/validate/mod.rs
Validates nesting depth using post-scan indices vector with u32::MAX sentinel. Detects trailing non-whitespace content after logical root. Eager-only grammar validator using context stack to enforce structural transitions, object key types, colon/comma placement, string spans, and scalar literals. 126 lines of test coverage.
Decode Integration
src/decode/number.rs, src/decode/string.rs
Number decoders call validate_number upfront; parse_f64 maps non-finite results to NUMBER_OUT_OF_RANGE. String decoder calls validate_string_span before escape handling. Test expectations updated to expect INVALID_NUMBER and INVALID_STRING codes.
Core Parsing Flow
src/lib.rs, src/doc.rs
Document::parse delegates to parse_with_options using default Options. New parse_with_options rejects empty/whitespace-only input, builds indices, validates depth per options, conditionally runs eager validators (trailing, grammar) when not lazy. Library exports options and doc modules, declares validate module.
FFI and Lua API
src/ffi.rs, include/lua_quick_decode.h, lua/quickdecode.lua
New qjd_parse_ex accepts Options pointer, implements panic-to-OOM boundary, validates NULL buffer only when err_out provided. qjd_parse refactored to delegate with defaults. FFI strerror extended for new codes. Lua bindings add ERR table export and parse(json_str, opts) supporting lazy boolean and max_depth integer.
Compliance Testing
tests/rfc8259_compliance.rs, tests/json_test_suite.rs, tests/ffi_options_smoke.rs, tests/lua/options_spec.lua
RFC 8259 corpus with 714 lines of test cases covering depth limits, trailing content diffs between modes, number/string grammar, and error code validation. JSONTestSuite integration with y_/n_/i_ file filtering and skip lists. FFI smoke tests for qjd_parse_ex. Lua options validation tests.
Documentation and CI
.github/workflows/ci.yml, .gitmodules, README.md, CLAUDE.md, docs/rfc8259-conformance.md
CI workflow enables recursive submodule fetch. JSONTestSuite submodule added. README documents strict vs lenient modes with Lua/C examples. CLAUDE.md expanded with phase 1 validation semantics. New conformance doc records i_* implementation-defined verdict table.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes


Note

🎁 Summarized by CodeRabbit Free

Your organization has reached its limit of developer seats under the Pro Plan. For new users, CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please add seats to your subscription by visiting https://app.coderabbit.ai/login.If you believe this is a mistake and have available seats, please assign one to the pull request author through the subscription management page using the link above.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an RFC 8259 validation layer (eager-by-default) on top of the existing structural scanner, adds an options-bearing parse API across Rust/C/Lua, and brings in comprehensive compliance testing (including JSONTestSuite via submodule).

Changes:

  • Add Options { mode, max_depth } plumbing end-to-end (Rust API, qjd_parse_ex C ABI, Lua wrapper options) and enforce max depth in both modes.
  • Implement new post-scan validators for depth, trailing content (eager), grammar-aware structural/value validation (eager), number ABNF, and string-content/UTF-8 validation.
  • Add extensive RFC 8259 and JSONTestSuite-driven test coverage; update CI to fetch submodules.

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/rfc8259_compliance.rs Adds RFC 8259 conformance tests for eager/lazy behavior.
tests/lua/options_spec.lua Adds Lua tests for qd.parse(..., opts) validation and behavior.
tests/json_test_suite.rs Adds JSONTestSuite corpus walker tests (y_/n_/i_ categories).
tests/ffi_options_smoke.rs Adds smoke tests for qjd_parse_ex and options ABI behavior.
src/validate/strings/scalar.rs Implements scalar string-span validator (control/escapes/UTF-8).
src/validate/strings/avx2.rs Adds AVX2 ASCII fast path for string validation.
src/validate/strings/neon.rs Adds NEON ASCII fast path for string validation.
src/validate/strings/mod.rs Adds dispatch to best string validator via OnceCell.
src/validate/number.rs Adds strict RFC 8259 number-format validator.
src/validate/mod.rs Adds depth, trailing-content, and eager grammar/value validators.
src/options.rs Introduces FFI-stable Options struct and constants.
src/lib.rs Exposes doc and adds options / validate modules.
src/doc.rs Adds Document::parse_with_options and eager-vs-lazy validation hooks.
src/ffi.rs Adds qjd_parse_ex, updates qjd_parse delegation, extends strerror table.
src/error.rs Extends error enum and strerror mapping with new validation codes.
src/decode/string.rs Validates string spans at decode/access time (lazy correctness).
src/decode/number.rs Validates number ABNF at decode/access time; adds f64 overflow mapping.
lua/quickdecode.lua Adds Lua-visible parse options + error code table export.
include/lua_quick_decode.h Adds new error codes, options struct, and qjd_parse_ex declaration.
README.md Documents eager/lazy modes, test suites, and C/Lua usage.
docs/rfc8259-conformance.md Adds documentation for implementation-defined JSONTestSuite behaviors.
CLAUDE.md Updates architecture docs for new two-phase validation behavior.
.gitmodules Adds JSONTestSuite as a git submodule under tests/.
.github/workflows/ci.yml Updates CI checkout to fetch submodules recursively.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md
Comment on lines +150 to +152
### Known gaps

Three structural-grammar checks are deferred to a follow-up — they require a grammar-aware walk beyond the current heuristic. See `tests/rfc8259_compliance.rs` for the specific `#[ignore]`d cases, and `tests/json_test_suite.rs::KNOWN_N_FAILURES` for the corresponding JSONTestSuite files.

| File pattern | Our verdict | Rationale |
|---|---|---|
| `i_number_huge_exp` | REJECT (`QJD_NUMBER_OUT_OF_RANGE`) | f64 overflow surfaces at decode. |
Comment thread lua/quickdecode.lua
Comment on lines +110 to +116
local max_depth = opts.max_depth or 0
if type(max_depth) ~= "number" or max_depth < 0 or max_depth ~= math.floor(max_depth) then
error("quickdecode.parse: opts.max_depth must be a non-negative integer")
end
opts_box[0].mode = lazy and MODE_LAZY or MODE_EAGER
opts_box[0].max_depth = max_depth
ptr = C.qjd_parse_ex(json_str, #json_str, opts_box, err_box)
Comment thread src/doc.rs
Comment on lines +18 to +38
pub fn parse_with_options(
buf: &'a [u8],
opts: &crate::options::Options,
) -> Result<Self, qjd_err> {
// RFC 8259 §2: "A JSON text is a serialized value."
// Empty input and whitespace-only input contain no value.
if buf.iter().all(|&b| matches!(b, b' ' | b'\t' | b'\n' | b'\r')) {
return Err(qjd_err::QJD_PARSE_ERROR);
}

let max_depth = opts.effective_max_depth();
let mut indices = Vec::new();
crate::scan::scan(buf, &mut indices).map_err(|_| qjd_err::QJD_PARSE_ERROR)?;
// Sentinel simplifies boundary checks during Phase 2.
indices.push(u32::MAX);

crate::validate::validate_depth(buf, &indices, max_depth)?;

if opts.is_eager() {
crate::validate::validate_trailing(buf, &indices)?;
crate::validate::validate_eager_values(buf, &indices)?;
}
Comment thread src/error.rs
qjd_err::QJD_TRAILING_CONTENT => "trailing content after root value",
qjd_err::QJD_NUMBER_OUT_OF_RANGE => "number out of representable range",
qjd_err::QJD_INVALID_NUMBER => "invalid number format (RFC 8259)",
qjd_err::QJD_INVALID_STRING => "invalid string content (unescaped control char)",
Comment thread src/ffi.rs
10 => b"trailing content after root value\0",
11 => b"number out of representable range\0",
12 => b"invalid number format (RFC 8259)\0",
13 => b"invalid string content (unescaped control char)\0",
@nic-6443 nic-6443 merged commit 062a1cd into main May 18, 2026
8 checks passed
@nic-6443 nic-6443 deleted the worktree-audit-json-validation-rfc8259 branch May 18, 2026 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants