diff --git a/README.md b/README.md index 3e785e7..b2fdd3d 100644 --- a/README.md +++ b/README.md @@ -75,3 +75,4 @@ Items intentionally pushed out of the first implementation. Each will be picked - **AVX-512 scanner backend** — 64-byte → 128-byte chunks. On the 1 MB string-heavy bench, profile shows scan throughput is L3-bandwidth-bound, so realistic win is ~1.5–1.8×, not a clean 2×; larger wins need fixtures that fit in L1/L2. Needs `avx512bw` + `vpclmulqdq` (Sapphire Rapids, Zen 4+). - **`cargo fmt --check` not enforced** — `make lint` runs clippy only. The codebase uses intentional manual column alignment in struct definitions and compact single-line literals that default rustfmt would reflow. Skip rather than reformat until a project-wide style decision is made. - **`validate_brackets` fusion into scan emit loop** — surfaced by profiling: on structurally-dense workloads `validate_brackets` is 65% of parse time (second linear pass over emitted indices). Folding bracket pairing into the scan emit loop via an inline depth stack eliminates that pass. No effect on the current string-heavy bench (0.3% there); a win for config / JSONL / table-shape JSON. +- **`memchr2` cross-chunk jump for very long string interiors** — the AVX2 in-string fast probe (issue #5) drops per-chunk cost from ~25 to ~10 ops but still pays ALU work for every 64-byte chunk in a string. A `memchr2(b'"', b'\\')` jump can approach memory bandwidth on multi-MB single-string payloads. Deferred until a workload that benefits clearly emerges; needs careful `bs_carry` reasoning across the jump. diff --git a/docs/superpowers/specs/2026-05-15-avx2-memchr-string-skip-design.md b/docs/superpowers/specs/2026-05-15-avx2-memchr-string-skip-design.md new file mode 100644 index 0000000..d125cc9 --- /dev/null +++ b/docs/superpowers/specs/2026-05-15-avx2-memchr-string-skip-design.md @@ -0,0 +1,195 @@ +# AVX2 scanner: cheaper in-string fast path + +**Status**: design approved, ready for implementation plan +**Issue**: [#5 perf(scan): memchr-based fast path for in-string content](https://github.com/membphis/lua-quick-decode/issues/5) +**Touches**: `src/scan/avx2.rs`, `benches/lua_bench.lua`, `README.md` (Roadmap / Deferred) + +## Problem + +The AVX2 scanner's current in-string fast path (`src/scan/avx2.rs:34-43`, added in PR #3) detects when a 64-byte chunk lies fully inside a string and skips the structural-mask + PCLMUL prefix-XOR work. The condition is `in_string != 0 && real_quote == 0`, which still requires computing both the backslash mask and the escape mask before it can fire. + +Per-chunk cost when the current fast path *fires*: + +- 2 × `loadu` (free, needed for any path) +- `backslash` byte mask: ~6 ops +- `quote` byte mask: ~6 ops +- `find_escape_mask_with_carry`: ~10 scalar ALU ops + several branches +- final `real_quote == 0` test + +≈ 25 ops per "skip" chunk. On string-heavy payloads — e.g. a multimodal-shaped JSON whose `data` field is ~10 MB of base64 — ~95% of chunks hit this path, making it the dominant scanner cost. + +## Goal + +Lower per-chunk cost on string-interior chunks from ~25 ops to ~10 ops, by replacing the current fast-path *condition* with a cheaper probe that detects "chunk has no `"` and no `\`" directly, before computing the escape mask. + +Estimated speedup on a 10 MB string-heavy payload: ~3× scan-phase throughput (op-count analysis; the implementation will validate via `make bench` against a synthetic fixture). + +This proposal is the chunk-granularity step (Option 1 in brainstorming). Cross-chunk `memchr2` jumps for very long string interiors are deferred (see Roadmap / Deferred). + +## Non-goals + +- Touching the scalar scanner (`src/scan/scalar.rs`). The hot path for the targeted workloads is the AVX2 backend. +- Changing validation semantics. Every byte still gets scanned for well-formedness; bracket balance still validated at end. +- Adding a new cargo feature. The change rides on the existing `avx2` feature. +- Cross-chunk jumps (`memchr2` jump path). Deferred — see Roadmap / Deferred. + +## Design + +### Code change + +Single file: `src/scan/avx2.rs::scan_avx2_impl`. The chunk loop body becomes: + +```rust +while i + 64 <= buf.len() { + let chunk_lo = _mm256_loadu_si256(buf.as_ptr().add(i) as *const __m256i); + let chunk_hi = _mm256_loadu_si256(buf.as_ptr().add(i + 32) as *const __m256i); + + // in_string fast-probe: only enter when previous chunk left us inside + // a string. Cheap quote-or-backslash mask; if zero, the chunk is pure + // string interior and we can skip ALL mask computation including the + // escape-run scan. + if in_string != 0 { + let interesting = quote_or_backslash_mask(chunk_lo, chunk_hi); + if interesting == 0 { + // No `"` or `\` in chunk → no escapes can originate here, so + // bs_carry must be 0 leaving this chunk. in_string stays 1. + bs_carry = 0; + i += 64; + continue; + } + } + + // Slow path unchanged below. + let backslash = byte_mask(chunk_lo, chunk_hi, b'\\'); + let quote = byte_mask(chunk_lo, chunk_hi, b'"'); + let escaped = find_escape_mask_with_carry(backslash, &mut bs_carry); + let real_quote = quote & !escaped; + + let (inside, new_in_string) = inside_string_mask(real_quote, in_string); + in_string = new_in_string; + + let struct_mask = structural_mask_chunk(chunk_lo, chunk_hi); + let final_mask = (struct_mask & !inside) | real_quote; + + emit_bits(final_mask, i as u32, out); + + i += 64; +} +``` + +The current fast-path branch (`if in_string != 0 && real_quote == 0 { i += 64; continue; }`) is **removed** — the new probe is a true subset of its trigger condition (proof in §"Correctness"), so removing the late fast path costs nothing and the code reads more linearly. + +### New helper + +```rust +#[inline(always)] +unsafe fn quote_or_backslash_mask(lo: __m256i, hi: __m256i) -> u64 { + let vq = _mm256_set1_epi8(b'"' as i8); + let vb = _mm256_set1_epi8(b'\\' as i8); + let lo_or = _mm256_or_si256(_mm256_cmpeq_epi8(lo, vq), _mm256_cmpeq_epi8(lo, vb)); + let hi_or = _mm256_or_si256(_mm256_cmpeq_epi8(hi, vq), _mm256_cmpeq_epi8(hi, vb)); + let mlo = _mm256_movemask_epi8(lo_or) as u32 as u64; + let mhi = _mm256_movemask_epi8(hi_or) as u32 as u64; + mlo | (mhi << 32) +} +``` + +Matches the style of existing helpers (`byte_mask`, `structural_mask_chunk`): `#[inline(always)] unsafe fn` with no explicit `#[target_feature]` annotation — the caller `scan_avx2_impl` carries `#[target_feature(enable = "avx2,pclmulqdq")]` and inlining propagates the feature set. + +Op count: 4 `cmpeq` + 2 `or` + 2 `movemask` + 1 shift + 1 or = ~10 vector ops, no scalar ALU, no branches. + +### Op-count comparison + +| chunk shape | current path | new path | delta | +|---|---|---|---| +| not in_string | full mask path (~25 ops, no fast path) | unchanged | 0 | +| in_string, chunk pure string interior | ~25 ops (current fast path) | ~10 ops (new probe) | **−60%** | +| in_string, chunk has `\` or `"` | ~25 ops slow path | ~10 ops probe + ~25 slow = ~35 | +40% | + +Net effect on a 10 MB base64-style payload (~95% pure-interior chunks): probe-hit case dominates; expected ~3× scan throughput. Mixed payloads with frequent escapes inside strings see a smaller win or slight regression on the in-string-with-escapes chunks; bench will measure the crossover. + +## Correctness + +The new fast path fires when `in_string == 1 ∧ chunk contains no '"' and no '\'`. We must prove that taking the branch (skip 64 bytes, set `bs_carry = 0`, keep `in_string = 1`) produces output identical to letting the slow path run. + +### (a) `bs_carry` leaves the chunk as 0 + +`bs_carry` represents whether the trailing backslash run of the current chunk has odd parity (and thus escapes byte 0 of the next chunk). With `backslash == 0`: + +- `trailing_bs = 0` in `find_escape_mask_with_carry` +- Falls into the `else` branch: `new_carry = 0 & 1 = 0` + +So slow-path `bs_carry` after this chunk is 0, regardless of incoming `bs_carry`. Setting it to 0 explicitly is equivalent. + +### (b) `in_string` stays 1 + +With `real_quote == 0` (which follows from `quote == 0`), `inside_string_mask` computes: + +- `q = 0`, prefix-XOR via `_mm_clmulepi64_si128` = 0 +- If `prev_in_string != 0`, `mask = !0 = u64::MAX` +- `new_state = (u64::MAX >> 63) & 1 = 1` + +Slow path leaves `in_string = 1`. Explicit retention is equivalent. + +### (c) No structural offsets are emitted for this chunk + +Slow path: `final_mask = (struct_mask & !inside) | real_quote`. With the whole chunk inside the string (`inside = u64::MAX`) and `real_quote = 0`, `final_mask = 0`. Zero offsets emitted. Skipping the chunk emits nothing. Equivalent. + +### (d) New condition is strictly narrower than current fast path + +Current condition `in_string != 0 ∧ real_quote == 0` fires when `quote & !escaped == 0`. New condition fires when `quote == 0 ∧ backslash == 0`. The new condition implies `quote == 0 ⇒ real_quote == 0`, so any chunk hit by the new path was also hit by the current fast path. The reverse is not true: a chunk with `quote != 0` where every quote bit is escaped (preceded by an odd backslash run) hits the current fast path but not the new one. Those chunks now go through the slow path — correctness unchanged, performance unchanged (slow path is the same code). + +### Edge cases + +| scenario | behavior | +|---|---| +| Entering chunk with `bs_carry == 1`, chunk byte 0 is `\` | `backslash != 0` → probe miss → slow path → `pc=1` handled by `find_escape_mask_with_carry` as before | +| Entering chunk with `bs_carry == 1`, chunk has no `"` or `\` | Probe hit → `bs_carry := 0`, equivalent to slow path's `else` branch returning `new_carry = 0` | +| 64-aligned input ending mid-string | Unchanged — main loop exits with `i == buf.len()`, existing post-loop `if i < buf.len() ... else if in_string != 0 { return Err(buf.len()) }` still flags unterminated | +| Non-aligned tail with `bs_carry=1` from probe-hit chunk | `bs_carry = 0` after probe hit, so `scalar_start = i` (existing logic), correct | + +## Bench fixture + +`benches/lua_bench.lua` gains a synthetic "string-heavy" scenario. **Fixture is generated at run time, not committed.** + +- Top-level shape: `{"id": "...", "ts": , "data": ""}` +- `data` value: `QJD_BENCH_BIG_MB` MB (default 10) of characters drawn from `A-Za-z0-9+/`. Guaranteed no `"` or `\` in the payload. Deterministic seed for reproducibility. +- Bench reports fixture size + three-run median for: + - `lua-cjson` full parse + - `quickdecode` parse + single-field extract on `data` + +Bench is a manual `make bench` target. **Not a CI gate.** Its output goes into the PR description and a Performance section update in `README.md`. + +## Tests + +Rust unit tests in `src/scan/avx2.rs::tests`. The host-AVX2 guard pattern (`if !host_supports_avx2() { return; }`) is preserved. + +| test | new / modified | purpose | +|---|---|---| +| `long_string_engages_skip_fastpath` | modified | bump from ~10 KB to ≥1 MB string interior — multiple probe-hit chunks in a row | +| `long_string_with_periodic_backslash` | **new** | every ~5 chunks inject `\\n` / `\\\"` escape sequences; alternates probe-hit and slow path, asserts parity with scalar | +| `bs_carry_one_at_pure_string_chunk_boundary` | **new** | construct prior chunk ending in odd-length backslash run (`bs_carry=1`), next chunk fully pure string interior with no `"`/`\`; assert parity (verifies §(a)) | +| `escaped_quotes_remain_correct_with_fastpath` | unchanged | existing test, still passes | +| `scanner_crosscheck` (proptest, `tests/scanner_crosscheck.rs`) | unchanged | 2000-case property test; if shrinking finds a regression case, `.proptest-regressions` gets committed | + +## CI matrix + +Unchanged. No new cargo features, no new test binaries. + +1. `cargo test --release` — exercises new path (host AVX2 required) +2. `cargo test --release --no-default-features` — scalar-only, new code excluded by `#![cfg(target_arch = "x86_64")]` + feature gate +3. `cargo test --features test-panic --release` — FFI panic barrier unchanged +4. Lua busted suite under LuaJIT — unchanged + +## Roadmap / Deferred + +After landing, add to `README.md` under Roadmap / Deferred: + +> - **memchr2 jump for ≥N consecutive in-string chunks** — current chunk-per-chunk probe leaves ~10 vector ops/chunk on the table for very large string-interior runs (≥1 MB single string). A `memchr2(b'"', b'\\')` jump path can approach memory bandwidth; deferred until a workload that benefits clearly emerges. + +## Out of scope + +- Scalar scanner changes. +- Auto-tuning the probe threshold or making the probe optional. +- Reworking `find_escape_mask_with_carry` (its cost is paid only on slow-path chunks now). +- Cross-chunk `memchr2` jumps (Option 2 from brainstorming; tracked in Roadmap). diff --git a/src/scan/avx2.rs b/src/scan/avx2.rs index 37fb455..48fc3e0 100644 --- a/src/scan/avx2.rs +++ b/src/scan/avx2.rs @@ -26,22 +26,27 @@ unsafe fn scan_avx2_impl(buf: &[u8], out: &mut Vec) -> Result<(), usize> { let chunk_lo = _mm256_loadu_si256(buf.as_ptr().add(i) as *const __m256i); let chunk_hi = _mm256_loadu_si256(buf.as_ptr().add(i + 32) as *const __m256i); + // In-string fast-probe: when the previous chunk left us inside a + // string, check for `"` or `\` BEFORE computing the backslash / + // escape masks. If neither byte appears in the chunk, the whole + // chunk is pure string interior — skip without computing the + // ~10-op scalar `find_escape_mask_with_carry`. bs_carry must be + // 0 leaving this chunk (no backslashes in chunk → no trailing + // run); in_string stays 1 (no real quote → no polarity flip). + if in_string != 0 { + let interesting = quote_or_backslash_mask(chunk_lo, chunk_hi); + if interesting == 0 { + bs_carry = 0; + i += 64; + continue; + } + } + let backslash = byte_mask(chunk_lo, chunk_hi, b'\\'); let quote = byte_mask(chunk_lo, chunk_hi, b'"'); let escaped = find_escape_mask_with_carry(backslash, &mut bs_carry); let real_quote = quote & !escaped; - // String-skip fast path: when the previous chunk left us inside a - // string and this chunk contains no unescaped quote, the entire - // chunk is string interior. No structural chars to emit and - // in_string stays 1; bs_carry was already updated above. Skip the - // 14 cmpeq / movemask ops in structural_mask_chunk plus the PCLMUL - // prefix-XOR — the dominant cost on string-heavy payloads. - if in_string != 0 && real_quote == 0 { - i += 64; - continue; - } - let (inside, new_in_string) = inside_string_mask(real_quote, in_string); in_string = new_in_string; @@ -110,6 +115,21 @@ fn emit_bits(mut mask: u64, base: u32, out: &mut Vec) { } } +/// Build a u64 mask where bit i is 1 if byte i in (lo|hi) equals `"` OR `\`. +/// Used by the in-string fast-probe to detect pure string-interior chunks +/// in ~10 vector ops (4 cmpeq + 2 or + 2 movemask + shift/or), avoiding +/// the ~25-op slow path including find_escape_mask_with_carry. +#[inline(always)] +unsafe fn quote_or_backslash_mask(lo: __m256i, hi: __m256i) -> u64 { + let vq = _mm256_set1_epi8(b'"' as i8); + let vb = _mm256_set1_epi8(b'\\' as i8); + let lo_or = _mm256_or_si256(_mm256_cmpeq_epi8(lo, vq), _mm256_cmpeq_epi8(lo, vb)); + let hi_or = _mm256_or_si256(_mm256_cmpeq_epi8(hi, vq), _mm256_cmpeq_epi8(hi, vb)); + let mlo = _mm256_movemask_epi8(lo_or) as u32 as u64; + let mhi = _mm256_movemask_epi8(hi_or) as u32 as u64; + mlo | (mhi << 32) +} + /// Build a u64 mask where bit i is 1 if byte i in (lo|hi) equals `c`. #[inline(always)] unsafe fn byte_mask(lo: __m256i, hi: __m256i, c: u8) -> u64 { @@ -270,19 +290,89 @@ mod tests { /// chunks with no internal quotes. The fast-path branch must produce /// the same emitted offsets as the slow path (which the parity check /// against scalar implicitly verifies). + /// + /// Sized at ≥1 MB so thousands of consecutive probe-hit chunks exercise + /// the new in-string fast-probe path; smaller inputs would only hit a + /// few hundred chunks and miss patterns that need a long pure-interior + /// run to surface. #[test] fn long_string_engages_skip_fastpath() { if !host_supports_avx2() { return; } let mut buf = Vec::new(); buf.extend_from_slice(b"{\"k\":\""); - // ~10 KB of string interior — many chunks fully inside the string. - buf.resize(buf.len() + 10_000, b'a'); + // ≥1 MB of string interior — thousands of chunks fully inside the + // string, all hitting the in_string probe path. + buf.resize(buf.len() + 1_048_576, b'a'); buf.extend_from_slice(b"\"}"); // Pad to 64-aligned to also exercise the no-tail branch. while buf.len() % 64 != 0 { buf.push(b' '); } parity(&buf); } + /// Long string with periodic backslash-escape sequences. Alternates + /// probe-hit chunks (pure interior) and probe-miss chunks (containing + /// `\` or escaped `"`), so the slow path engages every few chunks + /// while the fast probe handles the rest. Parity guarantees the two + /// paths agree under the new condition. + #[test] + fn long_string_with_periodic_backslash() { + if !host_supports_avx2() { return; } + let mut buf = Vec::new(); + buf.extend_from_slice(b"{\"k\":\""); + // ~5 chunks (320 bytes) of pure interior, then an escape sequence, + // repeated. Mix `\\n` (escaped newline letter) and `\\\"` (escaped + // quote) so both backslash-only and quote-after-backslash chunks + // appear. + for i in 0..200 { + buf.resize(buf.len() + 320, b'a'); + if i % 2 == 0 { + buf.extend_from_slice(b"\\n"); + } else { + buf.extend_from_slice(b"\\\""); + } + } + buf.push(b'"'); + buf.push(b'}'); + while buf.len() % 64 != 0 { buf.push(b' '); } + parity(&buf); + } + + /// bs_carry = 1 leaving a chunk that ends in an odd-length backslash + /// run, then the next chunk is pure string interior (no `"`, no `\`). + /// Verifies that the in-string fast probe correctly resets bs_carry + /// to 0 (matching the slow path's `find_escape_mask_with_carry` else + /// branch). If the probe forgot to clear bs_carry, the third chunk's + /// byte 0 would be wrongly treated as escaped. + #[test] + fn bs_carry_one_at_pure_string_chunk_boundary() { + if !host_supports_avx2() { return; } + let mut buf = Vec::new(); + // Chunk 0 (bytes 0..64): open object, open string, then padding + // ending with exactly one trailing backslash at byte 63. The + // backslash is preceded by even bytes of non-backslash, so the + // trailing run has length 1 (odd) → bs_carry=1 leaving chunk 0. + buf.extend_from_slice(b"{\"k\":\""); // 6 bytes + buf.resize(63, b'a'); // pad to byte 63 + buf.push(b'\\'); // byte 63: single backslash + assert_eq!(buf.len(), 64); + // Chunk 1 (bytes 64..128): byte 64 is the escape TARGET (any + // non-special byte). Then pure interior — no `"`, no `\` — for + // the rest of the chunk. This is the chunk the probe must handle + // correctly. With incoming bs_carry=1, slow path would set + // escaped[0]=1; new fast probe just clears bs_carry to 0. Both + // produce zero emitted offsets in this chunk. + buf.push(b'n'); // byte 64: escape target + buf.resize(128, b'a'); // bytes 65..128: pure interior + // Chunk 2 (bytes 128..192): another pure-interior chunk to + // confirm bs_carry stays clean across multiple probe hits. + buf.resize(192, b'a'); + // Close the string and object in a third chunk. + buf.push(b'"'); + buf.push(b'}'); + while buf.len() % 64 != 0 { buf.push(b' '); } + parity(&buf); + } + /// String contains escaped quotes — the parity output must still /// match scalar. (We cannot directly observe whether the fast path /// took the branch; parity asserts equivalence either way.)