From f73a968f239ff9d9e3d48660fbc67614a2c3f218 Mon Sep 17 00:00:00 2001 From: Yuansheng Wang Date: Sat, 16 May 2026 19:34:36 +0800 Subject: [PATCH] =?UTF-8?q?docs:=20roadmap=20=E2=80=94=20record=20empirica?= =?UTF-8?q?l=20findings=20from=20PR=20#18=20attempt?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The validate_brackets fusion entry now references the closed PR #18, explains why the prototype showed no measurable improvement on the string-heavy multimodal bench (per-emit buf[pos] lookup cancels the savings from eliminating the second indices pass), and pins the revisit condition to a structurally-dense bench fixture appearing. Keeps the entry actionable: future contributors know the design has been tried, what failed, and what data to gather before retrying. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index b836e85..c02f9cd 100644 --- a/README.md +++ b/README.md @@ -132,7 +132,7 @@ Items intentionally pushed out of the first implementation. Each will be picked - **Adaptive `out.reserve` in scanners** — `out.reserve(buf.len() / 6)` is calibrated for object-heavy JSON. On string-heavy multimodal payloads (one big content array, mostly base64) the actual emit rate is <1 structural per 1 KB, so we over-reserve by 100x+. Mainly a memory hygiene concern (mmap'd pages stay lazily faulted), <5% throughput effect. - **AVX-512 scanner backend** — 64-byte → 128-byte chunks. On the 1 MB string-heavy bench, profile shows scan throughput is L3-bandwidth-bound, so realistic win is ~1.5–1.8×, not a clean 2×; larger wins need fixtures that fit in L1/L2. Needs `avx512bw` + `vpclmulqdq` (Sapphire Rapids, Zen 4+). - **`cargo fmt --check` not enforced** — `make lint` runs clippy only. The codebase uses intentional manual column alignment in struct definitions and compact single-line literals that default rustfmt would reflow. Skip rather than reformat until a project-wide style decision is made. -- **`validate_brackets` fusion in SIMD scanners** — fused into `ScalarScanner` via `scan_and_validate`; AVX2 and NEON scanners still run the two-pass emit + `validate_brackets` design. Folding bracket pairing into the SIMD emit loops would require carrying a depth stack across chunks (the inline `emit_bits` loop currently has no such state). <1% effect on string-heavy workloads; worth revisiting only if profiling on structurally-dense input flags it. +- **`validate_brackets` fusion in SIMD scanners** — fused into `ScalarScanner` via `scan_and_validate`; AVX2 and NEON scanners still run the two-pass emit + `validate_brackets` design. A working implementation was prototyped in [#18](https://github.com/membphis/lua-quick-decode/pull/18) (closed): `emit_bits_validate` carries a depth stack inline and dispatches on `buf[pos]` per emitted bit, eliminating the second pass over `indices`. Measured ±2% (within noise) on the multimodal bench because the per-emit `buf[pos]` lookup adds back roughly what the eliminated pass saved, and the structural-char density is too low for the savings to dominate. Revisit only when a structurally-dense fixture (config / JSONL / object-shape JSON with hundreds of keys per chunk) is added to the bench harness and profiles flag `validate_brackets` as the bottleneck. - **`memchr2` cross-chunk jump for very long string interiors** — the AVX2 in-string fast probe (issue #5) drops per-chunk cost from ~25 to ~10 ops but still pays ALU work for every 64-byte chunk in a string. A `memchr2(b'"', b'\\')` jump can approach memory bandwidth on multi-MB single-string payloads. Deferred until a workload that benefits clearly emerges; needs careful `bs_carry` reasoning across the jump. - **Stateful O(N) iterator FFI** — current `qd.pairs` and the `__newindex` materialization path walk the object cursor from the start on every step,