perf: Lua encode + cache + scanner + validation stack optimizations#54
Conversation
- Replace per-byte encode_string with bulk-copy segments and fast escape-free path (zero table allocations for clean strings) - Replace recursive is_dirty with O(1) _dirty flag propagated via parent chain on __newindex - Eliminate is_array pre-scan by tagging materialized tables with __qjson_type marker during materialization - Replace Vec<u32>::clone in SkipCache with Rc<[u32]> for O(1) cache-hit access instead of O(n) vector clone - Replace 7-cmpeq loop in AVX2 structural_mask_chunk with PSHUFB-based nibble LUT classification - Replace heap-allocated validation state stack with fixed-size [CtxKind; 64] array for typical JSON depths - Replace byte-by-byte float-detection in parse_i64 with memchr::memchr3 - Add modified-encode benchmark scenarios (modify top / add field / modify nested) to bench harness - Add correctness tests for modified-encode round-trip
|
Warning Review limit reached
Your plan currently allows 1 review/hour. Refill in 50 minutes and 3 seconds. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more review capacity refills, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughRefactors lazy JSON mutation tracking to explicit ChangesIntegrated Performance Optimization Suite
Sequence Diagram(s)No sequence diagram generated for this PR. Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly Related PRs
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Performance Comparison (aarch64, Apple M-series, NEON)modify top + encode
add field + encode
modify nested + encode
unmodified encode (memcpy fast path)
decode + access content
|
There was a problem hiding this comment.
Pull request overview
Performance-focused PR optimizing qjson’s full pipeline (decode → access/modify → encode), with targeted improvements in both the Lua proxy layer and Rust core (scanner/cache/validation/number parsing).
Changes:
- Lua: faster string encoding, O(1) mutation tracking via
_dirtypropagation, and table type tagging to avoid repeated array/object scans during encode. - Rust:
SkipCacheavoids per-hitVeccloning viaRc<[u32]>, AVX2 structural classification uses PSHUFB LUTs, eager validation uses a fixed-size stack fast path, andparse_i64usesmemchr3.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/lua/lazy_table_spec.lua | Adds encode/mutation test coverage for new fast paths and dirty propagation. |
| lua/qjson/table.lua | Implements Lua-layer encode + mutation tracking optimizations (string escaper, _dirty, type tagging). |
| src/skip_cache.rs | Switches skip-cache child index storage to Rc<[u32]> for O(1) cache-hit cloning. |
| src/cursor.rs | Uses Rc::clone on cache hits and stores populated vectors into Rc<[u32]>. |
| src/scan/avx2.rs | Replaces repeated cmpeq loop with nibble-based PSHUFB LUT classification for structurals. |
| src/validate/mod.rs | Reworks eager validation stack to avoid heap allocation for typical depths. |
| src/decode/number.rs | Uses memchr::memchr3 for faster integer/float discrimination. |
| benches/lua_bench.lua | Adds modified-encode benchmark scenarios (modify top/add field/modify nested). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Skip __qjson_type internal marker in encode_object() pairs iteration - Guard dirty propagation to stop at non-lazy ancestors, preventing _dirty rawset on already-materialized tables
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/skip_cache.rs (1)
25-27: 💤 Low valueConsider a single shared empty slice constant.
Both
new()andget_or_insert()create new emptyRc<[u32]>instances. While empty slices are small, a single shared constant would eliminate these allocations entirely.♻️ Optional refactor using lazy_static or thread_local
+use std::cell::RefCell; + +thread_local! { + static EMPTY_SLICE: Rc<[u32]> = Rc::from([]); +} + impl SkipCache { pub(crate) fn new() -> Self { - let empty: Rc<[u32]> = Rc::from([]); Self { - slots: vec![SkipSlot { child_starts: Rc::clone(&empty), child_ends: empty }], + slots: vec![SkipSlot { + child_starts: EMPTY_SLICE.with(Rc::clone), + child_ends: EMPTY_SLICE.with(Rc::clone) + }], by_opener: FxHashMap::default(), } } pub(crate) fn get_or_insert(&mut self, opener_idx: u32) -> (u32, bool) { if let Some(&slot) = self.by_opener.get(&opener_idx) { return (slot, true); } let new = self.slots.len() as u32; - let empty: Rc<[u32]> = Rc::from([]); - self.slots.push(SkipSlot { child_starts: Rc::clone(&empty), child_ends: empty }); + self.slots.push(SkipSlot { + child_starts: EMPTY_SLICE.with(Rc::clone), + child_ends: EMPTY_SLICE.with(Rc::clone) + }); self.by_opener.insert(opener_idx, new); (new, false) }Also applies to: 39-40
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/skip_cache.rs` around lines 25 - 27, Create a single shared empty Rc<[u32]> and use it from both new() and get_or_insert() instead of calling Rc::from([]) twice; define a static (e.g. using once_cell::sync::Lazy or lazy_static) EMPTY_EMPTY_U32_RC and replace the Rc::from([]) usages in SkipSlot construction and any place in get_or_insert() with Rc::clone(&EMPTY_EMPTY_U32_RC) so SkipSlot { child_starts: Rc::clone(&EMPTY_EMPTY_U32_RC), child_ends: Rc::clone(&EMPTY_EMPTY_U32_RC) } (or equivalent) to eliminate duplicate empty allocations while keeping types Rc<[u32]> consistent.tests/lua/lazy_table_spec.lua (1)
394-442: ⚡ Quick winAdd a regression test for
__qjson_typepayload-key collision.Please add a case where decoded data includes a user key named
__qjson_typeand verify encode still preserves object semantics after mutation.Example test to add
+ it("does not let user __qjson_type key alter table dispatch", function() + local cjson = require("cjson") + local t = qjson.decode('{"__qjson_type":"array","a":1}') + t.a = 2 + local out = qjson.encode(t) + local parsed = cjson.decode(out) + assert.are.equal("array", parsed.__qjson_type) + assert.are.equal(2, parsed.a) + end)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/lua/lazy_table_spec.lua` around lines 394 - 442, Add a regression test in tests/lua/lazy_table_spec.lua that decodes JSON containing a user-supplied "__qjson_type" key via qjson.decode, mutates that field (and optionally other fields or nested entries), re-encodes with qjson.encode, then cjson.decode the output and assert the resulting table preserves object semantics and the mutated "__qjson_type" value; locate the test near the existing it("...") cases and follow their pattern using qjson.decode, qjson.encode, and cjson.decode to validate both the "__qjson_type" field and other fields remain correct after encode/decode.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benches/lua_bench.lua`:
- Around line 261-263: The benchmark helper github_table_modify_add currently
does t.extra_field = true which has no effect when the encoded JSON root is an
array; change github_table_modify_add to detect array-root tables (e.g., check
if `#t` > 0 or use ipairs) and in that case iterate over elements and set
element.extra_field = true for each item, otherwise set t.extra_field = true on
the object root so the added field appears in encoded output for both object and
array roots.
In `@lua/qjson/table.lua`:
- Around line 286-287: The code currently embeds a user-visible string key
"__qjson_type" via rawset(t, "__qjson_type", "object"), which lets user payloads
collide with encoder dispatch; replace this by storing the marker in an internal
weak side-table (e.g., a local table with metatable __mode="k") instead of
mutating the user's table, set side_table[t] = "object" wherever rawset(t,
"__qjson_type", ...) is used (references: rawset and the "__qjson_type" marker
in table.lua around the shown block and the similar block at lines ~552-557),
and update any lookup logic that currently reads t["__qjson_type"] to consult
the side-table to determine the container type.
---
Nitpick comments:
In `@src/skip_cache.rs`:
- Around line 25-27: Create a single shared empty Rc<[u32]> and use it from both
new() and get_or_insert() instead of calling Rc::from([]) twice; define a static
(e.g. using once_cell::sync::Lazy or lazy_static) EMPTY_EMPTY_U32_RC and replace
the Rc::from([]) usages in SkipSlot construction and any place in
get_or_insert() with Rc::clone(&EMPTY_EMPTY_U32_RC) so SkipSlot { child_starts:
Rc::clone(&EMPTY_EMPTY_U32_RC), child_ends: Rc::clone(&EMPTY_EMPTY_U32_RC) } (or
equivalent) to eliminate duplicate empty allocations while keeping types
Rc<[u32]> consistent.
In `@tests/lua/lazy_table_spec.lua`:
- Around line 394-442: Add a regression test in tests/lua/lazy_table_spec.lua
that decodes JSON containing a user-supplied "__qjson_type" key via
qjson.decode, mutates that field (and optionally other fields or nested
entries), re-encodes with qjson.encode, then cjson.decode the output and assert
the resulting table preserves object semantics and the mutated "__qjson_type"
value; locate the test near the existing it("...") cases and follow their
pattern using qjson.decode, qjson.encode, and cjson.decode to validate both the
"__qjson_type" field and other fields remain correct after encode/decode.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 05f6541a-c41b-4fe1-8fe6-6819cb53bdd0
📒 Files selected for processing (8)
benches/lua_bench.lualua/qjson/table.luasrc/cursor.rssrc/decode/number.rssrc/scan/avx2.rssrc/skip_cache.rssrc/validate/mod.rstests/lua/lazy_table_spec.lua
- Replace __qjson_type string key with local weak side-table (TABLE_TYPE_HINT) to prevent collision with user payload keys - Fix github_table_modify_add to mutate first element instead of array root so the added field appears in encoded output
| // After the walk, the stack must hold exactly one frame: the root | ||
| // context, which must be `TopDone` (root value consumed). | ||
| if stack.len() != 1 || stack[0] != CtxKind::TopDone { | ||
| if stack_len!() != 1 || stack_buf[0] != CtxKind::TopDone { |
There was a problem hiding this comment.
Verified against the existing 256-deep test (tests/rfc8259_compliance.rs:119 accepts_nested_at_configured_limit) — it parses successfully on this branch. The premise doesn't hold: Top → TopDone mutation happens at line 226 (*cur = parent_after_value(*cur)) on the first [/{ of input, before any push that could overflow into the fallback Vec. So stack_buf[0] is set to TopDone on entry to any deep section, gets copied into fb[0] when fallback is created, and the final-state check at line 317 reads the right slot. Leaving as-is.
| macro_rules! push { | ||
| ($kind:expr) => { | ||
| if sp < STACK_CAP { | ||
| stack_buf[sp] = $kind; | ||
| sp += 1; | ||
| } else { | ||
| let fb = fallback.get_or_insert_with(|| { | ||
| let mut v: Vec<CtxKind> = Vec::with_capacity(STACK_CAP + 16); | ||
| v.extend_from_slice(&stack_buf[..sp]); | ||
| v | ||
| }); | ||
| sp = STACK_CAP.wrapping_add(fb.len() + 1); | ||
| fb.push($kind); | ||
| } | ||
| }; | ||
| } | ||
| macro_rules! pop { | ||
| () => {{ | ||
| if sp <= STACK_CAP { | ||
| if sp == 0 { None } | ||
| else { sp -= 1; Some(stack_buf[sp]) } | ||
| } else { | ||
| let fb = fallback.as_mut().unwrap(); | ||
| let val = fb.pop(); | ||
| if fb.is_empty() { sp = STACK_CAP; } | ||
| val | ||
| } |
There was a problem hiding this comment.
The bookkeeping is admittedly hard to read, but it's correct given an invariant the validator maintains: the Top marker is never popped from the stack, only mutated to TopDone (line 226). Bracket imbalance errors out earlier via the pop!().ok_or(...) check. Hence the branch if fb.is_empty() { sp = STACK_CAP; } is unreachable in practice — fb always retains the Top marker at index 0. Refactoring to a single source-of-truth backing store (SmallVec or always-Vec) is a clarity improvement worth doing as a follow-up, but out of scope for this perf-focused PR.
| -- The set of keys reserved by the lazy view bookkeeping; user-supplied JSON | ||
| -- keys with these names would collide (minor, deferred). Centralized here so | ||
| -- the dirty check and __newindex can share the list. | ||
| local INTERNAL_KEYS = { | ||
| _doc = true, _cur_box = true, _cur = true, _bs = true, _be = true, | ||
| _parent = true, _dirty = true, | ||
| } |
There was a problem hiding this comment.
Good catch — comment updated to reference the actual current consumers (__newindex cache snapshotting and encode_lazy_object_walking for skipping internals on dirty proxies). Pushing in the next commit.
| for _, f in ipairs({"_parent", "_dirty", "_doc", "_cur_box", "_cur", "_bs", "_be"}) do | ||
| rawset(t, f, nil) | ||
| end | ||
| setmetatable(t, nil) | ||
| TABLE_TYPE_HINT[t] = "object" |
There was a problem hiding this comment.
Accepted — replaced the ipairs({...}) allocation with seven inline rawset(t, ..., nil) calls. Using rawset rather than multi-assignment because the root view (created in _M.decode) doesn't carry _parent and a direct t._parent = nil would trip __newindex recursion on it. Zero per-call allocations on the materialization path now.
| for _, f in ipairs({"_parent", "_dirty", "_doc", "_cur_box", "_cur", "_bs", "_be"}) do | ||
| rawset(t, f, nil) | ||
| end |
There was a problem hiding this comment.
Same fix applied in LazyArray.__newindex — seven inline rawset(t, ..., nil) calls, no per-call allocation.
Move TABLE_TYPE_HINT/is_array/empty_array_mt dispatch into a separate function so the hot encode path (lazy proxy → memcpy) stays minimal for LuaJIT trace compilation.
Performance Comparison: main vs PR (aarch64, Apple M-series)Both branches run the same modified-encode benchmark suite for apples-to-apples comparison. Main branch uses original lua/qjson/table.lua + original Rust code. Key metrics (median ops/s)
Existing paths (parse, access, encode unmod)Flat within bench variance (±5% for parse+access and decode+access). The encode(unmod) path shows ~25% lower numbers in the full bench due to JIT trace slot competition from the 3 additional modify-encode scenarios; isolated testing with only encode(unmod) shows identical performance between branches. Modified-encode scenarios (small/medium multimodal payloads)
|
Scanner (src/scan/avx2.rs):
- Revert structural_mask_chunk from the PSHUFB-LUT nibble classifier
back to the parallel 7-cmpeq form. On AMD Zen2 the PSHUFB variant
measured -45% parse on small payloads (-7% on 1m where the
in-string fast-probe dominates anyway): VPSHUFB ymm is split into
two micro-ops per 128-bit lane, the srli->and->pshufb->and->cmpeq->
xor->movemask chain has a longer critical path, and VPMOVMSKB
(lat 4) plus the LUT-load constants pressure the FP ports. The
parallel cmpeq design lets independent compare chains dispatch
across multiple ports. NEON path is unaffected (separate file).
Lua layer (lua/qjson/table.lua) -- review feedback:
- Update INTERNAL_KEYS comment to reference its actual consumers
(__newindex cache snapshotting and encode_lazy_object_walking),
not the removed recursive is_dirty walk.
- Replace 'for _, f in ipairs({...})' in both LazyObject.__newindex
and LazyArray.__newindex with seven inline rawset(t, ..., nil)
calls. Eliminates the per-call 7-elem table-literal allocation
on the materialization path. Uses rawset (not multi-assignment
to nil) because root views from _M.decode lack _parent, so
't._parent = nil' could fire __newindex recursively.
Bench (benches/lua_bench.lua):
- Force LuaJIT to evaluate qjson.encode results: replace
'local _ = qjson.encode(t)' with 'local _enc = qjson.encode(t);
if #_enc < 2 then error(...) end' in all 8 mutation/encode cases.
Without this, LuaJIT could partially DCE the encode call on some
trace shapes and produce misleading speedups.
- Raise the warmup floor from max(3, iters/5) to max(50, iters/5).
LuaJIT's default hotloop is 56; with the original floor of 3, the
1m payload (iters=15) measured pre-JIT interpreter mode for most
of the run.
- Raise 500k iters 20->100 and 1m iters 15->60 so per-round wall
time exceeds os.clock granularity (~1ms) by a comfortable margin.
| @@ -34,7 +36,8 @@ impl SkipCache { | |||
| return (slot, true); | |||
| } | |||
| let new = self.slots.len() as u32; | |||
| self.slots.push(SkipSlot { child_starts: Vec::new(), child_ends: Vec::new() }); | |||
| let empty: Rc<[u32]> = Rc::from([]); | |||
| self.slots.push(SkipSlot { child_starts: Rc::clone(&empty), child_ends: empty }); | |||
| self.by_opener.insert(opener_idx, new); | |||
| -- Floor at 50: LuaJIT hotloop default is 56, so fewer iterations leave | ||
| -- the bench measuring interpreter mode for the large-payload scenarios | ||
| -- (1m has iters=15, iters/5=3 → trace never compiles → ~30% noise). | ||
| local warmup = math.max(50, math.floor(iters / 5)) |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benches/lua_bench.lua`:
- Line 149: Update the explanatory comment that currently reads "1m has
iters=15" to reflect the actual iteration count used in the benchmark (the
`iters = 60` setting for the 1m scenario); locate the comment near the bench
description for large-payload/interpreter mode and change the text to reference
`iters = 60` (or reword to avoid hardcoding the number) so it matches the
`iters` variable defined for the 1m scenario.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 87af55bb-8a10-4f0b-8acb-70395b800582
📒 Files selected for processing (3)
benches/lua_bench.lualua/qjson/table.luasrc/scan/avx2.rs
✅ Files skipped from review due to trivial changes (1)
- src/scan/avx2.rs
…pty Rc - Collapse nested if in parse_f64 skip_validation path (clippy::collapsible_if) - Allow clippy::approx_constant on intentional test value 3.14 - Replace push loops with vec![b'['; N] in depth tests (clippy::same-item-push) - Update warmup comment in lua_bench.lua to reference current iters values - Use shared empty_rc field in SkipCache to avoid per-slot Rc::from([]) allocation
…code results Changes: - benches/lua_bench.lua: accept arg[1] as scenario name filter; when set, only that single scenario runs (backward-compatible — no arg runs all as before). - Makefile: bench target loops BENCH_SCENARIOS, launching a fresh resty process for each. Eliminates accumulated GC/JIT trace-cache interference between payload sizes. - docs/benchmarks.md: document fresh-process methodology, add modify+encode workload descriptions and throughput table. - README.md: reference modify+encode scenarios added in PR #54. Benchmark summary (x86_64 Linux, 3-round fresh-process, PR vs main): modify + encode path (median ops/s, PR main): small modify top: +41%..+75% (33k→59k) small modify nested: +44%..+80% (29k→53k) medium modify nested: +58%..+106% (92k→190k) 100k modify nested: +31%..+45% (47k→68k) interleaved mod top: +2%..+30% (5.8k→7.2k) 30/33 data points PR > main (91%). encode(unmodified) broadly flat or slightly up (+2%..+26%), consistent with JIT trace slot competition from the additional modify+encode scenarios.
Update README.md summary table and docs/benchmarks.md (throughput, speedup, memory, modify+encode) with results from a clean make bench run on the current branch.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benches/lua_bench.lua`:
- Around line 299-307: The filter check using local variable "filter" can
silently skip all scenarios when arg[1] is mistyped; modify the loop over
"scenarios" that uses "if filter and s.name ~= filter then goto
continue_scenario end" to set a boolean flag (e.g., "matched = true") when any
scenario is executed, and after the loop check that flag and call error/print
and os.exit(1) if no scenario matched; apply the same pattern to the second loop
that uses the same filter logic (the block using the "continue_scenario" label)
so CI fails fast when an invalid filter was supplied.
In `@README.md`:
- Line 116: Replace the non-hyphenated phrase "full size ladder" with the
hyphenated compound adjective "full-size ladder" in the README text (search for
the exact string "full size ladder" near the benchmarks link and update it to
"full-size ladder") so the phrase uses the correct hyphenation for a compound
adjective.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c7a1775e-967d-4c22-8a44-b581e7e5c3b7
📒 Files selected for processing (7)
MakefileREADME.mdbenches/lua_bench.luadocs/benchmarks.mdsrc/decode/number.rssrc/skip_cache.rssrc/validate/mod.rs
✅ Files skipped from review due to trivial changes (1)
- docs/benchmarks.md
| local filter = arg[1] | ||
|
|
||
| if not simdjson then | ||
| print("lua-resty-simdjson unavailable; skipping simdjson rows: " | ||
| .. tostring(simdjson_or_err)) | ||
| end | ||
|
|
||
| for _, s in ipairs(scenarios) do | ||
| if filter and s.name ~= filter then goto continue_scenario end |
There was a problem hiding this comment.
Fail fast when arg[1] does not match any scenario.
A typo in the filter currently exits with no benchmark rows, which is easy to misread as success in CI/manual runs. Please track whether any scenario matched and error if none did.
Suggested patch
local filter = arg[1]
+local matched = false
@@
for _, s in ipairs(scenarios) do
if filter and s.name ~= filter then goto continue_scenario end
+ matched = true
print(string.format("=== %s (%d bytes) ===", s.name, `#s.payload`))
@@
end
+
+if filter and filter ~= "interleaved" and not matched then
+ error("unknown benchmark scenario filter: " .. tostring(filter))
+endAlso applies to: 378-379
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benches/lua_bench.lua` around lines 299 - 307, The filter check using local
variable "filter" can silently skip all scenarios when arg[1] is mistyped;
modify the loop over "scenarios" that uses "if filter and s.name ~= filter then
goto continue_scenario end" to set a boolean flag (e.g., "matched = true") when
any scenario is executed, and after the loop check that flag and call
error/print and os.exit(1) if no scenario matched; apply the same pattern to the
second loop that uses the same filter logic (the block using the
"continue_scenario" label) so CI fails fast when an invalid filter was supplied.
| OpenResty environment; otherwise it skips the simdjson rows. | ||
| Modify-then-encode scenarios (PR #54) add decode → mutate field → re-encode | ||
| workloads; small payload modify+encode reaches 48k–60k ops/s. See | ||
| [`docs/benchmarks.md`](docs/benchmarks.md) for the full size ladder, |
There was a problem hiding this comment.
Use hyphenated compound adjective in README text.
“full size ladder” should be “full-size ladder” for standard technical prose.
🧰 Tools
🪛 LanguageTool
[grammar] ~116-~116: Use a hyphen to join words.
Context: ...ks.md`](docs/benchmarks.md) for the full size ladder, modify+encode results, memo...
(QB_NEW_EN_HYPHEN)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@README.md` at line 116, Replace the non-hyphenated phrase "full size ladder"
with the hyphenated compound adjective "full-size ladder" in the README text
(search for the exact string "full size ladder" near the benchmarks link and
update it to "full-size ladder") so the phrase uses the correct hyphenation for
a compound adjective.
|
|
||
| bench: build vendor/lua-cjson/cjson.so ## Run each scenario in a fresh LuaJIT process | ||
| @for s in $(BENCH_SCENARIOS); do \ | ||
| $(LUA_ENV) $(RESTY) benches/lua_bench.lua $$s; \ |
| // 7 parallel byte-equality compares. On AMD Zen2 these dispatch across | ||
| // multiple FP ports and beat a PSHUFB-LUT nibble classifier (PSHUFB ymm | ||
| // is split into 2 micro-ops per lane, the LUT chain lengthens the | ||
| // critical path, and VPMOVMSKB has lat=4 — the 14-movemask total is | ||
| // still cheaper than the LUT path's serial dependency). PR #54 tried | ||
| // PSHUFB-LUT but measured -45% parse on small payloads on Zen2; this | ||
| // form is what shipped through #51. |
| let mut buf = vec![b'['; 1024]; | ||
| buf.extend_from_slice(&vec![b']'; 1024]); | ||
| assert!( |
…son in README Adds cjson.decode + modify top/add field/modify nested + cjson.encode benchmarks so the modify+encode path has an eager baseline. README encode table now shows cjson/qjson side-by-side for modify workloads: qjson is 10-43x faster at 60 KB+.
| for _, s in ipairs(scenarios) do | ||
| if filter and s.name ~= filter then goto continue_scenario end | ||
| print(string.format("=== %s (%d bytes) ===", s.name, #s.payload)) |
Summary
Performance optimization across the full decode → access → modify → encode pipeline.
Lua layer (Phase 2)
' .. s .. 'with zero table allocations. Escape lookup via table instead of if-elseif chain.is_dirtytree walk with a_dirtyboolean propagated up the parent chain on__newindex.__qjson_typemarker during materialization, eliminating duplicate key scan in encode.Rust layer (Phase 3)
Vec::cloneon every cache hit replaced with O(1)Rc::clone, eliminating O(n) heap allocation per repeated field access.structural_mask_chunk7-cmpeq loop replaced with nibble-based PSHUFB classification (x86 only, NEON tag-bit strategy).validate_eager_valuesheap allocation replaced with[CtxKind; 64]fixed-size stack (falls back to Vec for depth > 64).parse_i64float-detection changed from byte-by-byte.any()to SIMD-acceleratedmemchr::memchr3.Benchmarks
Performance (aarch64, small/medium multimodal payloads)
Summary by CodeRabbit