Skip to content

[Java Extension] Ported the C extension parser to the Java and remove ragel generated parser.#1004

Open
samyron wants to merge 1 commit into
ruby:masterfrom
samyron:sm/jruby-parser-rewrite
Open

[Java Extension] Ported the C extension parser to the Java and remove ragel generated parser.#1004
samyron wants to merge 1 commit into
ruby:masterfrom
samyron:sm/jruby-parser-rewrite

Conversation

@samyron

@samyron samyron commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Overview

This PR is a port of the current C extension parser to Java. It removes the ragel generated parser.

Implementation Notes

  1. This uses the same frame-based parse loop as the C-based parser. However, since java doesn't support goto, it uses a while/switch dispatching mechanism.
  2. Building a JSON Object uses RubyHash#fastASet instead of RubyHash#op_aset as strings or symbols used as keys are explicitly frozen in this module. This contributed a significant performance boost to this Parser. The keys are frozen in cachedKey / internedKey which is called via parseString when isName=true.
  3. The Parser includes a SWAR and Vector API StringScanner. The Vector API is loaded at runtime if running on a supported JDK, the module is available and the ruby.json.useVectorizedParser=true JVM property is set. If the Vector API is not available or explicitly disabled the SWAR implementation is used.
  4. This also uses the same rvalue_cache style cache as a quick cache for object keys. However, since the cache is heap allocated in Java, the size is 128 entries.
  5. The SWAR consecutive digits decoder was ported over to this Java parser.
  6. It retains the Java extension Unicode character validation but moves the logic from the generated ragel parser to the StringDecoder.

Performance

These benchmarks were run on an M1 Macbook Air using jruby 10.0.5.0 and OpenJDK 64-Bit Server VM 24.0.1+9-30.

With the SWAR StringScanner

== Parsing activitypub.json (58160 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after   656.000 i/100ms
Calculating -------------------------------------
               after      6.719k (± 1.2%) i/s  (148.83 μs/i) -     34.112k in   5.076849s

Comparison:
before:     1351.3 i/s
 after:     6719.1 i/s - 4.97x  faster


== Parsing twitter.json (567916 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after    69.000 i/100ms
Calculating -------------------------------------
               after    706.750 (± 1.1%) i/s    (1.41 ms/i) -      3.588k in   5.076758s

Comparison:
before:      113.1 i/s
 after:      706.8 i/s - 6.25x  faster


== Parsing citm_catalog.json (1727030 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after    38.000 i/100ms
Calculating -------------------------------------
               after    390.007 (± 2.3%) i/s    (2.56 ms/i) -      1.976k in   5.066570s

Comparison:
before:       37.6 i/s
 after:      390.0 i/s - 10.39x  faster


== Parsing github_events.json (65130 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after   744.000 i/100ms
Calculating -------------------------------------
               after      7.261k (± 5.0%) i/s  (137.72 μs/i) -     36.456k in   5.020636s

Comparison:
before:     1160.6 i/s
 after:     7261.2 i/s - 6.26x  faster


== Parsing semanticscholar-corpus.json (8493528 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after     4.000 i/100ms
Calculating -------------------------------------
               after     53.332 (± 3.8%) i/s   (18.75 ms/i) -    268.000 in   5.025161s

Comparison:
before:        8.1 i/s
 after:       53.3 i/s - 6.62x  faster

With the Vector API based StringScanner

== Parsing activitypub.json (58160 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after   671.000 i/100ms
Calculating -------------------------------------
               after      6.880k (± 1.3%) i/s  (145.34 μs/i) -     34.892k in   5.071152s

Comparison:
before:     1361.7 i/s
 after:     6880.5 i/s - 5.05x  faster


== Parsing twitter.json (567916 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after    72.000 i/100ms
Calculating -------------------------------------
               after    736.863 (± 1.4%) i/s    (1.36 ms/i) -      3.744k in   5.080998s

Comparison:
before:      109.3 i/s
 after:      736.9 i/s - 6.74x  faster


== Parsing citm_catalog.json (1727030 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after    38.000 i/100ms
Calculating -------------------------------------
               after    394.387 (± 1.5%) i/s    (2.54 ms/i) -      1.976k in   5.010313s

Comparison:
before:       38.8 i/s
 after:      394.4 i/s - 10.17x  faster


== Parsing github_events.json (65130 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after   777.000 i/100ms
Calculating -------------------------------------
               after      7.681k (± 2.9%) i/s  (130.20 μs/i) -     38.850k in   5.058152s

Comparison:
before:     1097.3 i/s
 after:     7680.7 i/s - 7.00x  faster


== Parsing semanticscholar-corpus.json (8493528 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after     5.000 i/100ms
Calculating -------------------------------------
               after     56.260 (± 3.6%) i/s   (17.77 ms/i) -    285.000 in   5.065794s

Comparison:
before:        7.3 i/s
 after:       56.3 i/s - 7.67x  faster

@samyron

samyron commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Tagging @headius for a review on this PR.

@samyron

samyron commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Additional thought: We should probably add additional JRuby versions to CI and include the Vector API system properties to ensure the VectorizedStringScanner and VectorizedStringEncoder (from a previous PR) are thoroughly tested.

@byroot byroot requested a review from headius June 15, 2026 06:40
@byroot

byroot commented Jun 15, 2026

Copy link
Copy Markdown
Member

Is this the parser mentioned by @enebo in #983 (comment) or is it a concurrent effort?

@samyron

samyron commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Is this the parser mentioned by @enebo in #983 (comment) or is it a concurrent effort?

This is a concurrent effort. I missed that comment as I didn't refer back to that issue once #989 was opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants