Hello there :D
thanks for this great lib; I encountered some issues with recent unicode character not supported in re2j unicode tables.
Is there some plan to be able to update this table ?
I have done some tests with icu4j 71.1 (unicode 14.0) and I manage to fix some issues I got with not up to date definition of unicode range tables.
// Generated at 2022-09-29T11:43:59.963000520Z by Java 17.0.2 using Unicode version 14.0.0.0.
I run the benchmark for you but I have no real comparison with previous re2j version if it's really bad or not.
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark (binary) (impl) (regex) (repeats) Mode Cnt Score Error Units
BenchmarkBacktrack.matched N/A JDK N/A 5 avgt 5 0.409 ± 0.010 us/op
BenchmarkBacktrack.matched N/A JDK N/A 10 avgt 5 11.181 ± 1.372 us/op
BenchmarkBacktrack.matched N/A JDK N/A 15 avgt 5 421.301 ± 5.390 us/op
BenchmarkBacktrack.matched N/A JDK N/A 20 avgt 5 14601.885 ± 585.185 us/op
BenchmarkBacktrack.matched N/A RE2J N/A 5 avgt 5 0.725 ± 0.162 us/op
BenchmarkBacktrack.matched N/A RE2J N/A 10 avgt 5 2.198 ± 0.174 us/op
BenchmarkBacktrack.matched N/A RE2J N/A 15 avgt 5 4.858 ± 0.359 us/op
BenchmarkBacktrack.matched N/A RE2J N/A 20 avgt 5 8.456 ± 0.091 us/op
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch true JDK N/A N/A avgt 5 206.578 ± 4.268 us/op
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch true RE2J N/A N/A avgt 5 594.817 ± 3.601 us/op
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch false JDK N/A N/A avgt 5 208.050 ± 5.570 us/op
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch false RE2J N/A N/A avgt 5 622.479 ± 33.410 us/op
BenchmarkCompile.compile N/A JDK DATE N/A avgt 5 606.043 ± 43.850 ns/op
BenchmarkCompile.compile N/A JDK EMAIL N/A avgt 5 155.496 ± 13.520 ns/op
BenchmarkCompile.compile N/A JDK PHONE N/A avgt 5 295.075 ± 15.263 ns/op
BenchmarkCompile.compile N/A JDK RANDOM N/A avgt 5 1087.170 ± 115.227 ns/op
BenchmarkCompile.compile N/A JDK SOCIAL N/A avgt 5 251.806 ± 28.439 ns/op
BenchmarkCompile.compile N/A JDK STATES N/A avgt 5 1120.005 ± 80.479 ns/op
BenchmarkCompile.compile N/A RE2J DATE N/A avgt 5 3081.459 ± 438.926 ns/op
BenchmarkCompile.compile N/A RE2J EMAIL N/A avgt 5 1056.468 ± 146.337 ns/op
BenchmarkCompile.compile N/A RE2J PHONE N/A avgt 5 1683.852 ± 173.123 ns/op
BenchmarkCompile.compile N/A RE2J RANDOM N/A avgt 5 5299.210 ± 735.038 ns/op
BenchmarkCompile.compile N/A RE2J SOCIAL N/A avgt 5 1167.235 ± 28.007 ns/op
BenchmarkCompile.compile N/A RE2J STATES N/A avgt 5 7610.245 ± 1019.406 ns/op
BenchmarkFullMatch.matched true JDK N/A N/A avgt 5 77.149 ± 3.385 ns/op
BenchmarkFullMatch.matched true RE2J N/A N/A avgt 5 478.138 ± 48.828 ns/op
BenchmarkFullMatch.matched false JDK N/A N/A avgt 5 68.403 ± 3.378 ns/op
BenchmarkFullMatch.matched false RE2J N/A N/A avgt 5 462.312 ± 86.473 ns/op
BenchmarkFullMatch.notMatched true JDK N/A N/A avgt 5 57.041 ± 4.191 ns/op
BenchmarkFullMatch.notMatched true RE2J N/A N/A avgt 5 411.035 ± 39.113 ns/op
BenchmarkFullMatch.notMatched false JDK N/A N/A avgt 5 48.511 ± 3.818 ns/op
BenchmarkFullMatch.notMatched false RE2J N/A N/A avgt 5 405.113 ± 54.661 ns/op
BenchmarkSplit.benchmarkSplit N/A JDK N/A N/A avgt 5 10.665 ± 0.301 us/op
BenchmarkSplit.benchmarkSplit N/A RE2J N/A N/A avgt 5 40.286 ± 6.797 us/op
BenchmarkSubMatch.findPhoneNumbers true JDK N/A N/A avgt 5 3.868 ± 0.538 ms/op
BenchmarkSubMatch.findPhoneNumbers true RE2J N/A N/A avgt 5 11.425 ± 0.257 ms/op
BenchmarkSubMatch.findPhoneNumbers false JDK N/A N/A avgt 5 3.164 ± 0.334 ms/op
BenchmarkSubMatch.findPhoneNumbers false RE2J N/A N/A avgt 5 11.907 ± 0.295 ms/op
I read this https://groups.google.com/g/re2j-discuss/c/9XkVsnxngjc and I wonder if there is some plan in regards of moving from unicode 6.0.
I could even try to help if you need or have some plans for it 👍
nb: icu4j 71.1 is the latest official release but the next version 72 for unicode 15.0 is planned on 8th october
let me know your ideas,
Kidi
Hello there :D
thanks for this great lib; I encountered some issues with recent unicode character not supported in re2j unicode tables.
Is there some plan to be able to update this table ?
I have done some tests with icu4j 71.1 (unicode 14.0) and I manage to fix some issues I got with not up to date definition of unicode range tables.
I run the benchmark for you but I have no real comparison with previous re2j version if it's really bad or not.
I read this https://groups.google.com/g/re2j-discuss/c/9XkVsnxngjc and I wonder if there is some plan in regards of moving from unicode 6.0.
I could even try to help if you need or have some plans for it 👍
nb: icu4j 71.1 is the latest official release but the next version 72 for unicode 15.0 is planned on 8th october
let me know your ideas,
Kidi