Skip to content

RFC on unicode tables management #161

@kidi

Description

@kidi

Hello there :D

thanks for this great lib; I encountered some issues with recent unicode character not supported in re2j unicode tables.

Is there some plan to be able to update this table ?

I have done some tests with icu4j 71.1 (unicode 14.0) and I manage to fix some issues I got with not up to date definition of unicode range tables.

// Generated at 2022-09-29T11:43:59.963000520Z by Java 17.0.2 using Unicode version 14.0.0.0.

I run the benchmark for you but I have no real comparison with previous re2j version if it's really bad or not.

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                                 (binary)  (impl)  (regex)  (repeats)  Mode  Cnt      Score      Error  Units
BenchmarkBacktrack.matched                                     N/A     JDK      N/A          5  avgt    5      0.409 ±    0.010  us/op
BenchmarkBacktrack.matched                                     N/A     JDK      N/A         10  avgt    5     11.181 ±    1.372  us/op
BenchmarkBacktrack.matched                                     N/A     JDK      N/A         15  avgt    5    421.301 ±    5.390  us/op
BenchmarkBacktrack.matched                                     N/A     JDK      N/A         20  avgt    5  14601.885 ±  585.185  us/op
BenchmarkBacktrack.matched                                     N/A    RE2J      N/A          5  avgt    5      0.725 ±    0.162  us/op
BenchmarkBacktrack.matched                                     N/A    RE2J      N/A         10  avgt    5      2.198 ±    0.174  us/op
BenchmarkBacktrack.matched                                     N/A    RE2J      N/A         15  avgt    5      4.858 ±    0.359  us/op
BenchmarkBacktrack.matched                                     N/A    RE2J      N/A         20  avgt    5      8.456 ±    0.091  us/op
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch      true     JDK      N/A        N/A  avgt    5    206.578 ±    4.268  us/op
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch      true    RE2J      N/A        N/A  avgt    5    594.817 ±    3.601  us/op
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch     false     JDK      N/A        N/A  avgt    5    208.050 ±    5.570  us/op
BenchmarkCaseInsensitiveSubmatch.caseInsensitiveSubMatch     false    RE2J      N/A        N/A  avgt    5    622.479 ±   33.410  us/op
BenchmarkCompile.compile                                       N/A     JDK     DATE        N/A  avgt    5    606.043 ±   43.850  ns/op
BenchmarkCompile.compile                                       N/A     JDK    EMAIL        N/A  avgt    5    155.496 ±   13.520  ns/op
BenchmarkCompile.compile                                       N/A     JDK    PHONE        N/A  avgt    5    295.075 ±   15.263  ns/op
BenchmarkCompile.compile                                       N/A     JDK   RANDOM        N/A  avgt    5   1087.170 ±  115.227  ns/op
BenchmarkCompile.compile                                       N/A     JDK   SOCIAL        N/A  avgt    5    251.806 ±   28.439  ns/op
BenchmarkCompile.compile                                       N/A     JDK   STATES        N/A  avgt    5   1120.005 ±   80.479  ns/op
BenchmarkCompile.compile                                       N/A    RE2J     DATE        N/A  avgt    5   3081.459 ±  438.926  ns/op
BenchmarkCompile.compile                                       N/A    RE2J    EMAIL        N/A  avgt    5   1056.468 ±  146.337  ns/op
BenchmarkCompile.compile                                       N/A    RE2J    PHONE        N/A  avgt    5   1683.852 ±  173.123  ns/op
BenchmarkCompile.compile                                       N/A    RE2J   RANDOM        N/A  avgt    5   5299.210 ±  735.038  ns/op
BenchmarkCompile.compile                                       N/A    RE2J   SOCIAL        N/A  avgt    5   1167.235 ±   28.007  ns/op
BenchmarkCompile.compile                                       N/A    RE2J   STATES        N/A  avgt    5   7610.245 ± 1019.406  ns/op
BenchmarkFullMatch.matched                                    true     JDK      N/A        N/A  avgt    5     77.149 ±    3.385  ns/op
BenchmarkFullMatch.matched                                    true    RE2J      N/A        N/A  avgt    5    478.138 ±   48.828  ns/op
BenchmarkFullMatch.matched                                   false     JDK      N/A        N/A  avgt    5     68.403 ±    3.378  ns/op
BenchmarkFullMatch.matched                                   false    RE2J      N/A        N/A  avgt    5    462.312 ±   86.473  ns/op
BenchmarkFullMatch.notMatched                                 true     JDK      N/A        N/A  avgt    5     57.041 ±    4.191  ns/op
BenchmarkFullMatch.notMatched                                 true    RE2J      N/A        N/A  avgt    5    411.035 ±   39.113  ns/op
BenchmarkFullMatch.notMatched                                false     JDK      N/A        N/A  avgt    5     48.511 ±    3.818  ns/op
BenchmarkFullMatch.notMatched                                false    RE2J      N/A        N/A  avgt    5    405.113 ±   54.661  ns/op
BenchmarkSplit.benchmarkSplit                                  N/A     JDK      N/A        N/A  avgt    5     10.665 ±    0.301  us/op
BenchmarkSplit.benchmarkSplit                                  N/A    RE2J      N/A        N/A  avgt    5     40.286 ±    6.797  us/op
BenchmarkSubMatch.findPhoneNumbers                            true     JDK      N/A        N/A  avgt    5      3.868 ±    0.538  ms/op
BenchmarkSubMatch.findPhoneNumbers                            true    RE2J      N/A        N/A  avgt    5     11.425 ±    0.257  ms/op
BenchmarkSubMatch.findPhoneNumbers                           false     JDK      N/A        N/A  avgt    5      3.164 ±    0.334  ms/op
BenchmarkSubMatch.findPhoneNumbers                           false    RE2J      N/A        N/A  avgt    5     11.907 ±    0.295  ms/op

I read this https://groups.google.com/g/re2j-discuss/c/9XkVsnxngjc and I wonder if there is some plan in regards of moving from unicode 6.0.

I could even try to help if you need or have some plans for it 👍

nb: icu4j 71.1 is the latest official release but the next version 72 for unicode 15.0 is planned on 8th october

let me know your ideas,

Kidi

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions