Add support for Java Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart() character classes via \p{javaJavaIdentifierStart} and \p{javaJavaIdentifierPart}.#21
Conversation
…isJavaIdentifierPart() character classes via \p{javaJavaIdentifierStart} and \p{javaJavaIdentifierPart}.
|
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed, please reply here (e.g.
|
|
Oh, one other thing - MakeJavaCategories is currently under a source root, so it will be included in the distributed artefact. If you'd prefer, I can move that under the test hierarchy somewhere. |
|
|
||
| package com.google.re2j; | ||
|
|
||
| // AUTOGENERATED by MakeJavaCategories.java - do not modify |
There was a problem hiding this comment.
You shouldn't check in autogenerated files. They should be generated as part of the build process.
There was a problem hiding this comment.
Is this the case? The existing unicode tables are autogenerated and are checked in. Generating this file during the build would probably involve writing a Maven plugin.
|
Can you explain why you don't just use Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart()? They are constant time operations. |
|
I looked for a way to do this, but it seemed like a much more complex change. There is no Regexp.Op or Machine.Frag that supports anything like this, and I was less confident of being able to make the change without breaking anything. If that is the preferred approach I can attempt that - any pointers would be appreciated. |
This change takes a brute-force approach and generates the ranges accepted by these methods by testing all the code points. MakeJavaCategories.java generates the source file containing the category ranges. I've added simple tests for the parsing and matching, let me know if you'd like to see more about any particular aspect of this. Since it reuses the existing technique used by the current Unicode character ranges and shouldn't affect any existing code paths, hopefully the change is pretty safe.
I only generated these two ranges since they're the ones I need, but it would be trivial to add the rest of the Character.is* ranges if required. I suspect they're probably covered reasonably well by the existing unicode ranges.
One thing I learned from this change -
Character.isJavaIdentifierPart(0) == true- who knew?I haven't signed a CLA - let me know if this looks ok and I'll do so.