use a open addressing map to store folding paths#36
use a open addressing map to store folding paths#36arnaudroger wants to merge 3 commits intogoogle:masterfrom
Conversation
java/com/google/re2j/Inst.java
Outdated
| int[] folds = Unicode.optimizedFoldOrbit(r0); | ||
|
|
||
| if (folds == null) { | ||
| return (r == Unicode.optimizedFoldNonOrbit(r0)); |
There was a problem hiding this comment.
Have you benchmarked this against an implementation based on Character.toLowerCase(r) == r? Ideally with a few changes like that, we could throw away all the Unicode tables and use the ones built into java.lang.
Be sure to use the int (code point) not char (UTF-16 code) variants of those functions.
There was a problem hiding this comment.
I guess you are talking about the optimizedFoldNonOrbit? no I have not, was trying to be the least impactful as possible :) will have a look. From what I understand the Orbit exception will still need to be there ?
There was a problem hiding this comment.
Yes, the orbit case will still need special handling. I suggest you do this experiment as a separate change.
java/com/google/re2j/Unicode.java
Outdated
| static int[] optimizedFoldOrbit(int r) { | ||
| if (r >= ORBIT_MIN_VALUE) { | ||
| return FOLD_MAP.get(r); | ||
| } else return null; |
There was a problem hiding this comment.
Use {...} around every block, or outdent the else block:
if (r < ORBIT_MIN_VALUE) {
return null;
}
return FOLD_MAP.get(r);
java/com/google/re2j/Unicode.java
Outdated
| private static final int ORBIT_MIN_VALUE = UnicodeTables.CASE_ORBIT[0][0]; | ||
|
|
||
| static { | ||
| for(int i = 0; i < UnicodeTables.CASE_ORBIT.length; i++) { |
There was a problem hiding this comment.
for (int[][] orbit : UnicodeTables.CASE_ORBIT)
java/com/google/re2j/Unicode.java
Outdated
| int[] folds = new int[0]; | ||
| int r1 = r0; | ||
| while((r1 = simpleFold(r1)) != r0) { | ||
| folds = Arrays.copyOf(folds, folds.length + 1); |
There was a problem hiding this comment.
Isn't an orbit necessarily of length at least 3? I realize they are small and few, but it seems perverse to copy the entire array each time we add an element. Use a temporary large enough for the longest orbit (MAX_DISTANCE?) and copy it once before each call to FOLD_MAP.put.
java/com/google/re2j/Unicode.java
Outdated
| } | ||
|
|
||
| /* | ||
| associate a rune to it's unfolded case equivalent |
There was a problem hiding this comment.
"A FoldMap maps a rune to an array of runes that are equivalent to it in a case-insensitive pattern."
java/com/google/re2j/Unicode.java
Outdated
| return toUpper(r); | ||
| } | ||
|
|
||
| private static final FoldMap FOLD_MAP = new FoldMap(UnicodeTables.CASE_ORBIT.length * 4); |
There was a problem hiding this comment.
Whence 4? Is it a load factor based on MAX_DISTANCE?
I don't think this should be a parameter. FoldMap is instantiated exactly once; specialize it.
java/com/google/re2j/Inst.java
Outdated
| */ | ||
| class Inst { | ||
|
|
||
| private static final int RUNES_LINER_SEARCH_SIZE = 10; |
There was a problem hiding this comment.
LINEAR?
Move this constant into the sole function that uses it.
java/com/google/re2j/Inst.java
Outdated
| // Otherwise binary search. | ||
| for (int lo = 0, hi = runes.length / 2; lo < hi; ) { | ||
| // Otherwise binary search on rest of the array | ||
| for (int lo = 0, hi = (runes.length - RUNES_LINER_SEARCH_SIZE) / 2; lo < hi; ) { |
There was a problem hiding this comment.
I'm not sure it's worth the fuss to subtract the portion initially scanned. It's a logarithmic algorithm and this is not the common case. I'm sure it's fast enough to search the complete array. Did you find evidence to the contrary?
There was a problem hiding this comment.
no, will remove it
java/com/google/re2j/Inst.java
Outdated
|
|
||
| // Otherwise binary search. | ||
| for (int lo = 0, hi = runes.length / 2; lo < hi; ) { | ||
| // Otherwise binary search on rest of the array |
There was a problem hiding this comment.
While you're here, please document:
// Invariant: lo, hi, m are even.
java/com/google/re2j/Unicode.java
Outdated
| associate a rune to it's unfolded case equivalent | ||
| */ | ||
| private static class FoldMap { | ||
| private static final int MAX_DISTANCE = 4; |
There was a problem hiding this comment.
Document:
// number of closed-hashing probes before giving up
java/com/google/re2j/Unicode.java
Outdated
| private final int mask; | ||
| private final int maxDistance; | ||
|
|
||
| FoldMap(int size) { |
There was a problem hiding this comment.
FoldMap serves exactly one purpose. It doesn't need a parameter, and it doesn't need to be separated from the logic that instantiates and populates it.
|
I did check the Character.toLower/toUpper it does not affect the results, though I change the isUpper/isLower to use the Character one and it broke some test so I rolled it back. |
This optimisation came from profiling my app with real payload, the simpleFold was taking around 15% cpu and is now negligible.
I added a few test to verify that the logic is still working as expected.