-
Notifications
You must be signed in to change notification settings - Fork 48
PHP: Significant performance regression in toShort conversion due to bad regex #42
Description
In JoyPixels\Client, the $unicodeRegexp value has a mistake that causes it to match more often than needed. In the library itself, this is only used by JoyPixels\Client::toShort.
Specifically, the regex contains this portion:
...|[\x{1F1E0}-\x{1F1FF}]{2}||[\x{1F468}-\x{1F469}\x{1F9D0}-\x{1F9DF}][\x{1F3FA}-\x{1F3FF}]?\x{200D}?[\x{2640}\x{2642}\x{2695}\x{2696}\x{2708}]?\x{FE0F}?|...
What's notable is the || part, which causes the regex to match empty strings. This causes toShortCallback to be called with an empty match for each (non-emoji) character. This returns immediately, but this the match and callback overhead can be significant. With a single | this works as expected.
For example, here's a basic performance test:
$runCount = 100;
$str = str_repeat("testing \xF0\x9F\x98\x83 ", 1000);
$client = new \JoyPixels\Client();
$s = microtime(true);
for ($i = 0; $i < $runCount; $i++) {
$client->toShort($str);
}
$e = microtime(true);
echo ($e - $s) / $runCount;
This gives me the following results:
| Run type | Time per execution (s) | Total time (s) |
|---|---|---|
| Unfixed | 0.018859 | 1.88592 |
| Fixed | 0.000538 | 0.0538551 |
Though easily derivable, the difference in the total run time shows the size of the regression as we go from 1.88 seconds to run the test to 0.05 with a single character change!
This change first appeared in 5.5 (af05016). I assume this regex is auto generated, so this will presumably keep coming back unless the tool that causes the issue is fixed. Otherwise, a PR to resolve this is generally simple.
Potentially related, JoyPixels::$unicodeRegexp does not appear to have this issue, though the regex differs from the one in JoyPixels\Client. It did actually get the || in the 5.5 release as well (you can see it in the same commit), but it disappeared from there in 6.0. I'm not clear on why these two regexes are not identical, but I might be missing something.