Skip to content

PHP: Significant performance regression in toShort conversion due to bad regex  #42

@mikesul

Description

@mikesul

In JoyPixels\Client, the $unicodeRegexp value has a mistake that causes it to match more often than needed. In the library itself, this is only used by JoyPixels\Client::toShort.

Specifically, the regex contains this portion:

...|[\x{1F1E0}-\x{1F1FF}]{2}||[\x{1F468}-\x{1F469}\x{1F9D0}-\x{1F9DF}][\x{1F3FA}-\x{1F3FF}]?\x{200D}?[\x{2640}\x{2642}\x{2695}\x{2696}\x{2708}]?\x{FE0F}?|...

What's notable is the || part, which causes the regex to match empty strings. This causes toShortCallback to be called with an empty match for each (non-emoji) character. This returns immediately, but this the match and callback overhead can be significant. With a single | this works as expected.

For example, here's a basic performance test:

$runCount = 100;
$str = str_repeat("testing \xF0\x9F\x98\x83 ", 1000);
$client = new \JoyPixels\Client();

$s = microtime(true);
for ($i = 0; $i < $runCount; $i++) {
	$client->toShort($str);
}
$e = microtime(true);

echo ($e - $s) / $runCount;

This gives me the following results:

Run type Time per execution (s) Total time (s)
Unfixed 0.018859 1.88592
Fixed 0.000538 0.0538551

Though easily derivable, the difference in the total run time shows the size of the regression as we go from 1.88 seconds to run the test to 0.05 with a single character change!

This change first appeared in 5.5 (af05016). I assume this regex is auto generated, so this will presumably keep coming back unless the tool that causes the issue is fixed. Otherwise, a PR to resolve this is generally simple.

Potentially related, JoyPixels::$unicodeRegexp does not appear to have this issue, though the regex differs from the one in JoyPixels\Client. It did actually get the || in the 5.5 release as well (you can see it in the same commit), but it disappeared from there in 6.0. I'm not clear on why these two regexes are not identical, but I might be missing something.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions