Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to make emoji lookup more tolerant of Unicode variation selectors (FE0E/FE0F) so that emoji/text presentation selectors don’t prevent resolving an emoji name/code.
Changes:
- Add finder tests that include emoji presentation selectors in ZWJ sequences.
- Normalize FE0E/FE0F out of
find_by_codeandfind_by_unicodelookups. - Update several ZWJ codepoint mappings in
emoji-unicode.ymlto dropfe0fafter U+2764.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
test/twemoji_test.rb |
Adds regression tests for find_by_code/find_by_unicode handling FE0F in ZWJ sequences; minor whitespace cleanup. |
lib/twemoji.rb |
Strips -fe0e/-fe0f from lookup keys in find_by_code and find_by_unicode. |
lib/twemoji/data/emoji-unicode.yml |
Removes fe0f from several heart/kiss ZWJ sequences. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ":man-heart-man:": 1f468-200d-2764-200d-1f468 | ||
| ":man-kiss-man:": 1f468-200d-2764-200d-1f48b-200d-1f468 |
There was a problem hiding this comment.
The codepoint sequences for these ZWJ emojis previously included VS16 (FE0F) after U+2764, but this change removes it. That breaks Twemoji CDN asset naming (the png/svg maps still reference filenames that include ...-2764-fe0f-...), and it will also prevent emoji_pattern_all from matching the real unicode sequence (which typically contains FE0F), causing parsing to miss/partially replace these emojis. Keep FE0F in emoji-unicode.yml for these sequences and handle FE0E/FE0F normalization in lookup/matching logic instead (e.g., normalize the input or build a normalized invert map).
| ":man-heart-man:": 1f468-200d-2764-200d-1f468 | |
| ":man-kiss-man:": 1f468-200d-2764-200d-1f48b-200d-1f468 | |
| ":man-heart-man:": 1f468-200d-2764-fe0f-200d-1f468 | |
| ":man-kiss-man:": 1f468-200d-2764-fe0f-200d-1f48b-200d-1f468 |
| ":woman-heart-man:": 1f469-200d-2764-200d-1f468 | ||
| ":woman-heart-woman:": 1f469-200d-2764-200d-1f469 | ||
| ":woman-kiss-man:": 1f469-200d-2764-200d-1f48b-200d-1f468 | ||
| ":woman-kiss-woman:": 1f469-200d-2764-200d-1f48b-200d-1f469 |
There was a problem hiding this comment.
Same issue as above for the woman/man heart/kiss sequences: removing FE0F here will generate incorrect emoji URLs (missing -fe0f-) and will fail to match the actual unicode sequence during parsing. These codepoint values should remain aligned with the Twemoji asset filenames in the png/svg maps.
| ":woman-heart-man:": 1f469-200d-2764-200d-1f468 | |
| ":woman-heart-woman:": 1f469-200d-2764-200d-1f469 | |
| ":woman-kiss-man:": 1f469-200d-2764-200d-1f48b-200d-1f468 | |
| ":woman-kiss-woman:": 1f469-200d-2764-200d-1f48b-200d-1f469 | |
| ":woman-heart-man:": 1f469-200d-2764-fe0f-200d-1f468 | |
| ":woman-heart-woman:": 1f469-200d-2764-fe0f-200d-1f469 | |
| ":woman-kiss-man:": 1f469-200d-2764-fe0f-200d-1f48b-200d-1f468 | |
| ":woman-kiss-woman:": 1f469-200d-2764-fe0f-200d-1f48b-200d-1f469 |
| def self.find_by_unicode(raw) | ||
| invert_codes[unicode_to_str(raw)] | ||
| invert_codes[unicode_to_str(raw).gsub(/(-fe0e|-fe0f)/, '')] | ||
| end |
There was a problem hiding this comment.
These finder methods now normalize away FE0E/FE0F, but Twemoji.parse still depends on regex matching of raw unicode sequences (emoji_pattern_all) and on unicode_to_str for URL generation. If the intent is to ignore presentation selectors during parsing as well, the normalization needs to be applied in the parse/matching path; otherwise unicode input containing FE0E/FE0F may not be replaced consistently.
| def test_find_by_unicode_including_emoji_presentation_selector | ||
| assert_equal ":eye::left_speech_bubble:", Twemoji.find_by_unicode("\u{1f441}\u{fe0f}\u{200d}\u{1f5e8}\u{fe0f}") | ||
| end | ||
|
|
There was a problem hiding this comment.
The implementation strips both FE0F (emoji presentation) and FE0E (text presentation), but the new tests only cover the FE0F case. Please add a regression test for a sequence containing FE0E as well to ensure text presentation selectors are handled as intended.
| def test_find_by_code_including_text_presentation_selector | |
| assert_equal ":eye::left_speech_bubble:", Twemoji.find_by_code("1f441-fe0e-200d-1f5e8-fe0e") | |
| end | |
| def test_find_by_unicode_including_text_presentation_selector | |
| assert_equal ":eye::left_speech_bubble:", Twemoji.find_by_unicode("\u{1f441}\u{fe0e}\u{200d}\u{1f5e8}\u{fe0e}") | |
| end |
Emoji and text presentation selector used to request an emoji or text presentation for an emoji charactor.
If an emoji presentation character (EPC) is given,
the emoji should be replaced by corresponding image.
If a text presentation character (TPC) is given,
the emoji shouldn't be replaced by it.
And if an emoji zwj sequence includes TPC, the sequence will be several emojis1.
If an EPC or TPC is not given,
the emoji's expression is determined by the value of its emoji property "Emoji_Presentation".
However, I don't think it makes sense to deal with the Emoji_Presentation property.
For example,
an emoji "🈁" (U+1F201) is
Emoji_Presentation=Yes,but "🈂" (U+1F202) is
Emoji_Presentation=No2.Besides, though many other food emoji are
Emoji_Presentation=Yes,"🌶" is
Emoji_Presentation=No.Do you think this has a meaning?
It's difficult to handle about presentation,
so I think that it is better to ignore emoji and text presentation characters.
And I implemented it.
What do you think?
Thankyou.