Skip to content

Incorrect properties for characters that are range boundaries in UnicodeData.txt #30

@stephane-chazelas

Description

@stephane-chazelas

Full background: https://unix.stackexchange.com/questions/805290/why-doesnt-%e4%b8%80-have-a-numeric-value-in-debians-unicode-utility/805291#805291

As noted there:

unicode --format='{numeric_desc}' U+4E00

Outputs nothing even though:

$ python3 -c 'import unicodedata; print(unicodedata.numeric("\u4e00"))'
1.0

The name is also incorrect. For characters in those ranges, the name is meant to be automatically derived:

$ unicode --format='{name}\n' U+4E00
<CJK Ideograph, First>
$ python3 -c 'import unicodedata; print(unicodedata.name("\u4e00"))'
CJK UNIFIED IDEOGRAPH-4E00

That would affect all those characters whose second field in UnicodeData.txt is <* First> or <* Last>.

Sounds like an easy fix would be not to cache those characters where that second field matches ^<.*(First|Last)>$.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions