[codex] Fix Chinese name order and splitting by sergeyf · Pull Request #14 · allenai/sinonym

sergeyf · 2026-06-15T22:10:44Z

Summary

Preserve explicit all-Han group boundaries for spaced Chinese input such as 增友叶.
Add a guarded given-first order bonus for two-token romanized names where surname-frequency evidence strongly supports preserving input order.
Add a given-position-only gold splitter for surname-like fused tokens, fixing Junjie Fang-style outputs without splitting ambiguous one-anchor tokens such as Yuxuan or protected compound-surname variants.
Add regression coverage for the new order, splitting, and all-Han grouped-input behavior.

Root Cause

The parser was collapsing whitespace-separated all-Han groups into character-level tokens before deciding order, so 增友叶 lost the original given/surname boundary. Separately, two-token romanized names could be flipped even when the second token was overwhelmingly more likely to be the surname. Finally, the fused-token splitter had a broad surname early-exit, so valid given-name splits like Jun + Jie were skipped when the full token also appeared as a surname form.

Validation

uv run --no-sync --python 3.12 pytest -q tests/test_regression_proposals.py::test_given_context_gold_split_bypasses_surname_guard tests/test_regression_proposals.py::test_given_context_gold_split_keeps_ambiguous_non_gold_tokens_unsplit tests/test_regression_proposals.py::test_guarded_low_frequency_surname_ratio_preserves_given_first_order tests/test_regression_proposals.py::test_guarded_low_frequency_surname_ratio_keeps_compound_and_common_surname_boundaries tests/test_regression_proposals.py::test_spaced_all_chinese_given_first_preserves_group_boundary
uv run --no-sync --python 3.12 python scripts/check_test_status.py

Special status script result: 44 individual test-case failures, performance tests passed. This improves from the previous expected baseline of 52 failures.

Fix Chinese name order and splitting

fb933f7

sergeyf requested a review from atalyaalon June 15, 2026 22:11

sergeyf marked this pull request as ready for review June 15, 2026 22:11

sergeyf added 4 commits June 15, 2026 15:53

Fix spaced Hanzi compound surname parsing

4c8c44f

Fix batch order regressions

79c3af7

Fix batch guarded given-first regressions

c5a400e

Bump version to 0.2.5

7583bc7

sergeyf merged commit 5fe2096 into main Jun 16, 2026
2 checks passed

sergeyf deleted the codex/chinese-name-order-splitting branch June 16, 2026 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] Fix Chinese name order and splitting#14

[codex] Fix Chinese name order and splitting#14
sergeyf merged 5 commits into
mainfrom
codex/chinese-name-order-splitting

sergeyf commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sergeyf commented Jun 15, 2026

Summary

Root Cause

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant