Skip to content

[codex] Fix Chinese name order and splitting#14

Merged
sergeyf merged 5 commits into
mainfrom
codex/chinese-name-order-splitting
Jun 16, 2026
Merged

[codex] Fix Chinese name order and splitting#14
sergeyf merged 5 commits into
mainfrom
codex/chinese-name-order-splitting

Conversation

@sergeyf

@sergeyf sergeyf commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Preserve explicit all-Han group boundaries for spaced Chinese input such as 增友 叶.
  • Add a guarded given-first order bonus for two-token romanized names where surname-frequency evidence strongly supports preserving input order.
  • Add a given-position-only gold splitter for surname-like fused tokens, fixing Junjie Fang-style outputs without splitting ambiguous one-anchor tokens such as Yuxuan or protected compound-surname variants.
  • Add regression coverage for the new order, splitting, and all-Han grouped-input behavior.

Root Cause

The parser was collapsing whitespace-separated all-Han groups into character-level tokens before deciding order, so 增友 叶 lost the original given/surname boundary. Separately, two-token romanized names could be flipped even when the second token was overwhelmingly more likely to be the surname. Finally, the fused-token splitter had a broad surname early-exit, so valid given-name splits like Jun + Jie were skipped when the full token also appeared as a surname form.

Validation

  • uv run --no-sync --python 3.12 pytest -q tests/test_regression_proposals.py::test_given_context_gold_split_bypasses_surname_guard tests/test_regression_proposals.py::test_given_context_gold_split_keeps_ambiguous_non_gold_tokens_unsplit tests/test_regression_proposals.py::test_guarded_low_frequency_surname_ratio_preserves_given_first_order tests/test_regression_proposals.py::test_guarded_low_frequency_surname_ratio_keeps_compound_and_common_surname_boundaries tests/test_regression_proposals.py::test_spaced_all_chinese_given_first_preserves_group_boundary
  • uv run --no-sync --python 3.12 python scripts/check_test_status.py

Special status script result: 44 individual test-case failures, performance tests passed. This improves from the previous expected baseline of 52 failures.

@sergeyf sergeyf requested a review from atalyaalon June 15, 2026 22:11
@sergeyf sergeyf marked this pull request as ready for review June 15, 2026 22:11
@sergeyf sergeyf merged commit 5fe2096 into main Jun 16, 2026
2 checks passed
@sergeyf sergeyf deleted the codex/chinese-name-order-splitting branch June 16, 2026 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant