Skip to content

Fix HMM segmentation of digit+letter tokens to align with Python jieba#212

Merged
yanyiwu merged 2 commits intomasterfrom
copilot/fix-word-segmentation-alignment
Mar 11, 2026
Merged

Fix HMM segmentation of digit+letter tokens to align with Python jieba#212
yanyiwu merged 2 commits intomasterfrom
copilot/fix-word-segmentation-alignment

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 11, 2026

C++ cppjieba splits alphanumeric combinations like "5G" and "3D" into separate tokens, while Python jieba's finalseg keeps them together via re_skip = [a-zA-Z0-9]+(?:\.\d+)?.

Root cause

HMMSegment::NumbersRule consumed only [0-9.]+, so "5G" produced ["5", "G"] instead of ["5G"].

Changes

  • HMMSegment::NumbersRule: Inner loop changed from [0-9.] to [a-zA-Z0-9], allowing letters after an initial digit. Replaced multi-dot permissiveness with a single optional decimal suffix \.\d+, matching Python's pattern.
  • HMMSegment::SequentialLetterRule: Added the same optional \.\d+ suffix (e.g. abc1.2 stays together).
Before:  "3D打印" → ["3", "D", "打印"]
After:   "3D打印" → ["3D", "打印"]   ✓ matches Python jieba

Before:  "5G网络" → ["5", "G", "网络"]
After:   "5G网络" → ["5G", "网络"]   ✓ matches Python jieba

         "3.5KG"  → ["3.5", "KG"]    ✓ unchanged (decimal stops letter merging)
  • Tests: Added HMMSegmentTest.AlphanumericCombinations and two new cases in MixSegmentTest.Test1 covering the above patterns.
Original prompt

This section details on the original issue you should resolve

<issue_title>c++与python版本切词不对齐</issue_title>
<issue_description>用c++切词和python切词结果边界和词性不完全一致,请问有能对齐的版本吗?</issue_description>

Comments on the Issue (you are @copilot in this section)

@yanyiwu 目前还没有完全对齐的。毕竟两个仓库都在变化。完全一致比较难。只不过差异程度应该不大才对。 @yanyiwu 为避免分散讨论,我会把 #197#198 归并到本 issue,统一跟踪 C++/Python 分词边界不一致(尤其是数字+单位、字母数字组合)问题。后续如果有可复现样例,请优先补充到这里:\n\n- 输入原文(UTF-8)\n- C++ 输出\n- Python jieba 输出\n- 使用词典与版本信息\n\n这样便于集中评估规则差异与可行修复。

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

…binations

Co-authored-by: yanyiwu <2162645+yanyiwu@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix word segmentation alignment between C++ and Python Fix HMM segmentation of digit+letter tokens to align with Python jieba Mar 11, 2026
@yanyiwu yanyiwu marked this pull request as ready for review March 11, 2026 12:19
@yanyiwu yanyiwu merged commit 348ac91 into master Mar 11, 2026
56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

c++与python版本切词不对齐

2 participants