Skip to content

Fix FullSegment skipping the rune after a null dict match#214

Merged
yanyiwu merged 2 commits intomasterfrom
copilot/fix-fullsegment-cut-issue
Mar 12, 2026
Merged

Fix FullSegment skipping the rune after a null dict match#214
yanyiwu merged 2 commits intomasterfrom
copilot/fix-fullsegment-cut-issue

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 11, 2026

FullSegment::Cut could advance maxIdx with a stale wordLen when a DAG edge had du == NULL. In inputs like 崎岖的牙齿, that caused the segmenter to skip after encountering an unknown rune.

  • Root cause

    • wordLen was only updated on dictionary hits.
    • On du == NULL, the previous matched word length leaked into the next maxIdx calculation.
  • Change

    • Reset wordLen to 1 for null dictionary-unit edges so unknown runes advance by a single rune instead of reusing the prior match length.
    • Keep the existing emission behavior unchanged; only the index advancement logic is corrected.
  • Regression coverage

    • Add a focused FullSegment test for 崎岖的牙齿.
    • Expected output now includes the skipped rune:
segment.Cut("崎岖的牙齿", words);
ASSERT_EQ("崎岖/的/牙齿", Join(words.begin(), words.end(), "/"));
Original prompt

This section details on the original issue you should resolve

<issue_title>FullSegment cut wrong when du is nullptr.</issue_title>
<issue_description>Case: weicheng.utf8 line 68:
  明天他到苏家,唐小姐已先到了。他还没坐定,赵辛楣也来了,招呼后说:“方先生,昨天去得迟,今天来得早。想是上银行办公养成的好习惯,勤勉可嘉,佩服佩服!”“过奖,过奖!”方鸿渐本想说辛楣昨天早退,今天迟到,是学衙门里上司的官派,一转念,忍住不说,还对辛楣善意地微笑。辛楣想不到他会这样无的抵抗,反有一拳打个空的惊慌。唐小姐藏不了脸上的诧异。苏小姐也觉得奇怪,但忽然明白这是胜利者的大度,鸿渐知道自己爱的是他,所以不与辛楣计较了。沈氏夫妇也来了。乘大家介绍寒喧的时候,赵辛楣拣最近苏小姐沙发坐下,沈氏夫妇合坐一张长沙发,唐小姐坐在苏小姐和沈先生坐位中间的一个绣垫上,鸿渐孤零零地近太太坐了。一坐下去,他后悔无及,因为沈太太身上有一股味道,文言里的雅称跟古罗马成语都借羊来比喻:“愠羝。”这暖烘烘的味道,搀了脂粉香和花香,熏得方鸿渐泛胃,又不好意思抽烟解秽。心里想这真是从法国新回来的女人,把巴黎大菜场的“臭味交响曲”都带到中国来了,可见巴黎大而天下小。沈太太生得怪样,打扮得妖气。她眼睛下两个黑袋,像圆壳行军热水瓶,想是储蓄着多情的热泪,嘴唇涂的浓胭脂给唾沫进了嘴,把黯黄崎岖的牙齿染道红痕,血淋淋的像侦探小说里谋杀案的线索,说话常有“Tiens!”“O la,la!”那些法文慨叹,把自己身躯扭摆出媚态柔姿。她身体动一下,那气味又添了新的一阵。鸿渐恨不能告诉她,话用嘴说就够了,小心别把身体一扭两段。沈先生下唇肥厚倒垂,一望而知是个说话多而快像嘴里在泻肚子下痢的人。他在讲他怎样向法国人作战事宣传,怎样博得不少人对中国的同情:“南京撤退以后,他们都说中国完了。我对他们说:”欧洲大战的时候,你们政府不是也迁都离开巴黎么?可是你们是最后的胜利者。“他没有话讲,唉,他们没有话讲。”鸿渐想政府可以迁都,自己倒不能换座位。
cut result:
Old (566 words): [ ] [ ] [明天] [他] [到] [苏] [家] [,] [唐小姐] [小姐] [已] [先] [到] [了] [。] [他] [还] [没] [坐定] [,] [赵辛楣] [也] [来] [了] [,] [招呼] [后] [说] [:] [“] [方] [先生] [,] [昨天] [去] [得] [迟] [,] [今天] [来得] [来得早] [。] [想] [是] [上] [银行] [办公] [养成] [的] [好] [习惯] [,] [勤勉] [可嘉] [,] [佩服] [佩服] [!] [”] [“] [过奖] [,] [过奖] [!] [”] [方鸿渐] [本] [想] [说] [辛] [楣] [昨天] [早退] [,] [今天] [迟到] [,] [是] [学] [衙门] [门里] [上司] [的] [官派] [,] [一] [转念] [,] [忍住] [不] [说] [,] [还] [对] [辛] [楣] [善意] [地] [微笑] [。] [辛] [楣] [想不到] [不到] [他] [会] [这样] [无] [的] [抵抗] [,] [反] [有] [一拳] [拳打] [个] [空] [的] [惊慌] [。] [唐小姐] [小姐] [藏] [不了] [脸上] [的] [诧异] [。] [苏] [小姐] [也] [觉得] [奇怪] [,] [但] [忽然] [明白] [这] [是] [胜利] [胜利者] [的] [大度] [,] [鸿] [渐] [知道] [自己] [爱] [的] [是] [他] [,] [所以] [不] [与] [辛] [楣] [计较] [了] [。] [沈] [氏] [夫妇] [也] [来] [了] [。] [乘] [大家] [介绍] [寒喧] [的] [时候] [,] [赵辛楣] [拣] [最近] [苏] [小姐] [沙发] [坐下] [,] [沈] [氏] [夫妇] [合] [坐] [一张] [长沙] [沙发] [,] [唐小姐] [小姐] [坐在] [苏] [小姐] [和] [沈先生] [先生] [坐位] [中间] [的] [一个] [绣] [垫] [上] [,] [鸿] [渐] [孤零] [孤零零] [零零] [地] [近] [太太] [坐] [了] [。] [一] [坐下] [下去] [,] [他] [后悔] [后悔无及] [无及] [,] [因为] [沈] [太太] [身上] [上有] [一股] [味道] [,] [文言] [里] [的] [雅称] [跟] [古罗马] [罗马] [成语] [都] [借] [羊] [来] [比喻] [:] [“] [愠] [羝] [。] [”] [这] [暖烘烘] [的] [味道] [,] [搀] [了] [脂粉] [香] [和] [花香] [,] [熏得] [方鸿渐] [泛] [胃] [,] [又] [不好] [不好意思] [好意] [好意思] [意思] [抽烟] [解] [秽] [。] [心里] [想] [这] [真是] [是从] [法国] [新] [回来] [的] [女人] [,] [把] [巴黎] [大菜] [菜场] [的] [“] [臭味] [交响] [交响曲] [”] [都] [带到] [中国] [来] [了] [,] [可见] [巴黎] [大] [而] [天下] [小] [。] [沈] [太太] [生得] [怪样] [,] [打扮] [扮得] [妖气] [。] [她] [眼睛] [下] [两个] [黑] [袋] [,] [像] [圆] [壳] [行军] [热水] [热水瓶] [水瓶] [,] [想] [是] [储蓄] [着] [多情] [的] [热泪] [,] [嘴唇] [涂] [的] [浓] [胭脂] [给] [唾沫] [进] [了] [嘴] [,] [把] [黯] [黄] **[崎岖] [牙齿]** [染] [道] [红] [痕] [,] [血淋淋] [淋淋] [的] [像] [侦探] [侦探小说] [小说] [里] [谋杀] [谋杀案] [的] [线索] [,] [说话] [常有] [“] [T] [i] [e] [n] [s] [!] [”] [“] [O] [ ] [l] [a] [,] [l] [a] [!] [”] [那些] [法文] [慨叹] [,] [把] [自己] [己身] [身躯] [扭摆] [摆出] [媚态] [柔] [姿] [。] [她] [身体] [动] [一下] [,] [那] [气味] [又] [添] [了] [新] [的] [一阵] [。] [鸿] [渐] [恨不能] [不能] [告诉] [她] [,] [话] [用] [嘴] [说] [就] [够] [了] [,] [小心] [别] [把] [身体] [一] [扭] [两段] [。] [沈先生] [先生] [生下] [下唇] [肥厚] [倒垂] [,] [一望而知] [是] [个] [说话] [多] [而] [快] [像] [嘴里] [在] [泻肚] [泻肚子] [肚子] [下痢] [的] [人] [。] [他] [在] [讲] [他] [怎样] [向] [法国] [法国人] [国人] [作战] [战事] [宣传] [,] [怎样] [博得] [不少] [人] [对] [中国] [的] [同情] [:] [“] [南京] [撤退] [以后] [,] [他们] [都] [说] [中国] [完] [了] [。] [我] [对] [他们] [说] [:] [”] [欧洲] [大战] [的] [时候] [,] [你们] [政府] [不是] [也] [迁都] [离开] [巴黎] [么] [?] [可是] [你们] [是] [最后] [的] [胜利] [胜利者] [。] [“] [他] [没有] [话] [讲] [,] [唉] [,] [他们] [没有] [话] [讲] [。] [”] [鸿] [渐] [想] [政府] [可以] [迁都] [,] [自己] [倒] [不能] [换] [座位] [。]

In FullSegment::Cut, when du == NULL, the var wordLen is not reset, which lead to maxIdx increase wrong step. In this case: "崎岖的牙齿" , "岖" is not a dict word, wordLen reserve "崎岖"'s value 2,make maxIdx skip "的".

Suggested modification:

Image

</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: yanyiwu <2162645+yanyiwu@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix FullSegment cut issue when du is nullptr Fix FullSegment skipping the rune after a null dict match Mar 11, 2026
@yanyiwu yanyiwu marked this pull request as ready for review March 12, 2026 02:42
@yanyiwu yanyiwu merged commit b236d3a into master Mar 12, 2026
56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FullSegment cut wrong when du is nullptr.

2 participants