Skip to content

fix: jargon UX cleanup, protect edited meanings, and deduplicate style dialogue pairs#170

Closed
YumemiDream wants to merge 7 commits into
NickCharlie:mainfrom
YumemiDream:main
Closed

fix: jargon UX cleanup, protect edited meanings, and deduplicate style dialogue pairs#170
YumemiDream wants to merge 7 commits into
NickCharlie:mainfrom
YumemiDream:main

Conversation

@YumemiDream

@YumemiDream YumemiDream commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

描述

概述

本 PR 合并了 4 个独立但相关的修复,覆盖黑话 UX、用户数据保护、风格学习数据质量三个方向。
所有改动向后兼容,不修改任何对外 HTTP API。

改动一览

# 改动 文件数 净行数 风险
1 移除"按出现次数"排序 1 -3 🟢 极低
2 保护用户手动编辑过的黑话释义 5 +12 🟡 中(DB 迁移)
3 风格学习优先真实对话对 1 +1 🟢 低
4 风格学习对话对去重(提取+注入) 2 +24 🟡 中(行为变更)

1. fix: remove jargon sort by occurrences (count not reliable)

背景

黑话审查页原先提供"按出现次数"排序选项,但 count 字段的实际取值不可靠——
之前的 fix: change jargon count to actual chat occurrence frequency 已因统计口径问题被回滚,
导致该排序结果毫无意义,会给用户造成误导。

改动

  • web_res/static/html/dashboard.html
    • 移除排序下拉中的 <option value="occurrences">按出现次数</option>
    • 移除排序逻辑中的 state.jargon.sort === 'occurrences' 分支

影响

  • 仅影响 UI 排序选项,无后端行为变化
  • 之前依赖此排序的用户将少一个选项(按时间/按名称不受影响)
  • 无数据库 schema 变动,零迁移风险

2. fix: prevent plugin from overwriting manually edited jargon meanings

背景

当用户通过 dashboard 手动编辑了某条黑话的释义(meaning)后,插件在后续的推断流程中
会再次覆盖这条释义,导致用户的手动修改被无声丢失。这是数据上的明显 bug。

改动

数据模型新增字段

  • models/jargon.py:dataclass 新增 meaning_edited: bool = False
  • models/orm/jargon.py:对应新增 meaning_edited = Column(Boolean, default=False)
  • 启动时会自动迁移加列(向下兼容)

写入路径标记

  • webui/services/jargon_service.py:用户通过 dashboard 调用 update 接口修改释义时,
    额外把 meaning_edited=True 一起传过去
  • services/database/facades/jargon_facade.py:识别到 meaning_edited 字段才落库为 True
    避免推断流程误标

读取路径返回

  • services/database/facades/jargon_facade.py:序列化时把 meaning_edited 一并返回给前端

推断时跳过

  • services/jargon/jargon_miner.py_should_infer_meaning
    jargon.meaning_edited 为 True 时直接返回 False,跳过该条的 LLM 推断与释义覆盖

数据库迁移说明

新增列 meaning_edited BOOLEAN DEFAULT 0,对存量数据无影响(默认 False 视为可推断)。
如果需要手动执行 SQL:

ALTER TABLE jargons ADD COLUMN meaning_edited BOOLEAN DEFAULT 0;

影响

- 用户手动编辑过的黑话释义在后续学习流程中保持不变
- 仅扩展 dataclass 字段、facade 序列化与一个判断条件,向后兼容

---
3. fix: prioritize real dialogue pairs over LLM-generated patterns

背景

当前风格学习默认使用 LLM 生成的"表达模式"作为示范对话,但这些模板的语境连贯性差、
不如真实对话对自然,影响后续 Bot 模仿质量。

改动

services/core_learning/progressive_learning.py 的风格学习保存流程:
- 调整前:直接使用 LLM 生成的 expression_patterns
- 调整后:先尝试从真实聊天记录中提取 user→bot 对话对,只有当真实对话不足时
才回退到 LLM 生成的 expression_patterns
- 这样保证展示给 LLM 模仿的对话是真实有上下文的

影响

- 提升风格学习示范对话的质量(真实对话 > 模板)
- 不影响其他行为

---
4. fix: deduplicate style learning dialogue pairs at extraction and injection

背景

风格学习存在数据冗余:相同或相似的 user→bot 对话对在提取阶段和注入 begin_dialogs 阶段都
会被重复保存,造成冗余和 token 浪费。

改动

提取阶段去重 - 真实对话

_extract_fewshot_pairs_from_merged 新增 seen: set- 遍历消息对时若 (situation[:50], expression[:100]) key 已存在则跳过
- 单批次内不再重复保存相同对

提取阶段去重 - 风格学习

_extract_style_dialog_pairs:
- 从 learned_patterns 提取时去重
- 从 few-shots 文本解析对话对时也去重(与已收集的合并去重)

注入阶段去重

_build_style_begin_dialogs:
- 在追加新对话对之前,扫描现有 begin_dialogs,提取已包含的 user 消息集合
- 跳过 user 消息已存在的对,避免多次批准风格学习审查时反复追加相同内容

行为变化

- ✅ 减少冗余数据存储和 LLM token 消耗
- ✅ 用户可见效果:begin_dialogs 不再被重复污染
- ✅ 不修改任何对外 HTTP API

兼容性

- 数据库无 schema 变动
- 不影响 state.jargon、few_shots_content 等现有字段的读写
- 仅优化内部生成/注入逻辑

---
测试建议

1. 黑话排序(改动 1- 打开审查页 → 黑话 tab → 排序下拉应不再有"按出现次数"
- 其他排序方式(最新/最早/按名称)应正常工作

2. 黑话释义保护(改动 21. 编辑某条黑话释义
2. 触发一次学习批次
3. 检查该条黑话的释义未被覆盖,meaning_edited 字段为 True

3-4. 风格学习去重(改动 3-41. 连续批准两条相似的风格学习审查
2. 检查 begin_dialogs 中不应出现重复的 user 消息
3. 触发一次学习批次
4. 确认 expression_patterns 来源优先是真实对话
5. 故意构造重复的 learned_patterns,验证去重生效

---
兼容性总览

- ✅ 无 HTTP API 变更
- ✅ 仅 PR 2 涉及数据库 schema(向下兼容的新列)
- ✅ 所有改动可独立回滚
- ✅ 不影响现有插件配置和用户数据

相关 Commit

- 28a9de1 fix: remove jargon sort by occurrences (count not reliable)
- a0dfab4 fix: prevent plugin from overwriting manually edited jargon meanings
- 6a9b68f fix: prioritize real dialogue pairs over LLM-generated patterns
- adaffe2 fix: deduplicate style learning dialogue pairs at extraction and injection

## Summary by Sourcery

Protect manually edited jargon meanings, improve style learning style-learning data quality, and simplify jargon review sorting options.

Bug Fixes:
- Prevent LLM-based jargon inference from overwriting meanings that users have manually edited.
- Prioritize real user–bot dialogue pairs over LLM-generated templates when saving style learning records.
- Deduplicate style learning dialogue pairs during extraction and injection to avoid redundant storage and repeated begin_dialog entries.
- Remove the unreliable 'sort by occurrences' option from the jargon dashboard to avoid misleading UX.

Enhancements:
- Expose and persist a meaning_edited flag in jargon models and APIs so downstream components can respect user-edited meanings.

YumemiDream and others added 7 commits June 4, 2026 19:41
- Add meaning_edited flag to Jargon ORM and dataclass
- Set meaning_edited=True when user edits meaning via dashboard
- _should_infer_meaning skips jargon with meaning_edited=True
- Auto-migration will add the new column on startup
- Add sync_jargon_counts to facade: bulk-update count from statistical
  filter's term frequency table
- mine_jargon syncs filter frequencies to DB before inference
- Remove manual count+1 in save_or_update_jargon (count now managed
  by frequency sync)
- Inference thresholds [3,6,10,20,40,60,100] now reflect actual chat
  occurrences, not LLM validation pass count
Style learning now extracts user->bot pairs from actual chat history
first (chronologically matched), falling back to LLM-generated
expression patterns only when no real pairs are found.
…ction

- Extraction: _extract_fewshot_pairs_from_merged deduplicates by
  (situation, expression) content within a single batch
- Extraction: _extract_style_dialog_pairs deduplicates learned_patterns
  and few-shot pairs
- Injection: _build_style_begin_dialogs checks existing begin_dialogs
  for matching user messages before appending, preventing duplicates
  across multiple approved reviews
@sourcery-ai

sourcery-ai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Reviewer's Guide

This PR makes four related fixes around jargon UX and style-learning data quality: it removes an unreliable “occurrences” sort option, introduces a meaning_edited flag to protect user-edited jargon meanings from being overwritten by inference, prioritizes real dialogue pairs over LLM-generated patterns when saving style learning records, and deduplicates style-learning dialogue pairs both at extraction and injection to reduce redundancy.

Sequence diagram for protecting user-edited jargon meanings

sequenceDiagram
    actor User
    participant Dashboard as Dashboard
    participant JargonService as JargonService.update_jargon
    participant JargonFacade as JargonFacade.update_jargon
    participant DB as JargonTable
    participant Miner as JargonMiner._should_infer_meaning

    User->>Dashboard: Edit jargon meaning
    Dashboard->>JargonService: update_jargon(term, meaning)
    JargonService->>JargonService: payload[meaning]
    JargonService->>JargonService: payload[meaning_edited] = True
    JargonService->>JargonFacade: update_jargon(payload)
    JargonFacade->>DB: record.meaning = meaning
    JargonFacade->>JargonFacade: [meaning_edited in jargon_data]
    JargonFacade->>DB: record.meaning_edited = True
    DB-->>JargonFacade: updated record
    JargonFacade-->>Dashboard: jargon with meaning_edited

    Miner->>Miner: _should_infer_meaning(jargon)
    Miner->>Miner: [jargon.is_complete]
    Miner-->>Miner: return False
    Miner->>Miner: [jargon.meaning_edited]
    Miner-->>Miner: return False
    Miner-->>DB: Skip inference for edited jargon
Loading

File-Level Changes

Change Details Files
Protect user‑edited jargon meanings by tracking a meaning_edited flag through the model, DB, and inference paths.
  • Add meaning_edited field to the Jargon dataclass and ORM model with default False and include it in to_dict serialization.
  • On dashboard updates, set meaning_edited=True when a meaning is edited and only persist that flag in the facade when explicitly provided.
  • Expose meaning_edited on search responses so the frontend can distinguish manually edited entries.
  • Skip meaning inference in jargon_miner when meaning_edited is True and preserve this flag when saving or updating existing jargons.
models/jargon.py
models/orm/jargon.py
services/database/facades/jargon_facade.py
services/jargon/jargon_miner.py
webui/services/jargon_service.py
Improve style learning data quality by prioritizing real dialogue pairs and deduplicating them during extraction and injection.
  • In progressive_learning, first attempt to extract user→bot pairs from merged real messages and only fall back to filtered expression_patterns when no real pairs are available.
  • Introduce deduplication in _extract_fewshot_pairs_from_merged using a (situation, expression) key truncated to 50/100 chars to avoid repeated pairs.
  • Deduplicate style review dialog pairs extracted from learned_patterns and few_shots_content using a seen set so identical pairs are only included once.
  • Prevent duplicate style examples in begin_dialogs by collecting existing user messages and skipping dialog_pairs whose user text is already present before appending new STYLE_BEGIN_DIALOG_PREFIX entries.
services/core_learning/progressive_learning.py
webui/services/persona_review_service.py
Simplify jargon review UI sorting by removing the unreliable “sort by occurrences” option.
  • Remove the occurrences option from the jargon sort dropdown in the dashboard HTML.
  • Delete the corresponding occurrences-based sort branch in the frontend sorting logic so only time- and name-based sorts remain.
web_res/static/html/dashboard.html

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In _build_style_begin_dialogs, the dedup key for user messages is user_msg.strip(), but existing entries are derived via substring + strip(); consider extracting a shared normalization helper (e.g., trimming, optional lowercasing) so both existing and new entries use exactly the same normalization and you don’t get subtle duplicates due to whitespace or minor formatting differences.
  • In _save_style_learning_record, once any real dialogue pairs are extracted, LLM-generated expression_patterns are no longer used at all; if the intent is to only prioritize real dialogue rather than completely replace patterns, you might want to append filtered expression_patterns when the extracted real pairs are sparse, instead of discarding them.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `_build_style_begin_dialogs`, the dedup key for user messages is `user_msg.strip()`, but existing entries are derived via substring + `strip()`; consider extracting a shared normalization helper (e.g., trimming, optional lowercasing) so both existing and new entries use exactly the same normalization and you don’t get subtle duplicates due to whitespace or minor formatting differences.
- In `_save_style_learning_record`, once any real dialogue pairs are extracted, LLM-generated `expression_patterns` are no longer used at all; if the intent is to only *prioritize* real dialogue rather than completely replace patterns, you might want to append filtered `expression_patterns` when the extracted real pairs are sparse, instead of discarding them.

## Individual Comments

### Comment 1
<location path="services/database/facades/jargon_facade.py" line_range="167-169" />
<code_context>
                         record.meaning = json.dumps(meaning_val, ensure_ascii=False)
                     else:
                         record.meaning = str(meaning_val) if meaning_val is not None else None
+                    # Only mark meaning_edited when explicitly set (not from inference)
+                    if jargon_data.get('meaning_edited'):
+                        record.meaning_edited = True
                 if 'is_jargon' in jargon_data:
                     record.is_jargon = jargon_data['is_jargon']
</code_context>
<issue_to_address>
**issue (bug_risk):** Allow `meaning_edited` to be cleared or explicitly set to False when needed.

Because `record.meaning_edited` is only set when `jargon_data.get('meaning_edited')` is truthy, you can never clear or explicitly set it back to `False` once it’s been set. To support workflows like undoing edits or re-enabling inference, consider checking for key presence and assigning the value directly:

```python
if 'meaning_edited' in jargon_data:
    record.meaning_edited = bool(jargon_data['meaning_edited'])
```

This allows explicit `True`/`False` updates without unintended flips from inference-only changes.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread services/database/facades/jargon_facade.py
@YumemiDream YumemiDream closed this Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant