fix: jargon UX cleanup, protect edited meanings, and deduplicate style dialogue pairs#170
Closed
YumemiDream wants to merge 7 commits into
Closed
fix: jargon UX cleanup, protect edited meanings, and deduplicate style dialogue pairs#170YumemiDream wants to merge 7 commits into
YumemiDream wants to merge 7 commits into
Conversation
- Add meaning_edited flag to Jargon ORM and dataclass - Set meaning_edited=True when user edits meaning via dashboard - _should_infer_meaning skips jargon with meaning_edited=True - Auto-migration will add the new column on startup
- Add sync_jargon_counts to facade: bulk-update count from statistical filter's term frequency table - mine_jargon syncs filter frequencies to DB before inference - Remove manual count+1 in save_or_update_jargon (count now managed by frequency sync) - Inference thresholds [3,6,10,20,40,60,100] now reflect actual chat occurrences, not LLM validation pass count
This reverts commit 0194dd4.
Style learning now extracts user->bot pairs from actual chat history first (chronologically matched), falling back to LLM-generated expression patterns only when no real pairs are found.
…ction - Extraction: _extract_fewshot_pairs_from_merged deduplicates by (situation, expression) content within a single batch - Extraction: _extract_style_dialog_pairs deduplicates learned_patterns and few-shot pairs - Injection: _build_style_begin_dialogs checks existing begin_dialogs for matching user messages before appending, preventing duplicates across multiple approved reviews
Contributor
Reviewer's GuideThis PR makes four related fixes around jargon UX and style-learning data quality: it removes an unreliable “occurrences” sort option, introduces a meaning_edited flag to protect user-edited jargon meanings from being overwritten by inference, prioritizes real dialogue pairs over LLM-generated patterns when saving style learning records, and deduplicates style-learning dialogue pairs both at extraction and injection to reduce redundancy. Sequence diagram for protecting user-edited jargon meaningssequenceDiagram
actor User
participant Dashboard as Dashboard
participant JargonService as JargonService.update_jargon
participant JargonFacade as JargonFacade.update_jargon
participant DB as JargonTable
participant Miner as JargonMiner._should_infer_meaning
User->>Dashboard: Edit jargon meaning
Dashboard->>JargonService: update_jargon(term, meaning)
JargonService->>JargonService: payload[meaning]
JargonService->>JargonService: payload[meaning_edited] = True
JargonService->>JargonFacade: update_jargon(payload)
JargonFacade->>DB: record.meaning = meaning
JargonFacade->>JargonFacade: [meaning_edited in jargon_data]
JargonFacade->>DB: record.meaning_edited = True
DB-->>JargonFacade: updated record
JargonFacade-->>Dashboard: jargon with meaning_edited
Miner->>Miner: _should_infer_meaning(jargon)
Miner->>Miner: [jargon.is_complete]
Miner-->>Miner: return False
Miner->>Miner: [jargon.meaning_edited]
Miner-->>Miner: return False
Miner-->>DB: Skip inference for edited jargon
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Contributor
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In
_build_style_begin_dialogs, the dedup key for user messages isuser_msg.strip(), but existing entries are derived via substring +strip(); consider extracting a shared normalization helper (e.g., trimming, optional lowercasing) so both existing and new entries use exactly the same normalization and you don’t get subtle duplicates due to whitespace or minor formatting differences. - In
_save_style_learning_record, once any real dialogue pairs are extracted, LLM-generatedexpression_patternsare no longer used at all; if the intent is to only prioritize real dialogue rather than completely replace patterns, you might want to append filteredexpression_patternswhen the extracted real pairs are sparse, instead of discarding them.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `_build_style_begin_dialogs`, the dedup key for user messages is `user_msg.strip()`, but existing entries are derived via substring + `strip()`; consider extracting a shared normalization helper (e.g., trimming, optional lowercasing) so both existing and new entries use exactly the same normalization and you don’t get subtle duplicates due to whitespace or minor formatting differences.
- In `_save_style_learning_record`, once any real dialogue pairs are extracted, LLM-generated `expression_patterns` are no longer used at all; if the intent is to only *prioritize* real dialogue rather than completely replace patterns, you might want to append filtered `expression_patterns` when the extracted real pairs are sparse, instead of discarding them.
## Individual Comments
### Comment 1
<location path="services/database/facades/jargon_facade.py" line_range="167-169" />
<code_context>
record.meaning = json.dumps(meaning_val, ensure_ascii=False)
else:
record.meaning = str(meaning_val) if meaning_val is not None else None
+ # Only mark meaning_edited when explicitly set (not from inference)
+ if jargon_data.get('meaning_edited'):
+ record.meaning_edited = True
if 'is_jargon' in jargon_data:
record.is_jargon = jargon_data['is_jargon']
</code_context>
<issue_to_address>
**issue (bug_risk):** Allow `meaning_edited` to be cleared or explicitly set to False when needed.
Because `record.meaning_edited` is only set when `jargon_data.get('meaning_edited')` is truthy, you can never clear or explicitly set it back to `False` once it’s been set. To support workflows like undoing edits or re-enabling inference, consider checking for key presence and assigning the value directly:
```python
if 'meaning_edited' in jargon_data:
record.meaning_edited = bool(jargon_data['meaning_edited'])
```
This allows explicit `True`/`False` updates without unintended flips from inference-only changes.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
描述
概述
本 PR 合并了 4 个独立但相关的修复,覆盖黑话 UX、用户数据保护、风格学习数据质量三个方向。
所有改动向后兼容,不修改任何对外 HTTP API。
改动一览
1. fix: remove jargon sort by occurrences (count not reliable)
背景
黑话审查页原先提供"按出现次数"排序选项,但
count字段的实际取值不可靠——之前的
fix: change jargon count to actual chat occurrence frequency已因统计口径问题被回滚,导致该排序结果毫无意义,会给用户造成误导。
改动
web_res/static/html/dashboard.html:<option value="occurrences">按出现次数</option>state.jargon.sort === 'occurrences'分支影响
2. fix: prevent plugin from overwriting manually edited jargon meanings
背景
当用户通过 dashboard 手动编辑了某条黑话的释义(meaning)后,插件在后续的推断流程中
会再次覆盖这条释义,导致用户的手动修改被无声丢失。这是数据上的明显 bug。
改动
数据模型新增字段
models/jargon.py:dataclass 新增meaning_edited: bool = Falsemodels/orm/jargon.py:对应新增meaning_edited = Column(Boolean, default=False)列写入路径标记
webui/services/jargon_service.py:用户通过 dashboard 调用 update 接口修改释义时,额外把
meaning_edited=True一起传过去services/database/facades/jargon_facade.py:识别到meaning_edited字段才落库为True,避免推断流程误标
读取路径返回
services/database/facades/jargon_facade.py:序列化时把meaning_edited一并返回给前端推断时跳过
services/jargon/jargon_miner.py的_should_infer_meaning:当
jargon.meaning_edited为 True 时直接返回 False,跳过该条的 LLM 推断与释义覆盖数据库迁移说明
新增列
meaning_edited BOOLEAN DEFAULT 0,对存量数据无影响(默认 False 视为可推断)。如果需要手动执行 SQL: