fix(asr): 用 verbose_json 元数据丢弃 Whisper 幻听段落(仅 OpenAI/Groq)#572
Merged
H-Chris233 merged 1 commit intoJun 1, 2026
Merged
Conversation
…I/Groq only) Whisper fabricates plausible-but-unspoken text on silence/noise (the classic hallucination defect): leading/trailing silence or mic hiss turns into unrelated words. When the provider returns verbose_json, each segment carries no_speech_prob / avg_logprob / compression_ratio — use them to drop segments that clearly aren't speech (conservative thresholds so real speech is never trimmed). No segments in the response → fall back to text. Provider-gated to avoid breaking non-Whisper backends: - whisper (OpenAI) / groq: native Whisper, verbose_json fully supported with the metrics above — filter is effective. Verified against both providers' current docs. - siliconflow: SenseVoice / TeleSpeech, response_format is undocumented; sending verbose_json risks a 4xx, so it stays on the existing json path. - zhipu (GLM-ASR): accepts verbose_json but does not emit those metrics (filter would be a no-op), so it also stays on json to minimize behavior change. Only whisper/groq opt in. whisper_supports_verbose_json(provider_id) decides the flag; WhisperBatchASR gains a verbose_json bool. Missing metric fields are treated as "keep" so the filter is harmless for any provider that returns segments without them. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
katanumahotori
added a commit
to katanumahotori/openless
that referenced
this pull request
Jun 1, 2026
Aligns the fork with PR Open-Less#572: the Whisper hallucination filter only requests response_format=verbose_json for providers that return the metrics (whisper/groq). SiliconFlow (SenseVoice/TeleSpeech, no response_format) and zhipu (GLM-ASR, no metrics) keep the plain json path. Previously the fork always sent verbose_json, which was fine on Groq but would risk a 4xx if switched to SiliconFlow. WhisperBatchASR gains a verbose_json bool; whisper_supports_verbose_json decides it at construction. strip_prompt_echo still runs on both paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
User description
问题
Whisper 在静音 / 弱音 / 噪声段会生成「听起来合理但用户没说」的文本(已知的 hallucination 缺陷)。录音前后的沉默、麦克风底噪经常被转写成无关词,污染最终结果。当前
transcribe_chunk直接取json["text"],没有任何过滤。方案
当 provider 返回
verbose_json时,每个 segment 带no_speech_prob/avg_logprob/compression_ratio。用保守阈值丢掉明显不是语音的段落:no_speech_prob > 0.6且avg_logprob < -0.5(高静音概率 + 低置信)compression_ratio > 2.4(反复幻听,Whisper 标准阈值)avg_logprob < -1.0(置信极低,噪声被词化)误删真实语音最糟,所以阈值偏保守。响应里没有
segments时退回直接用text(与旧行为一致);某些指标字段缺失时按「保留」处理,所以对返回 segments 但缺指标的 provider 是无害空转。Provider 门控(关键)
verbose_json只对确证支持且有收益的 provider 开启,避免破坏其它后端:whisper(OpenAI)groqzhipu(GLM-ASR)siliconflowresponse_format依据:OpenAI / Groq 现行文档均明确
verbose_json返回上述 segment 指标;SiliconFlow 文档的转写接口没有response_format参数,模型为 SenseVoice/TeleSpeech;GLM-ASR 接受verbose_json但 segment 形态不同。whisper_supports_verbose_json(provider_id)决定是否开启;WhisperBatchASR增加一个verbose_jsonbool 参数。开启时同时把temperature固定为 0(转写是确定性任务)。测试
extract_confident_text:丢弃幻听段 / 保留可信段 / 无 segments 回退 text / 缺指标时保留。whisper_supports_verbose_json:仅 whisper/groq 为 true,siliconflow/zhipu 为 false。cargo check --lib --tests通过。平台 / 兼容性
仅改
transcribe_chunk与构造参数。未开启的 provider 行为完全不变。PR Type
Bug fix, Tests
Description
Add verbose_json support to filter hallucinated segments via metadata (no_speech_prob, avg_logprob, compression_ratio)
Gate the feature to only OpenAI/Groq to avoid breaking other providers
Add extract_confident_text function with conservative thresholds
Add unit tests for the new function and provider gating
File Walkthrough
whisper.rs
Add verbose_json hallucination filter and testsopenless-all/app/src-tauri/src/asr/whisper.rs
verbose_jsonboolean field toWhisperBatchASRtranscribeto conditionally requestresponse_format=verbose_jsonand useextract_confident_textfilterextract_confident_textfunction to drop hallucinated segmentsusing thresholds (no_speech_prob, avg_logprob, compression_ratio)
coordinator.rs
Gate verbose_json support to whisper/groq providersopenless-all/app/src-tauri/src/coordinator.rs
whisper_supports_verbose_jsonfunction to gate the feature onlyto providers "whisper" and "groq"
build_qa_asr_startto pass the flag when constructingWhisperBatchASRdictation.rs
Pass verbose_json flag in dictation sessionopenless-all/app/src-tauri/src/coordinator/dictation.rs
begin_sessionto pass theverbose_jsonflag when creatingWhisperBatchASR