Skip to content

[Feature] support and harden native multimodal file handling#865

Merged
dingyi222666 merged 7 commits into
ChatLunaLab:v1-devfrom
yabo083:codex/mimo-multimodal-service
May 19, 2026
Merged

[Feature] support and harden native multimodal file handling#865
dingyi222666 merged 7 commits into
ChatLunaLab:v1-devfrom
yabo083:codex/mimo-multimodal-service

Conversation

@yabo083
Copy link
Copy Markdown
Contributor

@yabo083 yabo083 commented May 17, 2026

This pr adds native multimodal file handling for MiMo/OpenAI-compatible models and hardens read_files/audio request conversion across adapters.

New Features

  • Detect audio-capable OpenAI-compatible models and expose file handling config with audio/image MIME limits.
  • Convert supported audio_url content into OpenAI-compatible input_audio parts for MiMo and GPT audio models.
  • Inject read_files image, audio, video, and file content through native conversation context when the active model supports it.
  • Transcode unsupported voice/audio inputs to MP3 before native audio injection when FFmpeg is available.

Bug fixes

  • Keep supported OpenAI audio models available instead of filtering every model name containing audio.
  • Prevent JPEG headers from being misdetected as MP3 audio frame sync.
  • Parse HTTP response headers from both Fetch-style headers and plain header objects in read_files.
  • Accept stringified files payloads from tool calls in the inline read_files schema.
  • Check GIF frame upload size per extracted PNG frame and report an error when no frame can be injected.
  • Tell the model when files were read but no conversation id was available for context injection.
  • Fail loudly when an unsupported audio MIME reaches OpenAI input_audio conversion, while preserving size-limit drop warnings.
  • Drop unsupported image/audio parts before OpenAI-compatible requests instead of sending invalid payloads.
  • Keep MiMo v2.5-pro out of MiMo image/audio capability matching because it is text-only.

Other Changes

  • Simplify service-multimodal helpers by merging MIME detection, GIF frame extraction, audio conversion, and content utilities into utils.ts.
  • Inline the read_files schema and remove obsolete helper modules.
  • Add AAC/WebM output filename extensions for native audio injection.
  • Update multimodal README docs for native audio/image/file handling.
  • Validation: yarn lint-fix; git pull --rebase origin v1-dev.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 17, 2026

Review Change Stack

Walkthrough

该 PR 重构多模态输入链路:新增/改写多媒体 utils(MIME 推断、音频检测与转码、GIF 分帧、图片描述)、重写 read_files 工具、重构 audio/image 插件注入逻辑,并在多适配器移除对已弃用 additional_kwargs.images 的处理路径,同时为共享适配器与 OpenAI 适配器注入音频/文件处理配置。

Changes

多模态重构主线

Layer / File(s) Summary
适配器弃用 images 路径
packages/adapter-claude/src/utils.ts, packages/adapter-gemini/src/utils.ts, packages/adapter-ollama/src/utils.ts, packages/adapter-qwen/src/utils.ts
四个适配器移除对 additional_kwargs.images 的实际处理;仅记录弃用警告并改为从 msg.contentimage_url/audio_url 路径处理。
多媒体工具库与 MIME 处理
packages/service-multimodal/src/utils.ts
新增 IMAGE_MIME_TYPES、infer/normalizeMimeType、detectAudioMimeType、convertAudioToMp3、GIF 解帧/parseGifToFrames、readImage、processImageWithModel、消息内容构造辅助函数与 BROWSER_UA。
音频插件重构
packages/service-multimodal/src/plugins/audio.ts
加入 modelAcceptsAudio gate;解析/回退 sourceUrl(含 onebot 私有资源);下载 buffer、检测 MIME、按需转码为 MP3;统一构造 base64 data URL 并更新 message/element 属性与注入逻辑。
图片插件重构
packages/service-multimodal/src/plugins/image.ts
重写 img 拦截:统一读取图片、modelAcceptsImage 判断、injectGifFrames 分帧注入、describeAndInject 非原生图片描述并注入。
ReadFiles 工具重写
packages/service-multimodal/src/plugins/read_files.ts
将 schema 提升为文件级 readFilesSchema,新增 NativePart;实现 _fetch/_describeImage/parseGifToFrames/ToolReport/checkSize 等;按 MIME 与模型能力分流注入或调用图像模型描述并在 conversationId 存在时注入多模态消息。

共享适配器与 OpenAI 映射

Layer / File(s) Summary
共享适配器音频与文件处理
packages/shared-adapter/src/client.ts, packages/shared-adapter/src/utils.ts
新增 DEFAULT_IMAGE_MAX_BASE64_BYTES/DEFAULT_AUDIO_MAX_BASE64_BYTES,重构模型名匹配器(支持正则排除 mimo-v2.5-pro);shared utils 新增对 image_url 与 audio_url 的拉取/编码/大小限制与 OpenAI input_audio 构建工具。
OpenAI / OpenAI-like 适配器音频能力
packages/adapter-openai/src/client.ts, packages/adapter-openai-like/src/client.ts
在 refreshModels 中使用 supportAudioInput 判定并追加 ModelCapabilities.AudioInput;在 _createModel 时注入 getOpenAIFileHandlingConfig(model) 到 ChatLunaChatModel 的 fileHandlingConfig。

文档与配置

Layer / File(s) Summary
文档与插件注入配置
packages/service-multimodal/README.md, packages/service-multimodal/src/index.ts
README 更新为 koishi-plugin-chatluna-multimodal-service 文档入口;移除 chatluna_storage 于 inject.optional,仅保留 ffmpegsilk

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 分钟

Possibly related PRs

Suggested reviewers

  • dingyi222666

"我是只会跳舞的兔子 🐇
转码库里忙又忙,MIME 一点都不慌,
GIF 分帧亮晶晶,图片描述好清晰,
音频化为 MP3,消息里见真情,
合并通过请别忘,给我一根胡萝卜酬劳!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.21% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed 拉取请求标题清晰准确地总结了主要变更:添加原生多模态文件处理支持,与更改内容的核心目标高度对应。
Description check ✅ Passed 拉取请求描述详细说明了新增功能、修复的缺陷和其他变更,与代码更改内容紧密相关,包括音频检测、文件注入、转码等多个方面。

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@yabo083 yabo083 marked this pull request as ready for review May 17, 2026 12:11
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces multimodal support for MiMo models, specifically adding audio and image understanding capabilities. It includes logic for transcoding audio to MP3 using ffmpeg, detecting MIME types from file headers, and handling Base64 data URLs for OpenAI-compatible interfaces. Key feedback points out an inconsistency where the read_files tool lacks Silk audio decoding support compared to the main audio plugin, suggesting a unification of media utilities. Additionally, the getHeaderValue utility needs to be more robust to ensure full case-insensitivity when retrieving HTTP headers from plain objects.

Comment thread packages/service-multimodal/src/plugins/read_files.ts Outdated
Comment thread packages/service-multimodal/src/plugins/read_files.ts Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
packages/service-multimodal/src/plugins/read_files.ts (1)

294-298: 💤 Low value

如果 buildAudioContent 返回未识别的类型,音频内容会被静默丢弃。

audioContent 既不是 isMessageContentAudio 也不是 type === 'input_audio' 时,内容不会被添加到 content 数组中。考虑添加日志警告以便在出现意外格式时进行调试。

🔧 建议添加警告日志
 if (isMessageContentAudio(audioContent as MessageContentComplex)) {
     content.push(audioContent as MessageContentComplex)
 } else if (audioContent.type === 'input_audio') {
     content.push(audioContent as MessageContentComplex)
+} else {
+    logger.warn(`Unexpected audio content type: ${(audioContent as { type?: string }).type}`)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/service-multimodal/src/plugins/read_files.ts` around lines 294 -
298, The current branch that handles audioContent in the if/else block using
isMessageContentAudio and audioContent.type === 'input_audio' silently drops any
other return type from buildAudioContent; update that block (the conditional
around isMessageContentAudio and audioContent.type checks that pushes into
content) to add a warning log (using the existing logger) when audioContent is
neither a recognized MessageContentAudio nor type === 'input_audio', include the
actual audioContent (or its type/shape) in the log message to aid debugging and
ensure you still skip invalid entries.
packages/service-multimodal/src/audio.ts (1)

31-34: ⚡ Quick win

函数实现与 isMimoAudioModel 完全重复。

isMimoImageModelisMimoAudioModel 的函数体完全相同,都使用同一个 mimoModels 集合。虽然这为 API 提供了清晰的语义,但违反了 DRY 原则。

考虑以下两种方案之一:

  1. 如果 MIMO 模型确实同时支持音频和图像,可以提取一个通用的 isMimoModel 函数,然后将 isMimoAudioModelisMimoImageModel 作为其别名或包装器。
  2. 如果未来可能需要不同的音频/图像模型集合,请在代码注释中说明当前共享集合的原因。
♻️ 方案1:提取通用函数
+function isMimoModel(model?: string): boolean {
+    if (!model) return false
+    return mimoModels.has(model.split('/').pop()?.toLowerCase() ?? '')
+}
+
 export function isMimoAudioModel(model?: string): boolean {
-    if (!model) return false
-    return mimoModels.has(model.split('/').pop()?.toLowerCase() ?? '')
+    return isMimoModel(model)
 }
 
 export function isMimoImageModel(model?: string): boolean {
-    if (!model) return false
-    return mimoModels.has(model.split('/').pop()?.toLowerCase() ?? '')
+    return isMimoModel(model)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/service-multimodal/src/audio.ts` around lines 31 - 34, The two
functions isMimoImageModel and isMimoAudioModel are identical and both check
mimoModels; refactor by extracting a single helper isMimoModel(model?: string):
boolean that performs the shared logic (use
model.split('/').pop()?.toLowerCase() and mimoModels.has(...)) and then make
isMimoImageModel and isMimoAudioModel thin wrappers that call isMimoModel (or
export them as aliases), or if you prefer to keep separate sets in future, add a
clear comment above isMimoImageModel/isMimoAudioModel explaining they
intentionally share the mimoModels set today and where to change it later.
packages/service-multimodal/src/media.ts (1)

14-16: 💤 Low value

MP3 检测逻辑可能产生误报。

单独使用 buffer[0] === 0xff 来检测 MP3 格式较弱,因为许多二进制格式都可能以 0xFF 开头。MP3 帧同步字节是 0xFF,后面应跟特定的位模式(通常是 0xFB、0xFA 等)。

虽然结合 ID3 标签检查提供了一定保护,但对于没有 ID3 标签的原始 MP3 流,这个检测可能会误判其他以 0xFF 开头的格式。

建议:如果这种误报率在实际使用中可接受,可以保持现状;否则应增强检测逻辑,例如检查 buffer[1] 的高位。

♻️ 可选:增强 MP3 检测
-    if (header.startsWith('ID3') || buffer[0] === 0xff) {
+    if (header.startsWith('ID3') || 
+        (buffer[0] === 0xff && buffer.length > 1 && (buffer[1] & 0xe0) === 0xe0)) {
         return 'audio/mpeg'
     }

解释:MP3 帧同步字是 11 个 1 位(0xFFE),检查第二字节的高 3 位可以减少误报。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/service-multimodal/src/media.ts` around lines 14 - 16, The current
MP3 detection uses header.startsWith('ID3') || buffer[0] === 0xff which can
false-positive on other formats; update the condition in
packages/service-multimodal/src/media.ts to also validate the second byte's high
bits (e.g., ensure buffer.length > 1 && buffer[0] === 0xFF && (buffer[1] & 0xE0)
=== 0xE0) so the check uses header and a proper MP3 frame-sync test (reference
the existing header and buffer variables in the detection logic).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/service-multimodal/src/audio.ts`:
- Around line 83-89: Add a concrete MessageContentInputAudio type and a
type-guard, then return that type instead of using a double cast: declare export
type MessageContentInputAudio = { type: 'input_audio'; input_audio: { data:
string } } in packages/core/src/utils/langchain.ts, implement an
isMessageContentInputAudio(value): value is MessageContentInputAudio guard, and
update buildAudioContent (the block using isMimoAudioModel(model)) to construct
and return a MessageContentInputAudio instance directly rather than using "as
unknown as MessageContentComplex".

In `@packages/service-multimodal/src/index.ts`:
- Around line 103-107: The documentation comment lines under the "MiMo 音频理解" and
"MiMo 图片理解" blocks exceed the eslint max length; locate the long literal strings
containing "MiMo 音频理解" and "MiMo 图片理解" in
packages/service-multimodal/src/index.ts and break them into multiple shorter
string/template-literal lines (or use string concatenation) so no single source
line exceeds 160 characters; preserve the exact wording and punctuation while
splitting at sensible boundaries (clauses or after commas) to keep readability.

In `@packages/service-multimodal/src/media.ts`:
- Around line 8-13: The SILK detection includes a non-standard variant check
(buffer.subarray(1, 10).toString('latin1') === '#!SILK_V3') in addition to
header.startsWith('#!SILK_V3'); add a clear inline comment above these checks
explaining that this offset-1 marker is a non-standard variant observed in a
specific platform/app (name the platform/app where known), why we need to handle
it, and whether it can be removed in the future; make the same explanatory
comment in the isSilkAudio() function where the identical logic appears so both
detection sites (the header.startsWith check and the buffer.subarray(1, 10)
check) document the origin and necessity of the special-case handling.

---

Nitpick comments:
In `@packages/service-multimodal/src/audio.ts`:
- Around line 31-34: The two functions isMimoImageModel and isMimoAudioModel are
identical and both check mimoModels; refactor by extracting a single helper
isMimoModel(model?: string): boolean that performs the shared logic (use
model.split('/').pop()?.toLowerCase() and mimoModels.has(...)) and then make
isMimoImageModel and isMimoAudioModel thin wrappers that call isMimoModel (or
export them as aliases), or if you prefer to keep separate sets in future, add a
clear comment above isMimoImageModel/isMimoAudioModel explaining they
intentionally share the mimoModels set today and where to change it later.

In `@packages/service-multimodal/src/media.ts`:
- Around line 14-16: The current MP3 detection uses header.startsWith('ID3') ||
buffer[0] === 0xff which can false-positive on other formats; update the
condition in packages/service-multimodal/src/media.ts to also validate the
second byte's high bits (e.g., ensure buffer.length > 1 && buffer[0] === 0xFF &&
(buffer[1] & 0xE0) === 0xE0) so the check uses header and a proper MP3
frame-sync test (reference the existing header and buffer variables in the
detection logic).

In `@packages/service-multimodal/src/plugins/read_files.ts`:
- Around line 294-298: The current branch that handles audioContent in the
if/else block using isMessageContentAudio and audioContent.type ===
'input_audio' silently drops any other return type from buildAudioContent;
update that block (the conditional around isMessageContentAudio and
audioContent.type checks that pushes into content) to add a warning log (using
the existing logger) when audioContent is neither a recognized
MessageContentAudio nor type === 'input_audio', include the actual audioContent
(or its type/shape) in the log message to aid debugging and ensure you still
skip invalid entries.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 337f9324-0eff-43c9-a0b9-9712ce246abe

📥 Commits

Reviewing files that changed from the base of the PR and between e70df1d and be511e9.

⛔ Files ignored due to path filters (1)
  • packages/service-multimodal/package.json is excluded by !**/*.json
📒 Files selected for processing (9)
  • packages/service-multimodal/README.md
  • packages/service-multimodal/src/audio.ts
  • packages/service-multimodal/src/index.ts
  • packages/service-multimodal/src/media.ts
  • packages/service-multimodal/src/plugins/audio.ts
  • packages/service-multimodal/src/plugins/image.ts
  • packages/service-multimodal/src/plugins/read_files.ts
  • packages/service-multimodal/src/read_files_schema.ts
  • packages/service-multimodal/tests/audio-mimo.test.ts

Comment thread packages/service-multimodal/src/audio.ts Outdated
Comment thread packages/service-multimodal/src/index.ts Outdated
Comment thread packages/service-multimodal/src/media.ts Outdated
yabo083 and others added 2 commits May 18, 2026 13:07
`detectAudioMimeType` checked only `buffer[0] === 0xFF` to identify
MP3 frame sync, but JPEG files also start with 0xFF (FF D8).
This caused every JPEG passed through `read_files` to be injected
into the conversation as `audio/mpeg`, crashing model APIs that
reject unsupported audio formats.

Tighten the check to require the full MPEG sync word:
`buffer[0] === 0xFF && (buffer[1] & 0xE0) === 0xE0`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dingyi222666 dingyi222666 force-pushed the codex/mimo-multimodal-service branch from 5e1232e to 6ac2a42 Compare May 18, 2026 05:09
@dingyi222666 dingyi222666 changed the title feat(service-multimodal): support MiMo audio and image inputs [Feature] support native multimodal file handling May 18, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/adapter-openai/src/client.ts (1)

85-93: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

这段音频能力分支现在基本触发不到。

Line 71-76 仍然把所有包含 audio 的模型都过滤掉了,所以 gpt-4o-audio-* / gpt-audio-* 会在到达这里之前就被移除。这样一来,新加的 AudioInput 能力和 fileHandlingConfig 都不会应用到自动拉取的 OpenAI 音频模型上。

🛠️ 建议修复
                 .filter(
                     (model) =>
                         !(
                             model.includes('instruct') ||
-                            [
-                                'whisper',
-                                'tts',
-                                'dall-e',
-                                'audio',
-                                'realtime'
-                            ].some((keyword) => model.includes(keyword))
+                            ['whisper', 'tts', 'dall-e', 'realtime'].some(
+                                (keyword) => model.includes(keyword)
+                            ) ||
+                            (model.includes('audio') &&
+                                !supportAudioInput(model))
                         )
                 )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/adapter-openai/src/client.ts` around lines 85 - 93, 当前逻辑在前面把所有包含
"audio" 的模型过滤掉,导致后面在 capabilities 中通过 supportAudioInput(model) 添加
ModelCapabilities.AudioInput(以及 fileHandlingConfig)永远不会触发;请修改两处:一是调整或移除此前那段会剔除包含
"audio" 的过滤逻辑(不要在预筛选中丢弃 gpt-4o-audio-* / gpt-audio-*),二是确保
supportAudioInput(model) 正确识别 gpt-4o-audio-* 和 gpt-audio-* 并返回 true,这样在构建
capabilities(包含 ModelCapabilities.AudioInput)和应用 fileHandlingConfig 时这些自动拉取的
OpenAI 音频模型会被正确处理。
🧹 Nitpick comments (1)
packages/service-multimodal/src/utils.ts (1)

68-78: 💤 Low value

Prettier 格式化问题:移除多余括号。

静态分析工具提示第 74 行存在多余的括号。

🔧 建议修复
-        return dot < 0
-            ? null
-            : (FILE_EXTENSION_TO_MIME_TYPE[path.slice(dot)] ?? null)
+        return dot < 0
+            ? null
+            : FILE_EXTENSION_TO_MIME_TYPE[path.slice(dot)] ?? null
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/service-multimodal/src/utils.ts` around lines 68 - 78, The Prettier
warning flags an unnecessary pair of parentheses in inferMimeTypeFromUrl; update
the return expression in that function to remove the extra parentheses around
the nullish-coalescing lookup so it directly returns
FILE_EXTENSION_TO_MIME_TYPE[path.slice(dot)] ?? null (referencing
inferMimeTypeFromUrl and FILE_EXTENSION_TO_MIME_TYPE to locate the code),
keeping the rest of the try/catch logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/service-multimodal/src/plugins/audio.ts`:
- Around line 27-34: MIME_TO_EXT is missing mappings for formats present in
NATIVE_AUDIO_MIMES (specifically audio/aac and audio/webm), causing a fallback
to 'mp3' and mismatched filenames; update MIME_TO_EXT to include 'audio/aac' =>
'aac' and 'audio/webm' => 'webm' so the code that looks up MIME_TO_EXT (used
when determining output filenames/extensions) produces the correct extensions
rather than defaulting to 'mp3'.

In `@packages/service-multimodal/src/plugins/read_files.ts`:
- Around line 112-116: The current MIME selection assigns mime to detectedAudio
even when detectedAudio is null, causing valid declared audio types (e.g.,
audio/wav) to be lost; in the MIME resolution logic in read_files.ts (variables
declared, detectedAudio, and the call to detectAudioMimeType), change the
selection to prefer detectedAudio when it is non-null/defined, otherwise fall
back to declared (and only treat it as audio if declared?.startsWith('audio/'));
ensure mime is never set to null so downstream code won’t hit "Could not
determine MIME type".
- Around line 314-332: 在 _fetch 方法中,response.headers 不是 Headers 实例,不能用 .get() 获取
content-type,导致 contentType 永远为 null;修改 _fetch(使用 this.ctx.http 返回的
response)改为直接通过 response.headers['content-type'](或
response.headers['Content-Type'])来读取值并做防御性检查(小写/大小写兼容与可能为 undefined 的情况),然后将其赋给
contentType 并返回正确的 Buffer 与 contentType;确保不再对 response.headers 作 Headers
类型断言并保留现有超时/headers 配置。

In `@packages/shared-adapter/src/utils.ts`:
- Around line 387-397: 当前代码在处理音频内容时用 try { return await
fetchAudioContentPart(plugin, content) } catch { return null }
将所有异常静默吞掉,导致音频丢失难以排查;请修改 isMessageContentAudio 分支:不要在 catch 中直接 return null;改为
catch (err) { logger.error(`Failed to fetch audio part for model
${normalizedModel}`, err); throw err },并保留对 fetchAudioContentPart 返回 null
的显式检查(如果 fetchAudioContentPart 返回 null 则记录明确的 warning/error via
logger.warn/logger.error 并按预期返回 null
或返回一个明确的错误标记),以便调用方能看到失败原因;涉及符号:isMessageContentAudio, supportsAudio,
fetchAudioContentPart, logger.warn。
- Around line 696-698: The function audioMimeToFormat currently falls back to
'mp3' for unknown MIME types (audioMimeToFormat and AUDIO_MIME_TO_FORMAT), which
can produce an incorrect input_audio.format; change it to validate
mime.toLowerCase() against AUDIO_MIME_TO_FORMAT and throw a clear, explicit
Error (including the unsupported mime value) when there is no mapping instead of
returning 'mp3' so callers fail fast and avoid sending mismatched format/bytes
to the OpenAI API.

---

Outside diff comments:
In `@packages/adapter-openai/src/client.ts`:
- Around line 85-93: 当前逻辑在前面把所有包含 "audio" 的模型过滤掉,导致后面在 capabilities 中通过
supportAudioInput(model) 添加 ModelCapabilities.AudioInput(以及
fileHandlingConfig)永远不会触发;请修改两处:一是调整或移除此前那段会剔除包含 "audio" 的过滤逻辑(不要在预筛选中丢弃
gpt-4o-audio-* / gpt-audio-*),二是确保 supportAudioInput(model) 正确识别 gpt-4o-audio-*
和 gpt-audio-* 并返回 true,这样在构建 capabilities(包含 ModelCapabilities.AudioInput)和应用
fileHandlingConfig 时这些自动拉取的 OpenAI 音频模型会被正确处理。

---

Nitpick comments:
In `@packages/service-multimodal/src/utils.ts`:
- Around line 68-78: The Prettier warning flags an unnecessary pair of
parentheses in inferMimeTypeFromUrl; update the return expression in that
function to remove the extra parentheses around the nullish-coalescing lookup so
it directly returns FILE_EXTENSION_TO_MIME_TYPE[path.slice(dot)] ?? null
(referencing inferMimeTypeFromUrl and FILE_EXTENSION_TO_MIME_TYPE to locate the
code), keeping the rest of the try/catch logic unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a9729694-da41-4a1c-b630-a282324ed506

📥 Commits

Reviewing files that changed from the base of the PR and between 5e1232e and 8819b13.

📒 Files selected for processing (14)
  • packages/adapter-claude/src/utils.ts
  • packages/adapter-gemini/src/utils.ts
  • packages/adapter-ollama/src/utils.ts
  • packages/adapter-openai-like/src/client.ts
  • packages/adapter-openai/src/client.ts
  • packages/adapter-qwen/src/utils.ts
  • packages/service-multimodal/README.md
  • packages/service-multimodal/src/index.ts
  • packages/service-multimodal/src/plugins/audio.ts
  • packages/service-multimodal/src/plugins/image.ts
  • packages/service-multimodal/src/plugins/read_files.ts
  • packages/service-multimodal/src/utils.ts
  • packages/shared-adapter/src/client.ts
  • packages/shared-adapter/src/utils.ts

Comment thread packages/service-multimodal/src/plugins/audio.ts
Comment thread packages/service-multimodal/src/plugins/read_files.ts Outdated
Comment thread packages/service-multimodal/src/plugins/read_files.ts
Comment thread packages/shared-adapter/src/utils.ts Outdated
Comment thread packages/shared-adapter/src/utils.ts
@dingyi222666 dingyi222666 changed the title [Feature] support native multimodal file handling [Feature] support and harden native multimodal handling May 18, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
packages/service-multimodal/src/utils.ts (2)

80-84: 💤 Low value

函数命名 normalizeMimeType 与编码规范冲突。

编码规范明确禁止 normalizeXXX 模式的函数命名。此函数仅为单行逻辑,可在调用点内联:

const mime = raw?.split(';')[0]?.trim()?.toLowerCase() || null

如在多处调用且确需复用,建议改用更简洁的名称如 cleanMimebaseMime

As per coding guidelines: "Do NOT create resolveXXX, normalizeXXX, ensureXXX, toSafeXXX functions—these are banned patterns".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/service-multimodal/src/utils.ts` around lines 80 - 84, The function
normalizeMimeType violates the naming rule banning normalizeXXX functions;
either inline its one-line logic at call sites (replace uses with
raw?.split(';')[0]?.trim()?.toLowerCase() || null) or rename and export it to an
approved shorter name (e.g., cleanMime or baseMime) and update all references to
that symbol (normalizeMimeType) to the new name, preserving the signature
(string | null) and export. Ensure you update any imports/usages across the
codebase and tests to reference the new symbol or the inlined expression.

362-369: 💤 Low value

函数命名 ensureContentArray 与编码规范冲突。

编码规范禁止 ensureXXX 模式。虽然此函数逻辑超过 5 行且在多处调用(如 audio.ts),满足提取函数的条件,但命名可考虑调整为更具描述性的名称,如 toContentArray 或直接命名为 contentAsArray

As per coding guidelines: "Do NOT create resolveXXX, normalizeXXX, ensureXXX, toSafeXXX functions—these are banned patterns".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/service-multimodal/src/utils.ts` around lines 362 - 369, The
function named ensureContentArray violates the banned `ensureXXX` naming
pattern; rename the function to a descriptive allowed name (e.g., toContentArray
or contentAsArray) while keeping the same signature (message: Message,
fallbackText = '') and preserving its behavior, then update all call sites (for
example in audio.ts and any other imports/exports) to use the new name and
adjust exports if necessary so builds/tests pick up the rename.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@packages/service-multimodal/src/utils.ts`:
- Around line 80-84: The function normalizeMimeType violates the naming rule
banning normalizeXXX functions; either inline its one-line logic at call sites
(replace uses with raw?.split(';')[0]?.trim()?.toLowerCase() || null) or rename
and export it to an approved shorter name (e.g., cleanMime or baseMime) and
update all references to that symbol (normalizeMimeType) to the new name,
preserving the signature (string | null) and export. Ensure you update any
imports/usages across the codebase and tests to reference the new symbol or the
inlined expression.
- Around line 362-369: The function named ensureContentArray violates the banned
`ensureXXX` naming pattern; rename the function to a descriptive allowed name
(e.g., toContentArray or contentAsArray) while keeping the same signature
(message: Message, fallbackText = '') and preserving its behavior, then update
all call sites (for example in audio.ts and any other imports/exports) to use
the new name and adjust exports if necessary so builds/tests pick up the rename.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 99ad8af7-be73-4bf7-b18f-ea58308c6665

📥 Commits

Reviewing files that changed from the base of the PR and between 8819b13 and e00ea56.

📒 Files selected for processing (5)
  • packages/adapter-openai/src/client.ts
  • packages/service-multimodal/src/plugins/audio.ts
  • packages/service-multimodal/src/plugins/read_files.ts
  • packages/service-multimodal/src/utils.ts
  • packages/shared-adapter/src/utils.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • packages/adapter-openai/src/client.ts
  • packages/shared-adapter/src/utils.ts
  • packages/service-multimodal/src/plugins/read_files.ts

@dingyi222666
Copy link
Copy Markdown
Member

你自己找时间在本地环境测一下,yarn fast-build,没问题和我说,我合并

@yabo083
Copy link
Copy Markdown
Contributor Author

yabo083 commented May 18, 2026

本地实测反馈

已在本地 Koishi 实例上部署测试(通过 yarn fast-build 构建后替换 node_modules),发现 read_files 工具无法正常工作。

问题现象

read_filesfiles 参数 schema 校验始终失败。LLM(openai-like/gpt-5-mini)传入的 files 值是 JSON 字符串而非 JSON 对象/数组:

// 实际传入(字符串)
{"files": "{\"url\": \"https://multimedia.nt.qq.com.cn/...\"}"}

// 期望传入(对象或数组)
{"files": {"url": "https://multimedia.nt.qq.com.cn/..."}}
{"files": [{"url": "https://multimedia.nt.qq.com.cn/..."}]}

模型反复重试(单对象、数组、file:// URI 等格式),均因 schema 不匹配而被拒绝。最终 agent 放弃并回退到 bash curl + ffmpeg 手动下载,但也无法完成音频理解。

测试环境

  • Koishi Desktop (Linux, /root/.koishi/data/instances/default)
  • adapter: koishi-plugin-adapter-onebot (QQ)
  • model: openai-like/gpt-5-mini + mimo/mimo-v2.5
  • ffmpeg + silk 均已启用
  • multimodal-service 配置: enableAudioFfmpegConversion: true, enableImageReadTool: true, enableFileReadTool: true

日志关键片段

agent tool call: {"tool": "read_files", "toolInput": {"files": "[{\"url\": \"https://multimedia.nt.qq.com.cn/...\"}]"}}
→ schema validation failed(files 收到 string 而非 object/array)

agent tool call: {"tool": "read_files", "toolInput": {"files": "{\"url\": \"https://...\"}"}}}
→ 同样失败

agent reasoning: "The read_files tool keeps failing with schema errors.
It seems like the tool is receiving the JSON as a string rather than as a parsed object."

已回滚到部署前版本,实例恢复正常。

建议

read_filesfiles 参数可能需要增加对 string 类型 JSON 的兼容解析(typeof files === 'string' ? JSON.parse(files) : files),或者检查 tool schema 注册时参数类型是否正确传递给了 LLM。

@yabo083
Copy link
Copy Markdown
Contributor Author

yabo083 commented May 18, 2026

老大你看着办喵,老大我回退到原本我自己的魔改的版本就能用了
image

@dingyi222666
Copy link
Copy Markdown
Member

行了,你重试一下

@dingyi222666
Copy link
Copy Markdown
Member

怎么会有这么猪鼻的模型?

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/service-multimodal/src/plugins/read_files.ts`:
- Around line 308-333: The code reports "file content has been added to the
conversation context" whenever native.length > 0 even if conversationId is
missing; fix by basing the success note on the actual injection outcome: compute
an injected boolean (e.g., injected = native.length > 0 && conversationId) and
use that to choose the note message, or enforce conversationId as a precondition
before attempting injection (wrap both the inject call and the success message
behind the same conversationId check); update references to native,
conversationId, this.ctx.chatluna.contextManager.inject, buildMultimodalMessage
and the JSON.stringify return so the note accurately reflects whether injection
occurred.
- Around line 225-249: When splitting GIFs into frames (parseGifToFrames) we
currently warn+break when remaining frames would exceed maxTotal but silently
drop the whole GIF if no frames were pushed; fix by detecting whether any frames
from that GIF were successfully pushed (track a local counter before/inside the
for-loop that calls pushNative) and if zero frames were pushed, record a failure
for that sourceUrl in the report (use the existing failure mechanism: call the
project’s failure helper such as pushFailure/report.files append with an error
entry or a dedicated pushFileFailure function) with a clear message like "GIF
frames exceed total size limit", instead of just warning, so the URL appears in
report.files as failed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f393673b-2a86-4c83-bce8-9c6d3b3f2767

📥 Commits

Reviewing files that changed from the base of the PR and between e00ea56 and bf9a7db.

📒 Files selected for processing (1)
  • packages/service-multimodal/src/plugins/read_files.ts

Comment thread packages/service-multimodal/src/plugins/read_files.ts
Comment thread packages/service-multimodal/src/plugins/read_files.ts Outdated
@dingyi222666 dingyi222666 force-pushed the codex/mimo-multimodal-service branch from bf9a7db to 64d3eef Compare May 18, 2026 21:59
@dingyi222666 dingyi222666 changed the title [Feature] support and harden native multimodal handling [Feature] support and harden native multimodal file handling May 18, 2026
@dingyi222666
Copy link
Copy Markdown
Member

测试还有问题吗,没有问题我明天合并了

@yabo083
Copy link
Copy Markdown
Contributor Author

yabo083 commented May 19, 2026

稍等

@yabo083
Copy link
Copy Markdown
Contributor Author

yabo083 commented May 19, 2026

生产环境测试通过!
图片、音频、文件,均能正常被模型读取,
(除了不知道为啥偶尔总会一条消息发好多次

@dingyi222666 dingyi222666 merged commit 4a93ee0 into ChatLunaLab:v1-dev May 19, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants