Skip to content

extract images of attachments uploaded during conversations#3217

Open
yzAiden wants to merge 10 commits into
ModelEngine-Group:developfrom
yzAiden:file_processing
Open

extract images of attachments uploaded during conversations#3217
yzAiden wants to merge 10 commits into
ModelEngine-Group:developfrom
yzAiden:file_processing

Conversation

@yzAiden

@yzAiden yzAiden commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

修复描述:因为返回值类型改变,导致原本接收单个值的变量报错,通过解包后修复
修复前:
505ddf5f8e83eb4e8a8ae222549c6bc2

56e5d7c221960cbebb0933f36fac660e

修复后:
db8ef6fe9de4100571140f5c9f8703cd

@yzAiden yzAiden requested review from Dallas98 and WMC001 as code owners June 11, 2026 02:33
@yzAiden

yzAiden commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

实现的功能:提取对话时上传的附件的图片,并输出用户指定的图片

实现思路:
1.采用已有的图片提取功能
2.在文件分析工具中,识别文件类型,获取sdk层返回的图片信息,图片本身存入minio,图片元数据和纯文本合并后统一传给llm分析
3.llm提取图片url并是用图片分析工具分析,最终在对话部分输出用户指定的图片。来源部分可以显示图片元数据、图片部分可以显示提取到的图片

实现效果展示:
7f62f5dddc1d96cf53721fcdf5aa7adc

bb39083ea9e3cbaac6b3e18e8ec2486b ffc78b04b6a8904044f4682d637783f1

@yzAiden yzAiden changed the title adapt_to_return_type_change extract images of attachments uploaded during conversations Jun 15, 2026
user_prompt=user_prompt
)
return result.content, truncation_percentage

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detect_file_type 的参数名是 file_bytes: bytes,但调用处传入的 single_file 可能是 URL 字符串。bytes.startswith() 对字符串会抛 AttributeError。需要确认调用链始终传入 bytes,或添加类型检查。

for idx, img_data in enumerate(images_chunks):
if not isinstance(img_data, dict):
logger.warning(f"Skipping image entry at index {idx}: unexpected type {type(img_data)}")
continue

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

图片上传到 MinIO 时没有错误处理。如果上传失败,整个文件处理流程会中断,用户连文本内容都拿不到。建议 try/except 包裹,失败的图片跳过并记录 warning,保证文本内容仍可用。

"task_id": None,
"filename": filename,
"text": full_text.strip(),
"images_info": [images_list_urls, image_info],

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunks_count 把图片条目也算进去了(len(text_chunks) + len(images_chunks)),但下游消费者可能认为 chunks_count 只代表文本分段数。建议拆分为 text_chunks_countimage_chunks_count


async def convert_office_to_pdf_impl(self, object_name: str, pdf_object_name: str) -> None:
"""Full conversion pipeline: download -> convert -> upload -> validate -> cleanup.
"""Full conversion pipeline: download convert upload validate cleanup.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[代码规范] chunks, _ = data_process.file_process(...) 中使用 _ 忽略了 images_info 返回值。如果后续需要处理上传文件中的图片信息,建议将返回值赋给有意义的变量名(如 images_info),并添加注释说明当前为何忽略该值。

@WMC001

WMC001 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

The image extraction logic from PDF files is a useful addition. Please ensure edge cases (corrupted images, unsupported formats) are handled gracefully and covered by tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants