extract images of attachments uploaded during conversations#3217
extract images of attachments uploaded during conversations#3217yzAiden wants to merge 10 commits into
Conversation
| user_prompt=user_prompt | ||
| ) | ||
| return result.content, truncation_percentage | ||
|
|
There was a problem hiding this comment.
detect_file_type 的参数名是 file_bytes: bytes,但调用处传入的 single_file 可能是 URL 字符串。bytes.startswith() 对字符串会抛 AttributeError。需要确认调用链始终传入 bytes,或添加类型检查。
| for idx, img_data in enumerate(images_chunks): | ||
| if not isinstance(img_data, dict): | ||
| logger.warning(f"Skipping image entry at index {idx}: unexpected type {type(img_data)}") | ||
| continue |
There was a problem hiding this comment.
图片上传到 MinIO 时没有错误处理。如果上传失败,整个文件处理流程会中断,用户连文本内容都拿不到。建议 try/except 包裹,失败的图片跳过并记录 warning,保证文本内容仍可用。
| "task_id": None, | ||
| "filename": filename, | ||
| "text": full_text.strip(), | ||
| "images_info": [images_list_urls, image_info], |
There was a problem hiding this comment.
chunks_count 把图片条目也算进去了(len(text_chunks) + len(images_chunks)),但下游消费者可能认为 chunks_count 只代表文本分段数。建议拆分为 text_chunks_count 和 image_chunks_count。
|
|
||
| async def convert_office_to_pdf_impl(self, object_name: str, pdf_object_name: str) -> None: | ||
| """Full conversion pipeline: download -> convert -> upload -> validate -> cleanup. | ||
| """Full conversion pipeline: download → convert → upload → validate → cleanup. |
There was a problem hiding this comment.
[代码规范] chunks, _ = data_process.file_process(...) 中使用 _ 忽略了 images_info 返回值。如果后续需要处理上传文件中的图片信息,建议将返回值赋给有意义的变量名(如 images_info),并添加注释说明当前为何忽略该值。
|
The image extraction logic from PDF files is a useful addition. Please ensure edge cases (corrupted images, unsupported formats) are handled gracefully and covered by tests. |



修复描述:因为返回值类型改变,导致原本接收单个值的变量报错,通过解包后修复

修复前:
修复后:
