Skip to content

add MinerU cloud API support#191

Open
technoadnan wants to merge 1 commit intoTHU-MAIC:mainfrom
technoadnan:feat/mineru-cloud
Open

add MinerU cloud API support#191
technoadnan wants to merge 1 commit intoTHU-MAIC:mainfrom
technoadnan:feat/mineru-cloud

Conversation

@technoadnan
Copy link
Copy Markdown

@technoadnan technoadnan commented Mar 21, 2026

Summary

When configuring https://mineru.net as the MinerU base URL, requests were incorrectly routed to the self-hosted code path (POST /file_parse), causing 413 errors. The PDF would then silently fall back to unpdf instead of using MinerU cloud. This happens because the cloud API follows this structure, upload + async polling + ZIP download.

Changes

  • lib/pdf/pdf-providers.ts — detect mineru.net base URL and route to cloud v4 path
  • lib/pdf/mineru-cloud.ts — new file handling the full cloud v4 flow (upload → poll → ZIP parse)
  • lib/pdf/mineru-parser.ts — new file normalizing MinerU output into ParsedPdfContent, shared by both self-hosted and cloud paths
  • lib/pdf/types.ts — added mineruModelVersion field to PDFParserConfig to support switching between vlm and pipeline model versions

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)

Verification

Steps to reproduce the tests:

  1. Set MinerU base URL to https://mineru.net with a valid API token
  2. Upload a PDF
  3. Observe logs show cloud v4 routing, presigned upload, and successful parse
  4. Verify self-hosted deployments with other base URLs work unchanged

Proof

Before

image image

After

image

What you personally verified:

  • https://mineru.net correctly routes to cloud v4
  • Self-hosted URLs fall through to the original code path unchanged
  • tsc --noEmit passes cleanly

Checklist

  • I have performed a self-review of my code
  • My changes do not introduce new warnings

@fubaobao2023
Copy link
Copy Markdown

你的意思是https://mineru.net为mineru的base url吗?但是
image
image
我这样设置了 我去前端上传 PDF 依然解析不了 出现,
ddc6f0598c2054e7b43ca5de793e52a1
ddc6f0598c2054e7b43ca5de793e52a1

@fubaobao2023
Copy link
Copy Markdown

mineru 这个地方就不能引导性的 填写一个标准的 base URL地址吗?让用户自己去猜是不是不太好,就好比其他API接口 都有填写 这个API地址是多少

@fubaobao2023
Copy link
Copy Markdown

而且mineru的官方说明是:
image也就是:https://mineru.net/api/v4/extract/task或者https://mineru.net/api/v4/file-urls/batch,但是就算按照官方的 这两个API 填写 然后输入KEY 点击openmaic 点击测试,依然显示通过,但是去上传PDF 依然没办法解析 返回错误,你这个接口这里必须优化一下

@fubaobao2023
Copy link
Copy Markdown

的意思是https://mineru.net为mineru的base url吗? 我这样设置了我去前端上传你PDF依然解析不了出现, 图像 图像 ddc6f0598c2054e7b43ca5de793e52a1 ddc6f0598c2054e7b43ca5de793e52a1

这个提示的失败我刚刚测试了,不管是用UNPDF 还是mineru 都是返回这样的失败 需要你们进一步排查

@fubaobao2023
Copy link
Copy Markdown

image 前端对PDF的文件大小做了限制?这里可否在设置API哪里让用户 根据调用接口不一样选择呢?让用户傻瓜式操作 这是我的curl [新建文本文档.txt](https://github.com/user-attachments/files/26228793/default.txt)

@fubaobao2023
Copy link
Copy Markdown

mineru官方技术反馈 接口代码有问题
image

@fubaobao2023
Copy link
Copy Markdown

核心问题确认:
OpenMAIC调用的是自托管MinerU API(/file_parse文件上传),但是MinerU官方云API(需要url参数)

解决方案:需要修改OpenMAIC代码来适配官方API
所以这里在mineru选择这里 就要新增选项 是采用API 还是自托管

@fubaobao2023
Copy link
Copy Markdown

@wyuc @claude @technoadnan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants