Skip to content

perf: parallelize HTTP OCR batch requests#93

Open
AdemBoukhris457 wants to merge 1 commit intorun-llama:mainfrom
AdemBoukhris457:perf/ocr-parallel-batch
Open

perf: parallelize HTTP OCR batch requests#93
AdemBoukhris457 wants to merge 1 commit intorun-llama:mainfrom
AdemBoukhris457:perf/ocr-parallel-batch

Conversation

@AdemBoukhris457
Copy link
Copy Markdown
Contributor

Summary

  • Replace sequential await-in-loop with Promise.all in HttpOcrEngine.recognizeBatch() to send all HTTP OCR requests concurrently.

Problem

recognizeBatch() in src/engines/ocr/http-simple.ts processes images one at a time using await in a for loop. HTTP OCR servers can handle concurrent requests, so this unnecessarily serializes work. With 10 images at ~500ms each, sequential takes ~5s vs ~500ms in parallel.

The Tesseract engine's recognizeBatch() already uses Promise.all for parallel processing. The HTTP engine was inconsistent.

Changes

  • src/engines/ocr/http-simple.ts: Replace sequential loop with Promise.all(images.map(...)), matching the Tesseract engine's implementation.

Closes #92

recognizeBatch() used await in a for loop, processing images sequentially. Replace with Promise.all to send all HTTP requests concurrently, consistent with the Tesseract engine implementation.
results.push(result);
}
return results;
return Promise.all(images.map((image) => this.recognize(image, options)));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhh if its a 400 page document this will send 400 requests. Most will timeout unless the user has a production-level deployment that scales

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better fix here is limiting to --num-workers ? Even then not great, but at least its controllable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] HttpOcrEngine.recognizeBatch processes images sequentially instead of in parallel

2 participants