Add OCR text verification to prevent false positive completions#46
Draft
Add OCR text verification to prevent false positive completions#46
Conversation
Author
|
@maxi07 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs. I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review. |
Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Add verification for OCR after our step
Add OCR text verification to prevent false positive completions
Aug 28, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently, the OCR service only trusts the exit code from OCRmyPDF to determine if OCR processing was successful. This can lead to false positives where OCR appears to complete successfully (exit code 0) but no actual text was extracted from the document.
Problem
OCRmyPDF can return exit code 0 in cases where:
In these scenarios, the OCR status was incorrectly set to
COMPLETEDeven though no text extraction occurred.Solution
This PR adds a verification step after successful OCR completion:
Text extraction verification: After OCR exits with code 0, the service now uses the existing
extract_text()helper function to verify that the OCR output file actually contains extractable text.Improved status logic:
OCRStatus.COMPLETEDOCRStatus.FAILEDOCRStatus.OUTPUT_ERROREnhanced logging: Added detailed logging that reports the number of characters extracted during verification.
Changes
ocr_service/main.py: Added import forextract_text, implemented verification logic, and fixed a bug where the completion status was set regardless of the OCR resulttests/test_ocr_verification.py: Added unit tests covering various text extraction scenariosExample Impact
Before this change, a blank PDF page would result in:
After this change:
This ensures the OCR pipeline only marks documents as successfully processed when text extraction actually occurred.
Fixes #41.
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.