Skip to content

1552 implement needs ocr in sample caller#1554

Closed
Luis-manzur wants to merge 8 commits intomainfrom
1552-implement-needs_ocr-in-sample-caller
Closed

1552 implement needs ocr in sample caller#1554
Luis-manzur wants to merge 8 commits intomainfrom
1552-implement-needs_ocr-in-sample-caller

Conversation

@Luis-manzur
Copy link
Copy Markdown
Contributor

This pull request adds a new --ocr-available flag to the sample_caller script, enhancing its ability to detect when OCR (Optical Character Recognition) should be used for document extraction. It introduces a new utility module for robustly detecting when OCR is needed, and updates the workflow to leverage this logic when the flag is set.

this PR addresses -- #1552

@Luis-manzur Luis-manzur requested a review from grossir August 21, 2025 16:06
@Luis-manzur Luis-manzur linked an issue Aug 21, 2025 that may be closed by this pull request
@Luis-manzur Luis-manzur requested a review from flooie August 21, 2025 16:06
@Luis-manzur Luis-manzur moved this to PRs to Review in Sprint (Case Law) Aug 21, 2025
@flooie
Copy link
Copy Markdown
Contributor

flooie commented Aug 22, 2025

Can you explain the point of the ocr utils. In particular is_doc_common_header. I'm not sure I understand why we are adding this?

Sample caller is meant to be, just that, a sample caller?

@flooie flooie assigned Luis-manzur and unassigned flooie Aug 22, 2025
@Luis-manzur
Copy link
Copy Markdown
Contributor Author

Can you explain the point of the ocr utils. In particular is_doc_common_header. I'm not sure I understand why we are adding this?

Sample caller is meant to be, just that, a sample caller?

I added it so I could test certain scenarios, such as in texbizct, where in some PDFs much of the information is in images rather than plain text.

So this is to simulate CL's behavior when necessary.

ocr_utils and the functions within it are the same functions initially used in CL to detect whether OCR should be used or not. The problem with that implementation is that its use case is for PACER, which is why I added the option to detect missing pages, what I consider a more general approach.

These changes should also be applied in CL

@Luis-manzur Luis-manzur requested a review from Copilot August 22, 2025 15:43

This comment was marked as spam.

@grossir
Copy link
Copy Markdown
Contributor

grossir commented Apr 2, 2026

Closing this since

  • it has grown stale
  • For these special courts I think it will be better to just test the whole integration through the CL docker environment. This will just make juriscraper more complex for an uncommon use case.

We should eventually stop copying code from CL into juriscraper and viceversa (like we did with download_content). A good chunk is copied from here https://github.com/freelawproject/courtlistener/blob/main/cl/lib/recap_utils.py

The problem with that implementation is that its use case is for PACER, which is why I added the option to detect missing pages, what I consider a more general approach.

These kind of fixes should be on doctor. We should probably write about the problematic courts / documents first

@grossir grossir closed this Apr 2, 2026
@github-project-automation github-project-automation Bot moved this from PRs to Review to Done in Sprint (Case Law) Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Implement needs_ocr in sample caller

4 participants