1552 implement needs ocr in sample caller by Luis-manzur · Pull Request #1554 · freelawproject/juriscraper

Luis-manzur · 2025-08-21T16:06:19Z

This pull request adds a new --ocr-available flag to the sample_caller script, enhancing its ability to detect when OCR (Optical Character Recognition) should be used for document extraction. It introduces a new utility module for robustly detecting when OCR is needed, and updates the workflow to leverage this logic when the flag is set.

this PR addresses -- #1552

…r PDFs

…aller

flooie · 2025-08-22T14:44:18Z

Can you explain the point of the ocr utils. In particular is_doc_common_header. I'm not sure I understand why we are adding this?

Sample caller is meant to be, just that, a sample caller?

Luis-manzur · 2025-08-22T15:40:34Z

Can you explain the point of the ocr utils. In particular is_doc_common_header. I'm not sure I understand why we are adding this?

Sample caller is meant to be, just that, a sample caller?

I added it so I could test certain scenarios, such as in texbizct, where in some PDFs much of the information is in images rather than plain text.

So this is to simulate CL's behavior when necessary.

ocr_utils and the functions within it are the same functions initially used in CL to detect whether OCR should be used or not. The problem with that implementation is that its use case is for PACER, which is why I added the option to detect missing pages, what I consider a more general approach.

These changes should also be applied in CL

…ct_content_from_doctor Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…ct_content_from_doctor

…le-caller' into 1552-implement-needs_ocr-in-sample-caller

grossir · 2026-04-02T23:11:19Z

Closing this since

it has grown stale
For these special courts I think it will be better to just test the whole integration through the CL docker environment. This will just make juriscraper more complex for an uncommon use case.

We should eventually stop copying code from CL into juriscraper and viceversa (like we did with download_content). A good chunk is copied from here https://github.com/freelawproject/courtlistener/blob/main/cl/lib/recap_utils.py

The problem with that implementation is that its use case is for PACER, which is why I added the option to detect missing pages, what I consider a more general approach.

These kind of fixes should be on doctor. We should probably write about the problematic courts / documents first

Luis-manzur added 4 commits August 20, 2025 14:55

feat(ocr): implement needs_ocr function to determine OCR necessity fo…

61e0426

…r PDFs

feat(ocr): enhance needs_ocr function to integrate page count

f3e93ce

feat(ocr): update help message for --ocr-available option in sample_c…

c0b8fbc

…aller

chore: add ocr feat to CHANGES.md

b73fc00

Luis-manzur requested a review from grossir August 21, 2025 16:06

Luis-manzur assigned flooie Aug 21, 2025

Luis-manzur linked an issue Aug 21, 2025 that may be closed by this pull request

Implement needs_ocr in sample caller #1552

Open

Luis-manzur requested a review from flooie August 21, 2025 16:06

Luis-manzur added this to Sprint (Case Law) Aug 21, 2025

Luis-manzur moved this to PRs to Review in Sprint (Case Law) Aug 21, 2025

flooie assigned Luis-manzur and unassigned flooie Aug 22, 2025

Luis-manzur requested a review from Copilot August 22, 2025 15:43

This comment was marked as spam.

Sign in to view

Luis-manzur and others added 4 commits August 22, 2025 11:46

fix(sample_caller): update params handling for ocr_available in extra…

f91436f

…ct_content_from_doctor Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix(sample_caller): correct ocr_available parameter handling in extra…

f2d255b

…ct_content_from_doctor

Merge remote-tracking branch 'origin/1552-implement-needs_ocr-in-samp…

5968a7f

…le-caller' into 1552-implement-needs_ocr-in-sample-caller

Merge branch 'main' into 1552-implement-needs_ocr-in-sample-caller

1a9924b

grossir closed this Apr 2, 2026

github-project-automation Bot moved this from PRs to Review to Done in Sprint (Case Law) Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1552 implement needs ocr in sample caller#1554

1552 implement needs ocr in sample caller#1554
Luis-manzur wants to merge 8 commits intomainfrom
1552-implement-needs_ocr-in-sample-caller

Luis-manzur commented Aug 21, 2025

Uh oh!

flooie commented Aug 22, 2025

Uh oh!

Luis-manzur commented Aug 22, 2025

Uh oh!

This comment was marked as spam.

Uh oh!

grossir commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Luis-manzur commented Aug 21, 2025

Uh oh!

flooie commented Aug 22, 2025

Uh oh!

Luis-manzur commented Aug 22, 2025

Uh oh!

This comment was marked as spam.

Uh oh!

grossir commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants