1552 implement needs ocr in sample caller#1554
Conversation
|
Can you explain the point of the ocr utils. In particular is_doc_common_header. I'm not sure I understand why we are adding this? Sample caller is meant to be, just that, a sample caller? |
I added it so I could test certain scenarios, such as in So this is to simulate CL's behavior when necessary.
These changes should also be applied in CL |
…ct_content_from_doctor Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ct_content_from_doctor
…le-caller' into 1552-implement-needs_ocr-in-sample-caller
|
Closing this since
We should eventually stop copying code from CL into juriscraper and viceversa (like we did with download_content). A good chunk is copied from here https://github.com/freelawproject/courtlistener/blob/main/cl/lib/recap_utils.py
These kind of fixes should be on doctor. We should probably write about the problematic courts / documents first |
This pull request adds a new
--ocr-availableflag to thesample_callerscript, enhancing its ability to detect when OCR (Optical Character Recognition) should be used for document extraction. It introduces a new utility module for robustly detecting when OCR is needed, and updates the workflow to leverage this logic when the flag is set.this PR addresses -- #1552