feat(rag): add OCR fallback for scanned PDFs#520
Conversation
|
@upasana-2006 is attempting to deploy a commit to the firefistisdead's projects Team on Vercel. A member of the Team first needs to authorize it. |
|
Warning Review limit reached
More reviews will be available in 50 minutes and 46 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughThis PR adds OCR fallback support to the RAG service for scanned and image-only PDFs. When normal text extraction yields insufficient text, the service can optionally convert PDF pages to images and run OCR via Tesseract. OCR behavior is controlled by environment variables and optional dependencies. ChangesOCR Fallback for Scanned PDFs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (3)
.env.example (2)
277-278: 💤 Low valueConsider alphabetizing OCR configuration keys.
Static analysis suggests reordering
OCR_DPIandOCR_LANGUAGEbeforeOCR_MAX_PAGESto maintain alphabetical ordering. While not required, consistent ordering improves maintainability of the environment template.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.env.example around lines 277 - 278, Reorder the OCR env keys in the .env.example so they are alphabetized: move OCR_DPI and OCR_LANGUAGE to appear before OCR_MAX_PAGES (so the sequence is OCR_DPI, OCR_LANGUAGE, OCR_MAX_PAGES), ensuring consistent alphabetical ordering of OCR configuration entries (OCR_DPI, OCR_LANGUAGE, OCR_MAX_PAGES).Source: Linters/SAST tools
271-271: 💤 Low valueRemove the extra blank line.
Static analysis detected an unnecessary blank line. While this doesn't affect functionality, maintaining consistent formatting in the environment template improves readability.
✨ Suggested fix
SEMANTIC_CHUNK_HIERARCHICAL=true - # OCR fallback for scanned/image-only PDFs🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.env.example at line 271, Remove the single extraneous blank line present in the environment template example file (the stray empty line in the .env example) so the file has consistent formatting; simply delete that empty line and keep the rest of the file unchanged.Source: Linters/SAST tools
rag-service/tests/test_ocr_extractor.py (1)
1-62: ⚡ Quick winConsider adding tests for error cases.
The test suite covers happy paths but is missing coverage for failure modes that could occur in production:
- Invalid integer values in environment variables (e.g.,
OCR_MAX_PAGES="abc")- File not found when calling
extract_text_with_ocr- OCR operations that return empty results for all pages (Line 72-73 raises in that case)
- Corrupt images that
pytesseractcannot processThese edge cases would help catch the issues flagged in
ocr_extractor.py(unhandledValueErroron int conversion, missing path validation).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rag-service/tests/test_ocr_extractor.py` around lines 1 - 62, Add tests to rag-service/tests/test_ocr_extractor.py that cover the error cases for get_ocr_config and extract_text_with_ocr: (1) verify get_ocr_config raises or handles invalid integer env vars by setting OCR_MAX_PAGES to "abc" and asserting a ValueError or fallback behavior; (2) simulate file-not-found by calling extract_text_with_ocr with a non-existent path and patching pdf2image.convert_from_path to raise a FileNotFoundError and assert the function raises or returns a clear error; (3) simulate OCR returning empty strings for all pages by patching pdf2image.convert_from_path to return mocked images and pytesseract.image_to_string to return "" for each page and assert extract_text_with_ocr raises the expected error (the case noted at lines that raise when all pages empty); and (4) simulate pytesseract throwing an exception on a corrupt image by patching pytesseract.image_to_string to raise an Exception and assert extract_text_with_ocr surfaces that error; use monkeypatch to set environment vars (ENABLE_OCR_FALLBACK, OCR_MAX_PAGES, OCR_DPI, OCR_LANGUAGE) and unittest.mock.patch to stub pdf2image and pytesseract, and reference the existing test helpers get_ocr_config and extract_text_with_ocr in assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rag-service/main.py`:
- Around line 70-73: The function load_pdf_with_optional_ocr currently
instantiates PyMuPDFLoader directly (duplicate of _load_with_pymupdf); instead,
remove that direct PyMuPDFLoader usage and call the existing
load_pdf_documents_async (which encapsulates PyMuPDFLoader with timeout
protection and PyPDFLoader fallback) to obtain documents so behavior, error
handling, and metadata remain consistent; update load_pdf_with_optional_ocr to
accept/forward the same args expected by load_pdf_documents_async and remove the
redundant _load_with_pymupdf-like logic in that function.
- Around line 65-90: The function load_pdf_with_optional_ocr is dead code and
the real fix is to unify OCR fallback behavior so it respects the
ENABLE_OCR_FALLBACK flag and centralizes logic: update load_pdf_documents_async
to either call the existing load_pdf_with_optional_ocr (preferred) or delegate
to the same helpers used there (get_ocr_config and extract_text_with_ocr)
instead of using the internal _load_with_ocr duplication; remove
load_pdf_with_optional_ocr only if you replace all callers with the unified
behavior. Concretely, locate load_pdf_documents_async and replace its OCR branch
to call load_pdf_with_optional_ocr(pdf_path, filename) (or mirror its checks:
use get_ocr_config().min_chars & .enabled and call extract_text_with_ocr when
enabled) and delete the redundant _load_with_ocr implementation or mark it as
internal if still needed; ensure the ocr_extractor.py/get_ocr_config and
extract_text_with_ocr symbols are the single source of truth for fallback
configuration.
In `@rag-service/ocr_extractor.py`:
- Around line 32-75: extract_text_with_ocr is missing file validation, timeout
protection, and bounds on resource usage; fix it by (1) validating pdf_path
exists and is readable at the top of extract_text_with_ocr and raise a clear
RuntimeError if not, (2) enforce sane limits using the existing get_ocr_config
values (e.g., config.max_pages and a capped config.dpi) before calling
convert_from_path to avoid converting more pages or overly-high DPI than
allowed, and (3) run both the convert_from_path call and each
pytesseract.image_to_string call inside a cancellable worker with a timeout (for
example, use concurrent.futures.ThreadPoolExecutor/ProcessPoolExecutor to submit
convert_from_path and per-image OCR tasks and call result(timeout=...) or cancel
on timeout), catching TimeoutError and raising a descriptive RuntimeError; also
ensure resources are cleaned up on timeout and that empty or non-readable
results produce the existing "no readable text" error.
- Around line 17-24: The get_ocr_config function currently calls int() on
OCR_MAX_PAGES, OCR_MIN_CHARS, and OCR_DPI without validation, which will raise
ValueError for non-numeric env values; update get_ocr_config to safely parse
these three env vars (OCR_MAX_PAGES, OCR_MIN_CHARS, OCR_DPI) by wrapping
conversions in try/except or using a small helper like parse_int_env that
returns the parsed int or a safe default (10, 30, 200) and logs a warning when
parsing fails, then construct OCRConfig with those validated values; keep
ENABLE_OCR_FALLBACK and OCR_LANGUAGE parsing unchanged.
In `@rag-service/requirements.txt`:
- Around line 20-22: The requirements file currently lists pdf2image and
pytesseract without version pins; replace those two lines with explicit, fixed
versions (for example pdf2image==1.16.0 and pytesseract==0.3.10 or the
project-approved stable versions), commit the change to
rag-service/requirements.txt, and then run your dependency update/CI steps (pip
install -r, tests) to verify compatibility; ensure you pick and document the
exact pinned versions used.
---
Nitpick comments:
In @.env.example:
- Around line 277-278: Reorder the OCR env keys in the .env.example so they are
alphabetized: move OCR_DPI and OCR_LANGUAGE to appear before OCR_MAX_PAGES (so
the sequence is OCR_DPI, OCR_LANGUAGE, OCR_MAX_PAGES), ensuring consistent
alphabetical ordering of OCR configuration entries (OCR_DPI, OCR_LANGUAGE,
OCR_MAX_PAGES).
- Line 271: Remove the single extraneous blank line present in the environment
template example file (the stray empty line in the .env example) so the file has
consistent formatting; simply delete that empty line and keep the rest of the
file unchanged.
In `@rag-service/tests/test_ocr_extractor.py`:
- Around line 1-62: Add tests to rag-service/tests/test_ocr_extractor.py that
cover the error cases for get_ocr_config and extract_text_with_ocr: (1) verify
get_ocr_config raises or handles invalid integer env vars by setting
OCR_MAX_PAGES to "abc" and asserting a ValueError or fallback behavior; (2)
simulate file-not-found by calling extract_text_with_ocr with a non-existent
path and patching pdf2image.convert_from_path to raise a FileNotFoundError and
assert the function raises or returns a clear error; (3) simulate OCR returning
empty strings for all pages by patching pdf2image.convert_from_path to return
mocked images and pytesseract.image_to_string to return "" for each page and
assert extract_text_with_ocr raises the expected error (the case noted at lines
that raise when all pages empty); and (4) simulate pytesseract throwing an
exception on a corrupt image by patching pytesseract.image_to_string to raise an
Exception and assert extract_text_with_ocr surfaces that error; use monkeypatch
to set environment vars (ENABLE_OCR_FALLBACK, OCR_MAX_PAGES, OCR_DPI,
OCR_LANGUAGE) and unittest.mock.patch to stub pdf2image and pytesseract, and
reference the existing test helpers get_ocr_config and extract_text_with_ocr in
assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 80d485a6-d47d-4405-b66f-12791dfa7f0c
📒 Files selected for processing (5)
.env.examplerag-service/main.pyrag-service/ocr_extractor.pyrag-service/requirements.txtrag-service/tests/test_ocr_extractor.py
| def load_pdf_with_optional_ocr(pdf_path: str, filename: str = "uploaded.pdf"): | ||
| """ | ||
| Load PDF text normally first. If the document appears scanned/image-only, | ||
| optionally fall back to OCR when ENABLE_OCR_FALLBACK=true. | ||
| """ | ||
| from langchain_community.document_loaders import PyMuPDFLoader | ||
|
|
||
| loader = PyMuPDFLoader(pdf_path) | ||
| documents = loader.load() | ||
|
|
||
| combined_text = "\n".join(doc.page_content for doc in documents) | ||
| ocr_config = get_ocr_config() | ||
|
|
||
| if should_use_ocr(combined_text, ocr_config.min_chars): | ||
| if not ocr_config.enabled: | ||
| raise ValueError( | ||
| "No readable text found in this PDF. It may be scanned/image-only. " | ||
| "Enable OCR fallback by setting ENABLE_OCR_FALLBACK=true." | ||
| ) | ||
| return extract_text_with_ocr(pdf_path, filename) | ||
|
|
||
| for doc in documents: | ||
| doc.metadata.setdefault("source", filename) | ||
| doc.metadata.setdefault("extraction_method", "text") | ||
|
|
||
| return documents |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify that load_pdf_with_optional_ocr is defined but never called
echo "=== Definition of load_pdf_with_optional_ocr ==="
rg -n "^def load_pdf_with_optional_ocr" --type=py
echo ""
echo "=== Calls to load_pdf_with_optional_ocr ==="
rg -n "load_pdf_with_optional_ocr\(" --type=py
echo ""
echo "=== Current PDF processing call site ==="
rg -n -A2 "load_pdf_documents_async\(temp_path" --type=pyRepository: FireFistisDead/pdf-qa-bot
Length of output: 673
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Find any other references/usages of load_pdf_with_optional_ocr (string match)
echo "=== All references to load_pdf_with_optional_ocr ==="
rg -n "load_pdf_with_optional_ocr" --hidden --glob='!**/node_modules/**' --glob='!**/.venv/**' || true
# 2) Find OCR-related helpers/config usage to see if fallback is already wired elsewhere
echo
echo "=== References to OCR config/helpers ==="
rg -n "ENABLE_OCR_FALLBACK|get_ocr_config|should_use_ocr|extract_text_with_ocr" --hidden --glob='!**/node_modules/**' --glob='!**/.venv/**' || true
# 3) Inspect load_pdf_documents_async to determine whether it already performs OCR fallback internally
echo
echo "=== Definition of load_pdf_documents_async ==="
rg -n "def load_pdf_documents_async|load_pdf_documents_async\s*\(" rag-service/main.py --hidden --glob='!**/node_modules/**' || true
echo
echo "=== Context around load_pdf_documents_async definition (if present) ==="
rg -n -A60 -B10 "def load_pdf_documents_async" rag-service/main.py --hidden --glob='!**/node_modules/**' || trueRepository: FireFistisDead/pdf-qa-bot
Length of output: 5654
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show any OCR helper usage in rag-service/main.py
echo "=== extract_text_with_ocr usage in rag-service/main.py ==="
rg -n "extract_text_with_ocr" rag-service/main.py --hidden || true
echo
echo "=== should_use_ocr usage in rag-service/main.py ==="
rg -n "should_use_ocr" rag-service/main.py --hidden || true
echo
echo "=== get_ocr_config usage in rag-service/main.py ==="
rg -n "get_ocr_config" rag-service/main.py --hidden || true
echo
echo "=== load_pdf_documents_async: tail context ==="
# Print more of the function to confirm whether OCR fallback exists inside it
rg -n -A250 -B10 "async def load_pdf_documents_async" rag-service/main.py --hidden --glob='*.py'Repository: FireFistisDead/pdf-qa-bot
Length of output: 11563
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Definition of _load_with_ocr ==="
rg -n "def _load_with_ocr|_load_with_ocr\s*\(" rag-service/main.py --hidden
echo
echo "=== Context around _load_with_ocr ==="
rg -n -A80 -B10 "def _load_with_ocr" rag-service/main.py --hiddenRepository: FireFistisDead/pdf-qa-bot
Length of output: 3584
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "ENABLE_OCR_FALLBACK" rag-service/main.py --hidden || trueRepository: FireFistisDead/pdf-qa-bot
Length of output: 213
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== .env.example: ENABLE_OCR_FALLBACK ==="
rg -n "ENABLE_OCR_FALLBACK" -n .env.example rag-service -S || true
echo
echo "=== rag-service/ocr_extractor.py (key parts) ==="
sed -n '1,120p' rag-service/ocr_extractor.pyRepository: FireFistisDead/pdf-qa-bot
Length of output: 2945
OCR fallback is already implemented; the unused load_pdf_with_optional_ocr is the real issue
load_pdf_with_optional_ocris defined inrag-service/main.pybut never called, so that function is dead code.- However,
load_pdf_documents_asyncalready attempts OCR fallback when extracted text is empty or low-density (via_load_with_ocr), so the OCR feature is not “missing” from the pipeline. - The fallback logic inside
load_pdf_documents_asyncdoes not respectENABLE_OCR_FALLBACK(unlikeload_pdf_with_optional_ocr/ocr_extractor.py), and it duplicates OCR implementation/config decisions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@rag-service/main.py` around lines 65 - 90, The function
load_pdf_with_optional_ocr is dead code and the real fix is to unify OCR
fallback behavior so it respects the ENABLE_OCR_FALLBACK flag and centralizes
logic: update load_pdf_documents_async to either call the existing
load_pdf_with_optional_ocr (preferred) or delegate to the same helpers used
there (get_ocr_config and extract_text_with_ocr) instead of using the internal
_load_with_ocr duplication; remove load_pdf_with_optional_ocr only if you
replace all callers with the unified behavior. Concretely, locate
load_pdf_documents_async and replace its OCR branch to call
load_pdf_with_optional_ocr(pdf_path, filename) (or mirror its checks: use
get_ocr_config().min_chars & .enabled and call extract_text_with_ocr when
enabled) and delete the redundant _load_with_ocr implementation or mark it as
internal if still needed; ensure the ocr_extractor.py/get_ocr_config and
extract_text_with_ocr symbols are the single source of truth for fallback
configuration.
8a9099e to
d2b80ac
Compare
d2b80ac to
4656b6c
Compare
Summary
Added an optional OCR fallback mechanism for scanned and image-only PDFs in the RAG service. The existing text extraction flow remains unchanged for standard PDFs, but when little or no readable text is detected, the service can automatically perform OCR to extract content from images embedded in PDFs.
This enhancement improves the usability of the PDF QA Bot by enabling support for a wider range of real-world documents, such as scanned notes, certificates, forms, and archived PDFs that previously could not be processed successfully.
Related issue
Closes #514
Testing
Additional testing performed:
Checklist:
Screenshots / recordings
N/A – This change affects backend document processing and does not introduce any UI modifications.
Notes
Added a dedicated
ocr_extractor.pyutility to isolate OCR-related functionality from the main RAG pipeline.Introduced environment variables to control OCR behavior:
ENABLE_OCR_FALLBACKOCR_MAX_PAGESOCR_MIN_CHARSOCR_DPIOCR_LANGUAGEAdded
pdf2imageandpytesseractas dependencies required for OCR support.OCR is only invoked when the initial text extraction produces insufficient content, minimizing unnecessary overhead.
Follow-up improvements could include support for additional OCR languages and confidence-based quality checks.
Security
Summary by CodeRabbit