Skip to content

feat(rag): add OCR fallback for scanned PDFs#520

Open
upasana-2006 wants to merge 1 commit into
FireFistisDead:masterfrom
upasana-2006:feat/ocr-fallback-scanned-pdfs
Open

feat(rag): add OCR fallback for scanned PDFs#520
upasana-2006 wants to merge 1 commit into
FireFistisDead:masterfrom
upasana-2006:feat/ocr-fallback-scanned-pdfs

Conversation

@upasana-2006

@upasana-2006 upasana-2006 commented Jun 13, 2026

Copy link
Copy Markdown

Summary

Added an optional OCR fallback mechanism for scanned and image-only PDFs in the RAG service. The existing text extraction flow remains unchanged for standard PDFs, but when little or no readable text is detected, the service can automatically perform OCR to extract content from images embedded in PDFs.

This enhancement improves the usability of the PDF QA Bot by enabling support for a wider range of real-world documents, such as scanned notes, certificates, forms, and archived PDFs that previously could not be processed successfully.

Related issue

Closes #514

Testing

  • I ran the relevant checks locally
  • I verified the app still starts
  • I tested the affected flow end-to-end

Additional testing performed:

  • Verified that regular text-based PDFs continue to use the existing extraction pipeline.
  • Verified that scanned/image-only PDFs trigger the OCR fallback when enabled.
  • Confirmed that OCR-generated documents preserve page metadata and source information.
  • Verified that disabling OCR preserves the existing behavior for unsupported scanned PDFs.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas

Screenshots / recordings

N/A – This change affects backend document processing and does not introduce any UI modifications.

Notes

  • Added a dedicated ocr_extractor.py utility to isolate OCR-related functionality from the main RAG pipeline.

  • Introduced environment variables to control OCR behavior:

    • ENABLE_OCR_FALLBACK
    • OCR_MAX_PAGES
    • OCR_MIN_CHARS
    • OCR_DPI
    • OCR_LANGUAGE
  • Added pdf2image and pytesseract as dependencies required for OCR support.

  • OCR is only invoked when the initial text extraction produces insufficient content, minimizing unnecessary overhead.

  • Follow-up improvements could include support for additional OCR languages and confidence-based quality checks.

Security

  • No sensitive data included

Summary by CodeRabbit

  • New Features
    • Added OCR fallback support to enable text extraction from scanned and image-only PDFs when standard document processing methods fail
    • Introduced configurable OCR settings: language selection, DPI resolution, maximum pages to process, and minimum text detection thresholds
  • Tests
    • Added OCR extraction unit tests
  • Chores
    • Added OCR processing dependencies

@vercel

vercel Bot commented Jun 13, 2026

Copy link
Copy Markdown

@upasana-2006 is attempting to deploy a commit to the firefistisdead's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@upasana-2006, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 50 minutes and 46 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: ad1e2b74-321f-4735-9b96-6985d2234dba

📥 Commits

Reviewing files that changed from the base of the PR and between 95cd817 and 4656b6c.

📒 Files selected for processing (5)
  • .env.example
  • rag-service/main.py
  • rag-service/ocr_extractor.py
  • rag-service/requirements.txt
  • rag-service/tests/test_ocr_extractor.py
📝 Walkthrough

Walkthrough

This PR adds OCR fallback support to the RAG service for scanned and image-only PDFs. When normal text extraction yields insufficient text, the service can optionally convert PDF pages to images and run OCR via Tesseract. OCR behavior is controlled by environment variables and optional dependencies.

Changes

OCR Fallback for Scanned PDFs

Layer / File(s) Summary
OCR Configuration and Extraction Logic
.env.example, rag-service/ocr_extractor.py, rag-service/requirements.txt, rag-service/tests/test_ocr_extractor.py
OCRConfig dataclass and get_ocr_config() read environment variables (ENABLE_OCR_FALLBACK, OCR_MAX_PAGES, OCR_MIN_CHARS, OCR_DPI, OCR_LANGUAGE) into typed settings. should_use_ocr() determines when OCR is needed based on extracted text length. extract_text_with_ocr() validates feature enablement, imports OCR dependencies dynamically, converts PDF pages to images with configured DPI and page limits, runs Tesseract per page, and returns LangChain Document objects with page/source/extraction_method metadata. Dependencies pdf2image and pytesseract added. Tests cover configuration parsing, decision logic, disabled-mode error handling, and successful OCR output mapping.
PDF Loading Integration
rag-service/main.py
load_pdf_with_optional_ocr() loads PDFs via PyMuPDFLoader, inspects extracted text length, conditionally invokes OCR fallback based on OCR configuration, raises ValueError when scanned PDF detected but OCR disabled, and normalizes document metadata for both extraction paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

  • FireFistisDead/pdf-qa-bot#518: Implements OCR fallback in crawler/pdf_extractor.py using the same pdf2image and pytesseract stack, offering an alternative or complementary extraction path.

Suggested labels

feature, rag-service, backend, type:testing, level:intermediate, quality:clean

Poem

🐰 A scanned PDF hops through the light,
Text fades but images stay bright.
OCR whispers each word it finds,
Now all your documents unwind!
No manual work—just fallback might.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(rag): add OCR fallback for scanned PDFs' clearly and concisely describes the main change: adding OCR fallback functionality to handle scanned PDFs in the RAG service.
Description check ✅ Passed The PR description is comprehensive and follows the template structure, including summary, related issue, testing, checklist items, screenshots/recordings section, notes, and security confirmation.
Linked Issues check ✅ Passed The changes fully implement the objectives from issue #514: optional OCR fallback with normal extraction first, PDF-to-image conversion using pdf2image, OCR via pytesseract, LangChain Document objects with metadata, environment variable controls (ENABLE_OCR_FALLBACK, OCR_MAX_PAGES, OCR_MIN_CHARS, OCR_DPI, OCR_LANGUAGE), and lightweight local solution.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing OCR fallback for scanned PDFs: .env.example (configuration), ocr_extractor.py (OCR logic), main.py (integration), requirements.txt (dependencies), and tests (validation). No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added backend Express or API gateway work docs Documentation only enhancement New feature or request feature A new feature or improvement frontend Frontend-related work level:advanced rag-service FastAPI / model service work type:security type:testing bug Something isn't working labels Jun 13, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
.env.example (2)

277-278: 💤 Low value

Consider alphabetizing OCR configuration keys.

Static analysis suggests reordering OCR_DPI and OCR_LANGUAGE before OCR_MAX_PAGES to maintain alphabetical ordering. While not required, consistent ordering improves maintainability of the environment template.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.env.example around lines 277 - 278, Reorder the OCR env keys in the
.env.example so they are alphabetized: move OCR_DPI and OCR_LANGUAGE to appear
before OCR_MAX_PAGES (so the sequence is OCR_DPI, OCR_LANGUAGE, OCR_MAX_PAGES),
ensuring consistent alphabetical ordering of OCR configuration entries (OCR_DPI,
OCR_LANGUAGE, OCR_MAX_PAGES).

Source: Linters/SAST tools


271-271: 💤 Low value

Remove the extra blank line.

Static analysis detected an unnecessary blank line. While this doesn't affect functionality, maintaining consistent formatting in the environment template improves readability.

✨ Suggested fix
 SEMANTIC_CHUNK_HIERARCHICAL=true
-

 
 # OCR fallback for scanned/image-only PDFs
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.env.example at line 271, Remove the single extraneous blank line present in
the environment template example file (the stray empty line in the .env example)
so the file has consistent formatting; simply delete that empty line and keep
the rest of the file unchanged.

Source: Linters/SAST tools

rag-service/tests/test_ocr_extractor.py (1)

1-62: ⚡ Quick win

Consider adding tests for error cases.

The test suite covers happy paths but is missing coverage for failure modes that could occur in production:

  • Invalid integer values in environment variables (e.g., OCR_MAX_PAGES="abc")
  • File not found when calling extract_text_with_ocr
  • OCR operations that return empty results for all pages (Line 72-73 raises in that case)
  • Corrupt images that pytesseract cannot process

These edge cases would help catch the issues flagged in ocr_extractor.py (unhandled ValueError on int conversion, missing path validation).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/tests/test_ocr_extractor.py` around lines 1 - 62, Add tests to
rag-service/tests/test_ocr_extractor.py that cover the error cases for
get_ocr_config and extract_text_with_ocr: (1) verify get_ocr_config raises or
handles invalid integer env vars by setting OCR_MAX_PAGES to "abc" and asserting
a ValueError or fallback behavior; (2) simulate file-not-found by calling
extract_text_with_ocr with a non-existent path and patching
pdf2image.convert_from_path to raise a FileNotFoundError and assert the function
raises or returns a clear error; (3) simulate OCR returning empty strings for
all pages by patching pdf2image.convert_from_path to return mocked images and
pytesseract.image_to_string to return "" for each page and assert
extract_text_with_ocr raises the expected error (the case noted at lines that
raise when all pages empty); and (4) simulate pytesseract throwing an exception
on a corrupt image by patching pytesseract.image_to_string to raise an Exception
and assert extract_text_with_ocr surfaces that error; use monkeypatch to set
environment vars (ENABLE_OCR_FALLBACK, OCR_MAX_PAGES, OCR_DPI, OCR_LANGUAGE) and
unittest.mock.patch to stub pdf2image and pytesseract, and reference the
existing test helpers get_ocr_config and extract_text_with_ocr in assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rag-service/main.py`:
- Around line 70-73: The function load_pdf_with_optional_ocr currently
instantiates PyMuPDFLoader directly (duplicate of _load_with_pymupdf); instead,
remove that direct PyMuPDFLoader usage and call the existing
load_pdf_documents_async (which encapsulates PyMuPDFLoader with timeout
protection and PyPDFLoader fallback) to obtain documents so behavior, error
handling, and metadata remain consistent; update load_pdf_with_optional_ocr to
accept/forward the same args expected by load_pdf_documents_async and remove the
redundant _load_with_pymupdf-like logic in that function.
- Around line 65-90: The function load_pdf_with_optional_ocr is dead code and
the real fix is to unify OCR fallback behavior so it respects the
ENABLE_OCR_FALLBACK flag and centralizes logic: update load_pdf_documents_async
to either call the existing load_pdf_with_optional_ocr (preferred) or delegate
to the same helpers used there (get_ocr_config and extract_text_with_ocr)
instead of using the internal _load_with_ocr duplication; remove
load_pdf_with_optional_ocr only if you replace all callers with the unified
behavior. Concretely, locate load_pdf_documents_async and replace its OCR branch
to call load_pdf_with_optional_ocr(pdf_path, filename) (or mirror its checks:
use get_ocr_config().min_chars & .enabled and call extract_text_with_ocr when
enabled) and delete the redundant _load_with_ocr implementation or mark it as
internal if still needed; ensure the ocr_extractor.py/get_ocr_config and
extract_text_with_ocr symbols are the single source of truth for fallback
configuration.

In `@rag-service/ocr_extractor.py`:
- Around line 32-75: extract_text_with_ocr is missing file validation, timeout
protection, and bounds on resource usage; fix it by (1) validating pdf_path
exists and is readable at the top of extract_text_with_ocr and raise a clear
RuntimeError if not, (2) enforce sane limits using the existing get_ocr_config
values (e.g., config.max_pages and a capped config.dpi) before calling
convert_from_path to avoid converting more pages or overly-high DPI than
allowed, and (3) run both the convert_from_path call and each
pytesseract.image_to_string call inside a cancellable worker with a timeout (for
example, use concurrent.futures.ThreadPoolExecutor/ProcessPoolExecutor to submit
convert_from_path and per-image OCR tasks and call result(timeout=...) or cancel
on timeout), catching TimeoutError and raising a descriptive RuntimeError; also
ensure resources are cleaned up on timeout and that empty or non-readable
results produce the existing "no readable text" error.
- Around line 17-24: The get_ocr_config function currently calls int() on
OCR_MAX_PAGES, OCR_MIN_CHARS, and OCR_DPI without validation, which will raise
ValueError for non-numeric env values; update get_ocr_config to safely parse
these three env vars (OCR_MAX_PAGES, OCR_MIN_CHARS, OCR_DPI) by wrapping
conversions in try/except or using a small helper like parse_int_env that
returns the parsed int or a safe default (10, 30, 200) and logs a warning when
parsing fails, then construct OCRConfig with those validated values; keep
ENABLE_OCR_FALLBACK and OCR_LANGUAGE parsing unchanged.

In `@rag-service/requirements.txt`:
- Around line 20-22: The requirements file currently lists pdf2image and
pytesseract without version pins; replace those two lines with explicit, fixed
versions (for example pdf2image==1.16.0 and pytesseract==0.3.10 or the
project-approved stable versions), commit the change to
rag-service/requirements.txt, and then run your dependency update/CI steps (pip
install -r, tests) to verify compatibility; ensure you pick and document the
exact pinned versions used.

---

Nitpick comments:
In @.env.example:
- Around line 277-278: Reorder the OCR env keys in the .env.example so they are
alphabetized: move OCR_DPI and OCR_LANGUAGE to appear before OCR_MAX_PAGES (so
the sequence is OCR_DPI, OCR_LANGUAGE, OCR_MAX_PAGES), ensuring consistent
alphabetical ordering of OCR configuration entries (OCR_DPI, OCR_LANGUAGE,
OCR_MAX_PAGES).
- Line 271: Remove the single extraneous blank line present in the environment
template example file (the stray empty line in the .env example) so the file has
consistent formatting; simply delete that empty line and keep the rest of the
file unchanged.

In `@rag-service/tests/test_ocr_extractor.py`:
- Around line 1-62: Add tests to rag-service/tests/test_ocr_extractor.py that
cover the error cases for get_ocr_config and extract_text_with_ocr: (1) verify
get_ocr_config raises or handles invalid integer env vars by setting
OCR_MAX_PAGES to "abc" and asserting a ValueError or fallback behavior; (2)
simulate file-not-found by calling extract_text_with_ocr with a non-existent
path and patching pdf2image.convert_from_path to raise a FileNotFoundError and
assert the function raises or returns a clear error; (3) simulate OCR returning
empty strings for all pages by patching pdf2image.convert_from_path to return
mocked images and pytesseract.image_to_string to return "" for each page and
assert extract_text_with_ocr raises the expected error (the case noted at lines
that raise when all pages empty); and (4) simulate pytesseract throwing an
exception on a corrupt image by patching pytesseract.image_to_string to raise an
Exception and assert extract_text_with_ocr surfaces that error; use monkeypatch
to set environment vars (ENABLE_OCR_FALLBACK, OCR_MAX_PAGES, OCR_DPI,
OCR_LANGUAGE) and unittest.mock.patch to stub pdf2image and pytesseract, and
reference the existing test helpers get_ocr_config and extract_text_with_ocr in
assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 80d485a6-d47d-4405-b66f-12791dfa7f0c

📥 Commits

Reviewing files that changed from the base of the PR and between 5590b87 and 95cd817.

📒 Files selected for processing (5)
  • .env.example
  • rag-service/main.py
  • rag-service/ocr_extractor.py
  • rag-service/requirements.txt
  • rag-service/tests/test_ocr_extractor.py

Comment thread rag-service/main.py Outdated
Comment on lines +65 to +90
def load_pdf_with_optional_ocr(pdf_path: str, filename: str = "uploaded.pdf"):
"""
Load PDF text normally first. If the document appears scanned/image-only,
optionally fall back to OCR when ENABLE_OCR_FALLBACK=true.
"""
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(pdf_path)
documents = loader.load()

combined_text = "\n".join(doc.page_content for doc in documents)
ocr_config = get_ocr_config()

if should_use_ocr(combined_text, ocr_config.min_chars):
if not ocr_config.enabled:
raise ValueError(
"No readable text found in this PDF. It may be scanned/image-only. "
"Enable OCR fallback by setting ENABLE_OCR_FALLBACK=true."
)
return extract_text_with_ocr(pdf_path, filename)

for doc in documents:
doc.metadata.setdefault("source", filename)
doc.metadata.setdefault("extraction_method", "text")

return documents

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify that load_pdf_with_optional_ocr is defined but never called

echo "=== Definition of load_pdf_with_optional_ocr ==="
rg -n "^def load_pdf_with_optional_ocr" --type=py

echo ""
echo "=== Calls to load_pdf_with_optional_ocr ==="
rg -n "load_pdf_with_optional_ocr\(" --type=py

echo ""
echo "=== Current PDF processing call site ==="
rg -n -A2 "load_pdf_documents_async\(temp_path" --type=py

Repository: FireFistisDead/pdf-qa-bot

Length of output: 673


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Find any other references/usages of load_pdf_with_optional_ocr (string match)
echo "=== All references to load_pdf_with_optional_ocr ==="
rg -n "load_pdf_with_optional_ocr" --hidden --glob='!**/node_modules/**' --glob='!**/.venv/**' || true

# 2) Find OCR-related helpers/config usage to see if fallback is already wired elsewhere
echo
echo "=== References to OCR config/helpers ==="
rg -n "ENABLE_OCR_FALLBACK|get_ocr_config|should_use_ocr|extract_text_with_ocr" --hidden --glob='!**/node_modules/**' --glob='!**/.venv/**' || true

# 3) Inspect load_pdf_documents_async to determine whether it already performs OCR fallback internally
echo
echo "=== Definition of load_pdf_documents_async ==="
rg -n "def load_pdf_documents_async|load_pdf_documents_async\s*\(" rag-service/main.py --hidden --glob='!**/node_modules/**' || true

echo
echo "=== Context around load_pdf_documents_async definition (if present) ==="
rg -n -A60 -B10 "def load_pdf_documents_async" rag-service/main.py --hidden --glob='!**/node_modules/**' || true

Repository: FireFistisDead/pdf-qa-bot

Length of output: 5654


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show any OCR helper usage in rag-service/main.py
echo "=== extract_text_with_ocr usage in rag-service/main.py ==="
rg -n "extract_text_with_ocr" rag-service/main.py --hidden || true

echo
echo "=== should_use_ocr usage in rag-service/main.py ==="
rg -n "should_use_ocr" rag-service/main.py --hidden || true

echo
echo "=== get_ocr_config usage in rag-service/main.py ==="
rg -n "get_ocr_config" rag-service/main.py --hidden || true

echo
echo "=== load_pdf_documents_async: tail context ==="
# Print more of the function to confirm whether OCR fallback exists inside it
rg -n -A250 -B10 "async def load_pdf_documents_async" rag-service/main.py --hidden --glob='*.py'

Repository: FireFistisDead/pdf-qa-bot

Length of output: 11563


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Definition of _load_with_ocr ==="
rg -n "def _load_with_ocr|_load_with_ocr\s*\(" rag-service/main.py --hidden

echo
echo "=== Context around _load_with_ocr ==="
rg -n -A80 -B10 "def _load_with_ocr" rag-service/main.py --hidden

Repository: FireFistisDead/pdf-qa-bot

Length of output: 3584


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "ENABLE_OCR_FALLBACK" rag-service/main.py --hidden || true

Repository: FireFistisDead/pdf-qa-bot

Length of output: 213


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== .env.example: ENABLE_OCR_FALLBACK ==="
rg -n "ENABLE_OCR_FALLBACK" -n .env.example rag-service -S || true

echo
echo "=== rag-service/ocr_extractor.py (key parts) ==="
sed -n '1,120p' rag-service/ocr_extractor.py

Repository: FireFistisDead/pdf-qa-bot

Length of output: 2945


OCR fallback is already implemented; the unused load_pdf_with_optional_ocr is the real issue

  • load_pdf_with_optional_ocr is defined in rag-service/main.py but never called, so that function is dead code.
  • However, load_pdf_documents_async already attempts OCR fallback when extracted text is empty or low-density (via _load_with_ocr), so the OCR feature is not “missing” from the pipeline.
  • The fallback logic inside load_pdf_documents_async does not respect ENABLE_OCR_FALLBACK (unlike load_pdf_with_optional_ocr / ocr_extractor.py), and it duplicates OCR implementation/config decisions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/main.py` around lines 65 - 90, The function
load_pdf_with_optional_ocr is dead code and the real fix is to unify OCR
fallback behavior so it respects the ENABLE_OCR_FALLBACK flag and centralizes
logic: update load_pdf_documents_async to either call the existing
load_pdf_with_optional_ocr (preferred) or delegate to the same helpers used
there (get_ocr_config and extract_text_with_ocr) instead of using the internal
_load_with_ocr duplication; remove load_pdf_with_optional_ocr only if you
replace all callers with the unified behavior. Concretely, locate
load_pdf_documents_async and replace its OCR branch to call
load_pdf_with_optional_ocr(pdf_path, filename) (or mirror its checks: use
get_ocr_config().min_chars & .enabled and call extract_text_with_ocr when
enabled) and delete the redundant _load_with_ocr implementation or mark it as
internal if still needed; ensure the ocr_extractor.py/get_ocr_config and
extract_text_with_ocr symbols are the single source of truth for fallback
configuration.

Comment thread rag-service/main.py Outdated
Comment thread rag-service/ocr_extractor.py
Comment thread rag-service/ocr_extractor.py
Comment thread rag-service/requirements.txt
@upasana-2006 upasana-2006 force-pushed the feat/ocr-fallback-scanned-pdfs branch from 8a9099e to d2b80ac Compare June 13, 2026 16:35
@upasana-2006 upasana-2006 force-pushed the feat/ocr-fallback-scanned-pdfs branch from d2b80ac to 4656b6c Compare June 13, 2026 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Express or API gateway work bug Something isn't working docs Documentation only enhancement New feature or request feature A new feature or improvement frontend Frontend-related work level:advanced rag-service FastAPI / model service work type:security type:testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Add OCR fallback for scanned/image-only PDFs in RAG service

1 participant