Skip to content

feat!: Public threaded PDF parser and rendering API#265

Merged
cau-git merged 10 commits into
mainfrom
cau/threaded-parse-api
May 11, 2026
Merged

feat!: Public threaded PDF parser and rendering API#265
cau-git merged 10 commits into
mainfrom
cau/threaded-parse-api

Conversation

@cau-git
Copy link
Copy Markdown
Member

@cau-git cau-git commented Apr 28, 2026

Summary

Redesigns the threaded PDF parser public API to remove C++ internals from user code, unify parse-only and parse-and-render workflows behind one public parser, and add a more Pythonic result iteration model.

Before:

# Required a dummy PdfDocument hack to convert C++ decoders
dummy_doc = PdfDocument.__new__(PdfDocument)
dummy_doc._boundary_type = PdfPageBoundaryType.CROP_BOX

while parser.has_tasks():
    task = parser.get_task()
    page_decoder, timings = task.get()  # C++ object exposed to user code
    seg_page = dummy_doc._to_segmented_page_from_decoder(page_decoder)

After:

for result in parser.iterate_results():
    seg_page = result.get_page()       # lazy, cached
    timings = result.get_timings()     # typed Timings
    image = result.get_image()         # only when render_config is set

Changes

  • PageParseResult replaces the old raw threaded result types with a Python-facing result object:
    • 1-indexed page_number
    • lazy, cached get_page()
    • typed get_timings()
    • get_image(...) / has_image
    • page_width / page_height without full page conversion
    • error_message
  • DoclingThreadedPdfParser is now the single public threaded entry point:
    • render_config=None means parse-only
    • render_config=RenderConfig(...) means parse-and-render
    • separate public DoclingThreadedPdfRenderer / ThreadedPdfRendererConfig are removed
  • ThreadedPdfParserConfig now includes:
    • boundary_type
    • render_config
  • segmented_page_from_decoder() is now a public module-level helper; PdfDocument._to_segmented_page_from_decoder() delegates to it
  • Threaded document lifecycle and scheduling are now public:
    • load(..., page_numbers=...)
    • page_count(doc_key)
    • scheduled_page_count(doc_key)
    • unload(doc_key)
    • unload_all()
  • iterate_results() is added for normal consumption; has_tasks() / get_task() remain for manual control
  • C++ number_of_pages() / scheduled page count plumbing is exposed through the threaded parser API
  • DoclingThreadedPdfParser.__init__() copies decode_config before mutating page_boundary, so caller-owned config objects are not modified in place
  • Rendered image access now supports:
    • default image reuse
    • rerendering at arbitrary scale
    • rerendering at canvas_size
    • Python-side cropbox cropping
  • The C++ rerender path releases the GIL during render instruction replay
  • Tests and perf scripts were updated to the new API shape

Replace the split DoclingThreadedPdfParser / DoclingThreadedPdfRenderer
classes with a single DoclingThreadedPdfParser whose ThreadedPdfParserConfig
selects parse-only or parse-and-render mode via an optional render_config field.

Key changes:
- Extract segmented_page_from_decoder() as a public module-level function;
  PdfDocument._to_segmented_page_from_decoder() delegates to it
- Add PageParseResult: typed result with 1-indexed page_number, lazy get_page(),
  typed get_timings(), get_image(), has_image, page_width/height, error_message
- Add ThreadedPdfParserConfig.boundary_type and render_config fields
- Add DoclingThreadedPdfParser.page_count() and iterate_results()
- Expose number_of_pages() on both C++ threaded backends via pybind11
- Remove DoclingThreadedPdfRenderer, PdfPageRenderResult, ThreadedPdfRendererConfig
- Fix DoclingThreadedPdfParser.__init__ to copy decode_config before mutating
  page_boundary, so the caller's object is never modified in place
- Update all perf scripts and tests to the new API; restore full groundtruth
  regression coverage in test_threaded_parse.py and test_threaded_render.py

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 28, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 28, 2026

DCO Check Passed

Thanks @cau-git, all your commits are properly signed off. 🎉

@cau-git cau-git marked this pull request as draft April 28, 2026 08:30
cau-git added 4 commits April 28, 2026 11:22
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Prepare docling-parse for the upcoming docling_release threaded backend.

- add selected-page scheduling to threaded document loads via `page_numbers`
- expose `scheduled_page_count()` alongside physical `page_count()`
- add public threaded document cleanup with `unload()` and `unload_all()`
- reject unload attempts while threaded iteration is still active
- extend `PageParseResult.get_image()` with true scale-based rerendering
- support `cropbox` cropping in Python while preserving default-image fast paths
- validate render config and add scale support across pybind and renderers
- cover scheduling, unload, scaling, canvas sizing, and cropping with tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@PeterStaar-IBM PeterStaar-IBM self-requested a review May 3, 2026 21:23
PeterStaar-IBM
PeterStaar-IBM previously approved these changes May 11, 2026
Copy link
Copy Markdown
Member

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@cau-git cau-git marked this pull request as ready for review May 11, 2026 12:50
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@cau-git cau-git requested a review from PeterStaar-IBM May 11, 2026 12:57
@cau-git cau-git changed the title feat(parser)!: Redesign public threaded PDF parser API feat!: Redesign public threaded PDF parser API May 11, 2026
@cau-git cau-git changed the title feat!: Redesign public threaded PDF parser API feat!: Public threaded PDF parser and rendering API May 11, 2026
Copy link
Copy Markdown
Member

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lovely!

@cau-git cau-git merged commit b066b26 into main May 11, 2026
34 checks passed
@cau-git cau-git deleted the cau/threaded-parse-api branch May 11, 2026 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants