feat!: Public threaded PDF parser and rendering API#265
Merged
Conversation
Replace the split DoclingThreadedPdfParser / DoclingThreadedPdfRenderer classes with a single DoclingThreadedPdfParser whose ThreadedPdfParserConfig selects parse-only or parse-and-render mode via an optional render_config field. Key changes: - Extract segmented_page_from_decoder() as a public module-level function; PdfDocument._to_segmented_page_from_decoder() delegates to it - Add PageParseResult: typed result with 1-indexed page_number, lazy get_page(), typed get_timings(), get_image(), has_image, page_width/height, error_message - Add ThreadedPdfParserConfig.boundary_type and render_config fields - Add DoclingThreadedPdfParser.page_count() and iterate_results() - Expose number_of_pages() on both C++ threaded backends via pybind11 - Remove DoclingThreadedPdfRenderer, PdfPageRenderResult, ThreadedPdfRendererConfig - Fix DoclingThreadedPdfParser.__init__ to copy decode_config before mutating page_boundary, so the caller's object is never modified in place - Update all perf scripts and tests to the new API; restore full groundtruth regression coverage in test_threaded_parse.py and test_threaded_render.py Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Contributor
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Contributor
|
✅ DCO Check Passed Thanks @cau-git, all your commits are properly signed off. 🎉 |
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Prepare docling-parse for the upcoming docling_release threaded backend. - add selected-page scheduling to threaded document loads via `page_numbers` - expose `scheduled_page_count()` alongside physical `page_count()` - add public threaded document cleanup with `unload()` and `unload_all()` - reject unload attempts while threaded iteration is still active - extend `PageParseResult.get_image()` with true scale-based rerendering - support `cropbox` cropping in Python while preserving default-image fast paths - validate render config and add scale support across pybind and renderers - cover scheduling, unload, scaling, canvas sizing, and cropping with tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Redesigns the threaded PDF parser public API to remove C++ internals from user code, unify parse-only and parse-and-render workflows behind one public parser, and add a more Pythonic result iteration model.
Before:
After:
Changes
PageParseResultreplaces the old raw threaded result types with a Python-facing result object:page_numberget_page()get_timings()get_image(...)/has_imagepage_width/page_heightwithout full page conversionerror_messageDoclingThreadedPdfParseris now the single public threaded entry point:render_config=Nonemeans parse-onlyrender_config=RenderConfig(...)means parse-and-renderDoclingThreadedPdfRenderer/ThreadedPdfRendererConfigare removedThreadedPdfParserConfignow includes:boundary_typerender_configsegmented_page_from_decoder()is now a public module-level helper;PdfDocument._to_segmented_page_from_decoder()delegates to itload(..., page_numbers=...)page_count(doc_key)scheduled_page_count(doc_key)unload(doc_key)unload_all()iterate_results()is added for normal consumption;has_tasks()/get_task()remain for manual controlnumber_of_pages()/ scheduled page count plumbing is exposed through the threaded parser APIDoclingThreadedPdfParser.__init__()copiesdecode_configbefore mutatingpage_boundary, so caller-owned config objects are not modified in placescalecanvas_sizecropboxcropping