feat!: Public threaded PDF parser and rendering API by cau-git · Pull Request #265 · docling-project/docling-parse

cau-git · 2026-04-28T08:00:13Z

Summary

Redesigns the threaded PDF parser public API to remove C++ internals from user code, unify parse-only and parse-and-render workflows behind one public parser, and add a more Pythonic result iteration model.

Before:

# Required a dummy PdfDocument hack to convert C++ decoders
dummy_doc = PdfDocument.__new__(PdfDocument)
dummy_doc._boundary_type = PdfPageBoundaryType.CROP_BOX

while parser.has_tasks():
    task = parser.get_task()
    page_decoder, timings = task.get()  # C++ object exposed to user code
    seg_page = dummy_doc._to_segmented_page_from_decoder(page_decoder)

After:

for result in parser.iterate_results():
    seg_page = result.get_page()       # lazy, cached
    timings = result.get_timings()     # typed Timings
    image = result.get_image()         # only when render_config is set

Changes

PageParseResult replaces the old raw threaded result types with a Python-facing result object:
- 1-indexed page_number
- lazy, cached get_page()
- typed get_timings()
- get_image(...) / has_image
- page_width / page_height without full page conversion
- error_message
DoclingThreadedPdfParser is now the single public threaded entry point:
- render_config=None means parse-only
- render_config=RenderConfig(...) means parse-and-render
- separate public DoclingThreadedPdfRenderer / ThreadedPdfRendererConfig are removed
ThreadedPdfParserConfig now includes:
- boundary_type
- render_config
segmented_page_from_decoder() is now a public module-level helper; PdfDocument._to_segmented_page_from_decoder() delegates to it
Threaded document lifecycle and scheduling are now public:
- load(..., page_numbers=...)
- page_count(doc_key)
- scheduled_page_count(doc_key)
- unload(doc_key)
- unload_all()
iterate_results() is added for normal consumption; has_tasks() / get_task() remain for manual control
C++ number_of_pages() / scheduled page count plumbing is exposed through the threaded parser API
DoclingThreadedPdfParser.__init__() copies decode_config before mutating page_boundary, so caller-owned config objects are not modified in place
Rendered image access now supports:
- default image reuse
- rerendering at arbitrary scale
- rerendering at canvas_size
- Python-side cropbox cropping
The C++ rerender path releases the GIL during render instruction replay
Tests and perf scripts were updated to the new API shape

Replace the split DoclingThreadedPdfParser / DoclingThreadedPdfRenderer classes with a single DoclingThreadedPdfParser whose ThreadedPdfParserConfig selects parse-only or parse-and-render mode via an optional render_config field. Key changes: - Extract segmented_page_from_decoder() as a public module-level function; PdfDocument._to_segmented_page_from_decoder() delegates to it - Add PageParseResult: typed result with 1-indexed page_number, lazy get_page(), typed get_timings(), get_image(), has_image, page_width/height, error_message - Add ThreadedPdfParserConfig.boundary_type and render_config fields - Add DoclingThreadedPdfParser.page_count() and iterate_results() - Expose number_of_pages() on both C++ threaded backends via pybind11 - Remove DoclingThreadedPdfRenderer, PdfPageRenderResult, ThreadedPdfRendererConfig - Fix DoclingThreadedPdfParser.__init__ to copy decode_config before mutating page_boundary, so the caller's object is never modified in place - Update all perf scripts and tests to the new API; restore full groundtruth regression coverage in test_threaded_parse.py and test_threaded_render.py Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

mergify · 2026-04-28T08:00:50Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

github-actions · 2026-04-28T08:06:32Z

✅ DCO Check Passed

Thanks @cau-git, all your commits are properly signed off. 🎉

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Prepare docling-parse for the upcoming docling_release threaded backend. - add selected-page scheduling to threaded document loads via `page_numbers` - expose `scheduled_page_count()` alongside physical `page_count()` - add public threaded document cleanup with `unload()` and `unload_all()` - reject unload attempts while threaded iteration is still active - extend `PageParseResult.get_image()` with true scale-based rerendering - support `cropbox` cropping in Python while preserving default-image fast paths - validate render config and add scale support across pybind and renderers - cover scheduling, unload, scaling, canvas sizing, and cropping with tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

…ed-parse-api

PeterStaar-IBM

lgtm!

…ed-parse-api

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

PeterStaar-IBM

lovely!

cau-git marked this pull request as draft April 28, 2026 08:30

cau-git added 4 commits April 28, 2026 11:22

Update plan

0ebd26d

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

fix: address threaded render and unload race issues

8174041

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Update plans

e0fb2e2

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

PeterStaar-IBM self-requested a review May 3, 2026 21:23

Merge branch 'main' of github.com:DS4SD/docling-parse into cau/thread…

eff5e5d

…ed-parse-api

PeterStaar-IBM previously approved these changes May 11, 2026

View reviewed changes

cau-git added 2 commits May 11, 2026 10:57

Merge branch 'main' of github.com:DS4SD/docling-parse into cau/thread…

61a939a

…ed-parse-api

Hide raw threaded pybind parser types behind internal names

b7e1faa

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

cau-git dismissed PeterStaar-IBM’s stale review via b7e1faa May 11, 2026 10:21

Remove DoclingPdfRenderer (deprecation)

6b681d3

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

cau-git marked this pull request as ready for review May 11, 2026 12:50

lint/format stuff

75bc9a4

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

cau-git requested a review from PeterStaar-IBM May 11, 2026 12:57

cau-git changed the title ~~feat(parser)!: Redesign public threaded PDF parser API~~ feat!: Redesign public threaded PDF parser API May 11, 2026

cau-git changed the title ~~feat!: Redesign public threaded PDF parser API~~ feat!: Public threaded PDF parser and rendering API May 11, 2026

PeterStaar-IBM approved these changes May 11, 2026

View reviewed changes

cau-git merged commit b066b26 into main May 11, 2026
34 checks passed

cau-git deleted the cau/threaded-parse-api branch May 11, 2026 13:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: Public threaded PDF parser and rendering API#265

feat!: Public threaded PDF parser and rendering API#265
cau-git merged 10 commits into
mainfrom
cau/threaded-parse-api

cau-git commented Apr 28, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

PeterStaar-IBM left a comment

Uh oh!

PeterStaar-IBM left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cau-git commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Uh oh!

mergify Bot commented Apr 28, 2026

Merge Protections

🟢 Enforce conventional commit

Uh oh!

github-actions Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PeterStaar-IBM left a comment

Choose a reason for hiding this comment

Uh oh!

PeterStaar-IBM left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cau-git commented Apr 28, 2026 •

edited

Loading

github-actions Bot commented Apr 28, 2026 •

edited

Loading