Skip to content

feat: update DoclingDocument model with correct wire formats#558

Merged
edeandrea merged 1 commit into
docling-project:mainfrom
ai-pipestream:chore/update-docling-document-model
Jun 22, 2026
Merged

feat: update DoclingDocument model with correct wire formats#558
edeandrea merged 1 commit into
docling-project:mainfrom
ai-pipestream:chore/update-docling-document-model

Conversation

@krickert

Copy link
Copy Markdown
Contributor

Updates the docling-core Java DoclingDocument model to match the Python docling-core wire format (v2.83.x).

New types

FineRef, TrackSource, EntityMention, EntitiesMetaField, KeywordsMetaField, TopicsMetaField, LanguageMetaField, CodeMetaField, FieldHeadingItem, FieldValueItem, FieldRegionItem, FieldItem.

Updated types

  • BaseMeta/FloatingMeta/PictureMeta gain language, entities, keywords, topics; PictureMeta also gains code (CodeMetaField).
  • All BaseTextItem implementations gain source (List<TrackSource>) and comments (List<FineRef>) matching DocItem.
  • DocItemLabel adds FIELD_REGION, FIELD_HEADING, FIELD_ITEM, FIELD_KEY, FIELD_VALUE, FIELD_HINT, MARKER.
  • DoclingDocument adds field_regions and field_items lists.
  • TableData.table_cells typed as List<TableCell> (was List<Object>).

Wire-format details

  • charspan / range serialize as 2-element [int, int] JSON arrays (Python tuple[int, int] parity).
  • TrackSource is a flat object with a kind: "track" discriminator.
  • Polymorphic dispatch on label for field heading/value text items.

Tests

Adds round-trip tests for the new wire shapes (charspan/range arrays, flat TrackSource, polymorphic field text items, null-safe field collections, PictureMeta sub-fields incl. code, typed table_cells). ./gradlew :docling-core:test is green and spotless is clean.

Copilot AI review requested due to automatic review settings June 22, 2026 12:29
@krickert krickert force-pushed the chore/update-docling-document-model branch from b212f26 to 88e20e4 Compare June 22, 2026 12:32
@krickert krickert changed the title feat(core): update DoclingDocument model with correct wire formats feat: update DoclingDocument model with correct wire formats Jun 22, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates docling-core’s DoclingDocument model to align Java serialization/deserialization with the Python docling-core v2.83.x wire format, adding new metadata fields and new document node types while tightening JSON shapes used on the wire.

Changes:

  • Added new wire-format model types (e.g., FineRef, TrackSource, field region/item nodes, and new meta fields) and extended existing nodes with source and comments.
  • Extended metadata (language, entities, keywords, topics) and added PictureMeta.code; added Orientation and typed TableData.table_cells as List<TableCell>.
  • Added/expanded Jackson round-trip tests for the new shapes (tuple-like arrays for charspan/range, flat TrackSource, field text polymorphism, null-as-empty lists, typed table_cells, etc.).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
docling-core/src/main/java/ai/docling/core/DoclingDocument.java Adds/updates model types and Jackson annotations to match the updated Python wire format (new meta fields, field nodes, source/comments, typed table cells, orientation, picture code meta).
docling-core/src/test/java/ai/docling/core/DoclingDocumentTests.java Adds coverage for the new JSON wire shapes and null/empty collection handling.

Comment thread docling-core/src/test/java/ai/docling/core/DoclingDocumentTests.java Outdated
Comment thread docling-core/src/main/java/ai/docling/core/DoclingDocument.java
Copilot AI review requested due to automatic review settings June 22, 2026 13:31
@krickert krickert force-pushed the chore/update-docling-document-model branch from 88e20e4 to af2fb07 Compare June 22, 2026 13:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread docling-core/src/main/java/ai/docling/core/DoclingDocument.java Outdated
Comment thread docling-core/src/main/java/ai/docling/core/DoclingDocument.java
Comment thread docling-core/src/main/java/ai/docling/core/DoclingDocument.java
@krickert krickert force-pushed the chore/update-docling-document-model branch from af2fb07 to 43a3eb1 Compare June 22, 2026 13:51
Add new types from the Python docling-core model:
- FineRef: $ref + range as [int,int] JSON array (not object)
- TrackSource: flat discriminated object with kind/start_time/end_time
- EntityMention: charspan as [int,int] JSON array matching CharSpan tuple
- EntitiesMetaField, KeywordsMetaField, TopicsMetaField, LanguageMetaField
- FieldHeadingItem, FieldValueItem (as sealed BaseTextItem subtypes)
- FieldRegionItem, FieldItem (new top-level document item types)
- CodeMetaField (PictureMeta.code) for code-backed picture nodes

Update existing types:
- BaseMeta/FloatingMeta/PictureMeta: add language, entities, keywords, topics;
  PictureMeta also gains the code (CodeMetaField) field
- All BaseTextItem implementations: add source (List<TrackSource>) and
  comments (List<FineRef>) fields matching DocItem in the Python model
- DocItemLabel: add FIELD_REGION, FIELD_HEADING, FIELD_ITEM, FIELD_KEY,
  FIELD_VALUE, FIELD_HINT, MARKER
- DoclingDocument: add field_regions and field_items lists; add missing
  @JsonSetter(nulls = AS_EMPTY) to tables, key_value_items, form_items
- TableData.table_cells: type as List<TableCell> (was List<Object>) so cells
  deserialize into typed TableCell instances, matching list[AnyTableCell]

Add tests verifying correct JSON array wire format for charspan/range,
flat TrackSource deserialization, polymorphic FieldHeadingItem/FieldValueItem
dispatch, null-safe field_regions/field_items handling, PictureMeta sub-fields
(description/classification/molecule/tabular_chart/code), and typed table_cells.

Signed-off-by: Kristian Rickert <krickert@gmail.com>
Copilot AI review requested due to automatic review settings June 22, 2026 14:52
@edeandrea edeandrea force-pushed the chore/update-docling-document-model branch from 43a3eb1 to 4e3961e Compare June 22, 2026 14:52

@edeandrea edeandrea left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @krickert for this!

@github-actions

Copy link
Copy Markdown

:java_duke: JaCoCo coverage report

Overall Project 46.12% 🔴

There is no coverage information present for the Files changed

@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown
TestsPassed ✅SkippedFailed
Gradle Test Results (all modules & JDKs)1520 ran1520 passed0 skipped0 failed
TestResult
No test annotations available

@github-actions

Copy link
Copy Markdown

HTML test reports are available as workflow artifacts (zipped HTML).

• Download: Artifacts for this run

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread docling-core/src/main/java/ai/docling/core/DoclingDocument.java
Comment thread docling-core/src/main/java/ai/docling/core/DoclingDocument.java
@github-actions

Copy link
Copy Markdown

HTML test reports are available as workflow artifacts (zipped HTML).

• Download: Artifacts for this run

@edeandrea edeandrea disabled auto-merge June 22, 2026 15:10
@edeandrea edeandrea merged commit ed4d834 into docling-project:main Jun 22, 2026
27 checks passed
@krickert

Copy link
Copy Markdown
Contributor Author

Thanks @krickert for this!

No problem. I have to maintain the model on my branch anyway.

Now ..

If there was only a technology out there so you just do model updates once ... Like a blockchain integration? Langblock4j?!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants