feat: update DoclingDocument model with correct wire formats#558
Conversation
b212f26 to
88e20e4
Compare
There was a problem hiding this comment.
Pull request overview
Updates docling-core’s DoclingDocument model to align Java serialization/deserialization with the Python docling-core v2.83.x wire format, adding new metadata fields and new document node types while tightening JSON shapes used on the wire.
Changes:
- Added new wire-format model types (e.g.,
FineRef,TrackSource, field region/item nodes, and new meta fields) and extended existing nodes withsourceandcomments. - Extended metadata (
language,entities,keywords,topics) and addedPictureMeta.code; addedOrientationand typedTableData.table_cellsasList<TableCell>. - Added/expanded Jackson round-trip tests for the new shapes (tuple-like arrays for
charspan/range, flatTrackSource, field text polymorphism, null-as-empty lists, typedtable_cells, etc.).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| docling-core/src/main/java/ai/docling/core/DoclingDocument.java | Adds/updates model types and Jackson annotations to match the updated Python wire format (new meta fields, field nodes, source/comments, typed table cells, orientation, picture code meta). |
| docling-core/src/test/java/ai/docling/core/DoclingDocumentTests.java | Adds coverage for the new JSON wire shapes and null/empty collection handling. |
88e20e4 to
af2fb07
Compare
af2fb07 to
43a3eb1
Compare
Add new types from the Python docling-core model: - FineRef: $ref + range as [int,int] JSON array (not object) - TrackSource: flat discriminated object with kind/start_time/end_time - EntityMention: charspan as [int,int] JSON array matching CharSpan tuple - EntitiesMetaField, KeywordsMetaField, TopicsMetaField, LanguageMetaField - FieldHeadingItem, FieldValueItem (as sealed BaseTextItem subtypes) - FieldRegionItem, FieldItem (new top-level document item types) - CodeMetaField (PictureMeta.code) for code-backed picture nodes Update existing types: - BaseMeta/FloatingMeta/PictureMeta: add language, entities, keywords, topics; PictureMeta also gains the code (CodeMetaField) field - All BaseTextItem implementations: add source (List<TrackSource>) and comments (List<FineRef>) fields matching DocItem in the Python model - DocItemLabel: add FIELD_REGION, FIELD_HEADING, FIELD_ITEM, FIELD_KEY, FIELD_VALUE, FIELD_HINT, MARKER - DoclingDocument: add field_regions and field_items lists; add missing @JsonSetter(nulls = AS_EMPTY) to tables, key_value_items, form_items - TableData.table_cells: type as List<TableCell> (was List<Object>) so cells deserialize into typed TableCell instances, matching list[AnyTableCell] Add tests verifying correct JSON array wire format for charspan/range, flat TrackSource deserialization, polymorphic FieldHeadingItem/FieldValueItem dispatch, null-safe field_regions/field_items handling, PictureMeta sub-fields (description/classification/molecule/tabular_chart/code), and typed table_cells. Signed-off-by: Kristian Rickert <krickert@gmail.com>
43a3eb1 to
4e3961e
Compare
:java_duke: JaCoCo coverage report
|
|
||||||||||||||
|
HTML test reports are available as workflow artifacts (zipped HTML). • Download: Artifacts for this run |
|
HTML test reports are available as workflow artifacts (zipped HTML). • Download: Artifacts for this run |
No problem. I have to maintain the model on my branch anyway. Now .. If there was only a technology out there so you just do model updates once ... Like a blockchain integration? Langblock4j?! |
Updates the
docling-coreJavaDoclingDocumentmodel to match the Python docling-core wire format (v2.83.x).New types
FineRef,TrackSource,EntityMention,EntitiesMetaField,KeywordsMetaField,TopicsMetaField,LanguageMetaField,CodeMetaField,FieldHeadingItem,FieldValueItem,FieldRegionItem,FieldItem.Updated types
BaseMeta/FloatingMeta/PictureMetagainlanguage,entities,keywords,topics;PictureMetaalso gainscode(CodeMetaField).BaseTextItemimplementations gainsource(List<TrackSource>) andcomments(List<FineRef>) matchingDocItem.DocItemLabeladdsFIELD_REGION,FIELD_HEADING,FIELD_ITEM,FIELD_KEY,FIELD_VALUE,FIELD_HINT,MARKER.DoclingDocumentaddsfield_regionsandfield_itemslists.TableData.table_cellstyped asList<TableCell>(wasList<Object>).Wire-format details
charspan/rangeserialize as 2-element[int, int]JSON arrays (Pythontuple[int, int]parity).TrackSourceis a flat object with akind: "track"discriminator.labelfor field heading/value text items.Tests
Adds round-trip tests for the new wire shapes (charspan/range arrays, flat
TrackSource, polymorphic field text items, null-safe field collections,PictureMetasub-fields incl.code, typedtable_cells)../gradlew :docling-core:testis green and spotless is clean.