Skip to content

fix(serializer): stop duplicating spanning cells in markdown/dataframe output#601

Open
scottf007 wants to merge 1 commit intodocling-project:mainfrom
scottf007:fix/markdown-table-no-span-duplicate
Open

fix(serializer): stop duplicating spanning cells in markdown/dataframe output#601
scottf007 wants to merge 1 commit intodocling-project:mainfrom
scottf007:fix/markdown-table-no-span-duplicate

Conversation

@scottf007
Copy link
Copy Markdown

Summary

The markdown table serializer and the legacy markdown / dataframe export paths iterate data.grid and emit each grid cell's text. When a TableCell has col_span > 1 or row_span > 1, the same cell reference occupies every grid position it covers, so iterating produces N copies of the cell's text in the output instead of one.

The underlying TableCell data is correct (TableFormer emits one cell per merged region with col_span / row_span set). Only the serialized presentation duplicates.

Visible symptoms

  • A col_span=4 header like `KRYTERIA FORMALNE` rendered as four repeated cells instead of one cell + three empties.
  • A row_span=2 cell like `Energy reduction` rendered twice in consecutive rows of the markdown table.
  • export_to_dataframe() concatenated the parent header text with each child header (e.g. `KRYTERIA FORMALNE.Lp.`, `KRYTERIA FORMALNE.TAK`).

Fix

In all three code paths:

  • MarkdownTableSerializer.serialize (modern path)
  • TableItem.export_to_markdown (legacy path)
  • TableItem.export_to_dataframe

emit cell text only at the cell's origin position (start_row_offset_idx, start_col_offset_idx). Spanned positions render as empty string. Cells with col_span=1 and row_span=1 are unaffected because their single grid position equals their origin.

Test plan

  • `pytest test/test_docling_doc.py` — 67 pass after fixture regen
  • Full non-chunker test suite — 376 pass, 2 xfailed, no regressions
  • Manual run via `docling` on three PDFs that exercise spans:
    • CVS Group 2025 annual report (TCFD table, page 35) — `Energy reduction` rowspan now renders once
    • Polish gov "Czyste Powietrze" — `KRYTERIA FORMALNE` colspan=4 now renders once
    • iShares dividend statement — `Net Cash for Reinvestment#` colspan=2 now renders once
  • Regenerated 6 affected groundtruth fixtures (`.yaml.md`, `.paged.md`, `constructed_doc*.md.gt`). Diffs are precisely the removed cell-content duplication, nothing else.

Issues addressed

Closes docling-project/docling#2862 — `export_to_dataframe` produces duplicated cells when tables contain spanning cells. The dataframe path is fixed by this change.

Notes

  • `data.grid` semantics are unchanged — the same `TableCell` reference still occupies multiple grid positions for spans. Only the serializers' use of it changes.
  • HTML serializer wasn't modified; HTML supports native `colspan` / `rowspan` so the duplication issue may not apply there. Please flag if it does.

…e output

The markdown table serializer and the legacy markdown/dataframe export
paths iterate `data.grid` and emit cell text at every grid position.
Because a TableCell with `col_span > 1` or `row_span > 1` occupies every
grid position it covers (with the same cell reference), this produces N
copies of the cell's text in the output instead of one.

Visible symptoms:
  - A colspan=4 header like "KRYTERIA FORMALNE" rendered as four repeated
    columns instead of one.
  - A rowspan=2 cell like "Energy reduction" rendered twice in
    consecutive rows of the markdown table.
  - export_to_dataframe concatenated the parent header text with each
    child header (e.g. "KRYTERIA FORMALNE.Lp.", "KRYTERIA FORMALNE.TAK").

Fix: in all three code paths (modern MarkdownTableSerializer.serialize,
legacy TableItem.export_to_markdown, TableItem.export_to_dataframe),
emit cell text only at the cell's origin position
(start_row_offset_idx, start_col_offset_idx). Spanned positions render
as empty strings. Underlying TableCell data is unchanged; only the
serialized presentation changes. Cells with col_span=1 and row_span=1
are unaffected because their single grid position equals their origin.

Regenerated affected groundtruth fixtures
(test/data/doc/2206.01062.yaml.md, *.paged.md, constructed_doc*.md.gt,
constructed_document.yaml.md). The diff in those files is precisely the
removed cell-content duplication.

Closes docling-project/docling#2862

Signed-off-by: scott <scott@fletchcorp.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

DCO Check Passed

Thanks @scottf007, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 4, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

Waiting for:

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tables in PDF: parsed table element contains many duplicated cells when .export_to_dataframe()

1 participant