Skip to content

fix(serializer): collapse multi-row column headers into one markdown header#602

Open
scottf007 wants to merge 2 commits intodocling-project:mainfrom
scottf007:fix/markdown-multi-row-headers
Open

fix(serializer): collapse multi-row column headers into one markdown header#602
scottf007 wants to merge 2 commits intodocling-project:mainfrom
scottf007:fix/markdown-multi-row-headers

Conversation

@scottf007
Copy link
Copy Markdown

Summary

MarkdownTableSerializer hardcoded headers=rows[0] and rows[1:]. When TableFormer correctly marks multiple leading rows as column_header=True (e.g. "Cash per Security" + "($)" on successive grid rows), only the first rendered as the markdown header and the continuation leaked into the body as a spurious data row.

Mirrors _export_to_dataframe_with_options (document.py:2219-2245): count leading grid rows containing any column_header cell, concatenate per column for the header, use the rest as body. Spanning siblings render empty so a colspan=N header isn't concatenated N times.

Falls back to "first row is header" when no column_header cells are marked — preserves prior behaviour for that path.

Empirical result

10-PDF corpus, byte-comparison of rendered markdown:

  • IOZ_2025_04_17.pdf: 2-row header collapses into 1. Body unchanged.
  • 9 other fixtures: byte-identical.

Related: docling-project/docling#2985.

…header

MarkdownTableSerializer hardcoded `headers=rows[0]` and `rows[1:]` as the
body. When TableFormer (correctly) marks multiple leading rows as
column_header=True — the case where a column title wraps onto two visual
lines like "Cash per Security" + "($)" — only the first row rendered as
the markdown header and the continuation leaked into the body as a
spurious "data" row.

Mirrors the logic that already exists in
`_export_to_dataframe_with_options` (document.py:2219-2245): count leading
grid rows containing any column_header cell, concatenate their cell text
per column to build the markdown header, and use the remaining grid rows
as the body. Spanning siblings render empty in both cases so a colspan=N
header is not concatenated N times.

Falls back to "first row is header" if no row has any column_header cell,
preserving prior behaviour for tables that arrive without header marking.

Empirical result on a 10-PDF dividend-statement corpus
(finance_nexus tests/fixtures/dividend_statement/):

* IOZ_Reinvestment_Plan_Advice_2025_04_17.pdf: the wrapped header
  collapses from two rows into one. Body unchanged.
* All 9 other fixtures: byte-identical markdown output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

DCO Check Failed

Hi @scottf007, your pull request has failed the Developer Certificate of Origin (DCO) check.

This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format.


🛠 Quick Fix: Add a remediation commit

Run this command:

git commit --allow-empty -s -m "DCO Remediation Commit for scott <scott@fletchcorp.com>

I, scott <scott@fletchcorp.com>, hereby add my Signed-off-by to this commit: 93b77bc830f3ec36b43653569acae438982a4d1d"
git push

🔧 Advanced: Sign off each commit directly

For the latest commit:

git commit --amend --signoff
git push --force-with-lease

For multiple commits:

git rebase --signoff origin/main
git push --force-with-lease

More info: DCO check report

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 5, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

I, scott <scott@fletchcorp.com>, hereby add my Signed-off-by to this commit: 93b77bcadbf3cdbeb20bbe10f7c14dde0f9ec88a

Signed-off-by: scott <scott@fletchcorp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant