fix(serializer): collapse multi-row column headers into one markdown header#602
Open
scottf007 wants to merge 2 commits intodocling-project:mainfrom
Open
fix(serializer): collapse multi-row column headers into one markdown header#602scottf007 wants to merge 2 commits intodocling-project:mainfrom
scottf007 wants to merge 2 commits intodocling-project:mainfrom
Conversation
…header MarkdownTableSerializer hardcoded `headers=rows[0]` and `rows[1:]` as the body. When TableFormer (correctly) marks multiple leading rows as column_header=True — the case where a column title wraps onto two visual lines like "Cash per Security" + "($)" — only the first row rendered as the markdown header and the continuation leaked into the body as a spurious "data" row. Mirrors the logic that already exists in `_export_to_dataframe_with_options` (document.py:2219-2245): count leading grid rows containing any column_header cell, concatenate their cell text per column to build the markdown header, and use the remaining grid rows as the body. Spanning siblings render empty in both cases so a colspan=N header is not concatenated N times. Falls back to "first row is header" if no row has any column_header cell, preserving prior behaviour for tables that arrive without header marking. Empirical result on a 10-PDF dividend-statement corpus (finance_nexus tests/fixtures/dividend_statement/): * IOZ_Reinvestment_Plan_Advice_2025_04_17.pdf: the wrapped header collapses from two rows into one. Body unchanged. * All 9 other fixtures: byte-identical markdown output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
❌ DCO Check Failed Hi @scottf007, your pull request has failed the Developer Certificate of Origin (DCO) check. This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format. 🛠 Quick Fix: Add a remediation commitRun this command: git commit --allow-empty -s -m "DCO Remediation Commit for scott <scott@fletchcorp.com>
I, scott <scott@fletchcorp.com>, hereby add my Signed-off-by to this commit: 93b77bc830f3ec36b43653569acae438982a4d1d"
git push🔧 Advanced: Sign off each commit directlyFor the latest commit: git commit --amend --signoff
git push --force-with-leaseFor multiple commits: git rebase --signoff origin/main
git push --force-with-leaseMore info: DCO check report |
Contributor
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
I, scott <scott@fletchcorp.com>, hereby add my Signed-off-by to this commit: 93b77bcadbf3cdbeb20bbe10f7c14dde0f9ec88a Signed-off-by: scott <scott@fletchcorp.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MarkdownTableSerializerhardcodedheaders=rows[0]androws[1:]. When TableFormer correctly marks multiple leading rows ascolumn_header=True(e.g. "Cash per Security" + "($)" on successive grid rows), only the first rendered as the markdown header and the continuation leaked into the body as a spurious data row.Mirrors
_export_to_dataframe_with_options(document.py:2219-2245): count leading grid rows containing anycolumn_headercell, concatenate per column for the header, use the rest as body. Spanning siblings render empty so acolspan=Nheader isn't concatenated N times.Falls back to "first row is header" when no
column_headercells are marked — preserves prior behaviour for that path.Empirical result
10-PDF corpus, byte-comparison of rendered markdown:
Related: docling-project/docling#2985.