Description
When using force_full_page_ocr=True to handle PDFs with problematic fonts, GLYPH artifacts still appear in table content. The OCR correctly replaces textline_cells, but TableStructureModel uses word_cells from the PDF backend which still contains the corrupted text.
Steps to Reproduce
- Process a PDF with problematic fonts (Type3 fonts or fonts with missing ToUnicode CMap)
- Enable
force_full_page_ocr=True in OCR options
- Extract tables from the document
Expected Behavior
All extracted text, including table content, should use OCR-extracted text when force_full_page_ocr=True.
Actual Behavior
Table content contains GLYPH artifacts like:
GLYPH<c=1,font=/AAAAAH+font000000002ed64673> GLYPH<c=1,font=/AAAAAH+font000000002ed64673>
Root Cause Analysis
I traced through the code and identified the issue:
-
PagePreprocessingModel populates page.parsed_page with cells from the PDF backend:
textline_cells - line-level text (may contain GLYPHs)
word_cells - word-level text (may contain GLYPHs)
char_cells - character-level text (may contain GLYPHs)
-
OCR Model (base_ocr_model.py) with force_full_page_ocr=True:
- Correctly replaces
page.parsed_page.textline_cells with OCR text ✅
- Does NOT clear
word_cells or char_cells ❌
-
TableStructureModel (table_structure_model.py, lines 224-236):
sp = page._backend.get_segmented_page()
if sp is not None:
tcells = sp.get_cells_in_bbox(
cell_unit=TextCellUnit.WORD, # Requests word-level cells
bbox=table_cluster.bbox,
)
if len(tcells) == 0:
tcells = table_cluster.cells # Only falls back if EMPTY
The fallback to OCR cells only triggers when word_cells is empty, not when it contains garbage.
Environment
- docling version: 2.64.0
- Python version: 3.12
- OS: Linux (Docker container)
Possible Solutions
Option A: Clear word/char cells in OCR model (minimal change)
When force_full_page_ocr=True, also clear word_cells and char_cells in post_process_cells():
if self.options.force_full_page_ocr:
page.parsed_page.word_cells = []
page.parsed_page.char_cells = []
page.parsed_page.has_words = False
page.parsed_page.has_chars = False
Pros: Simple, semantic consistency (if PDF text is unreliable, all levels are unreliable)
Cons: Loses word-level accuracy even for valid portions
Option B: GLYPH detection in TableStructureModel (surgical)
Check for GLYPH patterns before using word cells:
def _contains_glyph_artifacts(self, cells):
import re
glyph_pattern = re.compile(r"GLYPH<[^>]+>")
return any(glyph_pattern.search(c.text) for c in cells)
# In predict_tables():
if len(tcells) == 0 or self._contains_glyph_artifacts(tcells):
tcells = table_cluster.cells
Pros: Preserves word-level accuracy where valid
Cons: More code, pattern matching overhead
I have a working fix for Option A that I've tested successfully. Happy to submit a PR or discuss alternative approaches.
Description
When using
force_full_page_ocr=Trueto handle PDFs with problematic fonts, GLYPH artifacts still appear in table content. The OCR correctly replacestextline_cells, butTableStructureModelusesword_cellsfrom the PDF backend which still contains the corrupted text.Steps to Reproduce
force_full_page_ocr=Truein OCR optionsExpected Behavior
All extracted text, including table content, should use OCR-extracted text when
force_full_page_ocr=True.Actual Behavior
Table content contains GLYPH artifacts like:
Root Cause Analysis
I traced through the code and identified the issue:
PagePreprocessingModelpopulatespage.parsed_pagewith cells from the PDF backend:textline_cells- line-level text (may contain GLYPHs)word_cells- word-level text (may contain GLYPHs)char_cells- character-level text (may contain GLYPHs)OCR Model (
base_ocr_model.py) withforce_full_page_ocr=True:page.parsed_page.textline_cellswith OCR text ✅word_cellsorchar_cells❌TableStructureModel(table_structure_model.py, lines 224-236):The fallback to OCR cells only triggers when
word_cellsis empty, not when it contains garbage.Environment
Possible Solutions
Option A: Clear word/char cells in OCR model (minimal change)
When
force_full_page_ocr=True, also clearword_cellsandchar_cellsinpost_process_cells():Pros: Simple, semantic consistency (if PDF text is unreliable, all levels are unreliable)
Cons: Loses word-level accuracy even for valid portions
Option B: GLYPH detection in TableStructureModel (surgical)
Check for GLYPH patterns before using word cells:
Pros: Preserves word-level accuracy where valid
Cons: More code, pattern matching overhead
I have a working fix for Option A that I've tested successfully. Happy to submit a PR or discuss alternative approaches.