Skip to content

extract_text: wide-spaced positioned-text kept over flow prose on JSTOR-scanned books #318

@yfedoseev

Description

@yfedoseev

Summary

On JSTOR-sourced multi-column textbooks, certain pages have two copies of the same content in the PDF content stream:

  1. A flow-prose copy with normal single-spaced words.
  2. A positioned copy where each word sits at its explicit x-coordinate, producing runs of 4–20 spaces between words.

Until v0.3.23 both copies were emitted (the output had each page twice). v0.3.25 correctly deduplicates them via the #315/#316 reading-order fixes — but consistently keeps the positioned/wide-spaced copy and drops the clean flow-prose copy. All content is preserved (whitespace-normalized word counts match), but the extracted text is much less readable than it could be.

Reproduction

./target/release/examples/extract_text_simple \
  "pdfs_slow9/[Vaclav-Smil]-Energy-and-Civilization_-A-History(z-lib.org).pdf" > out.txt
grep -n "Daimler-Maybach" out.txt

Current head of file around page 244:

         In 1894        a new           Daimler-Maybach              gasoline       engine          installed      in a            car that
won      the            Paris-Bordeaux  race           rated         less than      30 g/W          (Beaumont      1902),          leaving
no       place for      steam           engines        in road       transportation.                And even       the first       commer-

v0.3.23 output for the same page had both:

In 1894 a new Daimler-Maybach gasoline engine installed in a car that
won the Paris-Bordeaux race rated less than 30 g/W (Beaumont 1902), leaving
no place for steam engines in road transportation. And even the first commer-

immediately followed by the wide-spaced copy above.

Expected behavior

When deduplicating two positionally-overlapping text copies, prefer the one with tighter word spacing (more "prose-like") over the one with wide positioned gaps.

Impact

  • Affects JSTOR-scanned academic PDFs with a text layer rendered via per-glyph positioning.
  • Content is preserved (verified by whitespace-normalized word-count comparison and distinctive-phrase search like primitive harness, Maybach designed, to transport people — all present in head after tr -s '[:space:]' ' ').
  • Readability suffers: paragraphs no longer look like paragraphs, and downstream NLP pipelines that tokenize on runs of whitespace may produce noise.

Reference corpus

  • pdfs_slow9/[Vaclav-Smil]-Energy-and-Civilization_-A-History(z-lib.org).pdf
  • Similar pattern likely on other JSTOR scans in pdfs_slow* dirs.

Tested versions

  • 0.3.23: both copies emitted (output ~1.8 MB for this file, with visible duplicates)
  • 0.3.25 (release/v0.3.25): only wide-spaced copy emitted (output ~1.6 MB)

Priority

Low — no content loss, deduplication itself is correct. This is a preference-ranking refinement for the dedup heuristic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions