Summary
On JSTOR-sourced multi-column textbooks, certain pages have two copies of the same content in the PDF content stream:
- A flow-prose copy with normal single-spaced words.
- A positioned copy where each word sits at its explicit x-coordinate, producing runs of 4–20 spaces between words.
Until v0.3.23 both copies were emitted (the output had each page twice). v0.3.25 correctly deduplicates them via the #315/#316 reading-order fixes — but consistently keeps the positioned/wide-spaced copy and drops the clean flow-prose copy. All content is preserved (whitespace-normalized word counts match), but the extracted text is much less readable than it could be.
Reproduction
./target/release/examples/extract_text_simple \
"pdfs_slow9/[Vaclav-Smil]-Energy-and-Civilization_-A-History(z-lib.org).pdf" > out.txt
grep -n "Daimler-Maybach" out.txt
Current head of file around page 244:
In 1894 a new Daimler-Maybach gasoline engine installed in a car that
won the Paris-Bordeaux race rated less than 30 g/W (Beaumont 1902), leaving
no place for steam engines in road transportation. And even the first commer-
v0.3.23 output for the same page had both:
In 1894 a new Daimler-Maybach gasoline engine installed in a car that
won the Paris-Bordeaux race rated less than 30 g/W (Beaumont 1902), leaving
no place for steam engines in road transportation. And even the first commer-
immediately followed by the wide-spaced copy above.
Expected behavior
When deduplicating two positionally-overlapping text copies, prefer the one with tighter word spacing (more "prose-like") over the one with wide positioned gaps.
Impact
- Affects JSTOR-scanned academic PDFs with a text layer rendered via per-glyph positioning.
- Content is preserved (verified by whitespace-normalized word-count comparison and distinctive-phrase search like
primitive harness, Maybach designed, to transport people — all present in head after tr -s '[:space:]' ' ').
- Readability suffers: paragraphs no longer look like paragraphs, and downstream NLP pipelines that tokenize on runs of whitespace may produce noise.
Reference corpus
pdfs_slow9/[Vaclav-Smil]-Energy-and-Civilization_-A-History(z-lib.org).pdf
- Similar pattern likely on other JSTOR scans in
pdfs_slow* dirs.
Tested versions
- 0.3.23: both copies emitted (output ~1.8 MB for this file, with visible duplicates)
- 0.3.25 (release/v0.3.25): only wide-spaced copy emitted (output ~1.6 MB)
Priority
Low — no content loss, deduplication itself is correct. This is a preference-ranking refinement for the dedup heuristic.
Summary
On JSTOR-sourced multi-column textbooks, certain pages have two copies of the same content in the PDF content stream:
Until v0.3.23 both copies were emitted (the output had each page twice). v0.3.25 correctly deduplicates them via the #315/#316 reading-order fixes — but consistently keeps the positioned/wide-spaced copy and drops the clean flow-prose copy. All content is preserved (whitespace-normalized word counts match), but the extracted text is much less readable than it could be.
Reproduction
Current head of file around page 244:
v0.3.23 output for the same page had both:
immediately followed by the wide-spaced copy above.
Expected behavior
When deduplicating two positionally-overlapping text copies, prefer the one with tighter word spacing (more "prose-like") over the one with wide positioned gaps.
Impact
primitive harness,Maybach designed,to transport people— all present in head aftertr -s '[:space:]' ' ').Reference corpus
pdfs_slow9/[Vaclav-Smil]-Energy-and-Civilization_-A-History(z-lib.org).pdfpdfs_slow*dirs.Tested versions
Priority
Low — no content loss, deduplication itself is correct. This is a preference-ranking refinement for the dedup heuristic.