extract_text: wide-spaced positioned-text kept over flow prose on JSTOR-scanned books

## Summary

On JSTOR-sourced multi-column textbooks, certain pages have two copies of the same content in the PDF content stream:

1. A **flow-prose** copy with normal single-spaced words.
2. A **positioned** copy where each word sits at its explicit x-coordinate, producing runs of 4–20 spaces between words.

Until v0.3.23 both copies were emitted (the output had each page twice). v0.3.25 correctly deduplicates them via the #315/#316 reading-order fixes — but consistently keeps the positioned/wide-spaced copy and drops the clean flow-prose copy. All content is preserved (whitespace-normalized word counts match), but the extracted text is much less readable than it could be.

## Reproduction

```bash
./target/release/examples/extract_text_simple \
  "pdfs_slow9/[Vaclav-Smil]-Energy-and-Civilization_-A-History(z-lib.org).pdf" > out.txt
grep -n "Daimler-Maybach" out.txt
```

Current head of file around page 244:
```
         In 1894        a new           Daimler-Maybach              gasoline       engine          installed      in a            car that
won      the            Paris-Bordeaux  race           rated         less than      30 g/W          (Beaumont      1902),          leaving
no       place for      steam           engines        in road       transportation.                And even       the first       commer-
```

v0.3.23 output for the same page had both:
```
In 1894 a new Daimler-Maybach gasoline engine installed in a car that
won the Paris-Bordeaux race rated less than 30 g/W (Beaumont 1902), leaving
no place for steam engines in road transportation. And even the first commer-
```
immediately followed by the wide-spaced copy above.

## Expected behavior

When deduplicating two positionally-overlapping text copies, prefer the one with tighter word spacing (more "prose-like") over the one with wide positioned gaps.

## Impact

- Affects JSTOR-scanned academic PDFs with a text layer rendered via per-glyph positioning.
- Content is preserved (verified by whitespace-normalized word-count comparison and distinctive-phrase search like `primitive harness`, `Maybach designed`, `to transport people` — all present in head after `tr -s '[:space:]' ' '`).
- Readability suffers: paragraphs no longer look like paragraphs, and downstream NLP pipelines that tokenize on runs of whitespace may produce noise.

## Reference corpus

- `pdfs_slow9/[Vaclav-Smil]-Energy-and-Civilization_-A-History(z-lib.org).pdf`
- Similar pattern likely on other JSTOR scans in `pdfs_slow*` dirs.

## Tested versions

- 0.3.23: both copies emitted (output ~1.8 MB for this file, with visible duplicates)
- 0.3.25 (release/v0.3.25): only wide-spaced copy emitted (output ~1.6 MB)

## Priority

Low — no content loss, deduplication itself is correct. This is a preference-ranking refinement for the dedup heuristic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

extract_text: wide-spaced positioned-text kept over flow prose on JSTOR-scanned books #318

Summary

Reproduction

Expected behavior

Impact

Reference corpus

Tested versions

Priority

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

extract_text: wide-spaced positioned-text kept over flow prose on JSTOR-scanned books #318

Description

Summary

Reproduction

Expected behavior

Impact

Reference corpus

Tested versions

Priority

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions