Add olmOCR-bench old_scans experiment for PaddleOCR-VL-1.6 by davanstrien · Pull Request #25 · davanstrien/ocr-bench

davanstrien · 2026-06-27T11:30:35Z

What

A standalone experiment (in experiments/olmocr-bench-oldscans/, outside the ocr_bench lib) that scores PaddleOCR-VL-1.6 on the old_scans subset of allenai/olmOCR-bench — a number its technical report (arXiv 2606.03264) never publishes (it only reports OmniDocBench v1.6 / Real5).

Three files: convert.py, score.py, README.md.

Result (preliminary)

Category	Pass rate	Tests
old_scans (present/absent/order)	38.6%	203/526
baseline (auto, 1/PDF)	84.7%	83/98

Mid-pack on the absolute leaderboard but ~best-in-class for its 0.9B size (ties Qwen2.5-VL-7B). Notable finding: it hallucinates Chinese characters on English handwritten scans (baseline disallowed-character failures).

Design

convert (hf jobs run, GPU l4x1): PaddlePaddle's vendor docker image (paddle + paddleocr + 1.6 weights baked) → markdown, pushed to a bucket via sync_bucket.
score (hf jobs uv run, CPU): stock olmocr.bench.benchmark over the bucket.
Fidelity: mirrors olmOCR-bench's own run_paddlevl.py exactly; only delta is pipeline_version="v1.6". README documents the image, every command, and all gotchas hit during bring-up.

To review / open questions

Numbers are preliminary — not yet validated by reproducing a published olmOCR-bench number (olmOCR/Qwen) through this harness. Recommend doing that before quoting 38.6% publicly.
Decoding determinism not pinned; image is :latest not a digest (see README reproducibility notes).
Intended as the source.url target for a possible model-card eval-results PR + discussion.

🤖 Generated with Claude Code

Standalone experiment (outside the lib, in experiments/) scoring PaddleOCR-VL-1.6 on the old_scans subset of allenai/olmOCR-bench — a number its technical report (arXiv 2606.03264) never publishes (report only covers OmniDocBench v1.6 / Real5). Faithful to olmOCR-bench's own run_paddlevl.py extraction (res.markdown['markdown_texts'], default pipeline, no tuning); the only delta is pipeline_version=v1.6. Scoring is stock olmocr.bench.benchmark. Two HF Jobs: PaddlePaddle vendor image convert -> olmocr scoring, with a bucket handoff. Result (preliminary, pending anchor-reproduction validation): old_scans 38.6% (present 31.2 / absent 95.7 / order 27.7). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Drop bring-up narrative; state config facts as requirements. Add a Reproducibility section (pin image digest / dataset revision / olmocr version / greedy decoding). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

generation_config.json has no sampling params -> greedy/deterministic. Candidate outputs spot-checked against source scans (real, untruncated). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Self-contained static page: 7 old_scans docs, source scan beside the model markdown, hallucinated CJK glyphs highlighted. gen_samples.py regenerates it from scan/output pairs pulled from the bucket. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- olmOCR-bench lists 'PaddleOCR-VL' (unversioned) at Old scans 37.8; its run_paddlevl runner landed 2025-10-20, pre-1.6 (2026-05-28). So 38.6 is the first 1.6 number, ~consistent with the older figure -- not a same-version reproduction. Finding: v1.6's OmniDocBench gains don't transfer to old scans. - Clarify 38.6 = old_scans.jsonl sub-score; harness also prints ~61.6 Average. - baseline 84.7% (old_scans-only auto-baseline) != leaderboard Base 98.5% (full-bench auto-baseline); all 15 baseline failures are CJK/JP disallowed-char. - Soften 'exactly' re: the v1.6 pin; tighten LIMIT smoke-test wording. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- convert.py reuses PDFs from the bucket mount if present, else downloads once and stages them under pdfs/ so sync_bucket adds them for the scorer. Removes the separate hf download + hf buckets sync step and the per-run re-download. - PIPELINE_VERSION / CANDIDATE env vars: same script runs v1.6 (default) or the original v1 into a second candidate; the score job ranks both together. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Ran the original PaddleOCR-VL (v1) through the same harness: old_scans 38.2 vs the leaderboard's published 37.8 -> within 0.4 pt / CI, so the convert+scoring are validated. v1.6 (38.6) is statistically identical to v1 (38.2) on old_scans -> OmniDocBench gains don't transfer; v1.6 even hallucinates slightly more CJK (baseline 84.7 vs 88.8). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

davanstrien and others added 7 commits June 27, 2026 12:30

Tighten README + convert docstring to a factual spec

72569f2

Drop bring-up narrative; state config facts as requirements. Add a Reproducibility section (pin image digest / dataset revision / olmocr version / greedy decoding). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

README: record greedy decoding + eyeball spot-check in status

66c3ff6

generation_config.json has no sampling params -> greedy/deterministic. Candidate outputs spot-checked against source scans (real, untruncated). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

davanstrien merged commit d07a412 into main Jun 27, 2026

davanstrien deleted the add-olmocr-bench-oldscans branch June 27, 2026 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add olmOCR-bench old_scans experiment for PaddleOCR-VL-1.6#25

Add olmOCR-bench old_scans experiment for PaddleOCR-VL-1.6#25
davanstrien merged 7 commits into
mainfrom
add-olmocr-bench-oldscans

davanstrien commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davanstrien commented Jun 27, 2026

What

Result (preliminary)

Design

To review / open questions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant