Skip to content

Add olmOCR-bench old_scans experiment for PaddleOCR-VL-1.6#25

Merged
davanstrien merged 7 commits into
mainfrom
add-olmocr-bench-oldscans
Jun 27, 2026
Merged

Add olmOCR-bench old_scans experiment for PaddleOCR-VL-1.6#25
davanstrien merged 7 commits into
mainfrom
add-olmocr-bench-oldscans

Conversation

@davanstrien

Copy link
Copy Markdown
Owner

What

A standalone experiment (in experiments/olmocr-bench-oldscans/, outside the ocr_bench lib) that scores PaddleOCR-VL-1.6 on the old_scans subset of allenai/olmOCR-bench — a number its technical report (arXiv 2606.03264) never publishes (it only reports OmniDocBench v1.6 / Real5).

Three files: convert.py, score.py, README.md.

Result (preliminary)

Category Pass rate Tests
old_scans (present/absent/order) 38.6% 203/526
baseline (auto, 1/PDF) 84.7% 83/98

Mid-pack on the absolute leaderboard but ~best-in-class for its 0.9B size (ties Qwen2.5-VL-7B). Notable finding: it hallucinates Chinese characters on English handwritten scans (baseline disallowed-character failures).

Design

  • convert (hf jobs run, GPU l4x1): PaddlePaddle's vendor docker image (paddle + paddleocr + 1.6 weights baked) → markdown, pushed to a bucket via sync_bucket.
  • score (hf jobs uv run, CPU): stock olmocr.bench.benchmark over the bucket.
  • Fidelity: mirrors olmOCR-bench's own run_paddlevl.py exactly; only delta is pipeline_version="v1.6". README documents the image, every command, and all gotchas hit during bring-up.

To review / open questions

  • Numbers are preliminary — not yet validated by reproducing a published olmOCR-bench number (olmOCR/Qwen) through this harness. Recommend doing that before quoting 38.6% publicly.
  • Decoding determinism not pinned; image is :latest not a digest (see README reproducibility notes).
  • Intended as the source.url target for a possible model-card eval-results PR + discussion.

🤖 Generated with Claude Code

davanstrien and others added 7 commits June 27, 2026 12:30
Standalone experiment (outside the lib, in experiments/) scoring
PaddleOCR-VL-1.6 on the old_scans subset of allenai/olmOCR-bench — a number
its technical report (arXiv 2606.03264) never publishes (report only covers
OmniDocBench v1.6 / Real5).

Faithful to olmOCR-bench's own run_paddlevl.py extraction
(res.markdown['markdown_texts'], default pipeline, no tuning); the only delta
is pipeline_version=v1.6. Scoring is stock olmocr.bench.benchmark. Two HF Jobs:
PaddlePaddle vendor image convert -> olmocr scoring, with a bucket handoff.

Result (preliminary, pending anchor-reproduction validation):
old_scans 38.6% (present 31.2 / absent 95.7 / order 27.7).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop bring-up narrative; state config facts as requirements. Add a
Reproducibility section (pin image digest / dataset revision / olmocr version /
greedy decoding).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
generation_config.json has no sampling params -> greedy/deterministic.
Candidate outputs spot-checked against source scans (real, untruncated).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Self-contained static page: 7 old_scans docs, source scan beside the model
markdown, hallucinated CJK glyphs highlighted. gen_samples.py regenerates it
from scan/output pairs pulled from the bucket.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- olmOCR-bench lists 'PaddleOCR-VL' (unversioned) at Old scans 37.8; its
  run_paddlevl runner landed 2025-10-20, pre-1.6 (2026-05-28). So 38.6 is the
  first 1.6 number, ~consistent with the older figure -- not a same-version
  reproduction. Finding: v1.6's OmniDocBench gains don't transfer to old scans.
- Clarify 38.6 = old_scans.jsonl sub-score; harness also prints ~61.6 Average.
- baseline 84.7% (old_scans-only auto-baseline) != leaderboard Base 98.5%
  (full-bench auto-baseline); all 15 baseline failures are CJK/JP disallowed-char.
- Soften 'exactly' re: the v1.6 pin; tighten LIMIT smoke-test wording.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- convert.py reuses PDFs from the bucket mount if present, else downloads once
  and stages them under pdfs/ so sync_bucket adds them for the scorer. Removes
  the separate hf download + hf buckets sync step and the per-run re-download.
- PIPELINE_VERSION / CANDIDATE env vars: same script runs v1.6 (default) or the
  original v1 into a second candidate; the score job ranks both together.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Ran the original PaddleOCR-VL (v1) through the same harness: old_scans 38.2 vs
the leaderboard's published 37.8 -> within 0.4 pt / CI, so the convert+scoring
are validated. v1.6 (38.6) is statistically identical to v1 (38.2) on old_scans
-> OmniDocBench gains don't transfer; v1.6 even hallucinates slightly more CJK
(baseline 84.7 vs 88.8).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@davanstrien davanstrien merged commit d07a412 into main Jun 27, 2026
@davanstrien davanstrien deleted the add-olmocr-bench-oldscans branch June 27, 2026 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant