Add olmOCR-bench old_scans experiment for PaddleOCR-VL-1.6#25
Merged
Conversation
Standalone experiment (outside the lib, in experiments/) scoring PaddleOCR-VL-1.6 on the old_scans subset of allenai/olmOCR-bench — a number its technical report (arXiv 2606.03264) never publishes (report only covers OmniDocBench v1.6 / Real5). Faithful to olmOCR-bench's own run_paddlevl.py extraction (res.markdown['markdown_texts'], default pipeline, no tuning); the only delta is pipeline_version=v1.6. Scoring is stock olmocr.bench.benchmark. Two HF Jobs: PaddlePaddle vendor image convert -> olmocr scoring, with a bucket handoff. Result (preliminary, pending anchor-reproduction validation): old_scans 38.6% (present 31.2 / absent 95.7 / order 27.7). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop bring-up narrative; state config facts as requirements. Add a Reproducibility section (pin image digest / dataset revision / olmocr version / greedy decoding). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
generation_config.json has no sampling params -> greedy/deterministic. Candidate outputs spot-checked against source scans (real, untruncated). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Self-contained static page: 7 old_scans docs, source scan beside the model markdown, hallucinated CJK glyphs highlighted. gen_samples.py regenerates it from scan/output pairs pulled from the bucket. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- olmOCR-bench lists 'PaddleOCR-VL' (unversioned) at Old scans 37.8; its run_paddlevl runner landed 2025-10-20, pre-1.6 (2026-05-28). So 38.6 is the first 1.6 number, ~consistent with the older figure -- not a same-version reproduction. Finding: v1.6's OmniDocBench gains don't transfer to old scans. - Clarify 38.6 = old_scans.jsonl sub-score; harness also prints ~61.6 Average. - baseline 84.7% (old_scans-only auto-baseline) != leaderboard Base 98.5% (full-bench auto-baseline); all 15 baseline failures are CJK/JP disallowed-char. - Soften 'exactly' re: the v1.6 pin; tighten LIMIT smoke-test wording. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- convert.py reuses PDFs from the bucket mount if present, else downloads once and stages them under pdfs/ so sync_bucket adds them for the scorer. Removes the separate hf download + hf buckets sync step and the per-run re-download. - PIPELINE_VERSION / CANDIDATE env vars: same script runs v1.6 (default) or the original v1 into a second candidate; the score job ranks both together. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Ran the original PaddleOCR-VL (v1) through the same harness: old_scans 38.2 vs the leaderboard's published 37.8 -> within 0.4 pt / CI, so the convert+scoring are validated. v1.6 (38.6) is statistically identical to v1 (38.2) on old_scans -> OmniDocBench gains don't transfer; v1.6 even hallucinates slightly more CJK (baseline 84.7 vs 88.8). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A standalone experiment (in
experiments/olmocr-bench-oldscans/, outside theocr_benchlib) that scores PaddleOCR-VL-1.6 on theold_scanssubset ofallenai/olmOCR-bench— a number its technical report (arXiv 2606.03264) never publishes (it only reports OmniDocBench v1.6 / Real5).Three files:
convert.py,score.py,README.md.Result (preliminary)
Mid-pack on the absolute leaderboard but ~best-in-class for its 0.9B size (ties Qwen2.5-VL-7B). Notable finding: it hallucinates Chinese characters on English handwritten scans (baseline disallowed-character failures).
Design
hf jobs run, GPUl4x1): PaddlePaddle's vendor docker image (paddle + paddleocr + 1.6 weights baked) → markdown, pushed to a bucket viasync_bucket.hf jobs uv run, CPU): stockolmocr.bench.benchmarkover the bucket.run_paddlevl.pyexactly; only delta ispipeline_version="v1.6". README documents the image, every command, and all gotchas hit during bring-up.To review / open questions
38.6%publicly.:latestnot a digest (see README reproducibility notes).source.urltarget for a possible model-card eval-results PR + discussion.🤖 Generated with Claude Code