From d167ece26552145f68f4e01a8a8de2ea3bb52b7b Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Sat, 27 Jun 2026 12:30:30 +0100 Subject: [PATCH 1/7] Add olmOCR-bench old_scans experiment for PaddleOCR-VL-1.6 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Standalone experiment (outside the lib, in experiments/) scoring PaddleOCR-VL-1.6 on the old_scans subset of allenai/olmOCR-bench — a number its technical report (arXiv 2606.03264) never publishes (report only covers OmniDocBench v1.6 / Real5). Faithful to olmOCR-bench's own run_paddlevl.py extraction (res.markdown['markdown_texts'], default pipeline, no tuning); the only delta is pipeline_version=v1.6. Scoring is stock olmocr.bench.benchmark. Two HF Jobs: PaddlePaddle vendor image convert -> olmocr scoring, with a bucket handoff. Result (preliminary, pending anchor-reproduction validation): old_scans 38.6% (present 31.2 / absent 95.7 / order 27.7). Co-Authored-By: Claude Opus 4.8 (1M context) --- experiments/olmocr-bench-oldscans/README.md | 137 +++++++++++++++++++ experiments/olmocr-bench-oldscans/convert.py | 106 ++++++++++++++ experiments/olmocr-bench-oldscans/score.py | 35 +++++ 3 files changed, 278 insertions(+) create mode 100644 experiments/olmocr-bench-oldscans/README.md create mode 100644 experiments/olmocr-bench-oldscans/convert.py create mode 100644 experiments/olmocr-bench-oldscans/score.py diff --git a/experiments/olmocr-bench-oldscans/README.md b/experiments/olmocr-bench-oldscans/README.md new file mode 100644 index 0000000..c2de80f --- /dev/null +++ b/experiments/olmocr-bench-oldscans/README.md @@ -0,0 +1,137 @@ +# PaddleOCR-VL-1.6 on olmOCR-bench (old_scans) + +A standalone experiment — **not part of the `ocr_bench` library**. Scores +[PaddleOCR-VL-1.6](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.6) on the +`old_scans` subset of [allenai/olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench), +a number the PaddleOCR-VL-1.6 technical report (arXiv 2606.03264) never reports — +it only measures OmniDocBench v1.6 / Real5-OmniDocBench. + +`old_scans` = 98 single-page Library-of-Congress scans, 526 tests +(text-present / text-absent / reading-order). No math, no tables → pure string +matching, so the scorer needs **no KaTeX/chromium**. + +## Fidelity (why the number is fair) + +- **Scoring** is stock `olmocr.bench.benchmark`, untouched. +- **Conversion** mirrors olmOCR-bench's own runner + [`olmocr/bench/runners/run_paddlevl.py`](https://github.com/allenai/olmocr/blob/main/olmocr/bench/runners/run_paddlevl.py) + exactly: `res.markdown["markdown_texts"]`, per page, **bare default pipeline, + no tuning** (no `max_pixels` / prompts / dpi). The only intentional difference + is `pipeline_version="v1.6"` — what the model card tells you to pass. +- We run inside **PaddlePaddle's own docker image**, so paddle/paddleocr are the + vendor's exact builds, not a PyPI guess. Arguably *more* faithful than + assembling the stack ourselves. + +## Design + +Two HF Jobs, one bucket as the handoff. The two stacks never share an env: + +| Job | Command | Hardware | Stack | Does | +|-----|---------|----------|-------|------| +| `convert.py` | `hf jobs run` | GPU (`l4x1`) | PaddlePaddle image | PaddleOCR-VL-1.6 → markdown into the bucket | +| `score.py` | `hf jobs uv run` | CPU (`cpu-upgrade`) | olmocr (PyPI) | `olmocr.bench.benchmark` → prints the score | + +The candidate path is written as `{splitext(pdf_field)}_pg{page}_repeat1.md` — +the literal string transform `benchmark.py` uses to locate it — so the layout is +guaranteed to match without going through olmocr's convert machinery. + +### Image (probed, not guessed) + +`ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu` +(~8 GB; **the Baidu registry is pullable by HF Jobs**). Probed on `cpu-basic`: + +``` +PY=/usr/local/bin/python3 # uv: none -> use `hf jobs run`, not `uv run` +paddleocr 3.6.0 # paddlex + huggingface_hub 1.16.4 also present +site-packages: /usr/local/lib/python3.10/site-packages +``` + +Because the image lacks `uv`, `convert.py` is a **plain python script** (no PEP 723 +header) run by the image's python; every import it needs is already in the image. +`hf jobs run` has no local-file upload, so the script is delivered via the bucket +(mounted read-only). The image runs as the **non-root `paddleocr` user**, which +can't write the root-owned bucket FUSE mount — so `convert.py` writes to a local +dir and pushes results with `sync_bucket()` (mount-free HTTP upload). + +Handoff bucket: `hf://buckets/davanstrien/paddleocr-vl16-oldscans` + +## Run + +```bash +BUCKET=hf://buckets/davanstrien/paddleocr-vl16-oldscans +IMAGE=ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu + +# deliver the script into the bucket (re-cp whenever you edit it) +hf buckets cp convert.py $BUCKET/convert.py + +# 1a. smoke-test the plumbing on 3 PDFs first (mount ro: script delivery only; +# results go back to the bucket via sync_bucket over HTTP) +hf jobs run --flavor l4x1 -s HF_TOKEN -e LIMIT=3 -v $BUCKET:/bucket:ro \ + $IMAGE python3 /bucket/convert.py + +# 1b. full convert — 98 PDFs. USE l4x1 (see gotchas) and a generous --timeout. +hf jobs run --flavor l4x1 --timeout 1h -s HF_TOKEN -v $BUCKET:/bucket:ro \ + $IMAGE python3 /bucket/convert.py + +# 2. score (CPU; olmocr from PyPI; read-only mount) +hf jobs uv run --flavor cpu-upgrade -s HF_TOKEN -v $BUCKET:/bucket:ro \ + score.py +``` + +`hf jobs run`/`uv run` accept `-d` to detach; then block on the job with +`hf jobs wait ` and read `hf jobs logs `. Inspect the bucket +between steps: `hf buckets ls $BUCKET`. + +## Notes / gotchas (all hit during bring-up) + +- **`/data` is reserved** by Jobs for local-script artifacts → mount the bucket + at `/bucket`. +- **Transient `Volume mount failed`** on a fresh bucket (CSI driver not ready on + a fresh node, unrelated to the bucket being empty) → just re-run the job. +- **Non-root image can't write the bucket mount**: this image runs as + `paddleocr`, the FUSE mount is root-owned → `PermissionError` on write. Fix: + write locally, upload via `sync_bucket()` (HTTP, no FUSE). Mount the bucket + `:ro` in job 1 since it's only used to deliver the script. +- **Why the image, not a uv script**: paddlex pulls the GUI build of opencv + (`libGL.so.1`, absent in the slim uv image) *and* the VL pipeline needs the + `paddlex[ocr]` extra; the 1.8 GB paddle wheel rebuilds every run. The vendor + image sidesteps all of it. (`uv run --image` does *not* help: uv still + reinstalls declared deps, and reusing the image's packages needs + `--system-site-packages`, which uv lacks. The `--python`+`PYTHONPATH` trick + needs uv *in* the image, which this one doesn't have.) +- **Use `l4x1` for convert, not bigger GPUs**: this paddle image's CUDA build + matches the older `l4x1` driver. On `l40sx1` the model hung on the first PDF + (driver/CUDA mismatch). More compute doesn't help anyway — a 0.9B model over 98 + pages is bound by image-pull + model-load, not GPU throughput. +- **Convert `--timeout`**: the full run can outlast the default job timeout. It + still finishes and `sync_bucket`s before being killed (so the bucket is + complete), but the job shows `ERROR: Job timeout` — pass `--timeout 1h` to keep + the status clean. +- **`numpy` for scoring**: `olmocr[bench]` imports numpy without declaring it + (assumes their conda env) → `score.py` adds `numpy` explicitly. +- **old_scans only.** For `old_scans_math` (458 math tests): change `JSONL_PATH` + in `convert.py`, and `score.py` then needs `playwright install chromium` for + KaTeX. + +## Result + +PaddleOCR-VL-1.6, default v1.6 pipeline, no tuning (run 2026-06-27): + +| Category | Pass rate | Tests | +|---|---|---| +| **old_scans (present/absent/order)** | **38.6%** | 203/526 | +| → present | 31.2% | 279 | +| → absent | 95.7% | 70 | +| → order | 27.7% | 177 | +| baseline (auto-generated, 1/PDF) | 84.7% | 83/98 | +| tool "Average Score" (mean of the two jsonl files) | 61.6% ± 4.0% | 624 | + +**Headline = 38.6%** — the leaderboard-comparable `old_scans` number. (The tool's +"61.6%" averages in the easy auto-baseline tests; don't quote it as the score.) +For reference, olmOCR-bench's published OldScan column: olmOCR-Ours 44.5, GPT-4o +40.7, Qwen2.5-VL 38.6, Gemini-Flash-2 34.2, most others 17–29. So PaddleOCR-VL-1.6 +is **mid-pack on degraded historical scans** despite being SOTA on OmniDocBench. + +**Notable:** ~15 baseline failures are `Text contains disallowed characters` +(CJK: 场, 景, 民, 生, …) — the model **hallucinates Chinese characters on English +handwritten scans**. Clean-benchmark SOTA ≠ real-world historical data. diff --git a/experiments/olmocr-bench-oldscans/convert.py b/experiments/olmocr-bench-oldscans/convert.py new file mode 100644 index 0000000..647df31 --- /dev/null +++ b/experiments/olmocr-bench-oldscans/convert.py @@ -0,0 +1,106 @@ +""" +Job 1 (GPU): run PaddleOCR-VL-1.6 over the olmOCR-bench `old_scans` subset and +write candidate markdown in the exact layout `olmocr.bench.benchmark` expects. + +This runs with PaddlePaddle's OWN docker image (paddle 3.2.1 + paddleocr 3.6.0 +preinstalled), NOT uv -- the image has no uv and assembling paddle from PyPI is +brittle (libGL, paddlex[ocr]). So there is no PEP 723 header: every import here +(paddleocr, huggingface_hub, stdlib) is already in the image's python3.10. + +Fidelity: the markdown extraction mirrors olmOCR-bench's own runner +(`olmocr/bench/runners/run_paddlevl.py`) exactly -- `res.markdown["markdown_texts"]`, +per page, with a bare default pipeline and NO tuning (no max_pixels / prompts / +dpi). The ONLY intentional difference is `pipeline_version="v1.6"`, which is what +the PaddleOCR-VL-1.6 model card tells you to pass. So we follow both the bench +runner and PaddlePaddle's documented defaults. + +This image runs as the non-root `paddleocr` user, which CANNOT write the bucket +FUSE mount (root-owned). So we write outputs to a container-local dir and push +them with `sync_bucket()` (mount-free HTTP upload) at the end. The bucket is +mounted read-only purely to deliver this script. + +Delivery + run (see README for full commands): + hf buckets cp convert.py hf://buckets/davanstrien/paddleocr-vl16-oldscans/convert.py + hf jobs run --flavor l4x1 -s HF_TOKEN \ + -v hf://buckets/davanstrien/paddleocr-vl16-oldscans:/bucket:ro \ + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu \ + python3 /bucket/convert.py + +Env: + OUT_ROOT local staging dir (default /tmp/olmocr-oldscans-out) + BUCKET bucket to sync results to (default below) + LIMIT cap number of PDFs (smoke test; 0 = all). A capped run scores low + because the un-converted tests count as failures. +""" +import json +import os +from collections import defaultdict +from pathlib import Path + +from huggingface_hub import hf_hub_download, sync_bucket +from paddleocr import PaddleOCRVL + +BENCH_REPO = "allenai/olmOCR-bench" +JSONL_PATH = "bench_data/old_scans.jsonl" +CANDIDATE = "paddleocr_vl_16" # any name except "pdfs" +OUT_ROOT = Path(os.environ.get("OUT_ROOT", "/tmp/olmocr-oldscans-out")) +BUCKET = os.environ.get("BUCKET", "hf://buckets/davanstrien/paddleocr-vl16-oldscans") +LIMIT = int(os.environ.get("LIMIT", "0")) + +# ---- test manifest ---------------------------------------------------------- +jsonl_local = hf_hub_download(BENCH_REPO, JSONL_PATH, repo_type="dataset") +tests = [json.loads(ln) for ln in Path(jsonl_local).read_text().splitlines() if ln.strip()] + +pages_by_pdf = defaultdict(set) +for t in tests: + pages_by_pdf[t["pdf"]].add(int(t.get("page", 1))) +print(f"{len(tests)} tests across {len(pages_by_pdf)} PDFs -> {OUT_ROOT}", flush=True) + +# ---- model (vendor default for v1.6, no tuning) ----------------------------- +pipeline = PaddleOCRVL(pipeline_version="v1.6") + + +def resolve_pdf(pdf_field): + """The jsonl `pdf` field may or may not carry a pdfs/ prefix; try variants.""" + for cand in (f"bench_data/{pdf_field}", f"bench_data/pdfs/{pdf_field}", pdf_field): + try: + return hf_hub_download(BENCH_REPO, cand, repo_type="dataset") + except Exception: + continue + raise FileNotFoundError(pdf_field) + + +def page_markdowns(pdf_path): + """Per-page markdown, exactly as run_paddlevl.py does: res.markdown['markdown_texts'].""" + return [res.markdown["markdown_texts"] for res in pipeline.predict(str(pdf_path))] + + +# ---- convert ---------------------------------------------------------------- +cand_dir = OUT_ROOT / CANDIDATE +items = sorted(pages_by_pdf.items()) +if LIMIT: + items = items[:LIMIT] + print(f"LIMIT={LIMIT} (plumbing smoke test -- expect a low score)", flush=True) + +for i, (pdf_field, pages) in enumerate(items, 1): + try: + mds = page_markdowns(resolve_pdf(pdf_field)) + except Exception as e: # keep going; a missing page just fails its tests + print(f"[WARN] {pdf_field}: {e}", flush=True) + mds = [] + md_base = os.path.splitext(pdf_field)[0] # mirrors benchmark.py exactly + for pg in pages: + md = mds[pg - 1] if 0 <= pg - 1 < len(mds) else "" # 1-indexed page -> 0-indexed + fp = cand_dir / f"{md_base}_pg{pg}_repeat1.md" + fp.parent.mkdir(parents=True, exist_ok=True) + fp.write_text(md) + n = len(mds[0]) if mds else 0 + print(f"[{i}/{len(items)}] {pdf_field} -> {n} chars", flush=True) + +# the scorer needs the jsonl next to the candidate folder +(OUT_ROOT / "old_scans.jsonl").write_text(Path(jsonl_local).read_text()) + +# push results to the bucket over HTTP (the FUSE mount is not writable as non-root) +print(f"Syncing {OUT_ROOT} -> {BUCKET}", flush=True) +sync_bucket(str(OUT_ROOT), BUCKET) +print("Done.", flush=True) diff --git a/experiments/olmocr-bench-oldscans/score.py b/experiments/olmocr-bench-oldscans/score.py new file mode 100644 index 0000000..0ea4be1 --- /dev/null +++ b/experiments/olmocr-bench-oldscans/score.py @@ -0,0 +1,35 @@ +# /// script +# requires-python = ">=3.11,<3.12" +# dependencies = [ +# "olmocr[bench]", +# "numpy", # olmocr.bench.tests imports numpy but doesn't declare it +# ] +# /// +""" +Job 2 (CPU): score the candidate produced by convert.py with the official +olmocr.bench.benchmark harness. old_scans = text-present / text-absent / +reading-order tests only -> pure string matching, no KaTeX/chromium needed. + +Reads from DATA (default /bucket). Mount the same bucket the convert job wrote to, +read-only is fine: + + hf jobs uv run --flavor cpu-upgrade -s HF_TOKEN \\ + -v hf://buckets/davanstrien/paddleocr-vl16-oldscans:/bucket:ro \\ + experiments/olmocr-bench-oldscans/score.py + +Env: + DATA directory holding old_scans.jsonl + the candidate folder (default /bucket) +""" +import os +import subprocess +import sys + +DATA = os.environ.get("DATA", "/bucket") + +# --dir globs *.jsonl (only old_scans.jsonl is present) and treats each subdir +# other than "pdfs" as a candidate (only paddleocr_vl_16 is present). +proc = subprocess.run( + [sys.executable, "-m", "olmocr.bench.benchmark", "--dir", DATA], + text=True, +) +sys.exit(proc.returncode) From 72569f2dcda12ecefdfa38964b89235e58014414 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Sat, 27 Jun 2026 12:37:36 +0100 Subject: [PATCH 2/7] Tighten README + convert docstring to a factual spec Drop bring-up narrative; state config facts as requirements. Add a Reproducibility section (pin image digest / dataset revision / olmocr version / greedy decoding). Co-Authored-By: Claude Opus 4.8 (1M context) --- experiments/olmocr-bench-oldscans/README.md | 167 +++++++------------ experiments/olmocr-bench-oldscans/convert.py | 8 +- 2 files changed, 68 insertions(+), 107 deletions(-) diff --git a/experiments/olmocr-bench-oldscans/README.md b/experiments/olmocr-bench-oldscans/README.md index c2de80f..e8b5577 100644 --- a/experiments/olmocr-bench-oldscans/README.md +++ b/experiments/olmocr-bench-oldscans/README.md @@ -1,59 +1,37 @@ # PaddleOCR-VL-1.6 on olmOCR-bench (old_scans) -A standalone experiment — **not part of the `ocr_bench` library**. Scores -[PaddleOCR-VL-1.6](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.6) on the -`old_scans` subset of [allenai/olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench), -a number the PaddleOCR-VL-1.6 technical report (arXiv 2606.03264) never reports — -it only measures OmniDocBench v1.6 / Real5-OmniDocBench. +Scores [PaddleOCR-VL-1.6](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.6) on the +`old_scans` subset of [`allenai/olmOCR-bench`](https://huggingface.co/datasets/allenai/olmOCR-bench). +Standalone experiment — not part of the `ocr_bench` library. `old_scans` = 98 single-page Library-of-Congress scans, 526 tests -(text-present / text-absent / reading-order). No math, no tables → pure string -matching, so the scorer needs **no KaTeX/chromium**. +(text-present / text-absent / reading-order). No math or tables, so scoring needs +no KaTeX/chromium. -## Fidelity (why the number is fair) +## Fidelity -- **Scoring** is stock `olmocr.bench.benchmark`, untouched. -- **Conversion** mirrors olmOCR-bench's own runner - [`olmocr/bench/runners/run_paddlevl.py`](https://github.com/allenai/olmocr/blob/main/olmocr/bench/runners/run_paddlevl.py) - exactly: `res.markdown["markdown_texts"]`, per page, **bare default pipeline, - no tuning** (no `max_pixels` / prompts / dpi). The only intentional difference - is `pipeline_version="v1.6"` — what the model card tells you to pass. -- We run inside **PaddlePaddle's own docker image**, so paddle/paddleocr are the - vendor's exact builds, not a PyPI guess. Arguably *more* faithful than - assembling the stack ourselves. +- **Scoring**: stock `olmocr.bench.benchmark`, unmodified. +- **Conversion**: matches olmOCR-bench's own runner + [`run_paddlevl.py`](https://github.com/allenai/olmocr/blob/main/olmocr/bench/runners/run_paddlevl.py) + — `res.markdown["markdown_texts"]`, per page, default pipeline, no tuning. The + only difference is `pipeline_version="v1.6"` (as the model card specifies). +- Runs inside PaddlePaddle's own image, so paddle/paddleocr are the vendor builds. -## Design +## Method -Two HF Jobs, one bucket as the handoff. The two stacks never share an env: +Two HF Jobs with a bucket as the handoff: -| Job | Command | Hardware | Stack | Does | -|-----|---------|----------|-------|------| -| `convert.py` | `hf jobs run` | GPU (`l4x1`) | PaddlePaddle image | PaddleOCR-VL-1.6 → markdown into the bucket | -| `score.py` | `hf jobs uv run` | CPU (`cpu-upgrade`) | olmocr (PyPI) | `olmocr.bench.benchmark` → prints the score | +| Step | Command | Hardware | Does | +|------|---------|----------|------| +| `convert.py` | `hf jobs run` | GPU `l4x1` | PaddleOCR-VL-1.6 → markdown → `sync_bucket` to the bucket | +| `score.py` | `hf jobs uv run` | CPU `cpu-upgrade` | `olmocr.bench.benchmark` → score | -The candidate path is written as `{splitext(pdf_field)}_pg{page}_repeat1.md` — -the literal string transform `benchmark.py` uses to locate it — so the layout is -guaranteed to match without going through olmocr's convert machinery. +Candidate files are written as `{splitext(pdf_field)}_pg{page}_repeat1.md`, the +path `benchmark.py` looks them up by. -### Image (probed, not guessed) - -`ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu` -(~8 GB; **the Baidu registry is pullable by HF Jobs**). Probed on `cpu-basic`: - -``` -PY=/usr/local/bin/python3 # uv: none -> use `hf jobs run`, not `uv run` -paddleocr 3.6.0 # paddlex + huggingface_hub 1.16.4 also present -site-packages: /usr/local/lib/python3.10/site-packages -``` - -Because the image lacks `uv`, `convert.py` is a **plain python script** (no PEP 723 -header) run by the image's python; every import it needs is already in the image. -`hf jobs run` has no local-file upload, so the script is delivered via the bucket -(mounted read-only). The image runs as the **non-root `paddleocr` user**, which -can't write the root-owned bucket FUSE mount — so `convert.py` writes to a local -dir and pushes results with `sync_bucket()` (mount-free HTTP upload). - -Handoff bucket: `hf://buckets/davanstrien/paddleocr-vl16-oldscans` +- **Image**: `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu` + — paddle 3.2.1 + paddleocr 3.6.0 + 1.6 weights baked in; python `/usr/local/bin/python3` (3.10); no `uv`. +- **Bucket**: `hf://buckets/davanstrien/paddleocr-vl16-oldscans` ## Run @@ -61,77 +39,60 @@ Handoff bucket: `hf://buckets/davanstrien/paddleocr-vl16-oldscans` BUCKET=hf://buckets/davanstrien/paddleocr-vl16-oldscans IMAGE=ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu -# deliver the script into the bucket (re-cp whenever you edit it) +# deliver the script into the bucket (re-cp after editing) hf buckets cp convert.py $BUCKET/convert.py -# 1a. smoke-test the plumbing on 3 PDFs first (mount ro: script delivery only; -# results go back to the bucket via sync_bucket over HTTP) -hf jobs run --flavor l4x1 -s HF_TOKEN -e LIMIT=3 -v $BUCKET:/bucket:ro \ - $IMAGE python3 /bucket/convert.py - -# 1b. full convert — 98 PDFs. USE l4x1 (see gotchas) and a generous --timeout. +# convert — 98 PDFs (add -e LIMIT=3 for a 3-PDF smoke test) hf jobs run --flavor l4x1 --timeout 1h -s HF_TOKEN -v $BUCKET:/bucket:ro \ $IMAGE python3 /bucket/convert.py -# 2. score (CPU; olmocr from PyPI; read-only mount) -hf jobs uv run --flavor cpu-upgrade -s HF_TOKEN -v $BUCKET:/bucket:ro \ - score.py +# add the source PDFs the scorer requires under pdfs/ +hf download allenai/olmOCR-bench --repo-type dataset \ + --include "bench_data/pdfs/old_scans/*" --local-dir /tmp/olm +hf buckets sync /tmp/olm/bench_data/pdfs $BUCKET/pdfs + +# score +hf jobs uv run --flavor cpu-upgrade -s HF_TOKEN -v $BUCKET:/bucket:ro score.py ``` -`hf jobs run`/`uv run` accept `-d` to detach; then block on the job with -`hf jobs wait ` and read `hf jobs logs `. Inspect the bucket -between steps: `hf buckets ls $BUCKET`. - -## Notes / gotchas (all hit during bring-up) - -- **`/data` is reserved** by Jobs for local-script artifacts → mount the bucket - at `/bucket`. -- **Transient `Volume mount failed`** on a fresh bucket (CSI driver not ready on - a fresh node, unrelated to the bucket being empty) → just re-run the job. -- **Non-root image can't write the bucket mount**: this image runs as - `paddleocr`, the FUSE mount is root-owned → `PermissionError` on write. Fix: - write locally, upload via `sync_bucket()` (HTTP, no FUSE). Mount the bucket - `:ro` in job 1 since it's only used to deliver the script. -- **Why the image, not a uv script**: paddlex pulls the GUI build of opencv - (`libGL.so.1`, absent in the slim uv image) *and* the VL pipeline needs the - `paddlex[ocr]` extra; the 1.8 GB paddle wheel rebuilds every run. The vendor - image sidesteps all of it. (`uv run --image` does *not* help: uv still - reinstalls declared deps, and reusing the image's packages needs - `--system-site-packages`, which uv lacks. The `--python`+`PYTHONPATH` trick - needs uv *in* the image, which this one doesn't have.) -- **Use `l4x1` for convert, not bigger GPUs**: this paddle image's CUDA build - matches the older `l4x1` driver. On `l40sx1` the model hung on the first PDF - (driver/CUDA mismatch). More compute doesn't help anyway — a 0.9B model over 98 - pages is bound by image-pull + model-load, not GPU throughput. -- **Convert `--timeout`**: the full run can outlast the default job timeout. It - still finishes and `sync_bucket`s before being killed (so the bucket is - complete), but the job shows `ERROR: Job timeout` — pass `--timeout 1h` to keep - the status clean. -- **`numpy` for scoring**: `olmocr[bench]` imports numpy without declaring it - (assumes their conda env) → `score.py` adds `numpy` explicitly. -- **old_scans only.** For `old_scans_math` (458 math tests): change `JSONL_PATH` - in `convert.py`, and `score.py` then needs `playwright install chromium` for - KaTeX. +Add `-d` to detach, then `hf jobs wait ` / `hf jobs logs `. + +## Configuration + +- **Flavor `l4x1`**: the image's CUDA build matches the `l4x1` driver; larger GPUs (l40s / a100) do not. +- **Mount path `/bucket`**: `/data` is reserved by Jobs for local-script artifacts. +- **`sync_bucket`, not a FUSE write**: the image runs as a non-root user that cannot write the mount, so `convert.py` writes locally and uploads over HTTP; the mount is `:ro` (script delivery only). +- **`pdfs/` folder**: `benchmark.py` requires `/pdfs` to exist; the source PDFs are synced there (run step above). +- **`numpy`**: declared in `score.py` because `olmocr[bench]` imports it without declaring it. +- **`old_scans_math`** variant: change `JSONL_PATH` in `convert.py`; `score.py` then also needs `playwright install chromium`. + +## Reproducibility + +This run uses floating refs. To make it bit-stable, pin: + +- the **image by digest** (`...paddleocr-vl@sha256:...`) instead of `:latest` — this pins paddle, paddleocr, and the weights together; +- `allenai/olmOCR-bench` by `revision`; +- `olmocr` to an exact version in `score.py`; +- **greedy decoding** (confirm / pin) for run-to-run stability. ## Result -PaddleOCR-VL-1.6, default v1.6 pipeline, no tuning (run 2026-06-27): +PaddleOCR-VL-1.6, default v1.6 pipeline, no tuning (2026-06-27): | Category | Pass rate | Tests | |---|---|---| -| **old_scans (present/absent/order)** | **38.6%** | 203/526 | +| **old_scans (present / absent / order)** | **38.6%** | 203 / 526 | | → present | 31.2% | 279 | | → absent | 95.7% | 70 | | → order | 27.7% | 177 | -| baseline (auto-generated, 1/PDF) | 84.7% | 83/98 | -| tool "Average Score" (mean of the two jsonl files) | 61.6% ± 4.0% | 624 | - -**Headline = 38.6%** — the leaderboard-comparable `old_scans` number. (The tool's -"61.6%" averages in the easy auto-baseline tests; don't quote it as the score.) -For reference, olmOCR-bench's published OldScan column: olmOCR-Ours 44.5, GPT-4o -40.7, Qwen2.5-VL 38.6, Gemini-Flash-2 34.2, most others 17–29. So PaddleOCR-VL-1.6 -is **mid-pack on degraded historical scans** despite being SOTA on OmniDocBench. - -**Notable:** ~15 baseline failures are `Text contains disallowed characters` -(CJK: 场, 景, 民, 生, …) — the model **hallucinates Chinese characters on English -handwritten scans**. Clean-benchmark SOTA ≠ real-world historical data. +| baseline (auto-generated, 1/PDF) | 84.7% | 83 / 98 | + +For reference, olmOCR-bench's published OldScan column (no-anchor): olmOCR 43.7, +GPT-4o 40.9, Qwen2.5-VL 38.6, Gemini-Flash-2 27.8, GOT-OCR (0.58B) 22.1. At 0.9B, +PaddleOCR-VL-1.6 ties the 7B Qwen2.5-VL. + +~15 baseline failures are `disallowed characters`: the model emits CJK glyphs +(场, 景, 民, 生, …) on English handwritten scans. + +> **Status: preliminary.** Not yet validated by reproducing a published +> olmOCR-bench number through this harness; decoding determinism unconfirmed. diff --git a/experiments/olmocr-bench-oldscans/convert.py b/experiments/olmocr-bench-oldscans/convert.py index 647df31..a3525e8 100644 --- a/experiments/olmocr-bench-oldscans/convert.py +++ b/experiments/olmocr-bench-oldscans/convert.py @@ -2,10 +2,10 @@ Job 1 (GPU): run PaddleOCR-VL-1.6 over the olmOCR-bench `old_scans` subset and write candidate markdown in the exact layout `olmocr.bench.benchmark` expects. -This runs with PaddlePaddle's OWN docker image (paddle 3.2.1 + paddleocr 3.6.0 -preinstalled), NOT uv -- the image has no uv and assembling paddle from PyPI is -brittle (libGL, paddlex[ocr]). So there is no PEP 723 header: every import here -(paddleocr, huggingface_hub, stdlib) is already in the image's python3.10. +Runs with PaddlePaddle's docker image (paddle 3.2.1 + paddleocr 3.6.0 + the v1.6 +weights preinstalled) via the image's python3.10 -- not uv. Every import here +(paddleocr, huggingface_hub, stdlib) is already in the image, so there is no +PEP 723 header. Fidelity: the markdown extraction mirrors olmOCR-bench's own runner (`olmocr/bench/runners/run_paddlevl.py`) exactly -- `res.markdown["markdown_texts"]`, From 66c3ff60d68519bb332636e35337762c40657d09 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Sat, 27 Jun 2026 12:43:08 +0100 Subject: [PATCH 3/7] README: record greedy decoding + eyeball spot-check in status generation_config.json has no sampling params -> greedy/deterministic. Candidate outputs spot-checked against source scans (real, untruncated). Co-Authored-By: Claude Opus 4.8 (1M context) --- experiments/olmocr-bench-oldscans/README.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/experiments/olmocr-bench-oldscans/README.md b/experiments/olmocr-bench-oldscans/README.md index e8b5577..d3f838e 100644 --- a/experiments/olmocr-bench-oldscans/README.md +++ b/experiments/olmocr-bench-oldscans/README.md @@ -72,8 +72,11 @@ This run uses floating refs. To make it bit-stable, pin: - the **image by digest** (`...paddleocr-vl@sha256:...`) instead of `:latest` — this pins paddle, paddleocr, and the weights together; - `allenai/olmOCR-bench` by `revision`; -- `olmocr` to an exact version in `score.py`; -- **greedy decoding** (confirm / pin) for run-to-run stability. +- `olmocr` to an exact version in `score.py`. + +Decoding is already **greedy** (the model's `generation_config.json` has no +`do_sample`/`temperature`, so transformers defaults to greedy), so runs are +deterministic modulo GPU-kernel nondeterminism — no sampling seed to pin. ## Result @@ -94,5 +97,7 @@ PaddleOCR-VL-1.6 ties the 7B Qwen2.5-VL. ~15 baseline failures are `disallowed characters`: the model emits CJK glyphs (场, 景, 民, 生, …) on English handwritten scans. -> **Status: preliminary.** Not yet validated by reproducing a published -> olmOCR-bench number through this harness; decoding determinism unconfirmed. +> **Status: preliminary.** Decoding is greedy (deterministic) and the candidate +> outputs were spot-checked against the source scans (real, untruncated). Not yet +> validated by reproducing a published olmOCR-bench number through this harness — +> do that before quoting the figure externally. From 42ab3d7a30d59e903ab572d095c9d7155cf5b434 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Sat, 27 Jun 2026 13:24:57 +0100 Subject: [PATCH 4/7] Add samples.html (scan vs output, CJK highlighted) + generator Self-contained static page: 7 old_scans docs, source scan beside the model markdown, hallucinated CJK glyphs highlighted. gen_samples.py regenerates it from scan/output pairs pulled from the bucket. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../olmocr-bench-oldscans/gen_samples.py | 106 +++++++ .../olmocr-bench-oldscans/samples.html | 279 ++++++++++++++++++ 2 files changed, 385 insertions(+) create mode 100644 experiments/olmocr-bench-oldscans/gen_samples.py create mode 100644 experiments/olmocr-bench-oldscans/samples.html diff --git a/experiments/olmocr-bench-oldscans/gen_samples.py b/experiments/olmocr-bench-oldscans/gen_samples.py new file mode 100644 index 0000000..4f0a4b2 --- /dev/null +++ b/experiments/olmocr-bench-oldscans/gen_samples.py @@ -0,0 +1,106 @@ +"""Generate a self-contained samples.html: source scan vs. PaddleOCR-VL-1.6 +output for a handful of old_scans docs, with hallucinated CJK glyphs highlighted. +Scans are embedded as base64 JPEG so the page is a single portable file. + +Populate the data dir from the bucket first, then render: + + B=hf://buckets/davanstrien/paddleocr-vl16-oldscans + for id in 1 5 10 27 30 50 56; do + hf buckets cp $B/pdfs/old_scans/$id.pdf samples_data/$id.pdf + hf buckets cp $B/paddleocr_vl_16/old_scans/${id}_pg1_repeat1.md samples_data/$id.md + done + uv run --with pypdfium2 --with pillow gen_samples.py --data samples_data +""" +import argparse +import base64 +import html +import io +import re +from pathlib import Path + +import pypdfium2 as pdfium +from PIL import Image # noqa: F401 (pypdfium2 .to_pil needs Pillow installed) + +DOCS = [ + ("5", "Typed letter — near-perfect transcription"), + ("10", "Typed letter — cursive signature dropped; 'Sincerely,' loops x3"), + ("1", "Handwritten letter — readable, character-level errors"), + ("30", "Typed letter — Chinese 场景 inserted mid-sentence"), + ("56", "Q&A catechism — 源 emitted for 'sources'"), + ("50", "Dense cursive — garbled + multiple CJK glyphs"), + ("27", "Ornate blackletter header skipped + cursive garbled"), +] + +CJK = re.compile(r"[㐀-鿿＀-￯]+") + + +def scan_b64(pdf_path: Path, width: int = 1000) -> str: + pdf = pdfium.PdfDocument(str(pdf_path)) + page = pdf[0] + scale = width / page.get_size()[0] + pil = page.render(scale=scale).to_pil().convert("RGB") + buf = io.BytesIO() + pil.save(buf, "JPEG", quality=80) + return base64.b64encode(buf.getvalue()).decode() + + +def render_output(text: str) -> str: + return CJK.sub(lambda m: f"{m.group(0)}", html.escape(text)) + + +def main() -> None: + ap = argparse.ArgumentParser() + ap.add_argument("--data", default="samples_data") + ap.add_argument("--out", default="samples.html") + args = ap.parse_args() + data = Path(args.data) + + cards = [] + for did, cap in DOCS: + img = scan_b64(data / f"{did}.pdf") + md = (data / f"{did}.md").read_text() + cards.append( + f""" +
+

old_scans/{did} — {html.escape(cap)}

+
+
scan {did}
+
{render_output(md)}
+
+
""" + ) + + page = f""" + + +PaddleOCR-VL-1.6 — olmOCR-bench old_scans samples + + +

PaddleOCR-VL-1.6 on olmOCR-bench old_scans

+

Source scan (left) vs. the model's markdown output (right) — default v1.6 pipeline, no tuning. + Highlighted spans are hallucinated CJK glyphs on English documents. Overall old_scans score: + 38.6% (preliminary). Scans: Library of Congress via + allenai/olmOCR-bench (ODC-BY).

+ {"".join(cards)} +""" + + out = Path(args.out) + out.write_text(page) + print(f"wrote {out} ({len(page) // 1024} KB)") + + +if __name__ == "__main__": + main() diff --git a/experiments/olmocr-bench-oldscans/samples.html b/experiments/olmocr-bench-oldscans/samples.html new file mode 100644 index 0000000..bea7456 --- /dev/null +++ b/experiments/olmocr-bench-oldscans/samples.html @@ -0,0 +1,279 @@ + + + +PaddleOCR-VL-1.6 — olmOCR-bench old_scans samples + + +

PaddleOCR-VL-1.6 on olmOCR-bench old_scans

+

Source scan (left) vs. the model's markdown output (right) — default v1.6 pipeline, no tuning. + Highlighted spans are hallucinated CJK glyphs on English documents. Overall old_scans score: + 38.6% (preliminary). Scans: Library of Congress via + allenai/olmOCR-bench (ODC-BY).

+ +
+

old_scans/5 — Typed letter — near-perfect transcription

+
+
scan 5
+
“WE NEVER DISAPPOINT”
+
+3
+
+136, 138, 140 West Short Street
+
+LEXINGTON, KY.
+
+public affairs, he would stand much better.
+
+You have educated us to expect the president
+
+to talk, and he who falls short of your
+
+means, will be a public disappointment.
+
+When the president ought to say a
+
+wise word to allay the Protestant ill-
+
+feeling and at the same time, let
+
+Rome know her place in such a
+
+way as would prevent exception being
+
+taken, be would�minently please again.
+
+He is a Protestant, and can not be
+
+expected to attend Catholic ceremonials,
+
+and send himself to even the appearance
+
+of intrigue.
+
+All together the public wants
+
+You, and is looking to you to say
+
+Something to check the ozone of Peace
+
+going on in Washington, Kindle D $25,000,000
+
+
+
+

old_scans/10 — Typed letter — cursive signature dropped; 'Sincerely,' loops x3

+
+
scan 10
+
ack
+
+5127114
+
+The Hon. Theodore Roosevelt.
+
+287 Fourth Avenue,
+
+New York City.
+
+Dear Sir:-
+
+Indiana County Progressives send you congratulations on your safe return from your epoch-making journey to South America.
+
+Ours is the Pennsylvania County that gave you six hundred more votes than Taft and Wilson combined received.I hereby make a special plea that you honor us with a campaign speech when you tour Pennsylvania.Indiana, the county-seat, is situated at the center of the county and is entered by rail-roads and trolley-lbnes leading to different sections.When you come thousands will greet you, and in no uncertain tones.
+
+Progressivism is a live issue here. We are in the fight to stay. We want no fusion or amalgamation with the Republican party, the party of Penrose and his ilk.
+
+I am,
+
+Sincerely,
+
+Sincerely,
+
+Sincerely,
+
+Chairman of Washington Party in Indiana County, Pa.
+
+
+
+

old_scans/1 — Handwritten letter — readable, character-level errors

+
+
scan 1
+
Bangor Pa. May. 22. nd 1914.
+
+Col. Rosevelt  Gennrade & Friend
+
+Gen. Dear Sir
+
+I am one of D. E. Sickles old
+
+Regt & Brigade I Served through
+
+all of the civil war. During the
+
+5. years was with Gen Sickles
+
+when He Loost His Logg. then He
+
+Loest us. I Regret Your Absence from
+
+this country During your Trip to
+
+S.A. I Say We Pinhott Here &
+
+Had a Loitte talk with Him about
+
+you & the Political Situation of Vty
+
+Country I am with you for the
+
+election of Gifford Pinchott as U.S.
+
+Senitor. One Loaw Here must Be
+
+Repeated. as we are Deprived of
+
+Nothing for Whoever We Desired at
+
+the Performaries. if you come to Easton
+
+or Bangor Pa I would like to
+
+See you Address 25 Market St Bangor
+
+Yours Very Rest. Caleb Aben
+
+
+
+

old_scans/30 — Typed letter — Chinese 场景 inserted mid-sentence

+
+
scan 30
+
indeed strengthened by the proposed change in your formal relation to the Outlook. With this change I am quite sure that we can do more to promote the interests which we both have as heart.
+
+He shall want our correspondence, when it is published, to make clear that our interest in and loyalty to the principles from which both you and we have stood, so you as the leader in this great demarcation movement, is unchanged, and that we can still corent on you as special contribution on serial and political topics.
+
+
+
+Of course nothing will be published, now by us anything said, until you return to America. Meanwhile I shall endeavor to draft a letter in response to your and get it into Laurancès hands, for consultation between you and Iri, if, as you have intended, you return to their in the same场景.
+
+Believe I may you in many
+
+since affection and esteem for you
+
+and my faith in what you have
+
+to / splanchically stood for in our
+
+
+
+

old_scans/56 — Q&A catechism — 源 emitted for 'sources'

+
+
scan 56
+
## Dario Ciroli! Laguir
+
+1. what is Logic II?
+
+A. the art of reason in the human mind in acquiring
+
+2. into how many parts is it divided?
+
+A. in to form.
+
+3. of what do they think?
+
+A. the first treat of simple approach to
+
+4. what is simple approach to
+
+5. it is the attention of the mind to the improve
+
+6. what is the influence of the mind to the own operation.
+
+7. what are the sources from which all original ideas
+
+8. the situation is the development of the world.
+
+9. what does it do we get from the generation?
+
+A. what does it do we get from the to.
+
+B. what does it do we get from reflection.
+
+C. how does it thinking willing believing it.
+
+D. how are our ideas divided.
+
+A. into simple and fuller.
+
+10. what is a simple idea?
+
+A. it is an original impression existing under
+
+the mind under one uniform appearance, without
+
+variety or composition.
+
+11. give an instance of a simple idea of sensation
+
+also of reflection.
+
+A. set the idea we have of colour in a simple
+
+idea of sensation. The idea we have of willing in a
+
+simple idea of reflection.
+
+2602
+
+
+
+

old_scans/50 — Dense cursive — garbled + multiple CJK glyphs

+
+
scan 50
+
## Tacping Sabbetta
+
+O Vincus means series to comply with what he believes to be the will of God, but commsion of his own faith, he revolts from entroning thus to conform to his own opinions. Even compulsion,政务的行动it may in the expense of persecution — God has given man no authority to come in his dinnert to his precepts.
+
+The people in a Ming bulwam a man this mother-mother it be the voluntary offering of the heart it must be a man cold征兆 of words which cannot be an apology to lead. If a man does not believe that religion exacts it by minor activity to ab�ane from labor are the subbate he will submit with relentless and this this will join them will be no party on his part now will it advance the party of address.
+
+There can be no more parties or elections in this less than that would be an enforcing information of the policy upon the public的手, an obstinance for the meat decing but in an� for the Admiralty Christans to work with Mohoutti —
+
+1. There is no one precept in the new Government commanding no to keep as abbat — If we are banned to keep one, it is in correspondence of the most air land —
+
+2. The language of the 2th commandment is, "The Leisure day is the sabbat of the Lord" But the Christmas keep the first day not the seventh —
+
+3. There is not a single word either in the adorn new Government nor even our admissions relating to the substitution of the first day for the seventh — The subject is not mentioned in any of the discourse of Christ was in any of the epistles of his apotheles —
+
+(1)
+
+
+
+

old_scans/27 — Ornate blackletter header skipped + cursive garbled

+
+
scan 27
+
audience
+
+Turning from will favor
+
+any request with your
+
+consideration I am
+
+Costoutfully you're!
+
+Philip L. J.
+
+by P. Pursuius, Levy
+
+
+ \ No newline at end of file From 1c4b09ec103fd3249d7eec3fbf41ca8c4ca62c37 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Sat, 27 Jun 2026 13:44:35 +0100 Subject: [PATCH 5/7] Address review: published number is the OLDER PaddleOCR-VL, not 1.6 - olmOCR-bench lists 'PaddleOCR-VL' (unversioned) at Old scans 37.8; its run_paddlevl runner landed 2025-10-20, pre-1.6 (2026-05-28). So 38.6 is the first 1.6 number, ~consistent with the older figure -- not a same-version reproduction. Finding: v1.6's OmniDocBench gains don't transfer to old scans. - Clarify 38.6 = old_scans.jsonl sub-score; harness also prints ~61.6 Average. - baseline 84.7% (old_scans-only auto-baseline) != leaderboard Base 98.5% (full-bench auto-baseline); all 15 baseline failures are CJK/JP disallowed-char. - Soften 'exactly' re: the v1.6 pin; tighten LIMIT smoke-test wording. Co-Authored-By: Claude Opus 4.8 (1M context) --- experiments/olmocr-bench-oldscans/README.md | 42 +++++++++++++++----- experiments/olmocr-bench-oldscans/convert.py | 18 +++++---- 2 files changed, 41 insertions(+), 19 deletions(-) diff --git a/experiments/olmocr-bench-oldscans/README.md b/experiments/olmocr-bench-oldscans/README.md index d3f838e..1851fed 100644 --- a/experiments/olmocr-bench-oldscans/README.md +++ b/experiments/olmocr-bench-oldscans/README.md @@ -88,16 +88,36 @@ PaddleOCR-VL-1.6, default v1.6 pipeline, no tuning (2026-06-27): | → present | 31.2% | 279 | | → absent | 95.7% | 70 | | → order | 27.7% | 177 | -| baseline (auto-generated, 1/PDF) | 84.7% | 83 / 98 | - -For reference, olmOCR-bench's published OldScan column (no-anchor): olmOCR 43.7, -GPT-4o 40.9, Qwen2.5-VL 38.6, Gemini-Flash-2 27.8, GOT-OCR (0.58B) 22.1. At 0.9B, +| baseline (auto BaselineTest, 1/PDF, over old_scans) | 84.7% | 83 / 98 | + +**vs the published leaderboard.** olmOCR-bench lists a `PaddleOCR-VL` row at +**Old scans = 37.8**, but *unversioned* — and its `run_paddlevl` runner landed +2025-10-20, ~7 months before PaddleOCR-VL-**1.6** (2026-05-28), so that figure is +an earlier PaddleOCR-VL, not 1.6. Our **38.6** is the first **1.6** old_scans +number, ~0.8 pt above it. Takeaway: **v1.6's OmniDocBench gains don't transfer to +degraded historical scans** — it scores essentially like the original here. The +cheap same-version anchor (run the *original* PaddleOCR-VL through this harness, +expect ~37.8) is the obvious next check — same image, just the version flag. + +**Which "38.6".** It is the `old_scans.jsonl` sub-score (present/absent/order), +matching the leaderboard's "Old scans" column. Stock `olmocr.bench.benchmark` +*also* prints `Average Score ≈ 61.6%` = mean(old_scans, auto-baseline) — that is +**not** the leaderboard figure; don't quote it as the headline. + +**baseline 84.7% ≠ leaderboard "Base" (98.5%).** "Base" is the auto-baseline over +the *whole* benchmark (~1,400 mostly-clean PDFs); ours is the same test over only +the 98 hardest old_scans PDFs — a different population, not a regression. All 15 +baseline failures are `disallowed characters`: CJK/Japanese glyphs (场, 景, 民, 生, +ら …) emitted on English scans — the hallucination that pulls old-scans baseline +below the full-bench Base. See `samples.html` (regenerate via `gen_samples.py`) +for scan↔output pairs with the glyphs highlighted. + +**Size context** (published no-anchor Old scans): olmOCR 43.7, GPT-4o 40.9, +Qwen2.5-VL 38.6, Gemini-Flash-2 27.8, GOT-OCR (0.58B) 22.1. At 0.9B, PaddleOCR-VL-1.6 ties the 7B Qwen2.5-VL. -~15 baseline failures are `disallowed characters`: the model emits CJK glyphs -(场, 景, 民, 生, …) on English handwritten scans. - -> **Status: preliminary.** Decoding is greedy (deterministic) and the candidate -> outputs were spot-checked against the source scans (real, untruncated). Not yet -> validated by reproducing a published olmOCR-bench number through this harness — -> do that before quoting the figure externally. +> **Status.** Consistent with the published (earlier-version) PaddleOCR-VL figure +> (37.8 → 38.6). Greedy/deterministic decoding; outputs spot-checked vs source +> scans (real, untruncated). For a strict same-version reproduction, run the +> original PaddleOCR-VL through this harness; pin the image digest (see +> Reproducibility) before citing the figure on a model card. diff --git a/experiments/olmocr-bench-oldscans/convert.py b/experiments/olmocr-bench-oldscans/convert.py index a3525e8..40cb80c 100644 --- a/experiments/olmocr-bench-oldscans/convert.py +++ b/experiments/olmocr-bench-oldscans/convert.py @@ -7,12 +7,13 @@ (paddleocr, huggingface_hub, stdlib) is already in the image, so there is no PEP 723 header. -Fidelity: the markdown extraction mirrors olmOCR-bench's own runner -(`olmocr/bench/runners/run_paddlevl.py`) exactly -- `res.markdown["markdown_texts"]`, -per page, with a bare default pipeline and NO tuning (no max_pixels / prompts / -dpi). The ONLY intentional difference is `pipeline_version="v1.6"`, which is what -the PaddleOCR-VL-1.6 model card tells you to pass. So we follow both the bench -runner and PaddlePaddle's documented defaults. +Fidelity: the markdown extraction matches olmOCR-bench's own runner +(`olmocr/bench/runners/run_paddlevl.py`) -- `res.markdown["markdown_texts"]`, per +page, with a bare default pipeline and NO tuning (no max_pixels / prompts / dpi). +The one intentional difference is `pipeline_version="v1.6"`: upstream calls +`PaddleOCRVL()` with no version (an earlier PaddleOCR-VL), while this measures 1.6 +as its model card specifies. So we follow the bench runner's extraction and +PaddlePaddle's documented v1.6 defaults. This image runs as the non-root `paddleocr` user, which CANNOT write the bucket FUSE mount (root-owned). So we write outputs to a container-local dir and push @@ -29,8 +30,9 @@ Env: OUT_ROOT local staging dir (default /tmp/olmocr-oldscans-out) BUCKET bucket to sync results to (default below) - LIMIT cap number of PDFs (smoke test; 0 = all). A capped run scores low - because the un-converted tests count as failures. + LIMIT cap number of PDFs (plumbing smoke test; 0 = all). With a cap, the + un-converted docs are scored FAILED, so the result is not a + representative score -- use a smoke run only to check the pipeline. """ import json import os From 74380bbe2e5505e5b79d3ccd4f18deaf177d7c0c Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Sat, 27 Jun 2026 15:52:48 +0100 Subject: [PATCH 6/7] De-duplicate PDF handling + parameterize version - convert.py reuses PDFs from the bucket mount if present, else downloads once and stages them under pdfs/ so sync_bucket adds them for the scorer. Removes the separate hf download + hf buckets sync step and the per-run re-download. - PIPELINE_VERSION / CANDIDATE env vars: same script runs v1.6 (default) or the original v1 into a second candidate; the score job ranks both together. Co-Authored-By: Claude Opus 4.8 (1M context) --- experiments/olmocr-bench-oldscans/README.md | 14 ++++---- experiments/olmocr-bench-oldscans/convert.py | 38 ++++++++++++++------ 2 files changed, 34 insertions(+), 18 deletions(-) diff --git a/experiments/olmocr-bench-oldscans/README.md b/experiments/olmocr-bench-oldscans/README.md index 1851fed..6155393 100644 --- a/experiments/olmocr-bench-oldscans/README.md +++ b/experiments/olmocr-bench-oldscans/README.md @@ -42,16 +42,13 @@ IMAGE=ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvid # deliver the script into the bucket (re-cp after editing) hf buckets cp convert.py $BUCKET/convert.py -# convert — 98 PDFs (add -e LIMIT=3 for a 3-PDF smoke test) +# convert — 98 PDFs (add -e LIMIT=3 for a 3-PDF smoke test). On the first run this +# also stages the source PDFs into the bucket under pdfs/ (which the scorer needs); +# later runs reuse them from the mount instead of re-downloading. hf jobs run --flavor l4x1 --timeout 1h -s HF_TOKEN -v $BUCKET:/bucket:ro \ $IMAGE python3 /bucket/convert.py -# add the source PDFs the scorer requires under pdfs/ -hf download allenai/olmOCR-bench --repo-type dataset \ - --include "bench_data/pdfs/old_scans/*" --local-dir /tmp/olm -hf buckets sync /tmp/olm/bench_data/pdfs $BUCKET/pdfs - -# score +# score (ranks every candidate folder in the bucket together) hf jobs uv run --flavor cpu-upgrade -s HF_TOKEN -v $BUCKET:/bucket:ro score.py ``` @@ -62,8 +59,9 @@ Add `-d` to detach, then `hf jobs wait ` / `hf jobs logs `. - **Flavor `l4x1`**: the image's CUDA build matches the `l4x1` driver; larger GPUs (l40s / a100) do not. - **Mount path `/bucket`**: `/data` is reserved by Jobs for local-script artifacts. - **`sync_bucket`, not a FUSE write**: the image runs as a non-root user that cannot write the mount, so `convert.py` writes locally and uploads over HTTP; the mount is `:ro` (script delivery only). -- **`pdfs/` folder**: `benchmark.py` requires `/pdfs` to exist; the source PDFs are synced there (run step above). +- **`pdfs/` folder**: `benchmark.py` requires `/pdfs` to exist; `convert.py` stages the source PDFs into the bucket on the first run and reuses them from the mount after, so no separate sync step is needed. - **`numpy`**: declared in `score.py` because `olmocr[bench]` imports it without declaring it. +- **Versions**: `convert.py` takes `PIPELINE_VERSION` (default `v1.6`) and `CANDIDATE`. Run `-e PIPELINE_VERSION=v1 -e CANDIDATE=paddleocr_vl_orig` to also convert the original 0.9B PaddleOCR-VL (the leaderboard's 37.8); both candidates then sit in the bucket and the score job ranks them together. - **`old_scans_math`** variant: change `JSONL_PATH` in `convert.py`; `score.py` then also needs `playwright install chromium`. ## Reproducibility diff --git a/experiments/olmocr-bench-oldscans/convert.py b/experiments/olmocr-bench-oldscans/convert.py index 40cb80c..c9198e6 100644 --- a/experiments/olmocr-bench-oldscans/convert.py +++ b/experiments/olmocr-bench-oldscans/convert.py @@ -28,14 +28,20 @@ python3 /bucket/convert.py Env: - OUT_ROOT local staging dir (default /tmp/olmocr-oldscans-out) - BUCKET bucket to sync results to (default below) - LIMIT cap number of PDFs (plumbing smoke test; 0 = all). With a cap, the - un-converted docs are scored FAILED, so the result is not a - representative score -- use a smoke run only to check the pipeline. + OUT_ROOT local staging dir (default /tmp/olmocr-oldscans-out) + BUCKET bucket to sync results to (default below) + PIPELINE_VERSION PaddleOCRVL version (default v1.6). "v1" = the original 0.9B + PaddleOCR-VL = the version on the olmOCR-bench leaderboard + (37.8) -> run it (with a distinct CANDIDATE) for a strict + same-version reproduction. + CANDIDATE output subfolder + model label (default paddleocr_vl_16) + LIMIT cap number of PDFs (plumbing smoke test; 0 = all). With a cap, + the un-converted docs are scored FAILED, so the result is not + a representative score -- use a smoke run only to check plumbing. """ import json import os +import shutil from collections import defaultdict from pathlib import Path @@ -44,7 +50,8 @@ BENCH_REPO = "allenai/olmOCR-bench" JSONL_PATH = "bench_data/old_scans.jsonl" -CANDIDATE = "paddleocr_vl_16" # any name except "pdfs" +CANDIDATE = os.environ.get("CANDIDATE", "paddleocr_vl_16") # any name except "pdfs" +PIPELINE_VERSION = os.environ.get("PIPELINE_VERSION", "v1.6") OUT_ROOT = Path(os.environ.get("OUT_ROOT", "/tmp/olmocr-oldscans-out")) BUCKET = os.environ.get("BUCKET", "hf://buckets/davanstrien/paddleocr-vl16-oldscans") LIMIT = int(os.environ.get("LIMIT", "0")) @@ -58,17 +65,28 @@ pages_by_pdf[t["pdf"]].add(int(t.get("page", 1))) print(f"{len(tests)} tests across {len(pages_by_pdf)} PDFs -> {OUT_ROOT}", flush=True) -# ---- model (vendor default for v1.6, no tuning) ----------------------------- -pipeline = PaddleOCRVL(pipeline_version="v1.6") +# ---- model (vendor default, no tuning) -------------------------------------- +print(f"pipeline_version={PIPELINE_VERSION} candidate={CANDIDATE}", flush=True) +pipeline = PaddleOCRVL(pipeline_version=PIPELINE_VERSION) def resolve_pdf(pdf_field): - """The jsonl `pdf` field may or may not carry a pdfs/ prefix; try variants.""" + """Reuse the PDF already on the bucket mount if present; otherwise download it + once and stage it under OUT_ROOT/pdfs so sync_bucket adds it to the bucket for + the scorer (benchmark.py needs /pdfs). Avoids a separate download + sync, + and skips re-downloading on later runs (the bucket is mounted at /bucket).""" + mounted = Path("/bucket/pdfs") / pdf_field + if mounted.is_file(): + return str(mounted) for cand in (f"bench_data/{pdf_field}", f"bench_data/pdfs/{pdf_field}", pdf_field): try: - return hf_hub_download(BENCH_REPO, cand, repo_type="dataset") + local = hf_hub_download(BENCH_REPO, cand, repo_type="dataset") except Exception: continue + dest = OUT_ROOT / "pdfs" / pdf_field + dest.parent.mkdir(parents=True, exist_ok=True) + shutil.copy(local, dest) + return local raise FileNotFoundError(pdf_field) From 52f7c66105484bec30f31ea7201e44aac1e78906 Mon Sep 17 00:00:00 2001 From: Daniel van Strien Date: Sat, 27 Jun 2026 16:26:20 +0100 Subject: [PATCH 7/7] Fold in v1 anchor: harness reproduces published 37.8 (got 38.2) Ran the original PaddleOCR-VL (v1) through the same harness: old_scans 38.2 vs the leaderboard's published 37.8 -> within 0.4 pt / CI, so the convert+scoring are validated. v1.6 (38.6) is statistically identical to v1 (38.2) on old_scans -> OmniDocBench gains don't transfer; v1.6 even hallucinates slightly more CJK (baseline 84.7 vs 88.8). Co-Authored-By: Claude Opus 4.8 (1M context) --- experiments/olmocr-bench-oldscans/README.md | 76 ++++++++++----------- 1 file changed, 37 insertions(+), 39 deletions(-) diff --git a/experiments/olmocr-bench-oldscans/README.md b/experiments/olmocr-bench-oldscans/README.md index 6155393..68d06ef 100644 --- a/experiments/olmocr-bench-oldscans/README.md +++ b/experiments/olmocr-bench-oldscans/README.md @@ -78,44 +78,42 @@ deterministic modulo GPU-kernel nondeterminism — no sampling seed to pin. ## Result -PaddleOCR-VL-1.6, default v1.6 pipeline, no tuning (2026-06-27): - -| Category | Pass rate | Tests | -|---|---|---| -| **old_scans (present / absent / order)** | **38.6%** | 203 / 526 | -| → present | 31.2% | 279 | -| → absent | 95.7% | 70 | -| → order | 27.7% | 177 | -| baseline (auto BaselineTest, 1/PDF, over old_scans) | 84.7% | 83 / 98 | - -**vs the published leaderboard.** olmOCR-bench lists a `PaddleOCR-VL` row at -**Old scans = 37.8**, but *unversioned* — and its `run_paddlevl` runner landed -2025-10-20, ~7 months before PaddleOCR-VL-**1.6** (2026-05-28), so that figure is -an earlier PaddleOCR-VL, not 1.6. Our **38.6** is the first **1.6** old_scans -number, ~0.8 pt above it. Takeaway: **v1.6's OmniDocBench gains don't transfer to -degraded historical scans** — it scores essentially like the original here. The -cheap same-version anchor (run the *original* PaddleOCR-VL through this harness, -expect ~37.8) is the obvious next check — same image, just the version flag. - -**Which "38.6".** It is the `old_scans.jsonl` sub-score (present/absent/order), -matching the leaderboard's "Old scans" column. Stock `olmocr.bench.benchmark` -*also* prints `Average Score ≈ 61.6%` = mean(old_scans, auto-baseline) — that is -**not** the leaderboard figure; don't quote it as the headline. - -**baseline 84.7% ≠ leaderboard "Base" (98.5%).** "Base" is the auto-baseline over -the *whole* benchmark (~1,400 mostly-clean PDFs); ours is the same test over only -the 98 hardest old_scans PDFs — a different population, not a regression. All 15 -baseline failures are `disallowed characters`: CJK/Japanese glyphs (场, 景, 民, 生, -ら …) emitted on English scans — the hallucination that pulls old-scans baseline -below the full-bench Base. See `samples.html` (regenerate via `gen_samples.py`) -for scan↔output pairs with the glyphs highlighted. +PaddleOCR-VL on olmOCR-bench `old_scans` — default pipeline, no tuning, greedy +(2026-06-27). Both versions scored through the same harness: + +| Version | **old_scans** | present | absent | order | baseline | +|---|---|---|---|---|---| +| v1.6 | **38.6%** (203/526) | 31.2 | 95.7 | 27.7 | 84.7 | +| v1 (original 0.9B) | **38.2%** (201/526) | 32.3 | 95.7 | 24.9 | 88.8 | + +`old_scans` = the present/absent/order tests = the leaderboard's "Old scans" column. + +**Harness validated against the published figure.** olmOCR-bench lists the original +`PaddleOCR-VL` (unversioned; its `run_paddlevl` runner is dated 2025-10-20, pre-1.6) +at **Old scans = 37.8**. Running that same original (`v1`) through this harness gives +**38.2** — within 0.4 pt, inside the ±3.6 % CI. So our convert + scoring reproduce +the published number; the harness is sound. + +**v1.6's gains don't transfer to old scans.** v1.6 (38.6) and the original v1 (38.2) +are statistically indistinguishable here — the upgrade that made v1.6 SOTA on +OmniDocBench buys nothing on degraded historical scans. v1.6 even regresses slightly +on `baseline` (84.7 vs 88.8): it emits *more* CJK/Japanese disallowed-character +hallucinations (场, 景, 民, 生, ら …) on English scans than the original did. See +`samples.html` (regenerate via `gen_samples.py`) for scan↔output pairs with the +glyphs highlighted. + +**Reading the harness output.** "38.6" / "38.2" are the `old_scans.jsonl` sub-scores. +`olmocr.bench.benchmark` *also* prints an `Average Score` (61.6 % / 63.5 %) = mean of +the old_scans sub-score and the auto-baseline category — that is **not** the +leaderboard "Old scans" figure; don't quote it. And the per-version `baseline` here +(auto-BaselineTest over only the 98 old_scans PDFs) is **not** the leaderboard's +"Base" column (the same test over the whole ~1,400-PDF benchmark, 98.5 %). **Size context** (published no-anchor Old scans): olmOCR 43.7, GPT-4o 40.9, -Qwen2.5-VL 38.6, Gemini-Flash-2 27.8, GOT-OCR (0.58B) 22.1. At 0.9B, -PaddleOCR-VL-1.6 ties the 7B Qwen2.5-VL. - -> **Status.** Consistent with the published (earlier-version) PaddleOCR-VL figure -> (37.8 → 38.6). Greedy/deterministic decoding; outputs spot-checked vs source -> scans (real, untruncated). For a strict same-version reproduction, run the -> original PaddleOCR-VL through this harness; pin the image digest (see -> Reproducibility) before citing the figure on a model card. +Qwen2.5-VL 38.6, Gemini-Flash-2 27.8, GOT-OCR (0.58B) 22.1. At 0.9B, PaddleOCR-VL +ties the 7B Qwen2.5-VL. + +> **Status: validated.** The harness reproduces the published original-PaddleOCR-VL +> figure (37.8 → 38.2, within CI), and v1.6 (38.6) is statistically the same on +> old_scans. Greedy/deterministic decoding; outputs spot-checked vs source scans. +> Pin the image digest (see Reproducibility) before citing externally.