Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions experiments/olmocr-bench-oldscans/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# PaddleOCR-VL-1.6 on olmOCR-bench (old_scans)

Scores [PaddleOCR-VL-1.6](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.6) on the
`old_scans` subset of [`allenai/olmOCR-bench`](https://huggingface.co/datasets/allenai/olmOCR-bench).
Standalone experiment — not part of the `ocr_bench` library.

`old_scans` = 98 single-page Library-of-Congress scans, 526 tests
(text-present / text-absent / reading-order). No math or tables, so scoring needs
no KaTeX/chromium.

## Fidelity

- **Scoring**: stock `olmocr.bench.benchmark`, unmodified.
- **Conversion**: matches olmOCR-bench's own runner
[`run_paddlevl.py`](https://github.com/allenai/olmocr/blob/main/olmocr/bench/runners/run_paddlevl.py)
— `res.markdown["markdown_texts"]`, per page, default pipeline, no tuning. The
only difference is `pipeline_version="v1.6"` (as the model card specifies).
- Runs inside PaddlePaddle's own image, so paddle/paddleocr are the vendor builds.

## Method

Two HF Jobs with a bucket as the handoff:

| Step | Command | Hardware | Does |
|------|---------|----------|------|
| `convert.py` | `hf jobs run` | GPU `l4x1` | PaddleOCR-VL-1.6 → markdown → `sync_bucket` to the bucket |
| `score.py` | `hf jobs uv run` | CPU `cpu-upgrade` | `olmocr.bench.benchmark` → score |

Candidate files are written as `{splitext(pdf_field)}_pg{page}_repeat1.md`, the
path `benchmark.py` looks them up by.

- **Image**: `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu`
— paddle 3.2.1 + paddleocr 3.6.0 + 1.6 weights baked in; python `/usr/local/bin/python3` (3.10); no `uv`.
- **Bucket**: `hf://buckets/davanstrien/paddleocr-vl16-oldscans`

## Run

```bash
BUCKET=hf://buckets/davanstrien/paddleocr-vl16-oldscans
IMAGE=ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu

# deliver the script into the bucket (re-cp after editing)
hf buckets cp convert.py $BUCKET/convert.py

# convert — 98 PDFs (add -e LIMIT=3 for a 3-PDF smoke test). On the first run this
# also stages the source PDFs into the bucket under pdfs/ (which the scorer needs);
# later runs reuse them from the mount instead of re-downloading.
hf jobs run --flavor l4x1 --timeout 1h -s HF_TOKEN -v $BUCKET:/bucket:ro \
$IMAGE python3 /bucket/convert.py

# score (ranks every candidate folder in the bucket together)
hf jobs uv run --flavor cpu-upgrade -s HF_TOKEN -v $BUCKET:/bucket:ro score.py
```

Add `-d` to detach, then `hf jobs wait <id>` / `hf jobs logs <id>`.

## Configuration

- **Flavor `l4x1`**: the image's CUDA build matches the `l4x1` driver; larger GPUs (l40s / a100) do not.
- **Mount path `/bucket`**: `/data` is reserved by Jobs for local-script artifacts.
- **`sync_bucket`, not a FUSE write**: the image runs as a non-root user that cannot write the mount, so `convert.py` writes locally and uploads over HTTP; the mount is `:ro` (script delivery only).
- **`pdfs/` folder**: `benchmark.py` requires `<dir>/pdfs` to exist; `convert.py` stages the source PDFs into the bucket on the first run and reuses them from the mount after, so no separate sync step is needed.
- **`numpy`**: declared in `score.py` because `olmocr[bench]` imports it without declaring it.
- **Versions**: `convert.py` takes `PIPELINE_VERSION` (default `v1.6`) and `CANDIDATE`. Run `-e PIPELINE_VERSION=v1 -e CANDIDATE=paddleocr_vl_orig` to also convert the original 0.9B PaddleOCR-VL (the leaderboard's 37.8); both candidates then sit in the bucket and the score job ranks them together.
- **`old_scans_math`** variant: change `JSONL_PATH` in `convert.py`; `score.py` then also needs `playwright install chromium`.

## Reproducibility

This run uses floating refs. To make it bit-stable, pin:

- the **image by digest** (`...paddleocr-vl@sha256:...`) instead of `:latest` — this pins paddle, paddleocr, and the weights together;
- `allenai/olmOCR-bench` by `revision`;
- `olmocr` to an exact version in `score.py`.

Decoding is already **greedy** (the model's `generation_config.json` has no
`do_sample`/`temperature`, so transformers defaults to greedy), so runs are
deterministic modulo GPU-kernel nondeterminism — no sampling seed to pin.

## Result

PaddleOCR-VL on olmOCR-bench `old_scans` — default pipeline, no tuning, greedy
(2026-06-27). Both versions scored through the same harness:

| Version | **old_scans** | present | absent | order | baseline |
|---|---|---|---|---|---|
| v1.6 | **38.6%** (203/526) | 31.2 | 95.7 | 27.7 | 84.7 |
| v1 (original 0.9B) | **38.2%** (201/526) | 32.3 | 95.7 | 24.9 | 88.8 |

`old_scans` = the present/absent/order tests = the leaderboard's "Old scans" column.

**Harness validated against the published figure.** olmOCR-bench lists the original
`PaddleOCR-VL` (unversioned; its `run_paddlevl` runner is dated 2025-10-20, pre-1.6)
at **Old scans = 37.8**. Running that same original (`v1`) through this harness gives
**38.2** — within 0.4 pt, inside the ±3.6 % CI. So our convert + scoring reproduce
the published number; the harness is sound.

**v1.6's gains don't transfer to old scans.** v1.6 (38.6) and the original v1 (38.2)
are statistically indistinguishable here — the upgrade that made v1.6 SOTA on
OmniDocBench buys nothing on degraded historical scans. v1.6 even regresses slightly
on `baseline` (84.7 vs 88.8): it emits *more* CJK/Japanese disallowed-character
hallucinations (场, 景, 民, 生, ら …) on English scans than the original did. See
`samples.html` (regenerate via `gen_samples.py`) for scan↔output pairs with the
glyphs highlighted.

**Reading the harness output.** "38.6" / "38.2" are the `old_scans.jsonl` sub-scores.
`olmocr.bench.benchmark` *also* prints an `Average Score` (61.6 % / 63.5 %) = mean of
the old_scans sub-score and the auto-baseline category — that is **not** the
leaderboard "Old scans" figure; don't quote it. And the per-version `baseline` here
(auto-BaselineTest over only the 98 old_scans PDFs) is **not** the leaderboard's
"Base" column (the same test over the whole ~1,400-PDF benchmark, 98.5 %).

**Size context** (published no-anchor Old scans): olmOCR 43.7, GPT-4o 40.9,
Qwen2.5-VL 38.6, Gemini-Flash-2 27.8, GOT-OCR (0.58B) 22.1. At 0.9B, PaddleOCR-VL
ties the 7B Qwen2.5-VL.

> **Status: validated.** The harness reproduces the published original-PaddleOCR-VL
> figure (37.8 → 38.2, within CI), and v1.6 (38.6) is statistically the same on
> old_scans. Greedy/deterministic decoding; outputs spot-checked vs source scans.
> Pin the image digest (see Reproducibility) before citing externally.
126 changes: 126 additions & 0 deletions experiments/olmocr-bench-oldscans/convert.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
"""
Job 1 (GPU): run PaddleOCR-VL-1.6 over the olmOCR-bench `old_scans` subset and
write candidate markdown in the exact layout `olmocr.bench.benchmark` expects.

Runs with PaddlePaddle's docker image (paddle 3.2.1 + paddleocr 3.6.0 + the v1.6
weights preinstalled) via the image's python3.10 -- not uv. Every import here
(paddleocr, huggingface_hub, stdlib) is already in the image, so there is no
PEP 723 header.

Fidelity: the markdown extraction matches olmOCR-bench's own runner
(`olmocr/bench/runners/run_paddlevl.py`) -- `res.markdown["markdown_texts"]`, per
page, with a bare default pipeline and NO tuning (no max_pixels / prompts / dpi).
The one intentional difference is `pipeline_version="v1.6"`: upstream calls
`PaddleOCRVL()` with no version (an earlier PaddleOCR-VL), while this measures 1.6
as its model card specifies. So we follow the bench runner's extraction and
PaddlePaddle's documented v1.6 defaults.

This image runs as the non-root `paddleocr` user, which CANNOT write the bucket
FUSE mount (root-owned). So we write outputs to a container-local dir and push
them with `sync_bucket()` (mount-free HTTP upload) at the end. The bucket is
mounted read-only purely to deliver this script.

Delivery + run (see README for full commands):
hf buckets cp convert.py hf://buckets/davanstrien/paddleocr-vl16-oldscans/convert.py
hf jobs run --flavor l4x1 -s HF_TOKEN \
-v hf://buckets/davanstrien/paddleocr-vl16-oldscans:/bucket:ro \
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu \
python3 /bucket/convert.py

Env:
OUT_ROOT local staging dir (default /tmp/olmocr-oldscans-out)
BUCKET bucket to sync results to (default below)
PIPELINE_VERSION PaddleOCRVL version (default v1.6). "v1" = the original 0.9B
PaddleOCR-VL = the version on the olmOCR-bench leaderboard
(37.8) -> run it (with a distinct CANDIDATE) for a strict
same-version reproduction.
CANDIDATE output subfolder + model label (default paddleocr_vl_16)
LIMIT cap number of PDFs (plumbing smoke test; 0 = all). With a cap,
the un-converted docs are scored FAILED, so the result is not
a representative score -- use a smoke run only to check plumbing.
"""
import json
import os
import shutil
from collections import defaultdict
from pathlib import Path

from huggingface_hub import hf_hub_download, sync_bucket
from paddleocr import PaddleOCRVL

BENCH_REPO = "allenai/olmOCR-bench"
JSONL_PATH = "bench_data/old_scans.jsonl"
CANDIDATE = os.environ.get("CANDIDATE", "paddleocr_vl_16") # any name except "pdfs"
PIPELINE_VERSION = os.environ.get("PIPELINE_VERSION", "v1.6")
OUT_ROOT = Path(os.environ.get("OUT_ROOT", "/tmp/olmocr-oldscans-out"))
BUCKET = os.environ.get("BUCKET", "hf://buckets/davanstrien/paddleocr-vl16-oldscans")
LIMIT = int(os.environ.get("LIMIT", "0"))

# ---- test manifest ----------------------------------------------------------
jsonl_local = hf_hub_download(BENCH_REPO, JSONL_PATH, repo_type="dataset")
tests = [json.loads(ln) for ln in Path(jsonl_local).read_text().splitlines() if ln.strip()]

pages_by_pdf = defaultdict(set)
for t in tests:
pages_by_pdf[t["pdf"]].add(int(t.get("page", 1)))
print(f"{len(tests)} tests across {len(pages_by_pdf)} PDFs -> {OUT_ROOT}", flush=True)

# ---- model (vendor default, no tuning) --------------------------------------
print(f"pipeline_version={PIPELINE_VERSION} candidate={CANDIDATE}", flush=True)
pipeline = PaddleOCRVL(pipeline_version=PIPELINE_VERSION)


def resolve_pdf(pdf_field):
"""Reuse the PDF already on the bucket mount if present; otherwise download it
once and stage it under OUT_ROOT/pdfs so sync_bucket adds it to the bucket for
the scorer (benchmark.py needs <dir>/pdfs). Avoids a separate download + sync,
and skips re-downloading on later runs (the bucket is mounted at /bucket)."""
mounted = Path("/bucket/pdfs") / pdf_field
if mounted.is_file():
return str(mounted)
for cand in (f"bench_data/{pdf_field}", f"bench_data/pdfs/{pdf_field}", pdf_field):
try:
local = hf_hub_download(BENCH_REPO, cand, repo_type="dataset")
except Exception:
continue
dest = OUT_ROOT / "pdfs" / pdf_field
dest.parent.mkdir(parents=True, exist_ok=True)
shutil.copy(local, dest)
return local
raise FileNotFoundError(pdf_field)


def page_markdowns(pdf_path):
"""Per-page markdown, exactly as run_paddlevl.py does: res.markdown['markdown_texts']."""
return [res.markdown["markdown_texts"] for res in pipeline.predict(str(pdf_path))]


# ---- convert ----------------------------------------------------------------
cand_dir = OUT_ROOT / CANDIDATE
items = sorted(pages_by_pdf.items())
if LIMIT:
items = items[:LIMIT]
print(f"LIMIT={LIMIT} (plumbing smoke test -- expect a low score)", flush=True)

for i, (pdf_field, pages) in enumerate(items, 1):
try:
mds = page_markdowns(resolve_pdf(pdf_field))
except Exception as e: # keep going; a missing page just fails its tests
print(f"[WARN] {pdf_field}: {e}", flush=True)
mds = []
md_base = os.path.splitext(pdf_field)[0] # mirrors benchmark.py exactly
for pg in pages:
md = mds[pg - 1] if 0 <= pg - 1 < len(mds) else "" # 1-indexed page -> 0-indexed
fp = cand_dir / f"{md_base}_pg{pg}_repeat1.md"
fp.parent.mkdir(parents=True, exist_ok=True)
fp.write_text(md)
n = len(mds[0]) if mds else 0
print(f"[{i}/{len(items)}] {pdf_field} -> {n} chars", flush=True)

# the scorer needs the jsonl next to the candidate folder
(OUT_ROOT / "old_scans.jsonl").write_text(Path(jsonl_local).read_text())

# push results to the bucket over HTTP (the FUSE mount is not writable as non-root)
print(f"Syncing {OUT_ROOT} -> {BUCKET}", flush=True)
sync_bucket(str(OUT_ROOT), BUCKET)
print("Done.", flush=True)
106 changes: 106 additions & 0 deletions experiments/olmocr-bench-oldscans/gen_samples.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
"""Generate a self-contained samples.html: source scan vs. PaddleOCR-VL-1.6
output for a handful of old_scans docs, with hallucinated CJK glyphs highlighted.
Scans are embedded as base64 JPEG so the page is a single portable file.

Populate the data dir from the bucket first, then render:

B=hf://buckets/davanstrien/paddleocr-vl16-oldscans
for id in 1 5 10 27 30 50 56; do
hf buckets cp $B/pdfs/old_scans/$id.pdf samples_data/$id.pdf
hf buckets cp $B/paddleocr_vl_16/old_scans/${id}_pg1_repeat1.md samples_data/$id.md
done
uv run --with pypdfium2 --with pillow gen_samples.py --data samples_data
"""
import argparse
import base64
import html
import io
import re
from pathlib import Path

import pypdfium2 as pdfium
from PIL import Image # noqa: F401 (pypdfium2 .to_pil needs Pillow installed)

DOCS = [
("5", "Typed letter — near-perfect transcription"),
("10", "Typed letter — cursive signature dropped; 'Sincerely,' loops x3"),
("1", "Handwritten letter — readable, character-level errors"),
("30", "Typed letter — Chinese 场景 inserted mid-sentence"),
("56", "Q&A catechism — 源 emitted for 'sources'"),
("50", "Dense cursive — garbled + multiple CJK glyphs"),
("27", "Ornate blackletter header skipped + cursive garbled"),
]

CJK = re.compile(r"[㐀-鿿＀-￯]+")


def scan_b64(pdf_path: Path, width: int = 1000) -> str:
pdf = pdfium.PdfDocument(str(pdf_path))
page = pdf[0]
scale = width / page.get_size()[0]
pil = page.render(scale=scale).to_pil().convert("RGB")
buf = io.BytesIO()
pil.save(buf, "JPEG", quality=80)
return base64.b64encode(buf.getvalue()).decode()


def render_output(text: str) -> str:
return CJK.sub(lambda m: f"<mark>{m.group(0)}</mark>", html.escape(text))


def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument("--data", default="samples_data")
ap.add_argument("--out", default="samples.html")
args = ap.parse_args()
data = Path(args.data)

cards = []
for did, cap in DOCS:
img = scan_b64(data / f"{did}.pdf")
md = (data / f"{did}.md").read_text()
cards.append(
f"""
<section class="card">
<h2>old_scans/{did} <span>— {html.escape(cap)}</span></h2>
<div class="pair">
<div class="scan"><img loading="lazy" src="data:image/jpeg;base64,{img}" alt="scan {did}"></div>
<pre class="out">{render_output(md)}</pre>
</div>
</section>"""
)

page = f"""<!doctype html>
<html lang="en"><head><meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>PaddleOCR-VL-1.6 — olmOCR-bench old_scans samples</title>
<style>
body {{ font: 15px/1.5 -apple-system, system-ui, sans-serif; max-width: 1200px; margin: 2rem auto; padding: 0 1rem; color: #1a1a1a; }}
h1 {{ margin-bottom: .2rem; }}
.lede {{ color: #555; }}
mark {{ background: #ffd54f; padding: 0 2px; border-radius: 2px; }}
.card {{ border: 1px solid #e3e3e3; border-radius: 8px; margin: 1.5rem 0; overflow: hidden; }}
.card h2 {{ font-size: 1rem; margin: 0; padding: .6rem .9rem; background: #f6f6f6; border-bottom: 1px solid #e3e3e3; }}
.card h2 span {{ font-weight: 400; color: #666; }}
.pair {{ display: grid; grid-template-columns: 1fr 1fr; }}
.scan {{ background: #fafafa; border-right: 1px solid #eee; padding: .5rem; text-align: center; }}
.scan img {{ max-width: 100%; height: auto; box-shadow: 0 1px 4px rgba(0,0,0,.12); }}
.out {{ margin: 0; padding: .9rem; white-space: pre-wrap; word-break: break-word; font: 13px/1.55 ui-monospace, monospace; max-height: 82vh; overflow: auto; }}
@media (max-width: 820px) {{ .pair {{ grid-template-columns: 1fr; }} .scan {{ border-right: none; border-bottom: 1px solid #eee; }} }}
</style></head>
<body>
<h1>PaddleOCR-VL-1.6 on olmOCR-bench <code>old_scans</code></h1>
<p class="lede">Source scan (left) vs. the model's markdown output (right) — default v1.6 pipeline, no tuning.
<mark>Highlighted</mark> spans are hallucinated CJK glyphs on English documents. Overall old_scans score:
<b>38.6%</b> (preliminary). Scans: Library of Congress via
<a href="https://huggingface.co/datasets/allenai/olmOCR-bench">allenai/olmOCR-bench</a> (ODC-BY).</p>
{"".join(cards)}
</body></html>"""

out = Path(args.out)
out.write_text(page)
print(f"wrote {out} ({len(page) // 1024} KB)")


if __name__ == "__main__":
main()
279 changes: 279 additions & 0 deletions experiments/olmocr-bench-oldscans/samples.html

Large diffs are not rendered by default.

35 changes: 35 additions & 0 deletions experiments/olmocr-bench-oldscans/score.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# /// script
# requires-python = ">=3.11,<3.12"
# dependencies = [
# "olmocr[bench]",
# "numpy", # olmocr.bench.tests imports numpy but doesn't declare it
# ]
# ///
"""
Job 2 (CPU): score the candidate produced by convert.py with the official
olmocr.bench.benchmark harness. old_scans = text-present / text-absent /
reading-order tests only -> pure string matching, no KaTeX/chromium needed.

Reads from DATA (default /bucket). Mount the same bucket the convert job wrote to,
read-only is fine:

hf jobs uv run --flavor cpu-upgrade -s HF_TOKEN \\
-v hf://buckets/davanstrien/paddleocr-vl16-oldscans:/bucket:ro \\
experiments/olmocr-bench-oldscans/score.py

Env:
DATA directory holding old_scans.jsonl + the candidate folder (default /bucket)
"""
import os
import subprocess
import sys

DATA = os.environ.get("DATA", "/bucket")

# --dir globs *.jsonl (only old_scans.jsonl is present) and treats each subdir
# other than "pdfs" as a candidate (only paddleocr_vl_16 is present).
proc = subprocess.run(
[sys.executable, "-m", "olmocr.bench.benchmark", "--dir", DATA],
text=True,
)
sys.exit(proc.returncode)
Loading