davanstrien · davanstrien · Jun 27, 2026 · Jun 27, 2026 · Jun 27, 2026 · Jun 27, 2026
diff --git a/experiments/olmocr-bench-oldscans/README.md b/experiments/olmocr-bench-oldscans/README.md
@@ -0,0 +1,119 @@
+# PaddleOCR-VL-1.6 on olmOCR-bench (old_scans)
+
+Scores [PaddleOCR-VL-1.6](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.6) on the
+`old_scans` subset of [`allenai/olmOCR-bench`](https://huggingface.co/datasets/allenai/olmOCR-bench).
+Standalone experiment — not part of the `ocr_bench` library.
+
+`old_scans` = 98 single-page Library-of-Congress scans, 526 tests
+(text-present / text-absent / reading-order). No math or tables, so scoring needs
+no KaTeX/chromium.
+
+## Fidelity
+
+- **Scoring**: stock `olmocr.bench.benchmark`, unmodified.
+- **Conversion**: matches olmOCR-bench's own runner
+  [`run_paddlevl.py`](https://github.com/allenai/olmocr/blob/main/olmocr/bench/runners/run_paddlevl.py)
+  — `res.markdown["markdown_texts"]`, per page, default pipeline, no tuning. The
+  only difference is `pipeline_version="v1.6"` (as the model card specifies).
+- Runs inside PaddlePaddle's own image, so paddle/paddleocr are the vendor builds.
+
+## Method
+
+Two HF Jobs with a bucket as the handoff:
+
+| Step | Command | Hardware | Does |
+|------|---------|----------|------|
+| `convert.py` | `hf jobs run` | GPU `l4x1` | PaddleOCR-VL-1.6 → markdown → `sync_bucket` to the bucket |
+| `score.py` | `hf jobs uv run` | CPU `cpu-upgrade` | `olmocr.bench.benchmark` → score |
+
+Candidate files are written as `{splitext(pdf_field)}_pg{page}_repeat1.md`, the
+path `benchmark.py` looks them up by.
+
+- **Image**: `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu`
+  — paddle 3.2.1 + paddleocr 3.6.0 + 1.6 weights baked in; python `/usr/local/bin/python3` (3.10); no `uv`.
+- **Bucket**: `hf://buckets/davanstrien/paddleocr-vl16-oldscans`
+
+## Run
+
+```bash
+BUCKET=hf://buckets/davanstrien/paddleocr-vl16-oldscans
+IMAGE=ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu
+
+# deliver the script into the bucket (re-cp after editing)
+hf buckets cp convert.py $BUCKET/convert.py
+
+# convert — 98 PDFs (add -e LIMIT=3 for a 3-PDF smoke test). On the first run this
+# also stages the source PDFs into the bucket under pdfs/ (which the scorer needs);
+# later runs reuse them from the mount instead of re-downloading.
+hf jobs run --flavor l4x1 --timeout 1h -s HF_TOKEN -v $BUCKET:/bucket:ro \
+    $IMAGE python3 /bucket/convert.py
+
+# score (ranks every candidate folder in the bucket together)
+hf jobs uv run --flavor cpu-upgrade -s HF_TOKEN -v $BUCKET:/bucket:ro score.py
+```
+
+Add `-d` to detach, then `hf jobs wait <id>` / `hf jobs logs <id>`.
+
+## Configuration
+
+- **Flavor `l4x1`**: the image's CUDA build matches the `l4x1` driver; larger GPUs (l40s / a100) do not.
+- **Mount path `/bucket`**: `/data` is reserved by Jobs for local-script artifacts.
+- **`sync_bucket`, not a FUSE write**: the image runs as a non-root user that cannot write the mount, so `convert.py` writes locally and uploads over HTTP; the mount is `:ro` (script delivery only).
+- **`pdfs/` folder**: `benchmark.py` requires `<dir>/pdfs` to exist; `convert.py` stages the source PDFs into the bucket on the first run and reuses them from the mount after, so no separate sync step is needed.
+- **`numpy`**: declared in `score.py` because `olmocr[bench]` imports it without declaring it.
+- **Versions**: `convert.py` takes `PIPELINE_VERSION` (default `v1.6`) and `CANDIDATE`. Run `-e PIPELINE_VERSION=v1 -e CANDIDATE=paddleocr_vl_orig` to also convert the original 0.9B PaddleOCR-VL (the leaderboard's 37.8); both candidates then sit in the bucket and the score job ranks them together.
+- **`old_scans_math`** variant: change `JSONL_PATH` in `convert.py`; `score.py` then also needs `playwright install chromium`.
+
+## Reproducibility
+
+This run uses floating refs. To make it bit-stable, pin:
+
+- the **image by digest** (`...paddleocr-vl@sha256:...`) instead of `:latest` — this pins paddle, paddleocr, and the weights together;
+- `allenai/olmOCR-bench` by `revision`;
+- `olmocr` to an exact version in `score.py`.
+
+Decoding is already **greedy** (the model's `generation_config.json` has no
+`do_sample`/`temperature`, so transformers defaults to greedy), so runs are
+deterministic modulo GPU-kernel nondeterminism — no sampling seed to pin.
+
+## Result
+
+PaddleOCR-VL on olmOCR-bench `old_scans` — default pipeline, no tuning, greedy
+(2026-06-27). Both versions scored through the same harness:
+
+| Version | **old_scans** | present | absent | order | baseline |
+|---|---|---|---|---|---|
+| v1.6 | **38.6%** (203/526) | 31.2 | 95.7 | 27.7 | 84.7 |
+| v1 (original 0.9B) | **38.2%** (201/526) | 32.3 | 95.7 | 24.9 | 88.8 |
+
+`old_scans` = the present/absent/order tests = the leaderboard's "Old scans" column.
+
+**Harness validated against the published figure.** olmOCR-bench lists the original
+`PaddleOCR-VL` (unversioned; its `run_paddlevl` runner is dated 2025-10-20, pre-1.6)
+at **Old scans = 37.8**. Running that same original (`v1`) through this harness gives
+**38.2** — within 0.4 pt, inside the ±3.6 % CI. So our convert + scoring reproduce
+the published number; the harness is sound.
+
+**v1.6's gains don't transfer to old scans.** v1.6 (38.6) and the original v1 (38.2)
+are statistically indistinguishable here — the upgrade that made v1.6 SOTA on
+OmniDocBench buys nothing on degraded historical scans. v1.6 even regresses slightly
+on `baseline` (84.7 vs 88.8): it emits *more* CJK/Japanese disallowed-character
+hallucinations (场, 景, 民, 生, ら …) on English scans than the original did. See
+`samples.html` (regenerate via `gen_samples.py`) for scan↔output pairs with the
+glyphs highlighted.
+
+**Reading the harness output.** "38.6" / "38.2" are the `old_scans.jsonl` sub-scores.
+`olmocr.bench.benchmark` *also* prints an `Average Score` (61.6 % / 63.5 %) = mean of
+the old_scans sub-score and the auto-baseline category — that is **not** the
+leaderboard "Old scans" figure; don't quote it. And the per-version `baseline` here
+(auto-BaselineTest over only the 98 old_scans PDFs) is **not** the leaderboard's
+"Base" column (the same test over the whole ~1,400-PDF benchmark, 98.5 %).
+
+**Size context** (published no-anchor Old scans): olmOCR 43.7, GPT-4o 40.9,
+Qwen2.5-VL 38.6, Gemini-Flash-2 27.8, GOT-OCR (0.58B) 22.1. At 0.9B, PaddleOCR-VL
+ties the 7B Qwen2.5-VL.
+
+> **Status: validated.** The harness reproduces the published original-PaddleOCR-VL
+> figure (37.8 → 38.2, within CI), and v1.6 (38.6) is statistically the same on
+> old_scans. Greedy/deterministic decoding; outputs spot-checked vs source scans.
+> Pin the image digest (see Reproducibility) before citing externally.
diff --git a/experiments/olmocr-bench-oldscans/convert.py b/experiments/olmocr-bench-oldscans/convert.py
@@ -0,0 +1,126 @@
+"""
+Job 1 (GPU): run PaddleOCR-VL-1.6 over the olmOCR-bench `old_scans` subset and
+write candidate markdown in the exact layout `olmocr.bench.benchmark` expects.
+
+Runs with PaddlePaddle's docker image (paddle 3.2.1 + paddleocr 3.6.0 + the v1.6
+weights preinstalled) via the image's python3.10 -- not uv. Every import here
+(paddleocr, huggingface_hub, stdlib) is already in the image, so there is no
+PEP 723 header.
+
+Fidelity: the markdown extraction matches olmOCR-bench's own runner
+(`olmocr/bench/runners/run_paddlevl.py`) -- `res.markdown["markdown_texts"]`, per
+page, with a bare default pipeline and NO tuning (no max_pixels / prompts / dpi).
+The one intentional difference is `pipeline_version="v1.6"`: upstream calls
+`PaddleOCRVL()` with no version (an earlier PaddleOCR-VL), while this measures 1.6
+as its model card specifies. So we follow the bench runner's extraction and
+PaddlePaddle's documented v1.6 defaults.
+
+This image runs as the non-root `paddleocr` user, which CANNOT write the bucket
+FUSE mount (root-owned). So we write outputs to a container-local dir and push
+them with `sync_bucket()` (mount-free HTTP upload) at the end. The bucket is
+mounted read-only purely to deliver this script.
+
+Delivery + run (see README for full commands):
+    hf buckets cp convert.py hf://buckets/davanstrien/paddleocr-vl16-oldscans/convert.py
+    hf jobs run --flavor l4x1 -s HF_TOKEN \
+        -v hf://buckets/davanstrien/paddleocr-vl16-oldscans:/bucket:ro \
+        ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu \
+        python3 /bucket/convert.py
+
+Env:
+  OUT_ROOT          local staging dir (default /tmp/olmocr-oldscans-out)
+  BUCKET            bucket to sync results to (default below)
+  PIPELINE_VERSION  PaddleOCRVL version (default v1.6). "v1" = the original 0.9B
+                    PaddleOCR-VL = the version on the olmOCR-bench leaderboard
+                    (37.8) -> run it (with a distinct CANDIDATE) for a strict
+                    same-version reproduction.
+  CANDIDATE         output subfolder + model label (default paddleocr_vl_16)
+  LIMIT             cap number of PDFs (plumbing smoke test; 0 = all). With a cap,
+                    the un-converted docs are scored FAILED, so the result is not
+                    a representative score -- use a smoke run only to check plumbing.
+"""
+import json
+import os
+import shutil
+from collections import defaultdict
+from pathlib import Path
+
+from huggingface_hub import hf_hub_download, sync_bucket
+from paddleocr import PaddleOCRVL
+
+BENCH_REPO = "allenai/olmOCR-bench"
+JSONL_PATH = "bench_data/old_scans.jsonl"
+CANDIDATE = os.environ.get("CANDIDATE", "paddleocr_vl_16")   # any name except "pdfs"
+PIPELINE_VERSION = os.environ.get("PIPELINE_VERSION", "v1.6")
+OUT_ROOT = Path(os.environ.get("OUT_ROOT", "/tmp/olmocr-oldscans-out"))
+BUCKET = os.environ.get("BUCKET", "hf://buckets/davanstrien/paddleocr-vl16-oldscans")
+LIMIT = int(os.environ.get("LIMIT", "0"))
+
+# ---- test manifest ----------------------------------------------------------
+jsonl_local = hf_hub_download(BENCH_REPO, JSONL_PATH, repo_type="dataset")
+tests = [json.loads(ln) for ln in Path(jsonl_local).read_text().splitlines() if ln.strip()]
+
+pages_by_pdf = defaultdict(set)
+for t in tests:
+    pages_by_pdf[t["pdf"]].add(int(t.get("page", 1)))
+print(f"{len(tests)} tests across {len(pages_by_pdf)} PDFs -> {OUT_ROOT}", flush=True)
+
+# ---- model (vendor default, no tuning) --------------------------------------
+print(f"pipeline_version={PIPELINE_VERSION}  candidate={CANDIDATE}", flush=True)
+pipeline = PaddleOCRVL(pipeline_version=PIPELINE_VERSION)
+
+
+def resolve_pdf(pdf_field):
+    """Reuse the PDF already on the bucket mount if present; otherwise download it
+    once and stage it under OUT_ROOT/pdfs so sync_bucket adds it to the bucket for
+    the scorer (benchmark.py needs <dir>/pdfs). Avoids a separate download + sync,
+    and skips re-downloading on later runs (the bucket is mounted at /bucket)."""
+    mounted = Path("/bucket/pdfs") / pdf_field
+    if mounted.is_file():
+        return str(mounted)
+    for cand in (f"bench_data/{pdf_field}", f"bench_data/pdfs/{pdf_field}", pdf_field):
+        try:
+            local = hf_hub_download(BENCH_REPO, cand, repo_type="dataset")
+        except Exception:
+            continue
+        dest = OUT_ROOT / "pdfs" / pdf_field
+        dest.parent.mkdir(parents=True, exist_ok=True)
+        shutil.copy(local, dest)
+        return local
+    raise FileNotFoundError(pdf_field)
+
+
+def page_markdowns(pdf_path):
+    """Per-page markdown, exactly as run_paddlevl.py does: res.markdown['markdown_texts']."""
+    return [res.markdown["markdown_texts"] for res in pipeline.predict(str(pdf_path))]
+
+
+# ---- convert ----------------------------------------------------------------
+cand_dir = OUT_ROOT / CANDIDATE
+items = sorted(pages_by_pdf.items())
+if LIMIT:
+    items = items[:LIMIT]
+    print(f"LIMIT={LIMIT} (plumbing smoke test -- expect a low score)", flush=True)
+
+for i, (pdf_field, pages) in enumerate(items, 1):
+    try:
+        mds = page_markdowns(resolve_pdf(pdf_field))
+    except Exception as e:  # keep going; a missing page just fails its tests
+        print(f"[WARN] {pdf_field}: {e}", flush=True)
+        mds = []
+    md_base = os.path.splitext(pdf_field)[0]          # mirrors benchmark.py exactly
+    for pg in pages:
+        md = mds[pg - 1] if 0 <= pg - 1 < len(mds) else ""   # 1-indexed page -> 0-indexed
+        fp = cand_dir / f"{md_base}_pg{pg}_repeat1.md"
+        fp.parent.mkdir(parents=True, exist_ok=True)
+        fp.write_text(md)
+    n = len(mds[0]) if mds else 0
+    print(f"[{i}/{len(items)}] {pdf_field} -> {n} chars", flush=True)
+
+# the scorer needs the jsonl next to the candidate folder
+(OUT_ROOT / "old_scans.jsonl").write_text(Path(jsonl_local).read_text())
+
+# push results to the bucket over HTTP (the FUSE mount is not writable as non-root)
+print(f"Syncing {OUT_ROOT} -> {BUCKET}", flush=True)
+sync_bucket(str(OUT_ROOT), BUCKET)
+print("Done.", flush=True)
diff --git a/experiments/olmocr-bench-oldscans/gen_samples.py b/experiments/olmocr-bench-oldscans/gen_samples.py
@@ -0,0 +1,106 @@
+"""Generate a self-contained samples.html: source scan vs. PaddleOCR-VL-1.6
+output for a handful of old_scans docs, with hallucinated CJK glyphs highlighted.
+Scans are embedded as base64 JPEG so the page is a single portable file.
+
+Populate the data dir from the bucket first, then render:
+
+    B=hf://buckets/davanstrien/paddleocr-vl16-oldscans
+    for id in 1 5 10 27 30 50 56; do
+      hf buckets cp $B/pdfs/old_scans/$id.pdf            samples_data/$id.pdf
+      hf buckets cp $B/paddleocr_vl_16/old_scans/${id}_pg1_repeat1.md samples_data/$id.md
+    done
+    uv run --with pypdfium2 --with pillow gen_samples.py --data samples_data
+"""
+import argparse
+import base64
+import html
+import io
+import re
+from pathlib import Path
+
+import pypdfium2 as pdfium
+from PIL import Image  # noqa: F401  (pypdfium2 .to_pil needs Pillow installed)
+
+DOCS = [
+    ("5", "Typed letter — near-perfect transcription"),
+    ("10", "Typed letter — cursive signature dropped; 'Sincerely,' loops x3"),
+    ("1", "Handwritten letter — readable, character-level errors"),
+    ("30", "Typed letter — Chinese 场景 inserted mid-sentence"),
+    ("56", "Q&A catechism — 源 emitted for 'sources'"),
+    ("50", "Dense cursive — garbled + multiple CJK glyphs"),
+    ("27", "Ornate blackletter header skipped + cursive garbled"),
+]
+
+CJK = re.compile(r"[㐀-鿿＀-￯]+")
+
+
+def scan_b64(pdf_path: Path, width: int = 1000) -> str:
+    pdf = pdfium.PdfDocument(str(pdf_path))
+    page = pdf[0]
+    scale = width / page.get_size()[0]
+    pil = page.render(scale=scale).to_pil().convert("RGB")
+    buf = io.BytesIO()
+    pil.save(buf, "JPEG", quality=80)
+    return base64.b64encode(buf.getvalue()).decode()
+
+
+def render_output(text: str) -> str:
+    return CJK.sub(lambda m: f"<mark>{m.group(0)}</mark>", html.escape(text))
+
+
+def main() -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--data", default="samples_data")
+    ap.add_argument("--out", default="samples.html")
+    args = ap.parse_args()
+    data = Path(args.data)
+
+    cards = []
+    for did, cap in DOCS:
+        img = scan_b64(data / f"{did}.pdf")
+        md = (data / f"{did}.md").read_text()
+        cards.append(
+            f"""
+    <section class="card">
+      <h2>old_scans/{did} <span>— {html.escape(cap)}</span></h2>
+      <div class="pair">
+        <div class="scan"><img loading="lazy" src="data:image/jpeg;base64,{img}" alt="scan {did}"></div>
+        <pre class="out">{render_output(md)}</pre>
+      </div>
+    </section>"""
+        )
+
+    page = f"""<!doctype html>
+<html lang="en"><head><meta charset="utf-8">
+<meta name="viewport" content="width=device-width, initial-scale=1">
+<title>PaddleOCR-VL-1.6 — olmOCR-bench old_scans samples</title>
+<style>
+  body {{ font: 15px/1.5 -apple-system, system-ui, sans-serif; max-width: 1200px; margin: 2rem auto; padding: 0 1rem; color: #1a1a1a; }}
+  h1 {{ margin-bottom: .2rem; }}
+  .lede {{ color: #555; }}
+  mark {{ background: #ffd54f; padding: 0 2px; border-radius: 2px; }}
+  .card {{ border: 1px solid #e3e3e3; border-radius: 8px; margin: 1.5rem 0; overflow: hidden; }}
+  .card h2 {{ font-size: 1rem; margin: 0; padding: .6rem .9rem; background: #f6f6f6; border-bottom: 1px solid #e3e3e3; }}
+  .card h2 span {{ font-weight: 400; color: #666; }}
+  .pair {{ display: grid; grid-template-columns: 1fr 1fr; }}
+  .scan {{ background: #fafafa; border-right: 1px solid #eee; padding: .5rem; text-align: center; }}
+  .scan img {{ max-width: 100%; height: auto; box-shadow: 0 1px 4px rgba(0,0,0,.12); }}
+  .out {{ margin: 0; padding: .9rem; white-space: pre-wrap; word-break: break-word; font: 13px/1.55 ui-monospace, monospace; max-height: 82vh; overflow: auto; }}
+  @media (max-width: 820px) {{ .pair {{ grid-template-columns: 1fr; }} .scan {{ border-right: none; border-bottom: 1px solid #eee; }} }}
+</style></head>
+<body>
+  <h1>PaddleOCR-VL-1.6 on olmOCR-bench <code>old_scans</code></h1>
+  <p class="lede">Source scan (left) vs. the model's markdown output (right) — default v1.6 pipeline, no tuning.
+  <mark>Highlighted</mark> spans are hallucinated CJK glyphs on English documents. Overall old_scans score:
+  <b>38.6%</b> (preliminary). Scans: Library of Congress via
+  <a href="https://huggingface.co/datasets/allenai/olmOCR-bench">allenai/olmOCR-bench</a> (ODC-BY).</p>
+  {"".join(cards)}
+</body></html>"""
+
+    out = Path(args.out)
+    out.write_text(page)
+    print(f"wrote {out} ({len(page) // 1024} KB)")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/experiments/olmocr-bench-oldscans/samples.html b/experiments/olmocr-bench-oldscans/samples.html
diff --git a/experiments/olmocr-bench-oldscans/score.py b/experiments/olmocr-bench-oldscans/score.py
@@ -0,0 +1,35 @@
+# /// script
+# requires-python = ">=3.11,<3.12"
+# dependencies = [
+#     "olmocr[bench]",
+#     "numpy",  # olmocr.bench.tests imports numpy but doesn't declare it
+# ]
+# ///
+"""
+Job 2 (CPU): score the candidate produced by convert.py with the official
+olmocr.bench.benchmark harness. old_scans = text-present / text-absent /
+reading-order tests only -> pure string matching, no KaTeX/chromium needed.
+
+Reads from DATA (default /bucket). Mount the same bucket the convert job wrote to,
+read-only is fine:
+
+    hf jobs uv run --flavor cpu-upgrade -s HF_TOKEN \\
+        -v hf://buckets/davanstrien/paddleocr-vl16-oldscans:/bucket:ro \\
+        experiments/olmocr-bench-oldscans/score.py
+
+Env:
+  DATA   directory holding old_scans.jsonl + the candidate folder (default /bucket)
+"""
+import os
+import subprocess
+import sys
+
+DATA = os.environ.get("DATA", "/bucket")
+
+# --dir globs *.jsonl (only old_scans.jsonl is present) and treats each subdir
+# other than "pdfs" as a candidate (only paddleocr_vl_16 is present).
+proc = subprocess.run(
+    [sys.executable, "-m", "olmocr.bench.benchmark", "--dir", DATA],
+    text=True,
+)
+sys.exit(proc.returncode)