diff --git a/experiments/olmocr-bench-oldscans/README.md b/experiments/olmocr-bench-oldscans/README.md new file mode 100644 index 0000000..68d06ef --- /dev/null +++ b/experiments/olmocr-bench-oldscans/README.md @@ -0,0 +1,119 @@ +# PaddleOCR-VL-1.6 on olmOCR-bench (old_scans) + +Scores [PaddleOCR-VL-1.6](https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.6) on the +`old_scans` subset of [`allenai/olmOCR-bench`](https://huggingface.co/datasets/allenai/olmOCR-bench). +Standalone experiment — not part of the `ocr_bench` library. + +`old_scans` = 98 single-page Library-of-Congress scans, 526 tests +(text-present / text-absent / reading-order). No math or tables, so scoring needs +no KaTeX/chromium. + +## Fidelity + +- **Scoring**: stock `olmocr.bench.benchmark`, unmodified. +- **Conversion**: matches olmOCR-bench's own runner + [`run_paddlevl.py`](https://github.com/allenai/olmocr/blob/main/olmocr/bench/runners/run_paddlevl.py) + — `res.markdown["markdown_texts"]`, per page, default pipeline, no tuning. The + only difference is `pipeline_version="v1.6"` (as the model card specifies). +- Runs inside PaddlePaddle's own image, so paddle/paddleocr are the vendor builds. + +## Method + +Two HF Jobs with a bucket as the handoff: + +| Step | Command | Hardware | Does | +|------|---------|----------|------| +| `convert.py` | `hf jobs run` | GPU `l4x1` | PaddleOCR-VL-1.6 → markdown → `sync_bucket` to the bucket | +| `score.py` | `hf jobs uv run` | CPU `cpu-upgrade` | `olmocr.bench.benchmark` → score | + +Candidate files are written as `{splitext(pdf_field)}_pg{page}_repeat1.md`, the +path `benchmark.py` looks them up by. + +- **Image**: `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu` + — paddle 3.2.1 + paddleocr 3.6.0 + 1.6 weights baked in; python `/usr/local/bin/python3` (3.10); no `uv`. +- **Bucket**: `hf://buckets/davanstrien/paddleocr-vl16-oldscans` + +## Run + +```bash +BUCKET=hf://buckets/davanstrien/paddleocr-vl16-oldscans +IMAGE=ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu + +# deliver the script into the bucket (re-cp after editing) +hf buckets cp convert.py $BUCKET/convert.py + +# convert — 98 PDFs (add -e LIMIT=3 for a 3-PDF smoke test). On the first run this +# also stages the source PDFs into the bucket under pdfs/ (which the scorer needs); +# later runs reuse them from the mount instead of re-downloading. +hf jobs run --flavor l4x1 --timeout 1h -s HF_TOKEN -v $BUCKET:/bucket:ro \ + $IMAGE python3 /bucket/convert.py + +# score (ranks every candidate folder in the bucket together) +hf jobs uv run --flavor cpu-upgrade -s HF_TOKEN -v $BUCKET:/bucket:ro score.py +``` + +Add `-d` to detach, then `hf jobs wait ` / `hf jobs logs `. + +## Configuration + +- **Flavor `l4x1`**: the image's CUDA build matches the `l4x1` driver; larger GPUs (l40s / a100) do not. +- **Mount path `/bucket`**: `/data` is reserved by Jobs for local-script artifacts. +- **`sync_bucket`, not a FUSE write**: the image runs as a non-root user that cannot write the mount, so `convert.py` writes locally and uploads over HTTP; the mount is `:ro` (script delivery only). +- **`pdfs/` folder**: `benchmark.py` requires `/pdfs` to exist; `convert.py` stages the source PDFs into the bucket on the first run and reuses them from the mount after, so no separate sync step is needed. +- **`numpy`**: declared in `score.py` because `olmocr[bench]` imports it without declaring it. +- **Versions**: `convert.py` takes `PIPELINE_VERSION` (default `v1.6`) and `CANDIDATE`. Run `-e PIPELINE_VERSION=v1 -e CANDIDATE=paddleocr_vl_orig` to also convert the original 0.9B PaddleOCR-VL (the leaderboard's 37.8); both candidates then sit in the bucket and the score job ranks them together. +- **`old_scans_math`** variant: change `JSONL_PATH` in `convert.py`; `score.py` then also needs `playwright install chromium`. + +## Reproducibility + +This run uses floating refs. To make it bit-stable, pin: + +- the **image by digest** (`...paddleocr-vl@sha256:...`) instead of `:latest` — this pins paddle, paddleocr, and the weights together; +- `allenai/olmOCR-bench` by `revision`; +- `olmocr` to an exact version in `score.py`. + +Decoding is already **greedy** (the model's `generation_config.json` has no +`do_sample`/`temperature`, so transformers defaults to greedy), so runs are +deterministic modulo GPU-kernel nondeterminism — no sampling seed to pin. + +## Result + +PaddleOCR-VL on olmOCR-bench `old_scans` — default pipeline, no tuning, greedy +(2026-06-27). Both versions scored through the same harness: + +| Version | **old_scans** | present | absent | order | baseline | +|---|---|---|---|---|---| +| v1.6 | **38.6%** (203/526) | 31.2 | 95.7 | 27.7 | 84.7 | +| v1 (original 0.9B) | **38.2%** (201/526) | 32.3 | 95.7 | 24.9 | 88.8 | + +`old_scans` = the present/absent/order tests = the leaderboard's "Old scans" column. + +**Harness validated against the published figure.** olmOCR-bench lists the original +`PaddleOCR-VL` (unversioned; its `run_paddlevl` runner is dated 2025-10-20, pre-1.6) +at **Old scans = 37.8**. Running that same original (`v1`) through this harness gives +**38.2** — within 0.4 pt, inside the ±3.6 % CI. So our convert + scoring reproduce +the published number; the harness is sound. + +**v1.6's gains don't transfer to old scans.** v1.6 (38.6) and the original v1 (38.2) +are statistically indistinguishable here — the upgrade that made v1.6 SOTA on +OmniDocBench buys nothing on degraded historical scans. v1.6 even regresses slightly +on `baseline` (84.7 vs 88.8): it emits *more* CJK/Japanese disallowed-character +hallucinations (场, 景, 民, 生, ら …) on English scans than the original did. See +`samples.html` (regenerate via `gen_samples.py`) for scan↔output pairs with the +glyphs highlighted. + +**Reading the harness output.** "38.6" / "38.2" are the `old_scans.jsonl` sub-scores. +`olmocr.bench.benchmark` *also* prints an `Average Score` (61.6 % / 63.5 %) = mean of +the old_scans sub-score and the auto-baseline category — that is **not** the +leaderboard "Old scans" figure; don't quote it. And the per-version `baseline` here +(auto-BaselineTest over only the 98 old_scans PDFs) is **not** the leaderboard's +"Base" column (the same test over the whole ~1,400-PDF benchmark, 98.5 %). + +**Size context** (published no-anchor Old scans): olmOCR 43.7, GPT-4o 40.9, +Qwen2.5-VL 38.6, Gemini-Flash-2 27.8, GOT-OCR (0.58B) 22.1. At 0.9B, PaddleOCR-VL +ties the 7B Qwen2.5-VL. + +> **Status: validated.** The harness reproduces the published original-PaddleOCR-VL +> figure (37.8 → 38.2, within CI), and v1.6 (38.6) is statistically the same on +> old_scans. Greedy/deterministic decoding; outputs spot-checked vs source scans. +> Pin the image digest (see Reproducibility) before citing externally. diff --git a/experiments/olmocr-bench-oldscans/convert.py b/experiments/olmocr-bench-oldscans/convert.py new file mode 100644 index 0000000..c9198e6 --- /dev/null +++ b/experiments/olmocr-bench-oldscans/convert.py @@ -0,0 +1,126 @@ +""" +Job 1 (GPU): run PaddleOCR-VL-1.6 over the olmOCR-bench `old_scans` subset and +write candidate markdown in the exact layout `olmocr.bench.benchmark` expects. + +Runs with PaddlePaddle's docker image (paddle 3.2.1 + paddleocr 3.6.0 + the v1.6 +weights preinstalled) via the image's python3.10 -- not uv. Every import here +(paddleocr, huggingface_hub, stdlib) is already in the image, so there is no +PEP 723 header. + +Fidelity: the markdown extraction matches olmOCR-bench's own runner +(`olmocr/bench/runners/run_paddlevl.py`) -- `res.markdown["markdown_texts"]`, per +page, with a bare default pipeline and NO tuning (no max_pixels / prompts / dpi). +The one intentional difference is `pipeline_version="v1.6"`: upstream calls +`PaddleOCRVL()` with no version (an earlier PaddleOCR-VL), while this measures 1.6 +as its model card specifies. So we follow the bench runner's extraction and +PaddlePaddle's documented v1.6 defaults. + +This image runs as the non-root `paddleocr` user, which CANNOT write the bucket +FUSE mount (root-owned). So we write outputs to a container-local dir and push +them with `sync_bucket()` (mount-free HTTP upload) at the end. The bucket is +mounted read-only purely to deliver this script. + +Delivery + run (see README for full commands): + hf buckets cp convert.py hf://buckets/davanstrien/paddleocr-vl16-oldscans/convert.py + hf jobs run --flavor l4x1 -s HF_TOKEN \ + -v hf://buckets/davanstrien/paddleocr-vl16-oldscans:/bucket:ro \ + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu \ + python3 /bucket/convert.py + +Env: + OUT_ROOT local staging dir (default /tmp/olmocr-oldscans-out) + BUCKET bucket to sync results to (default below) + PIPELINE_VERSION PaddleOCRVL version (default v1.6). "v1" = the original 0.9B + PaddleOCR-VL = the version on the olmOCR-bench leaderboard + (37.8) -> run it (with a distinct CANDIDATE) for a strict + same-version reproduction. + CANDIDATE output subfolder + model label (default paddleocr_vl_16) + LIMIT cap number of PDFs (plumbing smoke test; 0 = all). With a cap, + the un-converted docs are scored FAILED, so the result is not + a representative score -- use a smoke run only to check plumbing. +""" +import json +import os +import shutil +from collections import defaultdict +from pathlib import Path + +from huggingface_hub import hf_hub_download, sync_bucket +from paddleocr import PaddleOCRVL + +BENCH_REPO = "allenai/olmOCR-bench" +JSONL_PATH = "bench_data/old_scans.jsonl" +CANDIDATE = os.environ.get("CANDIDATE", "paddleocr_vl_16") # any name except "pdfs" +PIPELINE_VERSION = os.environ.get("PIPELINE_VERSION", "v1.6") +OUT_ROOT = Path(os.environ.get("OUT_ROOT", "/tmp/olmocr-oldscans-out")) +BUCKET = os.environ.get("BUCKET", "hf://buckets/davanstrien/paddleocr-vl16-oldscans") +LIMIT = int(os.environ.get("LIMIT", "0")) + +# ---- test manifest ---------------------------------------------------------- +jsonl_local = hf_hub_download(BENCH_REPO, JSONL_PATH, repo_type="dataset") +tests = [json.loads(ln) for ln in Path(jsonl_local).read_text().splitlines() if ln.strip()] + +pages_by_pdf = defaultdict(set) +for t in tests: + pages_by_pdf[t["pdf"]].add(int(t.get("page", 1))) +print(f"{len(tests)} tests across {len(pages_by_pdf)} PDFs -> {OUT_ROOT}", flush=True) + +# ---- model (vendor default, no tuning) -------------------------------------- +print(f"pipeline_version={PIPELINE_VERSION} candidate={CANDIDATE}", flush=True) +pipeline = PaddleOCRVL(pipeline_version=PIPELINE_VERSION) + + +def resolve_pdf(pdf_field): + """Reuse the PDF already on the bucket mount if present; otherwise download it + once and stage it under OUT_ROOT/pdfs so sync_bucket adds it to the bucket for + the scorer (benchmark.py needs /pdfs). Avoids a separate download + sync, + and skips re-downloading on later runs (the bucket is mounted at /bucket).""" + mounted = Path("/bucket/pdfs") / pdf_field + if mounted.is_file(): + return str(mounted) + for cand in (f"bench_data/{pdf_field}", f"bench_data/pdfs/{pdf_field}", pdf_field): + try: + local = hf_hub_download(BENCH_REPO, cand, repo_type="dataset") + except Exception: + continue + dest = OUT_ROOT / "pdfs" / pdf_field + dest.parent.mkdir(parents=True, exist_ok=True) + shutil.copy(local, dest) + return local + raise FileNotFoundError(pdf_field) + + +def page_markdowns(pdf_path): + """Per-page markdown, exactly as run_paddlevl.py does: res.markdown['markdown_texts'].""" + return [res.markdown["markdown_texts"] for res in pipeline.predict(str(pdf_path))] + + +# ---- convert ---------------------------------------------------------------- +cand_dir = OUT_ROOT / CANDIDATE +items = sorted(pages_by_pdf.items()) +if LIMIT: + items = items[:LIMIT] + print(f"LIMIT={LIMIT} (plumbing smoke test -- expect a low score)", flush=True) + +for i, (pdf_field, pages) in enumerate(items, 1): + try: + mds = page_markdowns(resolve_pdf(pdf_field)) + except Exception as e: # keep going; a missing page just fails its tests + print(f"[WARN] {pdf_field}: {e}", flush=True) + mds = [] + md_base = os.path.splitext(pdf_field)[0] # mirrors benchmark.py exactly + for pg in pages: + md = mds[pg - 1] if 0 <= pg - 1 < len(mds) else "" # 1-indexed page -> 0-indexed + fp = cand_dir / f"{md_base}_pg{pg}_repeat1.md" + fp.parent.mkdir(parents=True, exist_ok=True) + fp.write_text(md) + n = len(mds[0]) if mds else 0 + print(f"[{i}/{len(items)}] {pdf_field} -> {n} chars", flush=True) + +# the scorer needs the jsonl next to the candidate folder +(OUT_ROOT / "old_scans.jsonl").write_text(Path(jsonl_local).read_text()) + +# push results to the bucket over HTTP (the FUSE mount is not writable as non-root) +print(f"Syncing {OUT_ROOT} -> {BUCKET}", flush=True) +sync_bucket(str(OUT_ROOT), BUCKET) +print("Done.", flush=True) diff --git a/experiments/olmocr-bench-oldscans/gen_samples.py b/experiments/olmocr-bench-oldscans/gen_samples.py new file mode 100644 index 0000000..4f0a4b2 --- /dev/null +++ b/experiments/olmocr-bench-oldscans/gen_samples.py @@ -0,0 +1,106 @@ +"""Generate a self-contained samples.html: source scan vs. PaddleOCR-VL-1.6 +output for a handful of old_scans docs, with hallucinated CJK glyphs highlighted. +Scans are embedded as base64 JPEG so the page is a single portable file. + +Populate the data dir from the bucket first, then render: + + B=hf://buckets/davanstrien/paddleocr-vl16-oldscans + for id in 1 5 10 27 30 50 56; do + hf buckets cp $B/pdfs/old_scans/$id.pdf samples_data/$id.pdf + hf buckets cp $B/paddleocr_vl_16/old_scans/${id}_pg1_repeat1.md samples_data/$id.md + done + uv run --with pypdfium2 --with pillow gen_samples.py --data samples_data +""" +import argparse +import base64 +import html +import io +import re +from pathlib import Path + +import pypdfium2 as pdfium +from PIL import Image # noqa: F401 (pypdfium2 .to_pil needs Pillow installed) + +DOCS = [ + ("5", "Typed letter — near-perfect transcription"), + ("10", "Typed letter — cursive signature dropped; 'Sincerely,' loops x3"), + ("1", "Handwritten letter — readable, character-level errors"), + ("30", "Typed letter — Chinese 场景 inserted mid-sentence"), + ("56", "Q&A catechism — 源 emitted for 'sources'"), + ("50", "Dense cursive — garbled + multiple CJK glyphs"), + ("27", "Ornate blackletter header skipped + cursive garbled"), +] + +CJK = re.compile(r"[㐀-鿿＀-￯]+") + + +def scan_b64(pdf_path: Path, width: int = 1000) -> str: + pdf = pdfium.PdfDocument(str(pdf_path)) + page = pdf[0] + scale = width / page.get_size()[0] + pil = page.render(scale=scale).to_pil().convert("RGB") + buf = io.BytesIO() + pil.save(buf, "JPEG", quality=80) + return base64.b64encode(buf.getvalue()).decode() + + +def render_output(text: str) -> str: + return CJK.sub(lambda m: f"{m.group(0)}", html.escape(text)) + + +def main() -> None: + ap = argparse.ArgumentParser() + ap.add_argument("--data", default="samples_data") + ap.add_argument("--out", default="samples.html") + args = ap.parse_args() + data = Path(args.data) + + cards = [] + for did, cap in DOCS: + img = scan_b64(data / f"{did}.pdf") + md = (data / f"{did}.md").read_text() + cards.append( + f""" +
+

old_scans/{did} — {html.escape(cap)}

+
+
scan {did}
+
{render_output(md)}
+
+
""" + ) + + page = f""" + + +PaddleOCR-VL-1.6 — olmOCR-bench old_scans samples + + +

PaddleOCR-VL-1.6 on olmOCR-bench old_scans

+

Source scan (left) vs. the model's markdown output (right) — default v1.6 pipeline, no tuning. + Highlighted spans are hallucinated CJK glyphs on English documents. Overall old_scans score: + 38.6% (preliminary). Scans: Library of Congress via + allenai/olmOCR-bench (ODC-BY).

+ {"".join(cards)} +""" + + out = Path(args.out) + out.write_text(page) + print(f"wrote {out} ({len(page) // 1024} KB)") + + +if __name__ == "__main__": + main() diff --git a/experiments/olmocr-bench-oldscans/samples.html b/experiments/olmocr-bench-oldscans/samples.html new file mode 100644 index 0000000..bea7456 --- /dev/null +++ b/experiments/olmocr-bench-oldscans/samples.html @@ -0,0 +1,279 @@ + + + +PaddleOCR-VL-1.6 — olmOCR-bench old_scans samples + + +

PaddleOCR-VL-1.6 on olmOCR-bench old_scans

+

Source scan (left) vs. the model's markdown output (right) — default v1.6 pipeline, no tuning. + Highlighted spans are hallucinated CJK glyphs on English documents. Overall old_scans score: + 38.6% (preliminary). Scans: Library of Congress via + allenai/olmOCR-bench (ODC-BY).

+ +
+

old_scans/5 — Typed letter — near-perfect transcription

+
+
scan 5
+
“WE NEVER DISAPPOINT”
+
+3
+
+136, 138, 140 West Short Street
+
+LEXINGTON, KY.
+
+public affairs, he would stand much better.
+
+You have educated us to expect the president
+
+to talk, and he who falls short of your
+
+means, will be a public disappointment.
+
+When the president ought to say a
+
+wise word to allay the Protestant ill-
+
+feeling and at the same time, let
+
+Rome know her place in such a
+
+way as would prevent exception being
+
+taken, be would�minently please again.
+
+He is a Protestant, and can not be
+
+expected to attend Catholic ceremonials,
+
+and send himself to even the appearance
+
+of intrigue.
+
+All together the public wants
+
+You, and is looking to you to say
+
+Something to check the ozone of Peace
+
+going on in Washington, Kindle D $25,000,000
+
+
+
+

old_scans/10 — Typed letter — cursive signature dropped; 'Sincerely,' loops x3

+
+
scan 10
+
ack
+
+5127114
+
+The Hon. Theodore Roosevelt.
+
+287 Fourth Avenue,
+
+New York City.
+
+Dear Sir:-
+
+Indiana County Progressives send you congratulations on your safe return from your epoch-making journey to South America.
+
+Ours is the Pennsylvania County that gave you six hundred more votes than Taft and Wilson combined received.I hereby make a special plea that you honor us with a campaign speech when you tour Pennsylvania.Indiana, the county-seat, is situated at the center of the county and is entered by rail-roads and trolley-lbnes leading to different sections.When you come thousands will greet you, and in no uncertain tones.
+
+Progressivism is a live issue here. We are in the fight to stay. We want no fusion or amalgamation with the Republican party, the party of Penrose and his ilk.
+
+I am,
+
+Sincerely,
+
+Sincerely,
+
+Sincerely,
+
+Chairman of Washington Party in Indiana County, Pa.
+
+
+
+

old_scans/1 — Handwritten letter — readable, character-level errors

+
+
scan 1
+
Bangor Pa. May. 22. nd 1914.
+
+Col. Rosevelt  Gennrade & Friend
+
+Gen. Dear Sir
+
+I am one of D. E. Sickles old
+
+Regt & Brigade I Served through
+
+all of the civil war. During the
+
+5. years was with Gen Sickles
+
+when He Loost His Logg. then He
+
+Loest us. I Regret Your Absence from
+
+this country During your Trip to
+
+S.A. I Say We Pinhott Here &
+
+Had a Loitte talk with Him about
+
+you & the Political Situation of Vty
+
+Country I am with you for the
+
+election of Gifford Pinchott as U.S.
+
+Senitor. One Loaw Here must Be
+
+Repeated. as we are Deprived of
+
+Nothing for Whoever We Desired at
+
+the Performaries. if you come to Easton
+
+or Bangor Pa I would like to
+
+See you Address 25 Market St Bangor
+
+Yours Very Rest. Caleb Aben
+
+
+
+

old_scans/30 — Typed letter — Chinese 场景 inserted mid-sentence

+
+
scan 30
+
indeed strengthened by the proposed change in your formal relation to the Outlook. With this change I am quite sure that we can do more to promote the interests which we both have as heart.
+
+He shall want our correspondence, when it is published, to make clear that our interest in and loyalty to the principles from which both you and we have stood, so you as the leader in this great demarcation movement, is unchanged, and that we can still corent on you as special contribution on serial and political topics.
+
+
+
+Of course nothing will be published, now by us anything said, until you return to America. Meanwhile I shall endeavor to draft a letter in response to your and get it into Laurancès hands, for consultation between you and Iri, if, as you have intended, you return to their in the same场景.
+
+Believe I may you in many
+
+since affection and esteem for you
+
+and my faith in what you have
+
+to / splanchically stood for in our
+
+
+
+

old_scans/56 — Q&A catechism — 源 emitted for 'sources'

+
+
scan 56
+
## Dario Ciroli! Laguir
+
+1. what is Logic II?
+
+A. the art of reason in the human mind in acquiring
+
+2. into how many parts is it divided?
+
+A. in to form.
+
+3. of what do they think?
+
+A. the first treat of simple approach to
+
+4. what is simple approach to
+
+5. it is the attention of the mind to the improve
+
+6. what is the influence of the mind to the own operation.
+
+7. what are the sources from which all original ideas
+
+8. the situation is the development of the world.
+
+9. what does it do we get from the generation?
+
+A. what does it do we get from the to.
+
+B. what does it do we get from reflection.
+
+C. how does it thinking willing believing it.
+
+D. how are our ideas divided.
+
+A. into simple and fuller.
+
+10. what is a simple idea?
+
+A. it is an original impression existing under
+
+the mind under one uniform appearance, without
+
+variety or composition.
+
+11. give an instance of a simple idea of sensation
+
+also of reflection.
+
+A. set the idea we have of colour in a simple
+
+idea of sensation. The idea we have of willing in a
+
+simple idea of reflection.
+
+2602
+
+
+
+

old_scans/50 — Dense cursive — garbled + multiple CJK glyphs

+
+
scan 50
+
## Tacping Sabbetta
+
+O Vincus means series to comply with what he believes to be the will of God, but commsion of his own faith, he revolts from entroning thus to conform to his own opinions. Even compulsion,政务的行动it may in the expense of persecution — God has given man no authority to come in his dinnert to his precepts.
+
+The people in a Ming bulwam a man this mother-mother it be the voluntary offering of the heart it must be a man cold征兆 of words which cannot be an apology to lead. If a man does not believe that religion exacts it by minor activity to ab�ane from labor are the subbate he will submit with relentless and this this will join them will be no party on his part now will it advance the party of address.
+
+There can be no more parties or elections in this less than that would be an enforcing information of the policy upon the public的手, an obstinance for the meat decing but in an� for the Admiralty Christans to work with Mohoutti —
+
+1. There is no one precept in the new Government commanding no to keep as abbat — If we are banned to keep one, it is in correspondence of the most air land —
+
+2. The language of the 2th commandment is, "The Leisure day is the sabbat of the Lord" But the Christmas keep the first day not the seventh —
+
+3. There is not a single word either in the adorn new Government nor even our admissions relating to the substitution of the first day for the seventh — The subject is not mentioned in any of the discourse of Christ was in any of the epistles of his apotheles —
+
+(1)
+
+
+
+

old_scans/27 — Ornate blackletter header skipped + cursive garbled

+
+
scan 27
+
audience
+
+Turning from will favor
+
+any request with your
+
+consideration I am
+
+Costoutfully you're!
+
+Philip L. J.
+
+by P. Pursuius, Levy
+
+
+ \ No newline at end of file diff --git a/experiments/olmocr-bench-oldscans/score.py b/experiments/olmocr-bench-oldscans/score.py new file mode 100644 index 0000000..0ea4be1 --- /dev/null +++ b/experiments/olmocr-bench-oldscans/score.py @@ -0,0 +1,35 @@ +# /// script +# requires-python = ">=3.11,<3.12" +# dependencies = [ +# "olmocr[bench]", +# "numpy", # olmocr.bench.tests imports numpy but doesn't declare it +# ] +# /// +""" +Job 2 (CPU): score the candidate produced by convert.py with the official +olmocr.bench.benchmark harness. old_scans = text-present / text-absent / +reading-order tests only -> pure string matching, no KaTeX/chromium needed. + +Reads from DATA (default /bucket). Mount the same bucket the convert job wrote to, +read-only is fine: + + hf jobs uv run --flavor cpu-upgrade -s HF_TOKEN \\ + -v hf://buckets/davanstrien/paddleocr-vl16-oldscans:/bucket:ro \\ + experiments/olmocr-bench-oldscans/score.py + +Env: + DATA directory holding old_scans.jsonl + the candidate folder (default /bucket) +""" +import os +import subprocess +import sys + +DATA = os.environ.get("DATA", "/bucket") + +# --dir globs *.jsonl (only old_scans.jsonl is present) and treats each subdir +# other than "pdfs" as a candidate (only paddleocr_vl_16 is present). +proc = subprocess.run( + [sys.executable, "-m", "olmocr.bench.benchmark", "--dir", DATA], + text=True, +) +sys.exit(proc.returncode)