Hi,
Running dots-ocr via ocr-bench fails on Hugging Face Jobs before inference starts.
Reproduction:
ocr-bench run Lukaszl/pl-government-docs-mix-ocr-dataset Lukaszl/pl-government-docs-mix-ocr-dataset-v1 --models dots-ocr --max-samples 169
Model mapping
dots-ocr is mapped in run.py to a remote HF script:
"dots-ocr": ModelConfig(
script="https://huggingface.co/datasets/uv-scripts/ocr/raw/main/dots-ocr.py",
model_id="rednote-hilab/dots.ocr",
size="1.7B",
default_flavor="l4x1",
),
What happens
The HF Job executes:
uv run https://huggingface.co/datasets/uv-scripts/ocr/raw/main/dots-ocr.py
and fails before inference starts:
Installed 181 packages in 501ms
Traceback (most recent call last):
File "/tmp/dots-ocrzkyi0e.py", line 43, in
from datasets import load_dataset
File "/root/.cache/uv/environments-v2/98db52c6ac57f55b/lib/python3.12/site-packages/datasets/init.py", line 22, in
from .arrow_dataset import Dataset
File "/root/.cache/uv/environments-v2/98db52c6ac57f55b/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 67, in
from .arrow_writer import ArrowWriter, OptimizedTypedSequence
File "/root/.cache/uv/environments-v2/98db52c6ac57f55b/lib/python3.12/site-packages/datasets/arrow_writer.py", line 27, in
from .features import Features, Image, Value
File "/root/.cache/uv/environments-v2/98db52c6ac57f55b/lib/python3.12/site-packages/datasets/features/init.py", line 18, in
from .features import Array2D, Array3D, Array4D, Array5D, ClassLabel, Features, Sequence, Value
File "/root/.cache/uv/environments-v2/98db52c6ac57f55b/lib/python3.12/site-packages/datasets/features/features.py", line 634, in
class _ArrayXDExtensionType(pa.PyExtensionType):
^^^^^^^^^^^^^^^^^^
AttributeError: module 'pyarrow' has no attribute 'PyExtensionType'. Did you mean: 'ExtensionType'?
Analysis
This looks like a dependency mismatch in the HF Job environment used by the remote dots-ocr.py script:
- datasets expects pyarrow.PyExtensionType
- but the installed pyarrow version no longer provides it
So this is not related to the dataset or CLI usage — the job fails during environment setup / import phase.
Hi,
Running
dots-ocrviaocr-benchfails on Hugging Face Jobs before inference starts.Reproduction:
ocr-bench run Lukaszl/pl-government-docs-mix-ocr-dataset Lukaszl/pl-government-docs-mix-ocr-dataset-v1 --models dots-ocr --max-samples 169
Model mapping
dots-ocr is mapped in run.py to a remote HF script:
"dots-ocr": ModelConfig(
script="https://huggingface.co/datasets/uv-scripts/ocr/raw/main/dots-ocr.py",
model_id="rednote-hilab/dots.ocr",
size="1.7B",
default_flavor="l4x1",
),
What happens
The HF Job executes:
uv run https://huggingface.co/datasets/uv-scripts/ocr/raw/main/dots-ocr.py
and fails before inference starts:
Installed 181 packages in 501ms
Traceback (most recent call last):
File "/tmp/dots-ocrzkyi0e.py", line 43, in
File "/root/.cache/uv/environments-v2/98db52c6ac57f55b/lib/python3.12/site-packages/datasets/init.py", line 22, in
File "/root/.cache/uv/environments-v2/98db52c6ac57f55b/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 67, in
File "/root/.cache/uv/environments-v2/98db52c6ac57f55b/lib/python3.12/site-packages/datasets/arrow_writer.py", line 27, in
File "/root/.cache/uv/environments-v2/98db52c6ac57f55b/lib/python3.12/site-packages/datasets/features/init.py", line 18, in
File "/root/.cache/uv/environments-v2/98db52c6ac57f55b/lib/python3.12/site-packages/datasets/features/features.py", line 634, in
AttributeError: module 'pyarrow' has no attribute 'PyExtensionType'. Did you mean: 'ExtensionType'?
Analysis
This looks like a dependency mismatch in the HF Job environment used by the remote dots-ocr.py script:
So this is not related to the dataset or CLI usage — the job fails during environment setup / import phase.