Skip to content

dots-ocr fails on Hugging Face Jobs due to datasets / pyarrow incompatibility #18

Description

@arit2

Hi,
Running dots-ocr via ocr-bench fails on Hugging Face Jobs before inference starts.
Reproduction:

ocr-bench run Lukaszl/pl-government-docs-mix-ocr-dataset Lukaszl/pl-government-docs-mix-ocr-dataset-v1 --models dots-ocr --max-samples 169

Model mapping
dots-ocr is mapped in run.py to a remote HF script:
"dots-ocr": ModelConfig(
script="https://huggingface.co/datasets/uv-scripts/ocr/raw/main/dots-ocr.py",
model_id="rednote-hilab/dots.ocr",
size="1.7B",
default_flavor="l4x1",
),

What happens

The HF Job executes:
uv run https://huggingface.co/datasets/uv-scripts/ocr/raw/main/dots-ocr.py

and fails before inference starts:

Installed 181 packages in 501ms

Traceback (most recent call last):

File "/tmp/dots-ocrzkyi0e.py", line 43, in

from datasets import load_dataset

File "/root/.cache/uv/environments-v2/98db52c6ac57f55b/lib/python3.12/site-packages/datasets/init.py", line 22, in

from .arrow_dataset import Dataset

File "/root/.cache/uv/environments-v2/98db52c6ac57f55b/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 67, in

from .arrow_writer import ArrowWriter, OptimizedTypedSequence

File "/root/.cache/uv/environments-v2/98db52c6ac57f55b/lib/python3.12/site-packages/datasets/arrow_writer.py", line 27, in

from .features import Features, Image, Value

File "/root/.cache/uv/environments-v2/98db52c6ac57f55b/lib/python3.12/site-packages/datasets/features/init.py", line 18, in

from .features import Array2D, Array3D, Array4D, Array5D, ClassLabel, Features, Sequence, Value

File "/root/.cache/uv/environments-v2/98db52c6ac57f55b/lib/python3.12/site-packages/datasets/features/features.py", line 634, in

class _ArrayXDExtensionType(pa.PyExtensionType):

                            ^^^^^^^^^^^^^^^^^^

AttributeError: module 'pyarrow' has no attribute 'PyExtensionType'. Did you mean: 'ExtensionType'?

Analysis
This looks like a dependency mismatch in the HF Job environment used by the remote dots-ocr.py script:

  • datasets expects pyarrow.PyExtensionType
  • but the installed pyarrow version no longer provides it

So this is not related to the dataset or CLI usage — the job fails during environment setup / import phase.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions