Skip to content

Fails to load dataset #9

@AmitMY

Description

@AmitMY

Since no default dataset config is published, and I would like to iterate on diverse data, I tried:

from datasets import get_dataset_config_names, load_dataset, interleave_datasets

configs = get_dataset_config_names("HuggingFaceFW/fineweb-2")
print(configs)

streams = [
    load_dataset("HuggingFaceFW/fineweb-2", c, split="train", streaming=True)
    for c in configs
]

# Option A: round-robin (equal mixing across languages)
ds = interleave_datasets(streams, seed=42)

# ds is now an IterableDataset; languages are naturally mixed as you iterate.
for ex in ds.take(3):
    print(ex.keys())

This prints all configs (['aai_Latn', 'aak_Latn', 'aau_Latn', 'aaz_Latn',....)
and then:

ValueError: At least one valid data file must be specified, all the data_files are invalid: {'test': [], 'train': ['hf://datasets/HuggingFaceFW/fineweb-2@af9c13333eb981300149d5ca60a8e9d659b276b9/data/abi_Latn/train/000_00000.parquet']}

Minimally:

from datasets import load_dataset

ds = load_dataset("HuggingFaceFW/fineweb-2", "abi_Latn", split="train", streaming=True)
ds.take(1)

Works on my mac, fails on my server.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions