fix(loading errors): improving dataset loading errors handling and logging by CarolinePascal · Pull Request #3342 · huggingface/lerobot

CarolinePascal · 2026-04-09T16:21:18Z

Type / Scope

Type: Bug
Scope: LeRobotDataset

Summary / Motivation

This PR improves the handling of dataset loading errors when files are missing (e.g. not yet downloaded or inexistent).

Related issues

Fixes / Closes: # (if any)
Related: # (if any)

What changed

DatasetReader.try_load now catches the ValueError triggered when files are missing locally but may exist remotely.
load_nested_dataset wraps the unclear ValueError raised by Dataset.from_parquet when parquet files are missing.

How was this tested (or how to run locally)

This was tested while trying to fix lerobot/droid_1.0.1 dataset, which was missing a bunch of files remotely.

Checklist (required before merge)

Linting/formatting run (pre-commit run -a)
All tests pass locally (pytest)
Documentation updated
CI is green

Reviewer notes

Anything the reviewer should focus on (performance, edge-cases, specific files) or general notes.
Anyone in the community is free to review the PR.

…gging

Copilot

Pull request overview

This PR aims to make LeRobot dataset loading failures easier to diagnose and more recoverable when parquet files are missing or unreadable, especially in local-cache load paths.

Changes:

Wraps Dataset.from_parquet(...) errors in load_nested_dataset() with a clearer ValueError.
Treats ValueError during DatasetReader.try_load() as a cache-miss (returns False) alongside missing-path errors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`src/lerobot/datasets/io_utils.py`	Adds a `try/except ValueError` around parquet loading to raise a clearer error when `from_parquet` fails.
`src/lerobot/datasets/dataset_reader.py`	Expands `try_load()` exception handling to include `ValueError` so initialization can fall back to downloading data.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-09T16:23:44Z

+        try:
+            return Dataset.from_parquet([str(path) for path in paths], filters=filters, features=features)
+        except ValueError:
+            raise ValueError(f"Failed to load parquet files in {pq_dir}, make sure the dataset is valid and is not missing any files.")


The re-raised ValueError drops the original exception context, which makes debugging (e.g., which parquet file failed and why) much harder. Capture the exception as e and re-raise with exception chaining (... from e), ideally including e’s message (and/or the offending path) in the new error message.

Copilot · 2026-04-09T16:23:44Z

        try:
            self.hf_dataset = self._load_hf_dataset()
-        except (FileNotFoundError, NotADirectoryError):
+        except (FileNotFoundError, NotADirectoryError, ValueError):
            self.hf_dataset = None
            return False


try_load() now treats any ValueError from _load_hf_dataset() as a cache-miss and silently returns False, which can hide real data/schema errors and gives no hint why the local load failed. Consider narrowing this to the specific missing-file ValueError (e.g., by checking the message / exception type) and/or logging the exception before returning False so users can diagnose corrupted caches vs. expected downloads.

Copilot · 2026-04-09T16:23:44Z

    def try_load(self) -> bool:
        """Attempt to load from local cache. Returns True if data is sufficient."""
        try:
            self.hf_dataset = self._load_hf_dataset()
-        except (FileNotFoundError, NotADirectoryError):
+        except (FileNotFoundError, NotADirectoryError, ValueError):
            self.hf_dataset = None
            return False


New behavior (catching ValueError in try_load() for missing/invalid parquet reads) isn’t covered by existing DatasetReader contract tests; current tests only cover FileNotFound when data/ is absent. Add a test that creates a partially-written/corrupted parquet under data/ (so glob finds files but Dataset.from_parquet raises) and asserts try_load() returns False.

Copilot · 2026-04-09T16:23:45Z

        try:
            self.hf_dataset = self._load_hf_dataset()
-        except (FileNotFoundError, NotADirectoryError):
+        except (FileNotFoundError, NotADirectoryError, ValueError):
            self.hf_dataset = None
            return False


PR title mentions improved "logging", but the change here only adds a broader exception catch without emitting any log (it silently returns False). Either add a log statement (warning/debug) when this exception path is hit, or adjust the PR title/description to avoid implying new logging behavior.

imstevenpmwork · 2026-04-09T16:40:54Z

@claude Review this PR

claude · 2026-04-09T16:41:11Z

Claude finished @imstevenpmwork's task in 1m 28s —— View job

Review of PR #3342

Gather context and read changed files
Analyze changes and review code
Post review feedback

Summary

This PR catches ValueError in two places to handle missing/corrupt parquet files more gracefully during dataset loading. The intent is sound — Dataset.from_parquet raises an unhelpful ValueError when files are missing — but the current implementation has several issues that should be addressed before merging.

Issues

1. Overly broad `ValueError` catch in `try_load()` silently masks real errors

src/lerobot/datasets/dataset_reader.py:90

except (FileNotFoundError, NotADirectoryError, ValueError):
    self.hf_dataset = None
    return False

ValueError is one of Python's most common exceptions. Catching it here means any ValueError from the entire _load_hf_dataset() → load_nested_dataset() → Dataset.from_parquet() call chain — including genuine schema mismatches, feature type errors, or bugs in transform code — will be silently swallowed, and the dataset will be re-downloaded instead of surfacing the real problem.

Since try_load() returning False triggers a download attempt (lerobot_dataset.py:230-233), a user with a corrupted local dataset would enter an infinite download-then-fail loop or silently get a fresh download when they should have been told their data is corrupt.

Suggestion: At minimum, narrow this to only catch the specific ValueError you're targeting (e.g., check the message string), and log a warning so the user knows why the cache was invalidated. For example:

except (FileNotFoundError, NotADirectoryError):
    self.hf_dataset = None
    return False
except ValueError as e:
    logging.warning("Failed to load cached dataset, will re-download: %s", e)
    self.hf_dataset = None
    return False

Fix this →

2. Exception chaining lost in `load_nested_dataset()`

src/lerobot/datasets/io_utils.py:81-84

try:
    return Dataset.from_parquet(...)
except ValueError:
    raise ValueError(f"Failed to load parquet files in {pq_dir}, ...")

The original exception (which contains the actual error — e.g., which specific file failed and why) is discarded. This makes debugging harder, not easier. Use raise ... from e to preserve the chain:

except ValueError as e:
    raise ValueError(
        f"Failed to load parquet files in {pq_dir}, "
        "make sure the dataset is valid and is not missing any files."
    ) from e

Fix this →

3. PR title mentions "logging" but no logging is added

The PR title says "improving dataset loading errors handling and logging" but neither change adds any logging. The try_load() path silently returns False. If better observability is a goal (and it should be), add a logging.warning() call when the ValueError path is hit in try_load().

4. Missing test coverage

The existing test_try_load_returns_false_when_no_data only covers the FileNotFoundError path (no data/ directory). The new ValueError path — where parquet files exist but are corrupted/unreadable — has no test. Consider adding a test that creates an invalid parquet file under data/chunk-000/ and asserts try_load() returns False.

5. `load_nested_dataset` is also called from `load_episodes()`

Note that load_nested_dataset is called from load_episodes() at io_utils.py:254 as well, not just from DatasetReader._load_hf_dataset(). The new ValueError wrapping in load_nested_dataset will affect that call site too — make sure the error message ("make sure the dataset is valid and is not missing any files") makes sense in that context as well.

Minor nit

src/lerobot/datasets/io_utils.py:84 — The error message is a single long line. Consider breaking it for readability, and including the original error message for diagnostics.

Checklist items from PR description

The PR's own checklist has all items unchecked (linting, tests, docs, CI). These should be addressed before merge.

fix(loading errors): improving dataset loading errors handling and lo…

5231ba5

…gging

Copilot AI review requested due to automatic review settings April 9, 2026 16:21

github-actions bot added the dataset Issues regarding data inputs, processing, or datasets label Apr 9, 2026

Copilot started reviewing on behalf of CarolinePascal April 9, 2026 16:21 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(loading errors): improving dataset loading errors handling and logging#3342

fix(loading errors): improving dataset loading errors handling and logging#3342
CarolinePascal wants to merge 1 commit intomainfrom
fix/loading-errors

CarolinePascal commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

imstevenpmwork commented Apr 9, 2026

Uh oh!

claude bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CarolinePascal commented Apr 9, 2026

Type / Scope

Summary / Motivation

Related issues

What changed

How was this tested (or how to run locally)

Checklist (required before merge)

Reviewer notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

imstevenpmwork commented Apr 9, 2026

Uh oh!

claude bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #3342

Summary

Issues

1. Overly broad ValueError catch in try_load() silently masks real errors

2. Exception chaining lost in load_nested_dataset()

3. PR title mentions "logging" but no logging is added

4. Missing test coverage

5. load_nested_dataset is also called from load_episodes()

Minor nit

Checklist items from PR description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude bot commented Apr 9, 2026 •

edited

Loading

1. Overly broad `ValueError` catch in `try_load()` silently masks real errors

2. Exception chaining lost in `load_nested_dataset()`

5. `load_nested_dataset` is also called from `load_episodes()`