Skip to content

fix(loading errors): improving dataset loading errors handling and logging#3342

Open
CarolinePascal wants to merge 1 commit intomainfrom
fix/loading-errors
Open

fix(loading errors): improving dataset loading errors handling and logging#3342
CarolinePascal wants to merge 1 commit intomainfrom
fix/loading-errors

Conversation

@CarolinePascal
Copy link
Copy Markdown
Collaborator

Type / Scope

  • Type: Bug
  • Scope: LeRobotDataset

Summary / Motivation

This PR improves the handling of dataset loading errors when files are missing (e.g. not yet downloaded or inexistent).

Related issues

  • Fixes / Closes: # (if any)
  • Related: # (if any)

What changed

  • DatasetReader.try_load now catches the ValueError triggered when files are missing locally but may exist remotely.
  • load_nested_dataset wraps the unclear ValueError raised by Dataset.from_parquet when parquet files are missing.

How was this tested (or how to run locally)

This was tested while trying to fix lerobot/droid_1.0.1 dataset, which was missing a bunch of files remotely.

Checklist (required before merge)

  • Linting/formatting run (pre-commit run -a)
  • All tests pass locally (pytest)
  • Documentation updated
  • CI is green

Reviewer notes

  • Anything the reviewer should focus on (performance, edge-cases, specific files) or general notes.
  • Anyone in the community is free to review the PR.

Copilot AI review requested due to automatic review settings April 9, 2026 16:21
@github-actions github-actions bot added the dataset Issues regarding data inputs, processing, or datasets label Apr 9, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to make LeRobot dataset loading failures easier to diagnose and more recoverable when parquet files are missing or unreadable, especially in local-cache load paths.

Changes:

  • Wraps Dataset.from_parquet(...) errors in load_nested_dataset() with a clearer ValueError.
  • Treats ValueError during DatasetReader.try_load() as a cache-miss (returns False) alongside missing-path errors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/lerobot/datasets/io_utils.py Adds a try/except ValueError around parquet loading to raise a clearer error when from_parquet fails.
src/lerobot/datasets/dataset_reader.py Expands try_load() exception handling to include ValueError so initialization can fall back to downloading data.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +81 to +84
try:
return Dataset.from_parquet([str(path) for path in paths], filters=filters, features=features)
except ValueError:
raise ValueError(f"Failed to load parquet files in {pq_dir}, make sure the dataset is valid and is not missing any files.")
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The re-raised ValueError drops the original exception context, which makes debugging (e.g., which parquet file failed and why) much harder. Capture the exception as e and re-raise with exception chaining (... from e), ideally including e’s message (and/or the offending path) in the new error message.

Copilot uses AI. Check for mistakes.
Comment on lines 88 to 92
try:
self.hf_dataset = self._load_hf_dataset()
except (FileNotFoundError, NotADirectoryError):
except (FileNotFoundError, NotADirectoryError, ValueError):
self.hf_dataset = None
return False
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try_load() now treats any ValueError from _load_hf_dataset() as a cache-miss and silently returns False, which can hide real data/schema errors and gives no hint why the local load failed. Consider narrowing this to the specific missing-file ValueError (e.g., by checking the message / exception type) and/or logging the exception before returning False so users can diagnose corrupted caches vs. expected downloads.

Copilot uses AI. Check for mistakes.
Comment on lines 86 to 92
def try_load(self) -> bool:
"""Attempt to load from local cache. Returns True if data is sufficient."""
try:
self.hf_dataset = self._load_hf_dataset()
except (FileNotFoundError, NotADirectoryError):
except (FileNotFoundError, NotADirectoryError, ValueError):
self.hf_dataset = None
return False
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New behavior (catching ValueError in try_load() for missing/invalid parquet reads) isn’t covered by existing DatasetReader contract tests; current tests only cover FileNotFound when data/ is absent. Add a test that creates a partially-written/corrupted parquet under data/ (so glob finds files but Dataset.from_parquet raises) and asserts try_load() returns False.

Copilot uses AI. Check for mistakes.
Comment on lines 88 to 92
try:
self.hf_dataset = self._load_hf_dataset()
except (FileNotFoundError, NotADirectoryError):
except (FileNotFoundError, NotADirectoryError, ValueError):
self.hf_dataset = None
return False
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR title mentions improved "logging", but the change here only adds a broader exception catch without emitting any log (it silently returns False). Either add a log statement (warning/debug) when this exception path is hit, or adjust the PR title/description to avoid implying new logging behavior.

Copilot uses AI. Check for mistakes.
@imstevenpmwork
Copy link
Copy Markdown
Collaborator

@claude Review this PR

@claude
Copy link
Copy Markdown

claude bot commented Apr 9, 2026

Claude finished @imstevenpmwork's task in 1m 28s —— View job


Review of PR #3342

  • Gather context and read changed files
  • Analyze changes and review code
  • Post review feedback

Summary

This PR catches ValueError in two places to handle missing/corrupt parquet files more gracefully during dataset loading. The intent is sound — Dataset.from_parquet raises an unhelpful ValueError when files are missing — but the current implementation has several issues that should be addressed before merging.


Issues

1. Overly broad ValueError catch in try_load() silently masks real errors

src/lerobot/datasets/dataset_reader.py:90

except (FileNotFoundError, NotADirectoryError, ValueError):
    self.hf_dataset = None
    return False

ValueError is one of Python's most common exceptions. Catching it here means any ValueError from the entire _load_hf_dataset()load_nested_dataset()Dataset.from_parquet() call chain — including genuine schema mismatches, feature type errors, or bugs in transform code — will be silently swallowed, and the dataset will be re-downloaded instead of surfacing the real problem.

Since try_load() returning False triggers a download attempt (lerobot_dataset.py:230-233), a user with a corrupted local dataset would enter an infinite download-then-fail loop or silently get a fresh download when they should have been told their data is corrupt.

Suggestion: At minimum, narrow this to only catch the specific ValueError you're targeting (e.g., check the message string), and log a warning so the user knows why the cache was invalidated. For example:

except (FileNotFoundError, NotADirectoryError):
    self.hf_dataset = None
    return False
except ValueError as e:
    logging.warning("Failed to load cached dataset, will re-download: %s", e)
    self.hf_dataset = None
    return False

Fix this →

2. Exception chaining lost in load_nested_dataset()

src/lerobot/datasets/io_utils.py:81-84

try:
    return Dataset.from_parquet(...)
except ValueError:
    raise ValueError(f"Failed to load parquet files in {pq_dir}, ...")

The original exception (which contains the actual error — e.g., which specific file failed and why) is discarded. This makes debugging harder, not easier. Use raise ... from e to preserve the chain:

except ValueError as e:
    raise ValueError(
        f"Failed to load parquet files in {pq_dir}, "
        "make sure the dataset is valid and is not missing any files."
    ) from e

Fix this →

3. PR title mentions "logging" but no logging is added

The PR title says "improving dataset loading errors handling and logging" but neither change adds any logging. The try_load() path silently returns False. If better observability is a goal (and it should be), add a logging.warning() call when the ValueError path is hit in try_load().

4. Missing test coverage

The existing test_try_load_returns_false_when_no_data only covers the FileNotFoundError path (no data/ directory). The new ValueError path — where parquet files exist but are corrupted/unreadable — has no test. Consider adding a test that creates an invalid parquet file under data/chunk-000/ and asserts try_load() returns False.

5. load_nested_dataset is also called from load_episodes()

Note that load_nested_dataset is called from load_episodes() at io_utils.py:254 as well, not just from DatasetReader._load_hf_dataset(). The new ValueError wrapping in load_nested_dataset will affect that call site too — make sure the error message ("make sure the dataset is valid and is not missing any files") makes sense in that context as well.


Minor nit

src/lerobot/datasets/io_utils.py:84 — The error message is a single long line. Consider breaking it for readability, and including the original error message for diagnostics.


Checklist items from PR description

The PR's own checklist has all items unchecked (linting, tests, docs, CI). These should be addressed before merge.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset Issues regarding data inputs, processing, or datasets

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants