fix(loading errors): improving dataset loading errors handling and logging#3342
fix(loading errors): improving dataset loading errors handling and logging#3342CarolinePascal wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to make LeRobot dataset loading failures easier to diagnose and more recoverable when parquet files are missing or unreadable, especially in local-cache load paths.
Changes:
- Wraps
Dataset.from_parquet(...)errors inload_nested_dataset()with a clearerValueError. - Treats
ValueErrorduringDatasetReader.try_load()as a cache-miss (returnsFalse) alongside missing-path errors.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/lerobot/datasets/io_utils.py |
Adds a try/except ValueError around parquet loading to raise a clearer error when from_parquet fails. |
src/lerobot/datasets/dataset_reader.py |
Expands try_load() exception handling to include ValueError so initialization can fall back to downloading data. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| return Dataset.from_parquet([str(path) for path in paths], filters=filters, features=features) | ||
| except ValueError: | ||
| raise ValueError(f"Failed to load parquet files in {pq_dir}, make sure the dataset is valid and is not missing any files.") |
There was a problem hiding this comment.
The re-raised ValueError drops the original exception context, which makes debugging (e.g., which parquet file failed and why) much harder. Capture the exception as e and re-raise with exception chaining (... from e), ideally including e’s message (and/or the offending path) in the new error message.
| try: | ||
| self.hf_dataset = self._load_hf_dataset() | ||
| except (FileNotFoundError, NotADirectoryError): | ||
| except (FileNotFoundError, NotADirectoryError, ValueError): | ||
| self.hf_dataset = None | ||
| return False |
There was a problem hiding this comment.
try_load() now treats any ValueError from _load_hf_dataset() as a cache-miss and silently returns False, which can hide real data/schema errors and gives no hint why the local load failed. Consider narrowing this to the specific missing-file ValueError (e.g., by checking the message / exception type) and/or logging the exception before returning False so users can diagnose corrupted caches vs. expected downloads.
| def try_load(self) -> bool: | ||
| """Attempt to load from local cache. Returns True if data is sufficient.""" | ||
| try: | ||
| self.hf_dataset = self._load_hf_dataset() | ||
| except (FileNotFoundError, NotADirectoryError): | ||
| except (FileNotFoundError, NotADirectoryError, ValueError): | ||
| self.hf_dataset = None | ||
| return False |
There was a problem hiding this comment.
New behavior (catching ValueError in try_load() for missing/invalid parquet reads) isn’t covered by existing DatasetReader contract tests; current tests only cover FileNotFound when data/ is absent. Add a test that creates a partially-written/corrupted parquet under data/ (so glob finds files but Dataset.from_parquet raises) and asserts try_load() returns False.
| try: | ||
| self.hf_dataset = self._load_hf_dataset() | ||
| except (FileNotFoundError, NotADirectoryError): | ||
| except (FileNotFoundError, NotADirectoryError, ValueError): | ||
| self.hf_dataset = None | ||
| return False |
There was a problem hiding this comment.
PR title mentions improved "logging", but the change here only adds a broader exception catch without emitting any log (it silently returns False). Either add a log statement (warning/debug) when this exception path is hit, or adjust the PR title/description to avoid implying new logging behavior.
|
@claude Review this PR |
|
Claude finished @imstevenpmwork's task in 1m 28s —— View job Review of PR #3342
SummaryThis PR catches Issues1. Overly broad
|
Type / Scope
Summary / Motivation
This PR improves the handling of dataset loading errors when files are missing (e.g. not yet downloaded or inexistent).
Related issues
What changed
DatasetReader.try_loadnow catches theValueErrortriggered when files are missing locally but may exist remotely.load_nested_datasetwraps the unclearValueErrorraised byDataset.from_parquetwhen parquet files are missing.How was this tested (or how to run locally)
This was tested while trying to fix
lerobot/droid_1.0.1dataset, which was missing a bunch of files remotely.Checklist (required before merge)
pre-commit run -a)pytest)Reviewer notes