feat: add P0 v3 validators for LeRobot dataset pre-ingestion checks by kck325 · Pull Request #9 · Physical-Intelligence/pi-data-sharing

kck325 · 2026-03-23T05:09:54Z

Summary

Add 6 P0 validators as a new lerobot_validator/v3_checks.py module that catch the most common data quality issues before partner upload:
- V1 validate_tasks_format -- error if neither tasks file exists, warn if only jsonl (old format)
- V2 validate_codebase_version -- reject datasets without codebase_version starting with v3.
- V5 validate_feature_shapes -- reject shape: [] (0-D), require 3-element shape for image/video features
- V7 validate_timestamps -- reject absolute Unix epoch timestamps in data parquets; warn on non-monotonic or large starting offsets
- V11 validate_custom_metadata_csv -- require episode_index/episode_id columns, reject null/duplicate episode_id values
- V12 validate_start_timestamp -- require start_timestamp values are plausible Unix epoch floats (year 2000-2100 range)
Wire validate_v3_dataset() into the LerobotDatasetValidator orchestrator so issues surface automatically
Add get_warnings() method to the orchestrator and display warnings in print_results()
Export Issue and validate_v3_dataset from __init__.py for direct usage
Add 40 new tests in tests/test_v3_checks.py covering all validators and edge cases
Update existing test fixtures in test_integration.py and test_is_eval_data_consistency.py to include codebase_version and tasks.parquet required by the new checks

Test plan

All 40 new tests in test_v3_checks.py pass
All 74 tests (including existing ones) pass with pytest tests/ -v
Run validator against a real partner dataset to verify end-to-end behavior

🤖 Generated with Claude Code

Add 6 P0 validators as lerobot_validator/v3_checks.py to catch the most common data quality issues before partner upload: - V1 validate_tasks_format: error if no tasks file, warn if only jsonl - V2 validate_codebase_version: require codebase_version starts with v3. - V5 validate_feature_shapes: reject shape=[], require 3-element image shapes - V7 validate_timestamps: reject absolute Unix epoch in data parquets - V11 validate_custom_metadata_csv: require episode_index/episode_id, reject null/duplicate episode_ids - V12 validate_start_timestamp: require plausible Unix epoch floats Wire validate_v3_dataset() into the LerobotDatasetValidator orchestrator so errors surface automatically, and add get_warnings() support. Update existing test fixtures to include codebase_version and tasks.parquet so integration tests pass with the new checks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kck325 · 2026-03-23T05:12:34Z

Closing duplicate -- changes pushed to PR #8 on branch chandra/lerobot-v3-metadata-checker instead.

kck325 closed this Mar 23, 2026

kck325 deleted the feat/v3-validators branch March 23, 2026 05:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add P0 v3 validators for LeRobot dataset pre-ingestion checks#9

feat: add P0 v3 validators for LeRobot dataset pre-ingestion checks#9
kck325 wants to merge 1 commit intomainfrom
feat/v3-validators

kck325 commented Mar 23, 2026

Uh oh!

kck325 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kck325 commented Mar 23, 2026

Summary

Test plan

Uh oh!

kck325 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant