Skip to content

feat: add P0 v3 validators for LeRobot dataset pre-ingestion checks#9

Closed
kck325 wants to merge 1 commit intomainfrom
feat/v3-validators
Closed

feat: add P0 v3 validators for LeRobot dataset pre-ingestion checks#9
kck325 wants to merge 1 commit intomainfrom
feat/v3-validators

Conversation

@kck325
Copy link
Contributor

@kck325 kck325 commented Mar 23, 2026

Summary

  • Add 6 P0 validators as a new lerobot_validator/v3_checks.py module that catch the most common data quality issues before partner upload:
    • V1 validate_tasks_format -- error if neither tasks file exists, warn if only jsonl (old format)
    • V2 validate_codebase_version -- reject datasets without codebase_version starting with v3.
    • V5 validate_feature_shapes -- reject shape: [] (0-D), require 3-element shape for image/video features
    • V7 validate_timestamps -- reject absolute Unix epoch timestamps in data parquets; warn on non-monotonic or large starting offsets
    • V11 validate_custom_metadata_csv -- require episode_index/episode_id columns, reject null/duplicate episode_id values
    • V12 validate_start_timestamp -- require start_timestamp values are plausible Unix epoch floats (year 2000-2100 range)
  • Wire validate_v3_dataset() into the LerobotDatasetValidator orchestrator so issues surface automatically
  • Add get_warnings() method to the orchestrator and display warnings in print_results()
  • Export Issue and validate_v3_dataset from __init__.py for direct usage
  • Add 40 new tests in tests/test_v3_checks.py covering all validators and edge cases
  • Update existing test fixtures in test_integration.py and test_is_eval_data_consistency.py to include codebase_version and tasks.parquet required by the new checks

Test plan

  • All 40 new tests in test_v3_checks.py pass
  • All 74 tests (including existing ones) pass with pytest tests/ -v
  • Run validator against a real partner dataset to verify end-to-end behavior

🤖 Generated with Claude Code

Add 6 P0 validators as lerobot_validator/v3_checks.py to catch the most
common data quality issues before partner upload:

- V1  validate_tasks_format: error if no tasks file, warn if only jsonl
- V2  validate_codebase_version: require codebase_version starts with v3.
- V5  validate_feature_shapes: reject shape=[], require 3-element image shapes
- V7  validate_timestamps: reject absolute Unix epoch in data parquets
- V11 validate_custom_metadata_csv: require episode_index/episode_id, reject
      null/duplicate episode_ids
- V12 validate_start_timestamp: require plausible Unix epoch floats

Wire validate_v3_dataset() into the LerobotDatasetValidator orchestrator
so errors surface automatically, and add get_warnings() support. Update
existing test fixtures to include codebase_version and tasks.parquet so
integration tests pass with the new checks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kck325
Copy link
Contributor Author

kck325 commented Mar 23, 2026

Closing duplicate -- changes pushed to PR #8 on branch chandra/lerobot-v3-metadata-checker instead.

@kck325 kck325 closed this Mar 23, 2026
@kck325 kck325 deleted the feat/v3-validators branch March 23, 2026 05:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant