A comprehensive data preprocessing pipeline for preparing multiple autonomous driving datasets for the **DriveFusionQA** project. This repository handles the preprocessing and standardization of diverse driving-related question-answering datasets into unified formats suitable for training and evaluating vision-language models.
This project consolidates and preprocesses multiple driving-domain datasets into consistent formats (Llama, LLaVA, etc.), enabling seamless integration with DriveFusionQA's model training pipelines. It provides dataset-specific preprocessors and JSON file creators to standardize data across different sources.
The following datasets are currently supported and can be preprocessed:
- LingoQA: A driving-focused visual question-answering dataset with action and scenery components
- DriveGPT4 + BDD-X: Comprehensive driving dataset combining DriveGPT4 and BDD-X evaluation benchmarks
- DriveLM: Autonomous driving language modeling dataset (v1.1)
- MapLM v2: Map-based language modeling dataset with training, validation, and test splits
- nuInstruct + nuScenes: Comprehensive autonomous driving instruction dataset built on nuScenes
- COVLA: Additional driving vision-language dataset support
- Multi-dataset Support: Handle diverse dataset formats and structures
- Flexible Output Formats: Convert datasets to Llama and LLaVA instruction-following formats
- Modular Architecture: Dataset-specific preprocessors and JSON creators for extensibility
- Format Validation: Built-in testers for LLaVA and Llama formats to ensure data integrity
- Batch Processing: Efficient handling of large-scale driving datasets
- Image Path Management: Robust handling of image references and path validation
drivefusion_preprocessing/
├── dataset_preprocessors/ # Dataset-specific preprocessing logic
│ ├── dataset_preprocessor.py # Base preprocessor class
│ ├── drivegpt4_bddx_dataset_preprocessor.py
│ ├── drivelm_dataset_preprocessor.py
│ ├── lingo_dataset_preprocessor.py
│ ├── maplm_dataset_preprocessor.py
│ └── nuinstruct_nuscenes_dataset_preprocessor.py
├── json_file_creators/ # Format converters for output datasets
│ ├── json_file_creator.py # Base JSON creator class
│ ├── drivegpt4_bddx_json_file_creator.py
│ ├── drivelm_json_creator.py
│ ├── lingo_json_file_creator.py
│ ├── maplm_json_file_creator.py
│ └── nuinstruct_nuscenes_json_file_creator.py
└── testers/ # Format validation utilities
├── format_tester.py
├── llama_format_tester.py
└── llava_format_tester.py
- Clone the repository:
git clone https://github.com/DriveFusion/data-preprocessing.git
cd data-preprocessing- Install dependencies:
pip install -r requirements.txt- Python 3.10+
- PyTorch 2.5.1
- Transformers (from HuggingFace)
- pandas, pyarrow, fastparquet
- OpenCV for image processing
- Additional dependencies listed in
requirements.txt
The main preprocessing entry point is preprocess_datasets.py:
from preprocess_datasets import preprocess_lingoqa, preprocess_drivegpt4, preprocess_drivelm
# Preprocess LingoQA dataset
preprocess_lingoqa(system_path='/path/to/data')
# Preprocess other datasets
preprocess_drivegpt4(system_path='/path/to/data')
preprocess_drivelm(system_path='/path/to/data')To process a new dataset:
- Create a preprocessor class inheriting from
DatasetPreprocessor - Implement the
llama_train_preprocess()andllava_train_preprocess()methods - Create a corresponding JSON file creator to generate the output format
- Llama Format: Instruction-following format compatible with Llama models using
<image>tokens - LLaVA Format: Vision-language format with
<image>\ntoken formatting
Both formats follow the standard instruction-following dataset structure with image paths, questions, and answers.
Use the built-in testers to validate preprocessed datasets:
from drivefusion_preprocessing.testers import LLaVAFormatTester, LlamaFormatTester
# Validate LLaVA format
tester = LLaVAFormatTester(json_path='path/to/dataset.json')
tester.validate()
# Validate Llama format
tester = LlamaFormatTester(json_path='path/to/dataset.json')
tester.validate()The utils.py file provides helper functions:
dataset_length(json_path): Get the number of samples in a datasetcheck_images_existence(json_path): Validate that all image paths exist- Additional data manipulation and verification utilities
Contributions are welcome! To add support for new datasets:
- Create a new preprocessor in
dataset_preprocessors/ - Create a corresponding JSON creator in
json_file_creators/ - Add tests to validate the output format
- Update the documentation
For issues, questions, or feedback, please open an issue on the GitHub repository. You can also refer to the main DriveFusion project for additional resources and documentation.
This project is part of the DriveFusion project. See the LICENSE file for details.