Skip to content

DriveFusion/data-preprocessing

Repository files navigation

DriveFusion Logo

Data Preprocessing for DriveFusionQA

A comprehensive data preprocessing pipeline for preparing multiple autonomous driving datasets for the **DriveFusionQA** project. This repository handles the preprocessing and standardization of diverse driving-related question-answering datasets into unified formats suitable for training and evaluating vision-language models.

Python License Status


Overview

This project consolidates and preprocesses multiple driving-domain datasets into consistent formats (Llama, LLaVA, etc.), enabling seamless integration with DriveFusionQA's model training pipelines. It provides dataset-specific preprocessors and JSON file creators to standardize data across different sources.

Supported Datasets

The following datasets are currently supported and can be preprocessed:

  • LingoQA: A driving-focused visual question-answering dataset with action and scenery components
  • DriveGPT4 + BDD-X: Comprehensive driving dataset combining DriveGPT4 and BDD-X evaluation benchmarks
  • DriveLM: Autonomous driving language modeling dataset (v1.1)
  • MapLM v2: Map-based language modeling dataset with training, validation, and test splits
  • nuInstruct + nuScenes: Comprehensive autonomous driving instruction dataset built on nuScenes
  • COVLA: Additional driving vision-language dataset support

Features

  • Multi-dataset Support: Handle diverse dataset formats and structures
  • Flexible Output Formats: Convert datasets to Llama and LLaVA instruction-following formats
  • Modular Architecture: Dataset-specific preprocessors and JSON creators for extensibility
  • Format Validation: Built-in testers for LLaVA and Llama formats to ensure data integrity
  • Batch Processing: Efficient handling of large-scale driving datasets
  • Image Path Management: Robust handling of image references and path validation

Project Structure

drivefusion_preprocessing/
├── dataset_preprocessors/        # Dataset-specific preprocessing logic
│   ├── dataset_preprocessor.py   # Base preprocessor class
│   ├── drivegpt4_bddx_dataset_preprocessor.py
│   ├── drivelm_dataset_preprocessor.py
│   ├── lingo_dataset_preprocessor.py
│   ├── maplm_dataset_preprocessor.py
│   └── nuinstruct_nuscenes_dataset_preprocessor.py
├── json_file_creators/           # Format converters for output datasets
│   ├── json_file_creator.py      # Base JSON creator class
│   ├── drivegpt4_bddx_json_file_creator.py
│   ├── drivelm_json_creator.py
│   ├── lingo_json_file_creator.py
│   ├── maplm_json_file_creator.py
│   └── nuinstruct_nuscenes_json_file_creator.py
└── testers/                      # Format validation utilities
    ├── format_tester.py
    ├── llama_format_tester.py
    └── llava_format_tester.py

Installation

  1. Clone the repository:
git clone https://github.com/DriveFusion/data-preprocessing.git
cd data-preprocessing
  1. Install dependencies:
pip install -r requirements.txt

Requirements

  • Python 3.10+
  • PyTorch 2.5.1
  • Transformers (from HuggingFace)
  • pandas, pyarrow, fastparquet
  • OpenCV for image processing
  • Additional dependencies listed in requirements.txt

Usage

Preprocessing a Dataset

The main preprocessing entry point is preprocess_datasets.py:

from preprocess_datasets import preprocess_lingoqa, preprocess_drivegpt4, preprocess_drivelm

# Preprocess LingoQA dataset
preprocess_lingoqa(system_path='/path/to/data')

# Preprocess other datasets
preprocess_drivegpt4(system_path='/path/to/data')
preprocess_drivelm(system_path='/path/to/data')

Custom Dataset Processing

To process a new dataset:

  1. Create a preprocessor class inheriting from DatasetPreprocessor
  2. Implement the llama_train_preprocess() and llava_train_preprocess() methods
  3. Create a corresponding JSON file creator to generate the output format

Dataset Formats

Output Formats

  • Llama Format: Instruction-following format compatible with Llama models using <image> tokens
  • LLaVA Format: Vision-language format with <image>\n token formatting

Both formats follow the standard instruction-following dataset structure with image paths, questions, and answers.

Validation & Testing

Use the built-in testers to validate preprocessed datasets:

from drivefusion_preprocessing.testers import LLaVAFormatTester, LlamaFormatTester

# Validate LLaVA format
tester = LLaVAFormatTester(json_path='path/to/dataset.json')
tester.validate()

# Validate Llama format
tester = LlamaFormatTester(json_path='path/to/dataset.json')
tester.validate()

Utilities

The utils.py file provides helper functions:

  • dataset_length(json_path): Get the number of samples in a dataset
  • check_images_existence(json_path): Validate that all image paths exist
  • Additional data manipulation and verification utilities

Contributing

Contributions are welcome! To add support for new datasets:

  1. Create a new preprocessor in dataset_preprocessors/
  2. Create a corresponding JSON creator in json_file_creators/
  3. Add tests to validate the output format
  4. Update the documentation

Support

For issues, questions, or feedback, please open an issue on the GitHub repository. You can also refer to the main DriveFusion project for additional resources and documentation.

License

This project is part of the DriveFusion project. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •