Data Preprocessing for DriveFusionQA

A comprehensive data preprocessing pipeline for preparing multiple autonomous driving datasets for the **DriveFusionQA** project. This repository handles the preprocessing and standardization of diverse driving-related question-answering datasets into unified formats suitable for training and evaluating vision-language models.

Overview

This project consolidates and preprocesses multiple driving-domain datasets into consistent formats (Llama, LLaVA, etc.), enabling seamless integration with DriveFusionQA's model training pipelines. It provides dataset-specific preprocessors and JSON file creators to standardize data across different sources.

Supported Datasets

The following datasets are currently supported and can be preprocessed:

LingoQA: A driving-focused visual question-answering dataset with action and scenery components
DriveGPT4 + BDD-X: Comprehensive driving dataset combining DriveGPT4 and BDD-X evaluation benchmarks
DriveLM: Autonomous driving language modeling dataset (v1.1)
MapLM v2: Map-based language modeling dataset with training, validation, and test splits
nuInstruct + nuScenes: Comprehensive autonomous driving instruction dataset built on nuScenes
COVLA: Additional driving vision-language dataset support

Features

Multi-dataset Support: Handle diverse dataset formats and structures
Flexible Output Formats: Convert datasets to Llama and LLaVA instruction-following formats
Modular Architecture: Dataset-specific preprocessors and JSON creators for extensibility
Format Validation: Built-in testers for LLaVA and Llama formats to ensure data integrity
Batch Processing: Efficient handling of large-scale driving datasets
Image Path Management: Robust handling of image references and path validation

Project Structure

drivefusion_preprocessing/
├── dataset_preprocessors/        # Dataset-specific preprocessing logic
│   ├── dataset_preprocessor.py   # Base preprocessor class
│   ├── drivegpt4_bddx_dataset_preprocessor.py
│   ├── drivelm_dataset_preprocessor.py
│   ├── lingo_dataset_preprocessor.py
│   ├── maplm_dataset_preprocessor.py
│   └── nuinstruct_nuscenes_dataset_preprocessor.py
├── json_file_creators/           # Format converters for output datasets
│   ├── json_file_creator.py      # Base JSON creator class
│   ├── drivegpt4_bddx_json_file_creator.py
│   ├── drivelm_json_creator.py
│   ├── lingo_json_file_creator.py
│   ├── maplm_json_file_creator.py
│   └── nuinstruct_nuscenes_json_file_creator.py
└── testers/                      # Format validation utilities
    ├── format_tester.py
    ├── llama_format_tester.py
    └── llava_format_tester.py

Installation

Clone the repository:

git clone https://github.com/DriveFusion/data-preprocessing.git
cd data-preprocessing

Install dependencies:

pip install -r requirements.txt

Requirements

Python 3.10+
PyTorch 2.5.1
Transformers (from HuggingFace)
pandas, pyarrow, fastparquet
OpenCV for image processing
Additional dependencies listed in requirements.txt

Usage

Preprocessing a Dataset

The main preprocessing entry point is preprocess_datasets.py:

from preprocess_datasets import preprocess_lingoqa, preprocess_drivegpt4, preprocess_drivelm

# Preprocess LingoQA dataset
preprocess_lingoqa(system_path='/path/to/data')

# Preprocess other datasets
preprocess_drivegpt4(system_path='/path/to/data')
preprocess_drivelm(system_path='/path/to/data')

Custom Dataset Processing

To process a new dataset:

Create a preprocessor class inheriting from DatasetPreprocessor
Implement the llama_train_preprocess() and llava_train_preprocess() methods
Create a corresponding JSON file creator to generate the output format

Dataset Formats

Output Formats

Llama Format: Instruction-following format compatible with Llama models using <image> tokens
LLaVA Format: Vision-language format with <image>\n token formatting

Both formats follow the standard instruction-following dataset structure with image paths, questions, and answers.

Validation & Testing

Use the built-in testers to validate preprocessed datasets:

from drivefusion_preprocessing.testers import LLaVAFormatTester, LlamaFormatTester

# Validate LLaVA format
tester = LLaVAFormatTester(json_path='path/to/dataset.json')
tester.validate()

# Validate Llama format
tester = LlamaFormatTester(json_path='path/to/dataset.json')
tester.validate()

Utilities

The utils.py file provides helper functions:

dataset_length(json_path): Get the number of samples in a dataset
check_images_existence(json_path): Validate that all image paths exist
Additional data manipulation and verification utilities

Contributing

Contributions are welcome! To add support for new datasets:

Create a new preprocessor in dataset_preprocessors/
Create a corresponding JSON creator in json_file_creators/
Add tests to validate the output format
Update the documentation

Support

For issues, questions, or feedback, please open an issue on the GitHub repository. You can also refer to the main DriveFusion project for additional resources and documentation.

License

This project is part of the DriveFusion project. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
assets		assets
drivefusion_preprocessing		drivefusion_preprocessing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyze_datasets.ipynb		analyze_datasets.ipynb
maplm_wrong_list_search.py		maplm_wrong_list_search.py
paths.py		paths.py
preprocess_datasets.py		preprocess_datasets.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Preprocessing for DriveFusionQA

Overview

Supported Datasets

Features

Project Structure

Installation

Requirements

Usage

Preprocessing a Dataset

Custom Dataset Processing

Dataset Formats

Output Formats

Validation & Testing

Utilities

Contributing

Support

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

DriveFusion/data-preprocessing

Folders and files

Latest commit

History

Repository files navigation

Data Preprocessing for DriveFusionQA

Overview

Supported Datasets

Features

Project Structure

Installation

Requirements

Usage

Preprocessing a Dataset

Custom Dataset Processing

Dataset Formats

Output Formats

Validation & Testing

Utilities

Contributing

Support

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages