DriveFusion

A multimodal AI model that combines vision, language, and driving context to generate natural language descriptions and predict driving trajectories and target speeds from autonomous vehicle data.

What is DriveFusion?

DriveFusion is an advanced multimodal model built on top of Qwen2.5-VL that integrates:

Vision Processing: Handles images and videos from vehicle cameras
Language Understanding: Processes and generates natural language descriptions
Driving Intelligence: Predicts vehicle trajectories and target speeds using GPS and speed data
Multi-modal Fusion: Seamlessly combines all modalities for comprehensive scene understanding

This model is designed for autonomous driving applications, enabling vehicles to understand their environment and predict safe driving behaviors.

Available Models

Model	Task / Description	Link
DriveFusion-V0.2	Full Multimodal. Includes MLP heads for GPS/Speed context and trajectory prediction.	Check on Hugging Face
DriveFusionQA	Scenario Reasoning. Optimized for high-accuracy driving-related Q&A and scenario explanation.	Check on Hugging Face

Architecture

Core Components

Vision Components

Vision Transformer: Processes images and videos using a 32-layer transformer
Image/Video Processor: Handles preprocessing of visual inputs with configurable patch sizes

Language Components

Text Model: Qwen2.5-VL text encoder with 36 transformer layers
Tokenizer: Qwen2 tokenizer with 151,936 vocabulary tokens

Driving Components

SpeedMLP: Predicts vehicle target speeds from speed context
GPSTargetPointsMLP: Processes GPS coordinates for trajectory context
TrajectoryMLP: Generates trajectory predictions (20 points)
TargetSpeedMLP: Generates target speed predictions (10 values)

Processing Pipeline

The DriveFusionProcessor handles:

Image/video preprocessing with configurable resolutions
Text tokenization and templating
Integration of GPS and speed data
Batch processing and tensor conversion

Key Features

Multimodal Input Processing: Accepts images, videos, text, GPS coordinates, and speed data
Trajectory Prediction: Generates predicted vehicle trajectories (20 queries)
Speed Prediction: Predicts target speeds for the vehicle (10 queries)
Language Generation: Produces natural language descriptions of driving scenes
Flexible Architecture: Built on proven transformer architecture with custom driving-specific modules
GPU Optimized: Supports mixed precision inference with float16 for efficient processing

Getting Started

Installation

Clone the repository:

git clone https://github.com/DriveFusion/drivefusion.git
cd drivefusion

Install dependencies:

pip install -e .

Requirements

Python 3.10+
PyTorch with CUDA support
transformers 4.52.4+
CUDA-capable GPU (recommended for inference)

Quick Start

Loading the Model

import torch
from drivefusion import DriveFusionForConditionalGeneration, DriveFusionProcessor
from drivefusion.utils import load_drivefusion_model, generate_drivefusion_output

# Load the model (replace with your actual model path)
model = load_drivefusion_model("./path/to/model", dtype=torch.float16)
processor = DriveFusionProcessor.from_pretrained("./path/to/model")

Generating Output

from utils import generate_drivefusion_output

# Prepare input message with image
message = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/image.jpg",
            },
            {
                "type": "text",
                "text": "Describe this driving scene and predict the vehicle's next moves."
            }
        ],
    }
]

# Optional: Provide GPS and speed data
gps_points = [[40.7128, -74.0060], [40.7130, -74.0058]]  # Latitude, Longitude pairs
speed_data = [[30.5]]  # Speed values

# Generate predictions
output = generate_drivefusion_output(
    model=model,
    processor=processor,
    message=message,
    gps=gps_points,
    speed=speed_data,
    max_new_tokens=4000,
    device="cuda"
)

# Access results
print("Description:", output["text"])
print("Trajectory:", output["trajectory"])
print("Target Speeds:", output["target_speeds"])

Configuration

Model configuration is stored in config/config.json. Key parameters include:

hidden_size: 2048 (model dimension)
num_hidden_layers: 36 (transformer layers)
num_attention_heads: 16 (attention heads)
num_trajectory_queries: 20 (trajectory prediction points)
num_target_speed_queries: 10 (speed prediction points)
max_position_embeddings: 128000 (max sequence length)

Project Structure

drivefusion/
├── __init__.py
├── configuration_drivefusion.py    # Model configuration
├── generation_drivefusion.py       # Generation utilities
├── modeling_drivefusion.py         # Core model architecture
├── processing_drivefusion.py       # Input processor
├── config/
│   ├── config.json                 # DriveFusion config
│   └── qwen_config.json            # Qwen2.5-VL config
├── main.py                         # Model setup examples
├── utils.py                        # Helper functions
├── requirements.txt                # Dependencies
└── README.md

Usage Examples

Example 1: Basic Inference with Image

import torch
from drivefusion.utils import load_drivefusion_model, generate_drivefusion_output

# Load model (replace with your actual model path)
model = load_drivefusion_model("./path/to/model", dtype=torch.float16)
processor = DriveFusionProcessor.from_pretrained("./path/to/model")

# Simple image description
message = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "road_scene.jpg"},
        {"type": "text", "text": "What's happening in this traffic scene?"}
    ]
}]

result = generate_drivefusion_output(model, processor, message)
print(result["text"])

Example 2: Full Context with GPS and Speed

# Complete driving context
message = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "highway.jpg"},
        {"type": "text", "text": "Analyze this highway scene and predict the next trajectory."}
    ]
}]

gps = [[37.7749, -122.4194], [37.7750, -122.4190]]
speed = [[65.0], [66.5]]

result = generate_drivefusion_output(
    model, processor, message,
    gps=gps, speed=speed, use_queries=True
)

print("Scene analysis:", result["text"])
print("Predicted waypoints:", result["trajectory"].shape)
print("Target speeds:", result["target_speeds"])

Model Outputs

The model generates:

Text Output: Natural language description of the driving scene
Trajectory: Predicted vehicle trajectory (shape: batch_size × num_trajectory_queries × 2)
- Each point represents a predicted waypoint (x, y coordinates)
Target Speeds: Predicted vehicle speeds (shape: batch_size × num_target_speed_queries × 2)
- Speed values for each prediction point

System Requirements

GPU: NVIDIA GPU with 24GB+ VRAM recommended (for float16)
Memory: 16GB+ RAM
Python: 3.8 or higher
CUDA: 11.8+

Performance Tips

Use Mixed Precision: Default float16 reduces memory usage by 50%
Batch Processing: Process multiple inputs together for better throughput
Optimize Input Size: Adjust image resolutions to balance quality and speed
GPU Memory: Monitor VRAM usage, reduce batch size if needed

Dependencies

See requirements.txt for the complete list:

transformers==4.52.4 - Hugging Face transformers library
torch - PyTorch deep learning framework
torchvision - Computer vision utilities
torchaudio - Audio processing utilities
qwen-vl-utils - Qwen vision-language utilities
accelerate - Distributed training utilities
llmcompressor - Model compression tools
deepdiff - Deep object comparison

Support & Documentation

For detailed implementation:

Check main.py for model initialization examples
See utils.py for helper functions
Review drivefusion/ modules for implementation details

Troubleshooting

Out of Memory Errors

Reduce batch size
Use float16 precision (already default)
Reduce max_new_tokens parameter

CUDA Not Available

Verify CUDA installation: python -c "import torch; print(torch.cuda.is_available())"
Install CUDA-compatible PyTorch: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Model Not Loading

Verify model path/ID is correct
Ensure sufficient disk space for model weights
Check internet connection for remote model downloads

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Authors

DriveFusion Team - Graduation Project

This is a graduation project from the DriveFusion team. For more information about the project, visit the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 186 Commits
assets		assets
config		config
drivefusion		drivefusion
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
utils.py		utils.py

License

DriveFusion/drivefusion

Folders and files

Latest commit

History

Repository files navigation

DriveFusion

What is DriveFusion?

Available Models

Architecture

Core Components

Vision Components

Language Components

Driving Components

Processing Pipeline

Key Features

Getting Started

Installation

Requirements

Quick Start

Loading the Model

Generating Output

Configuration

Project Structure

Usage Examples

Example 1: Basic Inference with Image

Example 2: Full Context with GPS and Speed

Model Outputs

System Requirements

Performance Tips

Dependencies

Support & Documentation

Troubleshooting

Out of Memory Errors

CUDA Not Available

Model Not Loading

License

Authors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages