Skip to content

DriveFusion/drivefusion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

186 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DriveFusion Logo

DriveFusion

A multimodal AI model that combines vision, language, and driving context to generate natural language descriptions and predict driving trajectories and target speeds from autonomous vehicle data.

Python License Status


What is DriveFusion?

DriveFusion is an advanced multimodal model built on top of Qwen2.5-VL that integrates:

  • Vision Processing: Handles images and videos from vehicle cameras
  • Language Understanding: Processes and generates natural language descriptions
  • Driving Intelligence: Predicts vehicle trajectories and target speeds using GPS and speed data
  • Multi-modal Fusion: Seamlessly combines all modalities for comprehensive scene understanding

This model is designed for autonomous driving applications, enabling vehicles to understand their environment and predict safe driving behaviors.

Available Models

Model Task / Description Link
DriveFusion-V0.2 Full Multimodal. Includes MLP heads for GPS/Speed context and trajectory prediction. Check on Hugging Face
DriveFusionQA Scenario Reasoning. Optimized for high-accuracy driving-related Q&A and scenario explanation. Check on Hugging Face

Architecture

DriveFusion Architecture

Core Components

Vision Components

  • Vision Transformer: Processes images and videos using a 32-layer transformer
  • Image/Video Processor: Handles preprocessing of visual inputs with configurable patch sizes

Language Components

  • Text Model: Qwen2.5-VL text encoder with 36 transformer layers
  • Tokenizer: Qwen2 tokenizer with 151,936 vocabulary tokens

Driving Components

  • SpeedMLP: Predicts vehicle target speeds from speed context
  • GPSTargetPointsMLP: Processes GPS coordinates for trajectory context
  • TrajectoryMLP: Generates trajectory predictions (20 points)
  • TargetSpeedMLP: Generates target speed predictions (10 values)

Processing Pipeline

The DriveFusionProcessor handles:

  1. Image/video preprocessing with configurable resolutions
  2. Text tokenization and templating
  3. Integration of GPS and speed data
  4. Batch processing and tensor conversion

Key Features

  • Multimodal Input Processing: Accepts images, videos, text, GPS coordinates, and speed data
  • Trajectory Prediction: Generates predicted vehicle trajectories (20 queries)
  • Speed Prediction: Predicts target speeds for the vehicle (10 queries)
  • Language Generation: Produces natural language descriptions of driving scenes
  • Flexible Architecture: Built on proven transformer architecture with custom driving-specific modules
  • GPU Optimized: Supports mixed precision inference with float16 for efficient processing

Getting Started

Installation

  1. Clone the repository:
git clone https://github.com/DriveFusion/drivefusion.git
cd drivefusion
  1. Install dependencies:
pip install -e .

Requirements

  • Python 3.10+
  • PyTorch with CUDA support
  • transformers 4.52.4+
  • CUDA-capable GPU (recommended for inference)

Quick Start

Loading the Model

import torch
from drivefusion import DriveFusionForConditionalGeneration, DriveFusionProcessor
from drivefusion.utils import load_drivefusion_model, generate_drivefusion_output

# Load the model (replace with your actual model path)
model = load_drivefusion_model("./path/to/model", dtype=torch.float16)
processor = DriveFusionProcessor.from_pretrained("./path/to/model")

Generating Output

from utils import generate_drivefusion_output

# Prepare input message with image
message = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/image.jpg",
            },
            {
                "type": "text",
                "text": "Describe this driving scene and predict the vehicle's next moves."
            }
        ],
    }
]

# Optional: Provide GPS and speed data
gps_points = [[40.7128, -74.0060], [40.7130, -74.0058]]  # Latitude, Longitude pairs
speed_data = [[30.5]]  # Speed values

# Generate predictions
output = generate_drivefusion_output(
    model=model,
    processor=processor,
    message=message,
    gps=gps_points,
    speed=speed_data,
    max_new_tokens=4000,
    device="cuda"
)

# Access results
print("Description:", output["text"])
print("Trajectory:", output["trajectory"])
print("Target Speeds:", output["target_speeds"])

Configuration

Model configuration is stored in config/config.json. Key parameters include:

  • hidden_size: 2048 (model dimension)
  • num_hidden_layers: 36 (transformer layers)
  • num_attention_heads: 16 (attention heads)
  • num_trajectory_queries: 20 (trajectory prediction points)
  • num_target_speed_queries: 10 (speed prediction points)
  • max_position_embeddings: 128000 (max sequence length)

Project Structure

drivefusion/
├── __init__.py
├── configuration_drivefusion.py    # Model configuration
├── generation_drivefusion.py       # Generation utilities
├── modeling_drivefusion.py         # Core model architecture
├── processing_drivefusion.py       # Input processor
├── config/
│   ├── config.json                 # DriveFusion config
│   └── qwen_config.json            # Qwen2.5-VL config
├── main.py                         # Model setup examples
├── utils.py                        # Helper functions
├── requirements.txt                # Dependencies
└── README.md                       

Usage Examples

Example 1: Basic Inference with Image

import torch
from drivefusion.utils import load_drivefusion_model, generate_drivefusion_output

# Load model (replace with your actual model path)
model = load_drivefusion_model("./path/to/model", dtype=torch.float16)
processor = DriveFusionProcessor.from_pretrained("./path/to/model")

# Simple image description
message = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "road_scene.jpg"},
        {"type": "text", "text": "What's happening in this traffic scene?"}
    ]
}]

result = generate_drivefusion_output(model, processor, message)
print(result["text"])

Example 2: Full Context with GPS and Speed

# Complete driving context
message = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "highway.jpg"},
        {"type": "text", "text": "Analyze this highway scene and predict the next trajectory."}
    ]
}]

gps = [[37.7749, -122.4194], [37.7750, -122.4190]]
speed = [[65.0], [66.5]]

result = generate_drivefusion_output(
    model, processor, message,
    gps=gps, speed=speed, use_queries=True
)

print("Scene analysis:", result["text"])
print("Predicted waypoints:", result["trajectory"].shape)
print("Target speeds:", result["target_speeds"])

Model Outputs

The model generates:

  1. Text Output: Natural language description of the driving scene
  2. Trajectory: Predicted vehicle trajectory (shape: batch_size × num_trajectory_queries × 2)
    • Each point represents a predicted waypoint (x, y coordinates)
  3. Target Speeds: Predicted vehicle speeds (shape: batch_size × num_target_speed_queries × 2)
    • Speed values for each prediction point

System Requirements

  • GPU: NVIDIA GPU with 24GB+ VRAM recommended (for float16)
  • Memory: 16GB+ RAM
  • Python: 3.8 or higher
  • CUDA: 11.8+

Performance Tips

  1. Use Mixed Precision: Default float16 reduces memory usage by 50%
  2. Batch Processing: Process multiple inputs together for better throughput
  3. Optimize Input Size: Adjust image resolutions to balance quality and speed
  4. GPU Memory: Monitor VRAM usage, reduce batch size if needed

Dependencies

See requirements.txt for the complete list:

  • transformers==4.52.4 - Hugging Face transformers library
  • torch - PyTorch deep learning framework
  • torchvision - Computer vision utilities
  • torchaudio - Audio processing utilities
  • qwen-vl-utils - Qwen vision-language utilities
  • accelerate - Distributed training utilities
  • llmcompressor - Model compression tools
  • deepdiff - Deep object comparison

Support & Documentation

For detailed implementation:

  • Check main.py for model initialization examples
  • See utils.py for helper functions
  • Review drivefusion/ modules for implementation details

Troubleshooting

Out of Memory Errors

  • Reduce batch size
  • Use float16 precision (already default)
  • Reduce max_new_tokens parameter

CUDA Not Available

  • Verify CUDA installation: python -c "import torch; print(torch.cuda.is_available())"
  • Install CUDA-compatible PyTorch: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Model Not Loading

  • Verify model path/ID is correct
  • Ensure sufficient disk space for model weights
  • Check internet connection for remote model downloads

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Authors

DriveFusion Team - Graduation Project


This is a graduation project from the DriveFusion team. For more information about the project, visit the GitHub repository.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages