A multimodal AI model that combines vision, language, and driving context to generate natural language descriptions and predict driving trajectories and target speeds from autonomous vehicle data.
DriveFusion is an advanced multimodal model built on top of Qwen2.5-VL that integrates:
- Vision Processing: Handles images and videos from vehicle cameras
- Language Understanding: Processes and generates natural language descriptions
- Driving Intelligence: Predicts vehicle trajectories and target speeds using GPS and speed data
- Multi-modal Fusion: Seamlessly combines all modalities for comprehensive scene understanding
This model is designed for autonomous driving applications, enabling vehicles to understand their environment and predict safe driving behaviors.
| Model | Task / Description | Link |
|---|---|---|
| DriveFusion-V0.2 | Full Multimodal. Includes MLP heads for GPS/Speed context and trajectory prediction. | Check on Hugging Face |
| DriveFusionQA | Scenario Reasoning. Optimized for high-accuracy driving-related Q&A and scenario explanation. | Check on Hugging Face |
- Vision Transformer: Processes images and videos using a 32-layer transformer
- Image/Video Processor: Handles preprocessing of visual inputs with configurable patch sizes
- Text Model: Qwen2.5-VL text encoder with 36 transformer layers
- Tokenizer: Qwen2 tokenizer with 151,936 vocabulary tokens
- SpeedMLP: Predicts vehicle target speeds from speed context
- GPSTargetPointsMLP: Processes GPS coordinates for trajectory context
- TrajectoryMLP: Generates trajectory predictions (20 points)
- TargetSpeedMLP: Generates target speed predictions (10 values)
The DriveFusionProcessor handles:
- Image/video preprocessing with configurable resolutions
- Text tokenization and templating
- Integration of GPS and speed data
- Batch processing and tensor conversion
- Multimodal Input Processing: Accepts images, videos, text, GPS coordinates, and speed data
- Trajectory Prediction: Generates predicted vehicle trajectories (20 queries)
- Speed Prediction: Predicts target speeds for the vehicle (10 queries)
- Language Generation: Produces natural language descriptions of driving scenes
- Flexible Architecture: Built on proven transformer architecture with custom driving-specific modules
- GPU Optimized: Supports mixed precision inference with float16 for efficient processing
- Clone the repository:
git clone https://github.com/DriveFusion/drivefusion.git
cd drivefusion- Install dependencies:
pip install -e .- Python 3.10+
- PyTorch with CUDA support
- transformers 4.52.4+
- CUDA-capable GPU (recommended for inference)
import torch
from drivefusion import DriveFusionForConditionalGeneration, DriveFusionProcessor
from drivefusion.utils import load_drivefusion_model, generate_drivefusion_output
# Load the model (replace with your actual model path)
model = load_drivefusion_model("./path/to/model", dtype=torch.float16)
processor = DriveFusionProcessor.from_pretrained("./path/to/model")from utils import generate_drivefusion_output
# Prepare input message with image
message = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "path/to/image.jpg",
},
{
"type": "text",
"text": "Describe this driving scene and predict the vehicle's next moves."
}
],
}
]
# Optional: Provide GPS and speed data
gps_points = [[40.7128, -74.0060], [40.7130, -74.0058]] # Latitude, Longitude pairs
speed_data = [[30.5]] # Speed values
# Generate predictions
output = generate_drivefusion_output(
model=model,
processor=processor,
message=message,
gps=gps_points,
speed=speed_data,
max_new_tokens=4000,
device="cuda"
)
# Access results
print("Description:", output["text"])
print("Trajectory:", output["trajectory"])
print("Target Speeds:", output["target_speeds"])Model configuration is stored in config/config.json. Key parameters include:
hidden_size: 2048 (model dimension)num_hidden_layers: 36 (transformer layers)num_attention_heads: 16 (attention heads)num_trajectory_queries: 20 (trajectory prediction points)num_target_speed_queries: 10 (speed prediction points)max_position_embeddings: 128000 (max sequence length)
drivefusion/
├── __init__.py
├── configuration_drivefusion.py # Model configuration
├── generation_drivefusion.py # Generation utilities
├── modeling_drivefusion.py # Core model architecture
├── processing_drivefusion.py # Input processor
├── config/
│ ├── config.json # DriveFusion config
│ └── qwen_config.json # Qwen2.5-VL config
├── main.py # Model setup examples
├── utils.py # Helper functions
├── requirements.txt # Dependencies
└── README.md
import torch
from drivefusion.utils import load_drivefusion_model, generate_drivefusion_output
# Load model (replace with your actual model path)
model = load_drivefusion_model("./path/to/model", dtype=torch.float16)
processor = DriveFusionProcessor.from_pretrained("./path/to/model")
# Simple image description
message = [{
"role": "user",
"content": [
{"type": "image", "image": "road_scene.jpg"},
{"type": "text", "text": "What's happening in this traffic scene?"}
]
}]
result = generate_drivefusion_output(model, processor, message)
print(result["text"])# Complete driving context
message = [{
"role": "user",
"content": [
{"type": "image", "image": "highway.jpg"},
{"type": "text", "text": "Analyze this highway scene and predict the next trajectory."}
]
}]
gps = [[37.7749, -122.4194], [37.7750, -122.4190]]
speed = [[65.0], [66.5]]
result = generate_drivefusion_output(
model, processor, message,
gps=gps, speed=speed, use_queries=True
)
print("Scene analysis:", result["text"])
print("Predicted waypoints:", result["trajectory"].shape)
print("Target speeds:", result["target_speeds"])The model generates:
- Text Output: Natural language description of the driving scene
- Trajectory: Predicted vehicle trajectory (shape:
batch_size × num_trajectory_queries × 2)- Each point represents a predicted waypoint (x, y coordinates)
- Target Speeds: Predicted vehicle speeds (shape:
batch_size × num_target_speed_queries × 2)- Speed values for each prediction point
- GPU: NVIDIA GPU with 24GB+ VRAM recommended (for float16)
- Memory: 16GB+ RAM
- Python: 3.8 or higher
- CUDA: 11.8+
- Use Mixed Precision: Default float16 reduces memory usage by 50%
- Batch Processing: Process multiple inputs together for better throughput
- Optimize Input Size: Adjust image resolutions to balance quality and speed
- GPU Memory: Monitor VRAM usage, reduce batch size if needed
See requirements.txt for the complete list:
transformers==4.52.4- Hugging Face transformers librarytorch- PyTorch deep learning frameworktorchvision- Computer vision utilitiestorchaudio- Audio processing utilitiesqwen-vl-utils- Qwen vision-language utilitiesaccelerate- Distributed training utilitiesllmcompressor- Model compression toolsdeepdiff- Deep object comparison
For detailed implementation:
- Check
main.pyfor model initialization examples - See
utils.pyfor helper functions - Review
drivefusion/modules for implementation details
- Reduce batch size
- Use float16 precision (already default)
- Reduce max_new_tokens parameter
- Verify CUDA installation:
python -c "import torch; print(torch.cuda.is_available())" - Install CUDA-compatible PyTorch:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
- Verify model path/ID is correct
- Ensure sufficient disk space for model weights
- Check internet connection for remote model downloads
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
DriveFusion Team - Graduation Project
This is a graduation project from the DriveFusion team. For more information about the project, visit the GitHub repository.
